Sie sind auf Seite 1von 110

What is Statistics?

The science of collecting, describing,


analyzing, and interpreting data.
Collecting data
Organizing data
Analyzing data
Interpreting data
Individuals and Variables

Individuals are people or objects


included in the study.

Variables are characteristics of the


individual to be measured or
observed.
Types of Variables
Variables can be classified as
discrete or continuous.
Discrete variables (such as class
size) consist of indivisible categories,
and continuous variables (such as
time or weight) are infinitely divisible
into whatever units a researcher may
choose. For example, time can be
measured to the nearest minute,
second, half-second, etc.
Variables

Quantitative Variable The variable


is numerical, so operations such as
adding and averaging make sense.

Qualitative Variable The variable


describes an individual through
grouping or categorization.
Variables
Quantitative Variable The variable is numerical,
so operations such as adding and averaging make
sense.

Qualitative Variable The variable describes an


individual through grouping or categorization.

Which of the following is an example of a qualitative


variable?
a). Age b). Mass
c). Religious preference d). Batting average
Variables
Quantitative Variable The variable is
numerical, so operations such as adding and
averaging make sense.

Qualitative Variable The variable describes an


individual through grouping or categorization.

Which of the following is an example of a


qualitative variable?
a). Age b). Mass
c). Religious preference d). Batting average
Data

Population Data The data are from


every individual of interest.

Sample Data The data are from


only some of the individuals of
interest.
Data
Population Data The data are from
every individual of interest.
Parameter- fixed, unknown number that
describes the population

Sample Data The data are from only


some of the individuals of interest.
Statistic-known value calculated from a
sample ; often used to estimate a
parameter.
A properly chosen sample of 1600 people
across India was asked if they regularly
watch a certain television program, and
24% said yes. The parameter of interest
here is the true proportion of all people in
India who watch the program, while the
statistic is the value 24% obtained from the
sample of 1600 people
Data
Which of the following Venn diagrams
shows the relationship between
population data and sample data?
a). P b). S
S P

P S
c). d).
S P
Data
Which of the following Venn diagrams
shows the relationship between
population data and sample data?
a). P b). S
S P

P S
c). d).
S P
Levels of Measurement

Nominal Level The data consists of


names, labels, or categories.

Ordinal Level The data can be


ordered, but the differences between
data values are meaningless.
Levels of Measurement

Interval Level The data can be


ordered and the differences between
data values are meaningful.

Ratio Level The data can be


ordered, differences and ratios are
meaningful, and there is a
meaningful zero value.
Levels of Measurement

The freezing points of four liquids are


32F, 6F, 13F, and 20F. What is
the level of these measurements?

a). Nominal
b). Ordinal
c). Interval
d). Ratio
Levels of Measurement

The freezing points of four liquids are


32F, 6F, 13F, and 20F. What is
the level of these measurements?

a). Nominal
b). Ordinal
c). Interval
d). Ratio
Researchers want data on taste of a group of
pineapples. A panel of tasters rates the
pineapples according to the categories
poor, acceptable, and good. Only some
of the pineapples are included in the taste
test. In this case, the _______ is taste. This is
a _________ variable. Because only some of
the pineapples in the field are included in the
study, we have a __________. The proportion
of pineapples in the sample with a taste rating
of good is a __________.
Researchers want data on taste of a group of
pineapples. A panel of tasters rates the
pineapples according to the categories
poor, acceptable, and good. Only some
of the pineapples are included in the taste
test. In this case, the variable is taste. This is
a qualitative variable. Because only some of
the pineapples in the field are included in the
study, we have a sample. The proportion of
pineapples in the sample with a taste rating of
good is a statistic.
Two Branches of Statistics

Descriptive Statistics: Organizing,


summarizing, and graphing
information from populations or
samples.

Inferential Statistics: Using


information from a sample to draw
conclusions about a population.
Application of Statistics
In the early 20th century, two of the most important areas of applied
statistics were population biology and agriculture.
Nowadays the ideas of statistics are everywhere.
Descriptive statistics are featured in every newspaper and magazine.
Statistical inference has become indispensable
to public health and medical research,
to marketing and quality control,
to education,
to accounting,
to economics,
to meteorological forecasting
to polling and surveys ,
to sports,
to insurance,
to gambling, and
to all research that makes any claim to being scientific.
Sampling Techniques
Simple Random Sampling, Sample
size = n
Each member of the population has an
equal chance of being selected.
Each sample of size n has an equal
chance of being selected.

Stratified sampling
Subgroup 4
Subgroup 3
Population
Subgroup 2
sample

Subgroup 1
Sampling Techniques
Systematic sampling
Number every member of the population.
Select every kth member.

Cluster sampling
Population is naturally divided into pre-existing
segments.
Make a random selection of clusters, then select
all members of each cluster.

Convenience sampling - Collect sample data


from a readily-available population database.
Guidelines For Planning a Statistical
Study
1. Identify individuals or objects of
interest.
2. Specify the variables.
3. Determine if you will use the
entire population. If not,
determine an appropriate
sampling method
4. Determine a data collection plan,
addressing privacy, ethics, and
confidentiality if necessary.
Guidelines For Planning a Statistical
Study
5. Collect data.
6. Analyze the data using appropriate
statistical methods.
7. Note any concerns about the data
and recommend any remedies for
further studies.
Census vs. Sample

In a census, measurements or
observations are obtained from the
entire population (uncommon and
often impractical).
In a sample, measurements or
observations are obtained from part
of the population (common).
Surveys
Collecting data from respondents by asking them
questions.
Survey Pitfalls
Nonresponse undercoverage of population.
Truthfulness respondents sometimes lie.
Faulty recall of respondent
Hidden bias due to poor question wording.
Vague wording sometimes, often, seldom
Interviewer influence who is asking the
questions and in what manner.
Voluntary response relatively interested
individuals are more likely to participate.
Frequency Tables
A frequency table
organizes quantitative data.
partitions data into classes (intervals).
shows how many data values are in each
class. Test Score Number of
Students
61-70 4
71-80 8
81-90 15
91-100 7
Data Classes and Class Frequency
Class: an interval of values.
Example: 61 x 70

Frequency: the number of data values that fall


within a class.
Five data fall within the class 61 x 70.

Relative Frequency: the proportion of data


values that fall within a class.
18% of the data fall within the class
61 x 70.
Structure of a Data Class
A data class is basically an interval on a number line.

It has:
A lower limit a and an
upper limit b.
A width.
A lower boundary and
an upper boundary
(integer data).
A midpoint.
Structure of a Data Class
A data class is basically an interval on a number line.

If a = 60 and b = 69
for integer data,
what is the value of
the lower boundary?

a). 60 b). 59.5

c). 9 d). 64.5


Structure of a Data Class
A data class is basically an interval on a number line.

If a = 60 and b = 69
for integer data,
what is the value of
the lower boundary?

a). 60 b). 59.5

c). 9 d). 64.5


Constructing Data Classes
Find the class width.
Largest data value smallest data value
Desired number of classes

Increase the computed value to the next


higher whole number.

Find the class limits.


The lower limit of the leftmost class is set
equal to the smallest value in the data set.
Constructing Data Classes, contd
Find the class boundaries (integer data).
Subtract 0.5 from the lower class limit and
add 0.5 to the upper class limit.

For a certain data set, the minimum value is


25 and the maximum value is 58. If you
wish to partition the data into 5 classes,
what would be the class width?

a). 5 b). 6 c). 7 d). 8


Constructing Data Classes, contd
Find the class boundaries (integer data).
Subtract 0.5 from the lower class limit and
add 0.5 to the upper class limit.

For a certain data set, the minimum value is


25 and the maximum value is 58. If you
wish to partition the data into 5 classes,
what would be the class width?

a). 5 b). 6 c). 7 d). 8


Building a Frequency Table
Find the class width, class limits, and class
boundaries of the data.
Use Tally marks to count the data in each class.
Record the frequencies (and relative frequencies if
desired) on the table.
Histograms
Histogram graphical summary of a frequency table.
Uses bars to plot the data classes versus the class
frequencies.
Making a Histogram
Make a frequency table.

Place class boundaries on horizontal axis.


Place frequencies on vertical axis.

For each class, draw a bar with height


equal to the class frequency and width
equal to the class width plus 1.
Making a Histogram
Distribution Shapes
Symmetric Uniform Bimodal

Skewed Skewed
Left Right
Critical Thinking
A bimodal distribution shape might
indicate that the data are from two
different populations.
Outliers data values that are very
different from other values in the data set.
Outliers may indicate data recording
errors.
Exploratory Data Analysis
EDA is the process of learning about a
data set by creating graphs.

EDA specifically looks for patterns and


trends in the data.

EDA also identifies extreme values.


Graphical Displays
represent the data.

induce the viewer to think about the


substance of the graphic.

should avoid distorting the message of


the data.
Bar Graphs
Used for qualitative or quantitative data.
Can be vertical or horizontal.
Bars are uniformly spaced and have
equal widths.
Length/height of bars indicate counts or
percentages of the variable.
Good practice requires including titles
and units and labeling axes.
Bar Graphs
Example:
Pareto Charts
A bar chart with two specific features:
Heights of bars represent frequencies.
Bars are vertical and are ordered from tallest
to shortest.
Circle Graphs/Pie Charts
Used for qualitative data
Wedges of the circle represent
proportions of the data that share a
common characteristic.
Good practice requires including a title
and either wedge labels or legend.
Time-Series
Shows data measurements in chronological order.
Data are plotted in order of occurrence at regular
intervals over a period of time.
Critical Thinking
which type of graph to use?
Bar graphs are useful for quantitative or
qualitative data.
Pareto charts identify the frequency in
decreasing order.
Circle graphs display how a total is
dispersed into several categories.
Time-series graphs display how data
change over time.
Critical Thinking
which type of graph to use?
What type of graph would be best for
showing the ice cream flavor preferences of
a group of 100 children?

a). Histogram b). Pareto graph


c). Time series graph d). Circle graph
Critical Thinking
which type of graph to use?
What type of graph would be best for showing the
ice cream flavor preferences of a group of 100
children?

a). Histogram b). Pareto graph


c). Time series graph d). Circle graph

Stem and Leaf Plots
Displays the distribution of the data while
maintaining the actual data values.
Each data value is split into a stem and a leaf.
Stem and Leaf Plot Construction
Critical Thinking
By looking at the stem-
and-leaf display
sideways, we can see the
distribution shape of the
data.
Critical Thinking
Large gaps between stems containing
leaves, especially at the top or bottom,
suggest the existence of outliers.
Watch the outliers are they data errors
or simply unusual data values?
Measures of Central Tendency
Average a measure of the center value
or central tendency of a distribution of
values.

Three types of average:


Mode
Median
Mean
Mode
The mode is the most frequently occurring value in a
data set.

Example: Sixteen
students are asked
how many college
math classes they
have completed.

{0, 3, 2, 2, 1, 1, 0, 5,
1, 1, 0, 2, 2, 7,
1, 3}
Median
Finding the median:
1). Order the data from smallest to largest.

2). For an odd number of data values:


Median = Middle data value

3). For an even number of data values:


Sum of middle two values
Median
2
Median
Find the median of the following data set.
{ 4, 6, 6, 7, 9, 12, 18, 19}

a). 6 b). 7 c). 8 d). 9


Median
Find the median of the following data set.
{4, 6, 6, 7, 9, 12, 18, 19}

a). 6 b). 7 c). 8 d). 9


Mean
Sample mean Population mean
x
x
x
n N
Mean
Sample mean Population mean
x
x
x
n N

Find the mean of the following data set.


{3, 8, 5, 4, 8, 4, 10}

a). 8 b). 6.5 c). 6 d). 7


Mean
Sample mean Population mean
x
x
x
n N

Find the mean of the following data set.


{3, 8, 5, 4, 8, 4, 10}

a). 8 b). 6.5 c). 6 d). 7


Resistant Measures of Central Tendency

A resistant measure will not be affected


by extreme values in the data set.

The mean is not resistant to extreme


values.

The median is resistant to extreme


values.
Critical Thinking
Four levels of data nominal, ordinal, interval,
ratio (Chapter 1)

Mode can be used with all four levels.

Median may be used with ordinal, interval, of


ratio level.

Mean may be used with interval or ratio level.


Critical Thinking
Mound-shaped
data values of
mean, median
and mode are
nearly equal.
Critical Thinking
Skewed-left data mean < median <
mode.
Critical Thinking
Skewed-right data mean > median >
mode.
Measures of Variation
Three measures of variation:
range
variance
standard deviation

Range = Largest value smallest value

Only two data values are used in the computation,


so much of the information in the data is lost.
Sample Variance and Standard Deviation
Sample Variance Sample Standard Deviation
n 2
(
i x x )
2 i1 s s 2
s
n 1
Find the standard deviation of the data set.
{2,4,6}

a). 2 b). 3 c). 4 d). 3.67


Sample Variance and Standard Deviation
Sample Variance Sample Standard Deviation
n 2
(
i x x )
2 i1 s s 2
s
n 1
Find the standard deviation of the data set.
{2,4,6}

a). 2 b). 3 c). 4 d). 3.67


Population Variance
and Standard Deviation

Population Variance Population standard


Deviation
N

(x )
i
2


2 i 1 2

N
The Coefficient of Variation

For Samples For Populations


s
CV 100 CV 100
x
Percentiles and Quartiles
For whole numbers P, 1 P 99, the Pth
percentile of a distribution is a value such
that P% of the data fall below it, and (100-
P)% of the data fall at or above it.
Q1 = 25th Percentile
Q2 = 50th Percentile = The Median
Q3 = 75th Percentile
Quartiles and Interquartile Range (IQR)
Computing Quartiles
Five Number Summary
A listing of the following statistics:
Minimum, Q1, Median, Q3, Maximum

Box-and-Whisder plot represents the


five-number summary graphically.
Box-and-Whisker Plot Construction
Critical Thinking
Box-and-whisker plots display the spread of data
about the median.

If the median is centered and the whiskers are


about the same length, then the data distribution is
symmetric around the median.

Fences may be placed on either side of the box.


Values lie beyond the fences are outliers. (See
problem 10)
Critical Thinking
Which of the following box-and-whiskers plots suggests a symmetric data distribution?

(a) (b) (c) (d)


Critical Thinking
Which of the following box-and-whiskers plots suggests a symmetric data distribution?

(a) (b) (c) (d)


The Four Moments

1st = mean (central tendency)


2nd = SD (dispersion)
3rd = skewness (lean / tail)
4th = kurtosis (peakedness /
flattness)
Scatter Diagrams
A graph in which pairs of points, (x, y), are plotted
with x on the horizontal axis and y on the vertical
axis.

The explanatory variable is x.

The response variable is y.

One goal of plotting paired data is to determine if


there is a linear relationship between x and y.
Paired Data (x, y)
Important Questions

How strong is the


linear correlation
between x and y?

What line best


represents the
data?
How Strong Is the Linear Correlation?
Not all relationships are linearly-
correlated.

Statisticians need a quantitative measure


of the strength of the linear association.
The Sample Correlation Coefficient r
Statisticians use the sample correlation coefficient r to
measure the strength of the linear correlation
between paired data.
1) r has no units.
2) 1 r 1
3) r > 0 indicates a positive relationship between x and
y , r < 0 indicates a negative relationship.
4) r = 0 indicates no linear relationship.
5) Switching the explanatory variable and response
variable does not change r.
6) Changing the units of the variables does not change
r.
A Computational Formula for r
Illustration
Caribou (x, in hundreds) and wolf (y) populations
Illustration
Caribou (x, in hundreds) and wolf (y) populations
Interpreting the Value of r
r=0
There is no linear relation for the points of the
scatter diagram.
Interpreting the Value of r
r = 1 or r = 1
There is a perfect linear relation between x
and y; all points lie on a straight line.
Interpreting the Value of r
0<r<1
The x and y values has a positive
correlation. As x increases, y tends to
increase.
Interpreting the Value of r
1 < r < 0
The x and y values have a negative
correlation. As x increases, y tends to
decrease.
Which of the following shows a strong
negative correlation?

a). b).

c). d).
Which of the following shows a strong
negative correlation?

a). b).

c). d).
Critical Thinking

Expect r to vary from sample to sample.

So, consider the significance of r as well


as its value when assessing the strength
of a linear correlation. (Section 11.4)
Critical Thinking

|r| 1 only implies a linear relationship


between x and y.

It does not imply a cause and effect


relationship between x and y.

The values of x and y may both depend


linearly on some third lurking variable.
Critical Thinking
Over the past few years, there has been a
strong positive relationship between the
annual consumption of coffee and the
number of computers sold per year.
Which conclusion is the best one to draw
from this strong correlation?
a). Coffee consumption stimulates computer
sales.
b). Computer users are sophisticated and
thus are inclined to drinking coffee.
c). The correlation is purely accidental.
Critical Thinking
Over the past few years, there has
been a strong positive relationship
between the annual consumption of
coffee and the number of computers
sold per year.
Which conclusion is the best one to
draw from this strong correlation?
a). Coffee consumption stimulates
computer sales.
b). Computer users are sophisticated
and thus are inclined to drinking
Linear Regression
Linear Regression - a mathematical
technique for creating a linear model for
paired data.

Based on the least-squares criterion of


best fit.
Caribou and wolf populations in Denali
National Park
Questions
Do the data points have
a linear relationship?
How do we find an
equation for the best
fitting line?
Can we predict the value
of the response variable
for a new value of the
predictor variable?
What fractional part of
the variability in y is
associated with the
variability in x?
Least-Squares Criterion
Properties of the Regression
Equation
The point( x , y ) is always on the least-
squares line.

The slope tells us the amount that y


changes when x increases by one unit.
Illustration
Caribou (x, in hundreds) and wolf (y)
populations
Illustration
Illustration
Least-squares linear relationship between
caribou and wolf populations:

y 22.35 1.60 x
Critical Thinking: Making Predictions

We can simply plug in x values into the


regression equation to calculate y values.

Extrapolation may produce unrealistic


forecasts.
Coefficient of Determination
Another way to gauge the fit of the
regression equation is to calculate the
coefficient of determination, r 2.

1). Compute r. Simply square this value to


get r 2.
2). r 2 is the fractional amount of total
variation in y that can be explained using
the linear model.
3). 1 r 2 is the fractional amount of total
Coefficient of Determination
The linear correlation coefficient for a set of
paired data is r = 0.86.

What fractional amount of the total variation


in y is due to random chance and/or to
lurking variables?

a). 0.86 b). 0.14 c). 0.74d). 0.26


Coefficient of Determination
The linear correlation coefficient for a set of
paired data is r = 0.86.

What fractional amount of the total variation


in y is due to random chance and/or to
lurking variables?

a). 0.86 b). 0.14 c). 0.74d). 0.26

Das könnte Ihnen auch gefallen