Statistics 1

What is Statistics?
The science of collecting, describing,

analyzing, and interpreting data.
Collecting data
Organizing data
Analyzing data
Interpreting data
Individuals and Variables
Individuals are people or objects

included in the study.
Variables are characteristics of the

individual to be measured or
observed.
Types of Variables
Variables can be classified as
discrete or continuous.
Discrete variables (such as class
size) consist of indivisible categories,
and continuous variables (such as
time or weight) are infinitely divisible
into whatever units a researcher may
choose. For example, time can be
measured to the nearest minute,
second, half-second, etc.
Variables
Quantitative Variable The variable

is numerical, so operations such as
adding and averaging make sense.
Qualitative Variable The variable

describes an individual through
grouping or categorization.
Variables
Quantitative Variable The variable is numerical,
so operations such as adding and averaging make
sense.
Qualitative Variable The variable describes an

individual through grouping or categorization.
Which of the following is an example of a qualitative

variable?
a). Age b). Mass
c). Religious preference d). Batting average
Variables
Quantitative Variable The variable is
numerical, so operations such as adding and
averaging make sense.
Qualitative Variable The variable describes an

individual through grouping or categorization.
Which of the following is an example of a

qualitative variable?
a). Age b). Mass
c). Religious preference d). Batting average
Data
Population Data The data are from

every individual of interest.
Sample Data The data are from

only some of the individuals of
interest.
Data
Population Data The data are from
every individual of interest.
Parameter- fixed, unknown number that
describes the population
Sample Data The data are from only

some of the individuals of interest.
Statistic-known value calculated from a
sample ; often used to estimate a
parameter.
A properly chosen sample of 1600 people
across India was asked if they regularly
watch a certain television program, and
24% said yes. The parameter of interest
here is the true proportion of all people in
India who watch the program, while the
statistic is the value 24% obtained from the
sample of 1600 people
Data
Which of the following Venn diagrams
shows the relationship between
population data and sample data?
a). P b). S
S P
P S
c). d).
S P
Data
Which of the following Venn diagrams
shows the relationship between
population data and sample data?
a). P b). S
S P
P S
c). d).
S P
Levels of Measurement
Nominal Level The data consists of

names, labels, or categories.
Ordinal Level The data can be

ordered, but the differences between
data values are meaningless.
Interval Level The data can be

ordered and the differences between
data values are meaningful.
Ratio Level The data can be

ordered, differences and ratios are
meaningful, and there is a
meaningful zero value.
The freezing points of four liquids are

32F, 6F, 13F, and 20F. What is
the level of these measurements?
a). Nominal
b). Ordinal
c). Interval
d). Ratio
The freezing points of four liquids are

32F, 6F, 13F, and 20F. What is
the level of these measurements?
a). Nominal
b). Ordinal
c). Interval
d). Ratio
Researchers want data on taste of a group of
pineapples. A panel of tasters rates the
pineapples according to the categories
poor, acceptable, and good. Only some
of the pineapples are included in the taste
test. In this case, the _______ is taste. This is
a _________ variable. Because only some of
the pineapples in the field are included in the
study, we have a __________. The proportion
of pineapples in the sample with a taste rating
of good is a __________.
Researchers want data on taste of a group of
pineapples. A panel of tasters rates the
pineapples according to the categories
poor, acceptable, and good. Only some
of the pineapples are included in the taste
test. In this case, the variable is taste. This is
a qualitative variable. Because only some of
the pineapples in the field are included in the
study, we have a sample. The proportion of
pineapples in the sample with a taste rating of
good is a statistic.
Two Branches of Statistics
Descriptive Statistics: Organizing,

summarizing, and graphing
information from populations or
samples.
Inferential Statistics: Using

information from a sample to draw
conclusions about a population.
Application of Statistics
In the early 20th century, two of the most important areas of applied
statistics were population biology and agriculture.
Nowadays the ideas of statistics are everywhere.
Descriptive statistics are featured in every newspaper and magazine.
Statistical inference has become indispensable
to public health and medical research,
to marketing and quality control,
to education,
to accounting,
to economics,
to meteorological forecasting
to polling and surveys ,
to sports,
to insurance,
to gambling, and
to all research that makes any claim to being scientific.
Sampling Techniques
Simple Random Sampling, Sample
size = n
Each member of the population has an
equal chance of being selected.
Each sample of size n has an equal
chance of being selected.
Stratified sampling
Subgroup 4
Subgroup 3
Population
Subgroup 2
sample
Subgroup 1
Sampling Techniques
Systematic sampling
Number every member of the population.
Select every kth member.
Cluster sampling
Population is naturally divided into pre-existing
segments.
Make a random selection of clusters, then select
all members of each cluster.
Convenience sampling - Collect sample data

from a readily-available population database.
Guidelines For Planning a Statistical
Study
1. Identify individuals or objects of
interest.
2. Specify the variables.
3. Determine if you will use the
entire population. If not,
determine an appropriate
sampling method
4. Determine a data collection plan,
addressing privacy, ethics, and
confidentiality if necessary.
Guidelines For Planning a Statistical
Study
5. Collect data.
6. Analyze the data using appropriate
statistical methods.
7. Note any concerns about the data
and recommend any remedies for
further studies.
Census vs. Sample
In a census, measurements or
observations are obtained from the
entire population (uncommon and
often impractical).
In a sample, measurements or
observations are obtained from part
of the population (common).
Surveys
Collecting data from respondents by asking them
questions.
Survey Pitfalls
Nonresponse undercoverage of population.
Truthfulness respondents sometimes lie.
Faulty recall of respondent
Hidden bias due to poor question wording.
Vague wording sometimes, often, seldom
Interviewer influence who is asking the
questions and in what manner.
Voluntary response relatively interested
individuals are more likely to participate.
Frequency Tables
A frequency table
organizes quantitative data.
partitions data into classes (intervals).
shows how many data values are in each
class. Test Score Number of
Students
61-70 4
71-80 8
81-90 15
91-100 7
Data Classes and Class Frequency
Class: an interval of values.
Example: 61 x 70
Frequency: the number of data values that fall

within a class.
Five data fall within the class 61 x 70.
Relative Frequency: the proportion of data

values that fall within a class.
18% of the data fall within the class
61 x 70.
Structure of a Data Class
A data class is basically an interval on a number line.
It has:
A lower limit a and an
upper limit b.
A width.
A lower boundary and
an upper boundary
(integer data).
A midpoint.
If a = 60 and b = 69
for integer data,
what is the value of
the lower boundary?
a). 60 b). 59.5
c). 9 d). 64.5

If a = 60 and b = 69
for integer data,
what is the value of
the lower boundary?
a). 60 b). 59.5
c). 9 d). 64.5

Constructing Data Classes
Find the class width.
Largest data value smallest data value
Desired number of classes
Increase the computed value to the next

higher whole number.
Find the class limits.

The lower limit of the leftmost class is set
equal to the smallest value in the data set.
Constructing Data Classes, contd
Find the class boundaries (integer data).
Subtract 0.5 from the lower class limit and
add 0.5 to the upper class limit.
For a certain data set, the minimum value is

25 and the maximum value is 58. If you
wish to partition the data into 5 classes,
what would be the class width?
a). 5 b). 6 c). 7 d). 8

Constructing Data Classes, contd
Find the class boundaries (integer data).
Subtract 0.5 from the lower class limit and
add 0.5 to the upper class limit.
For a certain data set, the minimum value is

25 and the maximum value is 58. If you
wish to partition the data into 5 classes,
what would be the class width?
a). 5 b). 6 c). 7 d). 8

Building a Frequency Table
Find the class width, class limits, and class
boundaries of the data.
Use Tally marks to count the data in each class.
Record the frequencies (and relative frequencies if
desired) on the table.
Histograms
Histogram graphical summary of a frequency table.
Uses bars to plot the data classes versus the class
frequencies.
Making a Histogram
Make a frequency table.
Place class boundaries on horizontal axis.

Place frequencies on vertical axis.
For each class, draw a bar with height

equal to the class frequency and width
equal to the class width plus 1.
Making a Histogram
Distribution Shapes
Symmetric Uniform Bimodal
Skewed Skewed
Left Right
Critical Thinking
A bimodal distribution shape might
indicate that the data are from two
different populations.
Outliers data values that are very
different from other values in the data set.
Outliers may indicate data recording
errors.
Exploratory Data Analysis
EDA is the process of learning about a
data set by creating graphs.
EDA specifically looks for patterns and

trends in the data.
EDA also identifies extreme values.

Graphical Displays
represent the data.
induce the viewer to think about the

substance of the graphic.
should avoid distorting the message of

the data.
Bar Graphs
Used for qualitative or quantitative data.
Can be vertical or horizontal.
Bars are uniformly spaced and have
equal widths.
Length/height of bars indicate counts or
percentages of the variable.
Good practice requires including titles
and units and labeling axes.
Bar Graphs
Example:
Pareto Charts
A bar chart with two specific features:
Heights of bars represent frequencies.
Bars are vertical and are ordered from tallest
to shortest.
Circle Graphs/Pie Charts
Used for qualitative data
Wedges of the circle represent
proportions of the data that share a
common characteristic.
Good practice requires including a title
and either wedge labels or legend.
Time-Series
Shows data measurements in chronological order.
Data are plotted in order of occurrence at regular
intervals over a period of time.
Critical Thinking
which type of graph to use?
Bar graphs are useful for quantitative or
qualitative data.
Pareto charts identify the frequency in
decreasing order.
Circle graphs display how a total is
dispersed into several categories.
Time-series graphs display how data
change over time.
Critical Thinking
What type of graph would be best for
showing the ice cream flavor preferences of
a group of 100 children?
a). Histogram b). Pareto graph

c). Time series graph d). Circle graph
Critical Thinking
What type of graph would be best for showing the
ice cream flavor preferences of a group of 100
children?
a). Histogram b). Pareto graph

c). Time series graph d). Circle graph

Stem and Leaf Plots
Displays the distribution of the data while
maintaining the actual data values.
Each data value is split into a stem and a leaf.
Stem and Leaf Plot Construction
Critical Thinking
By looking at the stem-
and-leaf display
sideways, we can see the
distribution shape of the
data.
Critical Thinking
Large gaps between stems containing
leaves, especially at the top or bottom,
suggest the existence of outliers.
Watch the outliers are they data errors
or simply unusual data values?
Measures of Central Tendency
Average a measure of the center value
or central tendency of a distribution of
values.
Three types of average:

Mode
Median
Mean
Mode
The mode is the most frequently occurring value in a
data set.
Example: Sixteen
students are asked
how many college
math classes they
have completed.
{0, 3, 2, 2, 1, 1, 0, 5,
1, 1, 0, 2, 2, 7,
1, 3}
Median
Finding the median:
1). Order the data from smallest to largest.
2). For an odd number of data values:

Median = Middle data value
3). For an even number of data values:

Sum of middle two values
Median
2
Median
Find the median of the following data set.
{ 4, 6, 6, 7, 9, 12, 18, 19}
a). 6 b). 7 c). 8 d). 9

Median
Find the median of the following data set.
{4, 6, 6, 7, 9, 12, 18, 19}
a). 6 b). 7 c). 8 d). 9

Mean
Sample mean Population mean
x
x
x
n N
Mean
x
x
x
n N
Find the mean of the following data set.

{3, 8, 5, 4, 8, 4, 10}
a). 8 b). 6.5 c). 6 d). 7

Mean
x
x
x
n N
Find the mean of the following data set.

{3, 8, 5, 4, 8, 4, 10}
a). 8 b). 6.5 c). 6 d). 7

Resistant Measures of Central Tendency
A resistant measure will not be affected

by extreme values in the data set.
The mean is not resistant to extreme

values.
The median is resistant to extreme

values.
Critical Thinking
Four levels of data nominal, ordinal, interval,
ratio (Chapter 1)
Mode can be used with all four levels.
Median may be used with ordinal, interval, of

ratio level.
Mean may be used with interval or ratio level.

Critical Thinking
Mound-shaped
data values of
mean, median
and mode are
nearly equal.
Critical Thinking
Skewed-left data mean < median <
mode.
Critical Thinking
Skewed-right data mean > median >
mode.
Measures of Variation
Three measures of variation:
range
variance
standard deviation
Range = Largest value smallest value
Only two data values are used in the computation,

so much of the information in the data is lost.
Sample Variance and Standard Deviation
Sample Variance Sample Standard Deviation
n 2
(
i x x )
2 i1 s s 2
s
n 1
Find the standard deviation of the data set.
{2,4,6}
a). 2 b). 3 c). 4 d). 3.67

Sample Variance and Standard Deviation
Sample Variance Sample Standard Deviation
n 2
(
i x x )
2 i1 s s 2
s
n 1
Find the standard deviation of the data set.
{2,4,6}
a). 2 b). 3 c). 4 d). 3.67

Population Variance
and Standard Deviation
Population Variance Population standard

Deviation
N
(x )
i
2

2 i 1 2
N
The Coefficient of Variation
For Samples For Populations

s
CV 100 CV 100
x
Percentiles and Quartiles
For whole numbers P, 1 P 99, the Pth
percentile of a distribution is a value such
that P% of the data fall below it, and (100-
P)% of the data fall at or above it.
Q1 = 25th Percentile
Q2 = 50th Percentile = The Median
Q3 = 75th Percentile
Quartiles and Interquartile Range (IQR)
Computing Quartiles
Five Number Summary
A listing of the following statistics:
Minimum, Q1, Median, Q3, Maximum
Box-and-Whisder plot represents the

five-number summary graphically.
Box-and-Whisker Plot Construction
Critical Thinking
Box-and-whisker plots display the spread of data
about the median.
If the median is centered and the whiskers are

about the same length, then the data distribution is
symmetric around the median.
Fences may be placed on either side of the box.

Values lie beyond the fences are outliers. (See
problem 10)
Critical Thinking
Which of the following box-and-whiskers plots suggests a symmetric data distribution?
(a) (b) (c) (d)

Critical Thinking
Which of the following box-and-whiskers plots suggests a symmetric data distribution?
(a) (b) (c) (d)

The Four Moments
1st = mean (central tendency)

2nd = SD (dispersion)
3rd = skewness (lean / tail)
4th = kurtosis (peakedness /
flattness)
Scatter Diagrams
A graph in which pairs of points, (x, y), are plotted
with x on the horizontal axis and y on the vertical
axis.
The explanatory variable is x.
The response variable is y.
One goal of plotting paired data is to determine if

there is a linear relationship between x and y.
Paired Data (x, y)
Important Questions
How strong is the

linear correlation
between x and y?
What line best

represents the
data?
How Strong Is the Linear Correlation?
Not all relationships are linearly-
correlated.
Statisticians need a quantitative measure

of the strength of the linear association.
The Sample Correlation Coefficient r
Statisticians use the sample correlation coefficient r to
measure the strength of the linear correlation
between paired data.
1) r has no units.
2) 1 r 1
3) r > 0 indicates a positive relationship between x and
y , r < 0 indicates a negative relationship.
4) r = 0 indicates no linear relationship.
5) Switching the explanatory variable and response
variable does not change r.
6) Changing the units of the variables does not change
r.
A Computational Formula for r
Illustration
Caribou (x, in hundreds) and wolf (y) populations
Illustration
Caribou (x, in hundreds) and wolf (y) populations
Interpreting the Value of r
r=0
There is no linear relation for the points of the
scatter diagram.
r = 1 or r = 1
There is a perfect linear relation between x
and y; all points lie on a straight line.
0<r<1
The x and y values has a positive
correlation. As x increases, y tends to
increase.
1 < r < 0
The x and y values have a negative
correlation. As x increases, y tends to
decrease.
Which of the following shows a strong
negative correlation?
a). b).
c). d).
Which of the following shows a strong
negative correlation?
a). b).
c). d).
Critical Thinking
Expect r to vary from sample to sample.
So, consider the significance of r as well

as its value when assessing the strength
of a linear correlation. (Section 11.4)
Critical Thinking
|r| 1 only implies a linear relationship

between x and y.
It does not imply a cause and effect

relationship between x and y.
The values of x and y may both depend

linearly on some third lurking variable.
Critical Thinking
Over the past few years, there has been a
strong positive relationship between the
annual consumption of coffee and the
number of computers sold per year.
Which conclusion is the best one to draw
from this strong correlation?
a). Coffee consumption stimulates computer
sales.
b). Computer users are sophisticated and
thus are inclined to drinking coffee.
c). The correlation is purely accidental.
Critical Thinking
Over the past few years, there has
been a strong positive relationship
between the annual consumption of
coffee and the number of computers
sold per year.
Which conclusion is the best one to
draw from this strong correlation?
a). Coffee consumption stimulates
computer sales.
b). Computer users are sophisticated
and thus are inclined to drinking
Linear Regression
Linear Regression - a mathematical
technique for creating a linear model for
paired data.
Based on the least-squares criterion of

best fit.
Caribou and wolf populations in Denali
National Park
Questions
Do the data points have
a linear relationship?
How do we find an
equation for the best
fitting line?
Can we predict the value
of the response variable
for a new value of the
predictor variable?
What fractional part of
the variability in y is
associated with the
variability in x?
Least-Squares Criterion
Properties of the Regression
Equation
The point( x , y ) is always on the least-
squares line.
The slope tells us the amount that y

changes when x increases by one unit.
Illustration
Caribou (x, in hundreds) and wolf (y)
populations
Illustration
Illustration
Least-squares linear relationship between
caribou and wolf populations:
y 22.35 1.60 x
Critical Thinking: Making Predictions
We can simply plug in x values into the

regression equation to calculate y values.
Extrapolation may produce unrealistic

forecasts.
Coefficient of Determination
Another way to gauge the fit of the
regression equation is to calculate the
coefficient of determination, r 2.
1). Compute r. Simply square this value to

get r 2.
2). r 2 is the fractional amount of total
variation in y that can be explained using
the linear model.
3). 1 r 2 is the fractional amount of total
The linear correlation coefficient for a set of
paired data is r = 0.86.
What fractional amount of the total variation

in y is due to random chance and/or to
lurking variables?
a). 0.86 b). 0.14 c). 0.74d). 0.26

The linear correlation coefficient for a set of
paired data is r = 0.86.
What fractional amount of the total variation

in y is due to random chance and/or to
lurking variables?
a). 0.86 b). 0.14 c). 0.74d). 0.26

Statistics 1

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Statistics 1

Hochgeladen von

Copyright:

Verfügbare Formate

What is Statistics?

The science of collecting, describing,

Individuals are people or objects

Variables are characteristics of the

Quantitative Variable The variable

Qualitative Variable The variable

Qualitative Variable The variable describes an

Which of the following is an example of a qualitative

Qualitative Variable The variable describes an

Which of the following is an example of a

Population Data The data are from

Sample Data The data are from

Sample Data The data are from only

Nominal Level The data consists of

Ordinal Level The data can be

Interval Level The data can be

Ratio Level The data can be

The freezing points of four liquids are

The freezing points of four liquids are

Descriptive Statistics: Organizing,

Inferential Statistics: Using

Convenience sampling - Collect sample data

Frequency: the number of data values that fall

Relative Frequency: the proportion of data

a). 60 b). 59.5

c). 9 d). 64.5

a). 60 b). 59.5

c). 9 d). 64.5

Increase the computed value to the next

Find the class limits.

For a certain data set, the minimum value is

a). 5 b). 6 c). 7 d). 8

For a certain data set, the minimum value is

a). 5 b). 6 c). 7 d). 8

Place class boundaries on horizontal axis.

For each class, draw a bar with height

EDA specifically looks for patterns and

EDA also identifies extreme values.

induce the viewer to think about the

should avoid distorting the message of

a). Histogram b). Pareto graph

a). Histogram b). Pareto graph

Three types of average:

2). For an odd number of data values:

3). For an even number of data values:

a). 6 b). 7 c). 8 d). 9

a). 6 b). 7 c). 8 d). 9

Find the mean of the following data set.

a). 8 b). 6.5 c). 6 d). 7

Find the mean of the following data set.

a). 8 b). 6.5 c). 6 d). 7

A resistant measure will not be affected

The mean is not resistant to extreme

The median is resistant to extreme

Mode can be used with all four levels.

Median may be used with ordinal, interval, of

Mean may be used with interval or ratio level.

Range = Largest value smallest value

Only two data values are used in the computation,

a). 2 b). 3 c). 4 d). 3.67

a). 2 b). 3 c). 4 d). 3.67

Population Variance Population standard

For Samples For Populations

Box-and-Whisder plot represents the