Beruflich Dokumente
Kultur Dokumente
Version 2
2011
Instructor Comments:
This document contains a brief overview of basic statistics and core
terminology/concepts used in data analysis/problem-solving. It also includes
a practice test at the end of the document. Note: answers to the practice
test questions are included in an appendix.
Table of Contents
1.
2.
3.
4.
5.
Appendices:
A Useful Excel Functions
B Practice Test
Qualitative Data
1.2
Quantitative Data
2. DESCRIPTIVE STATISTICS
Descriptive statistics are used to summarize the characteristics of a data set.
Key Learning Skills
Understand the difference between a sample and a population
Understand the difference between a parameter and a statistic
Compute the mean, median, standard deviation, variance and range for a
sample data set
2.1 Sample Data versus Population Data
A population includes all items of a data set. For instance, one might have a
data set containing the height of every person in the United States. If the
desired information is available for all items in the population, we have what
is referred to as a census. In practice, we rarely have a complete set of
data. We usually collect data in samples or subsets of the population.
2.2 Parameters and Statistics
Numbers used to describe a population are parameters and often are
denoted using Greek letters. Numbers used to describe a sample data set
are called statistics. Statistics are used to estimate population parameters.
For example, the average of a data sample (denoted as X or ) provides an
estimate of the population mean, .
The difference between a statistic and a parameter is important because in
data analysis we often make inferences about a population based on sample
statistics. Since we rarely know every observation in a population, any
conclusions or recommendations that are made based on sample statistics
are subject to error. In general, we typically accept some margin of error
rather than incur the cost of measuring every observation.
Many descriptive statistics fall under two general types: measures of
central tendency or measures of dispersion/spread. Measures of central
tendency are used to identify the typical or most frequently occurring value.
The most commonly used measures of central tendency are mean and median.
Pat Hammett, PhD, University of Michigan
Mean
X 1 X 2 ... X N
N
The typical notation used to represent the mean for a sample of data is X ;
the Greek letter is used to denote the mean of a population. Again, the
terms, X or , are used to describe a sample estimate for a population mean.
Example: Suppose five students take a test and their scores are 70, 68, 71,
69 and 98.
Mean = (70+68+71+69+98)/5 = 75.2
Of note, the mean may be strongly influenced by extreme values. If we
excluded the student whose score was 98, the mean would change to 69.5.
To offset the influence of a single value within a data set, we may choose to
report the median as the best estimate of central tendency.
Median (also known as the 50th percentile) is the middle observation in a
data set. To determine the median, we may first rank the data set and then
select the middle value. If the data set has an odd number of observations,
the middle value is the observation number [N + 1] / 2. If the data set has
an even number of observations, the middle value is extrapolated as midway
between observation numbers N / 2 and [N / 2] + 1.
In the above example, the data ranked is 68, 69, 70, 71, and 98. Here, the
median is 70. If another student with a score of 60 was included, the new
median would 69.5 (69 + 70 / 2).
The median often is preferred if the data has extreme values (outliers) or
its distribution is skewed (e.g., if one of the tails of a bell-shaped curve is
significantly longer than the other). In the above example of student test
scores, the median provides a better representation of the center of the
distribution since 98 is an extreme value.
2.4 Dispersion Statistics (measures of variability)
Standard deviation (StDev) measures the dispersion of the individual
observations from the mean. In a sample data set, the standard deviation is
also referred to as the sample standard deviation or the root-mean-square
Srms and may be calculated using the following formula.
n
(X
i 1
X )2
n 1
S
2
(X
i 1
X )2
n 1
Above Example:
deviations!
3. FREQUENCY DISTRIBUTIONS
Frequency is used to describe the number of times a value or a range of values
occurs in a data set. Cumulative frequencies are used to describe the number of
observations less than, or greater than a specific value.
Key Learning Skills
Understand the difference between absolute, relative, and cumulative
frequencies
Create a frequency table
Create a histogram
3.1 Frequency Measures
10
Frequency Table
Combination
(1,1)
(1,2) (2,1)
(1,3) (3,1) (2,2)
(1,4) (4,1) (2,3) (3,2)
(1,5) (5,1) (2,4) (4,2)
(3,3)
(1,6) (6,1) (2,5) (5,2)
(3,4) (4,3)
(2,6) (6,2) (3,5) (5,3)
(4,4)
(3,6) (6,3) (4,5) (5,4)
(4,6) (6,4) (5,5)
(5,6) (6,5)
(6,6)
Sum
Absolute
Frequency
2
3
4
5
6
1
2
3
4
5
1
3
6
10
15
0.03
0.06
0.08
0.11
0.14
Cum.
Rel.
Freq,
0.03
0.08
0.17
0.28
0.42
21
0.17
0.58
26
0.14
0.72
9
10
11
12
Total
4
3
2
1
36
30
33
35
36
0.11
0.08
0.06
0.03
0.83
0.92
0.97
1.00
Cumulative Relative
Frequency
Freq
3.2 Histogram
A histogram is a graphical representation of a frequency table. Histograms
also may be used to show the shape of a distribution. Some common
distribution shapes are bell-shaped (normal), strictly decreasing
(exponential), or skewed. Skewed distributions are similar to normal
distributions only one tail is significantly larger than the other. For example,
a skewed right distribution has a basic bell-shaped curve with more
stretch/longer tail to the right (or to the left). The figure below shows each
of these shapes.
11
Skewed Right
30
30
25
25
Frequency
Frequency
Normal
20
15
10
20
15
10
Exponential (Decreasing)
30
Frequency
25
20
15
10
5
0
Frequency
Histogram
7
6
5
4
3
2
1
0
2
10
11
12
12
25.34
25.40
25.41
25.39
25.36
25.38
25.41
25.41
25.42
25.41
25.40
25.46
25.50
25.39
25.38
25.34
25.43
25.36
25.39
25.43
25.39
25.35
25.41
25.37
25.37
25.33
25.41
25.32
25.41
25.42
25.43
25.41
25.38
25.36
25.39
25.39
25.42
25.42
25.40
25.40
25.45
25.36
25.35
25.44
25.39
25.36
25.37
25.46
25.42
25.41
25.36
25.37
25.40
25.36
25.40
25.40
25.41
25.41
25.36
25.38
25.37
25.37
25.38
25.42
25.39
25.43
25.40
25.40
25.35
25.46
25.41
25.36
25.40
25.44
25.35
25.34
25.39
25.45
25.41
25.38
First, we arrange the data into frequency or bin ranges of equal width. The
selection of the number and width of the bins (frequency ranges) is
dependent on the analyst. For continuous data, a general rule of thumb is to
set the number of bins equal to the square root of the number of samples
(rounded to the nearest whole number). To obtain the bin width, divide the
range of the data set by the number of bins (rounded to desired precision of
measurement data). This example has 100 samples and a range of 0.23. Thus,
we might create 10 bins (=sqrt(100)) of width 0.02 mm (0.23/10 = 0.02). In
this example, the value 25.34 was chosen as the starting point because
relatively few values are below it. We typically use software to set the initial
starting point and bin widths and then may use experience/logic to adjust
them to provide the best representation of the distribution of a data set.
13
Bin Range
< 25.34
25.34 < X <= 25.36
25.36 < X <= 25.38
25.38 < X <= 25.4
25.40 < X <= 25.42
25.42 < X <= 25.44
25.44 < X <= 25.46
25.46 < X <= 25.48
25.48 < X <= 25.5
X > 25.50
Absolute
Frequency
3
13
14
24
27
6
9
2
1
1
Cumulative
Frequency
3
16
30
54
81
87
96
98
99
100
30
25
20
15
10
5
0
25
.3
4
25
.3
6
25
.3
8
25
.4
25
.4
2
25
.4
4
25
.4
6
25
.4
8
25
.5
M
or
e
Frequency
Histogram
Bin
Interpreting a Continuous Data Histogram
The absolute frequencies in a continuous data histogram represent the
number of observations that fall within a range. In the graph above, the
first column (labeled 25.34) represents the number of observations less
than or equal to 25.34. The second column (labeled 25.36) represents the
number of observations greater than 25.34 and less than or equal to 25.36.
The third column is the number of observations greater than 25.36 and less
than or equal to 25.38. Note: The above categories assume the use of cut
points to establish bins.
14
4. NORMAL DISTRIBUTION
The Normal distribution is the most commonly found pattern for randomly
measured data. It visually resembles a smooth, symmetrical, bell-shaped curve.
This distribution has been used to describe many variables/processes including
intelligence test results, part measurement results from inspection equipment,
and measurement errors. In fact, the failure to find a normal distribution when
studying a continuous process often suggests that some factor or special cause
is exerting an unusual amount of influence on the process (special cause of
variation exists).
Key Learning Skills
Understand Common Properties of the Normal Distribution and the
Standard Normal Distribution
Estimate the probability of an event given that the observed data follow
a normal distribution
4.1 Properties of the Normal Distribution
The figure below shows a normal distribution. In a normal distribution, the
mean, median, and mode (most frequently occurring item in the data set) are
equal. In addition, the number of standard deviations about the mean may be
represented by probabilities. For example, if data are normally distributed,
then 99.73% of the population values are expected to fall between +/- 3.
+/- 1 = 68.26%
+/- 2 = 95.46%
+/- 3 = 99.73%
15
16
17
The most common method used to determine the line that best fits data is
Least Squares Regression, which minimizes the squared deviations between
individual observations and the regression line.
The equations used to compute the slope (1) and Y-intercept (o) are:
n X i Yi X i Yi
n X i2 X i
1 X i
n
Note: n is the number of samples
Alternatively, you could use the excel functions =slope(Y-array,X-array) and
=intercept(Y-array,X-array). For example, if the Y-variable data are in Excel
work cells B2:B10 and the X-variable data are in cells A2:A10, then the
formula would be =slope(B2:B10,A2:A10).
Simple Linear Regression Example
Suppose you conduct an experiment to examine the relationships between
bicycle tire pressure, tire width, and the coefficient of rolling friction. From
experiments, you obtain the following:
Coefficient of Rolling Friction for Bicycle Tires
Pressure (PSI)
Width=1.25"
Width= 2"
20
25
30
35
40
45
50
55
60
65
70
75
0.0100
0.0095
0.0088
0.0081
0.0074
0.0067
0.0060
0.0058
0.0056
0.0054
0.0052
0.0050
0.0107
0.0100
0.0093
0.0086
0.0079
0.0073
0.0071
0.0068
0.0066
0.0063
0.0061
0.0058
18
Given these data, we may estimate the slope and Y-intercept for both
variables. Using excel, the following results may be obtained.
Width=1.25"
-0.0001
0.0115
slope
intercept
Width= 2"
-0.0001
0.0118
5.3 Correlation, R
The Pearson correlation coefficient, R, measures the extent to which two
continuous variables are linearly related. For example, you may want to
measure the correlation between tire pressure and the coefficient of rolling
friction in the above example.
Simple correlation may be measured using the following equation:
x yi y
n 1s x s y
19
Response Variable
Response Variable
A scatter plot is an effective tool for viewing the strength (strong versus
weak correlation) and direction (positive or negative). The figures below
show several examples with different correlation coefficients.
Predictor Variable
Predictor Variable
Response Variable
Response Variable
Predictor Variable
Predictor Variable
Correlation does NOT always indicate cause and effect. One should
not conclude that changes to one variable cause changes in another.
Properly controlled experiments are needed to verify that a
correlation relationship indicates causation.
20
Drawdepth
60.80
60.60
60.40
60.20
60.00
59.80
910
920
930
940
950
Tonnage
21
=percentile(array,value)
e.g., for the 95th percentile =percentile(array,0.95)
Slope
Intercept
Correlation
=correl(array 1, array 2)
22
Game
1
2
3
4
5
6
7
8
9
10
11
42
38
20
35
13
31
58
14
51
33
38
7
7
23
31
10
32
0
0
54
11
26
554
396
374
513
375
430
562
326
535
444
389
271
271
394
447
278
530
190
355
654
407
400
35
31
-3
4
3
-1
58
14
-3
22
12
N
Sum
Minimum
Maximum
11
373
13
58
11
201
0
54
11
4898
326
562
11
4197
190
654
11
172
-3
58
1. What is the average offense (total yards) per game for UM football team?
2. What is the standard deviation of offense per game for UM football team?
23
60
50
50
40
30
20
10
0
-10
100
Correlation = -0.75
70
60
Point Difference
Point Difference
70
40
30
20
10
0
200
300
400
500
-10
100
600
200
UM Offense
300
400
500
600
700
Opponent Offense
5. Which of the following statements appears true based on the available data?
a. Opponent offense (UM Defense) is a better predictor of point difference
than UM Offense.
b. UM offense is a better predictor of point difference than UM Defense.
c. Cannot tell based on the information given.
Using the raw data provided earlier, answer the following. HINT: copy the raw
data into excel, and use the excel functions provided in the appendix.
6. What is slope of the best fit line between UM Offense and Point Difference?
7. What is the y-intercept of the best fit line between UM Offense and Point
Difference?
24
25