Beruflich Dokumente
Kultur Dokumente
Why statistics?
Decision making is often based on
analysis of data.
Statistics helps you to make sense of the
data by using tools that summarize,
present and analyze the data.
Decision maker can also ascertain the
confidence in the decisions.
Examples
How many newspapers should the vendor stock
to maximize revenue?
Depends on the probability distribution of demand and
expected profit
Descriptive Statistics
Collect
Organize
Summarize
Display
Analyze
Inferential Statistics
Predict and forecast
values of population
parameters
Test hypotheses about
values of population
parameters
Make decisions
Descriptive statistics
- data and frequency distribution
The following are the departure delay in minutes of 42 flights selected
at random from a particular airport.
10
12
45
13
40
13
20
45
95
38
67
47
55
56
45
50
27
50
15
26
34
12
25
48
40
25
50
42
48
53
44
23
56
46
22
Frequency Distribution
Frequency distribution
Delay in
minutes
Frequency
Relative
frequency
015
12
0.286
15 - 30
0.190
30 45
0.143
45 60
14
0.333
0.048
42
60 or more
Total
Measures of Location
Arithmetic Mean
Median
Mode
Percentiles
Quartiles
Arithmetic mean
The mean of a data set is the average
of all the data values.
xi
x
n
xi
Sample mean
Population mean
Mean example
Average delay in flight departure
Median
It is the middle item in a data set that is
arranged in ascending/descending order
If there are n observations then the
Median = (n+1)/2 th observation.
computation rule
if n is odd then (n+1)/2 is an integer
Example
Sorted 42
observations
median is average of
21st and 22nd
observation
= (34+38)/2
= 36
22
45
23
46
25
47
25
48
26
48
27
50
34
50
10
38
50
12
40
53
12
40
55
13
42
56
13
44
56
15
45
67
20
45
95
Mode
Mode is the highest occurring observation
mode in the example is 0
Example
Calculate 45th percentile of the airline
delay data
the position of 45th percentile is
45*(42+1)/100 = 19.35th
value of 45th percentile
= 19th observation + 0.35 of (20 19)th
observation
= 26.35 (26 + 0.35(27-26))
Quartiles
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
Range
The range of a data set is the difference
between the largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest
data values.
Example from airline delay data
Range = 95 0 = 95 minutes
Interquartile range
The interquartile range of a data set is the
difference between the third quartile and the first
quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data
values.
Variance
The variance is a measure of variability
that utilizes all the data.
It is based on the difference between the
value of each observation (xi) and the
mean (x for a sample, for a population).
2
2 ( xi )
N
2
(
x
x
)
i
s2
n 1
Standard deviation
The standard deviation of a data set is the
positive square root of the variance.
It is measured in the same units as the
data, making it more easily comparable,
than the variance, to the mean.
If the data set is a sample, the standard
deviation is denoted s.
If the data set is a population, the standard
deviation is denoted (sigma).
Coefficient of Variation
The coefficient of variation indicates how large the
standard deviation is in relation to the mean.
If the data set is a sample, the coefficient of variation
is computed as follows:
s s (100)
(
100
)
xx
(100)
Example
Variance
= 465.89 minutes square
Standard Deviation
= 21.585 minutes
Coefficient of Variation =
= 21.584/32.2381 (100) = 66.95%
Skewness
Skewness
Skewness characterizes the degree of
asymmetry of a distribution around its
mean
Positively skewed
Symmetric or unskewed
Negatively skewed
Skewness
Negatively skewed
Skewness
Symmetric
Skewness
Positively Skewed
Skewness - measure
Skewness of a distribution is measured by
( X ) 3
1
N 3
For a given data set you may use
Kurtosis
Kurtosis characterizes the relative
peakedness or flatness of a symmetric
distribution compared to the normal
distribution
Platykurtic (relatively flat)
Mesokurtic (normal)
Leptokurtic (relatively peaked)
Kurtosis
Platykurtic - flat distribution
Kurtosis
Mesokurtic - not too flat and not too peaked
Kurtosis
Leptokurtic - peaked distribution
Kurtosis - measure
Kurtosis for a distribution is measured by
2 3
where
( X ) 4
2
N 4
Passengers
Delay
Passengers
Delay
Passengers
53
65
56
51
50
68
40
61
42
50
72
46
53
25
57
38
74
65
13
57
55
68
22
45
40
54
45
73
58
54
15
63
44
68
27
65
48
68
12
65
67
57
55
12
56
48
62
10
45
25
50
50
50
71
13
70
45
61
56
64
50
73
59
26
60
45
63
34
63
47
61
23
56
95
49
20
48
Scatter Plot
Scatter Plots are used to identify any
underlying relationships among pairs of
data sets.
The plot consists of a scatter of points,
each point representing an observation.
Scatter Plot
Covariance
The covariance is a measure of the linear
association between two variables.
Positive values indicate a positive
relationship.
Negative values indicate a negative
relationship
Covariance
If the data sets are samples, the covariance
is denoted by
sxy
( xi x )( yi y )
n 1
= 20.42 in the
Airline
example
Correlation Coefficient