Beruflich Dokumente
Kultur Dokumente
Sept 2015
Associate Professor Dr Sanjay Rampal
MBBS, MPH, PhD, CPH (US NBPHE), AMM
Faculty of Medicine, University of Malaya
srampal@ummc.edu.my / rampal.s@gmail.com
CONTENTS
Measures of central tendency
Mean, Median, & Mode
Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Other measures of location
Normal Distribution & skewness
Sanjay Rampal
Summarizing Data 1
Mean
(1)
xi
n
X 1 X 2 X 3 ...... Xn
n
Sanjay Rampal
Summarizing Data 1
Mean (arithmetic)
(2)
Mean (arithmetic)
(2)
1, 3, 5, 7, 7, 8, 8, 9
n=8
xi=1+3+5+7+7+8+8+9= 48
xi
n
48
8
= 6
Sanjay Rampal
Summarizing Data 1
Mean: Advantages
Mean: Disadvantages
It can be affected by extreme values in the
data set, called outliers, and therefore be
biased
Loss of accuracy when the distribution is
skewed
Including or excluding a data (number) will
change the mean
Manually, more tedious to calculate
Sanjay Rampal
Summarizing Data 1
Geometric mean
Harmonic mean
Generalized means
Weighted arithmetic mean
Truncated mean
Inter-quartile mean
Mean (Geometric)
It is an average that is useful for sets of
numbers that are interpreted according to
their product and not their sum (as is the case
with the arithmetic mean). E.g disease rates
Sanjay Rampal
Summarizing Data 1
Sanjay Rampal
Summarizing Data 1
Sanjay Rampal
Summarizing Data 1
Truncated Means
This is a useful measure of central tendency in the
presence of extreme values or outliers
The observations in the dataset are truncated
observations on either side comprising n % are
discarded and the mean is calculated where n
ranges from 5% to 50%
90% truncated mean 5% observations on either
extremes are discarded
Sanjay Rampal
Summarizing Data 1
Median
(1)
Median
(1)
9, 7, 6, 5, 3, 1, 1
Sanjay Rampal
Summarizing Data 1
Median
(2)
Median
(3)
Sanjay Rampal
Summarizing Data 1
10
Median: Advantages
Fairly easy to calculate and always exist
Relatively easy to interpret - half of the sample
(normally) lies above/below the median
Is not affected by extreme data values
Used when distribution of data is skewed
Does not include values of observations, only their
ranks
Can be used with ordinal observations because
calculation does not use actual vales of the
observations
Do not need a complete data set to calculate the
rank
Sanjay Rampal
Summarizing Data 1
11
Median: Disadvantages
Manually tedious to find for a large sample which is
not in order (Requires ordering)
Does not utilize all data values
Mode
(1)
Sanjay Rampal
Summarizing Data 1
12
Mode
(2)
Mode
(3)
Sanjay Rampal
Summarizing Data 1
13
Mode
(4)
Advantages
Quick and easy to calculate
Unaffected by extreme values
Disadvantages
May not be representative of the whole
sample as they do not use all values
Seldom gives statistical significance
1, 2, 3, 3, 4, 5
Mean ?
Median ?
Mode ?
Sanjay Rampal
Summarizing Data 1
14
Mean =
Median = 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 7 7 7 8 8 9
Mode = 5
(1)
Sanjay Rampal
Summarizing Data 1
15
(2)
Mean=median (symmetrical)
Mean>median (distribution skewed to right)
Mean<median (distribution skewed to left)
Sanjay Rampal
Summarizing Data 1
16
CONTENTS
Measures of central tendency
Mean, Median, & Mode
Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Other measures of location
Normal Distribution & skewness
Variability / Dispersion
the variability of observed values from the measures of
central tendency
data values in a sample are not all the same variation
between values is called dispersion
When the dispersion is large, the values are widely
scattered; when it is small they are tightly clustered
The width of diagrams such as dot plots, box plots, stem
and leaf plots is greater for samples with more dispersion
and vice versa
Sanjay Rampal
Summarizing Data 1
17
Sanjay Rampal
Summarizing Data 1
18
Measures of Dispersion
1)
2)
3)
4)
5)
Range
Interquartile range
Variance
Standard deviation
Coefficient of variation
Sanjay Rampal
Summarizing Data 1
19
Range
(1)
Range
(2)
E.g.
0 1 2 3 4 5 6
Range ?
0 1 2 3 4 5 6 51
Range ?
Sanjay Rampal
Summarizing Data 1
20
Interquartile range
More on this in later slides
Measuring dispersion
Real difference: xi -
Absolute difference: |xi - |
Mean absolute difference
where m(X) ~ Mean, Median, Mode
Note:
The sample mean absolute deviation is a biased estimator
of the population mean absolute deviation
The sample median absolute deviation is a unbiased
estimator of the population median absolute deviation
Sanjay Rampal
Summarizing Data 1
21
Deviation
Deviation: Distance and Direction from the mean
Deviation value: Values mean
E.g.
Mean = 52
Scores =45, 53, 50, 60
Deviations scores -7, 1, -2, 8 (respectively)
Variance
(1)
(1 2) 2 (2 2) 2 (3 2) 2
3
= 0.667
Sanjay Rampal
Summarizing Data 1
22
Variance
(2)
( )2
N
( X )2
s
n 1
2
Standard deviation
(1)
( X ) 2
N
Population
Sanjay Rampal
( X X ) 2
n 1
Sample
Summarizing Data 1
23
Standard deviation
(2)
Standard deviation
(3)
( X ) 2
Sanjay Rampal
( X X ) 2
s
n 1
Summarizing Data 1
24
Sanjay Rampal
Summarizing Data 1
25
Coefficient of variation
(1)
Coefficient of variation
(2)
X 100%
Sanjay Rampal
Summarizing Data 1
26
Coefficient of variation
(3)
CONTENTS
Measures of central tendency
Mean, Median, & Mode
Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Other measures of location
Normal Distribution & skewness
Sanjay Rampal
Summarizing Data 1
27
Quantiles
Box plot
Scatter plot
Quantiles
Quantiles are a set of 'cut points' that divide
a sample of data into groups containing (as
far as possible) equal numbers of
observations
E.g. quantiles include:
quartiles, quintiles, deciles, percentiles
Sanjay Rampal
Summarizing Data 1
28
Quartiles
Quartiles divide an ordered data set into four
quartiles
100 %
Q4
75 %
Q3
Q2
Q1
50 %
(Median)
25 %
Quartiles
(2)
E.g.
Sanjay Rampal
Summarizing Data 1
29
Quintiles
Quintiles are values that divide a sample of
data into 5 quintiles containing (as far as
possible) equal numbers of observations
Q5
Q4
Q3
Q2
Q1
80%
60%
40%
20%
Percentiles
The use of
percentiles in
the presentation
of data
50th percentile
= median
Sanjay Rampal
Summarizing Data 1
30
Summary of quantiles
k
Quantile name
No of
quantiles
Median
Quartiles
Quintiles
10
Deciles
100
Percentiles
99
Sanjay Rampal
Summarizing Data 1
31
Q1
Median
10
11
Q3
Sanjay Rampal
Summarizing Data 1
32
(1)
Extreme values
Outlier
Whisker
Median + 1.5 IQR
Q3 = P75
Median
Q1 = P25
(2)
Sanjay Rampal
Summarizing Data 1
33
CONTENTS
Measures of central tendency
Mean, Median, & Mode
Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Other measures of location
Normal Distribution & skewness
Normal distribution
The Normal Curve is bell-shaped and
symmetrical.
It is unimodal (mean = median = mode)
Tails of the normal curve are asymptotic to
the horizontal axis (- to + ); i.e. the curve
approaches the horizontal axis but never
touches it
Sanjay Rampal
Summarizing Data 1
34
1 X
exp
2
2 2
Normal distribution
(2)
0.4
0.4
0.3
Mean = 6
Mean = 5
Variance = 1
0.3
0.2
0.2
0.1
0.1
Variance = 4
0.0
0.0
1.5
2.5
3.5
4.5
Sanjay Rampal
5.5
6.5
7.5
8.5
1.5
Summarizing Data 1
2.5
3.5
4.5
5.5
6.5
7.5
8.5
35
- 3SD
- 2SD
- 1SD
+ 1SD
+ 2SD
+ 3SD
<----- 68.3%---->
<--------------95.5%-------------->
<----------------------99.7%------------------------->
Skewed Distributions
Skewness is defined as asymmetry in the
distribution of the sample data values
Values on one side of the distribution tend to be
further from the 'middle' than values on the
other side
Sanjay Rampal
Summarizing Data 1
36
Skewness
Skewness measures the extent a distribution
of values deviates from symmetry around the
mean
Simplest measurement is Mean-Median
If Mean-Median >0, then +ve skew
If Mean-Median <0, then -ve skew
Skewed distribution
+ve skewness
-ve skewness
Sanjay Rampal
Summarizing Data 1
37
Mean
Sanjay Rampal
Summarizing Data 1
38
Mean
Sanjay Rampal
Summarizing Data 1
39
Kurtosis
Curvature
Defined as a measure reflectingthe degree to
which a distribution is peaked
Provides information regarding the height of a
distribution relative to the value of its standard
deviation
Can be divided into:
Mesokurtic bell shaped
Leptokurtic peak (Clustered around the mean)
Platykurtic peak (More dispersed)
DAgostino-Pearson test
Kolmogrov Smirnov Test
Lilliefors test
Shapiro-Wilk W test (7n2000 )
Shapiro-Francia W' test (5 n5000)
Sanjay Rampal
Summarizing Data 1
40
Sanjay Rampal
Summarizing Data 1
41
Sanjay Rampal
Summarizing Data 1
42
Sanjay Rampal
Summarizing Data 1
43