You are on page 1of 29

# Full-time MBA: Business Statistics (BST510)

Descriptive Statistics:
Measures of Dispersion & Skewness

## Section One, Part 2

Paul Bottomley
Bottomleypa@cf.ac.uk
Silver, pp.24-26, 50-68
Measures of Dispersion
Measures of central tendency say nothing about the extent to
which data values are similar to one another.
Basis of segmentation and vendor selection.
Example: Number of days lost due to hangovers by a group
of French wine tasters.
October: 4, 5, 5, 5, 5, 5, 6
November: 0, 1, 3, 5, 5, 9, 12

## Various measures available, but are the data suitable?

Index of diversity (nominal)
Quartiles (ordinal)
Range, IQR, variance and standard deviation (metric)
The Range
Range = difference between maximum and minimum.
Example: French wine tasters, Oct. = 2, Nov. = 12.
Easy to calculate and interpret.
Lacks power: based on only two data values;
Strongly influenced by extreme values (potential outliers).
Always compare data-sets of same size: larger sample
greater chance of selecting extreme values.

## Example: Two data sets with same mean, same range,

but are they really equally spread out?
A = {0, 48, 49, 51, 52, 100}, B = {0, 1, 1, 99, 99, 100}
Upper and Lower Quartiles

## Identify the quarter and three-quarter points of a data series

when placed in ascending order, instead of the mid point (Md).
Lower (Q1): 25% of the data lies below Q1, 75% above it.
Upper (Q3): 75% of the data lies below Q3, 25% above it.
Think of Q1 as the median of the lower half of the sorted data,
and Q3 as the median of the upper half.

Data in
75% ascending
25% IQR
order

## Min. Lower Median Upper Max.

Quartile Quartile
Inter-Quartile Range (IQR)
IQR is the difference between the upper and lower quartiles.
Range of the middle 50% of data values.
More reliable less influenced by outliers. But what about
the other 50% of data?
Metric variables only must be legitimate to add/subtract!

## Dont confuse the quartile positions with quartile values.

Lower quartile (Q1): value in the position (n + 1)/4.
Upper quartile (Q3): value in the position 3(n + 1)/4.

## Quartiles may require interpolation. Imagine loads of data

evenly spread between Xi and Xi+1.
Calculating Quartiles: B&0 Prices

## {580, 757, 800, 891, 1192, 1285, 1295, 1451}

Range: 1451 580 = 871
First find the positions of the lower and upper quartiles.
Q1: position = (n + 1)/4 = (8 + 1)/4 = 2.25
Q3: position = 3(n + 1)/4 = 3(8 + 1)/4 = 6.75

## Quartiles are not positions, but values in these positions.

Q1: value = 757 + 0.25x(800 757) = 767.75
Q3: value = 1285 + 0.75x(1295 1285) = 1292.50
IQR: 1292.50 767.75 = 524.75. Recall: units = s
Variance and Standard Deviation
Variance and SD measure the spread of the data around the
mean. They use all data
_
values. Follow the steps below:
Calculate the mean (X).
Subtract the mean from each data value (deviations)
But: sum of deviations = 0; mean = center of gravity.
Square each deviation, then add them all up.
Divide by the number of data points (n).
_

(X X ) 2 _

s 2
i
SD
i
( X X ) 2

n n
SD is the square root of the average squared deviation from
the mean.
Standard Deviation: B&O Prices
(The Harder Way)
_ _

The variance is Xi Xi X (X i X )2
800 -231.38 53536.70
687969.84
85996.23 891 -140.38 19706.54
8 1295 263.62 69495.50
Units difficult to interpret. 1451 419.62 176080.94
Now it is measured in 2. 580 -451.38 203743.90
Solution: use the standard 1192 160.62 25798.78
deviation in units of s. 1285 253.62 64323.10
757 -274.38 75284.38
687969.84
293.25 8251 0 687969.84
8
Mean = 8251/8 = 1031.38
Standard Deviation: B&O Prices
(The Easier Way)
SD = (Mean of the Squares Price ( X i )2
minus Square of the Mean)
800 640000
X X
2
2
891 793881
SD

n 1295 1677025
n
1451 2105401
X2 = (X1)2 + (X2)2 ++ (Xn)2
580 336400
(X)2 = (X1 + X2 ++ Xn)2 1192 1420864
Formula looks more complex, but 1285 1651225
needs fewer calculations. 757 573049
8251 9197845
2
9197845 8251
SD = 293.25 (trust me!)
8 8
Interpreting the Standard Deviation
Q: B&Os prices have a SD of about 300. High or low?
Difficult to say with only one data series. Easier to think in
comparative terms but only if we have two (+) variables.

## Q: what if we only had summary statistics lost the data?

We can still make claims about the proportion of data values
we would expect to find within a certain number of standard
deviations from the mean.

## Q: Need to know if data has a bell- shaped distribution?

Try to imagine / picture the histogram.
If YES, use Empirical Rule; NO, use Chebyshevs Rule.
Interpreting the Standard Deviation
YES, use Empirical Rule Histogram
Mean 2 SD contains 95% of data points
about 95% of all data
values.
Mean 3 SD contains
about 99% of all data
values.
_
-2SD X +2SD
NO, use Chebyshevs Rule
Mean 2 SD contains at least 75% of the data.
Mean 3 SD contains at least 89% of the data.
Interpreting the Standard Deviation
YES, use Empirical Rule Histogram
Mean 2 SD contains 99% of data points
about 95% of all data
values.
Mean 3 SD contains
about 99% of all data
values.
_
-3SD X +3SD
NO, use Chebyshevs Rule
Mean 2 SD contains at least 75% of the data.
Mean 3 SD contains at least 89% of the data.
Is it Reasonable to Assume the
Distribution is Bell-shaped?
YES: Empirical Rule
Within what price range would we expect to find 95% of all
data values?

## NO: Chebyshevs Rule

Expect at least 75% of all data values within this range.
This rule can be applied to any data series regardless of
the shape of the distribution (see later).
It is a conservative estimate the minimum proportion.
SD: Population or Sample?
Samples are used when it is impossible or too expensive to
include every item / person of interest.
Because samples are less likely to include values at the
extremes of the distribution, we divide by n 1 rather than n
(harder way) or weight SD (easier way).
_

(Xi X ) X X
2
2 2
n
sn 1 SD

n 1 n n n 1
8
SD = 293.25 x = 293.25 x 1.069 = 313.48
7
Be careful: Excel commands STDEV or STDEVP.
To avoid confusion: we will treat data as a population!
Comparing Measures of Dispersion

## Criterion Quartiles Range IQR Std. Dev.

Scale of Ordinal
measurement Metric Metric Metric Metric
Uses all data? No No No Yes
Unique? Yes Yes Yes Yes
Resistant to
outliers? Yes No Yes No
Relative Dispersion: Coefficient of Variation
With larger means we often find larger standard deviations
(height of men vs. women). Difficult to compare.
Coefficient of Variation (CV) = Std.Dev. / Mean.
Independent of units of measurement:
change units from s to pence or to \$ has no effect.
Useful for comparing: (i) different variables, (ii) same variable
over time, (iii) international comparisons.
Brand Mean Std.Dev. CV
B&O 1031.38 293.25 0.28
Sony 680.23 278.32 0.41
JVC 445.78 150.44 0.34

## Greater variability in price of Sony TVs relative to B&O and JVC.

Visualising Skewness (Shape)
Skewness measures the degree of 20
Md = Mean
symmetry of a distribution. 15

Frequency
Histogram: useful graphical way to 10
plot frequency of values against a 5
numerical scale. 0
Q: What is the relative position 1 2 3 4 5 6 7 8 9 10 11 12 13

## of the mean and median? Symmetrical & Bell-shaped

Md 15
Md
15

Frequency
10
Frequency

10

5 5

0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13

## Positively Skewed Negatively Skewed

Measuring Skewness
If the distribution is truly symmetrical, mean = median. Each
half is a mirror image of the other.
When data are positively skewed (income), mean is greater
than median; when data are negatively skewed (easy exam),
mean is less than median.

## Pearsons Coefficient = 3(Mean Median)

of Skewness Standard Deviation

## Range: 3 to +3. Zero = symmetrical.

Values outside the range -1 to +1 are highly skewed.
Treat data with care robust statistics.
Measures of Skewness (2)
Pearsons Skewness: 3(Mean - Md) / Std.Dev
B&O Prices: 3(1031.38 - 1041.50) / 293.25 = -0.104
Very mild case of negative skew (bottom right figure).
SD meaningful; each half of the data has same properties.

## Bowleys Skewness: [(Q3 - Md) - (Md - Q1)] / (Q3 - Q1)

Focuses on middle 50% of data values; more robust.
If the upper (lower) quartile is further from the median than the
lower (upper) quartile, then positive (negative) skew.
Range: -1 to +1. Zero is symmetrical;
Values outside the range -0.5 to 0.5, are highly skewed. Treat
data with care.

## B&O Prices: [(1292.50 - 1041.50) (1041.50 - 767.75)] /

(1292.50 - 767.75) = [251.00 273.75] / 524.75 = -0.043
Summarizing Metric Data

Central Measures of
Tendency Dispersion

Deviation Range

## Is the data skewed? Empirical Chebyshevs

Skewed: Rule Rule
Not skewed: 95% 2SD 75% 2SD
Summary Measures Summarized
Measure Nominal Ordinal Metric
Examples Do you own a car? How often do you What is your date of
Is your natural hair buy, daily, weekly? birth? What is your
colour? What level are you annual income?
in the firm?
Central Mode Mode Mode
Tendency Median Median
Mean
Spread / Index of Quartiles Range
Dispersion Diversity IQR
Std. Dev.
Skewness Bowleys
Pearsons
Increasing Power

## Issues to consider: scale of measurement, presence of outliers,

purpose of the measure, different measures (complementary views).
Tukeys Box-and-Whiskers Plot
Graphical device for integrating measures of central tendency,
dispersion and skewness. (Variant of 5-number summary).
First draw a thin rectangle from lower to upper quartile and
mark the median as a line that crosses this box.
The box contains the middle 50% of data points (IQR).
Whiskers are the max. and min. values within the upper fence
= Q3 + (1.5*IQR) and lower fence Q1 - (1.5*IQR).
Values beyond the fences are potential outliers.
Box Outlier
Whisker
*
Q1-(1.5xIQR) Q1 Median Q3 Q3+(1.5xIQR)
Tukeys Box-and-Whiskers Plot
Graphical device for integrating measures of central tendency,
dispersion and skewness. (Variant of 5-number summary).
First draw a thin rectangle from lower to upper quartile and
mark the median as a line that crosses this box.
The box contains the middle 50% of data points (IQR).
Whiskers are the max. and min. values within the upper fence
= Q3 + (1.5*IQR) and lower fence Q1 - (1.5*IQR).
Values beyond the fences are potential outliers.
25% 25%
25% 25%

## Q1-(1.5xIQR) Q1 Median Q3 Q3+(1.5xIQR)

Example: TV Prices Dataset#1
301.81 756.66 460.00 150.19 635.30 239.99 904.05 206.60
417.82 882.05 176.97 466.69 259.47 478.90 173.66 333.79
673.69 1216.95 579.74 429.06 195.98 352.33 222.56 334.46
444.27 237.47 386.64 590.54 158.33 423.23 456.85

## First arrange the data in ascending order.

Note: prices rounded to s in interest of simplification.

## 150 158 174 177 196 207 223 237

240 259 302 334 334 352 387 418
423 429 444 457 460 467 479 580
591 635 674 757 882 904 1217
Building the Box Plot
Find the position and value of the median (Q2), lower (Q1)
and upper (Q3) quartiles.
Easy with 31 data points. Position of median is (n + 1)/2 =
(31 + 1)/2 = 16th data point, namely 418.
Positions of the lower and upper quartiles are the 8th and
24th values, namely 237 and 580.
Next, find the inter-quartile range (IQR) = Q3 Q1 =
580 237 = 343 (range, middle 50% of data).
0 200 400 600 800 1000 1200 1400

Lower Upper
Quartile Median Quartile
Building the Box Plot Cont.
Whiskers are the max. and min. data values between the
upper and lower fences (not always shown but should be!)
Upper: Q3+ (1.5xIQR) = 580 +(1.5 x 343) = 1094.5
Lower: Q1 - (1.5xIQR) = 237 - (1.5 x 343) = -277.5 0.
Most expensive television (1217) is greater than the upper
fence, it is a possible outlier (*).
2nd most expensive TV is not an outlier = 904
0 200 400 600 800 1000 1200 1400

*
Outlier
Cheapest TV Most expensive TV
Upper
inside the fence.
Fence
SPSS Summary Statistics:
Price of Selected Televisions
Statistic
Mean 436.97
Median 417.82
Mode 150.19a
Variance 62785.59
Std. Deviation(pop) 246.50
Minimum 150.19
Maximum 1216.95
Range 1066.76
Interquartile Range 342.27
Percentiles 25 237.47
Percentiles 50 417.82
Percentiles 75 579.74
Television Monthly Sales (Dataset#1)
Upper
fence

i
0 500 1000 1500
2000

## Upper fence = Q3 + 1.5*IQR

= 836 + 1.5*(836 88) = 836 + 1122 = 1958
Childhood Consumerism:
Comparing Younger and Older Children
4.5
4.0 511
282
526
543
140
84
482
497

3.5
3.0
2.5
2.0
1.5 530
423
309
157
426
1.0
N= 261 296

Junior senior