Beruflich Dokumente
Kultur Dokumente
X
n
Arithmetic Mean for Raw Data
Example
The inner diameter of a particular grade of tire based
on 5 sample measurements are as follows: (figures in
millimeters)
565, 570, 572, 568, 585
Applying the formula
We get mean = (565+570+572+568+585)/5 =572
Caution: Arithmetic Mean is affected by extreme values
or fluctuations in sampling. It is not the best average to
use when the data set contains extreme values (Very
high or very low values).
n
X
X
=
Median
Median is the middle most observation when you arrange data
in ascending or descending order of magnitude. That is, the
data are ranked and the middle value is picked up. Median is
such that 50% of the observations are above the median and
50% of the observations are below the median.
Median is a very useful measure for ranked data in the context
of consumer preferences and rating. It is not affected by
extreme values but affected by the number of observations.
th value of ranked data
n = Number of observations in the sample
Note: If the sample size is an odd number then median is
(n+1)/2 th value in the ranked data. If the sample size is even,
then median will be between two middle values. You take the
average of these two middle values.
2
1 n
Median
+
=
Median for Raw Data
Example -Odd Sample Size
Marks obtained by 7 students in Computer Science
Exam
are given below: Compute the median.
45 40 60 80 90 65 55
Arranging the data after ranking gives
90 80 65 60 55 45 40
Median = (n+1)/2 th value in this set = (7+1)/2 th
observation= 4
th
observation=60
Hence Median = 60 for this problem.
Median for Raw Data
Example - Even Sample Size
Diameter of a shaft in millimeters in a manufacturing unit is
Given below for 10 samples. Calculate the median value.
2.50 2.45 2.55 2.60 2.46 2.43 2.56 2.58
2.66 2.65
Arranging the data in the ascending order, you will get
2.43 2.45 2.46 2.50 2.55 2.56 2.58 2.60
2.65 2.66
The median falls between 5th and 6th observation. That is
between 2.55 and 2.56. Hence median = (2.55+2.56)/2
=2.555
Mode
Mode is that value which occurs most often. It has the
maximum frequency of occurrence. Mode is not affected
by extreme values.
Mode is a very useful measure when you want to keep in
the inventory, the most popular shirt in terms of collar
size during festival season. Median and mean will not be
helpful in this type of situation. Another example where
mode is the only answer is in determining the most
typical shoe size to be kept in stock in a shop selling
shoes.
Caution: In a few problems in real life, there will be more
than one mode such as bimodal and multi-modal values.
In these cases mode cannot be uniquely determined.
Mode for Raw Data
Example
The life in number of hours of 10 flashlight batteries are as
follows: Find the mode.
340 350 340 340 320 340 330 330
340 350
340 occurs five times. Hence, mode=340.
Mean for Grouped Data
Formula for Mean is given by
Where
= Mean
= Sum of cross products of frequency in each class
with midpoint X of each class
n = Total number of observations (Total frequency) =
n
fX
X
=
X
fX
f
Mean for Grouped Data
Example
Find the arithmetic mean for the following
continuous
frequency distribution:
Class 0-1 1-2 2-3 3-4 4-5 5-6
Frequency 1 4 8 7 3 2
Solution for the Example
Applying the formula =75.5/25=3.02
A B C D
1 Class X f fX
2 0-1 0.5 1 0.5
3 1-2 1.5 4 6.0
4 2-3 2.5 8 20.0
5 3-4 3.5 7 24.5
6 4-5 4.5 3 13.5
7 5-6 5.5 2 11.0
8 Totals 25 75.5
9 Mean 3.02
n
fX
X
=
Mean by short cut method
Where A is Assumed value ( one can assume any
value)
d is the deviation of each mid-value from A. If d=
( XA)/ c , then in the formula the second term
is multiplied by c. Where c is the class interval.
n
fd
X
+ = A
Assignment: find the mean using short-cut method
Example of short cut method
Table here presents
the profit of 1400
companies .Find the
mean using two
different methods
Profit No. of cos.
200-400 500
400-600 300
600-800 280
800-1000 120
1000-1200 100
1200-1400 80
1400-1600 20
Total 1400
Profit (f)Fr
eq.
(X)Mid
Point
(f X) d=
(X-A) /c
f d
200-400 500 300 150,000 -3 -1500
400-600 300 500 150,000 -2 -600
600-800 280 700 196,000 -1 -280
800-1000 120 900 108,000 0 0
1000-1200 100 1100 110,000 1 100
1200-1400 80 1300 104,000 2 160
1400-1600 20 1500 30,000 3 60
Total 1400 848,000 0 -2060
Direct method n
fX
X
=
714 . 605 1400 / 000 , 48 , 8 X = =
Short cut method
n
fd
X
+ = A
= 900 +(- 2060)(200)/ 1400=900-294.28
= 605.714
Properties of Mean
Sum of deviations from mean is always
zero.
Sum of squared deviation from Mean is
Minimum
If X= X1 + X2, Then the Mean of X is
equal to the sum of means of X1 and X2
(If the observations are equal)
From two or more groups a pooled
mean can be calculated
Median for Grouped
Data
Formula for Median is given by
Median =
Where
L =Lower limit of the median class
n = Total number of observations =
m= Cumulative frequency preceding the median class
f= Frequency of the median class
c= Class interval of the median class
c
f
m (n/2)
L
f
Median for Grouped Data
Example
Find the median for the following continuous
frequency distribution:
Class 0-10 11-20 21-30 31-40 41-50
Frequency 5 8 13 7 7
Solution for the Example
Class Frequency Cumulative
Frequency
0-10 5 5
11-20 8 13
21-30 13 26
31-40 7 33
41-50 7 40
Total 40
Substituting in the formula the relevant values,
Median = ,we have Median =
= 21+(70/13)= 21+5.38 = 26.38
c
f
m (n/2)
L
+
10
13
13 ) 2 / 40 (
21
+
Mode for Grouped Data
Mode =
Where L =Lower limit of the modal class
= Frequency of the modal class
= Frequency preceding the modal class
= Frequency succeeding the modal class
C = Class Interval of the modal class
c
d d
d
L
2 1
1
+
+
0 1 1
f f d =
2 1 2
f f d =
1
f
0
f
2
f
Mode for Grouped Data
Example
Example: Find the mode for the following
continuous frequency distribution:
Class 0-1 1-2 2-3 3-4 4-5 5-6
Frequency 1 4 8 7 3 2
Solution for the Example
Class Frequency
0-1 1
1-2 4
2-3 8
3-4 7
4-5 3
5-6 2
Total 25
Mode =
L = 2
= 8-4 = 4
= 8-7 = 1
C = 1 Hence Mode =
= 2.8
c
d d
d
L
2 1
1
+
+
0 1 1
f f d =
2 1 2
f f d =
1
5
4
2 +
Class assignment
Find the Median and Mode for the following data (
salary structure of 1500 employees)
( Answer Median= 33.46, Mode= 29.5)
Age 18-
22
22-
26
26-
30
30-
34
34-
38
38-
42
42-
46
46-
50
50-
54
54-
58
Fre
q
120 125 280 260 155 184 162 86 75 53
Comparison of
Mean, Median, Mode
Mean Median Mode
Defined as the arithmetic
average of all observations
in the data set.
Requires measurement on
all observations.
Uniquely and
comprehensively defined.
Defined as the
middle value in the
data set arranged
in ascending or
descending order.
Does not require
measurement on all
observations
Cannot be
determined under
all conditions.
Defined as the most
frequently occurring
value in the distribution;
it has the largest
frequency.
Does not require
measurement on all
observations
Not uniquely defined for
multi-modal situations.
Comparison of
Mean, Median, Mode Cont.
Mean Median Mode
Affected by extreme
values.
Can be treated
algebraically. That is,
Means of several groups
can be combined.
Not affected by
extreme values.
Cannot be treated
algebraically. That is,
Medians of several
groups cannot be
combined.
Not affected by
extreme values.
Cannot be treated
algebraically. That is,
Modes of several
groups cannot be
combined.
Which central tendency to use
Type of data:
If data is badly skewed: Avoid the Mean
If gaps in the data: Avoid median
If uneven frequencies: Avoid Mode
Purpose of Analysis:
Representative value: Mean
Qualitative/ nominal variable: Mode
Partition point: Median
Which central tendency to use
Frequency distribution:
Open ended classes: Median or Mode
(except certain situations)
Others : Mean
Nature of data:
Time series data: Avoid Mean
Ratios/rates : Avoid Mean
Relationship
Mean, Median and mode are related as
follows:
(Mean Mode)= 3 ( Mean Median)
For a completely symmetric distribution,
( Normal distribution) , the three
measures coincide with each other.
Fractiles / Quantiles
A FRACTILE is the value of an
observation which is located at a specified
place in a series of data. For example :
Median, which is located in the middle.
Various fractiles used are : Quartiles,
Deciles, Percentiles.
Median is 50
th
percentile or 5
th
decile or
2
nd
quartile.
How to calculate fractile values
Qn= P 25 n=
c
f
m (nN/4)
L
+
D n= P 10 n=
c
f
m (nN/10)
L
+
Class assignment: Calculate the fractiles from the
data
Age 18-
22
22-
26
26-
30
30-
34
34-
38
38-
42
42-
46
46-
50
50-
54
54-
58
Fre
q
120 125 280 260 155 184 162 86 75 53
3) Measures of Dispersion
In simple terms, measures of dispersion indicate
how large the spread of the distribution is
around the central tendency. It answers
unambiguously the question " What is the
magnitude of departure from the average value
for different groups having identical averages?".
It is important to study the central tendency
along with dispersion to throw light on the
shape of the curve; to gauge whether there is
distortion to the bell shaped symmetrical normal
distribution curve that forms the foundation
stone upon which the entire statistical inference
is built.
Range
Range is the simplest of all measures of dispersion. It is
calculated as the difference between maximum and
minimum value in the data set.
Range =
Example for Computing Range
The following data represent the percentage return on
investment for 10 mutual funds per annum. Calculate
Range.
12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9
Range = = 18-9=9
Minimum Maximum
X X
Minimum Maximum
X X
Range is an absolute measure and is defined
for a particular data set. It can not be used
for comparison of two data sets.
Coefficient of Range is an absolute measure
Coefft. Of Range= ( LS)/ ( L+ S)
If its a small value, dispersion is less
Coeftt. Of Range is not a consistent measure,
thus it is not used always.
Example : two samples : first with extreme values
as 1 and 2 , the second sample having extreme
values as
11 and 12. these samples coefficient of range will
be First sample :(2-1)/(2+1)= 1/3 , second
sample:( 12-11)/ (12+11)=1/23
Range
Caution: Range is a good measure of spread in the
distribution only when a data set shows a stable
pattern of variation without extreme values. If one
of the components of range namely the maximum
value or minimum value becomes an extreme
value, then range should not be used.
Interquartile
Range
Range is entirely dependent on maximum and
minimum values in the data set and is highly
misleading when one of them is an extreme value.
To overcome this deficiency, you can resort to
interquartile range. It is computed as the range
after eliminating the highest and lowest 25% of
observations in a data set that is arranged in
ascending order. Thus this measure is not
sensitive to extreme values.
Interquartile range = Range computed on middle
50% of the observations
Interquartile Range-Example
The following data represent the percentage return
on investment for 9 mutual funds per annum.
Calculate interquartile range.
Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9
Arranging in ascending order, the data set becomes
9, 10.5, 11, 11, 12, 12, 14, 14, 18
Ignore the first two (9, 10.5) and last two (14, 18)
observations in this data set. The remaining contains
50% of the data. They are 11, 11, 12, 12, 14, and
14. For this if you calculate range, you get
interquartile range.
Interquartile range = 14-11 =3.
Quartile Deviation
Quartile deviation= IQR/2
This is an absolute measure of dispersion,
not to be used for comparison
For comparison we use Coefficient of
Quartile deviation
Coefft. Of QD = ( Q3Q1)/( Q3 +Q1)
Mean Absolute Deviation(MAD)
Mean Absolute Deviation (MAD) is defined as the average based on the
deviations measured from arithmetic mean, in which all deviations are
treated as positive ignoring the actual sign. Unlike range, MAD is based
on all observations. Hence it reflects the dispersion of every item in the
distribution. In symbolic form, it is defined by the following formula.
MAD =
Where
represents sum of all deviations from arithmetic mean
after ignoring sign
= Arithmetic Mean
n = Number of observations in the sample(sample size)
Caution: Mean Absolute Deviation (MAD) has two weaknesses. 1) It
cannot be combined for several groups. 2) Ignoring the sign has serious
implications to a business manager attempting to measure the spread of
the distribution in a scientific manner.
n
X X
X X
X
Example for MAD
The following data represent the percentage return on
investment for 10 mutual funds per annum. Calculate MAD
(Please note that this is the same example used for computing
Range)
12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9
= (12+14+11+18+10.5+11.3+12+14+11+9)/10
=12.28
= + + +
+ + + +
+ + = 18.32
MAD = = 18.32/10 =1.832
n
X
X
=
X X 28 . 12 12
28 . 12 14
28 . 12 11
28 . 12 18
28 . 12 5 . 10 28 . 12 3 . 11
28 . 12 12
28 . 12 14
28 . 12 11 28 . 12 9
n
X X
Standard Deviation
Standard deviation forms the basis for the discussion on
Inferential Statistics. It is a classic measure of dispersion. It
has many advantages over the rest of the measures of
variations. It is based on all observations. It is capable of
being algebraically treated which implies that you can
combine standard deviations of many groups. It plays a very
vital role in testing hypotheses and forming confidence
interval.
To define standard deviation, you need to define another
term called variance. In simple terms, standard deviation is
the square root of variance.
Important Terms with Notations
Important Terms with notations
Key Remarks
Sample Variance
1
) (
2
2
n
X X
S
Sample Standard Deviation
S=
1
) (
2
n
X X
Population Variance
o
2
=
N
X
2
) (
Population Standard Deviation
= o
N
X
2
) (
Where
n
X
X
=
(Sample Mean) and
N
X
=
(Population Mean)
n =Number of observations in the
sample(Sample size)
N =Number of observations in the
Population (Population Size)
1.
1
) (
2
2
n
X X
S
is an unbiased
estimator of o
2
=
N
X
2
) (
2.
n
X
X
=
is an unbiased
estimator of
N
X
=
3. The divisor n-1 is always used
while calculating sample variance
for ensuring property of being
unbiased
4. Standard deviation is always the
square root of variance
Example for Standard
Deviation
The following data represent the percentage
return on investment for 10 mutual funds per
annum. Calculate the sample standard deviation.
12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9
Solution for the Example
for the Example
Solution for the Example
Cont.
From the spreadsheet of Microsoft Excel in the previous slide,
it is easy to see
that Mean = =12.28 (In column A and row14, 12.28 is
seen).
Sample Variance = =6.33 (In column D and
row 14, 6.33 is seen)
Sample Standard Deviation = S = = 2.52
(In column D and row 15, 2.52 is seen)
n
X
X
=
1 n
) X (X
S
2
2
1 n
) X (X
2
f
Standard Deviation for
Grouped Data-Example
Frequency Distribution of Return on Investment of
Mutual Funds
Return on
Investment
Number of Mutual
Funds
5-10
10-15
15-20
20-25
25-30
Total
10
12
16
14
8
60
Solution for the Example
Solution for the Example
From the spreadsheet of Microsoft Excel in the
previous
slide, it is easy to see
Mean = =1040/60=17.333(cell F10),
Standard Deviation = S = =
= 6.44
(Cell H12)
n
fX
X
=
X
2
n
X
2
59
2448.33
Calculation of SD : Raw data
First the direct
method
Without deviation
method
Assumed mean
method
X
2
n
X
2
2
2
] [
n
d
n
d
1 n
) X (X
2
Calculation of SD : Grouped
First the direct
method
Without deviation
method
Assumed mean
method
1 n
) X f(X
2
X
2
n
fX
2
2
2
] [
n
f d
n
f d
Class assignment
Find the average
deviation and
standard deviation of
the following data:
Sales No. of shops
10-20 3
20-30 6
30-40 11
40-50 3
50-60 2
Solution: Mean= 825 / 25 = 33
Sales f X fx (X-33) f (X-33) Sqr F(sqr)
10-20 3 15 45 18 54 324 972
20-30 6 25 150 8 48 64 384
30-40 11 35 385 2 22 4 44
40-50 3 45 135 12 36 144 432
50-60 2 55 110 22 44 484 968
Total 25 825 204 2800
AD= 204/ 25=8.16, Variance= 2800 / 25=122, SD= 10.58
Class Assignment :SD from Assumed mean
Use the above
method to find the SD
of the following data
of 79 students
Marks No. of
students
0-10 18
10-20 16
20-30 15
30-40 12
40-50 10
50-60 5
60-70 2
70-80 1
deviation d= (XA)/ c , A= 25
Class X f fx X^2 fX^2 d fd d^2 fd^2
0-10 5 18 90 25 450 -2 -36 4 72
10-20 15 16 240 225 3600 -1 -16 1 16
20-30 25 15 375 625 9375 0 0 0 0
30-40 35 12 420 1225 14700 1 12 1 12
40-50 45 10 450 2025 20250 2 20 4 40
50-60 55 5 275 3025 15125 3 15 9 45
60-70 65 2 130 4225 8450 4 8 16 32
70-80 75 1 75 5625 5625 5 5 25 25
Total 79 2055 17000 77575 8 242
Deviation method
SD= 10 [ (242/ 79)( 8/79)(8/79)]^ 1/2
SD = 10 ( 1.75)= 17.5
Direct method
V= [ ( 77575/ 79)(2055/79)( 2055/79)]
SD= {981.96676.75}^1/2= { 303.3}^1/2
=17.47
ns observatio the of 50% covers Q.D. X
on distributi symmetric a In
D A.
Q.D.
dispersion of measures other and S.D between ip Relationsh
o =
o =
o
o
5
4
3
2
Coefficient of Variation
Example
Consider two Sales Persons working in the same
territory. The sales performance of these two in the
context of selling PCs are given below. Comment on the
results.
Sales Person 1 Sales Person 2
Mean Sales (One year
average) 50 units
Standard Deviation
5 units
Mean Sales (One year
average)75 units
Standard deviation
25 units
Interpretation for the Example
The CV is 5/50 =0.10 or 10% for the Sales Person1
and 25/75=0.33 or 33% for sales Person2. It
seems Sales Person1 performs better than Sales
Person2 with less relative dispersion or scattering.
Sales Person2 has a very high departure or
standard deviation from his average sales
achievement. The moral of the story is "don't get
carried away by absolute number". Look at the
scatter. Even though, Sales Person2 has achieved a
higher average, his performance is not consistent
and seems erratic.
Example:Coefficient of Variation
Since Mean and variance are enough
to compare two groups of data CV is
used to measure the relative spread of
the data
Two factories which have 50 and 100
employees have the average wages as
Rs.120 per day and Rs. 85 per day.
The variance of wages in the two
factories are 9 and 16 respectively.
Find which factory has more uniformity
in wages?
CV for factory A = 3/120x 100= 2.5
CV for factory B= 4/85x 100= 4.7
Factory A has more uniform wages
Skewness
Measure of asymmetry of a frequency distribution
Skewed to left
Symmetric or unskewed
Skewed to right
Kurtosis
Measure of flatness or peakedness of a frequency
distribution
Platykurtic (relatively flat)
Mesokurtic (normal)
Leptokurtic (relatively peaked)
Skewness and Kurtosis
Skewed to left
Skewness
Skewness
Symmetric
Skewness
Skewed to right
Kurtosis
Platykurtic - flat distribution
Kurtosis
Mesokurtic - not too flat and not too peaked
Kurtosis
Leptokurtic - peaked distribution
Skewness
(i) Mean-Mode/S.D
(ii) 3(Mean-Median)/S.D
(iii) Bowleys :
BS= (Q3+Q1-2 Median)/(Q3-A1)
Kelleys:
KS= P50-(P10+P90)/2
BASED ON MOMENTS
BETA1= (Mu3)^2/ (Mu2)^3
Kurtosis
Kurtosis is measured by Beta2
Beta2= (Mu4)/ (Mu2)^2
Where Mu2= (1/N) Sum(X-mean)^2
And Mu4= (1/N) Sum (X-Mean)^4
Kurtosis
PlatyKurtic : Flat
Mesokurtic: Normal
Leptokurtic: Very high
Beta2= Mu4/(Mu2)^2
Where Mu4= 1/n( Sum fd^4)
and Mu2= 1/n( Sum fd^2)
Chebyshevs Theorem
Applies to any distribution, regardless of shape
Places lower limits on the percentages of
observations within a given number of standard
deviations from the mean
Empirical Rule
Applies only to roughly mound-shaped and
symmetric distributions
Specifies approximate percentages of observations
within a given number of standard deviations from
the mean
Relations between the Mean
and Standard Deviation
1
1
2
1
1
4
3
4
75%
1
1
3
1
1
9
8
9
89%
1
1
4
1
1
16
15
16
94%
2
2
2
= = =
= = =
= = =
At least of the elements of
any distribution lie within k standard
deviations of the mean
At
least
Lie
within
Standard
deviations
of the mean
2
3
4
Chebyshevs Theorem
|
.
|
\
|
2
1
1
k
For roughly mound-shaped and
symmetric distributions,
approximately:
68% 1 standard deviation
of the mean
95% Lie
within
2 standard deviations
of the mean
All 3 standard deviations
of the mean
Empirical Rule
Pie Charts
Categories represented as percentages of total
Bar Graphs
Heights of rectangles represent group frequencies
Frequency Polygons
Height of line represents frequency
Ogives
Height of line represents cumulative frequency
Time Plots
Represents values over time
1-8 Methods of Displaying Data
Pie Chart
Bar Chart
Average Revenues
Average Expenses
Fig. 1-11 Airline Operating Expenses and Revenues
1 2
1 0
8
6
4
2
0
A i r l i n e
American Continental Delta Northwest Southwest United USAir
Relative Frequency Polygon
Ogive
Frequency Polygon and Ogive
5 0 4 0 3 0 2 0 1 0 0
0 . 3
0 . 2
0 . 1
0 . 0
Sales
5 0 4 0 3 0 2 0 1 0 0
1 . 0
0 . 5
0 . 0
Sales
O S A J J M A M F J D N O S A J J M A M F J D N O S A J J M A M F J
8 . 5
7 . 5
6 . 5
5 . 5
M o n t h
M
i
l
l
i
o
n
s
o
f
T
o
n
s
M o n t h l y S t e e l P r o d u c t i o n
( P r o b l e m 1 - 4 6 )
Time Plot
Stem-and-Leaf Displays
Quick-and-dirty listing of all observations
Conveys some of the same information as a histogram
Box Plots
Median
Lower and upper quartiles
Maximum and minimum
Techniques to determine relationships and
trends, identify outliers and influential
observations, and quickly describe or
summarize data sets.
1-9 Exploratory Data Analysis -
EDA
1 122355567
2 0111222346777899
3 012457
4 11257
5 0236
6 02
Example 1-8: Stem-and-Leaf
Display
X X
* o
Median
Q
1
Q
3 Inner
Fence
Inner
Fence
Outer
Fence
Outer
Fence
Interquartile Range
Smallest data
point not
below inner
fence
Largest data
point not
exceeding inner
fence
Suspected
outlier Outlier
Q
1
-3(IQR)
Q
1
-1.5(IQR) Q
3
+1.5(IQR)
Q
3
+3(IQR)
Elements of a Box Plot
Box Plot
Example: Box Plot
1-10 Using the Computer The
Template Output
Using the Computer Template
Output for the Histogram
Using the Computer Template
Output for Histograms for
Grouped Data
Using the Computer Template Output for Frequency
Polygons & the Ogive for Grouped Data
Using the Computer Template Output for Two
Frequency Polygons for Grouped Data
Using the Computer Pie
Chart Template Output
Using the Computer Bar
Chart Template Output
Using the Computer Box Plot
Template Output
Using the Computer Box Plot
Template to Compare Two
Data Sets
Using the Computer Time
Plot Template
Using the Computer Time
Plot Comparison Template