Sie sind auf Seite 1von 44

Handouts 04: Data Description (1)

Handout 04
Contents
Organization and Interpretation of data: Frequency distribution, graphical representation, Histogram, frequency curve
and Ogive. Central Measures: Arithmetic Mean Geometric Mean, Harmonic Mean, Median, Mode, Quartiles, Deciles and
Percentiles for grouped and ungrouped data. Dispersion measures: variance, standard deviation, mean deviation, coefficient of
variation, Skewness.
Objectives
After careful study of this chapter stu4 dents should be able to Compute and interpret the central measures and the
measures of dispersion.
References
1. Introduction to Statistical Theory, Shehzad Ahmad and Sher Muhammad Ch.
2. Elementary Statistics, 7 th Edition, Allan G. Bluman
3. Statistics for Management, 7 th Edition, Richard Levin and David Rubin
4. Statistics for Business and Economics, 10Edition, David R. Anderson, Dennis J. Sweeny and Thomas A. Willium

Data Description
There are three main tasks in descriptive statistics: (i) collection and organization, (ii) analysis,
and (iii) interpretation of data.
(i) Collection and Organization of Data:
Graphically: through the use of charts and graphs
Numerically: through the use of tables of data
(ii) Analysis of Data:
Once the data is organized, we can go ahead and compute various quantities (called statistics or
parameters) associated with the data.
(iii) Interpretation of Data:
Once we have performed the analysis, we can use the information to make assertions about the real world
Samples versus Population
The term "population" is used in statistics to represent all possible measurements or outcomes
that are of interest to us in a particular study. The term "sample" refers to a portion of the population that
is representative of the population from which it was selected.
In order to use statistics to learn things about the population, the sample must be random. A
random sample is one in which every member of a population has an equal chance of being selected. The
most commonly used sample is a simple random sample. It requires that every possible sample of the
selected size has an equal chance of being used.
A parameter is a characteristic of a population. A statistic is a characteristic of a sample.
Inferential statistics enables you to make an educated guess about a population parameter based on a
statistic computed from a sample randomly drawn from that population.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (2)

Statistical procedures
Statistical procedures can be divided into two major categories: descriptive statistics and
inferential statistics.
(i) Descriptive Statistics

Descriptive statistics includes statistical procedures that we use to describe the population. The
data could be collected from either a sample or a population, but the results help us organize and describe
data. Descriptive statistics can only be used to describe the group that is being studying. Frequency
distributions, measures of central tendency (mean, median, and mode), and graphs like pie charts and bar
charts that describe the data are all examples of descriptive statistics.
(ii) Inferential Statistics
Inferential statistics is concerned with making predictions or inferences about a population from
observations and analysis of a sample. Regression analysis, test of hypothesis, significance, analysis of
variance are the examples of inferential statistics.
(A) Frequency Distribution
The main object of descriptive statistics is to put the information contained in a set of data into a
more useable form.
By condensing the raw data into the tabular form we distribute the data into classes or category
and determine the number of individuals belonging to each class, called the class frequency. A tabular
arrangement of data by classes together with the corresponding class frequencies is called a frequency
distribution or frequency table or categorical data. We can also use relative frequency and percentage
frequency in a frequency distribution.
frequency
where relative frequency =
n
and percent frequency = 100 relative frequency
Examples (1)
Thirty batteries were tested to determine how long they would last. The results, to the nearest
minute, were recorded as:
423, 369, 387, 411, 393, 394, 371, 377, 389, 409, 392, 408, 431, 401, 363, 391, 405, 382, 400,
381, 399, 415, 428, 422, 396, 372, 410, 419, 386, 390
Construct a frequency distribution table.
Solution
The lowest value is 363 and the highest is 431. Using the given data and a class interval of 10, the
interval for the first class is 360 to 369 and includes 363 (the lowest value). Remember, there should
always be enough class intervals so that the highest value is
included. The completed frequency distribution table should
look like this:
Life of batteries in minutes:

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (3)

Examples (2)

These data represent the record high temperatures in degrees Fahrenheit (oF) for each of the 50
states. Construct a grouped frequency distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
Source: The World Almanac and Book of Facts
Example 2-2 page 41 Elementary Statistics by Bluman
Solution

Examples (3)
These data represent the record high temperatures in degrees Fahrenheit (oF) for each of the 50
states. Construct a grouped frequency distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
Source: The World Almanac and Book of Facts

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (4)

Examples (4)
The data shown here represent the number of miles per gallon (mpg) that 30 selected four-wheel-
drive sports utility vehicles obtained in city driving. Construct a frequency distribution, and analyze the
distribution.
12 17 12 14 16 18 16 18 12 16 17 15 15 16 12
15 16 16 12 14 15 12 15 15 19 13 16 18 16 14
Source: Model Year Fuel Economy Guide. United States
Environmental Protection Agency.
The complete ungrouped frequency distribution is

In this case, almost one-half (14) of the vehicles get 15 or 16 miles per gallon.
The cumulative frequencies are:
Cumulative frequency
Less than 11.5 0
Less than 12.5 6
Less than 13.5 7
Less than 14.5 10
Less than 15.5 16
Less than 16.5 24
Less than 17.5 26
Less than 18.5 29
Less than 19.5 30
Exercise (1)
The number of passengers (in thousands) for the leading U.S. passenger airlines in 2004 is
indicated below. Use the data to construct a grouped frequency distribution and a cumulative frequency
distribution with a reasonable number of classes and comment on the shape of the distribution.
91,570 86,755 81,066 70,786 55,373
42,400 40,551 21,119 16,280 4,869
13,659 13,417 13,170 12,632 11,731
10,420 10,024 9,122 7,041 6,954
6,406 6,362 5,930 5,585 5,427

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (5)

(B) Graphical Representation of Frequency Distribution


(1) Bar Charts
Bar graphs/charts provide a visual presentation of categorical data. Categorical data is a grouping
of data into discrete groups, such as months of the year, age group, shoe sizes, and animals. These
categories are usually qualitative. In a column bar chart, the categories appear along the horizontal axis;
the height of the bar corresponds to the value of each category.
For Example: The amount of sugar in 7 different foods was measured as a percent. The data is
summarized in the bar graph below.

(2) Pareto Charts


A Pareto chart is a bar graph. The lengths of the bars represent frequency or cost (time or money),
and are arranged with longest bars on the left and the shortest to the right. In this way the chart visually
depicts which situations are more significant.
We use Pareto charts, when analyzing data about the frequency of problems or causes in a
process.. For example we want to show customer complaints received in each of five categories.

The Pareto Chart is a simple to use and powerful graphic to identify where the majority of
problems in a process are originating.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (6)

(3) Histogram
A histogram is a bar graph of raw data that creates a picture of the data distribution. The bars
represent the frequency of occurrence by classes of data. A histogram shows basic information about the
data set, such as central location, width of spread etc.
Histograms show how data can pile up; in any distribution of values, some values will occur
more frequently than others. The peaks on the histogram show where there is similarity among the data.
This is the central location, which is measured by mean, median, and mode. While these statistics provide
valuable information about the process, central location alone does not provide a complete picture of the
process. When you consider the spread of the data, you will see its extremes. The shape of the histogram
can show if the system leans toward one extreme or the other, or if there are multiple peaks.
When you use a histogram for prediction, the system must be stable. If not, the central location,
spread, and shape may vary dramatically in histograms created from data taken at different times and will
not be an accurate reflection of the process. If you are not using histograms to make predictions, stability
is not required.
We can construct histogram by taking class boundaries along x-axis and frequency along y-axis,
then constructing rectangular bars against each class boundary with a height according to the
corresponding frequency.
Examples (5)
Using data given in example (1), we can construct histogram by taking class boundaries along x-
axis and frequency along y-axis. Then constructing rectangular bars against each class boundary with a
height according to the corresponding frequency.
Further joining the mid points of the top heads of all rectangular bars with a smooth curve, we
can have a frequency curve as shown in figure. It is not necessary for a smooth curve to pass through all
the points.

(4) The Ogive


The third type of graph that can be used represents the cumulative frequencies for the classes.
This type of graph is called the cumulative frequency graph, or ogive. The cumulative frequency is the
sum of the frequencies accumulated up to the upper boundary of a class in the distribution.
Now taking class boundaries along x-axis and cumulative frequency along y-axis and
constructing rectagular bars we will have cumulative frequency histogram and joining all the mid points
of all the top heads with a smooth curve, we will have cumulative frequency curve (or Ogive)

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (7)

If we find out mid points of each class limit / class boundary, then draw the smooth curve for the
cumulative frequency against the midpoints then the diagram would be as follows:
Cumulative Frequency Curve or Ogive

35

30

25

20
C.F
Series1
15

10

0
0 2 4 6 8 10
Mid Points

(5) Relative Frequency Distribution


A relative frequency distribution presents frequencies in terms of fractions or percentages. We
obtain relative frequency by dividing each frequency by the total frequency in the data set.
If the bars in a relative frequency histogram are of equal width, the area of a particular bar is
proportional to the corresponding class relative frequency. If we let the total area of the bars equal to one,
then the area of a particular bar is equal to its corresponding class relative frequency.
e.g. Relative frequency of average inventory (in days) for 20 stores is given below.

Classes Frequency Relative Frequency

2.0-2.5 1 0.05

2.6-3.1 0 0.00

3.2-3.7 2 0.10

3.8-4.3 8 0.40

4.4-4.9 5 0.25

5.0-5.5 4 0.20

20 1.00

Some conclusions:
The frequency of an average inventory of 4.4 to 4.9 days is 5.
The relative frequency of an average inventory of 4.4 to 4.9 days is 0.25.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (8)

Examples (6)
Construct a histogram and Ogive to represent the data shown for the record high temperatures for
each of the 50 states.

Classes 100 104 105 109 110 114 115 119 120 124 125 129 130 - 134

Frequency 2 8 18 13 7 1 1

Histogram and ogive is given below

Examples (7)
Here is a frequency distribution of the weight of 150 people who used a ski lift a certain day.
Construct a histogram for these data
Class Frequency Class Frequency
75-89 10 150-164 23
90-104 11 165-179 9
105-119 23 180-194 9
120-134 26 195-209 6
135-149 31 210-224 2
(a) What can you see from the histogram about the data that was not immediately apparent
from the frequency distribution.
(b) If each ski lift chair holds two people but is limited in total safe weight capacity to 400
pounds, what can the operator do to maximize the people capacity of the ski lift without
exceeding the safe weight capacity of a chair? Do the data support your proposal?
Solution
(a) The lower tail of the distribution is fatter (has more observations in it) than the upper tail.
(b) Because there are so few people who weigh 180 pounds or more, the operator can afford to
apir each person who appear to be heavy with a lighter person. This can be done without
greatly delaying any individuals turn at the lift.
Exercise (2)
The number of passengers (in thousands) for the leading U.S. passenger airlines in 2004 is
indicated below. Use the data to construct a grouped frequency distribution and a cumulative frequency
distribution with a reasonable number of classes and comment on the shape of the distribution.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (9)

91,570 86,755 81,066 70,786 55,373 42,400 40,551 21,119


16,280 14,869 13,659 13,417 13,170 12,632 11,731 10,420
10,024 9,122 7,041 6,954 6,406 6,362 5,930 5,585 5,427
Chap 2, Ex 2.1, prob. 12 Elementary Statistics by Bluman
Examples (8) (Histogram for unequal class intervals)
A Company manufactures metal rods in different lengths. The table given below shows
information of a days production of the company.

Length (cm) 10-20 20-30 30-40 40-50 50-70 70-100 100-140


No. of metal rods (Frequency) 6 7 8 10 10 9 8
The size of the first four intervals is equal but the sizes of 5th , 6 and the 7th are unequal.
th

In such cases we find proportional height for rectangular bars. So we construct table as follows:
Class Frequency Width of Classes Proportional
Boundaries (in units) Height
10-20 6 1 6
20-30 7 1 7
30-40 8 1 8
40-50 10 1 10
50-70 10 2 5
70-100 9 3 3
100-140 8 4 2
Now we construct histogram by taking class boundaries along x-axis and proportional height
along y-axis.

Exercises

(1) We have a sample of 50 size given by


2 3 9 0 4 4 1 5 4 8
5 3 6 6 0 2 2 7 6 4
8 4 3 3 1 0 8 7 5 1
3 4 7 2 4 7 5 2 6 3
1 7 5 4 6 4 2 5 3 4
Construct frequency distribution (a frequency table), a histogram, frequency curve,
cumulative frequency Histogram, cumulative frequency curve (or Ogive).

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (10)

(2) The following is a frequency distribution of students of different ages, construct a histogram
Ages 18-19 20-24 25-29 30-34 35-44 45-59
No. 9 188 160 123 84 15
(3) Here are the ages of 30 people who bought video recorders at Liberty Music Shop last week:

26 37 40 18 14 45 32 68 31 37

20 32 15 27 46 44 62 58 30 42

22 26 44 41 34 55 50 63 29 22

(a) From looking at the data Justas they are, what conclusions can you come to quickly about
Libertys market?
(b) Construct a 6-category closed classification. Does having this enable you to conclude
anything more about Libertys market?
(4) At a newspaper office, the time required to set the entire front page in type was recorded for 50
days. The data, to the nearest tenth of a minute, are given below.

20.8 22.8 21.9 22.0 20.7 20.9 25.0 22.2 22.8 20.1

25.3 20.7 22.5 21.2 23.8 23.3 20.9 22.9 23.5 19.5

23.7 20.3 23.6 19.0 25.1 25.0 19.5 24.1 24.2 21.8

21.3 21.5 23.1 19.9 24.2 24.1 19.8 23.9 22.8 23.9

19.7 24.2 23.8 20.7 23.8 24.3 21.1 20.9 21.6 22.7

(a) arrange the data in an array from lowest to highest.


(b) Construct a frequency distribution and a less than cumulative frequency distribution from
the data, using the interval of 0.8 minute.
(c) Construct a frequency polygon from the data.
(d) Construct a less than ogive from the data.
(e) From your ogive, estimate what percentage of the time the front page can be set in less than
24 minutes.
(5) A department agricultural has these data representing weekly growth (in inches) on samples of
newly planted spring corn:

0.4 1.9 1.5 0.9 0.3 1.6 0.4 1.5 1.2 0.8

0.9 0.7 0.9 0.7 0.9 1.5 0.5 1.5 1.7 1.8

(a) Arrange the data in an array from highest to lowest.


(b) Construct a relative frequency distribution using intervals of 0.25.
(c) From what you have done so far, what conclusions you can come to about growth in this
sample.
(d) Construct an ogive that will help you determine what proportion of the corn grew at more
than 1.0 inch a week.
(e) What was the approximately weekly growth rate of the middle item in data array?

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (11)

(6) Administer of a hospital has ordered a study of the amount of time a patient must wait before
being treated by emergency room personnel. The following data were collected during a typical
day.
12 16 21 20 24 3 11 17 29 18
26 4 7 14 25 1 27 15 16 5
(a) Arrange the data in an array from lowest to highest. What comment can you make about
patient waiting time from your data array?
(b) Now construct a frequency distribution using 6 classes. What additional interpretation can
you give to the data from the frequency distribution?
(c) From an ogive, state how long 75 percent of the patients should expect to wait based on data?
(4) The bureau of labor statistics has sampled 30 communities nationwide and compiled prices in each
community at the beginning and end of August in order to find out approximately how the
Consumer Price Index has changed during August. The percentage changes in prices for the 30
communities are as follows: Ref. Ex. 2.19 Statistics for Management 7 th by Levin Rubin

0.7 0.4 0.3 0.2 0.1 0.1 0.3 0.7 0.0 0.4
0.1 0.5 0.2 0.3 1.0 0.3 0.0 0.2 0.5 0.1
0.5 0.3 0.1 0.5 0.4 0.0 0.2 0.3 0.5 0.4

(a) Arrange the data in an array from lowest to highest.


(b) Using the following four equal sized classes, create a frequency distribution: 0.5 to 0.2,
0.1 to 0.2, 0.3 to 0.6 and 0.7 to 1.0.
(c) How many communities had prices that either did not change or that increased less than
1.0 percent?
(d) Are these data discrete or continuous?
(4) The following data is presented on the motor fuel octane ratings of several blends of gasoline:
88.5 94.7 84.3 90.1 89.0 89.8 91.6 90.3 90.0 91.5 89.9 98.8 88.3 90.4 91.2
90.6 92.2 87.7 91.1 86.7 93.4 96.1 89.6 90.4 91.6 90.7 88.6 88.3 94.2 85.3
90.1 89.3 91.1 92.2 83.4 91.0 88.2 88.5 93.3 87.4 91.1 90.5 100.3 87.6 92.7
98.7 93.0 94.4 90.4 91.2 86.7 94.2 90.8 90.1 91.8 88.4 92.6 93.7 96.5 84.3
93.2 88.6 88.7 92.7 89.3 91.0 87.5 87.8 88.3 89.2 88.9 89.8 92.7 93.3 86.7
91.0 90.9 89.9 91.8 89.7 92.2
Construct Histogram with 8 number of class intervals. (Montgomary Exercise 6.3.14)
(5) In a group of 500 wage-earners, the weekly wages of 4% were under Rs.60 and those of 15%
were under Rs.62.50. 15% of the workers earned Rs.95 and over, and 5% of them got Rs.100 and
over. The median and quartile wages were Rs.82.25, Rs.72.75 and Rs.90.50; the 4 th and 6th decile
wages were Rs.78.75 and Rs.85.25 respectively. Put the above information in the form of a
frequency distribution and estimate the mean wages of the 500 wage-earners there from.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (12)

(C) Averages
The following average measures are also called the central measures
(i) Arithmetic Mean
(ii) Geometric Mean
(iii) Harmonic Mean
(1) Arithmetic Mean
The Arithmetic mean or simply the mean is the most familiar average. It is defined as
Sum of all the observations
Mean =
Number of the observations
x1+x2+ +xn xi
For ungrouped data, x = = , (i = 1, 2, , n)
n n
f1 x 1+f2 x2+ +fn x n fi x i
For grouped data, x = = , (n= fi)
f1+f2+ +fn fi
Advantages of Arithmetic Mean
its concept is familiar to most people and intuitively clear.
It is a measure that can be calculated, and it is unique because every data set has one and only one mean
The mean is useful for performing statistical procedure such as comparing the means from several
data sets.
Disadvantages of Arithmetic Mean
It may be affected by the extreme values that are not representative of the rest of the data. e.g. the
mean of the values 4.2, 4.3, 4.7, 4.8, 5.0, 5.1, 9.0 is 5.3. But if we exclude the value 9.0, the
answer is about 4.7. The one extreme value 9.0 distorts (de-shapes) the value we get for the mean.
It may be time consuming sometime.
We are unable to compute mean for the data with open ended classes.
Properties
Mean (a) = a
Mean (X a) = Mean (X) a
Mean (bX) = b Mean (X)
Sum of the deviations from mean value is equal to zero.

For the two sets of data with n1, n2 number of values and X1 , X2 mean values respectively,

n1 X1 + n2 X2
the joint mean X is
n1 + n2
Exercise
(1) Find the arithmetic mean, geometric mean and harmonic mean of the series
(i) 1,2,4,8,16,, 2n
(ii) 1,3,9,27,81,, 3 n. (Sher)
(2) Find the average rate of
a. motion in case of a person who rides the first mile at the rate of 10 miles an hour, the next
mile at the rate of 8 miles per hour and the third mile at the rate of 6 miles per hour.
b. Increase in the population, which in the first decade has increased 20%, in the next 25%
and in the third 44%.
Problem 4-108 Elementary Statistics by Bluman, chapter 3, page 122

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (13)

(2) The Weighted Mean


The weighted mean enables us to calculate an average that takes into account the importance of
each value to the overall total.
Examples (9)
The following table shows that a company uses three grades of labour- unskilled, semiskilled and
skilled- to produce two end products. The company wants to know the average cost of labour per hour for
each of the products.

Labor Hours per Unit of Output


Labor input in Grade of Labor Hourly Wages Product 1 Product 2
Manufacturing Unskilled $5.00 1 4
Process Semi skilled $7.00 2 3
Skilled $9.00 3 3

A simple arithmetic average of the labor wage rates would be


xi $5+$7+$9 $21
x= = = = $ 7.00 / hour
n 3 3
Using this average rate, we would compute the labor cost of one unit of product 1 to be $7 (1 + 2
+ 5) = $56 and of one unit of product 2 to be $7 (4 + 3 + 3) = $70. But these answers are incorrect.
To be correct, the answers must take into account the different amounts of each grade of labor are
used. We can determine the correct answers in the following manner.
For product 1, the total labor cost per unit is ($51) + ($72) + ($95) = $64, and since there are
8 hours of labor input, the average labor cost per hour is $64/8 = $8.00 per hour.
For product 2, the total labor cost per unit is ($54) + ($73) + ($93) = $68, and since there are
10 hours of labor input, the average labor cost per hour is $68/10 = $6.80 per hour.
Another way to calculate the correct average cost per hour for the two products is to take a
weighted average of the cost of the three grades of labor. To do this, we weight the hourly wage for each
grade by its proportion of the total labor required to produce the product.
One unit of product 1 requires 8 hours of labor. Unskilled labor uses 1/8 of this time, semiskilled
labor uses 2/8 of this time, and skilled labor requires 5/8 of this time. If we use these fractions as our
weight, then one hour of labor for product 1 costs an average of
1 2 5
( $5) + ( $7) + ( $9) = $8.00 / hour
8 8 8
Similarly one hour of labor for product 2 costs an average of
4 3 3
( $5) + ( $7) + ( $9) = $6.80 / hour
10 10 10
We see that weighted average gives correct value for the average hourly labor costs of two
products because it takes into account that different amounts of each grade of labor are used in the
products.
The formula for calculating the weighted average is
(wxi)
xw = w

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (14)

Exercise
(1) A salesperson drives 300 miles round trip at 30 miles per hour going to Chicago and 45 miles per
hour returning home. Find the average miles per hour.
(2) A bus driver drives 50 miles to West Chester at 40 miles per hour and returned driving 25 miles per
hour. Find the average miles per hour.
(3) A carpenter buys $500 worth of nails at $50 per pound and $500 worth of nails at $10 per pound.
Find the average cost of 1 pound of nails.
(4) The following are the monthly salaries in rupees of 30 employees of a firm:

139 126 114 100 88 62 77 99 103 108


144 129 148 63 69 148 132 118 142 116
123 104 95 80 85 106 123 140 134 133

The firm gave bonuses of Rs. 10, 15, 20, 25, 30 and 35 for individuals in the respective salary
groups; exceeding 60 but not exceeding 75, exceeding 75 but not exceeding 90 and so on up to
exceeding 135 but not exceeding 150. Find the average bonus paid per employee.
Examples (10)
Daves Giveaway Store advertises, If our average prices are not equal or lower than everyone
elses, you get it free. One of Daves customers came into the store one day and threw on the counter
bills of sale for six items she bought from a competitor for an average price less than Daves. (Statistics
for Management, 7th Ed, by Richard Levin and David Rubin Chap 3 )
The items cost:
$1.29, $2.97, $3.49, $5.00, $7.50, $10.95
Daves price for the same six items are:
$1.35, $2.89, $3.19, $4.98, $7.59, $11.50
Dave told the customer, My ad refers to a weighted average price of these items. Our average is lower
because our sales of these items have been
7, 9, 12, 8, 6, 3
Is Dave getting himself into or out of trouble by talking about weighted averages.
Solution
With unweighted average, we get
xi 1.29 + 2.97 + 3.49 + 5.00 + 7.50 + 10.95 31.20
xC = = = = $5.20 at the competition
n 6 6
xi 1.35 + 2.89 + 3.19 + 4.98 + 7.59 + 11.50 31.50
xD = = = = $5.20 at Daves
6 6 6
with weighted average
(wxi) 7(1.29) + 9(2.97) + 12(3.49) + 8(5.00) + 6(7.50) + 3(10.95) 195.49
xC = w = = = $4.344
7 + 9 + 12 + 8 + 6 + 3 45
at the competition

(wxi) 7(1.35) + 9(2.89) + 12(3.19) + 8(4.98) + 6(7.59) + 3(11.50) 193.62


xD = w = = = $4.303
7 + 9 + 12 + 8 + 6 + 3 45
at Daves
Although, Dave is technically correct, the word average in popular usage is equivalent to unweighted
average in technical usage, and the typical customer will surely be angry with Daves assertion (whether
he or she understands the technical point or not)

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (15)

Examples (11)
Bennett Distribution Company, a subsidiary of major appliance manufacturer, is forecasting
regional sales for the next year. The Atlantic branch, with current yearly sales of $193.8 million, is
expected to achieve a sales growth of 7.25 percent; the Midwest branch, with current sales of $79.3
million, is expected to grow by 8.20 percent; and the Pacific branch, with sales of $57.5 million, is
expected to increase sales by 7.15 percent. What is the average rate of sales growth forecasted for next
year? (Statistics for Management, 7th Ed, by Richard Levin and David Rubin Chap 3)
Solution
(wxi) 193.8(7.25) + 79.3(8.20) + 57.5(7.15) 2466.435
xw = w = = = 7.46%
193.8 + 79.3+ 57.5 330.6
Exercise ( Bluman )
1. Find the weighted mean price of three models of automobiles sold. The number and price of each
of each model sold are shown in this list.
Model Number Price
A 8 $10,000
B 10 $12,000
C 12 $8,000
2. Using the weighted mean, find the average number of games of fat per ounce of meat or fish that
a person would consume over a 5 day period if he ate these:
Meat or Fish Fat (g/oz)
3 oz fried shrimp 3.33
3 oz veal cutlet (broiled) 3.00
2 oz roast beef (lean) 2.50
2.5 oz fried chicken drumstick 4.40
4 oz tuna (canned in oil) 1.75
Source:- The World Almanac and Book of Facts
3. A recent survey of a new diet cola reported the following percentages of people who liked the
taste. Find the weighted mean of the percentages.
i.

Area % favored Number


Surveyed
1 40 1000
2 30 3000
3 50 800
4. The costs of three models of helicopters are shown below. Find the weighted mean of the costs of
the models
Model Number sold Cost
Sunscraper 9 $ 427,000
Skycoaster 6 $ 365,000
High-flyer 12 $ 725,000
5. An instructor grades exams, 20%; term paper, 30%; final exam, 50%. A student had grades of 83,
72, and 90, respectively, for exams, term paper, and final exam. Find the students final average.
Use the weighted mean.
6. Another instructor gives four 1-hour exams and one final exam, which counts as two 1-hour
exams. Find students grade if she received 62, 83, 97, and 90 on the 1-hour exams and 82 on the
final exam.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (16)

(3) Geometric Mean


Sometimes when we are dealing with quantities that change over a period of time, we need to
know an average rate of change, such as an average growth rate over a period of several years. In such a
case, the simple arithmetic mean is inappropriate, because it gives wrong answer. What we need to find is
the geometric mean. The geometric mean is useful in finding the average of percentages, ratios, indexes,
or growth rates.
The geometric mean G of a set of n positive values x1, x2, ,xn is defined as the positive nth root
of their product,
n
i.e. G = x1 . x2 . . xn where x > 0
1 1
log G = [log x1 + log x2 + + log xn] = log x i
n n
1
Hence G = antilog [ log xi ]
n
For a data in a grouped / frequency distribution,
1
G = antilog [ fi log xi]
n
Examples (12)
If a person receives a 20% raise after 1 year of service and a 10% raise after the second year of
service, the average percentage raise per year is not 15 but 14.89%, as shown.
GM = (1.2)(1.1) = 1.1489
GM = (120)(110) = 114.89
His salary is 120% at the end of the first year and 110% at the end of the second year. This is equivalent
to an average of 14.89%, since 114.89% 100% = 14.89%.
This answer can also be shown by assuming that the person makes $10,000 to start and receives two
raises of 20 and 10%.
Raise 1 = 10,000 20% = $2000
Raise 2 = 12,000 10% = $1200
His total salary raise is $3200. This total is equivalent to
$10,000 14.89% = $1489.00
$11,489 14.89% = 1710.71
$3199.71 = $3200
A discussion:
Consider, for example, the growth of a saving account. Suppose we deposit $100 initially and let the
interest increase at varying rates for 5 years. The growth is summarized in the following table

Year Interest rate Growth factor Saving at the end of year


Growth of $100 1 7% 1.07 $107.00
deposit in a 2 8 1.08 115.56
saving account 3 10 1.10 127.12
4 12 1.12 142.37
5 18 1.18 168.00
interest rate
The growth factor is calculated as 1+
100

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (17)

the growth factor is the amount by which we multiply the savings at the beginning of the year to
get the saving at the end of the year.
The simple arithmetic mean of the growth rate would be (1.07+1.08+1.10+1.12+1.18) 5 = 1.11,
which corresponds to an average interest rate of 11 percent per year. If the bank gives interest at
a constant rate of 11 percent per year, however, a $100 deposit would grow in five years to
$1001.111.111.111.111.11 = $168.51
The table shows that the actual figure is only $168.00. Thus the correct average growth factor
must be slightly less than 1.11.
To find the correct average growth factor, we can multiply together the 5 year growth factors and
then take the 5th root of the product. The result is the geometric mean growth rate, which is the
appropriate average to use here.
G.M = 5 1.071.081.101.121.18 = 5 1.679965 = 1.1093
Notice that the correct average interest rate of 10.93 percent per year obtained with the geometric
mean is very close to the incorrect average rate of 11 percent obtained with arithmetic mean.
This happens because the interest rates are relatively small.
In highly inflationary economics, banks pay high interest rate to attract savings. Suppose that
over 5 years in an unbelievable inflationary economy, banks pay interest at annual rates of 100,
200, 250, 300 and 400 percent, which correspond to growth factor of 2, 3, 3.5, 4, and 5.
(Calculate growth factor both with arithmetic mean and geometric mean as you did in above
table, you will find a significant difference.)
Solution
In 5 years, an initial deposit of $100 would grow to $100 2 3 3.5 4 5 = $42000. The
arithmetic growth factor is (2 + 3 + 3.5 + 4 + 5)/5 or 3.5. This corresponds to an average interest
rate of 250 percent. Yes if bank gave interest at a constant rate of 250 percent per year, then $100
would grow to $52521.88 in 5 years:
$100 3.5 3.5 3.5 3.5 3.5 = $52521.88
This answer exceeds the actual $42000 by more than $10500, a sizable error.
Lets use the formula for finding the geometric mean of a series of numbers to
determine the correct growth factor.
GM = n product of all x values
= n 2 3 3.5 4 5
= n 420 = 3.347 _____ Average Growth Factor
This growth factor corresponds to an average interest rate of 235 percent per year.
Examples (13)
The growth in bad-debt expense for a company over the last few year follows: Calculate the
average percentage increase in bad-debt expense over this time period. If this rate continues, estimate the
percentage increase in bad debt for 1997, relative to 1995
1989 1990 1991 1992 1993 1994 1995
0.11 0.09 0.075 0.08 0.095 0.108 0.120
Solution
M = 7 0.11(0.09)(0.075)(0.08)(0.095)(0.108)(0.120) = 7 1.908769992 = 1.09675
The average increase is 9.675 percent per year. The estimate for bad-debt expense in 1997 is (1.09675)2
1 = 0.2029. i.e. 20.29% higher than in 1995.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (18)

Exercise
Find the geometric mean of each of these.
a). The growth rates of the Living Life Insurance Corporation for the past 3 years
were 35, 24, and 18%.
b). A person received these percentage raises in salary over a 4-year period: 8, 6, 4,
and 5%.
c). A stock increased each year for 5 years at these percentages: 10, 8, 12, 9, and 3%.
d). The price increases, in percentages, for the cost of food in a specific geographic
region for the past 3 years were 1, 3, and 5.5%.
The advantages of geometric mean are
It is based on all observed values.
It gives equal weightage to all the observations.
It is not much affected by sampling variability.
The disadvantages of geometric mean are
It vanishes if any observation is zero.
In case of negative values, it cannot be computed at all.
(4) The Harmonic Mean
This mean is useful for finding the average speed. Suppose a person drove 100 miles at 40 miles
40 + 50
per hour and returned deriving 50 miles per hour. The average miles per hour is not = 45 miles
2
per hour. Correct average is found as shown:
Since Time = distance / rate, then
100
Time 1 = = 2.5 hours to make a trip and
40
100
Time 2 = = 2 hours to return
50
Hence total time is 4.5 hours, and total miles driven are 200. Now the average speed is
distance 200
Rate = = = 44.44 miles per hour
time 4.5
This value can also be found by using the harmonic mean as
2
HM = = 44.44
1/40 + 1/50
Definition
The harmonic mean is the reciprocal of the mean of the reciprocals.
1 + 1 + + 1
x1 x2 xn
for ungrouped data, H = Reciprocal of
n
( xf ) i

for a group data, H = Reciprocal of i

fi
The advantages of Harmonic mean are
It is neither easy to calculate nor to understand
It is based on all observed values.
It is an appropriate type for averaging rates and ratios.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (19)

The disadvantages of Harmonic mean are


It cannot be calculated if one of the observations is zero.
It gives too much weightage to the smaller observations.
Examples (14)
Given the following frequency distribution of weights, calculate the mean, geometric mean and
harmonic mean of the weights.

Weights 65 84 85 104 105 124 125 144 145 164 165 184 185 204
( grams )
F 9 10 17 10 5 4 5

Solution
The necessary calculations are given below:

Weight Frequency Midpoint fi xI log xi fi log xi 1


fi
xi
( grams) fi xI

65 84 9 74.5 670.5 1.8722 16.8498 0.12081

85 104 10 94.5 945.0 1.9754 19.7540 0.10582

105 124 17 114.5 1946.5 2.0589 35.0013 0.14847

125 144 10 135.5 1345.0 2.1287 21.2870 0.07435

5 154.5 772.5 2.1889 10.9445 0.03236


145 164
4 174.5 698.0 2.2418 8.9672 0.02292
165 184
5 194.5 972.5 2.2889 11.4445 0.02571
185 204

Total ( ) 60 --- 7350.0 --- 124.2483 0.53044

The Mean of Weights:


_
Since we know that the mean weight (i.e x ) for a group data is
_ fi xi
x= , ( i = 1,2,,7 )
fi
_ 7350.0
x= = 122.5 grams
60
The Geometric Mean of Weights:
Since we know that the geometric mean of weights (i.e G ) for a group data is
1 124.2483
G = Anti-log fi log xi = Anti-log = Anti-log( 2.0708 ) = 117.7 grams
f 60
The Harmonic Mean of Weights:
Since we know that the harmonic mean of weights (i.e H ) for a group data is
fi 60
i.e H= = = 113.11 grams
i 0.53044
f
xi

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (20)

(D) Other Central Measures


(1) Median
The median is the measure of location most often reported for annual income and property value
data because a few extremely large incomes or property values can inflate the mean. In such cases, the
median is the preferred measure of central location.
The median is the value in the middle when the data x1; , x n of size n are sorted in ascending
order (smallest to largest).
- If n is odd, then the median is the middle value.
- If n is even, the median is the average of the two middle values.
Examples (15)
For instance, find the mean and median of two data sets, representing monthly salaries of IT
engineers in the US:
X = [2710; 2755; 2850; 2880; 2880; 2890; 2920; 2940; 2950; 3050; 3130; 3325]; and
X* = [2710; 2755; 2850; 2880; 2880; 2890; 2920; 2940; 2950; 3050; 3130; 10000]:
The mean of the data set X is
xi
, X= = 2940
n
Since n = 12 is even, the middle two values are 2829 and 2920; the median of the data set X,
denoted by Med (X) is the average of these two values;
2829 + 2920
Median = Med (X) = = 2905
2
Remark:
Whenever a data set contains extreme values, the median is often the preferred measure of central
location than the mean.
Sample data X* consists of extreme values such as $USD10000, then the new sample mean is
xi
X* = = 3496 > 2940
n
But the median is unchanged, reflecting better central tendency:
2829 + 2920
Median = Med (X*) = = 2905
2
Exercise
In a study conducted by the Department of Mechanical Engineering and analyzed by the Statistics
Consulting Centre at Virginia Polytechnic Institute and State University, the steel rods supplied by two
different companies were compared. Ten sample springs were made out of the steel rods supplied by each
company and a measure of flexibility was recorded for each. The data are as follows:
Company A: 9.3 8.8 6.8 8.7 8.5 6.7 8.0 6.5 9.2 7.0
Company B: 11.0 9.8 9.9 10.2 10.1 9.7 11.0 11.1 10.2 9.6
Can you conclude that there is virtually no difference in means between the steel rods supplied by
the two companies? (Probability and Statistics by Walpole 8th Ed p-387)

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (21)

Exercise
The tensile strength of silicone rubber is thought to be a function of curing temperature. A study
was carried out in which samples of 12 specimens of the rubber were prepared using curing temperatures
of 20 C and 45 C.
The data below show the tensile strength values in megapascals. Calculate the sample mean and
median for the data for the two companies. (Walpole p-35)

200 C 2.07 2.14 2.22 2.03 2.21 2.03


2.05 2.18 2.09 2.14 2.11 2.02
450 C 2.52 2.15 2.49 2.03 2.37 2.05
1.99 2.42 2.08 2.42 2.29 2.01

(2) Central Measure Mode


The mode is the value that is repeated most often in the data set.
A data set that has only one value that occurs with the greatest frequency is said to be unimodal.
If a data set has two values that occur with the same greatest frequency, both values are considered to be
the mode and the data set is said to be bimodal. If a data set has more than two values that occur with the
same greatest frequency, each value is used as the mode, and the data set is said to be multimodal. When
no data value occurs more than once, the data set is said to have no mode.
Advantages and Disadvantages
the mode, like the median can be used as a central location for qualitative as well as
quantitative data.
Like the median, the mode is not unduly affected by extreme values. Even if the high
values are very high and the low values very low, we choose the most frequent value of
the data set to be the model value.
We can use mode even when one or more of the classes are open ended.
The mode is not used as often to measure the central tendency as are the median and
median.
When data set contains two, three or more modes, they are difficult to interpret and
compare.
e.g. The ages in years of the cars worked on by the Village Autohaus last week
5 6 3 6 11 7 9 10 2 4 10 6 2 `1 5. Mode in this case is 6
Examples (16)
A computing student received the following grades in subjects of his first semester 2007:
Y = [6; 7; 6; 8; 5; 7; 6; 9; 10; 6] Mode = 6
1,2,3,4,5,6,6,7,7 mode value is 6 and 7 called Bimodal
2,3,4,2,3,4,7,8 2,3,4, are the modes called Multimodal
2,3,4,5,6,7,8 no mode
2,2,3,3,4,4,5,5 no mode
In case of group data or for a frequency distribution
( fm - f1)
Mode = l + h.
( fm - f1) + ( fm - f2)
Where l = lower class boundary of the model class
f m = frequency of the model class, f1 = frequency associated with the class preceding the model class

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (22)

Exercise
The ages of residents in a community have the following distribution
Class 47-51.9 52-56.9 57-61.9 62-66.9 67-71.9 72-76.9 77-81.9
Frequency 4 9 13 42 39 20 9
Estimate the model value of the distribution.

(E) Measures of Position


In addition to measures of central tendency, there are measures of position. These measures
include percentiles, deciles, and quartiles. They are used to locate the relative position of a data value in
the data set.
Percentile is the position in hundredths that a data value holds in the distribution, Decile is the
position in tenths that a data value holds in the distribution, Quartile is the position in fourths that a data
value holds in the distribution.
(1) Quartiles
Quartiles divide the distribution into four groups, separated by Q1, Q2, Q3. Note that Q1 is the
same as the 25th percentile; Q2 is the same as the 50th percentile, or the median; Q3 corresponds to the
75th percentile, as shown:

n
For Q1 we see that is an integer or a non-integer
4
n n
If is not an integer, then Q1 = [ ] + 1 th item in the data
4 4
n n n
If is an integer, then Q1 = average of { th and( +1)th items}
4 4 4
2n 3n
Similarly for Q2 and Q3 we will check whether and is an integer or non-integer respectively, then
4 4
we find the value of Q2 and Q3 same as we did in the case of Q1.
When the data is in grouped form, then
h n
Q1 = l + -c
f4
Where
l = lower limit of the class for Q1
n = number of observations in the sample
c = sum of the frequencies in all classes preceding the class for Q1.
f = frequency of the class for Q1
h = class interval of the class for Q1
Similarly we can find, Q2 and Q3.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (23)

(2) Deciles
Deciles divide the distribution into 10 groups, as shown. They are denoted by D1, D2, etc.

7n
For D7 we see that is an integer or a non-integer
10
7n
If is not an integer, then
10
7n
D7 = [ ] + 1th item in the data
10
7n
If is an integer, then
10
7n 7n
D7 = average of { th and( +1)th items}
10 10
2n 3n
Similarly for D2 and D3 we will check whether and is an integer or non-integer respectively, then
10 10
we find the value of D2 and D3 same as we did in the case of D7.
When the data is in grouped form, then
h 7n
D7 = l + -c
f 10
Where
l = lower limit of the class for D7
n = number of observations in the sample
c = sum of the frequencies in all classes preceding the class for D7.
f = frequency of the class for D7
h = class interval of the class for D7
Similarly we can find, D2 and D3.
(3) Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the
position of an individual in a group.
Percentiles divide the data set into 100 equal groups.
Percentiles are symbolized by
P1, P2, P3, . . . , P99
and divide the distribution into 100 groups.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (24)

For instance,
27n
For P27 we see that is an integer or a non-integer
100
27n 27n
If is not an integer, then P27 = [ ] + 1 th item in the data
100 100
27n 27n 27n
If is an integer, then P27 = average of { th and( +1)th items}
100 100 100
25n 30n
Similarly for P25 and P30 we will check whether and is an integer or non-integer
100 100
respectively, then we find the value of P25 and P30 same as we did in the case of P27.
When the data is in grouped form, then
h 27n
P27 = l + -c
f 100
Where l = lower limit of the class for P27
n = number of observations in the sample
c = sum of the frequencies in all classes preceding the class for P27.
f = frequency of the class for P27
h = class interval of the class for P27
Similarly we can find, P25 and P30.
Examples (17)
The weights in milligrams of 2538 seeds of the long leef pine were as follows:
Weight Number of Weight Number of
(milligrams) Seeds (milligrams) Seeds
10 24.9 16 85 99.9 655
25 39.9 68 100 114.9 803
40 54.9 204 115 129.9 294
55 69.9 233 130 144.9 21
70 84.9 240 145 159.9 4

(a) Find the average weight, the median weight and the most common weight (mode) of the seeds
(b) Find the first and third quartiles. Find the third decile and the 45th percentile.

Solution:
The necessary calculations are given below:
Class Boundaries No. of Seeds Mid points fx Cumulative
( c.b ) (f) (x) Frequency
( c.f )
9.95 24.95 16 17.45 279.20 16
24.95 39.95 68 32.45 2206.60 84
39.95 54.95 204 47.45 9679.80 288
54.95 69.95 233 62.45 14550.85 521
69.95 84.95 240 77.45 18588.00 761
84.95 99.95 655 92.45 60554.75 1416
99.95 114.95 803 107.45 86282.35 2219
294 122.45 36000.30 2513
114.95 129.95
21 137.45 2886.45 2534
129.95 144.95 4 152.45 609.80 2538
144.95 159.95
Total () 2538 --- 231638.10 ---

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (25)

(a)
fi x i 231638.10
(i) Average weight = = = 91.27 milligrams
fi 2538
n
(ii) Median = Weight of th seed
2
2538th
= Weight of ,
2
th
i.e. 1269 seed which lies in the group 84.95 99.95.
Our median class is 84.95 99.95
Since for group data we have the median as
h n
Median = l + - C . Where
f 2
l = Lower limit of the median class = 84.95
n = No. of observations in the sample = 2538
C = Preceding cumulative frequency of the median class = 761
f = Frequency of the median class = 655
h = Class interval of the median class = 15
15
Median = 84.95 + (1269 761)
655
= 84.95 + 11.63 = 96.58 milligrams
Sine the class that carries the highest frequency is
99.95 114.95, Which is thus the model class.
Therefore for a group data
( fm - f1)
Mode = l + h. Where,
( fm - f1) + ( fm - f2)
l = lower class boundary of the middle class = 99.95
f m = frequency of the model class = 803
f1 = frequency associated with the class preceding the model class = 655
f1 = frequency associated with the class following the model class = 294
h = width of the class interval = 15
( 803 - 655 )
Mode = 99.95 + 15
( 803 - 655 ) + ( 803 - 294 )
148 148
= 99.95 + 15 = 99.95 + 15
148 + 509 657
= 99.95 + 3.38 = 103.33 mili grams
(b)
Since for a group data Q1 and Q3 are computed as
h n
Q1 = l + - C , and
f 4
h 3n
Q2 = l + -C
f 4
Now,
n th
Q1 = Weight of seed
4
2538 th
= Weight of , i.e.634.5 th seed which lies in the group 69.95 84.95. Thus
4
15
Q1 = 69.95 + (634.5 521)
240
= 69.95 + 7.09 = 77.04 milligrams

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (26)

And
3n th
Q3 = Weight of seed
4
3 2538 th
= Weight of , i.e.1903.5 th seed which lies in the group 99.95 114.95. Thus
4
15
Q1 = 99.95 + (1903.5 1416)
803
= 99.95 + 9.11 = 109.06 milligrams
(ii) Since for a group data D3 is computed as
h 3n
D3 = l + - C , now
f 10
3n th
D3 = Weight of the seed
10
3 2538 th
= Weight of

,
10
th
i.e.761.5 seed which lies in the group 84.95 99.95. Thus
15
D3 = 84.95 + (761.4 761) = 84.95 + 0.01 = 84.96 milligrams
655
(iii) Since for a group data P45 is computed as
h 45n
P45 = l + - C , now
f 100
45n th
P45 = Weight of the seed
100
45 2538 th
= Weight of ,
100
th
i.e.1142.10 seed which lies in the group 84.95 99.95. Thus
15
P45 = 84.95 + (1142.10 761) = 84.95 + 8.73 = 93.68 milligrams
655
Quartiles, Deciles and Percentiles with the help of Ogive
Examples (18)
Suppose you kept a record of the marks of a quiz of 80 students. The exam is out of 10 and you
have grouped the marks and recorded the data in a frequency table shown below:

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (27)

Graphically we can find out all the three quartiles as:

Similarly we can find out deciles and percentiles using ogive


Note
Note that an Ogive may also be constructed as follows:
First construct cumulative frequency histogram, then joining the mid points of the top heads of all
the rectangular bars with a smooth curve, we have a cumulative frequency curve or Ogive. Also by
tracing the mid points of all the classes according to their respective cumulative frequencies and then
joining them with a free hand smooth curve.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (28)

(F). Measures of Dispersion

The mean of all the three curves is the same, but curve A has less spread (or variability) than
curve B, and curve B has less variability than curve C. If we measure only mean of these three
distributions, we will miss an important difference among the three curves. To increase the understanding
of the pattern of the data, we must also measure its dispersion.
These are additional information that enables us to judge the reliability of our measure of the
central tendency. A wide spread of values away from the centre indicates an unacceptable risk. A quantity
that measures this characteristic is called measure of dispersion, scatter or variability. The main measures
are
(1) Range
Range R defined as the difference between xmax and x min in a set of data.
i.e. R = xmax - xmin = xn x0
The main disadvantage is that it depend only on two values (extreme values) may be seriously affected by
one usual observations. It is therefore unsatisfactory measure of dispersion. However, it is appropriately
used in statistical quality control charts of manufactured products, daily temperatures, stock prices etc.
This is an absolute measure of dispersion. Its relative measure known as the co-efficient of dispersion,
defined as;
x n x0
co-efficient of dispersion =
x n + x0
(2) Inter-quartile Range
The measure of variability that overcome the dependency on extreme values is the inter-quartile range
(IQR), defined by the difference between the third and first quartiles.
Interquartile range:
IQR = Q3 Q1).
In other words, the interquartile range is the range for the middle 50% of the data.
Half of this range is called the semi-interquartile range or the quartile deviation (Q.D), symbolically;
Q3 Q1
Q.D =
2

For the data on monthly starting salaries, the quartiles are Q3 = 3600 and Q1 = 3465. Thus, the
interquartile range is 3600 3465 = 135.
The quartile deviation is also an absolute measure of dispersion. Its relative measure is called the co-
efficient of quartile deviation or the coefficient of semi-interquartile range, is defined as

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (29)

Q3 Q1
co-efficient of quartile deviation =
Q3 + Q1
which is used for comparing the variation in two or more sets of data.
(3) Mean Deviation
The mean deviation (M.D) of a set of data is defined as the A.M of the absolute deviation measured either
from positive mean or from median or from mode; the reason to disregard the algebraic signs is to avoid
the difficulty arising from the property that the sum of the deviations of the observation from their
mean is zero.
n

x
x i
i =1
M.D =
n
For grouped data, with k classes, having the mid points x1, x 2,.,xk with the correspondence frequencies
n

f1, f2, ., fk where xi = n. The mean deviation of the sample is given by


i =1
k

fi | xi -
x|
i =1
M.D =
n
(4) Population Variance and Standard Deviation
The variance is the average of the squares of the distance each value is from the mean. The symbol for the
population variance is 2 ( is the Greek lowercase letter sigma). The formula for the population variance
is
The symbolic definition of variance is given by
(x i )2 fi(x i )2
2 = (for ungrouped data) and 2 = (for grouped data)
N fi
alternative formula,
2 Xi2 Xi 2 fiXi2 fiXi 2
= -( ) (for ungrouped data) and 2 = -( ) (for grouped data)
N N fi fi
The positive square root of the variance is called standard Deviation. Symbolically,
(xi)2 fi(x i)2
= (for ungrouped data) and 2 = (for grouped data)
N fi
The standard deviation is a very important concept that serves as a basic measure of variability. A smaller
value of the standard deviation indicates that most of the observations in the data are close to the mean
while a larger value implies that the observations are scattered widely about the mean.
Obviously the standard deviation may be found by taking the positive square roots of the above values. It
is an absolute measure of dispersion. Its relative measure called coefficient of standard deviation, is
defined as
Standard Deviation
Coefficient of S.D. =
Mean
(5) Sample Variance and Standard Deviation
In most cases the purpose of calculating the statistic is to estimate the corresponding parameter. For
example, the sample mean is used to estimate the population mean . The expression

(xi x)2
n
does not give best estimate of the population variance because when the population is large and the
sample is small (usually less than 30), the variance computed by this formula usually underestimates the

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (30)

population variance. Therefore, instead of dividing by n, find the variance of the sample by dividing by n
1, giving a slightly larger value and an unbiased estimate of the population variance. The formula for
the sample variance denoted by s 2 , is

2 (xi x)2
s =
n1
and standard deviation of a sample (denoted by s) is
(xi)2
s=
n1
(6) Properties of Variance
i). Var .(a) = 0
ii). Var (X + a) = Var (X) = 2
iii). Var (aX) = a2 Var (X)
iv). Var (X Y)= Var (X) + Var (Y)
v). Let x1 and s12 be mean and variance of n1 observations and x2 and s22 be mean and
variance of n 2 observations (n1 and n2 are sufficiently large) then if the variance of n1 +
n2 observations prove that
n1 s12+ n2 s22 n1( x1 - x )2 n2( x2 - x )2
S2 = + +
n1 +n2 n1 +n2 n1 +n2
Examples (19)
The breaking strength of test pieces of a certain alloy is given as under
95 103 97 130 96 73 78 95 89 68
82 79 69 67 83 108 94 87 93 117
Calculate the average breaking strength of the alloy and the standard deviation.
Breaking Strength (X) X2 Breaking Strength (X) X2
67 4489 93 8649
68 4624 94 8836
69 4761 95 9025
73 5329 95 9025
78 6084 96 9216
79 6241 97 9409
82 6724 103 10609
83 6889 108 11664
87 7569 117 13689
89 7921 130 16900
Total: 1803 167653
X 1803
Mean = = = 90.15
n 20
X2 X 2 167653 1803 2
= -( ) = -( )
n n 20 20
= 8382.65 - 8127.0225
= 255.6275
= 15.99

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (31)

Problems (Variance and Standard Deviation)


(1) For three sections of statistics class consisting of 32, 28, and 40 students, the mean grades
on the final exams were 83, 80 and 76 with standard deviations 5, 6 and 4. Find combined
mean and standard deviation of the class.
(2) By multiplying each number 3, 6, 2, 1, 7, 5 by 2 and then adding 5, we obtain 11, 17, 9, 7,
19, 15. What is the relationship between the variance and the mean of the two sets.
(3) The first of the two samples has 100 items with mean 15 and variance 9. If whole group has
250 items with mean 15.6 and S.D = 13.44. Find the standard deviation of the second
group. (4.15b)
(4) Two brands of cigarettes are compared to determine the variance of the difference D in the
Nicotine content of brand A which has the variance of 5mg and Y be the Nicotine content
of brand B which has the variance of 4mg. i.e. D = X Y. It is assumed that X and Y are
independent. What is the variance of D? (
Examples (20)
( in case of grouped data) Find variance and standard deviation.
Classes 65-85 85-105 105-125 125-145 145-165 165-185 185-205
Frequency 9 10 17 10 5 4 5

Solution

Classes xi fi fixi fix i2


65-85 75 9 675 50625
85-105 95 10 950 90205
105-125 115 17 1955 224825
125-145 135 10 1350 182250
145-165 155 5 475 120125
165-185 175 4 700 122500
185-205 195 5 975 190120
7380 9807700

fiXi2 fiXi 2 980700 7380 2


2 = -( ) = -( ) = 1236.61
n n 60 60

(7) Coefficient of Variation


The variability of the two or more than two sets of data cannot be compared unless we have a
relative measure of dispersion. For this purpose, Karl Pearson (1857-1938) introduced a relative measure
of variation, known as Co-efficient of variation (C.V) which expresses the standard deviation as a
percentage of the arithmetic mean of the data set. It is defined as
C.V = 100

x
Coefficient of variation allows you to compare standard deviations when the units are different,
for example, if a manager wanted to compare the standard deviations of two different variables, such as
the number of sales per salesperson over a 3-month period and the commissions made by these
salespeople?

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (32)

Examples (21)
The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5. The
mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two.
Solution
The coefficients of variation are
5
C.V =
100 = 100 = 5.7 %
(sales)
x 87
773
C.V =
100 =
100 = 14.8 % (commissions)
x 5225
Since the coefficient of variation is larger for commissions, the commissions are more variable than the sales.
Exercise
The lengths (in feet) of the main span of the longest suspension bridges in the United States and the rest
of the world are shown below. Which set of data is more variable?
United States: 4205, 4200, 3800, 3500, 3478, 2800, 2800, 2310
World: 6570, 5538, 5328, 4888, 4626, 4544, 4518, 3970 (Bluman Ex. 3.2, 29)
(8) Range Rule of Thumb
The range can be used to approximate the standard deviation. The approximation is called the
range rule of thumb.
range
A rough estimate of the standard deviation is s=
4
For example, the standard deviation for the data set 5, 8, 8, 9, 10, 12, and 13 is 2.7, and the range
is 13 5 = 8. The range rule of thumb is s 2.
A note of caution should be mentioned here. The range rule of thumb is only an approximation
and should be used when the distribution of data values is unimodal and roughly symmetric.
The range rule of thumb can be used to estimate the largest and smallest data values of a data set.
The smallest data value will be approximately 2 standard deviations below the mean, and the largest data
value will be approximately 2 standard deviations above the mean of the data set. The mean for the
previous data set is 9.3; hence,
Smallest data value = x 2s = 9.3 2(2.8) = 3.7
Largest data value = x + 2s = 9.3 + 2(2.8) = 14.9
Notice that the smallest data value was 5, and the largest data value was 13. Again, these are
rough approximations. For many data sets, almost all data values will fall within 2 standard deviations of
the mean. Better approximations can be obtained by using Chebyshevs theorem and the empirical rule.
Chebyshevs theorem, developed by the Russian mathematician Chebyshev (18211894),
specifies the proportions of the spread in terms of the standard deviation.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (33)

(9) Empirical Rule and Chebyshevs Theorem

We start by examining a specific set of data. The following Table shows the heights in inches of
100 randomly selected adult men. The mean and standard deviation of the data are, rounded to two

decimal places, x = 69.92 and = 1.70. If we go through the data and count the number of observations
that are within one standard deviation of the mean, that is, that are between 69.92 1.70 = 68.22 and
69.92 + 1.70 = 71.62 inches, there are 69 of them. If we count the number of observations that are within
two standard deviations of the mean, that is, that are between 69.92 2(1.70) = 66.52 and 69.92 + 2(1.70)
= 73.32 inches, there are 95 of them.
All of the measurements are within three standard deviations of the mean, that is, between 69.92 3(1.70)
= 64.822 and 69.92 + 3(1.70) = 75.02 inches. These tallies are not coincidences, but are in agreement with
the following result that has been found to be widely applicable.

65.6 67.8 68.6 69.1 69.5 70 70.4 70.8 71.3 72.2


65.9 67.9 68.7 69.1 69.6 70 70.4 70.9 71.4 72.2
66.2 68 68.7 69.2 69.6 70 70.4 70.9 71.5 72.3
66.8 68 68.7 69.3 69.7 70.1 70.5 71 71.5 72.4
67 68.1 68.8 69.3 69.7 70.1 70.5 71 71.6 72.5
67.2 68.2 68.8 69.4 69.7 70.1 70.6 71.1 71.8 72.7
67.3 68.3 68.9 69.4 69.8 70.2 70.6 71.1 71.8 72.8
67.5 68.4 68.9 69.4 69.8 70.2 70.7 71.2 71.9 73
67.6 68.6 69 69.5 69.8 70.3 70.7 71.2 71.9 73.7
67.7 68.6 69.1 69.5 69.9 70.3 70.8 71.3 72 74.8

A relative frequency histogram for the data is shown in Figure below.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (34)

(1) Empirical Rule


when a distribution is bell-shaped (or what is called normal), the following statements, which
make up the empirical rule, are true.
Approximately 68% of the data values will fall within 1 standard deviation of the mean.
Approximately 95% of the data values will fall within 2 standard deviations of the mean.
Approximately 99.7% of the data values will fall within 3 standard deviations of the mean.

Remarks
Two key points in regard to the Empirical Rule are that the data distribution must be
approximately bell-shaped and that the percentages are only approximately true. The Empirical Rule does
not apply to data sets with severely asymmetric distributions, and the actual percentage of observations in
any of the intervals specified by the rule could be either greater or less than those given in the rule. We
see this with the example of the heights of the men: the Empirical Rule suggested 68 observations
between 68.22 and 71.62 inches but we counted 69.
Examples (22)
Heights of 18-year-old males have a bell-shaped distribution with mean 69.6 inches and standard
deviation 1.4 inches.
(a) About what proportion of all such men are between 68.2 and 71 inches tall?
(b) What interval centered on the mean should contain about 95% of all such men?
Solution

Since the interval from 68.2 to 71.0 has endpoints x s and x s,
by the Empirical Rule about 68% of all 18-year-old males should have heights in this range.

By the Empirical Rule the shortest such interval containing 95% of the data is x 2s. So the

interval from x 2s = 69.6 2(1.4) = 66.8 to x + 2s = 69.6 + 2(1.4) = 72.4 contains 95% of the data
values.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (35)

Examples (23)
Scores on IQ tests have a bell-shaped distribution with mean = 100 and standard deviation =
10. Discuss what the Empirical Rule implies concerning individuals with IQ scores of 110, 120, and 130.
Solution
A sketch of the IQ distribution is given in Figure. The Empirical Rule states that
(i) approximately 68% of the IQ scores in the population lie between 90 and 110,
(ii) approximately 95% of the IQ scores in the population lie between 80 and 120, and
(iii) approximately 99.7% of the IQ scores in the population lie between 70 and 130.

Since 68% of the IQ scores lie within the interval from 90 to 110, it must be the case that 32% lie
outside that interval. By symmetry approximately half of that 32%, or 16% of all IQ scores, will lie above
110. If 16% lie above 110, then 84% lie below. We conclude that the IQ score 110 is the 84th percentile.
The same analysis applies to the score 120. Since approximately 95% of all IQ scores lie within
the interval form 80 to 120, only 5% lie outside it, and half of them, or 2.5% of all scores, are above 120.
The IQ score 120 is thus higher than 97.5% of all IQ scores, and is quite a high score.
By a similar argument, only 15/100 of 1% of all adults, or about one or two in every thousand,
would have an IQ score above 130. This fact makes the score 130 extremely high.
(2) Chebyshevs Theorem
The Empirical Rule does not apply to all data sets, only to those that are bell-shaped, and even
then is stated in terms of approximations. A result that applies to every data set is known as Chebyshevs
Theorem.
For any numerical data set,
at least 3/4 of the data lie within two standard deviations of the mean, that is, in the

interval with endpoints x 2s for samples and with endpoints 2 for populations;
at least 8/9 of the data lie within three standard deviations of the mean, that is, in the

interval with endpoints x 3s for samples and with endpoints 3 for populations;
at least 11/k2 of the data lie within k standard deviations of the mean, that is, in the

interval with endpoints x ks for samples and with endpoints k for populations,
where k is any positive whole number that is greater than 1.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (36)

Remark
It is important to pay careful attention to the words at least at the beginning of each of the
three parts of Chebyshevs Theorem. The theorem gives the minimum proportion of the data which must
lie within a given number of standard deviations of the mean; the true proportions found within the
indicated regions could be greater than what the theorem guarantees.
Examples (24)

A sample of size n = 50 has mean x = 28 and standard deviation s = 3. Without knowing anything
else about the sample, what can be said about the number of observations that lie in the interval (22,34)?
What can be said about the number of observations that lie outside that interval?
Solution
The interval (22,34) is the one that is formed by adding and subtracting two standard deviations
from the mean. By Chebyshevs Theorem, at least 3/4 of the data are within this interval. Since 3/4 of 50
is 37.5, this means that at least 37.5 observations are in the interval. But one cannot take a fractional
observation, so we conclude that at least 38 observations must lie inside the interval (22,34).
If at least 3/4 of the observations are in the interval, then at most 1/4 of them are outside it. Since
1/4 of 50 is 12.5, at most 12.5 observations are outside the interval. Since again a fraction of an
observation is impossible, x (22,34).

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (37)

Examples (25)
The number of vehicles passing through a busy intersection between 8:00 a.m. and 10:00 a.m.
was observed and recorded on every weekday morning of the last year. The data set contains n = 251

numbers. The sample mean is x = 725 and the sample standard deviation is s = 25. Identify which of the
following statements must be true.
a. On approximately 95% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00 a.m. to 10:00 a.m. was between 675 and 775.
b. On at least 75% of the weekday mornings last year the number of vehicles passing through the
intersection from 8:00 a.m. to 10:00 a.m. was between 675 and 775.
c. On at least 189 weekday mornings last year the number of vehicles passing through the
intersection from 8:00 a.m. to 10:00 a.m. was between 675 and 775.
d. On at most 25% of the weekday mornings last year the number of vehicles passing through the
intersection from 8:00 a.m. to 10:00 a.m. was either less than 675 or greater than 775.
e. On at most 12.5% of the weekday mornings last year the number of vehicles passing through
the intersection from 8:00 a.m. to 10:00 a.m. was less than 675.
f. On at most 25% of the weekday mornings last year the number of vehicles passing through the
intersection from 8:00 a.m. to 10:00 a.m. was less than 675.
Solution
a. Since it is not stated that the relative frequency histogram of the data is bell-shaped, the Empirical
Rule does not apply. Statement (a) is based on the Empirical Rule and therefore it might not be
correct.

b. Statement (b) is a direct application of part (1) of Chebyshevs Theorem because x 2s = 675,

x + 2s = 775. It must be correct.
c. Statement (c) says the same thing as statement (b) because 75% of 251 is 188.25, so the minimum
whole number of observations in this interval is 189. Thus statement (c) is definitely correct.
d. Statement (d) says the same thing as statement (b) but in different words, and therefore is
definitely correct.
e. Statement (d), which is definitely correct, states that at most 25% of the time either fewer than
675 or more than 775 vehicles passed through the intersection. Statement (e) says that half of that
25% corresponds to days of light traffic. This would be correct if the relative frequency histogram
of the data were known to be symmetric. But this is not stated; perhaps all of the observations
outside the interval (675,775) are less than 75. Thus statement (e) might not be correct.
f. Statement (d) is definitely correct and statement (d) implies statement (f): even if every
measurement that is outside the interval (675,775) is less than 675 (which is conceivable, since
symmetry is not known to hold), even so at most 25% of all observations are less than 675. Thus
statement (f) must definitely be correct.
Exercise
The mean of a distribution is 20 and the standard deviation is 2. Use Chebyshevs theorem.
a. At least what percentage of the values will fall between 10 and 30?
b. At least what percentage of the values will fall between 12 and 28? (Bluman ch. 3)

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (38)

Exercise
The Energy Information Administration reported that the mean retail price per gallon of regular
grade gasoline was $2.30 (Energy Information Administration, February 27, 2006). Suppose that the
standard deviation was $.10 and that the retail price per gallon has a bell shaped distribution.
a. What percentage of regular grade gasoline sold between $2.20 and $2.40 per gallon?
b. What percentage of regular grade gasoline sold between $2.20 and $2.50 per gallon?
c. What percentage of regular grade gasoline sold for more than $2.50 per gallon?
(prob. 3.30, Sweeny Chap 3 )

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (39)

(G) Exploratory Data Analysis


Exploratory data analysis enables us to use simple arithmetic and easy-to-draw pictures to
summarize data. In this section we continue exploratory data analysis by considering five-number
summaries and box plots.
1. Smallest value
2. First quartile (Q1)
3. Median (Q2)
4. Third quartile (Q3)
5. Largest value
Examples (26)

The easiest way to develop a five-number summary is to first place the data in ascending order.
Then it is easy to identify the smallest value, the three quartiles, and the largest value. The monthly
starting salaries shown in the above table for a sample of 12 business school graduates are repeated here
in ascending order.

The median of 3505 and the quartiles Q1 = 3465 and Q3 = 3600. Reviewing the data shows a
smallest value of 3310 and a largest value of 3925. Thus the five-number summary for the salary data is
3310, 3465, 3505, 3600, 3925. Approximately one-fourth, or 25%, of the observations are between
adjacent numbers in a five-number summary.
(1) Box Plot
A box plot is a graphical summary of data that is based on a five-number summary. A key to the
development of a box plot is the computation of the median and the quartiles, Q1 and Q3. The
interquartile range, IQR = Q3 Q1, is also used. Following figure is the box plot for the monthly
starting salary data. The steps used to construct the box plot follow.
Abox is drawn with the ends of the box located at the first and third quartiles. For the salary
data,Q1 = 3465 andQ3 = 3600. This box contains the middle 50% of the data.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (40)

A vertical line is drawn in the box at the location of the median (3505 for the salary data).
By using the interquartile range, IQR = Q3 Q1, limits are located. The limits for the box plot
are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the salary data, IQR = Q3 Q1 = 3600 3465 =
135. Thus, the limits are 3465 1.5(135) = 3262.5 and 3600 + 1.5(135) = 3802.5. Data outside these
limits are considered outliers.
The dashed lines in Figure are called whiskers. The whiskers are drawn from the ends of the box
to the smallest and largest values inside the limits computed in step 3. Thus, the whiskers end at salary
values of 3310 and 3730.
Finally, the location of each outlier is shown with the symbol *. In Figure we see one outlier, 3925.

Exercise
The nine measurements that follow are furnace temperature recorded on successive
batches in a semiconductor manufacturing process (units are F0): 953, 950, 948, 955, 951, 949,
957, 954, 955.
(a) Calculate the sample mean, sample variance, and standard deviation.
(b) Find the median. How much could the largest temperature measurement
increase without changing the median value?
(c) Construct a box plot of the data.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (41)

(H) Measures of Skewness and Kurtosis


A fundamental task in many statistical analyses is to characterize the location and variability of a
data set. A further characterization of the data includes skewness and kurtosis.
(1) Skewness
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or
data set, is symmetric if it looks the same to the left and right of the center point.
If a curve is symmetrical, then the number of values deviating from mean values below the mean
and above the mean are the same. This is called the symmetry.
Skewness is the degree of asymmetry (departure from symmetry of a distribution)

In a symmetrical distribution, the mean, median and mode coincide.


If the frequency curve of a distribution has a longer tail to the right of the central maximum than
to the left, the distribution is said to be skewed to the right or to have positive skewness.

In positive skewed distribution, the mean exceeds the mode.

If the frequency curve of a distribution has a longer tail to the left of the central maximum than to
the right, the distribution is said to be skewed to the left or to have negative skewness.

In negative skewed distribution, the mean is smaller than the mode.


For univariate data, the formula for skewness is

(Xi X )3 /N
Skewness =
s3

Where X is the mean, s is the standard deviation, and N is the number of the data points.
Note that in computing the skewness, the s is computed with N in the denominator rather than N - 1.
Many software programs actually compute the adjusted Fisher-Pearson coefficient of skewness.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (42)


N(N 1) (Xi X )3 /N
Skewness =
N1 s3
This is an adjustment for sample size. The adjustment approaches 1 as N gets large. For reference, the
adjustment factor is 1.49 for N = 5, 1.19 for N = 10, 1.08 for N = 20, 1.05 for N = 30, and 1.02 for N = 100.
Karl Pearson investigated the following formula to measure the skewness:
mean mode
Skewness =
standard deviation
Led Bowley introduced the following measure of skewness
Q3 + Q1 2Q2
Quartile coefficient of skewness =
Q3 Q1
This measure is equal to zero when quartiles are equidistant from median. Then the distribution is
symmetrical. It is positive when the upper quartile is farther from the median than the lower quartile.
Then the distribution is positive skewed. This measure is negative when the lower quartile is farther from
the median than the upper quartile.
For a perfectly symmetrical curve, this measure is zero.
Problems (Skewness)
1) What can you say of skewness in each case of the following cases;
(i) The median is 26.01, while the two quartiles are 13.73 and 38.29.
(ii) Mean = 140 and mode = 148.7
(iii) Mean = 129.5 and median = 128.7
2) Which of the following is correct in a positively skewed and negatively skewed distribution
(i) The arithmetic mean is greater than the mode.
(ii) The arithmetic mean is less than the mode.
(iii) The arithmetic mean is greater than the median.
(iv) The median is greater than the mode.
3) The length of stay on the cancer floor of Apolo Hospital were organized into a frequency
distribution. The mean length of stay was 28 days, the medial 25 days and modal length is 23
days. The standard deviation was computed to be 4.2 days.

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (43)

(2) Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low
kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.
The histogram is an effective graphical technique for showing both the skewness and kurtosis of
data set.
Kurtosis is the degree of peakness of a distribution. A distribution having relatively high peak is
called Lepto-Kurtic whereas a distribution having flat topped is called Platy Kurtic. A frequency curve
which is neither very high peaked nor vary flat topped is called Meso-kurtic or a Normal curve having a
Normal distribution.

For univariate data, the formula for Kurtosis is



(Xi X )4 /N
Kurtosis =
s4

Where X is the mean, s is the standard deviation, and N is the number of the data points.
The kurtosis for a standard normal distribution is 3, for Lepto-Kurtic, b2 > 3 and for Meso-kurtic,
b2 < 3.
Another measure of Kurtosis is:
Q.D
Percentile coefficient of Kurtosis = k =
P90 P10
Q3Q1
Where Q.D = quartile deviation =
2
Examples (27)
A group data for heights of 100 randomly selected male students is given below
Height (inches) Class Marks, X Frequency, f
59.5 62.5 61 5
62.5 65.5 64 18
65.5 68.5 67 42
68.5 71.5 70 27
71.5 74.5 73 8
Now
x = (615 + 6418 + 6742 + 7027 + 738) 100 = 67.45

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (44)

For Skewness,

Class marks Frequency, f xf


(x
x) (x
x )2 f (x
x )3 f
61 5 305 - 6.45 208.01 -1341.68
64 18 1152 - 3.45 214.25 -739.15
67 42 2814 - 0.45 8.51 - 3.83
70 27 1890 2.55 175.57 447.70
73 8 584 5.55 246.42 1367.63
100 6745 852.75 - 269.33


(Xi X )2 f 852.75
Variance = = = 8.5275
N 100
Standard Deviation = = 8.5275 = 2.92

(Xi X )3 /N
Skewness =
s3
-269.33/100
=
(2.92)3
= - 2.6933
This means that the distribution is negatively skewed

For Kutosis,

Class Mark, x Frequency, f xx (xx)4f

61 5 -6.45 8653.84
64 18 -3.45 2550.05
67 42 -0.45 1.72
70 27 2.55 1141.63
73 8 5.55 7590.35
n/a 19937.60


(Xi X )4 /N 19937.60
Kurtosis = = = 199.3760 < 3
s4 100
This means the frequency curve is flat, that is platy-Kurtic

Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore

Das könnte Ihnen auch gefallen