You are on page 1of 193

Lecture 1: Introduction to Statistics,

Data Analysis and Statistical


Inference
Dr. Bilal A Usmani
STRUCTURE OF THE COURSE
Business Statistics
Approach of the course

• Software based approach:

1. Use of MS Excel where possible, for explaining the


concepts and solving the examples that are related
to Statistics, especially descriptive statistics.
2. An introduction to SPSS / R, a very powerful,
freeware software, that might comprise of one or
two lectures.
What is Statistics?

•Statistics is the science of data.

•A collection of procedures for


describing and analysing data.
• This involves collecting, classifying,
summarizing, organizing, analyzing, and
interpreting numerical information.
What is Statistics? (cont…)

• Putting it in other words, statistics is the


methodology which scientists and mathematicians
have developed for interpreting and drawing
conclusions from collected data.

• Everything that deals even remotely with the


collection, processing, interpretation and
presentation of data belongs to the domain of
statistics.
Scope of statistics

• Statistics is a meaningful, useful, science whose


broad scope of applications to business,
government, and the physical and social sciences
is almost limitless. Virtually every single subject
from Anthropology to Zoology …. A to Z!
• Statistics plays key role in critical thinking-
whether in the classroom, on the job or in the
everyday life
The science of Statistics

Statistics is a discipline in its own right!


It would therefore be desirable to know the characteristic
features of statistics in order to appreciate and
understand its general nature.

Some of its important characteristics are given below:


• Statistics deals with the behavior of aggregates or large groups of
data. It has nothing to do with what is happening to a particular
individual or object of the aggregate.
The science of Statistics (cont…)

• Statistics deals with aggregates of observations of the same


kind rather than isolated figures.
• Statistics deals with variability that obscures underlying
patterns. No two objects in this universe are exactly alike. If
they were, there would have been no statistical problem.
• Statistics deals with uncertainties as every process of getting
observations whether controlled or uncontrolled, involves
deficiencies or chance variation. That is why we have to talk
in terms of probability.
The science of Statistics (cont…)

• Statistics deals with those characteristics or aspects of


things which can be described numerically either by
counts or by measurements.
• Statistics deals with those aggregates which are subject
to a number of random causes, e.g. the heights of
persons are subject to a number of causes such as race,
ancestry, age, diet, habits, climate and so forth.
The science of Statistics (cont…)

• Statistical laws are valid on the average or in the long


run. There is no guarantee that a certain law will hold in
all cases. Statistical inference is therefore made in the
face of uncertainty.
• Statistical results might be misleading the incorrect if
sufficient care in collecting, processing and interpreting
the data is not exercised or if the statistical data are
handled by a person who is not well versed in the subject
mater of statistics.
The Purpose of studying Statistics

• As it is such an important area of knowledge, it is


definitely useful to have a fairly good idea about
the way in which it works, and this is exactly the
purpose of this introductory course.
• The following points indicate some of the main
functions of this science:
• Statistics assists in summarizing the larger set of data
in a form that is easily understandable.
The Purpose of studying Statistics
(contd…)

• Statistics assists in the efficient design of laboratory


and field experiments as well as surveys.
• Statistics assists in a sound and effective planning in
any field of inquiry.
• Statistics assists in drawing general conclusions and in
making predictions of how much of a thing will
happen under given conditions.
TYPES OF STATISTICS
Types of Statistics

• There are two broad categories of statistics. They are


descriptive and inferential statistics
Descriptive statistics

• Descriptive statistics summarize population data


numerically or graphically by deriving:
• Statistics pertaining to central tendency such as the
mean, median, or mode
• Statistics pertaining to dispersion around the central
tendency such as the range or standard deviation or
statistics or graphs depicting the shape of a
distribution
Inferential statistics
• Inferential statistics allow one to infer population
parameters based upon sample statistics and to
model relationships within the data.

• The categories of inferential statistics are:


• Estimation is the group of statistics which allow for
the estimation about population values based upon
sample data.
Inferential statistics (contd…)

• The two types of statistics in this category are


population parameter estimates and confidence
intervals.
• Modelling allows us to develop mathematical
equations which describe the interrelationships
between two or more variables.
• Hypothesis testing allows us to test for whether a
particular hypothesis we’ve developed is supported by
a systematic analysis of the data.
THE NATURE OF STATISTICS

DESCRIPTIVE STATISTICS

PROBABILITY

INFERENTIAL STATISTICS
FUNDAMENTAL ELEMENTS OF
STATISTICS
Fundamental elements of
statistics
1.Data and its collection
2.Variable
3. Population
4. Sample
1. Data and types of data

•Data is a collection of facts and


information.
•All data can be classified as one of
the two types:
• qualitative and quantitative
Quantitative data

• Quantitative data are measurements that


are recorded on a naturally occurring
numerical scale.
• For example, the current unemployment rate
for each of the 4 provinces, the number of
convicted murderers who receive death
penalty each year over a 10-year period.
Qualitative data

• Qualitative data are the measurements that


cannot be measured on a natural numerical
scale; they can only be classified into a
group of categories.
• For example, political party affiliation in a
sample of 50 voters, the defective status
(defective or not) of each of 100 computer
chips manufactured by Intel.
Primary and secondary data

• Data that have been originally collected (raw


data) and have not undergone any sort of
statistical treatment, are called PRIMARY data.
• Data that have undergone any sort of treatment
by statistical methods at least ONCE, i.E. The
data that have been collected, classified,
tabulated or presented in some form for a certain
purpose, are called SECONDARY data.
Data: Objective

• As far as the objectives of your research are


concerned, they should be stated in such a
way that you are absolutely clear about the
goal of your study ---
Exactly what it is that you
are trying to find out?
Data: Collection methodology

• As far as the methodology for DATA-


COLLECTION is concerned, you need to
consider:
1. Source of your data
(the statistical population)
2. Sampling Methodology
3. Instrument for collecting data
Collection of data

• The most important part of statistical work is


perhaps the collection of data.
• Statistical data are collected either by a
complete enumeration of the whole field, called
census, which in many cases would be too costly
and too time consuming as it requires large
number of enumerators and supervisory staff, or
by a partial enumeration associated with a
sample which saves much time and money.
Collection of primary data

One or more of the following methods are


employed to collect primary data:

i) Direct Personal Investigation.


ii) Indirect Investigation.
iii) Collection through Questionnaires.
iv) Collection through Enumerators.
v) Collection through Local Sources.
Collection of secondary data

The secondary data may be obtained from the


following sources:
i) Official, e.g. the publications of the Statistical Division, Ministry of
Finance, the Federal and Provincial Bureaus of Statistics, Ministries of
Food, Agriculture, Industry, Labour, etc.
ii) Semi-Official, e.g., State Bank, Railway Board, Central Cotton
Committee, Boards of Economic Inquiry, District Councils, Municipalities,
etc.
iii) Publications of Trade Associations, Chambers of Commerce, etc.
iv) Technical and Trade Journals and Newspapers.
v) Research Organizations such as universities, and other institutions
2. Variable

• A variable is a characteristic or property of


an individual population unit.
• Age, sex, business income and expenses,
country of birth, capital expenditure, class
grades, and vehicle type are examples of
variables.
Types of variables
Numeric variables

• Numeric variables have values that describe a


measurable quantity as a number. Therefore
numeric variables are quantitative variables. The
data collected for a numeric
variable are quantitative data.

• Numeric variables may be further described as


either continuous or discrete
Continuous variable

• A continuous variable is a numeric


variable. Observations can take any value
between a certain set of real numbers.

• Examples of continuous variables include


height, time, age, and temperature.
Discrete Variable

• A discrete variable is a numeric variable.


Variables that can take a finite number of
values. A discrete variable cannot take the
value of a fraction between one value and
the next closest value.
• Examples of discrete variables include the number of
registered cars, number of business locations, and
number of children in a family, all of which measured
as whole units (i.e. 1, 2, 3 cars).
Categorical variables

• Categorical variables have values that


describe a 'quality' or 'characteristic' of a
data unit, like 'what type' or 'which
category'. The data collected for a
categorical variable are qualitative data.
• Categorical variables may be further described
as ordinal or nominal:
Ordinal variable

• An ordinal variable is a categorical variable.


Observations can take a value that can be
logically ordered or ranked. The categories
associated with ordinal variables can be ranked
higher or lower than another, but do not
necessarily establish a numeric difference
between each category.
• Examples of ordinal categorical variables include academic
grades (i.e. A, B, C), clothing size (i.e. small, medium, large,
extra-large) and attitudes (i.e. strongly agree, agree, disagree,
strongly disagree).
Nominal variable

• A nominal variable is a categorical variable.


Observations can take a value that is not
able to be organized in a logical sequence.
• Examples of nominal categorical variables
include sex, business type, eye color, religion
and brand.
3. Population

• It is a set of units (usually people, objects,


transactions, or events) that we are interested in
studying.
• For example, populations may include
• (1) all employed workers in a country
• (2) all registered voters in a city or country
• (3) everyone who is afflicted with AIDS
• (4) all the cars produced by a particular assembly line
• (5) the set of all accidents occurring in a particular
stretch of interstate highway during a holiday period.
4. Sample

• A sample is a subset of units of a population. It


is also known as representative data.

• The information contained in the sample can be


used to infer the characteristics of the population
from which it was drawn.
Sample (cont…)

• A sample, by definition, is a subset of the


population you are studying that is selected for
the actual research study.
• For example, instead of polling all 140 million registered voters
in the U.S during a presidential election year, a pollster might
select and question a sample of just 1500 voters, he would
record the the preference of each sampled vote.
Very
Important
Point!
DATA REPRESENTATION
Data representation

• Data can be presented in the text, in a table, or


pictorially as a chart, diagram or graph.

• Any of these may be appropriate to give


information the reader or viewer is supposed to
be able to assimilate "from cold" while reading or
listening.
TABLES
Tables
• A table is perhaps the simplest means of
summarizing a set of observations
• They can be used for all types of numerical data.
• Tables are commonly used in collecting and
organizing raw data during an experiment and
also for representing final data to be included in
a paper or report.
• The representation of data in a table is formally
referred to as “tabular presentation.”
https://www.ncsu.edu/labwrite/res/gh/gh-tables.html
Tables (cont…)

• When using a table to communicate your results


in your report, list specific data values or draw
comparisons between variables by listing
subtotals, totals, averages, percentages,
frequencies, statistical results, etc.
• Good tables should be easy to read across rows
and down columns, easy to understand, and easy
to refer to in the text of your report. They should
also include only relevant data from your results.
https://www.ncsu.edu/labwrite/res/gh/gh-tables.html
A slightly Look at the
complicated components
table!

https://www.ncsu.edu/labwrite/res/gh/gh-tables.html
Parts of Previous table

• Title
• Table number
• Headings & Subheadings
• Table Body
• Table Spanner
• Dividers
• Table Notes
https://www.ncsu.edu/labwrite/res/gh/gh-tables.html
Types of Tables

•Textual (Word) Tables

•Statistical Tables

•Numerical Tables
GOOD
Practice
Versus
BAD
practice
FIGURES
Figures

• Another representation of the data is by using the


Figures.

“Graphic Excellence is that which gives to the


viewer the greatest number of ideas in the shortest
time with the least ink in the smallest place.”

Edward R. Tufte
The main themes of a graph

• Graphs are a good means of describing, exploring or


summarising numerical data
• The use of a visual image can simplify complex
information and help to highlight patterns and trends in
the data.
• Designed to add understanding of information that it
difficult to convey with words
• They are a particularly effective way of presenting a
large amount of data but can also be used instead of a
table to present smaller datasets.
The main themes of a graph (cont…)

• There are many different graph types to choose


from and a critical issue is to ensure that the
graph type selected is the most appropriate for
the data.
• In general, a graph:
• Must be clear, accurate, appropriate
• Avoid mere decoration
• Need a legend
Parts of a
Graph
(line)
Types of graphs

• A summary of the types of data that can be presented in


the most common types of graphs is provided in the next
slides
• It is followed by some general guidelines for designing
readily understandable graphs.
• There is more detailed information on the uses and good
design of particular types of graph in the literature.
• We will focus on graphs that can be used to compare
groups.
1. Bar
chart

2. Scatter
6 Pie chart plot

Graphs that
can
compare
groups 3. Box
5 Line plot plots

4 Stem
and leaf
plot
1. Bar chart

• A bar chart is a graph that you can use to compare bar


heights of category measures. Bar charts can be made of
category tallies, of different statistics by categories, or of
summary values. The height of the bars signifies the
magnitude of the values.
• For example, bars could represent:
• Total sales for four branch stores for a year
• Mean diameters of parts manufactured by four different machines in
a factory during a week.
• Counts of visitors to four local tourist destinations during a weekend.
Example

For example, bars on


this bar chart
represent the counts
of paint flaws on an
automobile part.
Clustered and stacked bars

• You can represent subcategories on bar charts by


creating clusters of bars or by stacking bars.
• For example, suppose you want to track the
number of students at four regional high schools
by grade.
• Each cluster of bars represents a school, and each
bar within a cluster represents the number of
students in a grade.
Example 1

Creating clusters is helpful when you want


to compare subcategories within and
across categories. For example, this graph
shows:
East High has the most students.
Within East High, 12th grade has the most
students.
For each high school, the number of
students in each grade is similar.
The fewest students are in 9th grade at
West High, followed closely by the other
three grades at that high school.
Example 1: Stacked bars

Each stack of bars represents a


school, and each bar within a
stack represents the number of
students in a grade.

Stacking bars is helpful when


you want to compare
subcategories within categories
and categories with each other.
Histograms

• Perhaps the most commonly used type of graph is


the histogram.

• Whereas a bar chart is a pictorial representation


of a frequency distribution for either nominal or
ordinal data, a histogram depicts a frequency
distribution for discrete or continuous data.
Frequency Polygons

• The frequency polygon is similar to the histogram


in many respects.
• A frequency polygon uses the same two axes as a
histogram.
• It is constructed by placing a point at the center of
each interval such that the height of the point is
equal to the frequency or relative frequency
associated with that interval
2. Scatterplot

• A Scatter Plot has points that show the


relationship between two sets of data.
• The relationship between two variables is called
their correlation.
• The closer the data points come when plotted to
making a straight line, the higher the correlation
between the two variables, or the stronger the
relationship.
• It is an inferential plot and is used for continuous
data.
• It is effected by changing order of variable.
Scatterplots and correlation
3. Boxplots

• Box plots require a single axis;


however, they display only a summary
of the data. It is used for factor data
having levels.
• The central box is depicted vertically
but can also be horizontal-extends
from the 25th percentile, to the 75th
percentile.
• The 25th and 75th percentiles of a data set are
called the quartiles of the data.
• The line running between the quartiles marks the
50th percentile of the data set; half the
observations are less than or equal to, whereas
the other half are greater than or equal to this
value.
• If the 50th percentile lies approximately halfway
between the two quartiles, this implies that the
observations in the center of the data set are
roughly symmetric.
• The lines projecting out from the box on either
side extend to the adjacent values of the plot.
The adjacent values are the most extreme
observations in the data set.
4. Stem and leaf plot

• A stem-and-leaf diagram, also called a stem-and-


leaf plot, is a diagram that quickly summarizes
data while maintaining the individual data points.
• The "stem" is a column of the unique elements of
data after removing the last digit.
• The final digits ("leaves") of each column are then
placed in a row next to the appropriate column
and sorted in numerical order.
5. Line plots

• Line graphs are made by joining up points plotted


on a graph.
• It can be used to illustrate the relationship
between continuous quantities.
• Each point on the graph represents a pair of
values.
• Each value on the x-axis has a single
corresponding measurement on they-axis.
• Adjacent points are connected by straight lines.
Most commonly, the scale along the horizontal
axis represents time.
• Consequently, we are able to trace the
chronological change in the quantity on the
vertical axis over a specified period.
Inflation
7

6 6

5
Frequency

4.3
4 4

3 3

0
Category 1 Category 2 Category 3 Category 4
6. Pie chart

• Pie charts are specific types of data presentation where


the data is represented in the form of a circle.
• In a pie chart, a circle is divided into various sections or
segments such that each sector or segment represents a
certain proportion or percentage of the total.
• In such a diagram, the total of all the given items is
equated to 360 degrees and the degrees of angles,
representing different items, are calculated
proportionately.
http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro
• The entire diagram looks like a pie and its components
resemble slices cut from a pie. The pie chart is used to
show the break-up of one continuous variable into its
component parts.
• For example, chart on the next slide shows the
distribution of the sales of the car industry between six
car companies.

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro
http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro
Pie chart: Explanation

• Looking at the chart below, we can infer that


Maruti accounts for 24 per cent of the market
share, while GM accounts for 35 percent of the
market share, Ford for4 percent of the market
share, Tata for 10 percent of the market share,
Hyundai for 15 percent of the market share and
Fiat for 12 per cent of the market share.

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro
Pie chart: In a nutshell

• The pie chart encompasses a circle of 360


degrees which represents 100 per cent of the
value of the continuous variable.
• Thus, 3.6 degrees on the pie chart represent 1
percent of the total value of the continuous
variable being represented.
• A single pie diagram can represent only one
continuous variable.
http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro
• In terms of versatility of data representation, pie
charts are less versatile than either of bar charts,
x-y graphs or tables.
• However, their utility is in the fact that the
representation of data is cleaner and it gives an
immediate idea of the relative distribution of the
continuous variable amongst different sectors.

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro
MEASUREMENT OF CENTRAL
TENDENCY AND DISPERSION
CENTRAL TRENDENCY
Central tendency

The central tendency of the set of


the measurements that is the
tendency of the data to cluster,
center about certain numeric values.
• Measures of central tendency are sometimes
called measures of central location.
• They are also classed as summary statistics.
• They form part of the descriptive statistics of the
data.
• Some measures of central tendency are more
appropriate to use than others
Averages enable us to measure the
central tendency of variable data
https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php
Averages

• An average is a single value which is intended to


represent a set of data or a distribution as a whole.
• It is more or less CENTRAL value ROUND which the
observations in the set of data or distribution usually
tend to cluster.
• As a measure of central tendency (i.e. an average)
indicates the location or general position of the
distribution on the X-axis, it is also known as a measure
of location or position.
• Let us consider an example:
• Suppose that we have the following two frequency
distributions: 35
Suburb A Suburb B 30
4 0 0 25
5 8 0
20 Suburb A
6 27 8
7 30 27 15 Suburb B

8 16 30 10
9 0 16
10 0 0
5
0
4 5 6 7 8 9 10
• If we draw the frequency polygon of the two frequency
distributions, we obtain
Inspection of these frequency polygons shows that they
have exactly the same shape. It is their position relative
to the horizontal axis (X-axis) which distinguishes them.

• If we compute the mean number of rooms per house for


each of the two suburbs, we will find that the average
number of rooms per house in A is 6.67 while in B it is
7.67.
• This difference of 1 is equivalent to the difference in
position of the two frequency polygons.

• Our interpretation of the above situation would be that


there are LARGER houses in suburb B than in suburb A, to
the extent that there are on the average.
ONE More Room in each house.
Various TYPES of Averages

• There are several types of averages each of which


has a use in specifically defined circumstances.

• The most common types of averages are:


1) The arithmetic mean,
2) The geometric mean,
3) The harmonic mean
4) The median, and
5) The mode
• The Arithmetic, Geometric and Harmonic means are
averages that are mathematical in character, and give an
indication of the magnitude of the observed values.
• The Median indicates the middle position while the
mode provides information about the most frequent
value in the distribution or the set of data.
• The Mode is defined as that value which occurs most
frequently in a set of data i.e. it indicates the most
common result.
Arithmetic Mean

• Expected value of X = E[X]


• Also called the “First moment” of X
• xi = values measured
• pi = Pr(X = xi) = Pr(we measure xi)
n
E[ X ]   xi pi
i 1

Copyright 2004 David J. Lilja


Arithmetic Mean

• Without additional information, assume


• pi = constant = 1/n
• n = number of measurements
• Arithmetic mean
• Common “average”
1 n
x   xi
Copyright 2004 David J. Lilja
n i 1
Potential Problem with Means

• Sample mean gives equal weight to all


measurements
• Outliers can have a large influence on the
computed mean value
• Distorts our intuition about the central tendency
of the measured values

Copyright 2004 David J. Lilja


Potential Problem with Means

Mean

Mean
Copyright 2004 David J. Lilja
Geometric Mean

The geometric mean, G, of a set of n positive


values x1, x2,…,xn is defined as the positive nth root
of their product.

G  X 1 X 2 ... X n
n (Where Xi > 0)

When n is large, the computation of the geometric


mean becomes laborious as we have to extract the
nth root of the product of all the values.
Geometric Mean

• The geometric is simplified by the use of logarithms.


• Taking logarithms to the base 10, we get

log G  log X 1  log X 2  ....  log X n 


1
n

 log X
n
Therefore G  anti log   log X 
 
 n 
Example

Find the geometric mean of numbers:


• 45, 32, 37, 46, 39, 36, 41, 48, 36.
• Solution:
• We need to compute the numerical value of

45  32  37  46  39  36  41  48  36
9

• But, obviously, it is a bit cumbersome to find the


ninth root of a quantity. So we make use of
logarithms, as shown below:
X log X
45 1.6532
32 1.5052 14.3870
  1.5986
37 1.5682 9
46 1.6628  log X
log G 
39 1.5911 n
36 1.5563 Hence G  anti log 1.5986
41 1.6128  39.68
48 1.6812
36 1.5563
Sum 14.3870
• The above example pertained to the computation
of the geometric mean in case of raw data.
• Next, we consider the computation of the
geometric mean in the case of grouped data.
Geometric mean for grouped data

• In case of a frequency distribution having k classes


with midpoints X1, X2, …,Xk and the corresponding
frequencies f1, f2, …, fk (such that fi = n), the
geometric mean is given by

G  X X .... X
f1 f2 fk
n
1 2 k
• Each value of X thus has to be multiplied by itself f
times, and the whole procedure becomes quite a
formidable task!
In terms of logarithms, the formula becomes

log G   f1 log X 1  f 2 log X 2  ...  f k log X k 


1
n


 f log X
n
  f log X 
G  anti log  
 n 
• Obviously, the above formula is much easier to
handle.

• Let us now apply it to an example.


• In the example of the EPA mileage ratings, we
have:
Class-mark
Mileage No. of
(midpoint) log X f log X
Rating Cars (f)
X
30.0 - 32.9 2 31.45 1.4976 2.9952
33.0 - 35.9 4 34.45 1.5372 6.1488
36.0 - 38.9 14 37.45 1.5735 22.0290
39.0 - 41.9 8 40.45 1.6069 12.8552
42.0 - 44.9 2 43.45 1.6380 3.2760
30 47.3042
47.3042
G anti log
30
• = antilog 1.5768 = 37.74

• This means that, if we use the geometric mean to


measures the central tendency of this data set, then the
central value of the mileage of those 30 cars comes out
to be 37.74 miles per gallon.
When should we use the geometric
mean?
• The question is, “When should we use the
geometric mean?”

• The answer to this question is that


when relative changes in some variable
quantity are averaged, we prefer
the geometric mean.
Arithmetic or geometric mean

• Suppose it is discovered that a firm’s turnover has


increased during 4 years by the following
amounts:
Percentage
Year Turnover Compared
With Year Earlier
1958 £ 2,000 –
1959 £ 2,500 125
1960 £ 5,000 200
1961 £ 7,500 150
1962 £ 10,500 140
• The yearly increase is shown in a percentage form
in the right-hand column i.e. the turnover of 1959
is 125 percent of the turnover of 1958, the
turnover of 1960 is 200 percent of the turnover of
1959, and so on.

• The firm’s owner may be interested in knowing


his average rate of turnover growth.
• If the arithmetic mean is adopted he finds his answer
to be:
• Arithmetic Mean
125  200  150  140
4
 153.75
i.e. we are concluding that the turnover for any year
is 153.75% of the turnover for the previous year.
• In other words, the turnover in each of the years
considered appears to be 53.75 per cent higher than in
the previous year.
• If this percentage is used to calculate the turnover from
1958 to 1962 inclusive, we obtain:
153.75% of £ 2,000 = £ 3,075
153.75% of £ 3,075 = £ 4,728
153.75% of £ 4,728 = £ 7,269
153.75% of £ 7,269 = £ 11,176
• Whereas the actual turnover figures were
Year Turnover
1958 £ 2,000
1959 £ 2,500
1960 £ 5,000
1961 £ 7,500
1962 £ 10,500
• It seems that both the individual figures and,
more important, the total at the end of the
period, are incorrect.
• Using the arithmetic mean has exaggerated the
‘average’ annual rate of increase in the turnover
of this firm.
• Obviously, we would like to rectify this false
impression.
• The geometric mean enables us to do so.

• Geometric mean of the turnover figures:


4
125  200  150  140
 525000000
4

 151.37%
• Now, if we utilize this particular value to obtain the
individual turnover figures, we find that:
151.37% of £2,000 = £3,027
151.37% of £3,027 = £4,583
151.37% of £4,583 = £6,937
151.37% of £6,937 = £10,500
• So that the turnover figure of 1962 is exactly the same as
what we had in the original data.
Interpretation

• If the turnover of this company were to increase annually


at a constant rate, then the annual increase would have
been 51.37 percent.(On the average, each year’s
turnover is 51.37% higher than that in the previous year.)

• The above example clearly indicates the significance of


the geometric mean in a situation when relative changes
in a variable quantity are to be averaged.
Interpretation

• But we should bear in mind that such situations


are not encountered too often, and that the
occasion to calculate the geometric mean arises
less frequently than the arithmetic mean.(The
most frequently used measure of central
tendency is the arithmetic mean.)
Mode

• EXAMPLE:
• Suppose that the marks of eight students in a
particular test are as follows:
• 2, 7, 9, 5, 8, 9, 10, 9
• Obviously, the most common mark is 9.
• In other words,
Mode = 9.
THE MODE IN CASE OF A DISCRETE
FREQUENCY DISTRIBUTION:

In case of a discrete frequency distribution,


identification of the mode is immediate; one
simply finds that value which has the highest
frequency.
Example:

An airline found the following numbers of passengers in fifty flights of a


forty-seated plane.
Highest Frequency fm = 13
Occurs against the X value 13.
Hence:
Mode = x= 13
The mode is obviously 39 passengers and the company should be quite
satisfied that a 40 seater is the correct-size aircraft for this particular route.
THE MODE IN CASE OF THE FREQUENCY
DISTRIBUTION OF A CONTINUOUS VARIABLE:

In case of grouped data, the modal group is easily recognizable (the


one that has the highest frequency).
At what point within the modal group does the mode lie?
The answer is contained in the following formula:
Mode=
Where,
• l = lower class boundary of the modal class,
• fm = frequency of the modal class,
• f1 = frequency of the class preceding the
modal class,
• f2 = frequency of the class following modal
class, and
• h = length of class interval of the modal class
Example of EPA mileage ratings, we have:

It is evident that the third class is the modal class. The mode lies
somewhere between 35.95 and 38.95.
In order to apply the formula for the mode, we note that fm = 14,
f1 = 4 and f2 = 8.
Hence we obtain:
Mode by considering the graphical representation.
For the example of EPA Mileage Ratings, the
histogram was as shown below:
The frequency polygon of the same
distribution
Median

The median is the middle value of the series


when the variable values are placed in order of
magnitude.
OR
The median is defined as a value which divides a
set of data into two halves, one half comprising of
observations greater than and the other half
smaller than it. More precisely, the median is a
value at or below which 50% of the data lie.
The median value can be ascertained by inspection
in many series.
EXAMPLE-1:
The average number of floors in the buildings at the center of a city:
5, 4, 3, 4, 5, 4, 3, 4, 5, 20, 5, 6, 32, 8, 27
Arranging these values in ascending order, we obtain
3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 8, 20, 27, 32
Picking up the middle value, we obtain the median equal to 5.
Interpretation

• The median number of floors is 5. Out of those 15


buildings, 7 have up to 5 floors and 7 have 5
floors or more. We noticed earlier that the
arithmetic mean was distorted toward the few
extremely high values in the series and hence
became unrepresentative. The median = 5 is
much more representative of this series.
Median in Case of a Frequency Distribution of
a Continuous Variable:

In case of a frequency distribution, the median is given by the


formula
Where
l =lower class boundary of the median class (i.e. that class for
which the cumulative frequency is just in excess of n/2).
h=class interval size of the median class
f =frequency of the median class
n=Sf (the total number of observations)
c =cumulative frequency of the class preceding the median
class
Example of the EPA mileage ratings:
In this example, n = 30 and n/2 = 15.
Thus the third class is the median class. The median lies somewhere
between 35.95 and 38.95. Applying the above formula, we obtain
Interpretation:

This result implies that half of the cars have


mileage less than or up to 37.88 miles per gallon
whereas the other half of the cars has mileage
greater than 37.88 miles per gallon. As discussed
earlier, the median is preferable to the arithmetic
mean when there are a few very high or low
figures in a series. It is also exceedingly valuable
when one encounters a frequency distribution
having open-ended class intervals.
Empirical relation between the mean,
median and the mode

• The next concept that we will discuss is the


empirical relation between the mean,
median and the mode. This is a concept
which is not based on a rigid mathematical
formula; rather, it is based on observation.
In fact, the word ‘empirical’ implies ‘based
on observation’.
Empirical relation between the mean,
median and the mode

This concept relates to the THE SYMMETRIC CURVE


relative positions of the mean, f
median and the mode in case
of a hump-shaped
distribution. In a single-
peaked frequency
distribution, the values of the
mean, median and mode
coincide if the frequency
distribution is absolutely Mean = Median = Mode X
symmetrical.
Empirical relation between the mean,
median and the mode

In the case of a skewed distribution,


the mean, median and mode do not
all lie on the same point. They are
pulled apart from each other, and
the empirical relation explains the
way in which this happens.
Experience tells us that in a
unimodal curve of moderate
skewness, the median is usually
sandwiched between the mean and
the mode.
Empirical relation between the mean,
median and the mode

EMPIRICAL RELATION BETWEEN


THE MEAN, MEDIAN AND THE MODE

Mode = 3 Median – 2 Mean


DISPERSION
DISPERSION

Just as variable series differ with respect to


their location on the horizontal axis (having
different ‘average’ values); similarly, they
differ in terms of the amount of variability
which they exhibit.
MEASURES OF DISPERSION

• A measure of dispersion is used to describe


the variability in a sample or population.

• It is usually used in conjunction with


a measure of central tendency, such as the
mean or median, to provide an overall
description of a set of data.
Absolute versus Relative Measures of
Dispersion

There are two types of measurements of dispersion:


absolute and relative.
• An absolute measure of dispersion is one that measures
the dispersion in terms of the same units or in the square
of units, as the units of the data.
• For example, if the units of the data are rupees, meters,
kilograms, etc., the units of the measures of dispersion
will also be rupees, meters, kilograms, etc.
Absolute versus Relative Measures of
Dispersion

• On the other hand, relative measure of dispersion is one


that is expressed in the form of a ratio, co-efficient of
percentage and is independent of the units of
measurement.
• A relative measure of dispersion is useful for comparison
of data of different nature. A measure of central
tendency together with a measure of dispersion gives an
adequate description of data.
Why is it important to measure the
spread of data?

• There are many reasons why the measure of the spread


of data values is important, but one of the main reasons
regards its relationship with measures of central
tendency.
• A measure of spread gives us an idea of how well the
mean, for example, represents the data. If the spread of
values in the data set is large, the mean is not as
representative of the data as if the spread of data is
small.
Why is it important to measure the
spread of data?

• This is because a large spread indicates that


there are probably large differences between
individual scores.

• Additionally, in research, it is often seen as


positive if there is little variation in each data
group as it indicates that the similar.
In a nut shell…

• Means hide information about variability /


dispersion.
• How “spread out” are the values?
• How much spread relative to the mean?
• What is the shape of the distribution of
values?
EXAMPLES

• In a technical college, it may well be the case that the


ages of a group of first-year students are quite
consistent, e.g. 17, 18, 18, 19, 18, 19, 19, 18, 17, 18 and
18 years.

• A class of evening students undertaking a course of study


in their spare time may show just the opposite situation,
e.g. 35, 23, 19, 48, 32, 24, 29, 37, 58, 18, 21 and 30.
EXAMPLES

• It is very clear from this example that the


variation that exists between the various values
of a data-set is of substantial importance.
• We obviously need to be aware of the amount of
variability present in a data-set if we are to come
to useful conclusions about the situation under
review.
EXAMPLES

The sizes of the classes in two comprehensive schools in


different areas are as follows:
Number Number of Classes
of Pupils Area A Area B
10 – 14 0 5
15 – 19 3 8
20 – 24 13 10
25 – 29 24 12
30 – 34 17 14
35 – 39 3 5
40 – 44 0 3
45 - 49 0 3
60 60
EXAMPLES

• If the arithmetic mean size of class is calculated, we discover


that the answer is identical:
27.33 pupils in both areas.

• Even though these two distributions share a common


average, it can readily be seen that they are entirely
DIFFERENT.
• And the graphs of the two distributions (given below) clearly
indicate this fact.
25

Number of Classes 20

15

10

0
– 9 1 4 19 – 24 – 29 – 34 – 39 – 44 - 49 – 54
4 0 – –
1 15 20 25 30 35 40 44 50
Number of Pupils
EXAMPLES

• The question which must be posed and answered is ‘In what


way can these two situations be distinguished?’
• We need a measure of variability or DISPERSION to
accompany the relevant measure of position or ‘average’
used.
• The word ‘relevant’ is important here for we shall find one
measure of dispersion which expresses the scatter of values
round the arithmetic mean, another the scatter of values
round the median, and so forth. Each measure of dispersion
is associated with a particular ‘average’.
Histograms
40 40

35 35

30 30

25 25

20 20

15 15

10 10

5 5

0 0

• Similar mean values


• Widely different distributions
• How to capture this variability in one number?
Measures of dispersion

• We will be looking at different measures of


dispersion, these can be classified as:
• the range,
• variance,
• absolute deviation,
• standard deviation and
• quartiles, deciles and percentiles.
RANGE

• The range is the difference between the highest


and lowest scores in a data set and is the simplest
measure of spread. So we calculate range as:

Range = maximum value - minimum value


f

X
X0 Xm
Range
RANGE

• The simplicity of the concept does not necessarily


invalidate it, but in general it gives no idea of the
DISTRIBUTION of the observations between the
two ends of the series.
• For this reason it is used principally as a
supplementary aid in the description of variable
data, in conjunction with other measures of
dispersion.
Disadvantages

• When the data are grouped into a frequency distribution, the


range is estimated by finding the difference between the upper
boundary of the highest class and the lower boundary of the
lowest class.
• However, because of the fact that it is computed from only the
two extreme values in a data-set, it has two serious
disadvantages.
• It ignores all the INFORMATION available from the intermediate
observations.
• It might give a MISLEADING picture of the spread in the data.
Disadvantages

From THIS point of view, it is an


unsatisfactory measure of dispersion.
However, it is APPROPRIATELY used in
statistical quality control charts of
manufactured products, daily temperatures,
stock prices, etc.
Example

• For example, let us consider the following data


set:
23 56 45 65 59 55 62 54 85 25

• The maximum value is 85 and the minimum value


is 23. This results in a range of 62, which is 85
minus 23.
Points to ponder!

• Whilst using the range as a measure of spread is limited, it does


set the boundaries of the scores.
• This can be useful if you are measuring a variable that has either a
critical low or high threshold (or both) that should not be crossed.
• The range will instantly inform you whether at least one value
broke these critical thresholds.
• In addition, the range can be used to detect any errors when
entering data. For example, if you have recorded the age of
school children in your study and your range is 7 to 123 years old
you know you have made a mistake!
Quartiles, Deciles and Percentiles

• Consider the following figure: f


• It can be divided into four quarters,
here the two halves are mentioned
• The quartiles, together with the
median, achieve the division of the
50% 50%
total area into four equal parts.
X
• The first, second and third quartiles
are given by the formulae: Median
Quartiles, Deciles and Percentiles

• First quartile:
What are l, n, h, f and c?
h n  f
Q1  l    c 
f 4 
• Second quartile (i.e. median):

 2n 
  c   l  n 2  c 
h h
Q2  l
f 4  f
25% 25% 25% 25%
• Third quartile: ~ X
• Q1 Q2 = X Q3
h  3n 
Q3  l    c
f  4  It is clear from the formula of the second quartile that the
second quartile is the same as the median.
Quartiles, Deciles and Percentiles

is the same quantity as the median!


It is easily seen that the 5th decile
• The deciles and the percentiles given the division
of the total area into 10 and 100 equal parts
respectively.
h n 
• The formula for the first decile is D1  l    c 
f  10 

• The formulae for the subsequent deciles are


h  2n  h  3n 
D2  l    c  D3  l    c
f  10  f  10 
It is easily seen that the 50th percentile
Quartiles, Deciles and Percentiles

is the same quantity as the median!


• The formula for the first percentile is

h  n 
P1  l    c
f  100 
• The formulae for the subsequent percentiles are
h  2n  h  3n 
P2  l   c P3  l    c 
f  100  f  100 
• And so on…
A simple example for estimating quartiles

• For example, consider the marks of the 100 students below, which
have been ordered from the lowest to the highest scores.
Order Score Order Score Order Score Order Score Order Score
1st 35 21st 42 41st 53 61st 64 81st 74
2nd 37 22nd 42 42nd 53 62nd 64 82nd 74
3rd 37 23rd 44 43rd 54 63rd 65 83rd 74
4th 38 24th 44 44th 55 64th 66 84th 75
5th 39 25th 45 45th 55 65th 67 85th 75
6th 39 26th 45 46th 56 66th 67 86th 76
7th 39 27th 45 47th 57 67th 67 87th 77
A simple example for estimating quartiles

Order Score Order Score Order Score Order Score Order Score
8th 39 28th 45 48th 57 68th 67 88th 77
9th 39 29th 47 49th 58 69th 68 89th 79
10th 40 30th 48 50th 58 70th 69 90th 80
11th 40 31st 49 51st 59 71st 69 91st 81
12th 40 32nd 49 52nd 60 72nd 69 92nd 81
13th 40 33rd 49 53rd 61 73rd 70 93rd 81
14th 40 34th 49 54th 62 74th 70 94th 81
15th 40 35th 51 55th 62 75th 71 95th 81
16th 41 36th 51 56th 62 76th 71 96th 81
17th 41 37th 51 57th 63 77th 71 97th 83
18th 42 38th 51 58th 63 78th 72 98th 84
19th 42 39th 52 59th 64 79th 74 99th 84
20th 42 40th 52 60th 64 80th 74 100th 85
A simple example for estimating quartiles

• Quartile (Q1 The first) lies between the 25th and 26th student's
marks,
• The second quartile (Q2) between the 50th and 51st student's
marks, and
• The third quartile (Q3) between the 75th and 76th student's
marks.
First quartile (Q1) = (45 + 45) ÷ 2 = 45
Second quartile (Q2) = (58 + 59) ÷ 2 = 58.5
Third quartile (Q3) = (71 + 71) ÷ 2 = 71
Reflection form the example

• In the above example, we have an even number of scores


(100 students, rather than an odd number, such as 99
students). This means that when we calculate the
quartiles, we take the sum of the two scores around
each quartile and then half them (hence Q1= (45 + 45) ÷
2 = 45). However, if we had an odd number of scores
(say, 99 students), we would only need to take one score
for each quartile (that is, the 25th, 50th and 75th
scores). You should recognize that the second quartile is
also the median.
Reflection form the example

• Quartiles are a useful measure of spread because they are much


less affected by outliers or a skewed data set than the equivalent
measures of mean and standard deviation.
• For this reason, quartiles are often reported along with the
median as the best choice of measure of spread and central
tendency, respectively, when dealing with skewed and/or data
with outliers.
• A common way of expressing quartiles is as an interquartile range.
The interquartile range describes the difference between the
third quartile (Q3) and the first quartile (Q1), telling us about the
range of the middle half of the scores in the distribution.
Reflection form the example

Hence, for our 100 students:


Interquartile range = Q3 - Q1
= 71 - 45
= 26
• However, it should be noted that in journals and other
publications you will usually see the interquartile range reported
as 45 to 71, rather than the calculated range.
• A slight variation on this is the semi-interquartile range, which is
half the interquartile range = ½ (Q3 - Q1). Hence, for our 100
students, this would be 26 ÷ 2 = 13. This is also known as the
quartile deviation.
f

X
Q1 Q3
Inter-quartile Range

Quartile Deviation
(Semi Inter-quartile Range)
Coefficient of Dispersion (COD)

• The range is an absolute measure of dispersion. Its relative


measure is known as the COEFFICIENT OF DISPERSION, and is
defined by the relation given below:

COD 
1
2 Range 
Mid  Range
Xm  X0
2 Xm  X0
 
Xm  X0 Xm  X0
2
Coefficient of Dispersion (COD)

• This is a pure (i.e. dimensionless) number and is used for


the purposes of COMPARISON. (This is so because a pure
number can be compared with another pure number.)
• For example, if the coefficient of dispersion for one
data-set comes out to be 0.6 whereas the coefficient of
dispersion for another data-set comes out to be 0.4, then
it is obvious that there is greater amount of dispersion in
the first data-set as compared with the second.
Chebychev’s Inequality

• Any data set that is normally distributed, or in the shape


of a bell curve, has several features. One of them deals
with the spread of the data relative to the number
of standard deviations from the mean.

• In a normal distribution, we know that 68% of the data is


one standard deviation from the mean, 95% is two
standard deviations from the mean, and approximately
99% is within three standard deviations from the mean.
• But if the data set is not distributed in the shape of a bell curve,
then a different amount could be within one standard deviation.
Chebychev’s inequality provides a way to know what fraction of
data falls within K standard deviations from the mean for any data
set.
Why sampling?

• It is impractical, and usually impossible, to attempt to


study or survey every member of a population; studying a
sample of that population is a more attainable goal.
• Sampling gives you the opportunity to have confidence in
the data, and also assume what might happen if the
assumptions are wrong.
• If you perform your research with the wrong sample, or
just one that is inaccurately designed, and you will
almost certainly get misleading results.
Techniques of Sampling

• Sampling techniques can be further divided into


two major subsets.
• Probabilistic and
• Non-probablistic
• There are four types of Probabilistic
sampling: Random, Systematic, Cluster, and
Stratified.
5. Statistical inference

• A statistical inference is an estimate, prediction, or some


other generalization about a population based on
information contained in a sample. We use the
information contained in the smaller sample to learn
about the larger population.
Thus, from the sample of 1500 voters, the pollster may estimate
the percentage of all the voters who would vote for each
presidential candidate if the elections were held on the day the
poll was conducted or he might use the information to predict
the outcome on election.