Sie sind auf Seite 1von 336

BC C104

Business Statistics

Edition: Summer 2011

SIKKIM MANIPAL UNIVERSITY


Directorate of Distance Education

B1463

SIKKIM MANIPAL UNIVERSITY (SMU DDE)


Dean
Directorate of Distance Education
Sikkim Manipal University (SMU DDE)

BOARD OF STUDIES
Chairman
HOD Arts and Humanities
SMU DDE

Rajesh A.R., Head Employment,


Manipal Universal Learning Pvt Ltd

Additional Registrar
SMU DDE

Dr Gayathri Devi, Dean, SMU DE

Controller of Examination
SMU DDE

Srinath P.S., Additional Registrar, Student Evaluation,


SMU DDE

Dr Ramesh Murthy, Director, SMU DE


Dr Shivram Krishnan, Professor & HOD, A&H, SMU DDE

Ashok Kumar K., Additional Registrar, SMU DDE


Prof. Ramesh Murthy
Principal Academics
Manipal Universal Learning Pvt Ltd

Prof. S.N. Maheshwari, Director General, Delhi Institute of


Advanced Studies, Delhi (Formerly, Principal, Hindu College,
Delhi University & Professor & Dean, Faculty of Commerce and
Business Management , Goa University)
Dr Anil Singh, Associate Professor, University of Delhi

Authors:
J.S. Chandan: Units(1.3-1.10, 2.1-2.4, 3.3, units-4, 5, 8, 11) Copyright J.S. Chandan, 2011
G.S. Monga: (Unit-9) Copyright G.S Monga, 2011
Vijay Gupta: Unit(3.4-3.11) Copyright Vijay Gupta, 2011
C.R. Kothari: Units(unit-6, 7, 10) Copyright C.R. Kothari, 2011
Vikas Publishing House: Units(1.1-1.2, 2.5-2.10, 3.1-3.2, units-12-14) Copyright Reserved, 2011
This book is a distance education module comprising a collection of learning materials for our students.
All rights reserved. No part of this work may be reproduced in any form by any means without permission
in writing from Sikkim Manipal University, Gangtok, Sikkim. Printed and Published on behalf of Sikkim
Manipal University, Gangtok, Sikkim by Mr Rajkumar Mascreen, GM, Manipal Universal Learning Pvt
Ltd. Manipal - 576 104. Printed at Manipal Press Limited, Manipal.
Information contained in this book has been published by VIKAS Publishing House Pvt. Ltd. and has
been obtained by its Authors from sources believed to be reliable and are correct to the best of their
knowledge. However, the Publisher and its Authors shall in no event be liable for any errors, omissions
or damages arising out of use of this information and specifically disclaim any implied warranties or
merchantability or fitness for any particular use.

Vikas is the registered trademark of Vikas Publishing House Pvt. Ltd.


VIKAS PUBLISHING HOUSE PVT LTD
E-28, Sector-8, Noida - 201301 (UP)
Phone: 0120-4078900 Fax: 0120-4078999
Regd. Office: 576, Masjid Road, Jangpura, New Delhi 110 014
Website: www.vikaspublishing.com Email: helpline@vikaspublishing.com

Business Statistics

Contents
Unit 1
Information and Data Sources

122

Unit 2
Data Collection Methods

2342

Unit 3
Data Analysis Techniques

4385

Unit 4
Index Numbers

87118

Unit 5
Data Representation

119139

Unit 6
Correlation

141164

Unit 7
Regression

165187

Unit 8
Time Series

189214

Unit 9
Testing of Hypothesis

215235

Unit 10
Chi-Square Test

237249

Unit 11
t-Test, z-Test and Analysis of Variance

251278

Unit 12
Research Report Writing

279301

Unit 13
Exercise I

303311

Unit 14
Exercise II

313327

SUBJECT INTRODUCTION
Business Statistics
Statistics is considered a mathematical science pertaining to the collection,
analysis, interpretation or explanation and presentation of data. The subject of
statistics is primarily concerned with making decisions about various disciplines
of market and employment, such as stock market trends, unemployment rates
in various sectors of industries, demographic shifts, interest rates, inflation rates
over the years, and so on. Statistics is also considered a science that deals with
numbers or figures describing the state of affairs of various situations with which
we are generally and specifically concerned.
This book, Business Statistics, comprises fourteen units.
Unit 1- Information and Data Sources: Explains the need for information in
decision making. It defines a problem and discusses how information are
evaluated and processed. It also defines the various types of data.
Unit 2- Data Collection Methods: Discusses different methods of data
collection, such as observation, questionnaire, interviews and experiments. It
also lists the merits and demerits of data collection methods.
Unit 3- Data Analysis Techniques: Explains the various techniques of analysing
data, including percentage, ratio, average, mean, mode, median, quartiles, range
and standard deviation.
Unit 4- Index Numbers: Defines and classifies index numbers. It also explains
the methods of construction of different types of index numbers.
Unit 5- Data Representation: Lists the various tools of data representation,
including tables, graphs and diagrams, and discusses their features.
Unit 6- Correlation: Defines correlation analysis. It also discusses the concepts
of coefficient of determination, coefficient of correlation, Karl Pearsons coefficient
and Spearmans rank correlation.
Unit 7- Regression: Defines the term regression and lists the assumptions in
regression analysis. It also describes the simple regression model, scatter
diagram method and least square method.
Unit 8- Time Series: Lists the components of time series. It also describes the
various methods of measuring trends and seasonal variations.

Unit 9- Testing of Hypothesis: Defines a hypothesis and list its characteristics.


It also explains the various ways of formulating hypotheses.
Unit 10- Chi-Square Test: Discusses the meaning, characteristics and
significance of Chi-square test. It also lists the areas of application of Chi-square
test and steps involved in finding the value of Chi-square test.
Unit 11- t-Test, z-Test and Analysis of Variance: Describes the method to
perform -t-Test, z-Test and Analysis of Variance. It also identifies the conditions
in which these tests are applicable.
Unit 12- Research Report Writing: Describes the types, characteristics and
mechanics of report and explains how to write a good report. It also discusses
the types of research reports.
Unit 13- Exercise I
Unit 14- Exercise II
Objectives of studying the subject
After studying this subject, you should be able to:
Explain why information or data is needed for decision-making
Explain the various techniques of collecting and analysing data
Define what index numbers are and use different methods to construct
index numbers
Use tables, graphs and diagrams for representing data
Perform correlation analysis and regression analysis
Describe the various methods of measuring trends and seasonal variations
Formulate and test hypotheses
Use Chi-square test, t-test, z-test and analysis of variance
Describe the various aspects of report writing, including its types,
characteristics and mechanics

Business Statistics

Unit 1

Unit 1

Information and Data Sources

Structure
1.1 Introduction
Objectives

1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10

Need for Information in Decision-Making


Types of Data
Data Sources: Primary vs Secondary
Research Procedure
Summary
Glossary
Terminal Questions
Answers
Further Reading

1.1 Introduction
Information is processed from raw data. It is verified to be accurate, specific
and organized for a special purpose. The value of information lies solely in its
ability to affect a behaviour, decision or outcome.
In this unit, you will learn about information, decision-making, data and its
various types. The information should be context specific and available when it
is required, i.e., timely. Data is the numerical result of measurements. The
arrangement of the collected data defines its type. Data can be the basis of
graphs, images, or observations of a set of variables. Raw or unprocessed data
refers to a collection of numbers, characters, images or other outputs from
devices that collect information to convert physical quantities into symbols.
Statistics is the science of the collection, organization, and interpretation of
data. It deals with all aspects of this, including the planning of data collection in
terms of the design of surveys and experiments. You will also learn about
variables and random variable. A variable is any characteristic which can assume
different values.
In probability and statistics, a random or stochastic variable refers to a
variable whose value results from a measurement on some type of random
process. In formal terms, it refers to a function from a probability space, typically
to the real numbers, which is measurable. Intuitively, a random variable is a

Sikkim Manipal University

Page No. 1

Business Statistics

Unit 1

numerical description of the outcome of an experiment, for example, the possible


results of rolling two dice: (1, 1), (1, 2), etc. Random variables can be classified
as either discrete or as continuous. The former refers to a random variable that
may assume either a finite number of values or an infinite sequence of values,
while the latter refers to a variable that may assume any numerical value in an
interval or collection of intervals. An example of a random variable of mixed
type would be based on an experiment where a coin is flipped and the spinner
is spun only if the result of the coin toss is heads.
A random variable can also be divided into two main categories, qualitative
random variables and quantitative random variables. The classification of data is
done before processing it. It involves separating items according to similar
characteristics and grouping them into four classes: geographical, chronological,
qualitative and quantitative.
Further in this unit, you will learn about primary data, secondary data and
the sources from which these are collected. The validity and accuracy of the
final judgement is most crucial and depends on how well the data was gathered
in the first place. The quality of data will greatly affect the conclusions and
hence, utmost importance must be given to this process and every possible
precaution should be taken to ensure accuracy while gathering and collecting
data.

Objectives
After studying this unit, you should be able to:
Explain why information is needed in decision-making
Define a problem, evaluate and process information, and take as decision
Explain the meaning and scope of data and list the types of variables
Define variable and its types
Differentiate between primary and secondary data
Explain the procedures of conducting research, including the methods of
collecting primary and secondary data

1.2 Need for Information in Decision-Making


Information plays a vital role in decision-making. It is provided by the information
system set up in the organization. Information consists of data (facts and figures)
which is processed and retrieved to be used for forecasting and decision-making.

Sikkim Manipal University

Page No. 2

Business Statistics

Unit 1

In an information-oriented and information-driven society, everyone is a user


and a provider of information. Tremendous growth in technology in general and
communication technologies in particular has served as a powerful driving force
in industry. With the advent of computer systems, communication technologies
have gained more power and have given birth to specialized fields such as
information sciences and information technology.
On analysis we find that the real driving force behind this growth is the
information behind these technologies and not the technologies themselves.
The following steps are used while taking decisions based on information.

1.2.1 Defining the Problem


A problem understood properly is more than half its solution. This requires proper
definition of the problem and finding the issue that is to be covered. That is
decided on the basis of analysis of the information provided or gathered.
Figure 1.1 shows a fish-bone diagram that helps in understanding and analysing
a complex problem that is interlinked as a model of the problem under analysis.
Cause 1

Cause 2

Cause 3

Cause 4

1
Sub
2
Causes:
3

Major
Effect

Cause 5

Cause 6

Cause 7

Cause 8

Figure 1.1 Fish-bone Diagram

Example: The fishbone diagram portrays various causes for an effect or problem
and is often used in brainstorming sessions.
The given diagram was drawn by a manufacturing team in order to
understand the source of periodic iron contamination. Six generic terms were
used to prompt ideas while the branches portray the causes of the problem.

Sikkim Manipal University

Page No. 3

Business Statistics

Unit 1

Materials

Measurement

Methods

ed
llow

Lab solvent contamination

Solvent contamination

t fo
No

Supplier 2

ion

Plant
system

at
libr
Ca

H 20

T
DB

st
aly
An
r
pe
pro n
Im atio
libr
ca ation
libr
Ca

Truck

Analytical procedure
Supplier 1

2
WAK

Raw materials
Supplier
City

Lab error

Sampling

Iron
s
toll

tle
bot
Dry

lier
pp
Su

la
In

lier
pp
Su

lab
In

Iron in
Product

Rust near
sample point

Pip
es

Pu
mp
s

Re
ac
tor
s

pip
e
To
ols

err
or
too
ls

Ex
ch
an
ge
rs

83
E5

Iro
n

Op
en
ing

or

in
po
ple
sam

Ex
po
se
d

ct
rea
At

At

Maintenance

In
Materials of construction
P 584
Out
E 533
P 560
P 573
Heat exchanger leak
E 470
70
E4

Environment

Rusty pipes

Inexperienced
analyst

Manpower

Machines
Fishbone Diagram

The figure shows that the term machines contains the idea materials of
construction which shows four kinds of equipment having specific machine
numbers. However, it must be noted that some ideas appear twice. Calibration
appears under methods as a factor in the analytical procedure and under
measurement as a cause of lab error.

1.2.2 Evaluating the Information


The information should be context specific and available when it is required,
i.e., timely. As a decision-maker, you must evaluate the accuracy of information
and the sources of information used in taking a decision. If there are many
sources available for information, then select the source that can provide
authentic information. The following are the four states in which information can
be categorized:
Information you have and you are aware of it.
Information you do not have but you are aware of it.
Information that you have but you are not aware of it.
Information you do not have and you are not aware of it.

Sikkim Manipal University

Page No. 4

Business Statistics

Unit 1

Books, Articles,
and Documents
Interim Information
Products

Raw Data

Figure 1.2 Information Pyramid

An information pyramid explains the various sources of information as


shown in Figure 1.2. Easy sources of information are books, articles and
documents, which are shown at the top of the pyramid. Other sources of
information are based on raw data collected by someone else or collected by
you. Information is produced by converting raw data into meaningful information
and is presented in an easily understandable format.

1.2.3 Processing the Information


When information is specifically arranged according to the requirement or
problem then it is termed knowledge. Relevant information is extracted from
various sources. If the information is not relevant or as per the requirement
then the decision-maker either uses the other sources or collects the additional
accurate data for further analysis and decision-making. The information is judged
for its relevance, validity and inter-dependence. These are evaluated and
integrated to arrive at a conclusion and take a decision. The process is shown
in the Figure 1.3.

Information
Collected

Usable
Information

Additional
Information Needed
Value-adding
to Information

Information
Required for
Decision-making

Figure 1.3 Processing the Information

Sikkim Manipal University

Page No. 5

Business Statistics

Unit 1

1.2.4 Taking the Decision


The process of decision-making is interactive and involves all concerned persons
in an establishment or organization to give their opinion for taking decision on
the basis of information collected. Decision taken should be able to solve the
problem. If the problem is not solved then the decision taken is reviewed and
re-analysed. It may be examined with more insight and further modified to meet
various needs. Figure 1.4 shows problems and their relationship with decisions.

Problems

Decisions

Figure 1.4 Relationship between Problem and Decision

Thus, an organization must be able to take effective decisions to organize


its activities based on relevant information. It must develop proper mechanisms
for efficient and harmonized information exchange between various departments.

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Accurate and timely _____________ is considered as one of the
most powerful resources.
(b) W hen information is specifically arranged according to the
requirement or problem, it is termed as ______________.
2. State whether true or false.
(a) A problem understood properly is more than half its solution.
(b) If a problem is solved, then the decision taken is reviewed and reanalysed.

Sikkim Manipal University

Page No. 6

Business Statistics

Unit 1

1.3 Types of Data


Data
Data can be defined as the qualitative or quantitative attributes of a variable or
set of variables. They are usually the results of measurements and can be
presented in the form of graphs, images or observations of a set of variables.
Data is considered to be the lowest level of abstraction from which information
and knowledge are derived.
Data comprise the numerical results of measurements. Data can also be
used in singular sense such as a set of data. For example, if we ask the students
in a classroom their ages and we write down their ages as they tell us, then a
collection of these numbers would be considered as data. Similarly, information
regarding incomes of families, IQ scores of students, test scores of students in
a class, heights of policemen in Mumbai, and so on, when collected, is known
as data. If this data is written down as collected, then it is known as raw data. If
this data is written in an ascending or descending order, then it would be called
ordered data. If this ordered data is arranged in arrays of rows and columns,
then the data is known to be presented in an ordered array.

1.3.1 Types of Variables


Variable
A variable is any characteristic that can assume different values. Age, height,
IQ, and so on are all variables since their values can change when applied to
different people. For example, Mr X is a variable since X can represent anybody.
On the other hand, a constant will always have the same value. For example,
the number of days in a week are constant and will always remain the same.
Consider the following illustration:
Let, x + 6 > 10 be an inequality. Now, if x is a whole number, then it can
have any value greater than 4. While the values 6 and 10 are constant and do
not change, x can be 5, 6, 7... and up to any value. Thus, x is a variable which
can have any number of different values.
There are two types of variables: discrete variable and continuous variable.
A discrete variable takes whole number values and consists of distinct,
recognizable individual elements that can be counted, such as the number of
books in a library. Similarly, the number of children in a family would be considered
as values of a discrete variable, since the children can be counted exactly.

Sikkim Manipal University

Page No. 7

Business Statistics

Unit 1

On the other hand, a continuous variable is a variable whose values can


theoretically take on an infinite number of values within a given range of values.
Hence, these values are measured as against being counted. However,
since the measurement value would depend upon how accurately we measure
it, any exact value would simply be one of the infinite number of values on a
continuous scale between two given points. For example, the height of a child
touches every one of the infinite number of points between, let us say, 40 inches
and 40.1 inches as he/she grows from 40 inches to 40.1 inches. Accordingly,
the value of a continuous variable is more accurately defined if it is stated as
being between two points such as 40 inches and 40.1 inches.
A random variable
Roughly speaking, in probability and statistics a random or a stochastic variable
is a variable whose value results from a measurement on some type of random
process. Usually, it is a function from a probability space, typically to the real
numbers that is measurable (for finite probability spaces, the measurable
requirement is superfluous). Intuitively, any random variable is a numerical
description of the outcome of an experiment like the probable results of rolling
two dice (1, 1), (1, 2). Random variables can be either classified as discrete (it
may assume a finite number of values or an infinite sequence of values) or as
continuous (any numerical value in an interval or a collection of intervals). The
possible outcomes of a yet-to-be-performed experiment can be represented by
a random variables possible values, or the quantity of potential values with
uncertain already-existing value (e.g., as a result of incomplete information or
imprecise measurements). Realizations of a random variable are known as
random varieties.
Random variables are of two types, namely, discrete and continuous. For
a regular random variable, the probability of any specific value can be zero,
whereas the probability of some infinite set of values (such as an interval of
non-zero length) can be positive. A random variable can be mixed, with a part
of its probability spread out over an interval like any typical continuous variable,
and another part of it concentrated on particular values like a discrete variable.
These classifications are equivalent to the categorization of probability
distributions.
A random variable is a phenomenon of interest in which the observed
outcomes of an activity are entirely, by chance, absolutely unpredictable and
may differ from response to response. By definition of randomness, each possible
entity has the same chance of being considered. For instance, lottery drawings

Sikkim Manipal University

Page No. 8

Business Statistics

Unit 1

are considered to be random drawings so that each number has exactly the
same chance of being picked up. Similarly, the value of the outcome of a toss of
a fair coin is random, since a head or a tail has the same chance of occurring.
A random variable may be qualitative or quantitative in nature. The
qualitative random variables yield categorical responses so that the responses
fit into one category or another. For example, a response to a question such as
Are you currently unemployed? would fit in the category of either yes or no.
On the other hand, quantitative random variables yield numerical responses.
For example, responses to questions such as, How many rooms are there in
your house? or How many children are there in the family? would be in numerical
values. Also, these values being whole numbers are considered discrete values.
These are the values of discrete quantitative random variables. On the other
hand, responses to questions like, How tall are you? or How much do you
weigh? would be the values of continuous quantitative random variables, since
these values are measured and not counted. Some examples of these variables
are:
(i) Qualitative random variables
Sex of students in the class
Political affiliation of a faculty member in the college
Opinions of economists regarding the economic conditions in the
country
(ii) Quantitative random variables
(a) Discrete quantitative random variables
Number of people attending a conference
Number of eggs in the refrigerator
Number of children at a summer camp
(b) Continuous quantitative random variables
Heights of models in a beauty contest
Weights of people joining a diet programme
Lengths of steel bars produced in a given production run

1.3.2 Classification of Data


When the raw data has been collected and edited, it should be put into an
ordered form (ascending or descending order), so that it can be looked at more

Sikkim Manipal University

Page No. 9

Business Statistics

Unit 1

objectively. The next important step towards processing the data is classification.
Classification means separating items according to similar characteristics and
grouping them into various classes. The items in different classes will differ
from each other on the basis of some characteristics or attributes. Classification
of data is very similar to sorting of mail at a post office, where a mail is classified
according to its geographical destination and may further be classified into the
type of mail such as first class, parcel post, and so on. The data may be classified
into four broad classes:
(i) Geographical. This classification groups the data according to locational
differences among the items. The geographical areas are usually listed
in alphabetical order for easy reference. For example, the book listing
colleges and universities in various states in USA would first list the states
in the alphabetical order and then the colleges and the universities within
these states in the alphabetical order.
(ii) Chronological. This classification includes data according to the time
period in which the items under consideration occurred. For example, the
sales of automobiles in India over the last ten years may be grouped
according to the year in which such sales took place.
(iii) Qualitative. In this type of classification, the data is grouped together
according to some distinguished characteristic or attribute such as religion,
sex, age, national origin, and so on. This classification simply identifies
whether a given attribute is present or absent in a given population. For
example, the population may be divided into two classes: male and female.
Then the attribute of male will go into one class and the attribute of female
will go into the other.
(iv) Quantitative. This refers to the classification of data according to some
attribute which has magnitude and can be measured such as classification
according to weight, height, income, and so on. For example, the salaries
of professors at a university may be classified according to their rank
such as instructor, assistant professor, associate professor and full
professor.
Hence, the collected data should be arranged systematically to give it shape,
form and meaning. The division of the data into homogeneous groups according
to their characteristics, recorded in a statistical inquiry, is called classification.

Sikkim Manipal University

Page No. 10

Business Statistics

Unit 1

Self-Assessment Questions
3. State whether true or false.
(a) If the data is written in an ascending or descending order, it would
be called ordered data.
(b) Items in different classes will differ from each other on the basis of
some characteristics or attributes.
4. Fill in the blanks with the appropriate terms.
(a) A ______________ is any characteristic that can assume different
values.
(b) Classification means separating items according to similar
________________ and grouping them into various classes.

1.4 Data Sources: Primary vs Secondary


The statistical data, as previously discussed, may be classified under two
categories depending upon the sources utilized. These categories are:
1. Primary Data. Primary data is one which is collected by the investigator
himself for the purpose of a specific inquiry or study. Such data is original
in character and is generated by surveys conducted by individuals or
research institutions. For example, if a researcher is interested to know
what women think about the issue of abortion, he/she must undertake a
survey and collect data on the opinions of women by asking relevant
questions. Such data collected would be considered as primary data.
2. Secondary Data. When an investigator uses the data which has already
been collected by others, such data is called secondary data. This data is
primary data for the agency that collected it and becomes secondary
data for someone else who uses this data for his own purposes. Secondary
data can be obtained from journals, reports, government publications,
publications of professional and research organizations and so on. For
example, if a researcher desires to analyse the weather conditions of
different regions, he can get the required information or data from the
records of the meteorology department. Even though secondary data is
less expensive to collect in terms of money and time, the quality of this
data may even be better under certain situations, because it may have

Sikkim Manipal University

Page No. 11

Business Statistics

Unit 1

been collected by persons who were specifically trained for that purpose.
However, such secondary data must be used with utmost care. The reason
is that such data may be full of errors due to the fact that the purpose of
the collection of data by the primary agency may have been different
from that of the user of the secondary data. Additionally, there may have
been biases introduced during collection of data or analysis of data. For
example, the size of the sample may have been inadequate or there may
have been arithmetical or definitional errors. Hence, it is necessary to
critically investigate the validity of secondary data as well as the credibility
of the primary data collection agency.
Sources of Data
The following are some of the sources of data for collecting first hand information.
Census
World Bank
WHO (World Health Organization)
NSSO (National Sample Survey Organization)
Economic Survey
National Family and Health Surveys
SRS Surveys
Multiple Indicator Survey
CSO. RBI, Gov.nic.in, CMIE
Since the quality of the results obtained from statistical data for the purpose
of using these outcomes for managerial decision-making depends upon the
quality of the collected information itself, it is important that a sound investigative
process be established to ensure that the data is highly representative and
unbiased. This requires a high degree of skill and also certain precautionary
measures are to be taken.
Activity 1
Collect first hand information from five families in your neighbourhood on
education, health and economic status. Tabulate the data as qualitative or
quantitative. Also classify the attributes as per the four measurement scales.

Sikkim Manipal University

Page No. 12

Business Statistics

Unit 1

Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) ___________ data is one which is collected by the investigator himself
for the purpose of a specific inquiry or study.
(b) When an investigator uses the data which has already been collected
by others, such data is called ________________ data.
6. Choose the right answer from the given options.
(a) To collect first hand information, we use ________________.
(i) Census
(iii) Observation

(ii) Interview
(iv) Questionnaire

(b) It is necessary to critically investigate the validity of ___________


data.
(i) World Bank
(iii) Secondary

(ii) Census
(iv) Primary

1.5 Research Procedure


In general, all data, whether qualitative or quantitative, is measured in some
form. Even discrete quantitative data which is counted can fit into some form of
measurement. There are four widely accepted levels of measurement. These
levels, from the weakest on the one extreme to the strongest on the other, in
order are: Nominal scale, Ordinal scale, Interval scale and Ratio scale. Before
discussing these various measurement levels, let us look at some of the attributes
possessed by these scales.1
The scales are explained later.
(i) Magnitude. This is the quantitative value that exists or is assigned to an
attribute or characteristic and such values, when compared, will determine
whether the value of a given attribute in one case is greater than, equal to
or less than the value of the same attribute in another case. For example,
if student X gets 100 per cent marks in the final examination in a course
and student Y gets 40 per cent in the same exam, then student X may be
considered as more knowledgeable in that area than student Y.

Sikkim Manipal University

Page No. 13

Business Statistics

Unit 1

(ii) Equal intervals. Some measurement scales are constructed in such a


manner that the magnitude of an interval between any two points along
the scale has the same value or the same magnitude within the same
interval of any other two points along the same scale. For example, the
difference in heights of students between 60 inches and 63 inches is the
same in magnitude as the difference between 70 inches and 73 inches.
This means that the value of the magnitude is 3 inches, no matter where
such interval is measured on the scale. There may be some exceptions
to this rule. For instance, the value of the difference between the IQ of
180 and 190 may be different than the value of the difference of an IQ of
80 and 90, even though, numerically both these differences have the
same value.
(iii) Absolute zero point. The third attribute of the measurement scale is the
presence or absence of the zero point where the attribute has no value at
all. For example, the characteristic of height of a person does not have an
absolute zero point, since positive quantitative value of the attribute always
exists, no matter what the age of the person may be. On the other hand,
the number of TV sets in a family can have an absolute zero value if the
family has no TV set at all. In some unique cases we may assign a zero
value to an attribute for qualitative comparison purposes even when the
value of such an attribute is a positive quantitative number. For example,
we may say that an unintelligent person has zero intelligence, even though
it does not mean absolute zero.

1.5.1 Measurement of Scale


In light of the three attributes, let us now consider and discuss the four
measurement scales.
(i) Nominal scale. Applied to qualitative data only, it is also known as
classificatory scale, where the objects or items are classified into various
discrete and distinct groups or categories without any ranking or order
associated with such classified data. It does not possess any of the three
attributes discussed earlier: magnitude, equal intervals and absolute zero
point. It is the weakest form of measurement so that some statisticians
do not consider it as a scale at all. Examples of nominal scale would be
categorizing people according to their religion such as Christian, Muslim,
Hindu and so on, or according to their political affiliation such as Democrat,
Republican or Socialist. Other categories of nominal scale may be smoking
or non-smoking, ownership of house or no ownership of house, and so
on.
Sikkim Manipal University

Page No. 14

Business Statistics

Unit 1

(ii) Ordinal scale. Also known as ranking scale, it possesses only the attribute
of magnitude. This means that various categories of items can be
compared with each other only in order of rank assigned to these
categories. However, these ranks only indicate as to which category is
greater or better, but does not indicate the magnitude of the difference
among these categories. For example, the students in a class may be
categorized according to their grades of A, B, C, D and F where A is
better than B, and so on, and the classification is from the highest grade
to the lowest grade. Another example of ordinal scaling would be the
classification of teaching faculty ranks in the colleges as full professors,
associate professors, assistant professors and instructors.
(iii) Interval scale. The interval scale measures the values of quantitative
random variables and identifies not only which category is greater or better
but also by how much. It is a stronger form of measurement and possesses
two attributes, which are magnitude and equal intervals. It does not
possess, however, the absolute zero point. Measurements of height, weight
and time are all examples of interval scale.
(iv) Ratio scale. The ratio scale is also used for measurement of quantitative
random variables, but it differs from interval scale in that it has a true zero
point, meaning that the values of such variables can be zero. It makes
mathematical manipulations easier such as divisions and multiplications.
Examples of ratio scale are physical measurements including temperature,
number of students registered in various classes, and so on. The
temperature can be zero which means the total absence of heat and it is
also possible that zero students are registered for a given class. Similarly,
heights and weights, though considered in interval scale, can have
hypothetical zero values.
These measurement scales assist in designing survey methods for the
purpose of collecting relevant data.

1.5.2 Methods of Collecting Primary and Secondary Data


Planning the Study
Before any procedures for data collection are established, the purpose and the
scope of the study must be clearly specified. If any similar studies have been
conducted, prior to the current one, then the investigator may want to use some
secondary data in his own study, and may redefine his objectives on the basis
of the previous studies conducted. The scope of the study must take into

Sikkim Manipal University

Page No. 15

Business Statistics

Unit 1

consideration the field to be covered, and the time period in which to conduct
the study. The time span is very important, because in certain areas, the
conditions change very quickly, and hence, by the time the study is completed,
it may become irrelevant. The statistical units and the desired accuracy of such
units must be clearly specified.
Methods of Collecting Primary Data
Primary data is collected by the investigator for specific study. This data should
be unique in nature and should be kept secret until it is published. The following
are the methods of collecting primary data.
Questionnaires: These are the most popular means of collecting primary
data. The questionnaires are designed as per specific problems, for example, it
can be used for interviewing or for a telephone survey. It can be posted, emailed or faxed and can be used for a large number of people or organizations.
It does not require prior arrangements and there is no interviewer bias. The
questionnaire must not be too long, too complex, uninteresting or too personal.
The questions asked must be simple so that the respondent can read all
questions and reply. The basic subject of the questionnaire must be made clear
in a covering letter. The researcher must give his/her identification, why the
data is being collected and the declaration of confidentiality and anonymity.
Request and instructions to return the duly filled questionnaire must be mentioned
with the return date. You can make a request as, It would be greatly appreciated
if you may possibly return the completed questionnaire by.......... if it is possible.
Interviews: This is a technique basically used to know the mind-set, likings
or behaviour of the person being interviewed. Interviews can be conducted on a
personal one-to-one basis or in a group. Interviews can be of structured, semistructured and unstructured types. Structured type is based on a cautiously
worded interview plan. In semi-structured type, the interview is based on
questions that provide scope to the respondent to answer at length. Unstructured
type is also termed as an in-depth interview. The interviewer starts with the
general questions to encourage the respondent to talk without restraint. For
conducting an interview the researcher has to prepare a list of topics on which
the information is required. Select the type of interview to frame the relevant
questions and then fix appointment with the respondent.
Telephone interview: This is also a type of interview which is conducted
on personal or face-to-face basis. It gives high response rate and the answers
can be taped for keeping record. This method can be used if the respondent
has a telephone.

Sikkim Manipal University

Page No. 16

Business Statistics

Unit 1

Focus group interviews: This type of interview is conducted by a qualified


representative on a small group of respondents in a non-structured and natural
manner. The representative leads the conversation and the main idea is to get
insights by carefully listening to a small selected group of people on specific
subjects.
Observation: In this method the behavioural styles of specific people,
objects and happenings are recorded in a systematic way. Observational
methods can be structured or unstructured, disguised or undisguised, natural,
personal, mechanical, participant and non-participant. In the structured
observation, the researcher decides that what is to be observed and how the
observed records will be analysed, while in an unstructured observation the
researcher observes all phases of the event and then records the relevant ones.
A researcher watches the real behaviour as it happens in personal observation.
In participant observation, the researcher becomes the part of the group being
investigated, while in non-participant observation the researcher does not
communicate with the group being observed.
Methods of Collecting Secondary Data
The chief sources of secondary data may be broadly classified into the following
two groups:
(i) Published sources
(ii) Unpublished sources
(i) Published sources: There are a number of national organizations and
international agencies which collect and publish statistical data relating to
business, trade, labour, price, consumption, production, etc. These
publications are useful sources of secondary data. Some of these
published sources are as follows:
1. Official publications of the Central and State Governments such as
monthly abstract of statistics, national income statistics and vital
statistics of India.
2. Publications of semi-government organizations, e.g., the Reserve
Bank of India bulletin.
3. Publications of research institutions, e.g., the publications of the
Indian Council of Agricultural Research (ICAR), New Delhi.
4. Publications of commercial and financial institutions, e.g., the
publications of the FICCI

Sikkim Manipal University

Page No. 17

Business Statistics

Unit 1

5. Reports of various committees and commissions appointed by the


government, such as the Wanchoo Commission Report on Taxation.
6. Newspapers and periodicals like Economic Times and Statesman
Yearbook also publish useful statistical data.
7. International publications like the U.N. Statistical Yearbook and
Demographic Yearbook.
(ii) Unpublished sources: The records maintained by private firms or
business houses which may not like to release their data to any outside
agency; the studies carried out by research scholars in universities or
research institutes may also provide useful statistical data.
Precautions in the use of secondary data: Secondary data should be
used with extra caution since they have been collected by someone other than
the investigator. Before using such data, the investigator must be satisfied in
regard to the reliability, accuracy, adequacy and suitability of the data to the
given problem under investigation. Before using secondary data, the investigator
should examine the following questions.
1. Is the data suitable for the purpose of investigation? For this, he should
compare the objectives, the nature and the scope of the given enquiry
with the original investigation. He should also confirm that the various
terms and units were clearly defined and were uniform throughout the
earlier investigation and these definitions are suitable for the present
enquiry as well.
2. Is the data reliable? For this, the investigator himself should be satisfied
about the following:
(i) The reliability, integrity and experience of the collecting organization
(ii) The reliability of the source of information
(iii) The methods used for collection and analysis of the data.
(iv) The degree of accuracy desired by the company.
3. Is the data adequate? Adequacy of data is to be judged in the light of the
requirements of the survey and the geographical areas covered by the
available data. Adequacy of data is also to be considered in the light of
the time period for which the data is available.
Hence, in order to arrive at conclusions free from limitations and
inaccuracies, the secondary data must be subjected to thorough scrutiny and
editing before it is accepted for use.

Sikkim Manipal University

Page No. 18

Business Statistics

Unit 1

Activity 2
Observe a group of students participating in a debate competition. Collect
data on their behaviour and categorize them as most active, moderately
active and less active.

Self-Assessment Questions
7. Fill in the blanks with the appropriate terms.
(a) _____________ is the quantitative value that exists or is assigned
to an attribute or characteristic.
(b) Absolute zero point refers to the _____________ which has no value
at all on measurement scale.
8. State whether true or false.
(a) Nominal scale is the weakest form of measurement so that some
statisticians do not consider it as a scale at all.
(b) In the observation method, the behavioural styles of specific people,
objects and happenings are recorded in an unsystematic way.

1.6 Summary
Let us recapitulate the important concepts discussed in this unit:
Information plays a vital role in decision making. It is provided by the
information system set up in the organization. The management depends
on information systems for effective decision-making. Information consists
of data (facts and figures) which is processed and retrieved to be used for
forecasting and decision-making.
The information should be context specific and available when it is required.
When information is specifically arranged according to the requirement
or problem, it is termed as knowledge.
Data comprise the numerical results of any measurement. Data can also
be used in singular sense, such as a set of data.
A variable is any characteristic that can assume different values. There
are two types of variables: discrete variable and continuous variable.

Sikkim Manipal University

Page No. 19

Business Statistics

Unit 1

A random variable is a phenomenon in which the observed outcomes of


an activity are entirely, by chance, absolutely unpredictable and may differ
from response to response. By definition of randomness, each possible
entity has the same chance of being considered. A random variable may
be qualitative or quantitative in nature.
Classification means separating items according to similar characteristics
and grouping them into various classes. The data may be classified into
four broad classes as (i) geographical, (ii) chronological, (iii) qualitative
and (iv) quantitative.
The statistical data may be classified under two categories depending upon
the sources utilized as primary data and secondary data.
Primary data is data that is collected by the investigator himself for the
purpose of a specific inquiry or study. Such data is original in character and
is generated by surveys conducted by individuals or research institutions.
When an investigator uses the data which has already been collected by
others, such data is called secondary data. This data is primary data for
the agency that collected it and becomes secondary data for someone
else who uses this data for his own purposes.
The various sources which give first hand information to collect data are
Census, World Bank, WHO, NSSO, Economic Survey, Demographic and
Health Surveys, etc.
In general, all data, whether qualitative or quantitative, is measured in
some form. There are four widely accepted levels of measurement. These
levels, from the weakest on the one extreme to the strongest on the other,
in order are nominal scale, ordinal scale, interval scale and ratio scale.

1.7 Glossary
Data: Numerical results of any measurement
Variable: Any character that can assume different values
Random variable: A qualitative or quantitative phenomenon in which the
observed outcomes of an activity entirely or by chance absolutely
unpredictable and may differ from response to response.
Primary data: Data collected by the investigator for the purpose of a
specific inquiry or study. The data is original in character and is generated
by surveys conducted by individuals or research institutions.
Sikkim Manipal University

Page No. 20

Business Statistics

Unit 1

Secondary data: When an investigator uses the data which has already
been collected by others, then the data is secondary data for the
investigator but it remains primary data for those who collected it. It is
obtained from journals, reports, government publications, etc.

1.8 Terminal Questions


1. What role does information plays in decision-making?
2. How is information evaluated and processed? Explain with the help of
examples.
3. Define the various types of data with the help of examples.
4. Explain the four categories of data classification with the help of examples.
5. Differentiate between primary data and secondary data. Under what
circumstances would secondary data be more useful than primary data?
6. Describe in detail the four types of measurement scales. Illustrate your
explanation with examples.
7. What are the various modes of data collection? Under what circumstances
would each method be more suitable as compared to other methods?
Give reasons for your beliefs.

1.9 Answers
Answers to Self-Assessment Questions
1. (a) Information; (b) Knowledge
2. (a) True; (b) False
3. (a) True; (b) True
4. (a) Variable; (b) Characteristics
5. (a) Primary; (b) Secondary
6. (a) i; (b) iii
7. (a) Magnitude; (b) Attribute
8. (a) True; (b) False

Sikkim Manipal University

Page No. 21

Business Statistics

Unit 1

Answers to Terminal Questions


1. Refer to Section 1.2
2. Refer to Sections 1.2.2 and 1.2.3
3. Refer to Section 1.3
4. Refer to Section 1.3.2
5. Refer to Section 1.4
6. Refer to Section 1.5
7. Refer to Section 1.5.2

1.10 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2002.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2007.

Endnote
1. Aggarwal, Y.P. Statistical Methods, New Delhi: Sterling Publishers, 1986, p.5.

Sikkim Manipal University

Page No. 22

Business Statistics

Unit 2

Unit 2

Data Collection Methods

Structure
2.1 Introduction
Objectives
2.2 Observation
2.3 Questionnaire
2.4 Interviews
2.5 Experiments
2.6 Summary
2.7 Glossary
2.8 Terminal Questions
2.9 Answers
2.10 Further Reading

2.1 Introduction
In the previous unit, you learnt about information and data sources. Data sources
help in collecting data.
In this unit, you will learn about the various data collection methods.
The unit describes the advantages and shortfalls of various types of
observations. You will also learn about the process of preparing a
questionnaire, what all should be kept in mind while drafting it and what
pattern of questions should be adopted, i.e., dichotomous, multiple choice
or open questions. Also, you would learn about the different modes of
interviews along with their merits and demerits. Accurate records have to be
made to keep people updated about the current scenario of the society. As
there are several methods of data collection, the methods that consume the
least amount of time are put into use. Data collecting techniques such as
questionnaires and interviews play a vital role in collecting large amount of
information in a short period of time and hence have been discussed in this
unit. Experiments are resorted to when it is necessary to collect factual data
when nothing is available for reference. It may also be conducted to verify a
theory. Experiment is a study conducted under controlled conditions.

Sikkim Manipal University

Page No. 23

Business Statistics

Unit 2

Objectives
After studying this unit, you should be able to:
Prepare a questionnaire
Explain the significance of interviews
Discuss other modes of data collection along with their advantages
Explain the importance of experiments

2.2 Observation
Observation may be defined as recording behavioural patterns without verbal
communication.
Primary data can be collected using the following method.
Direct personal observation. Under this method, the investigator
presents himself personally before the informant and obtains a first hand
information. This method is most suitable when the field of enquiry is small and
a greater degree of accuracy is required.
We shall now see the merits and limitations of the observation method.
Merits
(i) The first hand information obtained by the investigator is bound to
be more reliable and accurate since the investigator can extract the
correct information by removing doubts, if any, in the minds of the
respondents regarding certain questions.
(ii) High response rate, since the answers to various questions are
obtained on the spot.
(iii) It permits explanation of questions concerning difficult subject matter.
(iv) It permits evaluation of respondent, his circumstances and reliability.
(v) This method is useful where spontaneity of response is required.
(vi) It provides personal rapport, which helps to overcome reluctance to
respond.
(vii) Where the investigator and the informant talk face to face, it becomes
possible to explore questions in depth.
(viii) Information is collected promptly and there is no dribbling.

Sikkim Manipal University

Page No. 24

Business Statistics

Unit 2

Limitations
(i) This method is suitable only for intensive studies and not for extensive
enquiries.
(ii) This method is time-consuming and the investigation may have to
be spanned over a long period.
(iii) This method is highly subjective in nature and the results of the
enquiry may be adversely affected by the personal bias, whim and
prejudices of the investigator.
Activity 1
Find a situation when direct personal observation is the perfect method
for data collection.

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Observation may be defined as recording ___________ patterns
without verbal communication.
(b) Direct personal observation method is most suitable when the field
of ____________ is small and a greater degree of accuracy is
required.
2. State whether true or false.
(a) Direct personal observation does not permit explanation of questions
concerning difficult subject matter.
(b) Direct personal observation provides personal rapport, which helps
overcome reluctance to respond.

2.3 Questionnaire
Questionnaire method can be used either as mailing the questionnaires or
sending through enumerators.

2.3.1 Mailed Questionnaire Method


Under this method, the investigator prepares a questionnaire containing a number
of questions pertaining to the field of enquiry. These questionnaires are sent by
Sikkim Manipal University

Page No. 25

Business Statistics

Unit 2

post to the informants together with a polite covering letter explaining in detail
the aims and objectives of collecting the information and requesting the
respondents to cooperate by furnishing the correct replies and returning the
questionnaire duly filled in. In order to ensure quick response, the return postage
expenses are usually borne by the investigator. This method is usually adopted
by research workers, private individuals and non-official agencies. The success
of this method depends upon the proper drafting of the questionnaire and the
cooperation of the respondents.
Merits
(i) By this method, a large field of investigation may be covered at a
very low cost. In fact, this is the most economical method in terms of
time, money and manpower.
(ii) Errors due to personal bias of the investigators or enumerators are
completely eliminated as the information is supplied by the person
concerned in his own handwriting.
Limitations
(i) This method can be used only if the respondents are educated and
can understand the questions well, and reply in their own handwriting.
(ii) Sometimes, the informants may not send back the schedules and
even if they return the schedules, they may be incorrectly filled in.
(iii) Sometimes, the informants are not willing to give written information
in their own handwriting on certain personal questions like income,
personal habits and property.
(iv) There is no scope for asking supplementary questions for crosschecking of the information supplied by the respondents.

2.3.2 Questionnaire Sent Through Enumerators


Under this method, instead of sending the questionnaire through post, the
investigator appoints agents known as enumerators, who go to the respondents
personally with the questionnaire, ask them the questions given therein, and
record their replies. This method is generally used by business houses, large
public enterprises and research institutions.
Merits
(i) The information collected through this method is more reliable as
the enumerators can explain in detail the objectives and aims of the
enquiry to the respondents and win their cooperation.
Sikkim Manipal University

Page No. 26

Business Statistics

Unit 2

(ii) Since the enumerators personally call on the respondents, there is


very little non-response.
(iii) This technique can be used with advantage even if the respondents
are illiterate.
(iv) The enumerators can effectively check the accuracy of the
information supplied through some intelligent cross-questioning by
asking supplementary questions.
Limitations
(i) The method is more expensive and can only be used by financially
strong bodies or institutions.
(ii) It is more time-consuming than the mailed questionnaire method.
(iii) The success of the method depends on the skill and efficiency of
the enumerators who collect the information and also on the efficiency
and wisdom with which the questionnaire is drafted.

2.3.3 Drafting or Framing the Questionnaire


Since the questionnaire is the only medium of communication between the
investigator and the respondents, it must be designed or drafted with utmost
care and caution so that all the relevant and essential information for the enquiry
may be collected without any difficulty, ambiguity or vagueness. Designing of
questionnaire, therefore, requires a high degree of skill and experience on the
part of the investigator. No hard and fast rules can be laid down for designing or
framing a questionnaire. However, if would help if the following general points
are borne in mind while drafting a questionnaire:
1. The size of the questionnaire should be as small as possible. The number
of questions should be kept to the minimum keeping in view the nature,
objectives and purpose of enquiry. Respondents time should not be wasted
by asking irrelevant and unimportant questions. Fifteen to twenty-five may
be regarded as a fair number. If a larger number of questions is
unavoidable in any enquiry, the questionnaire should preferably be divided
into two or more parts.
2. Questions should be clear, brief, unambiguous, non-offending, courteous
in tone, corroborative in nature and to the point.
3. Questions should be logically arranged.
4. Questions should be short, simple and easy to understand. The usage of
vague or multiple meaning words should be avoided. Unless the
Sikkim Manipal University

Page No. 27

Business Statistics

Unit 2

respondents are technically trained, the use of technical terms should be


avoided.
5. Questions should be so designed that the respondents can easily
comprehend and answer them. Questions involving mathematical
calculations should not be asked.
6. Questions of sensitive or personal nature should be avoided.
7. The questionnaire should provide necessary instructions to the
enumerators.
8. If a particular question needs clarification, it should be explained by way
of a footnote.
9. Questions should be capable of objective answer. Various types of
questions in the questionnaire may be grouped under three categories:
(i) Dichotomous or simple alternate questions in which the
respondent has to choose between two clear-cut alternatives like
Yes or No, Right or Wrong, Either, Or, and so on. This
technique can be applied elegantly in situations where two clearcut alternatives exist.
(ii) Multiple choice questions in which the respondent is asked to select
one out of a number of responses. All possible answers to a question
are listed and the respondent chooses one of these. Such questions
save time and facilitate tabulation. This method should be used only
if a few alternative answers exist to a particular question.
(iii) Open questions are those in which no alternative answers are
suggested and the respondents are free to express their frank and
independent opinions on the problem in their own words usually in
an essay form.
10. Cross-checks: The questionnaire should be so designed as to provide a
cross-check on the accuracy of the information supplied by the
respondents by including some connected questions.
11. Pre-testing the questionnaire: The questionnaire should be tried on a
small group before using it for the given enquiry. This will help in improving
or modifying the questionnaire in the light of the drawbacks, shortcomings
and problems faced by the investigator in the pre-test.

Sikkim Manipal University

Page No. 28

Business Statistics

Unit 2

12. A covering letter, stating briefly the aims and objectives of the enquiry,
soliciting cooperation of the respondents, and explaining various terms
and concepts, should be enclosed along with the questionnaire.
13. In case of a mailed questionnaire method, a self-addressed stamped
envelope should be enclosed.
14. To ensure quick response, the respondents may be offerred incentives in
the form of gift coupons, a sample of the product to be introduced, or a
promise to supply a copy of the findings after the survey work is over.
15. Method of tabulation and analysis, whether hand-operated, machineoperated or computerized, should also be kept in mind while designing
the questionnaire.
16. Lastly, the questionnaire should be made attractive by a proper layout
and an appealing get up.

2.3.4 A Specimen Questionnaire


This hypothetical study is adapted from a study developed by Deepak Mahendru
in India. Assume that this study involves 200 professors in New York colleges
who are asked about their interest in buying automobiles. The basic objective
of this survey is to determine certain marketing trends among the population of
professors in New York regarding their automobile buying patterns and are
based upon the following factors:
The profile of the decision-maker who finally decides to buy a particular
type of car.
People around the decision maker who influence the decision-making
process.
The factors affecting the selection of a particular dealer of cars.
People in the family who make or affect decisions regarding the maximum
budget that can be allocated for purchasing a car.
The effect of various options available in the car.
The image and reliability of the company that makes these cars.
The effect of heavy promotion on television about the utility of the car on
the decision-maker.

Sikkim Manipal University

Page No. 29

Business Statistics

Unit 2

(For the sake of simplicity, it is assumed that the professors have only
one car in the family.)
The Questionnaire
1. General
Name: ...................................................................................
Age: ......................................................................................
Sex: M .................... F ....................
Marital status: Married .................. Unmarried .................
Number of members in the family
12...................
34...................
56...................
Over 6..............
Yearly income
Less than 30,000...................
30,000 39,999......................
40,000 49,999......................
50,000 and more...................
2. What type of car do you own now?
.................Indian
.................Japanese
.................European
3. What size of car do you own?
.................Luxury
.................Mid-size
.................Compact
4. Did you buy this car new or used?
.................New....................Used
5. If you bought a used car, did you buy it from a dealer or a private party?
.................Dealer.................Private party

Sikkim Manipal University

Page No. 30

Business Statistics

Unit 2

6. If you bought a new car, how long have you owned this car?
.................Number of years
7. If you bought a used car, how old is this car now?
..............Number of years
8. Price paid for the car..........New..........Used
9. Who influenced your decision to purchase the above brand of car?
Indicate if more than one.
...............Yourself

...................... Your wife

...............Your children ...................... Your friend


...............Your neighbour

...................... Your colleague

Others.................................................................................. .
10. Indicate as to who decided about the budget allocation for the car.
...............Yourself
...............Your spouse
...............Family decision
11. If you bought your car from a dealer, then who influenced your decision
regarding the selection of a particular dealer?
...............Yourself
...............Your friend
...............Your colleague
...............Family decision
12. How did you come to know about this dealer?
...............TV commercial
...............Newspapers
...............Personal references
...............Others
13. Rank the following factors that affected the final decision at the time of
purchasing the car. A rank of 1 measures the most important factor, a
rank of 2 measures the second most important factor, and so on.
...............Very inconvenient without the car
...............Money was available

Sikkim Manipal University

Page No. 31

Business Statistics

Unit 2

...............Reputation of car manufacturer


...............Discounts offered
...............Interest rate on financing
...............Guarantees and warranties offered
...........................Others
14. Did you make an extensive survey regarding price comparisons after
you decided to buy the particular car? ............ Yes......... No
15. If you bought a used car, how did you learn about it?............ Newspapers
...............Friend ............... Others
16. In order of preference, what were the major reasons for buying a used
car?
...............Unavailability of adequate funds
...............Cheaper insurance
...............Lack of parking garage
...............Condition of the car
...............Others
17. Which of the following media you think is most effective in creating an
impact on the potential customer relative to a particular brand of the car?
................TV

...............Newspapers

................Magazines

...............Favourable news reports

................Word of mouth

...............Others

The responses to such questions would form the basis of analysis in


order to achieve the set marketing objectives.
Activity 2
Draft a questionnaire to collect data on any awareness program such as
polio awareness.

Sikkim Manipal University

Page No. 32

Business Statistics

Unit 2

Self-Assessment Questions
3. State whether true or false.
(a) Mailed questionnaires are sent by post to the informants together
with a polite covering letter explaining in detail the aims and objectives
of collecting the information and requesting the respondents to
cooperate by furnishing the correct replies and returning the
questionnaire duly filled in.
(b) Designing of questionnaire requires a high degree of skill and
experience on the part of the investigator.
4. Fill in the blanks with the appropriate terms.
(a) If a particular question needs clarification, it should be explained by
way of a __________________.
(b) Questions should be ________________ arranged.

2.4 Interviews
Indirect personal interview. Under this method, instead of directly approaching
the informants, the investigator interviews several third persons who are directly
or indirectly concerned with the subject matter of the enquiry and who are in
possession of the requisite information. Such a procedure is followed by the
enquiry committees and commissions appointed by the Government of India.
The committee selects persons, known as witnesses, and collects information
from them by getting answers to questions decided in advance. This method is
highly suitable where direct personal investigation is not practicable either
because the informants are unwilling or reluctant to supply information or where
the information desired is complex and the study in hand is extensive.
Merits
(i) This method is less costly and less time-consuming than direct
personal investigation.
(ii) Under this method, the enquiry can be formulated and conducted
more effectively and efficiently as it is possible to obtain the views
and suggestions of the experts on the given problem.

Sikkim Manipal University

Page No. 33

Business Statistics

Unit 2

Limitations
The success of this method depends upon:
(i) The representative character of the witnesses.
(ii) The personal knowledge of the witnesses about the subject matter
of enquiry.
(iii) The personal prejudices of the witnesses as regards definiteness in
stating what is wanted.
(iv) The ability of the interviewer to extract information from the witnesses
by asking appropriate questions and cross-questions.

2.4.1 Other Methods


Telephone survey. Under this method, the investigator, instead of presenting
himself before the informants, contacts them on telephone and collects
information from them.
Merits
(i) The method is more convenient than personal interview.
(ii) This method is less time-consuming and can be applied even to
extensive fields of enquiries. Telephone survey has all the other merits
of personal interview.
Limitations
(i) This method excludes those who do not have a telephone as also
those who have unlisted telephones.
(ii) This method is also subjective in nature and personal bias, whim
and prejudices of the investigator may adversely affect the results of
the enquiry.
Information received through local agents. Under this method, the information
is not collected formally by the investigator, but local agents, commonly known
as correspondents are appointed in different parts of the area under investigation.
These agents collect information in their areas and transmit the same to the
investigator. They apply their own judgement as to the best method of obtaining
information. This method is usually employed by newspapers or periodical
agencies which require information in different fields such as economic trends,
business, stock and share markets, sports, politics and so on.

Sikkim Manipal University

Page No. 34

Business Statistics

Unit 2

Merits
(i) This method is very cheap and economical for extensive
investigations.
(ii) The required information can be obtained expeditiously since only
rough estimates are required.
Limitations
(i) Since the correspondents apply their own judgement about the
method of collecting the information, the results are often vitiated
due to personal prejudices and whims of the correspondents. The
data so obtained is thus not so reliable.
(ii) This method is suitable only if the purpose of investigation is to obtain
rough and approximate estimates. It is unsuited where a high degree
of accuracy is desired.
Activity 3
How will you conduct an interview if the person is not ready to give it? Give
an example.

Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) The committee selects persons, known as ____________ and
collects information from them by getting answers to questions
decided in advance.
(b) The local agents collect information in their areas and ____________
the same to the investigator.
6. Fill in the blanks with the appropriate terms.
(a) The success of the interview method depends upon the
______________ character of the witnesses.
(b) The telephone survey method is more convenient than personal
________________.

Sikkim Manipal University

Page No. 35

Business Statistics

Unit 2

2.5 Experiments
Experiments are another method of collecting data. Experiments are resorted to
when it is necessary to collect factual data when nothing is available for reference.
It may also be conducted to verify a theory. It is a study conducted under controlled
conditions. Experiments are made by researchers to understand the cause and
effect relationships. Such relationships are also made in observational studies
but here, there is no control on how subjects are assigned to groups.
Experimental design
This design contains information gathering exercises that have variations under
control of the experimenter. In observational studies, there is no control on
condition. Mostly, an experimenter wants to know the effect of some process on
certain objects, which are taken as experimental units. Such objects are either
a small section of people, few groups, etc. Such design finds broad application
in natural and social sciences.
The random design experiment is very helpful in situations when we have
to analyse huge amount of outcome data. The word experiment or random
experiment is used when we face an uncertain situation and we need to have
some observations about the situation. Random does not imply haphazard. We
need to be careful to ensure that appropriate random methods are used. The
actual results of the uncertain situation are referred to as outcome or sample
point. In the random experiment, nothing can say with certainty about the
outcome. An experiment may comprise one or more observations. If there is a
single observation, we use the term random trail or simply trial. An electric fan,
for example, may be selected from a factory to examine whether or not it is
defective. A single fan selected is a trial. We can select as many fans as we
wish. The number of observations will be equal to that of fans. The properties
of a random experiment may be listed as follows:
We can repeat the experiment any number of times.
A random trial comprises at least two possible outcomes.
We cannot say with certainty about the outcome of the random trial or
random experiment.
There are three things in common in all statistical experiments.
1. The experiment can have many possible outcomes.
2. We can specify each possible outcome in advance.
3. The outcome of the experiment is dependent on chance.
Sikkim Manipal University

Page No. 36

Business Statistics

Unit 2

A coin toss, for example, has all the attributes of a statistical experiment.
In this case, there is more than one possible outcome. It is possible to specify
each possible outcome (i.e., heads or tails) in advance. Funally, there is an
element of chance, since the outcome is uncertain.
Analysis of the experimental design has the foundation of variance
analysis. This analysis is done by collecting models having variance already
observed, and these were partitioned into different components on different
factors, and then estimation and testing were carried out.
We now consider another experiment where eight objects are to be
weighed using a pan balance and a set of few standard weights. Each instrument
weighs the difference between objects in the left pan against those in the right
pan. Further, there is an addition of standard weights that were kept on the
lighter pan and equilibrium point is noted. There was a random error for each
experiment averaging zero. Standard deviation errors, due to the probability
distribution, are s on different weights and these are independent. We denote
true weight as q1, ..., q8.
Experiments considered are,
1. Weighing of each object in one pan, while the other is empty. We denote
Xi as the weight of the ith object, where i vary from 1 to 8.
2. Carry on weighing of eight as per schedule given below. We take measured
difference as Yi where i vary from 1 to 8.
1st weighing:
2nd:
3rd:
4th:
5th:
6th:
7th:
8th:

Left pan

Right pan

12345678
1238
1458
1678
2468
2578
3478
3568

(empty)
4567
2367
2345
1357
1346
1256
1247

The weight 1 has estimated value of,

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
.
8

Estimated value for weights of the other items, 2 is

2
Sikkim Manipal University

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
.
8
Page No. 37

Business Statistics

Unit 2

In decision-making one has to choose better alternatives. Here, 2 is the


variance of the estimate X1 of 1 for the first experiment. But 2/8 is the variance
for second experiment. Thus, there is eight times more precision in the second
experiment, for a single item. Estimates are done for all simultaneously having
the same precision. If weighed separately 64 weight are to taken with 8 weighing
in the second experiment. Estimates for items in the second experiment have
errors, correlated to other.
This also serves as an example for the design of experiments that involve
combinatorial designs.
1. Selecting a problem: For designing an experiment, one must select a
problem and put a phrase for it. This will direct the design as well as
outcomes of the experiment. Issues related to questions like Who, What,
When, Why and How need to be addressed. Let us take the case of
automobile accidents and design an experiment for this. We now collect
data for this experiment. Depending on how the presentation of the problem
is stated, the aim of the experiment may be different. This may either lead
to the design of a road surface for existing automobiles or a brand new
automobile. To make research more precise and cover greater depth,
proven models should underlie a design for the experiment.
2. Determining dependent variables: Dependent variables need
measurement in the experiment and there may be various dependent
variables. Variables should be split into system levels and individual levels.
Questions on the experiments are only taken for a system level. Such
variables are created so that a conclusion can be drawn. Further, such
conclusions should be supported from as many different angles as
possible. Such operations are called converging operations. System level
dependent variables tell how many experimenters are there while a certain
task is being done. If taken at individual level, these dependent variables
are taken as measurements for a particular subject. Such measurements
of dependent variables are to be analysed and reduced.
Dependent variables may consist of different measures like performance
and subjective. Performance measures tell time taken by the participant
in completing the task plus number of mistakes made during the task.
Subjective measures tell about the method used or not used by
participants.
3. Determining independent variables: These variables get manipulated
in the experiment. These are related to people, typically sex, age, level
of education, general work experience or vision. To ensure meeting of
Sikkim Manipal University

Page No. 38

Business Statistics

Unit 2

specifications, subjects are to be screened prior to running the


experiment.
4. Determining the number of levels of independent variables: This
determines the number of experimental conditions to be manipulated. If
an experiment is to be designed for assessing relative performance of
few automobiles, say 10, then independent variables have number of
levels as 10.
5. Determining the possible combinations: There is a need to establish
types of combinations in independent variables. Only then can an
experiment be taken as valid.
6. Determining the number of observations: Depending on desired
analysis, certain factors are to be considered before deciding on the
number of observations. This includes the number of trials be taken to
get familiarized with the experiment.
7. Redesign: This is necessary for obtaining an optimal design. Redesign
is essential when there are certain lacunae in the experiment design.
Inconsistencies are caused by inaccuracy while stating the problem,
selection of inadequate variables and non-availability of desired apparatus.
Recommended timeframe for redesign is:
Planning and scheduling 44 per cent
Testing 610 per cent
Reduction, analysis and writing 4550 per cent
8. Randomization: A trial that is randomized and controlled is most reliable
and impartial. It is a process of assigning participants not by choice, but
by chance. This is done either to the group carrying out the investigation
or those who are controlling. This ensures trials do not receive the preferred
results.
9. Data collection: Data collection must ensure that these experiments are
supported by factual data. This lies in collection of raw data and adhering
to the experimental conditions. The data here may be very large.
10. Data reduction: For data reduction, raw data are taken into manageable
chunks for further utilization. Entire data may not be found pertinent and
thus need to be excluded and not be considered for analysis.
11. Data verification: This is essential and mostly carried out by plotting
reduced data that gives a visual picture of how data is located. These
points indicate erroneous data collection.
Sikkim Manipal University

Page No. 39

Business Statistics

Unit 2

True experimental design needs an environment that is created for control


of spurious data that may mislead the experimental conclusion. A purchase
laboratory makes an approach most suited for this. Researchers modify one
variable at a time to determine the effect on sales volume. Virtual purchase
labs, which are Internet-based labs, are becoming popular.

Self-Assessment Questions
7. Fill in the blanks with the appropriate terms.
(a) Experiments are made by _______________ to understand the
cause and effect relationships.
(b) Analysis of the experimental design has the foundation of
___________ analysis.
8. State whether true or false.
(a) In observational studies, there is no control on condition.
(b) In decision-making one has to choose worse alternatives.

2.6 Summary
Let us recapitulate the important concepts discussed in this unit:
Observation may be defined as recording behavioural patterns without
verbal communication.
Questionnaire method for data collection by can be used either mailing
the questionnaires or sending them through enumerators.
The questionnaire is the only medium of communication between the
investigator and the respondents, so it must be designed or drafted with
utmost care and caution so that all the relevant and essential information
for the enquiry may be collected without any difficulty, ambiguity or
vagueness.
Instead of directly approaching the informants, the investigator can
interview several third persons who are directly or indirectly concerned
with the subject matter of the enquiry and who are in possession of the
requisite information using indirect personal interview method.
The investigator, instead of presenting himself before the informants,
contacts them on telephone and collects information from them.
Sikkim Manipal University

Page No. 40

Business Statistics

Unit 2

Experiments are resorted to when it is necessary to collect factual data


when nothing is available for reference. It may also be conducted to verify
the theory. It is a study conducted under controlled conditions.

2.7 Glossary
Direct personal observation: In this, the investigator himself is present
before the informant and obtains first hand information.
Mailed questionnaire method: In this, the investigator prepares a
questionnaire containing a number of questions pertaining to the field of
enquiry.
Questionnaire sent through enumerators: In this, the investigator
appoints agents known as enumerators, who go to the respondents
personally with the questionnaire and record the respondents
replies.
Indirect personal interviews: In this, the investigator interviews several
third persons who are directly or indirectly concerned with the subject
matter of the enquiry and who are in possession of the requisite
information.
Telephone survey: In this, the investigator contacts the informants on
telephone and collects the information.
Information received through local agents: In this, the information is
not collected formally by the investigator, but by local agents commonly
known as correspondents.

2.8 Terminal Questions


1. What is observation? Why it is important for data collection?
2. Discuss the features of indirect personal interview.
3. Discuss the merits and demerits of both types of questionnaires.
4. What points must be considered while drafting a questionnaire?
5. How is information received through local agents? What are its merits
and demerits?
6. What is the experiment method? What role does it play in data collection?
Sikkim Manipal University

Page No. 41

Business Statistics

Unit 2

2.9 Answers
Answers to Self-Assessment Questions
1. (a) Behavioural; (b) Enquiry
2. (a) False; (b) True
3. (a) True; (b) True
4. (a) Footnote; (b) Logically
5. (a) Witnesses; (b) Transmit
6. (a) Representative; (b) Interview
7. (a) Researchers; (b) Variance
8. (a) True; (b) False

Answers to Terminal Questions


1. Refer Section 2.2
2. Refer Section 2.3.1 and 2.3.2
3. Refer Section 2.3.3
4. Refer Section 2.4
5. Refer Section 2.4.1
6. Refer Section 2.5

2.10 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2002.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2007.

Sikkim Manipal University

Page No. 42

Business Statistics

Unit 3

Unit 3

Data Analysis Techniques

Structure
3.1 Introduction
Objectives

3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11

Percentages, Ratios and Averages


Mean, Mode and Median
Quartiles
Range
Standard Deviation
Summary
Glossary
Terminal Questions
Answers
Further Reading

3.1 Introduction
In the previous unit, you learnt about various data collection methods. The
collected data is analysed to get useful information. In this unit, you will learn
about the various techniques of data analysis. Percentage is the result obtained
by multiplying a quantity by 100. If 50% of the students in a class are girls, it
means that out of every 100 students, 50 are girls. A ratio is a comparison
between two values. It shows the number of times one value is contained in or
contains the other. For example, if the ratio of girls to boys in a class is 1:2, it
means that two times the number of girls is contained in boys. Average is the
measure of the middle value of the data set. A measure of central tendency is a
single value that attempts to describe a set of data by identifying the central
position within that set of data. The three common measures of central tendency
mean, median and mode are explained in this unit. Dispersion tells us about
the spread of data. The commonly used measures of dispersion are quartile
deviation, range and standard deviation.

Objectives
After studying this unit, you should be able to:
Evaluate percentages, ratios and averages
Calculate arithmetic mean, median and mode
Sikkim Manipal University

Page No. 43

Business Statistics

Unit 3

Evaluate and represent data using quartiles, deciles and percentiles


Calculate range and standard deviation

3.2 Percentages, Ratios and Averages


Cent is a French word for hundred. Per cent stands for every hundred and is
the most powerful tool for comparison of numerical and statistical data.
Percentage is used in business and economic fields for making comparison on
profit, growth rate, magnitude, performance, etc. The concept of percentage
applies mainly on ratios. A ratio, when multiplied by 100, becomes percentage.
An average is the measure of central tendency of a set of numbers. We
mostly come across such problems of finding an average value for a set of
numbers. For example, a student has secured 60% in mathematics, 70% in physics
and 80% in chemistry. If one is asked to find the average, we calculate it as (60 +
70 + 80)/3 = 70%. Average is also known as arithmetic mean. General formula
for finding an average of n numbers; x1, x2, x3, ..., xn is An = (x1, x2, x3, ..., xn)/n.

3.2.1 Percentage
Mathematically, percentage value is calculated for ratios that have a denominator.
A denominator is the base value of a percentage. If there is a ratio 3 to 10 (3/
10), this literally means 3 in 10. To convert it into percentage, we should multiply
it with 100 (hundred) and it is then expressed as 30% (or 30 per cent).
When a value of measured quantity is subject to some change, this can
be recorded as:
(i) Absolute value change
(ii) Percentage change
These two changes are related to each other.
(i) Absolute value change: This is defined as the actual change in the
quantity. For example, if there is a sales figure of 220 crores in the year
2000 and 250 crores in the year 2001, the absolute value change is 30
crores.
(ii) Percentage change: Here, change is expressed as a ratio of original
value and then multiplied by 100 (hundred). In the example cited above,
Percentage change = (Absolute value change/Original quantity) 100 =
(30/220) 100 = 13.64%. Percentage change is always taken with

Sikkim Manipal University

Page No. 44

Business Statistics

Unit 3

reference to its original value, unless otherwise stated. The changes


expressed as percentage present a better picture of the change.
Percentage point change and percentage change: Percentage point
change only notes the change in percentage whereas percentage change notes
the change with reference to the original value. This is explained with the help
of example 3.1.
Example 3.1: Savings expressed as percentage of Gross Domestic Product
(GDP) was 20% in 2000 and 25% in 2002. What is the percentage point change
and percentage change in this period?
Solution: Percentage point change in savings rate = 25 20 = 5% (Five per
cent).
Percentage change of savings rate = (25 20)/20 100 = 25% (Twenty
five per cent).
Numerator and Denominator
The numerator has a direct relationship with ratio or percentage. When numerator
increases, the ratio also increases, if denominator remains constant. The
denominator has an inverse relationship with ratio or percentage. When the
denominator increases, ratio or percentage decreases and when the denominator
decreases, ratio or percentage increases.
If changes take place both in the numerator as well as in the denominator,
first solve for change in any one of them keeping the other one constant and
then in the new value of the percentage, use the change in the other.
Example 3.2 will illustrate this concept.
Example 3.2: Petrol prices increase by 20%. Ramesh has decided to reduce
its consumption so that he does not incur additional expenditure. By what
percentage should he reduce the petrol consumption?
Solution: Let us assume that Ramesh consumes 100 litres of petrol. Let the
price of petrol be Rs x per litre. He was paying Rs 100x. Due to this increase he
has to pay now 1.2 x 100 = 120x.
He has reduced consumption to y litres. So, 1.2 x y = 100 x y =
100/1.2 = 83.33.
Percentage reduction in consumption = 100 y = (100 83.33) = 16.67%.

3.2.2 Ratio
When a comparison is carried out between two numbers, it is useful to know
how many times one number is greater or smaller than the other. Thus, we are

Sikkim Manipal University

Page No. 45

Business Statistics

Unit 3

often required to express one number as the fraction of the other. Ratio of a
number a to a number b is defined as quotient of number a and b.
The numbers that form the ratio are known as terms of the ratio. Numerator
of the ratio is known as antecedent and the denominator is known as consequent.
A ratio has no unit for homogeneous quantity, but in case of heterogeneous
quantity, it depends on the units of numerator and denominator. Here, the unit
is just a number. For example, a specific gravity that is the ratio of density is
unitless. Current, in electricity, is a ratio of flow of charge and time, so current is
coulomb per unit time. This unit has a special name as ampere.
Ratios are expressed as percentages and for this it is multiplied by 100. A
ratio is given as 3/5 = 0.6. This can be expressed as 0.6 100 = 60%.
Properties of Ratio
(i) If numerator and denominator are multiplied by the same number, ratio
remains unchanged. This means a/b = ma/mb.
(ii) If numerator and denominator are divided by the same number, ratio
remains unchanged. This means a/b = (a/m)/(b/m).
(iii) To compare magnitudes of two ratios, their denominator should be equated
and values of numerator will then decide which one is greater. If we
compare values of 8/3 and 11/4, we have to make a common denominator.
We multiply 8/3 by 4 in numerator as well as denominator and get 32/12.
We then multiply 11/4 by 3 in both, numerator and denominator and get
33/12. Thus, we find that 11/4 > 8/3.
(iv) Ratio of two fractions can be expressed as ratio of two integers. Thus,
a/b : c/d = ad/bc.
(v) If either of the terms of a ratio is a surd, then this ratio will never be an
integer unless both the terms are equal or numerator is an integral multiple
of the denominator. Thus, the ratio of sqrt(3)/sqrt(2) will never be an integer.
(vi) When two ratios are multiplied, their numerators and denominators are
also multiplied. For example, a/b c/d = ac/bd.
(vii) When ratio a/b is compounded with itself, the resulting ratio, a2/b2 is known
as duplicate ratio and a3/b3 is triplicate ratio and a0.5/b0.5 is the sub-duplicate
ratio of a/b.
(viii) If a/b = c/d = e/f = g/h = k, then, (a+c+e+g)/(b+d+f+h) = k.

Sikkim Manipal University

Page No. 46

Business Statistics

Unit 3

(ix) If a1/b1, a2/b2, a3/b3, ..., an/bn are unequal fractions then the ratio, (a1, a2,
a3, a4)/(b1, b2, b3, b4) lies between the lowest and the highest of these
fractions.
(x) If there are two equations containing three unknowns as, a1x + b1y + c1z
= 0 and a2x + b2y + c2z = 0; then values of x, y and z can not be resolved
unless we get the third equation, but the proportion in which x, y and z lie
can be solved.
(xi) If the ratio is a/b > 1 and if there is a positive number k, then (a + k)/(b +
k) < a/b and (a k)/(b k) > a/b. Similarly, if a/b < 1 and if there is a
positive number k, then (a + k)/(b + k) > a/b and (a - k)/(b - k) < a/b.

3.2.3 Averages
An average is the measure of central tendency of a set of numbers. The general
formula for finding an average of n numbers; x1, x2, x3, ..., xn is An = (x1, x2, x3, ...,
xn)/n. There is another type of average, known as weighted average.
When there are two or more groups with known averages, then the
combined average is found by weighted average. If we have r groups having
averages as A1, A2, A3,.., Ar and elements as n1, n2, n3,.., nr, then weighted
average is given as:
Aw = ( n1A1 + n2A2 + n3A3 +..+ nrAr)/( n1 + n2 + n3 +..+ nr)
An average is also known as an arithmetic mean.
Example 3.3: A man travels from point A to point B at 60 kmph and returns at
100 kmph. Find the average speed.
Solution: Average speed = Total distance/Total time taken.
Let the distance between A to B, be d. Time taken for going from A to B is
d/60 and for returning to A is d/100.
Total time is d/60 + d/100.
Total distance = 2d.
Hence, average speed = 2d/[ d/60 + d/100] = 2d 600/(16d) = 75 kmph.
Example 3.4: Average marks of 20 students in an examination is reduced by 2.
If the topper of the class who secured 90 marks was replaced by a new student.
What was the score of this new student?
Solution: Let the average marks when topper is included and not replaced by
the new student be x. There are 20 students, so total number is 20x. New
average is x 2 and hence total mark is 20(x 2) = 20x 40. Thus, there is a

Sikkim Manipal University

Page No. 47

Business Statistics

Unit 3

reduction of 40 marks and this must be due to the new student who got 40
marks less than the student he replaced. So, he got only 90 40 = 50 marks.
Activity 1
An investor buys Rs 1200 worth of shares in a company each month. During
the first five months, he bought the shares at a price of Rs 10, Rs 12, Rs
15, Rs 20 and Rs 24 per share. After 5 months what is the average price
paid for the shares by him?

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Percentage value is calculated for ratios that have a
___________________.
(b) Numerator of the ratio is known as _________________ and the
denominator is known as consequent.
2. State whether true or false.
(a) An average is the measure of central tendency of a set of numbers.
(b) The numerator does not have a direct relationship with ratio or
percentage.

3.3 Mean, Mode and Median


3.3.1 Arithmetic Mean
There are several commonly used measures such as arithmetic mean, mode
and median. These values are very useful not only in presenting the overall
picture of the entire data but also for the purpose of making comparisons among
two or more sets of data.
As an example, questions like How hot is the month of June in Delhi?
can be answered generally by a single figure of the average for that month.
Similarly, suppose we want to find out if boys and girls of age 10 years differ in
height for the purpose of making comparisons. Then, by taking the average
height of boys of that age and the average height of girls of the same age, we
can compare and record the differences.
Sikkim Manipal University

Page No. 48

Business Statistics

Unit 3

While arithmetic mean is the most commonly used measure of central


tendency, mode and median are more suitable measures under certain set of
conditions and for certain types of data. However, each measure of central
tendency should meet the following requisites.
1. It should be easy to calculate and understand.
2. It should be rigidly defined. It should have only one interpretation so that
the personal prejudice or the bias of the investigator does not affect its
usefulness.
3. It should be representative of the data. If it is calculated from a sample,
the sample should be random enough to be accurately representing the
population.
4. It should have a sampling stability. It should not be affected by sampling
fluctuations. This means that if we pick ten different groups of college
students at random and compute the average of each group, then we
should expect to get approximately the same value from each of these
groups.
5. It should not be affected much by extreme values. If few, very small or
very large items are present in the data, they will unduly influence the
value of the average by shifting it to one side or other, so that the average
would not be really typical of the entire series. Hence, the average chosen
should be such that it is not unduly affected by such extreme values.
Arithmetic mean is also commonly known as the mean. Even though
average, in general, means measure of central tendency, when we use the
word average in our daily routine, we always mean the arithmetic average. The
term is widely used by almost everyone in daily communication. We speak of
an individual being an average student or of average intelligence. We always
talk about average family size or average family income or grade point average
(GPA) for students, and so on.
For discussion purposes, let us assume a variable X which stands for
some scores such as the ages of students. Let the ages of 5 students be 19,
20, 22, 22 and 17 years. Then variable X would represent these ages as follows:
X: 19, 20, 22, 22, 17
Placing the Greek symbol (Sigma) before X would indicate a command
that all values of X are to be added together. Thus:
X = 19 + 20 + 22 + 22 + 17

Sikkim Manipal University

Page No. 49

Business Statistics

Unit 3

The mean is computed by adding all the data values and dividing it by
the number of such values. The symbol used for sample average is X so that:

X=

19 + 20 + 22 + 22 +17
5

In general, if there are n values in the sample, then

X=

X1 + X2 + ......... + Xn
n

In other words,
n

Xi
i=1

X=

i = 1, 2 ... n

According to this formula, the mean can be obtained by adding up all


values of Xi, where the value of i starts at 1 and ends at n with unit increments
so that i = 1, 2, 3, ... n.
If instead of taking a sample, we take the entire population in our
calculations of the mean, then the symbol for the mean of the population is
(mu) and the size of the population is N, so that:
N

Xi
i 1

i 1, 2 ...N

In real-life cases, a population is usually very large and hence the


population mean is considered an unknown constant. The value of N is also
very large and is in the thousands, millions or sometimes even infinity. Sample
mean is thus used as an estimator for estimating population mean.
If we have the data in grouped discrete form with frequencies, then the
sample mean is given by:
X

f ( X )
f

Here, f = Summation of all frequencies


=n
f(X) = Summation of each value of X multiplied by its
corresponding frequency ( f )
Sikkim Manipal University

Page No. 50

Business Statistics

Unit 3

Example 3.5: Let us take the ages of 10 students as follows:


19, 20, 22, 22, 17, 22, 20, 23, 17, 18
Solution: This data can be arranged in a frequency distribution as follows:
(X)
17
18
19
20
22
23

(f)
2
1
1
2
3
1
Total = 10

f(X)
34
18
19
40
66
23
200

In this case, we have f = 10 and f(X) = 200, so that:

f ( X )
f

= 200/10 = 20
Example 3.6: Calculate the mean of the marks of 46 students given in the
following table.
Frequency of Marks of 46 Students
Marks

Frequency

(X)

(f)

9
10
11
12
13
14
15
16
17
18

1
2
3
6
10
11
7
3
2
1

Total

46

Solution: This is a discrete frequency distribution, and is calculated using the


f ( x )
equation x f . The following table shows the method of obtianing f(X).

Sikkim Manipal University

Page No. 51

Business Statistics

Unit 3

Marks (X)

Frequency ( f )

9
10
11
12
13
14
15
16
17
18

X =

f(X)

1
2
3
6
10
11
7
3
2
1

9
20
33
72
130
154
105
48
34
18

f = 46

f(X) = 623

f ( X ) 623
=
= 13.54
46
f

Example 3.7: The mean age of a group of 100 persons (grouped in intervals
10, 12,..., etc.) was found to be 32.02. Later, it was discovered that age 57
was misread as 27. Find the corrected mean.
Solution: Let the mean be denoted by X. So, putting the given values in the
formula of arithmetic mean, we have,
32.02 =
Correct

Correct

X
100

, i.e.,

= 3202

= 3202 27 + 57 = 3232

AM =

3232
= 32.32
100

Example 3.8: The mean monthly salary paid to all employees in a company is
Rs 500. The monthly salaries paid to male and female employees average Rs
520 and Rs 420, respectively. Determine the percentage of males and females
employed by the company.
Solution: Let N1 be the number of males and N2 be the number of females
employed by the company. Also, let x1 and x2 be the monthly average salaries
paid to male and female employees and x be the mean monthly salary paid to
all the employees.
x =

Sikkim Manipal University

N1 x1 N 2 x2
N1 N 2

Page No. 52

Business Statistics

Unit 3

or

500 =

or

N1
N2

520 N 1 420 N 2
N1 N 2

or

20N1= 80N2

80 4

20 1

Hence, the males and females are in the ratio of 4 : 1 or 80 per cent are
males and 20 per cent are females in those employed by the company.
The Weighted Arithmetic Mean
In the computation of arithmetic mean we had given equal importance to each
observation in the series. This equal importance may be misleading if the
individual values constituting the series have different importance as in the
following example:
The Raja Toy shop sells
Toy Cars at

Rs 3 each

Toy Locomotives at

Rs 5 each

Toy Aeroplanes at

Rs 7 each

Toy Double Decker at Rs 9 each


What shall be the average price of the toys sold, if the shop sells 4 toys,
one of each kind?
Mean Price,

i.e.,

x
24
= = Rs
= Rs 6
4

In this case, the importance of each observation (price quotation) is equal


in as much as one toy of each variety has been sold. In the above computation
of the arithmetic mean, this fact has been taken care of by including once only
the price of each toy.
But if the shop sells 100 toys: 50 cars, 25 locomotives, 15 aeroplanes and
10 double deckers, the importance of the four price quotations to the dealer is
not equal as a source of earning revenue. In fact, their respective importance
is equal to the number of units of each toy sold, i.e.,
The importance of Toy Car

50

The importance of Locomotive

25

The importance of Aeroplane

15

The importance of Double Decker

10

Sikkim Manipal University

Page No. 53

Business Statistics

Unit 3

It may be noted that 50, 25, 15, 10 are the quantities of the various classes of
toys sold. It is for these quantities that the term weights is used in statistical language.
Weight is represented by symbol w, and w represents the sum of weights.
While determining the average price of toy sold, these weights are of
great importance and are taken into account in the manner illustrated below:
x

w1 x1 + w2 x2 + w3 x3 + w4 x4
wx
=
w1 + w2 + w3 + w4
w

When w1, w2, w3, w4 are the respective weights of x1, x2, x3, x4 which in
turn represent the price of four varieties of toys, viz., car, locomotive, aeroplane
and double decker, respectively.
x

(50 3) + (25 5) + (15 7) + (10 9)


50 + 25 + 15 + 10

(150) (125) (105) (90)


470
=
= Rs 4.70
100
100

Table 3.1 summarizes the steps taken in the computation of the weighted
arithmetic mean.
Table 3.1 Weighted Arithmetic Mean of Toys Sold by the Raja Toy Shop
Toys
Car

Price per Toy


Rs x

Number Sold
w

Price Weight
xw

50

150

Locomotive

25

125

Aeroplane

15

105

Double Decker

10

90

w = 100

xw = 470

w = 100; wx = 470
x

wx
470
=
=
= 4.70
w

100

The weighted arithmetic mean is particularly useful where we have to


compute the mean of means. If we are given two arithmetic means, one for
each of two different series, in respect of the same variable, and are required to
find the arithmetic mean of the combined series, the weighted arithmetic mean
is the only suitable method of its determination.
Example 3.9: The arithmetic mean of daily wages of two manufacturing concerns
A Ltd. and B Ltd. is Rs 5 and Rs 7, respectively. Determine the average daily
wages of both concerns if the number of workers employed were 2,000 and
4,000 respectively.
Sikkim Manipal University

Page No. 54

Business Statistics

Unit 3

Solution: (i) Multiply each average (viz. 5 and 7), by the number of workers in
the concern it represents.
(ii) Add up the two products obtained in (i) above.
(iii) Divide the total obtained in (ii) by the total number of workers.
Weighted Mean of Mean Wages of A Ltd. and B Ltd.
Manufacturing
Concern

Mean Wages
x

Workers
Employed
w

A Ltd.

2,000

10,000

B Ltd.

4,000

28,000

w = 6,000

Mean Wages
Workers Employed
wx

wx = 38,000

wx
w
38,000
=
6,000
= Rs 6.33

x =

The above mentioned examples explain that Arithmetic Means and


Percentage are not original data. They are derived figures and their importance
is relative to the original data from which they are obtained. This relative
importance must be taken into account by weighting while averaging them
(means and percentage).
Advantages of Mean
1. Its concept is familiar to most people and is intuitively clear.
2. Every data set has a mean, which is unique and describes the entire data
to some degree. For example, when we say that the average salary of a
professor is Rs 25,000 per month, it gives us a reasonable idea about the
salaries of professors.
3. It is a measure that can be easily calculated.
4. It includes all values of the data set in its calculation.
5. Its value varies very little from sample to sample taken from the same
population.
6. It is useful for performing statistical procedures such as computing and
comparing the means of several data sets.

Sikkim Manipal University

Page No. 55

Business Statistics

Unit 3

Disadvantages of Mean
1. It is affected by extreme values, and hence, are not very reliable when
the data set has extreme values especially when these extreme values
are on one side of the ordered data. Thus, a mean of such data is not
truly a representative of such data. For example, the average age of three
persons of ages 4, 6 and 80 years gives us an average of 30.
2. It is tedious to compute for a large data set as every point in the data set
is to be used in computations.
3. We are unable to compute the mean for a data set that has open-ended
classes either at the high or at the low end of the scale.
4. The mean cannot be calculated for qualitative characteristics such as
beauty or intelligence, unless these can be converted into quantitative
figures such as intelligence into IQs.

3.3.2 Median
The second measure of central tendency that has a wide usage in statistical
works is the median. Median is that value of a variable which divides the series
in such a manner that the number of items below it is equal to the number of
items above it. Half of the total number of observations lies below the median
and half above it. The median is thus a positional average.
The median of ungrouped data is found easily if the items are first arranged
in order of the magnitude. The median may then be located simply by counting,
and its value can be obtained by reading the value of the middle observations.
If we have five observations whose values are 8, 10, 1, 3 and 5, the values are
first arrayed: 1, 3, 5, 8 and 10. It is now apparent that the value of the median is
5, since two observations are below that value and two observations are above
it. When there is an even number of cases, there is no actual middle item and
the median is taken to be the average of the values of the items lying on either
side of (N + 1)/2, where N is the total number of items. Thus, if the values of six
items of a series are 1, 2, 3, 5, 8 and 10, then the median is the value of item
number (6 + 1)/2 = 3.5, which is approximated as the average of the third and
the fourth items, i.e., (3+5)/2 = 4.
Thus, the steps required for obtaining median are:
1. Arrange the data as an array of increasing magnitude.
2. Obtain the value of the (N+ l)/2th item.

Sikkim Manipal University

Page No. 56

Business Statistics

Unit 3

Frequency is the number of times a given data occurs in a data set. A relative
frequency is the fraction of times a data occurs. Cumulative frequency is the
accumulation of previous relative frequencies. For example, the data below gives
the number of hours devoted by 20 students of a class to study at home:
5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3
Following table gives the frequency distribution, relative frequency
distribution and cumulative frequency distribution:
Hours

Number of Students
(Frequency)

Relative
Frequency

3/20=0.15

0.15

5/20=0.25

0.15+0.25=0.4

3/20=0.15

0.4+0.15=0.55

6/20=0.3

0.55+0.3=0.85

2/20=0.1

0.83+0.1=0.95

1/20=0.05

0.95+0.05=1

7
Total

Cumulative
Frequency

20

Even in the case of grouped data, the procedure for obtaining median is
straightforward as long as the variable is discrete or non-continuous as is clear
from the following example.
Example 3.10: Obtain the median size of shoes sold from the following data.
Number of Shoes Sold by Size in One Year
Size

Number of Pairs

Cumulative Total

30

30

5 21
6

40
50

70
120

6 21
7

150
300

270
570

7 21
8

600
950

1170
2120

8 21
9

820
750

2940
3690

9 21
10

440
250

4130
4380

10 21
11

150
40

4530
4570

11 21

39

4609
Total 4609

Sikkim Manipal University

Page No. 57

Business Statistics

Unit 3

Solution: Median, is the value of

( N 1)
4609 + 1
th =
th = 2305th item. Since the
2
2

items are already arranged in ascending order (size-wise), the size of 2305th
item is easily determined by constructing the cumulative frequency. Thus, the
median size of shoes sold is 8, the size of 2305th item.
In the case of grouped data with continuous variable, the determination
of median is a bit more involved. Consider the following table where the data
relating to the distribution of male workers by average monthly earnings is given.
Clearly the median of 6291 is the earnings of (6291 + 1)/2 = 3146th worker
arranged in ascending order of earnings.
From the cumulative frequency, it is clear that this worker has his income
in the class interval 67.572.5. But, it is impossible to determine his exact income.
We therefore, resort to approximation by assuming that the 795 workers of this
class are distributed uniformly across the interval 67.5 to 72.5. The median
worker is (31462713) = 433rd of these 795, and hence, the value corresponding
to him can be approximated as,

67.5

433
( 72.5 67.5) = 67.5 + 2.73 = 70.23
795

Distribution of Male Workers by Average Monthly Earnings


Group No.

Monthly
Earnings (Rs)

No. of
Workers

Cumulative No.
of Workers

27.532.5

120

120

32.537.5

152

272

37.542.5

170

442

42.547.5

214

656

47.552.5

410

1066

52.557.5

429

1495

57.562.5

568

2063

62.567.5

650

2713

67.572.5

795

3508

10

72.577.5

915

4423

11

77.582.5

745

5168

12

82.587.5

530

5698

13

87.592.5

259

5957

14

92.597.5

152

6109

Sikkim Manipal University

Page No. 58

Business Statistics

Unit 3

15

97.5102.5

107

6216

16

102.5107.5

50

6266

17

107.5112.5

25

6291
Total 6291

The value of the median can thus be put in the form of the formula,
N 1
C
Me = l 2
i
f

Where l is the lower limit of the median class, i its

width, f its frequency, C the cumulative frequency upto (but not including) the
median class, and N is the total number of cases.
Finding Median by Graphical Analysis
The median can quite conveniently be determined by reference to the ogive
which plots the cumulative frequency against the variable. The value of the item
below which half the items lie, can easily be read from the ogive as is shown in
example 3.11.
Example 3.11: Obtain the median of data given in the following table.
Monthly Earnings

Frequency

Less Than

More Than

27.5

__

6291

32.5

120

120

6171

37.5

152

272

6019

42.5
47.5

170
214

442
656

5849
5635

52.5

410

1066

5225

57.5

429

1495

4796

62.5

568

2063

4228

67.5

650

2713

3578

72.5
77.5

795
915

3508
4423

2783
1868

82.5

745

5168

1123

87.5

530

5698

593

92.5

259

5957

334

97.5

152

6109

182

102.5
107.5

107
50

6216
6266

75
25

112.5

25

6291

Sikkim Manipal University

Page No. 59

Business Statistics

Unit 3

Solution: It is clear that this is grouped data. The first class is 27.532.5, whose
frequency is 120, and the last class is 107.5112.5 whose frequency is 25.
Figure 3.1 shows the ogive of less than cumulative frequency. The median is
the value below which N/2 items lie, is 6291/2 = 3145.5 items lie, which is read
of from Figure 3.2 as about 70. More accuracy than this is unobtainable because
of the space limitation on the earning scale.
6291
6000

5000
MORE THAN

LESS THAN

Number of Workers

4000

3000

2000

1000
MEDIAN
112.5

97.5

102.5
107.5

87.5
92.5

82.5

77.5

72.5

62.5
67.5

47.5
52.5
57.5

42.5

37.5

32.5

27.5

Monthly Earnings in Rupees

Figure 3.1 Median Determination by Plotting Less than and More than Cumulative
Frequency

The median can also be determined by plotting both less than and more
than cumulative frequency as shown in Figure 3.1. It should be obvious that the
two curves should intersect at the median of the data.

Sikkim Manipal University

Page No. 60

Business Statistics

Unit 3

6000
5000

Number of Workers

4000
3000
2000
MEDIAN

1000

27.5
32.5
37.5
42.5
47.5
52.5
57.5
62.5
67.5
72.5
77.5
82.5
87.5
92.5
97.5
102.5
107.5
112.5

Monthly Earnings in Rupees

Figure 3.2 Median

Advantages of Median
1. Median is a positional average and hence the extreme values in the data
set do not affect it as much as they do to the mean.
2. Median is easy to understand and can be calculated from any kind of
data, even from grouped data with open-ended classes.
3. We can find the median even when our data set is qualitative and can be
arranged in the ascending or the descending order, such as average
beauty or average intelligence.
4. Similar to mean, median is also unique, meaning that there is only one
median in a given set of data.
5. Median can be located visually when the data is in the form of ordered
data.
6. The sum of absolute differences of all values in the data set from the
median value is minimum. This means that it is less than any other value
of central tendency in the data set, which makes it more central in certain
situations.

Sikkim Manipal University

Page No. 61

Business Statistics

Unit 3

Disadvantages of Median
1. The data must be arranged in order to find the median. This can be very
time consuming for a large number of elements in the data set.
2. The value of the median is affected more by sampling variations. Different
samples from the same population may give significantly different values
of the median.
3. The calculation of median in case of grouped data is based on the
assumption that the values of observation are evenly spaced over the
entire class interval and this is usually not so.
4. Median is comparatively less stable than mean, particularly for small
samples, due to fluctuations in sampling.
5. Median is not suitable for further mathematical treatment. For example,
we cannot compute the median of the combined group from the median
values of different groups.

3.3.3 Mode
The mode is that value of the variable which occurs or repeats itself the greatest
number of times. The mode is the most fashionable size in the sense that it is
the most common and typical, and is defined by Zizek as the value occurring
most frequently in a series (or group of items) and around which the other items
are distributed most densely.
The mode of a distribution is the value at the point around which the items
tend to be most heavily concentrated. It is the most frequent or the most common
value, provided that a sufficiently large number of items are available, to give a
smooth distribution. It will correspond to the value of the maximum point
(ordinate), of a frequency distribution if it is an ideal or smooth distribution. It
may be regarded as the most typical of a series of values. The modal wage, for
example, is the wage received by more individuals than any other wage. The
modal hat size is that, which is worn by more persons than any other single
size.
It may be noted that the occurrence of one or a few extremely high or low
values has no effect upon the mode. If a series of data are unclassified, not
have been either arrayed or put into a frequency distribution, the mode cannot
be readily located.

Sikkim Manipal University

Page No. 62

Business Statistics

Unit 3

Taking first an extremely simple example, if seven men are receiving daily
wages of Rs 5, 6, 7, 7, 7, 8 and 10, it is clear that the modal wage is Rs 7 per
day. If we have a series such as 2, 3, 5, 6, 7, 10 and 11, it is apparent that there
is no mode.
There are several methods of estimating the value of the mode. But, it is
seldom that the different methods of ascertaining the mode give us identical
results. Consequently, it becomes necessary to decide as to which method
would be most suitable for the purpose in hand. In order that a choice of the
method may be made, we should understand each of the methods and the
differences that exist among them.
The four important methods of estimating mode of a series are: (i) Locating
the most frequently repeated value in the array; (ii) Estimating the mode by
interpolation; (iii) Locating the mode by graphic method; and (iv) Estimating the
mode from the mean and the median. Only the last three methods are discussed
in this unit.
Estimating the Mode by Interpolation. In the case of continuous
frequency distributions, the problem of determining the value of the mode is not
so simple as it might have appeared from the foregoing description. Having
located the modal class of the data, the next problem in the case of continuous
series is to interpolate the value of the mode within this modal class.
The interpolation is made by the use of any one of the following formulae:

(i) Mo = l1
(ii) Mo = l2
(iii) Mo = l1

f2
f0 f2

i;

f0
i
f0 f2

f1 f 0
( f1 f 0 ) ( f1 f 2 )

Where l1 is the lower limit of the modal class, l2 is the upper limit of the
modal class, f0 equals the frequency of the preceding class in value, f1 equals
the frequency of the modal class in value, f2 equals the frequency of the following
class (class next to modal class) in value, and i equals the interval of the modal
class.

Sikkim Manipal University

Page No. 63

Business Statistics

Unit 3

Example 3.12: Determine the mode for the data given in the following table.
Wage Group
14
18
22
26
30
34
38
42
46
50

Frequency (f)

18
22
26
30
34
38
42
46
50
54

6
18
19
12
5
4
3
2
1
0

54 58

Solution: In the given data, 22 26 is the modal class since it has the largest
frequency. The lower limit of the modal class is 22, its upper limit is 26, its
frequency is 19, the frequency of the preceding class is 18, and of the following
class is 12. The class interval is 4. Using the various methods of determining
mode, we have,

(i) Mo = 22
= 22

12
4
18 12
8
5

= 23.6

(iii) Mo =

22 +

(ii) Mo = 26
= 26

18
4
18 + 12
12
5

= 23.6
4
19 - 18
4 = 22 +
= 22.5
(19 - 18) + ( 19 - 12)
8

In formulae (i) and (ii), the frequency of the classes adjoining the modal
class is used to pull the estimate of the mode away from the midpoint towards
either the upper or lower class limit. In this particular case, the frequency of the
class preceding the modal class is more than the frequency of the class following
and therefore, the estimated mode is less than the midvalue of the modal class.
This seems quite logical. If the frequencies are more on one side of the modal
class than on the other it can be reasonably concluded that the items in the
modal class are concentrated more towards the class limit of the adjoining class
with the larger frequency.
Formula (iii) is also based on a logic similar to that of (i) and (ii). In this
case, to interpolate the value of the mode within the modal class, the differences
between the frequency of the modal class, and the respective frequencies of
the classes adjoining it are used. This formula usually gives results better than
Sikkim Manipal University

Page No. 64

Business Statistics

Unit 3

the values obtained by the other and exactly equal to the results obtained by
graphic method. Formulae (i) and (ii) give values which are different from the
value obtained by formula (iii) and are more close to the central point of modal
class. If the frequencies of the class adjoining the modal are equal, the mode is
expected to be located at the midvalue of the modal class, but if the frequency
on one of the sides is greater, the mode will be pulled away from the central
point. It will be pulled more and more if the difference between the frequencies
of the classes adjoining the modal class is higher and higher. In Example 3.12,
the frequency of the modal class is 19 and that of preceding class is 18. So, the
mode should be quite close to the lower limit of the modal class. The midpoint
of the modal class is 24 and lower limit of the modal class is 22.
Locating the Mode by the Graphic Method. The method of graphic
interpolation is illustrated in Figure 3.3. The upper corners of the rectangle over
the modal class have been joined by straight lines to those of the adjoining
rectangles as shown in the diagram; the right corner to the corresponding one
of the adjoining rectangle on the left, etc. If a perpendicular is drawn from the
point of intersection of these lines, we have a value for the mode indicated on
the base line. The graphic approach is, in principle, similar to the arithmetic
interpolation explained earlier.
The mode may also be determined graphically from an ogive or cumulative
frequency curve. It is found by drawing a perpendicular to the base from that
point on the curve where the curve is most nearly vertical, i.e., steepest (in
other words, where it passes through the greatest distance vertically and smallest
distance horizontal). The point where it cuts the base gives us the value of the
mode. How accurately this method determines the mode is governed by:
(i) The shape of the ogive, (ii) The scale on which the curve is drawn.
Estimating the Mode from the Mean and the Median. There usually
exists a relationship among the mean, median and mode for moderately
asymmetrical distributions. If the distribution is symmetrical, the mean, median
and mode will have identical values, but if the distribution is skewed (moderately)
the mean, median and mode will pull apart. If the distribution tails off towards
higher values, the mean and the median will be greater than the mode. If it tails
off towards lower values, the mode will be greater than either of the other two
measures. In either case, the median will be about one-third as far away from
the mean as the mode is. This means that,
Mode = Mean 3 (Mean Median)
= 3 Median 2 Mean

Sikkim Manipal University

Page No. 65

Business Statistics

Unit 3

Figure 3.3 Method of Mode Determination by Graphic Interpolation

In the case of the average monthly earnings, the mean is 68.53 and the
median is 70.2. If these values are substituted in the above formula, we get,

Mode = 68.5 3(68.5 70.2) = 68.5 + 5.1 = 73.6


According to the formula used earlier,

Mode = l1

f2
f0 f2

= 72.5

745
5 = 72.5 + 2.4 = 74.9
795 745

OR
Mode = l1 +

Sikkim Manipal University

f1 - f0
i
2 f1 - f0 - f 2

= 72.5

915 795
5
2 915 795 745

= 72.5

120
5 = 74.57
290

Page No. 66

Business Statistics

Unit 3

The difference between the two estimates is due to the fact that the
assumption of relationship between the mean, median and mode may not always
be true which is obviously not valid in this case.
Example 3.13: (i) In a moderately symmetrical distribution, the mode and mean
are 32.1 and 35.4 respectively. Calculate the median.
(ii) If the mode and median of moderately asymmetrical series are
respectively 16'' and 15.7'', what would be its most probable median?
(iii) In a moderately skewed distribution, the mean and the median are
respectively 25.6 and 26.1 inches. What is the mode of the distribution?
Solution: (i) We know,
Mean Mode = 3 (Mean Median)
or
3 Median = Mode + 2 Mean

32.1 + 2 35.4
3
102.9
=
3

or

Median =

= 34.3

(ii)
or

2 Mean = 3 Median Mode


Mean =

(iii)

1
31.1
( 3 15. 7 16. 0)
= 15.55
2
2

Mode = 3 Median 2 Mean


= 3 26.1 2 25.6 = 78.3 51.2 = 27.1

Advantages of Mode
1. Similar to median, the mode is not affected by extreme values in the data.
2. Its value can be obtained in open-ended distributions without ascertaining
the class limits.
3. It can be easily used to describe qualitative phenomenon. For example, if
most people prefer a certain brand of tea, then this will become the modal
point.
4. Mode is easy to calculate and understand. In some cases, it can be located
simply by observation or inspection.
Disadvantages of Mode
1. Quite often, there is no modal value.
2. It can be bi-modal or multi-modal, or it can have all modal values making
its significance more difficult to measure.
Sikkim Manipal University

Page No. 67

Business Statistics

Unit 3

3. If there is more than one modal value, the data is difficult to interpret.
4. A mode is not suitable for algebraic manipulations.
5. Since the mode is the value of maximum frequency in the data set, it
cannot be rigidly defined if such frequency occurs at the beginning or at
the end of the distribution.
6. It does not include all observations in the data set, and hence, less reliable
in most of the situations.
Activity 2
The following figures represent the number of books issued at the counter
of a commerce library in 11 different days. Calculate the median.
96, 180, 98, 75, 270, 20, 102, 100, 94, 75, 200.

Self-Assessment Questions
3. State whether true or false.
(a) The mean is computed by adding all the data values and dividing it
by the number of such values.
(b) The mode is that value of the variable which occurs or repeats itself
the greatest number of times.
4. Fill in the blanks with the appropriate terms.
(a) Weight is represented by symbol w, and Sw represents the
___________ of weights.
(b) Median is that ______________ of a variable which divides the series
in such a manner that the number of items below it is equal to the
number of items above it.

3.4 Quartiles
Some measures, other than measures of central tendency, are often employed
when summarizing or describing a set of data where it is necessary to divide
the data into equal parts. These are positional measures and are called quantiles
and consist of quartiles, deciles and percentiles. The quartiles divide the data
into four equal parts. The deciles divide the total ordered data into ten equal
parts and the percentiles divide the data into 100 equal parts. Consequently,
Sikkim Manipal University

Page No. 68

Business Statistics

Unit 3

there are three quartiles, nine deciles and 99 percentiles. The quartiles are
denoted by the symbol Q, which can be fractioned as Q1, Q2, Q3, ..., and so on.
Here, Q1 will be such point in the ordered data which has 25 per cent of the data
below and Q2 will represent 75 per cent of the data above it. In other words, Q1

n 1
is the value corresponding to
th ordered observation. Similarly, Q2 divides
4
the data in the middle, and is also equal to the median and its value, Q2 is given
by:
n 1
th ordered observation in the data.
Q2 = The value of 2
4
Similarly, we can calculate the values of various deciles. For instance,

n 1
th observaton in the ordered data, and
D1 =
10
n 1
th observation in the ordered data.
D7 = 7
10
Percentiles are generally used in the research area of education where
people are given standard tests and it is desirable to compare the relative position
of the subjects performance on the test. Percentiles are similarly calculated as:
n 1
th observation in the ordered data.
P7 = 7
100
and,

n 1
th observation in the ordered data.
P69 = 69
100
Quartiles
The formula for calculating the values of quartiles for grouped data is given as
follows:
Q = L + (j/f)C
Where,
Q = The quartile under consideration.
L = Lower limit of the class interval which contains the value of Q.
j = The number of units we lack from the class interval which contains
the value of Q, in reaching the value of Q.
Sikkim Manipal University

Page No. 69

Business Statistics

Unit 3

f = Frequency of the class interval containing Q.


C = Size of the class interval.
Let us assume, we took the data of the ages of 100 students and a frequency
distribution for this data has been constructed as shown.
The frequency distribution is as follows:

Ages (CI)
16 and upto 17
17 and upto 18
18 and upto 19
19 and upto 20
20 and upto 21
21 and upto 22
22 and upto 23

Mid-point (X)
16.5
17.5
18.5
19.5
20.5
21.5
22.5

(f)
4
14
18
28
20
12
4
Total = 100

f(X)
66
245
333
546
410
258
90
1948

f(X)2
1089.0
4287.5
6160.5
10647.0
8405.0
5547.0
2025.0
38161

In our case, in order to find Q1, where Q1 is the cut-off point so that 25 per
cent of the data is below this point and 75 per cent of the data is above, we see
that the first group has 4 students and the second group has 14 students, making
a total of 18 students. Since Q1 cuts off at 25 students, it is the third class
interval which contains Q1. This means that the value of L in our formula is 18.
Since we already have 18 students in the first two groups, we need 7
more students from the third group to make it a total of 25 students, which is the
value of Q1. Hence, the value of (j) is 7. Also, since the frequency of this third
class interval which contains Q1 is 18, the value of (f) in our formula is 18. The
size of the class interval C is given as 1. Substituting these values in the formula
for Q, we get,
Q1 = 18 + (7/18)1
= 18 + 0.38 = 18.38
This means that 25 per cent of the students are below 18.38 years of age
and 75 per cent are above this age.
Similarly, we can calculate the value of Q2, using the same formula. Hence,
Q2 = L + (j/f)C
= 19 + (14/28)1
= 19.5
This also happens to be the median.

Sikkim Manipal University

Page No. 70

Business Statistics

Unit 3

By using the same formula and the same logic we can calculate the values
of all deciles as well as percentiles.
We have defined the median as the value of the item which is located at
the centre of the array. We can define other measures which are located at
other specified points. Thus, the Nth percentile of an array is the value of the
item such that N per cent items lie below it. Clearly then, the Nth percentile Pn of
grouped data is given by,
nN
C
100
Pn = l
i
f

Here, l is the lower limit of the class in which nN/100th item lies, i its width,
f its frequency, C the cumulative frequency upto (but not including) this class,
and N is the total number of items.
We can similarly define the Nth decile as the value of the item below
which (nN/10) items of the array lie. Clearly,
nN
C
10
i
l

Dn = P10n =
f

where the symbols have the obvious meanings.


The other most commonly referred to measures of location are the
quartiles. Thus, nth quartile is the value of the item which lies at the n(N/5)th
item. Clearly, Q2, the second quartile, is the median for grouped data.

Qn = P25n

nN
C
l 4
i
f

Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) The positional measures are called ______________ and consist of
quartiles, deciles and percentiles.
(b) The Nth percentile of an ____________ is the value of the item
such that N per cent items lie below it.
6. State whether true or false.
(a) The quartiles divide the data into eight equal parts.
(b) The deciles divide the total ordered data into ten equal parts.
Sikkim Manipal University

Page No. 71

Business Statistics

Unit 3

3.5 Range
The crudest measure of dispersion is the range of the distribution. Range of
any series is the difference between the highest and the lowest values in the
series. If the marks received in an examination taken by 248 students are
arranged in the ascending order, then the range will be equal to the difference
between the highest and the lowest marks.
In a frequency distribution, the range is taken to be the difference between
the lower limit of the class at the lower extreme of the distribution and the upper
limit of the class at the upper extreme.
Table 3.2 Weekly Earnings of Labourers in Four Workshops of the Same Type
No. of Workers
Weekly Earnings
Rs

Workshop A

Workshop B

1516
1718
1920
2122
2324
2526
2728
2930
3132
3334
3536

...
...
...
10
22
20
14
14
...
...
...

...
2
4
10
14
18
16
10
6
...
...

2
4
4
10
16
14
12
6
6
2
...

...
...
4
14
16
16
12
12
4
2
...

3738

...

...

...

Total

80

80

80

80

Mean

25.5

25.5

25.5

25.5

Workshop C

Workshop D

Consider the data on weekly earnings of workers on four workshops given


in the table. We note the following:
Workshop

Range

15

23

15

From these figures, it is clear that the greater the range, the greater is the
variation of the values in the group.
Sikkim Manipal University

Page No. 72

Business Statistics

Unit 3

The range is a measure of absolute dispersion and as such, cannot be


usefully employed for comparing the variability of two distributions expressed in
different units. The amount of dispersion measured, say, in pounds, is not
comparable with dispersion measured in inches. So, the need of measuring
relative dispersion arises.
An absolute measure can be converted into a relative measure if we divide
it by some other value regarded as standard for the purpose. We may use the
mean of the distribution or any other positional average as the standard.
For Table 3.2, the relative dispersion would be:
Workshop A =

9
25.5

Workshop C =

23
25.5

Workshop B =

15
25.5

Workshop D =

15
25.5

An alternate method of converting an absolute variation into a relative


one would be, to use the total of the extremes as the standard. This will be
equal to dividing the difference of the extreme items by the total of the extreme
items. Thus,
Relative Dispersion =

Difference of extreme items, i.e., Range


Sum of extreme items

The relative dispersion of the series is called the coefficient or the ratio of
dispersion. In our example of weekly earnings of workers considered earlier,
the coefficients would be:
9
9

21 30 51
23
23
Workshop C =

15 38 53

Workshop A =

Workshop B
Workshop D

15
15

17 32 49
15
15
=

19 34 53

Merits and Limitations of Range


Merits. Of the various characteristics that a good measure of dispersion should
possess, the range has only two, viz (i) it is easy to understand, and (ii) its
computation is simple.
Limitations. Besides the aforesaid two qualities, the range does not satisfy the
other test of a good measure and hence it is often termed as a crude measure
of dispersion.
The following are the limitations that are inherent in the range as a concept
of variability:

Sikkim Manipal University

Page No. 73

Business Statistics

Unit 3

(i) Since it is based on two extreme cases in the entire distribution, the range
may be considerably changed if either of the extreme cases happens to
drop out, while the removal of any other case would not affect it at all.
(ii) It does not tell anything about the distribution of values in the series
relative to a measure of central tendency.
(iii) It cannot be computed when distribution has open-end classes.
(iv) It does not take into account the entire data. These can be illustrated by
the following illustration. Consider the data given in Table 3.3.
The table is designed to illustrate three distributions with the same number
of cases but different variability. The removal of two extreme students from
section A would make its range equal to that of B or C.
Table 3.3 Distribution with the Same Number of Cases,
but Different Variability
No. of Students

Class
Section
A

Section
B

Section
C

010
1020
2030
3040
4050
5060
6070
7080
8090
90100

...
1
12
17
29
18
16
6
11
...

...
...
12
20
35
25
10
8
...
...

...
...
19
18
16
18
18
21
...
...

Total

110

110

110

Range

80

60

60

The greater range of A is not a description of the entire group of 110


students, but of the two most extreme students only. Further, though sections
B and C have the same range, the students in section B cluster more closely
around the central tendency of the group than they do in section C. Thus, the
range fails to reveal the greater homogeneity of B or the greater dispersion of
C. Due to this defect, it is seldom used as a measure of dispersion.
Specific Uses of Range
In spite of the numerous limitations of the range as a measure of dispersion,
there are the following circumstances when it is the most appropriate one:
Sikkim Manipal University

Page No. 74

Business Statistics

Unit 3

(i) In situations where the extremes involve some hazard for which
preparation should be made, it may be more important to know the most
extreme cases to be encountered than to know anything else about the
distribution. For example, an explorer would like to know the lowest and
the highest temperatures on record in the region he is about to enter; or
an engineer would like to know the maximum rainfall during 24 hours for
the construction of a storage.
(ii) In the study of prices of securities, range has a special field of activity.
Thus, to highlight fluctuations in the prices of shares or bullion, it is a
common practice to indicate the range over which the prices have moved
during a certain period of time. This information, besides being of use to
the operators, gives an indication of the stability of the bullion market, or
that of the investment climate.
(iii) In statistical quality control, the range is used as a measure of variation.
For example, we determine the range over which, variations in quality are
due to random causes, which is made the basis for the fixation of control
limits.

Self-Assessment Questions
7. Fill in the blanks with the appropriate terms.
(a) Range of any series is the ______________ between the highest
and the lowest values in the series.
(b) The relative dispersion of the series is called the
___________________ or the ratio of dispersion.
8. State whether true or false.
(a) The crudest measure of dispersion is the range of the distribution.
(b) An absolute measure can not be converted into a relative measure
if we divide it by some other value regarded as standard for the
purpose.

3.6 Standard Deviation


By far, the most universally used and the most useful measure of dispersion is
the standard deviation or the root mean square deviation about the mean. We
have seen that all the methods of measuring dispersion so far discussed are
Sikkim Manipal University

Page No. 75

Business Statistics

Unit 3

not universally adopted for want of adequacy and accuracy. The range is not
satisfactory as its magnitude is determined by most extreme cases in the entire
group. Further, the range is notable because it is dependent on the item whose
size is largely a matter of chance. Mean deviation method is also an
unsatisfactory measure of scatter, as it ignores the algebraic signs of deviation.
We desire a measure of scatter which is free from these shortcomings. To
some extent, standard deviation is one such measure.
The calculation of standard deviation differs in the following respects from
that of mean deviation. First, in calculating standard deviation, the deviations
are squared. This is done so as to get rid of negative signs without committing
algebraic violence. Further, the squaring of deviations provides added weight
to the extreme items, a desirable feature for certain types of series.
Second, the deviations are always recorded from the arithmetic mean,
because although the sum of deviations is the minimum from the median, the
sum of squares of deviations is minimum when deviations are measured from
the arithmetic average. The deviation from x is represented by .
Thus, standard deviation, (sigma), is defined as the square root of the
mean of the squares of the deviations of individual items from their arithmetic
mean.

2
(x x)
N

For grouped data (discrete variables)

2
f (x x)
f

and, for grouped data (continuous variables)

f (M x)
f

Where, M is the mid-value of the group.


The use of these formulae is illustrated by the following examples.
Example 3.14: Compute the standard deviation for the following data:
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21

Sikkim Manipal University

Page No. 76

Business Statistics

Unit 3

Solution: Here the formula =

2
( x x ) is appropriate. We first calculate
N

the mean as x = x/ N = 176/11 = 16, and then calculate the deviation as


follows:
x

(x x )

(x x )2

11
12
13
14
15
16
17
18
19
20
21

5
4
3
2
1
0
+1
+2
+3
+4
+5

25
16
9
4
1
0
1
4
9
16
25

176

110

Thus, by using the formula, =

2
( x x ) , we get
N

110
10 = 3.16
11

Example 3.15: Find the standard deviation of the data in the following
distributions:
x

12

13

14

15

16

17

18

20

11

32

21

15

Solution: For this discrete variable grouped data, we use the formula
=

f ( x x )2
. Since for calculation of x , we need fx and then for we
f

need f ( x x )2 , the calculations are conveniently made in the following


format:

Sikkim Manipal University

Page No. 77

Business Statistics

Unit 3

fx

d=x x

d2

fd2

12
13
14
15
16
17
18
20

4
11
32
21
15
8
5
4

48
143
448
315
240
136
90
80

3
2
1
0
1
2
3
5

9
4
1
0
1
4
9
25

36
44
32
0
15
32
45
100

100

1500

304

Here,

x = fx / f = 1500/100 = 15

and

fd 2
f

304
=
100

3. 04 = 1.74

Calculation of Standard Deviation by Short-cut Method


The three examples worked out previously have one common simplifying feature,
namely x in each, turned out to be an integer, thus simplifying calculations. In
most cases, it is very unlikely that it will turn out to be so. In such cases, the
calculation of d and d2 becomes quite time-consuming. Short-cut methods have
consequently been developed. These are on the same lines as those for the
calculation of mean itself.
In the short-cut method, we calculate deviations x' from an assumed
mean A. Then, for ungrouped data

FG
H

x
x2

N
N

IJ
K

and for grouped data

fx 2 fx

f
f

This formula is valid for both discrete and continuous variables. In case of
continuous variables, x in the equation x' = x A, stands for the mid-value of the
class in question.

Sikkim Manipal University

Page No. 78

Business Statistics

Unit 3

Note that the second term in each of the formulae is a correction term
because of the difference in the values of A and x . When A is taken as x itself,
this correction is automatically reduced to zero. The following examples explain
the use of these formulae.
Example 3.16: Compute the standard deviation by the short-cut method for the
following data:
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21
Solution: Let us assume that A = 15.

x' = (x 15)

x'2

11
12
13
14
15
16
17
18
19
20
21

4
3
2
1
0
1
2
3
4
5
6

16
9
4
1
0
1
4
9
16
25
36

N = 11

1
x = 11

x 2 = 121

FG
H

x
x2

N
N

FG IJ
H K

121
11

11
11

IJ
K

= 11 1
= 10
= 3.16
Example 3.17: Calculate the standard deviation of the following data by the
short-cut method.

x
f

010
18

1020
16

Sikkim Manipal University

2030
15

3040
12

4050
10

5060
5

6070
2

7080
1

Page No. 79

Business Statistics

Unit 3

Solution:
Midpoint
(x)

Frequency
(f)

Deviation
from class
of assumed
mean
(x')

010
1020

5
15

18
16

2
1

2030
3040
4050
5060
6070
7080

25
35
45
55
65
75

15
12
10
5
2
1

0
1
2
3
4
5

Deviation
time
frequency
( fx')
36
16

Squared
deviation
times
frequency
( fx'2 )
72
16

52
0
12
20
15
8
5

0
12
40
45
32
25

60

f = 79

60

242

52

fx = 8

Since the deviations are from assumed mean and expressed in terms of
class-interval units,
fx2 fx

N
N

= i

= 10

FG IJ
H K

242
8

79
79

= 10 1.75 = 17.5
Combining Standard Deviations of Two Distributions
If we were given two sets of data of N1 and N2 items with means x1 and x 2 and
standard deviations 1 and 2 respectively, we can obtain the mean and the
standard deviation x and of the combined distribution by the following formulae:
x =

and

N 1 x1 N 2 x 2
N1 N 2

N 1 12 N 2 22 N 1 ( x x1 ) 2 N 2 ( x x 2 ) 2

Sikkim Manipal University

N1 N 2

Page No. 80

Business Statistics

Unit 3

Example 3.18: The mean and the standard deviations of two distributions of
100 and 150 items are 50, 5 and 40, 6 respectively. Find the standard deviation
of all taken together.
Solution: Combined mean,
x =

N 1 x1 N 2 x 2
N1 N 2

100 50 150 40
= 44
100 150

Combined standard deviation,


=

N112 N 2 22 N1 ( x x1 ) 2 N 2 ( x x2 ) 2
N1 N 2
100 (5) 2 150 ( 6) 2 100 ( 44 50 ) 2 150 ( 44 40 ) 2
100 150

= 7.46
Comparison of Various Measures of Dispersion
The range is the easiest to calculate measure of dispersion, but since it depends
on extreme values, it is extremely sensitive to the size of the sample and to the
sample variability. In fact, as the sample size increases, the range increases
dramatically, because the more the items one considers, the more likely it is
that some item will turn up which is larger than the previous maximum or smaller
than the previous minimum. So, in general, it is impossible to interpret properly
the significance of a given range unless the sample size is constant. It is for this
reason that there appears to be only one valid application of the range, namely
in statistical quality control where the same sample size is repeatedly used, so
that comparison of ranges are not distorted by differences in sample size.
The quartile deviations and other such positional measures of dispersions
are also easy to calculate, but suffer from the disadvantage that they are not
amenable to algebraic treatment. Similarly, the mean deviation is not suitable
because we cannot obtain the mean deviation of a combined series from the
deviations of component series. However, it is easy to interpret and easier to
calculate than the standard deviation.
The standard deviation of a set of data, on the other hand, is one of the
most important statistic describing it. It lends itself to rigorous algebraic treatment,
is rigidly defined and is based on all observations. It is therefore, quite insensitive
to sample size (provided the size is large enough) and is least affected by
sampling variations.
Sikkim Manipal University

Page No. 81

Business Statistics

Unit 3

It is used extensively in testing of hypothesis about population parameters


based on sampling statistics.
In fact, the standard deviation has such stable mathematical properties
that it is used as a standard scale for measuring deviations from the mean. If
we are told that the performance of an individual is 10 points better than the
mean, it really does not tell us enough, for 10 points may or may not be a
large enough difference to be of significance. But, if we know that the for the
score is only 4 points, so that on this scale, the performance is 2.5 better than
the mean, the statement becomes meaningful. This indicates an extremely good
performance. This sigma scale is a very commonly used scale for measuring
and specifying deviations which immediately suggest the significance of the
deviation.
The only disadvantage of the standard deviation lies in the amount of
work involved in its calculation, and the large weight it attaches to extreme
values because of the process of squaring involved in its calculations.
Activity 3
Calculate standard deviation from the following data:
Size of Item

10

11

12

Frequency

13

Self-Assessment Questions
9. Fill in the blanks with the appropriate terms.
(a) The squaring of deviations provides added _________________ to
the extreme items.
(b) Standard deviation, (sigma), is defined as the square root of the
mean of the _________________ of the deviations of individual items
from their arithmetic mean.
10. State whether true or false.
(a) In calculating standard deviation, the deviations are squared.
(b) The deviations are sometimes recorded from the arithmetic mean.

Sikkim Manipal University

Page No. 82

Business Statistics

Unit 3

3.7 Summary
Let us recapitulate the important concepts discussed in this unit:

Percentage is used in business and economic fields for making comparison


on profit, growth rate, magnitude, performance, etc. The concept of
percentage applies mainly on ratios.
Percentage point change only notes the change in percentage whereas
percentage change notes the change with reference to the original value.
An average is the measure of central tendency of a set of numbers.
The mean is computed by adding all the data values and dividing it by
the number of such values.
In weighted arithmetic mean, the weight is represented by symbol w,
and w represents the sum of weights. It is used to compute the mean
of means.
Median is that value of a variable which divides the series in such a manner
that the number of items below it is equal to the number of items above it.
Half of the total number of observations lies below the median and half
above it. The median is thus a positional average.
The mode is that value of the variable which occurs or repeats itself the
greatest number of times. The mode of a distribution is the value at the
point around which the items tend to be most heavily concentrated. It is
the most fre-quent or the most common value, provided that a sufficiently
large number of items are available, to give a smooth distribution.
The positional measures are called quantiles and consist of quartiles,
deciles and percentiles. The quartiles divide the data into four equal parts.
The deciles divide the total ordered data into ten equal parts and the
percentiles divide the data into 100 equal parts.
The crudest measure of dispersion is the range of the distribution. Range
of any series is the difference between the highest and the lowest values
in the series.
In a frequency distribution, the range is taken to be the difference between
the lower limit of the class at the lower extreme of the distribution and the
upper limit of the class at the upper extreme.
An absolute measure can be converted into a relative measure if we divide
it by some other value regarded as standard for the purpose.

Sikkim Manipal University

Page No. 83

Business Statistics

Unit 3

In calculating standard deviation, the deviations are squared. This is done


so as to get rid of negative signs without committing algebraic violence.
The deviations are always recorded from the arithmetic mean.
Standard deviation, (sigma), is defined as the square root of the mean
of the squares of the deviations of individual items from their arithmetic
mean.

3.8 Glossary
Mean: An arithmetic average and measure of central location.
Median: The measure of central tendency that appears in the centre of
an ordered data.
Mode: Another form of average that can be defined as the most frequently
occurring value in the data.
Quartile: A positional measure that divides the data into four equal parts.
Range: The difference between the maximum and minimum values. It
indicates the limits within which the values fall.
Standard deviation: A measure of the variability or dispersion of a
population, a data set, or a probability distribution. A low standard deviation
indicates that the data points tend to be very close to the same value (the
mean); while high standard deviation indicates that the data are spread
out over a large range of values.

3.9 Terminal Question


1. How is percentage change calculated? Name the two changes recorded
in it?
2. What is the significance of ratio and averages?
3. Explain the methods for calculating mean.
4. Explain the term median with the help of an example.
5. Explain the significance of mode in statistical calculations.
6. What are quartiles?
7. What is range? How it is calculated?
8. How is standard deviation calculated? Explain with the help of an example.
Sikkim Manipal University

Page No. 84

Business Statistics

Unit 3

3.10 Answers
Answers to Self-Assessment Questions
1. (a) Denominator; (b) Antecedent
2. (a) True; (b) False
3. (a) True; (b) True
4. (a) Sum; (b) Value
5. (a) Quartiles; (b) Array
6. (a) False; (b) True
7. (a) Difference; (b) Coefficient
8. (a) True; (b) False
9. (a) Weight; (b) Squares
10. (a) True; (b) False

Answers to Terminal Questions


1. Refer Section 3.2.1
2. Refer Sections 3.2.2 and 3.2.3
3. Refer Section 3.3.1
4. Refer Section 3.3.2
5. Refer Section 3.3.3
6. Refer Section 3.4
7. Refer Section 3.5
8. Refer Section 3.6

3.11 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010.
Sikkim Manipal University

Page No. 85

Business Statistics

Unit 4

Unit 4

Index Numbers

Structure
4.1 Introduction
Objectives
4.2 Index Numbers
4.3 Summary
4.4 Glossary
4.5 Terminal Questions
4.6 Answers
4.7 Further Reading

4.1 Introduction
In the previous unit you learnt about data analysis techniques such as measures
of dispersion. In this unit you will learn about index numbers, its various types
and the reason as to why index numbers are required. Index numbers are a
specialized type of average. They are designed to measure the relative change
in the level of a phenomenon with respect to time, geographical locations or
some other characteristics. You will also learn about the different formulae
methods devised for constructing index numbers and what all problems one will
face while constructing index numbers.

Objectives
After studying this unit, you should be able to:
Discuss the various formulae and methods used in constructing index
numbers
Construct index numbers
Use index numbers for various purposes.

4.2 Index Numbers


Index numbers are a specialized type of average. They are designed to measure
the relative change in the level of a phenomenon with respect to time,
geographical locations or some other characteristics.

Sikkim Manipal University

Page No. 87

Business Statistics

Unit 4

Originally, index numbers were developed for measuring the effect of


changes in the price level. But today, index numbers are also used to measure
changes in industrial production, fluctuations in the level of business activities
or variations in the agricultural output, etc. In fact, if we want to get an idea as to
what is happening to an economy, we have to simply look to a few important
indices like those of industrial output, agricultural production and business activity.
In the words of G. Simpson and F. Kafka, Index numbers are today one of the
most widely used statistical devices. They are used to take the pulse of the
economy and they have come to be used as indicators of inflationary or
deflationary tendencies.

4.2.1 Types of Index Numbers


Methods of Construction of Index Numbers
Methods of constructing index numbers can broadly be divided into two classes:
(i) Unweighted indices
(ii) Weighted indices
In case of unweighted indices, weights are not expressly assigned,
whereas in the weighted indices, weights are expressly assigned to the various
items. Each of these types may be further classified under two heads:
(i) Aggregate of prices method
(ii) Average of price relatives method
The following chart illustrates the various methods
Index Numbers

Unweighted
Simple aggregate
of prices

Simple average
of prices relatives

Weighted
Weighted aggregate
of prices

Weighted average
of prices relatives

A. Unweighted Index Numbers


1. Simple Aggregate of Prices Method: Under this method, the total of
prices for all commodities in the current year is divided by the total of
prices for these commodities in the base year and the quotient is multiplied
by 100. Symbolically,

Sikkim Manipal University

Page No. 88

Business Statistics

P01

Unit 4

P1
100
P0

where,

P1 = Total of current year prices for various commodities.


P0 = Total of base year prices for various commodities.
This method of constructing index numbers is very simple and requires
the following steps for its computation.
(i) Total the prices of various commodities for each time period to get P0
and P1 . These totals are in rupees.
(ii) Divide the total of the given time period, P1 , by the base period total,
P0 , and express the result in per cent, by multiplying the quotient by
100.
Example 4.1: From the following data, construct an index number of prices by
simple aggregative method for 1982 taking 1981 as the base:
Commodity

Unit

Price in 1981

Price in 1982

Milk

litre

2.00

2.50

Butter

kg

12.00

15.00

Cheese

kg

10.00

12.00

Bread

One

2.00

2.50

Eggs

dozen

4.00

5.00

Solution: Construction of index numbers


Commodity

Unit

P0

P1

Milk

litre

2.00

2.50

Butter

kg

12.00

15.00

Cheese

kg

10.00

12.00

Bread

One

2.00

2.50

Eggs

dozen

4.00

5.00

P0 = 30.00

P1 = 37.00

P01

P1
37
100 =
100 = 123.33%
P0
30

Sikkim Manipal University

Page No. 89

Business Statistics

Unit 4

This means that as compared to 1981, there is a net increase of (123.33)


23.33 per cent in 1982, in the prices of commodities included in the index.
This method suffers from two drawbacks,
(i) The unit by which each item is priced, introduces a concealed weight in
the simple aggregate of actual prices. For example, milk is quoted per
litre in Example 4.1. If the price is expressed in terms of per gallon, the
index might be very different.
(ii) Equal weightage is given to all the items irrespective of their relative
importance.
2. Simple Average of Price Relative Method
Under this method, the price relatives for each commodity are calculated and
their average is found out. The steps involved in the construction of this index
are:
(i) Obtain the price relative by dividing the price of each commodity in the
given time period, Pl, by its price in the base period, P0, and express this

P1

result in per cent, i.e., obtain P0

100

for each commodity..

(ii) Average these price relatives for the given time period by dividing the
total of price relatives for different commodities by the number of
commodities. Symbolically,

P01

LM P 100OP
NP Q
1
0

where, N refers to the number of commodities (items) whose price relatives


are thus averaged.
Example 4.2: From the data given in Example 4.1, compute the price index for
1982 with 1981 as base, by simple average of price relatives method.

Sikkim Manipal University

Page No. 90

Business Statistics

Unit 4

Solution: Construction of price index


Commodities Unit Price in 1981 Price in 1982
P0
P1
(Rs.)
(Rs.)

Price Relative

Milk

litre

2.00

2.50

2.50
100 125
2.00

Butter

kg

12.00

15.00

15
100 125
12

Cheese

kg

10.00

12.00

12
100 120
10

Bread

one

2.00

2.50

2.50
100 125
2.00

Eggs

dozen

4.00

5.00

5
100 125
4

1 100 620
P0

N=5

1 100
P
620 124
P01 0
N
5

The simple average of price relative method is superior to the simple


aggregate of prices method in two respects:
(i) Since we are comparing price per litre with price per litre, and price per
kilogram with price per kilogram, the concealed weight due to use of
different units is completely removed.
(ii) The index is not influenced by extreme items, as equal importance is
given to all items.

Sikkim Manipal University

Page No. 91

Business Statistics

Unit 4

However, the greatest drawback of unweighted indices is that equal


importance or weight is given to all items included in the index number, which is
not proper. As such, unweighted indices are of little use in practice.
B. Weighted Index Numbers
1. Weighted Aggregate of Prices Index: These indices are similar to the
simple aggregative type with the fundamental difference that weights are
assigned explicitly to the various items included in the index. In the matter
of assigning weights, authors differ. As a result, a large number of formulae
methods have been devised for constructing index numbers. Some of
the important formulae methods are as follows:
(i) Laspeyres Method: In this method, base year quantities are taken
as weights. The formula for constructing the index is:

P01
where

P1q 0
100
P0 q 0
P1 = Price in the current year
P0 = Price in the base year
q0 = Quantity in the base year

According to this method, the index number for each year is obtained in
three steps:
(i) The price of each commodity in each year is multiplied by the base year
quantity of that commodity. For the base year, each product is symbolized
by P0q0, and for the current year by P1q0.
(ii) The products for each year are totalled and P1q 0 and P0 q 0 are obtained.
(iii) P1q 0 is divided by P0 q 0 and the quotient is multiplied by 100 to obtain
the index.
Example 4.3: From the following data, calculate the index number of prices for
1982 with 1972 as base using the Laspeyres method.
1972

1982

Item

Price

Quantity

Price

Quantity

10

14

10

19

13

Sikkim Manipal University

Page No. 92

Business Statistics

Unit 4

Solution: Representing base year (1972) price by P0, base year quantity by q0,
current year (1982) price by P1 and current year quantity by q1 we have:
Commodity

P0

q0

P1

q1

P0 q0

P1 q0

16

32

10

50

60

14

10

56

70

19

13

38

38

P0 q0
= 160

Index number of prices by Laspeyres method =

P1q0
= 200

P1q0
100
P0q0
200
100 125
160

Laspeyres index is very widely used. It tells us about the change in the
aggregate value of the base period list of goods when valued at a given period
price.
However, this index has one drawback. It does not take into consideration
the changes in the consumption pattern that take place with the passage of
time.
(ii) Paasches Index: In this method, the current year quantities (q1), are taken
as weights. The formula for constructing this index is:

P01

Pq
1 1
100
P0q1

Steps for constructing the Paasches index are the same as those taken
in constructing Laspeyres index with the only difference that the price of each
commodity in each year is multiplied by the quantity of that commodity in the
current year rather than by the quantity in the base year.
Example 4.4: Taking the data given in Example 4.3, compute the index number
of prices for 1982 with 1972 as base, using the Paasches method.

Sikkim Manipal University

Page No. 93

Business Statistics

Unit 4

Solution: Construction of Paasches Index


Commodity
P0
q0
P1
A
B
C
D

2
5
4
2

8
10
14
19

4
6
5
2

q1

P0 q1

P1 q1

6
5
10
13

12
25
40
26

24
30
50
26

P0 q1 = P1q1 =
103
130

Pq
1 1
100
P0q1
130

100 126.21
103

Index number of prices by Paasches method =

Although this method takes into consideration the changes in the


consumption pattern, the need for collecting data regarding quantities for each
year or each period makes the method very expensive. Hence, where the number
of commodities is large, Paasches method is not preferred.
(iii) Bowley-Drobisch Method: This method is the simple arithmetic mean of
Laspeyres and Paasches indices. The formula for constructing Bowley-Drobisch
index is:
P1q0 P1q1

P0 q0 P0 q1
P01
100
2

L P
2
Where L = Laspeyres index
P01

P = Paasches index
Example 4.5: Compute the index number of prices for 1976 with 1970 as base
using the Bowley-Drobisch method from the following data.
1970
1976
Items
Price
Quantity
Price
Quantity
1
2
3
4

2
4
1
5

Sikkim Manipal University

20
4
10
5

5
8
2
10

15
5
12
6

Page No. 94

Business Statistics

Unit 4

Solution: Computation of price index by Bowley-Drobisch formula,


Items
P0
q0
P1
q1
P0q0
P0q1
P1q0
1
2
3
4

2
4
1
5

20
4
10
5

5
8
2
10

15
5
12
6

40
16
10
25
P0q0
= 91

30
20
12
30

P1q1

100
32
20
50

75
40
24
60

P0q1 P1q0
= 92 = 202

P1q1
= 199

P1q0 P1q1

According to Bowley-Drobisch formula: P P0 q0 P0 q1 100


01
2
202 199

2.2198 2.1630
100
91 92 100
2
2

= 4.3828 50 = 219.14
(iv) Marshall-Edgeworth Method: In this method, the sums of base year and
current year quantities are taken as weights. The formula for constructing the
index is:

or

P01

P1 (q0 q1 )
100
P0 (q0 q1 )

P01

Pq
1 0 Pq
1 1
100
P0q0 P0q1

Example 4.6: For the data given in Example 4.5, compute index number of
prices for 1976 with 1970 as base using the Marshall-Edgeworth formula:
Solution: Computation of price index by Marshall-Edgeworth formula:
Item
P0
q0
P1
q1
P0q0
P0q1
P1q0
P1q1
1
2
3
4

2
4
1
5

20
4
10
5

5
8
2
10

15
5
12
6

40
16
10
25
P0q0
= 91

Sikkim Manipal University

30
20
12
30

100
32
20
50

P0q1 P1q0
= 92 = 202

75
40
24
60
P1q1
= 199

Page No. 95

Business Statistics

Unit 4

According to Marshall-Edgeworth formula:

P01

P0q0 Pq
P1 (q0 q1 )
1 1
100
100
P0q0 P0q1
P0 (q0 q1 )
202 199
401
100
100
91 92
183

= 219.125
(v) Kellys Method: In this method, neither base year nor current year quantities
are taken as weights. Instead, the quantities of some reference year or the
average quantity of two or more years may be taken as weights. The formula
for constructing the index is:
P01

Pq
1
100
P0 q

Where, q, is the quantity of some reference year.


Example 4.7: Calculate the index number of prices for 1981 with 1980 as base
year for the following data, using Kellys method.
Item
Quantity
Price in 1980
Price in 1981
Bricks
Timber
Board
Sand
Cement

10
7
15
9
10

units

100
200
50
20
10

160
210
60
30
14

Solution: Computation of price index by Kellys method:


Item
q
P0
P1
P0q
Bricks
Timber
Boards
Sand
Cement

10
7
15
9
10

100
200
50
20
10

160
210
60
30
14

1000
1400
750
180
100

P0 q
3430

Sikkim Manipal University

P1q
1600
1470
900
270
140

P1q
4380

Page No. 96

Business Statistics

Unit 4

According to Kellys method:

P01

Pq
1
100
P0q
4380
100 127.697
3430

(vi) Fishers Ideal Index: This method is the geometric mean of Laspeyres
and Paasches indices.
The formula for constructing the index is:

P01

Pq
P1q1
1 0

100
P0 q0 P0 q1

Fishers formula is known as ideal index because of the following reasons:


(i) It takes into account prices and quantities of both the current year as well
as the base year.
(ii) It uses geometric mean which, theoretically, is the best average for
constructing index numbers.
(iii) It satisfies both the time reversal test and the factor reversal test.
(iv) It is free from bias. The weight biases embodied in Laspeyres and
Paasches methods are crossed geometrically, and thus, eliminated
completely.
Example 4.8: Construct the index number of prices for the year 1980 with 1979
as base using Fishers Ideal Method.
1979
1980
Commodity
Price
Quantity
Price
Quantity
A

20

40

50

10

60

40

15

50

10

20

20

20

15

Sikkim Manipal University

Page No. 97

Business Statistics

Unit 4

Solution: Construction of price index by Fishers Ideal Formula:


Commodity

P0

q0

P1

q1

P0q0

P0q1

P1q0

P1q1

A
B
C
D

20
50
40
20

8
10
15
20

40
60
50
20

6
5
10
15

160
500
600
400

120
250
400
300

320
600
750
400

240
300
500
300

P0q0 P0q1 P1q0 P1q1


= 1660 = 1070 = 2070 = 1340
Price index by Fishers Ideal Formula is:

P01

P1q0 Pq
1 1

100
P0 q0 P0 q1
2070 1340

100
1660 1070

1.247 1.252 100 1.5612 100

= 1.25 100 125


2. Weighted Average of Price Relatives
This method is similar to the simple average of price relatives method with the
fundamental difference that, explicit weights are assigned to each commodity
included in the index. Since price relatives are in percentages, the weights used
are value weights.
The following steps are taken in the construction of weighted average of
price relatives index:
(i) Calculate the price relatives,

FG P 100IJ for each commodity..


HP K
1
0

(ii) Determine the value weight of each commodity in the group by multiplying
its price in base year by its quantity in the base year, i.e., calculate P0q0
for each commodity. If, however, current year quantities are given, then
the weights shall be represented by P1q1.
(iii) Multiply the price relative of each commodity by its value weight as
calculated in (ii).

Sikkim Manipal University

Page No. 98

Business Statistics

Unit 4

(iv) Sum up the products obtained under (iii).


(v) Divide the total (iv) above, by the total of the value weights. Symbolically,
index number obtained by the method of weighted average of price
relatives is:

P01

LMF P 100I P q OP
NGH P JK Q or PV
1

0 0

P0 q0

This method is also known as Family Budget method.


Example 4.9: Calculate consumer price index using weighted average of price
relatives method for the year 1986 with 1985 as base for the following data:
Price (in Rs)
Commodity

Quantity

1985

1986

A
B
C
D

100
25
10
20

8
6
5
10

12
8
15
25

Solution: Calculation of consumer price index


Commodity
q0
P0
P1
Price Relative

FG P 100IJ
HP K
1

or P

P0q0

PV

or V

A
B
C
D

100
25
10
20

8
6
5
10

12
8
15
25

150.00
133.33
300.00
250.00

800
150
50
200

120000
20000
15000
50000

V
PV
= 1200 = 205000
Weighted average of price relative index

or consumer price index

Sikkim Manipal University

LMF P 100I P q OP
NGH P JK Q PV
1

0 0

P0 q0

205000
170.83
1200
Page No. 99

Business Statistics

Unit 4

Problems in the Construction of Index Numbers


Different problems are faced in the construction of different types of index
numbers. We shall deal here with only those problems that must be tackled
before constructing index numbers of prices.
Definition of Purpose
It is absolutely necessary that the purpose of the index numbers be rigorously
defined. This would help in deciding the nature of data to be collected, the
choice of the base year, the formula to be used and other related matters. For
example, if an index number is intended to measure consumer prices, it must
not include wholesale prices. Similarly, if a consumer price index number is
intended to measure the changes in the cost of living of families with low incomes,
great care should be exercised not to include goods ordinarily used by middleincome and upper-income groups. In fact, before constructing index numbers,
we must precisely know what we want to measure, and what we intend to use
this measurement for.
Selection of a Base Period
In order to make comparison between prices referring to several time periods,
some point of reference is almost always established. This point of reference is
called the base period. The prices of a certain time period are taken as the
standard, and assigned the value of 100 per cent. Though the selection of the
base period would primarily depend upon the purpose of the index, there are
two important guidelines to consider in choosing a base.
(i) The base period should be a period of normal and stable economic
conditions. It should be free from abnormalities and random or irregular
fluctuations like wars, earthquakes, famines, strikes, lock-outs, booms,
depressions, etc.
(ii) The base year should not be too distant in the past. Since the index
numbers are useful in decision-making, and economic practices are often
a matter of the short run, we should choose a base which is relatively
close to the year being studied.
Fixed Base and Chain Base. While selecting the base year, a decision
has to be made whether the base shall remain fixed or not. If the period of
comparison is fixed for all current years, it is called fixed base method. If, on the
other hand, the prices of the current year are linked with the prices of the
preceding year and not with the fixed year or period, it is called chain base
method. Chain base method is useful in cases where there are quick and frequent

Sikkim Manipal University

Page No. 100

Business Statistics

Unit 4

changes in fashion, tastes and habits of the people. In such cases comparison
with the preceding year is more worthwhile.
Selection of Commodities or Items
While constructing an index number, it is not possible to take into account all
the items whose price changes are to be represented by the index number.
Hence, the need for selecting a sample. For example, while constructing a
general purpose wholesale price index, it is impossible to take all the items.
Thus, only a few representative items are selected from the whole lot. While
selecting the sample, the following points should be kept in mind.
(i) The selected commodity or item should be representative of the tastes,
customs and necessities of the people to whom the index number relates.
(ii) It should be stable in quality and as far as possible should be standardized
or graded so that it can easily be identified after a time lapse.
(iii) The sample should be as large as possible. Theoretically, the larger the
number of items, the more accurate would be the results disclosed by an
index number. But it must be noted that, larger the number of items, the
greater shall be the cost and time taken.
(iv) As different varieties of a commodity are sold in the market, a decision
has to be made as to which variety should be included in the index
numbers. Ordinarily, all those varieties which are in common use should
be included.
Obtaining Price Quotations
After selecting the items, the next problem is to collect their prices. The price of
a commodity varies from place to place and even from shop to shop in the
same market. Just as it is not possible to include all the commodities in an index
number, it is similarly impractical to collect price quotations from all places where
a commodity is bought or sold. Thus, a selection is to be made of representative
places and shops. Generally, such places and shops are selected where the
commodity is bought and sold in large quantities. After selecting the places and
shops from where price quotations are to be obtained, the next step is to appoint
some representatives who will supply the price quotations from time to time.
Since prices can be quoted in two ways, i.e., either by expressing the
quantity of commodity per unit of money or by expressing the quantity of money
per unit of commodity, a decision has to be made regarding the manner in
which prices are to be quoted. It is better to quote the price of a commodity X as
50 paise per kg rather than quoting it as 2 kg per one rupee.

Sikkim Manipal University

Page No. 101

Business Statistics

Unit 4

Another decision in regard to price quotations is whether the wholesale


prices or the retail prices are to be collected. In general, the larger the number
of quotations, the better it is. Ordinarily, however, at least one quotation per
week in case of weekly indices, and at least four quotations per month in case
of monthly indices are essential.
Choice of Average
Since index numbers are specialized averages, a decision has to be made as
to which particular average (i.e., arithmetic mean, mode, median, harmonic
mean or geometric mean) should be used for the construction of index numbers.
Mode, median and harmonic mean are almost never used in the construction of
index numbers.
Therefore, a choice has to be made between arithmetic mean and
geometric mean. Though, theoretically, geometric mean is better for the purpose,
arithmetic mean due to its simplicity of computation is more commonly used.
Choice of Weights
A suitable method is devised by which the varying importance of different items
is taken into account. This is done by assigning weights. The term weight,
refers to the relative importance of different items in the construction of index.
There are two methods of assigning weights, (i) Implicit, and (ii) Explicit.
In the case of implicit weighting, a commodity or its variety is included in the
index a number of times. In the case of explicit weighting, on the other hand,
some outward evidence of importance of various items in the index is given.
Selection of an Appropriate Formula
A large number of formulae have been devised for constructing index numbers.
A decision has, therefore, to be made as to which formula is the most suitable
for the purpose depending upon the availability of the data regarding the prices
and quantities of the selected commodities in the base and/or current year.
Quantity or Volume Index Numbers
Price indices measure changes in the price level of certain commodities. On
the other hand, quantity or volume index numbers measure the changes in the
physical volume of goods produced, distributed or consumed. These indices
are important indicators of the level of output in the economy or in parts of it.
The quantity indices can be obtained easily by replacing p by q and vice
versa, in the various formulae discussed earlier.
The quantity index by different methods is:

Sikkim Manipal University

Page No. 102

Business Statistics

Unit 4

(i) Laspeyres Method:

Q 01

q1 P0
100
q0 P0

(ii) Paasches Method:

Q 01

q1 P1
100
q0 P1

(iii) Bowley-Drobisch Method:

q1 P0 q1 P1

q0 P0 q0 P1
Q 01
100
2

(iv) Marshall-Edgeworth Method:

Q 01

(v) Fishers Ideal Index:

Q 01

(vi) Kellys method:

Q 01

q1 ( P0 P1 )
100
q0 ( P0 P1 )

q1 P0 q1 P1

100
q0 P0 q0 P1

q1 P
100
q0 P

Example 4.10: Compute quantity index for the year 1982 with base 1980
= 100, for the following data, using (i) Laspeyres method (ii) Paasches method,
(iii) Bowley-Drobisch method, (iv) Marshall-Edgeworth method, and (v) Fishers
ideal formula.
Prices
Quantities
Commodity
1980
1982
1980
1982
A
B
C
D

5.00
7.75
9.63
12.50

6.50
8.80
7.75
12.75

Solution: Computation of quantity index


Commodity
P0
q0
P1
q1
A
B
C
D

5.00
7.75
9.63
12.50

5
6
4
9

6.50
8.80
7.75
12.75

7
10
6
9

5
6
4
9

7
10
6
9

q0P0

q0P1

q1P0

q1P1

25.00
46.50
38.52
112.50

32.50
52.80
31.00
114.75

35.00
77.50
57.78
112.50

45.50
88.00
46.50
114.75

q0 P0

= 222.52 =
Sikkim Manipal University

q0 P1 q1 P0 q1 P1
=
=
231.05 282.78 294.75
Page No. 103

Business Statistics

Unit 4

(i) Laspeyres quantity index or Q01

(ii) Paasches quantity index or Q01

q1 P0
100
q0 P0
282.78
100 127.08
222.52

q1 P1
100
q0 P1

294.75
100 127.57
231.05

q1 P0 q1 P1

(iii) Bowley-Drobisch quantity index or Q q0 P0 q0 P1 100


01
2
282.78 294.75

222.52 231.05 100


2

1.2708 1.2757
100
2

= 127.325
(iv) Marshall-Edgeworth quantity index or Q01

q1 P0 q1 P1
100
q0 P0 q0 P1

282.78 294.75
100
222.52 231.05

= 127.329
(v) Quantity index by Fishers ideal formula or Q01

q1 P0 q1 P1

100
q0 P0 q0 P1

282.78 294.75

100
222.52 231.05

= 1.273 100
= 127.3
Sikkim Manipal University

Page No. 104

Business Statistics

Unit 4

Value Index Numbers


Value means price times quantity. Thus, a value index V is the sum of the value
of a given year divided by the sum of the values for the base year. The formula,
therefore, is:

Pq
1 1
100 where V = value index
P0q0

In most cases, the value figures given in the formula may be stated more
simply as:

V1
V0

In this type of index, both price and quantity are variable in the numerator.
Weights are not to be applied because they are inherent in the value figures. A
value index, therefore, is an aggregate of values.
Tests of Consistency
As there are several formulae for constructing index numbers, the problem is to
select the most appropriate formula in a given situation. Irving Fisher has
suggested two tests for selecting an appropriate formula. These are:
(i) Time reversal test
(ii) Factor reversal test
Time reversal test
According to Fisher, the formula for calculating the index should be such that it
gives the same ratio between one point of comparison and another no matter
which of the two is taken as base. In other words, the index number prepared
forward should be the reciprocal of the index number prepared backward. Thus,
if from 1982 to 1983, the prices of a basket of goods have increased from Rs
400 to Rs 800, the index number for 1983 with 1982 as base is 200 per cent.
Now if the index number for 1983 with 1982 as base is 200 per cent, the index
number for 1982 with 1983 with base should be 50 per cent. One figure is
reciprocal of the other and their product (2 0.5) is unity. Therefore, time reversal
test is satisfied if P01 P10 =1 .
Time reversal test is satisfied by:
(i) Fishers Ideal Formula,
(ii) Marshall-Edgeworth Method

Sikkim Manipal University

Page No. 105

Business Statistics

Unit 4

(iii) Kellys Method


(iv) Simple Geometric mean of Price Relatives
Factor reversal test
According to Fisher, the formula for constructing the index number should permit
not only the interchange of the two times without giving inconsistent results, it
should also permit the interchange of weights without giving inconsistent results.
Simply stated, the test is satisfied if the change in price multiplied by the
change in quantity is equal to the total change in value. Thus, factor reversal
test is satisfied if:

P01 Q01

Pq
1 1
P0q0

Where P01 represents change in price in the current year, Q01 represents
change in quantity in the current year, P1q1 represents total value in the current
year, and P0 q 0 represents total value in the base year..
The factor reversal test is satisfied only by Fishers Ideal Formula. Thus,
Fishers formula satisfies both time reversal test and factor reversal test.
Proof
According to Fishers Ideal Index:

P01

P1q0 P1q1

P0 q0 P0 q1

P10

P0 q1 P0 q0

Pq
Pq
1 1
1 0

Q 01

q1 P0 q1 P1

q0 P0 q0 P1

(i) Thus,

P01 P10

P1q0 P1q1 P0 q1 P0 q0

1 1
P0 q0 P0 q1 Pq
P1q0
1 1

Hence, the time reversal test is satisfied.

Sikkim Manipal University

Page No. 106

Business Statistics

Unit 4

(ii) Similarly, according to Fishers Ideal Formula:

P01 Q01

P1q0 P1q1 q1P0 q1P1

P0 q0 P0 q1 q0 P0 q0 P1
P1q1 q1P1
P q
1 1
P0 q 0 q 0 P0 P0 q 0

Hence, the factor reversal test is also satisfied by Fishers Ideal Formula.
Besides these two tests, two other tests have been suggested by some
authors.
These are, (i) Unit test, (ii) Circular test
Unit test
According to unit test, the formula for constructing index numbers should be
independent of the units in which prices and quantities are quoted. This test is
satisfied only by simple aggregative index method.
Circular test
This test is just an extension of the time reversal test for more than two periods
and is based on the shiftability of the base period. This test requires the index
number to work in a circular manner such that if an index is constructed for the
year a on base year b, and for the year b on base year c, we should get the
same result as if we calculate directly an index for year a on base year c without
going through b as an intermediary. Thus, if there are three periods a, b and c,
the circular test is satisfied if,

P01 P12 P10 1


The circular test is satisfied only by the index number formula based on,
(i) Simple aggregate of prices.
(ii) Kellys method or fixed weighted aggregate of prices.
An index which satisfies this list has the advantage of reducing the
computations every time a change in the base year has to be made. Such
indices can be adjusted from year to year without referring each time to the
original base.

Sikkim Manipal University

Page No. 107

Business Statistics

Unit 4

Example 4.11: From the following data , show that Fishers Ideal Index satisfies
both following time reversal test and factor reversal test.
1980
Commodity
A
B
C
D
E

Price
4
6
14
3
5

1981
Quantity
10
8
5
12
7

Price
5
9
7
6
8

Quantity
8
7
12
8
5

Solution: Computation for time reversal test and factor reversal test
Commodity
P0
q0
P1
q1
P0q0 P0q1 P1q0 P1q1
A
B
C
D
E

4
6
14
3
5

10
8
5
12
7

5
9
7
6
8

8
7
12
8
5

40
48
70
36
35

32
42
168
24
25

P0 q0 P0 q1
229 291

50
72
35
72
56

40
63
84
48
40

P1q0 P1q1
285 275

(i) Time reversal test is satisfied when P01 P10 1 .


According to Fishers ideal index,

and

P01

Pq
Pq
1 0
1 1

P0 q0 P0 q1

P10

P0 q1 P0 q0

Pq
Pq
1 1
1 0

P01 P01

285 275 291 229

1 1
229 291 275 285

Hence, time reversal test is satisfied.

Sikkim Manipal University

Page No. 108

Business Statistics

Unit 4

(ii) Factor reversal test is satisfied when P01 Q01

P01

P1q0 P1q1

and Q01
P0 q0 P0 q1

P01 Q01

Pq
1 1
P0q0 .

q1 P0 q1 P1

q0 P0 q0 P1

285 275 291 275

229 291 229 285

275 275
229 229

275 Pq
1 1

229 P0q0

Hence, the factor reversal test is satisfied.


Fixed and Chain Base Indices
As stated earlier, the base may be fixed or changing. It is said to be fixed when
the period of comparison or the base year is fixed for all current years. Thus, if
the indices of 1971, 1972, 1973 and 1974 are all calculated with 1970 as the
base year, such indices will be called fixed base indices. If, on the other hand,
the whole series of index numbers is not related to any one base period, but the
indices for different years are obtained by relating each years price to that of
the immediately preceding year, the indices so obtained are called chain base
indices. For example, in the case of chain base indices, for 1974, 1973 will be
the base; for 1973, 1972 will be the base; for 1972, 1971 will be the base, and
so on. The relatives obtained by the chain base method are called link relatives,
whereas the relatives obtained by the fixed base method are called chain
relatives.
Example 4.12: From the following data relating to the wholesale prices of wheat
for six years, construct index numbers using (i) 1980 as base, and (ii) By chain
base method.
Year
Price (per quintal)
Year
Price (per quintal)
Rs.
Rs.
1980

100

1983

130

1981

120

1984

140

1982

125

1985

150

Sikkim Manipal University

Page No. 109

Business Statistics

Unit 4

Solution: (i) Computation of index numbers with 1980 as base,


Year Price of wheat Index Number
(1980 = 100)

Year

Price
of Wheat

Index No.
(1980 = 100)

1980

100

100

1983

130

130
100 130
100

1981

120

120
100 120
100

1984

140

140
100 140
100

1982

125

125
100 125
100

1985

150

150
100 150
100

(ii) Construction of link relative indices (chain base method)


Year

Price of
Wheat

Link Relative
Index

Year

Price
of Wheat

1983

130

1980

100

100

1981

120

120
100 120
100

1984140

1982

125

125
100 104.167
120

1985

Link Relative
Index

130
100 104
125
140
100 107.692
130

150

150
100 107.14
140

Conversion of Link Relatives into Chain Relatives


Chain relatives or chain indices can be obtained either directly or by converting
link relatives into chain relatives with the help of the following formula:
Link relative for the Chain relative for
the previous year
current year
Chain relative for
=
current year
100

Taking the data from Example 4.12, we can show the method of conversion
as follows:

Sikkim Manipal University

Page No. 110

Business Statistics

Unit 4

Year

Price of wheat

Link relative

1980

100

100.00

1981

120

120.00

1982

125

104.167

1983

130

104.00

1984

140

107.692

1985

150

107.14

Chain relative
100

120 100
120
100
104.167 120
125
100
104 125
130
100
107.692 130
140
100
107.14 140
150
100

Base Shifting
Sometimes, it becomes necessary to shift the base from one period to another.
This becomes necessary either because the previous base has become too old
and useless for comparison purposes or because comparison has to be made
with another series of index numbers having different base period. This can be
done in two ways,
(i) By reconstructing the series with the new base. This means that the
relatives of each individual item are constructed with the new base and
thus an entirely new series is formed.
(ii) By using a shorter method which is as follows: divide each index number
of the series by the index number of the time period selected as new
base and multiply the quotient by 100. Symbolically,

Current years old index number


Index number
=
100
(based on new base year)
New base years old index number
Example 4.13: The following are the index numbers of prices with 1939 as
base,
Year:
Index Number:

1939
100

1940
110

1945
120

1950
200

1955
400

1960
380

Shift the base to the year 1950.

Sikkim Manipal University

Page No. 111

Business Statistics

Unit 4

Solution: Index numbers with 1950 as base (1950100)


Year
Index
Index Number
(1939 = 100)
1939

100

1940

110

1945

120

1950

200

1955

400

1960

380

(1950 = 100)

100
100 50
200
110
100 55
200
120
100 60
200
200
100 100
200
400
100 200
200
380
100 190
200

Splicing
Sometimes, an index number series is discontinued because its base has
become too old and so it has lost its utility. A new series of index numbers may
be computed with some recent year as base. For example, the weights of an
index number may have become out of date and a new index with new weights
may be constructed. This would result in two series of index numbers. It may
sometimes be necessary to connect the two series of index number into one
continuous series. The procedure employed for connecting an old series of
index numbers with a revised series, in order to make the series continuous is
called splicing. The process of splicing is very simple and is similar to the one
used in shifting the base. The spliced index numbers are calculated with the
help of the following formula:
New base years
Spliced index number =

Sikkim Manipal University

Current years new index number Oldindex no.


100

Page No. 112

Business Statistics

Unit 4

Example 4.14: Index A was started in 1969 and continued upto 1975 in which
year another index B was started. Splice the index B to index A so that a
continuous series of index numbers from 1969 upto date may be available:
Year:

1969

(A) Index Numbers (Old): 100


Year:

1975

(B) Index Numbers (New): 100

1970 1971 1972 1973 1974 1975


120

130

200

300

350

400

1976 1977 1978 1979 1980


110

90

110

98

96

Solution: Index B spliced to Index A


Year
Old Index Nos. New Index Nos. Index B Spliced to Index A
(Base 1969 = 100)
1969
1970
1971
1972
1973
1974

100
120
130
200
300
350

400100
400
100
400110
1976
110
440
100
40090
360
1977
90
100
400 110
440
1978
110
100
400 98
392
1979
98
100
400 96
384
1980
96
100
Splicing is very useful for making comparison between new and old index
numbers.
1975

400

100

Deflating
Deflating is the process of making allowances for the effect of changing price
levels. With increasing price levels, the purchasing power of money is reduced.
As a result, the real wage figures are reduced and the real wages become less
Sikkim Manipal University

Page No. 113

Business Statistics

Unit 4

than the money wages. To get the real wage figure, the money wage figure may
be reduced to the extent the price level has risen. The process of calculating
the real wages by applying index numbers to the money wages so as to allow
for the change in the price level is called deflating. Thus, deflating is the process
by which a series of money wages or incomes can be corrected for price changes
to find out the level of real wages or incomes. This is done with the help of the
following formula:
Real wage =

Real wage index =

Money wage
100
Price index

Real wages for the current year


100
Real wages for the base year

Example 4.15: The average of monthly wages in different years is as follows:


Year
: 1977
1978 1979 1980 1981 1982 1983
Wages (Rs) :
200
240
350
360
360 380
400
Price Index :
100
150
200
220
230 250
250
Calculate real wages index numbers.
Solution: Construction of real wage indices
Year

Wages Price index


(Rs)

1977

200

100

200
100 200
100

100

1978

240

150

240
100 160
150

160
100 80
200

1979

350

200

350
100 175
200

175
100 87.5
200

1980

360

220

360
100 163.63
220

163.63
100 81.81
200

1981

360

230

360
100 156.52
230

156.52
100 78.26
200

1982

380

250

380
100 152
250

152
100 76
200

1983

400

250

400
100 160
250

160
100 80
200

Sikkim Manipal University

Real wages

Real wages index


(1977 = 100)

Page No. 114

Business Statistics

Unit 4

Uses and Importance of Index Numbers


Index numbers have become indispensable for analysing economic and business
conditions although they are used almost in all sciencesnatural, social and
physical. The main uses of index numbers can be summarized as follows:
1. They help in framing suitable policies
Index numbers of the data relating to prices, production, profits, imports
and exports, personnel and financial matters are indispensable for any
organization in framing suitable policies and formulation of executive
decisions. For example, the cost of living index numbers help the
employers in deciding the increase in dearness allowance of their
employees or adjusting their salaries and wages in accordance with
changes in their cost of living.
2. Index numbers help in studying trends and tendencies
Since the index numbers study the relative changes in the level of
phenomenon over a period of time, the time series so formed enable us
to study the general trend of the phenomen under study. For example, by
studying the index numbers of wholesale prices in India for the last ten
years, we can say that the general price level in India is showing an upward
trend as it is rising year after year. Similarly, by examining the index
numbers of production (industrial and agricultural), volume of trade, imports
and exports, etc., for the last few years, we can draw useful conclusions
about the trend of production and business activity.
3. Index numbers are very useful in deflating
In time-series analysis, index numbers are used to adjust the original
data for price changes, or to adjust wage changes for cost of living changes
and thus transform nominal wages into real wages. Moreover, nominal
income can be transformed into real income, and nominal sales into real
sales through appropriate index numbers.
Activity 1
Collect data on wholesale prices of rice for 5 continuous years starting
from year 2005. Construct index numbers using 2005 as base.

Sikkim Manipal University

Page No. 115

Business Statistics

Unit 4

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Index number shows by its ______________ the changes in a
magnitude which is not susceptible either to accurate measurement
in itself or to direct valuation in practice.
(b) _________________ is the process of making allowances for the
effect of changing price levels.
2. State whether true or false.
(a) The simple average of price relative method is superior to the simple
aggregate of prices method.
(b) The term weight refers to the relative importance of similar items in
the construction of index.

4.3 Summary
Let us recapitulate the important concepts discussed in this unit:

Index numbers are a specialized type of average. They are designed to


measure the relative change in the level of a phenomenon with respect to
time, geographical locations or some other characteristics.
In case of unweighted indices, weights are not expressly assigned,
whereas in the weighted indices, weights are expressly assigned to the
various items.
Weighted indices are similar to the simple aggregative type. The
fundamental difference is that weights are assigned explicitly to the various
items included in the index.
It is absolutely necessary that the purpose of the index numbers rigorously
defined. This would help in deciding the nature of data to be collected, the
choice of the base year, the formula to be used and other related matters.
While selecting the base year, a decision has to be made whether the
base shall remain fixed or not. If the period of comparison is fixed for all
current years, it is called fixed base method. If, on the other hand, the
prices of the current year are linked with the prices of the preceding year
and not with the fixed year or period, it is called chain base method.

Sikkim Manipal University

Page No. 116

Business Statistics

Unit 4

Value means price times quantity. Thus, a value index V is the sum of
the value of a given year divided by the sum of the values for the base
year.
Deflating is the process of making allowances for the effect of changing
price levels. With increasing price levels, the purchasing power of money
is reduced.

4.4 Glossary
Index numbers: The index number measures the relative change in the
magnitude of a group of related, distinct variables in two or more situations.
Index numbers can be used to measure changes in price, wages
production, employment, national income, etc., over a period of time.
Splicing: The process employed for connecting an old series of index
numbers with a revised series in order to make the series continuous
Deflating: The process of making the allowances for the effect of changing
price levels.

4.5 Terminal Questions


1. Explain the importance of index numbers.
2. Broadly discuss the division of the two methods of construction of index
numbers.
3. Describe the Marshall-Edgeworth method for constructing index numbers.
4. Why is it necessary to define the purpose of index numbers?
5. Differentiate between fixed and chain based indices.

4.6 Answers
Answers to Self-Assessment Questions
1. (a) Variations; (b) Deflating
2. (a) True; (b) False

Sikkim Manipal University

Page No. 117

Business Statistics

Unit 4

Answers to Terminal Questions


1. Refer Section 4.2
2. Refer Section 4.2.1
3. Refer Section 4.2.1
4. Refer Section 4.2.1
5. Refer Section 4.2.1

4.7 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010

Sikkim Manipal University

Page No. 118

Business Statistics

Unit 5

Unit 5

Data Representation

Structure
5.1 Introduction
Objectives
5.2 Tables
5.3 Graphs
5.4 Diagrams
5.5 Summary
5.6 Glossary
5.7 Terminal Questions
5.8 Answers
5.9 Further Reading

5.1 Introduction
In the previous unit, you learnt about index numbers, which are a specialized
type of average.
In this unit, you will learn about the construction of tables, diagrams and
graphs and how important these are to a business and their usages. In any type
of business firm, a large amount of raw data is generated from various business
sources. Such data becomes quite cumbersome and confusing for management
to handle and analyse. In a business firm, data can be of various types, relating
to various categories such as number of each item of the inventory, record of
sales from different departments, keeping an account of all kinds of bills and so
on. It is almost impossible for management to deal with all this data in raw form.
Therefore, such data must be presented in a suitable and summarized form
without any loss of relevant information so that it can be efficiently used for
decision-making. Hence, we construct appropriate tables, graphs and diagrams
to interpret and summarize the entire set of raw data.
In view of the ever increasing importance of statistical data in business
operations and their management, this unit discusses the presentation of data
in the form of graphs, tables and diagrams, their importance and use.

Sikkim Manipal University

Page No. 119

Business Statistics

Unit 5

Objectives
After studying this unit, you should be able to:
Explain the types of tables, graphs and diagrams
Construct tables, graphs and diagrams
Describe the concept of frequency polygon and relative frequency
Explain the construction of ogive curves and their types
Construct histograms
Represent and evaluate data in diagrammatic and graphic forms

5.2 Tables
Classification of data is usually followed by tabulation, which is considered the
mechanical part of classification.
Tabulation is the systematic arrangement of data in columns and rows.
The analysis of data is done by arranging the columns and rows to facilitate
comparisons.
Tabulation has the following objectives:
(i) Simplicity. The removal of unnecessary details gives a clear and
concise picture of the data
(ii) Economy of space and time
(iii) Ease in comprehension and remembering
(iv) Facility of comparisons. Comparisons within a table and with other
tables may be made
(v) Ease in handling of totals, analysis, interpretation, etc.

5.2.1 Construction of Tables


A table is constructed depending on the type of information to be presented and
the requirements of statistical analysis. The following are the essential features
of a table:
(i) Title. It should have a clear and relevant title, which describes the contents
of the table. The title should be brief and self explanatory.
(ii) Stubs and captions. It should have clear headings and sub headings.
Column headings are called captions and row headings are called stubs.

Sikkim Manipal University

Page No. 120

Business Statistics

Unit 5

The stubs are usually wider than the captions.


(iii) Unit. It should indicate all the units used.
(iv) Body. The body of the table should contain all information arranged
according to description.
(v) Headnote. The headnote or prefatory note, placed just below the title, in
a less prominent type, gives some additional explanation about the table.
Sometimes, the headnote consists of the unit of measurement.
(vi) Footnotes. A footnote at the bottom of the table may clarify some omissions
of special features.
A source note gives information about the source used, if any.
(vii) Arrangement of data. Data may be arranged according to requirements
in chronological, alphabetical, geographical, or any other order.
(viii) Emphasis. The items to be emphasized may be put in different print or
marked suitably.
(ix) Other details. Percentages, ratios, etc. should be shown in separate
columns. Thick and thin lines should be drawn at proper places.
A table should be easy to read and should contain only the relevant details.
If the aim of clarification is not achieved, the table should be redesigned.

5.2.2 Types of Tables


Depending on the nature of the data and other requirements, tables may be
divided into various types.
General tables or Reference tables. These contain detailed information
for general use and reference, e.g., tables published by government agencies.
Specific purpose or Derivative tables. These are usually summarized from
general tables and are useful for comparison and analytical purposes. Averages,
percentages etc. are incorporated along with information in these tables.
Simple and Complex tables. A table showing only one characteristic is a
simple table. The complex table shows two or more characteristics or groups of
items.

Sikkim Manipal University

Page No. 121

Business Statistics

Unit 5

Table 5.1 represents simple table.


Table 5.1 Cinema Attendance among Adult Male Factory Workers in Bombay

March 1972
Frequency

Number of Workers

Less than once a month

3780

1 to 4 times a month

1652

More than 4 times a month

926

Table 5.2 is an example of complex table and is the result of a survey on


the cinema going habits of adult factory workers.
Table 5.2 Cinema Attendance among Adult Male Factory Workers in Bombay
March 1972
Cinema Attendance

Single

Frequency

Under 30

Less than once a month

Married

Over 30

Under 30

Over 30

122

374

1404

1880

1046

202

289

115

More than 4 times a month

881

23

112

10

Total

2049

599

1805

2005

14 times a month

It is obvious that the tabular form of classification of data is a great


improvement over the narrative form.
Frequently, table construction involves deciding which attribute should be
taken as primary and which as secondary. For the previous table, we can also
consider whether it would be improved further if under 30 and 30 and over
had been the main column headings and single and married the sub headings.
The modifications depend on the purpose of the table. If the activities of age
groups are to be compared, it is best left as it stands. But if a comparison
between men of different marital status is required, the change would be an
improvement.

5.2.3 Advantages of Tabulation of Data


(i) Tabulated data can be more easily understood and grasped than
untabulated data.

Sikkim Manipal University

Page No. 122

Business Statistics

Unit 5

(ii) A table facilitates comparisons between subdivisions and with other tables.
(iii) It enables the required figures to be located easily.
(iv) It reveals patterns within the figures, which might otherwise not have been
obvious, e.g., from the previous table, we can conclude that regular and
frequent cinema attendance is mainly confined to younger age group.
(v) It makes the summation of items and the detection of errors and omissions,
easier.
(vi) It obviates repetition of explanatory phrases and headings and hence
takes less space.

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Tabulation is the _____________ arrangement of data in columns
and rows.
(b) Tabulated data can be more easily understood and grasped than
_____________ data.
2. State whether true or false.
(a) A table showing two characteristics is a simple table.
(b) A table facilitates comparisons between subdivisions and with other
tables.

5.3 Graphs
In a graph, the independent variable should always be placed on the horizontal
or X-axis and the dependent variable on the vertical or Y-axis.

5.3.1 Line Graph


Here, the points are plotted on paper (or graph paper) and joined by straight
lines. Generally, continuous variables are plotted by the line graph.

Sikkim Manipal University

Page No. 123

Business Statistics

Unit 5

Example 5.1: The monthly averages of Retail Price Index from 1996 to 2003
(Jan. 1996 = 100) were as follows:
Year

1996

1997

1998

Retail Price Index

100

105.8 109.0

1999 2000

2001 2002

2003

109.6 110.7

114.5 119.3

122.3

Draw a diagram to display these figures.


Solution: Here, years are plotted along the horizontal line and the retail price
index along the vertical line.
Erect perpendiculars to horizontal line from the points marked as retail
price index for the years 1997, 1998, ..., 2003 and cut off these ordinates
according to the given data and thus various points will be plotted on the paper.
Join these points by straight lines.

Retail Price Index

125
120
115
110
105
100
1996

1997

1998

1999

2000
Year

2001

2002

2003

5.3.2 Frequency Polygon


A frequency polygon is a line chart of frequency distribution in which, either the
values of discrete variables or midpoints of class intervals are plotted against
the frequencies and these plotted points are joined together by straight lines.
Since the frequencies generally do not start at zero or end at zero, this diagram
as such would not touch the horizontal axis. However, since the area under the
entire curve is the same as that of a histogram which is 100 per cent of the data
presented, the curve can be enclosed so that the starting point is joined with a
fictitious preceding point whose value is zero. This ensures that the start of the
curve is at horizontal axis and the last point is joined with a fictitious succeeding
point whose value is also zero, so that the curve ends at the horizontal axis.
This enclosed diagram is known as the frequency polygon.
We can construct the frequency polygon (Figure 5.1) from Table 5.3
presented for the ages of 30 workers as follows:

Sikkim Manipal University

Page No. 124

Business Statistics

Unit 5

Table 5.3 Ages of 30 Workers


Class Internal
(years)
15 upto 25
25 upto 35
35 upto 45
45 upto 55
55 upto 65
65 upto 75

Mid-Point

(f)

20
30
40
50
60
70

5
3
7
5
3
7

(40, 7)

(70, 7)

(20, 5)
(50, 5)

(30, 3)

(60, 3)

Figure 5.1 Frequency Polygen Curve

5.3.3 Relative Frequency


In a frequency distribution, if the frequency in each class interval is converted
into a proportion, dividing it by the total frequency, we get a series of proportions
called relative frequencies. A distribution presented with relative frequencies
rather than actual frequencies is called a relative frequency distribution. The
sum of all relative frequencies in a distribution is 1.
Example 5.2: Calculate relative frequency from the given table.
Class Interval
2535
3545
4555
5565
6575
7585

Sikkim Manipal University

Frequency
7
9
22
7
3
2

Page No. 125

Business Statistics

Unit 5

Solution: This example shows that the sum of all relative frequencies in a
distribution is 1.
Class Interval
Frequency
Relative Frequency
Explanation
2535

0.14

7
50

0.14

3545
4555
5565
6575
7585

9
22
7
3
2

0.18
0.44
0.14
0.06
0.04

9
50

0.18
etc.

50

1.00

Total

The concept of relative frequencies is useful in sampling theory. It can


also be used to compare two frequency distributions with unequal total frequency
with the same series of class intervals as in the following example.
Example 5.3: Compare the following frequency distribution.
Class Interval
1020
2030
3040
4050
5060

f1

f2

5
10
6
3
1

12
24
30
19
15

Solution: The following table shows the comparison.


Class Interval f1
1020
5
2030
10
3040
6
4050
3
5060
1
Total

25

f2
12
24
30
19
15

Rel. Freq. f1
0.20
0.40
0.24
0.12
0.04

Rel. Freq. f2
0.12
0.24
0.30
0.19
0.15

100

1.00

1.00

A direct visual comparison of two frequency distributions can be made by


drawing their frequency polygons.
Example 5.4: Draw frequency polygons for the relative frequency distributions
given in Example 5.3.
Solution: The following is the frequency polygon for the relative frequencies as
mentioned in Example 5.3.

Sikkim Manipal University

Page No. 126

Business Statistics

Unit 5

Relative Frequency

0.4

0.3

0.2

0.1

15

25

35
45
Class marks

55

65

5.3.4 Ogive Curves


Cumulative frequency curve or ogive is the graphic representation of a cumulative
frequency distribution. Ogives are of two types. One of these is less than and
the other one is greater than ogive. Both these ogives are constructed based
on Table 5.4 of 30 workers.
Table 5.4 Cummulative Frequency
Class Interval
(years)

Mid-point

(f)

Cum. Freq.
(less than)

Cum. Freq.
(greater than)

15 and upto 25

20

5 (less than 25)

30 (more than 15)

25 and upto 35

30

8 (less than 35)

25 (more than 25)

35 and upto 45

40

15 (less than 45)

22 (more than 35)

45 and upto 55

50

20 (less than 55)

15 (more than 45)

55 and upto 65

60

23 (less than 65)

10 (more than 55)

65 and upto 75

70

30 (less than 75)

7 (more than 65)

(i) Less than ogive. In this case, the less than cumulative frequencies are
plotted against the upper boundaries of their respective class intervals.
Figure 5.2 shows less than ogive.

Sikkim Manipal University

Page No. 127

Business Statistics

Unit 5

Less than Ogive

Class Interval

Figure 5.2 Less than, Ogive

(ii) Greater than ogive. In this case, the greater than cumulative frequencies
are plotted against the lower boundaries of their respective class intervals.

Greater than
Cumulative Frequency

More than Ogive

Class Interval

Figure 5.3 Greater than, Ogive

These ogives can be used for comparison purposes. Several ogives can
be drawn on the same grid, preferably with different colours for easier
visualization and differentiation.
Although diagrams and graphs are powerful and effective media for
presenting statistical data, they can only represent a limited amount of information
and they are not of much help when intensive analysis of data is required.

Sikkim Manipal University

Page No. 128

Business Statistics

Unit 5

5.3.5 Histograms
A histogram is the graphical description of data and is constructed from a
frequency table. It displays the distribution method of a data set and is used for
statistical as well as mathematical calculations.
The word histogram is derived from the Greek word histos which means
anything set upright and gramma which means drawing, record, writing. It is
considered the most important basic tool of statistical quality control process.
In this type of representation, the given data is plotted in the form of a
series of rectangles. Class intervals are marked along the X-axis and the
frequencies along the Y-axis according to a suitable scale. Unlike the bar chart,
which is one dimensional, meaning that only the length of the bar is important
and not the width, a histogram is two-dimensional in which both the length and
the width are important. A histogram is constructed from a frequency distribution
of a grouped data, where the height of the rectangle is proportional to the
respective frequency and the width represents the class interval. Each rectangle
is joined with the other and any blank spaces between the rectangles would
mean that the category is empty and there are no values in that class interval.
Let us construct a histogram for our example of ages of 30 workers. For
convenience is sake, we will present the frequency distribution along with the
midpoint of each interval, where the midpoint is simply the average of the values
of the lower and the upper boundary of each class interval. The frequency
distribution table is shown as follows:
Class Interval (years)

Midpoint

(f)

15 and upto 25

20

25 and upto 35

30

35 and upto 45

40

45 and upto 55

50

55 and upto 65

60

65 and upto 75

70

Sikkim Manipal University

Page No. 129

Business Statistics

Unit 5

The histogram of this data would be shown as follows:


7

Class Interval

Activity 1
The following frequency distribution represents the number of days during
a year that the faculty of the college was absent from work due to illness.

Number of Days

Number of Employees

02
35
68
911
1214

5
10
20
10
5
Total

50

(a) Construct a frequency distribution for this data.


(b) Construct a greater than cumulative frequency distribution as well
as a less than cumulative frequency distribution for this data.
(c) How many employees were absent for less than 3 days during the
year?
(d) How many employees were absent for more than 8 days during the
year?
(e) Draw a frequency polygon for this data.

Sikkim Manipal University

Page No. 130

Business Statistics

Unit 5

Self-Assessment Questions
3. State whether true or false.
(a) In a graph, the independent variable should always be placed in a
vertical axis.
(b) A distribution presented with relative frequencies rather than actual
frequencies is called a relative frequency distribution.
4. Fill in the blanks with the appropriate terms.
(a) A direct visual comparison of two ____________ distributions can
be made by drawing their frequency polygons.
(b) A histogram is constructed from a frequency distribution of a grouped
data, where the height of the rectangle is _______________ to the
respective frequency and the width represents the class interval.

5.4 Diagrams
The data we collect can often be more easily understood for interpretation if it is
presented graphically or pictorially. Diagrams and graphs give visual indications
of magnitudes, groupings, trends and patterns in the data. These important
features are more simply presented in the form of graphs. Also, diagrams facilitate
comparisons between two or more sets of data.
The diagrams should be clear and easy to read and understand. Too
much information should not be shown in the same diagram; otherwise, it may
become cumbersome and confusing. Each diagram should include a brief and
self explanatory title dealing with the subject matter. The scale of the presentation
should be chosen in such a way that the resulting diagram is of appropriate
size. The intervals on the vertical as well as the horizontal axis should be of
equal size; otherwise, distortions would occur.
Diagrams are more suitable to illustrate data which is discrete, while
continuous data is better represented by graphs. The following are the
diagrammatic and the graphic representation methods that are commonly used.

5.4.1 One Dimensional Diagrams


Bars are simply vertical lines where the lengths of the bars are proportional to
their corresponding numerical values. The width of the bar is unimportant but

Sikkim Manipal University

Page No. 131

Business Statistics

Unit 5

all bars should have the same width so as not to confuse the reader of the
diagram. Additionally, the bars should be equally spaced.
Example 5.5: Construct a subdivided bar chart for the three types of expenditures
in dollars for a family of four for the years 1988, 1989, 1990 and 1991 given as
follows:
Year

Food

Education

Other

Total

1988

3000

2000

3000

8000

1989

3500

3000

4000

10500

1990

4000

3500

5000

12500

1991

5000

5000

6000

16000

Solution: The subdivided bar chart would be as follows:


A Subdivided Bar Diagram
16000
14000

Food
Education
Other

Expenditure

12000
10000
8000
6000
4000
2000
0

1988

1989

1990

1991

Year

Percentage Component Bars or Divided Bar Charts


When in the previous case, the component lengths represent the percentages
(instead of the actual amounts) of each component we get percentage
component bar charts. The heights of all the bars will be the same as shown in
Figure 5.4.

Sikkim Manipal University

Page No. 132

Business Statistics

Unit 5

Figure 5.4 Percentage Component Bar Chart showing Expenses and Savings of
Mr X

Multiple Bar Charts


In multiple bar charts the interrelated component parts are shown in adjoining
bars, coloured or marked differently, thus allowing comparison between different
parts as shown in Figure 5.5.

Figure 5.5 Multiple Bar Chart showing Expenses and Savings of Mr X

These charts can be used if the overall total is not required. Some charts
given earlier show totals also.

5.4.2 Two Dimensional Diagrams


Two dimensional diagrams take two components of data for representation.
These are also called area diagrams as they consider two dimensions. The
types are rectangles, squares and pie. They can be best explained with the
help of a squares diagram.

Sikkim Manipal University

Page No. 133

Business Statistics

Unit 5

Squares: The square diagram is easy and simple to draw. Take the square root
of the values of various given items that are to be shown in the diagrams and
then select a suitable scale to draw the squares.
Example 5.6: Yield of rice in Kgs. per acre of five countries are as follows:
Country

USA

Australia

UK

Canada

India

Yield of rice
in Kgs per acre

6400

1600

2500

3600

4900

Represent this data using square diagram.


Solution: To draw the square diagrams calculate as follows:
Country

Yield

Square root

U.S.A

6400

80

Australia

1600

40

U.K.

2500

50

2.5

Canada

3600

60

India

4900

70

3.5

4 cm

2 cm

2.5 cm

Side of the square in cm

3 cm

3.5 cm

5.4.3 Pie Diagram


This type of diagram enables us to show the partitioning of a total into its
component parts. The diagram is in the form of a circle and is also called a pie
because the entire diagram looks like a pie and the components resemble slices
cut from it. The size of the slice represents the proportion of the component out
of the whole.
Example 5.7: The following figures relate to the cost of the construction of a
house. The various components of cost that go into it are represented as
percentages of the total cost.

Sikkim Manipal University

Page No. 134

Business Statistics

Unit 5

Item

% Expenditure

Labour

25

Cement, Bricks

30

Steel

15

Timber, Glass

20

Miscellaneous

10

Construct a pie chart for the above data.


Solution: The pie chart for this data is presented as follows:
Timber,
Glass
20%

Misc
10%

Steel
15%

Labour
25%

Cement, Bricks
30%

Pie charts are very useful for comparison purposes, especially when there
are only a few components. If there are too many components, it may become
confusing to differentiate the relative values in the pie.

5.4.4 Three Dimensional Diagrams


Three dimensional diagrams are also termed as volume diagram and consist of
cubes, cylinders, spheres, etc. In these diagrams, three dimensions, namely
length, width and height are taken into account. Cubes are used where side of
a cube is drawn in proportion to the cube root of the magnitude of data.
Example 5.8: Represent the following data using volume diagram.
Category

Number of Students

Undergraduate

64000

Postgraduate

27000

Professionals

8000

Sikkim Manipal University

Page No. 135

Business Statistics

Unit 5

Solution: The sides of cubes are calculated as follows:

Category

Number of Students Cube Root

Side of Cube

Undergraduate

64000

40

4 cm

Postgraduate

27000

30

3 cm

Professional

8000

20

2 cm

4cm

3cm

2cm

Activity 2
The following table represents the racial breakdown of people in the Flushing
area in Queens, New York.
Race
Number

White

Black

205,000 30,520

Hispanic
20,300

Asians Others
15,650

5,400

Construct a pie chart to represent this data. (Make sure that the slices of
the pie proportionately represent the various ethnic populations.)

Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) Each diagram should include a brief and self ______________ title
dealing with the subject matter.
(b) Bars are simply vertical lines where the ______________ of the bars
are proportional to their corresponding numerical values.
6. State whether true or false.
(a) Diagrams and graphs give visual indications of magnitudes,
groupings, trends and patterns in the data.
(b) Diagrams facilitate comparisons between two or more sets of data.

Sikkim Manipal University

Page No. 136

Business Statistics

Unit 5

5.5 Summary
Let us recapitulate the important concepts discussed in this unit:
Classification of data is usually followed by tabulation, which is considered
the mechanical part of classification.
Tabulation is the systematic arrangement of data in columns and rows.
The analysis of the data is done so by arranging the columns and rows to
facilitate comparisons.
A table should be easy to read and should contain only the relevant details.
If the aim of clarification is not achieved, the table should be redesigned.
In a graph, the independent variable should always be placed on the
horizontal or X-axis and the dependent variable on the vertical or Y-axis.
A frequency polygon is a line chart of frequency distribution in which the
values of discrete variables or midpoints of class intervals are plotted
against the frequencies and these plotted points are joined together by
straight lines.
In a frequency distribution, if the frequency in each class interval is
converted into a proportion, dividing it by the total frequency, we get a
series of proportions called relative frequencies.
Cumulative frequency curve or ogive is the graphic representation of a
cumulative frequency distribution. Ogives are of two types, less than
and greater than ogives.
A histogram is the graphical description of data and is constructed from a
frequency table. It displays the distribution method of a data set and is
used for statistical as well as mathematical calculations.
Diagrams and graphs give visual indications of magnitudes, groupings,
trends and patterns in the data.
A pie diagram illustrates the partitioning of a total into its component parts.

5.6 Glossary
Table: The systematic arrangement of data in columns and rows.
Frequency polygon: A line chart of frequency distribution in which the
values of discrete variables or midpoints of class intervals are plotted

Sikkim Manipal University

Page No. 137

Business Statistics

Unit 5

against the frequencies and these plotted points are joined together by
straight lines.
Relative frequency: The series of proportions achieved after converting
each class interval into a proportion, dividing it by the total frequency.
Ogive curve: A graphic representation of a cumulative frequency
distribution.
Histogram: The graphical description of data constructed from a
frequency table. It displays the distribution method of a data set and is
used for statistical as well as mathematical calculations.
Pie diagram: A diagram that enables us to show the partitioning of a total
into its component parts.

5.7 Terminal Questions


1. What are the essential features of a table?
2. Giving suitable examples distinguish between a simple and a complex
table.
3. Explain frequency polygon giving an example.
4. Define relative frequency. What are the areas where relative frequency is
considered useful?
5. What is an ogive curve? Explain its types and significance.
6. How are histograms useful in data representation?
7. What features should be kept in mind while drawing a diagram?
8. Explain one dimensional, two dimensional and three dimensional diagrams
with the help of examples.

5.8 Answers
Answers to Self-Assessment Questions
1. (a) Systematic; (b) Untabulated
2. (a) False; (b) True
3. (a) False; (b) True

Sikkim Manipal University

Page No. 138

Business Statistics

Unit 5

4. (a) Frequency; (b) Proportional


5. (a) Explanatory; (b) Lengths
6. (a) True; (b) True

Answers to Terminal Questions


1. Refer Section 5.2.1
2. Refer Section 5.2.2
3. Refer Section 5.3.2
4. Refer Section 5.3.3
5. Refer Section 5.3.4
6. Refer Section 5.3.5
7. Refer Section 5.4
8. Refer Sections 5.4.1, 5.4.2 and 5.4.4

5.9 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010.

Sikkim Manipal University

Page No. 139

Business Statistics

Unit 6

Unit 6

Correlation

Structure
6.1 Introduction
Objectives
6.2 Correlation Analysis
6.3 Coefficient of Correlation
6.4 Spearmans Rank Correlation
6.5 Summary
6.6 Glossary
6.7 Terminal Questions
6.8 Answers
6.9 Further Reading

6.1 Introduction
In the previous unit, you learnt about various data representation techniques
and their significance in decision-making.
In this unit, you will learn about correlation analysis. Correlation is one of
the most significant statistics. Correlation can be defined as the interdependence
between variable quantities. If the values of two variables changes with respect
to each other, then they are said to be correlated. For example, if the variables
are stock prices and the price of one stock increases at the same time the price
of another stock increases, then the two stock prices are positively correlated.
If the price of one stock goes down when the price of the other increases, then
the two stock prices are negatively correlated. However, if we are unable to find
a consistent pattern in the variation of the two stock prices, then they are
uncorrelated.
The strength of correlation is measured by the coefficient of correlation.
The value of the coefficient of correlation lies in the interval [1, 1]. Positive
correlations lie between 0 and 1; 0 means that there is no correlation; negative
correlations lie between 0 and 1. The purpose of doing correlations is to allow
us to make a prediction about one variable based on what we know about
another variable.

Sikkim Manipal University

Page No. 141

Business Statistics

Unit 6

Objectives
After studying this unit, you should be able to:
Explain correlation analysis
Evaluate coefficient of determination and coefficient of correlation
Calculate probable error of the coefficient of correlation
Calculate correlation using various methods
Define limitations of correlation analysis

6.2 Correlation Analysis


Correlation analysis is a statistical tool generally used to describe the degree to
which one variable is related to another. The relationship, if any, is usually
assumed to be a linear one. In fact, the word correlation refers to the relationship
or interdependence between two variables. There are various phenomena that
are related to each other. For instance, when demand of a certain commodity
increases, its price goes up, and when its demand decreases, its price goes
down.
On the basis of the theory of correlation, one can study the comparative
changes occurring in two related phenomena and their causeeffect relation
can be examined. It should be borne in mind that relationship like black cat
causes bad luck cannot be explained by the theory of correlation, since they
are all imaginary and are incapable of being justified mathematically. Thus,
correlation is concerned with relationships between two related and quantifiable
variables. For example, if the height of students as well as the height of the
trees increases, then we cannot call it a correlation because the two phenomena
are not related to each other.
Correlation can be positive or negative. The sign of the correlation
coefficient between two stock prices shows whether the two stock prices are
positively or negatively correlated. If the coefficient of correlation is greater than
zero but not greater than 1, then the stock prices are positively correlated and
move in the same directions. If the coefficient of correlation is less than zero but
not less than 1, then the stock prices are negatively correlated and move in
opposite directions.
The correlation coefficients numerical value shows the strength of the
correlation between the two stock prices. The stronger the positive correlation,
Sikkim Manipal University

Page No. 142

Business Statistics

Unit 6

the closer will be the value of the correlation coefficient to +1. The stronger the
negative correlation, the closer will be the correlation coefficient to 1. If the two
stock prices are perfectly uncorrelated, the value of the correlation coefficient is
zero. This can be explained as under:
Changes in Independent
Changes in Dependent
Nature of
Variable
Variable
Correlation
Increase (+)

Increase (+)

Positive (+)

Decrease ()

Decrease ()

Positive (+)

Increase (+)

Decrease ()

Negative ()

Decrease ()

Increase (+)

Negative ()

Statisticians have developed two measures for describing the correlation


between two variables, viz., the coefficient of determination and the coefficient
of correlation. We now explain, illustrate and interpret the two coefficients
concerning the relationship between two variables.

6.2.1 The Coefficient of Determination


The coefficient of determination (symbolically indicated as r2, though some people
would prefer to put it as R2) is a measure of the degree of linear association or
correlation between two variables, say X and Y, one of which happens to be and
independent variable and the other dependent. This coefficient is based on the
following two kinds of variations:
(i) The variation of the Y values around the fitted regression line viz.,
2
Y Y , technically known as the unexplained variation.

(ii) The variation of the Y values around their own mean viz., Y Y ,
technically known as the total variation.
2

If we subtract the unexplained variation from the total variation, we obtain


what is known as the explained variation, i.e., the variation explained by the line
of regression. Thus, Explained Variation = (Total variation) (Unexplained
variation)

Y Y
2
Y Y
Y Y

The Total and Explained as well as Unexplained variations are shown in


Figure 6.1.

Sikkim Manipal University

Page No. 143

Business Statistics

Unit 6

Y-axis
Mean line of X

Comsumption Expenditure (00 Rs)

100

80

d Y)
ine
pla e., Y int
Y)
x
e

.
, Y t Un on (i fic po
i
i.e. oin
ti
n ( ific p varia spec
o
i
t
a
a pec
i
t
r
a
va a s
tal at
Y
To r Y
o
Explained Variation
( i.e.,Y Y ) at a
specific point

60
Y

Mean line of Y
X

40

20

on
ssi
gre
Re

fY
eo
lin

20

X
on

40

60 X
80
Income (00 Rs)

100

120

X- axis

Figure 6.1 Diagram Showing Total, Explained and Unexplained Variations

Coefficients of determination is that fraction of the total variation of Y


which is explained by the regression line. In other words, coefficient of
determination is the ratio of explained variation to total variation in the Y variable
related to the X variable. Coefficient of determination algebraically can be stated
as follows:
r2 =

Explained variation
Total variation


Y Y
Y Y

2
2

Alternatively r2 can also be stated as under:


r2 = 1

Explained variation
Total variation


Y Y
Y Y

= 1

Sikkim Manipal University

2
2

Page No. 144

Business Statistics

Unit 6

6.2.2 Interpreting r2
Coefficient of determination explains how much of the variation in one factor
can be caused or explained by its relationship to another factor. It is the square
of correlation coefficient. For example, if you have two sets of scores on Tests
X and Y and they correlate at r = 0.90, the coefficient of determination r2 will be
0.81. This information can be interpreted as, 81% of the variance in Test X has
been explained by the Test Y.
As a matter of practice the squared correlations should be interpreted
because the correlation coefficient is misleading in suggesting the existence of
more correlation than really exists and the problem gets worse as the correlation
approaches zero.
Example 6.1: Calculate the coefficient of determination (r2) using data given
below. Calculate and analyse the result.
Observations

10

Income (X) (00 Rs)

41

65

50

57

96

94

110

30

79

65

Consumption
Expenditure (Y) (00 Rs) 44

60

39

51

80

68

84

34

55

48

Solution: r2 can be worked out as shown below:


r2

Since,

As, Y Y

Y 2 Y 2 nY

Y Y
Unexplained variation
= 1
= 1
Total variation
Y Y
2

, we can write,

r2 = 1

Y Y

2
2

Y 2 nY 2

Calculating and putting the various values, we have the following equation:
260.54
260.54
1
0.897
r2 = 1
2
2526.10
34223 10 56.3
Analysis of Result: The regression equation used to calculate the value of the
coefficient of determination (r2) from the sample data shows that, about 90% of
the variations in consumption expenditure can be explained. In other words, it
means that the variations in income explain about 90% of variations in
consumption expenditure.
Sikkim Manipal University

Page No. 145

Business Statistics

Unit 6

Observation

Income (X) (00 Rs)


41
Consumption
Expenditure (Y) (00 Rs)44

10

65

50

57

96

94

110

30

79

65

60

39

51

80

68

84

34

55

48

Activity 1
Using the various correlation methods discussed in the unit, compute the
correlation for the following data:

Person

Height (x)

1
2
3
4
5
6

68
71
62
75
58
60

Self Esteem
(y)
4.1
4.6
3.8
4.4
3.2
3.1

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Correlation is concerned with relationship between two related and
____________ variables.
(b) Coefficients of _____________ is that fraction of the total variation
of Y which is explained by the regression line.
2. State whether true or false.
(a) The word correlation refers to the relationship or interdependence
between two variables.
(b) Correlation can either be positive or negative.

6.3 Coefficient of Correlation


The coefficient of correlation, symbolically denoted by r, is another important
measure to describe how well one variable is explained by another. It measures

Sikkim Manipal University

Page No. 146

Business Statistics

Unit 6

the degree of relationship between the two casually related variables. The value
of this coefficient can never be more than +1 or less than 1. Thus, +1 and 1
are the limits of this coefficient. For a unit change in independent variable, if
there happens to be a constant change in the dependent variable in the same
direction, then the value of the coefficient will be +1 indicative of the perfect
positive correlation; but if such a change occurs in the opposite direction, the
value of the coefficient will be 1, indicating the perfect negative correlation. In
practical life, the possibility of obtaining either a perfect positive or perfect
negative correlation is very remote particularly in respect of phenomena
concerning social sciences. If the coefficient of correlation has a zero value
then it means that there exists no correlation between the variables under study.
There are several methods of finding the coefficient of correlation but the
following ones are considered important:
(i) Coefficient of Correlation by the Method of Least Squares.
(ii) Coefficient of Correlation using Simple Regression Coefficients.
(iii) Coefficient of Correlation through Product Moment Method or Karl
Pearsons Coefficient of Correlation.
Whichever of these above mentioned three methods we adopt, we get
the same value of r.
(i) Coefficient of Correlation by the Method of Least Squares
Under this method, first of all, the estimating equation is obtained using least
square method of simple regression analysis. The equation is worked out as,

Y a bX i
Total variation
Unexplained variation
Explained variation


2
Y Y
2
Y Y
Y Y

Then, by applying the following formulae, we can find the value of the coefficient
of correlation:
r =
=

Sikkim Manipal University

r2
1

Explained variation
Total variation

Unexplained variation
Total variation

Page No. 147

Business Statistics

Unit 6


1
Y Y
Y Y

2
2

This clearly shows that the coefficient of correlation happens to be the


squareroot of the coefficient of determination.
Short-cut formula for finding the value of r by the method of least squares
can be repeated and readily written as follows:
r =
Where,

aY bXY nY 2
Y 2 nY 2

a = Y-intercept
b = Slope of the estimating equation
X = Values of the independent variable
Y = Values of dependent variable
_
Y = Mean of the observed values of Y
n = Number of items in the sample
(i.e., pairs of observed data)

The plus (+) or the minus () sign of the coefficient of correlation worked
out by the method of least squares is related to the sign of b in the estimating
equation viz., Y a bX i . If b has a minus sign, the sign of r will also be minus
but if b has a plus sign, then the sign of r will also be plus. The value of r
indicates the degree along with the direction of the relationship between the
two variables X and Y.
(ii) Coefficient of Correlation using Simple Regression Coefficients
Under this method, the estimating equation of Y and the estimating equation of
X is worked out using the method of least squares. From these estimating
equations we find the regression coefficient of X on Y, i.e., the slope of the
estimating equation of X (symbolically written as bXY) and this happens to be

equal to r X and similarly, we find the regression coefficient of Y on X, i.e., the


Y
slope of the estimating equation of Y (symbolically written as bYX) and this

Y
. For finding r, the square root of the product of
X
these two regression coefficients are work out as follows:1
happens to be equal to r

Sikkim Manipal University

Page No. 148

Business Statistics

Unit 6

r =

bXY .bYX

X Y
.r
Y X

r2 = r

As stated earlier, the sign of r will depend upon the sign of the regression
coefficients. If they have minus sign, then r will take a minus sign but the sign
of r will be positive if regression coefficients have positive signs.

6.3.1 Karl Pearsons Coefficient


Karl Pearsons method is most widely used method of measuring the relationship
between two variables. This coefficient is based on the following assumptions:
(i) There is a linear relationship between the two variables which means that
straight line would be obtained if the observed data are plotted on a graph.
(ii) The two variables are casually related which means that one of the
variables is independent and the other one is dependent.
(iii) A large number of independent causes are operating in both the variables
so as to produce a normal distribution.
According to Karl Pearson, r can be worked out as under:
XY

Where,

r = n
X Y
_
X = (X X_)
Y = (Y Y )
X = Standard deviation of
X series and is equal to
Y = Standard deviation of
Y series and is equal to

X2
n
Y 2
n

n = Number of pairs of X and Y observed.


A short-cut formula, known as the Product Moment Formula, can be
derived from the above stated formula as under:

Sikkim Manipal University

Page No. 149

Business Statistics

Unit 6

XY
r = n
X Y
XY

n =

X 2 Y2

n
n

XY
X 2 Y2

The above formulae are based on obtaining true means (viz. X and Y )
first and then doing all other calculations. This happens to be a tedious task,
particularly if the true means are in fractions. To avoid difficult calculations, we
make use of the assumed means in taking out deviations and doing the related
calculations. In such a situation, we can use the following formula for finding
the value of r:2
(i) In case of ungrouped data:

r =

dX .dY dX dY

n
n
n
dX 2
n

dX dY
dX .dY

dX 2

Where,

2
dX dY
dY

n
n
n

dX 2
n

dY 2

dY 2
n

dX = (X XA)

XA = Assumed average of X

dY = (Y YA)

YA = Assumed average of Y

dX2 = (X XA)2
dY2 = (Y YA)2
dX . dY = (X XA) (Y YA)
n = Number of pairs of observations of X and Y

Sikkim Manipal University

Page No. 150

Business Statistics

Unit 6

(ii) In case of grouped data:

r =

or

r =

fdX .dY fdX fdY

n
n
n
fdX 2 fdX

n
n

fdX . fdY
fdX .dY

fdX
fdX

Where,

fdY 2 fdY

n
n

fdY
fdY

fdX.dY =0f (X XA) (Y YA)


fdX = f (X XA)
fdY = f (Y YA)
fdY2 = f (Y YA)2
fdX2 = f (X XA)2
n = Number of pairs of observations of X and Y.

6.3.2 Probable Error (P.E.) of the Coefficient of Correlation


Probable Error (P.E.) of r is very useful in interpreting the value of r and is
worked out as under for Karl Pearsons coefficient of correlation:
P.E. 0.6745

1 r2
n

If r is less than its P.E., it is not at all significant. If r is more than P.E., there
is correlation. If r is more than 6 times its P.E. and greater than 0.5, then it is
considered significant.
Example 6.2:
From the following data calculate r between X and Y applying the following
three methods:
(i) The method of least squares.
(ii) The method based on regression coefficients.
Sikkim Manipal University

Page No. 151

Business Statistics

Unit 6

(iii) The product moment method of Karl Pearson.


Verify the obtained result of any one method with that of another.
X

10

12

11

13

14

16

15

Solution:
Let us develop the following table for calculating the value of r:
X

X2

Y2

XY

1
2
3
4
5
6
7
8
9

9
8
10
12
11
13
14
16
15

1
4
9
16
25
36
49
64
81

81
64
100
144
121
169
196
256
225

9
16
30
48
55
78
98
128
135

Y = 108
_
Y = 12

X2 = 285

Y2 = 1356

n=9
X = 45
_
X = 5;

XY = 597

(i) Coefficient of correlation by the method of least squares is worked out as


follows:
First of all find out the estimating equation,
Y = a + bX
i

XY n X Y

Where,

b =

and

X 2 nX

597 9 5 12

285 9 25
_
_
a = Y bX

597 540
57
0.95
=
285 225
60

= 12 0.95(5) = 12 4.75 = 7.25


Hence,

Y = 7.25 + 0.95Xi

Sikkim Manipal University

Page No. 152

Business Statistics

Unit 6

Now r can be worked out as under by the method of least squares,


r =

Unexplained variation
Total variation


1
Y Y
Y Y


Y Y

Y Y

a Y b XY nY
Y 2 nY

2
2

This is as per short-cut formula,

r =

7.25 108 0.95 597 9 12


1356 9 12

783 567.15 1296


1356 1296

54.15
60

0.9025 = 0.95

(ii) Coefficient of correlation by the method based on regression coefficients


is worked out as follows:
Regression coefficients of Y on X,
XY n X Y

i.e.,

bYX =

X 2 nX

597 9 5 12

285 9 5

597 540 57

285 225 60

597 540
57

1356 1296 60

Regression coefficient of X on Y,
XY n X Y

i.e.,

bXY =
=

Sikkim Manipal University

Y 2 nY

597 9 5 12
1356 9 12

Page No. 153

Business Statistics

Hence,

Unit 6

r =
=

bYX . bXY

57 57 57

0.95
60 60 60

(iii) Coefficient of correlation by the product moment method of Karl Pearson


is worked out as under:
r =

=
=

XY n X Y
X 2 nX

Y 2 nY

597 9 5 12
285 9 5 1356 9 12
57
597 540
57
0.95

=
60
285 225 1356 1296
60 60
2

Hence, we get the value of r = 0.95. We get the same value applying the
other two methods also. Therefore, whichever method we apply, the results will
be the same.

6.3.3 Some Other Measures


Two other measures are often talked about along with the coefficients of
determinations and that of correlation. These are as follows:
(i) Coefficient of Nondetermination. Instead of using coefficient of
determination, sometimes coefficient of nondetermination is used.
Coefficient of nondetermination (denoted by k2) is the ratio of unexplained
variation to total variation in the Y variable related to the X variable.
Algebrically, we can write it as follows:
k2 =

Unexplained variation
Total variation


Y Y
Y Y

2
2

Concerning the data of Example 6.1 of this unit, coefficient of


nondetermination will be calculated as follows:
k2

260.54
0.103
2526.10

The value of k2 shows that about 10% of the variation in consumption


expenditure remains unexplained by the regression equation we had
worked out, viz., Y 14.000 + 0.616Xi. In simple terms, this means that
Sikkim Manipal University

Page No. 154

Business Statistics

Unit 6

variable other than X is responsible for 10% of the variations in the


dependent variable Y in the given case.
Coefficient of nondetermination can as well be worked out as under:
k2 = 1 r2
Accordingly for Example 6.1, it will be equal to 10.897 = 0.103
Note: Always remember that r2 + k2 = 1.
(ii) Coefficient of Alienation. Based on k2, we can work out one more measure
namely the Coefficient of alienation, symbolically written as k. Thus,
Coefficient of alienation, i.e., k = k 2
Unlike r + k2 = 1, the sum of r and k will not be equal to 1 unless one of
the two coefficients is 1 and in this case the remaining coefficients must be
zero. In all other cases, r + k > 1. Coefficient of alienation is not a popular
measure from practical point of view and is used very rarely.
Activity 2
Two random variables have the regression with equations,
3X + 2Y 26 = 0
6X + Y 31 = 0
Find the mean value of X as well as of Y and the correlation coefficient
between X and Y. If the variance of X is 25, find sY from the data given
above.

Self-Assessment Questions
3. State whether true or false.
(a) The value of this coefficient can never be more than +1 or less
than -1.
(b) Coefficient of determination (denoted by k2) is the ratio of unexplained
variation to total variation in the Y variable related to the X variable.
4. Fill in the blanks with the appropriate terms.
(a) The coefficient of correlation, symbolically denoted by 'r', measures
the degree of relationship between the two _____________ related
variables.
(b) If r is less than its probable error (P.E.), it is not at all significant but
if r is more than P.E., there is_______________.
Sikkim Manipal University

Page No. 155

Business Statistics

Unit 6

6.4 Spearmans Rank Correlation


If observations on two variables are given in the form of ranks and not as
numerical values, it is possible to compute what is known as rank correlation
between the two series.
The rank correlation, written as , is a descriptive index of agreement
between ranks over individuals. It is the same as the ordinary coefficient of
correlation computed on ranks, but its formula is simpler.

6Di2
n(n 2 1)

Here, n is the number of observations and Di, the positive difference


between ranks associated with the individuals i.
Like r, the rank correlation lies between 1 and +1.
Example 6.3: The ranks given by two judges to 10 individuals are as follows:
Rank given by
Individual Judge I
Judge II
D
D2
x
y
= xy
1
2
3
4
5
6
7
8
9
10

1
2
7
9
8
6
4
3
10
5

7
5
8
10
9
4
1
6
3
2

6
3
1
1
1
2
3
3
7
3

36
9
1
1
1
4
9
9
49
9
D2 = 128

Solution: The rank correlation is given by,

6 D 2
6 128
1 3
1 0.776 0.224
3
n n
10 10

The value of = 0.224 shows that the agreement between the judges is
not high.
Sikkim Manipal University

Page No. 156

Business Statistics

Unit 6

Example 6.4: Consider example 6.3 to compute r and then compare.


Solution: The simple coefficient of correlation r for the previous data is calculated
as follows:
x

x2

y2

xy

1
2
7
9
8
6
4
3
10
5

7
5
8
10
9
4
1
6
3
2

1
4
49
81
64
36
16
9
100
25

49
25
64
100
81
16
1
36
9
4

7
10
56
90
72
24
4
18
30
10

x = 55

y = 55

x2 = 385

y2 = 385

xy = 321

321 10

r =
55
385 10
10

55 55

10 10
55
385 10
10

18.5
18.5
=
= 0.224
82.5
82.5 82.5

This shows that the Spearman for any two sets of ranks is the same as
the Pearson r for the set of ranks. But it is much easier to compute .
Often, the ranks are not given. Instead, the numerical values of
observations are given. In such a case, we must attach the ranks to these
values to calculate .
Example 6.5: From the following table, compute the coefficient of correlation
between age of husbands and age of wives :
Age of
Husbands
15
25
35
45
55
65

25
35
45
55
65
75

Total

Age of wives

Total

15 25

25 35

35 45

45 55

55 65

65 75

1
2

1
12
4

1
10
3

1
6
2

1
4
1

2
2

2
15
15
10
8
3

17

14

53

Sikkim Manipal University

Page No. 157

Business Statistics

r=

Unit 6

2
N fd y2 fd y

N fdx d y fdx fd y

N fdx2 fdx .

53 86 10 16
53 98 102 . 53 92 162

= 0.907

Example: 6.6 If covariance between X and Y variables is 10 and the variances


of X and Y are 16 and 9 respectively, find the coefficient of correlation.
Covariance of X and Y = 11 =

xy
= 10
N

Variance of X, 2x = 16 x = 4
Variance of Y, 2y = 9 y = 3
Thus,
Also,

xy
N
11
xy
10
r = N = . =
= 0.833
43
x y
x
y

11 = 10 =

Sikkim Manipal University

Page No. 158

Business Statistics

Unit 6

Example 6.7: The marks of 8 candidates in Mathematics and English are given
below
Mathematics

76

90

98

69

54

82

67

52

English

25

37

56

12

36

23

11

Calculate the rank coefficient of correlation

Solution:
Marks in
Mathematics

Marks in
English

Rank in
Mathematics
(R 1 )

Rank in
English
(R 2 )

Rank
Difference
(D) = (R1 R2)

D2

76
90
98
69
54
82
67
52

25
37
56
12
7
36
23
11

4
2
1
5
7
3
5
8

4
2
1
6
8
3
6
7

0
0
0
1
1
0
1
+1

0
0
0
1
1
0
1
1

D = 0

D2 =4

Total

Here,
N= 8
Rank correlation coefficient,
6D2
R = 1 3
N N

= 1

6(4)
(83 8)

= 0.952

Example 6.8: Compute rank correlation coefficient from the following data of
marks obtained by eight students in the papers of Physics and Mathematics:
Marks in Physics

15

20

27

13

45

60

20

75

Marks in Mathematics

50

30

55

30

25

10

30

70

Sikkim Manipal University

Page No. 159

Business Statistics

Unit 6

Solution:
Rank of
Mathematics (D)

Difference
in Ranks

D2

16

456

5
3

0.5

0.25

55

56

5.5
2
4

13

30

45
60

25
10

3
2

456

5
3

7
8

4
6

16
36

20

30

0.25

70

456

5
3

0.5

75

56

5.5
2
1

Marks in
Physics

Marks in
Mathematics

15

50

20

30

27

Rank in
Physics

0
D2

Total

= 81.5

In this example, two students have secured equal marks viz., 20 in physics,
so the ranks awarded to them are the arithmetic means of the ranks that they
would have got (viz., 5 and 6) had they differed at least by a small number and
56
so the ranks awarded to them are
= 5.5 each.
2

Similarly, three students who got equal marks (30 each) in Mathematics
were accorded the rank 4 5 6 = 5 for each.
3

Now,

R = 1

m3 m n3 n
6 D 2

12 12

N3 N

23 2 33 3
6 81.5

12 12

= 0

= 1
3
8 8

Example 6.9: Ten competitors in a beauty contest are ranked by three judges
in the following order :
1st Judge

10

2nd Judge

10

3rd Judge

10

Sikkim Manipal University

Page No. 160

Business Statistics

Unit 6

Use the rank correlation coefficient to determining which pair of judges


has the nearest approach to common tastes in beauty.
R1

R2

R3

(R 1 R2)2 = D2

(R 2 R3)2 = D2

(R 1 R3)2 = D2

1
6
5
10
3
2
4
9
7
8

3
5
8
4
7
10
2
1
6
9

6
4
9
8
1
2
3
10
5
7

4
1
9
36
16
64
4
64
1
1

9
1
1
16
36
64
1
81
1
4

25
4
16
4
4
0
1
1
4
1

N = 10

N = 10

N = 10

D2= 200

D 2= 214

D2 = 60

Rank correlation between the judgements of Ist and 2nd judges

6 D 2

R12 = 1
N3 N


6 200

1
103 10

= 0.212

Rank correlation between the judgements of 2nd and 3rd judges :

6 D 2

R23 = 1
N3 N


6 214

1
103 10

= 0.297

Rank correlation between the Judgements of 1st and 3rd judges :

6 D 2

R13 = 1
N3 N


6 60

1
103 10

= 0.636

Since the coefficient of rank correlation is maximum in the judgements of


first and third judges, we conclude that they have the nearest approach to
common tastes in beauty.

Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) If observations on two variables are given in the form of ranks and
not as __________ values, then it is possible to compute rank
correlation between the two series.

Sikkim Manipal University

Page No. 161

Business Statistics

Unit 6

(b) The _____________________ for any two sets of ranks is the


same as the Pearson r for the set of ranks.
6. State whether true or false.
(a) The rank correlation, written as , is a descriptive index of agreement
between ranks over individuals.
(b) Like r, the rank correlation lies between 1 and +1.

6.5 Summary
Let us recapitulate the important concepts discussed in this unit:
Correlation analysis is the statistical tool generally used to describe the
degree to which, one variable is related to another.
The theory by means of which quantitative connections between two sets
of phenomena are determined is called the Theory of Correlation.
Correlation can either be positive or it can be negative.
The coefficient of determination can have a value ranging from zero to
one. The value of one can occur only if the unexplained variation is zero,
which simply means that all the data points in the Scatter diagram fall
exactly on the regression line.
The coefficient of correlation, symbolically denoted by r, is another
important measure to describe how well one variable is explained by
another. It measures the degree of relationship between the two casually
related variables. The value of this coefficient can never be more than +1
or less than 1.
Karl Pearsons method is the most widely used method of measuring the
relationship between two variables.
If r is less than its P.E., it is not at all significant. If r is more than P.E.,
there is correlation.
If observations on two variables are given in the form of ranks and not as
numerical values, it is possible to compute what is known as rank
correlation between the two series.

Sikkim Manipal University

Page No. 162

Business Statistics

Unit 6

6.6 Glossary
Correlation analysis: A statistical tool used to describe the degree to
which one variable is related to another.
Coefficient of determination: A measure of the degree of linear
association or correlation between two variables, one of which must be
an independent variable and the other, a dependent variable.
Coefficient of correlation: It is symbolically denoted by r and is an
important measure to describe how well one variable is explained by
another. It measures the degree of relationship between the two casually
related variables.

6.7 Terminal Questions


1. What is the importance of correlation analysis?
2. How will you determine the coefficient of determination?
3. Explain the method to calculate the coefficient of correlation using simple
regression coefficient.
4. Describe Karl Pearsons method of measuring coefficient of correlation.
5. What is the relationship between coefficient of nondetermination and
coefficient of alienation?
6. What is Spearmans rank correlation?

6.8 Answers
Answers to Self-Assessment Questions
1. (a) Quantifiable; (b) Determination
2. (a) True; (b) True
3. (a) True; (b) False
4. (a) Casually; (b) Correlation
5. (a) Numerical; (b) Spearman
6. (a) True; (b) True
Sikkim Manipal University

Page No. 163

Business Statistics

Unit 6

Answers to Terminal Questions


1. Refer Section 6.2
2. Refer Sections 6.2.1 and 6.2.2
3. Refer Section 6.3
4. Refer Section 6.3.1
5. Refer Section 6.3.3
6. Refer Section 6.4

6.9 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010.
Endnotes
1. Remember the short-cut formulae to workout bXY and bYX:

bXY
and

bYX

XY nXY
X n X
2

XY nXY
Y 2 nY 2

2. In case we take assumed mean to be zero for X variable as for Y variable then our
formula will be as under:

XY X Y

n
n n

or

r =

X2 X

n
n

Y 2 Y

n
n

r =

XY
XY
n
2
2 Y
2
X2
X
Y
n
n

XY nXY

r =

X nX 2 Y 2 nY 2
2

Sikkim Manipal University

Page No. 164

Business Statistics

Unit 7

Unit 7

Regression

Structure
7.1 Introduction
Objectives
7.2 Regression Analysis
7.3 Simple Linear Regression Model
7.4 Summary
7.5 Glossary
7.6 Terminal Questions
7.7 Answers
7.8 Further Reading

7.1 Introduction
In the previous unit, you learnt about correlation, a technique that looks at indirect
relationships and establishes variables.
In this unit, you will learn about regression analysis. Regression is a
statistical measure that determines the strength of relationship between a
dependent variable (variable to be predicted) and, one or more independent
variables (variables on which the prediction is based). It is a commonly used
tool in forecasting and financial analysis. For instance, suppose you want to
forecast sales for your company and it is seen that your companys sales go up
and down depending on changes in GDP. The sales you are forecasting would
be the dependent variable because their value depends on the value of GDP,
which, in turn, would be the independent variable. You would then need to
determine the strength of the relationship between these two variables in order
to forecast sales. If GDP increases/decreases by 1%, how much will your sales
increase or decrease? The regression equation is y=bx+a, where y is the
dependent variable which we intend to forecast, x is the independent variable,
b is the slope of the regression and a is the y-intercept.
You can use this simple model to solve your business problems. If your
research leads you to believe that the next GDP change will be a certain
percentage, you can plug that percentage into the model and generate a sales
forecast. This can help you develop a more objective plan and budget for the
upcoming year. You will also learn about the scatter diagram, least squares
method and standard error of estimate.
Sikkim Manipal University

Page No. 165

Business Statistics

Unit 7

Objectives
After studying this unit, you should be able to:
Describe how assumptions are made in regression analysis
Explain simple linear regression model
Define scatter diagram method and least square method
Judge the accuracy of estimating equation
Compute and interpret standard error of the estimate

7.2 Regression Analysis


The term regression was first used in 1877 by Sir Francis Galton who made a
study that showed that the height of children born to tall parents will tend to
move back or regress toward the mean height of the population. He designated
the word regression as the name of the process of predicting one variable from
another variable. Regression analysis is a statistical technique that attempts to
explore and model the relationship between two or more variables. For example,
an analyst may want to know if there is a relationship between road accidents
and the age of the driver. If we find a correlation between these two, then it is
possible to make use of this relationship in making estimates and to forecast
the value of the number of road accidents (dependent variable) on the basis of
the age of the drivers (independent variables). Regression analysis forms an
important part of the statistical analysis of the data obtained from designed
experiments. The results of regression along with the results from the analysis
of variance provide information that is useful to identify significant factors in an
experiment and explore the nature of the relationship between these factors
and the response. Similarly, an investigator may employ regression analysis to
test his theory having the cause and effect relationship. All this explains that
regression analysis is an extremely useful tool specially in problems of business
and industry involving predictions.

7.2.1 Assumptions in Regression Analysis


While making use of the regression techniques for making predictions, it is
always assumed that:
(a) There is an actual relationship between the dependent and independent
variables.

Sikkim Manipal University

Page No. 166

Business Statistics

Unit 7

(b) The values of the dependent variable are random but the values of the
independent variable are fixed quantities without error and are chosen by
the experimentor.
(c) There is clear indication of direction of the relationship. This means that
dependent variable is a function of independent variable. (For example,
when we say that advertising has an effect on sales, then we are saying
that sales has an effect on advertising).
(d) The conditions (that existed when the relationship between the dependent
and independent variable was estimated by the regression) are the same
when the regression model is being used. In other words, it simply means
that the relationship has not changed since the regression equation was
computed.
(e) The analysis is to be used to predict values within the range (and not for
values outside the range) for which it is valid.
Activity 1
Construct a regression line for r = 1.00 and r = 1.00.

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) The values of the dependent variable are random but the values of
the independent variable are fixed quantities without error and are
chosen by the ________________.
(b) The conditions that existed when the relationship between the
dependent and independent variable was estimated by the regression
are the same when the ___________ model is being used.
2. State whether true or false.
(a) The regression analysis is to be used to predict values within the
range (and not for values outside the range) for which it is valid.
(b) There is not an actual relationship between the dependent and
independent variables.

Sikkim Manipal University

Page No. 167

Business Statistics

Unit 7

7.3 Simple Linear Regression Model


In case of simple linear regression analysis, a single variable is used to predict
another variable on the assumption of linear relationship (i.e., relationship of
the type defined by Y = a + bX) between the given variables. The variable to be
predicted is called the dependent variable and the variable on which the prediction
is based is called the independent variable.
Simple linear regression model1 (or the Regression Line) is stated as,
Yi = a + bXi + ei
Where,

Yi is the dependent variable


Xi is the independent variable
ei is unpredictable random element (usually called as
residual or error term)

(a) a represents the Y-intercept, i.e., the intercept specifies the value of the
dependent variable when the independent variable has a value of zero.
(But this term has practical meaning only if a zero value for the independent
variable is possible).
(b) b is a constant, indicating the slope of the regression line. Slope of the
line indicates the amount of change in the value of the dependent variable
for a unit change in the independent variable.
If the two constants (viz., a and b) are known, the accuracy of our prediction
of Y (denoted by Y and read as Y--hat) depends on the magnitude of the values
of ei. If in the model, all the ei tend to have very large values then the estimates
will not be very good but if these values are relatively small, then the predicted
values ( Y ) will tend to be close to the true values (Yi).
Estimating the Intercept and Slope of the Regression Model (or Estimating
the Regression Equation)
The two constants or the parameters viz., a and b in the regression model for
the entire population or universe are generally unknown and as such are
estimated from sample information. The following are the two methods used for
estimation:
(a) Scatter diagram method
(b) Least squares method

Sikkim Manipal University

Page No. 168

Business Statistics

Unit 7

7.3.1 Scatter Diagram Method


This method makes use of the Scatter diagram, also known as Dot diagram.
Scatter diagram2 is a diagram representing two series with the known variable,
i.e., independent variable plotted on the X-axis and the variable to be estimated,
i.e., dependent variable to be plotted on the Y-axis on a graph paper (Refer
Figure 7.1) to get the following information:
Income
X
(Hundreds of Rupees)

Consumption Expenditure
Y
(Hundreds of Rupees)

41
65
50
57
96
94
110
30
79
65

44
60
39
51
80
68
84
34
55
48

The scatter diagram by itself is not sufficient for predicting values of the
dependent variable. Some formal expression of the relationship between the
two variables is necessary for predictive purposes. For the purpose, one may
simply take a ruler and draw a straight line through the points in the scatter
diagram and this way can determine the intercept and the slope of the said line
and then the line can be defined as Y a bX i , with the help of which we can
predict Y for a given value of X. But there are shortcomings in this approach.
For example, if five different persons draw such a straight line in the same
scatter diagram, it is possible that there may be five different estimates of a and
b, specially when the dots are more dispersed in the diagram. Hence, the
estimates cannot be worked out only through this approach. A more systematic
and statistical method is required to estimate the constants of the predictive
equation. The least squares method is used to draw the best fit line.

Sikkim Manipal University

Page No. 169

Business Statistics

Unit 7

Consumption Expenditure ( 00 Rs)

Y-axis

120
100
80
60
40
20
X-axis
0 20 40 60 80 100 120

Figure 7.1 Scatter Diagram

7.3.2 Least Square Method


Least squares method of fitting a line (the line of best fit or the regression line)
through the scatter diagram is a method which minimizes the sum of the squared
vertical deviations from the fitted line. In other words, the line to be fitted will
pass through the points of the scatter diagram in such a way that the sum of the
squares of the vertical deviations of these points from the line will be a minimum.
The meaning of the least squares criterion can be easily understood
through reference to Figure 7.2 drawn below, where the earlier figure in scatter
diagram has been reproduced along with a line which represents the least
squares line fit to the data.

Figure 7.2 Scatter Diagram, Regression Line and


Short Vertical Lines Representing e

Sikkim Manipal University

Page No. 170

Business Statistics

Unit 7

In Figure 7.2, the vertical deviations of the individual points from the line
are shown as the short vertical lines joining the points to the least squares line.
These deviations will be denoted by the symbol e. The value of e varies from
one point to another. In some cases it is positive, while in others it is negative.
If the line drawn happens to be least squares line, then the values of ei is the
least possible. It is because of this feature that the method is known as Least
Squares Method.
Why we insist on minimizing the sum of squared deviations is a question
that needs explanation. If we denote the deviations from the actual value Y to
the estimated value Y as (Y Y ) or e i , it is logical that we want the
(Y Y ) or

ei , to

be as small as possible. However, mere examining

i 1

(Y Y ) or

ei , is inappropriate, since any ei can be positive or negative. Large


i 1

positive values and large negative values could cancel one another. But large
values of ei regardless of their sign, indicate a poor prediction. Even if we ignore
n

the signs while working out


may continue. Hence, the

ei
ei

| ei | , where | ei |
i 1

if ei 0

if ei 0

the difficulties

standard procedure is to eliminate the effect of signs by squaring each


observation. Squaring each term accomplishes two purposes viz., (i) It magnifies
(or penalizes) the larger errors, and (ii) It cancels the effect of the positive and
negative values (since a negative error when squared becomes positive). The
choice of minimizing the squared sum of errors rather than the sum of the
absolute values implies that there are many small errors rather than a few large
errors. Hence, in obtaining the regression line, we follow the approach that the
sum of the squared deviations be minimum and on this basis work out the
values of its constants viz., a and b also known as the intercept and the slope
of the line. This is done with the help of the following two normal equations:3
Y = na + bX

XY = aX + bX2
In the above two equations, a and b are unknowns and all other values
viz., X, Y, X2, XY, are the sum of the products and cross products to be
calculated from the sample data, and n means the number of observations in
the sample.

Sikkim Manipal University

Page No. 171

Business Statistics

Unit 7

The following examples explain the Least squares method.


Example 7.1: Fit a regression line Y a bX i by the method of Least squares
to the given sample information.
Observations

10

Income (X) (00 Rs)

41

65

50

57

96

94

110

30

79

65

Consumption
Expenditure (Y) (00 Rs)

44

60

39

51

80

68

84

34

55

48

Solution: We are to fit a regression line Y a bX i to the given data by the


method of Least squares. Accordingly, we work out the a and b values with
the help of the normal equations as stated above and also for the purpose,
work out X, Y, XY, X2 values from the given sample information table on
Summations for Regression Equation.
Summations for Regression Equation
XY

X2

Y2

(00 Rs)

Consumption
Expenditure
Y
(00 Rs)

1
2
3
4
5
6
7
8
9
10

41
65
50
57
96
94
110
30
79
65

44
60
39
51
80
68
84
34
55
48

1804
3900
1950
2907
7680
6392
9240
1020
4345
3120

1681
4225
2500
3249
9216
8836
12100
900
6241
4225

1936
3600
1521
2601
6400
4624
7056
1156
3025
2304

n = 10

X = 687

Y =563

Observations

Income
X

XY = 42358 X2= 53173 Y2 = 34223

Putting the values in the required normal equations we have,


563 = 10a + 687b
42358 = 687a + 53173b
Solving these two equations for a and b we obtain,
a = 14.000

Sikkim Manipal University

and

b = 0.616

Page No. 172

Business Statistics

Unit 7

Hence, the equation for the required regression line is,


Y = a + bXi

or,

Y = 14.000 + 0.616Xi

This equation is known as the regression equation of Y on X from which


Y values can be estimated for given values of X variable.4
7.3.3 Checking the Accuracy of Equation
After finding the regression line as stated above, one can check its accuracy
also. The method to be used for the purpose follows from the mathematical
property of a line fitted by the method of least squares viz., the individual positive
and negative errors must sum to zero. In other words, using the estimating
equation one must find out whether the term Y Y is zero and if this is so,
then one can reasonably be sure that he has not committed any mistake in
determining the estimating equation.

The Problem of Prediction

When we talk about prediction or estimation, we usually imply that if the


relationship Yi = a + bXi + ei exists, then the regression equation, Y a bX i
provides a base for making estimates of the value for Y which will be
associated with particular values of X. In Example 7.1, we worked out the
regression equation for the income and consumption data as,
Y = 14.000 + 0.616Xi

On the basis of this equation we can make a point estimate of Y for any
given value of X. Suppose we wish to estimate the consumption expenditure of
individuals with income of Rs 10,000. We substitute X = 100 for the same in our
equation and get an estimate of consumption expenditure as follows:
Y =14.000 + 0.616 100 = 75.60

Thus, the regression relationship indicates that individuals with Rs 10,000 of


income may be expected to spend approximately Rs 7,560 on consumption.
But this is only an expected or an estimated value and it is possible that
actual consumption expenditure of same individual with that income may
deviate from this amount and if so, then our estimate will be an error, the
likelihood of which will be high if the estimate is applied to any one individual.
The interval estimate method is considered better and it states an interval in
which the expected consumption expenditure may fall. Remember that the
Sikkim Manipal University

Page No. 173

Business Statistics

Unit 7

wider the interval, the greater the level of confidence we can have, but the
width of the interval (or what is technically known as the precision of the
estimate) is associated with a specified level of confidence and is dependent
on the variability (consumption expenditure in our case) found in the sample.
This variability is measured by the standard deviation of the error term, e,
and is popularly known as the standard error of the estimate.

Standard Error of Estimate


Standard error of estimate is a measure developed by the statisticians for
measuring the reliability of the estimating equation. Like standard deviation,
the Standard Error (S.E.) of Y measures the variability or scatter of the
observed values of Y around the regression line. Standard Error of Estimate
(S.E. of Y ) is worked out as under:
S.E. of Y (or Se )
where,

(Y Y )2
n2

e2
n2

S.E. of Y (or Se) = Standard error of the estimate


Y = Observed value of Y
Y = Estimated value of Y

e = The error term = (Y Y )


n = Number of observations in the sample
Note: In the above formula, n 2 is used instead of n because of the fact that
two degrees of freedom are lost in basing the estimate on the variability of the
sample observations about the line with two constants viz., a and b whose
position is determined by those same sample observations.
The square of the Se, also known as the variance of the error term, is the
basic measure of reliability. The larger the variance, the more significant the
magnitudes of the es and the less reliable the regression analysis in predicting
the data.
Interpreting the Standard Error of Estimate and Finding the Confidence
Limits for the Estimate in Large and Small Samples
The larger the S.E. of estimate (SEe), the greater happens to be the dispersion,
or scattering, of given observations around the regression line. But if the S.E. of
estimate happens to be zero then the estimating equation is a perfect estimator
(i.e., cent per cent correct estimator) of the dependent variable.
In case of large samples, i.e., where n > 30 in a sample, it is assumed
Sikkim Manipal University

Page No. 174

Business Statistics

Unit 7

that the observed points are normally distributed around the regression
line and we may find,
68% of all points within Y 1 SEe limitss
95.5% of all points within Y 2 SEe limitss
99.7% of all points within Y 3 SEe limitss
This can be stated as,
(i) The observed values of Y are normally distributed around each estimated
value of Y and;
(ii) The variance of the distributions around each possible value of Y is the
same.
In case of small samples, i.e., where n 30 in a sample the t distribution
is used for finding the two limits more appropriately.
This is done as follows:
Upper limit = Y + t (SEe)
Lower limit = Y t (SEe)
Where,

Y = The estimated value of Y for a given value of X.

SEe = The standard error of estimate.


t = Table value of t for given degrees of freedom for a
specified confidence level.

7.3.4 Some Other Details Concerning Simple Regression


Sometimes the estimating equation of Y also known as the Regression equation
of Y on X, is written as follows:

Y Y
or,

= r

Y
X X
X i

r Y X X Y
Y = X i

Where,

r = Coefficient of simple correlation between X and


Y
Y = Standard deviation of Y
X = Standard deviation of X
_
X = Mean of X

Sikkim Manipal University

Page No. 175

Business Statistics

Unit 7

_
Y = Mean of Y
Y = Value of Y to be estimated
Xi = Any given value of X for which Y is to be
estimated.
This is based on the formula we have used, i.e., Y a bX i . The coefficient
of Xi is defined as,

Y
Coefficient of Xi = b = r
X
(Also known as regression coefficient of Y on X or slope of the regression
line of Y on X) or bYX.
XY n X Y Y 2 nY

Y 2 nY 2 X 2 n X

X 2 nX

XY n X Y

X 2 n X

Y
a = r X Y
X

and

Y
since b r
X

= Y bX

Similarly, the estimating equation of X, also known as the regression


equation of X on Y, can be stated as:
X
Y Y
X X = r
Y

or

X
X = r Y Y X
Y

and the
Regression coefficient of X on Y (or bXY) r

X XY n X Y

2
Y
Y 2 nY

If we are given the two regression equations as stated above, along with
the values of a and b constants to solve the same for finding the value of X
and Y, then the values of X and Y so obtained, are the mean value of X (i.e., X )
and the mean value of Y (i.e., Y ).
Sikkim Manipal University

Page No. 176

Business Statistics

Unit 7

If we are given the two regression coefficients (viz., bXY and bYX), then we
can work out the value of coefficient of correlation by just taking the square root
of the product of the regression coefficients as shown below:

r =

bYX .bXY

Y X
.r
X Y

r.r = r

The () sign of r will be determined on the basis of the sign of the regression
coefficients given. If regression coefficients have minus sign then r will be taken
with minus () sign and if regression coefficients have plus sign then r will be
taken with plus (+) sign. (Remember that both regression coefficients will
necessarily have the same sign whether it is minus or plus for their sign is
governed by the sign of coefficient of correlation.)
Example 7.2: Given is the following information:
Mean

X
39.5

Y
47.5

Standard Deviation

10.8

17.8

Simple correlation coefficient between X and Y is = + 0.42


Find the estimating equation of Y and X.
Solution:
Estimating equation of Y can be worked out as,

Y Y

Y
= r Xi X
X

Y = r

or

Y
Xi X Y
X

= 0.42

17.8
X i 39.5 47.5
10.8

= 0.69 X i 27.25 47.5


= 0.69Xi + 20.25

Similarly, the estimating equation of X can be worked out as under:

X X = r X Yi Y

Sikkim Manipal University

Page No. 177

Business Statistics

Unit 7

X = r

or

X
Yi Y X
Y

= 0.42

or

10.8
Yi 47.5 39.5
17.8

= 0.26Yi 12.35 + 39.5


= 0.26Yi + 27.15
Example 7.3: Given is the following data:
Variance of X = 9
Regression equations:
4X 5Y + 33 = 0

20X 9Y 107 = 0
Find: (i) Mean values of X and Y.
(ii) Coefficient of Correlation between X and Y.
(iii) Standard deviation of Y.
Solution:
(i) For finding the mean values of X and Y, we solve the two given regression
equations for the values of X and Y as follows:
4X 5Y + 33 = 0
(1)

20X 9Y 107 = 0

(2)

If we multiply Equation (1) by 5, we have the following equations:


20X 25Y = 165

20X 9Y = 107

(3)
(2)

16Y = 272
Subtracting Equation (2) from Equation (3) we get,
or
Y = 17
Putting this value of Y in Equation (1) we have,
4X = 33 + 5(17)
33 85 52

13
4
4

or

X =

Hence,

_
X = 13

Sikkim Manipal University

and

Y = 17

Page No. 178

Business Statistics

Unit 7

(ii) For finding the coefficient of correlation, first of all we presume one of the
two given regression equations as the estimating equation of X. Let
equation 4X 5Y + 33 = 0 be the estimating equation of X, then we have,
5Y 33
X i
4
4

and
From this we can write bXY

5
4

The other given equation is then taken as the estimating equation of Y


and can be written as,
20 X i 107
Y

9
9

and from this we can write bYX

20
9

If the above equations are correct then r must be equal to,


r = 5 / 4 20 / 9 25 / 9 = 5/3 = 1.6
which is an impossible equation, since r can in no case be greater than 1.
Hence, we change our supposition about the estimating equations and
by reversing it, we re-write the estimating equations as under:
9Y 107
X i
20 20

and

4 X i 33
Y

5
5

Hence,

r =

9 / 20 4 / 5

9 / 25

= 3/5
= 0.6
Since, regression coefficients have plus signs, we take r = + 0.6
(iii) Standard deviation of Y can be calculated as follows:
Variance of X = 9
Standard deviation of X = 3

bYX r

Y
X

4
0.6 Y 0.2 Y
5
3

Hence, Y = 4

Sikkim Manipal University

Page No. 179

Business Statistics

Unit 7

Alternatively, we can work it out as under:


bXY r

Y 1.8
9
X
= 20 0.6 3
Y
Y

Hence, Y = 4
Activity 2
Regression of savings (S) of a family on income (Y) may be expressed as
S a

Y
, where a and m are constants. In random sample of 100 families,
m

the variance of savings is one-quarter of the variance of incomes and the


coefficient of correlation is found to be +0.4. Obtain the estimate of m.
Example 7.4: Heights of the father and son are given below. Find the height of
the son when the height of the father is 69 inches.
Fathers height
(inches)

71

68

66

67

70

71

70

73

72

65

66

Sons Height
(inches)

69

64

65

63

65

62

65

64

66

59

62

Solution: Let fathers height be X and sons height be Y.


Regression line of Y on X
X

(X X) = x

x2

(Y Y ) = y

y2

xy

71
68
66
67
70
71
70
73
72
65
66

+2
1
3
2
+1
+2
+1
+4
+3
4
3

4
1
9
4
1
4
1
16
9
16
9

69
64
65
63
65
62
65
64
66
59
62

+5
0
+1
1
+1
2
+1
0
+2
5
2

25
0
1
1
1
4
1
0
4
25
4

+10
0
3
+2
+1
4
+1
0
+6
+20
+6

X = 759

x = 0

x2 = 74

Y = 704

y = 0

y2 = 66

xy = 39

(Y Y ) = r.

Y =

y
x

(X X)

704
= 64 ;
11

Sikkim Manipal University

Page No. 180

Business Statistics

Unit 7

759
= 69
11

X=

Note
Y =

r.

For

x
(Y 64) =
=
X=
=

Y
N

xy

= X =
39

= 0.527
74
x
0.527 (X 69)
0.527 X + 27.64
69, Y = 0.527 (69) + 27.64
64.003 64.
2

X
N

Example 7.5: Obtain the two regression equations for the following data using
the method of least squares :
x

10

11

xy

x2

y2

1
2
3
4
5

5
7
9
10
11

5
14
27
40
55

1
4
9
16
25

25
49
81
100
121

x = 15

y = 42

xy = 141

x2 = 55

y2 = 376

Regression equation of y on x :
y = a + bx
where
y = Na + bx
and
xy = ax + bx2
Thus,
42 = 5a + 15 b
141 = 15a + 55 b
Solving (i) and (ii), we get a = 3.9 and b = 1.5
Thus, y = 3.9 + 1.5 x
Regression equation of x on y
x = a + by
where
x = Na + by
and
xy = ay + by2
Thus,
15 = 5a + 42b
and
141 = 42a + 376 b

Sikkim Manipal University

...(i)
...(ii)

...(iii)
...(iv)

Page No. 181

Business Statistics

Unit 7

Solving (iii) and (iv), we get


a=
Thus,

x=

2
39 13

and b =
3
15
5

13 2
y
5
3

Example 7.6: The following table shows the ages (x) and blood pressure (y) of
8 persons.
x

52

63

45

36

72

65

47

25

62

53

51

25

79

43

60

33

Obtain the regression equation of y on x and find the expected blood pressure of
a person who is 49 years old.
Solution: Let Ax = 50 and Ay = 50
(Assumed means)
x

(x 50) = dx

d2x

(y 50) = dy

d2y

dxdy

52
63
45
36
72
65
47
25

+ 2
+ 13
5
14
+ 22
+ 15
3
25

4
169
25
196
484
225
9
625

62
53
51
25
79
43
60
33

+12
+3
+ 1
25
+ 29
7
+ 10
17

144
9
1
625
841
49
100
289

+24
+39
5
+ 350
+ 638
105
30
+ 425

x = 405

dx = 5

d2x = 1737

y = 406

dy = 6

d2y = 2058

dxdy = 1336

( y y) = r.

y
x

y = y
N
x

x =
N
r.

y
x

x x
406
= 50.75;
8
405
8

N dx d y d x d y
2

N dx2 dx

8(1136) (5)(6)
=
= 0.768
2
8(1737) 5

(y 50.75) = 0.768 (x 50.625)


or
y = 11.87 + 0.768x

y49 = 11.87 + 0.768 (49) = 49.502

Sikkim Manipal University

Page No. 182

Business Statistics

Unit 7

Example 7.7: The equation of two regression lines obtained in a correlation analysis
of 60 observations are 5x = 6y + 24 and 1000y = 768x 3608. What is the correlation
coefficient and what is its probable error?
Show that the ratio of the coefficient of variance of x to that of y is

5
. What is
24

the ratio of variance of x and y?


The equations of the regression lines are given as
5x = 6y + 24 and 1000y = 768x 3608

bxy = r.

and

byx = r.

x
6
=
y
5
y

x
Multiplying these, we get

768
1000

...(i)
...(ii)

6
768

r = 0.96
5 1000
Since both bxy and byx are positive, the correlation coefficient r is also positive and
hence r = + 0.96.
Also, probable error of r,

bxy byx = r2 =

1 r2
P.Er = 0.6745
N

1 0.962
P.Er = 0.6745

60

Also we know that each regression line passes through ( x, y) . So from the given
equations of these lines we have
5 x = 6 y 24
and
1000 y = 768 x 3608
Solving these we get
x = 6 and y = 1
Also from (i), we have r.
or

x 6
where r = 0.96
y 5

x
6
1
5
=

y
5 0.96 4

Sikkim Manipal University

...(iii)

...(iv)

Page No. 183

Business Statistics

Unit 7

And the ratio of the coefficients of variance of x to that of y

x / x

y x
1
5
=

=

x
y y 6 4
=

...(from (iii) & (iv))

5
24

Self-Assessment Questions
3. State whether true or false.
(a) The scatter diagram by itself is not sufficient for predicting values of
the dependent variable.
(b) The interval estimate method is considered worse as it states an
interval in which the expected consumption expenditure may fall.
4. Fill in the blanks with the appropriate terms.
(a) In case of simple linear regression analysis, a single variable is used
to __________ another variable on the assumption of linear
relationship (i.e., relationship of the type defined by Y = a + bX)
between the given variables.
(b) Standard error of estimate is a measure developed by the statisticians
for measuring the reliability of the _____________ equation.

7.4 Summary
Let us recapitulate the important concepts discussed in this unit:
The term regression was first used in 1877 by Sir Francis Galton who
made a study that showed the process of predicting one variable from
another variable.
When there is a well established relationship between variables, it is
possible to make use of this relationship in making estimates and to
forecast the value of one variable (the unknown or the dependent variable)
on the basis of the other variable/s (the known or the independent
variable/s).
There is an actual relationship between dependent and independent
variables.

Sikkim Manipal University

Page No. 184

Business Statistics

Unit 7

In case of simple linear regression analysis, a single variable is used to


predict another variable on the assumption of linear relationship (i.e.,
relationship of the type defined by Y = a + bX) between the given variables.
The variable to be predicted is called the dependent variable and the
variable on which the prediction is based is called the independent variable.
Scatter diagram is also known as Dot diagram. Scatter diagram represents
two series with the known variable, i.e., independent variable plotted on
the X-axis and the variable to be estimated, i.e., dependent variable to be
plotted on the Y-axis.
Least squares method of fitting a line (the line of best fit or the regression
line) through the scatter diagram is a method which minimizes the sum of
the squared vertical deviations from the fitted line.
The variability in sample is measured by the standard deviation of the
error term, e, and is popularly known as the standard error of the estimate.
The larger the S.E. of estimate (SEe), the greater happens to be the
dispersion, or scattering, of given observations around the regression
line.
The () sign of r will be determined on the basis of the sign of the regression
coefficients given. If regression coefficients have minus sign then r will be
taken with minus () sign and if regression coefficients have plus sign
then r will be taken with plus (+) sign.

7.5 Glossary
Regression analysis: A relationship used for making estimates and
forecasts about the value of one variable (the unknown or the dependent
variable) on the basis of the other variable/s (the known or the independent
variable/s).
Scatter diagram: Also known as a Dot diagram, used to represent two
series with the known variables, i.e., independent variable plotted on the
X-axis and the variable to be estimated, i.e., dependent variable to be
plotted on the Y-axis on a graph paper for the given information.
Standard error of estimate: A measure developed by statisticians for
measuring the reliability of the estimating equation.

Sikkim Manipal University

Page No. 185

Business Statistics

Unit 7

7.6 Terminal Questions


1. Define regression analysis. How will you predict the value of a dependent
variable?
2. Differentiate between Scatter diagram and Least Squares method.
3. Can the accuracy of estimated equation be checked? Explain.
4. How is the standard error of estimate calculated?
5. What is a Scatter diagram? How does it help in studying correlation
between two variables? Explain.

7.7 Answers
Answers to Self-Assessment Questions
1. (a) Experimentor; (b) Regression
2. (a) True; (b) False
3. (a) True; (b) False
4. (a) Predict; (b) Estimating

Answers to Terminal Questions


1. Refer Section 7.2
2. Refer Section 7.3.1
3. Refer Sections 7.3.1 and 7.3.2
4. Refer Section 7.3.3
5. Refer Section 7.3.3

7.8 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010.
Sikkim Manipal University

Page No. 186

Business Statistics

Unit 7

Endnotes
1. Usually the estimate of Y denoted by Y is written as,

Y a bX i
on the assumption that the random disturbance to the system averages out or has an
expected value of zero (i.e., e = 0) for any single observation. This regression model is
known as the Regression line of Y on X from which the value of Y can be estimated for
the given value of X.

2.
(2)

(1)

(3)

(4)

(5)

Five possible forms, which Scatter diagram may assume has been depicted in the above
five diagrams. First diagram is indicative of perfect positive relationship, Second shows
perfect negative relationship, Third shows no relationship, Fourth shows positive
relationship and Fifth shows negative relationship between the two variables under
consideration.
3. If we proceed centering each variable, i.e., setting its origin at its mean, then the two
equations will be as under:
Y = na + bX
XY = aX + bX2
But since Y and X will be zero, the first equation and the first term of the second
equation will disappear and we shall simply have the following equations:
XY = bX2
b = XY/X2
The value of a can then be worked out as:
a=

Y bX

4. It should be pointed out that the equation used to estimate the Y variable values from
values of X should not be used to estimate the values of X variable from given values of
Y variable. Another regression equation (known as the regression equation of X on Y of
the type X = a + bY) that reverses the two value should be used if it is desired to estimate
X from value of Y.

Sikkim Manipal University

Page No. 187

Business Statistics

Unit 8

Unit 8

Time Series

Structure
8.1 Introduction
Objectives
8.2 Components of Time Series
8.3 Different Methods of Measuring Trend
8.4 Different Methods of Measuring Seasonal Variations
8.5 Summary
8.6 Glossary
8.7 Terminal Questions
8.8 Answers
8.9 Further Reading

8.1 Introduction
In the previous unit, you learnt about regression analysis and its significance in
data analysis.
In this unit, you will learn how time series analysis differs from regression
analysis. We often see a number of charts on company drawing boards or in
newspapers, where we see lines going up and down from left to right on a
graph. The vertical axis represents a variable such as productivity or crime data
in the city and the horizontal axis represents the different periods of increasing
time such as days, weeks, months or years. Analysis of the movements of such
variables over periods of time is referred to as time series analysis, which can
then be defined as a set of numeric observations of a dependent variable,
measured at specific points in time in chronological order, usually at equal
intervals, in order to determine the relationship of time to such variables.
You will also learn that one of the major elements of planning and
specifically strategic planning of any organization is accurately forecasting the
future events that would have an impact on the operations of an organization.
Previous performances must be studied so as to forecast future activity. Even in
our daily lives, we plan our future events on the basis of a reasonable estimate
of the future environment that would affect our plans, whether it is forecasting
rain on our picnic on Saturday or forecasting economic conditions for ten years.
Textbook publishers, for example, must predict future sales of books to print
enough copies for students. Financial advisors must predict the values of a
Sikkim Manipal University

Page No. 189

Business Statistics

Unit 8

variety of economic factors in order to advise clients regarding stocks, bonds


and other business opportunities. Similarly, hotel builders in a city must project
the future influx of tourists, and so on. The quality of such forecasts is strongly
related to all the relevant information that can be extracted and used from past
data. In that respect, time series can be used to determine patterns in past data
over a period of time and extrapolate the data into the future.

Objectives
After studying this unit, you should be able to:
Analyse the components of time series
Explain the different methods of measuring trend
Calculate simple averages and moving averages
Measure irregular variations and seasonal adjustments

8.2 Components of Time Series


The time series analysis method is quite accurate where the future is expected
to be similar to the past. The underlying assumption in time series is that the
same factors will continue to influence the future patterns of economic activity
in a similar manner as in the past. These techniques are fairly sophisticated and
require experts to use these methods.
The classical approach to analyse a time series is in terms of four distinct
types of variations or separate components that influence a time series.
1. Secular Trend or Simply Trend (T). Trend is a general long-term
movement in the time series value of the variable (Y) over a fairly long
period of time. The variable (Y) is the factor that we are interested in
evaluating for the future. It could be sales, population, crime rate and so
on.
These variables are observed over a long period of time and any
changes related to time are noted and calculated and a trend of these
changes is established.
If a trend can be determined and the rate of change can be
ascertained, then tentative estimates on the same series values into the
future can be made. However, such forecasts are based on the assumption
that the conditions affecting the steady growth or decline are reasonably
expected to remain unchanged in the future. A change in these conditions

Sikkim Manipal University

Page No. 190

Business Statistics

Unit 8

would affect the forecasts. For example, a time series involving increase
in population over time is shown in Figure 8.1.

Figure 8.1 Time Series Graph on Population Increase

2. Cyclical Fluctuations (C). These refer to regular swings or patterns that


repeat over a long period of time. The movements are considered cyclical
only if they occur after time intervals of more than one year. These are the
changes that take place as a result of economic booms or depressions.
These may be up or down, and are recurrent in nature and have a duration
of several years usually lasting for two to ten years. These movements
also differ in intensity or amplitude and each phase of movement changes
gradually into the phase that follows it.
The cyclic variation for revenues in an industry against time is shown
graphically in Figure 8.2.

Figure 8.2 Cyclic Variation for Revenues

3. Seasonal Variation (S). This involves patterns of change that repeat over
a period of one year or less. Then they repeat from year to year and they
are brought about by fixed events. For example, sales of consumer items
increase prior to Deepawali due to the tradition of giving gifts.
Sikkim Manipal University

Page No. 191

Business Statistics

Unit 8

Since these variations repeat during a period of twelve months, they


can be predicted fairly and accurately. Some factors that cause seasonal
variations are:
(i) Season and climate. Changes in the climate and weather conditions
have a profound effect on sales. For example, the sale of umbrellas
in India is always more during monsoon season. Similarly, during
winter, there is a greater demand for woollen clothes and hot drinks,
while during summer months, there is an increase in sales of fans
and air conditioners.
(ii) Customs and festivals. Customs and traditions affect the pattern
of seasonal spending. For example, in India, festivals such as
Baisakhi and Diwali mean a big demand for sweets and candy.
An accurate assessment of seasonal behaviour is an aid in
business planning and scheduling such as in the area of production,
inventory control, personnel, advertising, and so on. The seasonal
fluctuations over four repeating quarters in a given year for sale of a
given item is illustrated in Figure 8.3.

Figure 8.3 Seasonal Fluctuations Over Four Quarters in a Year

4. Irregular or Random Variation (I). These variations are accidental,


random or simply due to chance factors. Thus, they are wholly
unpredictable. These fluctuations may be caused by such isolated incidents
as floods, famines, strikes or wars. Sudden changes in demand or a
breakthrough in a technological development may be included in this

Sikkim Manipal University

Page No. 192

Business Statistics

Unit 8

category. Accordingly, it is almost impossible to isolate and measure the


value and the impact of these erratic movements on forecasting models
or techniques. This phenomenon is graphically shown in Figure 8.4.

Figure 8.4 Irregular or Random Variation

It is traditionally acknowledged that the value of the time series (Y) is a


function of the impact of variable trend (T), seasonal variation (S), cyclical
variation (C) and irregular fluctuation (I). These relationships may vary depending
upon assumptions and purposes. The effects of these four components might
be additive, multiplicative, or a combination thereof in a number of ways. However,
the traditional time series analysis model is characterized by multiplicative
relationship, so that:
Y =T S C I

This model is appropriate for those situations where percentage changes


best represent movement in the series and the components are not viewed as
absolute values but as relative values.
Another approach to define the relationship may be additive, so that:
Y=T+S+C+I
This model is useful when the variations in the time series are in absolute
values and can be separated and traced to each of these four parts and each
part can be measured independently.

Sikkim Manipal University

Page No. 193

Business Statistics

Unit 8

Activity 1
The Indian Motorcycle Company is concerned about declining sales in the
Western region. The following data shows monthly sales (in millions of `) of
the motorcycles for the past twelve months.
Month

Sales (in millions of `)

January

6.5

February

6.0

March

6.3

April

5.1

May

5.6

June

4.8

July

4.0

August

3.6

September

3.5

October

3.1

November

3.0

December

3.0

(i) Plot the trend line and describe the relationship between sales and
time.
(ii) What is the average monthly change in sales?
(iii) If the monthly sales fall below ` 2.4 million, then the West Coast
office must be closed. Is it likely that the office will be closed during
the next six months?

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Trend is a general long-term ____________ in the time series value
of the variable (Y) over a fairly long period of time.
(b) Cyclic fluctuations refer to ____________ swings or patterns that
repeat over a long period of time.

Sikkim Manipal University

Page No. 194

Business Statistics

Unit 8

2. State whether true or false.


(a) The time series analysis method is quite accurate where the future
is expected to be similar to the past.
(b) Changes in the climate and weather conditions have a profound effect
on sales.

8.3 Different Methods of Measuring Trend


8.3.1 Trend Analysis
While chance variations are difficult to identify, separate, control or predict, a
more precise measurement of trend, cyclical effects and seasonal effects can
be made in order to make the forecasts more reliable. In this section, we discuss
techniques that would allow us to describe trend.
When a time series shows an upward or downward long-term linear trend,
then regression analysis can be used to estimate this trend and project the
trends into forecasting the future values of the variables involved. The equation
for the straight line which we have used to describe the linear relationship
between independent variable X and dependent variable Y is;
Y = b0 + b1X
Here,
b0 = Intercept on the Y-axis and b1 = Slope of the straight line
In time series analysis, the independent variable is time, so we will use
the symbol t in place of X and we will use the symbol Yt in place of Yc which
we have used previously.
Hence, the equation for linear trend is given as:
Yt = b0 + b1t
Here,
Yt = Forecast value of the time series in period t
b0 = Intercept of the trend line on Y-axis
b1 = Slope of the trend line
t = Time period
As discussed earlier, we can calculate the values of b0 and bl by the
following formulae:
n (ty ) (t )(y )
b1
, and b0 y b1t
n ( t 2 ) ( t ) 2

Sikkim Manipal University

Page No. 195

Business Statistics

Unit 8

Here,

y = Actual value of the time series in period time t


n = Number of periods
y
y = Average value of time series
n
t
t = Average value of t =
n
Knowing these values, we can calculate the value of y.
Example 8.1: A car fleet owner has five cars which have been in the fleet for
several different years. The manager wants to establish if there is a linear
relationship between the age of the car and the repairs in hundreds of dollars
for a given year. This way, he can predict the repair expenses for each year as
the cars become older. The information for the repair costs he collected for last
year on these cars is given as follows:
Car #
Age (t)
Repairs (Y)
1
2
3
4
5

1
3
3
5
6

4
6
7
7
9

The manager wants to predict the repair expenses for the next year for
the two cars that are three years old now.
Solution: The trend in repair costs suggests a linear relationship with the age
of the car, so that the linear regression equation is given as:

Yt b0 b1t

n (ty ) (t )(y )
n ( t 2 ) ( t ) 2

Here,

b1

and,

b0 y b1 t

Sikkim Manipal University

Page No. 196

Business Statistics

Unit 8

To calculate the various values, let us form a new table as follows:


Age of Car (t)

Total

Repair Cost (Y)

tY

t2

18

21

35

25

54

36

18

33

132

80

Knowing that n = 5, let us substitute these values to calculate the


regression coefficients b0 and b1.
Then,

5(132) (18)(33)
5(80) (18) 2

b1

660 594
400 324

66
0.87
76

and,

b0 y b1 t

Here,

y=

y 33

6.6
n
5

and,

t =

t 18

3.6
n 5

Then,

b0 6.6 0.87(3.6)
= 6.6 3.13
= 3.47

Hence,

Yt 3.47 0.87t

The cars that are 3 years old now will be 4 years old next year, so that
t = 4.

Sikkim Manipal University

Page No. 197

Business Statistics

Hence,

Unit 8

Y(4) 3.47 0.87(4)


3.47 3.48
= 6.95

Accordingly, the repair costs on each car that is 3 years old now are
expected to be ` 695.00

8.3.2 Measuring the Cyclical Effect


Cyclic variation, as we have discussed before, is a pattern that repeats over
time periods longer than one year. These variations are generally unpredictable
in relation to the time of occurrence, duration as well as amplitude. However,
these variations have to be separated and identified. The measure we use to
identify cyclical variation is the percentage of trend and the procedure used,
known as the residual trend.
As we have discussed earlier, there are four components of time series.
These are secular trend (T), seasonal variation (S), cyclical variation (C) and
irregular (or chance) variation (I). Since the time period considered for seasonal
variation is less than one year, it can be excluded from the study, because when
we look at time series consisting of annual data spread over many years, then
only the secular trend, cyclical variation and irregular variation are considered.
Since secular trend component can be described by the trend line (usually
calculated by line of regression), we can isolate cyclical and irregular components
from the trend. Furthermore, since irregular variation occurs by chance and
cannot be predicted or identified accurately, it can be reasonably assumed that
most of the variation in time series left unexplained by the trend component can
be explained by the cyclical component. In that respect, cyclical variation can
be considered as the residual, once other causes of variation have been
identified.
The measure of cyclic variation as percentage of trend is calculated as
follows:
(i) Determine the trend line (usually by regression analysis).
(ii) Compute the trend value Yt for each time period (t) under consideration.
(iii) Calculate the ratio Y/Yt for each time period.
(iv) Multiply this ratio by 100 to get the percentage of trend, so that,
Y
Percentage of trend = Y 100.
t
Sikkim Manipal University

Page No. 198

Business Statistics

Unit 8

Example 8.2: The following is the data for energy consumption (measured in
quadrillions of BTU) in the United States from 1981 to 1986 as reported in the
statistical abstracts of the United States.
Year

Time Period (t)

Annual Energy
Consumption (Y)

1981

74.0

1982

70.8

1983

70.5

1984

74.1

1985

74.0

1986

73.9

Assuming a linear trend, calculate the percentage of trend for each year
(cyclical variation).
Solution: First, we find the secular trend by the regression line method which is
given by:

Yt b0 b1t

n (ty ) (t )(y )
n ( t 2 ) ( t ) 2

Here,

b1

and,

b0 y b1 t
Let us make a table for these values.
t

tY

1
2
3
4
5
6

74.0
70.8
70.5
74.1
74.0
73.9

74.0
141.6
211.5
296.4
370.0
443.4

t2
1
4
9
16
25
36

t = 21

Y 437.3

tY 1536.9

t 2 91

Substituting these values we get,

b1
Sikkim Manipal University

6(1536.9) (21)(437.3)
6(91) (21)2
Page No. 199

Business Statistics

Unit 8

9221.4 9183.3
546 441

38.1
0.363
105

and,

b0 = y b1 t

Here,

y
t

Hence,

y 437.3

72.88
n
6
21
3.5
6

b0 72.88 0.363(3.5)
= 72.88 1.27
= 71.61

Yt 71.61 0.363t

Then,

Calculating the value of Yt for each time period, we get the following table
for percentage of trend (Y/Yt)100.
Time Period
(t)

Energy Consumption
(Y)

Trend
(Yt)

Percentage of Trend
(Y/Yt)100

74.0

71.97

102.82

70.8

72.34

97.87

70.5

72.70

96.97

74.1

73.06

101.42

74.0

73.43

100.77

73.9

73.79

100.15

The following graph shows the actual energy consumption (Y), trend line
(Yt) and the cyclical fluctuations above and below the trend line over the time
period (t) for 6 years.

Sikkim Manipal University

Page No. 200

Business Statistics

Unit 8

Yt

Frequently, we draw a graph of cyclic variation as the percentage of trend.


This process eliminates the trend line and isolates the cyclical component of
the time series.
It must be understood that cyclical fluctuations are not accurately
predictable, and hence, we cannot predict the future cyclic variations based
upon such past cyclic variations.

The percentage of trend figures show that in 1981, the actual consumption
of energy was 102.82% of expected consumption that year and in 1983, the
actual consumption was 96.97% of the expected consumption.

Sikkim Manipal University

Page No. 201

Business Statistics

Unit 8

Self-Assessment Questions
3. State whether true or false.
(a) The four components of time series are secular trend (T), seasonal
variation (S), cyclical variation (C) and irregular (or chance) variation
(I).
(b) The measure used to identify cyclical variation is the residual trend
and the procedure used is the percentage of trend.
4. Fill in the blanks with the appropriate terms.
(a) When a time series shows an upward or downward long-term linear
trend, then regression analysis can be used to ______________
this trend and project the trends into forecasting the future values of
the variables involved.
(b) Cyclic variation is a pattern that ___________________ over time
periods longer than one year.

8.4 Different Methods of Measuring Seasonal Variations


Seasonal variation has been defined as the predictable and repetitive movement
around the trend line in a period of one year or less. For the measurement of
seasonal variation, the time interval involved may be in terms of days, weeks,
months or quarters. Because of the predictability of seasonal trends, we can
plan in advance to meet these variations. For example, study of seasonal
variations in the production data makes it possible to plan for hiring of additional
personnel for peak periods of production or to accumulate an inventory of raw
materials or to allocate vacation time to personnel, and so on. Some of the
methods used for the measurement of seasonal variations are described as
follows.

8.4.1 Simple Averages


This is the simplest method of isolating seasonal fluctuations in time series. It is
based on the assumption that the series contain only the seasonal and irregular
fluctuations. Assume that the time series involve monthly data over a time period
of, say, five years. Assume further that we want to find the seasonal index for

Sikkim Manipal University

Page No. 202

Business Statistics

Unit 8

the month of March. (The seasonal variation will be the same for March in every
year. Seasonal index describes the degree of seasonal variation).
Then the seasonal index for the month of March will be calculated as
follows:
Monthly average for March
SeasonalIndex for March =
10
Average of monthly averages

The following steps can be used in the calculation of seasonal index


(variation) for the month of March (or any month), over the 5-year period,
regarding the sale of cars by one distributor.
(i) Calculate the average sale of cars for the month of March over the last
five years.
(ii) Calculate the average sale of cars for each month over the five years and
then calculate the average of these monthly averages.
(iii) Use the given formula to calculate seasonal index for March.
Let us say that the average sale of cars for the month of March over the
period of 5 years is 360, and the average of all monthly average is 316. Then
the seasonal index for March = (360/316) 100 = 113.92.

8.4.2 Moving Averages


This is the most widely used method of measuring seasonal variations. The
seasonal index is based upon a mean of 100 with the degree of seasonal variation
(seasonal index) measured by variations away from this base value. For example,
if we look at the seasonality of rental of row boats at the lake during the three
summer months (a quarter) and we find that the seasonal index is 135 and we
also know that the total boat rentals for the entire last year was 1680, then we
can estimate the number of summer rentals for the row boats.
The average number of quarterly boats rented = 1680/4 = 420.
The seasonal index, 135 for the summer quarter means that the summer
rentals are 135 percent of the average quarterly rentals.
Hence, summer rentals = 420 (135/100) = 567.
The steps required to compute the seasonal index can be enumerated by
illustrating an example.

Sikkim Manipal University

Page No. 203

Business Statistics

Unit 8

Example 8.3: Assume that a record of rental of row boats for the previous 3
years on a quarterly basis is given as follows:
Year
Rentals Per Quarter
Total
1991
1992
1993

I
350
330
370

II
300
360
350

III
450
500
520

IV
400
410
440

1500
1600
1680

Solution:
Step 1. The first step is to calculate the four-quarter moving total for time series.
This total is associated with the middle data point in the set of values for the four
quarters, shown as follows.
Year
Quarters
Rentals
Moving Total
1991

I
II

350
300
1500

III
IV

450
400

The moving total for the given values of four quarters is 1500, which is
simply the addition of the four quarter values. This value of 1500 is placed in the
middle of values 300 and 450 and recorded in the next column. For the next
moving total of the four quarters, we will drop the value of the first quarter, which
is 350, from the total and add the value of the fifth quarter (in other words, first
quarter of the next year), and this total will be placed in the middle of the next
two values, which are 450 and 400, and so on. These values of the moving
totals are shown in column 4 of the next table.
Step 2. The next step is to calculate the quarter moving average. This can be
done by dividing the four quarter moving total, as calculated in Step 1, by 4,
since there are 4 quarters. The quarters moving average is recorded in column
5 in the next table. The entire table of calculations is shown as follows:

Sikkim Manipal University

Page No. 204

Business Statistics

Unit 8

Year Quarters Rentals

(1)

(2)

(3)

I
II

350
300

III

Quarter
Moving
Total

Quarter
Moving
Average

(4)

(5)

1500

375.0

450
1480

IV

400

330

1540
1992

1590
II
III
IV
1993

I
II
III
IV

372.50

120.80

377.50

105.96

391.25

84.35

398.75

90.28

405.00

123.45

408.75

100.30

410.00

90.24

416.25

84.08

370.0
385.0
397.5

360
1600

400.0

1640

410.0

500
410
1630

Quarter Percentage of
Centered
Actual to
Moving
Centered
Average Moving Average
(6)
(7)

407.5

370
1650

412.5

1680

420.0

350
520
440

Step 3. After the moving averages for each consecutive 4 quarters have been
taken, then we centre these moving averages. As we see from the above table,
the quarterly moving average falls between the quarters. This is because the
number of quarters is even which is 4. If we had odd number of time periods,

Sikkim Manipal University

Page No. 205

Business Statistics

Unit 8

such as 7 days of the week, then the moving average would already be centred
and the third step here would not be necessary. Accordingly, we centre our
averages in order to associate each average with the corresponding quarter,
rather than between the quarters. This is shown in column 6, where the centred
moving average is calculated as the average of the two consecutive moving
averages.
The moving average (or the centred moving average) aims to eliminate
seasonal and irregular fluctuations (S and I) from the original time series, so
that this average represents the cyclical and trend components of the series.
As the following graph shows for this data, the centred moving average
has smoothed the peaks and troughs of the original time series.

C e ntred

Step 4. Column 7 in the table contains calculated entries which are percentages
of the actual values to the corresponding centred moving average values. For
example, the first four quarters centred moving average of 372.50 in the table
has the corresponding actual value of 450, so that the percentage of actual
value to centred moving average would be:

Actual Value
100
Centred Moving Average Value
=

450
100
372.5

= 120.80
Step 5. The purpose of this step is to eliminate the remaining cyclical and irregular
fluctuations still present in the values in Column 7 of the table. This can be done
by calculating the modified mean for each quarter. The modified mean for each
quarter of the three-year time period under consideration is calculated as follows.

Sikkim Manipal University

Page No. 206

Business Statistics

Unit 8

(i) Make a table of values in column 7 of the previous table (percentage of


actual to moving average values) for each quarter of the three years as
shown in the following table.
Year
Quarter I
Quarter II
Quarter (III)
Quarter (IV)
1991
1992
1993

84.35
90.24

90.28
84.08

120.80
123.45

105.96
100.30

(ii) We take the average of these values for each quarter. It should be noted
that if there are many years and quarters taken into consideration instead
of 3 years as we have taken, then the highest and the lowest values from
each quarterly data would be discarded and the average of the remaining
data would be considered. By discarding the highest and the lowest values
from each quarter data, we tend to reduce the extreme cyclical and irregular
fluctuations, which are further smoothed when we average the remaining
values. Thus, the modified mean can be considered as an index of
seasonal component. This modified mean for each quarter data is shown
as follows:

Quarter I =

84.35+90.24
= 87.295
2

Quarter II =

90.28 + 84.08
= 87.180
2

Quarter III =

120.80 +123.45
=122.125
2

Quarter IV =

105.96 +100.30
=103.13
2

Total = 399.73
The modified means as calculated here are preliminary seasonal indices.
These average should be 100 per cent or a total of 400 for the 4 quarters.
However, our total is 399.73. This can be corrected by the following step.
Step 6. First, we calculate an adjustment factor. This is done by dividing the
desired or expected total of 400, by the actual total obtained of 399.73, so that,
400
Adjustment =
=1.0007
399.73

Sikkim Manipal University

Page No. 207

Business Statistics

Unit 8

By multiplying the modified mean for each quarter by the adjustment factor,
we get the seasonal index for each quarter, so that,
Quarter I = 87.295 1.0007 = 87.356
Quarter II = 87.180 1.0007 = 87.241
Quarter III = 122.125 1.0007 = 122.201
Quarter IV = 103.13 1.0007 = 103.202
Total = 400.000
Average seasonal index

400
100
4

(This average seasonal index is approximated to 100 because of rounding


off errors).
The logical meaning behind this method is based on the fact that the
centred moving average part of this process eliminates the influence of secular
trend and cyclical fluctuations (T C). This may be represented by the following
expression:

T S CI
=SI
T C
Here, (T S C I) is the influence of trend, seasonal variations, cyclic
fluctuations and irregular or chance variations.
Thus, the ratio of moving average represents the influence of seasonal
and irregular components. However, if these ratios for each quarter over a period
of years are averaged, then most random or irregular fluctuations would be
eliminated so that,

SI
=S
I
and this would give us the value of seasonal influences.

8.4.3 Measuring Irregular Variation and Seasonal Adjustments


Typically, irregular variation is random in nature, unpredictable and occurs over
comparatively short periods of time. Because of its unpredictability, it is generally
not measured or explained mathematically. Usually, subjective and logical

Sikkim Manipal University

Page No. 208

Business Statistics

Unit 8

reasoning explains such variation. For example, cold weather in Brazil and
Columbia is considered responsible for increase in the price of coffee beans,
because cold weather destroys coffee plants. Similarly, the Persian Gulf War,
an irregular factor resulted in increase in airline and ship travel for a number of
months because of the movement of personnel and supplies. However, the
irregular component can be isolated by eliminating other components from the
time series data. For example, time series data contains (T S C I)
components and if we can eliminate (T S C) elements from the data, then
we are left with (I) component. We can follow the previous example to determine
the (I) component as follows. The data presented has already been provided or
calculated.
Year

Quarters

1991

I
II
III
IV

350
300
450
400

372.50
377.50

1.208
1.060

1992

I
II
III
IV
I
II
III

330
360
500
410
370
350
520

391.25
398.75
405.00
408.75
410.00
416.25

0.843
0.903
1.235
1.003
0.902
0.841

IV

440

1993

Rentals
Centered Moving T S C I /(T C)
Time Series Values Average (T C)
= S I
(T S C I)

The seasonal indices for each quarter have already been calculated as:
Quarter I = 87.356
Quarter II = 87.241
Quarter III = 122.201
Quarter IV = 103.202

Sikkim Manipal University

Page No. 209

Business Statistics

Unit 8

Then the seasonal influence is given by:


Quarter I = 87.356/100 = 0.874
Quarter II = 87.241/100 = 0.872
Quarter III = 122.201/100 = 1.222
Quarter IV = 103.202/100 = 1.032
Making another table of (S I) values and (S) values and dividing (S I)
by (S) we get the values of (I) as follows:
Year
Quarters
(S I)
(S)
(I)
1991

1992

1993

I
II
III
IV
I
II
III
IV
I
II
III
IV

1.208
1.060
0.843
0.903
1.235
1.003
0.902
0.841

1.222
1.032
0.874
0.872
1.222
1.032
0.874
0.872

0.988
1.027
0.965
1.036
1.011
0.972
1.032
0.964

Seasonal Adjustments
Many times, we read about time series values as seasonally adjusted. This is
accomplished by dividing the original time series values by their corresponding
seasonal indices. These deseasonalized values allow more direct and equitable
comparisons of values from different time periods. For example, in comparing
the demands for rental row boats (example that we have been following), it
would not be equitable to compare the demand of second quarter (spring) with
the demand of third quarter (summer), when the demand is traditionally higher.
However, these demand values can be compared when we remove the seasonal
influence from these time series values.
The seasonally-adjusted values for the demand of row boats in each
quarter are based on the values previously calculated and shown as follows.

Sikkim Manipal University

Page No. 210

Business Statistics

Year Quarter
1991

1992

1993

Unit 8

Rentals
(T S C I)

(S)

350
300
450
400
330
360
500
410
370
350
520
440

1.222
1.032
0.874
0.872
1.222
1.032
0.874
0.872

I
II
III
IV
I
II
III
IV
I
II
III
IV

Seasonally-Adjusted Rounded-off
Values
Values

368.25
387.60
377.57
412.80
409.16
397.29
423.34
401.38

368
388
378
413
409
397
423
401

The seasonally-adjusted value for each quarter is calculated as:

Original Value
Seasonal Index

These calculations complete the process of separating and identifying


the four components of the time series, namely secular trend (T), seasonal
variation (S), cyclical variation (C) and irregular variation (I).
Activity 2
The following data represents the quarterly earnings per share of a software
company for the last four years.
Quarter
Year

1st year

0.27

0.35

0.43

1.25

2nd year

0.40

0.55

0.45

1.35

3rd year

0.52

0.70

0.53

1.55

4th year

0.60

0.80

0.64

1.85

Analyse the quarterly time series to determine the effects of the trend,
cyclical, seasonal and irregular components.

Sikkim Manipal University

Page No. 211

Business Statistics

Unit 8

Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) Seasonal variation has been defined as the ________________ and
repetitive movement around the trend line in a period of one year or
less.
(b) Time series values can be seasonally ______________ by dividing
the original time series values by their corresponding seasonal
indices.
6. State whether true or false.
(a) Simple average is the difficult method of isolating seasonal
fluctuations in time series.
(b) Regular variation is random in nature, unpredictable and occurs over
comparatively short periods of time.

8.5 Summary
Let us recapitulate the important concepts discussed in this unit:
The time series analysis method is quite accurate where the future is
expected to be similar to the past. The underlying assumption in time
series is that the same factors will continue to influence the future patterns
of economic activity in a similar manner as in the past.
Trend is a general long-term movement in the time series value of the
variable (Y) over a fairly long period of time. The variable (Y) is the factor
that we are interested in evaluating for the future.
Cyclic fluctuations refer to regular swings or patterns that repeat over a
long period of time. The movements are considered cyclical only if they
occur after time intervals of more than one year.
Changes in the climate and weather conditions have a profound effect on
sales. Customs and traditions affect the pattern of seasonal spending.
Irregular or random variations are accidental, random or simply due to
chance factors. Thus, they are wholly unpredictable.
When a time series shows an upward or downward long-term linear trend,
then regression analysis can be used to estimate this trend and project
the trends into forecasting the future values of the variables involved.
Sikkim Manipal University

Page No. 212

Business Statistics

Unit 8

Cyclic variation is a pattern that repeats over time periods longer than
one year. These variations are generally unpredictable in relation to the
time of occurrence, duration as well as amplitude.
The measure used to identify cyclical variation is the percentage of trend
and the procedure used is known as the residual trend.
Seasonal variation has been defined as the predictable and repetitive
movement around the trend line in a period of one year or less. For the
measurement of seasonal variation, the time interval involved may be in
terms of days, weeks, months or quarters.
Seasonal index describes the degree of seasonal variation.
The moving average or the centred moving average aims to eliminate seasonal
and irregular fluctuations (S and I) from the original time series, so that this
average represents the cyclical and trend components of the series.
Irregular variation is random in nature, unpredictable and occurs over
comparatively short periods of time.

8.6 Glossary
Seasonal variation: Patterns of change that repeat over a period of one
year or less. The factors that cause seasonal variations are season and
climate and customs and festivals.
Irregular variations: These variations are unpredictable and can be
accidental, random or simply due to chance factor.
Cyclic variation: A pattern that repeats over time periods longer than
one year.

8.7 Terminal Questions


1. Differentiate between secular trend and cyclic fluctuations.
2. How is irregular variation caused?
3. Define seasonal variation.
4. What do you understand by trend analysis?
5. How will you measure cyclical effect?
6. Describe the simple average method of isolating seasonal fluctuations in
time series.
Sikkim Manipal University

Page No. 213

Business Statistics

Unit 8

7. What are the ways of measuring irregular variation?


8. How are seasonal adjustments made?

8.8 Answers
Answers to Self-Assessment Questions
1. (a) Movement; (b) Regular
2. (a) True; (b) True
3. (a) True; (b) False
4. (a) Estimate; (b) Repeats
5. (a) Predictable; (b) Adjusted
6. (a) False; (b) True

Answers to Terminal Questions


1. Refer Section 8.2
2. Refer Section 8.2
3. Refer Section 8.2
4. Refer Section 8.3.1
5. Refer Section 8.3.2
6. Refer Section 8.4.1
7. Refer Section 8.4.3
8. Refer Section 8.4.3
8.9 Further Reading
1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010

Sikkim Manipal University

Page No. 214

Business Statistics

Unit 9

Unit 9

Testing of Hypothesis

Structure
9.1 Introduction
Objectives
9.2 Hypothesis Formulation
9.3 Summary
9.4 Glossary
9.5 Terminal Questions
9.6 Answers
9.7 Further Reading

9.1 Introduction
In the previous unit, you learnt about interpolation of polynomial as a useful
method for functional approximation.
In this unit, you will learn about hypothesis, null and alternative hypotheses,
critical region, penalty, standard error and hypothesis testing. Hypothesis is an
assumption that is tested to find its logical or empirical consequence. It refers to
a provisional idea whose merit needs evaluation, but having no specific meaning.
A hypothesis should be clear and accurate. Various concepts, such as null and
alternative hypotheses, enable to verify the testability of an assumption. During
the course of hypothesis testing, some inference about the population like the
mean and proportion are made. Any useful hypothesis will enable predictions
by reasoning, including deductive reasoning. Statistical decisions have to be
made in the presence of uncertainty. The null hypothesis is tested about the
population mean which has a specific value m. Testing a statistical hypothesis
on the basis of a sample enables us to decide whether the hypothesis should
be accepted or rejected. The Critical Region (CR) or Rejection Region (RR) is
a set of values for testing statistic for which the null hypothesis is rejected in a
hypothesis test.

Objectives
After studing this unit, you should be able to:
Describe the concepts of hypothesis and list the types of errors
Explain the null and alternate hypotheses

Sikkim Manipal University

Page No. 215

Business Statistics

Unit 9

Discuss the concepts of critical region or the region of hypothesis rejection


Calculate standard errors of statistics

9.2 Hypothesis Formulation


A hypothesis is an approximate assumption that a researcher wants to test for
its logical or empirical consequences. Hypothesis refers to a provisional idea
whose merit needs evaluation, but having no specific meaning. Though it is
often referred as a convenient mathematical approach for simplifying
cumbersome calculation. Setting up and testing hypothesis is an integral art of
statistical inference. Hypotheses are often statements about population
parameters like variance and expected value. During the course of hypothesis
testing, some inference about population like the mean and proportion are made.
Any useful hypothesis will enable predictions by reasoning including deductive
reasoning. According to Karl Popper, a hypothesis must be falsifiable and that a
proposition or theory cannot be called scientific if it does not admit the possibility
of being shown false. Hypothesis might predict outcome of an experiment in a
lab setting the observation of a phenomenon in nature. Thus, hypothesis is a
explanation of a phenomenon proposal suggesting a possible correlation
between multiple phenomena.
The characteristics of hypothesis are:
Clear and accurate: Hypothesis should be clear and accurate so as to
draw a consistent conclusion.
Statement of relationship between variables: If a hypothesis is
relational, it should state the relationship between different variables.
Testability: A hypothesis should be open to testing so that other deductions
can be made from it and can be confirmed or disproved by observation.
The researcher should do some prior study to make the hypothesis a
testable one.
Specific with limited scope: A hypothesis, which is specific, with limited
scope, is easily testable than a hypothesis with limitless scope. Therefore,
a researcher should pay more time to do research on such kind of
hypothesis.
Simplicity: A hypothesis should be stated in the most simple and clear
terms to make it understandable.

Sikkim Manipal University

Page No. 216

Business Statistics

Unit 9

Consistency: A hypothesis should be reliable and consistent with


established and known facts.
Time limit: A hypothesis should be capable of being tested within a
reasonable time. In other words, it can be said that the excellence of a
hypothesis is judged by the time taken to collect the data needed for the
test.
Empirical reference: A hypothesis should explain or support all the
sufficient facts needed to understand what the problem is all about.
A hypothesis is a statement or assumption concerning a population. For
the purpose of decision-making, a hypothesis has to be verified and then
accepted or rejected. This is done with the help of observations. We test a
sample and make a decision on the basis of the result obtained. Decisionmaking plays significant role in different areas such as marketing, industry and
management.

9.2.1

Statistical Decision-Making

Testing a statistical hypothesis on the basis of a sample enables us to decide


whether the hypothesis should be accepted or rejected. The sample data enable
us to accept or reject the hypothesis. Since the sample data give incomplete
information about the population, the result of the test need not be considered
to be final or unchallengeable. The procedure, on which the basis of sample
results, enables to decide whether a hypothesis is to be accepted or rejected.
This is called Hypothesis Testing or Test of Significance.
Note 1: A test provides evidence, if any, against a hypothesis, usually called a null hypothesis.
The test cannot prove the hypothesis to be correct. It can give some evidence against it.

The test of hypothesis is a procedure to decide whether to accept or


reject a hypothesis.
Note 2: The acceptance of a hypotheses implies if there is no evidence from the sample that we
should believe otherwise.

The rejection of a hypothesis leads us to conclude that it is false. This


way of putting the problem is convenient because of the uncertainty inherent in
the problem. In view of this we must always briefly state a hypothesis that we
hope to reject.
A hypothesis stated in the hope of being rejected is called a null hypothesis
and is denoted by H0.

Sikkim Manipal University

Page No. 217

Business Statistics

Unit 9

If H0 is rejected, it may lead to the acceptance of an alternative hypothesis


denoted by H1.
For example, a new fragrance soap is introduced in the market. The null
hypothesis H0, which may be rejected, is that the new soap is not be better than
the existing soap.
Similarly, a dice is suspected to be rolled. Roll the dice a number of times
to test.
The Null Hypothesis H0: p = 1/6 for showing six.
The Alternative hypothesis H1: p 1/6.
For example, skulls found at an ancient site may all belong to race X or
race Y on the basis of their diameters. We may test the hypothesis, that the
mean is of the population from which the present skulls came. We have the
hypotheses.
H0 : = x, H1 : = y
Here; we should not insist on calling either hypothesis null and the other
alternative since the reverse could also be true.

9.2.2 Committing Errors: Type I and Type II


Types of Errors: There are two types of errors in statistical hypothesis,
which are as follows:
o Type I Error: In this type of error, you may reject a null hypothesis
when it is true. It means rejection of a hypothesis, which should have
been accepted. It is denoted by (alpha) and is also known alpha
error.
o Type II Error: In this type of error, you are supposed to accept a null
hypothesis when it is not true. It means accepting a hypothesis, which
should have been rejected. It is denoted by (beta) and is also known
as beta error.
Type I error can be controlled by fixing it at a lower level. For example, if
you fix it at 2%, then the maximum probability to commit Type I error is 0.02.
But, reducing Type I error has a disadvantage when the sample size is fixed, as
it increases the chances of Type II error. In other words, it can be said that both
types of errors cannot be reduced simultaneously. The only solution of this
problem is to set an appropriate level by considering the costs and penalties
attached to them or to strike a proper balance between both types of errors.

Sikkim Manipal University

Page No. 218

Business Statistics

Unit 9

In a hypothesis test, a Type I error occurs when the null hypothesis is


rejected when it is in fact true; that is, H0 is wrongly rejected. For example, in a
clinical trial of a new drug, the null hypothesis might be that the new drug is no
better, on average, than the current drug; that is H0: there is no difference between
the two drugs on average. A Type I error would occur if we concluded that the
two drugs produced different effects, when in fact there was no difference
between them.
In a hypothesis test, a Type II error occurs when the null hypothesis H0, is
not rejected when it is in fact false. For example, in a clinical trial of a new drug,
the null hypothesis might be that the new drug is no better, on average, than the
current drug; that is H0: there is no difference between the two drugs on average.
A Type II error would occur if it were concluded that the two drugs produced the
same effect, that is, there is no difference between the two drugs on average,
when in fact they produced different ones.
In how many ways can we commit errors?
We reject a hypothesis when it may be true. This is Type I Error.
We accept a hypothesis when it may be false. This is Type II Error.
The other true situations are desirable:
We accept a hypothesis when it is true. We reject a hypothesis when it is
false.
Accept H0
H0
True
H1
False

Reject H0

Accept True H0
Desirable

Reject True H0
Type I Error

Accept False H0
Type II Error

Reject False H0
Desirable

The level of significance implies the probability of Type I error. A five per
cent level implies that the probability of committing a Type I error is 0.05. A one
per cent level implies 0.01 probability of committing Type I error.
Lowering the significance level and hence the probability of Type I error is
good but unfortunately, it would lead to the undesirable situation of committing
Type II error.

Sikkim Manipal University

Page No. 219

Business Statistics

Unit 9

To sum up:
Type I Error: Rejecting H0 when H0 is true.
Type II Error: Accepting H0 when H0 is false.
Note. The probability of making a Type I error is the level of significance of a statistical test. It is
denoted by

Where,

= Prob. (Rejecting H0 / H0 true)


1 = Prob. (Accepting H0 / H0 true)

The probability of making a Type II error is denoted by .


Where, = Prob. (Accepting H0 / H0 false)
1 = Prob. (Rejecting H0 / H0 false) = Prob. (The test correctly rejects
H0 when H0 is false)
1- is called the power of the test. It depends on the level of significance
, sample size n and the parameter value.

9.2.3

Null and Alternate Hypotheses

Hypothesis is usually considered as the principal instrument in research. The


basic concepts regarding the testability of a hypothesis are as follows:
Null Hypothesis: While comparing two different methods in terms of
their superiority, wherein, the assumption is that both the methods are
equally good is called null hypothesis. It is also known as statistical
hypothesis and is symbolized as H0.
Alternate Hypothesis: While comparing two different methods, regarding
their superiority, wherein, stating a particular method to be good or bad
as compared to the other one is called alternate hypothesis. It is symbolized
as H1.
Comparison of Null Hypothesis with Alternate Hypothesis
Following are the points of comparison between null hypothesis and alternate
hypothesis:
Null hypothesis is always specific, while alternate hypothesis gives an
approximate value.
The rejection of null hypothesis involves great risk, which is not in the
case of alternate hypothesis.
Null hypothesis is more frequently used in statistics than alternate
hypothesis because it is specific and is not based on probabilities.

Sikkim Manipal University

Page No. 220

Business Statistics

Unit 9

The hypothesis to be tested is called the null hypothesis and is denoted


by H0.This is to be tested against other possible states of nature called alternative
hypothesis. The alternative is usually denoted by H1.
The null hypothesis implies that there is no difference between the statistic
and the population parameter. To test whether there is no difference between
the sample mean X and the population , we write the null hypothesis.
H0: X =
The alternative hypothesis would be,
H1:
This means > or < . This is called a two-tailed hypothesis.
The alternative hypothesis H1: > is right tailed.
The alternative hypothesis H1: < is left tailed.
These are one sided or one-tailed alternatives.
Note 1: The alternative hypothesis H1 implies all such values of the parameter, which are not
specified by the null hypothesis H0.
Note 2: Testing a statistical hypothesis is a rule, which leads to a decision to accept or reject a
hypothesis.

A one-tailed test requires rejection of the null hypothesis when the sample
statistic is greater than the population value or less than the population value at
a certain level of significance.
1. We may want to test if the sample mean exceeds the population mean .
Then the null hypothesis is,
H0: >
2. In the other case the null hypothesis could be,
H0: <
Each of these two situations leads to a one-tailed test and has to be dealt
with in the same manner as the two-tailed test. Here, the critical rejection is on
one side only, right for > and left for < . Both the Figures 9.1 and 9.2 here
show a five per cent level of test of significance.
For example, a minister in a certain government has an average life of 11
months without being involved in a scam. A new party claims to provide ministers
with an average life of more than 11 months without scam. We would like to test
if, on the average, the new ministers last longer than 11 months. We may write
the null hypothesis H0: = 11 and alternative hypothesis H1: > 11.

Sikkim Manipal University

Page No. 221

Business Statistics

Unit 9

Figure 9.1 H0: >

Figure 9.2 H0: <

9.2.4

Critical Region

The Critical Region (CR), or Rejection Region (RR), is a set of values for testing
statistic for which the null hypothesis is rejected in a hypothesis test. It means,
the sample space for the test statistic is partitioned into two regions; one region
as the critical region will lead us to reject the null hypothesis H0, the other not.
So, if the observed value of the test statistic is a member of the critical region,
we conclude that reject H0; if it is not a member of the critical region then we
conclude that do not reject H0.
We shall consider test problems arising out of Type I Error.
The level of significance of a test is the maximum probability with which
we are willing to take a risk of Type I error.
If we take a 5% significance level ( = 0.05), we are 95% confident
( = 0.95) that a right decision has been made.
A 1% significance level ( = 0.01), makes us 99% confident ( = 0.99)
about the correctness of the decision.
The critical region is the area of the sampling distribution in which the test
statistic must fall for the null hypothesis to be rejected.
We can say that the critical region corresponds to the range of values of
the statistic, which according to the test requires the hypothesis to be rejected.
Two-tailed and One-tailed Tests: A two-tailed test rejects the null
hypothesis if the sample mean is either more or less than the hypothesized

Sikkim Manipal University

Page No. 222

Business Statistics

Unit 9

value of the mean of the population. It is considered to be apt when null


hypothesis is of some specific value whereas alternate hypothesis is not
equal to the value of null hypothesis. In a two-tailed curve there are two
rejection regions, also called critical regions as shown in Figure 9.3.
Acceptance and rejection
regions in case of a two-tailed
test
(With 5% significance level)
Rejection
region

LIMIT

Acceptance region (Accept H0


if the sample mean X falls in
this region)

LIMIT

Rejection
region

0.475 of
area

0.475 of
area

{Both taken together equals


0.95 or 95% of area}

Z = 1.96

2H0 =

Z = 1.96

Reject H0 if the sample


mean ( X ) falls in either
of these two regions

Figure 9.3 Critical Region

Conditions for the Occurrence of One-tailed Test: When the population


mean is either lower or higher than some hypothesised value, one-tailed
test is considered to be appropriate where the rejection is only on the left
tail of the curve. This is known as left-tailed test (refer Figure 9.4).

Sikkim Manipal University

Page No. 223

Business Statistics

Unit 9

Figure 9.4 Left-Tailed Test

For example, what will happen if the acceptance region is made larger?
will decrease. It will be more easily possible to accept H0 when H0 is false (Type
II error), i.e., it will lower the probability by making a Type I error, but raise that
of , Type II error. , are probabilities of making an error; 1 , l are
probabilities of making correct decisions (refer Figure 9.5).

Figure 9.5 Acceptance Region


Sikkim Manipal University

Page No. 224

Business Statistics

Unit 9

Example 9.1: Can we say + = 1?


Solution: No. Each is concerned with a different type of error. But both are not
independent of each other.

9.2.5

Penalty

Usually Type II error is considered the worse of the two though, it is mainly the
circumstances of a case that decide the answer to this question.
If Type I error means accepting the hypothesis that a guilty person is
innocent and if Type II error means accepting the hypothesis that an innocent
person is guilty, then Type II error would be dangerous. The penalties and costs
associated with an error determine the balance or trade off between Type I and
Type II errors.
Usually Type I error is shown as the shaded area, say 5% of a normal
curve which is supposed to represent the data. If the sample statistic, say the
sample mean, falls in the shaded area, the hypothesis is rejected at 5 per cent
level of significance.

9.2.6

Standard Error

The concept of Standard Error (SE) of statistics is used to test the precision of
a sample and provides the confidence limits for the corresponding population
parameter.
The statistic may be the sample arithmetic mean, the sample proportion
p, etc.
The SE of any such statistic is the standard deviation of the sampling
distribution of the statistic. Given below is SE in common use.
SE ( X 1 X 2 )
SE ( p1 p2 )

n1 n2
PQ
PQ
1 1
2 2
n1
n2

SE of difference between two means X 1 , X 2 or two proportions p1, p2


sample sizes n1, n2 can be stated as,
SE ( X 1 X 2 )
SE ( p1 p2 )
Sikkim Manipal University

n1 n2
PQ
PQ
1 1
2 2
n1
n2
Page No. 225

Business Statistics

Unit 9

Where, n is the number of observations

X is the sample mean


is the population standard deviation
p is the sample proportion, q = 1 p
P is the population proportion, Q = l P

9.2.7 Testing of Hypothesis


A note on statistical decision-making
Statistical decisions have to be made in the presence of uncertainty. In testing
of hypothesis, the choice is between H0 and H1. In estimation, there are several
choices available. The design of experiments requires one to choose between
the nature and extent of observations. All this has to be done in the presence of
uncertainty. A decision function D(x), assigns to every possible outcome a unique
action. This may result in loss, positive or negative, depending on an unknown
parameter w.
So, the loss function is L(w, D), which depends on the outcome x is a
random variable. Its expected value is called the risk function.

9.2.8

Tests for a Sample Mean X

We have to test the null hypothesis that the population mean has a specified
value , i.e., H0: X = . For large n, if H0 is true then,
X
is approximately nominal. The theoretical region for z
SE ( X )
depending on the desired level of significance can be calculated.
z

For example, a factory produces items, each weighing 5 kg with variance


4. Can a random sample of size 900 with mean weight 4.45 kg be justified as
having been taken from this factory?

n = 900
X = 4.45

=5
=
z

Sikkim Manipal University

4=2
X 4.45 5
X
=
=
= 8.25
SE ( X ) / n
2 / 30
Page No. 226

Business Statistics

Unit 9

We have z > 3. The null hypothesis is rejected. The sample may not be
regarded as originally from the factory at 0.27% level of significance
(corresponding to 99.73% acceptance region).

9.2.9

Test for Equality of Two Proportions

If P1, P2 are proportions of some characteristic of two samples of sizes n1, n2,
drawn from populations with proportions P1, P2, then we have H0: P1 = P2 vs
H1:P1 P2
Case (I): If H0 is true, then let P1 = P2 = p
Where, p can be found from the data:
p

n1 P1 n2 P2
n1 n2

q 1 p

p is the mean of the two proportions.

SE ( P1 P2 )

1 1
pq
n1 n2
P1 P2
,P
SE ( P1 P2 ) is approximately normal (0,1)

We write z ~ N(0, 1)
The usual rules for rejection or acceptance are applicable here.
Case (II): If it is assumed that the proportion under question is not the same in
the two populations from which the samples are drawn and that P1, P2 are the
true proportions, we write,
Pq P q
SE ( P1 P2 ) 1 1 2 2
n2
n1

We can also write the confidence interval for P1 P2.


For two independent samples of sizes n1, n2 selected from two binomial
populations, the 100 (1 ) % confidence limits for P1 P2 are,

Pq P q
( P1 P2 ) z / 2 1 1 2 2
n2
n1

Sikkim Manipal University

Page No. 227

Business Statistics

Unit 9

The 90% confidence limits would be [with = 0.1, 100 (1 ) = 0.90]

Pq P q
( P1 P2 ) 1.645 1 1 2 2
n2
n1
Example 9.2: Out of 5000 interviewees, 2400 are in favour of a proposal, and
out of another set of 2000 interviewees, 1200 are in favour. Is the difference
significant?
Where,

P1

2400
0.48
5000
P1

Solution: Given,

P2

2400
0.48
5000

n1 = 5000

1200
0.6
2000
P2

1200
0.6
2000

n2 = 2000

0.48 0.52 0.6 0.4


SE

= 0.013 (using Case (II))


2000
5000
P P
0.12
z 1 2
9.2 3
0.013
SE
The difference is highly significant at 0.27% level.

9.2.10 Large Sample Test for Equality of Two Means

X1 , X 2

Suppose two samples of sizes n1 and n2 are drawn from populations having
means 1, 2 and standard deviations 1, 2
To test the equality of means X 1 , X 2 we write,
H 0 : 1 2
H1 : 1 2
If we assume H0 is true then,
z

X1 X 2
12 22

n1 n2

, approximately normally distributed with mean 0, and

S.D. = 1.
We write z ~ N (0, 1)

Sikkim Manipal University

Page No. 228

Business Statistics

Unit 9

As usual, if | z | > 2 we reject H0 at 4.55% level of significance and so on.


Example 9.3: Two groups of sizes 121 and 81 are subjected to tests. Their
means are found to be 84 and 81 and standard deviations 10 and 12. Test for
the significance of difference between the groups.
Solution: X 1 = 84
n1 = 121
n2 = 81
X 2 = 81
1 = 10
2 = 12
z

X1 X 2
12
n1

22

n2

84 81
= 1.86 < 1.96

121 81

The difference is not significant at the 5% level of significance.

9.2.11

Small Sample Tests of Significance

The sampling distribution of many statistics for large samples is approximately


normal. For small samples with n < 30, the normal distribution, as shown in
Example 9.3, can be used only if the sample is from a normal population with
known .
If is not known, we can use students t distribution instead of the normal.
We then replace by sample standard deviation with some modification as
given below.
Let x1, x2, ..., xn be a random sample of size n drawn from a normal
population with mean and S.D. . Then,
t

/ n 1

Here, t follows the students t distribution with n 1 degrees of freedom.


Note: For small samples of n < 30, the term n 1 , in SE = s / n 1 , corrects the bias,
resulting from the use of sample standard deviation as an estimator of

Also,
s2
S

n 1
n 1
or s S
n
n

Procedure: Small Samples


To test the null hypothesis H 0 : , against the alternative hypothesis
H1 :

Sikkim Manipal University

Page No. 229

Business Statistics

Calculate t

Unit 9

X
and compare it with the table value with n 1 degrees
SE ( X )

of freedom (d.f.) at level of significance per cent


If this value > table value, reject H0
If this value < table value, accept H0
(Significance level idea same as for large samples)
We can also find the 95% (or any other) confidence limits for .
For the two-tailed test (use the same rules as for large samples; substitute
t for z) the 95% confidence limits are,
X t s / n 1

Rejection Region. At % level for two-tailed test if | t | > t/2 reject.


For one-tailed test, (right) if t > t reject
(left) if t > t reject
At 5 per cent level the three cases are,
If | t | > t0.025

reject two-tailed

If t > t0.05

reject one-tailed right

If t < t0.05

reject one-tailed left

For proportions, the same procedure is to be followed.


Example 9.4: A firm produces tubes of diameter 2 cm. A sample of 10 tubes is
found to have a diameter of 2.01 cm and variance 0.004. Is the difference
significant? Given t0.05,9= 2.26
Solution:

X
s / n 1
2.01 2
0.004/ 10 1

0.01
0.021
0.48

Since, |t| < 2.26, the difference is not significant at 5% level.

Sikkim Manipal University

Page No. 230

Business Statistics

Unit 9

9.2.12 Paired Observations


t-Test for the Difference of Means
Let (x1, y1), (x2, y2,), ...,(xn, yn), be the pairs of values for the same subjects, e.g.,
sales data before (x) and after an advertisement campaign (y)
Performance of candidates before (x) and after training (y)
We have to test the significance of the difference between x, y values.
For each pair (xi-, yi), find di = xi- yi
H0: 1 = 2, i.e., no difference before and after and H1: 1 2
We find the mean d of d values and use the statistics:
t

d
S/ n

(d d )

n 1

Example 9.5: Eleven students were given a test and their marks noted. After
training, their marks in a second test were noted. Do the marks indicate any
benefit from training?
Solution:
Student

23

20

19

21

18

20

18

17

23

16

19

24

19

22

18

20

22

20

20

23

20

17

di

d
s

11

11
1
11

(d d )

n 1
df 11 1 10
t

10

2.49

0.121
2.24 / 11 2.49 11

The difference is not significant.

Sikkim Manipal University

Page No. 231

Business Statistics

Unit 9

9.2.13 Test for a Given Population Variance


In the test for given population variance, the variance is the square of the standard
deviation, whatever you say about a variance can be, for all practical purposes,
extended to a population standard deviation.
To test the hypothesis that a sample x1, x2, xn of size n has a specified
variance 0
Null hypothesis H 0 : 2 02
or

0
H1 : 2 02

Test statistic
2

ns 2

02

( x x )2

02

If 2 is greater than the table value, we reject the null hypothesis.


Activity 1
A dice is rolled 49152 times. Of these 25149 times it shows 4, 5 and 6. Test
the hypothesis that the dice is unbiased.

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) A hypothesis is an approximate _________________ that a
researcher wants to test for its logical or empirical consequences.
(b) The Critical Region (CR) or Rejection Region (RR) is a set of values
for testing statistic for which the ________________ hypothesis is
rejected in a hypothesis test.
2. State whether true or false.
(a) Hypothesis should be clear and accurate so as to draw a consistent
conclusion.
(b) Type I error can not be controlled by fixing it at a lower level.

Sikkim Manipal University

Page No. 232

Business Statistics

Unit 9

9.3 Summary
Let us recapitulate the important concepts discussed in this unit:

A hypothesis is an approximate assumption that a researcher wants to


test for its logical or empirical consequences. It refers to a provisional
idea whose merit needs evaluation, but having no specific meaning.
A hypothesis should be reliable and consistent with established and known
facts.
In a hypothesis test, a Type I error occurs when the null hypothesis is
rejected when it is in fact true.
Null hypothesis is always specific, while alternate hypothesis gives an
approximate value.
A one-tailed test requires rejection of the null hypothesis when the sample
statistic is greater than the population value or less than the population
value at a certain level of significance.
The Critical Region (CR) or Rejection Region (RR) is a set of values for
testing statistic for which the null hypothesis is rejected in a hypothesis
test.
A two-tailed test rejects the null hypothesis if the sample mean is either
more or less than the hypothesized value of the mean of the population.
The concept of Standard Error (SE) of statistics is used to test the precision
of a sample and provides the confidence limits for the corresponding
population parameter.
The sampling distribution of many statistics for large samples is
approximately normal.

9.4 Glossary
Hypothesis: An approximate assumption about population parameters
like variance and expected value that is tested by a researcher for its
logical or empirical consequences.
Critical region: A set of values for testing statistic for which the null
hypothesis is rejected and the alternate hypothesis is accepted in a
hypothesis test.

Sikkim Manipal University

Page No. 233

Business Statistics

Unit 9

Standard error: In statistics, it is used to test the precision of a sample


and provides the confidence limits for the corresponding population
parameter.

9.5 Terminal Questions


1. What is a hypothesis? Explain the characteristics of hypothesis.
2. Explain the importance of statistical decision-making.
3. What are type I and type II errors?
4. Differentiate between null and alternative hypotheses.
5. Describe critical region with the help of an example.
6. What are the conditions for the occurrence of one-tailed test?
7. What is penalty?
8. How is standard error calculated?

9.6 Answers
Answers to Self-Assessment Questions
1. (a) Assumption; (b) Null
2. (a) True; (b) False

Answers to Terminal Questions


1. Refer Section 9.2
2. Refer Section 9.2.1
3. Refer Section 9.2.2
4. Refer Section 9.2.3
5. Refer Section 9.2.4
6. Refer Section 9.2.4

Sikkim Manipal University

Page No. 234

Business Statistics

Unit 9

7. Refer Section 9.2.5


8. Refer Section 9.2.6

9.7 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010.

Sikkim Manipal University

Page No. 235

Business Statistics

Unit 10

Unit 10

Chi-Square Test

Structure
10.1 Introduction
Objectives
10.2 Chi-Square Test
10.3 Summary
10.4 Glossary
10.5 Terminal Questions
10.6 Answers
10.7 Further Reading

10.1 Introduction
In the previous unit you learnt about testing of hypothesis. The test statistic of
accepting or rejecting a null hypothesis is evaluated using 2. In this unit you will
learn about Chi-square test also called Chi-squared or 2 test. Any statistical
hypothesis test, in which the test statistic has a Chi-square distribution, when
the null hypothesis is true, is termed as Chi-square test. Chi-square test is a
non-parametric test of statistical significance for bivariate tabular analysis also
known as cross-breaks. Amongst the several tests used in statistics for judging
the significance of the sampling data, Chi-square test, developed by R.A. Fisher,
is considered an important test. Chi-square, symbolically written as 2
(pronounced as Ki-square), is a statistical measure with the help of which it is
possible to assess the significance of the difference between the observed
frequencies and the expected frequencies obtained from some hypothetical
universe. Chi-square tests enable us to test and compare whether more than
two population proportions can be considered equal. Hence, it is a statistical
test commonly used to compare observed data with expected data and testing
the null hypothesis, which states that there is no significant difference between
the expected and the observed result.

Objectives
After studying this unit, you should be able to:
Explain the Chi-square test of significance
Describe the degrees of freedom

Sikkim Manipal University

Page No. 237

Business Statistics

Unit 10

Define the conditions for the application of test


Explain the additive property of Chi-square

10.2 Chi-Square Test


Chi-square test is a non-parametric test of statistical significance for bivariate
tabular analysis (also known as cross-breaks). Any appropriate test of statistical
significance lets you know the degree of confidence you can have in accepting
or rejecting a hypothesis. Typically, the Chi-square test is any statistical
hypothesis test, in which the test statistics has a chi-square distribution when
the null hypothesis is true. It is performed on different samples (of people) who
are different enough in some characteristic or aspect of their behaviour that we
can generalize from the samples selected. The population from which our
samples are drawn should also be different in the behaviour or characteristic.
Amongst the several tests used in statistics for judging the significance of the
sampling data, Chi-square test, developed by Ronald A. Fisher, is considered
as an important test. Chi-square, symbolically written as 2 (pronounced as Kisquare), is a statistical measure with the help of which, it is possible to assess
the significance of the difference between the observed frequencies and the
expected frequencies obtained from some hypothetical universe. Chi-square
tests enable us to test whether more than two population proportions can be
considered equal. In order that Chi-square test may be applicable, both the
frequencies must be grouped in the same way and the theoretical distribution
must be adjusted to give the same total frequency which is equal to that of
observed frequencies. 2 is calculated with the help of the following formula:
2

Where,

( f 0 f e ) 2

fe

f0 means the observed frequency; and


fe means the expected frequency.

Whether or not a calculated value of 2 is significant, it can be ascertained


by looking at the tabulated values of 2 (given at the end of this book in appendix
part) for given degrees of freedom at a certain level of confidence (generally a
5% level is taken). If the calculated value of 2 exceeds the table value, the
difference between the observed and expected frequencies is taken as significant
but if the table value is more than the calculated value of 2, then the difference

Sikkim Manipal University

Page No. 238

Business Statistics

Unit 10

between the observed and expected frequencies is considered as insignificant,


i.e., considered to have arisen as a result of chance and as such can be ignored.

10.2.1 Degrees of Freedom


As already stated in the earlier unit, the number of independent constraints
determines the number of degrees of freedom (or df). If there are 10 frequency
classes and there is one independent constraint, then there are (10 1) = 9
degrees of freedom. Thus, if n is the number of groups and one constraint is
placed by making the totals of observed and expected frequencies equal, df =
(n 1); when two constraints are placed by making the totals as well as the
arithmetic means equal then df = (n 2) and so on. In the case of a contingency
table (i.e., a table with two columns and more than two rows or table with two
rows but more than two columns or a table with more than two rows and more
than two columns) or in the case of a 2 2 table the degrees of freedom is
worked out as follows:
df = (c 1)(r 1)
Where, c = Number of columns
r = Number of rows

10.2.2 Conditions for the Application of Test


The following conditions should be satisfied before the test can be applied:
(i) Observations recorded and used are collected on a random basis.
(ii) All the members (or items) in the sample must be independent.
(iii) No group should contain very few items say less than 10. In cases where
the frequencies are less than 10, regrouping is done by combining the
frequencies of adjoining groups so that the new frequencies become
greater than 10. Some statisticians take this number as 5, but 10 is
regarded as better by most of the statisticians.
(iv) The overall number of items (i.e., N) must be reasonably large. It should
at least be 50, howsoever small the number of groups may be.
(v) The constraints must be linear. Constraints which involve linear equations
in the cell frequencies of a contingency table (i.e., equations containing
no squares or higher powers of the frequencies) are known as linear
constraints.

Sikkim Manipal University

Page No. 239

Business Statistics

Unit 10

10.2.3 Areas of Application of Chi-Square Test


Chi-square test is applicable in large number of problems. The test is, in fact, a
technique through the use of which it is possible for us to (a) Test the goodness
of fit; (b) Test the homogeneity of a number of frequency distributions; and (c)
Test the significance of association between two attributes. In other words, Chisquare test is a test of independence, goodness of fit and homogeneity. At
times Chi-square test is used as a test of population variance also.
As a test of goodness of fit, 2 test enables us to see how well the
distribution of observe data fits the assumed theoretical distribution such as
Binomial distribution, Poisson distribution or the Normal distribution.
As a test of independence, 2 test helps explain whether or not two
attributes are associated. For instance, we may be interested in knowing whether
a new medicine is effective in controlling fever or not and 2 test will help us in
deciding this issue. In such a situation, we proceed on the null hypothesis that
the two attributes (viz., new medicine and control of fever) are independent.
Which means that new medicine is not effective in controlling fever. It may,
however, be stated here that 2 is not a measure of the degree of relationship or
the form of relationship between two attributes but it simply is a technique of
judging the significance of such association or relationship between two
attributes.
As a test of homogeneity, 2 test helps us in stating whether different
samples come from the same universe. Through this test, we can also explain
whether the results worked out on the basis of sample/samples are in conformity
with well defined hypothesis or the results fail to support the given hypothesis.
As such the test can be taken as an important decision-making technique.
As a test of population variance. Chi-square is also used to test the
significance of population variance through confidence intervals, specially in
case of small samples.

10.2.4 Steps Involved in Finding the Value of Chi-Square


The various steps involved are as follows:
(i) First of all calculate the expected frequencies.
(ii) Obtain the difference between observed and expected frequencies and
find out the squares of these differences, i.e., calculate ( f0 fe)2.
(iii) Divide the quantity ( f0 fe)2 obtained, as stated above by the corresponding
expected frequency to get
Sikkim Manipal University

( f0 fe )2
.
fe
Page No. 240

Business Statistics

Unit 10

(iv) Then find summation of

( f0 fe )2
values or what we call
fe

( f 0 f e ) 2

fe

This is the required 2 value.


The 2 value obtained as such should be compared with relevant table
value of 2 and inference may be drawn as stated above.
The following examples illustrate the use of Chi-square test.
Example 10.1: A dice is thrown 132 times with the following results:
Number Turned Up
Frequency

16

20

25

14

29

28

Test the hypothesis that the dice is unbiased.


Solution: Let us take the hypothesis that the dice is unbiased. If that is so, the
probability of obtaining any one of the six numbers is 1/6 and as such the
1

expected frequency of any one number coming upward is 132 = 22. Now,
6
we can write the observed frequencies along with expected frequencies and
work out the value of 2 as follows:
No. Turned Observed Expected ( f0 fe)
Up
Frequency Frequency
(or f0)
(or fe )
1
2
3
4
5
6

16
20
25
14
29
28

22
22
22
22
22
22

6
2
3
8
7
6

(f0 fe)2

( f0 fe )2
fe

36
4
9
64
49
36

36/22
4/22
9/22
64/22
49/22
36/22

( f 0 f e ) 2

= 9
fe

Hence, the calculated value of 2 = 9


Degrees of freedom in the given problem is (n 1) = (6 1) = 5

The table value of 2 for 5 degrees of freedom at 5% level of significance


is 10.071. If we compare the calculated and table values of 2 we find that
calculated value is less than the table value and as such could have arisen due
to fluctuations of sampling. The result thus supports the hypothesis and it can
be concluded that the dice is unbiased.
Sikkim Manipal University

Page No. 241

Business Statistics

Unit 10

Example 10.2:
Find the value of 2 for the following information:

Class Observed
A
B
C
D
E
Frequency
8
29
44
15
4
Theoretical (or
Expected) Frequency
7
24
38
24
7
Solution:
Since some of the frequencies are less than 10, we shall first regroup the given
data as follows and then work out the value of 2:
Class Observed FrequencyExpected Frequency (f0 fe)
(f0)
(fe)
A and B (8+29) = 37

( f0 fe )2
fe

(7+24) = 31

36/31

44

38

36/38

D and E (15+4) = 19

(24+7) = 31

12

144/31

( f 0 f e ) 2

= 6.76 approx.
fe

The table value of 2 for two degrees of freedom at 5% level of significance


is 5.991. The calculated value of 2 is much higher than this table value which
means that the calculated value cannot be said to have arisen just because of
chance. It is significant. Hence, the hypothesis does not hold good. This means
that the sampling techniques adopted by the two investigators differ and are
not similar. Naturally, then the technique of one must be superior than that of
the other.

10.2.5 Alternative Formula for Finding the Value of Chi-Square in


a (2 2) Table
There is an alternative method of calculating the value of 2 in the case of a
(2 2) table. Let us write the cell frequencies and marginal totals in case of a
(2 2) table as follows:
a

(a + b)

(c + d)

(a + c) (b + d)

Sikkim Manipal University

Page No. 242

Business Statistics

Unit 10

Then the formula for calculating the value of 2 will be stated as follows:
2 =

(ad - bc)2 N
(a + c)(b + d)(a + b)(c + d)

Where, N means the total frequency, ad means the larger cross product,
bc means the smaller cross product and (a + c), (b + d), (a + b) and (c + d) are
the marginal totals. The alternative formula is rarely used in finding out the
value of Chi-square as it is not applicable uniformly in all cases but can be used
only in a (2 2) contingency table.

10.2.6 Yates Correction


F. Yates suggested a correction in 2 value calculated in connection with a (2
2) table particularly when cell frequencies are small (since no cell frequency
should be less than 5 in any case, though 10 is better as stated earlier) and 2
is just on the significance level. The correction suggested by Yates is popularly
known as Yates correction. It involves the reduction of the deviation of observed,
from expected frequencies which of course reduces the value of 2. The rule for
correction is to adjust the observed frequency in each cell of a (2 2) table in
such a way as to reduce the deviation of the observed from the expected
frequency for that cell by 0.5, and this adjustment is made in all the cells without
disturbing the marginal totals. The formula for finding the value of 2 after applying
Yates correction is written as under:
2 (corrected) =

N .(ad bc 0.5 N )2
(a b)(c d )(a c)(b d )

In case we use the usual formula for calculating the value of Chi-square

viz., 2 =

(f0 - fe )2

then Yates correction can be applied as under:


fe

2

f 01 f e1 0.5
f 02 f e 2 0.5

(corrected) =
f e1
fe2
2

It may again be emphasized that Yates correction is made only in case of


(2 2) table and that too when cell frequencies are small.

10.2.7 Chi-Square as a Test of Population Variance


2 is used, at times, to test the significance of population variance (p)2 through
confidence intervals. This, in other words, means that we can use 2 test to
Sikkim Manipal University

Page No. 243

Business Statistics

Unit 10

judge if a random sample has been drawn from a normal population with mean
() and with specified variance (p)2. In such a situation, the test statistic for a
null hypothesis will be as under:
2 =

( X i X s )2
( p ) 2

n( s ) 2
( p ) 2

with (n1) degrees of freedom.

By comparing the calculated value (with the help of the above formula)
with the table value of 2 for (n1) df at a certain level of significance, we may
accept or reject the null hypothesis. If the calculated value is equal or less than
the table value, the null hypothesis is to be accepted but if the calculated value
is greater than the table value, the hypothesis is rejected. All this can be made
clear by an example.
Example 10.3:
Weight of 10 students is as follows:

Sl. No.

10

Weight in kg. 38

40

45

53

47

43

55

48

52

49

Can we say that the variance of the distribution of weights of all students
from which the above sample of 10 students was drawn is equal to 20 square
kg? Test this at 5% and 1% level of significance.
Solution:
First of all, we should work out the standard deviation of the sample (s)
Calculation of the sample standard deviation:
Sl. No.

Xi
Weight in kg

1
2
3
4
5
6
7
8
9
10

38
40
45
53
47
43
55
48
52
49

n = 10

Xi = 470

Sikkim Manipal University

Xi X s

+
+

+
+
+
+

9
7
2
6
0
4
8
1
5
2

( X i X s )2
81
49
04
36
00
16
64
01
25
04
( X i X s)2 = 280

Page No. 244

Business Statistics

Xs =

s =

Unit 10

X i 470

47 kg
10
n
( X i X s )2

280
28 5.3 kg
10

s = 28
Taking the null hypothesis as H0: (p)2 = (s)2
n ( s ) 2

The test statistic = ( )2


p

10 28 280

14
20
20

Degrees of freedom in this case is (n 1) = 10 1 = 9


At 5% level of significance, the table value of 2 = 16.92, and at 1% level
of significance it is 21.67 for 9 df, and both these values are greater than the
calculated value of 2 which is 14. Hence, we accept the null hypothesis and
conclude that the variance of the given distribution can be taken as 20 square
kg at 5% as well as at 1% level of significance.

10.2.8 Additive Property of Chi-Square (2)


An important property of 2 is its additive nature. This means that several values
of 2 can be added together and if the degrees of freedom are also added, this
number gives the degrees of freedom of the total value of 2. Thus, if a number
of 2 values have been obtained from a number of samples of similar data,
then, because of the additive nature of 2, we can combine the various values of
2 by just simply adding them. Such addition of various values of 2 gives one
value of 2 which helps in forming a better idea about the significance of the
problem under consideration. The following example illustrates the additive
property of the 2.
Example 10.4: The following values of 2 are obtained from different
investigations carried to examine the effectiveness of a recently invented
medicine for checking malaria.

Investigation

df

2.5

3.2

4.1

3.7

4.5

Sikkim Manipal University

Page No. 245

Business Statistics

Unit 10

What conclusion would you draw about the effectiveness of the new
medicine on the basis of the five investigations taken together?
Solution: By adding all the values of 2, we obtain a value equal to 18.0. Also
by adding the various d.f. as given in the question, we obtain a figure 5. We can
now state that the value of 2 for 5 degrees of freedom (when all the five
investigations are taken together) is 18.0.
Let us take the hypothesis that the new medicine is not effective. The
table value of 2 for 5 degrees of freedom at 5% level of significance is 10.070.
But our calculated value is higher than this table value which means that the
difference is significant and is not due to chance. As such the hypothesis is
wrong and it can be concluded that the new medicine is effective in checking
malaria.

10.2.9 Important Characteristics of Chi-Square (2) Test


(i) This test is based on frequencies and not on the parameters like mean
and standard deviation.
(ii) This test is used for testing the hypothesis and is not useful for estimation.
(iii) This test possesses the additive property.
(iv) This test can also be applied to a complex contingency table with several
classes and as such is a very useful test in research work.
(v) This test is an important non-parametric (or a distribution free) test as no
rigid assumptions are necessary in regard to the type of population and
no need of the parameter values. It involves less mathematical details.
A Word of Caution in Using 2 Test
Chi-square test is no doubt a most frequently used test but its correct application
is equally an uphill task. It should be borne in mind that the test is to be applied
only when the individual observations of sample are independent which means
that the occurrence of one individual observation (event) has no effect upon the
occurrence of any other observation (event) in the sample under consideration.
The researcher, while applying this test, must remain careful about all these
things and must thoroughly understand the rationale of this important test before
using it and drawing inferences concerning his hypothesis.

Sikkim Manipal University

Page No. 246

Business Statistics

Unit 10

Activity 1
200 digits were chosen at random from a set of tables. The frequencies of
the digits were:
Digit
Frequency
Calculate

18

19

23

21

16

25

22

20

21

15

.
2

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Chi-square test is a non-parametric test of statistical significance for
______________ tabular analysis.
(b) 2 is used to test the significance of population variance (p)2 through
_____________ intervals.
2. State whether true or false.
(a) Chi-square tests enable us to test whether more than two population
proportions can be considered equal.
(b) Chi-square test is based on frequencies and also on the parameters
like mean and standard deviation.

10.3 Summary
Let us recapitulate the important concepts discussed in this unit:
Chi-square test is a non-parametric test of statistical significance for
bivariate tabular analysis (also known as cross-breaks).
The Chi-square test is any statistical hypothesis test, in which the test
statistics has a chi-square distribution when the null hypothesis is true.
Chi-square, symbolically written as 2 (pronounced as Ki-square), is a
statistical measure with the help of which, it is possible to assess the
significance of the difference between the observed frequencies and the
expected frequencies obtained from some hypothetical universe.

Sikkim Manipal University

Page No. 247

Business Statistics

Unit 10

The correction suggested by Yates is popularly known as Yates correction.


It involves the reduction of the deviation of observed, from expected
frequencies which of course reduces the value of 2.
2 is used to test the significance of population variance (p)2 through
confidence intervals.
An important property of 2 is its additive nature. This means that several
values of 2 can be added together and if the degrees of freedom are also
added, this number gives the degrees of freedom of the total value of 2.

10.4 Glossary
Chi-square test: A non-parametric test of statistical significance used to
compare observed data with expected data. It also tests the validity of
null hypothesis.
Degrees of freedom: The number of independent observations in a
sample of data to estimate a parameter of the population from which that
sample is drawn.

10.5 Terminal Questions


1. Explain Chi-square test. Why is it considered an important test in statistical
analysis?
2. Describe the term Degrees of Freedom.
3. Define the necessary conditions required for the application of test?
4. What are the areas of application of Chi-square test?
5. How will you find the value of Chi-square?
6. Define Yates correction formula for Chi-square.
7. Chi-square can be used as a test of population variance. Explain.
8. Describe the additive properties of Chi-square.
9. Explain the important characteristics of Chi-square test.

Sikkim Manipal University

Page No. 248

Business Statistics

Unit 10

10.6 Answers
Answers to Self-Assessment Questions
1. (a) Bivariate; (b) Confidence
2. (a) True; (b) False

Answers to Terminal Questions


1. Refer Section 10.2
2. Refer Section 10.2.1
3. Refer Section 10.2.2
4. Refer Section 10.2.3
5. Refer Section 10.2.4
6. Refer Section 10.2.6
7. Refer Section 10.2.7
8. Refer Section 10.2.8
9. Refer Section 10.2.9

10.7 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010.

Sikkim Manipal University

Page No. 249

Business Statistics

Unit 11

Unit 11

t-Test, z-Test and Analysis of Variance

Structure
11.1 Introduction
Objectives
11.2 t-Test
11.3 z-Test
11.4 Analysis of Variance
11.5 Summary
11.6 Glossary
11.7 Terminal Questions
11.8 Answers
11.9 Futher Reading

11.1 Introduction
In the previous unit, you learnt about Chi-squared or 2 test which is a nonparametric test of statistical significance for bivariate tabular analysis. In this
unit you will learn about t-test, z-test and analysis of variance or ANOVA. z-test
and t-test are basically the same as they compare between two means to suggest
whether both samples come from the same population. A t-test is any statistical
hypothesis test in which the test statistic follows a Students t distribution, if the
null hypothesis is supported. It is most commonly applied when the test statistic
would follow a normal distribution. Similarly, a z-test is any statistical test for
which the distribution of the test statistic under the null hypothesis can be
approximated by a normal distribution. In statistics, analysis of variance (ANOVA)
is a collection of statistical models and their associated procedures in which the
observed variance in a particular variable is partitioned into components
attributable to different sources of variation.

Objectives
After studying this unit, you should be able to:
Explain the significance of t-test
Discuss the importance of z-test
Define analysis of variance or ANOVA
Explain degrees of freedom and F distribution
Sikkim Manipal University

Page No. 251

Business Statistics

Unit 11

11.2 t-Test
Sir William S. Gosset (pen name Student) developed a significance test and
through it made a significant contribution to the theory of sampling applicable in
case of small samples. When population variance is not known, the test is
commonly known as Students t-test and is based on the t distribution.
Like normal distribution, t distribution is also symmetrical but happens to
be flatter than normal distribution. Moreover, there is a different t distribution
for every possible sample size. As the sample size gets larger, the shape of the
t distribution loses its flatness and becomes approximately equal to the normal
distribution. In fact, for sample sizes of more than 30, the t distribution is so
close to the normal distribution that we will use the normal to approximate the t
distribution. Thus, when n is small, the t distribution is far from normal, but when
n is infinite, it is identical to normal distribution.
For applying t-test in context of small samples, the t value is calculated
first of all and, then the calculated value is compared with the table value of t at
certain level of significance for given degrees of freedom. If the calculated value
of t exceeds the table value (say t0.05), we infer that the difference is significant
at 5% level but if calculated value is t0 is less than its concerning table value, the
difference is not treated as significant.
The t-test is used when two conditions are fullfiled,
(i) The sample size is less than 30, i.e., when n 30.
(ii) The population standard deviation (p) must be unknown.
In using the t-test, we assume the following:
(i) That the population is normal or approximately normal;
(ii) That the observations are independent and the samples are randomly
drawn samples;
(iii) That there is no measurement error;
(iv) That in the case of two samples, population variances are regarded as
equal if equality of the two population means is to be tested.
The following formulae are commonly used to calculate the t value:
(i) To test the significance of the mean of a random sample
t

Sikkim Manipal University

| X |
S | SEx X

Page No. 252

Business Statistics

Unit 11

Where, X = Mean of the sample


= Mean of the universe
SE X = S.E. of mean in case of small sample and is worked out as follows:

( X i X )2

n
SEX s
n
n
and the degrees of freedom = (n 1)
The above stated formula for t can as well be stated as under:
| X |
| X |
| X |

t
n
=
2
SEX
( X X )
( X X )2
n 1
n 1
n
If we want to work out the probable or fiducial limits of population mean
() in case of small samples, we can use either of the following:
(a) Probable limits with 95% confidence level:
X SE X (t0.05 )

(b) Probable limits with 99% confidence level:


X SE X (t0.01 )

At other confidence levels, the limits can be worked out in a similar manner,
taking the concerning table value of t just as we have taken t0.05 in (a) and t0.01 in
(b) above.
(ii) To test the difference between the means of the two samples
t

Where,

| X1 X 2 |
SE X 1 X 2

X 1 = Mean of the sample 1


X 2 = Mean of the sample 2
SEX1 X2 = Standard Error of difference between two sample means and
is worked out as follows:

Sikkim Manipal University

Page No. 253

Business Statistics

Unit 11

( X

SEX1 X 2

1i

X1 ) 2 ( X 2 i X 2 )

n1 n2 2

1 1

n1 n2

and the degrees of freedom = (n1 + n2 2).


When the actual means are in fraction, then use of assumed means is
convenient. In such a case, the standard deviation of difference, i.e.,

( X1i X 1 )2 + ( X 2i X 2 )2
n1 n2 2
can be worked out by the following short-cut formula:

Where,

( X1i A1 )2 ( X 2i A1 )2 n1 ( X1i A2 )2 n2 ( X 2i A2 )2
n1 n2 2

A1 = Assumed mean of sample 1


A2 = Assumed mean of sample 2
X1 = True mean of sample 1
X2 = True mean of sample 2

(iii) To test the significance of an observed correlation coefficient

r
1 r2

n2

Here, t is based on (n 2) degrees of freedom.


(iv) In context of the difference test
Difference test is applied in the case of paired data and in this context t is
calculated as under:
t

X Ditt 0

Diff n

X Diff 0

Diff

Where, X Diff or D = Mean of the differences of sample items.


0 = The value zero on the hypothesis shows that there is no
difference
Diff. = Standard deviation of difference and is worked out as,

Sikkim Manipal University

Page No. 254

Business Statistics

Unit 11

D X

Diff

)2

(n 1)

or
D 2 ( D )2 n
( n 1)
D = Differences
n = Number of pairs in two samples and is based on (n 1) degrees of
freedom.
The following examples would illustrate the application of t-test using the
above stated formulae.
Example 11.1:
A sample of 10 measurements of the diameter of a sphere, gave a mean
X = 4.38 inches and a standard deviation, = 0.06 inches. Find (a) 95% and
(b) 99% confidence limits for the actual diameter.
Solution:
On the basis of the given data the standard error of mean:

s
n 1

0.06
0.06

0.02
3
10 1

Assuming the sample mean 4.38 inches to be the population mean, the
required limits are as follows:
(i) 95% confidence limits

= X SE X (t0.05 ) with degrees of freedom


= 4.38 .02(2.262)
= 4.38 .04524

i.e.,
(ii) 99% confidence limits

4.335 to 4.425
= X SE X (t0.01 ) with 9 degrees of freedom
= 4.38 .02(3.25) = 4.38 .0650

i.e.,

Sikkim Manipal University

4.3150 to 4.4450.

Page No. 255

Business Statistics

Unit 11

Example 11.2:
The sales data of an item in six shops before and after a special promotional
campaign are:
Shops

Before the
promotional
campaign

53

28

31

48

50

42

After the campaign

58

29

30

55

56

45

Can the campaign be judged to be a success? Test at 5% level of


significance.
Solution:
We take the hypothesis that the campaign does not bring any improvement in
sales. We can thus write:
In order to judge this, we apply the difference test. For this purpose we
calculate the mean and standard deviation of differences in two sample items
as follows:
Shops

Sales before Sales after


Difference = D
campaign
campaign
(i.e., increase or
XBi
XAi
decrease after the
campaign)

(D D )

(D D )2

53

58

+5

+1.5

2.25

28

29

+1

2.5

6.25

31

30

4.5

20.25

48

55

+7

+3.5

12.25

50

56

+6

+2.5

6.25

42

45

+3

0.5

0.25

D = 21

n=6

(D D )2
= 47.50

Mean of difference or X Diff D 21 3.5


n
6

Sikkim Manipal University

Page No. 256

Business Statistics

Unit 11

Standard deviation of difference,


47.50
( D D ) 2

3.08
6 1
n 1
X Diff 0
t
n

Diff

Diff

= 1.14 2.45 = 2.793


Degrees of freedom = (n 1) = (6 1) = 5
Table value of t at 5% level of significance for 5 degrees of freedom
= 2.015 for one-tailed test.
Since, the calculated value of t is greater than its table value, the difference
is significant. Thus, the hypothesis is wrong and the special promotional
campaign can be taken as a success.
Example 11.3:
Memory capacity of 9 students was tested before and after training. From the
following scores, state whether the training was effective or not.
Student

Before (XBi)
After (XAi)

10
12

15
17

9
8

3
5

7
6

12
11

16
18

17
20

4
3

Solution:
We take the hypothesis that training was not effective. We can write,
H 0 : x A X B , H 0 : X X B . We apply the difference test for which purpose first of
all we calculate the mean and standard deviation of difference as follows:
Students

Before XBi

After XAi

Difference = D

D2

1
2
3
4
5
6
7
8
9

10
15
9
3
7
12
16
17
4

12
17
8
5
6
11
18
20
3

2
2
1
2
1
1
2
3
1

4
4
1
4
1
1
4
9
1

D = 7

D2 = 29

n=9
Sikkim Manipal University

Page No. 257

Business Statistics

Unit 11

D 7
0.78
n
9

D 2 ( D ) 2 n
29 (0.78) 2 9

1.71
n 1
9 1
0.78
t
1.369
1 71

Diff

Degrees of freedom = (n 1) = (9 1) = 8
Table value of t at 5% level of significance for 8 degrees of freedom
= 1.860 for one-tailed test.
Since the calculated value of t is less than its table value, the difference is
insignificant and the hypothesis is true. Hence it can be inferred that the training
was not effective.
Example 11.4:
It was found that the coefficient of correlation between two variables calculated
from a sample of 25 items was 0.37. Test the significance of r at 5% level with
the help of t-test.
Solution:
To test the significance of r through t-test, we use the following formula for
calculating t value:
r
t
n2
1 r2
0.37
=
25 2
1 (0.37) 2
=1.903
Degrees of freedom = (n2) = (252) = 23
The table value of at 5% level of significance for 23 degrees of freedom
is 2.069 for a two-tailed test.
The calculated value of t is less than its table value, hence r is insignificant.
Activity 1
Select a variable. Compare the mean of the variable for a sample of 10 for
one group with the mean of the variable for a sample of 10 for a second
group using t-test.

Sikkim Manipal University

Page No. 258

Business Statistics

Unit 11

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) When population ____________ is not known, the test is commonly
known as Students t-test and is based on the t distribution.
(b) In t-test for the case of two samples population variances are
regarded as equal if _____________ of the two population means is
to be tested.
2. State whether true or false.
(a) Like normal distribution, t distribution is not symmetrical but happens
to be flatter than normal distribution.
(b) When n is small, the t distribution is far from normal but when n is
infinite it is identical with normal distribution.

11.3 z-Test
A z-test is any statistical test for which the distribution of the test state can be
approximated by normal distribution under the null hypothesis.

11.3.1

z-Test for Testing the Significance of r in Case of Small


Samples or z-Transformation

R.A. Fisher developed the z-test to test the significance of the correlation
coefficient in small samples. While applying the test, r of the sample is
transformed into z on account of which the test is also known as z transformation.
The z transformation is done as under:
1
1 r
(1 r )
z log e
1.15129 log10
2
1 r
(1 r )

where, r represents correlation coefficient on the basis of sample.


The statistic z is used to test (i) Whether an observed value of r is
significantly different from a given hypothetical or known value of population
correlation (ii) Whether two sample values of r differ significantly from each
other.

Sikkim Manipal University

Page No. 259

Business Statistics

Unit 11

The standard error of z is calculated as:


S.EX

1
n3

where, n means the number of pairs in a sample.


and
1
1.51129 log10
1

p
p

where, p represents population and represents population mean.


[Note. If p is not known, then it is taken as zero in which case = 0]

Finally the value of the Standard Normal Variate (S.N.V.) is calculated as follows:
| z |
1 ( z ) n 3
S .N .V . =
n3

If the value of S.N.V. exceeds 1.96, the difference is significant at 5%


level.
The following example makes the application of z-test clear in testing the
significance or r.
Example 11.5:
Test the significance of the coefficient of correlation, r = 0.5 discovered in a
sample of 19 pairs against hypothesis correlation p = 0.7. Apply z-transformation.
Solution:
The hypothesis that correlation coefficient in the population is 0.7 has to be
tested in this case.
Applying z-transformation, we obtain
1 r
z 1.5129 log10

1 r
1 0.5
= 1.5129 log10

1 0.5
1.5
= 1.5129 log
0.5
= 1.15129 log 3
= 1.15129 0.4771 = 0.549

Sikkim Manipal University

Page No. 260

Business Statistics

Unit 11

1 0.7
=1.15129 log10

1 0.7
1.7
= 1.15129 log
0.3
= 1.15129 log 5.67
= 1.15129 0.7536 = 0.868
| z |
S .N .V .
1
n3
0.549 0.868
=
1
19 3
0.319
=
16
1
= 0.319 4 = 1.276

Since the difference (0.319) is only 1.276 times the S.E., it is insignificant
at 5% level and hence could have arisen due to sampling fluctuations. In other
words, the hypothesis stands and p may be taken as 0.7.
As, it has been stated above, z-test is also used to test the significance of
the difference between two independent correlation coefficients. For this purpose,
first of all, r1 and r2 values are transformed in the similar manner (as stated
above) into z1 and z2 values respectively, and then the standard error of difference
between z1 and z2 is worked out as under:

S .EDiff

z1 z2

1
1

n1 3 n2 3

where, n1 = Number of pairs in Sample 1


n2 = Number of pairs in Sample 2

| z1 z2 |
Finally, we work out the ratio: S .E
z z
1

If this ratio is greater than 1.96, the difference will be significant at 5%


level and if this ratio is greater than 2.5758, the difference will be significant at
1% level. We take the following example to make the point clear:

Sikkim Manipal University

Page No. 261

Business Statistics

Unit 11

Example 11.6:
Given as the following information:
No. of Items in
the Sample

Coefficient of
Correlation

Sample 1

23

0.40

Sample 2

19

0.65

Test the significance of the difference, at 5% level, between the two given
values of coefficient of correlation, using z-transformation.
Solution:
Applying z-test, we obtain z1 and z2 values as under:
1 r1
1 r2
z1 1.15129 log10
z2 1.15129 log10

1 r1
1 r2
1 + 4
1 + 65
= 1.15129 log
= 1.15129 log

1 4
1 65
= 1.15129 log 2.333 = 0.424 = 1.15129 log 4.71 = 0.775

S .Ez1 z2

We now work out the ratio: =

1
1

n1 3 n2 3

1
1
9

0.335
20 16
80

As this ratio is less than 1.96, the difference between the two given values
of coefficient of correlation at 5% level is insignificant and it can be concluded
that the two samples come from the same population.

Self-Assessment Questions
3. State whether true or false.
(a) z-test is used to test the significance of the correlation coefficient in
small samples.
(b) The statistic z is used to test whether an observed value of r is
significantly different from a given hypothetical or known value of
population correlation.

Sikkim Manipal University

Page No. 262

Business Statistics

Unit 11

4. Fill in the blanks with the appropriate terms.


(a) While applying the z test, r of the sample is transformed into z on
account of which the test is also known as z ____________________.
(b) z-test is used to test the significance of the _____________________
between two independent correlation coefficients.

11.4 Analysis of Variance


In business decisions, we are often involved in determining if there are significant
differences among various sample means, from which conclusions can be drawn
about the differences among various population means. In the previous chapters,
we discussed and evaluated the differences between two sample means. But,
what if we have to compare more than 2 sample means? For example, we may
be interested in finding out if there are any significant differences in the average
sales figures of 4 different salesman employed by the same company, or we
may be interested to find out if the average monthly expenditures of a family of
4 in 5 different localities are similar or not, or the telephone company may be
interested in checking, whether there are any significant differences in the
average number of requests for information received in a given day among the
5 areas of New York City, and so on. The methodology used for such types of
determinations is known as Analysis of Variance.
This technique is one of the most powerful techniques in statistical analysis
and was developed by R.A. Fisher. It is also called the F-Test.
There are two types of classifications involved in the analysis of variance.
The one-way analysis of variance refers to the situations when only one fact or
variable is considered. For example, in testing for differences in sales for three
salesman, we are considering only one factor, which is the salesmans selling
ability. In the second type of classification, the response variable of interest
may be affected by more than one factor. For example, the sales may be affected
not only by the salesmans selling ability, but also by the price charged or the
extent of advertising in a given area.
For the sake of simplicity and necessity, our discussion will be limited to
One-way Analysis of Variance.
The null hypothesis, that we are going to test, is based upon the assumption
that there is no significant difference among the means of different populations.

Sikkim Manipal University

Page No. 263

Business Statistics

Unit 11

For example, if we are testing for differences in the means of k populations,


then,
H 0 1 2 3 ....... k

The alternate hypothesis (H1) will state that at least two means are different
from each other. In order to accept the null hypothesis, all means must be
equal. Even if one mean is not equal to the others, then we cannot accept the
null hypothesis. The simultaneous comparison of several population means is
called ANalysis Of VAriance or ANOVA.
Assumptions
The methodology of ANOVA is based on the following assumptions.
(i) Each sample of size n is drawn randomly and each sample is independent
of the other samples.
(ii) The populations are normally distributed.
(iii) The populations from which the samples are drawn have equal variances.
This means that:
12 22 23 .........= 2k , for k populations.

11.4.1

The Rationale Behind Analysis of Variance

Why do we call it the Analysis of Variance, even though we are testing for
means? Why not simply call it the Analysis of Means? How do we test for means
by analysing the variances? As a matter of fact, in order to determine if the
means of several populations are equal, we do consider the measure of variance,
2.
The estimate of population variance, 2, is computed by two different
estimates of 2, each one by a different method. One approach is to compute
an estimator of 2 in such a manner that even if the population means are not
equal, it will have no effect on the value of this estimator. This means that, the
differences in the values of the population means do not alter the value of 2 as
calculated by a given method. This estimator of 2 is the average of the variances
found within each of the samples. For example, if we take 10 samples of size n,
then each sample will have a mean and a variance. Then, the mean of these 10
variances would be considered as an unbiased estimator of 2, the population
variance, and its value remains appropriate irrespective of whether the population
means are equal or not. This is really done by pooling all the sample variances
to estimate a common population variance, which is the average of all sample
variances. This common variance is known as variance within samples or 2within.
Sikkim Manipal University

Page No. 264

Business Statistics

Unit 11

The second approach to calculate the estimate of 2, is based upon the


Central Limit Theorem and is valid only under the null hypothesis assumption
that all the population means are equal. This means that in fact, if there are no
differences among the population means, then the computed value of 2 by the
second approach should not differ significantly from the computed value of 2
by the first approach.
Hence,
If these two values of 2 are approximately the same, then we can decide
to accept the null hypothesis.
The second approach results in the following computation.
Based upon the Central Limit Theorem, we have previously found that
the standard error of the sample means is calculated by:

2X
n
or, the variance would be:
2
n
2 n2X
2X

or,

Thus, by knowing the square of the standard error of the mean ( X )2 , we


could multiply it by n and obtain a precise estimate of 2. This approach of
estimating 2 is known as 2between. Now, if the null hypothesis is true, that is if all
population means are equal then,
2between value should be approximately the same as 2within value. A significant
difference between these two values would lead us to conclude that this
difference is the result of differences between the population means.
But, how do we know that any difference between these two values is
significant or not? How do we know whether this difference, if any, is simply due
to random sampling error or due to actual differences among the population
means?
R.A. Fisher developed a Fisher test or F-test to answer the above question.
He determined that the difference between 2between and 2within values could be
expressed as a ratio to be designated as the F-value, so that:
F

Sikkim Manipal University

2between
2within

Page No. 265

Business Statistics

Unit 11

In the above case, if the population means are exactly the same, then
between will be equal to the 2within and the value of F will be equal to 1.
2

However, because of sampling errors and other variations, some disparity


between these two values will be there, even when the null hypothesis is true,
meaning that all population means are equal. The extent of disparity between
the two variances and consequently, the value of F, will influence our decision
on whether to accept or reject the null hypothesis. It is logical to conclude that,
if the population means are not equal, then their sample means will also vary
greatly from one another, resulting in a larger value of 2between and hence a
larger value of F (2within is based only on sample variances and not on sample
means and hence, is not affected by differences in sample means). Accordingly,
the larger the value of F, the more likely the decision to reject the null hypothesis.
But, how large the value of F be so as to reject the null hypothesis? The answer
is that the computed value of F must be larger than the critical value of F, given
in the table for a given level of significance and calculated number of degrees
of freedom. The F distribution is a family of curves, so that there are different
curves for different degrees of freedom.

11.4.2

Degrees of Freedom

We have talked about the F distribution being a family of curves, each curve
reflecting the degrees of freedom relative to both 2between and 2within. This means
that, the degrees of freedom are associated both with the numerator as well as
with the denominator of the F-ratio.
(i) The numerator. Since the variance between samples, 2between comes
from many samples and if there are k number of samples, then the degrees
of freedom, associated with the numerator would be (k 1).
(ii) The denominator is the mean variance of the variances of k samples
and since, each variance in each sample is associated with the size of
the sample (n), then the degrees of freedom associated with each sample
would be (n 1). Hence, the total degrees of freedom would be the sum
of degrees of freedom of k samples or,
df = k(n 1), when each sample is of size n.

11.4.3

The F Distribution

The major characteristics of the F distribution are as follows:


(i) Unlike normal distribution, which is only one type of curve irrespective of
the value of the mean and the standard deviation, the F distribution is a
family of curves. A particular curve is determined by two parameters. These
Sikkim Manipal University

Page No. 266

Business Statistics

Unit 11

are the degrees of freedom in the numerator and the degrees of freedom
in the denominator. The shape of the curve changes as the number of
degrees of freedom changes.
(ii) It is a continuous distribution and the value of F cannot be negative.
(iii) The curve representing the F distribution is positively skewed.
(iv) The values of F theoretically range from zero to infinity.
A diagram of F distribution curve is shown below.

Do not
reject
H0

Reject H0

The rejection region is only in the right end tail of the curve because
unlike z distribution and t distribution which had negative values for areas below
the mean, F distribution has only positive values by definition and only positive
values of F that are larger than the critical values of F, will lead to a decision to
reject the null hypothesis.
Computation of F
Since F ratio contains only two elements, which are the variance between the
samples and the variance within the samples.
If all the means of samples were exactly equal and all samples were
exactly representative of their respective populations so that all the sample
means, were exactly equal to each other and to the population mean, then
there will be no variance. However, this can never be the case. We always have
variation, both between samples and within samples, even if we take these
samples randomly and from the same population. This variation is known as
the total variation.
The total variation designated by ( X - X ) 2 , where X representss
individual observations for all samples and X is the grand mean of all sample
means and equals (), the population mean, is also known as the total sum of
squares or SST, and is simply the sum of squared differences between each

Sikkim Manipal University

Page No. 267

Business Statistics

Unit 11

observation and the overall mean. This total variation represents the contribution
of two elements. These elements are:
(A) Variance between samples. The variance between samples may be due
to the effect of different treatments, meaning that the population means may be
affected by the factor under consideration, thus, making the population means
actually different, and some variance may be due to the inter-sample variability.
This variance is also known as the sum of squares between samples. Let this
sum of squares be designated as SSB.
Then, SSB is calculated by the following steps:
(i) Take k samples of size n each and calculate the mean of each sample,
i.e., X 1 , X 2 , X 3 , .... X k .
(ii) Calculate the grand mean X of the distribution of these sample means,
so that,
k

x
i 1

(iii) Take the difference between the means of the various samples and the
grand mean, i.e.,
( X 1 X ), (X 2 X ), (X 3 X ), ...., (X k X )

(iv) Square these deviations or differences individually, multiply each of these


squared deviations by its respective sample size and sum up all these
products, so that we get;
k

n (X
i 1

X ) 2 , where n = Size of the ith sample.


i

This will be the value of the SSB.


However, if the individual observations of all samples are not available,
and only the various means of these samples are available, where the samples
are either of the same size n or different sizes, ni, n2, n3, ....., nk, then the value
of SSB can be calculated as:
SSB ni ( X i X ) 2 n2 ( X 2 X ) 2 ..... nk ( X k X ) 2

where,
n1 = Number of items in sample 1
n2 = Number of items in sample 2
Sikkim Manipal University

Page No. 268

Business Statistics

Unit 11

nk = Number of items in sample k

X 1 = Mean of sample 1
X 2 = Mean of sample 2
X k = Mean of sample k
X = Grand mean or average of all items in all samples.
(v) Divide SSB by the degrees of freedom, which are (k 1), where k is the
number of samples and this would give us the value of 2between, so that,
SSB
2between
.
(k 1)

(This is also known as mean square between samples or MSB).


(B) Variance within samples. Even though each observation in a given sample
comes from the same population and is subjected to the same treatment, some
chance variation can still occur. This variance may be due to sampling errors or
other natural causes. This variance or sum of squares is calculated through the
following steps:
(i) Calculate the mean value of each sample, i.e., X 1 , X 2 , X 3 , .... X k .
(ii) Take one sample at a time and take the deviation of each item in the
sample from its mean. Do this for all the samples, so that we would have
a difference between each value in each sample and their respective
means for all values in all samples.
(iii) Square these differences and take a total sum of all these squared
differences (or deviations). This sum is also known as SSW or sum of
squares within samples.
(iv) Divide this SSW by the corresponding degrees of freedom. The degrees
of freedom are obtained by subtracting the total number of samples from
the total number of items. Thus, if N is the total number of items or
observations, and k is the number of samples, then,
df = (N k)
These are the degrees of freedom within samples. (If all samples are of
equal size n, then df = k(n 1), since (n 1) are the degrees of freedom
for each sample and there are k samples).
(v) This figure SSW/df, is also known as 2within, or MSW (mean of sum of
squares within samples).
Now, the value of F can be computed as:

Sikkim Manipal University

Page No. 269

Business Statistics

Unit 11

F=

2between SSB/df
=
2
SSW/df
within
SSB/(k 1)
MSB
=
SSW/(N k) MSW

This value of F is then compared with the critical value of F from the table
and a decision is made about the validity of null hypothesis.

11.4.4

ANOVA Table

After various calculations for SSB, SSW and the degrees of freedom have been
made, these figures can be presented in a simple table called Analysis of
Variance table or simply ANOVA table, as follows:
ANOVA Table
Source of Variation Sum of Squares

Degrees of Freedom

Treatment

SSB

(k 1)

W ithin

SSW

(N k)

Total

SST

Mean Square

SSB
(k 1)
SSW
MSW
(n k)

MSB

MSB
MSW

Then,

F=

MSB
MSW

A Short-Cut Method
The formula developed above for the computation of the values of F-statistic is
rather complex and time consuming when we have to calculate the variance
between samples and the variance within samples. However, a short-cut, simpler
method for these sum of squares is available, which considerably reduces the
computational work. This technique is used through the following steps:
(i) Take the sum of all the observations of all the samples, either by adding
all the individual values, or by multiplying the mean of each sample by its
size and then adding up all these products as follows:
The Total Sum or TS n1 X 1 n2 X 2 ....nk X k , for k samples
(ii) Calculate the value of a correction factor. The Correction Factor (CF)
value is obtained by squaring the total sum obtained above and dividing it
by the total number of observations N, so that:

Sikkim Manipal University

Page No. 270

Business Statistics

Unit 11

CF

(TS ) 2
N

(iii) The total sum of squares is obtained by squaring all individual observations
of all samples, summing up these values and subtracting from this sum,
the CF.
In other words:
2
2
2
Total sum of squares SST X 1 X 2 .... X k

(TS )2
N

Where,

X 12 Summation of squares for all Xs in sample 1.


X 22 Summation of squares for all Xs in sample 2.
:
:

X k2 Summation of squares for all Xs in sample k.


(iv) The sum of squares between the samples (SSB) is obtained by the
following formula:
SSB

(X k ) 2 (TS ) 2
( X 1 ) 2 ( X 2 ) 2

....
n1
n2
nk
N

Where,

(X 1 ) 2 Square of the total of all values in sample 1.


(X 2 ) 2 Square of the total of all values in sample 2.
(X k )2 Square of the total of all values in sample k.
(v) Then sum of squares within samples SSW can be calculated as:
SSW = Total sum of squares Sum of squares between samples
= SST SSB
(vi) The rest of the procedure is similar to the previous method.

Sikkim Manipal University

Page No. 271

Business Statistics

Unit 11

Example 11.7:
To test whether all professors teach the same material in different sections of
the introductory statistics class or not, four sections of the same course were
selected and a common test was administered to five students selected at
random from each section. The scores for each student from each section were
noted and are given below. We want to test for any differences in learning, as
reflected in the average scores for each section.
Student #
1
2
3
4
5
Totals

Section 1
Scores (X1)

Section 2
Scores (X2)

Section 3
Scores (X3)

Section 4
Scores (X4)

8
10
12
10
5

12
12
10
8
13

10
13
11
12
14

12
15
13
10
10

X 1 45

X 2 55

X 3 60

X 4 60

X 2 11

X 3 12

X 4 12

X1 9
Solution:
A. The traditional method
Means

(i) State the null hypothesis. We are assuming that there is no significant
difference among the average scores of students from these four sections
and hence, all professors are teaching the same material with the same
effectiveness, i.e.,
H 0 : 1 2 3 4

H1: All means are not equal or at least two means differ from each other.
(ii) Establish a level of significance. Let = 0.05.
(iii) Calculate the variance between the samples, as follows:
(a) The mean of each sample is:

X 1 9, X 2 11, X 3 12, X 4 12
(b) The grand mean or X is:
X 9 11 12 12

n
4
11

Sikkim Manipal University

Page No. 272

Business Statistics

Unit 11

(c) Calculate the value of SSB:


SSB n( X X )2
5 (9 11)2 5 (11 11)2 5 (12 11)2 5 (12 11)2
20 0 5 5
30

(d) The variance between samples 2between or MSB is given by:


MSB

SSB
(30)
(30)

10
df
( k 1)
3

(iv) Calculate the variance within samples, as follows:


To find the sum of squares within samples (SSW), we square each
deviation between the individual value of each sample and its mean, for
all samples and then sum these squared deviations, as follows:
Sample 1:

X1 9

( X 1 X 1 )2 (8 9)2 (10 9)2 (12 9)2 (10 9)2 (5 9)2


1 1 9 1 16
28

Sample 2:

X 2 11

( X 2 X 2 ) 2 (12 11) 2 (12 11) 2 (10 11) 2 (8 11) 2 (13 11) 2


111 9 4
16

Sample 3:

X 3 12

( X 3 X 3 ) 2 (10 12) 2 (13 12)2 (11 12) 2 (12 12) 2 (14 12) 2
4 11 0 4
10

Sample 4:

X 4 12

( X 4 X 4 ) 2 (12 12)2 (15 12) 2 (13 12) 2 (10 12) 2 (10 12)2
0 9 1 4 4
18

Sikkim Manipal University

Page No. 273

Business Statistics

Unit 11

Then, SSW = 28 + 16 + 10 +18 = 72


Now, the variance within samples, 2within, or MSW is given by:
MSW

SSW
SSW
72
72

4.5
df
( N k ) 20 4 16

Then, the F-ratio =

MSB 10

2.22.
MSW 4.5

Now, we check for the critical value of F from the table for = 0.05 and
degrees of freedom as follows:
df (numerator) = (k 1) = (4 1) = 3
df (denominator) = (N k) = (20 4) = 16
This value of F from the table is given as 3.24. Now, since our
calculated value of F = 2.22 is less than the critical value of F = 3.24, we
cannot reject the null hypothesis.
B. The Short-Cut Method
Following the procedure outlined before for using the short-cut method, we get:
(i) Total Sum (TS) = X
= 220
(ii) Correction before CF

(TS )2 (220)2

2420
N
20

(iii) Total sum of squares:


SST ( X 2 ) CF
=2522 2420 102

(iv) Sum of squares betwen the samples SSB is obtained by:


k

SSB
i 1

( X i )2
CF
ni

( X )2
( X 1 )2 ( X 2 )2

.... k CF
n1
n2
nk

(45) 2 (55) 2 (60) 2 (60) 2

(2420)
5
5
5
5
405 605 720 720 2420

30
Sikkim Manipal University

Page No. 274

Business Statistics

Unit 11

(v) SSW can be calculated by:


SST SSB = 102 30 = 72
Now the F value can be calculated as:
30 /( k 1)
30 / 3 10
SSB / df

SSW / df 72 /( n k ) 72 /16 4.5


2.22

As we see, we get the same value of F as obtained by the traditional


method. So, we compare our value of F with the critical value of F from
the table for = 0.05 and df (numerator = 3), and df (denominator = 16),
and we get the critical value of F as 3.24. As before, we accept the null
hypothesis.
The ANOVA Table
We can construct an ANOVA table for the problem solved above as follows:
ANOVA Table
Source of Variation

Sum of Squares

Degrees of Freedom Mean Square

Treatment

SSB = 30

(k 1) = 3

MSB
=

Within (or error)

SSW = 72

(N k) = 16

30
10
3

MSW

Total

SSB
(k 1)

MSB
MSW

10
4.5

SSW
( N k ) =2.22

72
4.5
16

SST = 102

Activity 2
Prepare a list of magazines and categorize them into three groups according
to the educational level of the readers as high, medium and low. Select six
advertisements randomly from each of the magazines and for each
advertisement collect three different readability measures. Perform one
way ANOVA tests to determine whether advertisement readabilities of the
three groups of magazines are different.

Sikkim Manipal University

Page No. 275

Business Statistics

Unit 11

Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) The simultaneous ______________ of several population means is
called analysis of variance or ANOVA.
(b) F ratio contains only _____________ elements, which are the
variance between the samples and the variance within the samples.
6. State whether true or false.
(a) The one-way analysis of variance refers to the situations when only
one fact or variable is considered.
(b) The F distribution is a family of curves, so that there are similar
curves for different degrees of freedom.

11.5 Summary
Let us recapitulate the important concepts discussed in this unit:
Sir William S. Gosset (pen name Student) developed a significance test
and through it made significant contribution in the theory of sampling
applicable in case of small samples. When population variance is not
known, the test is commonly known as Students t-test and is based on
the t distribution.
When n is small, the t distribution is far from normal but when n is infinite
it is identical with normal distribution.
For applying t-test in context of small samples, the t value is calculated
first of all and, then the calculated value is compared with the table value
of t at certain level of significance for given degrees of freedom.
R.A. Fisher developed the z-test to test the significance of the correlation
coefficient in small samples. While applying the test, r of the sample is
transformed into z on account of which the test is also known as z
transformation.
The statistic z is used to test (i) whether an observed value of r is
significantly different from a given hypothetical or known value of population
correlation (ii) whether two sample values of r differ significantly from
each other.

Sikkim Manipal University

Page No. 276

Business Statistics

Unit 11

The one-way analysis of variance refers to the situations when only one
fact or variable is considered.
The simultaneous comparison of several population means is called
Analysis of Variance or ANOVA.
The F distribution is a family of curves, so that there are different curves
for different degrees of freedom.
F ratio contains only two elements, which are the variance between the
samples and the variance within the samples.

11.6 Glossary
t-test: Any statistical hypothesis test in which the test statistic follows a
Students t distribution, if the null hypothesis is supported.
z-test: Any statistical test for which the distribution of the test statistic
under the null hypothesis can be approximated by a normal distribution.
ANOVA: In statistics, analysis of variance or ANOVA is a collection of
statistical models and their associated procedures in which the observed
variance in a particular variable is partitioned into components attributable
to different sources of variation.

11.7 Terminal Questions


1. Who developed t-test? When is it used?
2. Define z-test.
3. On what assumptions is the ANOVA methodology based?
4. What is the rationale behind analysis of variance?
5. Define degree of freedom with respect to ANOVA.
6. What are the major characteristics of F distribution? How is F computed?
7. How is ANOVA table constructed? Explain with the help of an example.

Sikkim Manipal University

Page No. 277

Business Statistics

Unit 11

11.8 Answers
Answers to Self-Assessment Questions
1. (a) Variance; (b) Equality
2. (a) False; (b) True
3. (a) True; (b) True
4. (a) Transformation; (b) Difference
5. (a) Comparison; (b) Two
6. (a) True; (b) False

Answers to Terminal Questions


1. Refer Section 11.2
2. Refer Section 11.3
3. Refer Section 11.4
4. Refer Section 11.4.1
5. Refer Section 11.4.2
6. Refer Section 11.4.3
7. Refer Section 11.4.4

11.9 Futher Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal,2007.
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010.

Sikkim Manipal University

Page No. 278

Business Statistics

Unit 12

Unit 12

Research Report Writing

Structure
12.1 Introduction
Objectives
12.2 Introduction to Report Writing
12.3 Types of Research Reports
12.4 Summary
12.5 Glossary
12.6 Terminal Questions
12.7 Answers
12.8 Further Reading

12.1

Introduction

In the previous unit, you learnt about various elements of probability.


In this unit, you will learn the concept of report writing. Reports are used
for different purposes by different departments of an organization. Industries,
governments, businessesall need to prepare reports in order to collect
information and to keep a track of their performance and progress. The most
important aspect of a report is to convey information in clear terms. It should
provide the facts in a direct, straightforward and accurate manner. This unit
introduces you to the different types of reports and also shows you the correct
methods of report presentation.

Objectives
After studying this unit, you should be able to:
Describe the importance of reports
Explain the different types of reports
Define the characteristics of a good report
Describe the structure of a report
Use the correct method of presenting reports

Sikkim Manipal University

Page No. 279

Business Statistics

Unit 12

12.2 Introduction to Report Writing


A report can be defined as a written document which presents information in a
specialized and concise manner. A list of employees prepared by the HR
department for salary distribution can be termed as a report. In other words, a
report is information presented in a logical and concise manner.
There is a difference between report writing and other compositions
because a report is written in a short and conventional format. A report should
cover all mandatory matters but nothing extra should be written. For writing a
report, at first the relevant data is collected and then it is presented in a concise
and objective manner. Then, after successfully establishing the structure of the
report, the formatting features that improve the look and readability of the report
are added.

12.2.1 Types of Reports


Reports can be divided into different categories. The two main types of reports
are:
Informational report
Interpretive report
Informational report
A report that consists of a collection of data or facts and is written in an orderly
way is called an informational report. The main purpose of this type of report is
to present the information in its original form without any conclusion and
recommendation. Informational reports are further divided into four parts as
follows:
Inspection reports: Reports which show the outcome of products or
equipment to assure their proper functioning or to describe their quality
are called inspection reports. This type of report is mainly used in
manufacturing organizations.
Inventory reports: Reports which are made to keep stock of various
things like furniture, equipment, stationery, utensils and other accessories
are called inventory reports.
Assessment reports: These reports are made to maintain the database
of the employees in an organization. Generally, these reports are useful
for the HR department.

Sikkim Manipal University

Page No. 280

Business Statistics

Unit 12

Performance report: The reports which are made to measure the


performance of the employees in an organization for different purposes
like appraisal or promotion are called performance reports.
Interpretive report
An interpretive report contains a collection of data with its interpretation or any
recommendation explicitly specified by the writer. This type of report also includes
data analysis and conclusions made by the report writer. Writing interpretive
reports is different from writing informational reports because they contain
different elements. The possible elements that can be used in interpretive reports
are:
Cover
Frontpiece
Title page
Copyright notice
Forwarding letter
Preface
Acknowledgements
Table of contents
List of illustrations
Abstract and summary
Introduction
Discussion
Conclusions
Recommendations
Appendices
List of references
Bibliography
Glossary
Index

Sikkim Manipal University

Page No. 281

Business Statistics

Unit 12

12.2.2 Characteristics of a Good Report


The characteristics of a good report can be classified under the following four
heads:
Language and style of the report
Structure of the report
Presentation of the report
References in the report
Each of the above aspects of report writing needs to be given due attention
as they are interrelated to each other. A report given with a lucid style but with
very less and hypothetical information is of no use to the reader. Similarly, the
report writer needs to avoid overcrowding of information that may make the
reader feel confused and lost in reading the data, thereby losing its charm. A
systematic scrutiny of each of these aspects of a report is, therefore, necessary.
1. Language and Style of a Report: A report must have a logical structure
with a clear indication of where the ideas are leading. It should be able to
make a good first impression. The presentation of the report is very
important. All reports must be written in good language, using short
sentences and correct grammar and spellings. The main points to be
kept in mind in this light are as follows:
Context and style:
o Appropriate and informative title for the content of the report
o Crisp, specific, unbiased writing with minimal jargon
o Adequate analysis of prior relevant research
Questions/Hypotheses:
o Clearly stated questions or hypotheses
o Thorough operational definitions of key concepts along with
exact wording or measurement of key variables
Research procedures:
o Full and clear description of the research design
o Demographic profile of the participants/subjects
o Specific data gathering procedures

Sikkim Manipal University

Page No. 282

Business Statistics

Unit 12

Data analysis:
o Appropriate inferential statistics for sample or experimental data
and appropriate use of descriptive statistics
o Clear and reasonable interpretation of the statistical findings,
accompanied by effective tables and figures
Summary:
o Fair assessment of the implications and limitations of the
findings
o Effective commentary on the overall implications of the findings
for theory and/or policy
2. Structure of a Report: Before you write a report, you should define the
high level structure of the report. Defining a clear logical structure will
make the report easier to write and to read. There are two types of report
structures, which are listed as follows:
Report Structure I: In general, the report writing structure comprises
the following subheadings:
o Title Page
o Abstract
o Table of Contents
o Introduction
o Technical Detail and Results
o Discussion and Conclusions
o References
o Appendices
Report Structure II: There is also a specific structure of report writing
pertaining to technical or scientific reports which is as follows:
o Introduction
o Background and Context
o Technical Details
o Results
o Discussion and Conclusion

Sikkim Manipal University

Page No. 283

Business Statistics

Unit 12

Order of writing:
o Start with the technical chapters/sections.
o Follow with the discussion.
o Finally, write the conclusions, introduction and abstract, if you
are including any.
Appendix: The appendix should contain the following:
o Material that suits or goes well with the flow of the main report
but cannot be included in the main text of the report either
because it is too long or is not essential reading, for example,
lists of parameter values, etc.
o Bibliography, i.e., list of all the sources of material, you referred
to in your report.
3. Presentation of a Report: As stated earlier, mere data overloading or
just a lucid style of writing is not only necessary for good report writing.
Both the aspects need to be given due consideration, so that they interact
to give a simple, easy-to-read and comprehensive type of report. Same
goes with the presentation of the contents of the report. Printing mistakes,
informal use of font size and style can distract the attention of the reader.
On the other hand, effective use of tables and figures for better
understanding of data and writing its conclusions facilitate easy
comprehension. The main points of focus, where due attention is required
on the part of the report writer are as follows:
Capitals: This requires taking care of the following aspects:
o Using capitals only for proper nouns, place names, organization
names, etc.
o Defining acronyms at the first point of usage. For example,
Incorporated (Inc).
o Using bold, italics or underlines for emphasis, instead of capitals.
Headings: The basic points to be kept in mind for headings are as
follows:
o Differentiate headings from the rest of the text using different
fonts, bold, italics or underlines.
o Maintain consistency in formatting headings using predefined
styles.
o Avoid headings beyond three levels.
Sikkim Manipal University

Page No. 284

Business Statistics

Unit 12

Tables, figures and equations: In general, certain formatting


standards are pursued while giving tables and figures that are as
follows:
o Descriptive labelling of all tables at the top with reference in the
text.
o All figures must be labelled descriptively at the top and must be
referenced in the text.
o All equations must be numbered consecutively.
General presentation:
o Sheets should be of white A4 size and printed on one side only.
o Text should be justified on both sides and leave a blank line
between paragraphs.
o A staple in the top right hand corner is sufficient for most of the
reports.
4. References in a Report: Several report types like scientific, engineering,
technical and census reports contain either original writing or text adopted
from previous work. As such, a report writer should be careful and should
avoid any violation of copyright laws and plagiarism. The necessary rule
of thumb in this regard can be stated as follows:
o Citations and referencing:
A citation is the acknowledgement in your writing of the work of
other authors and includes paraphrasing and making direct
quotes.
Unless citation is very necessary, you should write the material
in your own words. This shows that you have understood what
you have read and know how to apply it, to your own context.
Direct quotes should be used sparingly.
o Direct quotes:
Short direct quotes: These need to be placed between
quotation marks. For example, Rosenfield defines a cluster as
a geographically bounded concentration of similar, related or
complementary businesses, with active channels for business
transactions, communications and dialogue that share
specialized infrastructure, common opportunities and threats.
This shows clearly that the words being used are not your own
words.
Sikkim Manipal University

Page No. 285

Business Statistics

Unit 12

Longer direct quotes: There are occasions when it is useful


to include longer direct quotes. If you are quoting more than
forty words, you should again use quotation marks but also
indent the text. For example, the sustainability of higher value
added industry is grounded in the diminishing significance of
cost structures. At the level of the European Union, a weak
capacity to innovate has been identified as an innovation, in
the sense of product, process, and organizational innovation,
accounts for a very large amount, perhaps 8090 per cent of
the growth in productivity in advanced economies.

12.2.3 Mechanics of Writing a Report


There are several parameters that are strictly followed while preparing technical
reports. The following points should be considered for writing a technical report:
Size and physical design: The manuscript, if handwritten, should be in
black or blue ink and on unruled paper of 81/2" 11" size. A margin of at
least one-and-half inches is set at the left side and half inch at the right
side of the paper. The top and bottom margins should be of one inch
each. If the manuscript is to be typed, then all typing should be double
spaced and on one side of the paper, except for the insertion of long
quotations.
Layout: According to the objective and nature of the research, the layout
of the report should be decided and followed in a proper manner.
Quotations: Quotations should be punctuated with quotation marks and
double spaces, forming an immediate part of the text. However, if a
quotation is too lengthy, then it should be single spaced and indented at
least half-an-inch to the right of the normal text margin.
Footnotes: Footnotes are meant for cross-references. They are placed
at the bottom of the page, separated from the textual material by a space
of half-an-inch as a line that is around one-and-a-half inches long.
Footnotes are always typed in single space, though they are divided from
one another by double space.
Documentation style: The first footnote reference to any given work
should be complete, giving all essential facts about the edition used. Such
footnotes follow a general sequence and order:
o In case of the single volume reference:
Authors name in normal order
Sikkim Manipal University

Page No. 286

Business Statistics

Unit 12

Title of work, underlined to indicate italics


Place and date of publication
Page number reference
For example,
John Gassner, Masters of the Drama, New York: Dover Publications,
Inc.1954, p.315.
o In case of a multivolume reference:
Authors name in the normal order
Title of work, underlined to indicate italics
Place and date of publication
Number of the volume
Page number reference
For example,
George Birkbeck Hill, Life Of Johnson, June 2004, Whitefish, Volume
2, p.124.
o In case of works arranged alphabetically:
For works arranged alphabetically such as encyclopedias and
dictionaries, no page reference is usually needed. In such cases,
order is illustrated according to the names of the topics.
Name of the Encyclopaedia
Number of Editions
For example,
Salamanca Encyclopaedia Britannica, 14th Edition.
o In case of periodicals reference:
Name of the author in normal order
Title of article, in quotation marks
Name of the periodical, underlined to indicate italics
Volume number
Date of issuance
Pagination

Sikkim Manipal University

Page No. 287

Business Statistics

Unit 12

For example,
Shahad, P.V. Rajesh Jains Ecosystem, in Business Today,
Vol. 14, December 18, p. 28, 2005.
o In case of multiple authorship:
If there are more than two authors or editors, then in the
documentation, the name of only the first is given and multiple
authorship is indicated by et al or and others.
Authors name in normal order
Title of work, underlined to indicate italics
Place and date of publication
Pagination references
For example,
Alexandra K. Wigdor, Ability Testing: Uses Consequences and
Controversies, 1981, p.23.
Subsequent references to the same work need not be detailed.
If the work is cited again without any other work intervening, it may
be indicated as ibid, followed by a comma and the page number.
Punctuations and abbreviations in footnotes: Punctuation concerning
the book and author names has already been discussed. They are general
rules to be strictly adhered to. Some English and Latin abbreviations are
often used in bibliographies and footnotes to eliminate any repetition.
Table 12.1 shows the various English and Latin abbreviations used
in bibliographies and footnotes.
Table 12.1 English and Latin Abbreviations used in Bibliographies and Footnotes
Abbreviations

Meaning

Anon.,

Anonymous

Ante.,

Before

Art.,

Article

Aug.,
bk.,
bull.,

Augmented
Book
Bulletin

cf.,

Compare

ch.,

Chapter

Sikkim Manipal University

Page No. 288

Business Statistics

col.,
diss.,
ed.,
ed. cit.,
e.g.

Unit 12

Column
Dissertation
editor, edition, edited
edition cited
exempli gratia: for example

eng.,

Enlarged

et.al.,

and others

et seq.,

et sequens: and the following

ex.,

Example

f.,ff.,

figure(s)

fn.,

Footnote

ibid.,ibidem
id.,idem.,

in the same place


the same

ill.,illus., or
illust(s)
Intro., intro.,
l., ll.,
loc. cit.,
MS., MSS.,
N.B. nota bene

illustrated, illustration(s)
introduction
line(s)
in the place cited; used as op.cit.,
Manuscript(s)
note well

n.d.,

no date

n.p.,

no place

no pub.,

no publisher

no(s) .,

number(s)

o.p.,

out of print

op.cit:
p.pp
passim:
Post:

in the work cited


page(s)
here and there
After

Use of statistics, charts and graphs: Statistics contribute to clarity and


simplicity in a report. They are usually presented in the form of tables,
charts, bars, line-graphs and pictograms.
Final draft: It requires careful scrutiny with regard to grammatical errors,
logical sequence and coherence in the sentences of the report.
Sikkim Manipal University

Page No. 289

Business Statistics

Unit 12

Index: An index acts as a good guide to the reader. It can be prepared


both as subject index and author index, giving names of subjects and
names of authors, respectively. The names are followed by the page
numbers of the report, where they have appeared or been discussed.

12.2.4 Research Report: An Overview


In simple terms, a research report is a written document which describes the
findings of an individual or group of individuals. It gives an account of something
seen, heard, done, etc. The findings may comprise such information like data,
surveys, resolutions or policies on which the concerned individual or individuals
have to submit their reports; which should include the proceedings as well as
the relevant conclusions.
The preparation and presentation of a research report is the most important
part of the research process. No matter how well designed the research study
is, it is of little value, unless communicated effectively to others in the form of a
research report. Moreover, if the report is confusing or poorly written, then the
time and effort spent on gathering and analysing data would be wasted. It is
therefore essential to summarize and communicate the result to the management
of an organization with the help of an understandable and logical research report.
Research reports are helpful during the research study, in the sense that
they facilitate maintenance of vast data in a logical way. Thus, in case the
researcher experiences any difficulty during the course of the study, it becomes
easier to refer to the contents of the report to get the relevant data. Research
report writing essentially involves systematic arrangement of data. This helps in
discovering flaws in reasoning, which may have been missed earlier while
conducting a research.
1. Format of a Research Report: The layout of the research report is of
utmost importance because the reader should be able to grasp logically,
what has been said and not feel lost in the bulk findings mentioned in the
research. This requires preparing of a proper layout of the report. Report
layout means allotting the research findings in a comprehensible format.
The layout should contain the following points:
Preliminary pages: In the preliminary pages, the report should carry
a title and a date, followed by acknowledgements in the form of
Preface or Foreword. The Table of Contents should come next,
followed by a list of tables and illustrations. This facilitate easy
reading and quick location of the required information.

Sikkim Manipal University

Page No. 290

Business Statistics

Unit 12

Main text: The main text comprises the complete outline of the
research report with all the details. The title of the research study is
repeated at the top of the first page of the main text, and then followed
with the other details on the pages numbered consecutively,
beginning with the second page. The main text can be classified
into the following sections:
o Introduction: The purpose of introduction is to introduce the
research projects to the readers. It should clearly state the
objectives of research, i.e., it should clarify, why the problem
was considered worth investigating. A brief summary of other
relevant research can be included as well, to enable the reader
to see the present study in that context.
o The methodology used for performing the study: The
introduction should contain answers to questions like; How was
the study carried out? What was the basic design? What were
the experimental directions? What were the questions asked
in the questionnaires used? etc. Besides this, the scope and
limitations of the study must be marked out.
o Statement of findings and recommendations: The research
report should comprise a statement of findings and
recommendations in a nontechnical language so that it is easily
comprehensible.
o Results: A detailed presentation of the findings of the study,
with supporting data in tabular forms along with the validation
of results, should be given. This section should contain statistical
summaries and deductions of the data rather than the raw data.
There should be a logical sequence and sectional presentation
of the results.
o Implications of the result: The researcher should write down
his results clearly and precisely, again at the end of the main
text. The implications derived from the results of the research
study should be stated in the research plan. The report should
also mention the conclusion drawn from the study, which should
be clearly related to the hypothesis stated in the introductory
section.
o Summary: The next step is to conclude the report with a short
summary, mentioning in brief the research problem, the
Sikkim Manipal University

Page No. 291

Business Statistics

Unit 12

methodology, the major findings and the major conclusions


drawn from the research results.
o End matter: The end of the research report should consist of
appendices listed with respect to all technical data such as
questionnaires, sample information and mathematical
derivations. The bibliography of the referred sources and an
index should also be given.
2. Precautions for Writing Research Reports: A research report is the
means of conveying the research study to a specific target audience. The
following precautions should be taken while preparing the research report:
It should be long enough to cover the subject and short enough to
preserve the interest.
It should not be dull and complicated.
It should be simple, without the usage of abstract terms and technical
jargon.
It should offer ready availability of findings with the help of charts,
tables and graphs, as readers prefer quick knowledge of main
findings.
The layout of the report should be in accordance with the objective
of the research study.
There should be no grammatical errors and the writing should adhere
to techniques of report writing in case of quotations, footnotes and
documentations.
It should be original, intellectual and should contribute to the solution
of a problem or add knowledge to the concerned field.
Appendices should be listed with respect to all the technical data in
the report.
It should be attractive, neat and clean, whether handwritten or typed.
The report writer should not confuse the possessive form of the word
it is with its. The accurate possessive form of it is is its.
A report should not have contractions. Examples are didnt or its.
In report writing, it is best to use the noncontractive form. Hence, the
examples would be replaced by did not and it is. Using Figure
instead of Fig. and Table instead of Tab. will spare the reader of
having to translate the abbreviations, while reading. If abbreviations
Sikkim Manipal University

Page No. 292

Business Statistics

Unit 12

are used, use them consistently throughout the report. For example,
do not switch among versus and vs.
It is advisable to avoid using the word very and other such words
that try to embellish a description. They do not add any extra meaning,
and therefore, should be dropped.
Repetition hampers lucidity. The report writer must avoid repeating
the same word more than once within a sentence.
When you use the word this or these, make sure you indicate to
what you are referring. This reduces the ambiguity in your writing
and helps to tie the sentences together.
Do not use the word they to refer to a singular person. You can
either rewrite the sentence to avoid using such a reference or use
the singular he or she.

12.2.5 Written and Oral Reports


A written report plays a vital role in every business operation. The manner in
which an organization writes business letters and business reports creates an
impression of its standard. Therefore, the organization should emphasize on
the improvement of writing skills of the employees in order to maintain effective
relations with their customers.
Preparing an effective written report requires a lot of hard work. Therefore,
before you begin writing, it is important to know the objective, i.e., the purpose
of writing, collection and organization of required data.
1. Written Report: Writing a report is the best way to communicate, and
often the only way to convey ones ideas to others. Thus, it is necessary
that the writing should be effective. To improve the effectiveness of writing
a report, following are the important points that should be kept in mind:
Take breaks in between writing, since this gives you the time to
incubate the ideas.
Start writing a short manuscript first, and later on, the detailed one.
Create an outline and organize the complete work.
Make a checklist of the important points that are necessary to be
covered in the manuscript.
Focus on one objective at a time.
Use dictionary and relevant reference materials as and when
required.
Sikkim Manipal University

Page No. 293

Business Statistics

Unit 12

Principles of writing a report


To write a useful report, it is necessary to follow certain principles. The following
are the principles that must be followed while writing a report:
Principle of purpose: A report must have a clear and meaningful purpose
that can be converted into an effective management. A clear statement of
the purpose helps prepare a well-focussed report on which the
management can work. Specification of the purpose is important because,
o Reports are the analysis of facts and proposals.
o Reports are the record of a particular business activity.
Principle of organization: A report that is written should be well-designed
and well-ordered. The managerial plan of a report must include the
following:
o Purpose of report
o Information required to be included in the report
o Method used to collect report data
o Summary of the report
o Problems and solutions of the subject mentioned in the report
o An appendix that describes and confirms the content and conclusion
of the report
Principle of brevity: Reports should be concise. It is essential because,
o Long reports are costly.
o Long reports are difficult to examine.
o Long reports are prone to disapproval, as they seem insufficient.
o Long reports focus on irrelevant minor details that may lead to the
ignorance of major points.
Principles of clarity: Reports should be clear. Clarity can be maintained
by using simple language for writing the report. New terms, if any in the
report, should be properly explained to avoid confusion.
Principle of scheduling: Reports should be prepared at that time when
there is no undue burden on the staff or when the staff has sufficient time
to prepare reports. However, the time period between the gathering of
data and generating finished reports should not be long; otherwise, the
report may become outdated and useless if it is not completed in time.

Sikkim Manipal University

Page No. 294

Business Statistics

Unit 12

Principle of cost: While preparing reports, it is necessary that the costbenefit analysis of the report should be done. A report should be minimum
at costs and maximum at benefits. If the cost of preparation of the report
is high but its benefit is low, then it is not advisable to prepare that report.
Different formats of written reports
A written report can be written in various formats, some of which are as follows:
Straight-line format: This format is used when the information is to be
presented in alphabetical, sequential or numerical orders. This format is
used to generate descriptive reports.
Building blocks format: This format is used when the information
presented, leads to some conclusion. The report in this format starts with
a brief introduction, contains some logical facts and finally the conclusions
and recommendations.
Inverted pyramid format: The report in this format has the most important
item at the top, and the least important item at the bottom of the report.
That is, items are listed in the descending order with the most important
item at the top. This style of writing or format is also known as journalistic
style or format.
2. Oral Report: At times, oral presentation of the results that are drawn out
of research is considered effective, particularly in cases where policy
recommendations are to be made. This approach proves beneficial
because it provides a medium of interaction between the listener and the
speaker. This leads to a better understanding of the findings and their
implications. However, the main drawback of oral presentation is lack of
any permanent records related to the research. Oral presentation of the
report is also effective when it is supported by various visual devices
such as slides, wall charts and white boards that help in better
understanding of the research reports.
Advantages of oral reports
Oral reports help in direct communication without any delay. Followings are
some of the advantages of an oral report:
It provides immediate feedback to the participants of the oral report.
Moreover, participants can also ask for further clarification, elaboration
and justifications.
It is time saving.

Sikkim Manipal University

Page No. 295

Business Statistics

Unit 12

It helps develop relationship among employees by building healthy


atmosphere in an organization.
It is an effective tool of persuasion in business.
It is economical as it saves large amount of money spent on stationery.
It provides the speaker with the opportunity to correct himself and make
himself clear on the spot.
It helps speakers to immediately understand the reaction of the group
that they are addressing.
Disadvantages of oral reports
There are many disadvantages of oral reports; these include the following:
Oral reports may not always be time saving. Sometimes, the meeting
between the speaker and the listener can continue for a very long time
without any satisfactory conclusion.
A listener of the oral report cannot always retain the entire message.
The messages in the oral reports do not have any legal validity as they
are not documented.
Oral reports may sometimes be misleading, if the thoughts of the speaker
are not organized carefully.
Lengthy oral messages may sometimes cause problems.
Activity 1
Suppose you have to write a report on the socio-economic awareness of
your country. List the headings and the procedure that you will include for
your research purpose.

Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) A report that consists of a collection of data or facts and is written in
an orderly way is called an __________________ report.
(b) An interpretive report contains a collection of data with its
interpretation or any _____________________ explicitly specified
by the writer.

Sikkim Manipal University

Page No. 296

Business Statistics

Unit 12

2. State whether true or false.


(a) A report should cover all mandatory matters but nothing extra should be
written.
(b) For writing a report, at first the relevant data is collected and then it is
presented in a concise and objective manner.

12.3 Types of Research Reports


Research reports are designed to convey and record the information that will
be of practical use to the reader. It is organized into distinct units of specific and
highly visible information. The kind of audience addressed in the research report
decides the type of report. Research reports can be categorized on the following
basis:
On the basis of information
On the basis of representation

12.3.1 Classification on the Basis of Information


Following are the ways in which the results of the research report can be
presented on the basis of information contained:
Technical report: A technical report is written for other researchers. In
writing the technical reports, the importance is mainly given on the methods
that have been used to collect the information and the data, the
presumptions that are made and finally, the various presentation
techniques that are used to present the findings and the data. Following
are the main features of a technical report:
o Summary: It covers a brief analysis of the findings of the research in
a very few pages.
o Nature: It contains the reasons for which the research is undertaken,
the analysis and the data that is required in order to prepare a report.
o Methods employed: It contains a description of the methods that
were employed in order to collect the data.
o Data: It covers a brief analysis of the various sources from which the
data has been collected with their features and drawbacks.
o Analysis of data and presentation of the findings: It contains the
various forms through which the data that has been analyzed, can be
presented.
Sikkim Manipal University

Page No. 297

Business Statistics

Unit 12

o Conclusions: It contains a brief explanation of findings of the research.


o Bibliography: It contains a detailed analysis of the various
bibliographies that have been used in order to conduct a research.
o Technical appendices: It contains the appendices for the technical
matters and for questionnaires and mathematical derivations.
o Index: The index of the technical report must be provided at the end
of the report.
Popular report: A popular report is formulated when there is a need to
draw the conclusions of the findings of the research report. One of the
main points of consideration that should be kept in mind while formulating
a research report is that, it must be simple and attractive. It must be
written in a very simple manner that is understandable to all. It must also
be made attractive by using large prints, various subheadings and by
giving the cartoons occasionally. Following are the main points that must
be kept in mind while preparing a popular report:
o Findings and their implications: While preparing a popular report,
main importance is given to the findings of the information and the
conclusions that can be drawn out of these findings.
o Recommendations for action: If there are any deviations in the report,
then recommendations are made for taking corrective action in order
to rectify the errors.
o Objective of the study: In a popular report, the specific objective for
which the research has been undertaken is presented.
o Methods employed: The report must contain the various methods
that have been employed in order to conduct a research.
o Results: The results of the research findings must be presented in a
suitable and appropriate manner by taking the help of charts and
diagrams.
o Technical appendices: The report must contain an in-depth
information used to collect the data in the form of appendices.

12.3.2 Classification on the Basis of Representation


Following are the ways through which the results of the research report can be
presented on the basis of representation:
Written report
Oral report
Sikkim Manipal University

Page No. 298

Business Statistics

Unit 12

For details of these two categorise of reports, see Section 12.2.5.


Activity 2
Collect a sample of technical appendices from any printed research report.

Self-Assessment Questions
3. State whether true or false.
(a) Research reports are not designed to convey and record the
information that will be of practical use to the reader.
(b) The index of the technical report must be provided at the end of the
report.
4. Fill in the blanks with the appropriate terms.
(a) A _________________ report is formulated when there is a need to
draw the conclusions of the findings of the research report.
(b) If there are any _______________ in the report, then
recommendations are made for taking corrective action in order to
rectify the errors.

12.4 Summary
Let us recapitulate the important concepts discussed in this unit:
A report can be defined as a written document which presents information
in a specialized and concise manner.
There is a difference between report writing and other compositions
because a report is written in a short and conventional format. A report
should cover all mandatory matters but nothing extra should be written.
A report that consists of a collection of data or facts and is written in an
orderly way is called an informational report. The main purpose of this
type of report is to present the information in its original form without any
conclusion and recommendation.
An interpretive report contains a collection of data with its interpretation
or any recommendation explicitly specified by the writer.
Defining a clear logical structure will make the report easier to write and
to read.
Sikkim Manipal University

Page No. 299

Business Statistics

Unit 12

A citation is the acknowledgement in your writing of the work of other


authors and includes paraphrasing and making direct quotes.
In simple terms, a research report is a written document which describes
the findings of an individual or group of individuals. It gives an account of
something seen, heard, done, etc. Research reports are helpful during
the research study, in the sense that they facilitate maintenance of vast
data in a logical way.
A written report plays a vital role in every business operation. Writing a
report is the best way to communicate, and often the only way to convey
ones ideas to others. Thus, it is necessary that the writing should be
effective.
Oral reports help in direct communication without any delay. It provides
the speaker with the opportunity to correct himself and make himself clear
on the spot.
In writing the technical reports, the importance is mainly given on the
methods that have been used to collect the information and the data, the
presumptions that are made and finally, the various presentation
techniques that are used to present the findings and the data.

12.5 Glossary
Report: A written document presenting information in a specialized and
concise manner.
Informational report: A report consisting of a collection of data or facts
written in an orderly manner.
Interpretive report: A report containing a collection of data with its
interpretation or any recommendation explicitly specified by the writer.
Research report: A written document describing the findings of some
individual or a group of individuals.

12.6 Terminal Questions


1. What are the different formats of written reports?
2. What are the points that you should keep in mind while writing a popular
report?

Sikkim Manipal University

Page No. 300

Business Statistics

Unit 12

3. What care should be taken with the following elements of a report:


(i) Direct quotes
(ii) Citations
(iii) Referencing
4. State the mechanics of writing a report.
5. Differentiate between written and oral reports.
6. What are the different types of research reports?

12.7 Answers
Answers to Self-Assessment Questions
1. (a) Informational; (b) Recommendation
2. (a) True; (b) True
3. (a) False; (b) True
4. (a) Popular; (b) Deviations

Answers to Terminal Questions


1. Refer Section 12.2.1
2. Refer Section 12.2.2
3. Refer Section 12.2.2
4. Refer Section 12.2.3
5. Refer Section 12.2.5
6. Refer Section 12.3

12.8 Further Reading


1. Elhance, D.N. Fundamentals of Statistics. Allahabad: Kitab Mahal, 2007
2. Gupta, S.C. Fundamentals of Business Statistics. New Delhi: Sultan Chand
& Sons, 2010

Sikkim Manipal University

Page No. 301

Business Statistics

Unit 13

Unit 13

Exercise-I

Example 1: How will you classify people according to gender using nominal scale.
Solution:
In the example below, the number 1 is assigned to male and the number 2 is assigned
to female. We can just as easily assign the number 1 to female and 2 to male. The
purpose of the number is merely to name the characteristic or give it identity.

As we can see from the graphs, changing the number assigned to male and
female does not have any impact on the data - we still have the same number of men
and women in the data set.

Example 2:
What type of questions should be avoided in a questionnaire?

Solution:
The following type of questions should be avoided when preparing a questionnaire.
1. Embarrassing Questions: Embarrassing questions are questions that ask
respondents details about personal and private matters. Embarrassing questions
are mostly avoided because you would lose the trust of your respondents. Your
respondents might also feel uncomfortable to answer such questions and might
refuse to answer your questionnaire.
2. Positive/Negative Connotation Questions: Since most verbs, adjectives and
nouns in the English language have either positive or negative connotations,
questions are bound to take on a positive or negative question. While defining a
question, strong negative or positive overtones must be avoided. Depending on
the positive or negative connotation of your question, you will get different data.
Ideal questions should have neutral or subtle overtones.
Sikkim Manipal University

Page No. 303

Business Statistics

Unit 1

3. Hypothetical Questions: Hypothetical questions are questions that are based


on speculation and fantasy. An example of a hypothetical question would be If
you were the CEO of ABC organization what would be the changes that you
would bring? Questions of this type, force the respondent to give his or her ideas
on a particular subject. However, these kind of questions will not give you consistent
or clear data. Hypothetical questions are mostly avoided in questionnaires.

Example 3:
Find the mode of the following data set:
48, 45, 46, 35, 45, 46, 35, 57, 34, 46, 48, 48, 46, 67

Solution:
The mode is 46 which occur 4 times.

Example 4:
Find the median of the following data set:
12

18

16

21

10

13

17

19

Solution:
Arrange the data values in order from the lowest value to the highest value:
10

12

13

16

17

18

19

21

The number of values in the data set is 8, which is even. So, the median is the
average of the two middle values.

4th data value 5th data value


2
16 17

2
16.5

Median

Example 5:
The marks of seven students in a mathematics test with a maximum possible mark of
20 are given below:
15

13

18

16

14

17

12

Find the mean of this set of data values.

Sikkim Manipal University

Page No. 304

Business Statistics

Unit 13

Solution:
15 13 18 16 14 17 12
7
105

7
15

Mean

So, the mean is 15.

Example 6:
Find the mean, median, mode and range for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13

Solution:
The mean is the usual average, so:
Mean = (13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) / 9 = 15
The median is the middle value, so arrange the data in ascending order as follows:
13, 13, 13, 13, 14, 14, 16, 18, 21
There are nine numbers in the list, so the middle one will be the (9 + 1) 2 = 10 2 =
5th number:
13, 13, 13, 13, 14, 14, 16, 18, 21
So the median is 14.
The mode is the number that is repeated more often than any other, so mode is 13.
The largest value in the list is 21 and the smallest is 13, so the range is 21 13 = 8.

Example 7:
Calculate the standard deviation for the data given below:
4, 2, 5, 8, 6

Solution:
Calculate the mean of the data

=5

Sikkim Manipal University

Page No. 305

Business Statistics

Unit 1

for each value in the sample:

Now

Now standard deviation is:

=
= 2.24

Example 8:
From the following data, construct index number of prices for 1986 with 1980 as base,
using (i) Laspeyres method, (ii) Paasches method, (iii) Bowley-Drobisch method, (iv)
Marshall-Edgeworth method, (v) Fishers ideal formula.
1980

1986

Commodity

Price Per Unit

Expenditure
in Rupees

Price
Per Unit

Expenditure
in Rupees

10

16

12

18

14

20

32

Sikkim Manipal University

Page No. 306

Business Statistics

Unit 13

Solution:
Since we are given the price and the total expenditure for the year 1980 and 1986, we
shall first calculate the quantities for the two years by dividing the expenditure by price,
and then we shall calculate the index numbers as follows:
Commodity

P0

q0

P1

q1

P0q0

P0q1

P1q0

P1q1

A
B
C
D

2
3
1
4

5
4
8
5

4
6
2
8

4
3
7
4

10
12
8
20

8
9
7
16

20
24
16
40

16
18
14
32

P0 q0
50
(i) Laspeyres price index or

P01

P1q0
100
P0q0

(ii) Paasches price index or P01

100
100 200
50

Pq
1 1
100
P0q1

(iii) Bowley-Drobisch price index or

P0 q1 P1q0 P1q1
40 100 80

80
100 200
40

P1q0 P1q1

P0 q0 P0 q1
P01
100
2
100 80

50
40 100 200

(iv) Marshall-Edgeworth price index or P01

p1q0 p1q1
100
p0q0 p0q1

100 80
100
50 40

= 200

Sikkim Manipal University

Page No. 307

Business Statistics

(v) Fishers Ideal index of price or

Unit 1

P01

p1q0 p1q1

100
p0q0 p0q1

100 80
100
50 40

2 2 100
= 200
Example 9:
Construct a pie chart in percentage for the given data of a publishing house (Cost is
in `):
Promotion cost
Royalty cost
Binding cost
Paper cost
Transportation cost
Printing cost

10,000
15,000
20,000
25,000
10,000
20,000

Solution:
The following pie chart shows the percentage distribution of the expenditure incurred
in publishing a book as per the given data.

Sikkim Manipal University

Page No. 308

Business Statistics

Unit 13

Example 10:
The ranks of 15 students in two subjects A and B are given below:
Student
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.

Subject A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Subject B
10
7
2
6
4
8
3
1
11
15
9
5
14
12
13

Use Spearman's formula to find the rank Correlation Coefficient.

Solution:
Rank in A
(R1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
n=15

Sikkim Manipal University

Rank in B
(R2)
10
7
2
6
4
8
3
1
11
15
9
5
14
12
13

(R1R2)
D
9
5
1
2
1
2
4
7
2
5
2
7
1
2
2

D =0

D
81
25
1
4
1
4
16
49
4
25
4
49
1
4
4

272

Page No. 309

Business Statistics

Unit 1

Spearmans coefficient of correlation 1

6 Di2

n(n 2 1)

=1

6 272
15( 225 1)

=10.4857
=0.5142

Example 11:
Researchers at the European Centre for Road Safety Testing are trying to find out
how the age of cars affects their braking capability. They test a group of ten cars of
differing ages and find out the minimum stopping distances that the cars can achieve.
The results are set out in the table below:
Car

Age of car in months

A
B
C
D
E
F
G
H
I
J

9
15
24
30
38
46
53
60
64
76

Minimum stopping at 40 kph


(metres)
28.4
29.3
37.6
36.2
36.5
35.3
36.2
44.1
44.8
47.2

Calculate the coefficient of correlation using the method of least squares.

Solution:
Let us develop the following table for calculating the value of r:

Total

X
9
15
24
30
38
46
53
60
64
76
415

Sikkim Manipal University

Y
28.4
29.3
37.6
36.2
36.5
35.3
36.2
44.1
44.8
47.2
375.6

X
81
225
576
900
1444
2116
2809
3600
4096
5776
21623

Y
806.56
858.49
1413.76
1310.44
1332.25
1246.09
1310.44
1944.81
2007.04
2227.84
14457.72

XY
255.6
439.5
902.4
1086
1387
1623.8
1918.6
2646
2867.2
3587.2
16713.3

Page No. 310

Business Statistics

Unit 13

X 41.5 , Y 37.7
Method of least squares:

XY n XY
X nX
2

16713 .3 10( 41.5)(37.7)


21623 10( 41.5) 2

16713.3 15645.5 1067.8


=
= 0.242654
21623 17222.5 4400.5

a Y b X 37.7 0.24( 41.5) 37.7 9.96 27.74


r

r=

a Y b XY nY

nY

27.74(375.6) 0.24(16713.3) 10(37.7) 2


14457.72 10(37.7) 2

10419.144 4011.192 14212.9


244.82

217.436
0.94
244.82

Sikkim Manipal University

Page No. 311

Business Statistics

Unit 14

Unit 14

Exercise-II

Example 1: Find the most likely production corresponding to a rainfall 40


from the following data:

Average
Standard Deviation

Rainfall

Production

30
5

500 kg
100 kg

Coefficient of correlation = 0.8


Solution: Let Y stand for production and X for rainfall.
Now, the regression line of Y on X is given by
(Y Y ) = r.

y
x

X X
100

(Y 500) = 0.8
( X 30)
5
or Y = 20 + 16X
For X = 40, Y = 16(40) + 20 = 660 kg

Example 2: The tangent of the angle between the lines of regression y on


x and x on y is 0.6 and x =

1
y . Find rxy.
2

Solution:
1 r2 x . y
2
2
r x y

tan =
x =

or

= 0.6

1
y
2

1 r 2 y .y
tan = 0.6 =

2
r 1 2
y
2 y

1
6
1 r 2 2
=

r 1
10

1
4

2r2 + 3r 2 = 0
Sikkim Manipal University

Page No. 313

Business Statistics

Unit 14

(r + 2) (2r 1) = 0

r = 2 or

1
2

But r2 must be 1, So r 2 and therefore the required value of r or rxy


is

1
is 0.5.
2

Example 3: The following table shows the number of public sector industries
failures in India during the period 1987 to 1993. Using a four-year moving
average method, calculate the mean square error (MSE) for this data.
Year

Number of Failures

1987
1988
1989
1990
1991
1992
1993

32
26
30
28
24
22
26

Solution:
The 4-year moving averages are calculated as follows:

32 + 26 + 30 + 28
29
4
26 + 30 + 28 + 24
=
27
4
30 + 28 + 24 + 22
=
26
4
28 + 24 + 22 + 26
=
25
4
following table is constructed.

(1) 1987 to 1990: Moving average =


(2) 1988 to 1991: Moving average
(3) 1989 to 1990: Moving average
(4) 1990 to 1993: Moving average
To calculate the value of MSE, the
Year

Time Series Value (Y1)

Moving Average

Error

Error Squared

1987

32

1988
1989

26
30

1990
1991

28
24

29
27

1
3

1
9

1992

22

26

16

1993

26

25

Sikkim Manipal University

Page No. 314

Business Statistics

Unit 14

Then,
MSE

1 9 16 1 27

6.75
4
4

Example 4: The Dean of the School of Business at Atlantic University, which


operates on a trimester system, has compiled the following quarterly new
enrolment of MBA students for the last 3 years from 1992 to 1994 and the
results are shown as follows:
Year

Fall

Winter

Spring

Summer

1992
1993

200
220

180
188

185
173

95
83

1994

220

176

161

87

By using the ratio to moving average method, calculate the seasonal


index for each trimester.
Solution: In order to calculate the seasonal indices for fall, winter, spring
and summer academic sessions, we need to find quarter moving averages,
quarter centred moving averages and percentages of actual to centred moving
averages as explained previously.
We construct the following table:
Year

Quarters Values

(1)

(2)

(3)

1992

200

II

180

III

Quarter
Moving
Total

Quarter
Moving
Average

(4)

(5)

660

165

185
680

IV
1993

I
II

188

167.5

110.45

171.0

55.55

170.5

129.03

167.5

112.24

172

220
676

Percentage of
Actual to
Centred
Moving Average
(7)

170

95
688

Quarter
Centred
Moving
Average
(6)

169
(Contd...)

Sikkim Manipal University

Page No. 315

Business Statistics

III

1994

Unit 14

IV

83

220

II

664

166

664

166

173

652

163

640

160

644

161

176

III
IV

166.0

104.22

164.5

50.46

161.5

136.22

160.5

109.66

161
87

Now, we calculate the modified mean for each quarter. This can be
done by the following steps.
The first step is to make a table of values already calculated and placed
in column (7) of this table. These are the percentage of actual to moving
average values for the various quarters of the three years. These are shown
in the following table:
Year
Fall
Winter
Spring
Summer
1992

110.45
55.55
1993
129.03
112.24
104.22
50.46
1994
136.22
109.66

The second step is to take the average of these values for each quarter.
The modified mean for each quarter data is shown as follows:

Fall

129.03 136.22 265.25

132.625
2
2

Winter

112.24 109.66 221.90

110.950
2
2

Spring

110.45 104.22 214.67

107.335
2
2

Summer

55.55 50.46 106.01

53.005
2
2
Total = 403.915

These modified means are preliminary seasonal indices. These should


average 100 or a total of 400 for these 4 quarters. However, our total is 403.915.
Accordingly, we calculate the adjustment factor as follows:
400
Adjustment Factor
0.9903
403.915
Sikkim Manipal University

Page No. 316

Business Statistics

Unit 14

We get the seasonal index for each quarter by multiplying the modified
mean for each quarter by the adjustment factor. Then, the seasonal index for
each quarter is shown as follows:
Fall:
Winter:
Spring:
Summer:

132.625 0.9903
110.950 0.9903
107.335 0.9903
53.005 0.9903
Total

=
=
=
=
=

131.34
109.87
106.29
52.50
400.00

Example 5: In the previous problem which gives us the data about new
admissions into the MBA programme of the university for each trimester,
separate the seasonal and irregular influences on the time series and calculate
the irregular (I) component as well as the seasonally-adjusted values for
each quarter.
Solution:
We have already calculated the various values that are needed. We know
that:
Time Series Values = T S C I
Centred Moving Average = T C
Hence,

S I =

T SC I
T C

Let us restate the needed values in the following table.


Year

Quarter

T S C I

T C

S I

1992

I
II
III
IV
I
II
III
IV
I
II
III

200
180
185
95
220
188
173
83
220
176
161

167.5
171.0
170.5
167.5
166.0
164.5
161.5
160.5

1.105
0.556
1.290
1.122
1.042
0.505
1.362
1.097

IV

87

1993

1994

Sikkim Manipal University

Page No. 317

Business Statistics

Unit 14

The seasonal indices for each quarter have already been calculated as:
Fall:
131.34
Winter: 109.87
Spring: 106.29
Summer: 52.50
Then the seasonal influence (S) is given by:
Fall: 131.34/100
= 1.3134
Winter: 109.87/100 = 1.10987
Spring: 106.29/100 = 1.0629
Summer: 52.50/100 = 0.5250
Now, we make another table with (S I) values as calculated in the
previous table and (S) values for each quarter of fall, winter, spring and
summer and this way; we can get the values of (I) by dividing (S I) values
by the (S) values. These are shown in the following table:
Year

Quarter

S I

(S)

(I)

1992

I
II
III
IV
I
II
III
IV
I
II
III
IV

1.105
0.556
1.290
1.122
1.042
0.505
1.362
1.097

1.0629
0.5250
1.3134
1.0987
1.0629
0.5250
1.3134
1.0987

1.040
1.059
0.982
1.021
0.980
0.962
1.037
0.998

1993

1994

Now, we can find the seasonally-adjusted values by dividing the original


time series values by their corresponding seasonal indices. This is shown as
follows:

Sikkim Manipal University

Page No. 318

Business Statistics

Unit 14

Year

Quarter

Time Series Values


T S C I

(S)

Seasonallyadjusted Values

1992

I
II
III
IV
I
II
III
IV
I
II
III
IV

200
180
185
95
220
188
173
83
220 .
176
161
87

1.0629
0.5250
1.3134
1.0987
1.0629
0.5250
1.3134
1.0987

174.05
180.95
167.50
171.11
162.76
158.09
167.50
160.19

1993

1994

Example 6: The life time of electric bulbs for a random sample of 10


from a large consignment gave the following data:
Item

Life (in
4.2
'000 hours)

10

4.6

3.9

4.1

5.2

3.8

3.9

4.3

4.4

5.6

Can we accept the hypothesis that the average life time of bulbs is 4000
hours.
Solution:
Let us take the null hypothesis that there is no significant difference between
the sample mean and the hypothetical population mean.
Applying the t-test (as the sample is small in size, because 10 < 30),
x

t=
n
S

Sikkim Manipal University

Page No. 319

Business Statistics

Unit 14

Calculation of x and s

4.2
4.6
3.9
4.1
5.2
3.8
3.9
4.3
4.4
5.6

0.2
+ 0.2
0.5
0.3
+ 0.8
0.6
0.5
0.1
0
+ 1.2

x = 44
x =

( x x) 2

( x x)
= (x 4.4)

( x

. 4

0.04
0.04
0.25
0.09
0.64
0.36
0.25
0.01
0
1.44
( x x) 2 3.12

x
= 44 = 4.4
N
10

S=

( x x) 2

(n 1)

3.12
0.589
9

No. of degrees of freedom, = (n 1) = (10 1) = 9.


For

= 9, t0.05 = 2.262.

The calculated value of t is less than the table value. Hence the hypothesis
is accepted.
The average life time of the bulbs could be 4000 hours.
Example 7: A Personnel Manager is interested in trying to determine whether
absenteeism is greater on one day of the week than on another. His records
for the past year show this sample distribution:
Day of
the week
No. of
Absentees

Monday

Tuesday

Wednesday

Thursday

Friday

66

57

54

48

75

Test whether the absence is uniformly distributed over the week.


Solution:
Let us take the (null) hypothesis that absenteeism is uniformly distributed over
the week.

Sikkim Manipal University

Page No. 320

Business Statistics

Unit 14

On the basis of this hypothesis, we should expect (66 + 57 + 54 + 48 + 75)/


5 = 300/5 = 60 absentees on each day of the week.

f o f e 2

fo

fe

66

60

0.60

57

60

0.15

54

60

0.60

48

60

2.40

75

60

3.75

fe

bf

fe

f o f e 2
fe

7.50

fe

7.5

= (n 1) = (5 1) = 4
for
4, 20.05 = 9.49
The calculated value of 2 is less than the table value. Hence, the (null)
hypothesis is accepted.

Example 8: An automobile company gives you the following information


about age groups and the liking for particular model of car which is plans to
introduce.
Age groups
Below 20

20-39

40-59

60

Total

Who liked the car


Who disliked
the car

140
60

80
50

40
30

20
80

280
220

Total

200

130

70

100

500

Persons

On the basis7 of this data, can it be concluded that the model appeal is
independent of the age groups. (Given v = 3, 20.05 7.815 )
Solution:
Let the null hypothesis be that the model appeal is independent of the age
group. Applying 2 test:
Sikkim Manipal University

Page No. 321

Business Statistics

Unit 14

f e11 = 280 200


500

= 112

Row total Column total


i.e.,

Grand total

f e12 = 280 130


500

= 72.8
f e13 = 280 70 39.2 and so on.
500

Thus, the table of expected frequencies will be obtained as follows:


112

72.8

39.2

56

280

88

57.2

30.8

44

220

200

130

70

100

500

fo

fe

(fo fe)2

(fo fe)2/fe

140

112.0

784.00

7.000

60

88.0

784.00

8.910

80

72.8

51.84

0.712

50

57.2

51.84

0.906

40

39.2

0.64

0.016

30

30.8

0.64

0.021

20

56.0

1296.00

23.143

80

44.0

1296.00

29.454
f f 2
o e 70.162
fe

Thus, 2 = 70.162
= (r 1) (c 1) = (2 1) (4 1) = 3
2

For = 3, 0.05 7.815


The calculated value is much greater than the table value. Hence, the null
hypothesis is rejected. We therefore, conclude that the model appeal is not
independent of the age groups.

Sikkim Manipal University

Page No. 322

Business Statistics

Unit 14

Example 9: A random sample of size 16 has 53 as mean. The sum of the


squares of the deviations from mean is 135. Can this sample be regarded
as taken from the population having 56 as mean? Obtain 95% and 99%
confidence limits of the mean of the population. (for n = 15, t0.05 = 2.13 and
for n = 15, t0.01 = 2.95).
Solution:
Let us take the (null) hypothesis that there is no significant difference between
the sample mean and the hypothetical population mean. Applying t-test:
t=
Given :

( x )
n
S

2
x = 53; = 56; n = 16; ( x x) 135

( x x) 2
135

3
(n 1)
15

S=

t=

|53 56|
16 4
3

= (n 1) = (16 1) = 15

And for = 15, t0.05 = 2.13 (table value)


The calculated value of t is more than the table value. Hence, the null
hypothesis gets rejected. Thus, we can say that the sample has not come
from a population having 56 as mean.

t0.05
95% confidence limits of the population mean : x
n

2.13
= 53
16

= 51.4 to 54.6
S

99% confidence limits of the population mean: x


t0.01
n

(2.95)
= 53
16

= 50.788 to 55.212

Example 10: A random sample of 27 pairs of observations from a normal


population gives a correlation coefficient of 0.42. Is it likely that the variables
in the population are uncorrelated?
Solution:
Let the null hypothesis be there is no significant difference between the sample
correlation and correlation in the population. Applying t-test:
Sikkim Manipal University

Page No. 323

Business Statistics

Unit 14

FG n 2 IJ
H1 r K

t = r

= 0.42

F 27 2 I 2.31
GH 1 0.42 JK
2

No. of degrees of freedom, = n 2 = 27 2 = 25.


For = 25, t0.05 = 1.708. The calculated value of t is more than the table
value and hence the null hypothesis is rejected. Thus, it can be said that it is
unlikely that the variables in the population are uncorrelated.
Example 11: The mean of a sample of 100 units is 3 and its standard
deviation is 2. Find the standard error and estimate the sample error at (i)
5% level of significance (ii) 97.73% level of probability and (iii) 95.45% level
of confidence.
Solution:
Standard Error of mean, x
=

2
100

s
n

2
0.2
10

(i) Sample error at 5% level of confidence


= 0.2 1.96 = 0.392
(ii) Sample error at 97.73% level of probability = 0.2 3 = 0.6
(iii) Sample error at 95.45% level of confidence = 0.2 2 = 0.4.
Example 12: Set up an ANOVA table for the following per acre production
data for three kinds or varieties of wheat, each grown on 4 plots and state
if the variety differences are significant.
Plot of land

1
2
3
4

Per acre production data (variety of wheat)


A

6
7
3
8

5
5
3
7

5
4
3
4

Solution:
We can solve the problem either by the direct method or by short-cut method,
but in each case we shall get the same result. We try below both the methods.
Sikkim Manipal University

Page No. 324

Business Statistics

Unit 14

Direct Method:
First we calculate the mean of each of these samples.

FG 6 + 7 + 3 + 8 IJ 6
H 4 K
5 + 5 + 3 + 7I
= FG
H 4 JK 5
F 5 + 4 + 3 + 4 IJ 4
= G
H 4 K

x1 =
x2
x3

Mean of the sample means or


x =

FG x
H

IJ FG
K H

IJ
K

x 2 x3
654

5
3
3

Now, we work out SS between and SS within samples:


SS between = n1 x1 x n2 x2 x n3 x3 x
= [4 (6 5)2 + 4 (5 5)2 + 4 (4 5)2]
= (4 + 0 + 4)
= 8.
2

SS within = x1i x1 2 x2i x2 2 x3i x3 2


i = 1, 2, 3, 4

2
2
2
2
= 6 6 7 6 3 6 8 6

+ 5 4

4 4

+ 5 5 2 5 5 2 3 5 2 7 5 2
2

4 4 3 4

= 24

SS for total variance xij x , i = 1, 2, 3,


and j = 1, 2, 3, ...

= 6 5 2 7 5 2 3 5 2 8 5 2

g b g b g b g
+ b5 5g b4 5g b3 5g b4 5g }
+ 55 2 55 2 35 2 75
2

= 32

Sikkim Manipal University

Page No. 325

Business Statistics

Unit 14

Alternatively, (SS for total variance) can also be worked out thus:
SS for total = (SS between + SS within)
= (8 + 24)
= 32
We can now set up the ANOVA table for this problem:
Source of
variation

SS

df

MS

Between sample

(3 1) = 2

8
4
2

Within sample

24

(12 3) = 9

24
2.67
9

Total

32

(12 1) = 11

F-ratio

4.00
1.5
2.67

5% F-limit
(from the
F-table)
F (2, 9)
= 4.26

The above table shows that the calculated value of F is 1.5 which is less
than the table value of 4.26 at 5% level with d.f being 1 = 2 and 2 = 9 and
hence could have arisen due to chance. This analysis supports the nullhypothesis of no difference in sample means. We may, therefore, conclude
that the difference in wheat output due to varieties is insignificant and is just
a matter of chance.
Aliter (Short-cut Method):
In this case, we first take the total of all the individual values of n items and
call it as T.
T in the given case = 60
and
n = 12
Hence, the correction factor =
=

602
300 .
12

T2
n

Now SS total, SS between and SS within can be worked out as under:


2 T2
SS total = xij
where i = 1, 2, 3, ...
n

and j = 1, 2, 3, ...

= 62 72 32 82 52 52 32 72 52 42 32 42

Sikkim Manipal University

602
= 32
12

Page No. 326

Business Statistics

Unit 14

T j2

SS between =

n j

T2
=
n

SS within = xij2

24 2 202 162 602

= 8
4
4 12
4

T j2

nj

= (332 308) = 24.


It may be noted that we get exactly the same result as we had obtained
in the case of direct method. From now onwards, we can set up ANOVA table
and interpret F-ratio in the same manner as we have already done under the
direct method.

Sikkim Manipal University

Page No. 327

NOTES

Sikkim Manipal University

Page No. 328

NOTES

Sikkim Manipal University

Page No. 329

NOTES

Sikkim Manipal University

Page No. 330

Das könnte Ihnen auch gefallen