Beruflich Dokumente
Kultur Dokumente
Business Statistics
B1463
BOARD OF STUDIES
Chairman
HOD Arts and Humanities
SMU DDE
Additional Registrar
SMU DDE
Controller of Examination
SMU DDE
Authors:
J.S. Chandan: Units(1.3-1.10, 2.1-2.4, 3.3, units-4, 5, 8, 11) Copyright J.S. Chandan, 2011
G.S. Monga: (Unit-9) Copyright G.S Monga, 2011
Vijay Gupta: Unit(3.4-3.11) Copyright Vijay Gupta, 2011
C.R. Kothari: Units(unit-6, 7, 10) Copyright C.R. Kothari, 2011
Vikas Publishing House: Units(1.1-1.2, 2.5-2.10, 3.1-3.2, units-12-14) Copyright Reserved, 2011
This book is a distance education module comprising a collection of learning materials for our students.
All rights reserved. No part of this work may be reproduced in any form by any means without permission
in writing from Sikkim Manipal University, Gangtok, Sikkim. Printed and Published on behalf of Sikkim
Manipal University, Gangtok, Sikkim by Mr Rajkumar Mascreen, GM, Manipal Universal Learning Pvt
Ltd. Manipal - 576 104. Printed at Manipal Press Limited, Manipal.
Information contained in this book has been published by VIKAS Publishing House Pvt. Ltd. and has
been obtained by its Authors from sources believed to be reliable and are correct to the best of their
knowledge. However, the Publisher and its Authors shall in no event be liable for any errors, omissions
or damages arising out of use of this information and specifically disclaim any implied warranties or
merchantability or fitness for any particular use.
Business Statistics
Contents
Unit 1
Information and Data Sources
122
Unit 2
Data Collection Methods
2342
Unit 3
Data Analysis Techniques
4385
Unit 4
Index Numbers
87118
Unit 5
Data Representation
119139
Unit 6
Correlation
141164
Unit 7
Regression
165187
Unit 8
Time Series
189214
Unit 9
Testing of Hypothesis
215235
Unit 10
Chi-Square Test
237249
Unit 11
t-Test, z-Test and Analysis of Variance
251278
Unit 12
Research Report Writing
279301
Unit 13
Exercise I
303311
Unit 14
Exercise II
313327
SUBJECT INTRODUCTION
Business Statistics
Statistics is considered a mathematical science pertaining to the collection,
analysis, interpretation or explanation and presentation of data. The subject of
statistics is primarily concerned with making decisions about various disciplines
of market and employment, such as stock market trends, unemployment rates
in various sectors of industries, demographic shifts, interest rates, inflation rates
over the years, and so on. Statistics is also considered a science that deals with
numbers or figures describing the state of affairs of various situations with which
we are generally and specifically concerned.
This book, Business Statistics, comprises fourteen units.
Unit 1- Information and Data Sources: Explains the need for information in
decision making. It defines a problem and discusses how information are
evaluated and processed. It also defines the various types of data.
Unit 2- Data Collection Methods: Discusses different methods of data
collection, such as observation, questionnaire, interviews and experiments. It
also lists the merits and demerits of data collection methods.
Unit 3- Data Analysis Techniques: Explains the various techniques of analysing
data, including percentage, ratio, average, mean, mode, median, quartiles, range
and standard deviation.
Unit 4- Index Numbers: Defines and classifies index numbers. It also explains
the methods of construction of different types of index numbers.
Unit 5- Data Representation: Lists the various tools of data representation,
including tables, graphs and diagrams, and discusses their features.
Unit 6- Correlation: Defines correlation analysis. It also discusses the concepts
of coefficient of determination, coefficient of correlation, Karl Pearsons coefficient
and Spearmans rank correlation.
Unit 7- Regression: Defines the term regression and lists the assumptions in
regression analysis. It also describes the simple regression model, scatter
diagram method and least square method.
Unit 8- Time Series: Lists the components of time series. It also describes the
various methods of measuring trends and seasonal variations.
Business Statistics
Unit 1
Unit 1
Structure
1.1 Introduction
Objectives
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.1 Introduction
Information is processed from raw data. It is verified to be accurate, specific
and organized for a special purpose. The value of information lies solely in its
ability to affect a behaviour, decision or outcome.
In this unit, you will learn about information, decision-making, data and its
various types. The information should be context specific and available when it
is required, i.e., timely. Data is the numerical result of measurements. The
arrangement of the collected data defines its type. Data can be the basis of
graphs, images, or observations of a set of variables. Raw or unprocessed data
refers to a collection of numbers, characters, images or other outputs from
devices that collect information to convert physical quantities into symbols.
Statistics is the science of the collection, organization, and interpretation of
data. It deals with all aspects of this, including the planning of data collection in
terms of the design of surveys and experiments. You will also learn about
variables and random variable. A variable is any characteristic which can assume
different values.
In probability and statistics, a random or stochastic variable refers to a
variable whose value results from a measurement on some type of random
process. In formal terms, it refers to a function from a probability space, typically
to the real numbers, which is measurable. Intuitively, a random variable is a
Page No. 1
Business Statistics
Unit 1
Objectives
After studying this unit, you should be able to:
Explain why information is needed in decision-making
Define a problem, evaluate and process information, and take as decision
Explain the meaning and scope of data and list the types of variables
Define variable and its types
Differentiate between primary and secondary data
Explain the procedures of conducting research, including the methods of
collecting primary and secondary data
Page No. 2
Business Statistics
Unit 1
Cause 2
Cause 3
Cause 4
1
Sub
2
Causes:
3
Major
Effect
Cause 5
Cause 6
Cause 7
Cause 8
Example: The fishbone diagram portrays various causes for an effect or problem
and is often used in brainstorming sessions.
The given diagram was drawn by a manufacturing team in order to
understand the source of periodic iron contamination. Six generic terms were
used to prompt ideas while the branches portray the causes of the problem.
Page No. 3
Business Statistics
Unit 1
Materials
Measurement
Methods
ed
llow
Solvent contamination
t fo
No
Supplier 2
ion
Plant
system
at
libr
Ca
H 20
T
DB
st
aly
An
r
pe
pro n
Im atio
libr
ca ation
libr
Ca
Truck
Analytical procedure
Supplier 1
2
WAK
Raw materials
Supplier
City
Lab error
Sampling
Iron
s
toll
tle
bot
Dry
lier
pp
Su
la
In
lier
pp
Su
lab
In
Iron in
Product
Rust near
sample point
Pip
es
Pu
mp
s
Re
ac
tor
s
pip
e
To
ols
err
or
too
ls
Ex
ch
an
ge
rs
83
E5
Iro
n
Op
en
ing
or
in
po
ple
sam
Ex
po
se
d
ct
rea
At
At
Maintenance
In
Materials of construction
P 584
Out
E 533
P 560
P 573
Heat exchanger leak
E 470
70
E4
Environment
Rusty pipes
Inexperienced
analyst
Manpower
Machines
Fishbone Diagram
The figure shows that the term machines contains the idea materials of
construction which shows four kinds of equipment having specific machine
numbers. However, it must be noted that some ideas appear twice. Calibration
appears under methods as a factor in the analytical procedure and under
measurement as a cause of lab error.
Page No. 4
Business Statistics
Unit 1
Books, Articles,
and Documents
Interim Information
Products
Raw Data
Information
Collected
Usable
Information
Additional
Information Needed
Value-adding
to Information
Information
Required for
Decision-making
Page No. 5
Business Statistics
Unit 1
Problems
Decisions
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Accurate and timely _____________ is considered as one of the
most powerful resources.
(b) W hen information is specifically arranged according to the
requirement or problem, it is termed as ______________.
2. State whether true or false.
(a) A problem understood properly is more than half its solution.
(b) If a problem is solved, then the decision taken is reviewed and reanalysed.
Page No. 6
Business Statistics
Unit 1
Page No. 7
Business Statistics
Unit 1
Page No. 8
Business Statistics
Unit 1
are considered to be random drawings so that each number has exactly the
same chance of being picked up. Similarly, the value of the outcome of a toss of
a fair coin is random, since a head or a tail has the same chance of occurring.
A random variable may be qualitative or quantitative in nature. The
qualitative random variables yield categorical responses so that the responses
fit into one category or another. For example, a response to a question such as
Are you currently unemployed? would fit in the category of either yes or no.
On the other hand, quantitative random variables yield numerical responses.
For example, responses to questions such as, How many rooms are there in
your house? or How many children are there in the family? would be in numerical
values. Also, these values being whole numbers are considered discrete values.
These are the values of discrete quantitative random variables. On the other
hand, responses to questions like, How tall are you? or How much do you
weigh? would be the values of continuous quantitative random variables, since
these values are measured and not counted. Some examples of these variables
are:
(i) Qualitative random variables
Sex of students in the class
Political affiliation of a faculty member in the college
Opinions of economists regarding the economic conditions in the
country
(ii) Quantitative random variables
(a) Discrete quantitative random variables
Number of people attending a conference
Number of eggs in the refrigerator
Number of children at a summer camp
(b) Continuous quantitative random variables
Heights of models in a beauty contest
Weights of people joining a diet programme
Lengths of steel bars produced in a given production run
Page No. 9
Business Statistics
Unit 1
objectively. The next important step towards processing the data is classification.
Classification means separating items according to similar characteristics and
grouping them into various classes. The items in different classes will differ
from each other on the basis of some characteristics or attributes. Classification
of data is very similar to sorting of mail at a post office, where a mail is classified
according to its geographical destination and may further be classified into the
type of mail such as first class, parcel post, and so on. The data may be classified
into four broad classes:
(i) Geographical. This classification groups the data according to locational
differences among the items. The geographical areas are usually listed
in alphabetical order for easy reference. For example, the book listing
colleges and universities in various states in USA would first list the states
in the alphabetical order and then the colleges and the universities within
these states in the alphabetical order.
(ii) Chronological. This classification includes data according to the time
period in which the items under consideration occurred. For example, the
sales of automobiles in India over the last ten years may be grouped
according to the year in which such sales took place.
(iii) Qualitative. In this type of classification, the data is grouped together
according to some distinguished characteristic or attribute such as religion,
sex, age, national origin, and so on. This classification simply identifies
whether a given attribute is present or absent in a given population. For
example, the population may be divided into two classes: male and female.
Then the attribute of male will go into one class and the attribute of female
will go into the other.
(iv) Quantitative. This refers to the classification of data according to some
attribute which has magnitude and can be measured such as classification
according to weight, height, income, and so on. For example, the salaries
of professors at a university may be classified according to their rank
such as instructor, assistant professor, associate professor and full
professor.
Hence, the collected data should be arranged systematically to give it shape,
form and meaning. The division of the data into homogeneous groups according
to their characteristics, recorded in a statistical inquiry, is called classification.
Page No. 10
Business Statistics
Unit 1
Self-Assessment Questions
3. State whether true or false.
(a) If the data is written in an ascending or descending order, it would
be called ordered data.
(b) Items in different classes will differ from each other on the basis of
some characteristics or attributes.
4. Fill in the blanks with the appropriate terms.
(a) A ______________ is any characteristic that can assume different
values.
(b) Classification means separating items according to similar
________________ and grouping them into various classes.
Page No. 11
Business Statistics
Unit 1
been collected by persons who were specifically trained for that purpose.
However, such secondary data must be used with utmost care. The reason
is that such data may be full of errors due to the fact that the purpose of
the collection of data by the primary agency may have been different
from that of the user of the secondary data. Additionally, there may have
been biases introduced during collection of data or analysis of data. For
example, the size of the sample may have been inadequate or there may
have been arithmetical or definitional errors. Hence, it is necessary to
critically investigate the validity of secondary data as well as the credibility
of the primary data collection agency.
Sources of Data
The following are some of the sources of data for collecting first hand information.
Census
World Bank
WHO (World Health Organization)
NSSO (National Sample Survey Organization)
Economic Survey
National Family and Health Surveys
SRS Surveys
Multiple Indicator Survey
CSO. RBI, Gov.nic.in, CMIE
Since the quality of the results obtained from statistical data for the purpose
of using these outcomes for managerial decision-making depends upon the
quality of the collected information itself, it is important that a sound investigative
process be established to ensure that the data is highly representative and
unbiased. This requires a high degree of skill and also certain precautionary
measures are to be taken.
Activity 1
Collect first hand information from five families in your neighbourhood on
education, health and economic status. Tabulate the data as qualitative or
quantitative. Also classify the attributes as per the four measurement scales.
Page No. 12
Business Statistics
Unit 1
Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) ___________ data is one which is collected by the investigator himself
for the purpose of a specific inquiry or study.
(b) When an investigator uses the data which has already been collected
by others, such data is called ________________ data.
6. Choose the right answer from the given options.
(a) To collect first hand information, we use ________________.
(i) Census
(iii) Observation
(ii) Interview
(iv) Questionnaire
(ii) Census
(iv) Primary
Page No. 13
Business Statistics
Unit 1
Page No. 14
Business Statistics
Unit 1
(ii) Ordinal scale. Also known as ranking scale, it possesses only the attribute
of magnitude. This means that various categories of items can be
compared with each other only in order of rank assigned to these
categories. However, these ranks only indicate as to which category is
greater or better, but does not indicate the magnitude of the difference
among these categories. For example, the students in a class may be
categorized according to their grades of A, B, C, D and F where A is
better than B, and so on, and the classification is from the highest grade
to the lowest grade. Another example of ordinal scaling would be the
classification of teaching faculty ranks in the colleges as full professors,
associate professors, assistant professors and instructors.
(iii) Interval scale. The interval scale measures the values of quantitative
random variables and identifies not only which category is greater or better
but also by how much. It is a stronger form of measurement and possesses
two attributes, which are magnitude and equal intervals. It does not
possess, however, the absolute zero point. Measurements of height, weight
and time are all examples of interval scale.
(iv) Ratio scale. The ratio scale is also used for measurement of quantitative
random variables, but it differs from interval scale in that it has a true zero
point, meaning that the values of such variables can be zero. It makes
mathematical manipulations easier such as divisions and multiplications.
Examples of ratio scale are physical measurements including temperature,
number of students registered in various classes, and so on. The
temperature can be zero which means the total absence of heat and it is
also possible that zero students are registered for a given class. Similarly,
heights and weights, though considered in interval scale, can have
hypothetical zero values.
These measurement scales assist in designing survey methods for the
purpose of collecting relevant data.
Page No. 15
Business Statistics
Unit 1
consideration the field to be covered, and the time period in which to conduct
the study. The time span is very important, because in certain areas, the
conditions change very quickly, and hence, by the time the study is completed,
it may become irrelevant. The statistical units and the desired accuracy of such
units must be clearly specified.
Methods of Collecting Primary Data
Primary data is collected by the investigator for specific study. This data should
be unique in nature and should be kept secret until it is published. The following
are the methods of collecting primary data.
Questionnaires: These are the most popular means of collecting primary
data. The questionnaires are designed as per specific problems, for example, it
can be used for interviewing or for a telephone survey. It can be posted, emailed or faxed and can be used for a large number of people or organizations.
It does not require prior arrangements and there is no interviewer bias. The
questionnaire must not be too long, too complex, uninteresting or too personal.
The questions asked must be simple so that the respondent can read all
questions and reply. The basic subject of the questionnaire must be made clear
in a covering letter. The researcher must give his/her identification, why the
data is being collected and the declaration of confidentiality and anonymity.
Request and instructions to return the duly filled questionnaire must be mentioned
with the return date. You can make a request as, It would be greatly appreciated
if you may possibly return the completed questionnaire by.......... if it is possible.
Interviews: This is a technique basically used to know the mind-set, likings
or behaviour of the person being interviewed. Interviews can be conducted on a
personal one-to-one basis or in a group. Interviews can be of structured, semistructured and unstructured types. Structured type is based on a cautiously
worded interview plan. In semi-structured type, the interview is based on
questions that provide scope to the respondent to answer at length. Unstructured
type is also termed as an in-depth interview. The interviewer starts with the
general questions to encourage the respondent to talk without restraint. For
conducting an interview the researcher has to prepare a list of topics on which
the information is required. Select the type of interview to frame the relevant
questions and then fix appointment with the respondent.
Telephone interview: This is also a type of interview which is conducted
on personal or face-to-face basis. It gives high response rate and the answers
can be taped for keeping record. This method can be used if the respondent
has a telephone.
Page No. 16
Business Statistics
Unit 1
Page No. 17
Business Statistics
Unit 1
Page No. 18
Business Statistics
Unit 1
Activity 2
Observe a group of students participating in a debate competition. Collect
data on their behaviour and categorize them as most active, moderately
active and less active.
Self-Assessment Questions
7. Fill in the blanks with the appropriate terms.
(a) _____________ is the quantitative value that exists or is assigned
to an attribute or characteristic.
(b) Absolute zero point refers to the _____________ which has no value
at all on measurement scale.
8. State whether true or false.
(a) Nominal scale is the weakest form of measurement so that some
statisticians do not consider it as a scale at all.
(b) In the observation method, the behavioural styles of specific people,
objects and happenings are recorded in an unsystematic way.
1.6 Summary
Let us recapitulate the important concepts discussed in this unit:
Information plays a vital role in decision making. It is provided by the
information system set up in the organization. The management depends
on information systems for effective decision-making. Information consists
of data (facts and figures) which is processed and retrieved to be used for
forecasting and decision-making.
The information should be context specific and available when it is required.
When information is specifically arranged according to the requirement
or problem, it is termed as knowledge.
Data comprise the numerical results of any measurement. Data can also
be used in singular sense, such as a set of data.
A variable is any characteristic that can assume different values. There
are two types of variables: discrete variable and continuous variable.
Page No. 19
Business Statistics
Unit 1
1.7 Glossary
Data: Numerical results of any measurement
Variable: Any character that can assume different values
Random variable: A qualitative or quantitative phenomenon in which the
observed outcomes of an activity entirely or by chance absolutely
unpredictable and may differ from response to response.
Primary data: Data collected by the investigator for the purpose of a
specific inquiry or study. The data is original in character and is generated
by surveys conducted by individuals or research institutions.
Sikkim Manipal University
Page No. 20
Business Statistics
Unit 1
Secondary data: When an investigator uses the data which has already
been collected by others, then the data is secondary data for the
investigator but it remains primary data for those who collected it. It is
obtained from journals, reports, government publications, etc.
1.9 Answers
Answers to Self-Assessment Questions
1. (a) Information; (b) Knowledge
2. (a) True; (b) False
3. (a) True; (b) True
4. (a) Variable; (b) Characteristics
5. (a) Primary; (b) Secondary
6. (a) i; (b) iii
7. (a) Magnitude; (b) Attribute
8. (a) True; (b) False
Page No. 21
Business Statistics
Unit 1
Endnote
1. Aggarwal, Y.P. Statistical Methods, New Delhi: Sterling Publishers, 1986, p.5.
Page No. 22
Business Statistics
Unit 2
Unit 2
Structure
2.1 Introduction
Objectives
2.2 Observation
2.3 Questionnaire
2.4 Interviews
2.5 Experiments
2.6 Summary
2.7 Glossary
2.8 Terminal Questions
2.9 Answers
2.10 Further Reading
2.1 Introduction
In the previous unit, you learnt about information and data sources. Data sources
help in collecting data.
In this unit, you will learn about the various data collection methods.
The unit describes the advantages and shortfalls of various types of
observations. You will also learn about the process of preparing a
questionnaire, what all should be kept in mind while drafting it and what
pattern of questions should be adopted, i.e., dichotomous, multiple choice
or open questions. Also, you would learn about the different modes of
interviews along with their merits and demerits. Accurate records have to be
made to keep people updated about the current scenario of the society. As
there are several methods of data collection, the methods that consume the
least amount of time are put into use. Data collecting techniques such as
questionnaires and interviews play a vital role in collecting large amount of
information in a short period of time and hence have been discussed in this
unit. Experiments are resorted to when it is necessary to collect factual data
when nothing is available for reference. It may also be conducted to verify a
theory. Experiment is a study conducted under controlled conditions.
Page No. 23
Business Statistics
Unit 2
Objectives
After studying this unit, you should be able to:
Prepare a questionnaire
Explain the significance of interviews
Discuss other modes of data collection along with their advantages
Explain the importance of experiments
2.2 Observation
Observation may be defined as recording behavioural patterns without verbal
communication.
Primary data can be collected using the following method.
Direct personal observation. Under this method, the investigator
presents himself personally before the informant and obtains a first hand
information. This method is most suitable when the field of enquiry is small and
a greater degree of accuracy is required.
We shall now see the merits and limitations of the observation method.
Merits
(i) The first hand information obtained by the investigator is bound to
be more reliable and accurate since the investigator can extract the
correct information by removing doubts, if any, in the minds of the
respondents regarding certain questions.
(ii) High response rate, since the answers to various questions are
obtained on the spot.
(iii) It permits explanation of questions concerning difficult subject matter.
(iv) It permits evaluation of respondent, his circumstances and reliability.
(v) This method is useful where spontaneity of response is required.
(vi) It provides personal rapport, which helps to overcome reluctance to
respond.
(vii) Where the investigator and the informant talk face to face, it becomes
possible to explore questions in depth.
(viii) Information is collected promptly and there is no dribbling.
Page No. 24
Business Statistics
Unit 2
Limitations
(i) This method is suitable only for intensive studies and not for extensive
enquiries.
(ii) This method is time-consuming and the investigation may have to
be spanned over a long period.
(iii) This method is highly subjective in nature and the results of the
enquiry may be adversely affected by the personal bias, whim and
prejudices of the investigator.
Activity 1
Find a situation when direct personal observation is the perfect method
for data collection.
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Observation may be defined as recording ___________ patterns
without verbal communication.
(b) Direct personal observation method is most suitable when the field
of ____________ is small and a greater degree of accuracy is
required.
2. State whether true or false.
(a) Direct personal observation does not permit explanation of questions
concerning difficult subject matter.
(b) Direct personal observation provides personal rapport, which helps
overcome reluctance to respond.
2.3 Questionnaire
Questionnaire method can be used either as mailing the questionnaires or
sending through enumerators.
Page No. 25
Business Statistics
Unit 2
post to the informants together with a polite covering letter explaining in detail
the aims and objectives of collecting the information and requesting the
respondents to cooperate by furnishing the correct replies and returning the
questionnaire duly filled in. In order to ensure quick response, the return postage
expenses are usually borne by the investigator. This method is usually adopted
by research workers, private individuals and non-official agencies. The success
of this method depends upon the proper drafting of the questionnaire and the
cooperation of the respondents.
Merits
(i) By this method, a large field of investigation may be covered at a
very low cost. In fact, this is the most economical method in terms of
time, money and manpower.
(ii) Errors due to personal bias of the investigators or enumerators are
completely eliminated as the information is supplied by the person
concerned in his own handwriting.
Limitations
(i) This method can be used only if the respondents are educated and
can understand the questions well, and reply in their own handwriting.
(ii) Sometimes, the informants may not send back the schedules and
even if they return the schedules, they may be incorrectly filled in.
(iii) Sometimes, the informants are not willing to give written information
in their own handwriting on certain personal questions like income,
personal habits and property.
(iv) There is no scope for asking supplementary questions for crosschecking of the information supplied by the respondents.
Page No. 26
Business Statistics
Unit 2
Page No. 27
Business Statistics
Unit 2
Page No. 28
Business Statistics
Unit 2
12. A covering letter, stating briefly the aims and objectives of the enquiry,
soliciting cooperation of the respondents, and explaining various terms
and concepts, should be enclosed along with the questionnaire.
13. In case of a mailed questionnaire method, a self-addressed stamped
envelope should be enclosed.
14. To ensure quick response, the respondents may be offerred incentives in
the form of gift coupons, a sample of the product to be introduced, or a
promise to supply a copy of the findings after the survey work is over.
15. Method of tabulation and analysis, whether hand-operated, machineoperated or computerized, should also be kept in mind while designing
the questionnaire.
16. Lastly, the questionnaire should be made attractive by a proper layout
and an appealing get up.
Page No. 29
Business Statistics
Unit 2
(For the sake of simplicity, it is assumed that the professors have only
one car in the family.)
The Questionnaire
1. General
Name: ...................................................................................
Age: ......................................................................................
Sex: M .................... F ....................
Marital status: Married .................. Unmarried .................
Number of members in the family
12...................
34...................
56...................
Over 6..............
Yearly income
Less than 30,000...................
30,000 39,999......................
40,000 49,999......................
50,000 and more...................
2. What type of car do you own now?
.................Indian
.................Japanese
.................European
3. What size of car do you own?
.................Luxury
.................Mid-size
.................Compact
4. Did you buy this car new or used?
.................New....................Used
5. If you bought a used car, did you buy it from a dealer or a private party?
.................Dealer.................Private party
Page No. 30
Business Statistics
Unit 2
6. If you bought a new car, how long have you owned this car?
.................Number of years
7. If you bought a used car, how old is this car now?
..............Number of years
8. Price paid for the car..........New..........Used
9. Who influenced your decision to purchase the above brand of car?
Indicate if more than one.
...............Yourself
Others.................................................................................. .
10. Indicate as to who decided about the budget allocation for the car.
...............Yourself
...............Your spouse
...............Family decision
11. If you bought your car from a dealer, then who influenced your decision
regarding the selection of a particular dealer?
...............Yourself
...............Your friend
...............Your colleague
...............Family decision
12. How did you come to know about this dealer?
...............TV commercial
...............Newspapers
...............Personal references
...............Others
13. Rank the following factors that affected the final decision at the time of
purchasing the car. A rank of 1 measures the most important factor, a
rank of 2 measures the second most important factor, and so on.
...............Very inconvenient without the car
...............Money was available
Page No. 31
Business Statistics
Unit 2
...............Newspapers
................Magazines
................Word of mouth
...............Others
Page No. 32
Business Statistics
Unit 2
Self-Assessment Questions
3. State whether true or false.
(a) Mailed questionnaires are sent by post to the informants together
with a polite covering letter explaining in detail the aims and objectives
of collecting the information and requesting the respondents to
cooperate by furnishing the correct replies and returning the
questionnaire duly filled in.
(b) Designing of questionnaire requires a high degree of skill and
experience on the part of the investigator.
4. Fill in the blanks with the appropriate terms.
(a) If a particular question needs clarification, it should be explained by
way of a __________________.
(b) Questions should be ________________ arranged.
2.4 Interviews
Indirect personal interview. Under this method, instead of directly approaching
the informants, the investigator interviews several third persons who are directly
or indirectly concerned with the subject matter of the enquiry and who are in
possession of the requisite information. Such a procedure is followed by the
enquiry committees and commissions appointed by the Government of India.
The committee selects persons, known as witnesses, and collects information
from them by getting answers to questions decided in advance. This method is
highly suitable where direct personal investigation is not practicable either
because the informants are unwilling or reluctant to supply information or where
the information desired is complex and the study in hand is extensive.
Merits
(i) This method is less costly and less time-consuming than direct
personal investigation.
(ii) Under this method, the enquiry can be formulated and conducted
more effectively and efficiently as it is possible to obtain the views
and suggestions of the experts on the given problem.
Page No. 33
Business Statistics
Unit 2
Limitations
The success of this method depends upon:
(i) The representative character of the witnesses.
(ii) The personal knowledge of the witnesses about the subject matter
of enquiry.
(iii) The personal prejudices of the witnesses as regards definiteness in
stating what is wanted.
(iv) The ability of the interviewer to extract information from the witnesses
by asking appropriate questions and cross-questions.
Page No. 34
Business Statistics
Unit 2
Merits
(i) This method is very cheap and economical for extensive
investigations.
(ii) The required information can be obtained expeditiously since only
rough estimates are required.
Limitations
(i) Since the correspondents apply their own judgement about the
method of collecting the information, the results are often vitiated
due to personal prejudices and whims of the correspondents. The
data so obtained is thus not so reliable.
(ii) This method is suitable only if the purpose of investigation is to obtain
rough and approximate estimates. It is unsuited where a high degree
of accuracy is desired.
Activity 3
How will you conduct an interview if the person is not ready to give it? Give
an example.
Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) The committee selects persons, known as ____________ and
collects information from them by getting answers to questions
decided in advance.
(b) The local agents collect information in their areas and ____________
the same to the investigator.
6. Fill in the blanks with the appropriate terms.
(a) The success of the interview method depends upon the
______________ character of the witnesses.
(b) The telephone survey method is more convenient than personal
________________.
Page No. 35
Business Statistics
Unit 2
2.5 Experiments
Experiments are another method of collecting data. Experiments are resorted to
when it is necessary to collect factual data when nothing is available for reference.
It may also be conducted to verify a theory. It is a study conducted under controlled
conditions. Experiments are made by researchers to understand the cause and
effect relationships. Such relationships are also made in observational studies
but here, there is no control on how subjects are assigned to groups.
Experimental design
This design contains information gathering exercises that have variations under
control of the experimenter. In observational studies, there is no control on
condition. Mostly, an experimenter wants to know the effect of some process on
certain objects, which are taken as experimental units. Such objects are either
a small section of people, few groups, etc. Such design finds broad application
in natural and social sciences.
The random design experiment is very helpful in situations when we have
to analyse huge amount of outcome data. The word experiment or random
experiment is used when we face an uncertain situation and we need to have
some observations about the situation. Random does not imply haphazard. We
need to be careful to ensure that appropriate random methods are used. The
actual results of the uncertain situation are referred to as outcome or sample
point. In the random experiment, nothing can say with certainty about the
outcome. An experiment may comprise one or more observations. If there is a
single observation, we use the term random trail or simply trial. An electric fan,
for example, may be selected from a factory to examine whether or not it is
defective. A single fan selected is a trial. We can select as many fans as we
wish. The number of observations will be equal to that of fans. The properties
of a random experiment may be listed as follows:
We can repeat the experiment any number of times.
A random trial comprises at least two possible outcomes.
We cannot say with certainty about the outcome of the random trial or
random experiment.
There are three things in common in all statistical experiments.
1. The experiment can have many possible outcomes.
2. We can specify each possible outcome in advance.
3. The outcome of the experiment is dependent on chance.
Sikkim Manipal University
Page No. 36
Business Statistics
Unit 2
A coin toss, for example, has all the attributes of a statistical experiment.
In this case, there is more than one possible outcome. It is possible to specify
each possible outcome (i.e., heads or tails) in advance. Funally, there is an
element of chance, since the outcome is uncertain.
Analysis of the experimental design has the foundation of variance
analysis. This analysis is done by collecting models having variance already
observed, and these were partitioned into different components on different
factors, and then estimation and testing were carried out.
We now consider another experiment where eight objects are to be
weighed using a pan balance and a set of few standard weights. Each instrument
weighs the difference between objects in the left pan against those in the right
pan. Further, there is an addition of standard weights that were kept on the
lighter pan and equilibrium point is noted. There was a random error for each
experiment averaging zero. Standard deviation errors, due to the probability
distribution, are s on different weights and these are independent. We denote
true weight as q1, ..., q8.
Experiments considered are,
1. Weighing of each object in one pan, while the other is empty. We denote
Xi as the weight of the ith object, where i vary from 1 to 8.
2. Carry on weighing of eight as per schedule given below. We take measured
difference as Yi where i vary from 1 to 8.
1st weighing:
2nd:
3rd:
4th:
5th:
6th:
7th:
8th:
Left pan
Right pan
12345678
1238
1458
1678
2468
2578
3478
3568
(empty)
4567
2367
2345
1357
1346
1256
1247
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
.
8
2
Sikkim Manipal University
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
.
8
Page No. 37
Business Statistics
Unit 2
Page No. 38
Business Statistics
Unit 2
Page No. 39
Business Statistics
Unit 2
Self-Assessment Questions
7. Fill in the blanks with the appropriate terms.
(a) Experiments are made by _______________ to understand the
cause and effect relationships.
(b) Analysis of the experimental design has the foundation of
___________ analysis.
8. State whether true or false.
(a) In observational studies, there is no control on condition.
(b) In decision-making one has to choose worse alternatives.
2.6 Summary
Let us recapitulate the important concepts discussed in this unit:
Observation may be defined as recording behavioural patterns without
verbal communication.
Questionnaire method for data collection by can be used either mailing
the questionnaires or sending them through enumerators.
The questionnaire is the only medium of communication between the
investigator and the respondents, so it must be designed or drafted with
utmost care and caution so that all the relevant and essential information
for the enquiry may be collected without any difficulty, ambiguity or
vagueness.
Instead of directly approaching the informants, the investigator can
interview several third persons who are directly or indirectly concerned
with the subject matter of the enquiry and who are in possession of the
requisite information using indirect personal interview method.
The investigator, instead of presenting himself before the informants,
contacts them on telephone and collects information from them.
Sikkim Manipal University
Page No. 40
Business Statistics
Unit 2
2.7 Glossary
Direct personal observation: In this, the investigator himself is present
before the informant and obtains first hand information.
Mailed questionnaire method: In this, the investigator prepares a
questionnaire containing a number of questions pertaining to the field of
enquiry.
Questionnaire sent through enumerators: In this, the investigator
appoints agents known as enumerators, who go to the respondents
personally with the questionnaire and record the respondents
replies.
Indirect personal interviews: In this, the investigator interviews several
third persons who are directly or indirectly concerned with the subject
matter of the enquiry and who are in possession of the requisite
information.
Telephone survey: In this, the investigator contacts the informants on
telephone and collects the information.
Information received through local agents: In this, the information is
not collected formally by the investigator, but by local agents commonly
known as correspondents.
Page No. 41
Business Statistics
Unit 2
2.9 Answers
Answers to Self-Assessment Questions
1. (a) Behavioural; (b) Enquiry
2. (a) False; (b) True
3. (a) True; (b) True
4. (a) Footnote; (b) Logically
5. (a) Witnesses; (b) Transmit
6. (a) Representative; (b) Interview
7. (a) Researchers; (b) Variance
8. (a) True; (b) False
Page No. 42
Business Statistics
Unit 3
Unit 3
Structure
3.1 Introduction
Objectives
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.1 Introduction
In the previous unit, you learnt about various data collection methods. The
collected data is analysed to get useful information. In this unit, you will learn
about the various techniques of data analysis. Percentage is the result obtained
by multiplying a quantity by 100. If 50% of the students in a class are girls, it
means that out of every 100 students, 50 are girls. A ratio is a comparison
between two values. It shows the number of times one value is contained in or
contains the other. For example, if the ratio of girls to boys in a class is 1:2, it
means that two times the number of girls is contained in boys. Average is the
measure of the middle value of the data set. A measure of central tendency is a
single value that attempts to describe a set of data by identifying the central
position within that set of data. The three common measures of central tendency
mean, median and mode are explained in this unit. Dispersion tells us about
the spread of data. The commonly used measures of dispersion are quartile
deviation, range and standard deviation.
Objectives
After studying this unit, you should be able to:
Evaluate percentages, ratios and averages
Calculate arithmetic mean, median and mode
Sikkim Manipal University
Page No. 43
Business Statistics
Unit 3
3.2.1 Percentage
Mathematically, percentage value is calculated for ratios that have a denominator.
A denominator is the base value of a percentage. If there is a ratio 3 to 10 (3/
10), this literally means 3 in 10. To convert it into percentage, we should multiply
it with 100 (hundred) and it is then expressed as 30% (or 30 per cent).
When a value of measured quantity is subject to some change, this can
be recorded as:
(i) Absolute value change
(ii) Percentage change
These two changes are related to each other.
(i) Absolute value change: This is defined as the actual change in the
quantity. For example, if there is a sales figure of 220 crores in the year
2000 and 250 crores in the year 2001, the absolute value change is 30
crores.
(ii) Percentage change: Here, change is expressed as a ratio of original
value and then multiplied by 100 (hundred). In the example cited above,
Percentage change = (Absolute value change/Original quantity) 100 =
(30/220) 100 = 13.64%. Percentage change is always taken with
Page No. 44
Business Statistics
Unit 3
3.2.2 Ratio
When a comparison is carried out between two numbers, it is useful to know
how many times one number is greater or smaller than the other. Thus, we are
Page No. 45
Business Statistics
Unit 3
often required to express one number as the fraction of the other. Ratio of a
number a to a number b is defined as quotient of number a and b.
The numbers that form the ratio are known as terms of the ratio. Numerator
of the ratio is known as antecedent and the denominator is known as consequent.
A ratio has no unit for homogeneous quantity, but in case of heterogeneous
quantity, it depends on the units of numerator and denominator. Here, the unit
is just a number. For example, a specific gravity that is the ratio of density is
unitless. Current, in electricity, is a ratio of flow of charge and time, so current is
coulomb per unit time. This unit has a special name as ampere.
Ratios are expressed as percentages and for this it is multiplied by 100. A
ratio is given as 3/5 = 0.6. This can be expressed as 0.6 100 = 60%.
Properties of Ratio
(i) If numerator and denominator are multiplied by the same number, ratio
remains unchanged. This means a/b = ma/mb.
(ii) If numerator and denominator are divided by the same number, ratio
remains unchanged. This means a/b = (a/m)/(b/m).
(iii) To compare magnitudes of two ratios, their denominator should be equated
and values of numerator will then decide which one is greater. If we
compare values of 8/3 and 11/4, we have to make a common denominator.
We multiply 8/3 by 4 in numerator as well as denominator and get 32/12.
We then multiply 11/4 by 3 in both, numerator and denominator and get
33/12. Thus, we find that 11/4 > 8/3.
(iv) Ratio of two fractions can be expressed as ratio of two integers. Thus,
a/b : c/d = ad/bc.
(v) If either of the terms of a ratio is a surd, then this ratio will never be an
integer unless both the terms are equal or numerator is an integral multiple
of the denominator. Thus, the ratio of sqrt(3)/sqrt(2) will never be an integer.
(vi) When two ratios are multiplied, their numerators and denominators are
also multiplied. For example, a/b c/d = ac/bd.
(vii) When ratio a/b is compounded with itself, the resulting ratio, a2/b2 is known
as duplicate ratio and a3/b3 is triplicate ratio and a0.5/b0.5 is the sub-duplicate
ratio of a/b.
(viii) If a/b = c/d = e/f = g/h = k, then, (a+c+e+g)/(b+d+f+h) = k.
Page No. 46
Business Statistics
Unit 3
(ix) If a1/b1, a2/b2, a3/b3, ..., an/bn are unequal fractions then the ratio, (a1, a2,
a3, a4)/(b1, b2, b3, b4) lies between the lowest and the highest of these
fractions.
(x) If there are two equations containing three unknowns as, a1x + b1y + c1z
= 0 and a2x + b2y + c2z = 0; then values of x, y and z can not be resolved
unless we get the third equation, but the proportion in which x, y and z lie
can be solved.
(xi) If the ratio is a/b > 1 and if there is a positive number k, then (a + k)/(b +
k) < a/b and (a k)/(b k) > a/b. Similarly, if a/b < 1 and if there is a
positive number k, then (a + k)/(b + k) > a/b and (a - k)/(b - k) < a/b.
3.2.3 Averages
An average is the measure of central tendency of a set of numbers. The general
formula for finding an average of n numbers; x1, x2, x3, ..., xn is An = (x1, x2, x3, ...,
xn)/n. There is another type of average, known as weighted average.
When there are two or more groups with known averages, then the
combined average is found by weighted average. If we have r groups having
averages as A1, A2, A3,.., Ar and elements as n1, n2, n3,.., nr, then weighted
average is given as:
Aw = ( n1A1 + n2A2 + n3A3 +..+ nrAr)/( n1 + n2 + n3 +..+ nr)
An average is also known as an arithmetic mean.
Example 3.3: A man travels from point A to point B at 60 kmph and returns at
100 kmph. Find the average speed.
Solution: Average speed = Total distance/Total time taken.
Let the distance between A to B, be d. Time taken for going from A to B is
d/60 and for returning to A is d/100.
Total time is d/60 + d/100.
Total distance = 2d.
Hence, average speed = 2d/[ d/60 + d/100] = 2d 600/(16d) = 75 kmph.
Example 3.4: Average marks of 20 students in an examination is reduced by 2.
If the topper of the class who secured 90 marks was replaced by a new student.
What was the score of this new student?
Solution: Let the average marks when topper is included and not replaced by
the new student be x. There are 20 students, so total number is 20x. New
average is x 2 and hence total mark is 20(x 2) = 20x 40. Thus, there is a
Page No. 47
Business Statistics
Unit 3
reduction of 40 marks and this must be due to the new student who got 40
marks less than the student he replaced. So, he got only 90 40 = 50 marks.
Activity 1
An investor buys Rs 1200 worth of shares in a company each month. During
the first five months, he bought the shares at a price of Rs 10, Rs 12, Rs
15, Rs 20 and Rs 24 per share. After 5 months what is the average price
paid for the shares by him?
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Percentage value is calculated for ratios that have a
___________________.
(b) Numerator of the ratio is known as _________________ and the
denominator is known as consequent.
2. State whether true or false.
(a) An average is the measure of central tendency of a set of numbers.
(b) The numerator does not have a direct relationship with ratio or
percentage.
Page No. 48
Business Statistics
Unit 3
Page No. 49
Business Statistics
Unit 3
The mean is computed by adding all the data values and dividing it by
the number of such values. The symbol used for sample average is X so that:
X=
19 + 20 + 22 + 22 +17
5
X=
X1 + X2 + ......... + Xn
n
In other words,
n
Xi
i=1
X=
i = 1, 2 ... n
Xi
i 1
i 1, 2 ...N
f ( X )
f
Page No. 50
Business Statistics
Unit 3
(f)
2
1
1
2
3
1
Total = 10
f(X)
34
18
19
40
66
23
200
f ( X )
f
= 200/10 = 20
Example 3.6: Calculate the mean of the marks of 46 students given in the
following table.
Frequency of Marks of 46 Students
Marks
Frequency
(X)
(f)
9
10
11
12
13
14
15
16
17
18
1
2
3
6
10
11
7
3
2
1
Total
46
Page No. 51
Business Statistics
Unit 3
Marks (X)
Frequency ( f )
9
10
11
12
13
14
15
16
17
18
X =
f(X)
1
2
3
6
10
11
7
3
2
1
9
20
33
72
130
154
105
48
34
18
f = 46
f(X) = 623
f ( X ) 623
=
= 13.54
46
f
Example 3.7: The mean age of a group of 100 persons (grouped in intervals
10, 12,..., etc.) was found to be 32.02. Later, it was discovered that age 57
was misread as 27. Find the corrected mean.
Solution: Let the mean be denoted by X. So, putting the given values in the
formula of arithmetic mean, we have,
32.02 =
Correct
Correct
X
100
, i.e.,
= 3202
= 3202 27 + 57 = 3232
AM =
3232
= 32.32
100
Example 3.8: The mean monthly salary paid to all employees in a company is
Rs 500. The monthly salaries paid to male and female employees average Rs
520 and Rs 420, respectively. Determine the percentage of males and females
employed by the company.
Solution: Let N1 be the number of males and N2 be the number of females
employed by the company. Also, let x1 and x2 be the monthly average salaries
paid to male and female employees and x be the mean monthly salary paid to
all the employees.
x =
N1 x1 N 2 x2
N1 N 2
Page No. 52
Business Statistics
Unit 3
or
500 =
or
N1
N2
520 N 1 420 N 2
N1 N 2
or
20N1= 80N2
80 4
20 1
Hence, the males and females are in the ratio of 4 : 1 or 80 per cent are
males and 20 per cent are females in those employed by the company.
The Weighted Arithmetic Mean
In the computation of arithmetic mean we had given equal importance to each
observation in the series. This equal importance may be misleading if the
individual values constituting the series have different importance as in the
following example:
The Raja Toy shop sells
Toy Cars at
Rs 3 each
Toy Locomotives at
Rs 5 each
Toy Aeroplanes at
Rs 7 each
i.e.,
x
24
= = Rs
= Rs 6
4
50
25
15
10
Page No. 53
Business Statistics
Unit 3
It may be noted that 50, 25, 15, 10 are the quantities of the various classes of
toys sold. It is for these quantities that the term weights is used in statistical language.
Weight is represented by symbol w, and w represents the sum of weights.
While determining the average price of toy sold, these weights are of
great importance and are taken into account in the manner illustrated below:
x
w1 x1 + w2 x2 + w3 x3 + w4 x4
wx
=
w1 + w2 + w3 + w4
w
When w1, w2, w3, w4 are the respective weights of x1, x2, x3, x4 which in
turn represent the price of four varieties of toys, viz., car, locomotive, aeroplane
and double decker, respectively.
x
Table 3.1 summarizes the steps taken in the computation of the weighted
arithmetic mean.
Table 3.1 Weighted Arithmetic Mean of Toys Sold by the Raja Toy Shop
Toys
Car
Number Sold
w
Price Weight
xw
50
150
Locomotive
25
125
Aeroplane
15
105
Double Decker
10
90
w = 100
xw = 470
w = 100; wx = 470
x
wx
470
=
=
= 4.70
w
100
Page No. 54
Business Statistics
Unit 3
Solution: (i) Multiply each average (viz. 5 and 7), by the number of workers in
the concern it represents.
(ii) Add up the two products obtained in (i) above.
(iii) Divide the total obtained in (ii) by the total number of workers.
Weighted Mean of Mean Wages of A Ltd. and B Ltd.
Manufacturing
Concern
Mean Wages
x
Workers
Employed
w
A Ltd.
2,000
10,000
B Ltd.
4,000
28,000
w = 6,000
Mean Wages
Workers Employed
wx
wx = 38,000
wx
w
38,000
=
6,000
= Rs 6.33
x =
Page No. 55
Business Statistics
Unit 3
Disadvantages of Mean
1. It is affected by extreme values, and hence, are not very reliable when
the data set has extreme values especially when these extreme values
are on one side of the ordered data. Thus, a mean of such data is not
truly a representative of such data. For example, the average age of three
persons of ages 4, 6 and 80 years gives us an average of 30.
2. It is tedious to compute for a large data set as every point in the data set
is to be used in computations.
3. We are unable to compute the mean for a data set that has open-ended
classes either at the high or at the low end of the scale.
4. The mean cannot be calculated for qualitative characteristics such as
beauty or intelligence, unless these can be converted into quantitative
figures such as intelligence into IQs.
3.3.2 Median
The second measure of central tendency that has a wide usage in statistical
works is the median. Median is that value of a variable which divides the series
in such a manner that the number of items below it is equal to the number of
items above it. Half of the total number of observations lies below the median
and half above it. The median is thus a positional average.
The median of ungrouped data is found easily if the items are first arranged
in order of the magnitude. The median may then be located simply by counting,
and its value can be obtained by reading the value of the middle observations.
If we have five observations whose values are 8, 10, 1, 3 and 5, the values are
first arrayed: 1, 3, 5, 8 and 10. It is now apparent that the value of the median is
5, since two observations are below that value and two observations are above
it. When there is an even number of cases, there is no actual middle item and
the median is taken to be the average of the values of the items lying on either
side of (N + 1)/2, where N is the total number of items. Thus, if the values of six
items of a series are 1, 2, 3, 5, 8 and 10, then the median is the value of item
number (6 + 1)/2 = 3.5, which is approximated as the average of the third and
the fourth items, i.e., (3+5)/2 = 4.
Thus, the steps required for obtaining median are:
1. Arrange the data as an array of increasing magnitude.
2. Obtain the value of the (N+ l)/2th item.
Page No. 56
Business Statistics
Unit 3
Frequency is the number of times a given data occurs in a data set. A relative
frequency is the fraction of times a data occurs. Cumulative frequency is the
accumulation of previous relative frequencies. For example, the data below gives
the number of hours devoted by 20 students of a class to study at home:
5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3
Following table gives the frequency distribution, relative frequency
distribution and cumulative frequency distribution:
Hours
Number of Students
(Frequency)
Relative
Frequency
3/20=0.15
0.15
5/20=0.25
0.15+0.25=0.4
3/20=0.15
0.4+0.15=0.55
6/20=0.3
0.55+0.3=0.85
2/20=0.1
0.83+0.1=0.95
1/20=0.05
0.95+0.05=1
7
Total
Cumulative
Frequency
20
Even in the case of grouped data, the procedure for obtaining median is
straightforward as long as the variable is discrete or non-continuous as is clear
from the following example.
Example 3.10: Obtain the median size of shoes sold from the following data.
Number of Shoes Sold by Size in One Year
Size
Number of Pairs
Cumulative Total
30
30
5 21
6
40
50
70
120
6 21
7
150
300
270
570
7 21
8
600
950
1170
2120
8 21
9
820
750
2940
3690
9 21
10
440
250
4130
4380
10 21
11
150
40
4530
4570
11 21
39
4609
Total 4609
Page No. 57
Business Statistics
Unit 3
( N 1)
4609 + 1
th =
th = 2305th item. Since the
2
2
items are already arranged in ascending order (size-wise), the size of 2305th
item is easily determined by constructing the cumulative frequency. Thus, the
median size of shoes sold is 8, the size of 2305th item.
In the case of grouped data with continuous variable, the determination
of median is a bit more involved. Consider the following table where the data
relating to the distribution of male workers by average monthly earnings is given.
Clearly the median of 6291 is the earnings of (6291 + 1)/2 = 3146th worker
arranged in ascending order of earnings.
From the cumulative frequency, it is clear that this worker has his income
in the class interval 67.572.5. But, it is impossible to determine his exact income.
We therefore, resort to approximation by assuming that the 795 workers of this
class are distributed uniformly across the interval 67.5 to 72.5. The median
worker is (31462713) = 433rd of these 795, and hence, the value corresponding
to him can be approximated as,
67.5
433
( 72.5 67.5) = 67.5 + 2.73 = 70.23
795
Monthly
Earnings (Rs)
No. of
Workers
Cumulative No.
of Workers
27.532.5
120
120
32.537.5
152
272
37.542.5
170
442
42.547.5
214
656
47.552.5
410
1066
52.557.5
429
1495
57.562.5
568
2063
62.567.5
650
2713
67.572.5
795
3508
10
72.577.5
915
4423
11
77.582.5
745
5168
12
82.587.5
530
5698
13
87.592.5
259
5957
14
92.597.5
152
6109
Page No. 58
Business Statistics
Unit 3
15
97.5102.5
107
6216
16
102.5107.5
50
6266
17
107.5112.5
25
6291
Total 6291
The value of the median can thus be put in the form of the formula,
N 1
C
Me = l 2
i
f
width, f its frequency, C the cumulative frequency upto (but not including) the
median class, and N is the total number of cases.
Finding Median by Graphical Analysis
The median can quite conveniently be determined by reference to the ogive
which plots the cumulative frequency against the variable. The value of the item
below which half the items lie, can easily be read from the ogive as is shown in
example 3.11.
Example 3.11: Obtain the median of data given in the following table.
Monthly Earnings
Frequency
Less Than
More Than
27.5
__
6291
32.5
120
120
6171
37.5
152
272
6019
42.5
47.5
170
214
442
656
5849
5635
52.5
410
1066
5225
57.5
429
1495
4796
62.5
568
2063
4228
67.5
650
2713
3578
72.5
77.5
795
915
3508
4423
2783
1868
82.5
745
5168
1123
87.5
530
5698
593
92.5
259
5957
334
97.5
152
6109
182
102.5
107.5
107
50
6216
6266
75
25
112.5
25
6291
Page No. 59
Business Statistics
Unit 3
Solution: It is clear that this is grouped data. The first class is 27.532.5, whose
frequency is 120, and the last class is 107.5112.5 whose frequency is 25.
Figure 3.1 shows the ogive of less than cumulative frequency. The median is
the value below which N/2 items lie, is 6291/2 = 3145.5 items lie, which is read
of from Figure 3.2 as about 70. More accuracy than this is unobtainable because
of the space limitation on the earning scale.
6291
6000
5000
MORE THAN
LESS THAN
Number of Workers
4000
3000
2000
1000
MEDIAN
112.5
97.5
102.5
107.5
87.5
92.5
82.5
77.5
72.5
62.5
67.5
47.5
52.5
57.5
42.5
37.5
32.5
27.5
Figure 3.1 Median Determination by Plotting Less than and More than Cumulative
Frequency
The median can also be determined by plotting both less than and more
than cumulative frequency as shown in Figure 3.1. It should be obvious that the
two curves should intersect at the median of the data.
Page No. 60
Business Statistics
Unit 3
6000
5000
Number of Workers
4000
3000
2000
MEDIAN
1000
27.5
32.5
37.5
42.5
47.5
52.5
57.5
62.5
67.5
72.5
77.5
82.5
87.5
92.5
97.5
102.5
107.5
112.5
Advantages of Median
1. Median is a positional average and hence the extreme values in the data
set do not affect it as much as they do to the mean.
2. Median is easy to understand and can be calculated from any kind of
data, even from grouped data with open-ended classes.
3. We can find the median even when our data set is qualitative and can be
arranged in the ascending or the descending order, such as average
beauty or average intelligence.
4. Similar to mean, median is also unique, meaning that there is only one
median in a given set of data.
5. Median can be located visually when the data is in the form of ordered
data.
6. The sum of absolute differences of all values in the data set from the
median value is minimum. This means that it is less than any other value
of central tendency in the data set, which makes it more central in certain
situations.
Page No. 61
Business Statistics
Unit 3
Disadvantages of Median
1. The data must be arranged in order to find the median. This can be very
time consuming for a large number of elements in the data set.
2. The value of the median is affected more by sampling variations. Different
samples from the same population may give significantly different values
of the median.
3. The calculation of median in case of grouped data is based on the
assumption that the values of observation are evenly spaced over the
entire class interval and this is usually not so.
4. Median is comparatively less stable than mean, particularly for small
samples, due to fluctuations in sampling.
5. Median is not suitable for further mathematical treatment. For example,
we cannot compute the median of the combined group from the median
values of different groups.
3.3.3 Mode
The mode is that value of the variable which occurs or repeats itself the greatest
number of times. The mode is the most fashionable size in the sense that it is
the most common and typical, and is defined by Zizek as the value occurring
most frequently in a series (or group of items) and around which the other items
are distributed most densely.
The mode of a distribution is the value at the point around which the items
tend to be most heavily concentrated. It is the most frequent or the most common
value, provided that a sufficiently large number of items are available, to give a
smooth distribution. It will correspond to the value of the maximum point
(ordinate), of a frequency distribution if it is an ideal or smooth distribution. It
may be regarded as the most typical of a series of values. The modal wage, for
example, is the wage received by more individuals than any other wage. The
modal hat size is that, which is worn by more persons than any other single
size.
It may be noted that the occurrence of one or a few extremely high or low
values has no effect upon the mode. If a series of data are unclassified, not
have been either arrayed or put into a frequency distribution, the mode cannot
be readily located.
Page No. 62
Business Statistics
Unit 3
Taking first an extremely simple example, if seven men are receiving daily
wages of Rs 5, 6, 7, 7, 7, 8 and 10, it is clear that the modal wage is Rs 7 per
day. If we have a series such as 2, 3, 5, 6, 7, 10 and 11, it is apparent that there
is no mode.
There are several methods of estimating the value of the mode. But, it is
seldom that the different methods of ascertaining the mode give us identical
results. Consequently, it becomes necessary to decide as to which method
would be most suitable for the purpose in hand. In order that a choice of the
method may be made, we should understand each of the methods and the
differences that exist among them.
The four important methods of estimating mode of a series are: (i) Locating
the most frequently repeated value in the array; (ii) Estimating the mode by
interpolation; (iii) Locating the mode by graphic method; and (iv) Estimating the
mode from the mean and the median. Only the last three methods are discussed
in this unit.
Estimating the Mode by Interpolation. In the case of continuous
frequency distributions, the problem of determining the value of the mode is not
so simple as it might have appeared from the foregoing description. Having
located the modal class of the data, the next problem in the case of continuous
series is to interpolate the value of the mode within this modal class.
The interpolation is made by the use of any one of the following formulae:
(i) Mo = l1
(ii) Mo = l2
(iii) Mo = l1
f2
f0 f2
i;
f0
i
f0 f2
f1 f 0
( f1 f 0 ) ( f1 f 2 )
Where l1 is the lower limit of the modal class, l2 is the upper limit of the
modal class, f0 equals the frequency of the preceding class in value, f1 equals
the frequency of the modal class in value, f2 equals the frequency of the following
class (class next to modal class) in value, and i equals the interval of the modal
class.
Page No. 63
Business Statistics
Unit 3
Example 3.12: Determine the mode for the data given in the following table.
Wage Group
14
18
22
26
30
34
38
42
46
50
Frequency (f)
18
22
26
30
34
38
42
46
50
54
6
18
19
12
5
4
3
2
1
0
54 58
Solution: In the given data, 22 26 is the modal class since it has the largest
frequency. The lower limit of the modal class is 22, its upper limit is 26, its
frequency is 19, the frequency of the preceding class is 18, and of the following
class is 12. The class interval is 4. Using the various methods of determining
mode, we have,
(i) Mo = 22
= 22
12
4
18 12
8
5
= 23.6
(iii) Mo =
22 +
(ii) Mo = 26
= 26
18
4
18 + 12
12
5
= 23.6
4
19 - 18
4 = 22 +
= 22.5
(19 - 18) + ( 19 - 12)
8
In formulae (i) and (ii), the frequency of the classes adjoining the modal
class is used to pull the estimate of the mode away from the midpoint towards
either the upper or lower class limit. In this particular case, the frequency of the
class preceding the modal class is more than the frequency of the class following
and therefore, the estimated mode is less than the midvalue of the modal class.
This seems quite logical. If the frequencies are more on one side of the modal
class than on the other it can be reasonably concluded that the items in the
modal class are concentrated more towards the class limit of the adjoining class
with the larger frequency.
Formula (iii) is also based on a logic similar to that of (i) and (ii). In this
case, to interpolate the value of the mode within the modal class, the differences
between the frequency of the modal class, and the respective frequencies of
the classes adjoining it are used. This formula usually gives results better than
Sikkim Manipal University
Page No. 64
Business Statistics
Unit 3
the values obtained by the other and exactly equal to the results obtained by
graphic method. Formulae (i) and (ii) give values which are different from the
value obtained by formula (iii) and are more close to the central point of modal
class. If the frequencies of the class adjoining the modal are equal, the mode is
expected to be located at the midvalue of the modal class, but if the frequency
on one of the sides is greater, the mode will be pulled away from the central
point. It will be pulled more and more if the difference between the frequencies
of the classes adjoining the modal class is higher and higher. In Example 3.12,
the frequency of the modal class is 19 and that of preceding class is 18. So, the
mode should be quite close to the lower limit of the modal class. The midpoint
of the modal class is 24 and lower limit of the modal class is 22.
Locating the Mode by the Graphic Method. The method of graphic
interpolation is illustrated in Figure 3.3. The upper corners of the rectangle over
the modal class have been joined by straight lines to those of the adjoining
rectangles as shown in the diagram; the right corner to the corresponding one
of the adjoining rectangle on the left, etc. If a perpendicular is drawn from the
point of intersection of these lines, we have a value for the mode indicated on
the base line. The graphic approach is, in principle, similar to the arithmetic
interpolation explained earlier.
The mode may also be determined graphically from an ogive or cumulative
frequency curve. It is found by drawing a perpendicular to the base from that
point on the curve where the curve is most nearly vertical, i.e., steepest (in
other words, where it passes through the greatest distance vertically and smallest
distance horizontal). The point where it cuts the base gives us the value of the
mode. How accurately this method determines the mode is governed by:
(i) The shape of the ogive, (ii) The scale on which the curve is drawn.
Estimating the Mode from the Mean and the Median. There usually
exists a relationship among the mean, median and mode for moderately
asymmetrical distributions. If the distribution is symmetrical, the mean, median
and mode will have identical values, but if the distribution is skewed (moderately)
the mean, median and mode will pull apart. If the distribution tails off towards
higher values, the mean and the median will be greater than the mode. If it tails
off towards lower values, the mode will be greater than either of the other two
measures. In either case, the median will be about one-third as far away from
the mean as the mode is. This means that,
Mode = Mean 3 (Mean Median)
= 3 Median 2 Mean
Page No. 65
Business Statistics
Unit 3
In the case of the average monthly earnings, the mean is 68.53 and the
median is 70.2. If these values are substituted in the above formula, we get,
Mode = l1
f2
f0 f2
= 72.5
745
5 = 72.5 + 2.4 = 74.9
795 745
OR
Mode = l1 +
f1 - f0
i
2 f1 - f0 - f 2
= 72.5
915 795
5
2 915 795 745
= 72.5
120
5 = 74.57
290
Page No. 66
Business Statistics
Unit 3
The difference between the two estimates is due to the fact that the
assumption of relationship between the mean, median and mode may not always
be true which is obviously not valid in this case.
Example 3.13: (i) In a moderately symmetrical distribution, the mode and mean
are 32.1 and 35.4 respectively. Calculate the median.
(ii) If the mode and median of moderately asymmetrical series are
respectively 16'' and 15.7'', what would be its most probable median?
(iii) In a moderately skewed distribution, the mean and the median are
respectively 25.6 and 26.1 inches. What is the mode of the distribution?
Solution: (i) We know,
Mean Mode = 3 (Mean Median)
or
3 Median = Mode + 2 Mean
32.1 + 2 35.4
3
102.9
=
3
or
Median =
= 34.3
(ii)
or
(iii)
1
31.1
( 3 15. 7 16. 0)
= 15.55
2
2
Advantages of Mode
1. Similar to median, the mode is not affected by extreme values in the data.
2. Its value can be obtained in open-ended distributions without ascertaining
the class limits.
3. It can be easily used to describe qualitative phenomenon. For example, if
most people prefer a certain brand of tea, then this will become the modal
point.
4. Mode is easy to calculate and understand. In some cases, it can be located
simply by observation or inspection.
Disadvantages of Mode
1. Quite often, there is no modal value.
2. It can be bi-modal or multi-modal, or it can have all modal values making
its significance more difficult to measure.
Sikkim Manipal University
Page No. 67
Business Statistics
Unit 3
3. If there is more than one modal value, the data is difficult to interpret.
4. A mode is not suitable for algebraic manipulations.
5. Since the mode is the value of maximum frequency in the data set, it
cannot be rigidly defined if such frequency occurs at the beginning or at
the end of the distribution.
6. It does not include all observations in the data set, and hence, less reliable
in most of the situations.
Activity 2
The following figures represent the number of books issued at the counter
of a commerce library in 11 different days. Calculate the median.
96, 180, 98, 75, 270, 20, 102, 100, 94, 75, 200.
Self-Assessment Questions
3. State whether true or false.
(a) The mean is computed by adding all the data values and dividing it
by the number of such values.
(b) The mode is that value of the variable which occurs or repeats itself
the greatest number of times.
4. Fill in the blanks with the appropriate terms.
(a) Weight is represented by symbol w, and Sw represents the
___________ of weights.
(b) Median is that ______________ of a variable which divides the series
in such a manner that the number of items below it is equal to the
number of items above it.
3.4 Quartiles
Some measures, other than measures of central tendency, are often employed
when summarizing or describing a set of data where it is necessary to divide
the data into equal parts. These are positional measures and are called quantiles
and consist of quartiles, deciles and percentiles. The quartiles divide the data
into four equal parts. The deciles divide the total ordered data into ten equal
parts and the percentiles divide the data into 100 equal parts. Consequently,
Sikkim Manipal University
Page No. 68
Business Statistics
Unit 3
there are three quartiles, nine deciles and 99 percentiles. The quartiles are
denoted by the symbol Q, which can be fractioned as Q1, Q2, Q3, ..., and so on.
Here, Q1 will be such point in the ordered data which has 25 per cent of the data
below and Q2 will represent 75 per cent of the data above it. In other words, Q1
n 1
is the value corresponding to
th ordered observation. Similarly, Q2 divides
4
the data in the middle, and is also equal to the median and its value, Q2 is given
by:
n 1
th ordered observation in the data.
Q2 = The value of 2
4
Similarly, we can calculate the values of various deciles. For instance,
n 1
th observaton in the ordered data, and
D1 =
10
n 1
th observation in the ordered data.
D7 = 7
10
Percentiles are generally used in the research area of education where
people are given standard tests and it is desirable to compare the relative position
of the subjects performance on the test. Percentiles are similarly calculated as:
n 1
th observation in the ordered data.
P7 = 7
100
and,
n 1
th observation in the ordered data.
P69 = 69
100
Quartiles
The formula for calculating the values of quartiles for grouped data is given as
follows:
Q = L + (j/f)C
Where,
Q = The quartile under consideration.
L = Lower limit of the class interval which contains the value of Q.
j = The number of units we lack from the class interval which contains
the value of Q, in reaching the value of Q.
Sikkim Manipal University
Page No. 69
Business Statistics
Unit 3
Ages (CI)
16 and upto 17
17 and upto 18
18 and upto 19
19 and upto 20
20 and upto 21
21 and upto 22
22 and upto 23
Mid-point (X)
16.5
17.5
18.5
19.5
20.5
21.5
22.5
(f)
4
14
18
28
20
12
4
Total = 100
f(X)
66
245
333
546
410
258
90
1948
f(X)2
1089.0
4287.5
6160.5
10647.0
8405.0
5547.0
2025.0
38161
In our case, in order to find Q1, where Q1 is the cut-off point so that 25 per
cent of the data is below this point and 75 per cent of the data is above, we see
that the first group has 4 students and the second group has 14 students, making
a total of 18 students. Since Q1 cuts off at 25 students, it is the third class
interval which contains Q1. This means that the value of L in our formula is 18.
Since we already have 18 students in the first two groups, we need 7
more students from the third group to make it a total of 25 students, which is the
value of Q1. Hence, the value of (j) is 7. Also, since the frequency of this third
class interval which contains Q1 is 18, the value of (f) in our formula is 18. The
size of the class interval C is given as 1. Substituting these values in the formula
for Q, we get,
Q1 = 18 + (7/18)1
= 18 + 0.38 = 18.38
This means that 25 per cent of the students are below 18.38 years of age
and 75 per cent are above this age.
Similarly, we can calculate the value of Q2, using the same formula. Hence,
Q2 = L + (j/f)C
= 19 + (14/28)1
= 19.5
This also happens to be the median.
Page No. 70
Business Statistics
Unit 3
By using the same formula and the same logic we can calculate the values
of all deciles as well as percentiles.
We have defined the median as the value of the item which is located at
the centre of the array. We can define other measures which are located at
other specified points. Thus, the Nth percentile of an array is the value of the
item such that N per cent items lie below it. Clearly then, the Nth percentile Pn of
grouped data is given by,
nN
C
100
Pn = l
i
f
Here, l is the lower limit of the class in which nN/100th item lies, i its width,
f its frequency, C the cumulative frequency upto (but not including) this class,
and N is the total number of items.
We can similarly define the Nth decile as the value of the item below
which (nN/10) items of the array lie. Clearly,
nN
C
10
i
l
Dn = P10n =
f
Qn = P25n
nN
C
l 4
i
f
Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) The positional measures are called ______________ and consist of
quartiles, deciles and percentiles.
(b) The Nth percentile of an ____________ is the value of the item
such that N per cent items lie below it.
6. State whether true or false.
(a) The quartiles divide the data into eight equal parts.
(b) The deciles divide the total ordered data into ten equal parts.
Sikkim Manipal University
Page No. 71
Business Statistics
Unit 3
3.5 Range
The crudest measure of dispersion is the range of the distribution. Range of
any series is the difference between the highest and the lowest values in the
series. If the marks received in an examination taken by 248 students are
arranged in the ascending order, then the range will be equal to the difference
between the highest and the lowest marks.
In a frequency distribution, the range is taken to be the difference between
the lower limit of the class at the lower extreme of the distribution and the upper
limit of the class at the upper extreme.
Table 3.2 Weekly Earnings of Labourers in Four Workshops of the Same Type
No. of Workers
Weekly Earnings
Rs
Workshop A
Workshop B
1516
1718
1920
2122
2324
2526
2728
2930
3132
3334
3536
...
...
...
10
22
20
14
14
...
...
...
...
2
4
10
14
18
16
10
6
...
...
2
4
4
10
16
14
12
6
6
2
...
...
...
4
14
16
16
12
12
4
2
...
3738
...
...
...
Total
80
80
80
80
Mean
25.5
25.5
25.5
25.5
Workshop C
Workshop D
Range
15
23
15
From these figures, it is clear that the greater the range, the greater is the
variation of the values in the group.
Sikkim Manipal University
Page No. 72
Business Statistics
Unit 3
9
25.5
Workshop C =
23
25.5
Workshop B =
15
25.5
Workshop D =
15
25.5
The relative dispersion of the series is called the coefficient or the ratio of
dispersion. In our example of weekly earnings of workers considered earlier,
the coefficients would be:
9
9
21 30 51
23
23
Workshop C =
15 38 53
Workshop A =
Workshop B
Workshop D
15
15
17 32 49
15
15
=
19 34 53
Page No. 73
Business Statistics
Unit 3
(i) Since it is based on two extreme cases in the entire distribution, the range
may be considerably changed if either of the extreme cases happens to
drop out, while the removal of any other case would not affect it at all.
(ii) It does not tell anything about the distribution of values in the series
relative to a measure of central tendency.
(iii) It cannot be computed when distribution has open-end classes.
(iv) It does not take into account the entire data. These can be illustrated by
the following illustration. Consider the data given in Table 3.3.
The table is designed to illustrate three distributions with the same number
of cases but different variability. The removal of two extreme students from
section A would make its range equal to that of B or C.
Table 3.3 Distribution with the Same Number of Cases,
but Different Variability
No. of Students
Class
Section
A
Section
B
Section
C
010
1020
2030
3040
4050
5060
6070
7080
8090
90100
...
1
12
17
29
18
16
6
11
...
...
...
12
20
35
25
10
8
...
...
...
...
19
18
16
18
18
21
...
...
Total
110
110
110
Range
80
60
60
Page No. 74
Business Statistics
Unit 3
(i) In situations where the extremes involve some hazard for which
preparation should be made, it may be more important to know the most
extreme cases to be encountered than to know anything else about the
distribution. For example, an explorer would like to know the lowest and
the highest temperatures on record in the region he is about to enter; or
an engineer would like to know the maximum rainfall during 24 hours for
the construction of a storage.
(ii) In the study of prices of securities, range has a special field of activity.
Thus, to highlight fluctuations in the prices of shares or bullion, it is a
common practice to indicate the range over which the prices have moved
during a certain period of time. This information, besides being of use to
the operators, gives an indication of the stability of the bullion market, or
that of the investment climate.
(iii) In statistical quality control, the range is used as a measure of variation.
For example, we determine the range over which, variations in quality are
due to random causes, which is made the basis for the fixation of control
limits.
Self-Assessment Questions
7. Fill in the blanks with the appropriate terms.
(a) Range of any series is the ______________ between the highest
and the lowest values in the series.
(b) The relative dispersion of the series is called the
___________________ or the ratio of dispersion.
8. State whether true or false.
(a) The crudest measure of dispersion is the range of the distribution.
(b) An absolute measure can not be converted into a relative measure
if we divide it by some other value regarded as standard for the
purpose.
Page No. 75
Business Statistics
Unit 3
not universally adopted for want of adequacy and accuracy. The range is not
satisfactory as its magnitude is determined by most extreme cases in the entire
group. Further, the range is notable because it is dependent on the item whose
size is largely a matter of chance. Mean deviation method is also an
unsatisfactory measure of scatter, as it ignores the algebraic signs of deviation.
We desire a measure of scatter which is free from these shortcomings. To
some extent, standard deviation is one such measure.
The calculation of standard deviation differs in the following respects from
that of mean deviation. First, in calculating standard deviation, the deviations
are squared. This is done so as to get rid of negative signs without committing
algebraic violence. Further, the squaring of deviations provides added weight
to the extreme items, a desirable feature for certain types of series.
Second, the deviations are always recorded from the arithmetic mean,
because although the sum of deviations is the minimum from the median, the
sum of squares of deviations is minimum when deviations are measured from
the arithmetic average. The deviation from x is represented by .
Thus, standard deviation, (sigma), is defined as the square root of the
mean of the squares of the deviations of individual items from their arithmetic
mean.
2
(x x)
N
2
f (x x)
f
f (M x)
f
Page No. 76
Business Statistics
Unit 3
2
( x x ) is appropriate. We first calculate
N
(x x )
(x x )2
11
12
13
14
15
16
17
18
19
20
21
5
4
3
2
1
0
+1
+2
+3
+4
+5
25
16
9
4
1
0
1
4
9
16
25
176
110
2
( x x ) , we get
N
110
10 = 3.16
11
Example 3.15: Find the standard deviation of the data in the following
distributions:
x
12
13
14
15
16
17
18
20
11
32
21
15
Solution: For this discrete variable grouped data, we use the formula
=
f ( x x )2
. Since for calculation of x , we need fx and then for we
f
Page No. 77
Business Statistics
Unit 3
fx
d=x x
d2
fd2
12
13
14
15
16
17
18
20
4
11
32
21
15
8
5
4
48
143
448
315
240
136
90
80
3
2
1
0
1
2
3
5
9
4
1
0
1
4
9
25
36
44
32
0
15
32
45
100
100
1500
304
Here,
x = fx / f = 1500/100 = 15
and
fd 2
f
304
=
100
3. 04 = 1.74
FG
H
x
x2
N
N
IJ
K
fx 2 fx
f
f
This formula is valid for both discrete and continuous variables. In case of
continuous variables, x in the equation x' = x A, stands for the mid-value of the
class in question.
Page No. 78
Business Statistics
Unit 3
Note that the second term in each of the formulae is a correction term
because of the difference in the values of A and x . When A is taken as x itself,
this correction is automatically reduced to zero. The following examples explain
the use of these formulae.
Example 3.16: Compute the standard deviation by the short-cut method for the
following data:
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21
Solution: Let us assume that A = 15.
x' = (x 15)
x'2
11
12
13
14
15
16
17
18
19
20
21
4
3
2
1
0
1
2
3
4
5
6
16
9
4
1
0
1
4
9
16
25
36
N = 11
1
x = 11
x 2 = 121
FG
H
x
x2
N
N
FG IJ
H K
121
11
11
11
IJ
K
= 11 1
= 10
= 3.16
Example 3.17: Calculate the standard deviation of the following data by the
short-cut method.
x
f
010
18
1020
16
2030
15
3040
12
4050
10
5060
5
6070
2
7080
1
Page No. 79
Business Statistics
Unit 3
Solution:
Midpoint
(x)
Frequency
(f)
Deviation
from class
of assumed
mean
(x')
010
1020
5
15
18
16
2
1
2030
3040
4050
5060
6070
7080
25
35
45
55
65
75
15
12
10
5
2
1
0
1
2
3
4
5
Deviation
time
frequency
( fx')
36
16
Squared
deviation
times
frequency
( fx'2 )
72
16
52
0
12
20
15
8
5
0
12
40
45
32
25
60
f = 79
60
242
52
fx = 8
Since the deviations are from assumed mean and expressed in terms of
class-interval units,
fx2 fx
N
N
= i
= 10
FG IJ
H K
242
8
79
79
= 10 1.75 = 17.5
Combining Standard Deviations of Two Distributions
If we were given two sets of data of N1 and N2 items with means x1 and x 2 and
standard deviations 1 and 2 respectively, we can obtain the mean and the
standard deviation x and of the combined distribution by the following formulae:
x =
and
N 1 x1 N 2 x 2
N1 N 2
N 1 12 N 2 22 N 1 ( x x1 ) 2 N 2 ( x x 2 ) 2
N1 N 2
Page No. 80
Business Statistics
Unit 3
Example 3.18: The mean and the standard deviations of two distributions of
100 and 150 items are 50, 5 and 40, 6 respectively. Find the standard deviation
of all taken together.
Solution: Combined mean,
x =
N 1 x1 N 2 x 2
N1 N 2
100 50 150 40
= 44
100 150
N112 N 2 22 N1 ( x x1 ) 2 N 2 ( x x2 ) 2
N1 N 2
100 (5) 2 150 ( 6) 2 100 ( 44 50 ) 2 150 ( 44 40 ) 2
100 150
= 7.46
Comparison of Various Measures of Dispersion
The range is the easiest to calculate measure of dispersion, but since it depends
on extreme values, it is extremely sensitive to the size of the sample and to the
sample variability. In fact, as the sample size increases, the range increases
dramatically, because the more the items one considers, the more likely it is
that some item will turn up which is larger than the previous maximum or smaller
than the previous minimum. So, in general, it is impossible to interpret properly
the significance of a given range unless the sample size is constant. It is for this
reason that there appears to be only one valid application of the range, namely
in statistical quality control where the same sample size is repeatedly used, so
that comparison of ranges are not distorted by differences in sample size.
The quartile deviations and other such positional measures of dispersions
are also easy to calculate, but suffer from the disadvantage that they are not
amenable to algebraic treatment. Similarly, the mean deviation is not suitable
because we cannot obtain the mean deviation of a combined series from the
deviations of component series. However, it is easy to interpret and easier to
calculate than the standard deviation.
The standard deviation of a set of data, on the other hand, is one of the
most important statistic describing it. It lends itself to rigorous algebraic treatment,
is rigidly defined and is based on all observations. It is therefore, quite insensitive
to sample size (provided the size is large enough) and is least affected by
sampling variations.
Sikkim Manipal University
Page No. 81
Business Statistics
Unit 3
10
11
12
Frequency
13
Self-Assessment Questions
9. Fill in the blanks with the appropriate terms.
(a) The squaring of deviations provides added _________________ to
the extreme items.
(b) Standard deviation, (sigma), is defined as the square root of the
mean of the _________________ of the deviations of individual items
from their arithmetic mean.
10. State whether true or false.
(a) In calculating standard deviation, the deviations are squared.
(b) The deviations are sometimes recorded from the arithmetic mean.
Page No. 82
Business Statistics
Unit 3
3.7 Summary
Let us recapitulate the important concepts discussed in this unit:
Page No. 83
Business Statistics
Unit 3
3.8 Glossary
Mean: An arithmetic average and measure of central location.
Median: The measure of central tendency that appears in the centre of
an ordered data.
Mode: Another form of average that can be defined as the most frequently
occurring value in the data.
Quartile: A positional measure that divides the data into four equal parts.
Range: The difference between the maximum and minimum values. It
indicates the limits within which the values fall.
Standard deviation: A measure of the variability or dispersion of a
population, a data set, or a probability distribution. A low standard deviation
indicates that the data points tend to be very close to the same value (the
mean); while high standard deviation indicates that the data are spread
out over a large range of values.
Page No. 84
Business Statistics
Unit 3
3.10 Answers
Answers to Self-Assessment Questions
1. (a) Denominator; (b) Antecedent
2. (a) True; (b) False
3. (a) True; (b) True
4. (a) Sum; (b) Value
5. (a) Quartiles; (b) Array
6. (a) False; (b) True
7. (a) Difference; (b) Coefficient
8. (a) True; (b) False
9. (a) Weight; (b) Squares
10. (a) True; (b) False
Page No. 85
Business Statistics
Unit 4
Unit 4
Index Numbers
Structure
4.1 Introduction
Objectives
4.2 Index Numbers
4.3 Summary
4.4 Glossary
4.5 Terminal Questions
4.6 Answers
4.7 Further Reading
4.1 Introduction
In the previous unit you learnt about data analysis techniques such as measures
of dispersion. In this unit you will learn about index numbers, its various types
and the reason as to why index numbers are required. Index numbers are a
specialized type of average. They are designed to measure the relative change
in the level of a phenomenon with respect to time, geographical locations or
some other characteristics. You will also learn about the different formulae
methods devised for constructing index numbers and what all problems one will
face while constructing index numbers.
Objectives
After studying this unit, you should be able to:
Discuss the various formulae and methods used in constructing index
numbers
Construct index numbers
Use index numbers for various purposes.
Page No. 87
Business Statistics
Unit 4
Unweighted
Simple aggregate
of prices
Simple average
of prices relatives
Weighted
Weighted aggregate
of prices
Weighted average
of prices relatives
Page No. 88
Business Statistics
P01
Unit 4
P1
100
P0
where,
Unit
Price in 1981
Price in 1982
Milk
litre
2.00
2.50
Butter
kg
12.00
15.00
Cheese
kg
10.00
12.00
Bread
One
2.00
2.50
Eggs
dozen
4.00
5.00
Unit
P0
P1
Milk
litre
2.00
2.50
Butter
kg
12.00
15.00
Cheese
kg
10.00
12.00
Bread
One
2.00
2.50
Eggs
dozen
4.00
5.00
P0 = 30.00
P1 = 37.00
P01
P1
37
100 =
100 = 123.33%
P0
30
Page No. 89
Business Statistics
Unit 4
P1
100
(ii) Average these price relatives for the given time period by dividing the
total of price relatives for different commodities by the number of
commodities. Symbolically,
P01
LM P 100OP
NP Q
1
0
Page No. 90
Business Statistics
Unit 4
Price Relative
Milk
litre
2.00
2.50
2.50
100 125
2.00
Butter
kg
12.00
15.00
15
100 125
12
Cheese
kg
10.00
12.00
12
100 120
10
Bread
one
2.00
2.50
2.50
100 125
2.00
Eggs
dozen
4.00
5.00
5
100 125
4
1 100 620
P0
N=5
1 100
P
620 124
P01 0
N
5
Page No. 91
Business Statistics
Unit 4
P01
where
P1q 0
100
P0 q 0
P1 = Price in the current year
P0 = Price in the base year
q0 = Quantity in the base year
According to this method, the index number for each year is obtained in
three steps:
(i) The price of each commodity in each year is multiplied by the base year
quantity of that commodity. For the base year, each product is symbolized
by P0q0, and for the current year by P1q0.
(ii) The products for each year are totalled and P1q 0 and P0 q 0 are obtained.
(iii) P1q 0 is divided by P0 q 0 and the quotient is multiplied by 100 to obtain
the index.
Example 4.3: From the following data, calculate the index number of prices for
1982 with 1972 as base using the Laspeyres method.
1972
1982
Item
Price
Quantity
Price
Quantity
10
14
10
19
13
Page No. 92
Business Statistics
Unit 4
Solution: Representing base year (1972) price by P0, base year quantity by q0,
current year (1982) price by P1 and current year quantity by q1 we have:
Commodity
P0
q0
P1
q1
P0 q0
P1 q0
16
32
10
50
60
14
10
56
70
19
13
38
38
P0 q0
= 160
P1q0
= 200
P1q0
100
P0q0
200
100 125
160
Laspeyres index is very widely used. It tells us about the change in the
aggregate value of the base period list of goods when valued at a given period
price.
However, this index has one drawback. It does not take into consideration
the changes in the consumption pattern that take place with the passage of
time.
(ii) Paasches Index: In this method, the current year quantities (q1), are taken
as weights. The formula for constructing this index is:
P01
Pq
1 1
100
P0q1
Steps for constructing the Paasches index are the same as those taken
in constructing Laspeyres index with the only difference that the price of each
commodity in each year is multiplied by the quantity of that commodity in the
current year rather than by the quantity in the base year.
Example 4.4: Taking the data given in Example 4.3, compute the index number
of prices for 1982 with 1972 as base, using the Paasches method.
Page No. 93
Business Statistics
Unit 4
2
5
4
2
8
10
14
19
4
6
5
2
q1
P0 q1
P1 q1
6
5
10
13
12
25
40
26
24
30
50
26
P0 q1 = P1q1 =
103
130
Pq
1 1
100
P0q1
130
100 126.21
103
P0 q0 P0 q1
P01
100
2
L P
2
Where L = Laspeyres index
P01
P = Paasches index
Example 4.5: Compute the index number of prices for 1976 with 1970 as base
using the Bowley-Drobisch method from the following data.
1970
1976
Items
Price
Quantity
Price
Quantity
1
2
3
4
2
4
1
5
20
4
10
5
5
8
2
10
15
5
12
6
Page No. 94
Business Statistics
Unit 4
2
4
1
5
20
4
10
5
5
8
2
10
15
5
12
6
40
16
10
25
P0q0
= 91
30
20
12
30
P1q1
100
32
20
50
75
40
24
60
P0q1 P1q0
= 92 = 202
P1q1
= 199
P1q0 P1q1
2.2198 2.1630
100
91 92 100
2
2
= 4.3828 50 = 219.14
(iv) Marshall-Edgeworth Method: In this method, the sums of base year and
current year quantities are taken as weights. The formula for constructing the
index is:
or
P01
P1 (q0 q1 )
100
P0 (q0 q1 )
P01
Pq
1 0 Pq
1 1
100
P0q0 P0q1
Example 4.6: For the data given in Example 4.5, compute index number of
prices for 1976 with 1970 as base using the Marshall-Edgeworth formula:
Solution: Computation of price index by Marshall-Edgeworth formula:
Item
P0
q0
P1
q1
P0q0
P0q1
P1q0
P1q1
1
2
3
4
2
4
1
5
20
4
10
5
5
8
2
10
15
5
12
6
40
16
10
25
P0q0
= 91
30
20
12
30
100
32
20
50
P0q1 P1q0
= 92 = 202
75
40
24
60
P1q1
= 199
Page No. 95
Business Statistics
Unit 4
P01
P0q0 Pq
P1 (q0 q1 )
1 1
100
100
P0q0 P0q1
P0 (q0 q1 )
202 199
401
100
100
91 92
183
= 219.125
(v) Kellys Method: In this method, neither base year nor current year quantities
are taken as weights. Instead, the quantities of some reference year or the
average quantity of two or more years may be taken as weights. The formula
for constructing the index is:
P01
Pq
1
100
P0 q
10
7
15
9
10
units
100
200
50
20
10
160
210
60
30
14
10
7
15
9
10
100
200
50
20
10
160
210
60
30
14
1000
1400
750
180
100
P0 q
3430
P1q
1600
1470
900
270
140
P1q
4380
Page No. 96
Business Statistics
Unit 4
P01
Pq
1
100
P0q
4380
100 127.697
3430
(vi) Fishers Ideal Index: This method is the geometric mean of Laspeyres
and Paasches indices.
The formula for constructing the index is:
P01
Pq
P1q1
1 0
100
P0 q0 P0 q1
20
40
50
10
60
40
15
50
10
20
20
20
15
Page No. 97
Business Statistics
Unit 4
P0
q0
P1
q1
P0q0
P0q1
P1q0
P1q1
A
B
C
D
20
50
40
20
8
10
15
20
40
60
50
20
6
5
10
15
160
500
600
400
120
250
400
300
320
600
750
400
240
300
500
300
P01
P1q0 Pq
1 1
100
P0 q0 P0 q1
2070 1340
100
1660 1070
(ii) Determine the value weight of each commodity in the group by multiplying
its price in base year by its quantity in the base year, i.e., calculate P0q0
for each commodity. If, however, current year quantities are given, then
the weights shall be represented by P1q1.
(iii) Multiply the price relative of each commodity by its value weight as
calculated in (ii).
Page No. 98
Business Statistics
Unit 4
P01
LMF P 100I P q OP
NGH P JK Q or PV
1
0 0
P0 q0
Quantity
1985
1986
A
B
C
D
100
25
10
20
8
6
5
10
12
8
15
25
FG P 100IJ
HP K
1
or P
P0q0
PV
or V
A
B
C
D
100
25
10
20
8
6
5
10
12
8
15
25
150.00
133.33
300.00
250.00
800
150
50
200
120000
20000
15000
50000
V
PV
= 1200 = 205000
Weighted average of price relative index
LMF P 100I P q OP
NGH P JK Q PV
1
0 0
P0 q0
205000
170.83
1200
Page No. 99
Business Statistics
Unit 4
Business Statistics
Unit 4
changes in fashion, tastes and habits of the people. In such cases comparison
with the preceding year is more worthwhile.
Selection of Commodities or Items
While constructing an index number, it is not possible to take into account all
the items whose price changes are to be represented by the index number.
Hence, the need for selecting a sample. For example, while constructing a
general purpose wholesale price index, it is impossible to take all the items.
Thus, only a few representative items are selected from the whole lot. While
selecting the sample, the following points should be kept in mind.
(i) The selected commodity or item should be representative of the tastes,
customs and necessities of the people to whom the index number relates.
(ii) It should be stable in quality and as far as possible should be standardized
or graded so that it can easily be identified after a time lapse.
(iii) The sample should be as large as possible. Theoretically, the larger the
number of items, the more accurate would be the results disclosed by an
index number. But it must be noted that, larger the number of items, the
greater shall be the cost and time taken.
(iv) As different varieties of a commodity are sold in the market, a decision
has to be made as to which variety should be included in the index
numbers. Ordinarily, all those varieties which are in common use should
be included.
Obtaining Price Quotations
After selecting the items, the next problem is to collect their prices. The price of
a commodity varies from place to place and even from shop to shop in the
same market. Just as it is not possible to include all the commodities in an index
number, it is similarly impractical to collect price quotations from all places where
a commodity is bought or sold. Thus, a selection is to be made of representative
places and shops. Generally, such places and shops are selected where the
commodity is bought and sold in large quantities. After selecting the places and
shops from where price quotations are to be obtained, the next step is to appoint
some representatives who will supply the price quotations from time to time.
Since prices can be quoted in two ways, i.e., either by expressing the
quantity of commodity per unit of money or by expressing the quantity of money
per unit of commodity, a decision has to be made regarding the manner in
which prices are to be quoted. It is better to quote the price of a commodity X as
50 paise per kg rather than quoting it as 2 kg per one rupee.
Business Statistics
Unit 4
Business Statistics
Unit 4
Q 01
q1 P0
100
q0 P0
Q 01
q1 P1
100
q0 P1
q1 P0 q1 P1
q0 P0 q0 P1
Q 01
100
2
Q 01
Q 01
Q 01
q1 ( P0 P1 )
100
q0 ( P0 P1 )
q1 P0 q1 P1
100
q0 P0 q0 P1
q1 P
100
q0 P
Example 4.10: Compute quantity index for the year 1982 with base 1980
= 100, for the following data, using (i) Laspeyres method (ii) Paasches method,
(iii) Bowley-Drobisch method, (iv) Marshall-Edgeworth method, and (v) Fishers
ideal formula.
Prices
Quantities
Commodity
1980
1982
1980
1982
A
B
C
D
5.00
7.75
9.63
12.50
6.50
8.80
7.75
12.75
5.00
7.75
9.63
12.50
5
6
4
9
6.50
8.80
7.75
12.75
7
10
6
9
5
6
4
9
7
10
6
9
q0P0
q0P1
q1P0
q1P1
25.00
46.50
38.52
112.50
32.50
52.80
31.00
114.75
35.00
77.50
57.78
112.50
45.50
88.00
46.50
114.75
q0 P0
= 222.52 =
Sikkim Manipal University
q0 P1 q1 P0 q1 P1
=
=
231.05 282.78 294.75
Page No. 103
Business Statistics
Unit 4
q1 P0
100
q0 P0
282.78
100 127.08
222.52
q1 P1
100
q0 P1
294.75
100 127.57
231.05
q1 P0 q1 P1
1.2708 1.2757
100
2
= 127.325
(iv) Marshall-Edgeworth quantity index or Q01
q1 P0 q1 P1
100
q0 P0 q0 P1
282.78 294.75
100
222.52 231.05
= 127.329
(v) Quantity index by Fishers ideal formula or Q01
q1 P0 q1 P1
100
q0 P0 q0 P1
282.78 294.75
100
222.52 231.05
= 1.273 100
= 127.3
Sikkim Manipal University
Business Statistics
Unit 4
Pq
1 1
100 where V = value index
P0q0
In most cases, the value figures given in the formula may be stated more
simply as:
V1
V0
In this type of index, both price and quantity are variable in the numerator.
Weights are not to be applied because they are inherent in the value figures. A
value index, therefore, is an aggregate of values.
Tests of Consistency
As there are several formulae for constructing index numbers, the problem is to
select the most appropriate formula in a given situation. Irving Fisher has
suggested two tests for selecting an appropriate formula. These are:
(i) Time reversal test
(ii) Factor reversal test
Time reversal test
According to Fisher, the formula for calculating the index should be such that it
gives the same ratio between one point of comparison and another no matter
which of the two is taken as base. In other words, the index number prepared
forward should be the reciprocal of the index number prepared backward. Thus,
if from 1982 to 1983, the prices of a basket of goods have increased from Rs
400 to Rs 800, the index number for 1983 with 1982 as base is 200 per cent.
Now if the index number for 1983 with 1982 as base is 200 per cent, the index
number for 1982 with 1983 with base should be 50 per cent. One figure is
reciprocal of the other and their product (2 0.5) is unity. Therefore, time reversal
test is satisfied if P01 P10 =1 .
Time reversal test is satisfied by:
(i) Fishers Ideal Formula,
(ii) Marshall-Edgeworth Method
Business Statistics
Unit 4
P01 Q01
Pq
1 1
P0q0
Where P01 represents change in price in the current year, Q01 represents
change in quantity in the current year, P1q1 represents total value in the current
year, and P0 q 0 represents total value in the base year..
The factor reversal test is satisfied only by Fishers Ideal Formula. Thus,
Fishers formula satisfies both time reversal test and factor reversal test.
Proof
According to Fishers Ideal Index:
P01
P1q0 P1q1
P0 q0 P0 q1
P10
P0 q1 P0 q0
Pq
Pq
1 1
1 0
Q 01
q1 P0 q1 P1
q0 P0 q0 P1
(i) Thus,
P01 P10
P1q0 P1q1 P0 q1 P0 q0
1 1
P0 q0 P0 q1 Pq
P1q0
1 1
Business Statistics
Unit 4
P01 Q01
P0 q0 P0 q1 q0 P0 q0 P1
P1q1 q1P1
P q
1 1
P0 q 0 q 0 P0 P0 q 0
Hence, the factor reversal test is also satisfied by Fishers Ideal Formula.
Besides these two tests, two other tests have been suggested by some
authors.
These are, (i) Unit test, (ii) Circular test
Unit test
According to unit test, the formula for constructing index numbers should be
independent of the units in which prices and quantities are quoted. This test is
satisfied only by simple aggregative index method.
Circular test
This test is just an extension of the time reversal test for more than two periods
and is based on the shiftability of the base period. This test requires the index
number to work in a circular manner such that if an index is constructed for the
year a on base year b, and for the year b on base year c, we should get the
same result as if we calculate directly an index for year a on base year c without
going through b as an intermediary. Thus, if there are three periods a, b and c,
the circular test is satisfied if,
Business Statistics
Unit 4
Example 4.11: From the following data , show that Fishers Ideal Index satisfies
both following time reversal test and factor reversal test.
1980
Commodity
A
B
C
D
E
Price
4
6
14
3
5
1981
Quantity
10
8
5
12
7
Price
5
9
7
6
8
Quantity
8
7
12
8
5
Solution: Computation for time reversal test and factor reversal test
Commodity
P0
q0
P1
q1
P0q0 P0q1 P1q0 P1q1
A
B
C
D
E
4
6
14
3
5
10
8
5
12
7
5
9
7
6
8
8
7
12
8
5
40
48
70
36
35
32
42
168
24
25
P0 q0 P0 q1
229 291
50
72
35
72
56
40
63
84
48
40
P1q0 P1q1
285 275
and
P01
Pq
Pq
1 0
1 1
P0 q0 P0 q1
P10
P0 q1 P0 q0
Pq
Pq
1 1
1 0
P01 P01
1 1
229 291 275 285
Business Statistics
Unit 4
P01
P1q0 P1q1
and Q01
P0 q0 P0 q1
P01 Q01
Pq
1 1
P0q0 .
q1 P0 q1 P1
q0 P0 q0 P1
275 275
229 229
275 Pq
1 1
229 P0q0
100
1983
130
1981
120
1984
140
1982
125
1985
150
Business Statistics
Unit 4
Year
Price
of Wheat
Index No.
(1980 = 100)
1980
100
100
1983
130
130
100 130
100
1981
120
120
100 120
100
1984
140
140
100 140
100
1982
125
125
100 125
100
1985
150
150
100 150
100
Price of
Wheat
Link Relative
Index
Year
Price
of Wheat
1983
130
1980
100
100
1981
120
120
100 120
100
1984140
1982
125
125
100 104.167
120
1985
Link Relative
Index
130
100 104
125
140
100 107.692
130
150
150
100 107.14
140
Taking the data from Example 4.12, we can show the method of conversion
as follows:
Business Statistics
Unit 4
Year
Price of wheat
Link relative
1980
100
100.00
1981
120
120.00
1982
125
104.167
1983
130
104.00
1984
140
107.692
1985
150
107.14
Chain relative
100
120 100
120
100
104.167 120
125
100
104 125
130
100
107.692 130
140
100
107.14 140
150
100
Base Shifting
Sometimes, it becomes necessary to shift the base from one period to another.
This becomes necessary either because the previous base has become too old
and useless for comparison purposes or because comparison has to be made
with another series of index numbers having different base period. This can be
done in two ways,
(i) By reconstructing the series with the new base. This means that the
relatives of each individual item are constructed with the new base and
thus an entirely new series is formed.
(ii) By using a shorter method which is as follows: divide each index number
of the series by the index number of the time period selected as new
base and multiply the quotient by 100. Symbolically,
1939
100
1940
110
1945
120
1950
200
1955
400
1960
380
Business Statistics
Unit 4
100
1940
110
1945
120
1950
200
1955
400
1960
380
(1950 = 100)
100
100 50
200
110
100 55
200
120
100 60
200
200
100 100
200
400
100 200
200
380
100 190
200
Splicing
Sometimes, an index number series is discontinued because its base has
become too old and so it has lost its utility. A new series of index numbers may
be computed with some recent year as base. For example, the weights of an
index number may have become out of date and a new index with new weights
may be constructed. This would result in two series of index numbers. It may
sometimes be necessary to connect the two series of index number into one
continuous series. The procedure employed for connecting an old series of
index numbers with a revised series, in order to make the series continuous is
called splicing. The process of splicing is very simple and is similar to the one
used in shifting the base. The spliced index numbers are calculated with the
help of the following formula:
New base years
Spliced index number =
Business Statistics
Unit 4
Example 4.14: Index A was started in 1969 and continued upto 1975 in which
year another index B was started. Splice the index B to index A so that a
continuous series of index numbers from 1969 upto date may be available:
Year:
1969
1975
130
200
300
350
400
90
110
98
96
100
120
130
200
300
350
400100
400
100
400110
1976
110
440
100
40090
360
1977
90
100
400 110
440
1978
110
100
400 98
392
1979
98
100
400 96
384
1980
96
100
Splicing is very useful for making comparison between new and old index
numbers.
1975
400
100
Deflating
Deflating is the process of making allowances for the effect of changing price
levels. With increasing price levels, the purchasing power of money is reduced.
As a result, the real wage figures are reduced and the real wages become less
Sikkim Manipal University
Business Statistics
Unit 4
than the money wages. To get the real wage figure, the money wage figure may
be reduced to the extent the price level has risen. The process of calculating
the real wages by applying index numbers to the money wages so as to allow
for the change in the price level is called deflating. Thus, deflating is the process
by which a series of money wages or incomes can be corrected for price changes
to find out the level of real wages or incomes. This is done with the help of the
following formula:
Real wage =
Money wage
100
Price index
1977
200
100
200
100 200
100
100
1978
240
150
240
100 160
150
160
100 80
200
1979
350
200
350
100 175
200
175
100 87.5
200
1980
360
220
360
100 163.63
220
163.63
100 81.81
200
1981
360
230
360
100 156.52
230
156.52
100 78.26
200
1982
380
250
380
100 152
250
152
100 76
200
1983
400
250
400
100 160
250
160
100 80
200
Real wages
Business Statistics
Unit 4
Business Statistics
Unit 4
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Index number shows by its ______________ the changes in a
magnitude which is not susceptible either to accurate measurement
in itself or to direct valuation in practice.
(b) _________________ is the process of making allowances for the
effect of changing price levels.
2. State whether true or false.
(a) The simple average of price relative method is superior to the simple
aggregate of prices method.
(b) The term weight refers to the relative importance of similar items in
the construction of index.
4.3 Summary
Let us recapitulate the important concepts discussed in this unit:
Business Statistics
Unit 4
Value means price times quantity. Thus, a value index V is the sum of
the value of a given year divided by the sum of the values for the base
year.
Deflating is the process of making allowances for the effect of changing
price levels. With increasing price levels, the purchasing power of money
is reduced.
4.4 Glossary
Index numbers: The index number measures the relative change in the
magnitude of a group of related, distinct variables in two or more situations.
Index numbers can be used to measure changes in price, wages
production, employment, national income, etc., over a period of time.
Splicing: The process employed for connecting an old series of index
numbers with a revised series in order to make the series continuous
Deflating: The process of making the allowances for the effect of changing
price levels.
4.6 Answers
Answers to Self-Assessment Questions
1. (a) Variations; (b) Deflating
2. (a) True; (b) False
Business Statistics
Unit 4
Business Statistics
Unit 5
Unit 5
Data Representation
Structure
5.1 Introduction
Objectives
5.2 Tables
5.3 Graphs
5.4 Diagrams
5.5 Summary
5.6 Glossary
5.7 Terminal Questions
5.8 Answers
5.9 Further Reading
5.1 Introduction
In the previous unit, you learnt about index numbers, which are a specialized
type of average.
In this unit, you will learn about the construction of tables, diagrams and
graphs and how important these are to a business and their usages. In any type
of business firm, a large amount of raw data is generated from various business
sources. Such data becomes quite cumbersome and confusing for management
to handle and analyse. In a business firm, data can be of various types, relating
to various categories such as number of each item of the inventory, record of
sales from different departments, keeping an account of all kinds of bills and so
on. It is almost impossible for management to deal with all this data in raw form.
Therefore, such data must be presented in a suitable and summarized form
without any loss of relevant information so that it can be efficiently used for
decision-making. Hence, we construct appropriate tables, graphs and diagrams
to interpret and summarize the entire set of raw data.
In view of the ever increasing importance of statistical data in business
operations and their management, this unit discusses the presentation of data
in the form of graphs, tables and diagrams, their importance and use.
Business Statistics
Unit 5
Objectives
After studying this unit, you should be able to:
Explain the types of tables, graphs and diagrams
Construct tables, graphs and diagrams
Describe the concept of frequency polygon and relative frequency
Explain the construction of ogive curves and their types
Construct histograms
Represent and evaluate data in diagrammatic and graphic forms
5.2 Tables
Classification of data is usually followed by tabulation, which is considered the
mechanical part of classification.
Tabulation is the systematic arrangement of data in columns and rows.
The analysis of data is done by arranging the columns and rows to facilitate
comparisons.
Tabulation has the following objectives:
(i) Simplicity. The removal of unnecessary details gives a clear and
concise picture of the data
(ii) Economy of space and time
(iii) Ease in comprehension and remembering
(iv) Facility of comparisons. Comparisons within a table and with other
tables may be made
(v) Ease in handling of totals, analysis, interpretation, etc.
Business Statistics
Unit 5
Business Statistics
Unit 5
March 1972
Frequency
Number of Workers
3780
1 to 4 times a month
1652
926
Single
Frequency
Under 30
Married
Over 30
Under 30
Over 30
122
374
1404
1880
1046
202
289
115
881
23
112
10
Total
2049
599
1805
2005
14 times a month
Business Statistics
Unit 5
(ii) A table facilitates comparisons between subdivisions and with other tables.
(iii) It enables the required figures to be located easily.
(iv) It reveals patterns within the figures, which might otherwise not have been
obvious, e.g., from the previous table, we can conclude that regular and
frequent cinema attendance is mainly confined to younger age group.
(v) It makes the summation of items and the detection of errors and omissions,
easier.
(vi) It obviates repetition of explanatory phrases and headings and hence
takes less space.
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Tabulation is the _____________ arrangement of data in columns
and rows.
(b) Tabulated data can be more easily understood and grasped than
_____________ data.
2. State whether true or false.
(a) A table showing two characteristics is a simple table.
(b) A table facilitates comparisons between subdivisions and with other
tables.
5.3 Graphs
In a graph, the independent variable should always be placed on the horizontal
or X-axis and the dependent variable on the vertical or Y-axis.
Business Statistics
Unit 5
Example 5.1: The monthly averages of Retail Price Index from 1996 to 2003
(Jan. 1996 = 100) were as follows:
Year
1996
1997
1998
100
105.8 109.0
1999 2000
2001 2002
2003
109.6 110.7
114.5 119.3
122.3
125
120
115
110
105
100
1996
1997
1998
1999
2000
Year
2001
2002
2003
Business Statistics
Unit 5
Mid-Point
(f)
20
30
40
50
60
70
5
3
7
5
3
7
(40, 7)
(70, 7)
(20, 5)
(50, 5)
(30, 3)
(60, 3)
Frequency
7
9
22
7
3
2
Business Statistics
Unit 5
Solution: This example shows that the sum of all relative frequencies in a
distribution is 1.
Class Interval
Frequency
Relative Frequency
Explanation
2535
0.14
7
50
0.14
3545
4555
5565
6575
7585
9
22
7
3
2
0.18
0.44
0.14
0.06
0.04
9
50
0.18
etc.
50
1.00
Total
f1
f2
5
10
6
3
1
12
24
30
19
15
25
f2
12
24
30
19
15
Rel. Freq. f1
0.20
0.40
0.24
0.12
0.04
Rel. Freq. f2
0.12
0.24
0.30
0.19
0.15
100
1.00
1.00
Business Statistics
Unit 5
Relative Frequency
0.4
0.3
0.2
0.1
15
25
35
45
Class marks
55
65
Mid-point
(f)
Cum. Freq.
(less than)
Cum. Freq.
(greater than)
15 and upto 25
20
25 and upto 35
30
35 and upto 45
40
45 and upto 55
50
55 and upto 65
60
65 and upto 75
70
(i) Less than ogive. In this case, the less than cumulative frequencies are
plotted against the upper boundaries of their respective class intervals.
Figure 5.2 shows less than ogive.
Business Statistics
Unit 5
Class Interval
(ii) Greater than ogive. In this case, the greater than cumulative frequencies
are plotted against the lower boundaries of their respective class intervals.
Greater than
Cumulative Frequency
Class Interval
These ogives can be used for comparison purposes. Several ogives can
be drawn on the same grid, preferably with different colours for easier
visualization and differentiation.
Although diagrams and graphs are powerful and effective media for
presenting statistical data, they can only represent a limited amount of information
and they are not of much help when intensive analysis of data is required.
Business Statistics
Unit 5
5.3.5 Histograms
A histogram is the graphical description of data and is constructed from a
frequency table. It displays the distribution method of a data set and is used for
statistical as well as mathematical calculations.
The word histogram is derived from the Greek word histos which means
anything set upright and gramma which means drawing, record, writing. It is
considered the most important basic tool of statistical quality control process.
In this type of representation, the given data is plotted in the form of a
series of rectangles. Class intervals are marked along the X-axis and the
frequencies along the Y-axis according to a suitable scale. Unlike the bar chart,
which is one dimensional, meaning that only the length of the bar is important
and not the width, a histogram is two-dimensional in which both the length and
the width are important. A histogram is constructed from a frequency distribution
of a grouped data, where the height of the rectangle is proportional to the
respective frequency and the width represents the class interval. Each rectangle
is joined with the other and any blank spaces between the rectangles would
mean that the category is empty and there are no values in that class interval.
Let us construct a histogram for our example of ages of 30 workers. For
convenience is sake, we will present the frequency distribution along with the
midpoint of each interval, where the midpoint is simply the average of the values
of the lower and the upper boundary of each class interval. The frequency
distribution table is shown as follows:
Class Interval (years)
Midpoint
(f)
15 and upto 25
20
25 and upto 35
30
35 and upto 45
40
45 and upto 55
50
55 and upto 65
60
65 and upto 75
70
Business Statistics
Unit 5
Class Interval
Activity 1
The following frequency distribution represents the number of days during
a year that the faculty of the college was absent from work due to illness.
Number of Days
Number of Employees
02
35
68
911
1214
5
10
20
10
5
Total
50
Business Statistics
Unit 5
Self-Assessment Questions
3. State whether true or false.
(a) In a graph, the independent variable should always be placed in a
vertical axis.
(b) A distribution presented with relative frequencies rather than actual
frequencies is called a relative frequency distribution.
4. Fill in the blanks with the appropriate terms.
(a) A direct visual comparison of two ____________ distributions can
be made by drawing their frequency polygons.
(b) A histogram is constructed from a frequency distribution of a grouped
data, where the height of the rectangle is _______________ to the
respective frequency and the width represents the class interval.
5.4 Diagrams
The data we collect can often be more easily understood for interpretation if it is
presented graphically or pictorially. Diagrams and graphs give visual indications
of magnitudes, groupings, trends and patterns in the data. These important
features are more simply presented in the form of graphs. Also, diagrams facilitate
comparisons between two or more sets of data.
The diagrams should be clear and easy to read and understand. Too
much information should not be shown in the same diagram; otherwise, it may
become cumbersome and confusing. Each diagram should include a brief and
self explanatory title dealing with the subject matter. The scale of the presentation
should be chosen in such a way that the resulting diagram is of appropriate
size. The intervals on the vertical as well as the horizontal axis should be of
equal size; otherwise, distortions would occur.
Diagrams are more suitable to illustrate data which is discrete, while
continuous data is better represented by graphs. The following are the
diagrammatic and the graphic representation methods that are commonly used.
Business Statistics
Unit 5
all bars should have the same width so as not to confuse the reader of the
diagram. Additionally, the bars should be equally spaced.
Example 5.5: Construct a subdivided bar chart for the three types of expenditures
in dollars for a family of four for the years 1988, 1989, 1990 and 1991 given as
follows:
Year
Food
Education
Other
Total
1988
3000
2000
3000
8000
1989
3500
3000
4000
10500
1990
4000
3500
5000
12500
1991
5000
5000
6000
16000
Food
Education
Other
Expenditure
12000
10000
8000
6000
4000
2000
0
1988
1989
1990
1991
Year
Business Statistics
Unit 5
Figure 5.4 Percentage Component Bar Chart showing Expenses and Savings of
Mr X
These charts can be used if the overall total is not required. Some charts
given earlier show totals also.
Business Statistics
Unit 5
Squares: The square diagram is easy and simple to draw. Take the square root
of the values of various given items that are to be shown in the diagrams and
then select a suitable scale to draw the squares.
Example 5.6: Yield of rice in Kgs. per acre of five countries are as follows:
Country
USA
Australia
UK
Canada
India
Yield of rice
in Kgs per acre
6400
1600
2500
3600
4900
Yield
Square root
U.S.A
6400
80
Australia
1600
40
U.K.
2500
50
2.5
Canada
3600
60
India
4900
70
3.5
4 cm
2 cm
2.5 cm
3 cm
3.5 cm
Business Statistics
Unit 5
Item
% Expenditure
Labour
25
Cement, Bricks
30
Steel
15
Timber, Glass
20
Miscellaneous
10
Misc
10%
Steel
15%
Labour
25%
Cement, Bricks
30%
Pie charts are very useful for comparison purposes, especially when there
are only a few components. If there are too many components, it may become
confusing to differentiate the relative values in the pie.
Number of Students
Undergraduate
64000
Postgraduate
27000
Professionals
8000
Business Statistics
Unit 5
Category
Side of Cube
Undergraduate
64000
40
4 cm
Postgraduate
27000
30
3 cm
Professional
8000
20
2 cm
4cm
3cm
2cm
Activity 2
The following table represents the racial breakdown of people in the Flushing
area in Queens, New York.
Race
Number
White
Black
205,000 30,520
Hispanic
20,300
Asians Others
15,650
5,400
Construct a pie chart to represent this data. (Make sure that the slices of
the pie proportionately represent the various ethnic populations.)
Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) Each diagram should include a brief and self ______________ title
dealing with the subject matter.
(b) Bars are simply vertical lines where the ______________ of the bars
are proportional to their corresponding numerical values.
6. State whether true or false.
(a) Diagrams and graphs give visual indications of magnitudes,
groupings, trends and patterns in the data.
(b) Diagrams facilitate comparisons between two or more sets of data.
Business Statistics
Unit 5
5.5 Summary
Let us recapitulate the important concepts discussed in this unit:
Classification of data is usually followed by tabulation, which is considered
the mechanical part of classification.
Tabulation is the systematic arrangement of data in columns and rows.
The analysis of the data is done so by arranging the columns and rows to
facilitate comparisons.
A table should be easy to read and should contain only the relevant details.
If the aim of clarification is not achieved, the table should be redesigned.
In a graph, the independent variable should always be placed on the
horizontal or X-axis and the dependent variable on the vertical or Y-axis.
A frequency polygon is a line chart of frequency distribution in which the
values of discrete variables or midpoints of class intervals are plotted
against the frequencies and these plotted points are joined together by
straight lines.
In a frequency distribution, if the frequency in each class interval is
converted into a proportion, dividing it by the total frequency, we get a
series of proportions called relative frequencies.
Cumulative frequency curve or ogive is the graphic representation of a
cumulative frequency distribution. Ogives are of two types, less than
and greater than ogives.
A histogram is the graphical description of data and is constructed from a
frequency table. It displays the distribution method of a data set and is
used for statistical as well as mathematical calculations.
Diagrams and graphs give visual indications of magnitudes, groupings,
trends and patterns in the data.
A pie diagram illustrates the partitioning of a total into its component parts.
5.6 Glossary
Table: The systematic arrangement of data in columns and rows.
Frequency polygon: A line chart of frequency distribution in which the
values of discrete variables or midpoints of class intervals are plotted
Business Statistics
Unit 5
against the frequencies and these plotted points are joined together by
straight lines.
Relative frequency: The series of proportions achieved after converting
each class interval into a proportion, dividing it by the total frequency.
Ogive curve: A graphic representation of a cumulative frequency
distribution.
Histogram: The graphical description of data constructed from a
frequency table. It displays the distribution method of a data set and is
used for statistical as well as mathematical calculations.
Pie diagram: A diagram that enables us to show the partitioning of a total
into its component parts.
5.8 Answers
Answers to Self-Assessment Questions
1. (a) Systematic; (b) Untabulated
2. (a) False; (b) True
3. (a) False; (b) True
Business Statistics
Unit 5
Business Statistics
Unit 6
Unit 6
Correlation
Structure
6.1 Introduction
Objectives
6.2 Correlation Analysis
6.3 Coefficient of Correlation
6.4 Spearmans Rank Correlation
6.5 Summary
6.6 Glossary
6.7 Terminal Questions
6.8 Answers
6.9 Further Reading
6.1 Introduction
In the previous unit, you learnt about various data representation techniques
and their significance in decision-making.
In this unit, you will learn about correlation analysis. Correlation is one of
the most significant statistics. Correlation can be defined as the interdependence
between variable quantities. If the values of two variables changes with respect
to each other, then they are said to be correlated. For example, if the variables
are stock prices and the price of one stock increases at the same time the price
of another stock increases, then the two stock prices are positively correlated.
If the price of one stock goes down when the price of the other increases, then
the two stock prices are negatively correlated. However, if we are unable to find
a consistent pattern in the variation of the two stock prices, then they are
uncorrelated.
The strength of correlation is measured by the coefficient of correlation.
The value of the coefficient of correlation lies in the interval [1, 1]. Positive
correlations lie between 0 and 1; 0 means that there is no correlation; negative
correlations lie between 0 and 1. The purpose of doing correlations is to allow
us to make a prediction about one variable based on what we know about
another variable.
Business Statistics
Unit 6
Objectives
After studying this unit, you should be able to:
Explain correlation analysis
Evaluate coefficient of determination and coefficient of correlation
Calculate probable error of the coefficient of correlation
Calculate correlation using various methods
Define limitations of correlation analysis
Business Statistics
Unit 6
the closer will be the value of the correlation coefficient to +1. The stronger the
negative correlation, the closer will be the correlation coefficient to 1. If the two
stock prices are perfectly uncorrelated, the value of the correlation coefficient is
zero. This can be explained as under:
Changes in Independent
Changes in Dependent
Nature of
Variable
Variable
Correlation
Increase (+)
Increase (+)
Positive (+)
Decrease ()
Decrease ()
Positive (+)
Increase (+)
Decrease ()
Negative ()
Decrease ()
Increase (+)
Negative ()
(ii) The variation of the Y values around their own mean viz., Y Y ,
technically known as the total variation.
2
Y Y
2
Y Y
Y Y
Business Statistics
Unit 6
Y-axis
Mean line of X
100
80
d Y)
ine
pla e., Y int
Y)
x
e
.
, Y t Un on (i fic po
i
i.e. oin
ti
n ( ific p varia spec
o
i
t
a
a pec
i
t
r
a
va a s
tal at
Y
To r Y
o
Explained Variation
( i.e.,Y Y ) at a
specific point
60
Y
Mean line of Y
X
40
20
on
ssi
gre
Re
fY
eo
lin
20
X
on
40
60 X
80
Income (00 Rs)
100
120
X- axis
Explained variation
Total variation
Y Y
Y Y
2
2
Explained variation
Total variation
Y Y
Y Y
= 1
2
2
Business Statistics
Unit 6
6.2.2 Interpreting r2
Coefficient of determination explains how much of the variation in one factor
can be caused or explained by its relationship to another factor. It is the square
of correlation coefficient. For example, if you have two sets of scores on Tests
X and Y and they correlate at r = 0.90, the coefficient of determination r2 will be
0.81. This information can be interpreted as, 81% of the variance in Test X has
been explained by the Test Y.
As a matter of practice the squared correlations should be interpreted
because the correlation coefficient is misleading in suggesting the existence of
more correlation than really exists and the problem gets worse as the correlation
approaches zero.
Example 6.1: Calculate the coefficient of determination (r2) using data given
below. Calculate and analyse the result.
Observations
10
41
65
50
57
96
94
110
30
79
65
Consumption
Expenditure (Y) (00 Rs) 44
60
39
51
80
68
84
34
55
48
Since,
As, Y Y
Y 2 Y 2 nY
Y Y
Unexplained variation
= 1
= 1
Total variation
Y Y
2
, we can write,
r2 = 1
Y Y
2
2
Y 2 nY 2
Calculating and putting the various values, we have the following equation:
260.54
260.54
1
0.897
r2 = 1
2
2526.10
34223 10 56.3
Analysis of Result: The regression equation used to calculate the value of the
coefficient of determination (r2) from the sample data shows that, about 90% of
the variations in consumption expenditure can be explained. In other words, it
means that the variations in income explain about 90% of variations in
consumption expenditure.
Sikkim Manipal University
Business Statistics
Unit 6
Observation
10
65
50
57
96
94
110
30
79
65
60
39
51
80
68
84
34
55
48
Activity 1
Using the various correlation methods discussed in the unit, compute the
correlation for the following data:
Person
Height (x)
1
2
3
4
5
6
68
71
62
75
58
60
Self Esteem
(y)
4.1
4.6
3.8
4.4
3.2
3.1
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Correlation is concerned with relationship between two related and
____________ variables.
(b) Coefficients of _____________ is that fraction of the total variation
of Y which is explained by the regression line.
2. State whether true or false.
(a) The word correlation refers to the relationship or interdependence
between two variables.
(b) Correlation can either be positive or negative.
Business Statistics
Unit 6
the degree of relationship between the two casually related variables. The value
of this coefficient can never be more than +1 or less than 1. Thus, +1 and 1
are the limits of this coefficient. For a unit change in independent variable, if
there happens to be a constant change in the dependent variable in the same
direction, then the value of the coefficient will be +1 indicative of the perfect
positive correlation; but if such a change occurs in the opposite direction, the
value of the coefficient will be 1, indicating the perfect negative correlation. In
practical life, the possibility of obtaining either a perfect positive or perfect
negative correlation is very remote particularly in respect of phenomena
concerning social sciences. If the coefficient of correlation has a zero value
then it means that there exists no correlation between the variables under study.
There are several methods of finding the coefficient of correlation but the
following ones are considered important:
(i) Coefficient of Correlation by the Method of Least Squares.
(ii) Coefficient of Correlation using Simple Regression Coefficients.
(iii) Coefficient of Correlation through Product Moment Method or Karl
Pearsons Coefficient of Correlation.
Whichever of these above mentioned three methods we adopt, we get
the same value of r.
(i) Coefficient of Correlation by the Method of Least Squares
Under this method, first of all, the estimating equation is obtained using least
square method of simple regression analysis. The equation is worked out as,
Y a bX i
Total variation
Unexplained variation
Explained variation
2
Y Y
2
Y Y
Y Y
Then, by applying the following formulae, we can find the value of the coefficient
of correlation:
r =
=
r2
1
Explained variation
Total variation
Unexplained variation
Total variation
Business Statistics
Unit 6
1
Y Y
Y Y
2
2
aY bXY nY 2
Y 2 nY 2
a = Y-intercept
b = Slope of the estimating equation
X = Values of the independent variable
Y = Values of dependent variable
_
Y = Mean of the observed values of Y
n = Number of items in the sample
(i.e., pairs of observed data)
The plus (+) or the minus () sign of the coefficient of correlation worked
out by the method of least squares is related to the sign of b in the estimating
equation viz., Y a bX i . If b has a minus sign, the sign of r will also be minus
but if b has a plus sign, then the sign of r will also be plus. The value of r
indicates the degree along with the direction of the relationship between the
two variables X and Y.
(ii) Coefficient of Correlation using Simple Regression Coefficients
Under this method, the estimating equation of Y and the estimating equation of
X is worked out using the method of least squares. From these estimating
equations we find the regression coefficient of X on Y, i.e., the slope of the
estimating equation of X (symbolically written as bXY) and this happens to be
Y
. For finding r, the square root of the product of
X
these two regression coefficients are work out as follows:1
happens to be equal to r
Business Statistics
Unit 6
r =
bXY .bYX
X Y
.r
Y X
r2 = r
As stated earlier, the sign of r will depend upon the sign of the regression
coefficients. If they have minus sign, then r will take a minus sign but the sign
of r will be positive if regression coefficients have positive signs.
Where,
r = n
X Y
_
X = (X X_)
Y = (Y Y )
X = Standard deviation of
X series and is equal to
Y = Standard deviation of
Y series and is equal to
X2
n
Y 2
n
Business Statistics
Unit 6
XY
r = n
X Y
XY
n =
X 2 Y2
n
n
XY
X 2 Y2
The above formulae are based on obtaining true means (viz. X and Y )
first and then doing all other calculations. This happens to be a tedious task,
particularly if the true means are in fractions. To avoid difficult calculations, we
make use of the assumed means in taking out deviations and doing the related
calculations. In such a situation, we can use the following formula for finding
the value of r:2
(i) In case of ungrouped data:
r =
dX .dY dX dY
n
n
n
dX 2
n
dX dY
dX .dY
dX 2
Where,
2
dX dY
dY
n
n
n
dX 2
n
dY 2
dY 2
n
dX = (X XA)
XA = Assumed average of X
dY = (Y YA)
YA = Assumed average of Y
dX2 = (X XA)2
dY2 = (Y YA)2
dX . dY = (X XA) (Y YA)
n = Number of pairs of observations of X and Y
Business Statistics
Unit 6
r =
or
r =
n
n
n
fdX 2 fdX
n
n
fdX . fdY
fdX .dY
fdX
fdX
Where,
fdY 2 fdY
n
n
fdY
fdY
1 r2
n
If r is less than its P.E., it is not at all significant. If r is more than P.E., there
is correlation. If r is more than 6 times its P.E. and greater than 0.5, then it is
considered significant.
Example 6.2:
From the following data calculate r between X and Y applying the following
three methods:
(i) The method of least squares.
(ii) The method based on regression coefficients.
Sikkim Manipal University
Business Statistics
Unit 6
10
12
11
13
14
16
15
Solution:
Let us develop the following table for calculating the value of r:
X
X2
Y2
XY
1
2
3
4
5
6
7
8
9
9
8
10
12
11
13
14
16
15
1
4
9
16
25
36
49
64
81
81
64
100
144
121
169
196
256
225
9
16
30
48
55
78
98
128
135
Y = 108
_
Y = 12
X2 = 285
Y2 = 1356
n=9
X = 45
_
X = 5;
XY = 597
XY n X Y
Where,
b =
and
X 2 nX
597 9 5 12
285 9 25
_
_
a = Y bX
597 540
57
0.95
=
285 225
60
Y = 7.25 + 0.95Xi
Business Statistics
Unit 6
Unexplained variation
Total variation
1
Y Y
Y Y
Y Y
Y Y
a Y b XY nY
Y 2 nY
2
2
r =
54.15
60
0.9025 = 0.95
i.e.,
bYX =
X 2 nX
597 9 5 12
285 9 5
597 540 57
285 225 60
597 540
57
1356 1296 60
Regression coefficient of X on Y,
XY n X Y
i.e.,
bXY =
=
Y 2 nY
597 9 5 12
1356 9 12
Business Statistics
Hence,
Unit 6
r =
=
bYX . bXY
57 57 57
0.95
60 60 60
=
=
XY n X Y
X 2 nX
Y 2 nY
597 9 5 12
285 9 5 1356 9 12
57
597 540
57
0.95
=
60
285 225 1356 1296
60 60
2
Hence, we get the value of r = 0.95. We get the same value applying the
other two methods also. Therefore, whichever method we apply, the results will
be the same.
Unexplained variation
Total variation
Y Y
Y Y
2
2
260.54
0.103
2526.10
Business Statistics
Unit 6
Self-Assessment Questions
3. State whether true or false.
(a) The value of this coefficient can never be more than +1 or less
than -1.
(b) Coefficient of determination (denoted by k2) is the ratio of unexplained
variation to total variation in the Y variable related to the X variable.
4. Fill in the blanks with the appropriate terms.
(a) The coefficient of correlation, symbolically denoted by 'r', measures
the degree of relationship between the two _____________ related
variables.
(b) If r is less than its probable error (P.E.), it is not at all significant but
if r is more than P.E., there is_______________.
Sikkim Manipal University
Business Statistics
Unit 6
6Di2
n(n 2 1)
1
2
7
9
8
6
4
3
10
5
7
5
8
10
9
4
1
6
3
2
6
3
1
1
1
2
3
3
7
3
36
9
1
1
1
4
9
9
49
9
D2 = 128
6 D 2
6 128
1 3
1 0.776 0.224
3
n n
10 10
The value of = 0.224 shows that the agreement between the judges is
not high.
Sikkim Manipal University
Business Statistics
Unit 6
x2
y2
xy
1
2
7
9
8
6
4
3
10
5
7
5
8
10
9
4
1
6
3
2
1
4
49
81
64
36
16
9
100
25
49
25
64
100
81
16
1
36
9
4
7
10
56
90
72
24
4
18
30
10
x = 55
y = 55
x2 = 385
y2 = 385
xy = 321
321 10
r =
55
385 10
10
55 55
10 10
55
385 10
10
18.5
18.5
=
= 0.224
82.5
82.5 82.5
This shows that the Spearman for any two sets of ranks is the same as
the Pearson r for the set of ranks. But it is much easier to compute .
Often, the ranks are not given. Instead, the numerical values of
observations are given. In such a case, we must attach the ranks to these
values to calculate .
Example 6.5: From the following table, compute the coefficient of correlation
between age of husbands and age of wives :
Age of
Husbands
15
25
35
45
55
65
25
35
45
55
65
75
Total
Age of wives
Total
15 25
25 35
35 45
45 55
55 65
65 75
1
2
1
12
4
1
10
3
1
6
2
1
4
1
2
2
2
15
15
10
8
3
17
14
53
Business Statistics
r=
Unit 6
2
N fd y2 fd y
N fdx d y fdx fd y
N fdx2 fdx .
53 86 10 16
53 98 102 . 53 92 162
= 0.907
xy
= 10
N
Variance of X, 2x = 16 x = 4
Variance of Y, 2y = 9 y = 3
Thus,
Also,
xy
N
11
xy
10
r = N = . =
= 0.833
43
x y
x
y
11 = 10 =
Business Statistics
Unit 6
Example 6.7: The marks of 8 candidates in Mathematics and English are given
below
Mathematics
76
90
98
69
54
82
67
52
English
25
37
56
12
36
23
11
Solution:
Marks in
Mathematics
Marks in
English
Rank in
Mathematics
(R 1 )
Rank in
English
(R 2 )
Rank
Difference
(D) = (R1 R2)
D2
76
90
98
69
54
82
67
52
25
37
56
12
7
36
23
11
4
2
1
5
7
3
5
8
4
2
1
6
8
3
6
7
0
0
0
1
1
0
1
+1
0
0
0
1
1
0
1
1
D = 0
D2 =4
Total
Here,
N= 8
Rank correlation coefficient,
6D2
R = 1 3
N N
= 1
6(4)
(83 8)
= 0.952
Example 6.8: Compute rank correlation coefficient from the following data of
marks obtained by eight students in the papers of Physics and Mathematics:
Marks in Physics
15
20
27
13
45
60
20
75
Marks in Mathematics
50
30
55
30
25
10
30
70
Business Statistics
Unit 6
Solution:
Rank of
Mathematics (D)
Difference
in Ranks
D2
16
456
5
3
0.5
0.25
55
56
5.5
2
4
13
30
45
60
25
10
3
2
456
5
3
7
8
4
6
16
36
20
30
0.25
70
456
5
3
0.5
75
56
5.5
2
1
Marks in
Physics
Marks in
Mathematics
15
50
20
30
27
Rank in
Physics
0
D2
Total
= 81.5
In this example, two students have secured equal marks viz., 20 in physics,
so the ranks awarded to them are the arithmetic means of the ranks that they
would have got (viz., 5 and 6) had they differed at least by a small number and
56
so the ranks awarded to them are
= 5.5 each.
2
Similarly, three students who got equal marks (30 each) in Mathematics
were accorded the rank 4 5 6 = 5 for each.
3
Now,
R = 1
m3 m n3 n
6 D 2
12 12
N3 N
23 2 33 3
6 81.5
12 12
= 0
= 1
3
8 8
Example 6.9: Ten competitors in a beauty contest are ranked by three judges
in the following order :
1st Judge
10
2nd Judge
10
3rd Judge
10
Business Statistics
Unit 6
R2
R3
(R 1 R2)2 = D2
(R 2 R3)2 = D2
(R 1 R3)2 = D2
1
6
5
10
3
2
4
9
7
8
3
5
8
4
7
10
2
1
6
9
6
4
9
8
1
2
3
10
5
7
4
1
9
36
16
64
4
64
1
1
9
1
1
16
36
64
1
81
1
4
25
4
16
4
4
0
1
1
4
1
N = 10
N = 10
N = 10
D2= 200
D 2= 214
D2 = 60
6 D 2
R12 = 1
N3 N
6 200
1
103 10
= 0.212
6 D 2
R23 = 1
N3 N
6 214
1
103 10
= 0.297
6 D 2
R13 = 1
N3 N
6 60
1
103 10
= 0.636
Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) If observations on two variables are given in the form of ranks and
not as __________ values, then it is possible to compute rank
correlation between the two series.
Business Statistics
Unit 6
6.5 Summary
Let us recapitulate the important concepts discussed in this unit:
Correlation analysis is the statistical tool generally used to describe the
degree to which, one variable is related to another.
The theory by means of which quantitative connections between two sets
of phenomena are determined is called the Theory of Correlation.
Correlation can either be positive or it can be negative.
The coefficient of determination can have a value ranging from zero to
one. The value of one can occur only if the unexplained variation is zero,
which simply means that all the data points in the Scatter diagram fall
exactly on the regression line.
The coefficient of correlation, symbolically denoted by r, is another
important measure to describe how well one variable is explained by
another. It measures the degree of relationship between the two casually
related variables. The value of this coefficient can never be more than +1
or less than 1.
Karl Pearsons method is the most widely used method of measuring the
relationship between two variables.
If r is less than its P.E., it is not at all significant. If r is more than P.E.,
there is correlation.
If observations on two variables are given in the form of ranks and not as
numerical values, it is possible to compute what is known as rank
correlation between the two series.
Business Statistics
Unit 6
6.6 Glossary
Correlation analysis: A statistical tool used to describe the degree to
which one variable is related to another.
Coefficient of determination: A measure of the degree of linear
association or correlation between two variables, one of which must be
an independent variable and the other, a dependent variable.
Coefficient of correlation: It is symbolically denoted by r and is an
important measure to describe how well one variable is explained by
another. It measures the degree of relationship between the two casually
related variables.
6.8 Answers
Answers to Self-Assessment Questions
1. (a) Quantifiable; (b) Determination
2. (a) True; (b) True
3. (a) True; (b) False
4. (a) Casually; (b) Correlation
5. (a) Numerical; (b) Spearman
6. (a) True; (b) True
Sikkim Manipal University
Business Statistics
Unit 6
bXY
and
bYX
XY nXY
X n X
2
XY nXY
Y 2 nY 2
2. In case we take assumed mean to be zero for X variable as for Y variable then our
formula will be as under:
XY X Y
n
n n
or
r =
X2 X
n
n
Y 2 Y
n
n
r =
XY
XY
n
2
2 Y
2
X2
X
Y
n
n
XY nXY
r =
X nX 2 Y 2 nY 2
2
Business Statistics
Unit 7
Unit 7
Regression
Structure
7.1 Introduction
Objectives
7.2 Regression Analysis
7.3 Simple Linear Regression Model
7.4 Summary
7.5 Glossary
7.6 Terminal Questions
7.7 Answers
7.8 Further Reading
7.1 Introduction
In the previous unit, you learnt about correlation, a technique that looks at indirect
relationships and establishes variables.
In this unit, you will learn about regression analysis. Regression is a
statistical measure that determines the strength of relationship between a
dependent variable (variable to be predicted) and, one or more independent
variables (variables on which the prediction is based). It is a commonly used
tool in forecasting and financial analysis. For instance, suppose you want to
forecast sales for your company and it is seen that your companys sales go up
and down depending on changes in GDP. The sales you are forecasting would
be the dependent variable because their value depends on the value of GDP,
which, in turn, would be the independent variable. You would then need to
determine the strength of the relationship between these two variables in order
to forecast sales. If GDP increases/decreases by 1%, how much will your sales
increase or decrease? The regression equation is y=bx+a, where y is the
dependent variable which we intend to forecast, x is the independent variable,
b is the slope of the regression and a is the y-intercept.
You can use this simple model to solve your business problems. If your
research leads you to believe that the next GDP change will be a certain
percentage, you can plug that percentage into the model and generate a sales
forecast. This can help you develop a more objective plan and budget for the
upcoming year. You will also learn about the scatter diagram, least squares
method and standard error of estimate.
Sikkim Manipal University
Business Statistics
Unit 7
Objectives
After studying this unit, you should be able to:
Describe how assumptions are made in regression analysis
Explain simple linear regression model
Define scatter diagram method and least square method
Judge the accuracy of estimating equation
Compute and interpret standard error of the estimate
Business Statistics
Unit 7
(b) The values of the dependent variable are random but the values of the
independent variable are fixed quantities without error and are chosen by
the experimentor.
(c) There is clear indication of direction of the relationship. This means that
dependent variable is a function of independent variable. (For example,
when we say that advertising has an effect on sales, then we are saying
that sales has an effect on advertising).
(d) The conditions (that existed when the relationship between the dependent
and independent variable was estimated by the regression) are the same
when the regression model is being used. In other words, it simply means
that the relationship has not changed since the regression equation was
computed.
(e) The analysis is to be used to predict values within the range (and not for
values outside the range) for which it is valid.
Activity 1
Construct a regression line for r = 1.00 and r = 1.00.
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) The values of the dependent variable are random but the values of
the independent variable are fixed quantities without error and are
chosen by the ________________.
(b) The conditions that existed when the relationship between the
dependent and independent variable was estimated by the regression
are the same when the ___________ model is being used.
2. State whether true or false.
(a) The regression analysis is to be used to predict values within the
range (and not for values outside the range) for which it is valid.
(b) There is not an actual relationship between the dependent and
independent variables.
Business Statistics
Unit 7
(a) a represents the Y-intercept, i.e., the intercept specifies the value of the
dependent variable when the independent variable has a value of zero.
(But this term has practical meaning only if a zero value for the independent
variable is possible).
(b) b is a constant, indicating the slope of the regression line. Slope of the
line indicates the amount of change in the value of the dependent variable
for a unit change in the independent variable.
If the two constants (viz., a and b) are known, the accuracy of our prediction
of Y (denoted by Y and read as Y--hat) depends on the magnitude of the values
of ei. If in the model, all the ei tend to have very large values then the estimates
will not be very good but if these values are relatively small, then the predicted
values ( Y ) will tend to be close to the true values (Yi).
Estimating the Intercept and Slope of the Regression Model (or Estimating
the Regression Equation)
The two constants or the parameters viz., a and b in the regression model for
the entire population or universe are generally unknown and as such are
estimated from sample information. The following are the two methods used for
estimation:
(a) Scatter diagram method
(b) Least squares method
Business Statistics
Unit 7
Consumption Expenditure
Y
(Hundreds of Rupees)
41
65
50
57
96
94
110
30
79
65
44
60
39
51
80
68
84
34
55
48
The scatter diagram by itself is not sufficient for predicting values of the
dependent variable. Some formal expression of the relationship between the
two variables is necessary for predictive purposes. For the purpose, one may
simply take a ruler and draw a straight line through the points in the scatter
diagram and this way can determine the intercept and the slope of the said line
and then the line can be defined as Y a bX i , with the help of which we can
predict Y for a given value of X. But there are shortcomings in this approach.
For example, if five different persons draw such a straight line in the same
scatter diagram, it is possible that there may be five different estimates of a and
b, specially when the dots are more dispersed in the diagram. Hence, the
estimates cannot be worked out only through this approach. A more systematic
and statistical method is required to estimate the constants of the predictive
equation. The least squares method is used to draw the best fit line.
Business Statistics
Unit 7
Y-axis
120
100
80
60
40
20
X-axis
0 20 40 60 80 100 120
Business Statistics
Unit 7
In Figure 7.2, the vertical deviations of the individual points from the line
are shown as the short vertical lines joining the points to the least squares line.
These deviations will be denoted by the symbol e. The value of e varies from
one point to another. In some cases it is positive, while in others it is negative.
If the line drawn happens to be least squares line, then the values of ei is the
least possible. It is because of this feature that the method is known as Least
Squares Method.
Why we insist on minimizing the sum of squared deviations is a question
that needs explanation. If we denote the deviations from the actual value Y to
the estimated value Y as (Y Y ) or e i , it is logical that we want the
(Y Y ) or
ei , to
i 1
(Y Y ) or
positive values and large negative values could cancel one another. But large
values of ei regardless of their sign, indicate a poor prediction. Even if we ignore
n
ei
ei
| ei | , where | ei |
i 1
if ei 0
if ei 0
the difficulties
XY = aX + bX2
In the above two equations, a and b are unknowns and all other values
viz., X, Y, X2, XY, are the sum of the products and cross products to be
calculated from the sample data, and n means the number of observations in
the sample.
Business Statistics
Unit 7
10
41
65
50
57
96
94
110
30
79
65
Consumption
Expenditure (Y) (00 Rs)
44
60
39
51
80
68
84
34
55
48
X2
Y2
(00 Rs)
Consumption
Expenditure
Y
(00 Rs)
1
2
3
4
5
6
7
8
9
10
41
65
50
57
96
94
110
30
79
65
44
60
39
51
80
68
84
34
55
48
1804
3900
1950
2907
7680
6392
9240
1020
4345
3120
1681
4225
2500
3249
9216
8836
12100
900
6241
4225
1936
3600
1521
2601
6400
4624
7056
1156
3025
2304
n = 10
X = 687
Y =563
Observations
Income
X
and
b = 0.616
Business Statistics
Unit 7
or,
Y = 14.000 + 0.616Xi
On the basis of this equation we can make a point estimate of Y for any
given value of X. Suppose we wish to estimate the consumption expenditure of
individuals with income of Rs 10,000. We substitute X = 100 for the same in our
equation and get an estimate of consumption expenditure as follows:
Y =14.000 + 0.616 100 = 75.60
Business Statistics
Unit 7
wider the interval, the greater the level of confidence we can have, but the
width of the interval (or what is technically known as the precision of the
estimate) is associated with a specified level of confidence and is dependent
on the variability (consumption expenditure in our case) found in the sample.
This variability is measured by the standard deviation of the error term, e,
and is popularly known as the standard error of the estimate.
(Y Y )2
n2
e2
n2
Business Statistics
Unit 7
that the observed points are normally distributed around the regression
line and we may find,
68% of all points within Y 1 SEe limitss
95.5% of all points within Y 2 SEe limitss
99.7% of all points within Y 3 SEe limitss
This can be stated as,
(i) The observed values of Y are normally distributed around each estimated
value of Y and;
(ii) The variance of the distributions around each possible value of Y is the
same.
In case of small samples, i.e., where n 30 in a sample the t distribution
is used for finding the two limits more appropriately.
This is done as follows:
Upper limit = Y + t (SEe)
Lower limit = Y t (SEe)
Where,
Y Y
or,
= r
Y
X X
X i
r Y X X Y
Y = X i
Where,
Business Statistics
Unit 7
_
Y = Mean of Y
Y = Value of Y to be estimated
Xi = Any given value of X for which Y is to be
estimated.
This is based on the formula we have used, i.e., Y a bX i . The coefficient
of Xi is defined as,
Y
Coefficient of Xi = b = r
X
(Also known as regression coefficient of Y on X or slope of the regression
line of Y on X) or bYX.
XY n X Y Y 2 nY
Y 2 nY 2 X 2 n X
X 2 nX
XY n X Y
X 2 n X
Y
a = r X Y
X
and
Y
since b r
X
= Y bX
or
X
X = r Y Y X
Y
and the
Regression coefficient of X on Y (or bXY) r
X XY n X Y
2
Y
Y 2 nY
If we are given the two regression equations as stated above, along with
the values of a and b constants to solve the same for finding the value of X
and Y, then the values of X and Y so obtained, are the mean value of X (i.e., X )
and the mean value of Y (i.e., Y ).
Sikkim Manipal University
Business Statistics
Unit 7
If we are given the two regression coefficients (viz., bXY and bYX), then we
can work out the value of coefficient of correlation by just taking the square root
of the product of the regression coefficients as shown below:
r =
bYX .bXY
Y X
.r
X Y
r.r = r
The () sign of r will be determined on the basis of the sign of the regression
coefficients given. If regression coefficients have minus sign then r will be taken
with minus () sign and if regression coefficients have plus sign then r will be
taken with plus (+) sign. (Remember that both regression coefficients will
necessarily have the same sign whether it is minus or plus for their sign is
governed by the sign of coefficient of correlation.)
Example 7.2: Given is the following information:
Mean
X
39.5
Y
47.5
Standard Deviation
10.8
17.8
Y Y
Y
= r Xi X
X
Y = r
or
Y
Xi X Y
X
= 0.42
17.8
X i 39.5 47.5
10.8
X X = r X Yi Y
Business Statistics
Unit 7
X = r
or
X
Yi Y X
Y
= 0.42
or
10.8
Yi 47.5 39.5
17.8
20X 9Y 107 = 0
Find: (i) Mean values of X and Y.
(ii) Coefficient of Correlation between X and Y.
(iii) Standard deviation of Y.
Solution:
(i) For finding the mean values of X and Y, we solve the two given regression
equations for the values of X and Y as follows:
4X 5Y + 33 = 0
(1)
20X 9Y 107 = 0
(2)
20X 9Y = 107
(3)
(2)
16Y = 272
Subtracting Equation (2) from Equation (3) we get,
or
Y = 17
Putting this value of Y in Equation (1) we have,
4X = 33 + 5(17)
33 85 52
13
4
4
or
X =
Hence,
_
X = 13
and
Y = 17
Business Statistics
Unit 7
(ii) For finding the coefficient of correlation, first of all we presume one of the
two given regression equations as the estimating equation of X. Let
equation 4X 5Y + 33 = 0 be the estimating equation of X, then we have,
5Y 33
X i
4
4
and
From this we can write bXY
5
4
9
9
20
9
and
4 X i 33
Y
5
5
Hence,
r =
9 / 20 4 / 5
9 / 25
= 3/5
= 0.6
Since, regression coefficients have plus signs, we take r = + 0.6
(iii) Standard deviation of Y can be calculated as follows:
Variance of X = 9
Standard deviation of X = 3
bYX r
Y
X
4
0.6 Y 0.2 Y
5
3
Hence, Y = 4
Business Statistics
Unit 7
Y 1.8
9
X
= 20 0.6 3
Y
Y
Hence, Y = 4
Activity 2
Regression of savings (S) of a family on income (Y) may be expressed as
S a
Y
, where a and m are constants. In random sample of 100 families,
m
71
68
66
67
70
71
70
73
72
65
66
Sons Height
(inches)
69
64
65
63
65
62
65
64
66
59
62
(X X) = x
x2
(Y Y ) = y
y2
xy
71
68
66
67
70
71
70
73
72
65
66
+2
1
3
2
+1
+2
+1
+4
+3
4
3
4
1
9
4
1
4
1
16
9
16
9
69
64
65
63
65
62
65
64
66
59
62
+5
0
+1
1
+1
2
+1
0
+2
5
2
25
0
1
1
1
4
1
0
4
25
4
+10
0
3
+2
+1
4
+1
0
+6
+20
+6
X = 759
x = 0
x2 = 74
Y = 704
y = 0
y2 = 66
xy = 39
(Y Y ) = r.
Y =
y
x
(X X)
704
= 64 ;
11
Business Statistics
Unit 7
759
= 69
11
X=
Note
Y =
r.
For
x
(Y 64) =
=
X=
=
Y
N
xy
= X =
39
= 0.527
74
x
0.527 (X 69)
0.527 X + 27.64
69, Y = 0.527 (69) + 27.64
64.003 64.
2
X
N
Example 7.5: Obtain the two regression equations for the following data using
the method of least squares :
x
10
11
xy
x2
y2
1
2
3
4
5
5
7
9
10
11
5
14
27
40
55
1
4
9
16
25
25
49
81
100
121
x = 15
y = 42
xy = 141
x2 = 55
y2 = 376
Regression equation of y on x :
y = a + bx
where
y = Na + bx
and
xy = ax + bx2
Thus,
42 = 5a + 15 b
141 = 15a + 55 b
Solving (i) and (ii), we get a = 3.9 and b = 1.5
Thus, y = 3.9 + 1.5 x
Regression equation of x on y
x = a + by
where
x = Na + by
and
xy = ay + by2
Thus,
15 = 5a + 42b
and
141 = 42a + 376 b
...(i)
...(ii)
...(iii)
...(iv)
Business Statistics
Unit 7
x=
2
39 13
and b =
3
15
5
13 2
y
5
3
Example 7.6: The following table shows the ages (x) and blood pressure (y) of
8 persons.
x
52
63
45
36
72
65
47
25
62
53
51
25
79
43
60
33
Obtain the regression equation of y on x and find the expected blood pressure of
a person who is 49 years old.
Solution: Let Ax = 50 and Ay = 50
(Assumed means)
x
(x 50) = dx
d2x
(y 50) = dy
d2y
dxdy
52
63
45
36
72
65
47
25
+ 2
+ 13
5
14
+ 22
+ 15
3
25
4
169
25
196
484
225
9
625
62
53
51
25
79
43
60
33
+12
+3
+ 1
25
+ 29
7
+ 10
17
144
9
1
625
841
49
100
289
+24
+39
5
+ 350
+ 638
105
30
+ 425
x = 405
dx = 5
d2x = 1737
y = 406
dy = 6
d2y = 2058
dxdy = 1336
( y y) = r.
y
x
y = y
N
x
x =
N
r.
y
x
x x
406
= 50.75;
8
405
8
N dx d y d x d y
2
N dx2 dx
8(1136) (5)(6)
=
= 0.768
2
8(1737) 5
Business Statistics
Unit 7
Example 7.7: The equation of two regression lines obtained in a correlation analysis
of 60 observations are 5x = 6y + 24 and 1000y = 768x 3608. What is the correlation
coefficient and what is its probable error?
Show that the ratio of the coefficient of variance of x to that of y is
5
. What is
24
bxy = r.
and
byx = r.
x
6
=
y
5
y
x
Multiplying these, we get
768
1000
...(i)
...(ii)
6
768
r = 0.96
5 1000
Since both bxy and byx are positive, the correlation coefficient r is also positive and
hence r = + 0.96.
Also, probable error of r,
bxy byx = r2 =
1 r2
P.Er = 0.6745
N
1 0.962
P.Er = 0.6745
60
Also we know that each regression line passes through ( x, y) . So from the given
equations of these lines we have
5 x = 6 y 24
and
1000 y = 768 x 3608
Solving these we get
x = 6 and y = 1
Also from (i), we have r.
or
x 6
where r = 0.96
y 5
x
6
1
5
=
y
5 0.96 4
...(iii)
...(iv)
Business Statistics
Unit 7
x / x
y x
1
5
=
=
x
y y 6 4
=
5
24
Self-Assessment Questions
3. State whether true or false.
(a) The scatter diagram by itself is not sufficient for predicting values of
the dependent variable.
(b) The interval estimate method is considered worse as it states an
interval in which the expected consumption expenditure may fall.
4. Fill in the blanks with the appropriate terms.
(a) In case of simple linear regression analysis, a single variable is used
to __________ another variable on the assumption of linear
relationship (i.e., relationship of the type defined by Y = a + bX)
between the given variables.
(b) Standard error of estimate is a measure developed by the statisticians
for measuring the reliability of the _____________ equation.
7.4 Summary
Let us recapitulate the important concepts discussed in this unit:
The term regression was first used in 1877 by Sir Francis Galton who
made a study that showed the process of predicting one variable from
another variable.
When there is a well established relationship between variables, it is
possible to make use of this relationship in making estimates and to
forecast the value of one variable (the unknown or the dependent variable)
on the basis of the other variable/s (the known or the independent
variable/s).
There is an actual relationship between dependent and independent
variables.
Business Statistics
Unit 7
7.5 Glossary
Regression analysis: A relationship used for making estimates and
forecasts about the value of one variable (the unknown or the dependent
variable) on the basis of the other variable/s (the known or the independent
variable/s).
Scatter diagram: Also known as a Dot diagram, used to represent two
series with the known variables, i.e., independent variable plotted on the
X-axis and the variable to be estimated, i.e., dependent variable to be
plotted on the Y-axis on a graph paper for the given information.
Standard error of estimate: A measure developed by statisticians for
measuring the reliability of the estimating equation.
Business Statistics
Unit 7
7.7 Answers
Answers to Self-Assessment Questions
1. (a) Experimentor; (b) Regression
2. (a) True; (b) False
3. (a) True; (b) False
4. (a) Predict; (b) Estimating
Business Statistics
Unit 7
Endnotes
1. Usually the estimate of Y denoted by Y is written as,
Y a bX i
on the assumption that the random disturbance to the system averages out or has an
expected value of zero (i.e., e = 0) for any single observation. This regression model is
known as the Regression line of Y on X from which the value of Y can be estimated for
the given value of X.
2.
(2)
(1)
(3)
(4)
(5)
Five possible forms, which Scatter diagram may assume has been depicted in the above
five diagrams. First diagram is indicative of perfect positive relationship, Second shows
perfect negative relationship, Third shows no relationship, Fourth shows positive
relationship and Fifth shows negative relationship between the two variables under
consideration.
3. If we proceed centering each variable, i.e., setting its origin at its mean, then the two
equations will be as under:
Y = na + bX
XY = aX + bX2
But since Y and X will be zero, the first equation and the first term of the second
equation will disappear and we shall simply have the following equations:
XY = bX2
b = XY/X2
The value of a can then be worked out as:
a=
Y bX
4. It should be pointed out that the equation used to estimate the Y variable values from
values of X should not be used to estimate the values of X variable from given values of
Y variable. Another regression equation (known as the regression equation of X on Y of
the type X = a + bY) that reverses the two value should be used if it is desired to estimate
X from value of Y.
Business Statistics
Unit 8
Unit 8
Time Series
Structure
8.1 Introduction
Objectives
8.2 Components of Time Series
8.3 Different Methods of Measuring Trend
8.4 Different Methods of Measuring Seasonal Variations
8.5 Summary
8.6 Glossary
8.7 Terminal Questions
8.8 Answers
8.9 Further Reading
8.1 Introduction
In the previous unit, you learnt about regression analysis and its significance in
data analysis.
In this unit, you will learn how time series analysis differs from regression
analysis. We often see a number of charts on company drawing boards or in
newspapers, where we see lines going up and down from left to right on a
graph. The vertical axis represents a variable such as productivity or crime data
in the city and the horizontal axis represents the different periods of increasing
time such as days, weeks, months or years. Analysis of the movements of such
variables over periods of time is referred to as time series analysis, which can
then be defined as a set of numeric observations of a dependent variable,
measured at specific points in time in chronological order, usually at equal
intervals, in order to determine the relationship of time to such variables.
You will also learn that one of the major elements of planning and
specifically strategic planning of any organization is accurately forecasting the
future events that would have an impact on the operations of an organization.
Previous performances must be studied so as to forecast future activity. Even in
our daily lives, we plan our future events on the basis of a reasonable estimate
of the future environment that would affect our plans, whether it is forecasting
rain on our picnic on Saturday or forecasting economic conditions for ten years.
Textbook publishers, for example, must predict future sales of books to print
enough copies for students. Financial advisors must predict the values of a
Sikkim Manipal University
Business Statistics
Unit 8
Objectives
After studying this unit, you should be able to:
Analyse the components of time series
Explain the different methods of measuring trend
Calculate simple averages and moving averages
Measure irregular variations and seasonal adjustments
Business Statistics
Unit 8
would affect the forecasts. For example, a time series involving increase
in population over time is shown in Figure 8.1.
3. Seasonal Variation (S). This involves patterns of change that repeat over
a period of one year or less. Then they repeat from year to year and they
are brought about by fixed events. For example, sales of consumer items
increase prior to Deepawali due to the tradition of giving gifts.
Sikkim Manipal University
Business Statistics
Unit 8
Business Statistics
Unit 8
Business Statistics
Unit 8
Activity 1
The Indian Motorcycle Company is concerned about declining sales in the
Western region. The following data shows monthly sales (in millions of `) of
the motorcycles for the past twelve months.
Month
January
6.5
February
6.0
March
6.3
April
5.1
May
5.6
June
4.8
July
4.0
August
3.6
September
3.5
October
3.1
November
3.0
December
3.0
(i) Plot the trend line and describe the relationship between sales and
time.
(ii) What is the average monthly change in sales?
(iii) If the monthly sales fall below ` 2.4 million, then the West Coast
office must be closed. Is it likely that the office will be closed during
the next six months?
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Trend is a general long-term ____________ in the time series value
of the variable (Y) over a fairly long period of time.
(b) Cyclic fluctuations refer to ____________ swings or patterns that
repeat over a long period of time.
Business Statistics
Unit 8
Business Statistics
Unit 8
Here,
1
3
3
5
6
4
6
7
7
9
The manager wants to predict the repair expenses for the next year for
the two cars that are three years old now.
Solution: The trend in repair costs suggests a linear relationship with the age
of the car, so that the linear regression equation is given as:
Yt b0 b1t
n (ty ) (t )(y )
n ( t 2 ) ( t ) 2
Here,
b1
and,
b0 y b1 t
Business Statistics
Unit 8
Total
tY
t2
18
21
35
25
54
36
18
33
132
80
5(132) (18)(33)
5(80) (18) 2
b1
660 594
400 324
66
0.87
76
and,
b0 y b1 t
Here,
y=
y 33
6.6
n
5
and,
t =
t 18
3.6
n 5
Then,
b0 6.6 0.87(3.6)
= 6.6 3.13
= 3.47
Hence,
Yt 3.47 0.87t
The cars that are 3 years old now will be 4 years old next year, so that
t = 4.
Business Statistics
Hence,
Unit 8
Accordingly, the repair costs on each car that is 3 years old now are
expected to be ` 695.00
Business Statistics
Unit 8
Example 8.2: The following is the data for energy consumption (measured in
quadrillions of BTU) in the United States from 1981 to 1986 as reported in the
statistical abstracts of the United States.
Year
Annual Energy
Consumption (Y)
1981
74.0
1982
70.8
1983
70.5
1984
74.1
1985
74.0
1986
73.9
Assuming a linear trend, calculate the percentage of trend for each year
(cyclical variation).
Solution: First, we find the secular trend by the regression line method which is
given by:
Yt b0 b1t
n (ty ) (t )(y )
n ( t 2 ) ( t ) 2
Here,
b1
and,
b0 y b1 t
Let us make a table for these values.
t
tY
1
2
3
4
5
6
74.0
70.8
70.5
74.1
74.0
73.9
74.0
141.6
211.5
296.4
370.0
443.4
t2
1
4
9
16
25
36
t = 21
Y 437.3
tY 1536.9
t 2 91
b1
Sikkim Manipal University
6(1536.9) (21)(437.3)
6(91) (21)2
Page No. 199
Business Statistics
Unit 8
9221.4 9183.3
546 441
38.1
0.363
105
and,
b0 = y b1 t
Here,
y
t
Hence,
y 437.3
72.88
n
6
21
3.5
6
b0 72.88 0.363(3.5)
= 72.88 1.27
= 71.61
Yt 71.61 0.363t
Then,
Calculating the value of Yt for each time period, we get the following table
for percentage of trend (Y/Yt)100.
Time Period
(t)
Energy Consumption
(Y)
Trend
(Yt)
Percentage of Trend
(Y/Yt)100
74.0
71.97
102.82
70.8
72.34
97.87
70.5
72.70
96.97
74.1
73.06
101.42
74.0
73.43
100.77
73.9
73.79
100.15
The following graph shows the actual energy consumption (Y), trend line
(Yt) and the cyclical fluctuations above and below the trend line over the time
period (t) for 6 years.
Business Statistics
Unit 8
Yt
The percentage of trend figures show that in 1981, the actual consumption
of energy was 102.82% of expected consumption that year and in 1983, the
actual consumption was 96.97% of the expected consumption.
Business Statistics
Unit 8
Self-Assessment Questions
3. State whether true or false.
(a) The four components of time series are secular trend (T), seasonal
variation (S), cyclical variation (C) and irregular (or chance) variation
(I).
(b) The measure used to identify cyclical variation is the residual trend
and the procedure used is the percentage of trend.
4. Fill in the blanks with the appropriate terms.
(a) When a time series shows an upward or downward long-term linear
trend, then regression analysis can be used to ______________
this trend and project the trends into forecasting the future values of
the variables involved.
(b) Cyclic variation is a pattern that ___________________ over time
periods longer than one year.
Business Statistics
Unit 8
the month of March. (The seasonal variation will be the same for March in every
year. Seasonal index describes the degree of seasonal variation).
Then the seasonal index for the month of March will be calculated as
follows:
Monthly average for March
SeasonalIndex for March =
10
Average of monthly averages
Business Statistics
Unit 8
Example 8.3: Assume that a record of rental of row boats for the previous 3
years on a quarterly basis is given as follows:
Year
Rentals Per Quarter
Total
1991
1992
1993
I
350
330
370
II
300
360
350
III
450
500
520
IV
400
410
440
1500
1600
1680
Solution:
Step 1. The first step is to calculate the four-quarter moving total for time series.
This total is associated with the middle data point in the set of values for the four
quarters, shown as follows.
Year
Quarters
Rentals
Moving Total
1991
I
II
350
300
1500
III
IV
450
400
The moving total for the given values of four quarters is 1500, which is
simply the addition of the four quarter values. This value of 1500 is placed in the
middle of values 300 and 450 and recorded in the next column. For the next
moving total of the four quarters, we will drop the value of the first quarter, which
is 350, from the total and add the value of the fifth quarter (in other words, first
quarter of the next year), and this total will be placed in the middle of the next
two values, which are 450 and 400, and so on. These values of the moving
totals are shown in column 4 of the next table.
Step 2. The next step is to calculate the quarter moving average. This can be
done by dividing the four quarter moving total, as calculated in Step 1, by 4,
since there are 4 quarters. The quarters moving average is recorded in column
5 in the next table. The entire table of calculations is shown as follows:
Business Statistics
Unit 8
(1)
(2)
(3)
I
II
350
300
III
Quarter
Moving
Total
Quarter
Moving
Average
(4)
(5)
1500
375.0
450
1480
IV
400
330
1540
1992
1590
II
III
IV
1993
I
II
III
IV
372.50
120.80
377.50
105.96
391.25
84.35
398.75
90.28
405.00
123.45
408.75
100.30
410.00
90.24
416.25
84.08
370.0
385.0
397.5
360
1600
400.0
1640
410.0
500
410
1630
Quarter Percentage of
Centered
Actual to
Moving
Centered
Average Moving Average
(6)
(7)
407.5
370
1650
412.5
1680
420.0
350
520
440
Step 3. After the moving averages for each consecutive 4 quarters have been
taken, then we centre these moving averages. As we see from the above table,
the quarterly moving average falls between the quarters. This is because the
number of quarters is even which is 4. If we had odd number of time periods,
Business Statistics
Unit 8
such as 7 days of the week, then the moving average would already be centred
and the third step here would not be necessary. Accordingly, we centre our
averages in order to associate each average with the corresponding quarter,
rather than between the quarters. This is shown in column 6, where the centred
moving average is calculated as the average of the two consecutive moving
averages.
The moving average (or the centred moving average) aims to eliminate
seasonal and irregular fluctuations (S and I) from the original time series, so
that this average represents the cyclical and trend components of the series.
As the following graph shows for this data, the centred moving average
has smoothed the peaks and troughs of the original time series.
C e ntred
Step 4. Column 7 in the table contains calculated entries which are percentages
of the actual values to the corresponding centred moving average values. For
example, the first four quarters centred moving average of 372.50 in the table
has the corresponding actual value of 450, so that the percentage of actual
value to centred moving average would be:
Actual Value
100
Centred Moving Average Value
=
450
100
372.5
= 120.80
Step 5. The purpose of this step is to eliminate the remaining cyclical and irregular
fluctuations still present in the values in Column 7 of the table. This can be done
by calculating the modified mean for each quarter. The modified mean for each
quarter of the three-year time period under consideration is calculated as follows.
Business Statistics
Unit 8
84.35
90.24
90.28
84.08
120.80
123.45
105.96
100.30
(ii) We take the average of these values for each quarter. It should be noted
that if there are many years and quarters taken into consideration instead
of 3 years as we have taken, then the highest and the lowest values from
each quarterly data would be discarded and the average of the remaining
data would be considered. By discarding the highest and the lowest values
from each quarter data, we tend to reduce the extreme cyclical and irregular
fluctuations, which are further smoothed when we average the remaining
values. Thus, the modified mean can be considered as an index of
seasonal component. This modified mean for each quarter data is shown
as follows:
Quarter I =
84.35+90.24
= 87.295
2
Quarter II =
90.28 + 84.08
= 87.180
2
Quarter III =
120.80 +123.45
=122.125
2
Quarter IV =
105.96 +100.30
=103.13
2
Total = 399.73
The modified means as calculated here are preliminary seasonal indices.
These average should be 100 per cent or a total of 400 for the 4 quarters.
However, our total is 399.73. This can be corrected by the following step.
Step 6. First, we calculate an adjustment factor. This is done by dividing the
desired or expected total of 400, by the actual total obtained of 399.73, so that,
400
Adjustment =
=1.0007
399.73
Business Statistics
Unit 8
By multiplying the modified mean for each quarter by the adjustment factor,
we get the seasonal index for each quarter, so that,
Quarter I = 87.295 1.0007 = 87.356
Quarter II = 87.180 1.0007 = 87.241
Quarter III = 122.125 1.0007 = 122.201
Quarter IV = 103.13 1.0007 = 103.202
Total = 400.000
Average seasonal index
400
100
4
T S CI
=SI
T C
Here, (T S C I) is the influence of trend, seasonal variations, cyclic
fluctuations and irregular or chance variations.
Thus, the ratio of moving average represents the influence of seasonal
and irregular components. However, if these ratios for each quarter over a period
of years are averaged, then most random or irregular fluctuations would be
eliminated so that,
SI
=S
I
and this would give us the value of seasonal influences.
Business Statistics
Unit 8
reasoning explains such variation. For example, cold weather in Brazil and
Columbia is considered responsible for increase in the price of coffee beans,
because cold weather destroys coffee plants. Similarly, the Persian Gulf War,
an irregular factor resulted in increase in airline and ship travel for a number of
months because of the movement of personnel and supplies. However, the
irregular component can be isolated by eliminating other components from the
time series data. For example, time series data contains (T S C I)
components and if we can eliminate (T S C) elements from the data, then
we are left with (I) component. We can follow the previous example to determine
the (I) component as follows. The data presented has already been provided or
calculated.
Year
Quarters
1991
I
II
III
IV
350
300
450
400
372.50
377.50
1.208
1.060
1992
I
II
III
IV
I
II
III
330
360
500
410
370
350
520
391.25
398.75
405.00
408.75
410.00
416.25
0.843
0.903
1.235
1.003
0.902
0.841
IV
440
1993
Rentals
Centered Moving T S C I /(T C)
Time Series Values Average (T C)
= S I
(T S C I)
The seasonal indices for each quarter have already been calculated as:
Quarter I = 87.356
Quarter II = 87.241
Quarter III = 122.201
Quarter IV = 103.202
Business Statistics
Unit 8
1992
1993
I
II
III
IV
I
II
III
IV
I
II
III
IV
1.208
1.060
0.843
0.903
1.235
1.003
0.902
0.841
1.222
1.032
0.874
0.872
1.222
1.032
0.874
0.872
0.988
1.027
0.965
1.036
1.011
0.972
1.032
0.964
Seasonal Adjustments
Many times, we read about time series values as seasonally adjusted. This is
accomplished by dividing the original time series values by their corresponding
seasonal indices. These deseasonalized values allow more direct and equitable
comparisons of values from different time periods. For example, in comparing
the demands for rental row boats (example that we have been following), it
would not be equitable to compare the demand of second quarter (spring) with
the demand of third quarter (summer), when the demand is traditionally higher.
However, these demand values can be compared when we remove the seasonal
influence from these time series values.
The seasonally-adjusted values for the demand of row boats in each
quarter are based on the values previously calculated and shown as follows.
Business Statistics
Year Quarter
1991
1992
1993
Unit 8
Rentals
(T S C I)
(S)
350
300
450
400
330
360
500
410
370
350
520
440
1.222
1.032
0.874
0.872
1.222
1.032
0.874
0.872
I
II
III
IV
I
II
III
IV
I
II
III
IV
Seasonally-Adjusted Rounded-off
Values
Values
368.25
387.60
377.57
412.80
409.16
397.29
423.34
401.38
368
388
378
413
409
397
423
401
Original Value
Seasonal Index
1st year
0.27
0.35
0.43
1.25
2nd year
0.40
0.55
0.45
1.35
3rd year
0.52
0.70
0.53
1.55
4th year
0.60
0.80
0.64
1.85
Analyse the quarterly time series to determine the effects of the trend,
cyclical, seasonal and irregular components.
Business Statistics
Unit 8
Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) Seasonal variation has been defined as the ________________ and
repetitive movement around the trend line in a period of one year or
less.
(b) Time series values can be seasonally ______________ by dividing
the original time series values by their corresponding seasonal
indices.
6. State whether true or false.
(a) Simple average is the difficult method of isolating seasonal
fluctuations in time series.
(b) Regular variation is random in nature, unpredictable and occurs over
comparatively short periods of time.
8.5 Summary
Let us recapitulate the important concepts discussed in this unit:
The time series analysis method is quite accurate where the future is
expected to be similar to the past. The underlying assumption in time
series is that the same factors will continue to influence the future patterns
of economic activity in a similar manner as in the past.
Trend is a general long-term movement in the time series value of the
variable (Y) over a fairly long period of time. The variable (Y) is the factor
that we are interested in evaluating for the future.
Cyclic fluctuations refer to regular swings or patterns that repeat over a
long period of time. The movements are considered cyclical only if they
occur after time intervals of more than one year.
Changes in the climate and weather conditions have a profound effect on
sales. Customs and traditions affect the pattern of seasonal spending.
Irregular or random variations are accidental, random or simply due to
chance factors. Thus, they are wholly unpredictable.
When a time series shows an upward or downward long-term linear trend,
then regression analysis can be used to estimate this trend and project
the trends into forecasting the future values of the variables involved.
Sikkim Manipal University
Business Statistics
Unit 8
Cyclic variation is a pattern that repeats over time periods longer than
one year. These variations are generally unpredictable in relation to the
time of occurrence, duration as well as amplitude.
The measure used to identify cyclical variation is the percentage of trend
and the procedure used is known as the residual trend.
Seasonal variation has been defined as the predictable and repetitive
movement around the trend line in a period of one year or less. For the
measurement of seasonal variation, the time interval involved may be in
terms of days, weeks, months or quarters.
Seasonal index describes the degree of seasonal variation.
The moving average or the centred moving average aims to eliminate seasonal
and irregular fluctuations (S and I) from the original time series, so that this
average represents the cyclical and trend components of the series.
Irregular variation is random in nature, unpredictable and occurs over
comparatively short periods of time.
8.6 Glossary
Seasonal variation: Patterns of change that repeat over a period of one
year or less. The factors that cause seasonal variations are season and
climate and customs and festivals.
Irregular variations: These variations are unpredictable and can be
accidental, random or simply due to chance factor.
Cyclic variation: A pattern that repeats over time periods longer than
one year.
Business Statistics
Unit 8
8.8 Answers
Answers to Self-Assessment Questions
1. (a) Movement; (b) Regular
2. (a) True; (b) True
3. (a) True; (b) False
4. (a) Estimate; (b) Repeats
5. (a) Predictable; (b) Adjusted
6. (a) False; (b) True
Business Statistics
Unit 9
Unit 9
Testing of Hypothesis
Structure
9.1 Introduction
Objectives
9.2 Hypothesis Formulation
9.3 Summary
9.4 Glossary
9.5 Terminal Questions
9.6 Answers
9.7 Further Reading
9.1 Introduction
In the previous unit, you learnt about interpolation of polynomial as a useful
method for functional approximation.
In this unit, you will learn about hypothesis, null and alternative hypotheses,
critical region, penalty, standard error and hypothesis testing. Hypothesis is an
assumption that is tested to find its logical or empirical consequence. It refers to
a provisional idea whose merit needs evaluation, but having no specific meaning.
A hypothesis should be clear and accurate. Various concepts, such as null and
alternative hypotheses, enable to verify the testability of an assumption. During
the course of hypothesis testing, some inference about the population like the
mean and proportion are made. Any useful hypothesis will enable predictions
by reasoning, including deductive reasoning. Statistical decisions have to be
made in the presence of uncertainty. The null hypothesis is tested about the
population mean which has a specific value m. Testing a statistical hypothesis
on the basis of a sample enables us to decide whether the hypothesis should
be accepted or rejected. The Critical Region (CR) or Rejection Region (RR) is
a set of values for testing statistic for which the null hypothesis is rejected in a
hypothesis test.
Objectives
After studing this unit, you should be able to:
Describe the concepts of hypothesis and list the types of errors
Explain the null and alternate hypotheses
Business Statistics
Unit 9
Business Statistics
Unit 9
9.2.1
Statistical Decision-Making
Business Statistics
Unit 9
Business Statistics
Unit 9
Reject H0
Accept True H0
Desirable
Reject True H0
Type I Error
Accept False H0
Type II Error
Reject False H0
Desirable
The level of significance implies the probability of Type I error. A five per
cent level implies that the probability of committing a Type I error is 0.05. A one
per cent level implies 0.01 probability of committing Type I error.
Lowering the significance level and hence the probability of Type I error is
good but unfortunately, it would lead to the undesirable situation of committing
Type II error.
Business Statistics
Unit 9
To sum up:
Type I Error: Rejecting H0 when H0 is true.
Type II Error: Accepting H0 when H0 is false.
Note. The probability of making a Type I error is the level of significance of a statistical test. It is
denoted by
Where,
9.2.3
Business Statistics
Unit 9
A one-tailed test requires rejection of the null hypothesis when the sample
statistic is greater than the population value or less than the population value at
a certain level of significance.
1. We may want to test if the sample mean exceeds the population mean .
Then the null hypothesis is,
H0: >
2. In the other case the null hypothesis could be,
H0: <
Each of these two situations leads to a one-tailed test and has to be dealt
with in the same manner as the two-tailed test. Here, the critical rejection is on
one side only, right for > and left for < . Both the Figures 9.1 and 9.2 here
show a five per cent level of test of significance.
For example, a minister in a certain government has an average life of 11
months without being involved in a scam. A new party claims to provide ministers
with an average life of more than 11 months without scam. We would like to test
if, on the average, the new ministers last longer than 11 months. We may write
the null hypothesis H0: = 11 and alternative hypothesis H1: > 11.
Business Statistics
Unit 9
9.2.4
Critical Region
The Critical Region (CR), or Rejection Region (RR), is a set of values for testing
statistic for which the null hypothesis is rejected in a hypothesis test. It means,
the sample space for the test statistic is partitioned into two regions; one region
as the critical region will lead us to reject the null hypothesis H0, the other not.
So, if the observed value of the test statistic is a member of the critical region,
we conclude that reject H0; if it is not a member of the critical region then we
conclude that do not reject H0.
We shall consider test problems arising out of Type I Error.
The level of significance of a test is the maximum probability with which
we are willing to take a risk of Type I error.
If we take a 5% significance level ( = 0.05), we are 95% confident
( = 0.95) that a right decision has been made.
A 1% significance level ( = 0.01), makes us 99% confident ( = 0.99)
about the correctness of the decision.
The critical region is the area of the sampling distribution in which the test
statistic must fall for the null hypothesis to be rejected.
We can say that the critical region corresponds to the range of values of
the statistic, which according to the test requires the hypothesis to be rejected.
Two-tailed and One-tailed Tests: A two-tailed test rejects the null
hypothesis if the sample mean is either more or less than the hypothesized
Business Statistics
Unit 9
LIMIT
LIMIT
Rejection
region
0.475 of
area
0.475 of
area
Z = 1.96
2H0 =
Z = 1.96
Business Statistics
Unit 9
For example, what will happen if the acceptance region is made larger?
will decrease. It will be more easily possible to accept H0 when H0 is false (Type
II error), i.e., it will lower the probability by making a Type I error, but raise that
of , Type II error. , are probabilities of making an error; 1 , l are
probabilities of making correct decisions (refer Figure 9.5).
Business Statistics
Unit 9
9.2.5
Penalty
Usually Type II error is considered the worse of the two though, it is mainly the
circumstances of a case that decide the answer to this question.
If Type I error means accepting the hypothesis that a guilty person is
innocent and if Type II error means accepting the hypothesis that an innocent
person is guilty, then Type II error would be dangerous. The penalties and costs
associated with an error determine the balance or trade off between Type I and
Type II errors.
Usually Type I error is shown as the shaded area, say 5% of a normal
curve which is supposed to represent the data. If the sample statistic, say the
sample mean, falls in the shaded area, the hypothesis is rejected at 5 per cent
level of significance.
9.2.6
Standard Error
The concept of Standard Error (SE) of statistics is used to test the precision of
a sample and provides the confidence limits for the corresponding population
parameter.
The statistic may be the sample arithmetic mean, the sample proportion
p, etc.
The SE of any such statistic is the standard deviation of the sampling
distribution of the statistic. Given below is SE in common use.
SE ( X 1 X 2 )
SE ( p1 p2 )
n1 n2
PQ
PQ
1 1
2 2
n1
n2
n1 n2
PQ
PQ
1 1
2 2
n1
n2
Page No. 225
Business Statistics
Unit 9
9.2.8
We have to test the null hypothesis that the population mean has a specified
value , i.e., H0: X = . For large n, if H0 is true then,
X
is approximately nominal. The theoretical region for z
SE ( X )
depending on the desired level of significance can be calculated.
z
n = 900
X = 4.45
=5
=
z
4=2
X 4.45 5
X
=
=
= 8.25
SE ( X ) / n
2 / 30
Page No. 226
Business Statistics
Unit 9
We have z > 3. The null hypothesis is rejected. The sample may not be
regarded as originally from the factory at 0.27% level of significance
(corresponding to 99.73% acceptance region).
9.2.9
If P1, P2 are proportions of some characteristic of two samples of sizes n1, n2,
drawn from populations with proportions P1, P2, then we have H0: P1 = P2 vs
H1:P1 P2
Case (I): If H0 is true, then let P1 = P2 = p
Where, p can be found from the data:
p
n1 P1 n2 P2
n1 n2
q 1 p
SE ( P1 P2 )
1 1
pq
n1 n2
P1 P2
,P
SE ( P1 P2 ) is approximately normal (0,1)
We write z ~ N(0, 1)
The usual rules for rejection or acceptance are applicable here.
Case (II): If it is assumed that the proportion under question is not the same in
the two populations from which the samples are drawn and that P1, P2 are the
true proportions, we write,
Pq P q
SE ( P1 P2 ) 1 1 2 2
n2
n1
Pq P q
( P1 P2 ) z / 2 1 1 2 2
n2
n1
Business Statistics
Unit 9
Pq P q
( P1 P2 ) 1.645 1 1 2 2
n2
n1
Example 9.2: Out of 5000 interviewees, 2400 are in favour of a proposal, and
out of another set of 2000 interviewees, 1200 are in favour. Is the difference
significant?
Where,
P1
2400
0.48
5000
P1
Solution: Given,
P2
2400
0.48
5000
n1 = 5000
1200
0.6
2000
P2
1200
0.6
2000
n2 = 2000
X1 , X 2
Suppose two samples of sizes n1 and n2 are drawn from populations having
means 1, 2 and standard deviations 1, 2
To test the equality of means X 1 , X 2 we write,
H 0 : 1 2
H1 : 1 2
If we assume H0 is true then,
z
X1 X 2
12 22
n1 n2
S.D. = 1.
We write z ~ N (0, 1)
Business Statistics
Unit 9
X1 X 2
12
n1
22
n2
84 81
= 1.86 < 1.96
121 81
9.2.11
/ n 1
Also,
s2
S
n 1
n 1
or s S
n
n
Business Statistics
Calculate t
Unit 9
X
and compare it with the table value with n 1 degrees
SE ( X )
reject two-tailed
If t > t0.05
If t < t0.05
X
s / n 1
2.01 2
0.004/ 10 1
0.01
0.021
0.48
Business Statistics
Unit 9
d
S/ n
(d d )
n 1
Example 9.5: Eleven students were given a test and their marks noted. After
training, their marks in a second test were noted. Do the marks indicate any
benefit from training?
Solution:
Student
23
20
19
21
18
20
18
17
23
16
19
24
19
22
18
20
22
20
20
23
20
17
di
d
s
11
11
1
11
(d d )
n 1
df 11 1 10
t
10
2.49
0.121
2.24 / 11 2.49 11
Business Statistics
Unit 9
0
H1 : 2 02
Test statistic
2
ns 2
02
( x x )2
02
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) A hypothesis is an approximate _________________ that a
researcher wants to test for its logical or empirical consequences.
(b) The Critical Region (CR) or Rejection Region (RR) is a set of values
for testing statistic for which the ________________ hypothesis is
rejected in a hypothesis test.
2. State whether true or false.
(a) Hypothesis should be clear and accurate so as to draw a consistent
conclusion.
(b) Type I error can not be controlled by fixing it at a lower level.
Business Statistics
Unit 9
9.3 Summary
Let us recapitulate the important concepts discussed in this unit:
9.4 Glossary
Hypothesis: An approximate assumption about population parameters
like variance and expected value that is tested by a researcher for its
logical or empirical consequences.
Critical region: A set of values for testing statistic for which the null
hypothesis is rejected and the alternate hypothesis is accepted in a
hypothesis test.
Business Statistics
Unit 9
9.6 Answers
Answers to Self-Assessment Questions
1. (a) Assumption; (b) Null
2. (a) True; (b) False
Business Statistics
Unit 9
Business Statistics
Unit 10
Unit 10
Chi-Square Test
Structure
10.1 Introduction
Objectives
10.2 Chi-Square Test
10.3 Summary
10.4 Glossary
10.5 Terminal Questions
10.6 Answers
10.7 Further Reading
10.1 Introduction
In the previous unit you learnt about testing of hypothesis. The test statistic of
accepting or rejecting a null hypothesis is evaluated using 2. In this unit you will
learn about Chi-square test also called Chi-squared or 2 test. Any statistical
hypothesis test, in which the test statistic has a Chi-square distribution, when
the null hypothesis is true, is termed as Chi-square test. Chi-square test is a
non-parametric test of statistical significance for bivariate tabular analysis also
known as cross-breaks. Amongst the several tests used in statistics for judging
the significance of the sampling data, Chi-square test, developed by R.A. Fisher,
is considered an important test. Chi-square, symbolically written as 2
(pronounced as Ki-square), is a statistical measure with the help of which it is
possible to assess the significance of the difference between the observed
frequencies and the expected frequencies obtained from some hypothetical
universe. Chi-square tests enable us to test and compare whether more than
two population proportions can be considered equal. Hence, it is a statistical
test commonly used to compare observed data with expected data and testing
the null hypothesis, which states that there is no significant difference between
the expected and the observed result.
Objectives
After studying this unit, you should be able to:
Explain the Chi-square test of significance
Describe the degrees of freedom
Business Statistics
Unit 10
Where,
( f 0 f e ) 2
fe
Business Statistics
Unit 10
Business Statistics
Unit 10
( f0 fe )2
.
fe
Page No. 240
Business Statistics
Unit 10
( f0 fe )2
values or what we call
fe
( f 0 f e ) 2
fe
16
20
25
14
29
28
expected frequency of any one number coming upward is 132 = 22. Now,
6
we can write the observed frequencies along with expected frequencies and
work out the value of 2 as follows:
No. Turned Observed Expected ( f0 fe)
Up
Frequency Frequency
(or f0)
(or fe )
1
2
3
4
5
6
16
20
25
14
29
28
22
22
22
22
22
22
6
2
3
8
7
6
(f0 fe)2
( f0 fe )2
fe
36
4
9
64
49
36
36/22
4/22
9/22
64/22
49/22
36/22
( f 0 f e ) 2
= 9
fe
Business Statistics
Unit 10
Example 10.2:
Find the value of 2 for the following information:
Class Observed
A
B
C
D
E
Frequency
8
29
44
15
4
Theoretical (or
Expected) Frequency
7
24
38
24
7
Solution:
Since some of the frequencies are less than 10, we shall first regroup the given
data as follows and then work out the value of 2:
Class Observed FrequencyExpected Frequency (f0 fe)
(f0)
(fe)
A and B (8+29) = 37
( f0 fe )2
fe
(7+24) = 31
36/31
44
38
36/38
D and E (15+4) = 19
(24+7) = 31
12
144/31
( f 0 f e ) 2
= 6.76 approx.
fe
(a + b)
(c + d)
(a + c) (b + d)
Business Statistics
Unit 10
Then the formula for calculating the value of 2 will be stated as follows:
2 =
(ad - bc)2 N
(a + c)(b + d)(a + b)(c + d)
Where, N means the total frequency, ad means the larger cross product,
bc means the smaller cross product and (a + c), (b + d), (a + b) and (c + d) are
the marginal totals. The alternative formula is rarely used in finding out the
value of Chi-square as it is not applicable uniformly in all cases but can be used
only in a (2 2) contingency table.
N .(ad bc 0.5 N )2
(a b)(c d )(a c)(b d )
In case we use the usual formula for calculating the value of Chi-square
viz., 2 =
(f0 - fe )2
f 01 f e1 0.5
f 02 f e 2 0.5
(corrected) =
f e1
fe2
2
Business Statistics
Unit 10
judge if a random sample has been drawn from a normal population with mean
() and with specified variance (p)2. In such a situation, the test statistic for a
null hypothesis will be as under:
2 =
( X i X s )2
( p ) 2
n( s ) 2
( p ) 2
By comparing the calculated value (with the help of the above formula)
with the table value of 2 for (n1) df at a certain level of significance, we may
accept or reject the null hypothesis. If the calculated value is equal or less than
the table value, the null hypothesis is to be accepted but if the calculated value
is greater than the table value, the hypothesis is rejected. All this can be made
clear by an example.
Example 10.3:
Weight of 10 students is as follows:
Sl. No.
10
Weight in kg. 38
40
45
53
47
43
55
48
52
49
Can we say that the variance of the distribution of weights of all students
from which the above sample of 10 students was drawn is equal to 20 square
kg? Test this at 5% and 1% level of significance.
Solution:
First of all, we should work out the standard deviation of the sample (s)
Calculation of the sample standard deviation:
Sl. No.
Xi
Weight in kg
1
2
3
4
5
6
7
8
9
10
38
40
45
53
47
43
55
48
52
49
n = 10
Xi = 470
Xi X s
+
+
+
+
+
+
9
7
2
6
0
4
8
1
5
2
( X i X s )2
81
49
04
36
00
16
64
01
25
04
( X i X s)2 = 280
Business Statistics
Xs =
s =
Unit 10
X i 470
47 kg
10
n
( X i X s )2
280
28 5.3 kg
10
s = 28
Taking the null hypothesis as H0: (p)2 = (s)2
n ( s ) 2
10 28 280
14
20
20
Investigation
df
2.5
3.2
4.1
3.7
4.5
Business Statistics
Unit 10
What conclusion would you draw about the effectiveness of the new
medicine on the basis of the five investigations taken together?
Solution: By adding all the values of 2, we obtain a value equal to 18.0. Also
by adding the various d.f. as given in the question, we obtain a figure 5. We can
now state that the value of 2 for 5 degrees of freedom (when all the five
investigations are taken together) is 18.0.
Let us take the hypothesis that the new medicine is not effective. The
table value of 2 for 5 degrees of freedom at 5% level of significance is 10.070.
But our calculated value is higher than this table value which means that the
difference is significant and is not due to chance. As such the hypothesis is
wrong and it can be concluded that the new medicine is effective in checking
malaria.
Business Statistics
Unit 10
Activity 1
200 digits were chosen at random from a set of tables. The frequencies of
the digits were:
Digit
Frequency
Calculate
18
19
23
21
16
25
22
20
21
15
.
2
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) Chi-square test is a non-parametric test of statistical significance for
______________ tabular analysis.
(b) 2 is used to test the significance of population variance (p)2 through
_____________ intervals.
2. State whether true or false.
(a) Chi-square tests enable us to test whether more than two population
proportions can be considered equal.
(b) Chi-square test is based on frequencies and also on the parameters
like mean and standard deviation.
10.3 Summary
Let us recapitulate the important concepts discussed in this unit:
Chi-square test is a non-parametric test of statistical significance for
bivariate tabular analysis (also known as cross-breaks).
The Chi-square test is any statistical hypothesis test, in which the test
statistics has a chi-square distribution when the null hypothesis is true.
Chi-square, symbolically written as 2 (pronounced as Ki-square), is a
statistical measure with the help of which, it is possible to assess the
significance of the difference between the observed frequencies and the
expected frequencies obtained from some hypothetical universe.
Business Statistics
Unit 10
10.4 Glossary
Chi-square test: A non-parametric test of statistical significance used to
compare observed data with expected data. It also tests the validity of
null hypothesis.
Degrees of freedom: The number of independent observations in a
sample of data to estimate a parameter of the population from which that
sample is drawn.
Business Statistics
Unit 10
10.6 Answers
Answers to Self-Assessment Questions
1. (a) Bivariate; (b) Confidence
2. (a) True; (b) False
Business Statistics
Unit 11
Unit 11
Structure
11.1 Introduction
Objectives
11.2 t-Test
11.3 z-Test
11.4 Analysis of Variance
11.5 Summary
11.6 Glossary
11.7 Terminal Questions
11.8 Answers
11.9 Futher Reading
11.1 Introduction
In the previous unit, you learnt about Chi-squared or 2 test which is a nonparametric test of statistical significance for bivariate tabular analysis. In this
unit you will learn about t-test, z-test and analysis of variance or ANOVA. z-test
and t-test are basically the same as they compare between two means to suggest
whether both samples come from the same population. A t-test is any statistical
hypothesis test in which the test statistic follows a Students t distribution, if the
null hypothesis is supported. It is most commonly applied when the test statistic
would follow a normal distribution. Similarly, a z-test is any statistical test for
which the distribution of the test statistic under the null hypothesis can be
approximated by a normal distribution. In statistics, analysis of variance (ANOVA)
is a collection of statistical models and their associated procedures in which the
observed variance in a particular variable is partitioned into components
attributable to different sources of variation.
Objectives
After studying this unit, you should be able to:
Explain the significance of t-test
Discuss the importance of z-test
Define analysis of variance or ANOVA
Explain degrees of freedom and F distribution
Sikkim Manipal University
Business Statistics
Unit 11
11.2 t-Test
Sir William S. Gosset (pen name Student) developed a significance test and
through it made a significant contribution to the theory of sampling applicable in
case of small samples. When population variance is not known, the test is
commonly known as Students t-test and is based on the t distribution.
Like normal distribution, t distribution is also symmetrical but happens to
be flatter than normal distribution. Moreover, there is a different t distribution
for every possible sample size. As the sample size gets larger, the shape of the
t distribution loses its flatness and becomes approximately equal to the normal
distribution. In fact, for sample sizes of more than 30, the t distribution is so
close to the normal distribution that we will use the normal to approximate the t
distribution. Thus, when n is small, the t distribution is far from normal, but when
n is infinite, it is identical to normal distribution.
For applying t-test in context of small samples, the t value is calculated
first of all and, then the calculated value is compared with the table value of t at
certain level of significance for given degrees of freedom. If the calculated value
of t exceeds the table value (say t0.05), we infer that the difference is significant
at 5% level but if calculated value is t0 is less than its concerning table value, the
difference is not treated as significant.
The t-test is used when two conditions are fullfiled,
(i) The sample size is less than 30, i.e., when n 30.
(ii) The population standard deviation (p) must be unknown.
In using the t-test, we assume the following:
(i) That the population is normal or approximately normal;
(ii) That the observations are independent and the samples are randomly
drawn samples;
(iii) That there is no measurement error;
(iv) That in the case of two samples, population variances are regarded as
equal if equality of the two population means is to be tested.
The following formulae are commonly used to calculate the t value:
(i) To test the significance of the mean of a random sample
t
| X |
S | SEx X
Business Statistics
Unit 11
( X i X )2
n
SEX s
n
n
and the degrees of freedom = (n 1)
The above stated formula for t can as well be stated as under:
| X |
| X |
| X |
t
n
=
2
SEX
( X X )
( X X )2
n 1
n 1
n
If we want to work out the probable or fiducial limits of population mean
() in case of small samples, we can use either of the following:
(a) Probable limits with 95% confidence level:
X SE X (t0.05 )
At other confidence levels, the limits can be worked out in a similar manner,
taking the concerning table value of t just as we have taken t0.05 in (a) and t0.01 in
(b) above.
(ii) To test the difference between the means of the two samples
t
Where,
| X1 X 2 |
SE X 1 X 2
Business Statistics
Unit 11
( X
SEX1 X 2
1i
X1 ) 2 ( X 2 i X 2 )
n1 n2 2
1 1
n1 n2
( X1i X 1 )2 + ( X 2i X 2 )2
n1 n2 2
can be worked out by the following short-cut formula:
Where,
( X1i A1 )2 ( X 2i A1 )2 n1 ( X1i A2 )2 n2 ( X 2i A2 )2
n1 n2 2
r
1 r2
n2
X Ditt 0
Diff n
X Diff 0
Diff
Business Statistics
Unit 11
D X
Diff
)2
(n 1)
or
D 2 ( D )2 n
( n 1)
D = Differences
n = Number of pairs in two samples and is based on (n 1) degrees of
freedom.
The following examples would illustrate the application of t-test using the
above stated formulae.
Example 11.1:
A sample of 10 measurements of the diameter of a sphere, gave a mean
X = 4.38 inches and a standard deviation, = 0.06 inches. Find (a) 95% and
(b) 99% confidence limits for the actual diameter.
Solution:
On the basis of the given data the standard error of mean:
s
n 1
0.06
0.06
0.02
3
10 1
Assuming the sample mean 4.38 inches to be the population mean, the
required limits are as follows:
(i) 95% confidence limits
i.e.,
(ii) 99% confidence limits
4.335 to 4.425
= X SE X (t0.01 ) with 9 degrees of freedom
= 4.38 .02(3.25) = 4.38 .0650
i.e.,
4.3150 to 4.4450.
Business Statistics
Unit 11
Example 11.2:
The sales data of an item in six shops before and after a special promotional
campaign are:
Shops
Before the
promotional
campaign
53
28
31
48
50
42
58
29
30
55
56
45
(D D )
(D D )2
53
58
+5
+1.5
2.25
28
29
+1
2.5
6.25
31
30
4.5
20.25
48
55
+7
+3.5
12.25
50
56
+6
+2.5
6.25
42
45
+3
0.5
0.25
D = 21
n=6
(D D )2
= 47.50
Business Statistics
Unit 11
3.08
6 1
n 1
X Diff 0
t
n
Diff
Diff
Before (XBi)
After (XAi)
10
12
15
17
9
8
3
5
7
6
12
11
16
18
17
20
4
3
Solution:
We take the hypothesis that training was not effective. We can write,
H 0 : x A X B , H 0 : X X B . We apply the difference test for which purpose first of
all we calculate the mean and standard deviation of difference as follows:
Students
Before XBi
After XAi
Difference = D
D2
1
2
3
4
5
6
7
8
9
10
15
9
3
7
12
16
17
4
12
17
8
5
6
11
18
20
3
2
2
1
2
1
1
2
3
1
4
4
1
4
1
1
4
9
1
D = 7
D2 = 29
n=9
Sikkim Manipal University
Business Statistics
Unit 11
D 7
0.78
n
9
D 2 ( D ) 2 n
29 (0.78) 2 9
1.71
n 1
9 1
0.78
t
1.369
1 71
Diff
Degrees of freedom = (n 1) = (9 1) = 8
Table value of t at 5% level of significance for 8 degrees of freedom
= 1.860 for one-tailed test.
Since the calculated value of t is less than its table value, the difference is
insignificant and the hypothesis is true. Hence it can be inferred that the training
was not effective.
Example 11.4:
It was found that the coefficient of correlation between two variables calculated
from a sample of 25 items was 0.37. Test the significance of r at 5% level with
the help of t-test.
Solution:
To test the significance of r through t-test, we use the following formula for
calculating t value:
r
t
n2
1 r2
0.37
=
25 2
1 (0.37) 2
=1.903
Degrees of freedom = (n2) = (252) = 23
The table value of at 5% level of significance for 23 degrees of freedom
is 2.069 for a two-tailed test.
The calculated value of t is less than its table value, hence r is insignificant.
Activity 1
Select a variable. Compare the mean of the variable for a sample of 10 for
one group with the mean of the variable for a sample of 10 for a second
group using t-test.
Business Statistics
Unit 11
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) When population ____________ is not known, the test is commonly
known as Students t-test and is based on the t distribution.
(b) In t-test for the case of two samples population variances are
regarded as equal if _____________ of the two population means is
to be tested.
2. State whether true or false.
(a) Like normal distribution, t distribution is not symmetrical but happens
to be flatter than normal distribution.
(b) When n is small, the t distribution is far from normal but when n is
infinite it is identical with normal distribution.
11.3 z-Test
A z-test is any statistical test for which the distribution of the test state can be
approximated by normal distribution under the null hypothesis.
11.3.1
R.A. Fisher developed the z-test to test the significance of the correlation
coefficient in small samples. While applying the test, r of the sample is
transformed into z on account of which the test is also known as z transformation.
The z transformation is done as under:
1
1 r
(1 r )
z log e
1.15129 log10
2
1 r
(1 r )
Business Statistics
Unit 11
1
n3
p
p
Finally the value of the Standard Normal Variate (S.N.V.) is calculated as follows:
| z |
1 ( z ) n 3
S .N .V . =
n3
1 r
1 0.5
= 1.5129 log10
1 0.5
1.5
= 1.5129 log
0.5
= 1.15129 log 3
= 1.15129 0.4771 = 0.549
Business Statistics
Unit 11
1 0.7
=1.15129 log10
1 0.7
1.7
= 1.15129 log
0.3
= 1.15129 log 5.67
= 1.15129 0.7536 = 0.868
| z |
S .N .V .
1
n3
0.549 0.868
=
1
19 3
0.319
=
16
1
= 0.319 4 = 1.276
Since the difference (0.319) is only 1.276 times the S.E., it is insignificant
at 5% level and hence could have arisen due to sampling fluctuations. In other
words, the hypothesis stands and p may be taken as 0.7.
As, it has been stated above, z-test is also used to test the significance of
the difference between two independent correlation coefficients. For this purpose,
first of all, r1 and r2 values are transformed in the similar manner (as stated
above) into z1 and z2 values respectively, and then the standard error of difference
between z1 and z2 is worked out as under:
S .EDiff
z1 z2
1
1
n1 3 n2 3
| z1 z2 |
Finally, we work out the ratio: S .E
z z
1
Business Statistics
Unit 11
Example 11.6:
Given as the following information:
No. of Items in
the Sample
Coefficient of
Correlation
Sample 1
23
0.40
Sample 2
19
0.65
Test the significance of the difference, at 5% level, between the two given
values of coefficient of correlation, using z-transformation.
Solution:
Applying z-test, we obtain z1 and z2 values as under:
1 r1
1 r2
z1 1.15129 log10
z2 1.15129 log10
1 r1
1 r2
1 + 4
1 + 65
= 1.15129 log
= 1.15129 log
1 4
1 65
= 1.15129 log 2.333 = 0.424 = 1.15129 log 4.71 = 0.775
S .Ez1 z2
1
1
n1 3 n2 3
1
1
9
0.335
20 16
80
As this ratio is less than 1.96, the difference between the two given values
of coefficient of correlation at 5% level is insignificant and it can be concluded
that the two samples come from the same population.
Self-Assessment Questions
3. State whether true or false.
(a) z-test is used to test the significance of the correlation coefficient in
small samples.
(b) The statistic z is used to test whether an observed value of r is
significantly different from a given hypothetical or known value of
population correlation.
Business Statistics
Unit 11
Business Statistics
Unit 11
The alternate hypothesis (H1) will state that at least two means are different
from each other. In order to accept the null hypothesis, all means must be
equal. Even if one mean is not equal to the others, then we cannot accept the
null hypothesis. The simultaneous comparison of several population means is
called ANalysis Of VAriance or ANOVA.
Assumptions
The methodology of ANOVA is based on the following assumptions.
(i) Each sample of size n is drawn randomly and each sample is independent
of the other samples.
(ii) The populations are normally distributed.
(iii) The populations from which the samples are drawn have equal variances.
This means that:
12 22 23 .........= 2k , for k populations.
11.4.1
Why do we call it the Analysis of Variance, even though we are testing for
means? Why not simply call it the Analysis of Means? How do we test for means
by analysing the variances? As a matter of fact, in order to determine if the
means of several populations are equal, we do consider the measure of variance,
2.
The estimate of population variance, 2, is computed by two different
estimates of 2, each one by a different method. One approach is to compute
an estimator of 2 in such a manner that even if the population means are not
equal, it will have no effect on the value of this estimator. This means that, the
differences in the values of the population means do not alter the value of 2 as
calculated by a given method. This estimator of 2 is the average of the variances
found within each of the samples. For example, if we take 10 samples of size n,
then each sample will have a mean and a variance. Then, the mean of these 10
variances would be considered as an unbiased estimator of 2, the population
variance, and its value remains appropriate irrespective of whether the population
means are equal or not. This is really done by pooling all the sample variances
to estimate a common population variance, which is the average of all sample
variances. This common variance is known as variance within samples or 2within.
Sikkim Manipal University
Business Statistics
Unit 11
2X
n
or, the variance would be:
2
n
2 n2X
2X
or,
2between
2within
Business Statistics
Unit 11
In the above case, if the population means are exactly the same, then
between will be equal to the 2within and the value of F will be equal to 1.
2
11.4.2
Degrees of Freedom
We have talked about the F distribution being a family of curves, each curve
reflecting the degrees of freedom relative to both 2between and 2within. This means
that, the degrees of freedom are associated both with the numerator as well as
with the denominator of the F-ratio.
(i) The numerator. Since the variance between samples, 2between comes
from many samples and if there are k number of samples, then the degrees
of freedom, associated with the numerator would be (k 1).
(ii) The denominator is the mean variance of the variances of k samples
and since, each variance in each sample is associated with the size of
the sample (n), then the degrees of freedom associated with each sample
would be (n 1). Hence, the total degrees of freedom would be the sum
of degrees of freedom of k samples or,
df = k(n 1), when each sample is of size n.
11.4.3
The F Distribution
Business Statistics
Unit 11
are the degrees of freedom in the numerator and the degrees of freedom
in the denominator. The shape of the curve changes as the number of
degrees of freedom changes.
(ii) It is a continuous distribution and the value of F cannot be negative.
(iii) The curve representing the F distribution is positively skewed.
(iv) The values of F theoretically range from zero to infinity.
A diagram of F distribution curve is shown below.
Do not
reject
H0
Reject H0
The rejection region is only in the right end tail of the curve because
unlike z distribution and t distribution which had negative values for areas below
the mean, F distribution has only positive values by definition and only positive
values of F that are larger than the critical values of F, will lead to a decision to
reject the null hypothesis.
Computation of F
Since F ratio contains only two elements, which are the variance between the
samples and the variance within the samples.
If all the means of samples were exactly equal and all samples were
exactly representative of their respective populations so that all the sample
means, were exactly equal to each other and to the population mean, then
there will be no variance. However, this can never be the case. We always have
variation, both between samples and within samples, even if we take these
samples randomly and from the same population. This variation is known as
the total variation.
The total variation designated by ( X - X ) 2 , where X representss
individual observations for all samples and X is the grand mean of all sample
means and equals (), the population mean, is also known as the total sum of
squares or SST, and is simply the sum of squared differences between each
Business Statistics
Unit 11
observation and the overall mean. This total variation represents the contribution
of two elements. These elements are:
(A) Variance between samples. The variance between samples may be due
to the effect of different treatments, meaning that the population means may be
affected by the factor under consideration, thus, making the population means
actually different, and some variance may be due to the inter-sample variability.
This variance is also known as the sum of squares between samples. Let this
sum of squares be designated as SSB.
Then, SSB is calculated by the following steps:
(i) Take k samples of size n each and calculate the mean of each sample,
i.e., X 1 , X 2 , X 3 , .... X k .
(ii) Calculate the grand mean X of the distribution of these sample means,
so that,
k
x
i 1
(iii) Take the difference between the means of the various samples and the
grand mean, i.e.,
( X 1 X ), (X 2 X ), (X 3 X ), ...., (X k X )
n (X
i 1
where,
n1 = Number of items in sample 1
n2 = Number of items in sample 2
Sikkim Manipal University
Business Statistics
Unit 11
X 1 = Mean of sample 1
X 2 = Mean of sample 2
X k = Mean of sample k
X = Grand mean or average of all items in all samples.
(v) Divide SSB by the degrees of freedom, which are (k 1), where k is the
number of samples and this would give us the value of 2between, so that,
SSB
2between
.
(k 1)
Business Statistics
Unit 11
F=
2between SSB/df
=
2
SSW/df
within
SSB/(k 1)
MSB
=
SSW/(N k) MSW
This value of F is then compared with the critical value of F from the table
and a decision is made about the validity of null hypothesis.
11.4.4
ANOVA Table
After various calculations for SSB, SSW and the degrees of freedom have been
made, these figures can be presented in a simple table called Analysis of
Variance table or simply ANOVA table, as follows:
ANOVA Table
Source of Variation Sum of Squares
Degrees of Freedom
Treatment
SSB
(k 1)
W ithin
SSW
(N k)
Total
SST
Mean Square
SSB
(k 1)
SSW
MSW
(n k)
MSB
MSB
MSW
Then,
F=
MSB
MSW
A Short-Cut Method
The formula developed above for the computation of the values of F-statistic is
rather complex and time consuming when we have to calculate the variance
between samples and the variance within samples. However, a short-cut, simpler
method for these sum of squares is available, which considerably reduces the
computational work. This technique is used through the following steps:
(i) Take the sum of all the observations of all the samples, either by adding
all the individual values, or by multiplying the mean of each sample by its
size and then adding up all these products as follows:
The Total Sum or TS n1 X 1 n2 X 2 ....nk X k , for k samples
(ii) Calculate the value of a correction factor. The Correction Factor (CF)
value is obtained by squaring the total sum obtained above and dividing it
by the total number of observations N, so that:
Business Statistics
Unit 11
CF
(TS ) 2
N
(iii) The total sum of squares is obtained by squaring all individual observations
of all samples, summing up these values and subtracting from this sum,
the CF.
In other words:
2
2
2
Total sum of squares SST X 1 X 2 .... X k
(TS )2
N
Where,
(X k ) 2 (TS ) 2
( X 1 ) 2 ( X 2 ) 2
....
n1
n2
nk
N
Where,
Business Statistics
Unit 11
Example 11.7:
To test whether all professors teach the same material in different sections of
the introductory statistics class or not, four sections of the same course were
selected and a common test was administered to five students selected at
random from each section. The scores for each student from each section were
noted and are given below. We want to test for any differences in learning, as
reflected in the average scores for each section.
Student #
1
2
3
4
5
Totals
Section 1
Scores (X1)
Section 2
Scores (X2)
Section 3
Scores (X3)
Section 4
Scores (X4)
8
10
12
10
5
12
12
10
8
13
10
13
11
12
14
12
15
13
10
10
X 1 45
X 2 55
X 3 60
X 4 60
X 2 11
X 3 12
X 4 12
X1 9
Solution:
A. The traditional method
Means
(i) State the null hypothesis. We are assuming that there is no significant
difference among the average scores of students from these four sections
and hence, all professors are teaching the same material with the same
effectiveness, i.e.,
H 0 : 1 2 3 4
H1: All means are not equal or at least two means differ from each other.
(ii) Establish a level of significance. Let = 0.05.
(iii) Calculate the variance between the samples, as follows:
(a) The mean of each sample is:
X 1 9, X 2 11, X 3 12, X 4 12
(b) The grand mean or X is:
X 9 11 12 12
n
4
11
Business Statistics
Unit 11
SSB
(30)
(30)
10
df
( k 1)
3
X1 9
Sample 2:
X 2 11
Sample 3:
X 3 12
( X 3 X 3 ) 2 (10 12) 2 (13 12)2 (11 12) 2 (12 12) 2 (14 12) 2
4 11 0 4
10
Sample 4:
X 4 12
( X 4 X 4 ) 2 (12 12)2 (15 12) 2 (13 12) 2 (10 12) 2 (10 12)2
0 9 1 4 4
18
Business Statistics
Unit 11
SSW
SSW
72
72
4.5
df
( N k ) 20 4 16
MSB 10
2.22.
MSW 4.5
Now, we check for the critical value of F from the table for = 0.05 and
degrees of freedom as follows:
df (numerator) = (k 1) = (4 1) = 3
df (denominator) = (N k) = (20 4) = 16
This value of F from the table is given as 3.24. Now, since our
calculated value of F = 2.22 is less than the critical value of F = 3.24, we
cannot reject the null hypothesis.
B. The Short-Cut Method
Following the procedure outlined before for using the short-cut method, we get:
(i) Total Sum (TS) = X
= 220
(ii) Correction before CF
(TS )2 (220)2
2420
N
20
SSB
i 1
( X i )2
CF
ni
( X )2
( X 1 )2 ( X 2 )2
.... k CF
n1
n2
nk
(2420)
5
5
5
5
405 605 720 720 2420
30
Sikkim Manipal University
Business Statistics
Unit 11
Sum of Squares
Treatment
SSB = 30
(k 1) = 3
MSB
=
SSW = 72
(N k) = 16
30
10
3
MSW
Total
SSB
(k 1)
MSB
MSW
10
4.5
SSW
( N k ) =2.22
72
4.5
16
SST = 102
Activity 2
Prepare a list of magazines and categorize them into three groups according
to the educational level of the readers as high, medium and low. Select six
advertisements randomly from each of the magazines and for each
advertisement collect three different readability measures. Perform one
way ANOVA tests to determine whether advertisement readabilities of the
three groups of magazines are different.
Business Statistics
Unit 11
Self-Assessment Questions
5. Fill in the blanks with the appropriate terms.
(a) The simultaneous ______________ of several population means is
called analysis of variance or ANOVA.
(b) F ratio contains only _____________ elements, which are the
variance between the samples and the variance within the samples.
6. State whether true or false.
(a) The one-way analysis of variance refers to the situations when only
one fact or variable is considered.
(b) The F distribution is a family of curves, so that there are similar
curves for different degrees of freedom.
11.5 Summary
Let us recapitulate the important concepts discussed in this unit:
Sir William S. Gosset (pen name Student) developed a significance test
and through it made significant contribution in the theory of sampling
applicable in case of small samples. When population variance is not
known, the test is commonly known as Students t-test and is based on
the t distribution.
When n is small, the t distribution is far from normal but when n is infinite
it is identical with normal distribution.
For applying t-test in context of small samples, the t value is calculated
first of all and, then the calculated value is compared with the table value
of t at certain level of significance for given degrees of freedom.
R.A. Fisher developed the z-test to test the significance of the correlation
coefficient in small samples. While applying the test, r of the sample is
transformed into z on account of which the test is also known as z
transformation.
The statistic z is used to test (i) whether an observed value of r is
significantly different from a given hypothetical or known value of population
correlation (ii) whether two sample values of r differ significantly from
each other.
Business Statistics
Unit 11
The one-way analysis of variance refers to the situations when only one
fact or variable is considered.
The simultaneous comparison of several population means is called
Analysis of Variance or ANOVA.
The F distribution is a family of curves, so that there are different curves
for different degrees of freedom.
F ratio contains only two elements, which are the variance between the
samples and the variance within the samples.
11.6 Glossary
t-test: Any statistical hypothesis test in which the test statistic follows a
Students t distribution, if the null hypothesis is supported.
z-test: Any statistical test for which the distribution of the test statistic
under the null hypothesis can be approximated by a normal distribution.
ANOVA: In statistics, analysis of variance or ANOVA is a collection of
statistical models and their associated procedures in which the observed
variance in a particular variable is partitioned into components attributable
to different sources of variation.
Business Statistics
Unit 11
11.8 Answers
Answers to Self-Assessment Questions
1. (a) Variance; (b) Equality
2. (a) False; (b) True
3. (a) True; (b) True
4. (a) Transformation; (b) Difference
5. (a) Comparison; (b) Two
6. (a) True; (b) False
Business Statistics
Unit 12
Unit 12
Structure
12.1 Introduction
Objectives
12.2 Introduction to Report Writing
12.3 Types of Research Reports
12.4 Summary
12.5 Glossary
12.6 Terminal Questions
12.7 Answers
12.8 Further Reading
12.1
Introduction
Objectives
After studying this unit, you should be able to:
Describe the importance of reports
Explain the different types of reports
Define the characteristics of a good report
Describe the structure of a report
Use the correct method of presenting reports
Business Statistics
Unit 12
Business Statistics
Unit 12
Business Statistics
Unit 12
Business Statistics
Unit 12
Data analysis:
o Appropriate inferential statistics for sample or experimental data
and appropriate use of descriptive statistics
o Clear and reasonable interpretation of the statistical findings,
accompanied by effective tables and figures
Summary:
o Fair assessment of the implications and limitations of the
findings
o Effective commentary on the overall implications of the findings
for theory and/or policy
2. Structure of a Report: Before you write a report, you should define the
high level structure of the report. Defining a clear logical structure will
make the report easier to write and to read. There are two types of report
structures, which are listed as follows:
Report Structure I: In general, the report writing structure comprises
the following subheadings:
o Title Page
o Abstract
o Table of Contents
o Introduction
o Technical Detail and Results
o Discussion and Conclusions
o References
o Appendices
Report Structure II: There is also a specific structure of report writing
pertaining to technical or scientific reports which is as follows:
o Introduction
o Background and Context
o Technical Details
o Results
o Discussion and Conclusion
Business Statistics
Unit 12
Order of writing:
o Start with the technical chapters/sections.
o Follow with the discussion.
o Finally, write the conclusions, introduction and abstract, if you
are including any.
Appendix: The appendix should contain the following:
o Material that suits or goes well with the flow of the main report
but cannot be included in the main text of the report either
because it is too long or is not essential reading, for example,
lists of parameter values, etc.
o Bibliography, i.e., list of all the sources of material, you referred
to in your report.
3. Presentation of a Report: As stated earlier, mere data overloading or
just a lucid style of writing is not only necessary for good report writing.
Both the aspects need to be given due consideration, so that they interact
to give a simple, easy-to-read and comprehensive type of report. Same
goes with the presentation of the contents of the report. Printing mistakes,
informal use of font size and style can distract the attention of the reader.
On the other hand, effective use of tables and figures for better
understanding of data and writing its conclusions facilitate easy
comprehension. The main points of focus, where due attention is required
on the part of the report writer are as follows:
Capitals: This requires taking care of the following aspects:
o Using capitals only for proper nouns, place names, organization
names, etc.
o Defining acronyms at the first point of usage. For example,
Incorporated (Inc).
o Using bold, italics or underlines for emphasis, instead of capitals.
Headings: The basic points to be kept in mind for headings are as
follows:
o Differentiate headings from the rest of the text using different
fonts, bold, italics or underlines.
o Maintain consistency in formatting headings using predefined
styles.
o Avoid headings beyond three levels.
Sikkim Manipal University
Business Statistics
Unit 12
Business Statistics
Unit 12
Business Statistics
Unit 12
Business Statistics
Unit 12
For example,
Shahad, P.V. Rajesh Jains Ecosystem, in Business Today,
Vol. 14, December 18, p. 28, 2005.
o In case of multiple authorship:
If there are more than two authors or editors, then in the
documentation, the name of only the first is given and multiple
authorship is indicated by et al or and others.
Authors name in normal order
Title of work, underlined to indicate italics
Place and date of publication
Pagination references
For example,
Alexandra K. Wigdor, Ability Testing: Uses Consequences and
Controversies, 1981, p.23.
Subsequent references to the same work need not be detailed.
If the work is cited again without any other work intervening, it may
be indicated as ibid, followed by a comma and the page number.
Punctuations and abbreviations in footnotes: Punctuation concerning
the book and author names has already been discussed. They are general
rules to be strictly adhered to. Some English and Latin abbreviations are
often used in bibliographies and footnotes to eliminate any repetition.
Table 12.1 shows the various English and Latin abbreviations used
in bibliographies and footnotes.
Table 12.1 English and Latin Abbreviations used in Bibliographies and Footnotes
Abbreviations
Meaning
Anon.,
Anonymous
Ante.,
Before
Art.,
Article
Aug.,
bk.,
bull.,
Augmented
Book
Bulletin
cf.,
Compare
ch.,
Chapter
Business Statistics
col.,
diss.,
ed.,
ed. cit.,
e.g.
Unit 12
Column
Dissertation
editor, edition, edited
edition cited
exempli gratia: for example
eng.,
Enlarged
et.al.,
and others
et seq.,
ex.,
Example
f.,ff.,
figure(s)
fn.,
Footnote
ibid.,ibidem
id.,idem.,
ill.,illus., or
illust(s)
Intro., intro.,
l., ll.,
loc. cit.,
MS., MSS.,
N.B. nota bene
illustrated, illustration(s)
introduction
line(s)
in the place cited; used as op.cit.,
Manuscript(s)
note well
n.d.,
no date
n.p.,
no place
no pub.,
no publisher
no(s) .,
number(s)
o.p.,
out of print
op.cit:
p.pp
passim:
Post:
Business Statistics
Unit 12
Business Statistics
Unit 12
Main text: The main text comprises the complete outline of the
research report with all the details. The title of the research study is
repeated at the top of the first page of the main text, and then followed
with the other details on the pages numbered consecutively,
beginning with the second page. The main text can be classified
into the following sections:
o Introduction: The purpose of introduction is to introduce the
research projects to the readers. It should clearly state the
objectives of research, i.e., it should clarify, why the problem
was considered worth investigating. A brief summary of other
relevant research can be included as well, to enable the reader
to see the present study in that context.
o The methodology used for performing the study: The
introduction should contain answers to questions like; How was
the study carried out? What was the basic design? What were
the experimental directions? What were the questions asked
in the questionnaires used? etc. Besides this, the scope and
limitations of the study must be marked out.
o Statement of findings and recommendations: The research
report should comprise a statement of findings and
recommendations in a nontechnical language so that it is easily
comprehensible.
o Results: A detailed presentation of the findings of the study,
with supporting data in tabular forms along with the validation
of results, should be given. This section should contain statistical
summaries and deductions of the data rather than the raw data.
There should be a logical sequence and sectional presentation
of the results.
o Implications of the result: The researcher should write down
his results clearly and precisely, again at the end of the main
text. The implications derived from the results of the research
study should be stated in the research plan. The report should
also mention the conclusion drawn from the study, which should
be clearly related to the hypothesis stated in the introductory
section.
o Summary: The next step is to conclude the report with a short
summary, mentioning in brief the research problem, the
Sikkim Manipal University
Business Statistics
Unit 12
Business Statistics
Unit 12
are used, use them consistently throughout the report. For example,
do not switch among versus and vs.
It is advisable to avoid using the word very and other such words
that try to embellish a description. They do not add any extra meaning,
and therefore, should be dropped.
Repetition hampers lucidity. The report writer must avoid repeating
the same word more than once within a sentence.
When you use the word this or these, make sure you indicate to
what you are referring. This reduces the ambiguity in your writing
and helps to tie the sentences together.
Do not use the word they to refer to a singular person. You can
either rewrite the sentence to avoid using such a reference or use
the singular he or she.
Business Statistics
Unit 12
Business Statistics
Unit 12
Principle of cost: While preparing reports, it is necessary that the costbenefit analysis of the report should be done. A report should be minimum
at costs and maximum at benefits. If the cost of preparation of the report
is high but its benefit is low, then it is not advisable to prepare that report.
Different formats of written reports
A written report can be written in various formats, some of which are as follows:
Straight-line format: This format is used when the information is to be
presented in alphabetical, sequential or numerical orders. This format is
used to generate descriptive reports.
Building blocks format: This format is used when the information
presented, leads to some conclusion. The report in this format starts with
a brief introduction, contains some logical facts and finally the conclusions
and recommendations.
Inverted pyramid format: The report in this format has the most important
item at the top, and the least important item at the bottom of the report.
That is, items are listed in the descending order with the most important
item at the top. This style of writing or format is also known as journalistic
style or format.
2. Oral Report: At times, oral presentation of the results that are drawn out
of research is considered effective, particularly in cases where policy
recommendations are to be made. This approach proves beneficial
because it provides a medium of interaction between the listener and the
speaker. This leads to a better understanding of the findings and their
implications. However, the main drawback of oral presentation is lack of
any permanent records related to the research. Oral presentation of the
report is also effective when it is supported by various visual devices
such as slides, wall charts and white boards that help in better
understanding of the research reports.
Advantages of oral reports
Oral reports help in direct communication without any delay. Followings are
some of the advantages of an oral report:
It provides immediate feedback to the participants of the oral report.
Moreover, participants can also ask for further clarification, elaboration
and justifications.
It is time saving.
Business Statistics
Unit 12
Self-Assessment Questions
1. Fill in the blanks with the appropriate terms.
(a) A report that consists of a collection of data or facts and is written in
an orderly way is called an __________________ report.
(b) An interpretive report contains a collection of data with its
interpretation or any _____________________ explicitly specified
by the writer.
Business Statistics
Unit 12
Business Statistics
Unit 12
Business Statistics
Unit 12
Self-Assessment Questions
3. State whether true or false.
(a) Research reports are not designed to convey and record the
information that will be of practical use to the reader.
(b) The index of the technical report must be provided at the end of the
report.
4. Fill in the blanks with the appropriate terms.
(a) A _________________ report is formulated when there is a need to
draw the conclusions of the findings of the research report.
(b) If there are any _______________ in the report, then
recommendations are made for taking corrective action in order to
rectify the errors.
12.4 Summary
Let us recapitulate the important concepts discussed in this unit:
A report can be defined as a written document which presents information
in a specialized and concise manner.
There is a difference between report writing and other compositions
because a report is written in a short and conventional format. A report
should cover all mandatory matters but nothing extra should be written.
A report that consists of a collection of data or facts and is written in an
orderly way is called an informational report. The main purpose of this
type of report is to present the information in its original form without any
conclusion and recommendation.
An interpretive report contains a collection of data with its interpretation
or any recommendation explicitly specified by the writer.
Defining a clear logical structure will make the report easier to write and
to read.
Sikkim Manipal University
Business Statistics
Unit 12
12.5 Glossary
Report: A written document presenting information in a specialized and
concise manner.
Informational report: A report consisting of a collection of data or facts
written in an orderly manner.
Interpretive report: A report containing a collection of data with its
interpretation or any recommendation explicitly specified by the writer.
Research report: A written document describing the findings of some
individual or a group of individuals.
Business Statistics
Unit 12
12.7 Answers
Answers to Self-Assessment Questions
1. (a) Informational; (b) Recommendation
2. (a) True; (b) True
3. (a) False; (b) True
4. (a) Popular; (b) Deviations
Business Statistics
Unit 13
Unit 13
Exercise-I
Example 1: How will you classify people according to gender using nominal scale.
Solution:
In the example below, the number 1 is assigned to male and the number 2 is assigned
to female. We can just as easily assign the number 1 to female and 2 to male. The
purpose of the number is merely to name the characteristic or give it identity.
As we can see from the graphs, changing the number assigned to male and
female does not have any impact on the data - we still have the same number of men
and women in the data set.
Example 2:
What type of questions should be avoided in a questionnaire?
Solution:
The following type of questions should be avoided when preparing a questionnaire.
1. Embarrassing Questions: Embarrassing questions are questions that ask
respondents details about personal and private matters. Embarrassing questions
are mostly avoided because you would lose the trust of your respondents. Your
respondents might also feel uncomfortable to answer such questions and might
refuse to answer your questionnaire.
2. Positive/Negative Connotation Questions: Since most verbs, adjectives and
nouns in the English language have either positive or negative connotations,
questions are bound to take on a positive or negative question. While defining a
question, strong negative or positive overtones must be avoided. Depending on
the positive or negative connotation of your question, you will get different data.
Ideal questions should have neutral or subtle overtones.
Sikkim Manipal University
Business Statistics
Unit 1
Example 3:
Find the mode of the following data set:
48, 45, 46, 35, 45, 46, 35, 57, 34, 46, 48, 48, 46, 67
Solution:
The mode is 46 which occur 4 times.
Example 4:
Find the median of the following data set:
12
18
16
21
10
13
17
19
Solution:
Arrange the data values in order from the lowest value to the highest value:
10
12
13
16
17
18
19
21
The number of values in the data set is 8, which is even. So, the median is the
average of the two middle values.
2
16.5
Median
Example 5:
The marks of seven students in a mathematics test with a maximum possible mark of
20 are given below:
15
13
18
16
14
17
12
Business Statistics
Unit 13
Solution:
15 13 18 16 14 17 12
7
105
7
15
Mean
Example 6:
Find the mean, median, mode and range for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13
Solution:
The mean is the usual average, so:
Mean = (13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) / 9 = 15
The median is the middle value, so arrange the data in ascending order as follows:
13, 13, 13, 13, 14, 14, 16, 18, 21
There are nine numbers in the list, so the middle one will be the (9 + 1) 2 = 10 2 =
5th number:
13, 13, 13, 13, 14, 14, 16, 18, 21
So the median is 14.
The mode is the number that is repeated more often than any other, so mode is 13.
The largest value in the list is 21 and the smallest is 13, so the range is 21 13 = 8.
Example 7:
Calculate the standard deviation for the data given below:
4, 2, 5, 8, 6
Solution:
Calculate the mean of the data
=5
Business Statistics
Unit 1
Now
=
= 2.24
Example 8:
From the following data, construct index number of prices for 1986 with 1980 as base,
using (i) Laspeyres method, (ii) Paasches method, (iii) Bowley-Drobisch method, (iv)
Marshall-Edgeworth method, (v) Fishers ideal formula.
1980
1986
Commodity
Expenditure
in Rupees
Price
Per Unit
Expenditure
in Rupees
10
16
12
18
14
20
32
Business Statistics
Unit 13
Solution:
Since we are given the price and the total expenditure for the year 1980 and 1986, we
shall first calculate the quantities for the two years by dividing the expenditure by price,
and then we shall calculate the index numbers as follows:
Commodity
P0
q0
P1
q1
P0q0
P0q1
P1q0
P1q1
A
B
C
D
2
3
1
4
5
4
8
5
4
6
2
8
4
3
7
4
10
12
8
20
8
9
7
16
20
24
16
40
16
18
14
32
P0 q0
50
(i) Laspeyres price index or
P01
P1q0
100
P0q0
100
100 200
50
Pq
1 1
100
P0q1
P0 q1 P1q0 P1q1
40 100 80
80
100 200
40
P1q0 P1q1
P0 q0 P0 q1
P01
100
2
100 80
50
40 100 200
p1q0 p1q1
100
p0q0 p0q1
100 80
100
50 40
= 200
Business Statistics
Unit 1
P01
p1q0 p1q1
100
p0q0 p0q1
100 80
100
50 40
2 2 100
= 200
Example 9:
Construct a pie chart in percentage for the given data of a publishing house (Cost is
in `):
Promotion cost
Royalty cost
Binding cost
Paper cost
Transportation cost
Printing cost
10,000
15,000
20,000
25,000
10,000
20,000
Solution:
The following pie chart shows the percentage distribution of the expenditure incurred
in publishing a book as per the given data.
Business Statistics
Unit 13
Example 10:
The ranks of 15 students in two subjects A and B are given below:
Student
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Subject A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Subject B
10
7
2
6
4
8
3
1
11
15
9
5
14
12
13
Solution:
Rank in A
(R1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
n=15
Rank in B
(R2)
10
7
2
6
4
8
3
1
11
15
9
5
14
12
13
(R1R2)
D
9
5
1
2
1
2
4
7
2
5
2
7
1
2
2
D =0
D
81
25
1
4
1
4
16
49
4
25
4
49
1
4
4
272
Business Statistics
Unit 1
6 Di2
n(n 2 1)
=1
6 272
15( 225 1)
=10.4857
=0.5142
Example 11:
Researchers at the European Centre for Road Safety Testing are trying to find out
how the age of cars affects their braking capability. They test a group of ten cars of
differing ages and find out the minimum stopping distances that the cars can achieve.
The results are set out in the table below:
Car
A
B
C
D
E
F
G
H
I
J
9
15
24
30
38
46
53
60
64
76
Solution:
Let us develop the following table for calculating the value of r:
Total
X
9
15
24
30
38
46
53
60
64
76
415
Y
28.4
29.3
37.6
36.2
36.5
35.3
36.2
44.1
44.8
47.2
375.6
X
81
225
576
900
1444
2116
2809
3600
4096
5776
21623
Y
806.56
858.49
1413.76
1310.44
1332.25
1246.09
1310.44
1944.81
2007.04
2227.84
14457.72
XY
255.6
439.5
902.4
1086
1387
1623.8
1918.6
2646
2867.2
3587.2
16713.3
Business Statistics
Unit 13
X 41.5 , Y 37.7
Method of least squares:
XY n XY
X nX
2
r=
a Y b XY nY
nY
217.436
0.94
244.82
Business Statistics
Unit 14
Unit 14
Exercise-II
Average
Standard Deviation
Rainfall
Production
30
5
500 kg
100 kg
y
x
X X
100
(Y 500) = 0.8
( X 30)
5
or Y = 20 + 16X
For X = 40, Y = 16(40) + 20 = 660 kg
1
y . Find rxy.
2
Solution:
1 r2 x . y
2
2
r x y
tan =
x =
or
= 0.6
1
y
2
1 r 2 y .y
tan = 0.6 =
2
r 1 2
y
2 y
1
6
1 r 2 2
=
r 1
10
1
4
2r2 + 3r 2 = 0
Sikkim Manipal University
Business Statistics
Unit 14
(r + 2) (2r 1) = 0
r = 2 or
1
2
1
is 0.5.
2
Example 3: The following table shows the number of public sector industries
failures in India during the period 1987 to 1993. Using a four-year moving
average method, calculate the mean square error (MSE) for this data.
Year
Number of Failures
1987
1988
1989
1990
1991
1992
1993
32
26
30
28
24
22
26
Solution:
The 4-year moving averages are calculated as follows:
32 + 26 + 30 + 28
29
4
26 + 30 + 28 + 24
=
27
4
30 + 28 + 24 + 22
=
26
4
28 + 24 + 22 + 26
=
25
4
following table is constructed.
Moving Average
Error
Error Squared
1987
32
1988
1989
26
30
1990
1991
28
24
29
27
1
3
1
9
1992
22
26
16
1993
26
25
Business Statistics
Unit 14
Then,
MSE
1 9 16 1 27
6.75
4
4
Fall
Winter
Spring
Summer
1992
1993
200
220
180
188
185
173
95
83
1994
220
176
161
87
Quarters Values
(1)
(2)
(3)
1992
200
II
180
III
Quarter
Moving
Total
Quarter
Moving
Average
(4)
(5)
660
165
185
680
IV
1993
I
II
188
167.5
110.45
171.0
55.55
170.5
129.03
167.5
112.24
172
220
676
Percentage of
Actual to
Centred
Moving Average
(7)
170
95
688
Quarter
Centred
Moving
Average
(6)
169
(Contd...)
Business Statistics
III
1994
Unit 14
IV
83
220
II
664
166
664
166
173
652
163
640
160
644
161
176
III
IV
166.0
104.22
164.5
50.46
161.5
136.22
160.5
109.66
161
87
Now, we calculate the modified mean for each quarter. This can be
done by the following steps.
The first step is to make a table of values already calculated and placed
in column (7) of this table. These are the percentage of actual to moving
average values for the various quarters of the three years. These are shown
in the following table:
Year
Fall
Winter
Spring
Summer
1992
110.45
55.55
1993
129.03
112.24
104.22
50.46
1994
136.22
109.66
The second step is to take the average of these values for each quarter.
The modified mean for each quarter data is shown as follows:
Fall
132.625
2
2
Winter
110.950
2
2
Spring
107.335
2
2
Summer
53.005
2
2
Total = 403.915
Business Statistics
Unit 14
We get the seasonal index for each quarter by multiplying the modified
mean for each quarter by the adjustment factor. Then, the seasonal index for
each quarter is shown as follows:
Fall:
Winter:
Spring:
Summer:
132.625 0.9903
110.950 0.9903
107.335 0.9903
53.005 0.9903
Total
=
=
=
=
=
131.34
109.87
106.29
52.50
400.00
Example 5: In the previous problem which gives us the data about new
admissions into the MBA programme of the university for each trimester,
separate the seasonal and irregular influences on the time series and calculate
the irregular (I) component as well as the seasonally-adjusted values for
each quarter.
Solution:
We have already calculated the various values that are needed. We know
that:
Time Series Values = T S C I
Centred Moving Average = T C
Hence,
S I =
T SC I
T C
Quarter
T S C I
T C
S I
1992
I
II
III
IV
I
II
III
IV
I
II
III
200
180
185
95
220
188
173
83
220
176
161
167.5
171.0
170.5
167.5
166.0
164.5
161.5
160.5
1.105
0.556
1.290
1.122
1.042
0.505
1.362
1.097
IV
87
1993
1994
Business Statistics
Unit 14
The seasonal indices for each quarter have already been calculated as:
Fall:
131.34
Winter: 109.87
Spring: 106.29
Summer: 52.50
Then the seasonal influence (S) is given by:
Fall: 131.34/100
= 1.3134
Winter: 109.87/100 = 1.10987
Spring: 106.29/100 = 1.0629
Summer: 52.50/100 = 0.5250
Now, we make another table with (S I) values as calculated in the
previous table and (S) values for each quarter of fall, winter, spring and
summer and this way; we can get the values of (I) by dividing (S I) values
by the (S) values. These are shown in the following table:
Year
Quarter
S I
(S)
(I)
1992
I
II
III
IV
I
II
III
IV
I
II
III
IV
1.105
0.556
1.290
1.122
1.042
0.505
1.362
1.097
1.0629
0.5250
1.3134
1.0987
1.0629
0.5250
1.3134
1.0987
1.040
1.059
0.982
1.021
0.980
0.962
1.037
0.998
1993
1994
Business Statistics
Unit 14
Year
Quarter
(S)
Seasonallyadjusted Values
1992
I
II
III
IV
I
II
III
IV
I
II
III
IV
200
180
185
95
220
188
173
83
220 .
176
161
87
1.0629
0.5250
1.3134
1.0987
1.0629
0.5250
1.3134
1.0987
174.05
180.95
167.50
171.11
162.76
158.09
167.50
160.19
1993
1994
Life (in
4.2
'000 hours)
10
4.6
3.9
4.1
5.2
3.8
3.9
4.3
4.4
5.6
Can we accept the hypothesis that the average life time of bulbs is 4000
hours.
Solution:
Let us take the null hypothesis that there is no significant difference between
the sample mean and the hypothetical population mean.
Applying the t-test (as the sample is small in size, because 10 < 30),
x
t=
n
S
Business Statistics
Unit 14
Calculation of x and s
4.2
4.6
3.9
4.1
5.2
3.8
3.9
4.3
4.4
5.6
0.2
+ 0.2
0.5
0.3
+ 0.8
0.6
0.5
0.1
0
+ 1.2
x = 44
x =
( x x) 2
( x x)
= (x 4.4)
( x
. 4
0.04
0.04
0.25
0.09
0.64
0.36
0.25
0.01
0
1.44
( x x) 2 3.12
x
= 44 = 4.4
N
10
S=
( x x) 2
(n 1)
3.12
0.589
9
= 9, t0.05 = 2.262.
The calculated value of t is less than the table value. Hence the hypothesis
is accepted.
The average life time of the bulbs could be 4000 hours.
Example 7: A Personnel Manager is interested in trying to determine whether
absenteeism is greater on one day of the week than on another. His records
for the past year show this sample distribution:
Day of
the week
No. of
Absentees
Monday
Tuesday
Wednesday
Thursday
Friday
66
57
54
48
75
Business Statistics
Unit 14
f o f e 2
fo
fe
66
60
0.60
57
60
0.15
54
60
0.60
48
60
2.40
75
60
3.75
fe
bf
fe
f o f e 2
fe
7.50
fe
7.5
= (n 1) = (5 1) = 4
for
4, 20.05 = 9.49
The calculated value of 2 is less than the table value. Hence, the (null)
hypothesis is accepted.
20-39
40-59
60
Total
140
60
80
50
40
30
20
80
280
220
Total
200
130
70
100
500
Persons
On the basis7 of this data, can it be concluded that the model appeal is
independent of the age groups. (Given v = 3, 20.05 7.815 )
Solution:
Let the null hypothesis be that the model appeal is independent of the age
group. Applying 2 test:
Sikkim Manipal University
Business Statistics
Unit 14
= 112
Grand total
= 72.8
f e13 = 280 70 39.2 and so on.
500
72.8
39.2
56
280
88
57.2
30.8
44
220
200
130
70
100
500
fo
fe
(fo fe)2
(fo fe)2/fe
140
112.0
784.00
7.000
60
88.0
784.00
8.910
80
72.8
51.84
0.712
50
57.2
51.84
0.906
40
39.2
0.64
0.016
30
30.8
0.64
0.021
20
56.0
1296.00
23.143
80
44.0
1296.00
29.454
f f 2
o e 70.162
fe
Thus, 2 = 70.162
= (r 1) (c 1) = (2 1) (4 1) = 3
2
Business Statistics
Unit 14
( x )
n
S
2
x = 53; = 56; n = 16; ( x x) 135
( x x) 2
135
3
(n 1)
15
S=
t=
|53 56|
16 4
3
= (n 1) = (16 1) = 15
t0.05
95% confidence limits of the population mean : x
n
2.13
= 53
16
= 51.4 to 54.6
S
(2.95)
= 53
16
= 50.788 to 55.212
Business Statistics
Unit 14
FG n 2 IJ
H1 r K
t = r
= 0.42
F 27 2 I 2.31
GH 1 0.42 JK
2
2
100
s
n
2
0.2
10
1
2
3
4
6
7
3
8
5
5
3
7
5
4
3
4
Solution:
We can solve the problem either by the direct method or by short-cut method,
but in each case we shall get the same result. We try below both the methods.
Sikkim Manipal University
Business Statistics
Unit 14
Direct Method:
First we calculate the mean of each of these samples.
FG 6 + 7 + 3 + 8 IJ 6
H 4 K
5 + 5 + 3 + 7I
= FG
H 4 JK 5
F 5 + 4 + 3 + 4 IJ 4
= G
H 4 K
x1 =
x2
x3
FG x
H
IJ FG
K H
IJ
K
x 2 x3
654
5
3
3
2
2
2
2
= 6 6 7 6 3 6 8 6
+ 5 4
4 4
+ 5 5 2 5 5 2 3 5 2 7 5 2
2
4 4 3 4
= 24
= 6 5 2 7 5 2 3 5 2 8 5 2
g b g b g b g
+ b5 5g b4 5g b3 5g b4 5g }
+ 55 2 55 2 35 2 75
2
= 32
Business Statistics
Unit 14
Alternatively, (SS for total variance) can also be worked out thus:
SS for total = (SS between + SS within)
= (8 + 24)
= 32
We can now set up the ANOVA table for this problem:
Source of
variation
SS
df
MS
Between sample
(3 1) = 2
8
4
2
Within sample
24
(12 3) = 9
24
2.67
9
Total
32
(12 1) = 11
F-ratio
4.00
1.5
2.67
5% F-limit
(from the
F-table)
F (2, 9)
= 4.26
The above table shows that the calculated value of F is 1.5 which is less
than the table value of 4.26 at 5% level with d.f being 1 = 2 and 2 = 9 and
hence could have arisen due to chance. This analysis supports the nullhypothesis of no difference in sample means. We may, therefore, conclude
that the difference in wheat output due to varieties is insignificant and is just
a matter of chance.
Aliter (Short-cut Method):
In this case, we first take the total of all the individual values of n items and
call it as T.
T in the given case = 60
and
n = 12
Hence, the correction factor =
=
602
300 .
12
T2
n
and j = 1, 2, 3, ...
= 62 72 32 82 52 52 32 72 52 42 32 42
602
= 32
12
Business Statistics
Unit 14
T j2
SS between =
n j
T2
=
n
SS within = xij2
= 8
4
4 12
4
T j2
nj
NOTES
NOTES
NOTES