Sie sind auf Seite 1von 83

APPLIED STATISTICS

IV Semester
COMPLEMENTARY COURSE
B Sc MATHEMATICS
(2011 Admission)
UNIVERSITY OF CALICUT
SCHOOL OF DISTANCE EDUCATION
Calicut University P.O. Malappuram, Kerala, India 673 635
418
APPLIED STATISTICS
IV Semester
COMPLEMENTARY COURSE
B Sc MATHEMATICS
(2011 Admission)
UNIVERSITY OF CALICUT
SCHOOL OF DISTANCE EDUCATION
Calicut University P.O. Malappuram, Kerala, India 673 635
418
APPLIED STATISTICS
IV Semester
COMPLEMENTARY COURSE
B Sc MATHEMATICS
(2011 Admission)
UNIVERSITY OF CALICUT
SCHOOL OF DISTANCE EDUCATION
Calicut University P.O. Malappuram, Kerala, India 673 635
418
School of Distance Education
Applied Statistics Page 2
UNIVERSITY OF CALICUT
SCHOOL OF DISTANCE EDUCATION
STUDY MATERIAL
Complementary Course
B Sc Mathematics
IV Semester
APPLIED STATISTICS
Prepared by
Dr. Aneesh Kumar.K.
Department of Statistics
Mahatma Gandhi College, Iritty
Keezhur.P.O. Kannur-670 703.
Layout: Computer Section, SDE

Reserved
School of Distance Education
Applied Statistics Page 3
CONTENTS PAGE
SYLLABUS 5
1. SKEWNESS AND KURTOSIS 6 17
Skewness
Test of skewness
Measures of skewness
Kurtosis
Measures of kurtosis
2. CORRELATION AND REGRESSION 18 - 55
Introduction
Scatter Diagram
Curve fitting
Regression lines
Pearsons Coefficient of correlation
Angle between the regression lines
Identification of regression lines and determination of correlation
coefficient
Rank correlation coefficient
Partial and Multiple Correlations
Properties of residuals
Coefficient of multiple correlations
Coefficient of partial correlation
Testing the significance of observed simple correlation coefficient
3. TIME SERIES 56 66
Time series
Components of Time series
Mathematical model of time series
Methods of measuring secular trend
Method of measuring seasonal variations
4. STATISTICAL QUALITY CONTROL 67 77
Quality
Process and Product Control
Control chart
x (mean) and R (range) chart
p-Chart
d-Chart
C-Chart
5. ANALYSIS OF VARIANCE 78 83
Analysis of variance
One way ANOVA
Two way ANOVA
School of Distance Education
Applied Statistics Page 4
School of Distance Education
Applied Statistics Page 5
SYLLABUS
Course-IV: Applied Statistics
Module 1: Univariate data: Skewness and kurtosis- Pearsons and Bowleys coefficient of
skewness- moment measures of skewness and kurtosis.
Module 2: Analysis of bi-variate data: Curve fitting-fitting of straight lines, parabola,
power curve and exponential curve. Correlation-Pearsons correlation
coefficient and rank correlation coefficient- partial and multiple correlation-
formula for calculation in 3 variable cases-Testing the significance of observed
simple correlation coefficient. Regression- simple linear regression, the two
regression lines, regression coefficients and their properties.
Module 3: Time series: components of time series- measurement of trend by fitting
polynomials-computing moving averages-seasonal indices- simple average-
ratio to moving average
Module 4: Statistical Quality Control: Concept of statistical quality control, assignable
and chance causes, process control. Construction of control charts, 3 sigma
limits. Control chart for variables-X bar chart and R Chart. Control chart for
attributes-p chart, d chart and c chart.
Module 5: Analysis of Variance: One way and two way classifications. Null hypothesis,
total, between and within sum of squares. Assumptions - ANOVA table.
Books for reference:
1. Goon.A.M., Gupta M.K., and Das Gupta: Fundamentals of Statistics Vol 1; The World Press,
Kolkotta
2. S.C.Gupta and V.K.Kapoor : Fundamental of Mathematical Statistics, Sulthan Chand and
Sons
3. S.P. Gupta: Statistical Methods.
4. E.L. Gran: Statistical Quality Control.
School of Distance Education
Applied Statistics Page 6
School of Distance Education
Applied Statistics Page 7
CHAPTER 1
SKEWNESS AND KURTOSIS
1.1. Skew ness:
Measure of central tendency gives us an idea about the average of the given set of
observations, while a measure of dispersion gives the idea on how the observations are
scattered about a central value or among themselves.
For some sets of observations the way in which the observations are distributed
about the central value may differ. For a set of observation, if the observations are
distributed exactly on either sides of the central value, the distribution of the observations
is said to be symmetric. Otherwise it is said to be skewed. Hence the skewness is the
characteristic of lack of symmetry for a given set of observations.
The shape of the frequency curve for a given set of observations indicates the
skewness of the set. A bell shaped frequency curve says that the distribution is
symmetric. A shape of frequency curve with an asymmetric tail extending out to the
right is referred to as positively skewed or skewed to the right, while a shape of
frequency curve with an asymmetric tail extending out to the left is referred to as
negatively skewed or skewed to the left.
For a symmetric distribution, its mean, median and mode are coinciding. That is
Mean = Median = Mode. For a positively skewed distribution, Mean > Median > Mode.
And for a negatively skewed distribution Mean < Median < Mode.
The following are the examples for the frequency curves of symmetric (or, normal),
positively skewed and negatively skewed distributions.
School of Distance Education
Applied Statistics Page 8
1.2. Test of skewness
Skewness is present in a distribution if,
(i) The value of mean, median and mode do not coincide
(ii) When the values are plotted on a graph, they do not yield a normal bell shaped
curve, or when divided vertically through the centre of the curve, the two
halves are unequal.
(iii) Frequencies on either side of the mode are not equal.
1.3. Measures of Skewness:
Measure of skewness says the amount of asymmetry in a given series of
observations. There are absolute and relative measures of skewness. The absolute
measures of skewness tell us the extent of asymmetry and whether it is positive or
negative. It is based on the difference between mean and mode. If the mean is greater
than mode, skewness will be positive, otherwise the skewness is negative. When mean is
same to mode, the distribution is symmetric.
When it is to compare the skewness of two sets of observations, absolute measure
of skewness is not adequate when the observations are in different units. Thus for a
comparison purpose we use relative measures of skewness, known as coefficients of
skewness. The following are some important relative measures of skewness.
(i) Karl Pearsons coefficient of skewness
(ii) Bowleys coefficient of skewness
School of Distance Education
Applied Statistics Page 9
(iii) Measure of skewness based on moments.
(iv) Kellys coefficient of skewness
We discuss them one by one,
(i) Karl Pearsons coefficient of skewness
Coefficient measure of skewness suggested by Karl Pearson is denoted by J, where,
, ) 3
. . . .
Mean Median
Mean Mode
J or J
S D S D

= = . Range of variation of J is (-3, 3). J>0, for


positively skewed, J<0, for negatively skewed and j=0 for symmetric observations.
(ii) Quartile measure of skew ness (Bowleys measure)
Bowley suggested the coefficient measure of skew ness based on quartile deviations as,
, ) , )
3 1
3 1
Q M M Q
Q
Q Q

=

.
3 1
3 1
2 Q Q M
Q Q
+
=

.
Here
1 3
, Q Q are the first and third quartiles of the given data and M is the median
(or second quartile). The range of variation of Q is (-1, 1). Q >0, for positively skewed, Q
<0, for negatively skewed and Q =0 for symmetric observations.
(iii) Measure of skew ness based on moments.
Coefficient measure of skewness based on moments
2
3
1 3
2

= or
1 1
= .
2 3
and are
the second and third central moments of the observations. But
1
is a ratio of two non-
negative quantities, which is always non-negative. If
3
0 = , then
1
0 = . Now the
distribution is said to be symmetric.
1
0 > , indicates the distribution is skewed. Then
observe the sign of
3
. If
3
is positive in sign, the distribution is positively skewed and if
3
is negative in sign, the distribution is negatively skewed
(iv) Kelleys Measure of skew ness :
Coefficient measure of skewness suggested by Kelly is,
9 1 5
9 1
2 D D D
Sk
D D
+
=

,
Where
9
D and
1
D are the ninth and first deciles of the given set of observations.
5
D is
the fifth deciles or the median of the set.
Problem: Find out Karl Pearsons coefficient of skew ness from the following table.
School of Distance Education
Applied Statistics Page 10
Wage: 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80
No. of persons: 12 18 35 42 50 45 20 8
Solution:
Karl persons coefficient o f skewness =
. .
Mean Mode
J
S D

=
Mean,
1
x f x
i i
N
i
= , Mode
, )
, ) , )
1 0
1 0 1 2
c f f
l
f f f f

= +
+
Standard deviation=
2 2
1
i i
i
f x x
N

Necessary calculations follow:


frequency Mid-x fx
2
fx
0 10
10 20
20 30
30 40
40 50
50 60
60 - 70
70 80
12
18
35
42
50
45
20
8
5
15
25
35
45
55
65
75
60
270
875
1470
2250
2475
1300
600
300
4050
21875
51450
101250
136125
84500
45000
230 9300 444550
Mean =
9300
40.43
230
=
S.D. =
2 2
1
i i
i
f x x
N

, )
2
1 9300
444550
230 230

=

]
= 17.27.
Here N=230, Maximum frequent class is 40 50; this is taken as the modal class.
Then,
0
42 f = ,
1
50 f = and
2
45 f =
Hence, Mode
, )
, ) , )
10 50 42
40 46.15
50 42 50 45

= + =
+
School of Distance Education
Applied Statistics Page 11
Coefficient of skewness J
40.43 46.15
17.27

= = - 0.3312.
Hence, the distribution is negatively skewed.
Problem: Calculate the quartile coefficient of skew ness for the following data:
x : 93 97 98 102 103 107 108 112 113 117 118 122 123 127 128 - 132
f : 2 5 12 17 14 6 3 1
Solution:
Quartile coefficient of skew ness
3 1
3 1
2 Q Q M
Q
Q Q
+
=

Class adjusted Frequency L.T. Cum. freq.


92.5 97.5
97.5 102.5
102.5 107.5
107.5 112.5
112.5 117.5
117.5 122.5
122.5 127.5
127.5 132.5
2
5
12
17
14
6
3
1
2
7
19
36
50
56
59
60
N=60
Class in which
4
th
N
= 15
th
observation lies is 102.5 107.5
Class in which
2
th
N
= 30
th
observation lies is 107.5 112.5
Class in which
3
4
th
N
= 45
th
observation lies is 112.5 117.5
Hence,
1 1
1 1
1
4
N
m c
Q l
f
| |

|
\
= +
60
7 5
4
102.5
12
| |

|
\
= + = 105.83
School of Distance Education
Applied Statistics Page 12
3 3
3 3
3
3
4
N
m c
Q l
f
| |

|
\
= +
3 60
36 5
4
112.5
14
| |

|
\
= + = 115.714
Median
2 2
2 2
2
2
N
m c
Q l
f
| |

|
\
= +
60
19 5
2
107.5
17
| |

|
\
= + = 110.735
Then
3 1
3 1
2 Q Q M
Q
Q Q
+
=

=
115.714 105.83 2 110.735
115.714 105.83
+

= 0.0075.
The distribution is slightly positively skewed.
Problem: First four moments about the value 5 of a distribution are 2, 20, 40 and 50. Calculate
the mean, variance, coefficient of skew ness and coefficient of kurtosis and comment on the nature of
the distribution.
Solution:
Given,
, )
1
'
(5)
1
1
5
k
i
i
i
f x
N

=
=

= 2 ;
, )
2
1
'
(5)
2
1
5
k
i
i
i
f x
N

=
=

= 20;
, )
1
3
'
(5)
3
1
5
k
i
i
i
f x
N

=
=

= 40 and
, )
1
4
'
(5)
4
1
5
k
i
i
i
f x
N

=
=

= 50.
1
'
(5)
1
1
5 2
k
i
i
i
f x
N

=
= =

2 5 7 x = + =
We have,
, )
2
1 2
' ' ' ' ' '
( ) ( ) ( ) ( ) ( ) ..... 1 ( )
1 1 2 1 1
r
r
r r r
A C A A C A A A
r r r


= + +

] ]
Hence
2
2
' '
(5) (5)
2 1


=

]
= 20 4 = 16
3
3
' ' ' '
(5) 3 (5) (5) 2 (5)
3 2 1 1


= +

]
3
40 - 3 20 2 + 2 2 = - 64
School of Distance Education
Applied Statistics Page 13
2
3
1 3
2

= =
, )
, )
2
3
64
1
16

=
Since coefficient of skew ness is negative the distribution is negatively skewed.
1.4. Kurtosis:
The word kurtosis in Greek language means bulginess. Majority of the frequency
curves are bell shaped, more or less symmetric and unimodal. But the concentration of
observations in the neighbourhood of mode may differ. If the frequency curve is almost
same in shape to the graph of the function
2
2
1
( ) ;
2
x
f x e x

= < < + , then the curve is


said to be a normal curve. The peaked ness or flatness of a frequency curve in comparison
with normal curve is known as kurtosis. If more observations are concentrated in the
neighbourhood of mode, the curve becomes more peaked than the normal curve and the
distribution is said to be lepto kurtic. Relatively less concentration of observations in the
neighbourhood of mode makes the curve more flat than the normal curve. Then the
distribution is said to be platy kurtic. The distribution with frequency curve almost same
to the normal curve, it is said to be meso kurtic.
The shapes of different types of frequency curves are given here.
W.S. Gosset, humorously gives a narration on kurtosis as, platykurtic curves, like the
platypus, are squat with short tails; leptokurtic curves are high with long tails like the
kangaroos noted for leaping and the sketch is as follows:
1.4.1. Measures of Kurtosis:
The measure of peak ness of flatness of the frequency distribution is the measure of
kurtosis. Following are the various measures of kurtosis.
(i) Measure of kurtosis based on moments.
School of Distance Education
Applied Statistics Page 14
Coefficient measure of kurtosis based on moments
4
2 2
2

= or
2 2
3 = .
4 2
and
are the fourth and second central moment of the set of observations.
2
> 3 for
leptokurtic.
2
=3, for meso kurtic and
2
< 3, for platy kurtic.
(ii) Measure of kurtosis based on quartiles.
Coefficient of kurtosis
3 1
0.9 0.1
1
( )
2
Q Q
k
Q Q

, where
3
Q and
1
Q are the first and third
quartiles.
0.9
Q and
0.1
Q are the observations which come at the
9
10
th
N | |
|
\
and
10
th
N | |
|
\
position
when the observations are arranged in ascending order of magnitude or they are called
the 9
th
and 1
st
deciles. For a meso kurtic distribution, k will be near about 0.25.
Problem: First four moments about the value 5 of a distribution are 2, 20, 40 and 50. Calculate
the mean, variance, coefficient of kurtosis and comment on the nature of the distribution.
Solution:
Given,
, )
1
'
(5)
1
1
5
k
i
i
i
f x
N

=
=

= 2 ;
, )
2
1
'
(5)
2
1
5
k
i
i
i
f x
N

=
=

= 20;
, )
1
3
'
(5)
3
1
5
k
i
i
i
f x
N

=
=

= 40 and
, )
1
4
'
(5)
4
1
5
k
i
i
i
f x
N

=
=

= 50.
1
'
(5)
1
1
5 2
k
i
i
i
f x
N

=
= =

2 5 7 x = + =
We have,
, )
2
1 2
' ' ' ' ' '
( ) ( ) ( ) ( ) ( ) ..... 1 ( )
1 1 2 1 1
r
r
r r r
A C A A C A A A
r r r


= + +

] ]
Hence
2
2
' '
(5) (5)
2 1


=

]
= 20 4 = 16
3
3
' ' ' '
(5) 3 (5) (5) 2 (5)
3 2 1 1


= +

]
3
40 - 3 20 2 + 2 2 = - 64
School of Distance Education
Applied Statistics Page 15
2 4
4
' ' ' ' ' '
(5) 4 (5) (5) 6 (5) (5) 3 (5)
4 3 1 2 1 1


= +

] ]
=
2 4
50 - 4 40 2 + 6 20 2 - 3 2 = 162
4
2 2 2
2
162
0.6328
16

= = =
Since the coefficient of kurtosis is less than 3, the distribution is platy kurtic.
Problem: Find the coefficient of kurtosis based on quartiles to the following data.
. Calculate the quartile coefficient of skew ness for the following data:
x : 10 15 16 20 21 25 26 30 31 35 36 40 41 45
f : 3 4 68 30 10 6 2
Solution:
Coefficient of kurtosis
3 1
0.9 0.1
1
( )
2
Q Q
k
Q Q

Class adjusted Frequency L.T. Cum. freq.


9.5 15.5
15.5 20.5
20.5 25.5
25.5 30.5
30.5 35.5
35.5 40.5
40.5 45.5
3
4
68
30
10
6
2
3
7
75
105
115
121
123
N=123
Class in which
4
th
N
= 31
st
observation lies is 20.5 25.5
School of Distance Education
Applied Statistics Page 16
Class in which
3
4
th
N
= 93
rd
observation lies is 25.5 30.5
Class in which
10
th
N
= 12
th
observation lies is 20.5 25.5
Class in which
9
10
th
N
= 111
th
observation lies is 30.5 35.5
Hence,
1 1
1 1
1
4
N
m c
Q l
f
| |

|
\
= +
123
7 5
4
20.5
68
| |

|
\
= + = 22.25
3 3
3 3
3
3
4
N
m c
Q l
f
| |

|
\
= +
3 123
75 5
4
25.5
30
| |

|
\
= + = 28.375
0.1 0.1
0.1 0.1
0.1
10
N
m c
Q l
f
| |

|
\
= +
123
7 5
10
20.5
68
| |

|
\
= + = 20.89
0.9 0.9
0.9 0.9
0.9
9
10
N
m c
Q l
f
| |

|
\
= +
9 123
105 5
10
30.5
10
| |

|
\
= + = 33.35
Coefficient of kurtosis
3 1
0.9 0.1
1
( )
2
Q Q
k
Q Q

=
1
(28.375 22.25)
2
33.35 20.89

= 0.2457
Since k is almost near to 0.25, the curve is almost meso kurtic.
EXERCISES
1. Define skewness. What are the measures of skewness?
2. Define kurtosis. Explain the various measures of kurtosis.
3. Calculate skew ness and kurtosis for the following distribution:
Class: 1 5 6 10 11 15 16 20 21 25 26 - 30 31 35
Frequency: 3 4 68 30 10 6 2
School of Distance Education
Applied Statistics Page 17
4. Karl Pearsons coefficient of skewness of a distribution is 0.32. Its standard
deviation is 6.5 and the mean is 29.6. Find the mode and median.
5. Coefficient of skewness for a certain distribution based on the quartiles is 0.5. If
the sum of the upper and lower quartiles is 28 and median is 11, find the values of
the upper and lower quartiles.
6. For the frequency distribution given below, calculate the coefficient of skew ness
based on quartiles
Monthly sales
(Rs. in lakh)
No. of firms Monthly sales
(Rs. in lakh)
No. of firms
Less than 20
Less than 30
Less than40
Less than 50
Less than 60
30
225
465
580
634
Less than 70
Less than 80
Less than90
Less than 100
644
650
665
680
7. For a distribution the mean is 10, variance is 16, coefficient of skew ness =1, and the
coefficient of kurtosis = 4. Find the first four moments about the origin.
8. Obtain Karl Pearsons measure of skew ness for the following data
Class: 0 10 10 20 20 30 30 40 40 50 50 - 60 60 70
Frequency: 8 12 24 20 18 10 8
*************************
School of Distance Education
Applied Statistics Page 18
CHAPTER 2
CORRELATION AND REGRESSION
2.1. Introduction:
Let us consider two characteristics X and Y which are numerically measurable.
Assume X denotes the height and Y denote corresponding weight of college students. For
a set of students, when we are recording their height (X) and weight (Y), we get two
values for an individual. One value corresponds to the height and the other value
corresponds to the weight of that individual. We record the data for that individual as an
ordered pair. The procedure repeats for all the students and finally we get a set of
ordered pairs on X and Y. We call such a data on two characteristics as bivariate data. To
analyze whether there is any relation between these characteristics, there are two distinct
aspects for the study. One is correlation analysis and the other is regression analysis.
Correlation analysis is to determine the degree of linear relationship between the
characteristics X and Y. Regression analysis is to establish the nature of linear relationship
between the characteristics. A simple method to get a rough idea on correlation and
regression of the two characteristics considered is scatter method.
2.2. Scatter Diagram:
Let
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y be the set of observations obtained on the two
characteristics X and Y. A diagram obtained by plotting these values
1 1
( , ) x y ,
2 2
( , ) x y ,,
( , )
n n
x y , on a graph is called scatter diagram. It consists of points scattered over the graph.
Consider the following scatter diagrams obtained by plotting the observations regarding
to some X and Y.
In the first scatter diagram (figure 1), we can observe that all the points are almost
scattered around a straight line. Also the line is of the form, as X increases Y also
increases. Then we can roughly say, there exist a positive linear relation between X and Y.
Since the points are closely clustered around the straight line, there is a high degree of
linear relation.
In the second diagram (figure 2), also we observe that all the points are almost
scattered around a straight line. But the line is of the form, as X increases Y decreases.
Then we can suspect there exist a negative linear relation between X and Y. Here also, the
points are closely clustered around the straight line. Hence the degree of linear relation is
high.
In the next scatter diagram (figure 3), no specific relation between X and Y is
observed. Then one can infer that there is no correlation between X and Y.
School of Distance Education
Applied Statistics Page 19
2.3. Curve Fitting:
We have a set of observations
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y on two variables
(characteristics) X and Y. If we feel that there is some relation between these two
variables, let it be of the form,
1 2
( , , ,..., )
i i n i
y f x a a a = + .
1 2
, ,...,
n
a a a are the constants
involved known as parameters and is the error term known as residual error. For
example, if we feel a linear relation of the form y ax b = + , we can express it as
( , , )
i i i
y f x a b = + for the point ( , )
i i
x y . The relation y ax b = + is only our assumption
regarding the relation between X and Y. Hence all the ( , )
i i
x y points may not strictly obey
the relation. Using the assumed relation y ax b = + , between X and Y, we can calculate the
value of
i
y corresponds to given values of
i
x . The difference between the given
i
y values
for a
i
x value and the calculated
i
y values for a
i
x value using the proposed relation is the
residual error. That is why we express the
i
y value as ( , , )
i i i
y f x a b = + . Hence the error
involved in the
i
y value is ( , , )
i i i
y f x a b = .
School of Distance Education
Applied Statistics Page 20
In general consider the relation between X and Y of the form,
1 2
( , , ,..., )
i i n i
y f x a a a = + . Then the residual error on
i
y value
1 2
( , , ,..., )
i i i n
y f x a a a = . To
identify the relation between X and Y in terms of the parameters
1 2
, ,...,
n
a a a , it is to
estimate the values of these parameters. The best values of
1 2
, ,...,
n
a a a are those values of
1 2
, ,...,
n
a a a which makes the residual errors minimum. The process of determining the
best values of the parameters
1 2
, ,...,
n
a a a , statistically known as curve fitting. The values of
the parameters are estimated using the Principle of least squares.
The Principle of least squares states that the best estimates of
1 2
, ,...,
n
a a a are those values
of
1 2
, ,...,
n
a a a which minimize the sum of squares of the residual errors for all
i
y values.
Then it is to find the values of
1 2
, ,...,
n
a a a which minimizing
j
2
2
1 2
1 1
( , , ,..., )
n n
i i i n
i i
E y f x a a a
= =
= =

.
The values of
1 2
, ,...,
n
a a a which minimizes E can be obtained by solving the
following equations,
1
0
E
a
c
=
c
,
2
0
E
a
c
=
c
,, 0
n
E
a
c
=
c
.
These equations are known as normal equations.
- Fitting of a straight line y ax b = +
Consider
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y are the observations taken. It is to fit a maximum
suitable straight line for the given data. That is to estimate the best values of the
parameters involved a and b. By the Principle of least squares, the best values of a and b
are those values of a and b which minimizes E, where,
j j
2 2
2
1 1 1
( , , ) ( )
n n n
i i i i i
i i i
E y f x a b y ax b
= = =
= = = +

The normal equations are 0
E
a
c
=
c
and 0
E
b
c
=
c
j
2
1
0 ( ) 0
n
i i
i
E
y ax b
a a
=
c c
= + =
c c

j
1
2 ( ) 0
n
i i i
i
y ax b x
=
+ =

2
1 1 1
(1)
n n n
i i i i
i i i
x y a x b x
= = =
= +

School of Distance Education
Applied Statistics Page 21
j
2
1
0 ( ) 0
n
i i
i
E
y ax b
b b
=
c c
= + =
c c

j
1
1 ( ) 0
n
i i
i
y ax b
=
+ =

1 1
(2)
n n
i i
i i
y a x n b
= =
= +

Solving (1) and (2) using the given data, the best estimates of a and b can be obtained.
If the given X, Y values are big values, to make the calculations easy, transform X, Y
values to U, V values in the form, make a transformation on X, Y values to reduce them
x a
u
b

= and
y c
v
d

= . Then fit a line of the form


' '
v au b = + and hence re substitute u and
v to get the required relation in terms of X and Y.
Problem: Fit a straight line to the following data
x : 3 4 5 6 7
y : 4 5 6 8 10
Solution:
Consider the straight line of the form y = ax +b. To find the bst values of a
and b by using the normal equations,
2
1 1 1
n n n
i i i i
i i i
x y a x b x
= = =
= +

And
1 1
n n
i i
i i
y a x nb
= =
= +

The calculations are as follows
x y
2
x
xy
3
4
5
6
7
4
5
6
8
10
9
16
25
36
49
12
20
30
48
70
x

=25 y

=33
2
x

=135 xy

=180
School of Distance Education
Applied Statistics Page 22
The normal equation corresponds to the given data are,
180 = 135 a + 25 b -----(1)
33 = 25 a + 5b ---- (2)
From (2) we get, 165 = 125 a + 25 b ---(3)
(1) (3) gives, 15 10a =
3
2
a = = 1.5
3
2
a = in (2)
3
33 25 5
2
b = + = -0.9
Hence the required straight line fitted is , 1.5 0.9 y x =
- Fitting of a curve
2
y ax bx c = + +
Let
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y are the given data. To fit the given curve, it is to
estimate the values of a, b and c. By the principle of least squares the best estimates are
those values of a, b and c which minimizing E. Here
2
2 2
1 1
( )
n n
i i i i
i i
E y ax bx c
= =
= = + +
]

.
The normal equations are 0
E
a
c
=
c
, 0
E
b
c
=
c
and 0
E
c
c
=
c
. On differentiation the normal
equations becomes;
2 4 3 2
1 1 1 1
(1)
n n n n
i i i i i
i i i i
x y a x b x c x
= = = =
= + +

3 2
1 1 1 1
(2)
n n n n
i i i i i
i i i i
x y a x b x c x
= = = =
= + +

2
1 1 1
(3)
n n n
i i i
i i i
y a x b x n c
= = =
= + +

Solve these normal equations using the given data to get the values of a, b and c.
While solving problems, appropriate transformations, if required to reduce
calculations, can be done as illustrated in the case of fitting of a straight line.
Problem: Fit a parabola of the form
2
y ax bx c = + + to the following data:
x : 1960 1962 1964 1966 1968
y : 125 140 165 195 230
Solution:
Let the equation of the parabola is in the form
2
y a bx cx = + + .
School of Distance Education
Applied Statistics Page 23
To identify the best values of a,b and c, we use the following normal equations
2 2 3 4
1 1 1 1
(1)
n n n n
i i i i i
i i i i
x y a x b x c x
= = = =
= + +

2 3
1 1 1 1
(2)
n n n n
i i i i i
i i i i
x y a x b x c x
= = = =
= + +

2
1 1 1
(3)
n n n
i i i
i i i
y na b x c x
= = =
= + +

But here the given values of the variables are huge numbers. So first we transform
x and y to some new variable u and v then, fit a parabola for u and v. Using this, derive
the parabola for x and y. The working methods are shown below
x y u=
1964
2
x
v=
165
5
y
2
u
3
u
4
u
uv
2
u v
1960
1962
1964
1966
1968
125
140
165
195
230
-2
-1
0
1
2
-8
-5
0
6
13
4
1
0
1
4
-8
-1
0
1
8
16
1
0
1
16
16
5
0
6
26
-32
-1
0
6
52
0 6 10 0 34 53 25
The normal equations in terms of u and v are,
2 2 3 4
1 1 1 1
n n n n
i i i i i
i i i i
u v a u b u c u
= = = =
= + +

2 3
1 1 1 1
n n n n
i i i i i
i i i i
u v a u b u c u
= = = =
= + +

2
1 1 1
n n n
i i i
i i i
v na b u c u
= = =
= + +

Corresponds to the given data, these normal equation are,
25 10 0 34 (1) a b c = + +
School of Distance Education
Applied Statistics Page 24
53 0 10 0 (2) a b c = + +
6 5 0 10 (3) a b c = + +
53
(2) 5.3
10
b = =
13
(1) 2(3) 13 14 0.929
14
c c = = =
Then, 5a = 6- 10(0.929) 0.658 a =
Now the parabola is,
2
0.658 5.3 0.929 v u u = + +
Substitute u and v as
1964
2
x
and
165
5
y
respectively,
We get,
2
165 1964 1964
0.658 5.3 0.929
5 2 2
y x x | | | |
= + +
| |
\ \
2
1964 3928 3857296
165 3.29 26.5 4.645
2 4
x x x
y
| | + | |
= + +
| |
\
\
2
1.161 4547.15 4410689.85 y x x = +
- Fitting of a curve
x
y ab =
Taking logarithm to the base 10 on both sides, the curve
x
y ab = becomes,
log log log y a x b = + . Let log Y y = , log A a = and log B b = . Now the required curve is of
the form, Y A Bx = + orY Bx A = + . If we are given x and Y values, it is easy to estimate
the parameters A and B, using the method of fitting a straight line. Hence we can obtain a
and b as the antilogarithm of A and B respectively.
To fit a curve of the form
x
y ab = for the given set of observations
1 1
( , ) x y ,
2 2
( , ) x y ,,
( , )
n n
x y , get
i
Y values by taking the logarithm of the given
i
y values. Using the
i i
x and Y
values, solve the following normal equations for estimating A and B,
2
1 1 1
(1)
n n n
i i i i
i i i
x Y B x A x
= = =
= +

and
1 1
(2)
n n
i i
i i
Y B x nA
= =
= +

Solve (1) and (2) to obtain A and B, then by taking antilogarithm of A and B we get a and
b. Hence the curve
x
y ab = is fitted.
School of Distance Education
Applied Statistics Page 25
- Fitting of a curve
b
y ax =
After taking logarithm on both sides the curve
b
y ax = also can be converted in the
form of a straight line. That is, the curve becomes, log log log y a b x = + . Let log Y y = ;
log X x = and log A a = ; then the curve becomes, Y A bX = + orY bX A = + . Using the given
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y values, obtain
i i
X and Y values taking logarithm on
i i
x and y
values. Then solving the following normal equations A and b can be solved.
2
1 1 1
(1)
n n n
i i i i
i i i
X Y b X A X
= = =
= +

and
1 1
(2)
n n
i i
i i
Y b X nA
= =
= +

The value of a is obtained by taking the antilogarithm of A. Hence the required
curve is fitted.
- Fitting of a curve
b x
y ae =
The method illustrated above can be used in the case of fitting of
b x
y ae = also.
Taking logarithm on both sides, the curve becomes, log log log y a x b e = + . Let log Y y = ;
log A a = and log B b e = , we get, Y A Bx = + orY Bx A = + . From the given
1 1
( , ) x y ,
2 2
( , ) x y
,, ( , )
n n
x y values, taking the logarithm of
i
y values
i
Y values are obtained. Then use the
following normal equations to obtain A and B.
2
1 1 1
(1)
n n n
i i i i
i i i
x Y B x A x
= = =
= +

and
1 1
(2)
n n
i i
i i
Y B x nA
= =
= +

Now, a is the antilogarithm of A and
log
B
b
e
= .
Problem: for the data given below, find the equation to he fitting exponential curve of the form
bx
y ae =
x : 1 2 3 4 5 6
y : 1.6 4.5 13.8 40.2 125 300
Solution:
School of Distance Education
Applied Statistics Page 26
Taking logarithm to base 10 on both sides, the given curve
bx
y ae = is in the
form, log log log y a xb c = +
This is in the form, Y A Bx = +
where, log Y y = , log A a = , log B b c =
Using the values of Y and x, we can fit the line, Y A Bx = + , that is we can find best
values for A and B. Using these values of A and B, we can get the values of a and b.
For an easiness in calculation we transformx to u, where 3 u x =
Now using u and Y, fit a line of the form
' '
Y A Bu = + , using the normal equations,
' 2 '
1 1 1
n n n
i i i i
i i i
uY B u A u
= = =
= +

and
' '
1 1
n n
i i
i i
Y B U nA
= =
= +

The calculations are as follows,
x y u=x-3
10
log Y y = uY
2
u
1
2
3
4
5
6
1.6
4.5
13.8
40.2
125
300
-2
-1
0
1
2
3
0.204
0.653
1.140
1.604
2.097
2.477
-0.408
-0.653
0
1.604
4.194
7.431
4
1
0
1
4
9
3 8.175 12.168 19
Here the normal equations for u and Y are
'
12.168 3 19 ' A B = + and
' '
8.175 6 3 A B = +
Solving these two equations we get,
'
1.13 A = and
'
0.46 B =
Hence the line connecting u and Y is 1.13 0.46 Y u = +
That is , ) 1.13 0.46 3 Y x = + 0.25 0.46 Y x = +
This implies, A = -0.25 and B = 0.46
School of Distance Education
Applied Statistics Page 27
That is
10
log 0.25 a = and
10
log 0.46 b c =
From here we get a = 0.557 and b = 1.06
Hence the required curve is,
1.06
0.557
x
y e = .
Problem: Fit a curve of the form
b
y ax = for the following data
x : 66 64 55 51 42 32 24
y : 2.5 7.5 12.5 17.5 25 40 75
Solution:
Taking logarithm on both sides of the required curve,
b
y ax = , we get,
log log log y a b x = + . This is in the form Y A bX = + , where log Y y = , log A a = ,and
log X x = .
The calculations are:
x y log X x = log Y y = XY
2
X
66
64
55
51
42
32
24
2.5
7.5
12.5
17.5
25
40
75
1.8195
1.8061
1.7403
1.7075
1.6232
1.5051
1.3802
0.3979
0.8751
1.0969
1.2430
1.3979
1.6021
1.8750
0.7239
1.5805
1.9089
2.1224
2.2690
2.4113
2.5879
3.3106
3.2619
3.0286
2.9156
2.6347
2.2653
1.9049
X

=11.5819 Y

=8.4879 XY

=13.6036
2
X

=19.3216
The normal equations for Y A bX = + are,
2
1 1 1
n n n
i i i i
i i i
X Y b X A X
= = =
= +

, and
1 1
n n
i i
i i
Y b X nA
= =
= +

Here the normal equations are,
School of Distance Education
Applied Statistics Page 28
13.6036 = 19.3216 b + 11.5819 A ---- (1)
8.4879 = 11.5819 b + 7 A ----- (2)
Solving these normal equations, we get, b = -2.773 and A = 5.8008
From A = 0.48, we get a = Anti log (A) = Anti log(5.8008) = 632120.68
Hence the required curve is ,
2.773
632120.68 y x

= .
Problem: Fit a curve of the form
x
y ab = for the following data
x : 2 3 4 5 6
y : 144 172.8 207.4 248.8 298.6
Solution:
Taking logarithm on both sides of the required curve,
x
y ab = , we get
log log log y a x b = + . This is in the formY A Bx = + , where log Y y = , log A a = , and log B b =
.
Using the values of log Y y = and x , we can find the best values of A and B using
the normal equations for fitting the line Y A Bx = + . From this the value of a and b can
be solved.
The calculations are:
x y log Y y = xY
2
x
2
3
4
5
6
144
172.8
207.4
248.8
298.6
2.16
2.24
2.32
2.40
2.47
4.32
6.72
9.28
12
14.82
4
9
16
25
36
x

=20 Y

=11.59 xY

=47.14
2
x

=90
The normal equations for Y A bX = + are,
2
1 1 1
n n n
i i i i
i i i
xY B x A x
= = =
= +

, and
1 1
n n
i i
i i
Y B x nA
= =
= +

School of Distance Education
Applied Statistics Page 29
Here the normal equations are, 47.14 = 90 B + 20 A --- (1)
11.59 = 20 B + 5 A --- (2)
(1) 4 (2) 10 0.78 B = 0.078 B = .
Solving (2) using 0.078 B = , get A = 2.006.
Then, log(2.006) 101.3 a Anti = = , and log(0.078) 1.196 b Anti = =
Hence the required curve is,
, ) 101.3 (1.196)
x
y =
2.4. Regression lines:
Let
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y be the given set of observations on two variables X and
Y. A scatter plot of these points reveals an idea on the linear relation between X and Y. If
a linear relation exists between X and Y, the line about which the points in the scatter
diagram cluster is called the regression line and the equation representing this line is
called the regression equation. There are two approaches for finding the regression line.
One is fitting a straight line of the form y ax b = + to the given data
1 1
( , ) x y ,
2 2
( , ) x y ,,
( , )
n n
x y , by minimizing the sum of squares of possible errors in y values. The other is
fitting a straight line of the form x cy d = + to the data, by minimizing the sum of squares
of possible errors in x values. If all the given ( , )
i i
x y values are perfectly obeys a linear
relation, then the straight line fitted by the above two approaches will be same. But in
general the ( , )
i i
x y values may not perfectly obey a linear relation, and hence the above
approaches may give two different straight lines for the given data. The straight line
fitted to the data in the form y ax b = + by minimizing the sum of squares of possible
errors in y values is known as the regression line y on x and the straight line fitted to the
data in the form x cy d = + by minimizing the sum of squares of possible errors in x values
is known as the regression line x on y.
To obtain the regression line Y on X of the form y ax b = + for the given data
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y , the following normal equations for fitting y ax b = + are to be solved.
2
1 1 1
(1)
n n n
i i i i
i i i
x y a x b x
= = =
= +

and
1 1
(2)
n n
i i
i i
y a x n b
= =
= +

Let us transform x and y to X and Y as, X x x = and Y y y = ; where x and y are the
means of x and y respectively. Now the normal equations for fitting a straight line
connecting X and Y in the formY aX b = + are:
School of Distance Education
Applied Statistics Page 30
2
1 1 1
1 1
(3)
(4)
n n n
i i i i
i i i
n n
i i
i i
X Y a X b X and
Y a X n b
= = =
= =
= +
= +


But here,
1 1
( ) 0
n n
i i
i i
X x x
= =
= =

and
1 1
( ) 0
n n
i i
i i
Y y y
= =
= =

Hence,
2
1 1
(3) 0
n n
i i i
i i
X Y a X b
= =
= +

, ) , )
, )
1 1
2
2
1 1
n n
i i i i
i i
n n
i i
i i
X Y x x y y
a
X x x
= =
= =

= =



, ) , )
, )
1
2
1
1
1
n
i i
i
n
i
i
x x y y
n
x x
n
=
=

=

That is
( , )
var( )
Cov x y
a
x
=
(4) 0 0 0 a n b b = + = .
Then, the straight line is,
( , )
0
var( )
Cov x y
Y X
x
= + .
Hence the regression line y on x is, , ) , )
( , )
var( )
Cov x y
y y x x
x
= .
In as similar way, the regression line x on y is derived as,
, ) , )
( , )
var( )
Cov x y
x x y y
y
=
In the regression line y on x, the coefficient of x,
2
( , )
var( )
xy
x
P
Cov x y
x
= is known as the
regression coefficient of y on x, denoted by
yx
b and in the regression line x on y, the
coefficient of y,
2
( , )
var( )
xy
y
P
Cov x y
y
= is known as the regression coefficient of x on y, denoted by
xy
b .
The regression line y on x help us to predict the value of y for a given value of x,
and the regression line x on y helps to predict the value of x for a given value of y.
School of Distance Education
Applied Statistics Page 31
Problem: Obtain the line of regression of y on x for the following data.
Age x : 66 38 56 42 72 36 63 47 55 45
BP : 145 124 147 125 160 118 149 128 150 124
Estimate the blood pressure of a man whose age is 55.
Solution:
The regression line y on x is defined as,
, ) , )
,
2
x y
x
P
y y x x

= , where
, x y
P = cov(X,Y),
2
x
= V(X).
Using the given data to find mean of x, mean of y, cov(X,Y) and V(X).
The calculations are as follows:
x y
2
x
xy
66
38
56
42
72
36
63
47
55
45
145
124
147
125
160
118
149
128
150
124
4356
1444
3136
1764
5184
1296
3969
2209
3025
2025
9570
4712
8232
5250
11520
4248
9387
6016
8250
5580
520 1370 28408 72765
Mean of X =
520
52
10
= , Mean of Y =
1370
137
10
=
Cov (X,Y)
1
xy x y
n
=

72765
52 137 152.5
10
= =
School of Distance Education
Applied Statistics Page 32
2 2
1
( ) V X x x
n
=

2
28408
52 136.8
10
= =
Hence the regression line of y on x is,
, ) , )
152.5
137 52
136.8
y x =
, ) 1.1148 79.03 y x = +
Then the blood pressure of a man whose age x = 55 can be get by substituting x =
55 in the derived regression equation y on x, This implies, the blood pressure,
, ) 1.1148 55 79.03 140.34 y = + = .
Problem: For 10 observations on X and Y, the following data were observed
2 2
130 , 200 , 2288 , 5506 , 3467 x y x y xy = = = = =

Obtain regression line of Y on X. Find y when x = 16.
Solution:
The regression line y on x is, , ) , )
,
2
x y
x
P
y y x x

= , where
, x y
P = cov(X,Y),
2
x
= V(X)
1
( , ) Cov X Y xy x y
n
=

, )
1 130 200
3467
10 10 10
| || |
=
| |
\ \
= 86.7
, )
2
2 2
1 1 130
( ) 2288
10 10
V X x x
n
| |
= =
|
\

= 59.8
The regression line Y on X is,
200 86.7 130
10 59.8 10
y x
| | | |
=
| |
\ \
1.4498 1.1526 y x = + .
When x = 16, we get,
1.4498 16 1.1526 24.3494 y = + = .
2.5. Pearsons Coefficient of correlation:
If there is a linear relation between the variables x and y, the degree of linear
relation is measured by the coefficient of correlation. If all they given ( , )
i i
x y points are
almost satisfying a linear relation, then we are saying that there is a high degree of linear
relation between the variables. If the linear relation fitted for the variables is in such a
School of Distance Education
Applied Statistics Page 33
way that the increment in one variable results in the increment of the other also, then
there is a direct (or positive) correlation existing between the variables. On the other
hand, if the linear relation fitted for the variables is in such a way that the increment in
one variable results in the decrease of the other, and then there is an inverse (or negative)
correlation existing between the variables. If there is no linear relation existing between
the variables, the correlation is zero.
A famous British Statistician, Karl Pearson suggested a coefficient measure of the
degree of correlation between two variables x and y, known as Pearsons coefficient of
correlation is denoted by
xy
r , where,
1
2 2
1 1
1
( )( )
1 1
( ) ( )
n
i i
xy
i
n n
i i
i i
xy
x y
x x y y
P
n
r
x x y y
n n

=
= =

= =


1
2 2 2 2
1 1
1
1 1
( ) ( )
n
i i
i
n n
i i
i i
x y xy
n
x x y y
n n
=
= =


Theorem: For two variable x and y, 1 1
xy
r s s + , where
xy
r is the Pearsons coefficient of
correlation.
Proof:
Let
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y are the observations on x and y. Consider
( )
i
x
x x

and
( )
i
y
y y

, where x and y are the means and


x
and
y
are the standard deviations of x
and y respectively.
We have,
2
( ) ( )
0
i i
x y
x x y y



>


]
, because it is the square of a real number.
Adding all such terms for i=1,2,,n and dividing by n,
2
( ) ( ) 1
0
i i
i
x y
x x y y
n


>


]

On expansion,
2 2
2 2
( ) ( ) ( ) ( ) 1 1 1
2 0
i i i i
i i i
x y x y
x x y y x x y y
n n n


+ >


]
]

2 2
2 2
1 1 1 1 1 1
( ) ( ) 2 ( )( ) 0
i i i i
i i i
x y x y
x x y y x x y y
n n n
+ >
] ]

School of Distance Education
Applied Statistics Page 34
2
2
2 2
( , )
2 0
y
x
x y x y
Cov x y


+ > . That is, 1 1 2 0
xy
x y
P

+ >
2 2 0
xy
r > . That is, 1 0
xy
r >
This gives, 1 0 1 0
xy xy
r or r + > >
That is, 1 1
xy xy
r or r > s
1 1
xy
r s s +
Remark: We have the regression coefficients y on x,
2
xy
x
yx
P
b

= and the regression


coefficients x on y,
2
xy
y
xy
P
b

= . The geometric mean of these regression coefficients gives


the magnitude of the coefficient of correlation
xy
r . The sign of correlation is determined by
the sign of covariance between x and y,
xy
P . If
xy
P is positive
xy
r is positive in sign and if
xy
P is negative
xy
r is negative in sign.
Theorem: (Invariance of correlation coefficient under linear transformation): A
transformation on the variables x and y to u and v in the form
x A
u
c

= and
y B
v
d

= is
making no change in the coefficient of correlation between the variables. That is,
xy uv
r r =
.
Proof:
Let
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y are the observations on x and y.
Then,
1
2 2
1 1
1
( )( )
1 1
( ) ( )
n
i i
i
n n
i i
i i
xy
x x y y
n
r
x x y y
n n
=
= =

=


Let,
x A
u
c

= and
y B
v
d

= ;
Then, Pearsons coefficient of correlation between u and v,
1
2 2
1 1
1
( )( )
1 1
( ) ( )
n
i i
i
n n
i i
i i
uv
u u v v
n
r
u u v v
n n
=
= =

=


School of Distance Education
Applied Statistics Page 35
1
2 2
1 1
1
1 1
n
i i
i
n n
i i
i i
uv
x A y B x A y B
n c c d d
r
x A y B x A y B
n c c n d d
=
= =
| | | |

| |

\ \ ] ]
=
| | | |

| |

\ \ ] ]


1
2 2
1 1
1
1 1
n
i i
i
n n
i i
i i
uv
x x y y
n c d
r
x x y y
n c n d
=
= =


] ]
=


] ]


j j
j j
1
2 2
1 1
1 1
1 1 1
n
i i
i
n n
i i
i i
uv
x x y y
cd n
r
x x y y
cd n n
=
= =

=


1
1
xy
x y
uv
P
cd
r
cd

xy
x y
P

=
uv xy
r r = .
Problem: Find the coefficient of correlation for the following data on X and Y.
X: 65 66 67 67 68 69 70 72
Y: 67 68 65 68 72 72 69 71
Solution:
Coefficient of correlation,
xy
x y
xy
P
r

=
To find x , y ,
xy
P ,
2
x
and
2
y

1
1
n
xy i i
i
P x y xy
n
=
=

;
2
x
=
2 2
1
1
( )
n
i
i
x x
n
=

and
2
y
=
2 2
1
1
( )
n
i
i
y y
n
=

The calculations are as follows:


x y
2
x
2
y
xy
School of Distance Education
Applied Statistics Page 36
65
66
67
67
68
69
70
72
67
68
65
68
72
72
69
71
4225
4356
4489
4489
4624
4761
4900
5184
4489
4624
4225
4624
5184
5184
4761
5041
4355
4488
4355
4556
4896
4968
4830
5112
544 552 37028 38132 37560
1
i
x x
n
=

1
544
8
= = 68 ;
1
i
y y
n
=

1
552
8
= = 69
1
1
n
xy i i
i
P x y xy
n
=
=

=
1
37560 68 69 3
8
=
2
x
=
2 2
1
1
( )
n
i
i
x x
n
=

= , )
2
1
37028 68 4.5
8
=
2
y
=
2 2
1
1
( )
n
i
i
y y
n
=

= , )
2
1
38132 69 5.5
8
=
Coefficient of correlation,
xy
x y
xy
P
r

=
3
0.603
4.5 5.5
= = .
Problem: Calculate Karl Pearsons coefficient of correlation for the following data;
x: 10 12 13 16 17 20 25
y: 19 22 26 27 29 33 37
Solution:
Coefficient of correlation
( , )
. .( ) . .( )
Cov X Y
r
S D X S D Y
=

The problem can be solved by simply following the steps shown in above example.
But for some computational easiness the problem can also be solved as in the following
illustration.
School of Distance Education
Applied Statistics Page 37
We have the result that correlation coefficient is independent of change of origin
and scale. Hence we can calculate the correlation between X and Y by altering
X and Y by some linear transformation. Here, consider U = X 16 and V = Y 27.
The correlation between U and V is same to correlation between X and Y.
Correlation between U and V,
( , )
. .( ) . .( )
Cov U V
r
S D U S D V
=

The calculations are:


x y U = X 16 V = Y 27
2
U
2
V
UV
10
12
13
16
17
20
25
19
22
26
27
29
33
37
-6
-4
-3
0
1
4
9
-8
-5
-1
0
2
6
10
36
16
9
0
1
16
81
64
25
1
0
4
36
100
48
20
3
0
2
24
90
1 4 159 230 187
1
( , ) Cov U V uv u v
n
=

, )
1 1 4
187 26.71 .082 26.628
7 7 7
| |
= = =
|
\
, )
2
2 2
1 1 1
( ) 159 22.71 0.02 22.69
7 7
V U u u
n
| |
= = = =
|
\

, )
2
2 2
1 1 4
( ) 230 32.86 0.327 32.533
7 7
V V v v
n
| |
= = = =
|
\

Now, Correlation between U and V,


26.628
22.69 32.533
r =

= 0.98
That is the correlation coefficient of X and Y = 0.98
2.6. Angle between the regression lines:
The regression equations are
School of Distance Education
Applied Statistics Page 38
, ) , )
2
xy
x
P
y y x x

= and
, ) , )
2
xy
y
P
x x y y

=
Since
xy
x y
xy
P
r

= , the regression coefficient y on x,
2
xy y
xy
x x
P
r


= and
The regression coefficient x on y,
2
xy
x
xy
y y
P
r


= .
Hence the regression equations are, , ) , )
y
xy
x
y y r x x

= ---- (1) and


, ) , )
x
xy
y
x x r y y

= ---- (2)
The regression equation x on y can be rewrite as , ) , )
y
xy x
y y x x
r

= ---- (3)
Now the regression equation y on x [equation (1)] and that on x on y [equation (3)] can be
written in the form y = m x + c as follows:
y y
xy xy
x x
y r x r x y


= + ---- (1) and
y y
xy yx xy yx
y x x y
r r


= + ---- (3)
From here, we get the slopes of these two regression lines as,
1
y
xy
x
m r

= and
2
y
xy x
m
r

=
Let us consider as the angle between the regression lines. Then,
1 2
1 2
tan
1
1
y y
xy
x xy x
y y
xy
x xy x
r
r
m m
m m
r
r

= =
+
+


]
2
2
2
2 2 2
2
1
xy y y
xy x xy y y
x
xy x x y y
x
r
r r
r



= =
+
+

]
School of Distance Education
Applied Statistics Page 39
2
2
2 2
1
tan
xy y
x
xy x x y
r
r

=
+
2
2 2
1
tan
xy x y
xy x y
r
r

=
+
Remarks:
(i) For two variables x and y, if 1
xy
r = , we get tan 0 = . This implies the angle between
the regression lines
1
tan 0 0

= = . That is, if there is a perfect linear relation exists
between x and y (whether it is direct or inverse), the angle between the regression line is
zero. Or in other words, the two regression lines coincide or they are same.
(ii) If 0
xy
r = , we get tan = . This implies the angle between the regression lines
1 0
tan 90

= = . That is, if there is no linear relation exists between x and y, the two
regression lines are perpendicular.
If there are two regression lines, it is obvious that they are intersecting at a point. The
point of intersection of regression lines can be obtained by solving the regression
equations for x an y. It can be done as follows:
We have regression equation y on x; , ) , )
y
xy
x
y y r x x

= --- (1) and the regression


equation x on y; , ) , )
x
xy
y
x x r y y

= ---- (2)
Put (2) in (1) gives, , ) , )
y
x
xy xy
x y
y y r r y y



=
, ) , )
2
xy
y y r y y =
, ) , )
2 2
1 1
xy xy
r y r y = y y =
Put y y = in (2) , ) 0 x x x x = =
Hence the point of intersection of the regression lines is , ) , x y
2.7. Identification of regression lines and determination of correlation coefficient
If we are given
1 1 1
0 a x b y c + + = and
2 2 2
0 a x b y c + + = as the two regression lines, it is
to identify which of them represent regression line yon x and which is regression line x on
y. For this first of all we assume the first line
1 1 1
0 a x b y c + + = is regression line y on x or
regression line x on y. Let us assume the first line is regression line y on x. Then we
School of Distance Education
Applied Statistics Page 40
express the line in terms of y as,
1 1
1 1
a c
y x
b b
= . Then the regression coefficient y on x is
1
1
yx
a
b
b
= . If the first line is assumed as regression line y on x the second is regression line
x on y. It is written in terms of x as,
2 2
2 2
b c
x y
a a
= . If so, the regression coefficient x on y,
2
2
xy
b
b
a
= .
We know the geometric mean of regression coefficients is the magnitude of
coefficient of correlation
xy
r and 1 1
xy
r s s + .
Hence, if
1 2
1 2
1
yx xy
a b
b b
b a
= s , we can confirm that our assumption regarding the
regression lines are same. Otherwise the first line is the regression line x on y and the
second is the regression line y on x. Then the regression coefficients are
1
1
xy
b
b
a
= and
2
2
yx
a
b
b
= . Then the coefficient of correlation,
xy
r
2 1
2 1
a b
b a
= , which is the reciprocal of
xy
r ,
obtained by previous assumption.
Problem: The two regression lines are
5 6 90 0 x y + =
15 8 130 0 x y =
Find (i) x , y (ii) regression coefficient of y on x and x on y (iii) correlation coefficient.
Solution:
Solving the given two regression lines,
5 6 90 0 x y + = ----- (1) and 15 8 130 0 x y = ----- (2) , we get x , y .
(2) 3 (1) 10 400 40. y y = = =
40, (1) 5 6 40 90 0 30. y in x x = + = =
, ) , ) , 30, 40 . x y =
Assume the first line is the regression line Y on X, then, the line can be expressed
as,
5 90
6 6
y x = + . This implies the regression coefficient Y on X
1
1
5
6
a
b
| |
= =
|
\
. The second
School of Distance Education
Applied Statistics Page 41
line, X ion Y, can be expressed as,
8 130
15 15
x y = + .Hence the regression coefficient X on
Y
2
2
8
15
b
a
| |
= =
|
\
.
Then,
1 2
2 1
a b
a b
=
1 2
1 2
a b
b a
| || |
=
| |
\ \
5 8
0.444 1
6 15
= = <
Hence our assumption is true. That is 5 6 90 0 x y + = is regression line Y on X
and 15 8 130 0 x y = is the regression line X on Y. Then the regression coefficient of Y
on X =
5
6
= 0.833. Regression coefficient of X on Y =
8
15
= 0.533 and correlation
coefficient = 0.444. ( here the regression coefficients are positive)
Problem: Given that 14 12 3 0 x y + = and 12 21 10 0 x y + + = are the regression lines for X
and Y. Identify the regression lines and find the correlation coefficient.
Solution:
Assume the 14 12 3 0 x y + = is the regression line Y on X, then,
14 3
12 2
y x = + .
This implies the regression coefficient Y on X
1
1
14
12
a
b
| |
= =
|
\
. The line
12 21 10 0 x y + + = is assumed as the regression line X on Y, then,
21 10
12 12
x y = . Then
the regression coefficient X on Y
2
2
21
12
b
a
| |
= =
|
\
.
Then,
1 2
2 1
a b
a b
1 2
1 2
a b
b a
| || |
=
| |
\ \
=
14 21
12 12
= 2.04 > 1. Hence our assumptions about the
regression lines are NOT true.
Now, 12 21 10 0 x y + + = is the regression line Y on X and the line
14 12 3 0 x y + = is the regression line X on Y .
Then,
12 10
21 21
y x = , and regression coefficient Y on X
1
1
12
21
a
b
| |
= =
|
\
.
And,
12 3
14 14
x x = + , the regression coefficient X on Y
2
2
12
14
b
a
| |
= =
|
\
.
School of Distance Education
Applied Statistics Page 42
Then,,
1 2
2 1
a b
a b
=
12 12
21 14
= 0.4898.
Since the regression coefficients are negative, the correlation coefficient is (- 0.4898).
Problem: The regression lines are y ax b = + and x cy d = + . If the two variables are having the
same mean, show that (1 ) (1 ) d a b c = .
Solution:
The means of x and y are obtained by solving the regression lines for x and y.
Here the first line is y ax b = + --(1) and the second is x cy d = + --(2) that is
1 d
y x
c c
= --(3)
1
(3) (1)
d
and ax b x
c c
+ =
1
/
1
d bc d
x b a
c c ac
+ | | | |
= + =
| |

\ \
(1)
1 1
bc d ad b
y a b
ac ac
+ +
= + =

This implies,
1 1
bc d ad b
x and y
ac ac
+ +
= =

.
If the means of the variables are equal, we can write,
1 1
bc d ad b
ac ac
+ +
=

This gives, bc d ad b d ad b bc + = + =
, ) , ) 1 1 a d b c = .
Problem: If the variables x and y are satisfying the relation 0 ax by c + + = . Show that the
correlation between x and y is -1 or +1, according as a and b are of the same sign or not.
Solution:
Since the variables are satisfying the relation 0 ax by c + + = , we can write this
relation in the line of the form y on x as,
a c
y x
b b
= ; and in the line of the form x on y as,
b c
x y
a a
= . Then the regression coefficients y on x and x on y are identified as
a
b
, and
b
a
respectively. Then the magnitude of the coefficient of correlation is obtained by the
geometric mean of the regression coefficients as, 1
a b
b a
= . Then the correlation
coefficient can be +1 or -1 according as the regression coefficients are positive or negative.
School of Distance Education
Applied Statistics Page 43
The regression coefficients
a
b
and
b
a
becomes positive, when a and b are with
different signs. And they will become negative, when a and b are of same sign. Hence,
the coefficient of correlation is -1 or +1, according as a and b are of the same sign or not.
2.8. Rank correlation coefficient
When we are considering two characteristics which are qualitative in nature, they are
not possible to measure numerically. For example consider the characteristics of the
ability in drawing (let it be X) and the ability in music (let it be Y). It is not possible to
measure numerically the values of X and Y, for an individual. But if there are n
individuals, it is possible to rank these n individuals according to the ability in drawing
(X) and according to their ability in music (Y). If these two characteristics are having high
positive correlation, then ranks obtained for the individuals based of X and Y will be in
same order. If these two characteristics are having high negative correlation, then ranks
obtained for the individuals based of X and Y will be in reverse order. Using the ranks
obtained for the n individuals based on the characteristics X and Y, a method of finding
the coefficient of correlation is derived by C.Spearman in 1904. The coefficient of
correlation for two characteristics which are calculated based on the ranks is known as
Spearmans Rank Correlation Coefficient.
Let there be n individuals ranked according to two qualitative characteristics
considered. Let ( , )
i i
x y denote the rank of the
th
i individual when ranked according to the
characteristics. So the ,
i i
x y values are the numbers from 1 to n.
Since
i
x values are the numbers from 1 to n, the mean of x values,
1 ( 1) ( 1)
2 2
sumof first nnatural numbers n n n
x
n n
+ +
= = =
Similarly,
1 ( 1) ( 1)
2 2
sumof first nnatural numbers n n n
y
n n
+ +
= = =
Variance of
i
x values,
2
2
( 1)
2
x
sumof squares of first nnatural numbers n
n

+
=

]
2
2
2
1 ( 1)(2 1) ( 1)
6 2
1
12
x
n n n n
n
n

+ + +
=

]

=
;
Similarly,
2
2
1
12
y
n


= .
Let , )
i i i
d x y = . This gives, 0 d x y = =
School of Distance Education
Applied Statistics Page 44
Variance of d values,
, ) j
, )
2
2 2
2 2
1 1
2
1
2
1
1 1
0
1
1
n n
d i i i
i i
n
i i
i
n
i
i
d d x y
n n
x y
n
d
n

= =
=
=
= =
]
=
=

Since x y = , we can re write


2
1
1
n
i
i
d
n
=

as,
, )
2
2
1 1
1 1
n n
i i i
i i
d x x y y
n n
= =
= +

, ) , )
2
2
1 1
1 1
n n
i i i
i i
d x x y y
n n
= =
=
]

, ) , ) , ) , )
2 2
2
1 1 1 1
1 1 1 1
2
n n n n
i i i i i
i i i i
d x x y y x x y y
n n n n
= = = =
= +

2 2 2
1
1
2cov( , )
n
i x y
i
d x y
n

=
= +

But, we have, cov( , )


x y
x y r = , where r is the coefficient of correlation. Hence,
2 2 2
1
1
2
n
i x y x y
i
d r
n

=
= +

Since,
2
2 2
1
12
x y
n


= = ,
we get,
2 2 2 2
2
1
1 1 1 1 1
2
12 12 12 12
n
i
i
n n n n
d r
n
=

= +

2 2
2
1
1 1 1
2 2
12 12
n
i
i
n n
d r
n
=

=

, )
2
2
1
1 1
1
6
n
i
i
n
d r
n
=

School of Distance Education


Applied Statistics Page 45
, )
, )
2
1
2
2
1
2
6
1
1
6
1 .
1
n
i
i
n
i
i
d
r or
n n
d
the coefficient of correlation r
n n
=
=
=

Problem: The following are the ranks obtained by 10 students in Statistics and Mathematics
Statistics: 1 2 3 4 5 6 7 8 9 10
Mathematics: 1 4 2 5 3 9 7 10 6 8
To what extent is the knowledge of students in the two subjects related?
Solution:
Here to find the rank correlation coefficient of the ranks in Statistics and
Mathematics. Rank correlation coefficient is defined as,
2
2
6
1
( 1)
i
i
d
r
n n
=

,
i
d is the difference in ranks.
The calculations are:
Rank inStat.
i
x Rank in Maths
i
y
i
d =
i
x -
i
y
2
i
d
1
2
3
4
5
6
7
8
9
10
1
4
2
5
3
9
7
10
6
8
0
-2
1
1
2
3
0
-2
3
2
0
4
1
1
4
9
0
4
9
4
36
Hence,
2
2
6
1
( 1)
i
i
d
r
n n
=

=
2
6 36
1 1 0.2189 0.7819
10(10 1)

= = =

Problem: 10 competitors in a music test were ranked by three judges A, B, and C in following
order.
School of Distance Education
Applied Statistics Page 46
Ranks by A: 1 6 5 10 3 2 4 9 7 8
Ranks by B: 3 5 8 4 7 10 2 1 6 9
Ranks by C: 6 4 9 8 1 2 3 10 5 7
Discuss which pair of judges has the nearest approaches to common likings in music.
Solution:
Here to find the rank correlation coefficient between each pair of the judges
considering the ranks they given. Identify the pair of judges with high correlation
coefficient. They are considered having nearest approaches to common likings in music.
The calculations follow:
Ranks
by A
i
x
Ranks
by B
i
y
Ranks
by C
i
z
i
x -
i
y
i
x -
i
z
i
y -
i
z
, )
2
i i
x y , )
2
i i
x z , )
2
i i
y z
1
6
5
10
3
2
4
9
7
8
3
5
8
4
7
10
2
1
6
9
6
4
9
8
1
2
3
10
5
7
-2
1
-3
6
-4
-8
2
8
1
-1
-5
2
-4
2
2
0
1
-1
2
1
-3
1
-1
-4
6
8
-1
-9
1
2
4
1
9
36
16
64
4
64
1
1
25
4
16
4
4
0
1
1
4
1
9
1
1
16
36
64
1
81
1
4
200 60 214
Rank correlation between A and B,
2
2
6
1
( 1)
i
i
d
r
n n
=

=
2
6 200
1 0.212
10(10 1)

= =

Rank correlation between A and C,


2
2
6
1
( 1)
i
i
d
r
n n
=

=
2
6 60
1 0.6364
10(10 1)

= =

Rank correlation between B and C,


2
2
6
1
( 1)
i
i
d
r
n n
=

=
2
6 214
1 0.297
10(10 1)

= =

It can be observed that the judges A and C are having nearest approaches to
common likings in music.
School of Distance Education
Applied Statistics Page 47
Problem: Find the rank correlation coefficient for the following data:
X: 92 89 87 86 84 77 71 63 53 50
Y: 86 83 91 77 68 85 52 82 37 57
Solution:
First, the given values of X and Y should be ranked. If an observation repeats, then
the sum of the ranks is equally divided among the observations. (For eg., when we are
ranking the observations in order, and let a number, say a, coming in the 6
th
and 7
th
position then the first and second a values are assigned with the rank 6.5).
Here the observations are ranked in descending order. Then find the rank
correlation coefficient.
x y Rank of X,
i
x Rank of Y,
i
y
i
x -
i
y
, )
2
i i
x y
92
89
87
86
84
77
71
63
53
50
86
83
91
77
68
85
52
82
37
57
1
2
3
4
5
6
7
8
9
10
2
4
1
6
7
3
9
5
10
8
-1
-2
2
-2
-2
3
-2
3
-1
2
1
4
4
4
4
9
4
9
1
4
44
Rank correlation coefficient,
2
2
6
1
( 1)
i
i
d
r
n n
=

2
6 44
1 0.733
10(10 1)

= =

School of Distance Education


Applied Statistics Page 48
Rank correlation coefficient when equal ranks (Tied ranks):
It may be noted that the Spearmans rank correlation formula is derived on the
assumption that all the ranks are different. But in practice, there are many situations,
where more than one individual are getting the same rank. In a competition consider,
three individuals received 3
rd
rank. They would have given the 3
rd
,4
th
, and 5
th
rank, if
there were slight difference in the evaluation. Then we add 3,4 and 5, which is 12. Then
12 is equally divided for these three individuals. Hence we assign the rank 4 to each of
these three individual. In such situations it is more accurate to calculate the Pearsons
coefficient of correlation between the ranks directly after assigning the average rank to
those with the same rank. But there is also a modified formula of Spearmans rank
correlation coefficient, which is as follows:
, ) , )
, )
2 2 2
1
2
1 1
6 1 1
12 12
1
1
n
i i i j j
i i j
d m m m m
r
n n
=

+ +

]
=


, where,
i
m stands for the number of
times the
th
i rank repeats in the x series of ranks and
j
m is the number of times the
th
j rank
repeats in the y series of ranks when the average ranks are assigned. The method is
illustrated below:
Obtain the rank correlation coefficient for the following data:
X: 15 20 28 12 40 60 20 80
Y: 40 30 50 30 20 10 30 60
Illustration:
At first we assign ranks for X and Y values. Here we have 8 sets of data. That is
n=8.
The ranks are:
X: 7 5.5 4 8 3 2 5.5 1
Y: 3 5 2 5 7 8 5 1
Here in X values, 20 repeats twice, with the possible ranks, 5 and 6. Hence its
average 5.5 is supplied for the value 20. Similarly in Y values, 30 repeat thrice, with
possible ranks 4, 5 and 6. Hence their average 5 is assigned as the ranks of the values 30.
Now the difference in ranks,
i i i
d X Y = values are:
i
d : 4 0.5 2 3 -4 -6 0.5 0
2
i
d : 16 0.25 4 9 16 36 0.25 0
School of Distance Education
Applied Statistics Page 49
This gives,
2
81.50
i
i
d =

.
2
i
m = (Because on X values, only the value 20 repeats twice) and 3
j
m = ( because on Y
values, only the value 30 repeats thrice).
Hence,
, ) , )
, )
2 2 2
1
2
1 1
6 1 1
12 12
1
1
n
i i i j j
i i j
d m m m m
r
n n
=

+ +

]
=


, ) , )
, )
2 2
2
1 1
6 81.50 2 2 1 3 3 1
12 12
1
8 8 1

+ +

]
=

j
, )
6 81.50 0.5 2
1
8 63
+ +
= = 0.
2.9. Partial and Multiple Correlations:
In a statistical study, if there are many variables included, and whenever we are
interested in studying the joint effect of a group of variables upon a variable not included
in that group, our study is on multiple correlations and multiple regressions.
For eg., in the study on the yield of a crop per acre (let it be
1
X ), the value of the
variable
1
X , is a joint effect of the variables, quality of seed , )
2
X , fertility of soil , )
3
X
,fertilizer used , )
4
X , irrigation facilities , )
5
X , whether conditions, )
6
X and so on.
If we are considering the relation between two variables only, there are two
alternatives;
(i) We consider only those two members of the observed data in which the
other members have specified values. Or,
(ii) We may eliminate mathematically the effect of other variables on the two
variables under consideration.
[The first method has the disadvantage that it limits the size of the data and also it will
applicable only the data in which the other variables have assigned values]
In second method it may not possible to eliminate the entire influence of the variables, but
the linear effect can easily eliminated. The correlation and regression between only two
variables eliminating the linear effects of other variables in considered is called the partial
correlation and partial regression.
Let us limit our discussion with three variables
1
X ,
2
X and
3
X .
School of Distance Education
Applied Statistics Page 50
The equation of plane of regression of
1
X on
2
X and
3
X is,
1 12.3 2 13.2 3
(1) X a b X b X = + +
Let the observations on
1
X ,
2
X and
3
X are measured from their respective means, ie.,
, )
1 1 1 i
X x x = , , )
2 2 2 i
X x x = and , )
3 3 3 i
X x x = .
Then, , ) , ) , )
1 1 2 2 3 3
0
i i i
x x x x x x = = =

. That is
1 2 3
0 X X X = = =

Taking summation on (1), we get a=0.
Then (1) implies,
1 12.3 2 13.2 3
(2) X b X b X = +
The coefficients
12.3
b and
13.2
b are the partial regression coefficients of
1
X on
2
X and
that of
1
X on
3
X respectively.
12.3 12.3 2 13.2 3
e b X b X = + is called the estimate of
1
X as given by the equation of plane
of regression (2).
The quantity j
1.23 1 12.3 2 13.2 3
X X b X b X = + is called the error estimate or residual.
In the subscript of the residual
1.23
X , the subscript before . ie., 1 is known as the
primary subscript and the other after the subscript, ie, 2 and 3 are called the secondary
subscripts.
The order of regression coefficients are determined by the number of secondary
subscripts. For eg.,
12.3
b is the regression coefficient of order 1. In
12.3
b ,
2
X is independent
and
1
X is dependent. In
21.3
b ,
1
X is independent and
2
X is dependent.
From the equation of plane of regression given in (2), the y the constants bs are
determined by the principle of least squares.
Sum of squares of residuals,
, ) j , )
2
2
1.23 1 12.3 2 13.2 3
S X X b X b X = = +

j , )
2 1 12.3 2 13.2 3
12.3
0 2 0
S
X X b X b X
b
c
= + =
c

j , )
3 1 12.3 2 13.2 3
13.2
0 2 0
S
X X b X b X
b
c
= + =
c

, )
2 1.23
0 X X =

and , )
3 1.23
0 X X =

School of Distance Education


Applied Statistics Page 51
2
1 2 12.3 1 13.2 2 3
2
1 3 12.3 2 3 13.2 3
0
(3)
0
X X b X b X X
X X b X X b X
=


`
=

)


Since '
i
X s are measured from their respective means, we have,
2 2
1 1
1
X
N
=

,
1
cov( , )
i j i j
X X X X
N
=

and
, )
cov ,
i j i j
i j i j
i j
X X X X
r
N
= =

.
Hence, the equations given in (3), gives,
2
12 1 2 12.3 2 13.2 23 2 3
2
13 1 3 12.3 23 2 3 13.2 3
(4)
r b b r
r b r b





`


)
12 1 12.3 2 13.2 23 3
13 1 12.3 23 2 13.2 3
(4) r b b r
r b r b


= +
=
Solving these equations, we get,
12 1 23 3 12 23
13 1 3 13
1
12.3
2 23 3 23 2
23 2 3 23
1
1
1
r r r r
r r
b
r r
r r





= = and,
23 2 13 1 23 13
2 12 1 12 1
13.2
23 2 3 23 3
2 23 3 23
12
23 13
1
23 3
23
1
1
1
1
1
1
r r r r
r r
b
r r
r r
r
r r
r
r



= =
=
If we write,
12 13
21 23
31 32
1
1
1
r r
r r
r r
= , and
i j
is the cofactor of the ( , )
th
i j element of , then,
1 12
12.3
2 11
b


= and
13 1
13.2
3 11
b


= . Now we get,
13 1 12 1
1 2 3
2 11 3 11
X X X


= +
School of Distance Education
Applied Statistics Page 52
3 1 2
11 12 13
1 2 3
0
X X X


+ + = .
2.10. Properties of residuals
(i) Sum of the product of any residual of order zero with any other residual of
higher order is zero, provided the subscript of the former occurs among the
secondary subscripts of the later.
(ii)
2
1.2 1.23 1 1.23 1.23
X X X X X = =

(iii) The sum of the product of two residuals is zero, if all the subscript (primary
as well as secondary) of the one occur among the secondary subscripts of the
other. Eg.,
1.2 3.12
0 X X =

,
2.3 1.23
0 X X =

2.11. Coefficient of multiple correlations


Consider the variables
1
X ,
2
X and
3
X has N observations. The multiple correlation
of
1
X on
2
X and
3
X , usually denoted by
1.23
R is the simple correlation coefficient between
1
X and the joint effect of
2
X and
3
X on
1
X . In other words,
1.23
R is the correlation
coefficient between
1
X and its estimated value as given by the plane of regression of
1
X on
2
X and
3
X .
That is,
1 1.23
1.23
1 1.23
cov( , )
( ) ( )
X e
R
V X V e
= , which is derived as,
2 2
2 12 13 12 13 23
1.23 2
23
2
1
r r r r r
R
r
+
=

Multiple correlation coefficient measures the closeness of the association between


the observed values and expected values of a variable obtained from the multiple linear
regression of that variable on the other variables. It is proved that
1.23
0 1 R s s .
If
1.23
1 R = , then association is perfect and all the predicted value of
1
X coincide with
the observed values of
1
X .
If
1.23
0 R = , then
1
X is completely uncorrelated with the predicted values of
1
X .
That is the regression equation fails to throw any light on the value of
1
X , when
2
X and
3
X are known.
2.12. Coefficient of partial correlation
The correlation coefficient between
1
X and
2
X after the linear effect of
3
X on each
of them has been eliminated is called partial correlation coefficient of
1
X and
2
X .
School of Distance Education
Applied Statistics Page 53
Let
1.3 1 13 3
X X b X = may be regarded as a part of the variable
1
X which remains
after the linear effect of
3
X has been eliminated.
Similarly,
2.3 2 23 3
X X b X = is the part of
2
X obtained after eliminating the linear
effect of
3
X .
The partial correlation between
1
X and
2
X , denoted by
12. 3
r is given by,
1.3 2.3
12.3
1.3 2.3
cov( , )
( ) ( )
X X
r
V X V X
= .
This is derived as,
, ) , )
12 13 23
12.3
2 2
13 23
1 1
r r r
r
r r

=

.
In a similar way the expressions for
13.2 23.1
r and r can be obtained.
Problem: For the variables
1 2 3
, X X and X , it is given that
2 2 2
1 2 3 12 23 31
2, 3, 0.7, 0.5 r r r = = = = = = . Find (i)
23.1
r (ii)
1.23
R and (iii)
13.2
b .
Solution:
(i) We have,
, ) , )
23 21 31
23.1
2 2
21 31
1 1
r r r
r
r r

=

Hence,
, ) , )
23.1
2 2
0.5 0.7 0.5
1 0.7 1 0.5
r

=

= 0.2425.
(ii)
2 2
2 12 13 12 13 23
1.23 2
23
2
1
r r r r r
R
r
+
=

Hence,
2 2
2
1.23 2
0.7 0.5 2 0.7 0.5 0.5
1 0.5
R
+
=

= 0.52
1.23
0.721 R = .
(iii)
12
23 13
1
13.2
23 3
23
1
1
1
r
r r
b
r
r

=
School of Distance Education
Applied Statistics Page 54
Hence,
13.2
1 0.7
0.5 0.5 2
1 0.5 3
0.5 1
b =
, )
, )
0.5 0.35
2
3 1 0.25

0.1333 = .
2.13. Testing the significance of observed simple correlation coefficient:
Assume r is the calculated correlation using the data set provided. If it is to verify
whether, in the population the variables are actually correlated, we conduct statistical test
with null hypothesis
0
: 0 H = , where denotes the population correlation coefficient.
The test statistic considered is,
2
2
1
r n
t
r

-which follow the Students t distribution with n-2 degrees of freedom. Reject the
hypothesis if
/ 2
t t

> with a significance level , where


/ 2
t

is from table of t distribution


such that
/ 2 0
( / ) P t t H

> = .
Problem: A sample of 27 pairs of observations from a normal population gives r=0.6. Is it
likely that the variables are correlated at 5% significance level?
Solution:
Given r=0.6, n=27. To test
0
: 0 H = , the test statistic
2
2
1
r n
t
r

.
Here,
2
0.6 27 2
1 0.6
t

=

=3.75.
Table of t distribution for n-2=25 d.f. gives
/ 2
2.06 t

= for 0.05 = .
Calculated value of t is greater than the table value of t. Hence
0
: 0 H = is rejected.
That is the variables are correlated.
School of Distance Education
Applied Statistics Page 55
EXERCISES
1. What is a scatter diagram?
2. Explain the Principle of least squares in curve fitting?
3. Explain the method of fitting a parabola to a given set of observations using the
Principle of least squares.
4. What are regression coefficients? How they are related to correlation coefficient?
5. Given two regression lines 4y=9x+15 and 6y = 25x-7. Identify the regression lines
and obtain the coefficient of correlation between x and y.
6. Derive the regression lines for the variables x and y based on the observations
1 1
( , ) x y ,
2 2
( , ) x y ,, ( , )
n n
x y .
7. Derive the expression for the angle between two regression lines.
8. Calculate Karl Pearsons coefficient of correlation
X: 14 16 17 18 19 20 21 22 23
Y: 84 78 70 75 66 67 62 58 60
9. Prove that coefficient of correlation always lies in the interval [-1,+1].
10. Show that correlation coefficient is independent of change of origin and scale.
11. (i) Derive spearmans rank correlation coefficient. (ii) Obtain the rank correlation
coefficient for the following data:
X: 50 60 50 60 80 50 80 40 70
Y: 30 60 40 50 60 30 70 50 60
12. Obtain the rank correlation coefficient for the following data:
X: 50 60 50 60 80 50 80 40 70
Y: 30 60 40 50 60 30 70 50 60
13. Write a short note on partial and multiple correlations.
14. Write an expression for the multiple correlation coefficient
1.23
R
*******************
School of Distance Education
Applied Statistics Page 56
CHAPTER 3
TIME SERIES
3.1. Time series
Time series refers to as the chronologically ordered values of a variable. Time
series data are of two kinds, period data and point wise data. Period data refers to the
value of a variable in a particular period of time. For example the wheat production in
India in the year 2009 -2010 was 110 million tones. Point data gives the value of a variable
at a point of time. For eg., the available stock of a product in a company was 8 tone at 4
PM of 31-3-2010.
Analysis of time series is a statistical device which can be used to understand,
interpret and evaluate changes in a phenomena based on time. A time series analysis is
of interest in several areas, such as economics, health, biology, and so on.
Time series analysis helps to:
(i) Study the past behavior of the phenomenon under consideration.
(ii) Helps to compare actual performance with the expected performance
(iii) Purpose of systematic recording of facts is automatically done in the process of
time series data collection.
(iv) Forecast the nature of the phenomenon by analyzing the past time series data.
(v) Facilitate preparation, planning etc.
3.2. Components of Time series
Components of a time series are the various elements which can be extracted from
the observed data. These elements give some short term or long term properties of the
data. The elements are classified as (i) secular variation (Trend) (ii) Seasonal variation
(iii) Cyclical variation and (iv) Irregular variation. Among these the first two are long
term properties and the next two are short term properties.
(i) Secular variation (Trend): The general tendency of the time series data to increase or
decrease or stagnate during a long-period of time is called the secular trend or simply
trend. Most business and economic series are influenced by some secular forces of change
in which the underlying tendency is one of growth or decline. For example in the series of
data on population, national income etc., an increasing trend is observed, while series of
data relating to the number of deaths due to epidemics, tuberculosis etc., shows a decline
trend.
(ii) Seasonal variation: Seasonal variation refer to rhythmic forces of change inherent in
most time series showing a regular or a periodic pattern of movement over a span of less
than a year and has the same or almost the same pattern year after year. The seasonal
variations are usually measured in an interval within the calendar year.
School of Distance Education
Applied Statistics Page 57
Seasonal variation may be due to (i) natural forces (ii) social customs and
traditions.
(iii) Cyclical Variation: Cyclical variation refers to the recurrent variations in time series
that extends over longer period of time usually two or more years. Most of the time series
regarding to economic and business activities shows some kind of cyclic variation. They
are Prosperity, Decline, Depression and Recovery.
(iv) Irregular variation: Irregular variations are those caused by unusual unexpected
events. These variation are not regular and do not repeat in a definite pattern. Effect of
war, flood, strike etc., leads to irregular variations. Such variations cannot be predicted
like other components.
3.3. Mathematical model of time series
One of the objectives of time series is to give a general description of the behavior
of the series. In order to achieve this objective, it is necessary to break down the series
into its main components and to estimate the magnitudes of each of these components
separately. The method of analysis would depend to a large extent, on the hypothesis as
to how the components interact. The simplest hypothesis is to assume that the effects of
the distinct components are independent and additive. This leads us to the additive
model of the time series,
t t t t t
Y T S C I = + + + ,
where,
t
Y is the value of the variable at time t,
t
T is the trend value,
t
S is the
seasonal variation and
t
C is the cyclic variation and
t
I is the irregular variation. The
component
t
S will be positive or negative according to the season of the year, so also will
be
t
C .
t
I will also have positive or negative values, and for a long series the total
t
I

is
assumed to have the value zero.
An alternative model, which is considered to be more useful, is the multiplicative
model, which can be obtained as
t t t t t
Y T S C I = .
3.4. Methods of measuring secular trend
The main objectives of measuring trend are (i) to describe the general underlying
movement of the series and (ii) to eliminate the trend in order to bring into focus the other
movements in the series. Some methods for estimating the trend are: (i) Free hand curve
method (ii) Method of semi average (iii) Method of moving average (iv) Method of least
squares.
Free hand curve method:
This is a graphic method for measuring trend:
Procedures:
Step1: Draw the horizontal and vertical axis. Take time on horizontal axis and
values on vertical axis.
Step2: Original data is plotted on the graph
School of Distance Education
Applied Statistics Page 58
Step3: Join the points by a smooth curve
Step4: A straight line is drawn carefully through the middle areas of the curve
Step5: The direction of the straight line shows the direction of the trend.
Problem: Fit a trend line to the following data using free hand curve method
Year: 1991 1992 1993 1994 1995 1996
Profit: 40 42 40 48 52 49
Solution:
Performing the steps given, the following graph showing trend line is obtained.
Merits of free hand method:
- It is the simplest and easiest method.
- It helps to understand the character of time series and to know the trend.
Demerits of free hand method:
- It is highly subjective
- It does not enable us to help to predict the future value accurately
Method of semi-average:
Under this method the original data is divided into to two equal parts. When the
number of years is even, we can divide them in to two equal groups, but if the number of
years is odd the total number of years is divided in to two groups by omitting the middle
year. The average of each group is finding out and place against t the middle year of each
part. The average of each part is plotted on the graph against the middle year of each
part. Join the two points plotted to get the trend line. This line shows the semi-average
trend. One can extend the line upward or downward to predict future or prior values.
School of Distance Education
Applied Statistics Page 59
Problem: Fit a trend line to the following data using semi-average method
Year: 1991 1992 1993 1994 1995 1996
Profit: 40 42 40 48 52 49
Solution:
First to divide the give data into two equal sets and then assign the average of the
values of the first part to the year coming in the middle and the average of the second part
to the mid year of the second part. It is gives as follows:
1991 40
1992 41 1992 40.66
1993 42
1994 48
1995 52 1995 49.66
1996 49



` `

) )



` `

) )
Plot the points (1992, 40.66) and (1995, 49.66) on a graph. Join these points to get the trend
line. We get the trend line as follows:
Method of moving average:
Under this method a series of successive averages should be calculated from a
series of values. Moving averages are calculated based on values covering a fixed time
interval called period of moving average and is shown against the center of the interval.
Periods may be odd numbers as well as even numbers. Let a,b,c,d,e,f be the observations.
School of Distance Education
Applied Statistics Page 60
The formula for the three year moving average is , , ,...
3 3 3
a b c b c d c d e + + + + + +
Formula
for five year moving average will be, , ,...
3 3
a b c d e b c d e f + + + + + + + +
If the moving average to be calculated of even period, the method is a s follows:
Let the period be 4. First we calculate the average for first four observations and
are placed between second and third year. Omit the first year and find the average of next
four observations and place it in the middle of 3
rd
and 4
th
year and so on.
To get a trend value for a definite value, we use the method of centering. For that
calculate the averages of the two moving averages already calculated, taking 1
st
and 2
nd
,
2
nd
and 3
rd
, 3
rd
and 4
th
etc. We place these values against the middle of the two averages.
Now we get the moving averages for the 3
rd
year onwards.
Problem: Obtain trend values using
Year: 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Profit: 40 42 40 48 52 49 50 54 55 52
(i) 3 year moving average method
(ii) 5 year moving average method
(iii) 4 year moving average method
Solution:
(i) The trend values as per three year moving average method is illustrated as
follows:
Year Profit Three year total 3 year moving average
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
40
42
40
48
52
49
50
54
55
52
40+42+40=122
42+40+48=130
40+48+52=140
48+52+49=149
52+49+50=151
49+50+54=153
50+54+55=159
54+55+52=161
40.66
43.33
46.66
49.66
50.33
51
53
53.66
Hence trend values by three year moving average are:
Year: 1992 1993 1994 1995 1996 1997 1998 1999
Trend values: 40.66 43.33 46.66 49.66 50.33 51 53 53.66
School of Distance Education
Applied Statistics Page 61
(ii) The trend values as per 5 year moving average method are illustrated as follows:
Year Profit Five year total Three year
moving average
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
40
42
40
48
52
49
50
54
55
52
40+42+40+48+52=222
42+40+48+52+49=231
40+48+52+49+50=239
48+52+49+50+54=253
52+49+50+54+55=260
49+50+54+55+52=260
44.40
46.20
47.80
50.60
52
52
Hence trend values by five year moving average are:
Year: 1993 1994 1995 1996 1997 1998
Trend values: 44.40 46.20 47.80 50.60 52 52
(iii) The trend values as per four year moving average method are illustrated as follows:
Year Profit Four year total Four year
average
Two term total Four year
moving
average
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
40
42
40
48
52
49
50
54
55
52
40+42+40+48=170
42+40+48+52=182
40+48+52+49=190
48+52+49+50=199
52+49+50+54=205
49+50+54+55=208
50+54+55+52=211
42.50
45.50
47.50
49.75
51.25
52
52.75
42.50+45.50=88
45.50+47.50=93
47.50+49.75=97.25
49.75+51.25=101
51.25+52=103.25
52+52.75=104.75
44
46.50
48.63
50.50
51.63
52.38
School of Distance Education
Applied Statistics Page 62
Hence trend values by four year moving average are:
Year: 1993 1994 1995 1996 1997 1998
Trend values: 44 46.50 48.63 50.50 51.63 52.38
Method of least squares:
Using the principle of least squares, we can find the linear as well as non linear
trend curves for a given time series data.
- Linear trend: Using the method of fitting of a straight line, we can find a trend line
of the form y=ax+b for a given time series. The normal equations for finding a and
b are
2
1 1 1
n n n
i i i i
i i i
x y a x b x and
= = =
= +

1 1
n n
i i
i i
y a x n b
= =
= +

To reduce calculations, we apply a change of origin. Two cases may arise. The
number of years by the time series may be odd or even.
(i) If n is odd: We can take the origin at the middle year. Then x=0 will
correspond to the middle year and x=1 for the next year and x=-1 for the
preceding year and so on. We shall then have,
1
0
n
i
i
x
=
=

(ii) (ii) If n is even: (Let n=2m) Then there will be two middle years, the m
th
and
(m+1)
th
year. We take the origin at the middle of these two years and take a
half year as the unit. Then the m
th
year will correspond t x=-1 and (m+1)
th
year to x=+1, and so on. We shall then have,
1
0
n
i
i
x
=
=

.
Then the normal equation reduced to
2
1 1
n n
i i i
i i
x y a x and
= =
=

1
n
i
i
y n b
=
=

Then, a and b is estimated as,


1
2
1
n
i i
i
n
i
i
x y
a
x
=
=
=

and
1
n
i
i
y
b
n
=
=

.
In a similar way non-linear trend also can be estimated.
School of Distance Education
Applied Statistics Page 63
Problem: Fit a linear trend to the following data
Year : 1980 1981 1982 1983 1984 1985 1986
Profit : 125.5 136.1 142.9 158.3 171.3 197.7 200.8
Solution:
The calculations are as follows:
Year(x) u=x -1983 Profit
2
x
xy
1980
1981
1982
1983
1984
1985
1986
-3
-2
-1
0
+1
+2
+3
125.5
136.1
142.9
158.3
171.3
197.7
200.8
9
4
1
0
1
4
9
-376.5
-272.2
-142.9
0
171.3
375.4
602.4
Total
0 1132.6 28 377.5
Then,
1
2
1
377.5
13.48
28
n
i i
i
n
i
i
x y
a
x
=
=
= = =

and
1
1132.6
161.8
7
n
i
i
y
b
n
=
= = =

Hence the trend line is, 13.48 161.80 y x = + .


Remark: If there is one more year, 1987 in the above problem, then n becomes even . Then
consider the mid of the middle years 1983 and 1984, which is 1983.5 as the origin and
make a change in the scale as considering half year as a unit.
Then the x values becomes u values as, , ) 2 1983.5 u x = . Then the x values
becomes u values as, -7,-5,-3,-1, 0,1,3,5 and 7. Now proceed as done above to find the
equation of the trend line.
3.5. Method of measuring seasonal variations:
A study of seasonal variation of a time series is useful because it helps to (i)
analyse seasonal pattern in a short-period time series (ii) price the articles and services so
as to level up the seasonal variations in demand. (iii) plan future operations (iv) formulate
policy decisions regarding purchase, production, inventory control etc.
School of Distance Education
Applied Statistics Page 64
Method of simple averages is an important method of studying seasonal variations.
(i) Method of simple averages:
The steps involved in this method are:
(i) Arrange the data by years, months or quarters as the case may be for a
sufficiently long period of time.
(ii) Obtain the monthly average from monthly data, or obtain quarterly average
from the quarterly data etc, for the whole period covered.
(iii) Obtain the overall average for a month or quarter from all the monthly or
quarterly averages obtained.
(iv) Seasonal indices for different months are obtained on expressing each
monthly or quarterly average as a percentage of the overall average.
If X is the overall average and
k
X is the monthly average for the k
th
month,
then, seasonal index for the k
th
month = 100
k
X
X
.
Note: The sumo f seasonal indices must be 1,200 for monthly data and 400 for quarterly
data.
Problem: Use the method of monthly averages to determine quarterly indices for the
following data of production of a commodity based on years 1986, 1987, 1988.
Year(x)
Production in lakhs of tonnes
1986 1987 1988
January
February
March
April
May
June
July
August
September
October
November
December
12
11
10
14
15
15
16
13
11
10
12
15
15
14
13
16
16
15
17
12
13
12
13
14
16
15
14
16
15
17
16
13
10
10
11
15
School of Distance Education
Applied Statistics Page 65
Solution:
Setting the data in quarterly form, we get as follows:
Year(x) Production in lakhs of tonnes
1986 1987 1988
First quarter
Second quarter
Third quarter
Fourth quarter
33
44
40
37
42
47
42
39
45
48
39
36
The calculations are as shown below:
Year(x)
Production in lakhs of tonnes Quarterly
average
1986 1987 1988
First quarter
Second quarter
Third quarter
Fourth quarter
33
44
40
37
42
47
42
39
45
48
39
36
40
46.33
40.33
37.33
Total average =
40 46.33 40.33 37.33
40.9975
4
+ + +
=
Hence the seasonal indices for the various quarters are:
For first quarter, indices is,
40
100 97.57
40.9975
= .
For second quarter, indices is,
46.33
100 113.01
40.9975
= .
For third quarter, indices is,.
40.33
100 98.37
40.9975
=
For fourth quarter, indices is,.
37.33
100 91.05
40.9975
=
School of Distance Education
Applied Statistics Page 66
EXERCISES
1. Define time series.
2. What are the various component of a tie series?
3. What are the additive and multiplicative models in time series?
4. Explain various components of time series.
5. What the methods of measuring trend in a time series?
6. Fit a trend line to the following data using semi-average method
Year: 1990 1991 1992 1993 1994 1995
Profit: 34 34 34 34 32 39
7. Obtain trend values using
Year: 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
Profit: 33 35 34 38 35 34 35 39 36 38
(i) 3 year moving average method
(ii) 5 year moving average method
(iii) 4 year moving average method
8. Explain any method of measuring seasonal variation.
*********************
School of Distance Education
Applied Statistics Page 67
CHAPTER 4
STATISTICAL QUALITY CONTROL
4.1. Quality:
Quality means conformity with certain proscribed standards. These standards may
be in terms of size, weight, strength, colour, taste, etc. The quality standards are normally
set by the makers of the product. Government is also keen on quality standards. Only
quality products are given Agmark and ISI labels.
Statistical Quality Control (SQC) refers to the statistical techniques employed for
the maintenance of uniform quality in a continuous flow of manufactured products.
Definition: SQC is a simple statistical method for determining the extent to which
quality goals are being met without necessarily checking every item produced and for
indicating whether or not the variation which occur are exceeding normal expectations.
SQC enable us to decide whether to reject or accept a particular product
Some of the advantages of Statistical Quality Control are
1. Quality of the product is effectively checked and deviation from the quality is
easily identified.
2. It protects the loss of buyer due to the rejection of the items purchased.
3. Quality consciousness among employees increased
4. SQC enable the producer to decide when to take corrective measures.
Apart from a companys decision to produce every piece of quality item, some amount
of variation among the item produced are observed. This may be mainly of two types of
causes:
(1) Random and chance cause, which are some inevitable variations occurring and it is
natural and allowable and cannot be removed or prevented. (ii) Assignable and
preventable causes, which are neither natural nor inherent in manufacturing process and
they can be prevented if cause of such variation is identified.
The main purpose of SQC is to separate the preventable causes from the random
causes.
4.2. Process and Product Control:
In a production process the quality is to be ensured. That is the proportion of
defectives items is to be controlled. This is called process control. By product control we
mean controlling the quality of the items produced by critical examination at strategic
points.
School of Distance Education
Applied Statistics Page 68
4.3. Control chart:
Control chart is a statistical technique for process control. It is a graphical chart
used to detect unusual variations occurring in the process of production that put the
process out of control.
A control chart has three horizontal lines starting from the right hand side of the
vertical line and parallel to the base line of the chart. The vertical line is meant for
showing the quality statistic of each sample and is known as quality scale. The base line
is used for making sample numbers and is termed as subgroup scale. The three horizontal
lines (i) central line (ii) upper control limit and (iii) lower control limit are known as
control lines.
Central line (CL): It passes through the middle of the chart parallel to the base line and
represents the prescribed standard of quality of the product in the process. If the
manufacturing process in in the perfect state of control and the product produced
conforms perfectly to the prescribed standards, the sample points will coincide with this
line.
Lower Control Line (LCL): This is dotted line parallel to and below the central line. This
line represents the maximum lowest limit of variation that can tolerate.
Upper Control Line (UCL): This is dotted line parallel to and above the central line. This
line represents the maximum highest limit of variation that can tolerate.
If the sample points are lying in between these dotted lines in control chart, it is
assumed that the variation in the process is wholly due to chance cause and the process is
under control. Otherwise some assignable causes are involved in the process and the
process is considered as out of control and the remedial measures are to be executed.
Control chart for variables:
To control the quality of a characteristic which are measurable, like diameter of a
screw, specific resistance of a wire etc, we use two types of control charts. They are charts
for x (mean) and charts for R (range)
p-chart:
This type of chart are used to deal with the characteristics which are not possible to
measure, but can observe as absent or present from the product classifying as defective or
non defective.
c-chart:
This type of chart is applicable when the quality of product is a discrete variable
School of Distance Education
Applied Statistics Page 69
4.4. x (mean) and R (range) chart:
Though we are expecting a process produces products which are exactly alike,
some amount of variations may happen. This variation is due to the totality of various
functions of production process like, raw material, machine setting etc. Such variations
are due to chance causes or assignable causes. x and R chart reveals the presence or
absence of assignable causes of variation in the (i) average which is mostly related to
machine setting and (ii) range which is mainly related to negligence on the part of the
operator.
Following are some factors to be considered in an x and R chart.
1. Measurement: Any method of measurement has its own variability. Hence the
measurement of quality is to be with least variability. For this (i) To avoid use of faulty
instruments, (ii) Clear cut definitions of quality of the characteristic and method of taking
measurements are to be given. (iii) Measurements are taken by expert hands.
2. Selection of samples: To make the control chart analysis more effective, appropriate
sample size and the time between the selection of two samples are to be selected
depending upon the process to control. Initially more frequent samples are required but
after stabilizing the process under control, the frequency may be reduced.
3. Calculation of x and R for each sub group:
Assume k samples are taken. Consider , 1, 2,...,
ij
x j n = be the measurements in the
i
th
sample for i=1,2,,k. Denote ,
i i i
x R and s respectively are the mean, range and
standard deviation of the i
th
sample. Then,
1
i ij
j
x x
n
=

max min
i ij ij
j j
R x x =
And
, )
2
2
1
i ij i
j
s x x
n
=

Now, the averages of sample means, sample ranges and sample standard deviations are,
1
i
i
x x
k
=

;
1
i
i
R R
k
=

and
1
i
i
s s
k
=

4. Control limits for x :
Case 1: When process mean and standard deviation and are known, the 3
control limits for x , that is the interval of x within , ) 3 . S D x is,
School of Distance Education
Applied Statistics Page 70
, ) , ) , ) , ) 3 . . , 3 . . E x S D x E x S D x +
]
That is, 3 , 3
n n



+

]
Then, 3
x
LCL
n

= ; 3
x
UCL
n

= +
Case 2: When and are unknown.
Then and are estimated by x and
2
R
d
, where
2
d is a constant depending upon
the sample size. Then the 3 control limits for x is,
2 2
1 1
3 , 3
R R
x x
d d n n

+

]
Then,
2
1
3
x
R
LCL x
d n
= ;
2
1
3
x
R
UCL x
d n
= +
A table of the values of
2
3
d n
is available for different values of n from 2 to 25.
can be estimated using s also, by the expression,
2
s
c
= , where
2
2
!
2 2
.
3
!
2
n
c
n n
| |
|
\
=
| |
|
\
A table of the values of
2
3
c n
is available for different values of n from 2 to 25.
5. Control limits for R chart:
3 limit for R chart is
, ) , ) 3 , 3 1 3 , 1 3 R cR R cR c R c R + = +
] ]
, where c is a constant depending upon n.
The values of , ) , ) 1 3 1 3 c and c + are listed in table for various values of n.
, ) 1 3
R
LCL c R = and
, ) 1 3
R
UCL c R = +
School of Distance Education
Applied Statistics Page 71
Construction of x chart:
x Chart is constructed with sample number is taken along axis and the
corresponding sample mean
i
x is taken along is taken along Y-axis. The points are
plotted but not joining. The central line is drawn as bold parallel to X-axis along the value
mean of the sample means x .
Upper Control Limit line and Lower Control Limit line are drawn as dotted lines
parallel to X axis through the calculated Y values,
x
UCL and
x
LCL respectively.
The points fall beyond the control lines, if any, indicates assignable causes makes the
process out of control. Now remove all the sample results which are beyond the control
limits or revise the control limits for the remaining samples. Revision of control limit is
repeated so as the entire sample means are within the new control limits. The new limits
are then extended for checking the quality of future products.
Construction of R- chart:
Since small samples of size 2-15, range provides a good estimate of , range is
commonly used in SQC to study the pattern of variation in quality.
To draw R-Chart, we proceed as done in the case of x Chart. Take sample number
along X- axis and the corresponding sample range
i
R is taken along Y-axis. Central line
parallel to X-axis is drawn as bold line through R and the control lines are drawn through
the calculated values of
R
LCL and
R
UCL as dotted lines parallel to X-axis. Since range is
non-negative,
R
LCL must be non-negative. If it is calculated as negative in any case, it is
considered as zero.
Note:
(i) x Chart is used to show the quality averages of the samples drawn from a given
process, whereas R chart is used to show quality dispersions of the samples.
(ii) Usually R chart is drawn first. If R chart indicates that the dispersion of the quality by
the process is out of control it is better not to construct x Chart until the quality dispersion
is brought under control.
(iii) For a normal population probability of any point falling outside the 3 control line
is only 0.0027. If a point falls outside the limits it will be desirable to take another bigger
sample. If the mean o the combined samples still lie out of the control limits there should
be search for assignable causes.
Problem: Construct a control chart for mean and the range for the following data on the
basis of fuses, samples of 5 being taken every hour (each set of 5 has been arranged in
ascending order of magnitude).
42 42 19 36 42 51 60 18 15 69 64 61
School of Distance Education
Applied Statistics Page 72
65 45 24 54 51 74 60 20 30 109 90 78
75 68 80 69 57 75 72 27 39 113 93 94
78 72 81 77 59 78 95 42 62 118 109 109
87 90 81 84 78 132 138 60 84 153 112 136
Solution:
The given 10 sets of samples are arranged as follows and the mean and range of each
sample set is also calculated and listed below:
Sample
number
Sample observations Sample
total
Sample
mean
Sample
range
1
2
3
4
5
6
7
8
9
10
11
12
45 65 75 78 87
42 45 68 72 90
19 24 80 81 81
36 54 69 77 84
42 51 57 59 78
51 74 75 78 132
60 60 72 95 138
18 20 27 42 60
15 30 39 62 84
69 109 113 118 153
64 90 93 109 112
61 78 94 109 136
347
317
285
320
287
410
425
167
230
562
468
478
69.4
63.4
57
64
57.4
82
85
33.4
46
112.4
93.6
95.6
45
48
62
48
36
81
78
42
69
84
48
75
Mean of the sample means x =sum of sample means/12
Sum of sample means = 859.2
859.2
71.6
12
x = =
Mean of the sample range R=sum of sample ranges/12
Sum of sample means = 716
716
59.66
12
R = =
For x chart,
School of Distance Education
Applied Statistics Page 73
2
1
3
x
R
LCL x
d n
= ;
2
1
3
x
R
UCL x
d n
= + The value of
2
3
d n
for n=5 is obtained
from table as, 0.58
Hence,
71.60 59.66 0.58 37
x
LCL = = ; 71.60 59.66 0.58 106.2
x
UCL = + = .
For R chart,
The values of , ) , ) 1 3 1 3 c and c + for n=5 are, 0 and 2.11
Hence, , ) 1 3
R
LCL c R = =0 and
, ) 1 3 2.11 59.66 126.181
R
UCL c R = + = =
The charts are drawn as follows:
School of Distance Education
Applied Statistics Page 74
Since one point is beyond the control lines, the x bar chart says that there is a lack
of control in process average and it suggests the presence of some assignable causes which
is to be detected and corrected.
Since all the points in R chart are within the control lines, the process variability is
under control.
4.5. p-Chart:
p-chart is the control chart for defectives. Whenever the quality characteristic is
measured as yes or no, that is absent or present or like defective or non-defective p-chart
is used. Consider n items are observed and d items of them are defectives, then d is a
binomial variable with parameters n and P. P the fraction of defective for a sample is
estimated by d/n.
To draw p-chart, obtain the values of P for each sample. Then obtain the average
fraction defective from all the samples combined as,
.
.
No of defectives inall samples combined
P
Total no of items in samples comibined
=
Now the upper control limit and lower control limit of p-chart at 3 sigma level is obtained
as
3 ; 3
P P
UCL P LCL P

= + =
Where
(1 ) / P P n

= , which is estimated by (1 ) / P P n .
Hence,
3 (1 ) / ; 3 (1 ) /
P P
UCL P P P n LCL P P P n = + =
Since P is ever negative, if
P
LCL is found as negative, consider it as zero.
Draw the central line, which is the horizontal line through P .
Plot the fraction of defective for each sample and find whether all the points are within the
control lines or not.
A point appears above the
P
UCL indicates the lack of control and process changed
for the worse. Such a point is called high spot.
A point appears below the
P
LCL indicates the process changed for better. Such a
point is called low spot which indicates improvement in quality.
School of Distance Education
Applied Statistics Page 75
4.6. d-Chart:
d-chart is similar to p-chart, but in this chart instead of plotting the fraction of
defective for each sample, the number of defective for each of the sample is plotted
against the corresponding sample number.
The lines of control are:
Central line is passing through the value of nP, where P is the probability of
defective. If P is unknown it is estimated from the average number of defectives per
sample of constant size as,
.
.
No of defectives inall samples combined
P
Total no of items in samples comibined
=
The UCL corresponds to 3-sigma level is,
3 (1 ) nP nP P + ; n is the sample size (constant)
When P is not given, UCL is 3 (1 ) nP nP P +
LCL is 3 (1 ) nP nP P or 3 (1 ) nP nP P
LCL value is kept as zero, if it is calculated as negative.
Problem: The following data refer to visual defects found during inspection of the first 10
samples of size 100 each. Use them to obtain upper and lower control limits for
percentage effective in samples of 100.
Sample No. 1 2 3 4 5 6 7 8 9 10
4 8 11 3 11 7 7 16 12 6
Solution:
The total number of defectives in the given set of 10 samples of 100 each is 85.
Hence,
. 85
0.085
. 1000
No of defectives inall samples combined
P
Total no of items in samples comibined
= = =
Central line is passing through nP = 100 0.085 8.5 =
3 (1 ) 100 0.085 3 100 0.085(1 0.085) UCL nP nP P = + = + =18.87
3 (1 ) 100 0.085 3 100 0.085(1 0.085) LCL nP nP P = = =0.134
School of Distance Education
Applied Statistics Page 76
The chart is drawn as follows:
Since all the points are within the control lines, the process is under control.
Note: If the number of items inspected n in each sample varies, for p- chart separate
control limits have to be computed for each sample while the central line has to be
computed for each sample. These types of limits are known as variable control limits. In
such a situation p-chart is relatively simple and preferred.
4.7. C-Chart ( Control chart for number of defects per unit):
A defective item may contain one or more defects. In p-chart or in d-chart the
numbers of defectives are counted, but not concentrating on the number of defects per
unit. C-chart is concentrating on the number of defects per unit in a sample of items with
constant size.
Control limits of C-chart:
While inspecting a large sample, if the numbers of defectives are less, the probability p of
occurrence of a defect in any unit is very small. Hence Poisson approximation to the
number of defect is appropriate and the control limit calculation is based on Poisson
distribution.
When the number of defect is Poisson distributed variable, the average number of
defect is the parameter
If is known, the 3-sigma level, UCL= 3 + and
LCL= 3
Central line is passing through
If is unknown, is estimated by
School of Distance Education
Applied Statistics Page 77
.

.
total no of defects
average no of defects
total units observed
= =
Then, central line is through

,
UCL=

3 + and
LCL=

3 (which is taken as zero, if calculated as negative)
EXERCISES
1. What is Statistical Quality Control (SQC)?
2. Distinguish between process control and product control.
3. What is a control chart?
4. Write a short note on the lines of control.
5. What are the things to be considered while drawing x bar chart?
6. Explain R chart.
7. Which control charts are used in the case of attributes? Explain.
8. Draw the mean and range charts and comment for the following data
Sample No. 1 2 3 4 5 6 7 8 9 10
Mean 43 49 37 44 45 37 51 46 43 47
Range 5 6 5 7 7 4 8 6 4 6
9. Discuss the construction of p chart when all samples are of same size.
10. During an examination of equal length of cloth, the following are the number of
defects observed:
2,3,4,0,5,6,7,4,3,2
Draw control chart for the number of defects and comment whether the process is
under control or not.
************************
School of Distance Education
Applied Statistics Page 78
CHAPTER 5
ANALYSIS OF VARIANCE
5.1. Analysis of Variance (ANOVA)
To test the difference between the means of two normal populations are tested
using t-distribution. But to test the means of three or more population together based on
samples taken from them is not possible with t-test. Analysis of variance, ANOVA
provides a statistical test of whether or not the means of several groups are all equal, and
therefore generalizes t-test to more than two groups. Doing multiple two-sample t-tests
would result in an increased chance of committing a type I error. For this reason,
ANOVAs are useful in comparing two, three, or more means.
Analysis of variance was introduced by Prof.R.A.Fisher in 1920s.
Assumptions in ANOVA
- The populations from which the samples were obtained must be normally or
approximately normally distributed.
- The samples must be independent.
- The variances of the populations must be equal.
5.2. One way ANOVA
Let the data are classified according to only one characteristic. Consider
independent sample of
1 2
, ,...
k
n n n observations from different populations with population
means
1 2
, ,...
k
. Then one way ANOVA is to test the arithmetic means of the
population from which the k samples are randomly drawn are equal to one another or it is
to test the hypothesis
0 1 2
: ...
k
H = = = .
Procedure for One way ANOVA
Let us consider, we have k independent samples. Assume samples containing
respectively
1 2
, ,...
k
n n n items. Let
ij
x denotes the i
th
observation in the j
th
class. The total
number of items in the sample is
1 2
...
k
n n n n + + + = . Total variation among the
observations are classified into two (i) variation between the samples (or classes or
treatments) and (ii) variation within the sample. The first type of variation is due to
assignable causes, while the second type of variation is due to chance. Main aim of
analysis of variance is to examine whether the means of all populations from the samples
are taken are same in view of the variability within the samples (classes).
School of Distance Education
Applied Statistics Page 79
The steps for ANOVA for testing
0 1 2
: ...
k
H = = = are listed as follows:
Step1: Obtain the sample means of each of the k samples. Let it be
1 2
, ,...,
k
x x x .
Step 2: The grand average of the entire n sample items X is found as
1 2
...
k
x x x
X
k
+ + +
= ,
where k is the total number of samples.
Step 3: Take the difference between the means of the various samples and the grand
average.
Step 4: Find the sum of squares of these differences. This number is the sum of squares
between the samples (SSB).
Step 5: Mean sum of squares between the samples is
1
SSB
MSB
k
=

.
Step 6: Take the deviations of the various items in a sample from the mean of the
respective sample and add the squares of such deviations. This number is the sum of
squares within the samples (SSW).
Step 7: Mean sum of squares within the samples is
SSW
MSW
n k
=

.
Step 8: Calculate F-ratio, where the ratio
MSB
F
MSW
= .
Step 9: Compare the calculated F with the table value of F-distribution with degrees of
freedom, ) 1, k n k .
Step 10: If calculated value of F is less than the table value, ACCEPT the hypothesis that
the population means are equal.
ANOVA table for one way classified data:
Source of variation Sum of
squares
Degrees of
freedom
Mean sum of
squares
F- ratio
Between Sample
(class or treatment) SSB k-1 MSB=SSB/k-1
MSB
F
MSW
=
Within samples (or
Error)
SSW n-k MSW=SSW/n-k
Total SST n-1
School of Distance Education
Applied Statistics Page 80
Illustration:
Consider a sample of three bulbs is taken from four companies producing light
bulbs. Their lifetimes in hours are tested and the results in hundreds of hours are as
follows.
Companies
A B C D
18
17
15
20
16
15
18
16
16
15
18
15
Test whether the mean lifetime of the bulbs from the four companies are equal.
Solution:
Let the mean life of the bulbs produced by the four companies are
1 2 3 4
, , and
To test
0 1 2 3 4
: H = = = .
The means of the samples,
1
18 17 15
16.67
3
x
+ +
= =
2
20 16 15
17
3
x
+ +
= = ;
3
18 16 16
16.67
3
x
+ +
= = and
4
15 18 15
16
3
x
+ +
= =
Hence the grand mean,
16.67 17 16.67 16
16.585
4
X
+ + +
= =
Sum of squares between the samples SSB =
, ) , ) , ) , )
2 2 2 2
16.67 16.585 17 16.585 16.67 16.585 16 16.585 + + +
0.007225 0.172225 0.007225 0.342225 = + + +
= 0.5289.
Then,
MSB
0.5289
0.1763
3
= = .
Sum of squares between the samples SSW is calculated as follows:
For the samples from company A:
1
16.67 x = ;
The sum of squares of deviation = , ) , ) , )
2 2 2
18 16.67 17 16.67 15 16.67 + +
=1.7689 0.1089 2.7889 4.6667 + + =
For the samples from company B:
2
17 x = ;
The sum of squares of deviation = , ) , ) , )
2 2 2
20 17 16 17 15 17 + +
School of Distance Education
Applied Statistics Page 81
= 9 1 4 14 + + = .
For the samples from company C:
3
16.67 x = ;
The sum of squares of deviation = , ) , ) , )
2 2 2
18 16.67 16 16.67 16 16.67 + +
= 1.7689 0.4489 0.4489 2.6667 + + = .
For the samples from company D:
4
16 x = ;
The sum of squares of deviation = , ) , ) , )
2 2 2
15 16 18 16 15 16 + +
= 1 4 1 6 + + = .
Hence,
SSW =4.6667 14 2.6667 6 27.334 + + + =
4.6667 14 2.6667 6
8
MSW
+ + +
= =3.4167
0.1763
0.0516
3.4167
MSB
F
MSW
=
= =
The table is,
Source of variation Sum of
squares
Degrees
of
freedom
Mean sum of
squares
F- ratio
Between Sample
(class or treatment)
SSB=0.5289. 4-1 MSB 0.1763 =
0.1763
0.0516
3.4167
F = =
Within samples (or
Error)
SSW
27.334 =
12-4 MSW=3.4167
Total SST=27.8629 11
From table of F distribution, table value for (3,8) d.f. at 5% significance level is 4.07
Here, calculated F value is less than the table F value. Hence ACCEPT the
hypothesis that the mean lifetimes of the bulbs from the four companies are equal.
5.3. Two way ANOVA:
In one way ANOVA, we considered a variable influenced by various levels
(treatments) of a factor. Our test was, whether all treatments are alike with respect to their
mean effect or not. But in two way ANOVA, we consider a variable which is influenced
by various levels of two factors. It is to test whether these factors have any impact on the
variable.
School of Distance Education
Applied Statistics Page 82
Consider the case of the study of the effect of m variety of training methods and n
variety of foods on the intelligent quotient of mXn students.
Let
ij
x be the intelligent quotient of a student with i
th
training and j
th
food.
Suppose mXn students are divided in to m groups, where i
th
group receives the i
th
training program for i-1,2,,m and let these students are also divided into n groups,
where j
th
group receives the j
th
food for j=1,2,,n. Let us consider the data on IQ is listed
as follows;
Let us given the data on IQ as follows:
11 12 1
21 22 2
1 2
1 2 ...
... 1
... 2
: : ... : :
...
m
m
n n nm
Training group
m
x x x
x x x
Food group
x x x n
Here the hypothesis is
0 1 2
1 2
: ...
: ...
m
n
H

= = =
= = =
Where
i
is the mean IQ due to i
th
training and
j
is the mean IQ due to j
th
food.
The steps for two way ANOVA are listed as follows:
Step1: Obtain the means of each of the m columns. Let it be
1* 2* *.
, ,...,
m
x x x .
Obtain the means of each of the n rows. Let it be
*1 *2 * .
, ,...,
n
x x x .
Step 2: The grand average of the entire n sample items X is found as
1
ij
i j
X x
nm
=

,
Step 3: Take the difference between the means of the various samples and the grand
average.
Step 4: Sum of squares between the treatments of food (SSR)=
, )
2
* i
i
m x X

Step 5: Mean sum of squares due to the treatments of various food is


1
SSR
MSR
n
=

.
Step 6: Sum of squares between the treatments of training (SSC)=
, )
2
* j
i
n x X

Step 7: Mean sum of squares due to the treatments of various food is


1
SSC
MSC
m
=

.
School of Distance Education
Applied Statistics Page 83
Step 8: Total sum of squares SST=
, )
2
ij
i j
x X

Step 9: Error sum of square SSE=


, )
2
* * ij i j
i j
x x x X +

Step 10: Error Mean sum of square MSE=


, ) , )
, )
2
* *
1
1 1
ij i j
i j
x x x X
m n
+


Calculate F-ratio, where the ratio
R
MSR
F
MSE
= follow
1,( 1)( 1) n m n
F

.
Step 11: Compare the calculated F with the table value of F-distribution with
degrees of freedom 1, ( 1)( 1) n m n .
Step 12: If calculated value of F is less than the table value, ACCEPT the
hypothesis that the effects of food are equal.
Calculate F-ratio, where the ratio
C
MSC
F
MSE
= follow
1,( 1)( 1) m m n
F

.
Step 13: Compare the calculated F with the table value of F-distribution with
degrees of freedom 1, ( 1)( 1) m m n .
Step 14: If calculated value of F is less than the table value, ACCEPT the
hypothesis that the effects of trainings are equal.
EXERCISES
11. Define ANOVA.
12. What are the assumptions in ANOVA
13. Differentiate between one way and two way ANOVA.
14. What is a control chart?
15. Explain the procedures of one way ANOVA.
16. Explain the situation where we are using two way ANOVA.
17. Explain the procedures of two way ANOVA.
18. The following table shows the lives in hours of four batches of electric
bulbs.
Batch 1: 1600 1610 1680 1700 1720 1800
Batch 2: 1710 1590 1610 1700 1670 1600
Batch 3: 1650 1580 1680 1670 1630 1560
Batch 4: 1750 1680 1570 1680 1620 1660
Perform an ANOVA on these values and show that a significance test does
not reject their homogeneity.
********************

Das könnte Ihnen auch gefallen