Correlation-Regression 2019

Introduction to Linear Regression
and Correlation Analysis

 The correlation between two random variables, X
and Y, is a measure of the degree of linear
association between the two variables.
 The population correlation, denoted by ρ, and the
sample Correlation coefficient denoted y ‘r’ can
take on any value from -1 to 1.
Methods of Correlation Analysis

 Scatter diagram method
 Karl Pearson’s Correlation Coefficient
 Spearman’s Rank Correlation Method

Scatter Plots and Correlation
 A scatter plot (or scatter diagram) is used to show

the relationship between two variables
 Correlation analysis is used to measure strength
of the association (linear relationship) between
two variables
 Only concerned with strength of the
relationship
 No causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
(continued)
Strong relationships Weak relationships
y y
x x
y y
x x
(continued)
No relationship
x
Correlation Coefficient
(continued)
 The population correlation coefficient ρ (rho)

measures the strength of the association
between the variables
 The sample correlation coefficient r is an
estimate of ρ and is used to measure the
strength of the linear relationship in the
sample observations
Features of ρ and r
 Unit free
 Range between -1 and 1
 The closer to -1, the stronger the negative
linear relationship
 The closer to 1, the stronger the positive
linear relationship
 The closer to 0, the weaker the linear
relationship
Examples of Approximate
r Values
y y y
x x x
r = -1 r = -.6 r=0
y y
x x
r = +.3 r = +1
Calculating the Correlation Coefficient by using
Karl Pearson’s method
To measure the intensity of the relationship between the
variables Karl Person proposed a formula known as Karl
Pearson's Correlation coefficient
r
 ( x  x )( y  y )
[ ( x  x ) ][  ( y  y ) ]
2 2
or the algebraic equivalent:
n xy   x  y
r
[n(  x 2 )  (  x )2 ][n(  y 2 )  (  y )2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Assumptions of using Pearson’s
Correlation Coefficient
 Pearson’s correlation coefficient is appropriate to
calculate when both variables ‘x’ and ‘y’ are measured
on an interval or a ratio scale
 Both variables are Normally distributed, and that there

is a linear relationship between these variables
 There is a cause and effect relationship between two

variables that influences the distribution of both the
variables.
Probable Error and Standard Error of
Coefficient of Correlation
 By using probable error we can find whether the
obtained correlation coefficient is significant or not
significant
1 r 2
 P.E ( r ) = (0.6745)
n
 If r < 6.P.E ( r ) then the value of ‘r’ is not significant
 If r > 6.P.E ( r ) then the value of ‘r’ is significant
Coefficient of Determination
 Coefficient of determination is denoted by r2.
 It always has value between 0 to 1
 By using coefficient of determination we can find the
strength of the relationship between variables but we
lose the information about the direction
 r2 = 0 then no variation in y can be explain by the
variable x
 r 2=1 then the values of y completely explained by x
R2 Values
y
R2 = 1
Perfect linear relationship

between x and y:
x
R2 = 1
y 100% of the variation in y is
explained by variation in x
x
R = +1
2
R2 Values
y
0 < R2 < 1
Weaker linear relationship

between x and y:
x
Some but not all of the
y
variation in y is explained
by variation in x
x
R2 Values
R2 = 0
y
No linear relationship
between x and y:
The value of Y does not

x depend on x. (None of the
R2 = 0
variation in y is explained
by variation in x)
Example:
 The sales manager of copier wants to
determine whether there is a relationship
between the number of sales calls made in a
month and the number of copiers sold in that
month. The manager selects a random sample
of 10 representatives and determines the
number of sales calls each representative made
last month and the copiers sold. The sample
information is given below
Sales calls and Copier sales
Sales Person Number of sales No. of copiers sold
calls
Medha 20 30
Mahathi 40 60
Nikhil 20 40
Sai Ram 30 60
Sathya 10 30
Sashi 10 40
krishna 20 40
Pavan 20 50
Raman 20 30
Hari 30 70
Calculation Example
X Y XY X² Y²
20 30 600 400 900

40 60 2400 1600 3600
20 40 800 400 1600
30 60 1800 900 3600
10 30 300 100 900
10 40 400 100 1600
20 40 800 400 1600
20 50 1000 400 2500
20 30 600 400 900
30 70 2100 900 4900
220 450 10800 5600 22100
Calculation Example
n  xy   x  y
r
[n( x 2 )  ( x)2 ][n( y 2 )  ( y)2 ]
80
70
10(10800) (220)(450)
60 
50
40
[10(5600) (220)2 ][8(22100) (450)2 ]
30
0.759014
20
10
0
5 10 15 20 25
Y
30 35 40 45
There is positive relation
between the sales calls and
sales of the copier
Calculation Example II
Tree Trunk
Height Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713
Calculation Example
(continued)
Tree n xy   x  y
Height,
y
r
70
[n(  x 2 )  (  x)2 ][n(  y 2 )  (  y)2 ]
60
8(3142)  (73)(321)
50 
40
[8(713)  (73)2 ][8(14111)  (321)2 ]
30
 0.886
20
10
0
r = 0.886 → relatively strong positive
0 2 4 6 8 10 12 14
linear association between x and y
Trunk Diameter, x
Excel Output
Excel Correlation Output

Tools / data analysis / correlation…
Tree Height Trunk Diameter

Tree Height 1
Trunk Diameter 0.886231 1
Correlation between
Tree Height and Trunk Diameter
Ex: Pepsi Cola is studying the effect of its last advertising
campaign. People chosen at random were called and
asked how many cans of Pepsi Cola had bought (X) in
the past week and how many advertisements (Y) they
had either read or seen in the past week.
X :3 7 4 2 0 4 1 2
Y :11 18 9 4 7 6 3 8
Calculate the coefficient of Correlation and coefficient of

determination.
An economist wanted to find out if there was any
relationship between the unemployment rate in a country
and its inflation rate . Data gathered from 7 countries for
the year 2004 are given below.
Country Unemployment Inflation rate

rate (%) (%)
A 4.0 3.2
B 8.5 8.2
C 5.5 9.4
D 0.8 5.1
E 7.3 10.1
F 5.8 7.8
G 2.1 4.7
Find the degree of linear association between a country’s
unemployment and its level of inflation.
Spearman Rank correlation( )
 Correlation between ranks of two individuals is
known as Rank correlation
 To measure the intensity of the relationship
between the variables (having ordinal data), we
use Spearman rank correlation
 Spearman rank correlation lies between +1 and
-1
 If Rank Correlation coefficient is +1 there is
perfect positive correlation and if it is -1 there is
perfect negative correlation
Spearman's Rank Correlation is given by
 6 d  2
  x, y   1   2 
 n(n  1) 
Where d  R x  R y
n is no. of pair of observations
When ranks are equal we add correction factor to
∑d2 and is given by
 
 6  d 2  correction factor 
 ( x, y )  1   
2
 n(n  1) 
m(m2  1)
Where correction factor is , m is number
12
of times an item repeated
Ten Competitors in a beauty contest are ranked by
three judges in the following order.
Judge I :1 6 5 10 3 2 4 9 7 8
Judge II :3 5 8 4 7 10 2 1 6 9
Judge III:6 4 9 8 1 2 3 10 5 7
Determine which pair of judges has the nearest

approach to common tastes in beauty.
1 2 3 D1=1 -2 D2=1- 3 D3 =2 -3 (D1)² (D2)² (D3)²
A financial analyst wanted to find out whether inventory
turnover influences any company’s earnings per share (in
%).A random sample of 7 companies listed in a stock
exchange were selected and the following data was
obtained for each.
Company Inventory turnover Earnings per
(No.of times) share(%)
A 4 11
B 5 9
C 7 13
D 8 7
E 6 13
F 3 8
G 5 8
Find the strength of association between inventory
turnover and earnings per share. Interpret the result.
Co efficient of Determination:
The Squared value of Coefficient of Correlation

is called Co efficient of determination.
It indicates “the proportion of the total variability

of dependent variable that is accounted for or
explained by the independent variable”.
It always lies between 0 and 1.

 The following table gives indices of industrial
production and number of registered unemployed
people (in lakh). Calculate the value of the
correlation coefficient.
Year 1991 1992 1993 1994 1995 1996 1997 1998
Index of production 100 102 104 107 105 112 103 99
No.Of Unemployed 15 12 13 11 12 12 19 26
Introduction to Regression Analysis
 Regression analysis is used to:

 Predict the value of a dependent variable based on
the value of at least one independent variable
 Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
explain
Independent variable: the variable used to
explain the dependent variable
Simple Linear Regression Model
 Only one independent variable, x

 Relationship between x and y is
described by a linear function
 Changes in y are assumed to be caused
by changes in x
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear
Negative Linear Relationship No Relationship

Population Linear Regression
The population regression model:

Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual
y  β0  β1x  ε
Variable
Linear component Random Error

component
Linear Regression Assumptions
 Error values (ε) are statistically independent

 Error values are normally distributed for any
given value of x
 The probability distribution of the errors is
normal
 The probability distribution of the errors has
constant variance
 The underlying relationship between the x
variable and the y variable is linear
Population Linear Regression
(continued)
y y  β0  β1x  ε
Observed Value
of y for xi
εi Slope = β1
Predicted Value
Random Error
of y for xi
for this x value
Intercept = β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of
the population regression line
Estimated Estimate of Estimate of the

(or predicted) the regression regression slope
y value
intercept
Independent
ŷ i  b0  b1x variable
The individual random error terms ei have a mean of zero

Least Squares Criterion
 b0 and b1 are obtained by finding the values

of b0 and b1 that minimize the sum of the
squared residuals
 
e 2
 (y ŷ) 2
  (y  (b 0  b1x)) 2
The Least Squares Equation
 The formulas for b1 and b0 are:
b1 
 ( x  x )( y  y )
 (x  x) 2
algebraic equivalent:
and
 xy   x y
b1  n b0  y  b1 x
(
x  n
2  x ) 2
Interpretation of the
Slope and the Intercept
 b0 is the estimated average value of y

when the value of x is zero
 b1 is the estimated change in the

average value of y as a result of a one-
unit change in x
Finding the Least Squares Equation
 The coefficients b0 and b1 will usually be

found using computer software, such as
Excel or Minitab
 Other regression measures will also be

computed as part of computer-based
regression analysis
Simple Linear Regression Example
 A real estate agent wishes to examine the

relationship between the selling price of a home
and its size (measured in square feet)
 A random sample of 10 houses is selected
 Dependent variable (y) = house price in $1000s
 Independent variable (x) = square feet

Sample Data for House Price Model
House Price in $1000s Square Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Regression Using Excel
 Tools / Data Analysis / Regression
Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price  98.24833  0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Graphical Presentation
 House price model: scatter plot and
regression line
450
400
House Price ($1000s)
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
house price  98.24833  0.10977 (square feet)

Intercept, b0
 b0 is the estimated average value of Y when the

value of X is zero (if x = 0 is in the range of
observed x values)
 Here, no houses had 0 square feet, so b0 = 98.24833
just indicates that, for houses within the range of
sizes observed, $98,248.33 is the portion of the
house price not explained by square feet
Slope Coefficient, b1
 b1 measures the estimated change in the

average value of Y as a result of a one-
unit change in X
 Here, b1 = .10977 tells us that the average value of a
house increases by .10977($1000) = $109.77, on
average, for each additional one square foot of size
Example: House Prices
House Price Estimated Regression Equation:

Square Feet
in $1000s
(x)
(y)
house price  98.25  0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875 Predict the price for a house
199 1100 with 2000 square feet
219 1550
405 2350
324 2450
319 1425
255 1700
Example: House Prices
(continued)
Predict the price for a house
with 2000 square feet:
house price  98.25  0.1098 (sq.ft.)
 98.25  0.1098(200 0)
 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Example: Market Trend
Over all Market Average Return %
 In finance, it is of interest to look at Return % (X) (Y)
the relationship between Y, a stock’s
10 11
average return, and X, the overall
market return. The slope coefficient 12 15
computed by linear regression is 8 3
called the stock’s beta by investment
15 18
analysts. A beta greater than 1
indicates that the stock is relatively 9 10
sensitive to changes in the market; a 11 12
beta less than 1 indicates that the
8 6
stock is relatively insensitive. For the
following data, compute the beta and 10 7
suggest market trend. 13 18
11 13
Properties of regression lines and
their coefficients:
1. Correlation coefficient is the geometric
mean between the regression
coefficient
2. The sign of correlation coefficient is the
same as that of regression coefficient.
3. Regression coefficients are dependent
of the change origin but not of scale.
Problem
 The following data give the ages and Blood
Pressure of 10 women. Find
1. Correlation Coefficient between age and BP
2. Determine the least square regression
equation of BP on age
3. Estimate the BP of a woman whose age is
45
Data
 AGE BP Calculations
56 147
x y x2 y2 xy
42 125 56 147 3136 21609 8232
36 118 42 125 1764 15625 5250
47 128 36 118 1296 13924 4248
49 145 47 128 2209 16384 6016
42 140 49 145 2401 21025 7105
60 155 42 140 1764 19600 5880
72 160 60 155 3600 24025 9300
72 160 5184 25600 11520

63 149
63 149 3969 22201 9387
55 150
55 150 3025 22500 8250
522 1417 28348 202493 75188

 Correlation coefficient n  xy   x  y
r
[n(  x 2 )  ( x) 2 ][n(  y 2 )  ( y)2 ]
r = 0.891679 10(75188)  (522)(1417 )

[10(28348)  (522) 2 ][10(20249 3)  (1417) 2 ]
 Regression Equation of y on x
ŷ i  b 0  b1 x
and
 x y
b1 
 xy  n b0  y  b1 x
( x ) 2
x 2

n
b1 = 1.11 b0 = 83.755
 Regression equation is
 y = 83.755+ 1.11x
 When x=45 y =?
 Y=133.705
Multiple regression Analysis
 A linear regression equation with more than one
independent variable is called a multiple
regression model.
The linear regression equation with
k independen t variables takes the form :
y  β 0  β1 x1  β 2 x 2  β 3 x 3  ........  β k x k  ε
where
 y is the value of dependent variable to be estimated
 β 0 is a constant
 β1,β 2, ...β k are the regression coefficien ts associated
with each of the x k independen t variable.
 ε is the random error due to chance.
Let the fitted linear regression equation be
yˆ  b 0  b1 x1  b 2 x 2  .......  b k x k which minimizes
the sum of squares errors (SSE)   (y - yˆ ) 2
where
 yˆ is the estimated value of dependent variable y
 b1 , b 2 , b 3 ....b k partial regression coefficien ts and are
obtained by the principle of least squares technique.
 Let us consider the case where two independent
variables and a dependent variable.
The multiple linear regression model
involving two independen t variables is :
y  β 0  β1 x1  β 2 x 2  ε
where
 y is the dependent variable
 x1 and x 2 are independen t variables.
 ε is the random error due to chance.
 β 0 is the y - intercept.
 β1 , β 2 are the regression coefficien ts.
Let the fitted multiple linear regression equation be
yˆ  b 0  b1 x1  b 2 x 2
or
yˆ  b 0  b y1.2 x1  b y2.1x 2
where
 yˆ is the estimated value of dependent variable y.
 x1 , x 2 are the independen t variables.
 b 0 , b1,b 2 are the unknown constants and
are determined by the priniple of least squares technique
which minimizes the sum of squres errors (SSE)   (y - yˆ ) 2
By solving the following equations the values of
b 0 , b1 , b 2 can be determined .
 y  nb 0  b y1.2   x1   b y2.1   x 2 
y x 1  b 0   x1   b y1.2  x   b   x x 
1
2
y2.1 1 2
y x 2  b 0   x 2   b y1.2   x1 x 2   b y2.1  x 2
2
Let the fitted multiple linear regression equation be
y  b 0  b 1 x1  b 2 x 2
or y  b 0  b y1.2 x1  b y2.1x 2 - - - -(1)
y  b 0  b y1.2 x1  b y2.1x 2 - - - -(2)
(1) - (2)
(y - y )  b y1.2 (x1  x1 )  b y2.1 (x 2 - x 2 )
Y  b y1.2 X1  b y2.1 X 2
  Y  X    X     Y  X   X X 
1
2
2 2 2 1
b y1.2 
  X   X     X  X 
2
1
2
2 1 2
2
  Y  X    X     Y  X   X X 
2
2
1 1 2 1
b y2.1 
  X   X     X  X 
2
1
2
2 1 2
2
where Y  y - y , X 1  x 1  x1 , X 2  x 2  x 2
Relationsh ip b/w partial regression coefficien ts & Correlatio n coefficien ts :
 ry1  (ry2  r12 )  σ y
b y1.2   2

σ
 1  r12  1
 ry2  (ry1  r12 )  σ y
b y2.1   2

σ
 1  r12  2
 Y X  1
r  the correlatio n b/w y & x 
y1 1
Y X 2 2
1
 Y X  2
r  the correlatio n b/w y & x 
y2 2
Y X 2 2
2
 X X  1 2
r  the correlatio n b/w x & x 
12 1 2
X X
2 2
1 2
 A marketing manager of a company wants to
predict demand for the product. He is believing
strongly demand is highly influenced by annual
average price of the product (in units) &
advertising expenditure (Rs in lakh).He has
collected past data to know the effect of these
factors on demand and given below:
Y 4 6 7 9 13 15
X1 15 12 8 6 4 3
X2 30 24 20 14 10 4
• The following results are obtained from
measuremen t on length (in mm), volume (in cc)
and weight (in gm) of 300 eggs.
x1  55.95 x 2  51.48 y  56.03
σ 1  2.26 σ 2  4.39 σ y  4.41
ry1  0.578 ry2  0.581 r12  0.974
Obtain the linear regression equation of egg weight

on its length and volume. Hence estimate the weight of an egg
whose length is 58 mm and volume is 52.5 cc.
 The Federal Reserve is performing a preliminary
study to determine the relationship between
certain economic indicators and annual
percentage change in the gross national product
(GNP). Two such indicators being examined are
the amount of the federal government’s deficit (in
billions of dollars) and the Dow Jones Industrial
Average (the mean value over the year). Data for
6 years follow:
Change in GNP 2.5 -1.0 4.0 1.0 1.5 3.0
Federal Deficit 100.0 400.0 120.0 200.0 180.0 80.0
Dow Jones 2850 2100 3300 2400 2550 2700
i. Calculate the least squares equation that best
describes the data.
ii. What % change in GNP would be expected in a year
in which the federal deficit was $240 billion and the
mean Dow Jones value was 3000?
 Multiple correlation analysis:
It is a measure of association between a

dependent variable and several independent
variables taken together.
The coefficient of multiple correlation is given by,
r  r  2ry1ry2r12
2
y1
2
y2
R y.12 
1r 2
12
Its value always lie in between 0 and 1.

 Coefficient of multiple determination:
It is the proportion of the total variation in the

multiple values of dependent variable y,
accounted for or explained by the independent
variables in the multiple regression model.
 The square of coefficient of multiple correlation

is called Coefficient of multiple determination.

Correlation-Regression 2019

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Correlation-Regression 2019

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Linear Regression

and Correlation Analysis

Methods of Correlation Analysis

 Karl Pearson’s Correlation Coefficient

 Spearman’s Rank Correlation Method

 A scatter plot (or scatter diagram) is used to show

 The population correlation coefficient ρ (rho)

or the algebraic equivalent:

 Both variables are Normally distributed, and that there

 There is a cause and effect relationship between two

Perfect linear relationship

Weaker linear relationship

The value of Y does not

20 30 600 400 900

Excel Correlation Output

Tree Height Trunk Diameter

Calculate the coefficient of Correlation and coefficient of

Country Unemployment Inflation rate

Determine which pair of judges has the nearest

The Squared value of Coefficient of Correlation

It indicates “the proportion of the total variability

It always lies between 0 and 1.

Year 1991 1992 1993 1994 1995 1996 1997 1998

Index of production 100 102 104 107 105 112 103 99

 Regression analysis is used to:

 Only one independent variable, x

Negative Linear Relationship No Relationship

The population regression model:

Linear component Random Error

 Error values (ε) are statistically independent

Estimated Estimate of Estimate of the

The individual random error terms ei have a mean of zero

 b0 and b1 are obtained by finding the values

 b0 is the estimated average value of y

 b1 is the estimated change in the

 The coefficients b0 and b1 will usually be

 Other regression measures will also be

 A real estate agent wishes to examine the

 Independent variable (x) = square feet

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

house price  98.24833  0.10977 (square feet)

house price  98.24833  0.10977 (square feet)

 b0 is the estimated average value of Y when the

house price  98.24833  0.10977 (square feet)

 b1 measures the estimated change in the

House Price Estimated Regression Equation:

house price  98.25  0.1098 (sq.ft.)

36 118 42 125 1764 15625 5250

47 128 36 118 1296 13924 4248

49 145 47 128 2209 16384 6016

42 140 49 145 2401 21025 7105

60 155 42 140 1764 19600 5880

72 160 60 155 3600 24025 9300

72 160 5184 25600 11520

522 1417 28348 202493 75188

Obtain the linear regression equation of egg weight

It is a measure of association between a

Its value always lie in between 0 and 1.

It is the proportion of the total variation in the

 The square of coefficient of multiple correlation

Das könnte Ihnen auch gefallen