Sie sind auf Seite 1von 76

Introduction to Linear Regression

and Correlation Analysis


 The correlation between two random variables, X
and Y, is a measure of the degree of linear
association between the two variables.
 The population correlation, denoted by ρ, and the
sample Correlation coefficient denoted y ‘r’ can
take on any value from -1 to 1.

Methods of Correlation Analysis


 Scatter diagram method

 Karl Pearson’s Correlation Coefficient

 Spearman’s Rank Correlation Method


Scatter Plots and Correlation

 A scatter plot (or scatter diagram) is used to show


the relationship between two variables
 Correlation analysis is used to measure strength
of the association (linear relationship) between
two variables
 Only concerned with strength of the
relationship
 No causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships

y y

x x

y y

x x
Scatter Plot Examples
(continued)
Strong relationships Weak relationships

y y

x x

y y

x x
Scatter Plot Examples
(continued)
No relationship

x
Correlation Coefficient
(continued)

 The population correlation coefficient ρ (rho)


measures the strength of the association
between the variables
 The sample correlation coefficient r is an
estimate of ρ and is used to measure the
strength of the linear relationship in the
sample observations
Features of ρ and r
 Unit free
 Range between -1 and 1
 The closer to -1, the stronger the negative
linear relationship
 The closer to 1, the stronger the positive
linear relationship
 The closer to 0, the weaker the linear
relationship
Examples of Approximate
r Values
y y y

x x x
r = -1 r = -.6 r=0
y y

x x
r = +.3 r = +1
Calculating the Correlation Coefficient by using
Karl Pearson’s method
To measure the intensity of the relationship between the
variables Karl Person proposed a formula known as Karl
Pearson's Correlation coefficient

r
 ( x  x )( y  y )
[ ( x  x ) ][  ( y  y ) ]
2 2

or the algebraic equivalent:

n xy   x  y
r
[n(  x 2 )  (  x )2 ][n(  y 2 )  (  y )2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Assumptions of using Pearson’s
Correlation Coefficient
 Pearson’s correlation coefficient is appropriate to
calculate when both variables ‘x’ and ‘y’ are measured
on an interval or a ratio scale

 Both variables are Normally distributed, and that there


is a linear relationship between these variables

 There is a cause and effect relationship between two


variables that influences the distribution of both the
variables.
Probable Error and Standard Error of
Coefficient of Correlation
 By using probable error we can find whether the
obtained correlation coefficient is significant or not
significant

1 r 2
 P.E ( r ) = (0.6745)
n
 If r < 6.P.E ( r ) then the value of ‘r’ is not significant
 If r > 6.P.E ( r ) then the value of ‘r’ is significant
Coefficient of Determination
 Coefficient of determination is denoted by r2.
 It always has value between 0 to 1
 By using coefficient of determination we can find the
strength of the relationship between variables but we
lose the information about the direction
 r2 = 0 then no variation in y can be explain by the
variable x
 r 2=1 then the values of y completely explained by x
Examples of Approximate
R2 Values
y
R2 = 1

Perfect linear relationship


between x and y:
x
R2 = 1
y 100% of the variation in y is
explained by variation in x

x
R = +1
2
Examples of Approximate
R2 Values
y
0 < R2 < 1

Weaker linear relationship


between x and y:
x
Some but not all of the
y
variation in y is explained
by variation in x

x
Examples of Approximate
R2 Values

R2 = 0
y
No linear relationship
between x and y:

The value of Y does not


x depend on x. (None of the
R2 = 0
variation in y is explained
by variation in x)
Example:
 The sales manager of copier wants to
determine whether there is a relationship
between the number of sales calls made in a
month and the number of copiers sold in that
month. The manager selects a random sample
of 10 representatives and determines the
number of sales calls each representative made
last month and the copiers sold. The sample
information is given below
Sales calls and Copier sales
Sales Person Number of sales No. of copiers sold
calls
Medha 20 30
Mahathi 40 60
Nikhil 20 40
Sai Ram 30 60
Sathya 10 30
Sashi 10 40
krishna 20 40
Pavan 20 50
Raman 20 30
Hari 30 70
Calculation Example
X Y XY X² Y²

20 30 600 400 900


40 60 2400 1600 3600
20 40 800 400 1600
30 60 1800 900 3600
10 30 300 100 900
10 40 400 100 1600
20 40 800 400 1600
20 50 1000 400 2500
20 30 600 400 900
30 70 2100 900 4900
220 450 10800 5600 22100
Calculation Example
n  xy   x  y
r
[n( x 2 )  ( x)2 ][n( y 2 )  ( y)2 ]
80

70
10(10800) (220)(450)
60 
50

40
[10(5600) (220)2 ][8(22100) (450)2 ]
30
0.759014
20

10

0
5 10 15 20 25
Y
30 35 40 45
There is positive relation
between the sales calls and
sales of the copier
Calculation Example II
Tree Trunk
Height Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713
Calculation Example
(continued)

Tree n xy   x  y
Height,
y
r
70
[n(  x 2 )  (  x)2 ][n(  y 2 )  (  y)2 ]
60

8(3142)  (73)(321)
50 
40
[8(713)  (73)2 ][8(14111)  (321)2 ]
30

 0.886
20

10

0
r = 0.886 → relatively strong positive
0 2 4 6 8 10 12 14
linear association between x and y
Trunk Diameter, x
Excel Output

Excel Correlation Output


Tools / data analysis / correlation…

Tree Height Trunk Diameter


Tree Height 1
Trunk Diameter 0.886231 1

Correlation between
Tree Height and Trunk Diameter
Ex: Pepsi Cola is studying the effect of its last advertising
campaign. People chosen at random were called and
asked how many cans of Pepsi Cola had bought (X) in
the past week and how many advertisements (Y) they
had either read or seen in the past week.

X :3 7 4 2 0 4 1 2
Y :11 18 9 4 7 6 3 8

Calculate the coefficient of Correlation and coefficient of


determination.
An economist wanted to find out if there was any
relationship between the unemployment rate in a country
and its inflation rate . Data gathered from 7 countries for
the year 2004 are given below.

Country Unemployment Inflation rate


rate (%) (%)
A 4.0 3.2
B 8.5 8.2
C 5.5 9.4
D 0.8 5.1
E 7.3 10.1
F 5.8 7.8
G 2.1 4.7
Find the degree of linear association between a country’s
unemployment and its level of inflation.
Spearman Rank correlation( )
 Correlation between ranks of two individuals is
known as Rank correlation
 To measure the intensity of the relationship
between the variables (having ordinal data), we
use Spearman rank correlation
 Spearman rank correlation lies between +1 and
-1
 If Rank Correlation coefficient is +1 there is
perfect positive correlation and if it is -1 there is
perfect negative correlation
Spearman's Rank Correlation is given by

 6 d  2
  x, y   1   2 
 n(n  1) 
Where d  R x  R y
n is no. of pair of observations
When ranks are equal we add correction factor to
∑d2 and is given by

 
 6  d 2  correction factor 
 ( x, y )  1   
2
 n(n  1) 

m(m2  1)
Where correction factor is , m is number
12
of times an item repeated
Ten Competitors in a beauty contest are ranked by
three judges in the following order.

Judge I :1 6 5 10 3 2 4 9 7 8
Judge II :3 5 8 4 7 10 2 1 6 9
Judge III:6 4 9 8 1 2 3 10 5 7

Determine which pair of judges has the nearest


approach to common tastes in beauty.
1 2 3 D1=1 -2 D2=1- 3 D3 =2 -3 (D1)² (D2)² (D3)²
A financial analyst wanted to find out whether inventory
turnover influences any company’s earnings per share (in
%).A random sample of 7 companies listed in a stock
exchange were selected and the following data was
obtained for each.
Company Inventory turnover Earnings per
(No.of times) share(%)
A 4 11
B 5 9
C 7 13
D 8 7
E 6 13
F 3 8
G 5 8
Find the strength of association between inventory
turnover and earnings per share. Interpret the result.
Co efficient of Determination:

The Squared value of Coefficient of Correlation


is called Co efficient of determination.

It indicates “the proportion of the total variability


of dependent variable that is accounted for or
explained by the independent variable”.

It always lies between 0 and 1.


 The following table gives indices of industrial
production and number of registered unemployed
people (in lakh). Calculate the value of the
correlation coefficient.

Year 1991 1992 1993 1994 1995 1996 1997 1998

Index of production 100 102 104 107 105 112 103 99

No.Of Unemployed 15 12 13 11 12 12 19 26
Introduction to Regression Analysis

 Regression analysis is used to:


 Predict the value of a dependent variable based on
the value of at least one independent variable
 Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
explain
Independent variable: the variable used to
explain the dependent variable
Simple Linear Regression Model

 Only one independent variable, x


 Relationship between x and y is
described by a linear function
 Changes in y are assumed to be caused
by changes in x
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship


Population Linear Regression

The population regression model:


Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual

y  β0  β1x  ε
Variable

Linear component Random Error


component
Linear Regression Assumptions

 Error values (ε) are statistically independent


 Error values are normally distributed for any
given value of x
 The probability distribution of the errors is
normal
 The probability distribution of the errors has
constant variance
 The underlying relationship between the x
variable and the y variable is linear
Population Linear Regression
(continued)

y y  β0  β1x  ε
Observed Value
of y for xi

εi Slope = β1
Predicted Value
Random Error
of y for xi
for this x value

Intercept = β0

xi x
Estimated Regression Model
The sample regression line provides an estimate of
the population regression line

Estimated Estimate of Estimate of the


(or predicted) the regression regression slope
y value
intercept
Independent

ŷ i  b0  b1x variable

The individual random error terms ei have a mean of zero


Least Squares Criterion

 b0 and b1 are obtained by finding the values


of b0 and b1 that minimize the sum of the
squared residuals

 
e 2
 (y ŷ) 2

  (y  (b 0  b1x)) 2
The Least Squares Equation
 The formulas for b1 and b0 are:

b1 
 ( x  x )( y  y )
 (x  x) 2

algebraic equivalent:
and

 xy   x y
b1  n b0  y  b1 x
(
x  n
2  x ) 2
Interpretation of the
Slope and the Intercept

 b0 is the estimated average value of y


when the value of x is zero

 b1 is the estimated change in the


average value of y as a result of a one-
unit change in x
Finding the Least Squares Equation

 The coefficients b0 and b1 will usually be


found using computer software, such as
Excel or Minitab

 Other regression measures will also be


computed as part of computer-based
regression analysis
Simple Linear Regression Example

 A real estate agent wishes to examine the


relationship between the selling price of a home
and its size (measured in square feet)
 A random sample of 10 houses is selected
 Dependent variable (y) = house price in $1000s

 Independent variable (x) = square feet


Sample Data for House Price Model
House Price in $1000s Square Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Regression Using Excel
 Tools / Data Analysis / Regression
Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price  98.24833  0.10977 (square feet)
Standard Error 41.33032
Observations 10

ANOVA
  df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000      

  Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Graphical Presentation
 House price model: scatter plot and
regression line
450
400
House Price ($1000s)

350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

house price  98.24833  0.10977 (square feet)


Interpretation of the
Intercept, b0

house price  98.24833  0.10977 (square feet)

 b0 is the estimated average value of Y when the


value of X is zero (if x = 0 is in the range of
observed x values)
 Here, no houses had 0 square feet, so b0 = 98.24833
just indicates that, for houses within the range of
sizes observed, $98,248.33 is the portion of the
house price not explained by square feet
Interpretation of the
Slope Coefficient, b1

house price  98.24833  0.10977 (square feet)

 b1 measures the estimated change in the


average value of Y as a result of a one-
unit change in X
 Here, b1 = .10977 tells us that the average value of a
house increases by .10977($1000) = $109.77, on
average, for each additional one square foot of size
Example: House Prices

House Price Estimated Regression Equation:


Square Feet
in $1000s
(x)
(y)
house price  98.25  0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875 Predict the price for a house
199 1100 with 2000 square feet
219 1550
405 2350
324 2450
319 1425
255 1700
Example: House Prices
(continued)
Predict the price for a house
with 2000 square feet:

house price  98.25  0.1098 (sq.ft.)

 98.25  0.1098(200 0)

 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Example: Market Trend
Over all Market Average Return %
 In finance, it is of interest to look at Return % (X) (Y)
the relationship between Y, a stock’s
10 11
average return, and X, the overall
market return. The slope coefficient 12 15
computed by linear regression is 8 3
called the stock’s beta by investment
15 18
analysts. A beta greater than 1
indicates that the stock is relatively 9 10
sensitive to changes in the market; a 11 12
beta less than 1 indicates that the
8 6
stock is relatively insensitive. For the
following data, compute the beta and 10 7
suggest market trend. 13 18
11 13
Properties of regression lines and
their coefficients:
1. Correlation coefficient is the geometric
mean between the regression
coefficient
2. The sign of correlation coefficient is the
same as that of regression coefficient.
3. Regression coefficients are dependent
of the change origin but not of scale.
Problem
 The following data give the ages and Blood
Pressure of 10 women. Find
1. Correlation Coefficient between age and BP
2. Determine the least square regression
equation of BP on age
3. Estimate the BP of a woman whose age is
45
Data
 AGE BP Calculations
56 147
x y x2 y2 xy
42 125 56 147 3136 21609 8232

36 118 42 125 1764 15625 5250

47 128 36 118 1296 13924 4248

49 145 47 128 2209 16384 6016

42 140 49 145 2401 21025 7105

60 155 42 140 1764 19600 5880

72 160 60 155 3600 24025 9300

72 160 5184 25600 11520


63 149
63 149 3969 22201 9387
55 150
55 150 3025 22500 8250

522 1417 28348 202493 75188


 Correlation coefficient n  xy   x  y
r
[n(  x 2 )  ( x) 2 ][n(  y 2 )  ( y)2 ]
r = 0.891679 10(75188)  (522)(1417 )

[10(28348)  (522) 2 ][10(20249 3)  (1417) 2 ]
 Regression Equation of y on x

ŷ i  b 0  b1 x
and
 x y
b1 
 xy  n b0  y  b1 x
( x ) 2
x 2

n
b1 = 1.11 b0 = 83.755
 Regression equation is

 y = 83.755+ 1.11x
 When x=45 y =?

 Y=133.705
Multiple regression Analysis
 A linear regression equation with more than one
independent variable is called a multiple
regression model.
The linear regression equation with
k independen t variables takes the form :
y  β 0  β1 x1  β 2 x 2  β 3 x 3  ........  β k x k  ε
where
 y is the value of dependent variable to be estimated
 β 0 is a constant
 β1,β 2, ...β k are the regression coefficien ts associated
with each of the x k independen t variable.
 ε is the random error due to chance.
Let the fitted linear regression equation be
yˆ  b 0  b1 x1  b 2 x 2  .......  b k x k which minimizes
the sum of squares errors (SSE)   (y - yˆ ) 2
where
 yˆ is the estimated value of dependent variable y
 b1 , b 2 , b 3 ....b k partial regression coefficien ts and are
obtained by the principle of least squares technique.
 Let us consider the case where two independent
variables and a dependent variable.
The multiple linear regression model
involving two independen t variables is :
y  β 0  β1 x1  β 2 x 2  ε
where
 y is the dependent variable
 x1 and x 2 are independen t variables.
 ε is the random error due to chance.
 β 0 is the y - intercept.
 β1 , β 2 are the regression coefficien ts.
Let the fitted multiple linear regression equation be
yˆ  b 0  b1 x1  b 2 x 2
or
yˆ  b 0  b y1.2 x1  b y2.1x 2
where
 yˆ is the estimated value of dependent variable y.
 x1 , x 2 are the independen t variables.
 b 0 , b1,b 2 are the unknown constants and
are determined by the priniple of least squares technique
which minimizes the sum of squres errors (SSE)   (y - yˆ ) 2
By solving the following equations the values of
b 0 , b1 , b 2 can be determined .

 y  nb 0  b y1.2   x1   b y2.1   x 2 

y x 1  b 0   x1   b y1.2  x   b   x x 
1
2
y2.1 1 2

y x 2  b 0   x 2   b y1.2   x1 x 2   b y2.1  x 2
2
Let the fitted multiple linear regression equation be
y  b 0  b 1 x1  b 2 x 2
or y  b 0  b y1.2 x1  b y2.1x 2 - - - -(1)
y  b 0  b y1.2 x1  b y2.1x 2 - - - -(2)
(1) - (2)
(y - y )  b y1.2 (x1  x1 )  b y2.1 (x 2 - x 2 )
Y  b y1.2 X1  b y2.1 X 2
  Y  X    X     Y  X   X X 
1
2
2 2 2 1
b y1.2 
  X   X     X  X 
2
1
2
2 1 2
2

  Y  X    X     Y  X   X X 
2
2
1 1 2 1
b y2.1 
  X   X     X  X 
2
1
2
2 1 2
2

where Y  y - y , X 1  x 1  x1 , X 2  x 2  x 2
Relationsh ip b/w partial regression coefficien ts & Correlatio n coefficien ts :
 ry1  (ry2  r12 )  σ y
b y1.2   2

σ
 1  r12  1
 ry2  (ry1  r12 )  σ y
b y2.1   2

σ
 1  r12  2
 Y X  1
r  the correlatio n b/w y & x 
y1 1
Y X 2 2
1

 Y X  2
r  the correlatio n b/w y & x 
y2 2
Y X 2 2
2

 X X  1 2
r  the correlatio n b/w x & x 
12 1 2
X X
2 2
1 2
 A marketing manager of a company wants to
predict demand for the product. He is believing
strongly demand is highly influenced by annual
average price of the product (in units) &
advertising expenditure (Rs in lakh).He has
collected past data to know the effect of these
factors on demand and given below:
Y 4 6 7 9 13 15
X1 15 12 8 6 4 3
X2 30 24 20 14 10 4
• The following results are obtained from
measuremen t on length (in mm), volume (in cc)
and weight (in gm) of 300 eggs.
x1  55.95 x 2  51.48 y  56.03
σ 1  2.26 σ 2  4.39 σ y  4.41
ry1  0.578 ry2  0.581 r12  0.974

Obtain the linear regression equation of egg weight


on its length and volume. Hence estimate the weight of an egg
whose length is 58 mm and volume is 52.5 cc.
 The Federal Reserve is performing a preliminary
study to determine the relationship between
certain economic indicators and annual
percentage change in the gross national product
(GNP). Two such indicators being examined are
the amount of the federal government’s deficit (in
billions of dollars) and the Dow Jones Industrial
Average (the mean value over the year). Data for
6 years follow:
Change in GNP 2.5 -1.0 4.0 1.0 1.5 3.0
Federal Deficit 100.0 400.0 120.0 200.0 180.0 80.0
Dow Jones 2850 2100 3300 2400 2550 2700
i. Calculate the least squares equation that best
describes the data.
ii. What % change in GNP would be expected in a year
in which the federal deficit was $240 billion and the
mean Dow Jones value was 3000?
 Multiple correlation analysis:

It is a measure of association between a


dependent variable and several independent
variables taken together.
The coefficient of multiple correlation is given by,

r  r  2ry1ry2r12
2
y1
2
y2
R y.12 
1r 2
12

Its value always lie in between 0 and 1.


 Coefficient of multiple determination:

It is the proportion of the total variation in the


multiple values of dependent variable y,
accounted for or explained by the independent
variables in the multiple regression model.

 The square of coefficient of multiple correlation


is called Coefficient of multiple determination.

Das könnte Ihnen auch gefallen