Sie sind auf Seite 1von 38

Correlation and

Regression

17-2

Regression
Regression: It is to explain the variation in one variable
(dependent variable), based on the variation in one or
more other variables (called independent variable)
Ex: To explain the variations in sales of a product based
on advertising expenses, number of sales people,
number of sales offices. Correlation & Regression are
generally performed together.
Correlation: to measure the degree of association
between two sets of quantitative data. (EX: How the
advertising expenditure correlated with other
promotional expenditure). Correlation is usually
followed by regression analysis.

17-3

Variable Selection Methods


Stepwise procedures
procedures
Stepwise
Forward
Forward selection
selection

Addone
onevariable
variableat
ataatime
timeto
tothe
themodel,
model,on
onthe
thebasis
basisof
ofits
its
Add
statistic
FFstatistic

Backward
Backward elimination
elimination

Removeone
onevariable
variableat
ataatime,
time,on
onthe
thebasis
basisof
ofits
itsFF
Remove
statistic
statistic

Stepwise
Stepwise regression
regression

Addsvariables
variablesto
tothe
themodel
modeland
andsubtracts
subtractsvariables
variablesfrom
from
Adds
themodel,
model,on
onthe
thebasis
basisof
ofthe
theFFstatistic
statistic
the

17-4

Using Statistics

Lines
B

Slope: 11

x1

Intercept: 00

x
Any two points (A and B), or
an intercept and slope (0 and
1), define a line on a twodimensional surface.

Planes

x2
Any three points (A, B, and C), or an
intercept and coefficients of x1 and x2
(0 , 1, and 2), define a plane in a
three-dimensional surface.

17-5

Least-Squares Estimation: The 2-Variable


Normal Equations
Minimizing the sum of squared errors with respect to the
estimated coefficients b0, b1, and b2 yields the following
normal equations which can be solved for b0, b1, and b2.
y nb b x b x
0

x y b x b x b x x
2

x y b x b x x b x
2

2
2

17-6

Example
YY
72
72
76
76
78
78
70
70
68
68
80
80
82
82
65
65
62
62
90
90
----743
743

XX11
12
12
11
11
15
15
10
10
11
11
16
16
14
14
88
88
18
18
----123
123

XX22
55
88
66
55
33
99
12
12
44
33
10
10
----65
65

2
1X2
XX1X
XX121
2
60
144
60
144
88
121
88
121
90
225
90
225
50
100
50
100
33
121
33
121
144
256
144
256
168
196
168
196
32
64
32
64
24
64
24
64
180
324
180
324
----------869 1615
1615
869

2
XX222
25
25
64
64
36
36
25
25
99
81
81
144
144
16
16
99
100
100
----509
509

1Y
XX1Y
864
864
836
836
1170
1170
700
700
748
748
1280
1280
1148
1148
520
520
496
496
1620
1620
------9382
9382

2Y
XX2Y
360
360
608
608
468
468
350
350
204
204
720
720
984
984
260
260
186
186
900
900
------5040
5040

NormalEquations:
Equations:
Normal
743==10b
10b+123b
0+123b+65b
1+65b2
743
0
1
2
9382==123b
123b+1615b
0+1615b+869b
1+869b2
9382
0
1
2
5040==65b
65b+869b
0+869b+509b
1+509b2
5040
0
1
2
47.164942
bb00==47.164942
1.5990404
bb11==1.5990404
1.1487479
bb22==1.1487479

Estimatedregression
regressionequation:
equation:
Estimated
47164942
15990404
11487479
47164942
YY
..
15990404
..
XX11 11487479
..
XX22

17-7

The k-Variable Multiple Regression


Model
Thepopulation
populationregression
regressionmodel
modelof
ofaa
The
dependentvariable,
variable,Y,
Y,on
onaaset
setof
ofkk
dependent
independentvariables,
variables,XX,1,XX,.2,.. .. ., ,XXkisis
independent
1
2
k
givenby:
by:
given
Y=0++X
+X
+. .. .. .++X
+
1X1+
2X2+
kXk+
Y=
0

where0isisthe
theY-intercept
Y-interceptof
ofthe
the
where
0
regressionsurface
surfaceand
andeach
eachi, ,ii==1,2,...,k
1,2,...,k
regression
i
theslope
slopeof
ofthe
theregression
regressionsurface
surface-isisthe
sometimescalled
calledthe
theresponse
responsesurface
surface-sometimes
withrespect
respecttotoXX.i.
with
i

x2

x1
y 0 1 x1 2 x 2

Modelassumptions:
assumptions:
Model
2
~N(0,2),
),independent
independentofofother
othererrors.
errors.
1.1. ~N(0,
Thevariables
variablesXXiare
areuncorrelated
uncorrelatedwith
withthe
theerror
errorterm.
term.
2.2. The
i

17-8

Simple and Multiple Least-Squares


Regression
y

x1

y b0 b1x
X

simpleregression
regressionmodel,
model,
model
InInaasimple
model
theleast-squares
least-squaresestimators
estimators
the
minimizethe
thesum
sumof
ofsquared
squared
minimize
errorsfrom
fromthe
theestimated
estimated
errors
regressionline.
line.
regression

x2

y b0 b1 x1 b2 x 2

multipleregression
regressionmodel,
model,
model
InInaamultiple
model
theleast-squares
least-squaresestimators
estimators
the
minimizethe
thesum
sumofofsquared
squared
minimize
errorsfrom
fromthe
theestimated
estimated
errors
regressionplane.
plane.
regression

17-9

The Estimated Regression


Relationship
Theestimated
estimatedregression
regressionrelationship:
relationship:
relationship
The
relationship
Y b0 b1 X 1 b2 X 2 bk X k

whereY isisthe
thepredicted
predictedvalue
valueof
ofY,
Y,the
thevalue
valuelying
lyingon
onthe
the
where
estimatedregression
regressionsurface.
surface. The
Theterms
termsbbi,i,for
forii==0,
0,1,1,....,k
....,kare
are
estimated
theleast-squares
least-squaresestimates
estimatesof
ofthe
thepopulation
populationregression
regression
the
parametersi.i.
parameters
Theactual,
actual,observed
observedvalue
valueof
ofYYisisthe
thepredicted
predictedvalue
valueplus
plusan
an
The
error:
error:
+.....++bbkkxxkjkj+e,
+e, jj==1,1,,
,n.n.
yyj j==bb00++bb11xx1j1j++bb22xx2j2j+.

17-10

Product Moment Correlation

The product moment correlation, r, summarizes


the strength of association between two metric
(interval or ratio scaled) variables, say X and Y.

It is an index used to determine whether a linear or


straight-line relationship exists between X and Y.

As it was originally proposed by Karl Pearson, it is


also known as the Pearson correlation coefficient.
It is also referred to as simple correlation, bivariate
correlation, or merely the correlation coefficient.

17-11

Product Moment Correlation


From a sample of n observations, X and Y, the
product moment correlation, r, can be
n
calculated
as:
=1 (X i X )(Y i Y )
i
n

r=

=1
i

(X i X )2 (Y i Y )2
i =1

Division of the numerator and denominator by (n1) gives

=1

i
n

r=

=1
i

(X i X )(Y i Y )
n1

(X i X )2
n1

COV x y
SxSy

i =1

(Y i Y )2
n1

17-12

Product Moment Correlation

r varies between -1.0 and +1.0.


The correlation coefficient between two
variables will be the same regardless of their
underlying units of measurement.

17-13

Partial Correlation
A partial correlation coefficient measures the
association between two variables after controlling for,
or adjusting for, the effects of one or more additional
variables.

rx y . z =

rx y (rx z )(ry z )
1rx2z 1ry2z

Partial correlations have an order associated with them.


The order indicates how many variables are being
adjusted or controlled.

17-14

Statistics Associated with Bivariate


Regression Analysis

Bivariate regression model. The basic


is
1 Y =
0
regression equation
+ Xi + ei, where Y =
i
dependent or criterion variable, X = independent
1 of the line,
or predictor
= intercept
=
0 variable,
slope of the line, and ei is the error term
associated with the i th observation.

Coefficient of determination. The strength of


association is measured by the coefficient of
determination, r 2. It varies between 0 and 1 and
signifies the proportion of the total variation in Y
that is accounted for by the variation in X.

Estimated or predicted value. The estimated


or predicted value of Yi is i = a + b x, where i is
the predicted value of YYi, and a and b are Y
estimators of
0 and 1 , respectively.

17-15

Statistics Associated with Bivariate


Regression Analysis

Scattergram. A scatter diagram, or


scattergram, is a plot of the values of two
variables for all the cases or observations.

Standard error of estimate. This statistic,


SEE, is the standard deviation of the actual Y
Y predicted
values from the
values.

Standard error. The standard deviation of


b, SEb, is called the standard error.

17-16

Statistics Associated with Bivariate


Regression Analysis

Standardized regression coefficient. Also


termed the beta coefficient or beta weight, this
is the slope obtained by the regression of Y on X
when the data are standardized.

Sum of squared errors. The distances of all


the points from the regression line are squared
and added together to arrive at the sum of
squared errors, which is a measure of total2 error,
e j
.

t statistic. A t statistic with n - 2 degrees of


freedom can be used to test the null hypothesis
that no linear relationship exists between X and
Y, or H0: 1 = 0, where
t= b

SEb

17-17

Bi-variate Regression Analysis Test for


Significance
The statistical significance of the linear relationship
between X and Y may be tested by examining the
hypotheses:
H0 : 1 = 0
H1 : 1 0
A t statistic with n - 2 degrees of freedom can be
used, where
b

t=
SEb

SEb denotes the standard deviation of b and is


called
the standard error.

17-18

Multiple Regression
The general form of the multiple regression
model
is as follows:

Y = 0 + 1 X1 + 2 X2 + 3 X3+ . . . + k Xk + e

which is estimated by the following equation:

= a + b 1 X 1 + b 2 X 2 + b 3X 3 + . . . + b kX k

As before, the coefficient a represents the


intercept,
but the b's are now the partial regression
coefficients.

17-19

Statistics Associated with Multiple


Regression

Adjusted R2. R2, coefficient of multiple


determination, is adjusted for the number of independent
variables and the sample size to account for the
diminishing returns. After the first few variables, the
additional independent variables do not make much
contribution. R2 = R2 (( 1- R2 ) / n- k - 1 )
Coefficient of multiple determination. The strength
of association in multiple regression is measured by the
square of the multiple correlation coefficient, R2, which is
also called the coefficient of multiple determination.
F test. The F test is used to test the null hypothesis that
the coefficient of multiple determination in the
population, R2. This is equivalent to testing the null
hypothesis. The test statistic has an F distribution with k
and (n - k - 1) degrees of freedom.

17-20

Stepwise Regression
The purpose of stepwise regression is to select, from a large
number of predictor variables, a small subset of variables
that account for most of the variation in the dependent or
criterion variable. In this procedure, the predictor variables
enter or are removed from the regression equation one at a
time.

Forward inclusion. Initially, there are no predictor


variables in the regression equation. Predictor variables are
entered one at a time, only if they meet certain criteria
specified in terms of F ratio. The order in which the variables
are included is based on the contribution to the explained
variance.

Backward elimination. Initially, all the predictor variables


are included in the regression equation. Predictors are then
removed one at a time based on the F ratio for removal.

17-21

Multicollinearity

Multicollinearity arises when intercorrelations


among the predictors are very high.
Multicollinearity can result in several problems,
including:
The partial regression coefficients may not be
estimated precisely. The standard errors are likely
to be high.
The magnitudes as well as the signs of the partial
regression coefficients may change from sample
to sample.
It becomes difficult to assess the relative
importance of the independent variables in
explaining the variation in the dependent variable.
Predictor variables may be incorrectly included or
removed in stepwise regression.

17-22

Decomposition of the Total Deviation in a


Multiple Regression Model

Total deviation: Y Y

Y Y: Error Deviation

Y Y : Regression Deviation

x1
x2
TotalDeviation
Deviation==Regression
RegressionDeviation
Deviation++Error
ErrorDeviation
Deviation
Total

SST
SST

==

SSR
SSR

SSE
++ SSE

17-23

The F Test of a Multiple Regression Model


statisticaltest
testfor
forthe
theexistence
existenceof
ofaalinear
linearrelationship
relationshipbetween
betweenYYand
andany
anyor
or
AAstatistical
allof
ofthe
theindependent
independentvariables
variablesXX,1,xx,2,...,
...,XX:k:
all
1
2
k
...==
k= 0
HH0:0: 11==22==...=
k 0
Notall
allthe
thei(i=1,2,...,k)
(i=1,2,...,k)are
areequal
equaltoto00
HH1:1: Not
i

S
o
u
r
c
e
o
f
S
u
m
o
f
S
o
u
r
c
e
o
f
S
u
m
o
f
V
V
aarriaiatitoionnSSqquuaarreess
R
R
R
eeggrreesssioionnSSSR
E
E
E
rrroorr SSSE
T
tall SSST
T
T
oota

D
e
g
r
e
e
s
o
f
D
e
g
r
e
s
o
f
F
mM
M
FR
R
F
rreeeddoom
eeaannSSqquuaarree F
aattioio
kk
nn--((kk++11))
nn--11
SSR

MSR

MSE

SSE

( n ( k 1))

MST

SST

( n 1)

17-24

How Good is the Regression


The mean square error is an unbiased
estimator of the variance of the population
2
errors, , denoted by :

( y y) 2
MSE

( n ( k 1)) ( n ( k 1))
SSE

x1
x2

Standard error of estimate:

s = MSE

Errors: y - y

2
The multiple coefficient of determination, R , measures the proportion of
the variation in the dependent variable that is explained by the combination
of the independent variables in the multiple regression model:

SSR
SSE
R =
=1SST
SST
2

17-25

Decomposition of the Sum of Squares


and the Adjusted Coefficient of
Determination
SST
SSR
R

SSE
=

SSR
SST

= 1-

SSE
SST

The adjusted multiple coefficient of determination, R 2, is the coefficient of


determination with the SSE and SST divided by their respective degrees of freedom:
SSE
R 2 =1- (n -(k +1))
SST
(n -1)

Example1:1:
Example

1.911
ss==1.911

R-sq==96.1%
96.1%
R-sq

R-sq(adj)==95.0%
95.0%
R-sq(adj)

17-26

Measures of Performance in Multiple


Regression and the ANOVA Table

S
o
u
r
c
e
o
f
S
u
m
o
f
S
o
u
r
c
e
o
f
S
u
m
o
f
V
iatito
ionnSS
V
aarria
qq
uu
aarreess
R
ionS
nS
R
R
eeggrreesssio
SS
R
E
E
E
rrroorr SSS
E
T
tall SSS
T
T
oota
T

D
e
g
r
e
e
s
o
f
D
e
g
r
e
s
o
f
F
mM
FR
R
io
F
rreeedd
oomM
eeaannSS
qq
uu
aarree F
aatito
(k))
(k
(=
n
-((n-k
(-kk+
+
1
)
)
(=n
-(n
1
)
k-1
-1))
(n--11))
(n

SSR

SST

= 1-

SSE
SST

MSR

MSE

MSR

MSE

SSE
( n ( k 1))

MST

SSR

SST

( n 1)

SSE
( n ( k 1))
2

(1 R )

(k )

= 1-

(n - (k + 1))
SST
(n - 1)

MSE
MST

17-27

Tests of the Significance of Individual


Regression Parameters
Hypothesistests
testsabout
aboutindividual
individualregression
regressionslope
slope
Hypothesis
parameters:
parameters:
(1)
(1)
HH00::11==00
0
HH11::110
(2)
(2)
HH00::22==00
0
HH11::220
...
..
.
(k)
(k)
HH00::kk==00
b
00
b
H
:

0
Test
statistic
for
test
i
:
t

H11: kk0for test i: t


Test statistic

( n ( k 1 )
( n ( k 1 )

ss((bb))
i

17-28

Testing the Validity of the Regression


Model: Residual Plots
Residuals vs M1 (Example 11-2)

It appears that the residuals are randomly distributed with no pattern and
with equal variance as M1 increases

17-29

Testing the Validity of the Regression


Model: Residual Plots
Residuals vs Price

It appears that the residuals are increasing as the Price increases. The
variance of the residuals is not constant.

17-30

Normal Probability Plot for the Residuals:

Linear trend indicates residuals are normally distributed

17-31

Investigating the Validity of the


Regression: Outliers and Influential
Observations
y

Regression line
without outlier

. .
.
.. ..
. .. ..
.. .
.

Point with a large


value of xi

y
Regression
line with
outlier

.
.
.
.
.. .. .. .
. .. .

Regression line
when all data are
included

No relationship in
this cluster

* Outlier

Outliers
Outliers

x
InfluentialObservations
Observations
Influential

17-32

Possible Relation in the Region between


the Available Cluster of Data and the Far
Point
Point with a large value of xii

Some of the possible data between the


original cluster and the far point

.
.
.
.
.. .. .. .
. .. .

x
x
x

x x
x
x
x

*
x
x
x x

x
x
x

x
x
x
x
x x
x
More appropriate curvilinear relationship
(seen when the in between data are known).

17-33

Stepwise Regression
Compute F statistic for each variable not in the model

Is there at least one variable with p-value > Pin?

No

Stop

Yes
Enter most significant (smallest p-value) variable into model

Calculate partial F for all variables in the model

Is there a variable with p-value > Pout?


No

Remove
variable

17-34

Partial F Tests and Variable Selection


Methods
Fullmodel:
model:
Full
YY==0 0++1 1XX1 1++2 2XX2 2++3 3XX3 3++4 4XX4 4++
Reducedmodel:
model:
Reduced
YY==0 0++1 1XX1 1++2 2XX2 2++
PartialFFtest:
test:
Partial
HH0:0:3 3==4 4==00
and 4not
notboth
both00
HH1:1:3 3and
4
PartialFFstatistic:
statistic:
Partial

(SSE
F

(r, (n (k 1))

SSE ) / r
R
F
MSE
F

whereSSE
SSERisisthe
thesum
sumofofsquared
squarederrors
errorsofofthe
thereduced
reducedmodel,
model,SSE
SSEFisisthe
thesum
sumofofsquared
squared
where
R
F
errorsofofthe
thefull
fullmodel;
model;MSE
MSEFisisthe
themean
meansquare
squareerror
errorofofthe
thefull
fullmodel
model[MSE
[MSEF==SSE
SSE/(nF/(nerrors
F
F
F
(k+1))];rrisisthe
thenumber
numberofofvariables
variablesdropped
droppedfrom
fromthe
thefull
fullmodel.
model.
(k+1))];

17-35

Explaining Attitude Toward the


City of Residence
Respondent No Attitude Toward
the City

Duration of
Residence

Importance
Attached to
Weather

10

12

11

12

10

12

11

11

18

10

10

11

10

17

12

17-36

Product Moment Correlation


The correlation coefficient may be calculated as follows:

X
Y

n
i =1

= (10 + 12 + 12 + 4 + 12 + 6 + 8 + 2 + 18 + 9 + 17 + 2)/
= 9.333
= (6 + 9 + 8 + 3 + 10 + 4 + 5 + 2 + 11 + 9 + 10 + 2)/12
= 6.583

(X i X )(Y i Y )

=
+
+
+
+
+
=
+
+
=

(10 -9.33)(6-6.58) + (12-9.33)(9-6.58)


(12-9.33)(8-6.58) + (4-9.33)(3-6.58)
(12-9.33)(10-6.58) + (6-9.33)(4-6.58)
(8-9.33)(5-6.58) + (2-9.33) (2-6.58)
(18-9.33)(11-6.58) + (9-9.33)(9-6.58)
(17-9.33)(10-6.58) + (2-9.33)(2-6.58)
-0.3886 + 6.4614 + 3.7914 + 19.0814
9.1314 + 8.5914 + 2.1014 + 33.5714
38.3214 - 0.7986 + 26.2314 + 33.5714
179.6668

17-37

Product Moment Correlation


n

i =1

n
i =1

( X i X )2

(Y i Y )2

Thus,

= (10-9.33)2 + (12-9.33)2 + (12-9.33)2 + (4-9.33)2


+ (12-9.33)2 + (6-9.33)2 + (8-9.33)2 + (2-9.33)2
+ (18-9.33)2 + (9-9.33)2 + (17-9.33)2 + (2-9.33)2
= 0.4489 + 7.1289 + 7.1289 + 28.4089
+ 7.1289+ 11.0889 + 1.7689 + 53.7289
+ 75.1689 + 0.1089 + 58.8289 + 53.7289
= 304.6668

= (6-6.58)2 + (9-6.58)2 + (8-6.58)2 + (3-6.58)2


+ (10-6.58)2+ (4-6.58)2 + (5-6.58)2 + (2-6.58)2
+ (11-6.58)2 + (9-6.58)2 + (10-6.58)2 + (2-6.58)2
= 0.3364 + 5.8564 + 2.0164 + 12.8164
+ 11.6964 + 6.6564 + 2.4964 + 20.9764
+ 19.5364 + 5.8564 + 11.6964 + 20.9764
= 120.9168

r=

179.6668
(304.6668)(120.9168)

= 0.9361

17-38

Decomposition of the Total


Variation
Explainedvariation
r2 =
Totalvariation

=SS x
SS y

=TotalvariationErrorvariation
Totalvariation

SS y SS error
=

SS y