Sie sind auf Seite 1von 50

Chapter 13

Multiple Regression

Multiple Regression Model

Least Squares Method

Multiple Coefficient of Determination

Model Assumptions

Testing for Significance

Using the Estimated Regression


Equation
for Estimation
and Prediction
Qualitative
Independent
Variables

2006 Thomson/South-Western

Multiple Regression Model


The equation that describes how the
dependent variable y is related to the
independent variables x1, x2, . . . xp and an
error term is called the multiple regression
model.
y = 0 + 1x1 + 2x2 + . . . + pxp +
where:
0, 1, 2, . . . , p are the parameters, and
is a random variable called the error term

2006 Thomson/South-Western

Multiple Regression Equation


The equation that describes how the mean
value of y is related to x1, x2, . . . xp is called
the multiple regression equation.
E(y) = 0 + 1x1 + 2x2 + . . . + pxp

2006 Thomson/South-Western

Estimated Multiple Regression Equation


A simple random sample is used to
compute sample statistics b0, b1, b2, . . . , bp
that are used as the point estimators of the
parameters 0, 1, 2, . . . , p.
The estimated multiple regression equation is:
y =^b0 + b1x1 + b2x2 + . . . + bpxp

2006 Thomson/South-Western

Estimation Process
Multiple Regression Model

E(y) = 0 + 1x1 + 2x2 +. . .+ pxp +


Multiple Regression Equation
E(y) = 0 + 1x1 + 2x2 +. . .+ pxp
Unknown parameters are

Sample Data:
x1 x2 . . . xp y
.
.

.
.

.
.

0 , 1 , 2 , . . . , p

b0, b1, b2, . . . , bp


provide estimates of

0, 1, 2, . . . , p

2006 Thomson/South-Western

Estimated Multiple
Regression Equation

y b0 b1x1 b2x2 ... bpxp

Sample statistics are


b0, b1, b2, . . . , bp
5

.
.

Least Squares Method

Least Squares Criterion


min (yi yi )2

Computation of Coefficient Values

The formulas for the regression coefficients


b0, b1, b2, . . . bp involve the use of matrix algebra.
We will rely on computer software packages to
perform the calculations.

2006 Thomson/South-Western

Multiple Regression Model

Example: Programmer Salary Survey


A software firm collected data for a sample
of 20 computer programmers. A suggestion
was made that regression analysis could
be used to determine if salary was related
to the years of experience and the score
on the firms programmer aptitude test.

The years of experience, score on the


aptitude
test, and corresponding annual salary ($1000s)
for a
sample of 20 programmers is shown on the
next
slide.
2006
Thomson/South-Western
7

Multiple Regression Model


Exper. Score Salary
4
7
1
5
8
10
0
1
6
6

78
100
86
82
86
84
75
80
83
91

24
43
23.7
34.3
35.8
38
22.2
23.1
30
33

2006 Thomson/South-Western

Exper. Score Salary


9
2
10
5
6
8
4
6
3
3

88
73
75
81
74
87
79
94
70
89

38
26.6
36.2
31.6
29
34
30.1
33.9
28.2
30

Multiple Regression Model


Suppose we believe that salary (y) is
related to the years of experience (x1) and the
score on
the programmer aptitude test (x2) by the
following
y = 0 + 1x1 + 2x2 +
regression model:
where
y = annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test

2006 Thomson/South-Western

Solving for the Estimates of 0, 1, 2

Least Squares
Output

Input Data
x1
4

2006

x2
y

Computer
Package
for Solving
Multiple
Regression
Problems

78
24
7 100
43
.
.
.
.
.
.
Thomson/South-Western

b0 =
b1 =
b2 =
R2 =
etc.

10

Solving for the Estimates of 0, 1, 2

Excel Worksheet (showing partial data


entered)
1
2
3
4
5
6
7
8
9

A
B
C
Programmer Experience (yrs) Test Score
1
4
78
2
7
100
3
1
86
4
5
82
5
8
86
6
10
84
7
0
75
8
1
80

D
Salary ($K)
24.0
43.0
23.7
34.3
35.8
38.0
22.2
23.1

Note: Rows 10-21 are not shown.

2006 Thomson/South-Western

11

Solving for the Estimates of 0, 1, 2

Excels Regression Dialog Box

2006 Thomson/South-Western

12

Solving for the Estimates of 0, 1, 2

Excels Regression Equation Output


A

38
39
Coeffic. Std. Err. t Stat P-value
40 Intercept
3.17394 6.15607 0.5156 0.61279
41 Experience
1.4039 0.19857 7.0702 1.9E-06
42 Test Score 0.25089 0.07735 3.2433 0.00478
43

Note: Columns F-I are not shown.

2006 Thomson/South-Western

13

Estimated Regression Equation

SALARY
SALARY =
= 3.174
3.174 +
+ 1.404(EXPER)
1.404(EXPER) +
+ 0.251(SCORE)
0.251(SCORE)
Note: Predicted salary will be in thousands of dollars.

2006 Thomson/South-Western

14

Interpreting the Coefficients


In multiple regression analysis, we
interpret each
regression coefficient as follows:
bi represents an estimate of the change in y
corresponding to a 1-unit increase in xi when all
other independent variables are held constant.

2006 Thomson/South-Western

15

Interpreting the Coefficients


b
b11 =
= 1.
1. 404
404
Salary is expected to increase by $1,404 for
each additional year of experience (when the
variable
score on programmer attitude test is held
constant).

2006 Thomson/South-Western

16

Interpreting the Coefficients


b
b22 =
= 0.251
0.251
Salary is expected to increase by $251 for
each
additional point scored on the programmer
aptitude
test (when the variable years of experience is
held
constant).

2006 Thomson/South-Western

17

Multiple Coefficient of Determination

Relationship Among SST, SSR, SSE


SST =
SSE

SSR

2
2
2

(
y

y
)

(
y

y
)

(
y

y
)
i
i
i i

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error

2006 Thomson/South-Western

18

Multiple Coefficient of Determination

Excels ANOVA Output


A

32
33
34
35
36
37
38

ANOVA
df
SS
MS
F
Significance F
Regression
2 500.3285 250.1643 42.76013 2.32774E-07
Residual
17 99.45697 5.85041
Total
19 599.7855
SSR
SST

2006 Thomson/South-Western

19

Multiple Coefficient of Determination


R2 = SSR/SST
R2 = 500.3285/599.7855 = .83418

2006 Thomson/South-Western

20

Adjusted Multiple Coefficient


of Determination

Ra2

n 1
1 (1 R )
n p 1
2

20 1
R 1 (1 .834179)
.814671
20 2 1
2
a

2006 Thomson/South-Western

21

Adjusted Multiple Coefficient


of Determination

Excels Regression Statistics


A
23
24
25
26
27
28
29
30
31
32

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.913334059
R Square
0.834179103
Adjusted R Square
0.814670762
Standard Error
2.418762076
Observations
20

2006 Thomson/South-Western

22

Assumptions About the Error Term


The
The error
error is
is aa random
random variable
variable with
with mean
mean of
of zero.
zero.
The
The variance
variance of
of ,, denoted
denoted by
by
22,, is
is the
the same
same for
for all
all
values
values of
of the
the independent
independent variables.
variables.
The
The values
values of
of are
are independent.
independent.
The
The error
error is
is aa normally
normally distributed
distributed random
random variable
variable
reflecting
reflecting the
the deviation
deviation between
between the
the yy value
value and
and the
the
expected
expected value
value of
of yy given
given by
by 00 +
+ 11xx11+
+ 22xx22+
+ .. .. +
+ ppxxpp..

2006 Thomson/South-Western

23

Testing for Significance


In
In simple
simple linear
linear regression,
regression, the
the FF and
and tt tests
tests provide
provide
the
the same
same conclusion.
conclusion.
In
In multiple
multiple regression,
regression, the
the FF and
and tt tests
tests have
have different
different
purposes.
purposes.

2006 Thomson/South-Western

24

Testing for Significance: F Test


The
The FF test
test is
is used
used to
to determine
determine whether
whether aa significant
significant
relationship
relationship exists
exists between
between the
the dependent
dependent variable
variable
and
and the
the set
set of
of all
all the
the independent
independent variables.
variables.
The
The FF test
test is
is referred
referred to
to as
as the
the test
test for
for overall
overall
significance.
significance.

2006 Thomson/South-Western

25

Testing for Significance: t Test


IfIf the
the FF test
test shows
shows an
an overall
overall significance,
significance, the
the tt test
test is
is
used
used to
to determine
determine whether
whether each
each of
of the
the individual
individual
independent
independent variables
variables is
is significant.
significant.
A
A separate
separate tt test
test is
is conducted
conducted for
for each
each of
of the
the
independent
independent variables
variables in
in the
the model.
model.
We
We refer
refer to
to each
each of
of these
these tt tests
tests as
as aa test
test for
for individual
individual
significance.
significance.

2006 Thomson/South-Western

26

Testing for Significance: F Test


Hypotheses H0: 1 = 2 = . . . = p = 0
Ha: One or more of the parameters
is not equal to zero.
Test Statistics F = MSR/MSE
Rejection RuleReject H0 if p-value < or if F > F
where F is based on an F distribution
with p d.f. in the numerator and
n - p - 1 d.f. in the denominator.

2006 Thomson/South-Western

27

F Test for Overall Significance


Hypotheses H0: 1 = 2 = 0
Ha: One or both of the parameters
is not equal to zero.
For = .05 and d.f. = 2, 17; F.05 = 3.59
Rejection Rule
Reject H0 if p-value < .05 or F > 3.59

2006 Thomson/South-Western

28

F Test for Overall Significance

Excels ANOVA Output


A

32
33
34
35
36
37
38

ANOVA
df
SS
MS
F
Significance F
Regression
2 500.3285 250.1643 42.76013 2.32774E-07
Residual
17 99.45697 5.85041
Total
19 599.7855

p-value used to test


for
overall significance

2006 Thomson/South-Western

29

F Test for Overall Significance


F = MSR/MSE
Test Statistics
= 250.16/5.85 = 42.76
Conclusion

p-value < .05, so we can reject H0.


(Also, F = 42.76 > 3.59)

2006 Thomson/South-Western

30

Testing for Significance: t Test


Hypotheses

H 0 : i 0
H a : i 0

Test Statistics

bi
t
sbi

Rejection Rule Reject H0 if p-value < or


if t < -tor t > twhere t
is based on a t distribution
with n - p - 1 degrees of freedom.

2006 Thomson/South-Western

31

t Test for Significance


of Individual Parameters
Hypotheses

H 0 : i 0
H a : i 0

Rejection Rule
For = .05 and d.f. = 17, t.025 = 2.11
Reject H0 if p-value < .05 or if t > 2.11

2006 Thomson/South-Western

32

t Test for Significance


of Individual Parameters

Excels Regression Equation Output


A

38
39
Coeffic. Std. Err. t Stat P-value
40 Intercept
3.17394 6.15607 0.5156 0.61279
41 Experience
1.4039 0.19857 7.0702 1.9E-06
42 Test Score 0.25089 0.07735 3.2433 0.00478
43

Note: Columns F-I are not shown.

t statistic and p-value used to test for


the individual significance of
Experience
2006 Thomson/South-Western

33

t Test for Significance


of Individual Parameters

Excels Regression Equation Output


A

38
39
Coeffic. Std. Err. t Stat P-value
40 Intercept
3.17394 6.15607 0.5156 0.61279
41 Experience
1.4039 0.19857 7.0702 1.9E-06
42 Test Score 0.25089 0.07735 3.2433 0.00478
43

Note: Columns F-I are not shown.

t statistic and p-value used to test for


the individual significance of Test
Score
2006 Thomson/South-Western

34

t Test for Significance


of Individual Parameters
Test Statistics

b1 1. 4039

7.07
sb1 .1986
b2 . 25089

3. 24
sb2 . 07735

Conclusions Reject both H0: 1 = 0 and H0: 2 = 0.


Both independent variables are
significant.

2006 Thomson/South-Western

35

Testing for Significance: Multicollinearity


The
The term
term multicollinearity
multicollinearity refers
refers to
to the
the correlation
correlation
among
among the
the independent
independent variables.
variables.
When
When the
the independent
independent variables
variables are
are highly
highly correlated
correlated
(say,
(say, |r
|r || >
> .7),
.7), it
it is
is not
not possible
possible to
to determine
determine the
the
separate
separate effect
effect of
of any
any particular
particular independent
independent variable
variable
on
on the
the dependent
dependent variable.
variable.

2006 Thomson/South-Western

36

Testing for Significance: Multicollinearity


IfIf the
the estimated
estimated regression
regression equation
equation is
is to
to be
be used
used only
only
for
for predictive
predictive purposes,
purposes, multicollinearity
multicollinearity is
is usually
usually
not
not aa serious
serious problem.
problem.
Every
Every attempt
attempt should
should be
be made
made to
to avoid
avoid including
including
independent
independent variables
variables that
that are
are highly
highly correlated.
correlated.

2006 Thomson/South-Western

37

Using the Estimated Regression Equation


for Estimation and Prediction
The
The procedures
procedures for
for estimating
estimating the
the mean
mean value
value of
of yy
and
and predicting
predicting an
an individual
individual value
value of
of yy in
in multiple
multiple
regression
regression are
are similar
similar to
to those
those in
in simple
simple regression.
regression.
We
We substitute
substitute the
the given
given values
values of
of xx11,, xx22,, .. .. .. ,, xxpp into
into
the
the estimated
estimated regression
regression equation
equation and
and use
use the
the
corresponding
corresponding value
value of
of yy as
as the
the point
point estimate.
estimate.

2006 Thomson/South-Western

38

Using the Estimated Regression Equation


for Estimation and Prediction
The
The formulas
formulas required
required to
to develop
develop interval
interval estimates
estimates
^
for
the
mean
value
of
for the mean value of^yy and
and for
for an
an individual
individual value
value
of
of yy are
are beyond
beyond the
the scope
scope of
of the
the textbook.
textbook.
Software
Software packages
packages for
for multiple
multiple regression
regression will
will often
often
provide
provide these
these interval
interval estimates.
estimates.

2006 Thomson/South-Western

39

Qualitative Independent Variables


In
In many
many situations
situations we
we must
must work
work with
with qualitative
qualitative
independent
independent variables
variables such
such as
as gender
gender (male,
(male, female),
female),
method
method of
of payment
payment (cash,
(cash, check,
check, credit
credit card),
card), etc.
etc.
For
For example,
example, xx22 might
might represent
represent gender
gender where
where xx22 =
= 00
indicates
indicates male
male and
and xx22 =
= 11 indicates
indicates female.
female.
In
In this
this case,
case, xx22 is
is called
called aa dummy
dummy or
or indicator
indicator variable.
variable.

2006 Thomson/South-Western

40

Qualitative Independent Variables

Example: Programmer Salary Survey

As an extension of the problem involving the


computer programmer salary survey, suppose
that management also believes that the
annual salary is related to whether the
individual has a graduate degree in
computer science or information systems.
The years of experience, the score on the
programmer
aptitude test, whether the individual has a
relevant
graduate degree, and the annual salary ($1000)
for each
of the sampled 20 programmers are shown on the
2006
Thomson/South-Western
41
next

Qualitative Independent Variables


Exper. Score Degr. Salary Exper. Score Degr. Salary
4
7
1
5
8
10
0
1
6
6

78
100
86
82
86
84
75
80
83
91

No
Yes
No
Yes
Yes
Yes
No
No
No
Yes

24
43
23.7
34.3
35.8
38
22.2
23.1
30
33

2006 Thomson/South-Western

9
2
10
5
6
8
4
6
3
3

88
73
75
81
74
87
79
94
70
89

Yes
No
Yes
No
No
Yes
No
Yes
No
No

38
26.6
36.2
31.6
29
34
30.1
33.9
28.2
30

42

Estimated Regression Equation

y = b0 + b1x1 + b2x2 + b3x3


where:
y =^annual salary ($1000)
x1 = years of experience
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
x3 is a dummy variable

2006 Thomson/South-Western

43

Qualitative Independent Variables

Excels Regression Statistics


A
23
24
25
26
27
28
29
30
31
32

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.920215239
R Square
0.846796085
Adjusted R Square
0.818070351
Standard Error
2.396475101
Observations
20

2006 Thomson/South-Western

44

Qualitative Independent Variables

Excels ANOVA Output


A

32
33
34
35
36
37
38

ANOVA
df
SS
MS
F
Significance F
Regression
3 507.896 169.2987 29.47866 9.41675E-07
Residual
16 91.88949 5.743093
Total
19 599.7855

2006 Thomson/South-Western

45

Qualitative Independent Variables

Excels Regression Equation


Output
A

38
39
40
41
42
43
44

Coeffic. Std. Err.


Intercept
7.94485 7.3808
Experience 1.14758 0.2976
Test Score 0.19694 0.0899
Grad. Degr. 2.28042 1.98661

t Stat P-value
1.0764 0.2977
3.8561 0.0014
2.1905 0.04364
1.1479 0.26789

Note: Columns F-I are not shown.

Not significant

2006 Thomson/South-Western

46

Qualitative Independent Variables

Excels Regression Equation


Output
A

38
39
40
41
42
43
44

Coeffic. Low. 95%


Intercept
7.94485 -7.701739
Experience 1.14758 0.516695
Test Score 0.19694
0.00635
Grad. Degr. 2.28042 -1.931002

Up. 95%
23.5914
1.77847
0.38752
6.49185

Low. 95.0%
-7.7017385
0.51669483
0.00634964
-1.9310017

Up. 95.0%
23.591436
1.7784686
0.3875243
6.4918494

Note: Columns C-E are hidden.

2006 Thomson/South-Western

47

More Complex Qualitative Variables


IfIf aa qualitative
qualitative variable
variable has
has kk levels,
levels, kk -- 11 dummy
dummy
variables
variables are
are required,
required, with
with each
each dummy
dummy variable
variable
being
being coded
coded as
as 00 or
or 1.
1.
For
For example,
example, aa variable
variable with
with levels
levels A,
A, B,
B, and
and C
C could
could
be
be represented
represented by
by xx11 and
and xx22 values
values of
of (0,
(0, 0)
0) for
for A,
A, (1,
(1, 0)
0)
for
for B,
B, and
and (0,1)
(0,1) for
for C.
C.
Care
Care must
must be
be taken
taken in
in defining
defining and
and interpreting
interpreting the
the
dummy
dummy variables.
variables.

2006 Thomson/South-Western

48

More Complex Qualitative Variables


For example, a variable indicating level of
education could be represented by x1 and x2
values as follows:
Highest
Degree

x1

x2

Bachelors
0 0
Masters
1 0
Ph.D.
0 1

2006 Thomson/South-Western

49

End of Chapter 13

2006 Thomson/South-Western

50

Das könnte Ihnen auch gefallen