Sie sind auf Seite 1von 36

Polynomial regression models

Possible models for when the response


function is curved

Uses of polynomial models


When the true response function really is a
polynomial function.
(Very common!) When the true response
function is unknown or complex, but a
polynomial function approximates the true
function well.

Example
What is impact of exercise on human
immune system?
Is amount of immunoglobin in blood (y)
related to maximal oxygen uptake (x) (in a
curved manner)?

Maximal oxygen uptake (ml/kg)

Scatter plot

2000

1500

1000

30

40

50

Immunoglobin (mg)

60

70

A quadratic polynomial regression


function

Yi 0 1 X i 11 X i
2
i

where:
Yi = amount of immunoglobin in blood (mg)
Xi = maximal oxygen uptake (ml/kg)
typical assumptions about error terms (INE)

Estimated quadratic function


Regression Plot
igg = -1464.40 + 88.3071 oxygen - 0.536247 oxygen**2
S = 106.427

R-Sq = 93.8 %

R-Sq(adj) = 93.3 %

igg

2000

1500

1000

30

40

50

oxygen

60

70

Interpretation of the regression


coefficients
If 0 is a possible x value, then b0 is the
predicted response. Otherwise, interpretation
of b0 is meaningless.
b1 does not have a very helpful interpretation.
It is the slope of the tangent line at x = 0.
b2 indicates the up/down direction of curve
b2 < 0 means curve is concave down
b2 > 0 means curve is concave up

The regression equation is


igg = - 1464 + 88.3 oxygen - 0.536 oxygensq
Predictor
Coef
Constant -1464.4
oxygen
88.31
oxygensq -0.5362
S = 106.4

SE Coef
411.4
16.47
0.1582

R-Sq = 93.8%

T
-3.56
5.36
-3.39

P
0.001
0.000
0.002

VIF
99.9
99.9

R-Sq(adj) = 93.3%

Analysis of Variance
Source
Regression
Residual Error
Total
Source
oxygen
oxygensq

DF
1
1

DF
2
27
29

SS
4602211
305818
4908029
Seq SS
4472047
130164

MS
2301105
11327

F
203.16

P
0.000

A multicollinearity problem
5000

oxygensq

4000

3000

2000

1000
30

40

50

60

70

oxygen

Pearson correlation of oxygen and oxygensq = 0.995

Center the predictors


Mean of oxygen = 50.637

oxygen
34.6
45.0
62.3
58.9
42.5
44.3
67.9
58.5
35.6
49.6
33.0

oxcent
-16.037
-5.637
11.663
8.263
-8.137
-6.337
17.263
7.863
-15.037
-1.037
-17.637

oxcentsq
257.185
31.776
136.026
68.277
66.211
40.158
298.011
61.827
226.111
1.075
311.064

OxCent Oxygen 50.637


OxCentSq Oxygen 50.637

Does it really work?


400

oxcentsq

300

200

100

0
-20

-10

10

20

oxcent

Pearson correlation of oxcent and oxcentsq = 0.219

A better quadratic polynomial


regression function

Yi x x i
*
0

where

xi X i X

*
1 i

* 2
11 i

denotes the centered predictor, and

*0 = mean response at the predictor mean


*1 = linear effect coefficient
*11 = quadratic effect coefficient

The regression equation is


igg = 1632 + 34.0 oxcent - 0.536 oxcentsq
Predictor
Constant
oxcent
oxcentsq

Coef
1632.20
34.000
-0.5362

S = 106.4

SE Coef
29.35
1.689
0.1582

R-Sq = 93.8%

T
55.61
20.13
-3.39

P
0.000
0.000
0.002

VIF
1.1
1.1

R-Sq(adj) = 93.3%

Analysis of Variance
Source
Regression
Residual Error
Total
Source
oxcent
oxcentsq

DF
1
1

DF
2
27
29

SS
4602211
305818
4908029
Seq SS
4472047
130164

MS
2301105
11327

F
203.16

P
0.000

Interpretation of the regression


coefficients
b0 is predicted response at the predictor mean.
b1 is the estimated slope of the tangent line at
the predictor mean; and, typically, also the
estimated slope in the simple model.
b2 indicates the up/down direction of curve
b2 < 0 means curve is concave down
b2 > 0 means curve is concave up

Estimated regression function


Regression Plot
igg = 1632.20 + 33.9995 oxcent - 0.536247 oxcent**2
S = 106.427

R-Sq = 93.8 %

R-Sq(adj) = 93.3 %

igg

2000

1500

1000

-20

-10

oxcent

10

20

Similar estimates
Regression Plot
igg = 1557.63 + 32.7427 oxcent
S = 124.783

R-Sq = 91.1 %

R-Sq(adj) = 90.8 %

igg

2000

1500

1000

-20

-10

oxcent

10

20

The relationship between the two


forms of the model
Original model:

Yi b0 b1 X i b11 X i

Centered model:

*
*
* 2

Yi b0 b1 xi b11 xi

Where:

b0 b0* b1* X b11* X 2


b1 b1* 2b11* X
b11 b

*
11

Yi 1632.2 34.0 xi 0.5362 xi


Mean of oxygen = 50.637

b0 1632.2 34(50.637) 0.5362(50.637) 2 1464.3


b1 34 2(.5362)(50.637) 88.3
b11 0.5362

Yi 1464.4 88.31X i 0.536 X i

Residuals Versus the Fitted Values


(response is igg)

200

Residual

100

-100

-200
1000

1500

Fitted Value

2000

Normal Probability Plot of the Residuals


(response is igg)
2

Normal Score

-1

-2
-200

-100

Residual

100

200

What is predicted IgG if maximal


oxygen uptake is 90?
Predicted Values for New Observations
New Obs Fit
SE Fit
95.0% CI
95.0% PI
1
2139.6
219.2 (1689.8,2589.5) (1639.6,2639.7) XX
X denotes a row with X values away from the center
XX denotes a row with very extreme X values
Values of Predictors for New Observations
New Obs
1

oxcent
39.4

oxcentsq
1549

There is an even greater danger in extrapolation when modeling


data with a polynomial function, because of changes in direction.

It is possible to overfit the data


with polynomial models.
Regression Plot
y = -38.4 + 34.9762 x
- 8.64286 x**2 + 0.666667 x**3
S = 2.62950

R-Sq = 64.0 %

R-Sq(adj) = 0.0 %

2
2

It is even theoretically possible to fit


the data perfectly.
If you have n data points, then a polynomial of order n-1
will fit the data perfectly, that is, it will pass through each data
point.
But, good statistical software will keep an unsuspecting user
from fitting such a model.
** Error ** Not enough non-missing observations
to fit a polynomial of this order; execution
aborted

The hierarchical approach


to model fitting
Widely accepted approach is to fit a higher-order model and then
explore whether a lower-order (simpler) model is adequate.

Yi 0 1 x i 11 x 111 x i
2
i

Is a first-order linear model (line) adequate?

H 0 : 11 111 0

3
i

The hierarchical approach


to model fitting
But then if a polynomial term of a given order is
retained, then all related lower-order terms are also retained.
That is, if a quadratic term was significant, you would use
this regression function:
2
i
0
1 i
11 i

E Y x x

and not this one:

E Yi 0 11 x

2
i

Example
Quality of a product (y) a score between
0 and 100
Temperature (x1) degrees Fahrenheit
Pressure (x2) pounds per square inch

82.725
quality
53.375
95
temp
85
57.5
pressure
52.5

A two-predictor, second-order
polynomial regression function
Yi 0 1 X i1 2 X i 2 11 X i21 22 X i22 12 X i1 X i 2 i
where:
Yi = quality
Xi1 = temperature
Xi2 = pressure
12 = interaction effect coefficient

The regression equation is


quality = - 5128 + 31.1 temp + 140 pressure
- 0.133 tempsq - 1.14 presssq
- 0.145 tp
Predictor
Coef
Constant
-5127.9
temp
31.096
pressure
139.747
tempsq
-0.133389
Press
-1.14422
tp
-0.145500
S = 1.679

SE Coef
110.3
1.344
3.140
0.006853
0.02741
0.009692

R-Sq = 99.3%

T
-46.49
23.13
44.50
-19.46
-41.74
-15.01

P
0.000
0.000
0.000
0.000
0.000
0.000

R-Sq(adj) = 99.1%

VIF
1154.5
1574.5
973.0
1453.0
304.0

Again, some correlation


quality
temp
-0.423
pressure 0.182
tempsq
-0.434
presssq
0.162
tp
-0.227

temp pressure
0.000
0.999
0.000
0.773

0.000
1.000
0.632

Cell Contents: Pearson correlation

tempsq

presssq

-0.000
0.772

0.632

A better two-predictor, second-order


polynomial regression function
*
Yi 0* 1* xi1 2* xi 2 11* xi21 22
xi22 12* xi1 xi 2 i

where:
Yi = quality
xi1 = centered temperature
xi2 = centered pressure
*12 = interaction effect coefficient

Reduced correlation
quality
tcent
-0.423
pcent
0.182
tpcent
-0.274
tcentsq -0.355
pcentsq -0.762

tcent

pcent

tpcent

tcentsq

0.000
0.000
-0.000
0.000

0.000
0.000
0.000

0.000
0.000

-0.000

Cell Contents: Pearson correlation

The regression equation is


quality = 94.9 - 0.916 tcent + 0.788 pcent
- 0.146 tpcent - 0.133 tcentsq
- 1.14 pcentsq
Predictor
Coef
Constant
94.9259
tcent
-0.91611
pcent
0.78778
tpcent
-0.145500
tcentsq -0.133389
pcentsq
-1.14422
S = 1.679

SE Coef
0.7224
0.03957
0.07913
0.009692
0.006853
0.02741

R-Sq = 99.3%

T
131.40
-23.15
9.95
-15.01
-19.46
-41.74

P
0.000
0.000
0.000
0.000
0.000
0.000

R-Sq(adj) = 99.1%

VIF
1.0
1.0
1.0
1.0
1.0

Residuals Versus the Fitted Values


(response is quality)
3
2

Residual

1
0
-1
-2
-3
40

50

60

70

Fitted Value

80

90

100

Normal Probability Plot of the Residuals


(response is quality)
2

Normal Score

-1

-2
-3

-2

-1

Residual

Predicted Values for New Observations


New Obs Fit
1
94.926

SE Fit
95.0% CI
0.722 (93.424,96.428)

95.0% PI
(91.125,98.726)

Values of Predictors for New Observations


New Obs
1

tcent
0.0000

pcent
0.0000

tpcent
0.0000

tcentsq
0.0000

pcentsq
0.0000

Das könnte Ihnen auch gefallen