Beruflich Dokumente
Kultur Dokumente
A deterministic model does not have the random deviation component e, while a probabilistic
model does contain such a component.
Let
y=
total number of goods purchased at a service station which sells only one grade
of gas and one type of motor oil.
x1 =
gallons of gasoline purchased
x2 =
number of quarts of motor oil purchased.
Then y is related to x1 and x2 in a deterministic fashion.
Let
y=
IQ of a child
x1 =
age of the child
x2 =
total years of education of the parents.
Then y is related to x1 and x2 in a probabilistic fashion.
14.2
(mean y value for fixed values of x1, x2, x3) = 30 + 0.90x1 + 0.08x2 4.5x3
The average change in acceptable load associated with a 1-cm increase in left lateral bending,
when grip endurance and trunk extension ratio are held fixed, is 0.90kg.
The average change in acceptable load associated with a 1 N/kg increase in trunk extension
ratio, when grip endurance and left lateral bending are held fixed, is 4.5 kg.
f
33.5 23.5
13.5 23.5
P(13.5 < y < 33.5) = P
<z<
5
5
14.3
5
5
14.4
433
For a Hispanic women the value of x 4 would be set to 0. The predicted ecology score is
y = 3.60 0.01( 25 10) + 0.01( 32) 0.07(0) + 0.12(0) + 0.02(16) 0.04(1)
0.01( 3) 0.04(0) 0.02(0) = 1.03
The coefficient of x 3 , i.e., 3 is equal to the mean difference in ecology score for men and
women (men minus women), given that the other variables are the same. Its estimated value
is equal to 0.07 .
Given that the other variables are the same, the increase in the mean ecology score associated
with an increase in income of 1000 dollars is equal to 2 , the coefficient of x 2 in the
regression equation. Its estimated value is 0.01.
Ideology and social class are categorical (qualitative) variables. An appropriate way of
incorporating these variables in a regression model is to define indicator or dummy
variables. Each of these variables take on five different values, so four dummy variables need
to be defined for each of ideology and social class. For instance, for ideology, the four
dummy variables may be defined as follows:
dummy variable 1 = 1 if the individual is conservative and 0 otherwise;
dummy variable 2 = 1 if the individual is right of center and 0 otherwise;
dummy variable 3 = 1 if the individual is middle of the road and 0 otherwise;
dummy variable 4 = 1 if the individual is left of center and 0 otherwise.
For social class the dummy variables may be defined as follows:
dummy variable 1 = 1 if the individual belongs to the upper class and 0 otherwise;
dummy variable 2 = 1 if the individual belongs to the upper middle class
and 0 otherwise;
dummy variable 3 = 1 if the individual belongs to the middle class and 0 otherwise;
dummy variable 4 = 1 if the individual belongs to the lower middle class
and 0 otherwise.
14.5
1 = 6.6. Hence 6.6 is the expected decrease in yield associated with a one-unit increase in
mean temperature between date of coming into hop and date of picking when the mean
percentage of sunshine remains fixed.
2 = 4.5. So 4.5 is the expected decrease in yield associated with a one-unit increase in mean
percentage of sunshine when mean temperature remains fixed.
14.6
2 = 1.40. The expected decrease in error percentage associated with a one-unit increase in
character subtense when level of backlight, viewing angle, and level of ambient light remain
fixed,
is equal to 1.40.
434
3 = 0.02. The expected increase in error percentage associated with a one-unit increase in
viewing angle when level of backlight, character subtense, and level of ambient light remain
fixed.
14.7
The mean chlorine content at x = 8 is 564, while at x = 10 it is 570. So the mean chlorine
content is higher for x = 10 than for x = 8.
14.8
May 6 is 16 days after April 20. The values of x1 and x2 are 16 and 41180 respectively.
21.09 + 0.653(16) + 0.0022(41180) 0.0206(16)2 + 0.00004(41180)2 = 67948.5564
435
14.9
The parallel lines in each graph are attributable to the lack of interaction between the two
independent variables.
436
1
x2 =
0
437
1
x3 =
0
14.11
14.12
14.13
x1 x2
and x1 x3
No, because when an interaction term is present, changing the value of one independent
variable will also change the value of the interaction term. Thus, you cannot change one
model predictor while holding all the rest fixed.
Three dummy variables would be needed to incorporate a non-numerical variable with four
categories.
x3 = 1 if the car is a sub-compact, 0 otherwise
x4 = 1 if the car is a compact, 0 otherwise
x5 = 1 if the car is a midsize, 0 otherwise
y = + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + e
14.14
x6 = x1x3, x7 = x1x4, and x8 = x1x5 are the additional predictors needed to incorporate
interaction between age and size class.
Yes, we can calculate the exact volume if we know the exact values of width (i.e., diameter) and
height of cylindrical cans.
a
438
The relationship between the volume of a cylindrical can and its width and height is a
deterministic one. So, in principle, there is no need for regression. However, if the cans are
not exactly cylindrical (they never are), then a suitably chosen regression model can be
useful.
b3 = 0.0096 is the estimated change in mean VO2 max associated with a one-unit increase in
one mile walk time when the value of the other predictor variables are fixed. The change is
actually a decrease since the regression coefficient is negative.
b1 = 0.6566 is the estimated difference in mean VO2 max for males versus females (male minus
female) when the values of the other predictor variables are fixed.
R 2 = 1
s 2e =
SSResid
301033
.
= 1
= 1.294 = .706
SSTo
102.3922
301033
301033
SSResid
.
.
=
=
= .157609
n ( k + 1) 196 5
191
s e = .157609 = .397
14.16
b1 = 2.18 is the estimated change in mean fish intake associated with a one-unit increase in
water temperature when the value of the other predictor variables are fixed. Since the sign of
b1 negative, the change is actually a decrease.
b4 = 2.32 is the estimated change (increase) in mean fish intake associated with a one-unit
increase in speed
when the value of the other predictor variables are fixed.
R 2 = 1
s 2e =
1486.9
1486.9
SSResid SSRegr
=
=
=
= .4
1486.9 + 2230.2 3717.1
SSTo
SSTo
SSResid
2230.2 2230.2
=
=
= 106.2
21
n ( k + 1) 26 5
s e = 106.20 = 10.305
439
n 1 SSResid
25 2230.2
Adjusted R 2 = 1
= 1
= 1.7143 = .2857
21 3717.1
n ( k + 1) SSTo
14.18
14.19
P-value = 0.01
df1 = k = 5 df2 = n (k + 1) = 100 6 = 94; Using df2 = 90, 0.05 > P-value > 0.01
The fitted model was y = + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6 + e
Ho: 1 = 2 = 3 = 4 = 5 = 6 = 0
Ha: at least one among 1, 2, 3, 4, 5, 6 is not zero
= 0.01
F=
R2 k
(1 R 2 ) [ n ( k + 1)]
R2 k
2
(1 R ) [ n ( k + 1)]
.83 6
= 24.41
(1.83) 30
The fitted model was y = + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6 + 7x7 + e
Ho: 1 = 2 = 3 = 4 = 5 = 6 = 7 = 0
Ha: at least one among 1, 2, 3, 4, 5, 6, 7 is not zero
440
= 0.05
F=
R2 k
(1 R 2 ) [ n ( k + 1)]
R2 k
2
(1 R ) [ n ( k + 1)]
.543 7
= 34.286
(1.543) 202
R2 k
(1 R 2 ) [ n ( k + 1)]
R2 k
2
(1 R ) [ n ( k + 1)]
.706 4
= 114.66
(1.706) 191
R2 k
(1 R 2 ) [ n ( k + 1)]
R2 = 0.40
F=
R2 k
2
(1 R ) [ n ( k + 1)]
.4 4
= 35
.
(1.4) 21
Ho: 1 = 2 = 3 = 4 = 0
Ha: at least one among 1, 2, 3, 4, is not zero
= 0.01
F=
R2 k
(1 R 2 ) [ n ( k + 1)]
R2 k
2
(1 R ) [ n ( k + 1)]
.908 4
= 64.15
(1.908) 26
R2 = 0.908. This means that 90.8% of the variation in the observed tar content values has been
explained by the fitted model.
se = 4.784. This means that the typical distance of an observation from the corresponding
mean value is 4.784.
14.24
The fitted model was y = + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6 + 7x7 + 8x8 + 9x9 + e
Ho: 1 = 2 = 3 = 4 = 5 = 6 = 7= 8 = 9 = 0
Ha: at least one among 1, 2, 3, 4, 5, 6, 7, 8, 9 is not zero
= 0.05 (a value for is not given in the problem; we use 0.05 for illustration)
F=
R2 k
(1 R 2 ) [n (k + 1)]
442
R2 k
(1 R ) [n (k + 1)]
2
0.06 / 9
= 7.98582
(1 0.06) 1126
R-Sq = 59.3%
R-Sq(adj) = 53.5%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
2 181364
90682 10.21 0.002
Residual Error 14 124331
8881
Total
16 305695
b
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
F=
SSRegr / k
SSResid / [ n ( k + 1)]
181364 2
SSRegr / k
=
= 10.21
SSResid /[n (k + 1)] 124331 14
From the Minitab output, the P-value = 0.002. Since the P-value is less than 0.05 (we have
chosen = 0.05 for illustration) the null hypothesis is rejected. The data suggests that the
multiple regression model is useful for predicting weight.
14.26
Using MINITAB to fit the required regression model yields the following output.
443
R-Sq = 75.0%
R-Sq(adj) = 71.9%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
2 0.41617 0.20809 24.02 0.000
Residual Error 16 0.13861 0.00866
Total
18 0.55478
Hence the fitted regression model is
catch_time = 1.44 - 0.0523 prey_length + 0.00397 prey_speed
b
We test Ho: 1 = 2 = 0 versus Ha: At least one of the two i's is not zero, using = 0.05.
F=
R2 k
(1 R ) [n (k + 1)]
2
R2 k
0.75 / 2
=
= 24.02
2
(1 R ) [n (k + 1)] (1 0.75) 16
(This F value can be read directly from the Analysis of Variance Table). From the MINITAB
output we note that the P-value is equal to 0.000 (to 3 decimals). Hence the null hypothesis is
rejected and we conclude that the multiple regression model is useful for predicting catch
time.
d
Using x = length/speed we obtain the following fitted regression equation from MINITAB
along with other useful output.
444
Predictor
Coef SE Coef
T
P
Constant 1.58648 0.04803 33.03 0.000
x
-1.4044 0.3124 -4.50 0.000
S = 0.1221
R-Sq = 54.3%
R-Sq(adj) = 51.6%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1 0.30135 0.30135 20.22 0.000
Residual Error 17 0.25342 0.01491
Total
18 0.55478
14.27
The model in part a has R2 = 0.75 and R2 (adj) = 0.719, whereas the model in part d has R2 =
0.543 and R2 (adj) = 0.516. Based on this information it appears that the model from part a
may be recommended for predicting catch time.
Using MINITAB to fit the required regression model yields the following output.
The regression equation is
volume = - 859 + 23.7 minwidth + 226 maxwidth + 225 elongation
Predictor
Constant
minwidth
maxwidth
elongation
Coef SE Coef
T
-859.2
272.9 -3.15
23.72
85.66
0.28
225.81
85.76
2.63
225.24
90.65
2.48
S = 287.0
R-Sq = 67.6%
P
0.005
0.784
0.015
0.021
R-Sq(adj) = 63.4%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
3 3960700 1320233 16.03 0.000
Residual Error 23 1894141
82354
Total
26 5854841
b
Adjusted R2 takes into account the number of predictors used in the model whereas R2 does
not do so. In particular, adjusted R2 enables us to make a fair comparison of the
performances of models with differing numbers of predictors.
We test
Ho: 1 = 2 = 3 = 0
Ha: At least one of the three i's is not zero.
= 0.05 (for illustration)
F=
R2 k
(1 R 2 ) [ n ( k + 1)]
445
0.676 3
R2 k
=
= 16.03
2
(1 R ) [n (k + 1)] (1 0.676) 23
The corresponding P-value is 0.000 (correct to 3 decimals). Since the P-value is less than ,
the null hypothesis is rejected. There does appear to be a useful linear relationship between y
and at least one of the three predictors.
14.28
Ho: 1 = 2 = 3 = 0
Ha: At least one of the three i's is not zero.
= 0.05
F=
R2 k
(1 R ) [n (k + 1)]
2
.16 3
R2 k
=
= 23.05
2
(1 R ) [n (k + 1)] (1 .16) 363
The fact that the test results are statistically significant when R2 = 0.16 is a bit surprising.
However, if n is quite large, even a model for which y is not very strongly related to the
predictors will be judged useful by the model utility test. This is analogous to what might
happen in the bivariate case if, for example, = 0.02; with a large enough n,
Ha: = 0
will be rejected in favor of the conclusion that there is a positive linear relationship.
SSResid
SSResid
SSResid
0 .16 = 1
0.84 =
SSTo
SSTo
SSTo
n 1 SSResid
366
Adjusted R 2 = 1
SSTo = 1 363 ( .84) = 1.846942 =.153058
n
k
1
(
)
R 2 = 1
SSResid = 390.4347
SSTo = 7855.37 14(21.1071)2 = 1618.2093
446
R2 =
1227.7746
= .759
1618.2093
This means that 75.9 percent of the variation in the observed shear strength values has been
explained by the fitted model.
c
Ho: 1 = 2 = 3 = 4 = 5 = 0
Ha: at least one among 1, 2, 3, 4, 5, is not zero
= 0.05
F=
R2 k
(1 R 2 ) [n (k + 1)]
R2 k
(1 R 2 ) [n (k + 1)]
.759 5
= 5.039
(1 .759) 8
Ho: 1 = 2 = 3 = 4 = 0
Ha: at least one among 1, 2, 3, 4, is not zero
= 0.05
F=
SSRegr / k
SSResid / [ n ( k + 1)]
SSRegr / k
19.2 4
=
=6
SSResid [ n ( k + 1)] 20 25
19.2
= .4898
39.2
s2e =
20
= .8, se = 0.894
25
The value of R2 means that about 49% of the variability in error percentage has been
explained by the fitted model. The value of se means that a typical deviation from the mean
value corresponding to any specified set of values for the predictors is estimated to be 0.894.
c
14.31
Since the typical error not using regression is estimated to be 39.2/29 = 1.16 and the typical
error using this fitted model is estimated to be 0.89, the reduction in error may not be large
enough to justify the use of the regression model. However, this has to be decided based on
practical considerations. Thus, the relatively large value of se and the moderate value of R2
suggest that the estimated regression equation may not provide sufficiently good predictions
of error rate.
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
= 0.01
F=
R2 k
(1 R ) [n (k + 1)]
2
R2 k
(1 R ) [n (k + 1)]
2
.902 2
= 96.64
(1 .902) 21
The F statistic is
F=
9(14 k)
.90 /k
R2 / k
=
=
2
k
(1 R ) /[ n (k + 1)] (1 .9) /[15 (k+ 1)]
Substitute values of k (starting with 1) and compute the corresponding value of F, df1 , and df2. Then
using Appendix Table VII, determine the P-value. Doing so will produce the following table.
448
df1
df2
P-value
1
2
3
4
5
6
7
8
9
10
117.00
54.00
33.00
22.50
16.20
12.00
9.00
6.75
5.00
3.60
1
2
3
4
5
6
7
8
9
10
13
12
11
10
9
8
7
6
5
4
As long as the number of predictors variables is less than 10, the corresponding model would be
judged useful at level of significance 0.05.
A high R2 value can often be obtained simply by including a great many predictors in the model,
even though the actual population relationship between y and the predictors is weak. To pass the
model utility test, a model with a high R2 based on relatively few predictors is needed. In addition, se
needs to be small.
14.33
Coef
151.4
16.216
13.476
0.09353
0.2528
0.4922
Stdev
134.1
8.831
8.187
0.07093
0.1271
0.2281
t-ratio
1.13
1.84
1.65
1.32
1.99
2.16
p
0.292
0.104
0.138
0.224
0.082
0.063
It can be seen (except for differences due to rounding errors) that the estimated regression equation
given in the problem is correct.
14.34
449
Predictor
Constant
X1
X2
X3
X4
X1-SQ
X2-SQ
X3-SQ
X4-SQ
X1*X2
X1*X3
X1*X4
X2*X3
X2*X4
X3*X4
s = 0.3529
Coef
76.437
7.35
9.61
0.915
0.09632
13.452
2.798
0.02798
0.0003201
3.750
0.7500
0.14167
2.0000
0.12500
0.003333
R-sq = 88.5%
Stdev
9.082
10.80
10.80
1.068
0.09834
6.599
6.599
0.06599
0.0002933
8.823
0.8823
0.05882
0.8823
0.05882
0.005882
t-ratio
8.42
0.68
0.89
0.86
0.98
2.04
0.42
0.42
1.09
0.43
0.85
2.41
2.27
2.13
0.57
p
0.000
0.506
0.387
0.404
0.342
0.058
0.677
0.677
0.291
0.676
0.408
0.028
0.038
0.049
0.579
R-sq(adj) = 78.3%
Analysis of Variance
SOURCE
Regression
Error
Total
a
DF
14
16
30
SS
15.2641
1.9926
17.2568
MS
1.0903
0.1245
F
8.75
p
0.000
Ho: 1 = 2 = ... = 14 = 0
Ha: At least one of the fourteen i's is not zero.
= 0.05
F=
SSRegr / k
SSResid /[n(k + 1)]
SSRegr / k
15.2641 14
=
= 8.76
SSResid /[ n (k + 1)] 1.9926 16
From the Minitab output, the P-value 0. Since the P-value is less than , the null hypothesis
is rejected. The data suggests that the fitted model has utility for predicting brightness of
finished paper.
450
R2 = 0.885. Hence, 88.5% of the variation in the observed brightness readings has been
explained by the fitted model.
SSResid = 1.9926. This is the sum of the squared prediction errors.
se = 0.3529. This is the typical error of prediction.
14.35
Coef
35.83
0.676
1.2811
R-sq = 55.0%
Stdev
53.54
1.436
0.4243
t-ratio
0.67
0.47
3.02
p
0.508
0.641
0.005
R-sq(adj) = 52.1%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
2
31
33
SS
20008
16369
36377
MS
10004
528
F
18.95
p
0.000
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
F=
SSRegr / k
SSResid /[n(k + 1)]
20008 2
SSRegr / k
=
= 18.95
SSResid /[ n (k + 1)] 16369 31
From the Minitab output, the P-value 0. Since the P-value is less than , the null hypothesis is
rejected. The data suggests that the multiple regression model has utility for predicting infestation
rate.
Exercises 14.36 14.50
14.36
It should be confirmed that the estimated regression equation yields useful predictions before using it
to make predictions. This is the reason it is preferable to perform a model utility test before using an
estimated regression equation to make predictions.
14.37
The degrees of freedom for error is 100 (7 + 1) = 92. From Appendix Table III, the critical t
value is approximately 1.99.
451
Ho : 1 = 0
= 0.05
Ha: 1 =/ 0
b1
with df. = 92
s b1
.183
t=
= 0.599
.3055
t=
P-value = 2(area under the 92 df t curve to the left of 0.599) 2(0.275) = 0.550.
Since the P-value exceeds , the null hypotheses is not rejected. This means that there is not
sufficient evidence to conclude that there is a difference in the mean value of vacant lots that
are zoned for residential use and those that are not zoned for residential use.
14.38
14.39
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
= 0.05
F=
R2 k
(1 R 2 ) [n (k + 1)]
R2 k
(1 R ) [n (k + 1)]
2
.86 2
= 144.36
(1 .86) 47
Ho : 2 = 0
Ha: 2 =/ 0
452
= 0.01
t=
b2
with df. = 47
sb 2
t=
.0446
= 4.33
.0103
The point estimate of the mean value of MDH activity for an electrical conductivity level of
40 is 0.1838 + 0.0272(40) + 0.0446(402) = 0.1838 + 0.0272(40) + 0.0446(1600) = 72.2642.
The 90% confidence interval for the mean value of MDH activity for an electrical conductivity
level of 40 is
72.2642 (1.68)(0.120) 72.2642 0.2016 (72.0626, 72.4658)
14.40
Fitting the three predictor model in Exercise 14.27 yielded the following MINITAB output.
The regression equation is
volume = - 859 + 23.7 minwidth + 226 maxwidth + 225 elongation
Predictor
Constant
minwidth
maxwidth
elongation
Coef SE Coef
T
-859.2
272.9 -3.15
23.72
85.66
0.28
225.81
85.76
2.63
225.24
90.65
2.48
S = 287.0
R-Sq = 67.6%
P
0.005
0.784
0.015
0.021
R-Sq(adj) = 63.4%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
3 3960700 1320233 16.03 0.000
Residual Error 23 1894141
82354
Total
26 5854841
The t-ratio for minwidth is 0.28 with a corresponding P-value of 0.784 suggesting that
minwidth may be dropped from the model that already contains the predictors maxwidth
and elongation.
b
Predicted volume = -859 + 23.7 (2.5) + 226 (3.0) + 225 (1.55) = 226.7.
Using MINITAB with x1 = minwidth = 2.5, x2 = maxwidth = 3.0, and x3 = elongation = 1.55, we
obtain s y = 73.8. The t critical value for a 95% interval is obtained from a 23 d.f. t curve and it
453
is equal to 2.0687. Also from MINITAB we get s e2 = 82354. The prediction interval would be
226.7 (2.0687) 82354 + (73.8 ) 2 (- 386.28, 839.68 ).
14.41
The value 0.469 is an estimate of the expected change (increase) in the mean score of students
associated with a one unit increase in the student's expected score holding time spent
studying and student's grade point average constant.
Ho: 1 = 2 = 3 = 0
Ha: At least one of the three i's is not zero.
= 0.05
F=
R2 k
(1 R 2 ) [n (k + 1)]
R2 k
.686 3
=
= 75.01
(1 R ) [n (k + 1)] (1 .686) 103
2
s2e + (1.2 )2 .
To determine s2e , proceed as follows. From the definition of R2, it follows that
SSResid = (1R2)SSTo. So SSResid = (10.686)(10200) = 3202.8.
Then, s2e =
3202.8
= 31.095 .
103
454
Ho: 1 = 2 = 3 = 0
Ha: At least one of the three i's is not zero.
= 0.05
F=
R2 k
(1 R ) [n (k + 1)]
2
R2 k
(1 R ) [n (k + 1)]
2
.774 3
= 337.91
(1 .774) 296
Ha: 3 =/ 0
= 0.05
b
t = 3 with df. = 296
sb3
t=
1011
.
= .1616
6258
.
P-value = 2(area under the 296 df t curve to the right of 0.1616) 2(0.44) = 0.88.
Since the P-value exceeds , the null hypothesis is not rejected. The indicator variable x3 does
not appear useful and could be removed from the model if the predictor variables x1 and x2
remain in the model.
14.43
Ho : 3 = 0
Ha: 3 =/ 0
= 0.05
t=
b3
with d.f. = 363
sb
3
t=
.00002
= 2.22
.000009
P-value = 2(area under the 363 df t curve to the right of 2.22) 2(0.014) = 0.028.
Since the P-value is less than , the null hypothesis is rejected. The conclusion is that the inclusion of
the interaction term is important.
14.44
Ho : 2 = 0
Ha: 2 =/ 0
455
= 0.05
The test statistic is: t =
b2
with d.f. = 4.
sb
2
1.7155
t=
= 8.42
.2036
Ho: 1 = 2 = 3 = 0
Ha: At least one of the three i's is not zero.
= 0.05
Test statistic: F =
F=
SSRegr/k
SSResid/[n (k+ 1)]
5073.4 / 3
= 5.47
1854.1 / 6
Ho : 3 = 0
Ha: 3 =/ 0
=0.05
The test statistic is: t =
b3
with df. = 6.
sb
3
8.4
t=
= 0.04
199
P-value = 2(area under the 6 df t curve to the right of 0.04) 2(0.48) = 0.96.
Since the P-value exceeds , the null hypothesis is not rejected. The data suggests that the
interaction term is not needed in the model, if the other two independent variables are in the
model.
c
No. The model utility test is testing all variables simultaneously (that is, as a group). The t
test is testing the contribution of an individual predictor when used in the presence of the
remaining predictors. Results indicate that, given two out of the three predictors are included
in the model, the third predictor may not be necessary.
456
14.46
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
= 0.05
Test statistic: F =
F=
SSRegr/k
SSResid/[n (k+ 1)]
812.38 / 2
= 339.306
29.928 / 25
Based on this interval, the estimated change in mean absorption associated with a one-unit
increase in starch damage while keeping the value of flour protein fixed is between 0.29826
and 0.373.
c
Ho : 1 = 0
Ha: 1 =/ 0
= 0.05
The test statistic is: t =
b1
with df. = 25.
sb
1
t=
1.4423
= 6.95
.2076
Ho : 2 = 0
Ha: 2 =/ 0
= 0.05
The test statistic is: t =
b2
with df. = 25.
sb
2
.33563
t =
= 18.51
.01814
Since the P-value is less than , the null hypothesis is rejected. The data suggests
that the variable x2 (starch damage) is useful when used together with the variable x1
(flour protein).
d
Since both null hypotheses were rejected, it appears that both independent variables are
important.
From (e), the point estimate is 55.446 and the estimated standard deviation of
a + b1(11.7) + b2(57) is 0.522. From these and the MINITAB output, the estimated standard
deviation of the prediction error at x1 = 11.7 and x2 = 57 is
The point prediction for mean phosphate adsorption when x1 = 160 and x2 = 39 is at the midpoint of
the given interval. So the value of the point prediction is (21.40 + 27.20)/2 = 24.3. The t critical value
for a 95% confidence interval is 2.23. The standard error for the point prediction is equal to (27.20
21.40)/2(2.23) = 1.30. The t critical value for a 99% confidence interval is 3.17. Therefore, the 99%
confidence interval would be
24.3 (3.17)(1.3) 24.3 4.121 (20.179, 28.421).
14.48
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
= 0.10
Test statistic: F =
F=
SSRegr / k
SSResid / [n (k+ 1)]
649.75 / 2
= 9.06
538.03 / 15
The point estimate and standard deviation of the maximum heart rate of an individual
runner who is 43 years of age and weighs 65 kg are
458
538.03
= 35.87
15
The point estimate of the average maximum heart rate for all marathon runners who are 30
years old and weigh 77.2 kg is y$ = 179 0.8(30) + 0.5(77.2) = 193.6.
The 90% confidence interval is 193.6 (1.75)(2.97) 193.6 5.2 (188.4, 198.8).
14.49
The prediction interval is always wider than the confidence interval for a specified set of
values for the independent variables. This is because the standard deviation for the
prediction error when predicting a single y observation is larger than the standard deviation
for estimating the mean y value.
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
= 0.05
Test statistic: F =
F=
SSRegr / k
SSResid / [n (k+ 1)]
237.52 / 2
= 30.81
26.98 / 7
= 0.05. From the MINITAB output the t-ratio for b1 is 6.57, and the t-ratio for b2 is
7.69. The P-values for the testing 1 = 0 and 2 = 0 would be twice the area under the 7 df t
curve to the right of 6.57 and 7.69, respectively. From Appendix Table IV, the
P-values are found to be practically zero. Both hypotheses would be rejected. The data
suggests that both the linear and quadratic terms are important.
With 95% confidence, the mean height of wheat plants treated with
x = 2 (102 = 100 uM of Mn) is estimated to be between 43 and 47.92 cm.
d
14.50
Coef
109.771
2.2943
0.032857
R-sq = 99.7%
Stdev
1.273
0.1508
0.003614
t-ratio
86.26
15.22
9.09
p
0.000
0.004
0.012
R-sq(adj) = 99.3%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
2
2
4
SS
1111.54
3.66
1115.20
Y
110.000
90.000
76.000
72.000
70.000
Fit
109.771
90.114
77.029
70.514
70.571
MS
555.77
1.83
Stdev.Fit
1.273
0.824
0.942
0.824
1.273
F
303.94
Obs.
1
2
3
4
5
X
0.0
10.0
20.0
30.0
40.0
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
= 0.05
Test statistic: F =
F =
MSRegr
MSE
555.77
= 303.94
183
.
460
Residual
0.229
0.114
1.029
1.486
0.571
p
0.003
St.Resid
0.50
0.11
1.06
1.39
1.25
From the Minitab output, the P-value for the F test is 0.003.
Since the P-value is less than , the, the null hypothesis is rejected. The data suggests that the
quadratic regression model is useful for predicting plant Mn.
c
R2 = 0.997 and se = 1.352. The R2 value means that 99.7% of the variation in the observed plant
Mn can be explained by the fitted quadratic regression, using distance from the fertilizer
band as the independent variable. The se value means that the typical error of prediction is
1.352.
To test Ho: 1 = 0, from the Minitab output the P-value is 0.004. Since the P-value is less than
the of 0.05, the null hypothesis would be rejected. This means that the linear term is
important to the model.
To test Ho: 2 = 0, from the Minitab output the P-value is 0.012. Since the P-value is less than
the of 0.05, the null hypothesis would be rejected. This means that the quadratic term is
useful in the model, if the linear term is in the model.
The 90% confidence interval for the mean plant Mn for plants that are 30 cm. from the
fertilizer band is 70.514 (2.92)(0.824) 70.514 2.406 (68.108, 72.92).
One possible way would have been to start with the set of predictor variables consisting of all five
variables, along with all quadratic terms, and all interaction terms. Then, use a selection procedure
like backward elimination to arrive at the given estimated regression equation.
14.52
The model using the two variables x3 and x5 appears to be a good choice. It has an adjusted R2 which
is only 0.03 less than the largest adjusted R2. This model has five less variables, and hence, five more
degrees of freedom for estimation of the variance than does the model with the largest adjusted R2.
Choice of a model involves a compromise between parsimony and predictive ability.
14.53
The model using the three variables x3, x9, x10 appears to be a good choice. It has an adjusted R2
which is only slightly smaller than the largest adjusted R2. This model is almost as good as the model
with the largest adjusted R2 but has two less predictors.
14.54
The first candidate for elimination would be the variable, mean temperature. Its t-ratio is less than 2,
and hence, it can be eliminated from the model.
14.55
R2 k
(1 R 2 ) [n (k + 1)]
R2 = = 0.3092.
F=
R2 k
(1 R ) [n (k + 1)]
2
0.3092 9
= 91.8
(1 0.3092) 1846
If a backward elimination procedure was followed in the stepwise regression analysis, then
the statements in the paper suggest that all variables except daily cigarette consumption and
alcohol consumption were eliminated from the model. Of the two predictors left in the
model, cigarette consumption would have a larger t-ratio than alcohol consumption.
There is an alternative procedure called the forward selection procedure which is available in
most statistical software packages including MINITAB. According to this method one starts
with a model having only the intercept term and enters one predictor at a time into the
model. The predictor explaining most of the variance is entered first. The second predictor
entered into the model is the one that explains most of the remaining variance, and so on. If
the forward selection method was followed in the current problem then the statements in the
paper would suggest that the variable to enter the model first is daily cigarette consumption
and the next variable to enter the model is alcohol consumption. No further predictors were
entered into the model.
14.56
14.57
The t-ratios for the 9 coefficients (excluding the constant) are (in order)
-0.01/0.01 = -1.0, 0.01/0.01 = 1.0, -0.07/0.03 = -2.33, 0.12/0.06 = 2.0, -0.02/0.01 = 2.0,
-0.04/0.01 = -4.0, -0.01/0.03 = -0.33, -0.04/0.04 = -1.0, and -0.02/0.05=-0.4.
The predictor x7 has the smallest t-ratio associated with it. This t-ratio is 0.33, much smaller
than tout = 2.0. Hence x7 is the first candidate for elimination, and based on tout = 2.0, it would
be eliminated from the model.
No. The information given only helps us decide which of the nine predictors may be
eliminated from the model given that the remaining variables are present in the model. After
eliminating x7 from the model one has to refit the regression using the remaining eight
predictors. The results from this refitting will help decide the second candidate for
elimination from the model.
Standard error for the estimated coefficient of log of sales = (estimated coefficient)/t-ratio =
0.372/6.56 = 0.0567.
The predictor with the smallest (in magnitude) associated t-ratio is Return on Equity.
Therefore it is the first candidate for elimination from the model. It has a t-ratio equal to 0.33
which is much less than tout = 2.0. Therefore the predictor Return on Equity would be
eliminated from the model if a backward elimination method is used with tout = 2.0.
462
No. For the 1992 regression, the first candidate for elimination when using a backward
elimination procedure is CEO Tenure since it has the smallest t-ratio (in magnitude).
14.58
In step 1, the model that includes all four variables was used. In step 2, the model with the variable x1
(grass cover) deleted was used. In step 3, the model with the two variables x1 (grass cover) and x2
(mean soil depth) deleted was used. Neither of the remaining two variables can be removed, so the
final model uses the two variables x3 (angle of slope) and x4 (distance from cliff edge).
14.59
Using MINITAB, the best model with k variables has been found and summary statistics from each
are given below.
Number of
Variables
Variables Included
R2
Adjusted R2
Cp
1
2
3
4
x4
x2, x4
x2, x3, x4
x1, x2, x3, x4
0.824
0.872
0.879
0.879
0.819
0.865
0.868
0.865
14.0
2.9
3.1
5.0
The best model, using the procedure of minimizing Cp, would use variables x2, x4. Hence, the set of
predictor variables selected here is not the same as in problem 14.58.
14.60
Multicollinearity might be a problem because most homes differ primarily with respect to the
number of bedrooms and bathrooms. Hence, it may be that the value of x3 (total rooms) is
approximately equal to c + x1 + x2. If so, multicollinearity will be present.
14.61
Using MINITAB, the best model with k variables has been found and summary statistics for each are
given below.
k
Variables Included
R2
Adjusted R2
Cp
1
2
3
4
5
x4
x2, x4
x1, x3, x4
x1, x3, x4, x5
x1, x2, x3, x4, x5
0.067
0.111
0.221
0.293
0.340
0.026
0.031
0.110
0.151
0.166
5.8
6.6
5.4
5.4
6.0
It appears that the model using x1, x3, x4 is the best model, using the criterion of minimizing Cp.
14.62
It would have to be greater than or equal to 0.723, because if you add variables to a model,
R2 cannot decrease.
It would have to be less than or equal to 0.723, because if you take variables out of a model,
R2 cannot increase.
463
Ho: 1 = 2 = 3 = . . . = 11 = 0
Ha: at least one among is is not zero
= 0.01
F=
R2 k
(1 R 2 ) [n (k + 1)]
R2 k
(1 R ) [n (k + 1)]
2
.64 11
.058182
=
= 12.28
(1 .64) 76 .004737
Appendix Table VII does not have entries for df1 = 11, but using df1 = 10 it can be determined
that 0.001 > P-value.
Since the P-value is less than , the null hypothesis is rejected. There does appear to be a
useful linear relationship between y and at least one of the predictors.
n 1 SSResid
Adjusted R 2 = 1
n (k + 1) SSTo
To calculate adjusted R2, we need the values for SSResid and SSTo. From the information
given, we obtain:
s e2 = (5.57) 2 31.0249 =
R 2 =.64 .64 = 1
SSTo =
SSResid
SSResid = 76(31.0249) = 2357.8924
88 12
SSResid
2357.8924
2357.8924
.64 = 1
=.36
SSTol
SSTo
SSTo
2357.8924
= 6549.7011.
.36
n 1 SSResid
87 2357.8924
So, Adjusted R 2 = 1
= 1
= 1.4121 =.5879.
76 6549.7011
n (k + 1) SSTo
t - ratio =
b1 0
b
.458
= 3.08 s b1 = 1 =
=.1487
s b1
3.08 3.08
From this interval, we estimate the value of 1 to be between 0.1606 and 0.7554.
d
Many of the variables have t-ratios that are close to zero. The one with the smallest t in
absolute value is x9: certificated staff-pupil ratio. For this reason, I would eliminate x9 first.
464
Ho: 3 = 4 = 5 = 6 = 0
Ha: at least one among 3, 4, 5, 6 is non-zero.
None of the procedures presented in this chapter could be used. The two procedures
presented tested all variables as a group or a single variables contribution.
14.64
Ho: 1 = 2 = 3 = 4 = 5 = 6 = 0
Ha: at least one among 1, 2, 3, 4, 5, 6 is not zero
= 0.05
F=
R2 k
(1 R ) [n (k + 1)]
2
R2 k
.50 6
.083333
=
=
= 11.83
2
(1 R ) [n (k + 1)] (1 .50) 71 .007042
R2 =1
SSResid
1.02
1.02
1.02
.5 = 1
=.5 SSTo =
= 2.04
SSTo
SSTo
SSTo
.5
n 1 SSResid
77 1.02
= 1
Adjusted R 2 = 1
= 10.5423 = 0.4577.
+
n
k
(
1
)
SSTo
71 2.04
Ho : 6 = 0
Ha: 6 0
= 0.05
t = 0.92
P-value = 2(area under the 71 df t curve to the left of 0.92) 2(0.18) = 0.36.
Since the P-value exceeds , the null hypothesis is not rejected. The indicator variable x6
could be removed provided the other five predictor variables remain in the model.
465
14.65
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
= 0.05
Test statistic: F =
F =
SSRegr / k
SSResid / [ n ( k + 1)]
525.11 / 2
= 21.25
61.77 / 5
Ho : 2 = 0
Ha: 2 =/ 0
= 0.05
The test statistic is: t =
t =
b2
with df. = 5.
sb 2
1.7679
= 6.52
.2712
14.66
Ho: 1 = 2 = 3 = 4 = 0
Ha: At least one of the four i's is not zero.
= 0.05
Test statistic: F =
F=
2
R /k
(1 R 2) / [n (k+ 1)]
.69 / 4
= 15.58
(1 .69) / 28
Using the criteria of eliminating a variable if its t-ratio satisfies 2 t-ratio 2, (See problem
12.46), the variable x1 can be deleted.
The 95% confidence interval is 0.893 (2.05)(0.141) 0.893 0.289 (0.604, 1.182).
14.67
14.68
First, the model with all five variables is fit. The variable x3 has the t-ratio closest to zero. Its
value is .72. Since this value is less than 2 in absolute value, variable x3 is deleted and the
model using variables x1, x2, x4 and x5 is fit. In this model, variable x4 has the t-ratio closest to
zero (0.29). Since the t-ratio is less than 2, variable x4 is deleted and the model using variables
x1, x2 and x5 is fit. None of the variables in this three variable regression has a t-ratio of less
than 2 in absolute magnitude, and so none can be deleted. The final regression model
contains variables x1, x2 and x5.
The R2 of 0.786 means that 78.6% of the variation in total body electrical conductivity has
been explained by the fitted regression equation using the variables age, sex, and lean body
mass.
The se of 4.914 means that the typical deviation of an observed value from the value predicted
using the fitted equation, is 4.914.
The value of b2 is 6.988. The estimated difference in average total body electrical
conductivity, associated with being a female rather than a male, is 6.988, holding age and
lean body mass fixed. That is, other things being the same, females average about 7 units
lower in total body electrical conductivity than males.
467
14.69
First, the model using all four variables was fit. The variable age at loading (x3) was deleted because
it had the t-ratio closest to zero and it was between 2 and 2. Then, the model using the three
variables x1, x2, and x4 was fit. The variable time (x4) was deleted because its t-ratio was closest to
zero and was between 2 and 2. Finally, the model using the two variables x1 and x2 was fit. Neither
of these variables could be eliminated since their t-ratios were greater than 2 in absolute magnitude.
The final model then, includes slab thickness (x1) and load (x2). The predicted tensile strength for a
slab that is 25 cm thick, 150 days old, and is subjected to a load of 200 kg for 50 days is y$ = 13
0.487(25) + 0.0116(200) = 3.145.
14.70
Ho: 1 = 2 = 3 = 0
Ha: At least one of the three i's is not zero.
= 0.05
Test statistic: F =
F=
SSRegr / k
SSResid / [n (k+ 1)]
454.63 / 3
= 8.63
368.51 / 21
All of the t-ratios are between 2 and 2; however, the t-ratio closest to zero is 0.03. Thus,
variable x3 (fetus weight) can be eliminated from the model.
The t-ratio for the variable sex is between 2 and 2, hence, the variable sex can be eliminated.
The R2 value of 0.527 means that 52.7% of the total variation in the observed fetus
progesterone levels can be explained by the variable fetus length. The se value of 4.116 means
that the typical deviation of an observation from the predicted value using the fitted equation
is 4.116 when predicting progesterone levels from fetus lengths.
The value 0.231 for b2 is the estimated expected change (increase) in progesterone level
associated with a one unit increase in fetus length.
It does not make sense to interpret the value of a as average progesterone level when length
is zero. The fetus is non-existent if length is zero, and hence, progesterone level would be
zero. The non-zero value for a is simply the estimate of the constant term in the
population regression function.
468
14.71
14.72
The claim is very reasonable because 14 is close to where the smooth curve has its highest
value.
Ho: 1 = 2 = ... = 15 = 0
Ha: At least one of the fifteen i's is not zero.
= 0.05
Test statistic: F =
df1 = 15
F=
2
R /k
(1 R ) / [n (k+ 1)]
2
df2 = 20 16 = 4
.90 / 15
= 2.4
(1 .9) / 4
There are no entries for df1 = 15 in Appendix Table VII. Using a statistical calculator or a
computer software package that will compute P-values, it is found that P-value = 0.206.
Since the P-value exceeds , the null hypothesis is not rejected. The data suggests that the
fitted regression equation is not useful for predicting stock price.
b
Part a shows that a high R2 by itself does not imply that a model is useful. You might be
suspicious of a model that has a high R2 value, when the number of predictors is almost as
large as the number of observations. In this problem, k = 15 and n = 20.
For the model to be judged useful at the 0.05 level of significance, the computed F would
have to exceed the F-critical value for = 0.05, with df1 = 15 and df2 = 4, which has a value of
5.86. This number was obtained from a more extensive F table than the one given in your
text. Your library should have the information or you could use a computer software
package to determine this value.
469
R 2 15
(1 R 2 ) 4
> 5.86
4R 2
> 5.86
15(1 R 2 )
14.73
87.9
=.956.
91.9
Predictor
Constant
X1
X2
s = 0.05330
Coef
Stdev
t-ratio
1.56450
0.23720
0.00024908
0.07940
0.05556
0.00003205
19.70
4.27
7.77
0.000
0.000
0.000
R-sq = 86.5%
R-sq(adj) = 85.3%
Analysis of Variance
SOURCE
Regression
Error
Total
b
DF
2
22
24
SS
0.40151
0.06250
0.46402
MS
0.20076
0.00284
F
70.66
p
0.000
Ho : 1 = 2 = 0
Ha: At least one of the two i's is not zero.
= 0.05
Test statistic: F =
F=
SSRegr / k
SSResid / [n (k+ 1)]
.40151 / 2
= 70.67
.0625 / 22
From the Minitab output, the P-value associated with the F test is practically zero. Since the
P-value is less than , the null hypothesis is rejected.
c
The value for R2 is 0.865. This means that 86.5% of the total variation in the observed values
for profit margin has been explained by the fitted regression equation. The value for se is
0.0533. This means that the typical deviation of an observed value from the predicted value
is 0.0533, when predicting profit margin using this fitted regression equation.
No. Both variables have associated t-ratios that exceed 2 in absolute magnitude. Hence,
neither can be eliminated from the model.
470
471
472