Sie sind auf Seite 1von 13

Stat 401B Exam 3

Fall 2016
(Corrected Version)

I have neither given nor received unauthorized assistance on this exam.

________________________________________________________
Name Signed Date

_________________________________________________________
Name Printed

ATTENTION!

Incorrect numerical answers unaccompanied by supporting reasoning will receive NO


partial credit.

Correct numerical answers to difficult questions unaccompanied by supporting


reasoning may not receive full credit.

SHOW YOUR WORK/EXPLAIN YOURSELF!

1
1. There are data on the UCI Machine Learning Repository due originally to P. Tüfekci and H. Kaya
concerning the running of a power plant. Hourly information on atmospheric conditions and a plant
operating variable were collected over a number of years, along with the hourly energy output of the
plant. This question concerns MLR analyses of a random sample of 200 of the hourly periods made
treating mean "PE" (electrical power) as a function of the variables "AT" (ambient temperature in °C ),
"AP" (ambient pressure in milibars), "RH" (relative humidity in %), and "V" (exhaust vacuum in cm
Mg).

5 pts a) Below is a graphic from the "leaps" function regsubsets() for the n = 200 periods.
Which 2 predictors seem to be
most effective in predicting PE?
What fraction of the raw
variability in PE do they account
for?

8 pts b) Give the value of and degrees of freedom for an F statistic for comparing the full model
involving all predictors to the best 2-predictor model. (While it is not really needed to answer this
question, SSTot = 54670 for these n = 200 cases.)

F = ________________ d.f. = __________ , __________

2
Below are some results (cross-validation root mean squared prediction error) from repeated 10-fold
cross-validation, and values of MSE and R 2 for several MLR models for PE.
Model Predictors Included CV-RMSPE RMSE R-Squared
1 V 8.92 8.81 .720
2 AT 5.30 5.24 .901
3 AT,V 4.66 4.70 .921
4 AT,RH 4.56 4.60 .925
5 AT,AP,RH 4.61 4.60 .925
6 AT,V,RH 4.31 4.32 .934
7 AT,V,AP,RH 4.36 4.34 .934

4 pts c) Which of models 1-7 is most attractive on the basis of the table above? Explain.

4 pts d) What about the table above suggests that none of the models fit there suffers dramatic over-
fitting?

Below are some scatterplots of the data from the 200 sample hours.

4 pts d) Is there evidence of multicollinearity in these plots? If so, what is it?

3
2. There is an interesting "Banknote Authentication" data set on the UCI Machine Learning repository
that consists of 4 numerical features extracted from 400 × 400 grey scale images of real and counterfeit
banknotes. There are 610 counterfeit and 762 real notes represented in the data set. There is a printout
beginning on Page 8 of this exam from an attempt to model the probability that a note is counterfeit
(V5=1) as a function of the features (V1,V2,V3,V4). Use it to answer the following questions.

6 pts a) Which of the features V1,V2,V3,V4 appears to be least important in modeling the probability
that V5=1 (the note is counterfeit)? Explain.

( )
6 pts b) Recall that if p ( u ) = exp ( u ) / (1 + exp ( u ) ) then the "log odds" are u = ln p ( u ) / (1 − p ( u ) ) . Give
approximately 95% confidence limits for the increase in log odds that a banknote is counterfeit
accompanying a unit increase in V1 if the other features V2,V3,V4 are held fixed.

6 pts c) Give 2-sided approximately 95% confidence limits for the probability that a banknote with
features V1=.2,V2=.8,V3=.4,V4=-.6 is counterfeit.

4
3. A data set in the book Regression Analysis by Graybill and Iyer concerns how an optical reading,
y , measuring light transmitted through a chemical solution depends upon the concentration of a
chemical, x (in mg/l). A possible nonlinear (in coefficients β1 , β 2 , and β 3 ) form for the relationship
between x and mean y is
μ y| x = β1 + β 2 exp ( − β3 x ) (*)
A printout beginning on Page 9 summarizes an analysis of the n = 12 pairs in the data set.

6 pts a) Suppose relationship (*) above holds and that for a given concentration the optical reading is
normally distributed with standard deviation σ . Give approximate 95% two-sided confidence limits
for this model parameter.

5 pts b) According to the relationship (*), as concentration, x , goes from 0 to ∞ , the mean light
transmitted goes from β1 + β 2 to β1 . The value of concentration, x , at which half of the decrease in
light transmission has been realized might be of interest. What is this in terms of the model
parameters? Give 95% two-sided confidence limits for this value of x .

4. On page 217 of the white Vardeman and Jobe text there are data of Koh, Morden, and Ogbourne
that concern axial breaking strengths of wooden dowel rods of 3 different lengths and 3 different
diameters. A printout beginning on Page 9 of this exam summarizes some computations with these
data.

6 pts a) What about the printed analyses of dowel strength makes direct analysis of y under the usual
one-way normal model assumptions seem inappropriate?

Instead we will henceforth consider analysis of y ' = ln ( y ) .


5
6 pts b) Make an interaction plot enhanced with error bars based on 95% confidence limits for
combination mean log strengths. What are your "margins of error" for this plotting? (Give a number.)

+ / − margin: ____________

6 pts c) Based on the plot above, which effects appear to be both statistically detectable and most
important? (Consider diameter and length main effects and interactions. List an order of importance.)

6 pts d) What items on the printout support your judgment in c)? Explain how they lend support.

6
5. Beginning on Page 12 there is R code and output corresponding to a balanced 3 × 2 × 3 experiment
on paper airplane flight distances (carried out in an undergraduate engineering statistics class). There
are 3 levels of the factor "Design," 2 levels of the factor (nose) "Weight," and 3 levels of the factor
"Paper" (type) in the study. Use the R output to answer the rest of the questions on this exam.

6 pts a) What is the value of spooled for this data set? (Say where you found your value.) What does this
measure in the present context?

6 pts b) What is the relatively simple interpretation that is possible for these data? (What factorial effect(s)
dominate(s) and what does that mean about the flying of paper airplanes?) What on the output tells
you that this is so?

6 pts c) What type or types of airplanes fly furthest (according to the outcome of this study)? Explain.

6 pts d) What do you predict for the average flight distance of the type or types of planes you identified in
part c) based on a good simple model here?

7
R Code and OutPut for the Banknote Data
> Banknote[1:5,]
V1 V2 V3 V4 V5
1 3.62160 8.6661 -2.8073 -0.44699 0
2 4.54590 8.1674 -2.4586 -1.46210 0
3 3.86600 -2.6383 1.9242 0.10645 0
4 3.45660 9.5228 -4.0112 -3.59440 0
5 0.32924 -4.4552 4.5718 -0.98880 0

> summary(Banknote)
V1 V2 V3 V4 V5
Min. :-7.0421 Min. :-13.773 Min. :-5.2861 Min. :-8.5482 Min. :0.0000
1st Qu.:-1.7730 1st Qu.: -1.708 1st Qu.:-1.5750 1st Qu.:-2.4135 1st Qu.:0.0000
Median : 0.4962 Median : 2.320 Median : 0.6166 Median :-0.5867 Median :0.0000
Mean : 0.4337 Mean : 1.922 Mean : 1.3976 Mean :-1.1917 Mean :0.4446
3rd Qu.: 2.8215 3rd Qu.: 6.815 3rd Qu.: 3.1793 3rd Qu.: 0.3948 3rd Qu.:1.0000
Max. : 6.8248 Max. : 12.952 Max. :17.9274 Max. : 2.4495 Max. :1.0000

> bank.out<-glm(as.factor(V5)~V1+V2+V3+V4,data=Banknote,family=binomial())
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(bank.out)

Call:
glm(formula = as.factor(V5) ~ V1 + V2 + V3 + V4, family = binomial(),
data = Banknote)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.70001 0.00000 0.00000 0.00029 2.24614

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.3218 1.5589 4.697 2.64e-06 ***
V1 -7.8593 1.7383 -4.521 6.15e-06 ***
V2 -4.1910 0.9041 -4.635 3.56e-06 ***
V3 -5.2874 1.1612 -4.553 5.28e-06 ***
V4 -0.6053 0.3307 -1.830 0.0672 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1885.122 on 1371 degrees of freedom


Residual deviance: 49.891 on 1367 degrees of freedom
AIC: 59.891

Number of Fisher Scoring iterations: 12

> unknown<-data.frame(V1=.2,V2=.8,V3=.4,V4=-.6)
> predict(bank.out,newdata=unknown,se.fit=TRUE)
$fit
1
0.6453872

$se.fit
[1] 0.4574428

$residual.scale
[1] 1

8
R Code and OutPut for the Optical Data
> optical.out<-nls(y~b1+b2*exp(-b3*x),start=c(b1=0,b2=3,b3=1),trace=T)
1.142691 : 0 3 1
0.4897814 : 0.08377278 2.66283847 0.67210762
0.4604279 : 0.02919644 2.72294772 0.68326005
0.4604271 : 0.02874071 2.72328367 0.68274958
0.4604271 : 0.02875388 2.72327628 0.68276289
> summary(optical.out)

Formula: y ~ b1 + b2 * exp(-b3 * x)

Parameters:
Estimate Std. Error t value Pr(>|t|)
b1 0.02875 0.17152 0.168 0.870571
b2 2.72328 0.21054 12.935 4.05e-07 ***
b3 0.68276 0.14166 4.820 0.000947 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2262 on 9 degrees of freedom

Number of iterations to convergence: 4


Achieved convergence tolerance: 7.998e-07

> confint(optical.out)
Waiting for profiling to be done...
0.5970897 : 2.7232763 0.6827629
0.4785793 : 2.8120293 0.6082035
.
.
.
1.042523 : 0.3770538 2.4924962
1.041343 : 0.3664283 2.4959833
2.5% 97.5%
b1 -0.5093296 0.3499205
b2 2.2623059 3.2411076
b3 0.4017537 1.0651215
> predict(optical.out)
[1] 2.7520302 2.7520302 1.4046053 1.4046053 0.7238604 0.7238604 0.3799351 0.3799351
[9] 0.2061774 0.2061774 0.1183916 0.1183916

R Code and OutPut for the Dowel Strength Data


> cbind(type,diam,length,strength)
type diam length strength
[1,] 1 0.1250 4 51.5
[2,] 1 0.1250 4 37.4
[3,] 1 0.1250 4 59.3
[4,] 1 0.1250 4 58.5
[5,] 2 0.1250 8 5.2
[6,] 2 0.1250 8 6.4
[7,] 2 0.1250 8 9.0
[8,] 2 0.1250 8 6.3
[9,] 3 0.1250 12 2.5
[10,] 3 0.1250 12 3.3
[11,] 3 0.1250 12 2.6
[12,] 3 0.1250 12 1.9
[13,] 4 0.1875 4 225.3
[14,] 4 0.1875 4 233.9
[15,] 4 0.1875 4 211.2
[16,] 4 0.1875 4 212.8
[17,] 5 0.1875 8 47.0
[18,] 5 0.1875 8 79.2

9
[19,] 5 0.1875 8 88.7
[20,] 5 0.1875 8 70.2
[21,] 6 0.1875 12 18.4
[22,] 6 0.1875 12 22.4
[23,] 6 0.1875 12 18.9
[24,] 6 0.1875 12 16.6
[25,] 7 0.2500 4 358.8
[26,] 7 0.2500 4 309.6
[27,] 7 0.2500 4 343.5
[28,] 7 0.2500 4 357.8
[29,] 8 0.2500 8 127.1
[30,] 8 0.2500 8 158.0
[31,] 8 0.2500 8 194.0
[32,] 8 0.2500 8 133.0
[33,] 9 0.2500 12 68.9
[34,] 9 0.2500 12 40.5
[35,] 9 0.2500 12 50.3
[36,] 9 0.2500 12 65.6
>
> options(contrasts = rep("contr.sum", 2))
>
> aggregate(strength,by=list(type),FUN=mean)
Group.1 x
1 1 51.675
2 2 6.725
3 3 2.575
4 4 220.800
5 5 71.275
6 6 19.075
7 7 342.425
8 8 153.025
9 9 56.325
> aggregate(strength,by=list(type),FUN=sd)
Group.1 x
1 1 10.1411291
2 2 1.6111590
3 3 0.5737305
4 4 10.7706391
5 5 17.8593346
6 6 2.4267605
7 7 22.9722115
8 8 30.4237161
9 9 13.3027253
> summary(lm(strength~as.factor(type)))

Call:
lm(formula = strength ~ as.factor(type))

Residuals:
Min 1Q Median 3Q Max
-32.825 -3.363 -0.125 7.025 40.975

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.656 2.592 39.604 < 2e-16 ***
as.factor(type)1 -50.981 7.331 -6.954 1.79e-07 ***
as.factor(type)2 -95.931 7.331 -13.085 3.34e-13 ***
as.factor(type)3 -100.081 7.331 -13.651 1.23e-13 ***
as.factor(type)4 118.144 7.331 16.115 2.24e-15 ***
as.factor(type)5 -31.381 7.331 -4.280 0.00021 ***
as.factor(type)6 -83.581 7.331 -11.400 7.95e-12 ***
as.factor(type)7 239.769 7.331 32.704 < 2e-16 ***
as.factor(type)8 50.369 7.331 6.870 2.21e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.55 on 27 degrees of freedom


Multiple R-squared: 0.9848, Adjusted R-squared: 0.9803
F-statistic: 219 on 8 and 27 DF, p-value: < 2.2e-16
10
>
> logstrength<-log(strength)
> logstrength
[1] 3.9415818 3.6216707 4.0826093 4.0690268 1.6486586 1.8562980 2.1972246 1.8405496
[9] 0.9162907 1.1939225 0.9555114 0.6418539 5.4174328 5.4548937 5.3528056 5.3603528
[17] 3.8501476 4.3719763 4.4852599 4.2513483 2.9123507 3.1090610 2.9391619 2.8094027
[25] 5.8827651 5.7352811 5.8391871 5.8799742 4.8449742 5.0625950 5.2678582 4.8903491
[33] 4.2326562 3.7013020 3.9180051 4.1835757
>
> aggregate(logstrength,by=list(type),FUN=mean)
Group.1 x
1 1 3.9287221
2 2 1.8856827
3 3 0.9268946
4 4 5.3963712
5 5 4.2396830
6 6 2.9424941
7 7 5.8343019
8 8 5.0164441
9 9 4.0088847
> aggregate(logstrength,by=list(type),FUN=sd)
Group.1 x
1 1 0.21433043
2 2 0.22813681
3 3 0.22618831
4 4 0.04852411
5 5 0.27669684
6 6 0.12433499
7 7 0.06895316
8 8 0.19204237
9 9 0.24728988
> summary(lm(logstrength~as.factor(type)))

Call:
lm(formula = logstrength ~ as.factor(type))

Residuals:
Min 1Q Median 3Q Max
-0.38954 -0.09291 0.00828 0.13430 0.31154

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.79772 0.03269 116.162 < 2e-16 ***
as.factor(type)1 0.13100 0.09247 1.417 0.168
as.factor(type)2 -1.91204 0.09247 -20.677 < 2e-16 ***
as.factor(type)3 -2.87083 0.09247 -31.046 < 2e-16 ***
as.factor(type)4 1.59865 0.09247 17.288 3.95e-16 ***
as.factor(type)5 0.44196 0.09247 4.780 5.51e-05 ***
as.factor(type)6 -0.85523 0.09247 -9.249 7.38e-10 ***
as.factor(type)7 2.03658 0.09247 22.024 < 2e-16 ***
as.factor(type)8 1.21872 0.09247 13.180 2.82e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1962 on 27 degrees of freedom


Multiple R-squared: 0.9878, Adjusted R-squared: 0.9842
F-statistic: 273.8 on 8 and 27 DF, p-value: < 2.2e-16

>
> summary(lm(logstrength~as.factor(diam)*as.factor(length)))

Call:
lm(formula = logstrength ~ as.factor(diam) * as.factor(length))

Residuals:
Min 1Q Median 3Q Max
-0.38954 -0.09291 0.00828 0.13430 0.31154

11
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.79772 0.03269 116.162 < 2e-16 ***
as.factor(diam)1 -1.55062 0.04624 -33.538 < 2e-16 ***
as.factor(diam)2 0.39513 0.04624 8.546 3.69e-09 ***
as.factor(length)1 1.25541 0.04624 27.153 < 2e-16 ***
as.factor(length)2 -0.08378 0.04624 -1.812 0.08111 .
as.factor(diam)1:as.factor(length)1 0.42621 0.06539 6.518 5.46e-07 ***
as.factor(diam)2:as.factor(length)1 -0.05189 0.06539 -0.794 0.43435
as.factor(diam)1:as.factor(length)2 -0.27763 0.06539 -4.246 0.00023 ***
as.factor(diam)2:as.factor(length)2 0.13062 0.06539 1.998 0.05593 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1962 on 27 degrees of freedom


Multiple R-squared: 0.9878, Adjusted R-squared: 0.9842
F-statistic: 273.8 on 8 and 27 DF, p-value: < 2.2e-16

> anova(lm(logstrength~as.factor(diam)*as.factor(length)))
Analysis of Variance Table

Response: logstrength
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(diam) 2 46.748 23.3742 607.462 < 2.2e-16 ***
as.factor(length) 2 35.470 17.7348 460.900 < 2.2e-16 ***
as.factor(diam):as.factor(length) 4 2.081 0.5202 13.518 3.579e-06 ***
Residuals 27 1.039 0.0385
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R Code and OutPut for the Paper Airplane Data


> cbind(design,weight,paper,dist)
design weight paper dist
[1,] 1 1 1 5.00
[2,] 1 1 2 6.00
[3,] 1 1 1 6.25
[4,] 1 1 2 7.00
[5,] 1 1 1 4.75
[6,] 1 1 2 4.50
[7,] 1 2 1 6.75
[8,] 1 2 2 7.25
[9,] 1 2 1 7.00
[10,] 1 2 2 10.00
[11,] 1 2 1 4.50
[12,] 1 2 2 4.50
[13,] 2 1 1 10.00
[14,] 2 1 2 8.50
[15,] 2 1 1 15.50
[16,] 2 1 2 10.00
[17,] 2 1 1 5.50
[18,] 2 1 2 6.00
[19,] 2 2 1 10.00
[20,] 2 2 2 14.75
[21,] 2 2 1 16.50
[22,] 2 2 2 16.00
[23,] 2 2 1 6.00
[24,] 2 2 2 5.75
[25,] 3 1 1 4.50
[26,] 3 1 2 4.50
[27,] 3 1 1 5.75
[28,] 3 1 2 4.50
[29,] 3 1 1 4.50
[30,] 3 1 2 4.25
[31,] 3 2 1 4.50
[32,] 3 2 2 5.00

12
[33,] 3 2 1 5.25
[34,] 3 2 2 4.50
[35,] 3 2 1 5.75
[36,] 3 2 2 4.25
>
> Design<-as.factor(design)
> Weight<-as.factor(weight)
> Paper<-as.factor(paper)
>
> summary(lm(dist~Design*Weight*Paper))

Call:
lm(formula = dist ~ Design * Weight * Paper)

Residuals:
Min 1Q Median 3Q Max
-6.4167 -0.6042 0.0417 0.8542 5.6667

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.09028 0.48422 14.643 1.83e-13 ***
Design1 -0.96528 0.68479 -1.410 0.171
Design2 3.28472 0.68479 4.797 6.96e-05 ***
Weight1 -0.59028 0.48422 -1.219 0.235
Paper1 0.02083 0.48422 0.043 0.966
Design1:Weight1 0.04861 0.68479 0.071 0.944
Design2:Weight1 -0.53472 0.68479 -0.781 0.443
Design1:Paper1 -0.43750 0.68479 -0.639 0.529
Design2:Paper1 0.18750 0.68479 0.274 0.787
Weight1:Paper1 0.34028 0.48422 0.703 0.489
Design1:Weight1:Paper1 -0.17361 0.68479 -0.254 0.802
Design2:Weight1:Paper1 0.53472 0.68479 0.781 0.443
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.905 on 24 degrees of freedom


Multiple R-squared: 0.5392, Adjusted R-squared: 0.328
F-statistic: 2.553 on 11 and 24 DF, p-value: 0.02659

> anova(lm(dist~Design*Weight*Paper))
Analysis of Variance Table

Response: dist
Df Sum Sq Mean Sq F value Pr(>F)
Design 2 205.212 102.606 12.1557 0.0002259 ***
Weight 1 12.543 12.543 1.4860 0.2346823
Paper 1 0.016 0.016 0.0019 0.9660381
Design:Weight 2 6.295 3.148 0.3729 0.6926604
Design:Paper 2 3.469 1.734 0.2055 0.8156812
Weight:Paper 1 4.168 4.168 0.4938 0.4889847
Design:Weight:Paper 2 5.358 2.679 0.3174 0.7310780
Residuals 24 202.583 8.441
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

13

Das könnte Ihnen auch gefallen