Beruflich Dokumente
Kultur Dokumente
Fall 2016
(Corrected Version)
________________________________________________________
Name Signed Date
_________________________________________________________
Name Printed
ATTENTION!
1
1. There are data on the UCI Machine Learning Repository due originally to P. Tüfekci and H. Kaya
concerning the running of a power plant. Hourly information on atmospheric conditions and a plant
operating variable were collected over a number of years, along with the hourly energy output of the
plant. This question concerns MLR analyses of a random sample of 200 of the hourly periods made
treating mean "PE" (electrical power) as a function of the variables "AT" (ambient temperature in °C ),
"AP" (ambient pressure in milibars), "RH" (relative humidity in %), and "V" (exhaust vacuum in cm
Mg).
5 pts a) Below is a graphic from the "leaps" function regsubsets() for the n = 200 periods.
Which 2 predictors seem to be
most effective in predicting PE?
What fraction of the raw
variability in PE do they account
for?
8 pts b) Give the value of and degrees of freedom for an F statistic for comparing the full model
involving all predictors to the best 2-predictor model. (While it is not really needed to answer this
question, SSTot = 54670 for these n = 200 cases.)
2
Below are some results (cross-validation root mean squared prediction error) from repeated 10-fold
cross-validation, and values of MSE and R 2 for several MLR models for PE.
Model Predictors Included CV-RMSPE RMSE R-Squared
1 V 8.92 8.81 .720
2 AT 5.30 5.24 .901
3 AT,V 4.66 4.70 .921
4 AT,RH 4.56 4.60 .925
5 AT,AP,RH 4.61 4.60 .925
6 AT,V,RH 4.31 4.32 .934
7 AT,V,AP,RH 4.36 4.34 .934
4 pts c) Which of models 1-7 is most attractive on the basis of the table above? Explain.
4 pts d) What about the table above suggests that none of the models fit there suffers dramatic over-
fitting?
Below are some scatterplots of the data from the 200 sample hours.
3
2. There is an interesting "Banknote Authentication" data set on the UCI Machine Learning repository
that consists of 4 numerical features extracted from 400 × 400 grey scale images of real and counterfeit
banknotes. There are 610 counterfeit and 762 real notes represented in the data set. There is a printout
beginning on Page 8 of this exam from an attempt to model the probability that a note is counterfeit
(V5=1) as a function of the features (V1,V2,V3,V4). Use it to answer the following questions.
6 pts a) Which of the features V1,V2,V3,V4 appears to be least important in modeling the probability
that V5=1 (the note is counterfeit)? Explain.
( )
6 pts b) Recall that if p ( u ) = exp ( u ) / (1 + exp ( u ) ) then the "log odds" are u = ln p ( u ) / (1 − p ( u ) ) . Give
approximately 95% confidence limits for the increase in log odds that a banknote is counterfeit
accompanying a unit increase in V1 if the other features V2,V3,V4 are held fixed.
6 pts c) Give 2-sided approximately 95% confidence limits for the probability that a banknote with
features V1=.2,V2=.8,V3=.4,V4=-.6 is counterfeit.
4
3. A data set in the book Regression Analysis by Graybill and Iyer concerns how an optical reading,
y , measuring light transmitted through a chemical solution depends upon the concentration of a
chemical, x (in mg/l). A possible nonlinear (in coefficients β1 , β 2 , and β 3 ) form for the relationship
between x and mean y is
μ y| x = β1 + β 2 exp ( − β3 x ) (*)
A printout beginning on Page 9 summarizes an analysis of the n = 12 pairs in the data set.
6 pts a) Suppose relationship (*) above holds and that for a given concentration the optical reading is
normally distributed with standard deviation σ . Give approximate 95% two-sided confidence limits
for this model parameter.
5 pts b) According to the relationship (*), as concentration, x , goes from 0 to ∞ , the mean light
transmitted goes from β1 + β 2 to β1 . The value of concentration, x , at which half of the decrease in
light transmission has been realized might be of interest. What is this in terms of the model
parameters? Give 95% two-sided confidence limits for this value of x .
4. On page 217 of the white Vardeman and Jobe text there are data of Koh, Morden, and Ogbourne
that concern axial breaking strengths of wooden dowel rods of 3 different lengths and 3 different
diameters. A printout beginning on Page 9 of this exam summarizes some computations with these
data.
6 pts a) What about the printed analyses of dowel strength makes direct analysis of y under the usual
one-way normal model assumptions seem inappropriate?
+ / − margin: ____________
6 pts c) Based on the plot above, which effects appear to be both statistically detectable and most
important? (Consider diameter and length main effects and interactions. List an order of importance.)
6 pts d) What items on the printout support your judgment in c)? Explain how they lend support.
6
5. Beginning on Page 12 there is R code and output corresponding to a balanced 3 × 2 × 3 experiment
on paper airplane flight distances (carried out in an undergraduate engineering statistics class). There
are 3 levels of the factor "Design," 2 levels of the factor (nose) "Weight," and 3 levels of the factor
"Paper" (type) in the study. Use the R output to answer the rest of the questions on this exam.
6 pts a) What is the value of spooled for this data set? (Say where you found your value.) What does this
measure in the present context?
6 pts b) What is the relatively simple interpretation that is possible for these data? (What factorial effect(s)
dominate(s) and what does that mean about the flying of paper airplanes?) What on the output tells
you that this is so?
6 pts c) What type or types of airplanes fly furthest (according to the outcome of this study)? Explain.
6 pts d) What do you predict for the average flight distance of the type or types of planes you identified in
part c) based on a good simple model here?
7
R Code and OutPut for the Banknote Data
> Banknote[1:5,]
V1 V2 V3 V4 V5
1 3.62160 8.6661 -2.8073 -0.44699 0
2 4.54590 8.1674 -2.4586 -1.46210 0
3 3.86600 -2.6383 1.9242 0.10645 0
4 3.45660 9.5228 -4.0112 -3.59440 0
5 0.32924 -4.4552 4.5718 -0.98880 0
> summary(Banknote)
V1 V2 V3 V4 V5
Min. :-7.0421 Min. :-13.773 Min. :-5.2861 Min. :-8.5482 Min. :0.0000
1st Qu.:-1.7730 1st Qu.: -1.708 1st Qu.:-1.5750 1st Qu.:-2.4135 1st Qu.:0.0000
Median : 0.4962 Median : 2.320 Median : 0.6166 Median :-0.5867 Median :0.0000
Mean : 0.4337 Mean : 1.922 Mean : 1.3976 Mean :-1.1917 Mean :0.4446
3rd Qu.: 2.8215 3rd Qu.: 6.815 3rd Qu.: 3.1793 3rd Qu.: 0.3948 3rd Qu.:1.0000
Max. : 6.8248 Max. : 12.952 Max. :17.9274 Max. : 2.4495 Max. :1.0000
> bank.out<-glm(as.factor(V5)~V1+V2+V3+V4,data=Banknote,family=binomial())
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(bank.out)
Call:
glm(formula = as.factor(V5) ~ V1 + V2 + V3 + V4, family = binomial(),
data = Banknote)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.70001 0.00000 0.00000 0.00029 2.24614
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.3218 1.5589 4.697 2.64e-06 ***
V1 -7.8593 1.7383 -4.521 6.15e-06 ***
V2 -4.1910 0.9041 -4.635 3.56e-06 ***
V3 -5.2874 1.1612 -4.553 5.28e-06 ***
V4 -0.6053 0.3307 -1.830 0.0672 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> unknown<-data.frame(V1=.2,V2=.8,V3=.4,V4=-.6)
> predict(bank.out,newdata=unknown,se.fit=TRUE)
$fit
1
0.6453872
$se.fit
[1] 0.4574428
$residual.scale
[1] 1
8
R Code and OutPut for the Optical Data
> optical.out<-nls(y~b1+b2*exp(-b3*x),start=c(b1=0,b2=3,b3=1),trace=T)
1.142691 : 0 3 1
0.4897814 : 0.08377278 2.66283847 0.67210762
0.4604279 : 0.02919644 2.72294772 0.68326005
0.4604271 : 0.02874071 2.72328367 0.68274958
0.4604271 : 0.02875388 2.72327628 0.68276289
> summary(optical.out)
Formula: y ~ b1 + b2 * exp(-b3 * x)
Parameters:
Estimate Std. Error t value Pr(>|t|)
b1 0.02875 0.17152 0.168 0.870571
b2 2.72328 0.21054 12.935 4.05e-07 ***
b3 0.68276 0.14166 4.820 0.000947 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> confint(optical.out)
Waiting for profiling to be done...
0.5970897 : 2.7232763 0.6827629
0.4785793 : 2.8120293 0.6082035
.
.
.
1.042523 : 0.3770538 2.4924962
1.041343 : 0.3664283 2.4959833
2.5% 97.5%
b1 -0.5093296 0.3499205
b2 2.2623059 3.2411076
b3 0.4017537 1.0651215
> predict(optical.out)
[1] 2.7520302 2.7520302 1.4046053 1.4046053 0.7238604 0.7238604 0.3799351 0.3799351
[9] 0.2061774 0.2061774 0.1183916 0.1183916
9
[19,] 5 0.1875 8 88.7
[20,] 5 0.1875 8 70.2
[21,] 6 0.1875 12 18.4
[22,] 6 0.1875 12 22.4
[23,] 6 0.1875 12 18.9
[24,] 6 0.1875 12 16.6
[25,] 7 0.2500 4 358.8
[26,] 7 0.2500 4 309.6
[27,] 7 0.2500 4 343.5
[28,] 7 0.2500 4 357.8
[29,] 8 0.2500 8 127.1
[30,] 8 0.2500 8 158.0
[31,] 8 0.2500 8 194.0
[32,] 8 0.2500 8 133.0
[33,] 9 0.2500 12 68.9
[34,] 9 0.2500 12 40.5
[35,] 9 0.2500 12 50.3
[36,] 9 0.2500 12 65.6
>
> options(contrasts = rep("contr.sum", 2))
>
> aggregate(strength,by=list(type),FUN=mean)
Group.1 x
1 1 51.675
2 2 6.725
3 3 2.575
4 4 220.800
5 5 71.275
6 6 19.075
7 7 342.425
8 8 153.025
9 9 56.325
> aggregate(strength,by=list(type),FUN=sd)
Group.1 x
1 1 10.1411291
2 2 1.6111590
3 3 0.5737305
4 4 10.7706391
5 5 17.8593346
6 6 2.4267605
7 7 22.9722115
8 8 30.4237161
9 9 13.3027253
> summary(lm(strength~as.factor(type)))
Call:
lm(formula = strength ~ as.factor(type))
Residuals:
Min 1Q Median 3Q Max
-32.825 -3.363 -0.125 7.025 40.975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.656 2.592 39.604 < 2e-16 ***
as.factor(type)1 -50.981 7.331 -6.954 1.79e-07 ***
as.factor(type)2 -95.931 7.331 -13.085 3.34e-13 ***
as.factor(type)3 -100.081 7.331 -13.651 1.23e-13 ***
as.factor(type)4 118.144 7.331 16.115 2.24e-15 ***
as.factor(type)5 -31.381 7.331 -4.280 0.00021 ***
as.factor(type)6 -83.581 7.331 -11.400 7.95e-12 ***
as.factor(type)7 239.769 7.331 32.704 < 2e-16 ***
as.factor(type)8 50.369 7.331 6.870 2.21e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = logstrength ~ as.factor(type))
Residuals:
Min 1Q Median 3Q Max
-0.38954 -0.09291 0.00828 0.13430 0.31154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.79772 0.03269 116.162 < 2e-16 ***
as.factor(type)1 0.13100 0.09247 1.417 0.168
as.factor(type)2 -1.91204 0.09247 -20.677 < 2e-16 ***
as.factor(type)3 -2.87083 0.09247 -31.046 < 2e-16 ***
as.factor(type)4 1.59865 0.09247 17.288 3.95e-16 ***
as.factor(type)5 0.44196 0.09247 4.780 5.51e-05 ***
as.factor(type)6 -0.85523 0.09247 -9.249 7.38e-10 ***
as.factor(type)7 2.03658 0.09247 22.024 < 2e-16 ***
as.factor(type)8 1.21872 0.09247 13.180 2.82e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> summary(lm(logstrength~as.factor(diam)*as.factor(length)))
Call:
lm(formula = logstrength ~ as.factor(diam) * as.factor(length))
Residuals:
Min 1Q Median 3Q Max
-0.38954 -0.09291 0.00828 0.13430 0.31154
11
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.79772 0.03269 116.162 < 2e-16 ***
as.factor(diam)1 -1.55062 0.04624 -33.538 < 2e-16 ***
as.factor(diam)2 0.39513 0.04624 8.546 3.69e-09 ***
as.factor(length)1 1.25541 0.04624 27.153 < 2e-16 ***
as.factor(length)2 -0.08378 0.04624 -1.812 0.08111 .
as.factor(diam)1:as.factor(length)1 0.42621 0.06539 6.518 5.46e-07 ***
as.factor(diam)2:as.factor(length)1 -0.05189 0.06539 -0.794 0.43435
as.factor(diam)1:as.factor(length)2 -0.27763 0.06539 -4.246 0.00023 ***
as.factor(diam)2:as.factor(length)2 0.13062 0.06539 1.998 0.05593 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(lm(logstrength~as.factor(diam)*as.factor(length)))
Analysis of Variance Table
Response: logstrength
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(diam) 2 46.748 23.3742 607.462 < 2.2e-16 ***
as.factor(length) 2 35.470 17.7348 460.900 < 2.2e-16 ***
as.factor(diam):as.factor(length) 4 2.081 0.5202 13.518 3.579e-06 ***
Residuals 27 1.039 0.0385
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
12
[33,] 3 2 1 5.25
[34,] 3 2 2 4.50
[35,] 3 2 1 5.75
[36,] 3 2 2 4.25
>
> Design<-as.factor(design)
> Weight<-as.factor(weight)
> Paper<-as.factor(paper)
>
> summary(lm(dist~Design*Weight*Paper))
Call:
lm(formula = dist ~ Design * Weight * Paper)
Residuals:
Min 1Q Median 3Q Max
-6.4167 -0.6042 0.0417 0.8542 5.6667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.09028 0.48422 14.643 1.83e-13 ***
Design1 -0.96528 0.68479 -1.410 0.171
Design2 3.28472 0.68479 4.797 6.96e-05 ***
Weight1 -0.59028 0.48422 -1.219 0.235
Paper1 0.02083 0.48422 0.043 0.966
Design1:Weight1 0.04861 0.68479 0.071 0.944
Design2:Weight1 -0.53472 0.68479 -0.781 0.443
Design1:Paper1 -0.43750 0.68479 -0.639 0.529
Design2:Paper1 0.18750 0.68479 0.274 0.787
Weight1:Paper1 0.34028 0.48422 0.703 0.489
Design1:Weight1:Paper1 -0.17361 0.68479 -0.254 0.802
Design2:Weight1:Paper1 0.53472 0.68479 0.781 0.443
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(lm(dist~Design*Weight*Paper))
Analysis of Variance Table
Response: dist
Df Sum Sq Mean Sq F value Pr(>F)
Design 2 205.212 102.606 12.1557 0.0002259 ***
Weight 1 12.543 12.543 1.4860 0.2346823
Paper 1 0.016 0.016 0.0019 0.9660381
Design:Weight 2 6.295 3.148 0.3729 0.6926604
Design:Paper 2 3.469 1.734 0.2055 0.8156812
Weight:Paper 1 4.168 4.168 0.4938 0.4889847
Design:Weight:Paper 2 5.358 2.679 0.3174 0.7310780
Residuals 24 202.583 8.441
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
13