Beruflich Dokumente
Kultur Dokumente
Chapter 12 introduces criteria for model selection and comparison and discusses some standard
methods of choosing models. Model selection depends on the objectives of the study. Ramsey and
Schafer identify three possible objectives (pp. 345-6) that will influence how you select a model or
models:
1. Adjusting for a large set of explanatory variables. We want to examine the effect of a
particular variable or variables after adjusting for the effect of other variables which we know
may affect the response.
2. Fishing for explanation. Which variables are important in explaining the response?
3. Prediction. All that is desired is a model that predicts the response well from the values of the
explanatory variables. Interpretation of the model is not a goal.
Before considering some criteria by which a “best” model might be selected among a class of models,
let’s review model development.
Example: Suppose we are studying differences in abundance of bird species in 3 forest habitats. The
habitats represent various levels of prescribed burns. The experiment itself consists of counting the
number of birds of each species type heard from a station within 100 meters in a 10-minute period.
Many stations were used in the study for replication.
Explanatory variables? Habitat type, Neighboring habitat type, elevation, slope, aspect, visibility, etc.
Suppose we have 8 candidate explanatory variables X 1 , … , X 8 . How many possible first-order models
are there?
With 20 variables, there are 1,048,575 models, and these are only the first-order models. Clearly
fitting all possible models is not a feasible prospect.
Chapter 12, page 2
Criteria for selecting models
1. R2 : R2 cannot decrease when variables are added so the model maximizing R2 is the one with
all the variables. Maximizing R2 is equivalent to minimizing SSE. R2 is an appropriate way to
compare models with the same number of explanatory variables (as long as the response
variable is the same). Be aware that measures like R2 based on correlations are sensitive to
outliers.
2. MSE = SSE/(n-p): MSE can increase when variables are added to the model so minimizing
MSE is a reasonable procedure. However, minimizing MSE is equivalent to maximizing
adjusted R2 (discussed below) and tends to overfit (include too many variables).
3. Adjusted R2 : This statistic adjusts R2 by including a penalty for the number of parameters in
the model. This statistic is closely related to both R2 and MSE, as shown below.
Adjusted R2 =
Total mean square - Residual mean square MST − MSE MSE ( p − 1)(1 − R 2 )
= = 1− = R2 −
Total mean square MST MST n− p
where p is the number of coefficients (including the intercept) in the model. The third
expression shows that maximizing adjusted R2 is equivalent to minimizing MSE since MST is
fixed (it’s simply the variance of the response variable).
• Adjusted R2 tends to select models with too many variables (overfitting). This can be seen
from the fact that adjusted R2 will increase when a variable is added if the F statistic for
comparing the two models is greater than 1. This is a very generous criterion as this
corresponds to a significance level of around .5.
4. Mallows’ Cp: The Cp statistic assumes that the full model with all variables fits. Then Cp is
computed for a reduced model as
Cp = p + (n − p )
(σˆ
− σˆ full
2 2
)= ( n − p )
σˆ 2
+ 2p − n
σˆ full
2
σˆ full
2
where p is the number of coefficients (including the intercept) in the reduced model.
• Note that σ̂ 2 is simply MSE (mean square error or mean square residual) for a model.
• Models with small values of Cp are considered better and, ideally, we look for the smallest
model with a Cp of around p or smaller. Some statistics programs will compute Cp for a
large set of models and plot Cp versus p, as in Display 12.9 on p. 357. Unfortunately, SPSS
does not compute Cp automatically.
• CP assumes that the full model fits and satisfies all the regression model assumptions.
Outliers, unexplained nonlinearity, nonconstant variance, may seriously affect the
performance of Cp as a model selection tool.
Chapter 12, page 3
• Mallow's Cp is closely related to AIC. AIC has come to be preferred by many statisticians
in recent years.
5. Akaike's Information Criterion (AIC): The AIC statistic for a model is given by:
⎛ SSE ⎞
AIC = n ln⎜ ⎟ + 2p
⎝ n ⎠
where SSE = the error SS for the model under consideration and ln is natural log.
• The term 2p is the penalty for the number of parameters in the model.
• Ripley: “AIC has been criticized in asymptotic studies and simulation studies for tending to
over-fit, that is, choose a model at least as large as the true model. That is a virtue, not a
deficiency: this is a prediction-based criterion, not an explanation based one.'' BIC (below)
is a criterion based on “explanation” approach and places a bigger penalty on the number of
parameters.
• AIC can only be used to compare models. It is not an absolute measure of fit of the model
like R2 is. The model with the smallest AIC among those you examined may fit the data
best, but this does not mean it's a good model. Therefore, selecting which models to
consider (which variables, transformations, form of the model) and making sure the models
satisfy the regression model assumptions is very important.
• Since AIC is not an absolute measure of fit, many authors suggest reporting ∆AIC, the
difference between the AIC of each model and the AIC of the best fitting model. A further
suggestion is to consider all models with ∆AIC less than about 2 as having essentially equal
support.
• Neither AIC nor Cp nor R2 nor adjusted R2 can be used to compare models with different
response variables.
• AIC is based on the assumption that the models satisfy the regression model assumptions
and can be greatly affected by outliers.
6. Bayesian Information Criterion (BIC). BIC is similar to AIC but the penalty on the number
of parameters is pln(n) where ln is the natural log. That is,
⎛ SSE ⎞
BIC = n ln⎜ ⎟ + p ln(n)
⎝ n ⎠
BIC is motivated by a Bayesian approach to model selection and is said not to tend to overfit
like AIC. Therefore, it may be better for model selection for “explanation.” The purpose of
having the penalty depend on the sample size n is to reduce the likelihood that small and
relatively unimportant parameters are included (which is more likely with large n).
Chapter 12, page 4
7. PRESS Statistic (not in text): another prediction-based model selection statistic is the PRESS
statistic. It is calculated as follows: Remove the ith observation and fit the model with the
remaining n-1 observations. Then use this model to calculate a predicted value for the left-out
observation; call this predicted value Yi* . Compute Yi − Yi* , the difference between the
observed response and the predicted response from the model without the ith observation in it.
Repeat this process for each data value. The PRESS statistic is then defined as:
∑ (Yi − Yi* )
n
2
PRESS =
i =1
• Leaving one item out at a time is known as n-fold cross-validation or leave-one-out cross-
validation..
• The Yi − Yi* are called “deleted” residuals in SPSS. So the PRESS statistic can be
computed in SPSS by saving the deleted residuals, creating a new variable which is the
square of the deleted residuals, then computing the sum of this new variable using
Analyze…Descriptive Statistics…Descriptives and choosing Sum under Options.
• PRESS is similar to SSE, but is based on the deleted residuals rather than the raw residuals.
Unlike SSE, it’s possible for PRESS to increase when variables are added to the model.
The PRESS statistic is an example of the general idea of using crossvalidation to assess the predictive
power of models. A model will generally predict the data it's based on better than new data and bigger
models will necessarily do a better job of predicting the data they’re based on than smaller models:
SSE always decreases as more terms are added to the model. A less biased way of assessing the
predictive power of a model is to use the following general idea: fit a model using a subset of the data,
then validate the model using the remainder of the data. This is called crossvalidation (abbreviated
CV).
In k-fold CV, the data are randomly split into k approximately equal-sized subsets. Each subset is left
out in turn and the model based on the remaining subsets is used to predict for the left-out subset. The
PRESS statistic is based on n-fold CV, that is, only one observation at a time is left out. Simulations
have suggested that smaller values of k may work better; 10-fold CV has become a standard method of
cross-validation. Cross-validation is most useful as a way to compare models rather than as an
absolute measure of how good the predictions will be. This is because the model used for prediction of
each subset is different than the model based on all the data that will actually be used to predict future
observations. Each of the models being compared should use the same splits of the data. It’s also best
to repeat the 10-fold CV several times and average the results.
Chapter 12, page 5
Example
Ozone data without case 17. n = 110 cases. Dependent variable is log10(ozone).
• Choose several models a priori that make scientific sense. Use criteria above (like AIC and
BIC) to compare models.
• Examine all possible models involving the variables, including interactions and/or quadratic
terms or both (this is what was done with Ozone data). Generally feasible only up to 3 or 4
variables.
• Examine all main effects models only (there are 2k-1 possible models where k is the number of
variables). Consider interactions or other higher order terms only after the main effects have
been selected.
• If the number of variables is large, select a subset of the variables first, perhaps based on the
correlation of each of the variables individually with the response and/or eliminating redundant
variables (ones which are highly correlated with another variable). Then proceed with one of
the above approaches.
• If the number of variables is large, use stepwise regression to select possible models. Stepwise
regression does not require examination of all models.
Some authors do not believe in stepwise methods and other procedures that search for “good-fitting”
models because they are essentially searching through many tens or hundreds of possible models,
whether they make any scientific sense or not, and picking the “best” ones. The more models you
consider, the higher the likelihood you will select the “wrong” one. Therefore, they believe, you should
select a few models a priori that you will compare. Others argue that there is no “right” model and that
if the goal is prediction, it does not matter if the model makes physical sense. In that case, cross-
validation (discussed above) might be an important tool.
Stepwise regression
Stepwise regression methods attempt to find models minimizing or maximizing some criterion without
examining every possible model. Stepwise methods are not guaranteed to find the best model (in
terms of the criterion selected), but simply try to find the best models using a one-step at a time
approach.
The three most common types of subset selection methods employed are outlined below. The criterion
used in these descriptions is the F statistic for comparing two nested models, but stepwise methods can
also use the associated P-value, or AIC or BIC as a criterion. The latter two are now generally
preferred to the F statistic or P-value. SPSS, however, only does stepwise regression with the F
statistic or P-value.
Forward Selection
1. Start with the model with only the constant.
Chapter 12, page 7
2. Consider all models which consist of the current model plus one more term. For each term not
in the model, calculate its “F-to-enter” (the extra sum-of-squares F statistic). Identify the
variable with the largest F-to-enter. Higher order terms (interactions, quadratic terms) are
eligible for entry only if all lower order terms involved in them are already in the model. For
example, do not consider the interaction AxB for entry unless both A and B individually are
already in the model.
3. If the largest F-to-enter is greater than 4 (or some other user-specified number), add this
variable to get a new current model and return to step 2. If the largest F-to-enter is less than the
user-specified number, stop.
The criterion could also be the P-value for the F-test, in which case a term is added only if its P-value
is less than the user-specified cutoff (usually somewhere between .05 to .20). If a variable is a
categorical variable with more than 2 levels, we add all the indicator variables for this variable at once.
Note that once a variable has been entered it cannot be removed, even if its coefficient becomes
statistically nonsignificant with the addition of other variables, which is possible.
Backward Elimination
1. Start with the model with all of the candidate variables and any higher order terms which might
be important.
2. Calculate the F-to-remove for each variable in the current model (the extra-sum-of-squares test
statistic). Identify the variable with the smallest F-to-remove. A lower order term is eligible
for removal only if all higher order terms involving that variable have already been removed.
For example, the variable A is not eligible for removal if AxB is still in the model.
3. If the smallest F-to-remove is 4 (or some other user-specified number) or less, then remove that
variable to get a new current model and return to step 2. If the smallest F-to-remove is greater
than the user-specified number, stop.
• Again, the criterion for removal could be the P-value (remove a variable only if its P-value is
greater than the cutoff).
• Backward elimination is preferred to forward selection by many users because it does not
eliminate a term unless there is good reason to (forward selection, on the other hand, does not
include a term unless there is convincing evidence to include it).
Stepwise Selection
This method is a hybrid of the previous two, involving both forward selection and backward
elimination.
1. Start with the model with only the constant.
2. Do one step of forward selection.
3. Do one step of backward elimination.
4. Repeat steps 2 and 3 until no changes occur during one cycle of steps 2 and 3.
The F-to-enter must be greater than the F-to-remove; otherwise, you could have a never-ending cycle
of a variable being entered, then eliminated. If a P-value cutoff is used, then the P for entry must be
smaller than the P for removal.
Chapter 12, page 8
Forward selection in SAT data (Case study 12.1) using P of .05 or less to enter. Preliminary analysis
presented in text suggested that log of percent taking exam (log(takers)) should be used in place of
takers.
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 1112.248 12.275 90.611 .000
Log10(takers) -135.896 9.476 -.900 -14.340 .000
2 (Constant) 1060.351 15.539 68.239 .000
Log10(takers) -148.061 8.459 -.981 -17.504 .000
expend 2.900 .646 .252 4.488 .000
3 (Constant) 851.315 87.022 9.783 .000
Log10(takers) -143.383 8.272 -.950 -17.333 .000
expend 2.698 .620 .234 4.350 .000
years 12.833 5.265 .127 2.438 .019
a. Dependent Variable: sat
Excluded Variablesd
Collinearity
Partial Statistics
Model Beta In t Sig. Correlation Tolerance
1 income .078a .997 .324 .144 .648
years .157a 2.592 .013 .354 .960
public .048a .755 .454 .109 .980
expend .252a 4.488 .000 .548 .897
rank .221a 1.028 .309 .148 .086
2 income -.057b -.783 .438 -.115 .533
years .127b 2.438 .019 .338 .943
public -.014b -.254 .801 -.037 .916
rank .101b .546 .588 .080 .084
3 income -.051c -.726 .472 -.108 .532
public .056c .938 .353 .138 .727
rank .369c 1.939 .059 .278 .067
a. Predictors in the Model: (Constant), Log10(takers)
b. Predictors in the Model: (Constant), Log10(takers), expend
c. Predictors in the Model: (Constant), Log10(takers), expend, years
d. Dependent Variable: sat
Chapter 12, page 9
These three stepwise methods will not necessarily lead to the same model. In addition, changes in the F
or P-to-enter and F or P-to-remove can result in more or fewer variables in the final model.
The SPSS stepwise regression procedure has some disadvantages. SPSS has no way of knowing that
some variables may be higher order terms that involve lower order terms. Therefore, it cannot enforce
the restriction that higher order terms cannot be added before the corresponding lower order terms
have been added, nor that lower order terms cannot be eliminated until all higher order terms involving
them have been eliminated (that is why I used the SAT data and not the Ozone data with higher order
terms in this example). SPSS also cannot treat the set of indicator variables corresponding to a
categorical variable as one set of variables that should all be added or eliminated at once.
However, SPSS does allow you to define blocks of explanatory variables which can be treated
differently in stepwise regression. Therefore, for the ozone data, where I wanted to look at adding
two-way interactions and quadratic terms, I defined Block 1 to be Wind, MaxTemp and SolarRad and
Block 2 to be all the two way interactions and quadratic terms. I also defined the “Method” for Block
1 to be “Enter”, which means these variables will be in the starting model and cannot be eliminated. I
also defined the “Method” for Block 2 to be “Stepwise”, which means these variables can be added or
eliminated. The P-to-enter and P-to-remove were the default values of .05 and .10, respectively. will
be in the starting model and cannot be eliminated.
Ozone data, case #17 deleted: stepwise regression; Wind, MaxTemp and SolarRad fored to be in the
model.
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) .114 .226 .504 .615
Wind speed (mph) -.030 .006 -.308 -4.779 .000
Maximum temperature (F) .019 .002 .519 7.830 .000
Solar radiation (langleys) .001 .000 .245 4.248 .000
2 (Constant) .518 .260 1.992 .049
Wind speed (mph) -.096 .024 -.980 -4.040 .000
Maximum temperature (F) .018 .002 .489 7.534 .000
Solar radiation (langleys) .001 .000 .247 4.429 .000
Wind^2 .003 .001 .676 2.868 .005
a. Dependent Variable: Log10(Ozone)
Excluded Variables
Collinearity
Partial Statistics
Model Beta In t Sig. Correlation Tolerance
1 Wind^2 .676 2.868 .005 .270 .052
MaxTemp^2 1.929 2.454 .016 .233 .005
SolarRad^2 -.359 -1.453 .149 -.140 .050
WindTemp -.776 -2.049 .043 -.196 .021
WindSolar -.256 -1.258 .211 -.122 .074
TempSolar 1.198 2.371 .020 .225 .012
2 MaxTemp^2 1.431 1.789 .076 .173 .004
SolarRad^2 -.384 -1.606 .111 -.156 .050
WindTemp -.021 -.038 .969 -.004 .010
WindSolar -.117 -.572 .568 -.056 .069
TempSolar .933 1.846 .068 .178 .011
Chapter 12, page 10
One significant problem with using the F statistic or P-value is that the addition and elimination of
variables is not based on a criterion for comparing models – the final model is not necessarily
“optimal” in any sense. Why not add or eliminate variables based on one of the measures considered
in the first part of this handout, such as AIC or BIC?
The stepAIC function in the MASS library of S-Plus does stepwise regression using AIC (or BIC) as
the criterion. In forward selection, it looks for the single variable which reduces AIC the most; if no
variable reduces AIC, then it stops. In backward elimination, the goal is the same: find the variable
whose elimination reduces AIC the most. If no variable reduces AIC when it's eliminated, then stop.
In stepwise using both directions, find the addition or deletion which reduces AIC the most. Using AIC
has the additional appeal of not having to set arbitrary criteria for entering and removing variables.
The stepAIC function also handles categorical variables and interactions properly: an interaction
cannot be added unless all the variables involved in the interaction have been added; similarly, a
variable cannot be eliminated unless all higher order interactions involving that variable have been
eliminated. Unfortunately, stepAIC does not handle quadratic terms properly.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 948.4490 10.2140 92.8574 0.0000
Coefficients:
(Intercept) log(takers) expend years rank
399.1147 -38.1005 3.995661 13.14731 4.400277
Stepwise regression starting with the main effects model and allowing all two-way interactions.
> stepAIC(mfull,list(upper=~.^2,lower=~1))
Start: AIC= 311.88
sat ~ log(takers) + income + years + public + expend + rank
Coefficients:
(Intercept) log(takers) years public expend rank years:public
2590.556 19.42852 -134.2278 -26.43972 4.347684 5.991911 1.661026
log(takers):public
-0.5848999
Coefficients:
(Intercept) log(takers) years public expend rank years:public
3274.012 -34.05226 -164.8157 -33.8661 4.651103 5.040749 2.042115
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.26236 0.52033 0.50423 0.61515
wind -0.06931 0.01450 -4.77854 0.00001
temp 0.04445 0.00568 7.82953 0.00000
solar 0.00219 0.00052 4.24768 0.00005
Stepwise regression using AIC: start with the main effects model and allow all two-way interactions
and quadratic terms; “lower” specifies the lowest allowable model, which is the main effects model.
> stepAIC(m1,list(upper=~.^2+wind^2+temp^2+solar^2,lower=m1))
Start: AIC= -163.82
log(ozone) ~ wind + temp + solar
Coefficients:
(Intercept) wind temp solar I(wind^2)
2.7000915 -0.19764083 0.016191722 -0.0024656831 0.0059294158
I(solar^2) temp:solar
-0.000012334129 0.0001202964
> stepAIC(m1,list(upper=~.^2+wind^2+temp^2+solar^2,lower=m1),k=log(110))
Start: AIC= -153.02
log(ozone) ~ wind + temp + solar
Coefficients:
(Intercept) wind temp solar I(wind^2)
1.1932358 -0.22081888 0.041915712 0.0022096915 0.0068982286
We can try For this data set, there are other possible
objectives, besides finding good models for predicting SAT score.
For example:
\begin{quotation}
{\em After accounting for the percentage of students who
took the test (Log(TAKERS)) and the median class rank of the
test-takers (RANK), which variables are important predictors of
state SAT scores?} \end{quotation}
\begin{quotation} {\em
After accounting for the percentage of students who took the test
(TAKERS) and the median class rank of the test-takers (RANK),
which states performed best for the amount of money they spend?}
\end{quotation}
The second question could be answered in this way. First, fit the
regression model involving the TAKERS and RANK variables. What do
the resulting residuals tell us? The residuals are the difference
in the observed SAT scores and those predicted by the variables
TAKERS and RANK. A positive residual means the SAT score is
higher than predicted and a negative residual means it is lower
Chapter 12, page 19
than predicted based on these 2 variables. The states could then
be ranked based on these residuals.
\end{document}