Beruflich Dokumente
Kultur Dokumente
One goal is to find a model that gives the best prediction of Y, given X,, X 2 , . . .
, for some new observation or for a batch of new observations. In practice, we
emphasize estimating the regression of Y on the X's (see Chapter 8), E(Y/X1,X2 . . . ,X),
which expresses the mean of Y as a function of the predictors. Using this goal, we may say
that our best model is reliable if it predicts well in a new sample. The details of the model
may be of little or no consequence, such as the inclusion of any particular variable or the
magnitude or sign of its regression coefficient. For example, in considering a sample of
systolic blood pressures, the goal may be simply to predict blood pressure (SBP) as a
function of demographic variables like AGE, RACE, and GENDER. We may not
necessarily care whi variables are in the model or how they are defined, as long as the final
model obtained giv the best prediction possible.
In addition to the question of prediction is the question of validity, that is, obtaining
valid (i.e., accurate) estimates for one or more regression coefficients in a model and
then making inferences about the corresponding parameters of interest. The goal here is
to quantify the relationship between one or more independent variables of interest and
the dependent variable, controlling for the other variables. As an example, we might
wish to describe the relationship of SBP to AGE, controlling for RACE and GENDER.
In this case, we are focusing on regression coefficients involving AGE (including functions
of AGE); even though RACE and GENDER may remain in the model for control purposes,
their regression coefficients are not of primary interest.
In this chapter we shall focus on strategies for selecting the best model when the
primary goal of analysis is prediction. In the last section we shall briefly mention a
strategy for modeling in situations where the validity of the estimates of one or more
regression coefficients is of primary importance (see also Chapter 11).
1 Hocking (1976) provided an exceptionally thorough and well -written review of this topic. His
presentation is at a technical level somewhat higher than that of this text.
16-2 Steps in Selecting the Best Regression Equation
In selecting the best regression equation, we recommend the following steps:
1. Specify the maximum model (defined in Section 16-3) to he considered.
2. Specify a criterion for selecting a model.
3. Specify a strategy for applying the criterion.
4. Conduct the specified analysis.
5. Evaluate the reliability of the model chosen.
By following the above steps, the fuzzy idea of finding the best predictors of Y can be
converted into simple, concrete actions. Each step helps to insure reliability and to
reduce the work required. Specifying the maximum model forces the analyst to (1) state
the analysis goal clearly, (2) recognize the limitations of the data at hand, and (3)
describe the range of plausible models explicitly. All computer programs for
selectingmodels require that a maximum model be specified. In doing so, the analyst
should consider all available scientific knowledge. In turn, specifying the criterion for
selecting a model and the strategy for applying that criterion simplifies and speeds the
analysis process. Finally, whether the goal is prediction or validity, the reliability of the
chosen model must be demonstrated. Each step will be considered separately below.
We shall illustrate the above process of analysis via two examples. The first
example will be one considered earlier in Chapter 8. The data (given in Table 8-1) are
the hypothetical results of measuring a response variable weight (WGT) and predictor
variables height (HGT) and age (AGE) on 12 children. Once the methods have been
illustrated with this simple example, similar real data for over 200 subjects (Lewis and
Taylor, 1967; discussed in Chapter 12) will be used for a case study of the methods.
Throughout this chapter we shall make the important assumption that the maximum model
with k variables or some restriction of it with p ≤ k variables is the correct model for the
population. An important implication of this assumption is that the population
squared multiple correlation for the maximum model, namely p2(Y1X1,X2,,,Xk), is no larger
than that for the correct model (which may have fewer variables); as a result, adding more
predictors to the correct model does not increase the population squared multiple correlation
for the correct model. In turn, for the sample at hand, R2 (Y1X1,X2,,,Xk) for the
maximum model is always at least as large as that for any subset model. However,
other sample-based criteria (discussed below) may not necessarily suggest that the
largest model is best.
To illustrate, we consider the data from Table 8 -1, reproduced below:
Child 1 2 3 4 6 7 8 9 1011 12
Y (WGT) 64 71 53 67 55 58 77 57 56 5176 68
X1 (HGT) 57 59 49 62 51 50 55 48 42 4261 57
X-2 8 10 6 11 8 7 10 9 10 612
(AGE)
One possible model (not necessarily the maximum one) to consider is given by
Y = 𝛽 o + 𝛽 1X1 + 𝛽 2X2 +E
This model allows for only a linear (planar) relationship between the response (WGT)
and the two predictors (HGT and AGE). Nevertheless, the nature of growth suggests
that the relationship, although monotonic, in both HGT and AGE, may well be
nonlinear. This implies that at least one quadratic (squared) term may need to be
included in the model. Since HGT and AGE are highly correlated, and since the sample
size is very small, 2 only (AGE) 2 will be considered. Should the interaction between
AGE and HGT be considered? Should transformations of the predictors and/or the
response be considered (e.g., replacing AGE and (AGE) 2 with log(AGE))? The
limitations of the data lead us to define the maximum model as
There are many reasons for choosing a large maximum model. The most important
one is the desire to avoid making a Type II (false negative) error. In a regression
analysis, a Type II error corresponds to omitting a predictor that has a truly nonzero
regression coefficient in the population. Other reasons for considering a large
maximum model are the wishes to include:
2 Because the sample size is so small, it almost precludes making reliable conclusions from any regression
analysis. The small sample size is used to simplify our discussion.
Recall that overfitting a model (including variables in the model with truly zero
regression coefficients in the population) will not introduce bias when estimating
population regression coefficients if the usual regression assumptions are met. However,
we must be careful that overfifting does not introduce harmful collinearity (Chapter 12).
Also, underfifting (i.e., leaving important predictors out of the final model) will
introduce bias in the estimated regression coefficients.
There are also important reasons for working with a small maximum model. With a
prediction goal, the need for reliability (to be discussed later) strongly argues for a
small maximum model, and with a validity goal, we want to focus on a few important
variables. In either case, we wish to avoid a Type I (false positive) error. In a regression
analysis, a Type I error corresponds to including a predictor that has a zero population
regression coefficient. The desire for parsimony is another important reason for
choosing a small maximum model. Practically unimportant but statistically significant
predictors can greatly confuse the interpretation of regression results; complex
interaction terms are particularly troublesome in this regard.
df error = n - k - 1 > 0
which is equivalent to the constraint
n>k+1
The question then arises as to how many error degrees of freedom are needed.
Simple rules can be suggested to insure that the estimates for a single model are
reliable, but these rules are inadequate when considering a series of models. Later in
this chapter split-sample approaches will be offered as possible ways to assess
reliability.
n — k — 1≥ 10
or
n≥10+k+1
Another rule of thumb for regression that has. been suggested is to have at least 5 (or
10) observations per predictor, namely, require that
n ≥ 5k ( or n ≥ 10 k
Similarly, consider the variables GENDER, RACE (white = 0, black = 1), and their
interaction (GENRACE = GENDER x RACE). If no black females are represented in
the data, then the 1 degree of freedom interaction effect cannot be estimated. If a race —
sex cell is nearly empty, then the estimated interaction coefficient may be very unstable
(i.e., have a large variance).
If the collinearity problem cannot be overcome, then one should probably either (1)
conduct separate analyses for each form of X (for example, one analysis using X and X and
then a separate analysis using exp(X )), (2) eliminate some variables, or (3) impose
structure (e.g., a fixed order of tests) on the testing procedure. If nothing is done about
severe collinearity, then the estimated regression coefficients in the best model may be
highl unstable (i.e., have high variances) and possibly be quite far from the true parameter
values.
Obviously, the selection criterion should be related to the goal of the analysis.
For example, if the goal is reliable prediction of future observations, the selection
criterion should he somewhat liberal to avoid missing useful predictors.
(This leads us to note the distinctions among (1) numerical differences, (2)
statistically significant differences, and (3) scientifically important differences.
Numerical differences may or may not correspond to significant or important
differences. In turn, with sensitive tests (typical of analyses of large sa mples),
significant differences may or may not correspond to important differences.
Finally, with insensitive tests (typical of analyses of very small samples),
important differences may not be significant.)
Many selection criteria for choosing the best model ha been suggested. Hocking
2
(1976), for example, reviewed eight candidates. We shall consider four criteria: 𝑅 𝑝 Fp ,
MSE (p), and Cp . Each will he defined and discussed separately below.
Before discussing these criteria, we must define the notation needed to understand
them. All four criteria attempt to compare two model equations: the maximum model
with k predictors and a restricted model with p predictors (p≤ k). The maximum model
is
Let SSE(k) be the error sum of squares for the k -variable model, SSE( p) the error
sum of squares for the p-variable model, and so on. Also, SSY = ∑𝑛𝑖=1 (Y i – Y)2 is the
total (corrected) sum of squares for the response Y. Often p = k — 1, which is the
case when evaluating the addition or deletion of a single variable. We shall assume
that the k — p variables under consideration for addition or deletion are denoted Xp+1, Xp+2.
. . , Xk for notational convenience.
𝑆𝑆𝐸 (𝑝)
𝑅𝑝2 = 𝑅 2 (𝑌 𝐼 𝑋1, 𝑋2 , … . 𝑋𝑝 ) = 1 −
𝑆𝑆𝑌
Another reasonable criterion for selecting the best model is the F test statistic for
comparing the full and restricted models. This statistic, Fp, can be expressed in terms of sums
of squares for error (SSE's) as
The third criterion to be considered for selecting the hest model is the estimated error
variance for the p-variable model, namely,
𝑆𝑆𝐸 (𝑝)
MSE(p) = (16.5)
𝑛−𝑝−1
In considering one particular model, we have earlier symbolized this estimate as S2. The
quantity MSE(p) is an inviting choice for a selection criterion since we wish to find a model
with small residual variance.
A less obvious candidate for a selection criterion involving SSE(p) is Mallow's Cp,
namely,
𝑆𝑆𝐸 (𝑝)
Cp = – [n – 2 (p + 1)] (16.6)
𝑀𝑆𝐸𝑘
The Cp criterion helps us to decide how many variables to put in the best model, since it
achieves a value of approximately p + 1 if MSE(p) is roughly equal to MSE(k) (i.e., if the
correct model is of size p). Knowing the correct model size greatly aids in choosing the
best model.
The criteria Fp, 𝑅𝑘2 „ MSE(p), and Cp are intimately related. For example, the Fp test can
be expressed in terms of multiple squared correlations as follows:
(𝑅 2 − 𝑅𝑝
2 )/ (𝑘−𝑝)
Fp = (1−𝑘𝑅2 )/ (𝑛− (16.7)
𝑘 𝑘−1)
Once all 2k — 1 models have been fitted, one would then assemble the fitted models
into sets involving from 1 to k variables and then order the models within each set according
to some criterion (e.g., 𝑅𝑝2 , Fp, MSE(p), or Cp).
For our data, a summary of the results of this all-possible-regressions procedure is given
in Table 16-1. From the table, the leaders (in terms of RVvalues) in each of the sets involving
one, two, and three variables are given as follows:
Other, aspects of Table 16-1 deserve comment. Consider the partial F statistics. For
a given variable in a given model, the associated partial F statistic assesses the
contribution made by that variable to the prediction of Y (WGT) over and above the
contributions made by other variables already in the given model. For exam ple, for
model 4, which involves only X 1 and X2, the partial F for X2 is F(X21X1) = 4.785; but
for model 7, which includes X1, X2, and X i, the partial F for X, is F(X2IX1, X3) =
0.140.
These partial F's are variables-added-last tests, each based on the MSE for the corre-
sponding model being fit for that row. Such tests must be treated with some caution. Any
test based on a model with fewer terms than the correct model will be biased, perhaps
substan tially, because the test involves the use of biased error terms. Furthermore, so many
tests are computed that the Type I error rate should be higher than the nominal rate 𝛼.
Overall F tests are also affected by the estimate of the error variance used. In
variable selection, this estimate is important because biased tests conducted early in a
stepwise algorithm may stop the procedure prematurely and miss important predictors. As
calculated in Table 16-1, each overall F statistic is based on the corresponding MSE.
listed in the same row. In contrast, if the MSE for the largest model, number 7, was used
in the denominator of each overall F test, each resulting test statistic would be an Fp
statistic. Such a statistic would involve a comparison of the Ri-value for a reduced
model (models I through 6) and the 𝑅𝑝2 -value for the largest model (model 7). For
example, for model 4,
In general, such a test is a multiple-partial F test. The small fit alue here indicates that the
predictive abilities of the maximum model and model 4 do not differ significantly.
which equals 2 when k = 3 and p = 2. (Note from (16.8) that this lower bound for C p
is attained when F p = 0, and that it can be negative.) Such a value is better than the
value of p + 1 = 3.
Step 2. Calculate the partial F statistic for every variable in the model as though it were
the last variable to enter (see Table 16-1).
Variable Partial
HGT F*
6.827**
AGE 0.140
(AGE)2 0.010
* Based on 1 and 8 df.
(Recall that the partial F statistics above test whether the addition of the last variable to the
model significantly helps in predicting the dependent variable given that the other variables
are already in the model.)
Step 3. Focus on the lowest observed partial F test value (FL, say). In our example, FL =
0.010 for the variable (AGE)2.
Step 4. Compare this value F L with a preselected critical value (F (, say) of the F
distribution (i.e., test for the significance of the partial FL). (a) If FL < Fc, remove from
the model the variable under consideration, recompute the regression equation for the
remaining variables, and repeat steps 2, 3, and 4. (b) If FL> Fc, adopt the complete
regression equation as calculated. In our example, if we work at the 10% level, F L =
0.010 < F1,8,0.90
3.46. Therefore, we remove (AGE)2 from the model and recompute the equation using only
HGT and AGE. We then obtain
WGT = 6.553 + 0.722HGT + 2.050AGE with ANOVA table
Source df SS MS F R2
Regressio 2 692.82 346.41 15.95** .7800
n
Residual 9 195.43 21.71
Total 11 888.25
With (AGE)2 out of the picture, the partial F's become 7.665 for HGT and 4.785 for
AGE (see Table 16-1). Our new Fc = F1 9 0 90 = 3.36, which is less than 4.785. Therefore,
the partial F for AGE is significant and we stop here with this model, which is the same
model that we arrived at using the all-possible-regressions procedure.
Step 1. Select as the first variable to enter the model that variable most highly
correlated with the dependent variable, and then fit the associated straight -line
regression equation. For our example we have the folloVving correlations: r yx1 , =
.814, r yx , = .770, and r yx , = .767. Thus, the first variable to enter is X 1 = HGT.
If the F statistic in the table is not significant, we stop and conclude that no
independent variables are important predictors. If the F statistic is significant (as it
is), we include this variable (in our case, HGT) in the model and proceed to step 2.
Step 2. Calculate the partial F statistic associated with each remaining
variable based on a regression equation containing that variable and the
variable initially selected. For our example,
partial F(X21X1) = 4.785
partial F(X31X1) = 4.565
Step 3. Focus on the variable with the largest partial F statistic. For our data,
the variable AGE has the largest partial F (4.785).
Step 4. Test for the significance of the partial F statistic associated with the
variable selected in step 3. (a) If this test is significant, add the new variable to the
regression equation. (b) If this test is not significant, use in the model only the
variable added in step 1. For our example, since the partial F for AGE is
significant at the 10% level (F1,9,0.90 = 3.36), we add AGE to get the two-variable
model
Step 5. At each subsequent step, determine the partial F statistics for those
variables not yet in the model and then add to the model that variable which has
the largest partial F value if it is statistically significant. At any step, if the largest
partial F is not significant, no more variables are included in the model and the
process is terminated. For our example we have already added HGT and AGE to
the model. We now see if we'should add (AGE) 2 . The partial F for (AGE) 2 ,
controlling for HGT and AGE, is given by
partial F(X3 I X1, X2) = 0.010
This value is not statistically significant, since F1.9.0.90 = 3.46. Again, we have arrived at the
same two-variable model chosen via the previously discussed methods.
For our example the first step, as in the forward selection procedure; would be to
add the variable HGT to the model, since it has the highest significant correlation with
WGT. Next, as before, we would add AGE to the model, since it has a higher significant
partial correlation with WGT than does (AGE)2 , controlling for HGT. Now, before
testing to see if (AGE) 2 should also be added to the model, we look at the partial F of
HGT given that AGE is already in the model to see if we should now remove HGT. This
partial is given by F(X i IX-2) = 7.665 (see Table 16-1), which exceeds F1,9,090 = 3.36.
Thus we do not remove HGT from the model. Next we check to see if we need to add
(AGE)2; of course, the answer is no, since we have dealt with this situation before.
The analysis-of-variance table that best summarizes the results obtained for our example
is thus as follows:
Source dfSS MS F R2
Re ression {X1 1 588.92 588.92 19.67** .780
Regression {X1 1 103.90 103.90 4.79 (P < .10) 0
Residual 9 195.43 21.71
Total 11 888.25
Since almost all model selection methods involve using the data to generate
hypotheses for further study (often calle'd "exploratory data analysis"), P -values and
other variable selection criteria must be utilized cleverly. For example, model selection
programs often allow specifying a P-value to help decide whetheivariable is to he
considered significant enough to he included in a model. It is very helpful to specify the
"P-to-enter" to he 1.00 for forward selection. This guarantees that the process will go as
far as possible, thereby providing the maximum information for choosing the best model.
Similarly, specifying a value of 0.00 for the P-to-leave guarantees that a backward
elimination process will remove all variables.
16-5-6 Chunkwise Methods
The methods for selecting single variables just described can be generalized in a
very useful way. The basic idea is that any selection method in which a single variable is
added or deleted can he generalized to adding or deleting a group of variables. Consider,
for example, using backward elimination to build a model for which the response
variable is blood cholesterol level. Assume that three groups of predictors are available:
demographic (gender, race, age, and their pairwise interactions), anthropometric (height,
weight, and their interaction), and diet recall (amounts of five food types); hence, there
is a total of 6 + 3 + 5 = 14 predictors. The three groups of variables constitute chunks,
sets of predictors that are logically related and of equal importance (within a chunk) as
candidate predictors.
Several possible chunkwise testing methods are available. Choosing among them
depends on (1) the analyst's preference for backward, forward, or other selection
strategies and (2) the extent to which the analyst can logically group variables (i.e.,
form chunks) and then order the groups in importance. In many applications, an a priori
order exists among the chunks. For this example, the researcher may wish to consider
diet variables only after controlling for important demographic and anthropometric
variables. Imposing order in this fashion helps simplify the analysis, typically increases
reliability, and increases the chance of finding a scientifically plausible model. Hence,
whenever possible, we recommend imposing an order on chunk selection.
(If an a priori order among chunks exists, then that order determines
which chunk is considered first for deletion. Otherwise, it is typical
practice to choose first that chunk making the least important predictive
contribution (e.g., having the smallest multiple-partial F statistic).)
Thus, in our example, all of the demographic and anthropometric variables are forced to
remain in the model. If, for example, the chunk multiple-partial F test for the set of diet
variables is not significant, then the entire chunk can he deleted. If this test is significant,
then at least one of the variables in this diet chunk should be retained.
Of course, the simplest chunkwise method adds or deletes all variables in a chunk
together. However, a more sensitive approach is to manipulate single variables within a
significant chunk while keeping the other chunks in place. If we assume the diet chunk to
be important, then we must decide which of tldiet variables to retain as important
predictors. A reasonable second step here would their he to require all demographic
variables and the important individual diet variables to be retained, while considering the
(second) chunk of anthropometric variables for deletion. The final step for this three -
chunk example would require the individual variables selected from the first two chunks
to remain in the model, while variables in the third (demographic) chunk are candidates
for deletion. Forward and stepwise single-variable selection methods can also be
generalized for use in chunkwise testing.
Chunkwise methods for selecting variables can have substantial advantages over
single-variable selection methods. First, chunkwise methods effectively incorporate into
the analysis prior scientific knowledge and preferences about sets of variabl es. Second,
the number of possible models to be evaluated is reduced. If a chunk test is not
significant and the entire chunk of variables is deleted, then clearly no tests on individual
variables in that chunk are carried out. In many situations, such testing for group (or
chunk) significance can be more effective and reliable than testing variables one at a
time.
The most compelling way to assess the reliability of a chosen model is to conduct a new
study and test the fit of the chosen model to the new data. However, this approach is usually
expensive and sometimes intractable. The question then arises as to whether a single study
can achieve the two goals of finding the best model and assessing its reliability.
The above approach for assessing reliability is far too stringent when prediction is
the goal. More realistically, we suggest that a very good predictive model should (1)
predict as well in any new sample as it does in the sample at hand and (2) pass all
regression diagnostic tests for model adequacy applied to any new sample. These
comments lead us to consider a second approach to a split-sample analysis for assessing
reliability.
This second approach attempts to answer the question "Will the chosen model
predict well in a new sample?" This approach addresses a more modest, yet very
desirable, goal, as opposed to the former approach. With this second approach one
begins by conducting a model-building process using the data for the training group.
Suppose that the fitted prediction equation obtained is
For the children's weight example, this equation took the form
WGT = 6.553 0.722HGT 1 2.050AGE
Next, use the estimated prediction equation, which is based only on the training group data,
to compute predicted values for this group. Denote this set of predicted values as { Y, }, i =
1, 2, . . . , ni , with the subscript i indexing the subject and the subscript 1 denoting the
training group. For this training sample, let
denote the sample squared multiple correlation, which equals the sample squared aunivariate
correlation between the observed and predicted response values.
Next, use the prediction equation for the training group to compute predicted values
for the holdout (or "validation") sample. Denote this set of predicted values as
The quantity (2) is called the "cross-validation correlation," and the quantity
Using only half of the data to estimate the prediction equation appears rather wasteful. If
the shrinkage is small enough to indicate reliability of the model, then it seems reasonable to
pool the data and calculate pooled estimates of the regression coefficients.
Depending upon the situation, using only half of the data for the training sample
analysis may be inadvisable. A useful rule of thumb is to increase the relative size of the
training sample as the total sample size decreases and to decrease the relative size of the
training sample as the total sample size increases. To illustrate this, consider two
situations. In the first situation, data from over 3,000 subjects were available for
regression analysis. The maximum model included fewer than 20 variables, consisting of
a chunk of demographic control variables and a chunk of pollution exposure variables.
The primary goal of the study was to test for the presence of a relationship between the
response variable (a measure of pulmonary function) and the pollution variables,
controlling for demographic effects. To choose the best set of demographic cont rol
variables, a 10% stratified random sample (n -------- 300) was used as a training sample.
The second example concerned a study of the relationship between amount of body fat
measured by X-ray methods (CAT scans) and traditional measures such as skinfold thickness.
The primary goal was to provide an equation to predict the X-ray-measured body fat from
demographic information (gender and age) and simple anthropometric measures (body
weight, height, and three readings of skinfold thickness). The maximum m odel
contained fewer than 20 predictors. Data from approximately 200 subjects were
available. These data comprised nearly one year's collection from the hospital X -ray
laboratory and demanded substantial human effort and computer processing to extract.
Consequently, approximately 75% of the data were used as a training sample. These
two examples illustrate the general concept that the splitting proportion should be
tailored to the problem under study.
The third step is to choose a strategy for selecting variables. We prefer backwa rd
elimination methods, with as much structure imposed on the process as possible. As
mentioned earlier, one procedure is to group the variables into chunks and then to order
the chunks by degree of importance. However, we shall not consider a chunkwise
strategy here.
After the seemingly best model is chosen, the fourth step is to evaluate the
reliability of the model. Since the goal here is to produce a reliable prediction equation,
this step is especially important. To implement a split-sample approach to assessing
reliability, the subjects were randomly assigned to either data set ONE (the training
group) or TWO (the holdout group). Random splitting was sex specific. The table below
summarizes the observed split-sample subject distribution:
TABLE 16-2
Descriptive statistics for sample ONE (n = 118)
a
Equals age - AGE1; AGE1; - 13.59
b
Equals height — HGT1,; HGT1 = 61.27.
Sample ONE will be used to build the best model. Then sample TWO will be used
to compute cross-validation correlation and shrinkage statistics.
16-8-1 Analysis of Sample ONE
Table 16-2 provides descriptive statistics for sample ONE, and Table 16 -3
provides the matrix of intercorrelations. To avoid computational problems, the
predictor variables were centered (see Table 16 -2 and Chapter 12).
Table 16-4 summarizes the backward elimination analysis using sample ONE.
For sample ONE, SSY = 35,700.6949 and MSE = 120.0027676 for the full model.
Table 16-4 specifies the order of variable elimination, the variables in the best
model for each possible model size, and C p and Ry for each such model. Since the
maximum model includes k = 13 variables, the top row corresponds to p = 13, for
which Ri, = .65042. Since the maximum model always has C p = k + 1, the C p -value
of 14 provides no information regarding the best model size. The second row
corresponds to a p = 12 variable model. The variable HT3 (listed in the preceding row) has
been deleted, giving Cp = 12. Similarly, the bottom row tells us that the best single-variable
model includes HGT (C p = 6.72, = .59422), the best two variable model includes
HGT and AGE3 (C p = 0.52, RP = .62177), and so on. In general, for row p in the table,
the variable in that row and all variables listed in rows below it are included in the best
p-variable model, while all variables listed in rows above row p are excluded.
An examination of Table 16-4 does not suggest any obvious choice for a best
model. The maximum R 2 is about .650, and the minimum is around .594, which is a
small range in which to operate. This indicates that many different -sized models
predict about equally well. This is not an unusual occurrence with moderate -to-strong
predictor intercorrelations, say, above .50 in absolute value (see Table 16-3). Since
12.3 = .650 and R 1 3 = .650, it seems unreasonable to use more than seven predictors.
However, when using only Rp, it is difficult to decide whether a model containing
fewer than seven predictors is appropriate.
In contrast, the C p statistic suggests that only two variables are needed (recall that
C p should be compared with the value p + 1). Observe that C p is greater than p + 1 only
when p = 1, which indicts only the one-variable model in Table 16-4. Unfortunately,
the p = 2 model with HGT and AGE3 as predictors is not appealing since the linear and
quadratic terms AGE and AGE2 are not included. As discussed in detail in Chapter 13,
we strongly recommend including such lower-order terms in polynomial-type
regression models.
An apparent nonlinear relationship may indicate the need for transforming the re -
sponse and/or predictor variables. Since quadratic and cubic terms are included as
candidate predictors, log(WGT) and (WGT) 1 / 2 can be tried as alternative response
variables. Such transformations can sometimes linearize a relationship (see Chapter
12). Using all 13 predictors, R 2 = .668 for the log transformation and .659 for the
square root transformation. Because neither transformation produces a sub stantial gain
in R2 , they will not be considered further.
In additional exploratory analyses of the sample ONE data, models can be fitted in
two fixed orders, an interaction ordering (Table 16-5) and a power ordering (Table 16-
6). One can interpret Tables 16-5 and 16-6 similarly to Table 16-4; each row gives p,
the number of variables in the model, and the associated` and Rkvalues. All variables on or
below line pare included in the fitted model, and all above line p are excluded.
In both tables, certain values of p are natural break points for model
evaluation. For example, in Table 16-5, p = 3 corresponds..to including just the
linear terms of the predictors HGT, AGE, and MALE. The fact that 4.98 > 3 + 1
leads us to consider larger models. Including the pairwise interactions MALE x
HGT and MALE x AGE gives p = 5 and C5 = 5.36 < 5 + 1, which suggests that
this five-variable model is reasonable. Notice that 𝑅52 = .626, and that each further
variable addition increases R 2 very little. In fact, the best model could be of size p
= 4, since C4 = 3.60 and R 2 is not appreciably reduced. Taken together, these
comments suggest choosing a four-variable model containing HGT, AGE, MALE,
and MALE x HGT, with = 𝑅42 .625 and C4 = 3.60.
In this example, then, two possible models have been suggested, both with roughly the
same R 2 -value. Personal preference may be exercised within the constraints of
parsimony and scientific knowledge. Here, we prefer the five-variable model with
HGT, AGE, MALE, HT2, and AGE2 as predictors. We choose this model as best
because we expect growth to be nonlinear and because we prefer to avoid using
interaction terms. The chosen model is thus of the general form
And a predicted weight for the ith subject may be computed with the formula
Recall that MALE has a value of 1 for boys and 0 for girls. For males, the estimated
intercept is 100.9 — 3.153(1) = 97.747, while for females it is 100.9 — 3.153(0) =
100.9. For these particular data, then, the fitted model predicts that, on the average, girls
outweigh boys of the same height and age by about 3 pounds. After a model has been
selected, it is necessary to consider residual analysis and other regression diagnostic
procedures (see Chapter 12). It can be shown that some large residuals are present, but
they do not justify a more complex model. Since they turn out not to be influential, they
do not need to be deleted to help estimation accuracy.
When compared with Table 16-4, Table 16-8 illustrates the typical instability of
stepwise methods for selecting variables. Based on a backward elimination strategy,
almost no variable is in the same place as in the analysis for sample ONE (Table 16 -4).
where only sample TWO subjects are used and where the sample ONE equation is used
to compute the predicted values. These "unstandardized residuals" can be subjected to
various residual analyses. The most helpful entails univariate descriptive statistics, a
box-andwhisker plot, and a plot of the difference scores (y i — 9;) versus the predicted
values (9,) (such plots are describlOin Chapter 12). In our example, a few large positive
residuals are present, but they are neither sufficiently implausible nor influential to
require further investigation. Their presence does hint at why cubic terms and
interactions keep trying to creep into the model. These residuals could be reduced in
size, but only by adding many more predictors.
To be slightly more specific, a variable selection strategy with a validity goal would
first involve a determination of important interaction effects, followed by an evaluation
of confounding, which is contingent upon the results from assessing interaction in the
analysis. For examples of this type of strategy, see the epidemiologic-research-oriented
text of Kleinbaum, Kupper, and Morgenstern (1982, chaps. 21-24).
Problems
Note to the student: Use Cp for all of the problems in this chapter. Supplementary
problems can be created simply by using one of the other selection criteria.
1. Add the variables (HGT) 2 and (AGE x HGT) to the data set on WGT, HGT, AGE,
and (AGE)2 given in Section 16-3. Then, using an available regression program,
determine the best regression model relating WGT to the five independent vari ables
using (a) the forward approach, (b) the backward approach, and (c) an approach that
first determines the best model using HGT and AGE as the only candidate
independent variables and then determines whether any second-order terms should be
added to the model. (Use a = .10.) Compare and discuss the models obtained for
each of the three approaches.
2. Using the data given in Problem 2 of Chapter 5 (with SBP as the dependent
variable), find the best regression model using a = .05 and the independent vari-
ables AGE, QUET, and SMK by the (a) forward approach, (b) backward ap proach,
and (c) all-possible-regressions approach. (d) Based on the results in parts (a)
through (c), select a model for further analysis to determine which, if any, of the
following interaction (i.e., product) terms should be added to the model: AQ =
AGE x QUET, AS = AGE x SMK, QS = QUET x SMK, and AQS = AGE x QUET
x SMK.
3. For the same data set as considered in Problem 2, use the stepwise-regression
approach to find the best regression models of SBP on QUET, AGE, and QUET x
AGE for smokers and nonsmokers separately. (Use a = .05.) Compare the results for
each group with the answer to Problem 2(d).
4. For the data given in Problem 4 of Chapter 8, find (using a = .10) the best regression
model relating homicide rate ( Y) to population size (X 1 ), percent of families with
income less than $5,000 (X 2 ), and unemployment rate (X 3 ). Use (a) the stepwise
approach, (b) the backward approach, and (c) the all-possibleregressions approach.