Pt4 Adv Regression Models

Part IV
Advanced Regression Models
232
Chapter 17
Polynomial Regression
In this chapter, we provide models to account for curvature in a data set.
This curvature may be an overall trend of the underlying population or it
may be a certain structure to a specified region of the predictor space. We
will explore two common methods in this chapter.
17.1
Polynomial Regression
In our earlier discussions on multiple linear regression, we have outlined ways

to check assumptions of linearity by looking for curvature in various plots.
For instance, we look at the plot of residuals versus the fitted values.
We also look at a scatterplot of the response value versus each predictor.
Sometimes, a plot of the response versus a predictor may also show some
curvature in that relationship. Such plots may suggest there is a nonlinear relationship. If we believe there is a nonlinear relationship between the
response and predictor(s), then one way to account for it is through a polynomial regression model:
Y = 0 + 1 X + 2 X 2 + . . . + h X h + ,
(17.1)
where h is called the degree of the polynomial. For lower degrees, the
relationship has a specific name (i.e., h = 2 is called quadratic, h = 3 is
called cubic, h = 4 is called quartic, and so on). As for a bit of semantics,
it was noted at the beginning of the previous course how nonlinear regression
233
234
CHAPTER 17. POLYNOMIAL REGRESSION
(which we discuss later) refers to the nonlinear behavior of the coefficients,

which are linear in polynomial regression. Thus, polynomial regression is still
considered linear regression!
In order to estimate equation (17.1), we would only need the response
variable (Y ) and the predictor variable (X). However, polynomial regression
models may have other predictor variables in them as well, which could lead
to interaction terms. So as you can see, equation (17.1) is a relatively simple
model, but you can imagine how the model can grow depending on your
situation!
For the most part, we implement the same analysis procedures as done
in multiple linear regression. To see how this fits into the multiple linear
regression framework, let us consider a very simple data set of size n = 50
that I generated (see Table 17.1). The data was generated from the quadratic
model
yi = 5 + 12xi 3x2i + i ,
(17.2)
where the i s are assumed to be normally distributed with mean 0 and variance 2. A scatterplot of the data along with the fitted simple linear regression
line is given in Figure 17.1(a). As you can see, a linear regression line is not
a reasonable fit to the data.
Residual plots of this linear regression analysis are also provided in Figure
17.1. Notice in the residuals versus predictor plots how there is obvious curvature and it does not show uniform randomness as we have seen before. The
histogram appears heavily left-skewed and does not show the ideal bell-shape
for normality. Furthermore, the NPP seems to deviate from a straight line
and curves down at the extreme percentiles. These plots alone would suggest that there is something wrong with the model being used and especially
indicate the use of a higher-ordered model.
The matrices for the second-degree polynomial model are:
y1
1 x1 x21
1
0
y2
1 x2 x2
2
2
1 , = .. ,
Y = .. , X = .. ..
.. , =
.
. .
.
.
2
2
y50
1 x50 x50
50
where the entries in Y and X would consist of the raw data. So as you can
see, we are in a setting where the analysis techniques used in multiple linear
regression (e.g., OLS) are applicable here.
STAT 501
D. S. Young
235
20
100
300
Residuals
200
20
40
400
500
10
12
14
400
300
200
100
Fitted y
(a)
(b)
Histogram of Residuals
Normal QQ Plot
Sample Quantiles
10
Frequency
15
60
40
20
Residuals
(c)
20
Theoretical Quantiles
(d)
Figure 17.1: (a) Scatterplot of the quadratic data with the OLS line. (b)
Residual plot for the OLS fit. (c) Histogram of the residuals. (d) NPP for
the Studentized residuals.
D. S. Young
STAT 501
236

i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
xi
6.6
10.1
8.9
6
13.3
6.9
9
12.6
10.6
10.3
14.1
8.6
14.9
6.5
9.3
5.2
10.7
7.5
14.9
12.2
yi
-45.4
-176.6
-127.1
-31.1
-366.6
-53.3
-131.1
-320.9
-204.8
-189.2
-421.2
-113.1
-482.3
-42.9
-144.8
-14.2
-211.3
-75.4
-482.7
-295.6
i
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
xi
8.4
7.2
13.2
7.1
10.4
10.8
11.9
9.7
5.4
12.1
12.1
12.1
9.2
6.7
12.1
13.2
11
13.1
9.2
13.2
yi
-106.5
-63
-362.2
-61
-194
-216.4
-278.1
-162.7
-21.3
-284.8
-287.5
-290.8
-137.4
-47.7
-292.3
-356.4
-228.5
-354.4
-137.2
-361.6
i
41
42
43
44
45
46
47
48
49
50
xi
8
8.9
10.1
11.5
12.9
8.1
14.9
13.7
7.8
8.5
yi
-95.8
-126.2
-179.5
-252.6
-338.5
-97.3
-480.5
-393.6
-87.6
-105.4
Table 17.1: The simulated 2-degree polynomial data set with n = 50 values.
Some general guidelines to keep in mind when estimating a polynomial

regression model are:
The fitted model is more reliable when it is built on a larger sample
size n.
Do not extrapolate beyond the limits of your observed values.
Consider how large the size of the predictor(s) will be when incorporating higher degree terms as this may cause overflow.
Do not go strictly by low p-values to incorporate a higher degree term,
but rather just use these to support your model only if the plot looks
reasonable. This is sort of a situation where you need to determine
practical significance versus statistical significance.
STAT 501
D. S. Young
237
In general, you should obey the hierarchy principle, which says that
if your model includes X h and X h is shown to be a statistically significant predictor of Y , then your model should also include each X j for
all j < h, whether or not the coefficients for these lower-order terms
are significant.
17.2
Response Surface Regression
A response surface model (RSM) is a method for determining a surface

predictive model based on one or more variables. In the context of RSMs,the
variables are often called factors, so to keep consistent with the corresponding methodology, we will utilize that term for this section. RSM methods are
usually discussed in a Design of Experiments course, but there is a relevant
regression component. Specifically, response surface regression is fitting a
polynomial regression with a certain structure of the predictors.
Many industrial experiments are conducted to discover which values of
given factor variables optimize a response. If each factor is measured at
three or more values, then a quadratic response surface can be estimated by
ordinary least squares regression. The predicted optimal value can be found
from the estimated surface if the surface is shaped like a hill or valley. If
the estimated surface is more complicated or if the optimum is far from the
region of the experiment, then the shape of the surface can be analyzed to
indicate the directions in which future experiments should be performed.
In polynomial regression, the predictors are often continuous with a large
number of different values. In response surface regression, the factors (of
which there are k) typically represent a quantitative measure where their
factor levels (of which there are p) are equally spaced and established at
the design stage of the experiment. This is what we call a pk factorial
design because the analysis will involve all of the pk different treatment
combinations. Our goal is to find a polynomial approximation that works
well in a specified region of the predictor space. As an example, we may
be performing an experiment with k = 2 factors where one of the factors
is a certain chemical concentration in a mixture. The factor levels for the
chemical concentration are 10%, 20%, and 30% (so p = 3). The factors are
then coded in the following way:
Xi,j
=
D. S. Young
Xi,j [maxi (Xi,j ) + mini (Xi,j )]/2

,
[maxi (Xi,j ) mini (Xi,j )]/2
STAT 501
238
where i = 1, . . . , n indexes the sample and j = 1, . . . , k indexes the factor.

For our example (assuming we label the chemical concentration factor as
1) we would have
Xi,1
=
10[30+10]/2
[3010]/2
= 1, if Xi,1 = 10%;
20[30+10]/2
[3010]/2
= 0,
30[30+10]/2
[3010]/2
= +1, if Xi,1 = 30%.
if Xi,1 = 20%;
Some aspects which differentiate a response surface regression model from

the general context of a polynomial regression model include:
In a response surface regression model, p is usually 2 or 3, while k is
usually the same value for each factor. More complex models can be
developed outside of these constraints, but such a discussion is better
dealt with in a Design of Experiments course.
The factors are treated as categorical variables. Therefore, the X matrix will have a noticeable pattern based on the way the experiment
was designed. Furthermore, the X matrix is often called the design
matrix in response surface regression.
The number of factor levels must be at least as large as the number of
factors (p k).
If examining a response surface with interaction terms, then the model
must obey the hierarchy principle (this is not required of general polynomial models, although it is usually recommended).
The number of factor levels must be greater than the order of the model
(i.e., p > h).
The number of observations (n) must be greater than the number of
terms in the model (including all higher-order terms and interactions).
It is desirable to have a larger n. A rule of thumb is to have at
least 5 observations per term in the model.
STAT 501
D. S. Young
239
(a)
(b)
(c)
Figure 17.2: (a) The points of a square portion of a design with factor levels
coded at 1. This is how a 22 factorial design is coded. (b) Illustration of
the axial (or star) points of a design at (+a,0), (-a,0), (0,-a), and (0,+a). (c)
A diagram which shows the combination of the previous two diagrams with
the design center at (0,0). This final diagram is how a composite design is
coded.
Typically response surface regression models only have two-way interactions while polynomial regression models can (in theory) have k-way
interactions.
The response surface regression models we outlined are for a factorial
design. Figure 17.2 shows how a factorial design can be diagramed as
a square using factorial points. More elaborate designs can be constructed, such as a central composite design, which takes into consideration axial (or star) points (also illustrated in Figure 17.2). Figure
17.2 pertains to a design with two factors while Figure 17.3 pertains to
a design with three factors.
We mentioned that response surface regression follows the hierarchy principle. However, some texts and software do report ANOVA tables which do
not quite follow the hierarchy principle. While fundamentally there is nothing wrong with these tables, it really boils down to a matter of terminology.
If the hierarchy principle is not in place, then technically you are just performing a polynomial regression.
Table 17.2 gives a list of all possible terms when assuming an hth -order
response surface model with k factors. For any interaction that appears in
D. S. Young
STAT 501
240
(a)
(b)
(c)
Figure 17.3: (a) The points of a cube portion of a design with factor levels
coded at the corners of the cube. This is how a 23 factorial design is coded.
(b) Illustration of the axial (or star) points of this design. (c) A diagram
which shows the combination of the previous two diagrams with the design
center at (0,0). This final diagram is how a composite design is coded.
the model (e.g., Xih1 Xjh2 such that h2 h1 ), then the hierarchy principle
says that at least the main factor effects for 1, . . . , h1 must appear in the
model, that all h1 -order interactions with the factor powers of 1, . . . , h2 must
appear in the model, and all order interactions less than h1 must appear
in the model. Luckily, response surface regression models (and polynomial
models for that matter) rarely go beyond h = 3.
For the next step, an ANOVA table is usually constructed to assess the
significance of the model. Since the factor levels are all essentially treated as
categorical variables, the designed experiment will usually result in replicates
for certain factor level combinations. This is unlike multiple regression where
the predictors are usually assumed to be continuous and no predictor level
combinations are assumed to be replicated. Thus, a formal lack of fit test
is also usually incorporated. Furthermore, the SSR is also broken down
into the components making up the full model, so you can formally test the
contribution of those components to the fit of your model.
An example of a response surface regression ANOVA is given in Table
17.3. Since it is not possible to compactly show a generic ANOVA table nor
to compactly express the formulas, this example is for a quadratic model
with linear interaction terms. The formulas will be similar to their respective quantities defined earlier. For this example, assume that there are k
STAT 501
D. S. Young

Effect
Main Factor
Linear Interaction
Quadratic Interaction
Cubic Interaction
..
.
hth -order Interaction
241
Relevant Terms
Xi , Xi2 , Xi3 , . . . , Xih for all i
Xi Xj for all i < j
Xi2 Xj for i 6= j
and Xi2 Xj2 for all i < j
Xi3 Xj , Xi3 Xj2 for i 6= j
Xi3 Xj3 for all i < j
..
.
Xih Xj , Xih Xj2 , Xih Xj3 , . . . , Xih Xjh1 for i 6= j
Xih Xjh for all i < j
Table 17.2: A table showing all of the terms that could be included in a
response surface regression model. In the above, the indices for the factor
are given by i = 1, . . . , k and j = 1, . . . , k.
factors, n observations, m unique levels of the factor level combinations, and

q total regression parameters are needed for the full model. In Table 17.3,
the following partial sums of squares are used to compose the SSR value:
The sum of squares due to the linear component is
SSLIN = SSR(X1 , X2 , . . . , Xk ).
The sum of squares due to the quadratic component is
SSQUAD = SSR(X12 , X22 , . . . , Xk2 |X1 , X2 . . . , Xk ).
The sum of squares due to the linear interaction component is
SSINT = SSR(X1 X2 , . . . , X1 Xk , X2 Xk , . . . , Xk1 Xk |X1 , X2 . . . , Xk ,
X12 , X22 , . . . , Xk2 ).
Other analysis techniques are commonly employed in response surface
regression. For example, canonical analysis (which is a multivariate analysis tool) uses the eigenvalues and eigenvectors in the matrix of second-order
parameters to characterize the shape of the response surface (e.g., is the surface flat or have some noticeable shape like a hill or a valley). There is also
D. S. Young
STAT 501
242
Source
Regression
Linear
Quadratic
Interaction
Error
Lack of Fit
Pure Error
Total
df
SS
MS
q1
SSR
MSR
k
SSLIN
MSLIN
k
SSQUAD MSQUAD
q 2k 1
SSINT
MSINT
nq
SSE
MSE
mq
SSLOF
MSLOF
nm
SSPE
MSPE
n1
SSTO
F
MSR/MSE
MSLIN/MSE
MSQUAD/MSE
MSINT/MSE
MSLOF/MSPE
Table 17.3: ANOVA table for a response surface regression model with linear,
quadratic, and linear interaction terms.
ridge analysis, which computes the estimated ridge of optimum response

for increasing radii from the center of the original design. Since the context
of these techniques is better suited for a Design of Experiments course, we
will not develop their details here.
17.3
Examples
Example 1: Yield Data Set

This data set of size n = 15 contains measurements of yield from an experiment done at five different temperature levels. The variables are y = yield
and x = temperature in degrees Fahrenheit. Table 17.4 gives the data used
for this analysis. Figure 17.4 give a scatterplot of the raw data and then
another scatterplot with lines pertaining to a linear fit and a quadratic fit
overlayed. Obviously the trend of this data is better suited to a quadratic
fit.
Here we have the linear fit results:
##########
Coefficients:
Estimate Std. Error t value
(Intercept) 2.306306
0.469075
4.917
temp
0.006757
0.005873
1.151
--Signif. codes: 0 *** 0.001 ** 0.01
STAT 501
Pr(>|t|)
0.000282 ***
0.270641
* 0.05 . 0.1 1
D. S. Young

i Temperature
1
50
2
50
3
50
4
70
5
70
6
70
7
80
8
80
9
80
10
90
11
90
12
90
13
100
14
100
15
100
243
Yield
3.3
2.8
2.9
2.3
2.6
2.1
2.5
2.9
2.4
3.0
3.1
2.8
3.3
3.5
3.0
Table 17.4: The yield measurements data set pertaining to n = 15 observations.
Residual standard error: 0.3913 on 13 degrees of freedom

Multiple R-Squared: 0.09242,
Adjusted R-squared: 0.0226
F-statistic: 1.324 on 1 and 13 DF, p-value: 0.2706
##########
Here we have the quadratic fit results:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9604811 1.2589183
6.323 3.81e-05 ***
temp
-0.1537113 0.0349408 -4.399 0.000867 ***
temp2
0.0010756 0.0002329
4.618 0.000592 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
D. S. Young
STAT 501
244
3.4
3.4
3.2
3.2
2.8
Yield
2.4
2.4
2.6
2.8
2.6
Yield
3.0
3.0
2.2
2.2
50
60
70
80
90
100
50
60
70
80
Temperature
Temperature
(a)
(b)
90
100
Figure 17.4: The yield data set with (a) a linear fit and (b) a quadratic fit.

##########
We see that both temperature and temperature squared are significant predictors for the quadratic model (with p-values of 0.0009 and 0.0006, respectively) and that the fit is much better for than the linear fit. From this output, we see the estimated regression equation is yi = 7.96050 0.15371xi +
0.00108x2i . Furthermore, the ANOVA table below shows that the model we
fit is statistically significant at the 0.05 significance level with a p-value of
0.0012. Thus, our model should include a quadratic term.
##########
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value
Pr(>F)
Regression 2 1.47656 0.73828
12.36 0.001218 **
Residuals 12 0.71677 0.05973
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##########
STAT 501
D. S. Young
245
Example 2: Odor Data Set

An experiment is designed to relate three variables (temperature, ratio, and
height) to a measure of odor in a chemical process. Each variable has three
levels, but the design was not constructed as a full factorial design (i.e., it is
not a 33 design). Nonetheless, we can still analyze the data using a response
surface regression routine. The data obtained was already coded and can be
found in Table 17.5.
Odor Temperature
66
-1
58
-1
65
0
-31
0
39
1
17
1
7
0
-35
0
43
-1
-5
-1
43
0
-26
0
49
1
-40
1
-22
0
Ratio Height
-1
0
0
-1
-1
-1
0
0
-1
0
0
-1
1
-1
0
0
1
0
0
1
-1
1
0
0
1
0
0
1
1
1
Table 17.5: The odor data set measurements with the factor levels already
coded.
First we will fit a response surface regression model consisting of all of the
first-order and second-order terms. The summary of this fit is given below:
##########
Coefficients:
(Intercept) -30.667
10.840 -2.829
0.0222 *
temp
-12.125
6.638 -1.827
0.1052
ratio
-17.000
6.638 -2.561
0.0336 *
D. S. Young
STAT 501
246
height
-21.375
6.638 -3.220
0.0122
temp2
32.083
9.771
3.284
0.0111
ratio2
47.833
9.771
4.896
0.0012
height2
6.083
9.771
0.623
0.5509
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05
*
*
**
. 0.1 1

##########
As you can see, the square of height is the least statistically significant, so
we will drop that term and rerun the analysis. The summary of this new fit
is given below:
##########
Coefficients:
Estimate Std. Error t value
(Intercept) -26.923
8.707 -3.092
temp
-12.125
6.408 -1.892
ratio
-17.000
6.408 -2.653
height
-21.375
6.408 -3.336
temp2
31.615
9.404
3.362
ratio2
47.365
9.404
5.036
--Signif. codes: 0 *** 0.001 ** 0.01
Pr(>|t|)
0.012884
0.091024
0.026350
0.008720
0.008366
0.000703
*
.
*
**
**
***
* 0.05 . 0.1 1

##########
By omitting the square of height, the temperature main effect has now become marginally significant. Note that the square of temperature is statistically significant. Since we are building a response surface regression model,
we must obey the hierarchy principle. Therefore temperature will be retained
in the model.
STAT 501
D. S. Young
247
Finally, contour and surface plots can also be generated for the response
surface regression model. Figure 17.5 gives the contour plots (with odor as
the contours) for each of the three levels of height (Figure 17.6 gives color
versions of the plots). Notice how the contours are increasing as we go out to
the corner points of the design space (so it is as if we are looking down into
a cone). The surface plots of Figure 17.7 all look similar (with the exception
of the temperature scale), but notice the curvature present in these plots.
D. S. Young
STAT 501
248
Height=0
Ratio
0
1
Ratio
Height=1
Temperature
Temperature
(a)
(b)
0
1
Ratio
Height=1
Temperature
(c)
Figure 17.5: The contour plots of ratio versus temperature with odor as a
response for (a) height=-1, (b) height=0, and (c) height=+1.
STAT 501
D. S. Young
249
Height=0
Height=1
1.0
1.0
80
100
80
0.5
60
0.5
40
Ratio
Ratio
60
0.0
0.0
20
40
20
0.5
0.5
20
0
1.0
1.0
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
Temperature
Temperature
(a)
(b)
0.5
1.0
Height=1
1.0
60
40
0.5
Ratio
20
0.0
20
0.5
40
1.0
1.0
0.5
0.0
0.5
1.0
Temperature
(c)
Figure 17.6: The contour plots of ratio versus temperature with odor as a
D. S. Young
STAT 501
250
(a)
(b)
(c)
Figure 17.7: The surface plots of ratio versus temperature with odor as a
STAT 501
D. S. Young
Chapter 18
Biased Regression Methods
and Regression Shrinkage
Recall earlier that we dealt with multicollinearity (i.e., a near-linear relationship amongst some of the predictors) by centering the variables in order to
reduce the variance inflation factors (which reduces the linear dependency).
When multicollinearity occurs, the ordinary least squares estimates are still
unbiased, but the variances are very large. However, we can add a degree
of bias to the estimation process, thus reducing the variance (and standard
errors). This concept is known as the bias-variance tradeoff due to the
functional relationship between the two values. We proceed to discuss some
popular methods for producing biased regression estimates when faced with
a high degree of multicollinearity.
The assumptions made for these methods are mostly the same as in the
multiple linear regression model. Namely, we assume linearity, constant variance, and independence. Any apparent violation of these assumptions must
be dealt with first. However, these methods do not yield statistical intervals
due to uncertainty in the distributional assumption, so normality of the data
is not assumed.
One additional note is that the procedures in this section are often referred
to as shrinkage methods. They are called shrinkage methods because, as
we will see, the regression estimates we obtain cover a smaller range than
those from ordinary least squares.
251
CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

252
SHRINKAGE
18.1
Ridge Regression
Perhaps the most popular (albeit controversial) and widely studied biased
regression technique to deal with multicollinearity is ridge regression. Before we get into the computational side of ridge regression, let us recall from
the last course how to perform a correlation transformation (and the corresponding notation) which is performed by standardizing the variables.
The standardized X matrix is given as:
X1,1 X1 X1,2 X2
p1
X1,p1 X
.
.
.
s
s
sXp1
X2,1X1X1 X2,2X2X2
p1
X2,p1 X
.
.
.
s
s
s
X
X
X
1
1
2
p1
X = n1
,
.
.
.
.
..
..
..
..
Xn,2 X2
Xn,p1 Xp1
Xn,1 X1
.
.
.
sX
sX
sX
1
p1
which is a n (p 1) matrix, and the standardized Y vector is given as:
1
Y = n1
Y1 Y
sY
Y2 Y
sY
..
.
Yn Y
sY
which is still a n-dimensional vector. Here,

sP
n
2
i=1 (Xi,j Xj )
sXj =
n1
for j = 1, 2, . . . , (p 1) and
sP
sY =
n
i=1 (Yi
Y )2
.
n1
Remember that we have removed the column of 1s in forming X , effectively

reducing the column dimension of the original X matrix by 1. Because of
this, we no longer can estimate an intercept term (b0 ), which may be an
important part of the analysis. When using the standardized variables, the
regression model of interest becomes:
Y = X + ,
STAT 501
D. S. Young

SHRINKAGE
253
where is now a (p 1)-dimensional vector of standardized regression coefficients and is an n-dimensional vector of errors pertaining to this standardized model. Thus, the ordinary least squares estimates are
= (XT X )1 XT Y
= r1
XX rXY ,
where rXX is the (p 1) (p 1) correlation matrix of the predictors and
rXY is the (p 1)dimensional matrix of correlation coefficients between the
is a function of correlations and hence
predictors and the response. Thus
we have performed a correlation transformation.
Notice further that
] =
E[
and
] = 2 r1 = r1 .
V[
XX
XX
For the variance-covariance matrix, 2 = 1 because we have standardized all
of the variables.
Ridge regression adds a small value k (called a biasing constant) to
the diagonal elements of the correlation matrix. (Recall that a correlation
matrix has 1s down the diag, so it can sort of be thought of as a ridge.)
Mathematically, we have
= (rXX + kI(p1)(p1) )1 rXY ,
where 0 < k < 1, but usually less than 0.3. The amount of bias in this
estimator is given by
] = [(rXX + kI(p1)(p1) )1 rXX I(p1)(p1) ] ,
E[
and the variance-covariance matrix is given by
= (rXX + kI(p1)(p1) )1 rXX (rXX + kI1
V[]
(p1)(p1) ).
is calculated on the standardized variables (sometimes
Remember that
called the standardized ridge regression estimates). We can transform
D. S. Young
STAT 501

254
SHRINKAGE
back to the original scale (sometimes these are called the ridge regression
estimates) by

sY
j =
j
sXj
0 = y
p1
X
j xj ,
j=1
where j = 1, 2, . . . , p 1.
How do we choose k? Many methods exist, but there is no agreement
on which to use, mainly due to instability in the estimates asymptotically.
Two methods are primarily used: one graphical and one analytical. The first
method is called the fixed point method and uses the estimates provided
by fitting the correlation transformation via ordinary least squares. This
method suggests using
(p 1)MSE
k=
,
T
where MSE is the mean square error obtained from the respective fit.
Another method is the Hoerl-Kennard iterative method. This method
calculates
(p 1)MSE
,
k (t) = T
(t1)
k(t1)
(t1) pertains to the ridge regression estimates

where t = 1, 2, . . .. Here,
k
obtained when the biasing constant is k (t1) . This process is repeated until
the difference between two successive estimates of k is negligible. The starting
value for this method (k (0) ) is chosen to be the value of k calculated using
the fixed point method.
Perhaps the most common method is a graphical method. The ridge
trace is a plot of the estimated ridge regression coefficients versus k. The
value of k is picked where the regression coefficients appear to have stabilized.
The smallest value of k is chosen as it introduces the smallest amount of bias.
There are criticisms regarding ridge regression. One major criticism is
that ordinary inference procedures are not available since exact distributional properties of the ridge estimator are not known. Another criticism
is in the subjective choice of k. While we mentioned a few of the methods
here, there are numerous methods found in the literature, each with their
STAT 501
D. S. Young

SHRINKAGE
255
own limitations. On the flip-side of these arguments lie some potential benefits of ridge regression. For example, it can accomplish what it sets out to
do, and that is reduce multicollinearity. Also, occasionally ridge regression
can provide an estimate of the mean response which is good for new values
that lie outside the range of our observations (called extrapolation). The
mean response found by ordinary least squares is known to not be good for
extrapolation.
18.2
Principal Components Regression
The method of principal components regression transforms the predictor

variables to their principal components. Principal components of XT X are
extracted using the singular value decomposition (SVD) method which
says there exist orthogonal matrices Un(p1) and Pn(p1) (i.e., UT U =
PT P = I(p1)(p1) ) such that
X = UDPT .
P is called the (factor) loadings matrix while the (principal component) scores matrix is defined as
Z = UD,
such that
ZT Z = .
Here, is a (p 1) (p 1) diagonal matrix consisting of the nonzero
eigenvalues of XT X on the diagonal (for simplicity, we assume that the
eigenvalues are in decreasing order down the diagonal: 1 2 . . .
p1 > 0). Notice that Z = X P, which implies that each entry of the Z
matrix is a linear combination of the entries of the corresponding column of
the X matrix. This is because the goal of principal components is to only
keep those linear combinations which help explain a larger amount of the
variation (as determined by using the eigenvalues described below).
Next, we regress Y on Z. The model is
Y = Z + ,
which has the least squares solution
Z = (ZT Z)1 ZT Y .
D. S. Young
STAT 501

256
SHRINKAGE
Severe multicollinearity is identified by very small eigenvalues. Multicollinearity is corrected by omitting those components which have small eigenvalues.
Z corresponds to the ith component, simply set those
Since the ith entry of
Z to 0 which have correspondingly small eigenvalues. For examentries of
ple, suppose you have 10 predictors (and hence 10 principal components).
You find that the last three eigenvalues are relatively small and decide to
omit these three components. Therefore, you set the last three entries of
Z
equal to 0.
Z , we can transform back to get the coefficients on the
With this value
X scale by
= P
.
PC
This is a solution to
Y = X + .
from the original calNotice that we have not reduced the dimension of
Z
culation, but we have only set certain values equal to 0. Furthermore, as in
ridge regression, we can transform back to the original scale by
PC,j
=

sY
PC,j
sXj
PC,0
= y
p1
X
PC,j
xj ,
j=1
where j = 1, 2, . . . , p 1.
How do you choose the number of eigenvalues to omit? This can be
accomplished by looking at the cumulative percent variation explained by
each of the (p 1) components. For the j th component, this percentage is
Pj
i
100,
1 + 2 + . . . + p1
i=1
where j = 1, 2, . . . , p 1 (remember, the eigenvalues are in decreasing order).

A common rule of thumb is that once you reach a component that explains
roughly 80% 90% of the variation, then you can omit the remaining components.
STAT 501
D. S. Young

SHRINKAGE
257
18.3
Partial Least Squares
We next look at a procedure that is very similar to principal components

regression. Here, we will attempt to construct the Z matrix from the last
section in a different manner such that we are still interested in models of
the form
Y = Z + .
Notice that in principal components regression that the construction of the
linear combinations in Z do not rely whatsoever on the response Y . Yet,
Z (from regressing Y on Z) to help us build our final
we use the estimate
estimate. The method of partial least squares allows us to choose the
linear combinations in Z such that they predict Y as best as possible. We
proceed to describe a common way to estimate with partial least squares.
First, define
SST = XT Y YT X .
We construct the score vectors (i.e., the rows of Z) as
zi = X ri ,
for i = 1, . . . , p 1. The challenge becomes to find the ri values. r1 is just
the first eigenvector of SST . ri for i = 2, . . . , p 1 maximizes
T
rT
i1 SS ri ,
subject to the constraint

T
T
rT
i1 X X ri = zi1 zi = 0.
Next, we regress Y on Z, which has the least squares solution

= (ZT Z)1 ZT Y .
Z
As in principal components regression, we can transform back to get the
coefficients on the X scale by
PLS = R Z ,
which is a solution to
Y = X + .
D. S. Young
STAT 501

258
SHRINKAGE
In the above, R is the matrix where the ith column is ri . Furthermore, as in
both ridge regression and principal components regression, we can transform
back to the original scale by

sY
PLS,j
PLS,j =
sXj
PLS,0
= y
p1
X
PLS,j
xj ,
j=1
where j = 1, 2, . . . , p 1.
The method described above is sometimes referred to as the SIMPLS
method. Another method commonly used is nonlinear iterative partial
least squares (NIPALS). NIPALS is more commonly used when you have
a vector of responses. While we do not discuss the differences between these
algorithms any further, we do discuss later the setting where we have a vector
of responses.
18.4
Inverse Regression
In simple linear regression, we introduced calibration intervals which are a

type of statistical interval for a predictor value given a value for the response. An inverse regression technique is essentially what is performed to
find the calibration intervals (i.e., regress the predictor on the response), but
calibration intervals do not extend easily to the multiple regression setting.
However, we can still extend the notion of inverse regression when dealing
with p 1 predictors.
Let X i be a p-dimensional vector (with fist entry equal to 0 for an intercept so that we actually have p 1 predictors) such that
XT
1
X = ... .
XT
n
However, assume that p is actually quite large with respect to n. Inverse
regression can actually be used as a tool for dimension reduction (i.e., reducing p), which reveals to use the most important aspects (or direction)
STAT 501
D. S. Young

SHRINKAGE
259
of the data.1 The tool commonly used is called Sliced Inverse Regression (or SIR). SIR uses the inverse regression curve E(X|Y = y), which
falls into a reduced dimension space under certain conditions. SIR uses this
curve to perform a weighted principal components analysis such that one can
determine an effective subset of the predictors. The reason for reducing the
dimensions of the predictors is because of the curse of dimensionality,
which means that drawing inferences on the same number of data points in a
higher dimensional space becomes difficult due to the sparsity of the data in
the volume of the higher dimensional space compared to the volume of the
lower dimensional space.2
When working with the classic linear regression model
Y = X +
or a more general regression model
Y = f (X) +
for some real-valued function f , we know that the distribution of Y |X depends on X only through the p-dimensional variable = (0 , 1 , . . . , p1 )T .
Dimension reduction claims that the distribution of Y |X depends on X only
through the k-dimensional variable = (1 , . . . , k )T such that k < p. This
new vector is called the effective dimension reduction direction (or
EDR-direction).
The inverse regression curve is computed by looking for E(X|Y = y),
which is a curve in Rp , but consisting of p one-dimensional regressions (as
opposed to one p-dimensional surface in standard regression). The center of
the inverse regression curve is located at E(E(X|Y = y)) = E(X). Therefore,
the centered inverse regression curve is
m(y) = E(X|Y = y) E(X),
1
When we have p large with respect to n, we use the terminology dimension reduction.
However, when we are more concerned about which predictors are significant or which
functional form is appropriate for our regression model (and the size of p is not too much
of an issue), then we use the model selection terminology.
2
As an example, consider 100 points on the unit interval [0,1], then imagine 100 points
on the unit square [0, 1][0, 1], then imagine 100 points on the unit cube [0, 1][0, 1][0, 1],
and so on. As the dimension increases, the sparsity of the data makes it more difficult to
make any relevant inferences about the data.
D. S. Young
STAT 501

260
SHRINKAGE
which is a p-dimensional curve in Rp . Next, the slice part of SIR comes
from estimating m(y) by dividing the range of Y into H non-overlapping
intervals (or slices), which are then used to compute the sample means, m
h,
of each slice. These sample means are a crude estimate of m(y).
With the basics of the inverse regression model in place, we can introduce
an algorithm often used to estimate the EDR-direction vector for SIR:
1. Let X be the variance-covariance matrix of X. Using the standardized
X matrix (i.e., the matrix defined earlier in this chapter as X ), we can
rewrite the classic regression model as
Y = X +
or the more general regression model as
Y = f (X ) + ,
1/2
where = X .
2. Divide the range of y1 , . . . , yn into H non-overlapping slices (using the
index h = 1, . . . , H). Let nh be the number of observation within each
slice and Ih {} be the indicator function for this slice such that
nh =
n
X
Ih {yi }.
i=1
3. Compute
m
h (yi ) = n1
h
n
X
xi Ih {yi },
i=1
which are the means of the H slices.

4. Calculate the estimate for Cov(m(y)) by
= n1
V
H
X
nh m
h (yi )m
h (yi )T .
h=1
i and eigenvectors ri of V.
Construct
5. Identify the k largest eigenvalues
the score vectors zi = X ri as in partial least squares, which are the
rows of Z. Then
= (ZT Z)1 ZT Y
is the standardized EDR-direction vector.

STAT 501
D. S. Young

SHRINKAGE
261
6. Transform the standardized EDR-direction vector back to the original
scale by
= 1/2
X .
18.5
Regression Shrinkage and Connections

with Variable Selection
Suppose we now wish to find the least squares estimate of the model Y =
X + , but subject to a set of equality constraints A = a. It can be shown
(by using Lagrange multipliers), that
T
T
1 T
1 T 1
CLS = OLS (X X) A [A(X X) A ] [A OLS a],
which is called the constrained least squares estimator. This is helpful

when you wish to restrict from being estimated in various areas of Rp .
However, you can also have more complicated constraints (e.g., inequality
constraints, quadratic constraints, etc.) in which case more sophisticated
optimization techniques need to be utilized. The constraints are imposed to
restrict the range of and so any corresponding estimate can be thought
of as a shrinkage estimate as they are covering a smaller range than the
ordinary least squares estimates. Ridge regression is a method providing
shrinkage estimators although they are biased. Oftentimes we hope to shrink
our estimates to 0 by imposing certain constraints, but this may not always
be possible.
A common regression shrinkage procedure is least absolute shrinkage
and selection operator or LASSO. LASSO is also concerned with the
setting of finding the least squares
Pp estimate of Y = X + , but subject to
a set of inequality constraints j=1 |j | t, which is called an L1 penalty
since we are looking at an L1 -norm.3 Here, t 0 is a tuning parameter
be the
which the user sets to control the amount ofPshrinkage. If we let
p
ordinary least squares estimate and let t0 = j=1 |j |, then values of t < t0
will cause shrinkage of the solution towards 0 and some coefficients may be
exactly 0. Because of this, LASSO is also accomplishing model (or subset)
selection as we can omit those predictors from the model whose coefficients
become exactly 0.
3
Ordinary least squares minimizes with respect to the equality constraint

which is an L2 penalty.
D. S. Young
Pn
2
i=1 ei ,
STAT 501

262
SHRINKAGE
It is important to restate the purposes of LASSO. Not only does it shrink
the regression estimates, but it also provides a way to accomplish subset selection. Furthermore, ridge regression also serves this dual purpose, although
we introduced ridge regression as a way to deal with multicollinearity and
not as a first line effort for shrinkage. The way subset selection P
is performed
using ridge regression is by imposing the inequality constraint pj=1 j2 t,
which is an L2 penalty. Many competitors to LASSO are available in the
literature (such as regularized least absolute deviation and Dantzig selectors), but LASSO is one of the more commonly used methods. It should be
noted that there are numerous efficient algorithms available for estimating
with these procedures, but due to the level of detail necessary, we will not
explore these techniques.
18.6
Examples
Example 1: GNP Data

This data set of size n = 16 contains macroeconomic data taken between
the years 1947 and 1962. The economic indicators recorded were the GNP
implicit price deflator (IPD), the GNP, the number of people unemployed,
the number of people in the armed forces, the population, and the number
of people employed. We wish to see if the GNP IPD can be modeled as a
function of the other variables. The data set is given in Table 18.1.
First we run a multiple linear regression procedure to obtain the following
output:
##########
Coefficients:
(Intercept) 2946.85636 5647.97658
0.522
0.6144
GNP
0.26353
0.10815
2.437
0.0376 *
Unemployed
0.03648
0.03024
1.206
0.2585
Armed.Forces
0.01116
0.01545
0.722
0.4885
Population
-1.73703
0.67382 -2.578
0.0298 *
Year
-1.41880
2.94460 -0.482
0.6414
Employed
0.23129
1.30394
0.177
0.8631
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
STAT 501
D. S. Young

SHRINKAGE
263
Year GNP IPD
1947
83.0
1948
88.5
1949
88.2
1950
89.5
1951
96.2
1952
98.1
1953
99.0
1954
100.0
1955
101.2
1956
104.6
1957
108.4
1958
110.8
1959
112.6
1960
114.2
1961
115.7
1962
116.9
GNP
Unemployed
234.289
235.6
259.426
232.5
258.054
368.2
284.599
335.1
328.975
209.9
346.999
193.2
365.385
187.0
363.112
357.8
397.469
290.4
419.180
282.2
442.769
293.6
444.546
468.1
482.704
381.3
502.601
393.1
518.173
480.6
554.894
400.7
Forces
159.0
145.6
161.6
165.0
309.9
359.4
354.7
335.0
304.8
285.7
279.8
263.7
255.2
251.4
257.2
282.7
Population
107.608
108.632
109.773
110.929
112.075
113.270
115.094
116.219
117.388
118.734
120.445
121.950
123.366
125.368
127.852
130.081
Employed
60.323
61.122
60.171
61.187
63.221
63.639
64.989
63.761
66.019
67.857
68.169
66.513
68.655
69.564
69.331
70.551
Table 18.1: The macroeconomic data set for the years 1947 to 1962.

F-statistic: 202.5 on 6 and 9 DF, p-value: 4.426e-09
##########
As you can see, not many predictors appear statistically significant at the
0.05 significance level. We also have a fairly high R2 (over 99%). However,
by looking at the variance inflation factors, multicollinearity is obviously an
issue:
##########
GNP
1214.57215
Employed
220.41968
##########
Unemployed Armed.Forces
83.95865
12.15639
Population
230.91221
Year
2065.73394
In performing a ridge regression, we first obtain a trace plot of possible ridge coefficients (Figure 18.1). As you can see, the estimates of the
D. S. Young
STAT 501
10
10
t(x$coef)
20

264
SHRINKAGE
0.00
0.02
0.04
0.06
0.08
0.10
x$lambda
Figure 18.1: Ridge regression trace plot with the ridge regression coefficient
on the x-axis.
regression coefficients shrink drastically until about 0.02. When using the
Hoerl-Kennard method, a value of about k = 0.0068 is obtained. Other
methods will certainly yield different estimates which illustrates some of the
criticism surrounding ridge regression.
The resulting estimates from this ridge regression analysis are
##########
GNP
25.3615288
Employed
0.7864825
##########
Unemployed Armed.Forces
3.3009416
0.7520553
Population
-11.6992718
Year
-6.5403380
The estimates have obviously shrunk closer to 0 compared to the original

estimates.
Example 2: Acetylene Data
This data set of size n = 16 contains observations of the percentage of conversion of n-heptane to acetylene and three predictor variables. The response
STAT 501
D. S. Young

SHRINKAGE
265
variable is y = conversion of n-heptane to acetylene (%), x1 = reactor temperature (degrees Celsius), x2 = ratio of H2 to n-heptane (Mole ratio), and
x3 = contact time (in seconds). The data set is given in Table 18.2.
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Y
49.0
50.2
50.5
48.5
47.5
44.5
28.0
31.5
34.5
35.0
38.0
38.5
15.0
17.0
20.5
29.5
X1
1300
1300
1300
1300
1300
1300
1200
1200
1200
1200
1200
1200
1100
1100
1100
1100
X2
7.5
9.0
11.0
13.5
17.0
23.0
5.3
7.5
11.0
13.5
17.0
23.0
5.3
7.5
11.0
17.0
X3
0.0120
0.0120
0.0115
0.0130
0.0135
0.0120
0.0400
0.0380
0.0320
0.0260
0.0340
0.0410
0.0840
0.0980
0.0920
0.0860
Table 18.2: The acetylene data set where Y =conversion of n-heptane to

acetylene (%), X1 =reactor temperature (degrees Celsius), X2 =ratio of H2 to
n-heptane (mole ratio), X3 =contact time (in seconds).
First we run a multiple linear regression procedure to obtain the following
output:
##########
Coefficients:
(Intercept) -121.26962
55.43571 -2.188
0.0492 *
reactor.temp
0.12685
0.04218
3.007
0.0109 *
H2.ratio
0.34816
0.17702
1.967
0.0728 .
cont.time
-19.02170 107.92824 -0.176
0.8630
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
D. S. Young
STAT 501

266
SHRINKAGE

##########
As you can see, everything but contact time appears statistically significant
at the 0.05 significance level. We also have a fairly high R2 (nearly 90%).
However, by looking at the pairwise scatterplots for the predictors in Figure
18.2, there appears to be a distinctive linear relationship between contact
time and reactor temperature. This is further verified by looking at the
variance inflation factors:
##########
reactor.temp
12.225045
##########
H2.ratio
1.061838
cont.time
12.324964
We will proceed with a principal components regression analysis. First

we perform the SVD of X in order to get the Z matrix. Then, regressing
Y on Z yields
##########
Coefficients:
Z1
Z2
-0.66277
0.03952
##########
Z3
-0.57268
.
The above is simply
Z
From the SVD of X , the factor loadings matrix is found to be
0.6742704 0.2183362 0.70547061

P = 0.2956893 0.9551955 0.01301144
0.6767033 0.1998269 0.70861973
So transforming back yields
= P
PC
Z
0.05150079
= 0.15076806 .
0.86220916
STAT 501
D. S. Young
0.10
0.10

SHRINKAGE
267
0.06
Contact Time
0.04
0.08
0.06
0.04
Contact Time
0.08
0.02
0.02
10
15
20
1100
1150
1200
1250
1300
Reactor Temperature
H2 Ratio
(a)
(b)
20
15
10
H2 Ratio
1100
1150
1200
1250
1300
Reactor Temparature
(c)
Figure 18.2: Pairwise scatterplots for the predictors from the acetylene data
set. LOESS curves are also provided. Does there appear to be any possible
linear relationships between pairs of predictors?
D. S. Young
STAT 501
Chapter 19
Piecewise and Nonparametric
Methods
This chapter focuses on regression models where we start to deviate from
the functional form discussed thus far. The first topic discusses a model
where different regressions are fit depending on which area of the predictor
space we are in. The second topic discusses non parametric models which,
as the name suggests, fits a model which is free of distributional assumptions
and subsequently does not have regression coefficients readily available for
estimation. This is best accomplished by using a smoother, which is a
tool for summarizing the trend of the response as a function of one or more
predictors. The resulting estimate of the trend is less variable than the
response itself.
19.1
Piecewise Linear Regression
A model that proposes a different linear relationship for different intervals (or
regions) of the predictor is called a piecewise linear regression model.
The predictor values at which the slope changes are called knots, which
we will discuss throughout this chapter. Such models are helpful when you
expect the linear trend of your data to change once you hit some threshold.
Usually the knot values are already predetermined due to previous studies
or standards that are in place. However, there are methods for estimating
the knot values (sometimes called changepoints in the context of piecewise
linear regression), but we will not explore such methods.
268
CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS 269

For simplicity, we construct the piecewise linear regression model for the
case of simple linear regression and also briefly discuss how this can be extended to the multiple regression setting. First, let us establish what the
simple linear regression model with one knot value (k1 ) looks like:
Y = 0 + 1 X1 + 2 (X1 k1 )I{X1 > k1 } + ,
where I{} is the indicator function such that

1, if X1 > k1 ;
I{X1 > k1 } =
0, otherwise.
So, when X1 k1 , the simple linear regression line is
E(Y ) = 0 + 1 X1
and when X1 > k1 , the simple linear regression line is
E(Y ) = (0 2 k1 ) + (1 + 2 )X1 .
Such a regression model is fitted in the upper left-hand corner of Figure 19.1.
For more than one knot value, we can extend the above regression model
to incorporate other indicator values. Suppose we have c knot values (i.e.,
k1 , k2 , . . . , kc ) and we have n observations. Then the piecewise linear regression model is written as:
yi = 0 + 1 xi,1 + 2 (xi,1 k1 )I{xi,1 > k1 } + . . . + c+1 (xi,1 kc )I{xi,1 > kc } + i .
As you can see, this can be written more compactly as:
y = X + ,
where is a c + 2dimensional vector and
1 x1,1 x1,1 I{x1,1 > k1 } x1,1 I{x1,1 > kc }

1 x2,1 x2,1 I{x2,1 > k1 } x2,1 I{x2,1 > kc }
X = ..
..
..
..
..
.
.
.
.
.
1 xn,1 xn,1 I{xn,1 > k1 } xn,1 I{xn,1 > kc }
Furthermore, you can see how for more than one predictor you can construct
the X matrix to have columns as functions of the other predictors.
D. S. Young
STAT 501
270 CHAPTER 19. PIECEWISE AND NONPARAMETRIC METHODS
1 Knot (Discontinuous)
11
1 Knot
10
8
Y
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
2 Knots (Discontinuous)
9
8
Y
10
10
1.0
(b)
2 Knots
0.8
(a)
0.0
0.2
0.4
0.6
X
(c)
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
(d)
Figure 19.1: Plots illustrating continuous and discontinuous piecewise linear

regressions with 1 and 2 knots.
STAT 501
D. S. Young

Sometimes you may also have a discontinuity that needs to be reflected
at the knots (see the right-hand side plots of Figure 19.1). This is easily
reflected in the piecewise linear model we constructed above by adding one
more term to the model. For each kj where there is a discontinuity, you
add the corresponding indicator random variable I{X1 > k1 } as a regressor.
Thus, the X matrix would have the column vector
I{x1,1 > kj }
I{x2,1 > kj }
..
.
I{xn,1 > kj }
appended to it for each kj where there is a discontinuity. Extending discontinuities to the case of more than one predictor is analogous.
19.2
Local Regression Methods
Nonparametric regression attempts to find a functional relationship between yi and xi (only one predictor):
yi = m(xi ) + i ,
where m() is the regression function to estimate and E(i ) = 0. It is not
necessary to assume constant variance and, in fact, one typically assumes
that i = 2 (xi ), where 2 () is a continuous, bounded function.
Local regression is a method commonly used to model this nonparametric regression relationship. Specifically, local regression makes no global
assumptions about the function m(). Global assumptions are made in standard linear regression as we assume that the regression curve we estimate
(which is characterized by the regression coefficient vector ) properly models all of our data. However, local regression assumes that m() can be
well-approximated locally by a member from a simple class of parametric
functions (e.g., a constant, straight-line, quadratic curve, etc.) What drives
local regression is Taylors theorem from Calculus, which says that any continuous function (which we assume that m() is) can be approximated with
a polynomial.
In this section, we discuss some of the common local regression methods
for estimating regressions nonparametrically.
D. S. Young
STAT 501
19.2.1
Kernel Regression
One way of estimating m() is to use density estimation, which approximates the probability density function f () of a random variable X. Assuming we have n independent observations x1 , . . . , xn from the random variable
X, the kernel density estimator fh (x) for estimating the density at x (i.e.,
f (x)) is defined as

n
xi x
1 X
K
.
fh (x) =
nh i=1
h
Here, K() is called the kernel function and h is called the bandwidth.
K() is a function often resembling a probability density function, but with
no parameters (some common kernel functions are provided in Table 19.1). h
controls the window width around x which we perform the density estimation.
Thus, a kernel density estimator is essentially a weighting scheme (dictated
by the choice of kernel) which takes into consideration the proximity of a
point in the data set near x when given a bandwidth h. Furthermore, more
weight is given to points near x and less weight is given to points further
from x.
With the formalities established, one can perform a kernel regression of
yi on Xi estimate mh () with the Nadaraya-Watson estimator:
Pn
i=1

K
m
h (x) =
Pn
i=1
xi x
h

K
xi x
h
yi
,
where m has been subscripted to note its dependency on the bandwidth. As

you can see, this kernel regression estimator is just a weighted sum of the
observed responses.
It is also possible to construct approximate confidence intervals and confidence bands using the Nadaraya-Watson estimator, but under some restrictive assumptions. An approximate 100(1)% confidence interval is given
by
s
m
h (x) z1 2
h2 (x)kKk22
,
nhfh (x)
where h = cn1/5 for some constant c > 0, z1 2 is the (1 /2)quantile of

STAT 501
D. S. Young

Kernel K(u)
Triangle
Beta
Gaussian
Cosinus
Optcosinus
(1 |u|)I(|u| 1)
(1u2 )g
I(|u|
Beta(0.5,g+1)
1)
1 2
1 e 2 u
2
1
(1
2
+ cos(u))I(|u| 1)

)
I(|u| 1)
cos( u
2
Table 19.1: A table of common kernel functions. In the above, I(|u| 1)

is the indicator function yielding 1 if |u| 1 and 0 otherwise. For the beta
kernel, the value g 0 is specified by the user and is a shape parameter. Common values of g are 0, 1, 2, and 3, which are called the uniform,
Epanechnikov, biweight, and triweight kernels, respectively.
the standard normal distribution, kKk22 =
h2 (x) =
1
n
Pn

K
xi x
h
Pn
i=1
i=1

2
xi x
, and
i=1 K
h
Pn

{yi m
h (x)}2

.
xi x
h
Next, let h = n for ( 15 , 12 ). Then, under certain regularity conditions,

an approximate 100 (1 )% confidence band is given by
s
m
h (x) zn,
D. S. Young
h2 (x)kKk22
,
nhfh (x)
STAT 501

where

zn, =
log{ 12 log(1 )}
+ dn
(2 log(n))1/2
1/2
and
p 0 2 1/2
kK k
p 2
.
dn = (2 log(n))1/2 + (2 log(n))1/2 log
2 kKk22
Some final notes about kernel regression include:
Choice of kernel and bandwidth are still major issues in research. There
are some general guidelines to follow and procedures that have been
developed, but are beyond the scope of this course.
What we developed in this section is only for the case of one predictor.
If you have multiple predictors (i.e., x1,i , . . . , xp,i ), then one needs to
use a multivariate kernel density estimator at a point x = (x1 , . . . , xp )T ,
which is defined as

n
xi,1 x1
xi,p xp
1
1X
Q
K
,...,
.
fh (x) =
n i=1 pj=1 hj
h1
hp
Multivariate kernels require more advanced methods and are difficult
to use as data sets with more predictors will often suffer from the curse
of dimensionality.
19.2.2
Local Polynomial Regression and LOESS
Local polynomial modeling is similar to kernel regression estimation, but the

fitted values are now produced by a locally weighted regression rather than
by a locally weighted average. The theoretical basis for this approach is to
do a Taylor series expansion around a value xi :
m(xi ) m(x)+m0 (x)(xi )(xxi )+
m00 (x)(xi )(x xi )

m(q) (x)(xi )(x xi )
+. . .+
,
2
q!
for x in a neighborhood of xi . It is then parameterized in a way such that:

m(xi ) 0 (x)+1 (x)(xi x)+2 (x)(xi x)2 +. . .+q (x)(xi x)q ,
STAT 501
|xi x| h,
D. S. Young

so that
0 (x) = m(x)
1 (x) = m0 (x)
2 (x) = m00 (x)/2
..
.
q (x) = m(q) (x)/q!.
Note that the parameters are considered functions of x, hence the local
aspect of this methodology.
Local polynomial fitting minimizes
2

q
n
X
X
xi x
j
yi
j (x)(xi x) K
h
i=1
j=0
with respect to the j (x) terms. Then, letting
0 (x)
1 (x1 x) . . . (x1 x)q
1 (x)
1 (x2 x) . . . (x2 x)q
,
(x)
=
X = ..
..
.
.
.
..
..
..
.
.
1 (xn x) . . . (x1 x)q
q (x)

and W = diag K x1hx , . . . , K xnhx
, the local least squares estimate can be written as:
(x)
= arg min(Y X(x))T W(Y X(x))
(x)
= (X W1 X)1 XT W1 Y.
T
Thus we can estimate the th derivative of m(x) by

(x).
m
() (x) = !
Finally, for any x, we can perform inference on the j (x) (or the m() (x))
terms in a manner similar to weighted least squares.
The method of LOESS (which stands for Locally Estimated Scatterplot
Smoother)1 is commonly used for local polynomial fitting. However, LOESS
1
There is also another version of LOESS called LOWESS, which stands for Locally
WEighted Scatterplot Smoother. The main difference is the weighting that is introduced
during the smoothing process.
D. S. Young
STAT 501

is not a simple mathematical model, but rather an algorithm that, when given
a value of X, computes an appropriate value of Y. The algorithm was designed so that the LOESS curve travels through the middle of the data and
gives points closest to each X value the greatest weight in the smoothing
process, thus limiting the influence of outliers.
Suppose we have a set of observations ((x1 , y1 ), . . . , (xn , yn ). LOESS follows a basic algorithm as follows:
1. Select a set of values partitioning [x(1) , x(n) ]. Let x0 be an individual
value in this set.
2. For each observation, calculate the distance
di = |xi x0 |.
Let q be the number of observations in the neighborhood of x0 . The
neighborhood is formally defined as the q smallest values of di where
q = dne. is the proportion of points to be selected and is called the
span (usually chosen to be about 0.40) and de means to take the next
largest integer if the calculated value is not already an integer.
3. Perform a weighted regression of the yi s on the xi s using only the
points in the neighborhood. The weights are given by

|xi x0 |
wi (x0 ) = T
,
dq
where T () is the tricube weight function given by

(1 |u|3 )3 , if |u| < 1;
T (u) =
0,
if |u| 1,
and dq is the the largest distance in the neighborhood of observations
close to x0 . The weighted regression for x0 is defined by the estimated
regression coefficients
LOESS = arg min
n
X
wi (x0 )[yi (0 + 1 (xi x0 ) + 2 (xi x0 )2
i=1
+ . . . + h (xi x0 )h )]2 .
For LOESS, usually h = 2 is sufficient.
STAT 501
D. S. Young

4. Calculate the fitted values as
yi,LOESS (x0 ) = 0,LOESS + 1,LOESS (x0 ) + [2,LOESS (x0 )]2
+ . . . + [h,LOESS (x0 )]h .
5. Iterate the above procedure for another value of x0 .
Since outliers can have a large impact on least squares estimates, a robust weighted regression procedure may also be used to lessen the influence
of outliers on the LOESS curve. This is done by replacing Step 3 in the
algorithm above with a new set of weights. These weights are calculated by
taking the q LOESS residuals
ri = yi yi,LOESS (xi )
and calculating new weights given by
wi
wi B

|ri |
.
6M
For wi , the value wi0 is the previous weight for this observation (where the first
time you calculate this weight can be done by the original LOESS procedure
we outlined), M is the median of the q absolute values of the residuals, and
B() is the bisquare weight function given by

(1 |u|2 )2 , if |u| < 1;
B(u) =
0,
if |u| 1.
This robust procedure can be iterated up to 5 times for a given x0 .
Some other notes about local regression methods include:
Various forms of local regression exist in the literature. The main
thing to note is that these are approximation methods with much of
the theory being driven by Taylors theorem from Calculus.
Kernel regression is actually a special case of local regression.
As with kernel regression, there is also an extension of local regression
regarding multiple predictors. It requires use of a multivariate version
of Taylors theorem around the p-dimensional point x0 . The model can
D. S. Young
STAT 501

include all main effects, pairwise combination, and k-wise combinations
of the predictors up to the order of h. Weights can then be defined,
such as

kxi x0 k
wi (x0 ) = T
,
where again is the span. The values of xi can also be scaled so that
the smoothness occurs the same way in all directions. However, note
that this estimation is often difficult due to the curse of dimensionality.
19.2.3
Projection Pursuit Regression
Besides procedures like LOESS, there is also an exploratory method called

projection pursuit regression (or PPR) which attempts to reveal possible
nonlinear and interesting structures in
yi = m(xi ) + i
by looking at univariate regressions instead of complicated multiple regressions to avoid the curse of dimensionality. A pure nonparametric approach
can lead to strong oversmoothing and since the sparseness of the space requires to include a lot of space and observations to do a local averaging for a
reliable estimate. To estimate the response function m() from the data, the
following PPR algorithm is typically used:
(0)
1. Set ri
= yi .
2. For j = 1, . . ., maximize
Pn
i=1
2
R(j)
=1
(j1)
ri
2
T
m
(j) (
(j) xi )

2
Pn
(j1)
i=1 ri
by varying over the orthogonal parameters

Rp (i.e., kk
= 1) and
a univariate regression function m
(j) ().
3. Compute new residuals
(j)
(j1)
ri = ri
STAT 501
T
m
(j) (
(j) xi ).
D. S. Young

2
2
4. Repeat steps 2 and 3 until R(j)
becomes small. A small R(j)
implies
T
(j) xi ) is approximately the zero function and we will not
that m
(j) (
find any other useful direction.
The advantages of using PPR for estimation is that we are using univariate regressions which are quick and easy to estimate. Also, PPR is able to
approximate a fairly rich class of functions as well as ignore variables providing little to no information about m(). Some disadvantages of using PPR
(j)
include having to examine a p-dimensional parameter space to estimate
and interpretation of a single term may be difficult.
19.3
Smoothing Splines
A smoothing spline is a piecewise linear function where the polynomial

pieces fit together at knots. Smoothing splines are continuous on the whole
interval the function is defined on, including at the knots. Mathematically,
a smoothing spline minimizes
n
X
i=1
(yi (xi )) +
[ 00 (t)]2 dt
among all twice continuously differentiable functions () where > 0 is a

smoothing parameter and a x(1) . . . x(n) b (where, recall, that
x(1) and x(n) are the minimum and maximum x values, respectively). The
knots are usually chosen as the unique values of the predictors in the data
set, but may also be a subset of them.
In the function above, the first term measures the closeness to the data
while the second term penalizes curvature in the function. In fact, it can be
shown that there exists an explicit, unique minimizer, and that minimizer is
a cubic spline with knots at each of the unique values of the xi .
The smoothing parameter does just what its name suggests, it smooths
the curve. Typically, 0 < 1, but this need not be the case. When > 1,
then /(1 + ) is said to be the tuning parameter. Regardless, when the
smoothing parameter is near 1, a smoother curve is produced. Smaller values
of the smoothing parameter (values near 0) often produce rougher curves as
the curve is interpolating nearer to the observed data points (i.e., the curves
are essentially being drawn right to the location of the data points).
D. S. Young
STAT 501

Notice that the cubic smoothing spline introduced above is only capable of
handling one predictor. Suppose now that we have p predictors X1 , . . . , Xp .2
We wish to consider the model
yi = (xi,1 , . . . , xi,p ) + i
= (xi ) + i ,
fori = 1, . . . , n where () belongs to the space of functions whose partial
derivatives of order m exist and are in L2 (p ) such that p is the domain of
the p-dimensional random variable X.3 In general, m and p must satisfy the
constraint 2m p > 0.
For a fixed (i.e., the smoothing parameter), we estimate by minimizing
n
1X
[yi (xi )]2 + Jm (),
n i=1
which results in what is called a thin-plate smoothing spline. While there
are several ways to define Jm (), a common way to define it for a thin-plate
smoothing spline is by

2
Z +
Z + X
m!
dx1 dxp ,
Jm () =
tp
t1
t
1 ! tp ! x1 xp
P
where is the set of all permutations of (t1 , . . . , tp ) such that pj=1 tj = m.
Numerous algorithms exist for estimation and have demonstrated fairly
stable numerical results. However, one must gently balance fitting the data
closely with avoiding characterizing the fit with excess variation. Fairly general procedures also exist for constructing confidence intervals and estimating
the smoothing parameter. The following subsection briefly describes the least
squares method usually driving these algorithms.
19.3.1
Penalized Least Squares
Penalized least squares estimates are a way to balance fitting the data
closely while avoiding overfitting due to excess variation. A penalized least
2
We will forego assuming an intercept for simplicity in this discussion.

Basically all this is saying is that the first m derivatives of () exist when evaluated
at our values of xi,1 , . . . , xi,p for all i.
3
STAT 501
D. S. Young

squares fit is a surface which minimizes a penalized least squares function
over the class of all such surfaces meeting certain regularity conditions.
Let us assume we are in the case defined earlier for a thin-plate smoothing
spline, but now there is also a parametric component. The model we will
consider is
yi = (zi ) + xT
i + i ,
where zi is a q-dimensional vector of covariates while xi is a p-dimensional
vector of covariates whose relationship with yi is characterized through .
So notice that we have a parametric component to this function and a nonparametric component. Such a model is said to be semiparametric in nature
and such models are discussed in the last chapter.
The ordinary least squares estimate for our model estimates (zi ) and
by minimizing
n
1X
2
(yi (zi ) xT
i ) .
n i=1
However, the functional space of (zi ) is so large that a function can always
be found which interpolates the points perfectly, but this will simply reflect
all random variation in the data. Penalized least squares attempts to fit the
data well, but provide a degree of smoothness to the fit. Penalized least
squares minimizes
n
1X
2
(yi (zi ) xT
i ) + Jm (),
n i=1
where Jm () is the penalty on the roughness of the function (). Again, the
squared term of this function measures the goodness-of-fit, while the second
term measures the smoothness associated with (). A larger penalizes
rougher fits, while a smaller emphasizes the goodness-of-fit.
A final estimate of for the penalized least squares method can be written
as
n
X
i ) = + zT +
(z
k Bk (xi ),
i
k=1
where the Bk s are basis functions dependent on the location of the xi s and
, , and are coefficients to be estimated. For a fixed , (, , ) can be
estimated. The smoothing parameter can be chosen by minimizing the
generalized crossvalidation (or GCV) function. Write
= A()y,
y
D. S. Young
STAT 501

where A() is called the smoothing matrix. Then the GCV is defined as
V () =
k(Inn A())yk2 /n
[tr(Inn A())/n]2
and
= arg min V ().
19.4
Nonparametric Resampling Techniques

for
In this section, we discuss two commonly used resampling techniques, which
are used for estimating characteristics of the sampling distribution of .

While we discuss these techniques for the regression parameter , it should be
noted that they can be generalized and applied to any parameter of interest.
They can also be used for constructing nonparametric confidence intervals
for the parameter(s) of interest.
19.4.1
The Bootstrap
Bootstrapping is a method where you resample from your data (often with
replacement) in order to approximate the distribution of the data at hand.
While conceptually bootstrapping procedures are very appealing (and they
have been shown to possess certain asymptotic properties), they are computationally intensive. In the non parametric regression routines we presented,
standard regression assumptions were not made. In these nonstandard situations, bootstrapping provides a viable alternative for providing standard
errors and confidence intervals for the regression coefficients and predicted
values. When in the regression setting, there are two types of bootstrapping
methods that may be employed. Before we differentiate these methods, we
first discuss bootstrapping in a little more detail.
In bootstrapping, you assume that your sample is actually the population of interest. You draw B samples (B is usually well over 1000)
of size n from your original sample with replacement. With replacement
means that each observation you draw for your sample is always selected
from the entire set of values in your original sample. For each bootstrap
sample, the regression results are computed and stored. For example, if
B = 5000 and we are trying to estimate the sampling distribution of the
STAT 501
D. S. Young

regression coefficients for a simple linear regression, then the bootstrap will
) as your sample.
, 1,5000
), . . . , (0,5000
, 1,2
), (0,2
, 1,1
yield (0,1
Now suppose that you want the standard errors and confidence intervals
for the regression coefficients. The standard deviation of the B estimates
provided by the bootstrapping scheme is the bootstrap estimate of the standard error for the respective regression coefficient. Furthermore, a bootstrap
confidence interval is found by sorting the B estimates of a regression coefficient and selecting the appropriate percentiles from the sorted list. For
example, a 95% bootstrap confidence interval would be given by the 2.5th
and 97.5th percentiles from the sorted list. Other statistics may be computed
in a similar manner.
One assumption which bootstrapping relies heavily on is that your sample approximates the population fairly well. Thus, bootstrapping does not
usually work well for small samples as they are likely not representative of
the underlying population. Bootstrapping methods should be relegated to
medium sample sizes or larger (what constitutes a medium sample size is
somewhat subjective).
Now we can turn our attention to the two bootstrapping techniques available in the regression setting. Assume for both methods that our sample
consists of the pairs (x1 , y1 ), . . . , (xn , yn ). Extending either method to the
case of multiple regression is analogous.
We can first bootstrap the observations. In this setting, the bootstrap
samples are selected from the original pairs of data. So the pairing of a
response with its measured predictor is maintained. This method is appropriate for data in which both the predictor and response were selected at
random (i.e., the predictor levels were not predetermined).
We can also bootstrap the residuals. The bootstrap samples in this setting
are selected from what are called the Davison-Hinkley modified residuals, given by
n
ei
1X
e
p j
ei = p
,
1 hi,i n j=1 1 hj,j
where the ei s are the original regression residuals. We do not simply use
the ei s because these lead to biased results. In each bootstrap sample, the
randomly sampled modified residuals are added to the original fitted values
forming new values of y. Thus, the original structure of the predictors will
remain the same while only the response will be changed. This method is
appropriate for designed experiments where the levels of the predictor are
D. S. Young
STAT 501

predetermined. Also, since the residuals are sampled and added back at
random, we must assume the variance of the residuals is constant. If not,
this method should not be used.
Finally, a 100 (1 )% bootstrap confidence interval for the regression
coefficient i is given by
(i,b
nc , i,d(1 )ne ),
2
which is then used to calculate a 100(1)% bootstrap confidence interval

for E(Y |X = xh ), which is given by
(0,b
nc + 1,b nc xh , 0,d(1 )ne + 1,d(1 )ne xh ).
2
19.4.2
The Jackknife
Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error (variance) of a statistic when a
random sample of observations is used to calculate it. The basic idea behind
the jackknife variance estimator lies in systematically recomputing the estimator of interest by leaving out one or more observations at a time from the
original sample. From this new set of replicates of the statistic, an estimate
for the bias and variance of the statistic can be calculated, which can then
be used to calculate jackknife confidence intervals.
Below we outline the steps for jackknifing in the simple linear regression
setting for simplicity, but the multiple regression setting is analogous:
1. Draw a sample of size n (x1 , y1 ), . . . , (xn , yn ) and divide the sample into
s independent groups, each of size d.
2. Omit the first set of d observations from the sample and estimate 0
and 1 from the (n d) remaining observations (call these estimates
(J )
(J )
0 1 and 1 1 , respectively). The remaining set of (n d) observations
is called the delete-d jackknife sample.
3. Omit each of the remaining sets of 2, . . . , s groups in turn and esti(J )
(J )
mate the respective regression coefficients. These are 0 2 , . . . , 0 s

(J )
(J )
and 1 2 , . . . , 1 s . Note that this results in s = nd delete-d jackknife
samples.
STAT 501
D. S. Young

(J)
(J)
4. Obtain the (joint) probability distribution F (0 , 1 ) of delete-d jackknife estimates. This may be done empirically or through use of investigating an appropriate distribution.
5. Calculate the jackknife regression coefficient estimate, which is the
(J)
(J)
mean of the F (0 , 1 ) distribution, as:
(J)
j =
s
X
(J )
j k ,
k=1
for j = 0, 1. Thus, the delete-d jackknife (simple) linear regression

equation is
(J)
(J)
yi = 0 + 1 xi + ei .
The jackknife bias for each regression coefficient is
d J (j ) = (n 1)((J) j ),
bias
j
where j is the estimate obtained when using the full sample of size n. The
jackknife variance for each regression is
(n 1) (J) 2
var
c J (j ) =
(j j ) ,
n
which implies that the jackknife standard error is
q
s.e.
c J (j ) = var
c J (j ).
Finally, if normality is appropriate, then a 100 (1 )% jackknife con(J)
fidence interval for the regression coefficient j is given by
j tn2;1/2 s.e.
c J (j ).
Otherwise, we can construct a fully nonparametric jackknife confidence interval in a similar manner as the bootstrap version. Namely,

(J)
(J)
,
,
j,b 2 nc
D. S. Young
j,d(1 2 )ne
STAT 501

which can then be used to calculate a 100 (1 )% bootstrap confidence
interval for E(Y |X = xh ), which is given by

(J)
(J)
(J)
(J)
0,b nc + 1,b nc xh , 0,d(1 )ne + 1,d(1 )ne xh .

2
While for moderately sized data the jackknife requires less computation,
there are some drawbacks to using the jackknife. Since the jackknife is us In fact,
ing fewer samples, it is only using limited information about .
the jackknife can be viewed as an approximation to the bootstrap (it is a
linear approximation to the bootstrap in that the two are roughly equal for
linear estimators). Moreover, the jackknife can perform quite poorly if the
estimator of interest is not sufficiently smooth (intuitively, smooth can be
thought of as small changes to the data result in small changes to the calculated statistic), which can especially occur when your sample is too small.
19.5
Examples
Example 1: Packaging Data Set

This data set of size n = 15 contains measurements of yield from a packaging
plant where the manager wants to model the unit cost (y) of shipping lots
of a fragile product as a linear function of lot size (x). Table 19.2 gives the
data used for this analysis. Because of the economics of scale, the manager
believes that the cost per unit will decrease at a fast rate for lot sizes of more
than 1000.
Based on the description of this data, we wish to fit a (continuous) piecewise regression with one knot value at k1 = 1000. Figure 19.2 gives a scatterplot of the raw data with a vertical line at the lot size of 1000. This appears
to be a good fit.
We can also obtain summary statistics regarding the fit
##########
Coefficients:
(Intercept) 4.0240268 0.1766955 22.774 3.05e-11 ***
lot.size
-0.0020897 0.0002052 -10.183 2.94e-07 ***
lot.size.I -0.0013937 0.0003644 -3.825 0.00242 **
--STAT 501
D. S. Young

i Unit Cost Lot Size
1
1.29
1150
2
2.20
840
3
2.26
900
4
2.38
800
5
1.77
1070
6
1.25
1220
7
1.87
980
8
0.71
1300
9
2.90
520
10
2.63
670
11
0.55
1420
12
2.31
850
13
1.90
1000
14
2.15
910
15
1.20
1230
Table 19.2: The packaging data set pertaining to n = 15 observations.
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

##########
As can be seen, the two predictors are statistically significant for this piecewise linear regression model.
Example 2: Quality Measurements Dataset (continued )
Recall that we fit a quadratic polynomial to the quality measurement data
set. Let us also fit nonparametric regression curve to this data and calculate
bootstrap confidence intervals for the slope parameters. Figure 19.3(a) shows
two LOESS curves with two different spans.
Here are some general things to think about when fitting data with
LOESS:
1. Which fit appears to be better to you?
D. S. Young
STAT 501
2.5
2.0
Cost
1.5
1.0
0.5
600
800
1000
1200
1400
Lot Size
Figure 19.2: A scatterplot of the packaging data set with a piecewise linear
regression fitted to the data.
2. How do you think more data would affect the smoothness of the fits?
3. If we drive the span to 0, what type of regression line would you expect
to see?
4. If we drive the span to 1, what type of regression line would you expect
to see?
Figure 19.3(b) shows two kernel regression curves with two different bandwidths. A Gaussian kernel is used. Some things to think about when fitting
the data (as with the LOESS fit) are:
1. Which fit appears to be better to you?
2. How do you think more data would affect the smoothness of the fits?
3. What type of regression line would you expect to see as we change the
bandwidth?
4. How does the choice of kernel affect the fit?
STAT 501
D. S. Young

When performing local fitting (as with kernel regression), the last two points
above are issues where there are still no clear solutions.
Quality Scores
100
100
Quality Scores
Score 2
60
60
60
70
Score 1
(a)
80
90
Bandwidth
5
15
50
Span
0.4
0.9
50
70
80
80
70
Score 2
90
90
60
70
80
90
Score 1
(b)
Figure 19.3: (a) A scatterplot of the quality data set and two LOESS fits
with different spans. (b) A scatterplot of the quality data set and two kernel
regression fits with different bandwidths.
Next, let us return to the orthogonal regression fit of this data. Recall
that the slope term for the orthogonal regression fit was 1.4835. Using a
nonparametric bootstrap (with B = 5000 bootstraps), we can obtain the
following bootstrap confidence intervals for the orthogonal slope parameter:
90% bootstrap confidence interval: (0.9677, 2.9408).
Remember that if you were to perform another bootstrap with B = 5000,
then the estimated intervals given above will be slightly different due to the
randomness of the resampling process!
D. S. Young
STAT 501
Chapter 20
Regression Models with
Censored Data
Suppose we wish to estimate the parameters of a distribution where only a
portion of the data is known. When the remainder of the data has a measurement that exceeds (or falls below) some threshold and only that threshold
value is recorded for that observation, then the data are said to be censored.
When the data exceeds (or falls below) some threshold, but the data is omitted from the database, then the data are said to be truncated. This chapter
deals primarily with the analysis of censored data by first introducing the
area of reliability (survival) analysis and then presenting some of the basic
tools and models from this area as a segue into a regression setting. We also
devote a section to discussing truncated regression models.
20.1
Overview of Reliability and Survival Analysis
It is helpful to formally define the area of analysis which is heavily concerned

with estimating models with censored data. Survival analysis concerns
the analysis of data from biological events associated with the study of animals and humans. Reliability analysis concerns the analysis of data from
events associated with the study of engineering applications. We will utilize
terminology from both areas for the sake of completeness.
Survival (reliability) analysis studies the distribution of lifetimes (failure
times). The study will consist of the elapsed time between an initiating event
290
CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA 291

and a terminal event. For example:
Study the time of individuals in a cancer study. The initiating time
could be the diagnosis of cancer or the start of treatment. The terminal
event could be death or cure of the disease.
Study the lifetime of various machine motors. The initiating time could
be the date the machine was first brought online. The terminal event
could be complete machine failure or the first time it must be brought
off-line for maintenance.
The data are a combination of complete and censored values, which means
a terminal event has occurred or not occurred, respectively. Formally, let Y
be the observed time from the study, T denote the actual event time (and
is sometimes referred to as the latent variable, and t denote some known
threshold value where the values are censored. Observations in a study can
be censored in the following manners:
Right censoring: This occurs when an observation has dropped out,
been removed from a study, or did not reach a terminal event prior to
termination of the study. In other words, Y T such that

T, T < t;
Y =
t, T t.
Left censoring: This occurs when an observation reaches a terminal
event before the first time point in the study. In other words, Y T
such that

T, T > t;
Y =
t, T t.
Interval censoring: This occurs when a study has discrete time points
and an observation reaches a terminal event between two of the time
points. In other words, for discrete time increments 0 = t1 < t2 <
. . . < tr < , we have Y1 < T < Y2 such that for j = 1, . . . , r 1,

tj , tj < T < tj+1 ;
Y1 =
0, otherwise
and

Y2 =
D. S. Young
tj+1 , tj < T < tj+1 ;

, otherwise.
STAT 501
292 CHAPTER 20. REGRESSION MODELS WITH CENSORED DATA

Double censoring: This is when all of the above censored observations can occur in a study.
Moreover, there are two criteria that define the type of censoring in a study. If
the experimenter controls the type of censoring, then we have non-random
censoring, of which there are two types:
Type I or time-truncated censoring, which occurs if an observation is still alive (in operation) when a test is terminated after a predetermined length of time.
Type II or failure-truncated censoring, which occurs if an observation is still alive (in operation) when a test is terminated after a
pre-determined number of failures is reached.
Suppose T has probability density function f (t) with cumulative distribution function F (t). Since we are interested in survival times (lifetimes),
the support of T is (0, +). There are 3 functions usually of interest in a
survival (reliability) analysis:
The survival function S(t) (or reliability function R(t)) is given
by:
Z +
S(t) = R(t) =
f (x)dx = 1 F (t).
t
This is the probability that an individual survives (or something is

reliable) beyond time t and is usually the first quantity studied.
The hazard rate h(t) (or conditional failure rate) is the probability
that an observation at time t will experience a terminal event in the
next instant. It is given by:
h(t) =
f (t)
f (t)
=
.
S(t)
R(t)
The empirical hazard (conditional failure) rate function is useful in

identifying which probability distribution to use if it is not already
specified.
The cumulative hazard function H(t) (or cumulative conditional
failure function) is given by:
Z t
H(t) =
h(x)dx.
0
STAT 501
D. S. Young

These are only the basics when it comes to survival (reliability) analysis.
However, they provide enough of a foundation for our interests. We are
interested in when a set of predictors (or covariates) are also measured
with the observed time.
20.2
Censored Regression Model
Censored regression models (also called the Tobit model) simply attempts to model the unknown variable T (which is assumed left-censored) as
a linear combination of the covariates X1 , . . . , Xp1 . For a sample of size n,
we have
Ti = 0 + 1 xi,1 + . . . + p1 xi,p1 + i ,
where i iid N (0, 2 ). Based on this model, it can be shown that for the
observed variable Y , that
E[Yi |Yi > t] = XT
i + (i ),
where i = (t XT
i )/ and
(i ) =
(i )
1 (i )
such that () and () are the probability density function and cumulative
distribution function of a standard normal random variable (i.e., N (0, 1)),
respectively. Moreover, the quantity (i ) is called the inverse Mills ratio,
which reappears later in our discussion about the truncated regression model.
If we let i1 be the index of all of the uncensored values and i2 be the index
of all of the left-censored values, then we can define a log-likelihood function
for the estimation of the regression parameters (see Appendix C for further
details on likelihood functions):
X
2
`(, ) = (1/2)
[log(2) + log( 2 ) + (yi XT
i )/ ]
i1
log(1 (XT
i /)).
i2
Optimization of the above equation yields estimates for and .

Now it should be noted that this is a very special case of a broader class of
survival (reliability) regression models. However, it is commonly used so that
is why it is usually treated separately than the broader class of regression
models that are discussed in the next section.
D. S. Young
STAT 501
20.3
Survival (Reliability) Regression
Suppose we have n observations where we measure p 1 covariates with the

observed time. In our examples from the introduction, some covariates you
may also measure include:
Gender, age, weight, and previous ailments of the cancer patients.
Manufacturer, metal used for the drive mechanisms, and running temperature of the machines.
Let X be the matrix of covariates as in the standard multiple regression
model, but without the first column consisting of 1s (so it is an n (p 1)
matrix). Then we model
T = 0 + XT + ,
where is a (p 1)dimensional vector, T = ln T , and has a certain
distribution (which we will discuss shortly). Then
T = exp (T ) = e0 +X
e = e0 +X
T ],
where T ] = e() . So, the covariate acts multiplicatively on the survival time
T.
The distribution of will allow us to determine the distribution of T .
Each possible probability distribution has a different h(t). Furthermore, in
a survival regression setting, was assume the hazard rate at time t for an
individual has the form:
h(t|X ) = h0 (t)k(XT )
= h0 (t)eX
In the above, h0 (t) is called the baseline hazard and is the value of the
hazard function when X = 0 or when = 0. Note in the expression for
T that we separated out the intercept term 0 as it becomes part of the
baseline hazard. Also, k() in the equation for h(t|X ) is a specified link
function, which for our purposes will be e() .
Next we discuss some of the possible (and common) distributions assumed
for . We do not write out the density formulas here, but they can be found in
most statistical texts. The parameters for your distribution help control three
STAT 501
D. S. Young

primary aspects of the density curve: location, scale, and shape. You will
want to consider the properties your data appear to exhibit (or historically
have exhibited) when determining which of the following to use:
The normal distribution with location parameter (mean) and scale
parameter (variance) 2 . As we have seen, this is one of the more
commonly used distributions in statistics, but is infrequently used for
lifetime distribution as it allows negative values while lifetimes are always positive. One possibility is to consider a truncated normal or a
log transformation (which we discuss next).
The lognormal distribution with location parameter , scale parameter
, and shape parameter 2 . gives the minimum value of the random
variable T , and the scale and shape parameters of the lognormal distribution are the location and scale parameters of the normal distribution,
respectively. Note that if T has a lognormal distribution, then ln(T )
has a normal distribution.
The Weibull distribution with location parameter , scale parameter
, and shape parameter . The Weibull distribution is probably most
commonly used for time to failure data since it is fairly flexible to work
with. again gives the minimum value of the random variable T and
is often set to 0 so that the support of T is positive. Assuming = 0
provides the more commonly used two-parameter Weibull distribution.
The Gumbel distribution (or extreme-value distribution) with location
parameter and scale parameter 2 is sometimes used, but more often
it is presented due to its relationship to the Weibull distribution. If T
has a Weibull distribution, then ln(T ) has a Gumbel distribution.
The exponential distribution with location parameter and scale parameter (or sometimes called rate 1/). again gives the minimum
value of the random variable T and is often set to 0 so that the support
of T is positive. Setting = 0 results in what is usually referred to as
the exponential distribution. The exponential distribution is a model
for lifetimes with a constant failure rate. If T has an exponential distribution with = 0, then ln(T ) has a standard Gumbel distribution
(i.e., the scale of the Gumbel distribution is 1).
D. S. Young
STAT 501

The logistic distribution with location parameter and scale parameter
. This distribution is very similar to the normal distribution, but is
used in cases where there are heavier tails (i.e., higher probability of
the data occurring out in the tails of the distribution).
The log-logistic distribution with location parameter , scale parameter
, and shape parameter . again gives the minimum value of the
random variable T and is often set to 0 so that the support of T is
positive. Setting = 0 results in what is usually referred to as the loglogistic distribution. If T has a log-logistic distribution with = 0, then
ln(T ) has a logistic distribution with location parameter = 1/ ln()
and scale parameter = 1/.
The gamma distribution with location parameter , scale parameter ,
and shape parameter . The gamma distribution is a competitor to the
Weibull distribution, but is more mathematically complicated and thus
avoided where the Weibull appears to provide a good fit. The gamma
distribution also arises because the sum of independent exponential
random variables is gamma distributed. again gives the minimum
value of the random variable T and is often set to 0 so that the support
of T is positive. Setting = 0 results in what is usually referred to as
the gamma distribution.
The beta distribution has two shape parameter, and , as well as
two location parameters, A and B, which denote the minimum and
maximum of the data. If the beta distribution is used for lifetime
data, then it appears when fitting data which are assumed to have an
absolute minimum and absolute maximum. Thus, A and B are almost
always assumed known.
Note that the above is not an exhaustive list, but provides some of the more
commonly used distributions in statistical texts and software. Also, there is
an abuse of notation in that duplication of certain characters (e.g., , , etc.)
does not imply a mathematical relationship between all of the distributions
where that character appears.
Estimation of the parameters can be accomplished in two primary ways.
One way is to construct a probability of the chosen distribution with your
data and then apply least squares regression to this plot. Another, perhaps
more appropriate, approach is to use maximum likelihood estimation as it
STAT 501
D. S. Young

can be shown to be optimum in most situations and provides estimates of
standard errors, and thus confidence limits. Maximum likelihood estimation
is commonly accomplished by using a Newton-Raphson algorithm.
20.4
Cox Proportional Hazards Regression
Recall from the last section that we set T = ln(T ) where the hazard function
T
is h(t|X ) = h0 (t)eX
. The Cox formation of this relationship gives:
ln(h(t)) = ln(h0 (t)) + XT ,
which yields the following form of the linear regression model:

h(t)
= XT .
ln
h0 (t)
Exponentiating both sides yields a ratio of the actual hazard rate and baseline
hazard rate, which is called the relative risk:
h(t)
T
= eX
h0 (t)
p1
Y
=
ei xi .
i=1
Thus, the regression coefficients have the interpretation as the relative risk
when the value of a covariate is increased by 1 unit. The estimates of the
regression coefficients are interpreted as follows:
A positive coefficient means there is an increase in the risk, which
decreases the expected survival (failure) time.
A negative coefficient means there is a decrease in the risk, which increases the expected survival (failure) time.
The ratio of the estimated risk functions for two different sets of covariates (i.e., two groups) can be used to examine the likelihood of Group
1s survival (failure) time to Group 2s survival (failure) time.
D. S. Young
STAT 501

Remember, for this model the intercept term has been absorbed by the baseline hazard.
The model we developed above is the Cox Proportional Hazards regression model and does not include t on the right-hand side. Thus, the
relative risk is constant for all values of t. Estimation for this regression model
is usually done by maximum likelihood and Newton-Raphson is usually the
algorithm used. Usually, the baseline hazard is found non parametrically, so
the estimation procedure for the entire model is said to be semi parametric.
Additionally, if there are failure time ties in the data, then the likelihood gets
more complex and an approximation to the likelihood is usually used (such
as the Breslow Approximation or the Efron Approximation).
20.5
Diagnostic Procedures
Depending on the survival regression model being used, the diagnostic measures presented here may have a slightly different formulation. We do present
somewhat of a general form for these measures, but the emphasis is on the
purpose of each measure. It should also be noted that one can perform formal hypothesis testing and construct statistical intervals based on various
estimates.
Cox-Snell Residuals
In the previous regression models we studied, residuals were defined as a
difference between observed and fitted values. For survival regression, in
order to check the overall fit of a model, the Cox-Snell residual for the ith
observation in a data set is used and defined as:
0 (ti )eX
rCi = H
is the maximum likelihood estimate of the regression coIn the above,

0 (ti ) is a maximum likelihood estimate of the baseline
efficient vector. H
cumulative hazard function H0 (ti ), defined as:
Z t
H0 (t) =
h0 (x)dx.
0
Notice that rCi > 0 for all i. The way we check for a goodness-of-fit with the
Cox-Snell residuals is to estimate the cumulative hazard rate of the residuals
STAT 501
D. S. Young

r (tr )) from whatever distribution you are assuming, and then
(call this H
C
Ci
r (tr ) versus rC . A good fit would be suggested if they form roughly
plot H
i
C
Ci
a straight line (like we looked for in probability plots).
Martingale Residuals
Define a censoring indicator for the ith observation as

0, if observation i is censored;
i =
1, if observation i is uncensored.
In order to identify the best functional form for a covariate given the assumed
functional form of the remaining covariates, we use the Martingale residual
for the ith observation, which is defined as:
i = i rC .
M
i
i values fall between the interval (, 1] and are always negative
The M
i values are plotted against the xj,i , where j
for censored values. The M
represents the index of the covariate for which we are trying to identify
the best functional form. Plotting a smooth-fitted curve over this data set
will indicate what sort of function (if any) should be applied to xj,i . Note
that the martingale residuals are not symmetrically distributed about 0, but
asymptotically they have mean 0.
Deviance Residuals
Outlier detection in a survival regression model can be done using the deviance residual for the ith observation:
q
`S (i )).
Di = sgn(Mi ) 2(`i ()
i
is the ith log likelihood evaluated at ,
which is the maximum
For Di , `i ()
likelihood estimate of the models parameter vector . `Si (i ) is the log
likelihood of the saturated model evaluated at the maximum likelihood .
A saturated model is one where n parameters (i.e., 1 , . . . , n ) fit the n
observations perfectly.
The Di values should behave like a standard normal sample. A normal
probability plot of the Di values and a plot of Di versus the fitted ln(t)i
values, will help to determine if any values are fairly far from the bulk of
D. S. Young
STAT 501

the data. It should be noted that this only applies to cases where light to
moderate censoring occur.
Partial Deviance
Finally, we can also consider hierarchical (nested) models. We start by defining the model deviance:
n
X
=
Di2 .
i=1
Suppose we are interested in seeing if adding additional covariates to our

model significantly improves the fit from our original model. Suppose we
calculate the model deviances under each model. Denote these model deviances as R and F for the reduced model (our original model) and the
full model (our model with all covariates included), respectively. Then, a
measure of the fit can be done using the partial deviance:
= R F
R ) `(
F ))
= 2(`(

`( R )
,
= 2 log
F )
`(
R ) and `(
F ) are the log likelihood functions evaluated at the maxiwhere `(
mum likelihood estimates of the reduced and full models, respectively. Luckily, this is a likelihood ratio statistic and has the corresponding asymptotic 2
distribution. A large value of (large with respect to the corresponding 2
distribution) indicates the additional covariates improve the overall fit of the
model. A small value of means they add nothing significant to the model
and you can keep the original set of covariates. Notice that this procedure
is similar to the extra sum of squares procedure developed in the previous
course.
20.6
Truncated Regression Models
Truncated regression models are used in cases where observations with

values for the response variable that are below and/or above certain thresholds are systematically excluded from the sample. Therefore, entire observations are missing so that neither the dependent nor independent variables
STAT 501
D. S. Young

are known. For example, suppose we had wages and years of schooling for
a sample of employees. Some persons for this study are excluded from the
sample because their earned wages fall below the minimum wage. So the
data would be missing for these individuals.
Truncated regression models are often confused with the censored regression models that we introduced earlier. In censored regression models, only
the value of the dependent variable is clustered at a lower and/or upper
threshold value, while values of the independent variable(s) are still known.
In truncated regression models, entire observations are systematically omitted from the sample based on the lower and/or upper threshold values. Regardless, if we know that the data has been truncated, we can adjust our
estimation technique to account for the bias introduced by omitting values
from the sample. This will allow for more accurate inferences about the entire population. However, if we are solely interested in the population that
does not fall outside the threshold value(s), then we can rely on standard
techniques that we have already introduced, namely ordinary least squares.
Let us formulate the general framework for truncated distributions. Suppose that X is a random variable with a probability density function fX
and associated cumulative distribution function FX (the discrete setting is
defined analogously). Consider the two-sided truncation a < X < b. Then
the truncated distribution is given by
fX (x|a < X < b) =
where
gX (x) =
gX (x)
,
FX (b) FX (a)
fX (x), a < x < b;
0,
otherwise.
Similarly, one-sided truncated distributions can be defined by assuming a or

b are set at the respective, natural bound of the support for the distribution
of X (i.e., FX (a) = 0 or FX (b) = 1, respectively). So a bottom-truncated (or
left-truncated) distribution is given by
fX (x|a < X) =
gX (x)
,
1 FX (a)
while a top-truncated (or right-truncated) distribution is given by

fX (x|X < b) =
D. S. Young
gX (x)
.
FX (b)
STAT 501

gX (x) is then defined accordingly for whichever distribution with which you
are working.
Consider the canonical multiple linear regression model
Yi = XT
i + i ,
where i iid N (0, 2 ). If no truncation (or censoring) is assumed with the
data, then normal distribution theory yields
2
Yi |Xi N (XT
i , ).
When truncating the response, the distribution, and consequently the mean
and variance of the truncated distribution, must be adjusted accordingly.
Consider the three possible truncation settings of a < Y < b (two-sided
truncation), a < Yi (bottom-truncation), and Yi < b (top-truncation). Let
T
T
i = (a XT
i )/, i = (b Xi )/, and i = (yi Xi )/, such that yi
is the realization of the random variable Yi . Moreover, recall that (z) is the
inverse Mills ratio applied to the value of z and let
(z) = (z)[((z))1 1].
Then using established results for the truncated normal distribution, the
three different truncated probability density functions are
fY |X (yi |, XT
i , , ) =
1
(i )
1(i )
1
(i )
(i )(i )
1
(i )
(i )
= {a < Yi } and a < yi ;
, = {a < Yi < b} and a < yi < b;
= {Yi < b} and yi < b,
while the respective truncated cumulative distribution functions are
FY |X (yi |, XT
i , , ) =
STAT 501
(i )(i )
,
1(i )
= {a < Yi } and a < yi ;
(i )(i )
,
(i )(i )
= {a < Yi < b} and a < yi < b;
(i )
,
(i )
= {Yi < b} and yi < b.

D. S. Young

Furthermore, the means of the three different truncated distributions are
T
Xi + (i ),
= {a < Yi };

(i )(i )
T
T
E[Yi |, Xi ] =
Xi + (i )(i ) , = {a < Yi < b};
T
Xi (i ),
= {Yi < b},
while the corresponding variances are
2
{1 (i )[(i ) i ]},
= {a < Yi };

2

T
i (i )i (i )
(i )(i )
2
Var[Yi |, Xi ] =
, = {a < Yi < b};
1 + (i )(i ) (i )(i )
2
{1 (i )[(i ) + i ]},
= {Yi < b}.
Using the distributions defined above, the likelihood function can be found
and maximum likelihood procedures can be employed. Note that the likelihood functions will not have a closed-form solution and thus numerical
techniques must be employed to find the estimates of and .
It is also important to underscore the type of estimation method used in
a truncated regression setting. The maximum likelihood estimation method
that we just described will be used when you are interested in a regression
equation that characterizes the entire population, including the observations
that were truncated. If you are interested in characterizing just the subpopulation of observations that were not truncated, then ordinary least squares
can be used. In the context of the example provided at the beginning of
this section, if we regressed wages on years of schooling and were only interested in the employees who made above the minimum wage, then ordinary
least squares can be used for estimation. However, if we were interested in
all of the employees, including those who happened to be excluded due to
not meeting the minimum wage threshold, then maximum likelihood can be
employed.
20.7
Examples
Example 1: Motor Dataset

This data set of size n = 16 contains observations from a temperatureD. S. Young
STAT 501

accelerated life test for electric motors. The motorettes were tested at four
different temperature levels and when testing terminated, the failure times
were recorded. The data can be found in Table 20.1.
Hours
8064
1764
2772
3444
3542
3780
4860
5196
5448
408
1344
1440
1680
408
504
528
Censor Count Temperature

0
10
150
1
1
170
1
1
170
1
1
170
1
1
170
1
1
170
1
1
170
1
1
170
0
3
170
1
2
190
1
2
190
1
1
190
0
5
190
1
2
220
1
3
220
0
5
220
Table 20.1: The motor data set measurements with censoring occurring if a
0 appears in the Censor column.
This data set is actually a very common data set analyzed in survival
analysis texts. We will proceed to fit it with a Weibull survival regression
model. The results from this analysis are
##########
Value Std. Error
z
p
(Intercept) 17.0671
0.93588 18.24 2.65e-74
count
0.3180
0.15812 2.01 4.43e-02
temp
-0.0536
0.00591 -9.07 1.22e-19
Log(scale) -1.2646
0.24485 -5.17 2.40e-07
Scale= 0.282
STAT 501
D. S. Young

Weibull distribution
Loglik(model)= -95.6
Loglik(intercept only)= -110.5
Chisq= 29.92 on 2 degrees of freedom, p= 3.2e-07
Number of Newton-Raphson Iterations: 9
n= 16
##########
As we can see, the two covariates are statistically significant. Furthermore,
the scale which is estimated (at 0.282) is the scale pertaining to the distribution being fit for this model (i.e., Weibull). It too is found to be statistically
significant.
While we appear to have a decent fitting model, we will turn to looking at
the deviance residuals. Figure 20.1(a) gives a plot of the deviance residuals
versus ln(time). As you can see, there does appear to be one value with a
deviance residual of almost -3. This value may be cause for concern. Also,
Figure 20.1(b) gives the NPP plot for these residuals and they do appear to
fit along a straight line, with the trend being somewhat impacted by that
residual in question. One could attempt to remove this point and rerun the
analysis, but the overall fit seems to be good and there are no indications in
the study that this was an incorrectly recorded point, so we will leave it in
the analysis.
Example 2: Logical Reasoning Dataset
Suppose that an educational researcher administered a (hypothetical) test
meant to relate ones logical reasoning with their mathematical skills. n =
100 participants were chosen for this study and the (simulated) data is provided in Table 20.2. The test consists of a logical reasoning section (where the
participants received a score between 0 and 10) and a mathematical problem
solving section (where the participants receive a score between 0 and 80).
The scores from the mathematics section (y) were regressed on the scores
from the logical reasoning section (x). The researcher was interested in only
those individuals who received a score of 50 or better on the mathematics
section as they would be used for the next portion of the study, so the data
was truncated at y = 50.
Figure 20.2 shows the data with different regression fits depending on the
assumptions that are made. The solid black circles are all of the participants
with a score of 50 or better on the mathematics section, while the open red
circles indicate those values that are either truncated (as in Figure 20.2(a))
D. S. Young
STAT 501
Deviance Residuals
Normal QQ Plot
6.0
6.5
7.0
7.5
Log Hours
(a)
8.0
8.5
9.0
Deviance Residuals
Sample Quantiles
Theoretical Quantiles
(b)
Figure 20.1: (a) Plot of the deviance residuals. (b) NPP plot for the deviance
residuals.
or censored (as in Figure 20.2(b)). The dark blue line on the left is the
truncated regression line that is estimated using ordinary least squares. So
the interpretation of this line will only apply tho those data that were not
truncated, which is what the researcher is interested in. The estimated model
is given below:
##########
Coefficients:
(Intercept) 50.8473
1.1847 42.921 < 2e-16 ***
x
1.6884
0.1871
9.025 7.84e-14 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Multiple R-squared: 0.5045,
##########
Notice that the estimated regression line never drops below the level of truncation (i.e., y = 50) within the domain of the x variable.
STAT 501
D. S. Young

Suppose now that the researcher is interested in all of the data and, say,
there is some problem with recovering those participants that were in the
truncated portion of the sample. Then the truncated regression line can be
estimated via the method of maximum likelihood estimation, which is the
light blue line in Figure 20.2(a). This line can (and will likely) go beyond
the level of truncation since the estimation method is accounting for the
truncation. The estimated model is given below:
##########
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
1.89772 25.1351 < 2.2e-16 ***
x
2.09223
0.26979 7.7551 8.882e-15 ***
sigma
4.81855
0.46211 10.4273 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Log-Likelihood: -231.21 on 3 Df
##########
This will be helpful for the researcher to say something about the broader
population of individuals tested other than those who only received a math
score of 50 or higher. Moreover, notice that both methods yield highly significant slope and intercept terms for this data, as would be expected by
observing the strong linear trend in this data.
Figure 20.2(b) shows the estimate obtained when using a survival regression fit when assuming normal errors. Suppose the data had inadvertently
been censored at y = 50. So all of the red open circles now correspond to
a solid red circle in Figure 20.2(b). Since the data is now treated as leftcensored, we are actually fitting a Tobit regression model. The Tobit fit is
given by the green line and the results are given below:
##########
Value Std. Error
z
p
(Intercept) 52.21
1.0110 51.6 0.0e+00
x
1.50
0.1648 9.1 8.9e-20
Log(scale)
1.46
0.0766 19.0 7.8e-81
Scale= 4.3
D. S. Young
STAT 501
Gaussian distribution
Loglik(model)= -241.6
Loglik(intercept only)= -267.1
Chisq= 51.03 on 1 degrees of freedom, p= 9.1e-13
Number of Newton-Raphson Iterations: 5
n= 100
##########
Moreover, the dashed red line in both figures is the ordinary least squares fit
(assuming all of the data values are known and used in the estimation) and
is simply provided for comparative purposes. The estimates for this fit are
given below:
##########
Coefficients:
(Intercept) 47.2611
0.9484
49.83
<2e-16 ***
x
2.1611
0.1639
13.19
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Multiple R-squared: 0.6396,
F-statistic: 173.9 on 1 and 98 DF, p-value: < 2.2e-16
##########
As you can see, the structure of your data and underlying assumptions
can change your estimate - namely because you are attempting to estimate
different models. The regression lines in Figure 20.2 are a good example of
how different assumptions can alter the final estimates that you report.
STAT 501
D. S. Young
Survival Regression Fit

80
Truncated MLE Fit

Truncated OLS Fit
Regular OLS Fit
50
70
40
40
50
60
70
Survival Fit
Regular OLS Fit
60
80
Truncated Regression Fits
6
x
(a)
10
10
(b)
Figure 20.2: (a) A plot of the logical reasoning data. The red circles have
been truncated as they fall below 50. The maximum likelihood fit for the
truncated regression (solid light blue line) and the ordinary least squares fit
for the truncated data set (solid dark blue line) are shown. The ordinary
least squares line (which includes the truncated values for the estimation) is
shown for reference. (b) The logical reasoning data with a Tobit regression
fit provided (solid green line). The data has been censored at 50 (i.e., the
solid red dots are included in the data). Again, the ordinary least squares
line has been provided for reference.
D. S. Young
STAT 501
x
0.00
0.10
0.20
0.30
0.40
0.51
0.61
0.71
0.81
0.91
1.01
1.11
1.21
1.31
1.41
1.52
1.62
1.72
1.82
1.92
2.02
2.12
2.22
2.32
2.42
y
46.00
56.38
45.59
53.66
40.05
46.62
44.56
47.20
57.06
49.18
51.06
51.75
46.73
42.04
48.83
51.81
57.35
49.91
49.82
61.53
47.40
54.78
48.94
55.13
43.57
x
2.53
2.63
2.73
2.83
2.93
3.03
3.13
3.23
3.33
3.43
3.54
3.64
3.74
3.84
3.94
4.04
4.14
4.24
4.34
4.44
4.55
4.65
4.75
4.85
4.95
y
49.95
51.58
59.50
50.84
55.65
51.55
49.16
58.59
51.90
62.95
57.74
54.37
58.21
55.44
58.62
53.63
43.46
57.42
60.64
50.99
50.42
54.68
54.40
60.21
58.70
x
5.05
5.15
5.25
5.35
5.45
5.56
5.66
5.76
5.86
5.96
6.06
6.16
6.26
6.36
6.46
6.57
6.67
6.77
6.87
6.97
7.07
7.17
7.27
7.37
7.47
y
64.31
68.22
58.39
58.55
60.40
57.10
58.64
58.93
61.30
60.75
58.67
60.67
59.46
65.49
60.96
57.36
59.83
57.40
62.96
67.02
65.93
63.55
61.99
64.48
62.61
x
7.58
7.68
7.78
7.88
7.98
8.08
8.18
8.28
8.38
8.48
8.59
8.69
8.79
8.89
8.99
9.09
9.19
9.29
9.39
9.49
9.60
9.70
9.80
9.90
10.00
y
50.91
65.51
61.32
71.37
76.97
56.72
67.90
65.30
61.62
68.68
69.43
64.82
63.81
59.27
62.23
64.78
64.88
72.30
65.18
78.35
64.62
76.85
68.57
61.29
71.46
Table 20.2: The test scores from n = 100 participants for a logical reasoning
section (x) and a mathematics section (y).
STAT 501
D. S. Young
Chapter 21
Nonlinear Regression
All of the models we have discussed thus far have been linear in the parameters (i.e., linear in the beta terms). For example, polynomial regression was
used to model curvature in our data by using higher-ordered values of the
predictors. However, the final regression model was just a linear combination
of higher-ordered predictors.
Now we are interested in studying the nonlinear regression model:
Y = f (X, ) + ,
where X is a vector of p predictors, is a vector of k parameters, f () is
some known regression function, and is an error term whose distribution
mayor may not be normal. Notice that we no longer necessarily have the
dimension of the parameter vector simply one greater than the number of
predictors. Some examples of nonlinear regression models are:
e0 +1 xi
+ i
1 + e0 +1 xi
0 + 1 xi
yi =
+ i
1 + 2 e3 xi
yi = 0 + (0.4 0 )e1 (xi 5) + i .
yi =
However, there are some nonlinear models which are actually called intrinsically linear because they can be made linear in the parameters by a
simply transformation. For example:
Y =
0 X
1 + X
311
312
CHAPTER 21. NONLINEAR REGRESSION
can be rewritten as
1
1
1
=
+ X
Y
0 0
= 0 + 1 X,
which is linear in the transformed variables 0 and 1 . In such cases, transforming a model to its linear form often provides better inference procedures
and confidence intervals, but one must be cognizant of the effects that the
transformation has on the distribution of the errors.
We will discuss some of the basics of fitting and inference with nonlinear
regression models. There is a great deal of theory, practice, and computing
associated with nonlinear regression and we will only get to scratch the surface of this topic. We will then turn to a few specific regression models and
discuss generalized linear models.
21.1
Nonlinear Least Squares
We initially consider the setting

Yi = f (Xi , ) + i ,
where the i are iid normal with mean 0 and constant variance 2 . For this
setting, we can rely on some of the least squares theory we have developed
over the course. For other nonnormal error terms, different techniques need
to be employed.
First, let
n
X
Q=
(yi f (Xi , ))2 .
i=1
In order to find
= arg min Q,
we first find each of the partial derivatives of Q with respect to j :

n
X
f (Xi , )
Q
= 2
.
[yi f (Xi , )]
j
j
i=1
STAT 501
D. S. Young
313
Then, we set each of the above partial derivatives equal to 0 and the parameters k are each replaced by k . This yields:

n
n
X
X

f (Xi , )
f
(X
,
)
i

yi
f
(X
,
)
i

= 0,
k
k
=
=
i=1
i=1
for k = 0, 1, . . . , p 1.
The solutions to the critical values of the above partial derivatives for
and are
nonlinear regression are nonlinear in the parameter estimates
k
often difficult to solve, even in the simplest cases. Hence, iterative numerical
methods are often employed. Even more difficulty arises in that multiple
solutions may be possible!
21.1.1
A Few Algorithms
We will discuss a few incarnations of methods used in nonlinear least squares

estimation. It should be noted that this is NOT an exhaustive list of algorithms, but rather an introduction to some of the more commonly implemented algorithms.
First let us introduce some notation used in these algorithms:
(t) be the estimated
Since these numerical algorithms are iterative, let
value of at time t. When t = 0, this symbolizes a user-specified
starting value for the algorithm.
Let
1
y1 f (X1 , )
..
= ... =
.
n
yn f (Xn , )
be an n-dimensional vector of the error terms and e is again the residual

vector.
Let
Q() =
Q()
=
kk2
1
..
.
kk2
k
2
be
Pnthe 2gradient of the sum of squared errors where Q() = kk =
i=1 i is the sum of squared errors.
D. S. Young
STAT 501
314
Let
1
1
...
...
1
k
n
1
...
n
k
J() = ...
..
.
be the Jacobian matrix.

Let
H() =
2 Q()
T
2 2
kk
1 1
..
.
2 kk2
k 1
...
...
...
2 kk2
1 k
..
.
2 kk2
k k
be the Hessian matrix (i.e., matrix of mixed partial derivatives and

second-order derivatives).
In the following algorithms, we will use the notation established above
(t) ) = Q()|
(t) ) = H()| = (t) , and J(
(t) ) =
for Q(
(t) , H(
=
J()|= (t)
The classical method based on the gradient approach is Newtons method,
(0) and iteratively calculates
which starts at
(t+1) =
(t) [H(
(t) )]1 Q(
(t) )
until a convergence criterion is achieved. The difficulty in this approach is

that inversion of the Hessian matrix can be computation ally difficult. In
particular, the Hessian is not always positive definite unless the algorithm is
initialized with a good starting value, which may be difficult to find.
A modification to Newtons method is the Gauss-Newton algorithm,
which, unlike Newtons method, can only be used to minimize a sum of
squares function. The advantage with using the Gauss-Newton algorithm
is that it no longer requires calculation of the Hessian matrix, but rather
approximates it using the Jacobian. The gradient and approximation to the
Hessian matrix can be written as
Q() = 2J()T and H() 2J()T J().
STAT 501
D. S. Young
315
Thus, the iterative approximation based on the Gauss-Newton method yields

(t+1) =
(t) (
(t) )
(t)
(t)
(t)
(t)
)T J(
)]1 J(
)T e,
[J(
(t) ) to be everything that is subtracted from

(t) .
where we have defined (
Convergence is not always guaranteed with the Gauss-Newton algorithm.
Since the steps for this method may be too large (thus leading to divergence),
one can incorporate a partial step by using
(t) )
(t+1) =
(t) (
such that 0 < < 1. However, if is close to 0, an alternative method is

the Levenberg-Marquardt method, which calculates
(t) ) = (J(
(t) )T J(
(t) ) + D)1 J(
(t) )T e,
(
where D is a positive diagonal matrix (often taken as the identity matrix)
and is the so-called Marquardt parameter. The above is optimized for
which limits the length of the step taken at each iteration and improves an
ill-conditioned Hessian matrix.
For these algorithms, you will want to try the easiest one to calculate
for a given nonlinear problem. Ideally, you would like to be able to use
the algorithms in the order they were presented. Newtons method will give
you an accurate estimate if the Hessian is not ill-conditioned. The GaussNewton will give you a good approximation to the solution Newtons method
should have arrived at, but convergence is not always guaranteed. Finally,
the Levenberg-Marquardt method can take care of computational difficulties
arising with the other methods, but searching for can be tedious.
21.2
Exponential Regression
One simple nonlinear model is the exponential regression model

yi = 0 + 1 e2 xi,1 +...+p+1 xi,1 + i ,
where the i are iid normal with mean 0 and constant variance 2 . Notice
that if 0 = 0, then the above is intrinsically linear by taking the natural
logarithm of both sides.
D. S. Young
STAT 501
316
Exponential regression is probably one of the simplest nonlinear regression

models. An example where an exponential regression is often utilized is when
relating the concentration of a substance (the response) to elapsed time (the
predictor).
21.3
Logistic Regression
Logistic regression models a relationship between predictor variables and

a categorical response variable. For example, you could use logistic regression
to model the relationship between various measurements of a manufactured
specimen (such as dimensions and chemical composition) to predict if a crack
greater than 10 mils will occur (a binary variable: either yes or no). Logistic
regression helps us estimate a probability of falling into a certain level of
the categorical response given a set of predictors. You can choose from
three types of logistic regression, depending on the nature of your categorical
response variable:
Binary Logistic Regression: Used when the response is binary (i.e., it
has two possible outcomes). The cracking example given above would
utilize binary logistic regression. Other examples of binary responses
could include passing or failing a test, responding yes or no on a survey,
and having high or low blood pressure.
Nominal Logistic Regression: Used when there are three or more categories with no natural ordering to the levels. Examples of nominal
responses could include departments at a business (e.g., marketing,
sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN),
and color (black, red, blue, orange).
Ordinal Logistic Regression: Used when there are three or more categories with a natural ordering to the levels, but the ranking of the
levels do not necessarily mean the intervals between them are equal.
Examples of ordinal responses could be how you rate the effectiveness
of a college course on a scale of 1-5, levels of flavors for hot wings, and
medical condition (e.g., good, stable, serious, critical).
The problems with logistic regression include nonnormal error terms, nonconstant error variance, and constraints on the response function (i.e., the
STAT 501
D. S. Young
317
response is bounded between 0 and 1). We will investigate ways of dealing

with these in the logistic regression setting.
21.3.1
Binary Logistic Regression
The multiple binary logistic regression model is the following:

e0 +1 X1 +...+p1 Xp1
1 + e0 +1 X1 +...+p1 Xp1
T
eX
=
T ,
1 + eX
(21.1)
where here denotes a probability and not the irrational number.

is the probability that an observation is in a specified category of the
binary Y variable.
Notice that the model describes the probability of an event happening
as a function of X variables. For instance, it might provide estimates
of the probability that an older person has heart disease.
With the logistic model, estimates of from equations (21.1) will always be between 0 and 1. The reasons are:
The numerator e0 +1 X1 +...+p1 Xp1 must be positive, because it
is a power of a positive value (e).
The denominator of the model is (1+numerator), so the answer
will always be less than 1.
With one X variable, the theoretical model for has an elongated S
shape (or sigmoidal shape) with asymptotes at 0 and 1, although in
sample estimates we may not see this S shape if the range of the X
variable is limited.
There are algebraically equivalent ways to write the logistic regression
model in equation (21.1):
First is
D. S. Young
= e0 +1 X1 +...+p1 Xp1 ,
1
(21.2)
STAT 501
318

which is an equation that describes the odds of being in the current
category of interest. By definition, the odds for an event is P/(1 P )
such that P is the probability of the event. For example, if you are at
the racetrack and there is a 80% chance that a certain horse will win
the race, then his odds are .80/(1-.80)=4, or 4:1.
Second is

log
1

= 0 + 1 X1 + . . . + p1 Xp1 ,
(21.3)
which states that the logarithm of the odds is a linear function of the
X variables (and is often called the log odds).
In order to discuss goodness-of-fit measures and residual diagnostics for
binary logistic regression, it is necessary to at least define the likelihood (see
Appendix C for a further discussion). For a sample of size n, the likelihood
for a binary logistic regression is given by:
L(; y, X) =
n
Y
iyi (1 i )1yi
i=1
y i
1yi
T
n
Y
1
eXi
.
=
T
T
1 + eXi
1 + eXi
i=1
This yields the log likelihood:
`() =
n
X
[yi eXi log(1 + eXi )].
i=1
Maximizing the likelihood (or log likelihood) has no closed-form solution, so a

technique like iteratively reweighted least squares is used to find an estimate
Once this value of
has been obtained, we
of the regression coefficients, .
may proceed to define some various goodness-of-fit measures and calculated
residuals. For the residuals we present, they serve the same purpose as in
linear regression. When plotted versus the response, they will help identify
suspect data points. It should also be noted that the following is by no
way an exhaustive list of diagnostic procedures, but rather some of the more
common methods which are used.
STAT 501
D. S. Young
319
Odds Ratio
The odds ratio (which we will write as ) determines the relationship between
a predictor and response and is available only when the logit link is used.
The odds ratio can be any nonnegative number. An odds ratio of 1 serves
as the baseline for comparison and indicates there is no association between
the response and predictor. If the odds ratio is greater than 1, then the odds
of success are higher for the reference level of the factor (or for higher levels
of a continuous predictor). If the odds ratio is less than 1, then the odds of
success are less for the reference level of the factor (or for higher levels of
a continuous predictor). Values farther from 1 represent stronger degrees of
association. For binary logistic regression, the odds of success are:
T
= eX .
1
This exponential relationship provides an interpretation for . The odds
increase multiplicatively by ej for every one-unit increase in Xj . More formally, the odds ratio between two sets of predictors (say X(1) and X(2) ) is
given by
(/(1 ))|X=X(1)
=
.
(/(1 ))|X=X(2)
Wald Test
The Wald test is the test of significance for regression coefficients in logistic
regression (recall that we use t-tests in linear regression). For maximum
likelihood estimates, the ratio
Z=
i
s.e.(i )
can be used to test H0 : i = 0. The standard normal curve is used to

determine the p-value of the test. Furthermore, confidence intervals can be
constructed as
i z1/2 s.e.(i ).
Raw Residual
The raw residual is the difference between the actual response and the
estimated probability from the model. The formula for the raw residual is
ri = y i
i .
D. S. Young
STAT 501
320
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals
by dividing by the standard deviation. The formula for the Pearson residuals
is
ri
.
pi = p
i (1
i )
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these
residuals is the deviance statistic. The formula for the deviance residual is
s

yi
1 yi
di = 2 yi log
+ (1 yi ) log
.
i
1
i
Hat Values
The hat matrix serves a similar purpose as in the case of linear regression to measure the influence of each observation on the overall fit of the model but the interpretation is not as clear due to its more complicated form. The
hat values are given by
T
hi,i =
i (1
i )xT
i (X WX)xi ,
where W is an n n diagonal matrix with the values of

i (1
i ) for i =
1, . . . , n on the diagonal. As before, a hat value is large if hi,i > 2p/n.
Studentized Residuals
We can also report Studentized versions of some of the earlier residuals. The
Studentized Pearson residuals are given by
spi = p
pi
1 hi,i
and the Studentized deviance residuals are given by

sdi = p
di
.
1 hi,i
C and C
are extensions of Cooks distance for logistic regression. C
measures
C and C
STAT 501
D. S. Young
321
the overall change in fitted log its due to deleting the ith observation for all
points excluding the one deleted while C includes the deleted point. They
are defined by:
p2i hi,i
Ci =
(1 hi,i )2
and
i =
C
p2i hi,i
.
(1 hi,i )
Goodness-of-Fit Tests
Overall performance of the fitted model can be measured by two different
chi-square tests. There is the Pearson chi-square statistic
P =
n
X
p2i
i=1
and the deviance statistic

G=
n
X
d2i .
i=1
Both of these statistics are approximately chi-square distributed with n p

degrees of freedom. When a test is rejected, there is a statistically significant
lack of fit. Otherwise, there is no evidence of lack of fit.
These goodness-of-fit tests are analogous to the F -test in the analysis of
variance table for ordinary regression. The null hypothesis is
H0 : 1 = 2 = . . . = k1 = 0.
A significant p-value means that at least one of the X variables is a predictor
of the probabilities of interest.
In general, one can also use the likelihood ratio test for testing the
null hypothesis that any subset of the s is equal to 0. Suppose we test that
r < p of the s are equal to 0. Then the likelihood ratio test statistic is
given by:
(0) ) `()),
= 2(`(
(0)
) is the log likelihood of the model specified by the null hypothesis

where `(
evaluated at the maximum likelihood estimate of that reduced model. This
test statistic has a 2 distribution with p r degrees of freedom.
D. S. Young
STAT 501
322
One additional test is Browns test, which has a test statistic to judge
the fit of the logistic model to the data. The formula for the general alternative with two degrees of freedom is:
T = sT C 1 s,
where sT = (s1 , s2 ) and C is the covariance matrix of s. The formulas for s1
and s2 are:
s1 =
n
X
(yi
i )(1 +
log(
i )
)
1
i
(yi
i )(1 +
log(1
i )
).
i=1
s2 =
n
X
i=1
The formula for the symmetric alternative with 1 degree of freedom is:
(s1 + s2 )2
.
Var(s1 + s2 )
To interpret the test, if the p-value is less than your accepted significance
level, then reject the null hypothesis that the model fits the data adequately.
DFDEV and DFCHI
DFDEV and DFCHI are statistics that measure the change in deviance
and in Pearsons chi-square, respectively, that occurs when an observation
is deleted from the data set. Large values of these statistics indicate observations that have not been fitted well. The formulas for these statistics
are
i
DFDEVi = d2i + C
and
DFCHILi =
i
C
.
hi,i
RA2
The calculation of R2 used in linear regression does not extend directly to
logistic regression. The version of R2 used in logistic regression is defined as
R2 =
STAT 501
`(0 )
`()
,
`(0 ) `S ()
D. S. Young
323
where `(0 ) is the log likelihood of the model when only the intercept is
included and `S () is the log likelihood of the saturated model (i.e., where a
model is fit perfectly to the data). This R2 does go from 0 to 1 with 1 being
a perfect fit.
21.3.2
Nominal Logistic Regression
In binomial logistic regression, we only had two possible outcomes. For nominal logistic regression, we will consider the possibility of having k possible
outcomes. When k > 2, such responses are known as polytomous.1 The
multiple nominal logistic regression model (sometimes called the multinomial logistic regression model) is given by the following:
T
eX j
1+Pkj=2 eXT j , j = 2, . . . , k;
(21.4)
j =
1
Pk XT , j = 1,
j
1+
j=2
where again j denotes a probability and not the irrational number. Notice
that k 1 of the groups have their own set of values. Furthermore, since
P
k
j=1 j = 1, we set the values for group 1 to be 0 (this is what we call the
reference group). Notice that when k = 2, we are back to binary logistic
regression.
j is the probability that an observation is in one of k categories. The
likelihood for the nominal logistic regression model is given by:
L(; y, X) =
n Y
k
Y
i,ji,j (1 i,j )1yi,j ,
i=1 j=1
where the subscript (i, j) means the ith observation belongs to the j th group.
`() =
n X
k
X
yi,j i,j .
i=1 j=1


1
The word polychotomous is sometimes used, but note that this is not actually a word!
D. S. Young
STAT 501
324
An odds ratio () of 1 serves as the baseline for comparison. If = 1,

then there is no association between the response and predictor. If > 1,
then the odds of success are higher for the reference level of the factor (or for
higher levels of a continuous predictor). If < 1, then the odds of success are
less for the reference level of the factor (or for higher levels of a continuous
predictor). Values farther from 1 represent stronger degrees of association.
For nominal logistic regression, the odds of success (at two different levels of
the predictors, say X(1) and X(2) ) are:
=
(j /1 )|X=X(1)
.
(j /1 )|X=X(2)
Many of the procedures discussed in binary logistic regression can be

extended to nominal logistic regression with the appropriate modifications.
21.3.3
Ordinal Logistic Regression
For ordinal logistic regression, we again consider k possible outcomes as in

nominal logistic regression, except that the order matters. The multiple
ordinal logistic regression model is the following:
k
X
j =
j=1
e0,k +X
1 + e0,k +X
(21.5)
such that k k, 1 2 , . . . , k , and again j denotes a probability.

Notice that this model is a cumulative sum of probabilities which involves just
changing the intercept of the linear regression portion (so is now (p 1)dimensional and X is n (p 1) such that
Pfirst column of this matrix is not
a column of 1s). Also, it still holds that kj=1 j = 1.
j is still the probability that an observation is in one of k categories, but
we are constrained by the model written in equation (21.5). The likelihood
for the ordinal logistic regression model is given by:
L(; y, X) =
k
n Y
Y
i,ji,j (1 i,j )1yi,j ,
i=1 j=1
where the subscript (i, j) means the ith observation belongs to the j th group.
STAT 501
D. S. Young
325

`() =
n X
k
X
yi,j i,j .
i=1 j=1
Notice that this is identical to the nominal logistic regression likelihood.

Thus, maximization again has no closed-form solution, so we defer to a procedure like iteratively reweighted least squares.
For ordinal logistic regression, a proportional odds model is used to determine the odds ratio. Again, an odds ratio () of 1 serves as the baseline
for comparison between the two predictor levels, say X(1) and X(2) . Only one
parameter and one odds ratio is calculated for each predictor. Suppose we
are interested in calculating the odds of X(1) to X(2) . If = 1, then there is
no association between the response and these two predictors. If > 1, then
the odds of success are higher for the predictor X(1) . If < 1, then the odds
of success are less for the predictor X(1) . Values farther from 1 represent
stronger degrees of association. For ordinal logistic regression, the odds ratio
utilizes cumulative probabilities and their complements and is given by:
Pk
j=1 j )|X=X(1)
j=1 j |X=X(1) /(1
.
Pk
Pk
j=1 j )|X=X(2)
j=1 j |X=X(2) /(1
Pk
=
21.4
Poisson Regression
The Poisson distribution for a random variable X has the following probability mass function for a given value X = x:
P(X = x|) =
e x
,
x!
for x = 0, 1, 2, . . .. Notice that the Poisson distribution is characterized by

the single parameter , which is the mean rate of occurrence for the even
being measured. For the Poisson distribution, it is assumed that large counts
(with respect to the value of ) are rare.
Poisson regression is similar to logistic regression in that the dependent
variable (Y ) is a categorical response. Specifically, Y is an observed count
that follows the Poisson distribution, but the rate is now determined by
D. S. Young
STAT 501
326
a set of p predictors X = (X1 , . . . , Xp )T . The expression relating these

quantities is
= exp{XT }.
Thus, the fundamental Poisson regression model for observation i is given by
T
yi
e exp{Xi } exp{XT
i }
.
P(Yi = yi |Xi , ) =
yi !
That is, for a given set of predictors, the categorical outcome follows a Poisson
distribution with rate exp{XT }.
In order to discuss goodness-of-fit measures and residual diagnostics for
Poisson regression, it is necessary to at least define the likelihood. For a
sample of size n, the likelihood for a Poisson regression is given by:
L(; y, X) =
T
n
Y
e exp{Xi } exp{XT }yi
yi !
i=1

`() =
n
X
yi XT
i
n
X
exp{XT
i }
log(yi !).
i=1
i=1
i=1
n
X

Once this value of
has been obtained, we
may proceed to define some various goodness-of-fit measures and calculated
residuals. For the residuals we present, they serve the same purpose as in
linear regression. When plotted versus the response, they will help identify
suspect data points.
Goodness-of-Fit
Overall performance of the fitted model can be measured by two different
chi-square tests. There is the Pearson statistic
P =
n
X
2
(yi exp{XT })
i
T
exp{Xi }
i=1
and the deviance statistic

n
X
G=
yi log
i=1
STAT 501
yi
exp{XT
i }

.
(yi exp{XT
i })
D. S. Young
327
Both of these statistics are approximately chi-square distributed with n p

degrees of freedom. When a test is rejected, there is a statistically significant
lack of fit. Otherwise, there is no evidence of lack of fit.
Overdispersion means that the actual covariance matrix for the observed data exceeds that for the specified model for Y |X. For a Poisson
distribution, the mean and the variance are equal. In practice, the data
almost never reflects this fact. So we have overdispersion in the Poisson regression model since the variance is oftentimes greater than the mean. In
addition to testing goodness-of-fit, the Pearson statistic can also be used as
a test of overdispersion. Note that overdispersion can also be measured in
the logistic regression models that were discussed earlier.
Deviance
Recall the measure of deviance introduced in the study of survival regressions
and logistic regression. The measure of deviance for the Poisson regression
setting is given by
= 2`S () `(),
D(y, )
where `S () is the log likelihood of the saturated model (i.e., where a model
is fit perfectly to the data). This measure of deviance (which differs from the
deviance statistic defined earlier) is a generalization of the sum of squares
from linear regression. The deviance also has an approximate chi-square
distribution.
Pseudo R2
The value of R2 used in linear regression also does not extend to Poisson
regression. One commonly used measure is the pseudo R2 , defined as
R2 =
`(0 )
`()
,
`S () `(0 )
where `(0 ) is the log likelihood of the model when only the intercept is
included. The pseudo R2 goes from 0 to 1 with 1 being a perfect fit.
Raw Residual
The raw residual is the difference between the actual response and the
estimated value from the model. Remember that the variance is equal to the
mean for a Poisson random variable. Therefore, we expect that the variances
D. S. Young
STAT 501
328
of the residuals are unequal. This can lead to difficulties in the interpretation
of the raw residuals, yet it is still used. The formula for the raw residual is
ri = yi exp{XT
i }.
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals
by dividing by the standard deviation. The formula for the Pearson residuals
is
ri
,
pi = q
exp{XT
}
i
where
=
n
2
1 X (yi exp{XT
i })
.
n p i=1
exp{XT
i }
is a dispersion parameter to help control overdispersion.

Deviance Residuals
Deviance residuals are also popular because the sum of squares of these
residuals is the deviance statistic. The formula for the deviance residual is
s

yi
T
T
di = sgn(yi exp{Xi }) 2 yi log
(yi exp{Xi }) .
exp{XT
i }
Hat Values
The hat matrix serves the same purpose as in the case of linear regression to measure the influence of each observation on the overall fit of the model.
The hat values, hi,i , are the diagonal entries of the Hat matrix
H = W1/2 X(XT WX)1 XT W1/2 ,
where W is an n n diagonal matrix with the values of exp{XT

i } on the
diagonal. As before, a hat value is large if hi,i > 2p/n.
Finally, we can also report Studentized versions of some of the earlier residuals. The Studentized Pearson residuals are given by
pi
spi = p
1 hi,i
STAT 501
D. S. Young
329
and the Studentized deviance residuals are given by

di
.
1 hi,i
sdi = p
21.5
Generalized Linear Models
All of the regression models we have considered (both linear and nonlinear)
actually belong to a family of models called generalized linear models.
Generalized linear models provides a generalization of ordinary least squares
regression that relates the random term (the response Y ) to the systematic term (the linear predictor XT ) via a link function (denoted by g()).
Specifically, we have the relation
E(Y ) = = g 1 (XT ),
so g() = XT . Some common link functions are:
The identity link:
g() = = XT ,
which is used in traditional linear regression.
The logit link:

g() = log
1
= XT
eX
=
T ,
1 + eX
which is used in logistic regression.
The log link:
g() = log() = XT
= eX
which is used in Poisson regression.

D. S. Young
STAT 501
330
The probit link:

g() = 1 () = XT
= (XT ),
where () is the cumulative distribution function of the standard normal distribution. This link function is also sometimes called the normit link. This also can be used in logistic regression.
The complementary log-log link:
g() = log( log(1 )) = XT
= 1 exp{eX
},
which can also be used in logistic regression. This link function is also
sometimes called the gompit link.
The power link:
g() = = XT
= (XT )1/ ,
where 6= 0. This is used in other regressions which we do not explore
(such as gamma regression and inverse Gaussian regression).
Also, the variance is typically a function of the mean and is often written as
Var(Y ) = V () = V (g 1 (XT )).
The random variable Y is assumed to belong to an exponential family
distribution where the density can be expressed in the form

y b()
q(y; , ) = exp
+ c(y, ) ,
a()
where a(), b(), and c() are specified functions, is a parameter related
to the mean of the distribution, and is called the dispersion parameter. Many probability distributions belong to the exponential family. For
example, the normal distribution is used for traditional linear regression, the
binomial distribution is used for logistic regression, and the Poisson distribution is used for Poisson regression. Other exponential family distributions
STAT 501
D. S. Young
331
lead to gamma regression, inverse Gaussian (normal) regression, and negative

binomial regression, just to name a few.
The unknown parameters, , are typically estimated with maximum likelihood techniques (in particular, using iteratively reweighted least squares),
Bayesian methods (which we will touch on in the advanced topics section),
or quasi-likelihood methods. The quasi-likelihood is a function which possesses similar properties to the log-likelihood function and is most often used
with count or binary data. Specifically, for a realization y of the random
variable Y , it is defined as
Z
yt
dt,
Q(; y) =
2
y V (t)
where 2 is a scale parameter. There are also tests using likelihood ratio statistics for model development to determine if any predictors may be dropped
from the model.
21.6
Examples
Example 1: Nonlinear Regression Example

A simple model for population growth towards an asymptote is the logistic
model
1
+ i ,
yi =
1 + e2 +3 xi
where yi is the population size at time xi , 1 is the asymptote towards which
the population grows, 2 reflects the size of the population at time x =
0 (relative to its asymptotic size), and 3 controls the growth rate of the
population.
We fit this model to Census population data for the United States (in
millions) ranging from 1790 through 1990 (see Table 21.1). The data are
graphed in Figure 21.1(a) and the line represents the fit of the logistic population growth model.
To fit the logistic model to the U. S. Census data, we need starting values
for the parameters. It is often important in nonlinear least squares estimation
to choose reasonable starting values, which generally requires some insight
into the structure of the model. We know that 1 represents asymptotic
population. The data in Figure 21.1(a) show that in 1990 the U. S. population
stood at about 250 million and did not appear to be close to an asymptote;
D. S. Young
STAT 501
332

year population
1790
3.929
1800
5.308
1810
7.240
1820
9.638
1830
12.866
1840
17.069
1850
23.192
1860
31.443
1870
39.818
1880
50.156
1890
62.948
1900
75.995
1910
91.972
1920
105.711
1930
122.775
1940
131.669
1950
150.697
1960
179.323
1970
203.302
1980
226.542
1990
248.710
Table 21.1: The U.S. Census data.
so as not to extrapolate too far beyond the data, let us set the starting value
of 1 to 350. It is convenient to scale time so that x1 = 0 in 1790, and so
that the unit of time is 10 years. Then substituting 1 = 350 and x = 0 into
the model, using the value y1 = 3.929 from the data, and assuming that the
error is 0, we have
3.929 =
STAT 501
350
.
1 + e2 +3 (0)
D. S. Young
333
Solving for 2 gives us a plausible start value for this parameter:

350
1
3.929

350
2 = log
1 4.5.
3.929
e2 =
Finally, returning to the data, at time x = 1 (i.e., at the second Census

performed in 1800), the population was y2 = 5.308. Using this value, along
with the previously determined start values for 1 and 2 , and again setting
the error to 0, we have
5.308 =
350
.
1 + e4.5+3(1)
Solving for 3 we get

350
1
5.308

350
1 4.5 0.3.
3 = log
5.308
e4.5+3 =
So now we have starting values for the nonlinear least squares algorithm
that we use. Below is the output from running a Gauss-Newton algorithm
for optimization. As you can see, the starting values resulted in convergence
with values not too far from our guess.
##########
Formula: population ~ beta1/(1 + exp(beta2 + beta3 * time))
Parameters:
beta1 389.16551
30.81197
12.63 2.20e-10 ***
beta2
3.99035
0.07032
56.74 < 2e-16 ***
beta3 -0.22662
0.01086 -20.87 4.60e-14 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
D. S. Young
STAT 501
334
Residuals
250
Census Data
200
50
Residuals
150
100
Population
1800
1850
1900
Year
(a)
1950
1800
1850
1900
1950
Year
(b)
Figure 21.1: (a) Plot of the Census data with the logistic functional fit. (b)
Plot of the residuals versus the year.
Number of iterations to convergence: 6 Achieved convergence

tolerance: 1.492e-06
##########
Figure 21.1(b) is a plot of the residuals versus the year. As you can see,
the logistic functional form that we chose did catch the gross characteristics
of this data, but some of the nuances appear to not be as well characterized.
Since there are indications of some cyclical behavior, a model incorporating
correlated errors or, perhaps, trigonometric functions could be investigated.
Example 2: Binary Logistic Regression Example
We will first perform a binary logistic regression analysis. The data set we
will use is data published on n = 27 leukemia patients. The data (found in
Table 21.2) has a response variable of whether leukemia remission occurred
(REMISS), which is given by a 1. The independent variables are cellularity
of the marrow clot section (CELL), smear differential percentage of blasts
(SMEAR), percentage of absolute marrow leukemia cell infiltrate (INFIL),
percentage labeling index of the bone marrow leukemia cells (LI), absolute
number of blasts in the peripheral blood (BLAST), and the highest temperature prior to start of treatment (TEMP).
STAT 501
D. S. Young
335
The following gives the estimated logistic regression equation and associated significance tests. The reference group of remission is 1 for this data.
##########
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
64.25808
74.96480
0.857
0.391
cell
30.83006
52.13520
0.591
0.554
smear
24.68632
61.52601
0.401
0.688
infil
-24.97447
65.28088 -0.383
0.702
li
4.36045
2.65798
1.641
0.101
blast
-0.01153
2.26634 -0.005
0.996
temp
-100.17340
77.75289 -1.288
0.198
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372
Residual deviance: 21.594
AIC: 35.594
on 26
on 20
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 8

##########
As you can see, the index of the bone marrow leukemia cells appears to be
closest to a significant predictor of remission occurring. After looking at
various subsets of the data, it is found that a significant model is one which
only includes the labeling index as a predictor.
##########
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-3.777
1.379 -2.740 0.00615 **
li
2.897
1.187
2.441 0.01464 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
D. S. Young
on 26
degrees of freedom
STAT 501
336
Deviance Residuals
Pearson Residuals
0.5
0.0
0.0
Pearson Residuals
0.5
0.5
Deviance Residuals
0.5
10
15
20
25
Observation
10
15
20
25
Observation
(a)
(b)
Figure 21.2: (a) Plot of the deviance residuals. (b) Plot of the Pearson
residuals.

AIC: 30.073
on 25
degrees of freedom

Odds Ratio: 18.125
95% Confidence Interval: 1.770 185.562
##########
Notice that the odds ratio for LI is 18.12. It is calculated as e2.897 . The
95% confidence interval is calculated as e2.897z0.975 1.187 , where z0.975 = 1.960
is the 97.5th percentile from the standard normal distribution. The interpretation of the odds ratio is that for every increase of 1 unit in LI, the estimated
odds of leukemia reoccurring are multiplied by 18.12. However, since the LI
appears to fall between 0 and 2, it may make more sense to say that for
every .1 unit increase in L1, the estimated odds of remission are multiplied
by e2.8970.1 = 1.337. So, assume that we have CELL=1.0 and TEMP=0.97.
Then
At LI=0.8, the estimated odds of leukemia reoccurring is exp{3.777+
2.897 0.8} = 0.232.
STAT 501
D. S. Young
337
10
Simulated Poisson Data
10
15
20
Figure 21.3: Scatterplot of the simulated Poisson data set.
At LI=0.9, the estimated odds of leukemia reoccurring is exp{3.777+

2.897 0.9} = 0.310.
0.232
The odds ratio is = 0.310
, which is the ratio of the odds of death when
LI=0.8 compared to the odds when L1=0.9. Notice that 0.2321.337 =
0.310, which demonstrates the multiplicative effect by e2 on the odds

ratio.
Figure 21.2 also gives plots of the deviance residuals and the Pearson
residuals. These plots seem to be okay.
Example 3: Poisson Regression Example
Table 21.3 consists of a simulated data set of size n = 30 such that the
response (Y ) follows a Poisson distribution with rate = exp{0.50 + 0.07X}.
A plot of the response versus the predictor is given in Figure 21.3.
The following gives the analysis of the Poisson regression data:
##########
Coefficients:
D. S. Young
STAT 501
338
Deviance Residuals
Pearson Residuals
Pearson Residuals
Deviance Residuals
Fitted Values
Fitted Values
(a)
(b)
Figure 21.4: (a) Plot of the deviance residuals. (b) Plot of the Pearson
residuals.
0.989060
0.007
0.994
x
0.306982
0.066799
4.596 8.37e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for gaussian family taken to be 3.977365)
AIC: 130.49
on 29
on 28
degrees of freedom
degrees of freedom

##########
As you can see, the predictor is highly significant.
Finally, Figure 21.4 also provides plots of the deviance residuals and Pearson residuals versus the fitted values. These plots appear to be good for a
Poisson fit. Further diagnostic plots can also be produced and model selection techniques can be employed when faced with multiple predictors.
STAT 501
D. S. Young
REMISS
1
1
0
0
1
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
1
1
0
CELL SMEAR
0.80
0.83
0.90
0.36
0.80
0.88
1.00
0.87
0.90
0.75
1.00
0.65
0.95
0.97
0.95
0.87
1.00
0.45
0.95
0.36
0.85
0.39
0.70
0.76
0.80
0.46
0.20
0.39
1.00
0.90
1.00
0.84
0.65
0.42
1.00
0.75
0.50
0.44
1.00
0.63
1.00
0.33
0.90
0.93
1.00
0.58
0.95
0.32
1.00
0.60
1.00
0.69
1.00
0.73
INFIL
0.66
0.32
0.70
0.87
0.68
0.65
0.92
0.83
0.45
0.34
0.33
0.53
0.37
0.08
0.90
0.84
0.27
0.75
0.22
0.63
0.33
0.84
0.58
0.30
0.60
0.69
0.73
LI BLAST
1.90
1.10
1.40
0.74
0.80
0.18
0.70
1.05
1.30
0.52
0.60
0.52
1.00
1.23
1.90
1.35
0.80
0.32
0.50
0.00
0.70
0.28
1.20
0.15
0.40
0.38
0.80
0.11
1.10
1.04
1.90
2.06
0.50
0.11
1.00
1.32
0.60
0.11
1.10
1.07
0.40
0.18
0.60
1.59
1.00
0.53
1.60
0.89
1.70
0.96
0.90
0.40
0.70
0.40
339
TEMP
1.00
0.99
0.98
0.99
0.98
0.98
0.99
1.02
1.00
1.04
0.99
0.98
1.01
0.99
0.99
1.02
1.01
1.00
0.99
0.99
1.01
1.02
1.00
0.99
0.99
0.99
0.99
Table 21.2: The leukemia data set. Descriptions of the variables are given in
the text.
D. S. Young
STAT 501
340
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
xi yi i xi yi
2 0 16 16 7
15 6 17 13 6
19 4 18 6 2
14 1 19 16 5
16 5 20 19 5
15 2 21 24 6
9 2 22 9 2
17 10 23 12 5
10 3 24 7 1
23 10 25 9 3
14 2 26 7 3
14 6 27 15 3
9 5 28 21 4
5 2 29 20 6
17 2 30 20 9
Table 21.3: Simulated data for the Poisson regression example.
STAT 501
D. S. Young
Chapter 22
Multivariate Multiple
Regression
Up until now, we have only been concerned with univariate responses (i.e.,
the case where the response Y is simply a single value for each observation).
However, sometimes you may have multiple responses measured for each observation, whether it be different characteristics or perhaps measurements
taken over time. When our regression setting must accommodate multiple
responses for a single observation, the technique is called multivariate regression.
22.1
The Model
A multivariate multiple regression model is a multivariate linear model

that describes how a vector of responses (or y-variables) relates to a set of
predictors (or x-variables). For example, you may have a newly machined
component which is divided into four sections (or sites). Various experimental
predictors may be the temperature and amount of stress induced on the
component. The responses may be the average length of the cracks that
develop at each of the four sites.
The general structure of a multivariate multiple regression model is as
follows:
A set of p 1 predictors, or independent variables, are measured for
341
342
CHAPTER 22. MULTIVARIATE MULTIPLE REGRESSION

each of the i = 1, . . . , n observations:
Xi,1
..
Xi =
.
.
Xi,p1
A set of m responses, or dependent variables, are measured for each of
the i = 1, . . . , n observations:
Yi,1
Yi = ... .
Yi,m
Each of the j = 1, . . . , m responses has its own regression model:
Yi,j = 0,j + 1,j Xi,1 + 2,j Xi,2 + . . . + p1,j Xi,p1 + i,j .
Vectorizing the above model for a single observation yields:
Yi = (1 XT
i )B + i ,
where
B=
1 2 . . . m
0,1
1,1
..
.
0,2
1,2
..
.
...
...
..
.
0,m
1,m
..
.
p1,1 p1,2 . . . p1,m

and
i,1
i = ... .
i,m
Notice that i is the vector of errors for the ith observation.
STAT 501
D. S. Young
343
Finally, we may explicitly write down the multivariate multiple regression model:
T
Y1
..
Ynm = .
YT
n
XT
1
..
.
.
= .
.
1 XT
n
0,1
1,1
..
.
0,2
1,2
..
.
...
...
...
0,m
1,m
..
.
p1,1 p1,2 . . . p1,m
T
1
..
.
T
n
= Xnp Bpm + nm .
Or more compactly, without the dimensional subscripts, we will write:
Y = XB + .
22.2
Estimation and Statistical Regions
Least Squares
Extending least squares theory from the multiple regression setting to the
multivariate multiple regression setting is fairly intuitive. The biggest hurdle
is dealing with the matrix calculations (which statistical packages perform for
you anyhow). We can also formulate similar assumptions for the multivariate
model.
Let
1,j
(j) = ... ,
n,j
which is the vector of errors for the j th trial of all n observations. We assume
that E((j) ) = 0 and Cov((i) , (k) ) = i,k Imm for each i, k = 1, . . . , n. Notice
that the j th trial of the n observations have variance-covariance matrix =
{i,k }, but observations from different entries of the vector are uncorrelated.
The least squares estimate for B is simply given by:
= (XT X)1 XT Y.
B
D. S. Young
STAT 501
344
we can calculate the predicted values as:

Using B,
= XB
Y
and the residuals as:
= Y Y.
Furthermore, an estimate of (which is the maximum likelihood estimate of

) is given by:
= 1
.
T
n
Hypothesis Testing
Suppose we are interested in testing the hypothesis that our multivariate
responses do not depend on the predictors Xi,q+1 , . . . , Xi,p1 . We can partition B to consist of two matrices: one with the regression coefficients of
the predictors we assume will remain in the model and one with the regression coefficients we wish to test. Similarly, we can partition X in a similar
manner. Formally, the test is
H0 : (2) = 0,
where

B=
(1)
(2)
and
X=
X1 X2
Here X2 is an n (p q 1) matrix of predictors corresponding to the null

hypothesis and X1 is an n (q) matrix of predictors we assume will remain
in the model. Furthermore, (2) and (1) are (p q 1) m and q m
matrices, respectively, for these predictor matrices.
Under the null hypothesis, we can calculate
= (XT X )1 XT Y
(1)
1
1
1
and
(1) )T (Y X
1 = (Y X1
1 (1) )/n.
STAT 501
D. S. Young
345
These values (which are maximum likelihood estimates under the null hypothesis) can be used to calculate one of four commonly used multivariate
test statistics:
|n|
Wilks Lambda =
1 )|
|n(
1 )

1 ]
Pillais Trace = tr[(
1
1
1 )
]
Hotelling-Lawley Trace = tr[(
1
.
Roys Greatest Root =
1 + 1
1 )

1 . Also,
In the above, 1 is the largest nonzero eigenvalue of (
the value || is the determinant of the variance-covariance matrix and is
called the generalized variance which assigns a single numerical value to
express the overall variation of this multivariate problem. All of the above
test statistics have approximate F distributions with degrees of freedom
which are more complicated to calculate than what we have seen. Most
statistical packages will report at least one of the above if not all four. For
large sample sizes, the associated p-values will likely be similar, but various
1 )

1 or a relatively small
situations (such as many large eigenvalues of (
sample size) will lead to a discrepancy between the results. In this case, it is
usually accepted to report the Wilks lambda value as this is the likelihood
ratio test.
Confidence Regions
One problem is to predict the mean responses corresponding to fixed values
T xh
xh of the predictors. Using various distributional results concerning B
it can be shown that the 100 (1 )% simultaneous confidence
and ,
intervals for E(Yi |X = xh ) = xT

h i are
s

m(n p 2)
T
Fm,np1m;1
xh i
np1m
s

n
T
T
1
xh (X X) xh
i,i ,
np2
and
for i = 1, . . . , m. Here, i is the ith column of B
i,i is the ith diagonal
Also, notice that the simultaneous confidence intervals are
element of .
D. S. Young
STAT 501
346
constructed for each of the m entries of the response vector, thus why they are
considered simultaneous. Furthermore, the collection of these simultaneous
T xh .
intervals yields what we call a 100 (1 )% confidence region for B
Prediction Regions
Another problem is to predict new responses Yh = BT xh + h . Again,
skipping over a discussion on various distributional assumptions, it can be
shown that the 100 (1 )% simultaneous prediction intervals for the
individual responses Yh,i are
s
xT
h i

m(n p 2)
Fm,np1m;1
np1m
s

(1 +
T
1
xT
h (X X) xh )

n
i,i ,
np2
for i = 1, . . . , m. The quantities here are the same as those in the simultaneous confidence intervals. Furthermore, the collection of these simultaneous
prediction intervals are called a 100 (1 )% prediction region for yh .
MANOVA
The multivariate analysis of variance (MANOVA) table is similar to
its univariate counterpart. The sum of squares values in a MANOVA are
no longer scalar quantities, but rather matrices. Hence, the entries in the
MANOVA table are called sum of squares and cross-products (SSCPs).
These quantities are described in a little more detail below:
The
cross-products for total is SSCPTO =
Pn sum of squares and
T
i=1 (Yi Y)(Yi Y) , which is the sum of squared deviations from
the overall mean vector of the Yi s. SSCPTO is a measure of the overall
variation in the Y vectors. The corresponding total degrees of freedom
are n 1.
P
The sum of squares and cross-products for the errors is SSCPE =
n
T
i=1 (Yi Yi )(Yi Yi ) , which is the sum of squared observed errors
(residuals) for the observed data vectors. SSE is a measure of the variation in Y that is not explained by the multivariate regression. The
corresponding error degrees of freedom are n p.
STAT 501
D. S. Young
347
The sum of squares and cross-products due to the regression is

SSCPR = SSCPTO SSCPE, and it is a measure of the total variation
in Y that can be explained by the regression with the predictors. The
corresponding model degrees of freedom are p 1.
Formally, a MANOVA table is given in Table 22.1.
Source
Regression
Error
Total
df
p1
np
n1
SSCP
Pn
Y
i Y)
T
(Y Y)(
Pni=1 i
i )T
(Yi Yi )(Yi Y
Pi=1
n
T
i=1 (Yi Y)(Yi Y)
Table 22.1: MANOVA table for the multivariate multiple linear regression
model.
Notice in the MANOVA table that we do not define any mean square
values or an F -statistic. Rather, a test of the significance of the multivariate multiple regression model is carried out using a Wilks lambda quantity
similar to

Pn
T

(Y
Y
)(Y
Y
)
i
i
i
i

i=1
,
=

P
T
n (Yi Y)(Y
Y)
i

i=1
which will follow a 2 distribution. However, depending on the number of
variables and the number of trials, modified versions of this test statistic
must be used, which will affect the degrees of freedom for the corresponding
2 distribution.
22.3
Reduced Rank Regression
Reduced rank regression is a way of constraining the multivariate linear

regression model so that the rank of the regression coefficient matrix has
less than full rank. The objective in reduced rank regression is to minimize
the sum of squared residual subject to a reduced rank condition. Without
the rank condition, the estimation problem is an ordinary least squares problem. Reduced-rank regression is important in that it contains as special cases
D. S. Young
STAT 501
348
the classical statistical techniques of principal component analysis, canonical variate and correlation analysis, linear discriminant analysis, exploratory
factor analysis, multiple correspondence analysis, and other linear methods
of analyzing multivariate data. It is also heavily utilized in neural network
modeling and econometrics.
Recall that the multivariate regression model is
Y = XB + ,
where Y is an n m matrix, X is an n p matrix, and B is a p m matrix
of regression parameters. A reduced rank regression occurs when we have
the rank constraint
rank(B) = t min(p, m),
with equality yielding the traditional least squares setting. When the rank
condition above holds, then there exists two non-unique full rank matrices
Apt and Ctm , such that
B = AC.
Moreover, there may be an additional set of predictors, say W, such that W
is a n q matrix. Letting D denote a q m matrix of regression parameters,
we can then write the reduced rank regression model as follows:
Y = XAC + WD + .
In order to get estimates for the reduced rank regression model, first note
that E((j) ) = 0 and Var(0(j) ) = Imm . For simplicity in the following,
let Z0 = Y, Z1 = X, and Z2 = W. Next, we define the moment matrices
1
Mi,j = ZT
i Zj /m for i, j = 0, 1, 2 and Si,j = Mi,j Mi,2 M2,2 M2,i , i, j = 0, 1.
Then, the parameters estimates for the reduced rank regression model are as
follows:
= (
A
1 , . . . , t )
T = S0,1 A(
A
T S1,1 A)
1
C
= M0,2 M 1 C
TA
T M1,2 M 1 ,
D
2,2
2,2
where (
1 , . . . , t ) are the eigenvectors corresponding to the t largest eigen1, . . .
t of |S1,1 S1,0 S 1 S0,1 | = 0 and where is an arbitrary t t
values
0,0
matrix with full rank.
STAT 501
D. S. Young
22.4
349
Example
Example: Amitriptyline Data

This example analyzes conjectured side effects of amitriptyline - a drug some
physicians prescribe as an antidepressant. Data were gathered on n = 17
patients admitted to a hospital with an overdose of amitriptyline. The two
response variables are Y1 = total TCAD plasma level and Y2 = amount of
amitriptyline present in TCAD plasma level. The five predictors measured
are X1 = gender (0 for male and 1 for female), X2 = amount of the drug
taken at the time of overdose, X3 = PR wave measurement, X4 = diastolic
blood pressure, and X5 = QRS wave measurement. Table 22.2 gives the data
set and we wish to fit a multivariate multiple linear regression model.
Y1
3389
1101
1131
596
896
1767
807
1111
645
628
1360
652
860
500
781
1070
1754
Y2
X1
3149 1
653
1
810
0
448
1
844
1
1450 1
493
1
941
0
547
1
392
1
1283 1
458
1
722
1
384
0
501
0
405
0
1520 1
X2
7500
1975
3600
675
750
2500
350
1500
375
1050
3000
450
1750
2000
4500
1500
3000
X3 X4
220 0
200 0
205 60
160 60
185 70
180 60
154 80
200 70
137 60
167 60
180 60
160 64
135 90
160 60
180 0
170 90
180 0
X5
140
100
111
120
83
80
98
93
105
74
80
60
79
80
100
120
129
Table 22.2: The amitriptyline data set.
First we obtain the regression estimates for each response:

##########
Coefficients:
D. S. Young
STAT 501
350
(Intercept)
X1
X2
X3
X4
X5
##########
Y1
-2879.4782
675.6508
0.2849
10.2721
7.2512
7.5982
Y2
-2728.7085
763.0298
0.3064
8.8962
7.2056
4.9871
Then we can obtain individual ANOVA tables for each response and see that
the multiple regression model for each response is statistically significant.
##########
Response Y1 :
Pr(>F)
Regression
5 6835932 1367186 17.286 6.983e-05 ***
Residuals
11 870008
79092
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Response Y2 :
Pr(>F)
Regression
5 6669669 1333934 15.598 0.0001132 ***
Residuals
11 940709
85519
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##########
The following also gives the SSCP matrices for this fit:
##########
$SSCPR
Y1
Y2
Y1 6835932 6709091
Y2 6709091 6669669
$SSCPE
Y1
Y2
Y1 870008.3 765676.5
STAT 501
D. S. Young
351
Y2 765676.5 940708.9
$SSCPTO
Y1
Y2
Y1 7705940 7474767
Y2 7474767 7610378
##########
We can also see which predictors are statistically significant for each response:
##########
Response Y1 :
Coefficients:
(Intercept) -2.879e+03 8.933e+02 -3.224 0.008108 **
X1
6.757e+02 1.621e+02
4.169 0.001565 **
X2
2.849e-01 6.091e-02
4.677 0.000675 ***
X3
1.027e+01 4.255e+00
2.414 0.034358 *
X4
7.251e+00 3.225e+00
2.248 0.046026 *
X5
7.598e+00 3.849e+00
1.974 0.074006 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Response Y2 :
Coefficients:
(Intercept) -2.729e+03 9.288e+02 -2.938 0.013502 *
X1
7.630e+02 1.685e+02
4.528 0.000861 ***
X2
3.064e-01 6.334e-02
4.837 0.000521 ***
X3
8.896e+00 4.424e+00
2.011 0.069515 .
X4
7.206e+00 3.354e+00
2.149 0.054782 .
X5
4.987e+00 4.002e+00
1.246 0.238622
D. S. Young
STAT 501
352
1.5
Response = Y1
Response = Y2
1.0
0.5
1.0
0.5
0.0
1.5
2.0
500
1000
1500
2000
2500
3000
500
1000
1500
2000
Fitted Values
Fitted Values
(a)
(b)
2500
3000
Figure 22.1: Plots of the Studentized residuals versus fitted values for the
response (a) total TCAD plasma level and the response (b) amount of
amitriptyline present in TCAD plasma level.
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

##########
We can proceed to drop certain predictors from the model in an attempt
to improve the fit as well as view residual plots to assess the regression
assumptions. Figure 22.1 gives the Studentized residual plots for each of the
responses. Notice that the plots have a fairly random pattern, but there is
one value high with respect to the fitted values. We could formally test (i.e.,
with a Levenes test) to see if this affects the constant variance assumption
and also to study pairwise scatterplots for any potential multicollinearity in
this model.
STAT 501
D. S. Young
Chapter 23
Data Mining
The field of Statistics is constantly being presented with larger and more
complex data sets than ever before. The challenge for the Statistician is to
be able to make sense of all of this data, extract important patterns, and
find meaningful trends. We refer to the general tools and the approaches for
dealing with these challenges in massive data sets as data mining.1
Data mining problems typically involve an outcome measurement which
we wish to predict based on a set of feature measurements. The set of
these observed measurements is called the training data. From these training data, we attempt to build a learner, which is a model used to predict the
outcome for new subjects. These learning problems are (roughly) categorized
as either supervised or unsupervised. A supervised learning problem is
one where the goal is to predict the value of an outcome measure based on a
number of input measures, such as classification with labeled samples from
the training data. An unsupervised learning problem is one where there is
no outcome measure and the goal is to describe the associations and patterns
among a set of input measures, which involves clustering unlabeled training
data by partitioning a set of features into a number of statistical classes. The
regression problems that are the focus of this text are (generally) supervised
learning problems.
Data mining is an extensive field in and of itself. In fact, many of the
methods utilized in this field are regression-based. For example, smoothing
splines, shrinkage methods, and multivariate regression methods are all often
found in data mining. The purpose of this chapter will not be to revisit these
1
Data mining is also referred to as statistical learning or machine learning.
353
354
CHAPTER 23. DATA MINING
methods, but rather to add to our toolbox additional regression methods,

which are methods that happen to be utilized more in data mining problems.
23.1
Some Notes on Variable and Model Selection
When faced with high-dimensional data, it is often desired to perform some

variable selection procedure. Methods discussed earlier, such as best subsets,
forward selection, and backwards elimination can be used; however, these
can be very computationally expensive to implement. Shrinkage methods
like LASSO can be implemented, but these too can be expensive.
Another alternative used in variable selection and commonly discussed in
the context of data mining is least angle regression or LARS. LARS is
a stagewise procedure that uses a simple mathematical formula to accelerate the computations relative to other variable selection procedure we have
discussed. Only p steps are required for the full set of solutions, where pis
the number of predictors. The LARS procedure starts with all coefficients
equal to zero, and then finds the predictor most correlated with the response,
say Xj1 . We take the largest step possible in the direction of this predictor
until some other predictor, say Xj2 , has as much correlation with the current residual. LARS then proceeds in a direction equiangular between the
two predictors until a third variable, say Xj3 earns its way into the most
correlated set. LARS then proceeds equiangularly between Xj1 , Xj2 , and
Xj3 (along the least angle direction) until a fourth variable enters. This
continues until all p predictors have entered the model and then the analyst studies these p models to determine which yields an appropriate level of
parsimony.
A related methodology to LARS is forward stagewise regression.
Forward stagewise regression starts by taking the residuals between the response values and their mean (i.e., all of the regression slopes are set to
0). Call this vector r. Then, find the predictor most correlated with r, say
Xj1 . Update the regression coefficient j1 by setting j1 = j1 + ji , where
ji = corr(r, Xj1 ) for some > 0 and j1 is the old value of j1 . Finally,
update r by setting it equal to r j1 Xj1 , such that r is the old value of r.
Repeat this process until no predictor has a correlation with r.
LARS and forward stagewise regression are very computationally efficient.
STAT 501
D. S. Young
355
In fact, a slight modification to the LARS algorithm can calculate all possible
LASSO estimates for a given problem. Moreover, a different modification
to LARS efficiently implements forward stagewise regression. In fact, the
acronym for LARS includes an S at the end to reflect its connection to
LASSO and forward stagewise regression.
Earlier in the text we also introduced the bootstrap as a way to get bootstrap confidence intervals for the regression parameters. However, the notion
of the bootstrap can also be extended to fitting a regression model. Suppose
that we have p 1 feature measurements and one outcome variable. Let
Z = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} be our training data that we wish to fit
a model to such that we obtain the prediction f(x) at each input x. Bootstrap aggregation or bagging averages this prediction over a collection of
bootstrap samples, thus reducing its variance. For each bootstrap sample
Zb , b = 1, 2, . . . , B, we fit our model, which yields the prediction fb (x). The
bagging estimate is then defined by
B
1 X
f (x).
fbag(x) =
B b=1 b
which puts equal probaDenote the empirical distribution function by P,

bility 1/n on each of the data points (xi , yi ). The true bagging estimate
is defined by EP f (x), where Z = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} and each
(xi , yi ) P . Note that the bagging estimate given above is a Monte Carlo
estimate of the true bagging estimate, which it approaches as B . The
bagging approach can be used in other model selection approaches throughout Statistics and data mining.
23.2
Classification and Support Vector Regression
Classification is the problem of identifying the subpopulation to which new

observations belong, where the labels of the subpopulation are unknown,
on the basis of a training set of data containing observations whose subpopulation is known. The classification problem is often contrasted with
clustering, where the problem is to analyze a data set and determine how
(or if) the data set can be divided into groups. In data mining, classification
D. S. Young
STAT 501
356
is a supervised learning problem while clustering is an unsupervised learning

problem.
In this chapter, we will focus on a special classification technique which
has many regression applications in data mining. Support Vector Machines (or SVMs) perform classification by constructing an N -dimensional
hyperplane that optimally separates the data into two categories. SVM models are closely related to neural networks, which we discuss later in this
chapter. The predictor variables are called attributes, and a transformed
attribute that is used to define the hyperplane is the feature. The task of
choosing the most suitable representation is known as feature selection. A
set of features that describes one case (i.e., a row of predictor values) is called
a vector. So the goal of SVM modeling is to find the optimal hyperplane
that separates clusters of vectors in such a way that cases with one category
of the target variable are on one side of the plane and cases with the other
category are on the other size of the plane. The vectors near the hyperplane
are the support vectors.
Suppose we wish to perform classification with the data shown in Figure
23.1(a) and our data has a categorical target variable with two categories.
Also assume that the attributes have continuous values. Figure 23.1(b) provides a snapshot of how we perform SVM modeling. The SVM analysis
attempts to find a 1-dimensional hyperplane (i.e., a line) that separates the
cases based on their target categories. There are an infinite number of possible lines and we show only one in Figure 23.1(b). The question is which line
is optimal and how do we define that line.
The dashed lines drawn parallel to the separating line mark the distance
between the dividing line and the closest vectors to the line. The distance
between the dashed lines is called the margin. The vectors (i.e., points)
that constrain the width of the margin are the support vectors. An SVM
analysis finds the line (or, in general, hyperplane) that is oriented so that the
margin between the support vectors is maximized. Unfortunately, the data
we deal with is not generally as simple as that in Figure 23.2. The challenge
will be to develop an SVM model that accommodates such characteristics as:
1. more than two attributes;
2. separation of the points with nonlinear curves;
3. handling of cases where the clusters cannot be completely separated;
and
STAT 501
D. S. Young
357
2
x
(a)
0
2
4
(b)
Figure 23.1: (a) A plot of the data where classification is to be performed.

(b) The data where a support vector machine has been used. The points
near the parallel dashed lines are the support vectors. The regions between
the parallel dashed lines is called the margin, which is the region we want to
optimize.
4. handling classification with more than two categories.

The setting with nonlinear curves and where clusters cannot be completely
separated is illustrated in Figure 23.2. Without loss of generality, our discussion will mainly be focused to the one attribute and one feature setting.
Moreover, we will be utilizing support vectors in order to build a regression
relationship that fits our data adequately.
A little more terminology is necessary before we move into the regression
discussion. A loss function represents the loss in utility associated with
an estimate being wrong (i.e., different from either a desired or a true
value) as a function of a measure of the degree of wrongness (generally
the difference between the estimated value and the true or desired value).
When discussing SVM modeling in the regression setting, the loss function
will need to incorporate a distance measure as well.
As a quick illustration of some common loss functions look at Figure 23.3.
Figure 23.3(a) is a quadratic loss function, which is what we use in classical
ordinary least squares. Figure 23.3(b) is a Laplacian loss function, which
D. S. Young
STAT 501
358
SVM classification plot

4
1.0
0.5
0.0
0.5
1.0
Figure 23.2: A plot of data where a support vector machine has been used
for classification. The data was generated where we know that the circles
belong to group 1 and the triangles belong to group 2. The white contours
show where the margin is; however, there are clearly some values that have
been misclassified since the two clusters are not well-separated. The points
that are solid were used as the training data.
is less sensitive to outliers than the quadratic loss function. Figure 23.3(c)
is Hubers loss function, which is a robust loss function that has optimal
properties when the underlying distribution of the data is unknown. Finally,
Figure 23.3(d) is called the -insensitive loss function, which enables a sparse
set of support vectors to be obtained.
In Support Vector Regressions (or SVRs), the input is first mapped
onto an N -dimensional feature space using some fixed (nonlinear) mapping,
and then a linear model is constructed in this feature space. Using mathematical notation, the linear model (in the feature space) is given by
f (x, ) =
N
X
j gj (x) + b,
j=1
where gj (), j = 1, . . . , N denotes a set of nonlinear transformations, and b

is a bias term. If the data is assumed to be of zero mean (as it usually is),
STAT 501
D. S. Young
359
(a)
(b)
(c)
(d)
Figure 23.3: Plots of the (a) quadratic loss, (b) Laplace loss, (c) Hubers loss,
and (d) -insensitive loss functions.
D. S. Young
STAT 501
360
then the bias term is dropped. Note that b is not considered stochastic in
this model and is not akin to the error terms we have studied in previous
models.
The optimal regression function is given by the minimum of the functional
n
X
1
(, ) = kk2 + C
( + + ),
2
i=1
where C is a pre-specified constant, and + are slack variables representing
upper and lower constraints (respectively) on the output of the system. In
other words, we have the following constraints:
yi f (xi , ) + i+
f (xi , ) yi + i
, + 0, i = 1, . . . , n,
where yi is defined through the loss function we are using. The four loss
functions we show in Figure 23.3 are as follows:
Quadratic Loss:
L2 = (f (x) y) = (f (x) y)2
Laplace Loss:
L1 = (f (x) y) = |f (x) y|
Hubers Loss2 :

LH =
1
(f (x)
2
y)2 ,
|f (x) y|
2
,
2
for |f (x) y| < ;

otherwise.
-Insensitive Loss:

L =
0,
for |f (x) y| < ;
|f (x) y| , otherwise.
Depending on which loss function is chosen, then an appropriate optimization problem can be specified, which can involve kernel methods. Moreover,
specification of the kernel type as well as values like C, , and all control
the complexity of the model in different ways. There are many subtleties
2
The quantity is a specified threshold constant.
STAT 501
D. S. Young
361
depending on which loss function is used and the investigator should become
familiar with the loss function being employed. Regardless, the optimization
approach will require the use of numerical methods.
It is also desirable to strike a balance between complexity and the error
that is present with the fitted model. Test error (also known as generalization error) is the expected prediction error over an independent test
sample and is given by
Err = E[L(Y, f(X))],
where X and Y are drawn randomly from their joint distribution. This
expectation is an average of everything that is random in this set-up, including the randomness in the training sample that produced the estimate f().
Training error is the average loss over the training sample and is given by
n
1X
err =
L(yi , f(xi )).
n i=1
We would like to know the test error of our estimated model f(). As the
model increases in complexity, it is able to capture more complicated underlying structures in the data, which thus decreases bias. But then the
estimation error increases, which thus increases variance. This is known as
the bias-variance tradeoff. In between there is an optimal model complexity that gives minimum test error.
23.3
Boosting and Regression Transfer
Transfer learning is the notion that it is easier to learn a new concept (such
as how to play racquetball) if you are already familiar with a similar concept
(such as knowing how to play tennis). In the context of supervised learning, inductive transfer learning is often framed as the problem of learning
a concept of interest, called the target concept, given data from multiple
sources: a typically small amount of target data that reflects the target concept, and a larger amount of source data that reflects one or more different,
but possibly related, source concepts.
While most algorithms addressing this notion are in classification settings,
some of the common algorithms can be extended to the regression setting to
help us build our models. The approach we discuss is called boosting or
D. S. Young
STAT 501
362
boosted regression. Boosted regression is highly flexible in that it allows

the researcher to specify the feature measurements without specifying their
functional relationship to the outcome measurement. Because of this flexibility, a boosted model will tend to fit better than a linear model and therefore
inferences made based on the boosted model may have more credibility.
Our goal is to learn a model of a concept ctarget mapping feature vectors from the feature space containing X to the response space Y . We
are given a set of training instances Ttarget = {(xi , yi )}, with xi X and
yi Y for i = 1, . . . , n that reflect ctarget . In addition, we are given data sets
1
B
, . . . , Tsource
source reflecting B different, but possibly related, concepts
Tsource
also mapping X to Y . In order to learn the most accurate possible model of
ctarget , we must decide how to use both the target and source data sets. If
Ttarget is sufficiently large, we can likely learn a good model using only this
data. However, if Ttarget is small and one or more of the source concepts is
similar to ctarget , then we may be able to use the source data to improve our
model.
Regression transfer algorithms fit into two basic categories: those that
make use of models trained on the source data, and those that use the source
data directly as training data. The two algorithms presented here fit into
each of these categories and are inspired by two boosting-based algorithms
for classification transfer: ExpBoost and AdaBoost. The regression analogues
that we present are called ExpBoost.R2 and AdaBoost.R2. Boosting is an
ensemble method in which a sequence of models (or hypotheses) h1 , . . . , hm ,
each mapping from X to Y , are iteratively fit to some transformation of a
data set using a base learner. The outputs of these models are then combined
into a final hypothesis, which we denote as h . We can now formalize the
two regression transfer algorithms.
AdaBoost.R2
Input the labeled target data set T of size n, the maximum number of
iterations B, and a base learning algorithm called Learner. Unless otherwise
specified, set the initial weight vector w1 such that wi1 = 1/n for i = 1, . . . , n.
For t = 1, . . . , B:
1. Call Learner with the training set T and the distribution wt , and get
a hypothesis ht : X R.
2. Calculate the adjusted error eti for each instance. Let Dt = maxi |yi
STAT 501
D. S. Young
363
ht (xj )|, so that eti = |yi ht (xi )|/Dt .

3. Calculate the adjusted error of ht , which is t =
then stop and set B = t 1.
Pn
t t
i=1 ei wi .
If t 0.5,
4. Let t = t /(1 t ).
1eti
5. Update the weight vector as wit+1 = wit t

normalizing constant.
/Zt , such that Zt is a
Output the hypothesis h , which is the median of ht (x) for t = 1, . . . , B,

using ln(1/t ) as the weight for hypothesis ht .
The method used in AdaBoost.R2 is to express each error in relation to

the largest error D = maxi |ei | in such a way that each adjusted error e0i
is in the range [0, 1]. In particular, one of three possible loss functions is
used: e0i = ei /D (linear), e0i = e2i /D2 (quadratic), or e0i = 1 exp(ei /D)
(exponential). The degree to which instance xi is reweighted in iteration t
thus depends on how large the error of ht is on xi relative to the error on the
worst instance.
ExpBoost.R2
Input the labeled target data set T of size n, the maximum number of
iterations B, and a base learning algorithm called Learner. Unless otherwise
specified, set the initial weight vector w1 such that wi1 = 1/n for i = 1, . . . , n.
Moreover, each source data set gets assigned to one expert from the set of
experts H B = {h1 , . . . , hB }.
For t = 1, . . . , B:
1. Call Learner with the training set T and the distribution wt , and get
a hypothesis ht : X R.
2. Calculate the adjusted error eti for each instance. Let Dt = maxi |yi
ht (xj )|, so that eti = |yi ht (xi )|/Dt .
P
3. Calculate the adjusted error of ht , which is t = ni=1 eti wit . If t 0.5,
then stop and set B = t 1.
D. S. Young
STAT 501
364
4. Calculate the weighted errors of each expert in H B on the current

weighting scheme. If any expert in H B has a lower weighted error than
ht , then replace ht with this best expert.
5. Let t = t /(1 t ).
1eti
6. Update the weight vector as wit+1 = wit t

normalizing constant.
/Zt , such that Zt is a
Output the hypothesis h , which is the median of ht (x) for t = 1, . . . , B,

using ln(1/t ) as the weight for hypothesis ht .
As can be seen, ExpBoost.R2 is similar to the AdaBoost.R2 algorithm,

but with a few minor differences.
23.4
CART and MARS
Classification and regression trees (CART) is a nonparametric treebased method which partitions the predictor space into a set of rectangles
and then fits a simple model (like a constant) in each one. While they seem
conceptually simple, they are actually quite powerful.
Suppose we have one response (yi ) and p predictors (xi,1 , . . . , xi,p ) for
i = 1, . . . , n. First we partition the predictor space into M regions (say,
R1 , . . . , RM ) and model the response as a constant cm in each region:
f (x) =
M
X
cm I(x Rm ).
m=1
Then, minimizing the sum of squares yields

cm =
n
X
(yi f (xi ))2
i=1
P
n
yi I(xi Rm )
.
= Pi=1
n
i=1 I(xi Rm )
We proceed to grow the tree by finding the best binary partition in terms
of the cm values. This is generally computationally infeasible which leads to
STAT 501
D. S. Young
365
use of a greedy search algorithm. Typically, the tree is grown until a small
node size (such as 5 nodes) is reached and then a method for pruning the
tree is implemented.
Multivariate adaptive regression splines (MARS) is another nonparametric method that can be viewed as a modification of CART and is wellsuited for high-dimensional problems. MARS uses expansions in piecewise
linear basis functions of the form (xt)+ and (tx)+ such that the + subscript simply means we take the positive part (e.g., (xt)+ = (xt)I(x > t)).
These two functions together are called a reflected pair.
In MARS, each function is piecewise linear with a knot at t. The idea is
to form a reflected pair for each predictor Xj with knots at each observed
value xi,j of that predictors. Therefore, the collection of basis functions for
j = 1, . . . , p is
C = {(Xj t)+ , (t Xj )+ }t{x1,j ,...,xn,j } .
MARS proceeds like a forward stepwise regression model selection procedure,
but instead of selecting the predictors to use, we use functions from the set
C and their products. Thus, the model has the form
f (X) = 0 +
M
X
m hm (X),
m=1
where each hm (X) is a function in C or a product of two or more such

functions.
You can also think of MARS as selecting a weighted sum of basis functions from the set of (a large number of) basis functions that span all values
of each predictor (i.e., that set would consist of one basis function and knot
value t for each distinct value of each predictor variable). The MARS algorithm then searches over the space of all inputs and predictor values (knot
locations t) as well as interactions between variables. During this search,
an increasingly larger number of basis functions are added to the model (selected from the set of possible basis functions) to maximize an overall least
squares goodness-of-fit criterion. As a result of these operations, MARS automatically determines the most important independent variables as well as
the most significant interactions among them.
D. S. Young
STAT 501
366
23.5
Neural Networks
With the exponential growth in available data and advancement in computing power, researchers in statistics, artificial intelligence, and data mining
have been faced with the challenge to develop simple, flexible, powerful procedures for modeling large data sets. One such model is the neural network
approach, which attempts to model the response as a nonlinear function of
various linear combinations of the predictors. Neural networks were first used
as models for the human brain.
The most commonly used neural network model is the single-hiddenlayer, feedforward neural network (sometimes called the single-layer
perceptron.) In this neural network model, the ith response yi is modeled as
a nonlinear function fY of m derived predictor values, Si,0 , Si,1 , . . . , Si,m1 :
yi = fY (0 Si,0 + 1 Si,1 + . . . + m1 Si,m1 ) + i
= fY (ST
i ) + i ,
where
0
1
..
.
Si,0
Si,1
..
.
and Si =
Si,m1
m1
Si,0 equals 1 and for j = 1, . . . , m 1, the j th derived predictor value for the
ith observation, Si,j , is a nonlinear function fj of a linear combination of the
original predictors:
Si,j = fj (XT
i j ),
where
j =
j,0
j,1
..
.
j,p1
and
X
=
Xi,0
Xi,1
..
.
Xi,p1
and Xi,0 = 1. The functions fY , f1 , . . . , fm1 are called activation functions. Finally, we can combine all of the above to form the neural network
STAT 501
D. S. Young
367
model as:
yi = fY (ST
i ) + i
= fY (0 +
m1
X
j fj (XT
i j )) + i .
j=1
There are various numerical optimization algorithms for fitting neural

networks (e.g., quasi-Newton methods and conjugate-gradient algorithms).
One important thing to note is that parameter estimation in neural networks
often utilizes penalized least squares to control the level of overfitting. The
penalized least squares criterion is given by:
Q=
n
X
[yi fY (0 +
i=1
m1
X
2
j fj (XT
i j ))] + p (, 1 , . . . , m1 ),
j=1
where the penalty term is given by:

p (, 1 , . . . , m1 ) =
m1
X
i=0
i2
p1
m1
XX
2
i,j

.
i=1 j=1
Finally, there is also a modeling technique which is similar to the treebased methods discussed earlier. The hierarchical mixture-of-experts
model (HME model) is a parametric tree-based method which recursively
splits the function of interest at each node. However, the splits are done
probabilistically and the probabilities are functions of the predictors. The
model is written as
f (yi ) =
k1
X
j1 =1
j1 (xi , )
k2
X
j2 (xi , j1 )
j2 =1
kr
X
jr (xi , j1 , j2 , . . . , jr1 )g(yi ; xi , j1 ,j2 ,...,jr1 ),
jr =1
which has a tree structure with r levels (i.e., r levels where probabilistic splits
occur). The () functions provide the probabilities for the splitting and, in
addition to being dependent on the predictors, they also have their own
set of parameters (the different values) requiring estimation (these mixing
D. S. Young
STAT 501
368
proportions are modeled using logistic regressions). Finally, is simply the

parameter vectors for the regression modeled at each terminal node of the
tree constructed using the HME structure.
The HME model is similar to CART, however, unlike CART it does not
provide a hard split at each node (i.e., either a node splits or it does not).
The HME model incorporates these predictor-dependent mixing proportions
which provide a soft probabilistic split at each node. The HME model
can also be thought of as being in the middle of continuum where at one
end we have CART (which provides hard splits) and at the other end is
mixtures of regressions (which is closely related to the HME model, but the
mixing proportions which provide the soft probabilistic splits are no longer
predictor-dependent). We will discuss mixtures of regressions at the end of
this chapter.
23.6
Examples
Example 1: Simulated Neural Network Data

In this very simple toy example, we have provided two features (i.e., input
neurons X1 and X2 ) and one response measurement (i.e., output neuron Y )
which are given in Table 23.1. The model fit is one where X1 and X2 do not
interact on Y . A single hidden-layer and a double hidden-layer neural net
model are each fit to this data to highlight the difference in the fits. Below
is the output (for each neural net) which shows the results from training
the model. A total of 5 training samples were used and a threshold value
of 0.01 was used as a stopping criterion. The stopping criterion pertains to
the partial derivatives of the error function and once we fall beneath that
threshold value, then the algorithm stops for that training sample.
X1
0
1
0
1
X2
0
0
1
1
Y
0
1
1
0
Table 23.1: The simulated neural network data.
########## 5 repetitions were calculated.

STAT 501
D. S. Young
369
Error Reached Threshold Steps

3 0.3480998376
0.007288519715
41 1 0.5000706639
0.009004839727
14 2 0.5000949028
0.009028409036
0.5001216674
0.008221843135
35 5 0.5007429970
0.007923316336
10
26 4
5 repetitions were calculated.

Error Reached Threshold Steps
4 0.0002241701811
0.009294280160
61 2 0.0004741186530
0.008171862296
193 5 0.2516368073472
0.006640846189
88 3
0.3556429122848
0.007036160421
46 1 0.5015928330534
0.009549108455
25 ##########
1
431
2.8
X1
X1
6.15091
Y
4
38
.76
5
(a)
6.07832
52
X2
3.3107
00
Error: 0.3481 Steps: 41
502
.
3
96
87
X2
41
2.17
.1
40
1.62898
.40
307
1864
37
0.67
90
0.4
3
.
25
.1
Error: 0.000224 Steps: 61
(b)
Figure 23.4: (a) The fitted single hidden-layer neural net model to the toy
data. (b) The fitted double hidden-layer neural net model to the toy data.
In the above output, the first group of 5 repetitions pertain to the single
hidden-layer neural net. For those 5 repetitions, the third training sample
yielded the smallest error (about 0.3481). The second group of 5 repetitions
D. S. Young
STAT 501
370
100
100
20
10
50
50
accel
50
accel
50
30
times
(a)
40
50
= 0.01
= 0.10
= 0.70
10
20
30
40
50
times
(b)
Figure 23.5: (a) Data from a simulated motorcycle accident where the time
until impact (in milliseconds) is plotted versus the recorded head acceleration
(in g). (b) The data with different values of used for the support vector
regression obtained with an insensitive loss function. Note how the smaller
the , the more features you pick up in the fit, but the complexity of the model
also increases.
pertain to the double hidden-layer neural net. For those 5 repetitions, the
fourth training sample yielded the smallest error (about 0.0002). The increase
in complexity of the neural net has yielded a smaller training error. The fitted
neural net models are depicted in Figure 23.4.
Example 2: Motorcycle Accident Data
This data set is from a simulated accident involving different motorcycles.
The time in milliseconds until impact and the g-force measurement of acceleration are recorded. The data are provided in Table 23.2 and plotted in
Figure 23.5(a). Given the obvious nonlinear trend that is present with this
data, we will attempt to fit a support vector regression to this data.
A support vector regression using an -insensitive loss function is fitted to
this data. {0.01, 0.10, 0.70} are fitted to this data and are shown in Figure
23.5(b). As decreases, different characteristics of the data are emphasized,
but the level of complexity of the model is increased. As noted earlier, we
STAT 501
D. S. Young
371
want to try and strike a good balance regarding the model complexity. For
the training error, we get values of 0.177, 0.168, and 0.250 for the three
levels of . Since our objective is to minimize the training error, the value
of (which has a training error of 0.168) is chosen. This corresponds to the
green line in Figure 23.5(b).
D. S. Young
STAT 501
372
Obs. Times
1
2.4
2
2.6
3
3.2
4
3.6
5
4.0
6
6.2
7
6.6
8
6.8
9
7.8
10
8.2
11
8.8
12
9.6
13
10.0
14
10.2
15
10.6
16
11.0
17
11.4
18
13.2
19
13.6
20
13.8
21
14.6
22
14.8
23
15.4
24
15.6
25
15.8
26
16.0
27
16.2
28
16.4
29
16.6
30
16.8
31
17.6
32
17.8
Accel.
0.0
-1.3
-2.7
0.0
-2.7
-2.7
-2.7
-1.3
-2.7
-2.7
-1.3
-2.7
-2.7
-5.4
-2.7
-5.4
0.0
-2.7
-2.7
0.0
-13.3
-2.7
-22.8
-40.2
-21.5
-42.9
-21.5
-5.4
-59.0
-71.0
-37.5
-99.1
Obs. Times
33
18.6
34
19.2
35
19.4
36
19.6
37
20.2
38
20.4
39
21.2
40
21.4
41
21.8
42
22.0
43
23.2
44
23.4
45
24.0
46
24.2
47
24.6
48
25.0
49
25.4
50
25.6
51
26.0
52
26.2
53
26.4
54
27.0
55
27.2
56
27.6
57
28.2
58
28.4
59
28.6
60
29.4
61
30.2
62
31.0
63
31.2
Accel.
-112.5
-123.1
-85.6
-127.2
-123.1
-117.9
-134.0
-101.9
-108.4
-123.1
-123.1
-128.5
-112.5
-95.1
-53.5
-64.4
-72.3
-26.8
-5.4
-107.1
-65.6
-16.0
-45.6
4.0
12.0
-21.5
46.9
-17.4
36.2
75.0
8.1
Obs. Times
64
32.0
65
32.8
66
33.4
67
33.8
68
34.4
69
34.8
70
35.2
71
35.4
72
35.6
73
36.2
74
38.0
75
39.2
76
39.4
77
40.0
78
40.4
79
41.6
80
42.4
81
42.8
82
43.0
83
44.0
84
44.4
85
45.0
86
46.6
87
47.8
88
48.8
89
50.6
90
52.0
91
53.2
92
55.0
93
55.4
94
57.6
Accel.
54.9
46.9
16.0
45.6
1.3
75.0
-16.0
69.6
34.8
-37.5
46.9
5.4
-1.3
-21.5
-13.3
30.8
29.4
0.0
14.7
-1.3
0.0
10.7
10.7
-26.8
-13.3
0.0
10.7
-14.7
-2.7
-2.7
10.7
Table 23.2: The motorcycle data.
STAT 501
D. S. Young
Chapter 24
Advanced Topics
This chapter presents topics where theory beyond the scope of this course
needs to be developed with the applicability. The topics are not arranged
in any particular order, but rather are just a sample of some of the more
advanced regression procedures that are available. Not all computer software
has the capabilities to perform analysis on the models presented here.
24.1
Semiparametric Regression
Semiparametric regression is concerned with flexible modeling of nonlinear functional relationships in regression analysis by building a model consisting of both parametric and nonparametric components. We have already
visited a semiparametric model with the Cox proportional hazards model. In
this model, there is the baseline hazard, which is nonparametric, and then
the hazards ratio, which is parametric.
Suppose we have n = 200 observations where y is the response, x1 is
a predictor taking on only values of 1, 2, 3 or 4, and x2 , x3 and x4 are
predictors taking on values between 0 and 1. A semiparametric regression
model of interest for this setting is
yi = 0 + 1 z2,i + 2 z3,i + 3 z4,i + m(x2,i , x3,i , x4,i ),
where
zj,i = I{x1,i = j}.
In otherwords, we are using the leave-one-out method for the levels of x1 .
373
374
CHAPTER 24. ADVANCED TOPICS
The results of fitting a semiparametric regression model are given in Figure 24.1. There are noticeable functional forms for x2 and x3 , however, x4
appears to almost be 0. In fact, this is exactly how the data was generated.
The data were generated according to:
6
3
10
yi = 5.15487 + e2xi,1 + 0.2x11
2,i (10(1 x2,i )) + 10(10x2,i ) (1 x2,i ) + ei ,
where the ei were generated according to a normal distribution with mean 0

and variance 4. Notice how the x4,i term was not used in the data generation,
which is reflected in the plot and in the significance of the smoothing term
from the output below:
##########
Parametric coefficients:
(Intercept)
3.9660
0.2939 13.494 < 2e-16 ***
x12
1.8851
0.4176
4.514 1.12e-05 ***
x13
3.8264
0.4192
9.128 < 2e-16 ***
x14
6.1100
0.4181 14.615 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Approximate significance of smooth terms:
edf Est.rank
F p-value
s(x2) 1.729
4.000 25.301 <2e-16 ***
s(x3) 7.069
9.000 45.839 <2e-16 ***
s(x4) 1.000
1.000 0.057
0.811
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
R-sq.(adj) =
0.78
GCV score = 4.5786
##########
Deviance explained = 79.4%

Scale est. = 4.2628
n = 200
There are actually many general forms of semiparametric regression models. We will list a few of them. In the following outline, X = (X1 , . . . , Xp )T
pertains to the predictors and may be partitioned such that X = (UT , VT )T
where U = (U1 , . . . , Ur )T , V = (V1 , . . . , Vs )T , and r + s = p. Also, m()
is a nonparametric function and g() is a link function as established in the
discussion on generalized linear models.
STAT 501
D. S. Young
10
375
s(x3,7.07)
5
0
5
s(x2,1.73)
10
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
1.0
6
4
0
Partial for x1
5
0
5
s(x4,1)
0.8
x3
10
x2
0.6
0.0
0.2
0.4
0.6
x4
0.8
1.0
x1
Figure 24.1: Semiparametric regression fits of the generated data.
D. S. Young
STAT 501
376
Additive Models: In the model

E(Y |X) = 0 +
p
X
mj (Xj ),
j=1
we have a fixed intercept term and wish to estimate p nonparametric

functions - one for each of the predictors.
Partial Linear Models: In the model
E(Y |U, V) = UT + m(V),
we have the sum of a purely parametric part and a purely nonparametric part, which involves parametric estimation routines and nonparametric estimation routines, respectively. This is the type of model used
in the generation of the example given above.
Generalized Additive Models: In the model
E(Y |X) = g(0 +
p
X
mj (Xj )),
j=1
we have the same setting as in an additive model, but a link function

relates the sum of functions to the response variable. This is the model
fitted to the example above.
Generalized Partial Linear Models: In the model
E(Y |U, V) = g(UT + m(V)),
we have the same setting as in a partial linear model, but a link function
relates the sum of parametric and nonparametric components to the
response.
Generalized Partial Linear Partial Additive Models: In the model
T
E(Y |U, V) = g(U +
s
X
mj (Vj )),
j=1
we have the sum of a parametric component and the sum of s individual

nonparametric functions, but there is also a link function that relates
this sum to the response.
STAT 501
D. S. Young
377
Another method (which is often a semiparametric regression model due to

its exploratory nature) is the projection pursuit regression method discussed
earlier. Projection Pursuit stands for a class of exploratory projection
techniques. This class contains statistical methods designed for analyzing
high-dimensional data using low-dimensional projections. The aim of projection pursuit regression is to reveal possible nonlinear relationship between
a response and very large number of predictors with the ultimate goal of
finding interesting structures hidden within the high-dimensional data.
To conclude this section, let us outline the general context of the three
classes of regression models:
Parametric Models: These models are fully determined up to a parameter vector. If the underlying assumptions are correct, then the fitted
model can easily be interpreted and estimated accurately. If the assumptions are violated, then fitted parametric estimates may provide
inconsistencies and misleading interpretations.
Nonparametric Models: These models provide flexible models and avoid
the restrictive parametric form. However, they may be difficult to interpret and yield inaccurate estimates for a large number of regressors.
Semiparametric Models: These models combine parametric and nonparametric components. They allow easy interpretation of the parametric
component while providing the flexibility of the nonparametric component.
24.2
Random Effects Regression and Multilevel Regression
The next model we consider is not unlike growth curve models. Suppose
we have responses measured on each subject repeatedly. However, we no
longer assume that the same number of responses are measured for each
subject (such data is often called longitudinal data or trajectory data).
In addition, the regression parameters are now subject-specific parameters.
The regression parameters are considered random effects and are assumed
to follow their own distribution. Earlier, we only discussed the sampling
distribution of the regression parameters and the regression parameters were
assumed fixed (i.e., they were assumed to be fixed effects).
D. S. Young
STAT 501
378
All Groups
100
200
Response
100
300
10
15
20
Time
Figure 24.2: Scatterplot of the infant data with a trajectory (in this case, a
quadratic response curve) fitted to each infant.
As an example, consider a sample of 40 infants used in the study of a

habituation task. Suppose the infants were broken into four groups and
studied by four different psychologists. A similar habituation task is given to
the four groups, but the number of times it is performed in each group differs.
Furthermore, it is suspected that each infant will have its own trajectory
when a response curve is constructed. All of the data are presented in Figure
24.2 with a quadratic response curve fitted to each infant. When broken
down further, notice in Figure 24.3 how each group has a set of infants
with a different number of responses. Furthermore, notice how a different
trajectory was fit to each infant. Each of these trajectories has its own set
of regression parameter estimates.
Let us formulate the linear model for this setting. Suppose we have
i = 1, . . . , N subjects and each subject has a response vector yi which consists
of ni measurements (notice that n is subscripted by i to signify the varying
number of measurements nested within each subject - if all subjects have
the same number of measurements, then ni n). The random effects
regression model is given by:
yi = Xi i + i ,
STAT 501
D. S. Young
379
Group 1
Group 2
100
10
15
20
10
100
200
Response
200
100
100
Group 4
20
(b)
Group 3
15
Time
(a)
Time
100
300
300
Response
200
100
200
Response
100
Response
100
300
300
10
Time
(c)
15
20
10
15
20
Time
(d)
Figure 24.3: Plots for each group of infants where each group has a different
number of measurements.
D. S. Young
STAT 501
380
where the Xi are known ni p design matrices, i are regression parameters

for subject i, and the i are ni 1 vectors of random within-subject residuals distributed independently as Nni (0, 2 Ini ni ). Furthermore, the i are
assumed to be multivariate normally distributed with mean vector and
variance-covariance matrix . Given these assumptions, it can be shown
that the yi are marginally distributed as independent normals with mean
2
Xi and variance-covariance matrix Xi XT
i + Ini ni .
Another regression model, not unrelated to random effects regression
models, involves imposing another model structure on the regression coefficients. These are called multilevel (hierarchical) regression models.
For example, suppose the random effects regression model above is a simple linear case (i.e., = (0 1 )T ). We may assume that the regression
coefficients in the random effects regression model from above to have the
following structure:
i = 0 + 1 ui + i .
In this regression relationship, we would also have observed the ui s and we
assume that i iid N (0, 2 ) for all i. Then, we estimate 0 and 1 directly
from the data. Note the hierarchical structure of this model, hence the name.
Estimation of the random effects regression model and the multilevel
regression model requires more sophisticated methods. Some common estimation methods include use of empirical or hierarchical Bayes estimates,
iteratively reweighted maximum marginal likelihood methods, and EM algorithms. Various statistical intervals can also be constructed for these models.
24.3
Functional Linear Regression Analysis
Functional data consists of observations which can be treated as functions

rather than numeric vectors. One example is fluorescence curves used in photosynthesis research where the curve reflects the biological processes which
occur during the plants initial exposure to sunlight. Longitudinal data can
be considered a type of functional data such as taking repeated measurements
over time on the same subject (e.g., blood pressure or cholesterol readings).
Functional regression models are of the form
yi (t) = (t)(xi ) + i (t),
where yi (t), (t), and i (t) represent the functional response, average curve,
and the error process, respectively. (xi ) is a multiplicative effect modifying
STAT 501
D. S. Young
381
the average curve according to the predictors. So each of the i = 1, . . . , n

trajectories (or functions) are observed at points t1 , . . . , tk in time, where k is
large. In otherwords, we are trying to fit a regression surface for a collection
of functions (i.e., we actually observe trajectories and not individual data
points).
Functional regression models do sound similar to random effects regression models, but differ in a few ways. In random effects regression models, we
assume that each observations set of regression coefficients are random variables from some distribution. However, functional regression models do not
make distributions on the regression coefficients and are treated as separate
functions which are characterized by the densely sampled set of points over
t1 , . . . , tk . Also, random effects regression models easily accommodate trajectories of varying dimensions, whereas this is not reflected in a functional
regression model.
Estimation of (t) is beyond the scope of this discussion as it requires
knowledge of Fourier series and more advance multivariate techniques. Furthermore, estimation of (t) is intrinsically an infinite-dimensional problem.
However, estimates found in the literature have been shown to possess desirable properties of an estimator. While there are also hypothesis tests available concerning these models, difficulties still exist with using these models
for prediction.
24.4
Mediation Regression
Consider the following research questions found in psychology:

Will changing social norms about science improve childrens achievement in scientific disciplines?
Can changes in cognitive attributions reduce depression?
Does trauma affect brain stem activation in a way that inhibits memory?
Such questions suggest a chain of relations where a predictor variable affects
another variable, which then affects the response variable. A mediation
regression model attempts to identify and explicate the mechanism that
underlies an observed relationship between an independent variable and a
D. S. Young
STAT 501
382
dependent variable, via the inclusion of a third explanatory variable called a

mediator variable.
Instead of modeling a direct, causal relationship between the independent and dependent variables, a mediation model hypothesizes that the independent variable causes the mediator variable which, in turn, causes the
dependent variable. Mediation models are generally utilizes in the area of
psychometrics, while other scientific disciplines (including statistics) have
criticized the methodology. One such criticism is that sometimes the roles of
the mediator variable and the dependent variable can be switched and yield a
model which explains the data equally well, thus causing identifiability issues.
The model we present simply has one independent variable, one dependent
variable, and one mediator variable in the model. Models including more of
any of these variables are possible to construct.
The following three regression models are used in our discussion:
1. Y = 0 + 1 X +
2. Y = 0 + 1 X + 2 M +
3. M = 0 + 1 X + .
The first model is the simple linear regression model we are familiar with.
The is the relationship between X and Y that we typically wish to study (in
causal analysis, this is written as X Y ). The second and third models
show how we incorporate the mediator variable into this framework so that
X causes the mediator M and M causes Y (i.e., X M Y ). So 1 is the
coefficient relating X to Y adjusted for M , 2 is the coefficient relating M to
Y adjusted for X, 1 is the coefficient relating X to M , and , , and are
error terms for the three relationships. Figure 24.4 gives a diagram showing
these relationships, sans the error terms.
The mediated effect in the above models can be calculated in two ways
- either as 1
2 or 1 2 . There are various methods for estimating these
coefficients, including those based on ordinary least squares and maximum
likelihood theory. To test for significance, the chosen quantity (i.e., either
1
2 or 1 2 ) is divided by the standard error and then the ratio is compared to a standard normal distribution. Thus, confidence intervals for the
mediation effect are readily available by using the 100(1/2)th -percentile
of the standard normal distribution.
Finally, the strength and form of mediated effects may depend on, yet,
another variable. These variables, which affect the hypothesized relationship
STAT 501
D. S. Young
383
Mediator Variable
M
Independent Variable
X
Dependent Variable
Y
Figure 24.4: Diagram showing the basic flow of a mediation regression model.
amongst the variables already in our model, are called moderator variables
and are often tested as an interaction effect. A significantly different from 0
XM interaction in the second equation above suggests that the 2 coefficient
differs across different levels of X. These different coefficient levels may
reflect mediation as a manipulation, thus altering the relationship between
M and Y . The moderator variables may be either a manipulated factor in
an experimental setting (e.g., dosage of medication) or a naturally occurring
variable (e.g., gender). By examining moderator effects, one can investigate
whether the experiment differentially affects subgroups of individuals. Three
primary models involving moderator variables are typically studied:
Moderated mediation: The simplest of the three, this model has a variable which mediates the effects of an independent variable on a dependent variable, and the mediated effect depends on the level of another
variable (i.e., the moderator). Thus, the mediational mechanism differs
for subgroups of the study. This model is more complex from an interpretative viewpoint when the moderator is continuous. Basically, you
have either X M and/or M Y dependent on levels of another
variable (call it Z).
D. S. Young
STAT 501
384
Mediated moderation: This occurs when a mediator is intermediate in

the causal sequence from an interaction effect to a dependent variable.
The purpose of this model is to determine the mediating variables that
explain the interaction effect.
Mediated baseline by treatment moderation: This model is a special
case of the mediated moderation model. The basic interpretation of
the mediated effect in this model is that the mediated effect depends
on the baseline level of the mediator. This scenario is common in prevention and treatment research, where the effects of an intervention are
often stronger for participants who are at higher risk on the mediating
variable at the time they enter the program.
24.5
Meta-Regression Models
In statistics, a meta-analysis combines the results of several studies that

address a set of related research hypotheses. In its simplest form, this is
normally by identification of a common measure of effect size, which is a
descriptive statistic that quantifies the estimated magnitude of a relationship
between variables without making any inherent assumption about if such a
relationship in the sample reflects a true relationship for the population.
In a meta-analysis, a weighted average might be used as the output. The
weighting might be related to sample sizes within the individual studies.
Typically, there are other differences between the studies that need to be
allowed for, but the general aim of a meta-analysis is to more powerfully
estimate the true effect size as opposed to a smaller effect size derived
in a single study under a given set of assumptions and conditions.
Meta-regressions are similar in essence to classic regressions, in which
a response variable is predicted according to the values of one or more predictor variables. In meta-regression, the response variable is the effect estimate
(for example, a mean difference, a risk difference, a log odds ratio or a log
risk ratio). The predictor variables are characteristics of studies that might
influence the size of intervention effect. These are often called potential
effect modifiers or covariates. Meta-regressions usually differ from simple
regressions in two ways. First, larger studies have more influence on the relationship than smaller studies, since studies are weighted by the precision
of their respective effect estimate. Second, it is wise to allow for the residSTAT 501
D. S. Young
385
ual heterogeneity among intervention effects not modeled by the predictor

variables. This gives rise to the random-effects meta-regression, which we
discuss later.
The regression coefficient obtained from a meta-regression analysis will
describe how the response variable (the intervention effect) changes with a
unit increase in the predictor variable (the potential effect modifier). The
statistical significance of the regression coefficient is a test of whether there
is a linear relationship between intervention effect and the predictor variable.
If the intervention effect is a ratio measure, the log-transformed value of the
intervention effect should always be used in the regression model, and the
exponential of the regression coefficient will give an estimate of the relative
change in intervention effect with a unit increase in the predictor variable.
Generally, three types of meta-regression models are commonplace in the
literature:
Simple meta-regression: This model can be specified as:
yi = 0 + 1 xi,1 + 2 xi,2 + . . . + p1 xi,p1 + ,
where yi is the effect size in study i and 0 (i.e., the intercept) is the
estimated overall effect size. The variables xi,j , for j = 1, . . . , (p 1),
specify different characteristics of the study and specifies the between
study variation. Note that this model does not allow specification of
within study variation.
Fixed-effect meta-regression: This model assumes that the true effect
size is distributed as N (, 2 ), where 2 is the within study variance
of the effect size. A fixed effect1 meta-regression model thus allows for
within study variability, but no between study variability because all
studies have the identical expected fixed effect size ; i.e., = 0.This
model can be specified as:
yi = 0 + 1 xi,1 + 2 xi,2 + . . . + p1 xi,p1 + i ,
where 2i is the variance of the effect size in study i. Fixed effect metaregressions ignore between study variation. As a result, parameter
estimates are biased if between study variation can not be ignored.
Furthermore, generalizations to the population are not possible.
1
Note that for the fixed-effect model, no plural is used as only ONE true effect across
all studies is assumed.
D. S. Young
STAT 501
386
Random effects meta-regression: This model rests on the assumption

that in N (, ) is a random variable following a hyper-distribution
N ( , ). The model can be specified as:
yi = 0 + 1 xi,1 + 2 xi,2 + . . . + p1 xi,p1 + + i ,
were 2i is the variance of the effect size in study i. Between study
variance 2 is estimated using common estimation procedures for random effects models (such as restricted maximum likelihood (REML)
estimators).
24.6
Bayesian Regression
Bayesian inference is concerned with updating our model (which is based

on previous beliefs) as a result of receiving incoming data. Bayesian inference
is based on Bayes Theorem, which says for two events, A and B,
P(A|B) =
P(B|A)P(A)
.
P(B)
We update our model by treating the parameter(s) of interest as a random

variable and defining a distribution for the parameters based on previous
beliefs (this distribution is called a prior distribution). This is multiplied
by the likelihood function of our model and then divided by the marginal
density function (which is the joint density function with the parameter
integrated out). The result is called the posterior distribution. Luckily,
the marginal density function is just a normalizing constant and does not
usually have to be calculated in practice.
For multiple linear regression, the ordinary least squares estimate
= (XT X)1 XT y
is constructed from the frequentists view (along with the maximum likelihood estimate
2 of 2 ) in that we assume there are enough measurements
of the predictors to say something meaningful about the response. In the
Bayesian view, we assume we have only a small sample of the possible
measurements and we seek to correct our estimate by borrowing information from a larger set of similar observations.
STAT 501
D. S. Young
387
The (conditional) likelihood is given as:

1
2
2 n/2
2
`(y|X, , ) = (2 )
exp 2 ky Xk .
2
We seek a conjugate prior (a prior which yields a joint density that is of
the same functional form as the likelihood). Since the likelihood is quadratic
Write
in , we re-write the likelihood so it is normal in ( ).
2 + ( )
T (XT X)( )
ky Xk2 = ky Xk
Now rewrite the likelihood as

1
vs2
2
2 (nv)/2
2
2 v/2
exp 2 kX( )k
,
`(y|X, , ) ( )
exp 2 ( )
2
2
2 and v = n p with p as the number of parameters
where vs2 = ky Xk
to estimate. This suggests a form for the priors:
(, 2 ) = ( 2 )(| 2 ).
The prior distributions are characterized by hyperparameters, which
are parameter values (often data-dependent) which the researcher specifies.
The prior for 2 (( 2 )) is an inverse gamma distribution with shape hyperparameter and scale hyperparameter . The prior for ((| 2 )) is a
multivariate normal distribution with location and dispersion hyperparame and . This yields the joint posterior distribution:
ters
f (, 2 |y, X) `(y|X, , 2 )(| 2 )( 2 )

1
T
T
1
n
exp 2 (
s + ( ) ( + X X)( )) ,
2
where
= (1 + XT X)1 (1
+ XT X)
)
T 1
+ (
)
T XT X.
s = 2 +
2 (n p) + (
Finally, it can be shown that the distribution of |X, y is a multivariate-t
distribution with n + p 1 degrees of freedom such that:
E(|X, y) =
Cov(|X, y) =
D. S. Young
s(1 + XT X)1
.
n+p3
STAT 501
388
Furthermore, the distribution of 2 |X, y is an inverse gamma distribution

with shape parameter n + p and scale parameter 0.5
2 (n + p).
One can also construct statistical intervals based on draws simulated from
a Bayesian posterior distribution. A 100 (1 )% credible interval is
constructed by taken the middle 100 (1 )% of the values simulated from
the parameters posterior distribution. The interpretation of these intervals
is that there is a 100 (1 )% chance that the true population parameter
is in the 100 (1 )% credible interval which is constructed (which is how
many people initially try to interpret confidence intervals).
24.7
Quantile Regression
The th quantile of a random variable X is the value of x such that

P(X x) = .
For example, if = 21 , then the corresponding value of x would be the median.
This concept of quantiles can also be extended to the regression setting.
In a quantile regression, we have a data set of size n with response y
and predictors x = (x1 , . . . , xp1 )T and we seek a solution to the least squares
criterion
n
X
=
(yi (xT ))2 ,
i=1
where () is some parametric function and () is called the linear check

function and is defined as
(x) = x xI{x < 0}.
T
For linear regression, (xT
i ) = xi .
We actually encountered quantile regression earlier. Least absolute deviations regression is just the case of quantile regression where = 1/2.
Figure 24.5 gives a plot relating food expenditure to a familys monthly
household income. Overlaid on the plot is a dashed red line which gives the
ordinary least squares fit. The solid blue line is the least absolute deviation fit
(i.e., = 0.50). The gray lines (from bottom to top) are the quantile regression fits for = 0.05, 0.10, 0.25, 0.75, 0.90, and 0.95, respectively. Essentially,
what this says is that if we looked at those households with the highest food
expenditures, they will likely have larger regression coefficients (such as the
STAT 501
D. S. Young
2000
389
1500
1000
500
Food Expenditure
1000
mean (LSE) fit

median (LAE) fit
2000
3000
4000
5000
Household Income
Figure 24.5: Various qunatile regression fits for the food expenditures data
set.
= 0.95 regression quantile) while those with the lowest food expenditures
will likely have smaller regression coefficients (such as the = 0.05 regression
quantile). The estimates for each of these quantile regressions is as follows:
##########
Coefficients:
tau= 0.05
tau= 0.10 tau= 0.25
(Intercept) 124.8800408 110.1415742 95.4835396
x
0.3433611
0.4017658 0.4741032
tau= 0.75 tau= 0.90 tau= 0.95
(Intercept) 62.3965855 67.3508721 64.1039632
x
0.6440141 0.6862995 0.7090685
Degrees of freedom: 235 total; 233 residual
##########
Estimation for quantile regression can be done through linear programming or other optimization procedures. Furthermore, statistical intervals can
also be computed.
D. S. Young
STAT 501
390
24.8
Monotone Regression
Suppose we have a set of data (x1 , y1 ), . . . , (xn , yn ). For ease of notation, let
us assume there is already an ordering on the predictor variable. Specifically,
we assume that x1 . . . xn . Monotonic regression is a technique where
we attempt to find a weighted least squares fit of the responses y1 , . . . , yn to
a set of scalars a1 , . . . , an with corresponding weights w1 , . . . , wn , subject to
monotonicity constraints giving a simple or partial ordering of the responses.
In otherwords, the responses are suppose to strictly increase (or decrease)
as the predictor increases and the regression line we fit is piecewise constant
(which resembles a step function). The weighted least squares problem for
monotonic regression is given by the following quadratic program:
arg min
a
n
X
wi (yi ai )2
i=1
and is subject to one of two possible constraints. If the direction of the

trend is to be monotonically increasing, then the process is called isotonic
regression and the constraint is yi yj for all i > j where this ordering
is true. If the direction of the trend is to be monotonically decreasing, then
the process is called antitonic regression and the constraint is yi yj for
all i > j where this ordering is true. More generally, one can also perform
monotonic regression under Lp for p > 0:
arg min
a
n
X
wi |yi ai |p ,
i=1
with the appropriate constraints imposed for isotonic or antitonic regression.

Monotonic regression does have its place in statistical inference. For
example, in astronomy data sets, there is a measure of a gamma-ray burst
flux measurements versus time. On the log scale, one can identify an area of
flaring which is an area where the flux measurements are to increase. Such
an area could be fit using an isotonic regression.
An example of an isotonic regression fitted to a made-up data set is given
in Figure 24.6. The top plot gives the actual isotonic regression fit. The
horizontal lines represent the values of the scalars minimizing the weighted
least squares problem given earlier. The bottom plot shows the cumulative
sums of the responses plotted against the predictors. The piecewise regression
STAT 501
D. S. Young
391
Isotonic Regression
6
x$y
3 4
10
x0
Cumulative Data and Convex Minorant
cumsum(x$y)
15
25
10
x0
Figure 24.6: An example of an isotonic regression fit.
line which is plotted is called the convex minorant. Each predictor value
where this convex minorant intersects at the value of the cumulative sum
is the same value of the predictor where the slope changes in the isotonic
regression plot.
24.9
Spatial Regression
Suppose an econometrician is trying to quantify the price of a house. In

doing so, he will surely need to incorporate neighborhood effects (e.g., how
much is the house across the street valued at as well as the one next door?)
However, the house prices in an adjacent neighborhood may also have an
impact on the price, but house prices in the adjacent county will likely not.
The framework for such modeling is likely to incorporate some sort of spatial
effect as houses nearest to the home of interest are likely to have a greater
impact on the price while homes further away will have a smaller or negligible
impact.
Spatial regression deals with the specification, estimation, and diagnostic analysis of regression models which incorporate spatial effects. Two
D. S. Young
STAT 501
392
broad classes of spatial effects are often distinguished: spatial heterogeneity and spatial dependency. We will provide a brief overview of both
types of effects, but it should be noted that we will only skim the surface of
what is a very rich area.
A spatial regression model reflecting spatial heterogeneity is written locally as
Y = X(g) + ,
where g indicates that the regression coefficients are to be estimated locally
at the coordinates specified by g and is an error term distributed with
mean 0 and variance 2 . This model is called geographically weighted
regression or GWR. The estimation of (g) is found using a weighting
scheme such that
(g)
= (XT W(g)X)1 XT W(g)Y.
The weights in the geographic weighting matrix W(g) are chosen such that
those observations near the point in space where the parameter estimates
are desired have more influence on the result than those observations further
away. This model is essentially a local regression model like the one discussed
in the section on LOESS. While the choice of a geographic (or spatially)
weighted matrix is a blend of art and science, one commonly used weight is
the Gaussian weight function, where the diagonal entris of the n n matrix
W(g) are:
wi (g) = exp{di /h},
where di is the Euclidean distance between observation i and location g, while
h is the bandwidth.
The resulting parameter estimates or standard errors for the spatial heterogeneity model may be mapped in order to examine local variations in the
parameter estimates. Hypothesis tests are also possible regarding this model.
Spatial regression models also accommodate spatial dependency in two
major ways: through a spatial lag dependency (where the spatial correlation
occurs in the dependent variable) or a spatial error dependency (where the
spatial correlation occurs through the error term). A spatial lag model is
a spatial regression model which models the response as a function of not
only the predictors, but also values of the response observed at other (likely
neighboring) locations:
yi = f (yj(i) ; ) + X T
i + i ,
STAT 501
D. S. Young
393
where j(i) is an index including all of the neighboring locations j of i such

that i 6= j. The function f can be very general, but typically is simplified by
using a spatially weighted matrix (as introduced earlier).
Assuming a spatially
Pn weighted matrix W(g) which has row-standardized
spatial weights (i.e., j=1 wi,j = 1), we obtain a mixed regressive spatial
autoregressive model:
yi =
n
X
wi,j yj + X T
i + i ,
j=1
where is the spatial autoregressive coefficient. In matrix notation, we

have
Y = W(g)Y + X + .
The proper solution to the equation for all observations requires (after some
matrix algebra)
Y = (Inn W)1 X + (Inn W)1
to be solved simultaneously for and .
The inclusion of a spatial lag is similar to a time series model, although
with a fundamental difference. Unlike time dependency, a spatial dependency is multidirectional, implying feedback effects and simultaneity. More
precisely, if i and j are neighboring locations, then yj enters on the righthand side in the equation for yi , but yi also enters on the right-hand side in
the equation for yj .
In a spatial error model, the spatial autocorrelation does not enter as an
additional variable in the model, but rather enters only through its affects
on the covariance structure of the random disturbance term. In otherwords,
Var() = such that the off-diagonals of are not 0. One common way to
model the error structure is through direct representation, which is similar
to the weighting scheme used in GWR. In this setting, the off-diagonals of
are given by i,j = 2 g(di,j , ), where again di,j is the Euclidean distance
between locations i and j and is a vector of parameters which may include
a bandwidth parameter.
Another way to model the error structure is through a spatial process,
such as specifying the error terms to have a spatial autoregressive structure
as in the spatial lag model from earlier:
= W(g) + u,
D. S. Young
STAT 501
394
where u is a vector of random error terms. Other spatial processes exist, such
as a conditional autoregressive process and a spatial moving average
process, both which resemble similar time series processes.
Estimation of these spatial regression models can be accomplished through
various techniques, but they differ depending on if you have a spatial lag dependency or a spatial error dependency. Such estimation methods include
maximum likelihood estimation, the use of instrumental variables, and semiparametric methods.
There are also tests for the spatial autocorrelation coefficient, of which the
most notable uses Morans I statistic. Morans I statistics is calculated
as
eT W(g)e/S0
,
I=
eT e/n
where e is a vector of ordinary
squares residuals, W(g) is a geographic
Pleast
n Pn
weighting matrix, and S0 = i=1 j=1 wi,j is a normalizing factor. Then,
Morans I test can be based on a normal approximation using a standardized
value I statistic such that
E(I) = tr(MW/(n p))
and
Var(I) =
tr(MWMWT ) + tr(MWMW) + [tr(MW)]2

,
(n p)(n p + 2)
where M = Inn X(XT X)1 XT .

As an example, let us consider 1978 house prices in Boston which we will
try to fit a spatial regression model with spatial error dependency. There
are 20 variables measured for 506 locations. Certain transformations on the
predictors have already been performed due to the investigators claims. In
particular, only 13 of the predictors are of interest. First a test on the spatial
autocorrelation coefficient is performed:
##########
Global Morans I for regression residuals
data:
model: lm(formula = log(MEDV) ~ CRIM + ZN + INDUS + CHAS +
I(NOX^2) + I(RM^2) + AGE + log(DIS) + log(RAD) + TAX +
PTRATIO + B + log(LSTAT), data = boston.c)
STAT 501
D. S. Young
395
weights: boston.listw
Moran I statistic standard deviate = 14.5085, p-value < 2.2e-16
alternative hypothesis: two.sided
sample estimates:
Observed Morans I
Expectation
Variance
0.4364296993
-0.0168870829
0.0009762383
##########
As can be seen, the p-value is very small and so the spatial autocorrelation
coefficient is significant.
Next, we attempt to fit a spatial regression model with spatial error dependency including those variables that the investigator specified:
##########
Call:errorsarlm(formula = log(MEDV) ~ CRIM + ZN + INDUS + CHAS
+ I(NOX^2) + I(RM^2) + AGE + log(DIS) + log(RAD) + TAX
+ PTRATIO + B + log(LSTAT), data = boston.c,
listw = boston.listw)
Residuals:
Min
1Q
-0.6476342 -0.0676007
Median
0.0011091
3Q
0.0776939
Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value
(Intercept) 3.85706025 0.16083867 23.9809
CRIM
-0.00545832 0.00097262 -5.6120
ZN
0.00049195 0.00051835
0.9491
INDUS
0.00019244 0.00282240
0.0682
CHAS1
-0.03303428 0.02836929 -1.1644
I(NOX^2)
-0.23369337 0.16219194 -1.4408
I(RM^2)
0.00800078 0.00106472
7.5145
AGE
-0.00090974 0.00050116 -1.8153
log(DIS)
-0.10889420 0.04783714 -2.2764
log(RAD)
0.07025730 0.02108181
3.3326
TAX
-0.00049870 0.00012072 -4.1311
PTRATIO
-0.01907770 0.00564160 -3.3816
D. S. Young
Max
0.6491629
Pr(>|z|)
< 2.2e-16
2.000e-08
0.3425907
0.9456389
0.2442466
0.1496286
5.707e-14
0.0694827
0.0228249
0.0008604
3.611e-05
0.0007206
STAT 501
396
B
log(LSTAT)
0.00057442
-0.27212781
0.00011101
5.1744 2.286e-07
0.02323159 -11.7137 < 2.2e-16
Lambda: 0.70175 LR test value: 211.88 p-value: < 2.22e-16

Asymptotic standard error: 0.032698
z-value: 21.461 p-value: < 2.22e-16
Wald statistic: 460.59 p-value: < 2.22e-16
Log likelihood: 255.8946 for error model
ML residual variance (sigma squared): 0.018098, (sigma: 0.13453)
Number of observations: 506
Number of parameters estimated: 16
AIC: -479.79, (AIC for lm: -269.91)
##########
As can be seen, there are some predictors that do not appear to be significant.
Model selection procedures can be employed or other transformations can be
tried in order to improve the fit of this model.
24.10
Circular Regression
A circular random variable is one which takes values on the circumference

of a circle (i.e., the angel is in the range of (0, 2) radians or (0 , 360 )). A
circular-circular regression is used to determine the relationship between
a circular predictor variable X and a circular response variable Y . Circular
data occurs when there is periodicity to the phenomena at hand or where
there are naturally angular measurements. An example could be determining
the relationship between wind direction measurements (the response) on an
aircraft and wind direction measurements taken by radar (the predictor).
Another related model is one where only the response is a circular variable
while the predictor is linear. This is called a circular-linear regression.
Both types of circular regression models can be given by
yi = 0 + 1 xi + i (mod 2).
The expression i (mod 2) is read as i modulus 2 and is a way of expression
the remainder of the quantity i /(2).2 In this model, i is a circular random
2
For example, 11(mod 7) = 4 because 11 divided by 7 leaves a remainder of 4.
STAT 501
D. S. Young
397
error assumed to follow a von Mises distribution with circular mean 0 and
concentration parameter . The von Mises distribution is the circular analog
of the univariate normal distribution, but has a more complex form. The
von Mises distribution with circular mean and concentration parameter
is defined on the range x [0, 2), with probability density function
f (x) =
e cos(x)
2I0 ()
and cumulative distribution function

X
Ij () sin(j(x ))
1
xI0 () + 2
.
F (x) =
2I0 ()
j
j=1
In the above, Ip () is called a modified Bessel function of the first kind
of order p. The Bessel function is the contour integral
I
1
e(z/2)(t1/t) t(p+1) dt,
Ip (z) =
2i
where the contour encloses the origin and traverses
in a counterclockwise
direction in the complex plane such that i = 1. Maximum likelihood

estimates can be obtained for the circular regression models (with minor
differences in the details when dealing with a circular predictor or linear
predictor). Needless to say, such formulas do not lend themselves well to
closed-dorm solutions. Thus we turn to numerical methods, which go beyond
the scope of this course.
As an example, suppose we have a data set of size n = 100 where Y
is a circular response and X is a continuous predictor (so a circular-linear
regression model will be built). The error terms are assumed to follow a von
Mises distribution with circular mean 0 and concentration parameter (for
this generated data, = 1.9). The error terms used in the generation of this
data can be plotted on a circular histogram as given in Figure 24.10.
Estimates for the circular-linear regression fit are given below:
##########
Circular-Linear Regression
Coefficients:
D. S. Young
STAT 501
398
Errors
(a)
(b)
Figure 24.7: (a) Plot of the von Mises error terms used in the generation of
the sample data. (b) Plot of the continuous predictor (X) versus the circular
response (Y ) along with the circular-linear regression fit.
[1,]
6.7875
[2,]
0.9618
--Signif. codes:
1.1271
0.2223
6.022 8.61e-10 ***

4.326 7.58e-06 ***
0 *** 0.001 ** 0.01 * 0.05 . 0.1
Log-Likelihood:
55.89
Summary: (mu in radians)

mu: 0.4535 ( 0.08698 ) kappa: 1.954 ( 0.2421 )
p-values are approximated using normal distribution
##########
Notice that the maximum likelihood estimates of and are 0.4535 and
1.954, respectively. Both estimates are close to the values used for generation of the error terms. Furthermore, the values in parentheses next to
these estimates are the standard errors for the estimates - both of which are
relatively small.
A rough way of looking at the data and estimated circular-linear regression equation is given in Figure 24.10. This is difficult to display since we are
STAT 501
D. S. Young
399
trying to look at a circular response versus a continuous predictor. Packages

specific to circular regression modeling provide better graphical alternatives.
24.11
Mixtures of Regressions
1.2
1.1
1.0
0.9
0.8
Equivalence
0.6
0.6
0.7
0.7
1.0
Equivalence
0.9
0.8
1.1
1.2
Consider a large data set consisting of the heights of males and females.
When looking at the distribution of this data, the data for the males will (on
average) be higher than that of the females. A histogram of this data would
clearly show two distinct bumps or modes. Knowing the gender labels of each
subject would allow one to account for that subgroup in the analysis being
used. However, what happens if the gender label of each subject were lost?
In otherwords, we dont know which observation belongs to which gender.
The setting where data appears to be from multiple subgroups, but there is
no label providing such identification, is the focus of the area called mixture
modeling.
3
NO
(a)
NO
(b)
Figure 24.8: (a) Plot of spark-ignition engine fuel data with equivalence
ratio as the response and the measure of nitrogen oxide emissions. (b) Plot
of the same data with EM algorithm estimates from a 2-component mixture
of regressions fit.
There are many issues one should be cognizant of when building a mixture
model. In particular, maximum likelihood estimation can be quite complex
D. S. Young
STAT 501
400
since the likelihood does not yield closed-form solutions and there are identifiability issues (however, the use of a Newton-Raphson or EM algorithm
usually provides a good solution). One alternative is to use a Bayesian approach with Markov Chain Monte Carlo (MCMC) methods, but this too has
its own set of complexities. While we do not explore these issues, we do see
how a mixture model can occur in the regression setting.
A mixture of linear regressions model can be used when it appears
that there is more than one regression line that could fit this data due to
some underlying characteristic (i.e., a latent variable). Suppose we have n
observations which belongs to one of k groups. If we knew to which group
an observation belonged (i.e., its label), then we could write down explicitly
the linear regression model given that observation i belongs to group j:
yi = XT
i j + ij ,
such that ij is normally distributed with mean 0 and variance j2 . Notice
how the regression coefficients and variance terms are different for each group.
However, now assume that the labels are unobserved. In this case, we can
only assign a probability that observation i came from group j. Specifically,
the density function for the mixture of linear regression model is:

k
X
1
T
2
2 1/2
f (yi ) =
j (2j )
exp 2 (yi Xi j ) ,
2j
j=1
P
such that kj=1 j = 1. Estimation is done by using the likelihood (or rather
log likelihood) function based on the above density. For maximum likelihood,
one typically uses an EM algorithm.
As an example, consider the data set which gives the equivalence ratios
and peak nitrogen oxide emissions in a study using pure ethanol as a sparkignition engine fuel. A plot of the equivalence ratios versus the measure
of nitrogen oxide is given in Figure 24.11. Suppose one wanted to predict
the equivalence ratio from the amount of nitrogen oxide emissions. As you
can see, there appears to be groups of data where separate regressions appear
appropriate (one with a positive trend and one with a negative trend). Figure
24.11 gives the same plot, but with estimates from an EM algorithm overlaid.
EM algorithm estimates for this data are 1 = (0.565 0.085)T , 1 = (1.247
0.083)T , 2 = 0.00188, and 2 = 0.00058.
It should be noted that mixtures of regressions appear in many areas.
For example, in economics it is called switching regimes. In the social
STAT 501
D. S. Young
401
sciences it is called latent class regressions. As we saw earlier, the neural

networking terminology calls this model (without the hierarchical structure)
the mixture-of-experts problem.
D. S. Young
STAT 501

Pt4 Adv Regression Models

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Pt4 Adv Regression Models

Hochgeladen von

Copyright:

Verfügbare Formate

Part IV

Advanced Regression Models

In our earlier discussions on multiple linear regression, we have outlined ways

CHAPTER 17. POLYNOMIAL REGRESSION

(which we discuss later) refers to the nonlinear behavior of the coefficients,

CHAPTER 17. POLYNOMIAL REGRESSION

CHAPTER 17. POLYNOMIAL REGRESSION

Some general guidelines to keep in mind when estimating a polynomial

CHAPTER 17. POLYNOMIAL REGRESSION

Response Surface Regression

A response surface model (RSM) is a method for determining a surface

Xi,j [maxi (Xi,j ) + mini (Xi,j )]/2

CHAPTER 17. POLYNOMIAL REGRESSION

where i = 1, . . . , n indexes the sample and j = 1, . . . , k indexes the factor.

= +1, if Xi,1 = 30%.

Some aspects which differentiate a response surface regression model from

CHAPTER 17. POLYNOMIAL REGRESSION

CHAPTER 17. POLYNOMIAL REGRESSION

CHAPTER 17. POLYNOMIAL REGRESSION

factors, n observations, m unique levels of the factor level combinations, and

CHAPTER 17. POLYNOMIAL REGRESSION

ridge analysis, which computes the estimated ridge of optimum response

Example 1: Yield Data Set

CHAPTER 17. POLYNOMIAL REGRESSION

Table 17.4: The yield measurements data set pertaining to n = 15 observations.

Residual standard error: 0.3913 on 13 degrees of freedom

CHAPTER 17. POLYNOMIAL REGRESSION

Multiple R-Squared: 0.6732,

CHAPTER 17. POLYNOMIAL REGRESSION

Example 2: Odor Data Set

CHAPTER 17. POLYNOMIAL REGRESSION

Residual standard error: 18.77 on 8 degrees of freedom

Residual standard error: 18.12 on 9 degrees of freedom

CHAPTER 17. POLYNOMIAL REGRESSION

CHAPTER 17. POLYNOMIAL REGRESSION

CHAPTER 17. POLYNOMIAL REGRESSION

CHAPTER 17. POLYNOMIAL REGRESSION

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

which is a n (p 1) matrix, and the standardized Y vector is given as:

which is still a n-dimensional vector. Here,

Remember that we have removed the column of 1s in forming X , effectively

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

(t1) pertains to the ridge regression estimates

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

Principal Components Regression

The method of principal components regression transforms the predictor

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

where j = 1, 2, . . . , p 1 (remember, the eigenvalues are in decreasing order).

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

Partial Least Squares

We next look at a procedure that is very similar to principal components

subject to the constraint

Next, we regress Y on Z, which has the least squares solution

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

In simple linear regression, we introduced calibration intervals which are a

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

which are the means of the H slices.

is the standardized EDR-direction vector.

CHAPTER 18. BIASED REGRESSION METHODS AND REGRESSION

Regression Shrinkage and Connections

CLS = OLS (X X) A [A(X X) A ] [A OLS a],

0 * 0.001 0.01 * 0.05 . 0.1 1