Beruflich Dokumente
Kultur Dokumente
232
Chapter 17
Polynomial Regression
In this chapter, we provide models to account for curvature in a data set.
This curvature may be an overall trend of the underlying population or it
may be a certain structure to a specified region of the predictor space. We
will explore two common methods in this chapter.
17.1
Polynomial Regression
(17.1)
where h is called the degree of the polynomial. For lower degrees, the
relationship has a specific name (i.e., h = 2 is called quadratic, h = 3 is
called cubic, h = 4 is called quartic, and so on). As for a bit of semantics,
it was noted at the beginning of the previous course how nonlinear regression
233
234
y1
1 x1 x21
1
0
y2
1 x2 x2
2
2
1 , = .. ,
Y = .. , X = .. ..
.. , =
.
. .
.
.
2
2
y50
1 x50 x50
50
where the entries in Y and X would consist of the raw data. So as you can
see, we are in a setting where the analysis techniques used in multiple linear
regression (e.g., OLS) are applicable here.
STAT 501
D. S. Young
235
20
100
300
Residuals
200
20
40
400
500
10
12
14
400
300
200
100
Fitted y
(a)
(b)
Histogram of Residuals
Normal QQ Plot
Sample Quantiles
10
Frequency
15
60
40
20
Residuals
(c)
20
Theoretical Quantiles
(d)
Figure 17.1: (a) Scatterplot of the quadratic data with the OLS line. (b)
Residual plot for the OLS fit. (c) Histogram of the residuals. (d) NPP for
the Studentized residuals.
D. S. Young
STAT 501
236
xi
6.6
10.1
8.9
6
13.3
6.9
9
12.6
10.6
10.3
14.1
8.6
14.9
6.5
9.3
5.2
10.7
7.5
14.9
12.2
yi
-45.4
-176.6
-127.1
-31.1
-366.6
-53.3
-131.1
-320.9
-204.8
-189.2
-421.2
-113.1
-482.3
-42.9
-144.8
-14.2
-211.3
-75.4
-482.7
-295.6
i
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
xi
8.4
7.2
13.2
7.1
10.4
10.8
11.9
9.7
5.4
12.1
12.1
12.1
9.2
6.7
12.1
13.2
11
13.1
9.2
13.2
yi
-106.5
-63
-362.2
-61
-194
-216.4
-278.1
-162.7
-21.3
-284.8
-287.5
-290.8
-137.4
-47.7
-292.3
-356.4
-228.5
-354.4
-137.2
-361.6
i
41
42
43
44
45
46
47
48
49
50
xi
8
8.9
10.1
11.5
12.9
8.1
14.9
13.7
7.8
8.5
yi
-95.8
-126.2
-179.5
-252.6
-338.5
-97.3
-480.5
-393.6
-87.6
-105.4
Table 17.1: The simulated 2-degree polynomial data set with n = 50 values.
D. S. Young
237
In general, you should obey the hierarchy principle, which says that
if your model includes X h and X h is shown to be a statistically significant predictor of Y , then your model should also include each X j for
all j < h, whether or not the coefficients for these lower-order terms
are significant.
17.2
Xi,j
=
D. S. Young
238
Xi,1
=
10[30+10]/2
[3010]/2
= 1, if Xi,1 = 10%;
20[30+10]/2
[3010]/2
= 0,
30[30+10]/2
[3010]/2
if Xi,1 = 20%;
D. S. Young
239
(a)
(b)
(c)
Figure 17.2: (a) The points of a square portion of a design with factor levels
coded at 1. This is how a 22 factorial design is coded. (b) Illustration of
the axial (or star) points of a design at (+a,0), (-a,0), (0,-a), and (0,+a). (c)
A diagram which shows the combination of the previous two diagrams with
the design center at (0,0). This final diagram is how a composite design is
coded.
Typically response surface regression models only have two-way interactions while polynomial regression models can (in theory) have k-way
interactions.
The response surface regression models we outlined are for a factorial
design. Figure 17.2 shows how a factorial design can be diagramed as
a square using factorial points. More elaborate designs can be constructed, such as a central composite design, which takes into consideration axial (or star) points (also illustrated in Figure 17.2). Figure
17.2 pertains to a design with two factors while Figure 17.3 pertains to
a design with three factors.
We mentioned that response surface regression follows the hierarchy principle. However, some texts and software do report ANOVA tables which do
not quite follow the hierarchy principle. While fundamentally there is nothing wrong with these tables, it really boils down to a matter of terminology.
If the hierarchy principle is not in place, then technically you are just performing a polynomial regression.
Table 17.2 gives a list of all possible terms when assuming an hth -order
response surface model with k factors. For any interaction that appears in
D. S. Young
STAT 501
240
(a)
(b)
(c)
Figure 17.3: (a) The points of a cube portion of a design with factor levels
coded at the corners of the cube. This is how a 23 factorial design is coded.
(b) Illustration of the axial (or star) points of this design. (c) A diagram
which shows the combination of the previous two diagrams with the design
center at (0,0). This final diagram is how a composite design is coded.
the model (e.g., Xih1 Xjh2 such that h2 h1 ), then the hierarchy principle
says that at least the main factor effects for 1, . . . , h1 must appear in the
model, that all h1 -order interactions with the factor powers of 1, . . . , h2 must
appear in the model, and all order interactions less than h1 must appear
in the model. Luckily, response surface regression models (and polynomial
models for that matter) rarely go beyond h = 3.
For the next step, an ANOVA table is usually constructed to assess the
significance of the model. Since the factor levels are all essentially treated as
categorical variables, the designed experiment will usually result in replicates
for certain factor level combinations. This is unlike multiple regression where
the predictors are usually assumed to be continuous and no predictor level
combinations are assumed to be replicated. Thus, a formal lack of fit test
is also usually incorporated. Furthermore, the SSR is also broken down
into the components making up the full model, so you can formally test the
contribution of those components to the fit of your model.
An example of a response surface regression ANOVA is given in Table
17.3. Since it is not possible to compactly show a generic ANOVA table nor
to compactly express the formulas, this example is for a quadratic model
with linear interaction terms. The formulas will be similar to their respective quantities defined earlier. For this example, assume that there are k
STAT 501
D. S. Young
241
Relevant Terms
Xi , Xi2 , Xi3 , . . . , Xih for all i
Xi Xj for all i < j
Xi2 Xj for i 6= j
and Xi2 Xj2 for all i < j
Xi3 Xj , Xi3 Xj2 for i 6= j
Xi3 Xj3 for all i < j
..
.
Xih Xj , Xih Xj2 , Xih Xj3 , . . . , Xih Xjh1 for i 6= j
Xih Xjh for all i < j
Table 17.2: A table showing all of the terms that could be included in a
response surface regression model. In the above, the indices for the factor
are given by i = 1, . . . , k and j = 1, . . . , k.
STAT 501
242
Source
Regression
Linear
Quadratic
Interaction
Error
Lack of Fit
Pure Error
Total
df
SS
MS
q1
SSR
MSR
k
SSLIN
MSLIN
k
SSQUAD MSQUAD
q 2k 1
SSINT
MSINT
nq
SSE
MSE
mq
SSLOF
MSLOF
nm
SSPE
MSPE
n1
SSTO
F
MSR/MSE
MSLIN/MSE
MSQUAD/MSE
MSINT/MSE
MSLOF/MSPE
Table 17.3: ANOVA table for a response surface regression model with linear,
quadratic, and linear interaction terms.
17.3
Examples
Pr(>|t|)
0.000282 ***
0.270641
* 0.05 . 0.1 1
D. S. Young
243
Yield
3.3
2.8
2.9
2.3
2.6
2.1
2.5
2.9
2.4
3.0
3.1
2.8
3.3
3.5
3.0
STAT 501
244
3.4
3.4
3.2
3.2
2.8
Yield
2.4
2.4
2.6
2.8
2.6
Yield
3.0
3.0
2.2
2.2
50
60
70
80
90
100
50
60
70
80
Temperature
Temperature
(a)
(b)
90
100
Figure 17.4: The yield data set with (a) a linear fit and (b) a quadratic fit.
D. S. Young
245
Ratio Height
-1
0
0
-1
-1
-1
0
0
-1
0
0
-1
1
-1
0
0
1
0
0
1
-1
1
0
0
1
0
0
1
1
1
Table 17.5: The odor data set measurements with the factor levels already
coded.
First we will fit a response surface regression model consisting of all of the
first-order and second-order terms. The summary of this fit is given below:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.667
10.840 -2.829
0.0222 *
temp
-12.125
6.638 -1.827
0.1052
ratio
-17.000
6.638 -2.561
0.0336 *
D. S. Young
STAT 501
246
height
-21.375
6.638 -3.220
0.0122
temp2
32.083
9.771
3.284
0.0111
ratio2
47.833
9.771
4.896
0.0012
height2
6.083
9.771
0.623
0.5509
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05
*
*
**
. 0.1 1
Pr(>|t|)
0.012884
0.091024
0.026350
0.008720
0.008366
0.000703
*
.
*
**
**
***
* 0.05 . 0.1 1
D. S. Young
247
Finally, contour and surface plots can also be generated for the response
surface regression model. Figure 17.5 gives the contour plots (with odor as
the contours) for each of the three levels of height (Figure 17.6 gives color
versions of the plots). Notice how the contours are increasing as we go out to
the corner points of the design space (so it is as if we are looking down into
a cone). The surface plots of Figure 17.7 all look similar (with the exception
of the temperature scale), but notice the curvature present in these plots.
D. S. Young
STAT 501
248
Height=0
Ratio
0
1
Ratio
Height=1
Temperature
Temperature
(a)
(b)
0
1
Ratio
Height=1
Temperature
(c)
Figure 17.5: The contour plots of ratio versus temperature with odor as a
response for (a) height=-1, (b) height=0, and (c) height=+1.
STAT 501
D. S. Young
249
Height=0
Height=1
1.0
1.0
80
100
80
0.5
60
0.5
40
Ratio
Ratio
60
0.0
0.0
20
40
20
0.5
0.5
20
0
1.0
1.0
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
Temperature
Temperature
(a)
(b)
0.5
1.0
Height=1
1.0
60
40
0.5
Ratio
20
0.0
20
0.5
40
1.0
1.0
0.5
0.0
0.5
1.0
Temperature
(c)
Figure 17.6: The contour plots of ratio versus temperature with odor as a
response for (a) height=-1, (b) height=0, and (c) height=+1.
D. S. Young
STAT 501
250
(a)
(b)
(c)
Figure 17.7: The surface plots of ratio versus temperature with odor as a
response for (a) height=-1, (b) height=0, and (c) height=+1.
STAT 501
D. S. Young
Chapter 18
Biased Regression Methods
and Regression Shrinkage
Recall earlier that we dealt with multicollinearity (i.e., a near-linear relationship amongst some of the predictors) by centering the variables in order to
reduce the variance inflation factors (which reduces the linear dependency).
When multicollinearity occurs, the ordinary least squares estimates are still
unbiased, but the variances are very large. However, we can add a degree
of bias to the estimation process, thus reducing the variance (and standard
errors). This concept is known as the bias-variance tradeoff due to the
functional relationship between the two values. We proceed to discuss some
popular methods for producing biased regression estimates when faced with
a high degree of multicollinearity.
The assumptions made for these methods are mostly the same as in the
multiple linear regression model. Namely, we assume linearity, constant variance, and independence. Any apparent violation of these assumptions must
be dealt with first. However, these methods do not yield statistical intervals
due to uncertainty in the distributional assumption, so normality of the data
is not assumed.
One additional note is that the procedures in this section are often referred
to as shrinkage methods. They are called shrinkage methods because, as
we will see, the regression estimates we obtain cover a smaller range than
those from ordinary least squares.
251
18.1
Ridge Regression
Perhaps the most popular (albeit controversial) and widely studied biased
regression technique to deal with multicollinearity is ridge regression. Before we get into the computational side of ridge regression, let us recall from
the last course how to perform a correlation transformation (and the corresponding notation) which is performed by standardizing the variables.
The standardized X matrix is given as:
X1,1 X1 X1,2 X2
p1
X1,p1 X
.
.
.
s
s
sXp1
X2,1X1X1 X2,2X2X2
p1
X2,p1 X
.
.
.
s
s
s
X
X
X
1
1
2
p1
X = n1
,
.
.
.
.
..
..
..
..
Xn,2 X2
Xn,p1 Xp1
Xn,1 X1
.
.
.
sX
sX
sX
1
p1
1
Y = n1
Y1 Y
sY
Y2 Y
sY
..
.
Yn Y
sY
n
i=1 (Yi
Y )2
.
n1
D. S. Young
= r1
XX rXY ,
where rXX is the (p 1) (p 1) correlation matrix of the predictors and
rXY is the (p 1)dimensional matrix of correlation coefficients between the
is a function of correlations and hence
predictors and the response. Thus
we have performed a correlation transformation.
Notice further that
] =
E[
and
] = 2 r1 = r1 .
V[
XX
XX
For the variance-covariance matrix, 2 = 1 because we have standardized all
of the variables.
Ridge regression adds a small value k (called a biasing constant) to
the diagonal elements of the correlation matrix. (Recall that a correlation
matrix has 1s down the diag, so it can sort of be thought of as a ridge.)
Mathematically, we have
= (rXX + kI(p1)(p1) )1 rXY ,
where 0 < k < 1, but usually less than 0.3. The amount of bias in this
estimator is given by
] = [(rXX + kI(p1)(p1) )1 rXX I(p1)(p1) ] ,
E[
and the variance-covariance matrix is given by
= (rXX + kI(p1)(p1) )1 rXX (rXX + kI1
V[]
(p1)(p1) ).
is calculated on the standardized variables (sometimes
Remember that
called the standardized ridge regression estimates). We can transform
D. S. Young
STAT 501
j =
j
sXj
0 = y
p1
X
j xj ,
j=1
where j = 1, 2, . . . , p 1.
How do we choose k? Many methods exist, but there is no agreement
on which to use, mainly due to instability in the estimates asymptotically.
Two methods are primarily used: one graphical and one analytical. The first
method is called the fixed point method and uses the estimates provided
by fitting the correlation transformation via ordinary least squares. This
method suggests using
(p 1)MSE
k=
,
T
where MSE is the mean square error obtained from the respective fit.
Another method is the Hoerl-Kennard iterative method. This method
calculates
(p 1)MSE
,
k (t) = T
(t1)
k(t1)
D. S. Young
18.2
D. S. Young
STAT 501
omit these three components. Therefore, you set the last three entries of
Z
equal to 0.
Z , we can transform back to get the coefficients on the
With this value
X scale by
= P
.
PC
This is a solution to
Y = X + .
from the original calNotice that we have not reduced the dimension of
Z
culation, but we have only set certain values equal to 0. Furthermore, as in
ridge regression, we can transform back to the original scale by
PC,j
=
sY
PC,j
sXj
PC,0
= y
p1
X
PC,j
xj ,
j=1
where j = 1, 2, . . . , p 1.
How do you choose the number of eigenvalues to omit? This can be
accomplished by looking at the cumulative percent variation explained by
each of the (p 1) components. For the j th component, this percentage is
Pj
i
100,
1 + 2 + . . . + p1
i=1
D. S. Young
18.3
Z
As in principal components regression, we can transform back to get the
coefficients on the X scale by
PLS = R Z ,
which is a solution to
Y = X + .
D. S. Young
STAT 501
PLS,j
PLS,j =
sXj
PLS,0
= y
p1
X
PLS,j
xj ,
j=1
where j = 1, 2, . . . , p 1.
The method described above is sometimes referred to as the SIMPLS
method. Another method commonly used is nonlinear iterative partial
least squares (NIPALS). NIPALS is more commonly used when you have
a vector of responses. While we do not discuss the differences between these
algorithms any further, we do discuss later the setting where we have a vector
of responses.
18.4
Inverse Regression
XT
1
X = ... .
XT
n
However, assume that p is actually quite large with respect to n. Inverse
regression can actually be used as a tool for dimension reduction (i.e., reducing p), which reveals to use the most important aspects (or direction)
STAT 501
D. S. Young
When we have p large with respect to n, we use the terminology dimension reduction.
However, when we are more concerned about which predictors are significant or which
functional form is appropriate for our regression model (and the size of p is not too much
of an issue), then we use the model selection terminology.
2
As an example, consider 100 points on the unit interval [0,1], then imagine 100 points
on the unit square [0, 1][0, 1], then imagine 100 points on the unit cube [0, 1][0, 1][0, 1],
and so on. As the dimension increases, the sparsity of the data makes it more difficult to
make any relevant inferences about the data.
D. S. Young
STAT 501
where = X .
2. Divide the range of y1 , . . . , yn into H non-overlapping slices (using the
index h = 1, . . . , H). Let nh be the number of observation within each
slice and Ih {} be the indicator function for this slice such that
nh =
n
X
Ih {yi }.
i=1
3. Compute
m
h (yi ) = n1
h
n
X
xi Ih {yi },
i=1
H
X
nh m
h (yi )m
h (yi )T .
h=1
i and eigenvectors ri of V.
Construct
5. Identify the k largest eigenvalues
the score vectors zi = X ri as in partial least squares, which are the
rows of Z. Then
= (ZT Z)1 ZT Y
D. S. Young
X .
18.5
Suppose we now wish to find the least squares estimate of the model Y =
X + , but subject to a set of equality constraints A = a. It can be shown
(by using Lagrange multipliers), that
T
T
1 T
1 T 1
D. S. Young
Pn
2
i=1 ei ,
STAT 501
18.6
Examples
STAT 501
D. S. Young
GNP
Unemployed
234.289
235.6
259.426
232.5
258.054
368.2
284.599
335.1
328.975
209.9
346.999
193.2
365.385
187.0
363.112
357.8
397.469
290.4
419.180
282.2
442.769
293.6
444.546
468.1
482.704
381.3
502.601
393.1
518.173
480.6
554.894
400.7
Forces
159.0
145.6
161.6
165.0
309.9
359.4
354.7
335.0
304.8
285.7
279.8
263.7
255.2
251.4
257.2
282.7
Population
107.608
108.632
109.773
110.929
112.075
113.270
115.094
116.219
117.388
118.734
120.445
121.950
123.366
125.368
127.852
130.081
Employed
60.323
61.122
60.171
61.187
63.221
63.639
64.989
63.761
66.019
67.857
68.169
66.513
68.655
69.564
69.331
70.551
Table 18.1: The macroeconomic data set for the years 1947 to 1962.
Unemployed Armed.Forces
83.95865
12.15639
Population
230.91221
Year
2065.73394
In performing a ridge regression, we first obtain a trace plot of possible ridge coefficients (Figure 18.1). As you can see, the estimates of the
D. S. Young
STAT 501
10
10
t(x$coef)
20
0.00
0.02
0.04
0.06
0.08
0.10
x$lambda
Figure 18.1: Ridge regression trace plot with the ridge regression coefficient
on the x-axis.
regression coefficients shrink drastically until about 0.02. When using the
Hoerl-Kennard method, a value of about k = 0.0068 is obtained. Other
methods will certainly yield different estimates which illustrates some of the
criticism surrounding ridge regression.
The resulting estimates from this ridge regression analysis are
##########
GNP
25.3615288
Employed
0.7864825
##########
Unemployed Armed.Forces
3.3009416
0.7520553
Population
-11.6992718
Year
-6.5403380
D. S. Young
Y
49.0
50.2
50.5
48.5
47.5
44.5
28.0
31.5
34.5
35.0
38.0
38.5
15.0
17.0
20.5
29.5
X1
1300
1300
1300
1300
1300
1300
1200
1200
1200
1200
1200
1200
1100
1100
1100
1100
X2
7.5
9.0
11.0
13.5
17.0
23.0
5.3
7.5
11.0
13.5
17.0
23.0
5.3
7.5
11.0
17.0
X3
0.0120
0.0120
0.0115
0.0130
0.0135
0.0120
0.0400
0.0380
0.0320
0.0260
0.0340
0.0410
0.0840
0.0980
0.0920
0.0860
STAT 501
H2.ratio
1.061838
cont.time
12.324964
Z3
-0.57268
.
The above is simply
Z
From the SVD of X , the factor loadings matrix is found to be
PC
Z
0.05150079
= 0.15076806 .
0.86220916
STAT 501
D. S. Young
0.10
0.10
0.06
Contact Time
0.04
0.08
0.06
0.04
Contact Time
0.08
0.02
0.02
10
15
20
1100
1150
1200
1250
1300
Reactor Temperature
H2 Ratio
(a)
(b)
20
15
10
H2 Ratio
1100
1150
1200
1250
1300
Reactor Temparature
(c)
Figure 18.2: Pairwise scatterplots for the predictors from the acetylene data
set. LOESS curves are also provided. Does there appear to be any possible
linear relationships between pairs of predictors?
D. S. Young
STAT 501
Chapter 19
Piecewise and Nonparametric
Methods
This chapter focuses on regression models where we start to deviate from
the functional form discussed thus far. The first topic discusses a model
where different regressions are fit depending on which area of the predictor
space we are in. The second topic discusses non parametric models which,
as the name suggests, fits a model which is free of distributional assumptions
and subsequently does not have regression coefficients readily available for
estimation. This is best accomplished by using a smoother, which is a
tool for summarizing the trend of the response as a function of one or more
predictors. The resulting estimate of the trend is less variable than the
response itself.
19.1
A model that proposes a different linear relationship for different intervals (or
regions) of the predictor is called a piecewise linear regression model.
The predictor values at which the slope changes are called knots, which
we will discuss throughout this chapter. Such models are helpful when you
expect the linear trend of your data to change once you hit some threshold.
Usually the knot values are already predetermined due to previous studies
or standards that are in place. However, there are methods for estimating
the knot values (sometimes called changepoints in the context of piecewise
linear regression), but we will not explore such methods.
268
X = ..
..
..
..
..
.
.
.
.
.
1 xn,1 xn,1 I{xn,1 > k1 } xn,1 I{xn,1 > kc }
Furthermore, you can see how for more than one predictor you can construct
the X matrix to have columns as functions of the other predictors.
D. S. Young
STAT 501
1 Knot (Discontinuous)
11
1 Knot
10
8
Y
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
2 Knots (Discontinuous)
9
8
Y
10
10
1.0
(b)
2 Knots
0.8
(a)
0.0
0.2
0.4
0.6
X
(c)
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
(d)
STAT 501
D. S. Young
I{x1,1 > kj }
I{x2,1 > kj }
..
.
I{xn,1 > kj }
appended to it for each kj where there is a discontinuity. Extending discontinuities to the case of more than one predictor is analogous.
19.2
Nonparametric regression attempts to find a functional relationship between yi and xi (only one predictor):
yi = m(xi ) + i ,
where m() is the regression function to estimate and E(i ) = 0. It is not
necessary to assume constant variance and, in fact, one typically assumes
that i = 2 (xi ), where 2 () is a continuous, bounded function.
Local regression is a method commonly used to model this nonparametric regression relationship. Specifically, local regression makes no global
assumptions about the function m(). Global assumptions are made in standard linear regression as we assume that the regression curve we estimate
(which is characterized by the regression coefficient vector ) properly models all of our data. However, local regression assumes that m() can be
well-approximated locally by a member from a simple class of parametric
functions (e.g., a constant, straight-line, quadratic curve, etc.) What drives
local regression is Taylors theorem from Calculus, which says that any continuous function (which we assume that m() is) can be approximated with
a polynomial.
In this section, we discuss some of the common local regression methods
for estimating regressions nonparametrically.
D. S. Young
STAT 501
19.2.1
Kernel Regression
One way of estimating m() is to use density estimation, which approximates the probability density function f () of a random variable X. Assuming we have n independent observations x1 , . . . , xn from the random variable
X, the kernel density estimator fh (x) for estimating the density at x (i.e.,
f (x)) is defined as
n
xi x
1 X
K
.
fh (x) =
nh i=1
h
Here, K() is called the kernel function and h is called the bandwidth.
K() is a function often resembling a probability density function, but with
no parameters (some common kernel functions are provided in Table 19.1). h
controls the window width around x which we perform the density estimation.
Thus, a kernel density estimator is essentially a weighting scheme (dictated
by the choice of kernel) which takes into consideration the proximity of a
point in the data set near x when given a bandwidth h. Furthermore, more
weight is given to points near x and less weight is given to points further
from x.
With the formalities established, one can perform a kernel regression of
yi on Xi estimate mh () with the Nadaraya-Watson estimator:
Pn
i=1
K
m
h (x) =
Pn
i=1
xi x
h
K
xi x
h
yi
,
h2 (x)kKk22
,
nhfh (x)
D. S. Young
Beta
Gaussian
Cosinus
Optcosinus
(1 |u|)I(|u| 1)
(1u2 )g
I(|u|
Beta(0.5,g+1)
1)
1 2
1 e 2 u
2
1
(1
2
+ cos(u))I(|u| 1)
)
I(|u| 1)
cos( u
2
h2 (x) =
1
n
Pn
K
xi x
h
Pn
i=1
i=1
2
xi x
, and
i=1 K
h
Pn
{yi m
h (x)}2
.
xi x
h
h2 (x)kKk22
,
nhfh (x)
STAT 501
log{ 12 log(1 )}
+ dn
(2 log(n))1/2
1/2
and
p 0 2 1/2
kK k
p 2
.
dn = (2 log(n))1/2 + (2 log(n))1/2 log
2 kKk22
Some final notes about kernel regression include:
Choice of kernel and bandwidth are still major issues in research. There
are some general guidelines to follow and procedures that have been
developed, but are beyond the scope of this course.
What we developed in this section is only for the case of one predictor.
If you have multiple predictors (i.e., x1,i , . . . , xp,i ), then one needs to
use a multivariate kernel density estimator at a point x = (x1 , . . . , xp )T ,
which is defined as
n
xi,1 x1
xi,p xp
1
1X
Q
K
,...,
.
fh (x) =
n i=1 pj=1 hj
h1
hp
Multivariate kernels require more advanced methods and are difficult
to use as data sets with more predictors will often suffer from the curse
of dimensionality.
19.2.2
|xi x| h,
D. S. Young
0 (x)
1 (x1 x) . . . (x1 x)q
1 (x)
1 (x2 x) . . . (x2 x)q
,
(x)
=
X = ..
..
.
.
.
..
..
..
.
.
1 (xn x) . . . (x1 x)q
q (x)
and W = diag K x1hx , . . . , K xnhx
, the local least squares estimate can be written as:
(x)
= arg min(Y X(x))T W(Y X(x))
(x)
= (X W1 X)1 XT W1 Y.
T
Finally, for any x, we can perform inference on the j (x) (or the m() (x))
terms in a manner similar to weighted least squares.
The method of LOESS (which stands for Locally Estimated Scatterplot
Smoother)1 is commonly used for local polynomial fitting. However, LOESS
1
There is also another version of LOESS called LOWESS, which stands for Locally
WEighted Scatterplot Smoother. The main difference is the weighting that is introduced
during the smoothing process.
D. S. Young
STAT 501
n
X
i=1
+ . . . + h (xi x0 )h )]2 .
For LOESS, usually h = 2 is sufficient.
STAT 501
D. S. Young
wi B
|ri |
.
6M
For wi , the value wi0 is the previous weight for this observation (where the first
time you calculate this weight can be done by the original LOESS procedure
we outlined), M is the median of the q absolute values of the residuals, and
B() is the bisquare weight function given by
(1 |u|2 )2 , if |u| < 1;
B(u) =
0,
if |u| 1.
This robust procedure can be iterated up to 5 times for a given x0 .
Some other notes about local regression methods include:
Various forms of local regression exist in the literature. The main
thing to note is that these are approximation methods with much of
the theory being driven by Taylors theorem from Calculus.
Kernel regression is actually a special case of local regression.
As with kernel regression, there is also an extension of local regression
regarding multiple predictors. It requires use of a multivariate version
of Taylors theorem around the p-dimensional point x0 . The model can
D. S. Young
STAT 501
where again is the span. The values of xi can also be scaled so that
the smoothness occurs the same way in all directions. However, note
that this estimation is often difficult due to the curse of dimensionality.
19.2.3
1. Set ri
= yi .
2. For j = 1, . . ., maximize
Pn
i=1
2
R(j)
=1
(j1)
ri
2
T
m
(j) (
(j) xi )
2
Pn
(j1)
i=1 ri
(j1)
ri = ri
STAT 501
T
m
(j) (
(j) xi ).
D. S. Young
The advantages of using PPR for estimation is that we are using univariate regressions which are quick and easy to estimate. Also, PPR is able to
approximate a fairly rich class of functions as well as ignore variables providing little to no information about m(). Some disadvantages of using PPR
(j)
include having to examine a p-dimensional parameter space to estimate
and interpretation of a single term may be difficult.
19.3
Smoothing Splines
(yi (xi )) +
[ 00 (t)]2 dt
STAT 501
1X
[yi (xi )]2 + Jm (),
n i=1
which results in what is called a thin-plate smoothing spline. While there
are several ways to define Jm (), a common way to define it for a thin-plate
smoothing spline is by
2
Z +
Z + X
m!
dx1 dxp ,
Jm () =
tp
t1
t
1 ! tp ! x1 xp
P
where is the set of all permutations of (t1 , . . . , tp ) such that pj=1 tj = m.
Numerous algorithms exist for estimation and have demonstrated fairly
stable numerical results. However, one must gently balance fitting the data
closely with avoiding characterizing the fit with excess variation. Fairly general procedures also exist for constructing confidence intervals and estimating
the smoothing parameter. The following subsection briefly describes the least
squares method usually driving these algorithms.
19.3.1
Penalized least squares estimates are a way to balance fitting the data
closely while avoiding overfitting due to excess variation. A penalized least
2
STAT 501
D. S. Young
1X
2
(yi (zi ) xT
i ) + Jm (),
n i=1
where Jm () is the penalty on the roughness of the function (). Again, the
squared term of this function measures the goodness-of-fit, while the second
term measures the smoothness associated with (). A larger penalizes
rougher fits, while a smaller emphasizes the goodness-of-fit.
A final estimate of for the penalized least squares method can be written
as
n
X
i ) = + zT +
(z
k Bk (xi ),
i
k=1
where the Bk s are basis functions dependent on the location of the xi s and
, , and are coefficients to be estimated. For a fixed , (, , ) can be
estimated. The smoothing parameter can be chosen by minimizing the
generalized crossvalidation (or GCV) function. Write
= A()y,
y
D. S. Young
STAT 501
k(Inn A())yk2 /n
[tr(Inn A())/n]2
and
= arg min V ().
19.4
19.4.1
The Bootstrap
Bootstrapping is a method where you resample from your data (often with
replacement) in order to approximate the distribution of the data at hand.
While conceptually bootstrapping procedures are very appealing (and they
have been shown to possess certain asymptotic properties), they are computationally intensive. In the non parametric regression routines we presented,
standard regression assumptions were not made. In these nonstandard situations, bootstrapping provides a viable alternative for providing standard
errors and confidence intervals for the regression coefficients and predicted
values. When in the regression setting, there are two types of bootstrapping
methods that may be employed. Before we differentiate these methods, we
first discuss bootstrapping in a little more detail.
In bootstrapping, you assume that your sample is actually the population of interest. You draw B samples (B is usually well over 1000)
of size n from your original sample with replacement. With replacement
means that each observation you draw for your sample is always selected
from the entire set of values in your original sample. For each bootstrap
sample, the regression results are computed and stored. For example, if
B = 5000 and we are trying to estimate the sampling distribution of the
STAT 501
D. S. Young
) as your sample.
, 1,5000
), . . . , (0,5000
, 1,2
), (0,2
, 1,1
yield (0,1
Now suppose that you want the standard errors and confidence intervals
for the regression coefficients. The standard deviation of the B estimates
provided by the bootstrapping scheme is the bootstrap estimate of the standard error for the respective regression coefficient. Furthermore, a bootstrap
confidence interval is found by sorting the B estimates of a regression coefficient and selecting the appropriate percentiles from the sorted list. For
example, a 95% bootstrap confidence interval would be given by the 2.5th
and 97.5th percentiles from the sorted list. Other statistics may be computed
in a similar manner.
One assumption which bootstrapping relies heavily on is that your sample approximates the population fairly well. Thus, bootstrapping does not
usually work well for small samples as they are likely not representative of
the underlying population. Bootstrapping methods should be relegated to
medium sample sizes or larger (what constitutes a medium sample size is
somewhat subjective).
Now we can turn our attention to the two bootstrapping techniques available in the regression setting. Assume for both methods that our sample
consists of the pairs (x1 , y1 ), . . . , (xn , yn ). Extending either method to the
case of multiple regression is analogous.
We can first bootstrap the observations. In this setting, the bootstrap
samples are selected from the original pairs of data. So the pairing of a
response with its measured predictor is maintained. This method is appropriate for data in which both the predictor and response were selected at
random (i.e., the predictor levels were not predetermined).
We can also bootstrap the residuals. The bootstrap samples in this setting
are selected from what are called the Davison-Hinkley modified residuals, given by
n
ei
1X
e
p j
ei = p
,
1 hi,i n j=1 1 hj,j
where the ei s are the original regression residuals. We do not simply use
the ei s because these lead to biased results. In each bootstrap sample, the
randomly sampled modified residuals are added to the original fitted values
forming new values of y. Thus, the original structure of the predictors will
remain the same while only the response will be changed. This method is
appropriate for designed experiments where the levels of the predictor are
D. S. Young
STAT 501
(i,b
nc , i,d(1 )ne ),
2
(0,b
nc + 1,b nc xh , 0,d(1 )ne + 1,d(1 )ne xh ).
2
19.4.2
The Jackknife
Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error (variance) of a statistic when a
random sample of observations is used to calculate it. The basic idea behind
the jackknife variance estimator lies in systematically recomputing the estimator of interest by leaving out one or more observations at a time from the
original sample. From this new set of replicates of the statistic, an estimate
for the bias and variance of the statistic can be calculated, which can then
be used to calculate jackknife confidence intervals.
Below we outline the steps for jackknifing in the simple linear regression
setting for simplicity, but the multiple regression setting is analogous:
1. Draw a sample of size n (x1 , y1 ), . . . , (xn , yn ) and divide the sample into
s independent groups, each of size d.
2. Omit the first set of d observations from the sample and estimate 0
and 1 from the (n d) remaining observations (call these estimates
(J )
(J )
0 1 and 1 1 , respectively). The remaining set of (n d) observations
is called the delete-d jackknife sample.
3. Omit each of the remaining sets of 2, . . . , s groups in turn and esti(J )
(J )
mate the respective regression coefficients. These are 0 2 , . . . , 0 s
(J )
(J )
and 1 2 , . . . , 1 s . Note that this results in s = nd delete-d jackknife
samples.
STAT 501
D. S. Young
(J)
4. Obtain the (joint) probability distribution F (0 , 1 ) of delete-d jackknife estimates. This may be done empirically or through use of investigating an appropriate distribution.
5. Calculate the jackknife regression coefficient estimate, which is the
(J)
(J)
mean of the F (0 , 1 ) distribution, as:
(J)
j =
s
X
(J )
j k ,
k=1
s.e.
c J (j ) = var
c J (j ).
Finally, if normality is appropriate, then a 100 (1 )% jackknife con(J)
fidence interval for the regression coefficient j is given by
j tn2;1/2 s.e.
c J (j ).
Otherwise, we can construct a fully nonparametric jackknife confidence interval in a similar manner as the bootstrap version. Namely,
(J)
(J)
,
,
j,b 2 nc
D. S. Young
j,d(1 2 )ne
STAT 501
While for moderately sized data the jackknife requires less computation,
there are some drawbacks to using the jackknife. Since the jackknife is us In fact,
ing fewer samples, it is only using limited information about .
the jackknife can be viewed as an approximation to the bootstrap (it is a
linear approximation to the bootstrap in that the two are roughly equal for
linear estimators). Moreover, the jackknife can perform quite poorly if the
estimator of interest is not sufficiently smooth (intuitively, smooth can be
thought of as small changes to the data result in small changes to the calculated statistic), which can especially occur when your sample is too small.
19.5
Examples
D. S. Young
Signif. codes:
STAT 501
2.5
2.0
Cost
1.5
1.0
0.5
600
800
1000
1200
1400
Lot Size
Figure 19.2: A scatterplot of the packaging data set with a piecewise linear
regression fitted to the data.
2. How do you think more data would affect the smoothness of the fits?
3. If we drive the span to 0, what type of regression line would you expect
to see?
4. If we drive the span to 1, what type of regression line would you expect
to see?
Figure 19.3(b) shows two kernel regression curves with two different bandwidths. A Gaussian kernel is used. Some things to think about when fitting
the data (as with the LOESS fit) are:
1. Which fit appears to be better to you?
2. How do you think more data would affect the smoothness of the fits?
3. What type of regression line would you expect to see as we change the
bandwidth?
4. How does the choice of kernel affect the fit?
STAT 501
D. S. Young
100
Quality Scores
Score 2
60
60
60
70
Score 1
(a)
80
90
Bandwidth
5
15
50
Span
0.4
0.9
50
70
80
80
70
Score 2
90
90
60
70
80
90
Score 1
(b)
Figure 19.3: (a) A scatterplot of the quality data set and two LOESS fits
with different spans. (b) A scatterplot of the quality data set and two kernel
regression fits with different bandwidths.
Next, let us return to the orthogonal regression fit of this data. Recall
that the slope term for the orthogonal regression fit was 1.4835. Using a
nonparametric bootstrap (with B = 5000 bootstraps), we can obtain the
following bootstrap confidence intervals for the orthogonal slope parameter:
90% bootstrap confidence interval: (0.9677, 2.9408).
95% bootstrap confidence interval: (0.8796, 3.6184).
99% bootstrap confidence interval: (0.6473, 6.5323).
Remember that if you were to perform another bootstrap with B = 5000,
then the estimated intervals given above will be slightly different due to the
randomness of the resampling process!
D. S. Young
STAT 501
Chapter 20
Regression Models with
Censored Data
Suppose we wish to estimate the parameters of a distribution where only a
portion of the data is known. When the remainder of the data has a measurement that exceeds (or falls below) some threshold and only that threshold
value is recorded for that observation, then the data are said to be censored.
When the data exceeds (or falls below) some threshold, but the data is omitted from the database, then the data are said to be truncated. This chapter
deals primarily with the analysis of censored data by first introducing the
area of reliability (survival) analysis and then presenting some of the basic
tools and models from this area as a segue into a regression setting. We also
devote a section to discussing truncated regression models.
20.1
f (t)
f (t)
=
.
S(t)
R(t)
STAT 501
D. S. Young
20.2
Censored regression models (also called the Tobit model) simply attempts to model the unknown variable T (which is assumed left-censored) as
a linear combination of the covariates X1 , . . . , Xp1 . For a sample of size n,
we have
Ti = 0 + 1 xi,1 + . . . + p1 xi,p1 + i ,
where i iid N (0, 2 ). Based on this model, it can be shown that for the
observed variable Y , that
E[Yi |Yi > t] = XT
i + (i ),
where i = (t XT
i )/ and
(i ) =
(i )
1 (i )
such that () and () are the probability density function and cumulative
distribution function of a standard normal random variable (i.e., N (0, 1)),
respectively. Moreover, the quantity (i ) is called the inverse Mills ratio,
which reappears later in our discussion about the truncated regression model.
If we let i1 be the index of all of the uncensored values and i2 be the index
of all of the left-censored values, then we can define a log-likelihood function
for the estimation of the regression parameters (see Appendix C for further
details on likelihood functions):
X
2
`(, ) = (1/2)
[log(2) + log( 2 ) + (yi XT
i )/ ]
i1
log(1 (XT
i /)).
i2
STAT 501
20.3
e = e0 +X
T ],
where T ] = e() . So, the covariate acts multiplicatively on the survival time
T.
The distribution of will allow us to determine the distribution of T .
Each possible probability distribution has a different h(t). Furthermore, in
a survival regression setting, was assume the hazard rate at time t for an
individual has the form:
h(t|X ) = h0 (t)k(XT )
= h0 (t)eX
In the above, h0 (t) is called the baseline hazard and is the value of the
hazard function when X = 0 or when = 0. Note in the expression for
T that we separated out the intercept term 0 as it becomes part of the
baseline hazard. Also, k() in the equation for h(t|X ) is a specified link
function, which for our purposes will be e() .
Next we discuss some of the possible (and common) distributions assumed
for . We do not write out the density formulas here, but they can be found in
most statistical texts. The parameters for your distribution help control three
STAT 501
D. S. Young
STAT 501
D. S. Young
20.4
Recall from the last section that we set T = ln(T ) where the hazard function
T
is h(t|X ) = h0 (t)eX
. The Cox formation of this relationship gives:
ln(h(t)) = ln(h0 (t)) + XT ,
which yields the following form of the linear regression model:
h(t)
= XT .
ln
h0 (t)
Exponentiating both sides yields a ratio of the actual hazard rate and baseline
hazard rate, which is called the relative risk:
h(t)
T
= eX
h0 (t)
p1
Y
=
ei xi .
i=1
Thus, the regression coefficients have the interpretation as the relative risk
when the value of a covariate is increased by 1 unit. The estimates of the
regression coefficients are interpreted as follows:
A positive coefficient means there is an increase in the risk, which
decreases the expected survival (failure) time.
A negative coefficient means there is a decrease in the risk, which increases the expected survival (failure) time.
The ratio of the estimated risk functions for two different sets of covariates (i.e., two groups) can be used to examine the likelihood of Group
1s survival (failure) time to Group 2s survival (failure) time.
D. S. Young
STAT 501
20.5
Diagnostic Procedures
Depending on the survival regression model being used, the diagnostic measures presented here may have a slightly different formulation. We do present
somewhat of a general form for these measures, but the emphasis is on the
purpose of each measure. It should also be noted that one can perform formal hypothesis testing and construct statistical intervals based on various
estimates.
Cox-Snell Residuals
In the previous regression models we studied, residuals were defined as a
difference between observed and fitted values. For survival regression, in
order to check the overall fit of a model, the Cox-Snell residual for the ith
observation in a data set is used and defined as:
0 (ti )eX
rCi = H
Notice that rCi > 0 for all i. The way we check for a goodness-of-fit with the
Cox-Snell residuals is to estimate the cumulative hazard rate of the residuals
STAT 501
D. S. Young
Di = sgn(Mi ) 2(`i ()
i
is the ith log likelihood evaluated at ,
which is the maximum
For Di , `i ()
likelihood estimate of the models parameter vector . `Si (i ) is the log
likelihood of the saturated model evaluated at the maximum likelihood .
A saturated model is one where n parameters (i.e., 1 , . . . , n ) fit the n
observations perfectly.
The Di values should behave like a standard normal sample. A normal
probability plot of the Di values and a plot of Di versus the fitted ln(t)i
values, will help to determine if any values are fairly far from the bulk of
D. S. Young
STAT 501
20.6
D. S. Young
gX (x)
,
FX (b) FX (a)
0,
otherwise.
gX (x)
,
1 FX (a)
gX (x)
.
FX (b)
STAT 501
When truncating the response, the distribution, and consequently the mean
and variance of the truncated distribution, must be adjusted accordingly.
Consider the three possible truncation settings of a < Y < b (two-sided
truncation), a < Yi (bottom-truncation), and Yi < b (top-truncation). Let
T
T
i = (a XT
i )/, i = (b Xi )/, and i = (yi Xi )/, such that yi
is the realization of the random variable Yi . Moreover, recall that (z) is the
inverse Mills ratio applied to the value of z and let
(z) = (z)[((z))1 1].
Then using established results for the truncated normal distribution, the
three different truncated probability density functions are
fY |X (yi |, XT
i , , ) =
1
(i )
1(i )
1
(i )
(i )(i )
1
(i )
(i )
FY |X (yi |, XT
i , , ) =
STAT 501
(i )(i )
,
1(i )
(i )(i )
,
(i )(i )
(i )
,
(i )
(i )(i )
T
T
E[Yi |, Xi ] =
Xi + (i )(i ) , = {a < Yi < b};
T
Xi (i ),
= {Yi < b},
while the corresponding variances are
2
{1 (i )[(i ) i ]},
= {a < Yi };
2
T
i (i )i (i )
(i )(i )
2
Var[Yi |, Xi ] =
, = {a < Yi < b};
1 + (i )(i ) (i )(i )
2
{1 (i )[(i ) + i ]},
= {Yi < b}.
Using the distributions defined above, the likelihood function can be found
and maximum likelihood procedures can be employed. Note that the likelihood functions will not have a closed-form solution and thus numerical
techniques must be employed to find the estimates of and .
It is also important to underscore the type of estimation method used in
a truncated regression setting. The maximum likelihood estimation method
that we just described will be used when you are interested in a regression
equation that characterizes the entire population, including the observations
that were truncated. If you are interested in characterizing just the subpopulation of observations that were not truncated, then ordinary least squares
can be used. In the context of the example provided at the beginning of
this section, if we regressed wages on years of schooling and were only interested in the employees who made above the minimum wage, then ordinary
least squares can be used for estimation. However, if we were interested in
all of the employees, including those who happened to be excluded due to
not meeting the minimum wage threshold, then maximum likelihood can be
employed.
20.7
Examples
STAT 501
Table 20.1: The motor data set measurements with censoring occurring if a
0 appears in the Censor column.
This data set is actually a very common data set analyzed in survival
analysis texts. We will proceed to fit it with a Weibull survival regression
model. The results from this analysis are
##########
Value Std. Error
z
p
(Intercept) 17.0671
0.93588 18.24 2.65e-74
count
0.3180
0.15812 2.01 4.43e-02
temp
-0.0536
0.00591 -9.07 1.22e-19
Log(scale) -1.2646
0.24485 -5.17 2.40e-07
Scale= 0.282
STAT 501
D. S. Young
STAT 501
Deviance Residuals
Normal QQ Plot
6.0
6.5
7.0
7.5
Log Hours
(a)
8.0
8.5
9.0
Deviance Residuals
Sample Quantiles
Theoretical Quantiles
(b)
Figure 20.1: (a) Plot of the deviance residuals. (b) NPP plot for the deviance
residuals.
or censored (as in Figure 20.2(b)). The dark blue line on the left is the
truncated regression line that is estimated using ordinary least squares. So
the interpretation of this line will only apply tho those data that were not
truncated, which is what the researcher is interested in. The estimated model
is given below:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.8473
1.1847 42.921 < 2e-16 ***
x
1.6884
0.1871
9.025 7.84e-14 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 4.439 on 80 degrees of freedom
Multiple R-squared: 0.5045,
Adjusted R-squared: 0.4983
F-statistic: 81.46 on 1 and 80 DF, p-value: 7.835e-14
##########
Notice that the estimated regression line never drops below the level of truncation (i.e., y = 50) within the domain of the x variable.
STAT 501
D. S. Young
STAT 501
Gaussian distribution
Loglik(model)= -241.6
Loglik(intercept only)= -267.1
Chisq= 51.03 on 1 degrees of freedom, p= 9.1e-13
Number of Newton-Raphson Iterations: 5
n= 100
##########
Moreover, the dashed red line in both figures is the ordinary least squares fit
(assuming all of the data values are known and used in the estimation) and
is simply provided for comparative purposes. The estimates for this fit are
given below:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.2611
0.9484
49.83
<2e-16 ***
x
2.1611
0.1639
13.19
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 4.778 on 98 degrees of freedom
Multiple R-squared: 0.6396,
Adjusted R-squared: 0.6359
F-statistic: 173.9 on 1 and 98 DF, p-value: < 2.2e-16
##########
As you can see, the structure of your data and underlying assumptions
can change your estimate - namely because you are attempting to estimate
different models. The regression lines in Figure 20.2 are a good example of
how different assumptions can alter the final estimates that you report.
STAT 501
D. S. Young
50
70
40
40
50
60
70
Survival Fit
Regular OLS Fit
60
80
6
x
(a)
10
10
(b)
Figure 20.2: (a) A plot of the logical reasoning data. The red circles have
been truncated as they fall below 50. The maximum likelihood fit for the
truncated regression (solid light blue line) and the ordinary least squares fit
for the truncated data set (solid dark blue line) are shown. The ordinary
least squares line (which includes the truncated values for the estimation) is
shown for reference. (b) The logical reasoning data with a Tobit regression
fit provided (solid green line). The data has been censored at 50 (i.e., the
solid red dots are included in the data). Again, the ordinary least squares
line has been provided for reference.
D. S. Young
STAT 501
x
0.00
0.10
0.20
0.30
0.40
0.51
0.61
0.71
0.81
0.91
1.01
1.11
1.21
1.31
1.41
1.52
1.62
1.72
1.82
1.92
2.02
2.12
2.22
2.32
2.42
y
46.00
56.38
45.59
53.66
40.05
46.62
44.56
47.20
57.06
49.18
51.06
51.75
46.73
42.04
48.83
51.81
57.35
49.91
49.82
61.53
47.40
54.78
48.94
55.13
43.57
x
2.53
2.63
2.73
2.83
2.93
3.03
3.13
3.23
3.33
3.43
3.54
3.64
3.74
3.84
3.94
4.04
4.14
4.24
4.34
4.44
4.55
4.65
4.75
4.85
4.95
y
49.95
51.58
59.50
50.84
55.65
51.55
49.16
58.59
51.90
62.95
57.74
54.37
58.21
55.44
58.62
53.63
43.46
57.42
60.64
50.99
50.42
54.68
54.40
60.21
58.70
x
5.05
5.15
5.25
5.35
5.45
5.56
5.66
5.76
5.86
5.96
6.06
6.16
6.26
6.36
6.46
6.57
6.67
6.77
6.87
6.97
7.07
7.17
7.27
7.37
7.47
y
64.31
68.22
58.39
58.55
60.40
57.10
58.64
58.93
61.30
60.75
58.67
60.67
59.46
65.49
60.96
57.36
59.83
57.40
62.96
67.02
65.93
63.55
61.99
64.48
62.61
x
7.58
7.68
7.78
7.88
7.98
8.08
8.18
8.28
8.38
8.48
8.59
8.69
8.79
8.89
8.99
9.09
9.19
9.29
9.39
9.49
9.60
9.70
9.80
9.90
10.00
y
50.91
65.51
61.32
71.37
76.97
56.72
67.90
65.30
61.62
68.68
69.43
64.82
63.81
59.27
62.23
64.78
64.88
72.30
65.18
78.35
64.62
76.85
68.57
61.29
71.46
Table 20.2: The test scores from n = 100 participants for a logical reasoning
section (x) and a mathematics section (y).
STAT 501
D. S. Young
Chapter 21
Nonlinear Regression
All of the models we have discussed thus far have been linear in the parameters (i.e., linear in the beta terms). For example, polynomial regression was
used to model curvature in our data by using higher-ordered values of the
predictors. However, the final regression model was just a linear combination
of higher-ordered predictors.
Now we are interested in studying the nonlinear regression model:
Y = f (X, ) + ,
where X is a vector of p predictors, is a vector of k parameters, f () is
some known regression function, and is an error term whose distribution
mayor may not be normal. Notice that we no longer necessarily have the
dimension of the parameter vector simply one greater than the number of
predictors. Some examples of nonlinear regression models are:
e0 +1 xi
+ i
1 + e0 +1 xi
0 + 1 xi
yi =
+ i
1 + 2 e3 xi
yi = 0 + (0.4 0 )e1 (xi 5) + i .
yi =
However, there are some nonlinear models which are actually called intrinsically linear because they can be made linear in the parameters by a
simply transformation. For example:
Y =
0 X
1 + X
311
312
can be rewritten as
1
1
1
=
+ X
Y
0 0
= 0 + 1 X,
which is linear in the transformed variables 0 and 1 . In such cases, transforming a model to its linear form often provides better inference procedures
and confidence intervals, but one must be cognizant of the effects that the
transformation has on the distribution of the errors.
We will discuss some of the basics of fitting and inference with nonlinear
regression models. There is a great deal of theory, practice, and computing
associated with nonlinear regression and we will only get to scratch the surface of this topic. We will then turn to a few specific regression models and
discuss generalized linear models.
21.1
In order to find
= arg min Q,
D. S. Young
313
Then, we set each of the above partial derivatives equal to 0 and the parameters k are each replaced by k . This yields:
n
n
X
X
f (Xi , )
f
(X
,
)
i
yi
f
(X
,
)
i
= 0,
k
k
=
=
i=1
i=1
for k = 0, 1, . . . , p 1.
The solutions to the critical values of the above partial derivatives for
and are
nonlinear regression are nonlinear in the parameter estimates
k
often difficult to solve, even in the simplest cases. Hence, iterative numerical
methods are often employed. Even more difficulty arises in that multiple
solutions may be possible!
21.1.1
A Few Algorithms
1
y1 f (X1 , )
..
= ... =
.
n
yn f (Xn , )
Q() =
Q()
=
kk2
1
..
.
kk2
k
2
be
Pnthe 2gradient of the sum of squared errors where Q() = kk =
i=1 i is the sum of squared errors.
D. S. Young
STAT 501
314
Let
1
1
...
...
1
k
n
1
...
n
k
J() = ...
..
.
2 Q()
T
2 2
kk
1 1
..
.
2 kk2
k 1
...
...
...
2 kk2
1 k
..
.
2 kk2
k k
=
J()|= (t)
The classical method based on the gradient approach is Newtons method,
(0) and iteratively calculates
which starts at
(t+1) =
(t) [H(
(t) )]1 Q(
(t) )
D. S. Young
315
(t)
(t)
(t)
(t)
)T J(
)]1 J(
)T e,
[J(
21.2
Exponential Regression
STAT 501
316
21.3
Logistic Regression
D. S. Young
317
21.3.1
(21.1)
D. S. Young
= e0 +1 X1 +...+p1 Xp1 ,
1
(21.2)
STAT 501
318
Second is
log
1
= 0 + 1 X1 + . . . + p1 Xp1 ,
(21.3)
which states that the logarithm of the odds is a linear function of the
X variables (and is often called the log odds).
In order to discuss goodness-of-fit measures and residual diagnostics for
binary logistic regression, it is necessary to at least define the likelihood (see
Appendix C for a further discussion). For a sample of size n, the likelihood
for a binary logistic regression is given by:
L(; y, X) =
n
Y
iyi (1 i )1yi
i=1
y i
1yi
T
n
Y
1
eXi
.
=
T
T
1 + eXi
1 + eXi
i=1
This yields the log likelihood:
`() =
n
X
i=1
STAT 501
D. S. Young
319
Odds Ratio
The odds ratio (which we will write as ) determines the relationship between
a predictor and response and is available only when the logit link is used.
The odds ratio can be any nonnegative number. An odds ratio of 1 serves
as the baseline for comparison and indicates there is no association between
the response and predictor. If the odds ratio is greater than 1, then the odds
of success are higher for the reference level of the factor (or for higher levels
of a continuous predictor). If the odds ratio is less than 1, then the odds of
success are less for the reference level of the factor (or for higher levels of
a continuous predictor). Values farther from 1 represent stronger degrees of
association. For binary logistic regression, the odds of success are:
T
= eX .
1
This exponential relationship provides an interpretation for . The odds
increase multiplicatively by ej for every one-unit increase in Xj . More formally, the odds ratio between two sets of predictors (say X(1) and X(2) ) is
given by
(/(1 ))|X=X(1)
=
.
(/(1 ))|X=X(2)
Wald Test
The Wald test is the test of significance for regression coefficients in logistic
regression (recall that we use t-tests in linear regression). For maximum
likelihood estimates, the ratio
Z=
i
s.e.(i )
STAT 501
320
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals
by dividing by the standard deviation. The formula for the Pearson residuals
is
ri
.
pi = p
i (1
i )
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these
residuals is the deviance statistic. The formula for the deviance residual is
s
yi
1 yi
di = 2 yi log
+ (1 yi ) log
.
i
1
i
Hat Values
The hat matrix serves a similar purpose as in the case of linear regression to measure the influence of each observation on the overall fit of the model but the interpretation is not as clear due to its more complicated form. The
hat values are given by
T
hi,i =
i (1
i )xT
i (X WX)xi ,
pi
1 hi,i
di
.
1 hi,i
C and C
are extensions of Cooks distance for logistic regression. C
measures
C and C
STAT 501
D. S. Young
321
the overall change in fitted log its due to deleting the ith observation for all
points excluding the one deleted while C includes the deleted point. They
are defined by:
p2i hi,i
Ci =
(1 hi,i )2
and
i =
C
p2i hi,i
.
(1 hi,i )
Goodness-of-Fit Tests
Overall performance of the fitted model can be measured by two different
chi-square tests. There is the Pearson chi-square statistic
P =
n
X
p2i
i=1
n
X
d2i .
i=1
= 2(`(
(0)
STAT 501
322
One additional test is Browns test, which has a test statistic to judge
the fit of the logistic model to the data. The formula for the general alternative with two degrees of freedom is:
T = sT C 1 s,
where sT = (s1 , s2 ) and C is the covariance matrix of s. The formulas for s1
and s2 are:
s1 =
n
X
(yi
i )(1 +
log(
i )
)
1
i
(yi
i )(1 +
log(1
i )
).
i=1
s2 =
n
X
i=1
The formula for the symmetric alternative with 1 degree of freedom is:
(s1 + s2 )2
.
Var(s1 + s2 )
To interpret the test, if the p-value is less than your accepted significance
level, then reject the null hypothesis that the model fits the data adequately.
DFDEV and DFCHI
DFDEV and DFCHI are statistics that measure the change in deviance
and in Pearsons chi-square, respectively, that occurs when an observation
is deleted from the data set. Large values of these statistics indicate observations that have not been fitted well. The formulas for these statistics
are
i
DFDEVi = d2i + C
and
DFCHILi =
i
C
.
hi,i
RA2
The calculation of R2 used in linear regression does not extend directly to
logistic regression. The version of R2 used in logistic regression is defined as
R2 =
STAT 501
`(0 )
`()
,
`(0 ) `S ()
D. S. Young
323
where `(0 ) is the log likelihood of the model when only the intercept is
included and `S () is the log likelihood of the saturated model (i.e., where a
model is fit perfectly to the data). This R2 does go from 0 to 1 with 1 being
a perfect fit.
21.3.2
In binomial logistic regression, we only had two possible outcomes. For nominal logistic regression, we will consider the possibility of having k possible
outcomes. When k > 2, such responses are known as polytomous.1 The
multiple nominal logistic regression model (sometimes called the multinomial logistic regression model) is given by the following:
T
eX j
1+Pkj=2 eXT j , j = 2, . . . , k;
(21.4)
j =
1
Pk XT , j = 1,
j
1+
j=2
where again j denotes a probability and not the irrational number. Notice
that k 1 of the groups have their own set of values. Furthermore, since
P
k
j=1 j = 1, we set the values for group 1 to be 0 (this is what we call the
reference group). Notice that when k = 2, we are back to binary logistic
regression.
j is the probability that an observation is in one of k categories. The
likelihood for the nominal logistic regression model is given by:
L(; y, X) =
n Y
k
Y
i=1 j=1
where the subscript (i, j) means the ith observation belongs to the j th group.
This yields the log likelihood:
`() =
n X
k
X
yi,j i,j .
i=1 j=1
The word polychotomous is sometimes used, but note that this is not actually a word!
D. S. Young
STAT 501
324
(j /1 )|X=X(1)
.
(j /1 )|X=X(2)
21.3.3
k
X
j =
j=1
e0,k +X
1 + e0,k +X
(21.5)
k
n Y
Y
i=1 j=1
where the subscript (i, j) means the ith observation belongs to the j th group.
STAT 501
D. S. Young
325
n X
k
X
yi,j i,j .
i=1 j=1
Pk
=
21.4
Poisson Regression
The Poisson distribution for a random variable X has the following probability mass function for a given value X = x:
P(X = x|) =
e x
,
x!
STAT 501
326
yi
e exp{Xi } exp{XT
i }
.
P(Yi = yi |Xi , ) =
yi !
That is, for a given set of predictors, the categorical outcome follows a Poisson
distribution with rate exp{XT }.
In order to discuss goodness-of-fit measures and residual diagnostics for
Poisson regression, it is necessary to at least define the likelihood. For a
sample of size n, the likelihood for a Poisson regression is given by:
L(; y, X) =
T
n
Y
e exp{Xi } exp{XT }yi
yi !
i=1
n
X
yi XT
i
n
X
exp{XT
i }
log(yi !).
i=1
i=1
i=1
n
X
n
X
2
(yi exp{XT })
i
T
exp{Xi }
i=1
STAT 501
yi
exp{XT
i }
.
(yi exp{XT
i })
D. S. Young
327
D(y, )
where `S () is the log likelihood of the saturated model (i.e., where a model
is fit perfectly to the data). This measure of deviance (which differs from the
deviance statistic defined earlier) is a generalization of the sum of squares
from linear regression. The deviance also has an approximate chi-square
distribution.
Pseudo R2
The value of R2 used in linear regression also does not extend to Poisson
regression. One commonly used measure is the pseudo R2 , defined as
R2 =
`(0 )
`()
,
`S () `(0 )
where `(0 ) is the log likelihood of the model when only the intercept is
included. The pseudo R2 goes from 0 to 1 with 1 being a perfect fit.
Raw Residual
The raw residual is the difference between the actual response and the
estimated value from the model. Remember that the variance is equal to the
mean for a Poisson random variable. Therefore, we expect that the variances
D. S. Young
STAT 501
328
of the residuals are unequal. This can lead to difficulties in the interpretation
of the raw residuals, yet it is still used. The formula for the raw residual is
ri = yi exp{XT
i }.
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals
by dividing by the standard deviation. The formula for the Pearson residuals
is
ri
,
pi = q
exp{XT
}
i
where
=
n
2
1 X (yi exp{XT
i })
.
n p i=1
exp{XT
i }
exp{XT
i }
Hat Values
The hat matrix serves the same purpose as in the case of linear regression to measure the influence of each observation on the overall fit of the model.
The hat values, hi,i , are the diagonal entries of the Hat matrix
H = W1/2 X(XT WX)1 XT W1/2 ,
D. S. Young
329
sdi = p
21.5
All of the regression models we have considered (both linear and nonlinear)
actually belong to a family of models called generalized linear models.
Generalized linear models provides a generalization of ordinary least squares
regression that relates the random term (the response Y ) to the systematic term (the linear predictor XT ) via a link function (denoted by g()).
Specifically, we have the relation
E(Y ) = = g 1 (XT ),
so g() = XT . Some common link functions are:
The identity link:
g() = = XT ,
which is used in traditional linear regression.
The logit link:
g() = log
1
= XT
eX
=
T ,
1 + eX
which is used in logistic regression.
The log link:
g() = log() = XT
= eX
STAT 501
330
},
which can also be used in logistic regression. This link function is also
sometimes called the gompit link.
The power link:
g() = = XT
= (XT )1/ ,
where 6= 0. This is used in other regressions which we do not explore
(such as gamma regression and inverse Gaussian regression).
Also, the variance is typically a function of the mean and is often written as
Var(Y ) = V () = V (g 1 (XT )).
The random variable Y is assumed to belong to an exponential family
distribution where the density can be expressed in the form
y b()
q(y; , ) = exp
+ c(y, ) ,
a()
where a(), b(), and c() are specified functions, is a parameter related
to the mean of the distribution, and is called the dispersion parameter. Many probability distributions belong to the exponential family. For
example, the normal distribution is used for traditional linear regression, the
binomial distribution is used for logistic regression, and the Poisson distribution is used for Poisson regression. Other exponential family distributions
STAT 501
D. S. Young
331
21.6
Examples
STAT 501
332
so as not to extrapolate too far beyond the data, let us set the starting value
of 1 to 350. It is convenient to scale time so that x1 = 0 in 1790, and so
that the unit of time is 10 years. Then substituting 1 = 350 and x = 0 into
the model, using the value y1 = 3.929 from the data, and assuming that the
error is 0, we have
3.929 =
STAT 501
350
.
1 + e2 +3 (0)
D. S. Young
333
e2 =
350
.
1 + e4.5+3(1)
e4.5+3 =
So now we have starting values for the nonlinear least squares algorithm
that we use. Below is the output from running a Gauss-Newton algorithm
for optimization. As you can see, the starting values resulted in convergence
with values not too far from our guess.
##########
Formula: population ~ beta1/(1 + exp(beta2 + beta3 * time))
Parameters:
Estimate Std. Error t value Pr(>|t|)
beta1 389.16551
30.81197
12.63 2.20e-10 ***
beta2
3.99035
0.07032
56.74 < 2e-16 ***
beta3 -0.22662
0.01086 -20.87 4.60e-14 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
D. S. Young
STAT 501
334
Residuals
250
Census Data
200
50
Residuals
150
100
Population
1800
1850
1900
Year
(a)
1950
1800
1850
1900
1950
Year
(b)
Figure 21.1: (a) Plot of the Census data with the logistic functional fit. (b)
Plot of the residuals versus the year.
D. S. Young
335
The following gives the estimated logistic regression equation and associated significance tests. The reference group of remission is 1 for this data.
##########
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
64.25808
74.96480
0.857
0.391
cell
30.83006
52.13520
0.591
0.554
smear
24.68632
61.52601
0.401
0.688
infil
-24.97447
65.28088 -0.383
0.702
li
4.36045
2.65798
1.641
0.101
blast
-0.01153
2.26634 -0.005
0.996
temp
-100.17340
77.75289 -1.288
0.198
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372
Residual deviance: 21.594
AIC: 35.594
on 26
on 20
degrees of freedom
degrees of freedom
on 26
degrees of freedom
STAT 501
336
Deviance Residuals
Pearson Residuals
0.5
0.0
0.0
Pearson Residuals
0.5
0.5
Deviance Residuals
0.5
10
15
20
25
Observation
10
15
20
25
Observation
(a)
(b)
Figure 21.2: (a) Plot of the deviance residuals. (b) Plot of the Pearson
residuals.
on 25
degrees of freedom
D. S. Young
337
10
10
15
20
Figure 21.2 also gives plots of the deviance residuals and the Pearson
residuals. These plots seem to be okay.
Example 3: Poisson Regression Example
Table 21.3 consists of a simulated data set of size n = 30 such that the
response (Y ) follows a Poisson distribution with rate = exp{0.50 + 0.07X}.
A plot of the response versus the predictor is given in Figure 21.3.
The following gives the analysis of the Poisson regression data:
##########
Coefficients:
Estimate Std. Error t value Pr(>|t|)
D. S. Young
STAT 501
338
Deviance Residuals
Pearson Residuals
Pearson Residuals
Deviance Residuals
Fitted Values
Fitted Values
(a)
(b)
Figure 21.4: (a) Plot of the deviance residuals. (b) Plot of the Pearson
residuals.
(Intercept) 0.007217
0.989060
0.007
0.994
x
0.306982
0.066799
4.596 8.37e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for gaussian family taken to be 3.977365)
Null deviance: 195.37
Residual deviance: 111.37
AIC: 130.49
on 29
on 28
degrees of freedom
degrees of freedom
STAT 501
D. S. Young
REMISS
1
1
0
0
1
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
1
1
0
CELL SMEAR
0.80
0.83
0.90
0.36
0.80
0.88
1.00
0.87
0.90
0.75
1.00
0.65
0.95
0.97
0.95
0.87
1.00
0.45
0.95
0.36
0.85
0.39
0.70
0.76
0.80
0.46
0.20
0.39
1.00
0.90
1.00
0.84
0.65
0.42
1.00
0.75
0.50
0.44
1.00
0.63
1.00
0.33
0.90
0.93
1.00
0.58
0.95
0.32
1.00
0.60
1.00
0.69
1.00
0.73
INFIL
0.66
0.32
0.70
0.87
0.68
0.65
0.92
0.83
0.45
0.34
0.33
0.53
0.37
0.08
0.90
0.84
0.27
0.75
0.22
0.63
0.33
0.84
0.58
0.30
0.60
0.69
0.73
LI BLAST
1.90
1.10
1.40
0.74
0.80
0.18
0.70
1.05
1.30
0.52
0.60
0.52
1.00
1.23
1.90
1.35
0.80
0.32
0.50
0.00
0.70
0.28
1.20
0.15
0.40
0.38
0.80
0.11
1.10
1.04
1.90
2.06
0.50
0.11
1.00
1.32
0.60
0.11
1.10
1.07
0.40
0.18
0.60
1.59
1.00
0.53
1.60
0.89
1.70
0.96
0.90
0.40
0.70
0.40
339
TEMP
1.00
0.99
0.98
0.99
0.98
0.98
0.99
1.02
1.00
1.04
0.99
0.98
1.01
0.99
0.99
1.02
1.01
1.00
0.99
0.99
1.01
1.02
1.00
0.99
0.99
0.99
0.99
Table 21.2: The leukemia data set. Descriptions of the variables are given in
the text.
D. S. Young
STAT 501
340
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
xi yi i xi yi
2 0 16 16 7
15 6 17 13 6
19 4 18 6 2
14 1 19 16 5
16 5 20 19 5
15 2 21 24 6
9 2 22 9 2
17 10 23 12 5
10 3 24 7 1
23 10 25 9 3
14 2 26 7 3
14 6 27 15 3
9 5 28 21 4
5 2 29 20 6
17 2 30 20 9
STAT 501
D. S. Young
Chapter 22
Multivariate Multiple
Regression
Up until now, we have only been concerned with univariate responses (i.e.,
the case where the response Y is simply a single value for each observation).
However, sometimes you may have multiple responses measured for each observation, whether it be different characteristics or perhaps measurements
taken over time. When our regression setting must accommodate multiple
responses for a single observation, the technique is called multivariate regression.
22.1
The Model
342
Xi,1
..
Xi =
.
.
Xi,p1
A set of m responses, or dependent variables, are measured for each of
the i = 1, . . . , n observations:
Yi,1
Yi = ... .
Yi,m
Each of the j = 1, . . . , m responses has its own regression model:
Yi,j = 0,j + 1,j Xi,1 + 2,j Xi,2 + . . . + p1,j Xi,p1 + i,j .
Vectorizing the above model for a single observation yields:
Yi = (1 XT
i )B + i ,
where
B=
1 2 . . . m
0,1
1,1
..
.
0,2
1,2
..
.
...
...
..
.
0,m
1,m
..
.
i,1
i = ... .
i,m
Notice that i is the vector of errors for the ith observation.
STAT 501
D. S. Young
343
Finally, we may explicitly write down the multivariate multiple regression model:
T
Y1
..
Ynm = .
YT
n
XT
1
..
.
.
= .
.
1 XT
n
0,1
1,1
..
.
0,2
1,2
..
.
...
...
...
0,m
1,m
..
.
T
1
..
.
T
n
= Xnp Bpm + nm .
Or more compactly, without the dimensional subscripts, we will write:
Y = XB + .
22.2
Least Squares
Extending least squares theory from the multiple regression setting to the
multivariate multiple regression setting is fairly intuitive. The biggest hurdle
is dealing with the matrix calculations (which statistical packages perform for
you anyhow). We can also formulate similar assumptions for the multivariate
model.
Let
1,j
(j) = ... ,
n,j
which is the vector of errors for the j th trial of all n observations. We assume
that E((j) ) = 0 and Cov((i) , (k) ) = i,k Imm for each i, k = 1, . . . , n. Notice
that the j th trial of the n observations have variance-covariance matrix =
{i,k }, but observations from different entries of the vector are uncorrelated.
The least squares estimate for B is simply given by:
= (XT X)1 XT Y.
B
D. S. Young
STAT 501
344
Y
and the residuals as:
= Y Y.
n
Hypothesis Testing
Suppose we are interested in testing the hypothesis that our multivariate
responses do not depend on the predictors Xi,q+1 , . . . , Xi,p1 . We can partition B to consist of two matrices: one with the regression coefficients of
the predictors we assume will remain in the model and one with the regression coefficients we wish to test. Similarly, we can partition X in a similar
manner. Formally, the test is
H0 : (2) = 0,
where
B=
(1)
(2)
and
X=
X1 X2
(1)
1
1
1
and
(1) )T (Y X
1 = (Y X1
1 (1) )/n.
STAT 501
D. S. Young
345
These values (which are maximum likelihood estimates under the null hypothesis) can be used to calculate one of four commonly used multivariate
test statistics:
|n|
Wilks Lambda =
1 )|
|n(
1 )
1 ]
Pillais Trace = tr[(
1
1
1 )
]
Hotelling-Lawley Trace = tr[(
1
.
Roys Greatest Root =
1 + 1
1 )
1 . Also,
In the above, 1 is the largest nonzero eigenvalue of (
the value || is the determinant of the variance-covariance matrix and is
called the generalized variance which assigns a single numerical value to
express the overall variation of this multivariate problem. All of the above
test statistics have approximate F distributions with degrees of freedom
which are more complicated to calculate than what we have seen. Most
statistical packages will report at least one of the above if not all four. For
large sample sizes, the associated p-values will likely be similar, but various
1 )
1 or a relatively small
situations (such as many large eigenvalues of (
sample size) will lead to a discrepancy between the results. In this case, it is
usually accepted to report the Wilks lambda value as this is the likelihood
ratio test.
Confidence Regions
One problem is to predict the mean responses corresponding to fixed values
T xh
xh of the predictors. Using various distributional results concerning B
it can be shown that the 100 (1 )% simultaneous confidence
and ,
i,i ,
np2
and
for i = 1, . . . , m. Here, i is the ith column of B
i,i is the ith diagonal
Also, notice that the simultaneous confidence intervals are
element of .
D. S. Young
STAT 501
346
constructed for each of the m entries of the response vector, thus why they are
considered simultaneous. Furthermore, the collection of these simultaneous
T xh .
intervals yields what we call a 100 (1 )% confidence region for B
Prediction Regions
Another problem is to predict new responses Yh = BT xh + h . Again,
skipping over a discussion on various distributional assumptions, it can be
shown that the 100 (1 )% simultaneous prediction intervals for the
individual responses Yh,i are
s
xT
h i
m(n p 2)
Fm,np1m;1
np1m
s
(1 +
T
1
xT
h (X X) xh )
n
i,i ,
np2
for i = 1, . . . , m. The quantities here are the same as those in the simultaneous confidence intervals. Furthermore, the collection of these simultaneous
prediction intervals are called a 100 (1 )% prediction region for yh .
MANOVA
The multivariate analysis of variance (MANOVA) table is similar to
its univariate counterpart. The sum of squares values in a MANOVA are
no longer scalar quantities, but rather matrices. Hence, the entries in the
MANOVA table are called sum of squares and cross-products (SSCPs).
These quantities are described in a little more detail below:
The
cross-products for total is SSCPTO =
Pn sum of squares and
T
i=1 (Yi Y)(Yi Y) , which is the sum of squared deviations from
the overall mean vector of the Yi s. SSCPTO is a measure of the overall
variation in the Y vectors. The corresponding total degrees of freedom
are n 1.
P
The sum of squares and cross-products for the errors is SSCPE =
n
T
i=1 (Yi Yi )(Yi Yi ) , which is the sum of squared observed errors
(residuals) for the observed data vectors. SSE is a measure of the variation in Y that is not explained by the multivariate regression. The
corresponding error degrees of freedom are n p.
STAT 501
D. S. Young
347
df
p1
np
n1
SSCP
Pn
Y
i Y)
T
(Y Y)(
Pni=1 i
i )T
(Yi Yi )(Yi Y
Pi=1
n
T
i=1 (Yi Y)(Yi Y)
Table 22.1: MANOVA table for the multivariate multiple linear regression
model.
Notice in the MANOVA table that we do not define any mean square
values or an F -statistic. Rather, a test of the significance of the multivariate multiple regression model is carried out using a Wilks lambda quantity
similar to
Pn
T
(Y
Y
)(Y
Y
)
i
i
i
i
i=1
,
=
P
T
n (Yi Y)(Y
Y)
i
i=1
which will follow a 2 distribution. However, depending on the number of
variables and the number of trials, modified versions of this test statistic
must be used, which will affect the degrees of freedom for the corresponding
2 distribution.
22.3
STAT 501
348
the classical statistical techniques of principal component analysis, canonical variate and correlation analysis, linear discriminant analysis, exploratory
factor analysis, multiple correspondence analysis, and other linear methods
of analyzing multivariate data. It is also heavily utilized in neural network
modeling and econometrics.
Recall that the multivariate regression model is
Y = XB + ,
where Y is an n m matrix, X is an n p matrix, and B is a p m matrix
of regression parameters. A reduced rank regression occurs when we have
the rank constraint
rank(B) = t min(p, m),
with equality yielding the traditional least squares setting. When the rank
condition above holds, then there exists two non-unique full rank matrices
Apt and Ctm , such that
B = AC.
Moreover, there may be an additional set of predictors, say W, such that W
is a n q matrix. Letting D denote a q m matrix of regression parameters,
we can then write the reduced rank regression model as follows:
Y = XAC + WD + .
In order to get estimates for the reduced rank regression model, first note
that E((j) ) = 0 and Var(0(j) ) = Imm . For simplicity in the following,
let Z0 = Y, Z1 = X, and Z2 = W. Next, we define the moment matrices
1
Mi,j = ZT
i Zj /m for i, j = 0, 1, 2 and Si,j = Mi,j Mi,2 M2,2 M2,i , i, j = 0, 1.
Then, the parameters estimates for the reduced rank regression model are as
follows:
= (
A
1 , . . . , t )
T = S0,1 A(
A
T S1,1 A)
1
C
= M0,2 M 1 C
TA
T M1,2 M 1 ,
D
2,2
2,2
where (
1 , . . . , t ) are the eigenvectors corresponding to the t largest eigen1, . . .
t of |S1,1 S1,0 S 1 S0,1 | = 0 and where is an arbitrary t t
values
0,0
matrix with full rank.
STAT 501
D. S. Young
22.4
349
Example
Y2
X1
3149 1
653
1
810
0
448
1
844
1
1450 1
493
1
941
0
547
1
392
1
1283 1
458
1
722
1
384
0
501
0
405
0
1520 1
X2
7500
1975
3600
675
750
2500
350
1500
375
1050
3000
450
1750
2000
4500
1500
3000
X3 X4
220 0
200 0
205 60
160 60
185 70
180 60
154 80
200 70
137 60
167 60
180 60
160 64
135 90
160 60
180 0
170 90
180 0
X5
140
100
111
120
83
80
98
93
105
74
80
60
79
80
100
120
129
STAT 501
350
(Intercept)
X1
X2
X3
X4
X5
##########
Y1
-2879.4782
675.6508
0.2849
10.2721
7.2512
7.5982
Y2
-2728.7085
763.0298
0.3064
8.8962
7.2056
4.9871
Then we can obtain individual ANOVA tables for each response and see that
the multiple regression model for each response is statistically significant.
##########
Response Y1 :
Df Sum Sq Mean Sq F value
Pr(>F)
Regression
5 6835932 1367186 17.286 6.983e-05 ***
Residuals
11 870008
79092
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Response Y2 :
Df Sum Sq Mean Sq F value
Pr(>F)
Regression
5 6669669 1333934 15.598 0.0001132 ***
Residuals
11 940709
85519
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##########
The following also gives the SSCP matrices for this fit:
##########
$SSCPR
Y1
Y2
Y1 6835932 6709091
Y2 6709091 6669669
$SSCPE
Y1
Y2
Y1 870008.3 765676.5
STAT 501
D. S. Young
351
Y2 765676.5 940708.9
$SSCPTO
Y1
Y2
Y1 7705940 7474767
Y2 7474767 7610378
##########
We can also see which predictors are statistically significant for each response:
##########
Response Y1 :
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.879e+03 8.933e+02 -3.224 0.008108 **
X1
6.757e+02 1.621e+02
4.169 0.001565 **
X2
2.849e-01 6.091e-02
4.677 0.000675 ***
X3
1.027e+01 4.255e+00
2.414 0.034358 *
X4
7.251e+00 3.225e+00
2.248 0.046026 *
X5
7.598e+00 3.849e+00
1.974 0.074006 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 281.2 on 11 degrees of freedom
Multiple R-Squared: 0.8871,
Adjusted R-squared: 0.8358
F-statistic: 17.29 on 5 and 11 DF, p-value: 6.983e-05
Response Y2 :
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.729e+03 9.288e+02 -2.938 0.013502 *
X1
7.630e+02 1.685e+02
4.528 0.000861 ***
X2
3.064e-01 6.334e-02
4.837 0.000521 ***
X3
8.896e+00 4.424e+00
2.011 0.069515 .
X4
7.206e+00 3.354e+00
2.149 0.054782 .
X5
4.987e+00 4.002e+00
1.246 0.238622
D. S. Young
STAT 501
352
1.5
Response = Y1
Response = Y2
1.0
0.5
1.0
Studentized Residuals
0.5
0.0
1.5
Studentized Residuals
2.0
500
1000
1500
2000
2500
3000
500
1000
1500
2000
Fitted Values
Fitted Values
(a)
(b)
2500
3000
Figure 22.1: Plots of the Studentized residuals versus fitted values for the
response (a) total TCAD plasma level and the response (b) amount of
amitriptyline present in TCAD plasma level.
--Signif. codes:
STAT 501
D. S. Young
Chapter 23
Data Mining
The field of Statistics is constantly being presented with larger and more
complex data sets than ever before. The challenge for the Statistician is to
be able to make sense of all of this data, extract important patterns, and
find meaningful trends. We refer to the general tools and the approaches for
dealing with these challenges in massive data sets as data mining.1
Data mining problems typically involve an outcome measurement which
we wish to predict based on a set of feature measurements. The set of
these observed measurements is called the training data. From these training data, we attempt to build a learner, which is a model used to predict the
outcome for new subjects. These learning problems are (roughly) categorized
as either supervised or unsupervised. A supervised learning problem is
one where the goal is to predict the value of an outcome measure based on a
number of input measures, such as classification with labeled samples from
the training data. An unsupervised learning problem is one where there is
no outcome measure and the goal is to describe the associations and patterns
among a set of input measures, which involves clustering unlabeled training
data by partitioning a set of features into a number of statistical classes. The
regression problems that are the focus of this text are (generally) supervised
learning problems.
Data mining is an extensive field in and of itself. In fact, many of the
methods utilized in this field are regression-based. For example, smoothing
splines, shrinkage methods, and multivariate regression methods are all often
found in data mining. The purpose of this chapter will not be to revisit these
1
353
354
23.1
D. S. Young
355
In fact, a slight modification to the LARS algorithm can calculate all possible
LASSO estimates for a given problem. Moreover, a different modification
to LARS efficiently implements forward stagewise regression. In fact, the
acronym for LARS includes an S at the end to reflect its connection to
LASSO and forward stagewise regression.
Earlier in the text we also introduced the bootstrap as a way to get bootstrap confidence intervals for the regression parameters. However, the notion
of the bootstrap can also be extended to fitting a regression model. Suppose
that we have p 1 feature measurements and one outcome variable. Let
Z = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} be our training data that we wish to fit
a model to such that we obtain the prediction f(x) at each input x. Bootstrap aggregation or bagging averages this prediction over a collection of
bootstrap samples, thus reducing its variance. For each bootstrap sample
Zb , b = 1, 2, . . . , B, we fit our model, which yields the prediction fb (x). The
bagging estimate is then defined by
B
1 X
f (x).
fbag(x) =
B b=1 b
23.2
STAT 501
356
D. S. Young
357
2
x
(a)
0
2
4
(b)
STAT 501
358
0.5
0.0
0.5
1.0
Figure 23.2: A plot of data where a support vector machine has been used
for classification. The data was generated where we know that the circles
belong to group 1 and the triangles belong to group 2. The white contours
show where the margin is; however, there are clearly some values that have
been misclassified since the two clusters are not well-separated. The points
that are solid were used as the training data.
is less sensitive to outliers than the quadratic loss function. Figure 23.3(c)
is Hubers loss function, which is a robust loss function that has optimal
properties when the underlying distribution of the data is unknown. Finally,
Figure 23.3(d) is called the -insensitive loss function, which enables a sparse
set of support vectors to be obtained.
In Support Vector Regressions (or SVRs), the input is first mapped
onto an N -dimensional feature space using some fixed (nonlinear) mapping,
and then a linear model is constructed in this feature space. Using mathematical notation, the linear model (in the feature space) is given by
f (x, ) =
N
X
j gj (x) + b,
j=1
D. S. Young
359
(a)
(b)
(c)
(d)
Figure 23.3: Plots of the (a) quadratic loss, (b) Laplace loss, (c) Hubers loss,
and (d) -insensitive loss functions.
D. S. Young
STAT 501
360
then the bias term is dropped. Note that b is not considered stochastic in
this model and is not akin to the error terms we have studied in previous
models.
The optimal regression function is given by the minimum of the functional
n
X
1
(, ) = kk2 + C
( + + ),
2
i=1
where C is a pre-specified constant, and + are slack variables representing
upper and lower constraints (respectively) on the output of the system. In
other words, we have the following constraints:
yi f (xi , ) + i+
f (xi , ) yi + i
, + 0, i = 1, . . . , n,
where yi is defined through the loss function we are using. The four loss
functions we show in Figure 23.3 are as follows:
Quadratic Loss:
L2 = (f (x) y) = (f (x) y)2
Laplace Loss:
L1 = (f (x) y) = |f (x) y|
Hubers Loss2 :
LH =
1
(f (x)
2
y)2 ,
|f (x) y|
2
,
2
-Insensitive Loss:
L =
0,
for |f (x) y| < ;
|f (x) y| , otherwise.
Depending on which loss function is chosen, then an appropriate optimization problem can be specified, which can involve kernel methods. Moreover,
specification of the kernel type as well as values like C, , and all control
the complexity of the model in different ways. There are many subtleties
2
STAT 501
D. S. Young
361
depending on which loss function is used and the investigator should become
familiar with the loss function being employed. Regardless, the optimization
approach will require the use of numerical methods.
It is also desirable to strike a balance between complexity and the error
that is present with the fitted model. Test error (also known as generalization error) is the expected prediction error over an independent test
sample and is given by
Err = E[L(Y, f(X))],
where X and Y are drawn randomly from their joint distribution. This
expectation is an average of everything that is random in this set-up, including the randomness in the training sample that produced the estimate f().
Training error is the average loss over the training sample and is given by
n
1X
err =
L(yi , f(xi )).
n i=1
We would like to know the test error of our estimated model f(). As the
model increases in complexity, it is able to capture more complicated underlying structures in the data, which thus decreases bias. But then the
estimation error increases, which thus increases variance. This is known as
the bias-variance tradeoff. In between there is an optimal model complexity that gives minimum test error.
23.3
Transfer learning is the notion that it is easier to learn a new concept (such
as how to play racquetball) if you are already familiar with a similar concept
(such as knowing how to play tennis). In the context of supervised learning, inductive transfer learning is often framed as the problem of learning
a concept of interest, called the target concept, given data from multiple
sources: a typically small amount of target data that reflects the target concept, and a larger amount of source data that reflects one or more different,
but possibly related, source concepts.
While most algorithms addressing this notion are in classification settings,
some of the common algorithms can be extended to the regression setting to
help us build our models. The approach we discuss is called boosting or
D. S. Young
STAT 501
362
D. S. Young
363
Pn
t t
i=1 ei wi .
If t 0.5,
4. Let t = t /(1 t ).
1eti
STAT 501
364
23.4
Classification and regression trees (CART) is a nonparametric treebased method which partitions the predictor space into a set of rectangles
and then fits a simple model (like a constant) in each one. While they seem
conceptually simple, they are actually quite powerful.
Suppose we have one response (yi ) and p predictors (xi,1 , . . . , xi,p ) for
i = 1, . . . , n. First we partition the predictor space into M regions (say,
R1 , . . . , RM ) and model the response as a constant cm in each region:
f (x) =
M
X
cm I(x Rm ).
m=1
n
X
i=1
P
n
yi I(xi Rm )
.
= Pi=1
n
i=1 I(xi Rm )
We proceed to grow the tree by finding the best binary partition in terms
of the cm values. This is generally computationally infeasible which leads to
STAT 501
D. S. Young
365
use of a greedy search algorithm. Typically, the tree is grown until a small
node size (such as 5 nodes) is reached and then a method for pruning the
tree is implemented.
Multivariate adaptive regression splines (MARS) is another nonparametric method that can be viewed as a modification of CART and is wellsuited for high-dimensional problems. MARS uses expansions in piecewise
linear basis functions of the form (xt)+ and (tx)+ such that the + subscript simply means we take the positive part (e.g., (xt)+ = (xt)I(x > t)).
These two functions together are called a reflected pair.
In MARS, each function is piecewise linear with a knot at t. The idea is
to form a reflected pair for each predictor Xj with knots at each observed
value xi,j of that predictors. Therefore, the collection of basis functions for
j = 1, . . . , p is
C = {(Xj t)+ , (t Xj )+ }t{x1,j ,...,xn,j } .
MARS proceeds like a forward stepwise regression model selection procedure,
but instead of selecting the predictors to use, we use functions from the set
C and their products. Thus, the model has the form
f (X) = 0 +
M
X
m hm (X),
m=1
STAT 501
366
23.5
Neural Networks
With the exponential growth in available data and advancement in computing power, researchers in statistics, artificial intelligence, and data mining
have been faced with the challenge to develop simple, flexible, powerful procedures for modeling large data sets. One such model is the neural network
approach, which attempts to model the response as a nonlinear function of
various linear combinations of the predictors. Neural networks were first used
as models for the human brain.
The most commonly used neural network model is the single-hiddenlayer, feedforward neural network (sometimes called the single-layer
perceptron.) In this neural network model, the ith response yi is modeled as
a nonlinear function fY of m derived predictor values, Si,0 , Si,1 , . . . , Si,m1 :
yi = fY (0 Si,0 + 1 Si,1 + . . . + m1 Si,m1 ) + i
= fY (ST
i ) + i ,
where
0
1
..
.
Si,0
Si,1
..
.
and Si =
Si,m1
m1
Si,0 equals 1 and for j = 1, . . . , m 1, the j th derived predictor value for the
ith observation, Si,j , is a nonlinear function fj of a linear combination of the
original predictors:
Si,j = fj (XT
i j ),
where
j =
j,0
j,1
..
.
j,p1
and
X
=
Xi,0
Xi,1
..
.
Xi,p1
and Xi,0 = 1. The functions fY , f1 , . . . , fm1 are called activation functions. Finally, we can combine all of the above to form the neural network
STAT 501
D. S. Young
367
model as:
yi = fY (ST
i ) + i
= fY (0 +
m1
X
j fj (XT
i j )) + i .
j=1
n
X
[yi fY (0 +
i=1
m1
X
2
j fj (XT
i j ))] + p (, 1 , . . . , m1 ),
j=1
m1
X
i=0
i2
p1
m1
XX
2
i,j
.
i=1 j=1
Finally, there is also a modeling technique which is similar to the treebased methods discussed earlier. The hierarchical mixture-of-experts
model (HME model) is a parametric tree-based method which recursively
splits the function of interest at each node. However, the splits are done
probabilistically and the probabilities are functions of the predictors. The
model is written as
f (yi ) =
k1
X
j1 =1
j1 (xi , )
k2
X
j2 (xi , j1 )
j2 =1
kr
X
jr =1
which has a tree structure with r levels (i.e., r levels where probabilistic splits
occur). The () functions provide the probabilities for the splitting and, in
addition to being dependent on the predictors, they also have their own
set of parameters (the different values) requiring estimation (these mixing
D. S. Young
STAT 501
368
23.6
Examples
X2
0
0
1
1
Y
0
1
1
0
D. S. Young
369
26 4
1
431
2.8
X1
X1
6.15091
Y
4
38
.76
5
(a)
6.07832
52
X2
3.3107
00
502
.
3
96
87
X2
41
2.17
.1
40
1.62898
.40
307
1864
37
0.67
90
0.4
3
.
25
.1
(b)
Figure 23.4: (a) The fitted single hidden-layer neural net model to the toy
data. (b) The fitted double hidden-layer neural net model to the toy data.
In the above output, the first group of 5 repetitions pertain to the single
hidden-layer neural net. For those 5 repetitions, the third training sample
yielded the smallest error (about 0.3481). The second group of 5 repetitions
D. S. Young
STAT 501
370
100
100
20
10
50
50
accel
50
accel
50
30
times
(a)
40
50
= 0.01
= 0.10
= 0.70
10
20
30
40
50
times
(b)
Figure 23.5: (a) Data from a simulated motorcycle accident where the time
until impact (in milliseconds) is plotted versus the recorded head acceleration
(in g). (b) The data with different values of used for the support vector
regression obtained with an insensitive loss function. Note how the smaller
the , the more features you pick up in the fit, but the complexity of the model
also increases.
pertain to the double hidden-layer neural net. For those 5 repetitions, the
fourth training sample yielded the smallest error (about 0.0002). The increase
in complexity of the neural net has yielded a smaller training error. The fitted
neural net models are depicted in Figure 23.4.
Example 2: Motorcycle Accident Data
This data set is from a simulated accident involving different motorcycles.
The time in milliseconds until impact and the g-force measurement of acceleration are recorded. The data are provided in Table 23.2 and plotted in
Figure 23.5(a). Given the obvious nonlinear trend that is present with this
data, we will attempt to fit a support vector regression to this data.
A support vector regression using an -insensitive loss function is fitted to
this data. {0.01, 0.10, 0.70} are fitted to this data and are shown in Figure
23.5(b). As decreases, different characteristics of the data are emphasized,
but the level of complexity of the model is increased. As noted earlier, we
STAT 501
D. S. Young
371
want to try and strike a good balance regarding the model complexity. For
the training error, we get values of 0.177, 0.168, and 0.250 for the three
levels of . Since our objective is to minimize the training error, the value
of (which has a training error of 0.168) is chosen. This corresponds to the
green line in Figure 23.5(b).
D. S. Young
STAT 501
372
Obs. Times
1
2.4
2
2.6
3
3.2
4
3.6
5
4.0
6
6.2
7
6.6
8
6.8
9
7.8
10
8.2
11
8.8
12
9.6
13
10.0
14
10.2
15
10.6
16
11.0
17
11.4
18
13.2
19
13.6
20
13.8
21
14.6
22
14.8
23
15.4
24
15.6
25
15.8
26
16.0
27
16.2
28
16.4
29
16.6
30
16.8
31
17.6
32
17.8
Accel.
0.0
-1.3
-2.7
0.0
-2.7
-2.7
-2.7
-1.3
-2.7
-2.7
-1.3
-2.7
-2.7
-5.4
-2.7
-5.4
0.0
-2.7
-2.7
0.0
-13.3
-2.7
-22.8
-40.2
-21.5
-42.9
-21.5
-5.4
-59.0
-71.0
-37.5
-99.1
Obs. Times
33
18.6
34
19.2
35
19.4
36
19.6
37
20.2
38
20.4
39
21.2
40
21.4
41
21.8
42
22.0
43
23.2
44
23.4
45
24.0
46
24.2
47
24.6
48
25.0
49
25.4
50
25.6
51
26.0
52
26.2
53
26.4
54
27.0
55
27.2
56
27.6
57
28.2
58
28.4
59
28.6
60
29.4
61
30.2
62
31.0
63
31.2
Accel.
-112.5
-123.1
-85.6
-127.2
-123.1
-117.9
-134.0
-101.9
-108.4
-123.1
-123.1
-128.5
-112.5
-95.1
-53.5
-64.4
-72.3
-26.8
-5.4
-107.1
-65.6
-16.0
-45.6
4.0
12.0
-21.5
46.9
-17.4
36.2
75.0
8.1
Obs. Times
64
32.0
65
32.8
66
33.4
67
33.8
68
34.4
69
34.8
70
35.2
71
35.4
72
35.6
73
36.2
74
38.0
75
39.2
76
39.4
77
40.0
78
40.4
79
41.6
80
42.4
81
42.8
82
43.0
83
44.0
84
44.4
85
45.0
86
46.6
87
47.8
88
48.8
89
50.6
90
52.0
91
53.2
92
55.0
93
55.4
94
57.6
Accel.
54.9
46.9
16.0
45.6
1.3
75.0
-16.0
69.6
34.8
-37.5
46.9
5.4
-1.3
-21.5
-13.3
30.8
29.4
0.0
14.7
-1.3
0.0
10.7
10.7
-26.8
-13.3
0.0
10.7
-14.7
-2.7
-2.7
10.7
STAT 501
D. S. Young
Chapter 24
Advanced Topics
This chapter presents topics where theory beyond the scope of this course
needs to be developed with the applicability. The topics are not arranged
in any particular order, but rather are just a sample of some of the more
advanced regression procedures that are available. Not all computer software
has the capabilities to perform analysis on the models presented here.
24.1
Semiparametric Regression
Semiparametric regression is concerned with flexible modeling of nonlinear functional relationships in regression analysis by building a model consisting of both parametric and nonparametric components. We have already
visited a semiparametric model with the Cox proportional hazards model. In
this model, there is the baseline hazard, which is nonparametric, and then
the hazards ratio, which is parametric.
Suppose we have n = 200 observations where y is the response, x1 is
a predictor taking on only values of 1, 2, 3 or 4, and x2 , x3 and x4 are
predictors taking on values between 0 and 1. A semiparametric regression
model of interest for this setting is
yi = 0 + 1 z2,i + 2 z3,i + 3 z4,i + m(x2,i , x3,i , x4,i ),
where
zj,i = I{x1,i = j}.
In otherwords, we are using the leave-one-out method for the levels of x1 .
373
374
The results of fitting a semiparametric regression model are given in Figure 24.1. There are noticeable functional forms for x2 and x3 , however, x4
appears to almost be 0. In fact, this is exactly how the data was generated.
The data were generated according to:
6
3
10
yi = 5.15487 + e2xi,1 + 0.2x11
2,i (10(1 x2,i )) + 10(10x2,i ) (1 x2,i ) + ei ,
There are actually many general forms of semiparametric regression models. We will list a few of them. In the following outline, X = (X1 , . . . , Xp )T
pertains to the predictors and may be partitioned such that X = (UT , VT )T
where U = (U1 , . . . , Ur )T , V = (V1 , . . . , Vs )T , and r + s = p. Also, m()
is a nonparametric function and g() is a link function as established in the
discussion on generalized linear models.
STAT 501
D. S. Young
10
375
s(x3,7.07)
5
0
5
s(x2,1.73)
10
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
1.0
6
4
0
Partial for x1
5
0
5
s(x4,1)
0.8
x3
10
x2
0.6
0.0
0.2
0.4
0.6
x4
0.8
1.0
x1
D. S. Young
STAT 501
376
p
X
mj (Xj ),
j=1
p
X
mj (Xj )),
j=1
s
X
mj (Vj )),
j=1
D. S. Young
377
24.2
The next model we consider is not unlike growth curve models. Suppose
we have responses measured on each subject repeatedly. However, we no
longer assume that the same number of responses are measured for each
subject (such data is often called longitudinal data or trajectory data).
In addition, the regression parameters are now subject-specific parameters.
The regression parameters are considered random effects and are assumed
to follow their own distribution. Earlier, we only discussed the sampling
distribution of the regression parameters and the regression parameters were
assumed fixed (i.e., they were assumed to be fixed effects).
D. S. Young
STAT 501
378
All Groups
100
200
Response
100
300
10
15
20
Time
Figure 24.2: Scatterplot of the infant data with a trajectory (in this case, a
quadratic response curve) fitted to each infant.
D. S. Young
379
Group 1
Group 2
100
10
15
20
10
100
200
Response
200
100
100
Group 4
20
(b)
Group 3
15
Time
(a)
Time
100
300
300
Response
200
100
200
Response
100
Response
100
300
300
10
Time
(c)
15
20
10
15
20
Time
(d)
Figure 24.3: Plots for each group of infants where each group has a different
number of measurements.
D. S. Young
STAT 501
380
24.3
D. S. Young
381
24.4
Mediation Regression
STAT 501
382
D. S. Young
383
Mediator Variable
M
Independent Variable
X
Dependent Variable
Y
Figure 24.4: Diagram showing the basic flow of a mediation regression model.
amongst the variables already in our model, are called moderator variables
and are often tested as an interaction effect. A significantly different from 0
XM interaction in the second equation above suggests that the 2 coefficient
differs across different levels of X. These different coefficient levels may
reflect mediation as a manipulation, thus altering the relationship between
M and Y . The moderator variables may be either a manipulated factor in
an experimental setting (e.g., dosage of medication) or a naturally occurring
variable (e.g., gender). By examining moderator effects, one can investigate
whether the experiment differentially affects subgroups of individuals. Three
primary models involving moderator variables are typically studied:
Moderated mediation: The simplest of the three, this model has a variable which mediates the effects of an independent variable on a dependent variable, and the mediated effect depends on the level of another
variable (i.e., the moderator). Thus, the mediational mechanism differs
for subgroups of the study. This model is more complex from an interpretative viewpoint when the moderator is continuous. Basically, you
have either X M and/or M Y dependent on levels of another
variable (call it Z).
D. S. Young
STAT 501
384
24.5
Meta-Regression Models
D. S. Young
385
Note that for the fixed-effect model, no plural is used as only ONE true effect across
all studies is assumed.
D. S. Young
STAT 501
386
24.6
Bayesian Regression
P(B|A)P(A)
.
P(B)
is constructed from the frequentists view (along with the maximum likelihood estimate
2 of 2 ) in that we assume there are enough measurements
of the predictors to say something meaningful about the response. In the
Bayesian view, we assume we have only a small sample of the possible
measurements and we seek to correct our estimate by borrowing information from a larger set of similar observations.
STAT 501
D. S. Young
387
ky Xk2 = ky Xk
Now rewrite the likelihood as
1
vs2
2
2 (nv)/2
2
2 v/2
exp 2 kX( )k
,
`(y|X, , ) ( )
exp 2 ( )
2
2
2 and v = n p with p as the number of parameters
where vs2 = ky Xk
to estimate. This suggests a form for the priors:
(, 2 ) = ( 2 )(| 2 ).
The prior distributions are characterized by hyperparameters, which
are parameter values (often data-dependent) which the researcher specifies.
The prior for 2 (( 2 )) is an inverse gamma distribution with shape hyperparameter and scale hyperparameter . The prior for ((| 2 )) is a
multivariate normal distribution with location and dispersion hyperparame and . This yields the joint posterior distribution:
ters
f (, 2 |y, X) `(y|X, , 2 )(| 2 )( 2 )
1
T
T
1
n
exp 2 (
s + ( ) ( + X X)( )) ,
2
where
= (1 + XT X)1 (1
+ XT X)
)
T 1
+ (
)
T XT X.
s = 2 +
2 (n p) + (
Finally, it can be shown that the distribution of |X, y is a multivariate-t
distribution with n + p 1 degrees of freedom such that:
E(|X, y) =
Cov(|X, y) =
D. S. Young
s(1 + XT X)1
.
n+p3
STAT 501
388
24.7
Quantile Regression
=
(yi (xT ))2 ,
i=1
STAT 501
D. S. Young
2000
389
1500
1000
500
Food Expenditure
1000
2000
3000
4000
5000
Household Income
Figure 24.5: Various qunatile regression fits for the food expenditures data
set.
= 0.95 regression quantile) while those with the lowest food expenditures
will likely have smaller regression coefficients (such as the = 0.05 regression
quantile). The estimates for each of these quantile regressions is as follows:
##########
Coefficients:
tau= 0.05
tau= 0.10 tau= 0.25
(Intercept) 124.8800408 110.1415742 95.4835396
x
0.3433611
0.4017658 0.4741032
tau= 0.75 tau= 0.90 tau= 0.95
(Intercept) 62.3965855 67.3508721 64.1039632
x
0.6440141 0.6862995 0.7090685
Degrees of freedom: 235 total; 233 residual
##########
Estimation for quantile regression can be done through linear programming or other optimization procedures. Furthermore, statistical intervals can
also be computed.
D. S. Young
STAT 501
390
24.8
Monotone Regression
Suppose we have a set of data (x1 , y1 ), . . . , (xn , yn ). For ease of notation, let
us assume there is already an ordering on the predictor variable. Specifically,
we assume that x1 . . . xn . Monotonic regression is a technique where
we attempt to find a weighted least squares fit of the responses y1 , . . . , yn to
a set of scalars a1 , . . . , an with corresponding weights w1 , . . . , wn , subject to
monotonicity constraints giving a simple or partial ordering of the responses.
In otherwords, the responses are suppose to strictly increase (or decrease)
as the predictor increases and the regression line we fit is piecewise constant
(which resembles a step function). The weighted least squares problem for
monotonic regression is given by the following quadratic program:
arg min
a
n
X
wi (yi ai )2
i=1
n
X
wi |yi ai |p ,
i=1
D. S. Young
391
Isotonic Regression
6
x$y
3 4
10
x0
cumsum(x$y)
15
25
10
x0
line which is plotted is called the convex minorant. Each predictor value
where this convex minorant intersects at the value of the cumulative sum
is the same value of the predictor where the slope changes in the isotonic
regression plot.
24.9
Spatial Regression
STAT 501
392
broad classes of spatial effects are often distinguished: spatial heterogeneity and spatial dependency. We will provide a brief overview of both
types of effects, but it should be noted that we will only skim the surface of
what is a very rich area.
A spatial regression model reflecting spatial heterogeneity is written locally as
Y = X(g) + ,
where g indicates that the regression coefficients are to be estimated locally
at the coordinates specified by g and is an error term distributed with
mean 0 and variance 2 . This model is called geographically weighted
regression or GWR. The estimation of (g) is found using a weighting
scheme such that
(g)
= (XT W(g)X)1 XT W(g)Y.
The weights in the geographic weighting matrix W(g) are chosen such that
those observations near the point in space where the parameter estimates
are desired have more influence on the result than those observations further
away. This model is essentially a local regression model like the one discussed
in the section on LOESS. While the choice of a geographic (or spatially)
weighted matrix is a blend of art and science, one commonly used weight is
the Gaussian weight function, where the diagonal entris of the n n matrix
W(g) are:
wi (g) = exp{di /h},
where di is the Euclidean distance between observation i and location g, while
h is the bandwidth.
The resulting parameter estimates or standard errors for the spatial heterogeneity model may be mapped in order to examine local variations in the
parameter estimates. Hypothesis tests are also possible regarding this model.
Spatial regression models also accommodate spatial dependency in two
major ways: through a spatial lag dependency (where the spatial correlation
occurs in the dependent variable) or a spatial error dependency (where the
spatial correlation occurs through the error term). A spatial lag model is
a spatial regression model which models the response as a function of not
only the predictors, but also values of the response observed at other (likely
neighboring) locations:
yi = f (yj(i) ; ) + X T
i + i ,
STAT 501
D. S. Young
393
n
X
wi,j yj + X T
i + i ,
j=1
STAT 501
394
where u is a vector of random error terms. Other spatial processes exist, such
as a conditional autoregressive process and a spatial moving average
process, both which resemble similar time series processes.
Estimation of these spatial regression models can be accomplished through
various techniques, but they differ depending on if you have a spatial lag dependency or a spatial error dependency. Such estimation methods include
maximum likelihood estimation, the use of instrumental variables, and semiparametric methods.
There are also tests for the spatial autocorrelation coefficient, of which the
most notable uses Morans I statistic. Morans I statistics is calculated
as
eT W(g)e/S0
,
I=
eT e/n
where e is a vector of ordinary
squares residuals, W(g) is a geographic
Pleast
n Pn
weighting matrix, and S0 = i=1 j=1 wi,j is a normalizing factor. Then,
Morans I test can be based on a normal approximation using a standardized
value I statistic such that
E(I) = tr(MW/(n p))
and
Var(I) =
D. S. Young
395
weights: boston.listw
Moran I statistic standard deviate = 14.5085, p-value < 2.2e-16
alternative hypothesis: two.sided
sample estimates:
Observed Morans I
Expectation
Variance
0.4364296993
-0.0168870829
0.0009762383
##########
As can be seen, the p-value is very small and so the spatial autocorrelation
coefficient is significant.
Next, we attempt to fit a spatial regression model with spatial error dependency including those variables that the investigator specified:
##########
Call:errorsarlm(formula = log(MEDV) ~ CRIM + ZN + INDUS + CHAS
+ I(NOX^2) + I(RM^2) + AGE + log(DIS) + log(RAD) + TAX
+ PTRATIO + B + log(LSTAT), data = boston.c,
listw = boston.listw)
Residuals:
Min
1Q
-0.6476342 -0.0676007
Median
0.0011091
3Q
0.0776939
Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value
(Intercept) 3.85706025 0.16083867 23.9809
CRIM
-0.00545832 0.00097262 -5.6120
ZN
0.00049195 0.00051835
0.9491
INDUS
0.00019244 0.00282240
0.0682
CHAS1
-0.03303428 0.02836929 -1.1644
I(NOX^2)
-0.23369337 0.16219194 -1.4408
I(RM^2)
0.00800078 0.00106472
7.5145
AGE
-0.00090974 0.00050116 -1.8153
log(DIS)
-0.10889420 0.04783714 -2.2764
log(RAD)
0.07025730 0.02108181
3.3326
TAX
-0.00049870 0.00012072 -4.1311
PTRATIO
-0.01907770 0.00564160 -3.3816
D. S. Young
Max
0.6491629
Pr(>|z|)
< 2.2e-16
2.000e-08
0.3425907
0.9456389
0.2442466
0.1496286
5.707e-14
0.0694827
0.0228249
0.0008604
3.611e-05
0.0007206
STAT 501
396
B
log(LSTAT)
0.00057442
-0.27212781
0.00011101
5.1744 2.286e-07
0.02323159 -11.7137 < 2.2e-16
24.10
Circular Regression
STAT 501
D. S. Young
397
error assumed to follow a von Mises distribution with circular mean 0 and
concentration parameter . The von Mises distribution is the circular analog
of the univariate normal distribution, but has a more complex form. The
von Mises distribution with circular mean and concentration parameter
is defined on the range x [0, 2), with probability density function
f (x) =
e cos(x)
2I0 ()
X
Ij () sin(j(x ))
1
xI0 () + 2
.
F (x) =
2I0 ()
j
j=1
In the above, Ip () is called a modified Bessel function of the first kind
of order p. The Bessel function is the contour integral
I
1
e(z/2)(t1/t) t(p+1) dt,
Ip (z) =
2i
where the contour encloses the origin and traverses
in a counterclockwise
STAT 501
398
Errors
(a)
(b)
Figure 24.7: (a) Plot of the von Mises error terms used in the generation of
the sample data. (b) Plot of the continuous predictor (X) versus the circular
response (Y ) along with the circular-linear regression fit.
[1,]
6.7875
[2,]
0.9618
--Signif. codes:
1.1271
0.2223
Log-Likelihood:
55.89
D. S. Young
399
24.11
Mixtures of Regressions
1.2
1.1
1.0
0.9
0.8
Equivalence
0.6
0.6
0.7
0.7
1.0
Equivalence
0.9
0.8
1.1
1.2
Consider a large data set consisting of the heights of males and females.
When looking at the distribution of this data, the data for the males will (on
average) be higher than that of the females. A histogram of this data would
clearly show two distinct bumps or modes. Knowing the gender labels of each
subject would allow one to account for that subgroup in the analysis being
used. However, what happens if the gender label of each subject were lost?
In otherwords, we dont know which observation belongs to which gender.
The setting where data appears to be from multiple subgroups, but there is
no label providing such identification, is the focus of the area called mixture
modeling.
3
NO
(a)
NO
(b)
Figure 24.8: (a) Plot of spark-ignition engine fuel data with equivalence
ratio as the response and the measure of nitrogen oxide emissions. (b) Plot
of the same data with EM algorithm estimates from a 2-component mixture
of regressions fit.
There are many issues one should be cognizant of when building a mixture
model. In particular, maximum likelihood estimation can be quite complex
D. S. Young
STAT 501
400
since the likelihood does not yield closed-form solutions and there are identifiability issues (however, the use of a Newton-Raphson or EM algorithm
usually provides a good solution). One alternative is to use a Bayesian approach with Markov Chain Monte Carlo (MCMC) methods, but this too has
its own set of complexities. While we do not explore these issues, we do see
how a mixture model can occur in the regression setting.
A mixture of linear regressions model can be used when it appears
that there is more than one regression line that could fit this data due to
some underlying characteristic (i.e., a latent variable). Suppose we have n
observations which belongs to one of k groups. If we knew to which group
an observation belonged (i.e., its label), then we could write down explicitly
the linear regression model given that observation i belongs to group j:
yi = XT
i j + ij ,
such that ij is normally distributed with mean 0 and variance j2 . Notice
how the regression coefficients and variance terms are different for each group.
However, now assume that the labels are unobserved. In this case, we can
only assign a probability that observation i came from group j. Specifically,
the density function for the mixture of linear regression model is:
k
X
1
T
2
2 1/2
f (yi ) =
j (2j )
exp 2 (yi Xi j ) ,
2j
j=1
P
such that kj=1 j = 1. Estimation is done by using the likelihood (or rather
log likelihood) function based on the above density. For maximum likelihood,
one typically uses an EM algorithm.
As an example, consider the data set which gives the equivalence ratios
and peak nitrogen oxide emissions in a study using pure ethanol as a sparkignition engine fuel. A plot of the equivalence ratios versus the measure
of nitrogen oxide is given in Figure 24.11. Suppose one wanted to predict
the equivalence ratio from the amount of nitrogen oxide emissions. As you
can see, there appears to be groups of data where separate regressions appear
appropriate (one with a positive trend and one with a negative trend). Figure
24.11 gives the same plot, but with estimates from an EM algorithm overlaid.
EM algorithm estimates for this data are 1 = (0.565 0.085)T , 1 = (1.247
0.083)T , 2 = 0.00188, and 2 = 0.00058.
It should be noted that mixtures of regressions appear in many areas.
For example, in economics it is called switching regimes. In the social
STAT 501
D. S. Young
401
D. S. Young
STAT 501