Linear Model Selection and Regularization PDF

11/7/2018
Federal University of Bahia

Polytechnic School of Engineering
Chemical Engineering Department
Industrial Engineering Program
LINEAR MODEL SELECTION

AND REGULARIZATION
Prof. Karla Oliveira Esquerre
karlaesquerre@ufba.br
http://gamma.ufba.br/
Linear Model Selection and Regularization
Linear models has distinct advantages in terms of inference

and, on real words problems, if often surprisingly competitive
in relation to non linear methods.
Alternative procedures can yield better prediction accuracy

and model interpretability.
1
11/7/2018

Prediction accuracy:
• If linear, linear squares estimation will have low bias
• If ≫ , least squares estimate tend to also have low variance
• If not, if , there can be a lot of variability in the least
squares fit, over fitting and poor prediction on future
observations
• If there is no least square coefficient estimate because the
variance is infinite so the method cannot be used at all.
Solution: constraining or shrinking the estimated

coefficients (substantially reduce the variance
with negligible increase on bias) 3

Model interpretability:
• Excluding irrelevant variables from a multiple regression model
performing feature selection or variable selection.
2
11/7/2018

Important class of methods:
• Subset selection: selecting a subset of predictors

• Shrinkage: estimate coefficients are shrunken towards
zero
• Dimension reduction: projecting into -dimensional
subspace, where , .
Subset Selection: Best Subset Selection

Fit a separate least squares regression for each possible
combination of the predictors. models (possibilities).
3
11/7/2018
Best model for a

given number of
predictors (red) ISLR
Figure 6.1
Why do we need caution in analyzing

these graphs? 7

Why do we need caution in analyzing these graphs?
• RSS decreases and increase as the number of
features included in the model increases.
• A low RSS or a high indicates a model with a low
training error ( test error)
An enormous search space can lead to over fitting

and high variance of the coefficient estimates.
8
4
11/7/2018

The same algorithm for best subset selection for least squares
regression can be used for other types of models, such as
logistic regression, BUT use deviance instead of RSS.
• Negative two times the

maximized log-likelihood.
• The smaller the deviance,
the better the fit
What happened when increases?

10 → 1,000 possible models
20 → 1,000,000 possible models
40 → Best subset selection becomes computationally
infeasible even with extremely fast modern
computers
10
5
11/7/2018
Subset Selection: Stepwise Selection

Stepwise methods explore a for more restricted set of
models than best subset selection. Why?
11
Subset Selection: Forward Stepwise Selection

Begins with a model containing no predictors, and then adds
predictors to the model, one-at-a-time, until all of the predictors
are in the model.
12
6
11/7/2018
1 ∑!"# 1 1 /2 possible models
What happened when increases?
20 → 211 possible models (against 1,000,000 of best subset

selection)
13
Why even though forward stepwise tends to do well in

practice, isn’t it guaranteed to find the best possible model
out of all 2 models containing subsets of the
predictors?
14
7
11/7/2018
Forward stepwise selection can be applied in the high-

dimensional setting where $ ; however, in this case, it
is possible to construct submodels %# , … , %' only,
since submodel is fit using least squares, which will not
yield a unique solution if ( .
15
Subset Selection: Backward Stepwise Selection

Begins with the full least squares model containing all predictors,
and then removes the least useful predictor, one-at-a-time.
16
8
11/7/2018
Subset Selection: Backward Stepwise Selection

Properties:
• Backward selection approach search through only
1 1 /2 models, so p can be large.
• There is no guarantee to yield the best model containing a
subset of the p predictors.
• N must be larger than p so the full model can be fit.
17
Subset Selection: Hybrid Approaches

Variables are added to the model sequentially, but
they are also removed when no longer provide an
improvement in the model fit.
Closely mimic best subset selection while retaining

the computational advantages of forward and
backward selection,
18
9
11/7/2018
Choosing the Optimal Model
Training error can be a poor estimate of the best

error.
Therefore, RSS and are not suitable for selecting

the best model among a collection of models with
different number of predictors.
19
So, we wish to choose a model with a low test error.

There are two common approaches:
• Indirectly estimate test error by making an adjustment to

the raining error to account for bias due to overfitting.
• Directly estimate the test error, using either, a validation
set approach or a cross-validation approach.
20
10
11/7/2018
) , Akaike information criterion (AIC), Bayesian

information criterion (BIC) and adjusted can be
used for adjusting the training error for the model
size.
21

)
For a fitted least squares model containing * predictors, the )
estimate of the test MSE is computed using the equation:
,- : estimate of the variance of the error /

) ++ 2*,-
' associated with each response
measurement
Since ,- is an unbiased estimate of , , ) is an unbiased

estimate of the test MSE. As a consequence, ) statistics tends
to take on a small value for models with a low test error.
22
11
11/7/2018

)
For a fitted least squares model containing * predictors, the )
estimate of the test MSE is computed using the equation:
,- : estimate of the variance of the error /

) ++ 2*,-
' associated with each response
measurement
Since ,- is an unbiased estimate of , , ) is an unbiased

estimate of the test MSE. As a consequence, ) statistics tends
to take on a small value for models with a low test error.
23

AIC criterion
Is defined for a large class of models fit by maximum likelihood.
In the case of the standard linear model 0 1# 1 ⋯
1 / with Gaussian errors, maximum likelihood and least
squares are the same thing. So,
AIC 78
++ 2*,- For simplicity, the constant was omitted
'6
For least squares models, ) and AIC are proportional to each

other. 24
12
11/7/2018

BIC criterion
Is derived from a Bayesian point of view (and looks similar to
) and AIC)
BIC ++ log *,- + constants

'
BIC will also tend to take on a small value for a model with a low
test error. Since log 2 for any 7, BIC statistic generally
places a heavier penalty on models with many variables, and hence
results in the selection of smaller models than ) .
25

Adjusted
Remember:
>??
>??/ ' @ 1 A?? and TSS ∑' F
Adjusted 1
A??/ ' RSS decrease as more variables are
included to the model
increases as more variables are
added to the model
BCC
Maximizing is equivalent to minimizing , so it may increase
' @
or decrease due to the presence of d in the denominator.
26
13
11/7/2018

Validation and Cross-Validation advantages
+ Provide a direct estimate of the test error.
+ Make fewer assumptions about the true underlying model.
+ Can be used in a wider range of model selection tasks, even
in cases where it is hard to pinpoint the model degree of
freedom (e.g. number of predictors) or hard to estimate the
error valiance , .
27

Why to we need the one-standard-error rule?
Because many models may show similar MSE
1. Calculate the standard error of the estimated test

MSE for each model size
2. Select the smallest model with the smallest standard
error
28
14
11/7/2018
Shrinkage Methods
The two best-known techniques for shrinking the

regression coefficients towards zero are
• Ridge regression
• Lasso
29
Shrinkage Methods: Ridge Regression
Remember:
Least squares fitting estimates 1# , 1 , …, 1 using the
values that minimize
'
++ G 1# G 1H H
" H"
30
15
11/7/2018

The ridge regression coefficients estimates 1I B are the
values that minimize
'
G 1# G 1H H J G 1IH ++ J G 1IH
" H" H" H"
where J ( 0 is a tuning parameter to be determined

separately.
31
Ridge regression seeks for coefficient estimates that fit

the data well, by making the RSS small.
The second term J ∑H" 1IH (shrinkage penalty) is small

when 1# , 1 , …, 1 are close to zero, so it has the effect
of shrinking the estimates of 1H towards zero.
32
16
11/7/2018

J control the relative impact of these two terms on the
regression coefficients estimates:
• when J 0, the penalty term has no effect, and ridge

regression will produce the least squares estimates
• when J → ∞, the impact of the shrinkage penalty grows,
and the ridge regression coefficient estimates will
approach zero.
33

Ridge regression will produce a different set of coefficient
estimates, 1IHB , for each value of J.
We cross-validation to select a good value for J.
1# is not shrunken. If data matrix have been centered to have

mean zero before ridge regression is performed, then the
estimate intercept will take the form 1I# F ∑'" ⁄ .
34
17
11/7/2018

1 : N O ∑H" 1H , measure the distance of 1 from 0.
The amount that the

ridge regression
coefficient estimates
been shrunken towards
zero.
A small number indicates
that they have been
shrunken very close to
zero.
Standardized ridge regression 35

coefficients ISLR Figure 6.4
Why is it best to apply ridge regression after standardizing the

predictors?
Because the ridge regression coefficient estimates are not
scale equivalent PQH R 1H
Standardizing the predictors, so all the predictors are all on the
same scale
TUV
̅H
W X
∑ TUV T̅ V 8
X UYW
36
18
11/7/2018

Why does ridge regression improve over least squares?
• Because of the bias-variance Trade-off
• As J increases, the flexibility of the ridge regression fit decreases,

leading to a decreased variance but an increased bias
OR
• As J increases, the shrinkage of the ridge coefficient estimates
leads to a substantial reduction in the variance of the predictors, at
the expense of a (slight) increase in bias.
37
Squared bias (black), variance (green), and test mean squared error (purple);
minimum possible MSE (horizontal dashed lines); ridge regression models for
which the MSE is smallest (purple crosses) ISLR Figure 6.4
38
19
11/7/2018
Because of its high variance, the MSE associated with the

least squares fit, when J 0, is almost as high as that of the
null model for which all coefficient estimates are zero, when
J ∞.
As 1IHB / 1̅ increases, the fits become more flexible, so

the bias decreases and the variance increases.
39
― Ridge regression will include all predictors in the final

model (subset selection methods do not), unless J ∞.
― Model interpretability becomes a challenge when is

large (accuracy is not a problem).
+ Lasso overcomes this disadvantage.
40
20
11/7/2018
Shrinkage Methods: Lasso

Lasso coefficients estimates 1I Z are the values that
minimize
'
G 1# G 1H H J G 1H ++ J G 1H
" H" H" H"
where J ( 0 is a tuning parameter to be determined

separately.
41

N penalty
'
G 1# G 1H H J G 1H ++ J G 1H
" H" H" H"
N penalty
'
G 1# G 1H H J G 1IH ++ J G 1IH
" H" H" H"
42
21
11/7/2018
N penalty has the effect of forcing some of the

coefficient estimates to be exactly equal to zero
when the tuning parameter J is sufficiently large.
+ Lasso yields sparse models, that is, models that

involve only a subset of the variables.
43

Another formulation for the Lasso and ridge regression
minimize
∑'" 1# ∑H" 1H subject to ∑H" 1H ` a
1 H
minimize
∑'" 1# ∑H" 1H subject to ∑H" 1H ` a
1 H
44
22
11/7/2018
Contours of the error RSS (red)

and constraint functions (solid
areas) for the lasso (left) and
ridge regression (right) ISLR Figure
1 1 `a 1 1 `a 6.7
minimize
∑'" 1# ∑H" 1H subject to ∑H" b 1H 0 ` a 45
1 H
Why is it that the Lasso, unlike ridge regression, results

in coefficient estimates that are exactly equal to zero?
The Lasso leads to feature selection when 2 due

to the sharp corners of the polyhedron or polytope.
46
23
11/7/2018
Comparing the Lasso and the Ridge Regression

Lasso performs better in a setting where a relatively
small number of predictors have substantial
coefficients, and the remaining predictors have
coefficients that are very small or that equal zero.
Ridge regression will perform better when the

response is a function of many predictors, all with
coefficients of roughly equal size.
47
Lasso solution can yield a reduction in variance at the

expense of a small increase in bias, and consequently
can generate more accurate predictors and models that
are easier to interpret.
48
24
11/7/2018

Lasso coefficients are shrunken entirely to zero explains
why the lasso performs feature selection.
shrinkage
performed by lasso
is known as soft -
thresholding
The ridge regression and lasso coefficient estimates for a simple

setting with and Q a diagonal matrix with 1’s on the diagonal 49
ISRL Figure 6.10
Ridge regression shrink every dimension of the data by

the same proportion, whereas Lasso shrinks all
coefficients toward zero by a similar amount, and
sufficiently small coefficients are shrunken all the way
to zero.
50
25
11/7/2018
Bayesian Interpretation for Ridge Regression and Lasso

A Bayesian viewpoint for regression assumes that:
• The coefficient vector 1 has some prior distribution
1 , where 1 1 ,…,1 c
• The likelihood of the data can be written as
0 0, 1 where Q Q ,…,Q
• The posterior distribution takes the form
1 Q, 0 ∝ 0, 1 1 Q 0 Q, 1 1
where Q is fixed
The most likely value for 1, 51
given the data
Bayesian Interpretation for Ridge Regression and Lasso
• For 1# Q 1 ⋯ Q 1 , assume that 1

∏H" f 1 for some density function g, the errors are
independent and drawn from a normal distribution.
→ If g is a Gaussian distribution g 0, , J the posterior mode

for 1 is given by a ridge regression solution.
→ If g is double-exponential (Laplace) distribution with mean

zero and scale parameter a function of J, the posterior mode
ISRL Figure 6.11 for 1 is the Lasso solution. 52
26
11/7/2018
Selecting the Tuning Parameter
Choose a grid of J and a value and compute the cross-

validation error (LOOCV or ten-fold CV) for each value
of J and a.
Unrelated predictors are often referred as signal and

noise variables.
53
So far…
Methods which have controlled variance by using a

subset of the original variables, or by shrinking their
coefficients towards zero (using the original variables
Q , … , Q ).
54
27
11/7/2018
Dimension Reduction Methods

Approaches that transform the predictors and them fit
a least squares model using the transformed variables.
i , i , … , ij represent
ij ∑H" kHj QH $ linear combination
of our original predictors
kHj .
l# ∑m
j" lj i j / n 1, … , and
o
∑j
j" kHj 1
55

The dimension of the problem has been reduced from 1
to 1
m m m
G lj i j G lj G kHj H G G lj kHj H G 1H H
j" j" H"p H" j" H"
Where: 1H ∑m
j" lj kHj
56
28
11/7/2018
+ Dimension reduction serves to constrain the

estimated 1H coefficients.
― Constraint may bias the coefficient estimates.
+ When ` , variance of the fitted

coefficients can be significantly reduced.
57

Dimension reduction methods work in two steps:
1. Transformed predictors i , i , … , ij are

obtained
2. Model is fit using these M predictors
How to obtain ∅Hj ’s?

58
29
11/7/2018
Principal Component Regression (PCR)

Steps:
1. Standardize each predictor prior generating the PCs
2. Construct the first principal components
3. Use the components as the predictors in a linear
regression model
4. Fit the model coefficients by using least squares
59
Principal Component Regression (PCR)
Why is PCR more closely related to ridge regression

than to the lasso?
Why can ridge regression be considered as a

continuous version of PCR?
Why is PCS considered an unsupervised method?
60
30
11/7/2018
Partial Least Squares (PLS)
PLS identifies a new set of features i , … , ij that

not only approximate the old features well but also that
are related to the response.
61

Steps:
1. Standardize each predictor and response
2. Compute the first direction i by setting each kH
equal to the coefficient from the simple linear
regression of 0 onto QH
3. In computing i ∑H" kH QH , PLS places the
highest weight on the variables that are strongly
related to the response.
62
Propose an algorithm to compute the principal components!
31
11/7/2018

The number of partial least squares directions is
a tuning parameter that is locally chosen by corss-
validation.
In practice, PLS often performs no better than

ridge regression or PCR. While the supervised
dimension reduction of PLS can reduce bias, it also
has the potencial to increase variance, so that the
overall benefit of PLS relative to PCA is a wash.
63
Considerations in High Dimensions
Data sets containing more features than

observations ( are often referred to as high-
dimensional.
What does go wrong in high dimensions?
64
32
11/7/2018

When , regardless of the values of the
observations, the regression line will fit the data
exactly, causing a overfitting of the data. This
model will perform extremely poorly on a
independent test set.
So, when or , a simple least squares

regression line is too flexible and hence overfits
the data.
65

increases to 1 and training set MSE decreased
to 0 as more features are included, although set
set MSE increases.
66
Simulated example with 20. Figure 6.10
33
11/7/2018
) , AIC and BIC approaches are not appropriate in

the high-dimensional setting because estimating ,-
is problematic (,- 0) and show a value of 1.
67
Regression in High Dimensions

Regularization of shrinkage plays a role in high-
dimensional problems.
Appropriate tuning parameter selection is crucial for
good predictive performance.
– The test error tends to increase as the

dimensionality of the problem increases, unless
the additional features are truly associated with
the responses. 68
34
11/7/2018
Regression in High Dimensions

Big data issue:
The new technologies that allow for a collection of
measurements for thousands or millions of features are
a double-edged sword, they can lead to improved
predictive models if theses features are in fact relevant
to the problem, but will lead to worse results if the
features are not relevant. Even if they are relevant, the
variance incurred in fitting their coefficients may
outweigh the reduction in bias that they bring.
69
Interpreting Results in High Dimensions
In high-dimensional setting, the multicollinearity

problem is extreme, so we can never identify the
best coefficients for use in the regression.
We must be careful not to overestimate the results
obtained since what we have identified is simply on
of many possible models that must be validated on
independent data sets.
70
35
11/7/2018
Moment of Reflection
Do I want to look back and

know I could have done better?
71
Homework
Conceptual Exercises Applied Exercises

(optional)
ISL Chapter 6 ISL Chapter
Use R Markdown
Deadline -
Submit the assignment on Moodle
Max. 3 members per group
72
36

Linear Model Selection and Regularization PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Linear Model Selection and Regularization PDF

Hochgeladen von

Copyright:

Verfügbare Formate

11/7/2018

Federal University of Bahia

LINEAR MODEL SELECTION

Linear Model Selection and Regularization

Linear models has distinct advantages in terms of inference

Alternative procedures can yield better prediction accuracy

Linear Model Selection and Regularization

Solution: constraining or shrinking the estimated

Linear Model Selection and Regularization

Linear Model Selection and Regularization

• Subset selection: selecting a subset of predictors

Subset Selection: Best Subset Selection

Subset Selection: Best Subset Selection

Best model for a

Why do we need caution in analyzing

Subset Selection: Best Subset Selection

An enormous search space can lead to over fitting

Subset Selection: Best Subset Selection

• Negative two times the

Subset Selection: Best Subset Selection

What happened when increases?

Subset Selection: Stepwise Selection

Subset Selection: Forward Stepwise Selection

Subset Selection: Forward Stepwise Selection

1 ∑!"# 1 1 /2 possible models

What happened when increases?

20 → 211 possible models (against 1,000,000 of best subset

Subset Selection: Forward Stepwise Selection

Why even though forward stepwise tends to do well in

Subset Selection: Forward Stepwise Selection

Forward stepwise selection can be applied in the high-

Subset Selection: Backward Stepwise Selection

Subset Selection: Backward Stepwise Selection

Subset Selection: Hybrid Approaches

Closely mimic best subset selection while retaining

Choosing the Optimal Model

Training error can be a poor estimate of the best

Therefore, RSS and are not suitable for selecting

Choosing the Optimal Model

So, we wish to choose a model with a low test error.

• Indirectly estimate test error by making an adjustment to

Choosing the Optimal Model

) , Akaike information criterion (AIC), Bayesian

Choosing the Optimal Model

,- : estimate of the variance of the error /

Since ,- is an unbiased estimate of , , ) is an unbiased

Choosing the Optimal Model

,- : estimate of the variance of the error /

Since ,- is an unbiased estimate of , , ) is an unbiased

Choosing the Optimal Model

For least squares models, ) and AIC are proportional to each

Choosing the Optimal Model

BIC ++ log *,- + constants

Choosing the Optimal Model

Choosing the Optimal Model

Choosing the Optimal Model

1. Calculate the standard error of the estimated test

The two best-known techniques for shrinking the

Shrinkage Methods: Ridge Regression

Shrinkage Methods: Ridge Regression

where J ( 0 is a tuning parameter to be determined

Shrinkage Methods: Ridge Regression

Ridge regression seeks for coefficient estimates that fit