Sie sind auf Seite 1von 36

11/7/2018

Federal University of Bahia


Polytechnic School of Engineering
Chemical Engineering Department
Industrial Engineering Program

LINEAR MODEL SELECTION


AND REGULARIZATION
Prof. Karla Oliveira Esquerre
karlaesquerre@ufba.br

http://gamma.ufba.br/

Linear Model Selection and Regularization

Linear models has distinct advantages in terms of inference


and, on real words problems, if often surprisingly competitive
in relation to non linear methods.

Alternative procedures can yield better prediction accuracy


and model interpretability.

1
11/7/2018

Linear Model Selection and Regularization


Prediction accuracy:
• If linear, linear squares estimation will have low bias
• If ≫ , least squares estimate tend to also have low variance
• If not, if , there can be a lot of variability in the least
squares fit, over fitting and poor prediction on future
observations
• If there is no least square coefficient estimate because the
variance is infinite so the method cannot be used at all.

Solution: constraining or shrinking the estimated


coefficients (substantially reduce the variance
with negligible increase on bias) 3

Linear Model Selection and Regularization


Model interpretability:
• Excluding irrelevant variables from a multiple regression model
performing feature selection or variable selection.

2
11/7/2018

Linear Model Selection and Regularization


Important class of methods:

• Subset selection: selecting a subset of predictors


• Shrinkage: estimate coefficients are shrunken towards
zero
• Dimension reduction: projecting into -dimensional
subspace, where , .

Subset Selection: Best Subset Selection


Fit a separate least squares regression for each possible
combination of the predictors. models (possibilities).

3
11/7/2018

Subset Selection: Best Subset Selection

Best model for a


given number of
predictors (red) ISLR
Figure 6.1

Why do we need caution in analyzing


these graphs? 7

Subset Selection: Best Subset Selection


Why do we need caution in analyzing these graphs?
• RSS decreases and increase as the number of
features included in the model increases.
• A low RSS or a high indicates a model with a low
training error ( test error)

An enormous search space can lead to over fitting


and high variance of the coefficient estimates.
8

4
11/7/2018

Subset Selection: Best Subset Selection


The same algorithm for best subset selection for least squares
regression can be used for other types of models, such as
logistic regression, BUT use deviance instead of RSS.

• Negative two times the


maximized log-likelihood.
• The smaller the deviance,
the better the fit

Subset Selection: Best Subset Selection

What happened when increases?


10 → 1,000 possible models
20 → 1,000,000 possible models
40 → Best subset selection becomes computationally
infeasible even with extremely fast modern
computers

10

5
11/7/2018

Subset Selection: Stepwise Selection


Stepwise methods explore a for more restricted set of
models than best subset selection. Why?

11

Subset Selection: Forward Stepwise Selection


Begins with a model containing no predictors, and then adds
predictors to the model, one-at-a-time, until all of the predictors
are in the model.

12

6
11/7/2018

Subset Selection: Forward Stepwise Selection

1 ∑!"# 1 1 /2 possible models

What happened when increases?

20 → 211 possible models (against 1,000,000 of best subset


selection)

13

Subset Selection: Forward Stepwise Selection

Why even though forward stepwise tends to do well in


practice, isn’t it guaranteed to find the best possible model
out of all 2 models containing subsets of the
predictors?

14

7
11/7/2018

Subset Selection: Forward Stepwise Selection

Forward stepwise selection can be applied in the high-


dimensional setting where $ ; however, in this case, it
is possible to construct submodels %# , … , %' only,
since submodel is fit using least squares, which will not
yield a unique solution if ( .

15

Subset Selection: Backward Stepwise Selection


Begins with the full least squares model containing all predictors,
and then removes the least useful predictor, one-at-a-time.

16

8
11/7/2018

Subset Selection: Backward Stepwise Selection


Properties:
• Backward selection approach search through only
1 1 /2 models, so p can be large.
• There is no guarantee to yield the best model containing a
subset of the p predictors.
• N must be larger than p so the full model can be fit.

17

Subset Selection: Hybrid Approaches


Variables are added to the model sequentially, but
they are also removed when no longer provide an
improvement in the model fit.

Closely mimic best subset selection while retaining


the computational advantages of forward and
backward selection,

18

9
11/7/2018

Choosing the Optimal Model

Training error can be a poor estimate of the best


error.

Therefore, RSS and are not suitable for selecting


the best model among a collection of models with
different number of predictors.

19

Choosing the Optimal Model

So, we wish to choose a model with a low test error.


There are two common approaches:

• Indirectly estimate test error by making an adjustment to


the raining error to account for bias due to overfitting.
• Directly estimate the test error, using either, a validation
set approach or a cross-validation approach.

20

10
11/7/2018

Choosing the Optimal Model

) , Akaike information criterion (AIC), Bayesian


information criterion (BIC) and adjusted can be
used for adjusting the training error for the model
size.

21

Choosing the Optimal Model


)
For a fitted least squares model containing * predictors, the )
estimate of the test MSE is computed using the equation:

,- : estimate of the variance of the error /


) ++ 2*,-
' associated with each response
measurement

Since ,- is an unbiased estimate of , , ) is an unbiased


estimate of the test MSE. As a consequence, ) statistics tends
to take on a small value for models with a low test error.
22

11
11/7/2018

Choosing the Optimal Model


)
For a fitted least squares model containing * predictors, the )
estimate of the test MSE is computed using the equation:

,- : estimate of the variance of the error /


) ++ 2*,-
' associated with each response
measurement

Since ,- is an unbiased estimate of , , ) is an unbiased


estimate of the test MSE. As a consequence, ) statistics tends
to take on a small value for models with a low test error.
23

Choosing the Optimal Model


AIC criterion
Is defined for a large class of models fit by maximum likelihood.
In the case of the standard linear model 0 1# 1 ⋯
1 / with Gaussian errors, maximum likelihood and least
squares are the same thing. So,

AIC 78
++ 2*,- For simplicity, the constant was omitted
'6

For least squares models, ) and AIC are proportional to each


other. 24

12
11/7/2018

Choosing the Optimal Model


BIC criterion
Is derived from a Bayesian point of view (and looks similar to
) and AIC)

BIC ++ log *,- + constants


'

BIC will also tend to take on a small value for a model with a low
test error. Since log 2 for any 7, BIC statistic generally
places a heavier penalty on models with many variables, and hence
results in the selection of smaller models than ) .
25

Choosing the Optimal Model


Adjusted
Remember:
>??
>??/ ' @ 1 A?? and TSS ∑' F
Adjusted 1
A??/ ' RSS decrease as more variables are
included to the model
increases as more variables are
added to the model

BCC
Maximizing is equivalent to minimizing , so it may increase
' @
or decrease due to the presence of d in the denominator.

26

13
11/7/2018

Choosing the Optimal Model


Validation and Cross-Validation advantages
+ Provide a direct estimate of the test error.
+ Make fewer assumptions about the true underlying model.
+ Can be used in a wider range of model selection tasks, even
in cases where it is hard to pinpoint the model degree of
freedom (e.g. number of predictors) or hard to estimate the
error valiance , .

27

Choosing the Optimal Model


Why to we need the one-standard-error rule?
Because many models may show similar MSE

1. Calculate the standard error of the estimated test


MSE for each model size
2. Select the smallest model with the smallest standard
error
28

14
11/7/2018

Shrinkage Methods

The two best-known techniques for shrinking the


regression coefficients towards zero are

• Ridge regression
• Lasso

29

Shrinkage Methods: Ridge Regression

Remember:
Least squares fitting estimates 1# , 1 , …, 1 using the
values that minimize
'

++ G 1# G 1H H
" H"

30

15
11/7/2018

Shrinkage Methods: Ridge Regression


The ridge regression coefficients estimates 1I B are the
values that minimize
'

G 1# G 1H H J G 1IH ++ J G 1IH
" H" H" H"

where J ( 0 is a tuning parameter to be determined


separately.
31

Shrinkage Methods: Ridge Regression

Ridge regression seeks for coefficient estimates that fit


the data well, by making the RSS small.

The second term J ∑H" 1IH (shrinkage penalty) is small


when 1# , 1 , …, 1 are close to zero, so it has the effect
of shrinking the estimates of 1H towards zero.

32

16
11/7/2018

Shrinkage Methods: Ridge Regression


J control the relative impact of these two terms on the
regression coefficients estimates:

• when J 0, the penalty term has no effect, and ridge


regression will produce the least squares estimates
• when J → ∞, the impact of the shrinkage penalty grows,
and the ridge regression coefficient estimates will
approach zero.

33

Shrinkage Methods: Ridge Regression


Ridge regression will produce a different set of coefficient
estimates, 1IHB , for each value of J.

We cross-validation to select a good value for J.

1# is not shrunken. If data matrix have been centered to have


mean zero before ridge regression is performed, then the
estimate intercept will take the form 1I# F ∑'" ⁄ .

34

17
11/7/2018

Shrinkage Methods: Ridge Regression


1 : N O ∑H" 1H , measure the distance of 1 from 0.

The amount that the


ridge regression
coefficient estimates
been shrunken towards
zero.
A small number indicates
that they have been
shrunken very close to
zero.

Standardized ridge regression 35


coefficients ISLR Figure 6.4

Shrinkage Methods: Ridge Regression

Why is it best to apply ridge regression after standardizing the


predictors?
Because the ridge regression coefficient estimates are not
scale equivalent PQH R 1H
Standardizing the predictors, so all the predictors are all on the
same scale
TUV
̅H
W X
∑ TUV T̅ V 8
X UYW
36

18
11/7/2018

Shrinkage Methods: Ridge Regression


Why does ridge regression improve over least squares?
• Because of the bias-variance Trade-off

• As J increases, the flexibility of the ridge regression fit decreases,


leading to a decreased variance but an increased bias
OR
• As J increases, the shrinkage of the ridge coefficient estimates
leads to a substantial reduction in the variance of the predictors, at
the expense of a (slight) increase in bias.
37

Shrinkage Methods: Ridge Regression

Squared bias (black), variance (green), and test mean squared error (purple);
minimum possible MSE (horizontal dashed lines); ridge regression models for
which the MSE is smallest (purple crosses) ISLR Figure 6.4
38

19
11/7/2018

Shrinkage Methods: Ridge Regression

Because of its high variance, the MSE associated with the


least squares fit, when J 0, is almost as high as that of the
null model for which all coefficient estimates are zero, when
J ∞.

As 1IHB / 1̅ increases, the fits become more flexible, so


the bias decreases and the variance increases.

39

Shrinkage Methods: Ridge Regression

― Ridge regression will include all predictors in the final


model (subset selection methods do not), unless J ∞.

― Model interpretability becomes a challenge when is


large (accuracy is not a problem).

+ Lasso overcomes this disadvantage.

40

20
11/7/2018

Shrinkage Methods: Lasso


Lasso coefficients estimates 1I Z are the values that
minimize
'

G 1# G 1H H J G 1H ++ J G 1H
" H" H" H"

where J ( 0 is a tuning parameter to be determined


separately.
41

Shrinkage Methods: Lasso


N penalty
'

G 1# G 1H H J G 1H ++ J G 1H
" H" H" H"

N penalty
'

G 1# G 1H H J G 1IH ++ J G 1IH
" H" H" H"

42

21
11/7/2018

Shrinkage Methods: Lasso

N penalty has the effect of forcing some of the


coefficient estimates to be exactly equal to zero
when the tuning parameter J is sufficiently large.

+ Lasso yields sparse models, that is, models that


involve only a subset of the variables.

43

Shrinkage Methods: Lasso


Another formulation for the Lasso and ridge regression

minimize
∑'" 1# ∑H" 1H subject to ∑H" 1H ` a
1 H

minimize
∑'" 1# ∑H" 1H subject to ∑H" 1H ` a
1 H

44

22
11/7/2018

Shrinkage Methods: Lasso

Contours of the error RSS (red)


and constraint functions (solid
areas) for the lasso (left) and
ridge regression (right) ISLR Figure
1 1 `a 1 1 `a 6.7

minimize
∑'" 1# ∑H" 1H subject to ∑H" b 1H 0 ` a 45
1 H

Shrinkage Methods: Lasso

Why is it that the Lasso, unlike ridge regression, results


in coefficient estimates that are exactly equal to zero?

The Lasso leads to feature selection when 2 due


to the sharp corners of the polyhedron or polytope.

46

23
11/7/2018

Comparing the Lasso and the Ridge Regression


Lasso performs better in a setting where a relatively
small number of predictors have substantial
coefficients, and the remaining predictors have
coefficients that are very small or that equal zero.

Ridge regression will perform better when the


response is a function of many predictors, all with
coefficients of roughly equal size.
47

Comparing the Lasso and the Ridge Regression

Lasso solution can yield a reduction in variance at the


expense of a small increase in bias, and consequently
can generate more accurate predictors and models that
are easier to interpret.

48

24
11/7/2018

Comparing the Lasso and the Ridge Regression


Lasso coefficients are shrunken entirely to zero explains
why the lasso performs feature selection.

shrinkage
performed by lasso
is known as soft -
thresholding

The ridge regression and lasso coefficient estimates for a simple


setting with and Q a diagonal matrix with 1’s on the diagonal 49
ISRL Figure 6.10

Comparing the Lasso and the Ridge Regression

Ridge regression shrink every dimension of the data by


the same proportion, whereas Lasso shrinks all
coefficients toward zero by a similar amount, and
sufficiently small coefficients are shrunken all the way
to zero.

50

25
11/7/2018

Bayesian Interpretation for Ridge Regression and Lasso


A Bayesian viewpoint for regression assumes that:
• The coefficient vector 1 has some prior distribution
1 , where 1 1 ,…,1 c
• The likelihood of the data can be written as
0 0, 1 where Q Q ,…,Q
• The posterior distribution takes the form
1 Q, 0 ∝ 0, 1 1 Q 0 Q, 1 1
where Q is fixed
The most likely value for 1, 51
given the data

Bayesian Interpretation for Ridge Regression and Lasso

• For 1# Q 1 ⋯ Q 1 , assume that 1


∏H" f 1 for some density function g, the errors are
independent and drawn from a normal distribution.

→ If g is a Gaussian distribution g 0, , J the posterior mode


for 1 is given by a ridge regression solution.

→ If g is double-exponential (Laplace) distribution with mean


zero and scale parameter a function of J, the posterior mode
ISRL Figure 6.11 for 1 is the Lasso solution. 52

26
11/7/2018

Selecting the Tuning Parameter

Choose a grid of J and a value and compute the cross-


validation error (LOOCV or ten-fold CV) for each value
of J and a.

Unrelated predictors are often referred as signal and


noise variables.

53

So far…

Methods which have controlled variance by using a


subset of the original variables, or by shrinking their
coefficients towards zero (using the original variables
Q , … , Q ).

54

27
11/7/2018

Dimension Reduction Methods


Approaches that transform the predictors and them fit
a least squares model using the transformed variables.

i , i , … , ij represent
ij ∑H" kHj QH $ linear combination
of our original predictors
kHj .

l# ∑m
j" lj i j / n 1, … , and
o
∑j
j" kHj 1
55

Dimension Reduction Methods


The dimension of the problem has been reduced from 1
to 1
m m m

G lj i j G lj G kHj H G G lj kHj H G 1H H
j" j" H"p H" j" H"

Where: 1H ∑m
j" lj kHj
56

28
11/7/2018

Dimension Reduction Methods

+ Dimension reduction serves to constrain the


estimated 1H coefficients.

― Constraint may bias the coefficient estimates.

+ When ` , variance of the fitted


coefficients can be significantly reduced.

57

Dimension Reduction Methods


Dimension reduction methods work in two steps:

1. Transformed predictors i , i , … , ij are


obtained

2. Model is fit using these M predictors

How to obtain ∅Hj ’s?


58

29
11/7/2018

Principal Component Regression (PCR)


Steps:
1. Standardize each predictor prior generating the PCs
2. Construct the first principal components
3. Use the components as the predictors in a linear
regression model
4. Fit the model coefficients by using least squares

59

Principal Component Regression (PCR)

Why is PCR more closely related to ridge regression


than to the lasso?

Why can ridge regression be considered as a


continuous version of PCR?

Why is PCS considered an unsupervised method?

60

30
11/7/2018

Partial Least Squares (PLS)

PLS identifies a new set of features i , … , ij that


not only approximate the old features well but also that
are related to the response.

61

Partial Least Squares (PLS)


Steps:
1. Standardize each predictor and response
2. Compute the first direction i by setting each kH
equal to the coefficient from the simple linear
regression of 0 onto QH
3. In computing i ∑H" kH QH , PLS places the
highest weight on the variables that are strongly
related to the response.
62
Propose an algorithm to compute the principal components!

31
11/7/2018

Partial Least Squares (PLS)


The number of partial least squares directions is
a tuning parameter that is locally chosen by corss-
validation.

In practice, PLS often performs no better than


ridge regression or PCR. While the supervised
dimension reduction of PLS can reduce bias, it also
has the potencial to increase variance, so that the
overall benefit of PLS relative to PCA is a wash.
63

Considerations in High Dimensions

Data sets containing more features than


observations ( are often referred to as high-
dimensional.

What does go wrong in high dimensions?

64

32
11/7/2018

Considerations in High Dimensions


When , regardless of the values of the
observations, the regression line will fit the data
exactly, causing a overfitting of the data. This
model will perform extremely poorly on a
independent test set.

So, when or , a simple least squares


regression line is too flexible and hence overfits
the data.
65

Considerations in High Dimensions


increases to 1 and training set MSE decreased
to 0 as more features are included, although set
set MSE increases.

66
Simulated example with 20. Figure 6.10

33
11/7/2018

Considerations in High Dimensions

) , AIC and BIC approaches are not appropriate in


the high-dimensional setting because estimating ,-
is problematic (,- 0) and show a value of 1.

67

Regression in High Dimensions


Regularization of shrinkage plays a role in high-
dimensional problems.
Appropriate tuning parameter selection is crucial for
good predictive performance.

– The test error tends to increase as the


dimensionality of the problem increases, unless
the additional features are truly associated with
the responses. 68

34
11/7/2018

Regression in High Dimensions


Big data issue:
The new technologies that allow for a collection of
measurements for thousands or millions of features are
a double-edged sword, they can lead to improved
predictive models if theses features are in fact relevant
to the problem, but will lead to worse results if the
features are not relevant. Even if they are relevant, the
variance incurred in fitting their coefficients may
outweigh the reduction in bias that they bring.
69

Interpreting Results in High Dimensions

In high-dimensional setting, the multicollinearity


problem is extreme, so we can never identify the
best coefficients for use in the regression.
We must be careful not to overestimate the results
obtained since what we have identified is simply on
of many possible models that must be validated on
independent data sets.

70

35
11/7/2018

Moment of Reflection

Do I want to look back and


know I could have done better?

71

Homework

Conceptual Exercises Applied Exercises


(optional)
ISL Chapter 6 ISL Chapter

Use R Markdown
Deadline -
Submit the assignment on Moodle
Max. 3 members per group
72

36

Das könnte Ihnen auch gefallen