Sie sind auf Seite 1von 12

Advanced Econometrics

Professor: Sukjin Han

November 13, 2018

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 1 / 12
Lecture 14 - Roadmap

Tree-based methods, continued


Bagging
Random forests
Boosting
Ch. 8 of ISL

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 2 / 12
Aggregating Trees

The goal is to construct more powerful prediction models


I Bagging
I Random forests
I Boosting

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 3 / 12
Bagging

In Lecture 5, we learn how to use bootstrap to, e.g., calculate


standard errors
Here we use it in a totally different context
The decision trees discussed in Lecture 13 suffer from high variance
I e.g., If we split the training data into two and fit a decision free to
both halves, we get quite different results
Bootstrap aggregation (bagging)
I It is a general-purpose procedure for reducing the variance of a SL
method
I Particularly useful in the context of decision trees
Intuition:
I Given a set of n independent obs’s Z1 , ..., Zn , each with variance σ 2 ,
the variance of the mean Z̄ of the obs’s is given by σ 2 /n

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 4 / 12
Bagging
So, in order to reduce the variance and thus increase the prediction
accuracy of a SL method,
I take many training sets from the population
I build separate prediction models, and average the resulting predictions

That is, calculate fˆ1 (x), fˆ2 (x), ..., fˆB (x) using B training sets, and
average them to obtain a single low-variance SL model
B
1 X ˆb
fˆavg (x) = f (x)
B
b=1
I Not practical, since we generally do not have access to multiple
training sets
I Instead, we can bootstrap!

Bagging:
B
1 X ˆ∗b
fˆbag (x) = f (x)
B
b=1
Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 5 / 12
Bagging
Bagging is particularly useful for decision trees
We construct B regression trees using B bootstrapped training sets,
and average the resulting predictions
I Trees are not pruned, so each of them has high variance but low bias
I Averaging these B trees reduces the variance

So far, bagging in regression


How to extend bagging to a classification problem?
I For a given test obs, record the class predicted by each of the B trees,
I and take the majority vote, i.e., the overall prediction is the most
commonly occurring class among the B predictions
I Figure 8.8
F The choice of B is not critical, and a large value doesn’t cause
overfitting

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 6 / 12
Out-of-Bag Error Estimation

Turns out that there is a very easy way to estimate the test error of a
bagged model, without using CV
Can show that, on average, each bootstrap sample (and thus each
bagged tree) makes use of around 2/3 of the obs’s (why?)
I The remaining 1/3 is referred to as the out-of-bag (OOB) obs’s
Then we can predict the response for the i-th obs using each of the
trees in which that obs was OOB
I This will yield around B/3 predictions for the i-th obs
I Average them (for regression) or take a majority vote (for
classification) to obtain a single prediction for i
I Then calculate the overall OOB MSE (for regression) or classification
error (for classification)

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 7 / 12
Bagging

One drawback of bagging is lack of interpretability


Can still calculate a measure of variable importance
I Record the decrease of the RSS (or the Gini index) due to splits over a
given predictor, averaged over all B trees
I Figure 8.9

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 8 / 12
Random Forests
Random forests provide improvement over bagged trees by ways of
decorrelating the trees
Similar to bagging procedure, but each time a split in a tree is
considered,
I a random sample of m predictors is chosen as split candidates from the

full set of p predictors (typically m ≈ p)
Why “decorrelating”?
I If there is one very strong predictor in the dataset, most of the trees
will use this predictor in the top split, and all the bagged trees will look
similar
I Hence the predictions from the bagged trees will be highly correlated
I Does not lead to a large reduction in variance
Random forests are particularly helpful when we have a large number
of correlated predictors
I e.g., high-dim biological dataset, such as genes (Figure 8.10)
Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 9 / 12
Boosting
Another approach for improving the performance of decision trees
Like bagging, boosting is a general approach that can be applied to
many SL methods
Boosting is similar to bagging, but the trees are grown sequentially
I Each tree is grown using info from previously grown trees
I No bootstrap sampling is involved
I Instead, each tree is fit on a modified version of the original dataset

Goal is to combine a large number of decision trees, fˆ1 , ..., fˆB


I by “learning slowly”
I In general, SL methods that learn slowly tend to perform well
Given the current model, we fit a tree using the residuals from the
model, rather than the outcome Y , as the response
I Then we add this new tree into the fitted function in order to update
the residuals
Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 10 / 12
Boosting
Algorithm:
1. Set fˆ(x) = 0 and ri = yi for all i in the training set
2. For b = 1, ..., B, repeat:
a. Fit a tree fˆb with d splits (d + 1 terminal nodes) to the training data
(X , r )

b. Update fˆ by adding in a shrunken version of the new tree:

fˆ(x) ←fˆ(x) + λfˆb (x)

c. Update the residuals,

ri ← ri − λfˆb (xi )

3. Output the boosted model,


B
X
fˆ(x) = λfˆb (x)
b=1

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 11 / 12
Boosting

Three tuning parameters:


I The number of trees B: Unlike the previous methods, boosting can
overfit if B is too large
I The shrinkage parameter λ ≥ 0: Typically 0.01 or 0.001 (small λ
requires large B)
I The number d of splits in each tree: Often d = 1 works well

Figure 8.11

Professor: Sukjin Han Multiple Treatments with Interaction November 13, 2018 12 / 12

Das könnte Ihnen auch gefallen