Beruflich Dokumente
Kultur Dokumente
Regression Trees
Bob Stine
Dept of Statistics, Wharton School
University of Pennsylvania
Wharton
Department of Statistics
Trees
Familiar metaphor
Biology
Decision tree
Medical diagnosis
Org chart
Properties
Wharton
Department of Statistics
Process
Models as averages
Wharton
Department of Statistics
Old Idea
Binning data
Trade-offs
bias
vs
variance
Classical Example
Classification
tree
G2 = - 2 log likelihood
= 2 entropy
Wharton
Department of Statistics
Stop?
ANES 2008
Example
Regression
tree with
dummy
response
Use Select
Rows command
in the tree nodes
Wharton
Department of Statistics
Stop?
6
Recursive Partitioning
CART
Comments
Growing Tree
Search for best splitting variable
Numerical variable
Categorical variable
Partition cases into two mutually exclusive groups
Lots of groups if the number of labels k is large (2k-1-1 splits)
Greedy search
Wharton
Department of Statistics
Splitting Criteria
Numerous choices
Log-likelihood for classification tree
Log worth
Cross-validation
Wharton
Department of Statistics
Common Limitations
Binary split on an
observed variable
Some tree methods allow splits
on linear combination
slower to fit since many more possible splits
Discrete fit
Greedy search
Over-fitting
Wharton
Department of Statistics
10
Example: ANES
Classify those who did not vote
Use 3-level validation variable
Demographics
Missing indicators
Wharton
Department of Statistics
sample
weights?
11
Nice interface
Can force splits at any location
Validation properties
Wharton
Department of Statistics
12
Would be a nice feature for NN as well!
Mosaic Plot
Summary of tree
Wharton
Department of Statistics
Romney
Obama
13
Estimated Tree
Very
parallel
structure
Wharton
Department of Statistics
14
Classify Missing
Majority vote
Results
60/40 split
among
observed
voters
Wharton
Department of Statistics
15
Things to Improve?
Highly variable
Example from
training sample
Wharton
Department of Statistics
16
Averaging Trees
Boosting
Simple models
Bagging
Wharton
Department of Statistics
17
Random Forest
Highly variable
Sharp boundaries, huge variation in fit at edges of bins
Random forest
Random Forest
Summary of forest
Wharton
Department of Statistics
19
Forest Results
Confusion matrix
Goodness of fit summary
20
Calibration Plot
Smooth ROC
b 1.1
Wharton
Department of Statistics
21
Boosting
Original method
called Adaboost
Weaknesses
0.1 or smaller
22
Boosted Trees
Simple models
Refit to training sample, but put more weight to
cases not fit well so far
Wharton
Department of Statistics
23
Boosting Results
Confusion matrix
Variation in
choice of test
sample?
Boosted
Forest
Wharton
Department of Statistics
24
Calibration Plots
Results for test sample with boosting
Similar benefits obtained by forest
Boosting is a bit more predictive
b 1.1
Wharton
Department of Statistics
25
Comparison of Predictions
Test Sample
Random Forest
Training Sample
Boosted Tree
r=0.99
Wharton
Department of Statistics
Boosted Tree
r=0.99
26
Take-Aways
averaging
Model
Boosting, bagging smooth predictions
Borrow strength
Over-fitting
Control with cross-validation
Analogous to use of CV in tuning Neural Net
Wharton
Department of Statistics
27
Wharton
Department of Statistics
28
Next Time
Thursday
Friday
Wharton
Department of Statistics
29