Beruflich Dokumente
Kultur Dokumente
Boosting
Two-class Classication
Observations are classied into two or more classes, coded by a
response variable Y taking values 1, 2, . . . , K.
We have a feature vector X = (X1 , X2 , . . . , Xp ), and we hope
to build a classication rule C(X) to assign a class label to an
individual with feature X.
We have a sample of pairs (yi , xi ), i = 1, . . . , N . Note that
each of the xi are vectors xi = (xi1 , xi2 , . . . , xip ).
Example: Y indicates whether an email is spam or not. X
represents the relative frequency of a subset of specially chosen
words in the email message.
The technology described here estimates C(X) directly, or via
the probability function P (C = k|X).
Boosting
Classication Trees
Represented by a series of binary splits.
Each internal node represents a value query on one of the
variables e.g. Is X3 > 0.4. If the answer is Yes, go right,
else go left.
The terminal nodes are the decision nodes. Typically each
terminal node is dominated by one of the classes.
The tree is grown using training data, by recursive splitting.
The tree is often pruned to an optimal size, evaluated by
cross-validation.
New observations are classied by passing their X down to a
terminal node of the tree, and then using majority vote.
Boosting
Classication Tree
0
10/30
x.2<0.39
x.2>0.39
3/21
2/9
x.3<-1.575
x.3>-1.575
2/5
0/16
Boosting
Properties of Trees
Can handle huge datasets
Can handle mixed predictorsquantitative and qualitative
Easily ignore redundant variables
Handle missing data elegantly
Small trees are easy to interpret
large trees are hard to interpret
Often prediction performance is poor
Boosting
Boosting
Predictors
48 quantitative predictorsthe percentage of words in the
email that match a given word. Examples include business,
address, internet, free, and george. The idea was that these
could be customized for individual users.
6 quantitative predictorsthe percentage of characters in the
email that match a given character. The characters are ch;,
ch(, ch[, ch!, ch$, and ch#.
The average length of uninterrupted sequences of capital
letters: CAPAVE.
The length of the longest uninterrupted sequence of capital
letters: CAPMAX.
The sum of the length of uninterrupted sequences of capital
letters: CAPTOT.
Boosting
Details
A test set of size 1536 was randomly chosen, leaving 3065
observations in the training set.
A full tree was grown on the training set, with splitting
continuing until a minimum bucket size of 5 was reached.
This bushy tree was pruned back using cost-complexity
pruning, and the tree size was chosen by 10-fold
cross-validation.
We then compute the test error and ROC curve on the test
data.
Boosting
you
your
hp
free
hpl
spam
0.00
2.26
1.38
0.02
0.52
0.01
1.27
1.27
0.44
0.90
0.07
0.43
our
re
edu
remove
spam
0.51
0.51
0.13
0.01
0.28
0.11
0.18
0.42
0.29
0.01
Boosting
10
email
600/1536
ch$<0.0555
ch$>0.0555
spam
280/1177
48/359
remove<0.06
remove>0.06
email
180/1065
hp<0.405
hp>0.405
spam
spam email
9/112
26/337
0/22
ch!<0.191
george<0.15CAPAVE<2.907
ch!>0.191
george>0.15CAPAVE>2.907
email
80/861
100/204
6/109
0/3
19/110
7/227
email spam
spam email
80/652
36/123
18/109
0/209
16/81
hp<0.03
hp>0.03
free<0.065
free>0.065
email email
email spam
77/423
3/229
16/94
9/29
CAPMAX<10.5 business<0.145
CAPMAX>10.5
business>0.145
email
20/238
57/185
14/89
receive<0.125edu<0.045
receive>0.125edu>0.045
email spam email email
19/236
1/2
48/113
9/72
our<1.2
our>1.2
email spam
37/101
1/12
3/5
0/1
Boosting
0.8
1.0
SPAM Data
TREE Error: 8.7%
0.0
0.2
0.4
Sensitivity
0.6
0.0
0.2
0.4
0.6
Specificity
0.8
1.0
11
Boosting
1.0
0.8
TREE vs SVM
o
o
0.0
0.2
0.4
Sensitivity
0.6
0.0
0.2
0.4
0.6
Specificity
0.8
1.0
12
Boosting
2
0
-2
-4
-6
-6
-4
Here X = (X1 , X2 )
11
1 111 1 1
1
1 0 1101110 1 0 1 1
1 1 111 1 11
1
1
1
1
1 1
1
1
0
1
1
11 1
1 11
1 1 1 0
0
1
1
0
1
1
0
1
0
1
1
11010 1 00 01 1 11 1
1 1 01 0 0101 100
1 1
11
1
0
00
000000
11
00110 0001
00 11
1 1 110
1 01 1
0
000100000
11 100
0010 01 1
1
0 1100000
1
1 1 101
0000011
0
1
0
0
0
1
0
0
0
01 0001010
010 1 1 1
0000000000010
000000
000101
11
1 10000
01
101 00
0101
011
11
00111101 1 11 11
10 1 00
000000
0001
0 000000
0111001
0
00
0
1
000
0
1
1
0
0
0
1011
0
1
1
0
0
0
0
0
0
01
0
0
0
1
1
000
0
0
0
0
1
0
0
1
1
0
1
110000000
0
0
0
0
0
1
0
0
0
11 11 0 1
1
0
0
1
0
0
000
000
00 01
001
100100 00 01 1
000
01000001
0 000010
01001000
0001
111 01 11
01
1
10010
101100
1
0101100
00100100010
1010000
0
1 10
0000000
0
0
0
0
1
1
1
0
0110
0
0
111 111 1 1
00000000000010
11 1
10 10110110110110001
1
1
0
0
1
0
0
0
0
1
1
0
0
1
1
0
0001
1 1
10
00
1 11 111 111
11
011111 10
1 00011 1 110111 1
01
00
0
1
1
1
1
1
1
0
1
0
1
1
11
11 000 0 0 11 10 1 1 11 1
11 11100
01
1 1 101
1
1
11 1 111 1 1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
1
1 11
1 11 11 1 1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
1
1
1
1
1
1
X2
-2
X1
1
1
C(X)
{1, 1}
13
Boosting
1
1
11 1 1 1
1
1
1
11
11
1
1
1 1
1
1
11 1
11 1
1 11 1
1 1
1
111 1
111 1 11 1 1
1 1
1111
1
1
1
1
1
1
1 1 111111
1 11 11 111 11 11 1 1
1 11
1 111 1 0 00 1 1 1 1
1
1
1
1
1
1
11
1
11
1 1 11 11111 11 00000 0 0
11 1
00 1 01
1
1
0 1 11 1
1
1 1 1 1111 00000 00 000000000
0
0
1
0
1
0
1
0 000 1 1 1 1 1 1
1 11 00 0 00 00000 00000
1
1
1
0
1
1
0
0
0
0
1
0
0
00
00000000
1 111 1 111000000000000 0 00 0
000 000 1111 11 11
1
1 1 0 0 00000000000 0000
0
0
0
11
0
1
0
0
1
1
1
000 000 1 1 1 1
0 000000000 0000
1
11 00 0 00000 0000
00
1
000 000000 0 0000000
1 11
0000
1 11 11 1 1
0
00000
00
0000
1
00 00 01 11
11 00 0000000000
0 0000
1
0
0
0
0
0
0
0
0
00 00000000 0111 1 1
11
1 100 0 00 0 00000000
11
1
00000 0
00 00 0 01 1111 1
1
000
1 11 11 11 1 0000000000000000
0
0
0
1 1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0000 0 00
11 1 111 0 0
1
00 0
0000000000 000000 1 111 1
111 111 1 1 1
000
000000
1
0
1 111
0
0
1
0
0
1
0
1 1
0 1 1
0 0000
1101001
0
1
0
1
0000 1 11111111111 1
11 1 1
0 100
1
11 001
1
1
11 00
1
1
1
1
11
1
1
1
1
1
1 1 11 1111 11 1
1
1
1 11 11 1 1 111 11
11
1
1 11 1 1 1 11
1
1
1
1
1
1
1
1
111 1 1 1 1 11
1
1
1 1
1 11 1
1
1
11 11
1
1
1
1
1
1
11
0
-1
-2
-3
X2
-3
-2
-1
X1
Deterministic problem;
noise comes from sampling distribution of X.
Use a training sample
of size 200.
Here Bayes Error is
0%.
14
Boosting
Classication Tree
1
94/200
x.2<-1.06711
x.2>-1.06711
0
72/166
x.2<1.14988
x.2>1.14988
0/34
40/134
x.1<1.13632
x.1>1.13632
0
23/117
0/17
0/32
x.1<-0.900735
x.1>-0.900735
1
5/26
x.1<-1.1668
x.1>-1.1668
1
1
0/12
2/91
x.2<-0.823968
x.2>-0.823968
5/14
x.1<-1.07831
x.1>-1.07831
1
1/5
4/9
2/8
0/83
15
Boosting
1
1
1
-1
-2
1
1 1
1
1
1
1
1
1 1
11
1
1
1
1
00 01 1
11 1 1
0 11 1
1
1 1
0
0 0 00
0
1 0 000
0 0 000 0
0
0
1 0 0
0 00 0 000
1 1
1 11
1
11 0 0 00 0 0 0 0 0000 1
0 0 00
0 00 0 00
0
00
0 00 0000 0000 0 0 0
1
0
000 000 0 1
0
0
1 1
0
0 0
1
1 0000 000 000 0 0 11 1
1
11
1 1
1 1 0 01
11
10
11
1
1
1
1
1
1
11
1
1
1
1
1
11 1 1
1
1
1
1
1
-3
X2
11
-3
-2
-1
0
X1
rule C(X),
with error rates
around 30%.
16
Boosting
Model Averaging
Classication trees can be simple, but often produce noisy (bushy)
or weak (stunted) classiers.
Bagging (Breiman, 1996): Fit many large trees to
bootstrap-resampled versions of the training data, and classify
by majority vote.
Boosting (Freund & Shapire, 1996): Fit many large or small
trees to reweighted versions of the training data. Classify by
weighted majority vote.
Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting Random Forests Bagging Single Tree.
17
Boosting
Bagging
Bagging or bootstrap aggregation averages a given procedure over
many samples, to reduce its variance a poor mans Bayes. See
pp 246.
Suppose C(S, x) is a classier, such as a tree, based on our training
data S, producing a predicted class label at input point x.
To bag C, we draw bootstrap samples S 1 , . . . S B each of size N
with replacement from the training data. Then
Cbag (x) = Majority Vote {C(S b , x)}B
b=1 .
Bagging can dramatically reduce the variance of unstable
procedures (like trees), leading to improved prediction. However
any simple structure in C (e.g a tree) is lost.
18
Boosting
Bootstrap Tree 1
10/30
7/30
x.2<0.39
x.2<0.36
x.2>0.39
x.2>0.36
3/21
2/9
1/23
1/7
x.3<-1.575
x.1<-0.965
x.3>-1.575
x.1>-0.965
2/5
0/16
1/5
0/18
Bootstrap Tree 2
Bootstrap Tree 3
11/30
4/30
x.2<0.39
x.4<0.395
x.2>0.39
x.4>0.395
3/22
0/8
2/25
2/5
x.3<-1.575
x.3<-1.575
x.3>-1.575
x.3>-1.575
2/5
0/17
2/5
0/20
Bootstrap Tree 4
Bootstrap Tree 5
13/30
12/30
x.2<0.255
x.2<0.38
x.2>0.255
x.2>0.38
2/16
3/14
4/20
2/10
x.3<-1.385
x.3<-1.61
x.3>-1.385
x.3>-1.61
2/5
0/11
2/6
0/14
19
Boosting
1
1
1
-1
-2
1
1 1
1
1
1
1
1
1 1
11
1
1
1
1
00 01 1
11 1 1
0 11 1
1
1 1
0
0 0 00
0
1 0 000
0 0 000 0
0
0
1 0 0
0 00 0 000
1 1
1 11
1
11 0 0 00 0 0 0 0 0000 1
0 0 00
0 00 0 00
0
00
0 00 0000 0000 0 0 0
1
0
000 000 0 1
0
0
1 1
0
0 0
1
1 0000 000 000 0 0 11 1
1
11
1 1
1 1 0 01
11
10
11
1
1
1
1
1
1
11
1
1
1
1
1
11 1 1
1
1
1
1
1
-3
X2
11
-3
-2
-1
0
X1
20
Boosting
Random forests
renement of bagged trees; quite popular
at each tree split, a random sample of m features is drawn, and
only those m features are considered for splitting. Typically
21
Boosting
22
1.0
ROC curve for TREE, SVM and Random Forest on SPAM data
0.8
0.0
0.2
0.4
Sensitivity
0.6
o
o
o
0.0
0.2
0.4
0.6
Specificity
0.8
1.0
Boosting
Weighted Sample
CM (x)
Boosting
Average many trees, each
grown to re-weighted versions
of the training data.
Weighted Sample
C3 (x)
Weighted Sample
C2 (x)
Training Sample
C1 (x)
23
Boosting
24
Boosting vs Bagging
0.4
Bagging
AdaBoost
0.2
0.1
0.0
Test Error
0.3
2000
points
from
Nested Spheres in R10
100
200
Number of Terms
300
400
Boosting
25
26
0.5
Boosting
Single Stump
0.3
A stump is a two-node
tree, after a single split.
Boosting stumps works
remarkably well on the
nested-spheres problem.
0.1
0.2
0.0
Test Error
0.4
Boosting Stumps
100
200
Boosting Iterations
300
400
Boosting
0.4
0.3
0.5
Training Error
0.2
0.1
0.0
27
100
200
300
Number of Terms
400
500
600
0.5
Boosting
0.4
Noisy Problems
0.3
Nested Gaussians in
10-Dimensions.
Bayes error is 25%.
Bayes Error
0.2
0.1
0.0
28
100
200
300
Number of Terms
400
500
600
Boosting
M
m b(x; m ).
m=1
29
Boosting
Additive Trees
Simple example: stagewise least-squares?
Fix the past M 1 functions, and update the M th using a tree:
min
fM T ree(x)
E(Y
M
1
fm (x) fM (x))2
m=1
M
1
fm (x)
m=1
fM T ree(x)
E(R fM (x))2
30
Boosting
31
N
i=1
Boosting
N
i=1
32
Boosting
(m)
With wi
N
(m)
wi
exp( yi G(xi ))
i=1
33
Boosting
3.0
1.0
1.5
0.5
f (x) =
0.0
Loss
2.0
2.5
Misclassification
Exponential
Binomial Deviance
Squared Error
Support Vector
-2
-1
yf
1
Pr(Y = 1|x)
log
2
Pr(Y = 1|x)
34
Boosting
35
Boosting
36
Gradient Boosting
General boosting algorithm that works with a variety of
dierent loss functions. Models include regression, resistant
regression, K-class classication and risk modeling.
Gradient Boosting builds additive tree models, for example, for
representing the logits in logistic regression.
Tree size is a parameter that determines the order of
interaction (next slide).
Gradient Boosting inherits all the good features of trees
(variable selection, missing data, mixed predictors), and
improves on the weak features, such as prediction performance.
Gradient Boosting is described in detail in
, section 10.10.
Boosting
Tree Size
0.4
Stumps
10 Node
100 Node
Adaboost
0.2
0.3
0.1
jk (Xj , Xk )
jk
0.0
Test Error
37
+
0
100
200
Number of Terms
300
400
jkl
jkl (Xj , Xk , Xl )
Boosting
38
Stumps win!
Since the true decision boundary is the surface of a sphere, the
function that describes it has the form
f (X) = X12 + X22 + . . . + Xp2 c = 0.
Boosted stumps via Gradient Boosting returns reasonable
approximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees
f1 (x1 )
f2 (x2 )
f3 (x3 )
f4 (x4 )
f5 (x5 )
f6 (x6 )
f7 (x7 )
f8 (x8 )
f9 (x9 )
f10 (x10 )
Boosting
Pr(spam|x)
Pr(email|x)
39
Boosting
20
40
60
Relative importance
80
100
40
-1.0
-1.0
0.0
0.2
0.2
0.4
0.6
0.4
0.6
edu
0.8
0.8
1.0
-0.2 0.0 0.2
0.0
-0.6
Partial Dependence
-0.6
Partial Dependence
Partial Dependence
Partial Dependence
Boosting
Trevor Hastie, Stanford University
1.0
0.0
0.0
0.2
0.5
1.0
41
0.4
1.5
hp
0.6
remove
2.0
2.5
3.0
Boosting
42
Neural
Nets
SVM
CART
GAM
KNN,
Kernel
Gradient
Boost
Robustness to outliers in
input space
Insensitive to monotone
transformations of inputs
Interpretability
Predictive power
Boosting
Software
R: free GPL statistical computing environment available from
CRAN, implements the S language. Includes:
randomForest: implementation of Leo Breimans algorithms.
rpart: Terry Therneaus implementation of classication
and regression trees.
gbm: Greg Ridgeways implementation of Friedmans
gradient boosting algorithm.
Salford Systems: Commercial implementation of trees, random
forests and gradient boosting.
Splus (Insightful): Commerical version of S.
Weka: GPL software from University of Waikato, New Zealand.
Includes Trees, Random Forests and many other procedures.
43