Sie sind auf Seite 1von 43

Boosting

Trevor Hastie, Stanford University

Trees, Bagging, Random Forests and Boosting


Classication Trees
Bagging: Averaging Trees
Random Forests: Cleverer Averaging of Trees
Boosting: Cleverest Averaging of Trees
Methods for improving the performance of weak learners such as
Trees. Classication trees are adaptive and robust, but do not
generalize well. The techniques discussed here enhance their
performance considerably.

Boosting

Trevor Hastie, Stanford University

Two-class Classication
Observations are classied into two or more classes, coded by a
response variable Y taking values 1, 2, . . . , K.
We have a feature vector X = (X1 , X2 , . . . , Xp ), and we hope
to build a classication rule C(X) to assign a class label to an
individual with feature X.
We have a sample of pairs (yi , xi ), i = 1, . . . , N . Note that
each of the xi are vectors xi = (xi1 , xi2 , . . . , xip ).
Example: Y indicates whether an email is spam or not. X
represents the relative frequency of a subset of specially chosen
words in the email message.
The technology described here estimates C(X) directly, or via
the probability function P (C = k|X).

Boosting

Trevor Hastie, Stanford University

Classication Trees
Represented by a series of binary splits.
Each internal node represents a value query on one of the
variables e.g. Is X3 > 0.4. If the answer is Yes, go right,
else go left.
The terminal nodes are the decision nodes. Typically each
terminal node is dominated by one of the classes.
The tree is grown using training data, by recursive splitting.
The tree is often pruned to an optimal size, evaluated by
cross-validation.
New observations are classied by passing their X down to a
terminal node of the tree, and then using majority vote.

Boosting

Trevor Hastie, Stanford University

Classication Tree
0
10/30
x.2<0.39
x.2>0.39

3/21

2/9

x.3<-1.575
x.3>-1.575

2/5

0/16

Boosting

Trevor Hastie, Stanford University

Properties of Trees
Can handle huge datasets
Can handle mixed predictorsquantitative and qualitative
Easily ignore redundant variables
Handle missing data elegantly
Small trees are easy to interpret
large trees are hard to interpret
Often prediction performance is poor

Boosting

Trevor Hastie, Stanford University

Example: Predicting e-mail spam


data from 4601 email messages
goal: predict whether an email message is spam (junk email) or
good.
input features: relative frequencies in a message of 57 of the
most commonly occurring words and punctuation marks in all
the training the email messages.
for this problem not all errors are equal; we want to avoid
ltering out good email, while letting spam get through is not
desirable but less serious in its consequences.
we coded spam as 1 and email as 0.
A system like this would be trained for each user separately
(e.g. their word lists would be dierent)

Boosting

Trevor Hastie, Stanford University

Predictors
48 quantitative predictorsthe percentage of words in the
email that match a given word. Examples include business,
address, internet, free, and george. The idea was that these
could be customized for individual users.
6 quantitative predictorsthe percentage of characters in the
email that match a given character. The characters are ch;,
ch(, ch[, ch!, ch$, and ch#.
The average length of uninterrupted sequences of capital
letters: CAPAVE.
The length of the longest uninterrupted sequence of capital
letters: CAPMAX.
The sum of the length of uninterrupted sequences of capital
letters: CAPTOT.

Boosting

Trevor Hastie, Stanford University

Details
A test set of size 1536 was randomly chosen, leaving 3065
observations in the training set.
A full tree was grown on the training set, with splitting
continuing until a minimum bucket size of 5 was reached.
This bushy tree was pruned back using cost-complexity
pruning, and the tree size was chosen by 10-fold
cross-validation.
We then compute the test error and ROC curve on the test
data.

Boosting

Trevor Hastie, Stanford University

Some important features


39% of the training data were spam.
Average percentage of words or characters in an email message
equal to the indicated word or character. We have chosen the
words and characters showing the largest dierence between spam
and email.
george

you

your

hp

free

hpl

spam

0.00

2.26

1.38

0.02

0.52

0.01

email

1.27

1.27

0.44

0.90

0.07

0.43

our

re

edu

remove

spam

0.51

0.51

0.13

0.01

0.28

email

0.11

0.18

0.42

0.29

0.01

Boosting

Trevor Hastie, Stanford University

10

email
600/1536

ch$<0.0555

ch$>0.0555

email

spam

280/1177

48/359

remove<0.06
remove>0.06
email
180/1065

hp<0.405
hp>0.405

spam

spam email

9/112

26/337

0/22

ch!<0.191
george<0.15CAPAVE<2.907
ch!>0.191
george>0.15CAPAVE>2.907
email

email spam email spam spam

80/861

100/204

6/109

0/3

19/110

7/227

george<0.005 CAPAVE<2.7505 1999<0.58


george>0.005 CAPAVE>2.7505 1999>0.58
email email

email spam

spam email

80/652

36/123

18/109

0/209

16/81

hp<0.03
hp>0.03

free<0.065
free>0.065

email email

email spam

77/423

3/229

16/94

9/29

CAPMAX<10.5 business<0.145
CAPMAX>10.5
business>0.145
email

email email spam

20/238

57/185

14/89

receive<0.125edu<0.045
receive>0.125edu>0.045
email spam email email
19/236

1/2

48/113

9/72

our<1.2
our>1.2
email spam
37/101

1/12

3/5

0/1

Boosting

Trevor Hastie, Stanford University

ROC curve for pruned tree on SPAM data

0.8

1.0

SPAM Data
TREE Error: 8.7%

0.0

0.2

0.4

Sensitivity

0.6

0.0

0.2

0.4

0.6
Specificity

0.8

1.0

Overall error rate on test data:


8.7%.
ROC curve obtained by varying the threshold c0 of the classier:
C(X) = +1 if P (+1|X) > c0 .
Sensitivity: proportion of true
spam identied
Specicity: proportion of true
email identied.

We may want specicity to be high, and suer some spam:


Specicity : 95% = Sensitivity : 79%

11

Boosting

Trevor Hastie, Stanford University

1.0

ROC curve for TREE vs SVM on SPAM data

0.8

TREE vs SVM
o
o

SVM Error: 6.7%


TREE Error: 8.7%

0.0

0.2

0.4

Sensitivity

0.6

Comparing ROC curves on


the test data is a good
way to compare classiers.
SVM dominates
TREE here.

0.0

0.2

0.4

0.6
Specificity

0.8

1.0

12

Boosting

Trevor Hastie, Stanford University

Toy Classication Problem


Bayes Error Rate: 0.25
6

2
0
-2
-4
-6

-6

-4

Here X = (X1 , X2 )

11
1 111 1 1
1
1 0 1101110 1 0 1 1
1 1 111 1 11
1
1
1
1
1 1
1
1
0
1
1
11 1
1 11
1 1 1 0
0
1
1
0
1
1
0
1
0
1
1
11010 1 00 01 1 11 1
1 1 01 0 0101 100
1 1
11
1
0
00
000000
11
00110 0001
00 11
1 1 110
1 01 1
0
000100000
11 100
0010 01 1
1
0 1100000
1
1 1 101
0000011
0
1
0
0
0
1
0
0
0
01 0001010
010 1 1 1
0000000000010
000000
000101
11
1 10000
01
101 00
0101
011
11
00111101 1 11 11
10 1 00
000000
0001
0 000000
0111001
0
00
0
1
000
0
1
1
0
0
0
1011
0
1
1
0
0
0
0
0
0
01
0
0
0
1
1
000
0
0
0
0
1
0
0
1
1
0
1
110000000
0
0
0
0
0
1
0
0
0
11 11 0 1
1
0
0
1
0
0
000
000
00 01
001
100100 00 01 1
000
01000001
0 000010
01001000
0001
111 01 11
01
1
10010
101100
1
0101100
00100100010
1010000
0
1 10
0000000
0
0
0
0
1
1
1
0
0110
0
0
111 111 1 1
00000000000010
11 1
10 10110110110110001
1
1
0
0
1
0
0
0
0
1
1
0
0
1
1
0
0001
1 1
10
00
1 11 111 111
11
011111 10
1 00011 1 110111 1
01
00
0
1
1
1
1
1
1
0
1
0
1
1
11
11 000 0 0 11 10 1 1 11 1
11 11100
01
1 1 101
1
1
11 1 111 1 1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
1
1 11
1 11 11 1 1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
1
1
1
1
1
1

X2

Data X and Y , with Y


taking values +1 or 1.

-2

X1

1
1

The black boundary


is the Bayes Decision
Boundary - the best
one can do.
Goal: Given N training
pairs
(Xi , Yi )
produce a classier

C(X)
{1, 1}

Also estimate the probability of the class labels P (Y = +1|X).

13

Boosting

Trevor Hastie, Stanford University

Toy Example - No Noise


Bayes Error Rate: 0
3

1
1
11 1 1 1
1
1
1
11
11
1
1
1 1
1
1
11 1
11 1
1 11 1
1 1
1
111 1
111 1 11 1 1
1 1
1111
1
1
1
1
1
1
1 1 111111
1 11 11 111 11 11 1 1
1 11
1 111 1 0 00 1 1 1 1
1
1
1
1
1
1
11
1
11
1 1 11 11111 11 00000 0 0
11 1
00 1 01
1
1
0 1 11 1
1
1 1 1 1111 00000 00 000000000
0
0
1
0
1
0
1
0 000 1 1 1 1 1 1
1 11 00 0 00 00000 00000
1
1
1
0
1
1
0
0
0
0
1
0
0
00
00000000
1 111 1 111000000000000 0 00 0
000 000 1111 11 11
1
1 1 0 0 00000000000 0000
0
0
0
11
0
1
0
0
1
1
1
000 000 1 1 1 1
0 000000000 0000
1
11 00 0 00000 0000
00
1
000 000000 0 0000000
1 11
0000
1 11 11 1 1
0
00000
00
0000
1
00 00 01 11
11 00 0000000000
0 0000
1
0
0
0
0
0
0
0
0
00 00000000 0111 1 1
11
1 100 0 00 0 00000000
11
1
00000 0
00 00 0 01 1111 1
1
000
1 11 11 11 1 0000000000000000
0
0
0
1 1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0000 0 00
11 1 111 0 0
1
00 0
0000000000 000000 1 111 1
111 111 1 1 1
000
000000
1
0
1 111
0
0
1
0
0
1
0
1 1
0 1 1
0 0000
1101001
0
1
0
1
0000 1 11111111111 1
11 1 1
0 100
1
11 001
1
1
11 00
1
1
1
1
11
1
1
1
1
1
1 1 11 1111 11 1
1
1
1 11 11 1 1 111 11
11
1
1 11 1 1 1 11
1
1
1
1
1
1
1
1
111 1 1 1 1 11
1
1
1 1
1 11 1
1
1
11 11
1
1
1
1
1
1
11

0
-1
-2
-3

X2

-3

-2

-1

X1

Deterministic problem;
noise comes from sampling distribution of X.
Use a training sample
of size 200.
Here Bayes Error is
0%.

14

Boosting

Trevor Hastie, Stanford University

Classication Tree
1
94/200
x.2<-1.06711
x.2>-1.06711
0

72/166
x.2<1.14988
x.2>1.14988

0/34

40/134
x.1<1.13632
x.1>1.13632
0

23/117

0/17

0/32

x.1<-0.900735
x.1>-0.900735
1

5/26
x.1<-1.1668
x.1>-1.1668
1

1
0/12

2/91
x.2<-0.823968
x.2>-0.823968

5/14
x.1<-1.07831
x.1>-1.07831
1

1/5

4/9

2/8

0/83

15

Boosting

Trevor Hastie, Stanford University

Decision Boundary: Tree

Error Rate: 0.073

1
1
1

-1

-2

1
1 1
1
1
1
1
1
1 1
11
1
1
1
1
00 01 1
11 1 1
0 11 1
1
1 1
0
0 0 00
0
1 0 000
0 0 000 0
0
0
1 0 0
0 00 0 000
1 1
1 11
1
11 0 0 00 0 0 0 0 0000 1
0 0 00
0 00 0 00
0
00
0 00 0000 0000 0 0 0
1
0
000 000 0 1
0
0
1 1
0
0 0
1
1 0000 000 000 0 0 11 1
1
11
1 1
1 1 0 01
11
10
11
1
1
1
1
1
1
11
1
1
1
1
1
11 1 1
1
1
1
1
1

-3

X2

11

-3

-2

-1

0
X1

When the nested spheres


are in 10-dimensions, Classication Trees produces a
rather noisy and inaccurate

rule C(X),
with error rates
around 30%.

16

Boosting

Trevor Hastie, Stanford University

Model Averaging
Classication trees can be simple, but often produce noisy (bushy)
or weak (stunted) classiers.
Bagging (Breiman, 1996): Fit many large trees to
bootstrap-resampled versions of the training data, and classify
by majority vote.
Boosting (Freund & Shapire, 1996): Fit many large or small
trees to reweighted versions of the training data. Classify by
weighted majority vote.
Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting  Random Forests  Bagging  Single Tree.

17

Boosting

Trevor Hastie, Stanford University

Bagging
Bagging or bootstrap aggregation averages a given procedure over
many samples, to reduce its variance a poor mans Bayes. See
pp 246.
Suppose C(S, x) is a classier, such as a tree, based on our training
data S, producing a predicted class label at input point x.
To bag C, we draw bootstrap samples S 1 , . . . S B each of size N
with replacement from the training data. Then
Cbag (x) = Majority Vote {C(S b , x)}B
b=1 .
Bagging can dramatically reduce the variance of unstable
procedures (like trees), leading to improved prediction. However
any simple structure in C (e.g a tree) is lost.

18

Boosting

Trevor Hastie, Stanford University


Original Tree

Bootstrap Tree 1

10/30

7/30

x.2<0.39

x.2<0.36
x.2>0.39

x.2>0.36

3/21

2/9

1/23

1/7

x.3<-1.575

x.1<-0.965
x.3>-1.575

x.1>-0.965

2/5

0/16

1/5

0/18

Bootstrap Tree 2

Bootstrap Tree 3

11/30

4/30

x.2<0.39

x.4<0.395
x.2>0.39

x.4>0.395

3/22

0/8

2/25

2/5

x.3<-1.575

x.3<-1.575
x.3>-1.575

x.3>-1.575

2/5

0/17

2/5

0/20

Bootstrap Tree 4

Bootstrap Tree 5

13/30

12/30

x.2<0.255

x.2<0.38
x.2>0.255

x.2>0.38

2/16

3/14

4/20

2/10

x.3<-1.385

x.3<-1.61
x.3>-1.385

x.3>-1.61

2/5

0/11

2/6

0/14

19

Boosting

Trevor Hastie, Stanford University

Decision Boundary: Bagging

Error Rate: 0.032

1
1
1

-1

-2

1
1 1
1
1
1
1
1
1 1
11
1
1
1
1
00 01 1
11 1 1
0 11 1
1
1 1
0
0 0 00
0
1 0 000
0 0 000 0
0
0
1 0 0
0 00 0 000
1 1
1 11
1
11 0 0 00 0 0 0 0 0000 1
0 0 00
0 00 0 00
0
00
0 00 0000 0000 0 0 0
1
0
000 000 0 1
0
0
1 1
0
0 0
1
1 0000 000 000 0 0 11 1
1
11
1 1
1 1 0 01
11
10
11
1
1
1
1
1
1
11
1
1
1
1
1
11 1 1
1
1
1
1
1

-3

X2

11

-3

-2

-1

0
X1

Bagging averages many


trees,
and
produces
smoother decision boundaries.

20

Boosting

Trevor Hastie, Stanford University

Random forests
renement of bagged trees; quite popular
at each tree split, a random sample of m features is drawn, and
only those m features are considered for splitting. Typically

m = p or log2 p, where p is the number of features


For each tree grown on a bootstrap sample, the error rate for
observations left out of the bootstrap sample is monitored.
This is called the out-of-bag error rate.
random forests tries to improve on bagging by de-correlating
the trees. Each tree has the same expectation.

21

Boosting

Trevor Hastie, Stanford University

22

1.0

ROC curve for TREE, SVM and Random Forest on SPAM data

0.8

TREE, SVM and RF


Random Forest Error: 5.0%
SVM Error: 6.7%
TREE Error: 8.7%

Random Forest dominates


both other methods on the
SPAM data 5.0% error.
Used 500 trees with default
settings for random Forest
package in R.

0.0

0.2

0.4

Sensitivity

0.6

o
o
o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0

Boosting

Trevor Hastie, Stanford University

Weighted Sample

CM (x)

Boosting
Average many trees, each
grown to re-weighted versions
of the training data.

Weighted Sample

C3 (x)

Weighted Sample

C2 (x)

Training Sample

C1 (x)

Final Classier is weighted average of classiers:




M
C(x) = sign
m=1 m Cm (x)

23

Boosting

Trevor Hastie, Stanford University

24

100 Node Trees

Boosting vs Bagging
0.4

Bagging
AdaBoost

Bayes error rate is 0%.

0.2

Trees are grown best


rst without pruning.

0.1

Leftmost term is a single tree.

0.0

Test Error

0.3

2000
points
from
Nested Spheres in R10

100

200
Number of Terms

300

400

Boosting

Trevor Hastie, Stanford University

AdaBoost (Freund & Schapire, 1996)


1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .
2. For m = 1 to M repeat steps (a)(d):
(a) Fit a classier Cm (x) to the training data using weights wi .
(b) Compute weighted error of newest tree
N
i=1 wi I(yi = Cm (xi ))
.
errm =
N
i=1 wi
(c) Compute m = log[(1 errm )/errm ].
(d) Update weights for i = 1, . . . , N :
wi wi exp[m I(yi = Cm (xi ))]
and renormalize to wi to sum to 1.


M
3. Output C(x) = sign
m=1 m Cm (x) .

25

Trevor Hastie, Stanford University

26

0.5

Boosting

Single Stump

0.3

A stump is a two-node
tree, after a single split.
Boosting stumps works
remarkably well on the
nested-spheres problem.

0.1

0.2

400 Node Tree

0.0

Test Error

0.4

Boosting Stumps

100

200
Boosting Iterations

300

400

Boosting

Trevor Hastie, Stanford University

0.4

Nested spheres in 10Dimensions.

0.3

0.5

Training Error

Bayes error is 0%.

0.2

Boosting drives the


training error to zero.

0.1

Further iterations continue to improve test


error in many examples.

0.0

Train and Test Error

27

100

200

300
Number of Terms

400

500

600

Trevor Hastie, Stanford University

0.5

Boosting

0.4

Noisy Problems

0.3

Nested Gaussians in
10-Dimensions.
Bayes error is 25%.
Bayes Error

0.2

Boosting with stumps

0.1

Here the test error


does increase, but quite
slowly.

0.0

Train and Test Error

28

100

200

300
Number of Terms

400

500

600

Boosting

Trevor Hastie, Stanford University

Stagewise Additive Modeling


Boosting builds an additive model
f (x) =

M


m b(x; m ).

m=1

Here b(x, m ) is a tree, and m parametrizes the splits.


We do things like that in statistics all the time!

GAMs: f (x) = j fj (xj )
M
Basis expansions: f (x) = m=1 m hm (x)
Traditionally the parameters fm , m are t jointly (i.e. least
squares, maximum likelihood).
With boosting, the parameters (m , m ) are t in a stagewise
fashion. This slows the process down, and overts less quickly.

29

Boosting

Trevor Hastie, Stanford University

Additive Trees
Simple example: stagewise least-squares?
Fix the past M 1 functions, and update the M th using a tree:
min

fM T ree(x)

E(Y

M
1


fm (x) fM (x))2

m=1

If we dene the current residuals to be


R=Y

M
1


fm (x)

m=1

then at each stage we t a tree to the residuals


min

fM T ree(x)

E(R fM (x))2

30

Boosting

Trevor Hastie, Stanford University

31

Stagewise Least Squares


Suppose we have available a basis family b(x; ) parametrized by .
After m 1 steps, suppose we have the model
m1
fm1 (x) = j=1 j b(x; j ).
At the mth step we solve
min
,

N


(yi fm1 (xi ) b(xi ; ))

i=1

Denoting the residuals at the mth stage by


rim = yi fm1 (xi ), the previous step amounts to
min(rim b(xi ; ))2 ,
,

Thus the term m b(x; m ) that best ts the current residuals is


added to the expansion at each step.

Boosting

Trevor Hastie, Stanford University

Adaboost: Stagewise Modeling


AdaBoost builds an additive logistic regression model
M

Pr(Y = 1|x)
m Gm (x)
f (x) = log
=
Pr(Y = 1|x) m=1

by stagewise tting using the loss function


L(y, f (x)) = exp(y f (x)).
Given the current fM 1 (x), our solution for (m , Gm ) is
arg min
,G

N


exp[yi (fm1 (xi ) + G(x))]

i=1

where Gm (x) {1, 1} is a tree classier and m is a


coecient.

32

Boosting

Trevor Hastie, Stanford University

(m)

With wi

= exp(yi fm1 (xi )), this can be re-expressed as


arg min
,G

N


(m)

wi

exp( yi G(xi ))

i=1

We can show that this leads to the Adaboost algorithm; See


pp 305.

33

Boosting

Trevor Hastie, Stanford University

Why Exponential Loss?

3.0

eyF (x) is a monotone,


smooth upper bound on
misclassication loss at x.
Leads to simple reweighting
scheme.

1.0

1.5

Has logit transform as population minimizer

0.5

f (x) =

0.0

Loss

2.0

2.5

Misclassification
Exponential
Binomial Deviance
Squared Error
Support Vector

-2

-1

yf

1
Pr(Y = 1|x)
log
2
Pr(Y = 1|x)

Other more robust loss functions, like binomial deviance.

34

Boosting

Trevor Hastie, Stanford University

General Stagewise Algorithm


We can do the same for more general loss functions, not only least
squares.
1. Initialize f0 (x) = 0.
2. For m = 1 to M :
(a) Compute
N
(m , m ) = arg min, i=1 L(yi , fm1 (xi ) + b(xi ; )).
(b) Set fm (x) = fm1 (x) + m b(x; m ).
Sometimes we replace step (b) in item 2 by
(b ) Set fm (x) = fm1 (x) + m b(x; m )
Here is a shrinkage factor, and often < 0.1. Shrinkage slows the
stagewise model-building even more, and typically leads to better
performance.

35

Boosting

Trevor Hastie, Stanford University

36

Gradient Boosting
General boosting algorithm that works with a variety of
dierent loss functions. Models include regression, resistant
regression, K-class classication and risk modeling.
Gradient Boosting builds additive tree models, for example, for
representing the logits in logistic regression.
Tree size is a parameter that determines the order of
interaction (next slide).
Gradient Boosting inherits all the good features of trees
(variable selection, missing data, mixed predictors), and
improves on the weak features, such as prediction performance.
Gradient Boosting is described in detail in

, section 10.10.

Boosting

Trevor Hastie, Stanford University

Tree Size

0.4

Stumps
10 Node
100 Node
Adaboost

0.2

0.3

The tree size J determines


the interaction order of the
model:

(X) =
j (Xj )

0.1

jk (Xj , Xk )

jk

0.0

Test Error

37

+
0

100

200
Number of Terms

300

400


jkl

jkl (Xj , Xk , Xl )

Boosting

Trevor Hastie, Stanford University

38

Stumps win!
Since the true decision boundary is the surface of a sphere, the
function that describes it has the form
f (X) = X12 + X22 + . . . + Xp2 c = 0.
Boosted stumps via Gradient Boosting returns reasonable
approximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees

f1 (x1 )

f2 (x2 )

f3 (x3 )

f4 (x4 )

f5 (x5 )

f6 (x6 )

f7 (x7 )

f8 (x8 )

f9 (x9 )

f10 (x10 )

Boosting

Trevor Hastie, Stanford University

Spam Example Results


With 3000 training and 1500 test observations, Gradient Boosting
ts an additive logistic model
f (x) = log

Pr(spam|x)
Pr(email|x)

using trees with J = 6 terminal-node trees.


Gradient Boosting achieves a test error of 4%, compared to 5.3% for
an additive GAM, 5.0% for Random Forests, and 8.7% for CART.

39

Boosting

Trevor Hastie, Stanford University

Spam: Variable Importance


3d
addresses
labs
telnet
857
415
direct
cs
table
85
#
parts
credit
[
lab
conference
report
original
data
project
font
make
address
order
all
hpl
technology
people
pm
mail
over
650
;
meeting
email
000
internet
receive
(
re
business
1999
will
money
our
you
edu
CAPTOT
george
CAPMAX
your
CAPAVE
free
remove
hp
$
!

20

40

60

Relative importance

80

100

40

-1.0

-1.0

-0.2 0.0 0.2

0.0
0.2

0.2
0.4
0.6

0.4
0.6

edu
0.8

0.8
1.0
-0.2 0.0 0.2

0.0

-0.6

Partial Dependence

-0.6

Partial Dependence

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

Boosting
Trevor Hastie, Stanford University

1.0
0.0

0.0

0.2

0.5
1.0

41

Spam: Partial Dependence

0.4

1.5
hp

0.6

remove

2.0
2.5
3.0

Boosting

Trevor Hastie, Stanford University

42

Comparison of Learning Methods


Some characteristics of dierent learning methods.
Key: = good, =fair, and =poor.
Characteristic

Neural
Nets

SVM

CART

GAM

KNN,
Kernel

Gradient
Boost

Natural handling of data


of mixed type

Handling of missing values

Robustness to outliers in
input space

Insensitive to monotone
transformations of inputs

Computational scalability (large N )

Ability to deal with irrelevant inputs

Ability to extract linear


combinations of features

Interpretability
Predictive power

Boosting

Trevor Hastie, Stanford University

Software
R: free GPL statistical computing environment available from
CRAN, implements the S language. Includes:
randomForest: implementation of Leo Breimans algorithms.
rpart: Terry Therneaus implementation of classication
and regression trees.
gbm: Greg Ridgeways implementation of Friedmans
gradient boosting algorithm.
Salford Systems: Commercial implementation of trees, random
forests and gradient boosting.
Splus (Insightful): Commerical version of S.
Weka: GPL software from University of Waikato, New Zealand.
Includes Trees, Random Forests and many other procedures.

43

Das könnte Ihnen auch gefallen