You are on page 1of 72

Overfitting

in decision trees
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Review of loan default prediction
Loan
Applications

Safe

Intelligent loan application Risky


review system

Risky

3 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Decision tree review

T(xi) = Traverse decision tree


start

excellent poor
Credit?

fair
Loan
Income? i
Application Safe Term?
high Low
3 years 5 years

Input: xi Risky Safe Term? Risky

3 years 5 years

Risky Safe

4 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Overfitting review

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Overfitting in logistic regression
True error
Classification Error
Error =

Overfitting if there exists w*:


training_error(w*) > training_error()
true_error(w*) < true_error()
Training error

Model complexity

8 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Overfitting
Overconfident predictions

Logistic Regression Logistic Regression


(Degree 6) (Degree 20)

9 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Overfitting in decision trees

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Decision stump (Depth 1):
Split on x[1]

y values Root
- + 18 13

x[1]

x[1] < -0.07 x[1] >= -0.07


13 3 4 11

13 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


What happens when we increase depth?

Training error reduces with depth

Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10

Training error 0.22 0.13 0.10 0.03 0.00

Decision
boundary

14 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Deeper trees lower training error

Depth 10 (training error = 0.0)


Training Error

Tree depth

16 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Training error = 0: Is this model perfect?
Depth 10 (training error = 0.0)

EC T
O T P ERF
N
17 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Why training error reduces with depth?

Loan status: Root Split on credit


Safe Risky 22 18
Tree Training
error
Credit?
(root) 0.45
split on credit 0.20
excellent good fair
9 0 9 4 4 14

Safe Safe Risky Training error


improved by 0.25
because of the split
18 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Feature split selection algorithm
Given a subset of data M (a node in a tree)
For each feature hi(x):
1. Split data of M according to feature hi(x)
2. Compute classification error split
Chose feature h*(x) with lowest
classification error
By design, each split
reduces training error

19 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Decision trees overfitting
on loan data

21 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Principle of Occams razor:
Simpler trees are better

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Principle of Occams Razor
Among competing hypotheses, the one with
fewest assumptions should be selected,
William of Occam, 13th Century

Symptoms: S1 and S2
SIMPLER
Diagnosis 1: 2 diseases Diagnosis 2: 1 disease
Two diseases D1 and D2 where OR Disease D3 explains both
D1 explains S1, D2 explains S2 symptoms S1 and S2

25 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Occams Razor for decision trees
When two trees have similar classification error
on the validation set, pick the simpler one

Complexity Train Validation


error error Same validation
error
Simple 0.23 0.24
Moderate 0.12 0.15
Complex 0.07 0.15
Super complex 0 0.18 Overfit
26 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Which tree is simpler?

OR
SIMPLER

27 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Modified tree learning problem
Find a simple decision tree with low classification error

Simple trees Complex trees


T1(X) T2(X)

T4(X)

29 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


How do we pick simpler trees?

1. Early Stopping: Stop learning algorithm


before tree become too complex

2. Pruning: Simplify tree after


learning algorithm terminates

30 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Early stopping for
learning decision trees

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Deeper trees
Increasing complexity

Model complexity increases with depth

Depth = 1 Depth = 2 Depth = 10

33 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Early stopping condition 1:
Limit the depth of a tree

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Restrict tree learning to shallow trees?
Classification Error

Simple Complex
trees trees

True error
Training error

max_depth
Tree depth
36 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 1:
Limit depth of tree
Classification Error

Stop tree building when


depth = max_depth

max_depth
Tree depth
37 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Picking value for max_depth???
Classification Error

Validation set or
cross-validation

max_depth
Tree depth
38 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 2:
Use classification error to
limit depth of tree

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Decision tree recursion review
Loan status: Root
Safe Risky 22 18

Credit?

excellent fair poor


16 0 1 2 5 16

Safe
Build decision stump Build decision stump
with subset of data with subset of data
where Credit = fair where Credit = poor
40 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Split selection for credit=poor

Loan status: Root


Safe Risky 22 18

Credit?
No split improves
excellent fair poor classification error
16 0 1 2 5 16
Stop!
Splits for Classification
credit=poor error
Safe
(no split) 0.24
0.45
split on term 0.24
split on income 0.24

42 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Early stopping condition 2:
No split improves classification error
Loan status: Root
Safe Risky
Early stopping
22 18
condition 2
Credit?

excellent fair poor


16 0 1 2 5 16

Splits for Classification


credit=poor error
Safe Risky
Build decision stump (no split) 0.24
with subset of data split on term 0.24
where Credit = fair split on income 0.24

43 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Practical notes about stopping when
classification error doesnt decrease

1. Typically, add magic parameter


- Stop if error doesnt decrease by more than

2. Some pitfalls to this rule (see pruning section)

3. Very useful in practice

45 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Early stopping condition 3:
Stop if number of data points
contained in a node is too small

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Can we trust nodes with very few points?
Loan status: Root
Safe Risky 22 18

Credit?

excellent fair poor


16 0 1 2 5 16

Safe Risky

Stop recursing Only 3 data


points!
47 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 3:
Stop when data points in a node <= Nmin
Loan status: Root
Safe Risky 22 18 Example: Nmin = 10

Credit?

excellent fair poor


16 0 1 2 5 16

Safe Risky Risky

Early stopping
condition 3
48 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Summary of decision trees
with early stopping

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Early stopping: Summary

1. Limit tree depth: Stop splitting after a


certain depth

2. Classification error: Do not consider any


split that does not cause a sucient
decrease in classification error

3. Minimum node size: Do not split an


intermediate node which contains
too few data points

50 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Greedy decision tree learning

Step 1: Start with an empty tree


Step 2: Select a feature to split data
For each split of the tree: Stopping
Step 3: If nothing more to, conditions 1 & 2
make predictions or
Step 4: Otherwise, go to Step 2 & Early stopping
continue (recurse) on this split conditions 1, 2 & 3
Recursion

52 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Overfitting in Decision Trees:
Pruning

I O N A L
O P T
2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Stopping condition summary
Stopping condition:
1. All examples have the same target value
2. No more features to split on

Early stopping conditions:


1. Limit tree depth
2. Do not consider splits that do not cause a
sucient decrease in classification error
3. Do not split an intermediate node which
contains too few data points

55 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Exploring some challenges
with early stopping conditions

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Challenge with early stopping condition 1
Hard to know exactly
when to stop
Classification Error

Simple Complex
trees trees

True error
Training error

max_depth
Tree depth
57 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Is early stopping condition 2 a good idea?
Classification Error

Stop because of
zero decrease in
classification error

Tree depth
58 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 2:
Dont stop if error doesnt decrease???
y = x[1] xor x[2] y values Root
x[1] x[2] y True False 2 2
False False False
False True True Error = .

True False True


True True False
=

Tree Classification error


(root) 0.5

60 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Consider split on x[1]

y = x[1] xor x[2] y values Root


x[1] x[2] y True False 2 2
False False False
False True True Error = .

True False True x[1]


True True False
=
True False
1 1 1 1

Tree Classification error


(root) 0.5
Split on x[1] 0.5
61 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Consider split on x[2]

y = x[1] xor x[2] y values Root


x[1] x[2] y True False 2 2
False False False
False True True Error = .

True False True x[2]


True True False
=
True False
1 1 1 1

Neither features Tree Classification error

improve training error (root) 0.5


Split on x[1] 0.5
Stop now??? Split on x[2] 0.5
62 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Final tree with early stopping condition 2

y = x[1] xor x[2] y values Root


x[1] x[2] y True False 2 2
False False False
False True True
True False True True
True True False

Tree Classification
error
with early stopping 0.5
condition 2

63 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Without early stopping condition 2

y = x[1] xor x[2] Root


y values 2 2
x[1] x[2] y True False
False False False
x[1]
False True True
True False
True False True
1 1 1 1
True True False

x[2] x[2]
Tree Classification
error
True False True False
with early stopping 0.5 0 1 1 0 1 0 0 1
condition 2
without early 0.0
stopping condition 2 False True True False
64 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Early stopping condition 2: Pros and Cons
Pros:
- A reasonable heuristic for early stopping to
avoid useless splits

Cons:
- Too short sighted: We may miss out on good
splits may occur right after useless splits

66 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Tree pruning

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Two approaches to picking simpler trees

1. Early Stopping: Stop the learning


algorithm before the tree becomes
too complex

2. Pruning: Simplify the tree after the


learning algorithm terminates

Complements early stopping


68 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Pruning: Intuition
Train a complex tree, simplify later

Complex Tree

Simpler Tree

Simplify

70 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Pruning motivation
Simple Complex
tree tree
True Error
Classification Error

Simplify after
tree is built

Dont stop
too early

Training Error

Tree depth
71 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Example 1: Which tree is simpler?

Start

excellent poor
Credit?

fair
Income?
OR Start
Safe Term?
high low
3 years 5 years excellent poor
Credit?

Risky Safe Term? Risky fair

3 years 5 years Safe Safe Risky

Risky Safe SIMPLER


72 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Example 2: Which tree is simpler???

Start Start

excellent
Credit?
poor
OR Term?

3 years 5 years
good fair bad
Safe Risky

Safe Safe Risky Risky Safe

73 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Simple measure of complexity of tree

L(T) = # of leaf nodes


Start

excellent poor
Credit?

good fair bad


Safe Risky

Safe Safe Risky

74 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Which tree has lower L(T)?
L(T1) = 5 L(T2) = 2

Start Start

excellent
Credit?
poor
OR Term?

3 years 5 years
good fair bad
Safe Risky

Safe Safe Risky Risky Safe

SIMPLER
75 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Balance simplicity & predictive power
Too complex, risk of overfitting
Start

excellent
Credit?
poor
Too simple, high
fair classification error
Income?
Safe Term? Start
high low
3 years 5 years

Risky
Risky Safe Term? Risky

3 years 5 years

Risky Safe
76 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Desired total quality format
Want to balance:
i. How well tree fits data
ii. Complexity of tree
want to balance
Total cost =
measure of fit + measure of complexity

(classification error)
Large # = likely to
Large # = bad fit to
overfit
78
training data 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Consider specific total cost

Total cost =
classification error + number of leaf nodes

Error(T) L(T)

79 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Balancing fit and complexity

Total cost C(T) = Error(T) + L(T)

tuning parameter
If =0:

If =:

If in between:
81 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Use total cost to simplify trees

Complex tree

Simpler tree
Total quality
based pruning

82 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Tree pruning algorithm

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Pruning Intuition
Start Tree T

excellent poor
Credit?

fair
Income?
Safe Term?
high low
3 years 5 years

Risky Safe Term? Risky

3 years 5 years

Risky Safe

85 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Step 1: Consider a split

Start Tree T

excellent poor
Credit?

fair
Income?
Safe Term?
high low
3 years 5 years

Risky Safe Term? Risky

3 years 5 years

Candidate for
Risky Safe
pruning
86 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Step 2: Compute total cost C(T) of split

Start Tree T

excellent
Credit?
poor = 0.3
Tree Error #Leaves Total
fair
T 0.25
Income?
Safe Term?
high
3 years 5 years
low
C(T) = Error(T) + L(T)
Risky Safe Term? Risky

3 years 5 years

Candidate for
Risky Safe
pruning
88 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Step 2: Undo the splits on Tsmaller

Start Tree Tsmaller

excellent
Credit?
poor = 0.3
Tree Error #Leaves Total
fair
T 0.25 6 0.43
Income?
Safe Term? Tsmaller 0.26
high low
3 years 5 years
C(T) = Error(T) + L(T)
Risky Safe Safe Risky

Replace split
by leaf node?
89 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Prune if total cost is lower: C(Tsmaller) C(T)
Worse training
Start Tree Tsmaller error but lower
overall cost
excellent
Credit?
poor = 0.3
Tree Error #Leaves Total
fair
T 0.25 6 0.43
Income?
Safe Term? Tsmaller 0.26 5 0.41
high low
3 years 5 years
C(T) = Error(T) + L(T)
Risky Safe Safe Risky

Replace split
by leaf node? YES!
90 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Step 5: Repeat Steps 1-4 for every split

Start
Decide if each
split can be
excellent poor pruned
Credit?

fair
Income?
Safe Term?
high low
3 years 5 years

Risky Safe Safe Risky

92 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Decision tree pruning algorithm
Start at bottom of tree T and traverse up,
apply prune_split to each decision node M
prune_split(T,M):
1. Compute total cost of tree T using
C(T) = Error(T) + L(T)
2. Let Tsmaller be tree after pruning subtree
below M
3. Compute total cost complexity of Tsmaller
C(Tsmaller) = Error(Tsmaller) + L(Tsmaller)
4. If C(Tsmaller) < C(T), prune to Tsmaller

93 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Summary of overfitting in
decision trees

2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


What you can do now
Identify when overfitting in decision trees
Prevent overfitting with early stopping
- Limit tree depth
- Do not consider splits that do not reduce
classification error
- Do not split intermediate nodes with only
few points
Prevent overfitting by pruning complex trees
- Use a total cost formula that balances
classification error and tree complexity
- Use total cost to merge potentially complex
trees into simpler ones

95 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Thank you to Dr. Krishna Sridhar

Dr. Krishna Sridhar


Sta Data Scientist, Dato, Inc.
96 2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization