Beruflich Dokumente
Kultur Dokumente
An Introduction to CART
Dan Steinberg
8880 Rio San Diego Dr., Suite 1045
San Diego, CA 92108
619.543.8880 - 619.543.8888 fax
http://www.salford-systems.com
dstein@.salford-systems.com
An Introduction to CART l- Salford Systems 1999
Slide 1
Goals of Tutorial
Provide Overview of
Introduce Software
Methodology
Historical background
Motivation for research
New Developments
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Histology
Analytical Cellular Pathology
Annals of Mathematics and Artificial Intelligence
Annals of Statistics
Annals of the New York Academy of Sciences
Applied Artificial Intelligence
Applied Statistics
Archives of Psychiatric Nursing
Arthritis and Rheumatism
Artificial Intelligence in Medicine
Automatica
Biometrics
Bone Joint Surgery
Brain Topography
British Journal of Hematology
Cancer
Cancer Treatment Reports
Cardiology
Circulation
Cognitive Psychology
Computer Journal
Computer Speech and Language
Computers and Biology and Medicine
Computers and Biomedical Research
Controlled Clinical Trials
Death Studies
Economics Letter
Europhysics Letters
Family Practice
Financial Analysts Journal
Financial Management
Health Physic
IEEE Transactions on Information Theory
IEEE Transactions on Neural Networks
IEEE Transactions on Pattern Analysis and
Machine Intelligence
International Journal of Forecasting
International Journal of Man-Machine Studies
International Neural Network Conference
Investigative Radiology
Journal of Clinical Oncology
Journal of Biomedical Engineering
Journal of Cancer Research and Clinical
Oncology
Slide 8
Dermatology
Journal of the American Statistical Association
Journal of the American Veterinary Medical
Association
Journal of the Operational Research Society
Journal of Toxicology and Environmental Health
Knowledge Acquisition
Machine Learning
Management Science
Marketing Research: A Magazine of Management
& Applications
Medical Care
Medical Decision Making
Medical Information
Methods of Information in Medicine
Psychiatry
Quarterly Journal of Business & Economics
Revista Medica de Chile
Scandinavian Journal of Gastroenterology
Science
Statistics in Medicine
Statistical Science
Statistics and Computing
Statistics in Medicine
Technometrics
The American Statistician
The Annals of Statistics
The Journal of Pediatrics
The Journal of Behavioral Economics
The Journal of Finance
The Journal of Industrial Economic
The Journal of Rheumatology
The Journal of Risk and Insurance
The Journal of Wildlife Management
The New England Journal of Medicine
Therapie
Ultrasonic Imaging
World Journal of Surgery
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
So what is CART?
Best illustrated with a famous example-the UCSD Heart
Disease study
Given the diagnosis of a heart attack based on
Chest pain, Indicative EKGs, Elevation of enzymes typically
released by damaged heart muscle
Slide 14
PATIENTS = 215
<= 91
SURVIVE
DEAD
178
37
> 91
82.8%
17.2%
Is BP<=91?
Terminal Node A
SURVIVE 6
30.0%
DEAD
14
70.0%
NODE = DEAD
PATIENTS = 195
<= 62.5
SURVIVE
DEAD
172
23
> 62.5
88.2%
11.8%
Is AGE<=62.5?
Terminal Node B
PATIENTS = 91
<=.5
SURVIVE
DEAD
70
21
76.9%
23.1%
>.5
Is SINUS<=.5?
Terminal Node C
SURVIVE 14
50.0%
DEAD
14
50.0%
NODE = DEAD
An Introduction to CART l- Salford Systems 1999
Terminal Node D
SURVIVE 56
88.9%
DEAD
7
11.1%
NODE = SURVIVE
Slide 15
Slide 16
Standard split:
If answer to question is YES a case goes left; otherwise it
goes right
This is the form of all primary splits
Slide 17
Slide 18
Classification is determined
by following a cases
path down the tree
Terminal nodes are associated with a single class
Any case arriving at a terminal node is assigned to that
class
Slide 19
Accuracy of a Tree
For classification trees
All objects reaching a terminal node are classified in
the same way
e.g. All heart attack victims with BP>91 and AGE less
than 62.5 are classified as SURVIVORS
Slide 20
Total % Correct
158
20
178
88%
28
37
75%
167
48
215
86.51%
Slide 21
Slide 22
Slide 23
Core Questions
How is a CART tree generated?
What assurances do we have that it will
give correct answers?
Is CART better than other decision tree
tools?
Slide 24
Slide 25
= data point
= summary
Use median or mean of the
function of Y values to relate Y
to X
No attempt to fit a function
Each region gets its own fitted
Y value, a constant
Straightforward in two or three
dimensions
Slide 26
Problems with
Non-parametric Approaches
Curse of dimensionality
Slide 27
Slide 28
Iris Versicolor
Iris Virginica
Slide 29
PETALLEN &
PETALWID
variables to produce a
tree
Key
= Species 1
= Species 2
= Species 3
Slide 30
Slide 31
Each partition is
analyzed separately
An Introduction to CART l- Salford Systems 1999
Slide 32
Tree Pruning
Optimal Tree Selection
We will first conduct a brief tour of the entire process
Slide 33
Splitting Rules
One key to generating a CART tree is asking the right
question
Question always in the form
Is CONDITION VALUE
Slide 34
Consider first the variable AGE--in our data set has minimum
value 40
The rule
Is AGE 40? will separate out these cases to the left the youngest
persons
Slide 35
Split Tables
Sorted by Age
AGE
40
40
40
43
43
43
45
48
48
49
49
BP
91
110
83
99
78
135
120
119
122
150
110
SINUST SURVIVE
0
SURVIVE
0
SURVIVE
1
DEAD
0
SURVIVE
1
DEAD
0
SURVIVE
0
SURVIVE
1
DEAD
0
SURVIVE
0
DEAD
1
SURVIVE
BP
78
83
91
99
110
110
119
120
122
135
150
SINUST SURVIVE
1
DEAD
1
DEAD
0
SURVIVE
0
SURVIVE
0
SURVIVE
1
SURVIVE
1
DEAD
0
SURVIVE
0
SURVIVE
0
SURVIVE
0
DEAD
Slide 36
Slide 37
Categorical Predictors
CART considers all possible splits based on a
categorical predictor
Example: four regions - A, B, C, D
Left
Right
B, C, D
A, C, B
A, B, D
A, B, C
A, B
C, D
A, C
B, D
A, D
B, C
Slide 38
Categorical Predictors
e.g. If RACE has 5 levels then there are 25-1 =16 race
groupings; all searched as if they were separate
variables
Slide 39
Slide 40
Detailed Example:
GYM Cluster Model
Problem: needed to understand a market
research clustering scheme
Clusters were created using 18 variables
and conventional clustering software
Wanted simple rules to describe cluster
membership
Experimented with CART tree to see if
there was an intuitive story
Slide 41
Identification # of member
Cluster assigned from clustering scheme (10 level categorical coded 1-10)
Racquet ball usage (binary indicator coded 0, 1)
Number of on-peak racquet ball uses
Pool usage (binary indicator coded 0, 1)
Number of on-peak pool uses
Percent of pool and racquet ball usage
Total number of pool and racquet ball uses
Number of on-peak aerobics classes attended
Number of off-peak aerobics classes attended
Difference between number of on- and off-peak aerobics visits
Number of visits to tanning salon
Personal trainer (binary indicator coded 0, 1)
Number of classes taken
Number of supplements/vitamins/frozen dinners purchased
Small business discount (binary indicator coded 0, 1)
Terms of offer
Index variable for package price
Monthly fee paid
Fitness score
Number of family members
Home ownership (binary indicator coded 0, 1)
Slide 42
Slide 43
Slide 44
===========================
CLASSIFICATION TREE DIAGRAM
===========================
|
------------------1-----------------|
|
----------------2-------------------10--|
|
|
|
------3--------------6-------|
|
|
|
-----4-----------7------|
|
|
|
----5------8------9--|
|
|
|
|
|
Terminal Regions
1
10
11
Slide 45
Slide 46
1
2
3
4
5
6
1
2
3
4
5
6
7
8
9
10
Surrogate
PERSTRN
FIT
NFAMMEM
TANNING
NSUPPS
CLASSES
Competitor
PERSTRN
FIT
ANYRAQT
PLRQTPCT
TPLRCT
ONRCT
NFAMMEM
CLASSES
IPAKPRIC
OFFAER
s
s
s
s
s
s
Split
0.500000
6.353424
4.500000
2.500000
30.500000
1.500000
Split
0.500000
3.107304
0.500000
0.031280
0.500000
0.500000
2.500000
0.500000
7.632159
0.500000
Assoc.
0.914703
0.523818
0.206078
0.029640
0.029236
0.011499
Improve.
0.127490
0.110693
0.030952
0.007039
0.003991
0.006856
Improve.
0.127490
0.111190
0.102227
0.090114
0.090085
0.077620
0.055136
0.045368
0.033419
0.022993
Slide 47
2363
233
0.055145
6
0.041550
Parent C.T. =
0.205059
0.077476)
0.033113
1
2
3
4
5
6
7
8
9
10
307
0
0
22
0
1901
79
0
0
54
0.131271
0.000000
0.000000
0.009236
0.000000
0.794941
0.036572
0.000000
0.000000
0.027979
8
0
0
0
0
199
0
0
0
26
0.031321
0.000000
0.000000
0.000000
0.000000
0.922524
0.000000
0.000000
0.000000
0.046155
Slide 48
Overall
15.86%
Slide 49
Slide 50
Slide 51
Note that this twoing split yields a result MUCH more balanced split both in size of
children and in the number of classes going left and right.
Slide 52
PATIENTS = 215
Yes
SURVIVE
DEAD
178
37
82.8%
17.2%
No
Is X5<=10?
Terminal Node A
SURVIVE
DEAD
178
0
0.0%
100.0%
Terminal Node B
SURVIVE
DEAD
0
37
0.0%
100.0%
Slide 53
PATIENTS = 215
Yes
SURVIVE
DEAD
178
37
82.8%
17.2%
No
Is AGE<=55?
To assess, we may
want to compare this to
another split
Terminal Node A
SURVIVE
DEAD
62
23
88.3%
11.7%
Terminal Node B
SURVIVE
DEAD
104
26
80.0%
20.0%
Slide 54
PATIENTS = 215
Yes
SURVIVE
DEAD
Is AGE 55
178
37
No
82.8%
17.2%
Is AGE<=55?
40% Yes
.27 DEAD
.73 SURVIVE
60% No
.20 DEAD
.80 SURVIVE
Terminal Node A
SURVIVE
DEAD
Terminal Node B
73.0%
27.0%
SURVIVE
DEAD
80.0%
20.0%
PATIENTS = 215
Is BP 102
32% Yes
.25 DEAD
.75 SURVIVE
Yes
68% No
.13 DEAD
.87 SURVIVE
SURVIVE
DEAD
178
37
82.8%
17.2%
No
Is BP<=102?
Terminal Node A
SURVIVE
DEAD
75.0%
25.0%
Terminal Node B
SURVIVE
DEAD
87.0%
13.0%
Slide 55
Slide 56
.6
the node t
.5
.4
.3
.2
.1
0
.5
Slide 57
Slide 58
0.5
Slide 59
Why impurity?
Why not predictive accuracy?
(1) Can almost always improve purity
Often unable to improve accuracy for a specific node
we will see later: if parent and both children are all assigned
to same class then split does nothing for accuracy
Slide 60
Example of Predictive
Accuracy Splitting Rule
Split takes error rate of
400 in root down to 200
overall
CUSTOMERS = 800
Buyer
No-Buyer
Terminal Node A
Buyer
No-Buyer
300
100
75.0%
25.0%
400
400
50.0%
50.0%
Terminal Node B
Buyer
No-Buyer
100
300
25.0%
75.0%
Slide 61
CUSTOMERS = 800
Buyer
No-Buyer
400
400
50.0%
50.0%
Terminal Node A
Buyer
No-Buyer
200
400
33.3%
66.6%
Terminal Node B
Buyer
No-Buyer
200
0
100.0%
0.0%
Slide 62
Buyer
300
No Buyer 100
Buyer
100
No Buyer 300
Split #1
Buyer
200 Buyer 200
No Buyer 400 No Buyer 0
Split #2
Slide 63
Slide 64
Responder
Non-Responder
60
400
13%
87%
100
900
10%
90%
Responder
Non-Responder
40
500
7%
93%
Slide 65
probL
impurityL
probR
impurityR
Slide 66
Slide 67
Modified Twoing
Other rules available: Ordered Twoing, Symmetric Gini
Slide 68
B
1/6, 1/6, 1/6, 1/6, 1/6, 1/6
Slide 69
5/6
4/6
4/6
=1/6
Slide 70
Slide 71
p
Lp
(s,t )=
| p( j|tL) p(j|tR)|
4 j
The best twoing split maximizes (s, t)
C*1 is given by:
Slide 72
Twoing Interpreted
pL p
| p ( j |t L ) p (j|t R )|
4
j
Slide 73
k
( pL pR )
| p( j|t ) p(j|t )|
L
Slide 74
Slide 75
GINI or Twoing:
Which splitting criterion to use?
Monograph suggests Gini is usually better
Will be problem dependent
When end-cut splits (uneven sizes) need
to be avoided use twoing
but can also just set POWER>0
Slide 76
Slide 77
A 40
B 30
C 20
Class
A 40
D 10
Class
A 0
B 0
B 30
C 0
C 20
D 0
D 10
A 40
B 30
C 20
Class
A 40
D 10
Class
A 0
B 0
B 30
C 0
C 20
D 10
D 0
Slide 78
Prior Probabilities
An essential component of any CART analysis
A prior probability distribution for the classes is needed
to do proper splitting since prior probabilities are used
to calculate the pi in the Gini formula
Example: Suppose data contains
99% type A (nonresponders),
1% type B (responders),
Slide 79
A:
B:
300
10
1/3 of all A
1/10 of all B
Class as A
900
100
A:
B:
600
90
2/3 of all A
9/10 of all B
Class as B
Slide 80
Slide 81
Slide 82
(1)N1(t) N1
>
(2)N2(t) N2
(1) N1
=
(2) N2
Slide 83
300
300
10
10
900
900
100
100
A:
A: 600
600
B:
B: 90
90
Slide 84
Example: 4 segments
Root node will get 1/4 correct , 3/4 wrong
Below root, classification will be accomplished by weighting
by priors
Slide 85
Slide 86
PRIORS EQUAL
PRIORS DATA
PRIORS MIX
PRIORS = n1,n2
PRIORS = 2,1
PRIORS = .67, .33
Slide 87
(i)=
Ni(t)
Ni
p(t) =
Nj(t)
(j)
Nj
(i)
Ni
N
p(t ) =
N (t) N (t)
=
N (t) N(t)
i
Slide 88
Using Priors to
Reflect Importance
Putting a larger prior on a class will tend to decrease its
misclassification rate
Hence can use priors to shape an analysis
Example:
MISCLASSIFICATION RATES BY CLASS FOR
ALTERNATIVE PRIORS: Number Misclassified (N) and
Probability of Misclassification (%)
Sample
Size
Class
1
2
3
12
49
139
Total
200
96
109
108
%
50.0
30.6
67.6
115
Slide 89
Slide 90
Costs of Misclassification
serious
May want to up-weight the misclassification of malignant
tumors error
Slide 91
Slide 92
Truth non-buyer
buyer
Classified as
non-buyer buyer
Truth non-buyer
buyer
Slide 93
Specifying Costs
EXPLICIT COST COMMANDS
MISCLASS COST=3 CLASS 2 AS 1
Specifying in GUI:
Slide 94
Cost Matrix
Can specify ANY cost matrix by specifying each cell
separately
Short hand forms of command:
Slide 95
1
2
B
3
C
1
4
Slide 96
C( 21
| ) ( 1 )N 1(t)
N1
>
C( 12
| ) ( 2 )N 2(t) N 2
C(j|i) (i)Ni(t) Ni
>
C(i| j) (j)Nj(t) Nj
Slide 97
Example: Classifying
Segments Without Costs
Table Without Costs (MDL5A.DAT)
ACTUAL
CLASS
1
2
3
4
5
6
PRED. TOT.
CORRECT
SUCCESS IND.
TOT. CORRECT
1
45.00
4.00
9.00
4.00
6.00
2.00
2
1.00
32.00
20.00
11.00
3.00
1.00
70.00
0.88
0.68
0.50
68.00
0.52
0.27
PREDICTED CLASS
3
4
1.00
3.00
1.00
7.00
3.00
10.00
1.00
2.00
0.00
0.00
0.00
2.00
6.00
0.06
-0.16
24.00
0.09
0.00
5
0.00
8.00
6.00
2.00
29.00
3.00
6
1.00
10.00
6.00
2.00
1.00
15.00
48.00
0.74
0.59
35.00
0.65
0.56
ACTUAL
TOTAL
51.00
62.00
54.00
22.00
39.00
23.00
251.00
Slide 98
Example: Classifying
Segments With Costs
Table With Costs (MDL5EO.DAT)
ACTUAL
CLASS
1
2
3
4
5
6
PRED. TOT.
CORRECT
SUCCESS IND.
TOT. CORRECT
1
45.00
3.00
0.00
4.00
7.00
1.00
2
4.00
33.00
15.00
10.00
12.00
9.00
60.00
0.88
0.68
0.53
83.00
0.53
0.29
PREDICTED CLASS
3
4
2.00
0.00
10.00
3.00
27.00
3.00
3.00
3.00
0.00
0.00
4.00
1.00
46.00
0.50
0.29
10.00
0.14
0.05
5
0.00
6.00
5.00
2.00
19.00
1.00
6
0.00
7.00
4.00
0.00
1.00
7.00
33.00
0.49
0.33
19.00
0.30
0.21
ACTUAL
TOTAL
51.00
62.00
54.00
22.00
39.00
23.00
251.00
Slide 99
Comparison: Classification
With and Without Costs
% CORRECT
% CORRECT
1
0.88
0.88
PREDICTED CLASS
2
3
4
5
0.52
0.06
0.09
0.74
0.53
0.50
0.14
0.49
6
0.65
0.30
Without Costs
With Costs
Slide 100
= 1 - p (i | t ) 2
i
[C(i j) + C( j i)]p(i t ) p( j t )
Rule uses symmetrized costs
An Introduction to CART l- Salford Systems 1999
Slide 101
Slide 102
Slide 103
Surrogates Mimicking
Alternatives to Primary Splitters
A primary splitter is the best splitter of a node
A surrogate is a splitter that splits in a fashion similar to
the primary
Surrogate A Variable with possibly equivalent
information
Why Useful
Reveals structure of the information in the variables
If the primary is expensive or difficult to gather and the
surrogate is not
Then consider using the surrogate instead
Loss in predictive accuracy might be slight
Slide 104
Association
How well can another variable mimic the primary splitter?
Predictive association between primary and surrogate splitter
Consider another variable and allow reverse splits
A reverse split sends cases to the RIGHT if the condition is met
Standard split (primary) always sends cases to the LEFT
e.g. if primary sends 400 cases left and 100 cases right with PRIORS
DATA
PL =.8 PR =.2
If default sends all cases left then 400 cases are matched and 100
mismatched
Default mismatch rate is .2
A credible surrogate must yield a mismatch rate LESS THAN the
default
Slide 105
Evaluating Surrogates
Matches are evaluated on a case by case basis
If surrogate matches on 85% of cases then error rate is
15%
Association=
Slide 106
Primary
Split
Competitor
Split
Surrogate
Split
Class A
100
Class B
100
Class C
100
Left
Right
Class A
90
10
Class B
80
20
Class C
15
85
Class A
80
20
Class B
25
75
Class C
14
86
Class A
78
22
Class B
74
26
Class C
21
79
Slide 107
Surrogate
PERSTRN
FIT
NFAMMEM
TANNING
NSUPPS
CLASSES
Competitor
PERSTRN
FIT
ANYRAQT
PLRQTPCT
TPLRCT
ONRCT
NFAMMEM
CLASSES
IPAKPRIC
OFFAER
s
s
s
s
s
s
Split
0.500000
6.353424
4.500000
2.500000
30.500000
1.500000
Split
0.500000
3.107304
0.500000
0.031280
0.500000
0.500000
2.500000
0.500000
7.632159
0.500000
Assoc.
0.914703
0.523818
0.206078
0.029640
0.029236
0.011499
Improve.
0.127490
0.110693
0.030952
0.007039
0.003991
0.006856
Improve.
0.127490
0.111190
0.102227
0.090114
0.090085
0.077620
0.055136
0.045368
0.033419
0.022993
Slide 108
Classification Is Accomplished by
Entire Tree Not Just One Node
Even if a case goes wrong way on a node (say surrogate
is imperfect) it is not necessarily a problem
Case may be get correctly classified on next node or
further down tree
CART trees will frequently contain self-correcting
information
Slide 109
by default
A splitter qualifies as a surrogate ONLY if association
value >0
Note surrogates are ranked in order of association
A relatively weak surrogate might have better
IMPROVEMENT than best surrogate
A variable can be both a good competitor and a good
surrogate
Competitor split values might be different than surrogate
split value
Slide 110
Slide 111
Variable Importance
How can we assess the relative importance of the variables in our
data
Issue is a measure of the splitting potential of all variables
CART includes a general measure with very interesting
characteristics
The most important variables might not ever be used to split a node!
Importance is related to both potential and actual splitting behavior
If one variable say EDUCATION is masked by another say INCOME
then...
Both variables have SIMILAR splitting information
In any one node only one variable will be best but the other may
be a close 2nd best
If INCOME appears just once as primary splitter and EDUCATION is
never a primary splitter
But EDUCATION appears as a strong surrogate in several
different nodes
Then EDUCATION may have more potential splitting power
Means that if you had to live with just one of the variables
EDUCATION would most likely to be the better performer
Slide 112
Slide 113
deepest nodes
Suppose you grow a large exploratory tree review
importances
Then find an optimal tree via test set or CV yielding
smaller tree
Optimal tree SAME as exploratory tree in the top nodes
YET importances might be quite different.
WHY? Because larger tree uses more nodes to compute
the importance
When comparing results be sure to compare similar or
same sized trees
Slide 114
100.000
93.804
78.773
68.529
66.770
65.695
61.355
37.319
32.853
32.767
100.000
73.055
71.180
32.635
28.914
18.787
12.268
5.699
0.000
0.000
Slide 115
Modifying Importance--Down
Weighting Weak Surrogates
Each surrogate gets full credit for any improvement it
might yield on a split
Credit accrues in proportion to improvement irrespective
of degree of association
May want to down weight surrogates
Slide 116
Competitor Splits
CART searches for the BEST splitter at every node
To accomplish this it first finds the best split for a
specific variable
Then it repeats this search over all other variables
Results in a split quality measure for EVERY variable
The variable with the highest score is the PRIMARY
SPLITTER
The other variables are the COMPETITORS
Can see as many COMPETITORS as you want they
have all been computed
BOPTIONS COMPETITORS=5 is the default for number
to PRINT
To see scores for every variable try BOPTIONS
COMPETITORS=250
Slide 117
Printing Many
Many Competitors
Competitors
Printing
Node
Competitor
1 G1
2 SU
3 AG6
4 AG5
5 SBHI7
6 AH
7 AI
8 SBHI8
9 HISB
10 N5
11 G12
12 SO
13 SBHI4
14 SBHI5
15 G7
16 MIRA8
17 ABHI8
18 G4
19 SBLO5
20 AC
21 N1
22 BBDA7
23 SBHI6
24 AL
25 SC
26 ABHI5
Improve.
0.040
0.040
0.037
0.035
0.035
0.034
0.033
0.031
0.031
0.031
0.029
0.029
0.029
0.029
0.027
0.027
0.026
0.026
0.025
0.025
0.025
0.025
0.025
0.025
0.025
0.024
Slide 118
If functional relationship is y = X
+ error, CART will generate a
sequence of splits of the form
Is X < c1
Is X < c2
Is X < c3
etc.
Slide 119
coefficients =1
Linear combinations searched using a stepwise algorithm global
best may not be found
Backward deletion used; start with all numerics and back down
May want to limit search to larger nodes
LINEAR N=500 prevents search when number of cases in node <
500
Benefit to favoring small number of variables in combinations
Easier to interpret and assess
Does combination make any empirical sense
Slide 120
Slide 121
Slide 122
LINEAR N = 12347
when the learn sample is 12347 will prevent linear
combination splits below the root
Slide 123
Slide 124
Tree Pruning
Optimal Tree Selection
Slide 125
Slide 126
Slide 127
Slide 128
Slide 129
Tree Pruning
Take some large tree such as the maximal tree
Tree may be radically overfit
Tracks all the idiosyncrasies of THIS data set
Tracks patterns that may not be found in other data sets
Analogous to a regression with very large number of
variables
Slide 130
Cost-Complexity Pruning
Begin with a large enough tree one that is larger than
the truth
With no processing constraints grow until 1 case in each
terminal node
Slide 131
Cost-Complexity Pruning
R (T)
Slide 132
increases
Much previous
disenchantment with tree
methods due to reliance on
learn R(T)
Can lead to disasters when
applied to new data
No.
Terminal
Nodes
71
63
58
40
34
19
*10
9
7
6
5
2
1
R(T)
Rts(T)
.00
.00
.03
.10
.12
.20
.29
.32
.41
.46
.53
.75
.86
.42
.40
.39
.32
.32
.31
.30
.34
.47
.54
.61
.82
.91
Slide 133
Slide 134
Slide 135
Pruning
Start with complexity penalty set to 0
Best resubstitution cost tree is maximal tree (biggest
tree) has smallest error
Start increasing complexity penalty
At some point penalty is big enough to warrant
choosing a smaller tree since penalty on nodes exceeds
gain in accuracy
Example: Pruning our CLUSTER tree from 6 to 5 terminal
nodes
Slide 136
6
5
0.388242 *0.807546
0.461266 *0.807546
Tree 15-Tree 14
Complexity Penalty
0.388242 *0.807546 +
0.461266 *0.807546 +
6
5
*
*
=
0, when
= complexity penalty
Slide 137
Cases
Cases
13161
13161
10204
10204
2957
2957
Class
Class
99
99
22
Cost
Cost
0.668535
0.668535
0.581630
0.581630
0.102462
0.102462
Slide 138
Order of Pruning
Prune away "weakest link" the nodes that
add least to overall accuracy
If several nodes have same complexity
threshold they all prune away simultaneously
Hence more than two terminal nodes could be
cut off in one pruning
Often happens in the larger trees; less likely as
tree gets smaller
Slide 139
Slide 140
Slide 141
^ k)
R(T
0
0
10
20
30
40
50
~
|Tk |
Slide 142
Slide 143
procedure
Tree should exhibit very similar accuracy when applied to new
data
This is key to the CART selection procedure applicability to
external data
BUT Tree is NOT necessarily unique other tree structures
may be as good
Other tree structures might be found by
Using other learning data
Using other growing rules
Running CV with a different random number seed
Some variability of results needs to be expected
Overall story should be similar for good size data sets
Slide 144
Cross-Validation
Cross-validation is a recent computer intensive development
in statistics
Purpose is to protect oneself from over fitting errors
Don't want to capitalize on chance track idiosyncrasies
of this data
Idiosyncrasies which will NOT be observed on fresh data
Ideally would like to use large test data sets to evaluate trees,
N>5000
Practically, some studies don't have sufficient data to spare
for testing
Cross-validation will use SAME data for learning and for
testing
Slide 145
10-Fold Cross-Validation:
The Industry Standard
Slide 146
Cross-Validation Procedure
1
Test
10
10
10
Learn
Learn
Learn Test
2
Learn
7
Learn
Test
ETC...
1
5
Learn
10
Test
Slide 147
Cross-Validation Details
Every observation is used as test case exactly once
In 10 fold CV each observation is used as a learning case 9
times
In 10 fold CV 10 auxiliary CART trees are grown in addition to
initial tree
When all 10 CV trees are done the error rates are cumulated
(summed)
Summing of error rates is done by TREE complexity
Summed Error (cost) rates are then attributed to INITIAL tree
Observe that no two of these trees need be the same
Even the primary splitter of the root node may be different
across CV trees
Results are subject to random fluctuations sometimes
severe
Slide 148
Slide 149
analysis
CART produces a summary of the results for each tree at
end of output
Table titled "CV TREE COMPETITOR LISTINGS" reports
results for each CV tree Root (TOP), Left Child of Root
(LEFT), and Right Child of Root (RIGHT)
Want to know how stable results are -- do different
variables split root node in different trees?
Only difference between different CV trees is random
elimination of 1/10 data
Slide 150
Cross-Validation Results
Since CV can produce so much output it is
necessary to summarize and select
Regardless of what CART produces for a final
tree, CV runs generate one full tree for each CV
10-fold CV produces
10 CV maximal trees
Initial maximal tree
Final pruned tree IF IT SURVIVES
Slide 151
Cross-Validation
Getting Selected Output
May want to see all intermediate tree results or only
some of it
Least amount of tree detail tree pictures and Terminal
node summary only
Obtained with two commands working together
LOPTIONS NOPRINT
BOPTIONS COPIOUS
Slide 152
CV-TREE Variable
Scaled Importance Measures
Variable
Cross-Validation
Tree Number
INIT
INIT | |
CV
CV11 | |
CV
CV22 | |
CV
CV33 | |
CV
CV44 | |
CV
CV55 | |
CV
CV66 | |
CV
CV77 | |
CV
CV88 | |
CV
CV99 | |
CV
CV10
10 | |
FINAL
FINAL | |
V01
V01
0.001
0.001
0.044
0.044
0.039
0.039
0.050
0.050
0.038
0.038
0.068
0.068
0.123
0.123
0.114
0.114
0.045
0.045
0.137
0.137
0.048
0.048
0.000
0.000
V02
V02
0.000
0.000
0.072
0.072
0.056
0.056
0.051
0.051
0.082
0.082
0.122
0.122
0.048
0.048
0.123
0.123
0.000
0.000
0.067
0.067
0.044
0.044
0.000
0.000
V03
V03
0.002
0.002
0.203
0.203
0.074
0.074
0.114
0.114
0.130
0.130
0.211
0.211
0.269
0.269
0.193
0.193
0.202
0.202
0.127
0.127
0.135
0.135
0.079
0.079
V04
V04
0.000
0.000
0.199
0.199
0.149
0.149
0.203
0.203
0.065
0.065
0.177
0.177
0.116
0.116
0.190
0.190
0.195
0.195
0.256
0.256
0.223
0.223
0.000
0.000
V05
V05
0.001
0.001
0.574
0.574
0.439
0.439
0.286
0.286
0.682
0.682
0.816
0.816
0.370
0.370
0.346
0.346
0.685
0.685
0.405
0.405
1.000
1.000
0.328
0.328
Slide 153
TOP
TOP SPLIT
SPLIT
MAIN
MAIN
CV
NN
CV 11 ||
|| V15
V15
||2.396
2.396
||0.134
0.134
CV
2
|
NN
CV 2 |
|| V07
V07
||2.450
2.450
||0.140
0.140
[etc...]
[etc...]
CV
NN
CV 10|
10|
|| V09
V09
||3.035
3.035
|| 0.152
0.152
FINAL
|
NN
FINAL |
|| V15
V15
||2.396
2.396
||0.138
0.138
COMPETITORS--Rank
COMPETITORS--Rank
11
22
33
44
NN
V07
V07
2.450
2.450
0.131
0.131
NN
V11
V11
3.454
3.454
0.136
0.136
NN
V09
V09
2.984
2.984
0.127
0.127
NN
V15
V15
2.396
2.396
0.132
0.132
NN
V13
V13
3.057
3.057
0.124
0.124
NN
V09
V09
3.045
3.045
0.125
0.125
NN
V11
V11
3.454
3.454
0.124
0.124
NN
V13
V13
2.797
2.797
0.123
0.123
NN
V07
V07
2.450
2.450
0.145
0.145
NN
V07
V07
2.450
2.450
0.137
0.137
NN
V15
V15
2.284
2.284
0.140
0.140
NN
V09
V09
3.035
3.035
0.131
0.131
NN
V14
V14
2.106
2.106
0.123
0.123
NN
V11
V11
3.454
3.454
0.129
0.129
NN
V13
V13
3.057
3.057
0.123
0.123
NN
V13
V13
3.057
3.057
0.122
0.122
Slide 154
COMPETITORS
COMPETITORS
11
22
33
44
NN
V11
V11
3.024
3.024
0.051
0.051
NN
V13
V13
2.254
2.254
0.045
0.045
NN
V10
V10
2.661
2.661
0.049
0.049
NN
V09
V09
4.136
4.136
0.031
0.031
NN
V17
V17
0.253
0.253
0.030
0.030
NN
V05
V05
0.237
0.237
0.029
0.029
NN
V12
V12
3.214
3.214
0.029
0.029
NN
V17
V17
4.347
4.347
0.027
0.027
NN
V05
V05
0.567
0.567
0.098
0.098
NN
V11
V11
3.024
3.024
0.053
0.053
NN
V07
V07
2.460
2.460
0.089
0.089
NN
V10
V10
2.661
2.661
0.047
0.047
NN
V13
V13
2.867
2.867
0.086
0.086
NN
V12
V12
3.214
3.214
0.030
0.030
NN
V06
V06
2.172
2.172
0.078
0.078
NN
V17
V17
0.157
0.157
0.028
0.028
Slide 155
COMPETITORS
COMPETITORS
11
22
33
44
NN
V05
V05
0.725
0.725
0.067
0.067
NN
V11
V11
3.343
3.343
0.089
0.089
NN
V13
V13
2.253
2.253
0.065
0.065
NN
V09
V09
2.973
2.973
0.088
0.088
NN
V12
V12
2.589
2.589
0.056
0.056
NN
V17
V17
0.157
0.157
0.060
0.060
NN
V07
V07
2.951
2.951
0.055
0.055
NN
V12
V12
3.221
3.221
0.054
0.054
NN
V16
V16
2.143
2.143
0.038
0.038
NN
V05
V05
0.725
0.725
0.071
0.071
NN
V08
V08
2.575
2.575
0.036
0.036
NN
V13
V13
2.701
2.701
0.060
0.060
NN
V15
V15
2.284
2.284
0.033
0.033
NN
V06
V06
2.172
2.172
0.058
0.058
NN
V13
V13
3.087
3.087
0.031
0.031
NN
V07
V07
2.436
2.436
0.051
0.051
[etc...]
[etc...]
CV
NN
CV 10
10 ||
|| V07
V07
||1.852
1.852
||0.041
0.041
FINAL
NN
FINAL ||
|| V11
V11
||3.454
3.454
||0.094
0.094
Slide 156
Slide 157
Slide 158
N
26
108
28
53
76
Prob
0.092
0.029
0.352
0.093
0.187
0.246
Class
1
Cost
0.104
[
Class
0.326]
0.447
[
0.890]
0.226
[
0.298]
0.270
[
0.489]
0.121
[
0.247]
0.159
[
0.259]
N
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
23
3
0
0
5
4
15
85
8
7
0
21
46
1
6
2
10
64
Prob
0.896
0.000
0.153
0.270
0.879
0.029
Complexity
Threshold
0.016
0.104
0.000
0.016
0.553
0.447
0.057
0.774
0.074
0.043
0.000
0.730
0.043
0.017
0.104
0.103
0.130
0.841
Slide 159
During this analysis each observation was a test case exactly once
Record whether it was classified correctly or incorrectly as test
case
Repeat CV=10 run with new random number seed 100 to 500 times
Slide 160
USE NEWDATA
SAVE FORECAST
TREE CELLPHONE
CASE
Slide 161
Slide 162
Slide 163
Slide 164
Records of the
Output Dataset
ID NODE RESPONSE
1
5
1
2
7
0
3
1
0
4
5
1
5
5
1
6
4
0
7
4
0
DEPTH
5
3
4
5
5
5
5
PATH1
2
2
2
2
2
2
2
PATH2
5
5
3
5
5
5
5
PATH3
6
-7
4
6
6
6
6
PATH4
7
0
-1
7
7
7
7
final DEPTH = 3
CASE 2: From Root node 1, splits to node 2. Splits again to node
5. Splits again, reaching terminal node 7
Slide 165
Slide 166
Slide 167
output
Fast runs since only one tree grown
If you have a trivial model (perfect classifiers) you'll
learn quickly, and wont waste time testing
If your resubstitution error rate is very high, test based
results will only be worse
Look for some good predictor variables while still
exploring
Slide 168
Slide 169
Slide 170
Slide 171
Practical Advice
Experiment with PRIORS
Slide 172
Slide 173
Slide 174
Slide 175
Slide 176
Cross-Validated
Relative Cost
1.077 +/- 0.071
1.031 +/- 0.073
1.023 +/- 0.075
1.046 +/- 0.075
1.046 +/- 0.075
0.992 +/- 0.075
1.008 +/- 0.076
1.008 +/- 0.076
1.015 +/- 0.075
1.046 +/- 0.075
1.000 +/- 0.000
Resubstitution
Relative Cost
0.031
0.046
0.062
0.177
0.208
0.277
0.400
0.492
0.577
0.677
1.000
Complexity
Parameter
0.000
0.004
0.008
0.012
0.015
0.017
0.021
0.023
0.042
0.050
0.081
Slide 177
Cross-Validated
Relative Cost
0.831 +/- 0.073
0.846 +/- 0.074
0.877 +/- 0.074
0.862 +/- 0.074
0.892 +/- 0.075
0.854 +/- 0.075
0.815 +/- 0.075
0.854 +/- 0.075
0.869 +/- 0.075
1.000 +/- 0.000
Resubstitution
Relative Cost
0.146
0.169
0.185
0.223
0.269
0.423
0.492
0.577
0.677
1.000
Complexity
Parameter
0.000
0.004
0.008
0.010
0.012
0.015
0.017
0.042
0.050
0.081
Slide 178
Prior
Prob.
0.500
0.500
1.000
CROSS VALIDATION
LEARNING SAMPLE
N MisN MisN Classified Cost N Classified Cost
130
57 0.438 130
21 0.169
65
36 0.554 65
7 0.108
195
93
195
29
Class
0
1
Total
An Introduction to CART l- Salford Systems 1999
Prior
Prob.
0.500
0.500
1.000
CROSS VALIDATION
LEARNING SAMPLE
N MisN MisN Classified Cost N Classified Cost
130
54 0.415 130
36 0.277
65
26 0.400 65
14 0.215
195
80
195
50
Slide 179
Troubleshooting Problem:
Cross-Validation Breaks Down
CART requires at least one case in each level of DPV
If only a few cases for a given level in data set, random
test set during cross-validation might get no
observations in that level
May need to aggregate levels OR
Delete levels with very low representation from
problem
Slide 180
Troubleshooting
NO TREE CREATED
The suggestions made in the PRACTICAL ADVICE
section should get around this
If not request COPIOUS OUTPUT for CV runs
First, set complexity high enough to limit tree growth
Allow maybe 6 terminal nodes to get started
Slide 181
Tree Sequence
Dependent variable: CLT
Terminal
Tree Nodes
1
10
11
12
13
14
15
16
17
18
19**
30
18
17
15
14
13
11
7
6
4
1
Cross-Validated
Relative Error
1.4237
1.3095
1.2745
1.2711
1.2631
1.1898
1.1891
1.0782
1.0591
1.0377
1.0000
+/+/+/+/+/+/+/+/+/+/+/-
0.0848
0.0735
0.0683
0.0682
0.0678
0.0530
0.0515
0.0446
0.0322
0.0261
0.0003
Resubstitution
Relative Error
0.5150
0.6153
0.6275
0.6523
0.6648
0.6784
0.7116
0.7814
0.8050
0.8618
1.0000
Complexity
Parameter
Relative
Complexity
0.0000
0.7625
0.8401
0.8603
0.8646
0.9372
1.1508
1.2076
1.6343
1.9639
3.1890
0.0000
0.0110
0.0121
0.0124
0.0125
0.0135
0.0166
0.0174
0.0236
0.0284
0.0461
Slide 182
when test error rate for tree of any size is higher than for
root node
In this case CART prunes ALL branches away leaving no
tree
Could also be caused by SERULE=1 and large SE
Trees actually generated - including initial and all CV trees
CART maintains nothing worthwhile to print
Need some output to diagnose
Try an exploratory tree with ERROR EXPLORE
Grows maximal tree but does not test
Maximum size of any initial tree or CV tree controlled by
COMPLEXITY parameter
Slide 183
Slide 184
Memory Management:
Commands and Reports
Issuing the MEMORY command after problem set up &
before BUILD
Example comes from a small PC version
Windows and minicomputer versions will have much
larger workspaces
from 100,000 to 32,000,000 workspace elements
Slide 185
Memory Management
Current Memory Requirements
TOTAL:
AVAILABLE:
DEFICIT:
90969
55502
35467
DATA: 37464
Solution
Value
Solution
Surplus
Learning Set
1338
700
Depth
32
NONE
Nodes
257
NONE
Atom
10
NONE
Subsampling
1000
NONE
Continuous Ind Var
27
12
Dependent Var Classes
5
NONE
() Denotes Deficit at Extreme Allowable Parameter Value
295
(19788)
(19581)
(20693)
(5190)
353
(34248)
Adjustable Parameters
Current
Value
ANALYSIS: 53505
Slide 186
Slide 187
Memory Management
LIMIT Experiments
Set some LIMITs to let tree get started
After results obtained refine with BOPTIONS
COMPLEXITY, etc.
Try LIMIT LEARN=1000, SUBSAMPLE=500 in above
situation and get some further feedback from the
memory analyzer
Slide 188
Slide 189
Slide 190
Slide 191
Probabilistic Splits
Conventional CART uses a "hard" split
Is X <=c? has a yes/no answer
Means that a small change in the value of variable can
have large change on outcome
0
C1
An Introduction to CART l- Salford Systems 1999
C1
Slide 192
Probabilistic Splits
For X<c1 case always goes left and for X>c2 case always
goes right
For c1<X<c2 case goes left with probability proportional
to how close it is to c1
Slide 193
Split Revisitation
Conventional CART is myopic; looks only at own child
nodes
Cannot consider entire tree or grandchildren
Split revisitation, developed by Breiman, re-examines
trees after growth and initial pruning
Begins with a tree from original tree sequence
Notes SIZE (K terminal nodes) and impurity of this tree
Slide 194
Split Revisitation
Go down to leaves of tree and expand tree from the leaves
Grow via normal splitting
Slide 195
Split Revisitation
Result of this process is a re-optimized K-terminal tree
Search procedure is repeated for K-1 terminal node tree
Slide 196