Beruflich Dokumente
Kultur Dokumente
Introducing Decision
Theory Analysis (DTA)
and Classification and
Regression Trees (CART)
John J. McArdle
University of Southern California
Fall 2013
Introducing CART
1. Introducing
Classification and
Regression
Trees (DTA-CART)
Particularly in the social sciences, there are two powerful reasons for
believing that it is a mistake to assume that the various influences
are additive. In the first place, there are already many instances
known of powerful interaction effects advanced education helps a
man more than it does a woman when it comes to making money, [...]
Second, the measured classifications are only proxies for more than
one construct. [...] We may have interaction effects not because the
world is full of interactions, but because our variables have to interact
to produce the theoretical constructs that really matter. (Morgan
and Sonquist, 1963, p 416.)
Whenever these interrelationships become very complexcontaining
non linearity and interactionthe usefulness of classical approaches
is limited. (Press et al., 1969, p 364.)
History of DTA/CART
History of DTA/CART
The famous
SPAM filter
as a tree
(Hastie et
al.,
2001)
H[2]
ed
0.05
0.00
0.90
W[1]
W[2]
0.49
eg
0.05
An example of a longitudinal cross-lagged
path analysis
2. Binary Outcomes
with Logistic
Regression
LogOdds
1=+0.11
0=-5.3
Age (Years)
3. Binary Outcomes
with
Classification Trees
CART Techniques
CART is based on the utterly simple ideas that we can deal
with a multivariate prediction/classification by (a) using
only parts of the database at any time (partition,
segmentation), and then (b) doing some procedure over
and over again in subsets (i.e., recursively).
In steps:
0: We define one DV (binary) and a set (many, p) of IVs.
1. We search for the single best predictor of the DV among ALL IVs,
considering ALL ordered splits of IVs (i.e., all 2x2 tables).
2. We split the data (partition) into two parts according to the optimal
splitting rule developed in step 1 (for 2x2 tables).
3. We reapply the search strategy on each part of the data.
4. We do this over and over again (recursively) until a final split is
not warranted or a stopping criterion has been reached.
Clearly this would not be possible without modern day computers.
10
Impossible Calculations?
11
Complete Success?
Adding Priors
12
'
Es t i mat ed
Pr obabi l i t y
1. 0
P_CART5
0. 8
0. 9
0. 7
0. 8
0. 7
0. 6
0. 5
0. 5
4. Continuous
Outcomes with
Linear Regression
0. 4
0. 4
0. 3
0. 3
0. 2
0. 2
0. 1
0. 0
0. 1
20
30
40
50
Age
60
70
20
30
40
50
60
70
Age
13
Linear: BD = B0 + B1 * ED(years-11)
120
WAIS_BD
100
80
60
40
20
0
-1 2
-8
-4
0
YEARED1 1
DF
Sum of
Squares
Mean
Square
F Value
Pr > F
379.17
<.0001
Model
Error
Corrected Total
2
1877
1879
270165
668694
938859
Root MSE
Dependent Mean
Coeff Var
18.87477
52.85461
35.71074
135083
356.25695
R-Square
Adj R-Sq
0.2878
0.2870
Parameter Estimates
Variable
DF
Parameter
Estimate
Intercept
yage50
yed12
1
1
1
51.25935
-0.35293
2.80412
Standard
Error
t Value
Pr > |t|
Standardized
Estimate
0.52158
0.02358
0.14112
98.28
-14.96
19.87
<.0001
<.0001
<.0001
0
-0.29687
0.39418
14
5. Continuous
Outcomes with
Regression Trees
15
16
Multivariate Clustering-Taxonomy
6. Classification
Based on Continuous
Outcomes
17
Multivariate Clustering-Taxonomy
95% Confidence
Ellipsoids of our data
Y(2)
All Data
Y(1)
95% Confidence
Ellipsoids
Cluster 1
Centroid
Y(2)
Cluster 2
Centroid
Y(1)
18
Approaches to Clustering-Taxonomy
- 1
- 2
- 3
- 3
- 2
- 1
Z_Bl oc k s
19
'
Z_Voc ab
2
Criterion Based on Final Seeds =
0.7005
Cluster Summary
Maximum Distance
RMS Std
from Seed
Radius
Nearest
Distance Between
Cluster
Frequency
Deviation
to Observation
Exceeded
Cluster
Cluster Centroids
__________________________________________________________________________________________________
1
551
0.7629
2.7820
2
2.1494
2
1129
0.6682
2.2602
1
2.1494
Statistics for Variables
Variable
Total STD
Within STD
R-Square
RSQ/(1-RSQ)
__________________________________________________________________
Z_Vocab
1.00000
0.68594
0.529767
1.126607
Z_Blocks
1.00000
0.71499
0.489096
0.957316
OVER-ALL
1.00000
0.70061
0.509432
1.038453
Pseudo F Statistic = 1742.52
Approximate Expected Over-All R-Squared =
0.37575
Cubic Clustering Criterion =
22.536
WARNING: The two values above are invalid for correlated variables.
Cluster Means
Cluster
Z_Vocab
Z_Blocks
___________________________________________
1
-1.041560545
-1.000780939
2
0.508326212
0.488424017
Variable
Mean
Std Dev
Minimum
Maximum
18.50000
3.00000
-2.79519
-2.31949
1.00000
0.03018
72.00000
16.00000
1.73833
2.12645
2.00000
2.78205
- 1
- 2
- 3
- 3
- 2
- 1
Z_Bl oc ks
yearsage
yearsed
Z_Vocab
Z_Blocks
CLUSTER
DISTANCE
1680
1680
1680
1680
1680
1680
42.23810
11.41250
2.5777E-7
2.49072E-7
1.67202
0.90061
18.03073
3.2698
1.00000
1.00000
0.46962
0.41290
Pearson Correlation Coefficients, N = 1680
yearsage
yearsed
Z_Vocab
Z_Blocks
CLUSTER
DISTANCE
yearsage
1.00000
-0.28420
0.00745
-0.38731
-0.18273
-0.02227
yearsed
-0.28420
1.00000
0.58864
0.49478
0.51908
-0.09180
Z_Vocab
0.00745
0.58864
1.00000
0.52641
0.72785
-0.12258
Z_Blocks
-0.38731
0.49478
0.52641
1.00000
0.69935
-0.11220
CLUSTER
Cluster
-0.18273
0.51908
0.72785
0.69935
1.00000
-0.15283
DISTANCE
-0.02227
-0.09180
-0.12258
-0.11220
-0.15283
1.00000
20
Canonical Analysis
Canonical Analysis
H = Type III SSCP Matrix for CLUSTER
E = Error SSCP Matrix
Adjusted Approximate
Squared
Canonical
Canonical
Standard
Canonical
Correlation Correlation
Error Correlation
Adjusted Approximate
Squared
Canonical
Canonical
Standard
Canonical
Correlation Correlation
Error Correlation
0.817364
0.817324
0.008100
0.668083
0.520378
0.520074
0.017796
0.270793
Likelihood
Eigenvalue Ratio
Approximate
F Value Num DF Den DF Pr > F
Likelihood
Eigenvalue Ratio
Approximate
F Value Num DF Den DF Pr > F
2.0128
0.33191661
1687.74
1677 <.0001
0.3714 0.72920654
311.38
1677 <.0001
NOTE: The F statistic is exact.
Canonical Structure
Canonical Structure
Total
Can1
Between
Can1
Within
Can1
Total
Can1
Between
Can1
Within
Can1
-0.3511
0.9975
-1.0000
1.0000
-0.3050
0.9966
S.E.
Est./S.E.
P-Value
12.694
8.858
0.000
0.753
0.669
92.805
83.520
0.000
0.000
15.011
16.036
14.691
26.562
0.000
0.000
18.875
5.178
0.000
1.795
1.976
14.476
15.667
0.000
0.000
19.749
48.047
4.211
8.910
0.000
0.000
0.031
0.064
11.984
8.086
0.000
0.000
Z_Vocab
Z_Blocks
0.8905
0.8556
1.0000
1.0000
0.7481
0.6896
2.
3.
4.
5.
yearsage
yearsed
Estimate
Latent Class 1
WAIS_VO WITH
WAIS_BD
112.444
Means
WAIS_VO
69.879
WAIS_BD
55.911
Variances
WAIS_VO
220.521
WAIS_BD
425.932
Latent Class 2
WAIS_VO WITH
WAIS_BD
97.736
Means
WAIS_VO
25.978
WAIS_BD
30.953
Variances
WAIS_VO
83.155
WAIS_BD
428.096
STDYX Standardization
WAIS_VO WITH
WAIS_BD
0.367
WAIS_BD
0.518
21
FASTCLUS Cluster
Assignments
Cluster 1
Cluster 2
307
(18.3%)
1,117
(66.5%)
Minimum
0.507
Maximum
0.998
Probability in Class 2
N
Average
1117
0.995
Minimum
0.564
Maximum
1.00
Disagreement
Probability in Class 1
N
Average
12
0.704
Minimum
0.505
Maximum
0.872
Probability in Class 2
N
Average
307
0.946
Minimum
0.521
Maximum
1.00
7. Unsupervised
CART with
Continuous Outcomes
22
23
8. Introducing
SEM+Trees
24
SEM
TREES
SEM-Trees
25
2 {,
n
Residuals:
Min
1Q
Median
3Q
Max
-3.07219 -0.68178 0.00666 0.72954 3.09040
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.02631
0.05499
18.66
<2e-16 ***
x1.2TRUE
2.00439
0.06451
31.07
<2e-16 ***
x2.2TRUE
2.91782
0.06449
45.25
<2e-16 ***
Residual standard error: 1.02 on 998 degrees of freedom
Multiple R-squared: 0.7527,
Adjusted R-squared: 0.7522
F-statistic: 1519 on 2 and 998 DF, p-value: < 2.2e-16
The total misfit for this joint model is the weighted sum of
these likelihoods, written as
Split L2{, } =
na{a,
a} +
2 { ,
nb
b
b}
R Regression Output
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min
1Q Median
3Q
Max
-3.0847 -0.6854 0.0149 0.7092 3.9269
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.02756
0.03310
31.04
<2e-16 ***
x1
1.90252
0.03197
59.52
<2e-16 ***
x2
2.99095
0.03332
89.77
<2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 1.045 on 998 degrees of freedom
Multiple R-squared: 0.9191,
Adjusted R-squared: 0.919
F-statistic: 5673 on 2 and 998 DF, p-value: < 2.2e-16
26
R RPART Output
Call:
rpart(formula = y.2 ~ x1.2 + x2.2, method = "anova")
n= 1001
CP nsplit rel error
xerror
xstd
1 0.5134325
0 1.0000000 1.0019140 0.03580360
2 0.1295204
1 0.4865675 0.4890493 0.01906300
3 0.1100367
2 0.3570472 0.3585805 0.01558064
4 0.0100000
3 0.2470104 0.2486406 0.01109804
Node number 1: 1001 observations,
complexity
param=0.5134325
mean=3.444344, MSE=4.194623
left son=2 (506 obs) right son=3 (495 obs)
Primary splits:
x2.2 < 0.5 to the left, improve=0.5134325, (0 missing)
x1.2 < 0.5 to the left, improve=0.2453740, (0 missing)
Node number 2: 506 observations,
complexity param=0.1295204
mean=1.992849, MSE=2.147889
left son=4 (262 obs) right son=5 (244 obs)
Primary splits:
x1.2 < 0.5 to the left, improve=0.5003832, (0 missing)
R RPART Output
Node number 3: 495 observations,
complexity param=0.1100367
mean=4.928094, MSE=1.93167
left son=6 (252 obs) right son=7 (243 obs)
Primary splits:
x1.2 < 0.5 to the left, improve=0.4831997, (0 missing)
Node number 4: 262 observations
mean=0.9923852, MSE=0.9943195
Node number 5: 244 observations
mean=3.067118, MSE=1.157737
Node number 6: 252 observations
mean=3.979386, MSE=1.064824
Node number 7: 243 observations
mean=5.91194, MSE=0.9292872
12
X1
12
e2
X2
22
27
Results of OpenMX
free parameters:
name matrix row col
Estimate Std.Error lbound ubound
1
B1
A
y x1 1.9941156879 0.03077067
2
B2
A
y x2 3.0290359794 0.03212727
3
V1
S x1 x1 1.0968105126 0.04902384
4 C12
S x1 x2 -0.0009623423 0.03320199
5
V2
S x2 x2 1.0061455782 0.04496839
6
Ve
S ey ey 1.0395564636 0.04646684
7
B0
M
1
y 1.0095534064 0.03223348
8 Mx1
M
1 x1 0.0218588355 0.03309820
9 Mx2
M
1 x2 -0.0063300329 0.03170186
observed statistics: 3003
estimated parameters: 9
degrees of freedom: 2994
-2 log likelihood: 8659.608
saturated -2 log likelihood: NA
number of observations: 1001
chi-square: NA
p: NA
Information Criteria:
df Penalty Parameters Penalty Sample-Size Adjusted
AIC
2671.608
8677.608
NA
BIC -12025.203
8721.787
8693.203
CFI: NA
TLI: NA
RMSEA: NA
timestamp: 2013-05-29 21:07:07
frontend time: 0.0904429 secs
backend time: 0.1859481 secs
independent submodels time: 2.503395e-05 secs
wall clock time: 0.2764161 secs
cpu time: 0.2764161 secs
openmx version number: 1.2.4-2063
28
9. Summary &
Discusssion
29
30