Beruflich Dokumente
Kultur Dokumente
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
z Two classes
Predicting tumor cells as benign or malignant
Classifying credit card transactions
as legitimate or fraudulent
z Multiple classes
Classifying secondary structures of
protein as alpha-helix, beta-sheet,
or random coil
Categorizing news stories as finance,
weather, entertainment, sports, etc
z Decision trees
z Rule-based
R l b d methods
th d
z Logistic regression
z Discriminant analysis
z k-Nearest neighbor (instance-based learning)
z Nave Bayes
z Neural networks
z Support vector machines
z Bayesian
y belief networks
splitting nodes
Tid Refund Marital Taxable
Status Income Cheat
classification nodes
MarSt Single,
g ,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 N
No Si l
Single 85K Y
Yes
9 No Married 75K No There can be more than one tree
10 No Single 90K Yes that fits the same data!
10
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test data
Start from the root of tree. Refund Marital Taxable
Status Income CCheat
eat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single,
g , Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test data
Refund Marital Taxable
Status Income CCheat
eat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single,
g , Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test data
Refund Marital Taxable
Status Income CCheat
eat
No Married 80K ?
Refund 10
Yes No
NO MarSt
g , Divorced
Single, Married
TaxInc NO
< 80K > 80K
NO YES
Test data
Refund Marital Taxable
Status Income CCheat
eat
No Married 80K ?
Refund 10
Yes No
NO MarSt
g , Divorced
Single, Married
TaxInc NO
< 80K > 80K
NO YES
Test data
Refund Marital Taxable
Status Income CCheat
eat
No Married 80K ?
Refund 10
Yes No
NO MarSt
g , Divorced
Single, Married
TaxInc NO
< 80K > 80K
NO YES
Test data
Refund Marital Taxable
Status Income CCheat
eat
No Married 80K ?
Refund 10
Yes No
NO MarSt
g , Divorced
Single, Married Assign Cheat to No
TaxInc NO
< 80K > 80K
NO YES
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large
g 67K ?
10
z Many algorithms:
Hunt
Hunts
s algorithm (one of the earliest)
CART
ID3,
ID3 C4
C4.5
5
SLIQ, SPRINT
Refund Refund
Dont
Yes No Yes No
Cheat
Dont Dont Dont Marital
Cheat Cheat Cheat Status
Single,
Tid Refund Marital Taxable Married
Status Income Cheat
Divorced
1 Yes Single 125K No Refund Dont
Cheat
2 No Married 100K No Yes No Cheat
3 No Single 70K No
4 Yes Married 120K No Dont Marital
5 No Divorced 95K Yes Ch
Cheat Status
6 No Married 60K No Single,
7 Yes Divorced 220K No
Married
Divorced
8 No Single 85K Yes
9 No Married 75K No Taxable Dont
Cheat
10
10 No Single 90K Yes Income
< 80K >= 80K
Dont Cheat
Cheat
z Greedy strategy
Split the records at each node based on an
attribute test that optimizes some chosen
criterion.
z Issues
Determine how to split the records
How to specify structure of split?
What is best attribute / attribute value for splitting?
z Greedy strategy
Split the records at each node based on an
attribute test that optimizes some chosen
criterion.
z Issues
Determine how to split the records
How to specify structure of split?
What is best attribute / attribute value for splitting?
z Depends
D d on number b off ways tto split
lit
Binary (two-way) split
Multi-way split
z Bi
Binary split:
lit Di
Divides
id values
l iinto
t ttwo subsets.
b t
Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Size
{Small,
{Small
z What about this split? Large} {Medium}
z Greedy strategy
Split the records at each node based on an
attribute test that optimizes some chosen
criterion.
z Issues
Determine how to split the records
How to specify structure of split?
What is best attribute / attribute value for splitting?
z Greedy approach:
Nodes with homogeneous class distribution
are preferred.
z Need a measure of node impurity:
Non homogeneous
Non-homogeneous, Homogeneous
Homogeneous,
high degree of impurity low degree of impurity
z Gini index
z Entropy
z Misclassification error
Attrib te A?
Attribute Attrib te B?
Attribute
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 M12 vsvs. M0 M34
Choose attribute that maximizes gain
Jeff Howbert Introduction to Machine Learning Winter 2012 30
Measure of impurity: Gini index
GINI (t ) = 1 [ p ( j | t )]2
j
GINI (t ) = 1 [ p ( j | t )] 2
C1 0 p( C1 ) = 0 / 6 = 0 p( C2 ) = 6 / 6 = 1
C2 6 Gi i = 1 p(( C1 )2 p(( C2 )2 = 1 0 1 = 0
Gini
C1 1 p( C1 ) = 1 / 6 p( C2 ) = 5 / 6
C2 5 Gini = 1 ( 1 / 6 )2 ( 5 / 6 )2 = 0.278
C1 2 p( C1 ) = 2 / 6 p( C2 ) = 4 / 6
C2 4 Gini = 1 ( 2 / 6 )2 ( 4 / 6 )2 = 0.444
k
ni
GINI split = GINI (i )
i =1 n
C T
CarType CarType CarType
Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
z Greedy strategy
Split the records at each node based on an
attribute test that optimizes some chosen
criterion.
z Issues
Determine how to split the records
How to specify structure of split?
What is best attribute / attribute value for splitting?
z Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small
small-sized
sized trees
Accuracy comparable to other classification
techniques for many simple data sets
z Disadvantages:
Easy
Eas to overfit
o erfit
Decision boundary restricted to being parallel
to attribute axes
Jeff Howbert Introduction to Machine Learning Winter 2012 42
MATLAB interlude
matlab demo 04 m
matlab_demo_04.m
P tA
Part
z Generalization
z Measuring
g classifier p
performance
z Overfitting, underfitting
z Validation
z Accuracy
a = number of test samples with label correctly predicted
b = number of test samples with label incorrectly predicted
a
accuracy =
a+b
example
75 samples in test set
accuracy = 62 / 75 = 0.827
z Limitations of accuracy
Consider a two-class
two class problem
number of class 1 test samples = 9990
number of class 2 test samples = 10
What if model predicts everything to be class 1?
accuracy is extremely high: 9990 / 10000 = 99.9 %
but model will never correctly predict any sample in class 2
in this case accuracy is misleading and does not give a good
picture of model quality
z Confusion matrix
example
(continued from actual class
two slides back)
class 1 class 2
class 1 21 6
predicted
class
class 2 7 41
21 + 41 62
accuracy = =
21 + 6 + 7 + 41 75
z Confusion matrix
actual class
derived metrics
(for two classes) class 1 class 2
(negative) (positive)
class 1
21 (TN) 6 (FN)
predicted (negative)
class class 2
7 (FP) 41 (TP)
(positive)
FP: false p
positives TP: true p
positives
z Confusion matrix
actual class
derived metrics
(for two classes) class 1 class 2
(negative) (positive)
class 1
21 (TN) 6 (FN)
predicted (negative)
class class 2
7 (FP) 41 (TP)
(positive)
TP TN
sensitivity = specificity =
TP + FN TN + FP
matlab demo 04 m
matlab_demo_04.m
P tB
Part
stage of optimization
example: number of iterations in a gradient
descent optimization
underfitting overfitting
optimal fit
D i i boundary
Decision b d distorted
di t t d by
b noise
i point
i t
Jeff Howbert Introduction to Machine Learning Winter 2012 55
Sources of overfitting: insufficient examples
z Model
ode co
complexity
p e ty sshould
ou d ttherefore
e e o e be co
considered
s de ed
when evaluating a model.
z Post-pruning
Grow full decision tree
Trim nodes of full tree in a bottom-up fashion
If ggeneralization error improves
p after trimming,
g, replace
p
sub-tree by a leaf node.
Class label of leaf node is determined from majority
class
l off iinstances
t i th
in the sub-tree
bt
Can use various measures of generalization error for
post-pruning
post pruning (see textbook)
A1 A4
A2 A3
matlab demo 04 m
matlab_demo_04.m
P tC
Part
TRAIN
on
TEST
z Training set:
Used to drive model building and parameter optimization
z Validation set
Used to gauge status of generalization error
Results can be used to guide decisions during training process
typically used mostly to optimize small number of high-level meta
parameters, e.g. regularization constants; number of gradient descent
iterations
z Test set
Used only for final assessment of model quality, after training +
validation completely finished
f
z Holdout
z Cross-validation
z Leave-one-out (LOO)
z Holdout
Pro: results in single model that can be used directly
in production
Con: can be wasteful of data
Con: a single static holdout partition has the potential
to be unrepresentative and statistically misleading
z C
Cross-validation
lid ti and
d lleave-one-outt (LOO)
Con: do not lead directly to a single production model
Pro:
P use allll available
il bl d
data
t ffor evaulation
l ti
Pro: many partitions of data, helps average out
statistical variability