Sie sind auf Seite 1von 25

Classification: Decision Trees

Dr. Faisal Kamiran

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
1
Classification: Definition

 Given a collection of records (training set )


– Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function
of the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
2
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
3
Examples of Classification Task

 Predicting tumor cells as benign or malignant

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

 Categorizing news stories as finance,


weather, entertainment, sports, etc
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
4
Many different types of models

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm) 
Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
5
Metrics of Classifier Performance

 Focus on the predictive capability of a model


 Confusion Matrix:
PREDICTED
CLASS TP (true positive)

Yes No FN (false negative)


FP (false positive)
ACTUAL Yes TP FN TN (true negative)
CLASS No FP TN

 Most widely-used metric:


TP  TN
Accuracy 
TP  TN  FP  FN
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
6
Classification Techniques

 Decision Tree based Methods


 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
7
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
8
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
9
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
10
Apply Model to Test Data

Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
11
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
12
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
13
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
14
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
15
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
16
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
17
Decision Tree Induction Algorithms

Number of Algorithms:
• Hunt’s
– Hunt's Algorithm (1966)
• Quinlan's
– Iterative Dichotomizer3 (1975) uses Entropy
– C4.5 / 4.8 / 5.0 (1993) uses Entropy
• Brieman's
– CART: Classification And Regression Trees
(1984) uses Gini

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
18
Hunt’s Algorithm

 In the Hunt’s algorithm, a decision tree is grown


in a recursive fashion by partitioning the training
records successively into purer subsets

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
19
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
 Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No

 General Procedure: 2 No Married 100K No


3 No Single 70K No
– If Dt contains records that
4 Yes Married 120K No
belong the same class yt, then t
5 No Divorced 95K Yes
is a leaf node labeled as yt 6 No Married 60K No
– If Dt is an empty set, then t is a 7 Yes Divorced 220K No
leaf node labeled by the default 8 No Single 85K Yes
class, yd 9 No Married 75K No
10 No Single 90K Yes
– If Dt contains records that 10

belong to more than one class, Dt


use an attribute test to split the
data into smaller subsets.
Recursively apply the ?
procedure to each subset.

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
20
General Structure of Hunt’s Algorithm
Input: Dataset D
Output: Decision tree t
Tid Refund Marital Taxable
Status Income Cheat
Induce(D):
If all tuples t in D have label + then 1 Yes Single 125K No

return 2 No Married 100K No


+
If all tuples t in D have label - then 3 No Single 70K No
4 Yes Married 120K No
return -
5 No Divorced 95K Yes
For all split criteria C:
6 No Married 60K No
D1,C = { t in D | t satisfies C}
7 Yes Divorced 220K No
D2,C = D - D1
8 No Single 85K Yes
Measure Quality(D1 ,D2)
9 No Married 75K No
Let C be the best split
10 No Single 90K Yes
Return C 10

yes no

Induce Induce
(D1,C) (D2,C)
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
21
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No

Don’t 3 No Single 70K No


Cheat 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
22
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
23
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Refund 9 No Married 75K No
Yes No 10 No Single 90K Yes
10

Don’t Marital
Cheat Status
Single,
Married
Divorced

Cheat Don’t
Cheat

© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
24
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Refund Refund 9 No Married 75K No
Yes No Yes No 10 No Single 90K Yes
10

Don’t Don’t Marital


Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced

Don’t Taxable Don’t


Cheat Cheat
Cheat Income
< 80K >= 80K

Don’t Cheat
Cheat
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
25

Das könnte Ihnen auch gefallen