Beruflich Dokumente
Kultur Dokumente
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
1
Classification: Definition
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
2
Illustrating Classification Task
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
3
Examples of Classification Task
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
5
Metrics of Classifier Performance
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
7
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
8
Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
9
Decision Tree Classification Task
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
10
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
11
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
12
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
13
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
14
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
15
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
16
Decision Tree Classification Task
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
17
Decision Tree Induction Algorithms
Number of Algorithms:
• Hunt’s
– Hunt's Algorithm (1966)
• Quinlan's
– Iterative Dichotomizer3 (1975) uses Entropy
– C4.5 / 4.8 / 5.0 (1993) uses Entropy
• Brieman's
– CART: Classification And Regression Trees
(1984) uses Gini
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
18
Hunt’s Algorithm
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
19
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
20
General Structure of Hunt’s Algorithm
Input: Dataset D
Output: Decision tree t
Tid Refund Marital Taxable
Status Income Cheat
Induce(D):
If all tuples t in D have label + then 1 Yes Single 125K No
yes no
Induce Induce
(D1,C) (D2,C)
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
21
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
22
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
23
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat
Don’t Marital
Cheat Status
Single,
Married
Divorced
Cheat Don’t
Cheat
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
24
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat
Don’t Cheat
Cheat
© Tan,Steinbach,
DR. FAISAL KAMIRANKumar Introduction to Data Mining 4/18/2004 UNIVERSITY
INFORMATION TECHNOLOGY ‹#›
25