Beruflich Dokumente
Kultur Dokumente
Classification
Basic Concepts, Decision Trees, and Model Evaluation Basic Concepts
Introduction to Data Mining, Chapter 4
08/09/2015
Classification Why?
- Classification is the task of assigning objects - Descriptive modeling
- Emails: SPAM or not? - Predict the class label of previously unseen records
- Patients: high or low risk? - Automatically assign a class label when presented
- Astronomy: star, galaxy, nebula, etc. with the attributes of the record
- News stories: finance, weather, entertainment,
sports, etc.
6 No Medium 60K No
Training Set
Interval Apply
Model
Ratio
Tid
11
Attrib1
No
Attrib2
Small
Attrib3
55K
Class
?
Records with
12 Yes Medium 80K ? unknown class labels
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Model
Model
10
Model
Model
Training Set Training Set
Apply Apply
Apply
Model
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class Tid Attrib1 Attrib2 Attrib3 Class
n model model
11 No Small 55K ?
uctio
11 No Small 55K ?
uction
12 Yes Medium 80K ? Ded Should correctly
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction predict class
13 labels
Yes Large 110K ? Deduction
14
15
No
No
Small
Large
95K
67K
?
?
for unseen14
15
data
No
No
Small
Large
95K
67K
?
?
10 10
- Rule-based
- Machine learning is focused on developing and
designing learning algorithms
- Naive Bayes
- Random forests
- Performed by a person who has a goal in mind and
- k-nearest neighbors
uses Machine Learning techniques on a specific
dataset
- … - Much of the work is concerned with data
(pre)processing and feature engineering
Today Objectives for Learning Alg.
Tid Attrib1 Attrib2 Attrib3 Class
Learning
Learning
- Decision trees
1
Should fit2theNo input
Yes
Medium
Large 125K
100K
No
No
algorithm
algorithm
data well 3 No Ind
uct
Small 70K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
n model
uctio
11 No Small 55K ?
Should correctly
12 Yes Medium 80K ? Ded
predict class
13 labels
Yes Large 110K ? Deduction
for unseen14
15
data
No
No
Small
Large
95K
67K
?
?
10
Test Set
Predicted class
- Based on the number of records correctly and
incorrectly predicted by the model
Positive Negative
- Counts are tabulated in a table called the
confusion matrix Positive
True Positives False Negatives
(TP) (FN)
- Compute various performance metrics based Actual
class
on this matrix Negative
False Positives True Negatives
(FP) (TN)
Positive
Negative
Positive Negative
Innocent Guilty
Type II Error
True Positive
False Negative
letting a guilty
True Positives False Negatives Positive
Positive (TP) (FN) failing to
person go free
Innocent
Actual raise an alarm Actual
Convicted Freed (error of impunity)
class class False Positive
True Negative
False Positives True Negatives Negative
Negative (FP) (TN)
Guilty Convicted Freed
- Accuracy
- Compute Accuracy and Error rate
Number of correct predictions TP + TN
=
Total number of predictions TP + FP + TN + FN
- Error rate
Ind
uc
3 No Small 70K No
6 No Medium 60K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
n model
uctio
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Internal node
exactly one incoming edges
two or more outgoing edges
Model
Model
NO YES Training Set
6 No Married 60K No
Apply
7 Yes Divorced 220K No Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
n model
uctio
8 No Single 85K Yes 11 No Small 55K ?
There could be more than one
9 No Married 75K No 12 Yes Medium 80K ? Ded
10 No Single 90K Yes
tree that fits the same data! 13 Yes Large 110K ? Deduction
10
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Refund 10
Yes No Yes No
NO MarSt NO MarSt
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
Refund 10
Yes No Yes No
NO MarSt NO MarSt
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
Test Data Test Data
Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
Refund 10
Yes No Yes No
NO MarSt NO MarSt
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
Learn
Learn
- Finding the optimal tree is computationally
7 Yes Large 220K No
infeasible (NP-hard)
9 No Medium 75K No
Model
Model
Training Set
Apply - Greedy strategies are used
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
n model - Grow a decision tree by making a series of locally
ctio
edu
11 No Small 55K ?
12 Yes Medium 80K ? D optimum decisions about which attribute to use for
Deduction
splitting the data
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Hunt’s algorithm
Status Income Cheat
- How to specify the attribute test condition? - How to specify the attribute test condition?
- How to determine the best split? - How to determine the best split?
- Determine when to stop splitting - Determine when to stop splitting
- Ordinal CarType
Family Luxury
- Continuous Sports
distinct values.
- Discretization to form an ordinal categorical attribute
Size - Static – discretize once at the beginning
Small Large - Dynamic – ranges can be found by equal interval bucketing,
Medium equal frequency bucketing (percentiles), or clustering
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Which test condition is the best?
i=0
- P(i|t) = fraction of records belonging to class i at a
given node t • Maximum (log nc) when records are equally
- c is the number of classes distributed among all classes implying least
c 1
X information
Entropy(t) = P (i|t)log2 P (i|t) • Minimum (0.0) when all records belong to one
i=0
class, implying most information
c 1
X
Gini(t) = 1 P (i|t)2
i=0
c 1
X
GINI Exercise Gini(t) = 1
i=0
P (i|t)2
c 1
X
2
Gini(t) = 1 P (i|t)
i=0 C1 0
C2 6
- Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
C1 1
- Minimum (0.0) when all records belong to one
class, implying most interesting information C2 5
C1 2
C2 4
c 1
X
Exercise Gini(t) = 1
i=0
P (i|t)2 Classification Error
Classification error(t) = 1 max P (i|t)
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
• Stop expanding a node when all the records - Easy to interpret for small-sized trees
- Grow decision tree to its entirety - Reserve 2/3 for training and 1/3 for testing
- Trim the nodes of the decision tree in a bottom-up fashion (validation set)
- If generalization error improves after trimming, replace sub-
tree by a leaf node - Cross validation
- Class label of leaf node is determined from majority class of - Partition data into k disjoint subsets
instances in the sub-tree
- k-fold: train on k-1 partitions, test on the remaining
one
- Leave-one-out: k=n
Expressivity Expressivity
1
0.9
0.8
x < 0.43?
x+y<1
0.7
Yes No
0.6
y < 0.33?
y
0.3
Yes No Yes No Class = + Class =
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Use-case:
Web Robot Detection
Tasks
- Given a training data set and a test set
Assignment 1 -
-
You have to build it from scratch
You are free to pick your programming language
- Submit code and predicted class labels for test set
- Accuracy has to reach a certain threshold
Tasks Online evaluation
- Task 2: Submit a short report describing
- Real-time leaderboard will be available for the
- What processing steps you applied submissions (updated after each git push)
- Which are the most important features of the dataset - Two tracks, results separately
Practicalities
- Data set and specific instructions will be made
available today
- Deadlines