Beruflich Dokumente
Kultur Dokumente
Use classifier that produces Instance P(+) True class FPR TPR
Instance P(+|A) True Class
posterior probability for each
1 0.95 + 1 0.95 + 0 1/5
test instance P(+|A)
2 0.93 + 2 0.93 + 0 2/5
3 0.87 - Sort the instances according
3 0.87 - 1/5 2/5
4 0.85 - to P(+|A) in decreasing order
5 0.85 -
4 0.85 -
Apply threshold at each
6 0.85 + unique value of P(+|A) 5 0.85 -
7 0.76 - 6 0.85 + 3/5 3/5
8 0.53 + Count the number of TP, FP,
TN, FN at each threshold 7 0.76 - 4/5 3/5
9 0.43 -
10 0.25 + 8 0.53 + 4/5 4/5
TP rate, TPR = TP/(TP+FN)
9 0.43 - 1 4/5
FP rate, FPR = FP/(FP + TN)
10 0.25 + 1 1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
1 0.95 + 0 1/5
0.8
9 0.43 + 4/5 1
0.0 0.2 0.4 0.6 0.8 1.0
FPR
10 0.25 - 1 1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
How to construct an ROC curve Model Evaluation
l
How to compare the relative performance
0.0
FPR
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
N
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Comparing Performance of 2 Models Errors of M1 and M2 are independent
Prob. C2
l Given two models, say M1 and M2, which Let X=C1-C2
incorrect (0) correct (1)
is better? C1 incorrect (0) 0.04 0.16
l Usually the models are evaluated on the correct (1) 0.16 0.64
same test sample.
l Make use of correlation between
predictions. X -1 0 +1 E(X)=0
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
Prob. C2
Let X=C1-C2 l Larger differences are more likely if errors are
incorrect (0) correct (1)
independent, and less likely if errors are
C1 incorrect (0) 0.18 0.02 positively correlated.
correct (1) 0.02 0.78
l Hence, an observed difference may be regarded
as significant for models with positively correlated
errors but not for models with independent errors.
X -1 0 +1 E(X)=0 l Our test should reflect (make use of) this
P(X) .02 .96 .02 VAR(X)= E(X E(X))2 = property.
0.02(-1)2 +0.96(0)2 +0.02(1)2
= 0.04
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Comparing Performance of 2 Models Comparing Performance of 2 Models
H0 : e1 = e2
Count Model M2
against
incorrect correct
Ha : e1 e2
Model incorrect 6 14
M1
where ei denotes the true error-rate of model i.
correct 24 56
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
incorrect correct
0.06
Model incorrect 18 2
0.04
M1
0.02
correct 12 68
0.00
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
0 1 2 3 4 5 6 7 8 9 10 12 14
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Data Mining Bayes (Generative) Classifier
Classification: Alternative Techniques
l A probabilistic framework for solving classification
problems
Lecture Notes for Chapter 5 l Conditional Probability: P( A, C )
P (C | A) =
P ( A)
Introduction to Data Mining P( A, C )
P( A | C ) =
by P (C )
Tan, Steinbach, Kumar l Bayes theorem:
P ( A | C ) P (C )
P(C | A) =
P( A)
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
P ( S | M ) P ( M ) 0.5 1 / 50000
P(M | S ) = = = 0.0002 l Can we estimate P(C| A1, A2,,An ) directly from
P(S ) 1 / 20
data?
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
X and Y are independent iff P(X,Y)=P(X)P(Y), l Assume independence among attributes Ai when class is
or, equivalently, P(X|Y)=P(X). given:
P(A1, A2, , An |Cj) = P(A1| Cj) P(A2| Cj) P(An| Cj)
Intuition: Y doesnt provide any information about X
(and vice versa).
Can estimate P(Ai| Cj) for all Ai and Cj.
X and Y are independent given Z iff P(X,Y|Z)=P(X|Z)P(Y|Z), Now we only need to estimate m n probabilities per
or, equivalently, P(X|Y,Z)=P(X|Z). class.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
How to Estimate
l l
Probabilities from Data? How to Estimate Probabilities from Data?
a a s
r ic r ic
u ou
go go tin ss
te te n a l Class: P(C) = Nc/N
ca ca co cl l For continuous attributes:
Tid Refund Marital Taxable e.g., P(No) = 7/10,
Status Income Evade
P(Yes) = 3/10 Discretize the range into bins
1 Yes Single 125K No
Two-way split: (A < v) or (A v)
2 No Married 100K No
l For discrete attributes: u choose only one of the two splits as new attribute
3 No Single 70K No
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nc k Probability density estimation:
5 No Divorced 95K Yes
where |Aik| is number of u Assume attribute follows a normal distribution
6 No Married 60K No
instances having attribute u Use data to estimate parameters of distribution
7 Yes Divorced 220K No
Ai and belongs to class Ck
8 No Single 85K Yes (i.e., mean and standard deviation)
9 No Married 75K No
Examples:
u Once probability distribution is known, can use it to
10 No Single 90K Yes P(Status=Married|No) = 4/7 estimate the conditional probability P(Ai|c)
P(Refund=Yes|Yes)=0
10
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
al
How toorEstimate al uProbabilities
s
from Data? How to Estimate Probabilities from Data?
ic ic
or uo
g g in s
te te nt as
ca ca co cl
0.08
P( A | c ) =
2 ij2
1 Yes Single 125K No
e Class No
2
i j 2 Class Yes
0.06
2 No Married 100K No
ij
3 No Single 70K No
4 Yes Married 120K No One for each (Ai,cj) pair
5 No Divorced 95K Yes
0.04
1
( 120110 ) 2
Income
200 250 300
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
Example of Nave Bayes Classifier Nave Bayes Classifier
Given a Test Record:
X = (Refund = No, Married, Income = 120K) l If one of the conditional probability is zero, then
the entire expression becomes zero
naive Bayes Classifier:
l Probability estimation:
P(Refund=Yes|No) = 3/7 l P(X|Class=No) = P(Refund=No|Class=No)
P(Refund=No|No) = 4/7 P(Married| Class=No)
P(Refund=Yes|Yes) = 0 P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7 4/7 0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7 N ic
P(Marital Status=Divorced|No)=1/7 Original : P ( Ai | C ) = a: number of values of Ai
P(Marital Status=Married|No) = 4/7 l P(X|Class=Yes) = P(Refund=No| Class=Yes) Nc
P(Marital Status=Single|Yes) = 2/7 P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7 P(Income=120K| Class=Yes) N ic + 1
P(Marital Status=Married|Yes) = 0 = 1 0 1.2 10-9 = 0 Laplace : P ( Ai | C ) =
For taxable income:
Nc + a
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975
Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38
Name
human
Give Birth
yes no
Can Fly Live in Water Have Legs
no yes
Class
mammals
A: attributes l Robust to isolated noise points
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog
komodo
no
no
no
no
sometimes
no
yes
yes
non-mammals
non-mammals
6 6 2 2
l Handle missing values by ignoring the instance
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P( A | M ) = = 0.06 during probability estimate calculations
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P( A | N ) = = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7 l Robust to irrelevant attributes
P( A | M ) P (M ) = 0.06 = 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P( A | N ) P( N ) = 0.004 = 0.0027 l Independence assumption may not hold for some
20
eagle no yes no yes non-mammals
attributes
P(A|M)P(M) > P(A|N)P(N)
Give Birth Can Fly Live in Water Have Legs Class Use other techniques such as Bayesian Belief
yes no yes no ? => Mammals
Networks (BBN)
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40
l Independence assumption may not hold for Suppose P(Yes|A1=a1,,An=an)=0.7 is the true probability
(some) attributes, but of Class Yes for a given attribute vector.
To minimize the error rate we should classify this attribute
l If we evaluate on error-rate, then all that matters,
vector as Yes.
in the binary case, is whether the probability
estimate is on the right side of 0.5. ^
As long as we have P(Yes|A 1=a1,,An=an) > 0.5, we will
l With more than two classes similar reasoning assign to the optimal class.
applies, but the margin of error becomes
The probability estimate itself may be way off!
smaller.
l For ROC curve, what matters is that we get the If we evaluate on likelihood this doesnt fly!
probabilities in the right order.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Example
C=0 A2 C=1 A2
0 1 P(A1) 0 1 P(A1)
A1 0 0.3 0.1 0.4 A1 0 0.6 0.1 0.7
1 0.1 0.5 0.6 1 0.1 0.2 0.3
P(A2) 0.4 0.6 1 P(A2) 0.7 0.3 1
P(C=0)=1/2, P(C=1)=1/2.
P (A1=1,A2=1|C=0)P (C=0) 40
P (C = 0|A1 = 1, A2 = 1) = P (A =1,A =1|C=0)P (C=0)+P (A1=1,A2=1|C=1)P (C=1) = 56 0.71
1 2