Sie sind auf Seite 1von 8

How to Construct an ROC curve How to Construct an ROC curve

Use classifier that produces Instance P(+) True class FPR TPR
Instance P(+|A) True Class
posterior probability for each
1 0.95 + 1 0.95 + 0 1/5
test instance P(+|A)
2 0.93 + 2 0.93 + 0 2/5
3 0.87 - Sort the instances according
3 0.87 - 1/5 2/5
4 0.85 - to P(+|A) in decreasing order
5 0.85 -
4 0.85 -
Apply threshold at each
6 0.85 + unique value of P(+|A) 5 0.85 -
7 0.76 - 6 0.85 + 3/5 3/5
8 0.53 + Count the number of TP, FP,
TN, FN at each threshold 7 0.76 - 4/5 3/5
9 0.43 -
10 0.25 + 8 0.53 + 4/5 4/5
TP rate, TPR = TP/(TP+FN)
9 0.43 - 1 4/5
FP rate, FPR = FP/(FP + TN)
10 0.25 + 1 1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

How to construct an ROC curve How to Construct an ROC curve

Instance P(+) True class FPR TPR


1 0.95 + 0 1/5
2 0.93 + 0 2/5
3 0.87 + 0 3/5
4 0.85 + 0 4/5
5 0.83 + 0 1
6 0.80 - 1/5 1
7 0.76 - 2/5 1
8 0.53 - 3/5 1
9 0.43 - 4/5 1
10 0.25 - 1 1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

How to construct an ROC curve How to Construct an ROC curve

Instance P(+) True class FPR TPR


1.0

1 0.95 + 0 1/5
0.8

2 0.93 - 1/5 1/5


3 0.87 + 1/5 2/5
0.6

4 0.85 - 2/5 2/5


TPR

5 0.83 + 2/5 3/5


0.4

6 0.80 - 3/5 3/5


0.2

7 0.76 + 3/5 4/5


8 0.53 - 4/5 4/5
0.0

9 0.43 + 4/5 1
0.0 0.2 0.4 0.6 0.8 1.0

FPR
10 0.25 - 1 1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
How to construct an ROC curve Model Evaluation

l Metrics for Performance Evaluation


1.0

How to evaluate the performance of a model?


0.8

l Methods for Performance Evaluation


0.6
TPR

How to obtain reliable estimates?


0.4

Methods for Model Comparison


0.2

l
How to compare the relative performance
0.0

among competing models?


0.0 0.2 0.4 0.6 0.8 1.0

FPR
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Confidence Interval for Accuracy Confidence Interval for Accuracy

l Prediction can be regarded as a Bernoulli trial Area = 1 -


l For large test sets (N > 30),
A Bernoulli trial has 2 possible outcomes
acc has a normal distribution
Possible outcomes for prediction: correct or wrong with mean p and variance
Collection of Bernoulli trials has a Binomial distribution: p(1-p)/N
u x Bin(N, p) x: number of correct predictions
e.g: Toss a fair coin 50 times, how many heads would turn up? acc p
u
Expected number of heads = Np = 50 0.5 = 25
P ( Z / 2 < < Z / 2 )
p (1 p) / N
= 1 -Z/2 Z /2
l Given x (# of correct predictions) or equivalently,
acc=x/N, and N (# of test instances), l (1-) 100% Confidence Interval for p:
v

acc Z/2t acc(1acc)


u
u
Can we predict p (true accuracy of model)?
u

N
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Confidence Interval for Accuracy Model Evaluation

l Consider a model that produces an accuracy of l Metrics for Performance Evaluation


80% when evaluated on 100 test instances: How to evaluate the performance of a model?
N=100, acc = 0.8 1- Z/2
Let 1- = 0.95 (95% confidence)
0.99 2.58 l Methods for Performance Evaluation
From probability table, Z/2=1.96
0.98 2.33 How to obtain reliable estimates?
N 50 100 500 1000 5000 0.95 1.96
p(lower) 0.689 0.722 0.765 0.775 0.789 0.90 1.65 l Methods for Model Comparison
How to compare the relative performance
p(upper) 0.911 0.878 0.835 0.825 0.811
among competing models?

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Comparing Performance of 2 Models Errors of M1 and M2 are independent

Prob. C2
l Given two models, say M1 and M2, which Let X=C1-C2
incorrect (0) correct (1)
is better? C1 incorrect (0) 0.04 0.16
l Usually the models are evaluated on the correct (1) 0.16 0.64
same test sample.
l Make use of correlation between
predictions. X -1 0 +1 E(X)=0

P(X) .16 .68 .16 VAR(X)= E(X E(X))2 =


0.16(-1)2 +0.68(0)2 +0.16(1)2
= 0.32

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Strong positive correlation Comparing Performance of 2 Models

Prob. C2
Let X=C1-C2 l Larger differences are more likely if errors are
incorrect (0) correct (1)
independent, and less likely if errors are
C1 incorrect (0) 0.18 0.02 positively correlated.
correct (1) 0.02 0.78
l Hence, an observed difference may be regarded
as significant for models with positively correlated
errors but not for models with independent errors.
X -1 0 +1 E(X)=0 l Our test should reflect (make use of) this
P(X) .02 .96 .02 VAR(X)= E(X E(X))2 = property.
0.02(-1)2 +0.96(0)2 +0.02(1)2
= 0.04

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Comparing Performance of 2 Models Comparing Performance of 2 Models

l Make a cross-table of l Ignore cells a and d (both incorrect, both


correct and incorrect correct).
predictions of M1 and
l If models were equally good, we would
M2.
Model M2
Count expect counts in cells b and c to be in
incorrect correct balance.
l Under the null hypothesis that models have
Model incorrect a b
M1
the same error, number in cell b has a
correct c d
binomial distribution with n=n(b)+n(c), and
p=0.5.

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Comparing Performance of 2 Models Comparing Performance of 2 Models

We test the null hypothesis: Errors of M1 and M2 are independent

H0 : e1 = e2
Count Model M2
against
incorrect correct

Ha : e1 e2
Model incorrect 6 14
M1
where ei denotes the true error-rate of model i.
correct 24 56

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Binomial distr. with n=38 and p=0.5. Comparing Performance of 2 Models


0.12

Errors of M1 and M2 are positively correlated


0.10

p-value = 0.14 Count Model M2


0.08

incorrect correct
0.06

Model incorrect 18 2
0.04

M1
0.02

correct 12 68
0.00

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Binomial distr. with n=14 and p=0.5. Comparing Performance of 2 Models

Although the difference in error-rate is the same in both


0.20

cases, the independent case produced a p-value of 0.14


(typically not regarded significant) leading to the conclusion
p-value = 0.012 that we cannot reject the null hypothesis that both models
0.15

have the same error-rate.


0.10

The example with positively correlated errors produces a


p-value of 0.012, leading to the conclusion that M1 has a
significantly lower error rate than M2.
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10 12 14

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Data Mining Bayes (Generative) Classifier
Classification: Alternative Techniques
l A probabilistic framework for solving classification
problems
Lecture Notes for Chapter 5 l Conditional Probability: P( A, C )
P (C | A) =
P ( A)
Introduction to Data Mining P( A, C )
P( A | C ) =
by P (C )
Tan, Steinbach, Kumar l Bayes theorem:

P ( A | C ) P (C )
P(C | A) =
P( A)
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26

Example of Bayes Theorem Bayesian Classifiers

l Given: l Consider each attribute and class label as random


A doctor knows that meningitis causes stiff neck 50% of the variables
time
Prior probability of any patient having meningitis is 1/50,000
l Given a record with attributes (A1, A2,,An)
Prior probability of any patient having stiff neck is 1/20
Goal is to predict class C
l If a patient has stiff neck, whats the probability Specifically, we want to find the value of C that
he/she has meningitis? maximizes P(C| A1, A2,,An )

P ( S | M ) P ( M ) 0.5 1 / 50000
P(M | S ) = = = 0.0002 l Can we estimate P(C| A1, A2,,An ) directly from
P(S ) 1 / 20
data?

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

Bayesian Classifiers Curse of dimensionality

l Approach: l How to estimate P(A1, A2, , An | C )?


compute the posterior probability P(C | A1, A2, , An) for
all values of C using the Bayes theorem l If each attribute is discrete with, say, 5 possible
values, then to estimate each possible combination
P ( A A K A | C ) P (C )
P (C | A A K A ) =
1 2 n
1 2 n
requires the estimation of 5n probabilities per class.
P(A A K A ) 1 2 n
l For 10 attributes (n=10) this is about ten million
Choose value of C that maximizes probabilities. In general: mn probabilities per class.
P(C | A1, A2, , An)
l This simple approach runs into the curse of

Equivalent to choosing value of C that maximizes dimensionality.


P(A1, A2, , An|C) P(C) l To be practical, we need to make some simplifying
assumptions.
l How to estimate P(A1, A2, , An | C )?
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30
Conditional Independence Nave Bayes Classifier

X and Y are independent iff P(X,Y)=P(X)P(Y), l Assume independence among attributes Ai when class is
or, equivalently, P(X|Y)=P(X). given:
P(A1, A2, , An |Cj) = P(A1| Cj) P(A2| Cj) P(An| Cj)
Intuition: Y doesnt provide any information about X
(and vice versa).
Can estimate P(Ai| Cj) for all Ai and Cj.
X and Y are independent given Z iff P(X,Y|Z)=P(X|Z)P(Y|Z), Now we only need to estimate m n probabilities per
or, equivalently, P(X|Y,Z)=P(X|Z). class.

Intuition: if we know the value of Z, then Y doesnt provide


New point is classified to Cj if P(Cj) P(Ai| Cj) is
any information about X (and vice versa).
maximal.

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32

How to Estimate
l l
Probabilities from Data? How to Estimate Probabilities from Data?
a a s
r ic r ic
u ou
go go tin ss
te te n a l Class: P(C) = Nc/N
ca ca co cl l For continuous attributes:
Tid Refund Marital Taxable e.g., P(No) = 7/10,
Status Income Evade
P(Yes) = 3/10 Discretize the range into bins
1 Yes Single 125K No
Two-way split: (A < v) or (A v)
2 No Married 100K No
l For discrete attributes: u choose only one of the two splits as new attribute
3 No Single 70K No
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nc k Probability density estimation:
5 No Divorced 95K Yes
where |Aik| is number of u Assume attribute follows a normal distribution
6 No Married 60K No
instances having attribute u Use data to estimate parameters of distribution
7 Yes Divorced 220K No
Ai and belongs to class Ck
8 No Single 85K Yes (i.e., mean and standard deviation)
9 No Married 75K No
Examples:
u Once probability distribution is known, can use it to
10 No Single 90K Yes P(Status=Married|No) = 4/7 estimate the conditional probability P(Ai|c)
P(Refund=Yes|Yes)=0
10

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34

al
How toorEstimate al uProbabilities
s
from Data? How to Estimate Probabilities from Data?
ic ic
or uo
g g in s
te te nt as
ca ca co cl
0.08

Tid Refund Marital


Status
Taxable
Income Evade
l Normal distribution:
( Ai ij ) 2
1

P( A | c ) =
2 ij2
1 Yes Single 125K No
e Class No

2
i j 2 Class Yes
0.06

2 No Married 100K No
ij
3 No Single 70K No
4 Yes Married 120K No One for each (Ai,cj) pair
5 No Divorced 95K Yes
0.04

6 No Married 60K No l For (Income, Class=No):


7 Yes Divorced 220K No
If Class=No
8 No Single 85K Yes
0.02

u sample mean = 110


9 No Married 75K No
10 No Single 90K Yes u sample variance = 2975
10
0.00

1
( 120110 ) 2

P ( Income = 120 | No) = e 2 ( 2975 )


= 0.0072
2 (54.54) 0 50 100 150

Income
200 250 300

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
Example of Nave Bayes Classifier Nave Bayes Classifier
Given a Test Record:
X = (Refund = No, Married, Income = 120K) l If one of the conditional probability is zero, then
the entire expression becomes zero
naive Bayes Classifier:
l Probability estimation:
P(Refund=Yes|No) = 3/7 l P(X|Class=No) = P(Refund=No|Class=No)
P(Refund=No|No) = 4/7 P(Married| Class=No)
P(Refund=Yes|Yes) = 0 P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7 4/7 0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7 N ic
P(Marital Status=Divorced|No)=1/7 Original : P ( Ai | C ) = a: number of values of Ai
P(Marital Status=Married|No) = 4/7 l P(X|Class=Yes) = P(Refund=No| Class=Yes) Nc
P(Marital Status=Single|Yes) = 2/7 P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7 P(Income=120K| Class=Yes) N ic + 1
P(Marital Status=Married|Yes) = 0 = 1 0 1.2 10-9 = 0 Laplace : P ( Ai | C ) =
For taxable income:
Nc + a
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975
Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38

Example of Nave Bayes Classifier Nave Bayes (Summary)

Name
human
Give Birth
yes no
Can Fly Live in Water Have Legs
no yes
Class
mammals
A: attributes l Robust to isolated noise points
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog
komodo
no
no
no
no
sometimes
no
yes
yes
non-mammals
non-mammals
6 6 2 2
l Handle missing values by ignoring the instance
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P( A | M ) = = 0.06 during probability estimate calculations
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P( A | N ) = = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7 l Robust to irrelevant attributes
P( A | M ) P (M ) = 0.06 = 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P( A | N ) P( N ) = 0.004 = 0.0027 l Independence assumption may not hold for some
20
eagle no yes no yes non-mammals
attributes
P(A|M)P(M) > P(A|N)P(N)
Give Birth Can Fly Live in Water Have Legs Class Use other techniques such as Bayesian Belief
yes no yes no ? => Mammals
Networks (BBN)
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40

Nave Bayes (Summary) Example

l Independence assumption may not hold for Suppose P(Yes|A1=a1,,An=an)=0.7 is the true probability
(some) attributes, but of Class Yes for a given attribute vector.
To minimize the error rate we should classify this attribute
l If we evaluate on error-rate, then all that matters,
vector as Yes.
in the binary case, is whether the probability
estimate is on the right side of 0.5. ^
As long as we have P(Yes|A 1=a1,,An=an) > 0.5, we will
l With more than two classes similar reasoning assign to the optimal class.
applies, but the margin of error becomes
The probability estimate itself may be way off!
smaller.
l For ROC curve, what matters is that we get the If we evaluate on likelihood this doesnt fly!
probabilities in the right order.

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Example

C=0 A2 C=1 A2
0 1 P(A1) 0 1 P(A1)
A1 0 0.3 0.1 0.4 A1 0 0.6 0.1 0.7
1 0.1 0.5 0.6 1 0.1 0.2 0.3
P(A2) 0.4 0.6 1 P(A2) 0.7 0.3 1

P(C=0)=1/2, P(C=1)=1/2.
P (A1=1,A2=1|C=0)P (C=0) 40
P (C = 0|A1 = 1, A2 = 1) = P (A =1,A =1|C=0)P (C=0)+P (A1=1,A2=1|C=1)P (C=1) = 56 0.71
1 2

With Naive Bayes:


(A1=1|C=0)P (A2=1|C=0)P (C=0)
P (C = 0|A1 = 1, A2 = 1) = P (A =1|C=0)P (A P=1|C=0)P (C=0)+P (A1=1|C=1)P (A2=1|C=1)P (C=1) = 0.8
1 2

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43

Das könnte Ihnen auch gefallen