Sie sind auf Seite 1von 23

BayesClassifierand

NaveBayes
CS434
BayesClassifiers
Aformidableandswornenemyofdecision
trees
Input
Classifier Prediction of
Attributes
categorical output

DT BC
ProbabilisticClassification
Creditscoring:
Inputsareincome andsavings
Outputislowrisk vshighrisk
Input:x=[x1,x2]T ,Output:C {0,1}
Prediction:
C 1 if P (C 1 | x 1,x 2 ) 0.5
choose
C 0 otherwise
or equivalent ly
C 1 if P (C 1 | x 1,x 2 ) P (C 0 | x 1,x 2 )
choose
C 0 otherwise

3
Asidenote:probabilisticinference
H = Have a headache
F F = Coming down with
Flu
H
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
One day you wake up with a headache. You think: Drat!
50% of flus are associated with headaches so I must have a
50-50 chance of coming down with flu

Is this reasoning good?


ProbabilisticInference
H = Have a headache
F F = Coming down with
Flu
H
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2

1 1 1
P( F ) P( H | F ) *
P(F ^ H) =
40 2 80
P( F ^ H ) 1
P(F|H) =
P( H ) 8
Copyright Andrew W. Moore
Bayesclassifier
prior
posterior
P y p x | y
P y | x
px
|
Givenasetoftrainingexamples,tobuildaBayes
classifier,weneedto
1. EstimateP(y)fromdata
2. EstimateP(x|y)fromdata

Asimplebayes net
Givenatestdatapointx,tomakeprediction
1. Applybayes rule: P y | x P( y ) P (x | y )
2. Predict arg max P y | x
y 6
MaximumLikelihoodEstimation(MLE)
Letybetheoutcomeofthecreditscoringofarandomloan
applicant,y{0,1}
P0=P(y=0),andP1=P(y=1)=1P0
Thiscanbecompactlyrepresentedas
0 1 0

Ifyouobservensamplesofy:y1,y2,,yn
wecanwritedownthelikelihoodfunction(i.e.theprobabilityof
theobserveddatagiventheparameters): n
L( p0 ) p01 yi (1 p0 ) yi
i 1
Theloglikelihoodfunction:
n
l ( p0 ) log L( p0 ) log p01 yi (1 p0 ) yi
i 1
n
[(1 yi ) log p0 yi log(1 p0 )]
i
MLEcont.
MLEmaximizesthelikelihood,orthelog
likelihood
p0MLE arg max l ( p0 )
p0

Forthiscase:
n

(1 y ) i
p0MLE i 1
n
i.e.,thefrequencythatoneobservesy=0in
thetrainingdata
BayesClassifiersinanutshell
1. Estimate P(x1, x2, xm | y=vi ) for each value vi
learning
3. Estimate P(y=vi ) as fraction of records with y=vi .
4. For a new prediction:

y predict argmax P( y v | x1 u1 xm um )
v
argmax P( x1 u1 xm um | y v) P( y v)
v

Estimatingthejoint
distributionofx1,x2,xm
given ycanbeproblematic!
JointDensityEstimatorOverfits
Typicallywedonthaveenoughdatatoestimatethe
jointdistributionaccurately
Itiscommontoencounterthefollowingsituation:
Ifnotrainingexampleshavetheexactx=(u1,u2,.um),
thenP(x|y=vi )=0forallvaluesofY.

Inthatcase,whatcanwedo?
wemightaswellguessarandomybasedontheprior,i.e.,

P y p x | y
p(y)
P y | x
px
Example:SpamFiltering
Assumethatourvocabularycontains10kcommonly
usedwords&tokens wehave10,000attributes
Letsassumetheseattributesarebinary
Howmanyparametersthatweneedtolearn?
2*(210,000-1)
Parameters for each
2 classes joint distribution p(x|y)

Clearlywedonthaveenoughdatatoestimate
thatmanyparameters
TheNaveBayesAssumption
Assumethateachattributeisindependentof
anyotherattributesgiventheclasslabel
P ( x1 u1 x m u m | y v i )
P ( x1 u1 | y v i ) P ( x m u m | y v i )

1| |
|
2|
Anoteaboutindependence
AssumeAandBaretwoRandomVariables.
Then
AandBareindependent
ifandonlyif
P(A|B)=P(A)

AandBareindependentisoftennotatedas
A B
Examplesofindependentevents
Twoseparatecointosses
Considerthefollowingfourvariables:
T:Toothache(Ihaveatoothache)
C:Catch(dentistssteelprobecatchesinmy
tooth)
A:Cavity T,C,A,W

W:Weather
P(T,C,A,W)=? T,C,A W
ConditionalIndependence
P(x1|x2,y)=P(x1|y)
X1isindependentofx2giveny
x1 andx2 areconditionallyindependentgiveny
IfX1 andX2 areconditionallyindependent
giveny,thenwehave
P(X1,X2|y)= P(X1|y)P(X2|y)
Exampleofconditionalindependence
T:Toothache(Ihaveatoothache)
C:Catch(dentistssteelprobecatchesinmytooth)
A:Cavity
T and C are conditionally independent given A: P(T|C,A) =P(T|A)
P(T, C|A) =P(T|A)*P(C|A)
Events that are not independent from each other might be conditionally
independent given some fact

It can also happen the other way around. Events that are independent might
become conditionally dependent given some fact.

B=Burglar in your house; A = Alarm (Burglar) rang in your house


E = Earthquake happened
B is independent of E (ignoring some minor possible connections between them)
However, if we know A is true, then B and E are no longer independent. Why?
P(B|A) >> P(B|A, E) Knowing E is true makes it much less likely for B to be true
NaveBayesClassifier
Byassumingthateachattributeis
independentofanyotherattributesgiventhe
classlabel,wenowhaveaNave Bayes
Classifier
Insteadoflearningajointdistributionofall
features,welearnp(xi|y)separatelyforeach
featurexi
Everythingelseremainsthesame
NaveBayesClassifier
Assumeyouwanttopredictoutputy whichhasnYvaluesv1,v2,
vny.
Assumetherearem inputattributescalledx=(x1,x2,xm)
Learnaconditionaldistributionofp(x|y)foreachpossibley
value,y=v1,v2,vny,,wedothisby:
BreaktrainingsetintonY subsetscalledS1,S2,Sny basedonthey
values,i.e.,Si containsexamplesinwhichy=vi
ForeachSi,learnp(y=vi)= |Si|/ |S|
ForeachSi,learntheconditionaldistributioneachinputfeatures,
e.g.:

P ( x1 u1 | y vi ),, P( xm u m | y vi )

y predict argmax P( x1 u1 | y v) P( xm um | y v) P( y v)
v
Example
Apply Nave Bayes, and make
X1 X2 X3 Y prediction for (1,0,1)?
1 1 1 0 1. Learn the prior distribution of y.
P(y=0)=1/2, P(y=1)=1/2
1 1 0 0 2. Learn the conditional distribution of
xi given y for each possible y values
0 0 0 0 p(X1|y=0), p(X1|y=1)
p(X2|y=0), p(X2|y=1)
0 1 0 1 p(X3|y=0), p(X3|y=1)
0 0 1 1 For example, p(X1|y=0):
0 1 1 1 P(X1=1|y=0)=2/3, P(X1=1|y=1)=0

To predict for (1,0,1):


P(y=0|(1,0,1)) = P((1,0,1)|y=0)P(y=0)/P((1,0,1))
P(y=1|(1,0,1)) = P((1,0,1)|y=1)P(y=1)/P((1,0,1))
LaplaceSmoothing
WiththeNaveBayes Assumption,wecanstillendupwithzero
probabilities
E.g.,ifwereceiveanemailthatcontainsawordthathasnever
appearedinthetrainingemails
P(x|y)willbe0forallyvalues
Wecanonlymakepredictionbasedonp(y)
Thisisbadbecauseweignoredalltheotherwordsintheemail
becauseofthissinglerareword
Laplacesmoothingcanhelp
P(X1=1|y=0)
=(1+#ofexampleswithy=0,X1=1)/(k+#ofexampleswithy=0)

k=thetotalnumberofpossiblevaluesofx
Forabinaryfeaturelikeabove,p(x|y)willnotbe0
FinalNotesabout(Nave)BayesClassifier
AnydensityestimatorcanbepluggedintoestimateP(x1,x2,
,xm |y),orP(xi|y)forNavebayes

Realvaluedattributescanbemodeledusingsimple
distributionssuchasGaussian(Normal)distribution

NaveBayesiswonderfullycheapandsurvivestensof
thousandsofattributeseasily
BayesClassifierisaGenerative
Approach
Generativeapproach:
Learnp(y),p(x|y),andthenapplybayesruleto
computep(y|x)formakingpredictions
Thisisequivalenttoassumingthateachdatapoint
isgeneratedfollowingagenerativeprocess
governedbyp(y)andp(X|y)
y p(y)
y p(y)
Bayes
Nave Bayes
classifier
classifier
X p(X|y)
p(X1|y) X1 Xm p(Xm|y)
Generativeapproachisjustonetypeoflearningapproaches
usedinmachinelearning
Learningacorrectgenerativemodelisdifficult
Andsometimesunnecessary
KNNandDTarebothwhatwecalldiscriminativemethods
Theyarenotconcernedaboutanygenerativemodels
Theyonlycareaboutfindingagooddiscriminativefunction
ForKNNandDT,thesefunctionsaredeterministic,notprobabilistic
Onecanalsotakeaprobabilisticapproachtolearning
discriminativefunctions
i.e.,Learnp(y|X)directlywithoutassumingXisgeneratedbasedon
someparticulardistributiongiveny(i.e.,p(X|y))
Logisticregressionisonesuchapproach

Das könnte Ihnen auch gefallen