Beruflich Dokumente
Kultur Dokumente
NaveBayes
CS434
BayesClassifiers
Aformidableandswornenemyofdecision
trees
Input
Classifier Prediction of
Attributes
categorical output
DT BC
ProbabilisticClassification
Creditscoring:
Inputsareincome andsavings
Outputislowrisk vshighrisk
Input:x=[x1,x2]T ,Output:C {0,1}
Prediction:
C 1 if P (C 1 | x 1,x 2 ) 0.5
choose
C 0 otherwise
or equivalent ly
C 1 if P (C 1 | x 1,x 2 ) P (C 0 | x 1,x 2 )
choose
C 0 otherwise
3
Asidenote:probabilisticinference
H = Have a headache
F F = Coming down with
Flu
H
P(H) = 1/10
P(F) = 1/40
P(H|F) = 1/2
One day you wake up with a headache. You think: Drat!
50% of flus are associated with headaches so I must have a
50-50 chance of coming down with flu
1 1 1
P( F ) P( H | F ) *
P(F ^ H) =
40 2 80
P( F ^ H ) 1
P(F|H) =
P( H ) 8
Copyright Andrew W. Moore
Bayesclassifier
prior
posterior
P y p x | y
P y | x
px
|
Givenasetoftrainingexamples,tobuildaBayes
classifier,weneedto
1. EstimateP(y)fromdata
2. EstimateP(x|y)fromdata
Asimplebayes net
Givenatestdatapointx,tomakeprediction
1. Applybayes rule: P y | x P( y ) P (x | y )
2. Predict arg max P y | x
y 6
MaximumLikelihoodEstimation(MLE)
Letybetheoutcomeofthecreditscoringofarandomloan
applicant,y{0,1}
P0=P(y=0),andP1=P(y=1)=1P0
Thiscanbecompactlyrepresentedas
0 1 0
Ifyouobservensamplesofy:y1,y2,,yn
wecanwritedownthelikelihoodfunction(i.e.theprobabilityof
theobserveddatagiventheparameters): n
L( p0 ) p01 yi (1 p0 ) yi
i 1
Theloglikelihoodfunction:
n
l ( p0 ) log L( p0 ) log p01 yi (1 p0 ) yi
i 1
n
[(1 yi ) log p0 yi log(1 p0 )]
i
MLEcont.
MLEmaximizesthelikelihood,orthelog
likelihood
p0MLE arg max l ( p0 )
p0
Forthiscase:
n
(1 y ) i
p0MLE i 1
n
i.e.,thefrequencythatoneobservesy=0in
thetrainingdata
BayesClassifiersinanutshell
1. Estimate P(x1, x2, xm | y=vi ) for each value vi
learning
3. Estimate P(y=vi ) as fraction of records with y=vi .
4. For a new prediction:
y predict argmax P( y v | x1 u1 xm um )
v
argmax P( x1 u1 xm um | y v) P( y v)
v
Estimatingthejoint
distributionofx1,x2,xm
given ycanbeproblematic!
JointDensityEstimatorOverfits
Typicallywedonthaveenoughdatatoestimatethe
jointdistributionaccurately
Itiscommontoencounterthefollowingsituation:
Ifnotrainingexampleshavetheexactx=(u1,u2,.um),
thenP(x|y=vi )=0forallvaluesofY.
Inthatcase,whatcanwedo?
wemightaswellguessarandomybasedontheprior,i.e.,
P y p x | y
p(y)
P y | x
px
Example:SpamFiltering
Assumethatourvocabularycontains10kcommonly
usedwords&tokens wehave10,000attributes
Letsassumetheseattributesarebinary
Howmanyparametersthatweneedtolearn?
2*(210,000-1)
Parameters for each
2 classes joint distribution p(x|y)
Clearlywedonthaveenoughdatatoestimate
thatmanyparameters
TheNaveBayesAssumption
Assumethateachattributeisindependentof
anyotherattributesgiventheclasslabel
P ( x1 u1 x m u m | y v i )
P ( x1 u1 | y v i ) P ( x m u m | y v i )
1| |
|
2|
Anoteaboutindependence
AssumeAandBaretwoRandomVariables.
Then
AandBareindependent
ifandonlyif
P(A|B)=P(A)
AandBareindependentisoftennotatedas
A B
Examplesofindependentevents
Twoseparatecointosses
Considerthefollowingfourvariables:
T:Toothache(Ihaveatoothache)
C:Catch(dentistssteelprobecatchesinmy
tooth)
A:Cavity T,C,A,W
W:Weather
P(T,C,A,W)=? T,C,A W
ConditionalIndependence
P(x1|x2,y)=P(x1|y)
X1isindependentofx2giveny
x1 andx2 areconditionallyindependentgiveny
IfX1 andX2 areconditionallyindependent
giveny,thenwehave
P(X1,X2|y)= P(X1|y)P(X2|y)
Exampleofconditionalindependence
T:Toothache(Ihaveatoothache)
C:Catch(dentistssteelprobecatchesinmytooth)
A:Cavity
T and C are conditionally independent given A: P(T|C,A) =P(T|A)
P(T, C|A) =P(T|A)*P(C|A)
Events that are not independent from each other might be conditionally
independent given some fact
It can also happen the other way around. Events that are independent might
become conditionally dependent given some fact.
P ( x1 u1 | y vi ),, P( xm u m | y vi )
y predict argmax P( x1 u1 | y v) P( xm um | y v) P( y v)
v
Example
Apply Nave Bayes, and make
X1 X2 X3 Y prediction for (1,0,1)?
1 1 1 0 1. Learn the prior distribution of y.
P(y=0)=1/2, P(y=1)=1/2
1 1 0 0 2. Learn the conditional distribution of
xi given y for each possible y values
0 0 0 0 p(X1|y=0), p(X1|y=1)
p(X2|y=0), p(X2|y=1)
0 1 0 1 p(X3|y=0), p(X3|y=1)
0 0 1 1 For example, p(X1|y=0):
0 1 1 1 P(X1=1|y=0)=2/3, P(X1=1|y=1)=0
k=thetotalnumberofpossiblevaluesofx
Forabinaryfeaturelikeabove,p(x|y)willnotbe0
FinalNotesabout(Nave)BayesClassifier
AnydensityestimatorcanbepluggedintoestimateP(x1,x2,
,xm |y),orP(xi|y)forNavebayes
Realvaluedattributescanbemodeledusingsimple
distributionssuchasGaussian(Normal)distribution
NaveBayesiswonderfullycheapandsurvivestensof
thousandsofattributeseasily
BayesClassifierisaGenerative
Approach
Generativeapproach:
Learnp(y),p(x|y),andthenapplybayesruleto
computep(y|x)formakingpredictions
Thisisequivalenttoassumingthateachdatapoint
isgeneratedfollowingagenerativeprocess
governedbyp(y)andp(X|y)
y p(y)
y p(y)
Bayes
Nave Bayes
classifier
classifier
X p(X|y)
p(X1|y) X1 Xm p(Xm|y)
Generativeapproachisjustonetypeoflearningapproaches
usedinmachinelearning
Learningacorrectgenerativemodelisdifficult
Andsometimesunnecessary
KNNandDTarebothwhatwecalldiscriminativemethods
Theyarenotconcernedaboutanygenerativemodels
Theyonlycareaboutfindingagooddiscriminativefunction
ForKNNandDT,thesefunctionsaredeterministic,notprobabilistic
Onecanalsotakeaprobabilisticapproachtolearning
discriminativefunctions
i.e.,Learnp(y|X)directlywithoutassumingXisgeneratedbasedon
someparticulardistributiongiveny(i.e.,p(X|y))
Logisticregressionisonesuchapproach