Sie sind auf Seite 1von 18

Introduction to Bayesian

Learning
Ata Kaban
A.Kaban@cs.bham.ac.uk

School of Computer Science
University of Birmingham

Overview
Today we learn:
Bayesian classification
E.g. How to decide if a patient is ill or healthy,
based on
A probabilistic model of the observed data
Prior knowledge

Training data: examples of the form (d,h(d))
where d are the data objects to classify (inputs)
and h(d) are the correct class info for d, h(d){1,K}
Goal: given d
new
, provide h(d
new
)
Classification problem
Why Bayesian?
Provides practical learning algorithms
E.g. Nave Bayes
Prior knowledge and observed data can be
combined
It is a generative (model based) approach, which
offers a useful conceptual framework
E.g. sequences could also be classified, based on
a probabilistic model specification
Any kind of objects can be classified, based on a
probabilistic model specification
Bayes Rule
) (
) ( ) | (
) | (
d P
h P h d P
d h p
) data the seen having after hypothesis of ty (probabili posterior
data) the of y probabilit (marginal evidence data
is hypothesis the if data the of ty (probabili likelihood
data) any seeing before hypothesis of ty (probabili belief prior
d h
h
h
: ) | (
: ) ( ) | ( ) (
true) : ) | (
: ) (
d h P
h P h d P d P
h d P
h P
h

Who is who in Bayes rule


sides both on
y probabilit joint same the
) , ( ) , (
) ( ) | ( ) ( ) | (
g rearrangin -
(model) hypothesis h
data d
rule Bayes' ing Understand
h d P h d P
h P h d P d P d h p

Probabilities auxiliary slide


for memory refreshing
Have two dice h
1
and h
2
The probability of rolling an i given die h
1
is denoted
P(i|h
1
). This is a conditional probability
Pick a die at random with probability P(h
j
), j=1 or 2. The
probability for picking die h
j
and rolling an i with it is
called joint probability and is P(i, h
j
)=P(h
j
)P(i| h
j
).
For any events X and Y, P(X,Y)=P(X|Y)P(Y)
If we know P(X,Y), then the so-called marginal
probability P(X) can be computed as

Y
Y X P X P ) , ( ) (
Does patient have cancer or not?
A patient takes a lab test and the result comes back
positive. It is known that the test returns a correct
positive result in only 98% of the cases and a correct
negative result in only 97% of the cases. Furthermore,
only 0.008 of the entire population has this disease.

1. What is the probability that this patient has cancer?
2. What is the probability that he does not have cancer?
3. What is the diagnosis?

?? . 3
....... .......... .......... ) | ( . 2
.......... ) (
03 . 0 ) | (
....... .......... .......... .......... .......... .......... ..........
) ( ) | ( ) ( ) | ( ) (
008 . 0 ) (
98 . 0 ) | (
..........
..... .......... ..........
..... .......... ..........
) (
) ( ) | (
) | ( . 1
' ' :
}
' ' : 2
' ' : 1
Diagnosis
cancer P
cancer P
cancer P
cancer P cancer P cancer P cancer P P
cancer P
cancer P
P
cancer P cancer P
cancer P
data
H space hypothesis
cancer hypothesis
cancer hypothesis





Choosing Hypotheses
Maximum Likelihood
hypothesis:


Generally we want the
most probable hypothesis
given training data.This is
the maximum a posteriori
hypothesis:
Useful observation: it does
not depend on the
denominator P(d)
) | ( max arg d h P h
H h
MAP

) | ( max arg h d P h
H h
ML

Now we compute the diagnosis


To find the Maximum Likelihood hypothesis, we evaluate P(d|h)
for the data d, which is the positive lab test and chose the
hypothesis (diagnosis) that maximises it:




To find the Maximum A Posteriori hypothesis, we evaluate
P(d|h)P(h) for the data d, which is the positive lab test and chose
the hypothesis (diagnosis) that maximises it. This is the same as
choosing the hypotheses gives the higher posterior probability.

.. .......... .......... :
... .......... ) ( ) | (
...... .......... ) ( ) | (



MAP
h Diagnosis
cancer P cancer P
cancer P cancer P
....... .......... :
... .......... ) | (
.. .......... ) | (



ML
h Diagnosis
cancer P
cancer P
Nave Bayes Classifier
What can we do if our data d has several attributes?
Nave Bayes assumption: Attributes that describe data instances are
conditionally independent given the classification hypothesis


it is a simplifying assumption, obviously it may be violated in reality
in spite of that, it works well in practice
The Bayesian classifier that uses the Nave Bayes assumption and
computes the MAP hypothesis is called Nave Bayes classifier
One of the most practical learning methods
Successful applications:
Medical Diagnosis
Text classification


t
t T
h a P h a a P h P ) | ( ) | ,..., ( ) | (
1
d
Example. Play Tennis data
Day Outlook Temperature Humidity Wind Play
Tennis
Day1 Sunny Hot High Weak No
Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Nave Bayes solution
Classify any new datum instance x=(a
1
,a
T
) as:



To do this based on training examples, we need to estimate the
parameters from the training examples:

For each target value (hypothesis) h




For each attribute value a
t
of each datum instance



) ( estimate : ) (

h P h P
) | ( estimate : ) | (

h a P h a P
t t


t
t
h h
Bayes Naive
h a P h P h P h P h ) | ( ) ( max arg ) | ( ) ( max arg x
Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
That means: Play tennis or not?



Working:


) | ( ) | ( ) | ( ) | ( ) ( max arg
) | ( ) ( max arg ) | ( ) ( max arg
] , [
] , [ ] , [
h strong Wind P h high Humidity P h cool Temp P h sunny Outlook P h P
h a P h P h P h P h
no yes h
t
t
no yes h no yes h
NB



x
no x PlayTennis answer
no strong P no high P no cool P no sunny P no P
yes strong P yes high P yes cool P yes sunny P yes P
etc
no PlayTennis strong Wind P
yes PlayTennis strong Wind P
no PlayTennis P
yes PlayTennis P







) ( :
) | ( ) | ( ) | ( ) | ( ) (
0053 . 0 ) | ( ) | ( ) | ( ) | ( ) (
.
60 . 0 5 / 3 ) | (
33 . 0 9 / 3 ) | (
36 . 0 14 / 5 ) (
64 . 0 14 / 9 ) (
0.0206
Learning to classify text
Learn from examples which articles are of
interest
The attributes are the words
Observe the Nave Bayes assumption just
means that we have a random sequence
model within each class!
NB classifiers are one of the most effective for
this task
Resources for those interested:
Tom Mitchell: Machine Learning (book) Chapter 6.
Results on a benchmark text corpus
Remember
Bayes rule can be turned into a classifier
Maximum A Posteriori (MAP) hypothesis estimation
incorporates prior knowledge; Max Likelihood doesnt
Naive Bayes Classifier is a simple but effective Bayesian
classifier for vector data (i.e. data with several attributes)
that assumes that attributes are independent given the
class.
Bayesian classification is a generative approach to
classification
Resources
Textbook reading (contains details about using Nave
Bayes for text classification):
Tom Mitchell, Machine Learning (book), Chapter 6.
Software: NB for classifying text:
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-
bayes.html
Useful reading for those interested to learn more about
NB classification, beyond the scope of this module:
http://www-2.cs.cmu.edu/~tom/NewChapters.html

Das könnte Ihnen auch gefallen