Sie sind auf Seite 1von 18

Dr.

Arslan Shaukat
Bayesian Estimation
In MLE  was supposed to have a fixed value
In BE  is a random variable
Training data allows us to convert a distribution on this
variable into a posterior probability density
The computation of posterior probabilities P(i|x) lies at
the heart of Bayesian classification
P ( x | i ).P (i )
P (i | x)  c
 P( x |  j ).P( j )
j 1

Given the training sample set D, Bayes formula can be


written P ( x | i , D).P(i | D)
P(i | x, D)  c

 P( x |  , D).P(
j 1
j j | D)
The training samples D can be used to determine the
class-conditional densities and prior probabilities
Assume that the true values of the a priori
probabilities are known or obtainable from a trivial
calculation; thus we substitute P(ωi) = P(ωi|D)
We can separate the training samples by class into c
subsets D1, ...,Dc, with the samples in Di belonging to ωi
The previous expression can be written as:
P ( x | i , Di ).P(i )
P (i | x, D)  c

 P( x |  , D ).P( )
j 1
j j j

Like MLE, each class is treated independently, so we


can dispense with needless class distinctions and
simplify our notation for P(x|ω,D) to P(x,D)
P(x) is unknown but has known parametric form by
saying that the function p(x|θ) is completely known.
Any information we might have about θ prior to
observing the samples is assumed to be contained
in a known prior density p(θ).
Observation of the samples converts this to a
posterior density p(θ|D)
Goal is to compute p(x|D)
Do this by integrating the joint density p(x, θ|D)
over θ. That is,

p( x | D)   p( x, | D)d   p( x |  , D) p( | D)d   p( x |  ) p( | D)d


In general, if we are less certain about the exact value of θ,
this equation directs us to average p(x|θ) over the possible
values of θ.
Thus, when the unknown densities have a known parametric
form, the samples exert their influence on p(x|D) through the
posterior density p(θ|D).
The basic problem is: “Compute the posterior density P( |
D)” then “Derive P(x | D)”
Using Bayes formula, we have:

P (D |  ).P( )
P ( | D)  ,
 P(D |  ).P( )d
And independence assumption leads to the value of P(D|θ) as

k n
P (D |  )   P ( xk |  )
k 1
5
Bayesian Parameter Estimation:
General Theory
P(x |D) computation can be applied to any situation
in which the unknown density can be parametrized:
the basic assumptions are:
The form of P(x |) is assumed known, but the value of
 is not known exactly
Our knowledge about  is assumed to be contained in a
known prior density P()
The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn drawn independently
according to the unknown probability density P(x)
p ( x | D )   p ( x |  ) p ( | D )d
P (D |  ).P( ) k n
P( | D)  , P (D |  )   P ( xk |  )
 P(D |  ).P( )d k 1 5
Bayesian Parameter Estimation:
Gaussian Case
Goal: Estimate  using the a-posteriori density P(|D)
The univariate case: P(|D)
 is the only unknown parameter
P(x |  ) ~ N(  ,  2 )
P(  ) ~ N(  0 ,  02 )
We assume that whatever prior knowledge we might have
about μ can be expressed by a known prior density p(μ)
0 and  0 are known
Roughly speaking, μ 0 represents our best a priori guess
for μ, and σ20 measures our uncertainty about this guess
4
By Bayes formula
P(D |  ).P(  )
P(  | D)  (1)
 P(D |  ).P( )d
k n
   P( xk |  ).P(  )
k 1

We assume that
P(x k |  ) ~ N(  ,  2 )
P(  ) ~ N(  0 ,  02 )
We have

4
If we write P (  | D) ~ N (  n ,  n2 )
μ n and σ2n can be found by equating coefficients in
the previous Eq. with corresponding coefficients in
the Gaussian form:
1  1   n 
 
2

P(  | D)  exp    
2  n  2   n  

 n 02  2
 n    ˆ 
2  n
. 0
 n0 0    n 0  
2 2 2

 0
2 2
and  n2 
n 02   2
μ n represents our best guess for μ after observing n
samples, and σ2n measures our uncertainty about this
guess.
Since σ2n decreases monotonically with n —
approaching σ2/n as n approaches infinity — each
additional observation decreases our uncertainty about
the true value of μ.
As n increases, p(μ|D) becomes more and more sharply
peaked
This behavior is commonly known as Bayesian learning
4
The Univariate Case P(x |D)
 P( | D) computed
 P(x | D) remains to be computed

P( x | D)   P( x |  ).P(  | D)d is Gaussian

where

4
It provides:
P( x | D) ~ N (  n ,  2   n2 )

(Desired class-conditional density P(x | Dj, j))


Therefore: P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:

  
Max P ( j | x, D)  Max P ( x |  j , D j ).P ( j )
j j

Multivariate Case
Assume: p( x| µ)~N(µ, ∑)
p(µ) ~N(µ0, ∑0)

We get :P(µ|D) ~ N (µn, ∑n)


Recursive Bayes Learning

Using Bayes formula

and

An incremental or on-line learning method, where learning goes


on as the data is collected
Difference between the two
methods
Computational complexity:
Maximum likelihood is simpler
Our confidence in the prior information
Maximum likelihood must be of the assumed parametric
form, not so for the Bayesian solution.
Bayesian methods use more of the information than
maximum likelihood thus it gives better results.
Bayesian methods exploit asymmetric information contained
in θ distribution while maximum likelihood does not.
Classification Error
To apply these results to multiple classes, separate the
training samples to c subsets D1, . . . ,Dc, with the
samples in Di belonging to class wi, and then estimate
each density p(x|wi,Di) separately
Different sources of error
Bayes error: due to overlapping class-conditional densities
(related to the features used)
Model error: due to incorrect model
Estimation error: due to estimation of parameters from a
finite sample (can be reduced by increasing the amount of
training data)
Conclusion
Maximum likelihood approach estimates a point in θ
space, the Bayesian approach estimates a distribution.
Bayesian method has strong theoretical and
methodological arguments supporting it, though in
practice maximum likelihood is simpler.
When used for classifiers, they mostly give same
result.

Das könnte Ihnen auch gefallen