Beruflich Dokumente
Kultur Dokumente
Arslan Shaukat
Bayesian Estimation
In MLE was supposed to have a fixed value
In BE is a random variable
Training data allows us to convert a distribution on this
variable into a posterior probability density
The computation of posterior probabilities P(i|x) lies at
the heart of Bayesian classification
P ( x | i ).P (i )
P (i | x) c
P( x | j ).P( j )
j 1
P( x | , D).P(
j 1
j j | D)
The training samples D can be used to determine the
class-conditional densities and prior probabilities
Assume that the true values of the a priori
probabilities are known or obtainable from a trivial
calculation; thus we substitute P(ωi) = P(ωi|D)
We can separate the training samples by class into c
subsets D1, ...,Dc, with the samples in Di belonging to ωi
The previous expression can be written as:
P ( x | i , Di ).P(i )
P (i | x, D) c
P( x | , D ).P( )
j 1
j j j
P (D | ).P( )
P ( | D) ,
P(D | ).P( )d
And independence assumption leads to the value of P(D|θ) as
k n
P (D | ) P ( xk | )
k 1
5
Bayesian Parameter Estimation:
General Theory
P(x |D) computation can be applied to any situation
in which the unknown density can be parametrized:
the basic assumptions are:
The form of P(x |) is assumed known, but the value of
is not known exactly
Our knowledge about is assumed to be contained in a
known prior density P()
The rest of our knowledge is contained in a set D of n
random variables x1, x2, …, xn drawn independently
according to the unknown probability density P(x)
p ( x | D ) p ( x | ) p ( | D )d
P (D | ).P( ) k n
P( | D) , P (D | ) P ( xk | )
P(D | ).P( )d k 1 5
Bayesian Parameter Estimation:
Gaussian Case
Goal: Estimate using the a-posteriori density P(|D)
The univariate case: P(|D)
is the only unknown parameter
P(x | ) ~ N( , 2 )
P( ) ~ N( 0 , 02 )
We assume that whatever prior knowledge we might have
about μ can be expressed by a known prior density p(μ)
0 and 0 are known
Roughly speaking, μ 0 represents our best a priori guess
for μ, and σ20 measures our uncertainty about this guess
4
By Bayes formula
P(D | ).P( )
P( | D) (1)
P(D | ).P( )d
k n
P( xk | ).P( )
k 1
We assume that
P(x k | ) ~ N( , 2 )
P( ) ~ N( 0 , 02 )
We have
4
If we write P ( | D) ~ N ( n , n2 )
μ n and σ2n can be found by equating coefficients in
the previous Eq. with corresponding coefficients in
the Gaussian form:
1 1 n
2
P( | D) exp
2 n 2 n
n 02 2
n ˆ
2 n
. 0
n0 0 n 0
2 2 2
0
2 2
and n2
n 02 2
μ n represents our best guess for μ after observing n
samples, and σ2n measures our uncertainty about this
guess.
Since σ2n decreases monotonically with n —
approaching σ2/n as n approaches infinity — each
additional observation decreases our uncertainty about
the true value of μ.
As n increases, p(μ|D) becomes more and more sharply
peaked
This behavior is commonly known as Bayesian learning
4
The Univariate Case P(x |D)
P( | D) computed
P(x | D) remains to be computed
where
4
It provides:
P( x | D) ~ N ( n , 2 n2 )
Max P ( j | x, D) Max P ( x | j , D j ).P ( j )
j j
Multivariate Case
Assume: p( x| µ)~N(µ, ∑)
p(µ) ~N(µ0, ∑0)
and