Sie sind auf Seite 1von 42
The Gaussian Distribution – MLE Estimators and Introduction to Bayesian Estimation Prof. Nicholas Zabaras School

The Gaussian Distribution MLE Estimators and Introduction to Bayesian Estimation

Prof. Nicholas Zabaras
Prof. Nicholas Zabaras

School of Engineering

University of Warwick Coventry CV4 7AL United Kingdom

August 7, 2014

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

1

Contents

The Gaussian Distribution, Standard Normal, Degenerate Gaussian Distribution, Multivariate Gaussian, the Gaussian and Maximum Entropy, the CLT and the Gaussian Distribution, Convolution of Gaussians, MLE for the Gaussian, MLE for the Multivariate Gaussian

Sequential MLE Estimation for the Gaussian, Robbins-Monro Algorithm

Bayesian Inference for the Gaussian with Known Variance, Bayesian Inference for the Gaussian with Known Mean, Bayesian Inference for the Gaussian with

unknown Mean and Variance

Normal-Gamma Distribution, Gaussian-Wishart Distribution

Following closely Chris Bishops’ PRML book, Chapter 2

Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

2

The Gaussian Distribution

X   A random variable is Gaussian or normally distributed X N ( 
X 
 A random variable
is Gaussian or normally distributed
X
N (  , s )
2
if:
1
t 
1
P X
t
 exp
(
x 
)
2
  dx
2
s
2
2
s
2
 


The following can be shown easily with direct integration:

 1  1  X   exp  (   2 2

1
1
X
exp
(
2
2
2
s
2
s


x

)

2

xdx

,

2  1  1  2 2 2 2 2 2   X
2

1
1
2
2
2
2
2
2
X
 
exp
(
x
)
x dx
s
,
var[
X
]
X
s
 
2
2
s
2
s
2


The following integrals are useful in these derivations :





exp



  u du 2    ,  u exp 
u du
2
,
u
exp



2

u du

0,





u

2

exp

   u du 2   2
u du
2
2

We often work with the precision of a Gaussian l=1/s 2 . The higher l the narrower the distribution is.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

3

Standard Normal, CDF, Error Function

Plot of the Standard Normal N (0,1) and CDF. Let F(x;0,1) the corresponding CDF.

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

PDF

N  x;0,1
N  x;0,1

Fx;0,1

-3 -2 -1 0 1 2 3 x 2 N ( z | s ,
-3
-2
-1
0
1
2
3
x
2
N
(
z
|
s
,
)
dz
F
z
;0,1 ,
z
(
x
 s
) /

z
1
2
1
t
/ 2
F
z
;0,1
e
dt
1
erf
z
/
2
2
2

x
2
2
t
erf
x
e
dt
0
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

4

Degenerate Gaussian Distribution

Degenerate Gaussian Distribution

Note that as s 2 0, the Gaussian becomes a delta function centered at the mean :

lim N  x | s , 2    ( x  
lim
N 
x |
s
,
2 
( x
)
s 2
 0
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

5

Multivariate Gaussian

A multivariate

D
D

is Gaussian if its probability density is

N (

x | S

,

)

 

1

2

D

det S

 

1/2

exp

 

1

2

(

x

)

T

S

1

(

x

)

 

where

D , S D D
D
, S
D D

(covariance matrix).

is symmetric positive definite matrix

We often work with the precision matrix L=S -1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

6

2D Gaussian

Level sets of 2D Gaussians (full, diagonal and spherical covariance matrix)

full

  0.2     0.15 0.1 0.05   0 10
  0.2     0.15 0.1 0.05   0 10
  0.2     0.15 0.1 0.05   0 10
  0.2     0.15 0.1 0.05   0 10
  0.2     0.15 0.1 0.05   0 10
  0.2     0.15 0.1 0.05   0 10
  0.2     0.15 0.1 0.05   0 10
  0.2     0.15 0.1 0.05   0 10
  0.2     0.15 0.1 0.05   0 10
 

0.2

 
    0.15
 

0.15

0.1

0.05

 

0

10
10
10
10
10
10
10
10
10
10
10

10

8

6

4

2

0

-2

-4

-6

-8

-10

-5 -4 -3 -2 -1 0 1 2 3 4 5 diagonal 0.2 0.15 0.1
-5
-4
-3
-2
-1
0
1
2
3
4
5
diagonal
0.2
0.15
0.1
0.05
0
10
5
0
0
-5
-10
-5

10

full 5 5 0 0 -5 -5 -10 -10
full
5
5
0
0
-5
-5
-10
-10

spherical

 

5

  5

4

 

3

2

1

 
1    
 

0

-1

-2

-3

 

5

-4

5 -4
5 -4
5 -4
5 -4
5 -4
5 -4
5 -4

-5

 

-6

-4

-2

0

2

4

6

diagonal

10 8 6 4 2 0 -2 -4 -6 -8 10 -10 -5 -4 -3
10
8
6
4
2
0
-2
-4
-6
-8
10
-10
-5
-4
-3
-2
-1
0
1
2
3
4
gaussPlot2DDemo
from PMTK
spherical
0.2
0.15
0.1
0.05
0
5
5
0
0
-5
-5
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

7

Multivariate Gaussian: Maximum Entropy

We can show that the multivariate Gaussian maximizes the entropy H with the constraints of normalization with given mean and given variance S:

max

p (

x

),

l

,

m L

,

Tr

L



p

( )ln

x

p

(

x

)

d

x +

p

( )(

x x

)(

x

)

T

l

d

x

p

S

(

x

)

d

x

1

m

T

x

p

(

x

)

d

x

Setting the derivative wrt p(x) to zero gives:

0

1 ln

  

p x l m x Tr L x x

(

)

(

)(

)

T

p ( x )

e  l

1

T

m x x

(

)

T

L x

(

)

T

The coefficients can be found by satisfying the constraints.

We start by completing the square.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

8

Multivariate Gaussian: Maximum Entropy

p

(

x

)

e

1

l

  m

T

e

1

T

  

l

x

(

x

 

1

m

m

 

4

 

)

T

T

L

L

(

x

)

1

2

1

m

(

x



L

1

m

)

T

L

y 1  1 ( x   L m ) 2
y
1
1
(
x

L
m
)
2

Satisfying the mean constraint:

e

1

T

  

l

m

1

4

m L m y Ly

T

1

T

y +



2

1

1

L m

d

y

The 1 st term drops from symmetry, the 2 nd gives from normalization, thus we need to have:

1

2

1

L m m

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

9

Multivariate Gaussian: Maximum Entropy

p ( x )

e  l

1

(

x

)

T

L x

(

z

)

Satisfying the variance constraint:

e

1 lz

T

Lz zz z S

T

d

Note that with

L = -S / 2

when integrated gives:

, the 3 nd term from the exponential

e

z

T

Lz zz z S

T

d

(2 )

D / 2

S

1/2

It remains to select l such that:

  1 1    1/2  1  l  D /2
 
1
1
 
1/2
1
l
 D
/2
e
 (2 )
S
  
l
1
ln
D / 2
1/2
(2 )
S
The optimizing p(x) is now clearly the Gaussian.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

10

Multivariate Gaussian: Maximum Entropy

The entropy of the multivariate Gaussian is now computed as follows:

H

x



N

(

x |

,

S

)ln

N

(

x |

,

S

) d

x

N

(

1

2

1

2

D

D

x |

,

ln(2

ln(2

S

)

)

)

1

2

ln

ln

1

 

2

D

ln(2

)

1

2

1

2

D

ln(2

)

D ln(2

)

ln

ln

ln

D

S

S

S

S

S

ln(2

)

ln

N

( x |

S

,

1

2

tr

N

S

(

)

x

1

2

tr

)

T

(

x

S

1

(

x

)(

x

)

)

d x

T

S

1

(

x |

,

S

)(

x

)(

x

)

T

d

x

1


2

tr

SS

1

tr

1

S S

D

d

x

S

1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

11

Multivariate Gaussian: Maximum Entropy

Using also the KL distance definition, one can show that the Gaussian has the largest entropy from any other distribution satisfying the mean and 2 nd moment constraints. To make the presentation simple, consider

Then:

0

(

KL q

||

p

)

p

(

x

)

N x | S

(

,

),

q

(

)

T

x xx x

d

S





q

q

( )ln

x

( )ln

x

p ( )

x

d

x

q ( )

x

p

(

x

)

d

x



q

[

H q

]

( )ln

x



[

H p

]

[

H q

]

[

H p

]

[

H q

]

p

(

x

)

d

x +

p

( )ln

x

p

(

x

q

( )ln

x

q

(

x

)

d

x

[

H q

]

)

d

x

The intermediate step in the proof above accounts for the moment constraint on q and the fact that log(p) is quadratic in x!

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

12

The CLT and the Gaussian Distribution

Let (X 1 ,X 2 , … X n ) be independent and identically distributed (i.i.d.) continuous random variables each with expectation and variance s 2 .

Define:

Z

n

1 ( X 1 s N
1
(
X
1
s
N

X

2

 

X

n

N )

As N→∞, the distribution of Z n converges to the distribution of a standard normal random variable

If

X

n

1

N

N

j 1

lim P Z

N 

n

x

X

j

, for N large,

x 1    2  
x
1
2


e

2

t / 2

dt

s

2

X ~

n

N

 

,

N

 

as N 

Somewhat of a justification for assuming that Gaussian

noise is common

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

13

The CLT and the Gaussian Distribution

As an example, assume N variables (X 1 ,X 2 , … X n ) each of which has a uniform distribution over [0, 1] and then consider the distribution of the mean (X 1 +X 2 + …+X n )/N. For large N, this distribution tends to a Gaussian. The convergence as N increases can be rapid.

4

 
4     4  
4     4  
4     4  
4     4  
4     4  
4     4  
4     4  
4     4  
4     4  
 

4

 

3.5

 

3.5

 

3

3

2.5

2.5

2

2

1.5

1.5

1

 

1

0.5

0.5 0.5

0.5

0

0

0.1

 

0.2

 

0.3

 

0.4

 

0.5

 

0.6

0.7

0.8

0.9

 

1

0

0

0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6
0.5   0.6 0.7 0.8 0.9   1 0 0 0.1 0.2 0.3 0.4 0.5 0.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

4 3.5 3 2.5 2 1.5 1 0.5 0
4
3.5
3
2.5
2
1.5
1
0.5
0

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

14

The CLT and the Gaussian Distribution

Histogram of

1

N

10000

j 1

x

ij

where x ij ~Beta (1,5)

3

2

1

0

N = 1

N = 5

3
3
3
3
3
3
3
3

3

3
3

0

0.5

2

1

x i j ~ Beta (1,5) 3 2 1 0 N = 1 N = 5
x i j ~ Beta (1,5) 3 2 1 0 N = 1 N = 5
x i j ~ Beta (1,5) 3 2 1 0 N = 1 N = 5
x i j ~ Beta (1,5) 3 2 1 0 N = 1 N = 5
x i j ~ Beta (1,5) 3 2 1 0 N = 1 N = 5

0.5

1

3

N = 10

2

1

0

1 0 0

1

1 0 0
1 0 0
1 0 0

0

0

0 0.5 1 Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
0
0.5
1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

15

The CLT and the Gaussian Distribution

One consequence of this result is that the binomial distribution which is a distribution over m defined by the sum of N observations of the random binary variable x,

will tend to a Gaussian as N →∞.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

16

Example of the Convolution of Gaussians

Example of the Convolution of Gaussians

Consider 2 Gaussians

x ~ ( , ), x ~

1

1

1

2

N

1

N

1

( , )

2

2

. We

want to compute the entropy of the distribution of x=x 1 +x 2 .

 p(x) can be computed from the convolution of two Gaussians p x ( )
 p(x) can be computed from the convolution of two Gaussians
p x
(
)
p x x
(
|
)
p x
(
)
dx
2
2
2
1
1
N 
(
 x
,
)
N  
(
,
)
2
2
1
2
1

We need to complete the square in the exponential in x 2 :

1

2

1

2

1

1

x

(

2

1

x

2

x

2

1

)

2

x

2

 2

2

1

)

2

2

1

( x

 

2

2

1

2

2

2

1

2

1

x

1

2

1

2

(
1

x

1

)

 

2

2

2

 

1

2

The 1 st term is integrated out and the precision of x is:

1

2

1

1

2

Thus the entropy of x is:

[

H x

]



1

2

1

2

1

2

ln 2

 s

e

2

1

2

ln   2

e

1

2



1

2

 

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

17

Maximum Likelihood for a Gaussian

Suppose that we have a data set of observations D = (x 1 , . , x N ) T , representing N observations of the scalar random variable X. The observations are drawn independently from a Gaussian distribution whose mean and variance s 2 are unknown.

We would like to determine these parameters from the data

set.

Data points that are drawn independently from the same distribution are said to be independent and identically distributed, which is often abbreviated to i.i.d.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

18

Maximum Likelihood for a Gaussian

Because our data set D is i.i.d., we can write the probability of the data set, given and s 2 , in the form

N 2 2 Likelihood function: p ( x | s , )   N
N
2
2
Likelihood function:
p (
x
|
s
,
)
 N
(
x s
|
,
)
i
i  1
2
This is seen as a function of
s
,
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

19

Max Likelihood for a Gaussian Distribution

Likelihood function p  s

:

(

x

|

,

2

)

N

i 1

N x  s

(

i

|

,

2

)

One common criterion for determining the parameters in a

probability distribution using an observed data set is to find

the parameter values that maximize the likelihood function, i.e. maximizing the probability of the data given the parameters (contrast this with maximizing the probability of the parameters given the data).

We can equivalently maximize the log-likelihood:

N

2

) max

,

s

2

 

1

s

2

2

 

N

i 1

(

x

i

)

2

N

2

ln

s

2

 

1

 

N

1

 

N

 
   

2

 

(

 

)

2

ML

N

x

i

,

s

ML

N

x

i

ML

 

i

1

i

1

max ln ( |

s

,

2

p

x

s

,

2

ln(2 )

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

20

Maximum Likelihood for a Gaussian Distribution

ML

N N 1 1 2 2  x , s   ( x 
N
N
1
1
2
2
x
,
s
(
x
)
i
ML
i
ML
N
N
i
1
i
1
Sample mean
Sample variance wrt ML
mean not the exact mean)
(

The MLE underestimates the variance (bias due to

overfitting) because

ML

fitted some of the noise in the data.

The maximum likelihood solutions

2

s

ML ML

,

are functions

of the data set values x 1 ,

expectations of these quantities with respect to the data set values, which come from a Gaussian.

, x N . Consider the

Using the equations above you can show that :

In this derivation

N  1 you need to use :      , s
N  1
you need to use :
,
s
2
 
s
2
 
2
ML
ML
E
x x
 
s
for i
j
N
 
i
j
2
2
2
E
  x  
s
i
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

21

Maximum Likelihood for a Gaussian Distribution

2

ML

N 1

N

2

s

s

We use :

E

E

 

x x

i

j

 

s

2

 

2

x

i

s

2

for i

2

N N N  1  1  1 2 2  s  
N
N
N
1
1
1
2
2
s
 
(
x
)
(
x
x
ML
n
ML
n
m
N
N
N
n
1
n
1
m
1

N 1  2 2 x    n N N  n 
N
1
2
2
x
n
N
N
n
1

x

n

N

m

1

x

m

1

N

2

N

N



m

1

l

1

1

N

N (

2

1

N

N

(

2

N

1

N

s

2

s

2

s

2

)

)

N

2

N

N

2

(

N

s

1)

2

2

x

m

x

l

(

2

s

2

)

N

)

2

1

N

2

N

(

N

1)

2

(

j

2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

s

2

)

22

Maximum Likelihood for a Gaussian Distribution

     s , 2 ML ML
 s
,
2
ML
ML

N 1

N

s

2

On average the MLE estimate obtains the correct mean but

will underestimate the true variance by a factor (N − 1)/N.

An unbiased estimate of the variance is given as:

1 N s 2 N 2  s   ( x   )
1
N
s
2 N 2
s
( x
)
2
ML
i
ML
N
1
N
1
i  1

For large N, the bias is not a problem

This result can be obtained from a Bayesian treatment in

which we marginalize over the unknown mean.

The N-1 factor takes account the fact that 1 degree of freedom has been used in fitting the mean and removes the bias of MLE.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

23

MLE for the Multivariate Gaussian

We can easily generalize the earlier results for a multivariate Gaussian. The log-likelihood takes the form:

ln (

p

X D S

|

,

,

)



ND

2

ln 2

N

2

ln |

S

|

1

2

N

n 1

(

x

n

)

T

S

1

(

x

n

)

Setting the derivatives wrt and S equal to zero gives the following:

1 N 1 N   T N N n  1 n  1
1
N
1
N
T
N
N
n
1
n
1

ML

x S

n

,

 ML  x S n , ML  ( x n   ML )(
 ML  x S n , ML  ( x n   ML )(

ML

(

x

n

ML

)(

x

n

ML

)

We provide a proof of the calculation of

S

ML next.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

24

MLE for the Multivariate Guassian

ln (

p

X D S

|

,

,

)



ND

2

ln 2

N

2

ln |

S

|

1

2

N

n 1

(

x

n

)

T

S

1

(

x

n

)

We differentiate the log likelihood wrt S 1 . Each contributing term is:

T

N  S 2 A useful trick! N 1   1 T S (
N
S
2
A useful trick!
N
1
1
T
S
(
x
)(
x
)
n
n
N
n
1

N

 

ln |

S

N

2

S

1

(

x

n

)

T

S

1

n

1

|

(

1

Tr

S

1

S



N

N

2

ln |

S

1

|

S

2

S

1

1

2

S

1



1

N

2

1

2

S

1

x

n

1

2

S

N

)

N

Tr



S symmetric

S

 1 N S , 2
1
N S ,
2

1

N

N

n 1

where S

S

ML S

So finally

(

x

n

Here we used:

|

A

Tr 

1

|

|

A

|

)(

x

n

)

T

1

T

,

A

ln | A |

,

tr

(

AB

)

tr

(

BA

)

A

1

T

,

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

25

Appendix: Some Useful Matrix Operations

Appendix: Some Useful Matrix Operations

Show that

Tr



T

and

Tr



T

 

 

Tr



A B

B

Tr

   

ik ki

 

nm

 

A

mn

Indeed

A

mn



Show that

A

1

T

ln | A | A

 ln | A | A  

T

Using the cofactor expansion of the det:

A

mn

ln |

A

|

1

|

A

|

 

|

A

|

A

mn

1

|

A

|

A

mn

j

i j

( 1) A M

ij

ij