Sie sind auf Seite 1von 34

Lecture

19: More EM
Machine Learning April 15, 2010

Last Time
Expecta>on Maximiza>on Gaussian Mixture Models

Today
EM Proof
Jensens Inequality

Clustering sequen>al data


EM over HMMs EM in any Graphical Model
Gibbs Sampling

Gaussian Mixture Models

How can we be sure GMM/EM works?


Weve already seen that there are mul>ple clustering solu>ons for the same data.
Non-convex op>miza>on problem

Can we prove that were approaching some maximum, even if many exist.

Bound maximiza>on
Since we cant op>mize the GMM parameters directly, maybe we can nd the maximum of a lower bound. Technically: op>mize a convex lower bound of the ini>al non-convex func>on.

EM as a bound maximiza>on problem


Need to dene a func>on Q(x,) such that
Q(x,) l(x,) for all x, l() Qt () Q(x,) = l(x,) at a single l(t ) Qt (t ) point Qt (t+1 ) > Qt (t ) Q(x,) is concave
l(t+1 ) > l(t )

l(t+1 ) Qt (t+1 )

EM as bound maximiza>on
Claim:
for GMM likelihood
l() =
n

log

The GMM MLE es>mate is a convex lower bound


Q() = argmax
n z z n log p(xn , z|)

p(xn , z|)

EM Correctness Proof
l()

Prove that l(x,) Q(x,)


= = = =
n n

log p(xn |) Likelihood func>on log log log


z

p(xn , z|) Introduce hidden variable (mixtures in GMM) p(xn , z|) p(z|xn , t ) p(z|xn , t ) p(xn , z|) p(z|xn , )
A xed value of t

n z

p(xn , z|) Jensens Inequality (coming soon) p(z|xn , t ) log p(z|xn , t ) n z p(z|xn , t ) log p(xn , z|) p(z|xn , t ) log p(z|xn , t )
z

p(z|xn , t )

EM Correctness Proof
Q() = argmax
n z z n log p(xn , z|)

l() t+1

Q(|t ) const

n z

p(z|xn , t ) log p(xn , z|)

p(z|xn , t ) log p(z|xn , t )

= argmax Q(|t ) p(z|xn , t ) log p(xn , z|) = argmax = argmax


n z n z

z n log p(xn , z|) GMM Maximum Likelihood Es>ma>on

The missing link: Jensens Inequality


If f is concave (or convex down): f (E{x}) E{f (x)} Incredibly important tool for dealing with mixture models.
f
if f(x) = log(x)

log

i
i

i p(x|i , i )

i p(x|i , i )

i f (p(x|i , i ))

i log (p(x|i , i ))

log(x1 + (1 )x2 ) log(x1 ) + (1 ) log(x2 )

Generalizing EM from GMM


No>ce, the EM op>miza>on proof never introduced the exact form of the GMM Only the introduc>on of a hidden variable, z. Thus, we can generalize the form of EM to broader types of latent variable models

General form of EM
Given a joint distribu>on over observed and latent variables: p(X, Z|) Want to maximize: p(X|)
1. Ini>alize parameters old 2. E Step: Evaluate: p(Z|X, old ) 3. M-Step: Re-es>mate parameters (based on expecta>on of complete-data log likelihood) new = argmax p(Z|X, old ) ln p(X, Z|) 4. Check for convergence of params or likelihood
Z

Applying EM to Graphical Models


Now we have a general form for learning parameters for latent variables.
Take a Guess Expecta>on: Evaluate likelihood Maximiza>on: Rees>mate parameters Check for convergence

Clustering over sequen>al data


Recall HMMs

We only looked at training supervised HMMs. What if you believe the data is sequen>al, but you cant observe the state.

EM on HMMs
also known as Baum-Welch
i
ij

ij

Recall HMM parameters: T 2 i j


i i = q 0 t=0 ij = M 1 T 1 k=0 t=0

qt qt+1
i k qt qt+1

Now the training counts are es>mated.


i j i t=0 E qt qt+1 i = E q0 ij = M 1 T 1 i k k=0 t=0 E qt qt+1 T 2

ij = N 1t=0 T 1
k=0 t=0

T 1
T 1

i qt xj t i qt xk t

j i t=0 E qt xt ij = N 1 T 1 i xk k=0 t=0 E qt t

EM on HMMs
Standard EM Algorithm
Ini>alize E-Step: evaluate expected likelihood M-Step: rees>mate parameters from expected likelihood Check for convergence

EM on HMMs
Guess: Ini>alize parameters, = [, , ]T E-Step: Compute E{l()} = E{log p(x, q|)}
E{log p(x, q|)} = E log p(q0 ) =E M 1
i=0 T 1 t=1

M 1 i=0

i E{q0 } log i +

T 1 M 1 T 1 M 1 N 1 j j i i i q0 log i + qt qt1 log ij + qt xt log ij )


t=1 i,j=0 t=0 i=0 j=0
T 1 M 1 t=1 i,j=0 i j E{qt qt1 } log ij + T 1 M 1 N 1 t=0 i=0 j=0

p(qt |qt 1)

T 1 t=0

p(xt |qt )

i E{qt xj } log ij ) t

EM on HMMs
But what are these E{} quan>>es?
=
M 1 i=0 i E{q0 } log i + T 1 M 1 t=1 i,j=0 i j E{qt qt1 } log ij + T 1 M 1 N 1 t=0 i=0 j=0 i E{qt xj } log ij t

E{x } =
i

p(x)x =
i

so

p(x)(x = xi ) = p(xi )

i i i j i j i i E{q0 } = p(q0 |) E{qt qt1 } = p(qt , qt1 |) E{qt } = p(qt |) x x x

These can be eciently calculated from JTA poten>als and separators.

EM on HMMs
p(qi , qi1 |x0 . . . xn ) p(q0 |x0 . . . xn ) p(qi |x0 . . . xn )

(q0 , x0 ) (q0 ) (qi1 , qi ) (q ) (qi , qi+1 ) i (qi ) (qi , xi ) (qi+1 )

p(qi+1 |x0 . . . xn )

(qi+1 , xi+1 )

EM on HMMs
Standard EM Algorithm
Ini>alize E-Step: evaluate expected likelihood
JTA algorithm.

M-Step: rees>mate parameters from expected likelihood


j i i t=0 E qt qt+1 i = E q0 ij = M 1 T 1 i k k=0 t=0 E qt qt+1

Using expected values from JTA poten>als and separators


T 2

Check for convergence

i j t=0 E qt xt ij = N 1 T 1 i xk k=0 t=0 E qt t T 1

Training latent variables in Graphical Models


Now consider a general Graphical Model with latent variables.

EM on Latent Variable Models


Guess
Easy, just assign random values to parameters

E-Step: Evaluate likelihood.


We can use JTA to evaluate the likelihood. And marginalize expected parameter values

M-Step: Re-es>mate parameters.


This can get trickier.

Maximiza>on Step in Latent Variable Models


Why is this easy in HMMs, but dicult in general Latent Variable Models?
(q0 , x0 ) (q0 ) (qi1 , qi ) (q ) (qi , qi+1 ) i (qi ) (qi , xi ) (qi+1 ) (qi+1 , xi+1 )

Many parents graphical model

Junc>on Trees
In general, we have no guarantee that we can isolate a single variable.

We need to es>mate marginal separately. Dense Graphs

M-Step in Latent Variable Models


M-Step: Rees>mate Parameters.
Keep k-1 parameters xed (to the current es>mate) Iden>fy a beker guess for the free parameter.

M-Step in Latent Variable Models


M-Step: Rees>mate Parameters.
Keep k-1 parameters xed (to the current es>mate) Iden>fy a beker guess for the free parameter.

M-Step in Latent Variable Models


M-Step: Rees>mate Parameters.
Keep k-1 parameters xed (to the current es>mate) Iden>fy a beker guess for the free parameter.

M-Step in Latent Variable Models


M-Step: Rees>mate Parameters.
Keep k-1 parameters xed (to the current es>mate) Iden>fy a beker guess for the free parameter.

M-Step in Latent Variable Models


M-Step: Rees>mate Parameters.
Keep k-1 parameters xed (to the current es>mate) Iden>fy a beker guess for the free parameter.

M-Step in Latent Variable Models


M-Step: Rees>mate Parameters.
x(t+1) y (t+1) p(x|y (t) ) p(y|x(t+1) )

Gibbs Sampling. This is helpful if its easier to sample from a condi>onal than it is to integrate to get the marginal.

EM on Latent Variable Models


Guess
Easy, just assign random values to parameters

E-Step: Evaluate likelihood.


We can use JTA to evaluate the likelihood. And marginalize expected parameter values

M-Step: Re-es>mate parameters.


Either JTA poten>als and marginals OR Sampling

Today
EM as bound maximiza>on EM as a general approach to learning parameters for latent variables Sampling

Next Time
Model Adapta>on
Using labeled and unlabeled data to improve performance.

Model Adapta>on Applica>on


Speaker Recogni>on
UBM-MAP

Das könnte Ihnen auch gefallen