Sie sind auf Seite 1von 12

e-Appendix C

The E-M Algorithm

The E-M algorithm is a convenient heuristic for likelihood m aximization. The

E-M algorithm will never decrease the likelihood. Our discu ssion will focus

on mixture models, the GMM being a special case, even though t he E-M

algorithm applies to more general settings.

, K , where θ are the parameters

as a bump. In the GMM setting,

specifying P k . We will refer to each P k

are Gaussians, and θ = {µ , Σ } (the mean vector and covariance

matrix for each bump). A mixture model is a weighted sum of the se K bumps,

all the P k

Let P k (x; θ k ) be a density for k = 1 ,

k

k

k

k

P (x; Θ) =

K

k =1

w k P k (x; θ k ),

k = 1 and we have collected all

, θ K } .

Intuitively, to generate a random point x, you first pick a bump according to

, w . Suppose you pick bump k . You then generate a

random point from the bump density

where the weights satisfy w 0 and

the parameters into a single grand parameter, Θ = { w ,

K

k =1

w

k

K

1

, w K ; θ 1 ,

the probabilities w ,

1

P k

.

Given data X = x ,

1

, x

N generated independently, we wish to estimate

the parameters of the mixture which maximize the log-likeli hood,

ln P (X | Θ) = ln

N

n=1

P (x | Θ)

n

= ln

n=1

N

=

n=1 ln

N

K

k =1

K

k

=1

w k P k (x n ; θ k )

w k P (x n | θ k ) .

(C.1)

In the first step above, P (X | Θ) is a product because the data are independent. Note that X is known and fixed. What is not known is which particular bump

c Yaser Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin: Jan-2015. A M L All rights reserved. No
c
Yaser Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin: Jan-2015.
A M L
All rights reserved. No commercial use or re-distribution in any format.

e- C. The E-M Algorithm

was used to generate data point x n . Denote by j n ∈ { 1 ,

is a ‘member’ of bump j ). Collect all bump

memberships into a set J = { j ,

to which bump, we can estimate each bump density’s parameter s separately,

, j } . If we knew which data belonged

, K } the bump

that generated x n (we say x n

n

1

N

using only the data belonging to that bump. We call (X, J ) the complete

data. If we know the complete data, we can easily optimize the log- likelihood.

We call X the incomplete data. Though X is all we can measure, it is still

called the ‘incomplete’ data because it does not contain eno ugh information

to easily determine the optimal parameters Θ which minimize E (Θ). Let’s

see mathematically how knowing the complete data helps us.

To get the likelihood of the complete data, we need the joint p robability

P [ x , j | Θ] . Using Bayes’ theorem,

P [ x , j | Θ] = P [ j | Θ] P [ x | j , Θ]

in

n

n

n

n

n

n

n

=

w j n P j n (x n ; θ j n ).

Since the data are independent,

P (X, J | Θ) =

=

N

n=1

N

n=1

P [ x n , j n | Θ]

w j n P j n (x n ; θ j n ).

Let N k

data points corresponding to the bump k , so X = { x X : j = k } . We

compute the log-likelihood for the complete data as follows :

be the number of occurrences of bump k in J , and let X be those

n

k

k

n

 

N

ln P (X, J | Θ) =

n=1

 

K

=

k

=1

 

K

=

k

=1

ln w j n +

N

n=1

ln P j n (x n ; θ j n )

N k ln w k +

K

ln P k (x n ; θ k )

k =1 x n X k

L k ( X k k )

N k ln w k +

K

k =1

L k (X k ; θ k ).

(C.2)

There are two simplifications which occur in (C.2) from knowi ng the complete

data (X, J ). The w (in the first term) are separated from the θ (in the second

term); and, the second term is the sum of K non-interacting log-likelihoods

L

k ’s parameters θ . Each log-likelihood L can be optimized independently of

the others. For many choices of P k , L k (X k ; θ k ) can be optimized analytically, even though the log-likelihood for the incomplete data in (C .1) is intractable. The next exercise asks you to analytically maximize (C.2) fo r the GMM.

(X , θ ) corresponding to the data belonging to X and only involving bump

k

k

k

k

k

k

k

k

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 2

e- C. The E-M Algorithm

Exercise C.1 (a) Maximize the first term in (C.2) subject to w k = 1,
Exercise C.1
(a)
Maximize the first term in (C.2) subject to
w
k = 1, and show that
k
the optimal weights are w =
N k
/N . [Hint: Lagrange multipliers.]
k
(b)
For the GMM,
1
1
t
1
P k
( x ; µ , Σ ) =
exp − ( x − µ ) Σ
( x − µ ) .
k
k
k
2
k
k
(2π )
d/ 2
|Σ |
1/ 2
k
Maximize L ( X ; µ , Σ ) to obtain the optimal parameters:
k
k
k
k
1
µ k
=
x ;
n
N k
x
∈ X
n
k
=
1
t
Σ
( x − µ )( x − µ ) .
k
n
k
n
k
N k
x
∈ X
n
k
These are exactly the parameters you would expect. µ
is the in-
k
sample mean for the data belonging to bump k ; similarly, Σ
is the
k
in-sample covariance matrix.
1
[Hint: Set S = Σ
and optimize with respect to S . Also, from the
k
k
k
Linear Algebra e-appendix, you may find these derivatives useful:
∂ S
∂ S

( z S z) = zz ;

t

t

∂ ( z S z ) = zz ; t t and ∂ ln | S
∂ ( z S z ) = zz ; t t and ∂ ln | S

and

ln |S | = S

1

.]

In reality, we do not have access to J , and hence it is called a ‘hidden vari-

able’. So what can we do now? We need a heuristic to maximize th e likelihood

in Equation (C.1). One approach is to guess J and maximize the resulting

complete likelihood in Equation (C.2). This almost works. Instead of maxi-

mizing the complete likelihood for a single guess, we consider an average of

the complete likelihood over all possible guesses. Specific ally, we treat J as

an unknown random variable and maximize the expected value ( with respect

to J ) of the complete log-likelihood in Equation (C.2). This exp ected value

is as easy to minimize as the complete likelihood. The mathem atical imple-

mentation of this idea will lead us to the E-M Algorithm, which stands for

Expectation-Maximization Algorithm. Let’s start with a simpler example.

Example C.1. You have two opaque bags. Bag 1 has red and green balls,

with µ being the fraction of red balls. Bag 2 has red and blue balls wi th µ

being the fraction of red. You pick four balls in independent trials as follows.

1

2

1

First pick one of the bags at random, each with probability ; then, pick a ball

2

at random from the bag. Here is the sample of four balls you got:

one green, one red and two blue. The task is to estimate µ and µ . It would

be much easier if we knew which bag each ball came from.

Here is one way to reason. Half the balls will come from Bag 1 and the

other half from Bag 2. The blue balls come from Bag 2 (that’s al ready Bag 2’s budget of balls), so the other two should come from Bag 1: | •• . Using in-sample estimates, µˆ 1 = 2 and µˆ 2 = 0 . We can get these same estimates

••,

1

2

1

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 3

e- C. The E-M Algorithm

using a maximum likelihood argument. The log-likelihood of the data is

ln(1 µ ) + 2 ln(1 µ ) + ln(µ + µ ) 4 ln 2 . (C.3)

The reader should explicitly maximize the above expression with respect to

µ 1 , µ 2 [0 , 1] to obtain the estimates µˆ 1 =

balls, there is a red one, so it seems a little counter-intuitive that we would

estimate µˆ = 0 , for isn’t there a positive probability that the red ball came

from Bag 2? Nevertheless, µˆ = 0 is the estimate that maximizes the proba-

bility of generating the data. You are uneasy with this, and r ightly so, because

we put all our eggs into this single ‘point’ estimate; a very unnatural thing

given that any point estimate has infinitesimal probability of being correct.

Nevertheless, that is the maximum likelihood method, and we are following it.

Here is another way to reason. ‘Half’ of each red ball came fro m Bag 1

and the other ‘half’ from Bag 2. So, µˆ =

came from Bag 1 out of a total of 1 +

This reasoning is wrong because it does not correctly use the knowledge that

the ball is red. For example, as we just reasoned, µˆ =

these estimates are correct, and indeed µˆ > µˆ , then a red ball is more likely

to come from Bag 1, so more than half if it should come from Bag 1 . This

contradicts the original assumption that led us to these est imates.

Now, let’s see how expectation-maximization solves this pr oblem. The rea-

soning is similar to our false start above; it just adds itera tion till consistency.

We begin by considering the two cases for the red ball. Either it is from Bag 1

or Bag 2. We can compute the log-likelihood for each of these two cases:

3 and µˆ 2 = 5 . But, if

1

2

1

2

2 1 , µˆ = 0 . In our data set of four

2

2

2

2 / (1 + 1

1

2 ) = 3 because 1

1

1

1

2 a red ball

1

2

balls. Similarly, µˆ 2 = 1

1

2

1

1

2 / (2 + 2 ) = 1

1

5 .

ln(1 µ ) + ln(µ ) + 2 ln(1 µ ) 4 ln 2

ln(1 µ ) + ln(µ ) + 2 ln(1 µ ) 4 ln 2

1

1

1

2

2

2

(Bag 1);

(Bag 2).

Suppose we have estimates µˆ and µˆ . Using Bayes theorem, we can compute

p

1

2

1

= P [ Bag 1 | µˆ , µˆ ] and p = P [ Bag 2 | µˆ , µˆ ] . The reader can verify that

1

2

2

1

2

p 1 =

µˆ 1

µˆ 1 + µˆ 2 ,

µˆ 2

p 2 = µˆ 1 + µˆ 2

.

Now comes the expectation step. Compute the expected log-likelihood using

p 1 and p 2 :

)+ p ln(µ )+ p ln(µ )+ 2 ln(1 µ ) 4 ln 2 . (C.4)

E[ log-likelihood ] = ln(1 µ

1

1

1

2

2

2

Next comes the maximization step. Treating p , p as constants, maximize

the expected log-likelihood with respect to µ , µ and update µˆ , µˆ to these

optimal values. Notice that the log-likelihood in (C.3) has an interaction

term ln(µ + µ ) which complicates the maximization. In the expected log-

likelihood (C.4), µ and µ are decoupled, and so the maximization can be

implemented separately for each variable. The reader can verify that maxi-

mizing the expected log-likelihood gives the updates:

µˆ 2ˆµ 1 + µˆ 2

1

2

2

1

1

2

1

2

1

2

µˆ 1

p

1

1

and

µˆ 2

p

2

µˆ 2 µ 1 + 3ˆµ 2

1 + p 1

=

2 + p 2

.

=

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 4

e- C. The E-M Algorithm

C.1. Derivation of the E-M Algorithm

The full algorithm just iterates this update process with th e new estimates.

Let’s see what happens if we start (arbitrarily) with estima tes µˆ = µˆ

2 = 1

2

:

1

Iteration number 0 1 2 3 4 5 6 7 . . . 1000 1
Iteration number
0
1
2
3
4
5
6
7
.
.
.
1000
1
1
µˆ
1 0 . 38 0 . 41 0 . 43 0 . 45 0 . 45 0 . 46
0 . 49975
2
3
1
1
µˆ
2 0 . 16 0 . 13 0 . 10 0 . 09 0 . 07 0 . 07
0 . 0005
2
5

We have highlighted in blue the result of the first iteration, which is exactly the

estimate from our earlier faulty reasoning. When µ = µ our faulty reasoning

matches the E-M step. If we continued this table, it is not har d to see what

1

2

will happen: µˆ → 2 1 and µˆ → 0 . 1 2 Exercise C.2
will happen: µˆ →
2 1 and µˆ → 0 .
1
2
Exercise C.2
When the E-M algorithm converges, it must be that
µˆ
µˆ
1
2
µˆ =
and
µˆ =
.
1
2
2ˆµ + µˆ
2ˆµ + 3ˆµ
1
2
1
2
Solve these consistency conditions, and report your estimates for µˆ ,µˆ ?
1
2

It’s miraculous that by maximizing an expected log-likelihood using a guess for

the parameters, we end up converging to the true maximum like lihood solution.

Why is this useful? Because the maximizations for µ and µ are decoupled.

We trade a maximization of a complicated likelihood of the incomplete data

for a bunch of simpler maximizations that we iterate.

1

2

C.1 Derivation of the E-M Algorithm

We now derive the E-M strategy and show that it will always improve the

likelihood of the incomplete data.

Maximizing an expected log-likelihood of the complete data increases

the likelihood of the incomplete data.

What is surprising is that to compute the expected log-likel ihood, we use a

guess, since we don’t know the best model. So it is really a gue ss for the

expected log-likelihood that one maximizes and this increa ses the likelihood.

Let Θ be any set of parameters, and define P (J | X, Θ ) as the conditional

probability distribution for J given the data and assuming that Θ is the actual

probability model for the data. The probability P (J | X, Θ ) is well defined,

even if Θ is not the probability model which generated the data. We wil l see

how to compute P (J | X, Θ ) soon, but for the moment assume that it is always

, x N to bumps

, j n has non-zero probability under Θ ). This will always be the case

positive (which means that every possible assignment of x 1 ,

j 1 ,

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 5

e- C. The E-M Algorithm

C.1. Derivation of the E-M Algorithm

unless some of the P k have bounded support or Θ is some degenerate mixture. The following derivation establishes a connection between the log-likelihood

for the incomplete data and the expectation (over J ) of the log-likelihood for

the complete data.

L (Θ)

=

(

a )

=

(

b)

=

(

c )

=

(

d )

(

e)

=

( f )

=

ln P (X | Θ)

ln P (X, J | Θ)

J

ln

J

ln

J

P

(X, J | Θ)

(J | X, Θ )

P (J | X, Θ )

P

P (X, J | Θ)P (X | Θ )

P (J | X, Θ )

P (X, J | , Θ )

J

ln P (X, J | Θ)P (X | Θ ) P (X, J | , Θ )

P

(J | X, Θ )

L ) + E J | X, Θ [ln P (X, J | Θ)] E J | X, Θ [ln P (X, J | Θ )]

L ) + Q | X, Θ ) Q | X, Θ )

(C.5)

(a) follows by the law of total probability; (b) is justified b ecause P (J | X, Θ ) is

positive; (c) follows from Bayes’ theorem; (d) follows beca use the summation

(· )P (J | X, Θ ) is an expectation using the probabilities P (J | X, Θ ) and

J

by Jensen’s inequality ln E[ · ] E[ln(· )] because ln(x) is concave; (e) follows

because ln P (X | Θ ) is independent of J and so

J

P (J | X, Θ ) ln P (X | Θ ) = ln P (X | Θ ) P

J

(J | X, Θ ) = ln P (X | Θ ) · 1;

finally, in (f), we have defined the function

Q | X, Θ ) = E J | X, Θ [ln P (X, J | Θ)] .

The function Q is a function of Θ , though its definition depends the distribu-

tion of J which in turn depends on the incomplete data X and the model Θ .

We have proved the following result.

Theorem C.2. If Q | X, Θ ) > Q | X, Θ ), then L (Θ) > L (Θ ).

In words, fix Θ and compute the ‘posterior’ distribution of J conditioned on

the data X with parameters Θ . Now for the parameters Θ , compute the

expected log-likelihood of the complete data (X, J ) where the expectation is

taken with respect to this posterior distribution for J that we just obtained.

This distribution for J is fixed, depending on X, Θ , but it does not depend

on Θ . Find Θ that maximizes this expected log-likelihood, and you are guaranteed to improve the actual likelihood. This theorem leads naturally to the E-M algorithm that follows.

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 6

e- C. The E-M Algorithm

C.1. Derivation of the E-M Algorithm

E-M Algorithm 1: Initialize Θ 0 at t = 0 .

2: At step t let the parameter estimate be Θ .

3: [Expectation] For X, Θ , compute the function of Q (Θ):

t

t

t

Q

t (Θ) = E

J | X, Θ

t [ln P (X, J | Θ)] ,

which is the expected log-likelihood for the complete data.

4: [Maximization] Update Θ to maximize Q (Θ):

t

Θ t+1

= argmax Q (Θ).

t

Θ

5: Increment t t + 1 and repeat steps 3,4 till convergence.

In the algorithm, we need to compute Q (Θ), which amounts to computing an

expectation with respect to P (J | X, Θ ). We illustrate this process with our

mixture model.

t

t

Recall that J is the vector of bump memberships. Since the data are

independent, to compute P (J | X, Θ ) we can compute this ‘posterior’ for each

P (j = k | x , Θ ), the

data point and then take the product. We need γ

probability that data point x came from bump k . By Bayes’ theorem,

t

nk =

n

n

t

n

γ nk = P (j n = k | x n , Θ t ) =

=

P (x n , k | Θ t ) P (x n | Θ t )

ˆ

wˆ k P k (x n | θ k )

K

ˆ

=1 wˆ P (x n | θ

) ,

where Θ t = { wˆ 1 ,

now compute Q (Θ),

t

ˆ

, wˆ K ; θ 1 ,

ˆ

, θ K } and P (J | X, Θ ) = n=1 γ nj n . We can

N

Q

t

(Θ) = E [ln P (X, J | Θ)] ,

J

where the expectation is with respect to the (‘fictitious’) p robabilities γ

that determine the distribution of the random variable J . These probabilities

depend on X and Θ . Let

(a random variable) be the number of occurrences

nk

ˆ

N

k

t

ˆ

of bump k in the random variable J ; similarly, let X be the random set

containing the data points of bump k . From Equation (C.2),

k

Q (Θ)

t

=

K

k =1

E [

J

ˆ

N

k

] ln w +

k

K

E

J

k =1

x

n

ˆ

X

k

ln P (x ; θ )

n

k

(

a )

=

K

k

=1

E J

n=1 z nk ln w k +

N

K

k

=1

E J

n=1 z nk ln P (x n ; θ k )

N

(

b)

=

K

k

=1

N k ln w k +

K

N

k =1 n=1

γ nk ln P (x n ; θ k ),

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 7

e- C. The E-M Algorithm

C.1. Derivation of the E-M Algorithm

where Θ = { w 1 ,

, w K ; θ 1 ,

 
 

ˆ

dom variable z nk = x n

X

k

in (b), we defined

(a), we introduced an indicator ran-

which is 1 if x n is from bump k and 0 otherwise;

, θ K } . In

N k

= E [

J

ˆ

N k

] = E


J

N

n=1

z nk

N


=

n=1

γ nk

,

nk . Now that we have an explicit functional form

for Q (Θ), we can perform the maximization step. Observe that the bump -

where we used E[ z ] = γ

nk

t

parameters θ Θ are occurring in independent terms, and so can be opti-

mized separately. As for the first term, observe that

k

1

N

K K

k =1

N

k

ln w

k


=

k =1

K


k =1

N

k

w k

N

N k

/N

ln

N

k

N

ln N k

N ,

K


+

k =1

N k

N

ln

N k

N

where the last inequality follows from Jensen’s inequality and the concavity of

the logarithm, which implies:

K

k

=1

ln w /N ln

N

k

k

N

N

k

K

k

=1

N

k

w

k

·

N

k /N

N

= ln

K

k

=1

w k = ln(1) = 0 .

Equality holds when w

therefore, gives the following updates:

k

= N k /N . Maximizing Q t (w 1 ,

w

k =

N k

=

N

n=1 γ nk

N

N

;

θ = argmax

k

θ

k

N

n=1

γ nk

ln P (x ; θ ).

n

k

, w K , θ 1 ,

, θ K ),

(C.6)

(C.7)

nk of data point x n

belongs to bump k . Thus,

belonging to bump k . Similarly, the update to get θ can be viewed as follows.

is the total number of data points

The update to get w can be viewed as follows. A fraction γ

k

N k = n γ nk

k

The ‘likelihood’ for the parameter θ of bump k is just the weighted sum of the

likelihoods of each point, weighted by the fraction of the po int that belongs to

bump k . This intuitive interpretation of the update is another rea son for the

popularity of the E-M algorithm.

The E-M algorithm for mixture density estimation is remarka bly simple

once we get past the machinery used to set it up: we have an anal ytic update

for the weights w and K separate optimizations for each bump parameter θ .

The miracle is that these simple updates are guaranteed to improve the log-

likelihood (from Theorem C.2). There are other ways to maximize the like- lihood, for example using gradient and Hessian based iterative optimization techniques. However, the E-M algorithm is simpler and works well in practice.

k

k

k

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 8

e- C. The E-M Algorithm

C.1. Derivation of the E-M Algorithm

Example Let’s derive the E-M update for the GMM with current paramete r

, wˆ K ; µˆ 1 ,

The posterior bump probabilities γ

estimate is Θ t = { wˆ 1 ,

, Σ K } . Let S k = ( Σ k ) 1 .

ˆ

, µˆ K ; Σ 1 ,

ˆ

ˆ

ˆ

nk

for the parameters Θ are:

t

ˆ

wˆ k P (x n |µˆ k , S k )

γ nk =

=1 wˆ P (x n |µˆ , S ) ,

K

ˆ

where P (x|µ , S) = (2 π )

is immediate from (C.6), which matches Equation (6.9). Sinc e

d/ 2 | S| 1/ 2

1

exp((x µ ) S(x µ )). The w update

2

t

k

1

ln P (x|µ , S) = (x µ ) S(x µ ) +

2

t

1

2 ln | S| − d ln(2 π ),

2

to get the µ and S updates using (C.7), we need to minimize

k

k

N

n=1

γ

nk

t

((x µ ) S (x µ ) ln | S | ) .

n

k

k

n

k

k

Setting the derivative with respect to µ to 0 , gives

k

2S k

N

γ

n=1

Since S is invertible, µ

we obtain the update in (6.9),

k

N

n=1

k

γ nk

µ k =

nk

=

1

N k

(x n µ k ) = 0 .

N

n=1 γ nk x n , and since N k =

N

n=1

γ

nk x n .

N

n=1 γ nk ,

To take the derivative with respect to S , we use the identities in the hint of

Exercise C.1(b). Setting the derivative with respect to S to zero gives

k

k

or

Since S

k

1

N

n=1

γ

nk (x

n

µ )(x µ ) S

k

n

k

t

k

1

N

n=1

γ nk

= 0 ,

S

k

1

=

1

N k

N

n=1

γ

nk (x

n

t

µ )(x µ ) .

k

n

k

= Σ , we recover the update in (6.9).

k

Commentary The E-M algorithm is a remarkable example of a recurring

theme in learning. We want to learn a model Θ that explains the data. We

ˆ

start with a guess Θ that is wrong. We use this wrong model to estimate

some other quantities of the world (the bump memberships in o ur example).

We now learn a new model which is better than the old model at ex plain-

ing the combined data plus the inaccurate estimates of the ot her quantities.

Miraculously, by doing this, we bootstrap ourselves up to a b etter model, one that is better at explaining the data. This theme reappears i n Reinforcement Learning as well. If we didn’t know better, it seems like a fre e lunch.

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 9

e- C. The E-M Algorithm

C.2. Problems

C.2 Problems

Consider the general case of Example C.1. The sample

has N balls, with N g green, N b blue and N r red ( N g + N b + N r = N ). Show

that the log-likelihood of the incomplete data is

Problem C.1

N g ln(1 µ 1 ) + N b ln(1 µ 2 ) + N r ln( µ 1 + µ 2 ) N ln 2.

(C.8)

What are the maximum likelihood estimates for µ ,µ . (Be careful with the

cases N g > N/2 and N b > N/2.

1

2

Problem C.2 Consider the general case of Example C.1 as in Problem C.1,

with N g green, N b blue and N r red balls.

(a)

(b)

Suppose your starting estimates are µˆ ˆ . For a red ball, what are

p 1 = P [Bag 1|µˆ 1 ˆ 2 ] and p 2 = P [Bag 2|µˆ 1 ˆ 2 ]

Let N r 1 and N r 2 (with N r = N r 1 + N r

Bag 1 and 2 respectively. Show that the log-likelihood of the complete

data is

2 ) be the number of red balls from

1

2

N g ln(1 µ 1 ) + N b ln(1 µ 2 ) + N r 1 ln( µ 1 ) + N r 2 ln( µ 2 ) N ln 2.

(c)

Compute the function Q ( µ ,µ ) by taking the expectation of the log-

likelihood of the complete data. Show that

t

1

2

Q t ( µ 1 2 ) = N g ln(1µ 1 )+ N b ln(1µ 2 )+ p 1 N r ln( µ 1 )+ p 2 N r ln( µ 2 ) N ln 2.

(d)

(e) Show that repeated E-M iteration will ultimately converge to

Maximize Q ( µ ,µ ) to obtain the E-M update.

t

1

2

µˆ 1 =

N 2N g

N

µˆ 2 =

N 2N b

N

.

Problem C.3 A sequence of N balls, X = x ,

follows. There are 2 bags. Bag 1 contains only red balls and bag 2 contains

red and blue balls. A fraction π in this second bag are red. A bag is picked

randomly with probability

2 1 and one of the balls is picked randomly from that

bag; x = 1 if ball n is red and 0 if it is blue. You are given N and the number

of red balls N r =

,x N is drawn iid as

1

x n .

n

N

n=1

(a) (i) Show that the likelihood P [X|π, N ] is

P [X|π, N ] =

N

n=1

1 + π

2

x

n

1 π

2

1x n

.

Maximize to obtain an estimate for π (be careful with N r < N/2).

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 10

e- C. The E-M Algorithm

C.2. Problems

(ii) For N r = 600 and N = 1000, what is your estimate of π .

(b) Maximizing the likelihood is tractable for this simple problem. Now de-

velop an E-M iterative approach.

(i) What is an appropriate hidden/unmeasured variable J = j ,

1

(ii)

(iii)

t

t [ln P [X, J|π, N ]] and show that

N .

Give a formula for the likelihood for the full data, P [X, J|π, N ].

If at step t your estimate is π , for the expectation step, compute

Q t ( π ) = E J | X,π

,j

Q t ( π ) =

π t

1 + π

t

N r ln π + ( N N r ) ln(1 π ) ,

and hence show that the E-M update is given by

π t +1 =

π t N r

π t N r + (1 + π t )( N N r )

.

What are the limit points when

N r N/2 and N r < N/2?

(iv) Plot π t versus t , starting from π 0 = 0.9 and π 0 = 0.2 , with N r =

600, N = 1000.

The values of the hidden variables can often be useful. After con-

vergence of the E-M, how could you get estimates of the hidden

(v)

variables?

Problem C.4

[E-M for Supervised Learning] We wish to learn a

function f ( x ) which predicts the temperature as a function of the time x of

the day, x [0, 1]. We believe that the temperature has a linear dependence

on time, so we model f with the linear hypotheses, h( x ) = w + w x .

We have N data points y ,

(independent) days, where y = f ( x ) + ǫ and ǫ N (0, 1) is iid zero mean

,y , the temperature measurements on different

0

1

1

N

n

n

n

n

Gaussian noise with unit variance. The problem is that we do not know the time

x

n at which these measurements were made. Assume that each temp erature

measurement was taken at some random time in the day, chosen uniformly on

c

[0, 1].

(a)

(b)

(c)

(d)

Show that the log-likelihood for weights w is

ln P [y |w ] =

N

ln

n=1

0

1

1 1 2 − ( y − w − w x ) dx √ e
1
1
2
( y − w − w x )
dx √
e
n
0
1
2

What is the natural hidden variable J = j ,

Compute the log-likelihood for the complete data ln P [y , J|w ].

Let γ ( x|w ) = P ( x = x|y , w ) . Show that, for x [0, 1],

1

,j

N .

n

n

n

2 1 exp − ( y −w −w x ) n 0 1 2 1
2
1
exp − ( y −w −w x )
n
0
1
2
1 dx
exp − 1 2 ( y n −w 0 −w 1 x ) 2 .
0

γ n ( x|w ) =

Hence, compute Q t ( w ) .

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

.

e-Chap:C 11

e- C. The E-M Algorithm

C.2. Problems

(e)

(f)

Let α n = E γ n ( x | w t ) [x ] and β n = E γ n ( x | w t ) [x 2 ] (expectations taken with respect to the distribution γ ( x|w ) ). Show that the EM-updates are

n

t

w

w

1

0

( t + 1)

( t + 1)

=

=

1

N

N

i=1

( y y )( α α )

i

i

;

βα

2

y w ( t + 1) α;

1

where, ( ·) denotes averaging (eg. α =

weights at iteration t .

What happens if the temperature measurement is not at a uniformly

random time, but at a time distributed according to an unknown P ( x ) ?

You have to maintain an estimate

α ) and w ( t ) are the

N

1

N

n=1

n

P t ( x ) . Now, show that

2 1 P t ( x ) exp − ( y −w −w x )
2
1
P t
( x ) exp − ( y −w −w x )
n
0
1
2
,
1
1 dx
P t
( x ) exp − ( y −w −w x )
2
n
0
1
0
2

γ n ( x|w ) =

and that the updates in (e) are unchanged, except that they use this new

γ ( x|w ) . Show that the update to

n

t

P t is given by

N

γ n ( x|w t ) .

n=1

P t +1 ( x ) = 1

N

What happens if you tried to maximize the log-likelihood for the incom-

plete data, instead of using the E-M approach?

c

A M L
A M L

Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015

e-Chap:C 12