N D IX: The E-M Algorithm

IX
e-Appendix C
The E-M Algorithm
D
The E-M algorithm is a convenient heuristic for likelihood maximization. The
E-M algorithm will never decrease the likelihood. Our discussion will focus
on mixture models, the GMM being a special case, even though the E-M
EN
algorithm applies to more general settings.
Let Pk (x; θk ) be a density for k = 1, . . . , K, where θk are the parameters
specifying Pk . We will refer to each Pk as a bump. In the GMM setting,
all the Pk are Gaussians, and θk = {µk , Σk } (the mean vector and covariance
matrix for each bump). A mixture model is a weighted sum of these K bumps,
K
X
P (x; Θ) = wk Pk (x; θk ),
k=1
PP
P
where the weights satisfy wk ≥ 0 and K k=1 wk = 1 and we have collected all
the parameters into a single grand parameter, Θ = {w1 , . . . , wK ; θ1 , . . . , θK }.
Intuitively, to generate a random point x, you first pick a bump according to
the probabilities w1 , . . . , wK . Suppose you pick bump k. You then generate a
random point from the bump density Pk .
Given data X = x1 , . . . , xN generated independently, we wish to estimate
the parameters of the mixture which maximize the log-likelihood,
e-A
N
Y
ln P (X|Θ) = ln P (xn |Θ)
n=1
N K
!
Y X
= ln wk Pk (xn ; θk )
n=1 k=1
N K
!
X X
= ln wk P (xn |θk ) . (C.1)
n=1 k=1
In the first step above, P (X|Θ) is a product because the data are independent.
Note that X is known and fixed. What is not known is which particular bump
c AM L Yaser Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin: Jan-2015.

All rights reserved. No commercial use or re-distribution in any format.
e-C. The E-M Algorithm
was used to generate data point xn . Denote by jn ∈ {1, . . . , K} the bump

that generated xn (we say xn is a ‘member’ of bump jn ). Collect all bump
memberships into a set J = {j1 , . . . , jN }. If we knew which data belonged
to which bump, we can estimate each bump density’s parameters separately,
using only the data belonging to that bump. We call (X, J) the complete
data. If we know the complete data, we can easily optimize the log-likelihood.
IX
We call X the incomplete data. Though X is all we can measure, it is still
called the ‘incomplete’ data because it does not contain enough information
to easily determine the optimal parameters Θ∗ which minimize Ein (Θ). Let’s
see mathematically how knowing the complete data helps us.
To get the likelihood of the complete data, we need the joint probability
P[xn , jn |Θ]. Using Bayes’ theorem,
D
P[xn , jn |Θ] = P[jn |Θ] P [xn |jn , Θ]
= wjn Pjn (xn ; θjn ).
Since the data are independent,
EN
N
Y
P (X, J|Θ) = P[xn , jn |Θ]
n=1
N
Y
= wjn Pjn (xn ; θjn ).
n=1
Let Nk be the number of occurrences of bump k in J, and let Xk be those

data points corresponding to the bump k, so Xk = {xn ∈ X : jn = k}. We
compute the log-likelihood for the complete data as follows:
PP
N
X N
X
ln P (X, J|Θ) = ln wjn + ln Pjn (xn ; θjn )
n=1 n=1
K
X K
X X
= Nk ln wk + ln Pk (xn ; θk )
k=1 k=1 xn ∈Xk
| {z }
Lk (Xk ,θk )
e-A
K
X K
X
= Nk ln wk + Lk (Xk ; θk ). (C.2)
k=1 k=1
There are two simplifications which occur in (C.2) from knowing the complete
data (X, J). The wk (in the first term) are separated from the θk (in the second
term); and, the second term is the sum of K non-interacting log-likelihoods
Lk (Xk , θk ) corresponding to the data belonging to Xk and only involving bump
k’s parameters θk . Each log-likelihood Lk can be optimized independently of
the others. For many choices of Pk , Lk (Xk ; θk ) can be optimized analytically,
even though the log-likelihood for the incomplete data in (C.1) is intractable.
The next exercise asks you to analytically maximize (C.2) for the GMM.
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–2
Exercise C.1
P
(a) Maximize the first term in (C.2) subject to k wk = 1, and show that
∗
the optimal weights are wk = Nk /N . [Hint: Lagrange multipliers.]
(b) For the GMM,
1
Pk (x; µk , Σk ) = exp − 21 (x − µk )t Σ−1
k (x − µk ) .
IX
(2π)d/2 |Σk |1/2
Maximize Lk (Xk ; µk , Σk ) to obtain the optimal parameters:

1 X
µ∗k = xn ;
Nk xn ∈Xk
1 X
Σ∗k = (xn − µk )(xn − µk )t .
D
Nk xn ∈Xk
These are exactly the parameters you would expect. µ∗k is the in-
sample mean for the data belonging to bump k; similarly, Σ∗k is the
in-sample covariance matrix.
EN
[Hint: Set Sk = Σ−1
k and optimize with respect to Sk . Also, from the
Linear Algebra e-appendix, you may find these derivatives useful:
∂ t ∂
(z Sz) = zzt ; and ln |S| = S−1 .]
∂S ∂S
In reality, we do not have access to J, and hence it is called a ‘hidden vari-

able’. So what can we do now? We need a heuristic to maximize the likelihood
in Equation (C.1). One approach is to guess J and maximize the resulting
PP
complete likelihood in Equation (C.2). This almost works. Instead of maxi-

mizing the complete likelihood for a single guess, we consider an average of
the complete likelihood over all possible guesses. Specifically, we treat J as
an unknown random variable and maximize the expected value (with respect
to J) of the complete log-likelihood in Equation (C.2). This expected value
is as easy to minimize as the complete likelihood. The mathematical imple-
mentation of this idea will lead us to the E-M Algorithm, which stands for
Expectation-Maximization Algorithm. Let’s start with a simpler example.
e-A
Example C.1. You have two opaque bags. Bag 1 has red and green balls,
with µ1 being the fraction of red balls. Bag 2 has red and blue balls with µ2
being the fraction of red. You pick four balls in independent trials as follows.
First pick one of the bags at random, each with probability 12 ; then, pick a ball
at random from the bag. Here is the sample of four balls you got: , ••••
one green, one red and two blue. The task is to estimate µ1 and µ2 . It would
be much easier if we knew which bag each ball came from.
Here is one way to reason. Half the balls will come from Bag 1 and the
other half from Bag 2. The blue balls come from Bag 2 (that’s already Bag 2’s
budget of balls), so the other two should come from Bag 1: | •• ••
. Using
in-sample estimates, µ̂1 = 21 and µ̂2 = 0. We can get these same estimates
c
AM
using a maximum likelihood argument. The log-likelihood of the data is

ln(1 − µ1 ) + 2 ln(1 − µ2 ) + ln(µ1 + µ2 ) − 4 ln 2. (C.3)
The reader should explicitly maximize the above expression with respect to
µ1 , µ2 ∈ [0, 1] to obtain the estimates µ̂1 = 21 , µ̂2 = 0. In our data set of four
balls, there is a red one, so it seems a little counter-intuitive that we would
IX
estimate µ̂2 = 0, for isn’t there a positive probability that the red ball came
from Bag 2? Nevertheless, µ̂2 = 0 is the estimate that maximizes the proba-
bility of generating the data. You are uneasy with this, and rightly so, because
we put all our eggs into this single ‘point’ estimate; a very unnatural thing
given that any point estimate has infinitesimal probability of being correct.
Nevertheless, that is the maximum likelihood method, and we are following it.
Here is another way to reason. ‘Half’ of each red ball came from Bag 1
D
and the other ‘half’ from Bag 2. So, µ̂1 = 21 /(1 + 12 ) = 31 because 12 a red ball
came from Bag 1 out of a total of 1 + 21 balls. Similarly, µ̂2 = 12 /(2 + 21 ) = 15 .
This reasoning is wrong because it does not correctly use the knowledge that
the ball is red. For example, as we just reasoned, µ̂1 = 31 and µ̂2 = 15 . But, if
EN
these estimates are correct, and indeed µ̂1 > µ̂2 , then a red ball is more likely
to come from Bag 1, so more than half if it should come from Bag 1. This
contradicts the original assumption that led us to these estimates.
Now, let’s see how expectation-maximization solves this problem. The rea-
soning is similar to our false start above; it just adds iteration till consistency.
We begin by considering the two cases for the red ball. Either it is from Bag 1
or Bag 2. We can compute the log-likelihood for each of these two cases:
ln(1 − µ1 ) + ln(µ1 ) + 2 ln(1 − µ2 ) − 4 ln 2 (Bag 1);
PP
ln(1 − µ1 ) + ln(µ2 ) + 2 ln(1 − µ2 ) − 4 ln 2 (Bag 2).

Suppose we have estimates µ̂1 and µ̂2 . Using Bayes theorem, we can compute
p1 = P[Bag 1 | µ̂1 , µ̂2 ] and p2 = P[Bag 2 | µ̂1 , µ̂2 ]. The reader can verify that
µ̂1 µ̂2
p1 = , p2 = .
µ̂1 + µ̂2 µ̂1 + µ̂2
Now comes the expectation step. Compute the expected log-likelihood using
p1 and p2 :
e-A
E[log-likelihood] = ln(1 − µ1 )+ p1 ln(µ1 )+ p2 ln(µ2 )+ 2 ln(1 − µ2 )− 4 ln 2. (C.4)

Next comes the maximization step. Treating p1 , p2 as constants, maximize
the expected log-likelihood with respect to µ1 , µ2 and update µ̂1 , µ̂2 to these
optimal values. Notice that the log-likelihood in (C.3) has an interaction
term ln(µ1 + µ2 ) which complicates the maximization. In the expected log-
likelihood (C.4), µ1 and µ2 are decoupled, and so the maximization can be
implemented separately for each variable. The reader can verify that maxi-
mizing the expected log-likelihood gives the updates:
p1 µ̂1 p2 µ̂2
µ̂1 ← = and µ̂2 ← = .
1 + p1 2µ̂1 + µ̂2 2 + p2 2µ̂1 + 3µ̂2
c
AM
e-C. The E-M Algorithm C.1. Derivation of the E-M Algorithm
The full algorithm just iterates this update process with the new estimates.
Let’s see what happens if we start (arbitrarily) with estimates µ̂1 = µ̂2 = 21 :
Iteration number
0 1 2 3 4 5 6 7 ... 1000
1 1
µ̂1 0.38 0.41 0.43 0.45 0.45 0.46 . . . 0.49975
IX
2 3
1 1
µ̂2 2 5 0.16 0.13 0.10 0.09 0.07 0.07 . . . 0.0005
We have highlighted in blue the result of the first iteration, which is exactly the
estimate from our earlier faulty reasoning. When µ1 = µ2 our faulty reasoning
matches the E-M step. If we continued this table, it is not hard to see what
will happen: µ̂1 → 21 and µ̂2 → 0.
D
Exercise C.2
When the E-M algorithm converges, it must be that
µ̂1 µ̂2
µ̂1 = and µ̂2 = .
EN
2µ̂1 + µ̂2 2µ̂1 + 3µ̂2
Solve these consistency conditions, and report your estimates for µ̂1 , µ̂2 ?
It’s miraculous that by maximizing an expected log-likelihood using a guess for

the parameters, we end up converging to the true maximum likelihood solution.
Why is this useful? Because the maximizations for µ1 and µ2 are decoupled.
We trade a maximization of a complicated likelihood of the incomplete data
for a bunch of simpler maximizations that we iterate.
PP
C.1 Derivation of the E-M Algorithm

We now derive the E-M strategy and show that it will always improve the
likelihood of the incomplete data.
Maximizing an expected log-likelihood of the complete data increases
e-A
the likelihood of the incomplete data.

What is surprising is that to compute the expected log-likelihood, we use a
guess, since we don’t know the best model. So it is really a guess for the
expected log-likelihood that one maximizes and this increases the likelihood.
Let Θ′ be any set of parameters, and define P (J|X, Θ′ ) as the conditional
probability distribution for J given the data and assuming that Θ′ is the actual
probability model for the data. The probability P (J|X, Θ′ ) is well defined,
even if Θ′ is not the probability model which generated the data. We will see
how to compute P (J|X, Θ′ ) soon, but for the moment assume that it is always
positive (which means that every possible assignment of x1 , . . . , xN to bumps
j1 , . . . , jn has non-zero probability under Θ′ ). This will always be the case
c
AM
unless some of the Pk have bounded support or Θ′ is some degenerate mixture.

The following derivation establishes a connection between the log-likelihood
for the incomplete data and the expectation (over J) of the log-likelihood for
the complete data.
L(Θ) = ln P (X|Θ)
IX
(a) X
= ln P (X, J|Θ)
J
(b) X P (X, J|Θ)
= ln P (J|X, Θ′ )
P (J|X, Θ′ )
J
(c) X P (X, J|Θ)P (X|Θ′ )
= ln P (J|X, Θ′ )
P (X, J|, Θ′ )
D
J
(d) X P (X, J|Θ)P (X|Θ′ )
≥ ln P (J|X, Θ′ )
P (X, J|, Θ′ )
J
(e)
= L(Θ′ ) + EJ|X,Θ′ [ln P (X, J|Θ)] − EJ|X,Θ′ [ln P (X, J|Θ′ )]
EN
(f )
= L(Θ′ ) + Q(Θ|X, Θ′ ) − Q(Θ′ |X, Θ′ ) (C.5)
(a) follows by the law of total probability; (b) is justified because P (J|X, Θ′ ) is
positive;
P (c) follows from Bayes’ theorem; (d) follows because the summation
J (·)P (J | X, Θ′ ) is an expectation using the probabilities P (J | X, Θ′ ) and
by Jensen’s inequality ln E[·] ≥ E[ln(·)] because ln(x) is concave; (e) follows
because ln P (X|Θ′ ) is independent of J and so
PP
X X
P (J|X, Θ′ ) ln P (X|Θ′ ) = ln P (X|Θ′ ) P (J|X, Θ′ ) = ln P (X|Θ′ ) · 1;
J J
finally, in (f), we have defined the function
Q(Θ|X, Θ′ ) = EJ|X,Θ′ [ln P (X, J|Θ)].
The function Q is a function of Θ, though its definition depends the distribu-

tion of J which in turn depends on the incomplete data X and the model Θ′ .
e-A
We have proved the following result.

Theorem C.2. If Q(Θ|X, Θ′ ) > Q(Θ′ |X, Θ′ ), then L(Θ) > L(Θ′ ).
In words, fix Θ′ and compute the ‘posterior’ distribution of J conditioned on
the data X with parameters Θ′ . Now for the parameters Θ, compute the
expected log-likelihood of the complete data (X, J) where the expectation is
taken with respect to this posterior distribution for J that we just obtained.
This distribution for J is fixed, depending on X, Θ′ , but it does not depend
on Θ. Find Θ∗ that maximizes this expected log-likelihood, and you are
guaranteed to improve the actual likelihood. This theorem leads naturally to
the E-M algorithm that follows.
c
AM
E-M Algorithm
1: Initialize Θ0 at t = 0.
2: At step t let the parameter estimate be Θt .
3: [Expectation] For X, Θt , compute the function of Qt (Θ):
Qt (Θ) = EJ|X,Θt [ln P (X, J|Θ)],
IX
which is the expected log-likelihood for the complete data.
4: [Maximization] Update Θ to maximize Qt (Θ):
Θt+1 = argmax Qt (Θ).

Θ
5: Increment t → t + 1 and repeat steps 3,4 till convergence.
D
In the algorithm, we need to compute Qt (Θ), which amounts to computing an
expectation with respect to P (J|X, Θt ). We illustrate this process with our
mixture model.
Recall that J is the vector of bump memberships. Since the data are
EN
independent, to compute P (J|X, Θt ) we can compute this ‘posterior’ for each
data point and then take the product. We need γnk = P (jn = k|xn , Θt ), the
probability that data point xn came from bump k. By Bayes’ theorem,
P (xn , k|Θt )
γnk = P (jn = k|xn , Θt ) =
P (xn |Θt )
ŵk Pk (xn |θ̂k )
= PK ,
ℓ=1 ŵℓ Pℓ (xn |θ̂ℓ )
PP
QN
where Θt = {ŵ1 , . . . , ŵK ; θ̂1 , . . . , θ̂K } and P (J|X, Θ′ ) = n=1 γnjn . We can
now compute Qt (Θ),
Qt (Θ) = EJ [ln P (X, J|Θ)],
where the expectation is with respect to the (‘fictitious’) probabilities γnk
that determine the distribution of the random variable J. These probabilities
depend on X and Θt . Let N̂k (a random variable) be the number of occurrences
of bump k in the random variable J; similarly, let X̂k be the random set
e-A
containing the data points of bump k. From Equation (C.2),

 
XK K
X X
Qt (Θ) = EJ [N̂k ] ln wk + EJ  ln P (xn ; θk )
k=1 k=1 xn ∈X̂k
K
" N
# K
" N
#
(a) X X X X
= EJ znk ln wk + EJ znk ln P (xn ; θk )
k=1 n=1 k=1 n=1
K
X K X
X N
(b)
= Nk ln wk + γnk ln P (xn ; θk ),
k=1 k=1 n=1
c
AM
where Θ = {w1 , . . . , wK ; θ1 , . . . , θK }. In (a), we introduced an indicator ran-

dom variable znk = Jxn ∈ X̂k K which is 1 if xn is from bump k and 0 otherwise;
in (b), we defined
" N # N
X X
Nk = EJ [N̂k ] = EJ znk = γnk ,
n=1 n=1
IX
where we used E[znk ] = γnk . Now that we have an explicit functional form
for Qt (Θ), we can perform the maximization step. Observe that the bump-
parameters θk ∈ Θ are occurring in independent terms, and so can be opti-
mized separately. As for the first term, observe that
K K K
1 X X Nk wk X Nk Nk
Nk ln wk = ln + ln
D
N N Nk /N N N
k=1 k=1 k=1
K
X Nk Nk
≤ ln ,
N N
k=1
EN
where the last inequality follows from Jensen’s inequality and the concavity of
the logarithm, which implies:
K K
! K
!
X Nk wk X Nk wk X
ln ≤ ln · = ln wk = ln(1) = 0.
N Nk /N N Nk /N
k=1 k=1 k=1
Equality holds when wk = Nk /N . Maximizing Qt (w1 , . . . , wK , θ1 , . . . , θK ),

therefore, gives the following updates:
PP
PN
Nk γnk
wk∗ = = n=1 ; (C.6)
N N
X N
θk∗ = argmax γnk ln P (xn ; θk ). (C.7)
θk n=1
The update to get wk∗

can be viewedPas follows. A fraction γnk of data point xn
belongs to bump k. Thus, Nk = n γnk is the total number of data points
belonging to bump k. Similarly, the update to get θk∗ can be viewed as follows.
e-A
The ‘likelihood’ for the parameter θk of bump k is just the weighted sum of the
likelihoods of each point, weighted by the fraction of the point that belongs to
bump k. This intuitive interpretation of the update is another reason for the
popularity of the E-M algorithm.
The E-M algorithm for mixture density estimation is remarkably simple
once we get past the machinery used to set it up: we have an analytic update
for the weights wk and K separate optimizations for each bump parameter θk .
The miracle is that these simple updates are guaranteed to improve the log-
likelihood (from Theorem C.2). There are other ways to maximize the like-
lihood, for example using gradient and Hessian based iterative optimization
techniques. However, the E-M algorithm is simpler and works well in practice.
c
AM
Example Let’s derive the E-M update for the GMM with current parameter
estimate is Θt = {ŵ1 , . . . , ŵK ; µ̂1 , . . . , µ̂K ; Σ̂1 , . . . , Σ̂K }. Let Ŝk = (Σ̂k )−1 .
The posterior bump probabilities γnk for the parameters Θt are:
ŵk P (xn |µ̂k , Ŝk )
γnk = PK ,
ℓ=1 ŵℓ P (xn |µ̂ℓ , Ŝℓ )
IX
where P (x|µ, S) = (2π)−d/2 |S|1/2 exp(− 21 (x − µ)t S(x − µ)). The wk update
is immediate from (C.6), which matches Equation (6.9). Since
1 1 d
ln P (x|µ, S) = − (x − µ)t S(x − µ) + ln |S| − ln(2π),
2 2 2
to get the µk and Sk updates using (C.7), we need to minimize
N
D
X
γnk ((xn − µk )t Sk (xn − µk ) − ln |Sk |) .
n=1
Setting the derivative with respect to µk to 0, gives
N
EN
X
2Sk γnk (xn − µk ) = 0.
n=1
PN PN PN
Since Sk is invertible, µk n=1 γnk = n=1 γnk xn , and since Nk = n=1 γnk ,
we obtain the update in (6.9),
N
1 X
µk = γnk xn .
Nk n=1
To take the derivative with respect to Sk , we use the identities in the hint of
PP
Exercise C.1(b). Setting the derivative with respect to Sk to zero gives

N
X N
X
γnk (xn − µk )(xn − µk )t − S−1
k γnk = 0,
n=1 n=1
or
N
1 X
S−1
k = γnk (xn − µk )(xn − µk )t .
Nk n=1
e-A
Since S−1
k = Σk , we recover the update in (6.9).
Commentary The E-M algorithm is a remarkable example of a recurring

theme in learning. We want to learn a model Θ that explains the data. We
start with a guess Θ̂ that is wrong. We use this wrong model to estimate
some other quantities of the world (the bump memberships in our example).
We now learn a new model which is better than the old model at explain-
ing the combined data plus the inaccurate estimates of the other quantities.
Miraculously, by doing this, we bootstrap ourselves up to a better model, one
that is better at explaining the data. This theme reappears in Reinforcement
Learning as well. If we didn’t know better, it seems like a free lunch.
c
AM
e-C. The E-M Algorithm C.2. Problems
C.2 Problems
Problem C.1 Consider the general case of Example C.1. The sample
has N balls, with Ng green, Nb blue and Nr red (Ng + Nb + Nr = N ). Show
that the log-likelihood of the incomplete data is
IX
Ng ln(1 − µ1 ) + Nb ln(1 − µ2 ) + Nr ln(µ1 + µ2 ) − N ln 2. (C.8)
What are the maximum likelihood estimates for µ1 , µ2 . (Be careful with the
cases Ng > N/2 and Nb > N/2.
Problem C.2 Consider the general case of Example C.1 as in Problem C.1,
D
with Ng green, Nb blue and Nr red balls.
(a) Suppose your starting estimates are µ̂1 , µ̂2 . For a red ball, what are
p1 = P[Bag 1|µ̂1 , µ̂2 ] and p2 = P[Bag 2|µ̂1 , µ̂2 ]
(b) Let Nr1 and Nr2 (with Nr = Nr1 + Nr2 ) be the number of red balls from
EN
Bag 1 and 2 respectively. Show that the log-likelihood of the complete
data is
Ng ln(1 − µ1 ) + Nb ln(1 − µ2 ) + Nr1 ln(µ1 ) + Nr2 ln(µ2 ) − N ln 2.
(c) Compute the function Qt (µ1 , µ2 ) by taking the expectation of the log-
likelihood of the complete data. Show that
Qt (µ1 , µ2 ) = Ng ln(1−µ1 )+Nb ln(1−µ2 )+p1 Nr ln(µ1 )+p2 Nr ln(µ2 )−N ln 2.

PP
(d) Maximize Qt (µ1 , µ2 ) to obtain the E-M update.

(e) Show that repeated E-M iteration will ultimately converge to
N − 2Ng N − 2Nb
µ̂1 = µ̂2 = .
N N
Problem C.3 A sequence of N balls, X = x1 , . . . , xN is drawn iid as

e-A
follows. There are 2 bags. Bag 1 contains only red balls and bag 2 contains
red and blue balls. A fraction π in this second bag are red. A bag is picked
randomly with probability 21 and one of the balls is picked randomly from that
bag; xn = 1 if ballP
n is red and 0 if it is blue. You are given N and the number
of red balls Nr = N n=1 xn .
(a) (i) Show that the likelihood P [X|π, N ] is

N x 1−xn
Y 1+π n 1−π
P [X|π, N ] = .
n=1
2 2
Maximize to obtain an estimate for π (be careful with Nr < N/2).
c
AM
(ii) For Nr = 600 and N = 1000, what is your estimate of π.

(b) Maximizing the likelihood is tractable for this simple problem. Now de-
velop an E-M iterative approach.
(i) What is an appropriate hidden/unmeasured variable J = j1 , . . . , jN .
(ii) Give a formula for the likelihood for the full data, P[X, J|π, N ].
IX
(iii) If at step t your estimate is πt , for the expectation step, compute
Qt (π) = EJ |X,πt [− ln P[X, J|π, N ]] and show that
πt
Qt (π) = Nr ln π + (N − Nr ) ln(1 − π),
1 + πt
and hence show that the E-M update is given by
πt Nr
D
πt+1 = .
πt Nr + (1 + πt )(N − Nr )
What are the limit points when Nr ≥ N/2 and Nr < N/2?
(iv) Plot πt versus t, starting from π0 = 0.9 and π0 = 0.2, with Nr =
600, N = 1000.
EN
(v) The values of the hidden variables can often be useful. After con-
vergence of the E-M, how could you get estimates of the hidden
variables?
Problem C.4 [E-M for Supervised Learning] We wish to learn a

function f (x) which predicts the temperature as a function of the time x of
the day, x ∈ [0, 1]. We believe that the temperature has a linear dependence
on time, so we model f with the linear hypotheses, h(x) = w0 + w1 x.
PP
We have N data points y1 , . . . , yN , the temperature measurements on different

(independent) days, where yn = f (xn ) + ǫn and ǫn ∼ N (0, 1) is iid zero mean
Gaussian noise with unit variance. The problem is that we do not know the time
xn at which these measurements were made. Assume that each temperature
measurement was taken at some random time in the day, chosen uniformly on
[0, 1].
(a) Show that the log-likelihood for weights w is

e-A
N Z 1
X 1 1 2
ln P[y|w] = ln dx √ e− 2 (yn −w0 −w1 x) .
n=1 0 2π
(b) What is the natural hidden variable J = j1 , . . . , jN .

(c) Compute the log-likelihood for the complete data ln P[y, J|w].
(d) Let γn (x|w) = P (xn = x|yn , w). Show that, for x ∈ [0, 1],

exp − 12 (yn − w0 − w1 x)2
γn (x|w) = R 1 .
0
dx exp − 12 (yn − w0 − w1 x)2
Hence, compute Qt (w).
c
AM
(e) Let αn = Eγn (x|wt ) [x] and βn = Eγn (x|wt ) [x2 ] (expectations taken with
respect to the distribution γn (x|wt )). Show that the EM-updates are
1
PN
N i=1 (yi − y)(αi − α)
w1 (t + 1) = ;
β − α2
w0 (t + 1) = y − w1 (t + 1)α;
IX
PN
where, (·) denotes averaging (eg. α = N1 n=1 αn ) and w(t) are the
weights at iteration t.
(f) What happens if the temperature measurement is not at a uniformly
random time, but at a time distributed according to an unknown P (x)?
You have to maintain an estimate Pt (x). Now, show that

Pt (x) exp − 21 (yn − w0 − w1 x)2
D
γn (x|w) = R 1 ,
0
dx Pt (x) exp − 12 (yn − w0 − w1 x)2
and that the updates in (e) are unchanged, except that they use this new
γn (x|wt ). Show that the update to Pt is given by
EN
N
1 X
Pt+1 (x) = γn (x|wt ).
N n=1
What happens if you tried to maximize the log-likelihood for the incom-
plete data, instead of using the E-M approach?
PP
e-A
c
AM

N D IX: The E-M Algorithm

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

N D IX: The E-M Algorithm

Hochgeladen von

Copyright:

Verfügbare Formate

IX

The E-M Algorithm

c AM L Yaser Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin: Jan-2015.

was used to generate data point xn . Denote by jn ∈ {1, . . . , K} the bump

Let Nk be the number of occurrences of bump k in J, and let Xk be those

Maximize Lk (Xk ; µk , Σk ) to obtain the optimal parameters:

In reality, we do not have access to J, and hence it is called a ‘hidden vari-

complete likelihood in Equation (C.2). This almost works. Instead of maxi-

using a maximum likelihood argument. The log-likelihood of the data is

ln(1 − µ1 ) + ln(µ2 ) + 2 ln(1 − µ2 ) − 4 ln 2 (Bag 2).

E[log-likelihood] = ln(1 − µ1 )+ p1 ln(µ1 )+ p2 ln(µ2 )+ 2 ln(1 − µ2 )− 4 ln 2. (C.4)

It’s miraculous that by maximizing an expected log-likelihood using a guess for

C.1 Derivation of the E-M Algorithm

the likelihood of the incomplete data.

unless some of the Pk have bounded support or Θ′ is some degenerate mixture.

finally, in (f), we have defined the function

Q(Θ|X, Θ′ ) = EJ|X,Θ′ [ln P (X, J|Θ)].

The function Q is a function of Θ, though its definition depends the distribu-

We have proved the following result.

Qt (Θ) = EJ|X,Θt [ln P (X, J|Θ)],

Θt+1 = argmax Qt (Θ).

5: Increment t → t + 1 and repeat steps 3,4 till convergence.

containing the data points of bump k. From Equation (C.2),

where Θ = {w1 , . . . , wK ; θ1 , . . . , θK }. In (a), we introduced an indicator ran-

Equality holds when wk = Nk /N . Maximizing Qt (w1 , . . . , wK , θ1 , . . . , θK ),

The update to get wk∗

Exercise C.1(b). Setting the derivative with respect to Sk to zero gives

Commentary The E-M algorithm is a remarkable example of a recurring

Ng ln(1 − µ1 ) + Nb ln(1 − µ2 ) + Nr1 ln(µ1 ) + Nr2 ln(µ2 ) − N ln 2.

Qt (µ1 , µ2 ) = Ng ln(1−µ1 )+Nb ln(1−µ2 )+p1 Nr ln(µ1 )+p2 Nr ln(µ2 )−N ln 2.

(d) Maximize Qt (µ1 , µ2 ) to obtain the E-M update.

Problem C.3 A sequence of N balls, X = x1 , . . . , xN is drawn iid as

(a) (i) Show that the likelihood P [X|π, N ] is

Maximize to obtain an estimate for π (be careful with Nr < N/2).

(ii) For Nr = 600 and N = 1000, what is your estimate of π.

Problem C.4 [E-M for Supervised Learning] We wish to learn a

We have N data points y1 , . . . , yN , the temperature measurements on different

(a) Show that the log-likelihood for weights w is

(b) What is the natural hidden variable J = j1 , . . . , jN .

Hence, compute Qt (w).

Das könnte Ihnen auch gefallen