eAppendix C
The EM Algorithm
The EM algorithm is a convenient heuristic for likelihood m aximization. The
EM algorithm will never decrease the likelihood. Our discu ssion will focus
on mixture models, the GMM being a special case, even though t he EM
algorithm applies to more general settings.
, K , where θ are the parameters
as a bump. In the GMM setting,
specifying P _{k} . We will refer to each P _{k}
are Gaussians, and θ = {µ , Σ } (the mean vector and covariance
matrix for each bump). A mixture model is a weighted sum of the se K bumps,
all the P _{k}
Let P _{k} (x; θ _{k} ) be a density for k = 1 ,
k
k
k
k
P (x; Θ) =
K
k =1
w _{k} P _{k} (x; θ _{k} ),
k = 1 and we have collected all
, θ _{K} } .
Intuitively, to generate a random point x, you ﬁrst pick a bump according to
, w . Suppose you pick bump k . You then generate a
random point from the bump density
where the weights satisfy w ≥ 0 and
the parameters into a single grand parameter, Θ = { w ,
K
k =1
w
k
K
1
, w _{K} ; θ _{1} ,
the probabilities w ,
1
P k
.
Given data X = x ,
1
, x
N generated independently, we wish to estimate
the parameters of the mixture which maximize the loglikeli hood,
ln P (X  Θ) = ln
N
n=1
P (x  Θ)
n
= ln
n=1
N
=
n=1 ln
N
K
k =1
K
k
=1
w _{k} P _{k} (x _{n} ; θ _{k} )
w _{k} P (x _{n}  θ _{k} ) .
(C.1)
In the ﬁrst step above, P (X  Θ) is a product because the data are independent. Note that X is known and ﬁxed. What is not known is which particular bump
e C. The EM Algorithm
was used to generate data point x _{n} . Denote by j _{n} ∈ { 1 ,
is a ‘member’ of bump j ). Collect all bump
memberships into a set J = { j ,
to which bump, we can estimate each bump density’s parameter s separately,
, j } . If we knew which data belonged
, K } the bump
that generated x _{n} (we say x _{n}
n
1
N
using only the data belonging to that bump. We call (X, J ) the complete
data. If we know the complete data, we can easily optimize the log likelihood.
We call X the incomplete data. Though X is all we can measure, it is still
called the ‘incomplete’ data because it does not contain eno ugh information
to easily determine the optimal parameters Θ which minimize E (Θ). Let’s
see mathematically how knowing the complete data helps us.
To get the likelihood of the complete data, we need the joint p robability
P [ x , j  Θ] . Using Bayes’ theorem,
P [ x , j  Θ] = P [ j  Θ] P [ x  j , Θ]
∗
in
n
n
n
n
n
n
n
=
w _{j} _{n} P _{j} _{n} (x _{n} ; θ _{j} _{n} ).
Since the data are independent,
P (X, J  Θ) =
=
N
n=1
N
n=1
P [ x _{n} , j _{n}  Θ]
w _{j} _{n} P _{j} _{n} (x _{n} ; θ _{j} _{n} ).
Let N _{k}
data points corresponding to the bump k , so X = { x ∈ X : j = k } . We
compute the loglikelihood for the complete data as follows :
be the number of occurrences of bump k in J , and let X be those
n
k
k
n
N 

ln P (X, J  Θ) = 


n=1 

K 

= 


k 
=1 

K 

= 


k 
=1 
ln w _{j} _{n} +
N
n=1
ln P _{j} _{n} (x _{n} ; θ _{j} _{n} )
N _{k} ln w _{k} +
K
ln P _{k} (x _{n} ; θ _{k} )
k =1 x _{n} ∈ X _{k}
L _{k} ( X _{k} ,θ _{k} )
N _{k} ln w _{k} +
K
k =1
L _{k} (X _{k} ; θ _{k} ).
(C.2)
There are two simpliﬁcations which occur in (C.2) from knowi ng the complete
data (X, J ). The w (in the ﬁrst term) are separated from the θ (in the second
term); and, the second term is the sum of K noninteracting loglikelihoods
L
k ’s parameters θ . Each loglikelihood L can be optimized independently of
the others. For many choices of P _{k} , L _{k} (X _{k} ; θ _{k} ) can be optimized analytically, even though the loglikelihood for the incomplete data in (C .1) is intractable. The next exercise asks you to analytically maximize (C.2) fo r the GMM.
(X , θ ) corresponding to the data belonging to X and only involving bump
k
k
k
k
k
k
k
k
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 2
e C. The EM Algorithm
∂
( z S z) = zz ;
t
t
and
∂
ln S  = S
− 1
.]
In reality, we do not have access to J , and hence it is called a ‘hidden vari
able’. So what can we do now? We need a heuristic to maximize th e likelihood
in Equation (C.1). One approach is to guess J and maximize the resulting
complete likelihood in Equation (C.2). This almost works. Instead of maxi
mizing the complete likelihood for a single guess, we consider an average of
the complete likelihood over all possible guesses. Speciﬁc ally, we treat J as
an unknown random variable and maximize the expected value ( with respect
to J ) of the complete loglikelihood in Equation (C.2). This exp ected value
is as easy to minimize as the complete likelihood. The mathem atical imple
mentation of this idea will lead us to the EM Algorithm, which stands for
ExpectationMaximization Algorithm. Let’s start with a simpler example.
Example C.1. You have two opaque bags. Bag 1 has red and green balls,
with µ being the fraction of red balls. Bag 2 has red and blue balls wi th µ
being the fraction of red. You pick four balls in independent trials as follows.
1
2
1
First pick one of the bags at random, each with probability ; then, pick a ball
2
at random from the bag. Here is the sample of four balls you got:
one green, one red and two blue. The task is to estimate µ and µ . It would
be much easier if we knew which bag each ball came from.
Here is one way to reason. Half the balls will come from Bag 1 and the
other half from Bag 2. The blue balls come from Bag 2 (that’s al ready Bag 2’s budget of balls), so the other two should come from Bag 1: ••  •• . Using insample estimates, µˆ _{1} = _{2} and µˆ _{2} = 0 . We can get these same estimates
••••,
1
2
1
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 3
e C. The EM Algorithm
using a maximum likelihood argument. The loglikelihood of the data is
ln(1 − µ ) + 2 ln(1 − µ ) + ln(µ + µ ) − 4 ln 2 . (C.3)
The reader should explicitly maximize the above expression with respect to
µ _{1} , µ _{2} ∈ [0 , 1] to obtain the estimates µˆ _{1} =
balls, there is a red one, so it seems a little counterintuitive that we would
estimate µˆ = 0 , for isn’t there a positive probability that the red ball came
from Bag 2? Nevertheless, µˆ = 0 is the estimate that maximizes the proba
bility of generating the data. You are uneasy with this, and r ightly so, because
we put all our eggs into this single ‘point’ estimate; a very unnatural thing
given that any point estimate has inﬁnitesimal probability of being correct.
Nevertheless, that is the maximum likelihood method, and we are following it.
Here is another way to reason. ‘Half’ of each red ball came fro m Bag 1
and the other ‘half’ from Bag 2. So, µˆ =
came from Bag 1 out of a total of 1 +
This reasoning is wrong because it does not correctly use the knowledge that
the ball is red. For example, as we just reasoned, µˆ =
these estimates are correct, and indeed µˆ > µˆ , then a red ball is more likely
to come from Bag 1, so more than half if it should come from Bag 1 . This
contradicts the original assumption that led us to these est imates.
Now, let’s see how expectationmaximization solves this pr oblem. The rea
soning is similar to our false start above; it just adds itera tion till consistency.
We begin by considering the two cases for the red ball. Either it is from Bag 1
or Bag 2. We can compute the loglikelihood for each of these two cases:
_{3} and µˆ _{2} = _{5} . But, if
1
2
1
2
2 1 , µˆ = 0 . In our data set of four
2
2
2
_{2} / (1 + ^{1}
1
_{2} ) = _{3} because ^{1}
1
1
1
_{2} a red ball
1
_{2}
balls. Similarly, µˆ _{2} = ^{1}
1
2
1
1
_{2} / (2 + _{2} ) = ^{1}
1
5 ^{.}
ln(1 − µ ) + ln(µ ) + 2 ln(1 − µ ) − 4 ln 2
ln(1 − µ ) + ln(µ ) + 2 ln(1 − µ ) − 4 ln 2
1
1
1
2
2
2
(Bag 1);
(Bag 2).
Suppose we have estimates µˆ and µˆ . Using Bayes theorem, we can compute
p
1
2
1
= P [ Bag 1  µˆ , µˆ ] and p = P [ Bag 2  µˆ , µˆ ] . The reader can verify that
1
2
2
1
2
p _{1} =
µˆ _{1}
µˆ _{1} + µˆ _{2} ^{,}
µˆ _{2}
^{p} ^{2} ^{=} µˆ _{1} + µˆ _{2}
.
Now comes the expectation step. Compute the expected loglikelihood using
p _{1} and p _{2} :
)+ p ln(µ )+ p ln(µ )+ 2 ln(1 − µ ) − 4 ln 2 . (C.4)
E[ loglikelihood ] = ln(1 − µ
1
1
1
2
2
2
Next comes the maximization step. Treating p , p as constants, maximize
the expected loglikelihood with respect to µ , µ and update µˆ , µˆ to these
optimal values. Notice that the loglikelihood in (C.3) has an interaction
term ln(µ + µ ) which complicates the maximization. In the expected log
likelihood (C.4), µ and µ are decoupled, and so the maximization can be
implemented separately for each variable. The reader can verify that maxi
mizing the expected loglikelihood gives the updates:
µˆ 2ˆµ _{1} _{+} _{µ}_{ˆ} _{2}
1
2
2
1
1
2
1
2
1
2
µˆ _{1} ←
^{p}
^{1}
1
and
µˆ _{2} ←
^{p}
^{2}
µˆ _{2} 2ˆµ _{1} + 3ˆµ _{2}
1 + p _{1}
_{=}
2 + p _{2}
.
_{=}
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 4
e C. The EM Algorithm
C.1. Derivation of the EM Algorithm
The full algorithm just iterates this update process with th e new estimates.
Let’s see what happens if we start (arbitrarily) with estima tes µˆ = µˆ
2 = ^{1}
2
:
1
We have highlighted in blue the result of the ﬁrst iteration, which is exactly the
estimate from our earlier faulty reasoning. When µ = µ our faulty reasoning
matches the EM step. If we continued this table, it is not har d to see what
1
2
It’s miraculous that by maximizing an expected loglikelihood using a guess for
the parameters, we end up converging to the true maximum like lihood solution.
Why is this useful? Because the maximizations for µ and µ are decoupled.
We trade a maximization of a complicated likelihood of the incomplete data
for a bunch of simpler maximizations that we iterate.
1
2
C.1 Derivation of the EM Algorithm
We now derive the EM strategy and show that it will always improve the
likelihood of the incomplete data.
Maximizing an expected loglikelihood of the complete data increases
the likelihood of the incomplete data.
What is surprising is that to compute the expected loglikel ihood, we use a
guess, since we don’t know the best model. So it is really a gue ss for the
expected loglikelihood that one maximizes and this increa ses the likelihood.
Let Θ be any set of parameters, and deﬁne P (J  X, Θ ) as the conditional
′
′
′
probability distribution for J given the data and assuming that Θ is the actual
′
probability model for the data. The probability P (J  X, Θ ) is well deﬁned,
even if Θ is not the probability model which generated the data. We wil l see
how to compute P (J  X, Θ ^{′} ) soon, but for the moment assume that it is always
, x _{N} to bumps
, j _{n} has nonzero probability under Θ ^{′} ). This will always be the case
positive (which means that every possible assignment of x _{1} ,
j _{1} ,
′
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 5
e C. The EM Algorithm
C.1. Derivation of the EM Algorithm
unless some of the P _{k} have bounded support or Θ ^{′} is some degenerate mixture. The following derivation establishes a connection between the loglikelihood
for the incomplete data and the expectation (over J ) of the loglikelihood for
the complete data.
L (Θ)
=
(
a )
=
(
b)
=
(
c )
=
(
d )
≥
(
e)
=
( f )
=
ln P (X  Θ)
ln ^{} P (X, J  Θ)
J
ln ^{}
J
ln ^{}
J
P
(X, J  Θ)
(J  X, Θ ^{′} )
′
P (J  X, Θ )
P
P (X, J  Θ)P (X  Θ ^{′} )
′
P (J  X, Θ )
P (X, J  , Θ )
′
J
_{l}_{n} ^{} P (X, J  Θ)P (X  Θ ^{′} ) P (X, J  , Θ ^{′} )
P
′
(J  X, Θ )
L (Θ ^{′} ) + E _{J} _{} _{X}_{,} _{Θ} ′ [ln P (X, J  Θ)] − E _{J} _{} _{X}_{,} _{Θ} ′ [ln P (X, J  Θ ^{′} )]
L (Θ ^{′} ) + Q (Θ  X, Θ ^{′} ) − Q (Θ ^{′}  X, Θ ^{′} )
(C.5)
(a) follows by the law of total probability; (b) is justiﬁed b ecause P (J  X, Θ ) is
positive; (c) follows from Bayes’ theorem; (d) follows beca use the summation
′
(· )P (J  X, Θ ) is an expectation using the probabilities P (J  X, Θ ) and
^{} J
by Jensen’s inequality ln E[ · ] ≥ E[ln(· )] because ln(x) is concave; (e) follows
′
′
′
because ln P (X  Θ ) is independent of J and so
J
P (J  X, Θ ) ln P (X  Θ ) = ln P (X  Θ ) ^{} P
′
′
′
J
(J  X, Θ ) = ln P (X  Θ ) · 1;
′
′
ﬁnally, in (f), we have deﬁned the function
Q (Θ  X, Θ ^{′} ) = E _{J} _{} _{X}_{,} _{Θ} ′ [ln P (X, J  Θ)] .
The function Q is a function of Θ , though its deﬁnition depends the distribu
tion of J which in turn depends on the incomplete data X and the model Θ .
We have proved the following result.
′
Theorem C.2. If Q (Θ  X, Θ ) > Q (Θ  X, Θ ), then L (Θ) > L (Θ ).
′
′
′
′
′
In words, ﬁx Θ and compute the ‘posterior’ distribution of J conditioned on
the data X with parameters Θ . Now for the parameters Θ , compute the
expected loglikelihood of the complete data (X, J ) where the expectation is
taken with respect to this posterior distribution for J that we just obtained.
′
This distribution for J is ﬁxed, depending on X, Θ , but it does not depend
on Θ . Find Θ ^{∗} that maximizes this expected loglikelihood, and you are guaranteed to improve the actual likelihood. This theorem leads naturally to the EM algorithm that follows.
′
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 6
e C. The EM Algorithm
C.1. Derivation of the EM Algorithm
EM Algorithm 1: Initialize Θ 0 at t = 0 .
2: At step t let the parameter estimate be Θ .
3: [Expectation] For X, Θ , compute the function of Q (Θ):
t
t
t
Q
t (Θ) = E
J  X, Θ
t [ln P (X, J  Θ)] ,
which is the expected loglikelihood for the complete data.
4: [Maximization] Update Θ to maximize Q (Θ):
t
Θ t+1
= argmax Q (Θ).
t
Θ
5: Increment t → t + 1 and repeat steps 3,4 till convergence.
In the algorithm, we need to compute Q (Θ), which amounts to computing an
expectation with respect to P (J  X, Θ ). We illustrate this process with our
mixture model.
t
t
Recall that J is the vector of bump memberships. Since the data are
independent, to compute P (J  X, Θ ) we can compute this ‘posterior’ for each
P (j = k  x , Θ ), the
data point and then take the product. We need γ
probability that data point x came from bump k . By Bayes’ theorem,
t
nk =
n
n
t
n
γ _{n}_{k} = P (j _{n} = k  x _{n} , Θ _{t} ) =
_{=}
^{P} ^{(}^{x} ^{n} ^{,} ^{k} ^{} ^{Θ} ^{t} ^{)} P (x _{n}  Θ _{t} )
ˆ
wˆ _{k} P _{k} (x _{n}  θ _{k} )
K
ℓ
ˆ
_{=}_{1} wˆ _{ℓ} P _{ℓ} (x _{n}  θ
ℓ ) ^{,}
where Θ _{t} = { wˆ _{1} ,
now compute Q (Θ),
t
ˆ
, wˆ _{K} ; θ _{1} ,
ˆ
, θ _{K} } and P (J  X, Θ ^{′} ) = ^{} _{n}_{=}_{1} γ _{n}_{j} _{n} . We can
N
Q
t
(Θ) = E [ln P (X, J  Θ)] ,
J
where the expectation is with respect to the (‘ﬁctitious’) p robabilities γ
that determine the distribution of the random variable J . These probabilities
depend on X and Θ . Let
(a random variable) be the number of occurrences
nk
ˆ
N
k
t
ˆ
of bump k in the random variable J ; similarly, let X be the random set
containing the data points of bump k . From Equation (C.2),
k
Q (Θ)
t
=
K
k =1
E [
J
ˆ
N
k
] ln w +
k
K
E
J
k =1
x
n
ˆ
∈ X
k
ln P (x ; θ )
n
k
(
a )
=
K
k
=1
E J
n=1 z _{n}_{k} ln w _{k} +
N
K
k
=1
E J
n=1 z _{n}_{k} ln P (x _{n} ; θ _{k} )
N
(
b)
=
K
k
=1
N _{k} ln w _{k} +
K
N
k =1 n=1
γ _{n}_{k} ln P (x _{n} ; θ _{k} ),
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 7
e C. The EM Algorithm
C.1. Derivation of the EM Algorithm
where Θ = { w _{1} , , w _{K} ; θ _{1} , 

ˆ 

dom variable z _{n}_{k} = x _{n} ∈ 
X 
k 
in (b), we deﬁned
(a), we introduced an indicator ran
which is 1 if x _{n} is from bump k and 0 otherwise;
, θ _{K} } . In
N k
= E [
J
ˆ
N k
] = E
J
N
n=1
z nk
N
=
n=1
γ nk
,
nk . Now that we have an explicit functional form
for Q (Θ), we can perform the maximization step. Observe that the bump 
where we used E[ z ] = γ
nk
t
parameters θ ∈ Θ are occurring in independent terms, and so can be opti
mized separately. As for the ﬁrst term, observe that
k
1
N
K K
k =1
N
k
ln w
k
=
k =1
K
≤
k =1
N
k
w k
N
N k
/N
ln
N
k
N
ln ^{N} ^{k}
N ^{,}
K
+
k =1
N k
N
ln
N k
N
where the last inequality follows from Jensen’s inequality and the concavity of
the logarithm, which implies:
K
k
=1
ln ^{w} _{/}_{N} ≤ ln
N
k
k
N
N
k
K
k
=1
N
k
w
k
·
N
_{k} /N
N
= ln
K
k
=1
w _{k} = ln(1) = 0 .
Equality holds when w
therefore, gives the following updates:
_{k}
= N _{k} /N . Maximizing Q _{t} (w _{1} ,
w
∗
k =
^{N} ^{k}
=
N
n=1 ^{γ} nk
N
N
;
∗
θ = argmax
k
θ
k
N
n=1
γ nk
ln P (x ; θ ).
n
k
, w _{K} , θ _{1} ,
, θ _{K} ),
(C.6)
(C.7)
_{n}_{k} of data point x _{n}
belongs to bump k . Thus,
belonging to bump k . Similarly, the update to get θ can be viewed as follows.
is the total number of data points
The update to get w can be viewed as follows. A fraction γ
∗
k
N _{k} = ^{} _{n} γ nk
∗
k
The ‘likelihood’ for the parameter θ of bump k is just the weighted sum of the
likelihoods of each point, weighted by the fraction of the po int that belongs to
bump k . This intuitive interpretation of the update is another rea son for the
popularity of the EM algorithm.
The EM algorithm for mixture density estimation is remarka bly simple
once we get past the machinery used to set it up: we have an anal ytic update
for the weights w and K separate optimizations for each bump parameter θ .
The miracle is that these simple updates are guaranteed to improve the log
likelihood (from Theorem C.2). There are other ways to maximize the like lihood, for example using gradient and Hessian based iterative optimization techniques. However, the EM algorithm is simpler and works well in practice.
k
k
k
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 8
e C. The EM Algorithm
C.1. Derivation of the EM Algorithm
Example Let’s derive the EM update for the GMM with current paramete r
, wˆ _{K} ; µˆ _{1} ,
The posterior bump probabilities γ
estimate is Θ _{t} = { wˆ _{1} ,
, Σ _{K} } . Let S _{k} = ( Σ _{k} ) ^{−} ^{1} _{.}
ˆ
, µˆ _{K} ; Σ _{1} ,
ˆ
ˆ
ˆ
nk
for the parameters Θ are:
t
ˆ
wˆ _{k} P (x _{n} µˆ _{k} , S _{k} )
γ nk =
_{=}_{1} wˆ _{ℓ} P (x _{n} µˆ _{ℓ} , S _{ℓ} ) ^{,}
K
ℓ
ˆ
where P (xµ , S) = (2 π )
is immediate from (C.6), which matches Equation (6.9). Sinc e
− d/ 2 _{} _{S}_{} 1/ 2
1
exp(− (x −µ ) S(x − µ )). The w update
2
t
k
1
ln P (xµ , S) = − (x −µ ) S(x − µ ) +
2
t
^{1}
_{2} ln  S − ^{d} ln(2 π ),
2
to get the µ and S updates using (C.7), we need to minimize
k
k
N
n=1
γ
nk
t
((x −µ ) S (x −µ ) − ln  S  ) .
n
k
k
n
k
k
Setting the derivative with respect to µ to 0 , gives
k
2S _{k}
N
γ
n=1
Since S is invertible, µ
we obtain the update in (6.9),
k
^{}
N
n=1
k
^{γ} nk
µ _{k} =
nk
=
1
N k
(x _{n} −µ _{k} ) = 0 .
N
_{n}_{=}_{1} γ _{n}_{k} x _{n} , and since N _{k} = ^{}
N
n=1
γ
nk x n .
N
n=1 ^{γ} nk ^{,}
To take the derivative with respect to S , we use the identities in the hint of
Exercise C.1(b). Setting the derivative with respect to S to zero gives
k
k
or
Since S
−
k
1
N
n=1
γ
nk (x
n
−µ )(x −µ ) − S
k
n
k
t
−
k
1
N
n=1
γ nk
= 0 ,
S
−
k
1
=
1
N k
N
n=1
γ
nk (x
n
t
−µ )(x −µ ) .
k
n
k
= Σ , we recover the update in (6.9).
k
Commentary The EM algorithm is a remarkable example of a recurring
theme in learning. We want to learn a model Θ that explains the data. We
ˆ
start with a guess Θ that is wrong. We use this wrong model to estimate
some other quantities of the world (the bump memberships in o ur example).
We now learn a new model which is better than the old model at ex plain
ing the combined data plus the inaccurate estimates of the ot her quantities.
Miraculously, by doing this, we bootstrap ourselves up to a b etter model, one that is better at explaining the data. This theme reappears i n Reinforcement Learning as well. If we didn’t know better, it seems like a fre e lunch.
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 9
e C. The EM Algorithm
C.2. Problems
C.2 Problems
Consider the general case of Example C.1. The sample
has N balls, with N _{g} green, N _{b} blue and N _{r} red ( N _{g} + N _{b} + N _{r} = N ). Show
that the loglikelihood of the incomplete data is
Problem C.1
N _{g} ln(1 − µ _{1} ) + N _{b} ln(1 − µ _{2} ) + N _{r} ln( µ _{1} + µ _{2} ) − N ln 2.
(C.8)
What are the maximum likelihood estimates for µ ,µ . (Be careful with the
cases N _{g} > N/2 and N _{b} > N/2.
1
2
Problem C.2 Consider the general case of Example C.1 as in Problem C.1,
with N _{g} green, N _{b} blue and N _{r} red balls.
(a)
(b)
Suppose your starting estimates are µˆ ,µˆ . For a red ball, what are
p _{1} = P [Bag 1µˆ _{1} ,µˆ _{2} ] and p _{2} = P [Bag 2µˆ _{1} ,µˆ _{2} ]
Let N _{r} _{1} and N _{r} _{2} (with N _{r} = N _{r} _{1} + N _{r}
Bag 1 and 2 respectively. Show that the loglikelihood of the complete
data is
2 ) be the number of red balls from
1
2
N _{g} ln(1 − µ _{1} ) + N _{b} ln(1 − µ _{2} ) + N _{r} _{1} ln( µ _{1} ) + N _{r} _{2} ln( µ _{2} ) − N ln 2.
(c)
Compute the function Q ( µ ,µ ) by taking the expectation of the log
likelihood of the complete data. Show that
t
1
2
Q _{t} ( µ _{1} ,µ _{2} ) = N _{g} ln(1−µ _{1} )+ N _{b} ln(1−µ _{2} )+ p _{1} N _{r} ln( µ _{1} )+ p _{2} N _{r} ln( µ _{2} ) −N ln 2.
(d)
(e) Show that repeated EM iteration will ultimately converge to
Maximize Q ( µ ,µ ) to obtain the EM update.
t
1
2
µˆ _{1} =
N − 2N _{g}
_{N}
µˆ _{2} =
N − 2N _{b}
N
^{.}
Problem C.3 A sequence of N balls, X = x ,
follows. There are 2 bags. Bag 1 contains only red balls and bag 2 contains
red and blue balls. A fraction π in this second bag are red. A bag is picked
randomly with probability
2 1 and one of the balls is picked randomly from that
bag; x = 1 if ball n is red and 0 if it is blue. You are given N and the number
of red balls N _{r} = ^{}
,x _{N} is drawn iid as
1
^{x} ^{n} ^{.}
n
N
n=1
(a) (i) Show that the likelihood P [Xπ, N ] is
P [Xπ, N ] =
N
n=1
1 + π
2
x
n
^{}
1 −π
2
1− x _{n}
.
Maximize to obtain an estimate for π (be careful with N _{r} < N/2).
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 10
e C. The EM Algorithm
C.2. Problems
(ii) For N _{r} = 600 and N = 1000, what is your estimate of π .
(b) Maximizing the likelihood is tractable for this simple problem. Now de
velop an EM iterative approach.
(i) What is an appropriate hidden/unmeasured variable J = j ,
1
(ii)
(iii)
t
t [− ln P [X, Jπ, N ]] and show that
N .
Give a formula for the likelihood for the full data, P [X, Jπ, N ].
If at step t your estimate is π , for the expectation step, compute
Q _{t} ( π ) = E _{J} _{} _{X}_{,}_{π}
,j
Q _{t} ( π ) =
^{π} ^{t}
1 + π
t
N _{r} ln π + ( N − N _{r} ) ln(1 − π ) ,
and hence show that the EM update is given by
π t +1 =
π t N r
π _{t} N _{r} + (1 + π _{t} )( N − N _{r} )
.
What are the limit points when
N _{r} ≥ N/2 and N _{r} < N/2?
(iv) Plot π _{t} versus t , starting from π _{0} = 0.9 and π _{0} = 0.2 , with N _{r} =
600, N = 1000.
The values of the hidden variables can often be useful. After con
vergence of the EM, how could you get estimates of the hidden
(v)
variables?
Problem C.4
[EM for Supervised Learning] We wish to learn a
function f ( x ) which predicts the temperature as a function of the time x of
the day, x ∈ [0, 1]. We believe that the temperature has a linear dependence
on time, so we model f with the linear hypotheses, h( x ) = w + w x .
We have N data points y ,
(independent) days, where y = f ( x ) + ǫ and ǫ ∼ N (0, 1) is iid zero mean
,y , the temperature measurements on diﬀerent
0
1
1
N
n
n
n
n
Gaussian noise with unit variance. The problem is that we do not know the time
x
n at which these measurements were made. Assume that each temp erature
measurement was taken at some random time in the day, chosen uniformly on
c
[0, 1].
(a)
(b)
(c)
(d)
Show that the loglikelihood for weights w is
ln P [y w ] =
N
ln
n=1
0
1
What is the natural hidden variable J = j ,
Compute the loglikelihood for the complete data ln P [y , Jw ].
Let γ ( xw ) = P ( x = xy , w ) . Show that, for x ∈ [0, 1],
1
,j
N .
n
n
n
γ _{n} ( xw ) =
Hence, compute Q _{t} ( w ) .
AbuMostafa, MagdonIsmail, Lin: Jan2015
.
eChap:C – 11
e C. The EM Algorithm
C.2. Problems
(e)
(f)
Let α _{n} = E _{γ} _{n} _{(} _{x} _{} _{w} _{t} _{)} [x ] and β _{n} = E _{γ} _{n} _{(} _{x} _{} _{w} _{t} _{)} [x ^{2} ] (expectations taken with respect to the distribution γ ( xw ) ). Show that the EMupdates are
n
t
w
w
1
0
( t + 1)
( t + 1)
=
=
1
N
N
i=1
( y − y )( α −α )
i
i
;
β−α
2
y − w ( t + 1) α;
1
where, ( ·) denotes averaging (eg. α =
weights at iteration t .
What happens if the temperature measurement is not at a uniformly
random time, but at a time distributed according to an unknown P ( x ) ?
You have to maintain an estimate
α ) and w ( t ) are the
N
1
N
n=1
n
P _{t} ( x ) . Now, show that
γ _{n} ( xw ) =
and that the updates in (e) are unchanged, except that they use this new
γ ( xw ) . Show that the update to
n
t
P _{t} is given by
N
γ _{n} ( xw _{t} ) .
n=1
P _{t} _{+}_{1} ( x ) = ^{1}
N
What happens if you tried to maximize the loglikelihood for the incom
plete data, instead of using the EM approach?
c
AbuMostafa, MagdonIsmail, Lin: Jan2015
eChap:C – 12
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.