Beruflich Dokumente
Kultur Dokumente
e-Appendix C
D
The E-M algorithm is a convenient heuristic for likelihood maximization. The
E-M algorithm will never decrease the likelihood. Our discussion will focus
on mixture models, the GMM being a special case, even though the E-M
EN
algorithm applies to more general settings.
Let Pk (x; θk ) be a density for k = 1, . . . , K, where θk are the parameters
specifying Pk . We will refer to each Pk as a bump. In the GMM setting,
all the Pk are Gaussians, and θk = {µk , Σk } (the mean vector and covariance
matrix for each bump). A mixture model is a weighted sum of these K bumps,
K
X
P (x; Θ) = wk Pk (x; θk ),
k=1
PP
P
where the weights satisfy wk ≥ 0 and K k=1 wk = 1 and we have collected all
the parameters into a single grand parameter, Θ = {w1 , . . . , wK ; θ1 , . . . , θK }.
Intuitively, to generate a random point x, you first pick a bump according to
the probabilities w1 , . . . , wK . Suppose you pick bump k. You then generate a
random point from the bump density Pk .
Given data X = x1 , . . . , xN generated independently, we wish to estimate
the parameters of the mixture which maximize the log-likelihood,
e-A
N
Y
ln P (X|Θ) = ln P (xn |Θ)
n=1
N K
!
Y X
= ln wk Pk (xn ; θk )
n=1 k=1
N K
!
X X
= ln wk P (xn |θk ) . (C.1)
n=1 k=1
In the first step above, P (X|Θ) is a product because the data are independent.
Note that X is known and fixed. What is not known is which particular bump
IX
We call X the incomplete data. Though X is all we can measure, it is still
called the ‘incomplete’ data because it does not contain enough information
to easily determine the optimal parameters Θ∗ which minimize Ein (Θ). Let’s
see mathematically how knowing the complete data helps us.
To get the likelihood of the complete data, we need the joint probability
P[xn , jn |Θ]. Using Bayes’ theorem,
D
P[xn , jn |Θ] = P[jn |Θ] P [xn |jn , Θ]
= wjn Pjn (xn ; θjn ).
Since the data are independent,
EN
N
Y
P (X, J|Θ) = P[xn , jn |Θ]
n=1
N
Y
= wjn Pjn (xn ; θjn ).
n=1
N
X N
X
ln P (X, J|Θ) = ln wjn + ln Pjn (xn ; θjn )
n=1 n=1
K
X K
X X
= Nk ln wk + ln Pk (xn ; θk )
k=1 k=1 xn ∈Xk
| {z }
Lk (Xk ,θk )
e-A
K
X K
X
= Nk ln wk + Lk (Xk ; θk ). (C.2)
k=1 k=1
There are two simplifications which occur in (C.2) from knowing the complete
data (X, J). The wk (in the first term) are separated from the θk (in the second
term); and, the second term is the sum of K non-interacting log-likelihoods
Lk (Xk , θk ) corresponding to the data belonging to Xk and only involving bump
k’s parameters θk . Each log-likelihood Lk can be optimized independently of
the others. For many choices of Pk , Lk (Xk ; θk ) can be optimized analytically,
even though the log-likelihood for the incomplete data in (C.1) is intractable.
The next exercise asks you to analytically maximize (C.2) for the GMM.
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–2
e-C. The E-M Algorithm
Exercise C.1
P
(a) Maximize the first term in (C.2) subject to k wk = 1, and show that
∗
the optimal weights are wk = Nk /N . [Hint: Lagrange multipliers.]
(b) For the GMM,
1
Pk (x; µk , Σk ) = exp − 21 (x − µk )t Σ−1
k (x − µk ) .
IX
(2π)d/2 |Σk |1/2
1 X
Σ∗k = (xn − µk )(xn − µk )t .
D
Nk xn ∈Xk
These are exactly the parameters you would expect. µ∗k is the in-
sample mean for the data belonging to bump k; similarly, Σ∗k is the
in-sample covariance matrix.
EN
[Hint: Set Sk = Σ−1
k and optimize with respect to Sk . Also, from the
Linear Algebra e-appendix, you may find these derivatives useful:
∂ t ∂
(z Sz) = zzt ; and ln |S| = S−1 .]
∂S ∂S
Example C.1. You have two opaque bags. Bag 1 has red and green balls,
with µ1 being the fraction of red balls. Bag 2 has red and blue balls with µ2
being the fraction of red. You pick four balls in independent trials as follows.
First pick one of the bags at random, each with probability 12 ; then, pick a ball
at random from the bag. Here is the sample of four balls you got: , ••••
one green, one red and two blue. The task is to estimate µ1 and µ2 . It would
be much easier if we knew which bag each ball came from.
Here is one way to reason. Half the balls will come from Bag 1 and the
other half from Bag 2. The blue balls come from Bag 2 (that’s already Bag 2’s
budget of balls), so the other two should come from Bag 1: | •• ••
. Using
in-sample estimates, µ̂1 = 21 and µ̂2 = 0. We can get these same estimates
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–3
e-C. The E-M Algorithm
IX
estimate µ̂2 = 0, for isn’t there a positive probability that the red ball came
from Bag 2? Nevertheless, µ̂2 = 0 is the estimate that maximizes the proba-
bility of generating the data. You are uneasy with this, and rightly so, because
we put all our eggs into this single ‘point’ estimate; a very unnatural thing
given that any point estimate has infinitesimal probability of being correct.
Nevertheless, that is the maximum likelihood method, and we are following it.
Here is another way to reason. ‘Half’ of each red ball came from Bag 1
D
and the other ‘half’ from Bag 2. So, µ̂1 = 21 /(1 + 12 ) = 31 because 12 a red ball
came from Bag 1 out of a total of 1 + 21 balls. Similarly, µ̂2 = 12 /(2 + 21 ) = 15 .
This reasoning is wrong because it does not correctly use the knowledge that
the ball is red. For example, as we just reasoned, µ̂1 = 31 and µ̂2 = 15 . But, if
EN
these estimates are correct, and indeed µ̂1 > µ̂2 , then a red ball is more likely
to come from Bag 1, so more than half if it should come from Bag 1. This
contradicts the original assumption that led us to these estimates.
Now, let’s see how expectation-maximization solves this problem. The rea-
soning is similar to our false start above; it just adds iteration till consistency.
We begin by considering the two cases for the red ball. Either it is from Bag 1
or Bag 2. We can compute the log-likelihood for each of these two cases:
ln(1 − µ1 ) + ln(µ1 ) + 2 ln(1 − µ2 ) − 4 ln 2 (Bag 1);
PP
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–4
e-C. The E-M Algorithm C.1. Derivation of the E-M Algorithm
The full algorithm just iterates this update process with the new estimates.
Let’s see what happens if we start (arbitrarily) with estimates µ̂1 = µ̂2 = 21 :
Iteration number
0 1 2 3 4 5 6 7 ... 1000
1 1
µ̂1 0.38 0.41 0.43 0.45 0.45 0.46 . . . 0.49975
IX
2 3
1 1
µ̂2 2 5 0.16 0.13 0.10 0.09 0.07 0.07 . . . 0.0005
We have highlighted in blue the result of the first iteration, which is exactly the
estimate from our earlier faulty reasoning. When µ1 = µ2 our faulty reasoning
matches the E-M step. If we continued this table, it is not hard to see what
will happen: µ̂1 → 21 and µ̂2 → 0.
D
Exercise C.2
When the E-M algorithm converges, it must be that
µ̂1 µ̂2
µ̂1 = and µ̂2 = .
EN
2µ̂1 + µ̂2 2µ̂1 + 3µ̂2
Solve these consistency conditions, and report your estimates for µ̂1 , µ̂2 ?
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–5
e-C. The E-M Algorithm C.1. Derivation of the E-M Algorithm
L(Θ) = ln P (X|Θ)
IX
(a) X
= ln P (X, J|Θ)
J
(b) X P (X, J|Θ)
= ln P (J|X, Θ′ )
P (J|X, Θ′ )
J
(c) X P (X, J|Θ)P (X|Θ′ )
= ln P (J|X, Θ′ )
P (X, J|, Θ′ )
D
J
(d) X P (X, J|Θ)P (X|Θ′ )
≥ ln P (J|X, Θ′ )
P (X, J|, Θ′ )
J
(e)
= L(Θ′ ) + EJ|X,Θ′ [ln P (X, J|Θ)] − EJ|X,Θ′ [ln P (X, J|Θ′ )]
EN
(f )
= L(Θ′ ) + Q(Θ|X, Θ′ ) − Q(Θ′ |X, Θ′ ) (C.5)
(a) follows by the law of total probability; (b) is justified because P (J|X, Θ′ ) is
positive;
P (c) follows from Bayes’ theorem; (d) follows because the summation
J (·)P (J | X, Θ′ ) is an expectation using the probabilities P (J | X, Θ′ ) and
by Jensen’s inequality ln E[·] ≥ E[ln(·)] because ln(x) is concave; (e) follows
because ln P (X|Θ′ ) is independent of J and so
PP
X X
P (J|X, Θ′ ) ln P (X|Θ′ ) = ln P (X|Θ′ ) P (J|X, Θ′ ) = ln P (X|Θ′ ) · 1;
J J
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–6
e-C. The E-M Algorithm C.1. Derivation of the E-M Algorithm
E-M Algorithm
1: Initialize Θ0 at t = 0.
2: At step t let the parameter estimate be Θt .
3: [Expectation] For X, Θt , compute the function of Qt (Θ):
IX
which is the expected log-likelihood for the complete data.
4: [Maximization] Update Θ to maximize Qt (Θ):
D
In the algorithm, we need to compute Qt (Θ), which amounts to computing an
expectation with respect to P (J|X, Θt ). We illustrate this process with our
mixture model.
Recall that J is the vector of bump memberships. Since the data are
EN
independent, to compute P (J|X, Θt ) we can compute this ‘posterior’ for each
data point and then take the product. We need γnk = P (jn = k|xn , Θt ), the
probability that data point xn came from bump k. By Bayes’ theorem,
P (xn , k|Θt )
γnk = P (jn = k|xn , Θt ) =
P (xn |Θt )
ŵk Pk (xn |θ̂k )
= PK ,
ℓ=1 ŵℓ Pℓ (xn |θ̂ℓ )
PP
QN
where Θt = {ŵ1 , . . . , ŵK ; θ̂1 , . . . , θ̂K } and P (J|X, Θ′ ) = n=1 γnjn . We can
now compute Qt (Θ),
Qt (Θ) = EJ [ln P (X, J|Θ)],
where the expectation is with respect to the (‘fictitious’) probabilities γnk
that determine the distribution of the random variable J. These probabilities
depend on X and Θt . Let N̂k (a random variable) be the number of occurrences
of bump k in the random variable J; similarly, let X̂k be the random set
e-A
K
" N
# K
" N
#
(a) X X X X
= EJ znk ln wk + EJ znk ln P (xn ; θk )
k=1 n=1 k=1 n=1
K
X K X
X N
(b)
= Nk ln wk + γnk ln P (xn ; θk ),
k=1 k=1 n=1
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–7
e-C. The E-M Algorithm C.1. Derivation of the E-M Algorithm
IX
where we used E[znk ] = γnk . Now that we have an explicit functional form
for Qt (Θ), we can perform the maximization step. Observe that the bump-
parameters θk ∈ Θ are occurring in independent terms, and so can be opti-
mized separately. As for the first term, observe that
K K K
1 X X Nk wk X Nk Nk
Nk ln wk = ln + ln
D
N N Nk /N N N
k=1 k=1 k=1
K
X Nk Nk
≤ ln ,
N N
k=1
EN
where the last inequality follows from Jensen’s inequality and the concavity of
the logarithm, which implies:
K K
! K
!
X Nk wk X Nk wk X
ln ≤ ln · = ln wk = ln(1) = 0.
N Nk /N N Nk /N
k=1 k=1 k=1
PN
Nk γnk
wk∗ = = n=1 ; (C.6)
N N
X N
θk∗ = argmax γnk ln P (xn ; θk ). (C.7)
θk n=1
The ‘likelihood’ for the parameter θk of bump k is just the weighted sum of the
likelihoods of each point, weighted by the fraction of the point that belongs to
bump k. This intuitive interpretation of the update is another reason for the
popularity of the E-M algorithm.
The E-M algorithm for mixture density estimation is remarkably simple
once we get past the machinery used to set it up: we have an analytic update
for the weights wk and K separate optimizations for each bump parameter θk .
The miracle is that these simple updates are guaranteed to improve the log-
likelihood (from Theorem C.2). There are other ways to maximize the like-
lihood, for example using gradient and Hessian based iterative optimization
techniques. However, the E-M algorithm is simpler and works well in practice.
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–8
e-C. The E-M Algorithm C.1. Derivation of the E-M Algorithm
Example Let’s derive the E-M update for the GMM with current parameter
estimate is Θt = {ŵ1 , . . . , ŵK ; µ̂1 , . . . , µ̂K ; Σ̂1 , . . . , Σ̂K }. Let Ŝk = (Σ̂k )−1 .
The posterior bump probabilities γnk for the parameters Θt are:
ŵk P (xn |µ̂k , Ŝk )
γnk = PK ,
ℓ=1 ŵℓ P (xn |µ̂ℓ , Ŝℓ )
IX
where P (x|µ, S) = (2π)−d/2 |S|1/2 exp(− 21 (x − µ)t S(x − µ)). The wk update
is immediate from (C.6), which matches Equation (6.9). Since
1 1 d
ln P (x|µ, S) = − (x − µ)t S(x − µ) + ln |S| − ln(2π),
2 2 2
to get the µk and Sk updates using (C.7), we need to minimize
N
D
X
γnk ((xn − µk )t Sk (xn − µk ) − ln |Sk |) .
n=1
Setting the derivative with respect to µk to 0, gives
N
EN
X
2Sk γnk (xn − µk ) = 0.
n=1
PN PN PN
Since Sk is invertible, µk n=1 γnk = n=1 γnk xn , and since Nk = n=1 γnk ,
we obtain the update in (6.9),
N
1 X
µk = γnk xn .
Nk n=1
To take the derivative with respect to Sk , we use the identities in the hint of
PP
Since S−1
k = Σk , we recover the update in (6.9).
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–9
e-C. The E-M Algorithm C.2. Problems
C.2 Problems
Problem C.1 Consider the general case of Example C.1. The sample
has N balls, with Ng green, Nb blue and Nr red (Ng + Nb + Nr = N ). Show
that the log-likelihood of the incomplete data is
IX
Ng ln(1 − µ1 ) + Nb ln(1 − µ2 ) + Nr ln(µ1 + µ2 ) − N ln 2. (C.8)
What are the maximum likelihood estimates for µ1 , µ2 . (Be careful with the
cases Ng > N/2 and Nb > N/2.
Problem C.2 Consider the general case of Example C.1 as in Problem C.1,
D
with Ng green, Nb blue and Nr red balls.
(a) Suppose your starting estimates are µ̂1 , µ̂2 . For a red ball, what are
p1 = P[Bag 1|µ̂1 , µ̂2 ] and p2 = P[Bag 2|µ̂1 , µ̂2 ]
(b) Let Nr1 and Nr2 (with Nr = Nr1 + Nr2 ) be the number of red balls from
EN
Bag 1 and 2 respectively. Show that the log-likelihood of the complete
data is
(c) Compute the function Qt (µ1 , µ2 ) by taking the expectation of the log-
likelihood of the complete data. Show that
follows. There are 2 bags. Bag 1 contains only red balls and bag 2 contains
red and blue balls. A fraction π in this second bag are red. A bag is picked
randomly with probability 21 and one of the balls is picked randomly from that
bag; xn = 1 if ballP
n is red and 0 if it is blue. You are given N and the number
of red balls Nr = N n=1 xn .
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–10
e-C. The E-M Algorithm C.2. Problems
IX
(iii) If at step t your estimate is πt , for the expectation step, compute
Qt (π) = EJ |X,πt [− ln P[X, J|π, N ]] and show that
πt
Qt (π) = Nr ln π + (N − Nr ) ln(1 − π),
1 + πt
and hence show that the E-M update is given by
πt Nr
D
πt+1 = .
πt Nr + (1 + πt )(N − Nr )
What are the limit points when Nr ≥ N/2 and Nr < N/2?
(iv) Plot πt versus t, starting from π0 = 0.9 and π0 = 0.2, with Nr =
600, N = 1000.
EN
(v) The values of the hidden variables can often be useful. After con-
vergence of the E-M, how could you get estimates of the hidden
variables?
N Z 1
X 1 1 2
ln P[y|w] = ln dx √ e− 2 (yn −w0 −w1 x) .
n=1 0 2π
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–11
e-C. The E-M Algorithm C.2. Problems
(e) Let αn = Eγn (x|wt ) [x] and βn = Eγn (x|wt ) [x2 ] (expectations taken with
respect to the distribution γn (x|wt )). Show that the EM-updates are
1
PN
N i=1 (yi − y)(αi − α)
w1 (t + 1) = ;
β − α2
w0 (t + 1) = y − w1 (t + 1)α;
IX
PN
where, (·) denotes averaging (eg. α = N1 n=1 αn ) and w(t) are the
weights at iteration t.
(f) What happens if the temperature measurement is not at a uniformly
random time, but at a time distributed according to an unknown P (x)?
You have to maintain an estimate Pt (x). Now, show that
Pt (x) exp − 21 (yn − w0 − w1 x)2
D
γn (x|w) = R 1 ,
0
dx Pt (x) exp − 12 (yn − w0 − w1 x)2
and that the updates in (e) are unchanged, except that they use this new
γn (x|wt ). Show that the update to Pt is given by
EN
N
1 X
Pt+1 (x) = γn (x|wt ).
N n=1
What happens if you tried to maximize the log-likelihood for the incom-
plete data, instead of using the E-M approach?
PP
e-A
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:C–12