Beruflich Dokumente
Kultur Dokumente
O
EREM
SE
SI
AD
MUTA
Contents
1 Maximum likelihood Inference
1.1 Motivating examples . . . . . . . . . . . . . .
1.2 Likelihood function . . . . . . . . . . . . . . .
1.3 Score vector . . . . . . . . . . . . . . . . . . .
1.4 Information matrix . . . . . . . . . . . . . . .
1.5 Newton-Raphson and Fisher Scoring methods
1.6 Expectation Maximization (EM) algorithm .
1.6.1 Basic EM algorithm . . . . . . . . . .
1.6.2 Monte Carlo EM Algorithm . . . . . .
1.7 Appendix for EM algorithm . . . . . . . . . .
.
.
.
.
.
.
.
.
.
2
2
7
9
10
13
17
17
27
32
Dr. J. Chan
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
O
EREM
SE
SI
MUTA
AD
1
1.1
with
no=c(0,1,2,3,1,5,10,17,23,31,20,25,37,45)
time=c(1:14)
poi=glm(no~time, family=poisson(link=log))
summary(poi)
Call:
Dr. J. Chan
O
EREM
SE
SI
AD
MUTA
3Q
0.2545
Max
2.6731
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.37571
0.24884
1.51
0.131
x
0.25365
0.02188
11.60
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 203.549
Residual deviance: 28.169
AIC: 85.358
on 13
on 12
degrees of freedom
degrees of freedom
Mice data
Twenty six mice were given dierent level xi of drug. Outcomes Yi are
whether they responded to the drug (Yi = 1) or not (Yi = 0).
Dr. J. Chan
O
EREM
SE
SI
AD
MUTA
>
>
>
>
i
1 i
ea+bxi
= a + bxi i =
.
1 + ea+bxi
y=c(0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1)
dose=c(0:25)/10
log=glm(y~dose, family=binomial(link=logit))
summary(log)
Call:
glm(formula = y ~ dose, family = binomial(link = logit))
Deviance Residuals:
Min
1Q
Median
-1.5766 -0.4757
0.1376
3Q
0.4129
Max
2.1975
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Dr. J. Chan
O
EREM
SE
SI
MUTA
AD
(Intercept)
-4.111
1.638 -2.510
0.0121 *
dose
3.581
1.316
2.722
0.0065 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 35.890
Residual deviance: 17.639
AIC: 21.639
on 25
on 24
degrees of freedom
degrees of freedom
0.4
0.0
0.8
10 20 30 40
Logistic regression
no
Poisson regression
10
14
0.0
time
0.5
1.0
1.5
2.0
2.5
dose
Dr. J. Chan
O
EREM
SE
SI
AD
MUTA
Dr. J. Chan
O
EREM
SE
SI
MUTA
AD
1.2
Likelihood function
Let Y1 , . . . , Yn be n independent random variables (rv) with probability density functions (pdf) fi (yi , ) depending on a vector-value
parameter . The joint density of y = (y1 , . . . , yn )
f (y, ) =
f (yi , ) = L(, y)
i=1
as a function of unknown parameter given y is called the likelihood function. We often work with the logarithm of f (y, ), the loglikelihood function:
n
(; y) = ln L(; y) =
ln f (yi ; ).
i=1
[yi ln(1 ) + ln ]
i=1
= n[
y ln(1 ) + ln ],
SydU MSH3 GLM (2012) First semester
Dr. J. Chan
O
EREM
SE
SI
AD
MUTA
where y =
1
n
i=1
6000
2000
score(pi)
150
250
logl(pi)
50
2000
loglikelihood function
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
pi
0.4
0.6
0.8
1.0
pi
100000
0
Ie(pi)
200000
0.0
0.2
0.4
0.6
0.8
1.0
pi
>
>
>
>
n=20
ym=3
pi=c(1:100)/100
logl=function(pi) n*(ym*log(1-pi)+log(pi))
Dr. J. Chan
O
EREM
SE
SI
MUTA
AD
1.3
Score vector
(
)
b can be
If the log-likelihood function is concave, the ML estimates
obtained by solving the system of equations:
u() = 0.
Example: The score function for the geometric distribution.
The score function for n observations from a geometric distribution is
(
)
1
d
y
d
=
n(
y ln(1 ) + ln ) = n
.
u() =
d
d
1
Setting this equation to zero and solving for gives the ML estimate:
1
y
1
1
=
y = 1
=
and y =
.
1
1 + y
Note that the ML estimate of the probability of success is the reciprocal of the average number of trials. The more trials it takes to get
a success, the lower is the estimated probability of success.
For a sample of n = 20 observations and with a sample mean of y = 3,
the ML estimate is
= 1/(1 + 3) = 0.25.
Dr. J. Chan
O
EREM
SE
SI
MUTA
AD
1.4
Information matrix
dy = 0
f (y, )dy = 0
f (y, )
ln f (y, )
()
f (y, )dy = 0
f (y, )dy = 0
[
]
()
Ey
= 0 and
]
[
()
f (y, ) dy = 0
]
[ 2
()
() f (y, )
f (y, ) +
dy = 0
]
[ 2
() () ()
+
f (y, )dy = 0
[ 2
]
[(
)(
) ]
()
()
()
Ey
+ Ey
= 0.
(1)
Hence the score function is a random vector such that it has a zero
mean
[
]
()
Ey [u()] = Ey
=0
Dr. J. Chan 10
O
EREM
SE
SI
AD
MUTA
(2)
2 ()
and I o () =
= H() is sometimes called the observed
information matrix. I o () indicates the extent to which () is peaked
rather than at. If it is more peaked, I o () is more positive. For
example, when Yi N (, 2 ),
)
( 2
2
2
()
2
2
=
.
2
2
2
2
2
( )
Example: Information matrix for geometric distribution.
Dierentiating the score, we nd the observed information to be
d2 ()
du()
d
Io () =
=
=
n
2
d
d
d
1
y
]
1
y
=n 2 +
.
(1 )2
]
[
]
[
]
1
(1 )/
1
1
1+
n
Ie () = n 2 +
=
n
+
=
n
=
.
(1 )2
2
(1 )
2 (1 )
2 (1 )
n
20
=
= 1045.8.
2 (1 )
0.152 (1 0.15)
Dr. J. Chan 11
O
EREM
SE
SI
AD
MUTA
(1 )2
0.152 (1 0.15)2
Substituting the ML estimate
= 0.25, the expected and observed
information are Io (0.25) = Ie (0.25) = 426.7 since y = (1
)/
.
> score=function(pi) n*(1/pi-ym/(1-pi))
> Ie=function(pi) n/(pi^2*(1-pi))
> Io=n*(1/pi^2+ym/(1-pi)^2)
>
> logl1=n*(ym*log(1-pi)+log(pi))
> score1=n*(1/pi-ym/(1-pi))
> Ie1=n/(pi^2*(1-pi))
> c(pi[logl1==max(logl1)],pi[score1==0],max(logl1))
[1]
0.25000
0.25000 -44.98681
> c(Io[pi==0.15],Ie1[pi==0.15],Io[pi==0.25],Ie1[pi==0.25])
[1] 971.9339 1045.7516 426.6667 426.6667
>
> par(mfrow=c(2,2))
> plot(logl, col=red,xlab="pi",ylab="logl(pi)")
> points(pi[score1==0],logl1[pi==pi[score1==0]],pch=2,col="red",cex=0.6)
> title("log-likelihood function")
> plot(score, col=red,xlab="pi",ylab="score(pi)")
> abline(h = 0)
> points(pi[score1==0],0,pch=2,col="red",cex=0.6)
> title("score function")
> plot(Ie, col=red,xlab="pi",ylab="Ie(pi)")
> title("Expected information function")
Dr. J. Chan 12
O
EREM
SE
SI
MUTA
AD
1.5
u( 0 ) b
b 0 ). (3)
( 0 ) + higher order terms in (
b
Ignoring higher order terms, equating (3) to zero and solving for ,
we have
(
)1
u(
)
0
b 0
u( 0 )
(4)
()
()
(k+1)
(k)
=
(5)
(k) .
=
The iterative procedure is repeated until the dierence between (k+1)
and (k) is suciently close to zero. Then (proof as exercise)
(
)1
2 b
b = I o ()
b 1 = ()
var()
.
b is concave downFor ML estimates, the second order derivative H()
wards and negative. The sharper the curvature (more information)
b is and hence the estimates have
of (), the more negative H()
b = I o ()
b 1 = H()
b 1 . The NR procedure
smaller variance var()
tends to converge quickly if the log-likelihood is well-behaved (close to
b and if the starting
quadratic) in a neighborhood of the ML estimate
b
value 0 is reasonably close to .
SydU MSH3 GLM (2012) First semester
Dr. J. Chan 13
O
EREM
SE
SI
MUTA
AD
()
(k+1)
(k)
= E
(6)
(k) .
=
For multimodal distributions, both methods will converge to a local
(not global) maximum.
Example: NR and FS methods for geometric distribution.
Setting the score to zero leads to an explicit solution for the ML esti1
mate
=
and no iteration is needed. For illustrative purpose,
1 + y
the iterative procedure is performed. Using the previous results,
d
=n
d
1
y
)
,
[
]
d2
1
y
= n 2 +
,
d 2
(1 )2
(
E
d2
d 2
)
=
n
.
2 (1 )
(k+1) = (k) E
| (k)
d 2
d =
(
)
(k) 2
(k)
(
)
(1
)
1
y
= (k) +
n
n
(k) 1 (k)
1 (k) (k) y
(k)
(k) 2
(k)
= + ( ) (1 ) (k)
(1 (k) )
= (k) + (1 (k) (k) y) (k) .
If the sample mean is y = 3 and we start from 0 = 0.1, say, the
procedure converges to the ML estimate
= 0.25 in four iterations.
>
>
>
>
>
n=20
ym=3
pi=0.1
result=matrix(0,10,7)
Dr. J. Chan 14
>
+
+
+
+
+
+
+
+
+
>
>
O
EREM
SE
SI
MUTA
AD
for (i in 1:10) {
dl=n*(1/pi-ym/(1-pi))
dl2=-n/(pi^2*(1-pi))
pi=pi-dl/dl2
#pi=pi+(1-pi-pi*ym)*pi
se=sqrt(-1/dl2)
l=n*(ym*log(1-pi)+log(pi))
step=(1-pi-pi*ym)*pi
result[i,]=c(i,pi,se,l,dl,dl2,step)
}
colnames(result)=c("Iter","pi","se","l","dl","dl2","step")
result
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
Iter
1
2
3
4
5
6
7
8
9
10
pi
0.1600000
0.2176000
0.2458010
0.2499295
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
se
0.02121320
0.03279024
0.04303862
0.04773221
0.04840091
0.04841229
0.04841229
0.04841229
0.04841229
0.04841229
l
-47.11283
-45.22528
-44.99060
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
dl
dl2
step
1.333333e+02 -2222.2222 5.760000e-02
5.357143e+01 -930.0595 2.820096e-02
1.522465e+01 -539.8628 4.128512e-03
1.812051e+00 -438.9114 7.050785e-05
3.009750e-02 -426.8674 1.989665e-08
8.489239e-06 -426.6667 1.582068e-15
6.661338e-13 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
(k+1) = (k)
| (k)
d 2
d =
[
]1
(
)
1
1
y
1
y
= (k) +
+
n
n ( (k) )2 (1 (k) )2
(k) 1 (k)
[
](
)
(k) 2
(k) 2
(k)
(k)
1
(
)
(1
)
= (k) +
1 2 (k) + ( (k) )2 + y( (k) )2
(k) (1 (k) )
(k) (1 (k) )(1 (k) (k) y)
(k)
= +
.
1 2 (k) + (1 + y)( (k) )2
>
>
>
>
n=20
ym=3
pi=0.1 #starting value
result=matrix(0,10,7)
Dr. J. Chan 15
>
>
+
+
+
+
+
+
+
+
+
>
>
O
EREM
SE
SI
MUTA
AD
for (i in 1:10) {
dl=20*(1/pi-ym/(1-pi))
dl2=-20*(1/pi^2+3/(1-pi)^2)
pi=pi-dl/dl2
#pi=pi+(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)
se=sqrt(-1/dl2)
l=n*(ym*log(1-pi)+log(pi))
step=(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)
result[i,]=c(i,pi,se,l,dl,dl2,step)
}
colnames(result)=c("Iter","pi","se","l","dl","dl2","step")
result
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
Iter
1
2
3
4
5
6
7
8
9
10
pi
0.1642857
0.2246830
0.2481241
0.2499905
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
se
0.02195775
0.03477490
0.04490170
0.04816876
0.04841107
0.04841229
0.04841229
0.04841229
0.04841229
0.04841229
l
-46.89107
-45.13029
-44.98756
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
dl
dl2
step
1.333333e+02 -2074.0741 6.039726e-02
4.994426e+01 -826.9292 2.344114e-02
1.162661e+01 -495.9916 1.866426e-03
8.044145e-01 -430.9919 9.453797e-06
4.033823e-03 -426.6882 2.383524e-10
1.016970e-07 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
Dr. J. Chan 16
O
EREM
SE
SI
MUTA
AD
1.6
1.6.1
Basic EM algorithm
The Expectation-Maximization (EM) algorithm was proposed by Dempster et al. (1977). It is an iterative approach for computing the maximum likelihood estimates (MLEs) for incomplete-data problems.
Let y be the observed data, z be the latent or missing data and be
the unknown parameters to be estimated. The functions f (y|) and
f (y, z|) are called the observed data and complete data likelihood
functions respectively. The observed data likelihood Lo () = f (y|)
is the expectation of f (y|z, ) w.r.t. f (z|), that is,
f (y|) =
f (y, z|) dz =
Dr. J. Chan 17
O
EREM
SE
SI
AD
MUTA
Dr. J. Chan 18
O
EREM
SE
SI
MUTA
AD
Example: (Darwin data) The data contains two very low outliers.
We consider the mixture model:
{
N (1 , 2 ), p = 0.9
yi
N (2 , 2 ), p = 0.1.
or
yi 0.9N (1 , 2 ) + 0.1N (2 , 2 ).
Let wij be the indicator that observation i comes from group j, j = 1, 2
and , wi1 + wi2 = 1. We dont know which normal distribution each
observation yi comes from. In other words, wij is unobserved.
In the M-step, writing rij = yi j , the complete data likelihood,
log-likelihood and their 1st and 2nd order derivative functions are
L() =
() =
i=1
n
i=1
wi2 ln (yi |2 , 2 )]
1
1
ln (yi |j , 2 ) = ln(2 2 ) 2 (yi j )2
2
2
1
rij
ln (yi |j , 2 ) =
(y
)
=
i
j
2
2
1
1
1
ln (yi |j , 2 ) = 2 + 4 (yi j )2 = 4 (rij2 2 )
2
2
2
n
n
()
1
=
wij ln (yi |j , 2 ) = 2
wij rij , j = 1, 2
j
j i=1
i=1
2
2
n
n
()
1
2
wij ln (yi |j , 2 ) = 4
wij (rij
2)
=
2
2
i=1 j=1
2 i=1 j=1
Dr. J. Chan 19
O
EREM
SE
SI
AD
MUTA
2 ()
2j
2 ()
( 2 )2
1
wij
2 i=1
1
n
2
2
w
(r
ij
ij
6 i=1 j=1
2 4
2 ()
j 2
n
1
wij rij
= 4
i=1
2 ()
1 2
= 0
>
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+
y=c(-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75)
n=length(y)
p=3
#no. of par.
iterE=5
iterM=10
dim1=iterE*iterM
dim2=2*p+3
dl=c(rep(0,p))
result=matrix(0,dim1,dim2)
theta=c(30,-37,729)
#starting values
for (k in 1:iterE) {
# E-step
ew1=0.9*exp(-0.5*(y-theta[1])^2/theta[3])
ew2=0.1*exp(-0.5*(y-theta[2])^2/theta[3])
w1=ew1/(ew1+ew2)
w1m=mean(w1)
Dr. J. Chan 20
O
EREM
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
SE
SI
AD
MUTA
w2=1-w1
sw1=sum(w1)
sw2=sum(w2)
for (i in 1:iterM) {
# M-step
r1=y-theta[1]
r2=y-theta[2]
s1=r1^2-theta[3]
s2=r2^2-theta[3]
dl[1]=sum(w1*r1)/theta[3]
dl[2]=sum(w2*r2)/theta[3]
dl[3]=0.5*(sum(w1*s1)+sum(w2*s2))/theta[3]^2
dl2=matrix(0,p,p)
dl2[1,1]=-sw1/theta[3]
dl2[2,2]=-sw2/theta[3]
dl2[3,3]=-(sum(w1*s1)+sum(w2*s2))/theta[3]^3-0.5*n/theta[3]^2
dl2[3,1]=dl2[1,3]=-sum(w1*r1)/theta[3]^2
dl2[3,2]=dl2[2,3]=-sum(w2*r2)/theta[3]^2
dl2i=solve(dl2)
theta=theta-dl2i%*%dl
se=sqrt(diag(-dl2i))
l=log(0.9)*sw1+log(0.1)*sw2-n*log(2*pi*theta[3])/2
-(sum(w1*r1^2)+sum(w2*r2^2))/(2*theta[3])
row=(k-1)*10+i
result[row,]=c(k,i,theta[1],se[1],theta[2],se[2],theta[3],se[3],l)
}
+
+
+
+
}
> colnames(result)=c("iE","iM","mu1","se","mu2","se","sigma2","se","logL")
> result
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
iE iM
mu1
se
mu2
se
sigma2
se
logL
1 1 33.48204 7.556567 -61.29223 20.36936 314.5368 333.38359 -74.72050
1 2 32.67069 4.932458 -55.63190 12.80706 426.8732 92.08656 -75.11445
1 3 32.28547 5.732025 -52.94443 14.65295 488.9407 144.09522 -74.00812
1 4 32.22150 6.132491 -52.49817 15.64174 500.6744 176.38060 -73.88313
1 5 32.21993 6.205593 -52.48720 15.82744 500.9943 182.76204 -73.88041
1 6 32.21993 6.207575 -52.48719 15.83249 500.9945 182.93721 -73.88041
1 7 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 8 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 9 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 10 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
2 1 33.13424 6.215236 -58.27681 15.93154 360.3297 207.03231 -72.33713
2 2 32.95097 5.265502 -57.11632 13.42523 391.1934 125.81277 -72.12760
2 3 32.93394 5.485894 -57.00847 13.98082 394.3075 142.27392 -72.09124
2 4 32.93380 5.507683 -57.00761 14.03630 394.3344 143.97586 -72.09096
2 5 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
2 6 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
2 7 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
Dr. J. Chan 21
O
EREM
SE
SI
AD
MUTA
14.03678
14.03678
14.03678
14.03871
13.86992
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87464
13.86856
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86859
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
394.3344
394.3344
394.3344
384.9492
385.1901
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
384.8527
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
-72.09096
-72.09096
-72.09096
-71.91673
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
From (wi1 , wi2 ), the rst two observations belong to group 2 while the
others all group 1. Hence EM method enables classication like cluster
analysis, an advantage over the classical likelihood method where the
missing data wij are integrated out as the observed data likelihood
Lo () =
[0.9(yi |1 , 2 ) + 0.1(yi |2 , 2 )]
(8)
i=1
Dr. J. Chan 22
O
EREM
SE
SI
MUTA
AD
x=rep(-0.001,n)
x1=seq(-120,100,0.1)
fx1=dnorm(x1,theta[1],sqrt(theta[3]))
fx2=dnorm(x1,theta[2],sqrt(theta[3]))
fx=0.9*fx1+0.1*fx2
plot(x1, fx1, xlab="x", ylab="f(x)", ylim=c(-0.001,0.025),
xlim=c(-120,100), pch=20, col="red",cex=0.5)
points(x1,fx2,pch=20,col=blue,cex=0.5)
points(x1,fx,pch=20,cex=0.5)
points(y,x,pch=20,cex=0.8)
title("Mixture of normal distributions for Darwin data")
0.010
0.000
0.005
f(x)
0.015
0.020
0.025
100
50
50
100
Note that this is a mixture model where the two model densities are
represented by the blue and red lines. The mixing density in (8) is
represented by the black line.
Dr. J. Chan 23
O
EREM
SE
SI
MUTA
AD
z > ci
(9)
where and are the pdf and cdf functions for normal. Let (k) =
((k) , 2 (k) ) be the current estimates of .
In the E-step, the conditional expectation of zi , i = 1, . . . , 4 given y,
(k) and ci is
(k)
zi
E(zi |y,
(k)
, ci ) =
(k)
z f (z|
(k)
ci
or
since
1
(k)
, ci ) dz =
ci (k)
(k)
(k)
)
(k)
ci
(k)
S
1 (k)
z
S j=1 ij
)
) (
)
[
]
(
1
1 2
1
1 2
1 2
1 2
= exp (c )
z exp z dz =
z exp z d z
2
2
2
2
2 c
2
(
(k)
Dr. J. Chan 24
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
+
+
O
EREM
SE
SI
AD
MUTA
library("msm")
cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censored
w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)
n=length(cy)
S=10000 #sim 10000 z for z hat
m=4
#first 4 censored
p=2
#2 parmeters
cen=cy[1:m] #censored obs
y=cy[(m+1):n] #uncensored obs
iterE=10
dim=p+m+1
result=matrix(0,iterE,dim)
simz=matrix(0,m,S)
z=rep(0,m)
theta=c(mean(cy),var(cy))
#starting value for mu & sigma2
for (k in 1:iterE) {
#E-step
for (j in 1:m) {
simz[j,]=rtnorm(S,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],
upper=Inf) #monte carlo approx. of E(Z|Z>c)
z[j]=mean(simz[j,])
+ #
+
+
cz=(cen[j]-theta[1])/sqrt(theta[2])
+
z[j]=theta[1]+dnorm(cz)*sqrt(theta[2])/(1-pnorm(cz)) #exact
+
}
+
yr=c(z,y)
+
theta[1]=mean(yr)
#M-step
+
theta[2]=(sum(yr^2)-sum(yr)^2/n)/n
+
result[k,]=c(k,theta[1],theta[2],z[1],z[2],z[3],z[4])
+
+
}
> colnames(result)=c("iE","mu","sigma2","ez1","ez2","ez3","ez4")
> print(result,digit=5) #monte carlo approx. of E(Z|Z>c)
iE
mu sigma2
ez1
ez2
ez3
ez4
[1,] 1 41.630 211.67 37.262 37.842 40.054 41.036
[2,] 2 42.647 208.03 41.943 42.259 42.455 42.755
[3,] 3 42.921 208.09 42.540 43.007 43.614 43.810
[4,] 4 43.017 208.11 43.281 43.385 43.655 43.900
[5,] 5 43.040 208.17 43.154 43.441 43.733 44.188
Dr. J. Chan 25
O
EREM
SE
SI
MUTA
AD
[6,] 6
[7,] 7
[8,] 8
[9,] 9
[10,] 10
208.21
208.18
208.14
208.19
208.22
42.985
43.158
43.393
43.375
43.336
43.371
43.473
43.317
43.298
43.364
44.201
44.277
44.039
44.156
44.411
ez4
41.018
42.932
43.737
43.969
44.035
44.054
44.059
44.061
44.061
44.061
Dr. J. Chan 26
O
EREM
SE
SI
MUTA
AD
1.6.2
Given the current guess to the posterior mode, (k) , the conditional expectation in the E-step may involve integration and can be calculated
using Monte Carlo (MC) approximation. Similarly the complete data
log-likelihood function c () = ln f (y, z|) can also be approximated
using MC approximation:
S
1
(k)
c () = ln f(y, z|) =
ln f (y, z j |)
(10)
S j=1
(k)
(k)
n
1
1
(k)
c () =
ln(2 2 ) 2
(zij )2 +
(yi )2
S j=1
2
2 i=1
i=5
[ 4 (
)
]
S
n
1
n
1
(k)
(zij )2 +
(yi )2
= ln(2 2 ) 2
2
2 i=1 S j=1
i=5
and maximizes it w.r.t. to obtain (k+1) through iterations instead
(k)
of close-form solution. Write ri = yi , i = 5, . . . , n, rij = zij ,
SydU MSH3 GLM (2012) First semester
Dr. J. Chan 27
O
EREM
SE
SI
MUTA
AD
zi =
1
S
(k)
zij and ri = zi , i = 1, . . . , 4,
j=1
()
()
2
4
S
n
n
1
1
1
(k)
(z ) +
(yi ) = 2
ri ,
2 i=1 S j=1 it
i=5
i=1
n
4
S
1 1 (k)
2
2
2
(z )
+
(yi ) n
2 4 i=1 S j=1 ij
i=5
4
s
n
1 1 2
2
2
2
rij +
(ri )
2 4
S
i=1
j=1
i=5
2
S
S
S
S
1
1
1
1
(k)
(k)
2
Since
rij
=
(zit )2 =
(zit ) =
rij = ri2 ,
S j=1
S j=1
S j=1
S j=1
closed-form solution using sample mean and sample variance can not be used.
2 ()
2
2 ()
( 2 )2
=
2 ()
2
n
2
S
n
4
1 1 (k)
n
2
2
+
(z
)
(y
)
+
i
2 4
6 i=1 S j=1 ij
i=5
S
n
4
1 1 2
n
2
2
2
r
+
(r
ij
i
2 4
6 i=1
S j=1
i=5
n
1
= 4
ri
i=1
> library("msm")
> cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censor time
> w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)
> mean(cy)
[1] 33
> n=length(cy)
> T=10000 #sim 10000 z for z hat
> m=4
#first 4 censored obs
> p=2
#2 pars
Dr. J. Chan 28
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
O
EREM
SE
SI
AD
MUTA
Dr. J. Chan 29
O
EREM
SE
SI
MUTA
AD
z[3],z[4])
+
}
+
}
> colnames(result)=c("iE","iM","mu","se","sigma2","se","logL","ez1",
"ez2","ez3","ez4")
> print(result,digit=5)
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
[18,]
[19,]
[20,]
[21,]
[22,]
[23,]
[24,]
[25,]
[26,]
[27,]
[28,]
[29,]
[30,]
[31,]
[32,]
[33,]
[34,]
[35,]
[36,]
[37,]
[38,]
[39,]
[40,]
[41,]
[42,]
[43,]
[44,]
[45,]
[46,]
[47,]
[48,]
[49,]
[50,]
iE
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
iM
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
mu
43.745
42.965
42.929
42.929
42.929
42.929
42.929
42.929
42.929
42.929
43.280
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343
se
5.6699
4.7218
4.8139
4.8187
4.8187
4.8187
4.8187
4.8187
4.8187
4.8187
4.8205
4.6989
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
sigma2
288.50
301.25
301.86
301.86
301.86
301.86
301.86
301.86
301.86
301.86
287.03
287.16
287.16
287.16
287.16
287.16
287.16
287.16
287.16
287.16
284.71
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06
se
160.37
113.42
118.16
118.40
118.40
118.40
118.40
118.40
118.40
118.40
118.44
112.58
112.63
112.63
112.63
112.63
112.63
112.63
112.63
112.63
112.63
111.67
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.42
111.42
111.42
111.42
111.42
111.42
111.42
111.42
111.42
logL
-53.653
-53.557
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.460
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
ez1
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072
ez2
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284
ez3
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957
ez4
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149
References
Dr. J. Chan 30
O
EREM
SE
SI
AD
MUTA
Dempster, A.P., Laird, N. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B, 39, 1-38. (with discussion).
McLachlan, G.J. & Krishnan, T (1997) The EM Algorithm and Extensions. Wiley.
Dr. J. Chan 31
O
EREM
SE
SI
MUTA
AD
1.7
= ln
f (z|y, (k) )
dz ln f (y| (k) )
(k)
f (z|y, )
f (y|z, (k+1) )f (z| (k+1) )
(k)
f (z|y, ) ln
dz
f (z|y, (k) ) ln f (y| (k) ) dz
(k)
f (z|y, )
f (y|z, (k+1) )f (z| (k+1) )
dz , ( (k+1) | (k) )
=
f (z|y, (k) ) ln
(k)
(k)
f (z|y, )f (y| )
since
(k)
i y i
i=1
(k+1) (k)
o (
| )
i=1
such that
Dr. J. Chan 32
O
EREM
SE
SI
AD
MUTA
f (y, z| (k) )
(k)
(k)
= o ( ) +
f (z|y, ) ln
dz
(k) )
f
(y,
z|
= o ( (k) ).
Hence any (k+1) which increases o ( (k+1) | (k) ) also increases o ( (k+1) ).
Lastly, we show (iii) that the EM algorithm chooses (k+1) for which
o (| (k) ) is a maximum. Since o () o (| (k) ), increasing o (| (k) )
ensures that o () is increased at each step.
To achieve the greatest increase in o ( (k+1) ), EM algorithm selects
(k+1) which maximize o (| (k) ), i.e.
Dr. J. Chan 33
O
EREM
SE
SI
MUTA
AD
(k+1)
[
]
f
(y|z,
)f
(z|)
= arg max o ( (k) ) +
f (z|y, (k) ) ln
dz
f (z|y, (k) )f (y| (k) )
[
]
= arg max
f (z|y, (k) ) ln[f (y|z, )f (z|)] dz
[
]
= arg max Ez |y ,(k) {ln f (y, z|)}
Dr. J. Chan 34