Sie sind auf Seite 1von 35

MSH3 Generalized Linear model (Part 1)

Jennifer S.K. CHAN


Course outline
Part I: Generalized Linear Model
1. Maximum Likelihood Inference
Newton-Raphson and Fisher Scoring methods; EM and Monte
Carlo EM algorithms
2. Exponential Family
Two parameter Exponential family; ML estimation for GLM,
Deviance; Quasi-likelihood, Random eects models.
3. Model Selection
Deviance for Likelihood Ratio Tests, Wald Tests, AIC and BIC,
Examples
4. Survival Analysis
Kaplan-Meier estimator; Proportional hazards models; Coxs
proportional hazards model.

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Contents
1 Maximum likelihood Inference
1.1 Motivating examples . . . . . . . . . . . . . .
1.2 Likelihood function . . . . . . . . . . . . . . .
1.3 Score vector . . . . . . . . . . . . . . . . . . .
1.4 Information matrix . . . . . . . . . . . . . . .
1.5 Newton-Raphson and Fisher Scoring methods
1.6 Expectation Maximization (EM) algorithm .
1.6.1 Basic EM algorithm . . . . . . . . . .
1.6.2 Monte Carlo EM Algorithm . . . . . .
1.7 Appendix for EM algorithm . . . . . . . . . .

SydU MSH3 GLM (2012) First semester

.
.
.
.
.
.
.
.
.

2
2
7
9
10
13
17
17
27
32

Dr. J. Chan

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

O
EREM

SE

SI

MUTA

AD

1
1.1

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Maximum likelihood Inference


Motivating examples

AIDS deaths (counts)


The numbers of death Yi from AIDS in Australia for three-month
periods from 1983 to 1986 are shown below.

The Poisson regression model


Yi Poisson(i ),

with

i = exp(a + bti ) > 0

is tted and the maximum likelihood (ML) estimates are a


= 0.376
and b = 0.254.
For each 3-month period, there will be a 29.3%
(exp(0.254) = 1.293) increase in expected AIDS deaths. Note that
the variance increases with the mean and the log link function g(i ) =
ln(i ) = xi is used.
>
>
>
>

no=c(0,1,2,3,1,5,10,17,23,31,20,25,37,45)
time=c(1:14)
poi=glm(no~time, family=poisson(link=log))
summary(poi)

Call:

SydU MSH3 GLM (2012) First semester

Dr. J. Chan

O
EREM

SE

SI

AD

MUTA

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

glm(formula = y ~ x, family = poisson(link = log))


Deviance Residuals:
Min
1Q
Median
-2.2502 -0.9815 -0.6770

3Q
0.2545

Max
2.6731

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.37571
0.24884
1.51
0.131
x
0.25365
0.02188
11.60
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 203.549
Residual deviance: 28.169
AIC: 85.358

on 13
on 12

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 5


> par=poi$coeff
> names(par)=NULL
> par
[1] 0.3757110 0.2536485
> beta0=par[1]
> beta1=par[2]
> par(mfrow=c(2,2))
> c1=function(time) exp(beta0+beta1*time)
> plot(time,no,pch=20,col=blue )
> curve(c1,1,14,add=TRUE)
> title("Poisson regression")

Mice data
Twenty six mice were given dierent level xi of drug. Outcomes Yi are
whether they responded to the drug (Yi = 1) or not (Yi = 0).

SydU MSH3 GLM (2012) First semester

Dr. J. Chan

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

The logistic regression model for binary data is


(
)
i
Yi Bernoulli(i ), with logit(i ) = ln
= a + bxi .
1 i
Note that
(
ln

>
>
>
>

i
1 i

ea+bxi
= a + bxi i =
.
1 + ea+bxi

y=c(0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1)
dose=c(0:25)/10
log=glm(y~dose, family=binomial(link=logit))
summary(log)

Call:
glm(formula = y ~ dose, family = binomial(link = logit))
Deviance Residuals:
Min
1Q
Median
-1.5766 -0.4757
0.1376

3Q
0.4129

Max
2.1975

Coefficients:
Estimate Std. Error z value Pr(>|z|)

SydU MSH3 GLM (2012) First semester

Dr. J. Chan

O
EREM

SE

SI

MUTA

AD

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

(Intercept)
-4.111
1.638 -2.510
0.0121 *
dose
3.581
1.316
2.722
0.0065 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 35.890
Residual deviance: 17.639
AIC: 21.639

on 25
on 24

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 6


> par=log$coeff
> names(par)=NULL
> par
[1] -4.111361 3.581176
> beta0=par[1]
> beta1=par[2]
> c1=function(dose) exp(beta0+beta1*dose)/(1+exp(beta0+beta1*dose))
> plot(dose,y, pch=20,col=red)
> curve(c1,0,2.5,add=TRUE)
> title("Logistic regression")

0.4
0.0

0.8

10 20 30 40

Logistic regression

no

Poisson regression

10

14

0.0

time

0.5

1.0

1.5

2.0

2.5

dose

For parameter estimation, the nonparametric LSE , the parametric


maximum likelihood (ML, PML, QML, GEE, EM, MCEM, etc) and
Bayesian methods methods will be discussed. Kernel smoothing and
other semi-parametric methods are not included. Model selection is
based on Akaike Information criterion (AIC), Bayesian Information
SydU MSH3 GLM (2012) First semester

Dr. J. Chan

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

criterion (BIC) and the Deviance information criterion (DIC).


For application, the two examples analyse counts data with Poisson distribution and binary data with Bernounill distribution respectively. Others include categorial data with multinominal distribution
and positive continuous data with Weibull distribution. These illustrate dierent data distributions under the Exponential Family. The
mean of the data distribution is linked to a linear function of covariates with possibly random eects but the variance is NOT modelled.
Popular time series models with heteroskedastic variance and long
memory such as Generalized autoregressive conditional heteroskedastic (GARCH) model and stochastic volatility (SV) model will not be
considered.

SydU MSH3 GLM (2012) First semester

Dr. J. Chan

O
EREM

SE

SI

MUTA

AD

1.2

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

Likelihood function

Let Y1 , . . . , Yn be n independent random variables (rv) with probability density functions (pdf) fi (yi , ) depending on a vector-value
parameter . The joint density of y = (y1 , . . . , yn )
f (y, ) =

f (yi , ) = L(, y)

i=1

as a function of unknown parameter given y is called the likelihood function. We often work with the logarithm of f (y, ), the loglikelihood function:
n

(; y) = ln L(; y) =
ln f (yi ; ).
i=1

b maximzes the log-likelihood


The maximum-likelihood (ML) estimator
function given the data y, that is,
b y) (; y) for all .
(;
In other words, they make the observed data as likely as possible under
the model.
Example: The log-likelhood function for Geometric Distribution.
Consider a series of independent Bernoulli trials with a common probability of success . The distribution for the number of failures Yi
before the rst success has a pdf
Pr(Yi = yi ) = (1 )yi
for yi = 0, 1, . . . . Direct calculation shows that E(Yi ) = (1 )/.
The log-likelihood function given y is
(; y) = ln L(; y) =

[yi ln(1 ) + ln ]

i=1

= n[
y ln(1 ) + ln ],
SydU MSH3 GLM (2012) First semester

Dr. J. Chan

O
EREM

SE

SI

AD

MUTA

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

where y =

1
n

yi is the sample mean. The fact that the log-likelihood

i=1

function depends on the observations only through y shows that y is


a sucient statistic for the unknown probability .
score function

6000

2000

score(pi)

150
250

logl(pi)

50

2000

loglikelihood function

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

pi

0.4

0.6

0.8

1.0

pi

100000
0

Ie(pi)

200000

Expected information function

0.0

0.2

0.4

0.6

0.8

1.0

pi

Figure: Log-likelihood function for geometric dist. when n = 20 and y = 3.

>
>
>
>

n=20
ym=3
pi=c(1:100)/100
logl=function(pi) n*(ym*log(1-pi)+log(pi))

SydU MSH3 GLM (2012) First semester

Dr. J. Chan

O
EREM

SE

SI

MUTA

AD

1.3

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Score vector

The rst order derivative of the log-likelihood function, called Fishers


score function, is a vector of dimension p where p is the number of
parameters and is denoted by
(; y)
.

(
)

For example, when Yi N (, 2 ), u() =


,
.
2
u() =

b can be
If the log-likelihood function is concave, the ML estimates
obtained by solving the system of equations:
u() = 0.
Example: The score function for the geometric distribution.
The score function for n observations from a geometric distribution is
(
)
1
d
y
d
=
n(
y ln(1 ) + ln ) = n

.
u() =
d
d
1
Setting this equation to zero and solving for gives the ML estimate:
1
y
1
1

=
y = 1
=
and y =
.

1
1 + y

Note that the ML estimate of the probability of success is the reciprocal of the average number of trials. The more trials it takes to get
a success, the lower is the estimated probability of success.
For a sample of n = 20 observations and with a sample mean of y = 3,
the ML estimate is
= 1/(1 + 3) = 0.25.

SydU MSH3 GLM (2012) First semester

Dr. J. Chan

O
EREM

SE

SI

MUTA

AD

1.4

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Information matrix

It can be shown that


[
]
[ 2
]
[(
)(
) ]
()
()
()
()
Ey
= 0 and Ey
= 0.
+ Ey

Proof: Since f (y, )dy = 1,


f (y , )
f (y, )

dy = 0
f (y, )dy = 0

f (y, )

ln f (y, )
()
f (y, )dy = 0
f (y, )dy = 0

[
]
()
Ey
= 0 and

]
[

()
f (y, ) dy = 0

]
[ 2
()
() f (y, )
f (y, ) +
dy = 0


]
[ 2
() () ()
+
f (y, )dy = 0


[ 2
]
[(
)(
) ]
()
()
()
Ey
+ Ey
= 0.
(1)

Hence the score function is a random vector such that it has a zero
mean
[
]
()
Ey [u()] = Ey
=0

and a variance-covariance matrix which is given by the informative


matrix:
)(
) ]
[(
()
()
var[u()] = Ey [u()u ()] = Ey
= I().

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 10

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Under mild regularity conditions, the information matrix can also be


obtained as minus the expected value of the second derivatives of the
log-likelihood from (1):
[ 2
]
()
var[u()] = I() = Ey
.

Note that the Hessian matrix is
2 ()
u()
H() = I o () =
=

(2)

2 ()
and I o () =
= H() is sometimes called the observed

information matrix. I o () indicates the extent to which () is peaked
rather than at. If it is more peaked, I o () is more positive. For
example, when Yi N (, 2 ),
)
( 2

2
2
()
2
2
=
.
2

2

2
2
2

( )
Example: Information matrix for geometric distribution.
Dierentiating the score, we nd the observed information to be
d2 ()
du()
d
Io () =
=
=
n
2
d
d
d

1
y

]
1
y
=n 2 +
.

(1 )2

To nd the expected information, we subsitute y by E(Y ) = E(Yi ) =


(1 )/ in Io () to obtain
[

]
[
]
[
]
1
(1 )/
1
1
1+
n
Ie () = n 2 +
=
n
+
=
n
=
.

(1 )2
2
(1 )
2 (1 )
2 (1 )

Note that Ie () depending on the sharpness of the peak increases with


the sample size n since larger sample size provides more information
and hence the loglikelihood function is more sharp at the peak. When
n = 20 and = 0.15, the expected information is
Ie (0.15) =

n
20
=
= 1045.8.
2 (1 )
0.152 (1 0.15)

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 11

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

If the sample mean y = 3, the observed information is


[
]
[
]
1
y
1
3
Io (0.15) = n 2 +
= 20
+
= 971.9.

(1 )2
0.152 (1 0.15)2
Substituting the ML estimate
= 0.25, the expected and observed
information are Io (0.25) = Ie (0.25) = 426.7 since y = (1
)/
.
> score=function(pi) n*(1/pi-ym/(1-pi))
> Ie=function(pi) n/(pi^2*(1-pi))
> Io=n*(1/pi^2+ym/(1-pi)^2)
>
> logl1=n*(ym*log(1-pi)+log(pi))
> score1=n*(1/pi-ym/(1-pi))
> Ie1=n/(pi^2*(1-pi))
> c(pi[logl1==max(logl1)],pi[score1==0],max(logl1))
[1]
0.25000
0.25000 -44.98681
> c(Io[pi==0.15],Ie1[pi==0.15],Io[pi==0.25],Ie1[pi==0.25])
[1] 971.9339 1045.7516 426.6667 426.6667
>
> par(mfrow=c(2,2))
> plot(logl, col=red,xlab="pi",ylab="logl(pi)")
> points(pi[score1==0],logl1[pi==pi[score1==0]],pch=2,col="red",cex=0.6)
> title("log-likelihood function")
> plot(score, col=red,xlab="pi",ylab="score(pi)")
> abline(h = 0)
> points(pi[score1==0],0,pch=2,col="red",cex=0.6)
> title("score function")
> plot(Ie, col=red,xlab="pi",ylab="Ie(pi)")
> title("Expected information function")

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 12

O
EREM

SE

SI

MUTA

AD

1.5

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Newton-Raphson and Fisher Scoring methods

Calculation of the ML estimate often requires iterative procedures.


b evaluated at the ML estimate
b
Expanding the score function u()
around a trial value 0 using a rst order Taylor series gives
b = u( 0 ) +
u()

u( 0 ) b
b 0 ). (3)
( 0 ) + higher order terms in (

b
Ignoring higher order terms, equating (3) to zero and solving for ,
we have
(
)1
u(
)
0
b 0

u( 0 )
(4)

b = 0. Then the Newton-Raphson (NR) procedure to obsince u()


tain an improved estimate (k+1) using the estimate (k) at the k-th
iteration is
( 2
)1

()
()
(k+1)
(k)

=
(5)
(k) .

=
The iterative procedure is repeated until the dierence between (k+1)
and (k) is suciently close to zero. Then (proof as exercise)
(
)1
2 b
b = I o ()
b 1 = ()
var()
.

b is concave downFor ML estimates, the second order derivative H()
wards and negative. The sharper the curvature (more information)
b is and hence the estimates have
of (), the more negative H()
b = I o ()
b 1 = H()
b 1 . The NR procedure
smaller variance var()
tends to converge quickly if the log-likelihood is well-behaved (close to
b and if the starting
quadratic) in a neighborhood of the ML estimate
b
value 0 is reasonably close to .
SydU MSH3 GLM (2012) First semester

Dr. J. Chan 13

O
EREM

SE

SI

MUTA

AD

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

An alternative procedure rst suggested by Fisher is to replace the


information matrix I o () by its expected value I e (). The procedure
knwon as Fisher Scoring (FS) is
( 2
)1
()

()
(k+1)
(k)

= E
(6)
(k) .

=
For multimodal distributions, both methods will converge to a local
(not global) maximum.
Example: NR and FS methods for geometric distribution.
Setting the score to zero leads to an explicit solution for the ML esti1
mate
=
and no iteration is needed. For illustrative purpose,
1 + y
the iterative procedure is performed. Using the previous results,
d
=n
d

1
y

)
,

[
]
d2
1
y
= n 2 +
,
d 2

(1 )2

(
E

d2
d 2

)
=

n
.
2 (1 )

The Fisher scoring procedure leads to the updating formula


( 2 )1
d
d

(k+1) = (k) E
| (k)
d 2
d =
(
)
(k) 2
(k)
(
)
(1

)
1
y

= (k) +
n

n
(k) 1 (k)
1 (k) (k) y
(k)
(k) 2
(k)
= + ( ) (1 ) (k)
(1 (k) )
= (k) + (1 (k) (k) y) (k) .
If the sample mean is y = 3 and we start from 0 = 0.1, say, the
procedure converges to the ML estimate
= 0.25 in four iterations.
>
>
>
>
>

n=20
ym=3
pi=0.1
result=matrix(0,10,7)

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 14

>
+
+
+
+
+
+
+
+
+
>
>

O
EREM

SE

SI

MUTA

AD

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

for (i in 1:10) {
dl=n*(1/pi-ym/(1-pi))
dl2=-n/(pi^2*(1-pi))
pi=pi-dl/dl2
#pi=pi+(1-pi-pi*ym)*pi
se=sqrt(-1/dl2)
l=n*(ym*log(1-pi)+log(pi))
step=(1-pi-pi*ym)*pi
result[i,]=c(i,pi,se,l,dl,dl2,step)
}
colnames(result)=c("Iter","pi","se","l","dl","dl2","step")
result

[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]

Iter
1
2
3
4
5
6
7
8
9
10

pi
0.1600000
0.2176000
0.2458010
0.2499295
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000

se
0.02121320
0.03279024
0.04303862
0.04773221
0.04840091
0.04841229
0.04841229
0.04841229
0.04841229
0.04841229

l
-47.11283
-45.22528
-44.99060
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681

dl
dl2
step
1.333333e+02 -2222.2222 5.760000e-02
5.357143e+01 -930.0595 2.820096e-02
1.522465e+01 -539.8628 4.128512e-03
1.812051e+00 -438.9114 7.050785e-05
3.009750e-02 -426.8674 1.989665e-08
8.489239e-06 -426.6667 1.582068e-15
6.661338e-13 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00

Alternatively the Newton-Raphson procedure is


( 2 )1
d
d

(k+1) = (k)
| (k)
d 2
d =
[
]1
(
)
1
1
y

1
y

= (k) +
+
n

n ( (k) )2 (1 (k) )2
(k) 1 (k)
[
](
)
(k) 2
(k) 2
(k)
(k)
1

(
)
(1

)
= (k) +
1 2 (k) + ( (k) )2 + y( (k) )2
(k) (1 (k) )
(k) (1 (k) )(1 (k) (k) y)
(k)
= +
.
1 2 (k) + (1 + y)( (k) )2
>
>
>
>

n=20
ym=3
pi=0.1 #starting value
result=matrix(0,10,7)

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 15

>
>
+
+
+
+
+
+
+
+
+
>
>

O
EREM

SE

SI

MUTA

AD

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

for (i in 1:10) {
dl=20*(1/pi-ym/(1-pi))
dl2=-20*(1/pi^2+3/(1-pi)^2)
pi=pi-dl/dl2
#pi=pi+(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)
se=sqrt(-1/dl2)
l=n*(ym*log(1-pi)+log(pi))
step=(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)
result[i,]=c(i,pi,se,l,dl,dl2,step)
}
colnames(result)=c("Iter","pi","se","l","dl","dl2","step")
result

[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]

Iter
1
2
3
4
5
6
7
8
9
10

pi
0.1642857
0.2246830
0.2481241
0.2499905
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000

se
0.02195775
0.03477490
0.04490170
0.04816876
0.04841107
0.04841229
0.04841229
0.04841229
0.04841229
0.04841229

l
-46.89107
-45.13029
-44.98756
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681

dl
dl2
step
1.333333e+02 -2074.0741 6.039726e-02
4.994426e+01 -826.9292 2.344114e-02
1.162661e+01 -495.9916 1.866426e-03
8.044145e-01 -430.9919 9.453797e-06
4.033823e-03 -426.6882 2.383524e-10
1.016970e-07 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00

For both algorithms


, u() and u () converge to 0.25, 0 (slope) and
-426.6667 (curvature) respectively. Note that the NR method, using
exact Io (), may converge faster than the FS method.
The maximization can also be done using a maximizer:
> logl = function(pi) -20*(3*log(1-pi)+log(pi))
> pi.hat = optimize(logl, c(0, 1), tol = 0.0001)
> pi.hat
$minimum
[1] 0.2500143
$objective
[1] 44.98681

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 16

O
EREM

SE

SI

MUTA

AD

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

1.6

Expectation Maximization (EM) algorithm

1.6.1

Basic EM algorithm

The Expectation-Maximization (EM) algorithm was proposed by Dempster et al. (1977). It is an iterative approach for computing the maximum likelihood estimates (MLEs) for incomplete-data problems.
Let y be the observed data, z be the latent or missing data and be
the unknown parameters to be estimated. The functions f (y|) and
f (y, z|) are called the observed data and complete data likelihood
functions respectively. The observed data likelihood Lo () = f (y|)
is the expectation of f (y|z, ) w.r.t. f (z|), that is,

f (y|) =

f (y, z|) dz =

f (y|z, )f (z|) dz = Ez| [f (y|z, )].

To nd the ML estimate, one should maximize



o () = ln f (y|) = ln
f (y|z, )f (z|) dz = ln Ez| [f (y|z, )].

The EM algorithm maximizes o (| (k) ) (the proof is given in the


appendix) which is equivalent to maximize

Ez|y,(k) {ln f (y, z| (k) )} =
ln f (y, z| (k) )f (z|y, (k) ) dz
(7)

given (k) in an iterative procedure. Note that it takes into account


the posterior distribution of z, i.e. f (z|y, (k) ) and so it provides a
, the
framework for estimating z in the E-step. With the estimated z
M-step is simplied whereas the classical ML method requires direct
maximization of o () which may involve integration over f (z|), a
prior distribution for z.
The EM algorithm consists of two steps: The E-step and the M -step.
1. E-step: Evaluate the conditional expectation of the complete
(k) |) by replacing
data log-likelihood function, c () = ln f (y, z
(k) = E(z|y, (k) ).
z by z
SydU MSH3 GLM (2012) First semester

Dr. J. Chan 17

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

(k) |) w.r.t. to obtain


2. M-step: Maximize c () = ln f (y, z
(k+1) . Return to the E-step with (k+1) .
3. Stopping rule: Iterations (expectation of z given (k) ) within
(k) for each k) arises and
iterations (maximization of given z
they should stop when || (k+1) (k) || is suciently small.
Remarks
1. The EM algorithm makes use of the Principle of Data Augmentation which states that:
EM inference: Augment the observed data y with latent data
|) is simz so that the likelihood of the complete data f (y, z
ple and then obtain the MLE of based on this complete likelihood function.
Bayesian inference: Augment the observed data y with la)
tent data z so that the augmented posterior density f (|y, z
is simple and then use this simple posterior distribution in
sampling the parameters .
2. Bayesian approach simply treats z as another latent variable and
so the distinction between the E and M steps disappears. Both
and z are optimized through a (Markov) chain one at a time.
3. The EM algorithm can be applied to dierent missing or incompletedata situations, such as censored observations, random eects
model, mixtures model, and models with latent class or latent
variable.
4. The EM algorithm has a linear rate of convergence which depends on the proportion of information about in f (y|) which
is observed. The convergence is usually slower than the NR
method.
SydU MSH3 GLM (2012) First semester

Dr. J. Chan 18

O
EREM

SE

SI

MUTA

AD

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Example: (Darwin data) The data contains two very low outliers.
We consider the mixture model:
{
N (1 , 2 ), p = 0.9
yi
N (2 , 2 ), p = 0.1.
or
yi 0.9N (1 , 2 ) + 0.1N (2 , 2 ).
Let wij be the indicator that observation i comes from group j, j = 1, 2
and , wi1 + wi2 = 1. We dont know which normal distribution each
observation yi comes from. In other words, wij is unobserved.
In the M-step, writing rij = yi j , the complete data likelihood,
log-likelihood and their 1st and 2nd order derivative functions are
L() =
() =

[0.9 (yi |1 , 2 )]wi1 [0.1 (yi |2 , 2 )]wi2

i=1
n

wi1 ln 0.9 + wi1 ln (yi |1 , 2 ) + wi2 ln 0.1 +

i=1

wi2 ln (yi |2 , 2 )]

1
1
ln (yi |j , 2 ) = ln(2 2 ) 2 (yi j )2
2
2
1
rij
ln (yi |j , 2 ) =
(y

)
=
i
j
2
2
1
1
1
ln (yi |j , 2 ) = 2 + 4 (yi j )2 = 4 (rij2 2 )
2
2
2

n
n

()

1
=
wij ln (yi |j , 2 ) = 2
wij rij , j = 1, 2
j
j i=1
i=1
2
2
n
n

()
1

2
wij ln (yi |j , 2 ) = 4
wij (rij
2)
=
2
2

i=1 j=1
2 i=1 j=1

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 19

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

2 ()
2j
2 ()
( 2 )2

Ch. 1 Max. likelihood inference

1
wij
2 i=1

1
n
2
2
w
(r

ij
ij
6 i=1 j=1
2 4

2 ()
j 2

n
1
wij rij
= 4
i=1

2 ()
1 2

= 0

In the E-step, the conditional expectation of wi1 is


w
i1

>
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+

= 1 Pr(Wi1 = 1|yi ) + 0 Pr(Wi1 = 0|yi ) = Pr(Wi1 = 1|yi )


Pr(Wi1 = 1) Pr(yi |Wi1 = 1)
Pr(Wi1 = 1, yi )
=
=
Pr(yi )
Pr(yi )
Pr(Wi1 = 1) Pr(yi |Wi1 = 1)
=
Pr(Wi1 = 1) Pr(yi |Wi1 = 1) + Pr(Wi1 = 0) Pr(yi |Wi1 = 0)
0.9 (yi |1 , 2 )
=
0.9 (yi |1 , 2 ) + 0.1 (yi |2 , 2 )

y=c(-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75)
n=length(y)
p=3
#no. of par.
iterE=5
iterM=10
dim1=iterE*iterM
dim2=2*p+3
dl=c(rep(0,p))
result=matrix(0,dim1,dim2)
theta=c(30,-37,729)
#starting values
for (k in 1:iterE) {
# E-step
ew1=0.9*exp(-0.5*(y-theta[1])^2/theta[3])
ew2=0.1*exp(-0.5*(y-theta[2])^2/theta[3])
w1=ew1/(ew1+ew2)
w1m=mean(w1)

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 20

O
EREM

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

w2=1-w1
sw1=sum(w1)
sw2=sum(w2)
for (i in 1:iterM) {
# M-step
r1=y-theta[1]
r2=y-theta[2]
s1=r1^2-theta[3]
s2=r2^2-theta[3]
dl[1]=sum(w1*r1)/theta[3]
dl[2]=sum(w2*r2)/theta[3]
dl[3]=0.5*(sum(w1*s1)+sum(w2*s2))/theta[3]^2
dl2=matrix(0,p,p)
dl2[1,1]=-sw1/theta[3]
dl2[2,2]=-sw2/theta[3]
dl2[3,3]=-(sum(w1*s1)+sum(w2*s2))/theta[3]^3-0.5*n/theta[3]^2
dl2[3,1]=dl2[1,3]=-sum(w1*r1)/theta[3]^2
dl2[3,2]=dl2[2,3]=-sum(w2*r2)/theta[3]^2
dl2i=solve(dl2)
theta=theta-dl2i%*%dl
se=sqrt(diag(-dl2i))
l=log(0.9)*sw1+log(0.1)*sw2-n*log(2*pi*theta[3])/2
-(sum(w1*r1^2)+sum(w2*r2^2))/(2*theta[3])
row=(k-1)*10+i
result[row,]=c(k,i,theta[1],se[1],theta[2],se[2],theta[3],se[3],l)
}

+
+
+
+
}
> colnames(result)=c("iE","iM","mu1","se","mu2","se","sigma2","se","logL")
> result
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]

iE iM
mu1
se
mu2
se
sigma2
se
logL
1 1 33.48204 7.556567 -61.29223 20.36936 314.5368 333.38359 -74.72050
1 2 32.67069 4.932458 -55.63190 12.80706 426.8732 92.08656 -75.11445
1 3 32.28547 5.732025 -52.94443 14.65295 488.9407 144.09522 -74.00812
1 4 32.22150 6.132491 -52.49817 15.64174 500.6744 176.38060 -73.88313
1 5 32.21993 6.205593 -52.48720 15.82744 500.9943 182.76204 -73.88041
1 6 32.21993 6.207575 -52.48719 15.83249 500.9945 182.93721 -73.88041
1 7 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 8 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 9 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 10 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
2 1 33.13424 6.215236 -58.27681 15.93154 360.3297 207.03231 -72.33713
2 2 32.95097 5.265502 -57.11632 13.42523 391.1934 125.81277 -72.12760
2 3 32.93394 5.485894 -57.00847 13.98082 394.3075 142.27392 -72.09124
2 4 32.93380 5.507683 -57.00761 14.03630 394.3344 143.97586 -72.09096
2 5 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
2 6 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
2 7 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 21

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

[18,] 2 8 32.93380 5.507870 -57.00761


[19,] 2 9 32.93380 5.507870 -57.00761
[20,] 2 10 32.93380 5.507870 -57.00761
[21,] 3 1 32.99249 5.507838 -57.40463
[22,] 3 2 32.99113 5.441860 -57.39541
[23,] 3 3 32.99113 5.443563 -57.39540
[24,] 3 4 32.99113 5.443564 -57.39540
[25,] 3 5 32.99113 5.443564 -57.39540
[26,] 3 6 32.99113 5.443564 -57.39540
[27,] 3 7 32.99113 5.443564 -57.39540
[28,] 3 8 32.99113 5.443564 -57.39540
[29,] 3 9 32.99113 5.443564 -57.39540
[30,] 3 10 32.99113 5.443564 -57.39540
[31,] 4 1 32.99290 5.443541 -57.41186
[32,] 4 2 32.99290 5.441156 -57.41185
[33,] 4 3 32.99290 5.441158 -57.41185
[34,] 4 4 32.99290 5.441158 -57.41185
[35,] 4 5 32.99290 5.441158 -57.41185
[36,] 4 6 32.99290 5.441158 -57.41185
[37,] 4 7 32.99290 5.441158 -57.41185
[38,] 4 8 32.99290 5.441158 -57.41185
[39,] 4 9 32.99290 5.441158 -57.41185
[40,] 4 10 32.99290 5.441158 -57.41185
[41,] 5 1 32.99296 5.441157 -57.41245
[42,] 5 2 32.99296 5.441074 -57.41245
[43,] 5 3 32.99296 5.441074 -57.41245
[44,] 5 4 32.99296 5.441074 -57.41245
[45,] 5 5 32.99296 5.441074 -57.41245
[46,] 5 6 32.99296 5.441074 -57.41245
[47,] 5 7 32.99296 5.441074 -57.41245
[48,] 5 8 32.99296 5.441074 -57.41245
[49,] 5 9 32.99296 5.441074 -57.41245
[50,] 5 10 32.99296 5.441074 -57.41245
> w=cbind(w1,w2)
> w
w1
w2
[1,] 2.315067e-05 9.999768e-01
[2,] 2.004754e-03 9.979952e-01
[3,] 9.984605e-01 1.539495e-03
[4,] 9.990371e-01 9.629220e-04
[5,] 9.997646e-01 2.353932e-04
[6,] 9.998528e-01 1.471616e-04
[7,] 9.999716e-01 2.842587e-05
[8,] 9.999775e-01 2.247488e-05
[9,] 9.999912e-01 8.782696e-06
[10,] 9.999931e-01 6.944001e-06
[11,] 9.999996e-01 4.143677e-07
[12,] 9.999999e-01 6.327540e-08
[13,] 1.000000e+00 1.222089e-08
[14,] 1.000000e+00 4.775591e-09
[15,] 1.000000e+00 1.408457e-10
> mean(w1)
[1] 0.866605

14.03678
14.03678
14.03678
14.03871
13.86992
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87464
13.86856
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86859
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837

394.3344
394.3344
394.3344
384.9492
385.1901
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
384.8527
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414

Ch. 1 Max. likelihood inference


143.99055
143.99055
143.99055
145.69396
140.51959
140.65153
140.65162
140.65162
140.65162
140.65162
140.65162
140.65162
140.65162
140.71323
140.52829
140.52848
140.52848
140.52848
140.52848
140.52848
140.52848
140.52848
140.52848
140.53061
140.52421
140.52421
140.52421
140.52421
140.52421
140.52421
140.52421
140.52421
140.52421

-72.09096
-72.09096
-72.09096
-71.91673
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720

From (wi1 , wi2 ), the rst two observations belong to group 2 while the
others all group 1. Hence EM method enables classication like cluster
analysis, an advantage over the classical likelihood method where the
missing data wij are integrated out as the observed data likelihood
Lo () =

[0.9(yi |1 , 2 ) + 0.1(yi |2 , 2 )]

(8)

i=1

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 22

O
EREM

SE

SI

MUTA

AD

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

is a marginal mixture of two distributions and contains no missing


observations.
>
>
>
>
>
>
>
>
>
>

x=rep(-0.001,n)
x1=seq(-120,100,0.1)
fx1=dnorm(x1,theta[1],sqrt(theta[3]))
fx2=dnorm(x1,theta[2],sqrt(theta[3]))
fx=0.9*fx1+0.1*fx2
plot(x1, fx1, xlab="x", ylab="f(x)", ylim=c(-0.001,0.025),
xlim=c(-120,100), pch=20, col="red",cex=0.5)
points(x1,fx2,pch=20,col=blue,cex=0.5)
points(x1,fx,pch=20,cex=0.5)
points(y,x,pch=20,cex=0.8)
title("Mixture of normal distributions for Darwin data")

0.010
0.000

0.005

f(x)

0.015

0.020

0.025

Mixture of normal distributions for Darwin data

100

50

50

100

Note that this is a mixture model where the two model densities are
represented by the blue and red lines. The mixing density in (8) is
represented by the black line.

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 23

O
EREM

SE

SI

MUTA

AD

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

Example: (Right-Censored Data) with Darwin data


Suppose that the rst four observations (ci , i = 1, . . . , 4) are right
censored (yi > ci ) and we assume that
zi , i = 1, . . . , 4; yi , i = 5, . . . , n N (, 2 ).
Let = (, ) , z = (z1 , . . . , z4 ) and y = (y5 , . . . , yn ). Then
4
n
1
n
1
2
2
(zi ) 2
(yi )2
c () = ln f (z, y|) = ln 2
2
2 i=1
2 i=5

For the censored observations zi > ci , i = 1, . . . , 4, the conditional


distribution is a truncated normal on (ci , ), with the density function
(z|, 2 )
(
),
f (z|, , ci ) =
1 ci

z > ci

(9)

where and are the pdf and cdf functions for normal. Let (k) =
((k) , 2 (k) ) be the current estimates of .
In the E-step, the conditional expectation of zi , i = 1, . . . , 4 given y,
(k) and ci is

(k)
zi

E(zi |y,

(k)

, ci ) =

(k)

z f (z|

(k)

ci
or

since
1

(k)

, ci ) dz =

ci (k)
(k)

(k)
)
(k)

ci
(k)

S
1 (k)
z
S j=1 ij

)
) (
)
[
]
(

1
1 2
1
1 2
1 2
1 2
= exp (c )
z exp z dz =
z exp z d z
2
2
2
2
2 c
2
(

(k)

where zij , j = 1, . . . , S is simulated from f (z|(k) , 2 (k) , ci ) in (9) in


the Monte Carlo approximation of conditional expectation.
(k)

In the M-step, the zi is substituted for the censored observation zi .


With the complete data (
z (k) , y),
and
2 of the normal data distribution are given by their sample mean and sample variance. Hence
no iteration is required for the M-step.
SydU MSH3 GLM (2012) First semester

Dr. J. Chan 24

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
+
+

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

library("msm")
cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censored
w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)
n=length(cy)
S=10000 #sim 10000 z for z hat
m=4
#first 4 censored
p=2
#2 parmeters
cen=cy[1:m] #censored obs
y=cy[(m+1):n] #uncensored obs
iterE=10
dim=p+m+1
result=matrix(0,iterE,dim)
simz=matrix(0,m,S)
z=rep(0,m)
theta=c(mean(cy),var(cy))
#starting value for mu & sigma2
for (k in 1:iterE) {

#E-step

for (j in 1:m) {
simz[j,]=rtnorm(S,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],
upper=Inf) #monte carlo approx. of E(Z|Z>c)
z[j]=mean(simz[j,])

+ #
+
+
cz=(cen[j]-theta[1])/sqrt(theta[2])
+
z[j]=theta[1]+dnorm(cz)*sqrt(theta[2])/(1-pnorm(cz)) #exact
+
}
+
yr=c(z,y)
+
theta[1]=mean(yr)
#M-step
+
theta[2]=(sum(yr^2)-sum(yr)^2/n)/n
+
result[k,]=c(k,theta[1],theta[2],z[1],z[2],z[3],z[4])
+
+
}
> colnames(result)=c("iE","mu","sigma2","ez1","ez2","ez3","ez4")
> print(result,digit=5) #monte carlo approx. of E(Z|Z>c)
iE
mu sigma2
ez1
ez2
ez3
ez4
[1,] 1 41.630 211.67 37.262 37.842 40.054 41.036
[2,] 2 42.647 208.03 41.943 42.259 42.455 42.755
[3,] 3 42.921 208.09 42.540 43.007 43.614 43.810
[4,] 4 43.017 208.11 43.281 43.385 43.655 43.900
[5,] 5 43.040 208.17 43.154 43.441 43.733 44.188

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 25

O
EREM

SE

SI

MUTA

AD

[6,] 6
[7,] 7
[8,] 8
[9,] 9
[10,] 10

MSH3 Generalized linear model


43.048
43.046
43.036
43.061
43.071

208.21
208.18
208.14
208.19
208.22

42.985
43.158
43.393
43.375
43.336

43.371
43.473
43.317
43.298
43.364

Ch. 1 Max. likelihood inference


44.071
43.689
43.721
43.968
43.810

44.201
44.277
44.039
44.156
44.411

> print(result,digit=5) #exact E(Z|Z>c)


iE
mu sigma2
ez1
ez2
ez3
[1,] 1 41.659 211.47 37.377 37.996 40.180
[2,] 2 42.660 208.05 41.948 42.062 42.638
[3,] 3 42.930 208.06 42.889 42.983 43.478
[4,] 4 43.006 208.12 43.148 43.239 43.718
[5,] 5 43.027 208.14 43.221 43.311 43.785
[6,] 6 43.033 208.15 43.242 43.332 43.805
[7,] 7 43.035 208.15 43.248 43.337 43.810
[8,] 8 43.035 208.15 43.249 43.339 43.812
[9,] 9 43.036 208.15 43.250 43.339 43.812
[10,] 10 43.036 208.15 43.250 43.340 43.812

ez4
41.018
42.932
43.737
43.969
44.035
44.054
44.059
44.061
44.061
44.061

The convergence using Monte Carlo approx. is subjected to random


error in the simulation. Parameter estimates are given by the averages
over iterations.

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 26

O
EREM

SE

SI

MUTA

AD

1.6.2

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Monte Carlo EM Algorithm

Given the current guess to the posterior mode, (k) , the conditional expectation in the E-step may involve integration and can be calculated
using Monte Carlo (MC) approximation. Similarly the complete data
log-likelihood function c () = ln f (y, z|) can also be approximated
using MC approximation:
S

1
(k)
c () = ln f(y, z|) =
ln f (y, z j |)
(10)
S j=1
(k)

(k)

where z 1 , . . . , z S f (z| (k) , y) as required in the E-step. This


maximizes an average of log-likelihood based ln f(y, z|) on simu |) where z
is
lated values which is dierent from maximizing ln f (y, z
average of simulated values. Then, in the M-step, we maximize c ()
in (10) to obtain a new guess, (k+1) .
Monitoring of convergence: Plot each component of (k) against the
iteration number k.
Example: (Right-Censored Data) Consider the Darwin data again.
In the E-step, the conditional expectation of zi given y, (k) and ci is
(k) (k)
(k)
given by (10) and estimated by drawing sample zi1 , zi2 , . . . , ziS from
the truncated normal f (zi |(k) , (k) , ci ) in (9) at the current estimates
(k) = ((k) , (k) ).
In the M-step, one obtains a MC approximation to ln f (y, z|) by
[
]
S
4
n

n
1
1
(k)
c () =
ln(2 2 ) 2
(zij )2 +
(yi )2
S j=1
2
2 i=1
i=5
[ 4 (
)
]
S
n

1
n
1
(k)
(zij )2 +
(yi )2
= ln(2 2 ) 2
2
2 i=1 S j=1
i=5
and maximizes it w.r.t. to obtain (k+1) through iterations instead
(k)
of close-form solution. Write ri = yi , i = 5, . . . , n, rij = zij ,
SydU MSH3 GLM (2012) First semester

Dr. J. Chan 27

O
EREM

SE

SI

MUTA

AD

zi =

1
S

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

(k)

zij and ri = zi , i = 1, . . . , 4,

j=1

()

()
2

4
S
n
n

1
1
1
(k)

(z ) +
(yi ) = 2
ri ,
2 i=1 S j=1 it

i=5
i=1

n
4
S

1 1 (k)
2
2
2
(z )
+
(yi ) n

2 4 i=1 S j=1 ij
i=5

4
s
n

1 1 2
2
2
2
rij +
(ri )

2 4
S
i=1

j=1

i=5

2
S
S
S
S

1
1
1
1
(k)
(k)
2
Since
rij
=
(zit )2 =
(zit ) =
rij = ri2 ,
S j=1
S j=1
S j=1
S j=1
closed-form solution using sample mean and sample variance can not be used.
2 ()
2

2 ()
( 2 )2

=
2 ()
2

n
2

S
n
4

1 1 (k)
n
2
2
+
(z

)
(y

)
+
i
2 4
6 i=1 S j=1 ij
i=5

S
n
4

1 1 2
n
2
2
2
r

+
(r

ij
i
2 4
6 i=1
S j=1
i=5

n
1
= 4
ri
i=1

> library("msm")
> cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censor time
> w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)
> mean(cy)
[1] 33
> n=length(cy)
> T=10000 #sim 10000 z for z hat
> m=4
#first 4 censored obs
> p=2
#2 pars

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 28

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

cen=cy[1:m] #censored obs


y=cy[(m+1):n] #uncensored obs
iterE=5
iterM=10
dim1=iterE*iterM
dim2=2*p+7
dl=c(rep(0,p))
dl2=matrix(0,p,p)
result=matrix(0,dim1,dim2)
simz=matrix(0,m,T)
z=matrix(0,m,1)
rz=rep(0,m)
r2z=rep(0,m)
theta=c(40,400)
#starting values for mu & var
for (k in 1:iterE) {
#E-step
for (j in 1:m) {
simz[j,]=rtnorm(T,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],
upper=Inf)
z[j]=mean(simz[j,])
}
for (i in 1:iterM) {
#M-step
rz=z-theta[1]
ry=y-theta[1]
r=c(rz,ry)
r2z=apply((simz-theta[1])^2,1,mean)
r2=c(r2z,ry^2)
s2=r2-theta[2]
dl[1]=sum(r)/theta[2]
dl[2]=0.5*sum(s2)/theta[2]^2
dl2[1,1]=-n/theta[2]
dl2[2,2]=-sum(s2)/theta[2]^3-0.5*n/theta[2]^2
dl2[2,1]=dl2[1,2]=-sum(r)/theta[2]^2
dl2i=solve(dl2)
theta=theta-dl2i%*%dl
se=sqrt(diag(-dl2i))
l=-n*log(2*pi*theta[2])/2-sum(r^2)/(2*theta[2]) #pi=3.141593
row=(k-1)*10+i
result[row,]=c(k,i,theta[1],se[1],theta[2],se[2],l,z[1],z[2],

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 29

O
EREM

SE

SI

MUTA

AD

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

z[3],z[4])
+
}
+
}
> colnames(result)=c("iE","iM","mu","se","sigma2","se","logL","ez1",
"ez2","ez3","ez4")
> print(result,digit=5)
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
[18,]
[19,]
[20,]
[21,]
[22,]
[23,]
[24,]
[25,]
[26,]
[27,]
[28,]
[29,]
[30,]
[31,]
[32,]
[33,]
[34,]
[35,]
[36,]
[37,]
[38,]
[39,]
[40,]
[41,]
[42,]
[43,]
[44,]
[45,]
[46,]
[47,]
[48,]
[49,]
[50,]

iE
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5

iM
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10

mu
43.745
42.965
42.929
42.929
42.929
42.929
42.929
42.929
42.929
42.929
43.280
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343

se
5.6699
4.7218
4.8139
4.8187
4.8187
4.8187
4.8187
4.8187
4.8187
4.8187
4.8205
4.6989
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745

sigma2
288.50
301.25
301.86
301.86
301.86
301.86
301.86
301.86
301.86
301.86
287.03
287.16
287.16
287.16
287.16
287.16
287.16
287.16
287.16
287.16
284.71
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06

se
160.37
113.42
118.16
118.40
118.40
118.40
118.40
118.40
118.40
118.40
118.44
112.58
112.63
112.63
112.63
112.63
112.63
112.63
112.63
112.63
112.63
111.67
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.42
111.42
111.42
111.42
111.42
111.42
111.42
111.42
111.42

logL
-53.653
-53.557
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.460
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443

ez1
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072

ez2
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284

ez3
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957

ez4
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149

References

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 30

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Dempster, A.P., Laird, N. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B, 39, 1-38. (with discussion).
McLachlan, G.J. & Krishnan, T (1997) The EM Algorithm and Extensions. Wiley.

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 31

O
EREM

SE

SI

MUTA

MSH3 Generalized linear model

AD

1.7

Ch. 1 Max. likelihood inference

Appendix for EM algorithm

To maximize o () = ln f (y|), we wish to compute an updated estimate (k+1) such that,


o ( (k+1) ) > o ( (k) ).
The idea is to maximize alternatively the function (| (k) ) which
is (i) bounded above by o ( (k+1) ) at (k+1) and (ii) equal to o ( (k) )
at (k) . Then any (k+1) which increases o ( (k+1) | (k) ) also increases
o ( (k+1) ). Lastly, the EM algorithm chooses (k+1) as the value of
for which o (| (k) ) is a maximum.
To show (i), we rst consider maximizing the dierence
o ( (k+1) ) o ( (k) )
= ln f (y| (k+1) ) ln f (y| (k) )

= ln
f (y|z, (k+1) )f (z| (k+1) ) dz ln f (y| (k) )

= ln

f (y|z, (k+1) )f (z| (k+1) )

f (z|y, (k) )
dz ln f (y| (k) )
(k)
f (z|y, )

f (y|z, (k+1) )f (z| (k+1) )


= ln
f (z|y, )
dz ln f (y| (k) )
(k)
f (z|y, )



f (y|z, (k+1) )f (z| (k+1) )
(k)

f (z|y, ) ln
dz
f (z|y, (k) ) ln f (y| (k) ) dz
(k)
f (z|y, )


f (y|z, (k+1) )f (z| (k+1) )
dz , ( (k+1) | (k) )
=
f (z|y, (k) ) ln
(k)
(k)
f (z|y, )f (y| )

since

(k)

f (z|y, (k) ) dz = 1 and ln

being concave. Then dene


o ( (k+1) )
or
o ()

i y i

i=1

(k+1) (k)
o (
| )

i ln(yi ) with ln()

i=1

such that

o ( (k) ) + ( (k+1) | (k) ) , o ( (k+1) | (k) )


o ( (k) ) + (| (k) ) , o (| (k) ) (writing (k+1) = )

where o (| (k) ) , o ( (k) )+(| (k) ). Hence o ( (k+1) | (k) ) is bound


above by o ( (k+1) ) or o (| (k) ) is bound above by o () in general.
SydU MSH3 GLM (2012) First semester

Dr. J. Chan 32

O
EREM

SE

SI

AD

MUTA

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

where in the diagram, (k+1) , n+1 , o ( (k+1) | (k) ) (| n ) and


o ( (k+1) ) L( n+1 ). The function o (| (k) ) is bounded above by
the log-likelihood function o ().
Next we show (ii) that o (| (k) ) and o () are equal at = (k) .
o ( (k) | (k) ) = o ( (k) ) + ( (k) | (k) )

f (y|z, (k) )f (z| (k) )
(k)
(k)
f (z|y, ) ln
= o ( ) +
dz
(k) )f (y| (k) )
f
(z|y,


f (y, z| (k) )
(k)
(k)
= o ( ) +
f (z|y, ) ln
dz
(k) )
f
(y,
z|

= o ( (k) ).
Hence any (k+1) which increases o ( (k+1) | (k) ) also increases o ( (k+1) ).
Lastly, we show (iii) that the EM algorithm chooses (k+1) for which
o (| (k) ) is a maximum. Since o () o (| (k) ), increasing o (| (k) )
ensures that o () is increased at each step.
To achieve the greatest increase in o ( (k+1) ), EM algorithm selects
(k+1) which maximize o (| (k) ), i.e.

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 33

O
EREM

SE

SI

MUTA

AD

(k+1)

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

= arg max[o (| (k) )] = arg max[o ( (k) ) + (| (k) )]

[
]

f
(y|z,
)f
(z|)
= arg max o ( (k) ) +
f (z|y, (k) ) ln
dz
f (z|y, (k) )f (y| (k) )

[
]
= arg max
f (z|y, (k) ) ln[f (y|z, )f (z|)] dz

(drop the constant term w.r.t. )


[
]
(k)
= arg max
ln f (y, z|)f (z|y, ) dz

[
]
= arg max Ez |y ,(k) {ln f (y, z|)}

and hence proved (7).

SydU MSH3 GLM (2012) First semester

Dr. J. Chan 34

Das könnte Ihnen auch gefallen