Fisher Information For GLM

MSH3 Generalized Linear model (Part 1)
Jennifer S.K. CHAN

Course outline
Part I: Generalized Linear Model
1. Maximum Likelihood Inference
Newton-Raphson and Fisher Scoring methods; EM and Monte
Carlo EM algorithms
2. Exponential Family
Two parameter Exponential family; ML estimation for GLM,
Deviance; Quasi-likelihood, Random eects models.
3. Model Selection
Deviance for Likelihood Ratio Tests, Wald Tests, AIC and BIC,
Examples
4. Survival Analysis
Kaplan-Meier estimator; Proportional hazards models; Coxs
proportional hazards model.
O
EREM
SE
SI
AD
MUTA
MSH3 Generalized linear model
Contents
1 Maximum likelihood Inference
1.1 Motivating examples . . . . . . . . . . . . . .
1.2 Likelihood function . . . . . . . . . . . . . . .
1.3 Score vector . . . . . . . . . . . . . . . . . . .
1.4 Information matrix . . . . . . . . . . . . . . .
1.5 Newton-Raphson and Fisher Scoring methods
1.6 Expectation Maximization (EM) algorithm .
1.6.1 Basic EM algorithm . . . . . . . . . .
1.6.2 Monte Carlo EM Algorithm . . . . . .
1.7 Appendix for EM algorithm . . . . . . . . . .
SydU MSH3 GLM (2012) First semester
.
.
.
.
.
.
.
.
.
2
2
7
9
10
13
17
17
27
32
Dr. J. Chan
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
O
EREM
SE
SI
MUTA
AD
1
1.1
Ch. 1 Max. likelihood inference
Maximum likelihood Inference

Motivating examples
AIDS deaths (counts)

The numbers of death Yi from AIDS in Australia for three-month
periods from 1983 to 1986 are shown below.
The Poisson regression model

Yi Poisson(i ),
with
i = exp(a + bti ) > 0
is tted and the maximum likelihood (ML) estimates are a

= 0.376
and b = 0.254.
For each 3-month period, there will be a 29.3%
(exp(0.254) = 1.293) increase in expected AIDS deaths. Note that
the variance increases with the mean and the log link function g(i ) =
ln(i ) = xi is used.
>
>
>
>
no=c(0,1,2,3,1,5,10,17,23,31,20,25,37,45)
time=c(1:14)
poi=glm(no~time, family=poisson(link=log))
summary(poi)
Call:
Dr. J. Chan
O
EREM
SE
SI
AD
MUTA
glm(formula = y ~ x, family = poisson(link = log))

Deviance Residuals:
Min
1Q
Median
-2.2502 -0.9815 -0.6770
3Q
0.2545
Max
2.6731
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.37571
0.24884
1.51
0.131
x
0.25365
0.02188
11.60
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 203.549
Residual deviance: 28.169
AIC: 85.358
on 13
on 12
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5

> par=poi$coeff
> names(par)=NULL
> par
[1] 0.3757110 0.2536485
> beta0=par[1]
> beta1=par[2]
> par(mfrow=c(2,2))
> c1=function(time) exp(beta0+beta1*time)
> plot(time,no,pch=20,col=blue )
> curve(c1,1,14,add=TRUE)
> title("Poisson regression")
Mice data
Twenty six mice were given dierent level xi of drug. Outcomes Yi are
whether they responded to the drug (Yi = 1) or not (Yi = 0).
Dr. J. Chan
O
EREM
SE
SI
AD
MUTA
The logistic regression model for binary data is

(
)
i
Yi Bernoulli(i ), with logit(i ) = ln
= a + bxi .
1 i
Note that
(
ln
>
>
>
>
i
1 i
ea+bxi
= a + bxi i =
.
1 + ea+bxi
y=c(0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,1,1,1,1,1,1,1,1,1,1)
dose=c(0:25)/10
log=glm(y~dose, family=binomial(link=logit))
summary(log)
Call:
glm(formula = y ~ dose, family = binomial(link = logit))
Deviance Residuals:
Min
1Q
Median
-1.5766 -0.4757
0.1376
3Q
0.4129
Max
2.1975
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Dr. J. Chan
O
EREM
SE
SI
MUTA
AD
(Intercept)
-4.111
1.638 -2.510
0.0121 *
dose
3.581
1.316
2.722
0.0065 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 35.890
Residual deviance: 17.639
AIC: 21.639
on 25
on 24
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6

> par=log$coeff
> names(par)=NULL
> par
[1] -4.111361 3.581176
> beta0=par[1]
> beta1=par[2]
> c1=function(dose) exp(beta0+beta1*dose)/(1+exp(beta0+beta1*dose))
> plot(dose,y, pch=20,col=red)
> curve(c1,0,2.5,add=TRUE)
> title("Logistic regression")
0.4
0.0
0.8
10 20 30 40
Logistic regression
no
Poisson regression
10
14
0.0
time
0.5
1.0
1.5
2.0
2.5
dose
For parameter estimation, the nonparametric LSE , the parametric

maximum likelihood (ML, PML, QML, GEE, EM, MCEM, etc) and
Bayesian methods methods will be discussed. Kernel smoothing and
other semi-parametric methods are not included. Model selection is
based on Akaike Information criterion (AIC), Bayesian Information
Dr. J. Chan
O
EREM
SE
SI
AD
MUTA
criterion (BIC) and the Deviance information criterion (DIC).

For application, the two examples analyse counts data with Poisson distribution and binary data with Bernounill distribution respectively. Others include categorial data with multinominal distribution
and positive continuous data with Weibull distribution. These illustrate dierent data distributions under the Exponential Family. The
mean of the data distribution is linked to a linear function of covariates with possibly random eects but the variance is NOT modelled.
Popular time series models with heteroskedastic variance and long
memory such as Generalized autoregressive conditional heteroskedastic (GARCH) model and stochastic volatility (SV) model will not be
considered.
Dr. J. Chan
O
EREM
SE
SI
MUTA
AD
1.2
Likelihood function
Let Y1 , . . . , Yn be n independent random variables (rv) with probability density functions (pdf) fi (yi , ) depending on a vector-value
parameter . The joint density of y = (y1 , . . . , yn )
f (y, ) =
f (yi , ) = L(, y)
i=1
as a function of unknown parameter given y is called the likelihood function. We often work with the logarithm of f (y, ), the loglikelihood function:
n
(; y) = ln L(; y) =
ln f (yi ; ).
i=1
b maximzes the log-likelihood

The maximum-likelihood (ML) estimator
function given the data y, that is,
b y) (; y) for all .
(;
In other words, they make the observed data as likely as possible under
the model.
Example: The log-likelhood function for Geometric Distribution.
Consider a series of independent Bernoulli trials with a common probability of success . The distribution for the number of failures Yi
before the rst success has a pdf
Pr(Yi = yi ) = (1 )yi
for yi = 0, 1, . . . . Direct calculation shows that E(Yi ) = (1 )/.
The log-likelihood function given y is
(; y) = ln L(; y) =
[yi ln(1 ) + ln ]
i=1
= n[
y ln(1 ) + ln ],
Dr. J. Chan
O
EREM
SE
SI
AD
MUTA
where y =
1
n
yi is the sample mean. The fact that the log-likelihood
i=1
function depends on the observations only through y shows that y is

a sucient statistic for the unknown probability .
score function
6000
2000
score(pi)
150
250
logl(pi)
50
2000
loglikelihood function
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
pi
0.4
0.6
0.8
1.0
pi
100000
0
Ie(pi)
200000
Expected information function
0.0
0.2
0.4
0.6
0.8
1.0
pi
Figure: Log-likelihood function for geometric dist. when n = 20 and y = 3.
>
>
>
>
n=20
ym=3
pi=c(1:100)/100
logl=function(pi) n*(ym*log(1-pi)+log(pi))
Dr. J. Chan
O
EREM
SE
SI
MUTA
AD
1.3
Score vector
The rst order derivative of the log-likelihood function, called Fishers

score function, is a vector of dimension p where p is the number of
parameters and is denoted by
(; y)
.
(
)
For example, when Yi N (, 2 ), u() =

,
.
2
u() =
b can be
If the log-likelihood function is concave, the ML estimates
obtained by solving the system of equations:
u() = 0.
Example: The score function for the geometric distribution.
The score function for n observations from a geometric distribution is
(
)
1
d
y
d
=
n(
y ln(1 ) + ln ) = n
.
u() =
d
d
1
Setting this equation to zero and solving for gives the ML estimate:
1
y
1
1
=
y = 1
=
and y =
.
1
1 + y
Note that the ML estimate of the probability of success is the reciprocal of the average number of trials. The more trials it takes to get
a success, the lower is the estimated probability of success.
For a sample of n = 20 observations and with a sample mean of y = 3,
the ML estimate is
= 1/(1 + 3) = 0.25.
Dr. J. Chan
O
EREM
SE
SI
MUTA
AD
1.4
Information matrix
It can be shown that

[
]
[ 2
]
[(
)(
) ]
()
()
()
()
Ey
= 0 and Ey
= 0.
+ Ey
Proof: Since f (y, )dy = 1,

f (y , )
f (y, )
dy = 0
f (y, )dy = 0
f (y, )
ln f (y, )
()
f (y, )dy = 0
f (y, )dy = 0
[
]
()
Ey
= 0 and
]
[
()
f (y, ) dy = 0

]
[ 2
()
() f (y, )
f (y, ) +
dy = 0

]
[ 2
() () ()
+
f (y, )dy = 0

[ 2
]
[(
)(
) ]
()
()
()
Ey
+ Ey
= 0.
(1)

Hence the score function is a random vector such that it has a zero
mean
[
]
()
Ey [u()] = Ey
=0
and a variance-covariance matrix which is given by the informative

matrix:
)(
) ]
[(
()
()
var[u()] = Ey [u()u ()] = Ey
= I().
Dr. J. Chan 10
O
EREM
SE
SI
AD
MUTA
Under mild regularity conditions, the information matrix can also be

obtained as minus the expected value of the second derivatives of the
log-likelihood from (1):
[ 2
]
()
var[u()] = I() = Ey
.

Note that the Hessian matrix is
2 ()
u()
H() = I o () =
=

(2)
2 ()
and I o () =
= H() is sometimes called the observed

information matrix. I o () indicates the extent to which () is peaked
rather than at. If it is more peaked, I o () is more positive. For
example, when Yi N (, 2 ),
)
( 2

2
2
()
2
2
=
.
2

2

2
2
2

( )
Example: Information matrix for geometric distribution.
Dierentiating the score, we nd the observed information to be
d2 ()
du()
d
Io () =
=
=
n
2
d
d
d
1
y
]
1
y
=n 2 +
.
(1 )2
To nd the expected information, we subsitute y by E(Y ) = E(Yi ) =

(1 )/ in Io () to obtain
[
]
[
]
[
]
1
(1 )/
1
1
1+
n
Ie () = n 2 +
=
n
+
=
n
=
.
(1 )2
2
(1 )
2 (1 )
2 (1 )
Note that Ie () depending on the sharpness of the peak increases with

the sample size n since larger sample size provides more information
and hence the loglikelihood function is more sharp at the peak. When
n = 20 and = 0.15, the expected information is
Ie (0.15) =
n
20
=
= 1045.8.
2 (1 )
0.152 (1 0.15)
Dr. J. Chan 11
O
EREM
SE
SI
AD
MUTA
If the sample mean y = 3, the observed information is

[
]
[
]
1
y
1
3
Io (0.15) = n 2 +
= 20
+
= 971.9.
(1 )2
0.152 (1 0.15)2
Substituting the ML estimate
= 0.25, the expected and observed
information are Io (0.25) = Ie (0.25) = 426.7 since y = (1
)/
.
> score=function(pi) n*(1/pi-ym/(1-pi))
> Ie=function(pi) n/(pi^2*(1-pi))
> Io=n*(1/pi^2+ym/(1-pi)^2)
>
> logl1=n*(ym*log(1-pi)+log(pi))
> score1=n*(1/pi-ym/(1-pi))
> Ie1=n/(pi^2*(1-pi))
> c(pi[logl1==max(logl1)],pi[score1==0],max(logl1))
[1]
0.25000
0.25000 -44.98681
> c(Io[pi==0.15],Ie1[pi==0.15],Io[pi==0.25],Ie1[pi==0.25])
[1] 971.9339 1045.7516 426.6667 426.6667
>
> par(mfrow=c(2,2))
> plot(logl, col=red,xlab="pi",ylab="logl(pi)")
> points(pi[score1==0],logl1[pi==pi[score1==0]],pch=2,col="red",cex=0.6)
> title("log-likelihood function")
> plot(score, col=red,xlab="pi",ylab="score(pi)")
> abline(h = 0)
> points(pi[score1==0],0,pch=2,col="red",cex=0.6)
> title("score function")
> plot(Ie, col=red,xlab="pi",ylab="Ie(pi)")
> title("Expected information function")
Dr. J. Chan 12
O
EREM
SE
SI
MUTA
AD
1.5
Newton-Raphson and Fisher Scoring methods
Calculation of the ML estimate often requires iterative procedures.

b evaluated at the ML estimate
b
Expanding the score function u()
around a trial value 0 using a rst order Taylor series gives
b = u( 0 ) +
u()
u( 0 ) b
b 0 ). (3)
( 0 ) + higher order terms in (
b
Ignoring higher order terms, equating (3) to zero and solving for ,
we have
(
)1
u(
)
0
b 0
u( 0 )
(4)
b = 0. Then the Newton-Raphson (NR) procedure to obsince u()

tain an improved estimate (k+1) using the estimate (k) at the k-th
iteration is
( 2
)1
()
()
(k+1)
(k)
=
(5)
(k) .

=
The iterative procedure is repeated until the dierence between (k+1)
and (k) is suciently close to zero. Then (proof as exercise)
(
)1
2 b
b = I o ()
b 1 = ()
var()
.

b is concave downFor ML estimates, the second order derivative H()
wards and negative. The sharper the curvature (more information)
b is and hence the estimates have
of (), the more negative H()
b = I o ()
b 1 = H()
b 1 . The NR procedure
smaller variance var()
tends to converge quickly if the log-likelihood is well-behaved (close to
b and if the starting
quadratic) in a neighborhood of the ML estimate
b
value 0 is reasonably close to .
Dr. J. Chan 13
O
EREM
SE
SI
MUTA
AD
An alternative procedure rst suggested by Fisher is to replace the

information matrix I o () by its expected value I e (). The procedure
knwon as Fisher Scoring (FS) is
( 2
)1
()
()
(k+1)
(k)
= E
(6)
(k) .

=
For multimodal distributions, both methods will converge to a local
(not global) maximum.
Example: NR and FS methods for geometric distribution.
Setting the score to zero leads to an explicit solution for the ML esti1
mate
=
and no iteration is needed. For illustrative purpose,
1 + y
the iterative procedure is performed. Using the previous results,
d
=n
d
1
y
)
,
[
]
d2
1
y
= n 2 +
,
d 2
(1 )2
(
E
d2
d 2
)
=
n
.
2 (1 )
The Fisher scoring procedure leads to the updating formula

( 2 )1
d
d
(k+1) = (k) E
| (k)
d 2
d =
(
)
(k) 2
(k)
(
)
(1
)
1
y
= (k) +
n
n
(k) 1 (k)
1 (k) (k) y
(k)
(k) 2
(k)
= + ( ) (1 ) (k)
(1 (k) )
= (k) + (1 (k) (k) y) (k) .
If the sample mean is y = 3 and we start from 0 = 0.1, say, the
procedure converges to the ML estimate
= 0.25 in four iterations.
>
>
>
>
>
n=20
ym=3
pi=0.1
result=matrix(0,10,7)
Dr. J. Chan 14
>
+
+
+
+
+
+
+
+
+
>
>
O
EREM
SE
SI
MUTA
AD
for (i in 1:10) {
dl=n*(1/pi-ym/(1-pi))
dl2=-n/(pi^2*(1-pi))
pi=pi-dl/dl2
#pi=pi+(1-pi-pi*ym)*pi
se=sqrt(-1/dl2)
l=n*(ym*log(1-pi)+log(pi))
step=(1-pi-pi*ym)*pi
result[i,]=c(i,pi,se,l,dl,dl2,step)
}
colnames(result)=c("Iter","pi","se","l","dl","dl2","step")
result
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
Iter
1
2
3
4
5
6
7
8
9
10
pi
0.1600000
0.2176000
0.2458010
0.2499295
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
se
0.02121320
0.03279024
0.04303862
0.04773221
0.04840091
0.04841229
0.04841229
0.04841229
0.04841229
0.04841229
l
-47.11283
-45.22528
-44.99060
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
dl
dl2
step
1.333333e+02 -2222.2222 5.760000e-02
5.357143e+01 -930.0595 2.820096e-02
1.522465e+01 -539.8628 4.128512e-03
1.812051e+00 -438.9114 7.050785e-05
3.009750e-02 -426.8674 1.989665e-08
8.489239e-06 -426.6667 1.582068e-15
6.661338e-13 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
Alternatively the Newton-Raphson procedure is

( 2 )1
d
d
(k+1) = (k)
| (k)
d 2
d =
[
]1
(
)
1
1
y
1
y
= (k) +
+
n
n ( (k) )2 (1 (k) )2
(k) 1 (k)
[
](
)
(k) 2
(k) 2
(k)
(k)
1
(
)
(1
)
= (k) +
1 2 (k) + ( (k) )2 + y( (k) )2
(k) (1 (k) )
(k) (1 (k) )(1 (k) (k) y)
(k)
= +
.
1 2 (k) + (1 + y)( (k) )2
>
>
>
>
n=20
ym=3
pi=0.1 #starting value
result=matrix(0,10,7)
Dr. J. Chan 15
>
>
+
+
+
+
+
+
+
+
+
>
>
O
EREM
SE
SI
MUTA
AD
for (i in 1:10) {
dl=20*(1/pi-ym/(1-pi))
dl2=-20*(1/pi^2+3/(1-pi)^2)
pi=pi-dl/dl2
#pi=pi+(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)
se=sqrt(-1/dl2)
l=n*(ym*log(1-pi)+log(pi))
step=(1-pi)*(1-pi-pi*ym)*pi/(1-2*pi+4*pi^2)
result[i,]=c(i,pi,se,l,dl,dl2,step)
}
colnames(result)=c("Iter","pi","se","l","dl","dl2","step")
result
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
Iter
1
2
3
4
5
6
7
8
9
10
pi
0.1642857
0.2246830
0.2481241
0.2499905
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
0.2500000
se
0.02195775
0.03477490
0.04490170
0.04816876
0.04841107
0.04841229
0.04841229
0.04841229
0.04841229
0.04841229
l
-46.89107
-45.13029
-44.98756
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
-44.98681
dl
dl2
step
1.333333e+02 -2074.0741 6.039726e-02
4.994426e+01 -826.9292 2.344114e-02
1.162661e+01 -495.9916 1.866426e-03
8.044145e-01 -430.9919 9.453797e-06
4.033823e-03 -426.6882 2.383524e-10
1.016970e-07 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
0.000000e+00 -426.6667 0.000000e+00
For both algorithms

, u() and u () converge to 0.25, 0 (slope) and
-426.6667 (curvature) respectively. Note that the NR method, using
exact Io (), may converge faster than the FS method.
The maximization can also be done using a maximizer:
> logl = function(pi) -20*(3*log(1-pi)+log(pi))
> pi.hat = optimize(logl, c(0, 1), tol = 0.0001)
> pi.hat
$minimum
[1] 0.2500143
$objective
[1] 44.98681
Dr. J. Chan 16
O
EREM
SE
SI
MUTA
AD
1.6
Expectation Maximization (EM) algorithm
1.6.1
Basic EM algorithm
The Expectation-Maximization (EM) algorithm was proposed by Dempster et al. (1977). It is an iterative approach for computing the maximum likelihood estimates (MLEs) for incomplete-data problems.
Let y be the observed data, z be the latent or missing data and be
the unknown parameters to be estimated. The functions f (y|) and
f (y, z|) are called the observed data and complete data likelihood
functions respectively. The observed data likelihood Lo () = f (y|)
is the expectation of f (y|z, ) w.r.t. f (z|), that is,
f (y|) =
f (y, z|) dz =
f (y|z, )f (z|) dz = Ez| [f (y|z, )].
To nd the ML estimate, one should maximize

o () = ln f (y|) = ln
f (y|z, )f (z|) dz = ln Ez| [f (y|z, )].
The EM algorithm maximizes o (| (k) ) (the proof is given in the

appendix) which is equivalent to maximize

Ez|y,(k) {ln f (y, z| (k) )} =
ln f (y, z| (k) )f (z|y, (k) ) dz
(7)
given (k) in an iterative procedure. Note that it takes into account

the posterior distribution of z, i.e. f (z|y, (k) ) and so it provides a
, the
framework for estimating z in the E-step. With the estimated z
M-step is simplied whereas the classical ML method requires direct
maximization of o () which may involve integration over f (z|), a
prior distribution for z.
The EM algorithm consists of two steps: The E-step and the M -step.
1. E-step: Evaluate the conditional expectation of the complete
(k) |) by replacing
data log-likelihood function, c () = ln f (y, z
(k) = E(z|y, (k) ).
z by z
Dr. J. Chan 17
O
EREM
SE
SI
AD
MUTA
(k) |) w.r.t. to obtain

2. M-step: Maximize c () = ln f (y, z
(k+1) . Return to the E-step with (k+1) .
3. Stopping rule: Iterations (expectation of z given (k) ) within
(k) for each k) arises and
iterations (maximization of given z
they should stop when || (k+1) (k) || is suciently small.
Remarks
1. The EM algorithm makes use of the Principle of Data Augmentation which states that:
EM inference: Augment the observed data y with latent data
|) is simz so that the likelihood of the complete data f (y, z
ple and then obtain the MLE of based on this complete likelihood function.
Bayesian inference: Augment the observed data y with la)
tent data z so that the augmented posterior density f (|y, z
is simple and then use this simple posterior distribution in
sampling the parameters .
2. Bayesian approach simply treats z as another latent variable and
so the distinction between the E and M steps disappears. Both
and z are optimized through a (Markov) chain one at a time.
3. The EM algorithm can be applied to dierent missing or incompletedata situations, such as censored observations, random eects
model, mixtures model, and models with latent class or latent
variable.
4. The EM algorithm has a linear rate of convergence which depends on the proportion of information about in f (y|) which
is observed. The convergence is usually slower than the NR
method.
Dr. J. Chan 18
O
EREM
SE
SI
MUTA
AD
Example: (Darwin data) The data contains two very low outliers.
We consider the mixture model:
{
N (1 , 2 ), p = 0.9
yi
N (2 , 2 ), p = 0.1.
or
yi 0.9N (1 , 2 ) + 0.1N (2 , 2 ).
Let wij be the indicator that observation i comes from group j, j = 1, 2
and , wi1 + wi2 = 1. We dont know which normal distribution each
observation yi comes from. In other words, wij is unobserved.
In the M-step, writing rij = yi j , the complete data likelihood,
log-likelihood and their 1st and 2nd order derivative functions are
L() =
() =
[0.9 (yi |1 , 2 )]wi1 [0.1 (yi |2 , 2 )]wi2
i=1
n
wi1 ln 0.9 + wi1 ln (yi |1 , 2 ) + wi2 ln 0.1 +
i=1
wi2 ln (yi |2 , 2 )]
1
1
ln (yi |j , 2 ) = ln(2 2 ) 2 (yi j )2
2
2
1
rij
ln (yi |j , 2 ) =
(y
)
=
i
j
2
2
1
1
1
ln (yi |j , 2 ) = 2 + 4 (yi j )2 = 4 (rij2 2 )
2
2
2
n
n
()

1
=
wij ln (yi |j , 2 ) = 2
wij rij , j = 1, 2
j
j i=1
i=1
2
2
n
n
()
1

2
wij ln (yi |j , 2 ) = 4
wij (rij
2)
=
2
2
i=1 j=1
2 i=1 j=1
Dr. J. Chan 19
O
EREM
SE
SI
AD
MUTA
2 ()
2j
2 ()
( 2 )2
1
wij
2 i=1
1
n
2
2
w
(r
ij
ij
6 i=1 j=1
2 4
2 ()
j 2
n
1
wij rij
= 4
i=1
2 ()
1 2
= 0
In the E-step, the conditional expectation of wi1 is

w
i1
>
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+
= 1 Pr(Wi1 = 1|yi ) + 0 Pr(Wi1 = 0|yi ) = Pr(Wi1 = 1|yi )

Pr(Wi1 = 1) Pr(yi |Wi1 = 1)
Pr(Wi1 = 1, yi )
=
=
Pr(yi )
Pr(yi )
Pr(Wi1 = 1) Pr(yi |Wi1 = 1)
=
Pr(Wi1 = 1) Pr(yi |Wi1 = 1) + Pr(Wi1 = 0) Pr(yi |Wi1 = 0)
0.9 (yi |1 , 2 )
=
0.9 (yi |1 , 2 ) + 0.1 (yi |2 , 2 )
y=c(-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75)
n=length(y)
p=3
#no. of par.
iterE=5
iterM=10
dim1=iterE*iterM
dim2=2*p+3
dl=c(rep(0,p))
result=matrix(0,dim1,dim2)
theta=c(30,-37,729)
#starting values
for (k in 1:iterE) {
# E-step
ew1=0.9*exp(-0.5*(y-theta[1])^2/theta[3])
ew2=0.1*exp(-0.5*(y-theta[2])^2/theta[3])
w1=ew1/(ew1+ew2)
w1m=mean(w1)
Dr. J. Chan 20
O
EREM
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
SE
SI
AD
MUTA
w2=1-w1
sw1=sum(w1)
sw2=sum(w2)
for (i in 1:iterM) {
# M-step
r1=y-theta[1]
r2=y-theta[2]
s1=r1^2-theta[3]
s2=r2^2-theta[3]
dl[1]=sum(w1*r1)/theta[3]
dl[2]=sum(w2*r2)/theta[3]
dl[3]=0.5*(sum(w1*s1)+sum(w2*s2))/theta[3]^2
dl2=matrix(0,p,p)
dl2[1,1]=-sw1/theta[3]
dl2[2,2]=-sw2/theta[3]
dl2[3,3]=-(sum(w1*s1)+sum(w2*s2))/theta[3]^3-0.5*n/theta[3]^2
dl2[3,1]=dl2[1,3]=-sum(w1*r1)/theta[3]^2
dl2[3,2]=dl2[2,3]=-sum(w2*r2)/theta[3]^2
dl2i=solve(dl2)
theta=theta-dl2i%*%dl
se=sqrt(diag(-dl2i))
l=log(0.9)*sw1+log(0.1)*sw2-n*log(2*pi*theta[3])/2
-(sum(w1*r1^2)+sum(w2*r2^2))/(2*theta[3])
row=(k-1)*10+i
result[row,]=c(k,i,theta[1],se[1],theta[2],se[2],theta[3],se[3],l)
}
+
+
+
+
}
> colnames(result)=c("iE","iM","mu1","se","mu2","se","sigma2","se","logL")
> result
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
iE iM
mu1
se
mu2
se
sigma2
se
logL
1 1 33.48204 7.556567 -61.29223 20.36936 314.5368 333.38359 -74.72050
1 2 32.67069 4.932458 -55.63190 12.80706 426.8732 92.08656 -75.11445
1 3 32.28547 5.732025 -52.94443 14.65295 488.9407 144.09522 -74.00812
1 4 32.22150 6.132491 -52.49817 15.64174 500.6744 176.38060 -73.88313
1 5 32.21993 6.205593 -52.48720 15.82744 500.9943 182.76204 -73.88041
1 6 32.21993 6.207575 -52.48719 15.83249 500.9945 182.93721 -73.88041
1 7 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 8 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 9 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
1 10 32.21993 6.207576 -52.48719 15.83250 500.9945 182.93734 -73.88041
2 1 33.13424 6.215236 -58.27681 15.93154 360.3297 207.03231 -72.33713
2 2 32.95097 5.265502 -57.11632 13.42523 391.1934 125.81277 -72.12760
2 3 32.93394 5.485894 -57.00847 13.98082 394.3075 142.27392 -72.09124
2 4 32.93380 5.507683 -57.00761 14.03630 394.3344 143.97586 -72.09096
2 5 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
2 6 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
2 7 32.93380 5.507870 -57.00761 14.03678 394.3344 143.99055 -72.09096
Dr. J. Chan 21
O
EREM
SE
SI
AD
MUTA
[18,] 2 8 32.93380 5.507870 -57.00761

[19,] 2 9 32.93380 5.507870 -57.00761
[20,] 2 10 32.93380 5.507870 -57.00761
[21,] 3 1 32.99249 5.507838 -57.40463
[22,] 3 2 32.99113 5.441860 -57.39541
[23,] 3 3 32.99113 5.443563 -57.39540
[24,] 3 4 32.99113 5.443564 -57.39540
[25,] 3 5 32.99113 5.443564 -57.39540
[26,] 3 6 32.99113 5.443564 -57.39540
[27,] 3 7 32.99113 5.443564 -57.39540
[28,] 3 8 32.99113 5.443564 -57.39540
[29,] 3 9 32.99113 5.443564 -57.39540
[30,] 3 10 32.99113 5.443564 -57.39540
[31,] 4 1 32.99290 5.443541 -57.41186
[32,] 4 2 32.99290 5.441156 -57.41185
[33,] 4 3 32.99290 5.441158 -57.41185
[34,] 4 4 32.99290 5.441158 -57.41185
[35,] 4 5 32.99290 5.441158 -57.41185
[36,] 4 6 32.99290 5.441158 -57.41185
[37,] 4 7 32.99290 5.441158 -57.41185
[38,] 4 8 32.99290 5.441158 -57.41185
[39,] 4 9 32.99290 5.441158 -57.41185
[40,] 4 10 32.99290 5.441158 -57.41185
[41,] 5 1 32.99296 5.441157 -57.41245
[42,] 5 2 32.99296 5.441074 -57.41245
[43,] 5 3 32.99296 5.441074 -57.41245
[44,] 5 4 32.99296 5.441074 -57.41245
[45,] 5 5 32.99296 5.441074 -57.41245
[46,] 5 6 32.99296 5.441074 -57.41245
[47,] 5 7 32.99296 5.441074 -57.41245
[48,] 5 8 32.99296 5.441074 -57.41245
[49,] 5 9 32.99296 5.441074 -57.41245
[50,] 5 10 32.99296 5.441074 -57.41245
> w=cbind(w1,w2)
> w
w1
w2
[1,] 2.315067e-05 9.999768e-01
[2,] 2.004754e-03 9.979952e-01
[3,] 9.984605e-01 1.539495e-03
[4,] 9.990371e-01 9.629220e-04
[5,] 9.997646e-01 2.353932e-04
[6,] 9.998528e-01 1.471616e-04
[7,] 9.999716e-01 2.842587e-05
[8,] 9.999775e-01 2.247488e-05
[9,] 9.999912e-01 8.782696e-06
[10,] 9.999931e-01 6.944001e-06
[11,] 9.999996e-01 4.143677e-07
[12,] 9.999999e-01 6.327540e-08
[13,] 1.000000e+00 1.222089e-08
[14,] 1.000000e+00 4.775591e-09
[15,] 1.000000e+00 1.408457e-10
> mean(w1)
[1] 0.866605
14.03678
14.03678
14.03678
14.03871
13.86992
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87426
13.87464
13.86856
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86857
13.86859
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
13.86837
394.3344
394.3344
394.3344
384.9492
385.1901
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
385.1903
384.8527
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8531
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414
384.8414

143.99055
143.99055
143.99055
145.69396
140.51959
140.65153
140.65162
140.65162
140.65162
140.65162
140.65162
140.65162
140.65162
140.71323
140.52829
140.52848
140.52848
140.52848
140.52848
140.52848
140.52848
140.52848
140.52848
140.53061
140.52421
140.52421
140.52421
140.52421
140.52421
140.52421
140.52421
140.52421
140.52421
-72.09096
-72.09096
-72.09096
-71.91673
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.91425
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90744
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
-71.90720
From (wi1 , wi2 ), the rst two observations belong to group 2 while the
others all group 1. Hence EM method enables classication like cluster
analysis, an advantage over the classical likelihood method where the
missing data wij are integrated out as the observed data likelihood
Lo () =
[0.9(yi |1 , 2 ) + 0.1(yi |2 , 2 )]
(8)
i=1
Dr. J. Chan 22
O
EREM
SE
SI
MUTA
AD
is a marginal mixture of two distributions and contains no missing

observations.
>
>
>
>
>
>
>
>
>
>
x=rep(-0.001,n)
x1=seq(-120,100,0.1)
fx1=dnorm(x1,theta[1],sqrt(theta[3]))
fx2=dnorm(x1,theta[2],sqrt(theta[3]))
fx=0.9*fx1+0.1*fx2
plot(x1, fx1, xlab="x", ylab="f(x)", ylim=c(-0.001,0.025),
xlim=c(-120,100), pch=20, col="red",cex=0.5)
points(x1,fx2,pch=20,col=blue,cex=0.5)
points(x1,fx,pch=20,cex=0.5)
points(y,x,pch=20,cex=0.8)
title("Mixture of normal distributions for Darwin data")
0.010
0.000
0.005
f(x)
0.015
0.020
0.025
Mixture of normal distributions for Darwin data
100
50
50
100
Note that this is a mixture model where the two model densities are
represented by the blue and red lines. The mixing density in (8) is
represented by the black line.
Dr. J. Chan 23
O
EREM
SE
SI
MUTA
AD
Example: (Right-Censored Data) with Darwin data

Suppose that the rst four observations (ci , i = 1, . . . , 4) are right
censored (yi > ci ) and we assume that
zi , i = 1, . . . , 4; yi , i = 5, . . . , n N (, 2 ).
Let = (, ) , z = (z1 , . . . , z4 ) and y = (y5 , . . . , yn ). Then
4
n
1
n
1
2
2
(zi ) 2
(yi )2
c () = ln f (z, y|) = ln 2
2
2 i=1
2 i=5
For the censored observations zi > ci , i = 1, . . . , 4, the conditional

distribution is a truncated normal on (ci , ), with the density function
(z|, 2 )
(
),
f (z|, , ci ) =
1 ci
z > ci
(9)
where and are the pdf and cdf functions for normal. Let (k) =
((k) , 2 (k) ) be the current estimates of .
In the E-step, the conditional expectation of zi , i = 1, . . . , 4 given y,
(k) and ci is
(k)
zi
E(zi |y,
(k)
, ci ) =
(k)
z f (z|
(k)
ci
or
since
1
(k)
, ci ) dz =
ci (k)
(k)
(k)
)
(k)
ci
(k)
S
1 (k)
z
S j=1 ij
)
) (
)
[
]
(

1
1 2
1
1 2
1 2
1 2
= exp (c )
z exp z dz =
z exp z d z
2
2
2
2
2 c
2
(
(k)
where zij , j = 1, . . . , S is simulated from f (z|(k) , 2 (k) , ci ) in (9) in

the Monte Carlo approximation of conditional expectation.
(k)
In the M-step, the zi is substituted for the censored observation zi .

With the complete data (
z (k) , y),
and
2 of the normal data distribution are given by their sample mean and sample variance. Hence
no iteration is required for the M-step.
Dr. J. Chan 24
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
+
+
O
EREM
SE
SI
AD
MUTA
library("msm")
cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censored
w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)
n=length(cy)
S=10000 #sim 10000 z for z hat
m=4
#first 4 censored
p=2
#2 parmeters
cen=cy[1:m] #censored obs
y=cy[(m+1):n] #uncensored obs
iterE=10
dim=p+m+1
result=matrix(0,iterE,dim)
simz=matrix(0,m,S)
z=rep(0,m)
theta=c(mean(cy),var(cy))
#starting value for mu & sigma2
#E-step
for (j in 1:m) {
simz[j,]=rtnorm(S,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],
upper=Inf) #monte carlo approx. of E(Z|Z>c)
z[j]=mean(simz[j,])
+ #
+
+
cz=(cen[j]-theta[1])/sqrt(theta[2])
+
z[j]=theta[1]+dnorm(cz)*sqrt(theta[2])/(1-pnorm(cz)) #exact
+
}
+
yr=c(z,y)
+
theta[1]=mean(yr)
#M-step
+
theta[2]=(sum(yr^2)-sum(yr)^2/n)/n
+
result[k,]=c(k,theta[1],theta[2],z[1],z[2],z[3],z[4])
+
+
}
> colnames(result)=c("iE","mu","sigma2","ez1","ez2","ez3","ez4")
> print(result,digit=5) #monte carlo approx. of E(Z|Z>c)
iE
mu sigma2
ez1
ez2
ez3
ez4
[1,] 1 41.630 211.67 37.262 37.842 40.054 41.036
[2,] 2 42.647 208.03 41.943 42.259 42.455 42.755
[3,] 3 42.921 208.09 42.540 43.007 43.614 43.810
[4,] 4 43.017 208.11 43.281 43.385 43.655 43.900
[5,] 5 43.040 208.17 43.154 43.441 43.733 44.188
Dr. J. Chan 25
O
EREM
SE
SI
MUTA
AD
[6,] 6
[7,] 7
[8,] 8
[9,] 9
[10,] 10

43.048
43.046
43.036
43.061
43.071
208.21
208.18
208.14
208.19
208.22
42.985
43.158
43.393
43.375
43.336
43.371
43.473
43.317
43.298
43.364

44.071
43.689
43.721
43.968
43.810
44.201
44.277
44.039
44.156
44.411
> print(result,digit=5) #exact E(Z|Z>c)

iE
mu sigma2
ez1
ez2
ez3
[1,] 1 41.659 211.47 37.377 37.996 40.180
[2,] 2 42.660 208.05 41.948 42.062 42.638
[3,] 3 42.930 208.06 42.889 42.983 43.478
[4,] 4 43.006 208.12 43.148 43.239 43.718
[5,] 5 43.027 208.14 43.221 43.311 43.785
[6,] 6 43.033 208.15 43.242 43.332 43.805
[7,] 7 43.035 208.15 43.248 43.337 43.810
[8,] 8 43.035 208.15 43.249 43.339 43.812
[9,] 9 43.036 208.15 43.250 43.339 43.812
[10,] 10 43.036 208.15 43.250 43.340 43.812
ez4
41.018
42.932
43.737
43.969
44.035
44.054
44.059
44.061
44.061
44.061
The convergence using Monte Carlo approx. is subjected to random

error in the simulation. Parameter estimates are given by the averages
over iterations.
Dr. J. Chan 26
O
EREM
SE
SI
MUTA
AD
1.6.2
Monte Carlo EM Algorithm
Given the current guess to the posterior mode, (k) , the conditional expectation in the E-step may involve integration and can be calculated
using Monte Carlo (MC) approximation. Similarly the complete data
log-likelihood function c () = ln f (y, z|) can also be approximated
using MC approximation:
S
1
(k)
c () = ln f(y, z|) =
ln f (y, z j |)
(10)
S j=1
(k)
(k)
where z 1 , . . . , z S f (z| (k) , y) as required in the E-step. This

maximizes an average of log-likelihood based ln f(y, z|) on simu |) where z
is
lated values which is dierent from maximizing ln f (y, z
average of simulated values. Then, in the M-step, we maximize c ()
in (10) to obtain a new guess, (k+1) .
Monitoring of convergence: Plot each component of (k) against the
iteration number k.
Example: (Right-Censored Data) Consider the Darwin data again.
In the E-step, the conditional expectation of zi given y, (k) and ci is
(k) (k)
(k)
given by (10) and estimated by drawing sample zi1 , zi2 , . . . , ziS from
the truncated normal f (zi |(k) , (k) , ci ) in (9) at the current estimates
(k) = ((k) , (k) ).
In the M-step, one obtains a MC approximation to ln f (y, z|) by
[
]
S
4
n
n
1
1
(k)
c () =
ln(2 2 ) 2
(zij )2 +
(yi )2
S j=1
2
2 i=1
i=5
[ 4 (
)
]
S
n
1
n
1
(k)
(zij )2 +
(yi )2
= ln(2 2 ) 2
2
2 i=1 S j=1
i=5
and maximizes it w.r.t. to obtain (k+1) through iterations instead
(k)
of close-form solution. Write ri = yi , i = 5, . . . , n, rij = zij ,
Dr. J. Chan 27
O
EREM
SE
SI
MUTA
AD
zi =
1
S
(k)
zij and ri = zi , i = 1, . . . , 4,
j=1
()
()
2
4
S
n
n
1
1
1
(k)
(z ) +
(yi ) = 2
ri ,
2 i=1 S j=1 it
i=5
i=1
n
4
S
1 1 (k)
2
2
2
(z )
+
(yi ) n
2 4 i=1 S j=1 ij
i=5
4
s
n
1 1 2
2
2
2
rij +
(ri )
2 4
S
i=1
j=1
i=5
2
S
S
S
S
1
1
1
1
(k)
(k)
2
Since
rij
=
(zit )2 =
(zit ) =
rij = ri2 ,
S j=1
S j=1
S j=1
S j=1
closed-form solution using sample mean and sample variance can not be used.
2 ()
2
2 ()
( 2 )2
=
2 ()
2
n
2
S
n
4
1 1 (k)
n
2
2
+
(z
)
(y
)
+
i
2 4
6 i=1 S j=1 ij
i=5
S
n
4
1 1 2
n
2
2
2
r
+
(r
ij
i
2 4
6 i=1
S j=1
i=5
n
1
= 4
ri
i=1
> library("msm")
> cy=c(6,8,14,16,23,24,28,29,41,49,56,60,75) #first 4 obs are censor time
> w=c(0,0,0,0,1,1,1,1,1,1,1,1,1)
> mean(cy)
[1] 33
> n=length(cy)
> T=10000 #sim 10000 z for z hat
> m=4
#first 4 censored obs
> p=2
#2 pars
Dr. J. Chan 28
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
O
EREM
SE
SI
AD
MUTA
cen=cy[1:m] #censored obs

y=cy[(m+1):n] #uncensored obs
iterE=5
iterM=10
dim1=iterE*iterM
dim2=2*p+7
dl=c(rep(0,p))
dl2=matrix(0,p,p)
result=matrix(0,dim1,dim2)
simz=matrix(0,m,T)
z=matrix(0,m,1)
rz=rep(0,m)
r2z=rep(0,m)
theta=c(40,400)
#starting values for mu & var
#E-step
for (j in 1:m) {
simz[j,]=rtnorm(T,mean=theta[1],sd=sqrt(theta[2]),lower=cen[j],
upper=Inf)
z[j]=mean(simz[j,])
}
for (i in 1:iterM) {
#M-step
rz=z-theta[1]
ry=y-theta[1]
r=c(rz,ry)
r2z=apply((simz-theta[1])^2,1,mean)
r2=c(r2z,ry^2)
s2=r2-theta[2]
dl[1]=sum(r)/theta[2]
dl[2]=0.5*sum(s2)/theta[2]^2
dl2[1,1]=-n/theta[2]
dl2[2,2]=-sum(s2)/theta[2]^3-0.5*n/theta[2]^2
dl2[2,1]=dl2[1,2]=-sum(r)/theta[2]^2
dl2i=solve(dl2)
theta=theta-dl2i%*%dl
se=sqrt(diag(-dl2i))
l=-n*log(2*pi*theta[2])/2-sum(r^2)/(2*theta[2]) #pi=3.141593
row=(k-1)*10+i
result[row,]=c(k,i,theta[1],se[1],theta[2],se[2],l,z[1],z[2],
Dr. J. Chan 29
O
EREM
SE
SI
MUTA
AD
z[3],z[4])
+
}
+
}
> colnames(result)=c("iE","iM","mu","se","sigma2","se","logL","ez1",
"ez2","ez3","ez4")
> print(result,digit=5)
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
[18,]
[19,]
[20,]
[21,]
[22,]
[23,]
[24,]
[25,]
[26,]
[27,]
[28,]
[29,]
[30,]
[31,]
[32,]
[33,]
[34,]
[35,]
[36,]
[37,]
[38,]
[39,]
[40,]
[41,]
[42,]
[43,]
[44,]
[45,]
[46,]
[47,]
[48,]
[49,]
[50,]
iE
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
iM
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
mu
43.745
42.965
42.929
42.929
42.929
42.929
42.929
42.929
42.929
42.929
43.280
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.263
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.320
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.325
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343
43.343
se
5.6699
4.7218
4.8139
4.8187
4.8187
4.8187
4.8187
4.8187
4.8187
4.8187
4.8205
4.6989
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6999
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6799
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6790
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
4.6745
sigma2
288.50
301.25
301.86
301.86
301.86
301.86
301.86
301.86
301.86
301.86
287.03
287.16
287.16
287.16
287.16
287.16
287.16
287.16
287.16
287.16
284.71
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.72
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.61
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06
284.06
se
160.37
113.42
118.16
118.40
118.40
118.40
118.40
118.40
118.40
118.40
118.44
112.58
112.63
112.63
112.63
112.63
112.63
112.63
112.63
112.63
112.63
111.67
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.68
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.63
111.42
111.42
111.42
111.42
111.42
111.42
111.42
111.42
111.42
logL
-53.653
-53.557
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.547
-53.460
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.458
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.446
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
-53.443
ez1
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
42.163
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
43.734
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
44.105
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
43.945
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072
44.072
ez2
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
42.427
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.914
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.867
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
43.878
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284
44.284
ez3
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.008
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.876
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
44.783
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
45.109
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957
44.957
ez4
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.473
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
44.900
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.399
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.295
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149
45.149
References
Dr. J. Chan 30
O
EREM
SE
SI
AD
MUTA
Dempster, A.P., Laird, N. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B, 39, 1-38. (with discussion).
McLachlan, G.J. & Krishnan, T (1997) The EM Algorithm and Extensions. Wiley.
Dr. J. Chan 31
O
EREM
SE
SI
MUTA
AD
1.7
Appendix for EM algorithm
To maximize o () = ln f (y|), we wish to compute an updated estimate (k+1) such that,

o ( (k+1) ) > o ( (k) ).
The idea is to maximize alternatively the function (| (k) ) which
is (i) bounded above by o ( (k+1) ) at (k+1) and (ii) equal to o ( (k) )
at (k) . Then any (k+1) which increases o ( (k+1) | (k) ) also increases
o ( (k+1) ). Lastly, the EM algorithm chooses (k+1) as the value of
for which o (| (k) ) is a maximum.
To show (i), we rst consider maximizing the dierence
o ( (k+1) ) o ( (k) )
= ln f (y| (k+1) ) ln f (y| (k) )

= ln
f (y|z, (k+1) )f (z| (k+1) ) dz ln f (y| (k) )
= ln
f (y|z, (k+1) )f (z| (k+1) )
f (z|y, (k) )
dz ln f (y| (k) )
(k)
f (z|y, )
f (y|z, (k+1) )f (z| (k+1) )

= ln
f (z|y, )
dz ln f (y| (k) )
(k)
f (z|y, )

f (y|z, (k+1) )f (z| (k+1) )
(k)
f (z|y, ) ln
dz
f (z|y, (k) ) ln f (y| (k) ) dz
(k)
f (z|y, )

f (y|z, (k+1) )f (z| (k+1) )
dz , ( (k+1) | (k) )
=
f (z|y, (k) ) ln
(k)
(k)
f (z|y, )f (y| )
since
(k)
f (z|y, (k) ) dz = 1 and ln
being concave. Then dene

o ( (k+1) )
or
o ()
i y i
i=1
(k+1) (k)
o (
| )
i ln(yi ) with ln()
i=1
such that
o ( (k) ) + ( (k+1) | (k) ) , o ( (k+1) | (k) )

o ( (k) ) + (| (k) ) , o (| (k) ) (writing (k+1) = )
where o (| (k) ) , o ( (k) )+(| (k) ). Hence o ( (k+1) | (k) ) is bound

above by o ( (k+1) ) or o (| (k) ) is bound above by o () in general.
Dr. J. Chan 32
O
EREM
SE
SI
AD
MUTA
where in the diagram, (k+1) , n+1 , o ( (k+1) | (k) ) (| n ) and

o ( (k+1) ) L( n+1 ). The function o (| (k) ) is bounded above by
the log-likelihood function o ().
Next we show (ii) that o (| (k) ) and o () are equal at = (k) .
o ( (k) | (k) ) = o ( (k) ) + ( (k) | (k) )

f (y|z, (k) )f (z| (k) )
(k)
(k)
f (z|y, ) ln
= o ( ) +
dz
(k) )f (y| (k) )
f
(z|y,

f (y, z| (k) )
(k)
(k)
= o ( ) +
f (z|y, ) ln
dz
(k) )
f
(y,
z|
= o ( (k) ).
Hence any (k+1) which increases o ( (k+1) | (k) ) also increases o ( (k+1) ).
Lastly, we show (iii) that the EM algorithm chooses (k+1) for which
o (| (k) ) is a maximum. Since o () o (| (k) ), increasing o (| (k) )
ensures that o () is increased at each step.
To achieve the greatest increase in o ( (k+1) ), EM algorithm selects
(k+1) which maximize o (| (k) ), i.e.
Dr. J. Chan 33
O
EREM
SE
SI
MUTA
AD
(k+1)
= arg max[o (| (k) )] = arg max[o ( (k) ) + (| (k) )]
[
]

f
(y|z,
)f
(z|)
= arg max o ( (k) ) +
f (z|y, (k) ) ln
dz
f (z|y, (k) )f (y| (k) )
[
]
= arg max
f (z|y, (k) ) ln[f (y|z, )f (z|)] dz
(drop the constant term w.r.t. )

[
]
(k)
= arg max
ln f (y, z|)f (z|y, ) dz
[
]
= arg max Ez |y ,(k) {ln f (y, z|)}
and hence proved (7).
Dr. J. Chan 34

Fisher Information For GLM

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Fisher Information For GLM

Hochgeladen von

Copyright:

Verfügbare Formate

MSH3 Generalized Linear model (Part 1)

Jennifer S.K. CHAN

MSH3 Generalized linear model

SydU MSH3 GLM (2012) First semester

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Maximum likelihood Inference

AIDS deaths (counts)

The Poisson regression model

i = exp(a + bti ) > 0

is tted and the maximum likelihood (ML) estimates are a

SydU MSH3 GLM (2012) First semester

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

glm(formula = y ~ x, family = poisson(link = log))

Number of Fisher Scoring iterations: 5

SydU MSH3 GLM (2012) First semester

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

The logistic regression model for binary data is

SydU MSH3 GLM (2012) First semester

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

Number of Fisher Scoring iterations: 6

For parameter estimation, the nonparametric LSE , the parametric

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

criterion (BIC) and the Deviance information criterion (DIC).

SydU MSH3 GLM (2012) First semester

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

b maximzes the log-likelihood

Ch. 1 Max. likelihood inference

MSH3 Generalized linear model

yi is the sample mean. The fact that the log-likelihood

function depends on the observations only through y shows that y is

Expected information function

Figure: Log-likelihood function for geometric dist. when n = 20 and y = 3.

SydU MSH3 GLM (2012) First semester

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

The rst order derivative of the log-likelihood function, called Fishers

For example, when Yi N (, 2 ), u() =

SydU MSH3 GLM (2012) First semester

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

It can be shown that

Proof: Since f (y, )dy = 1,

and a variance-covariance matrix which is given by the informative

SydU MSH3 GLM (2012) First semester

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Under mild regularity conditions, the information matrix can also be

To nd the expected information, we subsitute y by E(Y ) = E(Yi ) =

Note that Ie () depending on the sharpness of the peak increases with

SydU MSH3 GLM (2012) First semester

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

If the sample mean y = 3, the observed information is

SydU MSH3 GLM (2012) First semester

MSH3 Generalized linear model

Ch. 1 Max. likelihood inference

Newton-Raphson and Fisher Scoring methods