Beruflich Dokumente
Kultur Dokumente
h
_
, where h denotes the actual value of the latent
variable,
h is its estimated value, E
_
h
_
is the expected value of this estimate
and F as the covariance of the error estimate then:
F
t|t1
= E
_
h
t
h
T
t
= E
_
(Ah
t1
+
h
t
)(Ah
t1
+
h
t
)
T
_
= E
_
(Ah
t1
h
t1
T
A
T
+
h
t
hT
t
+
h
t
A
T
h
T
t1
+Ah
t1
hT
t
)
_
= AE
_
h
t1
h
T
t1
A
T
+E
_
h
t
hT
t
_
= AF
t1:t1
A
T
+
H
(3.1)
The subscript in F
t|t1
denotes the fact that this is Fs value before an
observation is made at time t (i.e. its a priori value) while F
t|t
would denote
2
a value for F
t|t
after an observation is made (its posterior value). This more
informative notation allows the update equation in (2.1) to be expressed as
follows:
h
t|t1
= A
h
t1|t1
(3.2)
Once we have an observation (and are therefore dealing with posterior val-
ues), we can dene
t
as the dierence between the observation wed expect
to see given our estimate of the latent state (its a priori value) and the one
actually observed, i.e.:
t
= v
t
B
h
t|t1
(3.3)
Now that we have an observation, if we wish to add a correction to our a
priori estimate that is proportional to the error
t
we can use a coecient
:
h
t|t
=
h
t|t1
+
t
(3.4)
This allows us to express F
t|t
recursively:
F
t|t
= Cov(h
t
h
t|t
)
= Cov(h
t
(
h
t|t1
+
t
)
= Cov(h
t
(
h
t|t1
+(v
t
B
h
t|t1
)))
= Cov(h
t
(
h
t|t1
+(Bh
t
+
v
t
B
h
t|t1
)))
= Cov(h
t
h
t|t1
Bh
t
v
t
+B
h
t|t1
)
= Cov((I B)(h
t
h
t|t1
)
v
t
)
= Cov((I B)(h
t
h
t|t1
)) +Cov(
v
t
)
= (I B)Cov(h
t
h
t|t1
)(I B)
T
+Cov(
v
t
)
T
= (I B)F
t|t1
(I B)
T
+
V
T
= (F
t|t1
BF
t|t1
)(I B)
T
+
V
T
= F
t|t1
BF
t|t1
F
t|t1
(B)
T
+BF
t|t1
(B)
T
+
V
T
= F
t|t1
BF
t|t1
F
t|t1
B
T
T
+(BF
t|t1
B
T
+
V
)
T
(3.5)
If we dene the innovation variance as S
t
= BF
t|t1
B
T
+
V
then (3.5)
becomes:
F
t|t
= F
t|t1
BF
t|t1
F
t|t1
B
T
T
+S
t
T
(3.6)
3
4 Minimizing the state estimate variance
If we wish to minimize the variance of F
t|t
, we can use the mean square error
measure (MSE):
E
_
h
t
h
t|t
2
_
= Tr(Cov(h
t
h
t|t
)) = Tr(F
t|t
) (4.1)
The only coecient we have control over is , so we wish to nd the that
gives us the minimum MSE, i.e. we need to nd such that:
Tr(F
t|t
)
= 0
(2.6)
Tr(F
t|t1
BF
t|t1
F
t|t1
B
T
T
+S
t
T
)
= 0
= F
t|t1
B
T
S
1
t
(4.2)
This optimum value for in terms of minimizing MSE is known as the
Kalman Gain and will be denoted K.
If we multiply by both sides of (4.2) by SK
T
:
KSK
T
= F
t|t1
B
T
K
T
(4.3)
Substituting this into (3.6):
F
t|t
= F
t|t1
KBF
t|t1
F
t|t1
B
T
K
T
+F
t|t1
B
T
K
T
= (I KB)F
t|t1
(4.4)
4
5 Filtered Latent State Estimation Procedure (The
Kalman Filter)
The procedure for estimating the state of h
t
, which when using the MSE
optimal value for is called Kalman Filtering, proceeds as follows:
1. Choose initial values for
h and F (i.e.
h
0|0
and F
0|0
).
2. Advance latent state estimate:
h
t|t1
= A
h
t1|t1
3. Advance estimate covariance:
F
t|t1
= AF
t1|t1
A
T
+
H
4. Make an observation v
t
5. Calculate innovation:
t
= v
t
B
h
t|t1
6. Calculate S
t
:
S
t
= BF
t|t1
B
T
+
V
7. Calculate K:
K = F
t|t1
B
T
S
1
t
8. Update latent state estimate:
h
t|t
=
h
t|t1
+K
t
9. Update estimate covariance (from (4.2)):
F
t|t
= (I KB)F
t|t1
10. Cycle through stages 2 to 9 for each time step.
Note that
h
t|t
and F
t|t
correspond to
t
and
2
t
from (2.4).
5
6 Smoothed Latent State Estimation
The smoothed probability of the latent variable is the probability it had a
value at time t after a sequence of T observations, i.e. p(h
t
|v
1:T
). Unlike the
Kalman Filter which you can update with each observation, one has to wait
until T observations have been made and then retrospectively calculate the
probability the latent variable had a value at time t where t < T.
Commencing at the nal time step in the sequence (t = T) and working
backwards to the start (t = 1), p(h
t
|v
1:T
) can be evaluated as follows:
p(h
t
|v
1:T
) =
_
h
t+1
p(h
t
|h
t+1
, v
1:T
)p(h
t+1
|v
1:T
)
h
t
v
t+1:T
|h
t+1
p(h
t
|h
t+1
, v
1:T
) = p(h
t
|h
t+1
, v
1:t
)
p(h
t
|v
1:T
) =
_
h
t+1
p(h
t
|h
t+1
, v
1:t
)p(h
t+1
|v
1:T
)
=
_
h
t+1
p(h
t+1
, h
t
|v
1:t
)p(h
t+1
|v
1:T
)
p(h
t+1
|v
1:t
)
_
h
t+1
p(h
t+1
, h
t
|v
1:t
)p(h
t+1
|v
1:T
)
=
_
h
t+1
p(h
t+1
|h
t
, v
1:t
)p(h
t
|v
1:t
)p(h
t+1
|v
1:T
)
h
t+1
v
1:t
|h
t
_
h
t+1
p(h
t+1
|h
t
)p(h
t
|v
1:t
)p(h
t+1
|v
1:T
) (6.1)
6
As before, we know that p(h
t
|v
1:T
) will be a Gaussian and we will need to
establish its mean and variance at each t, i.e. in a similar manned to (2.4):
p(h
t
|v
1:T
) N(h
s
t
, F
s
t
) (6.2)
Using the ltered values calculated in the previous section for
h
t|t
and F
t|t
for each time step, the procedure for estimating the smoothed parameters
h
s
t
and F
s
t
works backwards from the last time step in the sequence, i.e. at
t = T as follows:
1. Set h
s
T
and F
s
T
to
h
T|T
and F
T|T
from steps 8 and 9 in section 5.
2. Calculate A
s
t
:
A
s
t
= (AF
t|t
)
T
(AF
t|t
A
T
+
H
)
1
3. Calculate S
s
t
:
S
s
t
= F
t|t
A
s
t
AF
t|t
4. Calculate the smoothed latent variable estimate h
s
t
:
h
s
t
= A
s
t
h
s
t+1
+
h
t|t
A
s
t
A
h
t|t
5. Calculate the smoothed estimate covariance F
s
t
:
F
s
t
=
1
2
_
(A
s
t
F
s
t+1
A
T
+S
s
t
) + (A
s
t
F
s
t+1
A
T
+S
s
t
)
T
h
T
t|t
7. Cycle through stages 2 to 6 for each time step backwards through the
sequence from t = T to t = 1.
7
7 Expectation Maximization (Calibrating the Kalman
Filter)
The procedures outlined in the previous sections are ne if we assume that
we know the value in the parameter set =
_
0
,
2
0
, A,
H
, B,
V
_
but in
order to learn these values, we will need to perform the Expectation Maxi-
mization algorithm.
The joint probability of T time steps of the latent and observable variables
is:
p(h
1:T
, v
1:T
) = p(h
1
)
T
t=2
p(h
t
|h
t1
)
T
t=1
p(v
t
|h
t
) (7.1)
Making the dependence on the parameters explicit, the likelihood of the
model given the parameter set is:
p(h
1:T
, v
1:T
|) = p(h
1
|
0
,
2
0
)
T
t=2
p(h
t
|h
t1
, A,
H
)
T
t=1
p(v
t
|h
t
, B,
V
)
(7.2)
Taking logs gives us the models log likelihood:
ln p(h
1:T
, v
1:T
|) = ln p(h
1
|
0
,
2
0
) +
T
t=2
ln p(h
t
|h
t1
, A,
H
) +
T
t=1
ln p(v
t
|h
t
, B,
V
)
(7.3)
We will deal with each of the three components of (7.3) in turn. Using V
to represent the set of observations up to and including time t (i.e. v
1:t
),
H for h
1:T
,
old
to represent our parameter values before an iteration of the
EM loop, the superscript n to represent the value of a parameter after an
iteration of the loop, c to represent terms that are not dependent on
0
or
2
0
, to represent (
2
0
)
1
and Q = E
H|
old [ln p(H, V |)] we will rst nd the
expected value for p(h
1
|
0
,
2
0
):
Q =
1
2
ln
2
0
E
H|
old
_
1
2
(h
1
0
)
T
(h
1
0
)
_
+c
=
1
2
ln
2
0
1
2
E
H|
old
_
h
T
1
h
1
h
T
1
T
0
h
1
+
T
0
+c
=
1
2
_
ln || Tr
_
E
H|
old
_
h
1
h
T
1
h
1
T
0
0
h
T
1
+
0
T
0
__
+c
(7.4)
8
In order to nd the
0
which maximizes the expected log likelihood described
in (7.4), we will dierentiate it wrt
0
and set the dierential to zero:
Q
0
= 2
0
2E[h
1
] = 0
n
0
= E[h
1
] (7.5)
Proceeding in a similar manner to establish the maximal :
Q
=
1
2
_
2
0
E
_
h
1
h
T
1
E[h
1
]
T
0
0
E
_
h
T
1
+
0
T
0
_
= 0
2
0
n
= E
_
h
1
h
T
1
E[h
1
] E
_
h
T
1
(7.6)
In order to optimize for A and
H
we will substitute for p(h
t
|h
t1
, A,
H
)
in (7.3) giving:
Q =
T 1
2
ln |
H
| E
H|
old
_
1
2
T
t=2
(h
t
Ah
t1
)
T
1
H
(h
t
Ah
t1
)
_
+c
(7.7)
Maximizing with respect to these parameters then gives:
A
n
=
_
T
t=2
E
_
h
t
h
T
t1
__
T
t=2
E
_
h
t
h
T
t1
_
1
(7.8)
n
H
=
1
T 1
T
t=2
_
E
_
h
t
h
T
t1
A
n
E
_
h
t1
h
T
t
E
_
h
t
h
T
t1
A
n
+A
n
E
_
h
t1
h
T
t1
(A
n
)
T
_
(7.9)
In order to determine values for B and
V
we substitute for p(v
t
|h
t
, B,
V
)
in (7.3) to give:
Q =
T
2
ln |
V
| E
H|
old
_
1
2
T
t=2
(v
t
Bh
t
)
T
1
V
(v
t
Bh
t
)
_
+c
(7.10)
Maximizing this with respect to B and
V
gives:
B
n
=
_
T
t=1
v
t
E
_
h
T
t
__
T
t=1
E
_
h
t
h
T
t
_
1
(7.11)
n
V
=
1
T
T
t=1
_
v
t
v
T
t
B
n
E[h
t
] v
T
t
v
t
E
_
h
T
t
B
n
+B
n
E
_
h
t
h
T
t
B
n
_
(7.12)
9
Using the values calculated from the smoothing procedure in section 5:
E[h
t
] = h
s
t
E
_
h
t
h
T
t
= F
s
t
E
_
h
t
h
T
t1
= X
s
t
We can now set out the procedure for parameter learning using Expectation
Maximization:
1. Choose starting values for the parameters =
_
0
,
2
0
, A,
H
, B,
V
_
.
2. Using the parameter set , calculate the ltered statistics
h
t|t
and F
t|t
for each time step as described in section 4.
3. Using the parameter set , calculate the smoothed statistics h
s
t
, F
s
t
and X
s
t
for each time step as described in section 5.
4. Update A:
A
n
=
_
T
t=1
h
s
t
h
s
t1
T
+
T
t=1
X
s
t
__
T
t=1
h
s
t
h
s
t
T
+
T
t=1
F
s
t
_
1
5. Update
H
:
n
H
= [T 1]
1
_
_
T
t=2
h
s
t
h
s
t
T
+
T
t=2
F
s
t
A
n
_
T
t=1
h
s
t
h
s
t1
T
+
T
t=1
X
s
t
_
T
_
_
6. Update B:
B
n
=
_
T
t=1
v
t
h
s
t
T
__
T
t=1
h
s
t
h
s
t
T
+
T
t=1
F
s
t
_
1
7. Update
V
:
n
V
=
_
_
T
t=1
v
t
v
T
t
B
n
_
T
t=1
v
t
h
s
t
T
_
T
_
_
T
1
8. Update
0
:
n
0
= h
s
1
9. Update
2
0
:
2
0
n
=
_
F
s
1
+h
s
1
h
s
1
T
_
1
n
0
n
0
T
1
10. Iterate steps 2 to 10 a given number of times or until the dierence
between parameter values from succeeding iterations is below a pre-
dened threshold.
10
References
[1] C. M. Bishop, Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer, 2006.
[2] Z. Ghahramani and G. Hinton, Parameter estimation for linear dynam-
ical systems, Tech. Rep., 1996.
11