Introduction To Nonlinear Filtering

Introduction to Nonlinear Filtering
P. Chigansky
Contents
Preface 5
Instead of Introduction 7
An example 7
The brief history of the problem 12
Chapter 1. Probability preliminaries 15
1. Probability spaces 15
2. Random variables and random processes 17
3. Expectation and its properties 17
4. Convergence of random variables 19
5. Conditional expectation 20
6. Gaussian random variables 21
Exercises 22
Chapter 2. Linear ltering in discrete time 25
1. The Hilbert space L
2
, orthogonal projection and linear estimation 26
2. Recursive orthogonal projection 28
3. The Kalman-Bucy lter in discrete time 29
Exercises 31
Chapter 3. Nonlinear ltering in discrete time 37
1. The conditional expectation: a closer look 38
2. The nonlinear lter via the Bayes formula 43
3. The nonlinear lter by the reference measure approach 45
4. The curse of dimensionality and nite dimensional lters 48
Exercises 51
Chapter 4. The white noise in continuous time 55
1. The Wiener process 56
2. The Ito Stochastic Integral 61
3. The Ito formula 67
4. The Girsanov theorem 72
5. Stochastic Dierential Equations 73
6. Martingale representation theorem 79
Exercises 83
Chapter 5. Linear ltering in continuous time 87
1. The Kalman-Bucy lter: scalar case 87
2. The Kalman-Bucy lter: the general case 93
3. Linear ltering beyond linear diusions 94
3
4 CONTENTS
Exercises 95
Chapter 6. Nonlinear ltering in continuous time 97
1. The innovation approach 97
2. Reference measure approach 103
3. Finite dimensional lters 109
Exercises 121
Appendix A. Auxiliary facts 123
1. The main convergence theorems 123
2. Changing the order of integration 123
Appendix. Bibliography 125
Preface
These lecture notes were prepared for the course, taught by the author at the
Faculty of Mathematics and CS of the Weizmann Institute of Science. The course
is intended as the rst encounter with stochastic calculus with a nice engineering
application: estimation of signals from the noisy data. Consequently the rigor
and generality of the presented theory is often traded for intuition and motivation,
leaving out many interesting and important developments, either recent or classic.
Any suggestions, remarks, bug reports etc. are very welcome and can be sent to
pchiga@mscc.huji.ac.il.
Pavel Chigansky
HUJI, September 2007
5
Instead of Introduction
An example
Consider a simple random walk on integers (e.g. randomly moving particle)
X
j
= X
j1
+
j
, j Z
+
(1)
starting from the origin, where
j
is a sequence of independent random signs P(
j
=
1) = 1/2, j 1. Suppose the position of the particle at time j is to be estimated
(guessed or ltered) on the basis of the noisy observations
Y
i
= X
i
+
i
, i = 1, ..., j (2)
where
j
is a sequence of independent identically distributed (i.i.d.) random vari-
ables (so called discrete time white noise) with Gaussian distribution, i.e.
P
_
j
[a, b]
_
=
1
2
_
b
a
e
u
2
/2
du, j 1.
Formally an estimate is a rule, which assigns a real number
1
to any outcome of
the observation vector Y
[1,j]
= Y
1
, ..., Y
j
, in other words it is a map
j
(y) : R
j
R. How dierent guesses are compared ? One possible way is to require minimal
square error on average, i.e.
j
is considered better than
j
if
E
_
X
j

j
(Y
[1,j]
)
_
2
E
_
X
j

j
(Y
[1,j]
)
_
2
, (3)
where E() denotes expectation, i.e. average with respect to all possible outcomes
of the experiment, e.g. for j = 1
E
_
X
1
1
(Y
1
)
_
2
=
1
2
_

_
_
1
1
(1+u)
_
2
+
_
1
1
(1+u)
_
2
_
1
2
e
u
2
/2
du.
Note that even if (3) holds,
_
X
j

j
(Y
[1,j]
)
_
2
>
_
X
j

j
(Y
[1,j]
)
_
2
may happen in an individual experiment. However this is not expected
2
to happen.
Once the criteria (3) is accepted, we would like to nd the best (optimal)
estimate. Lets start with the simplest guess
X
j
:=
j
(Y
[1,j]
) Y
j
.
The corresponding mean square error is
P
j
= E(X
j
Y
j
)
2
= E(X
j
X
j

j
)
2
= E
2
j
= 1.
1
Though X
j
takes only integer values, we allow a guess to take real values, i.e. soft decisions
are admissible
2
think of an unfair coin with probability of heads equal to 0.99: it is not expected to give
tails, though it may!
7
8 INSTEAD OF INTRODUCTION
This simple estimate does not take into account past observations and hence po-
tentially can be improved by using more data. Lets try
X
j
=
Y
j
+Y
j1
2
.
The corresponding mean square error is
P
j
=E
_
X
j

X
j
_
2
= E
_
X
j

Y
j
+Y
j1
2
_
2
=
E
_
X
j

X
j
+X
j1
+
j1
+
j
2
_
2
=
E
_
(X
j
X
j1
)/2 (
j1
+
j
)/2
_
2
=
E
_
j
/2 (
j1
+
j
)/2
_
2
= 1/4 + 1/2 = 0.75
which is an improvement by 25% ! Lets try to increase the memory of the
estimate:
E
_
X
j

Y
j
+Y
j1
+Y
j2
3
_
2
= ... =
E
_
2
3
j
+
1
3
j1
+

j
+
j1
+
j2
3
_
2
=
4
9
+
1
9
+
3
9
0.89
i.e. the error increased! The reason is that the estimate gives the old and the
new measurements the same weights - it is reasonable to rely more on the latest
samples. So what is the optimal way to weigh the data ?
It turns out that the optimal estimate can be generated very eciently by the
dierence equation (j 1)
X
j
=

X
j1
+P
j
_
Y
j

X
j1
_
,

X
0
= 0 (4)
where P
j
is a sequence of numbers, generated by
P
j
=
P
j1
+ 1
P
j1
+ 2
, P
0
= 0. (5)
Lets us calculate the mean square error. The sequence
j
:= X
j

X
j
satises
j
=
j1
+
j
P
j
_
j1
+
j
+
j
_
=
_
1 P
j
_
j1
+ (1 P
j
)
j
P
j
j
and thus

P
j
= E
2
j
satises
P
j
=
_
1 P
j
_
2
P
j1
+ (1 P
j
)
2
+P
2
j
,

P
0
= 0
where the independence of
j
,
j
and
j1
has been used. Note that the sequence
P
j
satises the identity (just expand the right hand side using (5))
P
j
=
_
1 P
j
_
2
P
j1
+ (1 P
j
)
2
+P
2
j
,

P
0
= 0.
So the dierence

P
j
P
j
obeys the linear time varying equation
_
P
j
P
j
_
=
_
1 P
j
_
2
_
P
j1
P
j1
_
, t 1
and since

P
0
P
0
= 0, it follows that

P
j
P
j
for all j 0, or in other words P
j
is
the mean square error, corresponding to

X
j
! Numerically we get
AN EXAMPLE 9
j 1 2 3 4 5
P
j
0.5 0.6 0.6154 0.6176 0.618
In particular P
j
converges to the limit P
, which is the unique positive root of the

equation
P =
P + 1
P + 2
= P
=

5/2 1/2 0.618.
This is nearly a 40% improvement over the accuracy of

X
j
! As was mentioned
before, no further improvement is possible among linear estimates.
What about nonlinear estimates? Consider the simplest nonlinear estimate of
X
1
from Y
1
: guess 1 if Y
1
0 and 1 if Y
1
< 0, i.e.
X
1
= sign(Y
1
).
The corresponding error is
P
1
= E
_
X
1

X
1
_
2
=
1
2
E
_
1 sign(1 +
1
)
_
2
+
1
2
E
_
1 sign(1 +
1
)
_
2
=
1
2
2
2
P(
1
1) +
1
2
2
2
P(
1
1) = 4P(
1
1) = 4
1
2
_

1
e
u
2
/2
du 0.6346
which is even worse than the linear estimate

X
1
! Lets try the estimate
X
1
= tanh(Y
1
),
which can be regarded as a soft sign. The corresponding mean square error is
P
1
= E
_
X
1

X
1
_
2
=
1
2
_

_
_
1 tanh(u + 1)
_
2
+
_
1 + tanh(u 1)
_
2
2
expu
2
/2du 0.4496
which is the best estimate up to now (in fact it is the best possible!).
How can we compute the best nonlinear estimate of X
j
eciently (meaning
recursively)? Let
j
(i), i Z, j 0 be generated by the nonlinear recursion
j
(i) = expY
j
i i
2
/2
_
j1
(i 1) +
j1
(i + 1)
_
, j 1 (6)
subject to
0
(0) = 1 and
0
(i) = 0, i ,= 0. Then the best estimate of X
j
from the
observations Y
1
, ..., Y
j
is given by
X
j
=
i=
i
j
(i)
i=
j
(i)
. (7)
How good is it ? The exact answer is hard to calculate. E.g. the empirical mean
square error

P
100
is around 0.54 (note that it should be less than 0.618 and greater
than 0.4496).
How the same problem could be formulated in continuous time, i.e. when the
time parameter (denoted in this case by t) can be any nonnegative real number
? The signal dened in (1) is a Markov
3
chain with integer values, starting from
zero and making equiprobable transitions to the nearest neighbors. Intuitively the
3
Recall that a sequence called Markov if the conditional distribution of X
j
, given the history
X
0
, ..., X
j1
, depends only on the last entry X
j1
and not on the whole path. Verify this
property for the sequence dened by (1).
analogous Markov chain in continuous time should satisfy
P
_
X
t+
= i[X
s
, 0 s t
_
=
_
_
1 2, i = X
t
i = X
t
1
0 otherwise
(8)
for suciently small > 0. In other words, the process is not expected to jump on
short time intervals and eventually jumps to one of the nearest neighbors. It turns
out that (8) uniquely denes a stochastic process. For example it can be modelled
by a pair of independent Poisson processes. Let (
n
)
nZ
+
be an i.i.d sequence of
positive random variables with standard exponential distribution
P
_
n
t
_
=
_
1 e
t
, t 0
0, t < 0
(9)
Then a standard Poisson process is dened as
4
t
= maxn :
n
=1
t,
Clearly
t
starts at zero (
0
= 0) and increases, jumping to the next integer at
random times separated by
s. Let
t
and
+
t
be a pair of independent Poisson
process. Then the process
X
t
=
+
t

t
, t 0
satises (8). Remarkably the exponential distribution is the only one which can
lead to a Markov process.
To dene an analogue of Y
t
, the concept of white noise is to be introduced
in continuous time. The origin of the term white noise stems from the fact that
the spectral density of an i.i.d. sequence is at, i.e.
S
() :=
j=
E
0
j
e
ij
=
j=
(j)e
ij
= 1 (, ].
So any random sequence with at spectral density is called (discrete time) white
noise and its variance is recovered by integration over the spectral density
E
2
t
=
1
2
_

1d = 1.
The same denition leads to a paradox in continuous time: suppose that a stochastic
process have at spectral density, then it should have innite variance
5
E
2
t
=
1
2
_

d = .
This paradox is resolved if the observation process is dened as
Y
t
=
_
t
0
X
s
ds +W
t
, (10)
4
with convention
0
=1
= 0.
5
recall that the spectral density for continuous time processes is supported on the whole real
line, rather than being condensed to (, ] as in the case of sequences.
AN EXAMPLE 11
where W = (W
t
)
t0
is the Wiener process or mathematical Brownian motion. The
Wiener process is characterized by the following properties: W
0
= 0, the trajectories
of W
t
are continuous functions and it has independent increments with
E
_
W
t
[W
u
, u s
_
= W
s
, E
_
(W
t
W
s
)
2
[W
u
, u s
_
= t s.
Why is the model (10) compatible with the white noise notion? Introduce the
process
t
=
W
t
W
t
, > 0
Then E
t
= 0 and
6
E
s
=
1
2
E
_
W
t
W
t
__
W
s
W
s
_
=
1
2
_
[t s[, [t s[
0, [t s[
.
So the process
t
is stationary with the correlation function
R
() =
1
2
_
[[, [[
0, [[
.
For small > 0, R
() approximates the Dirac () in the sense that for any

continuous and compactly supported test function ()
_

()R
()d
0
(0)
and if the limit process := lim
0
t
existed, it would have at spectral density
as required. Then the observation process (10) would contain the same information
as
Y
t
= X
t
+
t
,
with
t
being the derived white noise. Of course, this is only an intuition and
t
does not exists as a limit in any reasonable sense (e.g. its variance at any point
t grows to innity with 0, which is the other side of the at spectrum
paradox). It turns out that the axiomatic denition of the Wiener process leads to
very unusual properties of its trajectories. For example, almost all trajectories of
W
t
, though continuous, are not dierentiable at any point.
After a proper formulation of the problem is found, what would be the analogs
of the ltering equations (4)-(5) and (6)-(7)? Intuitively, instead of the dierence
equations in discrete time, we should obtain dierential equations in continuous
time, e.g.
X
t
= P
t
_
Y
t

X
t
_
,

X
0
= 0.
However the right hand side of this equation involves derivative of Y
t
and hence also
of W
t
, which is impossible in view of aforementioned irregularity of the latter. Then
instead of dierential equations we may write (and implement!) the corresponding
integral equation
X
t
=
_
t
0
P
s
dY
s

_
t
0
P
s
X
s
ds,
6
Note that EW
t
W
s
= min(t, s) := t s for all t, s 0.
where the rst integral may be interpreted as Stieltjes integral with respect to Y
t
or alternatively dened (in the spirit of integration by parts formula) as
_
t
0
P
s
dY
s
:= Y
t
P
t

_
t
0
Y
s

P
s
ds.
Such a denition is correct, since the integrand function is deterministic and dier-
entiable (Y
t
turns to be Riemann integrable as well). Of course, we should dene
precisely what is the solution of such equation and under what assumptions it exists
and is unique. The optimal linear ltering equations then can be derived:
X
t
=
_
t
0
P
s
_
dY
s

X
s
ds
_
P
t
= 2 P
2
t
, P
0
= 0.
(11)
Now what about the nonlinear lter? The equations should realize a nonlinear
map of the data and thus their right hand side would require integration of some
stochastic process with respect to Y
t
. This is where the classical integration theory
completely fails! The reason is again irregularity of the Wiener process - it has
unbounded variation! Thus the construction similar to Stieltjes integral would not
lead to a well dened limit in general. The foundations of the integration theory
with respect to the Wiener process were laid by K.Ito in 40s. The main idea is
to use Stieltjes like construction for a specic class of integrands (non-anticipating
processes). In terms of Ito integral the nonlinear ltering formulae are
7
t
(i) = (i) +
_
t
0
_
s
(i + 1) +
s
(i 1) 2
s
(i)
_
ds +
_
t
0
i
s
(i)dY
s
(12)
and
X
t
=
m=
m
t
(m)
t
()
.
This example is the particular case of the ltering problem, which is the main
subject of these lectures:
Given a pair of random process (X
t
, Y
t
)
t0
with known statistical
description, nd a recursive realization for the optimal in the mean
square sense estimate of the signal X
t
on the basis of the observed
trajectory Y
s
, s t for each t 0.
The brief history of the problem
The estimation problem of signals from the noisy observations dates back to
Gauss (the beginning of XIX century), who studied the motion of planets on the
basis of celestial observations by means of his least squares method. In the modern
probabilistic framework the ltering type problems were addresses independently
by N.Wiener (documented in the monograph [26]) and A.Kolmogorov ([20]). Both
treated linear estimation of stationary processes via the spectral representation.
Wieners work seems to be partially motivated by the radar tracking problems and
gunre control. This part of the ltering theory wont be covered in this course
and the reader is referred to the classical text [28] for further exploration.
7
From now on (i) denotes the Kronecker symbol, i.e. (i) =
_
1 i = 0
0 i ,= 0
THE BRIEF HISTORY OF THE PROBLEM 13
The Wiener-Kolmogorov theory in many cases had serious practical limitation
- all the processes involved are assumed to be stationary. R.Kalman and R.Bucy
(1960-61) [13], [14] addressed the same problem from a dierent perspective: us-
ing state space representation they relaxed the stationarity requirement and ob-
tained closed form recursive formulae realizing the best estimator. The celebrated
Kalman-Bucy lter today plays a central role in various engineering applications
(communications, signal processing, automatic control, etc.) Besides being of signif-
icant practical importance, the Kalman-Bucy approach stimulated much research
in the theory of stochastic processes and their applications in control and esti-
mation. The state space approach allowed nonlinear extensions of the ltering
problem. The milestone contributions in this eld are due to H.Kushner [29], R.
Stratonovich [37] and Fujisaki, Kallianpur and Kunita [10] (the dynamic equations
for conditional probability distribution), Kallianpur and Striebel [17] (Bayes for-
mula for white noise observations), M. Zakai [41] (reference measure approach to
nonlinear ltering).
There are several excellent books and monographs on the subject including
R.Lipster and A.Shiryaev [21] (the main reference for the course), G.Kallianpur
[15], S.Mitter [23], G. Kallianpur and R.L. Karandikar [16] (a dierent look at the
problem), R.E. Elliott, L. Aggoun and J.B. Moore [8]. Classic introductory level
texts are B.Anderson and J. Moore [1] and A. Jazwinski [12].
CHAPTER 1
Probability preliminaries
Probability theory is simply a branch of measure theory, with its
own special emphasis and eld of application (J.Doob).
This chapter gives a summary of the probabilistic notions used in the course, which
are assumed to be familiar (the book [34] is the main reference hereafter).
1. Probability spaces
The basic object of probability theory is the probability space (, F, P), where
is a collection of elementary events (points), F is an appropriate family of
all considered events (or sets) and P is the probability measure on F. While can
be quite arbitrary, F and P are required to satisfy certain properties to provide
sucient applicability of the derived theory. The mainstream of the probability
research relies on the axioms, introduced by A.Kolmogorov in 30s (documented in
[19]). F is required to be a -algebra of events, i.e. to be closed under countable
intersections and compliment operations
1
F
A F = /A F
A
n
F =
n=1
A
n
F
P is a -additive nonnegative measure on F normalized to one, in other words P
is a set function F [0, 1], satisfying
P
_

n=1
A
n
_
=
n=1
P(A
n
), A
n
F -additivity
P() = 1 normalization.
Here are some examples of probability spaces:
1.1. A nite probability space. For example
:= 1, 2, 3
F := , 1, 2, 3, 1 2, 1 3, 2 3,
P(A) =
A
1/3, A F
Note that the -algebra F coincides with the (nite) algebra, generated by the
points of and P is dened on F by specifying its values for each , i.e.
P(1) = P(2) = P(3) = 1/3.
1
These imply that F is also closed under countable unions as well, i.e. A
n
F =
n=1
A
n
F.
15
16 1. PROBABILITY PRELIMINARIES
Example 1.1. Tossing a coin n times. The elementary event is a string of
n zero-one bits, i.e. the sampling space consists of 2
n
points. F consists of all
subsets of (how many are there?). The probability measure is dened (on F) by
setting P() = 2
n
, for all . What is the probability of the event A =the
rst bit of a string is one?
P(A) = P
_
: (1) = 1
_
=
(1)=1
2
n
= 1/2 (by symmetry).
1.2. The Lebesgue probability space ([0, 1], B, ). Here B denotes the
Borel -algebra on [0, 1], i.e. the minimal -algebra containing all open sets from
[0, 1]. It can be generated by the algebra of all intervals. The probability measure
is uniquely dened (by Caratheodory extension theorem) on B by its restriction
e.g. to the algebra of semi-open intervals
((a, b]) = b a, b a.
Similarly a probability space is dened on R (or R
d
). The probability measure in
this case can be dened by any nondecreasing right continuous (why?) nonnegative
function F : R [0, 1], satisfying lim
x
F(x) = 1 and lim
x
F(x) = 0:
P
_
(a, b]
_
= F(b) F(a).
What is the analogous construction in R
d
?
Example 1.2. An innite series of coin tosses. The elementary event is an
innite binary sequence or equivalently
2
a point in [0, 1], i.e. = [0, 1]. For the
event A from the previous example:
(A) =
_
: (1) = 1
_
=
_
1/2
_
= 1/2.
1.3. The space of innite sequences. (R
, B(R
), P). The Borel -

algebra B(R
) can be generated by the cylindrical sets of the form

A = x R
: x
i
1
(a
1
, b
1
], ..., x
i
n
(a
n
, b
n
], b
i
a
i
The probability P is uniquely dened on B(R
) by a consistent family of prob-

ability measures P
n
on
_
R
n
, B(R
n
)
_
, n 1 (Kolmogorov theorem), i.e. if P
n
satises
P
n+1
(B R) = P
n
(B), B B(R
n
).
Example 1.3. Let p(x, y) be a measurable
3
RR R
+
nonnegative function,
such that
_
R
p(x, y)dy = 1, a.s.x
and let (x) be a probability density (i.e. (x) 0 and
_
R
(x)dx = 1). Dene a
family of probability measures on B(R
n+1
) by the formula:
P
n+1
_
A
0
... A
n
_
=
_
A
0
...
_
A
n
(x
1
)p(x
1
, x
2
)...p(x
n1
, x
n
)dx
1
...dx
n
.
2
some sequences represent the same numbers (e.g. 0.10000... and 0.011111...), but there are
countably many of them, which can be neglected while calculating the probabilities.
3
measurability with respect to the Borel eld is mean by default
3. EXPECTATION AND ITS PROPERTIES 17
This family is consistent:
P
n+1
_
A
0
... R
_
=
_
A
0
...
_
R
(x
1
)p(x
1
, x
2
)...p(x
n1
, x
n
)dx
1
...dx
n
=
_
A
0
...
_
R
(x
1
)p(x
1
, x
2
)...p(x
n2
, x
n1
)dx
1
...dx
n1
:= P
n
_
A
1
... A
n1
_
,
and hence there is a unique probability measure P (on B(R
)), such that

P(A) = P
n
(A
n
) A
n
B(R
n
), n = 1, 2, ...
The constructed measure is called Markov.
2. Random variables and random processes
A random variable is a measurable function on a probability space (, F, P)
to a metric space (say R hereon), i.e a map X() : R, such that
: X() B F, B B(R).
Due to measurability requirement X (the argument is traditionally omitted)
induces a measure on B(R):
P
X
(B) := P
_
: X() B
_
, B B(R).
The function F
X
: R [0, 1]
F
X
(x) = P
X
_
(, x]
_
= P(X x), x R
is called the distribution function of X. Note that by denition F
X
(x) is a right-
continuous function.
A stochastic (random) process is a collection of random variables X
n
() on
a probability space (, F, P), parameterized by time n Z
+
. Equivalently, a
stochastic process can be regarded as a probability measure (or probability distri-
bution) on the space of real valued sequences. The nite dimensional distributions
F
n
X
: R
n
[0, 1] of X are dened as
F
n
X
(x
1
, ..., x
n
) = P
_
X
1
x
1
, ..., X
n
x
n
_
, n 1
The existence of a random process with given nite dimensional distributions is
guaranteed by the Kolmogorov theorem if and only if the family of probability
measures on R
n
, corresponding to F
n
X
, is consistent. Then one may realize X as a
coordinate process on an appropriate probability space, in which case the process
is called canonical.
3. Expectation and its properties
The expectation of a real random variable X 0, dened on (, F, P), is the
Lebesgue integral of X with respect to the measure P, i.e. the limit (either nite
of innite)
EX =
_
X()P(d) := lim
n
EX
n
,
where X
n
is an approximation of X by simple (piecewise constant) functions,
e.g.
X
n
() =
n2
n
=1
1
2
n
1
_
1
2
n
X() <

2
n
_
+n1
_
X() n
_
(1.1)
for which
EX
n
:=
n2
n
=1
1
2
n
P
_
1
2
n
X() <

2
n
_
+nP
_
X() n
_
is dened. Such limit always exists and is independent of the specic choice of the
approximating sequence. For a general random variable, taking values with both
signs, the expectation is dened
4
EX = E(0 X) E(0 X) := EX
+
+ EX
if at least one of the terms is nite. If EX exists and is nite X is said to be

Lebesgue integrable with respect to P. Note that expectation can be also realized
on the induced probability space, e.g.
EX =
_
X()P(d) =
_
R
xP
X
(dx) =
_

xdF
X
(x).
(the latter stands for the Lebesgue-Stieltjes integral).
Example 1.4. Consider a random variable X() =
2
on the Lebesgue prob-
ability space. Then
EX =
_
[0,1]
2
(d) = 1/3
Another way to calculate EX is to nd its distribution function:
F
X
(x) = P(X() x) = P(
2
x) = P(
x) =
_
_
0 x < 0
x 0 x < 1
1 1 x
and then to calculate the integral
EX =
_

xdF
X
(x) =
_
[0,1]
xd(
x) = 1
_
[0,1]
xdx = 1/3.
The expectation have the following basic properties:

(A) if EX is well dened, then EcX = cEX for any c R
(B) if X Y P-a.s., then EX EY
(C) if EX is well dened, then EX E[X[
(D) if EX is well dened, then EX1
A
is well dened for all A F. If EX is
nite, so is EX1
A
(E) if E[X[ < and E[Y [ < , then E(X +Y ) = EX + EY
(F) if X = 0 P-a.s., then EX = 0
(G) if X = Y P-a.s. and E[X[ < , E[Y [ < , then EX = EY
(H) if X 0 and EX = 0, then X = 0 P-a.s.
The random variables X
1
, ..., X
n
are independent if for any subset of indices
i
1
, ..., i
m
1, ..., n and Borel sets A
1
, ..., A
m
,
P
_
X
i
1
A
1
, ..., X
i
m
A
m
_
= P
_
X
i
1
A
1
_
...P
_
X
i
m
A
m
_
.
For example X and Y are independent if
P(X A, Y B) = P(X A)P(X B)
4
a b = a minb and a b = a max b
4. CONVERGENCE OF RANDOM VARIABLES 19
for any Borel sets A and B. Note that pairwise independence is not enough in gen-
eral for independence of e.g. three random variable. Also note that independence
is the joint property of random variables and the measure P. Being dependent
under P, the same random variables may be independent under another measure
P (dened on the same probability space).

The characteristic function of X is the Fourier transform of its distribution, i.e.
X
() := Eexp
_
iX
_
, R.
The independence can be alternatively formulated via distribution or characteristic
functions (How?).
4. Convergence of random variables
A sequence of random variables X
n
converges to a random variable X
(1) P-almost surely, if P
_
lim
n
X
n
= X
_
= 1.
(2) in probability P if lim
n
P
_
[X
n
X[
_
= 0, > 0.
(3) in L
p
(, F, P), p 1 if lim
n
E
X
n
X
p
= 0 and E[X[
p
< .
(4) weakly or in law, if for any bounded and continuous function f
lim
n
Ef(X
n
) = Ef(X).
Other types of convergence are possible, but these are used mostly. Note that the
convergence in law is actually not a convergence of the random variables, but rather
of their distributions: for example, an i.i.d. random sequence converges in law and
does not converge in any other aforementioned sense.
The following implications can be easily veried
Pa.s.
L
p
_
=
P
=
w
while the other are wrong in general.

Example 1.5. Let X
n
be an sequence of independent random variables with
P(X
n
= 1) = 1/n, P(X
n
= 0) = 1 1/n.
Then X
n
converges in probability: for 0 < < 1
P
_
X
n

_
= P
_
X
n
= 1
_
= 1 1/n 0.
However it doesnt converge P-a.s. Let A
n
= X
n
= 1 and let
A
i.o
=
n0
_
mn
A
m
i.e. the event of X
n
being equal to 1 innitely often. Let us show that P(A
i.o.
) = 1
or alternatively
5
P(A
c
i.o.
) = 0:
P(A
c
i.o.
) = P
_
_
n0
mn
A
c
m
_
n
P
_

mn
A
c
m
_
.
5
the superscript c stands for compliment, i.e. A
c
= \A.
For any xed n and 1, due to independence
P
_
n+
m=n
A
c
m
_
=
n+
m=n
P
_
A
c
m
_
=
n+
m=n
(1 1/m) = exp
_
n+
m=n
log(1 1/m)
_
exp
_
n+
m=n
1/m
_

0,
so, by continuity of P (which is implied by -additivity!),
P
_

m=n
A
c
m
_
= 0
for any n and thus P(A
i.o
) = 1, meaning that X
n
does not converge to zero P-a.s.
Is the independence crucial? Yes ! For example take dependent (why?) random
variables on the Lebesgue space, X
n
= 1( 1/n). Then the set : X
n
() , 0
is just the singleton 0, whose probability is zero and so P(X
n
0) = 1!
This example is the particular case of the Borel-Cantelli lemmas:
n=1
P(A
n
) < = P(A
i.o
) = 0
and

n=1
P(A
n
) =
A
n
are independent
_
= P(A
i.o.
) = 1.
5. Conditional expectation
The conditional expectation of a random variable X 0 with respect to a -
algebra G (under measure P) is a random variable, denoted by E(X[G)(), which
satises the properties:
(1) E(X[G)() is G-measurable
(2) E
_
X E(X[G)
_
1
A
= 0 for all A G.
The conditional expectation is characterized by these properties up to almost sure
equivalence.
Example 1.6. Suppose G is generated by a nite partition G of , i.e.
G = G
1
, ..., G
n
, G
i
G
j
= ,
n
j=1
G
j
= .
Then (why?)
E(X[G) =
n
=1
EX1
G
()
P(G
)
1
G
(),
where 0/0 = 0 is understood.
For a general random variable X, E(X[G) = E(X
+
[G) +E(X
[G) if no uncer-
tainty of the type arises.
The inverse images : Y B, B B(R) of a random variable Y form a
-algebra G
Y
F. The conditional expectation E(X[G
Y
) is usually denoted by
E(X[Y ) and there always exists
6
a Borel function , such that E(X[Y ) = (Y ).
6
if the space is not too wild, e.g. Polish spaces are OK
6. GAUSSIAN RANDOM VARIABLES 21
The conditional expectation enjoys the same properties as the expectation and
in addition
(A
t
) if G
1
G
2
, then E
_
E
_
X[G
2
_
G
1
_
= E(X[G
1
) P-a.s.
(B
t
) if E[X[
2
< , then for any Borel function g
E
_
X E(X[Y )
_
2
E
_
X g(Y )
_
2
. (1.2)
The latter property can be interpreted as optimality in the mean square sense of
the conditional expectation among all estimates of X given the realization of Y
(cf. (7) from the previous chapter). The main tool in calculation of the conditional
expectation is the Bayes formula.
Example 1.7. Let (X, Y ) be a pair of random variables and suppose that their
distribution has density (with respect to the Lebesgue measure on the plane), i.e.
P
_
X x, Y y
_
=
_
x
_
y
f(u, v)dudv.
Suppose that EX
2
< , then (why?)
E(X[Y )() =
_
R
xf
_
x, Y ()
_
dx
_
R
f
_
u, Y ()
_
du
.
Later we will prove and use a more abstract version of this formula.
6. Gaussian random variables
A random variable X is Gaussian with mean EX = m and variance E(X
EX)
2
=
2
> 0 if
F
X
(x) := P(X x) =
_
(,x]
1
2
2
exp
_
(u m)
2
2
2
_
du.
The corresponding characteristic function is
X
() = Ee
iX
= exp
_
im
1
2
2
_
.
If the latter is taken as denition (since there is a one to one correspondence between
F
X
and
X
), then the degenerate case = 0 is included as well, i.e. a constant
random variable can be considered as Gaussian.
Analogously a random vector X with values in R
d
is Gaussian with mean
EX = m R
d
and the covariance matrix C = E(X EX)(X EX)
0 (semi
positive denite matrix!), if
X
() = EexpiX
= exp
_
im

1
2
C
_
.
Finally a random process is Gaussian if its nite dimensional distributions are
Gaussian. Gaussian processes have a special place in probability theory and in
particular in ltering as we will see soon.
Exercises
(1) Let A
n
n 1 be a sequence of events and dene the events A
i.o
=
n1
mn
A
n
and A
e
=
n1
mn
A
n
.
(a) Explain the terms i.o. (innitely often) and e (eventually) in the
notations.
(b) Is A
i.o
= A
e
if A
n
is a monotonous sequence, i.e. A
n
A
n+1
or
A
n
A
n+1
for all n 1?
(c) Explain the notation A
i.o
= lim
n
A
n
and A
e
= lim
n
A
n
.
(d) Show that A
e
A
i.o.
(2) Prove the Borel-Cantelli lemmas.
(3) Using the Borel-Cantelli lemmas, show that
(a) a sequence X
n
converging in probability has a subsequence converg-
ing almost surely
(b) a sequence X
n
, converging exponentially
7
in L
2
, converges P-a.s.
(c) if X
n
is an i.i.d. sequence with E[X
1
[ < , then X
n
/n converges to
zero P-a.s.
(d) if X
n
is an i.i.d. sequence with E[X
1
[ = , then lim
n
[X
n
[/n =
P-a.s.
(e) show that if X
n
is a standard Gaussian i.i.d. sequence, then
lim
n
[X
n
[/
2 ln n = 1, P a.s.
(4) Give counterexamples to the following false implications:
(a) convergence in probability implies L
2
convergence
(b) P-a.s. convergence implies L
2
convergence
(c) L
2
convergence implies P-a.s. convergence
(5) Let X be a r.v. with uniform distribution on [0, 1] and be a r.v. given
by:
=
_
X X 0.5
0.5 X > 0.5
Find E(X[).
(6) Let
1
,
2
, ... be an i.i.d. sequence. Show that:
E(
1
[S
n
, S
n+1
, ...) =
S
n
n
where S
n
=
1
+... +
n
.
(7) (a) Consider an event A that does not depend on itself, i.e. A and A are
independent. Show that:
PA = 1 or PA = 0
(b) Let A be an event so that PA = 1 or PA = 0. Show that A and
any other event B are independent.
(c) Show that a r.v. () doesnt depend on itself if and only if ()
const.
(8) Consider the Lebesgue probability space and dene a sequence of random
variables
8
X
n
() = 2
n
| mod 2.
7
i.e. E[X
n
X[
2
C
n
for all n 1 with C 0 and [0, 1)
8
]x| is the integer part of x
EXERCISES 23
Show that X
n
is an i.i.d. sequence.
(9) Let Y be a nonnegative random variable with probability density:
f(y) =
1
2
e
y/2
y
, y 0
Dene the conditional density of X given xed Y :
f(x; Y ) =
2
e
Y x
2
/2
,
i.e. for any bounded function f
E
_
f(X)[Y
_
=
_
R
f(x)f(x; Y )dx.
Does the formula E
_
E(X[Y )
_
= EX hold ? If not, explain why.
(10) Give an example of three dependent random variables, any two of which
are independent.
(11) Let X and Z be a pair of independent r.v. and E[X[ < . Then
E(X[Z) = EX with probability one. Does the formula
E(X[Z, Y ) = E(X[Y )
holds for an arbitrary Y ?
(12) Let X
1
and X
2
be two random variables such that, EX
1
= 0 and EX
2
=
0. Suppose we can nd a linear combination Y = X
1
+ X
2
, which is
independent of X
2
. Show that E(X
1
[X
2
) = X
2
.
(13) Show that the coordinate (canonical) process on the space from Example
1.3 is Markov, i.e.
E
_
f(X
n
)[X
0
, ..., X
n1
_
= E(f(X
n
)[X
n1
), P a.s. (1.3)
for any bounded Borel f.
(14) Let (X
n
)
n0
be a sequence of random variables and let
G
n
:= X
0
, ..., X
n
and G
>n
:= X
n
, X
n+1
, ....
Show that the Markov property (1.3) is equivalent to the property
E
_
[X
n
_
= E
_
[X
n
_
E
_
[X
n
_
for all bounded random variables and , G
n
and G
>n
measurable re-
spectively. In other words, the Markov property is equivalently stated as
the future and the past are conditionally independent, given the present.
(15) Let X and Y be i.i.d. random variables with nite variance and twice
dierentiable probability density. Show that if X + Y and X Y are
independent, then X and Y are Gaussian.
(16) Let X
1
, X
2
and X
3
be independent standard Gaussian random variables.
Show that
X
1
+X
2
X
3
_
1 +X
2
3
is a standard Gaussian random variable as well.
(17) Let X
1
, X
2
, X
3
, X
4
be a Gaussian vector with zero mean. Show that
EX
1
X
2
X
3
X
4
= EX
1
X
2
EX
3
X
4
+ EX
1
X
3
EX
2
X
4
+ EX
1
X
4
EX
2
X
3
.
Recall that the moments, if exist, can be recovered from the derivatives
of the characteristic function at = 0.
(18) Let f(x) be a probability density function of a Gaussian variable, i.e:
f(x) =
1
2
2
e
(xa)
2
/(2
2
)
Dene a function:
g
n
(x
1
, ..., x
n
) =
_
n
j=1
f(x
j
)
__
1 +
n
k=1
(x
k
a)f(x
k
)
_
, (x
1
, ..., x
n
) R
n
(a) Show that g
n
(x
1
, ..., x
n
) is a valid probability density function of some
random vector X = (X
1
, ..., X
n
).
(b) Show that any subvector of X is Gaussian, while X is not Gaussian.
(19) Let f(x, y, ) be a two dimensional Gaussian probability density, so that
the marginal densities have zero means and unit variances and the corre-
lation coecient is =
_
R
_
R
xyf(x, y, ) = . Form a new density:
g(x, y) = c
1
f(x, y,
1
) +c
2
f(x, y,
2
)
with c
1
> 0, c
2
> 0, c
1
+c
2
= 1.
(a) Show that g(x, y) is a valid probability density of some vector X, Y .
(b) Show that each of the r.v. X and Y is Gaussian.
(c) Show that c
1
, c
2
and
1
,
2
can be chosen so that EXY = 0. Are X
and Y independent ?
CHAPTER 2
Linear ltering in discrete time
Consider a pair of random square integrable random variables (X, Y ) on a
probability space (, F, P). Suppose that the following (second order) probabilistic
description of the pair is available,
EX, EY
cov(X) := E(X EX)
2
, cov(Y ) := E(Y EY )
2
,
cov(X, Y ) := E(X EX)(Y EY )
and it is required to nd a pair of constants a
t
0
and a
t
1
, such that
E
_
X a
t
0
a
t
1
Y
_
2
E
_
X a
0
a
1
Y
_
2
, a
0
, a
1
R.
The corresponding estimate

X = a
t
0
+ a
t
1
Y is then the optimal linear estimate of
X, given the observation (realization) of Y . Clearly
E
_
X a
0
a
1
Y
_
2
=E
_
X EX a
1
(Y EY ) + EX a
1
EY a
0
_
2
=
cov(X) 2a
1
cov(X, Y ) +a
2
1
cov(Y ) + (EX a
1
EY a
0
)
2
cov(X) cov(X, Y )
2
/ cov(Y )
where cov(Y ) > 0 was assumed. The minimizers are
a
t
1
=
cov(X, Y )
cov(Y )
, a
t
0
= EX
cov(X, Y )
cov(Y )
EY.
If cov(Y ) = 0 (or in other words Y = EY , P-a.s.), then the same arguments lead
to
a
t
1
= 0, a
t
0
= EX.
So among all linear functionals of 1, Y (or ane functionals of Y ), there is the
unique optimal one
1
, given by
X := EX + cov(X, Y ) cov
(Y )
_
Y EY ) (2.1)
with the corresponding minimal mean square error
E(X

X)
2
= cov(X) cov
2
(X, Y ) cov
(Y ),
where for any x R
x
=
_
x
1
, x ,= 0
0, x = 0
1
Note that the pair of optimal coecients (a
0
, a
1
) is unique, though the random variable
a
0
+ a
1
Y () can be modied on a P-null set, without altering the mean square error. So the
uniqueness of the estimate is understood as uniqueness among the equivalence classes of random
variables (all equal with probability one within each class)
25
26 2. LINEAR FILTERING IN DISCRETE TIME
Note that the optimal estimate satises the orthogonality property
E
_
X

X
_
1 = 0
E
_
X

X
_
Y = 0
that is the residual estimation error is orthogonal to any linear functional of the
observations. It is of course not a coincidence, since (2.1) is nothing but the orthog-
onal projection of X on the linear space spanned by the random variables 1 and
Y . These simple formulae are the basis for the optimal linear ltering equations of
Kalman-Bucy and Bucy ([13], [14]), which is the subject of this chapter.
1. The Hilbert space L
2
, orthogonal projection and linear estimation
Let L
2
(, F, P) (or simply L
2
) denote the space of all square integrable random
variables
2
. Equipped with the scalar product
X, Y ) := EXY, X, Y L
2
and the induced norm |X| :=
_
X, X), L
2
is a Hilbert space (i.e. innite dimen-
sional Euclidian space). Let L be a closed linear subspace of L
2
(either nite or
innite dimensional at this point). Then
Theorem 2.1. For any X L
2
, there exists a unique
3
random variable

X L,
called the orthogonal projection and denoted by

E(X[L), such that
E
_
X

X
_
= inf
XL
E
_
X

X
_
2
(2.2)
and
E
_
X

X
_
Z = 0 (2.3)
for any Z L.
Proof. Let d
2
:= inf

XL
E
_
X

X
_
2
and let

X
j
be the sequence in L, such
that d
2
j
:= E(X

X
j
)
2
d
2
. Then

X
j
is a Cauchy sequence in L
2
E
_
X
j

X
i
_
2
=2E
_
X

X
i
_
2
+ 2E
_
X

X
j
_
2
4E
_
X
X
i
+

X
j
2
_
2
2E
_
X

X
i
_
2
+ 2E
_
X

X
j
_
2
4d
2
i,j
0,
where the inequality holds since

X
i
+

X
j
L. The space L
2
is complete and so

X
j
converges to a random variable

X
in L
2
and since L is closed,

X
L. Then
|X X
| =
_
E
_
X X
_
2
_
E
_
X

X
j
)
2
+
_
E(
X
j
X
_
2 j
d
and so X
is a version of

X. To verify (2.3), x a t R: then for any Z L
E
_
X

X
_
2
E
_
X

X tZ
_
2
= 2tE(X

X)Z t
2
EZ
2
The latter cannot hold for arbitrary small t unless E(X

X)Z = 0. Finally

X is
unique: suppose that

X
t
L satises (2.2) as well, then
E(X

X
t
)
2
= E(X

X +

X

X
t
)
2
= E(X

X)
2
+ E(
X

X
t
)
2
2
more precisely of the equivalence classes with respect the relation P(X = Y ) = 1
3
actually a unique equivalence class
1. THE HILBERT SPACE L
2
, ORTHOGONAL PROJECTION AND LINEAR ESTIMATION 27
which implies E(
X

X
t
)
2
= 0 or

X =

X
t
, P-a.s.
The orthogonal projection satises the following main properties:
(a) E
E(X[L) = EX
(b)

E(X[L) = X if X L and

E(X[L) = 0 if X L
(c) linearity: for X
1
, X
2
L
2
and c
1
, c
2
R,
E(c
1
X
1
+c
2
X
2
[L) = c
1
E(X
1
[L) +c
2
E(X
2
[L)
(d) for two linear subspaces L
1
L
2
,
E(X[L
1
) =

E
_
E(X[L
2
)
L
1
_
Proof. (a)-(c) are obvious from the denition. (d) holds, if
E
_
X
E
_
E(X[L
2
)
L
1
_
_
Z = 0
for all Z L
1
, which is valid since
E
_
X
E
_
E(X[L
2
)
L
1
_
_
Z =
E
_
X
E(X[L
2
)
_
Z +
_
E(X[L
2
)
E
_
E(X[L
2
)
L
1
_
_
Z = 0 (2.4)
where the rst term vanishes since L
1
L
2
.
Theorem 2.1 suggests that the optimal in the mean square sense estimate of
a random variable X L
2
from the observation (realization) of the collection of
random variables Y
j
L
2
, j Z
+
is given by the orthogonal projection of
X onto L
Y
:= spanY
j
, j . While for nite the explicit expression for
E(X[L
Y
) is straightforward and is given in Proposition 2.2 below, calculation of
E(X[L
Y
) in the innite case is more involved. In this chapter the nite case is
treated (still well need generality of Theorem 2.1 in continuous time case).
Proposition 2.2. Let X and Y be random vectors in R
m
and R
n
with square
integrable entries. Denote
4
by

E(X[L
Y
) the orthogonal projection
5
of X onto the
linear subspace, spanned by the entries of Y and 1. Then
6
E(X[L
Y
) = EX + cov(X, Y ) cov(Y )
_
Y EY
_
(2.5)
and
E
_
X
E(X[L
Y
)
__
X
E(X[L
Y
)
_
= cov(X) cov(X, Y ) cov(Y )
cov(Y, X),
(2.6)
where Q
stands for the generalized inverse of Q (see (2.8) below).

Proof. Let A and a be a matrix and a vector, such that

E(X[L
Y
) = a+AY .
Then by Theorem 2.1 (applied componentwise!)
0 = E
_
X a AY
_
4
sometimes the notation

E(X[Y ) =

E(X[L
Y
) is used.
5
Naturally the orthogonal projection of a random vector (on some linear subspace) is a vector
of the orthogonal projections of its entries.
6
the constant random variable 1 is always added to the observations, meaning that the
expectations EX and EY are known (available for the estimation procedure)
and
0 =E
_
X a AY
__
Y EY
_
=
E
_
X EX A(Y EY ) a + EX AEY
__
Y EY
_
=
cov(X, Y ) Acov(Y )
(2.7)
If cov(Y ) > 0, then (2.5) follows with cov(Y )
= cov(Y )
1
. If only cov(Y ) 0,
there exists a unitary matrix U (i.e. UU
= I) and a diagonal matrix D 0, so

that cov(Y ) = UDU
. Dene
7
cov(Y )
:= UD
(2.8)
where D
is a diagonal matrix with the entries

D
ii
=
_
1/D
ii
, D
ii
> 0
0, D
ii
= 0
. (2.9)
Then
cov(X, Y ) cov(X, Y ) cov(Y )
cov(Y ) =
cov(X, Y )U
_
I D
D
_
U
:D
=0
cov(X, Y )u
(2.10)
by the denition of D
, where u
is the -th column of U. Clearly

u
cov(Y )u
= 0 = E
_
u
(Y EY )
_
2
= 0 = (Y
EY
)u
= 0, P a.s.
and so
cov(X, Y )u
= E(X EX)(Y EY )
= 0,
i.e. (2.7) holds. The equation (2.6) is veried directly by substitution of (2.5) and
using the obvious properties of the generalized inverse.
Remark 2.3. Note that if instead of (2.9), D
were dened as
D
ii
=
_
1/D
ii
, D
ii
> 0
c, D
ii
= 0
with c ,= 0, the same estimate would be obtained.
2. Recursive orthogonal projection
Consider a pair of random processes (X, Y ) = (X
j
, Y
j
)
jZ
+
with entries in L
2
and let L
Y
j
= span1, Y
0
, ..., Y
j
. Calculation of the optimal estimate

E
_
X
j
[L
Y
j
_
by the formulae of Proposition 2.2 would require inverting matrices of sizes, growing
linearly with j. The following lemma is the key to a much more ecient calculation
algorithm of the orthogonal projection. Introduce the notations
X
j
:=

E
_
X
j
[L
Y
j
_
,

X
j]j1
:=

E(X
j
[L
Y
j1
),

Y
j]j1
:=

E(Y
j
[L
Y
j1
)
P
X
j
:= E
_
X
j

X
j
__
X
j

X
j
_
, P
X
j]j1
:= E
_
X
j

X
j]j1
__
X
j

X
j]j1
_
P
XY
j]j1
:= E
_
X
j

X
j]j1
__
Y
j

Y
j]j1
_
, P
Y
j]j1
:= E
_
Y
j

Y
j]j1
__
Y
j

Y
j]j1
_
Then
7
this is the generalized inverse of Moore and Penrose, in the special case of nonnegative
denite matrix. Note that it coincides (as should be) with the ordinary inverse if the latter exists.
3. THE KALMAN-BUCY FILTER IN DISCRETE TIME 29
Proposition 2.4. For j 1
X
j
=

X
j]j1
+P
XY
j]j1
_
P
Y
j]j1
_
Y
j

Y
j]j1
_
(2.11)
and
P
X
j
= P
X
j]j1
P
XY
j]j1
_
P
Y
j]j1
P
XY
j]j1
. (2.12)
Proof. To verify (2.11), check that
:= X
j

X
j]j1
+P
XY
j]j1
_
P
Y
j]j1
_
Y
j

Y
j]j1
_
is orthogonal to L
Y
j
. Note that is orthogonal to L
Y
j1
and so it suces to show
that Y
j
or equivalently (Y
j

Y
j]j1
):
E(Y
j

Y
j]j1
) = P
XY
j]j1
P
XY
j]j1
_
P
Y
j]j1
P
Y
j]j1
=
P
XY
j]j1
_
I
_
P
Y
j]j1
P
Y
j]j1
_
= 0
where the last equality is veried as in (2.10). The equation (2.12) is obtained
similarly to (2.6).
3. The Kalman-Bucy lter in discrete time
Consider a pair of processes (X, Y ) = (X
j
, Y
j
)
j0
, generated by the linear
recursive equations (j 1)
X
j
= a
0
(j) +a
1
(j)X
j1
+a
2
(j)Y
j1
+b
1
(j)
j
+b
2
(j)
j
(2.13)
Y
j
= A
0
(j) +A
1
(j)X
j1
+A
2
(j)Y
j1
+B
1
(j)
j
+B
2
(j)
j
(2.14)
where
* X
j
and Y
j
have values in R
m
and R
n
respectively
* = (
j
)
j1
and = (
j
)
j1
are orthogonal (discrete time) white noises
with values in R
and R
k
, i.e.
E
j
= 0, E
j
i
=
_
I, i = j
0, i ,= j
R
E
j
= 0, E
j
i
=
_
I, i = j
0, i ,= j
R
kk
and
E
j
i
= 0 i, j 0.
* the coecients a
0
(j), a
1
(j), etc. are deterministic (known) sequences of
matrices of appropriate dimensions
8
. From here on we will omit the time
dependence from the notation for brevity.
* the equations are solved subject to random vectors X
0
and Y
0
, uncorre-
lated with the noises and , whose means and covariances are known.
8
Note the customary abuse of notations, now time parameter is written in the parenthesis
instead of subscript
Denote the optimal linear estimate of X
j
, given L
Y
j
= span1, Y
1
, ..., Y
j
, by
X
j
=

E(X
j
[L
Y
j
)
and the corresponding error covariance matrix by
P
j
= E
_
X
j

X
j
__
X
j

X
j
_
Theorem 2.5. The estimate

X
j
and the error covariance P
j
satisfy the equa-
tions
X
j
= a
0
+a
1
X
j1
+a
2
Y
j1
+
_
a
1
P
j1
A
1
+b B
_
_
A
1
P
j1
A
1
+B B
_
_
Y
j
A
0
A
1
X
j1
A
2
Y
j1
_
(2.15)
and
P
j
= a
1
P
j1
a
1
+b b
_
a
1
P
j1
A
1
+b B
_
_
A
1
P
j1
A
1
+B B
_
_
a
1
P
j1
A
1
+b B
_
(2.16)
where
b b = b
1
b
1
+b
2
b
2
, b B = b
1
B
1
+b
2
B
2
, B B = B
1
B
1
+B
2
B
2
(2.15) and (2.16) are solved subject to
X
0
= EX
0
+ cov(X
0
, Y
0
) cov(Y
0
)
(Y
0
EY
0
)
P
0
= cov(X
0
) cov(X
0
, Y
0
) cov(Y
0
)
cov(X
0
, Y
0
)
.
Proof. Apply the formulae of Proposition 2.4 and the properties of orthogonal
projections. For example
X
j]j1
=

E
_
a
0
+a
1
X
j1
+a
2
Y
j1
+b
1
j
+b
2
j
[L
Y
j1
_

=
a
0
+a
1
E
_
X
j1
[L
Y
j1
_
+a
2
Y
j1
= a
0
+a
1
X
j1
+a
2
Y
j1
,
where the equality holds since
j
and
j
are orthogonal to L
Y
j1
.
Example 2.6. Consider an autoregressive scalar signal, generated by
X
j
= aX
j1
+
j
, X
0
= 0
where a is a constant and is a white noise sequence. Suppose it is observed via a
noisy linear sensor, so that the observations are given by
Y
j
= X
j1
+
j
where
j
is another white noise, orthogonal to . Applying the equations from
Theorem 2.5, one gets
X
j
= a
X
j1
+
aP
j1
P
j1
+ 1
_
Y
j

X
j1
_
,

X
0
= 0
where
P
j
= a
2
P
j1
+ 1
a
2
P
2
j1
P
j1
+ 1
, P
0
= 0. (2.17)
Many more interesting examples are given as exercises in the last section of
this chapter.
EXERCISES 31
3.1. Properties of the Kalman-Bucy lter.
1. The equation for P
j
is called dierence (discrete time) Riccati equation (analo-
gously to dierential Riccati equation arising in continuous time). Note that it does
not depend on the observations and so can be solved o-line (before the lter is
applied to the data). Even if all the coecients of the system (2.13) and (2.14) are
constant matrices, the optimal linear lter has in general time-varying coecients.
2. Existence, uniqueness and strict positiveness of the limit P
:= lim
j
P
j
is
a non-trivial question, the answer to which is known under certain conditions on
the coecients. If the limit exists and is unique, then one may use the stationary
version of the lter, where all the coecients are calculated with P
j1
replaced by
P
. In this case, the error matrix of this suboptimal lter converges to P
as
well, i.e. such stationary lter is asymptotically optimal as j . Note that the
innite sequence (X, Y ) generated by (2.13) and (2.14) may not have an L
2
limit
(e.g. if [a[ 1 in Example 2.6), so the innite horizon problem actually is beyond
the scope of Theorem 2.1. When (X, Y ) is in L
2
, then the lter may be used e.g.
to realize the orthogonal projection
9

E(X
0
[L
Y
(,0]
). This would coincide with the
estimates, obtained via Kolmogorov-Wiener theory for stationary processes (see
[28] for further exploration).
3. The propagation of

X
j
and P
j
is sometimes regarded in two-stages: prediction
X
j]j1
= a
0
+a
1
X
j1
+a
2
Y
j1
,

Y
j]j1
= A
0
+A
1
X
j1
+A
2
Y
j1
and update
X
j
=

X
j]j1
+K
j
(Y
j

Y
j]j1
)
where K
j
is the Kalman gain matrix from (2.15). Similar interpretation is possible
for P
j
.
4. The sequence

j
= Y
j
A
0
A
1
X
j1
A
2
Y
j1
(2.18)
turns to be an orthogonal sequence and is called the innovations: it is the residual
information borne by Y
j
after its prediction on the basis of the past information
is subtracted.
Exercises
(1) Prove that L
2
is complete, i.e. any Cauchy sequence converges to a random
variable in L
2
. Hint: show rst that from any Cauchy sequence in L
2
a
P-a.s. convergent subsequence can be extracted (Exercise (3a) on page
22)
(2) Complete the proof of Proposition 2.2 (verify (2.6))
(3) Complete the proof of Proposition 2.4.
(4) Show that the innovation sequence
j
from (2.18) is orthogonal. Find its
covariance sequence E
j

j
.
(5) Show that the limit lim
j
P
j
in (2.17) exists
10
and is positive. Find
the explicit expression for P
. Does it exist when the equation (2.17) is

started from any nonnegative P
0
?
9
here L
Y
(,0]
= span..., Y
1
, Y
0
10
Note that the ltering error P
j
is nite even if the signal is unstable ([a[ 1), i.e. all its
trajectories diverge to as j .
(6) Derive the Kalman-Bucy lter equations for the model, similar to Example
2.6, but with non-delayed observations
X
j
= aX
j1
+
j
Y
j
= X
j
+
j
(7) Derive the equations (4) and (5) on the page 8.
(8) Consider the continuous-time AM
11
radio signal X
t
= A(s
t
+1) cos(ft+),
t R
+
with the carrier frequency f, amplitude A and phase . The time
function s
t
is the information message to be transmitted to the receiver,
which recovers it by means of synchronous detection algorithm: it gener-
ates a cosine wave of frequency f
t
, phase
t
and amplitude A
t
, and forms
the base-band signal as follows
s
t
= [A
t
cos(f
t
t +
t
)X
t
]
LPF
, (2.19)
where []
LPF
is the (ideal) low pass lter operator, dened by
[q
t
+r
t
cos(c
1
t +c
2
)]
LPF
= q
t
, c
1
, c
2
R, c
1
,= 0
for any time functions q
t
and r
t
.
(a) Show that to get s
t
= s
t
for all t 0, the receiver has to know f, A
and (and choose f
t
,
t
and A
t
appropriately).
(b) Suppose the receiver knows f (set f = 1), but not A and . The
following strategy is agreed between the transmitter and the receiver:
s
t
0 for all 0 t T (the training period), i.e. the transmitter
chooses some A and and sends X
t
= Acos(t +) to the channel till
time T. The digital receiver is used for processing the transmission,
i.e. the received wave is sampled at times t
j
= j, j Z
+
with
some xed > 0, so that the following observations are available for
processing:
Y
j+1
= Acos(j +) +
j+1
, j = 0, 1, ... (2.20)
where is a white noise sequence of intensity > 0. Dene
t
=
_
X
t
X
t
_
and let Z
j
:=
j
, j Z
+
. Find the recursive equations for Z
j
, i.e.
the matrix () (depending on ) such that
Z
j+1
= ()Z
j
. (2.21)
(c) Using (2.21) and (2.20) and assuming that A and are random
variables with uniform distributions on [a
1
, a
2
], 0 < a
1
< a
2
and
[0, 2] respectively, derive the Kalman-Bucy lter equations for the
estimate

Z
j
=

E(Z
j
[L
Y
j
) and the corresponding error covariance P
j
.
(d) Find the relation between the estimates

Z
j
, j = 0, 1, ... and the signal
estimate
12
t
:=

E
_
X
t
[L
Y
t/]
_
for all t R
+
11
AM - amplitude modulation
12
recall that ]x| is the integer part of x
EXERCISES 33
(e) Solve the Riccati dierence equation from (c) explicitly
13
(f) Is exact asymptotical synchronization possible, i.e.
lim
T
E
_
X
T

X
T
_
2
= 0 (2.22)
for any > 0 ? For those (2.22) holds, nd the decay rate of
the synchronization error, i.e. nd the sequence r
j
> 0 and positive
number c, such that
lim
j
E
_
X
j

X
j
_
2
/r
j
= c.
(g) Relying on the asymptotic result from (e) and assuming = 1, what
should be T to attain synchronization error of 0.001 ?
(h) Simulate numerically the results of this problem (using e.g. MAT-
LAB)
(9) (taken from R.Kalman [13]) A number of particles leaves the origin at
time j = 0 with random velocities; after j = 0, each particle moves with
a constant (unknown velocity). Suppose that the position of one of these
particles is measured, the data being contaminated by stationary, additive,
correlated noise. What is the optimal estimate of the position and velocity
of the particle at the time of the last measurement ?
Let x
1
(j) be the position and x
2
(j) the velocity of the particle; x
3
(j)
is the noise. The problem is then represented by the model:
x
1
(j + 1) = x
1
(j) +x
2
(j)
x
2
(j + 1) = x
2
(j) (2.23)
x
3
(j + 1) = x
3
(j) +u(j)
y(j) = x
1
(j) +x
3
(j)
and the additional conditions
* Ex
2
1
(0) = Ex
2
(0) = 0, Ex
2
2
(0) = a
2
> 0
* Eu(j) = 0, Eu
2
(j) = b
2
(a) Derive Kalman-Bucy lter equations for the signal
X
j
=
_
_
x
1
(j)
x
2
(j)
x
3
(j)
_
_
(b) Derive Kalman-Bucy lter equations for the signal
X
j
=
_
x
2
(j)
x
3
(j)
_
using the obvious relation x
1
(j) = jx
2
(j) = jx
2
(0).
(c) Solve the Riccati equation from (b) explicitly
14
13
Hint: you may need the very useful Matrix Inversion Lemma (verify it): for any matrices
A,B,C and D (such that the required inverses exist), the following implication holds
A = B
1
+CD
1
C
A
1
= B BC(D +C
BC)
1
C
B
14
Hint: use the fact that the error covariance matrix is two dimensional and symmetric, i.e.
there are only three parameters to nd. Let the tedious calculations not scare you - the reward is
coming!
(d) Show that for ,= 1 (both [[ < 1 and [[ > 1!), the mean square
errors of the velocity and position estimates converge to 0 and b
2
respectively. Find the convergence rate for the velocity error.
(e) Show that for = 1, the mean square error for of the position di-
verges
15
!
(f) Dene the new observation sequence
y(j + 1) = y(j + 1) y(j), j 0
and y(0) = y(0). Then (why?)
spany(j), 0 j n = spany(j), 0 j n.
Derive the Kalman-Bucy lter for the signal X
j
:= x
2
(j) and obser-
vations y
j
. Verify your answer in (e).
(10) Consider the linear system of algebraic equations Ax = b, where A is an
m n matrix and b is an n 1 column vector. The generalized solution
of these equations is a vector x
t
, which solves the following minimization
problem (the usual Euclidian norm is used here)
x
t
:=
_
argmin
x
_
_
x
_
_
2
,=
argmin
xR
_
_
Ax b
_
_
2
=
where = x R : |Ax b| = 0. If A is square and invertible then
x = A
1
b. If the equations Ax = b are satised by more than one vector,
then the vector with the least norm is chosen. If Ax = b has no solutions,
then the vector which minimizes the norm |Axb| is chosen. This denes
x
t
uniquely, moreover
x
t
:= A
b = (A
A)
b
where A
is the Moore-Penrose generalized inverse (recall that (A
A)
has been dened in (2.8)).

(a) Applying the Kalman-Bucy lter equations, show that x
t
can be
found by the following algorithm:
x
j
= x
j1
+ (b
j
x
j1
)
_
P
j1
a
j
a
j
P
j1
a
j
, a
j
P
j1
a
j
> 0
0 a
j
P
j1
a
j
= 0
and
P
j1
= P
j1
+
_
P
j1
a
j
a
j
Pj1
a
j
P
j1
a
j
, a
j
P
j1
a
j
> 0
0 a
j
P
j1
a
j
= 0
,
where a
j
is the j-th row of the matrix A and b
j
are the entries of b.
To calculate x, these equations are to be started from P
0
= I and
x
0
= 0 and run for j = 1, ..., m. The solution is given by x
t
= x
m
.
(b) Show that for each j m,
a
j
P
j1
a
j
= min
c
1
,...,c
j1
_
_
_
_
_
a
j
j1
=1
c
j
a
_
_
_
_
_
2
15
Note that for [[ 1 the noise is unstable in the sense that its trajectories escape to
. When [[ > 0 this happens exponentially fast (in appropriate sense) and when = 1, the
divergence is linear. Surprisingly (for the author at least) the position estimate is worse in
the latter case!
EXERCISES 35
so that a
j
P
j1
a
j
= 0 indicates that a row, linearly dependent on the
previous ones, is encountered. So counting the number of times zero
was used to propagate the above equations, the rank of A is found
as a byproduct.
(11) Let X = (X
j
)
jZ
+
be a Markov chain with values in a nite set of numbers
S = a
1
, ..., a
d
, the matrix of transition probabilities
ij
and initial
distribution
16
, i.e.
P(X
j
= a
[X
j1
= a
m
) =
m
, P(X
0
= a
) =
, 1 , m d.
(a) Let p
n
be the vector with entries p
j
(i) = P(X
j
= a
i
), j 0. Show
that p
j
satises
p
j
=
p
j1
, s.t. p
0
= j 0.
(b) Let I
j
be the vectors with entries I
j
(i) = 1(X
j
= a
i
), j 0. Show
that there exists a sequence of orthogonal random vectors
j
, such
that
I
j
=
I
j1
+
j
, j 0
Find its mean and covariance matrix.
(c) Suppose that the Markov chain is observed via noisy samples
Y
j
= h(X
j
) +
j
, j 1
where is a white noise (with square integrable entries) and > 0
is its intensity. Let h be the column vector with entries h(a
i
). Verify
that
Y
j
= h
I
j
+
j
.
(d) Derive the Kalman-Bucy lter for

I
j
=

E(I
j
[L
Y
j
).
(e) What would be the estimate of

E
_
g(X
j
)[L
Y
j
_
for any g : S R in
terms of

I
j
? In particular,

X
j
=

E(X
j
[L
Y
j
)?
(12) Consider the ARMA(p,q) signal
17
X = (X
j
)
j0
, generated by the recur-
sion
X
j
=
p
k=1
a
k
X
jk
+
q
=0
a
j
, j p
subject to say X
0
= X
1
= ... = X
p
= 0. Suppose that
Y
j
= X
j1
+
j
, j 1.
Suggest a recursive estimation algorithm for X
j
, given L
Y
j
, based on the
Kalman-Bucy lter equations.
16
Such a chain is a particular case of the Markov processes as in Example 1.3 on page 16
and can be constructed in the following way: let X
0
be a random variable with values in S and
P(X
0
= a
) =
, 0 d and
X
j
=
d
i=1
i
j
1
{X
j1
=a
i
}
, j 0
where
i
j
is a table of independent random variables with the distribution
P(
i
j
= a
) =
i
, j 0, 1 i, d
17
ARMA(p,q) stands for auto regressive of order p and moving average of order q. This
model is very poplar in voice recognition (LPC coecients), compression, etc.
CHAPTER 3
Nonlinear ltering in discrete time
Let X and Z be a pair of independent real random variables on (, F, P) and
suppose that EX
2
< . Assume for simplicity that both have probability densities
f
X
(u) and f
Z
(u), i.e.
P(X u) =
_
u
f
X
(x)dx, P(Z u) =
_
u
f
Z
(x)dx.
Suppose it is required to estimate X, given the observed realization of the sum
Y = X +Z or, in other words, to nd a function
1
g : R R, so that
E
_
X g(Y )
_
2
E
_
X g(Y )
_
2
(3.1)
for any other function g : R R. Note that such a function should be square
integrable as well, since (3.1) with g = 0 and g
2
(Y ) 2X
2
+ 2
_
X g(Y )
_
2
imply
E g
2
(Y )
2
4EX
2
< .
Moreover, if g satises
E
_
X g(Y )
_
g(Y ) = 0 (3.2)
for any g : R R, such that Eg
2
(Y ) < , then (3.1) would be satised too.
Indeed, if E
_
X g(Y )
_
2
= , the claim is trivial and if E
_
X g(Y )
_
2
< , then
Eg
2
(Y ) 2EX
2
+ 2E(g(Y ) X)
2
< and
E
_
X g(Y )
_
2
= E
_
X g(Y ) + g(Y ) g(Y )
_
2
=
E
_
X g(Y )
_
2
+ E
_
g(Y ) g(Y )
_
2
E
_
X g(Y )
_
2
Moreover, the latter suggests that if another function satises (3.1), then it
should be equal to g on any set A, such that P(Y A) > 0. Does such a function
exist ? Yes - we give an explicit construction using (3.2)
E
_
X g(Y )
_
g(Y ) =
_
R
_
R
_
x g(x +z)
_
g(x +z)f
X
(x)f
Z
(z)dxdz =
_
R
_
R
_
x g(u)
_
g(u)f
X
(x)f
Z
(u x)dxdu =
_
R
g(u)
__
R
_
x g(u)
_
f
X
(x)f
Z
(u x)dx
_
du
The latter would vanish if
_
R
_
x g(u)
_
f
X
(x)f
Z
(u x)dx = 0
1
g should be a Borel function (measurable with respect to Borel -algebra on R) so that all
the expectations are well dened
37
38 3. NONLINEAR FILTERING IN DISCRETE TIME
is satised for all u, which leads to
g(u) =
_
R
xf
X
(x)f
Z
(u x)dx
_
R
f
X
(x)f
Z
(u x)dx
.
So the best estimate of X given Y is the random variable
E(X[Y )() =
_
R
xf
X
(x)f
Z
(Y () x)dx
_
R
f
X
(x)f
Z
(Y () x)dx
, (3.3)
which is nothing but the familiar Bayes formula for the conditional expectation of
X given Y .
1. The conditional expectation: a closer look
1.1. The denition and the basic properties. Let (, F, P) be a prob-
ability space, carrying a random variable X 0 with values in R and let G be a
sub--algebra of F.
Definition 3.1. The conditional expectation
2
of X 0 with respect to G is a
real random variable, denoted by E(X[G)(), which is G-measurable, i.e.
_
: E(X[G)() A
_
G, A B(R)
and satises
E
_
X E(X[G)()
_
1
A
() = 0, A G.
Why is this denition correct, i.e. is there indeed such a random variable and
is it unique? The positive answer is provided by the Radon-Nikodym theorem from
analysis
Theorem 3.2. Let (X, X) be a measurable space
3
, be a -nite
4
measure
and is a signed measure
5
, absolutely continuous
6
with respect to . Then there is
exists an X-measurable function f = f(x), taking values in R , such that
(A) =
_
A
f(x)(dx), A X.
f is called the Radon-Nikodym derivative (or density) of with respect to and is
denoted by
d
d
. It is unique up to -null sets
7
.
2
Note that the conditional probability is a special case of the conditional expectation:
P(B[G) = E(I
B
[G)
3
i.e. a collection of points X with a -algebra of sets X
4
i.e. (X) = is allowed, only if there is a countable partition D
j
X,
j
D
j
= X, so that
(D
j
) < for any j. For example, the Lebesgue measure on B(R) is not a nite measure (the
length of the whole line is ). It is -nite, since R can be partitioned into e.g. intervals of
unit Lebesgue measure.
5
i.e. which can be represented as =
1

2
, with at least one of
i
is nite
6
A measure is absolutely continuous with respect to (denoted _ ), if for any A X
(A) = 0 = (A) = 0. The measures and are said to be equivalent , if _ and
_ .
7
i.e. if there is another function h, such that (A) =
_
A
h(x)(dx) then (h ,= f) = 0
1. THE CONDITIONAL EXPECTATION: A CLOSER LOOK 39
Now consider the measurable space (, G) and dene a nonnegative set function
on
8
G
Q(A) =
_
A
XP(d) = EX1
A
, A G. (3.4)
This set function is a nonnegative -nite measure: take for example the partition
D
j
= j X < j + 1, j = 0, 1, ..., then Q(D
j
) = EX1
X[j,j+1)]
< even if
EX = . To verify Q P, let A be such that P(A) = 0 and let X
j
be a sequence
of simple random variables, such that X
j
X (for example as in (1.1) on page
17), i.e.
X
j
=
k
x
j
k
1
B
j
k
, B
j
k
F, x
j
k
R
Since
EX
j
1
A
=
k
x
j
k
P(B
j
k
A) = 0,
by monotone convergence (see Theorem A.1 in the Appendix ) Q(A) = EX1
A
=
lim
j
EX
j
1
A
= 0. Now by Radon-Nikodym theorem there exists the unique up to
P-null sets random variable , measurable with respect to G (unlike X itself!), such
that
Q(A) =
_
A
P(A), A G.
This is said to be a version of the conditional expectation E(X[G) to emphasize
its uniqueness only up to P-null sets:
E(X[G) =
dQ
dP
().
For a general random variable X, taking both positive and negative values,
dene E(X[G) = E(X
+
[G) E(X
[G), if no confusion occurs with positive

probability. Note that is allowed on the P-null sets, in which case an arbi-
trary value can be assigned. For this reason, the conditional expectation E(X[G)
may be well dened even, when EX is not. For example, let F
X
be the -algebra
generated by the pre-images X A, A B(R). Suppose that EX
+
= and
EX
= , so that EX is not dened. Since X

+
= X
= is a null set,
the conditional expectation is well dened and equals
E(X[F
X
) = E(X
+
[F
X
) E(X
[F
X
) = X
+
X
= X.
Example 3.3. Let G be the (nite) -algebra generated by the nite partition
D
j
F, j = 1, ..., n, D
j
= , P(D
j
) > 0. Any G-measurable random variable
(with real values) is necessarily constant on each set D
j
: suppose it takes two
distinct values on e.g. D
1
, say x
t
< x
tt
, then : X() x
t
D
1
and : X()
x
tt
D
1
are disjoint subsets of D
1
and hence not in any other D
i
, i ,= j. Thus
both events clearly cannot be in G. So for any random variable X,
E(X[G) =
n
j=1
a
j
1
D
j
().
8
Note that the integral here is well dened for A F as well, but we restrict it to A G
only
The constants a
j
are found from
E
_
X
n
j=1
a
k
1
D
j
_
1
D
i
= 0, i = 1, ..., n,
which leads to
E(X[G) =
n
j=1
EX1
D
j
P(D
j
)
1
D
j
().
The conditioning with respect to -algebras generated by the pre-images of

random variables (or more complex random objects), i.e. by the sets of the form
F
Y
= : Y A, A B(R)
are of special interest. Given a pair of random variables (X, Y ), E(X[Y ) is some-
times
9
written shortly for E(X[F
Y
). It can be shown, that for any F
Y
-measurable
random variable Z(), there exists a Borel function , such that Z = (Y ()). In
particular, there always can be found a Borel function g, so that E(X[Y ) = g(Y ).
This function is sometimes denoted by E(X[Y = y).
The main properties of the conditional expectations are
10
(A) if C is a constant and X = C, then E(X[G) = C
(B) if X Y , then E(X[G) E(Y [G)
(C) [E(X[G)[ E([X[[G)
(D) if a, b R, and aEX +bEY is well dened, then
E(aX +bY [G) = aE(X[G) +bE(Y [G)
(E) if X is G-measurable, then E(X[G) = X
(F) if G
1
G
2
, then E
_
E(X[G
2
)
G
1
_
= E(X[G
1
)
(G) if X and Y are independent and f(x, y) is such that E[f(X, Y )[ < ,
then
E
_
f(X, Y )[Y
_
=
_
f
_
X(
t
), Y ()
_
P(d
t
)
In particular, if X is independent of G and EX is well dened, then
E(X[G) = EX.
(H) if Y is G-measurable and E[Y [ < and E[Y X[ < , then
E
_
XY [G
_
= Y E(X[G)
(I) let (X, Y ) be a pair of random variables and E[X[
2
< , then
E
_
X E(X[Y )
_
2
= inf
E
_
X (Y )
_
2
(3.5)
where all the Borel functions are taken.
Let A
j
be a sequence of disjoint events, then
P
_
A
j
[G
_
=
j
P(A
j
[G). (3.6)
So one is tempted to think that for any xed , P(A[G)() is a measure on F.
This is wrong in general, since (3.6) holds only up to P-null sets. Denote by N
i
9
throughout these notations are freely switched
10
as usual any relations, involving comparison of random variables are understood P-a.s.
1. THE CONDITIONAL EXPECTATION: A CLOSER LOOK 41
the set of points at which (3.6) fails for the specic sequence A
(i)
j
, j = 1, 2, .... And
let N be the set of all null sets of the latter form. Since in general there can be
uncountably many sequences of events, N may have positive probability ! So in
general, the function
F
X
(x; ) = P
_
X x[G
_
()
may not be a proper distribution function for from a set of positive probability.
It turns out that for any random variable X with values in a complete separable
metric space X, there exists so called regular conditional measure of X, given G,
i.e. a function P
X
(B; ), which is a probability measure on B(X) for each xed
and is a version of P(X B[G)(). Obviously regular conditional expec-
tation plays the central role in statistical problems, where typically it is required
to nd an explicit formula (function), which can be applied to the realizations of
the observed random variables. For example regular conditional expectation was
explicitly constructed in (3.3).
1.2. The Bayes formula: an abstract formulation. The Bayes formula
(3.3) involves explicit distribution functions of the random variables involved in
the estimation problem. On the other hand, the abstract denition of the con-
ditional expectation of the previous section, allows to consider the setups, where
the conditioning -algebra is not necessarily generated by random variables, whose
distribution have explicit formulae: think for example of E(X[F
Y
t
), when F
Y
t
=
Y
s
, 0 s t with Y
t
, being a continuous time process.
Theorem 3.4. (the Bayes formula) Let (, F, P) be a probability space, car-
rying a real random variable X and let G be a sub--algebra of F. Assume that
there exists a regular conditional probability measure
11
P(d[X = x) on G and it
has Radon-Nikodym density (; x) with respect to a -nite measure (on G):
P(B[X = x) =
_
B
(; x)(d).
Then for any : R R, such that E[(X)[ < ,
E
_
(X)[G
_
=
_
R
(u)(; u)P
X
(du)
_
R
(; u)P
X
(du)
, (3.7)
where P
X
is the probability measure induced by X (on B(R)).
Proof. Recall that
E
_
(X)[G
_
() =
dQ
dP
() (3.8)
where Q is a signed measure, dened by
Q(B) =
_
B
(X())P(d), B G.
Let F
X
= X. Then for any B G
P(B) = EE(1
B
[F
X
) =
_
P(B[F
X
)()P(d)

=
_
R
P(B[X = u)P
X
(du) =
_
R
_
B
(; u)(d)P
X
(du)

=
_
B
__
R
(; u)P
X
(du)
_
(d) (3.9)
11
i.e. a measurable function P(B; x), which is a probability measure on F for any xed
x R and P(B; X()) coincides with P(B[F
X
)() up to P-null sets.
where the equality is changing variables under the Lebesgue integral and follows
from the Fubini theorem (see Theorem A.5 Appendix for quick reference). Also for
any B G
Q(B) := E(X)1
B
= E(X)E
_
1
B
[F
X
_
() =
_
R
(u)P(B[X = u)dP
X
(du) =
_
R
(u)
_
B
(; u)(d)P
X
(du) =
_
B
__
R
(u)(; u)P
X
(du)
_
(d). (3.10)
Note that Q P and by (3.9) P (on G!) and thus also Q . So for any
B G
Q(B) =
_
B
dQ
dP
()P(d) =
_
B
dQ
dP
()
dP
d
()(d)
while on the other hand
Q(B) =
_
B
dQ
d
()d, B G.
By arbitrariness of B, it follows that
dQ
d
() =
dQ
dP
()
dP
d
(), a.s.
Now since
P
_
:
dP
d
() = 0
_
=
_
1
_
dP
d
() = 0
_
P(d) =
_
1
_
dP
d
() = 0
_
dP
d
()(d) = 0
it follows
dQ
dP
() =
dQ/d()
dP/d()
, P a.s.
The latter and (3.8), (3.9), (3.10) imply (3.7).
Corollary 3.5. Suppose that G is generated by a random variable Y and there
is a -nite measure on B(R) and a measurable function (density) r(u; x) 0
so that
P(Y A[X = x) =
_
A
r(u; x)(du).
Then for [(X)[ < ,
E
_
(X)[G
_
=
_
R
(u)r
_
Y (), u
_
P
X
(du)
_
R
r
_
Y (), u
_
P
X
(du)
. (3.11)
Proof. By the Fubini theorem (see Appendix)
P(Y A) = EP(Y A[X) = E
_
A
r(u; X())(du) =
_
A
Er(u; X())(du).
Denote r(u) := Er(u; X()) and dene
(; x) =
_
_
_
r
_
Y (),x
_
r
_
Y ()
_
, r
_
Y ()
_
> 0
0, r
_
Y ()
_
= 0
2. THE NONLINEAR FILTER VIA THE BAYES FORMULA 43
Any G-measurable set is by denition a preimage of some A under Y (), i.e. for
any B G, there is A B(R) such that B = : Y () A. Then
_
B
(; x)P(d) =
_
A
r
_
u, x
_
r
_
u
_ r
_
u
_
(du) =
_
A
r(u; x)(du) = P(Y A[X = x) = P(B[X = x).
Now (3.11) follows from (3.7) with the specic (; x) and (d) := P(d), where
the denominators cancel.
Remark 3.6. Let (
,

F,

P) be a copy of the probability space (, F, P), then
(3.11) reads
E
_
(X)[G
_
=
E(X( ))r
_
Y (), X( )
_
Er
_
Y (), X( )
_ , (3.12)
where

E denotes expectation on (
,

F,

P) (and X( ) is a copy of X, dened on
this auxiliary probability space).
Remark 3.7. The formula (3.11) (and its notation (3.12)) holds when X and
Y are random vectors.
Remark 3.8. Often the following notation is used
P
_
X du[Y = y
_
=
r
_
y, u
_
P
X
(du)
_
R
r
_
y, u
_
P
X
(du)
for the regular conditional distribution of X given F
Y
. Note that it is absolutely
continuous with respect to the measure induced by X.
2. The nonlinear lter via the Bayes formula
Let (X
j
, Y
j
)
j0
be a pair of random sequences with the following structure:
* X
j
is a Markov process with the transition kernel
12
(x, du) and initial
distribution p(du), that is
P(X
j
B[F
X
j1
) =
_
B
(X
j1
, du), P a.s.
where
13
F
X
j1
= X
0
, ..., X
j1
P(X
0
B) =
_
B
p(du), B B(R).
* Y
j
is a random sequence, such that for all
14
j 0
P(Y
j
B[F
X
j
F
Y
j1
) =
_
B
(X
j
, du), P a.s (3.13)
with a Markov kernel (x, du), which has density (x, u) with respect to
some -nite measure (du) on B(R).
12
a function : R B(R) [0, 1] is called a Markov (transition) kernel, if (x, B) is a
Borel measurable function for each B B(R) and is a probability measure on B(R) for each xed
x R.
13
a family F
j
of increasing -algebras is called ltration
14
by convention F
Y
1
= ,
* f : R R be a measurable function, such that E[f(X
j
)[ < for each
j 0.
Theorem 3.9. Let
j
(dx) be the solution of the recursive equation
j
(dx) =
_
R
_
u, Y
j
()
_
(u, dx)
j1
(du)
_
R
_
R
_
u, Y
j
()
_
(u, dx)
j1
(du)
, j 0 (3.14)
subject to
0
(dx) =
_
R
_
u, Y
0
()
_
p(du)
_
R
_
R
_
u, Y
0
()
_
p(du)
. (3.15)
Then
E
_
f(X
j
)[F
Y
j
_
=
_
R
f(x)
j
(dx), P a.s. (3.16)
Proof. Due to the assumptions on Y , the regular conditional measure for the
vector Y
0
, ..., Y
j
, given the ltration F
X
j
= X
0
, ..., X
j
is given by
P
_
Y
0
A
0
, ..., Y
j
A
j
[F
X
j
_
=
_
A
0
...
_
A
j
(X
0
, u
0
) (X
j
, u
j
)(du
0
) (du
j
).
(3.17)
Then by Remark 3.7
E
_
(X
j
)[F
Y
j
_
=
E(X
j
( ))
j
i=0
_
X
i
( ), Y
i
_
j
i=0
_
X
i
( ), Y
i
_ (3.18)
Introduce the notation
L
j
(X( ), Y ) =
j
i=0
_
X
i
( ), Y
i
_
(3.19)
and note that
E
_
(X
j
( ))L
j
(X( ), Y )[F
X
j1
_
=
L
j1
(X( ), Y )
E
_
(X
j
( ))(X
j
( ), Y
j
)
F
X
j1
_
=
L
j1
(X( ), Y )
_
R
(u)(u, Y
j
)(X
j1
( ), du)
Then
E
_
(X
j
)[F
Y
j
_
=

E(X
j
( ))L
j
(X( ), Y )
EL
j
(X( ), Y )
=
EL
j1
(X( ), Y )
_
R
(u)(u, Y
j
)(X
j1
( ), du)
EL
j1
(X( ), Y )
_
R
(u, Y
j
)(X
j1
( ), du)
=
EL
j1
(X( ), Y )
_
R
(u)(u, Y
j
)(X
j1
( ), du)/
EL
j1
(X( ), Y )
EL
j1
(X( ), Y )
_
R
(u, Y
j
)(X
j1
( ), du)/
EL
j1
(X( ), Y )
=
E
_
_
R
(u)(u, Y
j
)(X
j1
, du)
F
Y
j1
_
E
_
_
R
(u, Y
j
)(X
j1
, du)
F
Y
j1
_
3. THE NONLINEAR FILTER BY THE REFERENCE MEASURE APPROACH 45
Now let
j
(dx) be the regular conditional distribution of X
j
, given F
Y
j
. Then the
latter reads (again the Fubini theorem is used)
_
R
(x)
j
(dx) = E
_
(X
j
)[F
Y
j
_
=
_
R
(u)
_
R
(u, Y
j
)(x, du)
j1
(dx)
_
R
_
R
(u, Y
j
)(x, du)
j1
(dx)
and by arbitrariness of (3.14) follows. The equation (3.15) is obtained similarly.
Remark 3.10. The proof may seem unnecessarily complicated at the rst
glance: in fact, a simpler and probably more intuitive derivation is possible (see
Exercise 10). This (and an additional derivation in the next section) is given for
two reasons: (1) to exercise the properties and notations, related to conditional
expectations and (2) to demonstrate the technique, which will be very useful when
working in continuous time case.
3. The nonlinear lter by the reference measure approach
Before proceeding to discuss the properties of (3.14), we give another proof of
it, using so called reference measure approach. This powerful and elegant method
requires stronger assumptions on (X, Y ), but gives an additional insight into the
structure of (3.14) and turns to be very ecient in the continuous time setup. It
is based on the following simple fact
Lemma 3.11. Let (, F) be a probability space and let P and

P be equivalent
probability measures on F, i.e. P

P. Denote by E([G) and

E([G) the conditional
expectations with respect to G F under P and

P. Then for any X, E[X[ <
E
_
X[G
_
=
E
_
X
dP
d
P
()
G
_
E
_
dP
d
P
()
G
_ . (3.20)
Proof. Note rst that the right hand side of (3.20) is well dened (on the sets
of full P-probability
15
) , since
P
_
E
_
dP
d
P
()[G
_
= 0
_
=

E1
_
E
_
dP
d
P
()
G
_
= 0
_
dP
d
P
() =
E1
_
E
_
dP
d
P
()
G
_
= 0
_
E
_
dP
d
P
()
G
_
= 0.
Clearly the right hand side of (3.20) is G-measurable and for any A G
E
_
X
E
_
X
dP
d
P
()
G
_
E
_
dP
d
P
()
G
_
_
1
A
() =

E
_
X
E
_
X
dP
d
P
()
G
_
E
_
dP
d
P
()
G
_
_
1
A
()
dP
d
P
() =
=

EX
dP
d
P
()1
A

E
_
X
dP
d
P
()
G
_
E
_
dP
d
P
()
G
_ 1
A
E
_
dP
d
P
()
G
_
= 0,
which veries the claim.
15
and thus also

P-probability
This lemma suggests the following way of calculating the conditional proba-
bilities: nd a reference measure

P, equivalent to P, under which calculation of
the conditional expectation would be easier (typically,

P is chosen so that X is
independent of G) and use (3.20).
Assume the following structure for the observation process
16
(all the other
assumptions remain the same)
* Y
j
= h(X
j
) +
j
, where h is a measurable function R R and = (
j
)
j0
is an i.i.d. sequence, independent of X, such that
1
has a positive density
q(u) > 0 with respect to the Lebesgue measure:
P(
1
u) =
_
u
q(s)ds.
Lets verify the claim of Theorem 3.9 under this assumption. For a xed j, let
F
j
= F
X
j
F
Y
j
(or equivalently F
j
= F
X
j
F
j
). Introduce the (positive)
random process
j
(X, Y ) :=
j
i=0
q(Y
i
)
q
_
Y
i
h(X
i
)
_. (3.21)
and dene the probability measure

P (on F
j
) by means of the Radon-Nikodym
derivative
d
P
dP
() =
j
_
X(), Y ()
_
,
with respect to the restriction of P on F
j
.

P is indeed a probability measure, since
j
is positive and
P() =E
j
(X, Y ) = E
j
i=0
q(Y
i
)
q
_
Y
i
h(X
i
)
_ = E
j
i=0
q
_
h(X
i
) +
i
_
q(
i
)
=
E
_
R
...
_
R
j
i=0
q
_
h(X
i
) +u
i
_
q(u
i
)
j
=0
q(u
)du
0
...du
j
=
E
_
R
...
_
R
j
i=0
q
_
h(X
i
) +u
i
_
du
0
...du
j
=
E
j
i=0
_
R
q
_
h(X
i
) +u
i
_
du
i
= 1
Under measure

P, the random processes (X, Y ) look absolutely dierent:
(i) the distribution of the process
17
Y under
P, coincides with the distribution

of under P
(ii) the distribution of the process X is the same under both measures P and
P
(iii) the processes X and Y are independent under

P
16
greater generality is possible with the reference measure approach, but is sacriced here
for the sake of clarity
17
of course the restriction of Y to [0, j] is meant here
3. THE NONLINEAR FILTER BY THE REFERENCE MEASURE APPROACH 47
Let (x
0
, ..., x
j
) and (x
0
, ..., x
j
) be measurable bounded R
j+1
R functions.
Then
E(X
0
, ..., X
j
)(Y
0
, ..., Y
j
) = E(X
0
, ..., X
j
)(Y
0
, ..., Y
j
)
j
(X, Y ) =
E(X
0
, ..., X
j
)(Y
0
, ..., Y
j
)
j
i=0
q(Y
i
)
q
_
Y
i
h(X
i
)
_ =
E(X
0
, ..., X
j
)
_
h(X
0
) +
0
, ..., h(X
j
) +
j
_
j
i=0
q
_
h(X
i
) +
i
_
q
_
i
_ =
E(X
0
, ..., X
j
)
_
R
...
_
R
_
h(X
0
) +u
0
, ..., h(X
j
) +u
j
_
i=0
q
_
h(X
i
) +u
i
_
q
_
u
i
_
j
=0
q(u
)du
0
...du
j
=
E(X
0
, ..., X
j
)
_
R
...
_
R
_
h(X
0
) +u
0
, ..., h(X
j
) +u
j
_
j
i=0
q
_
h(X
i
) +u
i
_
du
0
...du
j
=
E(X
0
, ..., X
j
)
_
R
...
_
R
_
u
0
, ..., u
j
_
j
i=0
q
_
u
i
_
du
0
...du
j
=
E(X
0
, ..., X
j
)E
_
0
, ...,
j
_
.
Now the claim (i) holds by arbitrariness of with 1. Similarly the (ii) holds
by arbitrariness of with 1. Finally, if (i) and (ii) hold then,
E(X
0
, ..., X
j
)(Y
0
, ..., Y
j
) = E(X
0
, ..., X
j
)
_
0
, ...,
j
_
=
E(X
0
, ..., X
j
)
E
_
Y
0
, ..., Y
j
_
,
which is nothing but (iii) by arbitrariness of and .
Now by Lemma 3.11 for any bounded function g,
E
_
g(X
j
)[F
Y
j
_
=
E
_
g(X
j
)
1
j
(X, Y )[F
Y
j
_
E
_
1
j
(X, Y )[F
Y
j
_ =
Eg(X
j
( ))
1
j
(X( ), Y ())
E
1
j
(X( ), Y ())
(3.22)
where
dP
d
P
() =
1
j
(X, Y ). The latter equality is due to independence of X and Y
under

P (the notations of Remark 3.6 are used here).
Now for arbitrary (measurable and bounded) function g
Eg(X
j
( ))
1
j
(X( ), Y ()) =

Eg(X
j
( ))
_
1
j
(X( ), Y ())[X
j
_
=
_
R
g(u)
E
_
1
j
(X( ), Y ())[X
j
= u
_
P
X
j
(du) :=
_
R
g(u)
j
(du)
On the other hand
Eg(X
j
( ))
1
j
(X( ), Y ()) =
E
1
j1
(X( ), Y )
E
_
g(X
j
( ))
q
_
Y
j
h(X
j
( ))
_
q(Y
j
)
F
X
j1
_
=
E
1
j1
(X( ), Y )
_
R
g(u)
q
_
Y
j
h(u)
_
q(Y
j
)
(X
j1
( ), du) =
_
R
g(u)
_
R
q
_
Y
j
h(u)
_
q(Y
j
)
(s, du)
j1
(ds).
By arbitrariness of g, the recursion
j
(du) =
_
R
q
_
Y
j
h(u)
_
q(Y
j
)
(s, du)d
j1
(s). (3.23)
is obtained. Finally by (3.22)
E
_
g(X
j
)[F
Y
j
_
=
_
R
g(u)
j
(du)
_
R
j
(du)
and hence the conditional distribution
j
(du) from Theorem 3.9 can be calculated
by normalizing
j
(du) =

j
(du)
_
R
j
(ds)
. (3.24)
Besides verifying (3.14), the latter suggests that
j
(du) can be calculated by solving
linear (!) equation (3.23), whose solution
j
(du) (which is called the unnormalized
conditional distribution) is to be normalized at the nal time j. In fact this remark-
able property can be guessed directly from (3.14) (under more general assumptions
on Y ).
4. The curse of dimensionality and nite dimensional lters
The equation (3.14) (or its unnormalized counterpart (3.23)) are not very prac-
tical solutions to the estimation problem: at each step they require at least two
integrations! Clearly the following property would be very desirable
Definition 3.12. The lter is called nite dimensional with respect to a
function f, if the right hand side of (3.16) can be parameterized by a nite number
of sucient statistics, i.e. solutions of real valued dierence equations, driven by
Y .
The evolution of
j
can be innite-dimensional, while the integral of
j
versus
specic function f may admit a nite dimensional lter (see Exercise 21). Unfor-
tunately there is no easy way to determine whether the nonlinear lter at hand is
nite dimensional. Moreover sometimes it can be proved to be innite dimensional.
In fact few nite dimensional lters are known, the most important of which are
described in the following sections.
4. THE CURSE OF DIMENSIONALITY AND FINITE DIMENSIONAL FILTERS 49
4.1. The Hidden Markov Models (HMM). Suppose that X
j
is a Markov
chain with a nite state space S = a
1
, ..., a
d
. Then its Markov kernel is identied
18
with the matrix of transition probabilities
m
= P(X
j
= a
m
[X
j1
= a
). Let
p
0
be the initial distribution of X, i.e. p
0
() = P(X
0
= a
). Suppose that the

observation sequence Y = (Y
j
)
j1
satises
P(Y
j
A[F
X
j
F
Y
j1
) =
_
A
(du), = 1, ..., d.
Note that each
(du) is absolutely continuous with respect to the measure (du) =
d
m=1
m
(du) and so no generality is lost if
(du) = f
(u)(du) is assumed for

some xed -nite measure on B(R) and densities f
(u). This statistical model is

extremely popular in various areas of engineering (see [7] for a recent survey).
Clearly the conditional distribution
j
(dx) is absolutely continuous with respect
to the point measure with atoms at a
1
, ..., a
d
and so can be identied with the
density
j
, which is just a vector of conditional probabilities P(X
j
= a
[F
Y
j
),
= 1, ..., d. Then by the formulae (3.14),
j
=
D(Y
j
)
j1
D(Y
j
)
j1
, (3.25)
subject to
0
= p
0
, where [x[ =

d
=1
[x
[ (
1
norm) of a vector x R
d
and D(y)
is a scalar matrix with f
(y), y R, = 1, ..., d on the diagonal. Alternatively the

unnormalized equation can be solved
j
= D(Y
j
)
j1
, j 1
subject to
0
= p
0
and then
j
is recovered by normalizing
j
=
j
/[
j
[. Finite
dimensional lters are known for several ltering problems, related to HMM - see
Exercise 21.
4.2. The linear Gaussian case: Kalman-Bucy lter revisited. The
Kalman-Bucy lter from Chapter 2 has a very special place among the nonlinear
lters due to the properties of Gaussian random vectors. Recall that
Definition 3.13. A random vector X, with values in R
d
, is Gaussian if
Eexp
_
i
X
_
= exp
_
i
m
1
2
K
_
, R
d
for a vector m and a nonnegative denite matrix K.
Remark 3.14. It is easy to check that m = EX and K = cov(X).
It turns out that if characteristic function of a random vector is exponential of
a quadratic form, this vector is necessarily Gaussian. Gaussian vectors (processes)
play a special role in probability theory. The following properties make them special
in the ltering theory in particular:
Lemma 3.15. Assume that the vectors X and Y (with values in R
m
and R
n
respectively) form a Gaussian vector (X, Y ) in R
m+n
. Then
(1) Any random variable from the linear subspace, spanned by the entries of
(X, Y ) is Gaussian. In particular Z = b+AX with a vector b and a matrix
A, is a Gaussian vector with EZ = b +AEX and cov(Z) = Acov(X)A
.
18
In this case the Markov kernel is absolutely continuous to the point measure
d
i=1
a
i
(du)
and the matrix is formally the density w.r.t this measure.
(2) If X and Y are orthogonal, they are independent (the opposite direction
is obvious)
(3) The regular conditional distribution of X, given Y is Gaussian P-a.s.,
moreover
19
E(X[Y ) =

E(X[Y ) and
cov(X[Y ) := E
_
_
X E(X[Y )
__
X E(X[Y )
_
Y
_
=
cov(X) cov(X, Y ) cov
(Y ) cov(Y, X). (3.26)

Remark 3.16. Note that in the Gaussian case the conditional covariance does
not depend on the condition !
Proof. For xed b and A
Eexp
_
i
(b +AX)
_
= exp
_
i
(b +AEX)
_
Eexp
_
i(
A)(X EX)
_
=
exp
_
i
(b +AEX)
_
exp
_
1
2
_
Acov(X)A
_
,
and the claim (1) holds, since the latter is a characteristic function of a Gaussian
vector.
Let
x
and
y
be vectors from R
m
and R
n
(so that = (
x
,
y
) R
m+n
), then
due to orthogonality cov(X, Y ) = 0 and
Eexp
_
i
(X, Y )
_
= exp
_
i
x
EX
1
2
x
cov(X)
x
_
exp
_
i
y
EY
1
2
y
cov(Y )
y
_
,
which veries the second claim.
Recall that X

E(X[Y ) is orthogonal to Y , and thus by (2), they are also
independent. Then
E
_
exp
_
i
x
_
X
E(X[Y )
_
_
Y
_
= Eexp
_
i
x
_
X
E(X[Y )
_
_
and on the other hand
E
_
exp
_
i
x
_
X
E(X[Y )
_
_
Y
_
= exp
_
i
E(X[Y )
_
E
_
exp
_
i
x
X
_
Y
_
and so
E
_
exp
_
i
x
X
_
Y
_
= exp
_
i
E(X[Y )
_
Eexp
_
i
x
_
X
E(X[Y )
_
_
.
Since X
E(X[Y ) is in the linear span of (X, Y ), the latter term equals

exp
_
i
x
E(X
E(X[Y ))
1
2
x
cov
_
X
E(X[Y )
_
x
_
,
and the third claim follows, since E(X
E(X[Y )) = 0 and cov

_
X
E(X[Y )
_
equals
(3.26).
Consider now the Kalman-Bucy linear model (2.13) and (2.14) (on page 29),
where the sequences and are Gaussian, as well as the initial condition (X
0
, Y
0
).
Then the processes (X, Y ) are Gaussian (i.e. any nite dimensional distribution
is Gaussian) and by Lemma 3.15, the conditional distribution of X
j
given F
Y
j
is
Gaussian too. Moreover its parameters - the mean and the covariance are governed
by the Kalman-Bucy lter equations from Theorem 2.5.
19
in other notations E(X[F
Y
) =

E(X[L
Y
)
EXERCISES 51
Remark 3.17. The recursions of Theorem 2.5 can be obtained via the nonlinear
ltering equation (3.14), using certain properties of the Gaussian densities. Note
however that guessing the Gaussian solution to (3.14) would not be easy !
In particular for any measurable f, such that E[f(X
j
)[ < (the scalar case is
considered for simplicity)
E
_
f(X
j
)[F
Y
j
_
=
_
R
f(u)
1
_
2P
j
exp
_
(u

X
j
)
2
2P
j
_
du,
where P
j
nd

X
j
are generated by the Kalman-Bucy equations. In Exercise 24 an
important generalization of the Kalman-Bucy lter is considered. More models, for
which nite dimensional lter exists are known, but their practical applicability is
usually limited.
Exercises
(1) Verify the properties of the conditional expectations on page 40
(2) Prove that pre-images of Borel sets of R under a measurable function
(random variable) is a -algebra
(3) Prove (3.6) (use monotone convergence theorem - see Appendix).
(4) Obtain the formula (3.3) by means of (3.11).
(5) Verify the claim of Remark 3.7.
(6) Explore the denition of the Markov process on page 43: argue the exis-
tence, etc. How such process can be generated, given say a source of i.i.d.
random variables with uniform distribution ?
(7) Is Y , dened in (3.13) a Markov process? Is the pair (X
j
, Y
j
) a (two
dimensional) Markov process?
(8) Show that P
_
EL
j
(X( ), Y
_
= 0
_
= 0 (L
j
(X, Y ) is dened in (3.19)).
(9) Complete the proof of Theorem 3.9 (i.e. verify (3.15)).
(10) Derive (3.14) and (3.15), using the orthogonality property of the condi-
tional expectation (similarly to derivation of (3.3)).
(11) Show that (3.23) and (3.24) imply (3.14).
(12) Derive the nonlinear ltering equations when Y is dened with delay:
P(Y
j
B[F
X
j1
, F
Y
j1
) =
_
B
(X
j1
; du), P a.s
(13) Discuss the changes, which have to be introduced into (3.14), when X and
Y take values in R
m
and R
n
respectively (the multivariate case)
(14) Discuss the changes, which have to be introduced into (3.14), when the
Markov kernels and are allowed to depend on j (time dependent case)
and F
Y
j1
(dependence on the past observations).
(15) Show that if the transition matrix of the nite state chain X is q-
primitive, i.e. the matrix
q
has all positive entries for some integer
q 1, then the limits lim
j
P
_
X
j
= a
_
=
exist, are positive for

all a
S and independent of the initial distribution (such chain is called

ergodic).
(16) Find the ltering recursion for the signal/observation model
X
j
= g(X
j1
) +
j
, j 1
Y
j
= f(X
j
) +
j
subject to a random initial condition X
0
(and Y
0
0), independent of
and . Assume that g : R R and f : R R are measurable func-
tions, such that E[g(X
j1
)[ < and E[f(X
j
)[ < for any j 0. The
sequences = (
j
)
j1
and = (
j
)
j1
are independent and i.i.d., such
that
1
and
1
have densities p(u) and q(u) with respect to the Lebesgue
measure on B(R).
(17) Let X be a Markov chain as in Section 4.1 and Y
j
= h(X
j
) +
j
, j 1,
where = (
j
)
j0
is an i.i.d. sequence. Assume that
1
has probabil-
ity density f(u) (with respect to the Lebesgue measure). Write down
the equations (3.25) in componentwise notation. Simulate the lter with
MATLAB.
(18) Show that the ltering process
j
from the previous problem is Markov.
(19) Under the setting of Section 4.1, denote by Y
j
the family of F
Y
j
- measur-
able random variables with values in S (detectors which guess the current
symbol of X
j
, given the observation of Y
1
, ..., Y
j
). For a random variable
j
Y
j
, let P
d
denote the detection error:
P
d
= P
_
j
,= X
j
_
.
Show that the optimal detector, minimizing the detection error in the
class Y
j
is given by

j
= argmax
a
j
().
Find (an implicit) expression for the minimal detection error.
(20) A random switch
j
0, 1, j 0 is a discrete-time two-state Markov
chain with transition matrix:
=
_

1
1
1
1
2

2
_
.
Assume that
0
= 1.
A counter
j
, counts arrivals (of e.g. particles) from two indepen-
dent sources with dierent intensities and . The counter is connected
according to the state of the switch
j
to one source or another, so that:
j
=
j1
+1(
j
= 1)
j
+1(
j
= 0)
j
, j = 1, 2, ...
subject
0
= 0. Here and are constants from the interval (0, 1) and
j
0, 1 stands for an i.i.d. sequence with P
j
= 1 = (0 < < 1).
(a) Find the optimal estimate of the switch state, given the counter
data up to the current moment, i.e. derive the recursion for
j
=
E(
j
[F
j
).
(b) Study the behavior of the lter in the limit cases:
(i) = 1 and = 0 (simultaneously).
(ii)
1
= 1 and
2
= 0 (and vice versa).
(iii)
1
=
2
= 1
EXERCISES 53
(21) Let
j
be the number of times, a nite state Markov chain X visited
(occupied) the state a
1
(or any other xed state) up to time j. Find
the recursion for calculation of the optimal estimate of the occupation
time E(
j
[F
Y
j
), where Y is dened as in Section 4.1.
(a) Let I
j
be the vector of indicators 1
X
j
=a
i
]
, i = 1, ..., d and dene
Z
j
:=
j
I
j
. Find the expression for

Z
j]j1
:= E(Z
j
[F
Y
j1
) in terms
of

Z
j1
= E(Z
j1
[F
Y
j1
) and
j]j1
=
j
.
(b) Find the expression of

Z
j
in terms of

Z
j]j1
and thus close the
recursion for

Z
j
.
(c) How E(
j
[F
Y
j
) is recovered from

Z
j
?
(22) Let
j
be the number of transitions from state a
1
to state a
2
(or any other
xed pair of states), a nite state Markov chain X made on the time
interval [1, j]. Find the nite dimensional lter for E(
j
[F
Y
j
). Hint: use
the approach suggested in the previous problem.
(23) Check the claim of Remark 3.14.
(24) Consider the signal/observation model (X
j
, Y
j
)
j0
:
X
j
= a
0
(Y
j1
0
) +a
1
(Y
j1
0
)X
j1
+b
j
, j = 1, 2, ...
Y
j
= A
0
(Y
j1
0
) +A
1
(Y
j1
0
)X
j1
+B
j
where b and B are constants and A
i
(Y
j1
0
) and a
i
(Y
j1
0
), i = 0, 1 are some
functionals of the vector Y
0
, Y
1
, ..., Y
j1
. = (
j
)
j1
and = (
j
)
j1
are independent i.i.d. standard Gaussian random sequences. The initial
condition (X
0
, Y
0
) is a standard Gaussian vector with unit covariance
matrix, independent of and .
(a) Is the pair of processes (X
j
, Y
j
)
j0
necessarily Gaussian ? Give a
proof or a counterexample.
(b) Find the recursion for

X
j
= E(X
j
[F
Y
j
) and P
j
= E
_
(X
j
X
j
)
2
[F
Y
j
_
.
Is the obtained lter linear w.r.t. observations ? Does the error P
j
depend on the observations ?
Hint: prove rst that X
j
is Gaussian, conditioned on F
Y
j
.
Remark 3.18. The ltering recursion in this case is sometimes re-
ferred as conditionally Gaussian lter. It plays an important role
in control theory, where the coecients usually depend on the past
observations.
(c) Verify that in the case of a
i
(Y
j1
0
) a
i
and A
i
(Y
j1
0
) A
i
, i = 0, 1
(a
i
and A
i
constants) your solution coincides with the Kalman-Bucy
lter.
(25) Consider the recursion
X
j
= aX
j1
+
j
, j 1
subject to a standard Gaussian random variable X
0
and where is a
Gaussian i.i.d. sequence, independent of X
0
. Assuming that the param-
eter a is a Gaussian random variable independent of and X
0
, derive a
recursion for E(a[F
X
j
) and for the square error
P
j
= E
_
_
a E(a[F
X
j
)
_
2
F
X
j
_
.
Is the recursion for E(a[F
X
j
) linear ? Does P
j
converge ? If yes, to which
limit and in which sense? Hint: use the results of the previous exercise.
(26) Consider a signal/observation pair (,
j
)
j1
, where is a random variable
distributed uniformly on [0, 1] and (
j
) is a sequence generated by:
j
= U
j
where (U
j
)
j1
is a sequence of i.i.d. random variables with uniform dis-
tribution on [0, 1]. and U are independent.
(a) Derive the Kalman-Bucy lter for

j
=

E([
j
1
).
(b) Find the corresponding mean square error P
j
= E(
j
)
2
. Show that
it converges to zero as j and determine the rate of convergence
20
(c) Consider the recursive ltering estimate (
j
)
j0
j
= max(
j1
,
j
),

0
= 0
Find the corresponding mean square error, Q
j
= E(
j
)
2
.
(d) Show that Q
j
converges to zero and nd the rate of convergence.
Does this lter give better accuracy, compared to Kalman-Bucy lter,
uniformly in j ? Asymptotically in j ?
(e) Verify whether

j
is the optimal in the mean square sense ltering
estimate. If not, nd the optimal estimate

j
= E([F
j
).
20
i.e. nd a sequence r
j
, such that lim
j
r
j
P
j
exists and positive
CHAPTER 4
The white noise in continuous time
A close look at the derivation of nonlinear ltering recursions reveals that one
of the crucial assumptions is independence of the observation noise on the past.
The model (3.13) is in fact a generalization of the following additive white noise
observation scenario
Y
j
= h(X
j
) +
j
, j 0 (4.1)
where h is a measurable function and is an i.i.d. sequence. As was mentioned in
the introduction, the term white noise stems from the fact that power spectral
density of the sequence (when E
2
1
< ), dened as the Fourier transform of
the correlation sequence R(n) = E
0
n
, is constant. In the continuous time case
similar denition would be meaningless both for mathematical and physical reasons:
the sample pathes of such process would be extremely irregular (e.g. not even
continuous in any point) and its variance is innite. It turns out that overcoming
this diculty is not an easy mathematical task. It is accomplished in several steps
i. Introduce a continuous time process with independent increments. The
motivation is that a formal derivative of such process is a white noise (recall the
discussion on page 10). It turns out that such a process can be constructed (the
Wiener process), but it is not dierentiable in any reasonable sense. At this point
the hope for real white noise is abandoned and instead of considering problems
involving dierentials (e.g. dierential equations, etc.), their integral analogues are
considered.
ii. This naturally leads to considering integration with respect to the Wiener
process. It turns out however that the Wiener process has irregular trajectories,
so that all the classical integration approaches (e.g. Stieltjes, Lebesgue, etc.) fail.
However integration can be carried out if the family of integrands is chosen in a
special way. Specically we will use the stochastic integral introduced by K.Ito
iii. After introducing the integral, one is led to establish the rules to manipulate
the new object: e.g. the change of integration variable, chain rule, etc. Surprisingly
(or not!) the Ito integral have properties, dramatically dierent from the classical
integration. The particularly useful tool in, what is called by now, the stochastic
calculus, is the Ito formula.
iv. Once there is a new calculus, the ultimate goal is accomplished: the sto-
chastic dierential equations are introduced. The term dierential is in fact
misleading, though customary: actually the integral equations involving usual Rie-
mann integrals and Ito integrals are considered. It turns out that besides strong
solutions (roughly speaking analogous to the usual solutions of ODE), one may
consider weak solutions, which have no analogue in classical ODEs. We will be
concerned mainly with the rst kind of solutions, though weak solutions play an
important role in ltering in particular.
55
56 4. THE WHITE NOISE IN CONTINUOUS TIME
Remark 4.1. The introductory scope of these lectures doesnt include many
important concepts and details from the vast theory of random processes in contin-
uous time. The reader may and should consult basic books in this area for deeper
understanding. The authors choice was and still is: the classic J.Doobs book [5]
and the modern [39] for general concepts of stochastic processes in continuous time,
the book [18] is a good starting point for further study of the Brownian motion and
stochastic calculus, the rst volume of [21] is a conned but very accessible coverage
of stochastic Ito calculus and its applications (collected in the second volume).
1. The Wiener process
The main building block of the white noise theory is the Wiener process (or
mathematical Brownian motion), which is dened (on some probability space (,
F, P)) as a stochastic process W = (W
t
())
tR
+
, satisfying the properties
(1) W
0
() = 0, P a.s.
(2) the trajectories of W are continuous functions
(3) the increments of W are independent Gaussian random variables with zero
mean and E(W
t
W
s
)
2
= t s, t s.
1.1. Construction. The existence of such process is not at all clear. There
are many constructions of W (see e.g. [18]) of which we choose the one due to
P.Levy (Section 2.3 in [18])
Theorem 4.2. The Wiener process W = (W
t
)
t[0,1]
exists.
Proof. Let I(n) denote the odd integers from 0, 1, ..., 2
n
. Dene the Haar
functions as H
0
1
(t) = 1, t [0, 1] and n 1, k I(n)
H
n
k
(t) =
_
_
2
(n1)/2
,
k1
2
n
t <
k
2
n
2
(n1)/2
,
k
2
n
t <
k+1
2
n
0 otherwise
.
The Schauder functions are
S
n
k
=
_
t
0
H
n
k
(s)ds,
which do not overlap for dierent k, when n is xed, and have a tent like shape.
Let
n
j
, j I(n), n = 1, ... be an array of i.i.d. standard Gaussian random
variables. Introduce the sequence of random processes, n 0
W
n
t
=
n
m=0
kI(m)
m
k
S
m
k
(t), t [0, 1], (4.2)
Note that W
n
t
has continuous trajectories for all n. If the sequence W
n
t
converges
P-a.s. uniformly in t [0, 1], then the limit process has continuous trajectories as
required in axiom 2.
Lets verify the convergence of the series
n
m=1
jI(m)
m
j
S
m
j
(t)
n
m=1
max
jI(m)
m
j

jI(m)
S
m
j
(t)
n
m=1
2
(m1)
max
j2
m
m
j
(4.3)
1. THE WIENER PROCESS 57
(recall that S
m
j
(t) do not overlap for a xed m and dierent j). Since
P
_
[
m
j
[ x
_
=
2
2
_

x
e
u
2
/2
du
_
2
_

x
u
x
e
u
2
/2
du =
_
2
e
x
2
/2
x
for m 1
P
_
max
j2
m
m
j
m
_
= P
_
_
j2
m
_
[
m
j
[ > m
_
_
2
m
P
_
[
m
j
[ m
_
_
2
2
m
e
m
2
/2
m
.
Since

m=1
2
m
e
m
2
/2
m
1
< , by Borel-Cantelli Lemma
P
_
max
j2
m
m
j
m, i.o.
_
= 0.
In other words, there is a set
t
of full P-measure and a random integer n(), such
that max
j2
m
m
j
m for all m n() for all

t
. Then the series in (4.3)
converge on
t
since
n
m=n()
2
m
max
j2
m
m
j
m=n()
2
m
m < .
So the processes W
n
t
converge P-a.s. uniformly in t to a continuous process W
t
. It
is left to verify the axiom 3. The Haar basis forms a complete orthonormal system
in the Hilbert space L
2
[0, 1] with the scalar product g, f) =
_
[0,1]
f(s)g(s)ds and
so by Parseval equality
g, f) =
n=0
kI(n)
g, H
n
k
)f, H
n
k
).
For g
u
= 1(u t) and f(u) = 1(u s), the latter implies
s t =
n=0
kI(n)
S
n
k
(t)S
n
k
(s).
Now let
j
, j = 1, ..., n be real numbers and x n distinct times t
1
< ... < t
n
.
Then (with
n+1
= 0)
Eexp
_
i
n
j=1
(
j+1

j
)W
t
j
_
=
Eexp
_
i
n
j=1
(
j+1

j
)
m=0
kI(m)
m
k
S
m
k
(t
j
)
_
=
Eexp
_
m=0
kI(m)
m
k
n
j=1
i(
j+1

j
)S
m
k
(t
j
)
_
=
m=0
kI(m)
Eexp
_
m
k
n
j=1
i(
j+1

j
)S
m
k
(t
j
)
_
=
m=0
kI(m)
exp
_
1
2
_
n
j=1
(
j+1

j
)S
m
k
(t
j
)
_
2
_
=
exp
_
1
2
m=0
kI(m)
_
n
j=1
(
j+1

j
)S
m
k
(t
j
)
_
2
_
=
exp
_
1
2
n
j=1
n
i=1
(
j+1

j
)(
i+1

i
)
m=0
kI(m)
S
m
k
(t
j
)S
m
k
(t
i
)
_

exp
_
1
2
n
j=1
n
i=1
(
j+1

j
)(
i+1

i
)(t
j
t
i
)
_
Then
Eexp
_
i
n
j=1
j
_
W
t
j
W
t
j1
_
_
= Eexp
_
i
n
j=1
(
j+1

j
)W
t
j
_
=
exp
_
1
2
n
j=1
n
i=1
(
j+1

j
)(
i+1

i
)(t
j
t
i
)
_
=
exp
_
n1
j=1
n
i=j+1
(
j+1

j
)(
i+1

i
)(t
j
t
i
)
1
2
n
j
(
j+1

j
)
2
t
j
_
=
exp
_
n1
j=1
(
j+1

j
)t
j
n
i=j+1
(
i+1

i
)
1
2
n
j
(
j+1

j
)
2
t
j
_
=
exp
_
n1
j=1
(
j+1

j
)t
j
j+1

1
2
n
j
(
j+1

j
)
2
t
j
_
=
exp
_
n1
j=1
t
j
_
(
j+1

j
)
j+1

1
2
(
j+1

j
)
2
_
1
2
2
n
t
n
_
=
1. THE WIENER PROCESS 59
exp
_
1
2
n1
j=1
t
j
_
2
j+1

2
j
_
1
2
2
n
t
n
_
= exp
_
1
2
n
j=1
2
j
(t
j
t
j1
)
_
=
n
j=1
exp
_
1
2
2
j
(t
j
t
j1
)
_
,
which veries axiom 2.
Remark 4.3. The Wiener process on [0, ) can be constructed by patching
the Wiener processes on the intervals [j, j + 1], j = 0, 1, ....
Remark 4.4. Though Gaussian distribution of the i.i.d. random variables in
this proof plays crucial role, the Gaussian property of the limit W is universal:
it turns out that any continuous time process with independent increments (a mar-
tingale!), continuous trajectories and variance t is the Wiener process. Roughly
speaking, this suggests that the white noise, which originates from a random
process with these properties is necessarily Gaussian! More exactly
Theorem 4.5. (P. Levy) Let B
t
be a continuous process with EB
t
0, t 0
and
E
_
B
2
t
t[F
B
s
_
= B
2
s
s, t s 0.
Then B
t
is a Wiener process.
Remark 4.6. Sometimes it is convenient to relate the Wiener process to some
ltration F
t
, by extending the denition in the following way: W
t
is the Wiener
process with respect to a ltration F
t
, if W has continuous pathes, starts from zero
and for any t s 0, W
t
W
s
is a Gaussian random variable, independent of F
s
,
with zero mean and variance (t s). The previous denition reduces to the case
F
t
F
W
t
= W
s
, s t.
1.2. Nondierentiability of the pathes. The properties of the trajectories
of W are really amazing and up to now do not cease to attract attention of math-
ematicians. We will verify a few of them, which are crucial to understanding the
origins of stochastic calculus.
For a function f : [0, 1] R, denote by D
the upper left and right Deni

derivatives at t:
D
f(t) = lim
h0
f(t +h) f(t)
h
and by D
(t) the lower left and right Deni derivatives at t:

D
f(t) = lim
h0
f(t +h) f(t)
h
.
The function is dierentiable at t from the right if D
+
f(t) and D
+
f(t) are nite and
coincide. Similarly left dierentiability is dened by means of D
f(t) and D
f(t).
If all the Deni derivatives are equal, f is dierentiable at t. Dierentiability at t = 0
and t = 1 is dened as right and left dierentiability respectively.
Theorem 4.7. (Paley, Wiener and Zygmund, 1933) The Wiener process has
nowhere dierentiable trajectories, more precisely
P
_
: for each t < 1, either D
+
W
t
= or D
+
W
t
=
_
= 1.
Proof. For xed j, k 0, dene the sets
A
jk
=
_
t[0,1]
h[0,1/k]
_
:
W
t+h
W
t
jh
_
.
Clearly
_
: < D
+
W
t
D
+
W
t
<
_
_
j1
_
k1
A
jk
and so to verify the claim, it would be enough to show that P
_
A
jk
_
= 0 for any j, k.
Fix a trajectory in the set A
jk
. For this trajectory there exists a number t [0, 1],
such that

W
t+h
W
t
jh for any 0 h 1/k. Fix an integer n 4k and let

1 i n be such that (i 1)/n t i/n. Then we have
W
(i+1)/n
W
i/n
W
(i+1)/n
W
t
W
t
W
i/n
2j
n
+
j
n
=
3j
n
W
(i+2)/n
W
(i+1)/n
W
(i+2)/n
W
t
W
t
W
(i+1)/n
3j
n
+
2j
n
=
5j
n
W
(i+3)/n
W
(i+2)/n
W
(i+3)/n
W
t
W
t
W
(i+2)/n
4j
n
+
3j
n
=
7j
n
.
Then A
jk

n
i=1
C
(n)
i
with
C
(n)
i
=
3
r=1
_
W
(i+r)/n
W
(i+r1)/n
(2r + 1)j
n
_
.
hold for any n 4k or in other words
A
jk

n4k
n
_
i=1
C
(n)
i
:= C.
Note that since W
(i+r)/n
W
(i+r1)/n
are independent and Gaussian with zero
mean and variance 1/
n,
P
_
C
(n)
i
_
3 5 7j
3
n
3/2
,
where the inequality P([[ ) for a standard Gaussian r.v. , have been used.
Then P(A
jk
) P(C) inf
n4k
P(
n
i=1
C
(n)
i
) = 0, where the latter holds since
P
_
n
i=1
C
(n)
i
_
i=1
P(C
(n)
i
) =
105j
3
n
1/2
n
0.
Recall that the p-variation of the function f : [0, 1] R on the partition
n
= t
i
, 0 = t
0
< ... < t
n+1
= 1 is
p
n
f(t) :=
t
i+1
t
f
t
i+1
f
t
i
p
, t [0, 1].
The function f is said to be of nite p-variation on [0, 1] if the limit is nite
p
f(t) := sup
n
,nZ
n
t
i+1
t
f
t
i+1
f
t
i
p
, t [0, 1].
2. THE IT
O STOCHASTIC INTEGRAL 61
Theorem 4.8. The quadratic variation of the Wiener process trajectories equals
t in the sense, that
2
W(t) = lim
]
n
]0
2
n
W(t) = t,
where
1
the limit in L
2
is understood
2
.
Proof. Use the Gaussian properties of the Wiener process
E
_

t
i+1
t
_
W
t
i+1
W
t
i
_
2
t
_
2
= E
_

t
i+1
t
_
W
t
i+1
W
t
i
_
2
(t
i+1
t
i
)
_
2
=
t
i+1
t
E
_
_
W
t
i+1
W
t
i
_
2
(t
i+1
t
i
)
_
2
=
t
i+1
t
2(t
i+1
t
i
)
2
2
t
n
0.
Theorem 4.9. The Wiener process has trajectories with innite variation, in
particular
P
_
lim
n0
0in
W
i/n
W
(i1)/n
=
_
= 1.
Proof. The random variables
_
W
i/n
W
(i1)/n
_
n form an i.i.d. standard
Gaussian sequence, so that by the law of large numbers
P
_
lim
n
1
n
n
i=1
W
i/n
W
(i1)/n
n = E[W
1
[
_
= 1.
Since E[W
1
[ > 0, this implies in particular
P
_
n
i=1
W
i/n
W
(i1)/n
n
1/2
, eventually
_
= 1.
for any > 0.
2. The Ito Stochastic Integral
Recall the following fact from the classical analysis Vol.3, Ch. 15, 4-5 in [9].
Theorem 4.10. (Stieltjes integral) Let
3
f : [0, 1] R be a uniformly continu-
ous function and g
t
: [0, 1] R be a function of nite variation. Let 0 = t
0
< t
1
<
... < t
n
= 1 be a sequence of partitions and denote
n
= max
j
[t
j
t
j1
[. Then the
limit
_
1
0
f
s
dg
s
:= lim
n
0
n
j=1
f(t
j1
)
_
g
t
j
g
t
j1
_
(4.4)
exists and is unique for any choice of points t
j1
[t
j1
, t
j
], j = 1, ..., n. It is
called the Stieltjes integral of f
t
with respect to g
t
.
1
[
n
[ = max
0in+1
[t
i+1
t
i
[ is the size of the partition.
2
Stronger convergence is possible if the partitions sizes are allowed to decrease fast enough.
3
For the sake of notation simplicity, the dependence of the partition t
j
on n is always
assumed implicitly.
Proof. Assume rst that g does not decrease. Dene the Darboux sums
s
n
=
j=1
m
j1
_
g
t
j
g
t
j1
_
, S
n
=
j=1
M
j1
_
g
t
j
g
t
j1
_
where m
j1
= min
s[t
j1
,t
j
]
f
s
and M
j1
= max
s[t
j1
,t
j
]
f
s
. It is easy to see that
S
n
(s
n
) does not increase (decrease) with n and moreover S
n
s
m
for any m, n 1.
Then the limit in (4.4) exists and is unique if I
:= inf
n
S
n
= sup
n
s
n
=: I
. The
latter holds if
lim
j=1
(M
j1
m
j1
)
_
g
t
j
g
t
j1
_
= 0.
If f is uniformly continuous, then for any > 0, one may choose
n
> 0 such that
M
j
m
j
/(g
1
g
0
) uniformly in j. Then
n
j=1
(M
j1
m
j1
)
_
g
t
j
g
t
j1
_
,
and the claim of the Theorem holds for nondecreasing g. The general case fol-
lows from the fact that g with nite variation can be decomposed into sum of a
nonincreasing and nondecreasing functions.
The Wiener process has innite variation and hence it is not clear how Stieltjes
integral with respect to its trajectories can be constructed. This is claried in the
following example:
Example 4.11. Suppose we would like to dene the integral
_
t
0
W
s
dW
s
as the
limit n of the sums
[tn]
i=0
W
s
i
_
W
s
i+1
W
s
i
_
, t [0, 1]
where s
i
= i/n and s
i
is a point from interval [s
i1
, s
i
] for each i. Consider the
two choices: s
i
= s
i
and s
i
= (s
i+1
+s
i
)/2, which lead to
I
n
t
=
[tn]
i=0
W
s
i
_
W
s
i+1
W
s
i
_
and
J
n
t
=
[tn]
i=0
W
(s
i
+s
i+1
)/2
_
W
s
i+1
W
s
i
_
respectively. Clearly EI
n
t
= 0 for all t and n 1. On the other hand
EJ
n
t
=
[tn]
i=0
EW
(s
i
+s
i+1
)/2
_
W
s
i+1
W
s
i
_
=
[tn]
i=0
_
(s
i
+s
i+1
)/2 s
i
_
=
1
2
[tn]/n
n
t/2.
It is not hard to see that the limits in probability I
t
= lim
n
I
n
t
and J
t
=
lim
n
J
n
t
exist and satisfy EI
t
= 0 and EJ
t
= t/2 for all t [0, 1]. So one does
not obtain the same limit for dierent partitions as promised in Theorem 4.10.
This is a manifestation of the trajectories irregularity of W: if their variation were
2. THE IT
nite the same limit would be obtained! Let us note that both examples are in
fact the prototypes of the stochastic integrals in the sense of Ito and Stratonovich
respectively. See Exercise 7 for further exploration.
2.1. Construction. The Ito integral will be dened in this course
4
under the
following setup. Let (, F, P) be a complete
5
probability space with the increasing
family of sub--algebras (ltration) F
t
F. Sometimes (, F, (F
t
)
t0
, P) is
referred as ltered probability space or stochastic basis.
Definition 4.12. A random process X is said to be adapted to ltration F
t
if for each xed t 0, the random variable X
t
is F
t
-measurable.
From here on all the random processes are assumed to be adapted to F
t
, if
not stated otherwise. For example the Wiener process W
t
is trivially adapted to
its natural ltration F
W
t
= W
s
, s t, but is also assumed to be adapted to
F
t
. This allows to dene the integral more generally and is of no limitation, since
F
t
can be usually dened to be the least ltration, to which all the processes are
adapted. For example it allows to dene integrals like
_
t
0
V
t
dW
t
where W and V
are independent Wiener processes: V is not adapted to F
W
t
, but both W and V
are adapted to F
t
:= F
V
t
F
W
t
.
Construction of the Ito integral is based on two main ideas: (1) to restrict the
choice of the sampling points of the integrand in the prelimit sums to the beginning
of the sub-intervals of the partition and (2) to consider integrands for which this
restriction leads to the unique limit.
Definition 4.13. The process X
t
() is said to belong to the family H
2
[0,T]
if
(1) the mapping (t, ) X
t
() is measurable with respect to B([0, T]) F
(as a function of both arguments)
(2) X
t
() is F
t
adapted
(3) E
_
T
0
X
2
s
()ds <
Remark 4.14. The stochastic integral can be constructed for a more general
class of integrands, satisfying only
P
_
_
T
0
X
2
t
dt <
_
= 1,
instead of (3). In what follows the stochastic integral will be used with the inte-
grands satisfying the stronger condition, if not specied otherwise. It turns out
that the properties of the stochastic integral may crucially depend on the integrand
type - this point is demonstrated in Example 4.25 below.
Generally stochastic integration can be dened with respect to processes, more
general than the Wiener process: the martingales. For further exploration see the
introductory text [4] and [22] for a more advanced treatment.
t
is H
2
[0,T]
-simple (or just simple) if it belongs
to H
2
[0,T]
and has the form X
n
t
=

n
j=1
j1
1
[t
j1
,t
j
)
for some xed partition 0 =
t
0
t
1
... t
n
= T and random variables
j
.
4
The text [25] is followed here.
5
standard technical requirement which is usually imposed on probability spaces: it means
that F contains all the sets A, such that A A A for some measurable sets A and A (on which
P is dened) with P(A) = P(A). Then P(A) = 0 is set.
Assume that F
W
t
F
t
and dene the Ito integral for a simple process X
n
t
as
I(X
n
) :=
_
T
0
X
n
t
dW
t
:=
n
j=1
j1
_
W
t
j
W
t
j1
_
.
Then
6
EI
2
(X
n
) =E
_
n
j=1
j1
_
W
t
j
W
t
j1
_
_
2
=
n
j=1
E
2
j1
_
W
t
j
W
t
j1
_
2
+
2
n1
i=1
j<i
E
i1
j1
_
W
t
j
W
t
j1
__
W
t
i
W
t
i1
_
=
n
j=1
E
2
j1
E
_
_
W
t
j
W
t
j1
_
2
F
t
j1
_
+
2
n1
i=1
j<i
E
i1
j1
_
W
t
j
W
t
j1
_
_
E
_
W
t
i
W
t
i1
_
F
t
i1
_
=
n
j=1
E
2
j1
(t
j
t
j1
) =
_
T
0
E
_
X
n
t
_
2
dt.
(4.5)
The latter property is called the Ito isometry and is the main feature in the con-
struction of the stochastic integral.
Lemma 4.16. Let t
j
be a sequence of partitions on [0, T], such that
n
=
max
j
[t
j
t
j1
[ 0 , as n 0. Then
1. for any continuous
7
and bounded H
2
[0,T]
process X
bc
t
, there is a sequence of
simple H
2
[0,T]
processes X
t
, 0, such that
lim
_
T
0
E
_
X
bc
t
X
t
_
2
dt = 0 (4.6)
2. for any bounded H
2
[0,T]
process X
b
t
there is a sequence of continuous H
2
[0,T]
processes X
c,m
t
, m 1, such that
lim
m
_
T
0
E
_
X
b
t
X
c,m
t
_
2
dt = 0 (4.7)
3. for any H
2
[0,T]
process X
t
there is a sequence of bounded H
2
[0,T]
processes
X
b,n
t
, n 1 such that
lim
n
_
T
0
E
_
X
t
X
b,n
t
_
2
dt = 0. (4.8)
6
It can be shown that the ltration F
W
t
is continuous, i.e. F
W
t+
:=
>0
F
W
t+
and F
W
t
:=
_
>0
F
W
t
coincide. It is customary to assume that F
t
is continuous (or at least right continuous)
as well. This and the denition of X
n
t
implies that
j1
is F
t
j1
-measurable.
7
i.e. a process with continuous trajectories
2. THE IT
Proof. 1. Let X
t
=

t
j
t
X
bc
t
j1
1
[t
j1
,t
j
)
. Clearly X
t
is a simple bounded
H
2
[0,T]
process, which converges to X
bc
t
uniformly in t due to its continuity. Then
(4.6) follows by dominated convergence.
2. Let
m
t
, m 1 be a sequence of continuous functions supported on (n
1
, 0)
and satisfying
_
R
n
s
ds = 1. Dene
X
c,m
t
=
_
t
0
X
b
s
m
st
ds.
Clearly X
c,m
t
are continuous H
2
[0,T]
processes (since
m
t
was chosen in a casual
way) and
lim
m
_
T
0
E
_
X
b
t
X
c,m
t
_
2
dt = 0, P a.s.
since convolution with
m
s
approximates the identity operator for bounded func-
tions. Again (4.7) follows by dominated convergence.
3. Fix an integer n 1 and dene
X
b,n
t
=
_
X
t
[X
t
[ n
sign(X
t
)n [X
t
[ > n
.
Clearly [X
b,n
t
[ [X
t
[ and so
_
T
0
E
_
X
b,n
t
X
t
_
2
dt 2
_
T
0
EX
2
t
dt <
and hence (4.8) follows by dominated convergence.
Theorem 4.17. (Ito stochastic integral) For any X
t
H
2
[0,T]
, the L
2
-limit
_
T
0
X
s
dW
s
:= lim
_
T
0
X
n
s
dW
s
exists and is independent of the specic sequence X
n
of simple processes, approxi-
mating X in the sense
_
T
0
E(X
s
X
n
s
)
2
ds
n
0.
Proof. By Lemma 4.16 for any H
2
[0,T]
-process X
t
there is a sequence of simple
processes X
n
t
for which I(X
n
) is well dened. Note that for any n, m, X
n
t
X
m
t
is
a simple H
2
[0,T]
-process. Then sequence I(X
n
), n 1 satises the Cauchy property
E
_
I(X
n
)I(X
m
)
_
2
= E
_
_
T
0
(X
n
t
X
m
t
)dW
t
_
2
=
_
T
0
E
_
X
n
t
X
m
t
_
2
dt
n,m
0,
where the latter holds since X
n
is a convergent sequence in
8
L
2
. The existence of
the limit I(X) = lim
n
I(X
n
) follows since any Cauchy sequence converges in
L
2
.
The uniqueness is obtained by standard arguments. Let X
(1)
n
and X
(2)
n
be two
approximating sequences and let X
n
denote the sequence obtained by taking X
(1)
n
for odd n and taking X
(2)
n
for even n. Suppose that dierent limits I
1
(X) and
I
2
(X) are obtained when using X
(1)
n
and X
(1)
n
. Then the approximating sequence
8
L
2
_
[0, T], F B, P
_
is meant here
X
n
will not converge to any limit. This however contradicts the existence of a limit
for X
n
.
Remark 4.18. Calculation of the Ito integral is possible by applying the con-
struction used in its denition - see Exercise 8. Another way is to apply the Ito
formula to be given below.
2.2. Properties. Let X and Y be H
2
[0,T]
-processes, then (all random equal-
ities hold P-a.s.)
(i)
_
T
0
X
t
dW
t
=
_
S
0
X
t
dW
t
+
_
T
S
X
t
dW
t
, S T
(ii)
_
T
0
(aX
t
+bY
t
)dW
t
= a
_
T
0
X
t
dW
t
+b
_
T
0
Y
t
dW
t
, for constants a and b
(iii) E
_
T
0
X
t
dW
t
= 0
(iv) E
_
_
T
0
X
t
dW
t
_
S
0
Y
t
dW
t
) =
_
ST
0
EX
t
Y
t
dt. In particular
E
_
_
T
0
X
t
dW
t
_
2
=
_
T
0
EX
2
t
dt.
(v)
_
t
0
X
s
dW
s
is F
t
-adapted
(vi)
_
t
0
X
s
dW
s
, t [0, T] admits a continuous version
9
, i.e. there exists a
random process I
t
(X), t [0, T] with continuous trajectories, so that
P
_
_
t
0
X
s
dW
s
= I
t
(X)
_
= 1, t [0, T].
Proof. The properties (i)-(v) are inherited from the simple functions approx-
imation. Lets verify, say (i): take a sequence X
n
X, in the sense
_
T
0
E(X
n
t
X
t
)
2
dt 0.
Then
_
T
0
X
n
t
dW
t
=
_
S
0
X
n
t
dW
t
+
_
T
S
X
n
t
dW
t
and so
E
_
_
T
0
X
t
dW
t

_
S
0
X
t
dW
t

_
T
S
X
t
dW
t
_
2
4E
_
_
T
0
X
t
dW
t

_
T
0
X
n
t
dW
t
_
2
+ 4E
_
_
S
0
X
t
dW
t

_
S
0
X
n
t
dW
t
_
2
+
4E
_
_
T
S
X
t
dW
t

_
T
S
X
n
t
dW
t
_
2
n
0.
9
Several types of equalities between continuous time random process are usually considered.
The processes X and Y are said to be indistinguishable if
P
_
t [0, T] : X
t
,= Y
t
_
= P
_
sup
tT
[X
t
Y
t
[ > 0
_
= 0.
This is the strongest kind of equality, which is sometimes hard to establish. X is said to be a
version of Y if for any t [0, T]
P(X
t
,= Y
t
) = 0 (4.9)
Clearly indistinguishable processes are versions of each other. Note that if X and Y satisfy (4.9),
then their nite dimensional distributions coincide.
3. THE IT
O FORMULA 67
The property (vi) stems from continuity of W. Its proof relies on the fact that
_
t
0
X
n
t
dW
t
is continuous for a xed n 1 and that this sequence converges uniformly
in t, making the limit a continuous function of t as well (the proof uses Doobs
inequality for martingales).
Remark 4.19. If the assumption
_
T
0
EX
2
t
dt <
is replaced by
P
_
_
T
0
X
2
t
dt <
_
= 1,
the integral is still well dened (as mentioned before in Remark 4.14), however the
properties (iii) and (iv) may fail to hold (!) - see Example 4.25 below.
3. The Ito formula
Consider the scalar random process
X
t
= X
0
+
_
t
0
a
s
()ds +
_
t
0
b
s
()dW
s
, t T, (4.10)
where a
t
and b
t
are H
2
[0,T]
processes and W = (W
t
)
tT
is the Wiener process,
dened on a stochastic basis (, F, F
t
, P). A random process is an Ito process, if
it satises (4.10), which is usually written in a dierential form
dX
t
= a
t
()dt +b
t
()dW
t
. (4.11)
Note that this Ito dierential is nothing more than a brief notation in the spirit of
classical calculus.
Let f(t, x) be a R
+
R R function with one and two continuous derivatives
in t and x respectively. It turns out that the process
t
:= f(t, X
t
) admits unique
integral representation, similar to (4.10), or in other words, it is also an Ito process.
Theorem 4.20. (the Ito formula) Assume f and its partial derivatives with
respect to t and x variables f
t
t
, f
t
x
and f
tt
x
are bounded and continuous, then the
process
t
= f(t, X
t
) admits the Ito dierential
d
t
= f
t
t
(t, X
t
)dt +f
t
x
(t, X
t
)a
t
dt +
1
2
f
tt
x
(t, X
t
)b
2
t
dt +f
t
x
(t, X
t
)b
t
dW
t
, (4.12)
subject to
0
= f(0, X
0
).
Remark 4.21. Consider the similar setting in the classical nonrandom case:
let V
t
be a function of bounded variation and dX
t
= a
t
dt +b
t
dV
t
, where the latter
is the Stieltjes dierential. Then the dierential for
t
= f(t, X
t
) is obtained by the
well known chain rule
d
t
= f
t
t
(t, X
t
)dt +f
t
x
(t, X
t
)dX
t
= f
t
t
(t, X
t
)dt +f
t
x
(t, X
t
)a
t
dt +f
t
x
(t, X
t
)b
t
dV
t
.
The major dierence between the classic dierentiation and (4.12) is the extra term
1
2
f
tt
x
(t, X
t
)b
2
t
dt, which is again the manifestation of trajectories irregularity of W.
This non-classic chain rule is called Ito formula and is the central tool of sto-
chastic calculus with respect to Wiener process.
Remark 4.22. The requirements for f and its derivatives to be bounded can be
relaxed even if working under the condition, mentioned in Remark 4.14. Moreover
the second derivative in x can be discontinuous at a countable number of points.
One should be careful to make further relaxations: for example if the rst derivative
has a discontinuity point, the local time process arises - see Example 4.26.
Remark 4.23. The Ito formula remains valid under condition mentioned in
Remark 4.14 (recall that the stochastic integral itself may have dierent properties
depending on the integrability conditions of the integrand - see Remark 4.19).
Proof. (Sketch) Let a
n
t
() and b
n
t
() be simple H
2
[0,T]
processes, approximat-
ing a
t
and b
t
:
_
T
0
E
a
n
t
a
t
dt
n
0
_
T
0
E
_
b
n
t
b
t
_
2
dt
n
0,
Let X
n
t
= X
0
+
_
t
0
a
n
s
ds+
_
t
0
b
n
s
dW
s
and suppose that (4.12) holds for
n
:= f(t, X
n
t
).
Then (4.12) holds for
t
by continuity and boundedness of f and its derivatives:
E
f(t, X
t
) f(0, X
0
)
_
t
0
_
f
t
t
(s, X
s
) +f
t
x
(s, X
s
)a
s
ds +
1
2
f
tt
x
(s, X
s
)b
2
s
_
ds
_
t
0
f
t
x
(s, X
s
)b
s
dW
s
f(t, X
t
) f(t, X
n
t
)
+
_
T
0
E
f
t
t
(s, X
s
) f
t
t
(s, X
n
s
)
ds+
_
T
0
E
f
t
x
(s, X
s
) f
t
x
(s, X
n
s
)
a
s
ds +
_
T
0
1
2
E
f
tt
x
(s, X
s
) f
tt
x
(s, X
n
s
)
b
2
s
ds+
_
_
T
0
E
_
f
t
x
(s, X
s
) f
t
x
(s, X
n
s
)
_
2
b
2
s
ds
_
1/2
n0
0.
So it is enough to verify (4.12), when a
t
and b
t
are simple. Due to additivity of the
stochastic integral, it even suces to consider constant a() and b() (such that
the Ito integral is well dened), in which case X
t
= at +bW
t
. Since f(t, at +bW
t
)
is now a function of t and W
t
, the formula (4.12) holds, if
u(t, W
t
) = u(0, 0) +
_
t
0
u
t
t
(s, W
s
)ds +
_
t
0
u
t
x
(s, W
s
)dW
s
+
1
2
_
t
0
u
tt
x
(s, W
s
)ds (4.13)
for a bounded u(t, x) with two bounded continuous derivatives. Using the Taylor
expansion for u(t, x), the telescopic sum is obtained (with t
i
:= t
i
t
i1
and
W
i
= W
t
i
W
t
i1
)
u(t, W
t
) =u(0, 0) +
n
i=1
u
t
t
(t
i1
, W
t
i1
)t
i
+
n
i=1
u
t
x
(t
i1
, W
t
i1
)W
i
+
1
2
n
i=1
u
tt
x
(t
i1
, W
t
i1
)(W
i
)
2
+R
n
where R
n
is the residual term, consisting of sums over (t
i
)
2
, t
i
W
i
and (W
i
)
3
with coecients obtained by the Mean Value Theorem. Clearly the rst three terms
3. THE IT
O FORMULA 69
on the right hand side of the latter converge to the corresponding terms in (4.13).
By the same arguments, used in the proof of Theorem 4.8
E
_
n
i=1
u
tt
x
(t
i1
, W
t
i1
)(W
i
)
2
i=1
u
tt
x
(t
i1
, W
t
i1
)t
i
_
2
=
n
i=1
E
_
u
tt
x
(t
i1
, W
t
i1
)
_
2
_
(W
i
)
2
t
i
_
2
2T sup
t,x[0,T]R
[u
tt
x
(t, x)[
2
max
i
t
i
n
0
Similarly the residual term R
n
is shown to vanish as n .
Example 4.24. Apply the Ito formula to W
2
t
:
d(W
t
)
2
= 2W
t
dW
t
+dt
or in other words
W
2
t
= 2
_
t
0
W
s
dW
s
+t.
Example 4.25. (Example 8 Ch. 6.2 in [21]) Let

t
be a random process,
adapted to F
t
and satisfying
P
__
1
0
2
t
dt <
_
= 1. (4.14)
Then the process
t
= exp
__
t
0
s
dW
s

1
2
_
t
0
2
s
ds
_
is well dened and by the Ito formula, satises the integral identity (which is also
an example of stochastic dierential equation (SDE) to be introduced in Section 5)
t
= 1 +
_
t
0
s
dW
s
, t [0, 1].
If
_
1
0
E
2
s
ds < , then the stochastic integral has zero mean and thus E
1
= 1. If
however only (4.14) holds, then E
1
< 1 is possible, meaning that the stochastic
integral is no longer a martingale. Consider a specic
t
t
=
2W
t
(1 t)
2
1
t]
,
where = inft 1 : W
2
t
= 1 t, i.e. the rst time W
2
t
hits the line 1 t. The
event t is F
W
t
measurable (and a fortiori F
t
measurable), since it can be
resolved on the basis of trajectory of W up to time t and hence
t
is F
t
-adapted.
Note that P( < 1) = 1, since
P( = 1) P(W
1
= 0) = 0,
and so
_
1
0
2
t
dt =
_
1
0
4W
2
t
(1 t)
4
1
t]
dt =
_

0
4W
2
t
(1 t)
4
dt < , P a.s.
By the Ito formula
d
_
W
2
t
(1 t)
2
_
=
2W
2
t
(1 t)
3
dt +
2W
t
(1 t)
2
dW
t
+
1
(1 t)
2
dt,
which implies
_
1
0
s
dW
s

1
2
_
1
0
2
s
ds =
_

0
2W
t
(1 t)
2
dW
s

_

0
2W
2
t
(1 t)
4
dt =
W
2
(1 )
2
+
_

0
2W
2
t
(1 t)
3
dt +
_

0
1
(1 t)
2
dt
_

0
2W
2
t
(1 t)
4
dt =
1
(1 )
2
+
_

0
2W
2
t
_
1
(1 t)
3

1
(1 t)
4
_
dt +
_

0
1
(1 t)
2
dt
1
1
+
_

0
1
(1 t)
2
dt = 1.
Then E
t
1/e < 1, i.e. the stochastic integral
_
t
0

s
s
dW
s
has nonzero mean!
Example 4.26. (The Tanaka formula and the local time) Let > 0 and
f
(x) = [x[1
]x]]
+
1
2
( +
x
2
)1
]x]<]
.
Since f
(x) is twice dierentiable with the second derivative discontinuous at two

points x = , the Ito formula still applies and gives
f
(W
t
) =
_
t
0
f
t
(W
s
)dW
s
+
1
2
_
t
0
f
tt
(W
s
)ds =
_
t
0
sign(W
s
)1
]W
s
]]
dW
s
+
_
t
0
1
W
s
1
]W
s
]<]
dW
s
+
1
2
_
t
0
1
]W
s
]]
ds
Note that
E
__
t
0
1
W
s
1
]W
s
]<]
dW
s
_
2
=
_
t
0
2
EW
2
s
1
]W
s
]<]
ds
_
t
0
2
E1
]W
s
]<]
ds =
_
t
0
P([W
s
[ < )ds
0
0.
Hence the local time process corresponding to W
t
L
t
= lim
0
1
2
_
t
0
1
]W
s
]]
ds (4.15)
exists at least as L
2
limit. In fact it exists in a stronger sense and moreover the
Tanaka formula holds
[W
t
[ =
_
t
0
sign(W
t
)dW
t
+L
t
, (4.16)
as the preceding limit procedure hints (f
(x) [x[ for all x). By denition L

t
is
the rate at which the amount of time spent by the Wiener process in the vicinity of
zero decays as it shrinks. This is another manifestation of pathes irregularity of the
Wiener process: e.g. the limit (4.15) would vanish if W
t
had a countable number
of zeros on [0, T].
3. THE IT
O FORMULA 71
More examples are collected in the Exercises section. The following Theorem
gives the multivariate version of the Ito formula
Theorem 4.27. Let X
t
have the Ito dierential
dX
t
= a
t
dt +b
t
dW
t
, t [0, T],
where a
t
and b
t
are n 1 vector and n m matrix of H
2
[0,T]
-random processes and
W
t
is a vector of m independent Wiener processes. Assume f : R
+
R
n
R is
continuously dierentiable in t variable and twice continuously dierentiable in the
x variables. Then
df(t, X
t
) =

t
f(t, X
t
)dt +
d
i=1
x
i
f(t, X
t
)dX
t
+
1
2
i,j
2
x
i
x
j
f(t, X
t
)
n
k=1
b
t
(i, k)b
t
(j, k)dt. (4.17)
Remark 4.28. Denote by the (row vector) gradient operator with respect
to x and let b
t
b
be the second order dierential operator, obtained by formal

multiplication of partial derivatives. Denote by

f(t, x) the partial derivative w.r.t.
time variable t. Then (4.17) can be compactly written as
df(t, X
t
) =

f(t, X
t
)dt +f(t, X
t
)dX
t
+
1
2
(b
t
b
)f(t, X
t
)dt.
The vector Ito formula can be conveniently encoded into the mnemonic multiplica-
tion rules between dierentials, summarized in Table 4.28, used with formal Taylor
expansion of f as demonstrated in the following example.
1 dt dW
t
(1) dW
t
(2)
dt dt 0 0 0
dW
t
(1) dW
t
(1) 0 dt 0
dW
t
(2) dW
t
(2) 0 0 dt
Table 1. The formal Ito dierential multiplication rules
Example 4.29. Consider the two dimensional system
dX
t
= a
1
X
t
dt +b
11
dW
t
+b
12
dV
t
dY
t
= a
2
Y
t
dt +b
21
dW
t
+b
22
dV
t
.
and let r
t
= f(X
t
, Y
t
). Then formally
dr
t
=df(X
t
, Y
t
) = f
x
(X
t
, Y
t
)dX
t
+f
y
(X
t
, Y
t
)dY
t
+
1
2
f
xx
(X
t
, Y
t
)(dX
t
)
2
+f
xy
(X
t
, Y
t
)dX
t
dY
t
+
1
2
f
yy
(X
t
, Y
t
)(dY
t
)
2
.
and using the rules from the table.
(dX
t
)
2
=
_
a
1
X
t
dt +b
11
dW
t
+b
12
dV
t
_
2
= b
2
11
dt +b
2
12
dt.
Proceeding similarly for the rest of terms, one gets
dr
t
= f
x
(X
t
, Y
t
)dX
t
+f
y
(X
t
, Y
t
)dY
t
+
1
2
f
xx
(X
t
, Y
t
)(b
2
11
+b
2
12
)dt+
f
xy
(X
t
, Y
t
)(b
11
b
21
+b
12
b
22
)dt +
1
2
f
yy
(X
t
, Y
t
)(b
2
21
+b
2
22
)dt
Verify the answer by applying (4.17) directly.
4. The Girsanov theorem
The following theorem, proved by I.Girsanov, plays the crucial role in stochastic
analysis and in ltering particularly
Theorem 4.30. Let
t
be an F
t
-adapted process, dened on (, F, F
t
, P) and
satisfying
P
_
_
T
0
2
t
dt <
_
= 1
and let
t
= exp
__
t
0
s
dW
s

1
2
_
t
0
2
s
ds
_
.
Assume that E
T
= 1 and dene the probability measure

P by
d
P
dP
() =
T
().
Then
V
t
= W
t

_
t
0
s
ds, t [0, T]
is the Wiener process with respect to F
t
under probability

P.
Proof. (Sketch) Clearly V
t
has continuous pathes and starts at zero. Thus it
is left to verify
E
_
expi(V
t
V
s
)[F
s
_
= exp
_
0.5
2
(t s)
_
, t s. (4.18)
It turns out that the assumption E
T
= 1 implies P(inf
tT

t
= 0) = 0 and hence
also

P(inf
tT

t
= 0) = 0. Then P

P and
dP
d
P
() =
1
T
().
By Lemma 3.11
E
_
expi(V
t
V
s
)[F
s
_
=
E
_
expi(V
t
V
s
)
T
[F
s
_
E
_
T
[F
s
_ =
expiV
s
E
_
expiV
t
T
[F
s
_
E
_
T
[F
s
_
Moreover under the assumption E
T
= 1, the process
t
is a martingale, i.e. it is
F
t
-adapted and E(
t
[F
s
) =
s
. Indeed by the Ito formula
t
satises
t
=
s
+
_
t
s
t
dW
r
= E(
t
[F
s
) =
s
,
5. STOCHASTIC DIFFERENTIAL EQUATIONS 73
where the (nontrivial!) fact E
_
_
t
s

r
t
dW
r
[F
s
_
= 0 has been used. Then
E
_
expi(V
t
V
s
)[F
s
_
=
E
_
expiV
t
t
[F
s
_
expiV
s
s
. (4.19)
By the Ito formula the process
t
:= expiV
t
t
satises
d
t
= i
t
dV
t

1
2
t
dt + expiV
t
d
t
+iexpiV
t
t
dt =
i
t
dW
t
i
t
t
dt
1
2
t
dt +
t
t
dW
t
+i
t
t
dt
which implies
t
=
s

_
t
s
1
2
u
du +
_
t
s
u
(i +
u
)dW
u
and in turn
E(
t
[F
s
) =
s

1
2
2
_
t
s
E(
u
[F
s
)du,
where once again the martingale property of the stochastic integral has been used.
This linear equation is explicitly solved for
t
t
=
s
exp
_
1
2
2
(t s)
_
and the claim (4.18) holds by (4.19).
Remark 4.31. As we have seen in the Example 4.25, the verication of E
T
=
1 is not a trivial task. It holds if the process
t
satises Novikov condition (e.g.
Theorem 6.1 in [21])
Eexp
_
1
2
_
T
0
2
t
dt
_
< . (4.20)
Remark 4.32. The Girsanov theorem basically states that if W is shifted
by a suciently smooth function, then the obtained process induces a measure,
absolutely continuous with respect to the Wiener measure. Obviously this wouldnt
be possible if the shift is done by a function, say, with a jump - the obtained
process wont have continuous trajectories. Lets try to shift W by a continuous
function: an independent Wiener process W
t
. In this case V = W W
t
is again a
Wiener process with quadratic variation 2t. Since quadratic variation is measurable
with respect to natural ltration, the induced measure cannot be equivalent to the
standard Wiener measure, corresponding to quadratic variation t. This indicates
that certain degree of trajectories smoothness is required.
5. Stochastic Dierential Equations
Let (, F, F
t
, P) be a stochastic basis, carrying a Wiener process W. Let
a(t, x) and b(t, x) be a pair of functionals on the space of continuous functions
C
[0,T]
, which are non-anticipating in the sense
x
1
(s) x
2
(s), s t =
a(t, x
1
) a(t, x
2
)
b(t, x
1
) b(t, x
2
)
t [0, T].
Equivalently this property can be formulated as measurability of a(t, x) with respect
to the Borel -algebra B
t
, generated by the open sets of C
[0,t]
.
Definition 4.33. A continuous random process X is a unique strong solution
of the stochastic dierential equation (SDE)
dX
t
= a(t, X)dt +b(t, X)dW
t
(4.21)
subject to a random F
0
-measurable initial condition X
0
= , if
(1) X is F
t
-adapted
(2) X satises
10
P
_
_
T
0
[a(t, X)[dt <
_
= 1, P
_
_
T
0
b
2
(t, X)dt <
_
= 1
(3) for each t [0, T]
X
t
= +
_
t
0
a(s, X)ds +
_
T
0
b(s, X)dW
s
, P a.s.
(4) (uniqueness) any two processes, satisfying (1)-(3) are indistinguishable.
The simplest conditions to guarantee the existence and uniqueness of the strong
solutions are e.g.
Theorem 4.34. Assume that a(t, x) and b(t, x) satisfy the functional Lipschitz
condition
[a(t, x) a(t, y)[
2
+[b(t, x) b(t, y)[
2
L
1
_
t
0
[x
s
y
s
[
2
dK
s
+L
2
[x
t
y
t
[
2
(4.22)
and the linear growth condition
a
2
(t, x) +b
2
(t, x) L
1
_
t
0
(1 +x
2
s
)dK
s
+L
2
(1 +x
2
t
) (4.23)
where L
1
,L
2
are constants, K
s
is a nondecreasing right continuous function
11
, such
that 0 K
s
T. Then the equation (4.21) has a unique strong solution.
Proof. (only the main idea - see Theorem 4.6 in [21] for details) The proof is
in the spirit of classical dierential equations by the Picard iterations method. Let
X
(0)
t
X
0
and dene X
(n)
recursively
X
(n)
t
= X
0
+
_
t
0
a
_
s, X
(n1)
_
ds +
_
t
0
b
_
s, X
(n1)
_
dW
s
.
Now one shows, using the properties of Ito integral, that sup
tT
[X
(n)
t
X
(n1)
t
[
converges to zero as n P-a.s. and dene the process
X
t
:= X
(0)
t
+
n=0
_
X
(n+1)
t
X
(n)
t
_
.
Then it is veried that X
t
satises all the four properties in Denition 4.33.
10
Note that the strong solution actually employs the denition of the stochastic integral
under weaker condition than H
2
[0,T]
, usually considered in these notes
11
e.g. K
s
= s
Corollary 4.35. Let a(t, x) and b(t, x) be functions on R
+
R satisfying the
Lipschitz condition
[a(t, x) a(t, y)[
2
+[b(t, x) b(t, y)[
2
L[x y[
2
, x, y R
and the linear growth condition
a
2
(t, x) +b
2
(t, x) L(1 +x
2
).
Then the SDE
dX
t
= a(t, X
t
)dt +b(t, X
t
)dW
t
, X
0
=
has a unique strong solution.
Remark 4.36. Analogous denition and proofs apply in the multivariate case,
with appropriate adjustments in the conditions to be satised by the coecients a
and b.
Remark 4.37. Sometimes the existence and uniqueness can be veried under
signicantly weaker conditions: for example (rst shown in [43]) the scalar equation
with b(t, x) 1, has a unique strong solution if a(t, x) is a bounded function on
R
+
R (without Lipschitz condition). This is a remarkable fact, since it is well
known that classic ordinary dierential equation may not have a unique solution
if the drift a(t, x) is not Lipschitz (e.g.

X = 3/2
3
X, X
0
= 0 has two distinct
solutions X
t
0 and X
t
= t
3/2
). Loosely speaking the equation is regularized
if a small amount of white noise is plugged in! Even more remarkably, the strong
solution ceases to exist in general if a(t, x), being still bounded, is allowed to depend
on the past of x - a celebrated counterexample was given by B.Tsirelson in [38].
Example 4.38. As in the world of ODEs, the explicit solutions to SDEs are
rarely available. The Ito formula and a good guess are usually the main tools. For
example the strong solution of the equation
dX
t
= aX
t
dt +bX
t
dW
t
, X
0
= 1,
is
X
t
= exp
_
at b
2
/2t +bW
t
_
.
Indeed,
dX
t
= X
t
d(at b
2
/2t +bW
t
) +
1
2
b
2
X
t
dt = aX
t
dt +bX
t
dW
t
.
Sometimes it is easier to calculate various statistical parameters of the process,
directly via the corresponding SDE. Let e.g. m
t
= EX
t
and P
t
= EX
2
t
. Then
EX
t
= EX
0
+a
_
t
0
EX
s
ds, = m
t
= EX
0
e
at
.
Apply Ito formula to X
2
t
to get
X
2
t
= X
2
0
+
_
t
0
2X
s
dX
s
+
_
t
0
b
2
X
2
s
ds = X
2
0
+
_
t
0
(2a +b
2
)X
2
s
ds +
_
t
0
2X
s
bdW
s
and so
P
t
= EX
2
0
+
_
t
0
(2a +b
2
)EX
2
s
ds = P
t
= EX
2
0
exp(2a +b
2
)t.
Along with the strong solutions, weak solutions of (4.21) are dened.
Definition 4.39. The equation (4.21) has a weak solution if there exists a
probability basis (
t
, F
t
, F
t
t
, P
t
), carrying a Wiener process W and a continuous
F
t
t
-adapted process X, such that (4.21) is satised and P
t
(X
0
x) = P( x). If
all weak solutions induce the same probability distribution, the equation (4.21) is
said to have a unique weak solution.
Remark 4.40. Note that in the case of strong solutions the random process X
is dened on the original probability space and thus X is by denition adapted to
F
t
= F
W
t
, i.e. the driving Wiener process W generates X:
F
X
t
F
W
t
.
In particular any strong solution is trivially also a weak solution with the choice
(
t
, F
t
, F
t
t
, P
t
) = (, F, F
t
, P). In the case of weak solutions, one is allowed
to choose a probability space and to construct on it a process X to satisfy the
relation (4.21). Typically (as well see shortly) the opposite inclusion holds for
weak solutions
F
X
t
F
W
t

on the new probability space.
Theorem 4.41. Let b(t, x) 1 and a(t, x) satisfy
W
_
x C
[0,T]
:
_
T
0
a
2
(t, x)dt <
_
= 1,
and
_
C
[0,T]
exp
_
_
T
0
a(t, x)dW
t
(x)
1
2
_
T
0
a
2
(t, x)dt
_
W
(dx) = 1
where
W
is the Wiener measure on C
[0,T]
and W
t
(x) is the coordinate process on
the measure space (C
[0,T]
, B,
W
), i.e. W
t
(x) := x
t
, x C
[0,T]
, t [0, T]. Then
(4.21) subject to X
0
= 0 has a weak solution.
Proof. Dene
T
(x) = exp
_
_
T
0
a(t, x)dW
t
(x)
1
2
_
T
0
a
2
(t, x)dt
_
and introduce a new measure on (C
[0,T]
, B) by
d
d
W
(x) =
T
(x).
Then by Girsanov theorem the process
W
t
t
:= W
t

_
t
0
a
_
s, W
_
ds
is a Wiener process on (C
[0,T]
, B, ) and hence W is the weak solution of
dW
t
= a(t, W
t
)dt +dW
t
t
on this probability space.
Remark 4.42. As the notion of weak suggests, (4.21) may have a weak
solution, without having a strong one. The classical example is the Tanaka equation
(see e.g. Chapter 5.3 in [25])
dX
t
= sign(X
t
)dW
t
, X
0
= 0.
To show that X
t
is not measurable with respect to F
W
t
(and thus the equation
does not have a strong solution) use the Tanaka formula (see Example 4.26).
Since the stochastic integral
_
t
0
sign(X
s
)dW
s
is a martingale
12
and its quadratic
variation is
_
t
0
1ds = t, it is a Wiener process itself (by the Levy Theorem 4.5) and
so by Tanaka formula (applied to [X
t
[)
W
t
=
_
t
0
sign(X
t
)dX
t
= [X
t
[ L
t
,
where L
t
is the local time of (the Wiener process) X
t
. Since the local time is
measurable with respect to F
]X]
t
= X
s
, s t, W
t
is measurable with respect to
F
]X]
t
, which is strictly less than F
X
t
, hence
F
W
t
F
]X]
t
F
X
t
,
and X
t
cannot be a strong solution.
A weak solution is easily constructed by taking a Wiener process W
t
on some
probability space and letting dX
t
= sign(W
t
)dW
t
. Then sign(W
t
)dX
t
= dW
t
, which
is nothing but Tanaka equation with respect to the Wiener process W
t
on the new
probability space. Note that on the original probability space dX
t
= sign(W
t
)dW
t
does not satisfy dX
t
= sign(X
t
)dW
t
!
Another example of an SDE without strong solution (with nonzero drift with
memory!) is the already mentioned Tsirelson equation (see e.g. Example in Section
4.4.8 in [21]).
5.1. A connection to PDEs. The theory and applications of SDEs with
respect to Wiener process are vast (see e.g. [36], [33]), especially in the case of
diusions, i.e. when a(t, x) (called the drift coecient) and b(t, x) (called diusion
matrix) are pointwise functions of x. In particular there is a close relation between
various statistical properties of diusions and PDEs.
As an example
13
consider the scalar diusion
dX
t
= a(X
t
)dt +b(X
t
)dW
t
, t 0 (4.24)
subject to a random variable X
0
with distribution F(x), having density q(x) with
respect to the Lebesgue measure. Assume that the coecients are such that the
unique strong solution exists.
Dene the second order dierential (forward Kolmogorov-Focker-Planck) oper-
ator
(L
f)(x) =

x
_
a(x)f(x)
_
+
1
2
2
x
2
_
b
2
(x)f(x)
_
. (4.25)
and consider the Cauchy problem
t
p
t
(x) = (L
p
t
)(x) (4.26)
p
0
(x) = q(x). (4.27)
12
its integrand is bounded and thus satises the Novikov condition trivially
13
to be revisited in the context of ltering below
Suppose that the unique solution p
t
(x) exists, such that for each t 0 the function
p
t
(x) decays suciently fast as [x[ . The conditions for this are well known
from the theory of PDEs and can be found in textbooks.
Then p
t
(x) is the distribution density (with respect to the Lebesgue measure)
of X
t
for a xed t. Take a twice continuously dierentiable function f. Then by
the Ito formula, for any xed t 0
f(X
t
) = f(X
0
) +
_
t
0
f
t
(X
s
)a(X
s
)ds +
_
t
0
f
t
(X
s
)b(X
s
)dW
s
+
1
2
_
t
0
f
tt
(X
s
)b
2
(X
s
)ds
and so
Ef(X
t
) = Ef(X
0
) +
_
t
0
E
_
f
t
(X
s
)a(X
s
) +
1
2
f
tt
(X
s
)b
2
(X
s
)
_
ds.
Let F
X
t
(dx) be the probability distribution of X
t
, then the latter equation reads
_
R
f(x)F
X
t
(dx) =
_
R
f(x)q(x)dx+
_
t
0
_
R
_
f
t
(x)a(x) +
1
2
f
tt
(x)b
2
(x)
_
F
X
s
(dx)ds. (4.28)
Lets verify that F
X
t
(dx) = p
t
(x)dx is a solution:
_
R
_
f
t
(x)a(x) +
1
2
f
tt
(x)b
2
(x)
_
F
X
s
(dx) =
_
R
_
f
t
(x)a(x) +
1
2
f
tt
(x)b
2
(x)
_
p
s
(x)dx =
_
R
f(x)

x
_
a(x)p
s
(x)
_
dx +
1
2
_
R
f(x)

2
x
2
_
b
2
(x)p
s
(x)
_
dx =
_
R
f(x)(L
p
t
)(x)dx
where the tail decay properties of p
t
(x) are to be used to ensure proper integration
by parts. The right hand side of (4.28) becomes
_
R
f(x)q(x)dx+
_
R
f(x)
_
t
0
(L
p
t
)(x)dx =
_
R
f(x)q(x)dx+
_
R
f(x)
_
t
0
t
p
t
(x)dx
=
_
R
f(x)q(x)dx +
_
R
f(x)
_
p
t
(x) p
0
(x)
_
dx =
_
R
f(x)p
t
(x)dx
and (4.28) holds. Of course these naive arguments leave many unanswered ques-
tions: e.g. it is not clear whether (4.28) denes the distribution of X
t
uniquely, etc.
But nevertheless they give the correct intuition and the correct answer.
It can be shown that under certain conditions on the coecients (e.g. a(x)x
x
2
and b
2
(x) C > 0), the nonnegative solution p(x) of the ODE
(L
p)(x) = 0
exists and is unique and
lim
t
_
R
[p
t
(x) p(x)[dx = 0.
In other words, the unique stationary distribution of X
t
exists and has density p(x).
In the scalar case it may be even found explicitly
p(x) =
C
b
2
(x)
exp
__
x
0
2a(u)
b
2
(u)
du
_
, (4.29)
where C is the normalization constant.
6. MARTINGALE REPRESENTATION THEOREM 79
6. Martingale representation theorem
Martingales have been mentioned before on several occasions:
t
is an F
t
-martingale
14
if X
t
is F
t
-adapted
and E(X
t
[F
s
) = X
s
for any t s 0.
The Wiener process and the stochastic integral (under appropriate conditions
imposed on the integrand) are examples of martingales. It turns out that any
martingale with respect to the ltration F
W
t
generated by a Wiener process W
t
is necessarily a stochastic integral with respect to W
t
. We chose the simplied
approach of [25] to hint how this deep result emerges. The more complete treatment
of the subject can be found in Chapter 5 of [21].
Theorem 4.44. (The Ito representation theorem) Let be a square integrable
F
W
T
measurable random variable, i.e. L
2
(, F
W
T
, P). Then there is an H
2
[0,T]
process f(t, ), such that
= E +
_
T
0
f(s, )dW
s
, P a.s. (4.30)
Remark 4.45. When (, W) form a Gaussian process, deterministic f(t, )
f(t) in (4.30) always exists - see Example 4.47.
Proof. The idea is to show
15
that the linear closed subspace E of random
variables of the form
16
T
:= exp
_
_
T
0
h
s
dW
s

1
2
_
T
0
h
2
s
ds
_
, h : [0, T] R,
_
T
0
h
2
s
ds < (4.31)
is dense in L
2
(, F
W
T
, P) (all square integrable functionals of the Wiener process
on [0, T]). By the Ito formula
T
= 1 +
_
T
0
h
s
s
dW
s
,
and thus
T
admits the representation (4.30) (with f(t, ) = h
t
t
). Due to linearity
of the stochastic integral the linear combinations of random variables from E are
also of the form (4.31). If the subspace E is dense in L
2
(, F
W
T
, P), any F
W
T
-
measurable random variable can be approximated by a convergent sequence
n
E :
n
= E
n
+
_
T
0
f
n
(s, )dW
s
.
Then by the Ito isometry,
E
_
m
_
2
=
_
E
n
E
m
_
2
+
_
T
0
E
_
f
n
(s, ) f
m
(s, )
_
2
ds
14
Sometimes the pair (X
t
, F
t
) us referred as martingale
15
the proof is taken from 4.3 [25] (the same proof is used in Ch. V, 3[27]). Dierent proof
is given in 5.2 [21].
16
the functions h are deterministic
and since
n
converges in L
2
(, F
W
T
, P), f
n
(t, ) is a Cauchy sequence and hence
is also convergent, i.e. the limit f(t, ) exists in the sense
_
T
0
E
_
f
n
(s, ) f(s, )
_
2
ds
n
0.
Since f
n
are adapted, f is adapted as well and again by the Ito isometry
n
= E
n
+
_
T
0
f
n
(s, )dW
s
n
L
2
E +
_
T
0
f(s, )dW
s
.
and hence admits (4.30).
Suppose that f is non-unique, i.e. there are f
1
and f
2
, so that
= E +
_
T
0
f
1
(s, )dW
s
= E +
_
T
0
f
2
(s, )dW
s
.
This implies
_
T
0
E
_
f
1
(s, ) f
2
(s, )
_
2
ds = 0, i.e. f
1
= f
2
, ds P-a.s.
So the main issue is to verify that E is dense in L
2
(, F
W
T
, P), or equivalently
to check that if L
2
(, F
W
T
, P) satises
E = 0, E , (4.32)
then 0, P-a.s. If (4.32) holds, then in particular
Eexp
_
n
i=1
i
_
W
t
i+1
W
t
i
_
1
2
n
i=1
2
i
(t
i+1
t
i
)
_
= 0
for any nite number of 0 = t
1
< ... < t
n
= T and any constants
i
, i = 1, ..., n,
which is equivalent to
Eexp
_
n
i=1
i
W
t
i
_
= 0,
for any real numbers
i
. It is easy to verify that the function
G() = Eexp
_
n
i=1
i
W
t
i
_
, R
n
is real analytic (i.e. has derivatives of any order at any R
n
). Then the complex
function
G(z) = Eexp
_
n
i=1
z
i
W
t
i
_
, z C
n
is analytic as well (i.e. satises the Cauchy-Riemann condition or equivalently has
a complex derivative at any point of C
n
). The analytic function, which vanishes on
the real line (or on the real lines in this case), vanishes everywhere on the complex
plain and thus in particular vanishes on the complex axes
G(i) = Eexp
_
n
i=1
i
i
W
t
i
_
, R
n
.
Now for an arbitrary real analytic function : R
n
R with compact support
E(W
t1
, ..., W
t
n
) = E(2)
n/2
_
R
n
(u) exp
_
iu
1
W
t
1
+... +iu
n
W
t
n
_
=
(2)
n/2
_
R
n
(u)Eexp
_
iu
1
W
t
1
+... +iu
n
W
t
n
_
= 0.
6. MARTINGALE REPRESENTATION THEOREM 81
The claim holds, since smooth compactly supported functions approximate Borel
functions in L
2
.
Remark 4.46. The integrand in (4.30) is an adapted random process. It turns
out that functionals of the Wiener process can be expanded into multiple inte-
grals with respect to W with non-random kernels - this is so called Wiener chaos
expansion.
Example 4.47. The random variable =
_
T
0
W
s
ds is F
W
T
-measurable with
=
_
T
0
(T t)dW
t
.
Theorem 4.48. (The martingale representation theorem) Let X

t
be an square
integrable
17
F
W
t
-martingale. Then there is a unique H
2
[0,T]
process g(s, ), adapted
to F
W
t
, such that
X
t
= EX
0
+
_
t
0
g(s, )dW
s
, t [0, T], P a.s.
Proof. By Theorem 4.44, for each xed t [0, T], there is a unique F
W
t
-
measurable process f
(t)
(s, ), such that (E
t
= E
0
)
t
= E
0
+
_
t
0
f
(t)
(s, )dW
s
,
and we shall verify that f
(t)
(s, ) can be chosen independently of t. Let T t
2

t
2
0, then
E
_
t
2
[F
W
t
1
_
= E
0
+ E
_
_
t
2
0
f
(t
2
)
(s, )dW
s
F
W
t
1
_
= E
0
+
_
t
1
0
f
(t
2
)
(s, )dW
s
.
On the other hand
E
_
t
2
[F
W
t
1
_
=
t
1
= E
0
+
_
t
1
0
f
(t
1
)
(s, )dW
s
and hence by Ito isometry, f
(t
2
)
(s, ) and f
(t
1
)
(s, ) coincide on [0, t
1
], namely
_
t
1
0
E
_
f
(t
2
)
(s, ) f
(t
1
)
(s, )
_
2
ds = 0.
Then one can choose
f(s, ) = f
(T)
(s, ),
so that
t
= E
0
+
_
t
0
f
(T)
(s, )dW
s
= E
0
+
_
t
0
f
(t)
(s, )dW
s
.
17
sup
t[0,T]
EX
2
t
<
Example 4.49. Let = W
4
1
and consider the martingale X
t
= E
_
W
4
1
[F
W
t
_
,
t 1. By the Markov property of W, X
t
= E(W
4
1
[W
t
). Since (W
1
, W
t
) is a Gaussian
pair, the conditional distribution of W
1
given W
t
is Gaussian as well with the mean
W
t
and variance 1 t. Hence
E(W
4
1
[W
t
) =E
_
(W
1
W
t
+W
t
)
4
[W
t
_
=
E
_
(W
1
W
t
)
4
[W
t
_
+ 4E
_
(W
1
W
t
)
3
W
t
[W
t
_
+
6E
_
(W
1
W
t
)
2
W
2
t
[W
t
_
+ 4E
_
(W
1
W
t
)W
3
t
[W
t
_
+W
4
t
=
3(1 t)
2
+ 6(1 t)W
2
t
+W
4
t
.
Applying the Ito formula one gets
dX
t
= 6(1 t)dt 6W
2
t
dt + 12(1 t)dW
t
+ 6(1 t)dt
+ 4W
3
t
dW
t
+ 6W
2
t
dt = 12(1 t)dW
t
+ 4W
3
t
dW
t
.
and hence
= X
1
= X
0
+
_
1
0
_
12(1 t) + 4W
3
t
_
dW
t
= 3 +
_
1
0
_
12(1 t) + 4W
3
t
_
dW
t
.
Example 4.50. This representation is not always easy to nd explicitly. Here

is one amazing formula: the random variable S
1
= sup
s[0,1]
W
s
satises
S
1
= ES
1
+ 2
_
1
0
_
1
_
S
t
B
t
1 t
_
_
dW
t
where (x) =
_
x
2
e
r
2
/2
dr.
The following theorem will be extensively used in the derivation of nonlinear
ltering equations.
Theorem 4.51. Let Y = (Y
t
)
t[0,T]
be the strong solution
18
of the SDE
dY
t
= a
t
(Y )dt +dW
t
,
where a
t
() is a non-anticipating functional on C
[0,T]
, satisfying
_
T
0
Ea
2
t
(Y )dt < , and
_
T
0
Ea
2
t
(W)dt <
Then any square integrable F
Y
t
-martingale Z
t
has a continuous version satisfying
Z
t
= Z
0
+
_
t
0
g(s, )dW
s
with an H
2
[0,T]
process g(s, ), adapted to F
Y
t
.
Proof. Due to the assumptions on a
t
(), the process
T
() = exp
_
_
t
0
a
s
(Y )dW
s

1
2
_
t
0
a
2
s
(Y )ds
_
=
exp
_
_
t
0
a
s
(Y )dY
s
+
1
2
_
t
0
a
2
s
(Y )ds
_
18
in other words a is such that the strong solution exists
EXERCISES 83
is an F
Y
t
-martingale under P and thus the Radon-Nikodym density
d
P
dP
() =
T
(),
denes probability

P. Moreover by Girsanov theorem, Y
t
is a Wiener process under
P. The process z
t
:= Z
t
/
t
is an F
Y
t
-martingale under

P:
E[z
t
[ = E[z
t
[
T
= E[z
t
[E
_
T
[F
Y
t
) = E[z
t
[
t
= E[Z
t
[ <
and by Lemma 3.11
E(z
t
[F
Y
s
) =

E
_
Z
t
t
[F
Y
s
_
=
E
_
Z
t
T
[F
Y
s
_
E(
T
[F
Y
s
)
=
E(Z
t
[F
Y
s
)
s
= z
s
.
Then by Theorem 4.48, z
t
admits the representation (Y is a Wiener process
under

P)
z
t
= z
0
+
_
t
0
f(s, )dY
s
= z
0
+
_
t
0
f(s, )a
s
(Y )ds +
_
t
0
f(s, )dW
s
with an F
Y
t
-adapted process f. Applying the Ito formula to Z
t
= z
t
t
one gets
(recall that d
t
= a
t
(Y )
t
dW
t
)
dZ
t
= z
t
d
t
+
t
dz
t
a
t
t
f(t, )dt = z
t
a
t
t
dW
t
+
t
f(t, )a
t
dt+
t
f(t, )dW
t
a
t
t
f(t, )dt =
_
t
f(t, ) z
t
a
t
_
dW
t
,
and thus the required representation holds with g(s, ) :=
t
f(t, ) z
t
a
t
(Y ).
Exercises
(1) Prove that the limit of a sequence of uniformly convergent continuous
functions f
n
: [0, 1] R is continuous.
(2) Plot a typical path of W
n
t
, dened in (4.2) for n = 1, 2, 3
(3) Prove
P
_
D
+
W
t
= and D
+
W
t
=
_
= 1, t [0, T]
(4) Verify that for a standard Gaussian r.v. , P([[ ) for any > 0.
(5) Prove the law of large numbers
P
_
lim
t
W
t
/t = 0
_
= 1.
(6) Let W
t
, t [0, 1] be the Wiener process (with respect to its natural
ltration F
W
t
). Verify that each of the following processes is a Wiener
process with respect to appropriate ltration.
(a) Scaling invariance: for any constant c > 0
W
c
t
:=
1
c
W
ct
, t 1
(b) Time inversion:
Y
t
=
_
tW
1/t
, t (0, 1]
0, t = 0.
(c) Time reversal:
Z = W
1
W
1t
, t 1.
(d) Symmetry:
V
t
= W
t
, t 1.
(7) Let f : R [K, K] for some constant 0 < K < be a twice continu-
ously dierentiable function with bounded derivatives. For a xed number
q [0, 1], dene
I
q,n
t
=
[nt]
i=1
f(W
s
q
i
)
_
W
s
i
W
s
i1
_
where s
i
= i/n, i n and s
q
i
= qs
i1
+ (1 q)s
i
.
(a) Show that the L limit I
q
t
= lim
n
I
q,n
t
exists (in particular for
q = 1, the Ito integral I
t
:= I
1
t
is obtained). Calculate the expectation
of I
q
t
.
(b) Verify the Wong-Zakai correction formula
I
q
t
= I
t
+ (1 q)
_
t
0
f
t
(W
s
)ds.
(8) Prove directly from the denition of Ito integral with respect to the Brow-
nian motion B that
(a)
_
t
0
sdB
s
= tB
t

_
t
0
B
s
ds
(b)
_
t
0
B
2
s
dB
s
=
1
3
B
3
t

_
t
0
B
s
ds
(9) Use the Ito formula to verify the integration by parts rule. Let f
t
: R
+

R be a deterministic dierentiable function, then
_
t
0
f
s
dW
s
= W
t
f
t

_
t
0
W
s

f
t
dt.
Use the multivariate Ito formula to derive the analogue of integration by
parts rule, when f
t
is another Ito process with respect to the same Wiener
process: df
t
= a
t
dt +b
t
dW
t
.
(10) Let a
t
and b
t
be a pair of deterministic functions. Find the dierential of
the process
X
t
= exp
__
t
0
a
s
ds
__
x +
_
t
0
exp
_
_
s
0
a
u
du
_
b
s
dW
s
_
,
where x R. Show that the mean m
t
= EX
t
, variance V
t
= E(X
t

m
t
)
2
and covariance K(t, s) = E(X
t
m
s
)(X
s
m
s
) functions satisfy the
equations
m
t
= a
t
m
t
, m
0
= x
V
t
= 2a
t
V
t
+b
2
t
, V
0
= 0
K(t, s) = exp
__
t
s
a
u
ds
_
V
s
, t s
(11) Use the multivariate Ito formula to show that the process
R
t
=
_
(W
1
t
)
2
+... + (W
d
t
)
2
, t 0
where W
i
t
are independent Wiener processes, satises
dR
t
=
d
i=1
W
i
t
dW
i
t
R
t
+
d 1
2R
t
dt.
EXERCISES 85
This is so called d-dimensional Bessel process. For the case d = 2, show
that
R
3
E(R
4
[W
3
, V
3
)
_
2 +R
2
3
.
Hint: the upper bound can be obtained by Jensen inequality.
(12) Let
k
(t) = EW
k
t
, k = 0, 1, 2, .... Use the Ito formula to derive the recur-
sion
k
(t) =
1
2
k(k 1)
_
t
0
k2
(s)ds, k 2.
Deduce that EW
4
t
= 3t
2
and nd EW
6
t
.
(13) Explain the origins of mnemonic rules in Remark 4.28 by sketching the
proof of multivariate Ito formula
(14) Obtain the answer in Example 4.29 by applying the Ito formula directly
(avoiding the use of table).
(15) Verify the existence and uniqueness of the strong solution of the following
equations (check the conditions of Theorem 4.34). Check whether the
given processes solve the corresponding equations as claimed.
(a) X
t
= e
B
t
solves
dX
t
= 0.5X
t
dt +X
t
dB
t
, X
0
= 1
(b) X
t
= B
t
/(t + 1) solves
dX
t
=
1
1 +t
X
t
dt +
1
1 +t
dB
t
, X
0
= 0
(c) X
t
= sin(W
t
) solves
dX
t
=
1
2
X
t
dt +
_
1 X
2
t
dB
t
, B
0
(/2, /2)
(d) X
1
(t) = X
1
(0)+t+B
1
and X
2
(t) = X
2
(0)+X
1
(0)B
2
(t)+
_
t
0
sdB
2
(s)+
_
t
0
B
1
(s)dB
2
(s) solve
dX
1
= dt +dB
1
dX
2
= X
1
dB
2
(e) X
t
= e
t
X
0
+e
t
B
t
solves
dX
t
= X
t
dt +e
t
dB
t
.
(f) Y
t
= exp(aB
t
0.5a
2
t)
_
Y
0
+r
_
t
0
exp(aB
s
+ 0.5a
2
s)ds
solves
dY = rdt +aY dB
t
.
(g) The processes X
1
(t) = X
1
(0) cosh(t) + X
2
(0) sinh(t) +
_
t
0
a cosh(t
s)dB
1
+
_
t
0
b sinh(ts)dB
2
and X
2
(t) = X
1
(0) sinh(t)+X
2
(0) cosh(t)+
_
t
0
a sinh(t s)dB
1
+
_
t
0
b cosh(t s)dB
2
solve
dX
1
= X
2
dt +adB
1
dX
2
= X
1
dt +bdB
2
,
which can be seen as stochastically excited vibrating string equations.
(h) The process X
t
=
_
X
1
(t), X
2
(t)
_
=
_
cosh(B
t
), sinh(B
t
)
_
solve
dX
t
=
1
2
X
t
dt +X
t
dB
t
.
(16) Let X and Y be the strong solution of
dX
t
= 0.5X
t
dt Y
t
dB
t
dY
t
= 0.5Y
t
dt +X
t
dB
t
.
subject to X
0
= x and Y
0
= y with B
t
being a Wiener process (Brownian
motion).
(a) Show that X
2
t
+ Y
2
t
x
2
+ y
2
for all t 0, i.e. the vector (X
t
, Y
t
)
revolves on a circle.
(b) Find the SDE, satised by
t
= arctan(X
t
/Y
t
).
(17) Consider the multivariate linear SDE
dX
t
= AX
t
dt +BdW
t
, X
0
= ,
where A and B are n n and n m matrices, W is the vector of m
independent Wiener process (usually referred as vector Wiener process)
and is a random variable independent of W and E||
2
< .
(a) Find the explicit strong solution of the vector linear equation
(b) Find the explicit expressions for M
t
= EX
t
and Q
t
= cov(X
t
) =
E(X
t
m
t
)(X
t
m
t
)
(Hint: nd rst the ODEs for m

t
and Q
t
)
(c) Find the explicit expression for the correlation matrix K
t,s
= E(X
t
m
t
)(X
s
m
s
)
in terms of Q
t
(d) Give simple sucient conditions on A,B and so that the process
X
t
is stationary, i.e. m
t
m and Q
t
Q for certain (what?) m and
Q.
(e) The linear one dimensional diusion X
t
is called Ornstein-Uhlenbeck
process. Specify your answers in the previous questions in this case.
(18) Consider the equation of a harmonic oscillator, driven by the white noise
N
t
X
t
+ (1 +N
t
)X = 0, X
0
= 1,

X
0
= 1
where > 0 is a parameter.
(a) Write this equation as a two dimensional linear Ito SDE with respect
to the Wiener process
(b) Find the mean, variance and covariance functions of the oscillator
position
(c) Verify that the position satises the stochastic Volterra equation
X
t
= X
0
+

X
0
t +
_
t
0
(r t)X
r
dr +
_
t
0
(r t)X
r
dW
r
(19) Write down the KFP PDE, corresponding to the linear SDE
dX
t
= aX
t
dt +bdW
t
, X
0

where is a standard Gaussian random variable, b > 0 and a > 0 are
constants. Find the stationary density p(x) and calculate the stationary
mean and the variance. Compare to Exercise (17).
(20) Find explicit Ito representation for the following functionals of W on [0, T]:
W
T
, W
2
T
, W
3
T
, e
W
T
, sinW
T
. Hint: use the Ito formula.
CHAPTER 5
Linear ltering in continuous time
The continuous time linear ltering problem is addressed in this chapter, using
the white noise formalism, developed in the preceding one. In continuous time
setting the ltering formulae are derived by solving the Wiener-Hopf equation,
rather than using the general recursive formulae for orthogonal projection as in the
discrete time.
1. The Kalman-Bucy lter: scalar case
Consider the following system of linear SDEs:
dX
t
= a
t
X
t
dt +b
t
dW
t
(5.1)
dY
t
= A
t
X
t
dt +B
t
dV
t
(5.2)
where W and V are independent Wiener processes and the (scalar) coecients
are deterministic functions of t, such that the system has a unique strong solution.
These equations are solved subject to random variables X
0
and Y
0
with the bounded
covariance matrix, assumed independent of (W, V ). Hereafter B
2
t
C > 0 for some
constant C.
In what follows L
Y
t
denotes the closed linear subspace generated by the ran-
dom variables Y
s
, s t and

E([L
Y
t
) is the orthogonal projection
1
on L
Y
t
. As
discussed in Chapter 2,

X
t
:=

E(X
t
[L
Y
t
) is the best linear estimate of X
t
, given
the observations Y
s
, s t.
Theorem 5.1. (Kalman-Bucy lter) The optimal linear estimate

X
t
and the
corresponding mean square error P
t
= E(X
t

X
t
)
2
satisfy the equations
X
t
= a
t
X
t
dt +
P
t
A
t
B
2
t
_
dY
t
A
t
X
t
dt
_
P
t
= 2a
t
P
t
+b
2
t

A
2
t
P
2
t
B
2
t
(5.3)
subject to
X
0
= EX
0
+ cov(X
0
, Y
0
) cov
(Y
0
)
_
Y
0
EY
0
_
P
0
= cov(X
0
) cov
2
(X
0
, Y
0
) cov
(Y
0
).
(5.4)
Proof. The proof is done in several steps:
Step 1 (getting rid of

X
0
)
1
as usual a constant is added to any linear subspace
87
88 5. LINEAR FILTERING IN CONTINUOUS TIME
It would be easier to treat the case

X
0
0 and we claim that it is enough to prove
the theorem under this assumption: introduce
X
t
t
= X
t

X
0
exp
__
t
0
a
s
ds
_
, Y
t
t
= Y
t

_
t
0
A
s
X
0
exp
__
s
0
a
u
du
_
.
The process (X
t
t
, Y
t
t
) satises
dX
t
t
= a
t
X
t
t
dt +b
t
dW
t
dY
t
t
= A
t
X
t
t
dt +B
t
dV
t
,
subject to X
t
0
= X
0

X
0
and Y
t
0
= Y
0
. Clearly L
Y
t
= L
Y
t
and hence
X
t
=

E
_
X
t
[L
Y
t
_
=

E
_
X
t
[L
Y
t
_
=

E
_
X
t
t
[L
Y
t
_
+

X
0
exp
__
t
0
a
s
ds
_
.
Note that E
_
X
t
0
[Y
t
0
_
= 0 and suppose that

X
t
t
=

E(X
t
t
[L
Y
t
) and P
t
t
= E(X
t
t

X
t
t
)
2
satisfy (5.3), subject to

X
t
0
= 0 and P
t
0
= E(X
t
0

X
t
0
)
2
. Then
d
X
t
=d
X
t
t
+a
t
X
0
exp
__
t
0
a
s
ds
_
dt =
a
t
X
t
dt +
P
t
A
t
B
2
t
_
dY
t
A
t
X
0
exp
_
_
t
0
a
s
ds
_
A
t
X
t
t
dt
_
=
a
t
X
t
dt +
P
t
A
t
B
2
t
_
dY
t
A
t
X
t
dt
_
,
which means that

X
t
satises (5.3) equation as well, subject to

X =

E(X
0
[Y
0
),
given by the rst equation of (5.4). Moreover
P
t
= E
_
X
t

X
t
_
2
= E
_
X
t
t
+

X
0
exp
_
_
t
0
a
s
ds
_
X
t
t

X
0
exp
_
_
t
0
a
s
ds
__
2
= E(X
t
t

X
t
t
)
2
= P
t
t
,
i.e. P
t
satises the equation from (5.3).
Step 2 (the general form of the estimate)
From here on

E(X
0
[Y
0
) = 0 is assumed P-a.s. Let 0 = t
1
< ... < t
n
= T be a par-
tition of [0, T] and denote by L
Y
t
(n) the subspace, spanned by Y
t
1
, ..., Y
t
n
. This
subspace coincides with the one spanned by the increments Y
t
1
, Y
t
2
Y
t
1
, ..., Y
t
n
Y
t
n1
and so
E
_
X
t
[L
Y
t
(n)
_
=

E(X
t
[Y
0
) +
n1
j=1
g
j
_
Y
t
j+1
Y
t
j
_
=

E(X
t
[Y
0
) +
_
t
0
G
n
(t, s)dY
s
,
where g
j
are real numbers and G(t, s) =
jn
g
j
1
s[t
j
,t
j+1
)]
. Since L
Y
t
is a closed
subspace,
lim
n
E
_
X
t
[L
Y
t
(n)
_
=

E
_
X
t
[L
Y
t
_
,
and hence
E
__
t
0
G
n
(t, s)dY
t

_
t
0
G
m
(t, s)dY
t
_
2
n,m
0.
1. THE KALMAN-BUCY FILTER: SCALAR CASE 89
Since X and V are independent, the latter implies
__
t
0
_
G
n
(t, s) G
m
(t, s)
_
2
A
s
X
s
ds
_
2
+
_
t
0
_
G
n
(t, s) G
m
(t, s)
_
2
B
2
s
ds
n,m
0
Then due to the assumption B
2
s
C > 0, G
n
(t, s) is a Cauchy sequence and hence
converges to a limit G(t, s), so that
E(X
t
[L
Y
t
) =

E(X
t
[Y
0
) +
_
t
0
G(t, s)dY
s
.
Step 3 (using orthogonality)
Recall that

E(X
0
[Y
0
) = 0, P-a.s. is assumed, so that EX
t
= 0 and

E(X
t
[Y
0
) = 0.
The function G(t, s) satises the Wiener-Hopf equation
K(t, u)A
u
=
_
t
0
G(t, s)A
s
K(s, u)A
u
ds +G(t, u)B
2
u
, t u 0, (5.5)
where K(t, s) = cov(X
t
, X
s
). Indeed, by orthogonality property of the orthogonal
projection, for any xed t [0, T] and any measurable and bounded deterministic
function
E
_
X
t

E(X
t
[L
Y
t
)
_
_
t
0
s
dY
s
= E
_
X
t

_
t
0
G(t, s)dY
s
__
t
0
u
dY
u
= 0.
Then (5.5) holds, since
EX
t
_
t
0
u
dY
u
=
_
t
0
u
A
u
K(t, u)du
and
E
_
t
0
G(t, s)dY
s
_
t
0
u
dY
u
=
_
t
0
_
t
0
G(t, s)A
s
K(s, u)A
u
u
duds+
_
t
0
u
G(t, u)B
2
u
du
for arbitrary . Under the assumption B
2
t
C > 0, the Wiener-Hopf equation has
a unique solution: suppose it doesnt, i.e. both G
1
(t, s) and G
2
(t, s) satisfy (5.5)
and let (t, s) = G
1
(t, s) G
2
(t, s). Then (t, s) satises
_
t
0
(t, s)A
s
K(s, u)A
u
ds + (t, u)B
2
u
= 0, t u 0.
Multiply this equation by (t, u) and integrate with respect to u:
_
t
0
_
t
0
(t, u)A
u
K(s, u)(t, s)A
s
dsdu +
_
t
0
2
(t, u)B
2
u
= 0.
The rst term is nonnegative, since the covariance function K(s, u) is nonnegative
denite, and thus for t [0, T]
_
t
0
2
(t, u)B
2
u
= 0 =
2
(t, u) = 0, du a.s.
Step 4 (solving the Wiener-Hopf equation)
The uniqueness allows us to look for dierentiable G(t, s), since once found it should
be the solution. Dierentiating (5.5) with respect to t one obtains
t
K(t, u)A
u
= G(t, t)A
t
K(t, u)A
u
+
_
t
0
t
G(t, s)A
s
K(s, u)A
u
ds +

t
G(t, u)B
2
u
Recall that (Exercise 10 of the previous chapter)
t
K(t, u) = a
t
K(t, u), K(u, u) = EX
2
u
and hence the latter equation reads
K(t, u)A
u
_
a
t
G(t, t)A
t
_
_
t
0
t
G(t, s)A
s
K(s, u)A
u
ds

t
G(t, u)B
2
u
= 0.
Now using the expression for K(t, u)A
u
from (5.5), one gets
__
t
0
G(t, s)A
s
K(s, u)A
u
ds +G(t, u)B
2
u
_
_
a
t
G(t, t)A
t
_
_
t
0
t
G(t, s)A
s
K(s, u)A
u
ds

t
G(t, u)B
2
u
= 0.
or
_
t
0
_
G(t, s)
_
a
t
G(t, t)A
t
_

t
G(t, s)
_
A
s
K(s, u)A
u
ds+
_
G(t, u)
_
a
t
G(t, t)A
t
_

t
G(t, u)
_
B
2
u
= 0
Multiply the latter equality by
(t, u) := G(t, u)
_
a
t
G(t, t)A
t
_

t
G(t, u)
and integrate:
_
t
0
_
t
0
(t, s)A
s
K(s, u)(t, u)A
u
dsdu +
_
t
0
(t, u)
2
B
2
u
du = 0,
which gives the dierential equation for G(t, s):
t
G(t, s) = G(t, s)
_
a
t
G(t, t)A
t
_
. (5.6)
With u = t in (5.5), one gets
0 = K(t, t)A
t
A
t
_
t
0
G(t, s)A
s
K(s, t)ds G(t, t)B
2
t
,
1. THE KALMAN-BUCY FILTER: SCALAR CASE 91
which implies
0 = A
t
EX
t
_
X
t

_
t
0
G(t, s)A
s
X
s
ds
_
G(t, t)B
2
t
=
A
t
EX
t
_
X
t

_
t
0
G(t, s)dY
s
_
G(t, t)B
2
t
=
A
t
E
_
X
t

_
t
0
G(t, s)dY
s
_
2
G(t, t)B
2
t
= A
t
P
t
G(t, t)B
2
t
,
where the equality is due to the orthogonality property and P
t
= (X
t

X
t
)
2
.
Hence the ODE (5.6) reads
t
G(t, s) = G(t, s)
_
a
t

A
2
t
P
t
B
2
t
_
. (5.7)
Being a linear equation, the latter admits the representation G(t, s) = (s, t)G(s, s),
where (s, t) is the Cauchy
2
(or fundamental) solution corresponding to (5.7). Then
X
t
=
_
t
0
G(t, s)dY
s
=
_
t
0
(s, t)G(s, s)Y
s
= (0, t)
_
t
0
1
(0, s)G(s, s)dY
s
and applying the Ito formula one gets the rst equation in (5.3)
d
X
t
=
_
t
0
1
(0, s)G(s, s)dY
s
t
(0, t)dt + (0, t)
1
(0, t)G(t, t)dY
t
=
_
t
0
1
(0, s)G(s, s)dY
s
_
a
t

A
2
t
P
t
B
2
t
_
(0, t)dt +G(t, t)dY
t
=
a
t
X
t
dt +
A
t
P
t
B
2
t
_
dY
t
A
t
X
t
_
.
The process D
t
= X
t

X
t
satises
dD
t
= a
t
D
t
dt +b
t
dW
t

A
t
P
t
B
2
t
_
A
t
X
t
dt +B
t
dV
t
A
t
X
t
_
=
_
a
t

A
2
t
P
t
B
2
t
_
D
t
dt +b
t
dW
t

A
t
P
t
B
t
dV
t
.
Applying the Ito formula to D
2
t
one gets
dD
2
t
= 2D
t
dD
t
+b
2
t
dt +
_
A
t
P
t
B
t
_
2
dt = 2
_
a
t

A
2
t
P
t
B
2
t
_
D
2
t
dt+
b
2
t
dt +
_
A
t
P
t
B
t
_
2
dt + 2D
t
_
b
t
dW
t

A
t
P
t
B
t
dV
t
_
and taking the expectation
dP
t
= 2
_
a
t

A
2
t
P
t
B
2
t
_
P
t
dt +b
2
t
dt +
_
A
t
P
t
B
t
_
2
dt = 2a
t
dt +b
2
t
dt
A
2
t
P
2
t
B
2
t
dt,
subject to P
0
= E(X
0

X
0
)
2
(recall the construction of Step 1).
2
Since solution of linear equation depends linearly on the initial condition, it can be written
as a time dependent linear operator (just multiplication by (s, t) in this case), acting on the
initial condition. The Cauchy operator satises (0, s)(s, t) = (0, t) and is invertible.
The Kalman-Bucy lter is a linear SDE with time varying coecients, which
depend on P
t
, being the solution of the Riccati equation (5.3). The innovation
process
W
t
=
_
t
0
dY
s
A
s
X
s
ds
B
s
has uncorrelated increments and in the case of Gaussian (X
0
, Y
0
) is a Wiener process
(!), with respect to the ltration F
Y
t
(this is worked out in details in the next
chapter, dealing with nonlinear ltering).
Example 5.2. Consider the system (5.1)-(5.2) with constant coecients: a
t

a, etc. and subject to a random square integrable X
0
and Y
0
= 0. The Kalman-
Bucy lter in this case is
X
t
= a
X
t
dt +
P
t
A
B
2
_
dY
t
A
X
t
dt
_
P
t
= 2aP
t
+b
2
A
2
P
2
t
B
2
(5.8)
subject to

X
0
= EX
0
and P
0
= E(X
0
EX
0
)
2
.
Consider the quadratic equation
2aP +b
2
A
2
P
2
/B
2
= 0. (5.9)
If A ,= 0 and b ,= 0 are assumed, then it has two solutions
P
=
B
2
A
2
_
a
_
a
2
+
A
2
b
2
B
2
_
,
with P
< 0 and P
+
> 0. Consider the suboptimal lter
X
t
= a
X
t
dt +
AP
+
B
2
_
Y
t
A
X
t
dt
_
,

X
0
= 0.
The error process
t
= X
t

X
t
, satises
d
t
=
_
a
A
2
P
+
B
2
_
t
dt +bdW
t
+
AP
+
B
dV
t
,
0
= X
0
.
Since
a
A
2
P
+
B
2
= a
_
a +
_
a
2
+
A
2
b
2
B
2
_
=
_
a
2
+
A
2
b
2
B
2
< 0, (5.10)
the mean square error of this lter is bounded: sup
t0
E
2
t
< and thus by
optimality of

X
t
sup
t0
P
t
E
2
t
< .
The function R
t
:= P
t
P
+
, satises
R
t
= 2aR
t

A
2
B
2
_
P
2
t
P
2
+
_
= 2aR
t

A
2
B
2
R
t
_
P
t
+P
+
_
and hence
[R
t
[ = [R
0
[ exp
_
2at
A
2
B
2
_
t
0
_
P
s
+P
+
_
ds
_
[R
0
[ exp
_
2at
A
2
B
2
P
+
t
_
= [R
0
[ exp
_
at
_
a
2
+
A
2
b
2
B
2
t
_
t
0,
2. THE KALMAN-BUCY FILTER: THE GENERAL CASE 93
due to (5.10). In other words, if A ,= 0 and b ,= 0, the solution of the Riccati
equation stabilizes and the limit mean square error P
= lim
t
P
equals the
unique positive solution of the algebraic Riccati equation (5.9). If A = 0 and b ,= 0,
then P
t
= E(X
t
EX
t
)
2
and the limit P
exists and is nite if a < 0, otherwise

P
t
grows to innity. Finally if b = 0 and A ,= 0, then P
= 0, either if a < 0 (since

X
t
0 in L
2
) or if a > 0 (since then a/Ae
at
Y
t
X
0
in L
2
) or if a = 0 (since
A
1
Y
t
/t X
0
in L
2
).
Unlike in the discrete time case, the scalar Riccati equation in (5.8) has an
explicit solution:
P
t
=
K
2
exp
_
(
+
)A
2
t
B
2
_
1 K exp
_
(
+
)A
2
t
B
2
_ , (5.11)
where
= A
2
_
aB
2
B
_
a
2
B
2
+A
2
b
2
_
, K =
P
0

P
0

+
.
2. The Kalman-Bucy lter: the general case

In this section we give the general formulation of linear ltering problem and the
corresponding Kalman-Bucy equations. The proof uses the very same arguments as
in the scalar case and is left as an exercise. Let X = (X
t
)
t[0,T]
and Y = (Y
t
)
t[0,T]
be the process with values in R
m
and R
n
, generated by the system of linear SDEs
dX
t
=
_
a
0
(t) +a
1
(t)X
t
+a
2
(t)Y
t
_
dt +b
1
(t)dW
t
+b
2
(t)dV
t
(5.12)
dY
t
=
_
A
0
(t) +A
1
(t)X
t
+A
2
(t)Y
t
_
dt +B
1
(t)dW
t
+B
2
(t)dW
t
, (5.13)
with respect to independent vector Wiener processes W and V and subject to a
square integrable random vector (X
0
, Y
0
) independent of (W, V ). The coecients
are deterministic matrix functions of appropriate dimensions, such that the unique
strong solution of the system exists
3
and (B B)(t) := B
1
B
1
+B
2
B
2
is uniformly
nonsingular matrix.
Theorem 5.3. The the orthogonal projection

X
t
=

E(X
t
[L
Y
t
) and the corre-
sponding error covariance matrix P
t
= E
_
X
t

X
t
__
X
t

X
t
_
satisfy the Kalman-

Bucy equations
4
d
X
t
=
_
a
0
+a
1
X
t
dt +a
2
Y
t
_
dt +
_
b B +P
t
A
1
__
B B
_
1
(5.14)
_
dY
t
(A
0
A
1
X
t
A
2
Y
t
)dt
_
P
t
=a
1
P
t
+P
t
a
1
+b b
_
b B +P
t
A
1
__
B B
_
1
_
b B +P
t
A
1
_
(5.15)
subject to
X
0
= EX
0
cov(X
0
, Y
0
) cov
(Y
0
)(Y
0
EY
0
),
P
0
= cov(X
0
) cov(X
0
, Y
0
) cov
(Y
0
) cov(Y
0
, X
0
)
and where
b B = b
1
B
1
+b
2
B
2
, b b = b
1
b
1
+b
2
b
2
.
3
for example if the drift coecients are integrable and the diusion coecients are square
integrable functions of t with respect to the Lebesgue measure.
4
the time dependence of the coecients is omitted for brevity
3. Linear ltering beyond linear diusions
The Kalman-Bucy ltering formulae are applicable in somewhat more general
setting than (5.1)-(5.2) (or (5.12)-(5.13)).
Definition 5.4. w
t
is a Wiener process in wide sense, if w
0
= 0, Ew
t
= 0 and
Ew
t
w
s
= s t, t, s 0.
Example 5.5. The stochastic integral w
t
=
_
t
0
X
s
/
_
EX
2
s
dW
s
with a positive
process X
t
C > 0 is a Wiener process in the wide sense:
Ew
t
w
s
=
_
ts
E
_
X
u
_
EX
2
u
_
2
du = t s.
Since w
t
has uncorrelated increments, one may dene the stochastic integral
I
t
(f) =
_
t
0
f
s
dw
s
:= lim
n
n
i=1
f
t
i1
_
w
t
i
w
t
i1
_
,
where f is an L
2
[0,T]
deterministic function and 0 = t
0
< ... < t
n
= T, such that
max
i
[t
i
t
i1
[ 0 as n (by construction similar to the Ito integral).
Since the linear SDE
dX
t
= a
t
X
t
dt +b
t
dW
t
,
has an explicit solution
X
t
= exp
__
t
0
a
u
du
__
X
0
+
_
t
0
exp
_
_
s
0
a
u
du
_
b
s
dW
s
_
,
analogously one may dene the process
X
t
= exp
__
t
0
a
u
du
__
X
0
+
_
t
0
exp
_
_
s
0
a
u
du
_
b
s
dw
s
_
to be the solution of
dX
t
= a
t
X
t
dt +b
t
dw
t
.
With these denitions it is almost obvious that the Kalman-Bucy ltering equa-
tions generate the optimal linear estimates, if the Wiener processes are replaced by
the Wiener processes in the wide sense. Lets demonstrate the application of this
generalization in the following example:
Example 5.6. Consider the SDE system
dX
t
= X
t
dt +dW
t
dY
t
= X
3
t
dt +dV
t
(5.16)
subject to random X
0
with zero mean and EX
2
0
= 1/2, Y
0
= 0. By the Ito formula
dX
3
t
= 3X
2
t
dX
t
+ 3X
t
dt = 3X
3
t
dt + 3X
t
dt + 3X
2
t
dW
t
.
Dene Z
t
= X
3
t
and
w
t
=

2
_
t
0
X
2
s
dW
s

W
t
2
.
EXERCISES 95
Then w
t
is the Wiener process in the wide sense (t s):
Ew
t
w
s
= E
_
2
_
s
0
X
2
u
dW
u

W
s
2
_
2
=
2
_
s
0
EX
4
u
du +
s
2
2
_
s
0
EX
2
u
du = 2
3
4
s +
s
2
s = s,
where the Gaussian property of X
t
have been used (EX
2
t
= 1/2, EX
4
t
= 3(EX
2
t
)
2
=
3/4, etc.). Analogously
Ew
t
W
t
= E
_
2
_
t
0
X
2
u
dW
u

W
t
2
_
W
t
=

2tEX
2
t

t
2
= 0.
So (w
t
, W
t
, V
t
) is a three-dimensional Wiener process in wide sense. Consider now
the linear system
dX
t
= X
t
dt +dW
t
dZ
t
= 3Z
t
dt + 3X
t
dt +
3
2
dw
t
+
3
2
dW
t
dY
t
= Z
t
dt +dV
t
,
(5.17)
subject to (X
0
, Z
0
) = (X
0
, X
3
0
) (i.e. EZ
0
= 0, EZ
2
0
= EX
6
0
= 15/8, etc.). The
estimate E(X
t
[L
Y
t
) can be obtained by means of the Kalman-Bucy equations for
(5.17).
Exercises
(1) Verify that if X
0
and Y
0
are such that

E(X
0
[Y
0
) = 0, P-a.s. in the model
(5.1)-(5.2), then EX
t
= 0 and

E(X
t
[Y
0
) = 0, P-a.s.
(2) Show that the innovation process
W
t
= B
1
_
t
0
(dY
s
A
X
s
ds)
satises the following properties (t s 0)
(a)

E
_
W
t
[L
Y
s
_
=

W
s
(b) E
_
W
t

W
s
_
2
= t s
(c) Derive the Kalman-Bucy equations, assuming that

W is a Wiener
process (in the wide sense) and that

E(X
t
[L
Y
t
) =
_
t
0
(t, s)d

W
s
for
some (t, s).
(3) Let Y
t
=
_
t
0
W
s
ds+V
t
, where W and V are independent Wiener processes.
(a) Find the optimal linear lter for

W
t
=

E(W
t
[L
Y
t
)
(b) Find the explicit form for the optimal kernel G(t, s), such that
W
t
=
_
t
0
G(t, s)dY
s
.
Hint: use the explicit solution (5.11).
(c) Derive the equation for linear estimate

V
t
=

E(V
t
[L
Y
t
).
Hint: use the two dimensional formulae of Theorem (5.3)).
(4) Derive the equations (11), claimed in the Introduction (page 12).
(5) Prove that the equations (5.3) have the unique strong solution.
(6) Reformulate and solve the problem (8) (page 32) in continuous time
(7) Reformulate and solve the problem (9) (page 33) in continuous time
CHAPTER 6
Nonlinear ltering in continuous time
In this chapter the two main approaches to nonlinear ltering problem in con-
tinuous time are presented. The rst one relies on the representation of the con-
ditional expectation as a stochastic integral with respect to the innovation Wiener
process. The second one uses the abstract version of the Bayes formula, involv-
ing the Girsanov change of measure to dene a reference probability, under which
the dependence between the signal and the observations is cancelled and thus the
calculations are carried out in a particularly simple way. This approach gives an
additional insight into the structure of FKK equation: it turns out that its solution
is a normalized version of the measure valued stochastic process, generated by a
linear Zakai equation.
As in the discrete time case, both approaches lead to measure valued equations
which at best characterize the conditional law of the signal given the observation
-algebra. Remarkably for certain particular systems the ltering process turns to
be nite dimensional, i.e. can be parameterized by a nite number of computable
parameters. For example, Kalman-Bucy ltering equations turn to be the nite
dimensional parametrization in the linear Gaussian case.
1. The innovation approach
The typical ltering problem in continuous time is to nd a recursive realization
for the conditional expectation of the signal Markov process at the current time,
given the past of its noisy trajectory. Lets consider the following general framework
of this problem: let (X, Y ) = (X
t
, Y
t
)
t[0,T]
be supported on a stochastic basis
(, F, F
t
, P) and satisfy the following assumptions:
(a) X admits the decomposition
X
t
= X
0
+
_
t
0
H
s
ds +M
t
, (6.1)
where (M
t
, F
t
) is a martingale
1
and H
t
is an H
2
[0,T]
-process.
1
As mentioned before, the denition of the stochastic integral can be extended to martingales,
more general than Wiener process. In this introductory course we dont really need this generality.
In fact M
t
will be either a stochastic integral with respect to Wiener process or a Poisson like
jump processes
97
98 6. NONLINEAR FILTERING IN CONTINUOUS TIME
(b) Y is the Ito process, satisfying
2
Y
t
=
_
t
0
A
s
ds +BW
t
, (6.2)
where A is an H
2
[0,T]
process, B > 0 is a xed constant and W is a Wiener
process, independent of X.
The following generic notation will be used throughout:
t
() = E
_
t
[F
Y
t
_
for a
process = (
t
)
t[0,T]
, where F
Y
t
is the natural ltration of Y .
1.1. The innovation Wiener process. The innovation process

W was al-
ready encountered in the Kalman-Bucy ltering setting.
Theorem 6.1. The process Y , satisfying (b), admits the representation
Y
t
= Y
0
+
_
t
0
s
(A)ds +B

W
t
, (6.3)
where
W
t
= B
1
(Y
t

_
t
0
s
(A)ds). (6.4)
is a Wiener process with respect to F
Y
t
.
Proof. Clearly

W has continuous trajectories, starting at zero. For brevity
let B = 1, then
W
t
= W
t
+
_
t
0
_
A
s

s
(A)
_
ds.
Show that
E
_
e
i(

W
t

W
s
)
[F
Y
t
_
= e
1
2
2
(ts)
. (6.5)
Applying the Ito formula to
t
= exp
_
i

W
t
_
one gets
d
t
= i
t
d
t

1
2
t
dt = i
t
dW
t
+i
t
_
A
t

t
(A)
_
dt
1
2
t
dt
and hence
e
i

W
t
= e
i

W
s
+i
_
t
s
e
i

W
u
dW
u
+
i
_
t
s
e
i

W
u
_
A
u

u
(A)
_
du
1
2
2
_
t
s
e
i

W
u
dt
Since W is a Wiener process with respect to the ltration F
W
t
F
Y
t
,
E
__
t
s
e
i

W
u
dW
u
F
Y
s
_
= 0.
2
With an additional eort, the diusion coecient B can be allowed to depend on Y and
time t. The essential requirement is then B
2
t
(Y ) C > 0, which prevents the ltering problem
from being singular. Also note that if B is allowed to depend on the signal X, the ltering problem
becomes ill-posed. For example, if B(x) = x, x R, then X
2
t
can be recovered from the quadratic
variation of Y and thus X
2
t
is F
Y
t
-measurable, i.e. known up to its sign. These situations are
customary taboo in ltering
1. THE INNOVATION APPROACH 99
Note that for u s
E
_
e
i

W
u
u
(A)
F
Y
s
_
= E
_
e
i

W
u
E(A
u
[F
Y
u
)
F
Y
s
_
=
E
_
E(A
u
e
i

W
u
[F
Y
u
)
F
Y
s
_
= E
_
A
u
e
i

W
u
[F
Y
s
_
and thus
E
__
t
s
e
i

W
u
_
A
u

u
(A)
_
du
F
Y
s
_
= 0.
Then
t
= E(e
i

W
t
[F
Y
s
) satises
t
=
s

1
2
2
_
t
s
u
du,
which veries (6.5).
Remark 6.2. Note that

W need not be (and in general is not) a Wiener process
with respect to other ltrations, e.g. F
W
.
Remark 6.3. Note that the equation (6.6) is driven not by the observation
process Y itself, but rather by a Wiener process, generated by Y . Loosely speaking,
this Wiener process is a minimal representation of the information carried by Y ,
sucient for estimation of X, which is the origin of the term innovation. Clearly
F

W
t
F
Y
t
, since

W
t
is a measurable functional of Y on [0, t] or in other words,
the information carried by

W is less than information carried by Y . Naturally the
question arises: does

W
t
encodes all the information, i.e. F
Y
t
F

W
t
? The answer
to this question is armative if the SDE (6.3) has a strong solution. However, in
view of the Tsirelsons counterexample, mentioned in Remark 4.37, the latter is not
at all clear. Some positive results in this direction can be found in Section 12.2 in
[21].
Remark 6.4. Recall the statement of the Girsanov theorem: given a Wiener
process (W
t
, F
t
) on a xed probability basis (, F, F
t
, P), there is a probability

P
on (, F), equivalent to P and such that the process, obtained by shifting W by
a random process with suciently smooth trajectories (absolutely continuous with
respect to the Lebesgue measure), is again a Wiener process with respect to F
t
under

P. On the other hand, the innovations (6.4)
W
t
= W
t
+
_
t
0
_
A
s

s
(A)
_
ds
exhibit a dierent phenomenon: W shifted by a special function becomes a Wiener
process under the original measure P but with respect to another ltration F
Y
t
!
1.2. Fujisaki-Kallianpur-Kunita equation. Using the innovation form of
Y and the martingale representation theorem an equation for the measure valued
ltering process
t
() is derived below.
Theorem 6.5. Assume (a) and (b), then
t
(X) satises satises the Fujisaki-
Kallianpur-Kunita (FKK) equation: for any t [0, T] P-a.s.
t
(X) =
0
(X) +
_
t
0
s
(H)ds +
_
t
0
_
s
(AX)
s
(A)
s
(X)
_
B
1
d

W
t
, (6.6)
where (

W
t
, F
Y
t
) is the innovation Wiener process dened in (6.4).
Remark 6.6. FKK equation (6.6) is a measure valued equation: its (strong)
solution, say
t
(dx), can be dened as a stochastic process taking values in the space
of probability measures on
_
R, B(R)
_
, adapted to F
Y
t
and satisfying (6.6) with
probability one. For example, if the process
t
(dx) has a density, (6.6) can be used
to derive an equation for the conditional density process (Kushner-Stratonovich
equation (6.13)). The existence and uniqueness of the strong solution is not an
easy issue.
Proof. The ltering process admits the following decomposition
t
(X) =
0
(X) +
_
t
0
s
(H)ds +

M
t
, t [0, T], (6.7)
where
M
t
:= E(X
0
[F
Y
t
)
0
(X) + E
__
t
0
H
s
ds[F
Y
t
_
_
t
0
s
(H)ds + E(M
t
[F
Y
t
).
is a square integrable F
Y
t
-martingale. The square integrability of each component
follows from the assumptions on X and the martingale property is veried as follows:
the rst term is a martingale, since (t s 0)
E
_
E(X
0
[F
Y
t
)
0
(X)[F
Y
s
_
= E(X
0
[F
Y
s
)
0
(X).
The second one satises
E
_
E
_
_
t
0
H
u
du[F
Y
t
_
_
t
0
u
(H)du
F
Y
s
_
=
_
t
0
E
_
H
u
[F
Y
s
_
du
_
t
0
E
_
u
(H)
F
Y
s
)du =
E
__
s
0
H
u
du
F
Y
s
_
_
s
0
u
(H)du +
_
t
s
E
_
H
u
[F
Y
s
_
du
_
t
s
E
_
u
(H)
F
Y
s
)du =
E
__
s
0
H
u
du
F
Y
s
_
_
s
0
u
(H)du
and thus is also a martingale. Finally the third term inherits martingale properties
from M
t
:
E
_
E(M
t
[F
Y
t
)[F
Y
s
_
= E(M
t
[F
Y
s
) = E
_
E(M
t
[F
s
)
F
Y
s
) = E(M
s
[F
Y
s
).
Since Y
t
is an Ito process, generated by (6.3), where

W
t
is a Wiener process, by
Theorem 4.51, being a square integrable F
Y
t
-martingale,

M
t
has the representation
M
t
=
_
t
0
g
s
(Y )d

W
s
,
with g
s
being F
Y
t
-adapted process. To verify (6.6) one should show that
g
s
(Y ) =
_
s
(AX)
s
(A)
s
(X)
_
/B, ds P a.s., (6.8)
which is equivalent to
_
t
0
E
s
(Y )
_
g
s
(Y )
_
s
(AX)
s
(A)
s
(X)
_
/B
_
ds = 0, (6.9)
1. THE INNOVATION APPROACH 101
for any bounded F
Y
t
-adapted
3
s
(Y ).
Let z
t
=
_
t
0

s
(Y )d

W
s
and
t
=
_
t
0
g
s
(Y )d

W
s
, then
_
t
0
E
s
(Y )g
s
(Y )ds = Ez
t
t
. (6.10)
On the other hand,
Ez
t
t
= Ez
t
_
t
(X)
0
(X)
_
t
0
s
(H)ds
_
= E
_
z
t
X
t

_
t
0
z
s
H
s
ds
_
,
since Ez
t
0
(X) = E
0
(X)E(z
t
[F
Y
0
) = 0, Ez
t
t
(X) = Ez
t
E
_
X
t
[F
Y
t
_
= Ez
t
X
t
and
Ez
t
_
t
0
s
(H)ds = E
_
t
0
E(z
t
[F
Y
s
)
s
(H)ds =
_
t
0
z
s
s
(H)ds =
_
t
0
E(z
s
H
s
[F
Y
s
)ds = E
_
t
0
z
s
H
s
ds.
Using the denition of

W
z
t
=
_
t
0
s
dW
s
+
_
t
0
s
A
s

s
(A)
B
ds.
Then
Ez
t
t
= E
_
X
t
_
t
0
s
dW
s

_
t
0
_
_
s
0
u
dW
u
_
H
s
ds
_
+
E
_
X
t
_
t
0
s
A
s

s
(A)
B
ds
_
t
0
_
_
s
0
u
A
u

u
(A)
B
du
_
H
s
ds
_
(6.11)
We claim that the rst expectation vanishes: indeed
EX
0
_
t
0
s
(Y )dW
s
= EX
0
E
_
_
t
0
s
(Y )dW
s
[F
0
_
= 0
and
E
_
t
0
_
_
s
0
u
dW
u
_
H
s
ds = E
_
t
0
E
_
_
t
0
u
dW
u
F
s
_
H
s
ds =
E
_
t
0
E
_
H
s
_
t
0
u
dW
u
F
s
_
ds = E
_
t
0
u
dW
u
_
t
0
H
s
ds
and hence
E
_
X
t
_
t
0
s
dW
s

_
t
0
_
_
s
0
u
dW
u
_
H
s
ds
_
=
E
_
t
0
s
dW
s
_
X
t
X
0

_
t
0
H
s
ds
_
= E
_
t
0
s
dW
s
M
t
= 0,
where the latter equality holds
4
since the martingale M is independent of W.
3
if is F
Y
t
-adapted and satises
_
t
0
E
s
s
ds = 0 for any bounded F
Y
t
-adapted , then with
particular
t
= sign(
t
) one gets
_
t
0
E[
s
[ds = 0 and so
s
= 0 ds P-a.s. on [0, t].
4
verify this claim when M
t
is another Wiener process, independent of W. By the way, M
and W can be assumed to be correlated and then the correlation will enter the ltering formula
(6.6) at this point.
Consider the rst term in the second expectation in the right hand side of
(6.11):
EX
t
_
t
0
s
A
s

s
(A)
B
ds =
E
_
t
0
s
X
s
_
A
s

s
(A)
_
B
ds + E
_
t
0
s
(X
t
X
s
)
A
s

s
(A)
B
ds =
E
_
t
0
s
(XA)
s
(X)
s
(A)
_
B
ds + E
_
t
0
s
(M
t
M
s
)
A
s

s
(A)
B
ds+
E
_
t
0
s
_
t
s
H
u
du
A
s

s
(A)
B
ds =
E
_
t
0
s
(XA)
s
(X)
s
(A)
_
B
ds + E
_
t
0
H
s
__
s
0
u
A
u

u
(A)
B
du
_
ds
Assembling all parts together we obtain
Ez
t
t
=
_
t
0
E
s
s
(XA)
s
(X)
s
(A)
_
B
ds
which along with (6.10) implies (6.8).
1.3. Kushner-Stratonovich equation for conditional density. The FKK
equation (6.6) takes a somewhat more concrete form in the case when (X
t
, Y
t
) are
diusion processes, namely the (strong) solution of SDE
5
dX
t
= a(X
t
)dt +b(X
t
)dV
t
, X
0
= ,
dY
t
= A(X
t
)dt +BW
t
, Y
0
= 0
(6.12)
where is a random variable with probability density p
0
(x), independent of the
Wiener processes V and W.
Theorem 6.7. Assume that there is an F
Y
t
-adapted random eld
6
q
t
(x), sat-
isfying the Kushner-Stratonovich stochastic partial integral-dierential equation
q
t
(x) = p
0
(x) +
_
t
0
_
L
q
s
_
(x)ds +B
1
_
t
0
q
s
(x)
_
A(x)
s
(A)
_
d

W
s
(6.13)
where L
is dened in (4.25) and
t
(A) =
_
R
A(x)q
t
(x)dx.
Then q
t
(x) is a version of the conditional density of X
t
given F
Y
t
, i.e. for any
bounded function
E
_
(X
t
)[F
Y
t
_
=
_
R
(x)q
t
(x)dx.
5
Hereon Y
0
= 0 is usually set for brevity
6
by random eld we mean a random process, parameterized by time variable t and space
variable x. All the usual properties (e.g. adaptedness) are assumed to be satised uniformly in x.
In our case sucient smoothness (e.g. twice dierentiability) in x is required.
2. REFERENCE MEASURE APPROACH 103
Proof. Verify that q
t
(x) is a solution of (6.6) and thus is a version of the
required conditional expectation. For any twice continuously dierentiable function
f,
f(X
t
) = f(X
0
) +
_
t
0
(Lf)(X
s
)ds +
_
t
0
f
t
(X
s
)b(X
s
)dV
s
, t [0, T],
where L is the backward Kolmogorov operator
_
Lf
_
(x) = a(x)

x
f(x) +
b
2
(x)
2
2
x
2
f(x). (6.14)
Then the random measure
t
(dx) = q
s
(x)dx satises FKK equation (6.6) for f(X
t
)
with arbitrary f:
s
_
(Lf)(X)
_
=
_
R
_
a(x)

x
f(x) +
b
2
(x)
2
2
x
2
f(x)
_
q
s
(x)dx =
_
R
_

x
a(x)q
s
(x) +
1
2
2
x
2
b
2
(x)q
s
(x)
_
f(x)dx =
_
R
_
L
q
s
_
(x)f(x)dx (6.15)
and
s
(fA)
s
(f)
s
(A) =
_
R
f(x)A(x)q
s
(x)dx
s
(A)
_
R
f(x)q
s
(x)dx =
_
R
f(x)q
s
(x)
_
A(x)
s
(A)
_
dx.
Then the right hand side of (6.6) reads
0
(f) +
_
t
0
s
_
Lf
_
ds +B
1
_
t
0
_
s
(fA)
s
(f)
s
(A)
_
d

W
s
=
_
R
f(x)
_
p
0
(x) +
_
t
0
_
L
q
s
_
(x)ds +B
1
_
t
0
q
s
(x)
_
A(x)
s
(A)
_
d

W
s
_
dx =
_
R
f(x)q
t
(x)dx,
where (6.13) has been used.
Remark 6.8. Due to complicated structure of (6.13), the assumption of the
Theorem 6.7 are not easy to verify.
2. Reference measure approach
The nonlinear ltering equation can be derived by the Girsanov change of
measure. For the clarity of presentation, we chose a specic form of A
s
in (6.2):
dY
t
=
_
t
0
g(s, X
s
)ds +BW
t
, (6.16)
where g is a measurable R
+
R R function.
2.1. Kallianpur-Striebel formula.
Theorem 6.9. (Kallianpur-Striebel formula) Assume that g(s, X
s
) is an H
2
[0,T]
process and Y satises (6.16). Let (
,

F,

P) be an auxiliary copy of (, F, P), then
for any bounded and measurable function f : R R
E
_
f(X
t
)[F
Y
t
_
() =
Ef
_
X
t
( )
_
t
_
X( ), Y ()
_
E
t
_
X( ), Y ()
_ , P a.s. (6.17)
where
t
(X, Y ) = exp
_
1
B
2
_
t
0
g(s, X
s
)dY
s

1
2B
2
_
t
0
g
2
(s, X
s
)ds
_
. (6.18)
Remark 6.10. The integral J( , ) :=
_
t
0
g
_
X
s
( )
_
dY
s
() is a well dened
random variable on the product space
_
,

FF,

PP
_
. In fact the integration
over could have been done on the original probability space by means of an
independent copy of X.
Remark 6.11. The function f need not to be bounded, but should rather
satisfy appropriate integrability conditions.
Remark 6.12. The expression in (6.18) is sometimes referred as the likelihood
ratio, being the Radon-Nikodym density of the law of Y under the hypothesis that
Y either has a drift or not.
Proof. Consider B = 1 for brevity (B ,= 1 is treated completely analogously).
Denote by
W
the Wiener measure on C
[0,T]
, i.e. the probability measure induced
by W. Let
z
t
(X, W) = exp
_
_
t
0
g(s, X
s
)dW
s

1
2
_
t
0
g
2
(s, X
s
)ds
_
, t [0, T].
Under the assumption on g, z
t
is a martingale and so
d
P
dP
() = z
T
_
X(), Y ()
_
, (6.19)
denes the probability measure

P.
Let Y
x
be given by
7
Y
x
t
=
_
t
0
g(s, x
s
)ds +W
t
, t [0, T], x D
[0,T]
.
Then by Girsanov theorem (recall that P

P and Y
x
is a Wiener process under
P)
E
_
z
T
(x, W)(Y
x
)
_
=
_
C
[0,T]
(y)
W
(dy),
X
a.s,
where
X
is the probability measure induced by X. Now by independence of X
and W under P, for any bounded and measurable functionals and
E(Y )(X) = Ez
T
(X, W)(Y )(X) =
_
D
[0,T]
Ez
T
(x, W)(Y
x
)(x)
X
(dx) =
_
C
[0,T]
(y)
W
(dy)
_
D
[0,T]
(x)Q
X
(dx)
This implies that under

P, Y is a Wiener process (take 1 and arbitrary ),
X has the same distribution as under P (take 1 and arbitrary ) and Y and
X are independent.
Since z
t
(X, W) is F
t
-martingale and
z
t
(X, W) = exp
_
_
t
0
g(s, X
s
)dY
s
+
1
2
_
t
0
g
2
(s, X
s
)ds
_
=
1
t
(X, Y ),
by Lemma 3.11
E
_
f(X
t
)[F
Y
t
_
=
E
_
f(X
t
)z
1
T
(X, W)[F
Y
t
_
E
_
z
1
T
(X, W)[F
Y
t
_ =
E
_
f(X
t
)z
1
t
(X, W)[F
Y
t
_
E
_
z
1
t
(X, W)[F
Y
t
_ =
E
_
f(X
t
)
t
(X, Y )[F
Y
t
_
E
_
t
(X, Y )[F
Y
t
_ =
Ef
_
X
t
( )
_
t
_
X( ), Y ()
_
E
t
_
X( ), Y ()
_ ,
where the latter holds by independence of X and Y under

P.
Remark 6.13. The drift term in (6.16) can be allowed to depend on Y : let
Y
t
=
_
t
0
g(s, X
s
, Y )ds +BW
t
,
where g is a non-anticipating measurable R
+
RC
[0,t]
R functional, such that
the SDE has the unique strong solution. Let
t
(X, Y ) be dened by (6.18) with
g(s, X
s
) replaced by g(s, X
s
, Y ). Then for any measurable and bounded f : R R
E
_
f(X
t
)
F
Y
t
_
=
E
_
f(X
t
)
t
(X, Y )[F
Y
t
_
E
_
t
(X, Y )[F
Y
t
_ , (6.20)
where

E is the expectation with respect to probability

P (dened similarly to
(6.19)), under which X and Y are independent, X is distributed as under P and Y
is a Wiener process.
Remark 6.14. The Kallianpur-Striebel formula can be reformulated as
E
_
f(X
t
)[F
Y
t
_
() =
_
C
[0,T]
f(x
t
)
t
_
x, Y ()
_
X
(dx)
_
C
[0,T]
t
_
x, Y ()
_
X
(dx)
, (6.21)
where
X
is the probability measure (distribution) induced by X on D
[0,T]
under
either P or P
t
.
Example 6.15. Consider the Bayesian estimation problem of a random variable
(constant unknown signal) from the observations
Y
t
=
_
t
0
g(s, )ds +W
t
.
7
X is assumed to have right continuous pathes with nite left limits. Such functions are
usually referred as cadlag (French abbreviature) or corlol (English one). In other words, the
trajectories are allowed to have countable number of nite jumps. This space, denoted by D
[0,T]
is not complete under the usual supremum metric. The so called Skorohod metric turns it into a
complete separable space
Then by Kallianpur-Stribel formula
E([F
Y
t
) =
E( ) exp
_
_
t
0
g
_
s, ( )
_
dY
s

1
2
_
t
0
g
2
_
s, ( )
_
ds
_
exp
_
_
t
0
g
_
s, ( )
_
dY
s

1
2
_
t
0
g
2
_
s, ( )
_
ds
_ =
_
R
xexp
_
_
t
0
g(s, x)dY
s

1
2
_
t
0
g
2
(s, x)ds
_
dF
(x)
_
R
exp
_
_
t
0
g(s, x)dY
s

1
2
_
t
0
g
2
(s, x)ds
_
dF
(x)
,
where F
(x) is the distribution function of . In particular, if g(s, x) g(x)

E([F
Y
t
) =
_
R
xexpg(x)Y
t

1
2
g
2
(x)tdF
(x)
_
R
expg(x)Y
t

1
2
g
2
(x)tdF
(x)
.
2.2. The Zakai equation. Note that the Kallianpur-Striebel formula does
not impose much structure on X. If the signal satises (6.1), an SDE can be
derived for the unnormalized conditional law of X
t
given F
Y
t
. Below we use the
generic notation
t
() =

E(
t
t
[F
Y
t
), where is an F
t
adapted random process.
Theorem 6.16. Assume that in addition to the assumptions of Theorem 6.9,
X obeys the representation (6.1), then
d
t
(X) =
t
(H)dt +B
2
t
(Xg)dY
t
, t [0, T], (6.22)
subject to
0
(X) = EX
0
and
t
(f) =

t
(f)
t
(1)
for any bounded and measurable f.
Remark 6.17. Similarly to (6.6), the Zakai equation (6.22) is a measure valued
stochastic equation - see Remark 6.6.
Proof. The process
t
satises SDE (again B = 1 is set for brevity)
d
t
=
t
g(t, X
t
)dY
t
,
0
= 1. (6.23)
Then by the Ito formula
8
X
t
t
= X
0
+
_
t
0
s
dX
s
+
_
t
0
X
s
d
s
=
X
0
+
_
t
0
s
H
s
dt +
_
t
0
s
dM
s
+
_
t
0
X
s
g(s, X
s
)
s
dY
s
.
The equation (6.22) is obtained by taking the conditional expectation given F
Y
t
,
under

P. First note that
E
__
t
0
s
H
s
ds
F
Y
t
_
=
_
t
0
E
_
s
H
s
F
Y
t
_
ds =
_
t
0
E
_
s
H
s
F
Y
s
_
ds,
8
Here we use the extension of the Ito formula for general martingales (not necessarily Wiener
processes or their stochastic integrals). In the case when it is applied to f(x, y) = xy and inde-
pendent martingales, it reduces to the usual dierentiation rule for product. Verify this in the
case of a pair of independent Wiener processes.
where the latter equality holds since (
s
, H
s
) is F
X
s
F
Y
s
-measurable and thus
independent of F
Y
[s,T]
= Y
u
Y
s
, s u T under

P. For the same reason
E
__
t
0
s
dM
s
[F
Y
t
_
= 0, (6.24)
and
E
__
t
0
X
s
g(s, X
s
)
s
dY
s
F
Y
t
_
=
_
t
0
E
_
X
s
g(s, X
s
)
s
[F
Y
s
_
dY
s
. (6.25)
The vulgar proof of these facts can be done by verifying them for simple processes
and then extending to the general case by an approximation argument (refer Corol-
laries 1 and 2 of Theorem 5.13 in [21] for a more solid reasoning).
The FKK equation (6.6) can be recovered from (6.22)
Corollary 6.18. Under the setup of Theorem 6.16, the conditional expectation
t
(X) = E
_
X
t
[F
Y
t
_
satises
t
(X) =
0
(X) +
_
t
0
s
(H)ds +
_
t
0
_
s
(gX)
s
(g)
s
(X)
_
B
1
d

W
s
, (6.26)
where
W
t
= B
1
_
Y
t

_
t
0
s
(g)ds
_
.
Proof. By Kallianpur-Striebel formula
t
(X) =
t
(X)/
t
(1). By (6.22) the
process
t
(1) satises
d
t
(1) = B
2
t
(g)dY
t
,
0
(1) = 1.
and by the Ito formula
d
t
= d
_
t
(X)
t
(1)
_
=
d
t
(X)
t
(1)

t
(X)
2
t
(1)
d
t
(1) +

t
(X)
2
t
(g)
B
2
3
t
(1)
dt

t
(g)
t
(Xg)
B
2
2
t
(1)
dt =
t
(H
t
)
t
(1)
dt +

t
(Xg)
B
2
t
(1)
dY
t

t
(X)
t
(g)
B
2
2
t
(1)
dY
t
+

t
(X)
2
t
(g)
B
2
3
t
(1)
dt

t
(g)
t
(Xg)
B
2
2
t
(1)
dt =
t
(H)dt +

t
(Xg)
B
2
dY
t

t
(X)
t
(g)
B
2
dY
t
+

t
(X)
2
t
(g)
B
2
dt

t
(g)
t
(Xg)
B
2
dt =
t
(H)dt +B
2
_
t
(Xg)
t
(X)
t
(g)
__
dY
t

t
(g)dt
_
which veries (6.26).
2.3. Stochastic PDE for the unnormalized conditional density. Sim-
ilarly to the Kushner-Stratonovich PDE (6.13) for the conditional density in the
case of diusions, the corresponding PDE for the unnormalized conditional density
can be derived using (6.22). Consider the diusion signal, given by the SDE
dX
t
= a(t, X
t
)dt +b(t, X
t
)dV
t
, X
0
(6.27)
where V is a Wiener process, independent of W, the coecients guarantee existence
and uniqueness of the strong solution and is a random variable with density p
0
(x),
with
_
R
x
2
p
0
(x)dx < .
Theorem 6.19. Assume that there is an F
Y
t
-adapted nonnegative random eld
t
(x), satisfying
9
the Zakai PDE
d
t
(x) =
_
L
t
_
(x)dt +B
2
g(s, x)
t
(x)dY
s
,
0
(x) = p
0
(x). (6.28)
Then
t
(x) is a version of the unnormalized conditional density of X
t
given F
Y
t
,
so that for any measurable f, such that Ef
2
(X
t
) < ,
E
_
f(X
t
)[F
Y
t
) =
_
R
f(x)
t
(x)dx
_
R
t
(x)dx
, P a.s. (6.29)
Proof. Let f be a twice continuously dierentiable function (again B = 1 is
treated). Then by the Ito formula
f(X
t
) = f(X
0
) +
_
t
0
(Lf)(X
s
)ds +
1
2
_
t
0
f
tt
(X
s
)b
2
(X
s
)dV
s
,
where L is dened in (6.14). Applying (6.22) to f(X
t
) one obtains
t
(f) =
0
(f) +
_
t
0
s
(Lf)ds +
_
t
0
s
(fg)dY
s
.
Lets verify that the (random) measure corresponding to the density
t
(x), is a
solution of the latter equation:
_
t
0
s
(Lf)ds +
_
t
0
s
(fg)dY
s
=
_
t
0
_
R
_
a(x)f
t
(x) +
b
2
(x)
2
f
tt
(x)
_
s
(x)dxds +
_
t
0
_
R
f(x)g(s, x)
s
(x)dxdY
s
=
_
R
f(x)
__
t
0
_
L
s
_
(x)ds +
_
t
0
g(s, x)
s
(x)dY
s
_
dx =
_
R
f(x)
_
t
(x)
0
(x)
_
dx =
t
(f)
0
(f).
Remark 6.20. The solution existence and uniqueness for (6.28) is the issue far
beyond the scope of these lecture notes. The density
t
(x) even at the rst glance
is not an easy mathematical object to treat: being twice dierentiable in x, it is
very nonsmooth in time t, as should be a diusion. Still (6.28) is much easier to
deal with compared to (6.13).
2.4. The robust ltering formulae. The stochastic PDE (6.28) involves
stochastic integral, which is dened on the continuous functions only in the sup-
port of the Wiener measure. It turns out, that it may be rewritten as a PDE
without stochastic integral, but rather with random coecients, depending on Y
continuously and thus well dened for all continuous functions. Let for simplicity
g(s, x) g(x) and dene

t
(x) = R
t
(x)
t
(x), (6.30)
where
R
t
(x) = exp
_
1
B
2
Y
t
g(x) +
1
2B
2
g
2
(x)t
_
.
9
The natural question arises at this point: what is the (strong) solution of stochastic PDE
? Clearly besides the obvious property of adaptedness to F
t
, a solution should satisfy some
integrability properties in x variable, etc. This issue is beyond the scope of these notes.
3. FINITE DIMENSIONAL FILTERS 109
Then by the Ito formula
d
t
(x) =
g(x)
t
B
2
dY
t
+
g
2
(x)
t
2B
2
dt +
g
2
(x)
t
2B
2
dt+
R
t
(x)d
t
(x)
g
2
(x)
t
B
2
dt = R
t
(x)
_
L
t
_
(x)dt,
which leads to
d
t
(x) = R
t
(x)
_
L
R
1
t
(x)
t
_
(x)dt,
0
(x) = p
0
(x)
t
(x) = R
1
t
(x)
t
(x).
(6.31)
The PDE (6.31) is sometimes referred as robust ltering equation, corresponding
to the gauge transformation (6.30).
3. Finite dimensional lters
The nonlinear ltering equations (6.6) and (6.22), as well as the correspond-
ing PDE versions (6.13) and (6.28), are in general innite dimensional, meaning
that their solutions may not belong to a family of stochastic elds, parameteri-
zable by a nite number of sucient statistics. The importance of the latter is
obvious in applications. This section covers some special settings when a nite di-
mensional lter exists. There is no constructive way to derive or even to verify the
existence of the nite dimensional lters in general. However there is a beautiful
connection between this issue and Lie algebras generated by the coecients of the
signal/observation equations - see the survey [31]. Some negative results about the
existence of the nite dimensional realization of the ltering equation with cubic
observation nonlinearity are available [24], [11].
3.1. The Kalman-Bucy lter revisited. The Kalman-Bucy ltering for-
mulae can be obtained from the general nonlinear ltering equations.
Theorem 6.21. The solution of (5.12) and (5.13), subject to a Gaussian vector
(X
0
, Y
0
) is a Gaussian process. In particular the conditional distribution of X
t
,
given F
Y
t
is Gaussian with mean

X
t
and covariance P
t
, generated by (5.14) and
(5.15) respectively.
Proof. Lets verify the claim for the simple scalar example (of course the
general vector case is obtained similarly with more tedious calculations). Consider
the two dimensional system of linear SDEs
dX
t
= aX
t
dt +bdW
t
dY
t
= AX
t
dt +BdV
t
(6.32)
subject to Y
0
= 0 and a Gaussian random variable X
0
, where W and V are inde-
pendent Wiener processes, independent of X
0
, and all the coecients are scalars.
The process (X, Y ) form a Gaussian system and hence the conditional law of X
t
,
given F
Y
t
is Gaussian as well, so that we are left with the problem of nding the
equations for the conditional mean and variance.
Applying the equation (6.6) to X
t
one gets the familiar equation for

X
t
:=
t
(X)
X
t
= EX
0
+
_
t
0
a
X
s
ds +
_
t
0
A
_
s
(X
2
)
2
s
(X)
_
B
2
_
dY
s
A
X
t
ds
_
=
EX
0
+
_
t
0
a
X
s
ds +
_
t
0
AP
s
B
2
_
dY
s
A
X
t
ds
_
, (6.33)
where
P
t
=
t
(X
2
)
2
t
(X) = E(X
2
t
[F
Y
t
)
_
E(X
t
[F
Y
t
)
_
2
=
E
_
_
X
t
E(X
t
[F
Y
t
)
_
2
[F
Y
t
_
.
By the Ito formula
X
2
t
= X
2
0
+
_
t
0
2aX
2
s
ds +
_
t
0
b
2
ds +
_
t
0
2X
s
bdW
s
,
and thus (6.6) gives
t
(X
2
) =
0
(X
2
0
) +
_
t
0
_
2a
s
(X
2
) +b
2
_
ds+
_
t
0
A
_
s
(X
3
)
s
(X)
s
(X
2
)
_
B
2
_
dY
s
A
X
s
ds
_
(6.34)
Note that
t
(X
2
) =

X
2
t
+ P
t
and moreover since the conditional law of X
t
is
Gaussian E
_
(X
t

X
t
)
p
[F
Y
t
_
= 0 for any odd p and so
t
(X
3
) = E
_
X
3
t
[F
Y
t
_
= E
_
(X
t

X
t
+

X
t
)
3
[F
Y
t
_
= 3E
_
(X
t

X
t
)
2
[F
Y
t
_
X
t
+

X
3
t
= 3P
t
X
t
+

X
3
t
.
Then (6.34) gives
X
2
t
+P
t
=

X
2
0
+P
0
+
_
t
0
_
2a
X
2
s
+2aP
s
+b
2
_
ds +
_
t
0
2AP
s
X
s
B
2
_
dY
s
A
X
s
ds
_
.
Recall that

W
t
=
_
dY
s
A
X
s
ds
_
/B is a Wiener process and thus by (6.33),
d
X
2
t
=

X
2
0
+
_
t
0
2a
X
2
s
ds +
_
t
0
A
2
P
2
s
B
2
ds + 2
X
s
AP
s
B
d

W
s
.
The latter two equations imply
P
t
= 2aP
t
+b
2
A
2
P
2
t
B
2
, P
0
= E(X
0
EX
0
)
2
,
which is the familiar Riccati equation for the ltering error.
Remark 6.22. In particular in the linear Gaussian case the conditional density
equation (6.13) is solved by
p
t
(x) =
1
2P
t
exp
_
(x

X
t
)
2
2P
t
_
.
3.2. Conditionally Gaussian lter. In the previous section the key reason
for the FKK to be nite (two) dimensional was the Gaussian property of the pair
(X, Y ). In fact the very same arguments would be applicable, if only the conditional
distribution of X
t
given F
Y
t
is Gaussian. This leads to the following generalization
of the Kalman-Bucy lter due to R.Liptser and A.Shiryaev (see Chapters 11, 12 in
[21])
Theorem 6.23. (Conditionally Gaussian lter) Consider the SDE system
dX
t
=
_
a
0
(t, Y ) +a
1
(t, Y )X
t
_
dt +b(t, Y
t
)dW
t
(6.35)
dY
t
=
_
A
0
(t, Y ) +A
1
(t, Y )X
t
_
dt +BdV
t
(6.36)
subject to Y
0
= 0 and Gaussian random variable X
0
, where B is a positive constant
and the rest of the coecients are non-anticipating functionals of Y , satisfying the
conditions under which the unique strong solution (X, Y ) = (X
t
, Y
t
)
t[0,T]
exists
and EX
2
t
< t [0, T]. Then the conditional distribution of X
t
given F
Y
t
is
Gaussian with the mean

X
t
and variance P
t
, given by
d
X
t
=
_
a
0
(t, Y ) +a
1
(t, Y )
X
t
_
dt+
A
1
(t, Y )P
t
B
2
_
dY
t
A
0
(t, Y )dt A
1
(t, Y )
X
t
dt
_
P
t
= 2a
1
(t, Y )dt +b
2
(t, Y )dt
A
2
1
(t, Y )P
2
t
B
2
,
(6.37)
subject to

X
0
= EX
0
and P
0
= E(X
0

X
0
)
2
.
Remark 6.24. Note that in general the processes (X, Y ) do not form a Gauss-
ian system anymore. The only essential constrain on the structure of (6.35) and
(6.36) is linear dependence on X
t
. Despite of similarity, the dierence between the
Kalman-Bucy lter (5.3) and the equations (6.37) is signicant: the latter are no
longer linear and the conditional ltering error is no longer deterministic ! This
nonlinear generalization plays an important role in various problems of control and
optimization (see e.g. the Applications volume of [21]). The multidimensional
version of the lter is derived similarly.
Proof. Only the conditional Gaussian property of (X, Y ) is to be veried
E
_
e
iX
t
F
Y
t
_
= exp
_
im
t
(Y )
1
2
2
V
t
(Y )
_
, R (6.38)
where m
t
(Y ) and V
t
(Y ) are some non-anticipating functionals of Y . Once (6.38) is
established the very same arguments of the preceding section lead to the equations
(6.37), i.e. m
t
(Y )

X
t
and V
t
(Y ) P
t
.
The equation (6.35) has a closed form solution
X
t
= (t, Y )
_
X
0
+
_
t
0
1
(s, Y )b(s, Y )dW
s
_
:=
t
(X
0
, W, Y ). (6.39)
where (t, Y ) = exp
_
_
t
0
_
a
0
(s, Y ) +a
1
(s, Y )
_
ds
_
.
The (6.20) version of Kallianpur-Striebel formula implies
E
_
e
iX
t
[F
Y
t
_
=
E
_
e
iX
t
t
(X, Y )[F
Y
t
_
E
_
t
(X, Y )[F
Y
t
_ , (6.40)
where
t
(X, Y ) = exp
__
t
0
_
A
0
(s, Y ) +A
1
(s, Y )X
s
_
dY
s
1
2
_
t
0
_
A
0
(s, Y ) +A
1
(s, Y )X
s
_
2
ds
_
.
Insert the expression (6.39) into the right hand side of (6.40). Since Y and (W, X
0
)
are independent under

E (which follows from the independence of Y and X), the
expectation

E averages over (X
0
, W), keeping Y xed. This results in the quadratic
form of the type (6.38), due to Gaussian property of the system (X
0
, W), which
enter the exponent linearly. In fact its precise expression is identical to the one
that would have been obtained in the usual Kalman-Bucy setting.
Remark 6.25. Another (much more harder!) way to verify the claim of the
Theorem 6.23 is to check that Gaussian density with the mean and variance driven
by (6.37) is the unique solution of FKK equation (or Kushner-Stratonovich equa-
tion).
3.3. Linear systems with non-Gaussian initial condition. If the initial
condition X
0
is non-Gaussian, the conditional law of X
t
given F
Y
t
is no longer
Gaussian and thus the Kalman-Bucy equations do not necessarily generate the
conditional mean and variance. It turns out that a nite dimensional lter exists
and even can be derived in a number of ways, of which we choose the elegant
approach due to A.Makowski [30].
Theorem 6.26. Consider the processes (X, Y ) generated by the linear system
(with B = 1) (6.32), started from a random variable X
0
with distribution F(x),
_
R
x
2
dF(x) < . Then for any measurable f, such that Ef
2
(X
t
) < , t [0, T]
E
_
f(X
t
)[F
Y
t
_
=
_
R
2
_
R
f(x
1
+e
at
u)
t
(u, x
2
)dF(u)
t
(x
1
, x
2
)dx
1
dx
2
_
R
_
R
t
(u, x
2
)dF(u)
t
(x
2
)dx
2
(6.41)
where
t
(u, x) = exp
_
ux
u
2
2
A
2
2a
(e
2at
1)
_
,
t
(x, y) is the two dimensional Gaussian density with the mean and covariance
satisfying the equations
d
X
t
= a
X
t
dt +AP
2
t
_
dY
t
A
X
t
_
,

X
0
= 0
d
t
= A
_
e
at
+Q
t
__
dY
t
A
X
t
_
,

0
= 0
(6.42)
and
P
t
= 2aP
t
+b
2
A
2
P
2
t
, P
0
= 0
Q
t
= aQ
t
P
t
A
2
_
Q
t
+e
at
_
, Q
0
= 0
R
t
= A
2
e
2at
A
2
_
Q
t
+e
at
_
2
, R
0
= 0,
(6.43)
and
t
(x) is its marginal with the mean

t
and variance R
t
.
Proof. Let X
be the solution of

X
t
= aX
t
, subject to X
0
= X
0
, i.e.
X
t
= e
at
X
0
, t [0, T],
and X
t
t
be the solution of
dX
t
t
= aX
t
t
dt +bdW
t
, X
t
0
= 0.
Then X
t
= X
t
+X
t
t
, t [0, T] and
Y
t
=
_
t
0
AX
t
s
ds +
_
t
0
AX
s
ds +V
t
. (6.44)
Dene
t
= exp
_
_
t
0
AX
s
dV
s

1
2
_
t
0
_
AX
s
_
2
ds
_
Since EX
2
0
< is assumed,
t
is a martingale and by Girsanov theorem the
Radon-Nikodym derivative
d
P
dP
() =
T
()
denes the probability measure

P, under which
V
t
t
:=
_
t
0
AX
s
ds +V
t
is a Wiener process, independent of X
(or equivalently of X
0
) and X
t
(which
is veried as in the proof of Kallianpur-Striebel formula of Theorem 6.9), whose
distributions are preserved. Moreover
E
_
f(X
t
)[F
Y
t
_
=
E
_
f(X
t
t
+e
at
X
0
)
t
(X
0
, )[F
Y
t
_
E
_
t
(X
0
, )[F
Y
t
_ (6.45)
where
t
(X
, ) :=
1
t
=exp
__
t
0
AX
s
dV
t
s

1
2
_
t
0
_
AX
s
_
2
ds
_
=
exp
_
X
0
_
t
0
Ae
as
dV
t
s

X
2
0
2
_
t
0
_
Ae
as
_
2
ds
_
=
exp
_
X
0
_
t
0
d
s

X
2
0
2
_
t
0
_
Ae
as
_
2
ds
_
,
where d
t
= Ae
at
dV
t
t
was dened. Note that under

P, (X
t
, , Y ) form a Gaussian
system (independent of X
0
) and thus the conditional distribution of (X
t
t
,
t
) given
F
Y
t
is Gaussian, whose parameters can be found by the Kalman-Bucy lter for the
linear model
dX
t
t
= aX
t
t
dt +bdW
t
, X
t
0
= 0
d
t
= Ae
at
dV
t
t
,
0
= 0
dY
t
= AX
t
t
dt +dV
t
t
, Y
0
= 0.
Applying the equations (5.14) and (5.15), one gets (6.42) and (6.43) and the formula
(6.41) follows from (6.45).
3.4. Markov chains with nite state space.

3.4.1. The Poisson process. Similarly to the role played by the Wiener process
W in the theory of diusion, the Poisson process is the main building block of
purely discontinuous martingales, counting processes, etc.
Definition 6.27. A Markov process with piecewise constant (right continu-
ous) trajectories with unit positive jumps,
0
= 0, P-a.s. and stationary indepen-
dent increments, such that
10
P
_
t

s
= k[F
s
_
=
_
(t s)
_
k
e
(ts)
k!
, k Z
+
, (6.46)
is called Poisson process with intensity
11
0.
The existence of is a relatively easy matter: let (
n
)
n1
be an i.i.d sequence
of exponential random variables
P
_
1
t
_
= e
t
, t 0,
and let
12
t
= max
n0
_
n :
n
i=1
i
t
_
, t 0. (6.47)
Theorem 6.28. dened in (6.47) is a Poisson process.
Proof. Clearly
0
= 0 and the trajectories of (6.47) are piecewise constant
as required. Introduce
k
=
k
i=1
i
. Then
P(
t
= k[F
s
) =
k
=0
P(
t
= k[
1
, ...,
,
+1
> s
)1
s
=]
and thus
P(
t
= k[
1
, ...,
,
+1
> s
) =
((t s))
(k)
e
(ts)
(k )!
is to be veried:
P(
t
= k[
1
, ...,
,
+1
> s
) = P(
k
t <
k+1
[
1
, ...,
,
+1
> s
) =
E
_
P(
k
t <
k+1
[
1
, ...,
+1
)
1
, ...,
,
+1
> s
_
=
E
_
P(
+2
+... +
k
t

+1
<
+2
+... +
k+1
[
,
+1
)
,
+1
> s
_
= P
_
+2
+... +
k
t

+1
<
+2
+... +
k+1
,
+1
> s
_
=
e
(s
)
_

s
P
_
+2
+... +
k
t
u <
+2
+... +
k+1
_
e
u
du =
=
_

0
P
_
+2
+... +
k
t s u
t
<
+2
+... +
k+1
_
e
u
du
t
=
= P
_
+1
+
+2
+... +
k
t s <
+1
+
+2
+... +
k+1
_
=
= P
_
1
+... +
k
t s <
1
+... +
k+1
_
= P
_
ts
= k
_
.
10
extra care should be taking, when manipulating the ltrations of point processes. This
delicate matter is left out (as many others) - see the last chapter in [21] for a discussion
11
in (6.46) 0
0
= 1 is understood and so = 0 is allowed
12
0
i=1
... 0 is understood
Now (6.46) holds, if
P(
t
= k) =
(t)
k
e
t
k!
, k 0. (6.48)
Note that
P(
t
= k) = P(
k
t <
k
+
k+1
) = EI(
k
t)I(
k+1
> t
k
) =
EI(
k
t)e
(t
k
)
=
_
t
0
e
(ts)
dP(
k
s). (6.49)
and
P(
k
s) = P(
k
s
k1
) = EP(
k
s
k1
[
k1
) =
EI(s
k1
0)(1 e
(s
k1
)
) =
_
s
0
(1 e
(su)
)dP(
k1
u) (6.50)
Clearly
P(
1
s) = P(
1
s) = 1 e
s
and so by induction P(
k
s) has density, which by (6.50) satises
dP(
k
s)
ds
=
_
s
0
e
(su)
dP(
k1
u)
du
du
and thus
13
dP(
k
s)
ds
=
(s)
k1
e
s
(k 1)!
.
Now the equation (6.48) follows from (6.49).
A simple consequence of the denition is that
t
t is a martingale. Remark-
ably the converse is true (compare the Levy theorem (Theorem 4.5) for the Wiener
process)
Theorem 6.29. (S. Watanabe) A process N
t
with piecewise constant (right
continuous) trajectories with positive unit jumps is a Poisson process with intensity
, if N
t
t is a martingale.
Since the pathes of
t
are of bounded variation, the stochastic integral with
respect to is understood in Stieltjes sense: for any bounded
14
random process X
_
t
0
X
s
dN
s
=
st
X
s
N
s
=
st
X
s
_
N
s
N
s
_
, (6.51)
where X
s
denotes the left limit of X at point s. If X is an F
N
t
-adapted process,
then
_
t
0
X
s
(dN
s
ds) is a martingale
15
.
13
This is known as Erlang distribution
14
we wont need integrands more complicated than bounded ones
15
This is again an oversimplication, as many things in these notes
3.4.2. Markov chains in continuous time. The Markov chains with nite num-
ber of states is the simplest example of Markov processes in continuous time
16
.
Among many possible constructions we choose the following: let S = a
1
, ..., a
d
be
a nite set of (distinct) real numbers and N
t
be d d matrix, whose o diagonal
entries are independent Poisson processes with intensities
ij
0. The diagonal
entries are chosen in a special way: N
t
(i, j) =
j,=i
N
t
(i, j). Now dene the
vector process I
t
by
I
t
= I
0
+
_
t
0
dN
s
I
s
, (6.52)
where I
0
is a random vector, equal to one of the vectors of the standard Euclidian
basis
17
e
1
, ..., e
d
with probabilities p
i
0. It is easy
18
to see that only one
component of I
t
equals unity and all others are zeros at any time t 0, i.e. I
t
takes
the values in e
1
, ..., e
d
as well. Finally dene
X
t
=
d
i=1
a
i
I
t
(i), t 0.
Theorem 6.30. The process X is a Markov chain with initial distribution
19
p
0
and transition intensities matrix with o-diagonal entries
ij
and
ii
:=
j,=i
ij
, i = 1, ..., d,
meaning that
p
s,t
(j) := P(X
t
= a
j
[F
X
s
) =
d
i=1
p
s,t
(i, j)1
X
s
=a
i
]
, t s 0, (6.53)
where the matrix p
s,t
solves the forward Kolmogorov equation
20
t
p
s,t
=
p
s,t
, p
s,s
= E
dd
.
Proof. Since I
t
takes values in e
1
, ..., e
d
, by denition F
X
t
= F
I
t
and thus
P(X
t
= a
i
[F
X
s
) = P(I
t
= e
i
[F
I
s
) = q
s,t
(i), i = 1, ..., d., where q
s,t
:= E(I
t
[F
I
s
).
The latter satises
q
s,t
= I
s
+ E
__
t
s
dN
u
I
u
F
I
s
_
=
I
s
+ E
__
t
s
_
dN
du
_
I
u
+
_
t
s
I
u
du
F
I
s
_
= I
s
+
_
t
s
q
s,u
du, (6.54)
where
21
the martingale property of the stochastic integral has been used. Reading
(6.54) componentwise gives (6.53) and veries the claim of the theorem.
16
for the general theory of Markov processes, the reader is referred to the classic text [6] -
but dont expect easy reading!
17
i.e. i-th entry of e
i
is one and the rest are zeros
18
note that the probability of an event, that any two of a nite number of Poisson processes
have a jump simultaneously is zero - this follows directly from the construction of the Poisson
process, since exponential distribution does not have atoms.
19
distributions on S are identied with vectors of the simplex S
d1
= x R
d
:
d
i=1
x
i
=
1, x
i
0 in an obvious way
20
E
dd
is d-dimensional identity matrix
21
Note that
_
t
0

I
s
ds =
_
t
0

I
s
ds since the integrator is continuous!
In particular the equation (6.53) implies that the a priori distribution of X
t
,
i.e. the vector of probabilities p
i
= P(X
t
= a
i
) satises the equation
p
t
=
p
t
, subject to p
0
, (6.55)
whose explicit solution is given by means of the matrix exponential p
t
= e
t
p
0
.
3.4.3. The Shiryaev-Wonham lter. Consider the ltering problem of a nite
state Markov chain X (with known parameters) to be estimated from the trajectory
of the observation process Y , given by
Y
t
=
_
t
0
g(X
s
)ds +BW
t
, t [0, T]
where g is an S R function, B > 0 is a constant and W is a Wiener process,
independent of X. The sucient statistics in this problem is the vector
22
t
of
conditional probabilities
t
(i) = P(X
t
= a
i
[F
Y
t
), i = 1, ..., d, since
E
_
f(X
t
)[F
Y
t
_
= E
_
d
i=1
f(a
i
)1
X
t
=a
i
]
F
Y
t
_
=
d
i=1
f(a
i
)
t
(i).
The following theorem gives the complete solution to the ltering problem
Theorem 6.31. (Shiryaev [35], Wonham [40]) The vector
t
satises the Ito
SDE
d
t
=
t
dt +
_
diag(
t
)
t
t
_
g
_
dY
t
g
t
dt
_
/B
2
,
0
= p
0
, (6.56)
where g stands for d-dimensional vector with entries g(a
i
), i = 1, ..., d. Moreover
23
t
=
t
/[
t
[, where
d
t
=
t
dt + diag(g)
t
dY
t
/B
2
,
0
= p
0
. (6.57)
Proof. The equation 6.56 follows from the FKK equation (6.6), applied to
the process I
t
, introduced in (6.52). In particular the i-th component of I
t
satises
I
t
(i) = I
0
(i) +
_
t
0
d
j=1
ji
I
s
(j)ds +
_
t
0
d
j=1
I
s
(j)
_
dN
s
(ji)
ji
ds
_
:=
I
0
(i) +
_
t
0
d
j=1
ji
I
s
(j)ds +M
t
(i),
where M(i) is a square integrable martingale. Then (6.6) implies
t
(i) =
0
(i) +
_
t
0
ji
s
(j)ds+
_
E(I
s
(i)g
I
s
[F
Y
s
)
s
(i)E(g
I
s
[F
Y
s
)
__
dY
s
E(g
I
s
[F
Y
s
)ds
_
/B
2
=
0
(i) +
_
t
0
ji
s
(j)ds +
_
g
i
s
(i)
s
(i)
s
g
__
dY
s
g
s
ds
_
/B
2
which is nothing but (6.56) in the componentwise notation. Similarly (6.57) follows
from (6.22).
22
a slight abuse of notation is allowed here - recall that
t
() stands for the conditional
expectation operator in the FKK equation (6.6)
23
[x[ denotes the
2
norm: [x[ =
i
[x
i
[.
Example 6.32. The two dimensional version of (6.56) was derived in [35] and
shown to play an important role in the problems of quickest change detection. Let
X be a symmetric Markov chain with the switching intensity > 0 and with values
in 0, 1 (often referred as telegraphic signal) and set
t
= P(X
t
= 1[F
Y
t
). Suppose
that the observations
Y
t
=
_
t
0
X
s
ds +W
t
are available. Then
d
t
= (1 2
t
)dt +
t
(1
t
)
_
dY
t

t
dt
_
,
0
= P(X
0
= 1).
3.4.4. Filtering number of transitions and occupation times. Clearly the key to
the existence of nite dimensional lter for nite state Markov chains is the fact
that powers of the indicators process I
t
reduce to a linear function of I
t
! This can
be exploited further to get nite dimensional lters for various functionals of X:
the occupation time of the state a
i
O
t
(i) =
_
t
0
1
X
s
=a
i
]
ds =
_
t
0
I
s
(i)ds, (6.58)
the number of transitions from a
i
to a
j
T
t
(i, j) =
_
t
0
1
X
s
=a
i
]
d1
X
s
=a
j
]
=
_
t
0
I
s
(i)dI
s
(j) (6.59)
and the stochastic integrals like
J =
_
t
0
I
s
dY
s
. (6.60)
Being of interest on their own, the ltering formulae for these quantities can be used
to estimate the intensities matrix and other parameters in the problem by means
of so called EM (Expectation/Minimization) algorithm.
24
We derive the lter for
O
t
(omitting the index i, since the derivation is the same for all is), leaving the rest
as exercises. These problems seem to be initially addressed in [42], the derivation
below is taken from [8].
Theorem 6.33. The ltering estimate

O
t
= E
_
O
t
[F
Y
t
_
= [
Z
t
[, with

Z
t
being
the solution of
d
Z
t
=

Z
t
dt+e
i
e
t
dt+
_
diag(
Z
t
)

Z
t
t
_
g
_
dY
t
g
t
dt
_
/B
2
,

Z
0
= 0. (6.61)
Proof. The trick is to introduce an auxiliary process Z
t
= O
t
I
t
with values
in R
d
. Once the conditional expectation

Z
t
= E(Z
t
[F
Y
t
) is found, the estimate of
O
t
is recovered by
O
t
= E
_
O
t
d
i=1
I
t
(i)[F
Y
t
_
=
d
i=1
E
_
O
t
I
t
(i)[F
Y
t
_
=
d
i=1
Z
t
(i) =

Z
t
By the Ito formula

25
dZ
t
= d(O
t
I
t
) = O
t
dI
t
+I
t
dO
t
= O
t
dN
t
I
t
+I
t
I
t
(i)dt = dN
t
Z
t
+e
i
e
i
I
t
dt
24
an iterative procedure for nding maximum of certain likelihood functionals.
25
in this case it is simply integration by parts: no continuous time martingales or mutual
jumps are involved: note that O
t
has absolutely continuous trajectories
and hence
Z
t
=
_
t
0
_
Z
s
ds+e
i
e
i
I
s
_
ds+
_
t
0
(dN
ds)Z
s
:=
_
t
0
_
Z
s
ds+e
i
e
i
I
s
_
ds+M
t
t
where M
t
t
is a square integrable martingale (check it). Apply (6.6) to the component
Z
t
()
Z
t
() =
_
t
0
_
d
j=1
j

Z
s
() +
i
s
(i)
_
ds
+
_
t
0
_
E(Z
s
()g
I
s
[F
Y
s
)

Z
s
()g
s
_
B
2
_
dY
s
g
s
ds
_
Since Z
s
()g
I
s
= g
O
s
I
s
()I
s
= g
Z
s
() = g
Z
s
(), the equation (6.61) is ob-
tained.
3.5. Benes lter. Unlike the preceding nite dimensional lters, Benes lter
([2]) is mostly of academic interest: it is an example of a ltering problem for
nonlinear diusions admitting nite dimensional realization. This lter does not
seem to have an analogue in discrete time.
Theorem 6.34. Consider the two dimensional system of SDEs
dX
t
= h(X
t
)dt +dW
t
dY
t
= X
t
+dV
t
(6.62)
subject to Y
0
= 0 and X
0
= 0, where W and V are independent Wiener processes.
Assume that h(x) satises the ODE
h
t
+h = ax
2
+bx +c, a 0, b, c R
and is such that (6.62) has a unique strong solution. Then the unnormalized con-
ditional distribution of X
t
given F
Y
t
has density
t
(x) = exp
_
H(x) +xY
t
+
1
2
1 +ax
2
1
2
(c +k)t
_
_
R
2
e
x
2
+x
3
(x; m
t
, V
t
)dx
2
dx
3
(6.63)
where (x; m
t
, V
t
) is three dimensional Gaussian density with the mean m
t
and
covariance matrix V
t
, corresponding to the Gaussian system
d
t
=
1 +a
t
dt +dW
t
,
0
= 0
d
t
= Y
t
dW
t
,
0
= 0
d
t
=
_
Y
t
1 +a b/2
_
t
dt,
0
= 0.
(6.64)
Remark 6.35. For example h(x) = tanh(x) satises the Benes nonlinearity
with a = b = 0 and c = 1, and the Kalman-Bucy case h(x) = x corresponds to
b = c = 1, a = 0.
Proof. By the Kallianpur-Stribel formula, for any measurable and bounded
function f
E
_
f(X
t
)[F
Y
t
_
() =
_
C
[0,T]
f(x
t
)
t
_
x, Y ()
_
X
(dx)
_
C
[0,T]
t
_
x, Y ()
_
X
(dx)
,
with
t
(x, Y ) = exp
__
t
0
x
s
dY
s

1
2
_
t
0
x
2
s
ds
_
,
X
a.s.
and where
X
denotes the probability measure induced by X.
The integration with respect to
X
can be replaced with integration by the
Wiener measure
W
: indeed by the Girsanov theorem
X

W
(checking that
h(X
t
) satises e.g. the Novikov condition (4.20)) and
d
X
d
W
(x) = exp
__
t
0
h(x
s
)dx
s

1
2
_
t
0
h
2
(x
s
)ds
_
,
X
a.s.
Hence
_
C
[0,T]
f(x
t
)
t
_
x, Y ()
_
X
(dx) =
_
C
[0,T]
f(x
t
)
t
_
x, Y ()
_
d
X
d
W
(x)
W
(dx) =
_
C
[0,T]
f(x
t
) exp
__
t
0
x
s
dY
s

1
2
_
t
0
x
2
s
ds +
_
t
0
h(x
s
)dx
s
1
2
_
t
0
h
2
(x
s
)ds
_
W
(dx)
Let H(x) :=
_
x
0
h(u)du, then by the Ito formula
H(W
t
) =
_
t
0
h(W
s
)dW
s
+
1
2
_
t
0
h
t
(W
s
)ds
and since h
t
+h
2
= ax
2
+bx +c, we have
_
C
[0,T]
f(x
t
)
t
_
x, Y ()
_
X
(dx) =
_
C
[0,T]
f(x
t
) exp
__
t
0
x
s
dY
s

1
2
_
t
0
x
2
s
ds+
H(x
t
)
1
2
_
t
0
h
t
(x
s
)ds
1
2
_
t
0
h
2
(x
s
)ds
_
W
(dx) =
_
C
[0,T]
f(x
t
)e
H(x
t
)
exp
__
t
0
x
s
dY
s

1
2
(1 +a)
_
t
0
x
2
s
ds
1
2
_
t
0
_
bx
s
+c
_
ds
_
W
(dx)
Now we apply the Girsanov theorem once again: introduce the Ornstein-Uhlnebeck
process
d
t
=
1 +a
t
dt +dW
t
,
0
= 0
The induced measure
is equivalent to
W
and
d
d
W
(x) = exp
_
_
t
0
1 +ax
s
dx
s

1
2
_
t
0
(1 +a)x
2
s
ds
_
,
a.s.
EXERCISES 121
Hence
_
C
[0,T]
f(x
t
)
t
_
x, Y ()
_
X
(dx) =
_
C
[0,T]
f(x
t
)e
H(x
t
)
exp
__
t
0
x
s
dY
s

1
2
(1 +a)
_
t
0
x
2
s
ds
1
2
_
t
0
_
bx
s
+c
_
ds
_
d
W
d
(x)
(dx) =
_
C
[0,T]
f(x
t
)e
H(x
t
)
exp
__
t
0
x
s
dY
s

1
2
_
t
0
_
bx
s
+c
_
ds+
1 +a
_
t
0
x
s
dx
s
_
(dx) =
_
C
[0,T]
f(x
t
)e
H(x
t
)
exp
_
x
t
Y
t

_
t
0
Y
s
dx
s

1
2
_
t
0
_
bx
s
+c
_
ds
+
1 +a
1
2
(x
2
t
t)
_
(dx),
where the latter equality is obtained by the Ito formula (applicable under
).
Let (, , ) be the solution of the linear system (6.64), then
_
C
[0,T]
f(x
t
)
t
_
x, Y ()
_
X
(dx) =
_
R
3
f(x
1
)
exp
_
H(x
1
) +x
1
Y
t
+
1
2
1 +ax
2
1

1
2
(c +k)t +x
2
+x
3
_
(x; m
t
, V
t
)dx,
and (6.63) follows by arbitrariness of f.
Exercises
(1) Let the signal process X
t
= 1
t]
, where is a nonnegative random
variable with probability distribution G(dx). Suppose that the trajectory
of
Y
t
=
_
t
0
X
s
ds +W
t
is observed, where W is a Wiener process, independent of .
(a) Is X
t
a Markov process for general G? Give a counterexample if your
answer is negative. Give an example for which X
t
is Markov.
(b) Apply the Kallianpur-Striebel formula to obtain a formula for P(
t[F
Y
t
).
(2) Show that
t
(1) = exp
__
t
0
s
(g)dY
s

1
2
_
t
0
_
s
(g)
_
2
ds
_
.
(3) (a) Verify the claim of Remark 6.22 directly
(b) Find the solution of the Zakai equation (6.28) in the linear Gaussian
case
(4) Consider the linear diusion
dX
t
= aX
t
+dW
t
, X
0
= 0,
where W is a Wiener process and a is an unknown random parameter, to
be estimated from F
X
t
. Below a and W are assumed independent.
(a) Assume that a takes a nite number of values
1
, ...,
d
with pos-
itive probabilities p
1
, ..., p
d
. Find the recursive formulae (d dimen-
sional system of SDEs) for
t
(i) = P(a =
i
[F
X
t
).
(b) Find the explicit solutions to the SDEs in (a).
(c) Does
t
(i) converges to 1
a=
i
]
, i = 1, ..., d ? If yes, in what sense ?
(d) Assume that Ea
2
< and nd an explicit expression for the orthog-
onal projection

E(a[L
X
t
) and the corresponding mean square error.
(e) Assume that a is a standard Gaussian random variable. Is the process
X Gaussian ? Is the pair (a, X) Gaussian ? Is X conditionally
Gaussian, given a ?
(f) Is the optimal nonlinear lter in this case nite dimensional ? If yes,
nd the recursive equations for the sucient statistics.
(g) Does the mean square error P
t
= E
_
aE(a[F
X
t
)
_
2
converges to zero
as t ?
(5) Verify that F
Y
t
F

W
t
for the linear Gaussian setting (6.32)
(6) Derive the robust version of the Wonham lter (see (6.31) for reference).
Elaborate the telegraphic (two dimensional) signal case.
(7) Calculate the mean, covariance and one dimensional characteristic func-
tion for the Poisson process.
(8) Verify the last equality (or equivalently the martingale property of the
stochastic integral in this specic case) in (6.54).
(9) Let X
t
be a nite state Markov chain with values in S = a
1
, ..., a
d
,
transition intensities matrix and initial distribution p
0
. Let I
t
be the
d-dimensional vector of indicators 1
X
t
=a
i
]
.
(a) Show that the vector process M
t
= I
t
I
0

_
t
0

I
s
ds is a F
X
t
-
martingale.
(b) Find its variance EM
t
M
t
(10) For the process I
t
, dened in the previous exercise, derive the ltering
equations for the optimal linear estimate

I
t
=

E(I
t
[L
Y
t
) and the corre-
sponding error covariance, where Y
t
=
_
t
0
h(X
s
)ds +W
t
.
Hint: use the results of Section 3 from the previous chapter
(11) Consider a nite automaton with d states. A timer is associated with
each state, which is reset upon entering and initiates state transition after
a random period of time elapses. The next state is chosen at random,
independently of all the timers with probabilities depending on the current
state. Let X
t
be the state of the automaton at time t. Calibrate this model
(i.e. choose the timers parameters and transition probabilities, so that X
t
is a Markov chain with given intensities matrix ).
(12) (a) Derive nite dimensional ltering equations for T
t
(i, j) in (6.59) and
J in (6.60)
(b) Derive the Zakai type equations for O
t
(i), T
t
(i, j) and J
(c) Elaborate the structure of the optimal lters for telegraphic signal
case.
APPENDIX A
Auxiliary facts
1. The main convergence theorems
Theorem A.1. (Monotone convergence) Let Y, X, X
1
, ... be random variables,
then
(a) If X
j
Y for each j 1, EY > and X
j
X, then
EX
j
EX.
(b) If X
j
Y for each j 1, EY < and X
j
X, then
EX
j
EX.
Corollary A.2. Let X
j
be a sequence of nonnegative random variables, then
E
j=1
X
j
=
j=1
EX
j
Theorem A.3. (Fatou Lemma) Let Y, X
1
, X
2
, ... be random variables, then
(a) If X
j
Y for all j 1 and EY > , then
E lim
j
X
j
lim
j
EX
j
.
(b) If X
j
Y for all j 1 and EY < , then
lim
j
EX
j
E lim
j
X
j
.
(c) If [X
j
[ Y for all j 1 and EY < , then
E lim
j
X
j
lim
j
EX
j
lim
j
EX
j
E lim
j
X
j
Theorem A.4. (Lebesgue dominated convergence) Let Y, X
1
, X
2
, ... be random
variables, such that [X
j
[ Y , EY < and X
j
Pa.s.
j
X. Then E[X[ < and
lim
j
EX
j
= EX
and
lim
j
E[X
j
X[ = 0.
2. Changing the order of integration
Consider the (product) measure space (, F, ) with =
1

2
, F =
F
1
F
2
, i.e. F is the -algebra of sets A
1
A
2
, A
1
F
1
and A
2
F
2
, and
=
1

2
, i.e.
1

2
(A
1
A
2
) =
1
(A
1
)
2
(A
2
), A
1
F
1
, A
2
F
2
.
123
124 A. AUXILIARY FACTS
Theorem A.5. (Fubini theorem) Let X(
1
,
2
) be F
1
F
2
-measurable func-
tion, integrable with respect to measure
1

2
, i.e.
_
2
[X(
1
,
2
)[d(
1

2
) < .
Then the integrals
_
1
X(
1
,
2
)(d
1
) and
_
2
X(
1
,
2
)(d
2
) are well dened
for all
1
and
2
and are measurable functions with respect to F
2
and F
1
respec-
tively:
2
_
2
:
_
1
[X(
1
,
2
)[
1
(d
1
) =
_
= 0
1
_
1
:
_
2
[X(
1
,
2
)[
2
(d
2
) =
_
= 0.
Moreover
_
2
X(
1
,
2
)d(
1

2
) =
_
1
__
2
X(
1
,
2
)
2
(d
2
)
_
1
(d
1
) =
_
2
__
1
X(
2
,
1
)
1
(d
1
)
_
2
(d
2
).
Bibliography
[1] B.D.O. Anderson and J.B. Moore, Optimal Filtering, Prentice-Hall, October 1978 available
on the authors page:
http://www.syseng.anu.edu.au/ftp/Publications/by_author/John_Moore/index.html
[2] V.E. Benes, Exact nite-dimensional lters for certain diusions with nonlinear drift. Stochas-
tics 5 (1981), no. 1-2, 6592.
[3] P. Bremaud, Point processes and queues. Martingale dynamics, Springer Series in Statistics,
Springer-Verlag, New York-Berlin, 1981
[4] K.L.Chung, R.J.Williams, Introduction to stochastic integration. Progress in Probability and
Statistics, 4. Birkhauser Boston, Inc., Boston, MA, 1983.
[5] J.L. Doob, Stochastic processes. John Wiley & Sons, Inc., New York; Chapman & Hall, Lim-
ited, London, 1953
[6] E.B.Dynkin, Markov processes. Vols. I, II. , Academic Press Inc., Publishers, New York;
Springer-Verlag, Berlin-Gottingen-Heidelberg 1965
[7] Y. Ephraim, N.Merhav, Hidden Markov processes. Special issue on Shannon theory: perspec-
tive, trends, and applications. IEEE Trans. Inform. Theory 48 (2002), no. 6, 15181569
[8] R.E. Elliott, L. Aggoun and J.B. Moore, Hidden Markov Models: Estimation and Control,
Springer-Verlag, 1995
[9] G.M. Fihtengolc, Kurs differencialnogo i integralnogo isqesleni, Nauka,
Moskva.
[10] M. Fujisaki, G. Kallianpur, H.Kunita, Stochastic dierential equations for the non linear
ltering problem. Osaka J. Math. 9 (1972), 1940
[11] M.Hazewinkel, S. Marcus, H. Sussmann, Nonexistence of nite-dimensional lters for condi-
tional statistics of the cubic sensor problem. Systems Control Lett. 3 (1983), no. 6, 331340
[12] A.H. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press , 1970
[13] R.E. Kalman, A New Approach to Linear Filtering and Prediction Problems, Trans. ASME
Ser. D. J. Basic Engrg. 82 1960 3545.
(available at http://www.cs.unc.edu/~Ewelch/kalman/kalmanPaper.html)
[14] R.E. Kalman, R.S. Bucy,New results in linear ltering and prediction theory, Trans. ASME
Ser. D. J. Basic Engrg. 83 1961 95108.
[15] G. Kallianpur, Stochastic Filtering Theory (Applications of Mathematics Vol 13), Springer-
Verlag, 1980
[16] G. Kallianpur, R.L. Karandikar, White Noise Theory of Prediction, Filtering and Smoothing
(Stochastics Monographs), Gordon & Breach Science Pub, 1988
[17] G. Kallianpur, C.Striebel, Estimation of stochastic systems: Arbitrary system process with
additive white noise observation errors. Ann. Math. Statist. 39 1968 785801
[18] I. Karatzas, S. Shreve, Brownian motion and stochastic calculus. Graduate Texts in Mathe-
matics, 113. Springer-Verlag, New York, 1988
[19] A.N. Kolmogorov, Foundations of the Theory of Probability. Chelsea Publishing Company,
New York, N. Y., 1950
[20] A.N. Kolmogorov, Interpolation and extrapolation of stationary sequences, Izv. Akad. Nauk
SSSR, Ser. Mat. 5, 3-14, (1941)
[21] R.Liptser, A.Shirayev, Statistics of random processes, General Thoery and Applications, 2nd
ed., Applications of Mathematics (New York), 6. Stochastic Modelling and Applied Probability.
Springer-Verlag, Berlin, 2001
[22] R.Sh. Liptser, A.N.Shiryayev, Theory of martingales, Mathematics and its Applications (So-
viet Series), 49. Kluwer Academic Publishers Group, Dordrecht, 1989
125
126 BIBLIOGRAPHY
[23] S.Mitter, Nonlinear Filtering and Stochastic Control (Lecture notes in mathematics) Springer-
Verlag , 1983
[24] D. Ocone, Probability densities for conditional statistics in the cubic sensor problem, Math.
Control Signals Systems 1 (1988), no. 2, 183202.
[25] B.Oksendal, Stochastic Dierential Equations: an introduction with applications, 5th ed.,
Springer, 1998
[26] N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series. With
Engineering Applications. The Technology Press of the Massachusetts Institute of Technology,
Cambridge, Mass; John Wiley & Sons, Inc., New York, N. Y.; Chapman & Hall, Ltd., London,
1949. ix+163 pp.
[27] D.Revuz, M.Yor, Continuous martingales and Brownian motion. Third edition. Grundlehren
der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], 293.
Springer-Verlag, Berlin, 1999
[28] Yu.A.Rozanov, Stationary Random processes, Holden-Day, 1967
[29] H.J.Kushner On the dierential equations satised by conditional probablitity densities of
Markov processes, with applications. J. Soc. Indust. Appl. Math. Ser. A Control 2 1964 106119.
[30] A. Makowski, Filtering formulae for partially observed linear systems with non-Gaussian
initial conditions, Stochastics 16 (1986), no. 1-2, 124
[31] S. Marcus, Algebraic and geometric methods in nonlinear ltering. SIAM J. Control Optim.
22 (1984), no. 6, 817844.
[32] J.R. Norris, Markov chains, Cambridge Series in Statistical and Probabilistic Mathematics.
Cambridge University Press, Cambridge, 1998.
[33] L.C.G. Rogers, D. Williams, Diusions, Markov processes, and martingales. Vol. 1. Foun-
dations and Vol 2. Ito calculus Reprint of the second (1994) edition. Cambridge Mathematical
Library. Cambridge University Press, Cambridge, 2000
[34] A.N. Shiryaev, Probability, 2nd ed., Graduate Texts in Mathematics, 95. Springer-Verlag,
New York, 1996
[35] A.N. Shiryaev, Optimal methods in quickest detection problems, Teor. Verojatnost. i Prime-
nen. 8 1963, pp. 2651
[36] Z. Schuss, Theory and applications of stochastic dierential equations. Wiley Series in Prob-
ability and Statistics. John Wiley & Sons, Inc., New York, 1980
[37] R.Stratonovich, Conditional Markov Processes, Theoretical Probability and Its Applica-
tions 5 (1960): 156178
[38] B.Tsirelson, An example of a stochastic dierential equation that has no strong solution,
Teor. Verojatnost. i Primenen. 20 (1975), no. 2, 427430
[39] A.D. Wentzell, A course in the theory of stochastic processes, McGraw-Hill International
Book Co., New York, 1981
[40] M. Wonham, Some applications of stochastic dierential equations to optimal nonlinear l-
tering. J. Soc. Indust. Appl. Math. Ser. A Control 2 1965 347369 (1965).
[41] M. Zakai, On the optimal ltering of diusion processes. Z. Wahrscheinlichkeitstheorie und
Verw. Gebiete 11 1969 230243.
[42] O.Zeitouni, A.Dembo, Exact lters for the estimation of the number of transitions of nite-
state continuous-time Markov processes, IEEE Trans. Inform. Theory 34 (1988), no. 4, 890893
[43] A.K. Zvonkin, A transformation of the phase space of a diusion process that will remove
the drift, Mat. Sb. (N.S.), 93 (135), 1974, 129-149

Introduction To Nonlinear Filtering

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Introduction To Nonlinear Filtering

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Nonlinear Filtering

, which is the unique positive root of the

() approximates the Dirac () in the sense that for any

1.3. The space of innite sequences. (R

), P). The Borel -

) can be generated by the cylindrical sets of the form

) by a consistent family of prob-

)), such that

if at least one of the terms is nite. If EX exists and is nite X is said to be

The expectation have the following basic properties:

P (dened on the same probability space).

while the other are wrong in general.

) is straightforward and is given in Proposition 2.2 below, calculation of

= cov(X) cov(X, Y ) cov(Y )

stands for the generalized inverse of Q (see (2.8) below).

= I) and a diagonal matrix D 0, so

is a diagonal matrix with the entries

is the -th column of U. Clearly

Remark 2.3. Note that if instead of (2.9), D

Theorem 2.5. The estimate

. In this case, the error matrix of this suboptimal lter converges to P

. Does it exist when the equation (2.17) is

is the Moore-Penrose generalized inverse (recall that (A

has been dened in (2.8)).

[G), if no confusion occurs with positive

= , so that EX is not dened. Since X

The conditioning with respect to -algebras generated by the pre-images of

P, coincides with the distribution

). Suppose that the

(du) is absolutely continuous with respect to the measure (du) =

(u)(du) is assumed for

(u). This statistical model is

(y), y R, = 1, ..., d on the diagonal. Alternatively the

(Y ) cov(Y, X). (3.26)

E(X[Y ) is in the linear span of (X, Y ), the latter term equals

E(X[Y )) = 0 and cov

exist, are positive for

S and independent of the initial distribution (such chain is called

m for all m n() for all

the upper left and right Deni

(t) the lower left and right Deni derivatives at t:

jh for any 0 h 1/k. Fix an integer n 4k and let

Recall that the p-variation of the function f : [0, 1] R on the partition

Example 4.25. (Example 8 Ch. 6.2 in [21]) Let

(x) is twice dierentiable with the second derivative discontinuous at two

(x) [x[ for all x). By denition L

be the second order dierential operator, obtained by formal

Theorem 4.48. (The martingale representation theorem) Let X

Example 4.50. This representation is not always easy to nd explicitly. Here

(Hint: nd rst the ODEs for m

exists and is nite if a < 0, otherwise

= 0, either if a < 0 (since

2. The Kalman-Bucy lter: the general case

satisfy the Kalman-

is dened in (4.25) and

(x) is the distribution function of . In particular, if g(s, x) g(x)

3.4. Markov chains with nite state space.

By the Ito formula

Das könnte Ihnen auch gefallen