Beruflich Dokumente
Kultur Dokumente
S.R.S.Varadhan
Courant Institute of Mathematical Sciences
New York University
1 Measure Theory 7
1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Construction of Measures . . . . . . . . . . . . . . . . . . . . 11
1.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Distributions and Expectations . . . . . . . . . . . . . . . . . 28
2 Weak Convergence 31
2.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . 31
2.2 Moment Generating Functions . . . . . . . . . . . . . . . . . . 36
2.3 Weak Convergence . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Independent Sums 51
3.1 Independence and Convolution . . . . . . . . . . . . . . . . . . 51
3.2 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . 54
3.3 Strong Limit Theorems . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Series of Independent Random variables . . . . . . . . . . . . 61
3.5 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . 68
3.6 Central Limit Theorem. . . . . . . . . . . . . . . . . . . . . . 70
3.7 Accompanying Laws. . . . . . . . . . . . . . . . . . . . . . . . 76
3.8 Infinitely Divisible Distributions. . . . . . . . . . . . . . . . . 83
3.9 Laws of the iterated logarithm. . . . . . . . . . . . . . . . . . 92
3
4 CONTENTS
5 Martingales. 149
5.1 Definitions and properties . . . . . . . . . . . . . . . . . . . . 149
5.2 Martingale Convergence Theorems. . . . . . . . . . . . . . . . 154
5.3 Doob Decomposition Theorem. . . . . . . . . . . . . . . . . . 157
5.4 Stopping Times. . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.5 Upcrossing Inequality. . . . . . . . . . . . . . . . . . . . . . . 164
5.6 Martingale Transforms, Option Pricing. . . . . . . . . . . . . . 165
5.7 Martingales and Markov Chains. . . . . . . . . . . . . . . . . 169
These notes are based on a first year graduate course on Probability and Limit
theorems given at Courant Institute of Mathematical Sciences. Originally
written during 1997-98, they have been revised during academic year 1998-
99 as well as in the Fall of 1999. I want to express my appreciation to those
who pointed out to me several typos as well as suggestions for improvement.
I want to mention in particular the detailed comments from Professor Charles
Newman and Mr Enrique Loubet. Chuck used it while teaching the course
in 98-99 and Enrique helped me as TA when I taught out of these notes
again in the Fall of 99. These notes cover about three fourths of the course,
essentially discrete time processes. Hopefully there will appear a companion
volume some time in the near future that will cover continuos time processes.
A small amount measure theory that is included. While it is not meant to
be complete, it is my hope that it will be useful.
5
6 CONTENTS
Chapter 1
Measure Theory
1.1 Introduction.
The evolution of probability theory was based more on intuition rather than
mathematical axioms during its early development. In 1933, A. N. Kol-
mogorov [4] provided an axiomatic basis for probability theory and it is now
the universally accepted model. There are certain ‘non commutative’ ver-
sions that have their origins in quantum mechanics, see for instance K. R.
Parthasarathy[5], that are generalizations of the Kolmogorov Model. We
shall however use exclusively Kolmogorov’s framework.
The basic intuition in probability theory is the notion of randomness.
There are experiments whose results are not predicatable and can be deter-
mined only after performing it and then observing the outcome. The simplest
familiar examples are, the tossing of a fair coin, or the throwing of a balanced
die. In the first experiment the result could be either a head or a tail and
the throwing of a die could result in a score of any integer from 1 through 6.
These are experiments with only a finite number of alternate outcomes. It is
not difficult to imagine experiments that have countably or even uncountably
many alternatives as possible outcomes.
Abstractly then, there is a space Ω of all possible outcomes and each
individual outcome is represented as a point ω in that space Ω. Subsets of Ω
are called events and each of them corresponds to a collection of outcomes. If
the outcome ω is in the subset A, then the event A is said to have occurred.
For example in the case of a die the set A = {1, 3, 5} ⊂ Ω corresponds to
the event ‘an odd number shows up’. With this terminology it is clear that
7
8 CHAPTER 1. MEASURE THEORY
Exercise 1.2. Show that a finitely additive probability measure P (·) defined
on a σ-field B, is countably additive, i.e. satisfies equation (1.6), if and only
if it satisfies any the following two equivalent conditions.
If An is any nonincreasing sequence of sets in B and A = limn An = ∩n An
then
P (A) = lim P (An ).
n
If An is any nondecreasing sequence of sets in B and A = limn An = ∪n An
then
P (A) = lim P (An ).
n
10 CHAPTER 1. MEASURE THEORY
0 ≤ P (A) − P (B) ≤ P (A ∩ B c ) ≤ P (B c ).
Definition 1.4. The σ-field in the above exercise is called the σ-field gener-
ated by F .
1.2. CONSTRUCTION OF MEASURES 11
where the infimum is taken over all countable collections {Aj } of sets from
F that cover A. Without loss of generality we can assume that {Aj } are
i=1 Ai ) ∩ Aj ).
disjoint. (Replace Aj by (∩j−1 c
P ∗ (A) ≥ P ∗ (A ∩ E) + P ∗(A ∩ E c )
holds for all sets A, and establish the following properties for the class M
of measurable sets. The class of measurable sets M is a σ-field and P ∗ is a
countably additive measure on it.
12 CHAPTER 1. MEASURE THEORY
Exercise 1.8. The smallest monotone class generated by a field is the same
as the σ-field generated by the field.
It now follows that A must contain the σ-field generated by F and that
proves uniqueness.
The extension Theorem does not quite solve the problem of constructing
countably additive probability measures. It reduces it to constructing them
on fields. The following theorem is important in the theory of Lebesgue inte-
grals and is very useful for the construction of countably additive probability
measures on the real line. The proof will again be only sketched. The natu-
ral σ-field on which to define a probability measure on the line is the Borel
σ-field. This is defined as the smallest σ-field containing all intervals and
includes in particular all open sets.
Let us consider the class of subsets of the real numbers, I = {Ia,b : −∞ ≤
a < b ≤ ∞} where Ia,b = {x : a < x ≤ b} if b < ∞, and Ia,∞ = {x : a <
x < ∞}. In other words I is the collection of intervals that are left-open and
right-closed. The class of sets that are finite disjoint unions of members of I
is a field F , if the empty set is added to the class. If we are given a function
F (x) on the real line which is nondecreasing, continuous from the right and
satisfies
lim F (x) = 0 and lim F (x) = 1,
x→−∞ x→∞
for intervals and then extending it to F by defining it as the sum for disjoint
unions from I. Let us note that the Borel σ-field B on the real line is the
σ-field generated by F .
1.2. CONSTRUCTION OF MEASURES 13
that only increases by jumps, the jump at xn being pn . The points {xn }
themselves can be discrete like integers or dense like the rationals.
Example
R∞ 1.2. If f (x) is a nonnegative
Rx integrable function with integral 1, i.e.
−∞
f (y)dy = 1 then F (x) = −∞
f (y)dy is a distribution function which
is continuous. In this case f is the density of the measure P and can be
calculated as f (x) = F 0 (x).
There are (messy) examples of F that are continuous, but do not come from
any density. More on this later.
Exercise 1.9. Let us try to construct the Lebesgue measure on the rationals
Q ⊂ [0, 1]. We would like to have
P [Ia,b ] = b − a
1.3 Integration
An important notion is that of a random variable or a measurable function.
Exercise 1.10. It is enough to check the requirement for sets B ⊂ R that are
intervals or even just sets of the form (−∞, x] for −∞ < x < ∞.
A function that is measurable and satisfies |f (ω)| ≤ M all ω ∈ Ω for some
finite M is called a bounded measurable function.
The following statements are the essential steps in developing an integra-
tion theory.
Details can be found in any book on real variables.
1. If A ∈ Σ , the indicator function A, defined as
(
1 if ω ∈ A
1A (ω) =
0 if ω ∈/A
is bounded and measurable.
2. Sums, products, limits, compositions and reasonable elementary oper-
ations like min and max performed on measurable functions lead to
measurable functions.
3. If {Aj : 1 ≤ j ≤ n} is a finite
P disjoint partition of Ω into measurable
sets, the function f (ω) = j cj 1Aj (ω) is a measurable function and is
referred to as a ‘simple’ function.
4. Any bounded measurable function f is a uniform limit of simple func-
tions. To see this, if f is bounded by M, divide [−M, M] into n subin-
tervals Ij of length 2Mn
with midpoints cj . Let
Aj = f −1 (Ij ) = {ω : f (ω) ∈ Ij }
and
X
n
fn = cj 1Aj .
j=1
6. If fn is
R a sequence of simple functions converging to f uniformly, then
an = fn dP is a Cauchy sequence R of real numbers and therefore has a
limit a as n → ∞. The integral f dP of f is defined to be this limit
a. One can verify that a depends only on f and not on the sequence
fn chosen to approximate f .
7. Now the integral is defined for all bounded measurable functions and
enjoys the following properties.
R
R f is a bounded measurable function so is |f | and |
(b) If f dP | ≤
|f |dP ≤ supω |f (ω)|.
(c) In fact a slightly stronger inequality is true. For any bounded
measurable f ,
Z
|f |dP ≤ P ({ω : |f (ω)| > 0}) sup |f (ω)|
ω
for every ω ∈ Ω.
for every ω ∈ N . c
and
lim inf An = ∪n ∩m≥n Am
n
both coincide with A. Finally 1An (ω) → 1A (ω) in measure if and only if
where for any two sets A and B the symmetric difference A∆B is defined as
A∆B = (A ∩ B c ) ∪ (Ac ∩ B) = A ∪ B ∩ (A ∩ B)c . It is the set of points that
belong to either set but not to both. For instance 1An → 0 in measure if and
only if P (An ) → 0.
Exercise 1.11. There is a difference between almost everywhere convergence
and convergence in measure. The first is really stronger. Consider the in-
terval [0, 1] and divide it successively into 2, 3, 4 · · · parts and enumerate the
intervals in succession. That is, I1 = [0, 12 ], I2 = [ 12 , 1], I3 = [0, 13 ], I4 = [ 13 , 23 ],
I5 = [ 23 , 1], and so on. If fn (x) = 1In (x) it easy to check that fn tends to 0
in measure but not almost everywhere.
lim P [∪∞
m=n Am ] = 0
n→∞
P
In particular it is sufficient that n P [An ] < ∞. Is it necessary?
Proof. Since
Z Z Z Z
| fn dP − f dP | = | (fn − f )dP | ≤ |fn − f |dP
R
we need only prove that if fn → 0 in measure and |fn | ≤ M then |fn |dP →
0. To see this
Z Z Z
|fn |dP = |fn |dP + |fn |dP ≤ + MP [ω : |fn (ω)| > ]
|fn |≤ |fn |>
However if we replace xn by nxn , fn (x) still goes to 0 a.e., but the sequence
is no longer uniformly bounded and the integral does not go to 0.
We now proceed to define integrals of nonnegative measurable functions.
20 CHAPTER 1. MEASURE THEORY
An important result is
hn → h = f ∧ g = g.
R R
Since hn dP ≤ fn dP for every n it follows that
Z Z
gdP ≤ lim inf fn dP.
n→∞
R R
Proof. Obviously fn dP ≤ f dP and the other half follows from Fatou’s
lemma.
1.3. INTEGRATION 21
3. If
R f = 0 except on a set N of measure 0, then f is integrable
R and
R f dP = 0. In particular if f = g almost everywhere then f dP =
gdP .
Proof. We have seen the inequlity already for φ(x) = |x|. The proof is
quite simple. We note that any convex function φ can be represented as the
supremum of a collection of affine linear functions.
It is clear that if (a, b) ∈ E, then af (ω) + b ≤ φ(f (ω)) and on integration this
yields am + b ≤ E[φ(f (ω))] where m = E[f (ω)]. Since this is true for every
(a, b) ∈ E, in view of the represenattion (1.9), our theorem follows.
Exercise 1.15. Take the unit interval with the Lebesgue measure and define
fn (x) = nα 1[0, 1 ] (x). Clearly fn (x) → 0 for x 6= 0. On the other hand
R n
fn (x)dx = nα−1 → 0 if and only if α < 1. What is g(x) = supn fn (x) and
when is g integrable?
If h(ω) = f (ω) + ig(ω) is a complex valued measurable function with real
and imaginary parts f (ω) and g(ω) that are integrable we define
Z Z Z
h(ω)dP = f (ω)dP + i g(ω)dP
Exercise 1.16. Show that for any complex function h(ω) = f (ω) + ig(ω)
with measurable f and g, |h(ω)| is integrable, if and only if |f | and |g| are
integrable and we then have
Z Z
h(ω) dP ≤ |h(ω)| dP
1.4 Transformations
A measurable space (Ω, B) is a set Ω together with a σ-field B of subsets of
Ω.
1.4. TRANSFORMATIONS 23
Definition 1.10. Given two measurable spaces (Ω1 , B1 ) and (Ω2 , B2 ), a map-
ping or a transformation from T : Ω1 → Ω2 , i.e. a function ω2 = T (ω1 ) that
assigns for each point ω1 ∈ Ω1 a point ω2 = T (ω1 ) ∈ Ω2 , is said to be
measurable if for every measurable set A ∈ B2 , the inverse image
Exercise 1.17. Show that, in the above definition, it is enough to verify the
property for A ∈ A where A is any class of sets that generates the σ-field B2 .
Exercise 1.21. Show that sets that are finite disjoint unions of measurable
rectangles constitute a field F .
Definition 1.11. The product σ-field B is the σ-field generaated by the field
F.
1.5. PRODUCT SPACES 25
Now let En ∈ F ↓ Φ, the empty set. Then it is easy to verify that En,ω2
defined by
En,ω2 = {ω1 : (ω1 , ω2 ) ∈ En }
satisfies En,ω2 ↓ Φ for each ω2 ∈ Ω2 . From the countable additivity of P1 we
conclude that P1 (En,ω2 ) → 0 for each ω2 ∈ Ω2 and since, 0 ≤ P1 (En,ω2 ) ≤ 1
for n ≥ 1, it follows from equation (1.13) and the bounded convergence
theorem that Z
P (En ) = P1 (En,ω2 ) dP2 → 0
Ω2
establishing the countable additivity of P on F .
26 CHAPTER 1. MEASURE THEORY
Corollary 1.11. For any A ∈ B if we denote by Aω1 and Aω2 the respective
sections
Aω1 = {ω2 : (ω1 , ω2 ) ∈ A}
and
Aω2 = {ω1 : (ω1 , ω2 ) ∈ A}
then the functions P1 (Aω2 ) and P2 (Aω1 ) are measurable and
Z Z
P (A) = P1 (Aω2 )dP2 = P2 (Aω1 ) dP1 .
In particular for a measurable set A, P (A) = 0 if and only if for almost all
ω1 with respect to P1 , the sections Aω1 have measure 0 or equivalently for
almost all ω2 with respect to P2 , the sections Aω2 have measure 0.
and Z
H(ω2 ) = hω2 (ω1 ) dP1
Ω1
are measurable, finite almost everywhere and integrable with repect to P1 and
P2 respectively. Finally
Z Z Z
f (ω1 , ω2) dP = G(ω1 )dP1 = H(ω2 )dP2
Here we are taking advantage of the fact that on the real line x is a very
special real valued function. The value of the integral in this context is
referred to as the expectation or mean of α. Of course it exists if and only
if Z
|x| dα < ∞
and Z Z
x dα ≤ |x| dα.
1.6. DISTRIBUTIONS AND EXPECTATIONS 29
Similarly Z Z
E(g(X)) = g(X(ω)) dP = g(x) dα
j+1 ) − F (aj )]
g(xj )[F (aN N
j=0
where −∞ < aN 0 < a1 < · · · < aN < aN +1 < ∞ is a partition of the finite
N N N
Weak Convergence
The above definition makes sense. We write the integrand eitx as cos tx +
i sin tx and integrate each part to see that
|φ(t)| ≤ 1
Exercise 2.1. Calculate the characteristic functions for the following distri-
butions:
for 0 ≤ k ≤ n.
31
32 CHAPTER 2. WEAK CONVERGENCE
The next question is how to recover the distribution function F (x) from
φ(t). If we go back to the Fourier inversion formula, see for instance [2], we
can ‘guess’, using the fundamental theorem of calculus and Fubini’s theorem,
that Z ∞
0 1
F (x) = exp[−i t x ] φ(t) dt
2π −∞
and therefore
1
Rb R∞
F (b) − F (a) = 2π a
dx −∞ exp[−i t x ] φ(t) dt
1
R∞ Rb
= 2π −∞
φ(t) dt a exp[−i t x ] dx
R∞
1
= 2π −∞
φ(t) exp[− i t b−]−exp[−
it
ita]
dt
RT
= limT →∞ 2π1
−T
φ(t) exp[− i t b−]−exp[−
it
ita]
dt.
We will in fact prove the final relation, which is a principal value integral,
provided a and b are points of continuity of F . We compute the right hand
side as
Z T Z
1 exp[− i t b ] − exp[− i t a ]
lim dt exp[ i t x ] dα
T →∞ 2π −T −it
1
R RT exp[i t (x−b) ]−exp[i t (x−a) ]
= limT →∞ 2π
dα −T −it
dt
R RT
1
= limT →∞ 2π dα −T sin t (x−a)−sin
t
t (x−b)
dt
R
= 12 [sign(x − a) − sign(x − b)] dα
= F (b) − F (a)
provided a and b are continuity points. We have applied Fubini’s theorem
and the bounded convergence theorem to take the limit as T → ∞. Note
that the Dirichlet integral
Z T
sin tz
u(t, z) = dt
0 t
satisfies supT,z |u(T, z)| ≤ C and
1 if z > 0
lim u(T, z) = −1 if z < 0
T →∞
0 if z = 0.
34 CHAPTER 2. WEAK CONVERGENCE
cp −cx p−1
2. The gamma distribution with density f (x) = Γ(p)
e x , x ≥ 0 has
the characteristic function
it −p
φ(t) = (1 − ) .
c
where c > 0 is any constant. A special case of the gamma distribution
is the exponential distribution, that corresponds to c = p = 1 with
density f (x) = e−x for x ≥ 0. Its characteristic function is given by
φ(t) = [1 − it]−1 .
3. The two sided exponential with density f (x) = 12 e−|x| has characteristic
function
1
φ(t) = .
1 + t2
1 1
4. The Cauchy distribution with density f (x) = π 1+x2
has the character-
istic function
φ(t) = e−|t | .
mk = E[X k ] (2.3)
for
P every P k ≥ 0. We can then replace them by { m an
0
}, { m
bn
0
} : n ≥ 0 so that
k ak = k bk = 1 and the two probability distributions
P [X = en ] = an , P [X = en ] = bn
will have all their moments equal. Once we can find {cn } such that
X
cn en z = 0 for z = 0, 1, · · ·
n
There
R is in fact a positive result as well. If α is such that the moments
mk = xk dα do not grow too fast, then α is determined by mk .
P a2k
Theorem 2.2. Let mk be such that k m2k (2k)! < ∞ for some a > 0. Then
R k
there is atmost one distribution α such that x dα = mk .
Proof. We want to determine the characteristic function φ(t) of α. First we
note that if α has moments mk satisfying our assumption, then
Z X a2k
cosh(ax)dα = m2k < ∞
k
(2k)!
by the monotone convergence theorem. In particular
Z
ψ(u + it) = e(u+it)x dα
for any interval I = [a, b] such that the single point sets a and b have proba-
bility 0 under α.
1. αn ⇒ α or Fn ⇒ F
and
Z
X
N
f (x) dα − f (a )[F (a ) − F (a )] ≤ δ + 2M. (2.5)
j j+1 j
j=1
Proof.
Step 1. Let r1 , r2 , · · · be an enumeration of the rational numbers. For each j
consider the sequence {Fn (rj ) : n ≥ 1} where Fn is the distribution function
corresponding to φn (·). It is a sequence bounded by 1 and we can extract a
subsequence that converges. By the diagonalization process we can choose a
subseqence Gk = Fnk such that
lim Gk (r) = br
k→∞
This is true for every rational r > x, and therefore taking the infimum over
r>x
lim sup Gn (x) ≤ G(x).
n→∞
Suppose now that we have y < x. Find a rational r such that y < r < x.
Step 5. We now complete the rest of the proof, i.e. show that αn ⇒ α. We
have Gk = Fnk ⇒ G as well as ψk = φnk → φ. Therefore G must equal F
which has φ for its characteristic function. Since the argument works for any
subsequence of Fn , every subsequence of Fn will have a further subsequence
that converges weakly to the same limit F uniquely determined as the distri-
bution function whose characteristic function is φ(·). Consequently Fn ⇒ F
or αn ⇒ α.
Exercise 2.7. How do you actually prove that if every subsequence of a se-
quence {Fn } has a further subsequence that converges to a common F then
Fn ⇒ F ?
Proof. The proof is already contained in the details of the proof of the ear-
lier theorem. We can always choose a subsequence such that the distribution
functions converge at rationals and try to reconstruct the limiting distribu-
tion function from the limits at rationals. The crucial step is to prove that
the limit is a distribution function. Either of the two conditions (2.7) or (2.8)
will guarantee this. If condition (2.7) is violated it is straight forward to pick
a sequence from A for which the distribution functions have a limit which is
44 CHAPTER 2. WEAK CONVERGENCE
and therefore
Z Z
lim sup αn (C) ≤ lim fk (x) dαn = fk (x)dα.
n→∞ n→∞
Letting k → ∞ we get
We are now ready to prove the converse of Theorem 2.1 which is the hard
part of a theorem of Bochner that characterizes the characteristic functions
of probability distributions as continuous positive definite functions on R
normalized to be 1 at 0.
Theorem 2.7. (Bochner’s Theorem). If φ(t) is a positive definite func-
tion which is continuous at t = 0 and is normalized so that φ(0) = 1, then φ
is the characteristic function of some probability ditribution on R.
Or
|φ(s) − φ(t)|2 ≤ 1 − |φ(s − t)|2 + 2|1 − φ(t − s)|
≤ 4|1 − φ(s − t)|
2.3. WEAK CONVERGENCE 47
Z
1 T
|t| − i t x
f (x) = lim 1− e φ(t) dt (2.10)
T →∞ 2π −T T
Z T Z T
1
= lim e− i (t−s) x φ(t − s) dt ds (2.11)
T →∞ 2πT 0 0
Z T Z T
1
= lim e− i t x ei s x φ(t − s) dt ds (2.12)
T →∞ 2πT 0 0
≥ 0.
We can use the dominated convergence theorem to prove equation (2.10),
a change of variables to show equation (2.11) and finally a Riemann sum
approximation to the integral and the positive definiteness of φ to show that
the quantity in (2.12) is nonnegative. It remains to show the relation (2.9).
Let us define
σ 2 x2
fσ (x) = f (x) exp[ − ]
2
48 CHAPTER 2. WEAK CONVERGENCE
Z ∞ Z ∞
σ 2 x2
itx
e fσ (x) dx = ei t x f (x) exp[ − ] dx
−∞ −∞ 2
Z ∞Z ∞
1 σ 2 x2
= ei t x φ(s)e−i s x exp[ − ] ds dx
2π −∞ −∞ 2
Z ∞
1 (t − s)2
= φ(s) √ exp[ − ] ds. (2.13)
−∞ 2πσ 2σ 2
If we take t = 0 in equation (2.13), we get
Z ∞ Z ∞
1 s2
fσ (x) dx = φ(s) √ exp[ − 2 ] ds ≤ 1. (2.14)
−∞ −∞ 2πσ 2σ
Now we let σ → 0. Since fσ ≥ 0 and tends to f as σ → 0, from Fa-
R ∞ lemma and equation (2.14), it follows that f is integarable and initxfact
tous’s
−∞
f (x)dx ≤ 1. Now we let σ → 0 in equation (2.13). Since fσ (x)e is
dominated by the integrable function f , there is no problem with the left
hand side. On the other hand the limit as σ → 0 is easily calculated on the
right hand side of equation (2.13)
Z ∞ Z ∞
1 (s − t)2
itx
e f (x)dx = lim φ(s) √ exp[ − ] ds
−∞ σ→0 −∞ 2πσ 2σ 2
Z ∞
1 s2
= lim φ(t + σs) √ exp[ − ] ds
σ→0 −∞ 2π 2
= φ(t)
proving equation (2.9).
Step 3. If φ(t) is a positive definite function which is continuous, so is
φ(t) exp[ i t y ] for every y and for σ > 0, as well as the convex combination
R∞ 1 2
φσ (t) = −∞
φ(t) exp[ i t y ] √2πσ exp[ − 2σ
y
2 ] dy
2 2
= φ(t) exp[ − σ 2t ].
The previous step is applicable to φσ (t) which is clearly integrable on R and
by letting σ → 0 we conclude by Theorem 2.3. that φ is a characteristic
function as well.
2.3. WEAK CONVERGENCE 49
and Z
sup g(x) dαn ≤ C < ∞.
n
Then show that Z Z
lim f (x) dαn =
f (x) dα
n→∞
R R R
In particular if |x|k dαn remains bounded, then xj dαn → xj dα for
1 ≤ j ≤ k − 1.
Exercise 2.10. On the other hand if αn ⇒ α and g : R → R is a continuos
function then the distribution βn of g under αn defined as
βn [A] = αn [x : g(x) ∈ A]
converges weakly to β the corresponding distribution of g under α.
Exercise 2.11. If gn (x) is a sequence of continuous functions such that
sup |gn (x)| ≤ C < ∞ and lim gn (x) = g(x)
n,x n→∞
Can you onstruct an example to show that even if gn , g are continuous just
the pointwise convergence limn→∞ gn (x) = g(x) is not enough.
Exercise 2.12. If a sequence {fn (ω)} of random variables on a measure space
are such that fn → f in measure, then show that the sequence of distributions
αn of fn on R converges weakly to the distribution α of f . Give an example
to show that the converse is not true in general. However, if f is equal to a
constant c with probability 1, or equivalently α is degenerate at some point c,
then αn ⇒ α = δc implies the convrgence in probability of fn to the constant
function c.
Chapter 3
Independent Sums
P [A ∩ B] = P [A]P [B].
P [X ∈ A, Y ∈ B] = P [X ∈ A]P [Y ∈ B].
51
52 CHAPTER 3. INDEPENDENT SUMS
The important thing to note is that if X and Y are independent and one
knows their distributions α and β, then their joint distribution is automati-
cally determined as the product measure.
If X and Y are independent random variables having α and β for their
distributions, the distribution of the sum Z = X +Y is determined as follows.
First we construct the product measure α×β on R×R and then consider the
induced distribution of the function f (x, y) = x + y. This distribution, called
the convolution of α and β, is denoted by α ∗ β. An elementary calculation
using Fubini’s theorem provides the following identities.
Z Z
(α ∗ β)(A) = α(A − x) dβ = β(A − x) dα (3.1)
Z Z Z
exp[ i t x ]d(α ∗ β) = exp[ i t (x + y) ] d α d β
Z Z
= exp[ i t x ] d α exp[ i t x ] d β
3.1. INDEPENDENCE AND CONVOLUTION 53
or equivalently
φα∗β (t) = φα (t)φβ (t) (3.2)
which provides a direct way of calculating the distributions of sums of inde-
pendent random variables by the use of characteristic functions.
Exercise 3.2. If X and Y are independent show that for any two measurable
functions f and g, f (X) and g(Y ) are independent.
Exercise 3.3. Use Fubini’s theorem to show that if X and Y are independent
and if f and g are measurable functions with both E[|f (X)|] and E[|g(Y )|]
finite then
E[f (X)g(Y )] = E[f (X)]E[g(Y )].
Exercise 3.4. Show that if X and Y are any two random variables then
E(X + Y ) = E(X) + E(Y ). If X and Y are two independent random
variables then show that
Var(X + Y ) = Var(X) + Var(Y )
where
Var(X) = E [X − E[X]]2 = E[X 2 ] − [E[X]]2 .
If X1 , X2 , · · · , Xn are n independent random variables, then the distri-
bution of their sum Sn = X1 + X2 + · · · + Xn can be computed in terms of
the distributions of the summands. If αj is the distribution of Xj , then the
distribution of µn of Sn is given by the convolution µn = α1 ∗ α2 ∗ · · · ∗ αn
that can be calculated inductively by µj+1 = µj ∗ αj+1 . In terms of their
characteristic functions ψn (t) = φ1 (t)φ2 (t) · · · φn (t). The first two moments
of Sn are computed easily.
E(Sn ) = E(X1 ) + E(X2 ) + · · · E(Xn )
and
Var(Sn ) = E[Sn − E(Sn )]2
X
= E[Xj − E(Xj )]2
j
X
+2 E[Xi − E(Xi )][Xj − E(Xj )].
1≤i<j≤n
54 CHAPTER 3. INDEPENDENT SUMS
with g(x) = (x − np)2 and in (3.6) have used the fact that Sn = X1 + X2 +
· · · + Xn where the Xi are independent and have the simple distribution
3.2. WEAK LAW OF LARGE NUMBERS 55
Sn = X1 + X2 + · · · + Xn
we have
Sn
lim P | − m| ≥ δ = 0
n→∞ n
Actually it is enough to assume that E|Xi | < ∞ and the existence of the
second moment is not needed. We will provide two proofs of the statement
Theorem 3.3. If X1 , X2 , · · · Xn are independent and identically distributed
with a finite first moment and E(Xi ) = m, then X1 +X2n+···+Xn converges to m
in probability as n → ∞.
Proof. 1. Let C be a large constant and let us define XiC as the truncated
random variable XiC = Xi if |Xi | ≤ C and XiC = 0 otherwise. Let YiC =
Xi − XiC so that Xi = XiC + YiC . Then
1 X 1 X C 1 X C
Xi = X + Y
n 1≤i≤n n 1≤i≤n i n 1≤i≤n i
= ξnC + ηnC .
If we denote by aC = E(XiC ) and bC = E(YiC ) we always have m =
aC + bC . Consider the quantity
1 X
δn = E[| Xi − m|]
n 1≤i≤n
= E[|ξnC + ηnC − m|]
≤ E[|ξnC − aC |] + E[|ηnC − bC |]
12
2
≤ E[|ξn − aC | ] + 2E[|YiC |].
C
(3.8)
Exercise 3.7. In the case of the Binomial distribution with p = 12 , use Stir-
ling’s formula
√
n! ' 2π e−n nn+12
to estimate the probability
X n 1
r≥nx
r 2n
and show that it decays geometrically in n. Can you calculate the geometric
ratio X n1
n 1
ρ(x) = lim
n→∞
r≥nx
r 2n
explicitly as a function of x for x > 12 ?
58 CHAPTER 3. INDEPENDENT SUMS
This is called the Strong Law of Large Numbers. Strong laws are statements
that hold for almost all ω.
Let us look at functions of the form fn = χAn . It is easy to verify that
fn → 0 in probability if and only if P (An ) → 0. On the other hand
then
P ω : lim χAn (ω) = 0 = 1.
n→∞
is the same as ∩∞ ∞
n=1 ∪j=n Aj , or the event that infinitely many of the events
{Aj } occcur.
Exercise 3.8. Prove the following variant of the monotone convergence the-
orem.
P If fn (ω) ≥ 0 are measurble functions the set E = {ω : S(ω) =
n fn (ω) < ∞} is measurable
P and S(ω) is a measurable function on E. If
each fn is integrable
P and n E[fn ] < ∞ then P [E] = 1, S(ω) is integrable
and E[S(ω)] = n E[fn (ω)].
P P
Proof. By the previous exercise if n P (An ) < ∞, then n χAn (ω) = S(ω)
is finite almost everywhere and
X
E(S(ω)) = P (An ) < ∞.
n
If an infinite series has a finite sum then the n-th term must go to 0, thereby
proving
P the direct part. To prove the converse we need to show that if
∞
n P (An ) = ∞, then limm→∞ P (∪n=m An ) > 0. We can use independence
and the continuity of probability under monotone limits, to calculate for
every m,
P (∪∞ ∞
n=m An ) = 1 − P (∩n=m An )
c
Y∞
= 1− (1 − P (An )) (by independence)
n=m
−
P ∞
P (An )
≥ 1−e m
= 1
and we are done. We have used the inequality 1 − x ≤ e−x familiar in the
study of infinite products.
Another digression that we want to make into measure theory at this point
is to discuss Kolmogorov’s consistency theorem. How do we know that there
are probability spaces that admit a sequence of independent identically dis-
tributed random variables with specified distributions? By the construction
of product measures that we outlined earlier we can construct a measure on
Rn for every n which is the joint distribution of the first n random variables.
Let us denote by Pn this probability measure on Rn . They are consistent
in the sense that if we project in the natural way from Rn+1 → Rn , Pn+1
projects to Pn . Such a family is called a consistent family of finite dimen-
sional distributions. We look at the space Ω = R∞ consisting of all real
sequences ω = {xn : n ≥ 1} with a natural σ-field Σ generated by the field
F of finite dimensional cylinder sets of the form B = {ω : (x1 , · · · , xn ) ∈ A}
where A varies over Borel sets in Rn and varies over positive integers.
60 CHAPTER 3. INDEPENDENT SUMS
converges with probability 1. The basic steps are the following inequalities
due to Kolomogorov and Lévy that control the behaviour of sums of inde-
pendent random variables. They both deal with the problem of estimating
X
k
Tn (ω) = sup |Sk (ω)| = sup | Xj (ω)|
1≤k≤n 1≤k≤n j=1
62 CHAPTER 3. INDEPENDENT SUMS
Proof. Clearly (iii) ⇒ (ii) ⇒ (i) are trivial. We will establish (i) ⇒ (ii) ⇒
(iii).
(i) ⇒ (ii). The characteristic functions φj (t) of Xj are such that
Y
∞
φ(t) = φj (t)
i=1
Y
n
lim
n→∞
φj (t) = 1
m→∞ m+1
lim P {|Sn − Sm | ≥ δ} = 0
n→∞
m→∞
establishing (ii).
(ii) ⇒ (iii). To establish (iii), because of Exercise 3.11 below, we need only
show that for every δ > 0
lim
n→∞
P sup |S k − S m | ≥ δ =0
m→∞ m<k≤n
Exercise 3.10. Prove the inequality 1 − cos 2t ≤ 4(1 − cos t) for all real t.
Deduce the inequality 1 − Real φ(2t) ≤ 4[1 − Real φ(t)], valid for any char-
acteristic function. Conclude that if a sequence of characteristic functions
converges to 1 in an interval around 0, then it converges to 1 for all real t.
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 65
lim P {|Sn − Sm | ≥ δ} = 0
n→∞
m→∞
for every δ > 0 then there is a limiting random variable S(ω) such that
P lim Sn (ω) = S(ω) = 1.
n→∞
X
∞
S(ω) = Xi (ω)
i=1
1 X
n
lim P sup |Sk − Sm | ≥ δ ≤ lim E(Xi2 )
n→∞
m→∞ m<k≤n
n→∞
m→∞
δ 2 j=m+1
1 X
n
= lim Var(Xi ) = 0.
n→∞
m→∞
δ 2 j=m+1
66 CHAPTER 3. INDEPENDENT SUMS
Therefore
lim P { sup |Sk − Sm | ≥ δ} ≤ 0.
n→∞
m→∞ m<k≤n
lim P {|Sn − Sm | ≥ δ} = 0
n→∞
m→∞
X
∞
1 2
σj2 ≤ ` + (` + C)2 ].
j=1
δ
Proof. We define
(
Xn if |Xn | ≤ n
Yn =
0 if |Xn | > n
lim bn = 0
n→∞
3.5. STRONG LAW OF LARGE NUMBERS 69
and
X cn X E[Y 2 ] XZ x2
2
≤ 2
n
= 2
dα
n n |x|≤n n
n n
Z
n
X Z
2 1
= x 2
dα ≤ C |x| dα < ∞
n≥x
n
converge almost surely. It is elementary to verify that for any series n xnn
that converges, x1 +···+x
n
n
→ 0 as n → ∞. We therefore conclude that
X1 + · · · + Xn b1 + · · · + bn
P lim − =0 =1
n→∞ n n
Since bn → 0 as n → ∞, the theorem is proved.
Exercise 3.14. Let X be a nonnegative random variable. Then
X
∞
E[X] − 1 ≤ P [Xn ≥ n] ≤ E[X]
n=1
P
In particular E[X] < ∞ if and only if n P [X ≥ n] < ∞.
Exercise 3.15. If for a sequence of i.i.d. random variables X1 , · · · , Xn , · · · ,
the strong law of large numbers holds with some limit, i.e.
Sn
P [ lim = ξ]= 1
n→∞ n
for some random variable ξ, which may or may not be a constant with prob-
ability 1, then show that necessarily E|Xi| < ∞. Consequently ξ = E(Xi )
with probabilty 1.
One may ask why the limit cannot be a proper random variable. There
is a general theorem that forbids it called Kolmogorov’s Zero-One law. Let
us look at the space Ω of real sequences {xn : n ≥ 1}. We have the σ-field B,
the product σ-field on Ω. In addition we have the sub σ-fields Bn generated
by {xj : j ≥ n}. Bn are ↓ with n and B∞ = ∩n Bn which is also a σ-field is
called the tail σ-field. The typical set in B∞ is a set depending only on the
tail behavior of the sequence. For example the sets {ω : xn is bounded },
{ω : lim supn xn = 1} are in B∞ whereas {ω : supn |xn | = 1} is not.
70 CHAPTER 3. INDEPENDENT SUMS
1 x2
p(x) = √ exp[− 2 ]. (3.11)
2πσ σ
3.6. CENTRAL LIMIT THEOREM. 71
t
ψn (t) = [φ( √ )]n
n
σ 2 t2
φ(t) = 1 − + o (t2 )
2
to conclude that
t σ 2 t2 1
φ( √ ) = 1 − +o( )
n 2n n
and it then follows that
σ 2 t2
lim ψn (t) = ψ(t) = exp[− ].
n→∞ 2
Since ψ(t) is the characteristic function of the normal distribution with den-
sity p(x) given by equation (3.11), we are done.
Exercise 3.17. A more direct proof is possible in some special cases. For
instance if each Xi = ±1 with probability 12 , Sn can take the values n − 2k
with 0 ≤ k ≤ n,
1 n
P [Sn = 2k − n] = n
2 k
and
X
Sn 1 n
P [a ≤ √ ≤ b] = n .
n 2 √ √ k
k:a n≤2k−n≤b n
Actually for the proof of the central limit theorem we do not need the
random variables {Xj } to have identical distributions. Let us suppose that
they all have zero means and that the variance of Xj is σj2 . Define s2n =
72 CHAPTER 3. INDEPENDENT SUMS
σ12 + · · · + σn2 . Assume s2n → ∞ as n → ∞. Then Yn = Ssnn has zero mean and
unit variance. It is not unreasonable to expect that
Z a
1 x2
lim P [Yn ≤ a] = √ exp[− ] dx
n→∞ −∞ 2π 2
under certain mild conditions.
for each > 0 is sufficient for the central limit theorem to hold.
Proof. The first step in proving this limit theorem as well as other limit
theorems that we will prove is to rewrite
and
Y
n
ψn (t) = ψn,j (t).
j=1
and
X
n
sup sup |φn,j (t) − 1| < ∞.
n |t|≤T j=1
X
n
≤ lim sup log φn,j (t) − [φn,j (t) − 1]
n→∞ |t|≤T
j=1
X
n
≤ lim sup C |φn,j (t) − 1|2
n→∞ |t|≤T
j=1
X
n
≤ C lim sup sup |φn,j (t) − 1| sup |φn,j (t) − 1|
n→∞ |t|≤T 1≤j≤n |t|≤T j=1
=0
by the expansion
We see that
Z
sup φn,j (t) − 1 = sup exp[i t x ] − 1 dαn,j
|t|≤T |t|≤T
Z
x
= sup exp[i t ] − 1 dαj
|t|≤T sn
Z
x x
= sup exp[i t ] − 1 − i t dαj (3.12)
|t|≤T sn sn
Z 2
x
≤ CT dαj (3.13)
s2n
Z Z
x2 x2
= CT 2
dα j + C T 2
dαj
|x|<sn sn |x|≥sn sn
Z
2 1
≤ CT + CT 2 x2 dαj . (3.14)
sn |x|≥sn
We have used the mean zero condition in deriving equation 3.12 and
the estimate |eix − 1 − ix| ≤ cx2 to get to the equation 3.13. If we let
n → ∞, by Lindeberg’s condition, the second term of equation (3.14) goes
to 0. Therefore
lim sup sup sup φn,j (t) − 1 ≤ 2 CT .
n→∞ 1≤j≤kn |t|≤T
X Xn Z
1 X 2
n n
x2
sup φn,j (t) − 1 ≤ CT dαj ≤ CT 2 σ = CT
|t|≤T j=1 j=1
s2n sn j=1 j
X
n t2
lim sup (φn,j (t) − 1) +
n→∞ |t|≤T 2
j=1
Xn
σ 2 t2
≤ lim sup φn,j (t) − 1 + j
n→∞ |t|≤T
j=1
2s2n
Xn Z
x x t2 2
x
= lim sup exp[i t ] − 1 − i t + 2 dαj
n→∞ |t|≤T sn sn 2sn
j=1
Xn Z
x x t 2 2
x
≤ lim sup exp[i t ] − 1 − i t + 2 dαj
n→∞ |t|≤T sn sn 2sn
j=1 |x|<sn
X n Z
x x t2 2
x
+ lim sup exp[i t ] − 1 − i t + 2 dαj
n→∞ |t|≤T sn sn 2sn
j=1 |x|≥sn
Xn Z
|x|3
≤ lim CT 3
dαj
n→∞
j=1 |x|<s n
s n
X n Z
x2
+ lim CT 2
dαj
n→∞
j=1 |x|≥s n
s n
X n Z
x2
≤ CT lim sup 2
dαj
n→∞
j=1
s n
X n Z
x2
+ lim CT 2
dαj
n→∞
j=1 |x|≥s n
s n
= CT
Remark 3.3. The key step in the proof of the central limit theorem under
Lindeberg’s condition, as well as in other limit theorems for sums of inde-
pendent random variables, is the analysis of products
ψn (t) = Πkj=1
n
φn,j (t).
The idea is to replace each φn,j (t) by exp [φn,j (t) − 1], changing the product
to the exponential of a sum. Although each φn,j (t) is close to 1, making
76 CHAPTER 3. INDEPENDENT SUMS
the idea Preasonable, in order for the idea to work one has to show that
the sum kj=1 n
|φn,j (t) − 1|2 is negligible. This requires the boundedness of
Pkn
j=1 |φn,j (t) − 1|. One has to use the mean 0 condition or some suitable
centering condition to cancel the first term in the expansion of φn,j (t) − 1
and control the rest from sums of the variances.
Exercise 3.18. Lyapunov’s condition is the following: for some δ > 0
n Z
1 X
lim |x|2+δ dαj = 0.
n→∞ s2+δ
n j=1
As we stated in the previous section, we want to study the behavior of the sum
of a large number of independent random variables. We have kn independent
random variables {Xn,j : 1 ≤ j ≤ kn } with respective distributions {αn,j }.
Pn
We are interested in the distribution µn of Zn = kj=1 Xn,j . One important
assumption that we will make on the random variables {Xn,j } is that no
single one is significant. More precisely for every δ > 0,
which is a convex combination αj with weights e−a aj! . We use the construc-
j
= δ.
Therefore
lim sup |an,j | = 0.
n→∞ 1≤j≤kn
0
This means that αn,j are uniformly infinitesimal just as αn,j were. Let us
suppose that n is so large that sup1≤j≤kn |an,j | ≤ 14 . The advantage in going
0
from αn,j to αn,j is that the latter are better centered and we can calculate
Z
a0n,j = 0
x dαn,j
|x|≤1
Z
= (x − an,j ) dαn,j
|x−an,j |≤1
Z
= x dαn,j − an,j αn,j [ |x − an,j | ≤ 1 ]
|x−an,j |≤1
Z
= x dαn,j − an,j + αn,j [ |x − an,j | > 1 ]
|x−an,j |≤1
3 1
|a0n,j | ≤ Cαn,j [ |x| ≥ 0
] ≤ Cαn,j [ |x| ≥ ].
4 2
3.7. ACCOMPANYING LAWS. 79
In other words we may assume without loss of generality that αn,j satisfy the
bound
1
|an,j | ≤ Cαn,j [ |x| ≥ ] (3.16)
2
0
and forget all about the change from αn,j to αn,j . We will drop the primes
and stay with just αn,j . Then, just as in the proof of the Lindeberg theorem,
we proceed to estimate
lim sup log λ̂n (t) − log µ̂n (t)
n→∞ |t|≤T
Xkn
≤ lim sup log α̂n,j (t) − (α̂n,j (t) − 1)]
n→∞ |t|≤T
j=1
X
kn
≤ lim sup log α̂n,j (t) − (α̂n,j (t) − 1)
n→∞ |t|≤T
j=1
X
kn
≤ lim sup C |α̂n,j (t) − 1|2
n→∞ |t|≤T
j=1
= 0.
X
kn
exp (α̂n,j (t) − 1)) + itAn = exp[fn (t)]
j=1
have a limit, which is again a characteristic function. Since the limiting char-
acteristic function is continuous and equals 1 at t = 0, and the convergence
is uniform near 0, on some small interval |t| ≤ T0 we have the bound
sup sup 1 − Re fn (t) ≤ C
n |t|≤T0
80 CHAPTER 3. INDEPENDENT SUMS
or equivalently
kn Z
X
sup sup (1 − cos t x ) dαn,j ≤ C
n |t|≤T0 j=1
kn Z
X
sup sup (1 − cos t x ) dαn,j ≤ CT .
n |t|≤T j=1
If we integrate the inequality with respect to t over the interval [−T, T ] and
divide by 2T , we get
kn Z
X sin T x
sup (1 − ) dαn,j ≤ CT
n
j=1
Tx
X
kn
sup αn,j [ |x| ≥ δ ] ≤ Cδ < ∞
n
j=1
kn Z
X
sup x2 dαn,j ≤ C < ∞.
n
j=1 |x|≤1
3.7. ACCOMPANYING LAWS. 81
Y
kn
2
lim |µ̂n (t)| = lim |α̂n,j (t)|2
n→∞ n→∞
j=1
X
kn
sup sup [1 − |α̂n,j (t)|2 ] ≤ C0 < ∞
n |t|≤T0 j=1
82 CHAPTER 3. INDEPENDENT SUMS
X
kn
sup sup [1 − |α̂n,j (t)|2 ] ≤ CT < ∞
n |t|≤T j=1
X
kn
sup |αn,j |2 [ |x| ≥ δ ] ≤ Cδ < ∞ (3.18)
n
j=1
kn Z Z
X
sup (x − y)2dαn,j (x) dαn,j (y) ≤ C < ∞. (3.19)
n
j=1 |x−y|≤2
X
kn
sup αn,j [ x : |x| ≥ δ ] ≤ Cδ < ∞. (3.20)
n
j=1
One can now derive (3.17) from (3.20) and (3.21) as in the earlier part.
Exercise 3.20. Let kn = n2 and αn,j = δ 1 for 1 ≤ j ≤ n2 . µn = δn and show
n
that without centering λn ∗ δ−n converges to a different limit.
Exercise 3.22. Show that for any λ ≥ 0, the Poisson distribution with pa-
rameter λ
e−n λn
pλ (n) = for n ≥ 0
n!
is infinitely divisible.
Exercise 3.23. Show that any probabilty distribution supported on a finite
set {x1 , . . . , xk } with
µ[{xj }] = pj
Pk
and pj ≥ 0, j=1 pj = 1 is infinitely divisible if and only if it is degenrate,
i.e. µ[{xj }] = 1 for some j.
Exercise 3.24. Show that for any nonnegative finite measure α with total
mass a, the distribution
X∞
(α∗)j
e(F ) = e−a
j=0
j!
with characteristic function
Z
[
e(F )(t) = exp[ (eitx − 1)dα]
We can make any reasonable choice for θ(·) and we will need it to be a
bounded continuous function with
|θ(x) − x| ≤ C|x|3
is a characteristic function for any measure M with finite total mass. In fact it
is the characteristic function of an infinitely divisible probability distribution.
It is not necessary that M be a finite measure for µ to make sense. M could
be infinite, but in such a way that it is finite on {x : |x| ≥ δ} for every δ > 0,
and near 0 it integrates x2 i.e.,
|ei t x − 1 − i t x | ≤ CT x2
for |t| ≤ T , µ̂δ (t) → µ̂(t) uniformly on bounded intervals where µ̂(t) is given
by the integral
Z
µ̂(t) = exp [ ei t x − 1 − i t θ(x) ] dM + i t a
Z
x2
dM < ∞ (3.25)
1 + x2
Theorem 3.20. For every admissible Lévy measure M, σ 2 > 0 and real a
Z
σ 2 t2
µ̂(t) = exp [ ei t x − 1 − i t θ(x) ] dM + i t a −
2
For some ( and therefore for every) ` > 0 such that ± ` are continuity points
for M, i.e., M{± `} = 0
Z ` Z `
2 2 2 2
lim σn + x dMn = σ + x dM . (3.27)
n→∞ −` −`
an → a as n → ∞. (3.28)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 87
Proof. Let us prove the sufficiency first. Condition (3.26) implies that for
every ` such that ± ` are continuity points of M
Z Z
lim [eitx
− 1 − i t θ(x) ] dMn = [ ei t x − 1 − i t θ(x) ] dM
n→∞ |x|≥` |x|≥`
Z `
t2 x2
lim lim sup [ ei t x − 1 − i t θ(x) + ] dMn
`→0 n→∞ −` 2
Z `
t2 x2
− [eitx
− 1 − i t θ(x) + ] dM
` 2
= 0
Z
σn2 t2
lim − + [e itx
− 1 − i t θ(x)] dMn
n→∞ 2
Z
σ 2 t2
= − + [e itx
− 1 − i t θ(x)] dM .
2
and Z Z
` `
3
|x| dMn ≤ ` |x|2 dMn .
−` −`
Z `
2 2
sup σn + |x| dMn < ∞. (3.31)
n −`
Consequently
ψ(t)
lim = σ12 = σ22
t→∞ t2
leaving us with
Z
ψ(t) = [e itx
− 1 − i t θ(x) ] dM1 + i t a1
Z
= [e itx
− 1 − i t θ(x) ] dM2 + i t a2
ψ(t + s) + ψ(t − s)
H(s, t) = − ψ(t)
2
we get Z Z
itx
e (1 − cos s x)dM1 = ei t x (1 − cos s x)dM2
for all t and s. Since we can and do assume that M{0} = 0 for any admissible
Levy measure M we have M1 = M2 . If we know that σ12 = σ22 and M1 = M2
it is easy to see that a1 must equal a2 .
Finally
Applications.
and X
a = lim an,j
n→∞
j
where Z
an,j = x dαn,j
|x|≤1
Z 2 Z 2 Z
|an,j | =
2
x dαn,j = x dαn,j ≤ αn,j [ |x| > 1 ] |x|2 dαn,j
|x|≤1 |x|>1
and
X
kn X Z
2 2
|an,j | ≤ |x| dαn,j sup αn,j [ |x| > 1 ]
1≤j≤kn
j=1 1≤j≤kn
≤ σn2 sup αn,j [ |x| > 1 ]
1≤j≤kn
→ 0.
Pkn
Because j=1 |an,j |2 → 0 as n → ∞ we must have
XZ
2
σ = lim |x|2 dαn,j
n→∞ |x|≤`
which can be +∞. Because the normal distribution has an infinitely long
tail, i.e the probability of exceeding any given value is positive, we must have
P Z≥a >0
for any a. But Z is a random variable that does not depend on the par-
ticular values of X1 , · · · , Xn and is therefore a set in the tail σ-field. By
Kolmogorov’s zero-one law P Z ≥ a must be either 0 or 1. Since it cannot
be 0 it must be 1.
We will not prove this theorem in the most general case which assumes
only the existence of two moments. We will assume instead that E[|X|2+α ] <
∞ for some α > 0. We shall first reduce the proof to an estimate on the
Sn
tail behavior of the distributions of √ n
by a careful application of the Borel-
Cantelli Lemma. This estimate is obvious if X1 , · · · , Xn , · · · are themselves
normally distributed and we will show how to extend it to a large class of
distributions that satisfy the additional moment
√ condition. It is clear that
we are interested in showing that for λ > 2,
p
P Sn ≥ λ n log log n infinitely often = 0.
It would
√ be sufficient because of Borel-Cantelli lemma to show that for any
λ > 2,
X p
P Sn ≥ λ n log log n < ∞.
n
E[Sn2 ]
sup P |Sj | ≥ σφ(kn−1) ≤
1≤j≤kn [σφ(kn−1)]2
kn
=
[σφ(kn−1)]2
kn
= 2
σ kn−1 log log kn−1
= o(1) as n → ∞. (3.34)
3.9. LAWS OF THE ITERATED LOGARITHM. 95
√
By choosing σ√small enough so that λ − σ > 2 it is sufficient to show
that for any λ0 > 2,
X
0
P Skn ≥ λ φ(kn−1 ) < ∞.
n
√ √ φ(kn−1 )
By picking ρ sufficiently close to 1, ( so that λ0 ρ > 2), because φ(kn )
=
√1 we can reduce this to the convergence of
ρ
X
P Skn ≥ λ φ(kn ) < ∞ (3.35)
n
√
for all λ > 2.
2
If we use the estimate P [X ≥ a] ≤ exp[− a2 ] that is valid for the standard
normal distribution, we can verify 3.35.
X λ2 (φ(kn ))2
exp − <∞
n
2 kn
√
for any λ > 2.
To prove the lower bound we select again a subsequece, kn = [ρn ] with
some ρ > 1, and look at Yn = Skn+1 − Skn , which are now independent
random variables. The tail probability of the Normal distribution has the
lower bound
Z ∞
1 x2
P [X ≥ a] = √ exp[− ]dx
2π a 2
Z ∞
1 x2
≥√ exp[− − x](x + 1)dx
2π a 2
1 (a + 1)2
≥ √ exp[− ].
2π 2
If we assume Normal like tail probabilities we can conclude that
X X
1 λφ(kn+1) 2
P Yn ≥ λφ(kn+1) ≥ exp − [1 + p ] = +∞
n n
2 (ρn+1 − ρn )
2
λ ρ
provided 2(ρ−1) < 1 and conclude by the Borel-Cantelli lemma, that Yn =
Skn+1 − Skn exceeds λφ(kn+1) infinitely often for such λ. On the other hand
96 CHAPTER 3. INDEPENDENT SUMS
and therefore,
s √
Sn 2(ρ − 1) 2
P lim sup ≥ −√ = 1.
n φ(n) ρ ρ
Z ∞
S 1 x2
sup P { √ ≥ a} − √ exp[− ] dx ≤ Cn−δ
n
(3.36)
a n a 2π 2
for some δ > 0 in the central limit theorem. Such an error estimate is
provided in the following theorem
Theorem 3.26. (Berry-Esseen theorem). Assume that the i.i.d. se-
quence {Xj } with mean zero and variance one satisfies an additional moment
condition E|X|2+α < ∞ for some α > 0. Then for some δ > 0 the estimate
(3.36) holds.
3.9. LAWS OF THE ITERATED LOGARITHM. 97
Lemma 3.28. If λ, µ are two probability measures with zero mean having
λ̂(·), µ̂(·) for respective characteristic functions. Then
Z ∞ Z ∞
1 e−i a y sin h y
fa,h (x) d(λ − µ)(x) = [λ̂(y) − µ̂(y)] dy
−∞ 2π −∞ iy hy
where fa,h (x) = fa,∞,h (x), is given by
0 for −∞ < x ≤ a − h
fa,h (x) = x−a+h
for a − h ≤ x ≤ a + h
2h
1 for a + h ≤ x < ∞.
98 CHAPTER 3. INDEPENDENT SUMS
Proof. We just let b → ∞ in the previous lemma. Since |λ̂(y)− µ̂(y)| = o(|y|),
there is no problem in applying the Riemann-Lebesgue Lemma. We now
proceed with the proof of the theorem.
Z
λ[[a, ∞)] ≤ fa−h,h (x) dλ(x) ≤ λ[[a − 2h, ∞)]
and Z
µ[[a, ∞)] ≤ fa−h,h (x) dµ(x) ≤ µ[[a − 2h, ∞)].
Z
sup |λ[[a, ∞)] − µ[[a, ∞)]| ≤ sup fa−h,h (x) d(λ − µ)(x)
a a
+ 2hC
Z ∞
1 | sin h y |
≤ |λ̂(y) − µ̂(y)| dy
2π −∞ h y2
+ 2hC. (3.37)
y2 |y|2+α α
|λ̂n (y) − exp[− ]| ≤ C α if |y| ≤ n 2+α .
2 n
3.9. LAWS OF THE ITERATED LOGARITHM. 99
α
Therefore for θ = 2+α
Z ∞ 2
λ̂n (y) − exp[− y ] | sin h y | dy
−∞ 2 h y2
Z
y 2 | sin h y |
= λ̂ n (y) − exp[− ] dy
|y|≤nθ h y2
Z
2
+ λ̂n (y) − exp[− y ] | sin h y | dy
|y|≥nθ h y2
Z Z
C |y|α dy
≤ dy +
|y|≥nθ |y|
h α 2
|y|≤nθ n
n(α+1)θ−α + n−θ
≤ C
h
C
= α
hn α+2
Substituting this bound in 3.37 we get
C
sup |λn [[a, ∞)] − µ[[a, ∞)]| ≤ C1 h + α .
a h n 2+α
4.1 Conditioning
One of the key concepts in probability theory is the notion of conditional
probability and conditional expectation. Suppose that we have a probability
space (Ω, F , P ) consisting of a space Ω, a σ-field F of subsets of Ω and a
probability measure on the σ-field F . If we have a set A ∈ F of positive
measure then conditioning with respect to A means we restrict ourselves to
the set A. Ω gets replaced by A. The σ-field F by the σ-field FA of subsets
of A that are in F . For B ⊂ A we define
P (B)
PA (B) =
P (A)
We could achieve the same thing by defining for arbitrary B ∈ F
P (A ∩ B)
PA (B) =
P (A)
in which case PA (·) is a measure defined on F as well but one that is concen-
trated on A and assigning 0 probability to Ac . The definition of conditional
probability is
P (A ∩ B)
P (B|A) = .
P (A)
Similarly the definition of conditional expectation of an integrable function
f (ω) given a set A ∈ F of positive probability is defined to be
R
f (ω)dP
E{f |A} = A .
P (A)
101
102 CHAPTER 4. DEPENDENT RANDOM VARIABLES
P (B|ξ = aj ) = P (B|Aj )
or more generally
P (B ∩ ξ = a)
P (B|ξ = a) =
P (ξ = a)
provided P (ξ = a) > 0. One of our goals is to seek a definition that makes
sense when P (ξ = a) = 0. This involves dividing 0 by 0 and should involve
differentiation of some kind. In the countable case we may think of P (B|ξ =
aj ) as a function fB (ξ) which is equal to P (B|Aj ) on ξ = aj . We can rewrite
our definition of
fB (aj ) = P (B|ξ = aj )
as Z
fB (ξ)dP = P (B ∩ ξ = aj ) for each j
ξ=aj
Sets of the form ξ ∈ E form a sub σ-field Σ ⊂ F and we can rewrite the
definition as Z
fB (ξ)dP = P (B ∩ A)
A
4.1. CONDITIONING 103
for all A ∈ Σ. We will prove existence and uniqueness of such a g and call it
the conditional expectation of G given Σ and denote it by g = E[G|Σ].
The way to prove the above result will take us on a detour. A signed
measure on a measurable space (Ω, F ) is a set function λ(.) defined for A ∈
F which is countably additive but not necessarily nonnegative. Countable
addivity is again in any of the following two equivalent senses.
X
λ(∪An ) = λ(An )
for any countable collection of disjoint sets in F , or
whenver An ↓ A or An ↑ A.
Examples of such λ can be constructed by taking the difference µ1 − µ2
of two nonnegative measures µ1 and µ2 .
104 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Proof. The key idea in the proof is that, since λ(Ω) is a finite number, if
λ(A) is large so is λ(Ac ) with an opposite sign. In fact, it is not hard to
see that ||λ(A)| − |λ(Ac )|| ≤ |λ(Ω)| for all A ∈ F . Another fact is that if
supB⊂A |λ(B)| and supB⊂Ac |λ(B)| are finite, so is supB |λ(B)|. Now let us
complete the proof. Given a subset A ∈ F with supB⊂A |λ(B)| = ∞, and
any positive number N, there is a subset A1 ∈ F with A1 ⊂ A such that
|λ(A1 )| ≥ N and supB⊂A1 |λ(B)| = ∞. This is obvious because if we pick
a set E ⊂ A with |λ(E)| very large so will λ(E c ) be. At least one of the
two sets E, E c will have the second property and we can call it A1 . If we
proceed by induction we have a sequence An that is ↓ and |λ(An )| → ∞ that
contradicts countable additivity.
Lemma 4.2. Given a subset A ∈ F with λ(A) = ` > 0 there is a subset
Ā ⊂ A that is totally positive with λ(Ā) ≥ `.
Proof. Let us define m = inf B⊂A λ(B). Since the empty set is included,
m ≤ 0. If m = 0 then A is totally positive and we are done. So let us assume
that m < 0. By the previous lemma m > −∞.
Let us find B1 ⊂ A such that λ(B1 ) ≤ m2 . Then for A1 = A − B1 we have
A1 ⊂ A, λ(A1 ) ≥ ` and inf B⊂A1 λ(B) ≥ m2 . By induction we can find Ak
with A ⊃ A1 ⊃ · · · ⊃ Ak · · · , λ(Ak ) ≥ ` for every k and inf B⊂Ak λ(Ak ) ≥ 2mk .
Clearly if we define Ā = ∩Ak which is the decreasing limit, Ā works.
Proof. Totally positive sets are closed under countable unions, disjoint or
not. Let us define m+ = supA λ(A). If m+ = 0 then λ(A) ≤ 0 for all A and
we can take Ω+ = Φ and Ω− = Ω which works. Assume that m+ > 0. There
exist sets An with λ(A) ≥ m+ − n1 and therefore totally positive subsets Ān
of An with λ(Ān ) ≥ m+ − n1 . Clearly Ω+ = ∪n Ān is totally positive and
λ(Ω+ ) = m+ . It is easy to see that Ω− = Ω − Ω+ is totally negative. µ± can
be taken to be the restriction of λ to Ω± .
The plan is to check that the function f (ω) defined above works. Since λa
is getting more negative as a increases, Ω(a) is ↓ as a ↑. There is trouble
with sets of measure 0 for every comparison between two rationals a1 and
a2 . Collect all such troublesome sets (only a countable number and throw
them away). In other words we may assume without loss of generality that
Ω(a1 ) ⊂ Ω(a2 ) whenever a1 > a2 . Clearly
and this makes f measurable. If A ⊂ ∩a Ω(a) then λ(A) − aµ(A) ≥ 0 for all
A. If µ(A) > 0, λ(A) has to be infinite which is not possible. Therefore µ(A)
has to be zero and by absolute continuity λ(A) = 0 as well. On the other
hand if A ∩ Ω(a) = Φ for all a, then λ(A) − aµ(A) ≤ 0 for all a and again
if µ(A) > 0, λ(A) = −∞ which is not possible either. Therefore µ(A), and
by absolute continuity, λ(A) are zero. This proves that f (ω) is finite almost
everywhere with respect to both λ and µ. Let us take two real numbers a < b
and consider Ea,b = {ω : a ≤ f (ω) ≤ b}. It is clear that the set Ea,b is in
Ω(a0 ) and Ωc (b0 ) for any a0 < a and b0 > b. Therefore for any set A ⊂ Ea,b by
letting a0 and b0 tend to a and b
for all A ∈ F .
Exercise 4.4. Let F (x) be a distribution function on the line with F (0) = 0
and F (1) = 1 so that the probability measure α corresponding to it lives on
the interval [0, 1]. If F (x) satisfies a Lipschitz condition
|F (x) − F (y)| ≤ A|x − y|
then prove that α << m where m is the Lebesgue measure on [0, 1]. Show
also that 0 ≤ dm dα
≤ A almost surely.
If ν, λ, µ are three nonnegative measures such that ν << λ and λ << µ
then show that ν << µ and
dν dν dλ
=
dµ dλ dµ
a.e.
dλ
Exercise 4.5. If λ, µ are nonnegative measures with λ << µ and dµ = f , then
show that g is integrable with respect to λ if and only if g f is integrable with
respect to µ and Z Z
g(ω) dλ = g(ω) f (ω) dµ.
dλ dλ
g(ω) = =
dµ dµ Σ
such that Z
λ(A) = g(ω) dµ for all A ∈ Σ
A
and g is Σ measurable. Since the old function f (ω) was only F measurable,
in general, it cannot be used as the Radon-Nikodym derivative for the sub
σ-field Σ. Now if f is an integrable function on (Ω, F , µ) and Σ ⊂ F is a sub
σ-field we can define λ on F by
Z
λ(A) = f (ω) dµ for all A ∈ F
A
6. If Σ2 ⊂ Σ1 ⊂ F , then
7. Jensen’s
Inequality. If φ(x) is a convex function of x, and g =
E f |Σ then
E φ(f (ω))|Σ ≥ φ(g(ω)) a.e. (4.2)
E[φ(f )] ≥ E[φ(g)].
Proof. (i), (ii) and (iii) are obvious. For (iv) we note that if dλ = f dµ
Z
|f | dµ = sup λ(A) − inf λ(A)
A∈F A∈F
and if we replace F by a sub σ-field Σ the right hand side is decreased. (v)
is obvious if h is the indicator function of a set A in Σ. To go from indicator
functions to simple functions to bounded measurable functions is routine.
(vi) is an easy consequence of the definition. Finally (vii) corresponds to
Theorem 1.7 proved for ordinary expectations
and is proved analogously.
We note that if f 1 ≥ f2 then E f1 |Σ ≥ E f2 |Σ a.e. and consequently
E max(f1 , f2 )|Σ ≥ max(g1 , g2 ) a.e. where gi = E fi |Σ for i = 1, 2. Since
we can represent any convex function φ as φ(x) = supa [ax − ψ(a)], limiting
4.2. CONDITIONAL EXPECTATION 111
then
H0 = L2 [Ω, Σ, µ] ⊂ H = L2 [Ω, F , µ]
and f → E[f |Σ] is seen to be the same as the orthogonal projection from H
onto H0 . Prove it.
Exercise 4.9. If F1 ⊂ F2 ⊂ F are two sub σ-fields of F and X is any in-
tegrable function, we can define Xi = E[X|Fi] for i = 1, 2. Show that
X1 = E[X2 |F1 ] a.e.
Conditional expectation is then the best nonlinear predictor if the loss
function is the expected (mean) square error.
The problem with the above theorem is that every property is valid only
almost everywhere. There are exceptional sets of measure zero for each case.
While each null set or a countable number of them can be ignored we have an
uncountable number of null sets and we would like a single null set outside
which all the properties hold. This means constructing a good version of the
conditional probability. It may not be always possible. If possible, such a
version is called a regular conditional probability. The existence of such a
regular version depends on the space (Ω, F ) and the sub σ-field Σ being nice.
If Ω is a complete separable metric space and F are its Borel stes, and if Σ
is any countably generated sub σ-field of F , then it is nice enough. We will
prove it in the special case when Ω = [0, 1] is the unit interval and F are the
Borel subsets B of [0, 1]. Σ can be any countably generated sub σ-field of F .
Remark 4.6. In fact the case is not so special. There is theorem [6] which
states that if (Ω, F ) is any complete separable metric space that has an un-
countable number of points, then there is one-to-one measurable map with
a measurable inverse between (Ω, F ) and ([0, 1], B). There is no loss of gen-
erality in assuming that (Ω, F ) is just ([0, 1], B).
Proof. The trick is not to be too ambitious in the first place but try to
construct the conditional expectations
Q(ω, B) = E χB (ω)|Σ}
only for sets B given by B = (−∞, x) for rational x. We call our conditional
expectation, which is in fact a conditional probability, by F (ω, x). By the
properties of conditional expectations for any pair of rationals x < y, there
is a null set Ex,y , such that for ω ∈
/ Ex,y
Moreover for any rational x < 0, there is a null set Nx outside which
F (ω, x) = 0 and similar null sets Nx for x > 1, ouside which F (ω, x) = 1.
If we collect all these null sets, of which there are only countably many, and
take their union, we get a null set N ∈ Σ such that for ω ∈ / N, we have have
a family F (ω, x) defined for rational x that satisfies
For ω ∈
/ N and real y we can define
for all intervals I and by standard arguments this will extend to finite disjoint
unions of half open intervals that constitute a field and finally to the σ-field
F generated by that field. To verify that for all real y,
Z
P (A ∩ [0, y]) = G(ω, y) dP for all A ∈ Σ
A
we start from
Z
P (A ∩ [0, x]) = F (ω, x) dP for all A ∈ Σ
A
valid for rational x and let x ↓ y through rationals. From the countable ad-
ditivity of P the left hand side converges to P (A ∩ [0, y]) Rand by the bounded
convergence theorem, the right hand side converges to A G(ω, y) dP and
we are done.
Finally from the uniqueness of the conditional expectation if A ∈ Σ
Q̂(ω, A) = χA (ω)
provided ω ∈/ NA , which is a null set that depends on A. We can take a
countable set Σ0 of generators A that forms a field and get a single null set
N such that if ω ∈
/N
Q̂(ω, A) = χA (ω)
for all A ∈ Σ0 . Since both side are countably additive measures in A and as
they agree on Σ0 they have to agree on Σ as well.
Exercise 4.10. (Disintegration Theorem.) Let µ be a probability measure on
the plane R2 with a marginal distribution α for the first coordinate. In other
words if we denote α is such that, for any f that is a bounded measurable
function of x, Z Z
f (x) dµ = f (x) dα
R2 R
Show that there exist measures βx depending measurably on x such that
R = 1, i.e. βx is supported on the vertical line through (x, y) : y ∈ R
βx [{x}×R]
and µ = R βx dα. The converse is of course easier. Given α and βx we can
construct a unique µ such that µ disintegrates as expected.
116 CHAPTER 4. DEPENDENT RANDOM VARIABLES
In such a case the process is called a Markov Process with transition prob-
abilities πk−1,k (·, ·). An even smaller subclass arises when we demand that
πk−1,k (·, ·) be the same for different values of k. A single transition proba-
bility π(x, A) and the initial distribution µ0 determine the entire process i.e.
the measure P on (X ∞ , F ∞ ). Such processes are called time-homogeneous
Markov Proceses or Markov Processes with stationary transition probabili-
ties.
Chapman-Kolmogorov Equations:. If we have the transition probabili-
ties πk,k+1 of transition from time k to k + 1 of a Markov Chain it is possible
to obtain directly the transition probabilities from time k to k + ` for any
` ≥ 2. We do it by induction on `. Define
Z
πk,k+`+1(x, A) = πk,k+` (x, dy) πk+`,k+`+1(y, A) (4.5)
X
Theorem 4.8. The transition probabilities πk,m (·, ·) satisfy the relations
Z
πk,n (x, A) = πk,m (x, dy) πm,n (y, A) (4.6)
X
for any k < m < n and for the Markov Process defined by the one step
transition probabilities πk,k+1 (·, ·), for any n > m
where Σm is the σ-field of past history upto time m generated by the coordi-
nates x0 , x1 , · · · , xm .
Proof. The identity is basically algebra. The multiple integral can be carried
out by iteration in any order and after enough variables are integrated we
get our identity. To prove that the conditional probabilities are given by the
right formula we need to establish
Z
P [{xn ∈ A} ∩ B] = πm,n (xm , A) dP
B
118 CHAPTER 4. DEPENDENT RANDOM VARIABLES
= πm,n (xm , A) dP
B
Remark 4.8. If the chain has stationary transition probabilities then the
transition probabilities πm,n (x, dy) from time m to time n depend only on
the difference k = n − m and are given by what are usually called the k step
transition probabilities. They are defined inductively by
Z
(k+1)
π (x, A) = π (k) (x, dy) π(y, A)
X
They look different. But they are both equivalent to the symmetric condition
which says that given the present, the past and future are conditionally
independent. In view of the symmetry it sufficient to prove the following:
Theorem 4.9. For any P on (X × Y × Z) the relations (4.7) and (4.9) are
equivalent.
Proof. Let us fix f and g. Let us denote the common value in (4.7) by ĝ(y)
Then
which is (4.9). Conversely, we assume (4.9) and denote by ḡ(x, y) and ĝ(y)
the expressions on the left and right side of (4.7). Let b(y) be a bounded
measurable function on Y .
Since f and b are arbitrary this implies that ḡ(x, y) = ĝ(y) a.e. P .
p is the loss and we have put the condition that the outflow is the
demand unless the stored amount is less than the demand in which case
4.4. MARKOV CHAINS 121
xn = f (xn−1 , ξn )
where xn is the current state and ξn is some random external disturbance. ξn
are assumed to be independent and identically distributed. They could have
two components like inflow and demand. The new state is a deterministic
function of the old state and the noise.
Exercise 4.11. Verify that the first two examples can be cast in the above
form. In fact there is no loss of generality in assuming that ξj are mutually
independent random variables having as common distribution the uniform
distribution on the interval [0, 1].
we conclude Z
µ(A) = π(y, A) dµ(y) for all A ∈ F
122 CHAPTER 4. DEPENDENT RANDOM VARIABLES
τ (ω) : Ω → {n : n ≥ 0}
is a random variable defined on the set Ω = X ∞ such that for every n ≥ 0 the
set {ω : τ (ω) = n} (or equivalently for each n ≥ 0 the set {ω : τ (ω) ≤ n})
is measurable with respect to the σ-field Fn generated by Xj : 0 ≤ j ≤ n.
It is not necessary that τ (ω) < ∞ for every ω. Such random variable τ are
called stopping times. Examples of stopping times are, constant times n ≥ 0,
the first visit to a state x, or the second visit to a state x. The important
thing is that in order to decide if τ ≤ n i.e. to know if what ever is supposed
to happen did happen before time n the chain need be observed only up to
time n. Examples of τ that are not stopping times are easy to find. The last
time a site is visited is not a stopping time nor is is the first time such that
at the next time one is in a state x. An important fact is that the Markov
property extends to stopping times. Just as we have σ-fields Fn associated
with constant times, we do have a σ field Fτ associated to any stopping
time. This is the information we have when we observe the chain upto time
τ . Formally
∞
Fτ = A : A ∈ F and A ∩ {τ ≤ n} ∈ Fn for each n
Px {Xτ +1 ∈ A1 , · · · , Xτ +n ∈ An |Fτ }
Z Z
= ··· π(Xτ , dx1 ) · · · π(xn−1 , dxn )
A1 An
Px {A ∩ {Xτ +1 ∈ A1 , · · · , Xτ +n ∈ An }}
X
= Px {A ∩ {τ = k} ∩ {Xk+1 ∈ A1 , · · · , Xk+n ∈ An }}
XZ Z Z
k
matrices. The n step transition matrix is just the n-th power of the matrix
defined inductively by
X
π (n+1) (x, y) = π (n) (x, z)π(z, y)
z
P
Since fn (x) are probailities
P of disjoint events n fn (x)P≤ 1. The state x
is called transient if n fn (x) < 1 and recurrent if n fn (x) = 1. The
recurrent case is divided into two situations. If we denote by τx = inf{n ≥
1 : Xn = x}, the time of first visit to x, then recurrence is Px {τx < ∞} = 1.
A recurrent state x is called positive recurrent if
X
E Px {τx } = n fn (x) < ∞
n≥1
Lemma 4.11. If for a (not necessarily irreducible) chain starting from x, the
probability of ever visiting y is positive then so is the probability of visiting y
before returning to x.
4.6. COUNTABLE STATE SPACE 125
Proof. Assume that for the chain starting from x the probability of visiting
y before returning to x is zero. But when it returns to x it starts afresh and
so will not visit y until it returns again. This reasoning can be repeated and
so the chain will have to visit x infinitely often before visiting y. But this
will use up all the time and so it cannot visit y at all.
Lemma 4.12. For an irreducible chain all states x are of the same type.
Proof. Let x be recurrent and y be given. Since the chain is irreducible, for
some k, π (k) (x, y) > 0. By the previous lemma, for the chain starting from x,
there is a positive probability of visiting y before returning to x. After each
successive return to x, the chain starts afresh and there is a fixed positive
probability of visiting y before the next return to x. Since there are infinitely
many returns to x, y will be visited infinitely many times as well. Or y is
also a recurrent state.
We now prove that if x is positive recurrent then so is y. We saw already
that the probability p = Px {τy < τx } of visiting y before returning to x is
positive. Clearly
E Px {τx } ≥ Px {τy < τx } E Py {τx }
and therefore
1
E Py {τx } ≤ E Px {τx } < ∞.
p
On the other hand we can write
Z Z
E {τy } ≤
Px
τx dPx + τy dPx
τy <τx τx <τy
Z Z
= τx dPx + {τx + E Px {τy }} dPx
τ <τ τ <τ
Zy x Zx y
= τx dPx + τx dPx + (1 − p) E Px {τy }
τ <τ τx <τy
Zy x
= τx dPx + (1 − p) E Px {τy }
and
1
G(x, x) =
1 − f (x, x)
where f (x, y) = Px {τy < ∞}.
Proof. Each time the chain returns to x there is a probability 1 − f (x, x) of
never returning. The number of returns has then the geometric distribution
and the right hand side from the calculation of the mean of a Geometric
distribution. Since we count the visit at time 0 as a visit to x we add 1 to
both sides to get our formula. If we want to calculate the expected number of
visits to y when we start from x, first we have to get to y and the probability
of that is f (x, y). Then by the renewal property it is exactly the same as the
expected number of visits to y starting from y, including the visit at time 0
and that equals G(y, y).
4.6. COUNTABLE STATE SPACE 127
Lemma 4.14. For any irreducible chain dx = d for all x ∈ X and for each
x, Dx contains all sufficiently large multiples of d.
since ` ≥ m > r.
128 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Remark 4.10. For an irreducible chain the common value d is called the pe-
riod of the chain and an irreducible chain with period d = 1 is called aperi-
odic.
The simplest example of a periodic chain is one with 2 states and the chain
shuttles back and forth between the two. π(x, y) = 1 if x 6= y and 0 if x = y.
A simple calculation yields π (n) (x, x) = 1 if n is even and 0 otherwise. There
is oscillatory behavior in n that persists. The main theorem for irreducible,
aperiodic, recurrent chains is the following.
Theorem 4.15. Let π(x, y) be the one step transition probability for a re-
current aperiodic Markov chain and let π (n) (x, y) be the n-step transition
probabilities. If the chain is null recurrent then
lim π (n) (x, y) = 0 for all x, y
n→∞
If the chain is positive recuurrent then of course E Px {τx } = m(x) < ∞ for
all x, and in that case
1
lim π (n) (x, y) = q(y) =
n→∞ m(y)
P
exist for all x and y is independent of the starting point x and y q(y) = 1.
Then
1
lim pn =
n→∞ m
where if m = ∞ the right hand side is taken as 0.
4.6. COUNTABLE STATE SPACE 129
Step
P 4: Because pN −j → a for every j along the subsequence N = nk , if
j Tj = m < ∞, we can deduce from the dominated convergence theorem
that m a = 1 and we conclude that
lim sup pn = v1m
n→∞
130 CHAPTER 4. DEPENDENT RANDOM VARIABLES
P
If j Tj = ∞, by Fatou’s Lemma a = 0. Exactly the same argument applies
to liminf and we conclude that
1
lim inf pn =
n→∞ m
This concludes the proof of the renewal theorem.
We now turn to
Let us now consider an irreducible Markov Chain with one step transition
probability π(x, y) that is periodic with period d > 1. Let us choose and fix
a reference point x0 ∈ X . For each x ∈ X let Dx0 ,x = {n : π (n) (x0 , x) > 0}.
Proof. Since the chain is irreducible there is an m such tha π (m) (x, x0 ) > 0.
By the Chapman-Kolmogorov equations π (m+ni ) (x0 , x0 ) > 0 for i = 1, 2.
Therefore m + ni ∈ Dx0 = Dx0 ,x0 for i = 1, 2. This implies that d divides
both m + n1 as well as m + n2 . Thus d divides n1 − n2 .
132 CHAPTER 4. DEPENDENT RANDOM VARIABLES
The residue modulo d of all the integers in Dx0 ,x are the same and equal
some number r(x), satisfying 0 ≤ r(x) ≤ d − 1. By definition r(x0 ) = 0. Let
us define Xj = {x : r(x) = j}. Then {Xj : 0 ≤ j ≤ d − 1} is a partition of X
into disjoint sets with x0 ∈ X0 .
Suppose now we have a chain that is not irreducible. Let us collect all
the transient states and call the set Xtr . The complement consists of all the
recurrent states and will be denoted by Xre .
Proof. If x is a recuurrent state, and π(x, y) > 0, the chain will return to x
infinitely often and each time there is a positive probability of visiting y. By
the renewal property these are independent events and so y will be recurrent
too.
The set of recurrent states Xre can be divided into one or more equivalence
classes accoeding to the following procedure. Two recurrent states x and y are
in the same equivalence class if f (x, y) = Px {τy < ∞}, the probability of ever
visiting y starting from x is positive. Because of recurrence if f (x, y) > 0 then
f (x, y) = f (y, x) = 1. The restriction of the chain to a single equivalence class
is irreducible and possibly periodic. Different equivalence classes could have
different periods, some could be positive recurrent and others null recurrent.
We can combine all our observations into the following theorem.
P (n)
Theorem 4.21. If y is transient then nπ (x, y) < ∞ for all x. If y
is null recurrent (belongs toP an equivalence class that is null recurrent) then
π (n) (x, y) → 0 for all x, but n π (n) (x, y) = ∞ if x is in the same equivalence
class or x ∈ Xtr with f (x, y) > 0. In all other cases π (n) (x, y) = 0 for all
n ≥ 1. If y is positive recurrent and belongs to an equivalence class with
period d with m(y) = E Py {τy }, then for a nontransient x, π (n) (x, y) = 0
unless x is in the same equivalence class and r(x) + n = r(y) modulo d. In
such a case,
d
lim π (n) (x, y) = .
n→∞
r(x)+n=r(y) modulo d
m(y)
If x is transient then
d
lim π (n) (x, y) = f (r, x, y)
n→∞
n=r modulo d
m(y)
where
f (r, x, y) = Px {Xkd+r = y for some k ≥ 0}.
134 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Proof. The only statement that needs an explanation is the last one. The
chain starting from a transient state x may at some time get into a positive
recurrent equivalence class Xj with period d. If it does, it never leaves that
class and so gets absorbed in that class. The probability of this is f (x, y)
where y can be any state in Xj . However if the period d is greater than
1, there will be cyclical subclasses C1 , · · · , Cd of Xj . Depending on which
subclass the chain enters and when, the phase of its future is determined.
There are d such possible phases. For instance, if the subclasses are ordered
in the correct way, getting into C1 at time n is the same as getting into
C2 at time n + 1 and so on. f (r, x, y) is the probability of getting into the
equivalence class in a phase that visits the cyclical subclass containing y at
times n that are equal to r modulo d.
C
' d .
n2
To see this asymptotic behavior let us first note that the integration can be
restricted to the set where |p̂(ξ)| ≥ 1 − δ or near the 2 points (0, 0, · · · , 0)
and (π, π, · · · , π) where |p̂(ξ)| = 1. Since the behaviour is similar at both
points let us concentrate near the origin.
1X X X
d
cos ξj ≤ 1 − c ξj2 ≤ exp[−c ξj2 ]
d j=1 j j
and with a change of variables the upper bound is clear. We have a similar
lower bound as well. The random walk is recurrent if d = 1 or 2 but transient
if d ≥ 3.
Exercise 4.12. If the distribution p(·) is arbitrary, determine when the chain
is irreducible and when it is irreducible and aperiodic.
P
Exercise 4.13. If z zp(z) = m 6= 0 conclude that the chain is transient by
an application of the strong law of large numbers.
P
Exercise
P 4.14. If z zp(z) = m = 0, and if the covariance matrix given by ,
z z
z i j p(z) = σi,j , is nondegenerate show that the transience or recurrence
is determined by the dimension as in the case of the nearest neighbor random
walk.
Exercise 4.15. Can you make sense of the formal calculation
X X 1 Z
(n)
π (0, 0) = ( )d [p̂(ξ)]n dξ
2π T d
n n
Z
1 d 1
= ( ) dξ
2π Td (1 − p̂(ξ))
Z
1 d 1
= ( ) Real Part dξ
2π Td 1 − p̂(ξ)
to conclude that a necessary and sufficient condition for transience or recur-
rece is the convergence or divergence of the integral
Z
1
Real Part dξ
Td 1 − p̂(ξ)
with an integrand
1
Real Part
1 − p̂(ξ)
that is seen to be nonnegative ?
k=0 j=1
In other words the chain admits an infinite invariant measure. Such a chain
cannot be positive recurrent. To see this we note
X
q(y) = π (n) (x, y)q(x)
x
138 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Fy (`) = Py {τx0 ≤ `}
and U(x0 ) = 1. One would hope that if we solve for these equations then
we have our U. This requires uniqueness. Since our U is bounded in fact by
1, it is sufficient to prove uniqueness within the class of bounded solutions
4.6. COUNTABLE STATE SPACE 139
−λ −λ
=e Py {En+1 } + e U(Xn+1 ) dPy
τx0 >n+1
where X
ψ(σ) = eσ e−σ y py .
y≥0
Let us solve eλ = ψ(σ) for σ which is the same as solving log ψ(σ) = λ for
λ > 0 to get a solution σ = σ(λ) > 0. Then
where
Γ(p)Γ(q)
β(p, q) =
Γ(p + q)
is a solution as well. The former is defined on p + q > 0 where as the latter is
defined only on p > 0, q > 0. Actually if p or q is initially 0 it remains so for
4.6. COUNTABLE STATE SPACE 141
for any continuous f on [0, 1]. We will show that the ratio ξn = pnp+q
n
n
which
is random, stabilizes asymptotically (i.e. has a limit) to a random variable
ξ and if we start from p, q the distribution of ξ is the Beta distribution on
[0, 1] with density
1
Fx (p, q) = xp−1 (1 − x)q−1
β(p, q)
Such functions are called (bounded) Harmonic functions for the Chain. Con-
sider the random variables ξn = U(Xn ) for such an harmonic function. ξn
are uniformly bounded by the bound for U. If we denote by ηn = ξn − ξn−1
an elementary calculation reveals
E Px {ηn ηm } = 0
142 CHAPTER 4. DEPENDENT RANDOM VARIABLES
for m 6= n. If we write
U(Xn ) = U(X0 ) + η1 + η2 + · · · + ηn
this is an orthogonal sum in L2 [Px ] and because U is bounded
X
n
2 2
E {|U(Xn )| } = |U(x)| +
Px
E Px {|ηi |2 } ≤ C
i=1
Then
E[Xn+1 |Fn ] = mXn .
i
E[S|X0 = i] = <∞
1−m
P [ lim Xn = 0] = 1.
n→∞
a contradiction.
we start with any a < 1 and iterate qn+1 = P (qn ) with q1 = a, then
qn → q.
If the population does not become extinct, one can show that it has
to grow indefinitely. This is best done using martingales and we will
revisit this example later as Example 5.6.
Example 4.5. Let X be the set of integers. Assume that transitions from x
are possible only to x − 1, x, and x + 1. The transition matrix π(x, y) appears
as a tridiagonal matrix with π(x, y) = 0 unless |x − y| ≤ 1. For simplicity let
us assume that π(x, x), π(x, x − 1) and π(x, x + 1) are positive for all x.
The chain is then irreducible and aperiodic. Let us try to solve for
U(x) = Px {τ0 = ∞}
for x 6= 0 with U(0) = 0. The equations decouple into a set for x > 0 and a
set for x < 0. If we denote by V (x) = U(x + 1) − U(x) for x ≥ 0, then we
always have
so that
π(x, x − 1) V (x − 1) − π(x, x + 1) V (x) = 0
or
V (x) π(x, x − 1)
=
V (x − 1) π(x, x + 1)
and therefore
Yx
π(i, i − 1)
V (x) = V (0)
i=1
π(i, i + 1)
and
X
x−1 Yy
π(i, i − 1)
U(x) = V (0) 1 + .
y=1 i=1
π(i, i + 1)
146 CHAPTER 4. DEPENDENT RANDOM VARIABLES
X∞ Y y
π(i, i − 1)
<∞
y=1 i=1
π(i, i + 1)
Px {τ0 = ∞} > 0
X∞ Y y
π(−i, −i + 1)
< ∞.
y=1 i=1
π(−i, −i − 1)
Transience needs at least one of the two series to converge. Actually the
converse is also true. If, for instance the series on the positive side converges
then we get a function U(x) with 0 ≤ U(x) ≤ 1 and U(0) = 0 that satisfies
Exercise 4.16. Determine the conditions for positive recurrence in the previ-
ous example.
Exercise 4.17. We replace the set of integers by the set of nonnegative inte-
gers and assume that π(0, y) = 0 for y ≥ 2. Such processes are called birth
and death processes. Work out the conditions in that case.
Exercise 4.18. In the special case of a birth and death process with π(0, 1) =
π(0, 0) = 12 , and for x ≥ 1, π(x, x) = 13 , π(x, x − 1) = 13 + ax , π(x, x + 1) =
1
3
− ax with ax = xλα for large x, find conditions on positive α and real λ for
the chain to be transient, null recurrent and positive recurrent.
4.6. COUNTABLE STATE SPACE 147
Exercise 4.19. The notion of a Markov Chain makes sense for a finite chain
X0 , · · · , Xn . Formulate it precisely. Show that if the chain {Xj : 0 ≤ j ≤ n}
is Markov so is the reversed chain {Yj : 0 ≤ j ≤ n} where Yj = Xn−j for 0 ≤
j ≤ n. Can the transition probabilities of the reversed chain be determined
by the transition probabilities of the forward chain? If the forward chain has
stationary transition proabilities does the same hold true for the reversed
chain? What if we assume that the chain has a finte invariant probability
distribution and we initialize the chain to start with an initial distribution
which is the invariant distribution?
Exercise 4.20. Consider the simple chain on nonnegative integers
P∞ with the
following transition probailities. π(0, x) = px for x ≥ 0 with x=0 px = 1.
For x > 0, π(x, x − 1) = 1 and π(x, y) = 0 for all other y. Determine
conditions on {px } in order that the chain may be transient, null recurrent
or positive recurrent. Determine the invariant probability measure in the
positive recurrent case.
Exercise 4.21. Show that any null recurrent equivalence class must neces-
sarily contain an infinite number of states. In patricular any Markov Chain
with a finite state space has only transient and positive recurrent states and
moreover the set of positive recurrent states must be non empty.
148 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Chapter 5
Martingales.
149
150 CHAPTER 5. MARTINGALES.
The second inequality in (5.2) follows from the fact that |x| is a convex func-
tion of x, and therefore |Xj | is a sub-martingale. In particular E{|Xn ||Fj } ≥
|Xj | a.e. P and Ej ∈ Fj . Summing up (5.2) over j = 1, · · · , n we obtain the
theorem.
Remark 5.6. We could have started with
Z
1
P (Ej ) ≤ p |Xj |p dP
` Ej
and obtained for p ≥ 1
Z
1
P (Ej ) ≤ p |Xn |p dP. (5.3)
` Ej
Proof. Let us denote the tail probability by T (`) = P {Y ≥ `}. Then with
1
p
+ 1q = 1, i.e. (p − 1)q = p
Z Z ∞ Z ∞
Y dP = −
p p
y dT (y) = p y p−1T (y)dy (integrating by parts)
Z 0∞ Z 0
dy
≤p y p−1 X dP (by assumption)
0 y Y ≥y
Z Z Y
p−2
=p X y dy dP (by Fubini’s Theorem)
0
Z
p
= X Y p−1 dP
p−1
Z p1 Z 1q
p
≤ p
X dP Y q(p−1)
dP (by Hölder’s inequality)
p−1
Z p1 Z 1− p1
p
≤ p
X dP p
Y dP
p−1
5.1. DEFINITIONS AND PROPERTIES 153
This simplifies to
Z p Z
p
Y dP ≤
p
X p dP
p−1
R
provided Y p dP is finite. In general given Y , we can truncate it at level N
to get YN = min(Y, N) and for 0 < ` ≤ N ,
Z Z
1 1
P {YN ≥ `} = P {Y ≥ `} ≤ X dP = X dP
` Y ≥` ` YN ≥`
R
with P {YN ≥ `} = 0 for ` > N. This gives us uniform bounds on YNp dP
and we can Rpass to the limit. So we have the
R strong implication that the
p p
finiteness of X dP implies the finiteness of Y dP .
X
n
E[Xn2 ] 2
= E[X0 ] + E [Yj2 ].
j=1
If in addition, supn E[Xn2 ] < ∞, then show that there is a random variable
X such that
lim E [ |Xn − X|2 ] = 0.
n→∞
154 CHAPTER 5. MARTINGALES.
Proof. Suppose kXn kp is uniformly bounded. For p > 1, since Lp is the dual
of Lq with 1p + 1q = 1, bounded sets are weakly compact. See [7] or [3]. We
can therefore choose a subsequence Xnj that converges weakly in Lp to a
limit in the weak topology. We call this limit X. Then consider A ∈ Fn for
some fixed n. The function 1A (·) ∈ Lq .
Z Z Z
X dP =< 1A , X >= lim < 1A , Xnj >= lim Xnj dP = Xn dP.
A j→∞ j→∞ A A
The last equality follows from the fact that {Xn } is a martingale, A ∈ Fn
and nj > n eventually. It now follows that Xn = E [ X |Fn ]. We can now
apply the preceeding theorem.
Exercise 5.3. For p = 1 the result is false. Example 5.1 gives us at the same
time a counterexample of an L1 bounded martingale that does not converge
in L1 and so cannot be represented as Xn = E [ X |Fn ].
We can show that the convergence in the preceeding theorems is also valid
almost everywhere.
Theorem 5.7. Let X ∈ Lp for some p ≥ 1. Then the martingale Xn =
E [X |Fn ] converges to X for almost all ω with respect to P .
Proof. From Hölder’s inequality kXk1 ≤ kXkp . Clearly it is sufficient to
prove the theorem for p = 1. Let us denote by M ⊂ L1 the set of functions
X ∈ L1 for which the theorem is true. Clearly M is a linear subset of L1 .
We will prove that it is closed in L1 and that it is dense in L1 . If we denote
by Mn the space of Fn measurable functions in L1 , then Mn is a closed
subspace of L1 . By standard approximation theorems ∪n Mn is dense in L1 .
Since it is obvious that M ⊃ Mn for every n, it follows that M is dense in
L1 . Let Yj ∈ M ⊂ L1 and Yj → X in L1 . Let us define Yn,j = E [Yj |Fn ].
With Xn = E [X |Fn ], by Doob’s inequality (5.1) and jensen’s inequlaity
(4.2),
Z
1
P sup |Xn | ≥ ` ≤ |XN | dP
1≤n≤N ` {ω:sup1≤n≤N |Xn |≥`}
1
≤ E [ |XN | ]
`
1
≤ E [ |X| ]
`
156 CHAPTER 5. MARTINGALES.
lim sup Xn − lim inf Xn ≤ [lim sup Yn,j − lim inf Yn,j ]
n n n n
Here we have used the fact that Yj ∈ M for every j and hence
Finally
P lim sup Xn − lim inf Xn ≥ ≤ P sup |Xn − Yn,j | ≥
n n n 2
2
≤ E [ |X − Yj | ]
=0
since the left hand side is independent of j and the term on the right on the
second line tends to 0 as j → ∞.
1. (Yn , Fn ) is a martingale.
3. A1 ≡ 0.
and
Xj = (Yj + Xj ) − Yj
does it!
We can always assume that our nonnegative martingale has its expecta-
tion equal to 1 because we can always multiply by a suitable constant. Here
is a way in which such martingales arise. Suppose we have a probability
space (Ω , F , P ) and and an increasing family of sub σ-fields Fn of F that
generate F . Suppose Q is another probability measure on (Ω , F ) which may
or may not be absolutely continuous with respect to P on F . Let us suppose
however that Q << P on each Fn , i.e. whenever A ∈ Fn and P (A) = 0, it
follows that Q(A) = 0. Then the sequence of Radon-Nikodym derivatives
dQ
Xn =
dP Fn
Ω = ⊗∞
j=1 R ; Fn = σ[x1 , · · · , xn ] ; Xj (ω) = xj
Fm ⊂ Fn if m<n
Exercise 5.7. Show that if τ1 , τ2 are stopping times so are max (τ1 , τ2 ) and
min (τ1 , τ2 ). In particular any stopping time τ is an increasing limit of
bounded stopping times τn (ω) = min(τ (ω), n).
Exercise 5.8. Verify that for any stopping time τ , Fτ is indeed a sub σ-field
i.e. is closed under countable unions and complementations. If τ (ω) ≡ k
then Fτ ≡ Fk . If τ1 ≤ τ2 are stopping times Fτ1 ⊂ Fτ2 . Finally if τ is a
stopping time then it is Fτ measurable.
Proof. Since Fτ1 ⊂ Fτ2 ⊂ FC , it is sufficient to show that for any martingale
{Xn }
Xn = ξ1 + ξ2 · · · + ξn for n ≥ 1
Exercise 5.12. It does not mean that we can never consider stopping times
that are unbounded. Let τ be an unbounded stopping time. For every k,
τk = min(τ, k) is a bounded stopping time and E [Xτk ] = 0 for every k. As
k → ∞, τk ↑ τ and Xτk → Xτ . If we can establish uniform integrability of
Xτk we can pass to the limit. In particular if S(ω) = sup0≤n≤τ (ω) |Xn (ω)| is
integrable then supk |Xτk (ω)| ≤ S(ω) and therefore E [Xτ ] = 0.
Exercise 5.14. The previous exercise needs the fact that if τn ↑ τ are stop-
pimg times, then
σ{∪n Fτn } = Fτ .
Prove it.
Exercise 5.15. Let us go back to the earlier exercise (Exercise 5.11) where
we had
Xn = ξ1 + · · · + ξn
as a sum of n idependent random variables taking the values ±1 with prob-
ability 12 . Show that if τ is a stopping time with E[τ ] < ∞, then S(ω) =
sup1≤n≤τ (ω) |Xn (ω)| is square integrable and therefore E[Xτ ] = 0. [Hint: Use
the fact that Xn2 − n is a martingale.]
164 CHAPTER 5. MARTINGALES.
which could very well have lots of 0’s at the end. In any case the first few
terms correspond to upcrossings and each term is at least (b − a) and there
5.6. MARTINGALE TRANSFORMS, OPTION PRICING. 165
are U(a, b) of them. Before the 0’s begin there may be at most one nonzero
term which is an incomplete upcrossing, i.e. when τ2`−1 < n = τ2` for some
`. It is then equal to (Xn − Xτ2l−1 ) ≥ Xn − a for some l. If on the other hand
if we end in the middle of a downcrossing, i.e. τ2` < n = τ2`+1 there is no
incomplete upcrossing. Therefore
Xn0 = Xn−1
0
+ an−1 Yn , for n ≥ 1
166 CHAPTER 5. MARTINGALES.
where aj is F
j measurable,
Fj being the σ-field generated by ξ1 , · · · , ξj .
0 2
Calculate E [Xn ] .
X
N
VN = V0 + aj (Xj+1 − Xj )
j=1
at time N equals the claim f (XN ) under every conceivable behavior of the
price movements X1 , X2 , · · · , XN . If the claim can be exactly replicated
starting from an initial capital of V0 , then V0 becomes the price of that
option. Anyone could sell the option at that price, use the proceeds as
capital and follow the strategy dictated by the coefficients a0 , · · · , aN −1 and
have exactly enough to pay off the claim at time N. Here we are ignoring
transaction costs as well as interest rates. It is not always true that a claim
can be replicated.
Let us assume for simplicity that the stock prices are always some non-
negative integral multiples of some unit. The set of possible prices can then
be taken to be the set of nonnegative integers. Let us make a crucial assump-
tion that if the price on some day is x the price on the next day is x ± 1. It
has to move up or down a notch. It cannot jump two or more steps or even
stay the same. When the stock price hits 0 we assume that the company
goes bankrupt and the stock stays at 0 for ever. In all other cases, from day
to day, it always moves either up or down a notch.
Let us value the claim f for one period. If the price at day N − 1 is x 6= 0
and we have assets c on hand and invest in a shares we will end up on day
N, with either assets of c + a and a claim of f (x + 1) or assets of c − a with
a claim of f (x − 1). In order to make sure that we break even in either case,
we need
f (x + 1) = c + a ; f (x − 1) = c − a
1 1
c(x) = [f (x − 1) + f (x + 1)] ; a(x) = [f (x + 1) − f (x − 1)]
2 2
168 CHAPTER 5. MARTINGALES.
for j ≥ 1 till we arrive at the value V0 (x) of the claim at time 0 and price x.
The corresponding values of a = aj−1 (x) = 12 [Vj (x + 1) − Vj (x − 1)] gives us
the number of shares to hold between day j − 1 and j if the current price at
time j − 1 equals x.
Remark 5.13. The important fact is that the value is determined by arbi-
trage and is unaffected by the actual movement of the price so long as it is
compatible with the model.
Remark 5.14. The value does not depend on any statistical assumptions on
the various probabilities of transitions of price levels between successive days.
Remark 5.15. However the value can be interpreted as the expected value
Px
V0 (x) = E f (XN )
1
where Px is the random walk starting at x with probability 2
for transitions
up or down a level, which is absorbed at 0.
Remark 5.16. Px can be characterized as the unique probability distribution
of (X0 , · · · , XN ) such that Px [X0 = x] = 1, Px [|Xj −Xj−1 | = 1|Xj−1 ≥ 1] = 1
for 1 ≤ j ≤ N and Xj is a martingale with respect to (Ω, Fj , Px ) where Fj
is generated by X0 , · · · , Xj .
Exercise 5.18. It is not necessary for the argument that the set of possible
price levels be equally spaced. If we make the assumption that for each price
level x > 0, the price on the following day can take only one of two possible
values h(x) > x and l(x) < x with a possible bankruptcy if the level 0 is
reached, a simlar analysis can be worked out. Carry it out.
5.7. MARTINGALES AND MARKOV CHAINS. 169
or
X
j
Zjf = f (Xj ) − f (X0 ) − hi−1 (X0 , · · · , Xi−1 )
i=1
then
h(x) = ([Π − I]f )(x).
and X
h(x) = [Π − I](x) = [f (y) − f (x)]π(x, y).
y
and for every bounded measurable function f defined on the state space X
X
n
f (xn ) − f (x0 ) − h(xj−1 )
j=1
(Π − I)V = 0 on Ac
V = 1 on A (5.9)
(Π − I)V = 0 on Ac
V = f on A (5.10)
is equal to
V (x) = E Px f (xτA ) . (5.11)
is a martingale with rsepect to (Ω, Fn , Px ) and let us use the stopping theorem
with τN = min(τA , N). Since h(xj−1 ) = 0 for j ≤ τA , we obtain
V (x) = E Px V (xτN ) .
If we now make the assumption that UA (x) = Px {τA < ∞} = 1, let N → ∞
and use the bounded convergence theorem it is easy to see that
V (x) = E Px f (xτA )
which proves (5.11) and the rest of the theorem.
Such arguments are powerful tools for the study of qualitative proper-
ties of Markov chains. Solutions to equations of the type [Π − I]V = f are
often easily constructed. They can be used to produce martingales, sub-
martingales or supermartingales that have certain behavior and that in turn
implies certain qualitative behavior of the Markov chain. We will now see
several illustrations of this method.
Example 5.1. Consider the symmetric simple random walk in one dimension.
We know from recurrence that the random walk exits the interval (−R, R)
in a finite time. But we want to get some estimates on the exit time τR .
Consider the function u(x) = cos λx. The function f (x) = [Πu](x) can be
calculated and
1
f (x) = [cos λ(x − 1) + cos λ(x + 1)]
2
= cos λ cos λx
= cos λ u(x).
5.7. MARTINGALES AND MARKOV CHAINS. 173
π
If λ < 2R , then cos λx ≥ cos λR > 0 in [−R, R]. Consider Zn = eσn cos λxn
with σ = − log cos λ .
E Px Zn |Fn−1 = eσ n f (xn−1 ) = eσ n cos λ cos λxn−1 = Zn−1 .
If τR is the exit time from the interval (−R, R), for any N, we have
E Px ZτR ∧N = E Px Z0 = cos λx.
Since σ > 0 and cos λx ≥ cos λR > 0 for x ∈ [−R, R], if R is an integer, we
can claim that
cos λx
E P0 eσ [τR ∧N ] ≤ .
cos λR
Since the estimate is uniform we can let N → ∞ to get the estimate
cos λx
E P0 eσ τR ≤ .
cos λR
so that we have slightly perturbed the random walk with perhaps even a
possible bias.
Exact calculations like in Eaxmple 5.1 are of course no longer possible.
Let us try to estimate again the exit time from a ball of radius R. For σ > 0
consider the function
Xd
F (x) = exp[σ |xi |]
i=1
d
defined on Z . We can get an estimate of the form
(ΠF )(x1 , · · · , xd ) ≥ θF (x1 , · · · , xd )
for some choices of σ > 0 and θ > 1 that may depend on R. Now proceed as
in Example 5.1.
174 CHAPTER 5. MARTINGALES.
Example 5.3. We can use these methods to show that the random walk is
transient in dimension d ≥ 3.
For 0 < α < d − 2 consider the function V (x) = |x|1α for x 6= 0 with
V (0) = 1. An approximate calculation of (ΠV )(x) yields, for sufficiently
large |x| (i.e |x| ≥ L for some L), the estimate
(ΠV )(x) − V (x) ≤ 0
If we start initially from an x with |x| > L and take τL to be the first
entrance time into the ball of radius L, one gets by the stopping theorem,
the inequality
E Px V (xτL ∧N ) ≤ V (x).
If τL ≤ N, then |xτL | ≤ L. In any case V (xτL ∧N ) ≥ 0. Therefore,
V (x)
Px τL ≤ N ≤
inf |y|≤L V (y)
valid uniformly in N. Letting N → ∞
V (x)
Px τL < ∞ ≤ .
inf |y|≤L V (y)
If we let |x| → ∞, keeping
L fixed,
we see the transience. Note that recur-
rence implies that Px τL < ∞ = 1 for all x. The proof of transience really
only required a function V defined for large |x|, that was strictly positive for
each x, went to 0 as |x| → ∞ and had the property (ΠV )(x) ≤ V (x) for
large values of |x|.
Example 5.4. We will now show that the random walk is recurrent in d = 2.
This is harder because the recurrence of random walk in d = 2 is right
on the border. We want to construct a function V (x) → ∞ as |x| → ∞ that
satisfies (ΠV )(x) ≤ V (x) for large |x|. If we succeed, then we can estimate
by a stopping argument the probability that the chain starting from a point
x in the annulus ` < |x| < L exits at the outer circle before getting inside
the inner circle.
V (x)
Px τL < τ` ≤ .
inf |y|≥L V (y)
We also have for every L,
Px τL < ∞ = 1.
5.7. MARTINGALES AND MARKOV CHAINS. 175
This proves that Px τ` < ∞ = 1 thereby proving recurrence. The natural
candidate is F (x) = log |x| for x 6= 0. A computation yields
C
(ΠF )(x) − F (x) ≤
|x|4
which does not quite make it. On the other hand if U(x) = |x|−1 , for large
values of |x|,
c
(ΠU)(x) − U(x) ≥ 3
|x|
for some c > 0. The choice of V (x) = F (x) − CU(x) = log x − |x|
C
works with
any C > 0.
Example 5.5. We can use these methods for proving positive recurrence as
well.
Suppose X is a countable set and we can find V ≥ 0, a finite set F and
a constant C ≥ 0 such that
(
−1 for x ∈ /F
(ΠV )(x) − V (x) ≤
C for x ∈ F
Xn X
n
≤E Px
C 1F (xj−1 ) − 1F c (xj−1 )
j=1 j=1
Xn
= −E Px [1 − (1 + C)1F (xj−1 )]
j=1
X
n X
= −n + (1 + C) π n (x, y)
j=1 y∈F
= −n + o(n) as n → ∞.
176 CHAPTER 5. MARTINGALES.
X
E[q Xn+1 |Fn ] = [ q j pj ]Xn
= [P (q)]Xn
= q Xn
5.7. MARTINGALES AND MARKOV CHAINS. 177
Stationary Stochastic
Processes.
179
180 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
on the space of functions defined on Ω by the rule (Uf )(ω) = f (T ω). Because
T is measure preserving it is easy to see that
Z Z Z
f (ω) dP = f (T ω) dP = (Uf )(ω) dP
Ω Ω Ω
as well as
Z Z Z
|f (ω)| dP =
p
|f (T ω)| dP =
p
|(Uf )(ω)|p dP.
Ω Ω Ω
I = {A : T A = A}.
Proof. Fisrst we prove the convergence in the various Lp spaces. These are
called mean ergodic theorems. The easiest situation to prove is when p = 2.
Let us define
H0 = {f : f ∈ H, Uf = f } = {f : f ∈ H, f (T ω) = f (ω)}.
6.1. ERGODIC THEOREMS. 181
Clearly if we let
f + Uf + · · · + U n−1 f
An f =
n
then kAn f k2 ≤ kf k2 for every f ∈ H and An f = f for every n and f ∈ H0 .
Therefore for f ∈ H0 , An f → f as n → ∞. On the other hand if f = (I−U)g,
An f = g−Un g and kAn f k2 ≤ 2kgk
n
n
2
→ 0 as n → ∞. Since kAn k ≤ 1, it follows
that An f → 0 as n → ∞ for every f ∈ H0⊥ = Range(I − U)H. (See exercise
6.1). If we denote by π the orthogonal projection from H → H0 , we see that
An f → πf as n → ∞ for every f ∈ H establishing the L2 ergodic theorem.
There is an alternate characterization of H0 . Functions f in H0 are invari-
ant under T , i.e. have the property that f (T ω) = f (ω). For any invariant
function f the level sets {ω : a < f (ω) < b} are invariant under T . We
can therefore talk about invariant sets {A : A ∈ F , T −1A = A}. Technically
we should allow ourselves to differ by sets of measure zero and one defines
I = {A : P (A ∆T −1 A) = 0} as the σ-field of almost invariant sets. . Noth-
ing is therefore lost by taking I to be the σ-field of invariant sets. We can
identify the orthogonal projection π as (see Exercise 4.8)
πf = E P f |I}
lim kAn f − f kp = 0
n→∞
Then Z
f (ω) dP ≥ 0
0
En
Proof. Let
where
h+
n (ω) = max(0 , hn (ω)).
On En0 , hn (ω) = h+
n (ω) and therefore
f (ω) = hn (ω) − h+ + +
n−1 (T ω) = hn (ω) − hn−1 (T ω).
Consequently,
Z Z
f (ω) dP = [h+ +
n (ω) − hn−1 (T ω)] dP
0
En E0
Z n
≥ [h+ +
n (ω) − hn (T ω)] dP (because h+ +
n−1 (ω) ≤ hn (ω))
0
Z En Z
+
= hn (ω) dP − h+n (ω) dP (because of inavraince of T )
0
En 0
T En
≥ 0.
The
R last step follows from the fact that for any integrable function h(ω),
E
h(ω) dP is the largest when we take for E the set E = {ω : h(ω) ≥ 0}.
6.1. ERGODIC THEOREMS. 183
we have Z
1
P En ≤ |f (ω)| dP.
` En
In particular
Z
1
P ω : sup |(Aj f )(ω)| ≥ ` ≤ |f (ω)| dP.
j≥1 `
[f (ω) + f (T ω) + · · · + f (T j−1ω)]
En = {ω : sup > `},
1≤j≤n j
then Z
[f (ω) − `] dP ≥ 0
En
or Z
1
P [En ] ≤ f (ω) dP.
` En
We are done.
Given the lemma the proof of the almost sure ergodic theorem follows
along the same lines as the proof of the almost sure convergence in the
martingale context. If f ∈ H0 it is trivial. For f = (I − U)g with g ∈ L∞ it
is equally trivial because kAn f k∞ ≤ 2kgk
n
∞
. So the almost sure convergence
is valid for f = f1 + f2 with f1 ∈ H0 and f2 = (I − U)g with g ∈ L∞ . But
such functions are dense in L1 (P ). Once we have almost sure convergence
for a dense set in L1 (P ), the almost sure convergence for every f ∈ L1 (P )
follows by routine approximation using Lemma 6.3. See the proof of Theorem
5.7.
184 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
defined on the set E where the limit exists. By the ergodic theorem P1 (E) =
P2 (E) = 1 and h is I measurable. Moreover, by the stationarity of P1 , P2
and the bounded convergence theorem,
Z
Pi
E [f (ω)] = h(ω)dPi for i = 1, 2
E
Lemma 6.7. For any stationary probability measure P , for almost all ω with
respect to P , the regular conditional probability distribution Pω , of P given
I, is stationary and ergodic.
or equivalently
P (E ∩ A) = P (E ∩ T A)
This is obvious because P is stationary and E is invariant.
We now turn to ergodicity. Again there is a minefield of null sets to
negotiate. It is a simple exercise to check that if, for some stationary measure
Q, the ergodic theorem is valid with an almost surely constant limit for the
indicator functions 1A with A ∈ F0 , then Q is ergodic. This needs to be
checked only for a countable collection of sets {A}. We need therfore only to
6.3. STATIONARY MARKOV PROCESSES. 187
check that any invariant function is constant almost surely with respect to
almost all Pω . Equivalently for any invariant set E, Pω (E) must be shown
almost surely to be equal to 0 or 1. But Pω (E) = χE (ω) and is always 0 or
1. This completes the proof.
Exercise 6.4. Show that any two distinct ergodic invariant measures P1 and
P2 are orthogonal on I, i.e. there is an invariant set E such that P1 (E) = 1
and P2 (E) = 0.
Exercise 6.5. Let (Ω, F ) = ([0, 1), B) and T x = x + a (mod) 1. If a is irra-
tional there is just one invariant measure P , namely the uniform distribution
on [0, 1). This is seen by Fourier Analysis. See Remark 2.2.
Z Z Z Z
i2nπx i 2 n π (T x) i 2 n π (x+a) i2nπa
e dP = e dP = e dP = e ei 2 n π x dP
Remark 6.1. Given a π, it is not always true that P exists. A simple but
illuminating example is to take X = {0, 1, · · · , n, · · · } to be the nonnegative
integers and define π(x, x + 1) = 1 and all the process does is move one step
to the right every time. Such a process if it had started long time back will
be found nowhere today! So it does not exist. On the other hand if we take
X to be the set of all integers then P is seen to exist. In fact there are lots
of them. What is true however is that given any initial distribution µ and
initial time m, there exist a unique process P on (Ω, F m ), i.e. defined on the
future σ-field from time m on, that is Markov with transition probability π
and satisfies P {xm ∈ A} = µ(A) for all A ∈ B.
The shift T acts naturally as a measurable invertible map on the product
space Ω into itself and the notion of a stationary process makes sense. The
following theorem connects stationarity and the Markov property.
Theorem 6.8. Let the transition probability π be given. Let P be a station-
ary Markov process with transition probability π. Then the one dimensional
marginal distribution µ, which is independent of time because of stationarity
and given by
µ(A) = P xn ∈ A
is π invariant in the sense that
Z
µ(A) = π(x, A)µ(dx)
conditionally independent given the present. See Theorem 4.9. This implies
that
P E|F00 = P E ∩ E|F00 = P E|F00 P E|F00
and must therfore equal either 0 or 1. This in turn means that corresponding
to any invariant set E ∈ I, there exists A ⊂ X that belongs to B, such that
E = {ω : xn ∈ A for all n ∈ Z} up to a set of P measure 0. If the
Markov process starts from A or Ac , it does not ever leave it. That means
0 < µ(A) < 1 and
Remark 6.2. One way to generate markov processes with multiple invariant
measures is to start with two markov processes with transition probabilities
πi (xi , dyi) on Xi and invariant measures µi , and consider X = X1 ∪ X2 .
Define
(
π1 (x, A ∩ X1 ) if x ∈ X1
π(x, A) =
π2 (x, A ∩ X2 ) if x ∈ X2
Then any one of the two processes can be going on depending on which world
we are in. Both µ1 and µ2 are invariant measures. We have combined two
distinct possibilities into one. What we have shown is that when we have
multiple invariant measures they essentially arise in this manner.
Remark 6.3. We can therefore look at the convex set of measures µ that are π
invariant, i.e. µΠ = µ. The extremals of this convex set are precisely the ones
that correspond to ergodic stationary processes and they are called ergodic
or extremal invariant measures. If the set of invariant probability measures
is nonempty for some π, then there are enough extremals to recover arbitrary
invariant measure as an integral or weighted average of extremal ones.
Exercise 6.9. Show that any two distinct extremal invariant measures µ1 and
µ2 for the same π are orthogonal on B.
Exercise 6.10. Consider the operator Π on the Lp (µ) spaces corresponding to
a given invariant measure. The dimension of the eigenspace f : Πf = f that
corresponds to the eigenvalue 1, determines the extremality of µ. Clarify this
statement.
6.3. STATIONARY MARKOV PROCESSES. 191
Exercise 6.11. Let Px be the Markov process with stationary transition prob-
ability π(x, dy) starting at time 0 from x ∈ X. Let f be a bounded mea-
surable function on X. Then for almost all x with respect to any extemal
invariant measure ν,
Z
1
lim [f (x1 ) + · · · + f (xn )] = f (y)ν(dy)
n→∞ n
over stationary ergodic processes, show that the integral really involves only
stationary Markov processes with transition probability π, so that the inte-
gral is really of the form
Z
Pµ =
g
Pν Q(dν)
Me
or equivalently Z
µ=
Me g ν Q(dν).
Exercise 6.13. If there is a reference measure α such that π (x , dy) has a
density p(x, y) with respect to α for every α, then show that any invari-
ant measure µ is absolutely continuous with respect to α. In this case the
eigenspace f : Πf = f in L2 (µ) gives a complete picture of all the invariant
measures.
The question of when there is at most one invariant measure for the
Markov process with transition probability π is a difficult one. If we have
a density p(x, y) with respect to a reference measure α and if for each x,
p(x, y) > 0 for almost all y with respect to α, then there can be atmost one
inavriant measure. We saw already that any invariant measure has a density
with respect to α. If there are at least two invariant mesaures, then there
are at least two ergodic ones which are orthogonal. If we denote by f1 and
192 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
1
with θ = a k .
6.4. MIXING PROPERTIES OF MARKOV PROCESSES. 193
for every n ≥ 1
Z
β(A) = lim π (n) (x , A)β(dy) = µ(A).
n→∞
−∞
Remark 6.7. The stationary process Pµ has the property that if E ∈ Fm
and F ∈ F∞ n
with a gap of k = n − m > 0 then
Z Z
Pµ [E ∩ F ] = π (k) (xm (ω), dx)Px (T −n F )Pµ (dω)
ZE ZX
Pµ [E]Pµ [F ] = µ(dx)Px (T −n F )Pµ (dω)
ZE ZX
Pµ [E ∩ F ] − Pµ [E]Pµ [F ] = Px (T −n F )[π (k) (xm (ω), dx) − µ(dx)]Pµ (dω)
E X
and write
X
n
ψ(n, n, t) − 1 = [ψ(n, j, t) − ψ(n, j − 1, t)]
j=1
Xn
∆(n, t) = ψ(n, j, t) − ψ(n, j − 1, t) .
j=1
1
sup | ψ(n, j, t) − ψ(n, j − 1, t) − θ(n, j, t)| = o( ).
|t|≤T n
1≤j≤n
6.5. CENTRAL LIMIT THEOREM FOR MARTINGALES. 197
Therefore
Xn
1
sup | ψ(n, j, t) − ψ(n, j − 1, t) − θ(n, j, t)| = n o( ) → 0.
|t|≤T j=1 n
P
We now concentrate on estimating | nj=1 θ(n, j, t)|. We pick an integer k
which will be large but fixed. We divide [1, n] into blocks of size k with
perhaps an incomplete block at the end. We will now replace θ(n, j, t) by
σ 2 t2 kr Skr (σ 2 − ξj2 )t2
θk (n, j, t) = exp[ ]E exp[it √ ]
2n n 2n
X k(r+1)
k(r+1)
1 X k
| θk (n, j, t) | ≤ C(t) E 2 2
(σ − ξj ) = C(t) δ(k)
j=kr+1
n j=kr+1
n
X
n
| θk (n, j, t)| ≤ C(t)δ(k).
j=1
1≤j≤k 2n n
One may think that the assumption that {ξn } is a martingale difference
is too restrictive to be useful. Let {Xn } be any stationary process with zero
mean. We can often succeed in writing Xn = ξn+1 + ηn+1 where P ξn is a mar-
tingale difference and ηn is negligible, in the sense that E[( nj=1 ηj )2 ] = o(n).
Then the central limit theorem
P for {Xn } can be deduced from that of {ξn }. A
cheap way to prove E[( nj=1 ηj )2 ] = o(n) is to establish that ηnP = Zn − Zn+1
n
for some stationary square integrable sequence {Zn }. Then j=1 ηj tele-
scopes and the needed estimate is obvious. Here is a way to construct Zn
from Xn so that Xn + (Zn+1 − Zn ) is a martingale difference.
Let us define
X∞
Zn = E Xn+j |Fn
j=0
There is no guarantee that the series converges, but we can always hope.
After all, if the memory is weak, prediction j steps ahead should be futile if
becoming independent of Fn as j gets large
j is large. Therefore if Xn+j is
one would expect E Xn+j |Fn to approach E[Xn+j ] which is assumed to be
0. By stationarity n plays no role. If Z0 can be defined the shift operator T
can be used to define Zn (ω) = Z0 (T n ω). Let us assume that {Zn } exist and
are square integrable. Then
Zn = E Zn+1 |Fn + Xn
or equivalently
Xn = Zn − E Zn+1 |Fn
= [Zn − Zn+1 ] + [Zn+1 − E Zn+1 |Fn ]
= ηn+1 + ξn+1
where ηn+1 = Zn − Zn+1 and ξn+1 = Zn+1 − E Zn+1 |Fn . It is easy to see
that E[ξn+1 |Fn ] = 0.
For a stationary ergodic Markov process {Xn } on state space (X, B),
with transition probability π(x , dy) and invariant measure µ, we can prove
the central limit theorem by this method. Let Yj = f (Xj ). Using the Markov
property we can calculate
X
∞ X
∞
Z0 = E[f (Xj )|F0] = [Πj f ](X0 ) = [[I − Π]−1 f ](X0 ).
j=0 j=0
6.6. STATIONARY GAUSSIAN PROCESSES. 199
Exercise 6.14. Let us consider a two state Markov Chain with states [1, 2].
Let the transition probabilities be given by π(1, 1) = π(2, 2) = p and π(1, 2) =
π(2, 1) = q with 0 < p, q < 1 , p + q = 1 . The invariant measure is
given by µ(1) = µ(2) = 12 for all values of p. Consider the random variable
Sn = An − Bn , where An and Bn are respectively the number of visits to the
states 1 and 2 during the first n steps. Prove a central limit theorem for √Snn
and calculate the limiting variance as a function σ 2 (p) of p. How does σ 2 (p)
behave as p → 0 or 1 ? Can you explain it? What is the value of σ 2 ( 12 ) ?
Could you have guessed it?
Exercise 6.15. Consider a random walk on the nonnegative integers with
1
2
for all x = y ≥ 0
1−δ for y = x + 1, x ≥ 1
4
π(x , y) = 1+δ
for y = x − 1, x ≥ 1
14
2
for x = 0, y = 1.
Prove that the chain is positive recurrent and find the invariant measure
µ(x) explicitly. If f (x) is a function on x ≥ 0 with compact support solve
explicitly the equation [I −Π]U = f . Show that either U grows exponentially
at P
infinity or is a constant for large x. Show that it is a constant if and only
if
Pn x f (x)µ(x) = 0. What can you say about the central limit theorem for
j=0 f (Xj ) for such functions f ?
then X
Ci,j = ti,k tj,k
k
In fact for any C we can find a T which is upper or lower diagonal i.e. ti,k = 0
for i > k or i < k. If the indices correspond to time, this can be interpreted
as a causal representation interms of current and future or past variables
only.
The following questions have simple answers.
Q1. When does a Gaussian process have a moving average representation in
terms of independent Gaussians i.e a representation of the form
X
∞
Xn = an−m ξm
m=−∞
with
X
∞
a2n < ∞
n=−∞
and that will make {ρk } the Fourier coefficients of the function
X
f =| aj ei j θ |2
j
with X
a2j < ∞?
j≥0
If we do have a causal representation then the remote past of the {Xk } process
is clearly part of the remote past of the {ξk } process. By Kolmogorov’s
zero-one law, the remote past for independent Gaussians is trivial and a
causal representation is therefore possible for {Xk } only if its remote past
is trivial. The converse is true as well. The subspace Hn is spanned by
Hn−1 and Xn . Therefore either Hn = Hn−1 , or Hn−1 has codimension 1 in
Hn . In the former case by stationarity Hn = Hn−1 for every n. This inturn
implies H−∞ = H = H∞ . Assuming that the process is not identically
zero i.e. ρ0 = µ(S) > 0 this makes the remote past or future the whole
thing and definitely nontrivial. So we may assume that Hn = Hn−1 ⊕ en
where en is a one dimensional subspace spanned by a unit vector ξn . Since
all our random variables are linear combinations of a Gaussian collection
they all have Gaussian distributions. We have the shift operator U satsfying
UXn = Xn+1 and we can assume with out loss of generality that Uξn = ξn+1
for every n. If we start with X0 in our Hilbert space
X0 = a0 ξ0 + R−1
with R1 ∈ Hn−1 . We can continue and write
R−1 = a1 ξ−1 + R−2
6.6. STATIONARY GAUSSIAN PROCESSES. 203
with R−(n+1) ∈ H−(n+1) . Since ∧n H−n = {0} we conclude that the the
expansion
X∞
X0 = aj ξ−j
j=0
is valid.
Q3. What are the conditions on the spectral density f in order that the
process may admit a causal representation. From our answer to Q1. we
know that we have to solve the following analytical problem. Given the
spectral measure µ with a non negative density f ∈ L1 (S), when can we
writePf = |g|2 for some g ∈ L2 (S), that admits a Fourier representation
g = j≥0 aj ei j θ involving only positive frequencies. This has the following
neat solution which is far from obvious.
Remark 6.8. Notice that the condition basically prevents f from vanishing
on a set of positive measure or having very flat zeros.
The proof will use methods from the theory of functions of a complex variable.
Define
Proof. X
g(θ) = cn exp[i n θ]
n≥0
Since G(reiθ ) has a limit in L2 (S), the positive part of log |G| which is domi-
nated by |G| is uniformly integrable. For the negative part we apply Fatou’s
lemma and derive our estimate. R
Now for the converse. Let f ∈ L1 (S). Assume S log f (θ)dθ > −∞ or
equivalently log f ∈ L1 (S). Define the Fourier coefficients
Z
1
an = log f (θ) exp[i n θ] dθ.
4π S
Because log f is integrable {an } are uniformly bounded and the power series
X
A(z) = an z n
G(z) = exp[A(z)].
|G(reiθ )|2 = exp 2 Real Part A(reiθ )
X ∞
= exp 2 aj r j cos jθ
j=0
X∞ Z
j 1
= exp 2 r cos jθ[ log f (ϕ) cos jϕdϕ]
j=0
4π S
Z X ∞
1
= exp log f (ϕ)[ r j cos jθ cos jϕdϕ]
2π S j=0
Z
= exp log f (ϕ)K(r, θ, ϕ)dϕ
Z S
≤ f (ϕ)K(r, θ, ϕ)dϕ
S
1 X j
∞
K(r, θ, ϕ) = r cos θ cos ϕ
2π j=0
R
is nonnegative and S K(r, θ, ϕ)dϕ = 1. The last step is a consequence of
Jensen’s inequality. The function
Z
fr (θ) = f (ϕ)K(r, θ, ϕ)dϕ
S
Xj is spanned by {ξk : k ≤ j} the converse may not be true. If the two spans
were the same, then the best predictor for X0 is just
X
X̂0 = aj ξ−j
j≥1
f = |g|2
Z
2 2
1
|G(0)| = |a0 | ≤ exp log f (θ)dθ . (6.1)
2π S
Z
2
1
|G(0)| = exp log f (θ)dθ (6.2)
2π S
The prediction error σ 2 (f ), that depends only on f and not on the choice
of g, also satisfies
6.6. STATIONARY GAUSSIAN PROCESSES. 207
σ 2 (f ) ≥ |G(0)|2 (6.3)
for every choice of g ∈ H2 with f = |g|2. There is a choice of g such that
σ 2 (f ) = |G(0)|2 (6.4)
Therefore from (6.1) and (6.4)
Z
2
1
σ (f ) ≤ exp log f (θ)dθ (6.5)
2π S
is given by Z
2
1
σ (f ) = exp log f (θ)dθ
2π S
208 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
there is atleast one choice that will satisfy it. There is still ambiguity, albeit
a trivial one among these, for we can always multiply g by a complex number
of modulus 1 and that will not change anything of consequence. We have
the following theorem.
in L2 (S), the functions are uniformly integrable in r. The positive part of the
logarithm F is well controlled and therefore uniformly uniformly integrable.
Fatou’s lemma is applicable and we should always have
Z Z
1 1
lim sup F (re )dθ ≤
iθ
log f (θ)dθ
r→1 2π S 4π S
Since we have equality at both ends that implies a lot of things. In particular
F is harmonic and and is represented via the Poisson integral interms of its
boundary value 12 log f . In particular G has no zeros in the disc. Obviuosly F
is uniquely determined by log f , and by the Cauchy-Riemann equations the
imaginary part of log G is determined upto an additive constant. Therefore
the only ambiguity in G is a multiplicative constant of modulus 1.
Given the process {Xn } with trivial tail subspaces, we saw earlier that it
has a representation
X∞
Xn = aj ξn−j
j=0
in terms of standard i.i.d Gaussians and from the construction we also know
that ξn ∈ Hn for each n. In particular ξ0 ∈ H0 and can be approximated by
linear combinations of {Xj : j ≤ 0}. Let us suppose that h(θ) represents ξ0 in
L2 (S; f ). We know that h(θ) is in the linear span of {ei j θ : j ≤ 0}. We want
to find the function h. If ξ0 ←→ h, then by the nature of the isomorphism
ξn ←→ ei n θ h and
X∞
1= aj e−i j θ h(θ)
j=0
then the boundary function g(θ) = limr→1 G(rei θ ) has the property
g(−θ)h(θ) = 1
and so
1
h(θ) =
g(−θ)
Since the function G that we constructed has the property
Z
2 2 2
1
|G(0)| = |a0 | = σ (f ) = exp log f (θ) dθ
2π S
it is the canonical choice determined earlier, to with in a multiplicative con-
stant of modulus 1. The predictor then is clearly represented by the function
g(0)
1̂(θ) = 1 − a0 h(θ) = 1 −
g(−θ)
210 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Example 6.2. A wide class of examples are given by densities f (θ) that are
rational trigonometric polynomials of the form
P
| Aj ei j θ |2
f (θ) = P
| Bj ei j θ |2
We can always mutiply by ei k θ inside the absolute value and assume that
|P (ei θ )|2
f (θ) =
|Q(ei θ )|2
where P (z) and Q(z) are polynomials in the complex variable z. The sym-
metry of f under θ → −θ means that the coefficients in the polynomial have
to be real. The integrability of f will force the polynomial Q not to have any
zeros on the circle |z| = 1. Given any two complex numbers c and z, such
that |z| = 1 and c 6= 0
1 1
|z − c| = |z̄ − c̄| = | − c̄| = |1 − c̄z| = |c||z − |
z c̄
This means in our representation for f , first we can omit terms that involve
powers of z that have only modulus 1 on S. Next, any term (z − c) that
contributes a nonzero root c with |c| < 1 can be replaced by c(z − 1c̄ ) and
thus move the root outside the disc without changing the value of f . We can
therefore rewrite
f (θ) = |g(θ)|2
with
P (z)
G(z) =
Q(z)
with new polynomials P and Q that have no roots inside the unit disc and
with perhaps P alone having roots on S . Clearly
Q(ei θ )
h(θ) =
P (ei θ )
If P has no roots on S, we have a nice convergent power series for Q P
with a
radius of convergence larger than 1, and we are in a very good situation. If
P = 1, we are in an even better situation with the predictor expressed as a
finite sum. If P has a root on S, then it could be a little bit of a mess as the
next exercise shows.
6.6. STATIONARY GAUSSIAN PROCESSES. 211
Xn = ξn − ξn−1
X
k
Xn = aj Xn−j + σξn
j=1
X
k
X̂n = aj Xn−j
j=1
and the prediction error σ 2 are specified for the model. Can you always
find a stationary Gaussian process {Xn } with spectral density f (θ), that is
consistent with the model?
212 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Chapter 7
with
VN (x) = f (x)
213
214 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
We then have
Theorem 7.1. If the Markov chain starts from x at time 0, then V0 (x) is
the best expected value of the reward. The ‘optimal’ control is Markovian
and is provided by {αj∗ (xj )}.
Proof. It is clear that if we pick the control as αj∗ then we have an inhomo-
geneous Markov chain with transition probability
πj, j+1 (x , dy) = παj (x) (x , dy)
and if we denote by Px∗ , the process corresponding to it that starts from the
point x at time 0, we can establish by induction that
∗
E Px {f (xN )|FN −j } = VN −j (xN −j )
for 1 ≤ j ≤ N. Taking j = N, we obtain
∗
E Px {f (xN )} = V0 (x).
To show that V0 (x) is optimal, for any admissible (not necessarily Markovian)
choice of controls, if P is the measure on FN corresponding to a starting point
x,
E P {Vj+1(xj+1 )|Fj } ≤ Vj (xj )
and now it follows that
E P {f (xN )} ≤ V0 (x).
Exercise 7.1. The problem could be modified by making the reward function
equal to
XN
E {
P
fj (αj−1 , xj )}
j=0
and thereby incorporate the cost of control into the reward function. Work
out the recursion formula for the optimal reward in this case.
7.2. OPTIMAL STOPPING. 215
τ̄ = {inf k : V (k , xk ) = f (k , xk )}
Ex {f (τ , xτ )} ≤ V (0 , x)
and
Ex {f (τ̄ , xτ̄ )} = V (0 , x)
Proof. Because Z
V (k , x) ≥ V (k + 1 , y)π(x , dy)
and this means V (τ̄ ∧ k , xτ̄ ∧k ) is a martingale and this establishes the second
claim.
216 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
If we decide to look at the first Nx tickets rather than N2 , the lower bound
becomes x log x1 and an optimization over x leads to x = 1e and the resulting
7.2. OPTIMAL STOPPING. 217
lower bound
1
lim inf pN ≥ .
n→∞ e
We will now use the method optimal stopping to decide on the best
strategy for every N and show that the procedure we described is about the
best. Since the only thing that matters is the ordering of the numbers, the
numbers themselves have no meaning. Consider a Markov chain with two
states 0 and 1. The player is in state 1 if he is holding the largest ticket so far.
Otherwise he is in state 0. If he is in state 1 and stops at stage k, i.e. when
k tickets have been drawn, the probability of his winning is easily calculated
to be Nk . If he is in state 0, he has to go on and the probability of landing on
1
1 at the next step is calculated to be k+1 . If he is at 1 and decides to play on
1
the probability is still k+1 for landing on 1 at the next stage. The problem
reduces to optimal stopping for a sequence X1 , X2 , · · · , XN of independent
random variables with P {Xi = 1} = 1i + 1, P {Xi = 0} = i+1 i
and a reward
i
function of f (i , 1) = N ; f (i , 0) = 0. Let us define recursively the optimal
probabilities
1 i
V (i , 0) = V (i + 1, 1) + V (i + 1 , 0)
I +1 i+1
and
i 1 i i
V (i , 1) = max[ , V (i + 1, 1) + V (i + 1 , 0)] = max[ , V (i , 0)]
N I +1 i+1 N
It is clear what the optimal strategy is. We should draw always if we are in
state 0, i.e. we are sure to lose if we stop. If we are holding a ticket that is
the largest so far, we should stop provided
i
> V (i , 0)
N
and go on if
i
< V (i , 0).
N
Either startegy is acceptable in case of equality. Since V (i+1 , 1) ≥ V (i+1 , 0)
for all i, it follows that V (i , 0) ≥ V (i + 1, 0). There is therefore a critical
k(= kN ) such that Ni ≥ V (i , 0) if i ≥ k and Ni ≤ V (i , 0) if i ≤ k. The best
strategy is to wait till k tickets have been drawn, discarding every ticket,
218 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
and then pick the first one that is the best so far. The last question is the
determination of k = kN . For i ≥ k,
1 i+1 i 1 i
V (i , 0) = + V (i + 1 , 0) = + V (i + 1 , 0)
i+1 N i+1 N i+1
or
V (i , 0) V (i + 1 , 0) 1 1
− = ·
i i+1 N i
telling us
i X1
N −1
V (i , 0) =
N j=i j
so that
1 X1
N −1
1
kN = inf i : <
N j=i j N
7.3 Filtering.
The problem in filtering is that there is an underlying stochastic process
that we cannot observe. There is a related stochastic process ‘driven’ by
the first one that we can observe and we want to use our information to
make conclusions about the state of the unobserved process. A simple but
extreme example is when the unobserved process does not move and remains
at the same value. Then it becomes a parameter. The driven process may
be a sequence of i.i.d random variables with densities f (θ, x) where θ is
the unobserved, unchanging underlying parameter. We have a sample of n
independent observations X1 , · · · , Xn from the common distribution f (θ, x)
and our goal is then nothing other than parameter estimation. We shall take
a Bayesian approach. We have a prior distribution µ(dθ) on the space of
parameters Θ and this can be modified to an ‘aposteriori’ distribution after
the sample is observed. We have the joint distribution
Y
n
f (θ, xi ) dxi µ(dθ)
i=1
7.3. FILTERING. 219
After a long time while the recursion for mn remains the same
1 (ρ2 σ02 + σ 2 )
mn = mn−1 + )yn
(ρ2 σ02 + σ 2 + 1) (ρ2 σ02 + σ 2 + 1
2 (ρ2 σ∞
2
+ σ2 )
σ∞ = .
(ρ2 σ∞
2 + σ 2 + 1)
Bibliography
221
Index
222
INDEX 223
Doob, 151, 152, 157, 161, 164 causal representation of, 200
decomposition theorem of, 157 moving average representation
inequality of, 152 of , 200
inequality of , 151 prediction of , 205
stopping theorem of , 161 prediction error of, 205
upcrossing inequality of, 164 predictor of, 205
double integral, 27 rational spectral density, 210
dynamic programming, 213 spectral density of , 200
spectral measure of , 200
ergodic invariant measure, 184 generating function, 36
ergodic process, 184 geometric distribution, 34
extremality of, 185
ergodic theorem, 179 Hahn, 104
almost sure, 182 Hahn-Jordan decomposition, 104
almost sure , 179
maximal, 182 independent events, 51
mean, 179 independent random variables, 51
ergodicity, 184 indicator function, 15
Esseen, 97 induced probability measure, 23
exit probability, 170 infinitely divisible distributions, 83
expectation, 28 integrable functions, 21
exponential distribution, 35 integral, 14, 15
two sided, 35 invariant measures, 179
extension theorem, 11 inversion theorem, 34
irrational rotations, 187
Fatou, 20
Fatou’s lemma, 20 Jensen, 110
field, 8 Jordan, 104
σ-field generated by, 10
filter, 219 Kallman, 219
finite additivity, 9 Kallman-Bucy filter, 219
Fubini, 27 Khintchine, 89
Fubini’s theorem, 27 Kolmogorov, 7, 59, 62, 66, 67, 70,
117
gamma distribution, 35 consistency theorem of, 59, 61
Gaussian distribution, 35 inequality of, 62
Gaussian process, 200 one series theorem of, 66
stationary, 200 three srries theorem of, 67
autoregressive schemes, 211 two series theorem of, 66
224 INDEX
transformations, 22, 23
measurable, 23
measure preserving, 179
isometries from, 179
transience, 124
transient states, 133
transition operator, 169
transition probability, 117
Probability/ Limit Theorems
Final Examination
Q1. For each n, {Xn,j }; j = 1, 2 . . . , n are n mutually independent random variables taking
values 0 or 1 with probabilities 1 − pn,j and pn,j respectively. i.e
If
lim sup pn,j = 0,
n→∞ j
then show that any limiting distribution of Sn = Xn,1 + Xn,2 + · · · + Xn,n is Poisson and
the limit exists if and only if
infinitely divisible? If it is, what is its Levy-Khintchine representation? How about the
two sided exponential f (x) = 12 e−|x| ?
Q3. Let f (x) be an integrable function on [0, 1] with respect to the Lebesgue measure.
For each n and j = 0, 1, . . . , 2n − 1 define for j2−n ≤ x ≤ (j + 1)2−n
Z (j+1)2−n
n
fn (x) = 2 f (x)dx
j2−n
Show that limn→∞ fn (x) = f (x) a.e. with respect to the Lebsgue measure.
Q4. If X1 , X2 , . . . , Xn , . . . are independent random variables that are almost surely positive
(i.e. P [Xi > 0] = 1) with E[Xi ] = 1, show that
Zn = X1 X2 · · · Xn
lim Zn = Z?
n→∞
1
When is Z nonzero? Is it sufficient if
Y
E[Xi−a ] < ∞
i
αnpn −αn x pn −1
fn (x) = e x
Γ(pn )