Sie sind auf Seite 1von 227

Probability Theory

S.R.S.Varadhan
Courant Institute of Mathematical Sciences
New York University

August 31, 2000


2
Contents

1 Measure Theory 7
1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Construction of Measures . . . . . . . . . . . . . . . . . . . . 11
1.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Distributions and Expectations . . . . . . . . . . . . . . . . . 28

2 Weak Convergence 31
2.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . 31
2.2 Moment Generating Functions . . . . . . . . . . . . . . . . . . 36
2.3 Weak Convergence . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Independent Sums 51
3.1 Independence and Convolution . . . . . . . . . . . . . . . . . . 51
3.2 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . 54
3.3 Strong Limit Theorems . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Series of Independent Random variables . . . . . . . . . . . . 61
3.5 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . 68
3.6 Central Limit Theorem. . . . . . . . . . . . . . . . . . . . . . 70
3.7 Accompanying Laws. . . . . . . . . . . . . . . . . . . . . . . . 76
3.8 Infinitely Divisible Distributions. . . . . . . . . . . . . . . . . 83
3.9 Laws of the iterated logarithm. . . . . . . . . . . . . . . . . . 92

4 Dependent Random Variables 101


4.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . 108
4.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . 112

3
4 CONTENTS

4.4 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 116


4.5 Stopping Times and Renewal Times . . . . . . . . . . . . . . . 122
4.6 Countable State Space . . . . . . . . . . . . . . . . . . . . . . 123

5 Martingales. 149
5.1 Definitions and properties . . . . . . . . . . . . . . . . . . . . 149
5.2 Martingale Convergence Theorems. . . . . . . . . . . . . . . . 154
5.3 Doob Decomposition Theorem. . . . . . . . . . . . . . . . . . 157
5.4 Stopping Times. . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.5 Upcrossing Inequality. . . . . . . . . . . . . . . . . . . . . . . 164
5.6 Martingale Transforms, Option Pricing. . . . . . . . . . . . . . 165
5.7 Martingales and Markov Chains. . . . . . . . . . . . . . . . . 169

6 Stationary Stochastic Processes. 179


6.1 Ergodic Theorems. . . . . . . . . . . . . . . . . . . . . . . . . 179
6.2 Structure of Stationary Measures. . . . . . . . . . . . . . . . . 184
6.3 Stationary Markov Processes. . . . . . . . . . . . . . . . . . . 187
6.4 Mixing properties of Markov Processes. . . . . . . . . . . . . . 192
6.5 Central Limit Theorem for Martingales. . . . . . . . . . . . . 195
6.6 Stationary Gaussian Processes. . . . . . . . . . . . . . . . . . 199

7 Dynamic Programming and Filtering. 213


7.1 Optimal Control. . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.2 Optimal Stopping. . . . . . . . . . . . . . . . . . . . . . . . . 215
7.3 Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Preface

These notes are based on a first year graduate course on Probability and Limit
theorems given at Courant Institute of Mathematical Sciences. Originally
written during 1997-98, they have been revised during academic year 1998-
99 as well as in the Fall of 1999. I want to express my appreciation to those
who pointed out to me several typos as well as suggestions for improvement.
I want to mention in particular the detailed comments from Professor Charles
Newman and Mr Enrique Loubet. Chuck used it while teaching the course
in 98-99 and Enrique helped me as TA when I taught out of these notes
again in the Fall of 99. These notes cover about three fourths of the course,
essentially discrete time processes. Hopefully there will appear a companion
volume some time in the near future that will cover continuos time processes.
A small amount measure theory that is included. While it is not meant to
be complete, it is my hope that it will be useful.

5
6 CONTENTS
Chapter 1

Measure Theory

1.1 Introduction.
The evolution of probability theory was based more on intuition rather than
mathematical axioms during its early development. In 1933, A. N. Kol-
mogorov [4] provided an axiomatic basis for probability theory and it is now
the universally accepted model. There are certain ‘non commutative’ ver-
sions that have their origins in quantum mechanics, see for instance K. R.
Parthasarathy[5], that are generalizations of the Kolmogorov Model. We
shall however use exclusively Kolmogorov’s framework.
The basic intuition in probability theory is the notion of randomness.
There are experiments whose results are not predicatable and can be deter-
mined only after performing it and then observing the outcome. The simplest
familiar examples are, the tossing of a fair coin, or the throwing of a balanced
die. In the first experiment the result could be either a head or a tail and
the throwing of a die could result in a score of any integer from 1 through 6.
These are experiments with only a finite number of alternate outcomes. It is
not difficult to imagine experiments that have countably or even uncountably
many alternatives as possible outcomes.
Abstractly then, there is a space Ω of all possible outcomes and each
individual outcome is represented as a point ω in that space Ω. Subsets of Ω
are called events and each of them corresponds to a collection of outcomes. If
the outcome ω is in the subset A, then the event A is said to have occurred.
For example in the case of a die the set A = {1, 3, 5} ⊂ Ω corresponds to
the event ‘an odd number shows up’. With this terminology it is clear that

7
8 CHAPTER 1. MEASURE THEORY

union of sets corresponds to ‘or’, intersection to ‘and’, and complementation


to ‘negation’.
One would expect that probabilities should be associated with each out-
come and there should be a ‘Probability Function’ f (ω) which is the proba-
bilty that ω occurs. In the case of coin tossing we may expect Ω = {H, T }
and
1
f (T ) = f (H) = .
2
Or in the case of a die
1
f (1) = f (2) = · · · = f (6) = .
6
Since ‘Probability’ is normalized so that certainty corresponds to a Proba-
bility of 1, one expects
X
f (ω) = 1. (1.1)
ω∈Ω

If Ω is uncountable this is a mess. There is no reasonable way of adding


up an uncountable set of numbers each of which is 0. This suggests that
it may not be possible to start with probabilities associated with individual
outcomes and build a meaningful theory. The next best thing is to start with
the notion that probabilities are already defined for events. In such a case,
P (A) is defined for a class B of subsets A ⊂ Ω. The question that arises
naturally is what should B be and what properties should P (·) defined on B
have? It is natural to demand that the class B of sets for which probabilities
are to be defined satisfy the following properties:
The whole space Ω and the empty set Φ are in B. For any two sets A
and B in B, the sets A ∪ B and A ∩ B are again in B. If A ∈ B, then the
complement Ac is again in B. Any class of sets satisfying these properties is
called a field.

Definition 1.1. A ‘probability’ or more precisely ‘a finitely additive proba-


bility measure’ is a nonnegative set function P (·) defined for sets A ∈ B that
satisfies the following properties:

P (A) ≥ 0 for all A ∈ B, (1.2)

P (Ω) = 1 and P (Φ) = 0. (1.3)


1.1. INTRODUCTION. 9

If A ∈ B and B ∈ B are disjoint then


P (A ∪ B) = P (A) + P (B). (1.4)
In particular
P (Ac ) = 1 − P (A) (1.5)
for all A ∈ B.
A condition which is some what more technical, but important from a
mathematical viewpoint is that of countable additivity. The class B, in
addition to being a field is assumed to be closed under countable union
(or equivalently, countable intersection); i.e. if An ∈ B for every n, then
A = ∪n An ∈ B. Such a class is called a σ-field. The ‘probability’ itself is
presumed to be defined on a σ-field B.
Definition 1.2. A set function P defined on a σ-field is called a ‘countably
additive probability measure’ if in addition to satsfying equations (1.2), (1.3)
and (1.4), it satisfies the following countable additivity property: for any
sequence of pairwise disjoint sets An with A = ∪n An
X
P (A) = P (An ). (1.6)
n

Exercise 1.1. The limit of an increasing (or decreasing) sequence An of sets


is defined as its union ∪n An (or the intersection ∩n An ). A monotone class
is defined as a class that is closed under monotone limits of an increasing or
decreasing sequence of sets. Show that a field B is a σ-field if and only if it
is a monotone class.

Exercise 1.2. Show that a finitely additive probability measure P (·) defined
on a σ-field B, is countably additive, i.e. satisfies equation (1.6), if and only
if it satisfies any the following two equivalent conditions.
If An is any nonincreasing sequence of sets in B and A = limn An = ∩n An
then
P (A) = lim P (An ).
n
If An is any nondecreasing sequence of sets in B and A = limn An = ∪n An
then
P (A) = lim P (An ).
n
10 CHAPTER 1. MEASURE THEORY

Exercise 1.3. If A, B ∈ B, and P is a finitely additive probability measure


show that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). How does this generalize
to P (∪nj=1Aj )?

Exercise 1.4. If P is a finitely additive measure on a field F and A, B ∈ F ,


then |P (A) − P (B)| ≤ P (A∆B) where A∆B is the symmetric difference
(A ∩ B c ) ∪ (Ac ∩ B). In particular if B ⊂ A,

0 ≤ P (A) − P (B) ≤ P (A ∩ B c ) ≤ P (B c ).

Exercise 1.5. If P is a countably additive


P∞ probability measure, show that for
any sequence An ∈ B, P (∪∞ A
n=1 n ) ≤ n=1 P (An ).

Although we would like our ‘probability’ to be a countably additive prob-


ability measure, on a σ- field B of subsets of a space Ω it is not clear that
there are plenty of such things. As a first small step show the following.

P 1.6. If {ωn : n ≥ 1} are distinct points in Ω and pn ≥ 0 are numbers


Exercise
with n pn = 1 then
X
P (A) = pn
n:ωn ∈A

defines a countably additive probability measure on the σ-field of all subsets


of Ω. ( This is still cheating because the measure P lives on a countable set.)

Definition 1.3. A probability measure P on a field F is said to be countably


additive on F , if for any sequence An ∈ F with An ↓ Φ, we have P (An ) ↓ 0.

Exercise 1.7. Given any class F of subsets of Ω there is a unique σ-field B


such that it is the smallest σ-field that contains F .

Definition 1.4. The σ-field in the above exercise is called the σ-field gener-
ated by F .
1.2. CONSTRUCTION OF MEASURES 11

1.2 Construction of Measures


The following theorem is important for the construction of countably additive
probability measures. A detailed proof of this theorem, as well as other
results on measure and integration, can be found in [7], [3] or in any one of
the many texts on real variables. In an effort to be complete we will sketch
the standard proof.
Theorem 1.1. (Caratheodory Extension Theorem). Any countably
additive probabilty measure P on a field F extends uniquely as a countably
additive probability measure to the σ-field B generated by F .

Proof. The proof proceeds along the following steps:


Step 1. Define an object P ∗ called the outer measure for all sets A.
X
P ∗ (A) = inf P (Aj ) (1.7)
∪j Aj ⊃A
j

where the infimum is taken over all countable collections {Aj } of sets from
F that cover A. Without loss of generality we can assume that {Aj } are
i=1 Ai ) ∩ Aj ).
disjoint. (Replace Aj by (∩j−1 c

Step 2. Show that P ∗ has the following properties:

1. P ∗ is countably sub-additive, i.e.


X
P ∗ (∪j Aj ) ≤ P ∗(Aj ).
j

2. For A ∈ F , P ∗ (A) ≤ P (A). (Trivial)

3. For A ∈ F , P ∗ (A) ≥ P (A). (Need to use the countable additivity of P


on F )

Step 3. Define a set E to be measurable if

P ∗ (A) ≥ P ∗ (A ∩ E) + P ∗(A ∩ E c )

holds for all sets A, and establish the following properties for the class M
of measurable sets. The class of measurable sets M is a σ-field and P ∗ is a
countably additive measure on it.
12 CHAPTER 1. MEASURE THEORY

Step 4. Finally show that M ⊃ F . This implies that M ⊃ B and P ∗ is an


extension of P from F to B.
Uniqueness is quite simple. Let P1 and P2 be two countably additive
probability measures on a σ-field B that agree on a field F ⊂ B. Let us
define A = {A : P1 (A) = P2 (A)}. Then A is a monotone class i.e., if An ∈ A
is increasing (decreasing) then ∪n An (∩n An ) ∈ A. Uniqueness will follow
from the following fact left as an excercise.

Exercise 1.8. The smallest monotone class generated by a field is the same
as the σ-field generated by the field.
It now follows that A must contain the σ-field generated by F and that
proves uniqueness.
The extension Theorem does not quite solve the problem of constructing
countably additive probability measures. It reduces it to constructing them
on fields. The following theorem is important in the theory of Lebesgue inte-
grals and is very useful for the construction of countably additive probability
measures on the real line. The proof will again be only sketched. The natu-
ral σ-field on which to define a probability measure on the line is the Borel
σ-field. This is defined as the smallest σ-field containing all intervals and
includes in particular all open sets.
Let us consider the class of subsets of the real numbers, I = {Ia,b : −∞ ≤
a < b ≤ ∞} where Ia,b = {x : a < x ≤ b} if b < ∞, and Ia,∞ = {x : a <
x < ∞}. In other words I is the collection of intervals that are left-open and
right-closed. The class of sets that are finite disjoint unions of members of I
is a field F , if the empty set is added to the class. If we are given a function
F (x) on the real line which is nondecreasing, continuous from the right and
satisfies
lim F (x) = 0 and lim F (x) = 1,
x→−∞ x→∞

we can define a finitely additive probability measure P by first defining

P (Ia,b ) = F (b) − F (a)

for intervals and then extending it to F by defining it as the sum for disjoint
unions from I. Let us note that the Borel σ-field B on the real line is the
σ-field generated by F .
1.2. CONSTRUCTION OF MEASURES 13

Theorem 1.2. (Lebesgue). P is countably additive on F if and only if


F (x) is a right continuous function of x. Therefore for each right continu-
ous nondecreasing function F (x) with F (−∞) = 0 and F (∞) = 1 there is
a unique probability measure P on the Borel subsets of the line, such that
F (x) = P (I−∞,x ). Conversely every countably additive probability measure
P on the Borel subsets of the line comes from some F . The correspondence
between P and F is one-to-one.

Proof. The only difficult part is to establish the countable additivity of P on


F from the right continuity of F (·). Let Aj ∈ F and Aj ↓ Φ, the empty set.
Let us assume, P (Aj ) ≥ δ > 0, for all j and then establish a contradiction.
Step 1. We take a large interval [−`, `] and replace Aj by Bj = Aj ∩ [−`, `].
Since |P (Aj ) − P (Bj )| ≤ 1 − F (`) + F (−`), we can make the choice of `
large enough that P (Bj ) ≥ 2δ . In other words we can assumes without loss
of generality that P (Aj ) ≥ 2δ and Aj ⊂ [−`, `] for some fixed ` < ∞.
Step 2. If
kj
Aj = ∪i=1 Iaj,i ,bj,i
use the right continuity of F to replace Aj by Bj which is again a union of
left open right closed intervals with the same right end points, but with left
end points moved ever so slightly to the right. Achieve this in such a way
that
δ
P (Aj − Bj ) ≤
10.2j
for all j.
Step 3. Define Cj to be the closure of Bj , obtained by adding to it the left
end points of the intervals making up Bj . Let Ej = ∩ji=1 Bi and Dj = ∩ji=1 Ci .
Then, (i) the sequence Dj of sets is decreasing, (ii) each Dj is a closed
bounded set, (iii) since Aj ⊃PDj and Aj ↓ Φ , it follows that Dj ↓ Φ. Because
4
Dj ⊃ Ej and P (Ej ) ≥ 2δ − j P (Aj − Bj ) ≥ 10δ , each Dj is nonempty and
this violates the finite intersection property that every decreasing sequence of
bounded nonempty closed sets on the real line has a nonempty intersection,
i.e. has at least one common point.
The rest of the proof is left as an exercise.

The function F is called the distribution function corresponding to the


probability measure P .
14 CHAPTER 1. MEASURE THEORY

Example 1.1. Suppose x1 , x2 , · · · , xn , · · · is a sequence of points and we have


probabilities pn at these points then for the discrete measure
X
P (A) = pn
n:xn ∈A

we have the distribution function


X
F (x) = pn
n:xn ≤ x

that only increases by jumps, the jump at xn being pn . The points {xn }
themselves can be discrete like integers or dense like the rationals.
Example
R∞ 1.2. If f (x) is a nonnegative
Rx integrable function with integral 1, i.e.
−∞
f (y)dy = 1 then F (x) = −∞
f (y)dy is a distribution function which
is continuous. In this case f is the density of the measure P and can be
calculated as f (x) = F 0 (x).
There are (messy) examples of F that are continuous, but do not come from
any density. More on this later.

Exercise 1.9. Let us try to construct the Lebesgue measure on the rationals
Q ⊂ [0, 1]. We would like to have

P [Ia,b ] = b − a

for all rational 0 ≤ a ≤ b ≤ 1. Show that it is impossible by showing that


P [{q}] = 0 for the set {q} containing the single rational q while P [Q] =
P [∪q∈Q {q}] = 1. Where does the earlier proof break down?

Once we have a countably additive probability measure P on a space


(Ω, Σ), we will call the triple (Ω, Σ, P ) a probabilty space.

1.3 Integration
An important notion is that of a random variable or a measurable function.

Definition 1.5. A random variable or measurable function is map f : Ω →


R, i.e. a real valued function f (ω) on Ω such that for every Borel set B ⊂ R,
f −1 (B) = {ω : f (ω) ∈ B} is a measurable subset of Ω, i.e f −1 (B) ∈ Σ.
1.3. INTEGRATION 15

Exercise 1.10. It is enough to check the requirement for sets B ⊂ R that are
intervals or even just sets of the form (−∞, x] for −∞ < x < ∞.
A function that is measurable and satisfies |f (ω)| ≤ M all ω ∈ Ω for some
finite M is called a bounded measurable function.
The following statements are the essential steps in developing an integra-
tion theory.
Details can be found in any book on real variables.
1. If A ∈ Σ , the indicator function A, defined as
(
1 if ω ∈ A
1A (ω) =
0 if ω ∈/A
is bounded and measurable.
2. Sums, products, limits, compositions and reasonable elementary oper-
ations like min and max performed on measurable functions lead to
measurable functions.
3. If {Aj : 1 ≤ j ≤ n} is a finite
P disjoint partition of Ω into measurable
sets, the function f (ω) = j cj 1Aj (ω) is a measurable function and is
referred to as a ‘simple’ function.
4. Any bounded measurable function f is a uniform limit of simple func-
tions. To see this, if f is bounded by M, divide [−M, M] into n subin-
tervals Ij of length 2Mn
with midpoints cj . Let
Aj = f −1 (Ij ) = {ω : f (ω) ∈ Ij }
and
X
n
fn = cj 1Aj .
j=1

Clearly fn is simple, supω |fn (ω) − f (ω)| ≤ Mn


, and we are done.
P R
5. ForPsimple functions f = cj 1Aj the integral f (ω)dP is defined to
be j cj P (Aj ). It enjoys the following properties:
(a) If f and g are simple, so is any linear combination af + bg for real
constants a and b and
Z Z Z
(af + bg)dP = a f dP + b gdP.
16 CHAPTER 1. MEASURE THEORY
R R
(b) If f is simple so is |f | and | f dP | ≤ |f |dP ≤ supω |f (ω)|.

6. If fn is
R a sequence of simple functions converging to f uniformly, then
an = fn dP is a Cauchy sequence R of real numbers and therefore has a
limit a as n → ∞. The integral f dP of f is defined to be this limit
a. One can verify that a depends only on f and not on the sequence
fn chosen to approximate f .

7. Now the integral is defined for all bounded measurable functions and
enjoys the following properties.

(a) If f and g are bounded measurable functions and a, b are real


constants then the linear combination af + bg is again a bounded
measurable function, and
Z Z Z
(af + bg)dP = a f dP + b gdP.

R
R f is a bounded measurable function so is |f | and |
(b) If f dP | ≤
|f |dP ≤ supω |f (ω)|.
(c) In fact a slightly stronger inequality is true. For any bounded
measurable f ,
Z
|f |dP ≤ P ({ω : |f (ω)| > 0}) sup |f (ω)|
ω

(d) If f is a bounded measurable function and A is a measurable set


one defines Z Z
f (ω)dP = 1A (ω)f (ω)dP
A

and we can write for any measurable set A,


Z Z Z
f dP = f dP + f dP
A Ac

In addition to uniform convergence there are other weaker notions of


convergence.
1.3. INTEGRATION 17

Definition 1.6. A sequence fn functions is said to converge to a function f


everywhere or pointwise if

lim fn (ω) = f (ω)


n→∞

for every ω ∈ Ω.

In dealing with sequences of functions on a space that has a measure


defined on it, often it does not matter if the sequence fails to converge on
a set of points that is insignificant. For example if we are dealing with the
Lebesgue measure on the interval [0, 1] and fn (x) = xn then fn (x) → 0 for
all x except x = 1. A single point, being an interval of length 0 should be
insignificant for the Lebesgue measure.

Definition 1.7. A sequence fn of measurable functions is said to converge


to a measurable function f almost everywhere or almost surely (usually ab-
breviated as a.e.) if there exists a measurable set N with P (N) = 0 such
that
lim fn (ω) = f (ω)
n→∞

for every ω ∈ N . c

Note that almost everywhere convergence is always relative to a proba-


bility measure.
Another notion of convergence is the following:

Definition 1.8. A sequence fn of measurable functions is said to converge


to a measurable function f ‘in measure’ or ‘in probability’ if

lim P [ω : |fn (ω) − f (ω)| ≥ ] = 0


n→∞

for every  > 0.

Let us examine these notions in the context of indicator functions of sets


fn (ω) = 1An (ω). As soon as A 6= B, supω |1A (ω) − 1B (ω)| = 1, so that
uniform convergence never really takes place. On the other hand one can
verify that 1An (ω) → 1A (ω) for every ω if and only if the two sets

lim sup An = ∩n ∪m≥n Am


n
18 CHAPTER 1. MEASURE THEORY

and
lim inf An = ∪n ∩m≥n Am
n

both coincide with A. Finally 1An (ω) → 1A (ω) in measure if and only if

lim P (An ∆A) = 0


n→∞

where for any two sets A and B the symmetric difference A∆B is defined as
A∆B = (A ∩ B c ) ∪ (Ac ∩ B) = A ∪ B ∩ (A ∩ B)c . It is the set of points that
belong to either set but not to both. For instance 1An → 0 in measure if and
only if P (An ) → 0.
Exercise 1.11. There is a difference between almost everywhere convergence
and convergence in measure. The first is really stronger. Consider the in-
terval [0, 1] and divide it successively into 2, 3, 4 · · · parts and enumerate the
intervals in succession. That is, I1 = [0, 12 ], I2 = [ 12 , 1], I3 = [0, 13 ], I4 = [ 13 , 23 ],
I5 = [ 23 , 1], and so on. If fn (x) = 1In (x) it easy to check that fn tends to 0
in measure but not almost everywhere.

Exercise 1.12. But the following statement is true. If fn → f as n → ∞ in


measure, then there is a subsequence fnj such that fnj → f almost every-
where as j → ∞.

Exercise 1.13. If {An } is a sequene of measurable sets, then in order that


lim supn→∞ An = Φ, it is necessary and sufficient that

lim P [∪∞
m=n Am ] = 0
n→∞
P
In particular it is sufficient that n P [An ] < ∞. Is it necessary?

Lemma 1.3. If fn → f almost everywhere then fn → f in measure.

Proof. fn → f outside N is equivalent to

∩n ∪m≥n [ω : |fm (ω) − f (ω)| ≥ ] ⊂ N

for every  > 0. In particular by countable additivity

P [ω : |fn (ω) − f (ω)| ≥ ] ≤ P [∪m≥n [ω : |fm (ω) − f (ω)| ≥ ] → 0

as n → ∞ and we are done.


1.3. INTEGRATION 19

Exercise 1.14. Countable additivity is important for this result. On a finitely


additive probability space it could be that fn → f everywhere and still
fn 6→ f in measure. In fact show that if every sequence fn → 0 that con-
verges everywhere also converges in probabilty, then the measure is countably
additive.

Theorem 1.4. (Bounded Convergence Theorem). If the sequence {fn }


of measurable functions
R is uniformly
R bounded and if fn → f in measure as
n → ∞, then limn→∞ fn dP = f dP .

Proof. Since
Z Z Z Z
| fn dP − f dP | = | (fn − f )dP | ≤ |fn − f |dP
R
we need only prove that if fn → 0 in measure and |fn | ≤ M then |fn |dP →
0. To see this
Z Z Z
|fn |dP = |fn |dP + |fn |dP ≤  + MP [ω : |fn (ω)| > ]
|fn |≤ |fn |>

and taking limits Z


lim sup |fn |dP ≤ 
n→∞

and since  > 0 is arbitrary we are done.

The bounded convergence theorem is the essence of countable additivity.


Let us look at the example of fn (x) = xn on 0 ≤ x ≤ 1 with Lebesgue
measure. Clearly fn (x) → 0 a.e. and therefore in measure. While the
convergence is not uniform, 0 ≤ xn ≤ 1 for all n and x and so the bounded
convergence theorem applies. In fact
Z 1
1
xn dx = → 0.
0 n+1

However if we replace xn by nxn , fn (x) still goes to 0 a.e., but the sequence
is no longer uniformly bounded and the integral does not go to 0.
We now proceed to define integrals of nonnegative measurable functions.
20 CHAPTER 1. MEASURE THEORY

Definition 1.9. If f is a nonnegative measurable function we define


Z Z
f dP = {sup gdP : g bounded , 0 ≤ g ≤ f }.

An important result is

Theorem 1.5. (Fatou’s Lemma). If for each n ≥ 1, fn ≥ 0 is measurable


and fn → f in measure as n → ∞ then
Z Z
f dP ≤ lim inf fn dP.
n→∞

Proof. Suppose g is bounded and satisfies 0 ≤ g ≤ f . Then the sequence


hn = fn ∧ g is uniformly bounded and

hn → h = f ∧ g = g.

Therefore, by the bounded convergence theorem,


Z Z
gdP = lim hn dP.
n→∞

R R
Since hn dP ≤ fn dP for every n it follows that
Z Z
gdP ≤ lim inf fn dP.
n→∞

As g satisfying 0 ≤ g ≤ f is arbitrary and we are done.

Corollary 1.6. (Monotone Convergence Theorem). If for a sequence


{fn } of nonnegative functions, we have fn ↑ f monotonically then
Z Z
fn dP → f dP as n → ∞.

R R
Proof. Obviously fn dP ≤ f dP and the other half follows from Fatou’s
lemma.
1.3. INTEGRATION 21

Now we try to define integrals of arbitrary measurable functions.


R A non-
negative measurable function is said to be integrable if f dP < ∞. A
measurable
R function
R f isRsaid to be integrable if |f | is integrable and we de-
fine f dP = f + dP − f − dP where f + = f ∨ 0 and f − = −f ∧ 0 are the
positive and negative parts of f . The integral has the following properties.

1. It is linear. IfRf and g are integrable


R so isR af + bg for any two real
constants and (af + bg)dP = a f dP + b gdP .
R R
2. | f dP | ≤ |f |dP for every integrable f .

3. If
R f = 0 except on a set N of measure 0, then f is integrable
R and
R f dP = 0. In particular if f = g almost everywhere then f dP =
gdP .

Theorem 1.7. (Jensen’s Inequality.) If φ(x) is a convex function of x


and f (ω) and φ(f (ω)) are integrable then
Z Z

φ(f (ω))dP ≥ φ f (ω)dP . (1.8)

Proof. We have seen the inequlity already for φ(x) = |x|. The proof is
quite simple. We note that any convex function φ can be represented as the
supremum of a collection of affine linear functions.

φ(x) = sup {ax + b}. (1.9)


(a,b)∈E

It is clear that if (a, b) ∈ E, then af (ω) + b ≤ φ(f (ω)) and on integration this
yields am + b ≤ E[φ(f (ω))] where m = E[f (ω)]. Since this is true for every
(a, b) ∈ E, in view of the represenattion (1.9), our theorem follows.

Another important theorem is

Theorem 1.8. (The Dominated Convergence Theorem) If for some


sequence {fn } of measurable functions we have fn → f in measure R and
|f
R n (ω)| ≤ g(ω) for all n and ω for some integrable function g, then fn dP →
f dP as n → ∞.
22 CHAPTER 1. MEASURE THEORY

Proof. g + fn and g − fn are nonnegative and converge in measure to g + f


and g − f respectively. By Fatou’s lemma
Z Z
lim inf (g + fn )dP ≥ (g + f )dP
n→∞
R
Since gdP is finite we can subtract it from both sides and get
Z Z
lim inf fn dP ≥ f dP.
n→∞

Working the same way with g − fn yields


Z Z
lim sup fn dP ≤ f dP
n→∞

and we are done.

Exercise 1.15. Take the unit interval with the Lebesgue measure and define
fn (x) = nα 1[0, 1 ] (x). Clearly fn (x) → 0 for x 6= 0. On the other hand
R n
fn (x)dx = nα−1 → 0 if and only if α < 1. What is g(x) = supn fn (x) and
when is g integrable?
If h(ω) = f (ω) + ig(ω) is a complex valued measurable function with real
and imaginary parts f (ω) and g(ω) that are integrable we define
Z Z Z
h(ω)dP = f (ω)dP + i g(ω)dP

Exercise 1.16. Show that for any complex function h(ω) = f (ω) + ig(ω)
with measurable f and g, |h(ω)| is integrable, if and only if |f | and |g| are
integrable and we then have
Z Z

h(ω) dP ≤ |h(ω)| dP

1.4 Transformations
A measurable space (Ω, B) is a set Ω together with a σ-field B of subsets of
Ω.
1.4. TRANSFORMATIONS 23

Definition 1.10. Given two measurable spaces (Ω1 , B1 ) and (Ω2 , B2 ), a map-
ping or a transformation from T : Ω1 → Ω2 , i.e. a function ω2 = T (ω1 ) that
assigns for each point ω1 ∈ Ω1 a point ω2 = T (ω1 ) ∈ Ω2 , is said to be
measurable if for every measurable set A ∈ B2 , the inverse image

T −1 (A) = {ω1 : T (ω1 ) ∈ A} ∈ B1 .

Exercise 1.17. Show that, in the above definition, it is enough to verify the
property for A ∈ A where A is any class of sets that generates the σ-field B2 .

If T is a measurable map from (Ω1 , B1 ) into (Ω2 , B2 ) and P is a probability


measure on (Ω1 , B1 ), the induced probability measure Q on (Ω2 , B2 ) is defined
by

Q(A) = P (T −1(A)) for A ∈ B2 . (1.10)

Exercise 1.18. Verify that Q indeed does define a probability measure on


(Ω2 , B2 ).

Q is called the induced measure and is denoted by Q = P T −1.

Theorem 1.9. If f : Ω2 → R is a real valued measurable function on Ω2 ,


then g(ω1) = f (T (ω1 )) is a measurable real valued function on (Ω1 , B1 ).
Moreover g is integrable with respect to P if and only if f is integrable with
respect to Q, and
Z Z
f (ω2 ) dQ = g(ω1) dP (1.11)
Ω2 Ω1

Proof. If f (ω2) = 1A (ω2 ) is the indicator function of a set A ∈ B2 , the


claim in equation (1.11) is in fact the definition of measurability and the
induced measure. We see, by linearity, that the claim extends easily from
indicator functions to simple functions. By uniform limits, the claim can now
be extended to bounded measurable functions. Monotone limits then extend
it to nonnegative functions. By considering the positive and negative parts
separately we are done.
24 CHAPTER 1. MEASURE THEORY

A measurable trnasformation is just a generalization of the concept of a


random variable introduced in section 1.2. We can either think of a random
variable as special case of a measurable transformation where the target space
is the real line or think of a measurable transformation as a random variable
with values in an arbitrary target space. The induced measure Q = P T −1 is
called the distribution of the random variable F under P . In particular, if T
takes real values, Q is a probability distribution on R.

Exercise 1.19. When T is real valued show that


Z Z
T (ω)dP = x dQ.

When F = (f1 , f2 , · · · fn ) takes values in Rn , the induced distribution Q


on Rn is called the joint distribution of the n random variables f1 , f2 · · · , fn .

Exercise 1.20. If T1 is a measurable map from (Ω1 , B1 ) into (Ω2 , B2 ) and T2 is


a measurable map from (Ω2 , B2 ) into (Ω3 , B3 ), then show that T = T2 ◦ T1 is
a measurable map from (Ω1 , B1 ) into (Ω3 , B3 ). If P is a probability measure
on (Ω1 , B1 ), then on (Ω3 , B3 ), the two measures P T −1 and (P T1−1)T2−1 are
identical.

1.5 Product Spaces


Given two sets Ω1 and Ω2 the Cartesian product Ω = Ω1 × Ω2 is the set of
pairs (ω1 , ω2 ) with ω1 ∈ Ω1 and ω2 ∈ Ω2 . If Ω1 and Ω2 come with σ-fields
B1 and B2 respectively, we can define a natural σ-field B on Ω as the σ-field
generated by sets (measurable rectangles) of the form A1 × A2 with A1 ∈ B1
and A2 ∈ B2 . This σ-field will be called the product σ-field.

Exercise 1.21. Show that sets that are finite disjoint unions of measurable
rectangles constitute a field F .

Definition 1.11. The product σ-field B is the σ-field generaated by the field
F.
1.5. PRODUCT SPACES 25

Given two probability measures P1 and P2 on (Ω1 , B1 ) and (Ω2 , B2 ) re-


spectively we try to define on the product space (Ω, B) a probability measure
P by defining for a measurable rectangle A = A1 × A2
P (A1 × A2 ) = P1 (A1 ) × P2 (A2 )
and extending it to the field F of finite disjoint unions of measurable rect-
angles as the obvious sum.
Exercise 1.22. If E ∈ F has two representations as disjoint finite unions of
measurable rectangles
E = ∪i (Ai1 × Ai2 ) = ∪j (B1j × B2j )
then X X
P1 (Ai1 ) × P2 (Ai2 ) = P1 (B1j ) × P2 (B2j ).
i j

so that P (E) is well defined. P is a finitely additive probability measure on


F.

Lemma 1.10. The measure P is countably additive on the field F .

Proof. For any set E ∈ F let us define the section Eω2 as


Eω2 = {ω1 : (ω1 , ω2 ) ∈ E}. (1.12)
Then P1 (Eω2 ) is a measurable function of ω2 (is in fact a simple function)
and
Z
P (E) = P1 (Eω2 ) dP2. (1.13)
Ω2

Now let En ∈ F ↓ Φ, the empty set. Then it is easy to verify that En,ω2
defined by
En,ω2 = {ω1 : (ω1 , ω2 ) ∈ En }
satisfies En,ω2 ↓ Φ for each ω2 ∈ Ω2 . From the countable additivity of P1 we
conclude that P1 (En,ω2 ) → 0 for each ω2 ∈ Ω2 and since, 0 ≤ P1 (En,ω2 ) ≤ 1
for n ≥ 1, it follows from equation (1.13) and the bounded convergence
theorem that Z
P (En ) = P1 (En,ω2 ) dP2 → 0
Ω2
establishing the countable additivity of P on F .
26 CHAPTER 1. MEASURE THEORY

By an application of the Caratheodory extension theorem we conclude that


P extends uniquely as a countably additive measure to the σ-field B (product
σ-field) generated by F . We will call this the Product Measure P .

Corollary 1.11. For any A ∈ B if we denote by Aω1 and Aω2 the respective
sections
Aω1 = {ω2 : (ω1 , ω2 ) ∈ A}
and
Aω2 = {ω1 : (ω1 , ω2 ) ∈ A}
then the functions P1 (Aω2 ) and P2 (Aω1 ) are measurable and
Z Z
P (A) = P1 (Aω2 )dP2 = P2 (Aω1 ) dP1 .

In particular for a measurable set A, P (A) = 0 if and only if for almost all
ω1 with respect to P1 , the sections Aω1 have measure 0 or equivalently for
almost all ω2 with respect to P2 , the sections Aω2 have measure 0.

Proof. The assertion is clearly valid if A is rectangle of the form A1 × A2


with A1 ∈ B1 and A2 ∈ B2 . If A ∈ F , then it is a finite disjoint union of such
rectangles and the assertion is extended to such a set by simple addition.
Clearly, by the monotone convergence theorem, the class of sets for which
the assertion is valid is a monotone class and since it contains the field F it
also contains the σ-field B generated by the field F .

Warning. It is possible that a set A may not be measurable with respect


to the product σ-field, but nevertheless the sections Aω1 and Aω2 are all
measurable, P2 (Aω1 ) and P1 (Aω2 ) are measurable functions, but
Z Z
P1 (Aω2 )dP2 6= P2 (Aω1 ) dP1.

In fact there is a rather nasty example where P1 (Aω2 ) is identically 1 whereas


P2 (Aω1 ) is identically 0.
The next result concerns the equality of the double integral, (i.e. the
integral with respect to the product measure) and the repeated integrals in
any order.
1.5. PRODUCT SPACES 27

Theorem 1.12. (Fubini’s Theorem). Let f (ω) = f (ω1 , ω2 ) be a measur-


able function of ω on (Ω, B). Then f can be considered as a function of ω2
for each fixed ω1 or the other way around. The functions gω1 (·) and hω2 (·)
defined respectively on Ω2 and Ω1 by

gω1 (ω2 ) = hω2 (ω1 ) = f (ω1 , ω2 )

are measurable for each ω1 and ω2 . If f is integrable then the functions


gω1 (ω2 ) and hω2 (ω1 ) are integrable for almost all ω1 and ω2 respectively. Their
integrals Z
G(ω1 ) = gω1 (ω2 ) dP2
Ω2

and Z
H(ω2 ) = hω2 (ω1 ) dP1
Ω1

are measurable, finite almost everywhere and integrable with repect to P1 and
P2 respectively. Finally
Z Z Z
f (ω1 , ω2) dP = G(ω1 )dP1 = H(ω2 )dP2

Conversely for a nonnegative measurable function f if either G are H, which


are always measurable, has a finite integral so does the other and f is inte-
grable with its integral being equal to either of the repeated integrals, namely
integrals of G and H.

Proof. The proof follows the standard pattern. It is a restatement of the


earlier corollary if f is the indicator function of a measurable set A. By
linearity it is true for simple functions and by passing to uniform limits, it is
true for bounded measurable functions f . By monotone limits it is true for
nonnegative functions and finally by taking the positive and negative parts
seperately it is true for any arbitrary integrable function f .

Warning. The following could happen. f is a measurable function that


takes both positive and negative values that is not integrable. Both the
repeated integrals exist and are unequal. The example is not hard.
28 CHAPTER 1. MEASURE THEORY

Exercise 1.23. Construct a measurable function f (x, y) which is not inte-


grable, on the product [0, 1] × [0, 1] of two copies of the unit interval with
Lebesgue measure, such that the repeated integrals make sense and are un-
equal, i.e. Z Z Z Z
1 1 1 1
dx f (x, y) dy 6= dy f (x, y) dx
0 0 0 0

1.6 Distributions and Expectations


Let us recall that a triplet (Ω, B, P ) is a Probability Space, if Ω is a set, B is
a σ-field of subsets of Ω and P is a (countably additive) probability measure
on B. A random variable X is a real valued measurable function on (Ω, B).
Given such a function X it induces a probability distribution α on the Borel
subsets of the line α = P X −1 . The distribution function F (x) corresponding
to α is obviously
F (x) = α((−∞, x]) = P [ ω : X(ω) ≤ x ].
The measure α is called the distibution of X and F (x) is called the distribu-
tion function of X. If g is a measurable function of the real variable x, then
Y (ω) = g(X(ω)) is again a random variable and its distribution β = P Y −1
can be obtained as β = α g −1 from α. The Expectation or mean of a random
variable is defined if it is integrable and
Z
P
E[X] = E [X] = X(ω) dP.

By the change of variables formula (Exercise 3.3) it can be obtained directly


from α as Z
E[X] = x dα.

Here we are taking advantage of the fact that on the real line x is a very
special real valued function. The value of the integral in this context is
referred to as the expectation or mean of α. Of course it exists if and only
if Z
|x| dα < ∞

and Z Z

x dα ≤ |x| dα.

1.6. DISTRIBUTIONS AND EXPECTATIONS 29

Similarly Z Z
E(g(X)) = g(X(ω)) dP = g(x) dα

and anything concerning X can be calculated from α. The statement X is a


random variable with distribution α has to be interpreted in the sense that
somewhere in the background there is a Probability Space and a random
variable X on it, which has α for its distribution. Usually only α matters
and the underlying (Ω, B, P ) never emerges from the background and in a
pinch we can always say Ω is the real line, B are the Borel sets , P is nothing
but α and the random variable X(x) = x.
Some other related quantities are
V ar(X) = σ 2 (X) = E[X 2 ] − [E[X]]2 . (1.14)
V ar(X) is called the variance of X.
Exercise 1.24. Show that if it is defined V ar(X) is always nonnegative and
V ar(X) = 0 if and only if for some value a, which is necessarily equal to
E[X], P [X = a] = 1.
Some what more generally we can consider a measurable mapping X =
(X1 , · · · , Xn ) of a probability space (Ω, B, P ) into Rn as a vector of n random
variables X1 (ω), X2 (ω), · · · , Xn (ω). These are caled random vectors or vector
valued random variables and the induced distribution α = P X −1 on Rn is
called the distribution of X or the joint distribution of (X1 , · · · , Xn ). If we
denote by πi the coordinate maps (x1 , · · · , xn ) → xi from Rn → R, then
αi = απi−1 = P Xi−1
are called the marginals of α.
The covariance between two random variables X and Y is defined as
Cov (X, Y ) = E[(X − E(X))(Y − E(Y ))] = E[XY ] − E[X]E[Y ]. (1.15)

Exercise 1.25. If X1 , · · · , Xn are n random variables the matrix


Ci,j = Cov(Xi , Xj )
is called the covariance matrix. Show that it is a symmetric positive semi-
definite matrix. Is every positive semi-definite matrix the covariance matrix
of some random vector?
30 CHAPTER 1. MEASURE THEORY

Exercise 1.26. TheR Riemann-Stieljes integral uses the distribution function



directly to define −∞ g(x)dF (x) where g is a bounded continuous function
and F is a distribution function. It is defined as limit as N → ∞ of sums
X
N

j+1 ) − F (aj )]
g(xj )[F (aN N

j=0

where −∞ < aN 0 < a1 < · · · < aN < aN +1 < ∞ is a partition of the finite
N N N

0 , aN +1 ] and the limit is taken in such a way that a0 → −∞,


interval [aN N N

aN +1 → +∞ and the oscillation of g in any [aj , aj+1 ] goes to 0. Show that


N N N

if P is the measure corresponding to F then


Z ∞ Z
g(x)dF (x) = g(x)dP.
−∞ R
Chapter 2

Weak Convergence

2.1 Characteristic Functions


If α is a probability distribution on the line, its characteristic function is
defined by
Z
φ(t) = exp[ i t x ] dα. (2.1)

The above definition makes sense. We write the integrand eitx as cos tx +
i sin tx and integrate each part to see that

|φ(t)| ≤ 1

for all real t.

Exercise 2.1. Calculate the characteristic functions for the following distri-
butions:

1. α is the degenerate distribution δa with probability one at the point a.

2. α is the binomial distribution with probabilities


 
n k
pk = P rob[X = k] = p (1 − p)n−k
k

for 0 ≤ k ≤ n.

31
32 CHAPTER 2. WEAK CONVERGENCE

Theorem 2.1. The characteristic function φ(t) of any probability distribu-


tion is a uniformly continuous function of t that is positive definite, i.e. for
any n complex numbers ξ1 , · · · , ξn and real numbers t1 , · · · , tn
X
n
φ(ti − tj ) ξi ξ¯j ≥ 0.
i,j=1

Proof. Let us note that


X
n
P R
φ(ti − tj ) ξi ξ¯j = ni,j=1 ξi ξ¯j exp[ i(ti − tj )x] dα
i,j=1
R Pn 2
= ξ j exp[i tj x] dα ≥ 0.
j=1

To prove uniform continuity we see that


Z
|φ(t) − φ(s)| ≤ | exp[ i(t − s) x ] − 1| dP

which tends to 0 by the bounded convergence theorem if |t − s| → 0.


The characteristic function of
R course carries some information about the
distribution α. In particular
R if |x| dα < ∞, then φ(·) is continuously dif-
0
ferentiable and φ (0) = i x dα.
Exercise 2.2. Prove it!
Warning: TheR converse need not be true. φ(·) can be continuously differ-
entiable but |x| dP could be ∞.
Exercise 2.3. Construct a counterexample along the following lines. Take a
discrete distribution, symmetric around 0 with
1
α{n} = α{−n} = p(n) ' .
n2 log n
P (1−cos nt)
Then show that n n2 log n
is a continuously differentiable function of t.
R
Exercise 2.4. The story with higher moments mr = xr dα is similar. If any
of them, say mr exists, then φ(·) is r times continuously differentiable and
φ(r) (0) = ir mr . The converse is false for odd r, but true for even r by an
application of Fatou’s lemma.
2.1. CHARACTERISTIC FUNCTIONS 33

The next question is how to recover the distribution function F (x) from
φ(t). If we go back to the Fourier inversion formula, see for instance [2], we
can ‘guess’, using the fundamental theorem of calculus and Fubini’s theorem,
that Z ∞
0 1
F (x) = exp[−i t x ] φ(t) dt
2π −∞
and therefore
1
Rb R∞
F (b) − F (a) = 2π a
dx −∞ exp[−i t x ] φ(t) dt
1
R∞ Rb
= 2π −∞
φ(t) dt a exp[−i t x ] dx
R∞
1
= 2π −∞
φ(t) exp[− i t b−]−exp[−
it
ita]
dt
RT
= limT →∞ 2π1
−T
φ(t) exp[− i t b−]−exp[−
it
ita]
dt.
We will in fact prove the final relation, which is a principal value integral,
provided a and b are points of continuity of F . We compute the right hand
side as
Z T Z
1 exp[− i t b ] − exp[− i t a ]
lim dt exp[ i t x ] dα
T →∞ 2π −T −it

1
R RT exp[i t (x−b) ]−exp[i t (x−a) ]
= limT →∞ 2π
dα −T −it
dt
R RT
1
= limT →∞ 2π dα −T sin t (x−a)−sin
t
t (x−b)
dt
R
= 12 [sign(x − a) − sign(x − b)] dα
= F (b) − F (a)
provided a and b are continuity points. We have applied Fubini’s theorem
and the bounded convergence theorem to take the limit as T → ∞. Note
that the Dirichlet integral
Z T
sin tz
u(t, z) = dt
0 t
satisfies supT,z |u(T, z)| ≤ C and


1 if z > 0
lim u(T, z) = −1 if z < 0
T →∞ 

0 if z = 0.
34 CHAPTER 2. WEAK CONVERGENCE

As a consequence we conclude that the distribution function and hence α is


determined uniquely by the characteristic function.
Exercise 2.5. Prove that if two distribution functions agree on the set of
points at which they are both continuous, they agree everywhere.
Besides those in Exercise 2.1, some additional examples of probability
distributions and the corresponding characteristic functuions are given below.

1. The Poisson distribution of ‘rare events’, with rate λ, has probabilities


P [X = r] = e−λ λr! for r ≥ 0. Its characteristic function is
r

φ(t) = exp[λ(eit − 1)].

2. The geometric distribution, the distribution of the number of unsuc-


cessful attempts preceeding a success has P [X = r] = pq r for r ≥ 0.Its
characteristic function is
φ(t) = p(1 − qeit )−1 .

3. The negative binomial distribution, the probabilty distribution of the


number of accumulated failures before k successes, with P [X = r] =
k+r−1 k r
r
p q has the characteristic function is

φ(t) = pk (1 − qeit )−k .

We now turn to some common continuous distributions, in fact R x given by


‘densities’ f (x) i.e the distribution functions are given by F (x) = −∞ f (y) dy
1
1. The ‘uniform ’ distribution with density f (x) = b−a
, a ≤ x ≤ b has
characteristic function
eitb − eita
φ(t) = .
it(b − a)
In particular for the case of a symmetric interval [−a, a],
sin at
φ(t) = .
at
2.1. CHARACTERISTIC FUNCTIONS 35

cp −cx p−1
2. The gamma distribution with density f (x) = Γ(p)
e x , x ≥ 0 has
the characteristic function
it −p
φ(t) = (1 − ) .
c
where c > 0 is any constant. A special case of the gamma distribution
is the exponential distribution, that corresponds to c = p = 1 with
density f (x) = e−x for x ≥ 0. Its characteristic function is given by

φ(t) = [1 − it]−1 .

3. The two sided exponential with density f (x) = 12 e−|x| has characteristic
function
1
φ(t) = .
1 + t2

1 1
4. The Cauchy distribution with density f (x) = π 1+x2
has the character-
istic function

φ(t) = e−|t | .

5. The normal or Gaussian distribution with mean µ and variance σ 2 ,


(x−µ)2
which has a density of √ 1 e− 2σ 2 has the characteristic function given
2πσ
by
σ 2 t2
φ(t) = eitµ− 2 .

In general if X is a random variable which has distribution α and a


characteristic function φ(t), the distribution β of aX + b, can be written
as β(A) = α [x : ax + b ∈ A] and its characteristic function ψ(t) can be
expressed as ψ(t) = eita φ(bt). In particular the characteristic function of −X
is φ(−t) = φ(t). Therefore the distribution of X is symmetric around x = 0
if and only if φ(t) is real for all t.
36 CHAPTER 2. WEAK CONVERGENCE

2.2 Moment Generating Functions


If α is a probability distribution on R, for any integer k ≥ 1, the moment
mk of α is defined as
Z
mk = xk dα. (2.2)

Or equivalently the k-th moment of a random variable X is

mk = E[X k ] (2.3)

By convention one takes m0 = 1 even if P [X = 0] > 0. We should note


k
that
R kif k is odd, in order for mk to be defined we must have E[|X| ] =
|x| dα < ∞. Given a distribution α, either all the moments exist, or they
exist only for 0 ≤ k ≤ k0 for some k0 . It could happen that k0 = 0 as
is the case with the Cauchy distribution.R If we know all the moments of a
distribution α, we know the expectations p(x)dα for every polynomial p(·).
Since polynomials p(·) can be used to approximate (by Stone-Weierstrass
theorem) any continuous function, one might hope that, from the moments,
one can recover the distribution α. This is not as staright forward as one
would hope. If we take a bounded continuous function, like sin x we can find
a sequence of polynomials pn (x) that converges to sin x. But to conclude
that Z Z
sin xdα = lim pn (x)dα
n→∞

we need to control the contribution to the integral from large values of x,


which is the role of the dominated convergence
R ∗ theorem. If we define p∗ (x) =
supn |pn (x)| it would be a big help if p (x)dα were finite. But the degrees
of the polynomials pn have to increase indefinitely with n because sin x is a
transcendental function. RTherefore p∗ (·) must grow faster than a polynomial
at ∞ and the condition p∗ (x)dα < ∞ may not hold.
In general, it is not true that moments determine the distribution. If
we look at it through characteristic functions, it is the problem of trying to
recover the function φ(t) from a knowledge of all of its derivatives at t = 0.
The Taylor series at t = 0 may not yield the function. Of course we have
more information in our hands, like positive definiteness etc. But still it is
likely that moments do not in general determine α. In fact here is how to
construct an example.
2.2. MOMENT GENERATING FUNCTIONS 37

We need nonnegative numbers {an }, {bn } : n ≥ 0, such that


X X
an ek n = bn ek n = mk
n n

for
P every P k ≥ 0. We can then replace them by { m an
0
}, { m
bn
0
} : n ≥ 0 so that
k ak = k bk = 1 and the two probability distributions

P [X = en ] = an , P [X = en ] = bn
will have all their moments equal. Once we can find {cn } such that
X
cn en z = 0 for z = 0, 1, · · ·
n

we can take an = max(cn , 0) and bn = max(−cn , 0) and P we will have our


example. The goal then is to construct {cn } such that n ck z
n
= 0 for
2
z = 1, e, e , · · · . Borrowing from ideas in the theory of a complex variable,
(see Weierstrass factorization theorem, [1]) we define
z
C(z) = Π∞n=1 (1 − n )
e
P n
and expandP C(z) = cn z . Since C(z) is an entire function, the coefficients
cn satisfy n |cn |e < ∞ for every k.
kn

There
R is in fact a positive result as well. If α is such that the moments
mk = xk dα do not grow too fast, then α is determined by mk .
P a2k
Theorem 2.2. Let mk be such that k m2k (2k)! < ∞ for some a > 0. Then
R k
there is atmost one distribution α such that x dα = mk .
Proof. We want to determine the characteristic function φ(t) of α. First we
note that if α has moments mk satisfying our assumption, then
Z X a2k
cosh(ax)dα = m2k < ∞
k
(2k)!
by the monotone convergence theorem. In particular
Z
ψ(u + it) = e(u+it)x dα

is well define as an analytic function of z = u + it in the strip |u| < a. From


the theory of functions of a complex variable we know that the function ψ(·)
is uniquely determined in the strip by its derivatives at 0, i.e. {mk }. In
particular φ(t) = ψ(0 + it) is determined as well
38 CHAPTER 2. WEAK CONVERGENCE

2.3 Weak Convergence


One of the basic ideas in establishing Limit Theorems is the notion of weak
convergence of a sequence of probability distributions on the line R. Since
the role of a probability measure is to assign probabilities to sets, we should
expect that if two probability measures are to be close, then they should
assign for a given set, probabilities that are nearly equal. This suggets the
definition
d(P1, P2 ) = sup |P1 (A) − P2 (A)|
A∈B

as the distance between two probability measures P1 and P2 on a measurable


space (Ω, B). This is too strong. If we take P1 and P2 to be degenerate
distributions with probability 1 concentrated at two points x1 and x2 on the
line one can see that, as soon as x1 6= x2 , d(P1, P2 ) = 1, and the above
metric is not sensitive to how close the two points x1 and x2 are. It only
cares that they are unequal. The problem is not because of the supremum.
We can take A to be an interval [a, b] that includes x1 but omits x2 and
|P1 (A) − P2 (A)| = 1. On the other hand if the end points of the interval
are kept away from x1 or x2 the situation is not that bad. This leads to the
following definition.

Definition 2.1. A sequence αn of probability distributions on R is said to


converge weakly to a probability distribution α if,

lim αn [I] = α[I]


n→∞

for any interval I = [a, b] such that the single point sets a and b have proba-
bility 0 under α.

One can state this equivalently in terms of the distribution functions


Fn (x) and F (x) corrresponding to the measures αn and α respectively.

Definition 2.2. A sequence αn of probability measures on the real line R


with distribution functions Fn (x) is said to converge weakly to a limiting
probability measure α with distribution function F (x) (in symbols αn ⇒ α or
Fn ⇒ F ) if
lim Fn (x) = F (x)
n→∞

for every x that is a continuity point of F .


2.3. WEAK CONVERGENCE 39

Exercise 2.6. prove the equivalence of the two definitions.


Remark 2.1. One says that a sequence Xn of random variables converges
in law or in distribution to X if the distributions αn of Xn converges weakly
to the distribution α of X.
There are equivalent formulations in terms of expectations and charac-
teristic functions.

Theorem 2.3. (Lévy-Cramér continuity theorem) The following are


equivalent.

1. αn ⇒ α or Fn ⇒ F

2. For every bounded continuous function f (x) on R


Z Z
lim f (x) dαn = f (x) dα
n→∞ R R

3. If φn (t) and φ(t) are respectively the characteristic functions of αn and


α, for every real t,
lim φn (t) = φ(t)
n→∞

Proof. We first prove (a ⇒ b). Let  > 0 be arbitrary. Find continuity


points a and b of F such that a < b, F (a) ≤  and 1 − F (b) ≤ . Since Fn (a)
and Fn (b) converge to F (a) and F (b), for n large enough, Fn (a) ≤ 2 and
1−Fn (b) ≤ 2. Divide the interval [a, b] into a finite number N = Nδ of small
subintervals Ij = (aj , aj+1 ], 1 ≤ j ≤ N with a = a1 < a2 < · · · < aN +1 =
b such that all the end points {aj } are points of continuity of F and the
oscillation of the continuous function f in each Ij is less than a preassigned
number δ. Since any continuous function f is uniformly continuous in the
closed bounded (compact)
P interval [a, b], this is always possible for any given
δ > 0. Let h(x) = N j=1 Ij f (aj ) be the simple function equal to f (aj ) on Ij
χ
and 0 outside ∪j Ij = (a, b]. We have |f (x) − h(x)| ≤ δ on (a, b]. If f (x) is
bounded by M, then
Z
X
N

f (x) dαn − f (a )[F (a ) − F (a )] ≤ δ + 4M (2.4)
j n j+1 n j
j=1
40 CHAPTER 2. WEAK CONVERGENCE

and
Z
X
N

f (x) dα − f (a )[F (a ) − F (a )] ≤ δ + 2M. (2.5)
j j+1 j
j=1

Since limn→∞ Fn (aj ) = F (aj ) for every 1 ≤ j ≤ N, we conclude from equa-


tions (2.4), (2.5) and the triangle inequality that
Z Z

lim sup f (x) dαn − f (x) dα ≤ 2δ + 6M.
n→∞

Since  and δ are arbitrary small numbers we are done.


Because we can make the choice of f (x) = exp[i t x ] = cos tx + i sin tx, which
for every t is a bounded and continuous function (b ⇒ c) is trivial.
(c ⇒ a) is the hardest. It is carried out in several steps. Actually we will
prove a stronger version as a separate theorem.

Theorem 2.4. For each n ≥ 1, let φn (t) be the characteristic function of a


probability distribution αn . Assume that limn→∞ φn (t) = φ(t) exists for each
t and φ(t) is continuous at t = 0. Then φ(t) is the characteristic function of
some probability distribution α and αn ⇒ α.

Proof.
Step 1. Let r1 , r2 , · · · be an enumeration of the rational numbers. For each j
consider the sequence {Fn (rj ) : n ≥ 1} where Fn is the distribution function
corresponding to φn (·). It is a sequence bounded by 1 and we can extract a
subsequence that converges. By the diagonalization process we can choose a
subseqence Gk = Fnk such that

lim Gk (r) = br
k→∞

exists for every rational number r. From the monotonicity of Fn in x we


conclude that if r1 < r2 , then br1 ≤ br2 .

Step 2. From the skeleton br we reconstruct a right continuous monotone


function G(x). We define
G(x) = inf br .
r>x
2.3. WEAK CONVERGENCE 41

Clearly if x1 < x2 , then G(x1 ) ≤ G(x2 ) and therefore G is nondecreasing. If


xn ↓ x, any r > x satisfies r > xn for sufficiently large n. This allows us to
conclude that G(x) = inf n G(xn ) for any sequence xn ↓ x, proving that G(x)
is right continuous.

Step 3. Next we show that at any continuity point x of G

lim Gn (x) = G(x).


n→∞

Let r > x be a rational number. Then Gn (x) ≤ Gn (r) and Gn (r) → br as


n → ∞. Hence
lim sup Gn (x) ≤ br .
n→∞

This is true for every rational r > x, and therefore taking the infimum over
r>x
lim sup Gn (x) ≤ G(x).
n→∞

Suppose now that we have y < x. Find a rational r such that y < r < x.

lim inf Gn (x) ≥ lim inf Gn (r) = br ≥ G(y).


n→∞ n→∞

As this is true for every y < x,

lim inf Gn (x) ≥ sup G(y) = G(x − 0) = G(x)


n→∞ y<x

the last step being a consequence of the assumption that x is a point of


continuity of G i.e. G(x − 0) = G(x).

Warning. This does not mean that G is necessarily a distribution function.


Consider Fn (x) = 0 for x < n and 1 for x ≥ n, which corresponds to
the distribution with the entire probability concentrated at n. In this case
limn→∞ Fn (x) = G(x) exists and G(x) ≡ 0, which is not a distribution
function.

Step 4. We will use the continuity at t = 0, of φ(t), to show that G is indeed


42 CHAPTER 2. WEAK CONVERGENCE

a distribution function. If φ(t) is the characteristic function of α


1
RT R h 1 RT i
2T −T
φ(t) dt = 2T −T
exp[i t x ] dt dα
R sin T x
= Tx

R sin T x
≤ dα
Tx
R sin T x R
= |x|<` T x dα + |x|≥` sinT Tx x dα
1
≤ α[|x| < `] + T`
α[|x| ≥ `].
We have used Fubini’s theorem in the first line and the bounds | sin x| ≤
|x| and | sin x| ≤ 1 in the last line. We can rewrite this as
1
RT
1 − 2T −T
φ(t) dt ≥ 1 − α[|x| < `] − T1` α[|x| ≥ `]
= α[|x| ≥ `] − T1` α[|x| ≥ `]

= 1 − T1` α[|x| ≥ `]

≥ 1 − T1` [1 − F (`) + F (−`)]
Finally, if we pick ` = T2 ,
 Z T 
2 2 1
[1 − F ( ) + F (− )] ≤ 2 1 − φ(t) dt .
T T 2T −T
Since this inequality is valid for any distribution function F and its charac-
teristic function φ, we conclude that, for every k ≥ 1,
 Z T 
2 2 1
[1 − Fnk ( ) + Fnk (− )] ≤ 2 1 − φn (t) dt . (2.6)
T T 2T −T k
We can pick T such that ± T2 are continuity points of G. If we now pass to
the limit and use the bounded convergence theorem on the right hand side
of equation (2.6), we obtain
 Z T 
2 2 1
[1 − G( ) + G(− )] ≤ 2 1 − φ(t) dt .
T T 2T −T
Since φ(0) = 1 and φ is continuous at t = 0, by letting T → 0 in such a way
that ± T2 are continuity points of G, we conclude that
1 − G(∞) + G(−∞) = 0
or G is indeed a distribution function
2.3. WEAK CONVERGENCE 43

Step 5. We now complete the rest of the proof, i.e. show that αn ⇒ α. We
have Gk = Fnk ⇒ G as well as ψk = φnk → φ. Therefore G must equal F
which has φ for its characteristic function. Since the argument works for any
subsequence of Fn , every subsequence of Fn will have a further subsequence
that converges weakly to the same limit F uniquely determined as the distri-
bution function whose characteristic function is φ(·). Consequently Fn ⇒ F
or αn ⇒ α.

Exercise 2.7. How do you actually prove that if every subsequence of a se-
quence {Fn } has a further subsequence that converges to a common F then
Fn ⇒ F ?

Definition 2.3. A subset A of probability distributions on R is said to be


totally bounded if, given any sequence αn from A, there is subsequence that
converges weakly to some limiting probability distribution α.

Theorem 2.5. In order that a family A of probability distributions be totally


bounded it is necessary and sufficient that either of the following equivalent
conditions hold.

lim sup α[x : |x| ≥ `] = 0 (2.7)


`→∞ α∈A

lim sup sup |1 − φα (t)| = 0. (2.8)


h→0 α∈A |t|≤h

Here φα (t) is the characteristic function of α.


The condition in equation (2.7) is often called the uniform tightness prop-
erty.

Proof. The proof is already contained in the details of the proof of the ear-
lier theorem. We can always choose a subsequence such that the distribution
functions converge at rationals and try to reconstruct the limiting distribu-
tion function from the limits at rationals. The crucial step is to prove that
the limit is a distribution function. Either of the two conditions (2.7) or (2.8)
will guarantee this. If condition (2.7) is violated it is straight forward to pick
a sequence from A for which the distribution functions have a limit which is
44 CHAPTER 2. WEAK CONVERGENCE

not a distribution function. Then A cannot be totally bounded. Condition


(2.7) is therefore necessary. That a)⇒ b), is a consequence of the estimate
R
|1 − φ(t)| ≤ | exp[ i t x ] − 1| dα
R R
= |x|≤` | exp[ i t x ] − 1| dα + |x|>` | exp[ i t x ] − 1| dα
≤ |t|` + 2α[x : |x| > `]

It is a well known principle in Fourier analysis that the regularity of φ(t) at


t = 0 is related to the decay rate of the tail probabilities.
R
Exercise 2.8. Compute |x|p dα in terms of the characteristic function φ(t)
for p in the range 0 < p < 2.
Hint: Look at the formula
Z ∞
1 − cos tx
dt = Cq |x|p
−∞ |t| p+1

and use Fubini’s theorem.


We have the following result on the behavior of αn (A) for certain sets
whenever αn ⇒ α.

Theorem 2.6. Let αn ⇒ α on R. If C ⊂ R is closed set then

lim sup αn (C) ≤ α(C)


n→∞

while for open sets G ⊂ R

lim inf αn (G) ≥ α(G)


n→∞

If A ⊂ R is a continuity set of α i.e. α(∂A) = α(Ā − Ao ) = 0, then

lim αn (A) = α(A)


n→∞

Proof. The function d(x, C) = inf y∈C |x − y| is a continuous and equals 0


precisely on C.
1
f (x) =
1 + d(x, C)
2.3. WEAK CONVERGENCE 45

is a continuous function bounded by 1, that is equal to 1 precisely on C and

fk (x) = [f (x)]k ↓ χC (x)

as k → ∞. For every k ≥ 1, we have


Z Z
lim fk (x) dαn = fk (x) dα
n→∞

and therefore
Z Z
lim sup αn (C) ≤ lim fk (x) dαn = fk (x)dα.
n→∞ n→∞

Letting k → ∞ we get

lim sup αn (C) ≤ α(C).


n→∞

Taking complements we conclude that for any open set G ⊂ R

lim inf αn (G) ≥ α(G).


n→∞

Combining the two parts, if A ⊂ R is a continuity set of α i.e. α(∂A) =


α(Ā − Ao ) = 0, then
lim αn (A) = α(A).
n→∞

We are now ready to prove the converse of Theorem 2.1 which is the hard
part of a theorem of Bochner that characterizes the characteristic functions
of probability distributions as continuous positive definite functions on R
normalized to be 1 at 0.
Theorem 2.7. (Bochner’s Theorem). If φ(t) is a positive definite func-
tion which is continuous at t = 0 and is normalized so that φ(0) = 1, then φ
is the characteristic function of some probability ditribution on R.

Proof. The proof depends on constructing approximations φn (t) which are in


fact characteristic functions and satisfy φn (t) → φ(t) as n → ∞. Then we can
apply the preceeding theorem and the probability measures corresponding to
φn will have a weak limit which will have φ for its characteristic function.
46 CHAPTER 2. WEAK CONVERGENCE

Step 1. Let us establish a few elementary properties of positive definite


functions.
1) If φ(t) is a positive definite function so is φ(t)exp[ i t a ] for any real a.
The proof is elementary and requires just direct verification.
2) IfPφj (t) are positive definite for each j then so is any linear combination
φ(t) = j wj φj (t) withP nonnegative weights wj . If each φj (t) is normalized
with φj (0) = 1 and j wj = 1, then of of course φ(0) = 1 as well.
3) If φ is positive definite then φ satisfies φ(0) ≥ 0, φ(−t) = φ(t) and
|φ(t)| ≤ φ(0) for all t.
We use the fact that the matrix {φ(ti − tj ) : 1 ≤ i, j ≤ n} is Hermitian
positive definite for any n real numbers t1 , · · · , tn . The first assertion follows
from the the positivity of φ(0)|z|2 , the second is a consequence of the Hermi-
tian property and if we take n = 2 with t1 = t and t2 = 0 as a consequence
of the positive definiteness of the 2 × 2 matrix we get |φ(t)|2 ≤ |φ(0)|2
4) For any s, t we have |φ(t) − φ(s)|2 ≤ 4φ(0)|φ(0) − φ(t − s)|
We use the positive definiteness of the 3 × 3 matrix
 
1 φ(t − s) φ(t)
 
φ(t − s) 1 φ(s) 
 
φ(t) φ(s) 1

which is {φ(ti − tj )} with t1 = t, t2 = s and t3 = 0. In particular the


determinant has to be nonnegative.

0 ≤ 1 + φ(s)φ(t − s)φ(t) + φ(s)φ(t − s)φ(t) − |φ(s)|2


−|φ(t)|2 − |φ(t − s)|2
= 1 − |φ(s) − φ(t)|2 − |φ(t − s)|2 − φ(t)φ(s)(1 − φ(t − s))
−φ(t)φ(s)(1 − φ(t − s))
≤ 1 − |φ(s) − φ(t)|2 − |φ(t − s)|2 + 2|1 − φ(t − s)|

Or
|φ(s) − φ(t)|2 ≤ 1 − |φ(s − t)|2 + 2|1 − φ(t − s)|
≤ 4|1 − φ(s − t)|
2.3. WEAK CONVERGENCE 47

5) It now follows from 4) that if a positive definite function is continuous


at t = 0, it is continuous everywhere (in fact uniformly continuous).
Step 2. First we show that if φ(t) is a positive definite function which is
continuous on R and is absolutely integrable, then
Z ∞
1
f (x) = exp[− i t x ]φ(t) dt ≥ 0
2π −∞
is a continuous function and
Z ∞
f (x)dx = 1.
−∞

Moreover the function Z x


F (x) = f (y) dy
−∞
defines a distribution function with characteristic function
Z ∞
φ(t) = exp[i t x ]f (x) dx. (2.9)
−∞

If φ is integrable on (−∞, ∞), then f (x) is clearly bounded and contin-


uous. To see that it is nonnegative we write

Z
1 T
|t|  − i t x
f (x) = lim 1− e φ(t) dt (2.10)
T →∞ 2π −T T
Z T Z T
1
= lim e− i (t−s) x φ(t − s) dt ds (2.11)
T →∞ 2πT 0 0
Z T Z T
1
= lim e− i t x ei s x φ(t − s) dt ds (2.12)
T →∞ 2πT 0 0

≥ 0.
We can use the dominated convergence theorem to prove equation (2.10),
a change of variables to show equation (2.11) and finally a Riemann sum
approximation to the integral and the positive definiteness of φ to show that
the quantity in (2.12) is nonnegative. It remains to show the relation (2.9).
Let us define
σ 2 x2
fσ (x) = f (x) exp[ − ]
2
48 CHAPTER 2. WEAK CONVERGENCE

and calculate for t ∈ R, using Fubini’s theorem

Z ∞ Z ∞
σ 2 x2
itx
e fσ (x) dx = ei t x f (x) exp[ − ] dx
−∞ −∞ 2
Z ∞Z ∞
1 σ 2 x2
= ei t x φ(s)e−i s x exp[ − ] ds dx
2π −∞ −∞ 2
Z ∞
1 (t − s)2
= φ(s) √ exp[ − ] ds. (2.13)
−∞ 2πσ 2σ 2
If we take t = 0 in equation (2.13), we get
Z ∞ Z ∞
1 s2
fσ (x) dx = φ(s) √ exp[ − 2 ] ds ≤ 1. (2.14)
−∞ −∞ 2πσ 2σ
Now we let σ → 0. Since fσ ≥ 0 and tends to f as σ → 0, from Fa-
R ∞ lemma and equation (2.14), it follows that f is integarable and initxfact
tous’s
−∞
f (x)dx ≤ 1. Now we let σ → 0 in equation (2.13). Since fσ (x)e is
dominated by the integrable function f , there is no problem with the left
hand side. On the other hand the limit as σ → 0 is easily calculated on the
right hand side of equation (2.13)
Z ∞ Z ∞
1 (s − t)2
itx
e f (x)dx = lim φ(s) √ exp[ − ] ds
−∞ σ→0 −∞ 2πσ 2σ 2
Z ∞
1 s2
= lim φ(t + σs) √ exp[ − ] ds
σ→0 −∞ 2π 2
= φ(t)
proving equation (2.9).
Step 3. If φ(t) is a positive definite function which is continuous, so is
φ(t) exp[ i t y ] for every y and for σ > 0, as well as the convex combination

R∞ 1 2
φσ (t) = −∞
φ(t) exp[ i t y ] √2πσ exp[ − 2σ
y
2 ] dy

2 2
= φ(t) exp[ − σ 2t ].
The previous step is applicable to φσ (t) which is clearly integrable on R and
by letting σ → 0 we conclude by Theorem 2.3. that φ is a characteristic
function as well.
2.3. WEAK CONVERGENCE 49

Remark 2.2. There is a Fourier Series analog involving distributions on a


finite interval, say S = [0, 2π). The right end point is omitted on purpose,
because the distribution should be thought of as being on [0, 2π] with 0 and
2π identified. If α is a distribution on S the characteristic function is defined
as Z
φ(n) = ei n x dα

for integral values n ∈ Z. There is a uniqueness theorem, and a Bochner


type theorem involving an analogous definition of positive definiteness. The
proof is nearly the same.
R R
Exercise 2.9. If αn ⇒ α it is not always true that x dαn → x dα be-
cause while x is a continuous function it is not bounded. Construct a simple
counterexample. On the positive side, let f (x) be a continuous function that
is not necessarily bounded. Assume that there exists a positive continuous
function g(x) satisfying
|f (x)|
lim =0
|x|→∞ g(x)

and Z
sup g(x) dαn ≤ C < ∞.
n
Then show that Z Z
lim f (x) dαn =
f (x) dα
n→∞
R R R
In particular if |x|k dαn remains bounded, then xj dαn → xj dα for
1 ≤ j ≤ k − 1.
Exercise 2.10. On the other hand if αn ⇒ α and g : R → R is a continuos
function then the distribution βn of g under αn defined as
βn [A] = αn [x : g(x) ∈ A]
converges weakly to β the corresponding distribution of g under α.
Exercise 2.11. If gn (x) is a sequence of continuous functions such that
sup |gn (x)| ≤ C < ∞ and lim gn (x) = g(x)
n,x n→∞

uniformly on every bounded interval, then whenever αn ⇒ α it follows that


Z
lim gn (x)dαn = g(x)dα.
n→∞
50 CHAPTER 2. WEAK CONVERGENCE

Can you onstruct an example to show that even if gn , g are continuous just
the pointwise convergence limn→∞ gn (x) = g(x) is not enough.
Exercise 2.12. If a sequence {fn (ω)} of random variables on a measure space
are such that fn → f in measure, then show that the sequence of distributions
αn of fn on R converges weakly to the distribution α of f . Give an example
to show that the converse is not true in general. However, if f is equal to a
constant c with probability 1, or equivalently α is degenerate at some point c,
then αn ⇒ α = δc implies the convrgence in probability of fn to the constant
function c.
Chapter 3

Independent Sums

3.1 Independence and Convolution


One of the central ideas in probabilty is the notion of independence. In
intuitive terms two events are independent if they have no influence on each
other. The formal definition is

Definition 3.1. Two events A and B are said to be independent if

P [A ∩ B] = P [A]P [B].

Exercise 3.1. If A and B are independent prove that so are Ac and B.

Definition 3.2. Two random variables X and Y are independent if the


events X ∈ A and Y ∈ B are independent for any two Borel sets A and
B on the line i.e.

P [X ∈ A, Y ∈ B] = P [X ∈ A]P [Y ∈ B].

for all Borel sets A and B.

There is a natural extension to a finite or even an infinite collection of


random variables.

51
52 CHAPTER 3. INDEPENDENT SUMS

Definition 3.3. A finite collection collection {Xj : 1 ≤ j ≤ n} of random


variables are said to be independent if for any n Borel sets A1 , . . . , An on the
line
 
P ∩1≤j≤n [Xj ∈ Aj ] = Π1≤j≤n P [Xj ∈ Aj ].

Definition 3.4. An infinite collection of random variables is said to be in-


dependent if every finite subcollection is independent.

Lemma 3.1. Two random variables X, Y defined on (Ω, Σ, P ) are indepen-


dent if and only if the measure induced on R2 by (X, Y ), is the product
measure α × β where α and β are the distributions on R induced by X and
Y respectively.

Proof. Left as an exercise.

The important thing to note is that if X and Y are independent and one
knows their distributions α and β, then their joint distribution is automati-
cally determined as the product measure.
If X and Y are independent random variables having α and β for their
distributions, the distribution of the sum Z = X +Y is determined as follows.
First we construct the product measure α×β on R×R and then consider the
induced distribution of the function f (x, y) = x + y. This distribution, called
the convolution of α and β, is denoted by α ∗ β. An elementary calculation
using Fubini’s theorem provides the following identities.

Z Z
(α ∗ β)(A) = α(A − x) dβ = β(A − x) dα (3.1)

In terms of characteristic function, we can express the characteristic func-


tion of the convolution as

Z Z Z
exp[ i t x ]d(α ∗ β) = exp[ i t (x + y) ] d α d β
Z Z
= exp[ i t x ] d α exp[ i t x ] d β
3.1. INDEPENDENCE AND CONVOLUTION 53

or equivalently
φα∗β (t) = φα (t)φβ (t) (3.2)
which provides a direct way of calculating the distributions of sums of inde-
pendent random variables by the use of characteristic functions.
Exercise 3.2. If X and Y are independent show that for any two measurable
functions f and g, f (X) and g(Y ) are independent.
Exercise 3.3. Use Fubini’s theorem to show that if X and Y are independent
and if f and g are measurable functions with both E[|f (X)|] and E[|g(Y )|]
finite then
E[f (X)g(Y )] = E[f (X)]E[g(Y )].
Exercise 3.4. Show that if X and Y are any two random variables then
E(X + Y ) = E(X) + E(Y ). If X and Y are two independent random
variables then show that
Var(X + Y ) = Var(X) + Var(Y )
where
 
Var(X) = E [X − E[X]]2 = E[X 2 ] − [E[X]]2 .
If X1 , X2 , · · · , Xn are n independent random variables, then the distri-
bution of their sum Sn = X1 + X2 + · · · + Xn can be computed in terms of
the distributions of the summands. If αj is the distribution of Xj , then the
distribution of µn of Sn is given by the convolution µn = α1 ∗ α2 ∗ · · · ∗ αn
that can be calculated inductively by µj+1 = µj ∗ αj+1 . In terms of their
characteristic functions ψn (t) = φ1 (t)φ2 (t) · · · φn (t). The first two moments
of Sn are computed easily.
E(Sn ) = E(X1 ) + E(X2 ) + · · · E(Xn )
and
Var(Sn ) = E[Sn − E(Sn )]2
X
= E[Xj − E(Xj )]2
j
X
+2 E[Xi − E(Xi )][Xj − E(Xj )].
1≤i<j≤n
54 CHAPTER 3. INDEPENDENT SUMS

For i 6= j, because of independence

E[Xi − E(Xi )][Xj − E(Xj )] = E[Xi − E(Xi )]E[Xj − E(Xj )] = 0

and we get the formula

Var(Sn ) = Var(X1 ) + Var(X2 ) + · · · + Var(Xn ). (3.3)

3.2 Weak Law of Large Numbers


Let us look at the distribution of the number of succeses in n independent
trials, with the probability of success in a single trial being equal to p.
 
n r
P {Sn = r} = p (1 − p)n−r
r
and
X  
n r
P {|Sn − np| ≥ nδ} = p (1 − p)n−r
r
|r−np|≥nδ
X  
1 2 n
≤ (r − np) pr (1 − p)n−r (3.4)
n2 δ 2 r
|r−np|≥nδ
 
1 X 2 n
≤ (r − np) pr (1 − p)n−r
n2 δ 2 1≤r≤n r
1 1
= E[Sn − np]2 = Var(Sn ) (3.5)
n2 δ 2 n2 δ 2
1
= np(1 − p) (3.6)
n2 δ 2
p(1 − p)
= .
nδ 2
In the step (3.4) we have used a discrete version of the simple inequality
Z
g(x)dα ≥ g(a)α[x : g(x) ≥ a]
x:g(x)≥a

with g(x) = (x − np)2 and in (3.6) have used the fact that Sn = X1 + X2 +
· · · + Xn where the Xi are independent and have the simple distribution
3.2. WEAK LAW OF LARGE NUMBERS 55

P {Xi = 1} = p and P {Xi = 0} = 1 − p. Therefore E(Sn ) = np and


Var(Sn ) = nVar(X1 ) = np(1 − p)
It follows now that
Sn
lim P {|Sn − np| ≥ nδ} = lim P {| − p| ≥ δ} = 0
n→∞ n→∞ n
or the average Sn n converges to p in probability. This is seen easily to
be equivalent to the statement that the distribution of Snn converges to the
distribution degenerate at p. See (2.12).
The above argument works for any sequence of independent and iden-
tically distributed random variables. If we assume that E(Xi ) = m and
Var(Xi ) = σ 2 < ∞, then E( Snn ) = m and Var( Snn ) = σn . Chebychev’s
2

inequality states that for any random variable X


Z
P {|X − E[X]| ≥ δ} = dP
|X−E[X]|≥δ
Z
1
≤ 2 [X − E[X]]2 dP
δ |X−E[X]|≥δ
Z
1
= 2 [X − E[X]]2 dP
δ
1
= 2 Var(X). (3.7)
δ
This can be used to prove the weak law of large numbers for the gen-
eral case of independent identically distributed random variables with finite
second moments.

Theorem 3.2. If X1 , X2 . . . , Xn , . . . is a sequence of independent identically


distributed random variables with E[Xj ] ≡ m and VarXj ≡ σ 2 then for

Sn = X1 + X2 + · · · + Xn

we have
 
Sn
lim P | − m| ≥ δ = 0
n→∞ n

for any δ > 0.


56 CHAPTER 3. INDEPENDENT SUMS

Proof. Use Chebychev’s inequality to estimate


 
Sn 1 Sn σ2
P | − m| ≥ δ ≤ 2 Var( ) = 2 .
n δ n nδ

Actually it is enough to assume that E|Xi | < ∞ and the existence of the
second moment is not needed. We will provide two proofs of the statement
Theorem 3.3. If X1 , X2 , · · · Xn are independent and identically distributed
with a finite first moment and E(Xi ) = m, then X1 +X2n+···+Xn converges to m
in probability as n → ∞.
Proof. 1. Let C be a large constant and let us define XiC as the truncated
random variable XiC = Xi if |Xi | ≤ C and XiC = 0 otherwise. Let YiC =
Xi − XiC so that Xi = XiC + YiC . Then
1 X 1 X C 1 X C
Xi = X + Y
n 1≤i≤n n 1≤i≤n i n 1≤i≤n i
= ξnC + ηnC .
If we denote by aC = E(XiC ) and bC = E(YiC ) we always have m =
aC + bC . Consider the quantity
1 X
δn = E[| Xi − m|]
n 1≤i≤n
= E[|ξnC + ηnC − m|]
≤ E[|ξnC − aC |] + E[|ηnC − bC |]
  12
2
≤ E[|ξn − aC | ] + 2E[|YiC |].
C
(3.8)

As n → ∞, the truncated random variables XiC are bounded and indepen-


dent. Theorem 3.2 is applicable and the first of the two terms in (3.8) tends
to 0. Therefore taking the limsup as n → ∞, for any 0 < C < ∞,

lim sup δn ≤ 2E[|YiC |].


n→∞

 If Cwe now let the cutoff level C to go to infinity, by the integrability of Xi ,


E |Yi | → 0 as C → ∞ and we are done. The final step of establishing that
3.2. WEAK LAW OF LARGE NUMBERS 57

for any sequence Yn of random variables, E[|Yn |] → 0 implies that Yn → 0 in


probability, is left as an exercise and is not very different from Chebychev’s
inequality.
Proof 2. We can use characteristic functions. If we denoteP the characteristic
1
function of Xi by φ(t), then the characteristic function of n 1≤i≤n Xi is given
by ψn (t) = [φ( nt )]n . The existence of the first moment assures us that φ(t)
is differentiable at t = 0 with a derivative equal to im where m = E(Xi ).
Therefore by Taylor expansion
t imt 1
φ( ) = 1 + + o( ).
n n n
Whenever nan → z it follows that (1 + an ) → ez . Therefore,
n

lim ψn (t) = exp[ i m t ]


n→∞

which is the characteristic function of the distribution degenerate at m.


Hence the distribution of Snn tends to the degenerate distribution at the point
m. The weak law of large numbers is thereby established.
Exercise 3.5. If the underlying distribution is a Cauchy distribution with
1 −|t|
density π(1+x 2 ) and characteristic function φ(t) = e , prove that the weak
law does not hold.
Exercise 3.6. The weak law may hold sometimes even if the mean does not
exist. If we dampen the tails of the Cauchy ever so slightly with a density
c
f (x) = (1+x2 ) log(1+x 2 ) , show that the weak law of large numbers holds.

Exercise 3.7. In the case of the Binomial distribution with p = 12 , use Stir-
ling’s formula

n! ' 2π e−n nn+12
to estimate the probability
X n 1

r≥nx
r 2n
and show that it decays geometrically in n. Can you calculate the geometric
ratio  X    n1
n 1
ρ(x) = lim
n→∞
r≥nx
r 2n
explicitly as a function of x for x > 12 ?
58 CHAPTER 3. INDEPENDENT SUMS

3.3 Strong Limit Theorems


The weak law of large numbers is really a result concerning the behavior of
Sn X 1 + X 2 + · · · + Xn
=
n n
where X1 , X2 , · · · , Xn , . . . is a sequence of independent and identically dis-
tributed random variables on some probability space (Ω, B, P ). Under the
assumption that Xi are integrable with an integral equal to m, the weak law
asserts that as n → ∞, Snn → m in Probability. Since almost everywhere
convergence is generally stronger than convergence in Probability one may
ask if
 
Sn (ω)
P ω : lim =m =1
n→∞ n

This is called the Strong Law of Large Numbers. Strong laws are statements
that hold for almost all ω.
Let us look at functions of the form fn = χAn . It is easy to verify that
fn → 0 in probability if and only if P (An ) → 0. On the other hand

Lemma 3.4. (Borel-Cantelli lemma). If


X
P (An ) < ∞
n

then  
P ω : lim χAn (ω) = 0 = 1.
n→∞

If the events An are mutually independent the converse is also true.

Remark 3.1. Note that the complementary event


 
ω : lim sup χAn (ω) = 1
n→∞

is the same as ∩∞ ∞
n=1 ∪j=n Aj , or the event that infinitely many of the events
{Aj } occcur.

The cnclusion of the next exercise will be used in the proof.


3.3. STRONG LIMIT THEOREMS 59

Exercise 3.8. Prove the following variant of the monotone convergence the-
orem.
P If fn (ω) ≥ 0 are measurble functions the set E = {ω : S(ω) =
n fn (ω) < ∞} is measurable
P and S(ω) is a measurable function on E. If
each fn is integrable
P and n E[fn ] < ∞ then P [E] = 1, S(ω) is integrable
and E[S(ω)] = n E[fn (ω)].
P P
Proof. By the previous exercise if n P (An ) < ∞, then n χAn (ω) = S(ω)
is finite almost everywhere and
X
E(S(ω)) = P (An ) < ∞.
n

If an infinite series has a finite sum then the n-th term must go to 0, thereby
proving
P the direct part. To prove the converse we need to show that if

n P (An ) = ∞, then limm→∞ P (∪n=m An ) > 0. We can use independence
and the continuity of probability under monotone limits, to calculate for
every m,
P (∪∞ ∞
n=m An ) = 1 − P (∩n=m An )
c

Y∞
= 1− (1 − P (An )) (by independence)
n=m

P ∞
P (An )
≥ 1−e m

= 1
and we are done. We have used the inequality 1 − x ≤ e−x familiar in the
study of infinite products.
Another digression that we want to make into measure theory at this point
is to discuss Kolmogorov’s consistency theorem. How do we know that there
are probability spaces that admit a sequence of independent identically dis-
tributed random variables with specified distributions? By the construction
of product measures that we outlined earlier we can construct a measure on
Rn for every n which is the joint distribution of the first n random variables.
Let us denote by Pn this probability measure on Rn . They are consistent
in the sense that if we project in the natural way from Rn+1 → Rn , Pn+1
projects to Pn . Such a family is called a consistent family of finite dimen-
sional distributions. We look at the space Ω = R∞ consisting of all real
sequences ω = {xn : n ≥ 1} with a natural σ-field Σ generated by the field
F of finite dimensional cylinder sets of the form B = {ω : (x1 , · · · , xn ) ∈ A}
where A varies over Borel sets in Rn and varies over positive integers.
60 CHAPTER 3. INDEPENDENT SUMS

Theorem 3.5. (Kolmogorov’s Consistency Theorem). Given a con-


sistent family of finite dimensional distributions Pn , there exists a unique
P on (Ω, Σ) such that for every n, under the natural projection πn (ω) =
(x1 , · · · , xn ), the induced measure P πn−1 = Pn on Rn .

Proof. The consistency is just what is required to be able to define P on F


by
P (B) = Pn (A).
Once we have P defined on the field F , we have to prove the countable
additivity of P on F . The rest is then routine. Let Bn ∈ F and Bn ↓ Φ,
the empty set. If possible let P (Bn ) ≥ δ for all n and for some δ > 0. Then
Bn = πk−1n
Akn for some kn and without loss of generality we assume that
kn = n, so that Bn = πn−1 An for some Borel set An ⊂ Rn . According to
Exercise 3.8 below, we can find a closed bounded subset Kn ⊂ An such that
δ
Pn (An − Kn ) ≤
2n+1
and define Cn = πn−1 Kn and Dn = ∩nj=1 Cj = πn−1 Fn for some closed bounded
set Fn ⊂ Kn ⊂ Rn . Then
Xn
δ δ
P (Dn ) ≥ δ − j+1
≥ .
j=1
2 2

Dn ⊂ Bn , Dn ↓ Φ and each Dn is nonempty. If we take ω (n) = {xnj : j ≥ 1}


to be an arbitrary point from Dn , by our construction (xn1 , · · · xnm ) ∈ Fm for
n ≥ m. We can definitely choose a subsequence (diagonlization) such that
xnj k converges for each j producing a limit ω = (x1 , · · · , xm , · · · ) and, for
every m, we will have (x1 , · · · , xm ) ∈ Fm . This implies that ω ∈ Dm for
every m, contradicting Dn ↓ Φ. We are done.
Exercise 3.9. We have used the fact that given any borel set A ⊂ Rn , and a
probability measure α on Rn , for any  > 0, there exists a closed bounded
subset K ⊂ A such that α(A − K ) ≤ . Prove it by showing that the class
of sets A with the above property is a monotone class that contains finite
disjoint unions of measurable rectangles and therefore contains the Borel σ-
field. To prove the last fact, establish it first for n = 1. To handle n = 1,
repeat the same argument starting from finite disjoint unions of right-closed
left-open intevals. Use the countable additivity to verify this directly.
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 61

Remark 3.2. Kolmogorov’s consistency theorem remains valid if we replace


R by an arbitrary complete separable metric space X, with its Borel σ-field.
However it is not valid in complete generality. See [8]. See Remark 4.7 in
this context.
The following is a strong version of the Law of Large Numbers.
Theorem 3.6. If X1 , · · · , Xn · · · is a sequence of independent identically
distributed random variables with E|Xi |4 = C < ∞, then
Sn X 1 + · · · + Xn
lim = lim = E(X1 )
n→∞ n n→∞ n
with probability 1.
Proof. We can assume without loss of generality that E[Xi ] = 0 . Just take
Yi = Xi − E[Xi ]. A simple calculation shows
E[(Sn )4 ] = nE[(X1 )4 ] + 3n(n − 1)E[(X1 )2 ]2 ≤ nC + 3n2 σ 4
and by applying a Chebychev type inequality using fourth moments,
Sn nC + 3n2 σ 4
P [| | ≥ δ ] = P [ |Sn | ≥ nδ ] ≤ .
n n4 δ 4
We see that
X

Sn
P[| | ≥ δ]< ∞
n=1
n
and we can now apply the Borel-Cantelli Lemma.

3.4 Series of Independent Random variables


We wish to investigate conditions under which an infinite series with inde-
pendent summands
X

S= Xj
j=1

converges with probability 1. The basic steps are the following inequalities
due to Kolomogorov and Lévy that control the behaviour of sums of inde-
pendent random variables. They both deal with the problem of estimating
X
k
Tn (ω) = sup |Sk (ω)| = sup | Xj (ω)|
1≤k≤n 1≤k≤n j=1
62 CHAPTER 3. INDEPENDENT SUMS

where X1 , · · · , Xn are n independent random variables.


Lemma 3.7. (Kolmogorov’s P Inequality). Assume that EXi = 0 and
Var(Xi ) = σi2 < ∞ and let s2n = nj=1 σj2 . Then
s2n
P {Tn (ω) ≥ `} ≤ . (3.9)
`2
Proof. The important point here is that the estimate depends only on s2n and
not on the number of summands. In fact the Chebychev bound on Sn is
s2n
P {|Sn | ≥ `} ≤
`2
and the supremum does not cost anything.
Let us define the events Ek = {|S1 | < `, · · · , |Sk−1| < `, |Sk | ≥ `} and
then {Tn ≥ `} = ∪nk=1 Ek is a disjoint union of Ek . If we use the independence
of Sn − Sk and Sk χEk that only depends on X1 · · · , Xk
Z
1
P {Ek } ≤ 2 Sk2 dP
` Ek
Z
1  2 
≤ 2 Sk + (Sn − Sk )2 dP
` Ek
Z
1  2 
= 2 Sk + 2Sk (Sn − Sk ) + (Sn − Sk )2 dP
` Ek
Z
1
= 2 S 2 dP.
` Ek n
Summing over k from 1 to n
Z
1 s2n
P {Tn ≥ `} ≤ 2 Sn2 dP ≤ 2 .
` Tn ≥` `
eatblishing (3.9)
Lemma 3.8. (Lévy’s Inequality). Assume that
`
P {|Xi + · · · + Xn | ≥ } ≤ δ
2
for all 1 ≤ i ≤ n. Then
δ
P {Tn ≥ `} ≤ . (3.10)
1−δ
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 63

Proof. Let Ek be as in the previous lemma.


  Xn  
` `
P (Tn ≥ `) ∩ |Sn | ≤ = P Ek ∩ |Sn | ≤
2 k=1
2
Xn  
`
≤ P Ek ∩ |Sn − Sk | ≥
k=1
2
Xn  
`
= P |Sn − Sk | ≥ P (Ek )
2
k=1
Xn
≤ δ P (Ek )
k=1
= δP {Tn ≥ `}.

On the other hand,


   
` `
P (Tn ≥ `) ∩ |Sn | > ≤ P |Sn | > ≤ δ.
2 2
Adding the two,  
P Tn ≥ ` ≤ δP Tn ≥ ` + δ
or
 δ
P Tn ≥ ` ≤
1−δ
proving (3.10)

We are now ready to prove


Theorem 3.9. (Lévy’s Theorem). If X1 , X2 , . . . , Xn , . . . is a seqence of
independent random variables, then the following are equivalent.

(i) The distribution αn of Sn = X1 + · · · + Xn converges weakly to a prob-


ability distribution α on R.

(ii) The random variable Sn = X1 + · · · + Xn converges in probability to a


limit S(ω).

(iii) The random variable Sn = X1 + · · · + Xn converges with probability 1


to a limit S(ω).
64 CHAPTER 3. INDEPENDENT SUMS

Proof. Clearly (iii) ⇒ (ii) ⇒ (i) are trivial. We will establish (i) ⇒ (ii) ⇒
(iii).
(i) ⇒ (ii). The characteristic functions φj (t) of Xj are such that

Y

φ(t) = φj (t)
i=1

is a convergent infinite product. Since the limit φ(t) is continuous at t = 0


and φ(0) = 1 it is nonzero in some interval |t| ≤ T around 0. Therefore for
|t| ≤ T ,
Yn
lim
n→∞
φj (t) = 1.
m→∞ m+1

By Exercise 3.10 below, this implies that for all t,

Y
n
lim
n→∞
φj (t) = 1
m→∞ m+1

and consequently, the distribution of Sn − Sm converges to the distribution


degenerate at 0. This implies the convergence in probability to 0 of Sn − Sm
as m, n → ∞. Therefore for each δ > 0,

lim P {|Sn − Sm | ≥ δ} = 0
n→∞
m→∞

establishing (ii).
(ii) ⇒ (iii). To establish (iii), because of Exercise 3.11 below, we need only
show that for every δ > 0

lim
n→∞
P sup |S k − S m | ≥ δ =0
m→∞ m<k≤n

and this follows from (ii) and Lévy’s inequality.

Exercise 3.10. Prove the inequality 1 − cos 2t ≤ 4(1 − cos t) for all real t.
Deduce the inequality 1 − Real φ(2t) ≤ 4[1 − Real φ(t)], valid for any char-
acteristic function. Conclude that if a sequence of characteristic functions
converges to 1 in an interval around 0, then it converges to 1 for all real t.
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 65

Exercise 3.11. Prove that if a sequence Sn of random variables is a Cauchy


sequence in Probability, i.e. for each δ > 0,

lim P {|Sn − Sm | ≥ δ} = 0
n→∞
m→∞

then there is a random variable S such that Sn → S in probability, i.e for


each δ > 0,
lim P {|Sn − S| ≥ δ} = 0.
n→∞

Exercise 3.12. Prove that if a sequence Sn of random variables satisfies



lim P
n→∞
sup |Sk − Sm | ≥ δ = 0
m→∞ m<k≤n

for every δ > 0 then there is a limiting random variable S(ω) such that
 
P lim Sn (ω) = S(ω) = 1.
n→∞

Exercise 3.13. Prove that whenever Xn → X in probability the distribution


αn of Xn converges weakly to the distribution α of X.

Now it is straightforward to find sufficient conditions for the convergence


of an infinite series of independent random variables.
Theorem 3.10. (Kolmogorov’s one series Theorem). Let a sequence
{Xi } of independent random variables,
P each of which has finite mean and
variance, satisfy E(Xi ) = 0 and ∞
i=1 Var(X i ) < ∞, then

X

S(ω) = Xi (ω)
i=1

converges with probability 1.


Proof. By a direct application of Kolmogorov’s inequality

 1 X
n
lim P sup |Sk − Sm | ≥ δ ≤ lim E(Xi2 )
n→∞
m→∞ m<k≤n
n→∞
m→∞
δ 2 j=m+1
1 X
n
= lim Var(Xi ) = 0.
n→∞
m→∞
δ 2 j=m+1
66 CHAPTER 3. INDEPENDENT SUMS

Therefore
lim P { sup |Sk − Sm | ≥ δ} ≤ 0.
n→∞
m→∞ m<k≤n

We can also prove convergence in probability

lim P {|Sn − Sm | ≥ δ} = 0
n→∞
m→∞

by a simple application of Chebychev’s inequality and then apply Lévy’s


Theorem to get almost sure convergence.

Theorem 3.11. (Kolomogorov’s two series theorem). Let ai = E[Xi ]


be the means and σi2 = Var(Xi ) the variances
P of P
a sequence of independent
randomPvariables {Xi }. Assume that i ai and i σi2 converge. Then the
series i Xi converges with probability 1.

Proof. Define Yi = Xi − ai and apply the previous (one series) theorem to


Yi .

Of course in general random variables need not have finite expectations


or variances. If {Xi } is any sequence of random variables we can take a cut
off value C and define Yi = Xi if |Xi | ≤ C, and Yi = 0 otherwise. The Yi are
then independent and bounded in absolute value by C. The theorem can be
applied to Yi and if we impose the additional condition that
X X
P {Xi 6= Yi } = P {|Xi| > C} < ∞
i i

by an application of Borel-Cantelli Lemma, P with Probabilty


P 1, Xi = Yi for
all sufficiently large i. The convergence of i Xi and i Yi are therefore
equivalent. We get then the sufficiency part of

Theorem 3.12. (Kolmogorov’s three series theorem). For P the con-


vergence of an infinite series of independent random variables i Xi it is
necessary and sufficient that all the three following infinite series converge.
P
(i) For some cut off value C > 0, i P {|Xi| > C} converges.
P
(ii) If Yi is defined to equal Xi if |Xi | ≤ C, and 0 otherwise, i E(Yi )
converges.
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES 67
P
(iii) With Yi as in (ii), iVar(Yi ) converges.
P
Proof. Let us now prove the converse. If i Xi converges for a sequence of
independent random variables, we must necessarily have |Xn | ≤ C eventually
with probability 1. By Borel-Cantelli Lemma the first series must converge.
This means that in order to prove the necessity we can assume without loss
of generality that |Xi | are all bounded say by 1. we may also assume that
E(Xi ) = 0 for each i. Otherwise let us take independent
P randomP variables Xi0
0
that have the same distribution as Xi . Then
P i Xi as well as i Xi converge
with probability 1 and therefore so does i (Xi − Xi0 ). The random variables
0
Zi = Xi − XiP are independent and bounded by 2. They have mean 0. If
we can show Var(Zi ) is convergent, since Var(Zi ) = 2Var(Xi ) we would
have proved the convergencePof the the thirdPseries. Now it is elementary
to conclude
P that since both i Xi as well as i (Xi − E(Xi )) converge, the
series i E(Xi ) must be convergent as well. So all we need is the following
lemma to complete the proof of necessity.
P
Lemma 3.13. If i Xi is convergent for a series of independent P random
variables with mean 0 that are individually bounded by C, then i Var(Xi )
is convergent.

Proof. Let Fn = {ω : |S1 | ≤ `, |S2 | ≤ `, · · · , |Sn | ≤ `} where Sk = X1 + · · · +


Xk . If the series converges with probablity 1, we must have, for some ` and
δ > 0, P (Fn ) ≥ δ for all n. We have
Z Z
2
Sn dP = [Sn−1 + Xn ]2 dP
Fn−1
ZFn−1
2
= [Sn−1 + 2Sn−1 Xn + Xn2 ] dP
Fn−1
Z
2
= Sn−1 dP + σn2 P (Fn−1 )
ZFn−1
2
≥ Sn−1 dP + δσn2
Fn−1

and on the other hand,


Z Z Z
2 2
Sn dP = Sn dP + Sn2 dP
Fn−1 Fn−1 ∩Fn
c
ZFn
≤ Sn2 dP + P (Fn−1 ∩ Fnc ) (` + C)2
Fn
68 CHAPTER 3. INDEPENDENT SUMS

providing us with the estimate


Z Z
2 2 2
δσn ≤ Sn dP − Sn−1 dP + P (Fn−1 ∩ Fnc ) (` + C)2 .
Fn Fn−1

Since Fn−1 ∩ Fnc are disjoint and |Sn | ≤ ` on Fn ,

X

1 2
σj2 ≤ ` + (` + C)2 ].
j=1
δ

This concludes the proof.

3.5 Strong Law of Large Numbers


We saw earlier that in Theorem 3.6 that if {Xi} is sequence of i.i.d. (in-
dependent identically distributed) random variables with zero mean and a
finite fourth moment then X1 +···+X
n
n
→ 0 with probability 1. We will now
prove the same result assuming only that E|Xi | < ∞ and E(Xi ) = 0.

Theorem 3.14. If {Xi } is a sequence of i.i.d random variables with mean


0,
X 1 + · · · + Xn
lim =0
n→∞ n
with probability 1.

Proof. We define

(
Xn if |Xn | ≤ n
Yn =
0 if |Xn | > n

an = P [Xn 6= Yn ], bn = E[Yn ] and cn = Var(Yn ).


First we note that (see exercise 3.14 below)
X X
an = P [|X1 | > n ] ≤ E|X1 | < ∞
n n

lim bn = 0
n→∞
3.5. STRONG LAW OF LARGE NUMBERS 69

and
X cn X E[Y 2 ] XZ x2
2
≤ 2
n
= 2

n n |x|≤n n
n n
Z
n
X  Z
2 1
= x 2
dα ≤ C |x| dα < ∞
n≥x
n

where α is the common distribution of Xi . From P the three series theorem


P Xnand
the Borel-Cantelli Lemma, we conclude that n n as well as nP n−bn
Yn −bn

converge almost surely. It is elementary to verify that for any series n xnn
that converges, x1 +···+x
n
n
→ 0 as n → ∞. We therefore conclude that
 
 X1 + · · · + Xn b1 + · · · + bn 
P lim − =0 =1
n→∞ n n
Since bn → 0 as n → ∞, the theorem is proved.
Exercise 3.14. Let X be a nonnegative random variable. Then
X

E[X] − 1 ≤ P [Xn ≥ n] ≤ E[X]
n=1
P
In particular E[X] < ∞ if and only if n P [X ≥ n] < ∞.
Exercise 3.15. If for a sequence of i.i.d. random variables X1 , · · · , Xn , · · · ,
the strong law of large numbers holds with some limit, i.e.
Sn
P [ lim = ξ]= 1
n→∞ n

for some random variable ξ, which may or may not be a constant with prob-
ability 1, then show that necessarily E|Xi| < ∞. Consequently ξ = E(Xi )
with probabilty 1.
One may ask why the limit cannot be a proper random variable. There
is a general theorem that forbids it called Kolmogorov’s Zero-One law. Let
us look at the space Ω of real sequences {xn : n ≥ 1}. We have the σ-field B,
the product σ-field on Ω. In addition we have the sub σ-fields Bn generated
by {xj : j ≥ n}. Bn are ↓ with n and B∞ = ∩n Bn which is also a σ-field is
called the tail σ-field. The typical set in B∞ is a set depending only on the
tail behavior of the sequence. For example the sets {ω : xn is bounded },
{ω : lim supn xn = 1} are in B∞ whereas {ω : supn |xn | = 1} is not.
70 CHAPTER 3. INDEPENDENT SUMS

Theorem 3.15. (Kolmogorov’s Zero-One Law). If A ∈ B∞ and P is


any product measure (not necessarily with identical components) P (A) = 0
or 1.

Proof. The proof depends on showing that A is independent of itself so that


P (A) = P (A ∩ A) = P (A)P (A) = [P (A)]2 and therefore equals 0 or 1. The
proof is elementary. Since A ∈ B∞ ⊂ Bn+1 and P is a product measure,
A is independent of Bn = σ-field generated by {xj : 1 ≤ j ≤ n}. It is
therefore independent of sets in the field F = ∪n Bn . The class of sets A
that are independent of A is a monotone class. Since it contains the field
F it contains the σ-field B generated by F . In particular since A ∈ B, A is
independent of itself.
Corollary 3.16. Any random variable measurable with respect to the tail σ-
field B∞ is equal with probaility 1 to a constant relative to any given product
measure.
Proof. Left as an exercise.

Warning. For different product measures the constants can be different.


Exercise 3.16. How can that happen?

3.6 Central Limit Theorem.


We saw before that for any sequence of independent identically distributed
random variables X1 , · · · , Xn , · · · the sum Sn = X1 + · · · + Xn has the prop-
erty that
Sn
lim =0
n→∞ n
in probability provided the expectation exists and equals 0. If we assume
that the Variance of the random variables is finite and equals σ 2 > 0, then
we have

Theorem 3.17. The distribution of Sn



n
converges as n → ∞ to the normal
distribution with density

1 x2
p(x) = √ exp[− 2 ]. (3.11)
2πσ σ
3.6. CENTRAL LIMIT THEOREM. 71

Proof. If we denote by φ(t) the characteristic function of any Xi then the


Sn
characteristic function of √n
is given by

t
ψn (t) = [φ( √ )]n
n

We can use the expansion

σ 2 t2
φ(t) = 1 − + o (t2 )
2
to conclude that
t σ 2 t2 1
φ( √ ) = 1 − +o( )
n 2n n
and it then follows that
σ 2 t2
lim ψn (t) = ψ(t) = exp[− ].
n→∞ 2
Since ψ(t) is the characteristic function of the normal distribution with den-
sity p(x) given by equation (3.11), we are done.

Exercise 3.17. A more direct proof is possible in some special cases. For
instance if each Xi = ±1 with probability 12 , Sn can take the values n − 2k
with 0 ≤ k ≤ n,  
1 n
P [Sn = 2k − n] = n
2 k
and
X  
Sn 1 n
P [a ≤ √ ≤ b] = n .
n 2 √ √ k
k:a n≤2k−n≤b n

Use Stirling’s formula to prove directly that


Z
Sn b
1 x2
lim P [a ≤ √ ≤ b] = √ exp[− ] dx.
n→∞ n a 2π 2

Actually for the proof of the central limit theorem we do not need the
random variables {Xj } to have identical distributions. Let us suppose that
they all have zero means and that the variance of Xj is σj2 . Define s2n =
72 CHAPTER 3. INDEPENDENT SUMS

σ12 + · · · + σn2 . Assume s2n → ∞ as n → ∞. Then Yn = Ssnn has zero mean and
unit variance. It is not unreasonable to expect that
Z a
1 x2
lim P [Yn ≤ a] = √ exp[− ] dx
n→∞ −∞ 2π 2
under certain mild conditions.

Theorem 3.18. (Lindeberg’s theorem). If we denote by αi the distribu-


tion of Xi , the condition (known as Lindeberg’s condition)
n Z
1 X
lim x2 dαi = 0
n→∞ s2n i=1 |x|≥sn

for each  > 0 is sufficient for the central limit theorem to hold.

Proof. The first step in proving this limit theorem as well as other limit
theorems that we will prove is to rewrite

Yn = Xn,1 + Xn,2 + · · · + Xn,kn + An

where Xn,j are kn mutually independent random variables and An is a con-


X
stant. In our case kn = n, An = 0, and Xn,j = snj for 1 ≤ j ≤ n. We denote
by Z Z
x t
φn,j (t) = E[e i t Xn,j
] = e dαn,j = ei t sn dαj = φj ( )
itx
sn
where αn,j is the distribution of Xn,j . The functions φj and φn,j are the
characteristic functions of αj and αn,j respectively. If we denote by µn the
distribution of Yn , its characteristic function µ̂n (t) is given by
Y
n
µ̂n (t) = φn,j (t)
j=1

and our goal is to show that


t2
lim µ̂n (t) = exp[− ].
n→∞ 2
This will be carried out in several steps. First, we define

ψn,j (t) = exp[φn,j (t) − 1]


3.6. CENTRAL LIMIT THEOREM. 73

and
Y
n
ψn (t) = ψn,j (t).
j=1

We show that for each finite T ,

lim sup sup |φn,j (t) − 1| = 0


n→∞ |t|≤T 1≤j≤n

and
X
n
sup sup |φn,j (t) − 1| < ∞.
n |t|≤T j=1

This would imply that



lim sup log µ̂n (t) − log ψn (t)
n→∞ |t|≤T

X
n

≤ lim sup log φn,j (t) − [φn,j (t) − 1]
n→∞ |t|≤T
j=1
X
n
≤ lim sup C |φn,j (t) − 1|2
n→∞ |t|≤T
j=1
  X
n 
≤ C lim sup sup |φn,j (t) − 1| sup |φn,j (t) − 1|
n→∞ |t|≤T 1≤j≤n |t|≤T j=1

=0

by the expansion

log r = log(1 + (r − 1)) = r − 1 + O(r − 1)2 .

The proof can then be completed by showing


 X 
t2 n t2
lim sup log ψn (t) + = lim sup (φn,j (t) − 1) + = 0.
n→∞ |t|≤T 2 n→∞ |t|≤T
j=1
2
74 CHAPTER 3. INDEPENDENT SUMS

We see that
Z
 

sup φn,j (t) − 1 = sup exp[i t x ] − 1 dαn,j
|t|≤T |t|≤T
Z
 x 
= sup exp[i t ] − 1 dαj
|t|≤T sn
Z
 x x
= sup exp[i t ] − 1 − i t dαj (3.12)
|t|≤T sn sn
Z 2
x
≤ CT dαj (3.13)
s2n
Z Z
x2 x2
= CT 2
dα j + C T 2
dαj
|x|<sn sn |x|≥sn sn
Z
2 1
≤ CT  + CT 2 x2 dαj . (3.14)
sn |x|≥sn

We have used the mean zero condition in deriving equation 3.12 and
the estimate |eix − 1 − ix| ≤ cx2 to get to the equation 3.13. If we let
n → ∞, by Lindeberg’s condition, the second term of equation (3.14) goes
to 0. Therefore

lim sup sup sup φn,j (t) − 1 ≤ 2 CT .
n→∞ 1≤j≤kn |t|≤T

Since,  > 0 is arbitrary, we have



lim sup sup φn,j (t) − 1 = 0.
n→∞ 1≤j≤kn |t|≤T

Next we observe that there is a bound,

X Xn Z
1 X 2
n n
x2
sup φn,j (t) − 1 ≤ CT dαj ≤ CT 2 σ = CT
|t|≤T j=1 j=1
s2n sn j=1 j

uniformly in n. Finally for each  > 0,


3.6. CENTRAL LIMIT THEOREM. 75

 X 
n t2
lim sup (φn,j (t) − 1) +
n→∞ |t|≤T 2
j=1
Xn
σ 2 t2
≤ lim sup φn,j (t) − 1 + j
n→∞ |t|≤T
j=1
2s2n
Xn Z  
x x t2 2
x
= lim sup exp[i t ] − 1 − i t + 2 dαj
n→∞ |t|≤T sn sn 2sn
j=1
Xn Z  
x x t 2 2
x
≤ lim sup exp[i t ] − 1 − i t + 2 dαj
n→∞ |t|≤T sn sn 2sn
j=1 |x|<sn

X n Z  
x x t2 2
x
+ lim sup exp[i t ] − 1 − i t + 2 dαj
n→∞ |t|≤T sn sn 2sn
j=1 |x|≥sn

Xn Z
|x|3
≤ lim CT 3
dαj
n→∞
j=1 |x|<s n
s n

X n Z
x2
+ lim CT 2
dαj
n→∞
j=1 |x|≥s n
s n

X n Z
x2
≤ CT lim sup 2
dαj
n→∞
j=1
s n

X n Z
x2
+ lim CT 2
dαj
n→∞
j=1 |x|≥s n
s n

= CT

by Lindeberg’s condition. Since  > 0 is arbitrary the result is proven.

Remark 3.3. The key step in the proof of the central limit theorem under
Lindeberg’s condition, as well as in other limit theorems for sums of inde-
pendent random variables, is the analysis of products

ψn (t) = Πkj=1
n
φn,j (t).

The idea is to replace each φn,j (t) by exp [φn,j (t) − 1], changing the product
to the exponential of a sum. Although each φn,j (t) is close to 1, making
76 CHAPTER 3. INDEPENDENT SUMS

the idea Preasonable, in order for the idea to work one has to show that
the sum kj=1 n
|φn,j (t) − 1|2 is negligible. This requires the boundedness of
Pkn
j=1 |φn,j (t) − 1|. One has to use the mean 0 condition or some suitable
centering condition to cancel the first term in the expansion of φn,j (t) − 1
and control the rest from sums of the variances.
Exercise 3.18. Lyapunov’s condition is the following: for some δ > 0
n Z
1 X
lim |x|2+δ dαj = 0.
n→∞ s2+δ
n j=1

Prove that Lyapunov’s condition implies Lindeberg’s condition.

Exercise 3.19. Consider the case of mutually independent random variables


{Xj }, where Xj = ±aj with probability 12 . What do Lyapunov’s and Linde-
berg’s conditions demand of {aj }? Can you find a sequence {aj } that does
not satisfy Lyapunov’s condition for any δ > 0 but satisfies Lindeberg’s con-
dition? Try to find a sequence {aj } such that the central limit theorem is
not valid.

3.7 Accompanying Laws.

As we stated in the previous section, we want to study the behavior of the sum
of a large number of independent random variables. We have kn independent
random variables {Xn,j : 1 ≤ j ≤ kn } with respective distributions {αn,j }.
Pn
We are interested in the distribution µn of Zn = kj=1 Xn,j . One important
assumption that we will make on the random variables {Xn,j } is that no
single one is significant. More precisely for every δ > 0,

lim sup P [ |Xn,j | ≥ δ ] = lim sup αn,j [ |x| ≥ δ ] = 0. (3.15)


n→∞ 1≤j≤kn n→∞ 1≤j≤kn

The condition is referred to as uniform infinitesimality. The following


construction will play a major role. If α is a probability distribution on the
line and φ(t) is its characteristic function, for any nonnegative real number
a > 0, ψa (t) = exp[a(φ(t) − 1)] is again a characteristic distribution. In fact,
3.7. ACCOMPANYING LAWS. 77

if we denote by αk the k-fold convolution of α with itself, ψa is seen to be


the characteristic function of the probability distribution
X

aj
−a
e αj
j=0
j!

which is a convex combination αj with weights e−a aj! . We use the construc-
j

tion mostly with a = 1. If we denote the probability distribution with charac-


teristic function ψa (t) by ea (α) one checks easily that ea+b (α) = ea (α) ∗ eb(α).
In particular ea (α) = e na (α)n . Probability distributions β that can be written
for each n ≥ 1 as the n-fold convolution βnn of some probability distribution
βn are called infinitely divisible. In particular for every a ≥ 0 and α, ea (α)
is an infinitely divisible probability distribution. These are called compound
Poisson distributions. A special case when α = δ1 the degenerate distribu-
tion at 1, we get for ea (δ1 ) the usual Poisson distribution with parameter a.
We can interpret ea (α) as the distribution of the sum of a random number
of independent random variables with common distribution α. The random
n has a distribution which is Poisson with parameter a and is independent
of the random variables involved in the sum.
In order to study the distribution µn of Zn it will be more convenient
to replace αn,j by an infinitely divisible distribution βn,j . This is done as
follows. We define Z
an,j = x dαn,j ,
|x|≤1
0
αn,j as the translate of αn,j by −an,j , i.e.
0
αn,j = αn,j ∗ δ−an,j ,
0
βn,j = e1 (αn,j ),
0
βn,j = βn,j ∗ an,j
and finally
Y
kn
λn = βn,j
j=1

A main tool in this subject is the following theorem. We assume always


that the uniform infinitesimality condition (3.15) holds. In terms of notation,
we will find it more convenient to denote by µ̂ the characteristic function of
the probability distribution µ.
78 CHAPTER 3. INDEPENDENT SUMS

Theorem 3.19. (Accompanying Laws.) In order that, for some con-


stants An , the distribution µn ∗ δAn of Zn + An may converge to the limit µ it
is necessary and sufficient that, for the same constants An , the distribution
λn ∗ δAn converges to the same limit µ.

Proof. First we note that, for any δ > 0,


Z

lim sup sup |an,j | = lim sup sup x dαn,j
n→∞ 1≤j≤kn n→∞ 1≤j≤kn
Z|x|≤1

≤ lim sup sup x dαn,j
n→∞ 1≤j≤kn |x|≤δ
Z

+ lim sup sup x dαn,j
n→∞ 1≤j≤kn δ<|x|≤1

≤ δ + lim sup sup αn,j [ |x| ≥ δ ]


n→∞ 1≤j≤kn

= δ.

Therefore
lim sup |an,j | = 0.
n→∞ 1≤j≤kn

0
This means that αn,j are uniformly infinitesimal just as αn,j were. Let us
suppose that n is so large that sup1≤j≤kn |an,j | ≤ 14 . The advantage in going
0
from αn,j to αn,j is that the latter are better centered and we can calculate
Z
a0n,j = 0
x dαn,j
|x|≤1
Z
= (x − an,j ) dαn,j
|x−an,j |≤1
Z
= x dαn,j − an,j αn,j [ |x − an,j | ≤ 1 ]
|x−an,j |≤1
Z
= x dαn,j − an,j + αn,j [ |x − an,j | > 1 ]
|x−an,j |≤1

and estimate |a0n,j | by

3 1
|a0n,j | ≤ Cαn,j [ |x| ≥ 0
] ≤ Cαn,j [ |x| ≥ ].
4 2
3.7. ACCOMPANYING LAWS. 79

In other words we may assume without loss of generality that αn,j satisfy the
bound
1
|an,j | ≤ Cαn,j [ |x| ≥ ] (3.16)
2
0
and forget all about the change from αn,j to αn,j . We will drop the primes
and stay with just αn,j . Then, just as in the proof of the Lindeberg theorem,
we proceed to estimate

lim sup log λ̂n (t) − log µ̂n (t)
n→∞ |t|≤T

Xkn

≤ lim sup log α̂n,j (t) − (α̂n,j (t) − 1)]
n→∞ |t|≤T
j=1

X
kn

≤ lim sup log α̂n,j (t) − (α̂n,j (t) − 1)
n→∞ |t|≤T
j=1

X
kn
≤ lim sup C |α̂n,j (t) − 1|2
n→∞ |t|≤T
j=1

= 0.

provided we prove that if either λn or µn has a limit after translation by some


constants An , then
X
kn

sup sup α̂n,j (t) − 1 ≤ C < ∞. (3.17)
n |t|≤T j=1

Let us first suppose that λn has a weak limit as n → ∞ after translation


by An . The characteristic functions

X
kn

exp (α̂n,j (t) − 1)) + itAn = exp[fn (t)]
j=1

have a limit, which is again a characteristic function. Since the limiting char-
acteristic function is continuous and equals 1 at t = 0, and the convergence
is uniform near 0, on some small interval |t| ≤ T0 we have the bound
 
sup sup 1 − Re fn (t) ≤ C
n |t|≤T0
80 CHAPTER 3. INDEPENDENT SUMS

or equivalently

kn Z
X
sup sup (1 − cos t x ) dαn,j ≤ C
n |t|≤T0 j=1

and from the subadditivity property (1−cos 2 t x ) ≤ 4(1−cos t x) this bound


extends to arbitrary interval |t| ≤ T ,

kn Z
X
sup sup (1 − cos t x ) dαn,j ≤ CT .
n |t|≤T j=1

If we integrate the inequality with respect to t over the interval [−T, T ] and
divide by 2T , we get

kn Z
X sin T x
sup (1 − ) dαn,j ≤ CT
n
j=1
Tx

from which we can conclude that

X
kn
sup αn,j [ |x| ≥ δ ] ≤ Cδ < ∞
n
j=1

for every δ > 0 by choosing T = 2δ . Moreover using the inequality (1−cos x) ≥


c x2 valid near 0 for a suitable choice of c we get the estimate

kn Z
X
sup x2 dαn,j ≤ C < ∞.
n
j=1 |x|≤1
3.7. ACCOMPANYING LAWS. 81

Now it is straight forward to estimate, for t ∈ [−T, T ],


Z

|α̂n,j (t) − 1| = [exp(i t x ) − 1] dαn,j
Z

= [exp(i t x ) − 1] dαn,j
|x|≤1
Z

+ [exp(i t x ) − 1] dαn,j
|x|>1
Z

≤ [exp(i t x ) − 1 − i t x] dαn,j
|x|≤1
Z

+ [exp(i t x ) − 1] dαn,j + T |an,j |
|x|>1
Z
1
≤ C1 x2 dαn,j + C2 αn,j [x : |x| ≥ ]
|x|≤1 2

which proves the bound of equation (3.17).


Now we need to establish the same bound under the assumption that
µn has a limit after suitable translations. For any probability measure α
we define ᾱ by ᾱ(A) = α(−A) for all Borel sets. The distribution α ∗ ᾱ is
denoted by |α|2. The characteristic functions of ᾱ and |α|2 are respectively
¯ and |α̂(t)|2 where α̂(t) is the characteristic function of α. An elementary
α̂(t)
but important fact is |α ∗ A|2 = |α|2 for any translate A. If µn has a limit so
does |µn |2 . We conclude that the limit

Y
kn
2
lim |µ̂n (t)| = lim |α̂n,j (t)|2
n→∞ n→∞
j=1

exists and defines a characteristic function which is continuous at 0 with a


value of 1. Moreover because of uniform infinitesimality,

lim inf |α̂n,j (t)| = 1.


n→∞ |t|≤T

It is easy to conclude that there is a T0 > 0 such that, for |t| ≤ T0 ,

X
kn
sup sup [1 − |α̂n,j (t)|2 ] ≤ C0 < ∞
n |t|≤T0 j=1
82 CHAPTER 3. INDEPENDENT SUMS

and by subadditivity for any finite T ,

X
kn
sup sup [1 − |α̂n,j (t)|2 ] ≤ CT < ∞
n |t|≤T j=1

providing us with the estimates

X
kn
sup |αn,j |2 [ |x| ≥ δ ] ≤ Cδ < ∞ (3.18)
n
j=1

for any δ > 0, and

kn Z Z
X
sup (x − y)2dαn,j (x) dαn,j (y) ≤ C < ∞. (3.19)
n
j=1 |x−y|≤2

We now show that estimates (3.18) and (3.19) imply (3.17)


Z
2 δ δ
|αn,j | [ x : |x| ≥ ] ≥ αn,j [x : |x − y| ≥ ] dαn,j (y)
2 |y|≤ 2δ 2
δ
≥ αn,j [ x : |x| ≥ δ ] αn,j [ x : |x| ≤ ]
2
1
≥ αn,j [ x : |x| ≥ δ ]
2

by uniform infinitesimality. Therfore 3.18 implies that for every δ > 0,

X
kn
sup αn,j [ x : |x| ≥ δ ] ≤ Cδ < ∞. (3.20)
n
j=1

We now turn to exploiting (3.19). We start with the inequality


Z Z
(x − y)2dαn,j (x) dαn,j (y)
|x−y|≤2
  Z 
2
≥ αn,j [y : |y| ≤ 1 ] inf (x − y) dαn,j (x) .
|y|≤1 |x|≤1
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 83

The first term on the right can be assumed to be at least 12 by uniform


infinitesimality. The second term
Z Z Z
2 2
(x − y) dαn,j (x) ≥ x dαn,j (x) − 2y x dαn,j (x)
|x|≤1 |x|≤1 |x|≤1
Z Z

≥ x dαn,j (x) − 2
2
x dαn,j (x)
|x|≤1 |x|≤1
Z
1
≥ x2 dαn,j (x) − Cαn,j [x : |x| ≥ ].
|x|≤1 2
The last step is a consequence of estimate (3.16) that we showed we could
always assume.
Z
1
x dαn,j (x) ≤ Cαn,j [x : |x| ≥ ]
|x|≤1 2
Because of estimate (3.20) we can now assert
kn Z
X
sup x2 dαn,j ≤ C < ∞. (3.21)
n
j=1 |x|≤1

One can now derive (3.17) from (3.20) and (3.21) as in the earlier part.
Exercise 3.20. Let kn = n2 and αn,j = δ 1 for 1 ≤ j ≤ n2 . µn = δn and show
n
that without centering λn ∗ δ−n converges to a different limit.

3.8 Infinitely Divisible Distributions.


In the study of limit theorems for sums of independent random variables
infinitely divisible distributions play a very important role.

Definition 3.5. A distribution µ is said to be infinitely divisible if for every


positive integer n, µ can be written as the n-fold convolution (λn ∗)n of some
other probability distribution λn .
Exercise 3.21. Show that the normal distribution with density
1 x2
p(x) = √ exp[− ]
2π 2
is infinitely divisible.
84 CHAPTER 3. INDEPENDENT SUMS

Exercise 3.22. Show that for any λ ≥ 0, the Poisson distribution with pa-
rameter λ
e−n λn
pλ (n) = for n ≥ 0
n!
is infinitely divisible.
Exercise 3.23. Show that any probabilty distribution supported on a finite
set {x1 , . . . , xk } with
µ[{xj }] = pj
Pk
and pj ≥ 0, j=1 pj = 1 is infinitely divisible if and only if it is degenrate,
i.e. µ[{xj }] = 1 for some j.
Exercise 3.24. Show that for any nonnegative finite measure α with total
mass a, the distribution
X∞
(α∗)j
e(F ) = e−a
j=0
j!
with characteristic function
Z
[
e(F )(t) = exp[ (eitx − 1)dα]

is an infinitely divisible distribution.


Exercise 3.25. Show that the convolution of any two infinitely divisible dis-
tributions is again infinitely divisible. In particular if µ is infinitely divisible
so is any translate µ ∗ δa for any real a.
We saw in the last section that the asymptotic behavior of µn ∗ δAn can
be investigated by means of the asymptotic behavior of λn ∗ δAn and the
characteristic function λ̂n of λn has a very special form
Y
kn
λ̂n = exp[ β̂n,j (t) − 1 + i t an,j ]
j=1
Z
X
kn X
kn

= exp [ ei t x − 1 ] dβn,j + i t an,j
j=1 j=1
Z
 
= exp [ ei t x − 1 ] dMn + i t an
Z Z
 
= exp [eitx
− 1 − i t θ(x) ] dMn + i t [ θ(x) dMn + an ]
Z
 
= exp [ ei t x − 1 − i t θ(x) ] dMn + i t bn . (3.22)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 85

We can make any reasonable choice for θ(·) and we will need it to be a
bounded continuous function with

|θ(x) − x| ≤ C|x|3

2 , or θ(x) = x for |x| ≤ 1 and sign (x)


x
near 0. Possible choices are θ(x) = 1+x
for |x| ≥ 1. We now investigate when such things will have a weak limit.
Convoluting with δAn only changes bn to bn + An .
First we note that
Z
 
µ̂(t) = exp [ ei t x − 1 − i t θ(x) ] dM + i t a

is a characteristic function for any measure M with finite total mass. In fact it
is the characteristic function of an infinitely divisible probability distribution.
It is not necessary that M be a finite measure for µ to make sense. M could
be infinite, but in such a way that it is finite on {x : |x| ≥ δ} for every δ > 0,
and near 0 it integrates x2 i.e.,

M[x : |x| ≥ δ] < ∞ for all δ > 0, (3.23)


Z
x2 dM < ∞. (3.24)
|x|≤1

To see this we remark that


Z
 
µ̂δ (t) = exp [ ei t x − 1 − i t θ(x) ] dM + i t a
|x|≥δ

is a characteristic function for each δ > 0 and because

|ei t x − 1 − i t x | ≤ CT x2

for |t| ≤ T , µ̂δ (t) → µ̂(t) uniformly on bounded intervals where µ̂(t) is given
by the integral
Z
 
µ̂(t) = exp [ ei t x − 1 − i t θ(x) ] dM + i t a

which converges absolutely and defines a characteristic function. Let us call


measures that satisfy (3.23) and (3.24), that can be expressed in the form
86 CHAPTER 3. INDEPENDENT SUMS

Z
x2
dM < ∞ (3.25)
1 + x2

admissible Lévy measures. Since the same argument applies to M n


and na
instead of M and a, for any admissible Lévy measure M and real number a,
µ̂(t) is in fact an infinitely divisible characteristic function. As the normal
distribution is also an infinitely divisible probability distribution, we arrive
at the following

Theorem 3.20. For every admissible Lévy measure M, σ 2 > 0 and real a
Z
 σ 2 t2 
µ̂(t) = exp [ ei t x − 1 − i t θ(x) ] dM + i t a −
2

is the characteristic function of an infinitely divisible distribution µ.

We will denote this distribution µ by µ = e (M, σ 2 , a). The main theorem


of this section is

Theorem 3.21. In order that µn = e (Mn , σn2 , an ) may converge to a limit


µ it is necessary and sufficient that µ = e (M, σ 2 , a) and the following three
conditions (3.26) (3.27) and (3.28) are satisfied.
For every bounded continuous function f that vanishes in some neighborhood
of 0,
Z Z
lim f (x)dMn = f (x)dM. (3.26)
n→∞

For some ( and therefore for every) ` > 0 such that ± ` are continuity points
for M, i.e., M{± `} = 0
 Z `   Z ` 
2 2 2 2
lim σn + x dMn = σ + x dM . (3.27)
n→∞ −` −`

an → a as n → ∞. (3.28)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 87

Proof. Let us prove the sufficiency first. Condition (3.26) implies that for
every ` such that ± ` are continuity points of M
Z Z
lim [eitx
− 1 − i t θ(x) ] dMn = [ ei t x − 1 − i t θ(x) ] dM
n→∞ |x|≥` |x|≥`

and because of condition (3.27), it is enough to show that

Z `
t2 x2
lim lim sup [ ei t x − 1 − i t θ(x) + ] dMn
`→0 n→∞ −` 2
Z `
t2 x2
− [eitx
− 1 − i t θ(x) + ] dM
` 2
= 0

in order to conclude that

 Z 
σn2 t2
lim − + [e itx
− 1 − i t θ(x)] dMn
n→∞ 2
 Z 
σ 2 t2
= − + [e itx
− 1 − i t θ(x)] dM .
2

This follows from the estimates



itx t2 x2
e − 1 − i t θ(x) + ≤ CT |x|3
2

and Z Z
` `
3
|x| dMn ≤ ` |x|2 dMn .
−` −`

Condition (3.28) takes care of the terms involving an .


We now turn to proving the necessity. If µn has a weak limit µ then the
absolute values of the characteristic functions |µ̂n (t)| are all uniformly close
to 1 near 0. Since
 Z 
σn2 t2
|µ̂n (t)| = exp − (1 − cos t x) dMn −
2
88 CHAPTER 3. INDEPENDENT SUMS

taking logarithms we conclude that


 2 Z 
σn t
lim sup + (1 − cos t x) dMn = 0.
t→0 n 2
This implies (3.29), (3.30) and (3.31 )below.
For each ` > 0,

sup Mn {x : |x| ≥ `} < ∞ (3.29)


n

lim sup Mn {x : |x| ≥ A} = 0. (3.30)


A→∞ n

For every 0 ≤ ` < ∞,

 Z ` 
2 2
sup σn + |x| dMn < ∞. (3.31)
n −`

We can choose a subsequence of Mn (which we will denote by Mn as well)


that ‘converges’ in the sense that it satisfies conditions (3.26) and (3.27) of
the Theorem. Then e (Mn , σn2 , 0) converges weakly to e (M, σ 2 , 0). It is not
hard to see that for any sequence of probability distributions αn if both αn
and αn ∗δan converge to limits α and β respectively, then necessarily β = α∗δa
for some a and an → a as n → ∞. In order complete the proof of necessity
we need only establish the uniqueness of the representation, which is done in
the next lemma.
Lemma 3.22. (Uniqueness). Suppose µ = e (M1 , σ12 , a1 ) = e (M2 , σ22 , a2 ),
then M1 = M2 , σ12 = σ22 and a1 = a2 .
Proof. Since µ̂(t) never vanishes by taking logarithms we have
 Z 
σ12 t2
ψ(t) = − + [eitx
− 1 − i t θ(x) ] dM1 + i t a1
2
 Z 
σ22 t2
= − + [eitx
− 1 − i t θ(x) ] dM2 + i t a2 . (3.32)
2
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 89

We can verify that for any admissible Lévy measure M


Z
1
lim [ ei t x − 1 − i t θ(x) ] dM = 0.
t→∞ t2

Consequently
ψ(t)
lim = σ12 = σ22
t→∞ t2

leaving us with
Z 
ψ(t) = [e itx
− 1 − i t θ(x) ] dM1 + i t a1
Z 
= [e itx
− 1 − i t θ(x) ] dM2 + i t a2

for a different ψ. If we calculate

ψ(t + s) + ψ(t − s)
H(s, t) = − ψ(t)
2
we get Z Z
itx
e (1 − cos s x)dM1 = ei t x (1 − cos s x)dM2

for all t and s. Since we can and do assume that M{0} = 0 for any admissible
Levy measure M we have M1 = M2 . If we know that σ12 = σ22 and M1 = M2
it is easy to see that a1 must equal a2 .

Finally

Corollary 3.23. (Lévy-Khintchine representation ) Any infinitely di-


visible distribution admits a representation µ = e (M, σ 2 , a) for some admis-
sible Lévy measure M, σ 2 > 0 and real number a.

Proof. We can write µ = µn ∗n = µn ∗ µn ∗ · · · ∗ µn with n terms. If we show


that µn ⇒ δ0 then the sequence is uniformly infinitesimal and by the earlier
theorem on accompanying laws µ will be the limit of some λn = e (Mn , 0, an )
and therefore has to be of the form e (M, σ 2 , a) for some choice of admissible
90 CHAPTER 3. INDEPENDENT SUMS

Levy measure M, σ 2 > 0 and real a. In a neighborhood around 0, µ̂(t) is


close to 1 and it is easy to check that
1
µ̂n (t) = [µ̂(t)] n → 1

as n → ∞ in that neighborhood. As we saw before this implies that µn ⇒ δ0 .

Applications.

1. Convergence to the Poisson Distribution. Let {Xn,j : 1 ≤ j ≤ kn } be


kn independent random variables taking the values 0 or 1 with proba-
bilities 1 − pn,j and pn,j respectively. We assume that

lim sup pn,j = 0


n→∞ 1≤j≤kn

which is the uniform infinitesimality


P n condition. We are interested in
the limiting distribution of Sn = kj=1 Xn,j as n → ∞. Since we have
to center by the mean we can pick any level say 12 for truncation. Then
the truncated means are allP 0. The accompanying P laws are given by
e (Mn , 0, an ) with Mn = ( pn,j )δ1 andPan = ( pn,j ) θ(1). It is clear
that a limit exists if and only if λn = pn,j has a limit λ as n → ∞
and the limit in such a case is the Poisson distribution with parameter
λ.
Pn
2. Convergence to the normal distribution. If the limit of Sn = kj=1 Xn,j
of kn uniformly infinitesimal mutually independent random variables
exists, then the limit is Normal if and only if M ≡ 0. If an,j is the
centering needed, this is equivalent to
X
lim P [|Xn,j − an,j | ≥ ] = 0
n→∞
j

for all  > 0. Since limn→∞ supj |an,j | = 0, this is equivalent to


X
lim P [|Xn,j | ≥ ] = 0
n→∞
j

for each  > 0.


3.8. INFINITELY DIVISIBLE DISTRIBUTIONS. 91

3. The limiting variance and the mean are given by


X  
2 2
σ = lim E [Xn,j − an,j ] : |Xn,j − an,j | ≤ 1
n→∞
j

and X
a = lim an,j
n→∞
j

where Z
an,j = x dαn,j
|x|≤1

P that E[X2 n,j ] = 02 for all 1 ≤ 2j ≤ kn and n. Assume that


Suppose
2
σn = j E{[Xn,j ] } and σ = limn→∞ σn exists. What do we need in
order to make sure that the limiting distribution is normal with mean
0 and variance σ 2 ? Let αn,j be the distribution of Xn,j .

Z 2 Z 2 Z

|an,j | =
2
x dαn,j = x dαn,j ≤ αn,j [ |x| > 1 ] |x|2 dαn,j
|x|≤1 |x|>1

and

X
kn  X Z  
2 2
|an,j | ≤ |x| dαn,j sup αn,j [ |x| > 1 ]
1≤j≤kn
j=1 1≤j≤kn
 
≤ σn2 sup αn,j [ |x| > 1 ]
1≤j≤kn

→ 0.
Pkn
Because j=1 |an,j |2 → 0 as n → ∞ we must have
XZ
2
σ = lim |x|2 dαn,j
n→∞ |x|≤`

for every ` > 0 or equivalently


XZ
lim |x|2 dαn,j = 0
n→∞ |x|>`
92 CHAPTER 3. INDEPENDENT SUMS

for every ` establishing the necessity as well as sufficiency in Lindeberg’s


Theorem. A simple calculation shows that
X XZ XZ
|an,j | ≤ |x| dαn,j ≤ |x|2 dαn,j = 0
j j |x|>1 j |x|>1

establishing that the limiting Normal distribution has mean 0.

P 3.26. What happens in the Poisson limit theorem (applicationSn1)


Exercise if
−λn
λn = j pn,j → ∞ as n → ∞? Can you show that the distribution of λ √
n
converges to the standard Normal distribution?

3.9 Laws of the iterated logarithm.


When we are dealing with a sequence of independent identically distributed
random variables X1 , · · · , Xn , · · · with mean 0 and variance 1, we have a
strong law of large numbers asserting that
 
X 1 + · · · + Xn
P lim =0 =1
n→∞ n

and a central limit theorem asserting that


  Z a
X1 + · · · + Xn 1 x2
P √ ≤a → √ exp[− ] dx
n −∞ 2π 2

It is a reasonable question to ask if the random variables X1 +···+X



n
n
themselves
converge to some limiting random variable Y that is distributed according
to the the standard normal distribution. The answer is no and is not hard
to show.

Lemma 3.24. For any sequence nj of numbers → ∞,


 

P lim sup X1 + · · · + Xnj nj = +∞ = 1
j→∞

Proof. Let us define



Z = lim sup X1 + · · · + Xnj nj
j→∞
3.9. LAWS OF THE ITERATED LOGARITHM. 93

which can be +∞. Because the normal distribution has an infinitely long
tail, i.e the probability of exceeding any given value is positive, we must have
 
P Z≥a >0

for any a. But Z is a random variable that does not depend on the par-
ticular values of X1 , · · · , Xn and is therefore a set in the tail σ-field. By
Kolmogorov’s zero-one law P Z ≥ a must be either 0 or 1. Since it cannot
be 0 it must be 1.

Since we know that X1 +···+X


n
n
→ 0 with probability 1 as n → ∞, the
question arises as to the rate at which this happens. The law of the iterated
logarithm provides an answer.

Theorem 3.25. For any sequence X1 , · · · , Xn , · · · of independent identi-


cally distributed random variables with mean 0 and Variance 1,
 
X1 + · · · + Xn √
P lim sup √ = 2 = 1.
n→∞ n log log n

We will not prove this theorem in the most general case which assumes
only the existence of two moments. We will assume instead that E[|X|2+α ] <
∞ for some α > 0. We shall first reduce the proof to an estimate on the
Sn
tail behavior of the distributions of √ n
by a careful application of the Borel-
Cantelli Lemma. This estimate is obvious if X1 , · · · , Xn , · · · are themselves
normally distributed and we will show how to extend it to a large class of
distributions that satisfy the additional moment
√ condition. It is clear that
we are interested in showing that for λ > 2,
 
p
P Sn ≥ λ n log log n infinitely often = 0.

It would
√ be sufficient because of Borel-Cantelli lemma to show that for any
λ > 2,
X  p

P Sn ≥ λ n log log n < ∞.
n

This however is too strong. The condition of the Borel-Cantelli lemma is


not necessary in this context because of
√ the strong dependence between the
partial sums Sn . The function φ(n) = n log log n is clearly well defined and
94 CHAPTER 3. INDEPENDENT SUMS

√ for n ≥ 3 and it is sufficient for our purposes to show that


non-decreasing
for any λ > 2 we can find some sequence kn ↑ ∞ of integers such that
X  
P sup Sj ≥ λ φ(kn−1 ) < ∞. (3.33)
n kn−1 ≤j≤kn

This will establish that with probability 1,


supkn−1 ≤j≤kn Sj
lim sup ≤λ
n→∞ φ(kn−1)
or by the monotonicity of φ,
Sn
lim sup ≤λ
n→∞ φ(n)

with probability 1. Since λ > 2 is arbitrary the upper bound in the law
of the iterated logarithm will follow. Each term in the sum of 3.33 can be
estimated as in Levy’s inequality,
   
P sup Sj ≥ λ φ(kn−1) ≤ 2 P Skn ≥ (λ − σ) φ(kn−1)
kn−1 ≤j≤kn

with 0 < σ < λ, provided


 
1
sup P |Sj | ≥ σφ(kn−1) ≤ .
1≤j≤kn −kn−1 2
Our choice of kn will be kn = [ρn ] for some ρ > 1 and therefore
φ(kn−1)
lim √ =∞
n→∞ kn
and by Chebychev’s inequality, for any fixed σ > 0,

 
E[Sn2 ]
sup P |Sj | ≥ σφ(kn−1) ≤
1≤j≤kn [σφ(kn−1)]2
kn
=
[σφ(kn−1)]2
kn
= 2
σ kn−1 log log kn−1
= o(1) as n → ∞. (3.34)
3.9. LAWS OF THE ITERATED LOGARITHM. 95

By choosing σ√small enough so that λ − σ > 2 it is sufficient to show
that for any λ0 > 2,
X  
0
P Skn ≥ λ φ(kn−1 ) < ∞.
n

√ √ φ(kn−1 )
By picking ρ sufficiently close to 1, ( so that λ0 ρ > 2), because φ(kn )
=
√1 we can reduce this to the convergence of
ρ

X  
P Skn ≥ λ φ(kn ) < ∞ (3.35)
n

for all λ > 2.
2
If we use the estimate P [X ≥ a] ≤ exp[− a2 ] that is valid for the standard
normal distribution, we can verify 3.35.
X  λ2 (φ(kn ))2 
exp − <∞
n
2 kn

for any λ > 2.
To prove the lower bound we select again a subsequece, kn = [ρn ] with
some ρ > 1, and look at Yn = Skn+1 − Skn , which are now independent
random variables. The tail probability of the Normal distribution has the
lower bound
Z ∞
1 x2
P [X ≥ a] = √ exp[− ]dx
2π a 2
Z ∞
1 x2
≥√ exp[− − x](x + 1)dx
2π a 2
1 (a + 1)2
≥ √ exp[− ].
2π 2
If we assume Normal like tail probabilities we can conclude that
X   X
 1 λφ(kn+1) 2 
P Yn ≥ λφ(kn+1) ≥ exp − [1 + p ] = +∞
n n
2 (ρn+1 − ρn )
2
λ ρ
provided 2(ρ−1) < 1 and conclude by the Borel-Cantelli lemma, that Yn =
Skn+1 − Skn exceeds λφ(kn+1) infinitely often for such λ. On the other hand
96 CHAPTER 3. INDEPENDENT SUMS

from the upper bound we already have (replacing Xi by −Xi )


 √ 
−Skn 2
P lim sup ≤√ = 1.
n φ(kn+1) ρ
Consequently
 s √ 
Skn+1 2(ρ − 1) 2
P lim sup ≥ −√ =1
n φ(kn+1) ρ ρ

and therefore,
 s √ 
Sn 2(ρ − 1) 2
P lim sup ≥ −√ = 1.
n φ(n) ρ ρ

We now take ρ arbitrarily large and we are done.


We saw that the law of the iterated logarithm depends on two things.
2
(i). For any a > 0 and p < a2 an upper bound for the probability
p
P [Sn ≥ a n log log n] ≤ Cp [log n]−p

with some constant Cp


a2
(ii). For any a > 0 and p > 2
a lower bound for the probability
p
P [Sn ≥ a n log log n] ≥ Cp [log n]−p

with some, possibly different, constant Cp .


Both inequalities can be obtained from a uniform rate of convergence in
the central limit theorem.

Z ∞
S 1 x2
sup P { √ ≥ a} − √ exp[− ] dx ≤ Cn−δ
n
(3.36)
a n a 2π 2
for some δ > 0 in the central limit theorem. Such an error estimate is
provided in the following theorem
Theorem 3.26. (Berry-Esseen theorem). Assume that the i.i.d. se-
quence {Xj } with mean zero and variance one satisfies an additional moment
condition E|X|2+α < ∞ for some α > 0. Then for some δ > 0 the estimate
(3.36) holds.
3.9. LAWS OF THE ITERATED LOGARITHM. 97

Proof. The proof will be carried out after two lemmas.


Lemma 3.27. Let −∞ < a < b < ∞ be given and 0 < h < b−a
2
be a small
positive number. Consider the function fa,b,h (x) defined as


 0 for −∞ < x ≤ a − h



 x−a+h
for a − h ≤ x ≤ a + h
 2h
fa,b,h (x) = 1 for a + h ≤ x ≤ b − h



 1 − x−b+h for b − h ≤ x ≤ b + h

 2h
0 for b + h ≤ x < ∞.

For any probability distribution µ with characteristic function µ̂(t)


Z ∞ Z ∞
1 e−i a y − e−i b y sin h y
fa,b,h (x) dµ(x) = µ̂(y) dy.
−∞ 2π −∞ iy hy
Proof. This is essentially the Fourier inversion formula. Note that
Z ∞
1 e−i a y − e−i b y sin h y
fa,b,h (x) = ei x y dy.
2π −∞ iy hy
We can start with the double integral
Z ∞Z ∞
1 e−i a y − e−i b y sin h y
ei x y dy dµ(x)
2π −∞ −∞ iy hy
and apply Fubini’s theorem to obtain the lemma.

Lemma 3.28. If λ, µ are two probability measures with zero mean having
λ̂(·), µ̂(·) for respective characteristic functions. Then
Z ∞ Z ∞
1 e−i a y sin h y
fa,h (x) d(λ − µ)(x) = [λ̂(y) − µ̂(y)] dy
−∞ 2π −∞ iy hy
where fa,h (x) = fa,∞,h (x), is given by



0 for −∞ < x ≤ a − h
fa,h (x) = x−a+h
for a − h ≤ x ≤ a + h
 2h

1 for a + h ≤ x < ∞.
98 CHAPTER 3. INDEPENDENT SUMS

Proof. We just let b → ∞ in the previous lemma. Since |λ̂(y)− µ̂(y)| = o(|y|),
there is no problem in applying the Riemann-Lebesgue Lemma. We now
proceed with the proof of the theorem.
Z
λ[[a, ∞)] ≤ fa−h,h (x) dλ(x) ≤ λ[[a − 2h, ∞)]

and Z
µ[[a, ∞)] ≤ fa−h,h (x) dµ(x) ≤ µ[[a − 2h, ∞)].

Therefore if we assume that µ has a density bounded by C,


Z
λ[[a, ∞)] − µ[[a, ∞)] ≤ 2hC + fa−h,h (x) d(λ − µ)(x).

Since we get a similar bound in the other direction as well,

Z

sup |λ[[a, ∞)] − µ[[a, ∞)]| ≤ sup fa−h,h (x) d(λ − µ)(x)
a a
+ 2hC
Z ∞
1 | sin h y |
≤ |λ̂(y) − µ̂(y)| dy
2π −∞ h y2
+ 2hC. (3.37)

Now we return to the proof of the theorem. We take λ to be the distribu-


tion of √Sn
n
having as its characteristic function λ̂n (y) = [φ( √yn )]n where φ(y)
is the characteristic function of the common distribution of the {Xi} and has
the expansion
y2
φ(y) = 1 − + O(|y|2+α)
2
for some α > 0. We therefore get, for some choice of α > 0,

y2 |y|2+α α
|λ̂n (y) − exp[− ]| ≤ C α if |y| ≤ n 2+α .
2 n
3.9. LAWS OF THE ITERATED LOGARITHM. 99

α
Therefore for θ = 2+α
Z ∞ 2
λ̂n (y) − exp[− y ] | sin h y | dy
−∞ 2 h y2
Z
y 2 | sin h y |
= λ̂ n (y) − exp[− ] dy
|y|≤nθ h y2
Z
2
+ λ̂n (y) − exp[− y ] | sin h y | dy
|y|≥nθ h y2
Z Z 
C |y|α dy
≤ dy +
|y|≥nθ |y|
h α 2
|y|≤nθ n

n(α+1)θ−α + n−θ
≤ C
h
C
= α
hn α+2
Substituting this bound in 3.37 we get
C
sup |λn [[a, ∞)] − µ[[a, ∞)]| ≤ C1 h + α .
a h n 2+α

By picking h = n− 2(2+α) we get


α

sup |λn [[a, ∞)] − µ[[a, ∞)]| ≤ C n− 2(2+α)


α

and we are done.


100 CHAPTER 3. INDEPENDENT SUMS
Chapter 4

Dependent Random Variables

4.1 Conditioning
One of the key concepts in probability theory is the notion of conditional
probability and conditional expectation. Suppose that we have a probability
space (Ω, F , P ) consisting of a space Ω, a σ-field F of subsets of Ω and a
probability measure on the σ-field F . If we have a set A ∈ F of positive
measure then conditioning with respect to A means we restrict ourselves to
the set A. Ω gets replaced by A. The σ-field F by the σ-field FA of subsets
of A that are in F . For B ⊂ A we define
P (B)
PA (B) =
P (A)
We could achieve the same thing by defining for arbitrary B ∈ F
P (A ∩ B)
PA (B) =
P (A)
in which case PA (·) is a measure defined on F as well but one that is concen-
trated on A and assigning 0 probability to Ac . The definition of conditional
probability is
P (A ∩ B)
P (B|A) = .
P (A)
Similarly the definition of conditional expectation of an integrable function
f (ω) given a set A ∈ F of positive probability is defined to be
R
f (ω)dP
E{f |A} = A .
P (A)

101
102 CHAPTER 4. DEPENDENT RANDOM VARIABLES

In particular if we take f = χB for some B ∈ F we recover the definition


of conditional probability. In general if we know P (B|A) and P (A) we can
recover P (A ∩ B) = P (A)P (B|A) but we cannot recover P (B). But if we
know P (B|A) as well as P (B|Ac ) along with P (A) and P (Ac ) = 1 − P (A)
then

P (B) = P (A ∩ B) + P (Ac ∩ B) = P (A)P (B|A) + P (Ac )P (B|Ac ).

More generally if P is a partition of Ω into a finite or even a countable number


of disjoint measurable sets A1 , · · · , Aj , · · ·
X
P (B) = P (Aj )P (B|Aj ).
j

If ξ is a random variable taking distinct values {aj } on {Aj } then

P (B|ξ = aj ) = P (B|Aj )

or more generally
P (B ∩ ξ = a)
P (B|ξ = a) =
P (ξ = a)
provided P (ξ = a) > 0. One of our goals is to seek a definition that makes
sense when P (ξ = a) = 0. This involves dividing 0 by 0 and should involve
differentiation of some kind. In the countable case we may think of P (B|ξ =
aj ) as a function fB (ξ) which is equal to P (B|Aj ) on ξ = aj . We can rewrite
our definition of
fB (aj ) = P (B|ξ = aj )
as Z
fB (ξ)dP = P (B ∩ ξ = aj ) for each j
ξ=aj

or summing over any arbitrary collection of j’s


Z
fB (ξ)dP = P (B ∩ {ξ ∈ E}).
ξ∈E

Sets of the form ξ ∈ E form a sub σ-field Σ ⊂ F and we can rewrite the
definition as Z
fB (ξ)dP = P (B ∩ A)
A
4.1. CONDITIONING 103

for all A ∈ Σ. Of course in this case A ∈ Σ if and only if A is a union


of the atoms ξ = a of the partition over a finite or countable subcollection
of the possible values of a. Similar considerations apply to the conditional
expectation of a random variable G given ξ. The equation becomes
Z Z
g(ξ)dP = G(ω)dP
A A

or we can rewrite this as


Z Z
g(ω)dP = G(ω)dP
A A

for all A ∈ Σ and instead of demanding that g be a function of ξ we demand


that g be Σ measurable which is the same thing. Now the random variable
ξ is out of the picture and rightly so. What is important is the information
we have if we know ξ and that is the same if we replace ξ by a one-to-one
function of itself. The σ-field Σ abstracts that information nicely. So it turns
out that the proper notion of conditioning involves a sub σ-field Σ ⊂ F . If G
is an integrable function and Σ ⊂ F is given we will seek another integrable
function g that is Σ measurable and satisfies
Z Z
g(ω)dP = G(ω)dP
A A

for all A ∈ Σ. We will prove existence and uniqueness of such a g and call it
the conditional expectation of G given Σ and denote it by g = E[G|Σ].
The way to prove the above result will take us on a detour. A signed
measure on a measurable space (Ω, F ) is a set function λ(.) defined for A ∈
F which is countably additive but not necessarily nonnegative. Countable
addivity is again in any of the following two equivalent senses.
X
λ(∪An ) = λ(An )
for any countable collection of disjoint sets in F , or

lim λ(An ) = λ(A)


n→∞

whenver An ↓ A or An ↑ A.
Examples of such λ can be constructed by taking the difference µ1 − µ2
of two nonnegative measures µ1 and µ2 .
104 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Definition 4.1. A set A ∈ F is totally positive (totally negative) for λ if


for every subset B ∈ F with B ⊂ A λ(B) ≥ 0. (≤ 0)

Remark 4.1. A measurable subset of a totally positive set is totally positive.


Any countable union of totally positive subsets is again totally positive.
Lemma 4.1. If λ is a countably additive signed measure on (Ω, F ),
sup |λ(A)| < ∞
A∈F

Proof. The key idea in the proof is that, since λ(Ω) is a finite number, if
λ(A) is large so is λ(Ac ) with an opposite sign. In fact, it is not hard to
see that ||λ(A)| − |λ(Ac )|| ≤ |λ(Ω)| for all A ∈ F . Another fact is that if
supB⊂A |λ(B)| and supB⊂Ac |λ(B)| are finite, so is supB |λ(B)|. Now let us
complete the proof. Given a subset A ∈ F with supB⊂A |λ(B)| = ∞, and
any positive number N, there is a subset A1 ∈ F with A1 ⊂ A such that
|λ(A1 )| ≥ N and supB⊂A1 |λ(B)| = ∞. This is obvious because if we pick
a set E ⊂ A with |λ(E)| very large so will λ(E c ) be. At least one of the
two sets E, E c will have the second property and we can call it A1 . If we
proceed by induction we have a sequence An that is ↓ and |λ(An )| → ∞ that
contradicts countable additivity.
Lemma 4.2. Given a subset A ∈ F with λ(A) = ` > 0 there is a subset
Ā ⊂ A that is totally positive with λ(Ā) ≥ `.
Proof. Let us define m = inf B⊂A λ(B). Since the empty set is included,
m ≤ 0. If m = 0 then A is totally positive and we are done. So let us assume
that m < 0. By the previous lemma m > −∞.
Let us find B1 ⊂ A such that λ(B1 ) ≤ m2 . Then for A1 = A − B1 we have
A1 ⊂ A, λ(A1 ) ≥ ` and inf B⊂A1 λ(B) ≥ m2 . By induction we can find Ak
with A ⊃ A1 ⊃ · · · ⊃ Ak · · · , λ(Ak ) ≥ ` for every k and inf B⊂Ak λ(Ak ) ≥ 2mk .
Clearly if we define Ā = ∩Ak which is the decreasing limit, Ā works.

Theorem 4.3. (Hahn-Jordan Decomposition). Given a countably ad-


ditive signed measure λ on (Ω, F ) it can be written always as λ = µ+ − µ−
the difference of two nonnegative measures. Moreover µ+ and µ− may be
chosen to be orthogonal i.e, there are disjoint sets Ω+ , Ω− ∈ F such that
µ+ (Ω− ) = µ− (Ω+ ) = 0. In fact Ω+ and Ω− can be taken to be subsets of
Ω that are respectively totally positive and totally negative for λ. µ± then
become just the restrictions of λ to Ω± .
4.1. CONDITIONING 105

Proof. Totally positive sets are closed under countable unions, disjoint or
not. Let us define m+ = supA λ(A). If m+ = 0 then λ(A) ≤ 0 for all A and
we can take Ω+ = Φ and Ω− = Ω which works. Assume that m+ > 0. There
exist sets An with λ(A) ≥ m+ − n1 and therefore totally positive subsets Ān
of An with λ(Ān ) ≥ m+ − n1 . Clearly Ω+ = ∪n Ān is totally positive and
λ(Ω+ ) = m+ . It is easy to see that Ω− = Ω − Ω+ is totally negative. µ± can
be taken to be the restriction of λ to Ω± .

Remark 4.2. If λ = µ+ − µ− with µ+ and µ− orthogonal to each other,


then they have to be the restrictions of λ to the totally positive and totally
negative sets for λ and such a representation for λ is unique. It is clear that
in general the representation is not unique because we can add a common µ
to both µ+ and µ− and the µ will cancel when we compute λ = µ+ − µ− .

Remark 4.3. If µ is a nonnegative measure and we define λ by


Z Z
λ(A) = f (ω) dµ = χA (ω)f (ω) dµ
A

where f is an integrable function, then λ is a countably additive signed


measure and Ω+ = {ω : f (ω) > 0} and Ω− = {ω : f (ω) < 0}. If we define
f ± (ω) as the positive and negative parts of f , then
Z
±
µ (A) = f ± (ω) dµ.
A

The signed measure λ that was constructed in the preceeding remark


enjoys a very special relationship to µ. For any set A with µ(A) = 0, λ(A) = 0
because the integrand χA (ω)f (ω) is 0 for µ-almost all ω and for all practical
purposes is a function that vanishes identically.

Definition 4.2. A signed measure λ is said to be absolutely continuous with


respect to a nonnegative measure µ, λ << µ in symbols, if whenever µ(A) is
zero for a set A ∈ F it is also true that λ(A) = 0.

Theorem 4.4. (Radon-Nikodym Theorem). If λ << µ then there is an


integrable function f (ω) such that
Z
λ(A) = f (ω) dµ (4.1)
A
106 CHAPTER 4. DEPENDENT RANDOM VARIABLES

for all A ∈ F . The function f is uniquely determined almost everywhere and


is called the Radon-Nikodym derivative of λ with respect to µ. It is denoted
by

f (ω) = .

Proof. The proof depends on the decomposition theorem. We saw that if the
relation 4.1 holds, then Ω+ = {ω : f (ω) > 0}. If we define λa = λ − aµ,
then λa is a signed measure for every real number a. Let us define Ω(a) to
be the totally positive subset of λa . These sets are only defined up to sets of
measure zero, and we can only handle a countable number of sets of measure
0 at one time. So it is prudent to restrict a to the set Q of rational numbers.
Roughly speaking Ω(a) will be the sets f (ω) > a and we will try to construct
f from the sets Ω(a) by the definition

f (ω) = [sup a ∈ Q : ω ∈ Ω(a)].

The plan is to check that the function f (ω) defined above works. Since λa
is getting more negative as a increases, Ω(a) is ↓ as a ↑. There is trouble
with sets of measure 0 for every comparison between two rationals a1 and
a2 . Collect all such troublesome sets (only a countable number and throw
them away). In other words we may assume without loss of generality that
Ω(a1 ) ⊂ Ω(a2 ) whenever a1 > a2 . Clearly

{ω : f (ω) > x} = {ω : ω ∈ Ω(y) for some rational y > x}


= ∪ y>x Ω(y)
y∈Q

and this makes f measurable. If A ⊂ ∩a Ω(a) then λ(A) − aµ(A) ≥ 0 for all
A. If µ(A) > 0, λ(A) has to be infinite which is not possible. Therefore µ(A)
has to be zero and by absolute continuity λ(A) = 0 as well. On the other
hand if A ∩ Ω(a) = Φ for all a, then λ(A) − aµ(A) ≤ 0 for all a and again
if µ(A) > 0, λ(A) = −∞ which is not possible either. Therefore µ(A), and
by absolute continuity, λ(A) are zero. This proves that f (ω) is finite almost
everywhere with respect to both λ and µ. Let us take two real numbers a < b
and consider Ea,b = {ω : a ≤ f (ω) ≤ b}. It is clear that the set Ea,b is in
Ω(a0 ) and Ωc (b0 ) for any a0 < a and b0 > b. Therefore for any set A ⊂ Ea,b by
letting a0 and b0 tend to a and b

a µ(A) ≤ λ(A) ≤ b µ(A).


4.1. CONDITIONING 107

Now we are essentially done. Let us take a grid {n h} and consider En =


{ω : nh ≤ f (ω) < (n + 1)h} for −∞ < n < ∞. Then for any A ∈ F and
each n,
Z
λ(A ∩ En ) − hµ(A ∩ En ) ≤ n h µ(A ∩ En ) ≤ f (ω) dµ
A∩En
≤ (n + 1) hµ(A ∩ En ) ≤ λ(A ∩ En ) + h µ(A ∩ En ).

Summing over n we have


Z
λ(A) − h µ(A) ≤ f (ω) dµ ≤ λ(A) + h µ(A)
A

proving the integrability of f and if we let h → 0 establishing


Z
λ(A) = f (ω) dµ
A

for all A ∈ F .

Remark 4.4. (Uniqueness). If we have two choices of f say f1 and f2 their


difference g = f1 − f2 satisfies
Z
g(ω) dµ = 0
A

for all A ∈ F . If we take A = {ω : g(ω) ≥ }, then 0 ≥ µ(A ) and this


implies µ(A ) = 0 for all  > 0 or g(ω) ≤ 0, almost everywhere with respect to
µ. A similar argument establishes g(ω) ≥ 0 almost everywhere with respect
to µ. Therefore g = 0 a.e. µ proving uniqueness.
Exercise 4.1. If f andR g are two integrable
R functions, maesurable with respect
to a σ-filed B and if A f (ω)dP = A g(ω)dP for all sets A ∈ B0 , a field that
generates the σ-field B, then f = g a.e. P .
Exercise 4.2. If λ(A) ≥ 0 for all A ∈ F , prove that f (ω) ≥ 0 almost every-
where.
Exercise 4.3. If Ω is a countable set and µ({ω}) > 0 for each single point
set prove that any measure λ is absolutely continuous with respect to λ and
calculate the Radon-Nikodym derivative.
108 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Exercise 4.4. Let F (x) be a distribution function on the line with F (0) = 0
and F (1) = 1 so that the probability measure α corresponding to it lives on
the interval [0, 1]. If F (x) satisfies a Lipschitz condition
|F (x) − F (y)| ≤ A|x − y|
then prove that α << m where m is the Lebesgue measure on [0, 1]. Show
also that 0 ≤ dm dα
≤ A almost surely.
If ν, λ, µ are three nonnegative measures such that ν << λ and λ << µ
then show that ν << µ and
dν dν dλ
=
dµ dλ dµ
a.e.

Exercise 4.5. If λ, µ are nonnegative measures with λ << µ and dµ = f , then
show that g is integrable with respect to λ if and only if g f is integrable with
respect to µ and Z Z
g(ω) dλ = g(ω) f (ω) dµ.

Exercise 4.6. Given two nonnegative measures λ and µ, λ is said to be uni-


formly absolutely continuous with respect to µ on F if for any  > 0 there
exists a δ > 0 such that for any A ∈ F with µ(A) < δ it is true that λ(A) < .
Use the Radon-Nikodym theorem to show that absolute continuity on a σ-
field F implies uniform absolute continuity. If F0 is a field that generates the
σ-field F show by an example that absolute continuity on F0 does not imply
absolute continuity on F . Show however that uniform absolute continuity
on F0 implies uniform absolute continuity and therefore absolute continuity
on F .
Exercise 4.7. If F is a distribution function on the line show that it is abso-
lutely continuous with respect to Lebesgue measure on the line, if and only
if for any  > 0, there exists a δ > 0 suchP that for arbitrary finite collec-
tion
P of disjoint intervals I j = [a ,
j jb ] with j |bj − aj | < δ it follows that
j [F (bj ) − F (aj )] ≤ .

4.2 Conditional Expectation


In the Radon-Nikodym theorem, if λ << µ are two probability distributions
on (Ω, F ), we defined the Radon-Nikodym derivative f (ω) = dµ

as an F
4.2. CONDITIONAL EXPECTATION 109

measurable function such that


Z
λ(A) = f (ω) dµ for all A ∈ F
A

If Σ ⊂ F is a sub σ-field, the absolute continuity of λ with respect to µ on Σ


is clearly implied by the absolute continuity of λ with respect to µ on F . We
can therefore apply the Radon-Nikodym theorem on the measurable space
(Ω, Σ), and we will obtain a new Radon-Nikodym derivative

dλ dλ
g(ω) = =
dµ dµ Σ
such that Z
λ(A) = g(ω) dµ for all A ∈ Σ
A

and g is Σ measurable. Since the old function f (ω) was only F measurable,
in general, it cannot be used as the Radon-Nikodym derivative for the sub
σ-field Σ. Now if f is an integrable function on (Ω, F , µ) and Σ ⊂ F is a sub
σ-field we can define λ on F by
Z
λ(A) = f (ω) dµ for all A ∈ F
A

and recalculate the Radon-Nikodym derivative g for Σ and g will be a Σ


measurable, integrable function such that
Z
λ(A) = g(ω) dµ for all A ∈ Σ
A

In other words g is the perfect candidate for the conditional expectation



g(ω) = E f (·)|Σ

We have therefore proved the existence of the conditional expectation.

Theorem 4.5. The conditional expectation as mapping of f → g has the


following properties.

1. If g = E f |Σ then E[g] = E[f ]. E[1|Σ] = 1 a.e.

2. If f is nonnegative then g = E f |Σ is almost surely nonnegative.
110 CHAPTER 4. DEPENDENT RANDOM VARIABLES

3. The map is linear. If a1 , a2 are constants


  
E a1 f1 + a2 f2 |Σ = a1 E f1 |Σ + a2 E f2 |Σ a.e.

4. If g = E f |Σ , then
Z Z
|g(ω)| dµ ≤ |f (ω)| dµ

5. If h is a bounded Σ measurable function then


 
E f h|Σ = h E f |Σ a.e.

6. If Σ2 ⊂ Σ1 ⊂ F , then

7. Jensen’s
 Inequality. If φ(x) is a convex function of x, and g =
E f |Σ then

E φ(f (ω))|Σ ≥ φ(g(ω)) a.e. (4.2)

and if we take expectations

E[φ(f )] ≥ E[φ(g)].

Proof. (i), (ii) and (iii) are obvious. For (iv) we note that if dλ = f dµ
Z
|f | dµ = sup λ(A) − inf λ(A)
A∈F A∈F

and if we replace F by a sub σ-field Σ the right hand side is decreased. (v)
is obvious if h is the indicator function of a set A in Σ. To go from indicator
functions to simple functions to bounded measurable functions is routine.
(vi) is an easy consequence of the definition. Finally (vii) corresponds to
Theorem 1.7 proved for ordinary  expectations
 and is proved analogously.
We note that if f 1 ≥ f2 then E f1 |Σ ≥ E f2 |Σ a.e. and consequently
E max(f1 , f2 )|Σ ≥ max(g1 , g2 ) a.e. where gi = E fi |Σ for i = 1, 2. Since
we can represent any convex function φ as φ(x) = supa [ax − ψ(a)], limiting
4.2. CONDITIONAL EXPECTATION 111

ourselves to rational a, we have only a countable set of functions to deal with,


and
 
E φ(f )|Σ = E sup[af − ψ(a)]|Σ
a  
≥ sup a E f |Σ − ψ(a)
a
 
= sup a g − ψ(a)
a
= φ(g)
a.e. and after taking expectations
E[φ(f )] ≥ E[φ(g)].

Remark 4.5. Conditional expecation is a form of averaging, i.e. it is linear,


takes constants into constants and preserves nonnegativity. Jensen’s inequal-
ity is now a cosequence of convexity.
In a somewhat more familiar context if µ = λ1 ×λ2 is a product measure
on (Ω, F ) = (Ω1 × Ω2 , F1 × F2 ) and we take Σ = A × Ω2 : A ∈ F1 } then
for any function f (ω) = f (ω1 , ω2), E[f (·)|Σ] = g(ω) where g(ω) = g(ω1) is
given by Z
g(ω1 ) = f (ω1, ω2 ) dλ2
Ω2
so that the conditional expectation is just integrating the unwanted variable
ω2 . We can go one step more. If φ(x, y) is the joint density on R2 of two
random variables X, Y (with respect to the Lebesgue measure on R2 ), and
ψ(x) is the marginal density of X given by
Z ∞
ψ(x) = φ(x, y) dy
−∞

then for any integrable function f (x, y)


R∞
−∞
f (x, y) φ(x, y) dy
E[f (X, Y )|X] = E[f (·, ·)|Σ] =
ψ(x)
where Σ is the σ-field of vertical strips A × (−∞, ∞) with a measurable
horizontal base A.
112 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Exercise 4.8. If f is already Σ measurable then E[f |Σ] = f . This suggests


that the map f → g = E[f |Σ] is some sort of a projection. In fact if
we consider the Hilbert space H = L2 [Ω, F , µ] of all F measurable square
integrable functions with an inner product
Z
< f, g >µ = f g dµ

then
H0 = L2 [Ω, Σ, µ] ⊂ H = L2 [Ω, F , µ]
and f → E[f |Σ] is seen to be the same as the orthogonal projection from H
onto H0 . Prove it.
Exercise 4.9. If F1 ⊂ F2 ⊂ F are two sub σ-fields of F and X is any in-
tegrable function, we can define Xi = E[X|Fi] for i = 1, 2. Show that
X1 = E[X2 |F1 ] a.e.
Conditional expectation is then the best nonlinear predictor if the loss
function is the expected (mean) square error.

4.3 Conditional Probability


We now turn our attention to conditional probability. If we take f = χB (ω)
then E[f |Σ] = P (ω, B) is called the conditional probability of B given Σ. It
is characterized by the property that it is Σ measurable as a function of ω
and for any A ∈ Σ Z
µ(A ∩ B) = P (ω, B) dµ.
A

Theorem 4.6. P (·, ·) has the following properties.

1. P (ω, Ω) = 1, P (ω, Φ) = 0 a.e.

2. For any B ∈ F , 0 ≤ P (ω, B) ≤ 1 a.e.

3. For any countable collection {Bj } of disjoint sets in F ,


X
P (ω, ∪j Bj ) = P (ω, Bj ) a.e.
j
4.3. CONDITIONAL PROBABILITY 113

4. If B ∈ Σ, P (ω, B) = χB (ω) a.e.

Proof. All are easy consequences of properties of conditional expectations.


Property (iii) perhaps needs an explanation. If E[|fn − f |] → 0 by the
properties of conditional expectation E[|E{fn |Σ} − E{f |Σ}| → 0. Property
(iii) is an easy consequence of this.

The problem with the above theorem is that every property is valid only
almost everywhere. There are exceptional sets of measure zero for each case.
While each null set or a countable number of them can be ignored we have an
uncountable number of null sets and we would like a single null set outside
which all the properties hold. This means constructing a good version of the
conditional probability. It may not be always possible. If possible, such a
version is called a regular conditional probability. The existence of such a
regular version depends on the space (Ω, F ) and the sub σ-field Σ being nice.
If Ω is a complete separable metric space and F are its Borel stes, and if Σ
is any countably generated sub σ-field of F , then it is nice enough. We will
prove it in the special case when Ω = [0, 1] is the unit interval and F are the
Borel subsets B of [0, 1]. Σ can be any countably generated sub σ-field of F .
Remark 4.6. In fact the case is not so special. There is theorem [6] which
states that if (Ω, F ) is any complete separable metric space that has an un-
countable number of points, then there is one-to-one measurable map with
a measurable inverse between (Ω, F ) and ([0, 1], B). There is no loss of gen-
erality in assuming that (Ω, F ) is just ([0, 1], B).

Theorem 4.7. Let P be a probability distribution on ([0, 1], B). Let Σ ⊂ B


be a sub σ-field. There exists a family of probability distributions Qx on
([0, 1], B) such that for every A ∈ B, Qx (A) is Σ measurable and for every B
measurable f ,
Z
f (y)Qx (dy) = E P [f (·)|Σ] a.e. P. (4.3)

If in addition Σ is countably generated, i.e there is a field Σ0 consisting of a


countable number of Borel subsets of [0, 1] such that the σ-field generated by
Σ0 is Σ, then

Qx (A) = 1A (x) for all A ∈ Σ. (4.4)


114 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Proof. The trick is not to be too ambitious in the first place but try to
construct the conditional expectations

Q(ω, B) = E χB (ω)|Σ}

only for sets B given by B = (−∞, x) for rational x. We call our conditional
expectation, which is in fact a conditional probability, by F (ω, x). By the
properties of conditional expectations for any pair of rationals x < y, there
is a null set Ex,y , such that for ω ∈
/ Ex,y

F (ω, x) ≤ F (ω, y).

Moreover for any rational x < 0, there is a null set Nx outside which
F (ω, x) = 0 and similar null sets Nx for x > 1, ouside which F (ω, x) = 1.
If we collect all these null sets, of which there are only countably many, and
take their union, we get a null set N ∈ Σ such that for ω ∈ / N, we have have
a family F (ω, x) defined for rational x that satisfies

F (ω, x) ≤ F (ω, y) if x < y are rational


F (ω, x) = 0 for rational x < 0
F (ω, x) = 1 for rational x > 1
Z
P (A ∩ [0, x]) = F (ω, x) dP for all A ∈ Σ.
A

For ω ∈
/ N and real y we can define

G(ω, y) = lim F (ω, x).


x↓y
x rational

For ω ∈ / N, G is a right continuous nondecreasing function (distribution


function) with G(ω, y) = 0 for y < 0 and G(ω, y) = 1 for y ≥ 1. There is
then a probability measure Q̂(ω, B) on the Borel subsets of [0, 1] such that
Q̂(ω, [0, y]) = G(ω, y) for all y. Q̂ is our candidate for regular conditional
probability. Clearly Q̂(ω, I) is Σ measurable for all intervals I and by stan-
dard arguments will continue to be Σ measurable for all Borel sets B ∈ F .
If we check that
Z
P (A ∩ [0, x]) = G(ω, x) dP for all A ∈ Σ
A
4.3. CONDITIONAL PROBABILITY 115

for all 0 ≤ x ≤ 1 then


Z
P (A ∩ I) = Q̂(ω, I) dP for all A ∈ Σ
A

for all intervals I and by standard arguments this will extend to finite disjoint
unions of half open intervals that constitute a field and finally to the σ-field
F generated by that field. To verify that for all real y,
Z
P (A ∩ [0, y]) = G(ω, y) dP for all A ∈ Σ
A

we start from
Z
P (A ∩ [0, x]) = F (ω, x) dP for all A ∈ Σ
A

valid for rational x and let x ↓ y through rationals. From the countable ad-
ditivity of P the left hand side converges to P (A ∩ [0, y]) Rand by the bounded
convergence theorem, the right hand side converges to A G(ω, y) dP and
we are done.
Finally from the uniqueness of the conditional expectation if A ∈ Σ
Q̂(ω, A) = χA (ω)
provided ω ∈/ NA , which is a null set that depends on A. We can take a
countable set Σ0 of generators A that forms a field and get a single null set
N such that if ω ∈
/N
Q̂(ω, A) = χA (ω)
for all A ∈ Σ0 . Since both side are countably additive measures in A and as
they agree on Σ0 they have to agree on Σ as well.
Exercise 4.10. (Disintegration Theorem.) Let µ be a probability measure on
the plane R2 with a marginal distribution α for the first coordinate. In other
words if we denote α is such that, for any f that is a bounded measurable
function of x, Z Z
f (x) dµ = f (x) dα
R2 R
Show that there exist measures βx depending measurably on x such that
R = 1, i.e. βx is supported on the vertical line through (x, y) : y ∈ R
βx [{x}×R]
and µ = R βx dα. The converse is of course easier. Given α and βx we can
construct a unique µ such that µ disintegrates as expected.
116 CHAPTER 4. DEPENDENT RANDOM VARIABLES

4.4 Markov Chains

One of the ways of generating a sequence of dependent random variables is


to think of a system evolving in time. We have time points that are discrete
say T = 0, 1, · · · , N, · · · . The state of the system is described by a point
x in the state space X of the system. The state space X comes with a
natural σ-field of subsets F . At time 0 the system is in a random state and
its distribution is specified by a probability distribution µ0 on (X , F ). At
successive times T = 1, 2, · · · , the system changes its state and given the past
history (x0 , · · · , xk ) of the states of the system at times T = 0, · · · , k − 1
the probability that system finds itself at time k in a subset A ∈ F is given
by πk (x0 , · · · , xk−1 ; A ). For each (x0 , · · · , xk−1 ), πk defines a probability
measure on (X , F ) and for each A ∈ F , πk (x0 , · · · , xk−1 ; A ) is assumed to
be a measurable function of (x0 , · · · , xk−1 ), on the space (X k , F k ) which is
the product of k copies of the space (X , F ) with itself. We can inductively
define measures µk on (X k+1, F k+1) that describe the probability distribution
of the entire history (x0 , · · · , xk ) of the system through time k. To go from
µk−1 to µk we think of (X k+1, F k+1) as the product of (X k , F k ) with (X , F )
and construct on (X k+1, F k+1 ) a probability measure with marginal µk−1 on
(X k , F k ) and conditionals πk (x0 , · · · , xk−1 ; ·) on the fibers (x1 , · · · , xk−1 )×X .
This will define µk and the induction can proceed. We may stop at some finite
terminal time N or go on indefinitely. If we do go on indefinitely, we will have
a consitent family of finite dimensional distributions {µk } on (X k+1 , F k+1)
and we may try to use Kolmogorov’s theorem to construct a probability
measure P on the space (X ∞ , F ∞ ) of sequences {xj : j ≥ 0} representing
the total evolution of the system for all times.
Remark 4.7. However Kolmogorov’s theorem requires some assumptions on
(X , F ) that are satisfied if X is a complete separable metric space and F
are the Borel sets. However, in the present context, there is a result known
as Tulcea’s theorem (see [8]) that proves the existence of a P on (X ∞ , F ∞ )
for any choice of (X , F ), exploiting the fact that the consistent family of
finite dimensional distributions µk arise from well defined successive regular
conditional probability distributions.
An important subclass is generated when the transition probability depends
on the past history only through the current state. In other words
πk (x0 , · · · , xk−1 ; ·) = πk−1,k (xk−1 ; ·).
4.4. MARKOV CHAINS 117

In such a case the process is called a Markov Process with transition prob-
abilities πk−1,k (·, ·). An even smaller subclass arises when we demand that
πk−1,k (·, ·) be the same for different values of k. A single transition proba-
bility π(x, A) and the initial distribution µ0 determine the entire process i.e.
the measure P on (X ∞ , F ∞ ). Such processes are called time-homogeneous
Markov Proceses or Markov Processes with stationary transition probabili-
ties.
Chapman-Kolmogorov Equations:. If we have the transition probabili-
ties πk,k+1 of transition from time k to k + 1 of a Markov Chain it is possible
to obtain directly the transition probabilities from time k to k + ` for any
` ≥ 2. We do it by induction on `. Define
Z
πk,k+`+1(x, A) = πk,k+` (x, dy) πk+`,k+`+1(y, A) (4.5)
X

or equivalently, in a more direct fashion


Z Z
πk,k+`+1(x, A) = ··· πk,k+1(x, dyk+1 ) · · · πk+`,k+`+1(yk+` , A)
X X

Theorem 4.8. The transition probabilities πk,m (·, ·) satisfy the relations
Z
πk,n (x, A) = πk,m (x, dy) πm,n (y, A) (4.6)
X

for any k < m < n and for the Markov Process defined by the one step
transition probabilities πk,k+1 (·, ·), for any n > m

P [xn ∈ A|Σm ] = πm,n (xm , A) a.e.

where Σm is the σ-field of past history upto time m generated by the coordi-
nates x0 , x1 , · · · , xm .
Proof. The identity is basically algebra. The multiple integral can be carried
out by iteration in any order and after enough variables are integrated we
get our identity. To prove that the conditional probabilities are given by the
right formula we need to establish
Z
P [{xn ∈ A} ∩ B] = πm,n (xm , A) dP
B
118 CHAPTER 4. DEPENDENT RANDOM VARIABLES

for all B ∈ Σm and A ∈ F . We write


Z
P [{xn ∈ A} ∩ B] = dP
{xn ∈A}∩B
Z Z
= ··· dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm )
{xn ∈A}∩B
π (x , dxm−1 ) · · · πn−1,n (xn−1 , dxn )
Z Z m,m+1 m
= · · · dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm )
B
π (x , dxm−1 ) · · · πn−1,n (xn−1 , A)
Z Z m,m+1 m
= · · · dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm ) πm,n (xm , A)
Z B

= πm,n (xm , A) dP
B

and we are done.

Remark 4.8. If the chain has stationary transition probabilities then the
transition probabilities πm,n (x, dy) from time m to time n depend only on
the difference k = n − m and are given by what are usually called the k step
transition probabilities. They are defined inductively by
Z
(k+1)
π (x, A) = π (k) (x, dy) π(y, A)
X

and satisfy the Chapman-Kolmogorov equations


Z Z
(k+`)
π (x, A) = π (x, dy)π (x, A) = π (`) (x, dy)π (k)(y, A)
(k) (`)
X §

Suppose we have a probability measure P on the product space X ×Y ×Z


with the product σ-field. The Markov property in this context refers to
equality

E P [g(z)|Σx,y ] = E P [g(z)|Σy ] a.e. P (4.7)

for bounded measurable functions g on Z, where we have used Σx,y to denote


the σ-field generated by projection on to X × Y and Σy the corresponding
4.4. MARKOV CHAINS 119

σ-field generated by projection on to Y . The Markov property in the reverse


direction is the similar condition for bounded measurable functions f on X.

E P [f (x)|Σy,z ] = E P [f (x)|Σy ] a.e. P (4.8)

They look different. But they are both equivalent to the symmetric condition

E P [f (x)g(z)|Σy ] = E P [f (x)|Σy ]E P [g(z)|Σy ] a.e. P (4.9)

which says that given the present, the past and future are conditionally
independent. In view of the symmetry it sufficient to prove the following:
Theorem 4.9. For any P on (X × Y × Z) the relations (4.7) and (4.9) are
equivalent.
Proof. Let us fix f and g. Let us denote the common value in (4.7) by ĝ(y)
Then

E P [f (x)g(z)|Σy ] = E P [E P [f (x)g(z)|Σx,y ]|Σy ] a.e. P


P P
= E [f (x)E [g(z)|Σx,y ]|Σy ] a.e. P
= E P [f (x)ĝ(y)|Σy ] a.e. P (by (4.5))
P
= E [f (x)|Σy ]ĝ(y) a.e. P
P P
= E [f (x)|Σy ]E [g(z)|Σy ] a.e. P

which is (4.9). Conversely, we assume (4.9) and denote by ḡ(x, y) and ĝ(y)
the expressions on the left and right side of (4.7). Let b(y) be a bounded
measurable function on Y .

E P [f (x)b(y)ḡ(x, y)] = E P [f (x)b(y)g(z)]


= E P [b(y)E P [f (x)g(z)|Σy ]]
 
= E P [b(y) E P [f (x)|Σy ] E P [g(z)|Σy ] ]

= E P [b(y) E P [f (x)|Σy ] ĝ(y)]
= E P [f (x)b(y)ĝ(y)].

Since f and b are arbitrary this implies that ḡ(x, y) = ĝ(y) a.e. P .

Let us look at some examples.


120 CHAPTER 4. DEPENDENT RANDOM VARIABLES

1. Suppose we have an urn containg a certain number of balls (nonzero)


some red and others green. A ball is drawn at random and its color
is noted. Then it is returned to the urn along with an extra ball of
the same color. Then a new ball is drawn at random and the process
continues ad infinitum. The current state of the system can be charac-
terized by two integers r, g such that r + g ≥ 1. The initial state if the
system is some r0 , g0 with r0 + g0 ≥ 1. The system can go from (r, g)
r
to either (r + 1, g) with probability r+g or to (r, g + 1) with probability
g
r+g
. This is clearly an example of a Markov Chain with stationary
transition probabilities.

2. Consider a queue for service in a store. Suppose at each of the times


1, 2, · · · , a random number of new customers arrive and and join the
queue. If the queue is non empty at some time, then exactly one
customer will be served and will leave the queue at the next time point.
The distribution of the number of new arrivals is specified by {pj : j ≥
0} where pj is the probability that exactly j new customers arrive
at a given time. The number of new arrivals at distinct times are
assumed to be independent. The queue length is a Markov Chain on
the state space X = {0, 1, · · · , } of nonegative integers. The transition
probabilities π(i, j) are given by π(0, j) = pj because there is no service
and nobody in the queue to begin with and all the new arrivals join
the queue. On the other hand π(i, j) = pj−i+1 if j + 1 ≥ i ≥ 1 because
one person leaves the queue after being served.

3. Consider a reservoir into which water flows. The amount of additional


water flowing into the reservoir on any given day is random, and has
a distribution α on [0, ∞). The demand is also random for any given
day, with a probability distribution β on [0, ∞). We may also assume
that the inflows and demands on successive days are random variables
ξn and ηn , that have α and β for their common distributions and are all
mutually independent. We may wish to assume a percentage loss due
to evaporation. In any case the storage level at successive days have a
recurrence relation

Sn+1 = [(1 − p)Sn + ξn − ηn ]+

p is the loss and we have put the condition that the outflow is the
demand unless the stored amount is less than the demand in which case
4.4. MARKOV CHAINS 121

the outflow is the available quantity. The current amount in storage is


a Markov Process with Stationary transition probabilities.

4. Let X1 , · · · , Xn , · · · be a sequence of independent random variables


with a common distribution α. Let Sn = Y + X1 + · · · + Xn for
n ≥ 1 with S0 = Y where Y is a random variable independent of
X1 , . . . , Xn , . . . with distribution µ. Then Sn is a Markov chain on
R with one step transition probaility π(x, A) = α(A − x) and initial
distribution µ. The n step transition probability is αn (A − x) where
αn is the n-fold convolution of α. This is often referred to as a random
walk.

The last two examples can be described by models of the type

xn = f (xn−1 , ξn )
where xn is the current state and ξn is some random external disturbance. ξn
are assumed to be independent and identically distributed. They could have
two components like inflow and demand. The new state is a deterministic
function of the old state and the noise.

Exercise 4.11. Verify that the first two examples can be cast in the above
form. In fact there is no loss of generality in assuming that ξj are mutually
independent random variables having as common distribution the uniform
distribution on the interval [0, 1].

Given a Markov Chain with stationary transition probabilities π(x, dy) on


a state space (X , F ), the behavior of π (n) (x, dy) for large n is an important
and natural question. In the best situation of independent random variables
π (n) (x, A) = µ(A) are independent of x as well as n. Hopefully after a long
time the Chain will ‘forget’ its origins and π (n) (x, ·) → µ(·), in some suitable
sense, for some µ that does not depend on x . If that happens, then from
the relation Z
(n+1)
π (x, A) = π (n) (x, dy)π(y, A),

we conclude Z
µ(A) = π(y, A) dµ(y) for all A ∈ F
122 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Measures that satisfy the above property, abbreviated as µπ = µ, are called


invariant measures for the Markov Chain. If we start with the initial distribu-
tion µ which is inavariant then the probability measure P has µ as marginal
at every time. In fact P is stationary i.e., invariant with respect to time
translation, and can be extended to a stationary process where time runs
from −∞ to +∞.

4.5 Stopping Times and Renewal Times


One of the important notions in the analysis of Markov Chains is the idea of
stopping times and renewal times. A function

τ (ω) : Ω → {n : n ≥ 0}

is a random variable defined on the set Ω = X ∞ such that for every n ≥ 0 the
set {ω : τ (ω) = n} (or equivalently for each n ≥ 0 the set {ω : τ (ω) ≤ n})
is measurable with respect to the σ-field Fn generated by Xj : 0 ≤ j ≤ n.
It is not necessary that τ (ω) < ∞ for every ω. Such random variable τ are
called stopping times. Examples of stopping times are, constant times n ≥ 0,
the first visit to a state x, or the second visit to a state x. The important
thing is that in order to decide if τ ≤ n i.e. to know if what ever is supposed
to happen did happen before time n the chain need be observed only up to
time n. Examples of τ that are not stopping times are easy to find. The last
time a site is visited is not a stopping time nor is is the first time such that
at the next time one is in a state x. An important fact is that the Markov
property extends to stopping times. Just as we have σ-fields Fn associated
with constant times, we do have a σ field Fτ associated to any stopping
time. This is the information we have when we observe the chain upto time
τ . Formally
 

Fτ = A : A ∈ F and A ∩ {τ ≤ n} ∈ Fn for each n

One can check from the definition that τ is Fτ measurable and so is Xτ on


the set τ < ∞. If τ is the time of first visit to y then τ is a stopping time and
the event that the chain visits a state z before visiting y is Fτ measurable.

Lemma 4.10. ( Strong Markov Property.) At any stopping time τ


the Markov property holds in the sense that the conditional distribution of
4.6. COUNTABLE STATE SPACE 123

Xτ +1 , · · · , Xτ +n , · · · conditioned on Fτ is the same as the original chain


starting from the state x = Xτ on the set τ < ∞. In other words

Px {Xτ +1 ∈ A1 , · · · , Xτ +n ∈ An |Fτ }
Z Z
= ··· π(Xτ , dx1 ) · · · π(xn−1 , dxn )
A1 An

a.e. on {τ < ∞}.

Proof. Let A ∈ Fτ be given with A ⊂ {τ < ∞}. Then

Px {A ∩ {Xτ +1 ∈ A1 , · · · , Xτ +n ∈ An }}
X
= Px {A ∩ {τ = k} ∩ {Xk+1 ∈ A1 , · · · , Xk+n ∈ An }}

XZ Z Z
k

= ··· π(Xk , dxk+1 ) · · · π(xk+n−1 , dxk+n ) dPx


A∩{τ =k} A1 An
Zk Z Z
= ··· π(Xτ , dx1 ) · · · π(xn−1 , dxn ) dPx
A A1 An

We have used the fact that if A ∈ Fτ then A ∩ {τ = k} ∈ Fk for every


k ≥ 0.

Remark 4.9. If Xτ = y a.e. with respect to Px on the set τ < ∞, then at


time τ , when it is finite, the process starts afresh with no memory of the past
and will have conditionally the same probabilities in the future as Py . At
such times the process renews itself and these times are called renewal times.

4.6 Countable State Space


From the point of view of analysis a particularly simple situation is when the
state space X is a countable set. It can be taken as the integers {x : x ≥ 1}.
Many applications fall in this category and an understanding of what happens
in this situation will tell us what to expect in general.
The one step transition
P probability is a matrix π(x, y) with nonnegative en-
tries such that y π(x, y) = 1 for each x. Such matrices are called stochastic
124 CHAPTER 4. DEPENDENT RANDOM VARIABLES

matrices. The n step transition matrix is just the n-th power of the matrix
defined inductively by
X
π (n+1) (x, y) = π (n) (x, z)π(z, y)
z

To be consistent one defines π (0) (x, y) = δx,y which is 1 if x = y and 0


otherwise. The problem is to analyse the behaviour for large n of π (n) (x, y). A
state x is said to communicate with a state y if π (n) (x, y) > 0 for some n ≥ 1.
We will assume for simplicity that every state communicates with every other
state. Such Markov Chains are called irreducible. Let us first limit ourselves
to the study of irreducible chains. Given an irreducible Markov chain with
transition probabilities π(x, y) we define fn (x) as the probability of returning
to x for the first time at the n-th step assuming that the chain starts from
the state x.. Using the convention that Px refers to the measure on sequences
for the chain starting from x and {Xj } are the successive positions of the
chain
 
fn (x) = Px Xj 6= x for 1 ≤ j ≤ n − 1 and Xn = x
X
= π(x, y1 ) π(y1, y2 ) · · · π(yn−1, x)
y1 6=x
···
yn−1 6=x

P
Since fn (x) are probailities
P of disjoint events n fn (x)P≤ 1. The state x
is called transient if n fn (x) < 1 and recurrent if n fn (x) = 1. The
recurrent case is divided into two situations. If we denote by τx = inf{n ≥
1 : Xn = x}, the time of first visit to x, then recurrence is Px {τx < ∞} = 1.
A recurrent state x is called positive recurrent if
X
E Px {τx } = n fn (x) < ∞
n≥1

and null recurrent if


X
E Px {τx } = n fn (x) = ∞
n≥1

Lemma 4.11. If for a (not necessarily irreducible) chain starting from x, the
probability of ever visiting y is positive then so is the probability of visiting y
before returning to x.
4.6. COUNTABLE STATE SPACE 125

Proof. Assume that for the chain starting from x the probability of visiting
y before returning to x is zero. But when it returns to x it starts afresh and
so will not visit y until it returns again. This reasoning can be repeated and
so the chain will have to visit x infinitely often before visiting y. But this
will use up all the time and so it cannot visit y at all.
Lemma 4.12. For an irreducible chain all states x are of the same type.
Proof. Let x be recurrent and y be given. Since the chain is irreducible, for
some k, π (k) (x, y) > 0. By the previous lemma, for the chain starting from x,
there is a positive probability of visiting y before returning to x. After each
successive return to x, the chain starts afresh and there is a fixed positive
probability of visiting y before the next return to x. Since there are infinitely
many returns to x, y will be visited infinitely many times as well. Or y is
also a recurrent state.
We now prove that if x is positive recurrent then so is y. We saw already
that the probability p = Px {τy < τx } of visiting y before returning to x is
positive. Clearly
E Px {τx } ≥ Px {τy < τx } E Py {τx }
and therefore
1
E Py {τx } ≤ E Px {τx } < ∞.
p
On the other hand we can write
Z Z
E {τy } ≤
Px
τx dPx + τy dPx
τy <τx τx <τy
Z Z
= τx dPx + {τx + E Px {τy }} dPx
τ <τ τ <τ
Zy x Zx y
= τx dPx + τx dPx + (1 − p) E Px {τy }
τ <τ τx <τy
Zy x
= τx dPx + (1 − p) E Px {τy }

by the renewal property at the stopping time τx . Therefore


1
E Px {τy } ≤ E Px {τx }.
p
We also have
2
E Py {τy } ≤ E Py {τx } + E Px {τy } ≤ E Px {τx }
p
126 CHAPTER 4. DEPENDENT RANDOM VARIABLES

proving that y is positive recurrent.

Transient Case: We have the following theorem regarding transience.

Theorem 4.13. An irreducible chain is transient if and only if


X

G(x, y) = π (n) (x, y) < ∞ for all x, y.
n=0

Moreover for any two states x and y,

G(x, y) = f (x, y)G(y, y)

and
1
G(x, x) =
1 − f (x, x)
where f (x, y) = Px {τy < ∞}.
Proof. Each time the chain returns to x there is a probability 1 − f (x, x) of
never returning. The number of returns has then the geometric distribution

Px { exactly n returns to x} = (1 − f (x, x))f (x, x)n

and the expected number of returns is given by


X

f (x, x)
π (k) (x, x) = .
k=1
1 − f (x, x)

The left hand side comes from the calculation


X
∞ X

E Px χ{x} (Xk ) = π (k) (x, x)
k=1 k=1

and the right hand side from the calculation of the mean of a Geometric
distribution. Since we count the visit at time 0 as a visit to x we add 1 to
both sides to get our formula. If we want to calculate the expected number of
visits to y when we start from x, first we have to get to y and the probability
of that is f (x, y). Then by the renewal property it is exactly the same as the
expected number of visits to y starting from y, including the visit at time 0
and that equals G(y, y).
4.6. COUNTABLE STATE SPACE 127

Before we study the recurrent behavior we need the notion of periodicity.


For each state x let us define Dx = {n : π (n) (x, x) > 0} to be the set of times
at which a return to x is possible if one starts from x. We define dx to be
the greatest common divisor of Dx .

Lemma 4.14. For any irreducible chain dx = d for all x ∈ X and for each
x, Dx contains all sufficiently large multiples of d.

Proof. Let us define


Dx,y = {n : π (n) (x, y) > 0}
so that Dx = Dx,x . By the Chapman-Kolmogorov equations

π (m+n) (x, y) ≥ π (m) (x, z)π (n) (z, y)

for every z, so that if m ∈ Dx,z and n ∈ Dz,y , then m + n ∈ Dx,y . In


particular if m, n ∈ Dx it follows that m + n ∈ Dx . Since any pair of states
communicate with each other, given x, y ∈ X , there are positive integers n1
and n2 such that n1 ∈ Dx,y and n2 ∈ Dy,x . This implies that with the choice
of ` = n1 + n2 , n + ` ∈ Dx whenever n ∈ Dy ; similarly n + ` ∈ Dy whenever
n ∈ Dx . Since ` itself belongs to both Dx and Dy both dx and dy divide `.
Suppose n ∈ Dx . Then n + ` ∈ Dy and therefore dy divides n + `. Since dy
divides `, dy must divide n. Since this is true for every n ∈ Dx and dx is the
greatest common divisor of Dx , dy must divide dx . Similarly dx must divide
dy . Hence dx = dy . We now complete the proof of the lemma. Let d be the
greatest common divisor of Dx . Then it is the greatest common divisor of a
finite subset n1 , n2 , · · · , nq of Dx and there will exist integers a1 , a2 , · · · , aq
such that
a1 n1 + a2 n2 + · · · + aq nq = d
Some of the a’s will be positive and others negative. Seperating them out,
and remembering that all the ni are divisible by d, we find two integers
md, (m + 1)d such that they both belong to Dx . If now n = kd with k > m2
we can write k = `m + r with a large ` ≥ m and the remainder r is less than
m.

kd = (`m + r)d = `md + r(m + 1)d − rmd = (` − r)md + r(m + 1)d ∈ Dx

since ` ≥ m > r.
128 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Remark 4.10. For an irreducible chain the common value d is called the pe-
riod of the chain and an irreducible chain with period d = 1 is called aperi-
odic.
The simplest example of a periodic chain is one with 2 states and the chain
shuttles back and forth between the two. π(x, y) = 1 if x 6= y and 0 if x = y.
A simple calculation yields π (n) (x, x) = 1 if n is even and 0 otherwise. There
is oscillatory behavior in n that persists. The main theorem for irreducible,
aperiodic, recurrent chains is the following.

Theorem 4.15. Let π(x, y) be the one step transition probability for a re-
current aperiodic Markov chain and let π (n) (x, y) be the n-step transition
probabilities. If the chain is null recurrent then
lim π (n) (x, y) = 0 for all x, y
n→∞

If the chain is positive recuurrent then of course E Px {τx } = m(x) < ∞ for
all x, and in that case
1
lim π (n) (x, y) = q(y) =
n→∞ m(y)
P
exist for all x and y is independent of the starting point x and y q(y) = 1.

The proof is based on

Theorem 4.16. (Renewal Theorem.) Let {fn : n ≥ 1} be a sequence of


nonnegative numbers such that
X X
fn = 1, n fn = m ≤ ∞
n n

and the greatest common divisor of {n : fn > 0} is 1. Suppose that {pn : n ≥


0} are defined by p0 = 1 and recursively
X
n
pn = fj pn−j (4.10)
j=1

Then
1
lim pn =
n→∞ m
where if m = ∞ the right hand side is taken as 0.
4.6. COUNTABLE STATE SPACE 129

Proof. The proof is based on several steps.


Step 1. We have inductively pn ≤ 1. Let a = lim supn→∞ pn . We can
choose a subsequence nk such that pnk → a. We can assume without loss of
generality that pnk +j → qj as k → ∞ for all positive and negative integers j
as well. Of course the limit q0 for j = 0 is a. In relation 4.10 we can pass to
the limit along the subsequence and use the dominated convergence theorem
to obtain
X∞
qn = fj qn−j (4.11)
j=1

valid for −∞ < n < ∞. In particular


X

q0 = fj q−j (4.12)
j=1

Step 2: Because a = lim sup pn we can conclude that qj ≤ a for all j. If


we denote by S = {n : fn > 0} then q−k = a for k ∈ S. We can then
deduce from equation 4.11 that q−k = a for k = k1 + k2 with k1 , k2 ∈ S. By
repeating the same reasoning q−k = a for k = k1 + k2 + · · · + k` . By lemma
3.6 because the greatest common factor of the integers in S is 1, there is a k0
such that for k ≥ k0 ,we have q−k = a. We now apply the relation 4.11 again
to conclude that qj = a for all positve as well as negative j.
Step 3: If we add up equation 4.10 for n = 1, · · · , N we get
p1 + p2 + · · · + pN = (f1 + f2 + · · · + fN ) + (f1 + f2 + · · · + fN −1 )p1
+ · · · + (f1 + f2 + · · · + fN −k )pk + · · · + f1 pN −1
P∞ P
If we denote by Tj = i=j fi , we have T1 = 1 and ∞ j=0 Tj = m. We can
now rewrite
X N XN
Tj pN −j+1 = fj
j=1 j=1

Step
P 4: Because pN −j → a for every j along the subsequence N = nk , if
j Tj = m < ∞, we can deduce from the dominated convergence theorem
that m a = 1 and we conclude that
lim sup pn = v1m
n→∞
130 CHAPTER 4. DEPENDENT RANDOM VARIABLES
P
If j Tj = ∞, by Fatou’s Lemma a = 0. Exactly the same argument applies
to liminf and we conclude that
1
lim inf pn =
n→∞ m
This concludes the proof of the renewal theorem.

We now turn to

Proof. (of Theorem 4.15). If we take a fixed x ∈ X and consider fn =


Px {τx = n}, then fn and pn = π (n) (x, x) are related by (1) and m = E Px {τx }.
In order to apply the renewal theorem we need to establish that the greatest
common divisor of S = {n : fn > 0} is 1. In general if fn > 0 so is pn . So
the greatest common divisor of S is always larger than that of {n : pn > 0}.
That does not help us because the greatest common divisor of {n : pn > 0}
is 1. On the other hand if fn = 0 unless n = k d for some k, the relation 4.10
can be used inductively to conclude that the same is true of pn . Hence both
sets have the same greatest common divisor. We can now conclude that
1
lim π (n) (x, x) = q(x) =
n→∞ m(x)
On the other hand if fn (x, y) = Px {τy = n}, then
X
n
(n)
π (x, y) = fk (x, y) π (n−k)(y, y)
k=1
P∞
and recurrence implies k+1 fk (x, y) = 1 for all x and y. Therefore
1
lim π (n) (x, y) = q(y) =
n→∞ m(y)
and is independent of x, the starting point. In order to complete the proof
we have to establish that
X
Q= q(y) = 1
y

It is clear by Fatou’s lemma that


X
q(y) = Q ≤ 1
y
4.6. COUNTABLE STATE SPACE 131

By letting n → ∞ in the Chapman-Kolmogorov equation


X
π (n+1) (x, y) = π n (x, z)π(z, y)
z

and using Fatou’s lemma we get


X
q(y) ≥ π(z, y)q(z)
z

Summing with repect to y we obtain


X
Q≥ π(z, y)q(z) = Q
z,y

and equality holds in this relation. Therefore


X
q(y) = π(z, y)q(z)
z

for every y or q(·) is an invariant measure. By iteration


X
q(y) = π n (z, y)q(z)
z

and if we let n → ∞ again an application of the bounded convergence theo-


rem yields
q(y) = Q q(y)
implying Q = 1 and we are done.

Let us now consider an irreducible Markov Chain with one step transition
probability π(x, y) that is periodic with period d > 1. Let us choose and fix
a reference point x0 ∈ X . For each x ∈ X let Dx0 ,x = {n : π (n) (x0 , x) > 0}.

Lemma 4.17. If n1 , n2 ∈ Dx0 ,x then d divides n1 − n2 .

Proof. Since the chain is irreducible there is an m such tha π (m) (x, x0 ) > 0.
By the Chapman-Kolmogorov equations π (m+ni ) (x0 , x0 ) > 0 for i = 1, 2.
Therefore m + ni ∈ Dx0 = Dx0 ,x0 for i = 1, 2. This implies that d divides
both m + n1 as well as m + n2 . Thus d divides n1 − n2 .
132 CHAPTER 4. DEPENDENT RANDOM VARIABLES

The residue modulo d of all the integers in Dx0 ,x are the same and equal
some number r(x), satisfying 0 ≤ r(x) ≤ d − 1. By definition r(x0 ) = 0. Let
us define Xj = {x : r(x) = j}. Then {Xj : 0 ≤ j ≤ d − 1} is a partition of X
into disjoint sets with x0 ∈ X0 .

Lemma 4.18. If x ∈ X , then π (n) (x, y) = 0 unless r(x) + n = r(y) mod d.


Proof. Suppose that x ∈ X and π(x, y) > 0. Then if m ∈ Dx0 ,x then
(m + 1) ∈ Dx0 ,y . Therefore r(x) + 1 = r(y) modulo d. The proof can be
completed by induction. The chain marches through {Xj } in a cyclical way
from a state in Xj to one in Xj+1

Theorem 4.19. Let X be irreducible and positive recurrent with period d.


Then
d
lim π (n) (x, y) =
n→∞
n+r(x)=r(y) modulo d
m(y)
Of course
π (n) (x, y) = 0
unless n + r(x) = r(y) modulo d.
Proof. If we replace π by π̃ where π̃(x, y) = π (d) (x, y), then π̃(x, y) = 0 unless
both x and y are in the same Xj . The restriction of π̃ to each Xj defines an
irreducible aperiodic Markov chain. Since each time step under π̃ is actually
d units of time we can apply the earlier results and we will get for x, y ∈ Xj
for some j,
d
lim π (k d) (x, y) =
k→∞ m(y)
We note that
X
π (n) (x, y) = fm (x, y) π (n−m) (y, y)
1≤m≤n

fm (x, y) = Px {τy = m} = 0 unless r(x) + m = r(y) modulo d


π (n−m) (y, y) = 0 unless n − m = 0 modulo d
X
fm (x, y) = 1.
m

The theorem now follows.


4.6. COUNTABLE STATE SPACE 133

Suppose now we have a chain that is not irreducible. Let us collect all
the transient states and call the set Xtr . The complement consists of all the
recurrent states and will be denoted by Xre .

Lemma 4.20. If x ∈ Xre and y ∈ Xtr , then π(x, y) = 0.

Proof. If x is a recuurrent state, and π(x, y) > 0, the chain will return to x
infinitely often and each time there is a positive probability of visiting y. By
the renewal property these are independent events and so y will be recurrent
too.

The set of recurrent states Xre can be divided into one or more equivalence
classes accoeding to the following procedure. Two recurrent states x and y are
in the same equivalence class if f (x, y) = Px {τy < ∞}, the probability of ever
visiting y starting from x is positive. Because of recurrence if f (x, y) > 0 then
f (x, y) = f (y, x) = 1. The restriction of the chain to a single equivalence class
is irreducible and possibly periodic. Different equivalence classes could have
different periods, some could be positive recurrent and others null recurrent.
We can combine all our observations into the following theorem.
P (n)
Theorem 4.21. If y is transient then nπ (x, y) < ∞ for all x. If y
is null recurrent (belongs toP an equivalence class that is null recurrent) then
π (n) (x, y) → 0 for all x, but n π (n) (x, y) = ∞ if x is in the same equivalence
class or x ∈ Xtr with f (x, y) > 0. In all other cases π (n) (x, y) = 0 for all
n ≥ 1. If y is positive recurrent and belongs to an equivalence class with
period d with m(y) = E Py {τy }, then for a nontransient x, π (n) (x, y) = 0
unless x is in the same equivalence class and r(x) + n = r(y) modulo d. In
such a case,
d
lim π (n) (x, y) = .
n→∞
r(x)+n=r(y) modulo d
m(y)
If x is transient then
d
lim π (n) (x, y) = f (r, x, y)
n→∞
n=r modulo d
m(y)

where
f (r, x, y) = Px {Xkd+r = y for some k ≥ 0}.
134 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Proof. The only statement that needs an explanation is the last one. The
chain starting from a transient state x may at some time get into a positive
recurrent equivalence class Xj with period d. If it does, it never leaves that
class and so gets absorbed in that class. The probability of this is f (x, y)
where y can be any state in Xj . However if the period d is greater than
1, there will be cyclical subclasses C1 , · · · , Cd of Xj . Depending on which
subclass the chain enters and when, the phase of its future is determined.
There are d such possible phases. For instance, if the subclasses are ordered
in the correct way, getting into C1 at time n is the same as getting into
C2 at time n + 1 and so on. f (r, x, y) is the probability of getting into the
equivalence class in a phase that visits the cyclical subclass containing y at
times n that are equal to r modulo d.

Example 4.1. (Simple Random Walk).


If X = Z d , the integral lattice in Rd , a random walk is a Markov chain
with transition probability π(x, y) = p(y − x) where {p(z)} specifies the
probability distribution of a single step. We will assume for simplicity that
p(z) = 0 except when z ∈ F where F consists of the 2 d neighbors of 0 and
1
p(z) = 2d for each z ∈ F . For ξ ∈ Rd the characteristic function of p̂(ξ) of
p(·) is given by d1 (cos ξ1 + cos ξ2 + · · · + cos ξd ). The chain is easily seen to
irreducible, but periodic of period 2. Return to the starting point is possible
only after an even number of steps.
Z
(2n) 1 d
π (0, 0) = ( ) [p̂(ξ)]2n dξ
2π T d

C
' d .
n2
To see this asymptotic behavior let us first note that the integration can be
restricted to the set where |p̂(ξ)| ≥ 1 − δ or near the 2 points (0, 0, · · · , 0)
and (π, π, · · · , π) where |p̂(ξ)| = 1. Since the behaviour is similar at both
points let us concentrate near the origin.
1X X X
d
cos ξj ≤ 1 − c ξj2 ≤ exp[−c ξj2 ]
d j=1 j j

for some c > 0 and


1 Xd
2n X
cos ξj ≤ exp[−2 n c ξj2 ]
d j=1 j
4.6. COUNTABLE STATE SPACE 135

and with a change of variables the upper bound is clear. We have a similar
lower bound as well. The random walk is recurrent if d = 1 or 2 but transient
if d ≥ 3.

Exercise 4.12. If the distribution p(·) is arbitrary, determine when the chain
is irreducible and when it is irreducible and aperiodic.
P
Exercise 4.13. If z zp(z) = m 6= 0 conclude that the chain is transient by
an application of the strong law of large numbers.
P
Exercise
P 4.14. If z zp(z) = m = 0, and if the covariance matrix given by ,
z z
z i j p(z) = σi,j , is nondegenerate show that the transience or recurrence
is determined by the dimension as in the case of the nearest neighbor random
walk.
Exercise 4.15. Can you make sense of the formal calculation
X X 1 Z
(n)
π (0, 0) = ( )d [p̂(ξ)]n dξ
2π T d
n n
Z
1 d 1
= ( ) dξ
2π Td (1 − p̂(ξ))
Z  
1 d 1
= ( ) Real Part dξ
2π Td 1 − p̂(ξ)
to conclude that a necessary and sufficient condition for transience or recur-
rece is the convergence or divergence of the integral
Z  
1
Real Part dξ
Td 1 − p̂(ξ)
with an integrand  
1
Real Part
1 − p̂(ξ)
that is seen to be nonnegative ?

Hint: Consider instead the sum


X
∞ X 1 Z
n (n)
ρ π (0, 0) = ( )d ρn [p̂(ξ)]n dξ
2π T d
n=0 n
Z
1 1
= ( )d dξ
2π Td (1 − ρp̂(ξ))
136 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Z  
1 1
= ( )d Real Part dξ
2π Td 1 − ρp̂(ξ)
for 0 < ρ < 1 and let ρ → 1.

Example 4.2. (The Queue Problem).


In the example of customers arriving, except in the trivial cases of p0 = 0
or p0 + p1 = 1 the chain is irreducible and P aperiodoc. Since the service
rate is at most 1 if the arrival rate m = j j pj > 1, then the queue will
get longer and by an application of law of large numbers it is seen that the
queue length will become infinite as time progresses. This is the transient
behavior of the queue. If m < 1, one can expect the situation to be stable and
there should be an asymptotic distribution for the queue length. If m = 1,
it is the borderline case and one should probably expect this to be the null
recurent case. The actual proofs are not hard. In time n the actual number
of customers served is at most n because the queue may sometomes be empty.
If {ξi : i ≥ 1} are the number of new customers arriving at time i and X0 is
the initial number inP the queue, then the number Xn in the queue at time n
satisfies Xn ≥ X0 +( ni=1 ξi )−n and if m > 1, it follows from the law of large
numbers that limn→∞ Xn = +∞, thereby establishing transience. To prove
positive recurrence when m < 1 it is sufficient to prove that the equations
X
q(x)π(x, y) = q(y)
x
P
has a nontrivial nonnegative solution such that x q(x) < ∞. We shall
proceed to show that this is indeed the case.PSince the equation is linear we
can alaways normalize tha solution so that x q(x) = 1. By iteration
X
q(x)π (n) (x, y) = q(y)
x
(n)
P
for every n. If limn→∞ π (x, y) = 0 for every x and y, because x q(x) =
1 < ∞, by the bounded convergence theorem the right hand side tends to
0 as n → ∞. therefore q ≡ 0 and is trivial. This rules out the transient
and the null recurrent case. In our case π(0, y) = py and π(x, y) = py−x+1
if y ≥ x − 1 and x ≥ 1. In all other cases π(x, y) = 0. The equations for
{qx = q(x)} are then
X
y+1
q0 py + qx py−x+1 = qy for y ≥ 0. (4.13)
x=1
4.6. COUNTABLE STATE SPACE 137

Multiplying equation 4.13 by z n and summimg from 1 to ∞, we get


1
q0 P (z) + P (z) [Q(z) − q0 ] = Q(z)
z
where P (z) and Q(z) are the generating functions
X

P (z) = px z x
x=0
X

Q(z) = qx z x .
x=0

We can solve for Q to get


 −1
Q(z) P (z) − 1
= P (z) 1 −
q0 z−1
X P (z) − 1 k
∞ 
= P (z)
k=0
z−1
X ∞  X∞ k
= P (z) pj (1 + z + · · · + z )
j−1

k=0 j=1

is a power series in z with nonnegative coefficients. If m < 1, we can let


z → 1 to get
∞  ∞ k X
Q(1) X X

1
= j pj = mk = <∞
q0 j=1
1−m
k=0 k=0

proving positive recurrence.


The case m = 1 is a little bit harder. The calculations carried out earlier
are still valid and we knowPin this case that there exists q(x) ≥ 0 such that
each q(x) < ∞ for each x, x q(x) = ∞, and
X
q(x) π(x, y) = q(y).
x

In other words the chain admits an infinite invariant measure. Such a chain
cannot be positive recurrent. To see this we note
X
q(y) = π (n) (x, y)q(x)
x
138 CHAPTER 4. DEPENDENT RANDOM VARIABLES

and if the chain were positive recurrent

lim π (n) (x, y) = q̃(y)


n→∞
P
would exist and y q̃(y) = 1. By Fatou’s lemma
X
q(y) ≥ q̃(y)q(x) = ∞
x

giving us a contradiction. To decide between transience and null recurrence


a more detiled investigation is needed. We will outline a general procedure.
Suppose we have a state x0 that is fixed and would like to calculate
Fx0 (`) = Px0 {τx0 ≤ `}. If we can do this, then we can answer questions
about transience, recurrence etc. If lim`→∞ Fx0 (`) < 1 then the chain is
transient and otherwise recurrent. In the recurrent case the convergence or
divergence of X
E Px0 {τx0 } = [1 − Fx0 (`)]
`

determines if it is positive or null recurrent. If we can determine

Fy (`) = Py {τx0 ≤ `}

for y 6= x0 , then for ` ≥ 1


X
Fx0 (`) = π(x0 , x0 ) + π(x0 , y)Fy (` − 1).
y6=x0

We shall outline a procedure for determining for λ > 0,


 
U(λ, y) = Ey exp[−λτx0 ] .

Clearly U(x) = U(λ, x) satisfies


X
U(x) = e−λ π(x, y)U(y) for x 6= x0 (4.14)
y

and U(x0 ) = 1. One would hope that if we solve for these equations then
we have our U. This requires uniqueness. Since our U is bounded in fact by
1, it is sufficient to prove uniqueness within the class of bounded solutions
4.6. COUNTABLE STATE SPACE 139

of equation 4.14. We will now establish that any bounded solution U of


equation 4.14 with U(x0 ) = 1, is given by
 
U(y) = U(λ, y) = Ey exp[−λτx0 ] .

Let us define En = {X1 6= x0 , X2 6= x0 , · · · , Xn−1 6= x0 , Xn = x0 }. Then


we will prove, by induction, that for any solution U of equation (3.7), with
U(λ, x0 ) = 1,
X
n Z
−λ j −λ n
U(y) = e Py {Ej } + e U(Xn ) dPy . (4.15)
j=1 τx0 >n

By letting n → ∞ we would obtain


X

U(y) = e−λ j Py {Ej } = E Py {e−λτx0 }
j=1

because U is bounded and λ > 0.


Z
U(Xn ) dPy
τx0 >n
Z X
−λ
=e [ π(Xn , y) U(y)] dPy
τx0 >n y
Z X
−λ −λ
=e Py {En+1 } + e [ π(Xn , y) U(y)] dPy
τx0 >n y6=x
Z 0

−λ −λ
=e Py {En+1 } + e U(Xn+1 ) dPy
τx0 >n+1

completing the induction argument. In our case, if we take x0 = 0 and try


Uσ (x) = e−σ x with σ > 0, for x ≥ 1
X X
π(x, y)Uσ (y) = e−σ y py−x+1
y y≥x−1
X
= e−σ (x+y−1) py
y≥0
X
= e−σ x eσ e−σ y py = ψ(σ)Uσ (x)
y≥0
140 CHAPTER 4. DEPENDENT RANDOM VARIABLES

where X
ψ(σ) = eσ e−σ y py .
y≥0

Let us solve eλ = ψ(σ) for σ which is the same as solving log ψ(σ) = λ for
λ > 0 to get a solution σ = σ(λ) > 0. Then

U(λ, x) = e−σ(λ) x = E Px {e−λτ0 }.

We see now that recurrence is equivalent to σ(λ) → 0 as λ ↓ 0 and positive


recurrence to σ(λ) being differentiable at λ = 0. The function log ψ(σ)
is convex and its slope at the origin is 1 − m. If m > 1 it dips below 0
initially for σ > 0 and then comes back up to 0 for some positive σ0 before
turning positive for good. In that situation limλ↓0 σ(λ) = σ0 > 0 and that
is transience. If m < 1 then log ψ(σ) has a positive slope at the origin and
σ 0 (0) = ψ01(0) = 1−m
1
< ∞. If m = 1, then log ψ has zero slope at the origin
0
and σ (0) = ∞. This concludes the discussion of this problem.

Example 4.3. ( The Urn Problem.)


We now turn to a discussion of the urn problem.
p q
π(p, q ; p + 1, q) = and π(p, q ; p, q + 1) =
p+q p+q

and π is zero otherwise. In this case the equation


p q
F (p, q) = F (p + 1, q) + F (p, q + 1) for all p, q
p+q p+q
p
which will play a role later, has lots of solutions. In particular, F (p, q) = p+q
is one and for any 0 < x < 1
1
Fx (p, q) = xp−1 (1 − x)q−1
β(p, q)

where
Γ(p)Γ(q)
β(p, q) =
Γ(p + q)
is a solution as well. The former is defined on p + q > 0 where as the latter is
defined only on p > 0, q > 0. Actually if p or q is initially 0 it remains so for
4.6. COUNTABLE STATE SPACE 141

ever and there is nothing to study in that case. If f is a continuous function


on [0, 1] then Z 1
Ff (p, q) = Fx (p, q)f (x) dx
0
is a solution and if we want we can extend Ff by making it f (1) on q = 0
and f (0) on p = 0. It is a simple exercise to verify

lim Ff (p, q) = f (x)


p,q→∞
p
q →x

for any continuous f on [0, 1]. We will show that the ratio ξn = pnp+q
n
n
which
is random, stabilizes asymptotically (i.e. has a limit) to a random variable
ξ and if we start from p, q the distribution of ξ is the Beta distribution on
[0, 1] with density
1
Fx (p, q) = xp−1 (1 − x)q−1
β(p, q)

Suppose we have a Markov Chain on some state space X with transition


probability π(x, y) and U(x) is a bounded function on X that solves
X
U(x) = π(x, y) U(y).
y

Such functions are called (bounded) Harmonic functions for the Chain. Con-
sider the random variables ξn = U(Xn ) for such an harmonic function. ξn
are uniformly bounded by the bound for U. If we denote by ηn = ξn − ξn−1
an elementary calculation reveals

E Px {ηn+1 } = E Px {U(Xn+1 ) − U(Xn )}


= E Px {E Px {U(Xn+1 ) − U(Xn )}|Fn }}

where Fn is the σ-field generated by X0 , · · · , Xn . But


X
E Px {U(Xn+1 ) − U(Xn )}|Fn } = π(Xn , y)[U(y) − U(Xn )] = 0.
y

A similar calculation shows that

E Px {ηn ηm } = 0
142 CHAPTER 4. DEPENDENT RANDOM VARIABLES

for m 6= n. If we write
U(Xn ) = U(X0 ) + η1 + η2 + · · · + ηn
this is an orthogonal sum in L2 [Px ] and because U is bounded
X
n
2 2
E {|U(Xn )| } = |U(x)| +
Px
E Px {|ηi |2 } ≤ C
i=1

is bounded in n. Therefore limn→∞ U(Xn ) = ξ exists in L2 [Px ] and E Px {ξ} =


U(x). Actually the limit exists almost surely and we will show it when
p
we discuss martingales later. In our example if we take U(p, q) = p+q , as
remarked earlier, this is a harmonic function bounded by 1 and therefore
pn
lim =ξ
n→∞ pn + qn

exists in L2 [Px ]. Moreover if we take U(p, q) = Ff (p, q) for some continuous


f on [0, 1], because Ff (p, q) → f (x) as p, q → ∞ and pq → x, U(pn , qn ) has a
limit as n → ∞ and this limit has to be f (ξ). On the other hand
E Pp,q {U(pn , qn )} = U(p0 , q0 ) = Ff (p0 , q0 )
Z 1
1
= f (x)xp0 −1 (1 − x)q0 −1 dx
β(p0, q0 ) 0
giving us Z 1
1
E {f (ξ)} =
Pp,q
f (x) xp−1 (1 − x)q−1 dx
β(p, q) 0
thereby identifying the distribution of ξ under Pp,q as the Beta distribution
with the right parameters.

Example 4.4. (Branching Process). Consider a population, in which each


individual member replaces itself at the beginning of each day by a random
number of offsprings. Every individual has the same probability distribution,
but the number of offsprings for different individuals are distibuted indepen-
dently of each other. The distribution of the number N of offsprings is given
by P [N = i] = pi for i ≥ 0. If there are Xn individuals in the population on
a given day, then the number of individuals Xn+1 present on the next day,
has the represenation
Xn+1 = N1 + N2 + · · · + NXn
4.6. COUNTABLE STATE SPACE 143

as the sum of Xn independent random variables each having the offspring


distribution {pi : i ≥ 0}. Xn is seen to be a Markov chain on the set
of nonnegative integers. Note that if Xn ever becomes zero, i.e. if every
member on a given day produces no offsprings, then the population remains
extinct.
If one uses generating functions, then the transition probability πi,j of the
chain are
X X i
πi,j z j = pj z j .
j j

What is the long time behavior of the chain?


Let us denote by m the expected number of offsprings of any individual,
i.e. X
m= ipi .
i≥0

Then
E[Xn+1 |Fn ] = mXn .

1. If m < 1, then the population becomes extinct sooner or later. This is


easy to see. Consider
X X 1
E[ Xn |F0 ] = mn X0 = X0 < ∞.
n≥0 n≥0
1−m
P
By an application of Fubini’s theorem, if S = n≥0 Xn , then

i
E[S|X0 = i] = <∞
1−m

proving that P [S < ∞] = 1. In particular

P [ lim Xn = 0] = 1.
n→∞

2. If m = 1 and p1 = 1, then Xn ≡ X0 and the poulation size never


changes, each individual replacing itself everytime by exactly one off-
spring.
144 CHAPTER 4. DEPENDENT RANDOM VARIABLES

3. If m = 1 and p1 < 1, then p0 > 0, and there is a positive probabiity


q(i) = q i that the poulation becomes extinct, when it starts with i
individuals. Here q is the probabilty of the population becoming extinct
when we start with X0 = 1. If we have initially i individulas each of
the i family lines have to become extinct for the entire population to
become extinct. The number q must therefore be a solution of the
equation
q = P (q)
where P (z) is the generating function
X
P (z) = pi z i .
i≥0

If we show that the equation P (z) = z has only the solution z = 1


in 0 ≤ z ≤ 1, then the population becomes extinct with probability 1
although E[S] = ∞ in this case. If P (1) = 1 and P (a) = a for some
0 ≤ a < 1 then by the mean value theorem applied to P (z) − z we
must have P 0 (z) = 1 for some 0 < z < 1. But if 0 < z < 1
X X
P 0(z) = iz i−1 pi < ipi = 1
i≥1 i≥1

a contradiction.

4. If m > 1 but p0 = 0 the problem is trivial. There is no chance of the


population becoming extinct. Let us assume that p0 > 0. The equation
P (z) = z has another solution z = q besides z = 1, in the range
0 < z < 1. This is seen by considering the function g(z) = P (z) − z.
We have g(0) > 0, g(1) = 0, g 0(1) > 0 which implies another root. But
g(z) is convex and therefore ther can be atmost one more root. If we
can rule out the possibility of extinction probability being equal to 1,
then this root q must be the extinction probability when we start with
a single individual at time 0. Let us denote by qn the probability of
extinction with in n days. Then
X
qn+1 = pi qni = P (qn )
i

and q1 < 1. A simple consequence of the monotonicity of P (z) and


the inequalities P (z) > z for z < q and P (z) < z for z > q is that if
4.6. COUNTABLE STATE SPACE 145

we start with any a < 1 and iterate qn+1 = P (qn ) with q1 = a, then
qn → q.
If the population does not become extinct, one can show that it has
to grow indefinitely. This is best done using martingales and we will
revisit this example later as Example 5.6.

Example 4.5. Let X be the set of integers. Assume that transitions from x
are possible only to x − 1, x, and x + 1. The transition matrix π(x, y) appears
as a tridiagonal matrix with π(x, y) = 0 unless |x − y| ≤ 1. For simplicity let
us assume that π(x, x), π(x, x − 1) and π(x, x + 1) are positive for all x.
The chain is then irreducible and aperiodic. Let us try to solve for

U(x) = Px {τ0 = ∞}

that satisfies the equation

U(x) = π(x, x − 1) U(x − 1) + π(x, x) U(x) + π(x, x + 1) U(x + 1)

for x 6= 0 with U(0) = 0. The equations decouple into a set for x > 0 and a
set for x < 0. If we denote by V (x) = U(x + 1) − U(x) for x ≥ 0, then we
always have

U(x) = π(x, x − 1) U(x) + π(x, x) U(x) + π(x, x + 1) U(x)

so that
π(x, x − 1) V (x − 1) − π(x, x + 1) V (x) = 0
or
V (x) π(x, x − 1)
=
V (x − 1) π(x, x + 1)
and therefore
Yx
π(i, i − 1)
V (x) = V (0)
i=1
π(i, i + 1)

and
 X
x−1 Yy 
π(i, i − 1)
U(x) = V (0) 1 + .
y=1 i=1
π(i, i + 1)
146 CHAPTER 4. DEPENDENT RANDOM VARIABLES

If the chain is to be transient we must have for some choice of V (0), 0 ≤


U(x) ≤ 1 for all x > 0 and this will be possible only if

X∞ Y y
π(i, i − 1)
<∞
y=1 i=1
π(i, i + 1)

which then becomes a necessary condition for

Px {τ0 = ∞} > 0

for x > 0. There is a similar condition on the negative side

X∞ Y y
π(−i, −i + 1)
< ∞.
y=1 i=1
π(−i, −i − 1)

Transience needs at least one of the two series to converge. Actually the
converse is also true. If, for instance the series on the positive side converges
then we get a function U(x) with 0 ≤ U(x) ≤ 1 and U(0) = 0 that satisfies

U(x) = π(x, x − 1) U(x − 1) + π(x, x) U(x) + π(x, x + 1) U(x + 1)

and by iteration one can prove that for each n,


Z
U(x) = U(Xn ) dPx ≤ P {τ0 > n}
τ0 >n

so the existence of a nontrivial U implies transience.

Exercise 4.16. Determine the conditions for positive recurrence in the previ-
ous example.
Exercise 4.17. We replace the set of integers by the set of nonnegative inte-
gers and assume that π(0, y) = 0 for y ≥ 2. Such processes are called birth
and death processes. Work out the conditions in that case.
Exercise 4.18. In the special case of a birth and death process with π(0, 1) =
π(0, 0) = 12 , and for x ≥ 1, π(x, x) = 13 , π(x, x − 1) = 13 + ax , π(x, x + 1) =
1
3
− ax with ax = xλα for large x, find conditions on positive α and real λ for
the chain to be transient, null recurrent and positive recurrent.
4.6. COUNTABLE STATE SPACE 147

Exercise 4.19. The notion of a Markov Chain makes sense for a finite chain
X0 , · · · , Xn . Formulate it precisely. Show that if the chain {Xj : 0 ≤ j ≤ n}
is Markov so is the reversed chain {Yj : 0 ≤ j ≤ n} where Yj = Xn−j for 0 ≤
j ≤ n. Can the transition probabilities of the reversed chain be determined
by the transition probabilities of the forward chain? If the forward chain has
stationary transition proabilities does the same hold true for the reversed
chain? What if we assume that the chain has a finte invariant probability
distribution and we initialize the chain to start with an initial distribution
which is the invariant distribution?
Exercise 4.20. Consider the simple chain on nonnegative integers
P∞ with the
following transition probailities. π(0, x) = px for x ≥ 0 with x=0 px = 1.
For x > 0, π(x, x − 1) = 1 and π(x, y) = 0 for all other y. Determine
conditions on {px } in order that the chain may be transient, null recurrent
or positive recurrent. Determine the invariant probability measure in the
positive recurrent case.
Exercise 4.21. Show that any null recurrent equivalence class must neces-
sarily contain an infinite number of states. In patricular any Markov Chain
with a finite state space has only transient and positive recurrent states and
moreover the set of positive recurrent states must be non empty.
148 CHAPTER 4. DEPENDENT RANDOM VARIABLES
Chapter 5

Martingales.

5.1 Definitions and properties


The theory of martingales plays a very important ans ueful role in the study
of stochastic processes. A formal definition is given below.

Definition 5.1. Let (Ω, F , P ) be a probability space. A martingale se-


quence of length n is a chain X1 , X2 , · · · , Xn of random variables and corre-
sponding sub σ-fields F1 , F2 , · · · , Fn that satisfy the following relations

1. Each Xi is an integrable random variable which is measurable with re-


spect to the corresponding σ-field Fi .

2. The σ-field Fi are increasing i.e. Fi ⊂ Fi+1 for every i.

3. For every i ∈ [1, 2, · · · , n − 1], we have the relation

Xi = E{Xi+1 |Fi} a.e. P.

Remark 5.1. We can have an infinite martingale sequence {(Xi , Fi) : i ≥ 1}


which requires only that for every n, {(Xi , Fi) : 1 ≤ i ≤ n} be a martingale
sequence of length n. This is the same as conditions (i), (ii) and (iii) above
except that they have to be true for every i ≥ 1.

149
150 CHAPTER 5. MARTINGALES.

Remark 5.2. From the properties of conditional expectations we see that


E{Xi } = E{Xi+1 } for every i, and therfore E{Xi } = c for some c. We can
define F0 to be the trivial σ-field consisting of {Φ, Ω} and X0 = c. Then
{(Xi , Fi) : i ≥ 0} is a martingale sequence as well.
P
Remark 5.3. We can define Yi+1 = Xi+1 − Xi so that Xj = c + 1≤i≤j Yi and
property (iii) reduces to

E{Yi+1 |Fi} = 0 a.e. P.

Such sequences are called martingale differences. If Yi is a sequence of in-


dependent random variables with mean 0, for each i, we can take Fi to
P generated by the random variables {Yj : 1 ≤ j ≤ i} and
be the σ-field
Xj = c + 1≤i≤j Yi , will be a martingale relative to the σ-fields Fi .
Remark 5.4. We can generate martingale sequences by the following pro-
cedure. Given any increasing family of σ-fields {Fj }, and any integrable
random variable X on (Ω, F , P ), we take Xi = E{X|Fi} and it is easy to
check that {(Xi, Fi )} is a martingale sequence. Of course every finite mar-
tingale sequence is generated this way for we can always take X to be Xn ,
the last one. For infinite sequences this raises an important question that we
will answer later.

If one participates in a ‘fair ’ gambling game, the asset Xn of the player at


time n is supposed to be a martingale. One can take for Fn the σ-field of all
the results of the game through time n. The condition E[Xn+1 − Xn |Fn ] = 0
is the assertion that the game is neutral irrespective of past history.
A related notion is that of a super or sub-martingale. If, in the definition
of a martingale, we replace the equality in (iii) by an inequality we get super
or sub-martingales.
For a sub-martingale we demand the relation
(iiia) for every i,
Xi ≤ E{Xi+1 |Fi } a.e. P.
while for a super-martingale the relation is
(iiib) for every i,
Xi ≥ E{Xi+1 |Fi } a.e. P.
5.1. DEFINITIONS AND PROPERTIES 151

Lemma 5.1. If {(Xi , Fi)} is a martingale and ϕ is a convex (or concave)


function of one variable such that ϕ(Xn ) is integrable for every n, then
{(ϕ(Xn ), Fn )} is a sub (or super)-martingale.

Proof. An easy consequence of Jensen’s inequality (4.2) for conditional ex-


pectations.
Remark 5.5. A particular case is φ(x) = |x|p with 1 ≤ p < ∞. For any
martingale (Xn , Fn ) and 1 ≤ p < ∞, (|Xn |p , Fn ) is a sub-martingale provided
E[|Xn |p ] < ∞
Theorem 5.2. (Doob’s inequality.) Suppose {Xj } is a martingale se-
quence of length n. Then
  Z Z
1 1
P ω : sup |Xj | ≥ ` ≤ |Xn | dP ≤ |Xn | dP (5.1)
1≤j≤n ` {sup1≤j≤n |Xj |≥`} `

Proof. Let us define S(ω) = sup1≤j≤n |Xj (ω)|. Then


{ω : S(ω) ≥ `} = E = ∪j Ej
is written as a disjoint union, where
Ej = {ω : |X1 (ω)| < `, · · · , |Xj−1| < `, |Xj | ≥ `}.
We have
Z Z
1 1
P (Ej ) ≤ |Xj | dP ≤ |Xn | dP. (5.2)
` Ej ` Ej

The second inequality in (5.2) follows from the fact that |x| is a convex func-
tion of x, and therefore |Xj | is a sub-martingale. In particular E{|Xn ||Fj } ≥
|Xj | a.e. P and Ej ∈ Fj . Summing up (5.2) over j = 1, · · · , n we obtain the
theorem.
Remark 5.6. We could have started with
Z
1
P (Ej ) ≤ p |Xj |p dP
` Ej
and obtained for p ≥ 1
Z
1
P (Ej ) ≤ p |Xn |p dP. (5.3)
` Ej

Compare it with (3.9) for p = 2.


152 CHAPTER 5. MARTINGALES.

This simple inequality has various implications. For example

Corollary 5.3. (Doob’s Inequality.) Let {Xj : 1 ≤ j ≤ n} be a martin-


gale. Then if, as before, S(ω) = sup1≤j≤n |Xj (ω)| we have
 p
p
E[S ] ≤
p
E [ |Xn |p ].
p−1

The proof is a consequence of the following fairly general lemma.

Lemma 5.4. Suppose X and Y are two nonnegative random variables on a


probability space such that for every ` ≥ 0,
Z
1
P {Y ≥ `} ≤ X dP.
` Y ≥`

Then for every p > 1,


Z  p Z
p
Y dP ≤
p
X p dP.
p−1

Proof. Let us denote the tail probability by T (`) = P {Y ≥ `}. Then with
1
p
+ 1q = 1, i.e. (p − 1)q = p
Z Z ∞ Z ∞
Y dP = −
p p
y dT (y) = p y p−1T (y)dy (integrating by parts)
Z 0∞ Z 0
dy
≤p y p−1 X dP (by assumption)
0 y Y ≥y
Z Z Y 
p−2
=p X y dy dP (by Fubini’s Theorem)
0
Z
p
= X Y p−1 dP
p−1
Z  p1 Z  1q
p
≤ p
X dP Y q(p−1)
dP (by Hölder’s inequality)
p−1
Z  p1 Z 1− p1
p
≤ p
X dP p
Y dP
p−1
5.1. DEFINITIONS AND PROPERTIES 153

This simplifies to
Z  p Z
p
Y dP ≤
p
X p dP
p−1
R
provided Y p dP is finite. In general given Y , we can truncate it at level N
to get YN = min(Y, N) and for 0 < ` ≤ N ,
Z Z
1 1
P {YN ≥ `} = P {Y ≥ `} ≤ X dP = X dP
` Y ≥` ` YN ≥`

R
with P {YN ≥ `} = 0 for ` > N. This gives us uniform bounds on YNp dP
and we can Rpass to the limit. So we have the
R strong implication that the
p p
finiteness of X dP implies the finiteness of Y dP .

Exercise 5.1. The result is false for p = 1. Construct a nonnegative martin-


gale Xn with E[Xn ] ≡ 1 such that ξ = supn Xn is not integrable. Consider
Ω = [0, 1], F is the Borel σ-field and P the Lebesgue measure. Suppose we
take Fn to be the σ-field generated by intervals with end points of the form
j
2n
for some integer j. It corresponds to a partition with 2n sets. Consider
the random variables
(
2n for 0 ≤ x ≤ 2−n
Xn (x) =
0 for 2−n < x ≤ 1.
R
Check that it is a martingale and calculate ξ(x) dx. This is the ‘winning ’
strategy of doubling one’s bets until the losses are recouped.
Exercise 5.2. If Xn is a martingale such that the differences Yn = Xn − Xn−1
are all square integrable, show that for n 6= m, E [Yn Ym ] = 0. Therefore

X
n
E[Xn2 ] 2
= E[X0 ] + E [Yj2 ].
j=1

If in addition, supn E[Xn2 ] < ∞, then show that there is a random variable
X such that
lim E [ |Xn − X|2 ] = 0.
n→∞
154 CHAPTER 5. MARTINGALES.

5.2 Martingale Convergence Theorems.


If Fn is an increasing family of σ-fields and Xn is a martingale sequence with
respect to Fn , one can always assume without loss of generality that the
full σ-field F is the smallest σ-field generated by ∪n Fn . If for some p ≥ 1,
X ∈ Lp , and we define Xn = E [ X|Fn ] then Xn is a martingale and by
Jensen’s inequality, supn E [ |Xn |p ] ≤ E [|X|p ]. We would like to prove

Theorem 5.5. For p ≥ 1, if X ∈ Lp , then limn→∞ kXn − Xkp = 0.

Proof. Assume that X is a bounded function. Then by the properties of con-


ditional expectation supn supω |Xn | < ∞. In particular E [ Xn2 ] is uniformly
bounded. By Exercise 5.2, at the end of last section, limn→∞ Xn = Y exists
in L2 . By the properties of conditional expectations for A ∈ Fm ,
Z Z Z
Y dP = lim Xn dP = X dP.
A n→∞ A A

This is true for all A ∈ Fm for any m. Since F is generated by ∪m Fm the


above relation is true for A ∈ F . As X and Y are F measurable we conclude
that X = Y a.e. P . See Exercise 4.1. For a sequence of functions that
are bounded uniformnly in n and ω convergence in Lp are all equivalent and
therefore convergence in L2 implies the convergence in Lp for any 1 ≤ p < ∞.
If now X ∈ Lp for some 1 ≤ p < ∞, we can approximate it by X 0 ∈ L∞
so that kX 0 − Xkp < . Let us denote by Xn0 the conditional expectations
E [ X 0 |Fn ]. By the properties of conditional expectations kXn0 − Xn kp ≤ 
for all n, and as we saw earlier kXn0 − X 0 kp → 0 as n → ∞. It now follows
that
lim sup kXn − Xm kp ≤ 2
n→∞
m→∞

and as  > 0 is arbitrary we are done.

In general, if we have a martingale {Xn }, we wish to know when it comes


from a random variable X ∈ Lp in the sense that Xn = E [ X |Fn ].

Theorem 5.6. If for some p > 1, a martingale {Xn } is bounded in Lp , in


the sense that supn kXn kp < ∞, then there is a random variable X ∈ Lp such
that Xn = E [ X |Fn ] for n ≥ 1. In particular kXn − Xkp → 0 as n → ∞.
5.2. MARTINGALE CONVERGENCE THEOREMS. 155

Proof. Suppose kXn kp is uniformly bounded. For p > 1, since Lp is the dual
of Lq with 1p + 1q = 1, bounded sets are weakly compact. See [7] or [3]. We
can therefore choose a subsequence Xnj that converges weakly in Lp to a
limit in the weak topology. We call this limit X. Then consider A ∈ Fn for
some fixed n. The function 1A (·) ∈ Lq .
Z Z Z
X dP =< 1A , X >= lim < 1A , Xnj >= lim Xnj dP = Xn dP.
A j→∞ j→∞ A A

The last equality follows from the fact that {Xn } is a martingale, A ∈ Fn
and nj > n eventually. It now follows that Xn = E [ X |Fn ]. We can now
apply the preceeding theorem.
Exercise 5.3. For p = 1 the result is false. Example 5.1 gives us at the same
time a counterexample of an L1 bounded martingale that does not converge
in L1 and so cannot be represented as Xn = E [ X |Fn ].

We can show that the convergence in the preceeding theorems is also valid
almost everywhere.
Theorem 5.7. Let X ∈ Lp for some p ≥ 1. Then the martingale Xn =
E [X |Fn ] converges to X for almost all ω with respect to P .
Proof. From Hölder’s inequality kXk1 ≤ kXkp . Clearly it is sufficient to
prove the theorem for p = 1. Let us denote by M ⊂ L1 the set of functions
X ∈ L1 for which the theorem is true. Clearly M is a linear subset of L1 .
We will prove that it is closed in L1 and that it is dense in L1 . If we denote
by Mn the space of Fn measurable functions in L1 , then Mn is a closed
subspace of L1 . By standard approximation theorems ∪n Mn is dense in L1 .
Since it is obvious that M ⊃ Mn for every n, it follows that M is dense in
L1 . Let Yj ∈ M ⊂ L1 and Yj → X in L1 . Let us define Yn,j = E [Yj |Fn ].
With Xn = E [X |Fn ], by Doob’s inequality (5.1) and jensen’s inequlaity
(4.2),
  Z
1
P sup |Xn | ≥ ` ≤ |XN | dP
1≤n≤N ` {ω:sup1≤n≤N |Xn |≥`}
1
≤ E [ |XN | ]
`
1
≤ E [ |X| ]
`
156 CHAPTER 5. MARTINGALES.

and therefore Xn is almost surely a bounded sequence. Since we know that


Xn → X in L1 , it suffices to show that

lim sup Xn − lim inf Xn = 0 a.e. P.


n n

If we write X = Yj + (X − Yj ), then Xn = Yn,j + (Xn − Yn,j ), and

lim sup Xn − lim inf Xn ≤ [lim sup Yn,j − lim inf Yn,j ]
n n n n

+ [lim sup (Xn − Yn,j ) − lim inf (Xn − Yn,j )]


n n

= lim sup (Xn − Yn,j ) − lim inf (Xn − Yn,j )


n n

≤ 2 sup |Xn − Yn,j |.


n

Here we have used the fact that Yj ∈ M for every j and hence

lim sup Yn,j − lim inf Yn,j = 0 a.e. P.


n n

Finally
   

P lim sup Xn − lim inf Xn ≥  ≤ P sup |Xn − Yn,j | ≥
n n n 2
2
≤ E [ |X − Yj | ]

=0

since the left hand side is independent of j and the term on the right on the
second line tends to 0 as j → ∞.

The only case where the situation is unclear is when p = 1. If Xn is an


L1 bounded martingale, it is not clear that it comes from an X. If it did
arise from an X, then Xn would converge to it in L1 and in particular would
have to be uniformly integrable. The converse is also true.

Theorem 5.8. If Xn is a uniformly integrable martingale then there is ran-


dom variable X such that Xn = E [ X |Fn ] , and then of course Xn → X in
L1 .
5.3. DOOB DECOMPOSITION THEOREM. 157

Proof. The uniform integrability of Xn implies the weak compactness in L1


and if X is any weak limit of Xn [see [7]], it is not difficult to show as in
Theorem 5.5, that Xn = E [ X |Fn ] .

Remark 5.7. Note that for p > 1, a martingale Xn that is bounded in Lp is


uniformly integrable in Lp , i.e |Xn |p is uniformly integrable. This is false for
p = 1. The L1 bounded martingale that we constructed earlier in Exercise
5.1 as a counterexample, is not convergent ln L1 and therefore can not be
uniformly integrable. We will defer the analysis of L1 bounded martingales
to the next section.

5.3 Doob Decomposition Theorem.


The simplest example of a submartingale is a sequence of functions that is
non decreasing in n for every (almost all) ω. In some sense the simplest
example is also the most general. More precisely the decomposition theorem
of Doob asserts the following.

Theorem 5.9. (Doob decomposition theorem.) If {Xn : n ≥ 1} is a


sub-martingale on (Ω , Fn , P ), then Xn can be written as Xn = Yn + An ,
with the following properties:

1. (Yn , Fn ) is a martingale.

2. An+1 ≥ An for almost all ω and for every n ≥ 1.

3. A1 ≡ 0.

4. For every n ≥ 2, An is Fn−1 measurable.

Xn determines Yn and An uniquely .

Proof. Let Xn be any sequence of integrable functions such that Xn is Fn


measurable, and is represented as Xn = Yn + An , with Yn and An satisfying
(1), (3) and (4). Then

An − An−1 = E [Xn − Xn−1 |Fn−1 ] (5.4)


158 CHAPTER 5. MARTINGALES.

are uniquely determined. Since A1 = 0, all the An are uniquely determined as


well. Property (2) is then plainly equivalent to the submartingale property.
To establish the representation, we define An inductively by (5.4). It is
routine to verify that Yn = Xn − An is a martingale and the monotonicity of
An is a consequence of the submartingale property.
Remark 5.8. Actually, given any sequence of integrable functions {Xn : n ≥
1} such that Xn is Fn measurable, equation (5.4) along with A1 = 0 defines
Fn−1 measurable functions that are integrable, such that Xn = Yn + An and
(Yn , Fn ) is a martingale. The decomposition is always unique. It is easy
to verify from (5.4) that {An } is increasing (or decreasing) if and only if
{Xn } is a super- (or sub-) maringale. Such a decomposition is called the
semi-martingale decomposition.
Remark 5.9. It is the demand that An be Fn−1 measurable that leads to
uniqueness. If we have to deal with continuous time this will become a
thorny issue.
We now return to the study of L1 bounded martingales. A nonnegative
martingale is clearly L1 bounded because E [ |Xn | ] = E [ Xn ] = E [ X1 ].
One easy way to generate L1 bounded martingales is to take the difference
of two nonneagtive martingales. We have the converse as well.
Theorem 5.10. Let Xn be an L1 bounded martingale. There are two non-
negative martingales Yn and Zn relative to the same σ-fields Fn , such that
Xn = Yn − Zn .

Proof. For each j and n ≥ j, we define


Yj,n = E [ | Xn | |Fj ].
By the submartingale property of | Xn |
Yj,n+1 − Yj,n = E[(|Xn+1| − |Xn |) |Fj ] = E[E[(|Xn+1 | − |Xn |) |Fn ]|Fj ] ≥ 0
almost surely. Yj,n is nonnegative and E[ Yj,n ] = E[ |Xn | ] is bounded in n.
By the monotone convergence theorem, for each j, there exists some Yj in
L1 such that Yj,n → Yj in L1 as n → ∞. Since limits of martingales are
again martingales, and Yn,j is a martingale for n ≥ j, it follows that Yj is a
martingale. Moreover
Yj + Xj = lim E [ | Xn | + Xn |Fj ] ≥ 0
n→∞
5.3. DOOB DECOMPOSITION THEOREM. 159

and
Xj = (Yj + Xj ) − Yj
does it!

We can always assume that our nonnegative martingale has its expecta-
tion equal to 1 because we can always multiply by a suitable constant. Here
is a way in which such martingales arise. Suppose we have a probability
space (Ω , F , P ) and and an increasing family of sub σ-fields Fn of F that
generate F . Suppose Q is another probability measure on (Ω , F ) which may
or may not be absolutely continuous with respect to P on F . Let us suppose
however that Q << P on each Fn , i.e. whenever A ∈ Fn and P (A) = 0, it
follows that Q(A) = 0. Then the sequence of Radon-Nikodym derivatives

dQ
Xn =
dP Fn

of Q with respect to P on Fn is a nonnegative martingale with expectation


1. It comes from an X, if and only if Q << P on F and this is the uniformly
integrable case. By Lebesgue decomposition we reduce our consideration to
the case when Q ⊥ P . Let us change the reference measure to P 0 = P +Q 2
.
The Radon-Nikodym derivative

0 dQ 2Xn
Xn = 0 =
dP Fn 1 + Xn
is uniformly integrable with respect to P 0 and Xn0 → X 0 a.e. P 0 . From
the orthogonality P ⊥ Q we know that there are disjoint sets E, E c with
P (E) = 1 and Q(E c ) = 1. Then
Q(A) = Q(A ∩ E) + Q(A ∩ E c ) = Q(A ∩ E c )
Z
0
= 2P (A ∩ E ) =
c
2 1E (ω)dP 0.
A
It is now seen that
(

X0 =
dQ = 2 a.e. Q
dP F
0
0 a.e. P
from which one concludes that
 
P lim Xn = 0 = 1.
n→∞
160 CHAPTER 5. MARTINGALES.

Exercise 5.4. In order to establish that a nonnegative martingale has an


almost sure limit (which may not be an L1 limit) show that we can assume,
without loss of generality, that we are in the following situation.

Ω = ⊗∞
j=1 R ; Fn = σ[x1 , · · · , xn ] ; Xj (ω) = xj

The existence of a Q such that



dQ
= xn
dP Fn

is essentially Kolmogorov’s consistency theorem (Theorem 3.5.) Now com-


plete the proof.
Remark 5.10. We shall give a more direct proof of almost sure convergence
of an L1 bounded martingale later on by means of the upcrossing inequality.

5.4 Stopping Times.


The notion of stopping times that we studied in the context of Markov Chains
is important again in the context of Martingales. In fact the concept of
stopping times is relevant whenever one has an ordered sequence of sub σ-
fields and is concerned about conditioning with respect to them.
Let (Ω, F ) be a measurable space and {Ft : t ∈ T } be a family of sub σ-
fields. T is an ordred set usually a set of real numbers or integers of the form
T = {t : a ≤ t ≤ b} or {t : t ≥ a}. We will assume that T = {0, 1, 2, · · · , }
the set of non negative integers. The family Fn is assumed to be increasing
with n. In other words

Fm ⊂ Fn if m<n

An F measurable random variable τ (ω) mapping Ω → {0, 1, · · · , ∞} is


said to be a stopping time if for every n ≥ 0 the set {ω : τ (ω) ≤ n} ∈ Fn . A
stopping time may actually take the value ∞ on a nonempty subset of Ω.
The idea behind the definiton of a stopping time, as we saw in the study
of Markov chains is that the decision to stop at time n can be based only on
the information available upto that time.
Exercise 5.5. Show that the function τ (ω) ≡ k is a stopping time for any
admissible value of the constant k.
5.4. STOPPING TIMES. 161

Exercise 5.6. Show that if τ is a stopping time and f : T → T is a nonde-


creasing function that satisfies f (t) ≥ t for all t ∈ T , then τ 0 (ω) = f (τ (ω))
is again a stopping time.

Exercise 5.7. Show that if τ1 , τ2 are stopping times so are max (τ1 , τ2 ) and
min (τ1 , τ2 ). In particular any stopping time τ is an increasing limit of
bounded stopping times τn (ω) = min(τ (ω), n).

To every stopping time τ we associate a stopped σ-field Fτ ⊂ F defined


by

Fτ = {A : A ∈ F and A ∩ {ω : τ (ω) ≤ n} ∈ Fn for every n}. (5.5)

This should be thought of as the information available upto the stopping


time τ . In other words, events in Fτ correspond to questions that can be
answered with a yes or no, if we stop observing the process at time τ .

Exercise 5.8. Verify that for any stopping time τ , Fτ is indeed a sub σ-field
i.e. is closed under countable unions and complementations. If τ (ω) ≡ k
then Fτ ≡ Fk . If τ1 ≤ τ2 are stopping times Fτ1 ⊂ Fτ2 . Finally if τ is a
stopping time then it is Fτ measurable.

Exercise 5.9. If Xn (ω) is a sequence of measurable functions on (Ω, F ) such


that for every n ∈ T , Xn is Fn measurable then on the set {ω : τ (ω) <
∞}, which is an Fτ measurable set, the function Xτ (ω) = Xτ (ω) (ω) is Fτ
measurable.

The following theorem called Doob’s optional stopping theorem is one of


the central facts in the theory of martingale sequences.

Theorem 5.11. (Optional Stopping Theorem.) Let {Xn : n ≥ 0} be


sequence of random variables defined on a probability space (Ω, F , P ), which
is a martingale sequence with respect to the filtration (Ω, Fn , P ) and 0 ≤ τ1 ≤
τ2 ≤ C be two bounded stopping times. Then

E [Xτ2 | Fτ1 ] = Xτ1 a.e.


162 CHAPTER 5. MARTINGALES.

Proof. Since Fτ1 ⊂ Fτ2 ⊂ FC , it is sufficient to show that for any martingale
{Xn }

E [ Xk |Fτ ] = Xτ a.e. (5.6)

provided τ is a stopping time bounded by the integer k. To see this we note


that in view of Exercise 4.9,

E[Xk |Fτ1 ] = E[E[Xk |Fτ2 ]|Fτ1 ]

and if (5.6) holds, then

E[Xτ2 |Fτ1 ] = Xτ1 a.e.

Let A ∈ Fτ . If we define Ej = {ω : τ (ω) = j}, then Ω = ∪k1 Ej is


a disjoint union. Moreover A ∩ Ej ∈ Fj for every j = 1, · · · , k. By the
martingale property
Z Z Z
Xk dP = Xj dP = Xτ dP
A∩Ej A∩Ej A∩Ej

and summing over j = 1, · · · , k gives


Z Z
Xk dP = Xτ dP
A A

for every A ∈ Fτ and we are done.


Remark 5.11. In particular if Xn is a martingale sequence and τ is a bounded
stopping time then E[Xτ ] = E[X0 ]. This property, obvious for constant
times, has now been extended to bounded stopping times. In a ‘fair’ game,
a policy to quit at an ‘opportune’ time, gives no advantage to the gambler
so long as he or she cannot foresee the future.

Exercise 5.10. The property extends to sub or super-martingales. For ex-


ample if Xn is a sub-martingale, then for any two bounded stopping times
τ1 ≤ τ2 , we have
E [Xτ2 |Fτ1 ] ≥ Xτ1 a.e..
One cannot use the earlier proof directly, but one can reduce it to the mar-
tingale case by applying the Doob decomposition theorem.
5.4. STOPPING TIMES. 163

Exercise 5.11. Boundedness is important. Take X0 = 0 and

Xn = ξ1 + ξ2 · · · + ξn for n ≥ 1

where ξi are independent identically distributed random variables taking the


values ±1 with probability 12 . Let τ = inf{n : Xn = 1}. Then τ is a stopping
time, P [τ < ∞ ] = 1, but τ is not bounded. Xτ = 1 with probability 1 and
trivially E [Xτ ] = 1 6= 0.

Exercise 5.12. It does not mean that we can never consider stopping times
that are unbounded. Let τ be an unbounded stopping time. For every k,
τk = min(τ, k) is a bounded stopping time and E [Xτk ] = 0 for every k. As
k → ∞, τk ↑ τ and Xτk → Xτ . If we can establish uniform integrability of
Xτk we can pass to the limit. In particular if S(ω) = sup0≤n≤τ (ω) |Xn (ω)| is
integrable then supk |Xτk (ω)| ≤ S(ω) and therefore E [Xτ ] = 0.

Exercise 5.13. Use a similar argument to show that if

S(ω) = sup |Xk (ω)|


0≤k≤τ2 (ω)

is integrable, then for any τ1 ≤ τ2

E [ Xτ2 |Fτ1 ] = Xτ1 a.e..

Exercise 5.14. The previous exercise needs the fact that if τn ↑ τ are stop-
pimg times, then
σ{∪n Fτn } = Fτ .
Prove it.

Exercise 5.15. Let us go back to the earlier exercise (Exercise 5.11) where
we had
Xn = ξ1 + · · · + ξn
as a sum of n idependent random variables taking the values ±1 with prob-
ability 12 . Show that if τ is a stopping time with E[τ ] < ∞, then S(ω) =
sup1≤n≤τ (ω) |Xn (ω)| is square integrable and therefore E[Xτ ] = 0. [Hint: Use
the fact that Xn2 − n is a martingale.]
164 CHAPTER 5. MARTINGALES.

5.5 Upcrossing Inequality.


The following inequality due to Doob, that controls the oscillations of a
martingale sequence, is very useful for proving the almost sure convergence
of L1 bounded martingales directly. Let {Xj : 0 ≤ j ≤ n} be a martingale
sequence with n+1 terms. Let us take two real numbers a < b. An upcrossing
is a pair of terms Xk and Xl , with k < l, for which Xk ≤ a < b ≤ Xl . Starting
from X0 , we locate the first term that is at most a and then the first term
following it that is at least b. This is the first upcrossing. In our martingale
sequence there will be a certain number of completed upcrossings (of course
over disjoint intervals ) and then at the end we may be in the middle of
an upcrossing or may not even have started on one because we are still on
the way down from a level above b to one below a. In any case there will
be a certain number U(a, b) of completed upcrossings. Doob’s upcrossing
inequlity gives a uniform upper bound on the expected value of U(a, b) in
terms of E[|Xn |], i.e. one that does not depend otherwise on n.
Theorem 5.12. Doob’s upcrossing inequality For any n,
1 1  
E[U(a, b)] ≤ E[a − Xn ]+ ≤ [|a| + E |Xn |] (5.7)
b−a b−a

Proof. Let us define recursively


τ1 = n ∧ inf{k : Xk ≤ a}
τ2 = n ∧ inf{k : k ≥ τ1 , Xk ≥ b}
·······
τ2k = n ∧ inf{k : k ≥ τ2k−1 , Xk ≥ b}
τ2k+1 = n ∧ inf{k : k ≥ τ2k , Xk ≤ a}
·······

Since τk ≥ τk−1 + 1 , τn = n. Consider the quantity


X
n
D(ω) = [Xτ2j − Xτ2j−1 ]
j=1

which could very well have lots of 0’s at the end. In any case the first few
terms correspond to upcrossings and each term is at least (b − a) and there
5.6. MARTINGALE TRANSFORMS, OPTION PRICING. 165

are U(a, b) of them. Before the 0’s begin there may be at most one nonzero
term which is an incomplete upcrossing, i.e. when τ2`−1 < n = τ2` for some
`. It is then equal to (Xn − Xτ2l−1 ) ≥ Xn − a for some l. If on the other hand
if we end in the middle of a downcrossing, i.e. τ2` < n = τ2`+1 there is no
incomplete upcrossing. Therefore

D(ω) ≥ (b − a)U(a, b) + Rn (ω)

with the remainder Rn (ω) satisfying

Rn (ω) = 0 if τ2` < n = τ2`+1


≥ (Xn − a) if τ2`−1 < n = τ2`

By the optional stopping theorem E[D(ω)] = 0. This gives the bound


1   1  
E[U(a, b)] ≤ E − Rn (ω) ≤ E (a − Xn )+ ]
b−a b−a
1   1  
≤ E |a − Xn | ≤ E |a| + |Xn | .
b−a b−a

Remark 5.12. In particular if Xn is an L1 bounded martingale, then the


number of upcrossings of any interval [a, b] is finite with Probability 1. From
Doob’s inequality, the sequence Xn is almost surely bounded. It now follows
by taking a countable number of intervals [a, b] with rational endpoints that
Xn has a limit almost surely. If Xn is uniformly integrable then the conver-
gence is in L1 and then Xn = E [ X| Fn ]. If we have a uniform Lp bound
on Xn , then X ∈ Lp and Xn → X in Lp . All of our earlier results on the
convergence of martingales now follow.

Exercise 5.16. For the proof it is sufficient that we have a supermartingale.


In fact we can change signs and so a submartingale works just as well.

5.6 Martingale Transforms, Option Pricing.


If Xn is a martingale with respect to (Ω, Fn , P ) and Yn are the differences
Xn − Xn−1 , a martingale transform Xn0 of Xn is given by the formula

Xn0 = Xn−1
0
+ an−1 Yn , for n ≥ 1
166 CHAPTER 5. MARTINGALES.

where an−1 is Fn−1 measurable and has enough integrability assumptions to


make an−1 Yn integrable. An elementary calculation shows that

E [ Xn0 |Fn−1 ] = Xn−1


0

making Xn0 a martingale as well. Xn0 is called a martingale transform of Xn .


The interpretation is if we have a fair game, we can choose the size and side of
our bet at each stage based on the prior history and the game will continue to
be fair. It is important to note that Xn may be sums of independent random
vaiables with mean zero. But the independence of the increments may be
destroyed and Xn0 will in general no longer have the independent increments
property.

Exercise 5.17. Suppose Xn = ξ1 + · · · + ξn , where ξj are independent random


variables taking the values ±1 with probability 12 . Let Xn0 be the martingale
transform given by
X n
0
Xn = aj−1 (ω)ξj
j=1

where aj is F
j measurable,
Fj being the σ-field generated by ξ1 , · · · , ξj .
0 2
Calculate E [Xn ] .

Suppose Xn is a sequence of nonnegative random variables that represent


the value of a security that is traded in the market place at a price that
is Xn for day n and changes overnight between day n and day n + 1 from
Xn to Xn+1 . We could at the end of day n, based on any information Fn
that is available to us at the end of that day be either long or short on the
security. The quantity an (ω) is the number of shares that we choose to own
overnight between day n and day n+ 1 and that could be a function of all the
information available to us up to that point. Positive values of an represent
long positions and negative values represent short positions. Our gain or loss
overnight is given by an (Xn+1 − Xn ) and the cumulative gain(loss) is the
transform
Xn
Xn0 − X00 = aj−1 (Xj − Xj−1).
j=1

A contingent claim (European Option) is really a gamble or a bet based


on the value of XN at some terminal date N. The nature of the claim is that
there is function f (x) such that if the security trades on that day at a price
5.6. MARTINGALE TRANSFORMS, OPTION PRICING. 167

x then the claim pays an amount of f (x). A call is an option to buy at a


certain price a and the payoff is f (x) = (x − a)+ whereas a put is an option
to sell at a fixed price a and therefore has a payoff function f (x) = (a − x)+ .

Replicating a claim, if it is possible at all, is determining a0 , a1 , · · · , aN


and the initial value V0 such that the transform

X
N
VN = V0 + aj (Xj+1 − Xj )
j=1

at time N equals the claim f (XN ) under every conceivable behavior of the
price movements X1 , X2 , · · · , XN . If the claim can be exactly replicated
starting from an initial capital of V0 , then V0 becomes the price of that
option. Anyone could sell the option at that price, use the proceeds as
capital and follow the strategy dictated by the coefficients a0 , · · · , aN −1 and
have exactly enough to pay off the claim at time N. Here we are ignoring
transaction costs as well as interest rates. It is not always true that a claim
can be replicated.
Let us assume for simplicity that the stock prices are always some non-
negative integral multiples of some unit. The set of possible prices can then
be taken to be the set of nonnegative integers. Let us make a crucial assump-
tion that if the price on some day is x the price on the next day is x ± 1. It
has to move up or down a notch. It cannot jump two or more steps or even
stay the same. When the stock price hits 0 we assume that the company
goes bankrupt and the stock stays at 0 for ever. In all other cases, from day
to day, it always moves either up or down a notch.
Let us value the claim f for one period. If the price at day N − 1 is x 6= 0
and we have assets c on hand and invest in a shares we will end up on day
N, with either assets of c + a and a claim of f (x + 1) or assets of c − a with
a claim of f (x − 1). In order to make sure that we break even in either case,
we need
f (x + 1) = c + a ; f (x − 1) = c − a

and solving for a and c, we get

1 1
c(x) = [f (x − 1) + f (x + 1)] ; a(x) = [f (x + 1) − f (x − 1)]
2 2
168 CHAPTER 5. MARTINGALES.

The value of the claim with one day left is


(
1
[f (x − 1) + f (x + 1)] if x ≥ 1
VN −1 (x) = 2
f (0) if x = 0

and we can proceed by iteration


(
1
[Vj (x − 1) + Vj (x + 1)] if x ≥ 1
Vj−1(x) = 2
Vj (0) if x = 0

for j ≥ 1 till we arrive at the value V0 (x) of the claim at time 0 and price x.
The corresponding values of a = aj−1 (x) = 12 [Vj (x + 1) − Vj (x − 1)] gives us
the number of shares to hold between day j − 1 and j if the current price at
time j − 1 equals x.

Remark 5.13. The important fact is that the value is determined by arbi-
trage and is unaffected by the actual movement of the price so long as it is
compatible with the model.
Remark 5.14. The value does not depend on any statistical assumptions on
the various probabilities of transitions of price levels between successive days.
Remark 5.15. However the value can be interpreted as the expected value
 
Px
V0 (x) = E f (XN )

1
where Px is the random walk starting at x with probability 2
for transitions
up or down a level, which is absorbed at 0.
Remark 5.16. Px can be characterized as the unique probability distribution
of (X0 , · · · , XN ) such that Px [X0 = x] = 1, Px [|Xj −Xj−1 | = 1|Xj−1 ≥ 1] = 1
for 1 ≤ j ≤ N and Xj is a martingale with respect to (Ω, Fj , Px ) where Fj
is generated by X0 , · · · , Xj .
Exercise 5.18. It is not necessary for the argument that the set of possible
price levels be equally spaced. If we make the assumption that for each price
level x > 0, the price on the following day can take only one of two possible
values h(x) > x and l(x) < x with a possible bankruptcy if the level 0 is
reached, a simlar analysis can be worked out. Carry it out.
5.7. MARTINGALES AND MARKOV CHAINS. 169

5.7 Martingales and Markov Chains.


One of the ways of specifying the joint distribution of a sequence X0 , · · · , Xn
of random variables is to specify the distribution of X0 and for each j ≥ 1,
specify the conditional distribution of Xj given the σ-field Fj−1 generated by
X0 , · · · , Xj−1. Equivalently instead of the conditional distributions one can
specify the conditional expectations E [f (Xj )|Fj−1] for 1 ≤ j ≤ n. Let us
write
hj−1 (X0 , · · · , Xj−1) = E [f (Xj )|Fj−1] − f (Xj−1)
so that, for 1 ≤ j ≤ n

E [{f (Xj ) − f (Xj−1) − hj−1(X0 , · · · , Xj−1)}|Fj−1] = 0

or
X
j
Zjf = f (Xj ) − f (X0 ) − hi−1 (X0 , · · · , Xi−1 )
i=1

is a martingale for every f . It is not difficult to see that the specification


of {hi } for each f is enough to determine all the successive conditional ex-
pectations and therefore the conditional distributions. If in addition the
initial distribution of X0 is specified then the distribution of X0 , · · · , Xn is
completely determined.
If for each j and f , the corresponding hj−1 (X0 , · · · , Xj−1 ) is a function
hj−1 (Xj−1 ) of Xj−1 only, then the distribution of (X0 , · · · , Xn ) is Markov
and the transition probabilities are seen to be given by the relation
 
hj−1 (Xj−1 ) = E [f (Xj ) − f (Xj−1)]|Fj−1
Z
= [f (y) − f (Xj−1)]πj−1,j (Xj−1, dy).

In the case of a stationary Markov chain the relationship is


 
hj−1 (Xj−1 ) = h(Xj−1) = E [f (Xj ) − f (Xj−1)]|Fj−1
Z
= [f (y) − f (Xj−1)]π(Xj−1, dy).

If we introduce the linear transformation (transition operator)


Z
(Π f )(x) = f (y)π(x, dy) (5.8)
170 CHAPTER 5. MARTINGALES.

then
h(x) = ([Π − I]f )(x).

Remark 5.17. In the case of a Markov chain on a countable state space


X
(Πf )(x) = π(x, y)f (y)
y

and X
h(x) = [Π − I](x) = [f (y) − f (x)]π(x, y).
y

Remark 5.18. The measure Px on the space (Ω, F ) of sequences {xj : j ≥


0} from the state space X, that corresponds to the Markov Process with
transition probability π(x, dy), and initial state x, can be characterized as
the unique measure on (Ω, F ) such that
 
Px ω : x0 = x = 1

and for every bounded measurable function f defined on the state space X
X
n
f (xn ) − f (x0 ) − h(xj−1 )
j=1

is a martingale with respect to (Ω, Fn , Px ) where


Z
h(x) = [f (y) − f (x)]π(x, dy).
X

Let A ⊂ X be a measurable subset and let τA = inf{n ≥ 0 : xn ∈ A} be


the first entrance time into A. It is easy to see that τA is a stopping time. It
need not always be true that Px {τA < ∞} = 1. But UA (x) = Px {τA < ∞} is
a well defined measurable function of x, that satisfies 0 ≤ U(x) ≤ 1 for all x
and is the exit probability from the set Ac . By its very definition UA (x) ≡ 1
on A and if x ∈ / A, by the Markov property,
Z Z
UA (x) = π(x, A) + UA (y)π(x, dy) = UA (y)π(x, dy).
Ac X
5.7. MARTINGALES AND MARKOV CHAINS. 171

In other words UA satisfies 0 ≤ UA ≤ 1 and is a solution of

(Π − I)V = 0 on Ac
V = 1 on A (5.9)

Theorem 5.13. Among all nonnegative solutions V of the equation (5.9)


UA (x) = Px {τA < ∞} is the smallest. If UA (x) = 1, then any bounded
solution of the equation

(Π − I)V = 0 on Ac
V = f on A (5.10)

is equal to

V (x) = E Px f (xτA ) . (5.11)

In particular if UA (x) = 1 for all x ∈


/ A, then any bounded solution V of
equation (5.10) is unique and is given by the formula (5.11).

Proof. First we establish that any nonnegative solution V of (5.10) dominates


UA . Let us replace V by W = min(V, 1). Then 0 ≤ W ≤ 1 everywhere,
W (x) = 1 for x ∈ A and for x ∈ / A,
Z Z
(ΠW )(x) = W (y)π(x, dy) ≤ V (y)π(x, dy) = V (x).
X X

Since ΠW ≤ 1 as well we conclude that ΠW ≤ W on Ac . On the otherhand


it is obvious that ΠW ≤ 1 = W on A. Since we have shown that ΠW ≤ W
everywhere it follows that {W (xn )} is a supermartingale with with repect to
(Ω, Fn , Px ). In particular for any bounded stopping time τ
 
E Px W (xτ ) ≤ E Px W (x0 ) = W (x).

While we cannot take τ = τA (since τA may not be bounded), we can always


take τ = τN = min(τA , N) to conclude
 
E Px W (xτN ) ≤ E Px W (x0 ) = W (x).
172 CHAPTER 5. MARTINGALES.

Let us let N → ∞. On the set {ω : τA (ω) < ∞}, τN ↑ τA and W (xτN ) →


W (xτA ) = 1. Since W is nonnegative and bounded,
Z

W (x) ≥ lim sup E Px
W (xτN ) ≥ lim sup W (xτN )dPx
N →∞ N →∞ τA <∞
= Px {τA < ∞} = UA (x).
Since V (x) ≥ W (x) it follows that V (x) ≥ UA (x).
For a bounded solution V of (5.10), let us define h = (Π − I)V which will
be a function vanishing on Ac . We know that
X
n
V (xn ) − V (x0 ) − h(xj−1 )
j=1

is a martingale with rsepect to (Ω, Fn , Px ) and let us use the stopping theorem
with τN = min(τA , N). Since h(xj−1 ) = 0 for j ≤ τA , we obtain

V (x) = E Px V (xτN ) .
If we now make the assumption that UA (x) = Px {τA < ∞} = 1, let N → ∞
and use the bounded convergence theorem it is easy to see that

V (x) = E Px f (xτA )
which proves (5.11) and the rest of the theorem.
Such arguments are powerful tools for the study of qualitative proper-
ties of Markov chains. Solutions to equations of the type [Π − I]V = f are
often easily constructed. They can be used to produce martingales, sub-
martingales or supermartingales that have certain behavior and that in turn
implies certain qualitative behavior of the Markov chain. We will now see
several illustrations of this method.
Example 5.1. Consider the symmetric simple random walk in one dimension.
We know from recurrence that the random walk exits the interval (−R, R)
in a finite time. But we want to get some estimates on the exit time τR .
Consider the function u(x) = cos λx. The function f (x) = [Πu](x) can be
calculated and
1
f (x) = [cos λ(x − 1) + cos λ(x + 1)]
2
= cos λ cos λx
= cos λ u(x).
5.7. MARTINGALES AND MARKOV CHAINS. 173

π
If λ < 2R , then cos λx ≥ cos λR > 0 in [−R, R]. Consider Zn = eσn cos λxn
with σ = − log cos λ .

E Px Zn |Fn−1 = eσ n f (xn−1 ) = eσ n cos λ cos λxn−1 = Zn−1 .
If τR is the exit time from the interval (−R, R), for any N, we have
 
E Px ZτR ∧N = E Px Z0 = cos λx.
Since σ > 0 and cos λx ≥ cos λR > 0 for x ∈ [−R, R], if R is an integer, we
can claim that
 cos λx
E P0 eσ [τR ∧N ] ≤ .
cos λR
Since the estimate is uniform we can let N → ∞ to get the estimate
 cos λx
E P0 eσ τR ≤ .
cos λR

Exercise 5.19. Can you


 prove
equality above? What is range of validity of
the equality? Is E Px eστR < ∞ for all σ > 0?
Example 5.2. Let us make life slightly more complicated by taking a Markov
chain in Z d with transition probabilities
(
1
+ δ(x, y) if |x − y| = 1
π(x, y) = 2d
0 if |x − y| 6= 0

so that we have slightly perturbed the random walk with perhaps even a
possible bias.
Exact calculations like in Eaxmple 5.1 are of course no longer possible.
Let us try to estimate again the exit time from a ball of radius R. For σ > 0
consider the function
Xd
F (x) = exp[σ |xi |]
i=1
d
defined on Z . We can get an estimate of the form
(ΠF )(x1 , · · · , xd ) ≥ θF (x1 , · · · , xd )
for some choices of σ > 0 and θ > 1 that may depend on R. Now proceed as
in Example 5.1.
174 CHAPTER 5. MARTINGALES.

Example 5.3. We can use these methods to show that the random walk is
transient in dimension d ≥ 3.
For 0 < α < d − 2 consider the function V (x) = |x|1α for x 6= 0 with
V (0) = 1. An approximate calculation of (ΠV )(x) yields, for sufficiently
large |x| (i.e |x| ≥ L for some L), the estimate
(ΠV )(x) − V (x) ≤ 0
If we start initially from an x with |x| > L and take τL to be the first
entrance time into the ball of radius L, one gets by the stopping theorem,
the inequality 
E Px V (xτL ∧N ) ≤ V (x).
If τL ≤ N, then |xτL | ≤ L. In any case V (xτL ∧N ) ≥ 0. Therefore,
 V (x)
Px τL ≤ N ≤
inf |y|≤L V (y)
valid uniformly in N. Letting N → ∞
 V (x)
Px τL < ∞ ≤ .
inf |y|≤L V (y)
If we let |x| → ∞, keeping
 L fixed,
we see the transience. Note that recur-
rence implies that Px τL < ∞ = 1 for all x. The proof of transience really
only required a function V defined for large |x|, that was strictly positive for
each x, went to 0 as |x| → ∞ and had the property (ΠV )(x) ≤ V (x) for
large values of |x|.
Example 5.4. We will now show that the random walk is recurrent in d = 2.
This is harder because the recurrence of random walk in d = 2 is right
on the border. We want to construct a function V (x) → ∞ as |x| → ∞ that
satisfies (ΠV )(x) ≤ V (x) for large |x|. If we succeed, then we can estimate
by a stopping argument the probability that the chain starting from a point
x in the annulus ` < |x| < L exits at the outer circle before getting inside
the inner circle.
 V (x)
Px τL < τ` ≤ .
inf |y|≥L V (y)
We also have for every L,

Px τL < ∞ = 1.
5.7. MARTINGALES AND MARKOV CHAINS. 175

This proves that Px τ` < ∞ = 1 thereby proving recurrence. The natural
candidate is F (x) = log |x| for x 6= 0. A computation yields

C
(ΠF )(x) − F (x) ≤
|x|4

which does not quite make it. On the other hand if U(x) = |x|−1 , for large
values of |x|,
c
(ΠU)(x) − U(x) ≥ 3
|x|
for some c > 0. The choice of V (x) = F (x) − CU(x) = log x − |x|
C
works with
any C > 0.

Example 5.5. We can use these methods for proving positive recurrence as
well.
Suppose X is a countable set and we can find V ≥ 0, a finite set F and
a constant C ≥ 0 such that
(
−1 for x ∈ /F
(ΠV )(x) − V (x) ≤
C for x ∈ F

Let us let U = ΠV − V , and we have



−V (x) ≤ E Px V (xn ) − V (x)
Px
Xn

=E U(xj−1 )
j=1

 Xn X
n
≤E Px
C 1F (xj−1 ) − 1F c (xj−1 )
j=1 j=1

 Xn

= −E Px [1 − (1 + C)1F (xj−1 )]
j=1
X
n X
= −n + (1 + C) π n (x, y)
j=1 y∈F

= −n + o(n) as n → ∞.
176 CHAPTER 5. MARTINGALES.

if the process is not positive recurrent. This is a contradiction.


For instance if X = Z, the integers, and we have a little bit of bias
towards the origin in the random walk
a
π(x, x + 1) − π(x, x − 1) ≥ if x ≤ −`
|x|
a
π(x, x − 1) − π(x, x + 1) ≥ if x ≥ `
|x|
with V (x) = x2 , for x ≥ `
1 a 1 a
(ΠV )(x) ≤ (x + 1)2 (1 − ) + (x − 1)2 (1 + )
2 |x| 2 |x|
= x2 + 1 − 2a

If a > 12 , we can multiply V by a constant and it works.


Exercise 5.20. What happens when
1
π(x, x + 1) − π(x, x − 1) = −
2x
for |x| ≥ 10? (See Exercise 4.16)
Example 5.6. Let us return to our example of a branching process Example
4.4. We see from the relation
E[Xn+1 |Fn ] = mXn
Xn
that m n is a martingale. If m < 1 we saw before quite easily that the
population becomes extinct. If m = 1, Xn is a martingale. Since it is
nonnegative it is L1 bounded and must have an almost sure limit as n →
∞. Since the population is an integer, this means that the size eventually
stabilizes. The limit can only be 0 because the population cannot stabilize
at any other size. If m > 1 there is a probability 0 < q < 1 such that
P [Xn → 0|X0 = 1] = q, We can show that with probability 1 − q, Xn → ∞.
To see this consider the function u(x) = q x . In the notation of Example 4.4

X
E[q Xn+1 |Fn ] = [ q j pj ]Xn
= [P (q)]Xn
= q Xn
5.7. MARTINGALES AND MARKOV CHAINS. 177

so that q Xn is a non negative martingale. It then has an almost sure limit,


which can only be 0 or 1. If q is the probabbility that it is 1 i.e that Xn → 0,
then 1 − q is the probability that it is 0, i.e. that Xn → ∞.
178 CHAPTER 5. MARTINGALES.
Chapter 6

Stationary Stochastic
Processes.

6.1 Ergodic Theorems.


A stationary stochastic process is a collection {ξn : n ∈ Z} of random vari-
ables with values in some space (X, B) such that the joint distribution of
(ξn1 , · · · , ξnk ) is the same as that of (ξn1 +n , · · · , ξnk +n ) for every choice of
k ≥ 1, and n, n1 , · · · , nk ∈ Z. Assuming that the space (X, B) is reasonable
and Kolmogorov’s consistency theorem applies, we can build a measure P on
the countable product space Ω of sequences {xn : n ∈ Z} with values in X,
defined for sets in the product σ-field F . On the space Ω there is the natural
shift defined by (T ω)(n) = xn+1 for ω with ω(n) = xn . The random variables
xn (ω) = ω(n) are essentially equivalent to {ξn }. The stationarity of the pro-
cess is reflected in the invariance of P with respect to T i.e. P T −1 = P . We
can without being specific consider a space Ω a σ-field F , a one to one invert-
ible measurable map from Ω → Ω with a measurable inverse T −1 and finally
a probability measure P on (Ω, F ) that is T -invariant i.e P (T −1 A) = P (A)
for every A ∈ F . One says that P is an invariant measure for T or T is a
measure preserving transformation for P . If we have a measurable map from
ξ : (Ω, F ) → (X, B), then it is easily seen that ξn (ω) = ξ(T n ω) defines a sta-
tionary stochastic process. The study of stationary stochastic process is then
more or less the same as the study of measure preserving (i.e. probability
preserving) transformations.
The basic transformation T : Ω → Ω induces a linear transformation U

179
180 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

on the space of functions defined on Ω by the rule (Uf )(ω) = f (T ω). Because
T is measure preserving it is easy to see that
Z Z Z
f (ω) dP = f (T ω) dP = (Uf )(ω) dP
Ω Ω Ω

as well as
Z Z Z
|f (ω)| dP =
p
|f (T ω)| dP =
p
|(Uf )(ω)|p dP.
Ω Ω Ω

In other words U acts as an isometry (i.e. norm presrving linear transforma-


tion) on the various Lp spaces for 1 ≤ p < ∞ and in fact it is an isometry on
L∞ as well. Moreover the transformation induced by T −1 is the inverse of U
so that U is also invertible. In particular U is unitary ( or orthogoanl)on L2 .
This means it presrves the inner product < ·, · >.
Z Z
< f, g >= f (ω)g(ω)dP = f (T ω)g(T ω)dP =< Uf, Ug > .

Of course our linear transformation U is very special and satisfies U1 = 1


and U(f g) = (Uf )(Ug).
A basic theorem known as the Ergodic theorem asserts that
Theorem 6.1. For any f ∈ L1 (P ) the limit
f (ω) + f (T ω) + · · · + f (T n−1 ω)
lim = g(ω)
n→∞ n
exists for almost all ω with respect to P as well as in L1 (P ). Moreover if
f ∈ Lp for some p satisfying 1 < p < ∞ then the function g ∈ Lp and the
convergence takes place in that Lp . Moreover the limit g(ω) is given by the
conditional expectation
g(ω) = E P [f |I]
where the σ-field I, called the invariant σ-field, is defined as

I = {A : T A = A}.
Proof. Fisrst we prove the convergence in the various Lp spaces. These are
called mean ergodic theorems. The easiest situation to prove is when p = 2.
Let us define
H0 = {f : f ∈ H, Uf = f } = {f : f ∈ H, f (T ω) = f (ω)}.
6.1. ERGODIC THEOREMS. 181

Since H0 contains constants, it is a closed nontrivial subspace of H = L2 (P ),


of dimension at least one. Since U is unitary Uf = f if and only if U −1 f =
U ∗ f = f where U ∗ is the adjoint of U. The orthogonal complement H0⊥ can
be defined as

H0⊥ = {g :< g, f >= 0 ∀f : U ∗ f = f } = Range(I − U)H .

Clearly if we let
f + Uf + · · · + U n−1 f
An f =
n
then kAn f k2 ≤ kf k2 for every f ∈ H and An f = f for every n and f ∈ H0 .
Therefore for f ∈ H0 , An f → f as n → ∞. On the other hand if f = (I−U)g,
An f = g−Un g and kAn f k2 ≤ 2kgk
n

n
2
→ 0 as n → ∞. Since kAn k ≤ 1, it follows
that An f → 0 as n → ∞ for every f ∈ H0⊥ = Range(I − U)H. (See exercise
6.1). If we denote by π the orthogonal projection from H → H0 , we see that
An f → πf as n → ∞ for every f ∈ H establishing the L2 ergodic theorem.
There is an alternate characterization of H0 . Functions f in H0 are invari-
ant under T , i.e. have the property that f (T ω) = f (ω). For any invariant
function f the level sets {ω : a < f (ω) < b} are invariant under T . We
can therefore talk about invariant sets {A : A ∈ F , T −1A = A}. Technically
we should allow ourselves to differ by sets of measure zero and one defines
I = {A : P (A ∆T −1 A) = 0} as the σ-field of almost invariant sets. . Noth-
ing is therefore lost by taking I to be the σ-field of invariant sets. We can
identify the orthogonal projection π as (see Exercise 4.8)

πf = E P f |I}

and as the conditional expectation operator, π is well defined on Lp as an


operator of norm 1, for all p in the range 1 ≤ p ≤ ∞. If f ∈ L∞ , then
kAn f k∞ ≤ kf k∞ and by the bounded convergence theorem, for any p sat-
isfying 1 ≤ p < ∞, we have kAn f − πf kp → 0 as n → ∞. Since L∞ is
dense in Lp and kAn k ≤ 1 in all the Lp spaces it is easily seen, by a simple
approximation argument, that for each p in 1 ≤ p < ∞ and f ∈ Lp ,

lim kAn f − f kp = 0
n→∞

proving the mean ergodic theorem in all the Lp spaces.


We now concentrate on proving almost sure convergence of An f to πf
for f ∈ L1 (P ). This part is often called the ‘individual ergodic theorem’ or
182 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

‘Birkhoff’s theorem’ . This will be based on an analog of Doob’s inequality for


martingales. First we will establish an inequality called the maximal ergodic
theorem.
Theorem 6.2. (Maximal Ergodic Theorem.) Let f ∈ L1 (P ) and for
n ≥ 1, let

En0 = {ω : sup [f (ω) + f (T ω) + · · · + f (T j−1ω)] ≥ 0]}.


1≤j≤n

Then Z
f (ω) dP ≥ 0
0
En

Proof. Let

hn (ω) = sup [f (ω) + f (T ω) + · · · + f (T j−1ω)]


1≤j≤n

= f (ω) + max(0 , hn−1 (T ω))


= f (ω) + h+
n−1 (T ω)

where
h+
n (ω) = max(0 , hn (ω)).

On En0 , hn (ω) = h+
n (ω) and therefore

f (ω) = hn (ω) − h+ + +
n−1 (T ω) = hn (ω) − hn−1 (T ω).

Consequently,
Z Z
f (ω) dP = [h+ +
n (ω) − hn−1 (T ω)] dP
0
En E0
Z n
≥ [h+ +
n (ω) − hn (T ω)] dP (because h+ +
n−1 (ω) ≤ hn (ω))
0
Z En Z
+
= hn (ω) dP − h+n (ω) dP (because of inavraince of T )
0
En 0
T En

≥ 0.

The
R last step follows from the fact that for any integrable function h(ω),
E
h(ω) dP is the largest when we take for E the set E = {ω : h(ω) ≥ 0}.
6.1. ERGODIC THEOREMS. 183

Now we establish the analog of Doob’s inequality or maximal inequality, or


sometimes referred to as the weaktype 1 − 1 inequality.
Lemma 6.3. For any f ∈ L1 (P ), and ` > 0, denoting by En the set

En = {ω : sup |(Aj f )(ω)| ≥ `}


1≤j≤n

we have Z
  1
P En ≤ |f (ω)| dP.
` En

In particular
Z
  1
P ω : sup |(Aj f )(ω)| ≥ ` ≤ |f (ω)| dP.
j≥1 `

Proof. We can assume without loss of generality that f ∈ L1 (P ) is nonneg-


ative. Apply the lemma to f − `. If

[f (ω) + f (T ω) + · · · + f (T j−1ω)]
En = {ω : sup > `},
1≤j≤n j

then Z
[f (ω) − `] dP ≥ 0
En
or Z
1
P [En ] ≤ f (ω) dP.
` En

We are done.

Given the lemma the proof of the almost sure ergodic theorem follows
along the same lines as the proof of the almost sure convergence in the
martingale context. If f ∈ H0 it is trivial. For f = (I − U)g with g ∈ L∞ it
is equally trivial because kAn f k∞ ≤ 2kgk
n

. So the almost sure convergence
is valid for f = f1 + f2 with f1 ∈ H0 and f2 = (I − U)g with g ∈ L∞ . But
such functions are dense in L1 (P ). Once we have almost sure convergence
for a dense set in L1 (P ), the almost sure convergence for every f ∈ L1 (P )
follows by routine approximation using Lemma 6.3. See the proof of Theorem
5.7.
184 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

Exercise 6.1. For any bounded linear transformation A on a Hilbert Space


H, show that the closure of the range of A, i.e Range A is the orthogonal
complement of the null space {f : A∗ f = 0} where A∗ is the adjoint of A.
Exercise 6.2. Show that any almost invariant set differs by a set of measure
0 from an invariant set i.e. if P (A ∆T −1 A) = 0 then there is a B ∈ F with
P (A∆B) = 0 and T −1 B = B.
Although the ergodic theorem implies a strong law of large numbers for
any stationary sequence of random variables, in particular a sequence of
independent identically distributed random variables, it is not quite the end
of the story. For the law of large numbers,
R we need to know that the limit
πf is a constant, which will then equal f (ω) dP . To claim this, we need
to know that the invariant σ-field is trivial or essentially consists of the
whole space Ω and the empty set Φ. An invariant measure P is said to be
ergodic for the transformation T , if every A ∈ I i.e every invariant set has
measure 0 or 1. Then
R every invariant function is almost surely a constant
and πf = E f |I = f (ω) dP .

Theorem 6.4. Any product measure is ergodic for the shift.

Proof. Let A be an invariant set. Then A can be approximated by sets An in


the σ-field corresonding to the coordinates from [−n, n]. Since A is invariant
T ±2n An will approximate A just as well. This proves that A actually belongs
to the tail σ-field, the remote past as well as the remote future. Now we
can use Kolmogorov’s 0 − 1 law (Theorem 3.15), to assert that P (A) = 0 or
1.

6.2 Structure of Stationary Measures.


Given a space (Ω, F ) and a measurable transformation T with a measurable
inverse T −1 , we can consider the space M of all T -invariant probability
measures on (Ω, F ). The set M, which may be empty, is easily seen to be a
convex set.
Exercise 6.3. Let Ω = Z, the integers, and for n ∈ Z, let T n = n + 1. Show
that M is empty.

Theorem 6.5. A probability measure P ∈ M is ergodic if and only if it is


an extreme point of M.
6.2. STRUCTURE OF STATIONARY MEASURES. 185

Proof. A point of a convex set is extreme if it cannot be written as a nontrivial


convex combination of two other points from that set. Suppose P ∈ M is
not extremal. Then P can be written as nontrivial convex combination of
P1 , P2 ∈ M, i.e. for some 0 < a < 1 and P1 6= P2 , P = aP1 + (1 − a)P2 . We
claim that such a P cannot be ergodic. If it were, by definition, P (A) = 0 or
1 for every A ∈ I. Since P (A) can be 0 or 1 only when P1 (A) = P2 (A) = 0
or P1 (A) = P2 (A) = 1, it follows that for every invariant set A ∈ I, P1 (A) =
P2 (A). We now show that if two invariant measures P1 and P2 agree on I,
they agree on F . Let f (ω) be any bounded F -measurable function. Consider
the function
1
h(ω) = lim [f (ω) + f (T ω) + · · · + f (T n−1 ω)]
n→∞ n

defined on the set E where the limit exists. By the ergodic theorem P1 (E) =
P2 (E) = 1 and h is I measurable. Moreover, by the stationarity of P1 , P2
and the bounded convergence theorem,
Z
Pi
E [f (ω)] = h(ω)dPi for i = 1, 2
E

Since P1 = P2 on I and h is I measurable and Pi (E) = 1 for i = 1, 2 we see


that
E P1 [f (ω)] = E P2 [f (ω)]
Since f is arbitrary this implies that P1 = P2 on F .
Conversely if P is not ergodic, then there is an A ∈ I with 0 < P (A) < 1
and we define
P (A ∩ E) P (Ac ∩ E)
P1 (E) = ; P2 (E) = .
P (A) P (Ac )

Since A ∈ I it follows that Pi are stationary. Moreover P = P (A)P1 +


P (Ac )P2 and hence P is not extremal.
One of the questions in the theory of convex sets is the existence of
sufficiently many extremal points, enough to recover the convex set by taking
convex combinations. In particular one can ask if any point in the convex
set can be obtained by taking a weighted average of the extremals. The next
theorem answers the question in our context. We will assume that our space
(Ω, F ) is nice, i.e. is a complete separable metric space with its Borel sets.
186 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

Theorem 6.6. For any invariant measure P , there is a probability measure


µP on the set Me of ergodic measures such that
Z
P = Q µP (dQ)
Me

Proof. If we denote by Pω the regular conditional probability distribution of


P given I, which exists (see Theorem 4.4) because (Ω, F ) is nice, then
Z
P = Pω P (dω)

We will complete the proof by showing that Pω is an ergodic stationary


probability measure for almost all ω with respect to P . We can then view
Pω as a map Ω → Me and µP will be the image of P under the map. Our
integral representation in terms of ergodic measures will just be an immediate
consequence of the change of variables formula.

Lemma 6.7. For any stationary probability measure P , for almost all ω with
respect to P , the regular conditional probability distribution Pω , of P given
I, is stationary and ergodic.

Proof. Let us first prove stationarity. We need to prove that Pω (A) =


Pω (T A) a.e. We have to negotiate carefully through null sets. Since a mea-
sure on the Borel σ-field F of a complete separable metric space is determined
by its values on a countable generating field F0 ⊂ F , it is sufficient to prove
that for each fixed A ∈ F0 , Pω (A) = Pω (T A) a.e. P . Since Pω is I measur-
able all we need to show is that for any E ∈ I,
Z Z
Pω (A) P (dω) = Pω (T A) P (dω)
E E

or equivalently
P (E ∩ A) = P (E ∩ T A)
This is obvious because P is stationary and E is invariant.
We now turn to ergodicity. Again there is a minefield of null sets to
negotiate. It is a simple exercise to check that if, for some stationary measure
Q, the ergodic theorem is valid with an almost surely constant limit for the
indicator functions 1A with A ∈ F0 , then Q is ergodic. This needs to be
checked only for a countable collection of sets {A}. We need therfore only to
6.3. STATIONARY MARKOV PROCESSES. 187

check that any invariant function is constant almost surely with respect to
almost all Pω . Equivalently for any invariant set E, Pω (E) must be shown
almost surely to be equal to 0 or 1. But Pω (E) = χE (ω) and is always 0 or
1. This completes the proof.

Exercise 6.4. Show that any two distinct ergodic invariant measures P1 and
P2 are orthogonal on I, i.e. there is an invariant set E such that P1 (E) = 1
and P2 (E) = 0.
Exercise 6.5. Let (Ω, F ) = ([0, 1), B) and T x = x + a (mod) 1. If a is irra-
tional there is just one invariant measure P , namely the uniform distribution
on [0, 1). This is seen by Fourier Analysis. See Remark 2.2.
Z Z Z Z
i2nπx i 2 n π (T x) i 2 n π (x+a) i2nπa
e dP = e dP = e dP = e ei 2 n π x dP

If a is irrational ei 2 n π a = 1 if and only if n = 0. Therefore


Z
ei 2 n π x dP = 0 for n 6= 0

which makes P uniform. Now let a = pq be rational with (p, q) = 1, i.e. p


and q are relatively prime. Then, for any x, the discrete distribution with
probabilities 1q at the points {x, x+a, x+2a, . . . , x+(q−1)a} is invariant and
ergodic. We can denote this distribution by Px . If we limit x to the interval
0 ≤ x < 1q then x is uniquely determined by Px . Complete the example by
determining all T invariant probability distributions on [0, 1) and find the
integral representation in terms of the ergodic ones.

6.3 Stationary Markov Processes.


Let π(x, dy) be a transition probability function on (X, B), where X is a state
space and B is a σ-field of measurable subsets of X. A stochastic process with
values in X is a probability measure on the space (Ω, F ), where Ω is the space
of sequences {xn : −∞ < n < ∞} with values in X, and F is the product
σ-field. The space (Ω, F ) has some natural sub σ-fields. For any two integers
m ≤ n, we have the sub σ-fields, Fnm = σ{xj : m ≤ j ≤ n} corresponding
to information about the process during the time interval [m, n]. In addition
we have Fn = Fn−∞ = σ{xj : j ≤ n} and F m = F∞ m
= σ{xj : j ≥ m} that
188 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

correspond to the past and future. P is a Markov Process on (Ω, F ) with


transition probability π(·, ·), if for every n, A ∈ B and P-almost all ω,

P xn+1 ∈ A|Fn } = π(xn , A)

Remark 6.1. Given a π, it is not always true that P exists. A simple but
illuminating example is to take X = {0, 1, · · · , n, · · · } to be the nonnegative
integers and define π(x, x + 1) = 1 and all the process does is move one step
to the right every time. Such a process if it had started long time back will
be found nowhere today! So it does not exist. On the other hand if we take
X to be the set of all integers then P is seen to exist. In fact there are lots
of them. What is true however is that given any initial distribution µ and
initial time m, there exist a unique process P on (Ω, F m ), i.e. defined on the
future σ-field from time m on, that is Markov with transition probability π
and satisfies P {xm ∈ A} = µ(A) for all A ∈ B.
The shift T acts naturally as a measurable invertible map on the product
space Ω into itself and the notion of a stationary process makes sense. The
following theorem connects stationarity and the Markov property.
Theorem 6.8. Let the transition probability π be given. Let P be a station-
ary Markov process with transition probability π. Then the one dimensional
marginal distribution µ, which is independent of time because of stationarity
and given by 
µ(A) = P xn ∈ A
is π invariant in the sense that
Z
µ(A) = π(x, A)µ(dx)

for every set A ∈ B. Conversely given such a µ, there is a unique stationary


Markov process P with marginals µ and transition probability π.

Exercise 6.6. Prove the above Theorem. Use Remark 4.7.


Exercise 6.7. If P is a stationary Markov process on a countable state space
with transition probaility π and invariant marginal distribution µ, show that
the time reversal map that maps {xn } to {x−n } takes P to another stationary
Markov process Q, and express the transition probability π̂ of Q, as explicitly
as you can in terms of π and µ .
6.3. STATIONARY MARKOV PROCESSES. 189

Exercise 6.8. If µ is an invariant


R measure for π, show that the conditional
expectation map Π : f (·) → f (y) π ( · , dy) induces a contraction in Lp (µ)
for any p ∈ [1, ∞]. We say that a Markov process is reversible if the time
reversed process Q of the previous example coincides with P . Show that P
corresponding to π and µ is reversible if and only if the corresponding Π in
L2 (µ) is self-adjoint or symmetric.
Since a given transition probability π may in general have several invariant
measures µ, there will be several stationary Markov processes with transition
probability π. Let M be the set of invariant probability measures for the
transition probabilities π(x, dy) i.e.
 Z 
f
M = µ : µ(A) = π(x , A) dµ(x) for all A ∈ B
X

Mf is a convex set of probability mesures and we denote by M fe its (possi-


bly empty) set of extremals. For each µ ∈ M, f we have the corresponding
stationary Markov process Pµ and the map µ → Pµ is clearly linear. If we
want Pµ to be an ergodic stationary process, then it must be an extremal in
the space of all stationary processes. The extremality of µ ∈ Mf is therfore a
necessary condition for Pµ to be ergodic. That it is also sufficient is a little
bit of a surprise. The following theorem is the key step in the proof. The
remaining part is routine.
Theorem 6.9. Let µ be an invariant measure for π and P = Pµ the corre-
sponding stationary Markov process. Let I be the σ-field of shift invariant
subsets on Ω. To within sets of P measure 0, I ⊂ F00 .
Proof. This theorem describes completely the structure of nontrivial sets in
the σ-field I of invariant sets for a stationary Markov process with transition
probability π and marginal distribution µ. Suppose that the state space can
be partitioned nontrivially i.e. with 0 < µ(A) < 1 into two sets A and Ac
that satisfy π(x, A) = 1 a.e µ on A and π(x, Ac ) = 1 a.e µ on Ac . Then the
event
E = {ω : xn ∈ A for all n ∈ Z}
provides a non trivial set in I. The theorem asserts the converse. The
proof depends on the fact that an invariant set E is in the remote past
−∞
F−∞ = ∩n Fn−∞ as well as in the remote future F∞∞
= ∩m F∞ m
. See the
proof of Theorem 6.4. For a Markov process the past and the future are
190 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

conditionally independent given the present. See Theorem 4.9. This implies
that        
P E|F00 = P E ∩ E|F00 = P E|F00 P E|F00
and must therfore equal either 0 or 1. This in turn means that corresponding
to any invariant set E ∈ I, there exists A ⊂ X that belongs to B, such that
E = {ω : xn ∈ A for all n ∈ Z} up to a set of P measure 0. If the
Markov process starts from A or Ac , it does not ever leave it. That means
0 < µ(A) < 1 and

π(x, Ac ) = 0 for µ a.e x ∈ A and π(x, A) = 0 for µ a.e x ∈ Ac

Remark 6.2. One way to generate markov processes with multiple invariant
measures is to start with two markov processes with transition probabilities
πi (xi , dyi) on Xi and invariant measures µi , and consider X = X1 ∪ X2 .
Define
(
π1 (x, A ∩ X1 ) if x ∈ X1
π(x, A) =
π2 (x, A ∩ X2 ) if x ∈ X2

Then any one of the two processes can be going on depending on which world
we are in. Both µ1 and µ2 are invariant measures. We have combined two
distinct possibilities into one. What we have shown is that when we have
multiple invariant measures they essentially arise in this manner.
Remark 6.3. We can therefore look at the convex set of measures µ that are π
invariant, i.e. µΠ = µ. The extremals of this convex set are precisely the ones
that correspond to ergodic stationary processes and they are called ergodic
or extremal invariant measures. If the set of invariant probability measures
is nonempty for some π, then there are enough extremals to recover arbitrary
invariant measure as an integral or weighted average of extremal ones.
Exercise 6.9. Show that any two distinct extremal invariant measures µ1 and
µ2 for the same π are orthogonal on B.
Exercise 6.10. Consider the operator Π on the Lp (µ) spaces corresponding to
a given invariant measure. The dimension of the eigenspace f : Πf = f that
corresponds to the eigenvalue 1, determines the extremality of µ. Clarify this
statement.
6.3. STATIONARY MARKOV PROCESSES. 191

Exercise 6.11. Let Px be the Markov process with stationary transition prob-
ability π(x, dy) starting at time 0 from x ∈ X. Let f be a bounded mea-
surable function on X. Then for almost all x with respect to any extemal
invariant measure ν,
Z
1
lim [f (x1 ) + · · · + f (xn )] = f (y)ν(dy)
n→∞ n

for almost all ω with respect to Px .


Exercise 6.12. We saw in the earlier section that any stationary process is
an integral over stationary ergodic processes. If we represent a stationary
Markov Process Pµ as the integral
Z
Pµ = R Q(dR)

over stationary ergodic processes, show that the integral really involves only
stationary Markov processes with transition probability π, so that the inte-
gral is really of the form
Z
Pµ =
g
Pν Q(dν)
Me

or equivalently Z
µ=
Me g ν Q(dν).
Exercise 6.13. If there is a reference measure α such that π (x , dy) has a
density p(x, y) with respect to α for every α, then show that any invari-
ant measure µ is absolutely continuous with respect to α. In this case the
eigenspace f : Πf = f in L2 (µ) gives a complete picture of all the invariant
measures.

The question of when there is at most one invariant measure for the
Markov process with transition probability π is a difficult one. If we have
a density p(x, y) with respect to a reference measure α and if for each x,
p(x, y) > 0 for almost all y with respect to α, then there can be atmost one
inavriant measure. We saw already that any invariant measure has a density
with respect to α. If there are at least two invariant mesaures, then there
are at least two ergodic ones which are orthogonal. If we denote by f1 and
192 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

f2 their densities with respect to α, by orthogonality we know that they are


supported on disjoint ivariant sets, A1 and A2 . In particular p(x, y) = 0 for
almost all x on A1 in the support of f1 and almost all y in A2 with respect
to α. By our positivity assumption we must have α(A2 ) = 0 , which is a
contradiction.

6.4 Mixing properties of Markov Processes.


One of the questions that is important in the theory of Markov Processes
is the rapidity with which the memory of the initial state is lost. There is
no unique way of assessing it and depending on the circumstances this could
happen in many differerent ways at many different rates. Let π (n) (x , dy) be
the n step transition probability. The issue is how the measures π (n) (x , dy)
depend less and less on x as n → ∞. Suppose we measure this dependence
by
ρn = sup sup |π (n) (x , A) − π (n) (y , A)|
x,y∈X A∈B

then the following is true.


Theorem 6.10. Either ρn ≡ 1 for all n ≥ 1, or ρn ≤ Cθn for some
0≤θ<1
Proof. From the Chapman-Kolmogorov equations
Z
(n+m) (n+m)
π (x , A) − π (y , A) = π (m) (z , A)[π (n) (x , dz) − π (n) (y , dz)]

If f (x) is a function with |f (x)−f (y)| ≤ C and µ = µ1 −µ2 is the difference of


two probability measures
R with kµk = supA |µ(A)| ≤ δ, then it is elementary
to estimate, using cdµ = 0,
Z Z
C
| f dµ| = inf | (f − c)dµ| ≤ 2 inf {sup |f (x) − c|}kµk ≤ 2 δ = Cδ
c c x 2
It follows that the sequence ρn is submultiplicative, i.e.
ρm+n ≤ ρm ρn
Our theorem follows from this property. As soon as some ρk = a < 1 we
have
ρn ≤ [ρk ][ k ] ≤ Cθn
n

1
with θ = a k .
6.4. MIXING PROPERTIES OF MARKOV PROCESSES. 193

Although this is an easy theorem it can be applied in some contexts.


Remark 6.4. If π(x , dy) has density p(x, y) withR respect to some reference
measure α and p(x, y) ≥ q(y) ≥ 0 for all y with q(y)dα ≥ δ > 0, then it is
elementary to show that ρ1 ≤ (1 − δ).
Remark 6.5. If ρn → 0, we can estimate
Z
(n) (n+m)
|π (x , A) − π (x , A)| = | [π (n) (x , A) − π (n) (y, A)]π (m) (x , dy)| ≤ ρn

and conclude from the estimate that

lim π (n) (x , A) = µ(A)


n→∞

exists. µ is seen to be an invariant probability measure.


Remark 6.6. In this context the invariant measure is unique. If β is another
invariant measure because
Z
β(A) = π (n) (x , A)β(dy)

for every n ≥ 1
Z
β(A) = lim π (n) (x , A)β(dy) = µ(A).
n→∞

−∞
Remark 6.7. The stationary process Pµ has the property that if E ∈ Fm
and F ∈ F∞ n
with a gap of k = n − m > 0 then
Z Z
Pµ [E ∩ F ] = π (k) (xm (ω), dx)Px (T −n F )Pµ (dω)
ZE ZX
Pµ [E]Pµ [F ] = µ(dx)Px (T −n F )Pµ (dω)
ZE ZX
Pµ [E ∩ F ] − Pµ [E]Pµ [F ] = Px (T −n F )[π (k) (xm (ω), dx) − µ(dx)]Pµ (dω)
E X

from which it follows that

|Pµ [E ∩ F ] − Pµ [E]Pµ [F ]| ≤ ρk Pµ (E)

proving an asymptotic independence property for Pµ .


194 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

There are situations in which we know that an invariant probability mea-


sure µ exists for π and we wish to establish that π (n) (x , A) converges to µ(A)
uniformly in A for each x ∈ X but not necessarily uniformly over the starting
points x. Uniformity in the starting point is very special. We will illustrate
this by an example.
Example 6.1. The Ornstein-Uhlenbeck process is Markov Chain on the state
space X = R, the real line with transition probability π(x, dy) given by a
Gaussian distribution with mean ρx and variance σ 2 . It has
R a density p(x, y)
with respect to the Lebesgue measure so that π(x, A) = A p(x, y)dy.
1 (y − ρx)2
p(x, y) = √ exp[− ]
2πσ 2σ 2
It arises from the ‘auto-regressive’ representation
xn+1 = ρxn + σξn+1
where ξ1 , · · · , ξn · · · are independent standard Gaussians. The characteristic
function of any invariant mesure φ(t) satisfies, for every n ≥ 1,
Pn−1 2j 2 2
σ 2 t2 ( j=0 ρ )σ t
φ(t) = φ(ρt) exp[− ] = φ(ρn t) exp[− ]
2 2
by induction on n. Therefore
Pn−1
( j=0 ρ2j )σ 2 t2
|φ(t)| ≤ exp[− ]
2
and this cannot be a characteristic function unless |ρ| < 1 (otherwise by
letting n → ∞ we see that φ(t) = 0 for t 6= 0 and therefore discontinuous at
t = 0). If |ρ| < 1, by letting n → ∞ and observing that φ(ρn t) → φ(0) = 1
σ 2 t2
φ(t) = exp[− ]
2(1 − ρ2 )
The only possible invariant measure is the Gaussian with mean 0 and variane
σ2
(1−ρ2 )
. One can verify that this Gaussian is infact an invariant measure. If
|ρ| < 1 a direct computation
P shows that π (n) (x, dy) is a Gaussian with mean
ρn x and variance σn2 = j=0 ρ2j σ 2 → (1 − ρ2 )σ 2 as n → ∞. Clearly there
n−1

is uniform convergence only over bounded sets of starting points x. This is


typical.
6.5. CENTRAL LIMIT THEOREM FOR MARTINGALES. 195

6.5 Central Limit Theorem for Martingales.


If {ξn } is an ergodic stationary sequence of random variables with mean zero
then we know from the ergodic theorem that the mean ξ1 ···+ξ n
n
converges to
zero almost surely. by the law of large numbers. We want to develop some
methods for proving the central limit theorem, i.e. the covergence of the
distribution of ξ1 +···+ξ

n
n
to some Gaussian distribution with mean 0 variance
2
σ . Under the best of situations, since thePcovariance ρk = E[Xn Xn+k ] may
not be 0 for all k 6= 0, if we assume that −∞<j<∞ |ρj | < ∞, we get
1
σ 2 = lim E[(ξ1 + · · · + ξn )2 ]
n→∞ n
X |j|
= lim (1 − )ρj
n→∞ n
|j|≤n
X
= ρj
−∞<j<∞
X∞
= ρ0 + 2 ρj .
j=1

The standard central limit theorem with n scaling is not likely to work
if the covariances do not decay rapidly enough to be summable. When the
covariances {ρk } are all 0 for k 6= 0 the variance calculation yields σ 2 = ρ0
just as in the independent case, but there is no guarantee that the central
limit theorem is valid.
A special situation is when {ξj } are square integrable martingale differ-
ences. With the usual notation for the σ-fields Fnm for m ≤ n (remember
that m can be −∞ while n can be +∞ ) we assume that

E{ξn |Fn−1} = 0 a.e.


and in this case by conditioning we see that ρk = 0 for k 6= 0. It is a useful and
important observation that in this context the central limit theorm always
holds. The distribution of Zn = ξ1 +···+ξ

n
n
converges to the normal distribution
2
with mean 0 and variance σ = ρ0 . The proof is a fairly simple modification
of the usual proof of the central limit theorem. Let us define
σ 2 t2 j  ξ1 + · · · + ξj
ψ(n, j, t) = exp[ ]E exp[i t √ ]
2n n
196 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

and write
X
n
ψ(n, n, t) − 1 = [ψ(n, j, t) − ψ(n, j − 1, t)]
j=1

leaving us with the estimation of

Xn
 
∆(n, t) = ψ(n, j, t) − ψ(n, j − 1, t) .
j=1

Theorem 6.11. For an ergodic stationary sequence {ξj } of square integrable


martingale diffrences, the central limit theorem is always valid.

Proof. We let Sj = ξ1 + · · · + ξj and calculate


 
ψ(n, j, t)−ψ(n, j − 1, t)
  
σ 2 t2 j Sj−1 ξj σ 2 t2
= exp[ ]E exp[it √ ] exp[it √ ] − exp[− ] .
2n n n 2n

We can replace it with


  
σ 2 t2 j Sj−1 (σ 2 − ξj2 )t2
θ(n, j, t) = exp[ ]E exp[it √ ]
2n n 2n

because the error can be controlled by Taylor’s expansion. In fact if we use


the martingale difference property to kill the linear term, we can bound the
difference, in an arbitrary finite interval |t| ≤ T , by
 
| ψ(n, j, t) − ψ(n, j − 1, t) − θ(n, j, t)|
 
ξ ξ t2 ξj2 σ 2 t2 σ 2 t2

≤ CT E exp[it √ j
] − 1 − it √ j
+ + CT | exp[− ]−1+ |
n n 2n 2n 2n

where CT is a constant that depends only on T . The right hand side is


independent of j because of stationarity. By Taylor expansions in the variable
√t of each of the two terms on the right, it is easily seen that
n

  1
sup | ψ(n, j, t) − ψ(n, j − 1, t) − θ(n, j, t)| = o( ).
|t|≤T n
1≤j≤n
6.5. CENTRAL LIMIT THEOREM FOR MARTINGALES. 197

Therefore
Xn
  1
sup | ψ(n, j, t) − ψ(n, j − 1, t) − θ(n, j, t)| = n o( ) → 0.
|t|≤T j=1 n
P
We now concentrate on estimating | nj=1 θ(n, j, t)|. We pick an integer k
which will be large but fixed. We divide [1, n] into blocks of size k with
perhaps an incomplete block at the end. We will now replace θ(n, j, t) by
  
σ 2 t2 kr Skr (σ 2 − ξj2 )t2
θk (n, j, t) = exp[ ]E exp[it √ ]
2n n 2n

for kr + 1 ≤ j ≤ k(r + 1) and r ≥ 0.


Using stationarity it is easy to estimate for r ≤ nk ,

X  k(r+1) 
k(r+1)
1 X k
| θk (n, j, t) | ≤ C(t) E 2 2
(σ − ξj ) = C(t) δ(k)
j=kr+1
n j=kr+1
n

where δ(k) → 0 as k → ∞ by the L1 ergodic theorem. After all {ξj2 } is a


stationary sequence with mean σ 2 and the ergodic theorem applies. Since
the above estimate is uniform in r, the left over incomplete block at the end
causes no problem and there are approximately nk blocks, we conclude that

X
n
| θk (n, j, t)| ≤ C(t)δ(k).
j=1

On the other hand, by stationarity,


X
n
|θk (n, j, t) − θ(n, j, t)|
j=1

≤ n sup |θk (n, j, t) − θ(n, j, t)|


1≤j≤n
 
σ 2 2
t j S 2
≤ C(t) sup E exp[ ] exp[it √ ] − 1 |σ − ξj |
2
j−1

1≤j≤k 2n n

and it is elementary to show by dominated convergence theorem that the


right hand side tends to 0 as n → ∞ for each finite k.
This concludes the proof of the theorem.
198 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

One may think that the assumption that {ξn } is a martingale difference
is too restrictive to be useful. Let {Xn } be any stationary process with zero
mean. We can often succeed in writing Xn = ξn+1 + ηn+1 where P ξn is a mar-
tingale difference and ηn is negligible, in the sense that E[( nj=1 ηj )2 ] = o(n).
Then the central limit theorem
P for {Xn } can be deduced from that of {ξn }. A
cheap way to prove E[( nj=1 ηj )2 ] = o(n) is to establish that ηnP = Zn − Zn+1
n
for some stationary square integrable sequence {Zn }. Then j=1 ηj tele-
scopes and the needed estimate is obvious. Here is a way to construct Zn
from Xn so that Xn + (Zn+1 − Zn ) is a martingale difference.
Let us define
X∞

Zn = E Xn+j |Fn
j=0

There is no guarantee that the series converges, but we can always hope.
After all, if the memory is weak, prediction j steps ahead should be futile if
becoming independent of Fn as j gets large
j is large. Therefore if Xn+j is
one would expect E Xn+j |Fn to approach E[Xn+j ] which is assumed to be
0. By stationarity n plays no role. If Z0 can be defined the shift operator T
can be used to define Zn (ω) = Z0 (T n ω). Let us assume that {Zn } exist and
are square integrable. Then

Zn = E Zn+1 |Fn + Xn

or equivalently

Xn = Zn − E Zn+1 |Fn

= [Zn − Zn+1 ] + [Zn+1 − E Zn+1 |Fn ]
= ηn+1 + ξn+1

where ηn+1 = Zn − Zn+1 and ξn+1 = Zn+1 − E Zn+1 |Fn . It is easy to see
that E[ξn+1 |Fn ] = 0.
For a stationary ergodic Markov process {Xn } on state space (X, B),
with transition probability π(x , dy) and invariant measure µ, we can prove
the central limit theorem by this method. Let Yj = f (Xj ). Using the Markov
property we can calculate

X
∞ X

Z0 = E[f (Xj )|F0] = [Πj f ](X0 ) = [[I − Π]−1 f ](X0 ).
j=0 j=0
6.6. STATIONARY GAUSSIAN PROCESSES. 199

If the equation [I − Π]U = f can be solved with U ∈ L2 (µ), then

ξn+1 = U(Xn+1 ) − U(Xn ) + f (Xn )


P n
f (Xj )
is a martingale difference and we have a central limit theorem for √
j=1
n
with variance given by
 
σ 2 = E Pµ [ ξ0 ]2 = E Pµ [U(X1 ) − U(X0 ) + f (X0 )]2 .

Exercise 6.14. Let us consider a two state Markov Chain with states [1, 2].
Let the transition probabilities be given by π(1, 1) = π(2, 2) = p and π(1, 2) =
π(2, 1) = q with 0 < p, q < 1 , p + q = 1 . The invariant measure is
given by µ(1) = µ(2) = 12 for all values of p. Consider the random variable
Sn = An − Bn , where An and Bn are respectively the number of visits to the
states 1 and 2 during the first n steps. Prove a central limit theorem for √Snn
and calculate the limiting variance as a function σ 2 (p) of p. How does σ 2 (p)
behave as p → 0 or 1 ? Can you explain it? What is the value of σ 2 ( 12 ) ?
Could you have guessed it?
Exercise 6.15. Consider a random walk on the nonnegative integers with
1

 2
for all x = y ≥ 0

 1−δ for y = x + 1, x ≥ 1
4
π(x , y) = 1+δ

 for y = x − 1, x ≥ 1

14
2
for x = 0, y = 1.

Prove that the chain is positive recurrent and find the invariant measure
µ(x) explicitly. If f (x) is a function on x ≥ 0 with compact support solve
explicitly the equation [I −Π]U = f . Show that either U grows exponentially
at P
infinity or is a constant for large x. Show that it is a constant if and only
if
Pn x f (x)µ(x) = 0. What can you say about the central limit theorem for
j=0 f (Xj ) for such functions f ?

6.6 Stationary Gaussian Processes.


Considering the importance of Gaussian distributions in Probability theory,
it is only natural to study stationary Gaussian processes, i.e. stationary
200 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

processes {Xn } that have Gaussian distributions as their finite dimensional


joint distributions. Since a joint Gaussian distribution is determined by its
means and covariances we need only specify E[Xn ] and Cov (Xn , Xm ) =
E[Xn Xm ] − E[Xn ]E[Xm ]. Recall that the joint density on Rn of N Gaussian
random variables with mean m = {mi } and covariance C = {ρi,j } is given
by
1 1 1
p(y) = [ √ ]N √ exp[− < (y − m), C −1 (y − m) >]
2π DetC 2
Here m is the vector of means and C −1 is the inverse of the positive definite
covariance matrix C. If C only positive semidefinite the Gaussian distribu-
tion lives on a lower dimensional hyperplane and is singular. By stationarity
E[Xn ] = c is independent of n and Cov (Xn , Xm ) = ρn−m can depend only on
the difference n − m. By symmetry ρk = ρ−k . Because the covariance matrix
is always positive semidefinite the sequence ρk has the positive definiteness
property
Xn
ρj−k zj z¯k ≥ 0
k,j=1

for all choices of n and complex numbers z1 · · · , zn . By Bochner’s theorem


(see Theorem 2.2) there exists a nonnegative measure µ on the circle that is
thought of as S = [0, 2π] with end points identified such that
Z 2π
ρk = exp[ikθ]dµ(θ)
0

and because of the symmetry of ρk , µ is symmetric as well with respect to


θ → 2π − θ. It is convenient to assume that c = 0. One can always add
it back. Given a Gaussian process it is natural to carry out linear
operations that will leave the Gaussian character unchanged. Rather than
working with the σ-fields Fnm we will work with the linear subspaces Hnm
spanned by {Xj : m ≤ j ≤ n} and the infinite spans Hn = ∨m≤n Hnm and
Hm = ∨n≥m Hnm , that are considered as linear subspaces of the Hilbertspace
H = ∨ n≥mm,n Hm which lies inside L (P ). But H is a small part of L (P ) ,
n 2 2
consisting only of linear functions of {Xj }. The analog of Kolmogorov’s tail
σ-field are the subspaces ∧m Hm and ∧n Hn that are denoted by H∞ and H−∞ .
The analog of Kolmogorov’s zero-one law would be that these subspaces are
trivial having in them only the zero function. The symmetry in ρk implies
that the processes {Xn } and {X−n } have the same underlying distributions
6.6. STATIONARY GAUSSIAN PROCESSES. 201

so that both tails behave identically. A stationary Gaussian process {Xn }


with mean 0 is said to be purely non determinstic if the tail subspaces are
trivial.
In finite dimensional theory a Covarance matrix can be diagonalized or
better still written in special form T ∗ T , which gives a linear representation
of the Gaussian random variables in terms of canonical or independent stan-
dard Gaussian random variables. The point to note is that if X is standard
Gaussian with mean zero and covariance I = {δi,j , then for any linear tr-
nasformation T , Y = T X iagain Gaussian with mean zero and covariance
C = T T ∗ . In other words if
X
Yi = ti,k Xk
j

then X
Ci,j = ti,k tj,k
k

In fact for any C we can find a T which is upper or lower diagonal i.e. ti,k = 0
for i > k or i < k. If the indices correspond to time, this can be interpreted
as a causal representation interms of current and future or past variables
only.
The following questions have simple answers.
Q1. When does a Gaussian process have a moving average representation in
terms of independent Gaussians i.e a representation of the form

X

Xn = an−m ξm
m=−∞

with
X

a2n < ∞
n=−∞

in terms of i.i.d. Gaussians {ξk } with mean 0 and variance 1 ?


If we have such a representation then the Covariance ρk is easily calculated
as the convolution X
ρk = aj aj+k = [a ∗ ā](k)
j
202 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

and that will make {ρk } the Fourier coefficients of the function
X
f =| aj ei j θ |2
j

which is the square of a function in L2 (S). In other words the spectral


measure µ will be absolutely continuous with a density f with repect to the

normalized Lebesgue measure 2π . Conversely if we have a µ with a density
f its square root will be a function in L2 and will therefore have Fourier
coefficients an in l2 and a moving average representation holds in terms of
i.i.d. random variables with these weights..
Q2. When does a Gaussian Process have a representation that is causal
i.e. of the form X
Xn = aj ξn−j
j≥0

with X
a2j < ∞?
j≥0

If we do have a causal representation then the remote past of the {Xk } process
is clearly part of the remote past of the {ξk } process. By Kolmogorov’s
zero-one law, the remote past for independent Gaussians is trivial and a
causal representation is therefore possible for {Xk } only if its remote past
is trivial. The converse is true as well. The subspace Hn is spanned by
Hn−1 and Xn . Therefore either Hn = Hn−1 , or Hn−1 has codimension 1 in
Hn . In the former case by stationarity Hn = Hn−1 for every n. This inturn
implies H−∞ = H = H∞ . Assuming that the process is not identically
zero i.e. ρ0 = µ(S) > 0 this makes the remote past or future the whole
thing and definitely nontrivial. So we may assume that Hn = Hn−1 ⊕ en
where en is a one dimensional subspace spanned by a unit vector ξn . Since
all our random variables are linear combinations of a Gaussian collection
they all have Gaussian distributions. We have the shift operator U satsfying
UXn = Xn+1 and we can assume with out loss of generality that Uξn = ξn+1
for every n. If we start with X0 in our Hilbert space
X0 = a0 ξ0 + R−1
with R1 ∈ Hn−1 . We can continue and write
R−1 = a1 ξ−1 + R−2
6.6. STATIONARY GAUSSIAN PROCESSES. 203

and so on. We will then have for every n

X0 = a0 ξ0 + a1 ξ−1 + · · · + an ξ−n + R−(n+1)

with R−(n+1) ∈ H−(n+1) . Since ∧n H−n = {0} we conclude that the the
expansion
X∞
X0 = aj ξ−j
j=0

is valid.
Q3. What are the conditions on the spectral density f in order that the
process may admit a causal representation. From our answer to Q1. we
know that we have to solve the following analytical problem. Given the
spectral measure µ with a non negative density f ∈ L1 (S), when can we
writePf = |g|2 for some g ∈ L2 (S), that admits a Fourier representation
g = j≥0 aj ei j θ involving only positive frequencies. This has the following
neat solution which is far from obvious.

Theorem 6.12. The process determined by the spectral density f admits a


causal representation if and only if f (θ) satisfies
Z
log f (θ)dθ > −∞
S

Remark 6.8. Notice that the condition basically prevents f from vanishing
on a set of positive measure or having very flat zeros.

The proof will use methods from the theory of functions of a complex variable.
Define

Proof. X
g(θ) = cn exp[i n θ]
n≥0

as the Fourier series of some g ∈ L2 (S). Assume cn 6= 0 for some n > 0.


In fact we can assume without loss of generality that c0 6= 0 by removing a
suitable factor of ei k θ which will not affect |g(θ)|. Then we will show that
Z
1
log |g(θ)|dθ ≥ log |c0 |.
2π S
204 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

Consider the function X


G(z) = cn z n
n≥0

as an analytic function in the disc |z| < 1. It has boundary values

lim G(reiθ ) = g(θ)


r→1

in L2 (S). Since G is an analytic function we know, from the theory of


functions of a complex variable, thatlog |G(reiθ )| is subharmonic and has the
mean value property
Z
log |G(reiθ )|dθ ≥ log |G(0)| = log |c0 |
S

Since G(reiθ ) has a limit in L2 (S), the positive part of log |G| which is domi-
nated by |G| is uniformly integrable. For the negative part we apply Fatou’s
lemma and derive our estimate. R
Now for the converse. Let f ∈ L1 (S). Assume S log f (θ)dθ > −∞ or
equivalently log f ∈ L1 (S). Define the Fourier coefficients
Z
1
an = log f (θ) exp[i n θ] dθ.
4π S

Because log f is integrable {an } are uniformly bounded and the power series
X
A(z) = an z n

which is well defined for |z| < 1. We define

G(z) = exp[A(z)].

We will show that


lim G(reiθ ) = g(θ)
r→1

exists in L2 (S) and f = |g|2, g being the boundary value of an analytic


function in the disc. The integral condition on log f is then the necessary
and sufficient condition for writing f = |g|2 with g involving only nonnegative
frequencies.
6.6. STATIONARY GAUSSIAN PROCESSES. 205

 
|G(reiθ )|2 = exp 2 Real Part A(reiθ )
 X ∞

= exp 2 aj r j cos jθ
j=0
X∞ Z
 j 1 
= exp 2 r cos jθ[ log f (ϕ) cos jϕdϕ]
j=0
4π S
Z X ∞
1 
= exp log f (ϕ)[ r j cos jθ cos jϕdϕ]
2π S j=0
Z
 
= exp log f (ϕ)K(r, θ, ϕ)dϕ
Z S

≤ f (ϕ)K(r, θ, ϕ)dϕ
S

Here K is the Poisson Kernel for the disc

1 X j

K(r, θ, ϕ) = r cos θ cos ϕ
2π j=0
R
is nonnegative and S K(r, θ, ϕ)dϕ = 1. The last step is a consequence of
Jensen’s inequality. The function
Z
fr (θ) = f (ϕ)K(r, θ, ϕ)dϕ
S

converges to f as r → 1 in L1 (S) by the properties of the Poisson Kernel. It


is therefore uniformly integrable. Since |G(reiθ )|2 is dominated by fr we get
uniform integrability for |G|2 as r → 1. It is seen now that G has a limit g
in L2 (S) as r → 1 and f = |g|2.

One of the issues in the theory of time series is that of prediction. We


have a stochastic process {Xn } that we have observed for times n ≤ −1 and
we want to predict X0 . The best predictor is E P [X0 |F−1] or in the Gaussian
linear context it is the compuation of the projection of X0 into H−1 . If we
have a moving average representation, even a causal one, while it is true that
206 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

Xj is spanned by {ξk : k ≤ j} the converse may not be true. If the two spans
were the same, then the best predictor for X0 is just
X
X̂0 = aj ξ−j
j≥1

obtained by dropping one term in the original representation. In fact in


answering Q2 the construction yielded a representation with this property.
The quantity |a0 |2 is then the prediction error. In any case it is a lower
bound.
Q4. What is the value of prediction error and how do we actually find the
predictor ?
The situation is some what muddled. Let us assume that we have a purely
nondeterministic
R process i.e. a process with a spectral density satisfying
S
log f (θ)dθ > −∞. Then f can be represented as

f = |g|2

with g ∈ H2 , where by H2 we denote the subspace of L2 (S) that are boundary


values of analytic functions in the disc |z| < 1, or equivalently functions
g ∈ L2 (S) with only nonnegative frequencies. For any such g, we have an
analytic function X
G(z) = G(rei θ ) = an r n ei n θ .
n≥0

For any choice of g ∈ H2 with f = |g|2, we have

Z
2 2
1 
|G(0)| = |a0 | ≤ exp log f (θ)dθ . (6.1)
2π S

There is a choice of g contstructed in the proof of the thorem for which

Z
2
1 
|G(0)| = exp log f (θ)dθ (6.2)
2π S

The prediction error σ 2 (f ), that depends only on f and not on the choice
of g, also satisfies
6.6. STATIONARY GAUSSIAN PROCESSES. 207

σ 2 (f ) ≥ |G(0)|2 (6.3)
for every choice of g ∈ H2 with f = |g|2. There is a choice of g such that
σ 2 (f ) = |G(0)|2 (6.4)
Therefore from (6.1) and (6.4)
Z
2
1 
σ (f ) ≤ exp log f (θ)dθ (6.5)
2π S

On the other hand from (6.2) and (6.3)


Z
2
1 
σ (f ) ≥ exp log f (θ)dθ (6.6)
2π S

We do now have an exact formula


Z
2
1 
σ (f ) = exp log f (θ)dθ (6.7)
2π S

for the prediction error.


As for the predictor, it is not quite that simple. In principle it is a limit
of linear combinations of {Xj : j ≤ 0} and may not always have a simple
concrete representation. But we can understand it a little better. Let us
consider the spaces H and L2 (S; µ) of square integrable functions on S with
respect to the spectral measure µ. There is a natural isomorphism between
the two Hilbert spaces, if we map
X X
aj Xj ←→ aj ei j θ
The problem then is the question of approximating ei θ in L2 (S; µ) by linear
combinations of {ei j θ : j ≤ 0}. We have already established that the error,
1
which is nonzero in the purely nondeterministic case, i.e when dµ = 2π f (θ)dθ
for some f ∈ L1 (S) satisfying
Z
log f (θ)dθ > −∞,
S

is given by Z
2
1 
σ (f ) = exp log f (θ)dθ
2π S
208 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

We now want to find the best approximation.


In order to get at the predictor we have to make a very special choice
of the representation f = |g|2. Simply demanding g ∈ L2 (S) will not even
give causal representations. Demanding g ∈ H2 will always give us causal
representation, but there are too many of these. If we mutiply G(z) by
an analytic function V (z) that has boundary values v(θ) satisfying |v(θ)| =
|V (ei θ )| ≡ 1 on S, then gv is another choice. If we demand that
Z
2
1 
|G(0)| = exp log f (θ)dθ (6.8)
2π S

there is atleast one choice that will satisfy it. There is still ambiguity, albeit
a trivial one among these, for we can always multiply g by a complex number
of modulus 1 and that will not change anything of consequence. We have
the following theorem.

Theorem 6.13. The representation f = |g|2 with g ∈ H2 , and satisfying


(6.8), is unique to within a multiplicative constant of modulus 1. In other
words if f = |g1 |2 = |g2 |2 with both g1 and g2 satisfying (8), then g1 = αg2
on S, where α is a complex number of modulus 1.

Proof. Let F (rei θ ) = log |G(rei θ )|. It is a subharmonic function and


1
lim F (rei θ ) = log f (θ)
r→1 2
Because
lim G(rei θ ) = g(θ)
r→1

in L2 (S), the functions are uniformly integrable in r. The positive part of the
logarithm F is well controlled and therefore uniformly uniformly integrable.
Fatou’s lemma is applicable and we should always have
Z Z
1 1
lim sup F (re )dθ ≤

log f (θ)dθ
r→1 2π S 4π S

But because F is subharmonic its average value on a circle of radius r around


0 is nondecreasing in r, and the lim sup is the same as the sup. Therefore
Z Z Z
1 1 1
F (0) ≤ sup iθ
F (re )dθ = lim sup F (re )dθ ≤

log f (θ)dθ
0≤r<1 2π S r→1 2π S 4π S
6.6. STATIONARY GAUSSIAN PROCESSES. 209

Since we have equality at both ends that implies a lot of things. In particular
F is harmonic and and is represented via the Poisson integral interms of its
boundary value 12 log f . In particular G has no zeros in the disc. Obviuosly F
is uniquely determined by log f , and by the Cauchy-Riemann equations the
imaginary part of log G is determined upto an additive constant. Therefore
the only ambiguity in G is a multiplicative constant of modulus 1.
Given the process {Xn } with trivial tail subspaces, we saw earlier that it
has a representation
X∞
Xn = aj ξn−j
j=0

in terms of standard i.i.d Gaussians and from the construction we also know
that ξn ∈ Hn for each n. In particular ξ0 ∈ H0 and can be approximated by
linear combinations of {Xj : j ≤ 0}. Let us suppose that h(θ) represents ξ0 in
L2 (S; f ). We know that h(θ) is in the linear span of {ei j θ : j ≤ 0}. We want
to find the function h. If ξ0 ←→ h, then by the nature of the isomorphism
ξn ←→ ei n θ h and
X∞
1= aj e−i j θ h(θ)
j=0

is an orthonormal expansion in L2 (S; f ). Also if we denote by


X

G(z) = aj z j
j=0

then the boundary function g(θ) = limr→1 G(rei θ ) has the property
g(−θ)h(θ) = 1
and so
1
h(θ) =
g(−θ)
Since the function G that we constructed has the property
Z
2 2 2
1 
|G(0)| = |a0 | = σ (f ) = exp log f (θ) dθ
2π S
it is the canonical choice determined earlier, to with in a multiplicative con-
stant of modulus 1. The predictor then is clearly represented by the function
g(0)
1̂(θ) = 1 − a0 h(θ) = 1 −
g(−θ)
210 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.

Example 6.2. A wide class of examples are given by densities f (θ) that are
rational trigonometric polynomials of the form
P
| Aj ei j θ |2
f (θ) = P
| Bj ei j θ |2
We can always mutiply by ei k θ inside the absolute value and assume that
|P (ei θ )|2
f (θ) =
|Q(ei θ )|2
where P (z) and Q(z) are polynomials in the complex variable z. The sym-
metry of f under θ → −θ means that the coefficients in the polynomial have
to be real. The integrability of f will force the polynomial Q not to have any
zeros on the circle |z| = 1. Given any two complex numbers c and z, such
that |z| = 1 and c 6= 0
1 1
|z − c| = |z̄ − c̄| = | − c̄| = |1 − c̄z| = |c||z − |
z c̄
This means in our representation for f , first we can omit terms that involve
powers of z that have only modulus 1 on S. Next, any term (z − c) that
contributes a nonzero root c with |c| < 1 can be replaced by c(z − 1c̄ ) and
thus move the root outside the disc without changing the value of f . We can
therefore rewrite
f (θ) = |g(θ)|2
with
P (z)
G(z) =
Q(z)
with new polynomials P and Q that have no roots inside the unit disc and
with perhaps P alone having roots on S . Clearly
Q(ei θ )
h(θ) =
P (ei θ )
If P has no roots on S, we have a nice convergent power series for Q P
with a
radius of convergence larger than 1, and we are in a very good situation. If
P = 1, we are in an even better situation with the predictor expressed as a
finite sum. If P has a root on S, then it could be a little bit of a mess as the
next exercise shows.
6.6. STATIONARY GAUSSIAN PROCESSES. 211

Exercise 6.16. Assume that we have a representation of the form

Xn = ξn − ξn−1

in terms of standard i.i.d. Gaussians. How will you predict X1 based on


{Xj : j ≤ 0} ?

Exercise 6.17. An autoregressive scheme is a representation of the form

X
k
Xn = aj Xn−j + σξn
j=1

where ξn is a standard Gaussian indepenedent of {(Xj , ξj ) : j ≤ (n − 1)}. In


other words the predictor

X
k
X̂n = aj Xn−j
j=1

and the prediction error σ 2 are specified for the model. Can you always
find a stationary Gaussian process {Xn } with spectral density f (θ), that is
consistent with the model?
212 CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Chapter 7

Dynamic Programming and


Filtering.

7.1 Optimal Control.


Optimal control or dynamic programming is a useful and important concept
in the theory of Markov Processes. We have a state space X and a family
πα of transition probability functions indexed by a parameter α ∈ A. The
parameter α is called the control parameter and can be chosen at will from the
set A. The choice is allowed to vary over time i.e. αj can be the parameter of
choice for the transition from xj at time j to xj+1 at time j+1. The choice can
also depend on the information available up to that point, i.e. αj can be an
Fj measurable function. Then the conditional probability P {xj+1 ∈ A|Fj }
is given by παj (x0 ,··· ,xj ) (xj , A). Of course in order for things to make sense we
need to assume some measurability conditions. We have a payoff function
f (xN ) and the object is to maximize E{f (xN )} by a suitable choice of the
functions {αj (x0 , · · · , xj ) : 0 ≤ j ≤ N − 1}. The idea ( Bellman’s) of
dynamic programming is to define recursively (by backward induction) for
0 ≤ j ≤ N − 1, the sequence of functions
Z
Vj (x) = sup Vj+1 (y)πα (x , dy) (1)
α

with
VN (x) = f (x)

213
214 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.

as well as the sequence {αj∗(x) : 0 ≤ j ≤ N − 1} of functions that provide


the supremum in (1).
Z Z
Vj (x) = Vj+1 (y)πα∗j (x) (x , dy) = sup Vj+1 (y)πα (x , dy)
α

We then have
Theorem 7.1. If the Markov chain starts from x at time 0, then V0 (x) is
the best expected value of the reward. The ‘optimal’ control is Markovian
and is provided by {αj∗ (xj )}.
Proof. It is clear that if we pick the control as αj∗ then we have an inhomo-
geneous Markov chain with transition probability
πj, j+1 (x , dy) = παj (x) (x , dy)
and if we denote by Px∗ , the process corresponding to it that starts from the
point x at time 0, we can establish by induction that

E Px {f (xN )|FN −j } = VN −j (xN −j )
for 1 ≤ j ≤ N. Taking j = N, we obtain

E Px {f (xN )} = V0 (x).
To show that V0 (x) is optimal, for any admissible (not necessarily Markovian)
choice of controls, if P is the measure on FN corresponding to a starting point
x,
E P {Vj+1(xj+1 )|Fj } ≤ Vj (xj )
and now it follows that
E P {f (xN )} ≤ V0 (x).

Exercise 7.1. The problem could be modified by making the reward function
equal to
XN
E {
P
fj (αj−1 , xj )}
j=0

and thereby incorporate the cost of control into the reward function. Work
out the recursion formula for the optimal reward in this case.
7.2. OPTIMAL STOPPING. 215

7.2 Optimal Stopping.


A special class of optimization problems are called optimal stopping prob-
lems. We have a Markov chain with transition probability π(x, dy) and time
runs from 0 to N. We have the option to stop at any time based on the
history up to that time. If we stop at time k in the state x, the reward
is f (k , x), The problem then is to maximize Ex {f (τ , xτ )} over all stopping
times 0 ≤ τ ≤ N. If V (k , x) is the optimal reward if the game starts from
x at time k, the best we can do starting from x at time k − 1 is is to earn a
reward of
Z
V (k − 1 , x) = max[f (k − 1 , x), V (k , y)π(x, dy)]

Starting with V (N , x) = f (N , x), by backwards induction, we can get


V (j , x) for 0 ≤ j ≤ N. The optimal stopping rule is given by

τ̄ = {inf k : V (k , xk ) = f (k , xk )}

Theorem 7.2. For any stopping time τ with 0 ≤ τ ≤ N,

Ex {f (τ , xτ )} ≤ V (0 , x)

and
Ex {f (τ̄ , xτ̄ )} = V (0 , x)

Proof. Because Z
V (k , x) ≥ V (k + 1 , y)π(x , dy)

we conclude that V (k , xk ) is a supermartingale and and an application


of Doob’s stopping theorem proves the first claim. On the other hand if
V (k , x) > f (k , x), we have
Z
V (k , x) = V (k + 1 , y)π(x , dy)

and this means V (τ̄ ∧ k , xτ̄ ∧k ) is a martingale and this establishes the second
claim.
216 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.

Example 7.1. (The Secretary Problem.) An interesting example is the fol-


lowing game.
We have a lottery with N tickets. Each ticket has a number on it. The
numbers a1 · · · , aN are distinct, but the player has no idea of what they
are. The player draws a ticket at random and looks at the number. He
can either keep the ticket or reject it. If he rejects it, he can draw another
ticket from the remaining ones and again decides if he wants to keep it. The
information available to him is the numbers on the tickets he has so far drawn
and discarded as well as the number on the last ticket that he has drawn and
is holding. If he decides to keep the ticket at any stage, then the game ends
and that is his ticket. Of course if he continues on till the end, rejecting all
of them, he is forced to keep the last one. The player wins only if the ticket
he keeps is the one that has the largest number written on it. He can not
go back and claim a ticket that he has already rejected and he can not pick
a new one unless he rejects the one he is holding. Assuming that the draws
are random at each stage, how can the player maximize the probability of
winning? How small is this probability?
It is clear that the strategy to pick the first or the last or any fixed draw
has the probability of N1 to win. It is not apriori clear that the probability pN
of winning under the optimal strategy remains bounded away from 0 for large
N. It seems unlikely that any strategy can pick the winner with significant
probability far large values of N. Nevertheless the following simple strategy
shows that
1
lim inf pN ≥ .
N →∞ 4
Let half the draws go by, no matter what, and then pick the first one which
is the highest among the tickets drawn up to the time of the draw. If the
second best has already been drawn and the best is still to come, this strategy
will succeed. This has probability nearly 14 . In fact the strategy works if the
k best tickets have not been seen during the first half, (k + 1)-th has been
and among the k best the highest shows up first in the second half. The
1
probability for this is about k2k+1 , and as these are disjoint events
X 1 1
lim inf pN ≥ k+1
= log 2
n→∞
k≥1
k2 2

If we decide to look at the first Nx tickets rather than N2 , the lower bound
becomes x log x1 and an optimization over x leads to x = 1e and the resulting
7.2. OPTIMAL STOPPING. 217

lower bound
1
lim inf pN ≥ .
n→∞ e
We will now use the method optimal stopping to decide on the best
strategy for every N and show that the procedure we described is about the
best. Since the only thing that matters is the ordering of the numbers, the
numbers themselves have no meaning. Consider a Markov chain with two
states 0 and 1. The player is in state 1 if he is holding the largest ticket so far.
Otherwise he is in state 0. If he is in state 1 and stops at stage k, i.e. when
k tickets have been drawn, the probability of his winning is easily calculated
to be Nk . If he is in state 0, he has to go on and the probability of landing on
1
1 at the next step is calculated to be k+1 . If he is at 1 and decides to play on
1
the probability is still k+1 for landing on 1 at the next stage. The problem
reduces to optimal stopping for a sequence X1 , X2 , · · · , XN of independent
random variables with P {Xi = 1} = 1i + 1, P {Xi = 0} = i+1 i
and a reward
i
function of f (i , 1) = N ; f (i , 0) = 0. Let us define recursively the optimal
probabilities
1 i
V (i , 0) = V (i + 1, 1) + V (i + 1 , 0)
I +1 i+1
and
i 1 i i
V (i , 1) = max[ , V (i + 1, 1) + V (i + 1 , 0)] = max[ , V (i , 0)]
N I +1 i+1 N
It is clear what the optimal strategy is. We should draw always if we are in
state 0, i.e. we are sure to lose if we stop. If we are holding a ticket that is
the largest so far, we should stop provided
i
> V (i , 0)
N
and go on if
i
< V (i , 0).
N
Either startegy is acceptable in case of equality. Since V (i+1 , 1) ≥ V (i+1 , 0)
for all i, it follows that V (i , 0) ≥ V (i + 1, 0). There is therefore a critical
k(= kN ) such that Ni ≥ V (i , 0) if i ≥ k and Ni ≤ V (i , 0) if i ≤ k. The best
strategy is to wait till k tickets have been drawn, discarding every ticket,
218 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.

and then pick the first one that is the best so far. The last question is the
determination of k = kN . For i ≥ k,

1 i+1 i 1 i
V (i , 0) = + V (i + 1 , 0) = + V (i + 1 , 0)
i+1 N i+1 N i+1
or
V (i , 0) V (i + 1 , 0) 1 1
− = ·
i i+1 N i
telling us
i X1
N −1
V (i , 0) =
N j=i j

so that
 
1 X1
N −1
1
kN = inf i : <
N j=i j N

Approximately log N − log kN = 1 or kN = N


e
.

7.3 Filtering.
The problem in filtering is that there is an underlying stochastic process
that we cannot observe. There is a related stochastic process ‘driven’ by
the first one that we can observe and we want to use our information to
make conclusions about the state of the unobserved process. A simple but
extreme example is when the unobserved process does not move and remains
at the same value. Then it becomes a parameter. The driven process may
be a sequence of i.i.d random variables with densities f (θ, x) where θ is
the unobserved, unchanging underlying parameter. We have a sample of n
independent observations X1 , · · · , Xn from the common distribution f (θ, x)
and our goal is then nothing other than parameter estimation. We shall take
a Bayesian approach. We have a prior distribution µ(dθ) on the space of
parameters Θ and this can be modified to an ‘aposteriori’ distribution after
the sample is observed. We have the joint distribution

Y
n
f (θ, xi ) dxi µ(dθ)
i=1
7.3. FILTERING. 219

and we calculate the conditional distribution of


µn (dθ|x1 · · · xn )
given x1 , · · · , xn . This is our best informed guess about the nature of the
unknown parameter. We can use this information as we see fit. If we have an
additional observation xn+1 we need not recalculate everything, but we can
simply update by viewing µn as the new prior and calculating the posterior
after a single observation xn+1 .
We will just work out a single illustration of this known as the Kallman-
Bucy filter. Suppose {xn } the unobserved process is a Gaussian Markov
chain

xn+1 = ρxn + σξn+1


with 0 < ρ < 1 and the noise term ξn are i.i.d normally distributed random
variables with mean 0 and variance 1. The observed process yn is given by
yn = xn + ηn
where the {ηj } are again independent standard Gaussians that are indepen-
dent of the {ξj } as well. If we start with an initial distribution for x0 say one
that is Gaussian with mean m0 and variance σ02 , we can compute the joint
distribution of x0 , x1 and y1 and then the conditional of x1 given y1 . This
becomes the new distribution of the state x1 based on the observation y1 .
This allows us te calculate recursively at every stage.
Let us do this explicitly now. The distribution of x1 , y1 is jointly normal
with mean (ρm0 , ρm0 ) variances (ρ2 σ02 + σ 2 , ρ2 σ02 + σ 2 + 1) and covariance
(ρ2 σ02 + σ 2 ). The posterior distribution of x1 is again Normal with mean
(ρ2 σ02 + σ 2 )
m1 = ρm0 + (y1 − ρm0 )
(ρ2 σ02 + σ 2 + 1)
1 (ρ2 σ02 + σ 2 )
= 2 2 m 0 + )y1
(ρ σ0 + σ 2 + 1) (ρ2 σ02 + σ 2 + 1
and variance
(ρ2 σ02 + σ 2 )
σ12 = (ρ2 σ02 + σ 2 )(1 −
(ρ2 σ02 + σ 2 + 1)
(ρ2 σ02 + σ 2 )
=
(ρ2 σ02 + σ 2 + 1)
220 CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.

After a long time while the recursion for mn remains the same

1 (ρ2 σ02 + σ 2 )
mn = mn−1 + )yn
(ρ2 σ02 + σ 2 + 1) (ρ2 σ02 + σ 2 + 1

the variance σn2 has an asymptotic value σ∞


2
given by the solution of

2 (ρ2 σ∞
2
+ σ2 )
σ∞ = .
(ρ2 σ∞
2 + σ 2 + 1)
Bibliography

[1] Ahlfors, Lars V. Complex analysis. An introduction to the theory of


analytic functions of one complex variable. Third edition. International
Series in Pure and Applied Mathematics. McGraw-Hill Book Co., New
York, 1978. xi+331 pp.
[2] Dym, H.; McKean, H. P. Fourier series and integrals. Probability and
Mathematical Statistics, No. 14. Academic Press, New York-London,
1972. x+295 pp.
[3] Halmos, Paul R. Measure Theory. D. Van Nostrand Company, Inc., New
York, N. Y., 1950. xi+304 pp.
[4] Kolmogorov, A. N. Foundations of the theory of probability. Transla-
tion edited by Nathan Morrison, with an added bibliography by A. T.
Bharuch-Reid. Chelsea Publishing Co., New York, 1956. viii+84 pp.
[5] Parthasarathy, K. R. An introduction to quantum stochastic calculus.
Monographs in Mathematics, 85. Birkhuser Verlag, Basel, 1992. xii+290
pp.
[6] Parthasarathy, K. R. Probability measures on metric spaces. Probability
and Mathematical Statistics, No. 3 Academic Press, Inc., New York-
London 1967 xi+276 pp.
[7] Royden, H. L. Real analysis. Third edition. Macmillan Publishing Com-
pany, New York, 1988. xx+444 pp.
[8] Stroock, Daniel W.; Varadhan, S. R. Srinivasa Multidimensional diffu-
sion processes. Grundlehren der Mathematischen Wissenschaften [Fun-
damental Principles of Mathematical Sciences], 233. Springer-Verlag,
Berlin-New York, 1979. xii+338 pp.

221
Index

σ-field, 9 uniqueness theorm , 34


Chebychev, 55
accompanying laws, 78 Chebychev’s inequality, 55
compound Poisson distribution, 77
Bellman, 213 conditional expectation, 101, 109
Berry, 97 Jensen’s inequality, 110
Berry-Essen theorem, 97 conditional probability, 101, 112
binomial distribution, 31 regular version, 113
Birkhoff, 179, 182 conditioning, 101
ergodic theorem of, 179, 182 continuity theorem, 39
Bochner, 32, 45, 49, 200 control, 213
theorem of, 32, 45 convergence
for the circle, 49 almost everywhere, 17
Borel, 58 in distribution, 38
Borel-Cantelli lemma, 58 in law, 38
bounded convergence theorem, 19 in probability, 17
branching process, 142 convolution, 53
Bucy, 219 countable additivity, 9
covariance, 29
Cantelli, 58
covariance matrix, 29
Caratheodory
Cramér, 39
extension theorem, 11
Cauchy, 35 degenerate distribution, 31
Cauchy distribution, 35 Dirichlet, 33
central limit theorem, 71 Dirichlet integral, 33
central limit theorem disintegration theorem, 115
under mixing , 198 distribution
change of variables, 23 joint, 24
Chapman, 117 of a random variable, 24
Chapman-Kolmogorov equations, 117 distribution function, 13
characteristic function, 31 dominated convergence theorem, 21

222
INDEX 223

Doob, 151, 152, 157, 161, 164 causal representation of, 200
decomposition theorem of, 157 moving average representation
inequality of, 152 of , 200
inequality of , 151 prediction of , 205
stopping theorem of , 161 prediction error of, 205
upcrossing inequality of, 164 predictor of, 205
double integral, 27 rational spectral density, 210
dynamic programming, 213 spectral density of , 200
spectral measure of , 200
ergodic invariant measure, 184 generating function, 36
ergodic process, 184 geometric distribution, 34
extremality of, 185
ergodic theorem, 179 Hahn, 104
almost sure, 182 Hahn-Jordan decomposition, 104
almost sure , 179
maximal, 182 independent events, 51
mean, 179 independent random variables, 51
ergodicity, 184 indicator function, 15
Esseen, 97 induced probability measure, 23
exit probability, 170 infinitely divisible distributions, 83
expectation, 28 integrable functions, 21
exponential distribution, 35 integral, 14, 15
two sided, 35 invariant measures, 179
extension theorem, 11 inversion theorem, 34
irrational rotations, 187
Fatou, 20
Fatou’s lemma, 20 Jensen, 110
field, 8 Jordan, 104
σ-field generated by, 10
filter, 219 Kallman, 219
finite additivity, 9 Kallman-Bucy filter, 219
Fubini, 27 Khintchine, 89
Fubini’s theorem, 27 Kolmogorov, 7, 59, 62, 66, 67, 70,
117
gamma distribution, 35 consistency theorem of, 59, 61
Gaussian distribution, 35 inequality of, 62
Gaussian process, 200 one series theorem of, 66
stationary, 200 three srries theorem of, 67
autoregressive schemes, 211 two series theorem of, 66
224 INDEX

zero-one law of, 70 stationary, 188


Markov property, 119
Lévy, 39, 63, 86, 89 strong, 123
inequality of, 63 martingale difference, 150
theorem of, 63 martingale transform, 165
Lévy measures, 86 martingales, 149
Lévy-Khintchine representation , 89 almost sure convergence of, 155,
law of large numbers 158
strong, 61 central limit theorem for, 196
weak, 55 convergence theorem, 154
law of the iterated logarithm, 93 sub-, 151
Lebesgue, 13 super-, 151
extension theorem, 13 maximal ergodic inequality, 183
Lindeberg, 72, 76 mean, 28
condition of, 72 measurable function, 15
theorem of, 72 measurable space, 22
Lipschitz, 108 moments, 33, 36
Lipschitz condition, 108 generating function, 36
Lyapunov, 76 uniqueness from, 36
condition of, 76 monotone class, 9, 12
monotone converegence theorem, 20
mapping, 22
Markov, 117 negative binomial distribution, 34
chain, 117 Nikodym, 105
process, 117 normal distribution, 35
homogeneous , 117
optimal control, 213
Markov chain
optimal stopping, 215
aperiodic, 133
option pricing, 167
invariant distribution for, 122
optional stopping theorem, 161
irreducible , 124 Ornstein, 194
periodic behavior, 133 Ornstein-Uhlenbeck process, 194
stationary distribution for, 122 outer measure, 11
Markov process
invariant measures Poisson, 34, 77
ergodicity, 189 Poisson distribution, 34
invariant measures for, 188 positive definite function, 32, 45
mixing, 192 probability space, 14
reversible, 189 product σ-field, 26
INDEX 225

product measure, 25 stationary, 117


product space, 24, 25 Tulcea, 116
theorem of, 116
queues, 136
Uhlenbeck, 194
Radon, 105 uniform distribution, 34
Radon-Nikodym uniform infinitesimality, 76
derivative, 105 uniform tightness, 43
theorem, 105 upcrossing inequality, 164
random variable, 15 urn problem, 140
random walk, 121
recurrence, 174 variance, 29
simple, 134
weak convergence, 38
transience, 174
Weierstrass, 37
recurrence, 124
factorization, 37
null, 124
positive, 124
recurrent states, 133
renewal theorem, 128
repeated integral, 27
Riemann-Stieljes integral, 30

secretary problem, 216


signed measure, 104
simple function, 15
Stirling, 57
Stirling’s formula, 57, 71
stochastic matrix, 124
stopped σ-field, 161
stopping time, 122, 160

transformations, 22, 23
measurable, 23
measure preserving, 179
isometries from, 179
transience, 124
transient states, 133
transition operator, 169
transition probability, 117
Probability/ Limit Theorems

Final Examination

Due before Dec 19

Q1. For each n, {Xn,j }; j = 1, 2 . . . , n are n mutually independent random variables taking
values 0 or 1 with probabilities 1 − pn,j and pn,j respectively. i.e

P [Xn,j = 1] = pn,j and P [xn,j = 0] = 1 − pn,j

If
lim sup pn,j = 0,
n→∞ j

then show that any limiting distribution of Sn = Xn,1 + Xn,2 + · · · + Xn,n is Poisson and
the limit exists if and only if

λ = lim [pn,1 + pn,2 + · · · + pn,n ]


n→∞

exists, in which case the limit is Poisson with parameter λ.

Q2. Is the exponential distribution with density

f (x) = e−x if x ≥ 0 and 0 otherwise

infinitely divisible? If it is, what is its Levy-Khintchine representation? How about the
two sided exponential f (x) = 12 e−|x| ?

Q3. Let f (x) be an integrable function on [0, 1] with respect to the Lebesgue measure.
For each n and j = 0, 1, . . . , 2n − 1 define for j2−n ≤ x ≤ (j + 1)2−n
Z (j+1)2−n
n
fn (x) = 2 f (x)dx
j2−n

Show that limn→∞ fn (x) = f (x) a.e. with respect to the Lebsgue measure.

Q4. If X1 , X2 , . . . , Xn , . . . are independent random variables that are almost surely positive
(i.e. P [Xi > 0] = 1) with E[Xi ] = 1, show that

Zn = X1 X2 · · · Xn

is a martingale. What can you say about

lim Zn = Z?
n→∞

1
When is Z nonzero? Is it sufficient if
Y
E[Xi−a ] < ∞
i

for some a > 0? Why?

Q5. Let {Xn } be independent random variables where Xn is distributed according to a


Gamma distribution with density fn (x) given by

αnpn −αn x pn −1
fn (x) = e x
Γ(pn )

for ≥ 0 and 0 otherwise.


P
(a) Find necessary and sufficient conditions on αn , pn so that n Xn converges almost
surely.
(b) For Sn = X1 + X2 + · · · + Xn compute E[Sn ] and Var[Sn ].
(c) When does
Sn − E[Sn ]
p
V ar[Sn ]
have a limiting distribution that is the standard normal distribution?

Das könnte Ihnen auch gefallen