Sie sind auf Seite 1von 8

Some useful results from probability theory

STA355H1S

Modes of Convergence

Note: In this section (and subsequently), all random variables are real-valued unless otherwise specified.
p

Convergence in probability. {Xn } converges in probability to X (Xn X) if, for each


> 0,
lim P (|Xn X| > ) = 0.
n

r
X) if
Convergence in r-th mean. {Xn } converges to X in r-th mean (Xn

lim E[|Xn X|r ] = 0.

(This type of convergence is also known as Lr convergence.)


d

Convergence in distribution. {Xn } converges in distribution to X (Xn X) if


lim P (Xn x) = P (X x)

for all x where P (X = x) = 0.


Notes:
1. Convergence in probability of {Xn } to X means that when n is sufficiently large, the
random variable Xn is well-approximated by the random variable X. The limiting
p
random variable X is often a constant and so if Xn X with P (X = a) = 1 then
Xn a (with probability close to 1) for sufficiently large n.
2. A somewhat stronger type of convergence is almost sure convergence: {Xn } cona.s.
verges almost surely to X (Xn X) if Xn () X() as n for all A
a.s.
where P (A) = 1. (In other words, P (Xn X) = 1.) Xn X if, and only if, for
each > 0,
!
lim P

[|Xk X| > ] = 0.

k=n

Almost sure convergence is important in probability theory and more advanced statistical theory, but we will not use it in this course. For completeness, we will include
here a number of results that are related to almost sure convergence; some of these
results turn out to be quite useful in practice.
1

3. It is important to note that convergence in distribution refers to convergence of probability measures (distributions) rather than the random variables themselves. For this
d
reason, we often Fn F where Fn is the distribution function of Xn and F is the
distribution function of X. Convergence in distribution is also sometimes referred to
as weak convergence.
4. Convergence in probability, almost sure convergence, and convergence in distribution
can be defined if {Xn }, X take values in a metric space (X , d):
p

a.s.

a.s.

(a) Xn X if d(Xn , X) 0;
(b) Xn X if d(Xn , X) 0;
d

(c) Xn X if E[g(Xn )] E[g(X)] for all bounded, continuous real-valued functions g defined on X .

Proving convergence in distribution


For sequences of random variables, convergence in distribution refers to convergence of distribution functions and not of random variables per se. If {Xn } is a sequence of random
d
variables then Xn X if Fn (x) = P (Xn x) P (X x) = F (x) (as n ) for each
x at which F is continous. This condition is not always easy to check; however, there are
several sufficient (and in some cases, necessary and sufficient) conditions that can be use to
d
establish Xn X.

Characteristic functions. Define exp(x) = cos(x) + sin(x) where = 1. Xn X


if, and only if, n (t) = E[exp(tXn )] E[exp(tX)] = (t) for all t. The functions
{n (t)} and (t) are called the characteristic functions of {Xn } and X.

Moment generating functions. Define moment generating functions mn (t) = E[exp(tX)]


and m(t) = E[exp(tX)]. If mn (t) m(t) < for all t [, ] for some > 0
d
then Xn X. Similarly, if {Xn } and X are non-negative random variables then
d
mn (t) m(t) for all t 0 implies that Xn X.
Density and probability functions. Suppose that {Xn } and X are continuous random
variables with density functions {fn (x)} and f (x). If fn (x) f (x) as n for
d
all x then Xn X. Likewise, if {Xn } and X are discrete random variables and
d
P (Xn = x) P (X = x) for all x then Xn X.
As mentioned above, for {Xn } and X taking values in more general metric spaces, we
define convergence in distribution (weak convergence) in terms of expected values of bounded
d
continuous real-valued functions: Xn X if (and only if) E[g(Xn )] E[g(X)] for all
2

bounded, continuous real-valued function g. For random variables {Xn } and X, it turns out
to be sufficient to verify convergence of E[g(Xn )] to E[g(X)] for all functions g belonging to
a somewhat more manageable class of functions.

Relationships between modes of convergence


Almost sure convergence requires that {Xn } and X be defined on a common probability
space. The same is true in general for convergence in probability and convergence in r-th
p
Lr
do
mean except when the limit X is a constant; in this case, the definitions of and
not require the Xn s to be defined on the same probability space.
The following relationships between modes of convergence hold.
a.s.

(a) Xn X implies that Xn X.


(b) Xn X implies that Xn X.
r
X implies that Xn X.
(c) Xn

(d) Xn X implies that Xn X if X is a constant with probability 1.


p

a.s.

r
X; this seems reaNote that neither Xn X nor Xn X implies that Xn
r
sonable since the expectation E[|Xn X| ] need not be finite for any n. (See section 3 for
more details.) It is easy to see that almost sure convergence is much stronger than either
p
convergence in probability or convergence in distribution. If Xn X, then it is always
a.s.
possible to find a subsequence {Xnk } such that Xnk X as nk . For example, pick
nk so that P [|Xn X| > ] 2k for n nk ; then

lim Xnk = X

lim P

lim

j=k

j=k

[|Xnj X| > ]

P [|Xnj X| > ]

lim 2k+1 = 0

a.s.

and so Xnk X. While no similar relationship exists for convergence in distribution,


the following result gives an interesting connection between convergence in distribution and
almost sure convergence.
d

Skorokhod representation theorem. Suppose that Xn X. Then there exist random


variables {Xn } and X defined on a common probability space such that
d

(a) Xn = Xn for all n and X = X (where = denotes equality in distribution).


3

a.s.

(b) Xn X .
The Skorokhod representation theorem also applies if {Xn } and X are random vectors
and can be extended to elements of more general spaces (for example, spaces of random
functions). The proof of the result as stated here for sequences of random variables is very
simple (at least conceptually). If Fn and F are the distribution functions of Xn and X,
we define Fn1 and F 1 to be their (left-continuous) inverses where, for example, F 1 (t) =
inf{x : F (x) t} for 0 < t < 1. Then given a random variable U which has a uniform
distribution on (0, 1), we define Xn = Fn1 (U) and X = F 1 (U). It is then easy to verify
a.s.
d
d
that Xn = Xn and X = X and Xn X . This simple proof cannot be generalized (even
to sequences of random vectors) as it takes advantage of the natural ordering of the real line.
The Skorokhod representation theorem is used almost exclusively as an analytical tool
for proving convergence in distribution results. It allows one to replace sequences of random
variables by sequences of numbers and thereby makes many convergence in distribution
proofs completely transparent. For example, this result allows for virtually trivial proofs of
both the continuous mapping theorem and the delta method stated in the section 2.

Op and op notation
At this point, we will introduce some very useful notation. Suppose that {Xn } is a sequence of
random variables. We say that {Xn } is bounded in probability (or tight) and write Xn = Op (1)
if for each > 0, there exists M < such that P (|Xn| > M ) < for all n. If {Yn } is
another sequence of random variables (or constants) then Xn = Op (Yn ) is equivalent to
Xn /Yn = Op (1).
A complement to the Op notation is the op notation. We say that Xn = op (1) (as
p
n ) if Xn 0 as n . Likewise Xn = op (Yn ) means that Xn /Yn = op (1). This
op notation is often used to denote that two sequences {Xn } and {Yn } are approximately
equal for sufficiently large n; for example, we can write Xn = Yn + op (1), which is equivalent
p
to saying that Xn Yn 0.

Limit theorems

Continuous Mapping Theorem. Let g be a continuous function. Then


a.s.

a.s.

1. Xn X implies g(Xn ) g(X).


2. Xn X implies g(Xn ) g(X).
3. Xn X implies g(Xn ) g(X).
4

In fact, g need not be everywhere continuous provided that the set of points A at which
g is discontinuous has P (X A) = 0. Moreover, the Continuous Mapping Theorem
applies for random vectors provided that g is a continuous mapping from space to the
other.
The Delta Method. Let {an } be a sequence of constants with an and suppose that
d
an (Xn ) X. Then if g is a function which is differentiable at
d

an (g(Xn ) g()) g ()X.


The Delta Method can be extended to, for example, real-valued functions of random
d
vectors: If an (X n ) X and g is a real-valued function with gradient g then
d

an (g(X n ) g()) [g()]T X.


p

Slutskys Theorem. Suppose that Xn X and Yn . Then


d

1. Xn + Yn X + .
d

2. Yn Xn X.
A variation of part 2 of Slutskys Theorem can be given using the Op /op notation
of the previous section. Specifically, if Xn = Op (1) and Yn = op (1) then Xn Yn = op (1)
p
(or, equivalently, Xn Yn 0).
Cram
er-Wold device. Suppose that {X n }, X are k-dimensional random vectors. Then
d
d
X n X if, and only if, aT X n aT X for all k-dimensional vectors a.
The Cramer-Wold device can be used to obtain a more general version of Slutskys
p
d
d
Theorem. Suppose that Xn X and Yn ; then a1 Xn + a2 Yn a1 X + a2 by
d
the vanilla version of Slutskys Theorem for any a1 , a2 and so (Xn , Yn ) (X, )
by the Cramer-Wold device. Thus if g : R2 Rp is a continuous function, we have
d
g(Xn , Yn ) g(X, ).
Laws of Large Numbers. Laws of large numbers give conditions under which a sample
mean of random variables will converge to some appropriate population mean. The
classical law of large numbers (strong and weak) is given in (a) below.
(a) Suppose that X1 , X2 , be independent, identically distributed random variables
with E[|Xi |] < and define = E(Xi ). Then

as n .

a.s.

1
(X1 + + Xn )

n

5

(b) Suppose that X1 , X2 , are independent random variables with E(Xi ) = and
Var(Xi ) = i2 . If
n
1 X
2 0
n2 i=1 i
then

1
p
(X1 + + Xn )
n

as n .
Central Limit Theorems. Under fairly broad conditions, sums of random variables will
be approximately normally distributed. We give three cases where this holds for sums
of independent random variables.
(a) If X1 , X2 , are independent, identically distributed random variables with E(Xi )
= 0 and Var(Xi ) = 2 for i = 1, , n then
1
d
Zn = (X1 + + Xn ) Z N (0, 1).
n
(b) If X1 , X2 , are independent random variables with E(Xi ) = 0, Var(Xi ) = i2
and E[|Xi |3 ] = i < . If
1 + + n

(12

+ + n2 )

3/2

as n then
Zn =

X1 + + Xn

(12

++

1/2
n2 )

Z N (0, 1).

(c) If X1 , X2 , are independent, identically distributed random variables with E(Xi )


= 0, Var(Xi ) = 2 , and a1 , a2 , are constants satisfying the condition
a2i
0 as n
2
1in a2
1 + + an
max

then
Zn =

a1 X1 + + an Xn d
Z N (0, 1).
(a21 + + a2n )1/2

Cases (b) and (c) give conditions under which sums of independent random variables
are approximately normal when the summands do not have the same distribution.
These conditions essentially imply that the amount of variability that each summand
contributions to the overall variability is small so that Zn is not dominated by a small
number of summands.
6

Poisson convergence theorem. Suppose that Xn1 , , Xnn are independent random variables with P (Xnk = 1) = pnk = 1 P (Xnk = 0). If
n
X

k=1

as n then

pnk > 0 and

max pnk 0

1kn

Xn1 + + Xnn Y Poisson().

Note that the conditions on the probabilities {pnk } imply that only a small number of
the {Xnk } can equal 1 for large n.

Convergence of moments

It is often tempting to say that convergence of Xn to X (in some sense) implies convergence
of moments. Unfortunately, this is not true in general; a small amount of probability in the
tail of the distribution of Xn will destroy the convergence of E(Xn ) to E(X). However, if
certain bounding conditions are put on the sequence {Xn } then convergence of moments is
possible.
Fatous Lemma. Let {Xn } be a sequence of random variables defined on a common probability space and for each in the sample space, define Y () = lim inf n |Xn ()|.
Then
E[Y ] lim
inf E[|Xn |].
n
a.s.

(Note that if Xn some X then Y = |X|; thus if Xn X, it follows from the


Skorokhod representation theorem that E(|X|) lim inf n E(|Xn |).)
a.s.

Dominated convergence theorem. Suppose that Xn X and |Xn | Y (for all n


some n0 ) where E(Y ) < . Then E(Xn ) E(X).
a.s.
d
(A similar result holds if is replaced by ; see below.)
Uniform integrability. A sequence {Xn } is uniformly integrable if
lim lim sup E[|Xn |I(|Xn | > x)] = 0.

The following results involve uniform integrability:


1. If {Xn } is uniformly integrable then supn E[|Xn |] < .
2. If supn E[|Xn |1+ ] < for some > 0 then {Xn } is uniformly integrable.

3. If there exists a random variable Y with E[|Y |] < and P (|Xn | > x) P (|Y | > x)
for all n some n0 and x > 0 then {Xn } is uniformly integrable.
d

4. Suppose that Xn X. If {Xn } is uniformly integrable then E[|X|] < and


E(Xn ) E(X). (It also follows that E[|Xn |] E[|X|].)

Inequalities

H
olders inequality. E[|XY |] E 1/r [|X|r ]E 1/s [|Y |s ] where r > 1 and r 1 + s1 = 1.
When r = 2, this inequality is called the Cauchy-Schwarz inequality.
Minkowskis inequality. E 1/r [|X + Y |r ] E 1/r [|X|r ] + E 1/r [|Y |r ] for r 1.
Chebyshevs inequality. Let g() be a positive, even function which is increasing on [0, )
(for example, g(x) = x2 or |x|). Then for any > 0
P (|X| > )

E(g(X))
.
g()

(Chebyshevs inequality usually refers to the special case where g(x) = x2 ; the more
general result is sometimes called Markovs inequality.)
Jensens inequality. If g() is a convex function and E(X) exists and is finite then g(E(X))
E(g(X)).

Das könnte Ihnen auch gefallen