Sie sind auf Seite 1von 14

Chapter 1

Random Variables
1.1 Elementary Examples
We will start with elementary and intuitive examples of probability. The most
well-known example is that of a fair coin: if ipped, the probability of getting
a head or tail both equals to 1/2. If we perform n independent tosses, then the
probability of obtaining n heads is equal to 1/2
n
. In fact, let S
n
= X
1
+ X
2
+
. . . +X
n
, where
X
j
=
_
1 if the result of the n-th trial is a head
0 if the result of the n-th trial is a tail
Then the probability that we get k heads out of n tosses is equal to
ProbS
n
= k =
1
2
n
_
n
k
_
In particular, using Stirling formula, we can calculate the asymptotic of obtain-
ing heads exactly half of the time:
ProbS
2n
= n =
1
2
2n
_
2n
n
_
=
1
2
2n
(2n)!
(n!)
2

e
1/2

n
0,
as n .
On the other hand, since we have a fair coin, we do expect to obtain heads
roughly half of the time, i.e.
S
2n
2n

1
2
,
for large n. Such a statement is indeed true and is embodied in the law of large
numbers that we discuss in the next chapter. For the moment let us simply
observe that while the probability that S
2n
equals to n goes to zero as n ,
the probability that S
2n
is close n goes to 1 as n . More precisely, for any
> 0,
Prob
_

S
2n
2n

1
2

>
_
0,
5
6 CHAPTER 1. RANDOM VARIABLES
as n . This is the weak law of large numbers for this particular example.
In the example of the fair coin, the number of outcomes in an experiment is
nite. In contrast, the second class of elementary examples involve a continuous
set of possible outcomes. Consider the orientation of a unit vector . Denote
by S
2
the unit sphere in R
3
. Dene f(n), n S
2
as the orientation distribution
function, i.e. for A S
2
,
Prob A =
_
A
f(n)d,
where d is the surface area element on S
2
. If does not have a preferred
orientation, i.e. it has equal probability of pointing at any direction, then
f(n) =
1
4
,
On the other hand, if does have a preferred orientation, say n
0
, then we expect
f(n) to be peaked at n
0
.
1.2 Probability Space
It is useful to put these intuitive notions of probability on a rm mathematical
basis, as was done by Kolmogorov. For this purpose, we dene the notion of
probability space, often written as a triplet (, T, P), where is a set called the
sample space and T is a collection of subsets of satisfying the requirements
that
(i) T, T;
(ii) if A T, then

A T, where

A = /A is the complement of A in ;
(iii) if A
1
, A
2
, . . . T, then

nN
A
n
T.
Such a collection of sets is called a -algebra, and the pair (, T) is called a
measurable space. We often use to denote elements in , which call smaple
points, and each set in T is called an event. P : T [0, 1] is the probability
measure or, in short, the probability, dened on T; it satises
(a) P() = 0, P() = 1;
(b) if A
1
, A
2
, . . . T are pairwise disjoint, A
i
A
j
= if i ,= j, then
P(
_
nN
A
n
) =

nN
P(A
n
).
Example 1.2.1 (Fair coin). The probability space for the outcome of one trial
can be dened as follows. = head, tail,
T = all subsets of = , head, tail,
1.3. CONDITIONAL PROBABILITY 7
and
P() = 0, P(head) = P(tail) =
1
2
, P() = 1.
For n tosses, we can take = head, tail
n
, T = all subsets of , and
P(A) =
1
2
n
Card(A),
where Card(A) is the cardinality of the set A.
Within this framework, the standard rule of set theory are used to answer
probability questions. For instance, if both A, B T, the probalility that A and
B occur is given by P(AB), the probability that A or B occur, by P(AB),
the probability that A but not B occur by P(A/B), etc. It is also elementary
to show intuitive properties like
A B P(A) P(B),
since B = A (B/A), implying by (b) above that P(B) = P(A) + P(B/A)
P(B), or
P(A B) P(A) +P(B)
since A = A(B A
c
), implying by (b) above that P(AB) = P(A) +P(B
A
c
) P(A) +P(B). This inequality is known as Booles inequality.
Another useful notion is that of independence. Two events A, B T are
independent if
P(A B) = P(A)P(B).
This generalizes to a sequence A
n

nN
of independent events as
P(

nN
A
n
) =

nN
P(A
n
).
1.3 Conditional probability
Let A, B T and assume that P(B) ,= 0. Then the conditional probability of
A given B is dened as
P(A[B) =
P(A B)
P(B)
.
Clearly this corresponds to the proportion of events where both A and B occur
given that B has occured. For instance, the probability to obtain two tails in
two tosses of a fair coin is
1
4
, but the conditional probability to obtain two tails
is
1
2
given that the rst toss is a tail, and is zero given that the rst toss is a
head.
One has
P([B) = 1,
8 CHAPTER 1. RANDOM VARIABLES
and therefore if A
1
, A
2
, are disjoint sets such that

n
A
n
= , then

n
P(A
n
[B) = 1.
Therefore
P(A
j
[B) =
P(A
j
B)

n
P(A
n
[B)
.
This is known as Bayes rule.
Since P(A B) = P(A[B)P(B) by denition, we also have by iteration
P(A B C) = P(A[B C)P(B[C)P(C),
and so on.
1.4 Discrete Distributions
Making n tosses of a fair coin is an example of an experiment where the total
number of outcomes is nite. More generally, if the number of elements in
= A
1
, A
2
, . . . is nite or enumerable, and associated with each of these
events there is a numerical value, X
j
, the function X such that X(A
j
) = X
j
is called a discrete random variable. It is usually convenient to take simply
X
j
1, 2, . . . , N in the nite case, and X
j
N or X
j
Z in the numerable
case, so that X : 1, 2, . . . , N, X : N, or X : Z, and we will
restrict to these cases. The probability distribution of X is the probability that
X takes the value j, i.e
p(j) = ProbX = j = P(A
j
), j = 1, . . .
and it obviously satises
p(j) 0,

j
p(j) = 1.
(The second conditions also implies that p(j) 1.) Given a function f of X,
its expectation is given by
Ef(X) =

i
f(i)p(i),
if the sum is well-dened. In particular, the pth moment of the distribution is
dened as
m
p
=

j
j
p
p(j),
if the sum is well-dened. The rst moment is also called the mean of the
random variable and is denoted by mean(X). Another important quantity is its
variance dened as
var(X) = m
2
m
2
1
=

j
(j m
1
)
2
p(j).
1.5. CONTINUOUS DISTRIBUTIONS. 9
Exercise 1.4.1. S
n
, the number of head in an outcome of n tosses, is an example
of random variable with a distribution is given by
p(j) = ProbS
n
= j =
1
2
n
_
n
j
_
.
Show that the mean and the variance of this random variable are
mean(X) =
n
2
, var(X) =
n
4
,
implying that var(X)/(mean(X))
2
0 as n .
Poisson distribution. This is the simplest example for the distribution of
randomly scattered points. Consider a set of randomly scattered points on the
plane. Let A be a set in R
2
. X
A
() be the number of points in A. X
A
has
Poisson distribution if
PX
A
() = n =

n
n!
e

,
where may depend on A according to the density of the points. If the points
are uniformly distributed on the plane, then is equal to the area of A. Notice
that in this case
= mean(X
A
) = var(X
A
).
1.5 Continuous Distributions.
Consider now the general case when is not necessarily numerable. A function
X : R
n
is called a random variable on the probability space (, T, P) if
the set : X() U is in T for all open set U R
n
. The distribution of
the random variable X is a probability measure on R
n
, dened for any set
B R
n
by
(B) = ProbX B = P : X() B.
If there exists an integrable function (x) such that
(B) =
_
B
(x)dx,
then is called the probability density of X.
Given its distribution, the expectation of f(X) is dened as
Ef(X) =
_
R
n
f(x)(dx),
if f is continuous and the right hand is well-dened (i.e.
_
R
n
[f(x)[(dx) < ).
Two important expectation are the mean of X,
mean(X) = EX,
10 CHAPTER 1. RANDOM VARIABLES
and the variance of X,
var(X) = E
_
(X EX)(X EX)
T
_
.
The expectation of f(X) can also be written as
Ef(X) =
_

f(X())dP(),
which allows us to make connection with L
p
-space. For p 1 let
L
p
= L
p
(, T, P) = X() : E[X[
p
< .
Then L
p
is a Banach space with norm
|X|
p
= (E[X[
p
)
1/p
,
and L
2
is a Hilbert space with scalar product
X, Y ) = E(X, Y ),
where (X, Y ) denotes the standrad scalar product in R
d
. In L
2
, we have
Scwartzs inequality
E[(X, Y )[ |X|
2
|Y |
2
.
More generally, we have Holders inequality
E[(X, Y )[ |X|
p
|Y |
q
, p > 1, 1/p + 1/q = 1, X L
p
, Y L
q
.
The triangle inequality in L
p
is called Minkowskis inequality
|X +Y |
p
|X|
p
+|Y |
p
, p 1, X, Y L
p
.
Exercise 1.5.1 (Jensens inequality). Let X be a one dimensional variable such
that M() = Ee
X
< for some R. Show that
M() e
EX
.
Two random variables X and Y are called independent if
Ef(X)g(Y ) = Ef(X)Eg(Y ),
for all continuous functions f and g. This means that the joint distributions of
X and Y is simply the product measure of the distributions of X and Y , i.e.
(dx, dy) =
x
(dx)
y
(dy).
A weaker notion is the following. X and Y are uncorrelated if
cov(X, Y ) = 0,
where cov(X, Y ) is the covariance matrix of X and Y ,
cov(X, Y ) = E(X EX)(Y EY )
T
.
1.6. EXAMPLES OF CONTINUOUS DISTRIBUTION 11
Exercise 1.5.2. Pairwise independence does not imply independence. Let X, Y
be two independent random variables such that
P(X = 1) = P(Y = 1) =
1
2
,
and let Z = XY . Check that X, Y , Z are pairwise independent but not
independent, i.e.
Ef(X)g(Y )h(Z) = Ef(X)Eg(Y )Eh(Z),
does not hold for all continuous functions f, g, h.
1.6 Examples of Continuous Distribution
We list a few important continuous distributions.
1. Uniform distribution
(x) =
_
_
_
1
vol(B)
if x B
0 otherwise
In one dimension for B = [0, 1] this reduces to
(x) =
_
1 if x [0, 1]
0 otherwise
Pseudo-random numbers generated on the computer typically have this distri-
bution.
2. Exponential distribution
(x) =
_
0 if x < 0
e
x
if x 0
Special cases include the distribution of waiting time for continuous time Markov
process ( is the rate of the process) and the Boltzmann distribution
f(E) =
_
e
E
if E 0
0 if E < 0,
where = 1/k
B
T, k
B
is the Boltzmann constant, and T is the temperature.
12 CHAPTER 1. RANDOM VARIABLES
3. Normal distribution
(x) =
e

1
2
(x x)
T
A
1
(x x)
(2)
n/2
(det A)
1/2
,
where A is a symmetric positive denite matrix, det A is the determinant of A;
x is the mean, A is the variance, and the random variable with the density above
is denoted by N( x, A). Such random variables are also called Gaussian random
variables. In dimension one, the density of a normal variable reduces to
(x) =
e
(x x)
2
/2
2

2
2
,
where x is the mean,
2
is the variance.
Exercise 1.6.1. Let X = (X
1
, . . . , X
n
) be a n-dimensional Gaussian random
variable and let Y = c
1
X
1
+c
2
X
2
+ +c
n
X
n
, where c
1
, . . . , c
n
are constants.
Show that Y is also Gaussian.
1.7 Conditional Expectation
The concept of conditional probability extends straightforwardly to discrete ran-
dom variables. Let X and Y be two discrete random variables, not necessarily
independent, with joint probability
p(i, j) = P(X = i, Y = j).
Since

i
p(i, j) = P(Y = j), the conditional probability that X = i given that
Y = j is given by
p(i[j) =
p(i, j)

i
p(i, j)
.
if

i
p(i, j) > 0 and conventionaly taken to be zero if

i
p(i, j) = 0. From this,
the natural denition of the conditional expectation of f(X) given that Y = j
is
E(f(X)[Y = j) =

i
f(i)p(i[j).
A diculty arises when one tries to generalizes this concept to continuous
random variables. Indeed, given two continuous random variables X and Y with
joint probability density (dx, dy), the probability that Y = y is zero for most
values of y. Therefore, the ratio involved in the denition of the conditional
probability measure is not dened. An obvious way to try to x this problem is
to take limits, i.e. dene the conditional probability of X B given that Y = y
as
(B[y) = lim
0
(B, B

(y))
_
R
n
(dx, B

(y))
,
1.8. BOREL-CANTELLI LEMMA 13
where B

(y) is the ball of radius centered around y. However, requiring that


this limit exists for every y turns out to be very restrictive, and it is better to
proceed dierently.
The is done as follows. One says that a measure is absolutely continuous
with rescpect to a measure , which is denoted as , if for every open set B
such that (B) = 0, one also has (B) = 0. It can be shown that when ,
there exists a function (z) such that for any B,
(B) =
_
B
(z)(dz).
The function is unique up to equivalence, i.e. it can be modied at most on
a set of zero measure with respect to (dz), and it is called the Radon-Nikodym
derivative of with respect to . This is denoted as
=
d
d
.
Note that within this terminology, the fact that has a density (z) means
that is absolutely continuous with respect to the Lesbegue measure dz, with
Radon-Nykodym derivative (z).
The conditional probability of x B knowing that Y = y, which we denote
as (B[y), can be dened as the Radon-Nikodym derivative of (dx, dy) with
respect to
y
(B) =
_
R
n
(dx, B) (the marginal probability of Y ), i.e. it satises
for every open sets A, B
(A, B) =
_
B
(A[y)
y
(dy).
Thus (A[y) is only dened up to a set of measure zero with respect to
y
(dy),
which is ne in practice and is why this denition is less restrictive than the one
above in terms of limit. If the pair (X, Y ) has a density, (x, y), then
(A[y) =
_
A
(x, y)
(y)
dx,
where (y) =
_
R
n
(x, y)dx is the marginal density of Y .
1.8 Borel-Cantelli Lemma
Many questions in probability, like the convergence of the normalized sum of
independent variables to their mean (i.e the law of large number treated below),
involve tail event. A tail event is an event dened on a sequence of events,
say A
n

n=N
, which is such that its probability do not depends on the rst
A
1
, . . . , A
k
no matter how large k is. For instance if S
n
/n is the proprtion of
heads in n tosses of a fair coin, the event that S
n
/n 1/2 as n is a tail
event because the probability that
S
n
n
=
1
n
n

j=1
X
j

1
2
14 CHAPTER 1. RANDOM VARIABLES
is the same as the probability that
1
n
n

j=k
X
j

1
2
for any k 1.
An important example of a tail event dened on A
n

nN
is the event that
A
n
occur innitely many otfen, i.e.
A
n
i.o. = : A
n
i.o. = lim
k
_
kn
A
k
=

nN
_
kn
A
k
The probability of such an event is specied by:
Lemma 1.8.1 (Borel-Cantelli Lemma).
1. If

n=1
P(A
n
) < , then PA
n
i.o. = 0.
2. If the A
n
s are independent and

n=1
P(A
n
) = , then PA
n
i.o. = 1.
Proof. 1. P

n=1

k=n
A
k
P

k=n
A
k

k=n
P(A
k
) for any n, but the
last term goes to 0 as n since

k=1
P(A
k
) < by assumption.
2. Using independence, one has
P(

_
k=n
A
k
) = 1 P(

k=n
A
c
k
) = 1

k=n
P(A
c
k
) = 1

k=n
(1 P(A
k
)).
Using 1 x e
x
, this gives
P(

_
k=n
A
k
) 1

k=n
e
P(A
k
)
= 1 e

k=n
P(A
k
)
= 1.
Since the left hand-side is a probability, the bound must be sharp and we are
done.
As an example of application of this result, we prove
Lemma 1.8.2. Let X
n

nN
be a sequence of identically distributed (not nec-
essarily independent) random variables, such that E[X
n
[ < . Then
lim
n
X
n
n
= 0 a.s.,
Here and below a.s. stands for almost surely, i.e. for almost all except
possibly a set of zero probability (see denition 1.9.1).
Proof. For any > 0, dene
A

n
= : [X
n
()/n[ > .
1.8. BOREL-CANTELLI LEMMA 15
Then

n=1
P(A

n
) =

n=1
P[X
n
[ > n
=

n=1
P[X
1
[ > n
=

n=1

k=n
Pk < [X
1
[ (k + 1)
=

k=1
kPk < [X
1
[ (k + 1)
=

k=1
k
_
k<|x|(k+1)
(dx)

k=1
_
k<|x|(k+1)
[x[(dx)
=
1

_
<|x|
[x[(dx)

E[X
1
[ < .
Therefore if we dene
B

= : A

n
i.o.,
then P(B

) = 0. Let B =

nN
B1
n
. Then P(B) = 0, and
lim
n
X
n
()
n
= 0, if , B.
Note that this proof relies on a special case of a useful inequality which we
state as a lemma.
Lemma 1.8.3 (Chebyshevs Inequality). Let X be a random variable such
that E[X[
k
< , for some integer k. Then
P[X[ >
1

k
E[X[
k
,
for any positive constant .
Proof. For any > 0,
E[X[
k
=
_
R
n
[x[
k
(dx)
_
|x|
[x[
k
(dx)
k
_
|x|
(dx) =
k
P[X[ .
16 CHAPTER 1. RANDOM VARIABLES
1.9 Notions of Convergence
Let X
n

n=1
be a sequence of random variables dened on a probability space
(, T, P), and let
n
be the distribution of X
n
. Let X be another random
variable with distribution . We will discuss four notions of convergence: almost
sure convergence, convergence in probability, convergence in distribution, and
convergence in L
p
.
Denition 1.9.1 (Almost sure convergence). X
n
converges to X almost
surely as n , if
P : X
n
() X() = 1.
We express almost sure convergence as X
n
X, a.s.
Denition 1.9.2 (Convergence in probability). X
n
converges to X in prob-
ability if for any > 0,
P : [X
n
() X()[ > 0,
as n .
Denition 1.9.3 (Convergence in distribution). X
n
converges to X in
distribution if for any continuous function f,
Ef(X
n
) Ef(X).
This is denoted as X
n
d
X, or
n
d
, where
n
, are the distributions of
X
n
and X respectively.
Denition 1.9.4 (Convergence in L
p
). Let X
n

nN
be a sequence of ran-
dom variables such that X
n
L
p
. X
n
converges to X L
p
in L
p
(or in pth
mean) if
E[X
n
X[
p
0.
For p = 1 we speak about convergence in mean; for p = 2, convergence in mean
square.
From real analysis, we know that convergence in L
p
implies convergence in L
q
if p q, almost sure convergence implies convergence in probability, convergence
in probability implies that there is a subsequence that converges almost surely,
and convergence in probability implies convergence in distribution. Convergence
in L
p
implies convergence in probability.
1.10 Characteristic Functions
The characteristic function of a probability measure is dened as
f() = Ee
i(,X)
=
_
R
n
e
i(x,)
(dx).
It has the following obvious properties
1.10. CHARACTERISTIC FUNCTIONS 17
1. R
n
, [f()[ 1, f() = f();
2. f is uniformly continuous on R
1
.
We also have
Theorem 1.10.1. Let
n

nN
be a sequence of probability measures, and
f
n

nN
be their corresponding characteristic functions. Assume that
1. f
n
converges everywhere on R
n
to a limiting function f.
2. f is continuous at = 0.
Then there exists a probability distribution such that
u
d
. Moreover f is
the characteristic function of .
Conversely, if
n
d
, where is some probability distribution then f
n
converges to f uniformly in every nite interval, where f is the characteristic
function of .
For a proof, see [2]
As in Fourier transforms, one can also dene the inverse transform
(x) =
1
(2)
n
_
R
n
e
i(,x)
f()d.
An interesting question arises as to when this gives the density of a probability
measure. To answer this we dene
Denition 1.10.1. A function f is called positive semi-denite if for any nite
set of values
1
, . . . ,
n
, n N, the matrix (f(
i

j
)) is positive semi-denite,
i.e.

i,j
f(
i

j
)v
i
v
j
0,
for any v
1
, . . . , v
n
.
Theorem 1.10.2 (Khinchin). If f is a positive semi-denite uniformly contin-
uous function and f(0) = 1, then it is the characteristic function of a probability
measure.
Exercise 1.10.1 (Stable laws). A one-dimensional distribution (dx) is stable if
given any two independent random variables X and Y following this distribu-
tion, there exists a and b such that
a(X +Y b)
has distribution (dx). Show that f() = e
||

is the characteristic function


of a stable distribution for 0 2, and is not a characteristic function for
other values of .
Exercise 1.10.2 (Girkos circular law for random matrices). If A has i.i.d. entries
with mean zero and variance
2
, then the eigenvalues of A/

n are asymptot-
ically uniformly distributed in the unit disk in the complex plane. Investigate
numerically.
18 CHAPTER 1. RANDOM VARIABLES

Das könnte Ihnen auch gefallen