Sie sind auf Seite 1von 29

Chapter 1

Review of Probability Theory

As we argued in the previous chapter, Pattern Recognition is founded on Probability Theory; here
we review the main results that will be needed in the book. This chapter is not intended as a
replacement for a course in Probability, but it will serve as a reference to the reader. For the sake of
precision, the language of measure theory is used in this chapter, but measure-theoretical concepts
will not be required in the remainder of the book.

1.1 Basic Concepts

We introduce in this section the basic notions of probability theory.

1.1.1 Sample Space and Events

A sample space S is the set of all outcomes of an experiment. An event E is a subset E ✓ S. Event
E is said to occur if it contains the outcome of the experiment.

Example 1.1. If the experiment consists of flipping two coins, then

S = {(H, H), (H, T ), (T, H), (T, T )} . (1.1)

The event E that the first coin lands tails is: E = {(T, H), (T, T )}. ⇧

Example 1.2. If the experiment consists in measuring the lifetime of a lightbulb, then

S = {t 2 IR | t 0} . (1.2)

The event that the lightbulb will fail at or earlier than 2 time units is the real interval E = [0, 2]. ⇧

3
4 CHAPTER 1. REVIEW OF PROBABILITY THEORY

If E ✓ F then the occurrence of E implies the occurrence of F . The union E [ F occurs i↵ (if and
only if) E, F , or both E and F occur. On the other hand, the intersection E \ F occurs i↵ both E
and F occur. If E \ F = ;, then E or F may occur but not both. Finally, the complement event
E c occurs i↵ E does not occur.

The limit of a sequences of events E1 , E2 , . . . can be defined in one of two cases. If E1 ✓ E2 ✓ . . .


(an increasing sequence) then
1
[
lim En = En . (1.3)
n!1
n=1
Or, if E1 ◆ E2 ◆ . . . (a decreasing sequence) then
1
\
lim En = En . (1.4)
n!1
n=1

1.1.2 Definition of Probability

A -algebra is a collection F of events in S that is closed under complementation and countable


intersection and union. The Borel algebra of Rd is generated by complement and countable union
and intersections of the open sets in Rd (in the usual topology); it is the smallest -algebra that
contains all open sets in Rd . A measurable function between two sets S and T , furnished with
-algebras F and G, respectively, is defined to be a mapping f : S ! T such that for every E 2 G,
the pre-image
f 1 (E) = {x 2 S | f (x) 2 E} (1.5)
belongs to F. A function f : Rd ! Rk is said to be Borel-measurable1 if it is measurable with
respect to the Borel algebras of Rd and Rk .

A probability space is a triple (S, F, P ) consisting of a sample space S, a -algebra F containing all
the events of interest, and a probability measure P , i.e., a real-valued function defined on each event
E 2 F that satisfies Kolmogorov’s axioms:

A1. 0  P (E)  1 ,

A2. P (S) = 1 ,

A3. For a sequence of events E1 , E2 , . . . such that Ei \ Ej = ; for all i 6= j,


1
! 1
[ X
P Ei = P (Ei ) . (1.6)
i=1 i=1
1
A Borel-measurable function is a very general function — for our purposes, it can be considered to be an arbitrary
function.
1.1. BASIC CONCEPTS 5

The following properties are straightforward consequences of Kolmogorov’s axioms.

P1. P (E c ) = 1 P (E).

P2. If E ✓ F then P (E)  P (F ).

P3. P (E [ F ) = P (E) + P (F ) P (E \ F ).

P4. (Union Bound) For any sequence of events E1 , E2 , . . .


1
! 1
[ X
P Ei  P (Ei ) . (1.7)
i=1 i=1

P5. (Continuity of Probability Measure) If E1 , E2 , . . . is an increasing or decreasing sequence of


events, then ⇣ ⌘
P lim En = lim P (En ) (1.8)
n!1 n!1

1.1.3 Borel-Cantelli Lemmas

Two important limiting events can be defined for any sequence of events E1 , E2 , . . .

• The lim sup:


1 [
\ 1
[En i.o.] = lim sup En = Ei (1.9)
n!1
n=1 i=n

We can see that lim supn!1 En occurs i↵ En occurs for an infinite number of n, that is, En
occurs infinitely often.

• The lim inf:


1 \
[ 1
lim inf En = Ei (1.10)
n!1
n=1 i=n

We can see that lim inf n!1 En occurs i↵ En occurs for all but a finite number of n, that is,
En eventually occurs for all n.

An important tool for our purposes are the Borel-Cantelli Lemmas, which specify the probabilities
of lim sup and lim inf events.

Lemma 1.1. (First Borel-Cantelli lemma.) For any sequence of events E1 , E2 , . . .


1
X
P (En ) < 1 ) P ([En i.o.]) = 0 . (1.11)
n=1
6 CHAPTER 1. REVIEW OF PROBABILITY THEORY

Proof.
1 [
1
! 1
!
\ [
P Ei =P lim Ei
n!1
n=1 i=n i=n
1
!
[
= lim P Ei
n!1 (1.12)
i=n
1
X
 lim P (Ei )
n!1
i=n

= 0.

The converse to the First Lemma holds if the events are independent.

Lemma 1.2. (Second Borel-Cantelli lemma.) For an independent sequence of events E1 , E2 , . . .,


1
X
P (En ) = 1 ) P ([En i.o.]) = 1 (1.13)
n=1

Proof. Given an independent sequence of events E1 , E2 , . . ., note that


1 [
\ 1 1
[
[En i.o.] = Ei = lim Ei (1.14)
n!1
n=1 i=n i=n

Therefore,

1
! 1
! 1
!
[ [ \
P ([En i.o.]) = P lim Ei = lim P Ei =1 lim P Eic (1.15)
n!1 n!1 n!1
i=n i=n i=n

where the last equality follows from DeMorgan’s Law. Now, note that, by independence,

1
! 1 1
\ Y Y
P Eic = P (Eic ) = (1 P (Ei )) (1.16)
i=n i=n i=n

From the inequality 1 xe x we obtain

1
! 1 1
!
\ Y X
P Eic  exp( P (Ei )) = exp P (Ei ) =0 (1.17)
i=n i=1 i=n
P1
since, by assumption, i=n P (Ei ) = 1, for all n. From (1.15) and (1.17) it follows that P ([En i.o.]) =
1, as required. ⇧
1.1. BASIC CONCEPTS 7

1.1.4 Conditional Probability and Independence

Conditional probability is one of the most important concepts in Statistical Signal Processing,
Pattern Recognition, and in Probability Theory in general.

Given that an event F has occurred, for E to occur, E \ F has to occur. In addition, the sample
space gets restricted to those outcomes in F , so a normalization factor P (F ) has to be introduced.
Therefore,
P (E \ F )
P (E | F ) = . (1.18)
P (F )
For simplicity, it is usual to write P (E \ F ) = P (E, F ) to indicate the joint probability of E and F .
From (1.18), one then obtains
P (E, F ) = P (E | F )P (F ) , (1.19)

which is known as the multiplication rule. One can also condition on multiple events:
P (E \ F1 \ F2 \ . . . \ Fn )
P (E | F1 , F2 , . . . , Fn ) = . (1.20)
P (F1 \ F2 \ . . . \ Fn )
This allows one to generalize the multiplication rule thus:

P (E1 , E2 , . . . , En ) = P (En | E1 , . . . , En 1 )P (En 1 | E1 , . . . , E n 2 ) · · · P (E2 | E1 )P (E1 ) . (1.21)

The Law of Total Probability is a consequence of axioms of probability and the multiplication rule:

P (E) = P (E, F ) + P (E, F c ) = P (E | F )P (F ) + P (E | F c )(1 P (F )) . (1.22)

This property allows one to compute a hard unconditional probability in terms of easier conditional
ones. It can be extended to multiple conditioning events via
n
X n
X
P (E) = P (E, Fi ) = P (E | Fi )P (Fi ) , (1.23)
i=1 i=1
S
for pairwise disjoint Fi such that Fi ◆ E.

Arguably the most useful result of Probability Theory is Bayes Theorem:


P (F | E)P (E) P (F | E)P (E)
P (E | F ) = = (1.24)
P (F ) P (F | E)P (E) + P (F | E c )(1 P (E)))

Bayes Theorem can be interpreted as a way to (1) “invert” the probability P (F | E) to obtain
the probability P (E | F ); or (2) “update” the “prior” probability P (E) to obtain the “posterior”
probability P (E | F ). The former interpretation is the foundation of estimation and detection
in Statistical Signal Processing, while the latter is the foundation of Bayesian Statistics. Bayes
Theorem plays a fundamental role in Pattern Recognition as well.
8 CHAPTER 1. REVIEW OF PROBABILITY THEORY

Events E and F are independent if the occurrence of one does not carry information as to the
occurrence of the other. That is:

P (E | F ) = P (E) and P (F | E) = P (F ). (1.25)

It is easy to see that this is equivalent to the condition

P (E, F ) = P (E)P (F ) . (1.26)

If E and F are independent, so are the pairs (E,F c ), (E c ,F ), and (E c ,F c ). However, E being
independent of F and G does not imply that E is independent of F \ G. Furthermore, three events
E, F , G are independent if P (E, F, G) = P (E)P (F )P (G) and each pair of events is independent.
This can be extended to independence of any number of events, by requiring that the joint probability
factor and that all subsets of events be independent.

Finally, we remark that P (·|F ) is a probability measure, so that it satisfies all theorems mentioned
previously. In particular, it is possible to define the notion of conditional independence of events.

1.2 Random Variables

Random variables are the basic units of Pattern Recognition, as discussed in Chapter 1. A random
variable can be thought of roughly as a “random number.” Formally, a random variable X defined
on a probability space (S, F, P ) is a measurable function X : S ! IR with respect to F and the
usual Borel algebra of IR (see Section 1.1.2 for the required definitions). Thus, a random variable
X assigns to each outcome ! 2 S a real number X(!) — see Figure 1.1 for an illustration.

Sample space

Figure 1.1: A real-valued random variable.


1.2. RANDOM VARIABLES 9

1.2.1 Cumulative Distribution Function

Given a set of real numbers A, we define an event

1
{X 2 A} = X (A) ✓ S. (1.27)

It can be shown that all probability questions about a random variable X can be phrased in terms
of the probabilities of a simple set of events:

{X 2 ( 1, x]}, x 2 IR. (1.28)

These events can be written more simply as {X  x}, for x 2 IR. The cumulative distribution
function (CDF) of a random variable X is the function FX : IR ! [0, 1], which gives the probability
of these events:
FX (x) = P ({X  x}), x 2 IR. (1.29)

For simplicity, henceforth we will often remove the braces around statements involving random
variables; e.g., we will write FX (x) = P (X  x), for x 2 IR.

Properties of a CDF:

1. FX is non-decreasing: x1  x2 ) F (x1 )  F (x2 ).

2. limx! 1 FX (x) = 0 and limx!+1 FX (x) = 1

3. FX is right-continuous: limx!x+ FX (x) = FX (x0 ).


0

It can be shown that a random variable X is uniquely specified by its CDF FX and, conversely, given
a function FX satisfying items 1-3 above, there is a unique random variable X associated with it.

1.2.2 Continuous Random Variables

The notion of a probability density function (PDF) is fundamental in Probability Theory (and
Pattern Recognition). However, it is a secondary notion to that of a CDF. In fact, every random
variable X must have a CDF FX , but not all random variables have a PDF. They do if the CDF
FX is continuous and di↵erentiable everywhere but for a countable number of points. In this case,
X is said to be a continuous random variable, with PDF given by:

dFX (x)
pX (x) = , (1.30)
dx
10 CHAPTER 1. REVIEW OF PROBABILITY THEORY

Figure 1.2: The CDF and PDF of a uniform continuous random variable.

at all points x 2 IR where the derivative is defined. See Figure ?? for an illustration of a uniform
continuous random variable. Note that FX is continuous, and di↵erentiable everywhere except for
points a and b.

In this chapter, for precision, we always use the subscript X to denote quantities associated with
a random variable X, e.g., we write FX (x) and pX (x). Elsewhere in the book, we often omit the
subscript, e.g. we write F (x) and p(x), when there is no possibility of confusion.

Probability statements about X can then be made in terms of integration of pX . For example,
Z x
FX (x) = pX (u)du , x 2 IR ,
1
Z x2
(1.31)
P (x1  X  x2 ) = pX (x)dx , x1 , x2 2 IR .
x1

Useful continuous random variables include the already mentioned uniform r.v. over the interval
[a, b], with density
1
pX (x) = , a < x < b, (1.32)
b a
the univariate Gaussian r.v. with parameters µ and > 0, such that
✓ ◆
1 (x µ)2
pX (x) = p exp , x 2 IR , (1.33)
2⇡ 2 2 2

the exponential r.v. with parameter > 0, such that

x
pX (x) = e , x 0 (1.34)

the gamma r.v. with parameters , t > 0, such that

e x( x)t 1
pX (x) = , x 0, (1.35)
(t)
R1 u ut 1 du,
where (t) = 0 e and the beta r.v. with parameters a, b > 0, such that:

1
pX (x) = xa 1
(1 x)b 1
, 0 < x < 1, (1.36)
B(a, b)

where B(a, b) = (a + b)/ (a) (b). Among these, the Gaussian is the only one defined over the
entire real line; the exponential and gamma are defined over the nonnegative real numbers, while
the uniform and beta have bounded support. In fact, the uniform r.v. over [0, 1] is just a beta with
a = b = 1, while an exponential r.v. is a gamma with t = 1.
1.2. RANDOM VARIABLES 11

1.2.3 Discrete Random Variables

If the random variable X only takes an at most countable number of values, then it is said to be a
discrete random variable. For example, let X be the numerical outcome of the roll of a six-sided.
The CDF FX for this example can be seen in Figure . We can see that FX for a discrete random
variable X is a “staircase” function. In particular, it is not possible to define a PDF in this case.

Figure 1.3: The CDF and PMF of a uniform discrete random variable.

The range of a discrete random variable X is defined as R(X) = {k 2 IR | P (X = k) > 0}.


By definition, R(X) must be either finite or a countably infinite set. In this previous example,
R(X) = {1, 2, 3, 4, 5, 6}. In fact, R(X) is often a subset of the set Z of integer numbers, but this is
not necessary. One defines the probability mass function (PMF) of a discrete random variable X as

pX (k) = P (X = k) , k 2 R(X) . (1.37)

The PMF pX corresponds to the “jumps” in the staircase CDF FX . See Figure 1.3 for the PMF in
the previous die-rolling example.

Useful discrete random variables include the already mentioned uniform r.v. over a finite set of
numbers K with PMF
1
pX (k) = , k2K, (1.38)
|K|
the Bernoulli with parameter 0 < p < 1, with PMF

pX (0) = 1 p,
(1.39)
pX (1) = p ,

the Binomial with parameters n 2 {1, 2, . . .} and 0 < p < 1, such that
✓ ◆
n k
pX (k) = p (1 p)n k , k = 0, 1, . . . , n , (1.40)
k
the Poisson with parameter > 0, such that
k
pX (k) = e , k = 0, 1, . . . (1.41)
k!
and the Geometric with parameter 0 < p < 1 such that

pX (k) = (1 p)k 1
p, k = 1, 2, . . . (1.42)

A binomial r.v. with parameters n and p has the distribution of a a sum of n i.i.d. Bernoulli r.v.s
with parameter p.
12 CHAPTER 1. REVIEW OF PROBABILITY THEORY

1.2.4 Mixed Random Variables

Random variables that are neither continuous nor discrete are called mixed random variables. They
are often, but not necessarily, mixtures of continuous and discrete random variables, such as linear
combinations of these (hence, the name “mixed”). The following table summarizes the classification
of random variables.

CDF PDF PMF


Continuous R.V. Yes Yes No
Discrete R.V. Yes No Yes
Mixed R.V. Yes No No

1.2.5 Joint and Conditional Distributions

These are crucial elements in PR. As in the case of the usual CDF, PDF and PMF, these concepts
involve only the probabilities of certain special events. We review below only the case of two random
variables; the extension to finite sets of jointly distributed random variables (i.e., random vectors)
is fairly straightforward.

Two random variables X and Y are said to be jointly distributed if they are defined on the same
probability space (S, F, P ) (it can be shown that this is sufficient for the mapping (X, Y ) : S ! IR2
to be measurable with respect to F and the Borel algebra of IR2 ). In this case, the joint CDF of
X and Y is the joint probability of the events {X  x} and {Y  y}, for x, y 2 IR. Formally, we
define a function FXY : IR ⇥ IR ! [0, 1] given by

FXY (x, y) = P ({X  x} \ {Y  y}) = P (X  x, Y  y), x, y 2 IR (1.43)

This is the probability of the “lower-left quadrant” with corner at (x, y). Note that FXY (x, 1) =
FX (x) and FXY (1, y) = FY (y). These are called the marginal CDFs.

If X and Y are jointly distributed and FXY is continuous and has continuous derivatives up to
second order, then X and Y are jointly continuous random variables, with joint density
@ 2 FXY (x, y)
pXY (x, y) = x, y 2 IR , (1.44)
@x@y
where the order of di↵erentiation does not matter. The joint density function pXY (x, y) integrates
to 1 over IR2 . The marginal densities are given by
Z 1
pX (x) = pXY (x, y) dy , x 2 IR ,
1
Z 1 (1.45)
pY (y) = pXY (x, y) dx , y 2 IR ,
1
1.3. EXPECTATION AND VARIANCE 13

The random variables X and Y are independent if pXY (x, y) = pX (x)pY (y), for all x, y 2 IR. It can
be shown that if X and Y are independent and Z = X + Y then
Z 1
pZ (z) = pX (x)pY (z x) dx , z 2 IR , (1.46)
1

with a similar expression in the discrete case for the corresponding PMFs. The above integral is
known as the convolution integral.

If pY (y) > 0, the conditional density of X given Y = y is given by

pXY (x, y)
pX|Y (x | y) = , x 2 IR . (1.47)
pY (y)

For an event E, the conditional probability P (E | Y = y) needs care if Y is a continuous random


variable, as P (Y = y) = 0. But as long as pY (y) > 0, this probability can be defined (the details
are outside of the scope of this review). The “Law of Total Probability” for random variables is a
generalization of (1.23): Z 1
P (E) = P (E | Y = y) pY (y) dy . (1.48)
1

The concepts of joint PMF, marginal PMFs, and conditional PMF can defined in a similar way. For
conciseness, this is omitted in this brief review.

1.3 Expectation and Variance

Expectation is a fundamental concept in Probability Theory and Pattern Recognition. It has several
important interpretations regarding a random variables: 1) its “mean” value; 2) a summary of its
distribution (sometimes referred to as a “location parameter”); 3) a prediction of its future value.
The latter meaning is the most important one for Pattern Recognition. The variance of a random
variable, on the other hand, gives 1) its “spread” around the mean; 2) a second summary of its
distribution (the “scale parameter”); 3) the uncertainty in the prediction of its future value by the
expectation.

1.3.1 Expectation

The expectation E[X] of a random variable X can be seen as an average of its values weighted by
their probabilities: Z 1
E[X] = xpX (x) du . (1.49)
1
14 CHAPTER 1. REVIEW OF PROBABILITY THEORY

Given a random variable X and a Borel-measurable function g : R ! R, g(X) is also a random


variable. One of the most useful theorems of Probability Theory states that:
Z 1
E[g(X)] = g(x)pX (x) dx (1.50)
1

An an immediate corollary, one gets E[aX + c] = aE[X] + c.

If f : R ! R is Borel-measurable and concave (i.e., f lies at or above a line joining any of its points)
then Jensen’s Inequality asserts that

E[f (X)]  f (E[X]) . (1.51)

For jointly distributed random variables X and Y , and a Borel-measurable function g : R ⇥ R ! R,


Z 1Z 1
E[g(X, Y )] = g(x, y)pXY (x, y) dxdy (1.52)
1 1

This can be extended directly to any finite number of jointly distributed random variables.

Analogous formulas concerning expectation for discrete random variables can be obtained by replac-
ing integration with summation and PDFs by PMFs.

From (1.52) and the linearity property of integration, one obtains the well-known linearity property
of expectation,
E[aX + bY ] = aE[X] + bE[Y ] . (1.53)

where no conditions on X and Y apart from the existence of the expectations are assumed. Once
again, this property can be easily extended to any finite number of jointly distributed random
variables.

It can be shown that E[f (X)g(Y )] = E[f (X)]E[g(Y )] for any Borel-measurable functions f, g : R !
R if, and only if, X and Y are independent. If this condition is satisfied for at least f (X) = X and
g(Y ) = Y , that is, if E[XY ] = E[X]E[Y ], then X and Y are said to be uncorrelated. Of course,
independence implies uncorrelatedness. The converse is only true in special cases; e.g. jointly
Gaussian random variables.

Expectation preserves order, in the sense that if P (X > Y ) = 1, then E[X] > E[Y ].

Holder’s Inequality states that, for 1 < r < 1 and 1/r + 1/s = 1,

E[|XY |]  E[|X|r ]1/r E[|Y |s ]1/s . (1.54)

The special case r = s = 2 results in the Cauchy-Schwartz Inequality:


p
E[|XY |]  E[X 2 ]E[Y 2 ] . (1.55)
1.3. EXPECTATION AND VARIANCE 15

The expectation of a random variable X is a↵ected by its probability tails, given by FX (a) = P (X 
a) and 1 FX (a) = P (X a). If the probability tails fail to vanish sufficiently fast (X has “fat
tails”), then E[X] will not be finite, and the expectation is undefined. For a nonnegative random
variable X (i.e., one for which P (X 0) = 1), there is only one probability tail, the upper tail
P (X > a), and there is a simple formula relating E[X] to it:
Z 1
E[X] = P (X > x) dx . (1.56)
0

A small E[X] constrain the upper tail to be thin. This is guaranteed by Markov’s inequality: if X
is a nonnegative random variable,

E[X]
P (X a)  , for all a > 0 . (1.57)
a
Finally, a particular result that if of interest to our purposes relates an exponentially-vanishing
upper tail of a nonnegative random variable to a bound on its expectation.

Lemma 1.3. If X is a non-negative random variable such that P (X > t)  ce at2 , for all t > 0
and given a, c > 0, we have r
1 + log c
E[X]  . (1.58)
a
p
Proof. Note that P (X 2 > t) = P (X > t)  ce at .
From (1.56) we get:
Z 1 Z u Z 1
2 2 2
E[X ] = P (X > t) dt = P (X > t) dt + P (X 2 > t) dt
0 0 u
Z 1 (1.59)
c
 u+ ce at dt = u + e au .
u a
By direct di↵erentiation, it is easy to verify that the upper bound on the right hand side is minimized
at u = (log c)/a. Substituting this value back into the bound leads to E[X 2 ]  (1 + log c)/a. The
p
result then follows from the fact that E[X]  E[X 2 ]. ⇧

1.3.2 Variance

The variance Var(X) of a random variable X is a nonnegative quantity related to the spread of the
values of X around the mean E[X]:

Var(X) = E[(X E[X])2 ] = E[X 2 ] (E[X])2 (1.60)

The following property follows directly from the definition:

Var(aX + c) = a2 Var(X) . (1.61)


16 CHAPTER 1. REVIEW OF PROBABILITY THEORY

A small variance constrains the random variable to be close to its mean with high probability. This
follows from Chebyshev’s Inequality:

Var(X)
P (|X E[X]| ⌧)  , for all ⌧ > 0 . (1.62)
⌧2

Chebyshev’s inequality follows directly from the application of Markov’s Inequality (1.57) to the
random variable |X E[X]|2 with a = ⌧ 2 .

Expectation has the linearity property, so that, given any pair of jointly distributed random variables
X and Y , it is always true that E[X + Y ] = E[X] + E[Y ] (provided that all expectations exist).
However, it is not always true that Var(X + Y ) = Var(X) + Var(Y ). In order to investigate this
issue, it is necessary to introduce the covariance between X and Y :

Cov(X, Y ) = E[(X E[X])(Y E[Y ])] = E[XY ] E[X]E[Y ] . (1.63)

If Cov(X, Y ) > 0 then X and Y are positively correlated; otherwise, they are negatively correlated.
Clearly, X and Y are uncorrelated if and only if Cov(X, Y ) = 0. Clearly, Cov(X, X) = Var(X). In
P P Pn Pm
addition, Cov( ni=1 Xi , mj=1 Yj ) = i=1 j=1 Cov(Xi , Yj ).

Now, it follows directly from the definition of variance that

Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2Cov(X1 , X2 ) . (1.64)

This can be extended to any number of random variables by induction:

n
! n
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj ) . (1.65)
i=1 i=1 i<j

Hence, the variance is distributive over sums if all variables are pair-wise uncorrelated.
p
It follows directly from the Cauchy-Schwarz Inequality (1.55) that |Cov(X, Y )|  Var(X)Var(Y ).
Therefore, the covariance can be normalized to be in the interval [ 1, 1] thus:

Cov(X, Y )
⇢(X, Y ) = p , (1.66)
Var(X)Var(Y )

with 1  ⇢(X, Y )  1. This is called the correlation coefficient between X and Y . The closer |⇢|
is to 1, the tighter is the relationship between X and Y . The limiting case where ⇢(X, Y ) = ±1
occurs if and only if Y = a ± bX, i.e., X and Y are perfectly related to each other through a linear
(affine) relationship. For this reason, ⇢(X, Y ) is sometimes called the linear correlation coefficient
between X and Y ; it does not respond to nonlinear relationships.
1.3. EXPECTATION AND VARIANCE 17

1.3.3 Conditional Expectation and Conditional Variance

Conditional expectation allows the prediction of the value of a random variable given the observed
value of the other, i.e., prediction given data, while conditional variance yields the uncertainty of that
prediction. Conditional expectation and conditional variance are thus key probabilistic concepts in
Pattern Recognition.

If X and Y are jointly continuous random variables and the conditional density pX|Y (x | y) is well
defined for Y = y, then the conditional expectation of X given Y = y is:
Z 1
E[X | Y = y] = x pX|Y (x | y) dx (1.67)
1

with a similar definition for discrete random variables using conditional PMFs.

The conditional variance of X given Y = y is defined using conditional expectation as:

Var(X | Y = y) = E[(X E[X | Y = y])2 | Y = y] = E[X 2 | Y = y] (E[X | Y = y])2 . (1.68)

Notice that all expectations are conditioned on Y = y.

Most of the properties of expectation and variance apply without modification to conditional ex-
P P
pectations and conditional variances, respectively. For example, E[ ni=1 Xi | Y = y] = ni=1 E[Xi |
Y = y] and Var(aX + c | Y = y) = a2 Var(X | Y = y).

Now, both E[X | Y = y] and Var(X | Y = y) are deterministic quantities for each value of Y = y
(just as the ordinary expectation and variance are). But if the specific value Y = y is not specified
and allowed to vary, then we can look at E[X | Y ] and Var(X | Y ) as functions of the random
variable Y , and therefore, random variables themselves. The reasons why these are valid random
variables are nontrivial and beyond the scope of this review (see [Billingsley, 1995]).

One can show that the expectation of the random variable E[X | Y ] is precisely E[X]:

E[E[X | Y ]] = E[X] . (1.69)

An equivalent statement is: Z 1


E[X] = E[X | Y = y] p(y) dy , (1.70)
1
with a similar expression in the discrete case. Paraphrasing the Law of Total Probability (1.23), the
previous equation might be called the Law of Total Expectation.

On the other hand, it is not the case that Var(X) = E[Var(X | Y )]. The answer is slightly more
complicated:
Var(X) = E[Var(X | Y )] + Var(E[X | Y ]) . (1.71)
18 CHAPTER 1. REVIEW OF PROBABILITY THEORY

This is known as the Conditional Variance Formula. It is an “analysis of variance” formula, as it


breaks down the total variance of X into a “within-rows” component and an “across-rows” compo-
nent. We call this the Law of Total Variance. This formula will play a key role in Chapter 7 on
Classifier Error Estimation.

1.3.4 Optimal Prediction

Now, suppose one is interested in predicting the value of a random variable Y using a predictor Ŷ .
One would like Ŷ to be optimal according to some criterion. The criterion most widely used is the
Mean Square Error:

MSE = E[(Y Ŷ )2 ] . (1.72)

It can be shown easily that the minimum MSE (MMSE) estimator is simply the mean: Ŷ ⇤ = E[Y ].
This is a constant estimator, since no data are available. Clearly, the MSE of Ŷ ⇤ is simply the
variance of Y . Therefore, the best one can do in the absence of any extra information is to predict
the mean E[Y ], with an uncertainty equal to the variance Var(Y ).

If Var(Y ) is very small, i.e., if there were very small uncertainty in Y to begin with, then E[Y ] could
actually be an acceptable estimator. In practice, this is rarely the case. Therefore, observations
on an auxiliary random variable X (i.e., data) are sought to improve prediction. Naturally, it is
known (or hoped) that X and Y are not independent, otherwise no improvement over the constant
estimator is possible. One defines the conditional MSE of a data-dependent estimator Ŷ = h(X) to
be

MSE (X) = E[(Y h(X))2 | X] . (1.73)

By taking expectation over X, one obtains the unconditional MSE: E[(Y h(X))2 ]. The conditional
MSE is often the most important one in practice, since it concerns the particular data at hand,
while the unconditional MSE is data-independent and used to compare the performance of di↵erent
predictors. Regardless, it can be shown that the MMSE estimator in both cases is the conditional
mean h⇤ (X) = E[Y | X]. This is one of the most important results in Signal Processing and Pattern
Recognition. The function ⌘(x) = E[Y | X = x] is the optimal regression of Y on X. This is not
in general the optimal estimator if Y is discrete; e.g., in the case of classification. This is because
⌘(X) may not be in the range of values taken by Y , so it does not define a valid estimator. We will
see in Chapter 3 how to modify ⌘(·) to obtain the optimal estimator (optimal classifier) in this case.
1.4. VECTOR RANDOM VARIABLES 19

1.4 Vector Random Variables

As mentioned in Chapter 1, Pattern Recognition is inherently a multivariate subject, and therefore


the notion of vector random variables is of key importance (e.g. feature vectors are vector random
variables).

Vector random variables, or random vectors are defined analogously to ordinary random variables
(see Section 1.2). A random vector X defined on a probability space (S, F, P ) is a measurable
function X : S ! IRd with respect to F and the Borel algebra of IRd (the required definitions
are given in Section 1.1.2). Alternatively, if X1 , . . . , Xd are jointly distributed random variables on
(S, F, P ), then it can be shown that X = (X1 , . . . , Xd ) is a proper random vector defined on the
same probability space.

The distribution of a random vector X is the joint distribution of the component random variables.
The expected value of X is the vector of expected values:
2 3
E[X1 ]
6 7
E[X] = 4 · · · 5 . (1.74)
E[Xd ]
The second moments of a random vector are contained in the d ⇥ d covariance matrix:

⌃ = E[(X µ)(X µ)T ] , (1.75)

where ⌃ii = Var(Xi ) and ⌃ij = Cov(Xi , Xj ), for i, j = 1, . . . , d. The covariance matrix is real
symmetric and thus diagonalizable:
⌃ = U DU T , (1.76)
where U is the orthogonal matrix of eigenvectors and D is the diagonal matrix of eigenvalues. All
eigenvalues are nonnegative (⌃ is positive semi-definite). In fact, except for “degenerate” cases, all
eigenvalues are positive, and so ⌃ is invertible (⌃ is said to be positive definite in this case).

It is easy to check that the random vector


1 1
Y=⌃ 2 (X µ) = D 2 U T (X µ) (1.77)

has zero mean and covariance matrix Id (so that all components of Y are zero-mean, unit-variance,
and uncorrelated). This is called the Whitening or Mahalanobis transformation.

Given n independent and identically-distributed (i.i.d.) sample observations X1 , . . . , Xn of the ran-


dom vector X, then the maximum-likelihood estimator of µ = E[X], known as the sample mean, is
n
1X
µ̂ = Xi . (1.78)
n
i=1
20 CHAPTER 1. REVIEW OF PROBABILITY THEORY

It can be shown [Casella and Berger, 2002] that this estimator is unbiased (that is, E[µ̂] = µ) and
consistent (that is, µ̂ converges in probability to µ as n ! 1; see Section 1.5 and Theorem ). On
the other hand, the sample covariance estimator is given by:
n
X
ˆ = 1
⌃ (Xi µ̂)(Xi µ̂)T . (1.79)
n 1
i=1

This is an unbiased and consistent estimator of ⌃ [Casella and Berger, 2002].

1.4.1 The Multivariate Gaussian

The multivariate Gaussian is the most important probability distribution in Engineering and Science.
The random vector X has a multivariate Gaussian distribution with mean µ and covariance matrix
⌃ (assuming ⌃ invertible, so that also det(⌃) > 0) if its density is given by
✓ ◆
1 1 T 1
p(x) = p exp (x µ) ⌃ (x µ) . (1.80)
(2⇡)d det(⌃) 2

We write X ⇠ Nd (µ, ⌃).

The multivariate Gaussian has ellipsoidal contours of constant density of the form

(x µ)T ⌃ 1
(x µ) = c2 , c > 0. (1.81)

The axes of the ellipsoids are given by the eigenvectors of ⌃ and the length of the axes are pro-
portional to its eigenvalues. In the case ⌃ = 2 Id , where Id denotes the d ⇥ d identity matrix, the
contours are spherical with center at µ. This can be seen by substituting ⌃ = 2 Id in (1.81), which
leads to the following equation for the countours:

||x µ||2 = r2 , r > 0, (1.82)

If d = 1, one gets the univariate Gaussian distribution X ⇠ N (µ, 2 ). With µ = 0 and = 1, the
PDF of X is given by Z x
1 u2
P (X  x) = (x) = e 2 du . (1.83)
1 2⇡
It is clear that the function (·) satisfies the property ( x) = 1 (x).

The following are useful properties of a multivariate Gaussian random vector X ⇠ N (µ, ⌃):

• The density of each component Xi is univariate gaussian N (µi , ⌃ii ).

• The components of X are independent if and only if they are uncorrelated, i.e., ⌃ is a diagonal
matrix.
1.5. CONVERGENCE OF RANDOM SEQUENCES 21

1
• The whitening transformation Y = ⌃ 2 (X µ) produces a multivariate gaussian Y ⇠ N (0, Ip )
(so that all components of Y are zero-mean, unit-variance, and uncorrelated Gaussian random
variables).

• In general, if A is a nonsingular p ⇥ p matrix and c is a p-vector, then Y = AX + c ⇠


Np (Aµ + c, A⌃AT ).

• The random vectors AX and BX are independent i↵ A⌃BT = 0.

• If Y and X are jointly multivariate Gaussian, then the distribution of Y given X is again
multivariate Gaussian.

• The best MMSE predictor E[Y | X] is a linear function of X.

1.5 Convergence of Random Sequences

It is often necessary in Pattern recognition to investigate the long-term behavior of random se-
quences, such as the sequence of true or estimated classification error rates indexed by sample size.
In this section and the next, we review basic results about convergence of random sequences. We
consider only the case of real-valued random variables, but nearly all the definitions and results can
be directly extended to random vectors, with the appropriate modifications.

A random sequence {Xn ; n = 1, 2, . . .} is a sequence of random variables. The standard modes of


convergence for random sequences are:

1. “Sure” convergence: Xn ! X surely if for all outcomes ! 2 S in the sample space one has
limn!1 Xn (!) = X(!).

2. Almost-sure convergence or convergence with probability one: Xn ! X (a.s.) if pointwise


converge fails only for an event of probability zero, i.e.:
⇣n o⌘
P ! 2 S lim Xn (!) = X(!) = 1. (1.84)
n!1

Lp
3. Lp -convergence: Xn ! X in Lp , for p > 0, also denoted by Xn ! X, if E[|Xn |p ] < 1 for
n = 1, 2, . . ., E[|X|p ] < 1, and the p-norm of the di↵erence between Xn and X converges to
zero:
lim E[|Xn X|p ] = 0 . (1.85)
n!1

The special case of L2 convergence is also called mean-square (m.s.) convergence.


22 CHAPTER 1. REVIEW OF PROBABILITY THEORY

P
4. Convergence in Probability: Xn ! X in probability, also denoted by Xn ! X, if the “proba-
bility of error” converges to zero:

lim P (|Xn X| > ⌧ ) = 0 , for all ⌧ > 0 . (1.86)


n!1

D
5. Convergence in Distribution : Xn ! X in distribution, also denoted by Xn ! X, if the
corresponding PDFs converge:
lim FXn (a) = FX (a) , (1.87)
n!1
at all points a 2 IR where FX is continuous.

Sure and almost-sure convergence has to do with convergence of the sequence realizations {Xn (!)}
to the corresponding limit X(!), so many properties of ordinary convergence apply. For example,
if f : R ! R is a continuous function, then Xn ! X a.s. implies that f (Xn ) ! f (X) a.s. as well (it
is possible to show that this is also true for convergence in probability).

The following relationships between modes of convergence hold [Chung, 1974]:


)
Sure ) Almost-sure
) Probability ) Distribution . (1.88)
Mean-square
Therefore, sure convergence is the strongest mode of convergence and convergence in distribution
is the weakest. In practice, sure convergence is rarely used, and almost-sure convergence is the
strongest mode of convergence employed. On the other hand, convergence is distribution is really
convergence of PDFs, and does not have the same properties one expects from convergence, which
the other modes have. For example, convergence Xn to X and Yn to Y almost surely, in Lp , and
in probability implies that Xn + Yn converges to X + Y almost surely, in Lp , and in probability,
respectively, but that is not true for convergence in distribution, unless Xn and Yn are independent
for all n = 1, 2, . . ..

Stronger relations between the modes of convergence can be proved for special cases. In particular,
mean-square convergence and convergence in probability can be shown to be equivalent for uniformly
bounded sequences. Classifier error rate sequences are uniformly bounded, so therefore this is an
important topic in Pattern Recognition.

A random sequence {Xn ; n = 1, 2, . . .} is uniformly bounded if there exists a finite K > 0, which
does not depend on n, such that

|Xn |  K , with probability 1, for all n = 1, 2, . . . (1.89)

meaning that P (|Xn | < K) = 1, for all n = 1, 2, . . . The classification error rate sequence {"n ; n =
1, 2, . . .} is an example of uniformly bounded random sequence, with K = 1. We have the following
theorem.
1.5. CONVERGENCE OF RANDOM SEQUENCES 23

Theorem 1.1. Let {Xn ; n = 1, 2, . . .} be a uniformly bounded random sequence. The following
statements are equivalent.

(1) Xn ! X in Lp , for some p > 0.

(2) Xn ! X in Lq , for all q > 0.

(3) Xn ! X in probability.

Proof. First note that we can assume without loss of generality that X = 0, since Xn ! X if and
only if Xn X ! 0, and Xn X is also uniformly bounded, with E[|Xn X|p ] < 1. Showing
that (1) , (2) requires showing that Xn ! 0 in Lp , for some p > 0 implies that Xn ! 0 in Lq , for
all q > 0. First observe that E[|Xn |q ]  E[K q ] = K q < 1, for all q > 0. If q > p, the result is
immediate. Let 0 < q < p. With X = Xnq , Y = 1 and r = p/q in Holder’s Inequality (1.54), we can
write
E[|Xn |q ]  E[|Xn |p ]q/p . (1.90)
Hence, if E[|Xn |p ] ! 0, then E[|Xn |q ] ! 0, proving the assertion. To show that (2) , (3), first we
show the direct implication by writing Markov’s Inequality (1.57) with X = |Xn |p and a = ⌧ p :
E[|Xn |p ]
P (|Xn | ⌧)  , for all ⌧ > 0 . (1.91)
⌧p
The right-hand side goes to 0 by hypothesis, and thus so does the left-hand side, which is equivalent
to (1.86) with X = 0. To show the reverse implication, write

E[|Xn |p ] = E[|Xn |p I|Xn |<⌧ ] + E[|Xn |p I|Xn | ⌧]  ⌧ p + K p P (|Xn | ⌧) . (1.92)

By assumption, P (|Xn | ⌧ ) ! 0, for all ⌧ > 0, so that lim E[|Xn |p ]  ⌧ p . Letting ⌧ ! 0 then
yields the desired result. ⇧

The previous Theorem states that convergence in m.s. and in probability are the same for uniformly
bounded sequences. The relationship between modes of convergence becomes:
( )
Mean-square
Sure ) Almost-sure ) ) Distribution (1.93)
Probability

As a simple corollary of Theorem thm-appA, we have the following result.

Corollary 1.1. If {Xn ; n = 1, 2, . . .} is a uniformly bounded random sequence and Xn ! X in


probability, then E[Xn ] ! E[X].

Proof. From the previous theorem, Xn ! X in L1 , i.e., E[|Xn X|] ! 0. But |E[Xn X]| 
E[|Xn X|], so |E[Xn X]| ! 0 ) E[Xn X] ! 0. ⇧
24 CHAPTER 1. REVIEW OF PROBABILITY THEORY

Example 1.3. Consider a sequence of independent binary random variables X1 , X2 , . . . that take
on values in {0, 1}, such that

1
P ({Xn = 1}) = , n = 1, 2, . . . (1.94)
n
Then Xn ! 0 in probability, since P (Xn > ⌧ ) ! 0, for every ⌧ > 0. By Theorem 1.1, Xn ! 0 in
Lp as well. However, Xn does not converge to 0 with probability one. Indeed,
1
X 1
X
P ({Xn = 1}) = P ({Xn = 0}) = 1 , (1.95)
n=1 n=1

and it follows from the 2nd Borel-Cantelli lemma that

P ([{Xn = 1} i.o.]) = P ([{Xn = 0} i.o.]) = 1, (1.96)

so that Xn does not converge with probability one. However, if convergence of the probabilities to
zero is faster, e.g.
1
P ({Xn = 1}) = 2 , n = 1, 2, . . . (1.97)
n
P1
then n=1 P ({Xn = 1}) < 1 and the 1st Borel-Cantelli Lemma ensures that Xn ! 0 with
probability one.

In the previous example, note that, with P (Xn = 1) = 1/n, the probability of observing a 1 becomes
infinitesimally small as n ! 1, so the sequence consists, for all practice purposes, of all zeros for
large enough n. Convergence in probability and in Lp of Xn to 0 agrees with this fact, but the lack
of convergence with probability 1 does not. This shows that convergence with probability 1 may be
too stringent a criterion to be useful in practice, and convergence in probability and in Lp (assuming
boundedness) may be enough. For example, this is the case in most Signal Processing applications,
where L2 is the criterion of choice.2

1.5.1 Limiting Theorems and Concentration Inequalities

The following two theorems are the classical limiting theorems for random sequences, the proofs of
which can be found in any advanced text in Probability Theory, e.g. [Chung, 1974].

Theorem 1.2. (Law of Large Numbers.) Given an i.i.d. random sequence {Xn ; n = 1, 2, . . .} with
common finite mean µ,
n
1X
Xi ! µ, with probability 1. (1.98)
n
i=1
2
More generally, Engineering applications are concerned with average performance and rates of failure.
1.6. ADDITIONAL TOPICS 25

Mainly for historical reasons, the previous theorem is sometimes called the Strong Law of Large
Numbers, with the weaker result involving only convergence in probability being called the Weak
Law of Large Numbers.

Theorem 1.3. (Central Limit Theorem.) Given an i.i.d. random sequence {Xn ; n = 1, 2, . . .} with
common finite mean µ and common finite variance 2 ,
n
!
1 X D
p Xi nµ ! N (0, 1) . (1.99)
n
i=1

The previous limiting theorems concern behavior of a sum of n random variables as n approach
infinity. It is also useful to have an idea of how partial sums di↵er from expected values for finite
n. This problem is addressed by the so-called concentration inequalities, the most famous of which
is Hoe↵ding’s inequality.

Theorem 1.4. (Hoe↵ding’s Inequality, 1963.) Given independent (not necessarily identically-
distributed) random variables X1 , . . . , Xn such that P (a  Xi  b) = 1, for i = 1, . . . , n, the
P
sum Sn = ni=1 Xi satisfies
2⌧ 2
P (|Sn E[Sn ]| ⌧ )  2e n(a b)2 , for all ⌧ > 0 . (1.100)

Hoe↵ding’s Inequality is a special case of a more general concentration inequality due to McDiarmid.

Theorem 1.5. (McDiarmid’s Inequality, 1989.) Given independent (not necessarily identically-
distributed) random variables X1 , . . . , Xn defined on a set A and a function f : An ! IR such
that
|f (x1 , . . . , xi 1 , xi , xi+1 , . . . , xn ) f (x1 , . . . , xi 1 , x0i , xi+1 , . . . , xn )|  ci , (1.101)

for all i = 1, . . . , n and all x1 , . . . , xn , x0i 2 A, then


2⌧ 2
Pn
c2
P (|f (X1 , . . . , Xn ) E[f (X1 , . . . , Xn )]| ⌧ )  2e i=1 i , for all ⌧ > 0 . (1.102)

1.6 Additional Topics

The infinite typist monkey and Borges’s total library

An interesting application of the Second Borel-Cantelli Lemma is the though experiment known as
the “infinite typist monkey.” Imagine a monkey that sits at a typewriter banging away randomly
for an infinite amount of time. It will produce Shakespeare’s complete works, and in fact, the entire
Library of Congress, not just once, but an infinite number of times.
26 CHAPTER 1. REVIEW OF PROBABILITY THEORY

Proof. Let L be the length in characters of the desired work. Let En be the event that the n-th
sequence of characters produced by the monkey matches, character by character, the desired work
(we are making it even harder for the monkey, as we are ruling out overlapping frames). Clearly
P (En ) = 27 L > 0. It is a very small number, but still positive. Now, since our monkey never gets
disappointed nor tired, the events En are independent. It follows by the 2nd Borel-Cantelli lemma
that En will occur, and infinitely often. Q.E.D.

The typist monkey would produce a library containing any possible works of literature, in any
language (based on the latin alphabet). This is what Argentine writer Jorge L. Borges had to say
about such a library (in a 1939 essay called “The Total Library:”)

Everything would be in its blind volumes. Everything: the detailed history of the future,
Aeschylus’ The Egyptians, the exact number of times that the waters of the Ganges have
reflected the flight of a falcon, the secret and true nature of Rome, the encyclopedia
Novalis would have constructed, my dreams and half-dreams at dawn on August 14, 1934,
the proof of Pierre Fermat’s theorem, the unwritten chapters of Edwin Drood, those same
chapters translated into the language spoken by the Garamantes, the paradoxes Berkeley
invented concerning Time but didn’t publish, Urizen’s books of iron, the premature
epiphanes of Stephen Dedalus, which would be meaningless before a cycle of a thousand
years, the Gnostic Gospel of Basilides, the song the sirens sang, the complete catalog
of the Library, the proof of the inaccuracy of that catalog. Everything: but for every
sensible line or accurate fact there would be millions of meaningless cacophonies, verbal
farragoes, and babblings. Everything: but all the generations of mankind could pass
before the dizzying shelves — shelves that obliterate the day and on which chaos lies —
ever reward them with a tolerable page.’

In practice, even if all the atoms in the universe were typist monkeys banging away billions of
characters a second since the big-bang, the probability of getting Shakespeare’s Hamlet, let alone
Borges’ total library, within the age of the universe would still be vanishingly small. This shows
that one must be careful with arguments involving infinity.

Tail Events and Kolmogorov’s Zero-One Law

Given a sequence of events E1 , E2 , . . ., a tail event is an event whose occurrence depends on the
whole sequence, but is probabilistically independent of any finite subsequence. Some examples of
tail events are limn!1 En (if {En } is monotone), lim supn!1 En , and lim inf n!1 En .

One of the most startling results published in Kolmogorov’s 1933 monograph was the so-called
Zero-One Law. It states that, given a sequence of independent events E1 , E2 , . . ., all its tail events
1.7. BIBLIOGRAPHICAL REMARKS 27

have either probability 0 or probability 1. That is, tail events are either almost-surely impossible
or occur almost surely. In practice, it may be extremely difficult to conclude one way or the other.
The Borel-Cantelli lemmas together give a sufficient condition to decide on the 0-1 probability of
the tail event lim supn!1 En , with {En } an independent sequence.

St. Petersburg Paradox

This paradox illustrates the issues with the frequentist approach to probability. Imagine a game
where a fair coin is tossed repeatedly and independently, until the first tail appears. The player is
then rewarded 2N dollars. According to the standard frequentist interpretation, the fair price of a
game is its expected winnings. The question is 1) what the expected winnings of the coin-flipping
game are and 2) how much would most people be willing to play the game once.

Notice that the number of tosses N is therefore a Geometric random variable. The expectation
winnings are therefore
X1 X1
N 2n
E[W ] = E[2 ] = = 1 = 1. (1.103)
2n
n=1 n=1

However, this expected result is very far from being the most likely result in a single game, as
P (W = 1) = P (N = 1) = 0, with the most likely outcome, i.e, the mode of W , being equal to 2,
with P (W = 2) = P (N = 1) = 12 . What most people would be willing to pay to play this game once
would be a small multiple of that. It is only in the long run (i.e., by playing the game repeatedly
many times) that the average winnings of the game are huge. In this case, however, it is a very
long run, and any player, regardless of how rich they are, would be broke long before attaining the
promised unbounded winnings.

1.7 Bibliographical Remarks

The modern theory of probability founded on measure theory is due to A. Kolmogorov


[Kolmogorov, 1933]. In his 60-page monograph, Kolmogorov introduced the notion of probabil-
ity spaces, the axiomatic definition of probability, the modern definition of random variables, and
more. For an excellent review of Kolmogorov’s fundamental contribution to Probability Theory, see
[Nualart, 2004].

There are many excellent books on the theory of probability. We mention but a few below. At
the advanced undergraduate level, the books by S. Ross [Ross, 1994, Ross, 1995] o↵er a through
treatment of non-measure theoretical probability. At the graduate level, the books by P. Billings-
ley [Billingsley, 1995] and K. Chung [Chung, 1974] provide mathematically rigorous expositions of
measure-theoretical probability theory. The book by J. Rosenthal [Rosenthal, 2006] is a surpris-
28 CHAPTER 1. REVIEW OF PROBABILITY THEORY

ingly concise and accessible introduction to measure-theoretical probability. Proofs of Hoe↵ding’s


and McDiarmid’s inequalities can be found in [Devroye et al., 1996].

Exercises

1. The Monty-Hall Problem. This problem demonstrates nicely subtle issues regarding partial
information and prediction. A certain show host has placed a case with US$1,000,000 behind
one of three identical doors. Behind each of the other two doors he placed a donkey. The host
asks the contestant to pick one door but not to open it. The host then opens one of the other
two doors to reveal a donkey. He then asks the contestant if he wants to stay with his door or
switch to the other unopened door. Assume that the host is honest and that if the contestant
initially picked the correct door, the host randomly picks one the two donkey doors to open.
Which of the following strategies is rationally justifiable:

(a) The contestant should never switch to the other door.


(b) The contestant should always switch to the other door.
(c) There is not enough information or the choice between (a) and (b) is indi↵erent.

Argue this by computing the probabilities of success.

2. The random experiment consists of throwing two fair dice. Let us define the events:
D = {the sum of the dice equals 6}
E = {the sum of the dice equals 7}
F = {the first die lands 4}
G = {the second die lands 3}
Show the following, both by arguing and by computing probabilities:

(a) D is not independent of F and D is not independent of G.


(b) E is independent of F and E is independent of G.
(c) E is not independent of (F, G), in fact, E is completely determined by (F, G). (Here is an
example where an event is independent of each of two other events but is not independent
of the joint occurrence of these events.)

3. Suppose that a typist monkey is typing randomly, but that each time he types the “wrong
character,” it is discarded from the output. Assume also that the monkey types 24-7 at the rate
of one character per second, and that each character can be one of 27 symbols (the alphabet
1.7. BIBLIOGRAPHICAL REMARKS 29

without punctuation plus space). Given that Hamlet has about 130,000 characters, what is
the average number of days that it would take the typist monkey to compose the famous play?

4. Suppose that 3 balls are selected without replacement from an urn containing 4 white balls,
6 red balls, and 2 black balls. Let Xi = 1 if the i-th ball selected is white, and let Xi = 0
otherwise, for i = 1, 2, 3. Give the joint PMF of

(a) X1 , X2
(b) X1 , X2 , X3

5. Consider 12 independent rolls of a 6-sided die. Let X be the number of 1’s and let Y be
the number of 2’s obtained. Compute E[X], E[Y ], Var(X), Var(Y ), E[X + Y ], Var(X + Y ),
Cov(X, Y ), and ⇢(X, Y ). (Hint: You may want to compute these in the order given.)

6. Consider the system represented by the block diagram below.

The functionals are given by S(X) = aX + b, and T (Y ) = Y 2 . The additive noise is N ⇠


2 ). Assuming that the input signal is X ⇠ N (µ , 2 ):
N (0, N X X

(a) Find the pdf of Y .


(b) Find the pdf of Z.
(c) Compute the probability that the output is bounded by a constant k > 0, i.e., find
P (Z  k).

7. (Bi-variate Gaussian Distribution) Suppose (X, Y ) are jointly Gaussian.

(a) Show that the joint pdf is given by:


1
p(x, y) = p
2⇡ x y 1 ⇢2
( "✓ ◆2 ✓ ◆2 #)
1 x µx y µy (x µx )(y µy )
⇥ exp + 2⇢
2(1 ⇢2 ) x y x y
30 CHAPTER 1. REVIEW OF PROBABILITY THEORY

where E[X] = µX , Var(X) = 2 2


X, E[Y ] = µY , Var(Y ) = Y, and ⇢ is the correlation
coefficient between X and Y .
(b) Show that the conditional pdf of Y , given X = x, is a univariate Gaussian density with
parameters:
y 2 2
µY |X = µy + ⇢ (x µx ) and Y |X = y (1 ⇢2 )
x

(c) Conclude that the conditional expectation E[Y |X] (which can be shown to be the “best”
predictor of Y given X), is in the Gaussian case a linear function of X. This is the
foundation of optimal linear filtering in Signal Processing. Plot the regression line for
the case x = y , µx = 0, fixed µy and a few values of ⇢. What do you observe as the
correlation ⇢ changes? What happens for the case ⇢ = 0?

8. Consider the example of a random sequence X(n) of 0-1 binary r.v.’s given in class:

• Set X(0) = 1
• From the next 2 points, pick one randomly and set to 1, the other to zero.
• From the next 3 points, pick one randomly and set to 1, the rest to zero.
• From the next 4 points, pick one randomly and set to 1, the rest to zero.
• ...

Show that X(n):

(a) converges to 0 in probability


(b) converges to 0 in the mean-square sense
(c) does not converge to 0 with probability 1. In fact, show that
⇣ ⌘
P lim X(n) = 0 = 0
n!1

Notice that Xn is clearly converging slowly in some sense to zero, but not with probability
one. This leads one to the realization that convergence with probability one is a very strong
requirement; in many practical situations, convergence in probability and in mean-square may
be more adequate.
Bibliography

[Billingsley, 1995] Billingsley, P. (1995). Probability and Measure. John Wiley, New York City, New
York, third edition.

[Casella and Berger, 2002] Casella, G. and Berger, R. (2002). Statistical Inference. Duxbury, Pacific
Grove, CA, 2nd edition.

[Chung, 1974] Chung, K. L. (1974). A Course in Probability Theory, Second Edition. Academic
Press, New York City, New York.

[Devroye et al., 1996] Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A Probabilistic Theory of
Pattern Recognition. Springer, New York.

[Kolmogorov, 1933] Kolmogorov, A. (1933). Grundbegri↵e der Wahrscheinlichkeitsrechnung.


Springer.

[Nualart, 2004] Nualart, D. (2004). Kolmogorov and probability theory. Arbor, 178(704):607–619.

[Rosenthal, 2006] Rosenthal, J. (2006). A First Look At Rigorous Probability Theory. World Scien-
tific Publishing, Singapore, 2nd edition.

[Ross, 1994] Ross, S. (1994). A first course in probability. Macmillan, New York, 4th edition.

[Ross, 1995] Ross, S. (1995). Stochastic Processes. Wiley, New York, 2nd edition.

31

Das könnte Ihnen auch gefallen