ECEN 649 Pattern Recognition
Probability Review
Ulisses BragaNeto
ECE Department Texas A&M University
ECEN 649 Pattern Recognition – p.1/42
Sample Spaces and Events
A sample space S is the collection of all outcomes of an experiment. 



An event E is a subset E ⊆ S . 


Event E is said to occur if it contains the outcome of the experiment. 


Example 1: if the experiment consists of ﬂipping two coins, t hen 

S = {( H, H ) , 
( H, T ) , ( T, H ) , ( T, T ) } 

The event E that the ﬁrst coin lands tails is: E = {( T, H ) , ( T, T ) }. 


Example 2: If the experiment consists in measuring the lifet ime of a lightbulb, then 

S = {t ∈ R  t ≥ 0} 

The event that the lightbulb will fail at or earlier than 2 time units is the real interval E = [0, 2] . 

ECEN 649 Pattern Recognition – p.2/42
Special Events
Inclusion: E ⊆ F iff the occurrence of E implies the occurrence of F .
Union: E ∪ F occurs iff E , F , or both E and F occur.
Intersection: E ∩ F occurs iff both E and F occur. If E ∩ F = ∅, then E and F are mutuallyexclusive.
Complement: E ^{c} occurs iff E does not occur.
If E _{1} , E _{2} , E _{1} ⊆ E _{2} ⊆
is an increasing sequence of events, that is, then
lim
n→∞ ^{E} ^{n} ^{=}
∞
n=1
E n
If E _{1} , E _{2} , E _{1} ⊇ E _{2} ⊇
is a decreasing sequence of events, that is, then
lim
n→∞ ^{E} ^{n} ^{=}
∞
n=1
E n
ECEN 649 Pattern Recognition – p.3/42
Special Events  II
If E _{1} , E _{2} ,
is any sequence of events, then
[ E _{n} i.o. ] = lim sup E _{n} =
n→∞
∞
∞
E i
n=1 i = n
We can see that lim sup _{n}_{→}_{∞} E _{n} occurs iff E _{n} occurs for an inﬁnite number of n, that is, E _{n} occurs inﬁnitely often .
If E _{1} , E _{2} ,
is any sequence of events, then
lim inf E _{n} =
n→∞
∞
∞
E i
n=1 i = n
We can see that lim inf _{n}_{→}_{∞} E _{n} occurs iff E _{n} occurs for all but a ﬁnite number of n, that is, E _{n} eventually occurs for all n.
ECEN 649 Pattern Recognition – p.4/42
Modern Deﬁnition of Probability
Probability assigns to each event E ⊆ S a number P ( E ) such that
1. 0 ≤ P ( E ) ≤ 1
2. P ( S ) = 1
3. For a sequence of mutuallyexclusive events E _{1} , E _{2} ,
P
i =1 E i = i =1
∞
∞
P ( E _{i} )
Technically speaking, for a noncountably inﬁnite sample space S , probability cannot be assigned to all possible subsets of S in a manner consistent with these axioms; one has to restrict the deﬁnit ion to a σ algebra of events in S .
ECEN 649 Pattern Recognition – p.5/42
Direct Consequences
P ( E ^{c} ) = 1 − P ( E )
If E ⊆ F then P ( E ) ≤ P ( F )
P ( E ∪ F ) = P ( E ) + P ( F ) − P ( E ∩ F )
(Boole’s Inequality) For any sequence of events E _{1} , E _{2} ,
P
i =1 E i ≤ i =1
∞
∞
P ( E _{i} )
(Continuity of Probability Measure) If E _{1} , E _{2} , decreasing sequence of events, then
is an increasing or
P _{n}_{→}_{∞} E _{n} = lim
lim
_{n}_{→}_{∞} P ( E _{n} )
ECEN 649 Pattern Recognition – p.6/42
1st BorelCantelli Lemma
For any sequence of events E _{1} , E _{2} ,
Proof.
∞
n=1
P ( E _{n} ) < ∞ ⇒ P ([ E _{n} i.o. ]) = 0
P _{}
n E _{i} = P lim
∞
∞
n=1 i =
n→∞
n→∞ ^{P}
= lim
i = n E i
∞
≤ lim
n→∞
∞
i = n
P ( E _{i} )
= 0
i = n E i
∞
ECEN 649 Pattern Recognition – p.7/42
Example
Consider a sequence of binary random variables X _{1} , X _{2} , values in {0, 1}, such that
Since
1
P ( {X _{n} = 0}) = _{2} _{n} ,
n = 1, 2,
_{∞}
n=1
P ( {X _{n} = 0}) = 2 < ∞
that take on
The 1st BorelCantelli lemma implies that P ([ {X _{n} = 0} i.o. ]) = 0, that is, for n sufﬁciently large, X _{n} = 1 with probability one. In other words,
_{n}_{→}_{∞} X _{n} = 1 with probability 1 .
lim
ECEN 649 Pattern Recognition – p.8/42
2nd BorelCantelli Lemma
For an independent sequence of events E _{1} , E _{2} ,
.,
∞
n=1
P ( E _{n} ) = ∞ ⇒ P ([ E _{n} i.o. ]) = 1
Proof. Exercise.
This is the converse to the 1st lemma; here the stronger assumption of independence is needed.
As immediate consequence of the 2nd BorelCantelli lemma is that any event with positive probability, no matter how small, will o ccur inﬁnitely often in independent repeated trials.
ECEN 649 Pattern Recognition – p.9/42
Example: “The inﬁnite typist monkey"
Consider a monkey that sits at a typewriter banging away rand omly for an inﬁnite amount of time. It will produce Shakespeare’s complete works, and in fact, the entire Library of Congress, not just once, but an inﬁnite number of times. Proof. Let L be the length in characters of the desired work. Let E _{n} be the event that the nth sequence of characters produced by the monkey matches, character by character, the desired work (we are ma king it even harder for the monkey, as we are ruling out overlapping frame s). Clearly P ( E _{n} ) = 27 ^{−} ^{L} > 0. It is a very small number, but still positive. Now, since our monkey never gets disappointed nor tired, the events E _{n} are independent. It follows by the 2nd BorelCantelli lemma tha t E _{n} will occur, and inﬁnitely often. Q.E.D. (NOTE: In practice, if all the at oms in the universe were typist monkeys banging away billions of chara cters a second since the bigbang, the probability of getting Shake speare within the age of the universe would still be vanishingly small.)
ECEN 649 Pattern Recognition – p.10/42
Tail Events and Kolmogorov’s 01 Law

Given a sequence of events E _{1} , E _{2} , 
., a tail event is an event whose 
occurrence depends on the whole sequence, but is probabilistically independent of any ﬁnite subsequence. 

Examples of tail events: lim _{n}_{→}_{∞} E _{n} (if {E _{n} } is monotone), lim sup _{n}_{→}_{∞} E _{n} , lim inf _{n}_{→}_{∞} E _{n} . 

Kolmogorov’s zeroone law: given a sequence of independent events 

E _{1} , E _{2} , ., all its tail events have either probability 0 or probabilit y 1. 

That is, tail events are either almostsurely impossible or occur almost surely. 

In practice, it may be difﬁcult to conclude one way or the othe r. The BorelCantelli lemmas together give a sufﬁcient condition to decide on the 01 probability of the tail event lim sup _{n}_{→}_{∞} E _{n} , with {E _{n} } an independent sequence. 



ECEN 649 Pattern Recognition – p.11/42
Conditional Probability
One of the most important concepts in Pattern Recognition, Engineering, and in Probability Theory in general.
Given that an event F has occurred, for E to occur, E ∩ F has to occur. In addition, the sample space gets restricted to those outcomes in F , so a normalization factor P ( F ) has to be introduced. Therefore,
P ( E  F ) = ^{P} ^{(} P ^{E} ( F ^{∩} ) ^{F} ^{)}
By the same token, one can condition on any number of events:
P ( E  F _{1} , F _{2} ,
, F _{n} )
= ^{P} ^{(} ^{E} ^{∩} ^{F} ^{1} ^{∩} ^{F} ^{2} ^{∩} P ( F _{1} ∩ F _{2} ∩
^{∩} ^{F} ^{n} ^{)} ∩ F _{n} )
Note: For simplicity, it is usual to write P ( E ∩ F ) = P ( E, F ) to indicate the joint probability of E and F .
ECEN 649 Pattern Recognition – p.12/42
Very Useful Formulas
( E, F ) = P ( E  F ) P ( F )
P
( E _{1} , E _{2} ,
P
P ( E _{n}  E _{1} ,
, E _{n} ) =
, E _{n}_{−} _{1} ) P ( E _{n}_{−} _{1}  E _{1} ,
, E _{n}_{−} _{2} ) · · · P ( E _{2}  E _{1} ) P ( E _{1} )
( E ) = P ( E, F ) + P ( E, F ^{c} ) = P ( E  F ) P ( F ) + P ( E  F ^{c} )(1 − P ( F ))
P
( E ) = ^{} ^{n} _{=}_{1} P ( E, F _{i} ) = ^{} ^{n} _{=}_{1} P ( E  F _{i} ) P ( F _{i} ) , whenever ^{} F _{i} ⊇ E
P
i
i
( Bayes Theorem):
_{P} _{(} _{E} _{} _{F} _{)} _{=} P ( F  E ) P ( E ) _{=} P ( F )
P ( F  E ) P ( E ) P ( F  E ) P ( E ) + P ( F  E ^{c} )(1 − P ( E )))
ECEN 649 Pattern Recognition – p.13/42
Independent Events
Events E and F are independent if the occurrence of one does not carry information as to the occurrence of the other. That is: 


P ( E  F ) = 
P ( E ) and P ( F  E ) = P ( F ) . 

It is easy to see that this is equivalent to the condition 

P ( E, F ) = P ( E ) P ( F ) 

If E and F are independent, so are the pairs (E , F ^{c} ), (E ^{c} , F ), and (E ^{c} , F ^{c} ). 

Caution 1 : E being independent of F and G does not imply that E is independent of F ∩ G. 

Caution 2 : Three events E , F , G are independent if P ( E, F, G) = P ( E ) P ( F ) P ( G) and each pair of events is independent. 
ECEN 649 Pattern Recognition – p.14/42
Random Variables
A random variable X is a (measurable) function X : S → R, that is, it assigns to each outcome of the experiment a real number.
Given a set of real numbers A, we deﬁne an event
{X ∈ A} = X ^{−} ^{1} ( A) ⊆ S.
It can be shown that all probability questions about a r.v. X can be phrased in terms of the probabilities of a simple set of event s:
{X ∈ ( −∞, a ] },
a ∈ R.
These events can be written more simply as {X ≤ a }, for a ∈ R.
ECEN 649 Pattern Recognition – p.15/42
Probability Distribution Functions

The probability of the special events {X ≤ a } gets a special name. 


The probability distribution function (PDF) of a r.v. X is the function F _{X} : R → [0, 1] given by 

F _{X} ( a ) = P ( {X ≤ a }) , 
a ∈ R. 

Properties of a PDF: 

1. F _{X} is nondecreasing: a ≤ b ⇒ F _{X} ( a ) ≤ F _{X} ( b ) . 

2. lim _{a} _{→}_{−}_{∞} F _{X} ( a ) = 0 and lim _{a} _{→}_{+} _{∞} F _{X} ( a ) = 1 

3. F _{X} is rightcontinuous: lim _{b} _{→}_{a} _{+} F _{X} ( b ) = F _{X} ( a ) . 

Everything we know about events applies here, since a PDF is nothing more than the probability of certain events. 



ECEN 649 Pattern Recognition – p.16/42
Joint and Conditional PDFs

These are crucial elements in PR and Engineering. Once again , these concepts involve only the probabilities of certain sp ecial events. 

The joint PDF of two r.v.’s X and Y is the joint probability of the events {X ≤ a } and {Y ≤ b }, for a, b ∈ R. Formally, we deﬁne a function F _{X}_{Y} : R × R → [0, 1] given by 

F _{X}_{Y} ( a, b ) = P ( {X ≤ a }, {Y ≤ b }) = P ( {X ≤ a, Y ≤ B }) , 
a, b ∈ R 

This is the probability of the "lowerleft quadrant" with co rner at ( a, b ) . 

Note that F _{X}_{Y} ( a, ∞) = F _{X} ( a ) and F _{X}_{Y} ( ∞, b ) = F _{Y} ( b ) . These are called the marginal PDFs. 



ECEN 649 Pattern Recognition – p.17/42
Joint and Conditional PDFs  II
The conditional PDF of X given Y is just the conditional probability of the corresponding special events. Formally, it is a functio n F _{X} _{} _{Y} : R × R → [0, 1] given by
F _{X} _{} _{Y} ( a, b ) = P ( {X ≤ a }{Y ≤ b })
_{=}
P ( {X ≤ a, Y ≤ B }) _{=} F _{X}_{Y} ( a, b )
P ( {Y ≤ b })
F _{Y} ( b )
, a, b ∈ R
The r.v.’s X and Y are indepedent r.v.s if F _{X} _{} _{Y} ( a, b ) = F _{X} ( a ) and F _{Y} _{} _{X} ( b, a ) = F _{Y} ( b ) , that is, if F _{X}_{Y} ( a, b ) = F _{X} ( a ) F _{Y} ( b ) , for a, b ∈ R.
Bayes’ Theorem for r.v.’s:
_{F} X  Y _{(} _{a}_{,} _{b} _{)} _{=} F _{Y} _{} _{X} ( b, a ) F _{X} ( a ) F _{Y} ( b )
, a, b ∈ R.
ECEN 649 Pattern Recognition – p.18/42
Probability Density Functions
The notion of a probability density function (pdf) is fundamental in probability theory. However, it is a secondary notion to tha t of a PDF. In fact, all r.v.’s must have a PDF, but not all r.v.’s have a pd f.
If F _{X} is everywhere continuous and differentiable, then X is said to be a continuous r.v. and the pdf f _{X} of X is given by:
f _{X} ( a ) = ^{d}^{F} ^{X} ( a ) ,
dx
a ∈ R.
Probability statements about X can then be made in terms of
integration
of f _{X} . For
example,
F _{X} ( a ) =
a
−∞
f _{X} ( x) dx,
b
P ( {a ≤ X ≤ b }) = f _{X} ( x) dx,
a
a ∈ R
a, b ∈ R
ECEN 649 Pattern Recognition – p.19/42
Useful Continuous R.V.’s
Uniform( a , b )
1
f _{X} ( x) = _{b} _{−} _{a} ,
a < x < b.
The univariate Gaussian( µ , σ > 0):
f _{X} ( x) =
Exponential( λ > 0):
_{√} 2πσ 2 _{e}_{x}_{p} ^{} _{−} ( x − µ ) ^{2}
1
2σ ^{2}
,
f _{X} ( x) = λe ^{−} ^{λ}^{x} ,
x ≥ 0.
x ∈ R.
Gamma( λ > 0, t > 0):
_{f} X _{(} _{x}_{)} _{=} λe ^{−} ^{λ}^{x} ( λx) ^{t} ^{−} ^{1} Γ( t)
, x ≥ 0.
Beta( a , b ):
1
f _{X} ( x) = _{B} _{(} _{a}_{,} _{b} _{)} x ^{a} ^{−} ^{1} (1 − x) ^{b} ^{−} ^{1} ,
0 < x < 1.
ECEN 649 Pattern Recognition – p.20/42
Probability Mass Function

If F _{X} is not everywhere differentiable, then X does not have a pdf. 


A useful particular case is that where X assumes countably many values and F _{X} is a staircase function . In this case, X is said to be a discrete r.v. 

One deﬁnes the probability mass function (PMF) of a discrete r.v. X to be: 

p _{X} ( a ) = P ( {X = a }) = F _{X} ( a ) − F _{X} ( a ^{−} ) , 
a ∈ R. 

(Note that for any continuous r.v. X , P ( {X = a }) = 0 so there is no PMF to speak of.) 



ECEN 649 Pattern Recognition – p.21/42
Useful Discrete R.V.’s
Bernoulli( p ):
p _{X} (0) = P ( {X = 0}) = 1 − p
p _{X} (1) = P ( {X = 0}) = p
Binomial( n, p ):
p _{X} ( i) = P ( {X = i}) = ^{n}
i
p ^{i} (1 − p ) ^{n}^{−} ^{i} ,
i = 0, 1,
Poisson ( λ):
p _{X} ( i) = P ( {X = i}) = e ^{−} ^{λ} ^{λ} _{i}_{!} ^{i} ,
i = 0, 1,
, n
ECEN 649 Pattern Recognition – p.22/42
Expectation
Expectation is a fundamental concept in probability theory, which has to do with the intuitive concept of "averages."
The mean value of a r.v. X is an average of its values weighted by their probabilities. If X is a continuous r.v., this corresponds to:
E [ X ] =
∞
−∞
xf _{X} ( x) dx
If X is discrete, then this can be written as:
E [ X ] =
^{}
i : p _{X} ( x _{i} ) > 0
x _{i} p _{X} ( x _{i} )
ECEN 649 Pattern Recognition – p.23/42
Expectation of Functions of R.V.’s
Given a r.v. X , and a (measurable) function g : R → R, then g ( X ) is also a r.v. If there is a pdf, it can be shown that
E [ g ( X )] =
∞
−∞
g ( x) f _{X} ( x) dx
If X is discrete, then this becomes:
E [ g ( X )] =
^{}
i : p _{X} ( x _{i} ) > 0
g ( x _{i} ) p _{X} ( x _{i} )
Immediate corollary: E [ aX + c ] = aE [ X ] + c .
(Joint r.v.’s) If g : R × R → R is measurable, then (continuous case):
E [ g ( X, Y )] =
−∞ −∞
∞
∞
g ( x, y ) f _{X}_{Y} ( x, y ) dxdy
ECEN 649 Pattern Recognition – p.24/42
Other Properties of Expectation
Linearity: E [ a _{1} X _{1} +
a _{n} X _{n} ] = a _{1} E [ X _{1} ] +
a _{n} E [ X _{n} ] .
if X and Y are uncorrelated , then E [ XY ] = E [ X ] E [ Y ] (independence always implies uncorrelatedness, but the converse is true o nly in special cases; e.g. jointly Gaussian or multinomial random variables).
If X ≥ Y then E [ X ] > E [ Y ] .
If X is nonnegative,
E [ X ] = ∞ P ( {X > x}) dx
0
Markov’s Inequality: If X is nonnegative, then for a > 0,
P ( {X ≥ a }) ≤ ^{E} ^{[} ^{X} ^{]}
a
CauchySchwarz Inequality: E [ XY ] ≤ ^{} E [ X ^{2} ] E [ Y ^{2} ]
ECEN 649 Pattern Recognition – p.25/42
Conditional Expectation
If X and Y are jointly continuous r.v.’s and f _{Y} ( y ) > 0, we deﬁne:
E [ X  Y = y ] = −∞ xf _{X} _{} _{Y} ( x, y ) dx =
∞
∞
−∞
_{x} f _{X}_{Y} ( x, y ) f _{Y} ( y )
dx
If X, Y are jointly discrete and p _{Y} ( y _{j} ) > 0, this can be written as:
E [ X  Y = y _{j} ] =
^{}
i : p _{X} ( x _{i} ) > 0
x _{i} p _{X} _{} _{Y} ( x _{i} , y _{j} ) =
^{}
i : p _{X} ( x _{i} ) > 0
_{x} i p _{X}_{Y} ( x _{i} , y _{j} ) p _{Y} ( y _{j} )
Conditional expectations have all the properties of usual expectations. For example,
E
n
i
=1
X _{i}  Y = y =
n
i
=1
E [ X _{i}  Y = y ]
ECEN 649 Pattern Recognition – p.26/42
E[Y  X] is a Random Variable
Given a r.v. X , the mean E [ X ] is a deterministic parameter.
Now, E [ X  Y ] is not random w.r.t. to X , but it is a function of the r.v. Y , so it is also a r.v. One can show that its mean is precisely E [ X ] :
E [ E [ X  Y ]] = E [ X ]
What this says in the continuous case is:
E [ X ] =
∞
−∞
and, in the discrete case,
E [ X ] =
^{}
E [ X  Y = y ] f _{Y} ( y ) dy
E [ X  Y = y _{i} ] P ( {Y = y _{i} })
i : p _{Y} ( y _{i} ) > 0
Computing E [ X  Y = y ] ﬁrst is often easier than ﬁnding E [ X ] directly.
ECEN 649 Pattern Recognition – p.27/42
Conditional Expectation and Prediction
One is interested in predicting the value of an unknown r.v. Y . We
ˆ
also want the predictor Y to be optimal according to some criterion.
The criterion most widely used is the Mean Square Error:
MSE ( Y ) = E [( Y − Y ) ^{2} ]
ˆ
ˆ
It can be shown that the MMSE (minimum MSE) constant (noinformation) estimator of Y is just the mean E [ Y ] .
Given now some partial information through an observed r.v. X = x, it can be shown that the overall MMSE estimator is the conditional mean E [ Y  X = x] . The function η ( x) = E [ Y  X = x] is called the regression of Y on X .
Is this Pattern Recognition yet? Why?
ECEN 649 Pattern Recognition – p.28/42
Variance
The mean E [ X ] is a good “guess” of the value of a r.v. X , but by itself it can be misleading. The variance Var(X ) says how the values of X are spread around the mean E [ X ] :
Var( X ) = E [( X − E [ X ]) ^{2} ] =
E [ X ^{2} ] − ( E [ X ]) ^{2}
ˆ
The MSE of the best constant estimator Y = E [ Y ] of Y is just the
variance of Y !
Property: Var( aX + c ) = a ^{2} Var( X )
Chebyshev’s Inequality: For any ǫ > 0,
P ( { X − E [ X ]  ≥ ǫ }) ≤ ^{V}^{a}^{r}^{(} ^{X} ^{)}
ǫ
^{2}
ECEN 649 Pattern Recognition – p.29/42
Conditional Variance
If X and Y are jointlydistributed r.v.’s , we deﬁne:
Var( X  Y ) = E [( X − E [ X  Y ]) ^{2}  Y ] = E [ X ^{2}  Y ] − ( E [ X  Y ]) ^{2}
Conditional Variance Formula:
Var( X ) = E [ Var( X  Y )] + Var( E [ X  Y ])
This breaks down the total variance in a "withinrows" component and a "acrossrows" component.
ECEN 649 Pattern Recognition – p.30/42
Covariance
Is Var( X _{1} + X _{2} ) = Var( X _{1} ) + Var( X _{2} ) ?
The covariance of two r.v.’s X and Y is given by:
Cov( X, Y ) = E [( X − E [ X ])( Y − E [ Y ])] = E [ XY ] − E [ X ] E [ Y ]
Two variables X and Y are uncorrelated if and only if Cov( X, Y ) = 0. Jointly Gaussian X and Y are independent if and only if they are uncorrelated (in general, independence implies uncorrela tedness but not viceversa).
It can be shown that
Var( X _{1} + X _{2} ) = Var( X _{1} ) + Var( X _{2} ) + 2Cov( X _{1} , X _{2} )
So the variance is distributive over sums if all variables are pairwise uncorrelated .
ECEN 649 Pattern Recognition – p.31/42
Correlation Coefﬁcient
The correlation coefﬁcient ρ between two r.v.’s X and Y is given by:
Properties:
_{ρ}_{(} _{X}_{,} _{Y} _{)} _{=}
1. − 1 ≤ ρ( X, Y ) ≤ 1
Cov( X, Y )
^{} Var( X ) Var( Y )
2. X and Y are uncorrelated iff ρ( X, Y ) = 0.
3. (Perfect linear correlation):
ρ( X, Y ) = ± 1 ⇔ Y = a ± bX, where b = ^{σ} ^{y}
σ x
ECEN 649 Pattern Recognition – p.32/42
Vector Random Variables

A vector r.v. or random vector X = ( X _{1} , 
, X _{d} ) takes values in R ^{d} . 

Its distribution is the joint distribution of the component r.v.’s. 

The mean of X is the vector µ = ( µ _{1} , 
, µ _{d} ) , where µ _{i} = E [ X _{i} ] . 

The covariance matrix Σ is a d × d matrix given by: 

Σ = E [( X − µ )( X − µ ) ^{T} ] 

where Σ _{i}_{i} = Var( X _{i} ) and Σ _{i}_{j} = Cov( X _{i} , X _{j} ) . 



ECEN 649 Pattern Recognition – p.33/42
Properties of the Covariance Matrix
Matrix Σ is real symmetric and thus diagonalizable: 


Σ 
= UDU ^{T} 

where U is the matrix of eigenvectors and D is the diagonal matrix of eigenvalues. 

All eigenvalues are nonnegative (Σ is positive semideﬁnite ). In fact, except for “degenerate” cases, all eigenvalues are positive, and so Σ is invertible (Σ is said to be positive deﬁnite in this case). 

(Whitening or Mahalanobis transformation ) It is easy to check that the random vector 
Y = Σ ^{−} 2 ( X − µ ) = D ^{−} ^{1}
1
2 U ^{T} ( X − µ )
has zero mean and covariance matrix I _{d} (so that all components of Y are zeromean, unitvariance, and uncorrelated).
ECEN 649 Pattern Recognition – p.34/42
Sample Estimates
Given n independent and identicallydistributed (i.i.d.) sample
observations X _{1} ,
mean estimator is given by:
, X _{n} of the random vector X , then the sample
µˆ = ^{1}
n
n
X
i
i =1
It can be shown that this estimator is unbiased (that is, E [ µˆ ] = µ ) and consistent (that is, µˆ converges in probability to µ as n → ∞).
The sample covariance estimator is given by:
S =
^{1}
n − 1
n
i =1
( X _{i} − µˆ )( X _{i} − µˆ ) ^{T}
This is an unbiased and consistent estimator of Σ.
ECEN 649 Pattern Recognition – p.35/42
The Multivariate Gaussian
The multivariate Gaussian distribution with mean µ and covariance matrix Σ (assuming Σ invertible, so that also det(Σ) > 0) corresponds to the multivariate pdf
f _{X} ( x ) =
^{} (2π ) ^{p} det(Σ) ^{e}^{x}^{p} ^{−}
^{1}
We write X ∼ N _{p} ( µ , Σ) .
1
_{2} ( x − µ ) ^{T} Σ ^{−} ^{1} ( x − µ )
The multivariate gaussian has elliptical contours of the fo rm
( x − µ ) ^{T} Σ ^{−} ^{1} ( x − µ ) = c ^{2} ,
c > 0
The axes of the ellipsoids are given by the eigenvectors of Σ and the length of the axes are proportional to its eigenvalues.
ECEN 649 Pattern Recognition – p.36/42
Properties of The Multivariate Gaussian

The density of each component X _{i} is univariate gaussian N ( µ _{i} , Σ _{i}_{i} ) . 

The components of X ∼ N ( µ, Σ) are independent if and only if they are uncorrelated, i.e., Σ is a diagonal matrix. 

The whitening transformation Y = Σ ^{−} ^{1} 2 ( X − µ ) produces another multivariate gaussian Y ∼ N (0, I _{p} ) . 

In general, if A is a nonsingular p × p matrix and c is a p vector, then Y = AX + c ∼ N _{p} ( Aµ + c , A ΣA ^{T} ) . 

The r.v.’s AX and BX are independent if and only if A ΣB ^{T} = 0. 

If Y and X are jointly Gaussian, then the distribution of Y given X is again Gaussian. 

The regression E [ Y  X ] is a linear function of X . 



ECEN 649 Pattern Recognition – p.37/42
Convergence of Random Sequences
A random sequence {X [1] , X [2] ,
.} is a countable set of r.v.s.
“Sure” convergence: We say that X [ n] → X surely if for all outcomes ξ ∈ S of the experiment we have lim _{n}_{→}_{∞} X [ n, ξ ] = X ( ξ ) .
Almostsure convergence or convergence with probability one : The sequence fails to converge only for an event of probability zero, i.e.:
P
{ξ ∈ S  lim
_{n}_{→}_{∞} X [ n, ξ ] = X ( ξ ) } = 1
Meansquare (m.s.) convergence: This uses convergence of an energy norm to zero.
_{n}_{→}_{∞} E [  X [ n] − X  ^{2} ] = 0
lim
ECEN 649 Pattern Recognition – p.38/42
Convergence of Random Sequences  II
Convergence in Probability: This uses convergence of the "probability of error" to zero.
_{n}_{→}_{∞} P [  X [ n] − X  > ǫ ] = 0,
lim
∀ ǫ > 0 .
Convergence in Distribution : This is just converge of the corresponding PDFs.
_{n}_{→}_{∞} F _{X} _{n} ( a ) = F _{X} ( a )
lim
at all points a ∈ R where F _{X} is continuous.
Relationship between modes of convergence:
Sure ⇒ Almostsure
Meansquare
⇒ Probability ⇒ Distribution
ECEN 649 Pattern Recognition – p.39/42
UniformlyBounded Random Sequences
A random sequence {X [1] , X [2] ,
.} is said to be uniformly bounded
if there exists a ﬁnite K > 0, which does not depend on n, such that
 X [ n]  < K with prob. 1 ,
for all n = 1, 2,
Theorem. If a random sequence {X [1] , X [2] , bounded, then
.} is uniformly
X [ n] → X in m.s. ⇔ X [ n] → X in probability .
That is, convergence in m.s. and in probability are the same.
The relationship between modes of convergence becomes:
Sure ⇒ Almostsure ⇒
Meansquare
Probability
⇒ Distribution
ECEN 649 Pattern Recognition – p.40/42
Example
Let X(n) consist of 01 binary random variables.
1. Set X(0) = 1
2. From the next 2 points, pick one randomly and set to 1, the ot her to zero.
3. From the next 3 points, pick one randomly and set to 1, the re st to zero.
4. From the next 4 points, pick one randomly and set to 1, the re st to zero.
5.
.
.
.
Then X ( n) is clearly converging slowly in some sense to zero, but not with probability one! In fact, P [ {X _{n} = 1} i.o. ] = 1. However, one can show that X ( n) → 0 in meansquare and also in probability. (Exercise.)
ECEN 649 Pattern Recognition – p.41/42
Limiting Theorems
(Weak Law of Large Numbers). Given an i.i.d. random sequence
{X _{1} , X _{2} ,
.} with common ﬁnite mean µ . Then
1
n
n
i =1
X _{i} → µ, in probability
(Strong Law of Large Numbers) The same convergence holds wit h probability 1.
(Central Limit Theorem) Given an i.i.d. random sequence
{X _{1} , X _{2} ,
.} with common mean µ and variance σ ^{2} . Then
σ ^{√} n
1
n
i =1
X _{i} − nµ → N (0, 1) , in distribution
ECEN 649 Pattern Recognition – p.42/42
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.