Sie sind auf Seite 1von 42
ECEN 649 Pattern Recognition Probability Review Ulisses Braga-Neto ECE Department Texas A&M University ECEN 649

ECEN 649 Pattern Recognition

Probability Review

Ulisses Braga-Neto

ECE Department Texas A&M University

Probability Review Ulisses Braga-Neto ECE Department Texas A&M University ECEN 649 Pattern Recognition – p.1/42

ECEN 649 Pattern Recognition – p.1/42

Sample Spaces and Events

A sample space S is the collection of all outcomes of an experiment. sample space S is the collection of all outcomes of an experiment.

A sample space S is the collection of all outcomes of an experiment.
A sample space S is the collection of all outcomes of an experiment.
An event E is a subset E ⊆ S .

An event E is a subset E S .

Event E is said to occur if it contains the outcome of the experiment.  

Event E is said to occur if it contains the outcome of the experiment.

 
Example 1: if the experiment consists of flipping two coins, t hen

Example 1: if the experiment consists of flipping two coins, t hen

S = {( H, H ) ,

( H, T ) , ( T, H ) , ( T, T ) }

The event E that the first coin lands tails is: E = {( T, H ) , ( T, T ) }.

 
Example 2: If the experiment consists in measuring the lifet ime of a lightbulb, then

Example 2: If the experiment consists in measuring the lifet ime of a lightbulb, then

S = {t R | t 0}

The event that the lightbulb will fail at or earlier than 2 time units is the real interval E = [0, 2] . The event that the lightbulb will fail at or earlier than 2 time units is the

The event that the lightbulb will fail at or earlier than 2 time units is the

ECEN 649 Pattern Recognition – p.2/42

Special Events Inclusion: E ⊆ F iff the occurrence of E implies the occurrence of

Special Events

Special Events Inclusion: E ⊆ F iff the occurrence of E implies the occurrence of F

Inclusion: E ⊆ F iff the occurrence of E implies the occurrence of F . E F iff the occurrence of E implies the occurrence of F .

Union: E ∪ F occurs iff E , F , or both E and F occur. E F occurs iff E , F , or both E and F occur.

Intersection: E ∩ F occurs iff both E and F occur. If E ∩ F = E F occurs iff both E and F occur. If E F = , then E and F are mutually-exclusive.

Complement: E c occurs iff E does not occur. E c occurs iff E does not occur.

If E 1 , E 2 , E 1 ⊆ E 2 ⊆ E 1 , E 2 , E 1 E 2

is an increasing sequence of events, that is, then

lim

n→∞ E n =

n=1

E n

If E 1 , E 2 , E 1 ⊇ E 2 ⊇ E 1 , E 2 , E 1 E 2

is a decreasing sequence of events, that is, then

lim

n→∞ E n =

n=1

E n

sequence of events, that is, then lim n →∞ E n = ∞ n =1 E

ECEN 649 Pattern Recognition – p.3/42

Special Events - II

Special Events - II If E 1 , E 2 , is any sequence of events,

If E 1 , E 2 , E 1 , E 2 ,

is any sequence of events, then

[ E n i.o. ] = lim sup E n =

n→∞

E i

n=1 i = n

] = lim sup E n = n →∞ ∞ ∞ E i n =1 i

We can see that lim sup n E n occurs iff E n occurs for an infinite number of n, that is, E n occurs infinitely often .

If E 1 , E 2 , E 1 , E 2 ,

is any sequence of events, then

lim inf E n =

n→∞

E i

n=1 i = n

We can see that lim inf n E n occurs iff E n occurs for all but a finite number of n, that is, E n eventually occurs for all n.

all but a finite number of n , that is, E n eventually occurs for all

ECEN 649 Pattern Recognition – p.4/42

Modern Definition of Probability

Probability assigns to each event E ⊆ S a number P ( E ) such that E S a number P ( E ) such that

1. 0 P ( E ) 1

2. P ( S ) = 1

3. For a sequence of mutually-exclusive events E 1 , E 2 ,

P

i =1 E i = i =1

P ( E i )

Technically speaking, for a non-countably infinite sample space S , probability cannot be assigned to all possible subsets of S in a manner consistent with these axioms; one has to restrict the definit ion to a σ -algebra of events in S .

one has to restrict the definit ion to a σ -algebra of events in S .
one has to restrict the definit ion to a σ -algebra of events in S .

ECEN 649 Pattern Recognition – p.5/42

Direct Consequences

Direct Consequences P ( E c ) = 1 − P ( E ) If E

P ( E c ) = 1 P ( E )

If E F then P ( E ) P ( F )

P ( E F ) = P ( E ) + P ( F ) P ( E F )

(Boole’s Inequality) For any sequence of events E 1 , E 2 ,

P

i =1 E i i =1

P ( E i )

(Continuity of Probability Measure) If E 1 , E 2 , decreasing sequence of events, then

is an increasing or

P n E n = lim

lim

n P ( E n )

or P n → ∞ E n = lim lim n → ∞ P ( E

ECEN 649 Pattern Recognition – p.6/42

1st Borel-Cantelli Lemma

1st Borel-Cantelli Lemma For any sequence of events E 1 , E 2 , Proof. ∞

For any sequence of events E 1 , E 2 ,

Proof.

n=1

P ( E n ) < ∞ ⇒ P ([ E n i.o. ]) = 0

P

n E i = P lim

n=1 i =

n→∞

n→∞ P

= lim

i = n E i

lim

n→∞

i = n

P ( E i )

= 0

i = n E i

= n E i ∞ ≤ lim n →∞ ∞ i = n P ( E

ECEN 649 Pattern Recognition – p.7/42

Example

Consider a sequence of binary random variables X 1 , X 2 , values in { 0 , 1 } , such X 1 , X 2 , values in {0, 1}, such that

Since

1

P ( {X n = 0}) = 2 n ,

n = 1, 2,

n=1

P ( {X n = 0}) = 2 <

that take on

The 1st Borel-Cantelli lemma implies that P ([ {X n = 0} i.o. ]) = 0, that is, for n sufficiently large, X n = 1 with probability one. In other words,

large, X n = 1 with probability one. In other words, n → ∞ X n

n X n = 1 with probability 1 .

lim

one. In other words, n → ∞ X n = 1 with probability 1 . lim

ECEN 649 Pattern Recognition – p.8/42

2nd Borel-Cantelli Lemma

2nd Borel-Cantelli Lemma For an independent sequence of events E 1 , E 2 , .

For an independent sequence of events E 1 , E 2 ,

.,

n=1

P ( E n ) = ∞ ⇒ P ([ E n i.o. ]) = 1

Proof. Exercise.

This is the converse to the 1st lemma; here the stronger assumption of independence is needed.) = ∞ ⇒ P ([ E n i.o. ]) = 1 Proof. Exercise. As immediate

As immediate consequence of the 2nd Borel-Cantelli lemma is that any event with positive probability, no matter how small, will o ccur infinitely often in any event with positive probability, no matter how small, will o ccur infinitely often in independent repeated trials.

no matter how small, will o ccur infinitely often in independent repeated trials. ECEN 649 Pattern

ECEN 649 Pattern Recognition – p.9/42

Example: “The infinite typist monkey"

Example: “The infinite typist monkey" Consider a monkey that sits at a typewriter banging away rand
Example: “The infinite typist monkey" Consider a monkey that sits at a typewriter banging away rand

Consider a monkey that sits at a typewriter banging away rand omly for an infinite amount of time. It will produce Shakespeare’s complete works, and in fact, the entire Library of Congress, not just once, but an infinite number of times. Proof. Let L be the length in characters of the desired work. Let E n be the event that the n-th sequence of characters produced by the monkey matches, character by character, the desired work (we are ma king it even harder for the monkey, as we are ruling out overlapping frame s). Clearly P ( E n ) = 27 L > 0. It is a very small number, but still positive. Now, since our monkey never gets disappointed nor tired, the events E n are independent. It follows by the 2nd Borel-Cantelli lemma tha t E n will occur, and infinitely often. Q.E.D. (NOTE: In practice, if all the at oms in the universe were typist monkeys banging away billions of chara cters a second since the big-bang, the probability of getting Shake speare within the age of the universe would still be vanishingly small.)

Shake speare within the age of the universe would still be vanishingly small.) ECEN 649 Pattern
Shake speare within the age of the universe would still be vanishingly small.) ECEN 649 Pattern

ECEN 649 Pattern Recognition – p.10/42

Tail Events and Kolmogorov’s 0-1 Law

Given a sequence of events E 1 , E 2 , . , a tail

Given a sequence of events E 1 , E 2 , E 1 , E 2 ,

. , a tail event is an event whose

., a tail event is an event whose

occurrence depends on the whole sequence, but is probabilistically independent of any finite subsequence.

Examples of tail events: lim n → ∞ E n (if { E n } is monotone), lim sup lim n E n (if {E n } is monotone), lim sup n E n , lim inf n E n .

Kolmogorov’s zero-one law: given a sequence of independent events independent events

E 1 , E 2 ,

., all its tail events have either probability 0 or probabilit y 1.

That is, tail events are either almost-surely impossible or occur almost surely.

In practice, it may be difficult to conclude one way or the othe r. The Borel-Cantelli lemmas together give a sufficient condition to decide on the 0-1 probability of the tail event lim sup n → ∞ E n , with { E n } an independent lim sup n E n , with {E n } an independent sequence.

on the 0-1 probability of the tail event lim sup n → ∞ E n ,
on the 0-1 probability of the tail event lim sup n → ∞ E n ,

ECEN 649 Pattern Recognition – p.11/42

Conditional Probability

One of the most important concepts in Pattern Recognition, Engineering, and in Probability Theory in general.Conditional Probability Given that an event F has occurred, for E to occur, E ∩ F

Engineering, and in Probability Theory in general. Given that an event F has occurred, for E
Engineering, and in Probability Theory in general. Given that an event F has occurred, for E

Given that an event F has occurred, for E to occur, E F has to occur. In addition, the sample space gets restricted to those outcomes in F , so a normalization factor P ( F ) has to be introduced. Therefore,

P ( E | F ) = P ( P E ( F ) F )

By the same token, one can condition on any number of events:

P ( E | F 1 , F 2 ,

, F n )

= P ( E F 1 F 2 P ( F 1 F 2

F n ) F n )

Note: For simplicity, it is usual to write P ( E F ) = P ( E, F ) to indicate the joint probability of E and F . Note: For simplicity, it is usual to write P ( E ∩ F ) = P

∩ F ) = P ( E, F ) to indicate the joint probability of E

ECEN 649 Pattern Recognition – p.12/42

Very Useful Formulas

Very Useful Formulas ( E, F ) = P ( E | F ) P (

( E, F ) = P ( E | F ) P ( F ) E, F ) = P ( E | F ) P ( F )

P

( E 1 , E 2 , E 1 , E 2 ,

P

P ( E n | E 1 ,

, E n ) =

, E n 1 ) P ( E n 1 | E 1 ,

, E n 2 ) · · · P ( E 2 | E 1 ) P ( E 1 )

( E ) = P ( E, F ) + P ( E, F c ) E ) = P ( E, F ) + P ( E, F c ) = P ( E | F ) P ( F ) + P ( E | F c )(1 P ( F ))

P

( E ) = n = 1 P ( E, F i ) = n = E ) = n =1 P ( E, F i ) = n =1 P ( E | F i ) P ( F i ) , whenever F i E

P

i

i

(Bayes Theorem ): Bayes Theorem):

P ( E | F ) = P ( F | E ) P ( E ) = P ( F )

P ( F | E ) P ( E ) P ( F | E ) P ( E ) + P ( F | E c )(1 P ( E )))

E ) P ( F | E ) P ( E ) + P ( F

ECEN 649 Pattern Recognition – p.13/42

Independent Events

Events E and F are independent if the occurrence of one does not carry information as E and F are independent if the occurrence of one does not carry information as to the occurrence of the other. That is:

E and F are independent if the occurrence of one does not carry information as to
E and F are independent if the occurrence of one does not carry information as to

P ( E | F ) =

P ( E ) and P ( F | E ) = P ( F ) .

It is easy to see that this is equivalent to the condition

 
 

P ( E, F ) = P ( E ) P ( F )

If E and F are independent, so are the pairs ( E , F c

If E and F are independent, so are the pairs (E , F c ), (E c , F ), and (E c , F c ).

 
Caution 1 : E being independent of F and G does not imply that E

Caution 1 : E being independent of F and G does not imply that E is independent of F G.

Caution 2 : Three events E , F , G are independent if P (

Caution 2 : Three events E , F , G are independent if Caution 2 : Three events E , F , G P ( E, F, G ) P ( E, F, G) = P ( E ) P ( F ) P ( G) and each pair of events is independent.

, G are independent if P ( E, F, G ) = P ( E )

ECEN 649 Pattern Recognition – p.14/42

Random Variables

Random Variables A random variable X is a (measurable) function X : S → R ,

A random variable X is a (measurable) function X : S → R , that is, random variable X is a (measurable) function X : S R, that is, it assigns to each outcome of the experiment a real number.

Given a set of real numbers A , we define an event A, we define an event

{X A} = X 1 ( A) S.

It can be shown that all probability questions about a r.v. X can be phrased in terms of the probabilities of a simple set of event X can be phrased in terms of the probabilities of a simple set of event s:

{X ( −∞, a ] },

a R.

These events can be written more simply as {X a }, for a R.

R . These events can be written more simply as { X ≤ a } ,

ECEN 649 Pattern Recognition – p.15/42

Probability Distribution Functions

The probability of the special events { X ≤ a } gets a special name.

The probability of the special events { X ≤ a } gets a special name. {X a } gets a special name.

The probability of the special events { X ≤ a } gets a special name.

The probability distribution function (PDF) of a r.v. X is the function F X : R probability distribution function (PDF) of a r.v. X is the function F X : R [0, 1] given by

F X ( a ) = P ( {X a }) ,

a R.

Properties of a PDF:F X ( a ) = P ( { X ≤ a } ) , a

1. F X is non-decreasing: a b F X ( a ) F X ( b ) .

 

2. lim a F X ( a ) = 0 and lim a + F X ( a ) = 1

3. F X is right-continuous: lim b a + F X ( b ) = F X ( a ) .

Everything we know about events applies here, since a PDF is nothing more than the probability of certain events.F X ( a ) = 1 3. F X is right-continuous: lim b → a

a ) . Everything we know about events applies here, since a PDF is nothing more
a ) . Everything we know about events applies here, since a PDF is nothing more

ECEN 649 Pattern Recognition – p.16/42

Joint and Conditional PDFs

These are crucial elements in PR and Engineering. Once again , these concepts involve only
These are crucial elements in PR and Engineering. Once again , these concepts involve only

These are crucial elements in PR and Engineering. Once again , these concepts involve only the probabilities of certain sp ecial events.The joint PDF of two r.v.’s X and Y is the joint probability of the

The joint PDF of two r.v.’s X and Y is the joint probability of the events joint PDF of two r.v.’s X and Y is the joint probability of the events {X a } and {Y b }, for a, b R. Formally, we define a function F XY : R × R [0, 1] given by

 

F XY ( a, b ) = P ( {X a }, {Y b }) = P ( {X a, Y B }) ,

a, b R

This is the probability of the "lower-left quadrant" with co rner at ( a, b ) .

 

Note that F X Y ( a, ∞ ) = F X ( a ) and F F XY ( a, ) = F X ( a ) and F XY ( , b ) = F Y ( b ) . These are called the marginal PDFs.

( a, ∞ ) = F X ( a ) and F X Y ( ∞
( a, ∞ ) = F X ( a ) and F X Y ( ∞

ECEN 649 Pattern Recognition – p.17/42

Joint and Conditional PDFs - II

Joint and Conditional PDFs - II The conditional PDF of X given Y is just the
Joint and Conditional PDFs - II The conditional PDF of X given Y is just the

The conditional PDF of X given Y is just the conditional probability of the corresponding special conditional PDF of X given Y is just the conditional probability of the corresponding special events. Formally, it is a functio n F X | Y : R × R [0, 1] given by

F X | Y ( a, b ) = P ( {X a }|{Y b })

=

P ( {X a, Y B }) = F XY ( a, b )

P ( {Y b })

F Y ( b )

, a, b R

The r.v.’s X and Y are indepedent r.v.s if F X | Y ( a, b ) X and Y are indepedent r.v.s if F X | Y ( a, b ) = F X ( a ) and F Y | X ( b, a ) = F Y ( b ) , that is, if F XY ( a, b ) = F X ( a ) F Y ( b ) , for a, b R.

Bayes’ Theorem for r.v.’s:) = F X ( a ) F Y ( b ) , for a, b

F X | Y ( a, b ) = F Y | X ( b, a ) F X ( a ) F Y ( b )

, a, b R.

F Y | X ( b, a ) F X ( a ) F Y (

ECEN 649 Pattern Recognition – p.18/42

Probability Density Functions

Probability Density Functions The notion of a probability density function (pdf) is fundamental in probability theory.

The notion of a probability density function (pdf) is fundamental in probability theory. However, it is a secondary notion probability density function (pdf) is fundamental in probability theory. However, it is a secondary notion to tha t of a PDF. In fact, all r.v.’s must have a PDF, but not all r.v.’s have a pd f.

If F X is everywhere continuous and differentiable, then X is said to be a continuous F X is everywhere continuous and differentiable, then X is said to be a continuous r.v. and the pdf f X of X is given by:

f X ( a ) = dF X ( a ) ,

dx

a R.

Probability statements about X can then be made in terms of X can then be made in terms of

integration

of f X . For

example,

F X ( a ) =

a

−∞

f X ( x) dx,

b

P ( {a X b }) = f X ( x) dx,

a

a R

a, b R

P ( { a ≤ X ≤ b } ) = f X ( x )

ECEN 649 Pattern Recognition – p.19/42

Useful Continuous R.V.’s

Useful Continuous R.V.’s Uniform( a , b ) 1 f X ( x ) = b

Uniform(a , b ) a , b )

1

f X ( x) = b a ,

a < x < b.

The univariate Gaussian(µ , σ > 0 ): µ , σ > 0):

f X ( x) =

Exponential(λ > 0 ): λ > 0):

2πσ 2 exp ( x µ ) 2

1

√ 2 πσ 2 e x p − ( x − µ ) 2 1 2

2σ 2

,

f X ( x) = λe λx ,

x 0.

x R.

Gamma(λ > 0 , t > 0 ): λ > 0, t > 0):

f X ( x) = λe λx ( λx) t 1 Γ( t)

, x 0.

Beta(a , b ): a , b ):

1

f X ( x) = B ( a, b ) x a 1 (1 x) b 1 ,

0 < x < 1.

) x a − 1 (1 − x ) b − 1 , 0 < x

ECEN 649 Pattern Recognition – p.20/42

Probability Mass Function

If F X is not everywhere differentiable, then X does not have a pdf.

If F X is not everywhere differentiable, then X does not have a pdf. F X is not everywhere differentiable, then X does not have a pdf.

If F X is not everywhere differentiable, then X does not have a pdf.

A useful particular case is that where X assumes countably many values and F X is a staircase function . In this X assumes countably many values and F X is a staircase function . In this case, X is said to be a discrete r.v.

One defines the probability mass function (PMF) of a discrete r.v. X to be: probability mass function (PMF) of a discrete r.v. X to be:

p X ( a ) = P ( {X = a }) = F X ( a ) F X ( a ) ,

a R.

(Note that for any continuous r.v. X , P ( {X = a }) = 0 so there is no PMF to speak of.)

 
a ∈ R . (Note that for any continuous r.v. X , P ( { X
a ∈ R . (Note that for any continuous r.v. X , P ( { X

ECEN 649 Pattern Recognition – p.21/42

Useful Discrete R.V.’s

Useful Discrete R.V.’s Bernoulli( p ): p X (0) = P ( { X = 0

Bernoulli(p ): p ):

p X (0) = P ( {X = 0}) = 1 p

p X (1) = P ( {X = 0}) = p

Binomial(n , p ): n, p ):

p X ( i) = P ( {X = i}) =   n

i

  p i (1 p ) n i ,

i = 0, 1,

Poisson (λ ): λ):

p X ( i) = P ( {X = i}) = e λ λ i! i ,

i = 0, 1,

, n

= P ( { X = i } ) = e − λ λ i !

ECEN 649 Pattern Recognition – p.22/42

Expectation Expectation is a fundamental concept in probability theory, which has to do with the

Expectation

Expectation Expectation is a fundamental concept in probability theory, which has to do with the intuitive

Expectation is a fundamental concept in probability theory, which has to do with the intuitive concept of "averages."Expectation The mean value of a r.v. X is an average of its values weighted by

The mean value of a r.v. X is an average of its values weighted by their probabilities. If X is a X is an average of its values weighted by their probabilities. If X is a continuous r.v., this corresponds to:

E [ X ] =

−∞

xf X ( x) dx

If X is discrete, then this can be written as:

E [ X ] =

i : p X ( x i ) > 0

x i p X ( x i )

E [ X ] = i : p X ( x i ) > 0 x

ECEN 649 Pattern Recognition – p.23/42

Expectation of Functions of R.V.’s

Expectation of Functions of R.V.’s Given a r.v. X , and a (measurable) function g :

Given a r.v. X , and a (measurable) function g : R → R , then g ( X , and a (measurable) function g : R R, then g ( X ) is also a r.v. If there is a pdf, it can be shown that

E [ g ( X )] =

−∞

g ( x) f X ( x) dx

If X is discrete, then this becomes:

E [ g ( X )] =

i : p X ( x i ) > 0

g ( x i ) p X ( x i )

Immediate corollary: E [ aX + c ] = aE [ X ] + c . E [ aX + c ] = aE [ X ] + c .

(Joint r.v.’s) If g : R × R → R is measurable, then (continuous case): g : R × R R is measurable, then (continuous case):

E [ g ( X, Y )] =

−∞ −∞

g ( x, y ) f XY ( x, y ) dxdy

( X, Y )] = −∞ −∞ ∞ ∞ g ( x, y ) f X

ECEN 649 Pattern Recognition – p.24/42

Other Properties of Expectation

Linearity: E [ a 1 X 1 + E [ a 1 X 1 +

of Expectation Linearity: E [ a 1 X 1 + a n X n ] =
of Expectation Linearity: E [ a 1 X 1 + a n X n ] =

a n X n ] = a 1 E [ X 1 ] +

a n E [ X n ] .

if X and Y are uncorrelated , then E [ XY ] = E [ X ] E [ Y ] (independence always implies uncorrelatedness, but the converse is true o nly in special cases; e.g. jointly Gaussian or multinomial random variables).

If X Y then E [ X ] > E [ Y ] .

If X is non-negative,

E [ X ] = P ( {X > x}) dx

0

Markov’s Inequality: If X is non-negative, then for a > 0,

P ( {X a }) E [ X ]

a

Cauchy-Schwarz Inequality: E [ XY ] ≤ E [ X 2 ] E [ Y 2 ] E [ XY ] E [ X 2 ] E [ Y 2 ]

Cauchy-Schwarz Inequality: E [ XY ] ≤ E [ X 2 ] E [ Y 2

ECEN 649 Pattern Recognition – p.25/42

Conditional Expectation

Conditional Expectation If X and Y are jointly continuous r.v.’s and f Y ( y )

If X and Y are jointly continuous r.v.’s and f Y ( y ) > 0 X and Y are jointly continuous r.v.’s and f Y ( y ) > 0, we define:

E [ X | Y = y ] = −∞ xf X | Y ( x, y ) dx =

−∞

x f XY ( x, y ) f Y ( y )

dx

If X, Y are jointly discrete and p Y ( y j ) > 0, this can be written as:

E [ X | Y = y j ] =

i : p X ( x i ) > 0

x i p X | Y ( x i , y j ) =

i : p X ( x i ) > 0

x i p XY ( x i , y j ) p Y ( y j )

Conditional expectations have all the properties of usual expectations. For example,i p X Y ( x i , y j ) p Y ( y j

E

n

i

=1

X i | Y = y =

n

i

=1

E [ X i | Y = y ]

For example, E n i =1 X i | Y = y = n i =1

ECEN 649 Pattern Recognition – p.26/42

E[Y | X] is a Random Variable

Given a r.v. X , the mean E [ X ] is a deterministic parameter. X , the mean E [ X ] is a deterministic parameter.

X , the mean E [ X ] is a deterministic parameter. Now, E [ X
X , the mean E [ X ] is a deterministic parameter. Now, E [ X

Now, E [ X | Y ] is not random w.r.t. to X , but it is a function of the r.v. Y , so it is also a r.v. One can show that its mean is precisely E [ X ] :

E [ E [ X | Y ]] = E [ X ]

What this says in the continuous case is:

E [ X ] =

−∞

and, in the discrete case,

E [ X ] =

E [ X | Y = y ] f Y ( y ) dy

E [ X | Y = y i ] P ( {Y = y i })

i : p Y ( y i ) > 0

Computing E [ X | Y = y ] first is often easier than finding E E [ X | Y = y ] first is often easier than finding E [ X ] directly.

[ X | Y = y ] first is often easier than finding E [ X

ECEN 649 Pattern Recognition – p.27/42

Conditional Expectation and Prediction

Conditional Expectation and Prediction One is interested in predicting the value of an unknown r.v. Y

One is interested in predicting the value of an unknown r.v. Y . We Y . We

ˆ

also want the predictor Y to be optimal according to some criterion.

The criterion most widely used is the Mean Square Error: Mean Square Error:

MSE ( Y ) = E [( Y Y ) 2 ]

ˆ

ˆ

It can be shown that the MMSE (minimum MSE) constant (no-information) estimator of Y is just the mean E [ Y ] . Y is just the mean E [ Y ] .

Given now some partial information through an observed r.v. X = x , it can be shown that the overall MMSE estimator is the X = x, it can be shown that the overall MMSE estimator is the conditional mean E [ Y | X = x] . The function η ( x) = E [ Y | X = x] is called the regression of Y on X .

Is this Pattern Recognition yet? Why?η ( x ) = E [ Y | X = x ] is called the

is called the regression of Y on X . Is this Pattern Recognition yet? Why? ECEN

ECEN 649 Pattern Recognition – p.28/42

Variance The mean E [ X ] is a good “guess” of the value of

Variance

Variance The mean E [ X ] is a good “guess” of the value of a

The mean E [ X ] is a good “guess” of the value of a r.v. X E [ X ] is a good “guess” of the value of a r.v. X , but by itself it can be misleading. The variance Var(X ) says how the values of X are spread around the mean E [ X ] :

Var( X ) = E [( X E [ X ]) 2 ] =

E [ X 2 ] ( E [ X ]) 2

ˆ

The MSE of the best constant estimator Y = E [ Y ] of Y is just the Y = E [ Y ] of Y is just the

variance of Y !

Property: Var ( aX + c ) = a 2 Var ( X ) Var( aX + c ) = a 2 Var( X )

Chebyshev’s Inequality: For any ǫ > 0 , ǫ > 0,

P ( {| X E [ X ] | ≥ ǫ }) Var( X )

ǫ

2

0 , P ( {| X − E [ X ] | ≥ ǫ } )

ECEN 649 Pattern Recognition – p.29/42

Conditional Variance

Conditional Variance If X and Y are jointly-distributed r.v.’s , we define: Var ( X |

If X and Y are jointly-distributed r.v.’s , we define: X and Y are jointly-distributed r.v.’s , we define:

Var( X | Y ) = E [( X E [ X | Y ]) 2 | Y ] = E [ X 2 | Y ] ( E [ X | Y ]) 2

Conditional Variance Formula:Y ] = E [ X 2 | Y ] − ( E [ X |

Var( X ) = E [ Var( X | Y )] + Var( E [ X | Y ])

This breaks down the total variance in a "within-rows" component and a "across-rows" component.

in a "within-rows" component and a "across-rows" component. ECEN 649 Pattern Recognition – p.30/42

ECEN 649 Pattern Recognition – p.30/42

Covariance

Is Var ( X 1 + X 2 ) = Var ( X 1 ) + Var( X 1 + X 2 ) = Var( X 1 ) + Var( X 2 ) ?

1 + X 2 ) = Var ( X 1 ) + Var ( X 2
1 + X 2 ) = Var ( X 1 ) + Var ( X 2
1 + X 2 ) = Var ( X 1 ) + Var ( X 2

The covariance of two r.v.’s X and Y is given by:

Cov( X, Y ) = E [( X E [ X ])( Y E [ Y ])] = E [ XY ] E [ X ] E [ Y ]

Two variables X and Y are uncorrelated if and only if Cov( X, Y ) = 0. Jointly Gaussian X and Y are independent if and only if they are uncorrelated (in general, independence implies uncorrela tedness but not vice-versa).

It can be shown that

Var( X 1 + X 2 ) = Var( X 1 ) + Var( X 2 ) + 2Cov( X 1 , X 2 )

X 1 ) + Var ( X 2 ) + 2 Cov ( X 1 ,

So the variance is distributive over sums if all variables are pair-wise uncorrelated So the variance is distributive over sums if all variables are . .

is distributive over sums if all variables are pair-wise uncorrelated . ECEN 649 Pattern Recognition –

ECEN 649 Pattern Recognition – p.31/42

Correlation Coefficient The correlation coefficient ρ between two r.v.’s X and Y is given by:

Correlation Coefficient

The correlation coefficient ρ between two r.v.’s X and Y is given by: ρ between two r.v.’s X and Y is given by:

Properties:ρ between two r.v.’s X and Y is given by: ρ ( X , Y )

ρ( X, Y ) =

1. 1 ρ( X, Y ) 1

Cov( X, Y )

) = 1. − 1 ≤ ρ ( X, Y ) ≤ 1 Cov ( X,

Var( X ) Var( Y )

2. X and Y are uncorrelated iff ρ( X, Y ) = 0.

3. (Perfect linear correlation):

ρ( X, Y ) = ± 1 Y = a ± bX, where b = σ y

σ x

ρ ( X, Y ) = ± 1 ⇔ Y = a ± bX, where b

ECEN 649 Pattern Recognition – p.32/42

Vector Random Variables

A vector r.v. or random vector X = ( X 1 , , X d

A vector r.v. or random vector X = ( X 1 , random vector X = ( X 1 ,

, X d ) takes values in R d .

A vector r.v. or random vector X = ( X 1 , , X d )

Its distribution is the joint distribution of the component r.v.’s.

The mean of X is the vector µ = ( µ 1 , X is the vector µ = ( µ 1 ,

, µ d ) , where µ i = E [ X i ] .

The covariance matrix Σ is a d × d matrix given by: covariance matrix Σ is a d × d matrix given by:

 

Σ = E [( X µ )( X µ ) T ]

where Σ ii = Var( X i ) and Σ ij = Cov( X i , X j ) .

µ )( X − µ ) T ] where Σ i i = Var ( X
µ )( X − µ ) T ] where Σ i i = Var ( X

ECEN 649 Pattern Recognition – p.33/42

Properties of the Covariance Matrix

Matrix Σ is real symmetric and thus diagonalizable: Σ is real symmetric and thus diagonalizable:

Matrix Σ is real symmetric and thus diagonalizable:
Matrix Σ is real symmetric and thus diagonalizable:
 

Σ

= UDU T

where U is the matrix of eigenvectors and D is the diagonal matrix of eigenvalues.

 
All eigenvalues are nonnegative ( Σ is positive semi-definite ). In fact, except for “degenerate”

All eigenvalues are nonnegative (Σ is positive semi-definite ). In fact, except for “degenerate” cases, all eigenvalues are positive, and so Σ is invertible (Σ is said to be positive definite in this case).

( Whitening or Mahalanobis transformation ) It is easy to check that the random vector

(Whitening or Mahalanobis transformation ) It is easy to check that the random vector

Y = Σ 2 ( X µ ) = D 1

1

2 U T ( X µ )

has zero mean and covariance matrix I d (so that all components of has zero mean and covariance matrix I d Y are zero-mean, unit-variance, and uncorrelated). Y are zero-mean, unit-variance, and uncorrelated).

(so that all components of Y are zero-mean, unit-variance, and uncorrelated). ECEN 649 Pattern Recognition –

ECEN 649 Pattern Recognition – p.34/42

Sample Estimates

Given n independent and identically-distributed (i.i.d.) sample n independent and identically-distributed (i.i.d.) sample

n independent and identically-distributed (i.i.d.) sample observations X 1 , mean estimator is given by: ,
n independent and identically-distributed (i.i.d.) sample observations X 1 , mean estimator is given by: ,

observations X 1 ,

mean estimator is given by:

, X n of the random vector X , then the sample

µˆ = 1

n

n

X

i

i =1

It can be shown that this estimator is unbiased (that is, E [ µˆ ] = µ ) and consistent (that is, µˆ converges in probability to µ as n → ∞).

The sample covariance estimator is given by:

S =

1

n 1

n

i =1

( X i µˆ )( X i µˆ ) T

This is an unbiased and consistent estimator of Σ. This is an unbiased and consistent estimator of Σ

X i − µ ˆ ) T This is an unbiased and consistent estimator of Σ

ECEN 649 Pattern Recognition – p.35/42

The Multivariate Gaussian

The Multivariate Gaussian The multivariate Gaussian distribution with mean µ and covariance matrix Σ (assuming Σ
The Multivariate Gaussian The multivariate Gaussian distribution with mean µ and covariance matrix Σ (assuming Σ

The multivariate Gaussian distribution with mean µ and covariance matrix Σ (assuming Σ invertible, so that also det (Σ) > 0 µ and covariance matrix Σ (assuming Σ invertible, so that also det(Σ) > 0) corresponds to the multivariate pdf

f X ( x ) =

(2π ) p det(Σ) exp

1

f X ( x ) = (2 π ) p det(Σ) e x p − 1

We write X ∼ N p ( µ , Σ) .

1

2 ( x µ ) T Σ 1 ( x µ )

The multivariate gaussian has elliptical contours of the fo rm2 ( x − µ ) T Σ − 1 ( x − µ ) (

( x µ ) T Σ 1 ( x µ ) = c 2 ,

c > 0

The axes of the ellipsoids are given by the eigenvectors of Σ and the length of the axes are proportional to its eigenvalues.

of Σ and the length of the axes are proportional to its eigenvalues. ECEN 649 Pattern

ECEN 649 Pattern Recognition – p.36/42

Properties of The Multivariate Gaussian

The density of each component X i is univariate gaussian N ( µ i ,

The density of each component X i is univariate gaussian N ( µ i , Σ i i ) . X i is univariate gaussian N ( µ i , Σ ii ) .

The density of each component X i is univariate gaussian N ( µ i , Σ

The components of X ∼ N ( µ, Σ) are independent if and only if they are uncorrelated, X ∼ N ( µ, Σ) are independent if and only if they are uncorrelated, i.e., Σ is a diagonal matrix.

The whitening transformation Y = Σ − 1 2 ( X − µ ) produces another multivariate gaussian Y = Σ 1 2 ( X µ ) produces another multivariate gaussian Y ∼ N (0, I p ) .

In general, if A is a nonsingular p × p matrix and c is a p -vector, then A is a nonsingular p × p matrix and c is a p -vector, then Y = AX + c ∼ N p ( Aµ + c , A ΣA T ) .

The r.v.’s AX and BX are independent if and only if A Σ B T = 0 AX and BX are independent if and only if A ΣB T = 0.

If Y and X are jointly Gaussian, then the distribution of Y given X is again Y and X are jointly Gaussian, then the distribution of Y given X is again Gaussian.

The regression E [ Y | X ] is a linear function of X . E [ Y | X ] is a linear function of X .

then the distribution of Y given X is again Gaussian. The regression E [ Y |
then the distribution of Y given X is again Gaussian. The regression E [ Y |

ECEN 649 Pattern Recognition – p.37/42

Convergence of Random Sequences

Convergence of Random Sequences A random sequence { X [1] , X [2] , . }

A random sequence { X [1] , X [2] , random sequence {X [1] , X [2] ,

.} is a countable set of r.v.s.

{ X [1] , X [2] , . } is a countable set of r.v.s. “Sure”

“Sure” convergence: We say that X [ n ] → X surely if for all outcomes convergence: We say that X [ n] X surely if for all outcomes ξ S of the experiment we have lim n X [ n, ξ ] = X ( ξ ) .

Almost-sure convergence or convergence with probability one : The sequence fails to converge only for an convergence or convergence with probability one : The sequence fails to converge only for an event of probability zero, i.e.:

P

{ξ S | lim

n X [ n, ξ ] = X ( ξ ) } = 1

Mean-square (m.s.) convergence: This uses convergence of an energy norm to zero. (m.s.) convergence: This uses convergence of an energy norm to zero.

n E [ | X [ n] X | 2 ] = 0

lim

norm to zero. n → ∞ E [ | X [ n ] − X |

ECEN 649 Pattern Recognition – p.38/42

Convergence of Random Sequences - II

Convergence of Random Sequences - II Convergence in Probability : This uses convergence of the "probability

Convergence in Probability : This uses convergence of the "probability of error" to zero. Probability: This uses convergence of the "probability of error" to zero.

n P [ | X [ n] X | > ǫ ] = 0,

lim

ǫ > 0 .

Convergence in Distribution : This is just converge of the corresponding PDFs. in Distribution : This is just converge of the corresponding PDFs.

n F X n ( a ) = F X ( a )

lim

at all points a R where F X is continuous.

Relationship between modes of convergence:lim at all points a ∈ R where F X is continuous. Sure ⇒ Almost-sure Mean-square

Sure Almost-sure

Mean-square


Probability Distribution

⇒ Almost-sure Mean-square    ⇒ Probability ⇒ Distribution ECEN 649 Pattern Recognition – p.39/42

ECEN 649 Pattern Recognition – p.39/42

Uniformly-Bounded Random Sequences

Uniformly-Bounded Random Sequences A random sequence { X [1] , X [2] , . } is

A random sequence { X [1] , X [2] , {X [1] , X [2] ,

Random Sequences A random sequence { X [1] , X [2] , . } is said

.} is said to be uniformly bounded

if there exists a finite K > 0, which does not depend on n, such that

| X [ n] | < K with prob. 1 ,

for all n = 1, 2,

Theorem. If a random sequence { X [1] , X [2] , bounded, then If a random sequence {X [1] , X [2] , bounded, then

.} is uniformly

X [ n] X in m.s. X [ n] X in probability .

That is, convergence in m.s. and in probability are the same.

The relationship between modes of convergence becomes:is, convergence in m.s. and in probability are the same. Sure ⇒ Almost-sure ⇒  

Sure Almost-sure

Mean-square

Probability

Distribution

⇒    Mean-square Probability    ⇒ Distribution ECEN 649 Pattern Recognition –

ECEN 649 Pattern Recognition – p.40/42

Example

Let X(n) consist of 0-1 binary random variables.Example 1. Set X(0) = 1 2. From the next 2 points, pick one randomly and

1. Set X(0) = 1

2. From the next 2 points, pick one randomly and set to 1, the ot her to zero.

3. From the next 3 points, pick one randomly and set to 1, the re st to zero.

4. From the next 4 points, pick one randomly and set to 1, the re st to zero.

5.

.

.

.

pick one randomly and set to 1, the re st to zero. 5. . . .

Then X ( n) is clearly converging slowly in some sense to zero, but not with probability one! In fact, P [ {X n = 1} i.o. ] = 1. However, one can show that X ( n) 0 in mean-square and also in probability. (Exercise.)

show that X ( n ) → 0 in mean-square and also in probability. (Exercise.) ECEN

ECEN 649 Pattern Recognition – p.41/42

Limiting Theorems

Limiting Theorems (Weak Law of Large Numbers). Given an i.i.d. random sequence { X 1 ,

(Weak Law of Large Numbers). Given an i.i.d. random sequenceLimiting Theorems { X 1 , X 2 , . } with common finite mean µ

{X 1 , X 2 ,

.} with common finite mean µ . Then

1

n

n

i =1

X i µ, in probability

(Strong Law of Large Numbers) The same convergence holds wit h probability 1.µ . Then 1 n n i =1 X i → µ, in probability (Central Limit

(Central Limit Theorem) Given an i.i.d. random sequenceNumbers) The same convergence holds wit h probability 1. { X 1 , X 2 ,

{X 1 , X 2 ,

.} with common mean µ and variance σ 2 . Then

σ n

1

mean µ and variance σ 2 . Then σ √ n 1 n i =1 X

n

i =1

X i → N (0, 1) , in distribution

σ √ n 1 n i =1 X i − nµ → N (0 , 1)

ECEN 649 Pattern Recognition – p.42/42