Beruflich Dokumente
Kultur Dokumente
k
i=1
P(A
i
).
E.g.: Two simultaneous fair coin tosses. Sample Space:
(H, H), (H, T), (T, H), (T, T).
P(Atleast one head) = P((H, H) (H, T) (T, H)) = 0.75.
E.g. Roll of a die. P(Outcome 3) = P(Outcome =
1) + P(Outcome = 2) + P(Outcome = 3) = 0.5.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 5
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Conditional Probability, Indpendence, R.V.
Conditional Probability: Let be a sample space associated with
a random experiment. Let A, B be two events. Then dene the
Conditional Probability of A given B as:
P(A[B) =
P(A B)
P(B)
Independence: Two events A, B are independent if
P(A B) = P(A)P(B). Stated another way, A, B are independent
if P(A[B) = P(A), P(B[A) = P(B). Note that A,B are
Independent necessarily means that A, B are not mutually
exclusive i.e. A B ,= .
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 6
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Independence and mutual exclusiveness
Example: Consider simultaneous ips of two fair coins. Event A: First
Coin was a head. Event B: Second coin was a tail. Note that A and B
are indeed independent events but are not mutually exclusive. On the
other hand, let Event C: First coin was a tail. Then A and C are not
independent. Infact they are mutually exclusive - I.e. knowing that one
has occurred (e.g. heads) implies the other has not (tails).
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 7
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Bayes rule
A simple outcome of the denition of Conditional Probability is the very
useful Bayes rule:
P(A B) = P(A[B)P(B) = P(B[A)Prob(A)
Lets say we are given P(A[B
1
) and we need to compute P(B
1
[A). Let
B
1
, . . . B
k
be disjoint events the probability of whose union is 1. Then,
P(B
1
[A) =
P(A|B
1
)P(B
1
)
P(A)
=
P(A|B
1
)P(B
1
)
k
i=1
P(AB
i
)
=
P(A|B
1
)P(B
1
)
k
i=1
P(A|B
i
)P(B
i
)
With additional information on P(B
i
) and P(A[B
i
), the required
conditional probability is easily computed.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 8
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Random Variables
Random Variable: A random variable is a mapping from the
sample space to the real line, i.e. X : R. Let x R, then
P(X(w) = x) = P(w : X(w) = x). Although random
variable is a function, in practice it is simply represented as X. It is
very common to directly dene probabilities over the range of the
random variable.
Cumulative Distribution Function(CDF): Every random variable
X has associated with it a CDF: F : R [0, 1] such that
P(X x) = F(x). If X is a continuous random variable, then its
Probability density function is given by f(x) =
d
dx
F(x).
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 9
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Expectation
Probability Mass Function (p.m.f): Every discrete random
variable has associated with it, a probability mass function which
ouputs the probability of the random variable taking a particular
value. E.g. if X denotes the R.V. corresponding to the role of a die,
P(X = i) =
1
6
is the p.m.f associated with X.
Expectation: The Expectation of a discrete random variable X
with probability mass function p(X) is given by E[X] =
x
p(x)x.
Thus expectation is the weighted average of the values that a
random variable can take (where the weights are given by the
probabilities). E.g. Let X be a bernoulli random variable with
P(X = 0) = p. Then E[X] = p 0 + (1 p) 1 = 1 p.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 10
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
More on Expectation
Conditional Expectation: Let X, Y be two random variables.
Given that Y = y, we can talk of the expectation of X conditioned
on the information that Y = y:
E[X[Y = y] =
x
P(X = x[Y = y)x
Law of Total Expectation: Let X, Y be two random-variables.
The Law of Total Expectation then states that:
E[X] = E[E[X[Y ]]
The above equation can be interpreted as: The Expectation of X
can be computed by rst computing the Expectation of the
randomness present only in X (but not in Y ) and next computing
the Expectation with respect to the randomness in Y . Thus, the
Total Expectation w.r.t X can be computed by evaluating a
sequence of partial Expectations.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 11
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Stochastic Process
Stochastic Process: Is dened as an indexed collection of random
variables. A stochastic process is denoted by X
i
N
i=1
.
E.g. In N ips of a fair coin, each ip is a random variable. Thus the
collection of these N random variables forms a stochastic process.
Independent and Identically distributed (i.i.d). A stochastic
process is said to be i.i.d if each of the random variables, X
i
have
the same CDF and in addition the random variables
(X
1
, X
2
, . . . , X
N
) are independent. Indepedence implies that
P(X
1
= x
1
, X
2
= x
2
, . . . , X
n
= x
N
) = P(X
1
= x
1
)P(X
2
=
x
2
) . . . P(X
N
= x
N
).
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 12
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Stochastic Process
Binomial Distribution. Consider the i.i.d stochastic process X
i
N
i=1
,
where X
i
is a bernoulli random variable. I.e. X
i
0, 1 and
P(X
i
= 1) = p. Let X =
n
i=1
X
i
. Consider the probability, P(X = k)
for some 0 k n. Then,
P(X = k) = P(
1i
1
<i
2
<...<i
k
n
(X
1
= 0, X
2
= 0, . . . , X
i
1
= 1, X
i
1
+1
=
0, . . . , X
i
2
= 1, . . . , X
i
k
= 1, . . . , X
n
= 0)). Thus the probability to be
computed is a union of
_
n
k
_
disjoint events and hence can be expressed as
sum of probabilities of
_
n
k
_
events. Also, since the stochastic process is
i.i.d, each event in the sum has a probability of p
k
(1 p)
nk
. Combining
the two, we have that P(X = k) =
_
n
k
_
p
k
(1 p)
nk
, which is exactly the
probability distribution associated with a Binomial Random Variable.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 13
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Convex Set
A set C is a convex set if for any two elements x, y C and any
0 1, x + (1 )y C.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 14
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Convex Function
A function f : X Y is convex if X is convex and for any two x, y X
and any 0 1 it holds that, f(x+(1 )y) f(x) +(1 )f(y).
Examples: e
x
, log(x), xlog x.
Second order condition
A function f is convex if and only if its hessian,
2
f _ 0.
For functions in one variable, this reduces to checking if second derivative
is non-negative.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 15
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
First order condition
f : X Y is convex if and only if it is bound below by its rst order
approximation at any point:
f(x) f(y) +f(y), x y x, y X
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 16
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Operations preserving convexity
Intersection of Convex sets is convex. E.g.: Every polyhedron is an
intersection of nite number of half-spaces and is therefore convex.
Sum of Convex functions is convex. E.g. The squared 2-norm of a
vector, |x|
2
2
is the sum of quadratic functions of the form y
2
and
hence is convex.
If h, g are convex functions and h is increasing, then the composition
f = h g is convex. E.g. f(x) = exp(Ax +b) is convex.
Max of convex functions is convex. E.g.
f(y) = max
x
x
T
y
s.t. x C.
Negative Entropy: f(x) =
i
x
i
log(x
i
) is convex.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 17
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Cauchy-Schwartz Inequality
Theorem 3.1
Let X and Y be any two random variables. Then it holds that
[E[XY ][
_
E[X
2
]
_
E[Y
2
].
Proof.
Let U = X/
_
E[X
2
], V = Y/
_
E[Y
2
]. Note that E[U
2
] = E[V
2
] = 1.
We need to show that [E[UV ][ 1. For any two U, V it holds that
[UV [
|U|
2
+|V |
2
2
. Now, [E[UV ][ E[[UV []
1
2
E[[U[
2
+[V [
2
] = 1.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 18
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Jensens Inequality
Theorem 3.2 (Jensens inequality)
Let X be any random variable and f : R R be a convex function.
Then, it holds that f(E[X]) E[f(X)].
Proof.
The proof uses the convexity of f. First note from the rst order
conditions for convexity, for any two x, y f(x) f(y) +f
(y)(x y).
Now, let x = X, y = E[X]. Then we have that
f(X) f(E[X]) +f
i
p
i
log a
i
= E[f(X)] f(E[X]) = log(
i
p
i
a
i
).
Or in other words,
i
a
p
i
i
i
p
i
a
i
Setting p
i
= 1/n i = 1, 2, . . . , n the AM-GM inequality follows:
(
n
i=1
a
i
)
1
n
i
a
i
n
.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 22
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Application of Jensens: GM-HM inequality
The GM-HM inequality also follows from Jensens. I.e., for any
a
1
, a
2
, . . . , a
n
> 0 it holds that:
_
n
i=1
a
i
_1
n
n
i=1
1
a
i
Exercise: Show this.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 23
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Markovs Inequality
Theorem 4.1 (Markovs inequality)
Let X be any non-negative random variable. Then for any t > 0 it holds
that,
P(X > t) E[X]/t
Proof.
E[X] =
_
0
xf(x)dx
=
_
t
0
xf(x)dx +
_
t
xf(x)dx
tP(X > t)
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 24
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Chebyshevs Inequality
Theorem 4.2 (Chebyshevs inequality)
Let X be any random variable. Then for any s > 0 it holds that,
P([X E[X][ > s) Var[X]/s
2
Proof.
This is an application of Markovs inequality. Set Z = [X E(X)[
2
and
apply Markovs inequality to Z with t = s
2
. Note that
E[Z] = Var[X].
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 25
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Application to Weak Law of Large Numbers
As a consequence of the Chebyshev inequality we have the Weak Law of
Large numbers.
WLLN
Let X
1
, X
2
, . . . , X
n
be i.i.d random variables. Then,
lim
1
n
i
X
i
p
E[X
i
]
Proof.
W.log assume X
i
has mean 0 and variance 1. Then variance of
X =
1
n
n
i=1
X
i
equals
1
n
. By Chebyshevs inequality as n , X
concentrates around its mean w.h.p.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 26
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Relative entropy or KL-divergence
Let X and Y be two discrete random variables with probability
distributions p(X), q(Y ). Then the relative entropy between X and Y is
given by
D(X[[Y ) = E
p(X)
log
p(X)
q(X)
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 27
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Non-negativity of Relative Entropy
Lemma 5.1
For any two discrete random variables, X, Y with distributions
p(X), q(Y ) respectively it holds that D(X[[Y ) 0.
Proof.
The proof follows from a simple application of Jensens inequality
(Theorem 3.2). Note that f(x) = log x is concave. Let g(X) =
q(X)
p(X)
.
Then,
D(X[[Y ) = E
p(X)
log
q(X)
p(X)
= E
p(X)
log g(X)
log E
p(X)
g(X)
= 0
C
onvexity - Jensens - KL non-negativity - Mutual Information
non-negative.
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 28
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
A tighter bound
Lemma 5.2
Let X, Y - Binary R.V.s with probability of success p, q respectively.
Then D(p[[q) = p log
p
q
+ (1 p) log
1p
1q
2(p q)
2
.
Proof.
Note that x(1 x)
1
4
for all x. Hence,
D(p[[q) =
_
p
q
(
p
x
1p
1x
)dx
=
_
p
q
(p(1x)(1p)(x))
x(1x)
dx
4
_
p
q
(p x)dx
= 2(p q)
2
s
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 29
Probability basics Basics of Convexity Basic Probabilistic Inequalities Concentration Inequalities Relative Entropy Inequalities
Summary
We reviewed the basics of probability: Sample Space, Random
variables, independence, conditional probability, expectation,
stochasic process, etc.
Convexity is fundamental to some important probabilistic
inequalities: E.g. Jensens inequality and Relative Entropy Inequality.
Jensens inequality has many applications. A simple one being the
AM-GM-HM inequality.
Markovs inequality is fundamental to concentration inequalities.
The Weak Law of Large Numbers essentially follows from a variant
of Markovs inequality (Chebyshevs Inequality).
K. Mohan, Prof. J. Bilmes EE514A/Fall 2013/Info Theory Lecture 0 - Sep 26th, 2013 page 30