Sie sind auf Seite 1von 63

Fundamentals of

Information Theory
Lecture 2. Entropy and Mutual
Information
Prof. CHEN Jie
Lab. 201, School of EIE
Beihang University
1
Contents
1 Self-information

2 Jensen’s inequality

3 Entropy

4 Joint entropy and conditional entropy

5 Relative entropy and mutual information

6 Relationship between entropy and MI

7 Chain rules for entropy , RE and MI

8 Data processing inequality 2


Copyright©BUAA201
Copyright©BUAA201
Definitions
An ensemble ‘X’ is a random variable with a set of possible
outcomes, AX  a1 , a2 ,..., a i ,..., al having probabilities  p1 , p2 ,..., pl 
with p  x  ai   pi , pi  0 and  xA p  x   1 .
x

A joint ensemble ‘XY’ is an ensemble which each outcome is an


ordered pair x,y with x  AX  a1 ,..., al and y  b1 ,..., b j . From the
joint probability P(x,y) we can obtain the following:

P( x  ai , y  b j )
Conditional P( x  ai | y  b j ) 
P( y  b j )
Probability

if P y  b j  0
3
Copyright©BUAA201
Copyright©BUAA201
Some Definitions
Rather than writing down the joint probability directly, we will
often define an ensemble in terms of a collection of conditional
probabilities. The following rules of probability theory will be
useful. (H denotes background assumptions).

Product rule P  x, y H   P  x y , H  P  y H 

Sum rule P  x H    P  x, y H    P  x y , H  P  y H 
y y

P  xPy x, Hy, H
P yPH
 y H 
Bayes’ P  y x , H  
theorem  y' P
P x
x H
y '
,H P 
 H
y '

4
Copyright©BUAA201
Copyright©BUAA201
Self-information &
Conditional Self-information

5
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information

1. Simple
events

Self-information: suppose that p(xi) was the probability of


particular event xi ,
We define self-information as :

I  xi    log p  x i 

6
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information

p  xi   0 , means the probability of event xi


Annotation occurred . The aim for giving the “-” (minus) is to
assure I  xi   0 .

Before xi occurred , it stands for the uncertainty of


Meaning xi. After xi occurred , it stands for information that xi
provided.

7
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information

Depending on the base of log


bit (the base 2) I(xi)=-log2p(xi)
Units nat (the base e) I(xi)=-logep(xi)
haitely (the base 10) I(xi)=-log10p(xi)

p(xi) uncertainty self-information


Rules
p(xi) uncertainty self-information

8
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information (…)
Examples

1. Martian are occupying the earth.

self-information

2. On every Friday, we will discuss information theory

self-information

9
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information (…)

We define Joint self-information as:


Joint events

Note that

10
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information (…)

Example

The occurrence probability of “e” is 0.105, the occurrence


probability of “c” is 0.023, the occurrence probability of “o” is
0.001. Please calculate their self-information, respectively.

11
Copyright©BUAA201
Copyright©BUAA201
Conditional
self-information

12
Copyright©BUAA201
Copyright©BUAA201
2.1.2 Conditional self-information

Conditional self-information

Note that

13
Copyright©BUAA201
Copyright©BUAA201
Review of the
probability theory

14
Copyright©BUAA201
Copyright©BUAA201
2.1.2 Conditional self-information

Some P  AB   P  B  P  A B 
useful P  AB   P  A P  B A
n
P  A   P  Bi P  A Bi 
formulas
i 1

P  AB 
P  A B 
Conditional
probability
P  B

P  Bi  P  A Bi 
Bayes’ P  Bi A  
 PB  P A B 
n
Formula
j j
j 1
15
Copyright©BUAA201
Copyright©BUAA201
2.1.2 Conditional self-information
Example

A chessboard is divided into 64 small panes. A chessman


is put into one pane, and let you guess the position
of that chessman.
1) Number these panes one by one. 1, 2, ……, 64. let
you guess the sequence number of pane in which
the chessman is.
2) Number these panes according to the
column and row. You have known
which column the chessman is in, and
you are asked to guess the row
number.
Calculate the information you got.
16
Copyright©BUAA201
Copyright©BUAA201
2.1.2 Conditional self-information
Answer:

p  xi y j   1 64

1) I  xi y j    log p  xi y j   6bit

p  xi y j 
   
2) I xi y j   log p xi y j   log
p yj 
 3bit

17
Copyright©BUAA201
Copyright©BUAA201
2.2 Jensen’s inequality

Definition

If the function f has a second derivative which is non-negative


(positive) everywhere, then the function is convex (strictly convex).

f  X 1  1    X 2    f  X 1   1    f  X 2 

18
Copyright©BUAA201
Copyright©BUAA201
2.2 Jensen’s inequality
Y

f [αx1+ (1-α)x2] f(x)

αf(x1)+ (1-α)f(x2)

x
0 x1 αx1+(1- α)x2 x2
19
Copyright©BUAA201
Copyright©BUAA201
2.2 Jensen’s inequality

Theorem 2.2.1 (Jensen’s inequality)

If f is a convex function and X is a random variable , then :

Ef  X   f  EX 

pj
j 1

 
j p j f  i   f j p j  j 
 

20
Copyright©BUAA201
Copyright©BUAA201
Entropy

21
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy
Definition
The entropy H(X) of a discrete random
variable X is defined by :
H  X     p  x  log p  x 
xX

Remark
The entropy of X can also be interpreted as the
expected value of log1/p(x), where X is drawn according to
probability mass function p(x).
1
Thus: H  X   E p log
p X 

Annotation
Ep means the expectation of p(x).
22
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy
Definition 2

We can use the concept of self-information to define the


entropy as:
The expectation of random variable I  xi  , namely average
self-information.
(define X as a discrete set )
The formula to express this definition is:

def q
H  X   E  I  xi   E   log p  xi    p  xi  log p  xi 
i 1

23
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy

Annotation

H(x) can be interpreted as a probability weighted


average, namely statistical average.

 p  x I  x
The expression: H X   x

 p  x
x

24
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy (…)
Lemma 2.3.1: H X  0
 1 
Proof: 0  p  x  1 implies log  0
 p  x 

Lemma 2.3.2: H b  X    logb a  H a  X 

Proof: logb p  logb a log a p

Annotation
The second property of entropy enable us to change the base
of the logarithm in the definition.
Entropy can be changed from one base to another by
multiplying by the appropriate factor .
25
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.1
H(X)=1 bit
when p=1/2
let

1 with probability p
X 
0 with probability 1  p
then

H  X    p log p  1  p  log 1  p 

 H  p

26
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.1
H(X)=1 bit
when p=1/2

The graph shown in Figure 2.1 illustrates


Some the basic properties of entropy.
- It is a convex function of the distribution.
- It is equal 0 when p=0 or 1.

27
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.2

let
 a with probability 1 2
 b with
 probability 1 4
X 
 c with probability 1 8
d with probability 1 8
The entropy of X is
1 1 1 1 1 1 1 1
H  X    log  log  log  log
2 2 4 4 8 8 8 8
7
 bit
4
28
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)

Example 2.3.3

Let X represents the outcome of a single roll of a fair die. Then


X={1,2,3,4,5,6} and pi=1/6 for each i.

The entropy of X is :

H  X   log 6  2.58bit  1.79nat

29
Copyright©BUAA201
Copyright©BUAA201
Joint Entropy and
Conditional Entropy

30
Copyright©BUAA201
Copyright©BUAA201
2.4.1 Joint entropy & conditional entropy

2.4.1 Joint entropy

Definition:
The joint entropy H(X,Y) of a pair of discrete random variables (X,Y)
with a joint distribution p(x,y) is defined as:

H  X , Y    p  x, y  log p  x, y 
`
x y

Which can also be expressed as :

H  X , Y    E log p  X , Y 

31
Copyright©BUAA201
Copyright©BUAA201
2.4.1 Joint entropy & conditional entropy

2.4.2 Conditional entropy

Definition:
If  X , Y   p  x, y  , then the conditional entropy H Y X 
is defined as :

H Y X    p  x H Y X  x 
`
x
  p  x   p  y x  log p  y x 
x y

  p  x, y  log p  y x 
x y

32
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)

Theorem 2.4.1: (Chain rule)


H  X , Y   H  X   H Y X 

Proof:
H  X , Y    p  xi y j  log p  xi y j 
n m

i 1 j 1

  p  xi y j  log  p  xi  p  y j xi  
n m

i 1 j 1
m 
   p  y j xi   p  xi  log p  xi    p  xi y j  log p  y j xi 
n n m

i 1  j 1  i 1 j 1
n
  p  xi  log p  xi   H Y X 
i 1

 H  X   H Y X 
33
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)

Corollary
H  X , Y Z   H  X Z   H Y X , Z 

Additional

H  X , Y   H  X   H Y 
Corollaries

H Y X   H Y 
H X Y H X 

Think about how to prove


the corollaries above
34
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)

Lemma: If x>0, then

lnx≤x-1
Proof: y y x1

y  ln x

x 1

35
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)

Example 2.4.1
Let (X,Y) has the following distribution:
Answer:
4 X
H ( X | Y )   p(Y  i) H ( X | Y  i) Y 1 2 3 4
i 1
1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 8 16 32 32 4
 H  , , ,  H  , , , 
4  2 4 8 8 4  2 4 8 8
1 1 1 1 1
1 1 1 1 1 1 2
 H  , , ,   H 1, 0, 0, 0  16 8 32 32 4
4 4 4 4 4 4
1 1 1 1 1
3 4
1 7 1 7 1 1 16 16 16 16
     2 0 1
4 4 4 4 4 4 1 0 0 0
4 4
4
11 H(Y)
 bits H(X)=7/4
4 1 1 1 1 =2 bits
bits
2 4 8 8
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)

Let (X,Y) have the following X 1 2


Example Y
2.2.1 distribution: 1 0
3
4

Then: H  X   H  ,   0.544bits
1 7 2
1 1
8
8 8 8

1 1
H  X Y  1  H  0,1  0bits H  X Y  2   H  ,   1bits
2 2
H  X Y   H  X Y  1  H  X Y  2   0.25bits
3 1
4 4

Remarks
The uncertainty in X is increased if Y=2 is observed and
decreased if Y=1 is observed, but uncertainty decreased on
the average.
37
Copyright©BUAA201
Copyright©BUAA201
Relative Entropy and
Mutual Information

38
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information

2.5.1 Relative entropy

Definition:
The relative entropy or Kullback Leibler distance between two
probability mass functions p(x) and q(x) is defined as:

`
p  x p  x
D  p q    p  x  log  Ep
xX q  x q  x

39
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information

Annotation
The relative entropy is always non-negative and is zero if
and only if p=q.
However, it is not a true distance between distributions
since it is not symmetric and does not satisfy the triangle
inequality.
Nonetheless, it is often useful to think of relative entropy
as a “distance” between distributions.

40
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information

Theorem (Information inequality): D  p q   0


with equality if and only if p  x   q  x  for all x

Proof:
p  x q  x
 D  p q     p  x  log   p  x  log
xX q  x xX p  x
q  x
 log  p  x   log  q  x 
xX p  x xX

 log1
 0.

41
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information

Example 2.5.1

Let  ={0,1} and consider two distribution p and q on .Let p(0)=1-r ,


p(1)=r , and let q(0)=1-s ,q(1)=s.
Then

1 r
D  p q   1  r  log
r
 r log
1 s s
and
1 s
D  q p   1  s  log
s
 s log
1 r r

42
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information

Annotation
►If r=s, then D(p||q)=D(q||p)=0.

►If r=1/2, s=1/4, then


1 1
D  p q   log 2  log 2  1  log 3  0.2075bits
1 1 1
2 3 2 1 2
whereas 4 4
3 1
D  q p   log 4  log 4  log 3  1  0.1887bits
3 1 3
4 1 4 1 4
2 2
►Note that D(p||q) ≠D(q||p) in general

43
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information

2.5.2 Mutual Information

Definition:

The mutual information I(X;Y) is the relative entropy between the


joint distribution and the product distribution p(x)p(y).

p  x, y 
`
I  X ;Y    p  x, y  log
x y p  x p  y


 D p  x, y  p  x  p  y  
p  x, y 
 E p x , y  log
p  x p  y
44
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information

Lemma (additional): I  X ;Y   0
p  xi 
Proof:  I  X ;Y    p  xi , y j  log
X Y 
p xi y j 
 px  
  log e  . p  xi , y j   i
 1
X Y   p xi y j



  log e  .  p  x  p  y   p  x y  
i j i j
X Y
 
  log e  .  p  xi  p  y j    p  xi y j  
X Y X Y 
  log e  .1  1
0
45
Copyright©BUAA201
Copyright©BUAA201
Relationship between
entropy
& Mutual information

46
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI

Theorem 2.6.1

1 . The relationship among mutual information , entropy and


conditional entropy:

HI ( XX;;YY)  H  X   H  X Y   H Y   H Y X 
`

I  X ;Y   H  X   H Y   H  X,Y 

47
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI

Proof
I  X ;Y    p  x y 
 
p xi y j
px 
i j
X Y i

I  X ;Y    p  xi y j  log p  xi 
X Y

 

   p  xi y j  log p xi y j 
 X Y 

 H X H X Y
By symmetry, we can also prove :

H  X ; Y   H  Y   H Y X 

48
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI

2 . The symmetrical characteristic of mutual information:


I  X ; Y   I Y ; X 
Annotation

According to the symmetrical characteristic of mutual information,


We can get : I  X ; X   H  X   H  X X   H  X 

Thus the mutual information of a random variable with itself is the


entropy of the random variable.
This is the reason that entropy is sometimes referred to as self-
information.

49

Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI

Proof
I  X ; Y    p  xi y j  log

p xi y j 
X Y p  xi 

  p  xi y j  log
 
p xi y j p  y j 
X Y p  xi  p  y j 
p  xi y j 
  p  xi y j  log
X Y p  xi  p  y j 
p  y j xi  p  xi 
  p  xi y j  log
X Y p  xi  p  y j 
p  y j xi 
  p  xi y j  log
X Y p yj 
 I Y ; X 
50
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI

H(X)

H(X|Y) I(X;Y) H(Y|X)

H(Y)

H(X,Y)

Figure2.2 relationship between Entropy and Mutual Information


51
Copyright©BUAA201
Copyright©BUAA201
Chain rules for entropy ,
relative entropy &mutual
information

52
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI

We now show that the entropy of a collection of random


variables is the sum of the conditional entropies.

Theorem 2.7.1 (chain rule for entropy)

Let X1,X2,…XN be drawn according to p(x1,x2,…,xN).


Then:

H  X1 , X 2 ,..., X N   H  X 1   H  X 2 X 1   ...  H  X N X 1 X 2 ... X N 1 


N
  H  X i X 1 X 2 ... X i1 
i 1

53
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI

Proof

H  X1 , X 2   H  X1   H  X 2 X1 
H  X1 , X 2 , X 3   H  X1   H  X 2 , X 3 X1 
 H  X1   H  X 2 X1   H  X 3 X 2 , X1 
…...

H  X1 , X 2 ,..., X N   H  X 1   H  X 2 X 1   ...  H  X N X N 1 ,..., X 1 


N
  H  X i X 1 X 2 ... X i1 
i 1

See Alternative proof in Page 21


54
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for Entropy , RE & MI

Definition

The conditional mutual information of random variables X and


Y given Z is defined by

I  X ;Y Z   H  X Z   H  X Y , Z 
p  X ,Y Z 
 E p x , y , z  log
p  X Z  p Y Z 

55
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for Entropy , RE & MI

Theorem 2.7.2 (chain rule for information)


n
I  X 1 , X 2 ,..., X n ; Y    I  X i ;Y X i 1 , X i 2 ,..., X 1 
i 1

Proof:
I  X 1 , X 2 ,..., X n ;Y   H  X 1 , X 2 ,..., X n   H  X 1 , X 2 ,..., X n Y 

n n
  H  X i X i1 ,..., X 1    H  X i X i 1 ,..., X 1 , Y 
i 1 i 1

n
  I  X i ;Y X 1 , X 2 ,..., X i 1 
i 1

56
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI

Definition

The conditional relative entropy D(p(y|x)||q(y|x)) is the average


of the relative entropies between the conditional probability mass
functions p(y|x) and q(y|x) averaged over the probability mass
function p(x). More precisely,

p  y x
 
D p  y x  q  y x    p  x   p  y x  log
x y q  y x
p Y X 
 E p x , y  log
q Y X 
57
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI

Theorem 2.7.3 (chain rule for relative entropy)


    
D p  x, y  q  x , y   D p  x  q  x   D p  y x  q  y x  
Proof:
p  x, y 
D  p  x, y  ||q  x, y     p  x, y  log
x y q  x, y 
p ( x) p ( y | x)
  p( x, y ) log
x y q ( x)q ( y | x)
p  x p  y x
  p  x, y  log  p  x, y  log
x y q  x x y q  y x

  
 D p  x q  x  D p  y x q  y x 
58
Copyright©BUAA201
Copyright©BUAA201
Data processing
inequality

59
Copyright©BUAA201
Copyright©BUAA201
2.8 Data processing inequality

Definition

Random variables X,Y,Z are said to form a Markov chain in that order
(denoted by XYZ) if the conditional distribution of Z depends
only on Y and is conditionally independent of X.
Specially, X,Y and Z form a Markov chain XYZ if the joint
probability mass function can be written as

p  x, y , z   p  x  p  y x  p  z y 

See some simple consequences in page32.

60
Copyright©BUAA201
Copyright©BUAA201
2.8 Data processing inequality

Theorem 2.8.1 (data processing inequality)


if XYZ, then I  X ;Y   I  X ; Z 

Proof:
I  X ;Y ; Z   I  X ; Z   I  X ;Y Z 

 I  X ;Y   I  X ; Z Y 

Since X and Z are conditionally independent given Y, we hav


e I  X ; Z Y   0. Since I  X ;Y Z   0, we have:
I (X;Y) I(X;Z)
61
Copyright©BUAA201
Copyright©BUAA201
2.8 Data processing inequality

Corollary:

In particular, if Z  g Y ,we have I  X ; Y   I X ; g Y  

X  Y  g Y  forms a Markov chain.

Corollary:
if , then I  X ;Y Z   I  X ;Y 
Proof: From the theorem 2.8.1, and using the fact
that I  X ; Z \Y   0 by Markovity and I  X ; Z   0, we
have :
I  X ;Y Z   I  X ;Y 

62
Copyright©BUAA201
Copyright©BUAA201
Thanks

63
Copyright©BUAA201
Copyright©BUAA201

Das könnte Ihnen auch gefallen