Sie sind auf Seite 1von 62

Chapter 7

Channel Capacity
Peng-Hua Wang
Graduate Inst. of Comm. Engineering
National Taipei University

Chapter Outline
Chap. 7 Channel Capacity
7.1 Examples of Channel Capacity
7.2 Symmetric Channels
7.3 Properties of Channel Capacity
7.4 Preview of the Channel Coding Theorem
7.5 Definitions
7.6 Jointly Typical Sequences
7.7 Channel Coding Theorem
7.8 Zero-Error Codes
7.9 Fanos Inequality and the Converse to the Coding Theorem

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 2/62

7.1 Examples of Channel Capacity

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 3/62

Channel Model

Operational channel capacity: the bit number to represent the


maximum number of distinguishable signals for n uses of a
communication channel.

In n transmission, we can send M signals without error, the


channel capacity is log M/n bits per transmission.

Information channel capacity: the maximum mutual information

Operational channel capacity is equal to Information channel capacity.

Fundamental theory and central success of information theory.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 4/62

Channel capacity
Definition 1 (Discrete Channel) A system consisting of an input
alphabet X and output alphabet Y and a probability transition matrix

p(y|x).
Definition 2 (Channel capacity) The information channel capacity of
a discrete memoryless channel is

C = max I(X; Y )
p(x)

where the maximum is taken over all possible input distribution p(x).

Operational definition of channel capacity: The highest rate in bits per


channel use at which information can be sent.

Shannons second theorem: The information channel capacity is


equal to the operational channel capacity.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 5/62

Example 1

Noiseless binary channel

p(Y = 0) = p(X = 0) = 0 , p(Y = 1) = p(X = 1) = 1 = 1 0


I(X; Y ) = H(Y ) H(Y |X) = H(Y )
1
=

0 = 1 = 1/2

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 6/62

Example 2

Noisy channel with non-overlapping outputs

p(X = 0) = 0 , p(X = 1) = 1 = 1 0
p(Y = 1) = 0 p, p(Y = 2) = 0 (1 p), p = 1/2
p(Y = 3) = 1 q, p(Y = 4) = 1 (1 q), q = 1/3
I(X; Y ) = H(Y ) H(Y |X) = H(Y ) 0 H(p) 1 H(q)
Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 7/62

Noisy Typewriter

Noisy typewriter
Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 8/62

Noisy Typewriter

I(X; Y )
= H(Y ) H(Y |X)
X
= H(Y )
p(x)H(Y |X = x)
x

= H(Y )

p(x)H( 21 )

= H(Y ) H( 12 )
log 26 1 = log 13
C = max I(X; Y ) = log 13

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 9/62

Binary Symmetric Channel (BSC)

Binary Symmetric Channel (BSC)


Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 10/62

Binary Symmetric Channel (BSC)

I(X; Y )
= H(Y ) H(Y |X)
X
= H(Y )
p(x)H(Y |X = x)
x

= H(Y )

p(x)H(p)

= H(Y ) H(p)
1 H(p)
C = max I(X; Y ) = 1 H(p)

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 11/62

Binary Erasure Channel

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 12/62

Binary Erasure Channel

I(X; Y )
= H(Y ) H(Y |X)
X
= H(Y )
p(x)H(Y |X = x)
x

= H(Y )

p(x)H()

= H(Y ) H()
H(Y ) = (1 )H(0 ) + H()
C = max I(X; Y ) = 1

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 13/62

7.3 Properties of Channel Capacity

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 14/62

Properties of Channel Capacity

C 0.

C log |X |.

C log |Y|.

I(X; Y ) is a continuous function of p(x),

I(X; Y ) is a concave function of p(x),

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 15/62

7.4 Preview of the Channel Coding Theorem

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 16/62

Preview of the Channel Coding Theorem

For each input n-sequence, there are approximately 2nH(Y |X) ,


possible Y sequences.

The total number of possible (typical) Y sequences is 2nH(Y ) .

This set has to be divided into sets of size 2nH(Y |X) corresponding to
the different input X sequences.

The total number of disjoint sets is less than or equal to

2nH(Y ) /2nH(Y |X) = 2n(H(Y )H(Y |X)) = 2nI(X;Y )

We can send at most 2nI(X;Y ) distinguishable sequences of length n.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 17/62

Example

6 typical sequences for X n . 4 typical sequences for Y n .

12 typical sequences for (X n , Y n ).

For every X n , we have

2nH(X,Y ) /2nH(X) = 2nH(Y |X) = 2 typical Y n .


e.g., for X n
Peng-Hua Wang, April 16, 2012

= 001100 Y n = 010100, 101011.


Information Theory, Chap. 7 - p. 18/62

Example

Since we have 2nH(Y )

= 4 typical Y n in total, how many typical X n

can these typical Y n be assigned?

2nH(Y ) /2nH(Y |X) = 2n(H(Y )H(Y |X)) = 2nI(X;Y ) = 2.

Can we assign more typical X n ? No. For some Y n received, we


cant not determine which X n is received. e.g., If we use 001100,
101101, and 101000 as codewords, we cant determine which
codeword is sent when we receive 101011.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 19/62

7.5 Definitions

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 20/62

Communication Channel

Message W

Encoder: input W , output X n

{1, 2, ..., M }.
X n (W ) X n

n is the length of the signal. We then transmit the signal via the
channel by using the channel n times. Every time we send a
symbol of the signal.

Channel: input X n , output Y n with distribution p(y n |xn )

Decoder: input Y n , output W

= g(Y n ) where g(Y n ) is a

deterministic decoding rule.

If W

6= W , an error occurs.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 21/62

Definitions
Definition 3 (Discrete Channel) A discrete channel, denoted by

(X , p(y|x), Y), consists of two finite sets X and Y and a collection of


probability mass functions p(y|x).
P
X : input, Y :output, for every input x X ,
y p(y|x) = 1.
Definition 4 (Discrete Memoryless Channel, DMC) The nth
extension of the discrete memoryless channel is the channel

(X n , p(y n |xn ), Y n ) where


p(yk |xk , y k1 ) = p(yk |xk ), k = 1, 2, . . . , n.
Without feedback: p(xk |xk1 , y k1 ) = p(xk |xk1 )

nth extension of DMC without feedback:


n
Y
p(y n |xn ) =
p(yi |xi ).
i=1

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 22/62

Definitions
(M, n) code] An (M, n) code for the channel
(X , p(y|x), Y) consists of the following:
1. An index set {1, 2, . . . , M }.
Definition 5

2. An encoding function X n

: {1, 2, . . . , M } X n . The codewords


are xn (1), xn (2), . . . , xn (M ). The set of codewords is called the
codebook.

3. A decoding function g

Peng-Hua Wang, April 16, 2012

: Y n {1, 2, . . . , M }

Information Theory, Chap. 7 - p. 23/62

Definitions
Definition 6 (Conditional probability of error)
n

i = Pr(g(Y ) 6= i|X = x (i)) =

p(y n |xn (i))

g(y n )6=i

p(y n |xn (i))I(g(y n ) 6= i)

yn

I() is the indicator function.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 24/62

Definitions
Definition 7 (Maximal probability of error)

(n) =

max

i{1,2,...,M }

Definition 8 (Average probability of error)

Pe(n)

The decoding error is

Pr(g(Y n ) 6= W ) =

M
X

M
X
1
i
=
M i=1

Pr(W = i) Pr(g(Y n ) 6= i|W = i)

i=1

If the index W is chosen uniformly from {1, 2, . . . , M }, then


(n)

Pe

= Pr(g(Y n ) 6= W ).

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 25/62

Definitions
Definition 9 (Rate) The rate R of an (M, n) code is

log M
R=
n

bits per transmission

Definition 10 (Achievable rate) A rate R is said to be achievable if


there exists a (2nR , n) code such that the maximal probability of
error (n) tends to 0 as n

Definition 11 (Channel capacity) The capacity of a channel is the


supremum of all achievable rates.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 26/62

7.6 Jointly Typical Sequences

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 27/62

Definitions
(n)

Definition 12 (Jointly typical sequences) The set A of jointly


typical sequences {(xn , y n )} with respect to the distribution p(x, y) is
defined by
A(n)

where






1
n
n
n
n
n
= (x , y ) X Y : log p(x ) H(X) < ,
n


1

log p(y n ) H(Y ) < ,
n




1

log p(xn , y n ) H(X, Y ) <
n

p(xn , y n ) =

n
Y

p(xi , yi )

i=1

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 28/62

Joint AEP
Theorem 1 (Joint AEP) Let (X n , Y n ) be sequences of length n
drawn i.i.d. according to p(xn , y n ). Then:

(n)

Pr (xn , y n ) A
1 as n .


(n)
2. A 2n(H(X,Y )+)
n , Y n ) p(xn )p(y n ) [i.e., X
n and Y n are independent with
3. If (X
1.

the same marginals], then



n(I(X;Y )3).
n , Y n ) A(n)
Pr (X

Also, for sufficient large n,



n , Y n ) A(n) (1 )2n(I(X;Y )+3).
Pr (X

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 29/62

Joint AEP


(n)
Theorem 2 (Joint AEP) 1. Pr (xn , y n ) A
1 as n .
Proof. Given

> 0, define events A, B, C as








n 1
A , X : log p(X n ) H(X)
n






n 1
B , Y : log p(Y n ) H(Y )
n





1
n
n
n
n

C , (X , Y ) : log p(X , Y ) H(X, Y ) ,
n

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 30/62

Joint AEP
Then, by weak law of large number, there exists n1 , n2 , n3 such that,

Pr (A) < , n > n1 , Pr (B) < , n > n2 ,


3
3

Pr (C) < , n > n3 .


3
Thus,
n

Pr (x , y )

A(n)

= Pr(Ac B c C c )

=1 Pr(A B C) 1 (Pr(A) + Pr(B) + Pr(C))


1
for all n

> max{n1 , n2 , n3 }. 

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 31/62

Joint AEP


(n)
Theorem 3 (Joint AEP) 2. A 2n(H(X,Y )+)

Proof.

1=

p(xn , y n )

p(xn , y n )

(n)

(xn ,y n )A

n(H(X,Y )+)
|2
|A(n)

Thus,

(n)
A 2n(H(X,Y )+) . 

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 32/62

Joint AEP
n , Y n )
Theorem 4 (Joint AEP) 3. If (X

n and
p(xn )p(y n ) [i.e., X

Y n are independent with the same marginals], then




n , Y n ) A(n) 2n(I(X;Y )3) .
Pr (X

Also, for sufficient large n,



n(I(X;Y )+3)
n , Y n ) A(n)
Pr (X

(1

)2
.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 33/62

Joint AEP
Proof.

n , Y n ) A(n)
=
Pr (X

p(xn )p(y n )
(n)

(xn ,y n )A

2n(H(X,Y )+) 2n(H(X)) 2n(H(Y ))


= 2n(I(X;Y )3) .


(n)
For sufficient large n, Pr A
1 , and therefore
1

(n)

(xn ,y n )A

and
Peng-Hua Wang, April 16, 2012

(n) n(H(X,Y ))
p(x , y ) A 2
.

(n)
A (1 )2n(H(X,Y ))

Information Theory, Chap. 7 - p. 34/62

Joint AEP


n , Y n ) A(n)
Pr (X

X
=
p(xn )p(y n )
(n)

(xn ,y n )A

(1 )2n(H(X,Y )) 2n(H(X)+) 2n(H(Y )+)


=(1 )2n(I(X;Y )+3)

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 35/62

Joint AEP: Conclusion

There are about 2n(H(X) typical X sequences, and about 2n(H(Y )


typical X sequences.

There are about 2n(H(X,Y ) jointly typical sequences.

Randomly chosen a pair of typical X n and typical Y n , the probability


that it is jointly typical is about 2nI(X;Y ) .

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 36/62

7.7 Channel Coding Theorem

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 37/62

Channel Coding Theorem


Theorem 5 (Channel coding theorem) For every rate R

< C , there

exists a sequence of (2nR , n) codes with maximum probability of error

(n) 0.
Conversely, any sequence of (2nR , n) codes with (n)

0 must have

R C.

We have to prove two parts.

R < C achievable.
Achievable R C .

Main ideas.

Random encoding (random code)

Jointly typical decoding

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 38/62

Random Code

Generate a (2nR , n) code at random according to the distribution

p(x) (fixed). That is, the 2nR codewords have the distribution
p(xn ) =

n
Y

p(xi )

i=1

A particular code C is the matrix with 2nR codewords as the row.

x1 (1)

x2 (1)

x (2)
x2 (2)
1
C= .
..
..
..
.
.

x1 (2nR ) x2 (2nR )

xn (1)

xn (2)

..

xn (2nR )

The code C is revealed to both sender and receiver. Both sender and
receiver are also assumed to know the channel transition matrix

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 39/62

Random Code
nR

There are (|X |n )2

The probability of a particular code C is

different codes.

nR

Pr(C) =

2
n
Y
Y

p (xi (w))

w=1 i=1

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 40/62

Transmission and Channel

A message W is chosen according to a uniform distribution

Pr[W = w] = 2nR , w = 1, 2, . . . , 2nR .

The w th codeword X n (w), corresponding to the w th row of C , is


sent over the channel.

The receiver receives a sequence Y n according to the distribution

P (y n |xn (w)) =

n
Y

p(yi |xi (w)).

i=1

That is, use the DMC channel for n times.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 41/62

Jointly Typical Decoding

was sent if
The
receiver
declares
that
the
message
W



), Y n is jointly typical.
X n (W

There is no other jointly typical pair for Y n . That is, there is no

other W

exists or if there is more than one such, an error is


If no such W

declared (W

such that W , Y n is jointly typical.


6= W
= 0).

There is decoding error if W

Peng-Hua Wang, April 16, 2012

6= W ].
6= W . Let E be the event [W

Information Theory, Chap. 7 - p. 42/62

Proof of R < C Achievable

The average probability of error averaged over all codewords in the


codebook, and averaged over all codebooks.

Pr(E) =

Pe(n) (C)

(n)
Pe (C)

Pr(C)

nR

2
X
X
1

2nR

w=1

2nR

w (C)

w=1

Pr(C)w (C)

is defined for jointly typical decoding.

By the symmetry of the code construction,


not depend on w.

Peng-Hua Wang, April 16, 2012

nR
2
X

Pr(C)w (C) does

Information Theory, Chap. 7 - p. 43/62

Proof of R < C Achievable

Therefore,

Pr(E) =
=

1
2nR
X

nR
2
XX

w=1

Pr(C)w (C)

Pr(C)w (C) for any w

Pr(C)1 (C) = Pr(E|W = 1)

= {(X n (i), Y n ) is jointly typical pair} for


i = 1, 2, . . . , 2nR where Y n is the channel output when the first
codeword X n (1) was sent. Then decoding error is declared if
E1c : The transmitted codeword and the received sequence are not
Define Ei

jointly typical.

E2 E3 E2nR : a wrong codeword is jointly


typical with
Information Theory, Chap. 7 - p. 44/62

Peng-Hua Wang, April 16, 2012

Proof of R < C Achievable

Y n is the channel output when the first codeword X n (1) was sent.

E1c : The transmitted codeword and the received sequence are not
jointly typical.

E2 , E3 , . . . , E2nR : wrong codewords that are jointly typical with the


received sequence.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 45/62

Proof of R < C Achievable

The average error

Pr(E|W = 1) = P (E1c E2 E3 E2nR |W = 1)


nR

P (E1c |W = 1) +

2
X

P (Ei |W = 1)

i=2

By AEP,

P (E1c |W = 1) for n sufficiently large


P (Ei |W = 1) 2n(I(X;Y )3)
(Y

Peng-Hua Wang, April 16, 2012

and X

(1) are jointly typical.)

Information Theory, Chap. 7 - p. 46/62

Proof of R < C Achievable

We have

Pr(E|W = 1) + (2nR 1)2n(I(X;Y )3)


+ 2nR 2n(I(X;Y )3)
= + 2n(I(X;Y )R3)

If I(X; Y ) R 3

> 0, then 2n(I(X;Y )R3) < for n

sufficiently large, and

Pr(E|W = 1) 2.

So far, we prove that: for any , if R

< I(X; Y ) and n sufficient


large, the average decoding error Pr(E) = Pr(E|W = 1) < 2.

What do we need? If R

< C , the maximum error probability

(n) 0.
Peng-Hua Wang, April 16, 2012

(We are almost there. Almost...)

Information Theory, Chap. 7 - p. 47/62

Proof of R < C Achievable, final part

Choose p(x) such that I(X; Y ) is maximum. That is, choose p(x)
such that I(X; Y ) achieve channel capacity C . Then the condition

R < I(X; Y ) 3 can be replaced by the achievability condition


R < C 3.

Since the average probability of error over codebooks is less then 2,


there exists at least one codebook C such that Pr(E|C )

< 2.

C can be found by an exhaustive search over all codes.

Since W is chosen uniformly, we have


nR

Pr(E|C ) =

2
1 X

2nR

i (C ) 2

i=1

which implies that the maximal error probability of the better half
codewords is less than 4.
Chap. 7 - p. 48/62
There are 10 students. Their average score is 40.Information
Then Theory,
the highest

Peng-Hua Wang,
April 16, 2012

Proof of R < C Achievable, final part

We throw away the worst half of the codewords in the best codebook

C . The new code has a maximal probability of error less than 4.




nR
n(R1/n)
However, we construct a 2 /2, n or 2
, n code. The
rate of the new code is R 1/n.

Summary. If R 1/n

< C 3 for any , then (n) 4 for n

sufficiently large.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 49/62

7.8 Zero-Error Codes

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 50/62

No error R C

, n code with zero probability of error.


W is determined by Y n . p(g(Y n ) = W ) = 1. H(W |Y n ) = 0.

Assume that we have a

nR

To obtain a strong bound, assume that W is uniformly distributed


over {1, 2, . . . , 2nR }.

nR = H(W ) = H(W |Y n ) + I(W ; Y n ) = I(W ; Y n )


I(X n ; Y n )(data processing ineq. W X n (W ) Y n )

n
X

I(Xi ; Yi ) (See next page.)

i=1

nC (definition of channel capacity)

That is, no error

Peng-Hua Wang, April 16, 2012

R C.

Information Theory, Chap. 7 - p. 51/62

No error R C
Lemma 1 Let Y n be the result of passing X n through a discrete
memoryless channel of capacity C . Then for all p(xn ),

I(X n ; Y n ) nC.
Proof.

I(X n ; Y n ) = H(Y n ) H(Y n |X n )


= H(Y n )
= H(Y n )

n
X
i=1
n
X

H(Yi |Y1 , . . . , Yi1 , X n )


H(Yi |Xi ) (definition of DMC)

i=1

n
X
i=1

Peng-Hua Wang, April 16, 2012

H(Yi )

n
X
i=1

H(Yi |Xi ) =

n
X

I(Yi ; Xi ) nC

i=1

Information Theory, Chap. 7 - p. 52/62

7.9 Fanos Inequality and the Converse to


the Coding Theorem

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 53/62

Fanos Inequality
Theorem 6 (Fanos inequality) Let X and W have the same sample
spaces X

= {1, 2, . . . , M } and have the joint p.m.f. p(x, w). Let


XX
p(x, w).
Pe = Pr[X 6= W ] =
xX wX ,
w6=x

Then

Pe log(M 1) + H(Pe ) H(X|W )


where

H(Pe ) = Pe log Pe (1 Pe ) log(1 Pe ).

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 54/62

Fanos Inequality
Proof. We will prove that H(X|W ) H(Pe ) Pe log(M
H(X|W ) =

XX

p(x, w) log

1
p(x|w)

XX

p(x, w) log

1
p(x|w)

XX

p(x, w) log

1
p(x|w)

XX

p(x, w) log

1
M 1

Pe log(M 1) =

w6=x

w=x

w6=x

1) 0.

H(Pe ) = Pe log Pe + (1 Pe ) log(1 Pe )


XX
p(x, w) log Pe
=
x

w6=x

XX
x

p(x, w) log(1 Pe )

w=x

Add the above three terms together.


Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 55/62

Fanos Inequality
Proof (cont.)
H(X|W ) Pe log(M 1) H(Pe )
XX
XX
Pe
1 Pe
=
p(x, w) log
p(x, w) log
+
(M 1)p(x|w)
p(x|w)
x w6=x
x w=x

XX
XX
P
1 Pe
e
log
p(x, w)
+
p(x, w)
(M 1)p(x|w)
p(x|w)
x w6=x
x w=x

XX
XX
P
e
p(w) + (1 Pe )
p(w)
= log
M 1 x
x w=x
w6=x

= log[Pe + (1 Pe )] = 0 

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 56/62

Fanos Inequality
Corollary 1 1.
2.

Pe log M + H(Pe ) H(X|W ), Pe = Pr[X 6= W ]

1 + Pe log M H(X|W ), Pe = Pr[X 6= W ]

3. If X

and Pe = Pr[X 6= X],


then
Y X
H(X|Y )
H(Pe ) + Pe log M H(X|X)

Remark.

H(X|W ) Pe log(M 1) + H(Pe ) Pe log M + H(Pe ).


2. H(X|W ) Pe log(M 1) + H(Pe ) Pe log M + 1.

1.

3. The second ineq. can be obtained by data processing ineq.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 57/62

Data Processing Inequality


Lemma 2 (Data processing inequality) If X

Y Z , then

I(X; Z) I(X; Y )
Proof.
I(X; Z) I(X; Y )
=H(X) H(X|Z) [H(X) H(X|Y )] = H(X|Y ) H(X|Z)
XX
XX
1
1
=
p(x, y) log

p(x, z) log
p(x|y)
p(x|z)
x
y
x
z

XXX
1
1
p(x, y, z) log

p(x, y, z) log
=
p(x|y)
p(x|z)
x
y
z
x
y
z
!
XXX
p(x|z)
log
p(x, y, z)
(by convexity of logarithm)
p(x|y)
x
y
z
XXX

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 58/62

Data Processing Inequality


Proof (cont.) Since X

Y Z , we have

p(x, y)p(y, z)
p(x, y, z) = p(x, y)p(z|x, y) = p(x, y)p(z|y) =
p(y)
and

p(x|z)
p(x, y)p(y, z) p(x, z)p(y)
p(x, z)p(y, z)
=

=
p(x, y, z)
p(x|y)
p(y)
p(z)p(x, y)
p(z)
Therefore,

XXX
x

p(x|z) X X X p(x, z)p(y, z)


p(x, y, z)
=
p(x|y)
p(z)
x
y
z

X X p(x, z) X
x

Peng-Hua Wang, April 16, 2012

p(z)

p(y, z) =

XX
x

p(x, z) = 1 

Information Theory, Chap. 7 - p. 59/62

Data Processing Inequality (Summary)


Lemma 3 1. If X

Y Z , then
(
I(X; Y )
I(X; Z)
I(Y ; Z)

H(X|Y ) H(X|Z)
2. If X

Y Z W , then
I(X; Z) + I(Y ; W ) I(X; W ) + I(Y ; Z),
I(X; W ) I(Y ; Z)

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 60/62

Achievable R C
Theorem 7 (Converse to Channel coding theorem) Any sequence of

(2nR , n) codes with (n) 0 must have R C .


Proof.

For a fixed encoding rule X n (W ) and a fixed decoding rule

= g(Y n ), we have W X n (W ) Y n W
.
W

For each n, let W be drawn according to a uniform distribution= over

{1, 2, . . . , 2nR }.

Since W has a uniform distribution,


nR

] = Pe(n) =
Pr[W 6= W

Peng-Hua Wang, April 16, 2012

2
X
1

2nR

i .

i=1

Information Theory, Chap. 7 - p. 61/62

Achievable R C
Proof (cont.)

nR = H(W ) (W is uniform distribution)


) + I(W ; W
)
= H(W |W
) (Fanos ineq.)
1 + Pe(n) nR + I(W ; W
1 + Pe(n) nR + I(X n ; Y n ) (data processing ineq.)
1 + Pe(n) nR + nC (lemma 7.9.2)
C
1

1
R nR
That is, if R > C , the probability of error is large than a positive value
for sufficiently large n. The error probability cant achieve arbitrary
small. 
Pe(n)

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 62/62

Das könnte Ihnen auch gefallen