Advanced Topics Information Theory-Lecture Notes - Stefan M. Moser 2.5 PDF

Advanced Topics
in Information Theory
Lecture Notes
Stefan M. Moser
2nd Edition 2013
c Copyright Stefan M. Moser

Signal and Information Processing Lab
ETH Z
urich
Zurich, Switzerland
Department of Electrical and Computer Engineering
National Chiao Tung University (NCTU)
Hsinchu, Taiwan
You are welcome to use these lecture notes for yourself, for teaching, or for any
other noncommercial purpose. If you use extracts from these lecture notes,
please make sure that their origin is shown. The author assumes no liability
or responsibility for any errors or omissions.
2nd Edition 2013.

Version 2.5.
Compiled on 31 August 2015.
For the latest version see http://moser-isi.ethz.ch/scripts.html
Contents
Preface
ix
1 Mathematical Preliminaries
1.1 Review of some Definitions .
1.2 Some Important Inequalities
1.3 FourierMotzkin Elimination
1.4 Law of Large Numbers . . . .
1.5 Additional Tools . . . . . . . . .
2 Method of Types
2.1 Types . . . . . . . . . . .
2.2 Properties of Types . .
2.3 Joint Types . . . . . . .
2.4 Conditional Types . . .
2.5 Remarks on Notation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
9
11
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
19
27
28
33
3 Large Deviation Theory

3.1 Sanovs Theorem . . . . . . . . . . . . . . . . . . . .
3.2 Pythagorean Theorem for Relative Entropy .
3.3 The Pinsker Inequality . . . . . . . . . . . . . . .
3.4 Conditional Limit Theorem . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
36
42
48
52
.
.
.
.
63
64
68
71
76
81
.
.
.
.
.
85
86
90
94
98
99
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Strong Typicality
4.1 Strongly Typical Sets . . . . . . . . . . . . . . . . . . . .
4.2 Jointly Strongly Typical Sets . . . . . . . . . . . . . . .
4.3 Conditionally Strongly Typical Sets . . . . . . . . . .
4.4 Accidental Typicality . . . . . . . . . . . . . . . . . . . .
4.A Appendix: Alternative Definition of Conditionally
Typical Sets . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Rate Distortion Theory
5.1 Motivation: Quantization of a Continuous RV
5.2 Definitions and Assumptions . . . . . . . . . . . .
5.3 The Information Rate Distortion Function . .
5.4 Rate Distortion Coding Theorem . . . . . . . . .
5.4.1 Converse . . . . . . . . . . . . . . . . . . . . .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.......
.......
.......
.......
Strongly
.......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
c Stefan M. Moser, vers. 2.5

iv
Contents
5.5
5.6
5.7
5.8
5.9
5.4.2 Achievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Characterization of R(D) . . . . . . . . . . . . . . . . . . . . . . . . . .
Further Properties of R(D) . . . . . . . . . . . . . . . . . . . . . . . .
Joint Source and Channel Coding Scheme . . . . . . . . . . . . . .
Information Transmission System: Transmitting above
Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rate Distortion for the Gaussian Source . . . . . . . . . . . . . . .
5.9.1 Rate Distortion Coding Theorem . . . . . . . . . . . . . .
5.9.2 Parallels to Channel Coding . . . . . . . . . . . . . . . . . .
5.9.3 Simultaneous Description of m Independent Gaussians
6 Error Exponents in Source Coding

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Strong Converse of the Rate Distortion Coding Theorem
6.3 Rate Distortion Error Exponent . . . . . . . . . . . . . . . . . .
6.3.1 Type Covering Lemma . . . . . . . . . . . . . . . . . . .
6.3.2 Achievability . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.3 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 The
7.1
7.2
7.3
7.4
7.5
7.6
7.7
.
.
.
.
.
.
Multiple Description Problem

Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . .
An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Random Coding Scheme . . . . . . . . . . . . . . . . . . . . . .
Performance Analysis of Our Random Coding Scheme . . .
7.4.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.4 Analysis Put Together . . . . . . . . . . . . . . . . . . . .
An Improvement to the Achievable Region . . . . . . . . . . .
7.5.1 New Random Coding Scheme . . . . . . . . . . . . . . .
7.5.2 Analysis of Case 2 . . . . . . . . . . . . . . . . . . . . . . .
7.5.3 Adapting Setup to Match Situation of Section 7.1
Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Rate Distortion with Side-Information: WynerZiv

Problem
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 A Random Coding Scheme: Binning . . . . . . . . . . . . .
8.2.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.3 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.4 Case 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.5 Case 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.6 Analysis Put Together . . . . . . . . . . . . . . . . .
c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
105
106
110
119
121
123
123
125
126
.
.
.
.
.
.
133
133
135
141
144
148
149
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
155
155
158
159
161
161
161
162
169
170
171
172
176
177
178
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
185
187
189
189
190
190
193
193
.
.
.
.
.
.
Contents
8.3
8.4
8.5
8.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
194
200
202
204
9 Distributed Lossless Data-Compression: SlepianWolf

Problem
9.1 Problem Statement and Main Result . . . . . . . . . . . . . .
9.2 New Lossless Data Compression Scheme based on Bins .
9.3 Achievability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5 Discussion: Colors instead of Bins . . . . . . . . . . . . . . . .
9.6 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7 Zero-Error Compression . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
207
208
210
212
215
216
217
217
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
219
219
221
221
225
225
229
230
233
233
235
241
243
244
244
245
246
249
10 The
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
The WynerZiv Rate Distortion

Properties of RWZ () . . . . . . . .
Converse . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . .
Function
.......
.......
.......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Multiple-Access Channel
Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time-Sharing: Convexity of Capacity Region . . . . .
Some Illustrative Examples for the MAC . . . . . . . .
The MAC Capacity Region . . . . . . . . . . . . . . . . . .
10.4.1 Achievability of C1 . . . . . . . . . . . . . . . . . . .
10.4.2 Capacity Region C2 Being a Subset of C1 . . .
10.4.3 Converse of C2 . . . . . . . . . . . . . . . . . . . . . .
Some Observations and Discussion . . . . . . . . . . . . .
10.5.1 C1 with Fixed Distribution QX (1) QX (2) . . .
10.5.2 Convex Hull of two Pentagons . . . . . . . . . . .
10.5.3 General Shape of the MAC Capacity Region
Multiple-User MAC . . . . . . . . . . . . . . . . . . . . . . . .
Gaussian MAC . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.1 Capacity Region . . . . . . . . . . . . . . . . . . . .
10.7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.3 CDMA versus TDMA or FDMA . . . . . . . . .
Historical Remarks . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11 Transmission of Correlated Sources over a MAC

251
11.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
11.2 A Joint Source Channel Coding Scheme . . . . . . . . . . . . . . . 254
11.3 Discussion and Improved Joint Source Channel Coding Scheme 258
12 Channels with Noncausal Side-Information:
GelfandPinsker Problem
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
12.2 A Random Coding Scheme . . . . . . . . . . . . .
12.2.1 Case 1 . . . . . . . . . . . . . . . . . . . . . .
12.2.2 Case 2 . . . . . . . . . . . . . . . . . . . . . .
12.2.3 Case 3 . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
261
261
263
265
265
265

vi
Contents
12.3
12.4
12.5
12.6
12.7
12.A
12.2.4 Case 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.5 Analysis Put Together . . . . . . . . . . . . . . . . . .
The GelfandPinsker Rate . . . . . . . . . . . . . . . . . . . .
Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Writing on Dirty Paper . . . . . . . . . . . . . . . . . . . . . . .
Different Types of Side-Information . . . . . . . . . . . . . .
Appendix: Concavity of GelfandPinsker Rate in Cost
Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
266
268
269
276
277
278
281
....
282
13 The
13.1
13.2
13.3
Broadcast Channel
Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Some Important Observations . . . . . . . . . . . . . . . . . . . . .
Some Special Classes of Broadcast Channels . . . . . . . . . . .
13.3.1 Degraded Broadcast Channel . . . . . . . . . . . . . . . .
13.3.2 Broadcast Channel with Less Noisy Output . . . . . .
13.3.3 The Broadcast Channel with More Capable Output
13.4 Superposition Coding . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.5 NairEl Gamal Outer Bound . . . . . . . . . . . . . . . . . . . . . .
13.6 Capacity Regions of Some Special Cases of BCs . . . . . . . . .
13.7 Achievability based on Binning . . . . . . . . . . . . . . . . . . . .
13.8 Best Known Achievable Region: Martons Region . . . . . . .
13.9 Some More Outer Bounds . . . . . . . . . . . . . . . . . . . . . . . .
13.10 Gaussian BC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 The
14.1
14.2
14.3
14.4
Multiple-Access Channel with Common Message

Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
An Achievable Region based on Superposition Coding .
Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Capacity Region . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15 Discrete Memoryless Networks and

15.1 Discrete Memoryless Networks . . .
15.2 Cut-Set Bound . . . . . . . . . . . . . .
15.3 Examples . . . . . . . . . . . . . . . . . .
15.3.1 Broadcast Channel . . . . . .
15.3.2 Multiple-Access Channel .
15.3.3 Single-Relay Channel . . . .
15.3.4 Double-Relay Channel . . .
.
.
.
.
.
.
.
.
.
.
.
.
327
327
328
333
335
the Cut-Set Bound

.................
.................
.................
.................
.................
.................
.................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
339
339
340
345
345
346
347
348
.
.
.
.
.
.
.
.
.
.
349
349
353
353
354
355
16 The Interference Channel

16.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . .
16.2 Some Simple Capacity Region Outer Bounds .
16.2.1 Cut-Set Bound . . . . . . . . . . . . . . . .
16.2.2 Satos Outer Bound . . . . . . . . . . . . .
16.3 Some Simple Capacity Region Inner Bounds .

.
.
.
.
.
.
.
.
.
.
.
.
.
285
285
288
289
289
292
292
294
300
305
309
313
319
320
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
Contents
16.4 Strong and Very Strong Interference . . . . . . . . . .
16.5 HanKobayashi Region . . . . . . . . . . . . . . . . . . .
16.5.1 Superposition Coding with Rate Splitting
16.5.2 FourierMotzkin Elimination . . . . . . . . .
16.5.3 Best Known Achievable Rate Region . . . .
16.6 Gaussian IC . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.6.1 Channel Model . . . . . . . . . . . . . . . . . . .
16.6.2 Outer Bound . . . . . . . . . . . . . . . . . . . . .
16.6.3 Basic Communication Strategies . . . . . . .
16.6.4 Strong and Very Strong Interference . . . .
16.6.5 HanKobayashi Region for Gaussian IC . .
16.6.6 Symmetric Degrees of Freedom . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
359
364
365
369
371
373
373
374
375
377
380
381
Bibliography
385
List of Figures
391
List of Tables
395
Index
397

Preface
As its title indicates, this course covers some more advanced topics in information theory. The script can be split into roughly three parts:
Chapters 14 cover results and tools that will be needed in the proofs
later on;
Chapters 59 deal with advanced topics related to data compression;
and
Chapters 1016 cover advanced topics of data transmission.
More in detail, after a chapter reviewing some definitions and basic mathematical and information theoretical facts, in Chapter 2, we introduce the
foundation on which the rest of this script is built: types. We study types in
detail, but first do not give any information-theory related motivation yet. We
simply ask the reader at that stage to enjoy the beauty of these mathematical
ideas and not to worry too much about their relation to communications. As
types are the most important tool throughout the course, we only talk about
our notation in Section 2.5 after we have introduced types. More comments
about notation can be found in Remark 4.2 in Section 4.1.
Chapter 3 then makes a detour into a fancy area of probability theory:
large deviation theory. Again, the connection to communications will not be
directly visible, however, we will rely on it later on in some proofs. And once
more, the results by themselves are beautiful! In Chapter 4, we then mold the
basic ideas of Chapter 2 into a form that will serve us as main tool in most of
the proofs in the remainder of this course: We define strongly typical sets and
discuss their properties.
After these lengthy preparations, we then finally start with information
theory. In Chapters 5 and 6, we discuss lossy compression: the rate distortion
theory. Rate distortion coding can be seen as a dual to channel coding: While
in channel coding we ask at what rates information can be transmitted for a
given available power such that the error probability is arbitrarily small, rate
distortion coding deals with the minimum necessary description rate needed to
compress a source such that it can be reconstructed within a given distortion
with arbitrarily small error probability. Chapter 6 further deepens the results
of Chapter 5 by looking at how quickly the error probability tends to zero
when the blocklength tends to infinity.
ix

Preface
Chapters 79 cover extensions of the idea of rate distortion. In Chapter 7,

we try to split a description into several parts in such a way that in case one
part is lost, one can still recover the data (but with slightly larger distortion).
In Chapter 8, we assume that the decoder has access to side-information about
the compressed source that should help in the reconstruction of the data. And
in Chapter 9, we look at a generalization of rate distortion theory to multiple
users: Two separate data compressor simultaneously and independently compress correlated data that need to be recovered in a lossless fashion at a joint
decoder.
Then for the last part of the script, we change to data transmission. Chapter 10 treats the first multiple-user data transmission setup: the multipleaccess channel where several users try to communicate simultaneously with
a receiver. Chapter 11 then combines the distributed compression setup of
Chapter 9 with the multiple-access channel of Chapter 10. In Chapter 12, we
look at data transmission with noncausal side-information at the transmitter
side. Then, Chapter 13 discusses the opposite situation of Chapter 10: In a
broadcast channel, a single transmitter tries to communicate simultaneously
with several receivers. In Chapter 14, we come back to the multiple-access
channel and treat a more general model with an additional common message.
In Chapter 15, a generalized model for multi-terminal networks is considered
and used to demonstrate the Cut-Set Bound. Finally, in the last chapter,
Chapter 16, the interference channel is discussed.
Even though these lecture notes are a continuation from the introduction
course Information Theory [Mos14], they are actually quite independent and
self-contained. So a detailed study of the first course is not strictly required.
However, it is assumed that the student has a solid knowledge in information
theory, especially about its main quantities entropy, differential entropy, and
mutual information and their properties. Moreover, a saddle-fast and firm
know-how in probability theory is essential.
In contrast to the basic information theory course where one can find a
vast amount of textbooks, there are not many textbooks covering the same
range of topics as this course. For the interested reader, I would like to give a
short (and not exclusive) list of recommended readings:
Topics in Multi-user Information Theory by Gerhard Kramer [Kra07]
covers rate distortion and multiple-user data transmission in a concise
and detailed, but still quite compact form.
The tool of typicality has mainly been developed and promoted by Imre
Csisz
ar and J
anos K
orner. Their textbook Information Theory: Coding Theorems for Discrete Memoryless Systems [CK81] [CK11] is the
foundation of any treatment on types and its derivatives. In the same
textbook one can also find a detailed treatise of error exponents in rate
distortion theory.
The famous textbook by Tom Cover and Joy Thomas Elements of Information Theory [CT06] also covers many of the topics of this course.

xi
Preface
The books main strength is its intuitive description of the problems and
their proofs. This intuitive approach, however, sometimes comes at the
cost of a loss in accuracy.
For some topics, there exist quite accessible journal papers presenting the
newest research results alongside a good summary of the basic results up
to that point. For example, an easy-to-read introduction to the multiple
description problem can be found in [VKG03]. It is also worth having
a look at the extensive list of references to other literature that is given
in this paper.
The following works could also be included in our list of readings.
Imre Csisz
ar and Paul Shields discuss the relations between information
theory and statistics in [CS04].
Abbas El Gamal and Young-Han Kim have recently published a book
about network information theory [EGK11] [EGK10].
Another book on network information theory has been written by Raymond Yeung [Yeu08].
I will keep working on these notes and try to improve them continually.
So if you find typos, errors, or if you have any comments about these notes, I
would be very happy to hear them! Write to
stefan.moser@alumni.ethz.ch
Thanks!
Finally and once again I must express my deepest gratitude to Yin-Tai and
Matthias who kept encouraging me during the whole project and particularly
towards the end phase when the schedule got tight.
Stefan M. Moser

Chapter 1
Mathematical Preliminaries
In this chapter, we prepare some mathematical tools that we need in our
derivations of this course. We start in Section 1.1 with a very brief review
of the main definitions in information theory. Then Section 1.2 summarizes
some important inequalities that stem partially from information theory itself
and partially are actually probability theory related. Section 1.3 reviews the
tool of FourierMotzkin elimination. In Section 1.4 we repeat the laws of large
numbers that will be the foundation of many of our proofs. And Section 1.5
presents some further statements and tools that are actually quite far away
from the core topics of this course, but that still are important for our analysis.
1.1
Review of some Definitions
The following definitions and results are all stated without motivation or
proofs. The details can be found in [Mos14].
Definition 1.1. The entropy of a discrete random variable (RV) X is defined
as
H(X) ,
PX (x) log PX (x)
(1.1)
xsupp(PX )
= E[log PX (X)],
(1.2)
where PX () denotes the probability mass function (PMF) of the RV X and

where we use supp(PX ) X to denote the set of all symbols a for which
PX (a) > 0. Note that often we omit the explicit mentioning of the support
supp(PX ), but simply write x X . In such cases we implicitly always regard
the summations to be only over those values in X for which PX (x) > 0.
Another way of looking at this is to say that 0 log 0 is defined to be equal
to 0.
Recall the most important properties of entropy.

Proposition 1.2. Entropy is nonnegative and maximized for a uniform distribution:

0 H(X) log |X |,
(1.3)
where we use |X | to denote the size of the set X . Conditioning reduces entropy
(or at least it does not increase it. . . ):
H(X) H(X|Y ).
(1.4)
Finally, the chain rule is given as

H(X1 , X2 , . . . , Xn ) =
n
X
k=1
H(Xk |X1 , X2 , . . . , Xk1 ).
(1.5)
Definition 1.3. Let X be a continuous random variable with probability density function (PDF) X (). Then we define the differential entropy h(X) as
follows:
Z
h(X) , X (x) log X (x) dx
(1.6)
= E[log X (X)].
(1.7)
While differential entropy can be negative, the property that conditioning

reduces entropy still holds and it also follows the chain rules.
Definition 1.4. The relative entropy between two PMFs Q1 () and Q2 () over
the same alphabet X is defined as

X
Q1 (x)
Q1 (X)
D(Q1 kQ2 ) ,
Q1 (x) log
= EQ1 log
.
(1.8)
Q2 (x)
Q2 (X)
xsupp(Q1 )
Note that if for some x supp(Q1 ) we have Q2 (x) = 0, then D(Q1 k Q2 ) = .

Similarly, the relative entropy between two PDFs 1 () and 2 () is defined
as1

Z
1 (x)
1 (X)
D(1 k 2 ) ,
1 (x) log
dx = E1 log
.
(1.9)
2 (x)
2 (X)
xsupp(1 )
Proposition 1.5. The relative entropys most important property is its nonnegativity:
D(Q1 k Q2 ) 0
(1.10)
D(1 k 2 ) 0.
(1.11)
or
Note that if the set G , {x supp(1 ) : 2 (x) = 0} has positive Lebesgue measure,
then D(1 k 2 ) = . Otherwise, the points in G are ignored in the integration (1.9).

1.2. Some Important Inequalities
Definition 1.6. The mutual information between the discrete RVs X and Y
with joint PMF PX,Y is defined as
I(X; Y ) , D(PX,Y kPX PY ) = H(X) H(X|Y ).
(1.12)
Similarly, the mutual information between two continuous RVs X and Y with
joint PDF X,Y is defined as
I(X; Y ) , D(X,Y kX Y ) = h(X) h(X|Y ).
(1.13)
The further generalization of the mutual information functional to arguments being a mixture of discrete and continuous random variables is slightly
more subtle, but basically well-behaved. We refer to [Mos14, Section 15.3] for
a brief discussion.
Proposition 1.7. Mutual information is nonnegative
I(X; Y ) 0
(1.14)
and satisfies the chain rule

I(X; Y1 , Y2 , . . . , Yn ) =
n
X
k=1
I(X; Yk |Y1 , Y2 , . . . , Yk1 ).
(1.15)
Remark 1.8. Note that we usually will omit to mention that the sum (or
integral) is only over the corresponding support. Instead we use the notational
convention that
0 log 0 , 0.
1.2
(1.16)
Some Important Inequalities
Some of the following inequalities are well-known, some maybe less, but all of
them are going to be important in various places of this script. It therefore
makes sense to summarize them here already.
The first inequality is taken over from [Mos14].We named it according to
the suggestion of Prof. James L. Massey, retired professor at ETH in Zurich,
the Information Theory Inequality or the IT Inequality.
Theorem 1.9 (IT Inequality). For any base b > 0 and any > 0,

1
1
logb e logb ( 1) logb e
(1.17)
with equalities on both sides if, and only if, = 1.

Proof: We start with the upper bound: First note that

logb =1 = 0 = ( 1) logb e=1 .
(1.18)

Then have a look at the derivatives:

d
( 1) logb e = logb e
d
(1.19)
and
d
1
logb = logb e
d
(
> logb e
< logb e
if 0 < < 1,
if > 1.
(1.20)
Hence, the two functions coincide at = 1, and the linear function is above
the logarithm for all other values.
To prove the lower bound again note that

1
logb e
= 0 = logb =1
(1.21)
1
=1
and
(

>
1
d
1
1
logb e = 2 logb e
d
<
d
d
d
d
logb =
logb =
logb e
if 0 < < 1,
logb e
if > 1,
(1.22)
similarly to above.
Corollary 1.10 (Exponentiated IT Inequality). For any > 0, we have
(1 ) e ,
1,
(1.23)
with equality if, and only if, = 0.

Proof: Choosing the natural logarithm and writing instead of , we have
from the upper bound in (1.17)
e1 .
(1.24)
The result now follows by choosing , 1 and exponentiating both sides

by . We only have to be careful that for > 1 we have 1 < 0, which will
cause problems when exponentiating. So we exclude this case.
Proposition 1.11 (Log-Sum Inequality). For any ai 0 and bi > 0,
i = 1, . . . , n we have
n
X
i=1
ai
ai log
bi
with equality if, and only if,
ai
bi
n
X
!
ai
i=1
Pn

i=1 ai
P
log
n
i=1 bi
is the same for all i.

(1.25)
Proof: Define the constants

A,
n
X
ai ,
B,
i=1
n
X
bi .
(1.26)
i=1
If A = 0, then (by Remark 1.8) both sides of the inequality are zero (thereby
achieving equality).
So, assume that A > 0 (B is positive by assumption). We define the two
PMFs
ai
(1.27)
Qa (i) , , i = 1, . . . , n,
A
bi
(1.28)
Qb (i) , , i = 1, . . . , n.
B
By the nonnegativity of relative entropy (Proposition 1.5) it now follows that
0 D(Qa kQb )

n
X
ai /A
ai
=
log
A
bi /B
i=1
X
n
n
1
B
1 X
ai
= log
ai +
ai log
A
A
A
bi
= log
B
1
+
A A
i=1
n
X
ai log
i=1
(1.29)
(1.30)
(1.31)
i=1
ai
.
bi
(1.32)
Hence,
n
X
i=1
ai log
A
ai
A log .
bi
B
(1.33)
In this case of positive A, equality can only be reached if D(Qa kQb ) = 0, i.e.,
if ai = bi for all i = 1, . . . , n.
The first part of the following proposition has again been taken over from
[Mos14].
Proposition 1.12 (Data Processing Inequalities (DPI)). Assume that
X (
Y (
Z
(1.34)
forms a Markov chain, i.e., I(X; Z|Y ) = 0. Then

I(X; Z) I(X; Y ),
I(X; Z) I(Y ; Z).
(1.35)
Moreover, suppose that Y1 and Y2 are the respective outputs of a discrete2

channel QY |X (|) with inputs X1 and X2 . Then
D(QY1 k QY2 ) D(QX1 k QX2 )
(1.36)
with equality if, and only if, QX1 = QX2 .

2
A similar result can be stated for the case of continuous RVs with densities.

Proof: We only prove the case of discrete RVs. Since I(X; Z|Y ) = 0, we
have H(X|Y ) = H(X|Y, Z). Hence,
I(X; Z) = H(X) H(X|Z)
(1.37)
H(X) H(X|Y, Z)
(1.38)
= I(X; Y ),
(1.40)
= H(X) H(X|Y )
(1.39)
where the inequality follows from conditioning that reduces entropy.

To prove the second part, we rely on the Log-Sum Inequality (Proposition 1.11):
D(QY1 kQY2 )
X
QY1 (y)
QY1 (y) log
=
QY2 (y)
y
(1.41)
!
X X
y
QX1 (x)QY |X (y|x) log P
x QX1 (x)QY |X (y|x)
x QX2 (x)QY |X (y|x)
QX1 (x)QY |X (y|x)
QX1 (x)QY |X (y|x) log

QX2 (x)QY |X (y|x)
y
x

XX
QX1 (x)
=
QX1 (x)QY |X (y|x) log
QX2 (x)
y
x
!

X
X
QX1 (x)
=
QX1 (x)
QY |X (y|x) log
QX2 (x)
x
y

X
QX1 (x)
= D(QX1 k QX2 ).
=
QX1 (x) log
QX2 (x)
x

XX
(1.42)

(1.43)
(1.44)
(1.45)
(1.46)
The next statement is a generalization of Fanos famous inequality (see

[Mos14, Section 9.6]).
be a guess about U
Proposition 1.13 (Fano Inequality [Fan61]). Let U
made from an observation V that is related to U by a joint distribution QU,V .
Note that this setup follows the Markov structure
U (
V (
U.
(1.47)
We define the error probability

].
Pe , Pr[U 6= U
(1.48)
Then
)
H(U |V ) H(U |U
Hb (Pe ) + Pe log(|U| 1)
log 2 + Pe log |U|.

(1.49)
Proof: Define the indicator RV (it indicates an error!)

(
6= U,
1 if U
Z,
= U,
0 if U
(1.50)
such that
PZ (1) = Pe ,
(1.51)
PZ (0) = 1 Pe ,
(1.52)
H(Z) = Hb (Pe ).
(1.53)
Then use the chain rule to derive the following:

) = H(U |U
) + H(Z|U, U
) = H(U |U
)
H(U, Z|U
| {z }
(1.54)
=0
and
) = H(Z|U
) + H(U |U,
Z)
H(U, Z|U
Z)
H(Z) + H(U |U,
(1.55)
(1.56)
= Hb (Pe ) + PZ (0) H(U |U, Z = 0) + PZ (1) H(U |U, Z = 1) (1.57)

{z
}
|
{z
}
|
=0
because U =U
Hb (Pe ) + Pe log(|U| 1),
log(|U |1)
because U 6=U
(1.58)
where the first inequality follows from conditioning that cannot increase entropy. This proves the inner (second) inequality.
The first inequality follows from the data processing inequality (1.35):
) I(U ; V )
I(U ; U
) H(U ) H(U |V )
H(U ) H(U |U
) H(U |V ).
H(U |U
(1.59)
(1.60)
(1.61)
The third inequality holds because

log(|U| 1) log |U|,
Hb (Pe ) log 2.
(1.62)
(1.63)
Theorem 1.14 (Union Bound on Total Expectation). Let X be a random variable taking value in the alphabet X and let f : X R+
0 be a nonnegative function. Consider some events Ei X that are not necessarily disjoint,
but whose union covers X :
[
Ei = X .
(1.64)
i
Then
E[f (X)]
X
i
Pr(Ei ) E[f (X) | Ei ].
(1.65)

Proof: We again only prove for the case of a discrete RV. We split all sets
Ei up into disjoint subsets Bj , Bj Bj 0 = , j 6= j 0 , such that
[
Ei =
Bk
(1.66)
some k
for the right choice of union over k. As an example, see Figure 1.1. There,
E3 = B4 B5 B6 B7 .
(1.67)
E2
E1
B2
B1
B3
B5
B6
B4
B7
E3
Figure 1.1: Example of overlapping sets Ei that are split up into disjoint subsets Bj .
Obviously, we still have
[
j
Bj = X .
(1.68)
Hence,
E[f (X)] =
QX (x) f (x)
(1.69)
xX
XX
j
(1.70)
xBj
XX
j
QX (x) f (x)
X
QX (x) f (x) +
XX
i
QX (x) f (x)
(1.71)
some j 0 xBj 0
xBj
|
=
QX (x) f (x)
xEi

{z
0 because f () 0
}
(1.72)
1.3. FourierMotzkin Elimination
Pr(Ei )
X QX (x)
f (x)
Pr(Ei )
(1.73)
xEi
X
i
Pr(Ei ) E[f (X) | Ei ].
(1.74)
Here, (1.70) follows because the sets Bj form a partition of X (i.e., the are
disjoint and exactly cover X ); and in (1.71) we add some sets Bj 0 once again
to make sure that we can account for all Ei (note that some Bj are member of
several Ei !).
The following inequality is slightly different in quality from the previous
ones because it explicitly only holds for continuous random variables (with a
PDF!) and their differential entropy.
Theorem 1.15 (Entropy Power Inequality (EPI)). Let X and Y be two
independent random n-vectors with PDF. Then
2
e n h(X+Y) e n h(X) + e n h(Y)
(1.75)
where the differential entropies have to be measured in nats. Equality holds if,
and only if, X and Y are Gaussian with proportional covariance matrices.
Proof: This inequality was introduced by Shannon in [Sha48]. The first
rigorous proof was given in [Sta59]. In [CT06] the proof relies on the Renyi
entropy [CK81]. In [VG06] a proof based on a relationship between mutual
information and minimum mean-square error in Gaussian channels is provided.
And a proof based on basic properties of mutual information and on Taylor
expansions is given in [Rio07].
1.3
FourierMotzkin Elimination3
We are all familiar of the Gaussian elimination procedure that is used to eliminate unwanted variables in a linear equation system. In information theory,
however, we more often encounter a linear inequality system. Particularly,
in multi-terminal problems we often see some types of rate regions that are described by a set of inequalities. In general these rate regions can be described
as
AR I
(1.76)
where
T
(1.77)
T
(1.78)
R = R(1) , . . . , R(L)
is a vector that contains L different rates,
I = I(1) , . . . , I(N)
3
This section is strongly inspired by the teaching of Gerhard Kramer.

10
is a vector with N entries that consists of sums of entropies and mutual information terms, and where A is an N L matrix describing the coupling
between the different rates. Hence, (1.76) represents N inequalities involving
L different rates describing a rate region that is called a polytope.
Now consider the situation that we would like to eliminate one rate, say
(L)
R from the vector R, i.e., we are interested in the projection
o
n

= R(1) , . . . , R(L1) T : R(L) s.t. R = R
T , R(L) T satisfies AR I .
R
(1.79)
Such an elimination can be achieved using a procedure called FourierMotzkin

elimination. The basic idea is very similar to Gaussian elimination: The
validity of inequalities is not changed if two inequalities are added together or
if they are scaled. Note, however, two fundamental differences:
We cannot scale an inequality by a negative number as then the inequality sign is reversed.
For an equation system with N independent equations and N unknowns,
it is sufficient to only keep N equations. Any additional equation is redundant and can be omitted. For a system of inequalities, dependencies
are far more difficult to track.
The FourierMotzkin elimination proceeds as follows:
1. For every pair of inequalities in (1.76) with opposite sign coefficients for
R(L) , generate a new inequality that eliminates R(L) .
2. Take all these new inequalities plus all those inequalities in (1.76) that
do not depend on R(L) to generate a new inequality system
inequalities.
with N
I
R
A
(1.80)
of inequalities
Note that in contrast to Gaussian elimination, the number N
after an elimination step might be larger than it was before.
Example 1.16. As an example we consider an inequality system that we
will encounter in the case of a broadcast channel. Consider the inequalities (13.246)(13.248) in Section 13.7 (together with the implicitly given constraints that the four rates cannot be zero:

I U (1) ; U (2)
0
0 1 1

1
0
1
0 (1) I U (1) ; Y (1)
R

0

I U (2) ; Y (2)
1
0
1
(2)
(1.81)
0
0
0
.
1 0
(1)
0 1 0
0
0 (2)
0
0 1 0
0
0
0
0
0 1

1.4. Law of Large Numbers
11
(2) . We see three inequalities where R

(2) is involved,
We start by eliminating R
one with positive sign and two with negative sign. This gives us two pairing
possibilities: We add the first and the third inequality, and we add the third
and the seventh. Then we also include the four other inequalities for which
the coefficient in A in the last column is equal to zero. This yields:

I U (2) ; Y (2) I U (1) ; U (2)
0
1 1
(2) ; Y (2)
0

I
U
1
0
(1)

1
0
1 (2)
I U (1) ; Y (1)
(1.82)
R
.
1 0
0
0
(1)
0 1 0 R
0
0 1
0
Note that the first two inequalities are generated in Step 1 of the Fourier
Motzkin elimination procedure, while the remaining four inequalities are taken
over without change.
(1) . Again we see that we have one positive
We continue and eliminate R
and two negative components, yielding again two pairing possibilities:

1
1
I U (1) ; Y (1) + I U (2) ; Y (2) I U (1) ; U (2)
0 " (1) #
I U (1) ; Y (1)

(2)
(2)
0
. (1.83)
1
I
U
;
Y
R(2)
0
1 0
0 1
0
This yields exactly the region given in Theorem 13.33.
Further examples can be found in Section 13.8.
1.4
Law of Large Numbers
The core behind almost all proofs and results discussed in these lecture notes
is the law of large numbers. We therefore quickly review the two different
types of laws of large numbers.
We start with the weak law.
Theorem 1.17 (Weak Law of Large Numbers).
Let X1 , X2 , . . . be a sequence of independent and identically distributed
(IID) RVs of mean and variance 2 . Then for any > 0,

#
" n

1 X

(1.84)
lim Pr
Xk < = 1.
n

n
k=1

12
This means that the probability that the sample mean gets very close to the
statistical mean tends to 1. We have here a convergence in probability.
Proof: From independence, it follows
" n
#
X

n
n
X
1X
2
2
2
Xk
Var
=
Xk =
=
n
=
,
Var
n
n
n2
n2
n
k=1
(1.85)
k=1
k=1
and we have
#
n
n
1X
1X
E
Xk =
E[Xk ] = .
n
n
"
k=1
(1.86)
k=1
Now, we use the Chebyshev Inequality [Mos14, Section 19.2], which says that
for any > 0,
Pr[|Y E[Y ]| ]
Var[Y ]
,
2
(1.87)
and get

" n
#
1 X

2

Pr
Xk 2 ,
n

n
(1.88)
k=1
from which the result follows by letting n tend to infinity.

Note that the asymptotic equipartition property (AEP) [Mos14, Chapter 19] follows directly from this:
!
n
Y
1
1
n
Q(Xk )
(1.89)
log Q (X1 , . . . , Xn ) = log
n
n
k=1
1X
=
log Q(Xk )
n
(1.90)
k=1
E[log Q(X)] = H(X)
in prob.
(1.91)
Next we state the strong law.

Theorem 1.18 (Strong Law of Large Numbers).
Let X1 , X2 , . . . be a sequence of IID RVs of mean and variance 2 .
Then
"
#
n
1X
Pr lim
Xk = = 1.
(1.92)
n n
k=1
This means that the probability that the sample mean is equal to the statistical mean is 1. We have here a convergence with probability 1 or
almost-sure convergence.

1.5. Additional Tools
13
Proof: See exercises.

Compare the type of convergence between the two versions of the law of
large numbers: While in the former we have a probability that itself tends
towards 1, in the latter form the probability is equal to 1 (pay attention to
the place of the limit inside or outside of the probability expression!).
Remark 1.19. We have here stated the two laws of large numbers only for
the case of IID RVs, which is the way we will mainly use them. However,
these theorems can be generalized to sequences that are not IID. Basically,
what we need is ergodicity. Sloppily, we could say that a stochastic process is
ergodic if it satisfies the law of large numbers. . .
1.5
Additional Tools
The following theorem shows that in the d-dimensional Euclidean space any
convex combination can be written using at most d + 1 vectors. This will be
useful for us when we try to limit the necessary size of certain finite alphabets.
Theorem 1.20 (Carath
eodorys Theorem). If v Rd is a convex combination of n points v1 , v2 , . . . , vn Rd , then there exists a subset of d + 1 of
these points {vk1 , vk2 , . . . , vkd+1 } such that
v=
d+1
X
j vk j
(1.93)
j=1
where
d+1
X
j 0,
j = 1.
(1.94)
j=1
Example 1.21. Consider R2 (which is much easier to graphically depict than

Rd for d > 2) and assume we have six points v1 , v2 , . . . , v6 as shown in Figure 1.2. We add a shaded point v that can be written as a convex combination
of v1 , v2 , . . . , v6 . According to Theorem 1.20 we can now choose a subset with
only d + 1 = 3 points to describe v. Two possible such subsets are shown in
Figure 1.2 by two shaded triangles.
Proof of Theorem 1.20: Assume that

v=
k
X
j vk j
(1.95)
j=1
for some k, j 0 and

proven. So assume that
Pk
j=1 j
= 1. If k d + 1, then the lemma is already
d + 2 k n.
(1.96)

14
v1
v5
v6
v2
v4
v3
Figure 1.2: The real plane with six points and a shaded point that is a convex combination of the six other. This shaded point can also be
described as a convex combination of only three of the six points.
Two possible choices of such a subset are depicted with the triangles.
Since k > d+1, we must have that v2 v1 , . . . , vk v1 are linearly dependent,

i.e., there exist some 2 , . . . , k not all zero such that
k
X
j (vj v1 ) = 0.
j=2
(1.97)
We define
1 ,
k
X
(1.98)
j=2
so that
k
X
j=1
j =
k
X
j +
j=2
k
X
j = 0
(1.99)
j=2
and by (1.97)
k
X
j=1
j vj =
k
X
j=2
j v1 +
k
X
j vj =
j=2

k
X
j=2
j (vj v1 ) = 0. (1.100)
1.5. Additional Tools
15
So, for any R,

v=
k
X
j=1
k
X
j=1
k
X
j=1
j vj 0
(1.101)
k
X
j vj
j vj
(1.102)
j=1
(j j )vj
(1.103)
Note that at least one j > 0 because not all j are zero, but their sum is
equal to zero. We choose
, min
j : j >0
j
i
=
j
i
(1.104)
(where the second equality should be read as a definition for i). Note that
> 0 and that
i
j j = j
j 0 j.
(1.105)
i
|{z}
j
j
if j >0
In particular, we have
i i = 0.
(1.106)
Hence,
v=
k
X
j=1
(j j )vj =
k
X
j vj
(1.107)
j=1
where
j = j j 0,
k
X
j=1
i = i i = 0,
k
k
X
X
j =
j
j = 1 0 = 1.
j=1
(1.108)
(1.109)
(1.110)
j=1
Therefore, we have written v as a convex combination of at most k 1 points.

We can repeat this argument until at most d + 1 points remain.
There exists a stronger version of Caratheodorys Theorem that we state
here, too, but without proof.
Theorem 1.22 (FenchelEggleston Strengthening of Carath
eodorys
Theorem [Egg58, p. 35]). Any point in the convex closure of a connected
compact set R Rd can be represented as a convex combination of at most d
points in R.

16
The following theorem is only stated for completeness reasons and we omit
a proof as it is quite far away from the main topics of this course. However,
it explains why we often will be able to avoid the use of infima and suprema,
but can directly resort to the simpler minima and maxima.
Theorem 1.23 (Extreme Value Theorem (Karl Weierstrass)). Let the
function f () be continuous and let X be a compact set (in the Euclidean
space this is equivalent to closed and bounded with respect to the Euclidean
distance). Then
inf f (x) = min f (x)
(1.111)
sup f (x) = max f (x),
(1.112)
xX
xX
and
xX
xX
i.e., the optimizing x always exists and is part of X .

Chapter 2
Method of Types
Types and typical sets are extremely important tools of information theory.
We have seen part of it already in [Mos14, Chapter 19]. However, while the
weak typicality introduced there has the advantage to be easily extendable to
continuous-alphabet random variables, it is not really very intuitive.
In this course we are going to talk about types and strongly typical sets.
Note that while Shannon did have a fundamental understanding of the principal concept of types, and while the theoretical foundations of types go back to
Sanov [San57] and Hoeffding [Hoe56], it was the work of Imre Csiszar [Csi98],
[CK81] jointly with J
anos Korner and Katalin Marton that formalized it and
made it to the main tool of information theory.
2.1
Types
We start with a couple of definitions. Let X be a finite alphabet of size |X |.

Let x = (x1 , . . . , xn ) X n be a sequence of n symbols from X .
Definition 2.1. Given a sequence x X n and some symbol a X , we denote
by N(a|x) the number of times the symbol a occurs in x.
By I(a|x) we denote the set of indices i such that xi = a. Hence,
N(a|x) = |I(a|x)|.
(2.1)
Moreover, we use the notation that yI(a|x) is a vector of length N(a|x)

containing all components yi where i I(a|x). In particular,
xI(a|x) = (a, a, . . . , a).
|
{z
}
(2.2)
N(a|x) components
Example 2.2. Let x = (11, 15, 11, 17) and y = (21, 22, 23, 24). Then we have
N(11|x) = 2, I(11|x) = {1, 3}, and yI(11|x) = (y1 , y3 ) = (21, 23).
Definition 2.3. The type Px of a sequence x is the relative proportion of

occurrences of each symbol of X :
Px (a) ,
N(a|x)
,
n
17
a X.
(2.3)

18
Method of Types
So the type Px is the empirical probability distribution of x.

Example 2.4. Let X = {H, T}, n = 5, x = (H, H, T, T, H). Then
3
Px (H) = ,
5
2
Px (T) = .
5
(2.4)
Definition 2.5. We use P(X ) to denote the set of all probability distributions
on X . Moreover, Pn (X ) denotes the set of all types with denominator n and
with respect to the alphabet X .
Obviously, Pn is a set with each member being a PMF, and
Pn (X ) P(X ).
(2.5)
Example 2.6. We continue with Example 2.4:

2 3
3 2
4 1
1 4
,
,
,
,
,
,
,
, (1, 0) .
P5 (X ) = (0, 1),
5 5
5 5
5 5
5 5
(2.6)
Next we turn the way of thinking around: So far we had a given sequence
x and described its type Px . Now we would like to fix a type P Pn and ask
the question of how many sequences x have this type.
Definition 2.7. Let P P(X ) be a distribution on the alphabet X (not
necessarily a type!). Then the set of all length-n sequences x having a type
Px = P is called type class of P and is denoted by T n (P ):

T n (P ) , x X n : Px = P .
(2.7)
Note that the type class T n (P ) is a set of sequences.

Example 2.8. Let X = {H, T}, n = 3, and P =
1 2
3, 3
. Then
T 3 (P ) = {HTT, THT, TTH}.

Example 2.9. Let X = {H, T}, n = 10, and P =
T 10 (P ) = .
1
,1
(2.8)
. Then
(2.9)
From Example 2.9 we see that we can redefine Pn (X ) as the set of all
probability distributions P on X such that T n (P ) 6= .

2.2. Properties of Types
19
Example 2.10. Let X = {1, 2, 3}, n = 5, and x = 11321. Then

3
Px (1) = ,
5
1
Px (2) = ,
5
1
Px (3) = ;
5
(2.10)
and
T 5 (Px ) = {11123, 11132, 11213, 11231, . . . , 32111}.
(2.11)
How many sequences are member of T 5 (Px )? Well, simply count all permutations without repetition:
5

T (Px ) = 5! = 20.
(2.12)
3!1!1!
2.2
Properties of Types
We now get to the first important result.

Theorem 2.11 (Type Theorem 1 (TT1)).
The number of types, i.e., the number of probability distributions that
can be an empirical distribution for some sequence x, grows at most polynomially in n:
|Pn (X )| (n + 1)|X | .
(2.13)
Proof: Every type Px Pn (X ) can be described by a vector with |X |

components (denoting the probabilities of each possible symbol!). Each of
this component has denominator n and a numerator that can only take on
n + 1 different values: 0, 1, . . . , n. So, there are at most (n + 1)|X | choices
for this vector. Of course, there will actually be fewer possibilities, since the
choices for the components are not independent (the components must add to
1, i.e., for example the last component is fixed by the others), but the bound
is good enough for our purposes.
Actually, from the proof of TT1 we immediately get an easy way to improve
on the bound.
Corollary 2.12. We can improve on the bound given in TT1:
|Pn (X )| (n + 1)|X |1 .
(2.14)
Proof: The additional reduction of the exponent by 1 follows from the fact
that one component of each probability vector is uniquely determined by all
the other components.
Note that we will rarely use this improved bound. It is almost only used
in a situation when the alphabet is binary. In this case the improvement is
quite significant from (n + 1)2 to n + 1.

20
Method of Types
We realize that there is a polynomial number of types, but an exponential

number of sequences. So at least one type must have exponentially many
sequences in its type class! We will see later that every type has exponentially
many sequences in its type class.
Let X1 , . . . , Xn be drawn IID Q P(X ). Fix a sequence x =
(x1 , . . . , xn ). The probability that x occurs is given by
Qn (x) = en(H(Px )+D(Px k Q)) ,
(2.15)
where we assume that entropy and relative entropy are given in nats.
Hence, the probability that X = x only depends on the type of x and not
on x directly!
Remark 2.14. Note that we can rewrite this expression using a power of 2
instead of e, but then need to specify the entropy and relative entropy in bits.
In the remainder of this class, unless explicitly marked, we will stick to nats
and e. Also, note that log denotes the natural logarithm.
Proof: Recalling our notation N(a|x) from Definition 2.1 and using that
the sequences are generated IID, we have:
Qn (x) =
n
Y
Q(xk )
(2.16)
k=1
Q(a)N(a|x)
(2.17)
Q(a)nPx (a)
(2.18)
enPx (a) log Q(a)
(2.19)
asupp(Px )
Y
asupp(Px )
Y
asupp(Px )
= exp
asupp(Px )

1
nPx (a) log

Q(a)
= expn
X
asupp(Px )
= expn
X
asupp(Px )

Px (a) log
(2.20)

Px (a)
1
Q(a) Px (a)
Px (a)
n
Px (a) log
Q(a)
X
asupp(Px )
(2.21)
1
Px (a) log
Px (a)
(2.22)
= exp n D(Px k Q) n H(Px ) .


(2.23)
21
Here, for (2.17) and (2.18) recall the Definition 2.3 of a type and recall that
the support supp(Px ) X denotes the set of all symbols for which Px (a) > 0.
Lets again have some examples.
Example 2.15. Let X = {0, 1}, Q(0) = 1 Q(1) = 13 , and n = 4. We want
to compute the probability of the sequence x = 0010 under the IID law Q:
3
2
1
2
4
= .
Q (0010) =
(2.24)
3
3
81
We can see already here that this probability only depends on the count of
zeros and ones in x.
Lets now compute the same result using TT2. First note that
3
Px (0) = ,
4
1
Px (1) = ,
4
(2.25)
such that
1 3
3
3
1
= 2 log2 3,
H(Px ) = log log
4
4 4
4
4
1
1/4 3
3/4
7
9
D(Px k Q) = log
+ log
= log2 3 .
4
2/3 4
1/3
4
4
(2.26)
(2.27)
Hence,
1
H(Px ) + D(Px k Q) = + log2 3
4
(2.28)
2
,
81
(2.29)
and
24( 4 +log2 3) = 214 log2 3 =
1
as expected.

1
Example 2.16. Let X = {1, 2, 3, 4, 5, 6}, with Q = 16 , . . . , 6 describing a

fair dice. Assume that n is a multiple of 6. What is the probability that a
particular sequence x with exactly n6 times face 1, n6 times face 2, n6 times face
3, etc. occurs?
Note that

n/6
1
1
n/6
,...,
=
,...,
(2.30)
Px =
n
n
6
6
such that D(Px kQ) = 0. Hence, we have from TT2:
n
1
n
n H(Px )
n log 6
Q (x) = e
=e
=
.
6
This is quite obvious, as any sequence of a fair dice has probability
(2.31)

1 n
6 .

22
Method of Types
But what if the dice is not fair? Consider

1 1 1 1 1
Q=
, , , , ,0 .
3 3 6 12 12
(2.32)
Let n be a multiple of 12. What is now the probability that a particular

n
sequence x with exactly n3 times face 1, n3 times face 2, n6 times face 3, 12
n
times face 4, 12 times face 5 and never face 6 occurs?
Again, D(Px k Q) = 0 and therefore
Qn (x) = en H( 3 , 3 , 6 , 12 , 12 ,0) = en H(Q) .
1 1 1
(2.33)
This is quite amazing!
We have shown the following corollary to TT2.

Corollary 2.17 (Corollary to Type Theorem 2). If x is in the type class
of Q, x T n (Q), then
Qn (x) = en H(Q) .
(2.34)
Proof: If x T n (Q), then Px = Q and therefore D(Px k Q) = 0 and

H(Px ) = H(Q). The corollary then follows from TT2.
Next we turn to the size of the type class. We suspect already that the
size must be exponential in n.
For any type P Pn (X ), the size of the type class T n (P ) is bounded as
follows:
1
en H(P ) |T n (P )| en H(P ) .
(n + 1)|X |
(2.35)
Proof: Note that it is actually easy to compute |T n (P )| precisely (see also

Example 2.10):
|T n (P )| = Q
aX
n!
.
nP (a) !
(2.36)
However, this value is very hard to manipulate because of the many factorials
inside. So, its bounds turn out to be more useful for our purposes.

23
The upper bound can be derived as follows:1

X
1=
P n (x)
(2.37)
xX n
X
xT
P n (x)
(drop some terms)
(2.38)
en H(P )
(Corollary 2.17)
(2.39)
n (P )
X
xT n (P )
= |T n (P )| en H(P ) ,
(2.40)
i.e.,
|T n (P )| en H(P ) .
(2.41)
The lower bound is unfortunately quite a bit more complicated. We start

by proving that the type class T n (P ) has the highest probability among all
type classes under the probability distribution P , i.e., we want to show that
P n (T n (P )) P n (T n (P )),
P Pn (X ).
(2.42)
To that goal, note that any x T n (P ) satisfies

P n (x) =
n
Y
k=1
P (xk ) =
Y
asupp(P )
P (a)N(a|x) =
P (a)nP (a) .
(2.43)
asupp(P )
Note that the right-hand side (RHS) of (2.43) is independent of x, and holds
for all sequences of type P . Hence,
X
P n (T n (P )) =
P n (x)
(2.44)
xT n (P )
= |T n (P )|
P (a)nP (a) .
(2.45)
asupp(P )
Since this holds for every P , it must also hold for P , i.e., we also have
Y
P n (T n (P )) = |T n (P )|
P (a)nP (a) .
(2.46)
asupp(P )
Now if there exists some a X such that P (a) > 0, but P (a) = 0, then it
follows immediately from (2.45) that P n (T n (P )) = 0 and (2.42) is trivially
satisfied. So we assume that supp(P ) supp(P ). Note that for all a
supp(P ) but a
/ supp(P ), we have P (a) > 0 and P (a) = 0, i.e.,
P (a)nP (a) = 1,
(2.47)
1
Note that this proof is almost identical to the proof about the size of a weakly typical
(n)
(n)
set A in [Mos14, Chapter 19]. There we got |A | en(H(X)+) , while here we get a
much more precise result without . So we see that this method is stronger!

24
Method of Types
and therefore
Y
P (a)nP (a) =
asupp(P )
P (a)nP (a) ,
asupp(P )
supp(P ) supp(P ).
(2.48)
Hence, from (2.46) and (2.45) we get

P n (T n (P ))
P n (T n (P ))
=
=
=
Q
|T n (P )| asupp(P ) P (a)nP (a)
Q
Q
Q
Y
|T n (P )|
P (a)n(P (a)P (a))

n
|T (P )| asupp(P )

Q
n!
asupp(P ) nP (a) !

Q
n!
asupp(P ) nP (a) !
(2.49)
(2.50)
(2.51)
Y
P (a)n(P (a)P (a))
asupp(P )
(2.52)
Y
nP (a) !

P (a)n(P (a)P (a))
asupp(P ) nP (a) ! asupp(P )

Q
asupp(P ) nP (a) !

=Q
P (a)n(P (a)P (a))
asupp(P ) nP (a) ! asupp(P )

Y
nP (a) !
P (a)n(P (a)P (a)) .

=
nP (a) !
Q
=Q
asupp(P )
(2.53)
(2.54)
(2.55)
asupp(P )
Here, in (2.50) we have used (2.48); (2.52) follows from the exact formula
(2.36) for the size of the type class; and in (2.54) we enlarge the range of the
first product without changing its value because for every added term a we
have (nP (a))! = 0! = 1.
Next we need a small lemma.
Lemma 2.19. For any m N0 and any n N, we have
m!
nmn .
n!
(2.56)
Proof: If m n, then
m!
= m (m 1) (m 2) (n + 1) |n n{z n} = nmn .
|
{z
}
n!
mn terms, each term n

mn terms
(2.57)
25
If m < n, then
1
1
m!
1
=
= nm = nmn . (2.58)
n!
n (n 1) (n 2) (m + 1)
n nn
n
|
{z
} | {z }
nm terms
nm terms, each term n
Hence, we can continue with (2.55) as follows:

P n (T n (P ))
P n (T n (P ))
Y
asupp(P )
nn(P (a)P (a))
nP (a)nP (a)
nP (a)
P (a)n(P (a)P (a))
(2.60)
asupp(P )
P
n asupp(P ) (P (a)P (a))
=n
=n
n(11)
(2.59)
(2.61)
= n = 1.
(2.62)
Here, we have again made use of our assumption that supp(P ) supp(P ).
This proves (2.42).
Since every sequence x X n has exactly one type, summing over all type
classes is equivalent to summing over all sequences. Hence,
X
1=
P n (T n (P ))
(2.63)
P Pn (X )
P n (T n (P ))
P Pn (X )
= P n (T n (P ))
(by (2.42))
(2.64)
(2.65)
= P n (T n (P )) |Pn (X )|
(2.66)
P Pn (X )
|X |
P (T (P )) (n + 1)
X
= (n + 1)|X |
P n (x)
(by TT1)
(2.67)
(2.68)
xT n (P )
= (n + 1)|X |
X
xT
= (n + 1)
|X |
en H(P )
(by Corollary 2.17)
(2.69)
n (P )
|T n (P )| en H(P ) ,
(2.70)
i.e.,
|T n (P )|
1
en H(P ) ,
(n + 1)|X |
(2.71)
as we have set out to prove.

Remark 2.20. Note that the lower bound of TT3 could be improved by using
Corollary 2.12 instead of TT1 in its proof. The lower bound then would read
|T n (P )|
1
en H(P ) .
(n + 1)|X |1
(2.72)

26
Method of Types
This has an interesting implication in the case of a binary alphabet. Let

X = {0, 1} and choose P = nk , nk
n . According to (2.36), the size of the type
class with respect to P is then

n!
n
n
|T (P )| =
=
.
(2.73)
k!(n k)!
k

Moreover, we have H(P ) = Hb nk . Hence, from TT3 and from (2.72) it now
follows that

k
k
n
1
n Hb ( n
)
en Hb ( n ) .
(2.74)
e
n+1
k
This gives a very good estimate at the growth rate of the binomial coefficient
in n.
Finally, we arrive at a fourth type theorem.
For any Q P(X ) and a type P Pn (X ), we have
1
en D(P k Q) Qn (T n (P )) en D(P k Q) .
(n + 1)|X |
Proof: This follows readily from the first three type theorems:
X
Qn (T n (P )) =
Qn (x)
xT
xT
en(H(P )+D(P k Q))
=e
(by TT2)
(2.77)
n (P )
= |T n (P )| en(H(P )+D(P k Q))
(2.76)
n (P )
(2.75)
n H(P )
n(H(P )+D(P k Q))
n D(P k Q)
(2.78)
(by TT3)
(2.79)
(2.80)
The lower bound follows similarly:

Qn (T n (P )) = |T n (P )| en(H(P )+D(P k Q))
1
en H(P ) en(H(P )+D(P k Q))
(n + 1)|X |
1
=
en D(P k Q) .
(n + 1)|X |
(2.81)
(2.82)
(2.83)
Note that also TT4 can be improved using Corollary 2.12.

So we see that these four theorems help us in figuring out the exponential
growth rate of types and its related quantities. In particular, if
f (n) ' g(n)

(2.84)
2.3. Joint Types
27
denotes that
1
f (n)
log
= 0,
n n
g(n)
lim
(2.85)
i.e., f and g have the same exponential growth rate, we can write:
|Pn (X )| ' 1;
(2.86)

Q (x) = exp n H(Px ) + D(Px k Q) ;

|T n (P )| ' exp n H(P ) ;

Qn (T n (P )) ' exp n D(P k Q) .
n
2.3
(2.87)
(2.88)
(2.89)
Joint Types
We start by pointing out that from a mathematical point of view a discrete

random vector is simply another form of a discrete RV.
Example 2.22. What is the difference between Z {3, 9, 15, 19} with
1
QZ (3) = ,
3
1
QZ (9) = ,
2
QZ (15) =
1
,
12
QZ (19) =
1
12
(2.90)
and
(
W
!
!
!
!)
2
3
15
300
,
,
,
6
5
0
1
(2.91)
with
1
1
1
1
QW (2, 6) = , QW (3, 5) = , QW (15, 0) = , QW (300, 1) = ?
3
2
12
12
(2.92)
Well, they do take on different values, but the probabilities are identical, and
therefore the uncertainty of Z and W are also identical!
Hence, it is straightforward to generalize types to joint types. We first

generalize Definition 2.1.
Definition 2.23. Given two length-n sequences x X n and y Y n and
some symbols a X and b Y, we denote by N(a, b|x, y) the number of
positions where at the same time x contains the symbol a and y contains the
symbol b.
By I(a, b|x, y) we denote the set of indices i such that (xi , yi ) = (a, b).
Hence,
N(a, b|x, y) = |I(a, b|x, y)|.
(2.93)
Example 2.24. We continue with Example 2.2. We have I(11, 23|x, y) = {3}
such that N(11, 23|x, y) = 1. On the other hand, I(15, 24|x, y) = and
N(15, 24|x, y) = 0.

28
Method of Types
Definition 2.25. The joint type of a pair of length-n sequences (x, y) is

defined as
N(a, b|x, y)
Px,y (a, b) ,
, (a, b) X Y.
(2.94)
n
For a joint distribution P P(X Y) we define the joint type class of P
to be the set of all pairs of length-n sequences (x, y) that have type Px,y = P :

T n (P ) , (x, y) X n Y n : Px,y = P .
(2.95)
All previous results directly generalize: TT1 now reads

Pn (X Y) (n + 1)|X ||Y| ,
(2.96)
where Pn (X Y) denotes the set of all joint types with denominator n.

TT2 says that for a joint distribution Q P(X Y), we have
Qn (x, y) = en(H(Px,y )+D(Px,y k Q)) ,
(x, y) X n Y n .
(2.97)
Note that Qn still denotes the product distribution: The pair of sequences
(X, Y) are assumed to be pairwise IID, i.e., while the components Xk and Yk
depend on each other, they are independent of the past and the future.
TT3 states that for any P Pn (X Y),
1
en H(P ) |T n (P )| en H(P ) .
(n + 1)|X ||Y|
(2.98)
Finally, TT4 now reads as follows: For any joint distribution Q P(X Y)
and for a joint type P Pn (X Y), we have
1
en D(P k Q) Qn (T n (P )) en D(P k Q) .
(n + 1)|X ||Y|
2.4
(2.99)
Conditional Types
We further generalize the idea of types to conditional distributions.

Definition 2.26. The conditional type of a sequence y Y n given a sequence
x X n is defined as
N(a, b|x, y)
Py|x (b|a) ,
, (a, b) X Y.
(2.100)
N(a|x)
Note that if there is some a X that does not occur in x, the conditional
type is not defined for this value.
Also note that types, conditional types, and joint types are all probability
distributions and therefore satisfy the chain rule as expected:
Px (a) Py|x (b|a) =
N(a|x) N(a, b|x, y)

N(a, b|x, y)
=
= Px,y (a, b).
n
N(a|x)
n
(2.101)

2.4. Conditional Types
29
Example 2.27. Let X = {1, 2, 3, 4} and Y = {1, 2} and assume

y = (1, 2, 1, 1, 2, 1, 2, 2, 1, 2),
(2.102)
x = (2, 2, 2, 3, 3, 3, 4, 3, 2, 3).
(2.103)
Then we see that Py|x (1|2) = 34 because there are 4 positions where xk = 2,
three of which have a corresponding yk = 1. On the other hand, Py|x (|1) is
not defined, because the symbol 1 does not show up in the given sequence
x.
We also generalize the definitions of set of types and type class to this
conditional situation.
Definition 2.28. The set of conditional probability distributions that can be
conditional types for a length-n sequence from alphabet Y given a length-n
sequence from alphabet X is denoted by Pn (Y|X ). The set that contains all
such conditional distributions (type or not) is denoted by P(Y|X ).
For a conditional distribution PY |X P(Y|X ) and a given sequence x
n
X , the conditional type class of PY |X is defined as

T n (PY |X |x) , y Y n : Py|x = PY |X .
(2.104)
Be aware of our used notation: PY |X and Py|x are both conditional distributions, but the latter is defined via the occurrences of symbols in (x, y).
Theorem 2.29 (Conditional Type Theorem 1 (CTT1)).
The number of conditional types is bounded as follows:
|Pn (Y|X )| (n + 1)|X ||Y| .
(2.105)
Proof: For a given a X we already know that there are at most (n + 1)|Y|
ways of choosing Py|x (|a). Now we have |X | ways of choosing a, i.e., think of
Py|x (|) as a matrix with at most

(n + 1)|Y|
|X |
ways of choosing it.

Fix a conditional distribution QY |X P(Y|X ) and a sequence x =
(x1 , . . . , xn ) X n . The conditional probability that a y occurs given the
sequence x is
QnY |X (y|x) = en(HPx (Py|x )+DPx (Py|x k QY |X )) ,
(2.106)

30
Method of Types
where
HPx (Py|x ) ,
X
aX

Px (a) H Py|x (|a) ,
(2.107)

X

Px (a) D Py|x (|a) QY |X (|a) .
DPx Py|x QY |X ,
(2.108)
aX
Hence, the probability QnY |X (y|x) is fully specified by the corresponding

types.
Proof: The proof is analogous to the unconditional version of TT2:
QnY |X (y|x)
n
Y
=
QY |X (yk |xk )
(2.109)
k=1
Y
(a,b)supp(Px,y )
Y
(a,b)supp(Px,y )
QY |X (b|a)N(a,b|x,y)
(2.110)
QY |X (b|a)nPx,y (a,b)
(2.111)
enPx,y (a,b) log QY |X (b|a)
(2.112)
(a,b)supp(Px,y )
= exp
(a,b)supp(Px,y )
nPx,y (a, b) log QY |X (b|a)
= expn

Py|x (b|a)
1
(2.114)
Px,y (a, b) log

QY |X (b|a) Py|x (b|a)

X
(a,b)supp(Px,y )
= expn
(2.113)
Px (a)
asupp(Px )
Py|x (b|a) log
bsupp(Py|x (|a))
Py|x (b|a)
QY |X (b|a)
X
asupp(Px )
Px (a)
X
bsupp(Py|x (|a))
Py|x (b|a) log
1
.
Py|x (b|a)
(2.115)

For a given sequence x X n and any conditional distribution PY |X such
that the conditional type class T n (PY |X |x) is nonempty, we have

1
en HPx (PY |X ) T n (PY |X |x) en HPx (PY |X ) ,
|X
||Y|
(n + 1)

(2.116)
2.4. Conditional Types
31
where as above
HPx (PY |X ) ,
X
aX

Px (a) H PY |X (|a) .
(2.117)
Proof: The proof relies heavily on TT3. Fix a X with N(a|x) > 0, take
a sequence y T n (PY |X |x), and only consider those components of y that
have a corresponding component a in the sequence x, i.e., consider yI(a|x) .
This subsequence has length N(a|x) and its type is by definition
PyI(a|x) () = Py|x (|a) = PY |X (|a).
(2.118)
Hence, we look at the type class of all length-N(a|x) sequences with type
PY |X (|a). From TT3 we know that
1
eN(a|x) H(PY |X (|a))
|Y|
(N(a|x) + 1)

T N(a|x) PY |X (|a) eN(a|x) H(PY |X (|a)) .
(2.119)
To get the size of the total type class, we have to run through all possible a X
and generate every possible sequence y by taking all possible combinations of
components of each sub-type class, i.e., we have to compute the product of
the sizes of each sub-type class.2
Example 2.32. Let X = {0, 1, 2}, Y = {3, 4}, and
y = (3, 4, 4, 4, 3, 4),
(2.120)
x = (0, 1, 1, 0, 1, 2).
(2.121)
Then we have
Py|x
1/2
1/3
1/2
2/3
(2.122)
and therefore we have the following sub-type class sizes:

For a = 0: yI(0|x) = (3, 4), which is of type 21 , 21 (compare with first
column in (2.122)).
In total there are 2 sequences of this type.
For a = 1: yI(1|x) = (4, 4, 3), which is of type
second column in (2.122)).
1 2
3, 3
(compare with
In total there are 3 sequences of this type.

2
Note that every a chooses some subset of the n components of y, and we count how
many possible permutations exist for each subset. The total number of possibilities is then
the product of all subsets.

32
Method of Types
For a = 2: yI(2|x) = (3), which is of type (1, 0) (compare with third
column in (2.122)).
In total there is only 1 sequence of this type.
So, in total there are 2 3 1 = 6 different choices for y having the same
conditional type.
So, we can now derive the lower bound in (2.116):

Y

T N(a|x) PY |X (|a)
|T n (PY |X |x)| =
(2.123)
aX
N(a|x)>0
(2.119)
1
eN(a|x) H(PY |X (|a))
|Y|
(N(a|x) +1)
aX
| {z }
N(a|x)>0
n
aX
N(a|x)>0
Y
aX
nP
1
e aX
(n + 1)|Y|
1
(n + 1)|Y|
N(a|x)
n
(2.124)
H(PY |X (|a))
(2.125)
!
en
aX
Px (a) H(PY |X (|a))
P
1
n aX Px (a) H(PY |X (|a))
e
.
(n + 1)|X ||Y|
(2.126)
(2.127)
The upper bound is derived analogously.

Finally, from the conditional versions of TT2 and TT3 we can easily prove
the conditional version of TT4.
For any given sequence x X n , any QY |X P(Y|X ), and for a conditional type PY |X Pn (Y|X ), we have
1
en DPx (PY |X k QY |X )
(n + 1)|X ||Y|

QnY |X T n (PY |X |x) x en DPx (PY |X k QY |X ) .
Proof: This proof is fully analogous to the proof of TT4:

QnY |X T n (PY |X |x) x
X
=
QnY |X (y|x)
yT n (PY |X |x)
en(HPx (PY |X )+DPx (PY |X k QY |X ))
(2.128)
(2.129)
(2.130)
yT n (PY |X |x)
= |T n (PY |X |x)| en(HPx (PY |X )+DPx (PY |X k QY |X ))

(2.131)
2.5. Remarks on Notation
33
en HPx (PY |X ) en(HPx (PY |X )+DPx (PY |X k QY |X ))
=e
n DPx (PY |X k QY |X )
(2.132)
(2.133)
The lower bound follows accordingly.
2.5
Remarks on Notation
We all know that notation always is messy. And since we are now trying
to bring some kind of logic into the used notation of this script, we merely
increase the chance to mess everything up even more. . . At the end, we wont
get around trying to understand what the statements actually mean!
We try to clearly distinguish between constant and random quantities.
The basic rule here is
capital letter X : random,
small letter
x : deterministic.
For vectors or sequences bold face is used:

capital bold letter X : random vector,
small bold letter
x : deterministic vector.
(In hand-writing bold is usually replaced by underline: X and x.) There are a
few exceptions to this rule. Certain deterministic quantities are very standard
in capital letters, so, to distinguish them from random variables, we use a
different font. For example, the capacity is denoted by C (in contrast to a
random variable C). Sets are denoted using a calligraphic font: F. So, if X
is a random variable (RV), then the alphabet of X is denoted by X :
X X.
(2.134)
Most importantly, in literature, it is quite common to use P as a generic

name for the probability distribution or, to be more precise, probability mass
functions (PMF) of some discrete random variable. Usually, a subscript will
be used to clearly specify the random variable
PX (x) = Pr[X = x],
(2.135)
or, if it is clear from the argument, it also might be dropped:

P (x) = PX (x).
(2.136)
I like this notation very much, however, unfortunately, it clashes with the
notation used for types. So we will try to follow some slightly adapted rules:
P and Q both can denote some specific PMF. This means that here we
think of P and Q to be fixed distributions and then we define random
variables having this distribution, e.g., X, Y Q and Z P . This
is in contrast to the P in PX and PY that is only generic and means
something different depending on the particular subscript!

34
Method of Types
When referring to the PMF of a RV X, we will exclusively use QX (and
not PX !).
Px denotes the type of sequence x. It is particularly important to note
that PX is the type of the random sequence X, i.e., it is a random
empirical distribution and not the PMF of the random vector X (which
would be stated as QX ).
Note that we try to use P for PMFs in the situation where the PMF
actually can be seen as a type for some sequence:
P : a PMF that could also be a type for some sequence,
Q : a general PMF.
So, since P(X ) denotes the set of all possible PMFs on a finite alphabet X and
Pn (X ) denotes the set of all possible types with denominator n, we usually
have
Q P(X ),
(2.137)
P Pn (X ).
(2.138)
If a length-n sequence of random variables X is generated IID Q, we

use a superscript n
QX (x) = Qn (x)
(2.139)
to emphasize that QX (x) actually is the product of n distributions Q.

If the argument of Q is a set, then this means the probability of all sequences in the set together. For example, taking again X IID Q, we have
the following four equivalent expressions:
Qn (F) = Pr[X F] =
Qn (x) =
xF

n
XY
xF k=1
Q(xk ).
(2.140)
Chapter 3
Large Deviation Theory

In this chapter we consider the following type of problem. Consider a sequence
of binary RVs X1 , X2 , . . . that is IID Q with
1
2
Q(0) = , Q(1) = .
(3.1)
3
3
From the law of large numbers (Theorem 1.17) we expect X = (X1 , . . . , Xn )
to have about 2/3 ones and about 1/3 zeros. So what is the probability that
this is not the case? For example, what is the probability that we have more
than 75% zeros? In mathematical terms, we would like to compute
" n
#
1X
3
Pr
Xk
=?
(3.2)
n
4
k=1
Now, it turns out that this question can be posed in very compact form using
types. Note that
n
X
1X
1X
xk =
aN(a|x) =
aPx (a) = EPx [X].
n
n
k=1
aX
(3.3)
aX
Hence, (3.2) can be reformulated as follows:

" n
#

1X
3
3
Pr
Xk
= Pr EPX [X]
n
4
4
(3.4)
k=1
where we remind the reader that PX is the type of a random sequence X. We

reformulate this further as

3
Pr EPX [X]
= Pr[PX F]
(3.5)
4
with

3
F , Q P(X ) : EQ [X]
.
4
(3.6)
So, the problem under consideration is to find the probability that a random sequence X has a type far away from the expected type, or in other
words, what is the probability that PX F, where F denotes a set of nontypical types.
35

36
3.1
Sanovs Theorem
Theorem 3.1 (Sanovs Theorem [San57]).

Let Q P(X ) be an arbitrary given PMF. Let F P(X ) be a set of
probability distributions. Then

k Q)
D(Q
Qn T n (F) (n + 1)|X | en inf QF

.
(3.7)
If in addition the set F is nice in the sense that there exists a sequence
{Pn F Pn (X )} of types in F such that
k Q),
lim D(Pn k Q) = inf D(Q
(3.8)

1
k Q),
log Qn T n (F) inf D(Q
n n
QF
(3.9)

1
k Q).
log Qn T n (F) = inf D(Q
n n
QF
(3.10)
QF
then
lim
i.e., we have
lim
A graphical description of Sanovs Theorem is shown in Figure 3.1.

Remark 3.2. We have to clarify a notational problem in Theorem 3.1: What
happens if F = ? Note that we have not excluded such a choice, but an
infimum over an empty set is not properly defined. To resolve this issue and
to make sure that Sanovs Theorem is true even for this case, we introduce a
notational agreement: we define
Q) , .
inf D(Qk
(3.11)
Then, (3.7) simply states that Qn () = 0, which obviously is correct.

Proof of Theorem 3.1: Note that Pn (X ) is finite, i.e., we can always find
a P F Pn (X ) such that
D(P k Q) =
min
P F Pn (X )
D(P kQ).
(3.12)
Now, since the type class of any PMF that is not a type is empty by definition,
we have

Qn T n (F) = Qn T n F Pn (X )
(3.13)
X

n
n
=
Q T (P )
(3.14)
P F Pn (X )

3.1. Sanovs Theorem
37
P(X )
Figure 3.1: Sanovs Theorem. The triangle depicts the set of all PMFs, the
shaded area is the subset F, and Q is a given PMF. By Q , we
denote the PMF in F (or at least on the boundary of F) that
is closest to Q, where relative entropy is used as a distance
measure.
en D(P k Q)
X
P F Pn (X )
=e
P F Pn (X )
en D(P k Q)
n minP F Pn (X ) D(P k Q)
= en D(P
e
max
P F Pn (X )
(by TT4)
k Q)
n D(P k Q)
n D(P k Q)
(3.15)
(3.16)
(3.17)
P F Pn (X )
|F Pn (X )|
(3.18)
|Pn (X )|
(n + 1)
|X |
(enlarging set)
(3.19)
(by TT1).
(3.20)
Since
D(P kQ) =
min
P F Pn (X )
D(P k Q) inf D(QkQ),
QF
(3.21)
the upper bound (3.7) follows.

For the proof of the lower bound (3.9), suppose that we can find a sequence
{Pn F Pn (X )} such that (3.8) is satisfied. Then

Qn T n (F) = Qn T n F Pn (X )
(3.22)
X

n
n
=
Q T (P )
(3.23)
P F Pn (X )

38
X
P F Pn (X )
1
en D(P k Q)
|X
|
(n + 1)
(by TT4)
1
en D(Pn k Q) ,
(n + 1)|X |
(3.24)
(3.25)
where in the last inequality we have dropped all but one term in the sum.
Hence,

|X | log(n + 1)
1
n
n
lim log Q T (F) lim
D(Pn kQ)
(3.26)
n
n n
n

= lim D(Pn k Q)
(3.27)
n
= lim D(Pn kQ)

n
kQ),
= inf D(Q
QF
(3.28)
(3.29)
where in the last step we have used our assumption (3.8).

Finally note that from (3.7) it follows that

1
Q),
log Qn T n (F) inf D(Qk
n n
QF
lim
(3.30)
and therefore

1
log Qn T n (F)
n n

1
lim log Qn T n (F)
n n
Q),
inf D(Qk
Q) lim
inf D(Qk
QF
QF
(3.31)
(3.32)
(3.33)
where we have also made use of the fact that lim lim by definition. Hence,
Q).
the limit exists and is equal to inf QF
D(Qk
Remark 3.3. In [CT06], a slightly different version of Sanovs Theorem is

presented. There the distribution Q in Figure 3.1 is defined as
Q , argmin D(QkQ).
(3.34)
QF
However, this is strictly speaking not possible since we do not assume that F
is finite and it therefore might be that the argmin does not exist.
Moreover, the second half of the theorem is claimed to be true not for
nice sets F, but rather for sets F that are open subsets of P(X ). However,
it is nowhere mentioned how such an open subset of P(X ) is supposed to be
defined. Note that the term open only makes sense if we can define an
-environment around any member of the set. To do so, we need a distance
measure, but unfortunately D( k ) is not a measure! The situation might be
saved if one considers the normal Euclidean distance and every PMF as a
|X |-dimensional vector. However, in any case, the statement and its proof are
not clean.

3.1. Sanovs Theorem
39
Example 3.4. Suppose we have a fair coin and want to estimate the probability of observing 700 or more heads in a series of 1000 tosses. So we want to
know the probability of the set of all sequences with 700 or more heads, i.e.,
all sequences with a type

k 1000 k
(3.35)
P =
,
1000
1000
with k {700, 701, . . . , 1000}. Let
n
o
P(X ) : Q
= (p, 1 p) with 0.7 p 1 .
F, Q
1 1
2, 2
Our coin has a distribution Q =
. Now,
p
1p
inf
p log
+ (1 p) log
0.7p1
1/2
1/2
= inf { Hb (p) + log 2}

kQ) =
inf D(Q
QF
(3.36)
0.7p1
(3.37)
(3.38)
= sup Hb (p) + log 2
(3.39)
= Hb (0.7) + log 2 0.0823 nats.
(3.40)
0.7p1
Hence, by Sanovs Theorem,

Pr[PX F] en0.0823 = e10000.0823 = e82.3 = 1.84 1036 .
(3.41)
Note that in this example F is very decent and it is easy to find a sequence
{Pn F P(X )} that achieves the infimum, because
Q = (0.7, 0.3) Pn (X ).
(3.42)
Example 3.5. Suppose we toss a fair dice n times. What is the probability
that the average of the tosses is greater
than or equal to 4?
1 Pn
We recall that the average n k=1 xk is the same as the expectation of
the type EPx [X]. For example, if x = (4, 5, 1, 6, 5, 6, 6, 5, 4, 5, 6) (n = 11), then
the average is about 4.82, the type is

1 0 0 2 4 4
Px =
, , , , ,
,
(3.43)
11 11 11 11 11 11
and hence
EPx [X] =
1
2
4
4
1+
4+
5+
6 4.82.
11
11
11
11
(3.44)
So, we define
(
F,
:
Q
6
X
i=1
)
4
iQ(i)
(3.45)

40
Q) over all Q
F for the given distribution
and we need to minimize D(Qk
Q = (1/6, . . . , 1/6).
Since this problem shows up so often, we will now generalize it and solve
it once in general form so that in future we can directly refer to this solution.
We are interested in the event that the sample average of g(X) for some
function g() is greater than some value :
n
1X
g(Xk ) .
n
(3.46)
k=1
From the discussion above, we know that this event is equivalent to the event
{PX F} where
(
)
X
P(X ) :
F, Q
g(a)Q(a)
.
(3.47)
aX
Even more general, we may be interested in the event that J different such
sample averages are larger than some given thresholds:
)
( n
1X
(3.48)
gj (Xk ) j , j = 1, 2, . . . , J ,
n
k=1
which is equivalent to the event {PX F} with

(
)
X
P(X ) :
F, Q
gj (a)Q(a)
j , j = 1, 2, . . . , J .
(3.49)
aX
Again, the solution is given by Sanovs Theorem, i.e., we need to evaluate

Q). To that goal, we write down the Lagrangian:
inf QF
D(Qk
J
X
X
Q(a)
L(Q) =
Q(a) log
+
j
Q(a)g
j (a) j
Q(a)
a
a
j=1
!
X
Q(a)
1
+
(3.50)
and compute its derivatives

J
L Q(a)
Q(a)
Q(a)
1
= log
+ Q(a)
+
j gj (a) +
Q(a)
Q(a)
Q(a)
Q(a)
j=1
= log Q(a)
log Q(a) + 1 + +
J
X
j gj (a) = 0,
(3.51)
(3.52)
j=1
giving us the solution
Q(a)
= Q(a) e1

PJ
j=1
j gj (a)
(3.53)
3.1. Sanovs Theorem
41
The Lagrangian multiplier can be found by making sure that Q(a)

sums to
1:
PJ
X
X
!
Q(a) e j=1 j gj (a) = 1,

(3.54)
Q(a)
= e1
a
i.e.,
Q (a) = P
Q(a) e
a0
PJ
j=1
Q(a0 ) e
j gj (a)
PJ
j=1
j gj (a0 )
The remaining Lagrangian multipliers j must be chosen1 such that

X
Q (a)gj (a) = j
(3.55)
(3.56)
aX
for all j = 1, . . . , J.
In our example (3.45) this reads
1
Q (x) = P66
ex
1
a=1 6
ea
ex
= P6
a
a=1 e
(3.57)
where is chosen such that

6
X
xQ (x) = 4.
(3.58)
x=1
This can be solved numerically:

Q = (0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468)
(3.59)
D(Q k Q) = 0.0624 bits.
(3.60)
and
Hence, if n = 100 000, then the probability that the average is larger or equal
to 4 is about
2624 1.4 10188 .
(3.61)
Example 3.6. Lets reconsider Example 3.5 and only make a very tiny
change: Instead of asking what the probability is of seeing a sample average larger or equal to 4, we now ask what is the probability of seeing a sample
average of exactly 4:
(
)
6
X
:
=4 .
F , Q
iQ(i)
(3.62)
i=1
Note that if this choice is notPpossible, then some of the values j 0 are too loose in
0
0
the sense that if all constraints n1 n
k=1 gj (Xk ) j , j 6= j , are satisfied, then the j th
constraint is automatically satisfied. These superfluous constraints must then be removed
from the problem and the problem solved again without them.

42
Its quite straightforward to see that the answer is completely identical to the
answer given in Example 3.5. How can that be?!?
First of all, lets be clear that the probability of {PX F} given in (3.45)
given in (3.62). It only could
is not the same as the probability of {PX F}
be the same if
"
(
)#
6
X
:
>4
Pr PX Q
iQ(i)
= 0,
(3.63)
i=1
which is obviously not the case.

However, note that Sanovs Theorem never claimed to provide the exact
probability expression! In its first half it gives an upper bound (3.7), which
Since F F, we
turns out to be the same bound for both cases F and F.
see that Sanovs upper bound (3.7) is tighter for F than for F (but it holds
for both cases!). However, note that we have not even bothered to include
the factor (n + 1)|X | in our expressions above. If we did, then obviously the
bound will become quite too big, at least for small and medium n, before the
exponential decrease really kicks in.
The second half of Sanovs Theorem provides an exact result, but this
result is only asymptotic and it does not provide the full probability expression,
but it only gives the exponent of the probability. This exponent turns out to be
Hence, we learn that the subset F dominates
identical for both cases F and F.
over all other cases F \ F in the sense that the probabilitys exponential growth
rate in n is fully determined by F.

So, to be really precise, Sanovs Theorem tells that

Pr average of 10000 tosses is 4 100015 2624 = 1.44 10168 , (3.64)
where we have used the improved bound from Corollary 2.12.
3.2
Pythagorean Theorem for Relative Entropy
We know that the relative entropy is not a distance measure. We will see
now that it actually behaves like a squared distance and thereby satisfies the
Pythagorean Theorem for triangles.
Before we state the theorem, lets quickly recall the definition of convexity.
Definition 3.7. A set F P(X ) is convex if from Q1 , Q2 F it follows that
Q , Q1 + (1 )Q2 F,
[0, 1].
(3.65)
Theorem 3.8 (Pythagorean Theorem for Relative Entropy

[Abb08]).
For a convex set F P(X ) and some distribution Q
/ F, let Q be the

3.2. Pythagorean Theorem for Relative Entropy
43
distribution such that

k Q).
D(Q k Q) = inf D(Q
(3.66)
QF
Note that because F is convex, Q always exists and is unique. However,

it might be2 that Q
/ F. Then
k Q ) + D(Q k Q) D(Q
k Q),
D(Q
F.
Q
(3.67)
Again consider Figure 3.1 for an illustration.

It is important to realize that this is not the Triangle Inequality. As
a matter of fact, the Triangle Inequality should be exactly the other way
and Q should be shorter than the
around: The direct connection between Q
way from Q via the detour of Q to Q. What we have here is the behavior of
squared distances.
Recall the simple example of vectors in the three-dimensional Euclidean
space. We project an arbitrary vector q R3 onto a given plane. Then the
projection point q is that point of the plane that has shortest Euclidean
distance to q:
q = argmin k
q qk.
(3.68)
plane
q
See Figure 3.2 for an illustration. In this case we have a triangle with a
90-degree angle and therefore we know that the Pythagorean identity holds:
k
q q k2 + kq qk2 = k
q qk2 ,
plane.
q
(3.69)
below the plane

If we keep q and q as shown in Figure 3.2, but we move q
(i.e., onto the side of the plane opposite to where q lies), then the 90-degree
angle of the triangle is changed to be larger than 90 degrees. In this case, the
Pythagorean identity is changed into an inequality:
k
q q k2 + kq qk2 k
q qk2 ,
half-space below plane.

q
(3.70)
See Figure 3.3.

The exactly same discussion can also be made in the space P(X ) of probability distributions, replacing the squared Euclidean distance by the relative
entropy. In P(X ) the equivalent of a plane is a linear family:3
n
o
P(X ) : E [f (X)] =
Fplane , Q
(3.71)
Q
2
Think of open versus closed sets, even though these terms are not well defined
because D( k ) is not a distance measure.
3
Compare this definition to the definition of a plane in R3 :

R3 : h
Fplane , q
q, vi =
for some vector v and some fixed value . Here, h, i denotes the inner product between
two vectors.

44
Figure 3.2: Projection of a point q onto a plane.
> 90 degrees
q
Figure 3.3: A point below the projection plane.

45
for some function f () and some fixed value . It can be shown that the
following holds:
Q ) + D(Q kQ) = D(Q
kQ),
D(Qk
Fplane .
Q
(3.72)
This is the equivalent of (3.69) and Figure 3.2.

If we replace the equality in (3.71) by an inequality, we define a half-space
with the plane as boundary. In this case, (3.72) is adapted accordingly to
(3.70) and Figure 3.3 to
kQ ) + D(Q kQ) D(QkQ),
D(Q
F.
Q
(3.73)
In Theorem 3.8, we do not necessarily have a half-space, but F is defined

as a convex set and Q is the point closest to Q in (or on the boundary of)
F. This actually means that we create a tangential plane onto the set F
that touches F in Q . Hence, the situation is basically the same as shown in
Figure 3.3 and (3.73) still holds. See Figure 3.4.
Q
> 90 degrees
Figure 3.4: A convex set with a tangential plane.

After this long discussion, it only remains to actually prove the claim.
Proof: If Q F, we have no problems and define F 0 , F. If Q
/ F, we
0
0
define F , F {Q }. Note that F still is convex because F is convex and
Q must be on the boundary of F.

46

F 0 and define
Fix Q
, Q
+ (1 )Q .
Q
(3.74)
F 0 for any [0, 1]. Moreover, as for = 0

Since F 0 is convex, we have Q
0 = Q , we must have that

we have Q
k Q)
D() , D(Q
(3.75)
larger equal to zero at = 0 (note that D(0) =

has a derivative D()
inf QF
0 D(QkQ)!). Hence,

D()
0
(3.76)
=0
!

X
Q(a)
+ (1 )Q (a)
(3.77)
=
Q(a) + (1 )Q (a) log

a
Q(a)
=0

(a)
X
Q(a)
+
(1
)Q

Q(a)
Q (a) log
=

Q(a)
a
=0

(a)
X
Q(a)
Q
Q(a)

+
Q(a)
+ (1 )Q (a)

Q(a)
Q(a)
+ (1 )Q (a)
a
=0
(3.78)
Q (a) Q(a)
=
Q(a)
log
Q(a) Q(a)
a
X
X
Q (a)
+
Q(a)
X
Q (a) log
Q (a)
Q(a)
(3.79)
| {z }
=1
{z
=1
Q) D(Qk
Q ) D(Q k Q).
= D(Qk
(3.80)
This proves the claim.

From the discussion above and from the proof, we realize that the assumption of F being convex is not crucial. As long as Q is unique, the tangential
plane is well-defined, and F is completely on the side of the plane opposite
of Q, the theorem holds. We call such a set one-sided with respect to Q. See
Figure 3.5 for an example of such a set.
Note that a set that is one-sided with respect to some Q might not be
one-sided with respect to some other Q0 . Again see Figure 3.5 for an example.
Actually, we could define a set F to be convex if it is one-sided for all Q
/ F.
Also note that one-sidedness makes sure that Q is unique! The reason is
that if there exists a Q 0 that also achieves the infimum in (3.66), then Q 0
must be on the wrong side of the tangential plane through Q , i.e., some part
of F will be on the same side of the tangential plane as Q. Hence, this set
cannot be one-sided. See Figure 3.6 for an illustration.
While the intuitive meaning of one-sidedness is quite clear, it turns out to
be quite difficult to describe this family of sets in a mathematical clean way.

47
> 90 degrees
F
Figure 3.5: A one-sided set with respect to Q. We also show its tangential
plane. Note that this set is not one-sided with respect to Q0 .
Q
Figure 3.6: Uniqueness of Q : Both Q , Q 0 F achieve the minimum relative

entropy to Q. However, this then means that F cannot be onesided with respect to Q, because Q 0 is on the wrong side of the
tangential plane of Q .

48
A hand-waving definition could be, e.g., about as follows: A set F is one-sided

forms an angle to line Q Q
F the line QQ
with respect to Q if for all Q
that is more than 90 degrees. This, however, involves the definition of angles
in P(X ), which might be troublesome. It actually is much easier to specify it
using the Pythagorean principle. Pythagoras says that for a triangle with an
angle more than 90 degree, we have
2
+ Q Q2 QQ
.
QQ
(3.81)
Hence, we define a set F to be one-sided with respect to some Q if (3.73)

is satisfied. Unfortunately, it does now seem to be very trivial to state that
one-sided sets satisfy the Pythagorean Theorem (Theorem 3.8). . .
However, we hope that the reader will forgive us this tail-biting cheat
in light of the actual meaning as shown in Figure 3.5 and knowing that very
many such sets do exist like, e.g., all convex sets F.
Definition 3.9 (One-Sided Set [Abb08]). Let F P(X ) be a subset of the set
of distributions on the finite alphabet X , and let Q P(X ) be a distribution
not in F, Q
/ F. Let Q P(X ) be such that
Q).
D(Q kQ) = inf D(Qk
QF
(3.82)
F it is true that
We say that F is one-sided with respect to Q if for all Q
Q).
D(QkQ
) + D(Q k Q) D(Qk
3.3
(3.83)
The Pinsker Inequality
Sometimes, one needs a proper distance measure between PMFs and therefore
cannot rely on the relative entropy. One possible such measure is the variational distance. This measure was introduced in [Mos14, Section 1.3] and
some relations to entropy were discussed in [Mos14, Appendix 1.B].
Definition 3.10. The variational distance (sometimes also called L1 -distance) between any two probability distributions Q1 and Q2 is defined as
X

Q1 (a) Q2 (a).
(3.84)
V (Q1 , Q2 ) ,
aX
We will now investigate the variational distance further and deepen our
understanding. Let

M , a X : Q1 (a) > Q2 (a)
(3.85)
such that

max Q1 (A) Q2 (A) = Q1 (M) Q2 (M)
AX

(3.86)
3.3. The Pinsker Inequality
49
(because to maximize Q1 (A) Q2 (A) you choose all a X for which Q1 (a) >
Q2 (a)). Now, by definition we have
X

Q1 (a) Q2 (a)
V (Q1 , Q2 ) =
(3.87)
aX
aM
X

Q1 (a) Q2 (a) +
Q2 (a) Q1 (a)
(3.88)
aMc
= Q1 (M) Q2 (M) + Q2 (Mc ) Q1 (Mc )
= Q1 (M) Q2 (M) + 1 Q2 (M) 1 + Q1 (M)

= 2 Q1 (M) Q2 (M) .
(3.89)
(3.90)
(3.91)
Hence we have
Q1 (M) Q2 (M) =
1
V (Q1 , Q2 ).
2
(3.92)
Q2
Q1
I
II
III
Mc
Figure 3.7: Representation of two PMFs Q1 and Q2 : the width of the columns
are chosen to be 1 so that area corresponds to probability. The
set M collects all a X for which Q1 (a) > Q2 (a).
To understand this relationship better consider the example shown in Figure 3.7. Here all a M (i.e., those a for which Q1 (a) > Q2 (a)) are on the
left, and it holds that
area I = Q1 (M) Q2 (M).
But since the areas I and II also can be written as
Z
area I =
dQ1 area III
X
= 1 area III,
Z
area II =
dQ2 area III
X
= 1 area III,
(3.93)
(3.94)
(3.95)
(3.96)
(3.97)

50
we see that area I and area II must always be equally large. In other words,
Q1 (M) Q2 (M) = Q2 (Mc ) Q1 (Mc ).
(3.98)
Since this argument does not use any fact particular to this example, (3.98)
actually holds in general.
In literature, the quantity (3.86) is known as total variation distance, even
though on first sight its definition looks quite different:
Definition 3.11. The total variation distance between any two probability
distributions Q1 and Q2 is defined as

(3.99)
Vtot (Q1 , Q2 ) , maxQ1 (A) Q2 (A).
AX
(E.g., see [LPW08, Chapter 4])

To show that (3.99) indeed is the same as (3.86), we write (3.99) in a
different form and use (3.86):

Vtot (Q1 , Q2 ) = max max Q1 (A) Q2 (A) , max Q2 (A) Q1 (A)
(3.100)
AX
AX
n
o
= max Q1 (M) Q2 (M), Q2 (Mc ) Q1 (Mc )
(3.101)
= Q1 (M) Q2 (M),
(3.102)
where the last equality follows from (3.98).

Hence, (3.92) can be rewritten as follows.
Lemma 3.12 (Relation between Variational Distance and Total Variation Distance). For any two probability distributions Q1 and Q2 ,
Vtot (Q1 , Q2 ) =
1
V (Q1 , Q2 ).
2
(3.103)
We now come to the main result of this section: a relation between the
variational distance and the relative entropy.
Theorem 3.13 (Pinsker Inequality [Pin60] [Csi84]).
For two PMFs Q1 , Q2 P(X ), we have
D(Q1 k Q2 )
1 2
V (Q1 , Q2 ) log e.
2
(3.104)
Proof: We start by proving the special case of a binary alphabet: Xb =

{0, 1}. Let P1 , P2 P(Xb ) with
P1 (1) = p,
P2 (1) = q,
P1 (0) = 1 p,
P2 (0) = 1 q,

(3.105)
(3.106)
3.3. The Pinsker Inequality
51
and assume that q p. Then we have

p
1p
+ (1 p) log
q
1q
(3.107)
V 2 (P1 , P2 ) = (|p q| + |1 p 1 + q|)2
(3.108)
D(P1 kP2 ) = p log

and
= (p q + p q)
2
= 4(p q) .
(3.109)
(3.110)
We define
1 2
V (P1 , P2 ) log e
2
1p
p
2(p q)2 log e
= p log + (1 p) log
q
1q
g(p, q) , D(P1 kP2 )
(3.111)
(3.112)
and we note that

g(p, p) = 0
(3.113)
and that
0
14 12 12 =0
}|
z }| { z
{
(q p) 1 4q(1 q)
g(p, q)
=
log e 0,
q
q(1 q)
for q p.
(3.114)
Hence, we have
g(p, q) 0,
q p,
(3.115)
which proves the claim for the binary case.

For the general case, let Q1 , Q2 P(X ) and define

M , a X : Q1 (a) > Q2 (a) .
(3.116)
Let X be some RV and define Y , I {X M} to be an indicator RV, i.e.,

(
1 if X M,
Y =
(3.117)
0 if X
/ M.
Now, let P1 be the PMF of Y if X Q1 , and let P2 be the PMF of Y if
X Q2 . Then,
X
P1 (1) =
Q1 (a) = Q1 (M) , p,
P1 (0) = 1 p,
(3.118)
aM
P2 (1) =
X
aM
Q2 (a) = Q2 (M) , q,
P2 (0) = 1 q.
(3.119)

52
Also note that by definition of M, we have p q, i.e.,

Vtot (P1 , P2 ) = max P1 (A) P2 (A)
A{0,1}
= P1 ({1}) P2 ({1})
= Q1 (M) Q2 (M)
= Vtot (Q1 , Q2 ),
(3.120)
(3.121)
(3.122)
(3.123)
where (3.121) follows because p q, i.e., the maximum is achieved if A = {1};

where (3.122) follows from (3.118)(3.119); and where (3.123) holds by (3.102).
Since Y is a function of X we can apply the Data Processing Inequality
for the relative entropy (Proposition 1.12) to get
D(Q1 k Q2 ) D(P1 k P2 )
1
V 2 (P1 , P2 ) log e
2
2
1
= 2 Vtot (P1 , P2 ) log e
2
2
1
= 2 Vtot (Q1 , Q2 ) log e
2
1 2
= V (Q1 , Q2 ) log e
2
3.4
(DPI)
(3.124)
(by binary case of (3.104))
(3.125)
(by Lemma 3.12)
(3.126)
(by (3.123))
(3.127)
(by Lemma 3.12).
(3.128)
Conditional Limit Theorem
Before we state the Conditional Limit Theorem, we make a small observation

about the type of a sequence.
Lemma 3.14. Let P Pn (X ) be some fixed type. Let X = (X1 , . . . , Xn ) be
IID Q. Then given that PX = P , the PMF of X1 is P :
Pr[X1 = a | PX = P ] = P (a),
a X.
(3.129)
Proof: We know from TT2 that all sequences of a given type have the
same probability. Hence, conditionally on PX = P , all possible sequences are
equally likely. Moreover, since PX = P , we also know that there are nP (a)
positions k where Xk = a. Hence, the probability that the first position is a
is exactly nPn(a) = P (a).

Example 3.15. Consider X = {a, b}, n = 3, and P = 13 , 23 . Note that Q
does not matter. Given that PX = P , it follows that
X =ab b
or X = b a b
(3.130)
or X = b b a
where all three possibilities are equally likely. Hence, we have a 31 -chance that
X1 = a and a 32 -chance that X1 = b.

3.4. Conditional Limit Theorem
53
Theorem 3.16 (Conditional Limit Theorem [Csi84]).

Let F P(X ) be a convex subset of P(X ), and let Q P(X ) be a given
distribution not in F: Q
/ F. Let Q P(X ) be such that
k Q).
D(Q k Q) = inf D(Q
QF
(3.131)
Let the sequence X = (X1 , . . . , Xn ) be IID Q, and fix an arbitrary

> 0. Then for all a X
Q (a) (, n) < Pr[X1 = a | PX F] < Q (a) + (, n)
(3.132)
where
2
(, n) ,
+ (n + 1)2|X | en .
log e
(3.133)
Proof: For convenience, we define D , D(Q kQ). Moreover, we define

P(X ) : D(Qk
Q) r
Sr , Q
(3.134)
A , SD +2 F Pn (X ),

B , F Pn (X ) \ A,
(3.135)
and
(3.136)
i.e., A is the set of types in F that achieve a relative entropy close to D , and
B are all other types in F, see Figure 3.8.
We start by showing that with high probability the type of X is in A. We
have
Qn (T n (B)) =
Qn (T n (P ))
P F Pn (X )
D(P k Q)>D +2
<
P F Pn (X )
D(P k Q)>D +2
en D(P k Q)
+2)
en(D
P F Pn (X )
D(P k Q)>D +2
+2)
en(D
(3.137)
(by TT4)
(3.138)
(D(P kQ) > D + 2)
(3.139)
(more terms in sum)
(3.140)
(by TT1)
(3.141)
P Pn (X )
+2)
(n + 1)|X | en(D

54

F
Q
P(X )
Q
Figure 3.8: Illustration of the set A defined in (3.135). The triangle depicts
the set of all PMFs, the lightly shaded area is the subset F, and
the darkly shaded area is A.
and
Qn (T n (A))
Qn (T n (A SD + ))
n
(reduce set)
= Q (T (SD + F Pn (X )))
X
=
Qn (T n (P ))
P F Pn (X )
D(P k Q)D +
X
P F Pn (X )
D(P k Q)D +
X
P F Pn (X )
D(P k Q)D +
(3.142)
(3.143)
(3.144)
1
en D(P k Q)
(n + 1)|X |
(by TT4)
(3.145)
1
n(D +)
e
(n + 1)|X |
(D(P k Q) D + )
(3.146)
(drop all but one term).
(3.147)
en(D +)
|X
|
(n + 1)
Hence,
Qn (T n (B F))
Qn (T n (F))
n
Q (T n (B))
= n n
Q (T (F))
Pr[PX B | PX F] =

(def. of cond. prob.) (3.148)

(B F)
(3.149)
55
Qn (T n (B))
Qn (T n (A))
(n + 1)|X | en(D +2)

<
1
en(D +)
(n+1)|X |
(A F)
(3.150)
(3.151)
= (n + 1)2|X | en .
(3.152)
Therefore, since A and B are disjoint and their union contains all types in F,
we have
Pr[PX A |PX F] = 1 Pr[PX B | PX F]
> 1 (n + 1)
2|X | n
(3.153)
(3.154)
Next we show that all types in A are close to Q . For this we need to rely
on the Pythagorean Theorem (Theorem 3.8). For any P A we have
D + 2 D(P k Q)
D(P k Q ) + D(Q k Q)
(by definition of A)
(by Pythagoras (Th. 3.8))
= D(P k Q ) + D ,
(3.155)
(3.156)
(3.157)
i.e.,
D(P k Q ) 2.
(3.158)
Further we use the Pinsker Inequality (Theorem 3.13)

1 2
V (P, Q ) log e D(P k Q )
2
(3.159)
to conclude that for every P A
2
.
V (P, Q )
log e
(3.160)
2
PX A = V (PX , Q )
,
log e
(3.161)
Hence we have
which combined with (3.154) implies

2
PX F Pr[PX A |PX F]
Pr V (PX , Q )
log e
> 1 (n + 1)2|X | en .
Next we note that by definition (3.160) can be written as
X
2
|P (a) Q (a)|
,
log e
aX
(3.162)
(3.163)
(3.164)

56
which implies that
2
,
|P (a) Q (a)|
log e
Hence, if V (P, Q )
2 ,
log e
a X.
(3.165)
then
2
2
Q (a)
P (a) Q (a) +
,
log e
log e
a X.
(3.166)
The lower bound in (3.132) can now be derived as follows:

Pr[X1 = a |PX F]
X
Pr[X1 = a|PX F, PX = P ] Pr[PX = P |PX F] (3.167)
=
P F Pn (X )
P F Pn (X )
X
P F Pn (X )
Pr[X1 = a|PX = P ] Pr[PX = P |PX F]

|
{z
}
(3.168)
P (a) Pr[PX = P | PX F]
(3.169)
= P (a) (by Lemma 3.14)
P F Pn (X)
V (P,Q ) 2loge
X
P F Pn (X)
V (P,Q ) 2loge
(3.170)

2
Q (a)
Pr[PX = P | PX F]
log e
(3.171)

2
= Q (a)
log e
X
P F Pn (X)
V (P,Q ) 2loge
Pr[PX = P |PX F]
(3.172)

2
2
Pr V (PX , Q )
P
F
(3.173)
X
log e
log e

2
(3.174)
1 (n + 1)2|X | en
log e
2
2
Q (a) (n + 1)2|X | en +
(n + 1)2|X | en
= Q (a)
log e | {z }
log e
|
{z
}
1

= Q (a)

> Q (a)
2
Q (a)
(n + 1)2|X | en .
log e
(3.175)
(3.176)
Here, (3.167) follows from the Total Probability Theorem; in (3.169) we use
Lemma 3.14; the inequality (3.170) follows by constraining the sum; (3.171)
follows from (3.166); and in (3.174) we use (3.163).

57
The derivation of the upper bound is very similar. We start with (3.169):
X
(3.177)
Pr[X1 = a| PX F] =
P F Pn (X )
P F Pn (X)
V (P,Q ) 2loge
P (a) Pr[PX = P |PX F]
P F Pn (X)
V (P,Q )> 2loge
P (a) Pr[PX = P |PX F], (3.178)
and bound each sum separately. Similarly to (3.170)(3.172), we have

X
P (a) Pr[PX = P |PX F]
P F Pn (X)
V (P,Q ) 2loge

2
Q (a) +
log e
X
P F Pn (X)
V (P,Q ) 2loge
2
,
Q (a) +
log e
Pr[PX = P |PX F]
{z
(3.179)
}
(3.180)
and similarly to (3.172)(3.174), we have

X
| {z }
P F Pn (X)
V (P,Q )> 2loge
X
P F Pn (X)
V (P,Q )> 2loge
Pr[PX = P |PX F]

2
= Pr V (PX , Q ) >
P
F
X
log e

2
PX F
= 1 Pr V (PX , Q )
log e
< (n + 1)2|X | en .
(3.181)
(3.182)
(3.183)
(3.184)
This yields the claimed result.

Remark 3.17. The way the Conditional Limit Theorem is stated in [CT06,
Theorem 11.6.2] makes no sense. Beside the problem that strictly speaking
we have not defined what a closed subset of P(X ) is, the main problem is that
the conditional probability distribution of X1 is random and converges to Q
in probability. However, this randomness of a PMF is not visible at all in the
used notation and the proof is so unclear that I cannot really follow it. In the

58
version given in Theorem 3.16, no PMF is random (but we do have types of

random sequences!).
It is interesting to question the required assumptions of Theorem 3.16: We
assume that F is convex. This we do in order to be able to use the Pythagorean
Theorem for relative entropy (Theorem 3.8). However, we have already seen
in Section 3.2 that convexity is not really needed as long as Q is unique, the
tangential plane well-defined and F completely on the opposite side of Q. We
then introduced the definition of a one-sided set F with respect to a given
Q: a set that exactly satisfies these required conditions. Obviously, we can
generalize the Conditional Limit Theorem accordingly and replace convex
by one-sided with respect to Q.
Actually, we can generalize the theorem even further: Since in the proof
of Theorem 3.16, the Pythagorean Theorem (Theorem 3.8) was only applied
to A, it is not necessary that the whole set F is one-sided, but it is sufficient
that the Pythagorean Theorem holds for A! We say that we need F to be
locally one-sided with respect to Q. An example of a set that is only locally
one-sided, but not one-sided with respect to Q is shown in Figure 3.9. Note
Q
F
Figure 3.9: An example of a locally one-sided set F.
that not every set is locally one-sided, as can be seen from the counterexample
shown in Figure 3.10.
To make sure that we do not get into troubles, we in addition also require
that Q is unique (this might not be the case anymore if F is only locally
one-sided and not properly one-sided!).
We now reformulate the Conditional Limit Theorem and get the following
more general version.

59
F
Figure 3.10: An example of a set F that is not locally one-sided with respect
to Q. The shaded area shows the part of A that is on the wrong
side of the tangential plane.
Theorem 3.18 (Generalized Conditional Limit Theorem).

Let F P(X ) be a given set of distributions and let Q P(X ) be a given
distribution not in F: Q
/ F. Assume that F is such that there exists a
unique Q P(X ) with
k Q).
D(Q k Q) = inf D(Q
QF
For convenience, we define D , D(Q k Q).

Moreover, for

P(X ) : D(Q
k Q) r
Sr , Q
(3.185)
(3.186)
and for an arbitrary > 0, we define the set A of types in F that achieve
a relative entropy close to D ,
A , SD +2 F Pn (X ),
(3.187)
and require that A is one-sided with respect to Q, i.e., for all P A

D(P k Q ) + D(Q k Q) D(P k Q).
(3.188)
In other words, we require F to be locally one-sided with respect to

Q.

60

Let the sequence X = (X1 , . . . , Xn ) be IID Q. Then for all a X
Q (a) (, n) < Pr[X1 = a | PX F] < Q (a) + (, n)
(3.189)
where
2
(, n) ,
+ (n + 1)2|X | en .
log e
(3.190)
Proof: The proof is identical to the proof of Theorem 3.16 with the only
difference that instead of relying on the Pythagorean Theorem (Theorem 3.8),
we invoke the assumption (3.188).

Example 3.19. Consider {Xk } IID Q and > EQ X 2 fixed. Then it
follows from Theorem 3.18 that

"
#
n
1X
n

Pr X1 = a
Xk2 Q (a)
(3.191)
n
k=1

kQ) over all Q
that satisfy E X 2 (note that
where Q minimizes D(Q
Q

P(X ) : E X 2 ). From (3.55) we know that
we have F = Q
Q
2
Q (x) = Q(x) e|1x

{z } .
(3.192)
Gaussian!
Hence, we see that the conditional distribution of X1 is the normalized product

of the original distribution Q and the maximum entropy distribution, which
is Gaussian in this case!
So we observe a strengthening of the maximum entropy principle here.
Corollary 3.20. The Conditional Limit Theorem can be extended to the m
first elements of X (where X is IID Q):
n
Pr[X1 = a1 , X2 = a2 , . . . , Xm = am | PX F]
m
Y
Q (ak ),
(3.193)
k=1
where m N is fixed.
We omit the proof, but hope that this result is quite intuitive. Lets just
discuss the case m = 2. Obviously, the Conditional Limit Theorem does not
depend on whether we regard X1 or X2 . So the only new part in Corollary 3.20
is the conditional independence between X1 and X2 when n tends to infinity.
This can be understood by noting that X1 and X2 are dependent due to the
given structure of the type of the sequence. However, the longer we make the
sequence, the weaker this dependence becomes.
Note that Corollary 3.20 strongly relies on the assumption that m is fixed.
In particular, m must not grow with n. Once again, this is obvious because

61
given the type of a sequence, the last component can be determined from the
previous components, i.e., they are not independent and can therefore not
have a product distribution.
Example 3.21. Lets continue with Example 3.5. We know that
Q (a) = Q(a) max-entropy PMF for given mean =
1 x
e .
c
(3.194)
So, given that the average of a series of dice throws is larger than 4, the first
couple of throws look like as if they were IID according to an exponential
distribution.

Chapter 4
Strong Typicality
We have introduced types in Chapter 2 and already seen in Chapter 3 how
the concept can be very useful in proofs. We will further rely on types in
this class, for example in the study of error exponents or universal source
coding, however, most of the time we will use a slightly simpler tool: typical
sets. The idea of typical sets is to merge several nearby types and their type
classes together, so that we do not need to bother about the exact form of
the distributions, i.e., we do not need to worry whether a PMF is now a type
( Pn (X )) or not. Since types are dense in the set of PMFs1 (similar to Q
being dense in R), for any PMF there is always a type close-by.
There are two types of typical sets: weakly typical sets and strongly typical sets. We have defined weakly typical sets in the first course [Mos14,
Chapter 19]. The biggest advantage of weakly typical sets is that they are
easily generalized to continuous-alphabet random variables, while this is not
possible for strongly typical sets and types, which require a finite alphabet.
However, weakly typical sets are difficult to teach because they are not really
very intuitive.
In this course we are going to talk about strongly typical sets that turn
out to be much more intuitive. They provide the main tool that we will
need to prove almost all results discussed in this course. Actually, strongly
typical sets are a more powerful tool than weakly typical sets because they
merge fewer types together (therefore the names!). Note, however, that the
strongest results are proven using types and not typical sets.
We would like to point out an unfortunate misunderstanding: Due to their
name it is tempting to think that strongly typical sets rely on the strong law
of large numbers (Theorem 1.18), while weakly typical sets only require the
weak law of large numbers (Theorem 1.17). This is not true: Both concepts,
strong and weak typicality, only require the weak law of large numbers.
As mentioned, strong typicality only works with finite discrete alphabets,
so all our results are restricted to such cases. One exception, however, shall be
1
We put dense in quotation mark here because we have not properly defined what we
mean by it. Since we will never need to rely on this concept in any proof, we will leave it
undefined, but hope that it illuminates the basic idea anyway.
63

64
Strong Typicality
mentioned already now: It is basically always possible to extend any results

about a discrete memoryless channel (DMC) or a discrete memoryless source
(DMS) to the corresponding Gaussian situation (i.e., a Gaussian channel or
a Gaussian source, respectively). While we will never prove this properly, we
will still discuss the (in practice very important) Gaussian case once we have
proven the corresponding result for the DMC or DMS.
4.1
Strongly Typical Sets
Definition 4.1. Fix an > 0 and a distribution QX P(X ). The strongly

(n)
typical set A (QX ) with respect to the distribution QX is defined as

(n)
A (QX ) , x X n : Px (a) QX (a) <
, a X , and
|X |

Px (a) = 0, a X with QX (a) = 0 , (4.1)
i.e., the strongly typical set is a set of length-n sequences that all have a type
close to a given distribution. Note that the second condition makes sure that
we dont need to specially treat symbols with zero probability: If a symbol
has zero probability, we would like to think of it as never occurring.
Remark 4.2. Some remarks about our used notation are in place. We will
refer to the strongly typical set with respect to some PMF Q by
A(n) (Q).
(n)
Unfortunately, in literature one often sees A (X) where X is a RV that is

distributed according to QX . This is similar to H(X) versus H(QX ). We will
(n)
try to stick with the more precise A (QX ).
The strongly typical set has two important parameters: the tolerance > 0
and the sequence length n. In our proofs that rely on typical sets we will see
heaps of s and s that all depend on the original of the strongly typical set.
To simplify our life, we introduce the following convention: s only depend
(n)
on the main from A , but s also depend on n:
for i (like 1 , 2 , 3 , . . .) :
for i (like 1 , 2 , 3 , . . .) :
i 0 if 0,
i 0 if n and 0 (in this order).
Note that the order matters. We always let n go to infinity first and only
afterwards make small.
Moreover, one particular and one particular show up so often that we
name them. We define
m (Q) , log Qmin

(4.2)
4.1. Strongly Typical Sets
65
where Qmin denotes the smallest positive value of Q. Here m stands for
minimum and we do not especially point out the implicit dependence of m
on . Note that m (Q) > 0 because log Qmin < 0.
We also define
t (n, , X ) , (n + 1)
|X |

2
exp n
log e
2|X |2
(4.3)
where t stands for typicality.

We now state the main properties of the strongly typical set.
Theorem 4.3 (Theorem A (TA)).
P(X ) and let x A(n)
Let Q, Q
(Q) be a strongly typical sequence.

Define m () as given in (4.2) and t (n, , X ) as given in (4.3).
1. The probability of the typical sequence x is bounded as follows:
m (Q))
n (x) < en(H(Q)+D(Q k Q)
,
a) en(H(Q)+D(Q k Q)+m (Q)) < Q
(4.4)
b)
n(H(Q)+m (Q))
n(H(Q)m (Q))
< Q (x) < e
(4.5)
2. The size of the strongly typical set is bounded as follows:

(Q) < en(H(Q)+m (Q)) . (4.6)
1 t (n, , X ) en(H(Q)m (Q)) < A(n)

be IID Q
and X be IID Q, and define
3. Let X
1 , m (Q) + m (Q).
(4.7)
of the typical set A(n) (Q),

The total probability (based on Q)

A(n) (Q) = Q
n A(n) (Q) ,
Pr X
(4.8)

is bounded as
a)
1 t (n, , X ) en(D(Q k Q)+1 )

1)
n A(n) (Q) < en(D(Q k Q)
<Q
.
(4.9)

66
Strong Typicality
(n)
The total probability (based on Q) of the typical set A

Pr X A(n)
(Q) = Qn A(n)
(Q) ,

(Q),
(4.10)
is bounded as

1 t (n, , X ) Qn A(n) (Q) 1.
b)
(4.11)
Remark 4.4. We would like to point out that those bounds that contain a
factor (1t ) in front of the exponential term can be rewritten with the (1t )factor incorporated into the exponent and combined with the . However, since
the t needs to have n tending to infinity before we can make small, the in
the exponent must be changed to a . For example, in TA-2 we have

1
1 (n, , X ) en(H(Q)m (Q)) = en(H(Q)m (Q)+ n log(1t (n,,X )))
(4.12)
t
= en(H(Q))
(4.13)

1
log 1 t (n, , X ) .
n
(4.14)
with
, m (Q)
Note that m (Q) for n .
Proof of Theorem 4.3: We start with TA-1a: Let

X 0 , X \ a X : Q(a)
=0
(4.15)
and recall that by definition

x A(n)
(Q) = Px (a) > Q(a)

Hence,
n (x) =
Q
n
Y

.
|X |
k)
Q(x
(4.16)
(4.17)
k=1
N(a|x)
Q(a)
(4.18)
nPx (a)
Q(a)
(4.19)
aX 0
Y
aX 0
<
n
Q(a)
Q(a) |X |
(by (4.16))
(4.20)
aX 0
!

X

= exp
n Q(a)
log Q(a)
|X |
0
aX
!

X
X
Q(a)
1
= exp n
Q(a) log
log Q(a)
Q(a)
|X
|
Q(a)
aX 0
aX 0

0
+ H(Q) + |X |
min
exp n D(Q k Q)
log Q
|X |

+ H(Q) m (Q)
.
exp n D(Qk Q)

(4.21)
(4.22)
(4.23)
(4.24)
4.1. Strongly Typical Sets
67
Here, in the last inequality, we used that |X 0 | |X | and applied (4.2). Note
that we have to use X 0 in (4.18) because otherwise we might get to an expression like 00 , which is not defined.
The lower bound follows similarly.
= Q.
TA-1b follows from TA-1a by choosing Q
Next we turn to TA-3b and define
n
o
F , Px Pn (X ) : x
/ A(n) (Q)
(4.25)
to be the set of all types of all nontypical sequences. Then for any P F
there must exist some a X such that

P (a) Q(a)
|X |
(4.26)
or
P (a) > 0
but Q(a) = 0,
(4.27)
because otherwise the conditions in (4.1) are satisfied and the corresponding
sequence x with this type Px = P would be typical. Hence,
1 2
V (P, Q) log e
2
!2
1 X
=
|P (a) Q(a)| log e
2
D(P kQ)
(Pinsker (Th. 3.13))
(4.28)
(4.29)
aX
1
(|P (a) Q(a)|)2 log e
2
2
log e.
2|X |2
(drop terms)
(4.30)
(by (4.26))
(4.31)
Here in (4.30) we drop all terms in the sum apart from that particular a that
satisfies (4.26). Note that if (4.27) holds instead of (4.26), then D(P kQ) =
and (4.31) is true trivially.
Now, it follows from Sanovs Theorem (Theorem 3.1) that

n
n
|X |
Q (T (F)) (n + 1) exp n min D(P k Q)
(4.32)
P F

2
(n + 1)|X | exp n
log e
(by (4.31))
(4.33)
2|X |2
= t (n, , X ).
(4.34)
Hence,

Qn A(n)
(Q) = 1 Qn (T n (F)) 1 t (n, , X ).

(4.35)
The upper bound in TA-3b is trivial.

68
Strong Typicality
The claim TA-2 is proven as follows:
X
1=
Qn (x)
(4.36)
xX n
Qn (x) +
Qn (x)
(4.37)
(n)
xA
/ (Q)
(n)
xA (Q)
Qn (x) + Qn (T n (F))
(4.38)
en(H(Q)m (Q)) + t (n, , X )
(4.39)
(n)
xA (Q)
<
X
(n)
xA (Q)

(Q) en(H(Q)m (Q)) + t (n, , X ),
= A(n)

(4.40)
where the inequality follows from TA-1b and (4.34). This proves the lower
bound. For the upper bound we have
X
1=
Qn (x)
(4.41)
xX n
>
Qn (x)
(4.42)
(n)
xA (Q)
en(H(Q)+m (Q))
(by TA-1b)
(4.43)
(n)
xA (Q)

(Q) en(H(Q)+m (Q)) .
= A(n)

(4.44)
Finally, we prove TA-3a:

X

n A(n)
n (x)
Q
(Q) =
Q

(4.45)
(n)
xA (Q)
<
en(H(Q)+D(Q k Q)m (Q))
(by TA-1a) (4.46)
(n)
xA (Q)
= A(n)
(Q) en(H(Q)+D(Q k Q)m (Q))

< en(H(Q)+m (Q)) en(H(Q)+D(Q k Q)m (Q))

=e
n(D(Q k Q)
1)
(4.47)
(by TA-2) (4.48)
(4.49)
The lower bound is analogous.
4.2
Jointly Strongly Typical Sets
It is straightforward to generalize the definition of strongly typical sets to joint

distributions.
Definition 4.5. Fix an > 0 and a distribution QX,Y P(X Y). The
(n)
jointly strongly typical set A (QX,Y ) with respect to the distribution QX,Y

4.2. Jointly Strongly Typical Sets
69
is defined as
A(n)
(QX,Y )

, (x, y) X n Y n :

, (a, b) X Y, and
|X | |Y|

Px,y (a, b) = 0, (a, b) X Y with QX,Y (a, b) = 0 .(4.50)
|Px,y (a, b) QX,Y (a, b)| <
In contrast to the weakly typical set, where the individual typicality had
to be taken as a condition into the definition of jointly weakly typical sets,
here this implicitly follows from the definitions! We have the following lemma.
Lemma 4.6 (Joint Typicality Implies Individual Typicality). Let QX
and QY be the marginal distribution of QX,Y . Then,
(QX,Y ) = x A(n)
(QX ) and y A(n) (QY ), (4.51)
(x, y) A(n)

i.e., if a pair of sequences is jointly typical, then each sequence is automatically
typical with respect to its marginal distribution.
(n)
Note that the inverse is not true, i.e., from x A (QX ) and y
(n)
(n)
A (QY ) we cannot conclude that (x, y) A (QX,Y ).
Proof: Recall from our discussion in Section 2.4 that types are probability
distributions and behave this way. Hence, if (x, y) have type Px,y , then x has
type Px where
X
Px (a) =
Px,y (a, b).
(4.52)
bY
Example 4.7. Let n = 10 and

x = 1 0 1 1 0 1 0 1 1 1,
(4.53)
y = 0 1 0 1 0 0 0 1 0 1.
(4.54)
Then
00
Px,y
01
10
11
z}|{ z}|{ z}|{ z}|{

2
1
4
3
,
,
,
=
10 10 10 10
(4.55)
and
= 1+2 = 4+3
z}|{ z}|{
3
7
Px =
,
.
10
10
(4.56)

70
Strong Typicality
Hence, if by definition we have
Px,y (a, b) QX,Y (a, b) <

,
|X | |Y
(4.57)
then
X
X
(Px,y (a, b) QX,Y (a, b)) <
bY
bY

,
|X | |Y
(4.58)
i.e.,
Px (a) QX (a) <

|X |
(4.59)
and one condition for strong typicality of x is satisfied. All other conditions
can be checked similarly.
(n)
The properties of A (Q) given in TA generalize directly to the jointly
strongly typical set.
Corollary 4.8 (Generalized Theorem A (TA)).
X,Y P(X Y) and let (x, y) A(n)
Let QX,Y , Q
(QX,Y ) be a strongly

typical pair of sequences. Define m () as given in (4.2) and t (n, , X Y)
according to (4.3) as

2
|X ||Y|
t (n, , X Y) , (n + 1)
exp n
log e .
(4.60)
2|X |2 |Y|2
1. The joint probability of the jointly typical sequences (x, y) is bounded
as follows:
a) en(H(X,Y )+D(QX,Y k QX,Y )+m (QX,Y ))

nX,Y (x, y) < en(H(X,Y )+D(QX,Y k Q X,Y )m (Q X,Y )) , (4.61)
<Q
b) en(H(X,Y )+m (QX,Y ))
< QnX,Y (x, y) < en(H(X,Y )m (QX,Y )) .
(4.62)
2. The size of the jointly strongly typical set is bounded as follows:

1 t (n, , X Y) en(H(X,Y )m (QX,Y ))

< A(n)
(QX,Y ) < en(H(X,Y )+m (QX,Y )) .
(4.63)

Y)
be IID Q
X,Y and (X, Y) be IID QX,Y , and define
3. Let (X,
X,Y ).
1 , m (QX,Y ) + m (Q

(4.64)
4.3. Conditionally Strongly Typical Sets
71
X,Y ) of the jointly typical set

The total probability (based on Q
(n)
A (QX,Y ),

(n)
Y)
A(n) (QX,Y ) = Q
n
Pr (X,
(QX,Y ) ,
(4.65)

X,Y A
is bounded as
a)
1 t (n, , X Y) en(D(QX,Y k QX,Y )+1 )

nX,Y A(n) (QX,Y ) < en(D(QX,Y k Q X,Y )1 ) .
<Q
(4.66)
The total probability (based on QX,Y ) of the jointly typical set

(n)
A (QX,Y ),

Pr (X, Y) A(n)
(QX,Y ) = QnX,Y A(n)
(QX,Y ) ,
(4.67)

is bounded as
b)

1 t (n, , X Y) QnX,Y A(n) (QX,Y ) 1.
(4.68)
Note that we have used the notation H(X, Y ) instead of the more precise
H(QX,Y ) simply out of habit.
4.3
Conditionally Strongly Typical Sets
We can also extend our definition of strongly typical sets to conditional distributions.
Definition 4.9. For some fixed a X with QX (a) > 0 we define the strongly
typical set conditional on the letter a and with respect to QY |X as

(n)
A
QY |X (|a) , y Y n : |Py (b) QY |X (b|a)| <
, b Y, and
|Y|

Py (b) = 0, b Y with QY |X (b|a) = 0 .
(4.69)
This definition contains nothing new because it simply conditions everything on the event that X = a. A much more interesting generalization is
when we condition on a given sequence!
Definition 4.10. For some joint distribution QX,Y with marginal QX and
(n)
for some fixed strongly typical sequence x A (QX ), we define the conditionally strongly typical set with respect to QX,Y as
n
o
A(n)
(QX,Y |x) , y Y n : (x, y) A(n) (QX,Y ) ,
(4.70)

i.e., conditionally on x, y is conditionally strongly typical if the pair (x, y) is
(n)
(n)
jointly typical. Note that for x
/ A (QX ), we have A (QX,Y |x) = .

72
Strong Typicality
Theorem 4.11 (Theorem B (TB)).

(n)
Let QX,Y P(X Y) be a joint distribution, and let x A (QX ) be
a typical sequence with respect to the marginal QX .
(n)
1. For every y A
as follows:
(QX,Y |x), the conditional probability is bounded
en(H(Y |X)+m (QX,Y )) < QnY |X (y|x) < en(H(Y |X)m (QX,Y )) .
(4.71)
2. The size of the conditionally strongly typical set is bounded as follows:

1 t (n, , X Y) en(H(Y |X)m (QX,Y ))

(QX,Y |x) < en(H(Y |X)+m (QX,Y )) .
(4.72)
< A(n)

(n)
3. Let x A (QX ) be an arbitrary typical sequence and let Y be

IID with Yk QY |X (|xk ). The total probability of the conditionally
(n)
strongly typical set A (QX,Y |x), conditionally on X = x,

Pr Y A(n)
(QX,Y |x) X = x = QnY |X A(n) (QX,Y |x) x ,

(4.73)
is bounded as

1 t (n, , X Y) QnY |X A(n) (QX,Y |x) x 1.
(4.74)
Note that Remark 4.4 also applies here.

(n)
(n)
Proof: By assumption we have x A (QX ) and y A (QX,Y |x),
and therefore by definition
Px,y (a, b) > QX,Y (a, b)

.
|X | |Y|
(4.75)
Hence,
QnY |X (y|x)
n
Y
=
QY |X (yk |xk )
(4.76)
k=1
Y
(a,b)supp(QX,Y )
Y
(a,b)supp(QX,Y )
QY |X (b|a)
N(a,b|x,y)
(4.77)
QY |X (b|a)
nPx,y (a,b)
(4.78)

73
= expn
(a,b)supp(QX,Y )

QX,Y (a, b)
< expn
Px,y (a, b) log QY |X (b|a)
(a,b)supp(QX,Y )

|X | |Y|
(4.79)
log QY |X (b|a)
(4.80)
!

QX,Y (a, b)

= exp n H(Y |X) n
(4.81)
log
|X | |Y|
QX (a)
(a,b)supp(QX,Y )
| {z }
X
QX,Y (a,b)
(QX,Y )min

exp n H(Y |X) n log(QX,Y )min ,
(4.82)
where (4.80) follows from (4.75). The lower bound is analogous. This proves
TB-1.
Next we use the lower bound in TB-1 to prove the upper bound on the
size of the conditionally strongly typical set (TB-2):
X
1=
QnY |X (y|x)
(4.83)
yY n
>
X
(n)
yA (QX,Y
|x)
X
(n)
yA (QX,Y
QnY |X (y|x)
(drop terms)
(4.84)
en(H(Y |X)+m (QX,Y ))
(by TB-1)
(4.85)
|x)

(QX,Y |x) en(H(Y |X)+m (QX,Y )) .
= A(n)

(4.86)
The derivation of the corresponding lower bound in TB-2 is more com(n)

plicated. We first note that since x A (QX ), we have for any y
(n)
A (QX,Y |x) the following:

> Px,y (a, b) QX,Y (a, b)
|X | |Y|
= Px (a)Py|x (b|a) QX (a)QY |X (b|a)

> Px (a)Py|x (b|a) Px (a) +
QY |X (b|a)
|X |

= Px (a) Py|x (b|a) QY |X (b|a)
Q
(b|a)
|X | | Y |X
{z }

Px (a) Py|x (b|a) QY |X (b|a)
|X |
(4.87)
(4.88)
(4.89)
(4.90)
(4.91)
for all (a, b) X Y. Here the first inequality follows because x and y are
jointly typical, and the second inequality follows because x is typical.
Hence, for all a such that Px (a) > 0, we have

1
1
Py|x (b|a) QY |X (b|a) <
1+
.
(4.92)
|X |
|Y| Px (a)

74
Strong Typicality
Similarly, one can show that

1
1

1+
Py|x (b|a) QY |X (b|a) >
.
|X |
|Y| Px (a)
(n)
Hence, any y A
bY
(4.93)
(QX,Y |x) satisfies for all a with Px (a) > 0 and for all

1
Py|x (b|a) QY |X (b|a) < 1 + 1
.
|X |
|Y| Px (a)
(4.94)
From (4.70) and the second condition in (4.50) it also follows that any
(n)
y A (QX,Y |x) satisfies for all a with Px (a) > 0 and for all b Y with
QX|Y (b|a) = 0
Py|x (b|a) = 0.
(4.95)
We now define
n
o
Fx , PY |X Pn (Y|X ) : y
/ A(n) (QX,Y |x) with Py|x = PY |X
(4.96)
to be the set of all conditional types of all conditionally nontypical sequences.

Then for any PY |X Fx there must exist some a X with Px (a) > 0 and
some b Y such that

1
PY |X (b|a) QY |X (b|a) 1 + 1
(4.97)
|X |
|Y| Px (a)
or
PY |X (b|a) > 0 but QY |X (b|a) = 0,
(4.98)
because otherwise (4.94) and (4.95) are satisfied and the corresponding sequence y with this conditional type Py|x = PY |X would be typical. Hence,
considering this pair (a, b), we have

DPx PY |X QY |X
X

=
Px (
a) D PY |X (|
a) QY |X (|
a)
(4.99)
a
X s.t.
Px (
a)>0
X
a
X s.t.
Px (
a)>0
X
a
X s.t.
Px (
a)>0
Px (
a)

1 2
V PY |X (|
a), QY |X (|
a) log e
2

1 X
Px (
a)
PY |X (b|
a) QY |X (b|
a)
2
(4.100)
!2
log e
(4.101)
bY
2
1
Px (a) PY |X (b|a) QY |X (b|a) log e
2

1 2
1
1 2
1+
2
log e
Px (a)
2
2 |X |
|Y|
Px (a)

(4.102)
(4.103)

1 2
1
2
1
+
log e
2|X |2
|Y|
Px (a)
| {z }
| {z }
75
2
1
|Y|
(4.104)
log e.
2|X |2 |Y|2
(4.105)
Here, (4.99) follows by definition (see (2.108)); the subsequent inequality

(4.100) follows from the Pinsker Inequality (Theorem 3.13) applied to the
distributions PY |X (|
a) and QY |X (|
a); in (4.102) we drop all terms in both
sums apart from that particular pair (a, b) that satisfies (4.97); and in the
subsequent lower bound (4.103) we apply (4.97). Note that if (4.98) holds
instead of (4.97), then DPx (PY |X kQY |X ) = and (4.105) is true trivially.
We are now ready to apply these results to our derivation of the lower
bound in TB-2:
1=
X
yY n
QnY |X (y|x)
X
(n)
yA (QX,Y
(n)
yA (QX,Y
(n)
yA (QX,Y
|x)
|x)
<
(n)
yA
|x)
(4.106)
QnY |X (y|x) +
QnY |X (y|x)
X
(n)
yA
/ (QX,Y
QnY |X
QnY |X (y|x) +
(n)
yA (QX,Y
(4.108)

PY |X Fx
(4.109)
en(H(Y |X)m (QX,Y )) +
en DPx (PY |X k QY |X ) (4.110)
PY |X Fx
e
|x)
(4.107)

T (Fx |x) x
(QX,Y |x)
|x)
QnY |X (y|x)
n(H(Y |X)m (QX,Y ))
2
2|X |2 |Y|2
log e
(4.111)
PY |X Fx
2

n(H(Y |X) (Q ))
n
log e
m
X,Y

2|X |2 |Y|2
e
+
|F
|
e
(Q
|x)
= A(n)
(4.112)
x
X,Y

2

n
log e
(QX,Y |x) en(H(Y |X)m (QX,Y )) + |Pn (Y|X )| e 2|X |2 |Y|2
A(n)

(4.113)

A(n)
(QX,Y |x) en(H(Y |X)m (QX,Y )) + (n + 1)|X ||Y| e

= A(n)
(QX,Y |x) en(H(Y |X)m (QX,Y )) + t (n, , X Y).

2
n
2|X |2 |Y|2
log e
(4.114)
(4.115)
Here, the first inequality (4.110) follows from TB-1 and from CTT4; in the
subsequent inequality (4.111) we make use of (4.105); in (4.113) we upperbound the size of Fx by the number of conditional types; in (4.114) we then
apply CTT1; and the final step (4.115) follows from definition of t given in
(4.3).

76
Strong Typicality
It only remains TB-3. The upper bound is trivial. We again use the
definition (4.96) and write:

(4.116)
QnY |X A(n)
(QX,Y |x) x = 1 QnY |X T n (Fx |x) x

X

=1
(4.117)
PY |X Fx
1
1
en DPx (PY |X k QY |X )
PY |X Fx
PY |X Fx
= 1 |Fx | e
2
2|X |2 |Y|2
2
2|X |2 |Y|2
1 |Pn (Y|X )| e
1 (n + 1)|X ||Y| e
log e
(4.119)
log e
2
2|X |2 |Y|2
(4.120)
log e
2
n
2|X |2 |Y|2
= 1 t (n, , X Y).
(4.118)
log e
(4.121)
(4.122)
(4.123)
Here, in (4.118) we have applied CTT4; (4.119) follows from (4.105); and
(4.122) follows from CTT1.
Note that (4.117)(4.123) basically is a proof of a conditional version of
Sanovs Theorem.
4.4
Accidental Typicality
To us, the most important circumstances with regard to typical sequences are
situations when two or more sequences are generated not according to the
joint PMF that is used to define the typical set, but rather independently (or
partially independently) based on marginal distributions of the joint PMF.
We know from our discussion of types and the large deviation theory that
in such a case these sequences are very likely to be typical, but not jointly
typical! Concretely, we will next compute bounds on the probability that an
independent pair of sequences accidentally looks like it had been generated
jointly using the joint PMF.
Theorem 4.12 (Theorem C (TC)).
Let QX,Y P(X Y) be a joint PMF with marginals QX and QY . Let
the pair of sequences (X, Y) be generated IID not according to QX,Y , but
independently according to the marginals:
{(Xk , Yk )}nk=1 IID QX QY .
(4.124)
1. The probability that (X, Y) accidentally happens to look like being

4.4. Accidental Typicality
77
jointly typical is bounded as follows:

1 t (n, , X Y) en(I(X;Y )+2 )

< Pr (X, Y) A(n) (QX,Y ) < en(I(X;Y )2 ) ,
(4.125)
where
2 , m (QX,Y ) + m (QX ) + m (QY ) 3m (QX,Y ).
(4.126)
(n)
2. For any x A (QX ), the probability that Y happens to be conditionally strongly typical given x,

Pr Y A(n)
(QX,Y |x) = QnY A(n) (QX,Y |x) , (4.127)

is bounded as

1 t (n, , X Y) en(I(X;Y )+3 )

< QnY A(n)
(QX,Y |x) < en(I(X;Y )3 ) ,

(4.128)
where
3 , m (QX,Y ) + m (QY ) 2m (QX,Y ).
(4.129)
Proof: To prove TC-1, note that the distribution of (X, Y) is a product

distribution:

Pr (X, Y) A(n) (QX,Y )
X
=
QnX (x) QnY (y)
(4.130)
(n)
(x,y)A
>
(QX,Y )
X
(n)
(x,y)A (QX,Y
en(H(X)+m (QX )) en(H(Y )+m (QY ))
(4.131)

= A(n)
(QX,Y ) en(H(X)+H(Y )+m (QX )+m (QY ))

> 1 t (n, , X Y) en(H(X,Y )m (QX,Y ))
en(H(X)+H(Y )+m (QX )+m (QY ))

= 1 t (n, , X Y) en(I(X;Y )+2 ) ,
(4.132)
(4.133)
(4.134)
where the first inequality (4.131) follows from TA-1b (note that because
(n)
(n)
(n)
(x, y) A (QX,Y ) we also have x A (QX ) and y A (QY )) and
the second inequality (4.133) follows from TA-2.
Note that the steps (4.130)(4.134) are exemplary for many of the proofs
that we will encounter later in this course.
Moreover, we see that because
X
QY (b) =
QX,Y (a, b) QX,Y (a, b) (QX,Y )min
(4.135)
aX

78
Strong Typicality
for all b supp(Y), we have

(QY )min (QX,Y )min
(4.136)
m (QY ) m (QX,Y ).
(4.137)
and therefore
The same is also true for m (QX ). Hence, the bound in (4.134) can be further
bounded by replacing 2 by 3m (QX,Y ).
The upper bound is analogous.
To prove TC-2, we only need to slightly adapt the proof of TC-1:

QnY A(n)
(QX,Y |x)

X
QnY (y)
(4.138)
=
(n)
yA
(QX,Y |x)
>
(n)
yA
en(H(Y )+m (QY ))
(4.139)
(QX,Y |x)

= A(n)
(QX,Y |x) en(H(Y )+m (QY ))

> 1 t (n, , X Y) en(H(Y |X)m (QX,Y )) en(H(Y )+m (QY ))

= 1 t (n, , X Y) en(I(X;Y )+3 ) ,
(4.140)
(4.141)
(4.142)
where the first inequality (4.139) follows from TA-1b (note that because y
(n)
(n)
A (QX,Y |x) we also have y A (QY )) and the second inequality (4.141)
follows from TB-2.
We see from TC that if the sequence Y is generated completely independently of a sequence X (i.e., their joint distribution is QX QY instead of
QX,Y ), then the probability that they accidentally look jointly typical (i.e.,
the accidentally look like they have been generated jointly according to QX,Y )
is tending to zero exponentially fast in n, with the decay rate being I(X; Y ).
Hence, if X and Y are very dependent and have therefore high mutual information, the chance that independently generated versions of it happen to
look jointly typical is decreasing very fast to zero, while for small I(X; Y ) this
decay is slower. Obviously, if I(X; Y ) = 0, i.e., X
Y in the first place, then
the argument breaks down and TC only states trivial uninteresting relations.
This type of observations will be fundamental for our proofs in the following chapters.
The argumentation shown in TC can be taken a step further to a situation
of three RVs that are in a Markov relation.
Theorem 4.13 (Theorem D (TD)).
Let QU,V,W be a general joint PMF with marginals QU , QV |U and QW |U .
Let the triple of sequences (U, V, W) be generated IID not according to
QU,V,W , but according to marginal distributions forming a Markov chain

4.4. Accidental Typicality
79
V (
U (
W:
{(Uk , Vk , Wk )}nk=1 IID QU QV |U QW |U .
(4.143)
1. The probability that (U, V, W) accidentally happens to look like

jointly typical is bounded as follows:

1 t (n, , U V W) en(I(V ;W |U )+4 )

< Pr (U, V, W) A(n) (QU,V,W ) < en(I(V ;W |U )4 ) (4.144)
where
4 , m (QU ) + m (QU,V ) + m (QU,W ) + m (QU,V,W )
4m (QU,V,W ).
(4.145)
(4.146)
(n)
2. For any u A (QU ), the probability that (V, W) accidentally

happens to look like conditionally jointly typical given u is bounded
as follows:

1 t (n, , U V W) en(I(V ;W |U )+5 )

< Pr (V, W) A(n) (QU,V,W |u) < en(I(V ;W |U )5 ) (4.147)
where
5 , m (QU,V ) + m (QU,W ) + m (QU,V,W )
(4.148)
3m (QU,V,W ).
(4.149)
A typical case for such a Markov constellation is shown in Figure 4.1:

A common input U is transmitted over two different channels or processing
boxes who have output V and W , respectively.
V
U
W
Figure 4.1: Markov setup of three RVs that satisfies the assumptions of TD.
Proof: We start with the lower bound of TD-1:

Pr (U, V, W) A(n)
(QU,V,W )

X
=
QnU (u) QnV |U (v|u) QnW |U (w|u)
(n)
(u,v,w)A
(4.150)
(QU,V,W )

80
Strong Typicality
en(H(U )+m (QU )) en(H(V |U )+m (QU,V ))
>
(n)
(u,v,w)A
(QU,V,W )
en(H(W |U )+m (QU,W ))

(4.151)
n(H(U )+H(V |U )+H(W |U )+ (Q )+ (Q )+ (Q ))
(n)
m
m
m
U
U,V
U,W
= A (QU,V,W ) e
(4.152)

> 1 t (n, , U V W) en(H(U,V,W )m (QU,V,W ))
en(H(U,V )+H(W |U )+m (QU )+m (QU,V )+m (QU,W ))

= 1 t (n, , U V W)
en(H(W |U,V )H(W |U )m (QU,V,W )m (QU )m (QU,V )m (QU,W ))

= 1 t (n, , U V W) en(I(V ;W |U )+4 ) .
(4.153)
(4.154)
(4.155)
Here, in (4.151) we use once TA-1b and twice TB-1. Note that because
(n)
(n)
(u, v, w) A (QU,V,W ) we know from Lemma 4.6 that u A (QU ),
(n)
(n)
that (u, v) A (QU,V ), and that (u, w) A (QU,W ). In (4.153) we use
TA-2. Furthermore, we see that similarly to the derivation shown in (4.135)
(4.137), we can bound
4 4m (QU,V,W ).
(4.156)

To prove TD-2, we only need to slightly adapt the proof of TD-1. Again, we
only show the derivation of the lower bound. The upper bound is analogous.
(n)
Let u A (QU ) be given. Then, we have the following:

Pr (V, W) A(n)
(QU,V,W |u)

X
=
QnV |U (v|u) QnW |U (w|u)
(n)
(v,w)A (QU,V,W |u)
>
(n)
(v,w)A
(QU,V,W |u)
(4.157)
en(H(V |U )+m (QU,V )) en(H(W |U )+m (QU,W )) (4.158)

= A(n)
(QU,V,W |u) en(H(V |U )+H(W |U )+m (QU,V )+m (QU,W ))

> 1 t (n, , U V W) en(H(V,W |U )m (QU,V,W ))
en(H(V |U )+H(W |U )+m (QU,V )+m (QU,W ))

= 1 t (n, , U V W)
en(H(W |U,V )H(W |U )m (QU,V,W )m (QU,V )m (QU,W ))

= 1 t (n, , U V W) en(I(V ;W |U )+5 ) .
Here, in (4.158) we use twice TB-1; and in (4.160) we use TB-2.

(4.159)
(4.160)
(4.161)
(4.162)
4.A. Appendix: Alternative Definition
4.A
81
Appendix: Alternative Definition of

Conditionally Strongly Typical Sets
In Section 4.3 we have introduced the conditionally strongly typical sets conditional on a given sequence. The definition given there actually differs from
the style of definitions given before for the typical sets (Definition 4.1), jointly
typical sets (Definition 4.5), and the conditionally typical sets conditioned on
an event (Definition 4.9): Instead of specifying boundaries on the conditional
type it simply refers to the definition of jointly typical sets. The reason for
this is that the derivations turn out to be easier with this definition.
It is, however, possible to define the conditionally strongly typical sets conditional on a given sequence using the normal approach of specifying boundaries on the conditional type.
Definition 4.14. Fix an > 0 and a conditional distribution QY |X P(Y|X ).
(n)
The conditionally strongly typical set A (QY |X |x) conditional on a fixed
sequence x X n and with respect to the conditional distribution QY |X is
defined as

(Q
|x)
,
y Y n : Px,y (a, b) QY |X (b|a)Px (a) <
A(n)
,
Y |X

|Y|
(a, b) X Y, and
Px,y (a, b) = 0, (a, b) X Y

with Px (a) > 0 and QY |X (b|a) = 0 . (4.163)
Note that we again have the second condition to make sure that for zero
probability in QY |X we do not have any occurrences in the sequences. We do
not need to worry about Px (a) = 0 because then Px,y (a, b) = 0 for sure, but
if Px (a) > 0 and QY |X (b|a) = 0 we do require Px,y (a, b) = 0, too.
We would like to point out that Definition 4.14 actually is more general
than Definition 4.10:
While our original Definition 4.10 only works for a given sequence that
(n)
is typical, x A (QX ), the alternative Definition 4.14 works for any
sequence x X n .
Our original Definition 4.10 requires the specification of a joint distribution QX,Y , while the alternative Definition 4.14 only needs a conditional
distribution QY |X .
However, apart from that, it turns out that both versions of conditionally
strongly typical sets are equivalent. This is shown in the following proposition.

82
Strong Typicality
(n)
Proposition 4.15. Given a typical sequence x A (QX ) and a joint

distribution QX,Y P(X Y), we have the following statements:
(n)
if y A(n)
(QX,Y |x) = y A0 (QY |X |x),

(4.164)
where
0 ,
|Y| + 1
;
|X |
(4.165)
and
(n)
if y A(n)
(QY |X |x) = y A00 (QX,Y |x),

(4.166)

00 , |X | + |Y| .
(4.167)
where
(n)
Proof: We start with (4.164): If y A

with QX,Y (a, b) > 0 we have
Px,y (a, b) < QX,Y (a, b) +
(QX,Y |x), then for any (a, b)

|X | |Y|
(by Def. 4.5)

= QX (a)QY |X (b|a) +
|X | |Y|

< Px (a) +
QY |X (b|a) +
|X |
|X | |Y|

+
Px (a)QY |X (b|a) +
|X | |X | |Y|
0
= Px (a)QY |X (b|a) +
|Y|
(4.168)
(4.169)
(n)
(x A
(QX )) (4.170)
(QY |X 1)
(4.171)
(by (4.165)).
(4.172)
0
.
|Y|
(4.173)
Similarly we can show that

Px,y (a, b) > Px (a)QY |X (b|a)
So we only need to check what happens if QX,Y (a, b) = 0. By Definition 4.5, we

know that in this case Px,y (a, b) = 0. Hence, all requirements of Definition 4.14
are satisfied and we have proven one direction.
(n)
Next, we prove (4.166): If y A (QY |X |x), then for any (a, b) with
QY |X (b|a) > 0 we have

Px,y (a, b) < Px (a)QY |X (b|a) +
|Y|

< QX (a) +
QY |X (b|a) +
|X |
|Y|

QX (a)QY |X (b|a) +
+
|X | |Y|
00
= QX,Y (a, b) +
|X | |Y|

(by Def. 4.14)

(n)
(x A
(QX ))
(4.174)
(4.175)
(QY |X 1)
(4.176)
(by (4.167)).
(4.177)
4.A. Appendix: Alternative Definition
83
Again, the lower bound is analogous.

So, it only remains to check what happens if QY |X (b|a) = 0. By Definition 4.14, we know that in this case Px,y (a, b) = 0. Now, QX,Y (a, b) = 0 for
two possible reasons: either QX (a) = 0 or QY |X (b|a) = 0. In the latter case
we have just shown that Px,y (a, b) = 0. In the former, we know from our as(n)
sumption of x A (QX ) that Px (a) = 0 and therefore also Px,y (a, b) = 0.
Hence, all requirements of Definition 4.10 are satisfied and we have proven the
other direction.

Chapter 5
Rate Distortion Theory

The topic that we will discuss in this chapter is related to the following problem: If we only have a finite number of bits, how do we describe a real number?
And since obviously such a (finite-length) description cannot be perfect, an
immediate follow-up question is: How well can we perform?
For the latter question to make sense, we need to be able to judge the
quality of performance, i.e., we require a measure of goodness or, more
mathematically, a distortion measure, which describes the distance between
the real number and its finite-length representation.
This all leads to rate distortion theory, which tries to find answers to the
following basic problem, the rate distortion problem:
What is the minimum expected distortion achievable at a particular
description rate, or equivalently, what is the minimum description rate
required to achieve a particular distortion?
It is a very interesting fact that joint descriptions are more efficient than
individual descriptions. This is even true for independent random variables:
Even if X1
X2 , the description of (X1 , X2 ) is shorter than the description
for X1 and the description for X2 together!
In slightly fancy wording, it is simpler to describe an elephant and a chicken
together with one description rather than to describe each separately.
So, why have independent problems not independent solutions? The answer lies in the geometry: Rectangular grid points resulting from independent
descriptions are not space efficient! If we want to get something more packed,
we need to make the description dependent. See Figure 5.1 for a qualitative
explanation.
Rate distortion theory goes back to Shannons seminal paper [Sha48].
Shannon dealt with it more in detail in [Sha59], and at the same time also
the Russian group around Kolmogorov worked intensively on this problem
[Kol56]. Berger published a comprehensive book about the topic in 1971
[Ber71]. For the characterization of the rate distortion function (Sections 5.5
85

86
independent description
dependent description
49 points
45 points
Figure 5.1: Quantization of a square: Every point in the square will be represented by the closest grid point. In the left version we use a
rectangular grid generated by an independent description of the
two dimensions of the square. In the right version, we use a shifted
grid where the two dimensions are dependent. As an example a
shaded point with its nearest grid point is shown where we have
chosen this example point such that it demonstrates the maximum
distance between any point of the square and its closest grid point.
Note that even though we use more grid points in the left version,
the maximum distance is larger than in the right version.
and 5.6) Gallagers book is highly recommended [Gal68, Chapter 9].

Before we give a formal problem description, we make some more motivating observations.
5.1
Motivation: Quantization of a Continuous RV
Consider a continuous RV X. For every value X = x we would like to try to

find a representation x
(x) for x where x
can take on only 2R different values,
for a given rate R > 0 (measured in bits).
For example let X N 0, 2 and assume that our measure of goodness
is the averaged squared error distortion:

E (X x
(X))2 .
(5.1)
If we are given 1 bit to represent X (i.e., x
can only take on two different
values), it is clear that the bit should distinguish whether X > 0 or not. To
minimize the distortion, each reproduction symbol should be at the conditional
mean of its region, i.e., if X > 0 we should choose x
such that

g(
x) , E (X x
)2 X > 0
(5.2)

5.1. Motivation: Quantization of a Continuous RV
87
is minimized. Taking the derivative and setting it to zero yields:

!
g 0 (
x) = E[2(X x
)(1)|X > 0] = 2 E[X |X > 0] + 2
x = 0,
(5.3)
x
= E[X | X > 0].
(5.4)
i.e.,
Note that the conditional PDF is

X|X>0 (x) =
x2
2 2
e 22 ,
x > 0,
(5.5)
so that
x
= E[X |X > 0] =
Z
0
2
2 2
x2
2 2
r

2
2 2
2
x 2
dx =
e 2
=
.
2 2
x=0
(5.6)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
4
Figure 5.2: Reconstruction areas and points of X N (0, 1) according to (5.7).

A similar discussion can be made for X 0. We get the following optimal
1-bit description of X:
q
2
q
x
(x) =
2
if x > 0,
if x 0.
(5.7)

88
See Figure 5.2 for these reconstruction points in the situation when 2 = 1.
This description will cause a distortion of

E (X x
(X))2

1

1
= E (X x
(X))2 X > 0 + E (X x
(X))2 X 0
(5.8)
2
2

2
(5.9)
= E (X x
(X)) X > 0

2
= 1
2.
(5.10)
But what shall we do if we have two bits available to represent X? Obviously, we want to divide the real line into 4 regions and use a point within each
region to represent all values of X in this region. However, it is not obvious
how to choose the regions and their representation points.
Luckily, we do have some knowledge: An optimal choice of region and
representation points should have the following two properties:
Given a set of reconstruction points, the regions should be chosen such
that the distortion is minimized. This is the case if every value x is
mapped to its closest representation point (in the sense of the given
distortion measure), i.e., the regions should be the nearest neighbor
regions around the reconstruction points. Such a partition is called
Voronoi or Dirichlet partition.
Given a set of regions, the reconstruction points should be chosen such
that the distortion is minimized. This is the case if for each region the
corresponding reconstruction point minimizes the conditional expected
distortion over this region.
These two properties can now be used as a basis for an iterative algorithm
that should find (if not the optimal, then at least some) good quantization
system:
Lloyds Algorithm for the Design of a Quantization System:
[Llo82]
Step 1: Start with a set of (manually chosen) reconstruction points.
Step 2: Find the Voronoi regions for the given reconstruction points.
Step 3: Find the optimal reconstruction points for the derived regions.
Step 4: Return to Step 2 until the algorithm has reached some local
minimum.
Example 5.1. As an example, let us again consider X N (0, 1) with the

squared error distortion measure d(x, x
) = (x x
)2 . Then given some reconstruction points, the optimal regions contain all points with closest Euclidean

5.1. Motivation: Quantization of a Continuous RV
89
distance; and given some region, the optimal reconstruction point in this region is E[X |X is in region].
Let us start with the two reconstruction points x
1 = 0 and x
2 = 1.
Given x
1 = 0 and x
2 = 1, the optimal regions are divided by the threshx
1 +
x2
1
old = 2 = 2 .
Given = 21 , the optimal reconstruction points are
Z

2

1
1
x2

=
x
1 = E X X >
x
e
dx
1
2
Q 21 2
2
81
1.14,
Q 12 2

Z 1
2

2
1
1
x2

x
2 = E X X
=
x
e
dx
2
1 Q 21
2
(5.11)
e 8
0.509.
=
2
1 Q 21
(5.12)
Given x
1 = 1.14 and x
2 = 0.509, the optimal regions are divided by
x2
0.316.
the threshold = x1 +
2
Given = 0.316, the optimal reconstruction points are
0.3162
e 2
x
1 = E[X |X > 0.316] =
Q(0.316) 2
1.009,
(5.13)
0.3162
e 2
0.608.
x
2 = E[X |X 0.316] =
(1 Q(0.316)) 2
(5.14)
The new threshold is = 0.201.

We continue in the same way:
x
1 0.930
x
1 0.881
x
1 0.850
x
2 0.675
= = 0.128.
(5.15)
= = 0.081.
(5.16)
= = 0.052.
(5.17)
x
2 0.719
x
2 0.750

90
x
1 0.831
x
2 0.765
= = 0.033.
(5.18)
We see how we slowly approach the optimal solution given in (5.7) and Figure 5.2.
Example 5.2. Let us also consider the case of X N (0, 1) with the squared
error distortion measure and with R = 2 bits. For symmetry reason it is clear
that we must have a threshold at X = 0 with two region and symmetric
reconstruction points on either side of it. So we only concentrate on the
positive side and try to find x
1 and x
2 with a threshold in between. The
other two reconstruction points will then be x
3 =
x1 and x
4 =
x2 with
corresponding threshold .
The recursion formulas for Lloyds algorithm are then as follows:
=
x
1 + x
2
,
2
(5.19)
2
Z
1
1 x2
1 e 2
,
x
1 = 1
x e 2 dx =
2
2 21 Q()
2 Q() 0
2
Z
1
1 x2
e 2
x
2 =
.
x e 2 dx
=
Q()
2
2 Q()
(5.20)
(5.21)
2 = 1, we then get the recursive values given in

Starting with x
1 = 12 and x
Table 5.3. The corresponding plot is shown in Figure 5.4.
In the following we will restrict ourselves to discrete RVs so that we may

apply strong typicality.
5.2
Definitions and Assumptions
We are now ready to set up our system more formally.

Source: We assume that the source produces a sequence X = (X1 , . . . , Xn )
X n with a given finite alphabet X , where X1 , . . . , Xn are IID Q
P(X ).

Encoder: The encoder is a mapping n : X n 1, 2, . . . , enR that describes
every source sequence by an index w. Note that in total we have enR
indices available, i.e., we have a rate
R,
log(# of indices)
nats/symbol.
n
(5.22)
The rate describes how many nats we need on average to describe one
source letter.

5.2. Definitions and Assumptions
91
Table 5.3: Recursion given by Lloyds algorithm for the case of a Gaussian RV
represented by four values.
x
1
x
2
0.5
0.75
0.3578
1.3288
0.8433
0.3973
1.4011
0.8992
0.4202
1.4450
0.9326
0.4336
1.4714
0.9525
0.4414
1.4872
0.9643
0.4461
1.4966
0.9714
0.4488
1.5022
..
.
0.9755
0.4528
1.5104
0.9816
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
4
Figure 5.4: Reconstruction areas and points of X N (0, 1) according to Example 5.2.

92

Decoder: The decoder is a mapping n : 1, 2, . . . , enR X n , i.e., the decoder receives an index from the encoder and represents the correspond X n .
ing X by an estimate X
Distortion measure: We first define a per-letter distortion measure:
d : X X R+
0
(5.23)
that is a measure of the cost of representing a symbol x by the symbol

we define
x
. To evaluate the distortion between two sequences x and x
the (sequence) distortion measure
d : X n X n R+
0
(5.24)
as an average per-letter distortion:

n
) ,
d(x, x
1X
d(xk , x
k ).
n
(5.25)
k=1
Note that this choice is bad for many practical applications like image
quality or sound quality. A good practical sequence distortion measure
is very likely far from being an average per-letter distortion! However,
we make this assumption anyway here in order to simplify our system
and to make its analysis tractable.
We also assume that the per-letter distortion d(, ) is bounded, i.e.,
dmax ,
max d(x, x
) < .
xX ,
xX
(5.26)
This assumption again is made for simplicity. However, in contrast to

(5.25) it is not critical, i.e., when taking proper care in the derivations
one can avoid it. Note that we only rely on this assumption in the proof
of the achievability part.
Finally, we also assume that
min d(x, x
) = 0,
x
X
x X.
(5.27)
This means that we require that for every possible value x X there
must exist (at least) one best representation with zero distortion. As we
will see in the exercises, this assumption causes no loss in generality.
Example 5.3. Two of the most important distortion measures are as follows:
The Hamming distortion is defined as
(
0 if x = x
,
d(x, x
) ,
1 if x 6= x
.
(5.28)
The Hamming distortion measure results in an error probability distortion:

= 0 Pr[X = X]
+ 1 Pr[X 6= X]
E d(X, X)
(5.29)
= Pr[X 6= X].
(5.30)

5.2. Definitions and Assumptions
93
The squared error distortion is defined as

d(x, x
) , (x x
)2 .
(5.31)
This distortion measure is in particular popular for continuous (especially Gaussian) RVs. Unfortunately, it is a bad choice for a quality
criterion for images or speech.
Now we give the following definition.

Definition 5.4. An enR , n rate distortion coding scheme consists of
a source alphabet X and a reproduction alphabet X ,
an encoding function n ,
a decoding function n , and
a distortion measure d(, ),
n
1X
) =
d(xk , x
k ).
d(x, x
n
(5.32)
k=1
The set of n-tuples

enR
(5.33)
...,X
n (1), . . . , n enR = X(1),

1 nR are the associated assignment
is called the codebook, and 1
n (1), . . . , n e
regions.
often also is called vector quantization, reproducNote that the codeword X
tion, reconstruction, representation, source codeword, source code, or estimate
of X.
As usual for a coding scheme, we define its rate. Moreover, since our
coding scheme is a rate distortion coding scheme, we also define the achieved
distortion.
Definition 5.5. The rate R of a rate distortion coding scheme is defined1
R,
log(# of indices)
log enR
=
.
n
n
(5.34)
Its unit is nats (or bits) per source letter.

Moreover, every rate distortion coding scheme achieves a certain average
distortion:

D , E d X, n (n (X))
(5.35)
where the expectation is over the source product distribution Qn .
1
Note that unless we choose several identical codewords (which is a rather inefficient
thing to do!), the number of indices is equal to the number of codewords.

94
Definition 5.6. A rate distortion pair (R, D) is said to be achievable for a

source Q and a distortion measure d(, ) if there exists a sequence of enR , n
rate distortion coding schemes (n , n ) with

lim E d X, n (n (X)) D.
(5.36)
n
The rate distortion region for a source Q and a distortion measure d(, ) is
the closure of the set of all achievable rate distortion pairs (R, D).
Note that by definition we specify the rate distortion region to contain
also its boundaries. This is similar to the definition of capacity that is the
supremum of all achievable transmission rates, without worrying whether a
rate R = C is actually achievable or not.2
We next define two functions that describe the boundary of the rate distortion region.
Definition 5.7. The rate distortion function
R(D) is the smallest rate (actually, infimum of rate) such that R(D), D is in the rate distortion region of
the source for a given average distortion D.
The distortion rate function D(R) is the smallest average
distortion (actu
ally, infimum of average distortion) such that R, D(R) is in the rate distortion
region of the source for a given rate R.
Note that the rate distortion function R(D) is comparable to the capacity
cost function C(Es ), which describes the maximum achievable transmission
rate for a certain given cost (like power).
5.3
The Information Rate Distortion Function
Similarly to the situation of achievable transmission rates and capacity, where

we first defined an information capacity and later proved that this in fact
is identical to the capacity, we will next introduce an information rate distortion function. We will then prove in the rate distortion coding theorem
in Section 5.4 that this quantity actually is identical to the rate distortion
function defined in Definition 5.7.
Definition 5.8. The information rate distortion function RI () for a source
Q with per-letter distortion measure d(, ) is defined as
RI (D) ,
q(
x|x) :
x,
x
inf
Q(x)q(
x|x)d(x,
x)D
I(X; X)
(5.37)
or, written more compactly,

RI (D) ,
inf
q(
x|x) : EQq [d(X,X)]D
I(X; X).
(5.38)
In the case of the capacity, it depends on the channel whether the capacity itself is
achievable or not.

5.3. The Information Rate Distortion Function
95
are two RVs and not a source sequence

Note that in this definition X and X
and its corresponding representation! The RV X represents a source symbol
a reproduction symbol. They are jointly distributed according to the
and X
joint PMF
QX,X (x, x
) = QX (x) QX|X
x|x) = Q(x) q(
x|x),
(
(5.39)
where Q is the source distribution and q is a conditional distribution whose

choice is based on the optimization.
This is opposite to channel capacity: There the conditional distribution
represents the channel and is given, and we optimize mutual information over
the choice of an input distribution. Here the input distribution is given by the
source and we optimize over the choice of a channel.
Example 5.9. Consider a Bernoulli source Q(1) = p and Q(0) = 1 p and
the Hamming distortion given in Example 5.3, which in this binary case can
be written as
d(x, x
) = x x
.
(5.40)
So the question is which choice of q minimizes mutual information for the

given Bernoulli source and under the constraint that
!

= E[X X]
= Pr[X 6= X]
D.
E d(X, X)
(5.41)
We only need to analyze the case where p 12 , because for p > 12 we can
simply redefine the source letter 0 as 1 and vice versa.
We assume first that D p ( 12 ) and derive the following lower bound
on I(X; X):
= H(X) H(X|X)
I(X; X)
(5.42)
X)
= Hb (p) H(X X|
Hb (p) H(X X)
(5.43)
Hb (p) Hb (D).
(5.46)
= Hb (p) Hb Pr[X =
6 X]
(5.44)
(5.45)
Here, in (5.44) we use that conditioning reduces entropy; and (5.46) follows
because of (5.41) and because Hb () is increasing for arguments less than 12 .
We next show that this general lower bound actually can be achieved by
an appropriate choice of q. To do so, we consider the inverse test-channel
and output X. We choose it to be a binary symmetric channel
with input X
(BSC) with error probability D, see Figure 5.5. Now we need to choose the
such that the output X has the correct Bernoulli distribution: Let
input X
= 1], and compute
r , Pr[X
Pr[X = 1] = p = D(1 r) + (1 D)r.
(5.47)

96

1D
0
1
1D
Figure 5.5: Test channel chosen to be a binary symmetric channel (BSC).
Hence,
r=
pD
,
1 2D
(5.48)
which is nonnegative because we assume D p. For this choice we now have

=D
Pr[X 6= X]
(5.49)
= H(X) H(X|X)
= Hb (p) Hb (D).
I(X; X)
(5.50)
and
Hence, we have shown that we can find a q(

x|x) that achieves the lower bound
D and therefore this lower bound must be the
in (5.46) with Pr[X 6= X]
minimum.
= 0 with probability 1. Then I(X; X)
= 0 (since
For D > p, let X
X
X). This must be the minimum because the mutual information can
never become negative. This choice causes a distortion

= Pr[X 6= X]
= Pr[X 6= 0] = p < D
E d(X, X)
(5.51)
by assumption.
So we have derived the information rate distortion function:
(
Hb (p) Hb (D) 0 D min{p, 1 p},
RI (D) =
0
D > min{p, 1 p}.
(5.52)
Remark 5.10. Note that from the Extreme Value Theorem (Theorem 1.23)
we realize that the infimum in the definition of the rate distortion function
(Definition 5.7) can actually be replaced by a minimum. The reason is that
the mutual information is continuous, for any finite alphabets X and X the
set P(X |X ) is bounded, and as long as the constraints are closed, i.e., we have

D
EQq d(X, X)
(5.53)

5.3. The Information Rate Distortion Function
97
instead of

< D,
EQq d(X, X)
(5.54)
the optimization set is compact.

Hence, from now on we will replace inf by min and make use of our knowledge that there exists a q that achieves the minimum:3
RI (D) ,
min
q(
I(X; X)
(5.55)
and
q ,
I(X; X).
argmin
(5.56)
q(
The function RI () is well-behaved and has many nice properties. At the

moment we will state three such properties, but we will learn more later on.
Lemma 5.11 (Three Properties of the Information Rate Distortion
Function). For some source Q and some distortion measure d(, ), consider
RI () defined in (5.55). We have the following:
1. RI (D) is nonincreasing in D.
2. RI (D) is convex in D.
3. RI (D) is upper-bounded by the source entropy:
RI (D) H(Q),
D 0.
(5.57)
Proof: To see why Part 1 holds, note that by definition RI (D) is a minimum
over a candidate set that is enlarged if D is increased. If the candidate set is
enlarged, the minimum can only remain unchanged or decrease. Hence, RI (D)
is nonincreasing.
To prove Part 2, take two rate distortion pairs (R1 , D1 ) and (R2 , D2 ) that
lie on the curve described by RI (D), and let q1 (
x|x) and q2 (
x|x) be the PMFs
that achieve these two points, respectively. Now define
q (
x|x) , q1 (
x|x) + (1 )q2 (
x|x).
(5.58)

D , EQq d(X, X)

+ (1 ) EQq d(X, X)
= EQq1 d(X, X)
2
(5.59)
Then
= D1 + (1 )D2 .
(5.60)
(5.61)
For the moment we ignore the question of uniqueness.

98
Now recall that the mutual information over a channel is convex in the channel
law:
= I(Q, q)
I(X; X)
(5.62)
is convex in q (and concave in Q), i.e.,

I(Q, q ) = I Q, q1 + (1 )q2 I(Q, q1 ) + (1 ) I(Q, q2 ). (5.63)
Hence, if we drop the minimization in the definition of the information rate
distortion function, we now get:

RI D1 + (1 )D2
= RI (D )
(by (5.61))
(5.64)
I(Q, q )
(drop minimization)
(5.65)
(by (5.63))
(5.66)
(by definition)
(5.67)
I(Q, q1 ) + (1 ) I(Q, q2 )
RI (D1 ) + (1 )RI (D2 )
which proves the claim.

The proof of Part 3 follows from basic properties of entropy:
RI (D) =
min
q(
(5.68)
H(X) H(X|X)
| {z }
EQq [d(X,X)]D
min
q(
x|x) :
I(X; X)
min
q(
H(X)
= H(X) = H(Q).
5.4
(5.69)
(5.70)
(5.71)
Rate Distortion Coding Theorem
We are now ready to state the main result of this chapter.

Theorem 5.12 (Rate Distortion Coding Theorem).
Consider a discrete memoryless source (DMS) Q and a per-letter distortion measure d(, ). Then the rate distortion function equals the information rate distortion function, i.e.,
R(D) = RI (D) =
min
q(
I(X; X)
(5.72)
is the minimum achievable rate at distortion D. In other words, any rate

distortion pair (R, D) with
R > RI (D)
(5.73)
is achievable, and any achievable rate distortion coding scheme with rate

5.4. Rate Distortion Coding Theorem
99
R and distortion D must satisfy

R RI (D).
5.4.1
(5.74)
Converse
We start with a proof of (5.74).

Proof of Converse: Given is Q and d(, ). So consider an arbitrary coding
scheme with rate R that achieves a distortion of at most D for a certain n:

1n D.
E d X1n , X
(5.75)
Of course, by specifying the coding scheme, we have specified n , n , and
= n (n (x)). Hence, we implicitly also have specified a
enR codewords x
conditional distribution q(
x|x).
is random; and as X
can take on at most enR
Since X is random, also X
different values, we have
log enR = nR.
H(X)
(5.76)
So, we get the following sequence of inequalities:
nR H(X)
H(X|X)
= H(X)
(5.77)
is a function of X)
(X
(5.78)
= I(X; X)
(5.79)
= H(X) H(X|X)
n
X
H(Xk ) H(Xk |X1k1 , X)

=
(5.80)
k=1
n
X
k=1
n
X
k=1
n
X

k )
H(Xk ) H(Xk |X
(chain rule, X is IID)
(conditioning reduces entropy) (5.82)
k )
I(Xk ; X

k )
RI E d(Xk ; X
(5.81)
(5.83)
(Def. 5.8)
(5.84)
k=1
n

1 X I
k )
R E d(Xk ; X
n
k=1
!
n
X

I 1
k )
E d(Xk ; X
nR
n
" k=1n
#!
X
1
k )
= nRI E
d(Xk ; X
n
k=1

I
1n
= nR E d X1n ; X
=n
nRI (D)
(5.85)
(Jensen Inequality)
(5.86)
(5.87)
(by (5.25))
(5.88)
(by (5.75)).
(5.89)

100
Here, (5.84) follows because the information rate distortion function by definition is a minimization of mutual information for a given maximum expected distortion. We hence pick this maximum expected distortion to be
k )], i.e., the value that is implicitly specified by q(
E[d(Xk ; X
x|x) of our coding
scheme. In (5.86) we rely on the convexity of RI () as shown in Lemma 5.11-2;
and in the last inequality (5.89) we use that, according to Lemma 5.11-1, RI ()
is nonincreasing. This proves the converse.
One immediate consequence of the converse is a corollary that states that
no DMS can be compressed losslessly below its entropy. We have proven this
already in [Mos14, Theorem 4.14] under the assumption that we use a proper
message set. Here, we now generalize the statement for any coding scheme.
Corollary 5.13. There exists no coding scheme that can compress a DMS
losslessly below its entropy.
Proof: Choose X = X and consider the Hamming distortion, such that

= Pr[X
6= X].
E d(X, X)
(5.90)
6= X] = 0, we have
If we require D = 0, i.e., Pr[X
R RI (0)
(5.91)
min
I(X; X)
= H(X)
=X]=0
q : Pr[X6
=X]=0
q : Pr[X6
max
{z
=0
(5.92)
H(X|X)
(5.93)
= H(X) = H(Q).
(5.94)
Note the beauty of the converse (5.89): There are no epsilons involved nor
any limits! So, what we actually have proven is that for any coding scheme
that satisfies (5.75) and for any finite n, it must hold that
R RI (D).
(5.95)
However, in our definition of achievable rate distortion pairs we only required

the existence of a sequence of coding schemes such that

1n D.
lim E d X1n , X
(5.96)
n
So, could it be that there exists a sequence of coding schemes with

n = Dn ,
E d X1n , X
1
(5.97)
where Dn is decreasing with

lim Dn = D,

(5.98)
101
but with a rate smaller than RI (D)? The answer to this question is No. The
reason is as follows. Since Dn is decreasing, for every > 0 there exists a n0
such that for all n n0 we have

n D + .
E d X1n , X
1
(5.99)
Hence, these coding schemes cannot have a rate smaller than RI (D + ) and,
since is arbitrary, we can approach RI (D) arbitrarily closely, as long as RI ()
is continuous. So the only question that remains is whether RI () is continuous.
This is really the case and will be proven later in Section 5.6. For a graphical
explanation of this discussion, see Figure 5.6.
rate
information rate distortion
function (with discontinuity)
RI (D)
RI (D + )
D D+
distortion
Figure 5.6: Our proof of the converse would fail if the information rate distortion function were not continuous.
5.4.2
Achievability
To prove (5.73), we need to provide a particular coding scheme (n , n ) that

achieves (RI (D), D) for every D 0. Since this is too difficult, we rely again on
the random coding argument introduced by Shannon: We generate a coding
scheme at random and analyze its performance.
1: Setup: We choose a channel distribution q(
x|x) and compute QX ()
as the marginal distribution of
QX,X (x, x
) = Q(x)q(
x|x).
(5.100)
Then we fix some rate R and some blocklength n.
2: Codebook Design: We generate enR length-n codewords X(w),

w =
nR
nR
1, . . . , e , by choosing each of the n e symbols Xk (w) independently

at random according to QX .

102
3: Encoder Design: For a given source sequence x, the encoder tries to
find a codeword X(w)

such that

x, X(w)
A(n) QX,X .
(5.101)
If it finds several possible choices of w, it picks one. If it finds none, it
chooses w = 1.
The encoder then puts out w.
4: Decoder Design: For a given index w, the decoder puts out X(w).
5: Performance Analysis: We partition the sample space into three disjoint cases:
1. The source sequence is not typical:
X
/ A(n) (Q)
(5.102)
(in which case we for sure cannot find a w such that (5.101) is
satisfied!).
2. The source sequence is typical, but there exists no codeword X(w)

that is jointly typical with the source sequence:

(Q), @ w : X, X(w)
A(n)
QX,X . (5.103)
X A(n)

3. The source sequence is typical and there exists a codeword X(w)


X A(n)
(Q), w : X, X(w)
A(n)
QX,X . (5.104)

We now apply the Total Expectation Theorem to compute the expected
achieved distortion of our random system:

= E d X n, X
n Case 1 Pr(Case 1)
E d(X, X)
1
1
|
{z
}
dmax

n Case 2 Pr(Case 2)
+ E d X1n , X
1
|
{z
}
+E d
dmax

n n
X1 , X1 Case

3 Pr(Case 3)
| {z }
(5.105)
dmax Pr(Case 1) + dmax Pr(Case 2)

n Case 3 ,
+ E d X n, X
1
(5.106)
where we have upper-bounded the maximum average distortion of the

first two cases by the maximum distortion value dmax , and the probability
of the third case by 1.
By TA-3 we can bound the probability of Case 1 as follows:

Pr(Case 1) = 1 Qn A(n) (Q) t (n, , X ).

(5.107)
103
To bound the probability of Case 2, note that since each codeword X(w)
has been generated IID without considering any other codeword, the
probability that there exists no codeword that is jointly typical with the
source sequence is simply the product (over all w) of the probabilities that
the wth codeword is not jointly typical with the source sequence. Moreover, each of the probabilities in this product is the same, independent
of w. Hence, we get
Pr(Case 2)

= Pr {X A(n) (Q)} @ w : X, X(w)

A(n)
QX,X
(5.108)

= Pr X A(n)
(Q)

h
i

Pr @ w : X, X(w)
A(n) QX,X X A(n) (Q)
(5.109)

= Pr X A(n)
(Q)
|
{z
}
1
enR

Pr X, X(w)
/ A(n) QX,X X A(n)

(Q)

Y
w=1
(5.110)
nR
e
Y
w=1

Pr X, X(w)
/ A(n) QX,X X A(n) (Q)
(5.111)
h
i

(n)
Pr X(w)
/ A
QX,X X X A(n) (Q)
(5.112)
nR
e
Y
w=1
nR
e
Y
w=1
i
h

1 Pr X(w)
A(n) QX,X X X A(n) (Q)
h
ienR

(n)
= 1 Pr X A
QX,X X X A(n) (Q)

h
i

A(n) Q X X A(n) (Q)
exp enR Pr X

X,X

< exp enR en(I(X;X)+)

= exp en(RI(X;X))

(5.113)
(5.114)
(5.115)
(5.116)
(5.117)
where

1

= m QX,X + m QX log 1 t (n, , X ) .
n
(5.118)
Here, in (5.112) we use the definition of conditionally typical sets (Definition 4.10); (5.114) follows from the independence of the probability
expression on w; the inequality (5.115) is due to the Exponentiated IT
Inequality (Corollary 1.10); and (5.116) follows from TC.
So we see that as long as
+
R > I(X; X)
(5.119)

104

the probability Pr(Case 2) tends to zero double-exponentially fast in n.
To bound the expected distortion in Case
3, we consider a pair of jointly
(n)
) A
typical sequences (x, x
QX,X :
n
1X
) =
d(x, x
d(xk , x
k )
n
k=1
1X
) d(a, b)
=
N(a, b|x, x
n
(5.120)
(5.121)
aX
bX
Px,x (a, b) d(a, b)
(5.122)
aX
bX
QX,X
(a, b) +
aX
bX

|X | |X |
QX,X
(a, b) d(a, b) +
aX
bX

+ dmax .
= E d(X, X)
X
aX
bX
d(a, b)

|X | |X |
(5.123)
dmax
(5.124)
(5.125)
Here in (5.122) we used the definition of joint types, and in the subsequent

(n)
) A
inequality (5.123) we relied on the assumption that (x, x
QX,X .
Hence,

)
< dmax t (n, , X ) + dmax exp en(RI(X;X)
E d(X, X)

+ dmax
+ E d(X, X)
(5.126)
!

+ 0 D.
(5.127)
= E d(X, X)
So we see that as long as

D
E d(X, X)
(5.128)
R > I(X; X)
(5.129)
and
the random coding scheme works. We can now optimize our choice of
q(
x|x) in Step 1 such that it minimizes the mutual information under the
constraint (5.128). Then, for a given D 0 our random coding scheme
works as long as
R>
min
q(
x|x) : E[d(X,X)]D

I(X; X).
(5.130)
5.4.3
105
Discussion
We would like to discuss this result and highlight a few interesting points.
Firstly, note that we have not made any explicit choice for the distortion
measure d(, ). We only assumed that it is an average per-letter distortion
and that it is bounded (where the latter assumption could be relaxed if one is
careful with limits: Note that the probability of all cases where the distortion
becomes large tends to zero exponentially fast!). The explicit choice for d only
is needed once we want to evaluate the minimization in (5.130).
Secondly, we would like to point out that Theorem 5.12 does not specify
what happens on the border when
R=
min
q(
x|x) : E[d(X,X)]D
I(X; X).
(5.131)
The converse does not exclude this case, but the achievability part does not
include it. This is identical to the situation of channel capacity, where one
cannot in general state whether a rate equal to capacity can be achieved or
not. By definition, these boundary cases are included in the rate distortion
region (see Definition 5.6).
Finally, note that we have actually proven a quite strong statement: We
have shown that the probability that our randomly designed system will not
work is very small and tends to zero exponentially fast! Had we relied in our
proof on weak typicality (as defined in [Mos14, Chapter 19]), then our proof
would have become much less direct and less strong: With weak typicality
one can only show that for any > 0 one can find an n and 1 such that the
expected distortion averaged over all codes of length n is less than D + 1 , and
that therefore there must exist at least one code with an average distortion
less than D + 1 .
As a matter of fact, if we go back to our achievability proof and think
about it, then we realize that we have shown that for our coding scheme the
probability of all sequences that are not well represented (i.e., yield a distortion
larger than D) is tending to zero exponentially fast in n! Concretely, we can
compute
Pr {X : there is no good X}

(5.132)
t (n, , X ) + exp en(RI(X;X))
=e
2

|X |

n 2|X
n log(n+1)
|
+ e e
n(RI(X;X)
)
(5.133)

(Here the first exponent with factor 2|X
| is dominating.) Note that in the
proof that relies on weak typicality nothing is said about the probability of
bad representation. For example, it could be that 10% of all source sequences
result in a very bad representation with a distortion of 2D, however, the
remaining 90% of sequences
are so well represented that, on average, the
requirement E d(X, X)
D is satisfied. We have proven that with our
scheme such a situation cannot occur! We will come back to this exponential
growth in Chapter 6.

106
Remark 5.14. Since the information rate distortion function is identical to

the rate distortion function, we will not anymore distinguish between them
and simply write the rate distortion function R(D). The reader is asked to
forgive our sloppy notation with a double use of the letter R: By itself R
denotes the rate of a rate distortion coding scheme, but in combination with
round brackets R() it denotes the rate distortion function.
So far we have described the boundary of the rate distortion region by
the rate distortion function, i.e., the minimum rate for a given distortion.
Naturally, we can turn things around and describe the rate distortion region
by the distortion rate function instead, i.e., the minimum distortion for a given
rate. This is completely analogous, i.e., we can rephrase Theorem 5.12 and
say that

D(R) =
min
E d(X, X)
(5.134)
q(
x|x) : I(X;X)R
is the minimum achievable distortion at a given rate R, i.e., any rate distortion
pair (R, D) with
D > D(R)
(5.135)
is achievable, and any achievable rate distortion coding scheme with rate R
and distortion D must satisfy
D D(R).
5.5
(5.136)
Characterization of R(D)
Since R() is a minimization, we can use the KarushKuhnTucker (KKT)

conditions to try to characterize R(), similarly to our KKT characterization
of channel capacity (see [Mos14, Section 10.5]).
In the following we will assume that Q(x) > 0 for all x X (otherwise we
simply redefine X to exclude those x with Q(x) = 0).
In the optimization of (5.72) we have the following constraints:
X
Q(x)q(
x|x)d(x, x
) D,
(5.137)
x,
x
X
x
q(
x|x) = 1,
q(
x|x) 0,
x,
(5.138)
x, x
.
(5.139)
So we define the Lagrangian

(q, , ) ,
X
x,
x
q(
x|x)
0 )q(
Q(x
x|x0 )
x0
{z
}
Q(x)q(
x|x) log P
= I(X;X)

5.5. Characterization of R(D)
107
!
X
X
X
+
Q(x)q(
x|x)d(x, x
) D +
(x)
q(
x|x) 1
x
x,
x
(5.140)
=
X
x,
x
(x)
q(
x|x)
+ d(x, x
) +
0
0
x|x )
Q(x)
x0 Q(x )q(

Q(x)q(
x|x) log P
D
|
(x) .
(5.141)
{z
constant
Here the last two terms are constant and not really interesting. So we drop
them and additionally replace () by () in such a way that
(x)
(x)
= log
.
Q(x)
Q(x)
(5.142)
We get
L(q, , ) ,
X
x,
x
q(
x|x)
(x)
+ d(x, x
) log
0
0
x|x )
Q(x)
x0 Q(x )q(

Q(x)q(
x|x) log P
(5.143)
or, renaming the dummy summation variables,

X
(a)
q(b|a)
.
+ d(a, b) log
L(q, , ) ,
Q(a)q(b|a) log P
0
0
Q(a)
a0 Q(a )q(b|a )
a,b
(5.144)
Now fix an x and an x
and take the derivative with respect to q(
x|x). Be
very careful because q(
x|x) shows up in three places! Note that if the main
sum is at b = x
, but a 6= x, then q(
x|x) still shows up in the sum inside the
logarithm!
L(q, , )
q(
x|x)

= Q(x) log P

q(
x|x)
(x)
+ d(x, x
) log
0
Q(x)
x|a0 )
a0 Q(a )q(
!
P
P
0
0
0
x|a )
x|a0 ) q(
x|x)Q(x)
a0 Q(a )q(
a0 Q(a )q(
+ Q(x)q(
x|x)
P
q(
x|x)
( a0 Q(a0 )q(
x|a0 ))2
!
P
0
X
x|a0 )
q(
x|a)
a0 Q(a )q(
+
Q(a)q(
x|a)
P
Q(x) (5.145)
0
0 ))2
q(
x|a)
(
x
|a
0 Q(a )q(
a
a6=x

q(
x|x)
(x)
= Q(x) log P
+ d(x, x
) log
0
x|a0 )
Q(x)
0 Q(a )q(
P a
0 )q(
0 ) q(
X
Q(a
x
|a
x
|x)Q(x)
q(
x|a)Q(x)
0
P
+ Q(x) a
Q(a) P
0 )q(
0)
0 )q(
Q(a
x
|a
Q(a
x|a0 )
0
0
a
a
a6=x
(5.146)

108

q(
x|x)
(x)
= Q(x) log
+ d(x, x
) log
p(
x)
Q(x)
X
q(
x|a)Q(x)
+ Q(x)
Q(a)
p(
x)
a
{z
}
|
=
Q(x)
p(
x)
(5.147)
Q(a)q(
x|a) = Q(x)

(x)
q(
x|x)
,
+ d(x, x
) log
= Q(x) log
p(
x)
Q(x)
(5.148)
where we have introduced

p(
x) ,
X
x
Q(x)q(
x|x).
Since Q(x) > 0 we hence now get from the KKT conditions that
!
x|x) > 0,
q(
x|x)
(x) = 0 if q(
log
+ d(x, x
) log
!
p(
x)
Q(x) 0 if q(
x|x) = 0.
(5.149)
(5.150)
Let us now first consider the case when q(

x|x) > 0. According to (5.149)
we then also have p(
x) > 0 and it follows from (5.150) that
log
(x)
q(
x|x)
= log
d(x, x
)
p(
x)
Q(x)
(5.151)
or
q(
x|x) =
p(
x)(x) d(x,x)
e
.
Q(x)
(5.152)
We plug this into (5.149) and get

p(
x) =
X
x
Q(x)q(
x|x) =
p(
x)(x) d(x,x)
e
Q(x)
(5.153)
(x) ed(x,x)
(5.154)
Q(x)
= p(
x)
X
x
which means that for all x

with p(
x) > 0,
X
(x) ed(x,x) = 1.
(5.155)
Moreover, from (5.152) it also follows that

1=
X
x
X p(
x)(x)
ed(x,x)
Q(x)
x
(x) X
=
p(
x) ed(x,x) ,
Q(x)
q(
x|x) =
(5.156)
(5.157)
i.e.,
Q(x)
.
x) ed(x,x)
x
p(
(x) = P

(5.158)
5.5. Characterization of R(D)
109
Together with (5.155), this now yields that for p(

x) > 0,
X Q(x) ed(x,x)
P
= 1.
d(x,b)
b p(b) e
x
(5.159)
On the other hand, lets next assume that p(

x) = 0, i.e., q(
x|x) = 0 for
all x because Q(x) > 0 (see (5.149)). So if we set q(
x|x) = for all x and let
0, we see that
lim
0
q(
x|x)
q(
x|x)

= lim P
= lim P
= 1.
0 )q(
0)
0 )
0
0
p(
x)
Q(x
x
|x
Q(x
0
0
x
x
(5.160)
Hence, we must have in (5.150) that

log
q(
x|x)
= log 1 = 0.
p(
x)
(5.161)
This is only hand-waving, however, the claim can be shown properly using
some sophisticated -variation argument: Fiddle around a little with one component of q and check its impact on L.
Then we get from (5.150)
d(x, x
) log
(x)
0
Q(x)
(5.162)
or
(x) ed(x,x) Q(x).
From this now follows that for all x
with p(
x) = 0 we have
X
X
(x) ed(x,x)
Q(x) = 1,
x
(5.163)
(5.164)
i.e., combined4 with (5.158),

X Q(x) ed(x,x)
P
1.
d(x,b)
b p(b) e
x
(5.165)
We combine (5.165) and (5.159) to yield the KKT conditions for the rate
distortion function.
Theorem 5.15 (KarushKuhnTucker Conditions for the Rate
Distortion Function).
A PMF p(
x) is the solution to the rate distortion minimization if
(
X Q(x) ed(x,x)
= 1 if p(
x) > 0,
P
(5.166)
d(x,b)
1 if p(
x) = 0,
b p(b) e
x
4
Note that we did use the case q(

x|x) > 0 in order to derive (5.158). However, as (x)
only depends on x and not on x
, this must hold in general.

110
where is the solution to

X
Q(x)q(
x|x)d(x, x
) = D
(5.167)
p(
x) ed(x,x)
q(
x|x) = P
.
d(x,b)
b p(b) e
(5.168)
x,
x
with
is
Here for (5.168) we have plugged (5.158) into (5.152). Note that I(X; X)
given by
!
d(x,
x)
d(x,
x)
X
p(
x
)
e
e
=
I(X; X)
Q(x) P
log P
.
(5.169)
d(x,b)
d(x,b)
p(b)
e
p(b)
e
b
b
x,
x
5.6
Further Properties of R(D)
It actually turns out that the really interesting part of the Lagrangian defined
in (5.143) is the expression without the technical constraint of
X
q(
x|x) = 1.
(5.170)
x
We define

+ E d(X, X)
R0 (q, ) , I(X; X)

X
q(
x|x)
=
Q(x)q(
x|x) log P
+ d(x, x
)
0
x|x0 )
x0 Q(x )q(
(5.171)
(5.172)
x,
x
We will next show that by varying 0 we can find all values of R(D) for
every D 0.
Actually, we claim that has more meaning than simply being the Lagrangian multiplier to find the solution of the minimization in R(D). To see
this, fix > 0 and some q(|). The latter defines a certain mutual informa and a certain expected distortion E[d(X, X)].
tion I(X; X)
We draw this rate
distortion pair as a point in the distortion-rate plane and then add a line of
slope through this point. See Figure 5.7.
Recalling our definition of R0 in (5.171), we realize that the line of slope
crosses the rate-axis at R0 (q, ).
We now repeat the same game, but this time we choose q(|) to be
q , argmin R0 (q, )
q
for some fixed > 0. We now claim the following.

(5.173)
5.6. Further Properties of R(D)
111
rate

, I(X; X)
point E d(X, X)
E d(X, X)
line of slope
I(X; X)
distortion

E d(X, X)
Figure 5.7: Distortion-rate plane with a certain rate distortion pair and a line
of slope .
rate
R0 (q , )
R()

= D
Eq d(X, X)
= R(D)
Iq (X; X)
distortion
Figure 5.8: For every > 0, the line with slope through R0 (q , ) is a
tangent to the rate distortion function R().

112

rate
R()
achievable by q
R0 (q , )
distortion
Figure 5.9: A contradiction: an achievable point below R().

rate
R()
R0 (q , )
R0 (q , )
achievable by q
achievable by q
distortion
Figure 5.10: A contradiction: a value R0 (q0 , ) below the minimum R0 (q , ).
Lemma 5.16. The rate distortion pair induced by q and lies on the rate
distortion curve R(), and the line with slope through this point is a tangent
to R(); see Figure 5.8.
Proof: Assume first that R() does not intersect with the line, i.e., it lies
strictly above the line, see Figure 5.9. Then
we have found a rate distortion

Iq (X; X)
that is achievable (our choice q !),
pair (R, D) = Eq [d(X, X)],
but that lies below the rate distortion function. This is a contradiction to the
definition of R() being the minimum.
So assume that the line cuts R() either in or below our point,5 see Figure 5.10. Then we can find some point on the rate distortion curve that is
induced by some other choice q0 and that is below the line. If we now draw
a second line through this new point with the same slope , then this new
line will intersect the rate-axis in R0 (q0 , ) (this can be argued the same way
as shown in Figure 5.7), which is below R0 (q , ). However, this is a contra5
Note that since R() is convex and nonnegative, any line of negative slope that cuts R()
above our point will cut R() once more below our point!

113
diction to the fact that R0 (q , ) is the minimum among all q for the given
!
Hence, we see that R() must touch the line and that therefore the line
must be tangential.
From this lemma and its corresponding Figure 5.8 now immediately follows
that for any > 0 and any D 0,
R0 (q , ) D R(D),
(5.174)
i.e., we have the following corollary.

Corollary 5.17. The rate distortion function can be lower-bounded as follows:
R(D) min R0 (q, ) D,
q
D 0, > 0.
(5.175)
Note that lower-bounding an expression that is a minimum by definition

is usually not so easy. So, (5.175) looks very interesting as it provides a lower
bound to R(). Unfortunately, in (5.175) we still have a minimization. This,
however, can be remedied. For that purpose, we need the following lemma.
Lemma 5.18. Let R0 (q, ) be as defined in (5.171). Then for any > 0,
X
R0 (q, ) H(Q) +
Q(x) log (x)
(5.176)
x
for all q(|) and for any () > 0 that satisfies

X
(x) ed(x,x) 1, x
with p(
x) > 0,
(5.177)
where p() is given via (5.158). Moreover, we have

min R0 (q, ) = H(Q) +
q
max
>0 s.t.
(5.177) is satisfied x
Q(x) log (x).
(5.178)
Also note that the maximization on the RHS is achieved by some () only if
(5.177) is satisfied with equality for all x
with p(
x) > 0.
Proof: We start with the proof of (5.176). We assume that (5.177) holds
and recall the definition (5.143):

X
(x)
q(
x|x)
L(q, , ) =
Q(x)q(
x|x) log
+ d(x, x
) log
(5.179)
p(
x)
Q(x)
x,
x
(x) X
q(
x|x)
Q(x)
x
x
X
X
= R0 (q, )
Q(x) log (x)
Q(x) log
= R0 (q, )
= R0 (q, )
X
x
(5.180)
Q(x) log
Q(x) log (x) H(Q).
1
Q(x)
(5.181)
(5.182)

114
Applying the IT Inequality (Theorem 1.9) we next note that

X
p(
x)(x) d(x,x)
L(q, , ) =
Q(x)q(
x|x) log
e
q(
x|x)Q(x)
x,
x

X
p(
x)(x) d(x,x)
Q(x)q(
x|x)
e
1 log e
q(
x|x)Q(x)
x,
x
!
X
X
d(x,
x)
=
p(
x)(x) e
Q(x)q(
x|x) log e
x,
x
(5.183)
(5.184)
(5.185)
x,
x
!
X
p(
x)
X
x
(x) ed(x,x) 1 log e

{z
1 by (5.177)
(5.186)
p(
x) 1 log e
(5.187)
= (1 1) log e = 0,
(5.188)
which combined with (5.182) proves the first statement (5.176).

Since (5.176) holds for any choice of q and , it also must hold that
X
min R0 (q, ) H(Q) + max
Q(x) log (x),
(5.189)
q
as long as the maximization is over those that satisfy (5.177). To show that
this can be achieved with equality, we investigate the inequalities (5.184) and
(5.187): The former is achieved with equality if, and only if,
p(
x)(x) d(x,x)
e
= 1,
q(
x|x)Q(x)
q(
x|x) > 0.
(5.190)
From this follows (5.152) and therefore, analogously to the derivations shown
in (5.152)(5.158), also (5.158).
The latter inequality (5.187) is achieved with equality if, and only if,
X
(x) ed(x,x) = 1, x
with p(
x) > 0.
(5.191)
x
This condition is identical to (5.155). Hence, the situation is completely analogous to the derivations shown for the KKT conditions.
We know from the KKT conditions that such a choice of q and exists
(details omitted), and therefore equality can be achieved in (5.188), but only
if (5.191) holds.
Now we can combine Corollary 5.17 and Lemma 5.18 to find a lower for
the rate distortion function.
Theorem 5.19 (Lower Bound on R()). For a given DMS Q and per-letter
distortion measure d(, ), we have for any D 0
R(D) H(Q) +
X
x
Q(x) log (x) D

(5.192)

R(D)
115
here a discontinuity is possible
line below R(D)

D
Figure 5.11: A convex function cannot be discontinuous inside the convexity
interval. A discontinuity can only occur on the boundary.
for an arbitrary choice of > 0 and for any () satisfying

X
x
(x) ed(x,x) 1.
(5.193)
Note that this lower bound even contains free parameters that we can
choose. From the derivations of Corollary 5.17 and Lemma 5.18 we also see
that for every D there exists a particular choice of and that will make the
lower bound tight.
Another important consequence of these derivations is as follows.
Corollary 5.20 (Continuity of R()). The rate distortion function R(D) is
continuous for D 0.
Proof: The continuity of R() for D > 0 follows directly from its convexity. To see this, assume by contradiction that R() is convex, but contains a
discontinuity inside of the convexity interval. Then, we can find around this
discontinuity two points on R() such that their connecting line lies partially
below R() which contradicts the definition of convexity. See Figure 5.11
for a graphical picture of this situation.
On the boundary, on the other hand, a convex function could be discontinuous; again see Figure 5.11. Since R() is convex for D 0, we need to
check whether R(D) is continuous also for D = 0:
?
lim R(D) = R(0).

D0
(5.194)
Since R() is nonincreasing, it immediately follows that

lim R(D) R(0)
D0
(5.195)

116
and we only need to show whether

?
lim R(D) R(0).
(5.196)
D0
To that goal we define a new distortion measure

(
0
if d(x, x
) = 0,
x
d(x,
) ,
otherwise,
(5.197)
whose corresponding rate distortion function is
R(D)
= R(0),
D 0,
(5.198)
) can only be kept finite if d(X, X)

=0
because the average distortion for d(,
0 (q, )
almost surely, which is only possible at a rate R(0). Similarly, for R
belonging to d(, ), we derive

n

o
0 (q, ) = min I(X; X)

+ E d(X,
(5.199)
min R
X)
q
q
{z
}
|
=
min
q : E[d(X,X)]=0
= or 0
I(X; X)
= R(0).
(5.200)
(5.201)
Next note that

X
max
>0
P
x
x) 1
(x) ed(x,
Q(x) log (x)
x
with p(
x)>0 x
max
>0
x : d(x,
x)=0
x
with p(
x)>0
Q(x) log (x)
(5.202)
x) 1 x
(x) ed(x,
with p(
x)>0 x
max
P >0
Q(x) log (x),
(5.203)
(x)1 x
x : d(x,
x)=0
x
with p(
x)>0
where the inequality follows because on the RHS of (5.202) we only restrict
some values of () and are free to choose the others. On the other hand, also
note that
X
lim
max
Q(x) log (x)
P
x
>0
x) 1 x
(x) ed(x,
with p(
x)>0 x
=
lim
max
>0
x) 1 x
(x) ed(x,
with p(
x)>0 x
max
P >0
Q(x) log (x),
(x)1 x
x : d(x,
x)=0
x
with p(
x)>0

Q(x) log (x)
(5.204)
(5.205)
117
i.e., for any > 0 we can find a big enough such that
X
max
Q(x) log (x)
P
x
>0
x) 1 x
(x) ed(x,
with p(
x)>0 x
max
P >0
(x)1 x
Q(x) log (x) .
(5.206)
x : d(x,
x)=0
x
with p(
x)>0
With these preparations, we can now derive the following:

n
o
lim R(D) lim min R0 (q, ) D
D0
D0
= min R0 (q, )
(5.208)
= H(Q) +
max
>0
P
x
H(Q) +
(5.207)
x) 1
(x) ed(x,
max
P >0
Q(x) log (x)
(5.209)
x
with p(
x)>0 x
(x)1 x
Q(x) log (x)
(5.210)
x : d(x,
x)=0
x
with p(
x)>0
= H(Q) +
max
>0
P
x
x)
(x) ed(x,
1
x
with p(
x)>0
X
x
Q(x) log (x) (5.211)
0 (q, )
= min R
(5.212)
= R(0) .
(5.213)
Here, the first inequality (5.207) follows from Corollary 5.17 and holds for
any > 0; in (5.209) we apply Lemma 5.18; the subsequent inequality (5.210)
then follows from (5.206); in (5.211) we reformulate the condition in the maxi ) given in (5.197); the subsequent equality
mization using our definition of d(,
);
(5.212) follows again from Corollary 5.17, this time applied to case of d(,
and in the final step we use (5.201).
Since (5.213) holds for an arbitrary , this proves the claim.
We can show even more.
Corollary 5.21. For any finite distortion measure, the slope of R(D) is continuous for 0 < D < dmax and approaches as D 0.
Proof: We have seen that minq R0 (q, ) is the rate-axis crossing point of a
tangent to R() with slope . If now R() had a point with a discontinuous
slope (or if it approaches a finite limit as D 0), then there exist several
different tangents with different slopes to that point; see Figure 5.12. This, on
the other hand means, that any q(|) achieving this point on the R()-curve
must minimize R0 (q, ) for several different values .
Hence, we complete our proof if we can show that any q that minimizes
= 0.
R0 (q, ) for two different , 1 6= 2 , will cause I(X; X)

118

R(D)
R0 (q, 1 )
slope discontinuities
R0 (q, 2 )
two different tangents

with slopes 1 and 2
D
Figure 5.12: A convex function with a slope discontinuity must have several
tangents on the point of discontinuity. The boundary point at
D = 0 also could have several tangents, unless the slope there
goes to .
Let 1 and 2 be maximizing for 1 and 2 , respectively, in (5.178). Then
from (5.152) we have that
q(x|
x) =
p(
x)1 (x) 1 d(x,x) ! p(
x)2 (x) 2 d(x,x)
e
=
e
.
Q(x)
Q(x)
(5.214)
Hence, for p(
x) > 0 we have that6
1 (x)
= e(1 2 )d(x,x) .
2 (x)
(5.215)
Note that the left-hand side (LHS) of (5.215) does not depend on x
, which
means that the RHS cannot either. Therefore we must have that d(x, x
) is
independent of x
for all x
with p(
x) > 0. This now means that the joint
distribution is a product distribution:
Q(x, x
) = Q(x)q(
x|x)
(5.214)
p(
x) 1 (x) e1 d(x,x) = p(
x) (x),
|
{z
}
(5.216)
independent of x
We have shown that

i.e., X
X.
=0
I(X; X)
(5.217)
as we have set out to prove.

6
Note that this step would not be possible if d(x, x

) = . Therefore the restriction to
finite measures here.

5.7. Joint Source and Channel Coding Scheme
119
Note an interesting consequence of this corollary: Even though the rate

distortion function for many distortions and sources looks like it sets out from
the rate-axis with a finite negative slope, it actually always has an infinite
slope at the beginning. We try to show this behavior in Figure 5.13.
R(D)
D
Figure 5.13: A typical shape of a rate distortion function.
5.7
Joint Source and Channel Coding Scheme
Similarly to [Mos14, Chapter 14], we can now combine source and channel
coding. Consider a DMS that shall be transmitted over a DMC with an
expected distortion of at most D 0, see Figure 5.14. As before, we use Ts to
dest.
1 , . . . , U
K
U
decoder
Y1 , . . . , Y n
DMC
X1 , . . . , Xn
encoder
U1 , . . . , UK
DMS
Figure 5.14: Joint source and channel coding system: a DMS is transmitted
over a DMC with distortion at most D.
denote the source clocking and Tc to denote the channel clocking. We assume
that the encoder accepts K source symbols as inputs and then generates a
codeword of length n to be transmitted over the DMC. For synchronization
reasons, we need to have
!
KTs = nTc .
(5.218)
An obvious way to design our system is to split the joint source and channel
coding scheme into a rate distortion coding scheme and a channel coding

120
scheme. The rate distortion coding scheme will work as long as its rate Rrd
(in bits per source symbols) satisfies Rrd > R(D), and the channel coding
scheme will work as long as its rate Rch (in bits per channel use) satisfies
Rch < C. Hence, this approach using a source-channel separation will work as
long as the rates of both systems, measured in bits per second, are in accord:
Rrd
Rch
C
R(D)
<
=
< ,
Ts
Ts
Tc
Tc
(5.219)
R(D)
C
< .
Ts
Tc
(5.220)
i.e., as long as
So the question that now remains is whether this source-channel separation

scheme is actually optimal or if there exists a joint source-channel coding
scheme that can outperform any separate scheme. To answer this question,
we assume we are given a joint scheme that works and return to the converse
of the rate distortion coding theorem in Section 5.4.1, where (5.79)(5.89) can
be rewritten as follows:
K

K = H U K H U K U
1
I U1K ; U
(5.221)
1
1
1
K
X

K
(5.222)
H(Uk ) H Uk U1k1 , U
=
1
k=1
K
X
k=1
K
X
k )
H(Uk ) H(Uk |U
k )
I(Uk ; U
(5.223)
(5.224)
k=1
K

1X
k )
R E d(Uk ; U
K
k=1
!
K

1X
k )
KR
E d(Uk ; U
K
(5.225)
(5.226)
k=1
K R(D),
(5.227)
where (5.222) holds because a DMS is memoryless. Moreover, using the fact
that we have DMC used without feedback, we get

K I(X n ; Y n )
I U1K ; U
(5.228)
1
1
1
= H(Y1n ) H(Y1n |X1n )
n
X

=
H Yk Y1k1 H Yk Y1k1 , X1n
=
k=1
n
X
k=1

H Yk Y1k1 H(Yk |Xk )

(5.229)
(5.230)
(5.231)
5.8. Information Transmission System: Transmitting above Capacity
n
X
k=1
n
X
121

H(Yk ) H(Yk |Xk )
(5.232)
I(Xk ; Yk )
(5.233)
(5.234)
k=1
n
X
k=1
= nC,
(5.235)
where the first inequality (5.228) follows from the data processing inequality
(Proposition 1.12); in (5.231) we use the fact that the channel is a DMC
without feedback; (5.232) follows by conditioning that reduces entropy; and
in (5.234) we apply the definition of channel capacity.
Hence, we see that any working joint source channel coding scheme must
satisfy
KR(D) nC
(5.236)
C
R(D)
.
Ts
Tc
(5.237)
or, using (5.218),
5.8
Information Transmission System:

Transmitting above Capacity
In [Mos14, Section 14.6] we have introduced a particular information transmission scheme that tries to convey a binary DMS over a discrete memoryless
channel (DMC), where unfortunately the entropy of the source is larger than
the available capacity of the channel so that no lossless transmission is possible. We have shown there that in this situation there exists an ultimate lower
bound on the bit error probability Pb :

Ts
1
Pb > Hb 1 C .
(5.238)
Tc
We have then proposed a system that includes between the DMS and the
channel encoder a lossy compression scheme that will reduce the entropy of
the source sequence H({Uk }) to H({Vk }), which is matched to the available
capacity. See Figure 5.15.
The question that we could not answer in [Mos14, Section 14.6] is whether
there exists a system that actually can achieve the lower bound (5.238) (arbitrarily closely). Using our newly acquired knowledge about rate distortion
systems, this question can now be answered.
Lets quickly repeat the setup. We implement a rate distortion system
as the lossy compressor, see Figure 5.16. As before, we use Ts to denote the
source clocking, and Tc to denote the channel clocking, where (5.218) needs to

122

V1 , . . . , V K
encoder
destination
V1 , . . . , VK
lossy
U1 , . . . , UK
compressor
Y1 , . . . , Yn
decoder
DMC
binary
DMS
X1 , . . . , Xn
Figure 5.15: Lossy compression added to joint source channel coding: We

insert a lossy compressor between source and channel encoder to
control the introduced errors.
DMC
X1 , . . . , Xn
destination
channel
encoder
V1 , . . . , V K
RD
decoder
RD
encoder
U1 , . . . , UK
channel
decoder
binary
DMS
Y1 , . . . , Yn
Figure 5.16: Rate distortion combined with channel transmission: The rate
distortion system compresses the source sequence to make sure
that the entropy of W is below the channel capacity.
hold. Moreover, we assume that the binary DMS is uniform, i.e., H({Uk }) = 1
bit/symbol. The capacity of the DMC is too small, i.e., we have
1
C
bits/s >
bits/s.
Ts
Tc
(5.239)
The rate distortion system has a rate7 R, i.e., we will have eKR different possible
indices W . Hence,
H(W )
log eKR
R
= .
KTs
KTs
Ts
(5.240)
And in order to make sure that the channel transmission of W will be reliable,
we need that
R ! C
.
Ts
Tc
7
(5.241)
Be careful: In [Mos14, Section 14.6], R has a different meaning. There we use it to

denote R = TTcs = K
.
n

5.9. Rate Distortion for the Gaussian Source
123
Next we choose as distortion measure the Hamming distortion, which will

lead to a distortion criterion based on the bit error probability:
"
#
K

1X
K
K
E d U1 , V1
=E
d(Uk , Vk )
(5.242)
K
k=1
K
1X
=
E[d(Uk , Vk )]
K
1
K
k=1
K
X
k=1
Pr[Uk 6= Vk ] = Pb .
(5.243)
(5.244)
So if we restrict the maximum allowed average distortion by some value D,

we restrict the bit error probability by this value D. Since we want to know
how a system behaves that works at the lower bound (5.238), we choose

Ts
1
D , Hb 1 C .
(5.245)
Tc
In Example 5.9 we already have derived the rate distortion function for a
binary source and the Hamming distortion. For 0 D 12 , we got
R(D) = Hb (p) Hb (D) = 1 Hb (D) bits
(5.246)
(since in our case we have p = 12 ). Plugging (5.245) into (5.246) now yields a
lower bound on the necessary rate for our rate distortion coding scheme:

Ts
Ts
Ts
1
1
R R Hb 1 C
= 1 Hb Hb 1 C
= C.
(5.247)
Tc
Tc
Tc
This is exactly the maximum that is allowed in (5.241), i.e., the general lower
bound that we had derived in [Mos14, Section 14.6] coincides with the lower
bound in the rate distortion coding theorem (Theorem 5.12) and is therefore
achievable!
5.9
Rate Distortion for the Gaussian Source
We now extend our main result Theorem 5.12 to the important situation of a
Gaussian source.
5.9.1
Rate Distortion Coding Theorem
Theorem 5.22 (Rate Distortion Coding Theorem

for a Gaussian

Source). The rate distortion function for a N 0, 2 source with squared error distortion is
(
2
1
log D for 0 D 2 ,
2
R(D) =
(5.248)
0
for D > 2 .

124

For convenience, we often write

R(D) =
1
2
log
2
D
+
(5.249)
where
()+ , max{, 0}.
(5.250)
Proof: As a matter of fact, we have only proven Theorem 5.12 for finite
alphabets and since our proofs relied on strong typicality, we cannot generalize
it to continuous RVs in a straightforward manner. However, it is not too hard
to show that Theorem 5.12 also holds for Gaussian sources with the squared
error distortion, i.e., it can be shown that
R(D) =
inf
2 ]D
(
x|x) : E[(XX)
I(X; X).
(5.251)
Here, x, x
R, and (|) describes a conditional probability density function
(PDF). It remains
to evaluate the minimization in (5.251).
Since E (X X)2 D, we have

= h(X) h(X|X)
I(X; X)
1
X)
= log 2e 2 h(X X|
2
1
log 2e 2 h(X X)
2

1
2
log 2e 2 h N 0, E (X X)
2

1
1
2
= log 2e 2 log 2e E (X X)
2
2
1
1
2
log 2e log 2eD
2
2
1
2
= log ,
2
D
(5.252)
(5.253)
(5.254)
(5.255)
(5.256)
(5.257)
(5.258)
i.e.,
R(D)
1
2
log .
2
D
(5.259)
To find a test channel (

x|x) that achieves this lower bound, it is easier to look
at the inverse test channel (x|
x). We need to choose
input and channel such

2 D. Since the distortion
that the output is X and such that E (X X)
measure is based on a difference between input and
output,
we choose an ad 2 D corresponds
ditive channel with some additive noise. Then E (X X)
to a bound on the noise power.
If D 2 , we choose
+Z
X=X

(5.260)
125

N 0, 2 D , Z N (0, D), where X
with X
Z. This yields the correct

2
output X N 0, and also satisfies

2 = E Z 2 = D.
E (X X)
(5.261)
The mutual information achieved by this scheme is

2 D
1
2
1
= log ,
I(X; X) = log 1 +
2
D
2
D
(5.262)
which corresponds exactly to the lower bound (5.259).

= 0 with probability 1, i.e., I(X;
X) = 0 and
If D > 2 , we choose X

2 = E X 2 = 2 < D.
E (X X)
(5.263)
Since R(D) 0 trivially, this must be exact in this case.
5.9.2
Parallels to Channel Coding
In the situation of Gaussian sources and Gaussian channels, the parallels between rate distortion theory and channel coding theory are extremely pronounced.
To see this, first recall the situation of channel coding for a Gaussian
channel (see [Mos14, Section 16.3]). We have
Yk = Xk + Zk
where {Zk } IID N 0,

2
(5.264)
and we have a power constraint

n
1 X 2
E Xk E.
n
(5.265)
k=1
The received sequence Y lies with very high probability in a sphere of radius
v
v
u n
u n
X
p
u
2 uX

2
t
rtot = E[kYk ] =
E Yk = t
E Xk2 + E Zk2
k=1
k=1
p
= n(E + 2 ),
(5.266)
and for every codeword x, the received vector Y lies with high probability in
a sphere around x with radius
v
u n
p
uX
2
r = E[kZk ] = t
E Zk2 = n 2 .
(5.267)
k=1
Hence, the channel coding problem is to find as many codewords as possible

such that their corresponding spheres r do not overlap inside of the large
sphere rtot :
p
n

n
2)
A
n(E
+
n
n
An rtot
E + 2 2
nR

n
# of codewords M = e
=
=
,
An r n
2
A
n 2
n
(5.268)

126
i.e.,

E
1
R log 1 + 2 .
2
(5.269)
On the other hand, in the

situation of rate distortion coding for a Gaussian
source {Xk } IID N 0, 2 , we have a distortion constraint
n

1X
k )2 D.
E (Xk X
n
(5.270)
k=1
Here, every source sequence X lies with high probability in a sphere with
radius
v
u n
p
uX
2
(5.271)
rtot = E[kXk ] = t
E Xk2 = n 2 ,
k=1
that with
and for every source sequence x there should exist a codeword X
high probability lies in a sphere around x with radius
v
u n
q
uX

2
k Xk )2 = nD.
r = E kX Xk = t
(5.272)
E (X
k=1
Hence, the rate distortion coding problem is to find as few codewords as

possible such that their spheres r cover the whole sphere rtot :

n
2 n2
2
A
n
n
n
An rtot
nR
n =
# of codewords M = e
=
,
(5.273)
An r n
D
A
nD
n
i.e.,
R
1
2
log .
2
D
(5.274)
Hence, we see that channel coding corresponds to sphere packing, while

rate distortion coding corresponds to sphere covering. Moreover, we also see
that a set of codewords that is good in one case, also is good in the other
(choose them Gaussian with the correct variance)! We have tried to depict
these two situations in Figure 5.17.
5.9.3
Simultaneous Description of m Independent Gaussians
We can generalize the problem to a simultaneous description of m independent

Gaussian RVs

Xi N 0, i2 ,
i = 1, . . . , m,

(5.275)
127
Figure 5.17: The large sphere depicts an n-dimensional sphere of radius rtot
and the small spheres all have radius r. In the case of channel
coding, the small spheres depict the codewords with some noise
around it. We try to put as many small spheres as possible into
the large sphere, but without having overlap such as to make sure
that the receiver will not confuse two codewords due to the noise.
In the case of rate distortion coding, the small spheres depict the
reconstruction vectors with the range of maximum allowed distortion around it. We try to put as few small spheres as possible
into the large sphere, but making sure that the complete large
sphere is covered so that for any source sequence we find at least
one reconstruction vector within the allowed distance.
where Xi
Xj , i 6= i. Assume we are given a certain amount D of total
allowed distortion (again assuming squared error distortion) and ask what rate
R is required to represent (X1 , . . . , Xm ) within this allowed total distortion.
Again, we actually need to derive a coding theorem. But the proof is very
similar to what we have seen so far and we omit the details. The result is as
follows:
R(D) =
min P
(
x1 ,...,
xm |x1 ,...,xm ) : E [
m
2
i=1 (Xi Xi )
]D
1, . . . , X
m ).
I(X1 , . . . , Xm ; X
(5.276)
It remains to evaluate the minimization in (5.276). We first derive a lower

bound to the mutual information:
m

1m = h X1m h X1m X
1
I X1m ; X
m
m
X
X

m
=
h(Xi )
h Xi X1i1 , X
1
i=1
(5.277)
(5.278)
i=1

128
m
X
i=1
m
X
h(Xi )
m
X
i=1
i)
h(Xi |X
i)
I(Xi ; X
(5.279)
(5.280)
i=1
m

X

i )2
R E (Xi X
(5.281)
(5.282)
i=1
m
X
R(Di )
i=1
m
X
i=1
2
1
log i
2
Di
+
.
(5.283)
Here, in (5.278) we the chain rule and the fact that the m sources Xi are independent of each other; (5.279) follows because conditioning cannot increase
entropy; in (5.281) we use the definition of the rate distortion function of a
particular source (it is the minimum for a given average distortion!); (5.282)
should be read as a definition

i )2 ,
Di , E (Xi X
(5.284)
where we know from the condition in the minimization in (5.276) that

m
X
i=1
Di D;
(5.285)
and the final equality (5.283) follows from Theorem 5.22.

Note that this lower bound can actually be achieved with equality if
we choose
(
x1 , . . . , x
m |x1 , . . . , xm ) =
m
Y
i=1
(
xi |xi )
(5.286)
such that
m
m X
1 =
i );
h X1m X
h(Xi |X
(5.287)
i=1
and if

i = 0
i N 0, 2 Di if 2 > Di , and X
we choose (
xi |xi ) such that X
i
i
otherwise, which then makes sure that
i) =
I(Xi ; X

2
1
log i
2
Di
+
.
(5.288)
129
Therefore, (5.283) is the minimum of (5.276). Note that since (5.283) is

decreasing if we increase Di , we actually have to choose the parameters Di
such that
m
X
Di = D
(5.289)
i=1
with equality instead of inequality.

We have hence shown that
R(D) =
=
Di :
Pmin
m
i=1 Di =D
m
X
1
i=1
m
X
1
min
Di s.t.
Pm
i=1
i=1 Di =D
Di i2
2
log i
2
Di
log
+
i2
.
Di
(5.290)
(5.291)
It only remains to figure out how to choose Di . Note that this minimization problem actually looks very much like the capacity problem of parallel
Gaussian channels described in [Mos14, Section 18.1]. While there we have
a concave function that is maximized with boundary constraints on the left
0 Ej , here we have a convex function that is minimized with boundary constraints on the right, Di i2 . So, we only need to adapt the KKT conditions
accordingly: we define
!
m
m
X
1
i2 X
L(D) ,
log
+
(5.292)
Di D ,
2
Di
i=1
i=1
and get the following conditions:

1 1
L
=
+
Di
2 Di
= 0 if Di i2 ,
0 if Di > i2 .
(5.293)
Note that we have a on the boundary because the minimum is achieved

at the boundary only if the slope there is negative.
we can rewrite these conditions as follows:
Using , 1/(2),
(
= if Di i2 ,
Di
(5.294)
if Di > i2 .
Hence, we see that all Di should be the same, but they are not allowed to be
larger than i2 .
We summarize this result.
Theorem 5.23 (Rate Distortion Coding Theorem for m Independent
Gaussian Sources). Therate distortion function for m independent
Pm Gaus2
sian sources Xi N 0, i with summed squared error distortion i=1 (xi
x
i )2 is given by
R(D) =
m
X
1
i=1
log
i2
Di
(5.295)

130
where
(
Di =
i2
if i2 ,
if > i2
(5.296)
and where is chosen such that

m
X
Di = D.
(5.297)
i=1
Remark 5.24. This result corresponds to reverse waterfilling! To see this

note that if Di = i2 , then
2
1
1
log i = log 1 = 0.
2
Di
2
(5.298)
So if D 1, then can be chosen very large such that Di = i2 for all

i = 1, . . . , m. This then results in R(D) = 0, i.e., we need 0 nats to describe
all RVs. In other words, we allow such a big distortion that it is sufficient to
the describe all RVs by constants.
If we now slowly reduce D, also gets smaller and at some point a level
will be reached where is smaller than the largest of the variances i2 . Now
we start using some nats to describe this RV such that the achieved distortion
is just OK. All other random variables are still only described by a constant.
If we continue to reduce D, will cross the border of a second RV. Etc.
We always will only describe those RVs with variances i2 > . So we get
a picture as shown in Figure 5.18.
Note that this idea can be generalized to reverse waterfilling on the eigenvalues for sequences of jointly Gaussian sources and on reverse water-filling
on the spectrum of a Gaussian stochastic process with memory in a similar
fashion as done for the waterfilling in [Mos14, Chapter 18].

i2
131
42
12
52
72
22
32
D4
D1
D2
D7
D5
D3
62
D6
X1
X2
X3
X4
X5
X6
X7
Figure 5.18: Reverse waterfilling solution of the distortion assignment of m

parallel independent Gaussian sources. In the given situation
only X1 , X4 , and X5 are partially described, the other RVs are
described by a constant.

Chapter 6
Error Exponents in Source

Coding
6.1
Introduction
So far we have seen two versions of the rate distortion coding theorem:
The first version we only mentioned without showing a proper proof. Its
proof is based on weak typicality, and it states that as long as R > R(D),
there exists a sequence of coding schemes such that

D.
lim E d(X, X)
(6.1)
However, it does not say anything about the probability whether a particular source sequences is well represented or not, only the average
distortion is fine.
The version shown in Chapter 5 (see Theorem 5.12 and the discussion
in Section 5.4.3), on the other hand, states that as long as R > R(D),
there exists a sequence of coding schemes such that

> D = 0.
lim Pr d(X, X)
(6.2)
The proof of this theorem is based on strong typicality.

In the current chapter we would like to go a step further and ask how quickly
the probability of a badly represented source sequence tends to zero as n
tends to infinity. Actually, as we know that the probability tends to zero
exponentially fast, we want to know what the decay-exponent is, i.e., how
large the factor in front of n in the exponent is. This factor is commonly
referred to as error exponent. To answer this question, we actually cannot
exclusively rely on typicality, but need to go back to types.
Many results in this chapter go back to [Ber71]. The treatment in [CK81]
is also recommend.
133

134
Error Exponents in Source Coding
For this chapter we must slightly adapt our notation from Chapter 5:
Instead of R(D) for the rate distortion function, we will write R(Q, D) to
explicitly show the dependence of the rate distortion function on the PMF of
the DMS.
Definition 6.1. For a given per-letter distortion measure d(, ), the rate
distortion function R(Q, D) of a discrete memoryless source Q and for a certain
allowed distortion D 0 is defined as
R(Q, D) ,
min
q(
x|x) : EQ [d(X,X)]D
IQ (X; X).
(6.3)
Similarly, we define the distortion rate function of a discrete memoryless source

Q and for a certain rate R 0 as

.
D(Q, R) ,
min
EQ d(X, X)
(6.4)
q(
x|x) : IQ (X;X)R
only to emphasize that

Note that we have put the subscript Q in IQ (X; X)
the mutual information is computed with respect to Q (and of course also q
that is chosen in the minimization).
Remark 6.2. In this chapter, we restrict ourselves to a per-letter distortion
measure d(, ) that satisfies three assumptions:
The distortion is bounded:
max
xX , x
X
d(x, x
) = dmax < .
(6.5)
Every source symbol has a zero-distortion representation:

min d(x, x
) = 0,
x
X
x X.
(6.6)
The zero-distortion representation is unique: For every x X , there

exists exactly one x
X such that d(x, x
) = 0.
The first two assumptions are taken over from Chapter 5 (see Section 5.2).
They are both not essential: the former can be avoided if one is careful with
limits, and the latter brings no loss in generality because it only causes a shift
of the rate distortion function.
The third assumption is new. It means that
R(Q, 0) = H(Q),
(6.7)
since whenever we restrict ourselves to D = 0, we have a one-to-one mapping

such that H(X|X)
= 0. Again this is no real restriction on
between X and X
the generality because for any distortion measure where R(Q, 0) < H(Q) we
can collapse the alphabets X and X (and thereby also Q) in such a way that
the zero-distortion representation becomes unique.

6.2. Strong Converse of the Rate Distortion Coding Theorem
135
Next, we define the error probability.

Definition 6.3. For a given distortion measure d(, ) according to Remark 6.2
and for a given length-n coding scheme (n , n ), we denote by the probability that the source sequence X of a DMS Q is not reproduced within distortion
D:

(n , n , Q, d, D) , Qn x X n : d x, n (n (x)) > D .
(6.8)
Before we state the main result, we need to strengthen the converse of
Chapter 5.
6.2
Strong Converse of the Rate Distortion

Coding Theorem
In Chapter 5 we have proven that any rate distortion coding scheme that
satisfies the average distortion

1n D
E d X1n , X
(6.9)
must have a rate
R R(Q, D).
(6.10)
Now we will state and prove a stronger version.

Theorem 6.4 (Strong Converse to RD Coding Theorem).
Fix some (0, 1), some 0 > 0, and some D 0. Then any coding
scheme (n , n ) that is used for some source Q and that satisfies

1n D 1
Pr d X1n , X
(6.11)
must satisfy
Rn ,
1
log kn k R(Q, D) 0
n
(6.12)
for n n0 (, 0 , d).

Note that n () denotes the encoder mapping of the rate distortion coding scheme, i.e., it is a mapping of a source sequence xn1 to an index w
{1, 2, . . . , enRn }. Hence, by kn k we denote the number of possible values it
can take on, i.e., enRn .
Proof: We define
n ,
1
log n
(6.13)

136
and the following constrained strongly typical set:

n
o

(n)
Bn ,D , x X n : x A(n)
n (Q) and d x, n (n (x)) D .
From the assumption (6.11) then follows that

1 Pr x X n : d x, n (n (x)) D
n
o

= Pr x X n : x A(n)
n (Q), d x, n (n (x)) D
n
o

x Xn: x
/ A(n)
(Q),
d
x,
(
(x))
D
n n
n
n
o

= Pr x X n : x A(n)
(Q),
d
x,
(
(x))
D
n
n
n
n
o

n
+ Pr x X : x
/ A(n)
n (Q), d x, n (n (x)) D

n
o
(n)
Pr Bn ,D + Pr x X n : x
/ A(n)
(Q)
n

(n)
Pr Bn ,D + t (n, n , X ).
(6.14)
(6.15)
(6.16)
(6.17)
(6.18)
(6.19)
Here, (6.15) follows from (6.11); the subsequent equalities (6.16) and (6.17)
follow from total probability and because the two sets are disjoint; in (6.18)
(n)
we use the definition of Bn ,D and we enlarge the second set by dropping one
condition; and the final inequality (6.19) follows from the basic properties of
strongly typical sets (TA-3b).
Hence,

(n)
Pr Bn ,D 1 t (n, n , X ).
(6.20)
(n)
(n)
Moreover, because Bn ,D An (Q) we can apply TA-1b:

X
(n)
Pr Bn ,D =
Qn (x)
(6.21)
(n)
xBn ,D
en(H(Q)+n log Qmin )
(6.22)

(n)
= Bn ,D en(H(Q)+n log Qmin ) .
(6.23)
<
(n)
xBn ,D
Putting (6.19) and (6.23) together now yields

(n)
(n)
n(H(Q)+n log Qmin )
B
>
Pr
B
n ,D
n ,D e

1 t (n, n , X ) en(H(Q)+n log Qmin )
1
= en(H(Q)+n log Qmin + n log(1t (n,n ,X )))

0
n H(Q) 4
for n large enough.
(6.24)
(6.25)
(6.26)
(6.27)
Here we assume that n is large enough such that in (6.26)

t (n, n , X ) < 1

(6.28)
137
and that in (6.27)

0
1
log(1 t (n, n , X )) .
n
4
n log Qmin
(6.29)
, i.e., the set of those

X n for
Next, let C be the set of all codewords x
x
, and let X, X be a new pair

which there exists some x with n (n (x)) = x

of RVs with X X and X X and chosen such that they maximize H X

under the constraints that
n
|QX (a) Q(a)| <
, a X,
(6.30)
|X |

D.
X
d X,
(6.31)
EQ
X
X,
Noting that
n
1X
d(xk , x
k )
n
k=1
1X
) d(a, b)
=
N(a, b|x, x
n
) =
d(x, x
(6.32)
(6.33)
aX
bX
Px,x (a, b) d(a, b)
aX
bX

,
= EPx,x d(X, X)
(6.34)
(6.35)
we further bound as follows:

n
o
(n) X

) D
Bn ,D
x X n : x A(n)
n (Q); d(x, x
(6.36)
C
x
o
X n

(Q);
E
d(X,
X)
D
x X n : x A(n)

Px,
n
x
C
x

X
=

C
x

n
o

n
x X : Px|x = qX|X

[
qX|X
s.t. EPx
q
X|X
(6.37)
(6.38)
[d(X,X)]D
and s.t. marginal QX of Px

qX|X
n
aX
satisfies |QX (a)Q(a)|< |X
|
X
C
x
n
o

x X n : Px|x = qX|X (6.39)
X
qX|X
s.t. EPx
q
X|X
[d(X,X)]D

qX|X
n
aX
|
sup
C qX|X
x
Pn (X |X )
qX|X
s.t. EPx
q
X|X
n
o

x X n : Px|x = qX|X
[d(X,X)]D

qX|X
n
aX
|
(6.40)

138
x
x
x
, but
Figure 6.1: Every source sequence x is mapped to exactly one codeword x
for a source sequence there might exist more than one codeword
that is close enough to satisfy the distortion constraint (e.g., x0
0 , but could also be mapped to x
). Hence, when
is mapped to x
considering all source sequences around every codeword, we might
count some source sequences several times.
C qX|X
x
Pn (X |X )

n

T qX|X x
sup
qX|X
(6.41)
[d(X,X)]D
s.t. EPx
q
X|X

qX|X
n
aX
|
(n + 1)|X ||X |
sup
qX|X
C
x
s.t. EPx
q
X|X
en HPx (X|X)
(6.42)
[d(X,X)]D

qX|X
n
aX
|
X
C
x
(n + 1)|X ||X | en H(X|X)
= kn k (n + 1)|X ||X | en H(X|X)

= kn k en H(X|X)+ .
(6.43)
(6.44)
(6.45)
The inequality (6.36) follows because in the sum we possibly count some x
several times as it is possible that a certain source sequence x is close to several
, see Figure 6.1. In (6.37) we use (6.35). In (6.38) we rewrite the
codewords x
set as a union of sets of sequences with the same conditional type, where the
type is restricted such that the same two conditions as in (6.37) are satisfied.
Then (6.39) follows by the Union Bound. In (6.40) we upper-bound the value
by adding a supremum over all given conditional types and by enlarging the
sum to be over all possible conditional types. In (6.41) we then rename the set
. The size of this type
by its proper name: the conditional type class given x
class is then upper-bounded in (6.42) by applying CTT3, and also the number
of conditional types is upper-bounded by applying CTT1. In (6.43) we make
use of the fact that the entropy in the exponent by definition is maximized

139
. And (6.44) follows because the size of C by definition is

X
if we use X,
equal to kn k. In the final step (6.45) we include the polynomial term into
the exponent by a corresponding .
Hence, we now have (for n large enough)

0

n H(Q) 4
(n)
e
< B,D kn k en H(X|X)+
(6.46)
which leads to
kn k > en
0
X)
H(Q)H(X|
en
where we need n large enough such that

H(Q) =
Q(a) log
aX
0
4.
20
X)
H(Q)H(X|
4
(6.47)
1
Q(a)
(6.48)
aX
aX
{z
Recalling (6.30) we see that

X
1
n
log
>
QX (a)
n
|X |
QX (a) + |X
|
aX

X
n
n
=
QX (a)
log QX (a) +
|X |
|X |
aX
!
n
X
QX (a) + |X
|
=
QX (a) log QX (a)
QX (a)
aX

X n
n
+
log QX (a) +
|X |
|X |
aX

X
X
QX (a) log 1 +
=
QX (a) log QX (a)
|
= H(X)
(6.49)
(6.50)
(6.51)
n
|X | QX (a)
| {z }
pmin

n X
n
+
log QX (a) +
| {z } |X |
|X |
aX
0

X
n

n X
n

H(X)
QX (a) log 1 +
+
log
|X | pmin
|X |
|X |
aX
aX

n
n
log 1 +
= H(X)
+ n log
|X | pmin
|X |
0

, for n large enough.
H(X)
4
(6.52)
(6.53)
(6.54)
(6.55)
Here we have introduced pmin to be the smallest nonzero value of QX ().

Hence, we have

30
30
kn k > en H(X)H(X|X) 4 = en I(X;X) 4 .

(6.56)
Now we can argue the following sequence of inequalities
0
0
1
1
3 R(Q , D) 3 .
X
log kn k log kn k > I X;
X
n
n
4
4
(6.57)

140
The first inequality follows because the number of indices kn k cannot be

smaller than the number of codewords kn k (in general they will be equal,
however, we do not exclude the possibility that several indices are mapped
to the same codeword). The second inequality follows from (6.56). The final
inequality follows from the definition of the rate distortion function that will
minimize the mutual information under the constraint that the expected dis by definition).
X
tortion is not larger than D (which is also satisfied for X,
Assuming for the moment that R(QX , D) is continuous in QX and D, it
follows from (6.30) that we can make n large enough such that
R(Q, D) R(QX , D) +
0
.
4
(6.58)
Together with (6.57) this will finally prove the theorem.

So it only remains to show the continuity of R(QX , D). Consider a sequence
of distributions {Qk } with limk Qk = Q and a sequence of distortion
values {Dk } with limk Dk = D . For a given > 0 pick some qX|X
such
that

< D ,
EQ qX|X
d(X, X)
(6.59)

I(X; X)
< R(Q , D ) + .
(6.60)
Q q
X|X
Note that because of (6.59) (with strict inequality!) and because E[d(X, X)]
is continuous, we can choose k large enough such that

Dk ,
EQk qX|X
d(X, X)
(6.61)
i.e., qX|X
is among the choices q in the minimization of R(Qk , Dk ):
R(Qk , Dk ) ,
min
q: EQk
q [d(X,X)]Dk

I(X; X)

I(X; X)
Q
k qX|X
(6.62)
is continuous, i.e., for k

Now we know that the mutual information I(X; X)
sufficiently large, it follows from (6.60) that

I(X; X)
< R(Q , D ) + .
(6.63)
Q q
k
X|X
Combining (6.62) and (6.63) then proves that

lim R(Qk , Dk ) lim I(X; X)
Q q
k
X|X
< R(Q , D ) + .
(6.64)
(k)
On the other hand, let q

achieve the minimum in the definition of
X|X
R(Qk , Dk ). Consider now a subsequence of {k}, {kj }, such that
lim R(Qkj , Dkj ) = lim R(Qk , Dk ),
(k )
lim q j
j X|X
k
()
q
X|X

and
(6.65)
(6.66)
6.3. Rate Distortion Error Exponent

()
.
X|X
for some q
EQ
()
X|X
141
In then follows that

= lim E
d(X, X)
j
(k )
Qkj q j
X|X

= lim Dk = D
d(X, X)
j
j
(6.67)
(k )
X|X
(the second equality follows because q j achieves R(Qkj , Dkj )) and therefore

R(Q , D ) I(X; X)
Q
()
q
X|X

= lim I(X; X)
(k )
Qkj q j
(R(, ) is a minimization)
(6.68)
(continuity of I(X; X))
(6.69)
X|X
(k )
X|X
= lim R(Qkj , Dkj )
(q j achieves R(Qkj , Dkj ))
(6.70)
= lim R(Qk , Dk )
(by (6.65)).
(6.71)
Hence, from (6.71) and (6.64), we have for an arbitrary > 0,

lim R(Qk , Dk ) < R(Q , D ) lim R(Qk , Dk ),
(6.72)
lim R(Qk , Dk ) = R(Q , D ),
(6.73)
i.e.,
k
showing that R(Q, D) indeed is continuous.
6.3
Rate Distortion Error Exponent
Recall that denotes the probability that the source sequence X is not reproduced within the required distortion D (see Definition 6.3). We now state
the main result of this chapter.
Theorem 6.5 (Rate Distortion Error Exponent).
Fix a per-letter distortion measure d(, ) with source alphabet X and reproduction alphabet X according to Remark 6.2. Then for every R 0
and every D 0, there exists a sequence of length-n rate distortion coding
schemes (n , n ) such that
the number of indices tends to at most enR :
lim
1
log kn k R;
n
(6.74)
for every distribution Q P(X ),

n inf Q
:
(n , n , Q, d, D) e
R(Q,D)>R
k Q)n
D(Q
(6.75)

142

where
n ,
1
|X | log(n + 1) 0
n
(6.76)
as n .
Furthermore, for every R 0, every D 0, every sequence of coding
schemes satisfying (6.74), and for every source Q P(X ),
1
k Q).
log (n , n , Q, d, D)
inf
D(Q
: R(Q,D)>R
n n
Q
lim
(6.77)
We remind the reader of our notation agreement in Remark 3.2: If we

P(X ) with
choose D and R such that there does not exist any source Q
R(Q, D) > R, then inf Q : R(Q,D)>R

D(Q kQ) is understood to be , i.e., we
have = 0.
To understand the meaning of Theorem 6.5, first note that if R < R(Q, D),
then the set of distributions

: R(Q,
D) > R
Q
(6.78)
also contains the distribution Q. This means that

Q) D(Qk
Q)
0
inf
D(Qk
= D(Qk Q) = 0,
Q=Q
: R(Q,D)>R
(6.79)
i.e.,
inf
: R(Q,D)>R
Q) = 0,
D(Qk
(6.80)
and therefore it follows from (6.77) that

1
log (n , n , Q, d, D) 0
n n
(6.81)
lim (n , n , Q, d, D) 1.
(6.82)
lim
or
n
This corresponds to the strong converse given in Theorem 6.4: If we have a

rate strictly below the rate distortion function, then the probability that a
sequence will have a distortion larger than the maximum allowed distortion
D is 1 (for n large)!
But what about R R(Q, D)? Here, the theorem guarantees that there
exists a coding scheme such that the error probability tends to zero exponentially fast in n with an exponent being (at least)
D ,
inf
: R(Q,D)>R

Q).
D(Qk
(6.83)
143
The really cool thing, however, is that the theorem guarantees the existence
of one coding scheme (n , n ) that works for any Q (as long as R R(Q, D))!
So, actually, what we have here is a universal compression scheme!
The performance of this universal compression scheme is not the same for
that have a
different sources: The further away the PMF Q is from those Q
too big rate distortion function value R(Q, D) > R, the better the system will
perform, i.e., the quicker will tend to zero. See Figure 6.2 for a graphical
explanation.
that do
the sources Q
not work because for the
given R and D:
D) > R
R(Q,
kQ) = D
inf D(Q
P(X )
Q
Figure 6.2: Graphical explanation of Theorem 6.5. The triangle depicts the
set of all sources, and the shaded area is the subset of sources for
which there exists no rate distortion coding scheme that works for
the given parameters R and D. There exists a coding scheme that
can compress all sources in the white area, for example the source
Q is a source that actually works. Its performance depends on
that do not work.
D , the distance to the closest of the sources Q
In short one can say that for every system (and for n ) we have
enD , and for all Q in the white area enD . Here, D depends on
the particular source Q and the parameters R and D.
In the following sections we are going to prove Theorem 6.5. In Section 6.3.1 an important lemma is proven that shows that any source sequence
in the type class of the source will for sure be reconstructable within the given
distortion D and rate R(D). This lemma is then used to prove the achievabil-

144
ity in Section 6.3.2. The converse is proven in Section 6.3.3 and is based on
the strong converse of Section 6.2.
6.3.1
Type Covering Lemma
The following lemma is the basis of our achievability proof.

Lemma 6.6 (Type Covering Lemma [Ber71]). For any distortion measure d(, ) according to Remark 6.2, for any source with PMF P Pn (X )
being a type, for any distortion D 0, and for any number 0 > 0, there exists
such that
a set B X n of codewords x
x T n (P ),
) D,
min d(x, x
B
x
(6.84)
and
0
|B| en(R(P,D)+ ) ,
(6.85)
provided that n is sufficiently large.

Remark 6.7. This lemma affirms that if we choose B as the set of our codewords, then for every x in the type class of P we can find a codeword such
that the achieved distortion is within the acceptable limit.
Note the differences to the rate distortion theorem:
While in the rate distortion theorem we only guarantee that

1n D,
lim E d X1n , X
(6.86)
) D for every x, once

the type covering lemma guarantees that d(x, x
n is big enough.
On the other hand, the type covering lemma is limited to x in the type
class T n (P ) and to sources with a PMF that happens to be a type.
Proof: Let the type P Pn (X ) be arbitrary, but fixed. We first consider
the case D = 0 and choose for every x T n (P ) a (possibly different) codeword
with d(x, x
) = 0. Hence, by TT3,
x

|B| T n (P ) en H(P ) .
(6.87)
And since we have R(P, 0) = H(P ) (see Remark 6.2), this proves the lemma
for D = 0.
So fix D > 0. For every set B X n denote by U(B) X n the set of those
x T n (P ) for which we cannot find a codeword:

n
) > D .
U(B) , x T (P ) : min d(x, x
(6.88)
B
x

145
with joint PMF

Fix some 0 < < D and consider a pair of RVs (X, X)
QX,X (x, x
) = P (x)q(
x|x)
(6.89)
for some choice of q(|) such that

D .
E d(X, X)
Moreover, fix some 0 < <
(6.90)
(n)
D
dmax .
) A
Then for all (x, x

QX,X we have
1X
d(xk , x
k )
n
k=1
1 X
)d(a, b)
=
N(a, b|x, x
n
) =
d(x, x
aX ,bX
QX,X (a, b) +
aX ,bX

|X | |X |
(6.91)
(6.92)
!
d(a, b)
(6.93)

+ dmax
E d(X, X)
(6.94)
D.
(6.96)
D + dmax
(6.95)
) are jointly typical; in

Here, (6.93) follows from the assumption that (x, x
(6.95) we apply (6.90); and the last inequality (6.96) follows for a correct
choice of 0 < < D.
Finally, let m be a positive integer that we specify later. We will now
prove the existence of a set B X n with U(B) = and |B| m, which will
then prove the lemma. We do this by a random coding argument.
According to a uniform distribution,
randomly and independently pick m
(n)
length-n sequences Zi from A
QX and plug them into a random matrix
Zm , (Z1 , . . . , Zm ).
(6.97)
Now consider the random set U(Zm ), i.e., the set of all those x T n (P ) for
which
d(x, Zi ) > D,
i = 1, . . . , m.
(6.98)
(n)
If we can show that E[|U(Zm )|] < 1, then there must exist a set B A
with |U(B)| < 1, i.e., |U(B)| = 0, i.e., U(B) = , where1
|B| kZm k = m.
QX
(6.99)
1
Note that in our random choice of Zm we do not prevent cases where the same vector
A(n) QX is picked several times. Hence, when regarding Zm as a set instead of a
x
matrix, the number of different vectors Zi might be less than m.

146
So, lets find a bound to E[|U(Zm )|]:

E[|U(Zm )|]
X
= E
xT n (P )
X
xT n (P )
X
xT n (P )
X
xT
X
xT
=
=
n (P )
n (P )
(6.100)
E[I {x U(Zm )}]
(6.101)
Pr[x U(Zm )] 1 + Pr[x

/ U(Zm )] 0
Pr[x U(Zm )]
m
Y
i=1
(6.102)
(6.103)
Pr[d(x, Zi ) > D, i = 1, . . . , m]
xT n (P ) i=1
m
X Y
xT
n (P )
I {x U(Zm )}
(by def. of U(Zm ))
(6.104)
Pr[d(x, Zi ) > D]
(6.105)
(1 Pr[d(x, Zi ) D])
(6.106)
m
Y
1
xT n (P ) i=1

h
i h
i

(n)

x
Pr d(x, Zi ) D Zi A(n)
Q
Pr
Z
A
Q
i

X,X
X,X x
|
{z
}
= 1 by (6.96)

h
i
i h

(n)
(n)

/ A
QX,X x
Pr d(x, Zi ) D Zi
/ A
QX,X x Pr Zi
|
{z
}
0
m
Y
xT n (P ) i=1
h
i
.
1 Pr Zi A(n)
QX,X x

(6.107)
(6.108)
Unfortunately, we cannot apply TC because Zi is not picked according to QX ,

(n)
but uniformly from A
QX . So we need to bound (6.108) indirectly. Note

(n)
(n)
A
that by definition every x
QX,X x also is element of A
QX , but

(n)
not necessarily vice versa. Since Zi is uniformly picked from A
QX , we
hence have

h
i A(n)
(Q
|x)

X,X

(6.109)
Pr Zi A(n)
Q
x
=
(n)

X,X
A (Q )
X
1 t (n, , X X ) en(H(X|X)m (QX,X ))

>
(6.110)
en(H(X)+m (QX ))
= en(H(X|X)H(X))
(6.111)

147
= en(I(X;X)+) ,
(6.112)
where the inequality (6.110) follows from TA-2 and TB-2, and where

1

, m QX,X + m QX log 1 t (n, , X X ) .
n
(6.113)
Hence, we get from (6.108),

E[|U(Zm )|] <
m
Y
X
xT
n (P )
i=1
1 en(I(X;X)+)
m

= T n (P ) 1 en(I(X;X)+)

< exp(n H(P ) ) exp m en(I(X;X)+)

| {z }
(6.114)
(6.115)
(6.116)
log |X |

exp n log |X | m en(I(X;X)+) .
(6.117)
Here, in (6.116) we use TT3 and the Exponentiated IT Inequality (Corollary 1.10).
Now choose m as some integer satisfying
en(I(X;X)+2) m en(I(X;X)+3) .
(6.118)
For n large enough, this is always possible. Then it follows from (6.117)

E[|U(Zm )|] < exp n log |X | en(I(X;X)+2) en(I(X;X)+)

(6.119)

= exp n log |X | en
(6.120)
1
(6.121)
for n large enough. Hence, there must exist a set B X n with U(B) = .
Moreover, from (6.118) this set satisfies
|B| m en(I(X;X)+3) .
(6.122)
If we now finally specify q(|) in (6.89) such that

< R(P, D) + ,
I(X; X)
(6.123)
which is always possible because by definition R(P, D) is the minimization of

and because R(P, D) is continuous in D (such that we can compensate
I(X; X)
for the additionally required in (6.90)), then we have
|B| en(R(P,D)+4) .
(6.124)
The lemma now follows once we make n large enough and small enough such
that 4 0 .

148
6.3.2
Achievability
The proof of the achievability of Theorem 6.5 is strongly based on the Type
Covering Lemma (Lemma 6.6).
By Lemma 6.6 we know that for any 0 > 0 and for every type P Pn (X ),
there exists a set BP X n that satisfies for n large enough
0
|BP | en(R+ )
(6.125)
and
) D(P, R),
min d(x, x
BP
x
x T n (P ).
(6.126)
Note that we have reformulated the Type Covering Lemma: While in Lemma 6.6 we fixed D and then computed R from D using the rate distortion
function, here we fix R and then compute D from R using the distortion rate
function:

.
(6.127)
D(P, R) =
min
EP d(X, X)
q(
x|x) : IP q (X;X)R
Set
B,
[
P Pn (X )
and compute (for n large enough)

X
|B|
|BP |
P Pn (X )
en(R+ )
P Pn (X )
=e
|X | n(R+0 )
n(R+ 0 )
(6.128)
(Union Bound)
(6.129)
(by (6.125))
(6.130)
= |Pn (X )| en(R+ )
(n + 1)
BP
(6.131)
(by TT1)
(6.132)
(6.133)
Hence, we can choose a code (n , n ) using all codewords from B such that
1
1
log kn k lim log |B| lim {R + 0 } = R + 0 .
n n
n
n n
lim
(6.134)
Since this holds for an arbitrary 0 > 0, this proves (6.74). So lets check the
distortion caused by this code. Fix some D 0 (R is already fixed!) and
define

P(X ) : R(Q,
D) > R .
F, Q
(6.135)
Note that F basically denotes the set of those sources for which the rate
distortion theorem is not satisfied, i.e., for which we get into troubles. Then
by definition of F, for all x T n (F), Px is such that R(Px , D) > R. Hence,

149
all x X n \ T n (F) have a type Px such that R(Px , D) R, which implies

that
D(Px , R) D.
(6.136)
Therefore, for every x X n \ T n (F) we have

) min d(x, x
)
min d(x, x
B
x
BPx
x
D(Px , R)
D
(weakening minimum)
(6.137)
(by (6.126))
(6.138)
(by (6.136)).
(6.139)
So we see that all x X n \ T n (F) are OK and that therefore

Pr[PX F] = Qn (T n (F)).
(6.140)
By Sanovs Theorem (Theorem 3.1) we now know that
D(Q k Q)
(n + 1)|X | en inf QF
,
(6.141)
which proves (6.75).
6.3.3
Converse
In order to prove the converse, we need the following lemma.

P(X ). Then for any 0 < < 1 , we have
Lemma 6.8. Fix two PMFs Q, Q
2
1
k Q).
lim
min
log Qn (B) = D(Q
(6.142)
n BX n : Q
n (B)1 n
Proof: Lets first treat the special cases:
If a X such that Q(a)

= Q(a) = 0, reduce X to X \ {a}.
If a X such that Q(a)

> Q(a) = 0, choose > 0 such that

Q(a)
>
|X |
(6.143)
(n)
(this is always possible because Q(a)

> 0). Now consider A (Q):
If
(n)
x A (Q), then

Px (a) > Q(a)
> 0,
(6.144)
|X |
(n)
where the first inequality follows from the definition of A (Q)
and the
second from our choice of . Hence, N(a|x) > 0 and therefore
n
Y
n
Q (x) =
Q(xk )
(6.145)
k=1
N(b|x)
Q(b)
(6.146)
bX
>0
z }| {
N(a|x)
= Q(a)
| {z }
=0
N(b|x)
Q(b)
= 0,
(6.147)
bX \{a}

150

where in the last step we wrote the term involving b = a separately
outside the product. Hence,
(n)
Qn A

= 0.
(Q)
(6.148)
But from TA-3 we know that

n A(n) (Q)
1 t (n, , X ) 1
Q

(6.149)
(n)
for n large enough. Hence, B , A (Q)
meets the condition in the minimization on the LHS of (6.142) and definitively achieves the minimum
because of (6.148). The LHS will be equal to in this case.
Since Q(a)
> Q(a) = 0, the RHS also will yield
Q) = ,
D(Qk
(6.150)
and the lemma is satisfied.

So, in the remaining proof, assume that Q(a) > 0 for all a X . Let the RV
and define
XQ
Y , log
Q(X)
.
Q(X)
(6.151)
Note that Y is finite with probability 1 because Q() > 0 and because the
Now,
event {Q(X)
= 0} has zero probability because X Q.

X
Q(X)
Q(x)
Q) , D, (6.152)
E[Y ] = E log
=
Q(x)
log
= D(Qk
Q(X)
Q(x)
xX
where the last equality has to be understood as definition of D.
Now define for some 0 > 0,

1
Qn (x)
n
0
0
A0 , x X : D log
D +
n (x)
n
Q
(6.153)
and note that

n (X)
1

Q
0
Q (A0 ) = Pr log
+ D
n (X)
n
Q

"
#
n
1

Y
Q(Xk )

= 1 Pr log
+ D > 0
k)
n

Q(X

" n k=1
#
1 X

Q(Xk )

log
+ D > 0
= 1 Pr
k)

n
Q(X

" k=1
#
n
1 X

= 1 Pr
Yk E[Y ] > 0
n

n
k=1

(6.154)
(6.155)
(6.156)
(6.157)
151
!2
n
X
1
= 1 Pr
Yk E[Y ]
> 02
n
k=1
h P
2 i
E n1 nk=1 Yk E[Y ]
1
02
h P
P
2 i
n
1
E n k=1 Yk E n1 nk=1 Yk
=1
02

1 Pn
Var n k=1 Yk
=1
02
Pn
Var[Yk ]
= 1 k=12 02
n
n Var[Y ]
=1
n2 02
Var[Y ]
1 for n large enough.
=1
n02
(6.158)
(6.159)
(6.160)
(6.161)
(6.162)
(6.163)
(6.164)
Here, the inequality follows from the Markov Inequality that states that for
any nonnegative Z,
E[Z]
.
t
Pr[Z t]
(6.165)
So we see that A0 is a possible candidate among all possible B in the

minimization in (6.142) and therefore, for n large enough,
min
n (B)1
BX n : Q
Qn (B) Qn (A0 )
X
=
Qn (x)
(6.166)
(6.167)
xA0
n (x)
en(D+ ) Q
(6.168)
xA0
0
n (A 0 )
= en(D+ ) Q
| {z }
n(D+0 )
(6.169)
(6.170)
Here, the inequality (6.168) follows because of the definition of A0 that guarantees that
Qn (x)
0
en(D+ ) .
n (x)
Q
(6.171)
Hence,
lim
min
n BX n : Q
n (B)1
1
log Qn (B) D + 0 .
n
(6.172)

152
n (B) 1 , for n large

On the other hand, note that for any B with Q
enough,
n (A 0 ) Q
n (B A0 )
n (B A0 ) = Q
n (B) + Q
Q
{z
}
| {z } | {z } |
1
1
1 + 1 1 = 1 2.
(6.173)
(6.174)
Note that by assumption 1 2 > 0. So, using that by definition of A0

Qn (x)
0
en(D ) ,
n
Q (x)
(6.175)
we get
Qn (B) Qn (B A0 )
X
=
Qn (x)
(6.176)
(6.177)
xBA0
0
en(D )
n (x)
Q
n (B A0 )
= en(D ) Q
e
n(D0 )
(6.178)
xBA0
(1 2),
(6.179)
(6.180)
where in the last step we have used (6.174). Hence,

1
1
n
0
min
log Q (B) lim D + log(1 2) (6.181)
lim
n (B)1 n
n
n BX n : Q
n
= D 0 .
(6.182)
Since 0 is arbitrary, the result follows from (6.172) and (6.182).

We are now ready for the proof of the converse. So given is a rate distortion
pair (R, D) and a sequence of length n coding schemes (n , n ) that satisfy
(6.74).
P(X ) be an arbitrary PMF such that2 R(Q,
D) > R, i.e., a source
Let Q
for which we cannot find a coding scheme that works. Then choose 0 > 0
such that
D) > R + 20
R(Q,
(6.183)
1
log kn k R + 0
n
(6.184)
and for n large enough
(note that this is possible because we assume that (6.74) is satisfied). Hence,
1
D) 20 + 0 = R(Q,
D) 0 .
log kn k R + 0 < R(Q,
n
(6.185)
does not exist, then the converse is void and (6.77) trivially
Recall that if such a Q
claims that 0.
2

153
For
Now recall the strong converse (Theorem 6.4) applied to the source Q.
0
any 0 < < 1 and > 0, if

n d X, n (n (X)) D 1 ,
Q
(6.186)
then
1
D) 0 ,
log kn k R(Q,
n
(6.187)
for n large enough. The inverse of this statement is as follows: For any
0 < < 1 and 0 > 0, if
1
D) 0 ,
log kn k < R(Q,
n
then, for n large enough,

n d X, n (n (X)) D < 1 ,
Q
(6.188)
(6.189)
or, equivalently,
n
Q

d X, n (n (X)) > D .
3
4
and define

B , x X n : d x, n (n (x)) > D .
Lets choose =
(6.190)
(6.191)
Now we know from (6.185) and (6.190) that for n large enough,
n (B) = 3 .
Q
4
(6.192)
with R(Q,
D) > R! Therefore, we have, for n
Note that this holds for any Q
large enough and an arbitrary > 0,
1
1
log (n , n , Q, d, D) = log Qn (B)
n
n
(6.193)
log Qn (B)
:
n
B
Q : R(Q,D)>R
n
o
kQ)
sup
D(Q
sup
min
n (B)
3
Q
4
(6.194)
(6.195)
: R(Q,D)>R
inf
: R(Q,D)>R
Q) .
D(Qk
(6.196)
Here, (6.193) follows by definition of B. The lower bound in (6.194) is caused

by the introduction of the minimum over all sets B that satisfy the constraint
(6.192), while the supremum has no influence since (6.192) holds for any Q
satisfying R(Q, D) > R. The subsequent inequality (6.195) then follows from
Lemma 6.8 for the choice of = 41 and for n large enough such that the value
Q).
is within
of the limiting value D(Qk
Since is arbitrary, this proves the converse.

Chapter 7
The Multiple Description

Problem
7.1
Problem Description
So far in rate distortion theory we have thought of one single source description
(the index that is generated from the source encoder) that will then be used
to produce an estimate of the source sequence (the output sequence of the
source decoder). But what happens if there are two or more compressors that
all can provide a description (i.e., an index)?
Of course, such a scheme is basically the same as a normal rate distortion
system because the different compressors all see the same source sequence and
therefore can cooperate with each other to achieve the same compression as if
they all were united into one big compressor. Hence, what happens in such a
setup is that the original single source index w is split up into several indices
w(1) , . . . , w(L) that then are used to create a source description by the decoder.
So far this is not interesting. However, lets now ask what happens if one of
these indices somehow gets lost on its way. A standard rate distortion system
will fail to work: No proper index means no description! Here, however, we
still have L 1 other indices available. So, we still should be able to get some
kind of description! This description, of course, will be slightly less accurate,
i.e., cause a higher distortion, since the total number of possible indices is
reduced.
Such a system could be very useful in practice. Consider, for example, a
network where due to packet loss some part of the message does not arrive
at the receiver. A traditional rate distortion system will fail, while a multiple
description system still can reproduce the source, just in slightly less good
accuracy.
So, let us define our setup more formally. For simplicity we concentrate
here on the case with two indices w(1) and w(2) . As shown in Figure 7.1, we
have a source that generates an IID random sequence X1 , . . . , Xn ,
{Xk } IID Q.
155
(7.1)

Sfrag
156
The Multiple Description Problem
W (1)
Dest.
1, . . . , X
n
X
Enc. (1)
X 1 , . . . , Xn
Dec. (i)
W (2)
Enc. (2)
Figure 7.1: A multiple description system.
This source sequence is then fed into two encoders, which will generate indices
W (1) = (1) (X),
W
(2)
(2)
(7.2)
(X),
(7.3)
respectively. The encoding functions are deterministic mappings that map

sequences from the source alphabet X into indices:

(i)
(i) : X n 1, 2, . . . , enR ,
i = 1, 2,
(7.4)
for two values R(1) and R(2) . Note that it is irrelevant whether the two encoders are actually physically separate entities or if they are jointly together in a
single source encoder. As mentioned above already, the reason for this is that
both encoders see the exactly same input and therefore perfectly know what
the other encoder does.
The decoder will receive either W (1) , or W (2) , or both (W (1) , W (2) ). Depending on what it receives, it will generate a description
or

= (1) W (1) ,
X

= (2) W (2) ,
X

= (12) W (1) , W (2) ,
X
(7.5)
(7.6)
(7.7)
respectively. The decoding functions are deterministic mappings from the set
of possible indices to a sequence in the corresponding reconstruction alphabet
X (i) :
n
o
n
(i)
(i) : 1, 2, . . . , enR
X (i) , i = 1, 2,
(7.8)
and
(12) :
n
o n
o
n
(1)
(2)
1, 2, . . . , enR
1, 2, . . . , enR
X (12) .
(7.9)
Note that the case when the decoder receives no index is uninteresting and
therefore ignored.

7.1. Problem Description
157
Depending on whether the first, the second, or both indices arrive at the
decoder, we ask for a different maximum allowed distortion:
h

i
E d(1) X, (1) (1) (X)
D(1) ,
(7.10)
h

i

E d(2) X, (2) (2) (X)
D(2) ,
(7.11)
h

i

E d(12) X, (12) (1) (X), (2) (X)
D(12) ,
(7.12)
for some given values D(1) , D(2) , D(12) . To keep things as general as possible,
we even allow a different distortion measure and different reproduction alphabets for the three different cases. However, in all cases we will stick to our old
(and poor!) assumption of a distortion measure that is an average per-letter
distortion:
n
1 X (i)
(i)
) ,
d (x, x
d (xk , x
k ), i = 1, 2, 12.
(7.13)
n
k=1
Moreover, we again assume that the distortion measures are bounded by a

maximum possible value dmax .
We now give the following definitions.

(1)
(2)
Definition 7.1. An enR , enR , n multiple description rate distortion coding scheme consists of
a source alphabet X and three reproduction alphabets X (1) , X (2) , X (12) ,
two encoding functions (1) and (2) as specified in (7.4),
three decoding functions (1) , (2) , and (12) as specified in (7.8) and
(7.9), and
three distortion measures d(1) , d(2) , and d(12) as specified in (7.13).
So the question now is what parameters R(1) , R(2) , D(1) , D(2) , and D(12) can
be chosen for some given source Q and some per-letter distortion functions,
such that we can find a multiple description rate distortion coding scheme
that works.
Definition 7.2. A multiple description rate distortion quintuple

R(1) , R(2) , D(1) , D(2) , D(12)
is said to be achievable for a source Q and forsome distortion measures d(i) (, )
(1)
(2)
if there exists a sequence of enR , enR , n multiple description rate distor(1) (2)
(1)
(2)
(12)
tion coding schemes (n , n , n , n , n ) satisfying the distortion constraints
h

i
(X)
D(1) ,
(7.14)
lim E d(1) X, n(1) (1)
n
n
h

i

lim E d(2) X, n(2) (2)
D(2) ,
(7.15)
n (X)
n
h

i

(2)
lim E d(12) X, n(12) (1)
D(12) .
(7.16)
n (X), n (X)
n

158
The multiple description rate distortion region for a source Q and some distortion measures d(i) (, ) is the closure of the set of all achievable multiple
description rate distortion quintuples.
Note the main problem that we have here. Since a good description of the
source must be similar to the source, two individual good descriptions are in
general quite similar and therefore dependent. This, however, means that the
second description cannot contribute much more new information in addition
to the first.
If, on the other hand, two descriptions are independent of each other such
that they together yield a far better description than they do alone, then they
usually will not be very good individually.
7.2
An Example
Example 7.3. Lets consider a binary symmetric source (BSS)

Q(0) = Q(1) =
1
2
(7.17)
and the Hamming distortion measure

(
0
d(1) (a, b) = d(2) (a, b) = d(12) (a, b) = d(a, b) ,
1
if a = b,
if a =
6 b.
(7.18)
Suppose we require that D(12) = 0, i.e., if both indices arrive we would like to
have perfect reconstruction. Lets consider a channel splitting approach where
the even numbered bits are transmitted over the first channel and the odd bits
over the second channel, i.e., we have rates R(1) = R(2) = 12 bits. Then we get

(1)
D(1) = E d X, X
(7.19)
n

1X
(1)
Pr Xk 6= X
(7.20)
=
k
n
k=1
n

i 1 h
i
1X 1 h
(1)
(1)
=
Pr Xk 6= Xk k is even + Pr Xk 6= Xk k is odd
n
2
2
k=1
(7.21)

n
1X 1
1 1
1
=
0+
= ,
n
2
2 2
4
(7.22)
k=1
and

(2)
D(2) = E d X, X
n

1X
(2)
Pr Xk 6= X
=
k
n
k=1

(7.23)
(7.24)
7.3. A Random Coding Scheme
159
n

i 1 h
i
1X 1 h
(2)
(2)
=
Pr Xk 6= Xk k is even + Pr Xk 6= Xk k is odd
n
2
2
k=1
(7.25)

n
1
1X 1 1 1
+ 0 = .
=
n
2 2 2
4
(7.26)
k=1
Hence, we see that

R
(1)
,R
(2)
(1)
,D
,D
(2)
(12)
,D

=

1
1
1 1
bits, bits, , , 0
2
2
4 4
(7.27)
is achievable.
However, as we will see below in Section 7.7, we can do better: For R(1) =
(2)
R = 12 bits and D(12) = 0 it is possible to achieve
21
1
(1)
(2)
D =D =
0.207 < .
(7.28)
2
4
7.3
A Random Coding Scheme
We next show an achievable coding scheme based on random coding.

1: Setup: We choose a PMF QX (1) ,X (2) ,X (12) |X and compute QX (1) , QX (2) ,
and QX (12) |X (1) ,X (2) as marginal distributions of
QX,X (1) ,X (2) ,X (12) = Q QX (1) ,X (2) ,X (12) |X .
(7.29)
Then we fix some rates R(1) and R(2) and some blocklength n.
2: Codebook Design: We generate enR

(1) w(1) ,
X
by choosing each of the n enR
random according to QX (1) .
Similarly, we generate enR
(2)
(1)
length-n codewords
(1)
w(1) = 1, . . . , enR ,
(1)
(1) (w(1) ) independently at

symbols X
k
length-n codewords

(2) w(2) ,
X
by choosing each of the n enR
random according to QX (2) .
(2)
w(2) = 1, . . . , enR ,
(2)
(2) (w(2) ) independently at

symbols X
k
Finally, for each pair (w(1) , w(2) ), we generate a codeword

(12) w(1) , w(2) ,
X

160

(12) w(1) , w(2) independently at random
by choosing its kth component X
k
according to

(1) (1) (2) (2)
QX (12) |X (1) ,X (2) X
w
,
X
w
,
k
k
i.e., this choice depends on the codewords generated before.
3: Encoder Design: For a given source sequence x, the encoders try to

find a pair (w(1) , w(2) ) such that

(2) (2) (12) (1) (2)
(1) w(1) , X
x, X
w
,X
w ,w

A(n) QX,X (1) ,X (2) ,X (12) .
(7.30)
If they find several possible choices, they pick one. If they find none,
they choose w(1) = w(2) = 1.
The first encoder (1) puts out w(1) , and the second encoder (2) puts
out w(2) .
4: Decoder Design: The decoder ( (1) , (2) , (12) ) consists of three different decoding functions, depending on whether w(1) , w(2) , or both
(w(1) , w(2) ) are received. It puts out

(1) w(1)
X
if only w(1) is received,
(7.31)

(2)
(2)
(2)
X w
if only w is received,
(7.32)

(12)
(1)
(2)
(1)
(2)
X
w ,w
if both (w , w ) is received.
(7.33)
5: Performance Analysis: We partition the sample space into three disjoint cases:
X
/ A(n)
(Q)

(7.34)
(in which case we for sure cannot find a pair (w(1) , w(2) ) such that
(7.30) is satisfied!).
2. The source sequence is typical, but there exists no codeword triple

X A(n)
(Q), @ w(1) , w(2) : (7.30) is satisfied. (7.35)

3. The source sequence is typical and there exists a codeword triple

X A(n)
(Q), w(1) , w(2) : (7.30) is satisfied. (7.36)

Then we compute the achieved expected distortion of our system, averaged both over the source and over the random code generation. If
this expected distortion is within tolerance, then our randomly generated coding scheme works.
The details of this analysis are given in the following Section 7.4.

7.4. Performance Analysis of Our Random Coding Scheme
7.4
161
Performance Analysis of Our Random Coding

Scheme
To compute the expected distortion in all possible situations of only a received

index W (1) , only a received index W (2) , or of received indices (W (1) , W (2) ), we
split the sample space into the three disjoint cases described in (7.34)(7.36)
and apply the Total Expectation Theorem. For i = 1, 2, 12 we get

(i) = E d(i) X, X
(i) Case 1 Pr(Case 1)
E d(i) X, X
|
{z
}
(i)
(i)
+E d
|
+E d
dmax

(i)
X, X
{z

Case 2 Pr(Case 2)
}
X, X

Case 3 Pr(Case 3)
| {z }
dmax

(i)
(7.37)
dmax Pr(Case 1) + dmax Pr(Case 2)

(i) Case 3 ,
+ E d(i) X, X
(7.38)
where we have upper-bounded the maximum average distortion of the first

two cases by the maximum distortion value dmax , and the probability of the
third case by 1.
7.4.1
Case 1
By TA-3 we can bound the probability of Case 1 as follows:

Pr(Case 1) = 1 Qn A(n) (Q) t (n, , X ).
7.4.2
(7.39)
Case 3
To bound the expected distortion in Case 3, we note that if

(1) , X
(2) , X
(12)
X, X

(1) , X, X
(2) , and X, X
(12) is
is jointly typical, then also each pair X, X
jointly typical. Hence, completely analogously to (5.120)(5.125) we have for
i = 1, 2, 12:
n
X

(i) = 1
(i)
d(i) X, X
d(i) Xk , X
k
n
k=1

1 X
(i) d(i) (a, b)
=
N a, b X, X
n
(7.40)
(7.41)
aX
bX (i)
(i)
PX,X
(i) (a, b) d (a, b)
(7.42)
aX
bX (i)

162

QX,X
(i) (a, b) +
|X | X (i)
X
aX
bX (i)
(i)
QX,X
(i) (a, b) d (a, b) +
aX
bX (i)
(i)
=E d
7.4.3
(i)
X, X
+ dmax .
X
aX
bX (i)
d(i) (a, b)
(7.43)

dmax (7.44)

|X | X (i)
(7.45)
Case 2
The by far most complicated part of this derivation is to find a bound on the
probability of Case 2. The problem is that even if (w(1) , w(2) ) 6= (v (1) , v (2) ),

and
(2) (2) (12) (1) (2)

(1) w(1) , X
X
w
,X
w ,w
(2) (2) (12) (1) (2)

(1) v (1) , X
X
v
,X
v ,v
are not necessarily independent because we might have that w(1) 6= v (1) , but
w(2) = v (2) , or vice versa. So we need trickery.
(n)
Given some x A (Q), let F(w(1) , w(2) ) be the event that w(1) and w(2)
give a good choice of codewords:
n (1) (1) (2) (2) (12) (1) (2)
F w(1) , w(2) ,
X
w
,X
w
,X
w ,w
o

(7.46)
A(n)
QX,X (1) ,X (2) ,X (12) x .

Note that w(1) , w(2) are fixed! The randomness comes from the random generation of the codebook.
Now, noting that Case 2 can only occur if F does not occur for all possible
choices of w(1) , w(2) , we can write
\

Pr(Case 2) = Pr
F c w(1) , w(2)
(7.47)
w(1) ,w(2)
= Pr[K = 0]
(7.48)
with
K,
X
w(1) ,w(2)
n
o

I F w(1) , w(2) occurs
and I {} being the indicator function

(
1 if statement is true,
I {statement} ,
0 if statement is wrong.

(7.49)
(7.50)
163
Note that if K = 0, then

E[K]
(7.51)
2
(the other way round this is not true!). So, using this crude bound (7.51), we
can argue as follows:
|K E[K]| = | E[K]| = E[K]
Pr(Case 2) = Pr[K = 0]

E[K]
Pr |K E[K]|
2

2
E |K E[K]|
2

E[K]
2
4 Var[K]
(E[K])2

4 E K 2 (E[K])2
(E[K])2
(7.52)
(7.53)
(7.54)
(7.55)
(7.56)
where the inequality (7.54) follows from the Chebyshev Inequality (1.87). So
it remains to derive some bounds on E[K] and Var[K].
Firstly, E[K]:
o
n
X

(7.57)
E[K] = E
w(1) ,w(2)
h
i

E IF w(1) , w(2) occurs
(7.58)
w(1) ,w(2)
X
w(1) ,w(2)
X
w(1) ,w(2)

1 Pr F w(1) , w(2) + 0 Pr F c w(1) , w(2)

Pr F w(1) , w(2)
X
(n)
w(1) ,w(2) (
x(1) ,
x(2) ,
x(12) )A (|x)
(7.59)
(7.60)

(2)
(1) QnX (2) x
QnX (1) x
(1) (2)
(12) x
,x
QnX (12) |X (1) ,X (2) x
(7.61)

(n)
(n)
Here we introduce the shorthand A (|x) for A
QX,X (1) ,X (2) ,X (12) x , i.e.,
for simplicity we will drop the exact statement of the joint distribution that
is the basis of the typical set.
Using twice TA-1b and once TB-1, we bound (7.61) as follows:

X
X

(1) +
E[K]
exp n H X
(n)
w(1) ,w(2) (
x(1) ,
x(2) ,
x(12) )A (|x)

(1) (2)

(2) + exp n H X
(12) X
,X
exp n H X
+
(7.62)

164

=
X

A(n)
QX,X (1) ,X (2) ,X (12) x

w(1) ,w(2)

(1) (2)

,X
(12) X
(2) + H X
(1) + H X
+
(7.63)
exp n H X

X

(1) , X
(2) , X
(12) X
exp n H X
w(1) ,w(2)

(1) (2)

,X
(12) X
(2) + H X
(1) + H X
+
(7.64)
exp n H X

= exp n R(1) + R(2)

(1)

(1) , X
(2) , X
(12) X H X
(1) H X
(2) X
exp n H X

(1)
(1) (2)

(2)
(2) X
,X
(12) X
H X
+H X
H X
(7.65)

(1) , X
(2) , X
(12) I X
(1) ; X
(2) .
= exp n R(1) + R(2) I X; X
(7.66)
Here, the inequality (7.64) follows from TB-2, and in the subsequent equality
(i)
(7.65) we used the fact that there are enR choices for w(i) . Note that we
have stopped keeping track of the different s and s, but simply summarize
them all together.
Hence, we get

(1) , X
(2) , X
(12)
(E[K])2 exp n 2R(1) + 2R(2) 2 I X; X

(1) ; X
(2) .
(7.67)
2I X

Secondly, we tackle E K 2 :

E K2
n
o X
n
o
X

= E
I F v (1) , v (2) occurs
w(1) ,w(2)
v (1) ,v (2)
(7.68)
n
o
n
oi

E I F w(1) , w(2) occurs I F v (1) , v (2) occurs
h
w(1) ,w(2) v (1) ,v (2)
(7.69)
=
w(1) ,w(2) v (1) ,v (2)

Pr F w(1) , w(2) F v (1) , v (2)
Q{1,2} w(1) ,w(2) ,v (1) ,v (2)

with overlap Q

Pr F w(1) , w(2) F v (1) , v (2) ,
(7.70)
(7.71)
where in the last step we distinguish four cases of whether w(i) = v (i) or not.
These cases are described by the four possible subsets of {1, 2}: Q = {1, 2},

165
Q = {1}, Q = {2}, and Q = . For i Q we have w(i) = v (i) , whereas for

the remaining indices i Qc we have w(i) 6= v (i) . Lets go through these four
cases.
Case Q = {1, 2}: In this case we have w(1) = v (1) and w(2) = v (2) , and using
a derivation that is (apart from the s and s) identical to (7.60)(7.66) we
get

X

Pr F w(1) , w(2) F v (1) , v (2)
w(1) ,w(2) ,v (1) ,v (2)
with overlap {1,2}
X
w(1) ,w(2)
X
w(1) ,w(2)

Pr F w(1) , w(2)
(7.72)

(1) , X
(2) , X
(12) + I X
(1) ; X
(2)
exp n I X; X
(7.73)

(1) , X
(2) , X
(12) I X
(1) ; X
(2) + .
= exp n R(1) + R(2) I X; X
(7.74)
Case Q = {1}: In this case we have w(1) = v (1) , but w(2) 6= v (2) , i.e., we
have a partial overlap that is more difficult to handle properly. We use the
following lemma.
Lemma 7.4 (Chain Rule for Typical Sets). The event

(X, Y) A(n) (QX,Y )
is equivalent to the event

X A(n) (QX ) Y A(n) (QX,Y |X) .
(7.75)
(7.76)
Proof: This lemma follows directly from the definitions of the typical and
(n)
the conditionally typical sets: If (x, y) A (QX,Y ), then we know from
(n)
Lemma 4.6 that x A (QX ), and from Definition 4.10 we know that y
(n)
A (QX,Y |x). On the other hand, it directly follows from Definition 4.10
(n)
(n)
(n)
that if x A (QX ) and y A (QX,Y |x), then (x, y) A (QX,Y ).
(n)
Using the shorthands A (|x) for

A(n)
QX,X (1) ,X (2) ,X (12) x
or A(n) QX,X (1) x ,

(n)
respectively,1 and A

(1) for
x, x

(1) ,
A(n)
QX,X (1) ,X (2) ,X (12) x, x

1
It should be clear from the context, which distribution needs to be plugged in. We
will keep using this type of shorthands for remainder of these notes whenever the context is
clear.

166
we get

Pr F w(1) , w(2) F w(1) , v (2)
n
o
(2) (2) (12) (1) (2)
(1) w(1) , X
= Pr
X
w
,X
w ,w
A(n)
(|x)

n
o

(1) w(1) , X
(2) v (2) , X
(12) w(1) , v (2) A(n) (|x)
X
(7.77)

n
o

(1) w(1) A(n) (|x)
= Pr X

n

o

(1) (w(1) )
(2) w(2) , X
(12) w(1) , w(2) A(n) x, X
X

n

o

(1) (w(1) )
(2) v (2) , X
(12) w(1) , v (2) A(n) x, X
(7.78)
X

= Pr(Ea Eb Ec )
(7.79)
= Pr(Ea ) Pr(Eb | Ea ) Pr(Ec | Ea , Eb ),
(7.80)
where in (7.78) we have used Lemma 7.4; (7.79) must be understood as the
definitions of the events Ea , Eb , and Ec ; and where (7.80) follows from the
chain rule.
Now note that conditionally on Ea , the events Eb and Ec are independent
of each other, i.e., in (7.80) we have the two terms Pr(Eb | Ea ) and Pr(Ec | Ea )
that are basically the same. Lets investigate them more closely. We have
Pr(Eb |Ea )
X
=
(n)
(1) A
x

(1) (1)
(1) Ea
Pr X
w
=x
(|x)
h

(12) (1) (2)
(2) w(2) , X
(1)
Pr X
A(n)
x, x
w ,w

i
(1) (1)
(1) (7.81)
=x
X w
X

(1) (1)
(1) Ea
Pr X
w
=x
(n)
(1) A
x
(|x)
max
(n)
(1) A
x
(|x)
max
(n)
(1) A
x
{z
}
h

i
(12) (1) (2)
(1)
(2) w(2) , X
Pr X
A(n)
x,
x
w ,w

=1
(|x)
(7.82)
h

i

(2) w(2) , X
(12) w(1) , w(2) A(n) x, x
(1) (7.83)
Pr X

X
max
(n)
(1) A (|x)
x
max
(n)
(1) A
x
(n)
A
(|x)
(n)
(
x(2) ,
x(12) )A
(2)
QnX
(2) x

x,
x(1)
(1) (2)
(12) x
,x
QnX
(7.84)
(12) |X
(1) ,X
(2) x

(1)
QX,X (1) ,X (2) ,X (12) x, x

(1) (2)

(2) exp n H X
(12) X
,X
exp n H X

(7.85)

(2) , X
(12) |X, X
(1) +
exp n H X

167

(1) (2)

(2) + H X
(12) X
,X
exp n H X

(7.86)

(2) , X
(12) X, X
(1) H X
(2) H X
(12) X
(1) , X
(2) + .
= exp n H X
(7.87)
(1) by the maximum

Here, (7.82)(7.83) follows by replacing the average over x
(1) ; in (7.84) we apply our knowledge about how the codebook has
over x
been generated; and the subsequent two inequalities follow again in the usual
manner from TA-1, TB-1, and TB-2.
The same bound also applies to Pr(Ec | Ea , Eb ) = Pr(Ec | Ea ).
(1) (w(1) ) is generated completely independently
For Pr(Ea ) we note that X
of the source sequence X. Hence, we can apply TC:

(1) .
Pr(Ea ) exp n I X; X
(7.88)
We plug (7.87) and (7.88) into (7.80) and get

Pr F w(1) , w(2) F w(1) , v (2)

(2)
(1) + 2 H X
(2) , X
(12) X, X
(1) 2 H X
exp n I X; X

(1) (2)
(12) X
,X
+ 2H X

(7.89)

(1) 2 H X
(2) , X
(12) X, X
(1) 2 H X
(1) X
= exp n I X; X
(1) (2)

(1) , X
(2)
,X
(12) X
(2) + 2 H X
+ 2H X
+ 2H X

(1) X 2 H X
(1) , X
(2)
+ 2H X
(7.90)

(1) 2 H X
(1) , X
(2) , X
(12) X
= exp n I X; X

(1) , X
(2) , X
(12)
(2) + 2 H X
+ 2H X

(1) X 2 H X
(1) , X
(2)
+ 2H X
(7.91)

(1) + 2 I X; X
(1) , X
(2) , X
(12) + 2 H X
(1) X
= exp n I X; X
(2)

(1)
(1) 2 H X
(1) X
(7.92)
+ 2H X
2H X

(1) , X
(2) , X
(12) + 2 I X
(1) ; X
(2)
= exp n 2 I X; X

(1) .
I X; X
(7.93)
Hence, we get the following bound:

X

Pr F w(1) , w(2) F w(1) , v (2)
w(1) ,w(2)
v (2) 6=w(2)

(1) , X
(2) , X
(12)
exp nR(1) + nR(2) + n R(2) 1 exp n 2 I X; X

(1) ; X
(2) I X; X
(1)
+ 2I X
(7.94)

168

(1) , X
(2) , X
(12) 2 I X
(1) ; X
(2)
exp n R(1) + 2R(2) 2 I X; X

(1) + .
+ I X; X
(7.95)
Case Q = {2}: This is the same as the case Q = {1}, but with exchanged
(1) and X
(2) :
roles of X

X

Pr F w(1) , w(2) F v (1) , w(2)
w(1) ,w(2)
v (1) 6=w(1)

(1) ; X
(2)
(1) , X
(2) , X
(12) 2 I X
exp n 2R(1) + R(2) 2 I X; X

(2) + .
+ I X; X
(7.96)
Case Q = : In this case

bothw(1) =
6 v (1) and w(2) 6= v (2) , i.e., the
we have
(1)
(2)
(1)
(2)
two events F w , w
and F v , v
are independent. Hence, we have

X

Pr F w(1) , w(2) F v (1) , v (2)
w(1) ,w(2) ,v (1) ,v (2)
with no overlap
w(1) ,w(2) v (1) 6=w(1)

v (2) 6=w(1)
w(1) ,w(2)
X
w(1) ,w(2)

Pr F w(1) , w(2)
X
v (1) 6=w(1)

Pr F v (1) , v (2)
w(1) ,w(2)
(7.97)
(7.98)
v (2) 6=w(1)

X

Pr F w(1) , w(2)
Pr F v (1) , v (2)

Pr F w(1) , w(2) Pr F v (1) , v (2)
(7.99)
v (1) ,v (2)
2

Pr F w(1) , w(2)
= (E[K])2 .
(7.100)
(7.101)
Here, in (7.99) we increase the number of terms in the second sum; and in
(7.101) we use (7.60).
Hence, plugging these four bounds (7.74), (7.95), (7.96), and (7.101) into
(7.71), we get

E K 2 (E[K])2

(1) , X
(2) , X
(12) I X
(1) ; X
(2) +
exp n R(1) + R(2) I X; X

(1) , X
(2) , X
(12) 2 I X
(1) ; X
(2)
+ exp n R(1) + 2R(2) 2 I X; X

(1) +
+ I X; X

(1) , X
(2) , X
(12) 2 I X
(1) ; X
(2)
+ exp n 2R(1) + R(2) 2 I X; X

(2) + .
+ I X; X
(7.102)


This we now divide by (7.67)

(1) , X
(2) , X
(12)
(E[K])2 exp n 2R(1) + 2R(2) 2 I X; X

(1) ; X
(2)
2I X
169
(7.103)
in order to get from (7.56) the final bound

Pr(Case 2)

4 E K 2 (E[K])2
(7.104)
(E[K])2

(1) ; X
(2) +
(1) , X
(2) , X
(12) + I X
4 exp n R(1) R(2) + I X; X

(1) +
+ 4 exp n R(1) + I X; X

(2) +
+ 4 exp n R(2) + I X; X
(7.105)
, 2 .
(7.106)
So we see that Pr(Case 2) 2 where 2 0 for n as long as we have

chosen R(1) , R(2) , and QX (1) ,X (2) ,X (12) |X such that

(1) + ,
R(1) > I X; X
(7.107)

(2) + ,
R(2) > I X; X
(7.108)

R(1) + R(2) > I X; X

(1) , X
(2) , X
(12) + I X
(1) ; X
(2) + .
(7.109)
7.4.4
Analysis Put Together
Putting all three cases from Sections 7.4.1, 7.4.2, and 7.4.3 back into (7.38)
now shows that

(i) dmax t (n, , X ) + dmax 2 + E d(i) X, X
(i) + dmax
E d(i) X, X
(7.110)
for all i = 1, 2, 12. Therefore,

(i) E d(i) X, X
(i) + dmax ,
lim E d(i) X, X
n
(7.111)
and the constraints (7.14)(7.16) are satisfied if QX (1) ,X (2) ,X (12) |X is such that
(1)

(1) D(1) dmax ,
E
d
X,
X
(7.112)
(2)

(2)
(2)
E d X, X
D dmax ,
(7.113)
(12)
(12)
E d(12) X, X
D
dmax .
(7.114)
We have shown that any multiple description rate distortion quintuple is
achievable for which a distribution QX (1) ,X (2) ,X (12) |X can be found such that
(7.107)(7.109) and (7.112)(7.114) are satisfied. Note that since is arbitrary,
we can omit the terms in (7.107)(7.109) and the terms dmax in (7.112)
(7.114).

170
Theorem 7.5 (Achievable Multiple Description Rate Distortion

Region [EGC82]).
Consider a DMS Q with finite alphabet X and three average per-letter
distortion measures d(i) (, ) with corresponding reconstruction alphabets
X (i) , i = 1, 2, 12. Then the following region is an achievable multiple
description rate distortion region:
[

RMD (Q) ,
R Q QX (1) ,X (2) ,X (12) |X
(7.115)
QX
(1) ,X
(2) ,X
(12) |X
where

R QX,X (1) ,X (2) ,X (12)

, R(1) , R(2) , D(1) , D(2) , D(12) :

(1) ,
R(1) I X; X

(2) ,
R(2) I X; X

(1) ; X
(2) ,
(1) , X
(2) , X
(12) + I X
R(1) + R(2) I X; X

(1) ,
D(1) E d(1) X, X

(2) ,
D(2) E d(2) X, X

(12)
D(12) E d(12) X, X
.
(7.116)
We would like to point out that similarly to Chapter 5 we have again

implicitly derived a bound on the probability of any source sequence not being
reproduced within the required quality:
Pr(X not properly reproduced) dmax t (n, , X ) + dmax 2 .
(7.117)
This error probability decays to zero exponentially fast in n.
7.5
An Improvement to the Achievable Region
There is a nice trick how we can actually enlarge the achievable rate distortion
region derived in Sections 7.3 and 7.4. Assume for the moment that there
exists a third encoder whose index always safely arrives at the decoder, i.e.,
the third encoder sees a noisefree channel.2 We assign the rate R(0) to this
third encoder and repeat the derivation of our random coding scheme.
2
Of course, we do not have such a noisefree channel, but we can simulate it, see the
discussion in Section 7.5.3.

7.5. An Improvement to the Achievable Region
7.5.1
171
New Random Coding Scheme
1: Setup: We choose a PMF QX (0) ,X (1) ,X (2) ,X (12) |X and then compute
QX (1) |X (0) , QX (2) |X (0) , and QX (12) |X (0) ,X (1) ,X (2) as marginal distributions
of Q QX (0) ,X (1) ,X (2) ,X (12) |X .
Then we fix some rates R(0) , R(1) , and R(2) and some blocklength n.
(0)
2: Codebook Design: We independently generate enR length-n codewords

(0) w(0) Qn (0) , w(0) = 1, . . . , enR(0) .
X
(7.118)
X
For every w(0) , we independently generate enR
(1)
(0) (0)

(1) w(0) , w(1) Qn (1) (0) X
(w ) ,
X
|X
X
and enR
(2)
length-n codewords
(1)
w(1) = 1, . . . , enR ,
(7.119)
length-n codewords
(0) (0)

(2) w(0) , w(2) Qn (2) (0) X
(w ) ,
X
X |X
(Note that this means that we have en(R
(0)
(2)
(2) .)
en(R +R ) codewords X
(0)
(2)
w(2) = 1, . . . , enR .
+R(1) )
(7.120)
(1) and
codewords X
Finally, for each triple (w(0) , w(1) , w(2) ), we generate one length-n codeword

(12) w(0) , w(1) , w(2)
X
(0) (0)
(w ), X
(1) (w(0) , w(1) ),
QnX (12) |X (0) ,X (1) ,X (2) X

(2) (w(0) , w(2) ) .
X
(7.121)
3: Encoder Design: For a given source sequence x, the encoders try to
find a triple (w(0) , w(1) , w(2) ) such that

(1) (0) (1) (2) (0) (2) (12) (0) (1) (2)
(0) w(0) , X
x, X
w ,w
,X
w ,w
,X
w ,w ,w

A(n)
QX,X (0) ,X (1) ,X (2) ,X (12) .
(7.122)

If they find several possible choices, they pick one. If they find none,
they choose w(0) = w(1) = w(2) = 1.
The first encoder (1) puts out w(1) , the second encoder (2) puts out
w(2) , and the third encoder (0) puts out w(0) .
4: Decoder Design: The decoder still consists of only three different decoding functions ( (1) , (2) , (12) ), because we know that w(0) does arrive

172

for sure and because we are not interested in the case when only w(0) arrives (this still counts like nothing has arrived identically to the setup
described in Section 7.1). The decoder puts out

(1) w(0) , w(1)
X

(2) w(0) , w(2)
X

(12) w(0) , w(1) , w(2)
X
if (w(0) , w(1) ) is received,

if
(w(0) , w(2) )
is received,
if (w(0) , w(1) , w(2) ) is received.
(7.123)
(7.124)
(7.125)
5: Performance Analysis: We again distinguish three different cases:

X
/ A(n)
(Q).

(7.126)
2. The source sequence is typical, but there exists no codeword quadruple that is jointly typical with the source sequence:
X A(n)
(Q),

@ w(0) , w(1) , w(2) : (7.122) is satisfied. (7.127)
3. The source sequence is typical and there exists a codeword quadruple that is jointly typical with the source sequence:
X A(n)
(Q),

w(0) , w(1) , w(2) : (7.122) is satisfied. (7.128)
The analysis of the first and third case is identical to the analysis shown
in Section 7.4. We only need to have a closer look at Pr(Case 2).
7.5.2
Analysis of Case 2
We define
n (0) (0) (1) (0) (1) (2) (0) (2)
F w(0) , w(1) , w(2) ,

X
w
,X
w ,w
,X
w ,w
,

(12) w(0) , w(1) , w(2)
X
o
A(n)
QX,X (0) ,X (1) ,X (2) ,X (12) x
(7.129)

and use the same trick based on the indicator RV K:
Pr(Case 2)

4 E K 2 (E[K])2

(E[K])2
(7.130)
173
with
X
E[K] =
w(0) ,w(1) ,w(2)

Pr F w(0) , w(1) , w(2)
w(0) ,w(1) ,w(2) (

x(0) ,
x(1) ,
x(2) ,
x(12) )
(n)
QnX (2) |X (0)

X
w(0) ,w(1) ,w(2)
A
(7.131)
(0)
(1) x
(0) QnX (1) |X (0) x
QnX (0) x
(|x)
(2) (0)
(0) (1) (2)

(12) x
,x
,x
QnX (12) |X (0) ,X (1) ,X (2) x
(7.132)
(n)

A
QX,X (0) ,X (1) ,X (2) ,X (12) x

(0)
(0)

(2) X
(1) X
(0) + H X
+H X
exp n H X

(0) (1) (2)
(12) X
,X
,X
+H X
+

exp n R(0) + R(1) + R(2)

(0) , X
(1) , X
(2) , X
(12) X
exp n H X

(0)
(0)

(0) + H X
(1) X
(2) X
exp n H X
+H X

(0) (1) (2)
,X
,X
(12) X
+
+H X

(0) , X
(1) , X
(2) , X
(12)
= exp n R(0) + R(1) + R(2) I X; X

(0)
(1) ; X
(2) X
I X
,
and with

E K2 =
(7.133)
(7.134)
(7.135)

Pr F w(0) , w(1) , w(2) F v (0) , v (1) , v (2) .
Q{0,1,2} w(0) ,w(1) ,w(2) ,

v (0) ,v (1) ,v (2)
with overlap Q
(7.136)
Before we start with the case distinction according to the different values

of Q, note that if w(0) 6= v (0) then the two events F w(0) , w(1) , w(2) and

F v (0) , v (1) , v (2) are disjoint because w(0) is a counter that was used in the
generation of all codewords simultaneously. Hence, the cases Q = {1, 2},
Q = {1}, Q = {2}, and Q = can be treated jointly.
Cases Q = {1, 2}, Q = {1}, Q = {2}, and Q = :

X

Pr F w(0) , w(1) , w(2) F v (0) , v (1) , v (2)
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with no overlap in w(0)
X
w(0) ,w(1) ,w(2)

Pr F w(0) , w(1) , w(2)
X
v (0) 6=w(0) ,
v (1) ,v (2)
F v (0) , v (1) , v (2)
(7.137)

174

2

Pr F w(0) , w(1) , w(2)
w(0) ,w(1) ,w(2)
(7.138)
= (E[K])2 .
(7.139)
Case Q = {0, 1, 2}: Using a derivation that is (apart from the s and s)
identical to (7.131)(7.135) we get
X
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with overlap {0,1,2}

Pr F w(0) , w(1) , w(2) F v (1) , v (1) , v (2)
w(0) ,w(1) ,w(2)

Pr F w(0) , w(1) , w(2)
(7.140)

(12)
(2) , X
(1) , X
(0) , X
exp n R(0) + R(1) + R(2) I X; X

(0)
(1) ; X
(2) X
I X
+ .
(7.141)
Case Q = {0, 1}: Again using Lemma 7.4, we get the following (using a
sloppy notation where we omit the arguments w(i) ):
X
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with overlap {0,1}

(0)
(1)
(2)
(1) (1) (2)
Pr F w , w , w
F v ,v ,v
X
w(0) ,w(1) ,w(2) ,
v (2) 6=w(2)
h
i

(0) , X
(1) A(n) (|x)
Pr X

h
(0) (1)

,x
(2) , X
(12) A(n) X
,X
Pr X

i2

(0) (1)
(n)
X ,X
A (|x)
(7.142)

(0) , X
(1)
exp n R(0) + R(1) + 2R(2) I X; X
(0) (1)
(0)
(2) , X
(12) X
,X
,X 2H X
(2) X
+ 2H X

(0) (1) (2)
(12) X
,X
,X
2H X
+

(0) , X
(1)
= exp n R(0) + R(1) + 2R(2) + I X; X
(0) , X
(1) , X
(2) , X
(12)
2 I X; X

(0)
(1) ; X
(2) X
2I X
+ .

(7.143)

(7.144)
175
Case Q = {0, 2}: This is identical to Case Q = {0, 1} with exchanged roles
(1) and X
(2) :
of X

X

Pr F w(0) , w(1) , w(2) F v (1) , v (1) , v (2)
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with overlap {0,2}

(0) , X
(2)
exp n R(0) + 2R(1) + R(2) + I X; X

(0) , X
(1) , X
(2) , X
(12)
2 I X; X

(0)
(1) ; X
(2) X
2I X
+ .
Case Q = {0}:
X
w(0) ,w(1) ,w(2) ,
v (0) ,v (1) ,v (2)
with overlap {0}
(7.145)
Finally:

Pr F w(0) , w(1) , w(2) F v (1) , v (1) , v (2)
X
w(0) ,w(1) ,w(2) ,
v (1) 6=w(1) ,v (2) 6=w(2)
h
i
(0) A(n) (|x)
Pr X

h
(0)

(1) , X
(2) , X
(12) A(n) X
,x
Pr X

i2
(0)
(n)
X A (|x)

(0)
exp n R(0) + 2R(1) + 2R(2) I X; X
(0)
(1) , X
(2) , X
(12) X
,X
+ 2H X
(0)
(0)
(2) X
(1) X
2H X
2H X

(0) (1) (2)
(12) X
,X
,X
2H X
+

(0)
= exp n R(0) + 2R(1) + 2R(2) + I X; X

(0) , X
(1) , X
(2) , X
(12)
2 I X; X

(0)
(2) X
(1) ; X
2I X
+ .
(7.146)
(7.147)
(7.148)
Plugging all these cases back together now finally yields:

Pr(Case 2)

4 E K 2 (E[K])2
(7.149)
(E[K])2

(0) , X
(1) , X
(2) , X
(12)
4 exp n R(0) R(1) R(2) + I X; X

(0)
(1) ; X
(2) X
+I X
+

176

(0) , X
(1) +
+ 4 exp n R(0) R(1) + I X; X

(0) , X
(2) +
+ 4 exp n R(0) R(2) + I X; X

(0) +
+ 4 exp n R(0) + I X; X
, 3 ,
(7.151)
where 3 tends to zero with n as long as
(0) + ,
R(0) > I X; X
(0) , X
(1) + ,
R(0) + R(1) > I X; X

(0) , X
(2) + ,
R(0) + R(2) > I X; X
R(0) + R(1) + R(2) > I X; X

(0) , X
(1) , X
(2) , X
(12)
(0)
(1) ; X
(2) X
+ .
+I X
7.5.3
(7.150)
(7.152)
(7.153)
(7.154)
(7.155)
Adapting Setup to Match Situation of Section 7.1
Now, according to Figure 7.1 we do not actually have access to such a third
guaranteed channel. However, we can simulate it by adding the nR(0) nats to
both of the two other channels. Then in all interesting three cases (only w(1)
arrives, only w(2) arrives, and (w(1) , w(2) ) arrives) we do have these nats and
they act like they had come over the virtual third channel! This now means
that we adapt our rates R(1) and R(2) :
(1) , R(1) + R(0) ,
R
(2) , R(2) + R(0) .
R
(7.156)
(7.157)

(0) , then
> I X; X
Plugging this into (7.152)(7.155) and using that R(0)

yields

(1) > I X; X
(0) , X
(1) ,
R
(7.158)
(2) > I X; X
(0) , X
(2) ,
R
(7.159)

(1) (2) (0)
(1)
(2)
(0)
(0) (1) (2) (12)
R + R > R + I X; X , X , X , X
+ I X ;X X

(0) + I X; X
(0) , X
(1) , X
(2) , X
(12)
> I X; X
(0)
(1) ; X
(2) X
.
(7.160)
+I X
(0) as U and call it an auxiliary random variable. In

We usually rename X
contrast to the other RVs that are used in order to describe a rate region,
U does not have a direct physical explanation, but simply shows up from
the proof. In certain cases (like in the example here), the auxiliary RV can
be understood as part of the internal system or similar (here it describes a
common message that is transmitted over all channels), but sometimes it does
not have any nice explanation, but simply is there. We will see some more
examples later on.
Note that not only the PMF of the auxiliary random variable can be chosen
freely, but even the alphabet U is not given and can be freely selected! This is

7.6. Convexity
177
a bit scary, and in particular, it might prevent us from a numerical search for
an optimal choice of the PMF QU,X (1) ,X (2) ,X (12) |X . Luckily, we will be able to
prove that without loss of generality the alphabet size |U| can be restricted to
a finite value (see Lemma 7.7 ahead).
We also point out that we could have guessed the rate region (7.158)
(7.160) directly from (7.107)(7.109). Simply rewrite (7.158)(7.160) as follows:
(0)

(1) > I X; X
(0) + I X; X
(1) X
R
,
(7.161)

(2)
(0)
(2)
(0)
> I X; X
R
+ I X; X
,
(7.162)
(0)

(1)
(2)
(0)
(1)
(2)
(12)
+R
> 2 I X; X
,X
,X
R
+ I X; X
(1) ; X
(2) X
(0) .
+I X
(7.163)
(0) (since w(0) is known in all
We note that all terms are conditioned on X

(0) for each rate to
cases), but we have to add the additional term I X; X
make sure that the description w(0) is accurate enough.
7.6
Convexity
It is not difficult to show that the multiple description rate distortion region is
convex. The argument goes as follows. Assume that two multiple description
rate distortion quintuples

(1) , R
(2) , D
(1) , D
(2) , D
(12)
R(1) , R(2) , D(1) , D(2) , D(12) and R
are achievable. Fix some 0 1 and some blocklength n. Let n1 , bnc
and n2 , n n1 . Now use the first n1 symbols from the source sequence
and encode them with a coding scheme achieving the first quintuple and then
take the remaining n2 symbols and encode them with a coding scheme that
achieves the second quintuple. Note that we can choose n large enough such
that both n1 and n2 become big enough for this to be possible.
On average, we have now created a new coding scheme with rates
(1)
(1) ,
R(1)
+ (1 )R
= R
(2)
R(2) = R(2) + (1 )R
(7.164)
(7.165)
that achieves the following distortions

(1)
(1) ,
D(1)
+ (1 )D
= D
(2) ,
D(2) = D(2) + (1 )D
D(12)
(12)
= D
(12) .
+ (1 )D
(7.166)
(7.167)
(7.168)
This proves that any point in the convex hull of an achievable rate distortion region is achievable, too.

178
7.7
Main Result
We are now finally ready to summarize all results that we have derived so far
in this chapter.
Theorem 7.6 (Improved Achievable Multiple Description Rate
Distortion Region [VKG03]).
Consider a DMS Q with finite alphabet X and three average per-letter
distortion measures d(i) (, ) with corresponding reconstruction alphabets
X (i) , i = 1, 2, 12. Let U be an auxiliary RV on some finite alphabet
U. Then the following region is an achievable multiple description rate
distortion region:
[
RMD (Q) , convex hull

R Q QU,X (1) ,X (2) ,X (12) |X
QU,X
(1) ,X
(2) ,X
(12) |X
(7.169)
where

R QX,U,X (1) ,X (2) ,X (12)

, R(1) , R(2) , D(1) , D(2) , D(12) :

(1) , U ,
R(1) I X; X

(2) , U ,
R(2) I X; X
R(1) + R(2) I(X; U )
D(1)
D(2)
(1) , X
(2) , X
(12) , U
+ I X; X

(1) ; X
(2) U ,
+I X
(1)

(1) ,
E d X, X

(2) ,
E d(2) X, X

(12)
D(12) E d(12) X, X

.
(7.170)
Lemma 7.7. Without loss of optimality we can restrict the size of U in Theorem 7.6 to
|U| |X | + 6.
(7.171)
Proof: The proof of this lemma is based on convexity and Caratheodorys

Theorem (Theorem 1.20) in Section 1.5.
Consider a given choice of U and QU,X (1) ,X (2) ,X (12) |X , and recall the alternative way of writing some of the terms in (7.170) as shown in (7.161)(7.163):

(1) U ,
R(1) I(X; U ) + I X; X
(7.172)

7.7. Main Result
179

(2) U ,
R(2) I(X; U ) + I X; X
(7.173)

(1)
(2)
(1) (2) (12)
(1) (2)
R + R 2 I(X; U ) + I X; X , X , X
U + I X ;X U .
(7.174)
Now note that

I(X; U ) = H(X) H(X|U )
X

=
QU (u) H(X) H(X|U = u) ,
(7.175)
(7.176)
uU
X

(i) U =
(i) U = u ,
I X; X
QU (u) I X; X
i = 1, 2,
(7.177)
uU
X

(1) , X
(2) , X
(12) U =
(1) , X
(2) , X
(12) U = u , (7.178)
I X; X
QU (u) I X; X
uU
(1)
I X
(2)
;X
X

(1) ; X
(2) U = u ,
U =
QU (u) I X
(7.179)
uU
E d(i) X, X

(i)
X
uU
QU (u)
xX x
(i) X (i)

(i) u
QX,X (i) |U x, x

d(i) x, x
(i) ,
i = 1, 2, 12, (7.180)
where QU is the marginal PMF coming from the chosen QU,X (1) ,X (2) ,X (12) |X
and from the given QX . Furthermore, note that
X
QX (x) =
QU (u)QX|U (x|u), x X .
(7.181)
uU
For simplicity of notation and without loss of generality, assume that X =

{1, 2, . . . , |X |}.
Now we define the vector v:

(1) U ,
v , I(X; U ) + I X; X

(2) U ,
I(X; U ) + I X; X

(1) ; X
(2) U ,
(1) , X
(2) , X
(12) U + I X
2 I(X; U ) + I X; X

(1) ,
E d(1) X, X

(2) ,
E d(2) X, X

(12) ,
E d(12) X, X

QX (1), . . . , QX (|X | 1) ,
(7.182)
and for each u U the vector vu :

(1) U = u ,
vu , H(X) H(X|U = u) + I X; X

(2) U = u ,
H(X) H(X|U = u) + I X; X

180

(1) , X
(2) , X
(12) U = u
2 H(X) 2 H(X|U = u) + I X; X

(1) ; X
(2) U = u ,
+I X
X X

(1) ,
QX,X (1) |U x, x
(1) u d(1) x, x
xX x
(1) X (1)
xX x
(2) X (2)

(2) ,
QX,X (2) |U x, x
(2) u d(2) x, x
xX x
(12) X (12)

(12) ,
QX,X (12) |U x, x
(12) u d(12) x, x
QX|U (1|u), . . . , QX|U

|X | 1u ,
(7.183)
such that by (7.175)(7.181)

v=
QU (u)vu .
(7.184)
uU
We see that v is a convex combination of |U| vectors vu . From Caratheodorys Theorem (Theorem 1.20) it now follows that we can reduce the size
of U to at most |X | + 6 values (note that v contains |X | + 5 components!)
without changing v, i.e., without changing the values of all right-hand side
terms in (7.170) and without changing the (given!) value of QX (x) for any
x
P {1, . . . , |X | 1} (and therefore also for x = |X | recall that since
x QX (x) = 1, the value of QX (|X |) is determined by the other values). This
proves the claim.
Note that we need to incorporate QX into v because we do not choose QU
directly, but QU |X . Hence, when changing the alphabet U and QU , we have
to make sure that QX still remains as given by the source.
Also note that the bound (7.171) can actually be reduced by 1 if we use
Theorem 1.22 instead of Theorem 1.20.
Example 7.8. Lets continue with Example 7.3 and apply Theorem 7.6 to
the situation of the BSS with the Hamming distortion measure. We choose
QU,X (1) ,X (2) ,X (12) |X such that
U = constant,
(12)
= X,
(7.185)
(7.186)
(1) and X
(2) have the joint conditional PMF given in Table 7.2.
and such that X
Note that from this table we can see that
(1) X
(2) = X
X
(7.187)
(1)
(2) ,
X
X
(7.188)
and that (see Table 7.3)
(1) , X
(2)
X

1
Bernoulli .
2

(7.189)
7.7. Main Result
181
Table 7.2: Our choice of QX (1) ,X (2) |X .

(2)
X
QX (1) ,X (2) |X (, |0)
0
(1)
X
1
QX (2) |X (|0)

32 2
21
21
0
2 2
21
QX (1) |X (|0)
2 2
21
(2)
X
QX (1) ,X (2) |X (, |1)
QX (1) |X (|1)
QX (2) |X (|1)
(1)
X
Table 7.3: The PMF QX (1) ,X (2) derived from the PMF given in Table 7.2.
(2)
X
QX (1) ,X (2) (, )
0
(1)
X
1
QX (2) ()
0
3
2
2
2 1
2
2
2
1
2
2 1
2
2
1
2
2
2
QX (1) ()
2
1
2
2
2

182

We now compute

(1) X
(1) H X
(1) , U = H X
I X; X
!
1
2
1
= Hb
Hb 2 1 0
2
2
2
!

1
2
Hb 2 1
= Hb
2
2
1
0.383 bits bits,
2
!

1
2
(2) , U = Hb
Hb 2 1
I X; X
2
2
1
0.383 bits bits,
2

(1) (2) (12)
(1) ; X
(2) U
I(X; U ) + I X; X , X , X
,U + I X

(1) , X
(2) , X
(12) + I X
(1) ; X
(2)
= I X; X
(1) (2) (12)
,X
,X
+0
= H(X) H X X
(7.190)
(7.191)
(7.192)
(7.193)
(7.194)
(7.195)
(7.196)
(7.197)
= H(X)
= 1 bit
(7.198)
1 1
+ bits
2 2
(7.199)
and
2 1
E d
,
2
2
(2)

1
2
(2) = Pr X 6= X
(2) =
,
E d X, X
2
2

(12) = Pr X 6= X
(12) = 0.
E d(12) X, X
(1)
(1)
X, X

(1) =
= Pr X 6= X
(7.200)
(7.201)
(7.202)
This shows that as claimed in (7.28)

R(1) , R(2) , D(1) , D(2) , D(12) =
1
1
bits, bits,
2
2
21
,
2
21
,0
2
is achievable. In [BZ83] it has been proven that D(1) = D(2) =

are the best possible values in this case if we require D(12) = 0.
21
2
(7.203)
0.207
The problem of multiple description was first described by L. H. Ozarow

in 1979 and then formally stated by A. D. Wyner. First contributions can
be found in [Wit80], [WWZ80], [Oza80], and [WW81]. El Gamal and Cover
published their achievable region for two channels given in Theorem 7.5 in
1982 [EGC82]. It was later on proven that in certain cases this region was
optimal for the Gaussian source and the mean squared error distortion. For
some values it is also optimal for the BSS and the Hamming distortion (as
shown in Examples 7.3 and 7.8), but not always [ZB87].

7.7. Main Result
183
The achievable region of Theorem 7.6 has been published in [VKG03] as a

special case of the more general setup of L channels. Many other publications
exist specializing to certain special cases, for some of which the exact value
of the multiple description rate distortion region could be established. A very
recent publication [VAR11] further enlarges the region of Theorem 7.6 by allowing to have an individual common codeword for each subset of descriptions.
This is shown to strictly improve the region for more than two descriptions
L > 2.
An important special case of the multiple description problem, for which
the exact rate distortion region is known, is the situation of successive refinement. In successive refinement we only want to consider D(1) and D(12) ,
but not D(2) . (This can be achieved by setting D(2) = so that the corresponding constraint is satisfied trivially.) Moreover, we also restrict ourselves
to only one reproduction alphabet and one distortion measure. The idea is
that W (1) will yield a first description of the source, while W (2) will provide
an improvement over the first description, i.e., W (2) refines the quality of the
description. (We are not interested in the case when only W (2) arrives.) If
D() is the distortion rate function of the source and if

D(1) D R(1) ,
(7.204)

D(12) D R(1) + R(2) ,
(7.205)
then the source is said to be successively refinable.
In general, the multiple description rate distortion region is still not known.

Chapter 8
Rate Distortion with

Side-Information: WynerZiv
Problem
8.1
Introduction
We go back to the rate distortion problem described in Chapter 5, however we

add a new twist: We assume that the decoder (but not the encoder) has some
side-information about the source sequence. Then the encoder should be able
to compress the data more strongly because the decoder has the additional
help of the side-information to recover it. The crux, however, is that the
encoder has no idea of what the realization of this side-information currently
is.
So, we have a system as shown in Figure 8.1.
Dest.
1, . . . , X
n
X
Encoder
X 1 , . . . , Xn
QX,Y
Decoder
Y1 , . . . , Yn
Figure 8.1: The WynerZiv problem: a rate distortion system where the decoder has access to side-information.
At first thought, this might seem a strange problem. Why should the
decoder have side-information, but the encoder not? However, there exist
some important practical situations where we have exactly this constellation.
For example:
In a wireless relay channel, the relay passes on his noisy observation Y
about the message X to the destination. The receiver then has access
to both X (directly from the transmitter) and Y (from the relay), while
the encoder has no idea about Y at the time of transmission.
185

186
Rate Distortion with Side-Information (WynerZiv)

In a sensor network, several sensors observe the same random experiment
and then transmit their measurements to a central receiver. The sensors
do not interact and therefore have no idea about the measurements of
the others. At the receiver, however, the information coming from the
other sensors can be regarded as side-information to the source of one
particular sensor.
The formal setup of the problem is basically the same as described in

Section 5.2, with the two main differences that the source creates a doublesequence (X, Y) of IID pairs1 (Xk , Yk ) according to some given joint distribution QX,Y and that the decoding function now is a mapping

n : 1, 2, . . . , enR Y n X n .
(8.1)
Hence, we have the following definition.

Definition 8.1. An enR , n WynerZiv coding scheme consists of
a source alphabet X and a reproduction alphabet X ,

an encoding function n : X n 1, 2, . . . , enR ,
a decoding function n according to (8.1), and
a distortion measure d(, ),
n
1X
) =
d(x, x
d(xk , x
k ).
n
(8.2)
k=1
Definition 8.2. A WynerZiv rate distortion pair (R, D) is said to be achievable for a DMS QX,Y and a distortion measure d(, ) if there exists a sequence
of (enR , n) WynerZiv rate distortion coding schemes (n , n ) with

lim E d X, n n (X), Y
D.
(8.3)
The WynerZiv rate distortion region for a DMS QX,Y and a distortion measure d(, ) is the closure of the set of all achievable WynerZiv rate distortion
pairs (R, D).
Definition 8.3. The WynerZiv rate distortion function RWZ (D) is the infimum of rates R such that (R, D) is in the WynerZiv rate distortion region
for a given distortion D.
1
Note that X and Y are dependent, but that (Xk , Yk ) is IID over time k.

8.2. A Random Coding Scheme: Binning
8.2
PSfrag
187
A Random Coding Scheme: Binning
We start with the derivation of an achievable rate distortion region. The

general idea is as follows. Recall that in Chapter 5 the encoder chooses a
codeword for the given source sequence and then transmits the index of this
codeword to the decoder. Since now the decoder has the additional information of the side-information, we will group the codewords into so called bins,
see Figure 8.2.
codeword 1
codeword enR
bin 1
bin 2
bin 3
bin (enR 1)
bin enR
Figure 8.2: The idea of binning: The codewords are grouped into bins. Instead
of transmitting the codeword index, the decoder is only informed
about the bin-number.
The encoder will not transmit the index of the codeword, but instead only
the bin-number. The decoder should then be able to figure out the correct
codeword within the bin with the help of the side-information. This idea is
called binning.
1: Setup: We need an auxiliary random variable U with some alphabet U.
This RV basically represents the codeword at the encoder. So, we choose
U and a PMF QU |X , and then compute QU as marginal distribution of
QX QU |X .
We further choose a function f : U Y X that will be used in the

decoder.
Then we fix some rates R and R0 , and some blocklength n.
0
2: Codebook Design: We create enR enR length-n codewords U(w, v),

0
0
w = 1, . . . , enR and v = 1, . . . , enR , by choosing all nen(R+R ) components
Uk (w, v) independently at random according to QU (). Here w describes
the bin and v describes the index of the codeword in this bin. Hence, we
0
have enR bins and enR codewords per bin.2
3: Encoder Design: For a given source sequence x, the encoder tries to
find a pair (w, v) such that

x, U(w, v) A(n) (QX,U ).
(8.4)
2
Note that in the literature one can also find a different setup where we first generate all
codewords and then randomly distribute them among the bins using a uniform distribution.
In this case the bins contain only approximately the same number of codewords.

188

If it finds several possible choices of (w, v), it picks one. If it finds none,
it chooses (w, v) = (1, 1).
The encoder then puts out the bin index w.
4: Decoder Design: For a given bin number w and a received sideinformation sequence y, the decoder tries to find an index v such that

y, U(w, v) A(n)
(QY,U ).
(8.5)

If there are several choices for v, the decoder simply picks one. If there
is no such v, it sets v , 1. Then the decoder puts out

= f n U(w, v), y
X
(8.6)
is
where we use the notation f n to denote that each component of X
created using the function f (, ), i.e.,

k = f Uk (w, v), yk , k = 1, . . . , n.
X
(8.7)
5: Performance Analysis: For the analysis we distinguish five different
cases that are not necessarily disjoint, but that together cover the entire
sample space:
1. The source and side-information sequences are not jointly typical:
(X, Y)
/ A(n) (QX,Y ).
(8.8)
2. The source and side-information sequences are jointly typical

(X, Y) A(n) (QX,Y ),
(8.9)
but the encoder cannot find a pair (w, v) such that (8.4) is satisfied.
3. The source and side-information sequences are jointly typical and
there exists a good choice (w, v) at the encoder,
(X, Y) A(n) (QX,Y ),

X, U(w, v) A(n) (QX,U ),
(8.10)
(8.11)
but there exists a v 6= v with

Y, U(w, v) A(n) (QY,U ).
(8.12)
4. The source and side-information sequences are jointly typical and

there exists a good choice (w, v) at the encoder,
(X, Y) A(n) (QX,Y ),

X, U(w, v) A(n) (QX,U ),
(8.13)

(QY,U ).
Y, U(w, v)
/ A(n)

(8.15)
(8.14)
but

189
5. Finally, the case where things work out:

X, Y, U(w, v) A(n) (QX,Y,U ).
(8.16)
To compute the expected distortion of our system, we apply the Union

Bound on total expectation (Theorem 1.14) using these five cases.
The details of this analysis are given in the following Sections 8.2.18.2.6.
8.2.1
Case 1
This case is standard: By TA-3b,

Pr(Case 1) = Pr (X, Y)
/ A(n)
(QX,Y ) t (n, , X Y).

8.2.2
(8.17)
Case 2
This is very similar to the Case 2 of the analysis of the rate distortion theorem,
see (5.108)(5.117). We have
Pr(Case 2)

(n)
= Pr (X, Y) A(n)
(Q
)
@
(w,
v)
:
X,
U(w,
v)
A
(Q
)
X,Y
X,U

(8.18)

= Pr (X, Y) A(n)
(QX,Y )

h
i

Pr @ (w, v) : X, U(w, v) A(n) (QX,U ) (X, Y) A(n)
(Q
)
X,Y

(8.19)

= Pr (X, Y) A(n)
(QX,Y )

|
{z
}
enR
0
enR
Y Y
w=1 v=1

(QX )
Pr X, U(w, v)
/ A(n)
(QX,U ) X A(n)

(8.20)
nR
enR eY
Y
w=1 v=1
nR
w=1 v=1
(8.21)

Pr U(w, v)
/ A(n) (QX,U |X) X A(n) (QX )
(8.22)
nR0
e
eY
Y
nR

Pr X, U(w, v)
/ A(n)
(QX,U ) X A(n)
(QX )

nR0
e
eY
Y
w=1 v=1
nR
<
(8.23)
nR0
e
eY
Y
w=1 v=1

1 Pr U(w, v) A(n)
(QX,U |X) X A(n) (QX )

1 en(I(X;U )+)
n(I(X;U )+)
= 1e
en(R+R0 )
(8.24)
(8.25)

190

0
exp en(R+R ) en(I(X;U )+)

0
= exp en(R+R I(X;U )) .
(8.26)
(8.27)
Here, in (8.22) we use the definition of conditionally typical sets (Definition 4.10); (8.24) follows from TC; and the inequality (8.26) is due to the
Exponentiated IT Inequality (Corollary 1.10).
So as long as
R + R0 > I(X; U ) + ,
(8.28)
the probability of Case 2 decays double-exponentially fast to zero.
8.2.3
Case 3
We have

Pr(Case 3) = Pr (X, Y) A(n)
(QX,Y ) X, U(w, v) A(n) (QX,U )

(8.29)
v 6= v : Y, U(w, v) A(n)
(Q
)
Y,U

[

Pr
Y, U(w, v) A(n)
(QY,U )
(8.30)

v,
v 6=v
v,
v 6=v

Pr Y, U(w, v) A(n)
(QY,U )

(8.31)
en(I(Y ;U ))
(8.32)
v,
v 6=v

0
= enR 1 en(I(Y ;U ))
0
en(R I(Y ;U )+) .
(8.33)
(8.34)
Here, in (8.30) we enlarge the set; in (8.31) we apply the Union Bound; and
(8.32) follows from TC.
So as long as
R0 < I(Y ; U )
(8.35)
the probability of Case 3 decays exponentially fast to zero.
8.2.4
Case 4
Note that by the definition of jointly typical sets, if (X, Y) is jointly typical
and (X, U) is jointly typical, but (Y, U) is not, then (X, Y, U) cannot be

191
jointly typical either.3 Hence,

Pr(Case 4)

= Pr (X, Y) A(n)
(QX,Y ) X, U(w, v) A(n) (QX,U )

Y, U(w, v)
/ A(n) (QY,U )

Pr (X, Y) A(n)
(QX,Y ) X, U(w, v) A(n) (QX,U )

Y, X, U(w, v)
/ A(n) (QY,X,U )

Pr
X, U(w, v) A(n) (QX,U )

Y, X, U(w, v)
/ A(n) (QY,X,U )

= Pr X, U(w, v) A(n)
(QX,U )

|
{z
}
(8.36)
(8.37)
(8.38)

Pr Y, X, U(w, v)
/ A(n) (QY,X,U ) X, U(w, v) A(n) (QX,U )
(8.39)

(n)
(n)
Pr Y, X, U(w, v)
/ A (QY,X,U ) X, U(w, v) A (QX,U ) (8.40)

= 1 Pr Y, X, U(w, v) A(n) (QY,X,U ) X, U(w, v) A(n) (QX,U )
X
=1
(x,u)
(QX,U )
(8.41)

(n)

Pr X = x, U(w, v) = u X, U(w, v) A (QX,U )
(n)
A
=1

Pr (Y, x, u) A(n) (QY,X,U ) X = x, U(w, v) = u
(8.42)
X

(n)
Pr X = x, U(w, v) = u X, U(w, v) A (QX,U )
(x,u)
(QX,U )
(n)
A

Pr Y A(n) (QY,X,U |x, u) X = x, U = u ,
(8.43)
where the first inequality (8.37) follows because we enlarge the event (the
(n)
(n)
event (X, Y, U)
/ A
follows from the event (Y, U)
/ A ); and where
in (8.38) we enlarge the event once more by dropping one intersecting event.
So we see that we need a lower bound on

Pr Y A(n) (QY,X,U |x, u) X = x, U = u

= Qn
A(n)
(QY,X,U |x, u) x
(8.44)

Y |X
(n)
where we know that (x, u) A

QnY |X,U
(QX,U ). It is interesting to note that

A(n)
(QY,X,U |x, u) x, u 1 t (n, , U X Y) (8.45)

by TB-3, but that we cannot apply this here because (U, X, Y) is not generated according to QU,X,Y , but U is independent of (X, Y). However, we do
3
However, note that the opposite direction of this argument does not necessarily hold!

192
not need to worry about how U was generated, because in (8.44) (x, u) are
already given as being jointly typical. Hence, we expect that the lower bound
in (8.45) in principle also holds for (8.44).
To prove this, we need to adapt the proof of TB-3. Similarly to (4.96) we
define
n
Fx,u , PY |X,U Pn (Y|X U) :
o
y
/ A(n)
(Q
|x,
u)
with
P
=
P
(8.46)
U,X,Y
y|x,u
Y |X,U

to be the set of all conditional types of all conditionally nontypical sequences
and argue identically to (4.96)(4.105) to show4 that for any PY |X,U Fx,u ,

DPx,u PY |X,U QY |X
X

,
Px,u (
a, u
) D PY |X,U (|
a, u
) QY |X (|
a)
(8.47)
(
a,
u)X U
s.t. Px,u (
a,
u)>0
2
2|U|2 |X |2 |Y|2
log e.
(8.48)
Using an adapted version of CTT2 where we make use of the fact that
QY |X (yk |xk ) = QY |X,U (yk |xk , uk )
(8.49)
because of the implicit Markov structure U (

X (
Y:
QnY |X (y|x) = en(HPx,u (Py|x,u )+DPx,u (Py|x,u k QY |X ))
(8.50)
we get the following version of CTT4:

1
en DPx,u (Py|x,u k QY |X )
(n + 1)|X ||Y||U |

Qn
T n (PY |X,U |x, u)x en DPx,u (Py|x,u k QY |X ) .
Y |X
We use this now to derive the following (similar to (4.116)(4.123)):

QnY |X A(n)
(QY,X,U |x, u) x

= 1 QnY |X T n (Fx,u |x, u) x
X

=1
QnY |X T n (PY |X,U |x, u) x
PY |X,U Fx,u
1
1
en DPx,u (PY |X,U k QY |X )
PY |X,U Fx,u
2
2|U |2 |X |2 |Y|2
log e
(8.51)
(8.52)
(8.53)
(8.54)
(8.55)
PY |X,U Fx,u
Basically we use the fact that PY |X,U deviates notably from QY |X because the sequence
is nontypical, and we then apply the Pinsker Inequality.

= 1 |Fx,u | e
2
2|U |2 |X |2 |Y|2
193
log e
2
n
2|U |2 |X |2 |Y|2
1 |Pn (Y|X U)| e
2
n
2|U |2 |X |2 |Y|2
1 (n + 1)|U ||X ||Y| e
(8.56)
log e
log e
= 1 t (n, , U X Y).
(8.57)
(8.58)
(8.59)
Plugged into (8.43), this then finally yields

Pr(Case 4) t (n, , U X Y).
(8.60)
Remark 8.4. Note that we have proven here the following statement: Consider a joint distribution QX,Y,Z forming a Markov chain X (
Y (
Z.
(n)
Let two sequences (x, y) A (QX,Y ) be jointly typical and assume that
the sequence Z is generated according to QnZ|Y (|y), ignoring QX or x. Then
(n)
with very high probability (x, y, Z) A (QX,Y,Z ) is jointly typical anyway

because of the Markov structure of QX,Y,Z . This statement is also known as
Markov Lemma.
8.2.5
Case 5
In this case we assume everything works out. In particular, we assume that

X, Y, U(w, v) A(n) (QX,Y,U ).
(8.61)
Then we achieve the following distortion:

= d X, f n U(w, v), Y
d(X, X)
n

1X
=
d Xk , f Uk (w, v), Yk
n
k=1
X

1
=
N a, b, c X, Y, U(w, v) d a, f (c, b)
n
(a,b,c)X YU

X

QX,Y,U (a, b, c) +
d a, f (c, b)
|X ||Y||U|
(a,b,c)X YU

EQX,Y,U d X, f (U, Y ) + dmax .
8.2.6
(8.62)
(8.63)
(8.64)
(8.65)
(8.66)
We are now ready to combine all these cases together. Using the fact that
all five cases combined cover the entire probability space, we use the Union
Bound on total expectation (Theorem 1.14) to get
5

X

Case i Pr(Case i)
E d(X, X)
E d(X, X)
(8.67)
i=1

194
4
X

Case i Pr(Case i)
E d(X, X)
|
{z
}
i=1
dmax

Case 5 Pr(Case 5)
+ E d(X, X)
(8.68)
| {z }
1

0
dmax t (n, , X Y) + dmax exp en(R+R I(X;U ))
+ dmax en(R I(Y ;U )+) + dmax t (n, , X Y U)

+ EQX,Y,U d X, f (U, Y ) + dmax .
Hence, as long as we choose QU |X and f (, ) such that

EQX,Y,U d X, f (U, Y ) D,
0
I(X; U ) < R + R ,
0
I(Y ; U ) > R ,
(8.69)
(8.70)
(8.71)
(8.72)
we are able to achieve the rate distortion pair (R, D). Note that we are not
interested in R0 , i.e., we can actually combine (8.71) and (8.72) to the condition
R > I(X; U ) I(Y ; U ).
(8.73)
Since we are trying to make this condition as loose as possible, we will then
decide to choose QU |X and f (, ) such that the RHS of (8.73) is minimized.
8.3
The WynerZiv Rate Distortion Function
Based on (8.70) and (8.73), we define the following rate distortion function.
Definition 8.5. The WynerZiv rate distortion function is defined as
RWZ (D) ,
min
QU |X ,f (,) : E[d(X,f (U,Y ))]D

I(X; U ) I(Y ; U ) .
(8.74)
From Section 8.2 we know that any rate distortion pair larger than the
WynerZiv rate distortion function is achievable, i.e., RWZ (D) is an upper
bound on the rate distortion function of the WynerZiv problem. In Section 8.5 we will prove that it actually also is a lower bound, i.e., that RWZ (D)
constitutes the rate distortion function of the rate distortion problem with
side-information at the decoder.
The form of RWZ (D) is interesting. First, there is the auxiliary random
variable U that is difficult to understand intuitively. Be aware that U does not
represent what is transmitted (we transmit the bin number W !). It is better
to think of U as the description of the codebook used at the encoder. In the
standard rate distortion problem we only use one codebook that is described
Here we use two: one (described by U) that does not consider the
by X.
that does take Y into account.
side-information Y and one (described by X)

8.3. The WynerZiv Rate Distortion Function
195
Depending on the correlation between X and Y, the encoder must use a very
detailed codebook with many bins and only very few codewords per bin, or it
can use a coarse codebook with only a few bins, but many codewords per bin.
i.e., each bin
In the extreme case of X
Y , the encoder must use U = X,
contains exactly one codeword.
Note that the function f (, ) at the decoder is actually a degenerate conditional probability distribution QX|U,Y
. We could replace the choice of f by
in the usual fashion at rana choice of Q

and then create a codebook X
X|U,Y
dom. However, once we minimize over QX|U,Y

, it turns out5 that the optimal
choice always only contains probability values of 1 and 0, i.e., given a value of
is deterministic.
U and Y , the value of X
Due to the Markov nature of QU,X,Y , i.e., U depends only on X, not on
(X, Y ):
QU,X,Y = QX QY |X QU |X ,
(8.75)
the expression in the minimization of (8.74) can also be expressed as follows:

I(X; U ) I(Y ; U ) = H(U ) H(U |X) H(U ) + H(U |Y )
(8.76)
= H(U |Y ) H(U |X)
(8.77)
= I(X; U |Y ).
(8.79)
= H(U |Y ) H(U |X, Y )
(8.78)
This form is very nice from a decoders point of view if we regard U as a

representation of the source: it gives the mutual information between source
X and its representation U when Y is known. However, from an encoders
point of view this does not make sense because the encoder does not know Y !
Note that
)
I(X; U |Y ) = I(X; U, X|Y
).
I(X; X|Y
= f (U, Y ))
(because X
(8.80)
(8.81)
The latter mutual information corresponds to the situation when both encoder
and decoder know Y , i.e., it describes the rate distortion region of the rate
distortion problem with global side-information:
RX|Y (D) =
min
QX|X,Y
: E[d(X,X)]D
).
I(X; X|Y
(8.82)
In general, for most sources and distortion measures, RX|Y (D) is strictly
smaller than RWZ (D), i.e., we usually have
R(D) > RWZ (D) > RX|Y (D).
(8.83)
Before we show two examples on how one can try to evaluate the Wyner
Ziv rate distortion function (8.74), we would like to point out that similar
such that for every
This is because I(X; U ) I(Y ; U ) does not directly depend on X
in an optimal fashion to minimize E[d(X, X)].
value of U , X, and Y , we can choose X

5

196
to our discussion at the end of Section 7.5.3 also here we have the problem
that in (8.74) we not only optimize over the best choice of QU |X , but that
we even have the freedom to select an optimal alphabet U for U . Luckily,
we reduce the dimensionality of the problem by proving that without loss of
generality we can restrict the size of the freely choosable alphabet U.
Lemma 8.6. Without loss of optimality we can restrict the size of U in the
definition of the WynerZiv rate distortion function in (8.74) to
|U| |X | + 2.
(8.84)
Proof: The proof is very similar to the proof of Lemma 7.7. Consider a
given choice of U, QU |X , and f (, ), and recall from (8.79) that
I(X; U ) I(Y ; U ) = I(X; U |Y )
(8.85)
= H(X|Y ) H(X|Y, U )
X

=
QU (u) H(X|Y ) H(X|Y, U = u) ,
(8.86)
(8.87)
uU
where QU is the marginal PMF coming from the chosen QU |X and from the
given QX . Furthermore, note that
X
XX
E[d(X, f (U, Y ))] =
QU (u)
QX,Y |U (x, y|u) d(x, f (u, y))
uU
xX yY
(8.88)
and that
QX (x) =
QU (u)QX|U (x|u),
uU
x X.
(8.89)

{1, 2, . . . , |X |}.

v , I(X; U ) I(Y ; U ), E[d(X, f (U, Y ))], QX (1), . . . , QX (|X | 1) ,
(8.90)

XX
vu , H(X|Y ) H(X|Y, U = u),
QX,Y |U (x, y|u) d(x, f (u, y)),
xX yY

QX|U (1|u), . . . , QX|U (|X | 1|u) ,
(8.91)
such that by (8.87)(8.89)

v=
QU (u)vu .
uU

(8.92)
197
We see that v is a convex combination of |U| vectors vu . From Caratheodorys Theorem (Theorem 1.20) it now follows that we can reduce the size
of U to at most |X | + 2 values (note that v contains |X | + 1 components!)
without changing v, i.e., without changing the values of I(X; U ) I(Y ; U ) and
of E[d(X, f (U, Y ))], and without changing the (given!) value of QX (x) for any
x {1, . . . , |X | 1} (and therefore also for x = |X |). This proves the claim.
Note that we need to incorporate QX into v because we do not choose QU
directly, but QU |X . Hence, when changing the alphabet U and QU , we have
to make sure that QX still remains as given by the source.
Example 8.7. Lets consider the example of a binary symmetric source (BSS)
with the Hamming distortion measure. Suppose that the side-information Yk
that is available at the decoder is the output of a BSC that has input Xk and
crossover probability p. See Figure 8.3. The task is to compute RWZ .
Dest.
1, . . . , X
n
X
Encoder
X 1 , . . . , Xn
BSS
Decoder
Y1 , . . . , Yn
1p
p
p
1p
Figure 8.3: Example of a WynerZiv problem with a binary symmetric source

and side-information at the decoder coming through a binary symmetric channel.
We start by dropping the minimum in the definition (8.74) and simply
pick some QU |X and f (, ). This then leads to an upper bound on RWZ .
We make two different choices. The first strategy is to choose U = 0
deterministically and to set f (u, y) = y for all u, y. This choice leads to
I(X; U ) I(Y ; U ) = 0
(8.93)
and

= E d X, f (U, Y ) = E[d(X, Y )] = Pr[X 6= Y ] = p. (8.94)
E d(X, X)
Hence, if D p, we know that RWZ (D) 0. However, since RWZ (D) 0 by
definition, we see that we have found
RWZ (D) = 0,
for D p.
(8.95)
The second strategy is to choose U to be the output of a BSC with input

X and crossover probability (note that in this case |U| = 2 |X | + 1 = 3)
and f (u, y) = u for all u, y. So, with respect to U and Y , we have two BSCs
in series, which yields again a BSC, but with crossover probability
p ? , p(1 ) + (1 p).
(8.96)

198
Hence,

I(X; U ) I(Y ; U ) = 1 Hb () 1 Hb (p ? )
(8.97)

= Pr[X 6= U ] = .
E d(X, X)
(8.99)
= Hb (p ? ) Hb ()
(8.98)
and
Now we use a time-sharing approach: During a fraction of the time we

use the second strategy, during the rest 1 we use the first. Then we get

RWZ (D)
min
Hb (p ? ) Hb () .
(8.100)
, : +(1)pD
One can show that the RHS of (8.100) actually achieves the minimum in
(8.74), i.e., we have found the exact value of the WynerZiv rate distortion
function.
Once more, we would like to remind the reader that U is not what is
transmitted over the channel! As an example take D = 0. In this case the
optimal choices for and in (8.100) are = 1 and = 0, i.e., we only use
the second strategy with U = X. To transmit X we needed a rate of 1 bit,
however, (8.100) shows that we can do with 1 Hb (p) Hb (0) = Hb (p) bits.
The rest is provided by Y !
Recall that the standard rate distortion function for the setting of this
example is
+
R(D) = 1 Hb (D) .
(8.101)
One can check that this is always strictly larger than (8.100) unless D 12 or
p = 12 , i.e., unless either R(D) = 0 or X
Y.
If the encoder also has access to Y we get
+
RX|Y (D) = Hb (p) Hb (D) .
(8.102)
This in turn is always strictly smaller than RWZ (D) unless D = 0, D p, or
p = 21 .
Example 8.8. As another example, lets apply Definition 8.5 to a Gaussian

source and the squared error distortion.6 Suppose that (X, Y ) are jointly
Gaussian with zero-mean and covariance matrix
!
2 2
K=
(8.103)
2 2
for some value of 2 > 0 and correlation factor [1, 1].
6
Strictly speaking we cannot do so, as we have only proven (8.74) for the situation of
finite alphabets. But it can be generalized to the Gaussian case, too.

199
We again start by dropping the minimization in (8.74) and pick some PDF
U |X and some function f (, ), which leads to an upper bound to RWZ (D).
If the decoder only uses Y to make an estimate of X (i.e., U is a constant
and ignored), then the best estimator is the linear estimator E[X |Y ], i.e., we
choose
= f (U, Y ) = E[X |Y ]
X
which yields the following average distortion
h

i
= E (X X)
2 = E X E[X |Y ] 2 = 2 ,
E d(X, X)
pred
(8.104)
(8.105)
where 2pred denotes the variance of X when knowing Y , i.e., the prediction
error of X when observing Y :

2pred = E (X E[X | Y ])2 = 2 (1 2 ).
(8.106)
(For details on how to compute the prediction error, go back to conditional
Gaussian random variables [Mos14, Appendices A & B].) Hence, if D 2 (1
2 ), we have RWZ (D) 0 and, since the rate distortion function cannot be
negative,
RWZ (D) = 0,
for D 2 (1 2 ).
(8.107)

If D < 2 (12 ), we choose U = X+Z where Z is chosen as Z N 0, Z2 ,
Z
X, and the decoder makes the best estimate of X given both Y and U :
= f (U, Y ) = E[X | Y, U ].
X
(8.108)
This then yields:

I(X; U ) I(Y ; U )
= I(X; U |Y )
(by (8.81))
= h(X|Y ) h(X|Y, U )
U)
= h(X|Y ) h(X X|Y,
(8.109)
(8.110)
= f (U, Y ))
(because X
= h(X|Y ) h(X X)
(8.111)
(because error is orthogonal

to observation)
1

1
= log 2e2pred log 2e2MMSE
2
! 2
2pred
1
= log 2
,
2
MMSE
(8.112)
(8.113)
(8.114)
and where 2pred is again the prediction error of X when observing Y , and
where 2MMSE is the variance of the error when optimally estimating X given
Y and U , i.e., the minimum mean squared error (MMSE). Since our distortion
measure is exactly this MMSE and we therefore require that 2MMSE D, we
will choose Z2 such that we have
2MMSE = D.
(8.115)

200
(Note that this is possible because D < 2 (1 2 ).) Hence, we have found
the following bound:

2
1
(1 2 )
RWZ (D) log
, for D < 2 (1 2 ).
(8.116)
2
D
Combined with (8.107) this now gives
+

2
1
(1 2 )
RWZ (D)
log
.
2
D
(8.117)
Now recall that the standard rate distortion function is given by

2 +
1
R(D) =
log
.
(8.118)
2
D
From this then follows that if Y is known both at encoder and decoder, we
have

2
+
1
(1 2 )
RX|Y (D) =
log
,
(8.119)
2
D
because given Y , the variance of X is 2 (1 2 ). Comparing (8.119) with
(8.117) and remembering that RX|Y (D) RWZ (D) by definition, we then see
that

2
+
1
(1 2 )
log
= RX|Y (D)
(8.120)
2
D
RWZ (D)
(8.121)

2

+
1
(1 2 )
log
,
(8.122)
2
D
i.e.,

RWZ (D) =
2
+
1
(1 2 )
log
.
2
D
(8.123)
This shows that the Gaussian source is special: The side-information is not
required at the encoder, but it is sufficient to have it available at the decoder
only.
8.4
Properties of RWZ ()
Any rate distortion function R(D) must have the properties that it is nonincreasing and convex in D. The former follows from the fact that if we allow
a higher distortion, we definitely do not need to increase our rate. The latter
can be shown by a time-sharing argument. Assume two rate distortion pairs
that are achievable using some given two schemes. If we now use a fraction
of the time the first scheme and the remaining fraction (1 ) the second

8.4. Properties of RWZ ()
201
scheme, we will achieve a rate distortion pair that lies on the straight line
between the two rate distortion pairs. Hence, the rate distortion function can
only lie on this line or below, i.e., it must be convex.
Unfortunately, however, we cannot apply this insight to RWZ (D) because
we have not yet proven that it really is the correct rate-distortion function.
So we will prove directly that RWZ (D) is nonincreasing and convex in D.
Lemma 8.9. The WynerZiv rate distortion function RWZ () as specified in
Definition 8.5 is nonincreasing and convex in D.
Proof: The former is quite obvious from Definition 8.5: If we increase
the value of D, we relax the constraint in the minimization, which can only
decrease the value achieved in the minimization.
We will next prove that RWZ (D) is convex in D. Consider two points
(R0 , D0 ) and (R1 , D1 ) on the (RWZ (D), D)-curve and suppose that
f0 : U0 Y X ,
f1 : U1 Y X
QU0 |X ,
QU1 |X ,
(8.124)
(8.125)
achieve these points, respectively. Now let Z be a binary random variable,

independent of (X, Y ) with
QZ (0) = 1 QZ (1) =
(8.126)
and define the auxiliary random variable (or, rather, random vector, but for a
finite alphabet there is no mathematical difference between a random variable
and a random vector as both take on a finite number of possible values):
U , [Z, UZ ].
Note that for all (z, uz , x) {0, 1} (U0 U1 ) X ,

QU|X (u|x) = Q[Z,UZ ]|X [z, uz ] x = QZ (z) QUz |X (uz |x).
(8.127)
(8.128)
Moreover, we choose f (u, y) as

(
f0 (u0 , y) if z = 0,
f (u, y) = f ([z, uz ], y) ,
f1 (u1 , y) if z = 1.
(8.129)
Note that in this setup we basically have Z choosing between U0 , f0 and U1 ,

f1 .
With this choice of auxiliary random variable and function f , we achieve
the following average distortion:

D , E d X, f (U, Y )
(8.130)

(8.131)
= E E d X, f ([Z, UZ ], Y ) Z

= QZ (0) E d X, f ([0, U0 ], Y ) Z = 0

+ QZ (1) E d X, f ([1, U1 ], Y ) Z = 1
(8.132)

= E d X, f0 (U0 , Y ) + (1 ) E d X, f1 (U1 , Y )
(8.133)
= D0 + (1 )D1
(8.134)

202
and the following rate:

I(X; U) I(Y ; U)
= H(X) H(X|U) H(Y ) + H(Y |U)
(8.135)
= H(X) H(X|Z, UZ ) H(Y ) + H(Y |Z, UZ )
= H(X) QZ (0) H(X|U0 , Z = 0) QZ (1) H(X|U1 , Z = 1)
H(Y ) + QZ (0) H(Y |U0 , Z = 0) + QZ (1) H(Y |U1 , Z = 1)
= ( + 1 ) H(X) H(X|U0 ) (1 ) H(X|U1 )
(8.136)
(8.137)
( + 1 ) H(Y ) + H(Y |U0 ) + (1 ) H(Y |U1 )

= H(X) H(X|U0 ) H(Y ) + H(Y |U0 )

+ (1 ) H(X) H(X|U1 ) H(Y ) + H(Y |U1 )

= I(X; U0 ) I(Y ; U0 ) + (1 ) I(X; U1 ) I(Y ; U1 )
(8.139)
= RWZ (D0 ) + (1 )RWZ (D1 ).
(8.142)
= R0 + (1 )R1
(8.138)
(8.140)
(8.141)
Hence, combining this all together,

RWZ D0 + (1 )D1
= RWZ (D)
=
(8.143)

min
QU|X ,f (,) : E[d(X,f (U,Y ))]D

I(X; U) I(Y ; U)
(8.144)

I(X; U) I(Y ; U)
(8.145)
= RWZ (D0 ) + (1 )RWZ (D1 ).
(8.146)
choose U and f as above
Here, in (8.143) we use (8.134); (8.144) follows from Definition 8.5; in (8.145)
we drop the minimization and choose U and f as given in (8.127) and (8.129);
and the final equality (8.146) follows from (8.142).
This concludes the proof.
8.5
Converse
We are now ready to show that there exists no rate distortion system with
side-information at the decoder side that has a rate distortion pair below the
WynerZiv rate distortion function.
Consider an arbitrary coding scheme where
the encoder
n maps a length

n source sequence x to an index w 1, . . . , enR for some given R, and
where the decoder n maps a received index w and a length-n side-information
. Assume that the scheme
sequence y into a source representation sequence x
works, i.e.,

D
E d(X, X)
(8.147)
for some given D.

8.5. Converse
203
Using that the entropy is upper-bounded by the logarithm of the alphabet

size, we have
R=
=
=
=
=
=
=
=
1
log enR
n
1
H(W )
n
1
H(W |Y)
n
1
1
H(W |Y) H(W |X, Y)
n
n
1
I(W ; X|Y)
n
n

1X
I W ; Xk X1k1 , Y1n
n
k=1
n

1 X
H Xk X1k1 , Y1n H Xk W, X1k1 , Y1n
n
k=1
n

1 X
H(Xk |Yk ) H Xk W, X1k1 , Y1n
n
k=1
n
X

1
n
H(Xk |Yk ) H Xk W, Y1k1 , Yk , Yk+1
n
1
n
1
n
1
n
1
n
1
n
k=1
n
X
k=1
n
X
k=1
n
X
k=1
n
X
k=1
n
X
k=1
(8.148)
(8.149)
(8.150)
(8.151)
(8.152)
(8.153)
(8.154)
(8.155)
(8.156)

H(Xk |Yk ) H(Xk |Uk , Yk )
(8.157)
I(Xk ; Uk |Yk )
(8.158)
H(Uk |Yk ) H(Uk |Xk , Yk )
(8.159)

H(Uk |Yk ) H(Uk |Xk )
(8.160)

I(Xk ; Uk ) I(Yk ; Uk ) .
(8.161)
Here, (8.150) and (8.151) follow because conditioning reduces entropy and because entropy is nonnegative; in (8.155) we use that the source is IID over
time; in the subsequent inequality (8.156) we again rely on conditioning reducing entropy (note that we cannot apply this step with equality as in (8.155)
because W depends on the past!); in (8.157) we define a new random variable
(or random vector)
n
Uk , W, Y1k1 , Yk+1
(8.162)
and the final step (8.161) follows by adding and subtracting H(Uk ).

204
Note that (8.160) actually holds with equality: A careful investigation

shows that we have here a Markov structure: Uk (
Xk (
Yk because the
source is IID and because W depends on Yk only via Xk . We do not need
this here, though, as we can simply again rely on conditioning that reduces
entropy.
The expression inside the sum in (8.161) can now be lower-bounded using
the definition of the WynerZiv
rate distortion function (8.74), once we figure

k ) for our coding scheme. Note that since
out the exact value of E d(Xk , X
X = n (W, Y), Xk simply is the kth component of this vector-valued function:

k = n,k (W, Y) = n,k W, Y k1 , Yk , Y n
X
k+1 = n,k (Uk , Yk ). (8.163)
1

So, for the value E d Xk , n,k (Uk , Yk ) , the WynerZiv rate distortion function is definitely smaller than I(Xk ; Uk ) I(Yk ; Uk ) because it minimizes it
over the choice of QUk |Xk and n,k . Hence,
n

1X
I(Xk ; Uk ) I(Yk ; Uk )
n
k=1
n

X

1
RWZ E d Xk , n,k (Uk , Yk )

n
k=1
!
n

1X
RWZ
E d Xk , n,k (Uk , Yk )
n
k=1
" n
#!
1X
k )
= RWZ E
d(Xk , X
n
k=1

= RWZ E d(X, X)
RWZ (D).
(8.164)
(8.165)
(8.166)
(8.167)
(8.168)
(8.169)
Here, in (8.166) we use the convexity of RWZ () as shown in Section 8.4; in

the subsequent equality (8.167) we move the expectation outside the sum
k according to (8.163); in (8.168) we use
(linearity) and swap back again to X
the fact that our distortion measure is assumed to be an average per-letter
distortion function; and the final inequality (8.169) follows from the fact that
RWZ () is nonincreasing (see again Section 8.4) and from (8.147) that is based
on the assumption that our coding scheme actually works.
This shows that any working coding scheme indeed cannot beat RWZ ().
8.6
Summary
In this chapter we have proven the following result.

Theorem 8.10 (WynerZiv Coding Theorem [WZ76]).
Consider the rate distortion problem given in Figure 8.1. An encoder is
supposed to describe a source sequence X using a code of rate R. The

8.6. Summary
205
decoder must be able to recover the original sequence up to a maximum

allowed average distortion D, using the description of the encoder and
also some side-information Y about the source sequence X. Then for a
given average distortion level D, the minimum achievable rate is given by
the WynerZiv rate distortion function

RWZ (D) ,
min
I(X; U ) I(Y ; U ) ,
(8.170)
QU |X ,f (,) : E[d(X,f (U,Y ))]D
where without loss of optimality we can restrict the size of U to

|U| |X | + 2.
(8.171)

Chapter 9
Distributed Lossless
Data-Compression:
SlepianWolf Problem
The rate distortion problem and the WynerZiv problem can both be considered to be a special case of the general setup shown in Figure 9.1.
1, . . . , X
n
X
Dest.
Y1 , . . . , Yn
W (1)
Encoder (1)
X 1 , . . . , Xn
QX,Y
Decoder
W (2)
Encoder (2)
Y1 , . . . , Yn
Figure 9.1: A general source compression problem with two joint sources, two
distributed encoders and one joint decoder.
Here, a source jointly generates two IID sequences X and Y that are
then encoded in a distributed fashion from two encoders that cannot directly
cooperate. The decoder then receives both indices W (1) and W (2) (with corresponding rates R(1) and R(2) , respectively) and needs to reconstruct both
source sequences up to some given distortions D(1) and D(2) .
This general problem is not solved, i.e., the optimal four-dimensional rate
distortion region is not known. However, some special cases are known:
If Yk = constant, we are in the case of a standard rate distortion problem
as discussed in Chapter 5.
If R(2) H(Y ), the decoder can recover Y perfectly first and then use
this as side-information to gain X back within the required distortion.
This is the situation of WynerZiv as discussed in Chapter 8.
If R(2) H(Y ), if we have D(1) = 0, and if we are only interested in X
at the destination, then we are in the case of lossless source coding with
side-information.
207

208
Distributed Lossless Data-Compression (SlepianWolf)

If D(1) = D(2) = 0, we have the problem of lossless distributed data
compression, which is usually known as SlepianWolf problem. This case
is the topic of this chapter. Obviously, if R(1) > H(X) and R(2) > H(Y ),
then such distributed lossless source compression will work: We simply
compress and decompress X and Y separately. On the other hand,
if the encoders could cooperate, then a rate of Rtot > H(X, Y ) would
be sufficient. Surprisingly, as we will show, this is also possible for
distributed encoding!
9.1
Problem Statement and Main Result
It is important to notice that we do not consider zero-error coding like, e.g.,

Huffman coding, where we never make any error. The problem statement is
such that we only require that the error probability tends to zero as n gets
large.

(1)
(2)
Definition 9.1. An enR , enR , n lossless distributed coding scheme for a
joint DMS QX,Y consists of

(1)
a first encoding function (1) : X n 1, . . . , enR
,

(2)
a second encoding function (2) : Y n 1, . . . , enR
, and

(1)
(2)
a decoding function : 1, . . . , enR
1, . . . , enR
X n Y n.
The error probability is defined as

(2)
Pe(n) , Pr (1)
n (X), n (Y) 6= (X, Y) .
(9.1)
Definition 9.2. A rate pair (R(1) , R(2) ) is said to be achievable for a dis
(2)
(1)
tributed source if there exists a sequence of enR , enR , n distributed cod(n)
ing schemes with Pe 0 as n . The SlepianWolf rate region is the

closure of the set of all achievable rate pairs.
Remark 9.3. Even though the two encoders cannot cooperate, we do assume
synchronization! Hence, it is possible to use time-sharing by simply predefining some time-slots where both encoders use a coding scheme 1 and some
time-slots where both encoders use a coding scheme 2. However, we will not
make use of this because it will turn out that the rate region already is convex.

9.1. Problem Statement and Main Result
209
R(2)
separate compression
H(X, Y )
and decompression
H(Y )
H(Y |X)
joint encoding
H(X|Y )
H(X)
H(X, Y )
R(1)
Figure 9.2: SlepianWolf rate region for distributed source coding. Note that
we do lose some rate pairs in comparison to the situation of joint
source coding: The corners with R(1) < H(X|Y ) or R(2) < H(Y |X)
are not achievable in distributed fashion, even though they are
achievable for joint source compression.
Theorem 9.4 (SlepianWolf Distributed Source Compression

Coding Theorem [SW73b]).
For the distributed source coding problem for (X, Y) IID QX,Y , the
SlepianWolf rate region is given by
R(1) H(X|Y ),
(9.2)
(2)
R H(Y |X),
(9.3)
(1)
(2)
R + R H(X, Y ).
(9.4)
This rate region is depicted in Figure 9.2.
This result and [SW73b] was very important not only because it proved
the surprising fact that the joint entropy H(X, Y ) can be achieved, but also
because the technique of binning was introduced, which subsequently was
successfully applied to many other problems.
Example 9.5. Consider the weather in Hsinchu and in Taichung. Obviously,
it is correlated, i.e., if it is rainy in Hsinchu, it probably also rains in Taichung,

210
and vice versa. Lets assume that the weather of every day is independent and
identically distributed following the joint distribution given in Table 9.3.
Table 9.3: A joint weather PMF of Hsinchu and Taichung.
Taichung Y
QX,Y (, )
rain
sun
Hsinchu total
rain
0.445
0.055
0.5
sun
0.055
0.445
0.5
Taichung total
0.5
0.5
Hsinchu X
Suppose two weather stations in Hsinchu and Taichung need to send the
local weather data of 100 days to the Taipei National Weather Service headquarters. They could send all 100 bits data from both places, which would
mean that in total 200 bits of data are transmitted.
If we try to compress the data to reduce the necessary amount of bits,
then an individual data compression both at Hsinchu and Taichung will not
help at all because both X and Y are uniformly binary distributed and can
therefore not be compressed.
However, if we apply a SlepianWolf scheme, then we get
R(1) H(X|Y ) = Hb (0.89) = 0.5 bits,
(1)
(2)
+R
(2)
(9.5)
H(Y |X) = Hb (0.89) = 0.5 bits,
(9.6)
= Hb (0.5) + Hb (0.89) = 1.5 bits,
(9.7)
H(X, Y ) = H(X) + H(Y |X)
i.e., in an optimal scheme we only need to transmit 150 bits!

Obviously, it is not possible that one encoder transmits more than 100 bits
(and the other less than 50 bits), because each encoder only has 100 bits of
data available. So, the situation of a rate split with, e.g., one encoder sending
120 bits and the other 30 bits (the sum is still 150 bits!) can only be achieved
if the encoders can cooperate.
9.2
New Lossless Data Compression Scheme based

on Bins
Before we prove Theorem 9.4, we would like to introduce a new lossless data
compression scheme for some DMS QX (). The insights of this new scheme
and its analysis can then directly be applied to the distributed coding problem,
too.
Our coding scheme is based on bins.

9.2. New Lossless Data Compression Scheme based on Bins
211
1: Setup: Fix a rate R and some blocklength n.

2: Codebook Design: For each possible source sequence we draw at random (using a uniform distribution) an index from {1, . . . , enR }. The set
of sequences that have the same index are said to form a bin.1
3: Encoder Design: For a given source sequence x, the encoder puts
out the bin number.
4: Decoder Design: For a given index w, the decoder checks the bin w:
If there is exactly one typical sequence in there, it declares it to be the
, otherwise it declares an error.
decision x
5: Performance Analysis: We distinguish typical and nontypical source
sequences. If the source sequence is nontypical, then the decoder will
always make an error (since whatever happens, the decoder will never
decide for a nontypical sequence!).
If the source sequence is typical, then at least one typical sequence is
present in its corresponding bin (this typical sequence itself!). Hence,
there will be an error if, and only if, there are more than one typical
sequence in this bin.
Hence,
Pr(error)

= Pr X
/ A(n)
(QX ) X
/ A(n)
(QX ), more than one

typical sequence in bin
(9.8)

(n)
Pr X
/ A (QX )

+ Pr X A(n)
(QX ), more than one typical seq. in bin
(9.9)

X
0
n
0
(n)
t (n, , X ) +
QX (x) Pr x 6= x : x A (QX )
(n)
xA
t (n, , X ) +
X
(n)
xA
= t (n, , X ) +
t (n, , X ) +
(QX )
and (x0 ) = (x)

QnX (x)
(n)
(QX )
QnX (x)
(n)
xA (QX )
X
(n)
xA (QX )
QnX (x)
x0 A (QX )
x0 6=x
(n)
x0 A (QX )
x0 6=x
(9.10)

Pr (x0 ) = (x) (9.11)
|
{z
}
= enR
enR
(9.12)
enR
(9.13)
(n)
x0 A (QX )
This is equivalent to having enR bins and randomly throwing all possible sequences into
one of them.

212

= t (n, , X ) +
X
(n)
xA (QX )

QnX (x) A(n) (QX ) enR
{z
(9.14)

(QX ) enR
t (n, , X ) + A(n)

(9.15)
(9.17)
t (n, , X ) + en(H(QX )+m ) enR
(9.16)
if
R > H(QX ) + m
(9.18)
and n is sufficiently large. Here, (9.11) follows from the Union Bound:
We upper-bound the event that at least some x0 . . . by a sum over all
x0 . Note that the encoder is random here: Each source sequence is
assigned a random index, i.e., the probability that two get the same
index is # of1bins = 1/ enR . And (9.16) follows from TA-2.
Remark 9.6. This shows that there are many ways of constructing coding
schemes with low error probability as long as R > H(X). The advantage of
this scheme is that we do not need the typical set at the encoder, but only at
the decoder, i.e., it will also work for a distributed source!
9.3
Achievability
We are now ready for an achievability proof of Theorem 9.4.

1: Setup: Fix rates R(1) and R(2) and some blocklength n.
(1)
2: Codebook Design: Independently assign every x X n to one of enR

bins according to a uniform distribution. Analogously, assign every y
(2)
Y n to one of enR bins according to a uniform distribution. Reveal the
first assignment to encoder 1, the second assignment to encoder 2, and
both assignments to the decoder.
3: Encoder Design: For a given source sequence x, the encoder (1) puts
out the index of the bin to which x belongs. Analogously, for a given
source sequence y, the encoder (2) puts out the index of the bin to which
y belongs.

4: Decoder Design: Given an index pair w(1) , w(2) , the decoder tries to
) such that
find a pair (
x, y
(1) (
x) = w(1) ,
(2)
(
y) = w
)
(
x, y
(2)
A(n) (QX,Y ).
(9.19)
(9.20)
(9.21)
), then the decoder puts out this

If it finds exactly one such pair (
x, y
pair. Otherwise, it declares an error.

9.3. Achievability
213
5: Performance Analysis: Recall that {(Xk , Yk )} are IID QX,Y and

that the assignments 1 and 2 are independent and uniform. We define
the following events:

F0 , (x, y) : (x, y)
/ A(n)
(QX,Y ) ,
(9.22)

(n)
0
(1) 0
(1)
F1 , (x, y) : (x, y) A (QX,Y ); x 6= x with (x ) = (x)

and (x0 , y) A(n) (QX,Y ) ,
(9.23)

F2 , (x, y) : (x, y) A(n) (QX,Y ); y0 6= y with (2) (y0 ) = (2) (y)

and (x, y0 ) A(n) (QX,Y ) ,
(9.24)

(n)
0
0
0
0
F12 , (x, y) : (x, y) A (QX,Y ); (x , y ) with x 6= x, y 6= y,
(1) (x0 ) = (1) (x), (2) (y0 ) = (2) (y)

and (x0 , y0 ) A(n) (QX,Y ) .
(9.25)
Then,
Pr(error) = Pr(F0 F1 F2 F12 )
(9.26)
Pr(F0 ) + Pr(F1 ) + Pr(F2 ) + Pr(F12 )
(9.27)
by the Union Bound. By TA-3b we have

Pr(F0 ) < t (n, , X Y).
(9.28)
For the second term, we bound as follows:

Pr(F1 )
X

QnX,Y (x, y) Pr x0 6= x : (1) (x0 ) = (1) (x),
(n)

(x,y)A (QX,Y )
x0 A(n) (QX,Y |y)
(9.29)
X
X
(1) 0

n
(1)
QX,Y (x, y)
Pr (x ) = (x)
{z
}
|
(n)
(n)
(x,y)A
QnX,Y (x, y)
(n)
(x,y)A (QX,Y
(n)
(x,y)A (QX,Y
en(R
(1)
enR
(1)
(1)
(9.30)
(9.31)
|y)
(9.32)
QnX,Y (x, y) en(H(X|Y )+m ) enR
(1)
(9.33)
{z
= enR

(1)
QnX,Y (x, y) A(n)
(QX,Y |y) enR

X
(n)
(x,y)A (QX,Y
(QX,Y |y)
x0 6=x
(n)
x0 A (QX,Y
x0 6=x
x0 A
(QX,Y )
H(X|Y )m )
(9.34)
(9.35)
if
R(1) > H(X|Y ) + m
(9.36)

214

and n is sufficiently large. Here, (9.30) follows from the Union Bound; in
(9.31) we recall that the assignments to bins are uniform; in (9.32) and
we enlarge the sum by adding one term; and (9.33) follows from TB-2.
Analogously, we can show that
Pr(F2 )
(9.37)
R(2) > H(Y |X) + m
(9.38)
if
and n is sufficiently large.

It remains to investigate F12 :
Pr(F12 )
X
(n)
(x,y)A

QnX,Y (x, y) Pr (x0 , y0 ) : x0 6= x, y0 6= y,
(QX,Y )
(n)
(x,y)A
QnX,Y (x, y)
(QX,Y )
X
(n)
(n)
(x,y)A
QnX,Y (x, y)
(QX,Y )
X
(n)
(n)
(x,y)A (QX,Y
en(R
enR
(1)
enR
(9.41)
(2)
(9.42)

(1)
(2)
(QX,Y ) en(R +R )
QnX,Y (x, y) A(n)

(9.43)
QnX,Y (x, y) en(H(X,Y )+m ) en(R
(9.44)
(1)
+R(2) )
{z
(1)
(2)
= enR
X
(n)
(x,y)A (QX,Y
(1)
(n)
(x0 ,y0 )A (QX,Y
x0 6=x, y0 6=y
= enR
QnX,Y (x, y)
(n)
(x,y)A (QX,Y

Pr (1) (x0 ) = (1) (x) Pr (2) (y0 ) = (2) (y)
|
{z
} |
{z
}
(x0 ,y0 )A (QX,Y )

x0 6=x, y0 6=y

Pr (1) (x0 ) = (1) (x), (2) (y0 ) = (2) (y) (9.40)
(x0 ,y0 )A (QX,Y )

x0 6=x, y0 6=y
(1) (x0 ) = (1) (x), (2) (y0 ) = (2) (y)

and (x0 , y0 ) A(n) (QX,Y )
(9.39)
+R(2) H(X,Y )m )
(9.45)
(9.46)
if
R(1) + R(2) > H(X, Y ) + m

(9.47)
9.4. Converse
215
and n is sufficiently large. Here, (9.41) follows from our assumptions that
the assignments are independent; and (9.44) follows from TA-2.
This proves the achievability of the region given in Theorem 9.4.
9.4
Converse
(n)
Consider a given (sequence of) working coding scheme with Pe 0 as

n . Recall
the Fano Inequality (Proposition 1.13) with an observation

Y):
W (1) , W (2) about (X, Y) that is used to make a guess (X,

Y)
6= (X, Y) n log(|X | |Y|) (9.48)
H X, YW (1) , W (2) log 2 + Pr (X,

log 2
(n)
=n
(9.49)
+ Pe log |X | + log |Y|
n
, nn ,
(9.50)
where the last line has to be understood as definition of n . Note that since
(n)
we have assumed that Pe 0 as n , we have n 0 as n .
Moreover, from (9.50) it follows that

H XY, W (1) , W (2) = H X, YW (1) , W (2) H YW (1) , W (2)
(9.51)
|
{z
}

H X, YW (1) , W (2)
(9.52)
nn
(9.53)

H YX, W (1) , W (2) nn .
(9.54)
and, analogously,
(1)
Since W (1) takes on enR different values and W (2) takes on enR
values, we now have

(1)
(2)
n R(1) + R(2) = log enR enR

H W (1) , W (2)

= I X, Y; W (1) , W (2) + H W (1) , W (2) X, Y

= I X, Y; W (1) , W (2)

= H(X, Y) H X, YW (1) , W (2)
H(X, Y) nn
= n H(X, Y ) nn ,
(2)
different
(9.55)
(9.56)
(9.57)
(9.58)
(9.59)
(9.60)
(9.61)
i.e.,
R(1) + R(2) H(X, Y ) n .
(9.62)
Here, (9.58) follows because W (1) = (1) (X) and W (2) = (2) (Y); in (9.60) we
have used (9.50); and the last equality (9.61) is because {(Xk , Yk )} are IID
QX,Y .

216

Similarly, we derive:
nR(1) = log enR
H W
(1)
(1)
(9.63)
(9.64)
(1)
(9.65)

H W
Y

(1)
= I X; W
Y + H W (1) X, Y

= I X; W (1) Y

= I X; W (1) , W (2) Y

= H(X|Y) H XY, W (1) , W (2)
H(X|Y) nn
= n H(X|Y ) nn .
(9.66)
(9.67)
(9.68)
(9.69)
(9.70)
(9.71)
Here, (9.68) follows because W (2) is a function of Y. Hence,

R(1) H(X|Y ) n ,
(9.72)
R(2) H(Y |X) n .
(9.73)
and, analogously,
This proves that no working coding scheme can be outside the region defined
in Theorem 9.4.
9.5
Discussion: Colors instead of Bins
To understand the idea of the coding scheme used in the achievability proof,
consider the corner point
R(1) = H(X),
R
(2)
= H(Y |X).
(9.74)
(9.75)
We know that using n H(X) bits we can effectively encode X in a way that
makes sure that the decoder can reconstruct it with arbitrarily small error.
But how do we encode Y using only n H(Y |X) bits?
Recall that every sequence X has a small set of Y that is jointly typical
with X. So, if the encoder knows X, it can easily send the index of Y within
this small set. But our encoder does not know X! Hence, instead of finding
(2)
this small typical set, it colors all Y sequences with enR different colors.
If the number of colors is large enough, then the Y sequences in the small
set that is jointly typical with X will all have a different color, i.e., the color
uniquely2 defines the correct Y.
2
We put uniquely in quotation marks here because strictly speaking it is not unique: We
will always have a nonzero error probability that only vanishes once n tends to infinity.

9.6. Generalizations
9.6
217
Generalizations
The setup of Theorem 9.4 can be generalized to L encoders. Let

(1)
(L)
Xk , . . . , Xk
be IID QX (1) ,...,X (L) .
(9.76)

Then all rate L-tuples R(1) , . . . , R(L) are achievable if, and only if,

c
R[L] H X [L] X [L ] ,
L {1, . . . , L}
(9.77)
R(i)
(9.78)
where
R[L] ,
X
iL
and

X [L] , X (i) : i L .
(9.79)
The theorem has also been extended to stationary and ergodic sources
[Cov75]. In that case the entropies have to be replaced by entropy rates.
9.7
Zero-Error Compression
It is important to realize that the SlepianWolf theorem (Theorem 9.4) does

not hold for zero-error coding!
Consider a joint DMS QX,Y such that QX,Y (x, y) > 0 for all (x, y). For
symmetry reasons, we can restrict ourselves to the situation where the decoder
has already managed to recover Y perfectly and now tries to recover X from
W (1) and Y.
Note that (because QX,Y (x, y) > 0) from any known source sequence y it
is impossible to predict the corresponding sequence X with zero probability
of error. So, since W (1) has been generated completely independently of Y,
we can only have zero probability of error if W (1) allows a error-free recovery
of X by itself.
This means that if we require
Pe(n) = 0,
(n)
instead of Pe
(9.80)
0 as n , then the best achievable region is

R(1) H(X),
R
(2)
H(Y )
(9.81)
(9.82)
(where we have assumed that QX,Y (x, y) > 0 for all (x, y)).

Chapter 10
The Multiple-Access Channel

After so many chapters on source coding, we now switch to channel coding.
10.1
Problem Setup
We consider a channel coding problem as shown in Figure 10.1.

X(1)
Dest.
(1) , M
(2)
M
Dec.
Channel
QnY |X (1) ,X (2) X(2)
Enc. (1)
Enc. (2)
M (1)
Uniform
Source 1
M (2)
Uniform
Source 2
Figure 10.1: A channel coding problem with two sources that independently
try to transmit a message M (i) , i = 1, 2, to the same destination.
Such a channel model is called multiple-access channel (MAC).
Here we have a discrete memoryless channel that simultaneously accepts
two inputs from two independent transmitters and that generates an output Y
that is random with a distribution conditional on both inputs. The decoders
task is to simultaneously recover both messages based on the received channel
output sequence Y.
More formally, we have the following definitions.
Definition 10.1. A discrete memoryless multiple-access channel (DM-MAC)
consists of three alphabets X (1) , X (2) , Y and a conditional probability distribution QY |X (1) ,X (2) such that

(1) k (2) k

QY |Y k1 ,{X (1) }k ,{X (2) }k yk y1k1 , x` `=1 , x` `=1
k 1
`
`=1
`
`=1
(1)
(2)

(10.1)
= QY |X (1) ,X (2) yk xk , xk .
If a DM-MAC is used without feedback, we have
n
(1) (2) Y
(1) (2)

QY|X(1) ,X(2) y x , x
=
QY |X (1) ,X (2) yk xk , xk .
(10.2)
k=1
219

220

(1)
(2)
Definition 10.2. An enR , enR , n coding scheme for a DM-MAC consists
of two sets of indices

(1)
M(1) = 1, 2, . . . , enR
,

(2)
M(2) = 1, 2, . . . , enR
(10.3)
(10.4)
called message sets, two encoding functions

n
(10.5)

(2) n
(10.6)
: Y n M(1) M(2) .
(10.7)
(1) : M(1) X (1)
(2)
:M
(2)
and a decoding function

(1)
(2)
The average error probability of an enR , enR , n coding scheme for a
DM-MAC is given as
Pe(n) ,
1
en(R
(1)
+R(2) )
X
(m(1) ,m(2) )
M(1) M(2)

h

Pr (Y1n ) 6= (m(1) , m(2) )
i

M (1) , M (2) = (m(1) , m(2) ) .
(10.8)
Note that we assume here that M (1)

M (2) , that M (i) is uniformly dis(i)
tributed over M , and that the encoders are distributed, i.e., work independently.1

Definition 10.3. A rate pair R(1) , R(2) is said to be achievable for the MAC

(2)
(1)
(n)
if there exists a sequence of enR , enR , n coding schemes with Pe 0
as n .
The capacity region of the MAC is defined to be the closure of the set of
all achievable rate pairs.
Remark 10.4. We include the boundary into the capacity region by definition, even though we will see that sometimes we do not know whether a pair
on the boundary is actually achievable or not. This is consistent with our definition of capacity of a DMC, that was defined as supremum of all achievable
rates, i.e., that also includes the boundary point by definition.
Note that we could think of the interval [0, C] to be the (one-dimensional)
capacity region of a DMC.
1
Note that if the encoders worked together either by a link or because they knew what
the current input message of the other encoder is, then we would have a multiple-input
single-output (MISO) channel, which in the case of discrete alphabets simply leads to a
normal DMC capacity problem.

10.2. Time-Sharing: Convexity of Capacity Region
10.2
221
Time-Sharing: Convexity of Capacity Region
Before we start with a proper investigation of the MAC, we prove next a

fundamental result that holds for the capacity region of any channel (MAC
or other) as long as the system is synchronized in the sense that there is a
common clock. For simplicity, we state the result only for our case of a MAC.
Proposition 10.5. The capacity region is convex.
Proof: The basic idea of the proof is again based on time-sharing. The
idea is to use two different systems and switch between them. The exact time
of switching can be agreed on in advance so that no additional synchronization
is needed.

(1)
(2)
Consider two sequences of working coding schemes, an enR , enR , n

(1)
(2)
coding scheme and an enR , enR , n coding scheme, where the error proba

(n)
(1) , R
(2)
bility Pe 0 as n for both schemes. Hence, R(1) , R(2) and R
are two points of the capacity region.

Now we fix some 0 1 and choose n1 , bnc and n2 , n n1
for a given blocklength n. During the first n1 transmissions, we use the first

(1)
(2)
en1 R , en1 R , n1 coding scheme, while for the remaining n2 transmissions

(1)
(2)
we use the second en2 R , en2 R , n2 coding scheme. The decoder can split
Y up into a Y1 of length n1 and a Y2 of length n2 and decode them separately
using the corresponding decoding functions.
If n , then both n1 , n2 . Since both coding schemes are assumed
to be working, we see that the error probabilities of both subsequences tend to
zero. Hence, also this mixed coding scheme usually it is called time-sharing
coding scheme is working. Its parameters are as follows:

0(1)

(1)
(2)
0(2)
(1)
(2)
en1 R en2 R , en1 R en2 R , n1 + n2 = enR , enR , n
(10.9)
where

1
(1) R(1) + (1 )R
(1) ,
n1 R(1) + n2 R
(10.10)
n

1
(2) R(2) + (1 )R
(2) .
R0(2) =
n1 R(2) + n2 R
(10.11)
n

(1) , R
(2) are
Since 0 1 is arbitrary, this shows that if R(1) , R(2) and R
achievable, any convex combination also is achievable, i.e., the capacity region
must be convex.
Once again note that we do not need cooperation between the encoders
apart from the fact that there is a common clock.
R0(1) =
10.3
Some Illustrative Examples for the MAC
In this section we will give a couple of examples of different scenarios that

can occur in the system shown in Figure 10.1, depending on the assumptions
about the conditional channel distribution.

222
Example 10.6 (Independent BSCs). Assume we have two independent BSCs

as shown in Figure 10.2. We know that X (1) can transmit at a rate of 1Hb (1 )
0
1 1
X (1)
0
1
1
1 1
Y (1)
1
Y
1 2
2
0
Y (2)
0
2
1
X (2)
1 2
1
Figure 10.2: Two independent BSCs form a multiple-access channel.

and X (2) at a rate of 1 Hb (2 ) bits. There is no interference. Hence, the
rectangular rate region shown in Figure 10.3 is achievable.
R(2)
C(2)
C(1)
R(1)
Figure 10.3: The capacity region of the MAC consisting of two independent
BSCs.
Note that even if we allowed for cooperation between the two encoders,
we could not get higher rates. Hence, the region of Figure 10.3 is the capacity

10.3. Some Illustrative Examples for the MAC
223
region.
Example 10.7 (Binary Multiplier MAC). Consider a MAC with binary input
and output alphabets X (1) = X (2) = Y = {0, 1} where
Y = X (1) X (2) .
(10.12)
If we choose X (2) = 1 constantly, then Y = X (1) and we can transmit 1 bit

per channel use from user 1. Analogously, if we choose X (1) = 1 constantly,
then Y = X (2) and we can transmit 1 bit per channel use from user 2. Hence,
besides the trivially achievable rate pair (0, 0), we already know two points of
the capacity region: (0, 1 bit) and (1 bit, 0) are both achievable rate pairs. By
time-sharing we can now achieve any linear combination of these three points,
yielding the triangular rate region shown in Figure 10.4.
R(2)
1
R(1)
Figure 10.4: The capacity region of the binary multiplier MAC.

However, since Y is binary, we have R(1) + R(2) log 2 = 1 bit, even if we
allow joint encoding! This proves that the rate region of Figure 10.4 actually
must be the capacity region.
Example 10.8 (Binary Erasure MAC). Consider a MAC with binary input
alphabet X (1) = X (2) = {0, 1} and ternary output alphabet Y = {0, 1, 2}
where
Y = X (1) + X (2)
(10.13)
(normal addition, not modulo 2!). If Y = 0 or Y = 2, there is no ambiguity in

the channel, but Y = 1 can result from either X (1) = 0, X (2) = 1 or X (1) = 1,
X (2) = 0.

224

1
2
0
1
2
X (2)
1
2
1
2
Figure 10.5: Binary erasure channel (BEC).
Now we argue in the same way as in Example 10.7: If we choose X (2) = 0

constantly, then Y = X (1) and we can transmit 1 bit per channel use from
user 1. Analogously, if we choose X (1) = 0 constantly, then Y = X (2) and we
can transmit 1 bit per channel use from user 2. Hence, besides the trivially
achievable rate pair (0, 0), we know two points of the capacity region: (0, 1 bit)
and (1 bit, 0) are both achievable rate pairs.
But can we do better? Assume for the moment a scheme with R(1) = 1 bit,
i.e., we transmit at full rate for user 1. This is only possible if the codewords
(1)
X(1) contain all possible binary sequences or, in other words, {Xk } is IID
Bernoulli(1/2). To decode these codewords, however, it is not necessary that
(2)
(2)
Xk = 0 always, but it is sufficient that the decoder knows the value of Xk !
Hence, we must first decode for X (2) , so that we then can use the knowledge
of X (2) to decode X (1) . This approach is called successive cancellation and is
very much loved by engineers because it also simplifies the decoding procedure:
Handle one problem at a time. . . !
So, when decoding firstly for X (2) , X (1) will act like noise, i.e., like additive
Bernoulli(1/2)-noise. This means that for user 2, the channel looks as shown
in Figure 10.5.
This is a binary erasure channel (BEC) and has capacity CBEC = 1 21 =
(1)
1
= 1 bit, user 2 can transmit
2 bits. Hence, user 1 can transmit at a rate R
(2)
1
at a rate R = 2 bits, and the decoder will decode X (2) first and then use
this knowledge to decode X (1) .
Of course the roles of user 1 and 2 can also be swapped.
Hence, weknow

two additional achievable rate pairs: 1 bit, 21 bits and 12 bits, 1 bit . By
time-sharing we can now achieve any linear combination of these five points,
yielding the pentagonal rate region shown in Figure 10.6.
We will later show that this achievable region actually is the capacity
region.

10.4. The MAC Capacity Region
10.4
225
The MAC Capacity Region
We will derive two equivalent forms of the MAC capacity region: C1 and C2 .
The proof works as follows: We will first prove that C1 is achievable, then we
will show that C2 C1 and therefore also achievable, and finally we derive a
converse on C2 . Hence, at that stage we will have shown a situation as shown
in Figure 10.7. Here, all rate pairs in C1 are achievable, all rate pairs outside
of C2 are not achievable, and C2 C1 . This is impossible unless C2 = C1 , i.e.,
in Figure 10.7 the dark-shaded area disappears and both regions are identical.
10.4.1
Achievability of C1
Theorem 10.9 (MAC Capacity Region 1).

Let X (1) , X (2) QX (1) QX (2) and define R QX (1) , QX (2) to be the set
of all rate pairs R(1) , R(2) such that
(2)
(1)
(1)
X
R
<
I
X
;
Y
,
(10.14)

(10.15)
R(2) < I X (2) ; Y X (1) ,
R(1) + R(2) < I X (1) , X (2) ; Y .

(10.16)
Then the MAC capacity region is given as
[

C1 = convex closure
R QX (1) , QX (2) .
(10.17)
QX (1) QX (2)
As mentioned, we will only prove the achievability of C1 . The converse will

follow by the converse for C2 .
1: Setup: Fix R(1) , R(2) , QX (1) ,X (2) = QX (1) QX (2) , and some blocklength
n.
(1)
2: Codebook Design: Generate enR length-n codewords X(1) (m(1) ),

(1)
m(1) = 1, . . . , enR , where each component is chosen IID QX (1) .
(2)
Independently thereof, generate enR length-n codewords X(2) (m(2) ),
(2)
m(2) = 1, . . . , enR , where each component is chosen IID QX (2) . Reveal both codebooks to encoders and decoder.
3: Encoder Design: To send message m(i) , encoder (i) transmits the
codeword X(i) (m(i) ), i = 1, 2.
4: Decoder Design:
Upon receiving Y, the decoder looks for a pair

(1)
(2)
m
,m
such that

X(1) m
(1) , X(2) m
(2) , Y A(n) QX (1) ,X (2) ,Y .
(10.18)

226
R(2)
1
1
2
1
2
R(1)
Figure 10.6: The capacity region of the binary erasure MAC.
R(2)
C1
contradiction
C2
R(1)
Figure 10.7: Two different capacity regions C1 and C2 of a MAC. We have a
contradiction unless the dark-shaded area disappears, i.e., actually C2 = C1 .

227

If there is a unique such pair m
(1) , m
(2) , the decoder puts out

m
(1) , m
(2) , m
(1) , m
(2) .
(10.19)
Otherwise the decoder declares an error.
5: Performance Analysis: We define the following events:
n

o
Fm(1) ,m(2) ,
X(1) m(1) , X(2) m(2) , Y A(n) QX (1) ,X (2) ,Y .
(10.20)
Then the error probability is given as

Pr(error)
(1)
nR
eX
(2)
nR
eX
m(1) =1 m(2) =1
1
en(R
(1)
+R
(2)

Pr error M (1) , M (2) = m(1) , m(2)
(10.21)
where, using the Union Bound, we can bound as follows:

Pr error M (1) , M (2) = m(1) , m(2)

[

c
(1) (2)
= PrFm
Fm
(1) ,m(2)
(1) ,m
(2) m , m

(m
(1) ,m
(2) )6=(m(1) ,m(2) )
(10.22)

c
(1)
(2)
Pr Fm
m
,
m
(1) ,m(2)
(1)
nR
eX
m
(1) =1
m
(1) 6=m(1)

(1)
(2)
Pr Fm
m
,
m

(1)
(2)
,m
(2)
nR
eX
m
(2) =1
m
(2) 6=m(2)
(1)

(1)
(2)
Pr Fm(1) ,m
(2) m , m
(2)
nR
eX
nR
eX
m
(1) =1
m
(1) 6=m(1)
m
(2) =1
m
(2) 6=m(2)

(1)
(2)
Pr Fm
m
,
m
.

(1)
(2)
,m
(10.23)
We now consider each term individually. Since the transmitted codewords

are jointly distributed with the received sequence Y, by TA-3b,

c
Pr Fm(1) ,m(2) m(1) , m(2) t n, , X (1) X (2) Y .
(10.24)
On the other hand, all other codewords are generated independently of
each other and are not transmitted, i.e., they are also independent of Y.

228

Therefore, for m
(1) 6= m(1) ,

(1)
(2)
Pr Fm
m
,
m

(1)
(2)
,m
X

QnX (1) x(1) QnX (2) ,Y x(2) , y
=
(10.25)
(n)
(x(1) ,x(2) ,y)A
en(H(X
(1) )
(n)
(x(1) ,x(2) ,y)A
) en(H(X (2) ,Y ))
(10.26)

(1)
(2)
QX (1) ,X (2) ,Y en(H(X )+H(X ,Y ))
= A(n)

(1)
(2)
(1)
(2)
en(H(X ,X ,Y )+) en(H(X )+H(X ,Y ))
(10.27)
))
= en(
(1)
(2)
(1)
(2)
= en(I(X ;X )+I(X ;Y |X ))
(10.30)
=e
n( H(X (1) )H(X (2) ,Y |X (1) )+H(X (1) )+H(X (2) ,Y ))
I(X (1) ;X (2) ,Y
=e
n(I(X (1) ;Y |X (2) ))
(10.28)
(10.29)
(10.31)
(10.32)
Here, in (10.26) we use TA-1b based on the fact that all sequences in the
sum are typical; in (10.28) we use TA-2; and the in the final step (10.32)
we rely on the independence between X (1) and X (2) .
Completely analogously, we derive for m
(2) 6= m(2) :

(2)
(1)

(1)
(2)
en(I(X ;Y |X )) ,
Pr Fm(1) ,m
m
,
m

(2)
and similarly, we get for m
(1) 6= m(1) , m
(2) 6= m(2) :

(1)
(2)
Pr Fm
(1) ,m
(2) m , m
X

QnX (1) x(1) QnX (2) x(2) QnY (y)
=
(10.33)
(10.34)
(n)
(x(1) ,x(2) ,y)A
en(H(X
X
(n)
(x(1) ,x(2) ,y)A
(1) )
) en(H(X (2) )) en(H(Y )) (10.35)

(1)
(2)
= A(n)
QX (1) ,X (2) ,Y en(H(X )+H(X )+H(Y ))

(1)
(2)
(1)
(2)
en(H(X ,X ,Y )+) en(H(X )+H(X )+H(Y ))
= en(
H(X (1) )H(X (2) |X (1) )H(Y
|X (1) ,X (2) )+H(X (1) )+H(X (2) )+H(Y
(10.36)
(10.37)
))
(10.38)
=e
n(I(X (1) ;X (2) )+I(X (1) ,X (2) ;Y ))
= en(
I(X (1) ,X (2) ;Y
))
(10.39)
(10.40)
Plugging these results back into (10.23) and (10.21) now yields
Pr(error)

(1)
(1)
(2)
t n, , X (1) X (2) Y + enR 1 en(I(X ;Y |X ))

229
(2)

(2)
(1)
+ enR 1 en(I(X ;Y |X ))
(1)
(2)

(1)
(2)
+ enR 1 enR 1 en(I(X ,X ;Y ))

(1)
(1)
(2)
t n, , X (1) X (2) Y + en(R I(X ;Y |X )+)
(2)
(1)
(2)
(2)
(1)
(1)
(2)
+ en(R I(X ;Y |X )+) + en(R +R I(X ,X ;Y )+) .
(10.41)
(10.42)
Note that this error probability will tend to zero for n as long
as the three conditions (10.14)(10.16) are satisfied. This proves the
achievability for a fixed distribution QX (1) QX (2) . We can now freely
choose QX (1) QX (2) , apply time-sharing to get the convex hull, and finally
take the closure because by definition the capacity region includes its
boundaries.
This concludes the achievability proof for C1 .
10.4.2
Capacity Region C2 Being a Subset of C1
Theorem 10.10 (MAC Capacity Region 2).

The MAC capacity region C2 is the closure of the set of all rate pairs
R(1) , R(2) satisfying
(10.43)
R(1) I X (1) ; Y X (2) , T ,
(1)
(2)
(2)
(10.44)
R I X ; Y X , T ,
(1)
(2)
(1)
(2)
R + R I X , X ; Y T ,
(10.45)
for some choice of the joint distribution
QT,X (1) ,X (2) ,Y = QT QX (1) |T QX (2) |T QY |X (1) ,X (2) .
(10.46)
Here T is an auxiliary random variable taking value in an alphabet T

where |T | = 2.
To prove that C2 C1 , note that

I X (1) ; Y X (2) = H Y X (2) H Y X (1) , X (2)

H Y X (2) , T H Y X (1) , X (2)

= H Y X (2) , T H Y X (1) , X (2) , T

= I X (1) ; Y X (2) , T .
(10.47)
(10.48)
(10.49)
(10.50)
where the inequality (10.48) follows from conditioning that reduces entropy,
and where the equality (10.49) holds because given X (1) and X (2) , we have
Y
T.
Similarly we can show

I X (2) ; Y X (1) I X (2) ; Y X (1) , T
(10.51)

230
and

I X (1) , X (2) ; Y = H(Y ) H Y X (1) , X (2)

H(Y |T ) H Y X (1) , X (2)

= H(Y |T ) H Y X (1) , X (2) , T

= I X (1) , X (2) ; Y T .
(10.52)
(10.53)
(10.54)
(10.55)

Hence, if R(1) , R(2) C2 , then

R(1) I X (1) ; Y X (2) , T I X (1) ; Y X (2) ,

R(2) I X (2) ; Y X (1) , T I X (2) ; Y X (1) ,

(1)
R + R(2) I X (1) , X (2) ; Y T I X (1) , X (2) ; Y
(10.56)
(10.57)
(10.58)

and therefore, comparing with (10.14)(10.16), we see that R(1) , R(2) C1 ,
too. Hence, C2 C1 and all rate pairs in C2 must be achievable.
The only missing point is the bound on the alphabet size of the auxiliary random variable T . A first bound follows from Caratheodorys Theorem
(Theorem 1.20): We write the 3-dimensional tuple

I X (1) ; Y X (2) , T , I X (2) ; Y X (1) , T , I X (1) , X (2) ; Y T
as convex combination

X

QT (t) I X (1) ; Y X (2) , T = t , I X (2) ; Y X (1) , T = t ,
tT

I X (1) , X (2) ; Y T = t .
(10.59)
By Theorem 1.20 we hence see that it is sufficient if |T | 3 + 1 = 4.

However, this bound can actually be improved to |T | = 2. To see this,
(a)
(a)
(b)
(b)
consider for each choice of QX (1) , QX (2) and QX (1) , QX (2) the convex closure of

(a)
(a)
(b)
(b)
R QX (1) , QX (2) R QX (1) , QX (2)
(where R is as defined in Theorem 10.9). Indeed, the union of all such sets is
closed and convex,and therefore it equals the convex closure of the union of
all R QX (1) , QX (2) . So we see that every point in the MAC capacity region
(i)
(i)
can be obtained by a convex combination of two points in R QX (1) , QX (2) ,
(i)
(i)
i = a, b, for some choice of QX (1) , QX (2) , i = a, b.
10.4.3
Converse of C2

(1)
(2)
We will next show that any sequence of enR , enR , n coding schemes with

(n)
Pe 0 must have a rate pair R(1) , R(2) C2 .

231
Recall the
Fano Inequality (Proposition 1.13) with an observation Y about
M (1) , M (2) :

log 2
(1)
(2) n
(n)
(1)
(2)
H M , M Y1 n
+ Pe R + R
(10.60)
n
, nn ,
(10.61)
(n)
where n 0 as n because Pe 0.
Hence, we have

nR(1) = H M (1)

= I M (1) ; Y1n + H M (1) Y1n

I M (1) ; Y1n + H M (1) , M (2) Y1n

I M (1) ; Y1n + nn

I x(1) M (1) ; Y1n + nn

= I X(1) ; Y + nn

I X(1) ; Y, X(2) + nn

= I X(1) ; X(2) + I X(1) ; YX(2) + nn

= I X(1) ; YX(2) + nn

= H YX(2) H YX(1) , X(2) + nn
n
X

H Yk X(2) , Y1k1 H Yk X(1) , X(2) , Y1k1 + nn
=
(10.62)
(10.63)
(10.64)
(10.65)
(10.66)
(10.67)
(10.68)
(10.69)
(10.70)
(10.71)
(10.72)
k=1
n
X
k=1
n
X
k=1
n
X

(1) (2)
+ nn
H Yk X(2) , Y1k1 H Yk Xk , Xk
(10.73)

(2)
(1) (2)
H Yk Xk
H Yk Xk , Xk
+ nn
(10.74)

(2)
(1)
I Xk ; Yk Xk
+ nn .
(10.75)
k=1
Here, (10.62) follows from the assumption that M (1) is uniformly distributed
(1)
over {1, . . . , enR }; (10.65) follows from (10.61); in the next step (10.66) we
apply the Data Processing Inequality (Proposition 1.12) where x(1) M (1)
denotes the codeword that is transmitted if the message is M (1) ; in (10.67)
we write X(1) for x(1) M (1) ; in (10.68) we add a random variable to the
arguments of the mutual information, thereby increasing its value; in the
subsequent (10.69) we use since M (1)
M (2) we also have X(1)
X(2) ; in
(10.73) we use the assumption that our DM-MAC is memoryless and used
without feedback; and (10.74) follows from conditioning that reduces entropy.
Hence,
R
(1)
n

1 X (1)
(2)
I Xk ; Yk Xk
+ n .
(10.76)
k=1

232

Similarly, we can show
R(2)
n

1 X (2)
(1)
I Xk ; Yk Xk
+ n
n
(10.77)
k=1
and

n R(1) + R(2) = H M (1) , M (2)
=I
I
(1)
(10.78)
(2)
M , M ; Y1n

M (1) , M (2) ; Y1n
(2)
(1)
(1)
I x
,M
(2)
n
Y1
+ nn

M (2) ; Y1n + nn
,x

= I X , X ; Y + nn

= H(Y) H YX(1) , X(2) + nn
n
X

=
H Yk Y1k1 H Yk X(1) , X(2) , Y1k1 + nn
+H M
(1)
(1)
k=1
n
X
k=1
n
X
(2)

(1) (2)
H(Yk ) H Yk Xk , Xk
+ nn

(1)
(2)
I Xk , Xk ; Yk + nn ,
(10.79)
(10.80)
(10.81)
(10.82)
(10.83)
(10.84)
(10.85)
(10.86)
k=1
i.e.,
R(1) + R(2)
n

1 X (1) (2)
I Xk , Xk ; Yk + n .
n
(10.87)
k=1
Now let T be a RV that is uniformly distributed on {1, 2, . . . , n}, i.e.,

QT (t) = n1 . Moreover, define X (1) to be a RV that describes the T th component of the first codeword x(1) M (1) , i.e.,
(1)
X (1) , XT .
(10.88)
(2)
Similarly, define X (2) , XT and Y , YT . Now we can write the first term
on the RHS of (10.76) as
n
n

X

1 X (1)
(2)
(2)
(1)
=
QT (k) I Xk ; Yk Xk , T = k
(10.89)
I Xk ; Yk Xk
n
k=1
k=1
n

X
(2)
(1)
=
QT (k) I XT ; YT XT , T = k
(10.90)
k=1

(2)
=I
XT , T

= I X (1) ; Y X (2) , T .

(1)
XT ; YT
Doing the same with (10.77) and (10.87) finally yields

R(1) I X (1) ; Y X (2) , T + n ,

R(2) I X (2) ; Y X (1) , T + n ,

R(1) + R(2) I X (1) , X (2) ; Y T + n

(10.91)
(10.92)
(10.93)
(10.94)
(10.95)
10.5. Some Observations and Discussion
233
for some distribution QT QX (1) |T QX (2) |T QY |X (1) ,X (2) . Note that this distribution is defined by our choice of T being uniform, the given coding scheme
with its set of codewords, the uniformly distributed messages M (1) , M (2) and
the given MAC.
Using the arguments shown in the discussion after (10.59), we know that
we can reduce the alphabet of T to a size |T | = 2.
10.5
Some Observations and Discussion
10.5.1
C1 with Fixed Distribution QX (1) QX (2)
Lets consider C1 . For every fixed choice of QX (1) QX (2) , we have given three
fixed numbers:

(10.96)
I1 , I X (1) ; Y X (2) ,
(1)
(2)
,
(10.97)
I2 , I X ; Y X

I3 , I X (1) , X (2) ; Y .
(10.98)
These three numbers together with the constraints R(1) 0 and R(2) 0
specify a pentagon of achievable rate pairs:
R(1) 0
R(2) 0
(1)
(2)
(10.99)
R I1
R I2
R(1) + R(2) I3
as shown in Figure 10.8, where we have named the corner points A to E.
The coordinates of point A are obviously R(1) , R(2) = (I1 , 0). To find the
coordinates of B, note that in B simultaneously we have R(1) = I1 and R(1) +
R(2) = I3 , i.e.,
R(2) = R(1) + I3
= I1 + I3
=I X
(1)
=I X
(2)
(10.100)
(10.101)
(2)
,X ;Y I X

;Y .
(1)

; Y X (2)
Hence, the coordinates of point B are

R(1) , R(2) = I X (1) ; Y X (2) , I X (2) ; Y .
(10.102)
(10.103)
(10.104)
So this means that the pentagon of Figure 10.8 more precisely looks as shown
in Figure 10.9.
Let us discuss Figure 10.9 more in detail. First of all, recall that at the
moment we keep QX (1) QX (2) fixed. So, in order to walk along the borders

of this achievable region, we need to play around with our rates R(1) , R(2) .
For example, if we choose R(2) = 0, i.e., user 2 only has one codeword, then

234
I2
I3
R(2)
C
I1
0
B
A
R(1)
0
Figure 10.8: Pentagon of achievable rate pairs.
R(2)

I X (2) ; Y X (1)
I X (2) ; Y

E
I X (1) ; Y
B
R(1)
A

I X (1) ; Y X (2)

235
the decoder knows X(2) in advance and can use this knowledge for the decoding of X(1) . Hence, we understand (using our knowledge of single-user data
transmission) that the decoder will be able to decode reliably as long as

(10.105)
R(1) < I X (1) ; Y X (2) .
Remark 10.11. Actually, for X (2) = we could do even better:

R(1) < max I X (1) ; Y X (2) = .
(10.106)
But the codeword X(2) is generated QnX (2) , i.e., all different values of will
show up with probability QX (2) (), which then gives

(10.107)
R(1) < EQ (2) I X (1) ; Y X (2) = .
X
The maximum choice (10.106) only occurs if we choose QX (2) such that X (2)
is constant equal to . This we do not include for the moment, as we keep
the distributions fixed.
So we see that in point A the decoder knows X(2) when he decodes X(1) .
However, it is not necessary that R(2) = 0 in order to make sure that the
decoder knows X(2) ! As long as we decode X(2) first and are sure we can do
this reliably, then the system still works. So how large can we choose R(2) ?
Well, we know from standard single-user transmission that we are OK as long
as

R(2) < I X (2) ; Y .
(10.108)
This explains point B! Note that this is again the idea called successive cancellation as already introduced in Example 10.8. The principle is easy: The
decoder decodes the message of one user (with usually smaller rate) first, ignoring the other user completely, i.e., treating the other user like it were noise.
Then, using the knowledge of this first message, it cancels the influence of
this user from the received sequence and decodes the (usually high-rate) message of the second user.
10.5.2
Convex Hull of two Pentagons
The capacity region C1 is defined as the convex hull of all different pentagons
given by some QX (1) QX (2) . Lets investigate such a convex hull using the
example of two different choices of QX (1) QX (2) : QaX (1) QaX (2) with corresponding pentagon C a , and QbX (1) QbX (2) with corresponding pentagon C b . These
two pentagons are depicted in Figure 10.10 together with the convex hull of
C a C b . The reader will note that this convex hull is not anymore a pentagon,
but rather a heptagon.
On the other hand, one could define a sequence of pentagons C defined
by a convex -combination of the five bordering lines of C a and C b , see Figure 10.11. The idea of C is that the five boundaries are convex combinations

236
R(2)
convex hull of C a C b
Cb
Ca
R(1)
Figure 10.10: Two pentagons C a and C b and the convex hull of their union.
R(2)
R(1)
Figure 10.11: Definition of a convex -combination of C a and C b .

237
of the five boundaries of C a and C b :

R
(1)
(1)
Ia1
+ (1
)Ib1
(2)
(2)
Ia2 + (1 )Ib2
R(1) + R(2) Ia3 + (1 )Ib3
(10.109)
Note that C 1 = C a and C 0 = C b .

It seems now obvious that
[

C.
convex hull C a C b =
(10.110)
[0,1]
Unfortunately, this is not true in general as can be seen from the following
example.
Example 10.12. Consider the following two pentagons:
Ca ,

R(1) , R(2) : R(1) 0, R(2) 0, R(1) 10, R(2) 10,
o
R(1) + R(2) 100 ,
n

C b , R(1) , R(2) : R(1) 0, R(2) 0, R(1) 20, R(2) 20,
o
R(1) + R(2) 20 .
PSfrag
(10.111)
(10.112)
These two pentagons and their boundaries are depicted in Figure 10.12.
inactive R(2)
constraint
20
R(2)
20
10
10
Cb
Ca
10
20
R(1)
10
20
R(1)
Figure 10.12: Two pentagons of Example 10.12.

One realizes that these shapes actually are not true pentagons, but rather
a square and a triangle because some of the boundary constraints are inactive.
We also realize that

convex hull C a C b = C b .
(10.113)

238

1
We now define C as in (10.109) and check the exact shape of C 2 :

1
1
1
C 2 , R(1) , R(2) : R(1) 0, R(2) 0, R(1) 10 + 20 = 15,
2
2
1
1
R(2) 10 + 20 = 15,
2
2

1
1
(1)
(2)
(10.114)
R + R 100 + 20 = 60 .
2
2
This pentagon is depicted in Figure 10.13.
inactive
constraint
R(2)
20
10
1
C2
10
R(1)
20
Figure 10.13: New pentagon derived as a convex combination of the two pentagons of Figure 10.12.
1
2
We now realize
a that
there are points in C that are not1 element of the
b
convex hull of C C ! For example, note that (15, 15) C 2 , but (15, 15)
/
Cb.
So, Example 10.12 shows that (10.110) is not true in general. However, we
can rescue the situation: The reason why (10.110) does not hold in the above
example is because some of the constraints are not active! Luckily, it is easy
to see that in our case this cannot happen. Because

(10.115)
I X (2) ; Y X (1) = I X (2) ; Y, X (1) I X (2) ; Y ,
(where we have used that X (1)
X (2) ), we have

I1 + I2 = I X (1) ; Y X (2) + I X (2) ; Y X (1)

I X (1) ; Y X (2) + I X (2) ; Y

= I X (1) , X (2) ; Y
= I3 ,
i.e., I1 + I2 I3 . Hence, the third constraint I3 is always active!
Lets make this more formal.

(10.116)
(10.117)
(10.118)
(10.119)
239
Proposition 10.13. For a fixed QX (1) QX (2) , let I1 , I2 , and I3 be defined as

above. Then define
I , (I1 , I2 , I3 )
(10.120)
and the corresponding achievable rate region

n

CI , R(1) , R(2) : R(1) 0, R(2) 0, R(1) I1 , R(2) I2 ,
o
R(1) + R(2) I3 .
(10.121)
Recall that I1 + I2 I3 , i.e., all five inequalities are active!
For two given distributions QaX (1) QaX (2) and QbX (1) QbX (2) with corresponding Ia and Ib , respectively, we then define the convex combination as
I , Ia + (1 )Ib
(10.122)
for 0 1.
Then, the rate region defined by I is given by
CI = CIa + (1 )CIb .
(10.123)
Proof: Let A be a point in CIa , i.e., it satisfies the inequalities for Ia .

Let B be a point in CIb , i.e., it satisfies the inequalities for Ib . Since the
constraints are linear, we immediately see that the convex -combination of
A and B, A + (1 )B, satisfies the convex -combination of the inequalities,
I = Ia + (1 )Ib . Hence,
A + (1 )B CI
(10.124)
CIa + (1 )CIb CI .
(10.125)
and therefore
To prove the reverse, we consider the five extreme points of the pentagonal
region CI , see Figure 10.14.
By definition, any of these extremal points can be written as a convex combination of the corresponding extremal points of CIa and CIb , respectively.
But since this holds true for these extremal points, it must also be true for
any point in CI , and hence
CI CIa + (1 )CIb .
(10.126)
Note that here we rely fundamentally on the fact that I1 + I2 I3 and that
therefore the pentagon in Figure 10.14 really is a pentagon and not a square
such as shown in Figure 10.15. If we could not rely on this fact, our argument
would break down.
The following corollary is a direct consequence of Proposition 10.13.

240

R(2)
0, I2
I3 I2 , I2
I1 , I3 I1
B
R(1)
A

I1 , 0
E
(0, 0)
Figure 10.14: Five extremal points of CI .

R(2)
C
B
D
R(1)
Figure 10.15: The five extremal points do not actually define the corner points
of a pentagon. This situation cannot happen because I1 + I2
I3 .
Corollary 10.14. The convex hull of the union of all rate regions defined by
some I is equal to the rate region defined by the convex combination of all I
vectors.
In particular this shows once again that C1 = C2 .
Note that in (10.110) the convex hull of C a C b can be achieved by timesharing: For a certain percentage [0, 1] of the time, we use a coding scheme
achieving C a while for the rest of time we use another coding scheme achieving
Cb.
S
The RHS of (10.110), [0,1] C , corresponds to a scheme that usually is
called coded time-sharing: there we choose the input distribution as a random
mixture with a probability [0, 1] of picking QaX (i) and probability 1 of
picking QbX (i) .

241
As we have proven in Proposition 10.13, for the MAC time-sharing and

coded time-sharing are equivalent. However, in general this is not the case:
We have seen in Example 10.12 that coded time-sharing can yield a larger
region. We will see later in Chapter 16 that for the interference channel coded
time-sharing yields a large achievable rate region than normal time-sharing.
10.5.3
General Shape of the MAC Capacity Region
As we have seen from Theorem 10.9, the MAC capacity region is a convex
hull of the union of many different pentagons. In general this region will look
as shown in Figure 10.16.
R(2)
R(1)
Figure 10.16: General shape of the MAC capacity region.
However, there are some cases where the MAC region is described by a
single pentagon. As an example, we continue with Example 10.8.
Example 10.15 (Continuation of Example 10.8). Recall the binary erasure
MAC from Example 10.8 with binary inputs and a ternary output given by
Y = X (1) + X (2)
(10.127)
(normal addition!). We have already argued that the pentagon given in Figure 10.6 is achievable. We will now show that it actually is the capacity region.
To do so, we will prove the following three statements:

1. If R(1) , R(2) is achievable, then R(1) 1 bit.
Proof: From Theorem 10.10 we know that for some choice of QT QX (1) |T
QX (2) |T we have

R(1) I X (1) ; Y X (2) , T

= H Y X (2) , T H Y X (1) , X (2) , T
{z
}
|
(10.128)
(10.129)
=0

242

= H X (1) + X (2) X (2) , T

= H X (1) X (2) , T
(10.130)
log 2 = 1 bit,
(10.132)
(10.131)
where we have used the fact that X (1) is binary.

2. If R(1) , R(2) is achievable, then R(2) 1 bit.
Proof: The proof is analogous to above.

3. If R(1) , R(2) is achievable and R(1) = R(2) = R, then R
3
4
bits.
Proof: From Theorem 10.10 we know that for some choice of QT QX (1) |T
QX (2) |T we have

R(1) + R(2) = 2R I X (1) , X (2) ; Y T

= H(Y |T ) H Y X (1) , X (2) , T
|
{z
}
(10.133)
(10.134)
=0
= H(Y |T )
=H X
(1)
(10.135)
+X
(2)

T .
(10.136)
Now note that from symmetry we can assume that QX (1) |T = QX (2) |T .
(If QX (1) |T and QX (2) |T were not the same, we could use time-sharing
between this asymmetric choice and its flipped version and thereby making the distribution symmetric. Note that the value of the entropy in
(10.136) for any choice of QT QX (1) |T QX (2) |T is identical to the entropy
of the flipped version and therefore also the time-sharing between these
two versions will result in the same entropy.)
Then
H(Y |T ) H(Y )
2
(10.137)
2
= p log p 2p(1 p) log 2p(1 p) (1 p) log(1 p)2
(10.138)
2
2
1
1
1 1
1 1
log
2 log 2
2
2
2 2
2 2

2

2
1
1
log 1
1
2
2
3
= bits,
2
(10.139)
(10.140)
where (10.137) follows from conditioning that reduces entropy; where

in (10.138) we set QX (1) (1) = QX (2) (1) = p; and where (10.139) follows
from maximizing over p. Note that this upper bound is achievable if we
choose X (1) and X (2) to be uniform and independent from each other
and T , which proves that it is the maximum.

10.6. Multiple-User MAC
243
outer bounds
R(2)
1
1
2
1
2
R(1)
Figure 10.17: An achievable rate region of the binary erasure MAC with (partial) outer bounds. Note that we have not yet proven that the
light-shaded area is not achievable.
Note that all these bounds are actually boundary points of the achievable
region given in Figure 10.6. Hence, we now have the situation shown in Figure 10.17. We have drawn arbitrarily shaped light-shaded areas which we have
not yet proven to be outside of the capacity region. However, it is straightforward to argue why these light-shaded areas cannot be achievable: Suppose
for the moment, they were achievable. Then, using the time-sharing convexity
argument, we could also achieve a rate pair

3 3
(1)
(2)
R ,R
= R, R >
,
,
(10.141)
4 4
which is a contradiction to the proven outer bound. Hence, we conclude that
the rate region given in Figure 10.6 must be the capacity region.
We will see below that also the Gaussian MAC has a capacity region in
the shape of a pentagon.
10.6
Multiple-User MAC
The generalization of our model and result to L users is straightforward. The

capacity region is given by the convex closure of all rate L-tuples satisfying

c
R[L] I X [L] ; Y X [L ] , L {1, 2, . . . , L}
(10.142)
where
R[L] ,
R(i)
(10.143)
iL
and

X [L] , X (i) : i L .
(10.144)

244
10.7
Gaussian MAC
Even though strictly speaking our proofs are not extendable to the Gaussian
case, we simply believe that the corresponding results hold anyway.
So assume two users independently transmitting codewords X(1) and X(2) ,
respectively, and a receiver that gets a sequence Y where
(1)
(2)
Yk = Xk + Xk + Zk
(10.145)

with {Zk } being IID N 0, 2 . We assume an average-power constraint for
each user i:
n
1 X (i) (i) 2
xk (m ) E(i) ,
n
k=1
10.7.1

(i)
m(i) 1, 2, . . . , enR .
(10.146)
Capacity Region
One can show (exercise!) that the capacity

region of the Gaussian MAC is
the convex hull of all rate pairs R(1) , R(2) satisfying
R(1) < I X (1) ; Y X (2) ,

(10.147)
(1)
(2)
(2)
R < I X ; Y X
,
(10.148)
R(1) + R(2) < I X (1) , X (2) ; Y

(10.149)
for some input densities X (1) X (2) satisfying
h
2 i
E X (1)
E(1) ,
h
i
2
E X (2)
E(2) .
(10.150)
(10.151)
The Gaussian MAC would not be Gaussian if we could not actually derive
this capacity region explicitly. . . ! So note that

I X (1) ; Y X (2)

(10.152)
= h Y X (2) h Y X (1) , X (2)
(1) (2)
(2)
(1)
(2)
(1)
(2)
(10.153)
= h X + X + Z X
h X + X + Z X , X

(1)
= h X + Z h(Z)
(10.154)

1
1
log 2e E(1) + 2 log 2e 2
(10.155)
2
2

1
E(1)
= log 1 + 2 ,
(10.156)
2
and, similarly,

(1) 1
E(2)

I X ;Y X
log 1 + 2 ,
2

1
E(1) + E(2)
(1)
(2)
I X , X ; Y log 1 +
.
2
2
(2)

(10.157)
(10.158)
10.7. Gaussian MAC
245
Note that all three upper bounds can simultaneously be achieved if

X (1) N 0, E(1) ,
(10.159)

(2)
(2)
X N 0, E ,
(10.160)
and, of course, X (1)
X (2) . Hence, these upper bounds are the maximum
over all choices of X (1) X (2) .
Theorem 10.16 (Gaussian MAC Capacity Region).
The Gaussian MAC capacity region is given by
(1)
E
(1)
,
R C
(2)
E
(2)
,
R C
(1)

E + E(2)
(1)
(2)
R +R C
,
(10.161)
(10.162)
(10.163)
where
C(t) ,
1
log(1 + t).
2
(10.164)
Note that we have here the excellent situation that the maximum possible
(1)
(2)
rate if full cooperation between the users were allowed, C E +E
, is also
2
achievable without cooperation! (However, without cooperation the maximum
rate is not available at the corners, where one of the two users get the large
majority of the available sum rate.)
10.7.2
Discussion
It is interesting to note that we can write

(1)

E + E(2)
1
E(1) + E(2)
C
=
log
1
+
2
2
2
(1)

1
E + E(2) + 2 E(1) + 2
= log
(1)
2
2
E + 2

(2)
E
E(1)
1
1
+ log 1 + 2
= log 1 + (1)
2
2
E + 2

(1)
(2)
E
E
= C (1)
+C
.
2
2
E +
(10.165)
(10.166)
(10.167)
(10.168)
Hence, the capacity region looks as shown in Figure 10.18.

We see that successive cancellation works perfectly with the Gaussian
MAC. Consider Point B: The receiver decodes user 2 first, treating user 1

246

R(2)
(2)
C E2
E(2)
E + 2
(1)
(1) (2)
R(1) + R(2) = C E +E
2
B
C
E(1)
E(2) + 2
R(1)
(1)
C E2
Figure 10.18: Capacity region of the Gaussian MAC.

as noise. This can be done as long as

(2)
R C
E(2)

.
(10.169)
E(1) + 2
Then, it cancels the codeword from user 2 from the received sequence and
decodes user 1, which works fine as long as
(1)
E
(1)
R C
.
(10.170)
2
Note that we have assumed so far the E(1) and E(2) are fixed! (To walk
around in the capacity region of Figure 10.18 we need to play with the rates!)
If we also allow changing the power constraints with a given overall total power
E:
E(1) + E(2) E,
(10.171)
then the capacity region becomes a triangle identical to the capacity of cooperative communication, see Figure 10.19.
10.7.3
CDMA versus TDMA or FDMA
In our current approach we need the decoder to simultaneously decode both

users, i.e., we have a type of code-division multiple-access (CDMA) channel.
This behavior can also be seen if we let the number of users L grow: Assume
each user has the same power Eu . Then the maximum sum rate is

L Eu
(1)
(L)
R + + R = C
.
(10.172)
2
This tends to infinity as L . However, the rate per user tends to zero:

1
L Eu
C
0 as L .
(10.173)
L
2

10.7. Gaussian MAC
247
R(2)
E
R(1)
E
Figure 10.19: Capacity region of the Gaussian MAC when we allow changing
the power of the users subject to a total power constraint E.
This behavior is similar to CDMA.

However, in practice, engineers often prefer an easier scheme based on
time-division multiple-access (TDMA) or frequency-division multiple-access
(FDMA). So the question is whether we achieve the same sum rate with these
simpler schemes as with the CDMA-type approach.
To find out, lets fix E(1) and E(2) and choose some blocklength n(1) and
n(2) , nn(1) . We now will use a time-sharing approach: During n(1) channel
uses, only user 1 will transmit, while user 2 remains silent; and during the
remaining n(2) time-steps, only user 2 transmits and user 1 remains silent.
Now we have

n(1) 1
E(1)
1
E(1)
(1)
R =
log 1 + 2 = log 1 + 2 ,
(10.174)
n 2

n(2) 1
E(2)
1
E(2)
(2)
log 1 + 2 = (1 ) log 1 + 2 ,
(10.175)
R =
n 2
(1)
where 0 , nn 1. This is called naive TDMA and performs poorly. But

we can improve it if we note that since user i only uses n(i) of n time-steps,
he can use more power during this time and still satisfy the average-power
constraint. This then leads to the improved TDMA performance
R
(1)

1
E(1)
= log 1 +
,
2
2
(10.176)

248
R(2) = (1 )

1
E(2)
log 1 +
.
2
(1 ) 2
(10.177)
To be as efficient as CDMA, we need that

R
(1)
+R
(2)

1
1
E(1)
E(2)
+ (1 ) log 1 +
(10.178)
= log 1 +
2
2
2
(1 ) 2

E(1) + E(2)
! 1
= log 1 +
,
(10.179)
2
2
i.e.,
(1 ) E(1) + E(2) + 2
E(2) + (1 ) 2
(1 ) 2 + (1 )E(1)
!
.
(1 ) 2 + E(2)
(10.180)
This is hard to solve. But considering that it is very unlikely to find solutions
unless both sides are 1, we guess that 1 = 1 . We check and see that this is
actually is possible:
(1 ) E(1) + E(2) + 2
E(2) + (1 ) 2
(1 ) 2 + (1 )E(1)
(1
) 2
(2)
+ E
E(1)
E(1) + E(2)
E(1)
= 1 = =
= 1 = =
E(1) + E(2)
(10.181)
(10.182)
It turns out that this solution really is the only solution to (10.180) apart from
the trivial solutions = 0 and = 1, which both are sum-rate suboptimal.
So we see that TDMA indeed can achieve the same maximum sum rate as
CDMA, however, only in one particular point where the time-sharing ratio is
fixed with the given power distribution. See Figure 10.20.
R(2)
=0
E(1)
E +E(2)
(1)
TDMA for
from 0 to 1
=1
R(1)
Figure 10.20: TDMA can achieve the maximum sum-rate capacity only in one
particular point that is specified by the power distribution.

10.8. Historical Remarks
249
For FDMA a similar investigation can be made. There we need to use the
continuous-time Gaussian capacity formula:

P(1)
(1)
(1)
R = B log 1 +
,
(10.183)
N0 B(1)

P(2)
(2)
(2)
R = B log 1 +
,
(10.184)
N0 B(2)
where P(i) is the ith users power and where B(i) is the ith users available
bandwidth. Assuming that we have a total bandwidth
B(1) + B(2) = B
(10.185)

P(1) + P(2)
,
= B log 1 +
N0 B
(10.186)
and solving
R
(1)
+R
(1) !
we find that
B(1) =
B(2) =
P(1)
P(1) + P(2)
P(2)
P(1) + P(2)
B,
(10.187)
(10.188)
is the only nontrivial optimal solution.
10.8
Historical Remarks
Not surprisingly, the problem of the multiple-access channel was introduced

by Shannon [Sha61, Section 17]. There Shannon claimed that he had found a
complete and simple solution of the capacity region, but he did not publish
it. The coding region was then developed independently by Ahlswede [Ahl71]
and Liao [Lia72].
In Chapter 14 we will come back to the MAC and consider the more
general scenario where beside the two private messages we also have a common
message that is intended for both receivers. The capacity region for this case
was found by Slepian and Wolf [SW73a].

Chapter 11
Transmission of Correlated
Sources over a MAC
In Chapter 9 we considered distributed source compression where the two
encoders were working independently from each other, and in Chapter 10
we considered two independent users transmitting information to the same
receiver. It is therefore quite natural to ask the question of what happens if
we combine these two setups.
11.1
Problem Setup
We consider the information transmission system shown in Figure 11.1 where

a distributed joint DMS is transmitted over a discrete memoryless multipleaccess channel.
Dest.
n , V n
U
1
1
Dec.
Y1n
(1) n
1
Xk
Xk
MAC
QY |X (1) ,X (2)
(2) n
1
Enc.
(1)
Enc. (2)
U1n
V1n
QU,V
Figure 11.1: A general system for transmitting a correlated source QU,V over
multiple-access channel QY |X (1) ,X (2) .
Note that we have simplified the system by assuming that both the length
of the source sequences U, V and the length of the transmitted codewords
X(1) , X(2) are equal to n. So throughout the whole system there is only one
clock.
Combining our knowledge about lossless compression from [Mos14] (i.e.,
lossless compression is possible up to the entropy of the IID source sequence)
and about the MAC channel (see Theorem 10.10), we immediately see that
251

252
Transmission of Correlated Sources over a MAC
we can design a reliable system if
H(U ) < I X (1) ; Y X (2) , T ,

H(V ) < I X (2) ; Y X (1) , T ,
H(U ) + H(V ) < I X (1) , X (2) ; Y T
(11.1)
(11.2)
(11.3)
for some choice of QT QX (1) |T QX (2) |T . In this case we simply use a lossless
compressor to compress the source sequences individually to their most efficient representation and then apply a standard MAC coding scheme according
to Chapter 10.
However, considering our discussion from Chapter 9, we can do better: If
(2)
(1)
X , T ,
(11.4)
H(U
|V
)
<
I
X
;
Y

(11.5)
H(V |U ) < I X (2) ; Y X (1) , T ,
H(U, V ) < I X (1) , X (2) ; Y T

(11.6)
for some choice of QT QX (1) |T QX (2) |T , then reliable transmission is possible
by a combination of SlepianWolf and MAC schemes, see Figure 11.2.
Encoder
X(1)
W (1)
MAC
U
SW-Enc. 1
MAC-Enc. 1
QU,V
QY |X (1) ,X (2)
X(2)
W (2)
MAC-Enc. 2
(1)
W
Y
MAC-Dec.
SW-Dec.
(2)
W
V
SW-Enc. 2
Dest.
Decoder
Figure 11.2: The information transmission system with source channel separation: The joint DMS is first compressed in a distributed manner
according to SlepianWolf, then a MAC coding scheme is applied
for the transmission of the data over the channel.
Note that in both (11.1)(11.3) and (11.4)(11.6) we apply a source channel separation. Is this optimal? Does a source channel separation theorem
exist in this context? Unfortunately, it does not. We can prove this by the
following counterexample.

11.1. Problem Setup
253
Example 11.1. Let U, V be two binary random variables with the following
joint distribution:
1
QU,V (0, 0) = QU,V (1, 0) = QU,V (1, 1) = ,
3
QU,V (0, 1) = 0.
(11.7)
Hence, H(U, V ) = log 3 1.58 bits.

Further, consider the binary erasure MAC of Examples 10.8 and 10.15:
Y = X (1) + X (2)
(11.8)
(with normal addition). We already know that the capacity region of this
MAC is a pentagon with a maximum sum rate of 1.5 bits.
Hence,

H(U, V ) = 1.58 bits > I X (1) , X (2) ; Y T = 1.5 bits
(11.9)
and according to (11.4)(11.6) this source cannot be transmitted reliably over
the given channel.
However, consider the following coding scheme: Choose n = 1, two encoders
(1)
Xk = Uk
(11.10)
(2)
Xk
(11.11)
= Vk ,
and a decoder
Yk =
(1)
Xk
(2)
Xk
0
= Uk + Vk = 1
=
=
=
k = 0, Vk = 0,
U
k = 1, Vk = 0,
U
k = 1, Vk = 1.
U
(11.12)
This coding scheme (apart from being very simple) works perfectly, i.e., the
probability of error is equal to zero!
The reason why a combination of SlepianWolf and MAC coding schemes
is not optimal lies in the basic assumptions of the setup: For MAC coding we
have always assumed that the two users are transmitting completely independent messages, while in the SlepianWolf situation the clue is that the source
has strong correlation. In other words, our MAC coding scheme is no good
at dealing with information that is contained in both messages and therefore
does not need to be transmitted by both users.
This explains why the third constraint in (11.4)(11.6) is too restrictive.
Note that the first two constraints are satisfied:

2
H(U |V ) = bits < I X (1) ; Y X (2) , T = 1 bit,
(11.13)
3

2
H(V |U ) = bits < I X (2) ; Y X (1) , T = 1 bit.
(11.14)
3
This must be the case because otherwise we would violate the limitations of the
well-understood single-user information transmission: Even if we inform the
first encoder and the decoder about the values of V and X(2) , we cannot beat
constraint (11.4) (and similarly for U and X(1) and the second encoder).

254
11.2
A Joint Source Channel Coding Scheme
In a first attempt to soften the constraints (11.4)(11.6), we design a joint

source channel coding scheme and analyze it. Hence, like for (11.1)(11.3)
and (11.4)(11.6), we only have an achievability proof, but no converse.
Theorem 11.2 (Achievable Information Transmission System).
A joint DMS QU,V can be transmitted with arbitrarily
small error proba
bility over a MAC X (1) X (2) , QY |X (1) ,x(2) , Y if
(2)

(1)

(11.15)
H(U |V ) < I X ; Y X , V, T ,
(2)
(1)
H(V |U ) < I X ; Y X , U, T ,
(11.16)
(1)
(2)
H(U, V ) < I X , X ; Y T
(11.17)
for some
QT,U,V,X (1) ,X (2) ,Y = QT QU,V QX (1) |U,T QX (2) |V,T QY |X (1) ,X (2) .
|{z} | {z } | {z } | {z } |
{z
}
timesharing
source
encoder 1
encoder 2
channel
(11.18)
Be aware that in (11.18) the source and the channel are given, while we
can try to find some optimal choice of the encoders and the time-sharing
distribution.
Note that this theorem includes both Theorem 9.4 (SlepianWolf) and
Theorem 10.10 (MAC) as special cases. Too see this, first note that if we
choose a dummy channel

Y = X (1) , X (2) ,
(11.19)
set T = 0, and choose X (1) and X (2) independent of (U, V ) so that
QU,V,X (1) ,X (2) = QU,V QX (1) QX (2) ,
(11.20)

where QX (i) must be such that H X (i) = R(i) , then we get from the first
condition in (11.15)(11.17):

H(U |V ) < I X (1) ; Y X (2) , V, T
(11.21)

(1) (2)
(1)
(2)
= H X X , V, T H X Y, X , V, T
(11.22)
|
{z
}
= 0 (because of (11.19))
(1)

= H X X (2) , V, T

= H X (1)
(because of (11.20))
=R
(1)
H(V |U ) < I X
(2)
(11.23)
(11.24)
(11.25)

; Y X (1) , U, T

(11.26)
11.2. A Joint Source Channel Coding Scheme

= H X (2)
=R
(2)
(11.27)
;
(1)
(11.28)
(2)

H(U, V ) < I X , X ; Y T

= H X (1) , X (2) T H X (1) , X (2) Y, T
{z
}
|
=0

= H X (1) , X (2)

= H X (1) + H X (2)
=R
(1)
+R
255
(2)
(11.29)
(11.30)
(11.31)
(11.32)
(11.33)
Hence, we have derived SlepianWolf from (11.15)(11.17).

For the MAC, assume that U
V with H(U ) = R(1) and H(V ) = R(2)
(i)
and choose X independent of (U, V ), i.e.,
QT,U,V,X (1) ,X (2) ,Y = QT QU QV QX (1) |T QX (2) |T QY |X (1) ,X (2) .
(11.34)
Then
H(U |V ) = H(U )
=R
(11.35)
(1)
(11.36)
(1)

< I X ; Y X (2) , V, T

= I X (1) ; Y X (2) , T ;
H(V |U ) = H(V )
=R
(11.40)
(2)

< I X ; Y X (1) , U, T

= I X (2) ; Y X (1) , T ;
H(U, V ) = H(U ) + H(V )
=R
<I X
+R
(1)
(11.38)
(11.39)
(2)
(1)
(11.37)
(2)
,X
(2)
(11.41)
(11.42)
(11.43)
(11.44)

; Y T .
(11.45)
We will prove Theorem 11.2 for T = only. The inclusion of a time-sharing

random variable is straightforward.
1: Setup: Fix QX (1) |U and QX (2) |V and some blocklength n.
n
2: Codebook Design:
n For each u Un generate an independent codeword
(1)
(1)
X (u) X
according to QX (1) |U (|u). Similarly, for each v
n
V n generate an independent codeword X(2) (v) X (2) according to
QnX (2) |V (|v).
3: Encoder Design: Upon observing u, encoder 1 transmits X(1) (u). Similarly, upon observing v, encoder 2 transmits X(2) (v).

256
4: Decoder Design: Upon observing Y, the decoder tries to find a pair

) such that
(
u, v

, v
, X(1) (
u
u), X(2) (
v), Y A(n) QU,V,X (1) ,X (2) ,Y .
(11.46)
) , (
),
If there is a unique such pair, then the decoder decides (
u, v
u, v
otherwise it declares an error.
5: Performance Analysis: We have
Pr(error)

= Pr error (U, V) A(n)
(QU,V ) Pr (U, V) A(n) (QU,V )

|
{z
}
1

+ Pr error (U, V)
/ A(n)
(QU,V ) Pr (U, V)
/ A(n) (QU,V )

|
{z
} |
{z
}
=1
t (n,,U V)
(11.47)

(QU,V ) + t (n, , U V).
Pr error (U, V) A(n)

(n)
For (u, v) A
Fu,v ,
(11.48)
(QU,V ) define the event

o

(11.49)
X(1) (u), X(2) (v), Y A(n) QU,V,X (1) ,X (2) ,Y u, v
and write

(QU,V )
Pr error (U, V) A(n)

X

=
Pr U = u, V = v (U, V) A(n) (QU,V )
(n)
(u,v)A
(QU,V )
Pr(error|U = u, V = v)
(11.50)

(n)

Pr U = u, V = v (U, V) A (QU,V )
(n)
(u,v)A
(QU,V )
c
PrFu,v
[
(n)
(
u,
v)A (QU,V
Fu ,v
(11.51)
(
u,
v)6=(u,v)
(n)
(u,v)A

Pr U = u, V = v (U, V) A(n)
(QU,V )

(QU,V )
{z
=1
c
Pr Fu,v
+
X
(n)
A
u
(QU,V |v)
6=u
u

Pr(Fu ,v )
11.2. A Joint Source Channel Coding Scheme
257
(n)
(n)
A
v
Pr(Fu,v ) +
(QU,V |u)
6=v
v
(
u,
v)A (QU,V )
(
u,
v)6=(u,v)
Pr(Fu ,v )
(11.52)
where we have used the Union Bound. We investigate each term on the
RHS of (11.52) separately:

c
Pr Fu,v
t n, , X (1) X (2) Y .
(11.53)
Then,
X
Pr(Fu ,v )
(n)
A (QU,V
u
6=u
u
|v)
(n)
A
u
(1) (2)
(QU,V |v) (x ,x ,y)
(n)
6=u
u
A (Q|
u,v)
<
<
(11.54)
(n)

(1)
(2)
A (Q|
u, v) en(H(X |U )+H(X ,Y |V ))
(n)
A (QU,V
u
6=u
u

QnX (2) ,Y |V x(2) , yv
QnX (1) |U x(1) u
|v)
en(H(X
(1) ,X (2) ,Y
|U,V )+)
(n)
A
u
(11.55)
(QU,V |v)
6=u
u
en(H(X
(1) |U )+H(X (2) ,Y
|V ))
(11.56)

(1)
(2)
(1)
(2)
< A(n)
(QU,V |v) en(H(X |U )+H(X |V )+H(Y |X ,X )+)

(1)
(2)
(2)
en( H(X |U )H(X |V )H(Y |X ,V )+)
)+)
< en(H(U |V )+) en(
(1)
(2)
= en(H(U |V )I(X ;Y |X ,V )+) 0
I(X (1) ;Y
|X (2) ,V
(11.57)
(11.58)
(11.59)
if

H(U |V ) < I X (1) ; Y X (2) , V .
(11.60)
(1)
Here we have made use of our assumptions that
X only depends on U ,
(2)
(1)
(2)
X only on V , and Y only on X , X
. However, note that when
(2)
Y is conditional on X and V , we cannot drop V , because V is via U
related to X (1) .
The analysis of the next term is accordingly:

X
(2)
(1)
Pr(Fu,v ) < en(H(V |U )I(X ;Y |X ,U )+) 0
(11.61)
(n)
A
v
(QU,V |u)
6=v
v

258

if

H(V |U ) < I X (2) ; Y X (1) , U .
Finally, we have
X
(n)
(
u,
v)A (QU,V
(11.62)
Pr(Fu ,v )
(
u,
v)6=(u,v)

QnX (2) |V x(2) v
QnY (y)
QnX (1) |U x(1) u
(1) (2)
(n)
(
u,
v)A (QU,V ) (x ,x ,y)
(n)
(
u,
v)6=(u,v)
u,
v)
A (Q|
<
X
(n)
(
u,
v)A (QU,V )
(
u,
v)6=(u,v)
<
(11.63)

(1)
(n)
, v
en(H(X |U ))
QU,V,X (1) ,X (2) ,Y u
A
en(H(X
(2) |V
en(
))
H(X (1) ,X (2) ,Y
en(H(Y ))
(11.64)
|U,V )+)
(n)
(
u,
v)A (QU,V )
(
u,
v)6=(u,v)
en(H(X
(1) |U )+H(X (2) |V
)+H(Y ))
(11.65)

(1)
(2)
(1)
(2)
(QU,V ) en(H(X |U )+H(X |V )+H(Y |X ,X )+)
< A(n)

en( H(X
(1) |U )H(X (2) |V
)H(Y )+)
(11.66)
)+)
< en(H(U,V )+) en(
(1)
(2)
= en(H(U,V )I(X ,X ;Y )+) 0
I(X (1) ,X (2) ;Y
(11.67)
(11.68)
if

H(U, V ) < I X (1) , X (2) ; Y .
(11.69)
Putting everything together, now finally yields

(1)
(2)
Pr(error) < t n, , X (1) X (2) Y + en(H(U |V )I(X ;Y |X ,V )+)
(2)
(1)
+ en(H(V |U )I(X ;Y |X ,U )+)
+ en(H(U,V )I(X
(1) ,X (2) ;Y
)+)
+ t (n, , U V),
(11.70)
which proves the conditions given in (11.15)(11.17).
11.3
Discussion and Improved Joint Source

Channel Coding Scheme
The region given in Theorem 11.2 is strictly suboptimal. Consider for example
the case U = V . Theorem 11.2 gives

H(U, V ) = H(U ) <
max
I X (1) , X (2) ; Y .
(11.71)
QU QX (1) |U QX (2) |U

11.3. Discussion and Improved Joint Source Channel Coding Scheme
259
However, since U = V , both encoders know the message of the other and
therefore they can cooperate! So, we definitely can achieve

H(U ) < max I X (1) , X (2) ; Y ,
(11.72)
QX (1) ,X (2)
which is a less restrictive constraint than (11.71).

We see that the scheme given in Section 11.2 does not deal well with
the case when U and V have a common part: some partial knowledge where
encoder 1 based on u can tell something about v and vice versa, so that both
encoders can partially cooperate.
To try to improve on this, lets arrange QU,V in a block-diagonal form with
as many nonzero blocks as possible; see Figure 11.3. The idea is that whenever
U
V
1
5
6
0
7
Figure 11.3: Block-diagonal form of QU,V demonstrating the common part of

the source.
U block i, then we know for sure that V block i, too, and vice versa. Now
let W be the counter of the blocks. Note that W is uniquely determined by
U or by V alone!
Based on this setup, one can now derive the following achievable joint
source channel coding scheme.

260
Theorem 11.3 (Achievable Information Transmission System

[CEGS80]).
A joint DMS QU,V with common part QW can be transmitted with arbi-
trarily small error probability over a MAC X (1) X (2) , QY |X (1) ,x(2) , Y
if
(11.73)
H(U |V ) < I X (1) ; Y X (2) , V, S ,
(11.74)
H(V |U ) < I X (2) ; Y X (1) , U, S ,

(1)
(2)

H(U, V |W ) < I X , X ; Y W, S ,
(11.75)
H(U, V ) < I X (1) , X (2) ; Y

(11.76)
for some
QS,U,V,W,X (1) ,X (2) ,Y
= QW QU,V |W QS QX (1) |U,S QX (2) |V,S QY |X (1) ,X (2) . (11.77)
|{z} | {z } | {z } |
{z
}
|
{z
}
source with common part
auxiliary
RV
encoder 1
encoder 2
channel
Be aware that in (11.77) the source with common part and the channel
are given, while we can try to find some optimal choice of the encoders and
the auxiliary distribution.
Proof: We omit the proof. It can be found in [CEGS80].
Note that it can be shown that this region is already convex, i.e., we do
not need a time-sharing variable T . Also it can be shown that

|S| min |X (1) | |X (2) |, |Y|
(11.78)
is sufficient.
The region given by (11.73)(11.76) is better than (11.15)(11.17) because
here we allow a dependence between X (1) and X (2) via S. (To be able to
properly compare between (11.73)(11.76) and (11.15)(11.17), it is best to
remove the time-sharing variable in (11.15)(11.17), as this only gives convexification. Without T (and conditionally on (U, V )) we see that X (1) and X (2)
in (11.15)(11.17) are indeed independent.)
Unfortunately, this region still is strictly too small. This can be seen very
easily when realizing that our counterexample in Example 11.1 still is not
included in this region: For the source given in Example 11.1 we do not have
any common part!

Chapter 12
Channels with Noncausal

Side-Information:
GelfandPinsker Problem
This chapter deals with a problem setup that is in some sense dual to
the problem given in Chapter 8: While in Chapter 8 we considered sideinformation in the situation of source compression where the side-information
was only available at the decoder side, here we deal with side-information in
the case of data transmission where the side-information is only available at
the encoder side. We will see that the arguments and the structure of this
chapter follow Chapter 8 closely.
12.1
Introduction
Consider the single-user channel coding problem shown in Figure 12.1.

QS
S
Dest.
Decoder
QY |X,S
S
X
Encoder
Uniform
Source
Figure 12.1: The GelfandPinsker problem: a channel coding system where

the encoder has access noncausally to interference in the channel.
Here the channel suffers from interference in form of a random noise sequence S1n QnS . The main point of this problem is that the encoder knows
the value of this interference noncausally in advance before transmission begins, but the decoder has no access to S.
261

262
Channels with Noncausal Side-Info (GelfandPinsker)
At first thought and similar to Chapter 8, this might again seem strange.
Why should the encoder have access to side-information, but the decoder not?
However, also here exist some important practical situations where we have
exactly this constellation:
In a broadcast channel, two messages are intended for two receivers,
where the message for receiver 1 can be regarded as unwanted interference for user 2 and vice versa. This interference is known noncausally
to the encoder in advance, but is not known to the decoders.
Consider the situation of burning a rewritable CD. If the CD has been
burned before, it already contains data that might not be completely
removable and that will cause distortion later on when the CD is read
again. Before re-burning the CD, the encoder can first read the contents
of the CD and then take the existing noise into account for the encoding
of the given data. The reader, on the other hand, will have no way of
knowing what the original, but now overwritten contents of the CD has
been. This problem is usually known as dirty paper coding. We will
discuss this more in detail in Section 12.6.
Definition 12.1. A discrete memoryless channel (DMC) with interference
consists of an input alphabet X , an output alphabet Y, an interference alphabet S, and a conditional probability distribution QY |X,S such that for any
value of interference sk , the channel output Yk depends only on the current
channel input xk via QY |X,S (|xk , sk ).

Definition 12.2. An enR , n coding scheme for a DMC with interference
consists of a set of indices

M = 1, 2, . . . , enR ,
(12.1)
called message set, an encoding function

: M S n X n,
(12.2)
: Y n M.
(12.3)
and a decoding function

The average error probability of an enR , n coding scheme for a DMC
with interference is given as
Pe(n) ,
1 X
Pr[(Y1n ) 6= m |M = m].
enR
mM
Note that we assume here that M is uniformly distributed over M.

(12.4)
263
Definition 12.3. A rate R is said to be achievable

for the DMC with inter
(n)
ference if there exists a sequence of enR , n coding schemes with Pe 0 as
n .
The capacity of the DMC with interference is defined to be the supremum
of all achievable rates.1
12.2
A Random Coding Scheme
We start with the derivation of a lower bound of capacity by specifying and

analyzing a coding scheme. The derivation follows very closely the proof of
the WynerZiv rate distortion coding scheme with side-information given in
Section 8.2. The main difference to WynerZiv is that here we will apply
the binning in reverse. Concretely, the idea is to separate all codewords
into several bins. The message then selects a bin, the channel interference
that is known as side-information at the encoder chooses the codeword in the
bin. This chosen codeword, however, is not transmitted, but instead it is used
together with the side-information to create the channel input.
The decoder only uses the received sequence to figure out which bin has
been selected at the encoder side.
1: Setup: We need an auxiliary random variable U with some alphabet U.
This RV represents the codeword at the encoder. So, we choose U and a
PMF QU |S , and then compute QU as marginal distribution of QS QU |S .
We further choose a function f : U S X that will be used in the
encoder to create the channel input sequence.
Then we fix some rates R and R0 , and some blocklength n.
0
2: Codebook Design: We create enR enR length-n codewords U(m, v),

0
0
m = 1, . . . , enR and v = 1, . . . , enR , by choosing all nen(R+R ) components
Uk (m, v) independently at random according to QU (). Here m describes
the bin and v describes the index of the codeword in this bin. Hence, we
0
have enR bins and enR codewords per bin.
3: Encoder Design: For a given message m and side-information
s, the
0
encoder searches the bin U(m, 1), . . . , U m, enR
for a codeword that
is jointly typical with the interference s: It tries to find a v such that

U(m, v), s A(n)
(QU,S ).
(12.5)

If it finds several possible choices of v, it picks one. If it finds none, it
chooses v = 1.
The encoder then transmits

X = f n U(m, v), s
(12.6)
1
Note that again we include the boundary into the capacity region, or rather, we define
the capacity as supremum rather than maximum without bothering whether the value of
R = C actually is achievable or not.

264

where we use the notation f n to denote that each component of X is
created using the function f (, ), i.e.,

Xk = f Uk (m, v), sk , k = 1, . . . , n.
(12.7)
4: Decoder Design: For a given received sequence Y, the decoder tries to

find a pair (m,
v) such that

U(m,
v), Y A(n) (QU,Y ).
(12.8)
If there is a unique m,
then the decoder puts out m
, m.
If there
are several choices for m,
the decoder declares an error. Note that the
decoder does not care if there are several possible v for a unique bin m.
5: Performance Analysis: For the analysis we distinguish four different

cases that are not necessarily disjoint, but that together cover all possibilities that will lead to an error:
1. The side-information sequence is not typical:
S
/ A(n)
(QS ).

(12.9)
2. The side-information sequence is typical

S A(n)
(QS ),

(12.10)
but the encoder cannot find a v such that (12.5) is satisfied.

3. The side-information sequence is typical, the encoder finds a good
choice v and transmits X according to (12.6), i.e.,
(U(m, v), S) A(n) (QU,S ),

X = f n U(m, v), S ,
but there exists an m
6= m and some v such that

U(m,
v), Y A(n) (QU,Y ).
(12.11)
(12.12)
(12.13)
4. The side-information sequence is typical, the encoder finds a good

choice v and transmits X according to (12.6), i.e.,
(U(m, v), S) A(n) (QU,S ),

X = f n U(m, v), S ,
(12.14)

U(m, v), Y
/ A(n) (QU,Y ).
(12.16)
(12.15)
but
Note that here we ignore the possibility that there might exist another v such that

U(m, v), Y A(n) (QU,Y ),
(12.17)
i.e., our bound on the error probability is definitely too big.
The details of this analysis are given in the following Sections 12.2.1
12.2.5.

12.2.1
265
Case 1
This case is standard: By TA-3b,

Pr(Case 1) = Pr S
/ A(n) (QS ) t (n, , S).
12.2.2
(12.18)
Case 2
This case we have seen before, see (8.18)(8.27). We have

Pr(Case 2)

(n)
= Pr S A(n)
(Q
)
@
v
:
U(m,
v),
S
A
(Q
)
(12.19)
S
U,S

h
i

(n)
(n)
= Pr S A(n)
(Q
)
Pr
@
v
:
U(m,
v)
A
(Q
|S)
S
A
(Q
)

S
U,S
S

(12.20)
0
enR

Y
= Pr S A(n)
(QS )
Pr U(m, v)
/ A(n) (QU,S |S) S A(n) (QS )

|
{z
} v=1
1
(12.21)
0
enR
Y
v=1

(QS )
Pr U(m, v)
/ A(n) (QU,S |S) S A(n)

(12.22)
nR
eY

v=1

1 Pr U(m, v) A(n) (QU,S |S) S A(n) (QS )
(12.23)
<
nR
eY

v=1
1 en(I(U ;S)+)
n(I(U ;S)+)
(12.24)
enR0
= 1e

0
exp enR en(I(U ;S)+)

0
= exp en(R I(S;U )) .
(12.25)
(12.26)
(12.27)
Here, (12.24) follows from TC-2, and the inequality (12.26) is due to the
Exponentiated IT Inequality (Corollary 1.10).
So, as long as
R0 > I(U ; S) +
(12.28)
12.2.3
Case 3
We have

(QU,S )
Pr(Case 3) = Pr S A(n) (QS ) U(m, v), S A(n)


266

m
6= m, v : U(m,
v), Y A(n)
(Q
)
U,Y

[

U(m,
v), Y A(n)
(QU,Y )
Pr

(12.29)
(12.30)
m,
v
m6
=m
X
m,
v
m6
=m

Pr U(m,
v), Y A(n)
(QU,Y )

(12.31)
en(I(U ;Y ))
(12.32)
m,
v
m6
=m

0
= enR enR 1 en(I(U ;Y ))
n(R+R0 I(U ;Y )+)
(12.33)
(12.34)
Here, in (12.30) we enlarge the set; in (12.31) we apply the Union Bound; and
(12.32) follows from TC.
So, as long as
R + R0 < I(U ; Y )
(12.35)
the probability of Case 3 decays exponentially fast to zero.
12.2.4
Case 4
Note that by the definition of jointly typical sets, if (U, Y) is not jointly
typical, then (U, S, X, Y) cannot be jointly typical either. Hence, always
taking into account that X = f n (U, S),
Pr(Case 4)

= Pr
U(m, v), S A(n)
(QU,S ) U(w, v), Y
/ A(n)
(QU,Y )

(12.36)

Pr (U(m, v), S) A(n)
(QU,S )

U(m, v), S, X, Y
/ A(n)
(Q
)
U,S,X,Y

(n)
= Pr U(m, v), S A (QU,S )
|
{z
}
(12.37)

Pr U(m, v), S, X, Y
/ A(n)
(QU,S,X,Y ) U(m, v), S A(n) (QU,S )

(12.38)

(n)
(n)
Pr U(m, v), S, X, Y
/ A (QU,S,X,Y ) U(m, v), S A (QU,S )
(12.39)

h

= 1 Pr U(m, v), S, X, Y A(n)
(Q
)
U,S,X,Y

i

U(m, v), S A(n) (QU,S ) (12.40)


X
=1
(u,s)
(QU,S )
267

Pr U(m, v) = u, S = s U(m, v), S A(n) (QU,S )
(n)
A

Pr (u, s, x, Y) A(n) (QU,S,X,Y ) U = u, S = s, x = f n (u, s)
X
=1
(u,s)
(QU,S )
(12.41)

(n)

Pr U(m, v) = u, S = s U(m, v), S A (QU,S )
(n)
A

Pr Y A(n) (QU,S,X,Y |u, s, x) U = u, S = s, x = f n (u, s) .
(12.42)
Here the first inequality (12.37) follows because we enlarge the event (the
(n)
(n)
event (U, S, X, Y)
/ A
follows from the event (U, Y)
/ A ).
So we see that we need a lower bound on

Pr Y A(n)
(QU,S,X,Y |u, s, x) U = u, S = s, x = f n (u, s)

(12.43)
= Qn
A(n)
(QU,S,X,Y |u, s, x) x

Y |X
(n)
where we know that (u, s) A (QU,S ) and x = f n (u, s). Note that, as in
Section 8.2.4, we cannot apply TB-3 here because (U, S, X, Y) is not generated
according to QU,S,X,Y , but U is independent of the rest.
Basically, the situation here is identical to Section 8.2.4 and the Markov
Lemma that is proven there (see Remark 8.4). The only difference is that
X is not generated according to a distribution, but rather as a deterministic
function X = f n (U, S). However, since a deterministic function can be viewed
as a special distribution function (that only contains probability values 1 or
0), we can adapt the proof in a straightforward manner.
(n)
We start by noting that since (u, s) A (QU,S ) and since x = f n (u, s),
(n)
we have for any y A (QU,S,X,Y |u, s, x) the following:

|U| |S| |X | |Y|
> Pu,s,x,y (a, b, c, d) QU,S,X,Y (a, b, c, d)
= Pu,s (a, b) I {c = f (a, b)} Py|u,s,x (d|a, b, c)
QU,S (a, b) I {c = f (a, b)} QY |U,S,X (d|a, b, c)
(12.44)
(12.45)
> Pu,s (a, b) I {c = f (a, b)} Py|u,s,x (d|a, b, c)

I {c = f (a, b)} QY |U,S,X (d|a, b, c) (12.46)
Pu,s (a, b) +
|U| |S|

= Pu,s (a, b) I {c = f (a, b)} Py|u,s,x (d|a, b, c) QY |U,S,X (d|a, b, c)

Q
(d|a, b, c) I {c = f (a, b)}
(12.47)
|U| |S| | Y |U,S,X
{z
}
1

Pu,s (a, b) I {c = f (a, b)} Py|u,s,x (d|a, b, c) QY |U,S,X (d|a, b, c)

(12.48)
|U| |S|

268
for all (a, b, c, d) U S X Y. Here, I {} denotes the indicator function defined in (7.50), the first inequality follows because (u, s, x, y) is jointly
typical, and the second inequality follows because (u, s) is jointly typical.
The other direction can be shown accordingly, i.e., we have that any y
(n)
A (QU,S,X,Y |u, s, x) satisfies for all (a, b, c) with Pu,s (a, b)I {c = f (a, b)} >
0 and for all d Y

Py|u,s,x (d|a, b, c) QY |U,S,X (d|a, b, c)

1
1

1+
<
.
|U| |S|
|X | |Y| Pu,s (a, b) I {c = f (a, b)}
(12.49)
This corresponds to (4.94) in the proof of TB-3. The subsequent derivation

(4.96)(4.105) of that proof can now directly be adapted to a definition
n
Fu,s,x , PY |U,S,X Pn (Y|U S X Y) :
o
y
/ A(n)
(QU,S,X,Y |u, s, x) with Py|u,s,x = PY |U,S,X
(12.50)

to show that for any PY |U,S,X Fu,s,x ,

DPu,s,x PY |U,S,X QY |U,S,X
2
log e.
2|U|2 |S|2 |X |2 |Y|2
(12.51)
This then corresponds to (8.48) in the proof of WynerZiv. Since also here
we have a Markov structure (U, S) (
X (
Y , the remainder of the proof
follows then exactly along the lines of (8.50)(8.59) (which is an adapted
version of (4.116)(4.123) of the derivation for TB-3). We hence are able to
show that
Pr(Case 4) t (n, , U S X Y).
12.2.5
(12.52)
We are now ready to combine all these results together. Using the fact that
all four cases combined cover the entire probability space, we use the Union
Bound to get
Pr(error) Pr(Case 1) + Pr(Case 2) + Pr(Case 3) + Pr(Case 4) (12.53)

0
n(R0 I(S;U ))
t (n, , S) + exp e
+ en(R+R I(U ;Y )+)
+ t (n, , U S X Y)
if n is large enough as long as we choose QU |S and f (, ) such that

(
R0 > I(U ; S) + ,
R + R0 < I(U ; Y ) .

(12.54)
(12.55)
(12.56)
(12.57)
12.3. The GelfandPinsker Rate
269
Note that since we are not interested in R0 , we can actually combine (12.56)
and (12.57) to the condition
R < I(U ; Y ) R0 < I(U ; Y ) I(U ; S).
(12.58)
(Note that we also omitted the s and s here, as they can be chosen arbitrarily
small anyway.) Since we are trying to make this condition as loose as possible,
we will then decide to choose QU |X and f (, ) such that the RHS of (12.58)
is maximized.
12.3
The GelfandPinsker Rate
Based on (12.58), we define the GelfandPinsker rate as follows.

Definition 12.4. For some joint distribution
QU,S,X,Y = QS QU |S QX|U,S QY |X,S ,
(12.59)
we define the GelfandPinsker rate as

RGP (QU,S,X,Y ) , I(U ; Y ) I(U ; S).
(12.60)
Note that in the factoring (12.59) QS and QY |X,S are given, while we
can choose QU |S and QX|U,S . Also note the usual problem that we also need
to choose the alphabet of the auxiliary RV U . So, we start with standard
argument based on Caratheodorys Theorem (Theorem 1.20) that limits the
size of U.
Lemma 12.5. Without loss of optimality we can restrict the size of U in the
definition of the GelfandPinsker rate in (12.60) to
|U| |S| |X | + 1.
(12.61)
Proof: The proof is again very similar to the proof of Lemma 7.7. Consider
a given choice of U, QU |S , and QX|U,S , and note that
I(U ; Y ) I(U ; S) = H(Y ) H(Y |U ) H(S) + H(S|U )
(12.62)
X

=
QU (u) H(Y ) H(Y |U = u) H(S) + H(S|U = u) ,
uU
(12.63)
QS,X (s, x) =
X
uU
QU (u)QS,X|U (s, x|u),
s S, x X .
(12.64)

{1, 2, . . . , |X |} and that S = {1, 2, . . . , |S|}.

v , I(X; U ) I(Y ; U ), QS,X (1, 1), . . . , QS,X |S|, |X | 1 ,
(12.65)

270

vu , H(Y ) H(Y |U = u) H(S) + H(S|U = u),

QS,X|U (1, 1|u), . . . , QS,X|U |S|, |X | 1u ,
(12.66)
such that by (12.62)(12.64)

v=
QU (u)vu .
(12.67)
uU
We see that v is a convex combination of |U| vectors vu . From Caratheodorys

Theorem (Theorem 1.20) it now follows that we can reduce the size of U to at
most |S| |X | + 1 values (note that v contains |S| |X | components!) without
changing v, i.e., without changing the values of I(U ; Y ) I(U ; S) and without
changing the value of QS,X (, ). This proves the claim.
Recall that in the standard channel transmission problem, the mutual
information I(X; Y ) between input and output of a DMC is concave in the
input distribution. On first sight, one might think that this also is true for
RGP . In particular, it is tempting to guess that RGP is concave in QU |S .
Unfortunately, this is not the case (see Appendix 12.A). However, we do have
the following.
Lemma 12.6. For the given factorization (12.59), the GelfandPinsker rate
RGP (QU,S,X,Y ) is convex in QX|U,S .
Proof: We fix QS , QU |S , and QY |X,S and regard RGP () a function of QX|U,S
only. Then, obviously, I(U ; S) is constant and I(U ; Y ) can be considered to
describe a DMC from U to Y . Hence we realize that RGP () is convex in QY |U .
Now note that
QY |U (y|u) =
s,x
s,x
X
s,x
X
s,x
QS,X,Y |U (s, x, y|u)
(12.68)
QS|U (s|u) QX|U,S (x|u, s) QY |U,X,S (y|u, x, s)
(12.69)
QS|U (s|u) QX|U,S (x|u, s) QY |X,S (y|x, s)
(12.70)
u,s,x,y QX|U,S (x|u, s),
(12.71)
where in (12.70) we use the factoring (12.59), and in (12.71) we define the
u,s,x,y , QS|U (s|u)QY |X,S (y|x, s). Hence, QY |U is a linear function of QX|U,S ,
which means that RGP () is a convex function of a linear function of QX|U,S .
But this means that RGP () is convex in QX|U,S , as can be seen as follows.

271
Let 7 g1 () be convex, let 7 = g2 () be linear, and define g() ,

g1 g2 () . Then

g 1 + (1 )2 = g1 g2 1 + (1 )2
(12.72)

= g1 g2 (1 ) + (1 )g2 (2 )
(12.73)

g1 g2 (1 ) + (1 )g1 g2 (2 )
(12.74)
= g(1 ) + (1 )g(2 ),
(12.75)
where (12.73) follows from the linearity of g2 () and (12.74) from the convexity
of g1 (). So we see that a convex function of a linear function is convex.
Recall that we have shown in Section 12.2 that the GelfandPinsker rate
RGP is achievable. Since we are allowed to choose QU |S and QX|U,S it is clear
that we would like to maximize RGP with an appropriate choice of QU |S and
QX|U,S . Due to the convexity of RGP in QX|U,S (Lemma 12.6), however, such a
maximization will lead to a conditional distribution with all probability values
being either 1 or 0, i.e., QX|U,S will become a deterministic relation that maps
(U, S) to X.
Remark 12.7. To explain why the maximization over a convex function always will result in a boundary point of the function, we consider the example
of a function f (t) that is convex in t [t0 , t1 ]. From the definition of convexity
we have for any 0 1,

f t0 + (1 )t1 f (t0 ) + (1 )f (t1 ) max f (t0 ), f (t1 ) .
(12.76)
Since any point t [t0 , t1 ] can be expressed as t = t0 + (1 )t1 , we hence
have

max f (t) max f (t0 ), f (t1 ) ,
(12.77)
t[t0 ,t1 ]
where the inequality actually is equality because the upper bound can be
achieved.
This motivates the following definition.
Definition 12.8. The GelfandPinsker capacity is defined as
CGP ,
max
QU |S
f : U SX

I(U ; Y ) I(U ; S) .
(12.78)
From Section 12.2 we know that any rate below the GelfandPinsker capacity is achievable. In Section 12.4 below we will prove the corresponding
converse, i.e., no rate larger than the GelfandPinsker capacity is achievable.
We remark that from Lemma 12.5 we know that it is sufficient to choose
an alphabet U of the random variable U having size of at most
|U| |S| |X | + 1.
(12.79)
It is also interesting to compare CGP with other known setups:

272

To consider a channel without side-information, we simply assume that
S is constant. Then we have

CGP = max
I(U ; Y ) I(U ; S)
(12.80)
| {z }
QU |S
=
f : U SX
max
QU |S
f : U SX
=0
I(U ; Y )
(12.81)
= max I(U ; Y )
(12.82)
= max I(X; Y ) = C,
(12.83)
QU
f : U X
QX
i.e., we are back at the normal capacity.

If the side-information is known both at transmitter and receiver, then
the capacity is
CS = max I(X; Y |S).
(12.84)
QX|S
By conditioning that does not increase entropy and by the Data Processing Inequality (Proposition 1.12) we know that
I(U ; Y ) I(U ; S) = H(U |S) H(U |Y )
(12.85)
H(U |S) H(U |Y, S)
(12.86)
I(X; Y |S).
(12.88)
= I(U ; Y |S)
(12.87)
Hence, we see that CGP is between the capacity without side-information and
the capacity with side-information both at transmitter and receiver.
Before we prove that CGP indeed is the maximum achievable rate, we give
an example of how CGP can look like.
Example 12.9. We consider a binary channel with a ternary state: X = Y =
{0, 1} and S = {0, 1, 2}. The conditional channel law is given for some given
0 p, q 1 as follows.
For S = 0, we have
QY |X,S (1|x, 0) = 1 QY |X,S (0|x, 0) = q,
x = 0, 1,
(12.89)
see Figure 12.2.

For S = 1, we have
QY |X,S (1|x, 1) = 1 QY |X,S (0|x, 1) = 1 q,
see Figure 12.3.

x = 0, 1,
(12.90)
273
1q
0
1q
X
1
q
Figure 12.2: Binary channel with state S = 0.
q
0
q
X
1q
1
1q
1p
0
p
X
1
1p

274

For S = 2, we have a BSC:
(
p
QY |X,S (y|x, 2) =
1p
if x 6= y,
if x = y,
(12.91)
see Figure 12.4.

Moreover, the probability distribution of S is
QS (0) = QS (1) = ,
QS (2) = 1 2,
(12.92)
(12.93)
for some given 0 21 .

We start by pointing out that if the encoder chooses to ignore the sideinformation, then the channel will become an average of the three DMCs,
which is a BSC with cross-over probability :
1 = Pr[Y = X = 0]
= q + (1 q) + (1 2)(1 p)
= + (1 2)(1 p).
(12.94)
(12.95)
(12.96)
To find a lower bound on CGP we choose some QU |S and f . We know

that |U| |S| |X | + 1 = 3 2 + 1 = 7, but in this case it turns out that
U = {0, 1} is sufficient. Because of the symmetry of the given channel where
the conditional law given S = 0 and S = 1 are flipped to each other and where
the conditional law given S = 2 is a BSC with optimal input being uniform,
we choose
QU |S (0|0) = 1 QU |S (1|0) = ,
QU |S (0|1) = 1 QU |S (1|1) = 1 ,
1
QU |S (0|2) = QU |S (1|2) = ,
2
(12.97)
(12.98)
(12.99)
for some 0 1. Moreover, since for S = 0 or S = 1 we have X

Y , but
for S = 2 we have a BSC, we can choose f (, ) to ignore S:
x = f (u, s) = u.
(12.100)
Given these choices, lets compute the corresponding RGP . We start with
I(U ; Y ): From the symmetry of the channel and our choice of distributions, we
must have that the channel from U to Y is a BSC with a crossover probability
= Pr[Y 6= U ]:
1 = Pr[Y = U ]
(12.101)
X
=
QS (s) QU |S (u|s) QX|U,S (x|u, s) QY |X,S (y|x, s) I {y = u}
|
{z
}
u,s,x,y
= I {x=u}

(12.102)
275
1
2
1
2
1
S
1 2
2
1
1
2
1
2
Figure 12.5: Channel from S to U .
X
u,s
QS (s) QU |S (u|s) QY |X,S (u|u, s)
(12.103)
= (1 q) + (1 )q + (1 )q + (1 q)
|
{z
} |
{z
}
for s=0
for s=1
1
1
+ (1 2) (1 p) + (1 2) (1 p)
2
2
|
{z
}
(12.104)
for s=2
= 2(1 q) + 2(1 )q + (1 2)(1 p).
(12.105)
Hence, using that the capacity of a BSC with crossover probability is log 2
Hb (), we get
I(U ; Y ) = log 2 Hb ()
(12.106)
= log 2 Hb (1 )
(12.107)

= log 2 Hb 2(1 q) + 2(1 )q + (1 2)(1 p) . (12.108)
To figure out I(U ; S), we think at a channel S to U . The input distribution

is given by QS , and the channel law by our choice QU |S , see Figure 12.5. Hence,
I(S; U ) = H(U ) H(U |S)
= log 2 H(U |S)
= log 2 Hb () (1 2) log 2 Hb ()
= 2 log 2 2 Hb (),
(12.109)
(12.110)
(12.111)
(12.112)
and therefore
n

log 2 Hb 2(1 q) + 2(1 )q + (1 2)(1 p)
01
o
2 log 2 + 2 Hb () .
(12.113)
CGP sup
Note that with respect to the first Hb -term we should choose to be small,
but with respect to the second Hb -term should be close to 12 . Also note

276
that if we choose = 12 , then this corresponds to the situation when the

encoder completely ignores S, i.e., we
fall back to the usual capacity expression
C = log 2 Hb + (1 2)(1 p) .
It can be shown that (12.113) actually is the GelfandPinsker capacity
for this channel model.
12.4
Converse
It remains to show that the GelfandPinsker capacity of Definition 12.8 is the

highest achievable rate. To that goal assume we have a sequence of achievable
(n)
systems with rate R such that Pe 0 as n . Again recall the Fano
Inequality (Proposition 1.13) with an observation Y1n about M :
H(M |Y1n ) log 2 + nRPe(n) = nn ,
(12.114)
where n 0 as n .
Since we assume that M is uniform we have H(M ) = log enR , i.e.,
nR = H(M )
=
I(M ; Y1n )
I(M ; Y1n )
n
X
k=1
(12.115)
+
H(M |Y1n )
(12.116)
+ nn

k
(12.117)

k1
n
; Y0 I M, Skn ; Y0
I M, Sk+1
+ nn .
(12.118)
where we have introduced Y0 , 0. To understand the last equality (12.118),

we write the sum out:

I M, S2n ; Y01
I M, S1n ; Y0
|
{z
}
=0

+ I M, S3n ; Y02
I M, S2n ; Y01

+ I M, S4n ; Y03
I M, S3n ; Y02
+

n
+ I M, Sn ; Y0n1 I M, Sn1
; Y0n2

+ I(M ; Y0n )
I M, Sn ; Y0n1
= I(M ; Y0n ) = I(M ; Y1n ).
(12.119)
Applying the chain rule twice, once with respect to Yk and once with respect
to Sk , we continue with (12.118) as follows.
n

1 X
n
I M, Sk+1
; Y0k I M, Skn ; Y0k1 + n
n
k=1
n

1 X
n
n
=
I M, Sk+1
; Y0k1 + I M, Sk+1
; Yk Y0k1
n
k=1

n
n
I M, Sk+1
; Y0k1 I Sk ; Y0k1 M, Sk+1
+ n

(12.120)
(12.121)
12.5. Summary
277
n
k1

1 X
k1
n
n

=
I M, Sk+1 ; Yk Y0
I Sk ; Y0
M, Sk+1 + n
n
k=1
n

1 X
n
H Yk Y0k1 H Yk Y0k1 , M, Sk+1
=
n
|
{z
}
{z
}
|
k=1
H(Yk )

n
H Sk M, Sk+1
|
{z
}
= H(Sk ) because
M
{Sk } and {Sk } IID
(12.122)
, Uk

n
+ n (12.123)
+ H Sk Y0k1 , M, Sk+1
{z
}
|
, Uk
n

1X
H(Yk ) H(Yk |Uk ) H(Sk ) + H(Sk |Uk ) + n
n
(12.124)
1
n

I(Uk ; Yk ) I(Uk ; Sk ) + n
(12.125)

max I(U ; Y ) I(U ; S) + n
(12.126)
1
n
k=1
n
X
k=1
n
X
k=1
QU,X|S

= max I(U ; Y ) I(U ; S) + n
QU,X|S

=
max
I(U ; Y ) I(U ; S) + n
QU |S ,QX|U,S

=
max
I(U ; Y ) I(U ; S) + n
QU |S ,f : U SX
= CGP + n .
(12.127)
(12.128)
(12.129)
(12.130)
Here, in (12.123) we introduce the auxiliary random variable2 Uk , (Y0k1 ,

n ); in (12.126) we maximize over all degrees of freedom, i.e., for the
M, Sk+1
given side-information QS and the given channel QY |X,S we maximize over
the choice of QU,X|S , or, equivalently over QU |S and QX|U,S ; and in (12.129)
we note that by Lemma 12.6 the maximization over QX|U,S will result in a
deterministic function f .
This shows that any working coding scheme indeed cannot beat CGP .
12.5
Summary
The main result of this chapter therefore is as follows.

Theorem 12.10 (GelfandPinsker Coding Theorem [GP80]).
Consider the channel coding problem given in Figure 12.1: a transmitter
tries to transmit a message M over a DMC with interference, where the
realization of the interference is known noncausally at the transmitter, but
is not known at the receiver. For this problem the maximum achievable
Note that we use here a vector notation for Uk , even though there is no fundamental
difference between a random variable and a random vector for finite alphabets.

278
transmission rate is given by the GelfandPinsker capacity

CGP , max
I(U ; Y ) I(U ; S) ,
QU |S
f : U SX
(12.131)
where without loss of optimality we can restrict the size of U to

|U| |S| |X | + 1.
(12.132)
This result is named after its discoverers: Mark S. Pinsker (19252003)

was the fifth recipient of the Shannon Award (in 1978). Sergei I. Gelfand3
(born 1944) is a still active researcher working at the Russian Institute for
Information Transmission Problems. He is the son of one of the most famous
mathematician of the 20th century: Isral M. Gelfand, and incidentally, he
did his Ph.D. with his father (his father Isral did his Ph.D. with Andrei
Kolmogorov which closes the circle of great mathematicians and great
information theorists. . . ).
12.6
Writing on Dirty Paper
The possibly most famous application of GelfandPinskers result is its application to a Gaussian setup.
Consider a sequence {Sk } of general (not necessarily Gaussian) IID random variables of finite variance,4 and a memoryless channel where for a given
channel input xk R, the channel output Yk R at time k is given as
Yk = xk + Sk + Zk ,
(12.133)
with

{Zk } IID N 0, 2 ,
{Zk }
{Sk }.
(12.134)
The transmitter has no knowledge of the realization of {Zk } (i.e., input and
noise are independent {Xk }
{Zk }), but the realization of the interference {Sk } is known to the transmitter noncausally before transmission starts.
Moreover, the transmitter is subject to an average-power constraint, i.e., a
codeword of length n must satisfy
n
1X 2
xk E.
n
(12.135)
k=1
There exist two different spellings: Gelfand or Gelfand. We use the spelling that seems
to be more common in information theory.
4
Note that Sk does not even need to be continuous, i.e., discrete, continuous or a mixture
of both is all fine, as long as it is decent enough that the mutual information terms below
make sense. One has to be careful with the differential entropy though! So, for simplicity,
we assume here that Sk is continuous with a proper PDF.

12.6. Writing on Dirty Paper
279
Max Costa who has introduced this system model [Cos83] called it writing
on dirty paper. The idea is that the transmitter writes its message on a piece
of paper that is pretty dirty so that the written message will be difficult to
read. However, to help transmission, the transmitter can first scan the paper
to learn about the noise on the paper and then adapt the writing to it. The
receiver, on the other hand, has no knowledge about the original dirt on the
paper before the message was written onto it. Additionally, the receiver will
introduce noise when reading the paper.
As a first thought one might think that an easy way of getting rid of the
interference Sk is to simply subtract the known Sk at the transmitter:
k Sk ,
Xk = X
(12.136)
k + Zk ,
k Sk + Sk + Zk = X
Yk = X
(12.137)
which will lead to
i.e., we have reduced the channel model to a Gaussian channel. However, this
approach does not work because of the power constraint (12.135). Note that
we have made no assumption about the interference apart from being IID and
having finite variance. This variance, however, might be far bigger than E
such that (12.136) violates (12.135).
So, we go back to Figure 12.1 and try to adapt the derivation of the
previous sections that were based on our finite alphabet assumption to this
Gaussian setup. This can be done and it is not that difficult to show that the
maximum achievable rate is given as

CGP (E) =
sup
I(U ; Y ) I(U ; S)
(12.138)
U,X|S : E[X 2 ]E

for some given random variable S, and for Y = X + S + Z with Z N 0, 2 .
In the remainder of this section we will now derive the explicit value of CGP (E).
We start with a lower bound. We choose U,X|S = U |S X|U,S as follows:
N (0, E) and for some R we set
For some independent U
+ S,
U ,U
(12.139)
+ S S = U.
X , U S = U
(12.140)
X N (0, E),
(12.141)
Hence, we see that
X
S,
U = X + S.
(12.142)
(12.143)
We remind the reader that our choice of X

S does not mean that the used
codewords X do not depend on S! Here, we only talk about the maximum
achievable rate and try to compute its value and not about the codewords
and side-information sequences. (Also compare this with the coding scheme

280
in the achievability proof in Section 12.2: There it is clearly seen that X is a

function of U and S.)
We now get
CGP (E) I(U ; Y ) I(U ; S)
(12.144)
= h(U |S) h(U |Y )
(12.146)
= h(X|S) h(S + X|X + S + Z)
(12.148)
= h(U ) h(U |Y ) h(U ) + h(U |S)

= h(S + X|S) h(S + X|Y )

= h(X) h S + X (X + S + Z)X + S + Z

= h(X) h (1 )X Z X + S + Z .
(12.145)
(12.147)
(12.149)
(12.150)
Now we choose such that (1 )X Z

X + Z. Since for Gaussian
random variables being independent is identical to being uncorrelated, we need

E (1 )X Z X + Z

= (1 ) E X 2 + (1 2) E[X] E[Z] E Z 2
2 !
= (1 )E = 0,
(12.151)
(12.152)
i.e.,
=
E
.
E + 2
(12.153)
Moreover, note that since (X, Z)

S, for this choice of also
(1 )X Z
X + Z + S.
(12.154)
Hence, we continue our derivation:

CGP (E) h(X) h (1 )X Z X + S + Z

= h(X) h (1 )X Z

= h(X) h (1 )X Z X + Z

= h(X) h (1 )X Z + (X + Z)X + Z
= h(X) h(X|X + Z)
= I(X; X + Z)

1
E
= log 1 + 2 ,
2
(12.155)
(12.156)
(12.157)
(12.158)
(12.159)
(12.160)
(12.161)
where in the last step we made use of our knowledge of the optimal Gaussian
input of a Gaussian channel.
On the other hand, note that CGP is trivially upper-bounded by a situation
where the receiver also knows the realization of the side-information. In this
scenario, the receiver simply subtracts the value of {Sk } (since the receiver

12.7. Different Types of Side-Information
281
is not restricted by any type of power constraint, this is always possible) and
thereby reduces the problem to the standard Gaussian channel. Hence,

1
E
CGP (E) log 1 + 2 .
(12.162)
2
This gives the astonishing result that for the dirty paper channel (12.133)
the interference can be eliminated without loss of rate independently of the
type of interference and the value of its variance!
Theorem 12.11 (Dirty Paper Coding Theorem [Cos83]).
The dirty paper channel capacity (12.133) is given by

1
E
(12.163)
CGP (E) = CGaussian (E) = log 1 + 2
2
irrespectively of S.
12.7
Different Types of Side-Information
We finish this chapter by quickly summarizing different types of side-information without proofs.
Consider a DMC with interference, and distinguish where the interference
is known.
No side-information: If neither transmitter nor receiver have knowledge of the interference, they simply experience a DMC with averaged
channel law:
C = max I(X; Y )
(12.164)
QX
where
QY |X (y|x) =
X
s
QS (s)QY |X,S (y|x, s),
x, y.
(12.165)
Noncausal side-information:
Only at encoder: This is the case discussed in this chapter:

I(U ; Y ) I(U ; S) .
(12.166)
C=
max
QU |S ,f : U SX
Only at decoder: We basically have a DMC from X to (Y, S),

i.e.,
C = max I(X; Y, S).
QX
(12.167)

282

But since X
S, we can write this equivalently (and more commonly) as
C = max I(X; Y |S).
QX
(12.168)
At encoder & decoder: Since S is known everywhere, the mutual

information is conditioned on S. Moreover, the encoder can choose
X with respect to S:
C = max I(X; Y |S).
QX|S
(12.169)
Causal side-information:
S1k1 known at encoder: Since {Sk } is IID and the DMC is
memoryless, knowledge of past realizations of {Sk } is useless, i.e.,
C = max I(X; Y ).
QX
(12.170)
S1k known at encoder: This basically corresponds to the situation

described in this chapter with the additional constraint that we
cannot choose the codewords U depending on the side-information
S because it is not known in advance. Hence, we achieve (12.166)
with the additional constraint that U
S, i.e., we get
C=
max
QU ,f : U SX
I(U ; Y ).
(12.171)
S1k known at decoder: This is identical to the case with noncausal

side-information at the decoder, because the decoder can simply
wait with the decoding until it has received the complete sequence
Y1n and therefore also the complete sequence S1n .
S1k known at encoder and decoder: Again, the decoder will
wait with decoding until it knows the complete sequence S1n . Hence,
we get a combination of (12.168) and (12.171):
C=
max
QU ,f : U SX
I(U ; Y |S)
= max I(X; Y |S),

QX|S
(12.172)
(12.173)
where the second equality can be seen by choosing X = U . Hence,

also here we have the same result as for noncausal side-information.
12.A
Appendix: Concavity of GelfandPinsker

Rate in Cost Constraint
We have seen in Lemma 12.6 that (for fixed QS , QU |S , and QY |X,S ) the
GelfandPinsker rate RGP (QU,S,X,Y ) is convex in QX|U,S . It is tempting to

12.A. Appendix: Concavity of GelfandPinsker Rate
283
claim that (for fixed QS , QX|U,S , and QY |X,S ) RGP (QU,S,X,Y ) is concave in
QU |S because we know that I(U ; S) is convex in the channel law QU |S , that
I(U ; Y ) is concave in the channel input distribution QU , and that QU is linear in QU |S . Unfortunately, this argument is wrong because for it to hold we
need the channel QY |U to be fixed, which is not the case as it also depends
on QU |S . It turns out that in general RGP (QU,S,X,Y ) is not concave in QU |S !5
However, under the additional assumption of a cost constraint, we can
prove that the GelfandPinsker rate is concave in the cost. This result is
crucial for the derivation of a converse for the GelfandPinsker capacity with
a cost constraint, see (12.138). In the following we will quickly show a prove
of this latter claim.
For simplicity of notation, we will only consider the case of a DMC. So,
given some QS and QY |X,S and some E > 0, we define

CGP (E) ,
max
I(U ; Y ) I(U ; S) .
(12.174)
QU,X|S : E[X 2 ]E
(1)
(2)
For some two values E(1) and E(2) , let QU,X|S and QU,X|S be the PMFs that

achieve CGP E(1) and CGP E(2) , respectively, and let U (i) , X (i) be the
corresponding RVs, i.e.,

(i)
S, U (i) , X (i) QS QU,X|S , i = 1, 2.
(12.175)
Now let Z be a binary RV that is independent of all other random variables
and that takes the value 1 with probability and the value 2 with probability
1 , and define a new pair of RVs (U , X ) as

U , Z, U (Z) ,
(12.176)
X , X (Z) .
(12.177)
Note that
E
X2
h h
ii
(Z) 2
=E E X
Z
h
i
h

2 i
2
= E X (1)
+ (1 ) E X (2)
= E(1) + (1 )E(2) , E .
(12.178)
(12.179)
(12.180)
Hence, we have
CGP E(1) + (1 )E(2)
= CGP (E )
max
QU,X|S : E[X 2 ]E
(12.181)

I(U ; Y ) I(U ; S)
I(U ; Y ) I(U ; S)

= I Z, U (Z) ; Y I Z, U (Z) ; S
(12.182)
(12.183)
(12.184)
Interestingly, Gelfand and Pinsker actually wrongly claim concavity in their original
paper [GP80]!

284

= H(Y ) H Y Z, U (Z) H(S) + H S Z, U (Z)

H(Y |Z) H Y Z, U (Z) H(S|Z) + H S Z, U (Z)

= I U (Z) ; Y Z I U (Z) ; S|Z

= I U (1) ; Y I U (1) ; S + (1 ) I U (2) ; Y I U (2) ; S

= CGP E(1) + (1 )CGP E(2) .
(12.185)
(12.186)
(12.187)
(12.188)
(12.189)
Here, (12.183) follows by dropping the maximization and choosing one particular input of corresponding cost E , i.e., by choosing (U , X ); in (12.186) we
add conditioning on Z to H(Y ) (which reduces entropy) and to H(S) (which
remains unchanged because S
Z); and the last equality (12.189) holds
because U (i) , X (i) achieves CGP E(i) , i = 1, 2.
This proves that CGP (E) indeed is concave in E.

Chapter 13
The Broadcast Channel

13.1
Problem Setup
The general problem of broadcasting messages from one transmitter to several

receivers is shown in Figure 13.1.
Dest. 1
(0) , M
(1)
M
Dec.
(1)
M (1) Uniform
Source 1
Y(1)
Broadcast
Channel
Dest. 2
(0) , M
(2)
M
Dec. (2)
Enc.
M (0)
Uniform
Source 0
M (2)
Uniform
Source 2
Y(2) Q (1) (2)

Y
,Y
|X
Figure 13.1: A channel coding problem with three independent sources and
two independent destinations: The common message M (0) is intended for both destinations, while the messages M (i) are only
for the corresponding destination i, i = 1, 2. The three sources
are encoded by one common encoder, while each destination has
its own independent decoder. Such a channel model is called
broadcast channel (BC).
A single encoder needs to transmit three messages to two independent
destinations. The common message M (0) must arrive at both receivers, while
the private1 message M (1) is only for destination 1, and the private message
M (2) only for destination 2. The transmission takes place via a so-called
broadcast channel that produces for a single input x two outputs Y (1) and
Y (2) . Such a communication setup has been described first in [Cov72].
1
Note that we do not consider privacy in a cryptographic context here: We do not care
if a private message can be (or even actually is) decoded by a wrong receiver as long as it
does arrive at its intended receiver!
285

286
Definition 13.1. A discrete memoryless broadcast channel (DM-BC) consists

of three alphabets X , Y (1) , Y (2) and a conditional probability distribution
QY (1) ,Y (2) |X such that when it is used without feedback, we have
QY(1) ,Y(2) |X y
(1)
,y
n

Y
(1) (2)
x =
QY (1) ,Y (2) |X yk , yk xk .
(2)
(13.1)
k=1

(0)
(1)
(2)
Definition 13.2. An enR , enR , enR , n coding scheme for a DM-BC
consists of three sets of indices

(0)
M(0) = 1, 2, . . . , enR
,
(13.2)

(1)
M(1) = 1, 2, . . . , enR
,
(13.3)

(2)
M(2) = 1, 2, . . . , enR
(13.4)
called message sets, an encoding function
: M(0) M(1) M(2) X n ,
(13.5)
and two decoding functions

(1) : Y (1)
(2) : Y
n

(2) n
M(0) M(1) ,
M(0) M(2) .
(13.6)
(13.7)

(0)
(1)
(2)
The error probability of an enR , enR , enR , n coding scheme for a
DM-BC is given as

Pe(n) , Pr (1) Y(1) 6= M (0) , M (1) or (2) Y(2) 6= M (0) , M (2) . (13.8)

Definition 13.3. A rate triple R(0) , R(1) , R(2) is said to be achievable for the

(0)
(1)
(2)
BC if there exists a sequence of enR , enR , enR , n coding schemes with
(n)
0 as n .
The capacity region of the BC is defined to be the closure of the set of all
achievable rate triples.
Pe
Before we start to investigate some achievable schemes, we give a couple

of examples.
Example 13.4. A TV station broadcasting its message to many users is a
degenerate BC because we only have a common message M (0) . We will see
below that the capacity region (for the simplified case of only two receivers)
in this case is

R(0) max min I X; Y (1) , I X; Y (2) .
(13.9)
QX
Note that this usually is less than the capacity of the worse of the two channels:
Since we need to convey on both channels at the same time with only one input
distribution, a distribution QX that is good for the worse channel might not
be good for the good channel.

13.1. Problem Setup
287
Example 13.5. A lecturer in a classroom: Not every student gets all information. If the lecturer is successful, then good students get more information
than less good students, but the poorer students still should receive enough
to be able to follow. Only a bad lecturer will teach at a pace that corresponds
to the worst student!
Example 13.6. If X is a vector-alphabet with the first component only connected with Y (1) and the second component only connected with Y (2) , then we
have an orthogonal BC with two independent channels. The capacity region
obviously is
(
R(0) + R(1) C(1) ,

R(0) + R(2) C(2) ,
(13.10)
(13.11)
where C(i) are the corresponding single-user capacities of the two independent
channels.
Example 13.7. To understand some of the subtleties of BCs, consider a

speaker that is fluent both in German and Chinese and two listeners that only
understand either German or Chinese, but who can figure out when which
language is spoken. Assume each language has 212 = 4096 words (all equally
likely) and that the speakers speaks at 1 word per channel use. Hence, if the
speaker speaks only German he has a rate of 12 bits per channel use, and the
same is true for Chinese.
So, a fair situation would be to apply time-sharing where half of the time
the speaker speaks German, and the other half Chinese, resulting in a rate of
6 bits per channel use for each user. However, there are many ways of ordering
the words, e.g., the two sentences ich liebe dich and wo ai ni could be
arranged as ich wo liebe ai dich ni or ich liebe wo ai ni dich. Concretely,
we have 63 possibilities. If we now consider sentences with m instead of only
3 words, then we have

1
2m
22m Hb ( 2 ) = 22m
m
(13.12)
different possible arrangements, which corresponds approximately to

log2 22m
= 1 bit/channel use.
2m
(13.13)
(Another way to see this is that for every word we always have a choice between
two languages, i.e., we get an additional 1 bit.) This then yields a total of 13
bits/channel use. This is more than single time-sharing!
Note that if we do not apply a 50%50% time-sharing,
but for example

3
a 25%75%, then we only get an additional Hb 4 0.81 bits per channel
use.

288
13.2
Some Important Observations
The first important observation concerns shifting of bits between common message and private messages. Since all R(0) bits are available at both receivers,
we can easily define some portion of them to become a private message of any
of the two receivers. These bits will still be decodable at the other receiver,
but once they are not common message anymore, they are simply discarded
at the wrong receiver.
We have the following theorem.

Theorem 13.8. If R(0) , R(1) , R(2) is achievable, then

0 R(0) , R(1) + 1 R(0) , R(2) + 2 R(0)
(13.14)
for 0 , 1 , 2 0 and 0 + 1 + 2 = 1 is also achievable.
The second observation is even more important. It is the foundation of
the degradedness property that will be introduced in Section 13.3.1.
Theorem 13.9. The capacity region of a BC depends only on the conditional
marginal distributions QY (1) |X and QY (2) |X and not on the joint conditional
channel law QY (1) ,Y (2) |X .
Proof: Define

Pe(n) , Pr (1) Y(1) 6= M (0) , M (1) (2) Y(2) 6= M (0) , M (2)
,
(13.15)
Pe(n),(1)
Pe(n),(2)
h

i
, Pr (1) Y(1) 6= M (0) , M (1) ,
h

i
, Pr (2) Y(2) 6= M (0) , M (2) .
(13.16)
(13.17)
Then, by the Union Bound we have

Pe(n) Pe(n)(1) + Pe(n)(2) ,
(13.18)
(i) (i)

and because Y
6= M (0) , M (i) implies
(1) (1)

Y
6= M (0) , M (1) (2) Y(2) 6= M (0) , M (2) ,
we have

Pe(n) max Pe(n),(1) , Pe(n),(2) .
(13.19)

max Pe(n),(1) , Pe(n),(2) Pe(n) Pe(n)(1) + Pe(n)(2)
(13.20)
Hence,
(n)
(n)(1)
(n)(2)
(n)
and Pe 0 if, and only if, Pe

0 and Pe
0. So, driving Pe to
(n)(1)
(n)(2)
zero is equivalent to driving both Pe
and Pe
to zero, where the latter
two only depend on QY (1) |X and QY (2) |X , respectively, and not on the joint
distribution QY (1) ,Y (2) |X . Hence, the capacity region depends on the marginals
only and not on the joint distribution.

13.3. Some Special Classes of Broadcast Channels
289
Remark 13.10. Be aware that the error probability does depend on the
(n)
channel law QY (1) ,Y (2) |X , but whether Pe can be made arbitrarily small or
not does not depend on QY (1) ,Y (2) |X except through QY (1) |X and QY (2) |X .
The main consequence of Theorem 13.9 is as follows.
(1) (2) have the
Corollary 13.11. If two different BCs QY (1) ,Y (2) |X and Q
Y ,Y |X
same conditional marginal distributions QY (1) |X and QY (2) |X , then these two
BCs have the same capacity region.
Also note that by the chain rule
QY (1) ,Y (2) |X = QY (1) |X QY (2) |Y (1) ,X ,
(13.21)
i.e.,
X

QY (2) |X y (2) x =
QY (1) |X y (1) x QY (2) |Y (1) ,X y (2) y (1) , x .
(13.22)
y (1)
Hence, if the conditional marginals are the same, then

X

(2) (1) y (2) y (1) , x
QY (1) |X y (1) x Q
Y |Y ,X
y (1)
X
y (1)

QY (1) |X y (1) x QY (2) |Y (1) ,X y (2) y (1) , x .
(13.23)
Finally, we would like to point out that by the usual time-sharing argument, i.e., the transmitter talks for a certain percentage of the time only to one
receiver and the rest of the time only to the second, it is clear again that the
capacity region must be convex. However, as we have seen in Example 13.7,
time-sharing usually is not efficient and normally the capacity region cannot
be achieved by time-sharing.
13.3
Some Special Classes of Broadcast Channels
13.3.1
Degraded Broadcast Channel
Definition 13.12. A BC is called physically degraded if

QY (1) ,Y (2) |X = QY (1) |X QY (2) |Y (1) ,
(13.24)
i.e., if we have a Markov chain X (

Y (1) (
Y (2) .
It is easiest to think of a degraded BC as a BC where user 2 sees a deteriorated version of the channel of user 1, see Figure 13.2. For example, a
base-station is transmitting to two users that are geographically located on a
straight line, one closer and one further away from the base-station.

290
QY (2) |Y (1)
Y (1)
QY (1) |X
Y (1)
Y (2)
Figure 13.2: A physically degraded BC.

Z (1)
Y (1)
Y (2)
Y (1)
Figure 13.3: A physically degraded Gaussian BC.
Example 13.13. A simple example of a physically degraded BC is given in

Figure 13.3. Here, the channel suffers from two independent Gaussian noise
sources:

2
Z (1) N 0, (1)
,
(13.25)

2
V N 0, ,
(13.26)
where Z (1)
V.
Remark 13.14. For a physically degraded BC we have

X

QY (2) |X y (2) x =
QY (1) |X y (1) x QY (2) |Y (1) y (2) y (1) . (13.27)
y (1)
Definition 13.15. A BC is called stochastically degraded if its conditional

marginal distributions QY (1) |X and QY (2) |X are identical to those of another,
physically degraded BC. We then know from Corollary 13.11 that the capacity
region of a stochastically degraded BC is identical to the capacity region of
the corresponding physically degraded BC.
Let QY (1) ,Y (2) |X be the conditional distribution of a stochastically degraded
(1) (2) be the conditional distribution of the corresponding
BC and let Q
Y ,Y |X
physically degraded BC. We then have
X

QY (2) |X y (2) x =
QY (1) |X y (1) x QY (2) |Y (1) ,X y (2) y (1) , x
(13.28)
y (1)
(2)
Q
Y |X
X

(1) y (1) x Q
(2) (1) y (2) y (1)
y (2) x =
Q
Y |X
Y |Y
y (1)

(13.29)
291
(1) and Q (2) = Q

(2) ,
and since QY (1) |X = Q
Y |X
Y |X
Y |X
X

QY (1) |X y (1) x QY (2) |Y (1) ,X y (2) y (1) , x
y (1)
X
y (1)

(2) (1) y (2) y (1) .
QY (1) |X y (1) x Q
Y |Y
(13.30)
We have realized the following.

Lemma 13.16. A BC is stochastically degraded if there exists a distribution
(2) (1) such that
Q
Y
|Y

X

(2) (1) y (2) y (1) .
QY (1) |X y (1) x Q
QY (2) |X y (2) x =
Y |Y
(13.31)
y (1)
Example 13.17. The most typical Gaussian BC is defined as follows. Given

an input X = x,
(
Y (1) = x + Z (1) ,
(13.32)
(2)
(2)
Y
=x+Z ,
(13.33)

2 , Z (2) N 0, 2 , Z (1)
where Z (1) N 0, (1)
Z (2) , and where we assume
(2)
2 2 .
without loss of generality that (1)
(2)

2 2 , V
Now we define V N 0, (2)
Z (1) , and note that
(1)

2
2
2
2
Z (1) + V N 0, (1)
+ (2)
(1)
= N 0, (2)
.
(13.34)
Hence, if we set

Y (2) , X + Z (1) + V

= X + Z (1) + V
=Y
(13.35)
(13.36)
(1)
+ V,
(13.37)
we see that conditionally on X = x, Y (2) has the same distribution as Y (2)
2 ), but Y
(2) depends
(both are conditional mean-x Gaussian with variance (2)
only on Y (1) and not on X, i.e., we have a Markov structure! This shows
that the Gaussian BC is stochastically degraded and has therefore the same
capacity region as the physically degraded Gaussian BC
(
Y (1) = x + Z (1) ,
(13.38)
(2)
(1)
Y
= x + Z + V.
(13.39)
Note that the two Gaussian BCs are not the same! In the original BC (13.32)
(13.33), we have Z (1)
Z (2) , but in (13.38)(13.39), Z (1)
6 Z (2) because
(2)
(1)
Z = Z + V , i.e.,

Cov Z (1) , Z (2) = E Z (1) Z (1) + V
(13.40)
h
i

2
= E Z (1)
+ E Z (1) E[V ]
(13.41)
2
= (1)
6= 0.
(13.42)

292
Note that when we talk about a degraded BC, we refer to a BC that is

either physically or stochastically degraded.
13.3.2
Broadcast Channel with Less Noisy Output
Definition 13.18 ([KM77a]). A BC is said to have a less noisy output if for

every QU,X such that U (
X (
(Y (1) , Y (2) ) it holds that

I U ; Y (1) I U ; Y (2) .
(13.43)
We next show that the degraded BCs constitute a strict subclass of the
less noisy BCs.
Lemma 13.19. Any degraded BC is a BC with less noisy output, but not
necessarily the other way around.
Proof: For a physically degraded BC we have U (
X (
Y (1) (
Y (2)
and therefore (13.43) follows directly from the Data Processing Inequality
(Proposition 1.12). Since (13.43) only depends on the marginals QY (1) |X and
QY (2) |X , but not on the joint distribution QY (1) ,Y (2) |X , we see that (13.43) also
must hold for a stochastically degraded BC.
To see that the class of less noisy BCs is strictly larger than the class of
degraded BCs, consider the following counterexample: Assume X = Y (1) =
Y (2) = {0, 1} and define
!
p
1p
QY (1) |X =
,
(13.44)
p+ 1p
!
1 +
QY (2) |X = QY (1) |X
(13.45)
1
1
=
1
2
2
2
p
1
2 + 2 + p
p
2 + 2 + p +
1
2
p
1
2 2 p
p
2 2 p
!

(13.46)
Note that this BC is not degraded because the matrix in (13.45) is not a
stochastic matrix (the rows do sum to 1, but there is a negative entry!).
However, it can be shown that (13.43) is satisfied for this choice.
Exercise 13.20. Finish the details in the derivation of the counterexample in
the proof of Lemma 13.19. Hint: This is not straightforward!
13.3.3
The Broadcast Channel with More Capable Output
Definition 13.21 ([KM77a]). A BC is said to have a more capable output if

for every QX it holds that

(13.47)
I X; Y (1) I X; Y (2) .
Lemma 13.22. Any BC with less noisy output is a BC with more capable
output, but not necessarily the other way around.

293
Proof: Note that any distribution QX can be generated by the choice

QU = QX and X = U . Hence, if (13.43) holds for every QU , then also (13.47)
must hold for every QU = QX .
To see that the class of more capable BCs is strictly larger than the class
of less noisy BCs, consider the following counterexample: Let Y (1) = Y (2) =
{0, 1} and X = {0, 1, 2} and define
1 0
1 0
QY (1) |X = 0 1 ,
QY (2) |X = 12 21 .
(13.48)
1
2
1
2
1
2
1
2
Then an arbitrary input distribution, described as QX = (1 2p, 2, 2p 2)

for p [0, 1/2] and [0, p], yields
QY (1) = (1 p , p + ),
(13.49)
QY (2) = (1 p, p),
(13.50)
and

I X; Y (1) = Hb (p + ) 2p + 2,
(13.51)

(2)
I X; Y
= Hb (p) 2p.
(13.52)

A quick calculation now shows that I X; Y (2) I X; Y (1) is convex in , i.e.,
the maximum is achieved for = 0 or = p. In the former the difference is 0,
in the latter the difference is Hb (p) Hb (2p) 2p, which in turn is convex in
p and always nonpositive. Hence, we see that

I X; Y (2) I X; Y (1) 0
(13.53)
proving that this BC has a more capable output.
On the other hand, if we choose U to be uniform over {0, 1} and
1
!
2 if x = 0, u = 0 or x = 1, u = 0,
1
1
0
2
2
QX|U (x|u) = 1 if x = 2, u = 1,
=
, (13.54)
0 0 1
0 otherwise
then
QY (1) |U =
1
2
1
2
1
2
1
2
!
,
3
4
1
2
QY (2) |U =
1
4
1
2
!
,
(13.55)
and

QY (1) =

1 1
,
,
2 2

QY (2) =

5 3
,
.
8 8
(13.56)
Hence,

I U ; Y (1) = 0 < I X; Y (2) ,
(13.57)
proving that this BC does not have a less noisy output.

294
13.4
Superposition Coding
Next, we will present an achievable coding scheme that is based on a new idea:
superposition coding. In superposition coding, the codewords are arranged in
several separate clouds, see Figure 13.4. The decoder with a bad channel will
cloud center U
codeword X
Figure 13.4: Superposition coding: Codewords are arranged in several clouds

with a cloud center U.
only be able to distinguish the different clouds, while a decoder with a good
channel can also separate the codewords within a cloud.
Theorem 13.23 ([Ber73]). For a broadcast channel with degraded
message set (i.e., a coding scheme
without private message for user 2,

(2)
(0)
(1)
R = 0), a rate triple R , R , 0 is achievable if
(0)
(2)
R
<
I
U
;
Y
,
(13.58)

R(1) < I X; Y (1) U ,
(13.59)

(0)
(1)
(1)
R + R < I X; Y
(13.60)
for some joint distribution QU,X such that U (
X (
(Y (1) , Y (2) ).
Note that as already mentioned in Section 10.2 we assume perfect time
synchronization and therefore can apply time-sharing between two different
coding schemes. Hence, the capacity region must be convex. It can be shown
that the region defined by (13.58)(13.60) is already convex, i.e., here no
additional time-sharing is necessary.
Proof: We prove Theorem 13.23 by creating a random coding scheme.
1: Setup: Fix R(0) , R(1) , QU , QX|U , and some blocklength n.
2: Codebook Design: We generate enR
U(m(0) ) QnU ,
(0)
independent length-n codewords

(0)
m(0) = 1, . . . , enR .

(13.61)
13.4. Superposition Coding
295

(1)
For each U m(0) , we generate enR independent length-n codewords

(1)
(13.62)
X m(0) , m(1) QnX|U U m(0) , m(1) = 1, . . . , enR .
We reveal both codebooks to encoder and decoders.

Note that U m(0) represents the cloud center of the m(0) th cloud, and

X m(0) , m(1) is the m(1) th codeword of the m(0) th cloud.

3: Encoder Design: To send message m(0) , m(1) , the encoder trans
mits the codeword X m(0) , m(1) .
4: Decoder Design: Upon receiving Y(2) , decoder (2) looks for an m
(0)
such that

U m
(0) , Y(2) A(n) QU,Y (2) .
(13.63)
If there is exactly one such m
(0) , the decoder (2) puts out m
(0) , m
(0) .
Otherwise it declares an error.

Upon receiving Y(1) , decoder (1) looks for a pair m
(0) , m
(1) such that

(13.64)
U m
(0) , X m
(0) , m
(1) , Y(1) A(n) QU,X,Y (1) .

If there is exactly one such pair m
(0) , m
(1) , the decoder (1) puts out

m
(0) , m
(1) , m
(0) , m
(1) . Otherwise it declares an error.
5: Performance Analysis: We start with decoder (2) :
Pe(n),(2)
(0)
nR
eX
(1)
nR
eX
m(0) =1 m(1) =1
1
en(R
(0)
+R(1) )

Pr error(2) M (0) , M (1) = m(0) , m(1) .
(13.65)
We define for each m(0)

n

o
(2)
Fm(0) ,
U m(0) , Y(2) A(n) QU,Y (2)
(13.66)
and, using the Union Bound, TB-3, and TC-1, we bound as follows:

Pr error(2) M (0) , M (1) = m(0) , m(1)

(0)
enR

[
(2) c
(2) (0)
(1)
Fm
m
,
m
(13.67)
= Pr Fm(0)

(0)

m
(0) =1

m
(0) 6=m(0)
(0)

c

(2)
Pr Fm(0) m(0) , m(1) +
nR
eX
m
(0) =1
m
(0) 6=m(0)

(2) (0)
(1)
Pr Fm
m
,
m
(13.68)

(0)

296

(0)
t n, , U Y

(2)
nR
eX
en(I(U ;Y
(2) )
(13.69)
m
(0) =1
m
(0) 6=m(0)

(0)
(2)
t n, , U Y (2) + enR en(I(U ;Y ))
(13.70)
(13.71)
and thus
Pe(n),(2)
(13.72)
as long as n is large enough and

R(0) < I U ; Y (2) .
(13.73)
For decoder (1) , we similarly have

Pe(n),(1)
(0)
nR
eX
(1)
nR
eX
m(0) =1 m(1) =1
en(R
(0)
+R(1) )

Pr error(1) M (0) , M (1) = m(0) , m(1)
(13.74)

and we define for each m(0) , m(1)

n
o

(1)
Fm(0) ,m(1) ,
U m(0) , X m(0) , m(1) , Y(1) A(n) QU,X,Y (1) .
(13.75)
Then, again using the Union Bound, we obtain

Pr error(1) M (0) , M (1) = m(0) , m(1)
c
[
(1)
(0) (1)
(1)
m , m
= Pr
Fm
Fm(0) ,m(1)
(0) ,m
(1)

(0)
(1)
,m
)
(m

(0)
(1)
(0)
(1)
,m
)6=(m ,m )
(m
(13.76)
(1)

c

(1)
Pr Fm(0) ,m(1) m(0) , m(1) +
(0)
nR
eX
(1)
nR
X
m
(0) =1
m
(0) 6=m(0)
m
(1) =1
nR
eX
m
(1) =1
m
(1) 6=m(1)

(0) (1)
(1)
Pr Fm(0) ,m
m
,
m

(1)

(0) (1)
(1)
Pr Fm
m
,
m
.

(0) ,m
(1)
(13.77)
The first term corresponds to the case of the jointly generated codewords
and received sequence are not jointly typical. This can be bounded by t

297
as usual. In the second term, a wrong codeword inside the correct cloud
is decoded: for m
(1) 6= m(1) ,

(0) (1)
(1)
Pr Fm(0) ,m
m
,
m

(1)
X

QnU (u) QnX|U (x|u) QnY (1) |U y(1) u
=
(13.78)
|
{z
}
(u,x,y(1) )
(n)
A
wrong codeword,
correct cloud!
(QU,X,Y (1) )
(u,x,y(1) )
(QU,X,Y (1) )
en(H(U )) en(H(X|U )) en(H(Y
) (13.79)
(1) |U )
(n)
A

(1)
= A(n)
QU,X,Y (1) en(H(U )+H(X|U )+H(Y |U ))

(1)
(1)
en(H(U,X,Y )+) en(H(U,X)+H(Y |U ))
= en(H(Y
(1)
= en(I(X;Y |U )) .
(1) |U,X)H(Y (1) |U )+
(13.80)
(13.81)
(13.82)
(13.83)
Here the most important step is (13.78), where we need to realize that
Y(1) is generated based on the transmitted X, but not on the wrong
codeword considered here. However, since we do consider the correct
cloud, the cloud center U is related to the received Y(1) .
In the third term, some codeword inside the wrong cloud is decoded: for
m
(0) 6= m(0) and any m
(1) ,

(0) (1)
(1)
Pr Fm
m
,
m

(0) ,m
(1)
X

=
QnU (u) QnX|U (x|u) QnY (1) y(1)
(13.84)
(u,x,y(1) )
(QU,X,Y (1) )
(n)
A

(1)
A(n)
QU,X,Y (1) en(H(U )) en(H(X|U )) en(H(Y ))

(13.85)
n(H(U,X,Y (1) )+)
n(H(U,X)+H(Y (1) ))
= en(H(Y
(1)
= en(I(U,X;Y ))
(1) |U,X)H(Y (1) )+
= en(I(X;Y
(13.86)
(13.87)
(13.88)
),
(1) )
(13.89)
where in the last step we used the Markov structure U (

X (
Y (1) .
Plugging this back into (13.77) and (13.74) we obtain

(1)
(1)
Pe(n),(1) t n, , U X Y (1) + enR en(I(X;Y |U ))
(0)
(1)
(1)
+ en(R +R ) en(I(X;Y ))
(13.90)

(13.91)

298

for n large enough and if

R(1) < I X; Y (1) U ,

R(0) + R(1) < I X; Y (1) .
(13.92)
(13.93)
This finishes the proof.

We specialize Theorem 13.23 to a BC with less noisy output.
Corollary
13.24. For a BC with a less noisy output, a rate triple R(0) , R(1) ,

0 is achievable if
(

R(0) < I U ; Y (2) ,
(13.94)

(1)
(1)
U
(13.95)
R < I X; Y
for some QU,X such that U (
X (
(Y (1) , Y (2) ).
Proof: If (13.43) holds, then conditions (13.58) and (13.59) imply

(13.96)
R(0) + R(1) < I U ; Y (2) + I X; Y (1) U

(1)
(1)
U
(13.97)
I U; Y
+ I X; Y

(1)
= I U, X; Y
(13.98)

= I X; Y (1) .
(13.99)
Hence, the third condition (13.60) is unnecessary.
Remark 13.25. It is interesting to note that one can also analyze the performance of the superposition coding scheme of Theorem 13.23 in a different
way. To analyze decoder (1) , we define the following events:

n
o

(1)
,
(13.100)
Q
Fm(0) ,
U m(0) , Y(1) A(n)
(1)

U,Y
o
n

(1)
Fm(0) ,m(1) ,
U m(0) , X m(0) , m(1) , Y(1) A(n) QU,X,Y (1) . (13.101)
Then,
Pe(n),(1)

c
(1)
E Pr Fm(0) ,m(1)

nR(0)
e[
[
(0)
(1)
(1)
M
,
M
Fm(0) ,m

(1)
(1)

m
=1

(1)
(1)
m
6=m
(1)
enR
(1)
Fm
(0)
m
(0) =1
m
(0) 6=m(0)

c
i

(1)
E Pr Fm(0) ,m(1) M (0) , M (1)
(13.102)
+ E
(0)
enR
m
(0) =1
m
(0) 6=m(0)

(1)
(0)
(1)
Pr Fm
M
,
M

(0)

299
nR(1)
+ E
eX
m
(1) =1
m
(1) 6=m(1)

(0)
(1)
(1)
M
,
M
Pr Fm(0) ,m

(1)
(13.103)

(0)
(1)
(1)
(1)
t n, , U X Y (1) + enR en(I(U ;Y )) + enR en(I(X;Y |U ))
(13.104)
(13.105)

R(0) < I U ; Y (1) ,

R(1) < I X; Y (1) U .
(13.106)
(13.107)
Here in (13.102) the first event corresponds to the case that the correct codeword is not recognized, the first union of events corresponds to the case where
any codeword from a wrong cloud is (wrongly) recognized, and the second
union of events corresponds to the case where a wrong codeword from the correct cloud is recognized. Note that we have an inequality in front of (13.102)
because we only check whether the cloud center of a wrong cloud happens to
be typical with the received sequence, and do not bother to check whether or
not there actually exists a codeword in that wrong cloud that is jointly typical
with the cloud center and the received sequence (which is the reason why this
analysis leads to a weaker result).

Hence, we have shown that a rate triple R(0) , R(1) , 0 is achievable if
(

(13.108)
R(0) < min I U ; Y (1) , I U ; Y (2) ,

(1)
(1)
U
(13.109)
R < I X; Y
for some joint distribution QU,X such that U (
X (
(Y (1) , Y (2) ).
For the case of less noisy BCs, this achievable region is identical to the
region given in Corollary 13.24. However, in general, (13.108)(13.109) is
smaller than (13.58)(13.60).
So far, all these results apply to the case when R(2) = 0. However, recalling the observation given in Theorem 13.8, we can generalize them: We can
convert some of the R(0) bits to R(2) bits, i.e., the original R(0) will become
R(0) + R(2) . This then gives the following first main result.
Theorem 13.26 (Achievability based on Superposition).

For a general BC, a rate triple R(0) , R(1) , R(2) is achievable if
R(0) + R(2) < I U ; Y (2) ,

(13.110)

(1)
(1)
R < I X; Y
U ,
(13.111)
R(0) + R(1) + R(2) < I X; Y (1)

(13.112)

300

X (
(Y (1) , Y (2) ). Without loss of
generality we can restrict the size of U to
|U| |X | + 2.
(13.113)
If the BC has a less noisy output or is degraded, these conditions

simplify to
(

R(0) + R(2) < I U ; Y (2) ,
(13.114)

(1)
(1)
R < I X; Y
U
(13.115)
X (
(Y (1) , Y (2) ).
Proof: It only remains to prove the bound on the alphabet size. We invoke
our standard technique: We define

v , I U ; Y (2) , I X; Y (1) U , QX (1), . . . , QX |X | 1 ,
and for each u U

vu , H Y (2) H Y (2) U = u , I X; Y (1) U = u ,

QX|U (1|u), . . . , QX|U |X | 1u ,
(13.116)
(13.117)
and note that v is a convex combination of the |U| vectors vu :

v=
QU (u)vu .
(13.118)
uU
From Caratheodorys Theorem (Theorem 1.20) it now follows that we can reduce the size of U to at most |X | + 2 values (note that v contains |X | + 1 com-
ponents!) without changing v, i.e., without changing the values of I U ; Y (2)

and I X; Y (1) U , and without changing the value of QX (). Note that if QX

is fixed, also I X; Y (1) remains fixed. This proves the claim.
13.5
NairEl Gamal Outer Bound
In [EG79], El Gamal derived the capacity region of BCs with a more capable
output. The main contribution was a new outer bound, as the inner bound
came from the already known superposition coding. In the following we will
now present a generalization of the main idea in [EG79] to general BCs.

13.5. NairEl Gamal Outer Bound
301
Theorem 13.27 (NairEl Gamal Outer Bound [NEG07]).

For a general BC, the set of rate triples R(0) , R(1) , R(2) satisfying

R(0) min I W ; Y (1) , I W ; Y (2) ,

(13.119)
R(0) + R(1) I U, W ; Y (1) ,

(13.120)

(0)
(2)
(2)
R + R I V, W ; Y
,
(13.121)

(0)
(1)
(2)
(1)
(2)
R + R + R I U, W ; Y
+ I V ; Y U, W ,
(13.122)

R(0) + R(1) + R(2) I V, W ; Y (2) + I U ; Y (1) V, W

(13.123)
for some joint distribution of the form QU,V,W,X = QU QV QW |U,V
QX|U,V,W constitutes an outer bound to the capacity region.
Proof: We let Z be a RV that is uniformly distributed over {1, 2, . . . , n},

i.e., QZ (z) = n1 , and we define for each k = 1, . . . , n,

(1)
(1)
(2)
Wk , M (0) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2) ,
(13.124)
Uk , M (1) ,
Vk , M
(2)
(13.125)
(13.126)
(i)
Then, using n to denote terms that tend to 0 as n tends to infinity, we obtain

nR(0) = H M (0)
(13.127)

(0)
(1)
(0) (1)
(13.128)
= I M ;Y
+H M Y
n

X
(1) (1)
(1)
I M (0) ; Yk Y1 , . . . , Yk1 + nn(1)
(13.129)
k=1
n

X
(1) (1)
(1)
(1)
(1)
(1)
=
H Yk Y1 , . . . , Yk1 H Yk M (0) , Y1 , . . . , Yk1
k=1
+ nn(1)
(13.130)
n

X
(1)
(1)
(1)
(1)
(2)
(0)
(2)
H Yk
H Yk M , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn
k=1
+ nn(1)
n

X
(1)
(1)
=
H Yk
H Yk Wk + nn(1)
(13.131)
(13.132)
k=1
n

X
(1)
=
I Wk ; Yk
+ nn(1)
k=1
n
X
(13.133)

1
(1)
I WZ ; YZ Z = k + nn(1)
n
k=1

(1)
= n I WZ ; YZ Z + nn(1)
=n
(13.134)
(13.135)

302

(1)
n I Z, WZ ; YZ
+ nn(1) .
(13.136)
Here, (13.129) follows from the Fano Inequality (Proposition 1.13); (13.131)
from conditioning that reduces entropy; and in (13.136) we move Z from the
conditioning into the main argument of the mutual information functional.
In a similar fashion, we bound
nR(0) = H M (0)
(13.137)

(2)

= I M (0) ; Y
+ H M (0) Y(2)
n

X
(2) (2)
I M (0) ; Yk Yk+1 , . . . , Yn(2) + nn(2)

=
k=1
n
X
k=1
(13.138)
(13.139)

(2) (2)
(2)
(2)
H Yk Yk+1 , . . . , Yn(2) H Yk M (0) , Yk+1 , . . . , Yn(2)
+ nn(2)
(13.140)
n

X
(2)
(2)
(1)
(1)
(2)
H Yk M (0) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2)
H Yk
k=1
+ nn(2)
n

X
(2)
(2)
H Yk
H Yk Wk + nn(2)
=
=
k=1
n
X

(2)
I W k ; Yk
+ nn(2)
(13.141)
(13.142)
(13.143)
k=1

(2)
= n I WZ ; YZ Z + nn(2)

(2)
n I Z, WZ ; YZ
+ nn(2) ,
(13.144)
(13.145)
where in (13.139) we expanded the chain rule backwards from n to k.

Next, we observe
n R(0) + R(1)
= H M (0) , M (1)
(13.146)

= I M (0) , M (1) ; Y
+ H M (0) , M (1) Y(1)
(13.147)
n

X
(1) (1)
(1)
(1)
(1)
(1)
H Yk Y1 , . . . , Yk1 H Yk M (0) , M (1) , Y1 , . . . , Yk1

(1)
k=1
+ nn(3)
(13.148)
n

X
(1)
(1)
(1)
(1)
(2)
H Yk
H Yk M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2)
k=1
+ nn(3)
n

X
(1)
(1)
=
H Yk
H Yk Wk , Uk + nn(3)
k=1

(13.149)
(13.150)
13.5. NairEl Gamal Outer Bound

n

X
(1)
=
I Wk , Uk ; Yk
+ nn(3)
303
(13.151)
k=1

(1)
= n I WZ , UZ ; YZ Z + nn(3)

(1)
n I Z, WZ , UZ ; YZ
+ nn(3)
(13.152)
(13.153)
and similarly,

(2)
n R(0) + R(2) n I Z, WZ , VZ ; YZ
+ nn(4) .
(13.154)
Finally, we obtain
n R(0) + R(1) + R(2)

= H M (0) , M (1) , M (2)

= H M (0) , M (1) + H M (2) M (0) , M (1)

I M (0) , M (1) ; Y(1) + I M (2) ; Y(2) M (0) , M (1) + nn(5)
n

X
(1) (1)
(1)
=
I M (0) , M (1) ; Yk Y1 , . . . , Yk1
(13.155)
(13.156)
(13.157)
k=1

(2)
(2)
+ I M (2) ; Yk M (0) , M (1) , Yk+1 , . . . , Yn(2) + nn(5)
n

X
(1)
(1)
(1)
I M (0) , M (1) , Y1 , . . . , Yk1 ; Yk
(13.158)
k=1

(1)
(1)
(2)
(2)
+ I M (2) , Y1 , . . . , Yk1 ; Yk M (0) , M (1) , Yk+1 , . . . , Yn(2)
+ nn(5)
n

X
(1)
(1)
(2)
(1)
I M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2) ; Yk
=
(13.159)
k=1

(0)
(1)
(1)
I
M , M (1) , Y1 , . . . , Yk1

(1)
(1)
(2)
(2)
+ I Y1 , . . . , Yk1 ; Yk M (0) , M (1) , Yk+1 , . . . , Yn(2)

(2)
(1)
(1)
(2)
+ I M (2) ; Yk M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2)

(2)
(1)
Yk+1 , . . . , Yn(2) ; Yk
+ nn(5)
n

X
(1)
(1)
(2)
(1)
=
I M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2) ; Yk
(13.160)
k=1

(1)
(1)
(2)
(2)
+ I M (2) ; Yk M (0) , M (1) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(2)
+ nn(5)
n

X
(1)
(2)
=
I Wk , Uk ; Yk
+ I Vk ; Yk Uk , Wk + nn(5)
(13.161)
(13.162)
k=1

(1)
(2)
= n I WZ , UZ ; YZ Z + n I VZ ; YZ UZ , WZ , Z + nn(5)

(1)
(2)
n I Z, WZ , UZ ; YZ
+ n I VZ ; YZ UZ , Z, WZ + nn(5) .
(13.163)
(13.164)

304
Here, (13.161) follows from the following lemma.

Lemma 13.28 (Csisz
arK
orner Identity [CK78, Lemma 7]). For RVs
R, S1n , T1n , we have
n
X
n

n
X

n
I S1k1 ; Tk Tk+1
,R =
I Tk+1
; Sk S1k1 , R .
k=1
(13.165)
k=1
Proof: Note that

k1
n

X

n
I S1k1 ; Tk Tk+1
,R =
I Sj ; Tk S1j1 , Tk+1
,R ,
(13.166)
j=1
n
X
k1

n
n

I Tk+1 ; Sk S1 , R =
I Tj ; Sk S1k1 , Tj+1
,R .
(13.167)
j=k+1
Hence,
n
X
n X
k1
n

X

n
,R =
I S1k1 ; Tk Tk+1
,R
k=1
k=1 j=1
(13.168)
and
n
X
n
n
X

X

n
n
,R
; Sk S1k1 , R =
I Sk ; Tj S1k1 , Tj+1
I Tk+1
k=1
k=1 j=k+1
j1
n X
X
(13.169)

n
,R
I Sk ; Tj S1k1 , Tj+1
(13.170)

n
,R .
(13.171)
j=1 k=1
n X
k1
X
k=1 j=1
Very similarly, we also get

n R(0) + R(1) + R(2)

(2)
(1)
n I Z, WZ , VZ ; YZ
+ n I UZ ; YZ VZ , Z, WZ + nn(6) .
(13.172)
Finally, by defining
W , (Z, WZ ),
U , UZ ,
V , VZ
(1)
Y (1) , YZ ,
(13.173)
(2)
YZ ,
(13.174)
(2)
(13.175)
and by letting n tend to infinity, we obtain the claimed inequalities (13.119)

(13.123).

13.6. Capacity Regions of Some Special Cases of BCs
305
Remark 13.29. If we specialize Theorem 13.27 for the case R(0) = 0, it looks
as follows:

R(1) I U, W ; Y (1) ,

R(2) I V, W ; Y (2) ,

R(1) + R(2) I U, W ; Y (1) + I V ; Y (2) U, W ,
R(1) + R(2) I V, W ; Y (2) + I U ; Y (1) V, W .
(13.176)
(13.177)
(13.178)
(13.179)
It has been shown in [NW08] that this outer bound actually is identical to

R(1) I U ; Y (1) ,

R(2) I V ; Y (2) ,

R(1) + R(2) I U ; Y (1) + I X; Y (2) U ,
R(1) + R(2) I V ; Y (2) + I X; Y (1) V
(13.180)
(13.181)
(13.182)
(13.183)
without the need of the auxiliary RV W . This shows how difficult it actually
is to really understand these bounds or even to evaluate them! Note that the
latter outer bound was implicitly given in [EG79] already, although it was not
explicitly stated because the paper directly specialized it for the case of BCs
with a more capable output.
13.6
Capacity Regions of Some Special Cases of

BCs
Theorem 13.30 (Capacity Region of a BC with a More Capable

Output [EG79]).
The capacity region of a BC
with a more capable output is given by all
(0)
(1)
(2)
rate triples R , R , R
satisfying
R(0) + R(2) I U ; Y (2) ,

(13.184)

(1)
(1)
R I X; Y
U ,
(13.185)
R(0) + R(1) + R(2) I X; Y (1)

(13.186)
X (
(Y (1) , Y (2) ), where without loss
of generality |U| |X | + 2.
Proof: The achievability follows directly from superposition coding, Theorem 13.26. For the converse, consider Theorem 13.27 and choose V = 0 and

306
X = U:

R(0) min I W ; Y (1) , I W ; Y (2) ,

R(0) + R(1) I X, W ; Y (1) ,

R(0) + R(2) I W ; Y (2) ,
R(0) + R(1) + R(2) I X, W ; Y (1) ,
R(0) + R(1) + R(2) I W ; Y (2) + I X; Y (1) W .
(13.187)
(13.188)
(13.189)
(13.190)
(13.191)
Next note that by Lemma 13.22 and Definition 13.18 we have

I W ; Y (1) I W ; Y (2) .
(13.192)
R(0) I W ; Y (2)
(13.193)
Hence, (13.187) reads

and is implied by (13.189). Moreover, (13.188) is implied by (13.190), which

can be simplified to

R(0) + R(1) + R(2) I X; Y (1) + I W ; Y (1) X = I X; Y (1)
(13.194)
because of the Markovity of W (

X (
Y (1) . Rewriting U instead of W ,
we hence have the following outer bound on the capacity region of a more
capable BC:
R(0) + R(2) I U ; Y (2) ,

(13.195)

(0)
(1)
(2)
(2)
(1)
U ,
(13.196)
R + R + R I U; Y
+ I X; Y
R(0) + R(1) + R(2) I X; Y (1)

(13.197)
X (
(Y (1) , Y (2) ).
It remains to show that the set of rate triples defined by (13.184)(13.186)
is identical to the set defined
by (13.195)(13.197). One direction is obvious:

(0)
(1)
(2)
any triple R , R , R
satisfying (13.184)(13.186) also satisfies (13.195)
(13.197), because (13.196) is the sum of (13.184) and (13.185).
For the opposite direction, first note that both sets are symmetric in R(0)
and R(2) , i.e., without loss of generality we can focus on the case where R(2) =
0. The reason why the stricter inequality (13.185) can be replaced by the
weaker inequality (13.196) can be seen from Figure 13.5. The shaded area is
part of (13.195)(13.197), but not of (13.184)(13.186). However, recall from
Theorem 13.8 that if R(0) , R(1) , 0 is achievable, then also R(0) t, R(1) + t, 0
is achievable for any 0 t R(0) . Hence, there must exist some (possibly
different) choice of QU,X such that

R(0) t I U ; Y (2) ,

R(1) + t I X; Y (1) U ,

(13.198)
(13.199)
13.6. Capacity Regions of Some Special Cases of BCs

R(1)

I X; Y (2)

+ I X; Y (1) U
307

R(1) = R(0) + I U ; Y (2) + I X; Y (1) U

I X; Y (1) U
R(0)
I U ; Y (2)
Figure 13.5: Two rate regions of the BC are identical.

or

t R(0) + I U ; Y (2) ,

R(1) t + I X; Y (1) U .
Combining these two bounds, we therefore obtain

R(1) R(0) + I U ; Y (2) + I X; Y (1) U .
(13.200)
(13.201)
(13.202)
Theorem 13.31 (Capacity Region of a BC with a Less Noisy

Output and of a Degraded BC [Ber73]).
The capacity region of a BC with less noisy output
(including all degraded
(0)
(1)
(2)
BCs) is given by all rate triples R , R , R
satisfying
(

R(0) + R(2) I U ; Y (2) ,
(13.203)

(1)
(1)
R I X; Y
U
(13.204)

X (
Y (1) , Y (2) , where we can assume
that |U| |X | + 2.
Proof: This follows from Theorem 13.30 in the same way as Corollary 13.24
follows from Theorem 13.23.

308
Theorem 13.32 (Capacity Region of a BC with Degraded Message Set [KM77b]).

The capacity region of a BC with a degraded message set without private
message for receiver 2, i.e., R(2) = 0, is given by all rate tuples R(0) , R(1)
satisfying

(0)
(2)
I
U
;
Y
,
(13.205)

(13.206)
R(1) I X; Y (1) U ,
R(0) + R(1) I X; Y (1)

(13.207)
X (
(Y (1) , Y (2) ), where we can assume
that |U| |X | + 2.
Proof: The achievability follows directly from Theorem 13.26. For the
converse we make the same derivations as in the proof of Theorem 13.30, i.e.,
we take the outer bound of Theorem 13.27 and choose V = 0 and X = U :
(13.208)
R(0) I W ; Y (1) ,
(0)
(2)
,
(13.209)
R I W;Y

(0)
(1)
(1)
R + R I X; Y
,
(13.210)
R(0) + R(1) I W ; Y (2) + I X; Y (1) W .

(13.211)
Now write U for W and note that using the same argument as in the proof of
Theorem 13.30 (see Figure 13.5) we can show that
(

R(0) I U ; Y (2) ,
(13.212)

(1)
(1)
R I X; Y
U
(13.213)
and
(

R(0) I U ; Y (2) ,

R(0) + R(1) I U ; Y (2) + I X; Y (1) U
are equivalent. Thus, we have
(13.214)
(13.215)

R(0) I U ; Y (1) ,

R(0) I U ; Y (2) ,

R(1) I X; Y (1) U ,
R(0) + R(1) I X; Y (1) .
(13.216)

I X; Y (1) = I U, X; Y (1)

= I U ; Y (1) + I X; Y (1) U ,
(13.220)
(13.217)
(13.218)
(13.219)
Moreover, noting that

(13.221)
13.7. Achievability based on Binning

we can repeat the same argument also for the equivalence of
(

R(1) I X; Y (1) U ,

R(0) I U ; Y (1)
309
(13.222)
(13.223)
and
(

R(1) I X; Y (1) U ,

R(0) + R(1) I U ; Y (1) + I X; Y (1) U .
(13.224)
(13.225)
This shows that (13.208)(13.211) is equivalent to (13.205)(13.207).
13.7
Achievability based on Binning
Note that by its implicit construction, superposition coding works well if one
user has a much better channel than the other and we have only a common
message for the worse user (or the BC is degraded). However, if both users
have similar channels and/or we have no common message, but only private
messages, binning turns out to be better!
The following scheme is closely related to GelfandPinsker, where we have
noncausal side-information at the transmitter. We will assume that there is
no common message.
1: Setup: We need two auxiliary random variables U (1) and U (2) with
some alphabets U (1) and U (2) . Then we choose a PMF QU (1) ,U (2) and
compute its marginals QU (1) and QU (2) . We further choose a function
f : U (1) U (2) X that will be used in the encoder to create the channel
input sequence.
(1) and R
(2) , and some blocklength n.
Then we fix some rates R(1) , R(2) , R

(i)
(i)
2: Codebook Design: We generate enR enR codewords U(i) m(i) , v (i)
(i)
(i)
of length n, m(i) = 1, . . . , enR and v (i) = 1, . . . , enR , by choosing

(i) (i)
(i)
all n en(R +R ) components Uk m(i) , v (i) independently at random
according to QU (i) , for both i = 1, 2. Here m(i) describes the bin and v (i)
describes the index of the codeword in this bin.

3: Encoder Design: For
a message pair m(1) , m(2) , the encoder tries to

find a pair v (1) , v (2) such that

U(1) m(1) , v (1) , U(2) m(2) , v (2) A(n) QU (1) ,U (2) .
(13.226)
If it finds several possible choices, it picks one. If it finds none, it chooses
v (1) , v (2) = (1, 1). Note that these choices can be decided in advance,

i.e., v (1) , v (2) becomes a function of m(1) , m(2) . However, also note
that the choice which codeword is picked in bin m(1) also depends on
m(2) and vice versa.

310


X = f n U(1) m(1) , v (1) , U(2) m(2) , v (2)
(13.227)
where we again use the notation f n to denote that each component of X

(1)
(2)
is created using the function f (, ): Xk = f Uk , Uk .
4: Decoder Design: For a given received sequence Y(i) , the decoder i tries
to find a pair m
(i) , v(i) such that

U(i) m
(i) , v(i) , Y(i) A(n)
QU (i) ,Y (i) .
(13.228)

If there is a unique m
(i) , then the decoder (i) puts out m
(i) , m
(i) . If
there are several choices for m
(i) or none, the decoder declares an error.
Note that the decoder does not care if there are several possible v(i) for
a unique bin m
(i) .
5: Performance Analysis: The analysis follows very closely the analysis
of GelfandPinsker in Section 12.2. We distinguish the following different cases that are not necessarily disjoint, but that together cover all
possibilities that will lead to an error:

1. The encoder cannot find a pair v (1) , v (2) such that (13.226) is
satisfied. This is analogous to Case 2 in Section 12.2.2:
Pr(Case 1)
n

= Pr @ v (1) , v (2) : U(1) m(1) , v (1) , U(2) m(2) , v (2)
o
A(n) QU (1) ,U (2)
(13.229)
(1)
R
enY
(2)
R
enY
h
(2) (2) (2)
(1)
(1) (1)
Pr U m , v
,U m ,v
v (1) =1 v (2) =1
/ A(n) QU (1) ,U (2)

(1)
R
enY
i
(2)
R
enY
v (1) =1 v (2) =1
h

1 Pr U(1) m(1) , v (1) , U(2) m(2) , v (2)
A(n) QU (1) ,U (2)
(1)
<
R
enY
i
(13.231)
(2)
R
enY
v (1) =1 v (2) =1
(13.230)
1 en(I(U
n(I(U (1) ;U (2) )+ )
(1) ;U (2) )+
en(R (1) +R (2) )
= 1e

(1)
(2)
(1) (2)
exp en(R +R ) en(I(U ;U )+)

(1)
(2)
(1) (2)
= exp en(R +R I(U ;U )) .

(13.232)
(13.233)
(13.234)
(13.235)
13.7. Achievability based on Binning
311
So, as long as

(1) + R
(2) > I U (1) ; U (2) +
R
(13.236)

2. The decoder i finds some m
(i) 6= m(i) and some v(i) such that

U(i) m
(i) , v(i) , Y(i) A(n) QU (i) ,Y (i) .
(13.237)
This corresponds to Case 3 in Section 12.2.3:
Pr(Case 2)
Pr
m
(i) ,
v (i)
m
(i) 6=m(i)
X
m
(i) ,
v (i)
m
(i) 6=m(i)

n

o
U(i) m
(i) , v(i) , Y(i) A(n)
Q
(i)
(i)

U ,Y
(13.238)
h
i
(i)
(n)
(i)
(i) (i)
A
QU (i) ,Y (i)
Pr U m
, v , Y
(13.239)
n(I(U (i) ;Y (i) ))
(13.240)
m
(i) ,
v (i)
m
(i) 6=m(i)
= enR
(i)
en(R
(i)

(i)
(i)
enR 1 en(I(U ;Y ))
(i)
(i) I(U (i) ;Y (i) )+)

+R
(13.241)
(13.242)
So, as long as

(i) < I U (i) ; Y (i)
R(i) + R
(13.243)
the probability of Case 2 decays exponentially fast to zero. Note

that this argument holds both for i = 1, 2.
3. The decoder i makes an error if

U(i) m(i) , v (i) , Y(i)
/ A(n)
QU (i) ,Y (i) .

(13.244)
Note that here we ignore the possibility that there might exist another v(i) such that

QU (i) ,Y (i) .
(13.245)
U(i) m(i) , v(i) , Y(i) A(n)

i.e., our bound on the error probability is too big. This derivation
is identical to the Case 4 in Section 12.2.4. There it is shown that

this probability is upper-bounded by t n, , U (1) U (2) X Y (i) .

312

Hence, we have the following conditions:

(1) < I U (1) ; Y (1) ,
R(1) + R

(2) < I U (2) ; Y (2) ,
R(2) + R

(1) + R
(2) > I U (1) ; U (2) ,
R
(13.246)
(13.247)
(13.248)
(i) 0. From the first two bounds we

plus the implicitly given bounds R
get

(1) + R(2) + R
(2) < I U (1) ; Y (1) + I U (2) ; Y (2)
R(1) + R
(13.249)
from which we get, using the third condition,

(1) + R
(2)
R(1) + R(2) < I U (1) ; Y (1) + I U (2) ; Y (2) R

< I U (1) ; Y (1) + I U (2) ; Y (2) I U (1) ; U (2) .
(i) 0 we get from the first two conditions

Moreover, using R

(i) I U (i) ; Y (i) .
R(i) < I U (i) ; Y (i) R
(13.250)
(13.251)
(13.252)
Note that we can also apply FourierMotzkin elimination to achieve this

reduction, see Example 1.16.
This proves the following theorem.
Theorem 13.33 (Achievability based on Binning [Mar79]).
On a general BC without common message, any rate pair R(1) , R(2) is
achievable that satisfies
(13.253)
R(1) < I U (1) ; Y (1) ,

(2)
(2)
(2)
R < I U ;Y
,
(13.254)
R(1) + R(2) < I U (1) ; Y (1) + I U (2) ; Y (2) I U (1) ; U (2) (13.255)
for some QU (1) ,U (2) QX|U (1) ,U (2) QY (1) ,Y (2) |X .
Note that an optimal choice for QX|U (1) ,U (2) actually degenerates into
a function f : U (1) U (2) X .
If we fix a certain QU (1) ,U (2) QX|U (1) ,U (2) and a given BC QY (1) ,Y (2) |X , then
the achievable rate region of Theorem 13.33 is given by a pentagon shown in
Figure 13.6. In this pentagon, points A and B are of particular interest: For
example in point A we have

R(1) = I U (1) ; Y (1) I U (1) ; U (2)
(13.256)
which is identical to the GelfandPinsker rate RGP if we consider U(2) as
interference that is noncausally known to the transmitter (compare with Definition 12.4)! Hence, we can actually use GelfandPinsker coding here. A
corresponding encoder is shown in Figure 13.7. Note that in general the decoders will only be able to decode their own message.

13.8. Best Known Achievable Region: Martons Region
313

R(2) = R(1) + I U (1) ; Y (1) + I U (2) ; Y (2) I U (1) ; U (2)
I U (2) ; Y
R(2)

(2)
A
R(2) = I U (2) ; Y (2)

I U (2) ; Y (2) I U (1) ; U (2)
R(1) = I U (1) ; Y (1)
R(1)
I U (1) ; Y (1)

I U (1) ; Y (1) I U (1) ; U (2)

X(1)
X
U(1)
fGP
U(2)
M (1)
GP Enc.
U(2)
U(2)
U(2)
Enc. 2
M (2)
f : U (1) U (2) X
Figure 13.7: BC encoder based on a GelfandPinsker encoder.
13.8
Best Known Achievable Region: Martons

Region
Martons region combines superposition coding with binning. We only give a

brief discussion.
1: Setup: We choose a PMF QT,U (1) ,U (2) and compute its marginal QT
and its conditional marginals QU (1) |T and QU (2) |T . We further choose a
function f : T U (1) U (2) X . Then we fix some rates R(0) , R(1) , R(2) ,
(1) and R
(2) , and some blocklength n.
R

(0)
2: Codebook Design: We generate enR codewords T m(0) QnT ,

(0)
m(0) = 1, . . . , enR (the cloud centers). For each T m(0) , we use the
(i)
code construction of Section 13.7 with binning, i.e., we generate enR

(i)
enR length-n codewords U(i) m(0) , m(i) , v (i) QnU (i) |T |T m(0) ,

314

(i)
(i)
m(i) = 1, . . . , enR (the bins) and v (i) = 1, . . . , enR

per bin), for both i = 1, 2.
(the codewords

(0) , m(1) , m(2) , the encoder
3: Encoder Design: For a message
triple
m

tries to find a pair v (1) , v (2) such that

T m(0) , U(1) m(0) , m(1) , v (1) , U(2) m(0) , m(2) , v (2)

A(n)
QT,U (1) ,U (2) .

(13.257)
If it finds several possible choices, it picks one. If it finds none, it chooses

v (1) , v (2) = (1, 1).

X = f n T m(0) , U(1) m(0) , m(1) , v (1) , U(2) m(0) , m(2) , v (2) .
(13.258)
4: Decoder Design: For agiven received sequence Y(i) , the decoder i tries
to find a pair m
(0) , m
(i) and a v(i) such that

QT,U (i) ,Y (i) .
T m
(0) , U(i) m
(0) , m
(i) , v(i) , Y(i) A(n)

(13.259)

If there is a unique pair m
(0) , m
(i) , then the decoder (i) puts out

m
(0) , m
(i) , m
(0) , m
(i) . If there are several choices for m
(0) , m
(i)
or none, the decoder declares an error. Note that the decoder does
not
care if there are several possible v(i) for a unique pair m
(0) , m
(i) .
5: Performance Analysis: Using our standard analysis technique, we find
the following conditions:
1. The encoder cannot find appropriate codewords (corresponds to
Case 1 in binning, Section 13.7):

(1) + R
(2) > I U (1) ; U (2) T .
R
(13.260)
2. The decoder i finds a wrong bin in the correct cloud (corresponds

to Case 2 in binning, Section 13.7):

(i) < I U (i) ; Y (i) T .
R(i) + R
(13.261)
3. The decoder i finds a wrong cloud (corresponds to the analysis of

decoder 2 (the one that only cares about the clouds) in superposition
coding, Section 13.4, but with R(0) replaced by R(0) + R(2) ):

(i) < I T, U (i) ; Y (i) .
R(0) + R(i) + R

(13.262)
315
Together with the nonnegativity constraints, this yields the following ten
conditions:

0
0 1 0 1
I U (1) ; U (2) T

I U (1) ; Y (1) T
1
1
0
0

0
0
0
1
1
(0) I U (2) ; Y (2) T

1
I T, U (1) ; Y (1)
1
1
0
0
(1)

R
1
(2) ; Y (2)
0
0
1
1
I
T,
U
(1)
(13.263)
R
1 0
0
0
0 (2)
0
R
0 1 0
0
0
0
(2)
0 1 0
0
0
0
0
1
0
0
0
0
0
0 1
0
We now apply FourierMotzkin elimination (see Section 1.3) to eliminate
(1) and R
(2) . We start with R
(1) :
R

0
1
0 1
I U (1) ; Y (1) T I U (1) ; U (2) T

1
I T, U (1) ; Y (1) I U (1) ; U (2) T
1
0 1

0
(1) ; Y (1) T
1
0
0
I
U
(0)
(1)
(1)
1
1
0
0
I T, U ; Y

0
(1)
(2)
(2)

0
1
1
T
I
U
;
Y
R
(2)
(2)
(2)
1
0
1
1
I T, U ; Y
1 0
(2)
0
0
0
R
0 1 0
0
0
0 1 0
0
0
0
0 1
0
(13.264)
(2) :
Next we remove R
0
1
1
1
1
1
1
1
1
2
1
1
0
R(0)
0
1
0
1 R(1)
1
(2)
0
1
0
R
1
0
1
1 0
0
0 1 0
0
0 1

316

U (1) ; Y (1) T I

U (1) ; Y (1) T I

T, U (1) ; Y (1) I

T, U (1) ; Y (1) I
I
I
I
I

U (1) ; U (2) T + I

U (1) ; U (2) T + I

U (1) ; U (2) T + I

U (1) ; U (2) T + I

U (2) ; Y (2) T

T, U (2) ; Y (2)

U (1) ; Y (1) T

T, U (1) ; Y (1)
0
0
0

U (2) ; Y (2) T

T, U (2) ; Y (2)

U (2) ; Y (2) T
T, U (2) ; Y (2)
Besides the obvious nonnegativity constraints, we obtain the

eight conditions:

R(1) I U (1) ; Y (1) T ,

R(2) I U (2) ; Y (2) T ,

R(1) + R(2) I U (1) ; Y (1) T + I U (2) ; Y (2) T

I U (1) ; U (2) T ,

R(0) + R(1) I T, U (1) ; Y (1) ,

R(0) + R(2) I T, U (2) ; Y (2) ,

R(0) + R(1) + R(2) I T, U (1) ; Y (1) + I U (2) ; Y (2) T

I U (1) ; U (2) T ,

R(0) + R(1) + R(2) I U (1) ; Y (1) T + I T, U (2) ; Y (2)

I U (1) ; U (2) T ,

2R(0) + R(1) + R(2) I T, U (1) ; Y (1) + I T, U (2) ; Y (2)

I U (1) ; U (2) T .
(13.265)
following
(13.266)
(13.267)
(13.268)
(13.269)
(13.270)
(13.271)
(13.272)
(13.273)
Note that
(13.268) + (13.273) = (13.271) + (13.272),
so one of these four constraints is redundant. We choose to ignore
(13.273). The remaining constraints can be simplified further if we take
Theorem 13.8 into account: We can replace R(0) by R(0) (1) (2) ,
R(1) by R(1) + (1) , and R(2) by R(2) + (2) , where we need to add the
constraints
(1) + (2) R(0) ,
(1)
(2)
0,
0.
(13.274)
(13.275)
(13.276)
Before we write down the new inequality system and again apply the
FourierMotzkin elimination to eliminate (1) and (2) , we introduce


some temporary abbreviations to simplify our life:

IU (1) , I U (1) ; Y (1) T ,

IU (2) , I U (2) ; Y (2) T ,

IY (1) , I Y (1) ; T ,

IY (2) , I Y (2) ; T ,

I , I U (1) ; U (2) T .
317
(13.277)
(13.278)
(13.279)
(13.280)
(13.281)
From (13.266)(13.272) and (13.274)(13.276) we now get the following

inequality system:
0
1
0
1
0
IU (1)
0
1
0
1
IU (2)
1
1
1
1
IU (1) + IU (2) I
1
0
0 1
IY (1) + IU (1)
(0)
1
0
1
1
0
R
I
+
I
(2)
(2)
Y
U
(1)
1
1
1
0
0 R IY (1) + IU (1) + IU (2) I
(2)
1
1
0
0
R IY (2) + IU (1) + IU (2) I . (13.282)

0
1
1 (1)
0
1 0
(2)
1 0
0
0
0
0
0
0
0
0 1 0
0 1 0
0
0
0
0
0
1
0
0
0
0
0
0
0 1
In a first step, we
1
1
1
0
1
0
1
1
2
0
1
1
0
0
1
1 0
0
0
0
1
1
1
0
1
1
1
1
1
1
1 0
0
0 1 0
0
0 1
0
0
0
eliminate (1) :
IY (2) + IU (1) + IU (2)

0
IU (1)
0
I
+
I
+
2I
I
1
Y (2)
U (1)
U (2)
I
+
I
I
1
(1)
(2)
U
U
I
1
(2) + IU (2)
(0)
0
1 R
(1)

I
1
R
(2)
,
(2)
IY (1) + IU (1)
1 R
I
(2)
0
Y (1) + IU (1) + IU (2) I

IY (2) + IU (1) + IU (2) I

0
0
0
0
0
0
0
0
1
(13.283)

318

and then we eliminate (2) :
2
2
2
IY (1) + IY (2) + 2IU (1) + 2IU (2) I
1
2
IY (2) + IU (1) + 2IU (2) I
2
1
IY (1) + 2IU (1) + IU (2) I
1
1
IU (1) + IU (2) I
1
I (1) + I (2) + I (1) + I (2)
1
1
Y
Y
U
U
0
1
IY (2) + IU (2)
1
0
IY (1) + IU (1)
(0)
1 0
0 R
0
(1)
. (13.284)
1
1
R
I
+
I
+
I
(1)
(1)
(2)
Y
U
U
(2)
0
1 R
IU (2)
0
1
1
IY (2) + IU (1) + IU (2)
1
0
IU (1)
0
1
1
IY (1) + IU (1) + IU (2) I
1
1
IY (2) + IU (1) + IU (2) I
1
1 0
0
0
0
1
0
0
0
0 1
0
Luckily, we can reduce these 17 inequalities further using the following
observations:
8 equals 15 = drop 8
6 is implied by 10 = drop 6
10 + 14 equals 2 = drop 2
12 + 13 equals 3 = drop 3
1 + 4 equals 10 + 12 + 13 + 14 = drop 1
This leaves us with five inequalities plus the obvious three nonnegativity
constraints. Writing 13 and 14 in a combined fashion yields the final
result given in Theorem 13.34.
Theorem 13.34 (Martons Achievable Region

[Mar79]).

On a general BC, any rate triple R(0) , R(1) , R(2) is achievable that satis-

13.9. Some More Outer Bounds
319
fies

R(1) < I U (1) ; Y (1) T ,
(13.285)

(2)
(2)
(2)
R < I U ;Y
T ,
(13.286)

(1)
(2)
(1)
(1)
(2)
(2)
R + R < I U ;Y
T + I U ;Y
T

(13.287)
I U (1) ; U (2) T ,
n
o

(0)
(1)
(2)
(1)
(2)
(1)
(1)
T
R
+
R
+
R
<
min
I
T
;
Y
,
I
T
;
Y
+
I
U
;
Y

+ I U (2) ; Y (2) T I U (1) ; U (2) T

(13.288)
for some QT,U (1) ,U (2) and some f : T U (1) U (2) X .

Martons region is the best known achievable region. But note that until
recently no bounds on the cardinality of the alphabets of its auxiliary random
variable existed [AGA12], i.e., until recently the region could not be evaluated
even numerically.
13.9
Some More Outer Bounds
There are many known outer bounds, but none has been proven to be tight
for all BCs. For all cases where the capacity region is known2 the outer bound
is taken from Theorem 13.27. In the following we quickly review a few more
outer bounds.
Note that the derivation of all outer bounds are based on the Fano Inequality, the Data Processing Inequality, additionally given information (so-called
genie-aided bounds), additionally allowed cooperations, or similar.
(0)
(1)
The
simplest outer bound is by Cover [Cov72]: Any rate triple R , R ,
(2)
R
not satisfying
R(0) + R(1) I X; Y (1) ,

(13.289)

(0)
(2)
(2)
R + R I X; Y
,
(13.290)
(0)
(1)
(2)
(1)
(2)
R + R + R I X; Y , Y
(13.291)
cannot be achievable. The proof is quite straightforward: Every user by itself
cannot transmit more than its capacity, and the sum-rate bound assumes that
the receivers cooperate. Another proof is based on the Cut-Set Bound, see
Chapter 15.
Sato improved on the sum-rate bound (13.291):

R(0) + R(1) + R(2) min max I X; Y (1) , Y (2) ,
(13.292)
QX
where the min is over all QY (1) ,Y (2) |X having the same conditional marginals
QY (1) |X and QY (2) |X as the BC.
2
Apart from the special cases discussed in Section 13.6, there is, e.g., also the case of
the deterministic BC where y (1) and y (2) are deterministic functions of x.

320
Another famous
outer bound is

pair R(1) , R(2) not satisfying
R(1) I
R(2) I
R(1) + R(2) I
by Korner and Marton [Mar79]: Any rate

X; Y (1) ,

U ; Y (2) ,

X; Y (1) U + I U ; Y (2)
(13.293)
(13.294)
(13.295)
cannot be achievable. Clearly, this is a special case of Theorem 13.27. Since

this bound is not symmetric, we can easily exchange the two receivers (swap
1 and 2 in (13.293)(13.295)) and then take the intersection between both
regions. Also this bound is tight in all cases where the capacity region is
known.
We also
would like to mention Satos outer bound [Sat78]: Any rate pair
R(1) , R(2) not satisfying

(1)
(1)
(1)
I
U
;
Y
,
(13.296)

R(2) I U (2) ; Y (2) ,
(13.297)
R(1) + R(2) I U (1) , U (2) ; Y (1) , Y (2)

(13.298)
for some QU (1) ,U (2) QX|U (1) ,U (2) QY (1) ,Y (2) |X cannot be achievable.
Finally, a very recent bound, which seems to be the best known outer
bound atthe moment, has been given by Nair [Nai10]: Any rate triple R(0) ,
R(1) , R(2) not satisfying
n

o
(0)
(1)
(2)
,
(13.299)
R
min
I
T
;
Y
,
I
T
;
Y

o
(0)
(1)
(1)
(1)
(1)
(2)
R
+
R
I
U
;
Y
T
+
min
I
T
;
Y
,
I
T
;
Y
,
(13.300)
o
n

(2)
(1)
(0)
(2)
(2)
(2)
,
,
I
T
;
Y
T
+
min
I
T
;
Y
R
+
R
I
U
;
Y
(13.301)

R(0) + R(1) + R(2) I U (1) ; Y (1) U (2) , T + I U (2) ; Y (2) T

o
(1)
(2)
+
min
I
T
;
Y
,
I
T
;
Y
,
(13.302)

R(0) + R(1) + R(2) I U (1) ; Y (1) T + I U (2) ; Y (2) U (1) , T

o
(13.303)
+ min I T ; Y (1) , I T ; Y (2)
for some QU (1) ,U (2) QX|U (1) ,U (2) QY (1) ,Y (2) |X cannot be achievable. Note that
this bound is contained in the outer bound of Theorem 13.27, but it is not
clear whether it is strictly smaller or not.
13.10
Gaussian BC
We have already seen the definition of the most typical Gaussian BC in Example 13.17. We now generalize this definition to the general Gaussian BC.

13.10. Gaussian BC
321
For a given input X = x,

(
Y (1) = x + Z (1) ,
Y (2) = x + Z (2) ,
(13.304)
(13.305)
where
Z (1) , Z (2)
T
N (0, KZZ )
(13.306)
for some arbitrary covariance matrix KZZ :

KZZ =
2
(1)
(12)
(12)
2
(2)
!
(13.307)
where without loss of generality we assume that

2
2
(1)
(2)
.
Moreover, we impose an average-power constraint on the input:

E X 2 E.
(13.308)
(13.309)
We can easily repeat the argument of Example 13.17 to show that also
this channel is stochastically degraded. Actually, the derivation is identical
because we only need to worry about the conditional distribution of Y (i) given
X, i.e., the correlation between Z (1) and Z (2) is completely irrelevant.
We now state the capacity region of this Gaussian BC.
Theorem 13.35 (Capacity Region of Gaussian BC).
The Gaussian BC capacity region is given by

(1 )E
1
(0)
(2)
R + R log 1 +
,
2
E + (2)

1
E
(1)
R log 1 + 2
2
(1)
(13.310)
(13.311)
for some [0, 1].

This region is depicted in Figure 13.8. Once again, we see that time-sharing
is strictly suboptimal.
Proof of Theorem 13.35: Recalling that the Gaussian BC is stochastically
degraded, we only investigate the corresponding physically degraded Gaussian
BC:
(
Y (1) = x + Z (1) ,
(13.312)
(2)
(1)
(13.313)
Y
= x + |Z {z+ V},
= Z (2)

322

R(0) + R(2)
1
2

log 1 +
E
2
(2)
=0
time-sharing
=1

E
1
log
1
+
2
2
R(1)
(1)
Figure 13.8: Capacity region of a general Gaussian BC.

2 2 , V
where V N 0, (2)
Z (1) .
(1)
As usual, we cheat and simply take the achievability part over from the
corresponding DMC case. Hence, according to Theorem 13.26, a rate triple is
achievable if
(

R(0) + R(2) < I U ; Y (2)
(13.314)

(1)
(1)
U
(13.315)
R < I X; Y
2
for some U X|U Y (1) |X Y (2) |Y (1) with E X E and where Y (1) |X =

2
2 2 .
N x, (1)
and Y (2) |Y (1) = N y (1) , (2)
(1)
We now make the following choices:
U N (0, (1 )E),
0
X N (0, E),
0
X =U +X .
(13.316)
0
X
U,
(13.317)
(13.318)
Note that by this choices we have X N (0, E) (U is the cloud center and X 0
the codeword in the cloud). Now we evaluate:

0
(2)
I U ; Y (2) = I U ; U + X
+
Z
(13.319)
| {z }
noise

(1 )E
1
= log 1 +
,
2
2
E + (2)

I X; Y (1) U = I U + X 0 ; U + X 0 + Z (1) U

= I X 0 ; X 0 + Z (1) U

= I X 0 ; X 0 + Z (1)

1
E
= log 1 + 2 .
2
(1)

(13.320)
(13.321)
(13.322)
(13.323)
(13.324)
13.10. Gaussian BC
323
To prove the converse, we do not simply try to show that the above choice
is optimal, but we go some steps further back. We start as follows:

1
R(1) = H M (1)
(13.325)
n

1

1
= I M (1) ; Y(1) + H M (1) Y(1)
(13.326)
n
n

1
I M (1) ; Y(1) + n(1)
(13.327)
n

1
(13.328)
I M (1) ; Y(1) , M (0) , M (2) + n(1)
n

1
1
= I M (1) ; M (0) , M (2) + I M (1) ; Y(1) M (0) , M (2) + n(1) (13.329)
n|
{z
} n
=0

1
= I M (1) ; Y(1) M (0) , M (2) + n(1) ,
n
(13.330)
and

1
H M (0) , M (2)
n

1

1
= I M (0) , M (2) ; Y(2) + H M (1) Y(1)
n
n

1
(0)
(2)
(2)
+ n(2) ,
I M ,M ;Y
n
R(0) + R(2) =
(1)
(13.331)
(13.332)
(13.333)
(2)
where n and n (by the Fano Inequality) tend to zero as n tends to infinity.
We now continue our bounding as follows:

1
I M (0) , M (2) ; Y(2)
n

1

1
(13.334)
= h Y(2) h Y(2) M (0) , M (2)
n
n
n
1

1 X (2) (2)
(2)
=
h Yk Y1 , . . . , Yk1 h Y(2) M (0) , M (2)
(13.335)
n
n
k=1
n

1
X

1
(2)
h Yk
h Y(2) M (0) , M (2)
(13.336)
n
n
k=1
n

1

1X1
2
log 2e Ek + (2)
h Y(2) M (0) , M (2)
n
2
n
k=1
!!
n

1
1X
1
2
log 2e
Ek + (2)
h Y(2) M (0) , M (2)
2
n
n
k=1

1

1
2
log 2e E + (2)
h Y(2) M (0) , M (2) .
2
n
(13.337)
(13.338)
(13.339)
Here, the first inequality (13.336) follows conditioning that cannot reduce
entropy; in the subsequent inequality (13.337) we define Ek , E[Xk ] and
upper-bound the entropy by the Gaussian entropy of given second moment;
then in (13.338) we use the concavity of the logarithm; and the final step
(13.339) follows from the average-power constraint (13.309).

324
Now note that by conditioning that reduces entropy and by the Markovity
of encoder(
channel(
decoder we have

n
2
log 2e(2)
= h Z(2)
(13.340)
2

= h Y(2) X
(13.341)

= h Y(2) X, M (0) , M (2)
(13.342)

(2) (0)
(2)
h Y M ,M
(13.343)

(2)
h Y
(13.344)

n
2
log 2e E + (2)
,
(13.345)
2
where the last step follows in the same way as in (13.334)(13.339). Hence,

n

n
2
2
log 2e(2)
h Y(2) M (0) , M (2) log 2e E + (2)
,
(13.346)
2
2
and therefore there must exist some , 0 1, such that

n
2
.
h Y(2) M (0) , M (2) = log 2e E + (2)
2
(13.347)
Plugging this into (13.339) and (13.333) then yields

1

1
2
log 2e E + (2)
h Y(2) M (0) , M (2) + n(2)
2
n
1

1
2
2
= log 2e E + (2) log 2e E + (2)
+ n(2)
2
2
2
(1 + )E + (2)
1
= log
+ n(2)
2
2
E + (2)

1
(1 )E
= log 1 +
+ n(2) .
2
2
E + (2)
R(0) + R(2)
(13.348)
(13.349)
(13.350)
(13.351)
Next we consider a conditional version of the Entropy Power Inequality

(Theorem 1.15):
2
e n h(Y
(1) +V|M (0) ,M (2) )
e n h(Y
(1) |M (0) ,M (2) )
+ e n h(V|M
(0) ,M (2) )
(13.352)
and bound

2
h Y(1) M (0) , M (2)
n
2

2
(1)
(0)
(2)
(0)
(2)
log e n h(Y +V|M ,M ) e n h(V|M ,M )
2

2
(2)
(0)
(2)
(0)
(2)
= log e n h(Y |M ,M ) e n h(V|M ,M )

2 n
2 n
2 )
2 2 )
log 2e(E+(2)
log 2e((2)
(1)
n 2
n 2
= log e
e

2
2
2
= log 2e E + (2)
2e (2)
(1)

2
= log 2e E + (1)
,

(13.353)
(13.354)
(13.355)
(13.356)
(13.357)
13.10. Gaussian BC
325
where in (13.355) we have made use of (13.347). Hence, using (13.357) and
the fact that X is a function of M (0) , M (1) , M (2) , we get from (13.330)
R(1)
=
=
=
=

1
(13.358)
I M (1) ; Y(1) M (0) , M (2) + n(1)
n

1

1
h Y(1) M (0) , M (2) h Y(1) M (0) , M (1) , M (2) + n(1) (13.359)
n
n

1 2
1
h Y(1) M (0) , M (2) h Y(1) M (0) , M (1) , M (2) , X + n(1)
2 n
n
(13.360)
1

1
2
(1)
(1)
log 2e E + (1) h Z
+ n
(13.361)
2
n
1

1
2
2
log 2e E + (1)
log 2e(1)
+ n(1)
(13.362)
2
2

E
1
+ n(1) .
(13.363)
log 1 + 2
2
(1)
This finishes the proof of the converse.

Chapter 14

with Common Message
14.1
Problem Setup
After we have discussed superposition coding in the context of the broadcast

channel in Chapter 13, it makes sense to quickly return to the multiple-access
channel and to generalize the problem by including a common message M (0)
that is seen by both encoders, in addition to their respective private messages
M (1) and M (2) . The channel model is shown in Figure 14.1.
M (1)
X(1)
Dest.
(0) , M
(1) ,
M
(2)
M
Dec.
Channel
QnY |X (1) ,X (2)
Enc. (1)
M (0)
X(2)
Uniform
Source 1
Uniform
Source 0
Enc. (2)
M
(2)
Uniform
Source 2
Figure 14.1: The generalized multiple-access channel with two private message
sources M (i) , i = 1, 2, and a common message source M (0) .
Note that the common message might represent a common time reference
that lets the transmitters synchronize their transmissions. However, in this
case we have R(0) = 0 and we are actually back in the situation of Chapter 10.
More generally, the common message has a strictly positive rate. For example,
it could represent some information that two mobile stations are relaying from
one base station to the next.
The various definitions of Section 10.1 very easily generalize to the new
situation here. In particular note that the capacity region
now has become

three dimensional containing rate triples R(0) , R(1) , R(2) .
327

328
The Multiple-Access Channel with Common Message
14.2
An Achievable Region based on

Superposition Coding
The following derivation is closely related to Section 13.4.

1: Setup: Fix R(0) , R(1) , R(2) , QU , QX (1) |U , QX (2) |U and some blocklength
n.
(0)
2: Codebook Design: We generate enR independent length-n codewords

(0)
U(m(0) ) QnU , m(0) = 1, . . . , enR . For each U(m(0) ) and for each i =

(i)
1, 2, we generate enR independent length-n codewords X(i) m(0) , m(i)

(i)
QnX (i) |U U(m(0) ) , m(1) = 1, . . . , enR . We reveal all three codebooks
to encoders and decoder.
Note that U(m(0) ) represents the cloud center of the m(0) th cloud.

3: Encoder Design: To send message m(0) , m(i) , the encoder (i) trans
mits the codeword X(i) m(0) , m(i) , i = 1, 2.
4: Decoder Design: Upon receiving Y, decoder looks for a triple m
(0) ,
(1)
(2)
m
,m
such that

U m
(0) , X(1) m
(0) , m
(1) , X(2) m
(0) , m
(2) , Y

A(n)
QU,X (1) ,X (2) ,Y .

(14.1)

If there is exactly one such triple m
(0) , m
(1) , m
(2) , the decoder puts

out m
(0) , m
(1) , m
(2) , m
(0) , m
(1) , m
(2) . Otherwise it declares an error.
5: Performance Analysis: By the symmetry of the random codebook
construction, the conditional error probability does not depend on which
triple of indices is sent. Without
loss of generality, we can therefore

assume that M (0) , M (1) , M (2) = (1, 1, 1).
We define the following events: for each m(0) m(1) , m(2) ,
n
o

Fm(0) ,
U m(0) , Y A(n)
(Q
)
,
(14.2)
U,Y

n

Fm(0) ,m(1) ,m(2) ,
U m(0) , X(1) m(0) , m(1) , X(2) m(0) , m(2) , Y
o
A(n)
QU,X (1) ,X (2) ,Y .
(14.3)

Then, using the Union Bound, we can bound as follows:
(1)
(2)
nR(0)
enR
enR
[
[
c e [

(n)
Pe Pr F1,1,1
Fm(0)
F1,m(1) ,1
F1,1,m(2)
m(0) =2

m(1) =2
m(2) =2
14.2. An Achievable Region based on Superposition Coding
329
F1,m(1) ,m(2) M (0) = 1, M (1) = 1, M (2) = 1

m(1) =2 m(2) =2
(1)
enR
[
(2)
enR
[
(14.4)

c
Pr F1,1,1 M (0) = 1, M (1) = 1, M (2) = 1
(0)
nR
eX
m(0) =2

Pr Fm(0) M (0) = 1, M (1) = 1, M (2) = 1
(1)
nR
eX
m(1) =2

Pr F1,m(1) ,1 M (0) = 1, M (1) = 1, M (2) = 1
(2)
nR
eX
m(2) =2
(1)
nR
eX

Pr F1,1,m(2) M (0) = 1, M (1) = 1, M (2) = 1
(2)
nR
eX
m(1) =2 m(2) =2

Pr F1,m(1) ,m(2) M (0) = 1, M (1) = 1, M (2) = 1
(14.5)
where in (14.4) the first event corresponds to the case that the correct
codewords are not recognized, the first union of events corresponds to the
case where some codeword from a wrong cloud is (wrongly) recognized,
and the remaining unions of events correspond to the cases where some
wrong codeword from the correct cloud is recognized. Note that we have
an inequality in front of (14.4) because we only check whether the cloud
center of a wrong cloud happens to be typical with the received sequence,
and do not bother to check whether or not there actually exist codewords
in that wrong cloud that are jointly typical with the cloud center and the
received sequence.
Of the five main terms in (14.5), the first two are standard:

c

Pr F1,1,1 (1, 1, 1) t n, , U X (1) X (2) Y ,
(0)
nR
eX
m(0) =2
(14.6)
(0)
Pr(Fm(0) |(1, 1, 1))
nR
eX
en(I(U ;Y ))
(14.7)
en(I(U ;Y )) .
(14.8)
m(0) =2
enR
(0)
Now lets have a look at the third term:

Pr F1,m(1) ,1 (1, 1, 1)
X

=
QnU (u) QnX (1) |U x(1) u QnX (2) |U x(2) u
(u,x(1) ,x(2) ,y)
(n)
A (QU,X (1) ,X (2) ,Y

330

QnY |X (2) ,U yx(2) , u
|
{z
}
wrong codeword x(1) ,

correct codeword x(2) ,
correct cloud!
n(H(X (1) |U ))
n(H(U ))
(u,x(1) ,x(2) ,y)

(n)
A
(14.9)
(QU,X (1) ,X (2) ,Y )

(2)
(2)
en(H(X |U )) en(H(Y |X ,U ))
(14.10)
n(H(U,X (1) )+H(X (2) |U )+H(Y |X (2) ,U ))
e
(14.11)
(2)

QU,X (1) ,X ,Y
= A(n)

(1)
(2)
(2)
n(H(U,X (1) ,X (2) ,Y )+)
e
en(H(U,X )+H(X |U )+H(Y |X ,U ))
(14.12)
n(H(X (2) |X (1) ,U )+H(Y |X (1) ,X (2) ,U )H(X (2) |U )H(Y |X (2) ,U )+)
(14.13)
n(I(X (1) ;X (2) |U )+I(X (1) ;Y |X (2) ,U ))
(14.14)
n(I(X (1) ;Y |X (2) ,U ))
(14.15)
=e
=e
=e
Here the most important step is (14.9) where we need to realize that Y is
generated based on the transmitted X(2) , but not on the wrong codeword
X(1) considered here. However, since we do consider the correct cloud,
the cloud center U is related to the received Y. Moreover, in (14.15) we
make use of the Markov chain X (1) (
U (
X (2) , i.e., we use that
conditionally on the cloud center U, the codewords X(1) and X(2) are
generated independently.
Hence, we get
(1)
nR
eX
m(1) =2

(1)
(1)
(2)

Pr F1,m(1) ,1 (1, 1, 1) enR en(I(X ;Y |X ,U )) ,
(14.16)
and, by symmetry,
(2)
nR
eX
m(2) =2

(2)
(2)
(1)

Pr F1,1,m(2) (1, 1, 1) enR en(I(X ;Y |X ,U )) .
(14.17)
Finally, we bound the last term as follows:

Pr F1,m(1) ,m(2) (1, 1, 1)
X

=
QnU (u) QnX (1) |U x(1) u QnX (2) |U x(2) u
(u,x(1) ,x(2) ,y)
(QU,X (1) ,X (2) ,Y )
(n)
A

QnY |U yx(2) , u
{z
}
|
X
wrong codewords,
correct cloud!
n(H(X (1) |U ))
n(H(U ))
(u,x(1) ,x(2) ,y)

(n)
A
(QU,X (1) ,X (2) ,Y )

(14.18)
14.2. An Achievable Region based on Superposition Coding
331
(2)
en(H(X |U )) en(H(Y |U ))
(1)
(2)
(1)
(2)
en(H(U,X ,X ,Y )+) en(H(U,X )+H(X |U )+H(Y |U ))
= en(
(1)
(2)
= en(I(X ,X ;Y |U )) ,
I(X (1) ;X (2) |U )+I(X (1) ,X (2) ;Y
|U ))
(14.19)
(14.20)
(14.21)
(14.22)
i.e.,
(1)
(2)
nR
eX
nR
eX

Pr F1,m(1) ,m(2) (1, 1, 1)
m(1) =2 m(2) =2
n(R(1) +R(2) )
en(I(X
(1) ,X (2) ;Y
|U ))
(14.23)
Plugging these terms back into (14.5) now yields

(0)
Pe(n) t n, , U X (1) X (2) Y + en(R I(U ;Y )+)
(1)
(2)
(1)
(2)
(2)
(1)
+ en(R I(X ;Y |X ,U )+) + en(R I(X ;Y |X ,U )+)
+ en(R
(1)
+R(2) I(X (1) ,X (2) ;Y |U )+)
(14.24)
(14.25)
R(0) < I(U ; Y ),

(1)
R(2)
R(1) + R(2)
(1)

< I X ; Y X (2) , U ,

< I X (2) ; Y X (1) , U ,

< I X (1) , X (2) ; Y U .
(14.26)
(14.27)
(14.28)
(14.29)
Note that (14.26) and (14.29) combined with the Markov chain

U (
X (1) , X (2) (
Y
(14.30)
results in

R(0) + R(1) + R(2) < I(U ; Y ) + I X (1) , X (2) ; Y U

= I U, X (1) , X (2) ; Y

= I X (1) , X (2) ; Y + I U ; Y X (1) , X (2)

= I X (1) , X (2) ; Y .
Hence, instead of (14.26)(14.29) we can equivalently write
R(1) < I X (1) ; Y X (2) , U ,
R(2) < I X (2) ; Y X (1) , U ,

R(1) + R(2) < I X (1) , X (2) ; Y U ,
R(0) + R(1) + R(2) < I X (1) , X (2) ; Y .
(14.31)
(14.32)
(14.33)
(14.34)
(14.35)
(14.36)
(14.37)
(14.38)

332

R(0)
R(2)
R(1)
Figure 14.2: The shape of the achievable region (14.35)(14.38) for the MAC
with common message for a fixed choice of QU QX (1) |U QX (2) |U .
Together with the implicit nonnegativity constraints on the three rates,

these four conditions describe a region with seven faces. See Figure 14.2
as an example.
Since we can freely choose QU , QX (1) |U and QX (2) |U , we can now take
the convex hull of the union of (14.35)(14.38) for such choices. Note,
however, that it can be shown (and we will prove it in the following
section) that the union of (14.35)(14.38) already is convex, i.e., we do
not need the convex-hull operation.
Also note that the shape of the region defined in (14.26)(14.29) is not
as shown in Figure 14.2. This is not a contradiction because the true
shape of the capacity region is given by the union of these regions and
this union is the same irrespectively whether we use (14.26)(14.29) or
(14.35)(14.38).
This finishes the derivation of an achievable region.

14.3. Converse
14.3
333
Converse
The converse again relies on the Fano Inequality (Proposition 1.13) with an
observation Y about M (0) , M (1) , M (2) :

log 2
(0)
(1)
(2) n
(n)
(0)
(1)
(2)
H M , M , M Y1 n
(14.39)
+ Pe R + R + R
n
, nn ,
(14.40)
(n)
where n 0 as n if Pe 0.
(n)
So assume a given system with Pe 0. For such a system we have

nR(1) = H M (1)
(14.41)

(1)
n
(1) n
= I M ; Y1 + H M Y1
(14.42)

(1)
n
(0)
(1)
(2) n
I M ; Y1 + H M , M , M Y1
(14.43)

(1)
n
I M ; Y1 + nn
(14.44)

(1)
n
(0)
(2)
I M ; Y1 , M , M
+ nn
(14.45)

(2)
(1)
n (0)
+ nn
(14.46)
= I M ; Y1 M , M
n
X

H Yk Y1k1 , M (0) , M (2) H Yk Y1k1 , M (0) , M (1) , M (2)
=
k=1
+ nn
n
X

=
H Yk Y1k1 , M (0) , M (2) , x(2) (M (0) , M (2) )
(14.47)
k=1

H Yk Y1k1 , M (0) , M (1) , M (2) , x(1) (M (0) , M (1) ),

x(2) (M (0) , M (2) ) + nn
n
X

H Yk Y1k1 , M (0) , M (2) , X(2)
=
(14.48)
k=1

H Yk Y1k1 , M (0) , M (1) , M (2) , X(1) , X(2) + nn
(14.49)
n

X
k1

(1) (2)
(0)
(2)
(2)
(0)

=
H Yk Y1 , M , M , X
H Yk Xk , Xk , M
k=1
+ nn
(14.50)
n

X
(2)
(1) (2)
H Yk Xk , M (0) H Yk Xk , Xk , M (0) + nn (14.51)

k=1
n

X
(2)
(1)
=
I Xk ; Yk Xk , M (0) + nn .
(14.52)
k=1
Here, (14.41) follows from the assumption that M (1) is uniformly distributed
(1)
over {1, . . . , enR }; (14.44) follows from (14.40); in (14.45) we make use of the
independence of M (1) and (M (0) , M (2) ); (14.48) follows because the codewords
are deterministic functions of the corresponding messages; in the next step

334

(14.49) we simplify our notation and write X(i) for x(i) M (0) , M (i) ; in (14.50)
we use the assumption that our DM-MAC is memoryless and used without
feedback; and (14.51) follows from conditioning that reduces entropy.
We next introduce a random variable T , which is independent of (M (0) ,
(1)
M , M (2) ) and uniformly distributed over {1, 2, . . . , n}, and a random vector
(1)
U , (M (0) , T ). Furthermore, we define the random variables X (1) , XT ,
(2)
X (2) , XT , and Y , YT , so that QU,X (1) ,X (2) ,Y factors as

QU (u) QX (1) |U x(1) u QX (2) |U x(2) u QY |X (1) ,X (2) y x(1) , x(2)
(14.53)
for all u, x(1) , x(2) , y. Hence,
n

1 X (1)
(2)
I Xk ; Yk Xk , M (0) + n
R(1)
n
k=1
n

X
1 (1)
(2)
=
I XT ; YT XT , M (0) , T = k + n
n
k=1

(2)
(1)
= I XT ; YT XT , M (0) , T + n

= I X (1) ; Y X (2) , U + n .
(14.54)
(14.55)
(14.56)
(14.57)
By symmetry, using the same definitions, we get analogously

R(2) I X (2) ; Y X (1) , U + n .
(14.58)
Similarly, by the fact that M (1) and M (2) are independent and uniformly
distributed over their respective index set,
nR(1) + nR(2)
= H M (1) , M (2)
=
=
=
=
(1)
(2)
(14.59)
I M , M ; Y1n + H M (1) , M (2) Y1n

I M (1) , M (2) ; Y1n + nn

I M (1) , M (2) ; Y1n , M (0) + nn

I M (1) , M (2) ; Y1n M (0) + nn
n
X

H Yk Y1k1 , M (0) H Yk Y1k1 , M (0) , M (1) , M (2)
k=1
n
X

H Yk Y1k1 , M (0)
k=1
(14.60)
(14.61)
(14.62)
(14.63)
+ nn (14.64)

H Yk Y1k1 , M (0) , M (1) , M (2) , X(1) , X(2) + nn
n

X

(1) (2)
=
H Yk Y1k1 , M (0) H Yk Xk , Xk , M (0) + nn
(14.65)

(1) (2)
H Yk M (0) H Yk Xk , Xk , M (0) + nn
(14.67)
k=1
n
X
k=1
n
X

(1)
(2)
I Xk , Xk ; Yk M (0) + nn ,
k=1

(14.66)
(14.68)
14.4. Capacity Region
335
and hence

R(1) + R(2) I X (1) , X (2) ; Y U + n .
(14.69)
Finally,
nR(0) + nR(1) + nR(2)
= H M (0) , M (1) , M (2)
=
=
=
(0)
(1)
(2)
(14.70)
I M , M , M ; Y1n + H M (0) , M (1) , M (2) Y1n

I M (0) , M (1) , M (2) ; Y1n + nn
n
X

H Yk Y1k1 H Yk Y1k1 , M (0) , M (1) , M (2) + nn
k=1
n
X

H Yk Y1k1 H Yk Y1k1 , M (0) , M (1) , M (2) , X(1) , X(2)
k=1

(14.71)
(14.72)
(14.73)
+ nn
n

X

(1) (2)
=
+ nn
H Yk Y1k1 H Yk Xk , Xk
(14.74)

(1) (2)
H(Yk ) H Yk Xk , Xk
+ nn
(14.76)
k=1
n
X
k=1
n
X

(1)
(2)
I Xk , Xk ; Yk + nn ,
(14.75)
(14.77)
k=1
and hence

R(0) + R(1) + R(2) I X (1) , X (2) ; Y + n .
(14.78)
So we see that the achievable region derived in Section 14.2 actually is the
best possible region. Note that for discrete alphabets there is no real difference
between a random vector and a random variable, i.e., we could also write U
instead of U.
Also note that since we have not excluded the possibility of time-sharing
in the proof of the converse, we see that the converse indirectly is proof that
the region derived in Section 14.2 is convex! One could of course also directly
check this, but after the proof of the converse this is not anymore necessary.
14.4
Capacity Region
We have the following result.

Theorem 14.1 (Capacity Region of MAC with Common Message
[SW73a]).
The capacity region for the multiple-access channel with common message

336

is given by all rate triples R(0) , R(1) , R(2) satisfying
R(1) < I X (1) ; Y X (2) , U ,
R(2) < I X (2) ; Y X (1) , U ,

R(1) + R(2) < I X (1) , X (2) ; Y U ,
R(0) + R(1) + R(2) < I X (1) , X (2) ; Y
(14.79)
(14.80)
(14.81)
(14.82)
for some QU QX (1) |U QX (2) |U QY |X (1) ,X (2) . The alphabet size of the
auxiliary RV U can be limited to
o
n

(14.83)
|U| min |Y| + 3, X (1) X (2) + 2 .
Proof: The achievability and converse have already been proven. It only
remains to show the bound on the cardinality of U.
Consider a given choice of U and
QU,X (1) ,X (2) ,Y = QU QX (1) |U QX (2) |U QY |X (1) ,X (2)
(14.84)
and note that

X

I X (1) ; Y X (2) , U =
QU (u) I X (1) ; Y X (2) , U = u ,
(14.85)
uU
I X
(2)

X
QU (u) I X (2) ; Y X (1) , U = u ,
; Y X (1) , U =
(14.86)
uU
X

I X (1) , X (2) ; Y U =
QU (u) I X (1) , X (2) ; Y U = u ,
(14.87)
uU
and for all x(1) X (1) and x(2) X (2)

X
QX (1) ,X (2) x(1) , x(2) =
QU (u)QX (1) ,X (2) |U x(1) , x(2) u
(14.88)
uU
X
uU

QU (u)QX (1) |U x(1) u QX (2) |U x(2) u .
(14.89)
(i)
For
of
simplicity
notation and without loss of generality, assume that X =
(i)
1, 2, . . . , |X | . Now we define the vector v:

v , I X (1) ; Y X (2) , U , I X (2) ; Y X (1) , U , I X (1) , X (2) ; Y U ,

(1)
(2)
QX (1) ,X (2) (1, 1), . . . , QX (1) ,X (2) |X |, |X | 1 ,
(14.90)

vu , I X (1) ; Y X (2) , U = u , I X (2) ; Y X (1) , U = u ,

I X (1) , X (2) ; Y U = u , QX (1) |U (1)QX (2) |U (1|u),

. . . , QX (1) |U |X (1) |u QX (2) |U |X (2) | 1u ,

(14.91)
14.4. Capacity Region
337
such that by (14.85)(14.89)

v=
QU (u) vu .
(14.92)
uU
We see that v is a convex combination of |U| vectors vu . From Fenchel

Eggleston strengthening of Caratheodorys Theorem (Theorem 1.22) it now
follows that we can reduce the size of U to at most |X (1) ||X (2) |+2 values (note
that v contains |X (1) | |X (2) | + 2 components!) without changing the values of
the right-hand side of (14.79)(14.81) and without changing QX (1) ,X (2) . Note
that since QX (1) ,X (2) remains unchanged, also the right-hand side of (14.82)
remains the same.
Alternatively, we may consider a given choice of U and (14.84) and note
that beside (14.85)(14.87) we have

I X (1) , X (2) ; Y = H(Y ) H Y X (1) , X (2)
(14.93)
(1) (2)

= H(Y ) H Y X , X , U
(14.94)

X
(1) (2)

=
QU (u) H(Y ) H Y X , X , U = u , (14.95)
uU

where we have made use of the Markov chain U (
X (1) , X (2) (
Y.
Moreover, we also have for all y Y
X
QY (y) =
QU (u)QY |U (y|u).
(14.96)
uU
For simplicity of notation and without loss of generality, assume that Y =

{1, 2, . . . , |Y|}. Now we define the vector v:

v , I X (1) ; Y X (2) , U , I X (2) ; Y X (1) , U , I X (1) , X (2) ; Y U ,

I X (1) , X (2) ; Y , QY (1), . . . , QY (|Y| 1) ,
(14.97)

vu , I X (1) ; Y X (2) , U = u , I X (2) ; Y X (1) , U = u ,

I X (1) , X (2) ; Y U = u , H(Y ) H Y X (1) , X (2) , U = u ,

QY |U (1), . . . , QY |U (|Y| 1|u) ,
(14.98)
such that by (14.85)(14.87), (14.95), and (14.96) we again see that v is a

convex combination of the vectors vu
X
v=
QU (u) vu .
(14.99)
uU
From FenchelEggleston strengthening of Caratheodorys Theorem (Theorem

1.22) it now follows that we can reduce the size of U to at most |Y| + 3 values
(note that v contains |Y| + 3 components!) without changing the values of
the right-hand side of (14.79)(14.82).
Since Caratheodorys Theorem provides a sufficient, but maybe not necessary condition, we can take the smaller of the values as an upper bound on
the needed size of U.

Chapter 15
Discrete Memoryless
Networks and the Cut-Set
Bound
15.1
Discrete Memoryless Networks
So far we have only seen multiple-user problems where we had either only
one transmitter, but several receivers, or we had several transmitters, but
only one receiver. In a much more general setting, however, we can think
of a situation with many different terminals, where each terminal potentially
can be transmitter, receiver, or even both. Moreover, a terminal might have
several messages intended for different receivers, some of them for only one
particular terminal (a private message) and some for several receivers at the
same time (a common or at least partially common message).
Such a general network can be described by a discrete memoryless network
(DMN), which is a very broad generalization of a discrete memoryless channel
(DMC).
Definition 15.1 ([CT06]). A discrete memoryless network (DMN) is a discrete-time, synchronously clocked network consisting of T different terminals
t T = {1, . . . , T}. These terminals are all connected via a channel and
potentially all act simultaneously as transmitter and receiver.
In the DMN there exist M statistically independent messages M (m) , m =

(m)
1, . . . , M, each of which is uniformly distributed over 1, . . . , enR
, i.e., ev(m)
(m)
ery message M
has a rate R . Each message originates at exactly one
terminal and is intended for one or more other terminals. We denote by
M(t) {1, . . . , M} the set of (indices of the) messages originating at terminal t, and by D(m) the set of terminals that are intended receivers of the mth
message M (m) .
(t)
At every time-step k, every terminal t T emits a channel input Xk that
is based on its messages M (m) , m M(t), and on the previously observed
(t)
(t)
channel outputs Y1 , . . . , Yk1 . Afterwards it observes a new channel output
339

340
Discrete Memoryless Networks and the Cut-Set Bound
(t)
Yk , that is a noisy function of all current channel inputs:

(t)
(1)
(2)
(T)
(t)
Yk = f (t) Xk , Xk , . . . , Xk , Nk
(15.1)
(t)
for some function f (t) () and for some noise RV Nk that is independent of all
other random variables and IID over time. Hence, we can describe the channel
again by a conditional probability distribution PY (1) ,...,Y (T) |X (1) ,...,X (T) (|) that
does not change over time. Note that the setup of the model is such that we
(t)
have a causal operation of the network: The channel inputs Xk are applied
after clock tick k 1, but before clock tick k, so that they serve as current
(t)
inputs for the channel outputs Yk .
In Figure 15.1 we have depicted an example of a DMN with five terminals.
Note that it is possible that some terminal acts as receiver only, in which
case we will omit the corresponding arrow of the nonexistent channel input.
Similarly, it is possible that the conditional channel probability distribution
PY (1) ,...,Y (T) |X (1) ,...,X (T) (|) is such that the channel output of some terminal t
is completely independent of any messages and only depends on noise and is
therefore completely useless for that terminal. In this case we will omit this
particular channel output. The decisions of the terminals about their intended
(m) (t), t D(m) .
received messages are denoted by M
We remark that while this definition of a DMN covers all networks considered so far, it does not accommodate the concept of a common message
between several transmitters because in Definition 15.1 we ask for all messages
to be independent and to originate at exactly one terminal only.
Definition 15.2. The capacity
region C of a DMN is the closure of the set
(1)
(M)
of rate tuples R , . . . , R
for which, for sufficiently large n, there are
encoders and decoders so that the error probability
M
[
[

(m) (t) 6= M (m)
Pe(n) = Pr
M
(15.2)
m=1 tD(m)
can be made arbitrarily close to 0.
15.2
Cut-Set Bound
Obviously, it is near impossible to find a closed-form expression for the capacity region of a DMN, seeing that we could not even solve some simple
examples of a DMN like the general BC. However, one can still say something
about the network. A particularly interesting and actually quite simple idea
is the Cut-Set Bound [CT06], [EG81]. This is an attempt to generalize the
typical proof of a converse based on the Fano Inequality (Proposition 1.13)
to the setup of a DMN. Hence, it will provide outer bounds on the capacity
region.

15.2. Cut-Set Bound
341
M (4)
Terminal 5
(2) (1)
M
M (1)
(1) (3)
M
X(5)
Y(3)
X(1)
Terminal 1
Y(1)
DMN
(4) (3)
M
Terminal 3
(2) (3)
M
(3) (3)
M
channel
X(2)
Terminal 2
M (2)
Y(2)
M (3)
X(4)
Y(4)
Terminal 4
(1) (4)
M
Figure 15.1: A DMN with five terminals and four messages. Note that
(m) (t) denotes the decision about the mth message at terminal
M
t. In this example, Terminal 5 does not get any useful information back from the channel and we have therefore omitted Y(5) .
Also note that Terminal 3 acts as pure receiver without giving
any feedback into the network.
Definition 15.3. If the set of terminals T = {1, . . . , T} is partitioned into
two sets S and S, then the pair (S, S) is called a cut.
We say that the cut (S, S) separates a message M (m) and its decision
(m)
M (t) at terminal t, if M (m) originates at a terminal in S, but t S.

To simplify our notation we introduce the following shorthands:
Let M(S) denote the set of (indices of) messages that is separated from
at least one of its decisions by the cut (S, S):
n

M(S) , m {1, . . . , M} : t S, t S s.t.
o
m M(t) and t D(m) .
(15.3)
Let M (S) be the tuple of message RVs that is separated from at least
one of its decisions by the cut (S, S):

M (S) , M (m) : m M(S)
(15.4)

342

and let M (S) be the tuple of the remaining message RVs:

M (S) , M (m) : m {1, . . . , M} \ M(S) .
(15.5)
Let X(S) be the tuple of channel input vectors of all terminals in S:

X(S) , X(t) : t S .
(15.6)
Similarly, we define X(S) , Y(S) , and Y(S) .
The same notation is also used for single random variables:
n
o
(S)
(t)
Xk , Xk : t S .
(S)
(S)
(15.7)
(S)
Similarly, we define Xk , Yk , and Yk .

We would now like to argue in the usual manner for the derivation of a
converse, but applied only to all messages M (S) of a certain cut (S, S). So
assume that we are given a fixed coding scheme for a DMN that works in the
(n)
sense that if n tends to infinity, then Pe in (15.2) tends to zero.
Now recall the Fano Inequality (Proposition 1.13) with an observation
Y(S) about M (S) :

H M (S) Y(S)
o
[
[ n
X

(m) t 6= M (m)
log 2 + n Pr
M
R(m)
mM(S) tD(m) S
log 2 + n Pr
M
[
(15.8)
[
m=1 tD(m)
log 2
=n
+ Pe(n)
n
mM(S)
M
X

(m) (t) 6= M (m)
M
M
X
!
R(m)
(15.9)
m=1
!!
R
(m)
(15.10)
m=1
, nn .
(15.11)
Here, the first inequality (15.8) follows from Proposition 1.13; in the second
inequality (15.9) we enlarge both the set in the probability expression as well
as the sum; and (15.11) should be read as definition of n . Note that n 0
(n)
as n because we assume that Pe 0.
P
(m)
So, using that M (S) is uniformly distributed over en mM(S) R
different
values, we have
X
n
R(m)
mM(S)
= H M (S)

(15.12)
15.2. Cut-Set Bound
343

= I M (S) ; Y(S) + H M (S) Y(S)

I M (S) ; Y(S) + nn

I M (S) ; Y(S) , M (S) + nn

= I M (S) ; Y(S) M (S) + nn

= H Y(S) M (S) H Y(S) M (S) , M (S) + nn
n

X
(S) (S)
(S)
=
H Yk Y1 , . . . , Yk1 , M (S)
(15.13)
(15.14)
(15.15)
(15.16)
(15.17)
k=1
(S)
Yk

(S)
(S)
(S)
(S)
+ nn
Y1 , . . . , Yk1 , M , M
(15.18)
(S)
Yk

(S)
(S)
(S)
(S)
+ nn
Y1 , . . . , Yk1 , M , M
(15.19)
H
n

X
(S)
(S) (S)
(S)
=
H Yk Y1 , . . . , Yk1 , M (S) , Xk
k=1
H
n

X
(S) (S)
H Yk Xk
k=1

(S)
(S) (S)
(S)
(S)
+ nn (15.20)
H Yk Y1 , . . . , Yk1 , M (S) , M (S) , Xk , Xk
n

X
(S) (S)
(S) (S)
(S)
=
H Yk Xk
H Yk Xk , Xk
+ nn
(15.21)
k=1
n

X
(S)
(S) (S)
=
I Xk ; Yk Xk
+ nn
k=1
n
X

1 (S) (S) (S)
I XZ ; YZ XZ , Z = k + nn
n
k=1

(S)
(S) (S)
= n I XZ ; YZ XZ , Z + nn

(S) (S)
(S) (S)
(S)
= n H YZ XZ , Z n H YZ XZ , XZ , Z + nn

(S) (S)
(S) (S)
(S)
= n H YZ XZ , Z n H YZ XZ , XZ
+ nn

(S) (S)
(S) (S)
(S)
n H YZ XZ
n H YZ XZ , XZ
+ nn

(S)
(S) (S)
= n I XZ ; YZ XZ
+ nn .
=n
(15.22)
(15.23)
(15.24)
(15.25)
(15.26)
(15.27)
(15.28)
Here, the inequality (15.14) follows from (15.11); in the following inequality
(15.15) we add some argument to the mutual information; in (15.16) we make
use of the basic assumption of a DMN that all messages are independent of
each other; (15.18) follows from the chain rule.
Then in (15.19) we first note that by definition M (S) contains all messages that originate from terminals in S (it also contains all those messages
that originate in S and whose destinations are all in S). Further we note
that all terminals in S generate their channel inputs at time k from their

344
observations of the past channel outputs and their messages. Hence, from
(S)
(S)
(S)
Y1 , . . . , Yk1 , M (S) , we can generate directly Xk . The subsequent inequality (15.20) is based on conditioning that reduces entropy; in (15.21) we apply
the basic assumption about the DMN that the current channel outputs only
depend on the current channel inputs.
In (15.23) we introduce the RV Z that is independent of any other RV
and uniformly distributed over {1, . . . , n}; and (15.26) follows because of the
Markov chain

(1)
(T)
(1)
(T)
Z (
XZ , . . . , XZ (
YZ , . . . , YZ .
(15.29)
i.e., the joint distribution is as follows:
QZ,X (1) ,...,X (T) ,Y (1) ,...,Y (T) = QZ QX (1) ,...,X (T) |Z QY (1) ,...,Y (T) |X (1) ,...,X (T) ,
(15.30)
where QZ is the uniform distribution as introduced above, where the third

factor QY (1) ,...,Y (T) |X (1) ,...,X (T) is the channel law of the DMN channel, and
where QX (1) ,...,X (T) |Z is a distribution on the channel inputs that is implicitly
given by the used coding scheme. Finally, in (15.27) we again use conditioning
that cannot increase entropy.
So, let R QX (1) ,...,X (T) , S be the set of nonnegative rate tuples R(1) , . . . ,

R(M) that are permitted by (15.28):

R QX (1) ,...,X (T) , S
X

,
R(1) , . . . , R(M) :
R(m) I X (S) ; Y (S) X (S) .
(15.31)
mM(S)
Now note the important fact that the used distribution (15.30) is the same
for all cuts S. Hence, for a given QX (1) ,...,X (T) , the achievable rate tuples must
lie in the set
\

(15.32)
R QX (1) ,...,X (T) , S
R QX (1) ,...,X (T) =
ST
and therefore the capacity region C must lie within the union of all these
regions for all possible choices of the distribution QX (1) ,...,X (T) . We have shown
the following.
Theorem 15.4 (Cut-Set Bound [CT06], [EG81]).
Consider a DMN QY (1) ,...,Y (T) |X (1) ,...,X (T) with M independent messages
M (m) of rate R(m) , m = 1, . . . , M. The capacity region C must satisfy
[
\

C
R QX (1) ,...,X (T) , S ,
(15.33)
QX (1) ,...,X (T) ST
where R(, ) is defined (15.31) and where we rely on the notational con-

15.3. Examples
345
vention given in (15.3).

Note that (15.33) involves many different possible cuts S, however, we
need not evaluate all of them. Any subset of possible choices for S also leads
to a correct outer bound of the capacity region. However, there are some
important points to notice:
The bound involves first an intersection of regions and then a union, and
not the other way around!
We must maximize the region in (15.33) over all possible choices of
QX (1) ,...,X (T) . (Fortunately, this is a convex optimization problem!)
We cannot incorporate any further knowledge about independence of different terminals. In the derivation of the Cut-Set Bound we have already
made use of the assumption that the different messages are independent.
In the generality of this bound, however, it is also assumed that each
terminal might observe channel outputs, which implicitly makes all terminals dependent on each other during the progress of the transmission.
Hence, the optimization in (15.33) is always over the joint input distribution.
We also once more point out that the Cut-Set Bound does not cover the concept of a common message that originates at several transmitters at the same
time. However, such a common message could be approximated by giving
the DMN a connection between two terminals that is noiseless. This way,
the first (finite) few transmissions could be used to convey the message from
one terminal to another, making it common between them. The remaining
time-steps can now be used by these two terminals cooperatively to transmit
the common message to one or more other terminals.
15.3
Examples
15.3.1
Broadcast Channel
Consider the BC given in Figure 15.2.

From all possible cuts only three are interesting as only they separate
message from decision: S1 = {2, 3}, S2 = {1, 3}, and S3 = {3}. The first
(0) (1) and M
(1) , and the corresponding
cut separates M (0) and M (1) from M
expression (15.28) reads

R(0) + R(1) I X (3) ; Y (1) .
(15.34)
(0) (2)
Similarly, we find that the second cuts separates M (0) and M (2) from M
(2) , and the third cuts separates all messages from all its decisions. The
and M
corresponding bounds on the rates read:

R(0) + R(2) I X (3) ; Y (2) ,
(15.35)

(0)
(1)
(2)
(3)
(1)
(2)
R + R + R I X ;Y ,Y
.
(15.36)

346
S1
(0) (1)
M
(1)
M
M (1)
S3
Terminal 1
Y(1)
Broadcast
Channel
M (0)
nal 3
(0) (2)
M
(2)
M
Termi-
X(3)
QY (1) ,Y (2) |X (3)

Terminal 2 Y(2)
M (2)
S2
Figure 15.2: Application of the Cut-Set Bound on the broadcast channel.
If we rename X (3) to its more usual X and maximize over QX , we see that the
Cut-Set Bound corresponds to the simplest outer bound (13.289)(13.291).
15.3.2
Multiple-Access Channel
Consider the MAC shown in Figure 15.3.

S3
S1
X(1)
(1)
MAC
(2) Terminal 3
M
M (1)
Terminal 1
(3)
QnY |X (1) ,X (2)
X(2)
M (2)
Terminal 2
S2
Figure 15.3: Application of the Cut-Set Bound on the multiple-access channel.
Again there are only three interesting cuts: S1 = {1}, S2 = {2}, and
(1) , the second separates
S3 = {1, 2}. The first cut separates M (1) from M
(2) , and the third both messages from both decisions. Hence, we
M (2) from M
get

R(1) I X (1) ; Y (3) X (2) ,

R(2) I X (2) ; Y (3) X (1) ,

R(1) + R(2) I X (1) , X (2) ; Y (3) ,

(15.37)
(15.38)
(15.39)
15.3. Examples
347
or, when we replace Y (3) with the more usual Y ,

R(1) I X (1) ; Y X (2) ,

R(2) I X (2) ; Y X (1) ,

R(1) + R(2) I X (1) , X (2) ; Y .
(15.40)
(15.41)
(15.42)
This we now have to maximize over all joint distributions QX (1) ,X (2) . Hence,
we see that the Cut-Set Bound gives the right mutual information terms (compare with Theorem 10.9!), but it is too large because we maximize over the
joint distribution instead of the product distribution.
15.3.3
Single-Relay Channel
Consider the channel model shown in Figure 15.4. This channel is called
Terminal 2
S2
M
Terminal 3
Y(3)
Y(2)
X(2)
QnY (2) ,Y (3) |X (1) ,X (2)
S1
X(1)
M
Terminal 1
Figure 15.4: Application of the Cut-Set Bound on a single-relay channel.

relay channel because one terminal, Terminal 2, does neither have its own
message nor is it intended receiver of any other message. It only helps in
the transmission of the message from Terminal 1 to Terminal 3. Recall the
causality constraint of a DMN, i.e., the relay prepares its current channel input
(2)
(2)
(2)
Xk between time k 1 and k, based on the channel outputs Y1 , . . . , Yk1 .
Since there is only one message in the system, we can only find two cuts
that are interesting: S1 = {1} and S2 = {1, 2}. The corresponding inequalities
are

R I X (1) ; Y (2) , Y (3) X (2) ,

R I X (1) , X (2) ; Y (3) .
(15.43)
(15.44)
Since both must be satisfied, we see that the capacity is bounded as follows:
C
max
QX (1) ,X (2)
n

o
min I X (1) ; Y (2) , Y (3) X (2) , I X (1) X (2) ; Y (3) .
(15.45)

348
15.3.4
Double-Relay Channel
Finally, we show the example of a double-relay channel as shown in Figure 15.5.

Here, there are two terminals that are helping the transmission of the only
Terminal 2
Y(2)
S2
X
Terminal 4
(2)
S1
X(1)
Y(4)
Channel
S3
S4
X(3)
Terminal 1
Y(3)
Terminal 3
Figure 15.5: Application of the Cut-Set Bound on a double-relay channel.

message to its destination.
In spite of the already quite large number of four terminals, we only find
four different cuts that yield interesting expressions: S1 = {1}, S2 = {1, 2},
S3 = {1, 3}, and S4 = {1, 2, 3}. The corresponding inequalities are

(15.46)
R I X (1) ; Y (2) , Y (3) , Y (4) X (2) , X (3) ,

(15.47)
R I X (1) , X (2) ; Y (3) , Y (4) X (3) ,

(1)
(3)
(2)
(4) (2)
X
,
(15.48)
R I X ,X ;Y ,Y

(1)
(2)
(3)
(4)
R I X ,X ,X ;Y
.
(15.49)
Hence the capacity is bounded as follows:
n

C
max
min I X (1) ; Y (2) , Y (3) , Y (4) X (2) , X (3) ,
QX (1) ,X (2) ,X (3)

I X (1) , X (2) ; Y (3) , Y (4) X (3) ,

I X (1) , X (3) ; Y (2) , Y (4) X (2) ,
o
I X (1) , X (2) , X (3) ; Y (4) .

(15.50)
Chapter 16
The Interference Channel

16.1
Problem Setup
We now turn to the simplest (and therefore very important) example of a communication setup involving at the same time several transmitters and several
receivers: the interference channel (IC). In this model, we do not have any
common messages, but only transmitter-receiver pairs with a corresponding
private message each. The channel mixes the transmitted signals such that
the transmitters unintentionally interfere with each other. We will restrict our
discussion to the case of two transmitters with their corresponding receivers
as shown in Figure 16.1.
(1)
M
Dest. 1
(2)
M
Dest. 2
Dec. (1)
Dec. (2)
Y(1)
X(1)
Channel
Y(2)
QnY (1) ,Y (2)

|X (1) ,X (2)
X(2)
Enc. (1)
Enc. (2)
M (1)
Uniform
Source 1
M (2)
Uniform
Source 2
Figure 16.1: A channel coding problem with two sources and two destinations:
The sources independently try to transmit their message M (i) ,
i = 1, 2, to their corresponding destination and by doing so interfere with each other. This channel model is called interference
channel (IC).
Encoder 1 needs to transmit the message M (1) to destination 1, and encoder 2 needs to transmit the message M (2) to destination 2. The interference
channel will produce two outputs Y (1) and Y (2) for the inputs X (1) and X (2) ,
where both Y (1) and Y (2) depend on both X (1) and X (2) .
Definition 16.1. A discrete memoryless interference channel (DM-IC) consists of four alphabets X (1) , X (2) , Y (1) , Y (2) and a conditional probability
distribution QY (1) ,Y (2) |X (1) ,X (2) such that when it is used without feedback, we
349

350
have

QY(1) ,Y(2) |X(1) ,X(2) y(1) , y(2) x(1) , x(2)
n

Y
(1) (2) (1) (2)
=
QY (1) ,Y (2) |X (1) ,X (2) yk , yk xk , xk .
(16.1)
k=1

(1)
(2)
Definition 16.2. An enR , enR , n coding scheme for a DM-IC consists of
two sets of indices
n
o
(1)
M(1) = 1, 2, . . . , enR
,
(16.2)
n
o
(2)
M(2) = 1, 2, . . . , enR
(16.3)
called message sets, two encoding functions
n
(16.4)

(2) n
(16.5)
M(1) ,
(16.6)
(1) : M(1) X (1)
(2)
:M
(2)
and two decoding functions

(1) : Y (1)
(2) : Y
n

(2) n
M(2) .
(16.7)

(1)
(2)
The error probability of an enR , enR , n coding scheme for a DM-IC
is given as

(16.8)
Pe(n) , Pr (1) Y(1) 6= M (1) or (2) Y(2) 6= M (2) .

Definition 16.3. A rate pair R(1) , R(2) is said to be achievable for the IC if

(2)
(1)
(n)
there exists a sequence of enR , enR , n coding schemes with Pe 0 as
n .
The capacity region of the IC is defined to be the closure of the set of all
achievable rate pairs.
Example 16.4 (Independent BSCs). Assume we have two independent BSCs
as shown in Figure 16.2. We know that X (1) can transmit at most at a rate
of C1 = 1 Hb (1 ) and X (2) at a rate of C2 = 1 Hb (2 ) bits. There is
no interference. Hence, the capacity region is the rectangular region shown in
Figure 16.3.
Normally, the rectangular rate region
(2)

(1)
(1)
(1)
(1)
(2)
R C , max (2) I X ; Y X = x ,
QX (1) , x

(2)
(2)
R C , max I X (2) ; Y (2) X (1) = x(1)

QX (2) , x(1)
is a (quite loose) outer bound on the capacity region.

(16.9)
(16.10)
16.1. Problem Setup
351
1 1
1
Y (1)
X (1)
1
1
1 1
1 2
2
X (2)
Y (2)
2
1
1 2
Figure 16.2: Two independent BSCs form an interference channel.
R(2)
C2
C1 R(1)
Figure 16.3: The capacity region of the IC consisting of two independent
BSCs.
R(2)
C(2)
C(1) R(1)
Figure 16.4: An achievable rate region for a DM-IC.

352
On the other hand, C(1) and C(2) also provide

an achievable
inner bound:
Obviously, for every IC, the rate pairs C(1) , 0 and 0, C(2) are achievable.
Hence, by time-sharing, the triangular region in Figure 16.4 is achievable.
Again, normally this bound is very loose. However, there are cases for which
it is tight.
Example 16.5 (Binary EXOR IC). Consider the DM-IC with only binary
alphabets where
Y (1) = Y (2) = X (1) X (2) .
(16.11)
The capacity region of this IC is a triangle with corner points (0, 0), (0, 1 bit),
and (1 bit, 0). The achievability follows directly from the inner bound of
Figure 16.4, the converse from the Cut-Set Bound (see Theorem 16.7 below).
Very similarly to the BC, the capacity region of the IC does only depend
on the marginal distributions.
Theorem 16.6. The capacity region of an IC depends only on the conditional
marginal distributions QY (1) |X (1) ,X (2) and QY (2) |X (1) ,X (2) and not on the joint
conditional channel law QY (1) ,Y (2) |X (1) ,X (2) .
Proof: Define

Pe(n) , Pr (1) Y(1) 6= M (1) (2) Y(2) 6= M (2) ,

Pe(n),(1) , Pr (1) Y(1) 6= M (1) ,

Pe(n),(2) , Pr (2) Y(2) 6= M (2) .
(16.12)
(16.13)
(16.14)
Then, by the Union Bound we have

Pe(n) Pe(n)(1) + Pe(n)(2) ,

and because (i) Y(i) 6= M (i) implies
(1) (1)

Y
6= M (1) (2) Y(2) 6= M (2) ,
(16.15)
we have

Pe(n) max Pe(n),(1) , Pe(n),(2) .
(n)
(n)(1)
(16.16)
(n)(2)
Hence, Pe 0 if, and only if, Pe

0 and Pe
0. There(n),(1)
(n),(2)
fore, the capacity region only depends on Pe
and Pe
, i.e., only on
QY (1) |X (1) ,X (2) and QY (2) |X (1) ,X (2) .
Hence, for all rate regions, we can always choose an optimized channel
(1) (2) (1) (2) that might be different from the actual channel law
law Q
Y ,Y |X ,X
QY (1) ,Y (2) |X (1) ,X (2) as long as it has the same marginals
(1) (1) (2) = Q (1) (1) (2) ,
Q
Y |X ,X
Y |X ,X
QY (2) |X (1) ,X (2) = QY (2) |X (1) ,X (2) .

(16.17)
(16.18)
16.2. Some Simple Capacity Region Outer Bounds
(1)
M
Dest. 1
(2)
M
Dest. 2
Dec. (1)
Dec. (2)
Y(1)
X(1)
Channel
Y(2)
QnY (1) ,Y (2)

|X (1) ,X (2)
X(2)
Enc. (1)
Enc. (2)
353
M (1)
Uniform
Source 1
M (2)
Uniform
Source 2
Figure 16.5: Seven cuts of the IC.
16.2
Some Simple Capacity Region Outer Bounds
16.2.1
Cut-Set Bound
We apply the Cut-Set Bound (Theorem 15.4) to the IC with transmitters 1

and 2 and receivers 1 and 2 labeled as terminals 1, 2, 3, and 4, respectively.
We have in total seven possible cuts (see Figure 16.5):

S {1}, {2}, {1, 2}, {1, 4}, {2, 3}, {1, 2, 4}, {1, 2, 3}
(16.19)
yielding the following seven inequalities:

R(1) I X (1) ; Y (1) , Y (2) X (2)

= I X (1) ; Y (1) X (2) + I X (1) ; Y (2) X (2) , Y (1) ,

R(2) I X (2) ; Y (1) , Y (2) X (1)

= I X (2) ; Y (2) X (1) + I X (2) ; Y (1) X (1) , Y (2) ,

R(1) + R(2) I X (1) , X (2) ; Y (1) , Y (2) ,

R(1) I X (1) ; Y (1) X (2) ,

R(2) I X (2) ; Y (2) X (1) ,

R(1) I X (1) , X (2) ; Y (1)

= I X (2) ; Y (1) + I X (1) ; Y (1) , Y (2) X (2) ,

R(2) I X (1) , X (2) ; Y (2)

= I X (2) ; Y (1) + I X (1) ; Y (1) , Y (2) X (2) ,
(16.20)
(16.21)
(16.22)
(16.23)
(16.24)
(16.25)
(16.26)
(16.27)
(16.28)
(16.29)
(16.30)
From the expansions (16.21), (16.23), (16.28), and (16.30), we see that the
first two and the last two inequalities are implied by (16.25) and (16.26). So
they are redundant and we remain with the three inequalities (16.24)(16.26).

354
Theorem 16.7
(Cut-Set Bound). On a general IC, any achievable rate
pair R(1) , R(2) must satisfy

(1)
(1)
(1) (2)
I
X
;
Y
X
,
(16.31)

(2)
(2)
(2) (1)
R I X ;Y
X
,
(16.32)
R(1) + R(2) I X (1) , X (2) ; Y (1) , Y (2)

(16.33)
(1) (2) (1) (2) , where Q
(1) (2) (1) (2) is an arbitrary
for some QX (1) ,X (2) Q
Y ,Y |X ,X
Y ,Y |X ,X
channel law having the same marginals (16.17) and (16.18) as the given IC.
We remind the reader that due to Theorem 16.6 we can replace the channel
QY (1) ,Y (2) |X (1) ,X (2) by any other law as long as we keep the marginals correct.
16.2.2
Satos Outer Bound
It is not difficult to improve the Cut-Set Bound to have independent encoders

with coded time-sharing.
Theorem 16.8 (Satos Outer
Bound [Sat77]). On a general IC, any

(1)
(2)
achievable rate pair R , R
must satisfy
(16.34)
R(1) I X (1) ; Y (1) X (2) , T ,

(2)
(2)
(2) (1)
X ,T ,
(16.35)
R I X ;Y
R(1) + R(2) I X (1) , X (2) ; Y (1) , Y (2) T

(16.36)
(1) (2) (1) (2)
(1) (2) (1) (2) , where Q
for some QT QX (1) |T QX (2) |T Q
Y ,Y |X ,X
Y ,Y |X ,X
is an arbitrary channel law having the same marginals (16.17) and (16.18) as
the given IC. The auxiliary time-sharing random variable T can be restricted
to take value in an alphabet T with |T | = 3.

(1)
Proof: For some fixed rate pair R(1) , R(2) , consider a sequence of enR ,

(2)
(n)
enR , n coding schemes with Pe 0 as n . From Theorem 16.6 we
(n),(1)
(n),(2)
know that then also Pe
0 and Pe
0. From the Fano Inequality
(Proposition 1.13) we then know that

log 2
(1) (1)
(n),(1) (1)
+ Pe
R
, nn(1) ,
(16.37)
H M Y
n
n

log 2
H M (2) Y(2) n
+ Pe(n),(2) R(2) , nn(2) ,
(16.38)
n

log 2
+ Pe(n) R(1) + R(2)
H M (1) , M (2) Y(1) , Y(2) n
, nn , (16.39)
n
(1)
(2)
where n , n , n 0 as n . Now we can use the exact same derivation

as for the MAC, (10.62)(10.75) and (10.88)(10.92), with Y replaced by Y(1)
or Y(2) , respectively, to get the first two conditions. For
the third condition
we use (10.78)(10.87) with Y replaced by Y(1) , Y(2) .
The bound on the alphabet size of T follows from the FenchelEggleston
strengthening of Caratheodorys Theorem (Theorem 1.22).

16.3. Some Simple Capacity Region Inner Bounds
16.3
355
Some Simple Capacity Region Inner Bounds
A very simple inner bound can be found by requiring that both receivers
decode both messages. This basically changes the IC into a double-MAC and
therefore will result in the following (MAC-like) achievable rate region:

(1)
(2) (2)
(1)
(1)
(1) (2)
X , T , (16.40)
X
,
T
,
I
X
;
Y
R
min
I
X
;
Y

R(2) min I X (2) ; Y (1) X (1) , T , I X (2) ; Y (2) X (1) , T , (16.41)
R(1) + R(2) minI X (1) , X (2) ; Y (1) T , I X (1) , X (2) ; Y (2) T (16.42)
for some QT QX (1) |T QX (2) |T QY (1) |X (1) ,X (2) QY (2) |X (1) ,X (2) .
This bound can be improved if we drop the restriction that the receivers
decode the message that is not intended for them.

Theorem 16.9. On a general IC, any rate pair R(1) , R(2) is achievable that
satisfies

(1)
(1)
(1) (2)
I
X
;
Y
X ,T ,
(16.43)
R

(2)
(2)
(2) (1)
R I X ;Y
X ,T ,
(16.44)

(1)
(2)
(1)
(2)
(1)
(1)
(2)
(2)
R + R min I X , X ; Y T , I X , X ; Y T
(16.45)
for some QT QX (1) |T QX (2) |T . The auxiliary time-sharing random variable
T can be restricted to take value in an alphabet T with |T | = 4.
Proof: This random coding proof follows very closely the achievability
proof we have seen for the MAC capacity region C1 .
1: Setup: Fix R(1) , R(2) , QT , QX (1) |T , QX (2) |T , and some blocklength n.
2: Codebook Design: Generate one length-n sequence T IID QT . Then
(1)
generate enR length-n codewords X(1) (m(1) ) QnX (1) |T (|T), m(1) =
(1)
(2)
1, . . . , enR . Independently thereof, generate enR length-n codewords

(2)
X(2) (m(2) ) QnX (2) |T (|T), m(2) = 1, . . . , enR . Reveal both codebooks
and T to encoders and decoders.
3: Encoder Design: To send message m(i) , encoder (i) transmits the

codeword X(i) (m(i) ), i = 1, 2.
4: Decoder Design: Upon receiving Y(1) , the decoder (1) looks for an
m
(1) such that

T, X(1) m
(1) , X(2) m
(2) , Y(1) A(n) QT,X (1) ,X (2) ,Y (1)
(16.46)

(2)
for some m
(2) 1, . . . , enR
. If there is a unique such m
(1) , the decoder
(1)
puts out
m
(1) , m
(1) .
(16.47)

356

Similarly, the decoder (2) looks for an m
(2) such that

T, X(1) m
(1) , X(2) m
(2) , Y(2) A(n) QT,X (1) ,X (2) ,Y (2)
(16.48)

(1)
for some m
(1) 1, . . . , enR
. If there is a unique such m
(2) , the decoder
(2)
puts out
m
(2) , m
(2) .
(16.49)

5: Performance Analysis: We first analyze decoder (1) . We define the
following events:
n
(2) (2) (1)
(1)
(1)
Fm(1) ,m(2) ,
T, X m , X m , Y
o
A(n)
Q
.
(16.50)
(1)
(2)
(1)

T,X ,X ,Y
Then the error probability is given as
Pe(n),(1)
(1)
nR
eX
(2)
nR
eX
m(1) =1 m(2) =1
en(R
(1)
+R(2) )

Pr error(1) M (1) , M (2) = m(1) , m(2)
(16.51)
where, using the Union Bound, we can bound as follows:

Pr error(1) M (1) , M (2) = m(1) , m(2)
[
c
= PrFm
Fm
(1) ,m(2)
(1) ,m(2)
m
(1) 6=m(1)

(1) (2)

Fm
(1) ,m
(2) m , m

(m
(1) ,m
(2) )6=(m(1) ,m(2) )

(1)
c
Pr Fm
, m(2)
(1) ,m(2) m
[
(16.52)
(1)
nR
eX
m
(1) =1
m
(1) 6=m(1)
(1)

(1)
(2)
Pr Fm
(1) ,m(2) m , m
(2)
nR
eX
nR
eX
m
(1) =1
m
(1) 6=m(1)
m
(2) =1
m
(2) 6=m(2)

(1)
(2)
Pr Fm
m
,
m
.

(1)
(2)
,m

(16.53)
16.3. Some Simple Capacity Region Inner Bounds
357
We now consider each term individually. Since the transmitted codewords

are jointly distributed with the received sequence Y(1) , by TA-3b,

(1)
(2)
c
Pr Fm
m
,
m
t n, , T X (1) X (2) Y (1) . (16.54)
(1) ,m(2)
On the other hand, all other codewords are generated independently of
each other and are not transmitted, i.e., they are also independent of
Y(1) . Therefore, for m
(1) 6= m(1) ,

(1)
(2)
Pr Fm
m
,
m

(1)
(2)
,m
X

QnT (t) QnX (1) |T x(1) t QnX (2) ,Y (1) |T x(2) , y(1) t (16.55)
=
(t,x(1) ,x(2) ,
(n)
y(1) )A
X
(t,x(1) ,x(2) ,
(n)
y(1) )A
en(H(T )) en(H(X
) en(H(X (2) ,Y (1) |T )) (16.56)
(1) |T )

(1)
(2)
(1)
= A(n)
QT,X (1) ,X (2) ,Y (1) en(H(T,X )+H(X ,Y |T ))

(1)
(2)
(1)
(1)
(2)
(1)
en(H(T,X ,X ,Y )+) en(H(T,X )+H(X ,Y |T ))
(16.57)
(16.58)
=e
n( H(T,X (1) )H(X (2) ,Y (1) |T,X (1) )+H(T,X (1) )+H(X (2) ,Y (1) |T ))
(16.59)
=e
n(I(X (1) ;X (2) ,Y (1) |T ))
(16.60)
=e
n(I(X (1) ;X (2) |T )+I(X (1) ;Y (1) |X (2) ,T ))
(16.61)
=e
n(I(X (1) ;Y (1) |X (2) ,T ))
(16.62)
Here, in (16.56) we use TA-1b based on the fact that all sequences in the
sum are typical; in (16.58) we use TA-2; and the in the final step (16.62)
we rely on the conditional independence between X (1) and X (2) given T .
Similarly, we get for m
(1) 6= m(1) , m
(2) 6= m(2) :

(1)
(2)
Pr Fm
(1) ,m
(2) m , m
X

=
QnT (t) QnX (1) |T x(1) t QnX (2) |T x(2) t QnY (1) |T y(1) t
(t,x(1) ,x(2) ,
(n)
y(1) )A
(16.63)
n(H(T ))
(t,x(1) ,x(2) ,
n(H(X (1) |T ))
n(H(X (2) |T ))
(n)
y(1) )A
en(H(Y
(1) |T )
(16.64)
(n)
n(H(T,X (1) )+H(X (2) |T )+H(Y (1) |T ))
= A
QT,X (1) ,X (2) ,Y (1) e
(16.65)
(1)
(2)
(1)
(1)
(2)
(1)
en(H(T,X ,X ,Y )+) en(H(T,X )+H(X |T )+H(Y |T )) (16.66)
= en( H(X |X ,T )H(Y |X ,X ,T )+H(X

(1)
(2)
(1)
(2)
(1)
= en(I(X ;X |T )+I(X ,X ;Y |T ))
(2)
=e
(1)
n(I(X (1) ,X (2) ;Y (1) |T ))
(1)
(1)
(2)
(2) |T )+H(Y (1) |T )
(16.67)
(16.68)
(16.69)

358

Plugging these results back into (16.53) and (16.51) now yields

Pe(n),(1) t n, , T X (1) X (2) Y (1)
(1)

(1)
(1)
(2)
+ enR 1 en(I(X ;Y |X ,T ))
(1)
(2)

(1)
(2)
(1)
+ enR 1 enR 1 en(I(X ,X ;Y |T )) (16.70)

t n, , T X (1) X (2) Y (1)
(1)
(1)
(1)
(2)
+ en(R I(X ;Y |X ,T )+)
+ en(R
(1)
+R(2) I(X (1) ,X (2) ;Y (1) |T )+)
(16.71)
A completely analogous analysis of decoder (2) yields

Pe(n),(2) t n, , T X (1) X (2) Y (2)
(2)
(2)
(2)
(1)
+ en(R I(X ;Y |X ,T )+)
+ en(R
(1)
+R(2) I(X (1) ,X (2) ;Y (2) |T )+)
(16.72)
Note that these error probabilities will tend to zero for n as long
as the three conditions (16.43)(16.45) are satisfied.
Example 16.10 (Symmetric IC). Consider an IC that is symmetric in the
sense that QY (1) |X (1) ,X (2) = QY (2) |X (1) ,X (2) . The capacity region for this IC is
R(1) I X (1) ; Y (1) X (2) , T ,

(16.73)

(2)
(2)
(2) (1)
R I X ;Y
X ,T ,
(16.74)
R(1) + R(2) I X (1) , X (2) ; Y (1) T

(16.75)
for some QT QX (1) |T QX (2) |T .
To prove this, we only need to realize that for the given IC, Satos outer
bound (Theorem 16.8) coincides with the inner bound of Theorem 16.9. Actually, strictly speaking, this is not true: in general, Satos outer bound is
larger than the achievable region (16.43)(16.45). However, in (16.34)(16.36)
(1) (2) (1) (2) as long as the marginals remain
we are allowed to choose Q
Y ,Y |X ,X
the same. If we choose it such that Y (1) = Y (2) deterministically, then obviously QY (1) |X (1) ,X (2) = QY (2) |X (1) ,X (2) . For this choice, in (16.36), Y (2) can be
dropped without change of mutual information, and in (16.45) both mutual
information terms are identical and are equal to (16.36).
Note that the capacity region of the symmetric IC is identical to the capacity region of the MAC with inputs X (1) , X (2) and output Y (1) . This means
that it is an optimal decoding strategy if both receiver decode both messages.
This is true in more cases than just this symmetric situation. In particular,
it is the case when the interference is stronger than the actual message signal
(see the following section!).

16.4. Strong and Very Strong Interference
16.4
359
Strong and Very Strong Interference
We say that an IC is suffering from strong interference if the cross link is

better than the direct link. More precisely, we have the following definition.
Definition 16.11 ([Car75], [Sat81]). A DM-IC is said to have strong interference if

I X (1) ; Y (1) X (2) I X (1) ; Y (2) X (2) ,

I X (2) ; Y (2) X (1) I X (2) ; Y (1) X (1)
(16.76)
(16.77)
for all QX (1) QX (2) .

If the cross link is so much better that even without the side-information
of the interfering message, the cross link is better than the direct link with
the knowledge of the interfering message, then we call the interference to be
very strong.
Definition 16.12 ([Car75], [Sat81]). A DM-IC is said to have very strong
interference if

I X (1) ; Y (1) X (2) I X (1) ; Y (2) ,

I X (2) ; Y (2) X (1) I X (2) ; Y (1)
(16.78)

I X (1) ; Y (2) I X (1) ; Y (2) , X (2)

= I X (1) ; X (2) + I X (1) ; Y (2) X (2)
|
{z
}
=0

= I X (1) ; Y (2) X (2) ,
(16.80)
(16.79)
for all QX (1) QX (2) .

Note that since
(16.81)
(16.82)
we see that any IC with very strong interference, implicitly also has strong
interference (therefore the naming!). However, the converse is not necessarily
true.
Example 16.13. Consider the IC with X (1) , X (2) {0, 1} and Y (1) , Y (2)
{0, 1, 2} where
Y (1) = Y (2) = X (1) + X (2)
(16.83)

I X (1) ; Y (1) X (2) = I X (1) ; Y (2) X (2) = H X (1) ,

I X (2) ; Y (2) X (1) = I X (2) ; Y (1) X (1) = H X (1) ,
(16.84)
(normal addition). Then
(16.85)

360
i.e., the channel has strong interference. However, for any nontrivial input
distributions,

I X (1) ; Y (2) = H X (1) H X (1) Y (2)
(16.86)

(1)
<H X
(16.87)

(16.88)
= I X (1) ; Y (1) X (2)
and

I X (2) ; Y (1) = H X (2) H X (2) Y (1)

< H X (2)

= I X (2) ; Y (2) X (1) ,
(16.89)
(16.90)
(16.91)
i.e., the IC does not have very strong interference.
Theorem 16.14 (IC with Very Strong Interference [Ahl74],

[Sat81]).
The capacity region of an IC with very strong interference is given by all
rate pairs R(1) , R(2) satisfying

R(1) I X (1) ; Y (1) X (2) , T ,
(16.92)
(1)
(2)
(2)
(2)
R I X ; Y X , T
(16.93)
for some QT QX (1) |T QX (2) |T .
Proof: Concerning achievability, note that the two terms in the third condition of Theorem 16.9 can be bounded as follows:

(16.94)
I X (1) , X (2) ; Y (1) T = I X (2) ; Y (1) T + I X (1) ; Y (1) X (2) , T

(1)
(1) (2)
(2)
(2) (1)
X , T , (16.95)
I X ;Y
X ,T + I X ;Y
where the inequality follows from the assumption of very strong interference
(16.79); and analogously

I X (1) , X (2) ; Y (2) T I X (1) ; Y (1) X (2) , T + I X (2) ; Y (2) X (1) , T . (16.96)
Therefore, we see that the first two conditions of Theorem 16.9 imply the
third.
Similarly, we see that the third condition of Satos bound (Theorem 16.8)
is

I X (1) , X (2) ; Y (1) , Y (2) T

I X (1) , X (2) ; Y (1) T
(16.97)

(2)
(2) (1)
(1)
(1) (2)
I X ;Y
X ,T + I X ;Y
X ,T ,
(16.98)

361
where the first inequality follows from dropping Y (2) and the second inequality
from (16.95). Hence, also here the first two conditions imply the third.
Note that the capacity region of an IC with very strong interference can
be achieved by successive cancellation: each decoder decodes the unwanted
message first (which it can do since the cross link is so much better than the
direct link) and then uses the knowledge of the unwanted message to decode
the wanted message.
Theorem 16.15 (IC with Strong Interference [CEG87]).
The capacity region
of an IC with strong interference is given by all rate

(1)
(2)
pairs R , R
satisfying

R(1) I X (1) ; Y (1) X (2) , T ,

(16.99)
(2)
(2)
(2) (1)
R I X ;Y
X ,T ,
(16.100)

R(1) + R(2) min I X (1) , X (2) ; Y (1) T , I X (1) , X (2) ; Y (2) T
(16.101)
for some QT QX (1) |T QX (2) |T . The auxiliary time-sharing random variable T can be restricted to take value in an alphabet T with |T | = 4.
In order to prove this, we need the following lemma.
Lemma 16.16. For an IC with strong interference, we have for an arbitrary
(X(1) , X(2) ) QX(1) QX(2) :

I X(1) ; Y(1) X(2) I X(1) ; Y(2) X(2) ,
(16.102)

(2)
(2) (1)
(2)
(1) (1)
I X ;Y X
I X ;Y X ,
(16.103)
where the vectors have an arbitrary length n 1.
Proof: We only prove (16.102). The other inequality follows accordingly.
The following proof is an adaptation1 from [CEG87].
Recall that by definition, an IC with strong interference satisfies

(16.104)
I X (1) ; Y (1) X (2) , U = u I X (1) ; Y (2) X (2) , U = u ,
for any QX (1) |U (|u) QX (2) |U (|u), where we have conditioned on U = u for
an arbitrary auxiliary RV U . Hence, by averaging over U , the IC with strong
interference also satisfies

I X (1) ; Y (1) X (2) , U I X (1) ; Y (2) X (2) , U ,
(16.105)

as long as U (
X (1) , X (2) (
Y (1) , Y (2) and X (1) (
U (
X (2)
form Markov chains.
1
In [CEG87] the authors claim to prove the lemma by induction. This is, however,
strictly speaking not true. The proof is rather a recursive derivation.

362
In the following vectors are of length n, and k is some integer between 1

and n (where a term with an index 0 or n + 1 is understood to be omitted).
By repeatedly applying the chain rule for mutual information, we get

(1)
(1)
(2)
(2)
(1)
I X1 , . . . , Xk ; Y1 , . . . , Yk X(2) , Yk+1 , . . . , Yn(1)

(1)
(1)
(2)
(2) (2)
(1)
= I X1 , . . . , Xk ; Y1 , . . . , Yk1 X , Yk+1 , . . . , Yn(1)

(1)
(1)
(2)
(2)
(2)
(1)
+ I X1 , . . . , Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1) (16.106)

(1)
(1)
(1)
(2)
(2)
(1)
= I X1 , . . . , Xk , Yk ; Y1 , . . . , Yk1 X(2) , Yk+1 , . . . , Yn(1)

(1)
(2)
(2) (2)
(1)
(1)
(1)
(1)
I Yk ; Y1 , . . . , Yk1 X , X1 , . . . , Xk , Yk+1 , . . . , Yn

(1)
(1)
(2)
(2)
(2)
(1)
+ I X1 , . . . , Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1) (16.107)

(1)
(2)
(2)
(1)
= I Yk ; Y1 , . . . , Yk1 X(2) , Yk+1 , . . . , Yn(1)

(1)
(1)
(2)
(2)
(1)
+ I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)

(1)
(2)
(2)
(1)
(1)
(1)
+ I Xk ; Y1 , . . . , Yk1 X(2) , X1 , . . . , Xk1 , Yk , . . . , Yn(1)
{z
}
|
=0

(1)
(2)
(2)
(1)
(1)
(1)
I Yk ; Y1 , . . . , Yk1 X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)

(1)
(2)
(2)
(2)
(1)
+ I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)

(1)
(1)
(2)
(1)
(2)
(2)
(1)
+ I X1 , . . . , Xk1 ; Yk X(2) , Xk , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)
{z
}
|
=0
(16.108)

(2) (1)
=I
X , Yk+1 , . . . , Yn(1)

(1)
(1)
(2)
(2) (2)
(1)
+ I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X , Yk , . . . , Yn(1)

(1)
(2)
(2)
(1)
(1)
(1)
I Yk ; Y1 , . . . , Yk1 X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)

(1)
(2)
(2)
(2)
(1)
+ I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1) ,
(16.109)

(1)
(2)
(2)
Yk ; Y1 , . . . , Yk1
where two terms are zero because the channel is memoryless, i.e., conditionally
(1)
(2)
(2)
(1)
on Xi and Xi (i = 1, . . . , k 1), Yi is independent of Xk .
Very similarly (but not exactly in the same way!), we also get

(1)
(1)
(1)
(1)
(1)
I X1 , . . . , Xk ; Y1 , . . . , Yk X(2) , Yk+1 , . . . , Yn(1)

(1)
(1)
(1)
(1)
= I X1 , . . . , Xk ; Yk X(2) , Yk+1 , . . . , Yn(1)

(1)
(1)
(1)
(1)
(1)
+ I X1 , . . . , Xk ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(16.110)

(1)
(1)
(2)
(2)
(1)
(1)
= I X1 , . . . , Xk , Y1 , . . . , Yk1 ; Yk X(2) , Yk+1 , . . . , Yn(1)

(2)
(2)
(1)
(1)
(1)
(1)
I Y1 , . . . , Yk1 ; Yk X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)

363

(1)
(1)
(1)
(1)
(1)
+ I X1 , . . . , Xk ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)
(16.111)

(2)
(2)
(1)
(1)
= I Y1 , . . . , Yk1 ; Yk X(2) , Yk+1 , . . . , Yn(1)

(1)
(1)
(2)
(2)
(1)
+ I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)

(1)
(1)
(1)
(1)
(2)
(2)
(1)
+ I X1 , . . . , Xk1 ; Yk X(2) , Xk , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)
{z
}
|
=0

(2)
(2)
(1)
(1)
(1)
(1)
I Y1 , . . . , Yk1 ; Yk X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)

(1)
(1)
(1)
(1)
(1)
+ I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)

(1)
(1)
(1)
(1)
(1)
(1)
+ I Xk ; Y1 , . . . , Yk1 X(2) , X1 , . . . , Xk1 , Yk , . . . , Yn(1) (16.112)
|
{z
}
=0

(2)
(2)
(1)
(1)
= I Y1 , . . . , Yk1 ; Yk X(2) , Yk+1 , . . . , Yn(1)

(1)
(1)
(2)
(2)
(1)
+ I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)

(2)
(2)
(1)
(1)
(1)
(1)
I Y1 , . . . , Yk1 ; Yk X(2) , X1 , . . . , Xk , Yk+1 , . . . , Yn(1)

(1)
(1)
(1)
(1)
(1)
+ I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1) .
(16.113)
Hence, subtracting (16.113) from (16.109) and noting that the first and third
terms cancel, we get

(1)
(1)
(2)
(2)
(1)
I X1 , . . . , Xk ; Y1 , . . . , Yk X(2) , Yk+1 , . . . , Yn(1)

(1)
(1)
(1)
(1) (2)
(1)
I X1 , . . . , Xk ; Y1 , . . . , Yk X , Yk+1 , . . . , Yn(1)

(1)
(1)
(2)
(2)
(1)
= I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)

(1)
(1)
(1)
(1)
(1)
I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)

(1)
(2) (2)
(2)
(2)
(1)
(1)
+ I Xk ; Yk X , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn

(1)
(1)
(2)
(2)
(1)
I Xk ; Yk X(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1)
(16.114)

(1)
(1)
(2)
(2)
(1)
= I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)

(1)
(1)
(1)
(1)
(1)
I X1 , . . . , Xk1 ; Y1 , . . . , Yk1 X(2) , Yk , . . . , Yn(1)

(1)
(2) (2)
(1)
(1) (2)
+ I Xk ; Yk Xk , Uk I Xk ; Yk Xk , Uk ,
(16.115)
where we have defined

(2)
(2)
(2)
(2)
(2)
(1)
Uk , X1 , . . . , Xk1 , Xk+1 , . . . , Xn(2) , Y1 , . . . , Yk1 , Yk+1 , . . . , Yn(1) .
(16.116)
By recursively using (16.115) starting with k = n and going backwards until

364
k = 1, we finally obtain

I X(1) ; Y(2) X(2) I X(1) ; Y(1) X(2)
n

X
(1)
(2) (2)
(1)
(1) (2)
=
I Xk ; Yk Xk , Uk I Xk ; Yk Xk , Uk .
(16.117)
k=1
Now note that

(1)
(2)
(1)
(2)
Uk (
Xk , Xk
(
Yk , Yk
(16.118)
and
(1)
(2)
Xk (
Uk (
Xk
(16.119)
form Markov chains such that (16.105) can be applied to each summand of
the sum in (16.117). This proves the lemma.
Proof of Theorem 16.15: The achievability follows directly from Theorem 16.9. For the converse, we take over the derivation of the two first inequalities in Satos outer bound (i.e., in the converse of the MAC in Section 10.4.3).
For the third inequality we adapt Satos proof as follows:

n R(1) + R(2)

= H M (1) + H M (2)
(16.120)

(2)
(2)
(2) (2)
(1)
(1)
(1) (1)
(16.121)
+ I M ;Y
+H M Y
= I M ;Y
+H M Y

(1)
(1)
(2)
(2)
(1)
(2)
I M ;Y
+ I M ;Y
+ nn + nn
(16.122)

(1)
(1)
(2)
(2)
(1)
(2)
I X ;Y
+ I X ;Y
+ nn + nn
(16.123)

(1)
(1)
(2)
(2)
(2)
(1)
(2)
I X ;Y ,X
+ I X ;Y
+ nn + nn
(16.124)

(2)
(2)
(1)
(2)
(1)
(1) (2)
+ I X ;Y
+ nn + nn
(16.125)
= I X ;Y X

(2)
(2)
(1)
(2)
(1)
(2) (2)
+ I X ;Y
+ nn + nn
(16.126)
I X ;Y X

(1)
(2)
(2)
(1)
(2)
= I X ,X ;Y
+ nn + nn ,
(16.127)
where the last inequality (16.126) follows from Lemma 16.16. The reminder
of the proof is identical to (10.82)(10.92).

This proves R(1) + R(2) I X (1) , X (2) ; Y (2) T . The bound R(1) + R(2)

I X (1) , X (2) ; Y (1) T follows accordingly.
16.5
HanKobayashi Region
So far we have seen the coding strategy where both receivers decode both
messages. Quite intuitively, this turns out to be optimal in the situation of
strong interference. However, in general this is not optimal and there are
quite a few other natural coding strategies like, e.g., treating the interference
as noise or using some kind of orthogonal transmission like TDMA or FDMA.

16.5. HanKobayashi Region
365
The random coding strategy that includes all mentioned strategies as special cases and that yields the best known achievable region to date is called
HanKobayashi coding scheme. It contains a fundamentally new aspect of
random coding that we have not seen so far: rate splitting. The basic idea
here is to be more flexible with respect to what part of a message is private
(i.e., it will only be decoded by the intended receiver) and what is public
in the sense that also the unintended receiver will decode it in order to help
with the decoding of the wanted message.
Related to rate splitting is the concept of nonunique decoding: so far
the receiver always tried to correctly decode all those messages that it was
interested in. However, in an IC, we might also try to decode the unwanted
message as it might help with the decoding of the wanted message. The
receiver, however, does not care whether the decoding of the unwanted message
turns out to be successful or not as long as it helps with the wanted message!
Hence, the unwanted message must not be uniquely decoded in the end.
16.5.1
Superposition Coding with Rate Splitting
1: Setup: For both i {1, 2}, we split M (i) up into two independent parts
M 0(i) (the public part) and M 00(i) (the private part) where M 0(i) and
M 00(i) have the rates R0(i) and R00(i) , respectively, and where R0(i) + R00(i) =
R(i) . The idea is that decoder 1 decodes M (1) = M 0(1) , M 00(1) and M 0(2) ,

and that decoder 2 decodes M (2) = M 0(2) , M 00(2) and M 0(1) .
Moreover, we choose some QU (1) , QU (2) , QX (1) |U (1) , QX (2) |U (2) , and some
blocklength n.
0(i)
2: Codebook Design: For both i {1, 2}, we generate enR indepen

0(i)
dent length-n codewords U(i) m0(i) QnU (i) , m0(i) = 1, . . . , enR . For

00(i)
independent length-n codewords
each U(i) m0(i) , we generate enR
(i) 0(i)

00(i)
(i)
0(i)
00(i)
n

X m ,m
QX (i) |U (i) U (m ) , m00(i) = 1, . . . , enR . We
reveal all four codebooks to encoders and decoders.

Note that U(i) m0(i) represents the cloud center of the m0(i) th cloud,

and X(i) m0(i) , m00(i) is the m00(i) th codeword of the m0(i) th cloud.

3: Encoder Design: To send message m(i) = m0(i) , m00 (i) , the encoder

(i) transmits the codeword X(i) m0(i) , m00(i) , i {1, 2}.
4: Decoder Design: For both i {1, 2},
upon receiving Y(i) , decoder (i)

looks for a triple m
0(i) , m
00(i) , m
0(3i) such that

U(i) m
0(i) , X(i) m
0(i) , m
00(i) , U(3i) m
0(3i) , Y(i)

A(n)
QU (i) ,X (i) ,U (3i) ,Y (i) .
(16.128)

If one or more such triple is found,
the decoder (i) chooses one

at random
0(i)
00(i)
0(3i)
0(i)
00(i)
and calls it m
,m
,m
, and puts out m

,m
. Otherwise
it declares an error.

366

Note that here we do not require unique decoding: if the decoder finds
more than one solution it simply picks one. In the analysis below, we
will count it as an error if there are more than one possible solution for
m
0(i) , m
00(i) , however, we will not care about several possible solutions
to m
00(i) as we are not interested in this message anyway.
5: Performance Analysis: The performance analysis is quite similar to

the degraded broadcast channel. We start with the first decoder. Since
we have three messages, there are (besides the correct event) 23 1 = 7
possible error events. However, we ignore one of them: the event

0(2)
00(1)
0(1)
6= M 0(2)
(16.129)
= M 00(1) M
M
= M 0(1) M
is not considered an error since we do not care if the first decoder cannot
decode the unwanted message. This leaves us with 6 possible error events.
As we have done such error bounding so many times before, we take the
liberty of using a slightly more sloppy notation. We first mention that
the probability that the correct codeword is not jointly typical with the
cloud centers and the received sequence is very small:
h

i
t .
Pr U(1) (right), X(1) (right, right), U(2) (right), Y(1) A(n)

(16.130)
Hence, we only need to check whether there exists a combination of partially wrong codewords that accidentally look jointly typical:

Pr M 0(1) in error
h
X
=
Pr U(1) (wrong), X(1) (wrong, right),
wrong m0(1)

i
(16.131)
U(2) (right), Y(1) A(n)
X

QU (1) ,X (1) u(1) , x(1) QU (2) ,Y (1) u(2) , y(1)
wrong m0(1) (u(1) ,x(1) ,u(2) ,

(n)
y(1) )A
(16.132)
nR0(1)
n(H(U (1) ,X (1) ,U (2) ,Y (1) )H(U (1) ,X (1) )H(U (2) ,Y (1) )+)
(16.133)
n(R0(1) I(U (1) ,X (1) ;U (2) ,Y (1) )+)
(16.134)
n(R0(1) I(U (1) ,X (1) ;U (2) )I(U (1) ,X (1) ;Y (1) |U (2) )+)
(16.135)
n(R0(1) I(U (1) ,X (1) ;Y (1) |U (2) )+)
(16.136)
=e
=e
=e
0(1)
= en(R
0(1)
(1)
(1)
(2)
= en(R I(X ;Y |U )+) .
I(X (1) ;Y (1) |U (2) )I(U (1) ;Y (1) |U (2) ,X (1) )+
(16.137)
(16.138)

Here, in (16.136) we have used that U (1) , X (1)
U (2) , X (2) , and in
(16.138) we use the Markov structure of U (1) (
X (1) (
Y (1) (which
(2)
holds irrespective of whether we condition on U or not).

367
We continue with the second event:

Pr M 00(1) in error
h
X
=
Pr U(1) (right), X(1) (right, wrong),
wrong m00(1)

i
U(2) (right), Y(1) A(n)

X

QU (1) ,U (2) ,Y (1) u(1) , u(2) , y(1)
(16.139)
wrong m00(1) (u(1) ,x(1) ,u(2) ,

(n)
y(1) )A

QX (1) |U (1) x(1) u(1)
(16.140)
enR
en(
00(1)
(1)
(2)
(1)
(1)
= en(R I(X ;U ,Y |U )+)
00(1)
00(1)
) (16.141)
H(U (1) ,X (1) ,U (2) ,Y (1) )H(U (1) ,U (2) ,Y (1) )H(X (1) |U (1) )+
(16.142)
= en(R
00(1)
(1)
(1)
(1)
(2)
= en(R I(X ;Y |U ,U )+) .
I(X (1) ;U (2) |U (1) )I(X (1) ;Y (1) |U (1) ,U (2) )+
(16.143)
(16.144)
The third event yields:

Pr M 0(1) and M 00(1) in error
h
X
=
Pr U(1) (wrong), X(1) (wrong, wrong),
wrong m0(1) ,m00(1)

i
U(2) (right), Y(1) A(n)
X

QU (1) ,X (1) u(1) , x(1)
(16.145)
wrong m0(1) ,m00(1) (u(1) ,x(1) ,u(2) ,

(n)
y(1) )A
en(R
0(1)
= en(R
0(1)
QU (2) ,Y (1) u(2) , y(1)
+R
00(1)
) en(
+R
00(1)
I(X (1) ;Y (1) |U (2) )+
(16.146)
)
H(U (1) ,X (1) ,U (2) ,Y (1) )H(U (1) ,X (1) )H(U (2) ,Y (1) )+
(16.147)
),
(16.148)
identically to the first event (16.138). Turning to the fourth event we

get:

h
X
=
Pr U(1) (wrong), X(1) (wrong, right),
wrong m0(1) ,m0(2)

i
U(2) (wrong), Y(1) A(n)

X

(1)
QU (1) ,X (1) u , x(1)
(16.149)
wrong m0(1) ,m0(2) (u(1) ,x(1) ,u(2) ,

(n)
y(1) )A

QU (2) u(2) QY (1) y(1)
(16.150)

368

0(1)
0(2)
en(R +R )
(1)
(1)
(2)
(1)
(1)
(1)
(2)
(1)
en(H(U ,X ,U ,Y )H(U ,X )H(U )H(Y )+)
(16.151)
n(R0(1) +R0(2) +H(U (2) ,Y (1) |U (1) ,X (1) )H(U (2) |U (1) ,X (1) )H(Y (1) )+)
(16.152)
=e
n(R0(1) +R0(2) I(U (1) ,U (2) ,X (1) ;Y (1) )+)
(16.153)
=e
n(R0(1) +R0(2) I(U (2) ,X (1) ;Y (1) )+)
=e
(16.154)

where in (16.152) we have used the independence of U (2) and U (1) , X (1) ,
and in (16.154) the Markovity of U (1) (
X (1) (
Y (1) .
The fifth event is as follows:

h
X
=
Pr U(1) (right), X(1) (right, wrong),
wrong m00(1) ,m0(2)

i
U(2) (wrong), Y(1) A(n)
(16.155)
X

QU (1) ,Y (1) u(1) , y(1) QU (2) u(2)
wrong m00(1) ,m0(2) (u(1) ,x(1) ,u(2) ,

(n)
y(1) )A
00(1)

QX (1) |U (1) x(1) u(1)
0(2)
(16.156)
en(R +R )
(1)
(1)
(2)
(1)
(1)
(1)
(2)
(1)
(1)
en(H(U ,X ,U ,Y )H(U ,Y )H(U )H(X |U )+) (16.157)
00(1)
0(2)
(1)
(2)
(1)
(1)
(2)
(1)
(1)
= en(R +R +H(X ,U |U ,Y )H(U )H(X |U )+)
(16.158)
= en(R
00(1)
+R0(2) +H(X (1) ,U (2) |U (1) ,Y (1) )H(U (2) |U (1) ,X (1) )H(X (1) |U (1) )+)
(16.159)
=e
n(R00(1) +R0(2) +H(X (1) ,U (2) |U (1) ,Y (1) )H(X (1) ,U (2) |U (1) )+)
(16.160)
=e
n(R00(1) +R0(2) I(X (1) ,U (2) ;Y (1) |U (1) )+)
(16.161)
Finally, the sixth and last event gives

Pr M 0(1) , M 00(1) , M 0(2) all in error
h
X
=
Pr U(1) (wrong), X(1) (wrong, wrong),
wrong
m0(1) ,m00(1) ,m0(2)

i
U(2) (wrong), Y(1) A(n)
X

QU (1) ,X (1) u(1) , x(1)
(16.162)
wrong
(u(1) ,x(1) ,u(2) ,
m0(1) ,m00(1) ,m0(2)
(n)
y(1) )A

QU (2) u(2) QY (1) y(1)
n(R0(1) +R00(1) +R0(2) )
en(H(U ,X ,U ,Y )H(U ,X )H(U

0(1)
00(1)
0(2)
(2)
(1)
(1)
= en(R +R +R I(U ,X ;Y )+) ,
(1)
(1)
(2)
(1)
(1)

(1)
(2) )H(Y (1) )+
(16.163)
(16.164)
(16.165)
369
identically to the fourth case (16.154).

Combined this gives us the following six bounds for reliable communication at decoder 1:

0(1)
(1)
(1) (2)
U
,
(16.166)
R
<
I
X
;
Y
00(1)
(1)
(1) (1)
(2)
R
< I X ;Y
U ,U
,
(16.167)
0(1)
00(1)
(1)
(1) (2)
R +R
< I X ;Y
U
,
(16.168)

0(1)
0(2)
(1)
(2)
(1)
R +R
< I X ,U ;Y
,
(16.169)
(1)
00(1)
0(2)
(1)
(2)
(1)
,
(16.170)
R
+R
< I X , U ; Y U
0(1)
00(1)
0(2)
(1)
(2)
(1)
R +R
+R
< I X ,U ;Y
.
(16.171)
Note that (16.166) and (16.169) are redundant, which leaves us with
four bounds. We combine them with the corresponding four bounds of
decoder 2 and replace R00(i) by R(i) R0(i) :
(16.172)
R(1) R0(1) < I X (1) ; Y (1) U (1) , U (2) ,
(1) (2)
(2)
0(2)
(2)
(2)
,
(16.173)
R R
< I X ; Y U , U
R(1) < I X (1) ; Y (1) U (2) ,

(16.174)
(2)
(2)
(2) (1)
U
,
(16.175)
R < I X ;Y
(1)
(1)
0(1)
0(2)
(1)
(2)
(1)

U
,
(16.176)
R R +R
< I X ,U ;Y
(2)
(2)
0(2)
0(1)
(2)
(1)
(2)
,
(16.177)
< I X , U ; Y U
R R + R
R(1) + R0(2) < I X (1) , U (2) ; Y (1) ,

(16.178)
(2)
0(1)
(2)
(1)
(2)
R +R
< I X ,U ;Y
.
(16.179)
This finishes the first stage of the proof.
16.5.2
FourierMotzkin Elimination
We next apply FourierMotzkin elimination (see Section 1.3) to get rid of the
unwanted rates R0(i) . We rewrite (16.172)(16.179) as follows:

I1
I X (1) ; Y (1) U (1) , U (2)
1
0 1 0

I X (2) ; Y (2) U (1) , U (2) I2
0
1
0
1

I X (1) ; Y (1) U (2) I
1
0
0
0

(1) 3
(2)
(2)

0
I4
U
1
0
0

I X ;Y

1

(1)
(2)
(1)
(1)
0 1 1 R(1)
I5
I X , U ; Y U

(2)
(2)
(1)
(2) U (2)

0
1
1 1
R I X , U ; Y
, I6 (16.180)

I
0(1)
1
(1)
(2)
(1)
0
0
1
R I X , U ; Y
7

0(2)
(2)
(1)
(2)
I X ,U ;Y
0
I8
1
1
0

1 0
0
0
0
0

0 1 0
0
0
0
0
1
0
0

0
0
0 1
0
0

370
where we also have added four nonnegativity constraints.

Now we start by removing R0(2) :
1
0 1
I1
1
I2 + I5
1 1
I2 + I7
1
0
1
I3
0
0
0
I4
1
0
1
I5 + I6
1
0 R(1)
(2)
1
0 1
R I5 .
1
1 R0(1)
1
I6 + I7
1
I7
0
0
1
1
0
I8
1 0
0
0
0 1 0
0
0
0
0 1
(16.181)
Note that I1 I5 , i.e., the first bound implies the 7th, and similarly that
I3 I7 , i.e., the 4th bound implies the 9th. Hence, we remove the 7th and
9th bound and continue to eliminate R0(1) :
2
1
I1 + I6 + I7
1
I1 + I8
2
I + I + I + I
2
5
6
7
1
I2 + I5 + I8
2
1
I
+
I
"
2
7
(1) #
0
I3
(2)
1
R
I
4
1
I
+
I
5
6
1
I6 + I7
1
I
8
1 0
0 1
0
(16.182)
Note that the third bound is the sum of the 5th and 9th; that I4 I8 and therefore the 7th bound implies the 10th; and that the 5th implies the 9th bound
because I2 I6 . So we remove the third, the 9th, and the 10th bound. Moreover, we also omit the obvious nonnegativity constraints (last two bounds)

371
and finally get:
0
1
16.5.3
1
I1 + I6 + I7
I1 + I8
1
# I2 + I5 + I8
"
2
R(1)
1 (2) I2 + I7 .
I3
0
I4
1
I5 + I6
1
(16.183)
Best Known Achievable Rate Region
This almost proves the HanKobayashi region. The only remaining part is
time-sharing. As mentioned before at the end of Section 10.5.2, the convex hull of the region defined by (16.183) might be smaller than if we perform an additional coded time-sharing operation. Hence, in the code generation we actually should first create
a random sequence T QnT () and

then create the sequences U(i) m0(i) QU (i) |T (|T) and X(i) m0(i) , m00(i)

QX (i) |U (i) ,T U(i) (m0(i) ), T . All expressions involving a typical set must be
adapted to include the T. Then we would have to go through the whole derivation again and would realize that we get the same expressions as in (16.183)
apart from the fact that all are conditioned on T .
Theorem 16.17 (HanKobayashi Achievable Rate Region [HK81],
[CMGEG08]).
For a general DM-IC,
an achievable rate region is given by all nonnegative
(1)
(2)
rate pairs R , R
satisfying

(16.184)
R(1) I X (1) ; Y (1) U (2) , T ,
(1)
(2)
(2)
(2)
R I X ; Y U , T ,
(16.185)

R(1) + R(2) I X (1) , U (2) ; Y (1) T + I X (2) ; Y (2) U (1) , U (2) , T ,
(16.186)

(1) (2)
(1)
(2)
(2)
(1)
(2)
(1)
(1)
T + I X ;Y
U ,U ,T ,
R + R I X ,U ;Y
(16.187)
(1)
(1)
(2)
(1)
(2)
(1)
R + R I X , U ; Y U , T
(2)
(1)
(2) (2)
+
I
X
,
U
;
Y
U
,
T
,
(16.188)

(1)
(2)
(1)
(2)
(1)
(1)
(1) (1)
(2)
2R + R I X , U ; Y
T + I X ;Y
U ,U ,T
+ I X (2) , U (1) ; Y (2) U (2) , T ,

(16.189)

(1) (2)
(1)
(2)
(2)
(1)
(2)
(2)
(2)
R + 2R I X , U ; Y T + I X ; Y U , U , T

+ I X (1) , U (2) ; Y (1) U (1) , T
(16.190)

372
for some
(1) (2) (1) (2)
QT QU (1) |T QX (1) |U (1) ,T QU (2) |T QX (2) |U (2) ,T Q
Y ,Y |X ,X
(16.191)
(1) (2) (1) (2) must have the same marginals (16.17), (16.18)
where Q
Y ,Y |X ,X
as the given IC. The auxiliary random variables
to take

(i) can be(i)restricted

X
value in alphabets of size |T | 7 and U
+ 4, i = 1, 2,
respectively.
Proof: The only part that remains to be proven are the bounds on the
alphabet sizes of the auxiliary random variables. First, note that the bound on
|T | is a straightforward consequence of the FenchelEggleston strengthening
of Caratheodorys Theorem (Theorem 1.22): we have seven bounds with all
terms in the bounds being conditional on T . Hence, the rate region can be
described by a linear combination of vectors with seven components.
So, we restrict T to size 7, fix some distribution QT and condition everything on T = t for a fixed t. We turn to U (1) : For given distributions QX (1) |U (1) ,T (|, t), QU (2) |T (|t), QX (2) |U (2) ,T (|, t), QY (1) |X (1) ,X (2) , and
QY (2) |X (1) ,X (2) , we define a vector vu(1) with the following |X (1) | + 4 components:

(1)
vu(1) , I X (2) ; Y (2) U (1) = u(1) , T = t ,
(16.192)

(2)
vu(1) , I X (2) ; Y (2) U (1) = u(1) , U (2) , T = t ,
(16.193)

(3)
vu(1) , H Y (2) U (1) = u(1) , T = t

(16.194)
+ I X (1) ; Y (1) U (1) = u(1) , U (2) , T = t ,

(4)
vu(1) , H Y (2) U (1) = u(1) , U (2) , T = t

(16.195)
+ I X (1) ; Y (1) U (1) = u(1) , U (2) , T = t ,

(5)
(2)
(1) (1)
(1)
vu(1) , I U ; Y
U = u ,T = t

+ I X (2) ; Y (2) U (1) = u(1) , U (2) , T = t ,
(16.196)
(1)
(6)
vu(1) , QX (1) |U (1) ,T 1u , t ,
(16.197)
..
.

(|X (1) |+4)
vu(1)
, QX (1) |U (1) ,T X (1) 1 u(1) , t .
(16.198)
It can now be checked that, for any choice of QU (1) |T , all terms on the RHS
of (16.184)(16.190) when conditioned on T = t are given by the components
of a v that is defined as a linear combination of vu(1) :
X

v,
QU (1) |T u(1) t vu(1) .
(16.199)
u(1) U (1)
Hence, since QT is fixed, also the HanKobayashi region is fixed by v. By the

FenchelEggleston strengthening of Caratheodorys Theorem (Theorem 1.22)
it now follows that we can restrict |U (1) | to at most |X (1) | + 4.

16.6. Gaussian IC
373
The proof for |U (2) | is accordingly.
16.6
Gaussian IC
Once again we will also discuss the Gaussian case even though strictly speaking our proofs do not directly generalize to continuous channel models. The
Gaussian IC is particularly illustrative because one can very nicely demonstrate the different coding strategies.
16.6.1
Channel Model
The most general model for a Gaussian interference channel looks as follows:
(
0
0
0
0
Y (1) = c11 x(1) + c12 x(2) + Z (1) ,
(16.200)
(2)0
(1)0
(2)0
(2)0
Y
= c21 x
+ c22 x
+Z ,
(16.201)
where c11 , c12 , c21 , c22 are fixed constants, where the noise is jointly Gaussian
0
Z (1) , Z (2)
T
N (0, KZ0Z0 )
(16.202)
with some arbitrary covariance matrix KZ0Z0 :

KZ0Z0 =
2
(1)
(12)
(12)
2
(2)
!
,
(16.203)
and where we have an average-power constraint on both inputs

h
i
0 2
0
E X (i)
E(i) , i = 1, 2.
(16.204)
However, without loss of generality, we can reformulate this model as follows:

We divide (16.200) and (16.201) by (1) and (2) , respectively, and define
c11 (1)0
x ,
(1)
c22 (2)0
x ,
,
(2)
1 (1)0
Y
,
(1)
2 (2)0
,
Y
,
(2)
1 (1)0
Z ,
(1)
2 (2)0
,
Z .
(2)
x(1) ,
Y (1) ,
Z (1) ,
(16.205)
x(2)
Y (2)
Z (2)
(16.206)
Then we get
c12 (2) (2)

(1)
= x(1) +
x + Z (1) ,
Y
(1) c22
Y (2) = c21 (1) x(1) + x(2) + Z (2) .
(2) c11
(16.207)
(16.208)
Hence, we have found the so-called standard form of the Gaussian IC.

374
Definition 16.18. The standard Gaussian interference channel is given by

(
(16.209)
Y (1) = x(1) + a12 x(2) + Z (1) ,
(2)
(1)
(2)
(2)
Y
= a21 x + x + Z ,
(16.210)
where a12 and a21 are fixed nonnegative constants, where the noise is jointly
Gaussian
!!

1
T
Z (1) , Z (2) N 0,
, [1, 1],
(16.211)
1
and where both inputs are subject to an average-power constraint
h
2 i
E X (i)
E(i) , i = 1, 2.
(16.212)
Note that due to Theorem 16.6 the capacity region of the Gaussian IC
does not depend on .
16.6.2
Outer Bound
We start by adapting Satos outer bound (Theorem 16.8) to the Gaussian IC.
To that goal note that

(16.213)
I X (1) ; Y (1) X (2) , T = h Y (1) X (2) , T h Y (1) X (1) , X (2) , T

(1)
(1)
(1)
(16.214)
=h X +Z T h Z

1
1
log 2e E(1) + 1 log 2e
(16.215)
2
2

1
(16.216)
= log 1 + E(1) ,
2
where the upper bound can be achieved if the input is chosen to be zero-mean
Gaussian with variance E(1) . The second bound is accordingly

1

I X (2) ; Y (2) X (1) , T log 1 + E(2) .
2
(16.217)
For the third bound, we get

I X (1) , X (2) ; Y (1) , Y (2) T
= h(Y|T ) h(Z)
1
1
log(2e)2 det KYY log(2e)2 det KZZ ,
2
2
(16.218)
(16.219)
where again the inequality can be achieved with equality if the input
is chosen
to be zero-mean Gaussian with covariance matrix diag E(1) , E(2) . Note that
!
1
det KZZ = det
= 1 2
(16.220)
1

16.6. Gaussian IC
375
and
() , det KYY
(1)
(2)
(1)
(16.221)
(2)
+ a12 E + 1
a21 E + a12 E
(2)
a21 E + a12 E +
a21 E(1) + E(2) + 1

= E(1) 1 + a21 2 a21 + E(2) 1 + a12 2 a12
2
+ E(1) E(2) 1 a12 a21 + 1 2 ,
= det
(1)
!
+
(16.222)
(16.223)
i.e.,
1
()
.
I X (1) , X (2) ; Y (1) , Y (2) T log
2
1 2
(16.224)
Note that since the capacity region does not depend on , we can minimize
(16.224) over .
Theorem 16.19 (Satos Outer
Bound [Sat77]). For a Gaussian IC, any

(1)
(2)
achievable rate pair R , R
must satisfy

1
R(1) log 1 + E(1) ,

(16.225)

1
(16.226)
R(2) log 1 + E(2) ,
R(1) + R(2) min 1 log () ,

(16.227)
11 2
1 2
with () defined in (16.223).
16.6.3
Basic Communication Strategies
As we have already discussed in Section 16.5, there are three basic communication strategies. All three strategies are special cases of the HanKobayashi
achievable region (Theorem 16.17).
Treating Interference as Noise
If the interference is only very weak (i.e., a12 , a21 1), then it is quite natural
to ignore the structure of the interfering signal and treat it as equivalent to
noise. Basically, the Gaussian IC is then transformed into two parallel additive
noise channels with an achievable rate region
(
)
[
0 R(1) I X (1) ; Y (1)
(1)
(2)
.
R1 =
R ,R
:
(16.228)
0 R(2) I X (2) ; Y (2)
(1) (2)
X
E[(X (1) )2 ]E(1)

E[(X (2) )2 ]E(2)
We actually could include a time-sharing random variable to make sure that

the region is convex, however, if the region happens not to be convex, then it

376
will be a very bad bound anyway. Note that this bound can be proven using
the HanKobayashi achievable region (Theorem 16.17) with T , U (1) , and U (2)
chosen to be deterministic.
Unfortunately, it is not clear what the optimal input distribution is. Note
that choosing a Gaussian input is good for the direct link, but is hurting the
other receiver most via the interference (Gaussian noise is the worst noise!).
However, if we do choose Gaussian inputs, the region looks as follows:

(1) E(1)
(1)
0 R log 1 +
(2)
[
(2)

2
a12 E + 1
(1)
(2)
R1,G =
R ,R
:
.

(2) E(2)
(1)
0 1
0 R(2) log 1 +
(1)
(2)
(1)
2
0 1
a21 E + 1
(16.229)
Note that (i) allows us to adapt the power of each user, thereby allowing to
reduce the interference at the cost of the direct link.
Orthogonal Coding
In the situation of a moderate interference, it might be a good strategy to try to
avoid interference by means of an orthogonal coding scheme. For example, one
can use TDMA to make sure that only one user is accessing the channel at one
time or FDMA to separate the users by means of using different frequencies.
Obviously, in such a scheme, we have two independent Gaussian channels
with the optimal input being Gaussian. We use 0 1 as the time-sharing
parameter and get the following achievable region:

E(1)
(1)
0 R log 1 +
[

2
(1)
(2)
R2 =
.
(16.230)
R ,R
:

1
E(2)
(2)
0 1
0R
log 1 +
2
1
We remind the reader that by diving the available power E(1) by , we make
sure that we meet the power constraint exactly (the first user only transmits
a fraction , hence it can use more power during this transmission and still
achieve the average-power constraint). The same comment applies to E(2) that
is divided by 1 .
Decoding and Canceling Interference
Finally, if the interference is much stronger than the direct link, then it is
quite obvious that one should decode the unwanted message first and use this
knowledge to eliminate the interference from the received signal before one
decodes the wanted message. As a matter of fact, as we have seen before, for
strong and very strong interference, this strategy is optimal.
Once the interference has successfully been removed from the channel, we
end up with two Gaussian MAC channels yielding an achievable rate region

16.6. Gaussian IC
377
according to (16.40)(16.42). However, as we have seen in Section 16.3, we

can enlarge this region when realizing that we do not care about the error
event when a decoder wrongly decodes the unwanted message. This results in
the achievable region of Theorem 16.9, which is again maximized for Gaussian
inputs:

1
(1)
(1)
0 R log 1 + E ,
(2)
(2)

0
log
1
+
E
,
(1)
(2)
2
R3 =
R ,R
:
. (16.231)
R(1) + R(2) log 1 + min E(1) + a12 E(2) ,
(1)
(2)
a E +E
21
Comparison
In Figure 16.6, all three communication strategies and Satos outer bound are
depicted for a Gaussian IC with E(1) = E(2) = 7 and different values of a12 and
a21 . It can be clearly seen that, depending on the strength of the interference,
different schemes are best.
16.6.4
Strong and Very Strong Interference
The concepts of strong and very strong interference also works in the case of
the Gaussian IC. Indeed, they are very intuitive in this context: For example,
consider the case where a12 1 and assume we have a reliable coding scheme
for some given R(1) , R(2) . So, receiver 1 can reconstruct X(1) reliably and
can therefore compute
0
Y(1) X(1)
a12
Z(1)
= a21 X(1) + X(2) + .
a12
Y(2) ,
a21 X(1) +
(16.232)
(16.233)
This is very similar to the received word of receiver 2: Only the noise is
different and actually has a smaller variance. Hence, since receiver is able to
reliably reconstruct X(2) from Y(2) , then this means that receiver 1 is also
0
able to reconstruct X(2) from Y(2) .
So, we understand that if both a12 1 and a21 1, in a reliable system
both receiver can always reconstruct both messages.
After this short discussion and this insight, we now quickly summarize the
results for strong and very strong interference of the Gaussian IC.
Definition 16.20 ([Car75], [Sat81]). A Gaussian IC is said to have strong
interference if
a12 1,
a21 1.
(16.234)

378

a12 = 0.15, a21 = 0.05
1.6
1.4
1.4
a)
b)
c)
0.8
a)
b)
c)
1.2
R(2)
1.2
R(2)
a12 = 0.35, a21 = 0.25
1.6
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
0.2
0.4
0.6
R(1)
a12 = 0.55, a21 = 0.45
1.4
1.4
1.2
1.2
b)
c)
0.8
1.2
1.4
1.6
a)
a12 = 0.85, a21 = 0.75
1.6
R(2)
R(2)
1.6
0.8
R(1)
b) c)
0.8
0.6
0.6
0.4
0.4
0.2
0.2
a)
0
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
0.2
0.4
0.6
R(1)
a12 = 1.15, a21 = 1.15
1.2
1.4
1.6
1.4
1.4
1.2
1.2
c)
b)
0.8
a12 = 2.15, a21 = 2.15
1.6
R(2)
R(2)
1.6
0.8
R(1)
0.6
c)
b)
0.8
0.6
a)
0.4
0.4
0.2
a)
0.2
0
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
R(1)
0.2
0.4
0.6
0.8
1.2
1.4
1.6
R(1)
Figure 16.6: Achievable regions and outer bound for the Gaussian IC with
E(1) = E(2) = 7 for different values of the cross talk coefficients. The black dotted curve denotes Satos outer bound
(Theorem 16.19); in the red curve a) interference is treated as
noise (where time-sharing is neglected on purpose, and we assume Gaussian inputs), see (16.229); blue curve b) is orthogonal
coding, see (16.230); and for the black curve c) the interference
is canceled, see (16.231). The units are bits.

16.6. Gaussian IC
379
If in addition also holds that

1

1
log 1 + E(1) + log 1 + E(2)
2
2

1
log 1 + min a21 E(1) + E(2) , E(1) + a12 E(2) ,
2
(16.235)
then the Gaussian IC is said to have very strong interference.

We would like to point out that these definitions match with the definitions
given in Section 16.4. Indeed, assuming Gaussian inputs of maximal variance
(which turn out to be optimal for these values of a12 and a21 irrespectively of
E(1) and E(2) ), we have from (16.76)

1

I X (1) ; Y (1) X (2) = log 1 + E(1)
2

!

I X (1) ; Y (2) X (2)

1
= log 1 + a21 E(1) ,
2
(16.236)
(16.237)
(16.238)
which is equivalent to a21 1.

Also, from (16.78), we get

1
I X (1) ; Y (1) X (2) = log 1 + E(1)
2
!

I X (1) ; Y (2)

1
a21 E(1)
= log 1 + (2)
2
E +1
1

1
(2)
= log E + 1 + log E(2) + 1 + a21 E(1) ,
2
2
(16.239)
(16.240)
(16.241)
(16.242)
i.e.,
1
1

1
log 1 + E(1) + log 1 + E(2) log 1 + a21 E(1) + E(2) .
2
2
2
(16.243)
In combination with the equivalent expression stemming from (16.79), this

corresponds to (16.235).
Corollary 16.21. A sufficient (but not necessary) condition for very strong
interference of a Gaussian IC is
a12 E(1) + 1,
a21 E(2) + 1.
(16.244)
Proof: Suppose that (16.244) holds. Then, (16.234) is satisfied trivially,

and the RHS of (16.235) can be bounded as follows:

1
log 1 + min a21 E(1) + E(2) , E(1) + a12 E(2)
2

1
log 1 + min E(2) + 1 E(1) + E(2) , E(1) + E(1) + 1 E(2)
(16.245)
2

380

1
log 1 + min E(1) + E(2) + E(1) E(2) , E(1) + E(2) + E(1) E(2)
2

1
= log 1 + E(1) + E(2) + E(1) E(2)
2

1
= log 1 + E(1) 1 + E(2) .
2
=
(16.246)
(16.247)
(16.248)
The capacity regions for the Gaussian IC with strong or very strong interference, respectively, then read as follows.
Theorem 16.22 ([Ahl74], [Sat81]).
with strong interference is
0 R(1)
0 R(2)

(1)
(2)
CIC, strong =
R ,R
:
R(1) + R(2)
The capacity region of a Gaussian IC

1
(1)
log 1 + E ,

1
(2)
log 1 + E ,
2
.

(1)
1
(2)
log 1 + min E + a12 E ,
(1)
(2)
a21 E + E
(16.249)
The capacity region of a Gaussian IC with very strong interference is

1
(1)
(1)
0 R 2 log 1 + E ,
(1)
(2)
(16.250)
CIC, very strong =
R ,R
:
.
1
0 R(2) log 1 + E(2)

2
Note that the capacity region for the case of very strong interference is
identical to the case without interference.
16.6.5
HanKobayashi Region for Gaussian IC
We can also apply the HanKobayashi region of Theorem 16.17 to the Gaussian IC. The time-sharing random variable T can be used as representing
different transmission modes: For example, t = 1 might represent the mode
when interference will be treated as noise, t = 2 the mode when only user 1
transmits, t = 3 is the mode when only user 2 transmits, t = 4 represents the
mode when both users decode both messages, etc.
In general, T T with T being a finite alphabet. Depending on the realization of T = t, we then choose the other random variables U (i) (t) and
X (i) (t). Actually, to allow the recovery of each of the three basic communication strategies of Section 16.6.3, we recall from Section 13.10 about the
Gaussian BC that in order to implement superposition coding, we may choose
0
0
X (i) = U (i) + X (i) where U (i) and X (i) are independent
(i)zero-mean Gaussian
(i)
(i)
(i)
random variables with variance E and 1 E , respectively. This
choice then only requires to choose (i) (t) and E(i) (t) as a function of the

16.6. Gaussian IC
381
mode T = t. It also ensures that all seven conditions of Theorem 16.17 are
restricted to Gaussian distributions:

1
E(1) (t)
(1)
(1) (2)
U , T = t = log 1 +
I X ;Y
, (16.251)

2
a12 1 (2) (t) E(2) (t) + 1

1
E(2) (t)
, (16.252)
I X (2) ; Y (2) U (1) , T = t = log 1 +
(1)
2
a21 1 (1) (t) E (t) + 1

1
E(1) (t) + a12 (2) (t)E(2) (t)
(1)
(2)
(1)
I X ,U ;Y
T = t = log 1 +
, (16.253)

2
a12 1 (2) (t) E(2) (t) + 1

I X (2) ; Y (2) U (1) , U (2) , T = t

1 (2) (t) E(2) (t)
1
= log 1 +
, (16.254)

2
a21 1 (1) (t) E(1) (t) + 1

1
E(2) (t) + a21 (1) (t)E(1) (t)
(2)
(1)
(2)
I X ,U ;Y
T = t = log 1 +
, (16.255)

2
a21 1 (1) (t) E(1) (t) + 1

I X (1) ; Y (1) U (1) , U (2) , T = t

1 (1) (t) E(1) (t)
1
= log 1 +
, (16.256)

2
a12 1 (2) (t) E(2) (t) + 1

I X (1) , U (2) ; Y (1) U (1) , T = t

1 (1) (t) E(1) + a12 (2) (t)E(2) (t)
1
= log 1 +
,

2
a12 1 (2) (t) E(2) (t) + 1
(16.257)

(2)
(1)
(2) (2)
U ,T = t
I X ,U ;Y

1 (2) (t) E(2) + a21 (1) (t)E(1) (t)
1
= log 1 +
.

2
a21 1 (1) (t) E(1) (t) + 1
(16.258)
Now we can recover the region R1,G in (16.229) with (1) = (2) = 1 by
choosing T to be a constant and by setting (1) = (2) = 0 (thereby making
U (1) = U (2) = 0 with probability 1). For R2 in (16.230), we set T to be binary
with T = {1, 2} and with probability PT (1) = 1 PT (2) = , and we choose
(1) (1) = 1, (2) (1) = 0, E(1) (1) = E(1) / , E(2) (1) = 0 for mode T = 1, and
(1) (2) = 0, (2) (2) = 1, E(1) (2) = 0, E(2) (2) = E(2) / for mode T = 2.
Finally, the reader can check that R3 in (16.231) can be recovered by T
being constant, and (1) = (2) = 1.
In general, the HanKobayashi region (for fixed T ) is a heptagon, see the
example in Figure 16.7.
16.6.6
Symmetric Degrees of Freedom
An interesting way of getting a better understanding of the Gaussian IC are the

so-called symmetric degrees of freedom. The idea is to consider a symmetric
situation where a12 = a21 , a and where both transmitters have the same

382

a12 = a21 = 0.1
3.5
R(2) 3.262.5bits
R(2)
R(1) + 2R(2) 7.09 bits

2
1.5
R(1) + R(2) 4.19 bits

1
2R(1) + R(2) 7.09 bits

0.5
R(1) 3.26 bits

0
0
0.5
1.5
2.5
3.5
R(1)
Figure 16.7: This depicts the heptagon describing the HanKobayashi region
for the Gaussian IC with the choice (1) = (2) = 0.9. The Gaussian IC has parameter a12 = a21 = 0.1 and the power constraints
are E(1) = E(2) = 1000. The units are bits.
power E(1) = E(2) , E and then to compare the maximum sum rate at high
SNR with the situation of no interference. If there was no interference (a = 0),
then each transmitter could transmit at a rate of
1
1
log(1 + E) log E
(16.259)
2
2
(for E large), i.e., in total one can transmit at a sum rate of

1 1
+
log E = 1 log E.
2 2
(16.260)
Now we are interested in the behavior of the factor in front of the logarithm
when a is increased. Concretely, we investigate the symmetric degrees of
freedom:
dsym (a) , lim
max
E (R(1) ,R(2) )C
R(1) (E) + R(2) (E)

.
log E
(16.261)
Even though the capacity region of the Gaussian IC is not know for all values
of a, the symmetric degrees of freedom has been derived exactly and is shown
in Figure 16.8.
Note how the Gaussian IC can be split into four different regions:

For a 0, 12 we have weak interference and it is optimal with respect
to the degrees of freedom to treat the interference as noise.

16.6. Gaussian IC
dsym
383
weak
medium
strong
very strong
1
2
3
1
2
1
2
2
3
3
2
Figure 16.8: Symmetric degrees of freedom.

For a 12 , 1 we have medium interference and using partial decoding
of the other message according to the HanKobayashi scheme is optimal
with respect to the degrees of freedom. Note the quite unexpected Wshape with the maximum 32 at a = 23 !
For a [1, 2] we have strong interference, where it is optimal (for capacity and therefore also for the degrees of freedom) to always decode the
interference and cancel it.
For a 2, we are in the very strong interference case, where it is still
optimal to decode the interference and cancel it and where the interference has no detrimental effect on the communication anymore and can
be canceled completely.
We also would like to point out that orthogonal coding is only optimal for the
special cases a = 21 and a = 1, that at the same time also correspond to the
worst case scenarios where interference hurts most. For more information we
refer to the literature, e.g., [ETW08].

Bibliography
[Abb08]
Emmanuel A. Abbe, Local to global geometric methods in information theory, Ph.D. dissertation, Massachusetts Institute
of Technology (MIT), June 2008.
[AGA12]
Amin Aminzadeh Gohari and Venkat Anantharam, Evaluation

of Martons inner bound for the general broadcast channel,
IEEE Transactions on Information Theory, vol. 58, no. 2, pp.
608619, February 2012.
[Ahl71]
Rudolf Ahlswede, Multi-way communication channels, in Proceedings 2nd IEEE International Symposium on Information
Theory (ISIT).
Tsahkadsor, Armenia, USSR: Publishing
House of the Hungarian Academy of Sciences (published 1973),
September 28, 1971, pp. 2351.
[Ahl74]
Rudolf Ahlswede, The capacity region of a channel with two

senders and two receivers, The Annals of Probability, vol. 2,
no. 5, pp. 805814, October 1974.
[Ber71]
Toby Berger, Rate Distortion Theory: Mathematical Basis for

Data Compression, ser. in Information and System Sciences.
Englewood Cliffs, NJ, USA: Prentice-Hall, October 1971.
[Ber73]
Patrick P. Bergmans, Random coding theorem for broadcast

channels with degraded components, IEEE Transactions on Information Theory, vol. 19, no. 2, pp. 197207, March 1973.
[BZ83]
Toby Berger and Zhen Zhang, Minimum breakdown degradation in binary source encoding, IEEE Transactions on Information Theory, vol. 29, no. 6, pp. 807814, November 1983.
[Car75]
Aydano B. Carleial, A case where interference does not reduce

capacity, IEEE Transactions on Information Theory, vol. 21,
no. 5, pp. 569570, September 1975.
[CEG87]
Max H. M. Costa and Abbas A. El Gamal, The capacity region

of the discrete memoryless interference channel with strong interference, IEEE Transactions on Information Theory, vol. 33,
no. 5, pp. 710711, September 1987.
385

386
Bibliography
[CEGS80]
Thomas M. Cover, Abbas A. El Gamal, and Masoud Salehi,

Multiple access channels with arbitrarily correlated sources,
IEEE Transactions on Information Theory, vol. 26, no. 6, pp.
648657, November 1980.
[CK78]
Imre Csisz
ar and J
anos Korner, Broadcast channels with confidential messages, IEEE Transactions on Information Theory,
vol. 24, no. 3, pp. 339348, May 1978.
[CK81]
Imre Csisz
ar and J
anos Korner, Information Theory: Coding
Theorems for Discrete Memoryless Systems. Budapest, Hungary: Academic Press, 1981.
[CK11]
Imre Csisz
ar and J
anos Korner, Information Theory: Coding
Theorems for Discrete Memoryless Systems, 2nd ed.
Cambridge, UK: Cambridge University Press, 2011.
[CMGEG08] Hon-Fah Chong, Mehul Motani, Hari Krishna Garg, and Hesham El Gamal, On the HanKobayashi region for the interference channel, IEEE Transactions on Information Theory,
vol. 54, no. 7, pp. 31883195, July 2008.
[Cos83]
Max H. M. Costa, Writing on dirty paper, IEEE Transactions

on Information Theory, vol. 29, no. 3, pp. 439441, May 1983.
[Cov72]
Thomas M. Cover, Broadcast channels, IEEE Transactions on

Information Theory, vol. 18, no. 1, pp. 214, January 1972.
[Cov75]
Thomas M. Cover, A proof of the data compression theorem

of Slepian and Wolf for ergodic sources, IEEE Transactions on
Information Theory, vol. 21, no. 2, pp. 226228, March 1975.
[CS04]
Imre Csisz
ar and Paul C. Shields, Information theory and
statistics: A tutorial, Foundations and Trends in Communications and Information Theory, vol. 1, no. 4, pp. 417528, 2004.
[Csi84]
Imre Csisz
ar, Sanov property, generalized I-projection and a
conditional limit theorem, The Annals of Probability, vol. 12,
no. 3, pp. 768793, August 1984.
[Csi98]
Imre Csisz
ar, The method of types, IEEE Transactions on
Information Theory, vol. 44, no. 6, pp. 25052523, October 1998.
[CT06]
Thomas M. Cover and Joy A. Thomas, Elements of Information

Theory, 2nd ed. New York, NY, USA: John Wiley & Sons, 2006.
[EG79]
Abbas A. El Gamal, The capacity of a class of broadcast channels, IEEE Transactions on Information Theory, vol. 25, no. 2,
pp. 166169, March 1979.

387
Bibliography
[EG81]
Abbas A. El Gamal, On information flow in relay networks,

in Proceedings IEEE National Telecommunications Conference,
vol. 2, New York, USA, November 29 December 3, 1981, pp.
D4.1.1D4.1.4.
[EGC82]
Abbas A. El Gamal and Thomas M. Cover, Achievable rates

for multiple descriptions, IEEE Transactions on Information
Theory, vol. 28, no. 6, pp. 851857, November 1982.
[Egg58]
H. G. Eggleston, Convexity.
versity Press, 1958.
[EGK10]
Abbas A. El Gamal and Young-Han Kim, Lecture Notes on

Network Information Theory, Dept. of Electrical Engineering,
Standford University, June 2010. Available: http://arxiv.org/a
bs/1001.3404
[EGK11]
Abbas A. El Gamal and Young-Han Kim, Network Information

Theory. Cambridge, UK: Cambridge University Press, December 2011, ISBN: 9781107008731. Available: http://www.cambri
dge.org/9781107008731/
[ETW08]
Raul H. Etkin, David N. C. Tse, and Hua Wang, Gaussian

interference channel capacity to within one bit, IEEE Transactions on Information Theory, vol. 54, no. 12, pp. 55345562,
December 2008.
[Fan61]
Robert M. Fano, Transmission of Information: A Statistical

Theory of Communications. Cambridge, MA, USA: MIT Press,
March 1961.
[Gal68]
Robert G. Gallager, Information Theory and Reliable Communication. New York, NY, USA: John Wiley & Sons, 1968.
[GP80]
Sergei I. Gelfand and Mark S. Pinsker, Coding for channels

with random parameters, Problems of Control and Information
Theory, vol. 9, no. 1, pp. 1931, 1980.
[HK81]
Te Sun Han and Kingo Kobayashi, A new achievable rate region

for the interference channel, IEEE Transactions on Information
Theory, vol. 27, no. 1, pp. 4960, January 1981.
[Hoe56]
Wassily Hoeffding, Asymptotically optimal tests for multinominal distributions, The Annals of Mathematical Statistics,
vol. 36, pp. 19161921, 1956.
[KM77a]
J
anos K
orner and Katalin Marton, Comparison of two noisy
channels, in Topics in Information Theory (1975), Imre Csiszar
and Peter Elias, Eds. North-Holland, 1977, pp. 411423, colloquia Math. Soc. Janos Bolyai.
Cambridge, UK: Cambridge Uni-

388
Bibliography
[KM77b]
J
anos K
orner and Katalin Marton, General broadcast channels
with degraded message sets, IEEE Transactions on Information Theory, vol. 23, no. 1, pp. 6064, January 1977.
[Kol56]
Andrei N. Kolmogorov, On the Shannon theory of information

transmission in the case of continuous signals, IRE Transactions on Information Theory, vol. 2, no. 4, pp. 102108, December 1956.
[Kra07]
Gerhard Kramer, Topics in multi-user information theory,

Foundations and Trends in Communications and Information
Theory, vol. 4, no. 4/5, pp. 265444, 2007.
[Lia72]
Henry Herng-Jiunn Liao, Multiple access channels, Ph.D. dissertation, University of Hawaii, Honolulu, USA, September 1972.
[Llo82]
Stuart P. Lloyd, Least squares quantization in PCM, IEEE

Transactions on Information Theory, vol. 28, no. 2, pp. 129
137, March 1982.
[LPW08]
David A. Levin, Yuval Peres, and Elizabeth L. Wilmer, Markov

Chains and Mixing Times. American Mathematical Society,
2008.
[Mar79]
Katalin Marton, A coding theorem for the discrete memoryless

broadcast channel, IEEE Transactions on Information Theory,
vol. 25, no. 3, pp. 306311, May 1979.
[Mos14]
Stefan M. Moser, Information Theory (Lecture Notes), 4th ed.

Signal and Information Processing Laboratory, ETH Z
urich,
Switzerland, and Department of Electrical & Computer Engineering, National Chiao Tung University (NCTU), Hsinchu, Taiwan, 2014. Available: http://moser-isi.ethz.ch/scripts.html
[Nai10]
Chandra Nair, A note on outer bounds for broadcast channel, in Proceedings International Zurich Seminar on Broadband
Communications (IZS), Zurich, Switzerland, March 35, 2010.
[NEG07]
Chandra Nair and Abbas A. El Gamal, An outer bound to the

capacity region of the broadcast channel, IEEE Transactions
on Information Theory, vol. 53, no. 1, pp. 350355, January
2007.
[NW08]
Chandra Nair and Zizhou Vincent Wang, On the inner and

outer bounds for 2-receiver discrete memoryless broadcast channels, in Proceedings Information Theory and Applications
Workshop (ITA), University of California, San Diego, CA, USA,
January 27 February 1, 2008, note that one author is sometimes wrongly listed as V. W. Zhizhou.

Bibliography
389
[Oza80]
Lawrence H. Ozarow, On a source coding problem with two

channels and three receivers, Bell System Technical Journal,
vol. 59, no. 10, pp. 19091921, December 1980.
[Pin60]
Mark S. Pinsker, Information and Information Stability of Random Variables and Processes, ser. Problemy Peredaci Informacii.
Moscow: Akademii Nauk SSSR, 1960, vol. 7, English translation:
HoldenDay, San Francisco, 1964.
[Rio07]
Olivier Rioul, A simple proof of the entropy-power inequality via properties of mutual information, in Proceedings IEEE
International Symposium on Information Theory (ISIT), Nice,
France, June 2430, 2007, pp. 4650.
[San57]
I. N. Sanov, On the probability of large deviations of random

variables, Mat. Sbornik, vol. 42, pp. 1144, 1957, in Russian;
English translation in Select. Transl. Math. Statist. Probab., vol.
1, pp. 213244, 1961.
[Sat77]
Hiroshi Sato, Two-user communication channels, IEEE Transactions on Information Theory, vol. 23, no. 3, pp. 295304, May
1977.
[Sat78]
Hiroshi Sato, An outer bound to the capacity region of broadcast channels, IEEE Transactions on Information Theory,
vol. 24, no. 3, pp. 374377, May 1978.
[Sat81]
Hiroshi Sato, The capacity of the Gaussian interference channel

under strong interference, IEEE Transactions on Information
Theory, vol. 27, no. 6, pp. 786788, November 1981.
[Sha48]
Claude E. Shannon, A mathematical theory of communication, Bell System Technical Journal, vol. 27, pp. 379423 and
623656, July and October 1948.
[Sha59]
Claude E. Shannon, Coding theorems for a discrete source with

a fidelity criterion, in Institute of Radio Engineers, International Convention Record, vol. 7, 1959, pp. 142163.
[Sha61]
Claude E. Shannon, Two-way communication channels, in

Proceedings of 4th Berkeley Symposium on Mathematical Statistics and Probability, J. Neyman, Ed. Berkeley, CA, USA: University California Press, 1961, pp. 611644.
[Sta59]
A. J. Stam, Some inequalities satisfied by the quantities of information of Fisher and Shannon, Information and Control,
vol. 2, pp. 101112, June 1959.

390
Bibliography
[SW73a]
David S. Slepian and Jack K. Wolf, A coding theorem for multiple access channels with correlated sources, Bell System Technical Journal, vol. 52, no. 7, pp. 10371076, September 1973.
[SW73b]
David S. Slepian and Jack K. Wolf, Noiseless coding of correlated information sources, IEEE Transactions on Information
Theory, vol. 19, no. 4, pp. 471480, July 1973.
[VAR11]
Kumar Viswanatha, Emrah Akyol, and Kenneth Rose,

A strictly improved achievable region for multiple descriptions using combinatorial message sharing, May 2011,
arXiv:1105.6150v1 [cs.IT]. Available: http://arxiv.org/abs/110
5.6150
[VG06]
Sergio Verd
u and Dongning Guo, A simple proof of the entropypower inequality, IEEE Transactions on Information Theory,
vol. 52, no. 5, pp. 21652166, May 2006.
[VKG03]
Raman Venkataramani, Gerhard Kramer, and Vivek K. Goyal,

Multiple description coding with many channels, IEEE Transactions on Information Theory, vol. 49, no. 9, pp. 21062114,
September 2003.
[Wit80]
Hans S. Witsenhausen, On source networks with minimal

breakdown degradation, Bell System Technical Journal, vol. 59,
no. 6, pp. 10831087, JulyAugust 1980.
[WW81]
Hans S. Witsenhausen and Aaron D. Wyner, On source coding

for multiple descriptions II: A binary source, Bell System Technical Journal, vol. 60, no. 10, pp. 22812292, December 1981.
[WWZ80]
Jack K. Wolf, Aaron D. Wyner, and Jacob Ziv, Source coding for multiple descriptions, Bell System Technical Journal,
vol. 59, no. 8, pp. 14171426, October 1980.
[WZ76]
Aaron D. Wyner and Jacob Ziv, The rate-distortion function

for source coding with side information at the decoder, IEEE
Transactions on Information Theory, vol. 22, no. 1, pp. 110,
January 1976.
[Yeu08]
Raymond W. Yeung, Information Theory and Network Coding.

New York, NY, USA: Springer Verlag, August 2008, ISBN: 978
0387792330. Available: http://www.springer.com/engineeri
ng/signals/book/978-0-387-79233-0
[ZB87]
Zhen Zhang and Toby Berger, New results in binary multiple descriptions, IEEE Transactions on Information Theory,
vol. 33, no. 4, pp. 502521, July 1987.

List of Figures
1.1
1.2
Example of overlapping sets Ei . . . . . . . . . . . . . . . . . . . . .

Convex combination of six points . . . . . . . . . . . . . . . . . . . .
8
14
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
Sanovs Theorem . . . . . . . . . . . . . . . . . . . . . . . .
Projection of a point q onto a plane . . . . . . . . . . .
A point below the projection plane . . . . . . . . . . .
A convex set with a tangential plane . . . . . . . . . .
A one-sided set with respect to Q . . . . . . . . . . . .
Uniqueness of Q . . . . . . . . . . . . . . . . . . . . . . . .
Representation of two PMFs Q1 and Q2 . . . . . . . .
Illustration of the set A defined in (3.135) . . . . . .
An example of a locally one-sided set F . . . . . . . .
An example of a set F that is not locally one-sided
.
.
.
.
.
.
.
.
.
.
37
44
44
45
47
47
49
54
58
59
4.1
Markov setup of three RVs . . . . . . . . . . . . . . . . . . . . . . . .
79
5.1
5.2
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18
Quantization of a square . . . . . . . . . . . . . . . . . . . . . . .
Reconstruction areas and points of X N (0, 1) . . . . . .
Reconstruction areas and points of X N (0, 1) . . . . . .
Test channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof failure for discontinuous rate distortion function . .
Distortion-rate plane . . . . . . . . . . . . . . . . . . . . . . . . .
Tangent through R0 (q , ) . . . . . . . . . . . . . . . . . . . . . .
A contradiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A second contradiction . . . . . . . . . . . . . . . . . . . . . . . .
A convex function is continuous . . . . . . . . . . . . . . . . . .
A convex function with a slope discontinuity . . . . . . . .
A typical rate distortion function . . . . . . . . . . . . . . . . .
Joint source channel coding . . . . . . . . . . . . . . . . . . . . .
Lossy compression added to joint source channel coding
Rate distortion combined with channel transmission . . .
The n-dimensional space of all possible received vectors .
Reverse waterfilling solution . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
87
91
96
101
111
111
112
112
115
118
119
119
122
122
127
131
6.1
6.2
..........
Mapping of source sequences x to codewords x
Graphical explanation of Theorem 6.5 . . . . . . . . . . . . . . . .
138
143
391
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

392
List of Figures
7.1
A multiple description system . . . . . . . . . . . . . . . . . . . . . .
156
8.1
8.2
8.3
The WynerZiv problem . . . . . . . . . . . . . . . . . . . . . . . . . .

The idea of binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of a WynerZiv problem . . . . . . . . . . . . . . . . . . .
185
187
197
9.1
9.2
A general source compression problem . . . . . . . . . . . . . . . .

SlepianWolf rate region . . . . . . . . . . . . . . . . . . . . . . . . . .
207
209
10.1
10.2
10.3
The multiple-access channel (MAC) . . . . . . . . . . . . . .

Two independent BSCs form a multiple-access channel
The capacity region of the MAC consisting of two
independent BSCs. . . . . . . . . . . . . . . . . . . . . . . . . . .
The capacity region of the binary multiplier MAC . . . .
Binary erasure channel (BEC) . . . . . . . . . . . . . . . . . .
The capacity region of the binary erasure MAC . . . . .
Two different capacity regions C1 and C2 of a MAC . . .
Pentagon of achievable rate pairs . . . . . . . . . . . . . . . .
Pentagon of achievable rate pairs . . . . . . . . . . . . . . . .
Convex hull of two pentagons . . . . . . . . . . . . . . . . . .
A convex -combination . . . . . . . . . . . . . . . . . . . . . .
Two pentagons of Example 10.12 . . . . . . . . . . . . . . . .
New pentagon derived as a convex combination . . . . .
Five extremal points of CI . . . . . . . . . . . . . . . . . . . .
Five extremal points do not define the corner points of
pentagon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
General shape of the MAC capacity region . . . . . . . . .
An achievable rate region of the binary erasure MAC .
Capacity region of the Gaussian MAC . . . . . . . . . . . .
Capacity region of the Gaussian MAC when we allow
changing the power . . . . . . . . . . . . . . . . . . . . . . . . . .
TDMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
....
....
219
222
..
..
..
..
..
..
..
..
..
..
..
..
a
..
..
..
..
10.4
10.5
10.6
10.7
10.8
10.9
10.10
10.11
10.12
10.13
10.14
10.15
10.16
10.17
10.18
10.19
10.20
11.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
222
223
224
226
226
234
234
236
236
237
238
240
.
.
.
.
.
.
.
.
240
241
243
246
....
....
247
248
11.3
A general system for transmitting a correlated source over a

MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The information transmission system with source channel
separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Block-diagonal form . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
252
259
12.1
12.2
12.3
12.4
12.5
The GelfandPinsker problem . .

Binary channel with state S = 0
Channel from S to U . . . . . . . .
.
.
.
.
.
261
273
273
273
275
13.1
13.2
Broadcast channel (BC) . . . . . . . . . . . . . . . . . . . . . . . . . .

A physically degraded BC . . . . . . . . . . . . . . . . . . . . . . . . .
285
290
11.2

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
251
393
List of Figures
13.3
13.4
13.5
13.6
13.7
13.8
A physically degraded Gaussian BC . . . . . . . . . .

Superposition coding . . . . . . . . . . . . . . . . . . . .
Two identical rate regions of the BC . . . . . . . . .
Pentagon of achievable rate pairs . . . . . . . . . . . .
BC encoder based on a GelfandPinsker encoder
Capacity region of a general Gaussian BC . . . . .
.
.
.
.
.
.
290
294
307
313
313
322
14.1
14.2
MAC channel with common message . . . . . . . . . . . . . . . . .

Shape of achievable MAC region for fixed distribution . . . . .
327
332
15.1
15.2
15.3
15.4
15.5
A DMN with five terminals and four messages

Cut-Set Bound on broadcast channel . . . . . . .
Cut-Set Bound on multiple-access channel . . .
Cut-Set Bound on a single-relay channel . . . . .
Cut-Set Bound on a double-relay channel . . . .
.
.
.
.
.
341
346
346
347
348
16.1
16.2
16.3
The interference channel (IC) . . . . . . . . . . . . . . . . . . . . . .

Two independent BSCs form an interference channel . . . . . .
The capacity region of the IC consisting of two independent
BSCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
An achievable rate region for a DM-IC . . . . . . . . . . . . . . . .
Seven cuts of the IC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of bounds on Gaussian IC . . . . . . . . . . . . . . . .
HanKobayashi region for Gaussian IC . . . . . . . . . . . . . . . .
Symmetric degrees of freedom . . . . . . . . . . . . . . . . . . . . . .
349
351
16.4
16.5
16.6
16.7
16.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
351
351
353
378
382
383

List of Tables
5.3
Recursion given by Lloyds algorithm . . . . . . . . . . . . . . . . .
91
7.2
7.3
Our choice of QX (1) ,X (2) |X . . . . . . . . . . . . . . . . . . . . . . . . .

The PMF QX (1) ,X (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
181
181
9.3
A joint weather PMF of Hsinchu and Taichung . . . . . . . . . .
210
395

Index
Italic entries are to names.
Symbols
|X |, 2
D, 2
i , 64
t (), 65, 70
i , 64
m (), 64
H, 1
h, 2
I(; ), 3
I(|x), 17, 27
I {}, 162
inf, 16, 36
L1 -distance, see variational
distance
log, 20
N(|x), 17, 27
P, 18
Pn , 18, 29
sup, 16
supp, 1
178, 196, 230, 260, 269,

336
B
BC, see broadcast channel
BEC, see binary erasure channel
Berger, Toby, 85, 133, 144, 182
Bergmans, Patrick P., 294, 307
binary erasure channel, 224
binary symmetric channel, 95
binary symmetric source, 158
binning, 187, 209, 210, 212, 263,
309, 313
broadcast channel, 285, 286, 345
capacity region, 286
convexity, 289
dependence on marginals,
288
coding scheme, 286
coding theorem, 294
achievability, 288, 299, 309,
312, 318
converse, 319, 320
degraded, 307
degraded message set, 308
Gaussian, 321
less noisy, 307
more capable, 305
outer bound, 301
Cut-Set Bound, 319, 345
degraded message set, 294,
308
deterministic, 319
A
Abbe, Emmanuel A., 42, 48
AEP, 12
Ahlswede, Rudolf, 249, 360, 380
Akyol, Emrah, 183
Aminzadeh Gohari, Amin, 319
Anantharam, Venkat, 319
asymptotic equipartition property,
12
auxiliary random variable, 176
bound on alphabet size, 177,
397
398
discrete memoryless, 286
Gaussian, 291, 320
GelfandPinsker coding, 312
physically degraded, 289
rate triple, 286
stochastically degraded, 290,
291
time-sharing, 287
with less noisy output, 292
with more capable output, 292
BSC, see binary symmetric
channel
BSS, see binary symmetric source
C
capacity cost function, 94
capacity region, 220, 221
BC, 286, 288
degraded, 307
Gaussian, 321
less noisy, 307
more capable, 305
Cut-Set Bound, 344
dirty paper, 281
DMN, 340
Gaussian MAC, 245
GelfandPinsker, 263, 277
IC, 350, 352
MAC, 225, 229, 241
common message, 335
Caratheodorys Theorem, 13
FenchelEggleston
strengthening, 15
Caratheodory, Constantin, 13
Carleial, Aydano B., 359, 377
causality, 340
CDMA, 246
chain rule
for entropy, 2
for mutual information, 3
for type, 28
for typical set, 165
Chebyshev Inequality, 12
Chong, Hon-Fah, 371

Index
code-division multiple-access, 246
coding scheme
broadcast channel, 286
DMC with interference, 262
interference channel, 350
multiple description, 157
multiple-access channel, 220
rate distortion, 93
SlepianWolf, 208
WynerZiv, 186
coding theorem
auxiliary random variable, 176
broadcast channel, 312, 318
dirty paper, 281
for broadcast channel, 294,
299, 301
degraded, 307
less noisy, 307
more capable, 305
for correlated sources over
MAC, 254, 260
for multiple description
problem, 170, 178
for multiple-access channel,
225, 229
with common message, 335
for rate distortion problem, 98
with side-information, 204
for SlepianWolf problem, 209
for WynerZiv problem, 204
GelfandPinsker, 277
common message, 327
common part, 259
Conditional Limit Theorem, 53, 59
Conditional Type Theorem, see
Type Theorem
convergence, 13
almost-sure, 12
in probability, 12
with probability 1, 12
convex combination, 13
of pentagons, 235
convexity, 42
maximization, 271
one-sided, 46
399
Index
Costa, Max, 279, 281, 361
Costa, Max H. M., 361
Cover, Thomas M., x, 9, 38, 57,
170, 182, 217, 260, 285,
319, 339, 340, 344
Csisz
ar, Imre, x, xi, 9, 17, 50, 53,
133
Csisz
ar,Imre, 304
Csisz
arK
orner identity, 304
CTT, see Type Theorem
cut, 341
Cut-Set Bound, 319, 340, 344
BC, 345
MAC, 347
relay channel, 347, 348
D
data compression
distributed, 207
distributed lossless, see
SlepianWolf problem
lossless, 100, 210
lossy, see rate distortion
problem
universal, 143
zero-error, 217
Data Processing Inequality, 5
data transmission
with side-information, see
GelfandPinsker problem
differential entropy, 2
Dirichlet partition, 88
dirty paper coding, 262
capacity, 281
discrete memoryless broadcast
channel, see broadcast
channel
discrete memoryless channel, 64,
339
with interference, 262
discrete memoryless interference
channel, see interference
channel
discrete memoryless network, 339
cut, 341
discrete memoryless source, 64

distortion measure, 85
assumptions, 92
Hamming, 92
per-letter, 92
sequence, 92
squared error, 93
distortion rate function, 134
DM-BC, see broadcast channel
DM-IC, see interference channel
DM-MAC, see multiple-access
channel
DMC, see discrete memoryless
channel
DMN, see discrete memoryless
network
DMS, see discrete memoryless
source
DPI, see Data Processing
Inequality
E
Eggleston, Harold G., 15
El Gamal, Abbas A., xi, 170, 182,
260, 300, 301, 305, 340,
344, 361
El Gamal, Hesham, 371
entropy, 1
chain rule, 2
conditioning reduces entropy,
2
differential, see differential
entropy
relative, see relative entropy
Entropy Power Inequality, 9
EPI, see Entropy Power Inequality
ergodicity, 13
error exponent, 133
Etkin, Raul H., 383
Exponentiated IT Inequality, 4
Extreme Value Theorem, 16
F
Fano Inequality, 6
Fano, Robert M., 6

400
FDMA, 247
Fenchel, Werner, 15
FenchelEggleston strengthening
of Caratheodorys
Theorem, 15
FourierMotzkin elimination, 9,
10, 315
frequency-division multiple-access,
247
G
Gallager, Robert G., 86
Garg, Hari Krishna, 371
Gelfand, Isral M., 278
Gelfand, Sergei I., 277, 278, 283
GelfandPinsker
capacity, 263
GelfandPinsker problem, 261
application to broadcast
channel, 312
capacity, 271
coding scheme, 262
coding theorem, 277
achievability, 263
converse, 276
Gaussian, 281
convexity, 270
rate, 263, 269
Goyal, Vivek K., xi, 178, 183
Guo, Dongning, 9
H
Han, Te Sun, 371
HanKobayashi region, 365, 371,
375, 380
Hoeffding, Wassily, 17
I
IC, see interference channel
IID, see independent and
identically distributed
under random variable
indicator function, 162
indicator random variable, 51

Index
Information Theory Inequality, see
IT Inequality
information transmission system,
121, 251
coding theorem, 254, 260
achievability, 255
interference channel, 349
capacity region, 350
dependence on marginals,
352
coding scheme, 350
Gaussian, 374
degrees of freedom, 381
HanKobayashi region, 375,
380
inner bound, 375
outer bound, 375
strong interference, 377, 380
very strong interference,
377, 380
HanKobayashi region, 365,
371, 375, 380
inner bound, 355
inner region, 365
outer bound, 353, 354
rate pair, 350
Satos outer bound, 354
strong interference, 359
Gaussian, 377, 380
symmetric, 358
very strong interference, 359
Gaussian, 377, 380
IT Inequality, 3
Exponentiated, 4
J
joint source channel coding
scheme, 119
K
K
orner, J
anos, x, 9, 17, 133, 292,
304, 308, 312, 320
KarushKuhnTucker conditions,
106
401
Index
Kim, Young-Han, xi
KKT conditions, 106
Kobayashi, Kingo, 371
Kolmogorov, Andrei N., 85, 278
Kramer, Gerhard, x, xi, 178, 183
L
Lagrangian, 40, 106
law of large numbers
strong, 12
weak, 11
Levin, David A., 50
LHS, 118
Liao, Henry Herng-Jiunn, 249
Lloyd, Stuart P., 88
Log-Sum Inequality, 4
lossless data compression
coding theorem
achievability, 210
M
MAC, see multiple-access channel
Markov chain, 5, 79
Markov Inequality, 151
Markov Lemma, 193, 267
Marton, Katalin, 17, 292, 308,
312, 318, 320
Massey, James L., 3
minimum mean squared error, 199
MMSE, see minimum mean
squared error
Moser, Stefan M., x, 1, 3, 5, 6, 12,
17, 23, 48, 63, 100, 105,
106, 119, 121123, 125,
129, 130, 199, 251
Motani, Mehul, 371
multiple description problem, 155
coding scheme, 157
coding theorem, 170, 178
achievability, 159, 170
convexity, 177
multiple description rate
distortion quintuple, 157
multiple description rate
distortion region, 158
successive refinement, 183

multiple-access
code-division, 246
frequency-division, 247
time-division, 247
multiple-access channel, 219, 327,
346
capacity region, 220, 241
coding scheme, 220
coding theorem
achievability, 225, 328
converse, 230, 333
Gaussian, 245
version 1, 225
version 2, 229
Cut-Set Bound, 347
Gaussian, 244
rate pair, 220
successive cancellation, 224,
235, 245
transmitting correlated
sources, 251
mutual information, 3
chain rule, 3
N
Nair, Chandra, 301, 305, 320
nonunique decoding, 365
notation, 33, 64
, , 64, 66
O
one-sided set, 46, 48
locally, 58, 59
Ozarow, Lawrence H., 182
P
PDF, see probability density
function
Peres, Yuval, 50
Pinsker Inequality, 50

402
Pinsker, Mark S., 50, 277, 278, 283
PMF, see probability mass
function
polytope, 10
probability density function, 2
probability distribution
empirical, 18
linear family, 43
PDF, 2
PMF, 1
probability mass function, 1
Pythagorean Theorem, 42
Q
quantization, 86
Lloyds algorithm, 88
R
Renyi entropy, 9
random coding, 101, 145, 159, 170,
187, 210, 212, 225, 255,
263, 294, 309, 313, 328,
365
binning, 187, 210, 212, 263,
309
binning & superposition
coding, 313
coloring, 216
rate splitting, 365
superposition coding, 294,
328, 365
random variable
independent and identically
distributed, 11
indicator, 7
rate distortion function, 94, 134
continuity, 115
convexity, 97
KKT conditions, 109
lower bound, 114
properties, 97, 106, 110
WynerZiv, see WynerZiv
problem
rate distortion problem, 85
coding scheme, 93

Index
coding theorem, 98, 105, 133
achievability, 101
converse, 99
strong converse, 135
coding theorem for Gaussian,
123
distortion rate function, 94
error exponent, 141
error probability, 135
Gaussian source, 123
information rate distortion
function, 94
multiple description problem,
see multiple description
problem
rate, 93
rate distortion function, 94
rate distortion pair, 94
rate distortion region, 94
test channel, 96, 124
with side-information, see
WynerZiv problem
rate distortion region, 94, 98
multiple description, 158, 170,
178
WynerZiv, 186, 204
rate region, 9, 220
MAC, 225, 229
SlepianWolf, 209
rate splitting, 365
relative entropy, 2
Pythagorean Theorem, 42
relay channel, 347, 348
Cut-Set Bound, 347, 348
reverse waterfilling, 130
RHS, 23
Rioul, Olivier, 9
Rose, Kenneth, 183
RV, see random variable
S
Salehi, Masoud, 260
Sanovs Theorem, 36
Sanov, I. N., 17
403
Index
Satos outer bound, 354
Sato, Hiroshi, 320, 354, 359, 360,
375, 377, 380
Shannon, Claude E., 85, 249
Shields, Paul C., xi
side-information
data compression, 185
data transmission, 261, 281
Slepian, David S., 209, 249, 335
SlepianWolf problem, 208
achievable rate region, 208
coding theorem, 209
achievability, 212
converse, 215
distributed coding scheme, 208
over MAC, 251
rate pair, 208
sphere covering, 126
sphere packing, 126
Stam, A. J., 9
successive cancellation, 224, 235,
245, 361
successive refinement, 183
superposition coding, 294, 313, 365
T
TA, see Theorem A under typical
set
TB, see Theorem B under typical
set
TC, see Theorem C under typical
set
TDMA, 247
Thomas, Joy A., x, 9, 38, 57, 339,
340, 344
time-division multiple-access, 247
time-sharing, 198, 200, 208, 221,
240, 247, 255, 287, 289
coded, 240
total expectation, 7
total variation distance, 50
Tse, David N. C., 383
TT, see Type Theorem
type, 17
chain rule, 28
conditional, 28
joint, 28
type class, 18
conditional, 29
type covering lemma, 144
Type Theorem
Conditional, 29
CTT1, 29
CTT2, 29
CTT3, 30
CTT4, 32
TT1, 19, 28
TT2, 20, 28
TT3, 22, 28
TT4, 26, 28
typical set
chain rule, 165
conditionally strongly, 71
conditionally strongly,
alternative definition, 81
joint implies individual, 69
jointly strongly, 68
Markov Lemma, 193
strongly, 64
strongly conditional on letter,
71
Theorem A, 65, 70
Theorem B, 72
Theorem C, 76
weakly, 63
U
Union Bound, 7
on total expectation, 7
V
variational distance, 48
Venkataramani, Raman, xi, 178,
183
Verd
u, Sergio, 9
Viswanatha, Kumar, 183
Voronoi partition, 88
W
Wang, Hua, 383

404
Wang, Zizhou Vincent, 305
waterfilling, 130
Weierstrass, Karl, 16
Wilmer, Elizabeth L., 50
Witsenhausen, Hans S., 182
Wolf, Jack K., 182, 209, 249, 335
Wyner, Aaron D., 182, 204
WynerZiv problem, 185
coding scheme, 186
coding theorem, 204
achievability, 187
converse, 202
Gaussian source, 198
rate distortion function with
global side-information,

Index
195
WynerZiv rate distortion
function, 186, 194
properties, 201
pair, 186
region, 186
Y
Yeung, Raymond W., xi
Z
Zhang, Zhen, 182
Ziv, Jacob, 182, 204

Advanced Topics Information Theory-Lecture Notes - Stefan M. Moser 2.5 PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Advanced Topics Information Theory-Lecture Notes - Stefan M. Moser 2.5 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Advanced Topics

2nd Edition 2013

c Copyright Stefan M. Moser

2nd Edition 2013.

3 Large Deviation Theory

c Stefan M. Moser, vers. 2.5

6 Error Exponents in Source Coding

Multiple Description Problem

8 Rate Distortion with Side-Information: WynerZiv

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

9 Distributed Lossless Data-Compression: SlepianWolf

The WynerZiv Rate Distortion

11 Transmission of Correlated Sources over a MAC

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

Multiple-Access Channel with Common Message

15 Discrete Memoryless Networks and

the Cut-Set Bound

16 The Interference Channel

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

c Stefan M. Moser, vers. 2.5

Chapters 79 cover extensions of the idea of rate distortion. In Chapter 7,

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

Review of some Definitions

PX (x) log PX (x)

where PX () denotes the probability mass function (PMF) of the RV X and

c Stefan M. Moser, vers. 2.5

Proposition 1.2. Entropy is nonnegative and maximized for a uniform distribution:

Finally, the chain rule is given as

H(Xk |X1 , X2 , . . . , Xk1 ).

While differential entropy can be negative, the property that conditioning

Note that if for some x supp(Q1 ) we have Q2 (x) = 0, then D(Q1 k Q2 ) = .

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

1.2. Some Important Inequalities

and satisfies the chain rule

I(X; Yk |Y1 , Y2 , . . . , Yk1 ).

Some Important Inequalities

with equalities on both sides if, and only if, = 1.

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

Then have a look at the derivatives:

with equality if, and only if, = 0.

The result now follows by choosing , 1 and exponentiating both sides

with equality if, and only if,

is the same for all i.

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

1.2. Some Important Inequalities

Proof: Define the constants

forms a Markov chain, i.e., I(X; Z|Y ) = 0. Then

Moreover, suppose that Y1 and Y2 are the respective outputs of a discrete2

with equality if, and only if, QX1 = QX2 .

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

where the inequality follows from conditioning that reduces entropy.

QX1 (x)QY |X (y|x) log P

x QX1 (x)QY |X (y|x)

x QX2 (x)QY |X (y|x)

QX1 (x)QY |X (y|x)

QX1 (x)QY |X (y|x) log

The next statement is a generalization of Fanos famous inequality (see

We define the error probability

c Copyright Stefan M. Moser, version 2.5, 31 Aug. 2015

1.2. Some Important Inequalities