EE6461

Information Theory and Coding
EE6461
MENG Course Autumn 2012
(c) 2004, 2012
Dr Tom Conway
August 7, 2012
2
Syllabus EE6461
Aims & Objectives:
This module aims to guide the student through the implications and consequences of fun-
damental theories and laws of information theory and to impart a comprehensive grounding
in random and burst error protection coding theory with reference to their increasingly wide
application in present day digital communications and computer systems.
FUNDAMENTALS OF INFORMATION THEORY:
Source encoding theory and techniques.
Communication channels: m-ary discrete memoryless, binary symmetric.
Equivocation, mutual information, and channel capacity.
Shannon-Hartley theorem.
CHANNEL CODING:
Random and burst error protection on communication channels.
Interleaving principles. Types and sources of error.
Linear block coding. Standard Array and syndrome decoding.
Cyclic Codes.
Convolution codes.
Soft and hard decision detection.
Viterbi decoding.
Reference Books:
M.J.Usher and C.G.Guy, Infomation Theory and Communications for Engineers, MacMil-
lan Press Ltd, ISBN 0-333-61527-1
Lin, S. and D.J. Costello, Error Coding: Fundamentals and Applications, Englewood
Clis: Prentice Hall, 1983
Stephen B. Wicker, Error control systems for digital communication and storage, En-
glewood Clis, N.J., 1995
Module Assessment
Final Exam 100%
Contents
1 Concepts in Information Theory 7
1.1 Review of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Rules for combining probabilities . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Information and it Quantication . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Average Information : Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Entropy Example I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Entropy Example II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3 Maximizing H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Information in Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6.1 Conditional Entropy Example . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 Redundancy in printed English . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.8 Information in Noisy Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.8.1 Random Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.8.2 Quantity of Information in a Noisy channel . . . . . . . . . . . . . . . . . 19
1.8.3 Example of Information transmitted . . . . . . . . . . . . . . . . . . . . 20
1.8.4 General Expression for Information Transfer . . . . . . . . . . . . . . . . 21
1.8.5 Example of Mutual Information . . . . . . . . . . . . . . . . . . . . . . 22
1.9 Equivocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.10 General Expression for Equivocation . . . . . . . . . . . . . . . . . . . . . . . 25
1.11 Summary for expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.12 Channel Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.13 Binary Symmetric Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.14 Information in Continuous Signals . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.15 Relative Entropy of a Continuous Signal . . . . . . . . . . . . . . . . . . . . . . 31
1.15.1 Justication for Relative Entropy . . . . . . . . . . . . . . . . . . . . . . 31
1.15.2 Example: a Gaussian signal . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.15.3 Example: Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . 34
1.15.4 Example: Binary Distribution . . . . . . . . . . . . . . . . . . . . . . . . 34
1.15.5 Proof that a Gaussian distribution maximizes relative entropy . . . . . . 35
1.16 Information Capacity of a Continuous Signal . . . . . . . . . . . . . . . . . . . 36
1.16.1 Example Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.17 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3
4 CONTENTS
2 Source Coding 41
2.0.1 Some Ideas on Source Coding . . . . . . . . . . . . . . . . . . . . . . . . 41
2.0.2 Mapping Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.0.3 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1 Coding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1.1 Fano-Shannon Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1.2 Human Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Shannon Coding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3 Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4 Lossless Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 Dictionary based Data Compression . . . . . . . . . . . . . . . . . . . . . 48
2.4.2 Arithmetic Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4.3 RLE Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 Lossy Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.1 DCT : Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . 50
2.5.2 JPEG: Still Image Compression . . . . . . . . . . . . . . . . . . . . . . . 51
2.6 Still Image Compression: GIF vs. JPEG . . . . . . . . . . . . . . . . . . . . . 52
2.7 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.8 Speech Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3 Galois Fields 57
3.1 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Elementary Properties of Galois Fields . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Primitive Polynomials and Galois Fields of Order p
m
. . . . . . . . . . . . . . . 61
3.5.1 Irreducible Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.2 Primitive Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.3 Generating GF(p
m
) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6 Hardware for operations over GF(2
m
) . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6.3 Division/Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Polynomials over Galois Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7.1 Minimal Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 Block codes 69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Linear Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.1 Minimum distance of a Linear block code . . . . . . . . . . . . . . . . . . 77
4.2.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.3 Syndrome Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Weight Distribution of Block codes . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Hamming Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1 Decoding Hamming Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 83
CONTENTS 5
4.4.2 Weight Distribution of Hamming Codes . . . . . . . . . . . . . . . . . . . 84
4.5 Non-Binary Hamming Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6 Modied Linear Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6.1 Shortened Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6.2 Punctured Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6.3 Extended Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6.4 Other Modications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 Cyclic Codes 89
5.1 Linear Cyclic Block Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Basic Properties of Cyclic codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Encoding a linear cyclic code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.1 Systematic encoding of a linear cyclic code . . . . . . . . . . . . . . . . . 93
5.3.2 Example systematic encoding . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Shift Register Implementation for Encoding(decoding) cyclic codes . . . . . . . . 94
5.4.1 Polynomial Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.2 Polynomial Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.5 Error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Syndrome Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.5.2 Example Syndrome decoder . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6 CRC Error Control Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.6.1 Burst Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6 BCH and Reed Solomon Codes 103
6.1 BCH Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Parity Check Matrix for BCH codes . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3 Reed Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 Generating a Reed Solomon Code . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2 Example Reed Solomon Code . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4 Decoding BCH and Reed Solomon Codes . . . . . . . . . . . . . . . . . . . . . . 109
6.4.1 Peterson-Gorenstein-Zierler Algorithm . . . . . . . . . . . . . . . . . . . 109
6.4.2 Example RS operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5 Erasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7 Channel Coding Performance 117
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2 AGWN Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Energy per bit per noise spectral density
E
b
N
0
. . . . . . . . . . . . . . . . . . . . 119
7.3.1
E
b
N
0
with a rate loss R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Performance of Block Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.1 Error Detection Performance . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4.2 Error Detection Performance Example . . . . . . . . . . . . . . . . . . . 122
7.4.3 Error Correction Performance . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.4 Error Correction Performance Example . . . . . . . . . . . . . . . . . . . 123
6 CONTENTS
7.4.5 Non Binary Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8 Convolutional codes 127
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2 Linear Convolutional Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.3 The Structural Properties of Convolutional Codes . . . . . . . . . . . . . . . . . 131
8.3.1 Catastrophic sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.3.2 Weight Enumerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.4 Convolutional codes performance metrics . . . . . . . . . . . . . . . . . . . . . . 134
8.5 Viterbi Decoding of Convolutional Codes . . . . . . . . . . . . . . . . . . . . . . 135
8.5.1 Trellis Representation of Convolutional Codes . . . . . . . . . . . . . . . 136
8.5.2 Viterbi algorithm with rate 1/2 K = 3 code . . . . . . . . . . . . . . . . 139
8.6 Branch Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.6.1 Binary Symmetric Channel (Hard Decision) . . . . . . . . . . . . . . . . 140
8.6.2 AWGN Channel (Soft Decision) . . . . . . . . . . . . . . . . . . . . . . . 141
8.7 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.7.1 Binary Symmetric Channel (Hard Decision) . . . . . . . . . . . . . . . . 143
8.7.2 AWGN Channel (Soft Decision) . . . . . . . . . . . . . . . . . . . . . . . 144
8.7.3 Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.8 Viterbi Decoder Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.8.1 BMU Branch Metric Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.8.2 ACS Add Compare Select Units . . . . . . . . . . . . . . . . . . . . . . 147
8.8.3 PSU Path Survivor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.9 Punctured Convolutional Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Chapter 1
Concepts in Information Theory
1.1 Review of Probability
Commonly used in everyday life in assessing risk/reward and decision making. E.G. Choose
a card at random form a pack.
Chance of a spade = 1/4 (0.25)
Odds of a spade is 1:3 (1 to 3 )
Non Spade is 3 times more likely that a spade.
...
Other expressions such as risk, likelihood, prospect, possibility , etc also imply ideas about
probability.
We will use probability as a quantitative measure. Hence we say the probability of a spade
is 1/4.
What exactly does this mean?
If we choose one card, it will either be a spade or a non spade i.e a binary answer.
However, if we repeat the experiment N times, (or N people do the experiment at once)
then, approximately 1/4 of the time, the chosen card will be a spade.
In more mathematical terms, the situation could be described as follows: If m of the N
experiments result in spades the
lim
N
m
N
=
1
4
= p(spade)
where p(spade) is the probability of a spade being chosen.
The concept of probability is a clear and well dened concept and very useful in many
application.
Clearly for any event X,
0 p(X) 1
0 implies the event X never occurs (cannot occur). 1 implies the event X always occurs.
7
8 CHAPTER 1. CONCEPTS IN INFORMATION THEORY
1.1.1 Example
If 3 coins are tossed, what is the probability that they all show the same face?
Enumerating each possible outcome,
HHH
HHT
HTH
HTT
THH
THT
TTH
TTT
Each is equally likely (assuming the coins are unbiased) and only 2 of the 8 have each coin
showing the same face. Hence
p(3 Coins having same face) =
2
8
= 0.25
1.1.2 Rules for combining probabilities
The Summation rule applies to events that are mutually exclusive i.e. events that cannot
occur simultaneously. E.G. a dice cannot show both 3 and 6 at the same time.
Sum Rule: if X
1
, X
2
, . . . X
N
are mutually exclusive events, then
p(X
1
or X
2
or . . . or X
N
) =
N
i=1
p(X
i
)
Hence, the probability of a dice showing 1 or 2 or 3 is
p(1 or 2 or 3) =
1
6
+
1
6
+
1
6
= 0.5
The product rule applies to events that are independent. i.e. Events that have no bearing
on each other. E.G. if two die are thrown, then the numbers on each are independent.
Product Rule: if X
1
, X
2
, . . . X
N
are independent events, then
p(X
1
and X
2
and . . . and X
N
) =
N
i=1
p(X
i
)
Hence, the probability of a 2 die thrown showing the rst dice as a 1 AND the second dice as
a 3 is
p( First = 1 AND second = 3) =
1
6

1
6
=
1
36
= 0.02777..
1.2. CONDITIONAL PROBABILITY 9
1.2 Conditional Probability
In reality many events are not independent. For example, in a set of letters from the English
language, if a Q occurs, then it is likely that a U follows it. This kind of scenario is handled with
Conditional Probabilities.
For 2 conditional events A and B, the conditional probabilities are dened as,
P(A and B) = p(A)p(B/A) = P(B)p(A/B)
where p(B/A) is the probability that B occurs on condition that B occurs and p(A/B) is the
probability that A occurs on condition that A occurs.
E.G. given:
p(Cold Weather) = 0.3
p(Cloudy Weather) = 0.4
p( Both Cold Weather and Cloudy Weather) = 0.2
then the probability that it is cold given that it is Cloudy can be calculated as;
p( Both Cold and Cloudy ) = p(Cloudy) p(Cold/Cloudy)
Hence
p(Cold/Cloudy) =
p( Both Cold and Cloudy )
p(Cloudy)
=
0.2
0.4
= 0.5
1.3 Information and it Quantication
Information Theory deals with Quantication, Coding and storage/communication of Infor-
mation.
Original work by Nyquist 1924 ( communications Telegraph/Telephone)
Hartley (1928) Basic ideas on information.
But Key concepts and bounds were developed by Shannon (1949)
Most work since then concerns achieving the bounds and performance promised by Shannon
work.
We can work with a generic system model g 1.1, but the basic ideas extend any real system
such as
Communications over wire and wireless links.
Digital TV/Audio/Satellite broadcast.
CD/DVD/Hard disk storage.
DSL, modems.
...
Note that storage channels are similar to communications channel. i.e. communications
transmit information from one location to another while storage transmit information from one
time to another (in the future).
In considering information a number of Issues need to be considered:
Compression
Contol
Coding
Correction
/Detection
Error
Error
Decompression Decryption
Encryption
Data
Source
Data
Sink
Channel
+Noise
Dictionary
/codes
REQ/ACK
Keys
Figure 1.1: Generic Information System
Accuracy of representation e.g music quality.
Bandwidth available for transmission.
Reliability e.g. robustness to noise.
Security.
Complexity of transmit/receive equipment
...
Hence we need quantitative measures.
Consider the following information.
"It will be raining as usual this week"
This information is well known and provides us with little real information because we know
this already.
However, the following sentence
"We will have an earthquake tomorrow"
provides us with a large amount of information as this is very unexpected and is a large
surprise.
Hence, the amount of information contained in a message should some how relate to the
surprise of message.
I: A good information metric should take account of the probability of the mes-
sage, i.e. a low probability message should have a higher information content.
We would also like to be able to add information quantities.
II: if two equally unlikely messages were sent, we should receive twice as much
information as a single equally likely message.
Consider the following example:
We would like to send information identifying a location on a chess board g 1.2 (assuming
they are all equally likely). This could be done in a number of ways.
Firstly, we could send a single message with one of 64 possible values. There are 64 locations
and so the probability of a particular location i is
p
L
(i) =
1
64
1.3. INFORMATION AND IT QUANTIFICATION 11
1
2
8
7
4
3
6 7 8 5 4 3 2 1
6
5
Figure 1.2: Chess Board
However, we could instead send two messages X and Y to identify the x and y coordinate.
Each of these message has 8 possible values. Hence
p
X
(x) =
1
8
and
p
Y
(y) =
1
8
Both cases must contain the same amount of information. Hence if our information measure
is f(p) where p is the message probability then we require
f(p
L
(i)) = f(p
X
(x)) +f(p
Y
(y))
and that
f(p
1
) > f(p
2
) if p
1
< p
2
The only function form that meets these requirements is
f(p) = log
b
(p)
Hence the measure of information contained in a message of probability p
i
is
I(message
i
) = log
b
(p
i
)
As log
b
(x) =
log
a
(x)
log
a
(b)
The base of the logarithm is not important but only sets a scaling for
the measure or in eect denes the units. By convention, the following units are commonly
employed:
Log to the base 2 implies units of bits.
Log to the base 10 implies units of hartleys.
Log to the base e implies units of nats.
Hence for the chess board example, the information content of a location message (or two
coordinate messages) is
log
2
1
64
= log
2
1
8
log
2
1
8
= 6 bits
or
log
10
1
64
= log
10
1
8
log
10
1
8
= 1.806 hartleys
or
log
e
1
64
= log
e
1
8
log
e
1
8
= 4.159 nats
Note:
log
e
(x) =
log
2
(x)
log
2
(e)
0.6932 log
2
(x)
Hence 1 bit 0.6932 nats.
1.4 Average Information : Entropy
In practice, we send many messages and are interested in the Average Information.
If a source can produce N dierent messages with probabilities:
p
i
i = 1 . . . N
then on average, message i is produced p
i
of the time and the average information produced is
Average I =
N
i=1
p
i
I(message
i
)
but as
I(message
i
) = log(p
i
)
then average information produced is
Average I =
N
i=1
p
i
log(p
i
)
The average information is known as the Entropy of the source and is denoted by H, hence
H =
N
i=1
p
i
log(p
i
)
The term Entropy comes from a similar expression in Thermodynamics.
Entropy is often expressed in bits per symbol.
1.4.1 Entropy Example I
Consider a source with an alphabet having characters with the probabilities in table 1.1
What is the entropy H of this source?
H =
N
i=1
p
i
log(p
i
)
= 0.5log
2
(0.5) 0.25 log
2
(0.25) 0.125 log
2
(0.125) 0.125 log
2
(0.125)
= 1.75 bits per symbol
1.4. AVERAGE INFORMATION : ENTROPY 13
Character Probability
a 0.5
b 0.25
c 0.125
d 0.125
Table 1.1: Example alphabet
1.4.2 Entropy Example II
A binary source produces a series of bits ( 1 or 0 ) with probabilities p(1) = 1/8 and
p(0) = 7/8. Calculate the entropy H of this source.
H =
N
i=1
p
i
log(p
i
)
= (1/8) log
2
(1/8) (7/8) log
2
(7/8)
Note that even though 1 binary digit is produced for each digit, the actual average informa-
tion is 0.544 bits due to the probabilities.
If the p(1) = p then p(0) = 1 p and the Entropy is
H = p log
2
(p) (1 p) log
2
(1 p)
this is plotted in g 1.3. Note no information if p = 0 or p = 1 and maximum information when
p = 1/2, i.e bits 0 or 1 equally likely.
1.4.3 Maximizing H
Given N symbols with probability p
i
, how can we chose each individual value of p
i
to maxi-
mize the entropy subject to the constraint that

i
p
i
= 1? The solution to this question can be
achieved by the method of Larange multiplier. Recall:
H =
N
i=1
p
i
log(p
i
)
Consider the expression = H+(
i
p
i
1), with being the Larange multiplier. Minimizing
is equivalent to maximizing H. This minimization can be done by dierentiating wrt p
j
for
j = 1, 2, . . . , N and wrt , to form N + 1 equations in N + 1 unknowns.
d
d
= (
i
p
i
1)
= 0 for a max or min
i
p
i
1 = 0
and
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
H
p
Figure 1.3: Entropy of binary source
d
dp
j
=
d
dp
j
H +
d
dp
j
(
i
p
i
1)
= p
j
d
dp
j
log(p
j
) + log(p
j
)
d
dp
j
p
j
+
d
dp
j
p
j
= p
j
1
p
j
+ log(p
j
) +
= 1 + log(p
j
) +
= 0 for a max or min
log(p
j
) = 1 for all j
p
j
= e
1
for all j
but
i
p
i
= 1
i
e
1
= 1
Ne
1
= 1
1 = log
1
N
p
j
= e
1
=
1
N
Hence to maximize H requires each symbol probability p
i
to be equal and each value to be
1
N
.
1.5. REDUNDANCY 15
1.5 Redundancy
When the Entropy is less than its maximum value, then there is Redundancy in the source:
e.g. when p(1) = 1/8 and p(0) = 7/8 and H = 0.544 bits per symbol, the source could have had
an Entropy of 1 bit per symbol hence 0.456 bit per symbol were redundant.
This can often be of use.
Consider a source that transmits a binary 1 as a triplet of 111 and a binary 0 as a triplet of
000. Clearly 2/3 of the binary digits transmitted are redundant ( even assuming binary 1s and
binary 0 are equally likely ).
However, if a receiver receives the transmitted data in the presence of noise, then it can still
correctly recover the information even when a single bit error occurs.
This can be done by using taking a majority decision.
In this case, two or more bit error within a span of 3 bits must occur in order to cause a
message error. This is much less likely than a single bit error ( provided bit errors are independent
).
Hence Redundancy can provide error correction and detection if used properly.
The English language contains about 80% redundancy.
Consider the sentence
INGORXATPON TQEORY IS UKEFUL
It can be readily understood despite several letter errors.
1.6 Information in Language
Consider the English language with 26 letters.
If each was equally likely, then the Entropy would be
H =
26
i=1
1
26
log(
1
26
) = 4.7 bits per letter
In reality each letter is far from equally likely and calculating the Entropy assuming each
letter was independent results in an Entropy of 4.1 bits per symbol.
However, the real English language also has a large amount of dependency between letters.
e.g the probability of the letter u is 0.0241 but the probability of the letter u following the
letter q is much higher.
From probability theory, the joint probability of two events A and B if they are independent
is
p(A, B) = p(A)p(B)
However, if they are NOT independent then the joint probability is
p(A, B) = p(A)p(B/A)
where p(B/A) is the probability of event B on condition that A occurred i.e. the conditional
probability.
E.G.
In the English language
p(
) = 0.0255
and
p(
) = 0.0821
Hence the conditional probability
p(
) = p(
)/p(
)
or
p(
) = 0.0255/0.0821 = 0.3106
Hence, if we have received the letter T, then if the next letter is a H then this provides
less information (i.e. higher probability) due to the interdependence of the letters.
If we assume a simple language where interdependence of the letters only extends over 2
letters, then consider the case of receiving the letter j given than we just received i.
The information that we actually received when j arrives is
log p(j/i)
and hence the conditional entropy of the source can be calculated as
H(j/i) =
i
p(i)
j
p(j/i) log p(j|i)
or
H(j/i) =
j
p(i, j) log p(j|i)
1.6.1 Conditional Entropy Example
Consider a simple language with two symbols A and B.
Let a representative sequence from the sample be
AABBBAAAABBAAABBBAAA
and assume the next symbol was an A when calculating pairs.
Estimate the individual(single), joint and conditional probabilities of A and B?
Estimate the conditional entropy of the sequence and the redundancy of the source?
Note: the redundancy R can be calculated as
R = 1
H
H
max
Solution
Individual probabilities:
p(A) = 12/20
p(B) = 8/20
Joint probabilities:
p(A, A) = 9/20
p(A, B) = 3/20
p(B, A) = 3/20
p(B, B) = 5/20
Conditional probabilities:
p(A/A) = 9/12
1.7. REDUNDANCY IN PRINTED ENGLISH 17
p(A/B) = 3/8
p(B/A) = 3/12
p(B/B) = 5/8
Conditional entropy:
H(j/i) =
j
p(i, j) log
2
p(j|i)
here,
H(j/i) =
iA,B
jA,B
p(i, j) log
2
p(j|i)
= p(A, A) log
2
p(A/A) p(A, B) log
2
p(A/B)
p(B, A) log
2
p(B/A) p(B, B) log
2
p(B/B)
= 9/20 log
2
(9/12) 3/20 log
2
(3/8) 3/20 log
2
(3/12) 5/20 log
2
(5/8)
Redundancy: The maximum entropy of this two symbol ( binary) source is 1 bit per symbol.
Hence
R = 1
0.8685
1
= 0.1315 13%
In this example, if we ignored the conditional probabilities, the entropy would be 0.971 bits
and redundancy 0.029 or 3%, hence most of the redundancy is due to the interdependence of
the symbols.
The interdependence of the symbols can be extended over more that two symbols and the
conditional concepts suitable extended: e.g.
H(l/i, j, k) =
l
p(i, j, k, l) log p(l/i, j, k)
and so on.
1.7 Redundancy in printed English
26 Symbols or equi-probable letters would have an Entropy of
H =
26
1
1
26
log
2
1
26
= 4.7 bits per letter
However, redundancy will reduce this.
Redundancy in the English language arises in a number of ways:
Letters having dierent probabilities.
Interdependence between adjacent letters.
Interdependence between letters in a word.
Interdependence between words in a sentence.
E.G. consider the letters DISTRIBUT
The next letters ION or IVE provide little additional information.
Extensive study has reviled that the average information of the printed English language is
1 bit per letter. The Redundancy of the printed English language is thus
R = 1
H
H
max
= 1
1
4.7
79%
Note that the ASCII system used in computers allocates 8 bits per character and thus text
storage in computer is quite inecient ( although very convenient). This is why compression of
text les provides substantial savings in space ( at the cost of time and convenient).
The Redundancy in the English language provides considerable error correction and detection
capability. This allows communications in the presence of noise and other interference.
In cases where even this redundancy is insucient, the International Phonetic Alphabet
(Table 1.7 ) can be used. Example uses include military radio, aircraft control, etc.
Letter Phonetic Letter Phonetic Letter Phonetic Letter Phonetic
A Alpha B Bravo C Charlie D Delta
E Echo F Foxtrot G Golf H Hotel
I India J Juliett K Kilo L Lima
M Mike N November O Oscar P Papa
Q Quebec R Romeo S Sierra T Tango
U Uniform V Victor W Whiskey X X-ray
Y Yankee Z Zulu
Table 1.2: International Phonetic Alphabet
1.8 Information in Noisy Channels
We saw that the information provided about an event depended on its probability and was
given by:
I(event) = log(p)
where p was the probability of the event.
When we know that the event has occurred, further messages that the event has occurred
provide no information ( as p is now 1.0).
In reality, there is usually some chance that the information we received was in error due to
transmission errors, noise and interference.
This means that at a receiver, after the message is received, the probability of the event is
slightly less than unity due to the uncertainty.
1.8.1 Random Noise
Random Noise occurs in all physical systems due to their thermal energy.
1.8. INFORMATION IN NOISY CHANNELS 19
In particular all electrical components that dissipate power will produce Johnson Noise. It
is fundamental to component and depends on the absolute temperature of the component.
v
2
R
= 4kTBR
where:
v
2
R
= Average of the Noise voltage squared k = Boltzmans constant = 6.626 10
34
Js
T = absolute Temperature (in Kelvin)
R = Resistor Values (in )
B = Bandwidth of the Measurement (in Hz).
Semiconductors such as transistors and diodes have additional noise sources such as shot
noise and 1/f Noise.
Therefore any signal processing (even digital signal processing) applied to a signal will add
noise to the signal.
Fig 1.4 shows one method of transmitting binary data over a channel. The received signal
has noise added due to the channel, ampliers, etc.
Assuming the received signal is sampled correctly, the data can be recovered but noise may
cause errors in the detected data at the receiver.
The channel can be characterized by the Mean Error Rate.
1 0 0 1 1 0
1 clock
period T
+V
-V
0V
Error Due To Noise
Figure 1.4: Transmission of Binary Data
1.8.2 Quantity of Information in a Noisy channel
Consider the system shown in g 1.5.
An event with probability p
0
is transmitted along a noisy channel.
After its reception, the probability of the event is p
1
due to the uncertainty in the transmis-
sion.
Then the message is sent over a noiseless (ideal) channel. This converts the probability of
the event at the receiver to unity (1.0) from p
1
. as there are no errors in the noiseless channel.
Thus, this channel transmits an amount of information log p
1
.
Rx
Tx
p
p
1
0
p
1
Noiseless
channel
Noisy
Channel
1
Figure 1.5: Transmission of Information in a Noisey Channel
Let the quantity of information transmitted over the noisy channel be I. Then the total
information transmitted by both channels in sending the complete information about an event
with probability p
0
i.e log p
0
Hence
I log p
1
= log p
0
or the information transmitted over the noisy channel is
I = log p
0
+ log p
1
= log
p
1
p
0
p
0
is known as the a priori probability i.e. before information.
p
1
is known as the a posteriori probability i.e. after information.
1.8.3 Example of Information transmitted
A binary system produces symbols +1s and -1s with equal probability. On average, 1/8 of
all symbols are received in error due to noise.
Find the Information received for all combinations of inputs and outputs.
Solution
There are 4 combinations of inputs and outputs.
+1 transmitted and +1 received
-1 transmitted and -1 received
+1 transmitted and -1 received
-1 transmitted and +1 received
So
for +1 transmitted and +1 received, the a priori probability was 1/2 and the a posteriori proba-
bility is 7/8/
I(+1 txed, +1 rxed) = log
7/8
1/2
= 0.807bits
and similarly
I(1 txed, -1 rxed) = log
7/8
1/2
= 0.807bits
I(+1 txed, -1 rxed) = log
1/8
1/2
= 2.0bits
I(1 txed, +1 rxed) = log
1/8
1/2
= 2.0bits
The average information transmitted is
1
2

7
8
0.807 +
1
2

7
8
0.807 +
1
2

1
8
(2) +
1
2

1
8
(2)
= 0.456bits
Notes:
Even when the correct data is received, the information received is less than 1 bit ( due
to our uncertainty in the reliability of the transmission ).
The incorrect data results in negative information.
The average information is always positive ( or zero ).
If the average information was negative, we would just invert our decisions to make it positive.
The zero information occurs when the probability of error is 1/2.
1.8.4 General Expression for Information Transfer
For our noisy system denote the inputs x and the outputs y.
The probability of an event in the source appears at the receiver as a conditional probability
( conditional on the received symbol ).
Let the channel have a set of input symbols {x
i
} and a corresponding set of output symbols
{y
i
} such that x
i
y
i
( in the absence of an error ).
The a posterior probability that symbol x
i
was transmitted given that y
i
was received is
p(x
i
/y
i
)
The a priori probability that symbol x
i
was transmitted is just p(x
i
), hence the information
transfered is
I(x
i
, y
i
) = log
p(x
i
/y
i
)
p(x
i
)
E.G. if p(x
i
) = 1/2 for binary data and the channel has no noise, then p(x
i
/y
i
) = 1 and
I(x
i
, y
i
) = log
2
1
1/2
= 1 bits
Noting that
p(x, y) = p(x)p(y/x) = p(y)p(x/y) = p(y, x)
We can write
I(x
i
, y
i
) = log
p(x
i
/y
i
)
p(x
i
)
= log
p(y
i
/x
i
)
p(y
i
)
= log
p(y
i
, x
i
)
p(x
i
)p(y
i
)
As in the case of entropy, we would like the average information over all the available sym-
bols and so we will sum over all the combinations of inputs and outputs accounting for their
probability. This average information transfer is called the Information Transfer or Mutual
Information can can be written as
I(x, y) =
x
i
y
i
p(x
i
, y
i
)I(x
i
, y
i
)
or
I(x, y) =
x
i
y
i
p(x
i
, y
i
) log
p(x
i
, y
i
)
p(x
i
)p(y
i
)
Notes:
The expression is symmetric in x and y.
If x and y are independent, then p(x
i
, y
i
) = p(x
i
)p(y
i
) and I(x, y) = 0 and no information
is transferred as expected.
If x and y are totally dependent, then p(x
i
, y
i
) = p(x
i
) and I(x, y) =
x
i
p(x
i
) log p(x
i
) =
H and the information transferred is the entropy of the source as expected.
1.8.5 Example of Mutual Information
A binary system produces symbols +1s with probability 0.7 and -1s with probability 0.3.
2/7 of the +1s are received in error and 1/3 of the -1s are received in error.
Find the Mutual Information.
Solution
The following information is given:
p(x = +1) = 0.7
p(x = 1) = 0.3
p(y = 1|x = +1) = 2/7
p(y = +1|x = 1) = 1/3
Hence:
p(y = +1|x = +1) = 1 2/7 = 5/7
p(y = 1|x = 1) = 1 1/3 = 2/3
And as, p(x, y) = p(x)p(y/x) = p(y)p(x/y) = p(y, x)
p(x = +1, y = +1) = p(x = +1)p(y = +1/x = +1)
= 0.7 5/7 = 0.5
Similarly
p(x = +1, y = 1) = 0.7 2/7 = 0.2
p(x = 1, y = +1) = 0.3 1/3 = 0.1
p(x = 1, y = 1) = 0.3 2/3 = 0.2
Also p(y) =

x
p(x, y) and so
p(y = +1) = p(x = +1, y = +1) +p(x = 1, y = +1)
= 0.5 + 0.1 = 0.6
and
p(y = 1) = 0.2 + 0.2 = 0.4
Now I(x, y) can be calculated as
I(x, y) =
x
i
y
i
p(x
i
, y
i
) log
p(x
i
, y
i
)
p(x
i
)p(y
i
)
= 0.5 log
2
0.5
0.7 0.6
{+1, +1}
+0.2 log
2
0.2
0.7 0.4
{+1, 1}
+0.1 log
2
0.1
0.3 0.6
{1, +1}
+0.2 log
2
0.2
0.3 0.4
{1, 1}
= 0.0913 bits per symbols
Note the low information transfer rate due to the high probability of error.
1.9 Equivocation
An alternative interpretation of information transfer may be obtained from the concept of
Equivocation. Consider the binary system in g 1.6.
In this case the Observer looks at both the transmitted and received bits and if there is an
error, it transmits a 0 otherwise it transmits a 1. Hence, the Observer provides information to
the receiver and the the average information it provides is
p(0) log p(0) p(1) log p(1)
where p(0) is simply the channel probability of error p
e
. The Equivocation is the amount of
information provided by the ideal observer or equivalently the amount of information destroyed
by errors in the channel.
Rx
Tx
Noisy
Channel
Noiseless
Observer
Figure 1.6: Transmission of Information in with Observer
E.G.
A binary system produces symbols +1s and -1s with equal probability. On average, 1/8 of
all symbols are received in error due to noise. Find the Information transmitted.
Here, the probability of error p
e
= 1/8, hence an ideal observer provides average information
of
p
e
log
2
p
e
(1 p
e
) log
2
(1 p
e
)
1
8
log
2
1
8
(1
1
8
) log
2
(1
1
8
)
Therefore the channel must have transmitted
1 0.544 = 0.456 bits per symbol
as the total information provided by the source is 1 bit per symbol
The Equivocation here is the 0.544 bits per symbol
1.10. GENERAL EXPRESSION FOR EQUIVOCATION 25
1.10 General Expression for Equivocation
In general, the probability of an event changes for p(x) for the source to p(x/y) at the receiver
when the message is received.
When the ideal observer provides the information to correct any errors, the probability at
the receiver changes to unity. Hence, the information provided by the ideal observer is
log
1
p(x/y)
= log p(x/y)
Averaging over all combinations of inputs and outputs as before, the average Equivocation
denoted H(x/y) is given as
H(x/y) =
x
i
y
i
p(x
i
, y
i
) log p(x
i
/y
i
)
The average information (or mutual information) provided by the noisy channel is then the
entropy of the source minus the average Equivocation or:
I(x, y) = H(x) H(x/y)
This should be equivalent to our previous expression of
I(x, y) =
x
i
y
i
p(x
i
, y
i
) log
p(x
i
, y
i
)
p(x
i
)p(y
i
)
=
x
i
y
i
p(x
i
, y
i
) log
p(x
i
, y
i
)
p(y
i
)

x
i
y
i
p(x
i
, y
i
) log p(x
i
)
=
x
i
y
i
p(x
i
, y
i
) log p(x
i
/y
i
)
x
i
y
i
p(x
i
, y
i
) log p(x
i
)
= H(x/y)
x
i
_
log p(x
i
)
y
i
p(x
i
, y
i
)
_
= H(x/y)
x
i
[p(x
i
) log p(x
i
)]
= H(x) H(x/y)
as expected.
Similarly, it can be readily shown that,
I(x, y) = H(y) H(y/x)
where H(y/x) is a backwards equivocation.
E.G. repeat example 1.8.5 using the concept of equivocation.
Solution
H(x) =
x
i
p(x
i
) log p(x
i
)
0.7 log
2
0.7 0.3 log
2
0.3 = 0.881 bits per symbol
H(y) =
y
i
p(y
i
) log p(y
i
)
0.6 log
2
0.6 0.4 log
2
0.4 = 0.971 bits per symbol
and
H(y/x) =
x
i
y
i
p(x
i
, y
i
) log p(y
i
/x
i
)
= 0.5 log
2
5/7 {+1, +1}
0.2 log
2
2/7 {+1, 1}
0.1 log
2
1/3 {1, +1}
0.2 log
2
2/3 {1, 1}
Hence,
I(x, y) = H(y) H(y/x) = 0.971 0.879 = 0.0913 bits per symbol
as before.
As an exercise calculate H(x) H(x/y).
The joint entropy can also be dened as
H(x, y) =
x
i
y
i
p(x
i
, y
i
) log p(x
i
, y
i
)
and shown to be equal to
= H(x/y) +H(y/x) +I(x, y)
but, has not a useful role to play.
1.11 Summary for expressions
Fig 1.7 shows a Venn Diagram Interpretation of the expressions for equivocation and mutual
information.
The expressions can be summarized as follows: Source Entropy:
H(x) =
x
i
p(x
i
) log p(x
i
)
Receiver Entropy:
H(y) =
y
i
p(y
i
) log p(y
i
)
Equivocation Entropy:
H(x/y) =
x
i
y
i
p(x
i
, y
i
) log p(x
i
/y
i
)
H(y/x) =
x
i
y
i
p(x
i
, y
i
) log p(y
i
/x
i
)
Information Transfer or Mutual Information:
I(x, y) =
x
i
y
i
p(x
i
, y
i
) log
p(x
i
, y
i
)
p(x
i
)p(y
i
)
1.12. CHANNEL CAPACITY 27
H(x)
H(x/y)
I(x,y)
H(y/x)
H(y)
Figure 1.7: Venn Diagram Interpretation
= H(x) H(x/y)
= H(y) H(y/x)
Joint Entropy:
H(x, y) =
x
i
y
i
p(x
i
, y
i
) log p(x
i
, y
i
)
= H(x/y) +H(y/x) +I(x, y)
1.12 Channel Capacity
For a given channel, i.e. given bandwidth, noise, interference, etc, an important factor call
the CHANNEL CAPACITY (C) can be as :
C = max
x distribution
I(x, y)
where the maximization is evaluated over the distribution of the input symbols x, only.
This Channel Capacity C plays an important role in information theory.
1.13 Binary Symmetric Channel
While our calculation refer to any channel with any number of input and output symbols,
we now consider a simple binary channel which will allow for simplied calculations. The basic
channel is illustrated in g 1.8.
The channel is simplied to be characterized by a single parameter p
e
, the probability of a
bit error ( with errors when a 0 or 1 is transmitted being equally likely ).
The capacity of this channel can be readily calculated ( while for many real channels, it can
be very dicult to calculate the capacity ).
Start with
I(x, y) = H(y) H(y/x)
0
1
0
1
x y
p
p
1-p
1-p
e
e
e
e
Figure 1.8: Binary Symmetric Channel
= H(y) +
x
i
y
i
p(x
i
, y
i
) log p(y
i
/x
i
)
= H(y) +
x
i
p(x
i
)
y
i
p(y
i
/x
i
) log p(y
i
/x
i
)
But for x
i
= 0:
y
i
p(y
i
/x
i
) log p(y
i
/x
i
)
= (1 p
e
) log(1 p
e
) {x
i
= 0, y
i
= 0}
+(p
e
) log(p
e
) {x
i
= 0, y
i
= 1}
Similarly, for x
i
= 1:
y
i
p(y
i
/x
i
) log p(y
i
/x
i
)
= (1 p
e
) log(1 p
e
) {x
i
= 1, y
i
= 1}
+(p
e
) log(p
e
) {x
i
= 1, y
i
= 0}
Thus for both values of x
i
, the term

y
i
p(y
i
/x
i
) log p(y
i
/x
i
) can be calculated as the constant
(1 p
e
) log(1 p
e
) + (p
e
) log(p
e
)
Thus
I(x, y) = H(y) +
x
i
p(x
i
) [(1 p
e
) log(1 p
e
) + (p
e
) log(p
e
)]
= H(y) + [(1 p
e
) log(1 p
e
) + (p
e
) log(p
e
)]
x
i
p(x
i
)
But

x
i
p(x
i
) = 1 and so:
I(x, y) = H(y) + (1 p
e
) log(1 p
e
) + (p
e
) log(p
e
)
Denoting the term (1 p
e
) log(1 p
e
) + (p
e
) log(p
e
) as F(p
e
), a function of the channel
characteristic only, the mutual information is
I(x, y) = H(y) F(p
e
)
1.13. BINARY SYMMETRIC CHANNEL 29
To maximize I(x, y) as required for the capacity, note that F(p
e
) is xed by the channel and
hence H(y) must be maximized. H(y) is the entropy of the received symbol and weve seen
that H(y) for a binary symbol is maximized when it is equally likely to be either symbol. i.e.
probability of received 1 symbol p(y
i
= 1) is equal to probability of received 0 symbol p(y
i
= 0)
. Therefore, p(y
i
= 1) = p(y
i
= 0) and hence H(y) = 1
The capacity can the be calculated as
C = max
x distribution
I(x, y) = 1 F(p
e
)
However, as the channel is symmetric, the input probabilities must also be equal and so
p(x
i
= 1) = p(x
i
= 0). Thus, the capacity and input distribution of symbols for the binary
symmetric channel have been determined.
E.G. Find the capacity of the binary symmetric channel with the probability of error p
e
=
1/8.
F(p
e
) = [(1 p
e
) log(1 p
e
) + (p
e
) log(p
e
)]
= [(1
1
8
) log
2
(1
1
8
) +
1
8
log
2
1
8
= 0.544
Hence the channel capacity is
C = 1 0.544 = 0.456 bits per symbol
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
I
(
x
,
y
)

(
b
i
t
s
/
s
y
m
b
o
l
)
P(x=0)
Figure 1.9: Mutual Information of Binary Symmetric Channel vs p(x
i
= 0)
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C

(
b
i
t
s
/
s
y
m
b
o
l
)
Pe
Figure 1.10: Capacity of Binary Symmetric Channel vs p
e
1.14 Information in Continuous Signals
So far, weve only looked at discrete symbols, but we can also have continuous signals. We
can readily extend the same ideas for discrete symbols to continuous signals.
For signal processing theory, we know that any continuous signal may be completely repre-
sented by regularly space samples provided a sucient amount of samples are taken.
In particular, the Sampling Theorem provides that all the information contained in a contin-
uous signal may be completely represented by samples provide that the sample rate f
s
is greater
than twice the bandwidth of the signal B. i.e:
f
s
> 2B
Conversely, if the sample rate is less than this, then the continuous signal cannot be completely
represented.
E.G. a speech signal with bandwidth 3KHz, must be sampled at 6Khz to completely repre-
sented it.
E.G. A CD audio disc provides samples at 44KHz and hence can represent music with a
bandwidth of 22Khz.
The Sampling Theorem provides a simple way of Estimating the information content of a
continuous signals. We will calculate the exact value later.
Consider a signal band-limited to a bandwidth B. We can represent the signal with 2B
samples per second and if each sample contains information I
s
then the information in the
signal is
R = 2BI
s
assuming all the samples are independent. Note, the information cannot be increased by faster
sampling as this will just produce additional samples which are not independent.
The information per sample I
s
should be related to the number of dierent levels per sample
and their probability. However, the number of dierent levels per sample that we can distinguish
1.15. RELATIVE ENTROPY OF A CONTINUOUS SIGNAL 31
is innite unless we consider that noise present will limit the resolution that we can resolve to,
and that power constraints will limit the amplitude of the samples.
A simple Estimate can then be calculated using the empirical rule that a sample can be
resolved to no less than , the rms value of the noise.
Assuming the signal amplitude is limited to a range of s, then the total number of levels
is n 2s/2 and assuming each is equally likely with probability
1
n
, then:
I
s

n
1
1
n
log
1
n
= log
1
n
= log
s
Approximating the signal to noise power as S/N s

2
/
2
then the rate of information trans-
mitted is:
R = 2BI
s
2Blog
_
S/N = Blog S/N
This is a rst approximation the information content of a continuous signal. The key points
are that:
information content is directly proportional to the bandwidth.
information content is related to the logarithm of the signal to noise ratio.
A more exact derivation now follows,
1.15 Relative Entropy of a Continuous Signal
Previously we had :
H =
N
i=1
p
i
log(p
i
)
For continuous signals, the discrete probabilities must be replaced with the probability density
functions and the summations replaced with integrals.
For a continuous signal, the probability that the signal value v lies in a range v
1
< v < v
2
is
:
p(v
1
< v < v
2
) =
_
v
2
v
1
p(v)dv
where p(v) is the probability density function (pdf) for the signal. Note: As the signal must
have some value then
_

p(v)dv = 1
The expression for Entropy of a Continuous Signal know as Relative Entropy can thus be
written as
H =
_

p(v) log p(v)dv

1.15.1 Justication for Relative Entropy
Consider an random variable X with a pdf f(x) as shown in g 1.11.
Divide the x-axis into division of width with the i
th
region shown shaded in the diagram
centered on the value x
i
.
f(x)
x
i
x
Figure 1.11: pdf of a random variable X.
Let X
be a discrete random variable which takes the value x

i
when
i x < (i + 1)
hence the probability the X
takes the value x

i
is
p
i
=
_
(i+1)
i
f(x)dx f(x
i
)
with the approximation becoming exact as 0.
The Entropy of the discrete random variable X
can be calculated as
H(X
) =
i=
p
i
log p
i

i=
f(x
i
)log[f(x
i
)]

i=
f(x
i
)[log f(x
i
) + log ]

i=
f(x
i
) log[f(x
i
)](log )
i=
f(x
i
)
However, as 0, the summations become integrations and
H(X
) =
_

f(x) log f(x)dx log

_

f(x)dx
=
_

f(x) log f(x)dx log

as

i=
f(x
i
) = 1 because f(x) is a pdf. But
_
f(x) log f(x)dx is the denition or relative

entropy of X i.e. H(X) and so
H(X
) = H(X) log
or
H(X) = H(X
) + log
Hence, the relative entropy of X is the entropy of a quantized version with the quantization
tending toward zero plus an additional term. As 0, then log . The value of the
relative entropy H(X) is normally nite, hence H(X
) . However, if we only consider the

dierence between relative entropy values of continuous random variables X and Y then
H(X) H(Y ) = H(X
) + log
_
H(Y
) + log
_
as 0
= H(X
) H(Y
) as 0
Hence, the dierence between relative entropy values can have an useful value.
In eect, the information in a continuous signal is innity as it can take on an innity number
of values. However, in any real system noise will ultimately dene how ne a continuous signal
can be resolved to and subtracting the relative entropy of the noise from the relative entropy of
the signal will result in a nite value.
1.15.2 Example: a Gaussian signal
A very common distribution that is often used in information and communications theory is
the Gaussian distribution of
p(v) =
1
2
e
v
2
2
2
where is the standard deviation of the distribution which has a mean of zero.
The Relative Entropy of this signal can be calculated using log
e
= ln: Note:
log
e
p(v) = ln p(v) = ln(
2) +
v
2
2
2
H =
_

p(v) log p(v)dv

=
_

p(v) ln(
2)dv +
_

p(v)
v
2
2
2
dv
= ln(
2) +
_

2
e
v
2
2
2
v
2
2
2
dv
From tables
_
0
x
2
e
x
2
dx =

/4 and
H = ln(
2) +
1
2
= ln(
2e) {as
1
2
= ln(e
1/2
)}
The units can be changed to bits by using the log
2
to result in a relative entropy of a Gaussian
signal with variance
2
as
H = log
2
(
2e)
As the power in a Gaussian signal at 1 sample per second is simply P =
2
this can also be
written as
H = log
2
(
2eP)
1.15.3 Example: Uniform Distribution
Calculate the Relative Entropy of a signal with a uniform distribution from A/2 to A/2.
The probability density function for this signal is
p(v) =
1
A
A/2 < v < A/2, 0 elsewhere
Hence
H =
_

p(v) log p(v)dv

=
_
A/2
A/2
1
A
log
1
A
dv
=
1
A
log
1
A
_
A/2
A/2
dv
= log
1
A
= log A
Note the Power in this signal at 1 sample per second is
P =
_

p(v)v
2
dv
=
_
A/2
A/2
1
A
v
2
dv
=
1
A
_
A/2
A/2
v
2
dv
=
1
A
_
1
3
v
3
_
A/2
A/2
=
A
2
12
And hence
H = log
12P
Note this is less than the Gaussian case of H = log(
2eP) log(
17.08P)
1.15.4 Example: Binary Distribution
Calculate the Relative Entropy of a signal with a binary distribution with values of A/2
and A/2.
The probability density function for this signal is
p(v) =
(v +A/2)
2
+
(v A/2)
2
where (x) is an impulse at x = 0. Hence:
H =
_

p(v) log p(v)dv

=
_

[(v + A/2) +(v A/2)] log

_
(v +A/2)
2
+
(v A/2)
2
_
dv
=
1
2
log
1
2

1
2
log
1
2
= log
1
2
= log 2
And in bits
H = log
2
2 = 1 bit
It can be shown that a Gaussian distribution maximizes the relative entropy of any signal
with a given power.
1.15.5 Proof that a Gaussian distribution maximizes relative entropy
Consider the relative entropy of a signal Y with pdf p(y). Let the signal power be
2
so
_

y
2
p(y)dy =
2
Let g(y) be a zero mean Gaussian distribution with variance (power)
2
and so
g(y) =
1
2
e
y
2
2
2
Consider
_
p(y) ln
1
g(y)
dy
_
p(y) ln
1
g(y)
dy =
=
_
p(y) [ln g(y)] dy
=
_
p(y)
_
ln
_
1
2
e
y
2
2
2
__
dy
=
_
p(y)
_
ln
1
2
ln e
y
2
2
2
_
dy
=
_
p(y)
_
ln
2 +
y
2
2
2
_
dy
=
_
p(y) ln
2dy +
_
p(y)
y
2
2
2
dy
= ln
2 +

2
2
2
= ln
2 +
1
2
= ln
2 + ln(e
1
2
)
= ln
2e
Or
_
p(y) ln
1
g(y)
dy =
1
2
ln 2e
2
Now consider the expression H(Y )
1
2
ln 2e
2
H(Y )
1
2
ln 2e
2
=
_
p(y) ln p(y)dy
_
p(y) ln
1
g(y)
dy
=
_
p(y) ln
1
p(y)
+p(y) ln g(y)dy
=
_
p(y) ln
g(y)
p(y)
dy
_
p(y)
_
g(y)
p(y)
1
_
dy
as ln z z 1 for all z with equality only at z = 1 and p(y) 0. Hence
H(Y )
1
2
ln 2e
2
_
p(y)
_
g(y)
p(y)
1
_
dy
with equality (and therefore maximized) when
g(y)
p(y)
= 1. But if
g(y)
p(y)
= 1 then p(y) = g(y).
Therefore the relative entropy of a continuous random variable Y is maximized when its pdf
takes on a Gaussian distribution and the relative entropy in that case is
H(Y ) =
1
2
ln 2e
2
1.16 Information Capacity of a Continuous Signal
We will consider a general signal with power S, in the presence of a noise power N, both
limited in bandwidth to B hz. The system is shown in g 1.12. The signal is Gaussian to
maximize its relative entropy. The noise is a Gaussian white noise source.
Input(x)
Power(S) Power(S+N)
Output(y)
Channel
Bandwidth
B
Figure 1.12: Continuous Channel
The information transfer is
I(x, y) = H(y) H(y/x)
The backwards equivocation H(y/x) is the information needed to overcome the eect of the
noise, and extending from the discrete case,
H(y/x) = H(Noise)
= log(
2eN)
The relative entropy at the output which is just a Gaussian signal with power S +N is
H(y) = log(
_
2e(S + N))
1.16. INFORMATION CAPACITY OF A CONTINUOUS SIGNAL 37
This yields a mutual information of
I(x, y) = log(
_
2e(S +N)) log(
2eN)
= log
_
2e(S +N)
2eN
=
1
2
log
S +N
N
But given 2B samples per second required to represent the signal, the maximum information
transfer (mutual information) per second or capacity is
C = 2B
1
2
log
S +N
N
= Blog(1 +S/N)
This expression
C = Blog(1 +S/N)
is an important result in information theory. It gives the maximum information transfer rate
over a band-limited channel with a given signal to noise ratio.
This shows that fundamentally, the bandwidth and signal to noise are the limiting parameters
of a continuous channel and is usually call the Shannon Capacity of a continuous channel.
Claude Shannon also proved that this was the maximum rate at which information could be
transmitted over such a channel. In particular, he showed that information could be transmitted
over such a channel Error Free up to this rate C but information could NOT be transmitted
Error Free above this rate.
While Shannons work provides a bound for communications systems to aim for, it is not a
constructive proof in that it doesnt show How to achieve the promised performance. People
working in the eld of information theory have largely spent the 50 years since Shannon work
trying to develop method to achieve the Shannon Capacity of various channels.
While we will look at such method later on, we can use the basic capacity expression to
analyise some basic tradeos in systems.
The information sent in a time T is
I = CT = BT log(1 +S/N) bits
Thus to transmit a given number of bits, we can tradeo between Time taken T, available
bandwidth B and signal to noise ratio S/N.
E.G. given xed T, we can tradeo bandwidth B and signal to noise ratio S/N. i.e. a
broadcast system has a xed B so we must increase the signal power S to increase the capacity
we must use more power S or a more expensive sensitive receiver N .
A space probe has a limited transmit power therefore must use wide bandwidth B to
increase its information capacity.
1.16.1 Example Calculations
Q: An Analogue Telephone Channel has a bandwidth of 3KHz and a S/N ratio of 30dB.
Calculate its capacity.
A:
30 dB 10
30/10
= 1000
C = Blog(1 +S/N)
= 3000 log
2
(1 + 1000) = 29, 900 = 29.9 Kb/s
Q: A Space probe uses a bandwidth of 100 KHz and a S/N ratio of 2dB. Calculate its
capacity.
A:
C = Blog(1 +S/N)
= 100 10
3
log
2
(1 + 10
2/10
) = 1.37 10
5
= 137 Kb/s
Q: A Spread Spectrum system uses a bandwidth of 10 MHz and a S/N ratio of -20dB.
Calculate its capacity.
A:
C = Blog(1 +S/N)
= 10 10
6
log
2
(1 + 10
20/10
) = 1.436 10
5
= 144 Kb/s
Typically, noise in a system is related to the bandwidth usually as N = N
0
B. In this case,
in the absence if any bandwidth constraint, the capacity tends towards
lim
B
C = lim
B
Blog
e
(1 +S/N
0
B)
= S/N
0
nats
1.17. EXERCISES 39
1.17 Exercises
1. If 3 coins are tossed in a sequence , what is the probability of the resulting pattern being
symmetric?
2. If 3 die are thrown, what is the probability of of receiving one or more 5s?
3. A source produces data with symbols from an alphabet with 5 elements {S
0
, S
1
, S
2
, S
3
, S
4
}.
The probabilities of the elements are 0.8, 0.1, 0.05, 0.025 and 0.025 respectively.
Calculate the Entropy of this source.
0
, S
1
, S
2
}. The
probabilities of the elements are 0.7, 0.2 and 0.1 respectively.
5. A noisy channel has inputs x
k
with 3 possible values -1,0,+1. These come from a source
with p(1) = 0.25, p(0) = 0.5 and p(+1) = 0.25
The channel output y
k
has a set of transition probabilities of:
p(y
k
= +1/x
k
= +1) = 0.8
p(y
k
= 0/x
k
= +1) = 0.2
p(y
k
= 1/x
k
= +1) = 0.0
p(y
k
= +1/x
k
= 0) = 0.1
p(y
k
= 0/x
k
= 0) = 0.8
p(y
k
= 1/x
k
= 0) = 0.1
p(y
k
= +1/x
k
= 1) = 0.0
p(y
k
= 0/x
k
= 1) = 0.2
p(y
k
= 1/x
k
= 1) = 0.8
Calculate the average mutual information between the input and the output I(X : Y )
6. Given
I(x, y) =
x
i
y
i
p(x
i
, y
i
) log
p(x
i
, y
i
)
p(x
i
)p(y
i
)
H(y/x) =
x
i
y
i
p(x
i
, y
i
) log p(y
i
/x
i
)
and
H(y) =
y
i
[p(y
i
) log p(y
i
)]
Prove that I(x, y) = H(y) H(y/x).
7. A binary erasure channel is dened as a channel with binary inputs 0 and 1. The channel
output can have 3 possible values 0 and 1 as well as an erasure E which is output when
the receiver cannot reliably determine whether a 0 or 1 was received.
Assuming that the receiver always makes the right decision or produces an erasure (but
cannot make a wrong decision), derive an expression for the capacity of the channel as-
suming the probability of an erasure is for both a 0 and a 1 being transmitted.
8. Prove that the relative entropy of a continuous random signal with a Uniform distribution
from
1
A
to
1
A
is log
2
A
.
Calculate the power in this signal and express relative entropy as a function of the signal
power.
9. By analogy with the discrete case, assume the relative conditional entropy for continuous
signals can be dened as
H(y/x) =
_
y
_
x
p(x, y)p(y/x)dxdy
Prove that in the case of a AWGN continuous channel with a Gaussian distributed input
signal with power S and Gaussian noise with power N, that
H(y/x) = log(
2eN)
10. Calculate the Shannon capacity of a telephone channel with bandwidth 4.0KHz and sig-
nal to noise ratio of 28dB.
What is the maximum symbol rate that this channel can support?
How many bits per symbol are required for a coding/modulation scheme to achieve the
capacity on this channel?
11. Calculate the Shannon capacity of a CDMA system with bandwidth 10.24MHz and signal
to noise ratio of -30dB.
Chapter 2
Source Coding
Coding is the process of transforming information into a suitable format such that it can be
transmitted ( or stored ) eciently and reliable. These two requirements can be shown to
be attained independently of each other and are known as source coding and channel coding
respectively.
Source coding is the process of coding information that we wish to transmit (or store) into
a suitable format for transmission ( or storage ). It is usually desirable to encode the data
to minimize the transmission time or storage space. This is what is commonly understood as
compression. In some applications we may also want to insert some desirable information such as
guaranteeing clock recovery information or spectral characteristics. Such line or modulation
codes also come under the theory of source coding.
In considering source coding for compression purposes, we will focus on coding input symbols
into binary data for convenience, though the theory can be readily applied to the general case
of any discrete alphabet. The key objective is to encode the data from a source to the minimum
number of bits possible. i.e. to minimize the redundancy in the encoded data.
2.0.1 Some Ideas on Source Coding
Consider a source producing one of 4 possible symbols s
1
, s
2
, s
3
, s
4
with probabilities p
1
, p
2
, p
3
, p
4
.
This can be encoded in a number of possible ways as in table 2.1.
Code s
1
s
2
s
3
s
4
1 0 10 110 11
2 00 01 10 11
3 0 10 110 1110
4 0 01 011 0111
5 0 10 110 111
Table 2.1: Example source coding options
The rst method (1) is ambiguous. The code sequence 110 could mean the symbol s
3
or the
sequence s
4
, s
1
and hence would not be a suitable code.
The second method (2) is a constant length code with a simple binary mapping. This would
be our most natural instinct and a good coding scheme. Once we know the starting point, there
is no diculty in decoding. However, if there is a big dierence in the probability of the dierent
41
42 CHAPTER 2. SOURCE CODING
symbols, then it may not be the most ecient scheme as all the symbols are code with binary
sequences of the same length.
The third method (3) is a comma code where the binary zero 0 represents a delimiter
between code words. This allows the dierent symbols to be isolated from each other and also
allows dierent length code words such that the short code words may be assigned to the most
likely source symbol.
The fourth and fth methods (4) and (5) provide similar properties with (5) being the most
preferable. This is because is code words are shorter that (3) or (4) and it has a desirable property
of being instantaneous decodeable. i.e. the source symbol can be determined immediately on
receipt of its nal bit, no lookahead is needed.
Code (5) achieves the desirable properties because each codeword is chosen such that it is
not a prex of any other codeword. It is also variable length, thus allowing the possibility of
assigning the shortest codes to the most likely source symbols.
2.0.2 Mapping Example
Consider the case of the symbol probabilities being p(s
1
) = 0.6, p(s
2
) = 0.2, p(s
3
) = 0.1 and
p(s
4
) = 0.1. Using the binary mapping of code (2), the length of 10 symbol message would be
20 binary bits.
However, using code (5), the Average length of a 10 symbol message would be
10 [0.6 1 + 0.2 2 + 0.1 3 + 0.1 3] = 16 bits
or 1.6 bits per symbol. Hence, the average performance of code (5) is signicantly better than
code (2) in this case.
Note: Sometimes the encoded 10 symbol message will be longer than 20 bits ( In fact it
could be 30 bits long!). Only the average length will be 16 bits. Hence any software or hardware
must be able to deal with up to 30 bit messages.
2.0.3 Eciency
The average bit rate L for code (2) and (5) are L = 2 bits per symbol and L = 1.6 bits
per symbol respectively. However, we can calculate the entropy of the source as a fundamental
measure of the information in the source and compare to how well the proposed codes work.
The Entropy of the source is
H =
N
i=1
p
i
log(p
i
)
which is
0.6 log
2
(0.6) 0.2 log
2
(0.2) 0.1 log
2
(0.1) 0.1 log
2
(0.1)
We can calculate the eciency of our codes by comparing them to this fundamental value as
E =
H
L
Hence the eciencies for the illustrated codes are code (2), 78.5% and code (5), 98.2%.
2.1. CODING METHODS 43
2.1 Coding Methods
The key point in source coding is the use of variable length codes an assignment of the
shortest codewords to the most likely symbols. This idea has been widely used, even as far back
as Morse Code (Table 2.2).
A .- N -. 0 -----
B -... O --- 1 .----
C -.-. P .--. 2 ..---
D -.. Q --.- 3 ...--
E . R .-. 4 ....-
F ..-. S ... 5 .....
G --. T - 6 -....
H .... U ..- 7 --...
I .. V ...- 8 ---..
J .--- W .-- 9 ----.
K -.- X -..-
L .-.. Y -.--
M -- Z --..
Table 2.2: Morse Code
If the special case of the probabilities of the symbols being powers of
1
2
, i.e. p
i
=
1
2
i
, j
1, 2, . . . , n, then trivial 100% ecient codes can readily be applied. However, it is not immedi-
ately clear if maximum eciency can be obtained in general.
Some algorithmic method have been developed to assign codewords eectively, an early one
being the Fano-Shannon method.
2.1.1 Fano-Shannon Coding
This method can be seen readily by example and readily implemented in software. Take the
example of a source with ve symbols and probability p(s
1
) = 0.5, p(s
2
) = 0.2, p(s
3
) = 0.1,
p(s
4
) = 0.1 and p(s
5
) = 0.1. The source symbols are written in descending order, divided
into groups that have approximately equal probability and assigned the bits 1 and 0. This
is recursively carried out until all the input symbols are assigned a code. Fig 2.1 shows the
procedure. In this example the average encoding rate is
0.5 1 + 0.2 3 + 0.1 3 + 0.1 3 + 0.1 3 = 2 bits per symbol
As the entropy of the source is 1.96 bits per symbol, the code is 98% ecient.
2.1.2 Human Coding
A more ecient method is due to Human and proceeds along similar lines. The lowest
probability symbols are added and the probabilities reordered in descending order. This is
repeated until no symbols are left. The codes are then assigned where additions have been
performed and the code sequences read o right to left. Fig 2.2 shows the previous example
repeated.
1
2
3
5
s
s
s
s
s
4
0.5
0.2
0.1
0.1
0.1
0
1
1
2
3
5
s
s
s
s
s
4
0.5
0.2
0.1
0.1
0.1
0
1
1
0
1
2
3
5
s
s
s
s
s
4
0.5
0.2
0.1
0.1
0.1
0
1
1
0
1
2
3
5
s
s
s
s
s
4
1
0
0
1
0.5
0.2
0.1
0.1
0.1 111
101
100
110
0
1
2
3
4
Figure 2.1: Example of Fano-Shannon Coding
In this case, the code word assignments are dierent, but the average encoding rate is
0.5 1 + 0.2 2 + 0.1 3 + 0.1 4 + 0.1 4 = 2 bits per symbol
with an 98% eciency.
In more complex examples the Human method can outperform the Fano-Shannon method.
It can also be shown that Human coding is optimal. i.e. there is no other assignment of
codewords that can perform better than the ones produced by the Human algorithm.
2.2 Shannon Coding Theorem
The question of how ecient the encoding process can be has been addressed by Shannon,
who proved formally that:
Shannons First Theorem: Given a source producing symbols from a nite alphabet X and
encoding them in groups of length n to a code alphabet with D symbols, there exists a code
such that:
lim
n
L
n
n
=
H(X)
log D
where L
n
is the average codeword length.
In our case, the code alphabet is binary, hence D = 2 and as we group more source symbols
together, the average code rate tends towards the entropy of the source.
2.2. SHANNON CODING THEOREM 45
1
2
3
5
s
s
s
s
s
4
0.5
0.2
0.1
0.1
0.1
0.2
0.2
0.5 0.5
0.2
0.1
0.2
0.1
0.2
0.5
0.3
0.3
0.2
0.5 0.5
0.5
1.0
1
0
0
1
1
1
0
0
1
2
3
5
s
s
s
s
s
4
0.5
0.2
0.1
0.1
0.1
2
0
1001
1000
101
11
1
0.1
0.2
0.2
0.3
0.3
0.5
1.0
1
0
1
0
0
s
4
Figure 2.2: Example of Human Coding
The theorem also indicates how we can achieve coding rates right up to the entropy of the
source. This can be achieved by grouping the source symbols into xed length sequences and
encoding these rather than encoding the source symbols on their own.
2.2.1 Example
A source produces one of three symbols A,B and C, with probability 16/20 3/20 and 1/20 at
a rate of 100 symbols per second. Design a code to transmit the information through a noiseless
channel that can transmit 100 binary bits per second.
Firstly, we must check if the problem is possible. The entropy of the source is H =
3
i=1
p
i
log(p
i
) = 0.884 bits per symbol or 88.4 bits per second and hence, a code must exist
to encode this source and thus is possible to solve the problem.
Next, we try the Fano-Shannon method to result in table 2.3.
Symbol i p
i
Code Seq.
A 16/20 0 0
B 3/20 1 0 10
C 1/20 1 1 11
Table 2.3: Fano-Shannon Code
The average data rate is now :
100
_
16
20
1 +
3
20
2 +
1
20
2
_
= 120 bits per sec
hence the eciency of the coding is
88.4
120
= 73.7%.
However, this too large to send through our channel and so we must consider a more complex
code.
This time we will group symbols in groups of 2. There are 9 possible groups of 2. This can
the be coded using the Fano-Shannon method as in table 2.4 and table 2.5.
Symbol i p
i
AA
16
20

16
20
=
256
400
BA
3
20

16
20
=
48
400
CA
1
20

16
20
=
16
400
AB
16
20

3
20
=
48
400
BB
3
20

3
20
=
9
400
CB
1
20

3
20
=
3
400
AC
16
20

1
20
=
16
400
BC
3
20

1
20
=
3
400
CC
1
20

1
20
=
1
400
Table 2.4: Symbol Probabilities (groups of 2)
Symbol i p
i
Code Seq.
AA
256
400
0 0
BA
48
400
1 0 10
AB
48
400
1 1 0 110
CA
16
400
1 1 1 0 0 11100
AC
16
400
1 1 1 0 1 11101
BB
9
400
1 1 1 1 0 11110
CB
3
400
1 1 1 1 1 0 111110
BC
3
400
1 1 1 1 1 1 0 1111110
CC
1
400
1 1 1 1 1 1 1 1111111
Table 2.5: Fano-Shannon Code (groups of 2)
The average data rate is now ( Note 50 pairs ) :
50
_
256
400
1 +
48
400
2 +
48
400
3 +
16
400
5 +
16
400
5+
9
400
5 +
3
400
6 +
3
400
7 +
1
400
7
_
= 93.375 bits per sec
and hence the data can be transmitted as desired.
The average rate is 93.375 bits per sec and hence the eciency of the coding is
88.4
93.375
=
94.7%.
If this was insucient, we could group more symbols together and decrease the data rate
towards the limit of 88.4 bits per second. However, no code can exist that can push the average
data rate below 88.4 bits per second.
2.3. DATA COMPRESSION 47
Note the probabilities of 1s and 0s in the coded data.
p(0)
p(1)
=
2561
400
+
481
400
+
481
400
+
162
400
+
161
400
+
91
400
+
31
400
+
31
400
+
481
400
+
482
400
+
163
400
+
164
400
+
94
400
+
35
400
+
36
400
+
17
400
=
1.035
0.83
But, as p(0) +p(1) = 1:
p(0) = 0.555
and
p(1) = 0.444
As the data rate tends towards the entropy of the source, we would expect the p(0) and p(1) to
tend towards
1
2
which would be required for maximum redundancy.
2.3 Data Compression
The basis and key ideas for data compression have been demonstrated based on Shannons
rst theorem. In practice, two further distinctions apply. These are the concept of lossless and
lossy compression:
Lossless Compression: This refers to compression where the original data must be re-
covered exactly, with no errors. This is the normal case with computer data and programs.
A single bit in error can completely destroy the value of an executable program of data
le.
Lossy Compression: In other applications such as voice and image compression, the ex-
act original data may not be absolutely required. Certain errors in the data may be allowed
provide the quality of the recovered audio or images are sucient for the application.
2.4 Lossless Compression
Lossless ( or noiseless or reverseable ) Compression must recover the compressed data with
no errors.
These can be based on the statistical properties of the data with Human coding being a
common example.
However, in many cases, it is dicult to evaluate the statistics of the input data and con-
ditional probabilities cause more complexity. However, weve seen that in the case of English
text, the conditional probabilities were key to the large redundancy in printed English.
Fortunately, a general purpose compression (universal) method has been developed by
Limpel and Ziv. Their dictionary based compression algorithms have become the most widely
used compression method for general purpose data and provide very good performance with fast
compression/decompression and no prior information on the statistics of the data.
Examples of such a compression technique include the gzip utility in UNIX systems, pkzip
and winzip on Windows machines.
2.4.1 Dictionary based Data Compression
These dictionary based data compression method work based on building up a dictionary of
common phrases as the algorithm progresses through the data. Thus the statistics of the input
data are built up as the algorithm progresses. ( A static dictionary could be used in certain
applications but this is rarely done. ).
Two common algorithms for dictionary compression are LZ77 and LZW. These and other
variations vary on how the store, search and encode the dictionary and its pointers. An outline
of the algorithm would be:
s = "";
do {
c = input_char();
if( {s,c} in the dictionary ){
s = {s,c};
}else{
output pointer for s;
add {s,c} to dictionary;
s = c;
}
}
A basic overview of the general method of how these work can be gained by considering a
simple example sentence and how its coded.
Consider the sentence
the fat cat sat where a fat cat should not have sat
We initialize the dictionary with 26 letters ( pointers 1 to 26 ) and the space character as pointers
27. The compression process can proceed as follows:
This example shows how the dictionary is built up. Although the example shows little if any
compression, it can be seen that as time progresses, larger strings will be matched and overall
the compression can be signicant, especially for text which will tend to have a lot of patterns
matched due to repeated words.
One interesting advantage to this kind of compression, is that the dictionary does not need
to be transmitted (or stored). It can be recreated as the receiver receives the data. (In practice,
aspects such as dictionary overow, maximum string length, etc, need to be accounted for)
On random data typically found on computer systems, such algorithms can achieve a com-
pression rate of 2:1 typically ( thought some products will boast 4:1 which should be taken with
some skepticism ). Some backup hardware such as tape or optical drives may include hardware
compression internally.
NB:It is also worth noting that some of these algorithms and their implementation may be
covered under patents granted.
2.4.2 Arithmetic Compression
See text book for description of Arithmetic Compression. In some cases, it can be superior
to Human encoding, but is more computationally complex to implement.
2.4. LOSSLESS COMPRESSION 49
String Character Output
s c t
s t c h D 19 (t) Add to dict th 27
s h c e D 7 (h) Add to dict he 28
s e c D 4 (e) Add to dict e 29
s c f D 26 ( ) Add to dict f 30
s f c a D 5 (f) Add to dict fa 31
s a c t D 0 (a) Add to dict at 32
s t c D 19 (t) Add to dict t 33
s c c D 26 ( ) Add to dict c 34
s c c a D 2 (c) Add to dict ca 35
s a c t
s at c D 32 (at) Add to dict at 36
s c s D 26 ( ) Add to dict s 37
s s c a D 18 (s) Add to dict sa 38
s a c t
s at c
s at c w D 36 (at ) Add to dict at w 39
s w c h D 22 (w) Add to dict wh 40
s h c e
s he c r D 28 (he) Add to dict her 41
s r c e D 17 (r) Add to dict re 42
s e c
s e c a D 29 (e ) Add to dict e a 43
s a c D 0 (a) Add to dict a 44
s c f
s f c a D 30 ( f) Add to dict fa 45
s a c t
s at c
s at c c D 36 (at ) Add to dict at c 46
Table 2.6: LZW Compression
2.4.3 RLE Compression
Run Length Encoding (RLE) is a specic Compression applicable to some specic applica-
tions, typically graphics systems.
E.G. A scanned two scale image will typically have long runs of similar bits such as
00000111111000000000011110000000001111110000...
Clearly, sending the run lengths rather than the bits should compress the data to be sent.
For the bit sequence shown, the run-lengths starting with 0 are
5,6,10,4,9,6,... This sequence could further be encoded using Human encoding.
The most common example of this is FAX transmission which has a predetermined coding
for the most common patterns. Compression ratios of up to 20:1 can be obtained using this
method.
s c c a
s ca c t D 35 (ca) Add to dict cat 47
s t c
s t c s D 33 (t ) Add to dict t s 48
s s c h D 18 (s) Add to dict sh 49
s h c o D 7 (h) Add to dict ho 50
s o c u D 14 (o) Add to dict ou 51
s u c l D 20 (u) Add to dict ul 52
s l c d D 11 (l) Add to dict ld 53
s d c D 3 (d) Add to dict d 54
s c n D 26 ( ) Add to dict n 55
s n c o D 13 (n) Add to dict no 56
s o c t D 14 (o) Add to dict ot 57
s t c
s t c h D 33 (t ) Add to dict t h 58
s h c a D 7 (h) Add to dict ha 59
s a c v D 0 (a) Add to dict av 60
s v c e D 21 (v) Add to dict ve 61
s e c
s e c s D 29 (e ) Add to dict e s 62
s s c a
s sa c t D 38 (sa) Add to dict sat 63
Table 2.7: LZW Compression (cntd)
2.5 Lossy Compression
In applications such as voice and image compression, the exact original data may not be
absolutely required. Certain errors in the data may be allowed provide the quality of the
recovered audio or images are sucient for the application.
These compression methods mathematically altered in some approximately reversible way.
In all of these cases there is a tradeo between distortion and compression rate. Often, the ac-
ceptable distortion can be subjective of determined by trials rather than using formal distortion
measures.
The distortion may be allowed in various domains such as space, time, and/or frequency.
2.5.1 DCT : Discrete Cosine Transform
The Discrete Cosine Transform (DCT) of a sequence of N samples x
i
i = 0, 1, . . . , N 1 is
dened as
z
k
=
2
N
(k)
N1
i=0
x
i
cos
_
(2i + 1)k
2N
_
k = 0, 1, . . . , N 1
with the inverse transform
x
i
=
2
N
N1
k=0
(k)z
k
cos
_
(2i + 1)k
2N
_
k = 0, 1, . . . , N 1
2.5. LOSSY COMPRESSION 51
where (k) =
1
2
for k = 0; otherwise 1
This transform is closely related to the discrete Fourier transform (DEFT) and decomposes
the data x
i
into frequency components z
i
. For image compression, a 2 dimensional version is
commonly used. The 2D DCT is dened based on a 2D input x
m,n
m = 0, 1, . . . , N 1 n =
0, 1, . . . , N 1 as
z
k,l
=
2
N
(k)(l)
N1
m=0
N1
n=0
x
m,n
cos
_
(2m + 1)k
2N
_
cos
_
(2n + 1)l
2N
_
with the inverse transform
x
m,n
=
2
N
N1
k=0
N1
l=0
(k)(l)z
k,l
cos
_
(2m + 1)k
2N
_
cos
_
(2n + 1)l
2N
_
where (k) =
1
2
for k = 0; otherwise 1
For practical implementation fast algorithms are available similar to fast Fourier transform
(FFT) algorithms. Figure 2.3(a) shows an example 64x64 grey scale image while g 2.4(a) shows
its 2 Dimensional DCT. Most of the energy in the DCT is at low frequencies as the image has
few sharp edges. When the 75% DCT high frequency coecients are zeroed (g 2.4(b)), the
inverse transform results in the image in g 2.3(b). Clearly, the image is still quite good despite
the 4:1 compression. Many lossy image compression algorithms use this feature to achieve their
compression.
10 20 30 40 50 60
10
20
30
40
50
60
(a) Original
10 20 30 40 50 60
10
20
30
40
50
60
(b) Compressed (75%)
Figure 2.3: 64 x 64 pixel Image
2.5.2 JPEG: Still Image Compression
The JPEG standard includes a set of sophisticated lossy compression options which resulted
from much experimentation by the creators of JPEG with regard to human acceptance of types
of image distortion. The JPEG standard was the result of years of eort by the Joint Photo-
graphic Experts Group which was formed as a joint eort by two large, standing, standards
organizations, the CCITT (The European telecommunications standards organization) and the
ISO (International Standards Organization.)
10 20 30 40 50 60
10
20
30
40
50
60
(a) Original
10 20 30 40 50 60
10
20
30
40
50
60
(b) Compressed (75%)
Figure 2.4: DCT of 64 x 64 pixel Image
The JPEG lossy compression algorithm consists of a image simplication stage, which re-
moves image complexity at some loss of delity, followed by a lossless compression step based
on predictive ltering and Human or Arithmetic coding.
The lossy image simplication step, is based on the Discrete Cosine Transform (DCT). The
DCT is applied to 8 by 8 pixel blocks of the image i.e. if the image is 256 by 256 pixels
in size, it is broken into 32 by 32 square blocks of 8 by 8 pixels and each one is compressed
independently. The 64 pixel values in each block are transformed by the DCT into a set of 64
spatial frequency values. These 64 numbers, the DCT coecients, represent the 8x8 image
completely and exactly, just in another form.
The JPEG compression algorithm uses these DCT coecients to perform a lossy reduction of
image data: In the algorithm, the smallest coecients are set to zero. The remaining coecients
are quantized to a number of dierent resolutions according to predetermined levels of sensitivity
of observers to their importance. e.g. low spatial frequencies are more important to nely
quantize than higher spatial frequencies.
Lossless compression steps such as Human encoding is applied to the quantized coecients.
As most images will have slowly varying features and alot of the quantized coecients will be
zero, this entropy compression will provide additional gains in the overall compression rate.
Typical quoted performance for JPEG suggest that photographic quality images of natural
scenes can be preserved with compression ratios of up to about 20:1 or 25:1. Usable quality
(that is, for non critical purposes) can result for compression ratios up to 200:1.
2.6 Still Image Compression: GIF vs. JPEG
GIF: Graphics Interchange Format:
GIF a data stream-oriented le format used to dene the transmission protocol of LZW-
encoded bitmap data. It is lossless but still achieves reasonable compression, but sup-
ports only 8-bits worth of colors. Its a popular choice for storing lower resolution image
data and can be signicantly better than JPEG on images with only a few distinct colors,
such as line drawings and simple cartoons.
2.7. VIDEO COMPRESSION 53
JPEG: Joint Photographic Experts Group:
JPEG is a lossy compression system in which the compression rate can be varied by
adjusting compression parameters. It can easily provide 20:1 compression of full-color
data. Comparing GIF and JPEG, the size ratio is usually more like 4:1. JPEG stores
full color information: 24 bits/pixel (16 million colors) and is superior to GIF for storing
full-color or gray-scale images of realistic scenes
2.7 Video Compression
Video is a sequence of still images which are displayed in order. Each of these images is
called a frame. As the human eye cannot notice small changes from frame to frame,like a slight
dierence of color, video compression can apply lossy compression yet still retain good picture
quality. Typically 25 (30 in US) frames are displayed on the screen every second . As may frames
will be similar the is a considerable amount of redundancy in the consecutive images. Hence,
most frames can be dened based on previous frames and only the dierences transmitted.
Frames can be compressed using only the information in that frame (intra-frame) or using
information in other frames as well (interframe). Intraframe coding allows random access opera-
tions like fast forward and provides fault tolerance. If a part of a frame is lost, the next intraframe
and the frames after that can be displayed because they only depend on the intraframe.
Video compression relies on spatial prediction and temporal prediction.
In spatial prediction of a pixel can be obtained from pixels of the same image similar to
still image compression.
In temporal prediction, the prediction of a pixel is obtained from a previously transmitted
image.
Hybrid coding consist of a prediction in the temporal domain with a suitable technique to
compensate for movement in the spatial domain. This motion compensation establishes
a correspondence between elements of nearby images in the frame sequence.
Every color in an image can be represented as a combination of red, green and blue. This
RGB color space is not suitable for compression since it does not consider the perception of
humans. YUV color space where only Y gives the grayscale image. Human eye is more sensitive
to changes is Y and this is used in compression algorithms by allowing more compression in the
color information than the luminance.
2.8 Speech Compression
The compression of speech signals has many practical applications. For example, in digital
mobile radio, compression allows more users to share the limited bandwidth available.
Usually, digital speech signals are sampled at a rate of 8Khz. Typically, each sample is
represented by up to 13 bits but coded (eectively compressed) using law or Alaw to 8 bits. This
corresponds to an datarate of 64 kbps (kbits/sec) which would be used by a typical telephone
system. It is possible to reduce the rate to around 8Kbps with little loss in quality.
Further compression ( 2.4Kbps) is possible but at the expense of lower quality. While the
language can still be understood at these lower bit rates , voice qualities such as recognizing
who the speaker is, diminish and are less acceptable to the general public.
Most of the current low-rate speech coders are based on the principle of linear predictive
coding (LPC). This is based on modeling the vocal tract as a lter. The speech is generated by
pushing air through that lter to produce a sound (unvoiced) or causing the vocal cords to
vibrate as some frequency (pitch) which is then ltered by the vocal tract to produce a sound
(voiced). Fig 2.5 shows the model and the lter is a IIR lter:
H(z) =
1
1 +a
1
z
1
+a
2
z
2
+ . . . +a
1
0z
10
The principle of linear predictive coding is to send information describing the lter (vocal
tract) and the loudness/pitchs rather than the actual sampled signal. The data to describe
these is much more compact than the actual data samples (However other forms of audio such
as music will not sound very good).
H(z) voice
signal
Voiced
Unvoiced
White Noise
Pitch Impulses
Gain
Figure 2.5: LPC Voice Model
2.9. EXERCISES 55
2.9 Exercises
0
, S
1
, S
2
, S
3
, S
4
}.
The probabilities of the elements are 0.8, 0.1, 0.05, 0.025 and 0.025 respectively.
Construct a variable length code to encode data from the source using the Shannon Fano
algorithm.
Calculate the eciency of your code.
0
, S
1
, S
2
}. The
probabilities of the elements are 0.7, 0.2 and 0.1 respectively.
Construct a variable length code to encode data from the source using the Human algo-
rithm.
Calculate the eciency of your code.
Form new symbols by grouping 2 original symbols together. Construct a variable length
code to encode data from the source using the Human algorithm with the new symbols.
Calculate the eciency of this code.
Chapter 3
Galois Fields
3.1 Groups
A group is a set of objects G on which the binary operation * is dened. The result of the
* operation on elements from the group results in an element from the group (closure) and
the following conditions must be satised.
Associativity: (a b) c = a (b c) for a, b, c G.
Identity: An identity element e G exists such that a e = e a = a for a G.
Inverse: For all a G, there exists an unique element a
1
G such that aa
1
= a
1
a =
e
A group is commutative or Abelian if it also satises Commutative: a b = b a for all
a, b G
The order of the group is the cardinality of the group (number of objects in the group).
E.G. The set of integers forms an innite commutative group under the addition operation.
E.G the set of integers {0, 1, 2, . . . , m1} form a nite group of order m under the operation
of addition modulo m. The identity element is 0 and the inverse element of x is mx.
3.2 Rings
A ring is a set of objects R on which two binary operation + and * is dened and the
following conditions must be satised.
R forms a commutative group under + with 0 as the identity element.
The operation * is associative: (a b) c = a (b c) for a, b, c R.
The operation * is distributive over + i.e a (b +c) = (a b) + (b c) for a, b, c R.
A ring is commutative if it also satises a b = b a for all a, b R
A ring is described as a ring with identity if the operation * has an identity element
labeled as 1. i.e a 1 = 1 a for all a, b R
E.G. The integers {0, 1, 2, . . . , m1} form a commutative ring with an identity under addition
and multiplication modulo m.
57
58 CHAPTER 3. GALOIS FIELDS
3.3 Fields
A eld is a set of objects F on which two binary operation + and * is dened and the
following conditions must be satised.
F forms a commutative group under + with 0 as the identity element.
F - 0 (i.e. without the 0 element forms a commutative group under * with 1 as the
identity element.
The operation * is distributive over + i.e a (b +c) = (a b) + (b c) for a, b, c R.
Finite elds (i.e. with a nite number of elements) form a important part of coding theory.
These are usually called Galois elds. Galois eld with q elements are usually denoted GF(q).
E.G 1 The simplest Galois eld is GF(2) with two elements 0,1. The operations + and
* are dened as
+ 0 1 * 0 1
0 0 1 0 0 0
1 1 0 1 0 1
E.G 2
The integers {0, 1, 2, . . . , p1} form a commutative group with identity 0 under the addition
modulo p operation. The integers {1, 2, . . . , p 1} (i.e less 0) form a commutative group with
an identity 1 under multiplication modulo m if m is prime.
If p is not prime then 1 < m, n, < p with mn = p, thus multiplication of mn mod p = 0,
thus the multiplication operation is not closed.
Theorem 3.3.1 The integers {0, 1, 2, . . . , p 1} where p is prime form a nite eld GF(p)
under modulo p addition and multiplication.
3.4 Elementary Properties of Galois Fields
It can be shown that any Galois of order q must be identical up to the labeling to its elements.
Let be an element of a Galois eld GF(q) and let 1 be the multiplicative identity. Consider
the sequence of elements
1, ,
2
,
3
, . . .
since is contained in GF(q), then successive powers of is also in GF(q). As GF(q) contains
a nite number of elements then the sequence must begin to repeat at some time.
To show that the sequence repeats back starting from 1:
Assume that
x
=
y
= 1 is the rst sequence to repeat where x > y > 0, therefore
x
=
y

xy
= 1
with 0 < x y < x, but we assumed that 1 was not the rst element to repeat hence this is a
contradiction.
Thus 1 must be the rst element to repeat.
This allows us to dene the order of an element as:
3.4. ELEMENTARY PROPERTIES OF GALOIS FIELDS 59
Denition 3.4.1 The order of an element is the smallest positive integer m such
that
m
= 1 and is denoted ord().
An element that has a order q 1 in GF(q) is denoted a primitive element.
It can be shown that every Galois eld GF(q) must have at least 1 primitive element (and
many have more than one).
The use of the primitive element is convenient for dealing with Galois elds. Let be a
primitive element of GF(q). Consider the sequence
1, ,
2
,
3
, . . .
q1
,
q
, . . .
as the order of is q 1, then the sequence repeats at
q1
= 1 and the rst q 1 elements are
unique.
Thus the successive powers of the primitive element starting at 1 construct all the non-zero
elements of the Galois eld. Or:
All non zero elements in a Galois eld may be constructed as the q 1 successive
powers of a primitive element
This is sometimes used in hardware to generate a counter over GF(q). Initializing to 1, the
counter is updated by multiplying by each successive period.
E.G. Consider GF(7) based on addition and multiplication modulo 7 of the integers 0, 1, 2, 3, 4, 5, 6.
It can be shown (by checking) that 3, and 5 are primitive elements of this eld.
Hence the eld elements can be generated as successive powers of 5 modulo 7 i.e
1, 5, |5
2
|
7
= 4, |5
3
|
7
= 6, |5
4
|
7
= 2, |5
5
|
7
= 3, |5
6
|
7
= 1
An additive structure to Galois eld can also be developed.
Consider adding the multiplicative identity element 1 to it self multiple times. A sequence
can then be generated
0, 1, 1 + 1, 1 + 1 + 1, 1 + 1 + 1 + 1, . . .
Since the eld is nite, the sequence must repeat.
Let the notation m(1) represent the summation of m 1s.
if j(1) is the rst repeated element and this is equal to an earlier element k(1) (where the
repetitive sequence starts), i.e.
j(1) = k(1) with 0 k < j
Then k = 0 for otherwise if k > 0, then (j k)1 = 0 and j k < j implying the sequence
repeats earlier than j contradicting the assumptions.
Hence, we dene,
The characteristic of a Galois eld GF(q) is the smallest positive integer m for
which m(1) = 0.
Consider the sequence
0, 1(1), 2(1), . . . k(1), (k + 1)1, . . .
with k(1) = 0. If k is not prime then k can be written as k = nm with n, m being positive
integers i.e. 0 < n, m < k.
Hence m(1).n(1) = k(1) = 0, but no multiplication of two nonzero elements of the eld can
result in the result of 0, hence this contradicts the statement that k is not prime. Hence we can
state:
The characteristic of a Galois eld is always prime.
Let GF(q) be a eld of characteristic p. Hence GF(q) must contain the elements Z
p
=
{0, 1, 2(1), 3(1), . . . , (p 1)1}.
The set Z
p
is closed under addition and multiplication (as the sequence just repeats). The
additive inverse of j(1) Z
p
is just (p j)1.
The multiplicative inverse of j(1) is i(1) with |ij|
p
= 1. The rest of the eld requirements for
Z
p
to be a eld (associativity, etc) are satised by noting that all the elements in Z
p
are already
in the eld GF(q).
Hence Z
p
forms a subeld of GF(q).
Since all elds of order p are the same eld (up to a labeling), then Z
p
must be the eld of
integers modulo p.
Given GF(q) with its prime order subeld GF(p) ( was Z
p
).
Then let
1
be an element of GF(q) = 0.
For each of the elements
i
GF(p), then
i
1
GF(q). These p elements must be distinct for
otherwise
1
=
1
implies = 0.
If there are other elements say
2
not in the form
i
1
, then this can be written as
i
2
with
i
= 1, but therefore there are another p elements
i
1
+
j
2
GF(q).
That these are distinct can be readily shown.
Let an element
2
=
i
1
GF(q) with
i
GF(p).
If
i
1
+
j
2
are not distinct, then there exists
i
,
j
,
k
,
l
such that
1
+
j
2
=
k
1
+
l
2
with
i
=
k
and
j
=
l
(
i
k
)
1
+ (
j

l
)
2
= 0
or
(
n
)
1
(
m
)
2
= 0
with
n
= 0 and
m
= 0
1
= (
m
)
2
or
1
m

n
1
=
2
but
1
m

n
is still and element from the subeld GF(p) thus contradicting the assumption
2
=
i
1
.
Thus their are either p or p
2
elements in GF(q). This argument can be extended for any
other elements
3
, . . . and it can be concluded that and Galois eld GF(q) must have q = p
n
where p is prime and n is an integer.
i.e.
Theorem 3.4.2 The order of a Galois eld GF(q) must be a power of a prime.
E.G.
A Galois eld with 32 = 2
5
elements exist.
8
elements exist.
3
elements exist.
No Galois eld with 10 = 5 2 elements exist.
3.5. PRIMITIVE POLYNOMIALS AND GALOIS FIELDS OF ORDER P
M
61
No Galois eld with 52 = 13 4 elements exist.
Note also that each element GF(q) = GF(p
m
) can be written as
=
1
1
+. . . +
i
i
+. . . +
m
m
with
i
GF(p). i.e the eld GF(p
m
) can be seen as a vector space with the subeld GF(p)
forming a set of basis vectors.
3.5 Primitive Polynomials and Galois Fields of Order p
m
The Galois eld GF(q) can be represented as q 1 consecutive powers of a primitive eld
element , and the 0 element.
Multiplication in GF(q) can be accomplished by representing the elements as powers of
and adding their exponents modulo q 1.
b
=
|a+b|
q1
It has also been shown that GF(q) contains a subeld of prime order p, whose additive
operation is integer addition modulo p.
We want to extend these ideas to nd the additive structure of GF(q) for q being non-prime.(q
must always be p
m
where p is prime and m is an integer.)
Denote GF(q)[x] as the collection of polynomials of arbitrary degree a
0
+a
1
x+a
2
x
2
+. . .+a
n
x
n
with a
i
GF(q).
This collection forms a commutative ring with identity.
The additive operation is polynomial addition
(a
0
+a
1
x +a
2
x
2
+. . . +a
n
x
n
) + (b
0
+b
1
x +b
2
x
2
+. . . +b
n
x
n
)
= (a
0
+b
0
) + (a
1
+ b
1
)x + (a
2
+ b
2
)x
2
+. . . + (a
n
+b
n
)x
n
The multiplicative operation is polynomial multiplication
(a
0
+ a
1
x +a
2
x
2
+. . . +a
n
x
n
) . (b
0
+b
1
x +b
2
x
2
+. . . +b
n
x
n
)
= (a
0
b
0
) + (a
1
b
0
+b
1
a
0
)x + (a
2
b
0
+a
1
b
1
+b
2
a
0
)x
2
+. . . + (a
n
b
n
)x
2n
The coecient operations are from the eld GF(q). e.g Over GF(2),
(1 +x +x
2
) + (0 +x + 0x
2
) = 1 + 0x +x
2
= 1 +x
2
3.5.1 Irreducible Polynomials
Denition 3.5.1 A polynomial f(x) is irreducible in GF(q) if f(x) cannot be factored into a
product of lower degree polynomials in GF(q)[x].
E.g consider f(x) = x
3
+x + 1.
This polynomial is irreducible in GF(2).
However it is not irreducible in GF(3) as
x
3
+x + 1 = (x + 2)(x
2
+x + 2) in GF(3)
Note that the irreducible only applies when the ring of polynomials used is specied.
Note also that there will always exist some ring in which the polynomial can be factored.
3.5.2 Primitive Polynomials
Denition 3.5.2 A irreducible polynomial f(x) GF(p) of degree m is dened as being
primitive if the smallest possible integer n for which f(x) divides x
n
1 is n = p
m
1.
E.G. x
4
+ x
3
+ 1 and x
4
+x + 1 are the only primitive polynomials of degree 4 in GF(2).
x
2
+x + 2 and x
2
+ 2x + 2 are the only primitive polynomials of degree 2 in GF(3).
Lists of primitive polynomials can be found in most text books covering block codes and
have some useful properties.
At least one primitive polynomial of degree m over GF(p) exist for n 1.
Note that it can be shown that any irreducible polynomial of degree m will divide x
p
m
1
1,
but only primitive polynomials will not divide into x
n
1 for n < p
m
1.
Theorem 3.5.3 The roots of {
j
} of an m-th degree primitive polynomial p(x) GF(q)[x]
have order p
m
1.
Hence, given that has order p
m
1, then the p
m
1 successive powers of form a
multiplicative group of order p
m
1 with the multiplication operation performed by adding the
powers of module p
m
1.
This can form the basis of generating Galois elds GF(p
m
).
3.5.3 Generating GF(p
m
)
Let p(x) = x
m
+ a
m1
x
m1
+ . . . + a
1
x + a
0
be a primitive polynomial in GF(p) (recalling
that p is prime).
A root of this primitive polynomial in GF(p) has order p
m
1.
Hence p() = 0, as is a root and
p() =
m
+a
m1
m1
+. . . +a
1
+a
0
= 0
and
m
=
_
a
m1
m1
+. . . +a
1
+a
0
_
Thus powers of higher than
m1
can be replaced by polynomials in with degree less than
m. But as is of order p
m
1 then the elements ,
2
, . . . ,
p
m
1
, must be unique and can be
represented as
j
= b
m1
m1
+. . . +b
1
+b
0
where b
i
are elements from GF(p).
These p
m
1 unique polynomials of degree m or less and the zero element 0 form an group
using polynomial addition with the identity element 0. The p
m
1 unique polynomials of
degree m or less form a commutative group using polynomial multiplication ( with reduction
using
m
= (a
m1
m1
+ . . . +a
1
+a
0
) ). The multiplicative identity element is the zero
degree polynomial 1. The addition and multiplication operation distribute. Thus, this set of
polynomials with the zero element 0 form a nite eld with p
m
elements i.e. GF(p
m
).
E.G: Generate GF(8) ?
GF(8) = GF(2
3
) hence we require a primitive polynomial of degree 3 over GF(2).
There are 2 primitive polynomials over GF(2) that are primitive. These are p(x) = x
3
+x
2
+1
and p(x) = x
3
+x + 1. We will (arbitrarily) chose p(x) = x
3
+x
2
+ 1.
3.6. HARDWARE FOR OPERATIONS OVER GF(2
M
) 63
The primitive element of the eld is a root of this polynomial. Hence p() = 0 and thus
3
= (
2
+ 1) =
2
+ 1 over GF(2)
The elements of the eld GF(2
3
) can now be calculated as
0 = 0
0
= 1
1
=
2
=
2
3
=
3
=
2
+ 1
4
=
3
= (
2
+ 1) =
3
+ =
2
+ + 1
5
=
3
2
= (
2
+ 1)(
2
) =
4
+
2
= + 1
6
=
5
= ( + 1)() =
2
+
Note that
7
=
6
= (
2
+) = (
3
+
2
) = 1 as expected (because order of = 7).
Addition and multiplication can be readily performed in the eld GF(8). e.g.
3
+
5
can
be calculated as
3
+
5
= (
2
+ 1) + ( + 1) =
2
+ =
6
and
3
.
5
can be calculated as
3
.
5
=
|3+5|
7
=
Multiplication can also be done by multiplying the polynomials modulo p(x) i.e
3
.
5
= (
2
+ 1)( + 1) mod
3
+
2
+ 1
= (
3
+
2
+ + 1) mod
3
+
2
+ 1
=
3.6 Hardware for operations over GF(2
m
)
3.6.1 Addition
The hardware implementation of operations over GF(2
m
) is important for high speed en-
coding/decoding operations for many codes. This can be readily accomplished as follows.
The elements of the eld GF(2
m
) can be written as degree m1 polynomials in , the root
of a primitive polynomial p(x) over GF(2) i.e
= b
m1
m1
+. . . +b
2
2
+b
1
+b
0
with p() = 0 and b
i
GF(2) i.e binary digits 1,0. Hence, each element can be represent as
an m bit binary word.
Addition of two elements
a
= a
m1
m1
+. . . +a
2
2
+a
1
+a
0
and
b
= b
m1
m1
+. . . +
b
2
2
+b
1
+b
0
to give the result
c
= c
m1
m1
+. . . +c
2
2
+c
1
+c
0
is
c
=
a
+
b
= (a
m1
m1
+. . . +a
2
2
+a
1
+a
0
) + (b
m1
m1
+ . . . +b
2
2
+ b
1
+b
0
)
= (a
m1
+ b
m1
)
m1
+. . . + (a
2
+b
2
)
2
+ (a
1
+b
1
) + (a
0
+b
0
)
= (a
m1
b
m1
)
m1
+. . . + (a
2
b
2
)
2
+ (a
1
b
1
) + (a
0
b
0
)
where represents the exclusive OR function. Fig 3.1 shows the circuit implementation.
a
0
a
1
a
2
b
0
b
1
b
2
c
2
c
1
c
0
a
m-1
b
m-1
c
m-1
Figure 3.1: Hardware for addition in GF(2
m
)
3.6.2 Multiplication
Multiplication is somewhat more complex but readily done. Multiplication of two elements
a
= a
m1
m1
+ . . . + a
2
2
+ a
1
+ a
0
and
b
= b
m1
m1
+ . . . + b
2
2
+ b
1
+ b
0
to give the
result
c
= c
m1
m1
+. . . +c
2
2
+c
1
+c
0
is
c
=
a
.
b
= (a
m1
m1
+. . . +a
2
2
+ a
1
+a
0
).(b
m1
m1
+. . . +b
2
2
+b
1
+b
0
) mod p()
= (a
m1
m1
+. . . +a
2
2
+ a
1
+a
0
)
m1
b
m1
mod p()
.
.
.
+(a
m1
m1
+. . . +a
2
2
+a
1
+a
0
)b
1
mod p()
+(a
m1
m1
+. . . +a
2
2
+a
1
+a
0
)b
0
mod p()
Consider
(a
m1
m1
+. . . +a
2
2
+ a
1
+a
0
) mod p()
multiplication by shifts the polynomial left one position. The modulo p() operation can then
be done. If a
m1
was 1 then p() =
m
+ p
m1
m1
+ . . . + p
1
+ p
0
is subtracted ( same as
addition in GF(2)). Hence,
(a
m1
m1
+ . . . +a
2
2
+a
1
+a
0
) mod p() =
(a
m1
m
+a
m2
m1
. . . +a
1
2
+a
0
)
a
m1
(
m
+p
m1
m1
+. . . +p
1
+p
0
)
= (a
m2
m1
+. . . +a
1
2
+a
0
)
a
m1
(p
m1
m1
+. . . +p
1
+p
0
)
3.6. HARDWARE FOR OPERATIONS OVER GF(2
M
) 65
= (a
m2
a
m1
p
m1
)
m1
+. . .
+(a
1
a
m1
p
2
)
2
+ (a
0
a
m1
p
1
)
+a
m1
p
0
For example consider the primitive element as a root if the degree 3 primitive polynomial
p(x) = x
3
+x
2
+ 1:
p() =
3
+
2
+ 1 = 0
With
a
represented as the three bit binary sequence {a
2
, a
1
, a
0
} representing the element
a
= a
2
2
+a
1
+a
0
multiplication by can be accomplished by the circuit shown in g 3.2
a
1
a
0
a
2
Figure 3.2: Circuit for multiplication by with p(x) = x

3
+x
2
+ 1
In general multiplication by can be accomplished by the circuit shown in g 3.3
p
m-1
p
m-2
p
1
a
0
a
m-1
p
0
a a
m-3 m-2
Figure 3.3: Circuit for multiplication by with p() =
m
+p
m1
m1
+. . . +p
1
+p
0
= 0
Recalling
c
=
a
.
b
= (a
m1
m1
+. . . +a
2
2
+a
1
+a
0
)
m1
b
m1
mod p()
.
.
.
+(a
m1
m1
+. . . +a
2
2
+a
1
+a
0
)b
1
mod p()
+(a
m1
m1
+. . . +a
2
2
+a
1
+a
0
)b
0
mod p()
=
a
m1
b
m1
mod p()
.
.
.
+
a
b
1
mod p()
+
a
b
0
mod p()
and noting
a
2
= (
a
),
a
3
= (
a
2
) and so forth, each
a
i
can be generated by cascading
circuits as in g 3.3. Multiplication by b
i
is simply an AND gate function. The nal sum can
then be generated using exclusive or gates as in g 3.1 previously.
With the degree 3 primitive polynomial p(x) = x
3
+x
2
+1: multiplication can be accomplished
by the circuit shown in g 3.4
a
1
a
0
a
2
a
2
b b b
1 0 2
c
c
c
2
1
0
Figure 3.4: Multiplier over GF(2
3
) with p() =
m
+p
m1
m1
+ . . . +p
1
+p
0
= 0
A multiplier for and eld GF(2
m
) can be built up in a similar manner. The structure is quite
similar to an integer multiplier with the concepts of partial product generation and a summation
tree to generate the result.
For lees speed critical applications, serial versions can be readily implemented to produce a
result in m for a shift add type implementation or m
2
for a completed bit serial approach with
corresponding reductions in hardware.
3.7. POLYNOMIALS OVER GALOIS FIELDS 67
3.6.3 Division/Inversion
Division and Inversion are closely related and are generally relatively dicult operations to
perform over GF(2
m
) with simple implementations not known. The division operation
1
/
2
is
normally calculated as a inversion and a multiplication as
1
1
2
. The inversion operation can
be implemented as a lookup table (synthesized logic to implement this is usually preferable)
with 2
m
elements. This can readily be implemented in current VLSI for m 8.
Various algorithms for inversion have been proposed including ones that operate over a
number of cycles to produce the result.
3.7 Polynomials over Galois Fields
In the design of cyclic codes (later on), it is required that polynomials have coecients in
the subeld GF(q) and that these polynomials have minimum degree.
Consider polynomials that have a specic root GF(q
m
).
3.7.1 Minimal Polynomials
Denition 3.7.1 Let be an element in GF(q
m
). The minimal polynomial of with respect
to GF(q) is the smallest degree polynomial p(x) in GF(q)[x] such that p() = 0.
Theorem 3.7.2 For each element in GF(q
m
), there exists an unique monic
1
polynomial p(x)
of minimal degree in GF(q)[x] such that the following are true:
1. p() = 0
2. the degree of p(x) is less than or equal to m.
3. f() = 0 implies that f(x) is a multiple of p(x).
4. p(x) is irreducible in GF(q)[x].
Thus an alternative denition of a primitive polynomial is that a primitive polynomial is the
minimal polynomial of a primitive element in a Galois eld.
1
Highest power has coecient 1.
3.8 Exercises
1. A degree 4 primitive polynomial over GF(2) is x
4
+x
3
+1. Use the primitive polynomial
to create a Galois eld of order 15 ( GF(2
4
) ). i.e. write out all the elements as degree 3
or less polynomials.
Calculate
7
+
8
.
Calculate
7
8
.
Calculate (
7
)
1
i.e. the multiplicative inverse of
7
.
Draw a circuit to add 2 elements from this eld.
Draw a circuit to multiply 2 elements from this eld.
Chapter 4
Block codes
4.1 Introduction
A block error control code maps a block of k data symbols onto a block of n codes symbols
from GF(q) as shown in g 4.1.
There is no memory between each encoding operation.
k-symbols
Block Encoder
n-symbols
Uncoded Data
Coded Data
Figure 4.1: Block Encoder operation
An arbitrary block code C can be though of as a set of M codewords {c
0
, c
1
, . . . c
M1
} with
each codeword c
i
consisting of n q-ary symbols
c
i
= {c
i,0
, c
i,1
, . . . c
i,n1
}
The k input symbols represent q
k
possible word and hence ideally M = q
k
i.e. each input
word is mapped to one codeword. If this is not the case, then more complex segmentation of
the input data stream is required (this is not dicult, but is cumbersome).
The number of q-ary symbols that can be encoded with the code is log
q
M and this produces
n q-ary symbols hence the rate of such a block code can be calculated as
R =
log
q
M
n
69
70 CHAPTER 4. BLOCK CODES
E.G. A particular code consists of 12 codeword each with 11 binary symbols.
The rate of this code is
log
2
12
11
0.326
Fig 4.2 shows a model of a channel form the block coding point of view. In eect, it can
be viewed as the transmitted signal with an error signal added ( over GF(q)) to result in the
received signal
r = c +e
or
{r
0
, r
1
, . . . r
n1
} = {c
0
, c
1
, . . . c
n1
} +{e
0
, e
1
, . . . e
n1
}
with errors represented by non zero values of e
i
. The eect of errors introduced in the channel
is to change the values of one or more elements of the transmitted code.
c ,c ... c
0
n-1 1
0
n-1 1
0
n-1 1
e ,e ... e
r ,r ... r
Figure 4.2: Block Code error model
The addition of the redundant information can provide a number of error control properties
to alleviate the eect of channel errors.
Error Detection If a error is detected, the receiver of the codeword can
Ignore the data (e.g. in an audio system)
Tag it as incorrect.
Request retransmission of the data
Error Correction
In this case the decoder attempts to correct the errors and may also tag information such
as how many errors were corrected or some reliability measures.
An error can be detected when the received sequence r is not one of the dened codewords.
i.e r / C. However, sometimes, the errors that occur are such that the received sequence r is
actually transformed into another codeword. In this case, this error pattern is not detected.
Similarly, the Error Correction procedure may miss-correct the received data due to too
many errors in the channel.
A number of concepts can aid in the analysis of error properties of block codes.
Weight of a code word
Denition 4.1.1 The weight of a code word or error pattern is dened as the number of non
zero elements in the code word and denoted w(c) for the word c.
E.G. the weight of the codeword {0,0,1,0,1} is 2.
4.1. INTRODUCTION 71
Distance between codewords
Given two codewords v = {v
0
, v
1
, . . . v
n1
} and w = {w
0
, w
1
, . . . w
n1
}, it is important to have
a measure of their dierence or the distance between them. If they represent signal values then
their dierence could be represented as a Euclidean dierence of
d
Euclidean
=
i
(v
i
w
i
)
2
This metric is commonly used in convolutional and soft decision decoding. However for block
codes with hard decision decoding and where the symbols are represented by symbols form
GF(q), then the Hamming distance is more useful and is dened as the number of positions
where two codeword dier.
E.G. if v = {0,0,1,0,1} and w = {1,1,1,0,0}, then
d
Hamming
(v, w) = 3
Minimum Distance of a Block Code
Denition 4.1.2 the Minimum Distance of a Block Code C is dened as the minimum
Hamming distance between any two distinct codewords from C
Performance of Block Codes
For an error detection code to fail, the errors that occur in the channel must be sucient to
transform one error pattern into another. For this to occur with a code C that has a minimum
hamming distance d
min
, the error pattern e that occurred in the channel must have a weight
greater than or equal to d
min
or:
a code with minimum hamming distance d
min
can detect all error patterns e with
weight w(e) d
min
1
Some error patterns with a larger weight will also be detected but not all.
For error correction, the goal is to correct errors. Let p
C
(c) be the probability that that
codeword c is transmitted.
Let the received words r have a probability p
R
(r).
Two types of detection schemes could be considered when the codeword r
j
is received: The
maximum a posterior (MAP) decoder is dened as the decoder that chooses the codeword
c
i
that maximizes
p(c
i
|r
j
)
The more common maximum likelihood (ML) detection decoder is dened as the decoder
that chooses the codeword c
i
that maximizes
p(r
j
|c
i
)
These can be related as
p(c
i
|r
j
) =
p(r
j
|c
i
)p
C
(c = c
i
)
p
R
(r
j
)
and as p
R
(r
j
) is constant as we try to maximize wrt c
i
then the dierence is that the MAP
decoder takes into account the probability of the distribution of the transmitted codewords
p
C
(c).
Hence, if the transmitted codewords are equally likely (which is usually the case) the both
MAP and ML decoding are equivalent and ML decoding is commonly used. (However the recent
development of Turbo codes uses the MAP decoder).
The conditional probability p(r
j
|c
i
) is equal to the probability of the occurrence of the error
pattern e = r
j
c
i
.
The weight of e is the number of errors that occurred.
Assuming errors occur independently, and that more errors are less likely than less errors,
the maximizing p(r
j
|c
i
) is equivalent to choosing the value of c
i
that minimizes the number of
errors i.e. minimizing w(e).
Hence the ML decoder picks the codeword c
i
that is closest in hamming distance to the
received word r
j
.
If the block code used has a minimum distance d
min
, then all valid codewords must be a least
this distance from the correct codeword and a correction error can be made only if the number
of errors is greater than or equal half this distance. or
a code with minimum hamming distance d
min
can correct all error patterns e with
weight w(e)
d
min
1
2

Some error patterns with a larger weight may also be corrected but not all.
This allows two options
Complete Decoder that given a received word r
j
chooses the codeword c
i
such that
d
Hamming
(r
j
, c
i
) is minimized.
Bounded Distance Decoder that given a received word r
j
chooses the codeword c
i
such
that d
Hamming
(r
j
, c
i
) < t where t
d
min
1
2
Otherwise a decoder failure is declared.
E.G consider the 4 bit binary repetition code with code words 0000 and 1111. It has a
d
min
= 4 and can hence detect 3 errors or correct 1 error. The decoder tables are shown in
table 4.1
Geometric View
A geometric interpreting can be applied to block codes distances. The Hamming sphere of
radius t is dened as the set of points in the n dimensional vector space over GF(q) that are a
hamming distance t from a particular center point p.
For each distance i from the center, the number of points that have i dierent values from
p are
_
n
i
_
(q 1)
i
i.e. the number of ways of arranging i objects into n spaces times the (q 1)
i
dierent values
that are allowed.
Hence, the total number of points in the Hamming sphere (volume) can be calculated as
V
q
(n, t) =
t
i=0
_
n
i
_
(q 1)
i
Of interest is calculating the minimum redundancy required for a given amount of error
correction performance as measured by the number of errors that can be corrected t. An exact
solution is not known but the problem can be bounded.
4.1. INTRODUCTION 73
r Complete Decoder Bounded Distance Decoder
0000 0000 0000
0001 0000 0000
0010 0000 0000
0011 0000 or 1111 Failure
0100 0000 0000
0101 0000 or 1111 Failure
0110 0000 or 1111 Failure
0111 1111 1111
1000 0000 0000
1001 0000 or 1111 Failure
1010 0000 or 1111 Failure
1011 1111 1111
1100 0000 or 1111 Failure
1101 1111 1111
1110 1111 1111
1111 1111 1111
Table 4.1: Decoder tables for 4 bit binary repetition code
Upper Bound
Each of the M codewords in C is associated with a sphere or radius t, hence there must be at
least MV
q
(n, t) points in the n dimensional vector space over GF(q).
MV
q
(n, t) q
n
or
M
q
n
V
q
(n, t)
Therefore the code rate must be
R =
log
q
M
n

log
q
q
n
V
q
(n,t)
n
=
n log
q
V
q
(n, t)
n
or
R 1
log
q
V
q
(n, t)
n
This is known as the Hamming bound and provides a upper bound on the code rate achievable.
Lower Bound
Given an n dimensional vector space over GF(q), for each codeword taken one at a time, select
a codeword and all the points a distance 2t from that point and delete them from consideration.
Repeat of each of the M codewords. Thus this code will have a d
min
= 2t +1 and can correct t
errors. Therefore
q
n
MV
q
(n, 2t)
or
M
q
n
MV
q
(n, 2t)
and the code rate
R =
log
q
M
n

log
q
q
n
V
q
(n,2t)
n
=
n log
q
V
q
(n, 2t)
n
or
R 1
log
q
V
q
(n, 2t)
n
This is known as the Gilbert bound and provides a lower bound on the code rate achievable.
E.G. calculated the upper and lower bound for a triple bit binary error correction code of
length 12. Hence n = 12, and t = 3.
1
log
q
V
q
(n, 2t)
n
R 1
log
q
V
q
(n, t)
n
V
q
(12, 6) evaluates to 2510 and V
q
(12, 3) evaluates to 299 and
1
log
2
2510
12
R 1
log
2
299
12
or
0.0588 R 0.3147
Clearly, the bounds are not very tight!
Fig 4.3 show the rounds for a double bit (t = 2) binary error correcting code vs code length
for codes up to length 255. For example, from the graph it can be seen that there does not exist
a binary code of length 100 that can correct 2 bit errors with a rate of 0.90. Hence search for
such a code is futile.
0 50 100 150 200 250 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n
R
Figure 4.3: Rate bounds for a double bit binary error correcting code vs code length
A code that satises the Hamming bound is known as a Perfect code.
Many perfect codes have be developed. Hamming single error correction codes are perfect
codes. The Golay {q = 2, n = 23, k = 12, t = 3} is a perfect code.
Perfect codes are of a mathematical interest however from a practical point of view, many
powerful codes (such as Reed Solomon codes) which are not perfect codes are widely used.
4.2. LINEAR CODES 75
As the length of the code n increases, with a xed probability of error p
e
, the number of
errors on average is np
e
. Hence longer codes should have a greater minimum distance to be
useful.
As n increases, consider the ratio =
d
min
n
.
It can be shown, that codes exist such that the ratio can be non-zero as n . These
refer to asymptotically good codes.
Most practically used codes are not asymptotically good codes.
4.2 Linear Codes
Consider mapping k bit binary words onto n bit binary codewords. There are M = 2
k
source
words and N = 2
n
codewords.
The number of possible mappings is
_
N
M
_
For example consider a k = 6, n = 8 code. There are
_
2
8
2
6
_
1.910
61
possible mappings!
If a computer could test 1 million per second, then it would take more than 3.6
49
years to
check all possible codes!
As this is clearly impractical and n = 8 is too small anyway, some form of structured method
of code design is required.
The most basic simplication is to restrict the search to Linear Codes. The mathematical
properties of linear codes will allow the design of useful codes.
The denition of a linear code code C is:
Denition 4.2.1 The q-ary code C consisting of n-tuples {c
0
, c
1
, . . . , c
n1
} of symbols from
GF(q) is a linear code if and only if C forms a vector subspace over GF(q).
The dimension of the subspace is k and thus there are q
k
code words of length n.
Such codes are described as (n, k) codes.
The restriction to linear codes provides a number of useful properties which follow from the
denition of vector spaces.
The linear combination of any two codewords is another codeword. i.e. if c and c
are
codewords, then c
= c c
must also be a codeword. Hence, the all zero vector must be

one of the codewords.
The minimum distance of the code must equal the weight of the minimum weight non-zero
codeword.
d
min
= min
c=c
, c,c
C
d(c, c
)
= min
c=c
, c,c
C
w(c c
)
= min
c
C
w(c
) with c
= c c
= 0
= min
c
=0,c
C
w(c
)
The undetectable error patterns are independent of the codeword transmitted an must be
one of the non zero codeword. i.e. if codeword c is transmitted and codeword c
is received
in error, then e = c c
must be a codeword.
As the code is dened as a vector space, the code can be described as a set of k basis vectors
{g
0
, g
1
, . . . , g
k1
} rather than a list of q
k
codewords.
Any code word can be written as a linear combination of the basis vectors:
c = a
0
g
0
+a
1
g
1
+. . . +a
k1
g
k1
Hence each of the q
k
source words can be readily encoded into code words.
This can be easily written in matrix form with the matrix G dened as:
G =
_
_
g
0
g
1
.
.
.
g
k1
_
_
=
_
_
g
0,0
g
0,1
. . . g
0,n1
g
1,0
g
1,1
. . . g
1,n1
.
.
.
.
.
.
.
.
.
.
.
.
g
k1,0
g
k1,1
. . . g
k1,n1
_
_
This matrix is called the generator matrix an can be used for encoding. A given source
message of k q-ary symbols can be formed into a vector m as
m = [m
0
, m
1
, . . . , m
k1
] m
i
GF(q)
The corresponding codeword can be calculated as
c = mG = [m
0
, m
1
, . . . , m
k1
]
_
_
g
0
g
1
.
.
.
g
k1
_
_
= m
0
g
0
+m
1
g
1
+. . . +m
k1
g
k1
Because C is a vector subspace of the vector space formed by the n-tuples of elements from
GF(q), then the dual space of C exists and is denoted C
and has the dimension n k.

If the basic vectors for this subspace are {h
0
, h
1
, . . . , h
nk1
}, then the matrix H dened as
H =
_
_
h
0
h
1
.
.
.
h
nk1
_
_
=
_
_
h
0,0
h
0,1
. . . h
0,n1
h
1,0
h
1,1
. . . h
1,n1
.
.
.
.
.
.
.
.
.
.
.
.
h
nk1,0
h
nk1,1
. . . h
nk1,n1
_
_
is known as the parity check matrix.
Let h be a vector in the subspace C
, then by denition
c h = 0
where c is a codeword i.e. c C. As H is dened as the matrix of basis vectors of C
, then let
h = [1, 1, . . . , 1]
_
_
h
0
h
1
.
.
.
h
nk1
_
_
and it can be seen that
cH
T
= 0
or if c is a code word then cH
T
= 0.
However, if c is not a codeword, then if cH
T
= 0 implies that c h = 0 for all h C
.
Hence c / C and c / C
, thus C and C
cannot be duals which is a contradiction.

Thus to summaries :
a vector c is a codeword in C if and only if cH
T
= 0
The parity check matrix can then be used to check if a received word r is a valid codeword
by calculating the value of rH
T
. If the result is non zero then an error has been detected.
4.2.1 Minimum distance of a Linear block code
Let code C has a parity check matrix H. Expressing the matrix H as a set of column vectors
H =
_
d
0
d
1
. . . d
n1
_
The matrix operation cH
T
can be written as
cH
T
= c
0
d
0
+c
1
d
1
+. . . +c
n1
d
k1
Hence, if c is a weight w codeword, then cH
T
is a combination of w columns of H and
The minimum distance of C is the minimum (nonzero) number of columns that a
nontrivial linear combination sum to zero.
Based on this observation a bound on the minimum distance of a (n, k) (linear) code can be
derived.
An (n, k) code has a parity check matrix H containing nk linearly independent rows. Thus
the column rank must equal the row rank of n k. Hence they must exist a nontrivial linear
combination or n k + 1 that sum to zero. Thus
The minimum distance of a (n, k) code is bounded by
d
min
n k + 1
This is known as the Singleton bound.
The use of generator and parity check matrices for encoding and error detection allow long
codes to be used without huge lookup tables and is a direct result of the mathematical structure
of a linear code.
To simplify decoding of the message data from a codeword, it is common practice to reorder
the Generator matrix into a systematic form.
G =
_
P I
k
_
=
_
_
p
0,0
p
0,1
. . . p
0,nk1
1 0 0 . . . 0
p
1,0
p
1,1
. . . p
1,nk1
0 1 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
. 0 0 1 . . . 0
p
k1,0
p
k1,1
. . . p
k1,nk1
0 0 0 . . . 1
_
_
Where P is a k n k matrix and I
k
is the k k identity matrix. It can be proved that this
is always possible by noting that the rows are linearly independent and the column rank of a
matrix is equal to the row rank of the matrix)
The encoding process is now
c = mG
= [m
0
, m
1
, . . . , m
k1
]
_
P I
k
_
= [m
0
, m
1
, . . . , m
k1
]
_
_
p
0,0
p
0,1
. . . p
0,nk1
1 0 0 . . . 0
p
1,0
p
1,1
. . . p
1,nk1
0 1 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
p
k1,0
p
k1,1
. . . p
k1,nk1
0 0 0 . . . 1
_
_
= [c
0
, c
1
, . . . , c
nk1
, m
0
, m
1
, . . . , m
k1
]
and the codeword is formed as n k check symbols followed by the k message symbols. Note
that this has not changed the set of valid code words, only the mappings are changed.
After decoding, the check symbols can be dis-guarded and the message symbols readily
passed out of the decoder.
The equivalent parity check matrix can be show to be
H =
_
I
nk
P
T
_
=
_
_
1 0 . . . 0 p
0,0
p
1,0
. . . p
k1,0
0 1 . . . 0 p
0,1
p
1,1
. . . p
k1,1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 1 p
0,nk1
p
1,nk1
. . . p
k1,nk1
_
_
4.2.2 Decoding
While error detection can be accomplished by multiplying the received code word by the
transpose of the parity check matrix and checking for a non zero result, error correction is more
complex.
The most obvious solution is a decoding table in which a lookup table is developed with an
entry for each possible received sequence of which there are 2
n
.
This array decoding method is only suitable for codes with very small n to allow practical
implementation of decoder hardware/software.
A Standard array decoder for a linear code C can be as follows:
Create a list L of all possible words of length n.
Remove each codeword (starting with 0from the list and place at the head of a column.
While the list is not empty, remove the lowest weight word w and place at the rst column.
Then ll that row with the rst row( of codewords) plus w, removing each from the list.
E.G. Consider the (7,3) binary linear code with basis vectors 1000111, 0101011 and 0011101.
One generator matrix for such a code is
G =
_
_
1000111
0101011
0011101
_
_
The standard decoding array is shown in table 4.2. The top row consists of the codewords.
The left column consists of the error patterns. While all the weight 1 error patterns are enu-
merated, note that not all of the weight 2 or higher error patterns are enumerated as the code
only guarantees to decode 1 bit errors (d
min
= 3).
0000000 0011101 0101011 0110110 1000111 1011010 1101100 1110001
0000001 0011100 0101010 0110111 1000110 1011011 1101101 1110000
0000010 0011111 0101001 0110100 1000101 1011000 1101110 1110011
0000100 0011001 0101111 0110010 1000011 1011110 1101000 1110101
0001000 0010101 0100011 0111110 1001111 1010010 1100100 1111001
0010000 0001101 0111011 0100110 1010111 1001010 1111100 1100001
0100000 0111101 0001011 0010110 1100111 1111010 1001100 1010001
1000000 1011101 1101011 1110110 0000111 0011010 0101100 0110001
0000011 0011110 0101000 0110101 1000100 1011001 1101111 1110010
0000101 0011000 0101110 0110011 1000010 1011111 1101001 1110100
0000110 0011011 0101101 0110000 1000001 1011100 1101010 1110111
0001001 0010100 0100010 0111111 1001110 1010011 1100101 1111000
0001010 0010111 0100001 0111100 1001101 1010000 1100110 1111011
0001100 0010001 0100111 0111010 1001011 1010110 1100000 1111101
0010010 0001111 0111001 0100100 1010101 1001000 1111110 1100011
0001110 0010011 0100101 0111000 1001001 1010100 1100010 1111111
Table 4.2: Standard decoding array for (7,3) binary linear code
Hence if the received pattern was 0110010 then from the table, the 0110110 would be
decoded as the error corrected codeword.
A more practical decoding method is syndrome decoding
4.2.3 Syndrome Decoding
Consider a received vector r = c+e with e being the error pattern introduced in the channel.
Compute a syndrome s as
s = rH
T
However
s = rH
T
s = (c +e)H
T
s = cH
T
+eH
T
s = 0 +eH
T
s = eH
T
Hence the syndrome s depends only on the error pattern e.
However, as there are 2
n
error patterns, and only 2
nk
syndromes, many distinct error
patterns have the same syndrome.
Let two distinct error patterns e and e
have the same syndrome.

s = eH
T
= e
H
T
0 = eH
T
e
H
T
= (e e
)H
T
(e e
) = c C and c = 0
As the minimum distance of the code is the weight of the smallest weight non zero codeword,
then any error pattern with weight less than d
min
will have a unique syndrome.
Hence if two distinct error patterns have the same syndrome, only one can have a weight
less than d
min
.
Thus decoding can be accomplished by creating a lookup table of the 2
nk
syndromes and
the smallest weight error pattern for that syndrome. The 2
nk
lookup table is far smaller than
a 2
n
lookup table.
For a perfect code, the 2
nk
syndromes will correspond to unique error patterns. Otherwise,
there will be some syndromes which refer to non unique higher weight error patterns. In this
case, these could be agged as unreliable corrections, or the best guess attempt made.
For the example of
G =
_
_
1000111
0101011
0011101
_
_
with the standard decoding array is shown in table 4.2, the parity check matrix H is
H =
_
_
0111000
1010100
1100010
1110001
_
_
Table 4.3 shows the syndrome for some error patterns. Note that all the weight 1 error patterns
have unique syndrome. Error patterns with weight 2 or greater do not have unique syndromes.
4.3 Weight Distribution of Block codes
One of the characteristics of a code that can be of interest (especially with recent develop-
ments in coding theory) is the ] weight distribution of a code.
The weight distribution of (n, k) code C is a series if coecients A
0
, A
1
, . . . , A
n
where A
i
is the number of codewords in C with weight i.
The weight distribution is often written as polynomial
A(x) = A
0
+A
1
x +A
2
x
2
+. . . +A
n
x
n
and called the weight enumerator.
For the (7,3) linear block code previously described,
A(x) = 7x
4
4.4 Hamming Codes
Hamming codes were the rst major class of linear binary codes and were published in 1950.
Hamming codes are dened by a single integer m and have the properties as in table 4.4
The parity check matrix for a Hamming code can be constructed easily. For a length 2
m
1
Hamming code, construct the parity check matrix H as a m2
m
1 matrix whose columns are
the 2
m
1 non zero binary n-tuples.
Any order forms the Hamming code, but a systematic code is usually preferred.
4.4. HAMMING CODES 81
Weight Error Pattern Syndrome
0 0000000 0000
1 0000001 0001
1 0000010 0010
1 0000100 0100
1 0001000 1000
1 0010000 1101
1 0100000 1011
1 1000000 0111
2 0000011 0011

2 0000101 0101
2 0000110 0110
2 0001001 1001
2 0001010 1010
2 0001100 1100
2 0010001 1100
2 0010010 1111
2 0010100 1001
2 0011000 0101
2 0100001 1010
2 0100010 1001
2 0100100 1111
2 0101000 0011
2 0110000 0110
2 1000001 0110
2 1000010 0101
2 1000100 0011

2 1001000 1111
2 1010000 1010
2 1100000 1100
. . . . . . . . .
Table 4.3: syndromes for (7,3) linear block code.
Code length n = 2
m
1
Number Information symbols k = 2
m
m1
Number of Check symbols n k = m
Error correcting capability t = 1
Table 4.4: Properties of Hamming codes
E.G Construct an m = 4 Hamming code.
In this case n = 2
m
1 = 15 and k = 2
m
m1 = 11, thus a (15, 11) code is required. One
such code would be
H =
_
_
100011111110000
010011110001110
001011001101101
000110101011001
_
_
In general, for a parity check matrix dened like this, as all the columns are distinct, at least
3 columns must be summed to get 0, hence the minimum distance for a Hamming code is 3
implying a single bit error correction capability.
It can be readily shown that the Hamming code meets the Hamming bound.
Recall
V
q
(n, t) =
t
i=0
_
n
i
_
(q 1)
i
For the Hamming code,
V
q
(2
m
1, 1) =
_
2
m
1
0
_
+
_
2
m
1
1
_
= 1 + 2
m
1 = 2
m
Hence
1
log
q
V
q
(n, t)
n
= 1
log
2
(2
m
)
2
m
1
=
2
m
1 m
2
m
1
But the code rate
R =
k
n
=
2
m
m1
2
m
1
Hence
R = 1
log
q
V
q
(n, t)
n
and therefore Hamming code meet the Hamming bound and are perfect codes.
Table 4.5 list some of the possible binary hamming codes that can be generated. Clearly,
arbitrary long codes can be generated. Unfortunately, the minimum distance is always 3,(t = 1)
and thus the ration of
d
min
n
0 as n .
n k Rate
3 1 0.333333
7 4 0.571429
15 11 0.733333
31 26 0.838710
63 57 0.904762
127 120 0.944882
255 247 0.968627
511 502 0.982387
1023 1013 0.990225
2047 2036 0.994626
4095 4083 0.997070
. . . . . . . . .
Table 4.5: Some allowed Hamming Code rates
4.4. HAMMING CODES 83
4.4.1 Decoding Hamming Codes
Hamming codes can be easily decoded using syndrome decoding and there is a simple relation
between the syndromes and the parity check matrix because only single error patterns are
dened.
Denote the parity check matrix as a array of columns:
H =
_
d
0
d
1
. . . d
n1
_
The syndromes of an error pattern consisting of a single error at position j can be calculated
as
s = eH
T
= [0, 0, . . . , 0, 1, 0, . . . 0]
_
_
d
T
0
d
T
1
.
.
.
d
T
n1
_
_
= d
T
j
Hence, if the received syndrome is non zero then the decoder can check which column comple-
mented to correct the error.
E.G For the example (15,11) Hamming code, the corresponding Generator matrix for would
be
G =
_
_
111110000000000
111001000000000
110100100000000
110000010000000
101100001000000
101000000100000
100100000010000
011100000001000
011000000000100
010100000000010
001100000000001
_
_
with
H =
_
_
100011111110000
010011110001110
001011001101101
000110101011011
_
_
Let a sample 11 bit message be
m = 10101010101
The corresponding codeword c = mG is
c = 010110101010101
Let a transmission error occur in bit 14 resulting in a received codeword
r = 010110101010111
At the receiver, the syndrome s = rH
T
is calculated as
s = eH
T
= 0101
This is equal to column 14 (transposed) hence there was an error in position 14. The corrected
word is thus c = 010110101010101. And the recovered message
m = 10101010101
which is correct. Thus the (15, 11) Hamming code corrected the single bit error.
Consider the case of a double bit error. Let a transmission error occur in bit 14 and bit 2
resulting in a received codeword
r
1
= 000110101010111
At the receiver, the syndrome s
1
= r
1
H
T
is calculated as
s = eH
T
= 0001
and the wrong bit is inverted resulting now in 3 errors (i.e. a miscorrection).
4.4.2 Weight Distribution of Hamming Codes
The weight enumerator of a (n, k) binary code can be shown to be
A(x) =
1
n + 1
_
(1 +x)
n
+n(1 x)(1 x
2
)
(n1)/2
_
4.5 Non-Binary Hamming Codes
Non binary Hamming codes can also be constructed of GF(q).
For a given m there are q
m
m-tuples of possible values or q
m
1 non zero m-tuples.
For each m-tuple d = [a
0
, . . . , a
m1
], a
i
GF(q), there are q1 m-tuples that are multiples
of d. i.e. 1d, d,
2
d, . . .
q2
d.
Hence a Hamming code can be constructed by choosing exactly one m-tuples for each set of
multiples.
For example, H can be constructed using all q-ary m-tuples with the rst element as 1.
Therefore there are
q
m
1
q1
possible columns and the code has the properties:
n =
q
m
1
q1
k =
q
m
1
q1
m
For example consider a Hamming code based on GF(5) with m = 2. (This eld just is
{ 01234} with addition and multiplication modulo 5)
With m = 2, this implies a (6,4) code is required.
The parity matrix H can be constructed as
H =
_
1 0 1 1 1 1
0 1 1 2 3 4
_
the corresponding Generator matrix for would be
G =
_
_
4 4 1 0 0 0
4 3 0 1 0 0
4 2 0 0 1 0
4 1 0 0 0 1
_
_
4.6. MODIFIED LINEAR CODES 85
As an example operation, consider the message
m = 0234
The corresponding codeword c = mG is
c = 110234
Let a transmission error occur in position 5 changing the 3 to a 1 and resulting in a received
codeword
r = 110214
At the receiver, the syndrome s = rH
T
is calculated as
s = eH
T
= 34
Here syndrome decoding is required as a error value is required as well as an error location.
Table 4.6 tabulated the syndromes and corresponding error patterns for this code.
For the table, it can be seen that the error pattern corresponding to the syndrome s = 34 is
e = 000030
Thus the corrected code word is
c = r e
= 110214 000030
= 110234
And the recovered message is thus
m = 0234
which is correct. Thus the (6, 4) Hamming code of GF(5) corrected the single symbol error.
4.6 Modied Linear Codes
While the code parameters n and k for Hamming (and other) codes are very specic, it is
often desirable to modify them.
For example, if we required a single bit error correcting code for a 64bit memory error
correction scheme, there is no binary Hamming code with k = 64.
We could however use a shortened code.
4.6.1 Shortened Code
A shorted code is obtained by deleting a message symbol from the code word to
yield a (n 1, k 1) code.
This can readily be done by assuming the deleted message symbol was a 0. Encode the k 1
message by appending a 0 and calculate the n k check symbols. However store or transmit
the n k codeword (omitting the 0).
On receipt of the received n k codeword inset the 0 and decode as normal.
Alternatively, the generator and parity check matrix can modied to reect the reduced
number of message symbols.
Error Pattern Syndrome
1 0 0 0 0 0 1 0
2 0 0 0 0 0 2 0
3 0 0 0 0 0 3 0
4 0 0 0 0 0 4 0
0 1 0 0 0 0 0 1
0 2 0 0 0 0 0 2
0 3 0 0 0 0 0 3
0 4 0 0 0 0 0 4
0 0 1 0 0 0 1 1
0 0 2 0 0 0 2 2
0 0 3 0 0 0 3 3
0 0 4 0 0 0 4 4
0 0 0 1 0 0 1 2
0 0 0 2 0 0 2 4
0 0 0 3 0 0 3 1
0 0 0 4 0 0 4 3
0 0 0 0 1 0 1 3
0 0 0 0 2 0 2 1
0 0 0 0 3 0 3 4
0 0 0 0 4 0 4 2
0 0 0 0 0 1 1 4
0 0 0 0 0 2 2 3
0 0 0 0 0 3 3 2
0 0 0 0 0 4 4 1
Table 4.6: Syndrome table for a (6,4) Hamming code over GF(5)
For the 64 bit memory example, a (127, 120) code could be used and shortened to a (71, 64)
shortened hamming code.
Note that in the decoding process a syndrome that indicates that a bit higher than 71 is in
error can occur. This means that a uncorrectable error has occurred and this can be agged as
such.
4.6.2 Punctured Code
A punctured code is obtained by deleting a check symbol from the code word to
yield a (n 1, k) code.
The error correction and detection properties will be reduced of course (unless the original
code was poorly designed!). This can be used to create a higher rate code.
4.6.3 Extended Code
A Extended code is obtained by adding a check symbol to the code word to yield
a (n + 1, k) code.
The error correction and detection properties will be improved (if this is done properly!).
4.6. MODIFIED LINEAR CODES 87
For example, adding an extra column of 1s to the generator matrix of any Hamming code
adds an extra parity bit that ensures that all codewords have an even weight. Thus the minimum
distance of the code increases to 4 and the extended Hamming code is a single error correction
double error detection code.
For example the (15, 11) code used previously can be extended to a (16, 11) code by setting
G =
_
_
1111110000000000
1111001000000000
1110100100000000
1110000010000000
1101100001000000
1101000000100000
1100100000010000
1011100000001000
1011000000000100
1010000000000010
1001100000000001
_
_
and
H =
_
_
111111111111111
0100011111110000
0010011110001110
0001011001101101
0000110101011001
_
_
The (8,4) extended Hamming code is used in Teletext encoding for particular control se-
quences.
4.6.4 Other Modications
A Lengthened code is obtained by adding an extra message symbol to the code
word to yield a (n + 1, k + 1) code.
A Expurged code is obtained by deleting some of the codewords.
An Augmented code is obtained by adding g some of the codewords.
4.7 Exercises
1. Calculate the upper (Hamming bound) and the lower (Gilbert bound) for the rate of a 3
bit binary error correcting code of length 100.
2. Calculate the upper (Hamming bound) and the lower (Gilbert bound) for the rate of a 2
symbol error correcting code of length 50 symbols over GF(16).
3. Write out a parity check matrix for a systematic (7, 4) Hamming code.
Write out the corresponding generator matrix.
Using this code, encode the message 0101
Assuming the codeword is transmitted and received with bit position 3 in error, calculate
the resulting syndrome.
Use the syndrome to correct the received codeword.
4. [Computer based Assignment] Write a computer program to implement encoding and
decoding with error correction of a (63, 57) binary Hamming code.
Use the program to create a simulation of the error correction code performance by en-
coding blocks of data, adding random errors with a probability p and decoding the data
with error correction. Plot the probability of error in the decoded data vs p on a log-log
scale.
Chapter 5
Cyclic Codes
Cyclic codes are an important area in code construction with linear cyclic codes being the most
important and widely used.
5.1 Linear Cyclic Block Codes
A (n, k) linear block code is said to be cyclic if for every codeword c = {c
0
, c
1
, . . . c
n1
} C,
there is also a codeword c
= {c
n1
, c
0
, c
1
, . . . c
n2
} C.
This states that an cyclic rotation right of any codeword is also a codeword. Hence each
codeword has n cyclic rotations which are also codewords.
For analysis of the properties of cyclic codes, it has been found that the association of a
polynomial with each codeword is convenient. Hence each codeword is associated with a code
polynomial c(x) = c
0
+c
1
x +c
2
x
2
+. . . c
n1
x
n1
.
If C is a q-ary cyclic linear code, the collection of codewords in C form a vector subspace
of dimension k in space of all n-tuples over GF(q). Similarly, the code polynomials associated
with C form a vector subspace within GF(q)[x]/[x
n
1].
Consider
x.c(x) = x(c
0
+c
1
x +c
2
x
2
+ . . . c
n1
x
n1
) mod x
n
1
= (c
0
x +c
1
x
2
+ c
2
x
3
+. . . c
n1
x
n
) mod x
n
1
= (c
n1
+ c
0
x + c
1
x
2
+c
2
x
3
+. . . c
n2
x
n1
)
= c
(x)
Hence, a cyclic rotation right if the codeword is equivalent to multiplying the polynomial by x
modulo x
n
1.
Similarly multiplication by x
2
modulo x
n
1 is equivalent to a double shift. Multiplication
by x
3
modulo x
n
1 is equivalent to a triple shift and so on.
Let a(x) = a
0
+a
1
x +a
2
x
2
+. . . a
n1
x
n1
be an arbitrary polynomial in GF(q)[X]/(x
n
1).
The product a(x)c(x) is a linear combination of cyclical shifts of c. But, since C forms a
vector space, a(x)c(x) must be a valid code polynomial.
Hence:
a(x)c(x) C for all a(x) GF(q)[X]/(x
n
1), c(x) C
A cyclic code is an ideal within GF(q)[X]/(x
n
1).
89
90 CHAPTER 5. CYCLIC CODES
5.2 Basic Properties of Cyclic codes
Theorem 5.2.1 Let C be a q-ary linear cyclic code.
1. Within the set of code polynomials in C, there is a unique monic polynomial g(x) with
degree r < n. g(x) is called the generator polynomial.
2. Every code polynomial c(x) in C can be expressed uniquely as c(x) = m(x)g(x) with m(x)
being a polynomial of degree less than n r in GF(q)[x].
3. The generator polynomial g(x) of C is a factor of x
n
1 in GF(q)[x].
Since g(x) is monic, it takes the form
g(x) = g
0
+g
1
x +g
2
x
2
+. . . +g
r1
x
r1
+x
r
It can be seen that g
0
= 0 as otherwise the generator could be shifted down one and a lower
degree polynomial obtained. Property 3. limits the selection of generator polynomials to factors
of x
n
1. Factorization of x
n
1 into irreducible polynomials in GF(q)[x] can be done. However,
the degree of the allowable generator polynomials are dependent on n and q and not all possible
values exist. Hence cyclic codes cannot have arbitrary values of n and k.
E.G. x
15
1 factors into a degree 1, a degree 2, and three degree 4 polynomials, hence
generator polynomials of any length 0..15 can be created. (Whether they are useful codes is s
dierent question!).
However, x
25
1 factors into into a degree 1, a degree 4, and a degree 20 polynomial, hence
generator polynomials of length 1,4,5,20,21 and 24 can be generated, but no others.
Property 2. can be used for the purpose of encoding. Let g(x) be the r-degree generator
polynomial for an (n, k) q-ary linear cyclic code C.
A message sequence m = {m
0
, m
1
, . . . , m
nr1
} may be encoded by associating the message
with the n r 1 degree polynomial
m(x) = m
0
+m
1
x +. . . , m
nr1
x
nr1
and multiplying by the generator polynomial g(x).
c(x) = m(x)g(x)
where the code word c = {c
0
, c
1
, . . . , c
n1
} is associated with the polynomial c(x) = c
0
+ c
1
x +
. . . , c
n1
x
n1
.
C has the dimension n r and contains q
nr
codewords.
The multiplication may be written as
c(x) = m(x)g(x)
= (m
0
+ m
1
x +. . . , m
nr1
x
nr1
)g(x)
= (m
0
g(x) +m
1
xg(x) +. . . , m
nr1
x
nr1
g(x))
= [m
0
, m
1
, . . . , m
nr1
]
_
_
g(x)
xg(x)
x
2
g(x)
.
.
.
x
nr1
g(x)
_
_
5.2. BASIC PROPERTIES OF CYCLIC CODES 91
This can be case in the form of a generator matrix as
c = m
_
_
g
0
g
1
. . . g
r
0
g
0
g
1
. . . g
r
.
.
.
.
.
.
.
.
.
.
.
.
0 g
0
g
1
. . . g
r
_
_
= mG
For every generator polynomial g(x), there exists a parity polynomial h(x) of degree k = nr,
such that g(x)h(x) = x
n
1.
As a polynomial c(x) is a codeword if it is a multiple of the generator polynomial g(x),
then c(x) is a codeword only if c(x)h(x)modulo(x
n
1) = 0 as c(x)h(x)modulo(x
n
1) =
m(x)g(x)h(x)modulo(x
n
1) = m(x)(x
n
1)modulo(x
n
1) = 0.
The product c(x)h(x)modulo(x
n
1) is a polynomial of degree n 1 and can be written as
s(x) = s
0
+s
1
x +. . . , s
n1
x
n1
GF(q)
If s(x) is identically 0, then each coecient s
j
must be 0 and hence n parity check equations
may be written. Writing c(x) =

n1
i=0
c
i
x
i
and h(x) =

nr1
j=0
h
j
x
j
then
s(x) =
n1
i=0
s
i
x
i
= c(x)h(x)
=
_
n1
i=0
c
i
x
i
_
_
_
nr1
j=0
h
j
x
j
_
_
modulo(x
n
1)
s
t
=
n1
i=0
c
i
h
|ti|
n
The last n k equations may be written as
s
k
=
n1
i=0
c
i
h
|ki|
n
s
k+1
=
n1
i=0
c
i
h
|k+1i|
n
.
.
.
s
n1
=
n1
i=0
c
i
h
|n1i|
n
But as i = 0, 1, 2, . . . , n 1
|k i|
n
= k, k 1, . . . , 1, 0, | 1|
n
= n 1, | 2|
n
= n 2, . . . , |k (n 1)|
n
= k + 1
and
h
|ki|
n
= h
k
, h
k1
, . . . , h
1
, h
0
, 0, 0, . . . , 0
So the n k check equations may be written as
s =
_
_
s
k
s
k+1
.
.
.
s
n1
_
_
=
_
c
0
c
1
c
2
, . . . c
n1
_
_
_
h
k
h
k1
h
k
0
h
k2
h
k1
.
.
. h
k2
h
1
.
.
.
.
.
.
h
k
h
0
h
1
h
k1
h
0
h
k2
.
.
.
0 h
1
h
0
_
_
or
s = cH
T
where
H =
_
_
h
k
h
k1
h
k2
. . . h
1
h
0
0
h
k
h
k1
h
k2
. . . h
1
h
0
.
.
.
0 h
k
h
k1
h
k2
. . . h
1
h
0
_
_
Hence, if c is a codeword, s = cH
T
= 0. Thus the rows of H are vectors in C
Also, there nk
rows of H that are linearly independent as h(x) is monic i.e. h
k
= 1.
Thus H has dimension n k an spans C
and is the a valid parity check matrix.

As G and H have the same form, they can be used to generate another code by swapping
them over.
Theorem 5.2.2 If C is a (n, k) linear cyclic code with generator g(x), then C
is a (n, n k)
linear cyclic code with generator h
(x) where h
(x) is the reciprocal of h(x), the parity check

polynomial.
Note: the reciprocal of f(x) = f
0
+ f
1
x+, ..., +f
n
x
n
, an n degree polynomial is dened as
f
(x) = x
n
f(x
1
) = f
n
+f
n1
x+, ..., +f
0
x
n
Proof The generator and parity check matrices have required form as shown previously.
Example:
Consider a cyclic code of length 7.
Irreducible factors of x
7
1 are
x
7
1 = (x
1
+ 1)(x
3
+x
1
+ 1)(x
3
+x
2
+ 1)
Chose g(x) = x
3
+x
1
+ 1 hence forming an (7, 4) linear cyclic code
h(x) = x
4
+x
3
+ x
2
+ 1
The corresponding generator and parity check matrices are
G =
_
_
1101000
0110100
0011010
0001101
_
_
5.3. ENCODING A LINEAR CYCLIC CODE 93
and
H =
_
_
1011100
0101110
0010111
_
_
5.3 Encoding a linear cyclic code
The (n, k) linear cyclic code C formed by the generator g(x) can encode the message
m(x) = m
0
+m
1
x +. . . , m
k1
x
k1
as
c(x) = m(x)g(x)
However in general this is not a systematic encoding.
5.3.1 Systematic encoding of a linear cyclic code
Consider a (n, k) linear cyclic code C formed by the generator g(x). Let the desired message
be:
m(x) = m
0
+m
1
x +. . . , m
k1
x
k1
Let q(x) be the quotient resulting from division of x
nk
m(x) by g(x), then
x
nk
m(x) = q(x)g(x) +d(x)
where d(x) is the remainder. As g(x) has degree r = n k, then d(x) has degree < r.
Hence
x
nk
m(x) d(x) = q(x)g(x)
But since q(x)g(x) is a multiple of g(x), then is must be a valid codeword c(x). Hence the
message m(x) may be encoded as
c(x) = x
nk
m(x) d(x)
As m(x) has degree k 1 and d(x) has degree < r = n k, the coecients of c(x) with degree
0, 1, ..n k 1, come from d(x) while the coecients with degree n k, n k + 1, .., n 1
come from m(x) only and hence a systematic encoding is obtained.
5.3.2 Example systematic encoding
With g(x) = x
3
+ x
1
+ 1 forming an (7, 4) linear cyclic code, systematically encode the
message [1100].
m(x) = x
3
+x
2
m(x)x
nk
= x
6
+x
5
The quotient q(x) and remainder d(x) after division be g(x) are
q(x) = x
3
+x
2
+x
and
d(x) = x
1
and hence the encoded codeword
c(x) = m(x)x
nk
d(x) = x
6
+x
5
+ x
1
or c = 1100010
5.4 Shift Register Implementation for Encoding(decoding) cyclic codes
5.4.1 Polynomial Multiplication
Consider multiplying g(x) = g
0
+g
1
x +g
2
x
2
by m(x) = m
0
+m
1
x +m
2
x
2
+m
3
x
3
. This can
be accomplished by forming the table
x
0
x
1
x
2
x
3
x
4
x
5
m
3
g
0
m
3
g
1
m
3
g
2
m
2
g
0
m
2
g
1
m
2
g
2
m
1
g
0
m
1
g
1
m
1
g
2
m
0
g
0
m
0
g
1
m
0
g
2
At any time only three elements need to be stored.
Fig 5.1 shows the direct form implementation (FIR like) implementation. Fig 5.2 shows an
equivalent transpose form. Note that the multipliers and adders are over GF(q).
0 1 2 3
m ,m ,m ,m
g
2
g
1
g
0
0 1 2 3 4 5
c ,c ,c ,c ,c ,c
Figure 5.1: Direct form multiplication
0 1 2 3
m ,m ,m ,m
g
2
g
1
g
0
0 1 2 3 4 5
c ,c ,c ,c ,c ,c
Figure 5.2: Transpose form multiplication
E.G. With g(x) = x
3
+ x + 1 forming an (7, 4) linear cyclic code, the circuit in g 5.3. The
multiplication and addition operators over GF(2) reduce to AND and EXOR gates respectively.
5.4. SHIFT REGISTER IMPLEMENTATIONFOR ENCODING(DECODING) CYCLIC CODES95
0 1 2 3
m ,m ,m ,m
c ,c ,c ,c ,c ,c ,c
0 1 2 3 4 5 6
1 x
x
3
Figure 5.3: Encoder for g(x) = x
3
+x + 1, non-systematic code.
5.4.2 Polynomial Division
For systematic encoding, division is required. Consider the division
q(x) =
b
5
x
5
+b
4
x
4
+b
3
x
3
+b
2
x
2
+b
1
x
1
+ b
0
a
2
x
2
+a
1
x
1
+a
0
Fig 5.4 shows an implementation of this division operation. Initial the registers are loaded
a
-1
2
a a
0 1
b b b b b b
0 1 2 3 4 5
q ,q ,q ,q
0 1 2 3
+
- -
+
d
d
0
1
Figure 5.4: Division using a feedback shift register.
with b
0
, b
1
, .., b
5
. The initial quotient of q
3
= b
5
a
1
2
is calculated. As in polynomial division,
the value q
3
a(x) should be subtracted from b(x) to leave the partial remainder. In g 5.4,
this is achieved by feedback and multiplication of a(x). The leading term is not calculated as
b
5
x
5
q
3
x
3
a
2
x
2
= b
5
x
5
(b
5
a
1
2
)a
2
x
5
= 0. The shift register will then contain
0, b
0
, b
1
, b
2
, b
3
q3a
0
, b
4
q
3
a
1
= 0, b
0
, b
1
, b
2
, b
3
, b
4
with the rst quotient value q
3
outputted. The problem then reduces to
q(x) =
b
4
x
4
+b
3
x
3
+ b
2
x
2
+b
1
x
1
+b
0
a
2
x
2
+a
1
x
1
+a
0
This is repeated 3 times to evaluate q
2
, q
1
, q
0
. The contents of the last 2 shift register elements
then contain the degree 1 remainder d
1
x +d
0
.
Fig 5.5 shows an equivalent implementation with the registers being initialized to 0 and the
polynomial b(x) being applied as required rather than being stored. The remainder ended up
in the storage register as before.
q ,q ,q ,q
0 1 2 3
a
-1
2
b ,b ,b ,b ,b ,b
0 1 2 3 4 5
a a
0 1
-
+
d d
0 1
Figure 5.5: Basic division feedback shift register to divide by a
2
x
2
+a
1
x
1
+a
0
.
This implementation can form the basis for systematic encoding of linear cyclic codes. The
systematic encoding process requires multiplication of the message polynomial m(x) by x
nk
which is just a shift in implementation terms. The division is by g(x) with the quotient dis-
guarded. The remainder d(x) is simply appended to the message. Fig 5.6 shows a possible
implementation. The shift register is initialized to zero at the beginning of the encoding op-
eration and congured as a dividing circuit with the CNTL line low. The message polynomial
m(x) is shifted in an divided by g(x) as described previously. It is also fed to the output as the
rst k code symbols. After k symbol periods, the CNTL is switched high. The shift register now
shifts its contents (the remainder of the division operation) out to form the check symbols of
the codeword thus completing the encoding operation.
m ,m ,...,m
0 1 k-1
1
0
10
g = 1
-1
r
n-k k
-
+
-
+
r-2
g
r-1
g g
1 0
g
0,0,...0,
n-k
c ,c ,...,c
0 1 n-1
0
CNTL
Figure 5.6: Systematic encoder with g(x) generator.
For binary codes i.e GF(2), the circuit can operate very fast and is easy to implement.
5.5 Error detection
For error detection in the systematic encoding, the encoder can be reused. Consider a
received codeword
r = (d
0
, d
1
, . . . , d
nk1
, m
0
, m
1
, . . . , m
k1
)
5.5. ERROR DETECTION 97
The message symbols m
0
, m
1
, . . . , m
k1
can be encoded as before to calculate a new set of check
symbols d
0
, d
1
, . . . , d
nk1
.
Any dierence in between the received check symbols d
0
, d
1
, . . . , d
nk1
and the newly
calculated ones d
0
, d
1
, . . . , d
nk1
indicate one or more errors in the data transmissions.
5.5.1 Syndrome Decoding
Consider systematic code, with received codeword r = [d
0
, d
1
, . . . , d
nk1
, m
0
, m
1
, . . . , m
k1
].
Let the received message symbols be encoded to calculate a new set of check symbols
d
0
, d
1
, . . . , d
nk1
. Hence c = [d
0
, d
1
, . . . , d
nk1
, m
0
, m
1
, . . . , m
k1
]. is a valid code-
word.
Recall, the parity check matrix for a systematic linear code has the form
H =
_
I
nk
P
T
_
The syndrome for such a code was dened as
s = rH
T
= rH
T
0
= rH
T
cH
T
= (r c)H
T
= [d
0
+d
0
, d
1
+d
1
, . . . , d
nk1
+d
nk1
, 0, . . . , 0]
_
I
nk
P
T
_
T
= [d
0
+d
0
, d
1
+d
1
, . . . , d
nk1
+d
nk1
]
= d
+d
In polynomial terms, s(x) = d
(x) + d
(x) and the process of calculating d
(x) can
be written as
r(x) + d
(x) = a(x)g(x) +d
(x)
as r(x) +d
(x) is the received sequence with the received check symbols subtracted (the message
bits are already shifted to account for the x
nk
multiplication). Hence
r(x) = a(x)g(x) d
(x) +d
(x)
or
r(x) = a(x)g(x) +s(x)
Hence, the remainder of dividing the received sequence r(x) by g(x) is the syndrome s(x) which
must be of degree less than n k . This operation can be easily performed using the division
implementations previously described.
Error correction can the be based on syndrome lookup as was the case for linear block codes.
For a q-ary (n, k) code, this still requires q
nk
error patterns to be stored.
However, it is possible to reduce the size of the lookup table required by employing some of
the properties of cyclic linear block codes.
Theorem 5.5.1 Let s(x) by the syndrome polynomial for a received polynomial r(x). Let r
(1)
(x)
be a single symbol rotation of r(x) right. Then the remainder when dividing xs(x) by g(x) is the
syndrome s
(1)
(x) of r
(1)
(x).
Proof The polynomial r
(1)
(x) may be written as
r
(1)
(x) = xr(x) r
n1
(x
n
1)
but the syndrome of r(x) is dened by
r(x) = a(x)g(x) +s(x)
and
x
n
1 = g(x)h(x)
Hence
r
(1)
(x) = x(a(x)g(x) + s(x)) r
n1
(g(x)h(x)) = b(x)g(x) +d(x)
where d(x) is the syndrome of r
(1)
(x) i.e s
(1)
(x)
g(x)[xa(x) r
n1
h(x)] +xs(x) = b(x)g(x) +d(x)
xs(x) = b(x)g(x) g(x)[xa(x) r

n1
h(x)] + d(x)
xs(x) = g(x)[b(x) xa(x) + r
n1
h(x)] +d(x)
Hence d(x) is the remainder after division of xs(x) by g(x) and so s
(1)
(x) is the remainder after
division of xs(x) by g(x).
This theorem can be used to reduce the lookup table required as follows:
On receipt of a received codeword r(x), the syndrome is calculated by dividing r(x) by g(x), the
generator polynomial. Shifting in a zero after the last element of r(x) is equivalent to calculating
the remainder of xs(x) after division by g(x).
Recall that with the syndrome decoder, each syndrome corresponded to a error pattern
(independent of the codeword). Hence in this case, the error pattern e(x) and all it n 1
rotations correspond to s(x) and its n1 versions formed by calculating the remainder of xs(x)
after division by g(x).
Thus for each error pattern stored and corresponding syndrome, the n 1 rotations of the
error pattern do not have to be stored.
5.5.2 Example Syndrome decoder
Consider the (7, 4) linear cyclic code with generator g(x) = x
3
+x
1
+ 1.
This can be shown to be the (7, 4) hamming code which can correct a single bit error.
Consider the single bit error pattern
e(x) = x
6
The corresponding syndrome is s(x) = x
2
+1. Table 5.1 shows the rotated versions modulo g(x)
which correspond to the error patterns rotated.
Let the received data polynomial be
r(x) = x
6
+x
5
+x
3
+x
1
The correspond syndrome (remainder after dividing by g(x),) is s(x) = x +1. Hence 4 right
rotations of the error pattern pattern are required i.e
x
6
x
0
x
1
x
2
x
3
Hence the coecient of x
3
is complemented to give the corrected codeword
c(x) = x
6
+x
5
+x
1
which is a valid codeword as generated in example 5.3.2.
5.6. CRC ERROR CONTROL CODES 99
Rotation s(x)
0 x
2
+ 1
1 1
2 x
3 x
2
+
4 x + 1
5 x
2
+x
6 x
2
+x + 1
7 x
2
+ 1
Table 5.1: Syndrome and shifted versions mod g(x)
5.6 CRC Error Control Codes
The most common application of cyclic linear block codes are the use of cyclic redundancy
check (CRC) codes. This is mainly due to their ease of implementation with shift register circuits
for codes over GF(2).
Most CRCs codes are used as shortened codes where some of the message symbols are
deleted. Normally the j higher degree message symbols are deleted to give a (n j, k j) code
with the same error control properties as the original code (though with a lower rate). The
deleted symbols can be implicitly assumed as 0 for the purpose of analysis.
Variable length code words can sometimes be employed (e.g. Ethernet), provide the codeword
length is not greater than n.
In general, the shortened (nj, k j) is not a cyclic code, but the same encoding, decoding
circuits may still be used.
CRC codes are mainly used for their error detection capabilities with a retransmission request
sent if an error is detected.
CRC codes are dened by their generator polynomials g(x). Some rules of thumb have been
used in their selection along with computer searches. Often g(x) = (1 + x)a(x) where a(x) is a
primitive is employed. The (1 + x) ensures that odd weight error patterns are detected.
A number of CRC generator polynomials have been adopted as international standards
(though subsequently, better ones have been found). Some examples are found in table 5.2. The
value of n for these should be 2
16
1 for the degree 16 polynomials and 2
32
1 for the degree
32 polynomials, though this is only the case if n is the smallest value for which g(x) is a factor
of x
n
1.
CRC Code g(x)
CRC-ANSI x
16
+x
15
+ x
2
+ 1
= (x
15
+x + 1)(x + 1)
CRC-CCITT x
16
+x
12
+ x
5
+ 1
= (x
15
+x
14
+x
13
+x
12
+x
4
+ x
3
+x
2
+x + 1)(x + 1)
CRC-32 x
32
+x
26
+x
23
+x
22
+x
16
+ x
12
+x
11
+x
10
+ x
8
+x
7
+x
5
+x
4
+ x
2
+x + 1
Table 5.2: Some CRC standards
While the weight distributions and minimum distance for all CRC generator polynomials are
not know, some general error control properties can be derived.
Often CRC codes are used in burst noise environment which dont have well characterized
error properties anyway.
The coverage of a CRC code is dened as the number of invalid codewords over the number
of valid codewords
=
q
n
q
k
q
n
= 1
1
q
nk
Hence the coverage is only dependent on the redundancy n k.
E.g. for the CRC-32 code, the coverage is 1
1
2
32
= 0.99999999976717
5.6.1 Burst Errors
Burst errors are described by their length. A burst error of length b starts with an error at
a given position and ends at a later point with an error. There may be zero, one or more errors
in between.
Hence for a b symbols burst there are (q 1)
2
q
(b2)
possible burst patterns.
Let C be an (n, k) cyclic code with generator g(x) of degree r. Its burst error detection
capability can be expressed as
b
, the ratio of the number of detectable bursts patterns of length
b to the total number of bursts patterns of length b.
Consider b = r: A burst error e(x) is undetectable only when e(x) is a valid codeword.
However as a valid codeword must be a multiple of g(x), the generator polynomial, then e(x)
must be of degree r or greater.
Since any cyclic shifts of a codeword is also a codeword, then any cyclic shifts of e(x) are
also detectable.
A degree r polynomial has r +1 coecients, thus any burst errors of length r or less can be
detected.
i.e
r
= 1 and
Theorem 5.6.1 A q-ary cyclic or shortened cyclic code with generator polynomial g(x) of degree
r can detect all bursts of length r or less.
The only valid polynomials of degree r are scaler multiples of g(x). Thus the only unde-
tectable burst errors of length r + 1 are the q 1 scaler multiples (zero excluded) of g(x) and
their n r 1 non-cyclic shifts right.
Thus,(noting there are also n r 1 non-cyclic shifts of any r symbol burst error),
r+1
=
detectable patterns of length r+1
all burst error patterns of length r+1
= 1
undetectable patterns of length r+1
all burst error patterns of length r+1
= 1
(n r)(q 1)
(n r)(q 1)
2
q
(r1)
= 1
1
(q 1)q
(r1)
= 1
q
1r
(q 1)
i.e.
5.6. CRC ERROR CONTROL CODES 101
r can detect 1
q
1r
(q1)
of all bursts of length r + 1.
By the denition of cyclic codes, all valid codewords can be expressed as m(x)g(x) with g(x)
the generator polynomial and m(x) having degree n r or less.
If this code polynomial is an error burst of length b > r +1, then its degree b 1 and degree
zero coecients must be nonzero.
The degree r and degree zero coecients of g(x) are nonzero.
Hence the degree b 1 r and degree zero coecients of m(x) must be nonzero to generate
the terms m
b1r
x
br
g
r
x
r
e
b1
x
b1
and m
0
g
0
e
0
.
Thus, the number undetectable patterns of length b equals the number of polynomials of
degree b 1 r with the coecient of degree b 1 r being nonzero = (q 1)
2
q
br21
.
The total of burst error patterns of length b equals the number of polynomials of degree b1
with the coecient of degree b 1 being nonzero = (q 1)
2
q
b2
.
Thus,(noting that there are n b shifts)
b>r+1
= 1
undetectable patterns of length b
all burst error patterns of length b
= 1
(n b)(q 1)
2
q
br2
(n b)(q 1)
2
q
b2
= 1 q
r
= 1
1
q
r
i.e.
r can detect 1
1
q
r
of all bursts of length b > r + 1.
For example the CRC-ASNI CRC code with r = 16 can:
detect all burst errors of length 16 or less.
detect 99.996948% of all burst errors of length 17.
detect 99.998474% of all burst errors of length 18 or greater.
Clearly CRC codes have good burst error detection performance and this is one of the reasons
for their use on communication channels with retransmission abilities.
They have also been used on storage channels (disk drives) to ensure that miss corrected
sectors are not agged as corrected data.
1
c(x) = c
b1
x
b1
+
. . . +c
0
m(x) = c
br1
x
br1
+
. . . +m
0
g(x) = g
r
x
r
+
. . . +g
0
5.7 Exercises
1. Given a cyclic code based on the generator polynomial g(x) = x
3
+ x + 1 forming a (7,4)
code, calculate the systematic encoding of the bits [1001].
2. Draw the schematic of a circuit to implement a systematic encoding of a cyclic code based
on the generator polynomial g(x) = x
5
+x
4
+x
3
+ 1.
3. The received polynomial r(x) = x
6
+x
4
is received at the output of a noise channel when the
input is encoded as a (7,4) cyclic code based on the generator polynomial g(x) = x
3
+x+1.
Calculate the syndrome for the received polynomial
Calculate the syndrome for each of the 7 single bit errors and state whether the received
polynomial is a valid codeword or if not which bit is most likely to be a bit error.
Chapter 6
BCH and Reed Solomon Codes
So far, cyclic codes have been generated without any guarantee of minimum distance and error
correction properties. For a given generator polynomial g(x), computer searches would have to
be undertaken to check the minimum distance.
BCH codes are linear cyclic codes that guarantee their minimum distance by construction.
Reed Solomon Codes are a particular class of BCH codes. BCH codes are named after the people
who invented them Bose and Chauduri (1959) and (independently) Hocquenghem (1960).
6.1 BCH Codes
Theorem 6.1.1 (BCH bound) Consider a (n, k) linear cyclic code C over GF(q). Let GF(q
m
)
be an extension eld of GF(q)
1
.
Let be a element in GF(q
m
) with order n.
The generator polynomial g(x), of a BCH code is constructed such that the elements
b
,
b+1
, . . . ,
b+d2
are its roots for some integer b. i.e g(
b
) = g(
b+1
) = . . . = g(
b+d2
) = 0
In this case the code C will have a minimum distance d
min
d
Note that the g(x) (and the codewords) are over GF(q), but the roots of the generator are
over GF(q
m
). The generator must then be constructed by nding the minimal polynomial p
i
(x)
of each root
b+i
with respect to GF(q). The product of these polynomials (ignoring duplicates)
form g(x), thus determining the achievable value of k.
It can be shown that p
i
(x) divides x
n
1, and that p
i
(x) is irreducible over GF(q) thus
making g(x) a valid generator polynomial for a (n, k) linear cyclic code over GF(q).
6.1.1 Examples
Construct BCH error control codes over GF(2) with n = 15 to correct 1, 2 and 3 errors.
First an extension eld of GF(2) with an element of order 15 is required.
GF(2
4
) is the smallest eld with an element of order 15 i.e. a primitive element of the
eld.
1
e.g. GF(2
8
) is an extension eld of GF(2)
GF(2
8
) is also an extension eld of GF(2
2
)
103
104 CHAPTER 6. BCH AND REED SOLOMON CODES
p(x) = x
4
+ x + 1 is a degree 4 primitive polynomial in GF(2) an hence can be used
to create the eld GF(2
4
). Let be a root of p(x) thus is a the primitive element of
GF(2
4
). The resulting eld is tabulated in table 6.1
element polynomial rep. binary rep.
0 0 0 0 0 0
1 1 0 0 0 1
1
0 0 1 0
2
0 1 0 0
3
1 0 0 0
4
+ 1 0 0 1 1
2
+ 0 1 1 0
3
+
2
1 1 0 0
3
+ + 1 1 0 1 1
2
+ 1 0 1 0 1
3
+ 1 0 1 0
10
2
+ + 1 0 1 1 1
11
3
+
2
+ 1 1 1 0
12
3
+
2
+ + 1 1 1 1 1
13
3
+
2
+ 1 1 1 0 1
14
3
+ 1 1 0 0 1
Table 6.1: GF(2
4
) based on the primitive polynomial p(x) = x
4
+x + 1.
is an element of order 15 in GF(2
4
) , thus we can take =
To correct one error t = 1, then d
min
3 and 2 successive powers of i.e
b
,
b+1
are
required to be roots of g(x).
b may be arbitrary chosen and is usually chosen as 1. (This is called a narrow sense
code). Hence, it is required to nd g(x) with roots ,
2
As is a root of the primitive polynomial p(x), then p(x) is the minimal polynomial
of .
The minimal polynomial of
2
need to be found. In this case it can be seen that
2
is also a root of primitive polynomial p(x) as
(
2
)
4
+
2
+ 1 =
8
+
2
+ 1 = (
2
+ 1) + (
2
) + 1 = 0
Hence, for t = 1, g(x) = x
4
+ x + 1. The degree of g(x) is 4, thus k = 15 4 = 11,
creating a (15, 11) single error correcting code. It can be shown that d
min
for this
code is indeed 3.
To correct two errors t = 2, then d
min
b
,
b+1
,
b+2
,
b+3
are required to be roots of g(x).
Choosing b = 1, it is required to nd g(x) with roots ,
2
,
3
,
4
6.1. BCH CODES 105
of .
As before
2
is also a root of p(x) = x
4
+x + 1.
Also
4
is a root of p(x) For
3
it can be seen by searching that p
1
(x) = x
4
+ x
3
+
x
2
+x + 1 has
3
as a root.
p
1
(
3
) = (
3
)
4
+ (
3
)
3
+ (
3
)
2
+ (
3
) + 1
= (
12
) + (
9
) + (
6
) + (
3
) + 1
= (
3
+
2
+ + 1) + (
3
+) + (
3
+
2
) + (
3
) + 1
= 0
Hence, for t = 2, g(x) = p(x)p
1
(x) or
g(x) = x
8
+x
7
+x
6
+x
4
+ 1
. The degree of g(x) is 8, thus k = 15 8 = 7, creating a (15, 7) double error
correcting code. It can be shown that d
min
for this code is indeed 5.
To correct two errors t = 3, then d
min
b
,
b+1
,
b+2
,
b+3
,
b+4
,
b+5
are required to be roots of g(x).
Choosing b = 1, it is required to nd g(x) with roots ,
2
,
3
,
4
,
5
,
6
of .
As before
2
is also a root of p(x).
Also
4
is a root of p(x) For
3
it can be seen by searching that p
1
(x) = x
4
+ x
3
+
x
2
+x + 1 has
3
as a root.
elements minimal polynomial
,
2
,
4
p(x) = x
4
+x + 1
3
,
6
p
1
(x) = x
4
+x
3
+x
2
+ x + 1
5
p
2
(x) = x
2
+x + 1
Table 6.2:
Hence, for t = 3, g(x) = p(x)p
1
(x)p
2
(x) or
g(x) = x
10
+x
8
+ x
5
+x
4
+x
2
+x
1
+ 1
. The degree of g(x) is 10, thus k = 15 10 = 5, creating a (15, 5) double error
correcting code. It can be shown that d
min
for this code is indeed 7.
Points to note:
Note BCH codes are usually over GF(2) but can be over any eld.
In the example the minimal polynomials were found by searching. In fact, for any element
GF(q
m
) the minimal polynomial of wrt GF(q) will have the roots ,
q
,
q
2
,
q
3
, . . . ,
q
d
.
These sequence is taken until it repeats E.G. in the above example the minimal poly-
nomial of =
5
wrt GF(2) could be calculated with roots
5
and (
5
)
2
=
10
, as
(
5
)
4
=
20
=
5
which is a repeat. Thus
p
1
(x) = (x
5
)(x
10
)
= (x (
2
+ ))(x (
2
+ + 1))
. . .
= x
2
+x + 1
If b = 1 as chosen above, then the code is called narrow sense.
If n = q
m
1 as above, then the code is called primitive.
Narrow sense primitive BCH codes can be found tabulated in textbooks on the subject.
While theorem 6.1.1 places a lower bound on the d
min
of the code, in many cases the d
min
of the code may be higher than this value.
There is no limit to the size or correction ability of the codes other than computation
complexity. E.G. consider n = 255 with a degree 8 primitive polynomial of p(x) = x
8
+
x
4
+x
3
+x
2
+ 1. The narrow sense primitive code results in a (255, 231), t = 3 code with
degree 24 generator
g(x) = x
24
+x
23
+x
21
+x
20
+x
19
+x
17
+x
16
+x
15
+x
13
+x
8
+ x
7
+x
5
+x
4
+x
2
+ 1
6.2 Parity Check Matrix for BCH codes
Consider a BCH code based on d 1 successive powers
b
,
b+1
, . . .
b+d2
.
Consider the matrix operation on the received word r = r
0
, r
1
, . . . , r
n1
s
T
= Hr
T
=
_
_
1
b
2b
. . .
n1
b
1
b+1
2(b+1)
. . .
(n1)(b+2)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
b+d2
2(b+d2)
. . .
(n1)(b+d2)
_
_
_
_
r
0
r
1
.
.
.
r
n1
_
_
The rst row times the column vector r is simply the received polynomial r(x) evaluated at
x =
b
or r(
b
). Similarly the second times the column vector r is simply the received polynomial
r(x) evaluated at x =
b+1
or r(
b+1
).
Thus the d 1 length vector s is the value of the received polynomial r(x) evaluated at
b
,
b+1
, . . .
b+d2
.
However, by denition, any polynomial that has
b
,
b+1
, . . .
b+d2
as roots is a code word.
Thus
s
T
= Hr
T
= 0
if and only if r is a valid codeword.
Hence H is a valid parity check matrix.
Theorem 6.1.1 can be proven by showing that any w < d, that no nontrivial sum of w
columns of H can sum to zero, guaranteeing that d
min
d. ( The proof is quite involved but
will be omitted here ).
The weight distribution of some BCH codes are known (e.g. double and triple error correction
codes) but in general their weight distribution are unknown.
6.3. REED SOLOMON CODES 107
6.3 Reed Solomon Codes
Denition 6.3.1 A Reed Solomon code is a q
m
-ary BCH code of length q
m
1.
By this denition of Reed Solomon codes, it can be seen that a element of order q
m
1 is
required. Hence a primitive element of GF(q
m
) is required.
However, the code symbols are from GF(q
m
). Hence there is no need for an extension eld
outside the codeword symbols.
Thus to construct a minimal polynomial with a root GF(q
m
), with respect to GF(q
m
)[x],
then the polynomial
p(x) = (x )
is the solution.
6.3.1 Generating a Reed Solomon Code
Consider the design of a t error correcting RS code over GF(q
m
). Let be a primitive
element of the eld GF(q
m
). We required d
min
2t + 1.
The generator is then required to have d 1 = 2t roots
b
,
b+1
, . . .
b+2t1
The generator is thus
g(x) = (x
b
)(x
b+1
) . . . (x
b+2t1
)
The degree of the generator is thus 2t and hence the code is a (n, n 2t) code with n = q
m
.
Hence if we consider a (n, k) code, then k = n2t or 2t = nk. But d = 2t +1 = nk +1
and d
min
d. Hence for a Reed Solomon code
d
min
n k + 1
However recall the Singleton bound that for any (n, k) code
d
min
n k + 1
Thus for a Reed Solomon code
d
min
= n k + 1
Theorem 6.3.2 (Minimum distance of Reed Solomon Codes) The minimum distance of
a (n, k) Reed Solomon code is
d
min
= n k 1
Any code that meets the Singleton bound is called a Maximum-Distance-Separable
(MDS) code. For such a code, we can correct an error for each 2 additional check symbols. (It
is not possible to do better than this and often we do worst.)
6.3.2 Example Reed Solomon Code
Example over GF(16)
Generate a double error correcting code over GF(16) = GF(2
4
). (Use the table 6.1).
The code will be of length 2
4
1 = 15. For double error correction t = 2 hence a k = n2t =
11.
is a primitive element of GF(2
4
). Take b = 1 for a narrow sense code. Hence 2t = 4,
consecutive powers of are required and the generator polynomial is
g(x) = (x )(x
2
)(x
3
)(x
4
)
This can be evaluated as
g(x) = (x )(x
2
)(x
3
)(x
4
)
= (x )(x
2
)(x
3
)(x ( + 1))
= . . .
= x
4
+ (
3
+
2
+ 1)x
2
+ (
3
+
2
)x + (
2
+ + 1)
= x
4
+
13
x
2
+
6
x +
10
This code word has length of 15 symbols ( 60 bits) and can correct any 2 symbols errors.
Example over GF(37)
Generate a triple error correcting code over GF(37) = GF(37
1
).
The code will be of length 371 = 36. For triple error correction t = 3 hence a k = n2t =
30.
Note 2, 5, 13, 15, 17 , 18, 19, 20, 22, 24, 32, 35 are a primitive elements in GF(37).
Taking 2 as a primitive element and b = 1 for a narrow sense code.
As 2t = 6, consecutive powers of 2 are required and the generator polynomial is
g(x) = (x |2|
37
)(x |2
2
|
37
)(x |2
3
|
37
)(x |2
4
|
37
)(x |2
5
|
37
)(x |2
6
|
37
)
This can be evaluated as
g(x) = (x |2|
37
)(x |2
2
|
37
)(x |2
3
|
37
)(x |2
4
|
37
)(x |2
5
|
37
)(x |2
6
|
37
) over GF(q)[x]
= (x 2)(x 4)(x 8)(x 16)(x 32)(x 27) over GF(q)[x]
= x
6
+ 22x
5
+ 28x
4
+x
3
+ 32x
2
+ 31x + 29 over GF(q)[x]
This code word has length of 36 symbols ( 36 log
2
(37) = 187.55 bits) and can correct any 3
symbols errors.
Example over GF(256)
Most Reed Solomon Codes used in practice are over GF(256) = GF(2
8
) as they t in well in
byte oriented systems. For example digital video broadcasting uses a (shortened) (208, 192) code
over GF(2
8
). This code can correct any 8 byte errors in a 208 symbol (1664 bit) codeword. If
p(x) = x
8
+x
4
+x
3
+x
2
+1, then a narrow sense code with these parameters can be generated
with the polynomial:
g(x) = x
16
+
121
x
15
+
106
x
14
+
110
x
13
+
113
x
12
+
107
x
11
+
167
x
10
+
83
x
9
+
11
x
8
+
100
x
7
+
201
x
6
+
158
x
5
+
181
x
4
+
195
x
3
+
208
x
2
+
240
x +
136
6.4. DECODING BCH AND REED SOLOMON CODES 109
6.4 Decoding BCH and Reed Solomon Codes
The main reason for the widespread use of Reed Solomon (and BCH) codes is due to the
existence of practical error correction algorithms. In general error correction algorithms have
to determine the location of the errors (i.e. which symbols) and its value (except in the case of
binary BCH codes).
6.4.1 Peterson-Gorenstein-Zierler Algorithm
Consider a transmitted codeword c(x) being received as r(x) due to errors in the communi-
cations(storage) channel. The errors can be represented as an error polynomial
e(x) = e
0
+e
1
x +. . . +e
n1
x
n1
The received codeword is
r(x) = c(x) +e(x);
Recalling the parity matrix description for BCH codes, the syndrome vector consist of the
elements S
1
, S
2
, . . . , S
r
where s = d 1.
_
_
S
1
S
2
.
.
.
S
d1
_
_
= Hr
T
=
_
_
1
b
2b
. . .
n1
b
1
b+1
2(b+1)
. . .
(n1)(b+2)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
b+d2
2(b+d2)
. . .
(n1)(b+d2)
_
_
_
_
r
0
r
1
.
.
.
r
n1
_
_
The s = d 1 length vector s is the value of the received polynomial r(x) evaluated at
b
,
b+1
, . . .
b+d2
or
S
1
= r(
b
)
S
2
= r(
b+1
)
.
.
.
S
s
= r(
b+s1
)
For simplicity let b = 1, however the decoding algorithm can be used with b = 1.
Valid code words have roots of
1
,
2
, . . .
s
, so c(
j
) = 0 for k = 1, 2, . . . , s. But
r(
j
) = c(
j
) +e(
j
) = 0 +e(
j
) =
n1
k=0
e
k
(
j
)
k
Hence
S
1
= r(
1
) =
n1
k=0
e
k
()
k
S
2
= r(
2
) =
n1
k=0
e
k
(
2
)
k
.
.
.
S
s
= r(
s
) =
n1
k=0
e
k
(
s
)
k
Assuming that v errors occur at positions i
1
, i
2
, . . . , i
v
. Then
e(x) =
v
l=1
e
i
l
x
i
l
Denote
X
l
=
i
l
Thus if we know X
l
then we can nd the error location i
l
.
Hence the syndromes can be written as
S
1
= r(
1
) =
v
l=1
e
i
l
X
l
S
2
= r(
2
) =
v
l=1
e
i
l
X
2
l
.
.
.
S
s
= r(
s
) =
v
l=1
e
i
l
X
s
l
or more explicitly
S
1
= e
i
1
X
1
+e
i
2
X
2
+. . . +e
i
v
X
v
S
2
= e
i
1
X
2
1
+e
i
2
X
2
2
+. . . +e
i
v
X
2
v
.
.
.
S
s
= e
i
1
X
s
1
+e
i
2
X
s
2
+. . . +e
i
v
X
s
v
This is a system of s equations in 2v unknowns ( v error locations and v error values).
A solution to this system of equations can be developed by considering a polynomial (x)
whose roots are the inverse of the error location values X
l
. Thus this error locater polynomial
can be written as
(x) =
v
l=1
(1 xX
l
)
Thus (X
1
l
) = 0, l = 1, 2, . . . , v.
writing,
(x) =
v
x
v
+
v1
x
v1
+ . . . +
1
x + 1
Thus
(X
1
l
) = 0
v
X
v
l
+
v1
X
(v1)
l
+. . . +
1
X
1
l
+ 1 = 0
e
i
l
X
j
l
_
v
X
v
l
+
v1
X
(v1)
l
+. . . +
1
X
1
l
+ 1
_
= 0
e
i
l
_
v
X
jv
l
+
v1
X
jv+1
l
+. . . +
1
X
j1
l
+X
j
l
_
= 0
This expression can be summed over all values of l = 1, 2, . . . , v. Hence
v
l=0
e
i
l
_
v
X
jv
l
+
v1
X
jv+1
l
+. . . +
1
X
j1
l
+X
j
l
_
= 0
v
v
l=0
e
i
l
X
jv
l
+
v1
v
l=0
e
i
l
X
jv+1
l
+. . . +
1
v
l=0
e
i
l
X
j1
l
+
v
l=0
e
i
l
X
j
l
= 0
Hence substituting

v
l=0
e
i
l
X
j
l
= S
j
v
S
jv
+
v1
S
jv+1
. . . +
1
S
j1
+S
j
= 0
or
v
S
jv
+
v1
S
jv+1
. . . +
1
S
j1
= S
j
Assuming s = 2v then this recursive expression can be written as
v
S
1
+
v1
S
2
. . . +
1
S
v
= S
v+1
i.e. j = v + 1
v
S
2
+
v1
S
3
. . . +
1
S
v+1
= S
v+2
i.e. j = v + 2
.
.
.
v
S
sv
+
v1
S
sv+1
. . . +
1
S
s1
= S
s
i.e. j = s
Or in matrix form
A =
_
_
S
1
S
2
. . . S
v
S
2
S
3
. . . S
v+1
.
.
.
.
.
.
.
.
.
.
.
.
S
sv
S
sv+1
. . . S
s1
_
_
_
v1
.
.
.
1
_
_
=
_
_
S
v+1
S
v+2
.
.
.
S
s
_
_
It can be shown that if v = 2s i.e. v errors have occurred then the matrix A is non singular
and the values for
1
,
2
, . . . ,
v
can be calculated. If less errors have occurred (say v 1 )
then the matrix will be singular. However, the leftmost column and the top row can be deleted
(eliminating
v
) and the system solved to nd the error locater polynomial (x). If the matrix
A is still singular this process can be repeated until the matrix is non singular and a smaller
error locater polynomial (x) is determined.
If no solution of any size is found then more errors than the code can correct must have
occurred and an uncorrectable error can be declared.
When a valid (x) is determined then is roots can be solved for. As these operations are
all in a nite eld GF(q
m
) then solving for the roots of a polynomial can be achieved by trying
q
m
1 non zero values of the eld elements. The roots are X
1
l
and the inverse of these elements
are X
l
=
i
l
, thus the locations of the errors i
l
are known.
In the case of a BCH code, the bits at the error locations can be just complemented. In the
case of a non binary BCH code or Reed Solomon Code, the error values e
i
l
can be solved from
the system of equations :
S
1
= e
i
1
X
1
+e
i
2
X
2
+. . . +e
i
v
X
v
S
2
= e
i
1
X
2
1
+e
i
2
X
2
2
+. . . +e
i
v
X
2
v
.
.
.
S
s
= e
i
1
X
s
1
+e
i
2
X
s
2
+. . . +e
i
v
X
s
v
In particular, the rst v equation allows the solution of the v error values. as the X
l
values are
known. The received symbols are then corrected as
r
i
l
= r
i
l
e
i
l
6.4.2 Example RS operation
Consider a Reed Solomon code based on GF(13) for convenience( Note most codes in practice
are based on GF(2
8
)) that can correct 3 symbol errors t = 3.
Creating the Code
A (12, 6) code is required.
2 is a primitive element in GF(13)
The generator is formed as a polynomial with 3t = 6 successive powers of 2 in GF(13)
(narrow sense code) i.e {2, 4, 8, 3, 6, 12}.
Hence
g(x) = (x 2)(x 2)(x 8)(x 3)(x 6)(x 12) over GF(13)[x]
or
g(x) = x
6
+ 4x
5
+ 8x
4
+ 4x
3
+ 10x
2
+ 3x + 5
Encoding:
Consider a random data sequence {12, 3, 8, 6, 12, 10}
This can be systematically encoded (as any cyclic code) to create a valid codeword
c(x).
c(x) = 12x
11
+ 3x
10
+ 8x
9
+ 6x
8
+ 12x
7
+ 10x
6
+ 6x
5
+ 6x
4
+ 4x
3
+ 2x
2
+ 1x + 3
Let the received codeword be r(x) as
r(x) = 12x
11
+ 3x
10
+ 8x
9
+ 1x
8
+ 12x
7
+ 10x
6
+ 6x
5
+ 6x
4
+ 12x
3
+ 2x
2
+ 1x + 3x
Note that there are 2 errors, i.e coecients of x
8
and x
3
.
Decoding:
First calculate the syndromes by evaluating the received codeword polynomial at each
of the elements that were used to create the code {2, 4, 8, 3, 6, 12}. Hence
{S
1
, S
2
, . . . , S
6
} = {6, 3, 9, 2, 10, 0}
If these were all zero, the codeword would be accepting as having no errors.
With the syndromes being nonzero, there must be errors.
First assume there are 3 errors ( the maximum the code can correct).Form the matrix
operation
_
_
S
1
S
2
S
3
S
2
S
3
S
4
S
3
S
4
S
5
_
_
_
1
_
_ =
_
_
S
4
S
5
S
6
_
_
i.e
_
_
6 3 9
3 9 2
9 2 10
_
_
_
1
_
_ =
_
_
11
3
0
_
_
Attempting to solve this over GF(13) results in the matrix being singular thus having
no solution. This means that there must be less that 3 errors ( or the errors are
uncorrectable).
Hence, assume that there are 2 errors.
_
S
1
S
2
S
2
S
3
_ _

2
1
_
=
_
S
3
S
4
_
i.e
_
6 3
3 9
_ _

2
1
_
=
_
4
11
_
This has a solution of ( over GF(13) ) of
_

2
1
_
=
_
7
9
_
Hence, the error locater polynomial
2
x
2
+
1
x + 1 is
(x) = 7x
2
+ 9x + 1
The roots of this polynomial over GF(13) can be solved for by evaluating it at all
the possible elements (except zero) in GF(13), i.e. 1,2,3,...,12 This yields the roots
x = 3 and x = 5.
Recall the error locations X
l
=
i
l
. The inverse of these elements were the roots of
the error locater polynomial (x), hence (noting = 2)
X
1
1
= 3 X
1
= 9 = |2
8
|
13
and
X
1
2
= 3 X
2
= 8 = |2
3
|
13
So the errors are located at positions 8 and 3 in the received polynomial.
As the code is non binary, the error values are also required. These can be obtained
form the system of equations:
S
1
= e
i
1
X
1
+e
i
2
X
2
+. . . +e
i
v
X
v
S
2
= e
i
1
X
2
1
+e
i
2
X
2
2
+. . . +e
i
v
X
2
v
.
.
.
S
s
= e
i
1
X
s
1
+e
i
2
X
s
2
+ . . . +e
i
v
X
s
v
In this case, it is known that there are two errors, so
S
1
= e
i
1
X
1
+e
i
2
X
2
S
2
= e
i
1
X
2
1
+e
i
2
X
2
2
Or
_
9 8
3 12
_ _
e
i
1
e
i
2
_
=
_
6
3
_
Solving for e
i
1
= 8 or e
i
2
= 8 .
The received code words at locations 8 and 3 can be corrected as
r
8
= r
8
e
i
1
= |1 8|
13
= 6
and
r
3
= r
3
e
i
2
= |12 8|
13
= 4
and thus the corrected word is
r(x) = 12x
11
+ 3x
10
+ 8x
9
+ 6x
8
+ 12x
7
+ 10x
6
+ 6x
5
+ 6x
4
+ 4x
3
+ 2x
2
+ 1x + 3x
6.5 Erasures
While, the decoding operations previously shown concentrated on error correction, Reed
Solomon (and BCH) codes can also correct Erasures which are errors where the error location
(but not its value) is suspected.
Clearly for binary codes, if the error location is known, then the bits are just ipped. How-
ever, Erasures usually imply that the bit is uncertain, hence its value needs to be determined.
In the non binary Erasures, the complete symbol needs to be determined.
For a (non binary) Reed Solomon with nk check symbols it is possible to correct f erasure
and e errors provided n k f + 2e.
6.6. EXERCISES 115
6.6 Exercises
1. Using table 6.1 of the lecture notes which tabulated the elements of GF(2
4
), nd the
minimal polynomial wrt GF(2) for the element
7
.
Given that x
4
+x
3
+x
2
+x +1 is the minimal polynomial wrt GF(2) for the element
6
,
form the generator of the BCH code such that it has roots
6
and
7
.
What is the minimum distance of this code and how many errors can it correct?
What is the value of n and k for this code?
2. Using table 6.1 of the lecture notes which tabulated the elements of GF(2
4
), construct a
(15, 9) Reed Solomon code over GF(2
4
).
How many symbol errors can this code detect?
How many symbol errors can this code correct?
Calculate the longest burst in terms of bits that this code guarantees to correct?
3. Outline the sequence of operations required to correct errors in a Reed Solomon decoder.
Chapter 7
Channel Coding Performance
7.1 Introduction
Channel coding is based on the transmission (or storage) of information over a communica-
tions channel ( or storage medium ).
Any real channel will have noise or other impairment such as interference, fading etc.
One way of characterizing these impairment would be probability of error of data transmitted
through the channel.
Channel coding is concerned with coding the data such that it can be transmitted more reli-
ably despite the presence of noise in the channel. The only way to do this is to add Redundancy
to the data.
7.2 AGWN Channel
Fig 7.1 shows a sample channel with binary data being represented by the levels +A and
A. A noise source with 2 sided power spectral density
N
0
2
is added to the signal. The sum is
then passed through a low pass lter with bandwidth B.
+A
-A
LPF
N
0
/2
1/T
y
k
x(t)
Figure 7.1: AGWN Channel
Data is transmitted at a rate 1/T = 2B. This system models a binary baseband digital
transmission system. The output y
k
is a discrete sample with noiseless amplitude A.
The data is detected as a 1 if y
k
> 0 otherwise a 0.
However, noise is present, impairing the system. The variance of the noise at the output is
2
=
_
B/2
B/2
N
0
2
df = N
0
B
Thus the output can be represented as
y
k
= x
k
+n
k
117
118 CHAPTER 7. CHANNEL CODING PERFORMANCE
with x
k
being the input and n
k
be a noise sample from a Gaussian noise source with variance
2
.
A decision error is made when a 1 corresponding to the value +A is transmitted and the
noise corrupting that same exceeds A i.e n
k
< A.
Similarly a decision error is made when a 0 corresponding to the value A is transmitted
and the noise corrupting that same exceeds A i.e n
k
> A.
Consider the probability that n
k
> A and thus an error is made.
The pdf of a Gaussian random variable is
g(x) =
1
2
e
x
2
2
2
thus the probability that n
k
> A is
p(n
k
> A) =
_

A
g(x)dx =
_

A
1
2
e
x
2
2
2
dx
let z =
x
, thus dz =
dx
and x = A z =
A
p(n
k
> A) =
1
2
_

A
e
z
2
2
dz
The complementary error function or Q function is dened as
Q() =
1
2
_

e
x
2
2
dx
The value of Q(x) can be obtained from tables or with software tools such as MATLAB. Fig 7.2
shows a plot of Q(x). Hence
p(n
k
> A) = Q
_
A
_
Thus the probability that an error is made on the AWGN channel is
p
e
= p(1 transmitted).p(n
k
< A) +p(0 transmitted).p(n
k
> A)
= p(1 )Q
_
A
_
+p(0)Q
_
A
_
= Q
_
A
_
noting that p(n
k
< A) = p(n
k
> A) as the Gaussian distribution is symmetric.
E.G. 1: An AWGN channel transmits values 10V over a channel with noise variance
2
= 6.25V
2
. Calculate the error rate for the channel?
Q
_
A
_
= Q
_
10
6.25
_
= Q(4) = 3.17 10
5
E.G. 2: An AWGN channel transmits values 1V over a channel with two sided noise power
spectral density
N
0
2
= 0.25uV
2
/hz and a bandwidth of 1MHz. Calculate the error rate for the
channel?
7.3. ENERGY PER BIT PER NOISE SPECTRAL DENSITY
E
B
N
0
119
0 1 2 3 4 5 6
10
10
10
9
10
8
10
7
10
6
10
5
10
4
10
3
10
2
10
1
10
0
x
Q
(
x
)
Figure 7.2: Q Function
2
=
N
0
2
2B
= 0.25V
2
Q
_
A
_
= Q
_
1
0.25
_
= 0.023
= 2.3 10
2
7.3 Energy per bit per noise spectral density
E
b
N
0
In order to compare, various coding schemes, a useful metric to consider is the value of
E
b
N
0
where E
b
is the energy per (user) bit of data.
In the baseband channel described, the signal power P
x
is
P
x
= A
2
The energy per bit E
b
is
E
b
= P
x
T
b
=
P
x
2B
=
A
2
2B
where T
b
is the bit period. The noise power is
2
=
_
B/2
B/2
N
0
2
df = N
0
B
Hence, the channel error rate is
p
e
= Q
_
A
_
= Q
_
E
b
2B
N
0
B
_
= Q
_
2E
b
N
0
_
This is the error rate for a BPSK channel as well as the binary channel considered here. Fig 7.3
plots the error rate vs
E
b
N
0
for this channel.
The ratio
E
b
N
0
is independent of bandwidth and is usually quoted in dB i.e. 10 log
10
(
E
b
N
0
).
4 2 0 2 4 6 8 10 12 14
10
10
10
9
10
8
10
7
10
6
10
5
10
4
10
3
10
2
10
1
10
0
Eb/N0 (dB)
P
e
Figure 7.3: Error rate vs
E
b
N
0
uncoded binary channel
The signal to noise ratio on this channel is
S/N =
P
x
2
=
A
2
N
0
B
= 2
E
b
N
0
Of interest in comparing systems is the
E
b
N
0
required to achieved a given probability of error.
The lower the value the better (more power ecient) the system.
7.4. PERFORMANCE OF BLOCK CODES 121
7.3.1
E
b
N
0
with a rate loss R
In the previous channel, the probability of error was
p
e
= Q
_
2
E
b
N
0
_
= Q
_
_
S/N
_
Now consider the use of a code with rate R. If the bandwidth is unchanged, then the number
of user bits being transmitted is reduced by a factor R. The total energy being transmitted is
unchanged, thus the energy ber bit is increased by 1/R.
If the code is ignored in the receiver, the probability of error remains at
p
e
= Q
_
_
S/N
_
but the
E
b
N
0
is increased by a factor 1/R. Hence
E
b
N
0
=
1
R
E
b
N
0
where
E
b
N
0
is the energy per bit per noise spectral density for the channel with rate loss R and
coding ignored in the receive and
E
b
N
0
is the energy per bit per noise spectral density for the
uncoded channel with no rate loss.
Example
The uncoded binary channel requires
E
b
N
0
= 10.3dB to achieve a bit error rate of 10
6
. If a rate
1/2 code is used but ignored in the receiver, then the rate 1/2 channel requires
E
b
N
0
= 13.3dB to
achieve a bit error rate of 10
6
, i.e 3dB worse.
In eect a rate loss R increases the required
E
b
N
0
by (in dB) 10 log
10
1/R. Hence, in order to
be useful, a code must improve the probability of error such that
E
b
N
0
is reduced by this amount
to break even. The real gain of a rate R code is the amount that
E
b
N
0
is improved in excess of
10 log
10
1/R.
Example
If the rate 1/2 code is properly decoded and achieves a probability of error of 10
6
with a S/N
ratio 4.8dB lower that the uncoded (no rate loss channel), then the
E
b
N
0
for this channel to achieve
a probability of error of 10
6
is
E
b
N
0
= 10.3 + 10 log
10
1/R 4.8 = 8.5dB
Thus this code provides a coding gain of 10.3 8.5 = 1.8dB over the uncoded channel.
7.4 Performance of Block Codes
Consider a q-ary linear (n, k) code with distance d
min
. This code can be used to detect d
min
1
symbol errors. Alternatively the code can be used to correct
d
min
1
2
errors. ( Combinations of t
error correction and d error detection can accomplished with 2t +d < d
min
.) Let the probability
of a symbol error be p
e
.
7.4.1 Error Detection Performance
The probability that errors occur that cannot be detected can be upper bounded by summing
the probability of error patterns with weight greater than or equal to d
min
i.e.
P
undetectable

n
j=d
min
_
n
j
_
p
j
e
(1 p
e
)
nj
= 1
d
min
1
j=0
_
n
j
_
p
j
e
(1 p
e
)
nj
where
_
n
j
_
is the number of ways of arranging j objects in n slots and p
j
e
(1 p
e
)
nj
is the
probability of j error symbols and nj correct symbols. This is an upper bound as many error
pattern with weight greater than d
min
may often be detected.
When the weight distribution A(x) of a code is known, then the error detection performance
can be accurately calculated. As the ith coecient represents the number of codeword with
weight i, and for a linear code, any undetectable error pattern must be a codeword then
P
undetect
=
n
j=d
min
A
j
p
j
e
(1 p
e
)
nj
Note that if k symbols are sent uncoded, then the probability of one or more errors in the k
symbols is
P
uncoded undetect
= 1 (1 p)
n
i.e 1 minus probability no errors.
7.4.2 Error Detection Performance Example
Consider the following systems on the AGWN channel.
Uncoded binary channel with
E
b
N
0
and 26 bit words.
Single bit parity code (27, 26) on binary channel (d
min
= 2). The code rate here is 26/27,
thus increasing the energy per bit to
E
b
N
0
/
26
27
with no coding.
Hamming code (31, 26) on binary channel (d
min
= 3). The code rate here is 26/31, thus
increasing the energy per bit to
E
b
N
0
/
26
31
with no coding.
Fig 7.4 shows the calculated performance bounds. The uncoded
1
lines show the probability of
error when the codes are not used. Hence, the uncoded rate 26/27 case is worse than normal
as each bit has 27/26 of the energy that it had previously ( though the extra energy is wasted).
The uncoded 26/31 case is worse again as each bit has 31/26 of the energy that it has previously.
However, with the codes being used to detect error, the performance is greatly improved.
Consider the requirement of codeword reliability of 10
6
, the required
E
b
N
0
for the uncoded, parity
coded, and hamming coded cases are 11.7dB, 8.9dB and 7.9dB respectively.
Hence, the hamming code could save 3.8dB of power (i.e reduction to 0.42 of the original
value).
1
i.e. using the rate loss of the code but not taking advantage of the correction/detection ability of the code.
4 2 0 2 4 6 8 10 12 14
10
10
10
9
10
8
10
7
10
6
10
5
10
4
10
3
10
2
10
1
10
0
Eb/N0 (dB)
P
u
n
d
e
t
e
c
t
Hamming
Parity
UnCoded
Figure 7.4: Error Detection performance.
.
7.4.3 Error Correction Performance
The probability that errors occur that cannot be corrected can be upper bounded by summing
the probability of error patterns with weight greater than or equal to t the number of errors
that can be corrected. t =
d
min
1
2
errors.
P
uncorrect

n
j=t+1
_
n
j
_
p
j
e
(1 p
e
)
nj
= 1
t
j=0
_
n
j
_
p
j
e
(1 p
e
)
nj
This is an upper bound as many error patterns with weight greater than d
min
can sometimes be
corrected
2
.
7.4.4 Error Correction Performance Example
Consider the following systems on the AGWN channel.
Uncoded binary channel with
E
b
N
0
and 26 bit words.
Hamming code (31, 26) on binary channel (d
min
= 3 t = 1). The code rate here is
26/31, thus increasing the energy per bit to
E
b
N
0
26
31
with no coding.
Fig 7.5 shows the calculated performance bounds.
2
Decoders can detect errors patterns with weight greater than d
min
much easier than they can correct errors
with weight greater than t
4 2 0 2 4 6 8 10 12 14
10
10
10
9
10
8
10
7
10
6
10
5
10
4
10
3
10
2
10
1
10
0
Eb/N0 (dB)
P
(
u
n
c
o
r
r
e
c
t
)
UnCoded
Hamming
UnCoded (26/31)
Figure 7.5: Error Correction performance.
.
With the Hamming code being used to correct errors, the performance is greatly improved.
Consider the requirement of codeword error rate of 10
6
, the required
E
b
N
0
for the uncoded and
hamming coded cases are 11.7dB and 9.6dB respectively. Hence, the hamming code could save
2.1dB of power (i.e reduction to 0.62 of the original value).
7.4.5 Non Binary Codes
For codes over GF(2
m
), the performance over a binary channel can be readily analyzed. If
the probability of a bit error is p
b
, then the probability of a symbol error is p
s
= 1 (1 p
b
)
m
.
Consider a (n, k), t symbol correcting code over GF(2
m
). The probability of non correcting
a received codeword is
P
uncorrect
1
t
j=0
_
n
j
_
p
j
s
(1 p
s
)
nj
The number of user bits in each codeword is km, thus the probability of an uncoded word
being in error is
1 (1 p
b
)
km
E.G. Consider a (192, 208) Reed Solomon code over GF(2
8
). This code can correct 8 symbol
errors. Fig 7.6 shows the performance of this code. The plot shows the uncoded word error per-
formance, the uncoded word error performance with a 192/208 rate loss and the (192, 208) Reed
Solomon code performance. The individual bit error rate is also shown. Note that codewords
in this code are 208 8 = 1664 bits long.
0 5 10 15 20
10
10
10
9
10
8
10
7
10
6
10
5
10
4
10
3
10
2
10
1
10
0
Eb/N0 (dB)
P
u
n
c
o
r
r
e
c
t
RS(208,192)
Uncoded
Uncoded 192/208
P(bit error)
Figure 7.6: RS (208, 192) Code Error Correction performance.
Consider the requirement of codeword error rate of 10
10
, the required
E
b
N
0
for the uncoded
and RS coded cases are 13.6dB and 8.4dB respectively. Hence, the RS code in this case could
save 5.2dB of power (i.e reduction to 0.30 of the original value).
7.5 Exercises
1. An AWGN channel has a bandwidth of 35MHz with a single sided power spectral density
N
0
= 0.1 10
6
V
2
/hz is to be used. A signal with levels 2.5V is transmitted at a data
rate of 70Mbps with no coding employed. Calculate the bit error rate for this channel
using the graph of g 7.2 in the lecture notes.
Calculate E
b
/N
0
for this system.
Code 1 with rate 3/4 can operate with 3dB lower SNR for the same target BER as the
uncoded channel.
Code 2 with rate 1/4 can operate with 6dB lower SNR for the same target BER as the
uncoded channel.
Calculate the real gain for both of these codes by accounting for the rate loss and say
which is better?
2. A block code of length 128 bits is available that can correct 3 bit errors. Calculate the
probability of the code being unable to correct or miss correct errors if the probability of
a bit error is 10
3.
If no code was employed, calculate the probability of one or more errors in groups of
128 bits if the probability of a bit error is 10
3.
[Computer based Assignment] Write a computer program to generate a graph of
the above coded and uncoded performance versus E
b
/N
0
assuming a code rate of 100/128.
Chapter 8
Convolutional codes
8.1 Introduction
Convolution codes operate on a continuous stream of data as opposed to the isolated blocks
used in block codes. Convolution codes were rst developed in the 1950s. They are some-
what heuristic in nature, and to-date dont have a good algebraic theory associated with them.
However, good codes have been found be computer search and convolution codes are widely
used.
Initial methods of decoding were based on sequential algorithms which were suboptimal. In
1967, A.Viterbi discovered an approach to decoding convolution codes that is optimal an this
has become known as the Viterbi algorithm.
8.2 Linear Convolutional Encoders
Fig 8.1 shows an example of a linear convolutional encoder. This is a rate 1/2 encoder. For
each input bit, 2 code bits are produced. In general, an encoder can have k inputs and n outputs
resulting in a rate k/n code
1
as in g 8.2. In the convolutional encoder of g 8.1, data is shifted
Data
IN
Coded
OUT
Data
(2x Rate)
x
y
y
(1)
(2)
Figure 8.1: Example of a convolutional encode.
1
rate 1/n and punctured versions of these codes are most widely used
127
128 CHAPTER 8. CONVOLUTIONAL CODES
into a 3 bit serial register one bit at a time. At each time instance, 2 dierent sequences are
calculated as a linear combination over GF(2) of the shift register contents including its input.
IN
Data
Coded
Data
Figure 8.2: Example of a rate 2/3 convolutional encode.
Considering the encoder in g 8.1, let an example input sequence be
x = {1, 0, 0, 1, 1, 1, 1, . . .}
The corresponding output (assuming the registers are initialized to zero) are
y
(1)
= {1, 0, 1, 0, 1, 0, 1, . . .}
y
(2)
= {1, 1, 0, 0, 0, 0, 1, . . .}
and the multiplexed sequence is
y = {11, 01, 10, 00, 10, 00, 11, . . .}
The encoder can be viewed as two parallel FIR lters or convolutions (over GF(2)). The
encode can then be characterized by its response to a single 1 input ( i.e. its impulse response).
For the encoder in g 8.1 these responses.
g
(1)
= {1011}
g
(2)
= {1101}
These are known as the generator sequences.
In g 8.1, each input bit can aect the output for 4 bit periods (i.e the input bit and the 3
memory elements) hence each generator sequence extends over 4 bits.
In general, a convolutional code has a shift register with m elements.
Denition 8.2.1 (Constraint Length) The constraint length K of a convolutional code
is the maximum number of bits in a single output stream that can be aected by any input bit.
With this denition
2
, the encoder of g 8.1 has a constraint length K = 4 as does g 8.2.
The total memory of an encoder is the total number of memory elements in the encoder
and is 3 for the encoder of g 8.1 and 6 for g 8.2. The total number of memory elements in
the encoder has a large eect on decoding complexity.
2
Sometimes a dierent denition is used
8.2. LINEAR CONVOLUTIONAL ENCODERS 129
Under normal operation, the encoder operates continuously on data. However, at the end of
transmission the code is normally terminated by clocking m zeros. This is known as terminating
the code.
Terminating the code eectively turns the code into a block code with a rate
=
kL
m +nL

k
n
for large L
The output of the encode can be evaluated using a convolution summation as
y
(j)
i
=
m
l=0
x
il
g
(j)
l
or
y
(j)
= x g
(j)
In general for a k/n encoder
y
(j)
i
=
k1
t=0
_
m
l=0
x
(t)
il
g
(j)
t,l
_
or
y
(j)
=
k1
t=0
x
(t)
g
(j)
t
Consider a rate 1/2 convolutional code with generator sequences g
(0)
and g
(1)
. A Generator
matrix could be constructed as as semi innite matrix:
G =
_
_
g
(0)
0
g
(1)
0
g
(0)
1
g
(1)
1
. . . g
(0)
m
g
(1)
m
g
(0)
0
g
(1)
0
g
(0)
1
g
(1)
1
. . . g
(0)
m
g
(1)
m
g
(0)
0
g
(1)
0
g
(0)
1
g
(1)
1
. . . g
(0)
m
g
(1)
m
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
with the encoding operation dened as
y = xG
A more compact expression can be developed by considering the D transform. The D
transform represents a sequence a
0
, a
1
, a
2
, . . . as a polynomial in D an indeterminant which can
be though of as a unit delay with D
2
being a two unit delay and so forth. i.e.
a
i=0
a
i
D
i
= A(D)
a
0
, a
1
, a
2
, . . . a
0
+ a
1
D + a
2
D
2
+. . .
For a single input encoder (noting the convolution is equivalent to polynomial multiplication)
Y
(i)
(D) = X(D)G
(i)
(D)
Or for a multiple input system
Y
(i)
(D) =
k1
j=0
X
(j)
(D)G
(i)
j
(D)
which is equivalent to the matrix multiplication
Y
(i)
(D) =
_
X
(0)
(D) X
(1)
(D) . . . X
(k1)
(D)
_
_
_
G
(i)
0
(D)
G
(i)
1
(D)
. . .
G
(i)
k1
(D)
_
_
or dening
Y(D) =
_
Y
(0)
(D), Y
(1)
(D), . . . , Y
(n1)
(D)
_
The the complete encoding operation can be described as
Y(D) = X(D)G(D)
=
_
X
(0)
(D) X
(1)
(D) . . . X
(k1)
(D)
_
_
_
G
(0)
0
(D) G
(1)
0
(D) . . . G
(n1)
0
(D)
G
(0)
1
(D) G
(1)
1
(D) . . . G
(n1)
1
(D)
. . . . . . . . . . . .
G
(0)
k1
(D) G
(1)
k1
(D) . . . G
(n1)
k1
(D)
_
_
8.2.1 Example
Let a rate 2/3 convolution code be dened with g
(0)
0
= {1001} , g
(1)
0
= {0111} , g
(2)
0
= {1100}
and g
(0)
1
= {011} , g
(1)
1
= {101} , g
(2)
1
= {010} . If the input sequence is x = {11, 00, 01},
calculate the output.
The input D sequences are
X
(0)
(D) = 1
X
(1)
(D) = 1 +D
2
The output sequence is
Y(D) =
_
1 1 +D
2
_
_
1 +D
3
D +D
2
+D
3
1 +D
D +D
2
1 + D
2
D
_
=
_
1 +D
3
+ (1 +D
2
)(D +D
2
) D +D
2
+D
3
+ (1 + D
2
)(1 +D
2
) 1 +D + (1 +D
2
)(D)
_
=
_
1 +D +D
2
+D
4
1 +D +D
2
+D
3
+ D
4
1 +D
3
_
y
(0)
= {11101}
y
(1)
= {11111}
y
(2)
= {10010}
resulting in an output y = {111, 110, 110, 011, 110}.
Convolutional codes can also be systematic where some of the outputs are actually the input
bits themselves. Fig 8.3 shows an example.
However, (unlike block codes) systematic convolutional usually have poorer code performance
than non systematic codes.
Hence, non systematic codes are usually employed unless the systematic property is required
for some particular reason.
8.3. THE STRUCTURAL PROPERTIES OF CONVOLUTIONAL CODES 131
It is also possible to implement convolutional encoders based on recursive structures. In this
case the encoder can be described as a rational function in D and the encoders implemented
using polynomial division over GF(2) as described previously.
8.3 The Structural Properties of Convolutional Codes
Convolutional codes are based on a structure with memory where the output depends on
the contents of that memory and the current input. This type of structure can be represented
by a state machine. The number of possible states is equal to 2
m
, with m the total memory of
the encoder.
Consider the encoder in g 8.1. Let the memory elements be denoted s
0
, s
1
, s
2
where at each
bit period these values are updated as
s
0
x, s
1
s
0
, s
2
s
1
The 8 possible states can then be labeled as in table 8.1.
State Register Contents
S
0
{0, 0, 0}
S
1
{1, 0, 0}
S
2
{0, 1, 0}
S
3
{1, 1, 0}
S
4
{0, 0, 1}
S
5
{1, 0, 1}
S
6
{0, 1, 1}
S
7
{1, 1, 1}
Table 8.1: State Labels
As each input bit period, the state changes from{s
0
, s
1
, s
2
} to {0, s
0
, s
1
} or {1, s
0
, s
1
} depend-
ing on the input bit. The allowed sequence of state transitions and the corresponding outputs
Data
IN
Coded
OUT
Data
x
y
y
(1)
(2)
Figure 8.3: Example of a systematic convolutional encode.
are shown in g 8.4. Each directed connection is labeled with the input that caused that tran-
sition and the corresponding (2 bit) output. Any valid output sequence can be generated by
traversing an allowed path through the state machine.
S
1
S
0
S
2
S
3
S
4
S
5
S
6
7
S
1/11
0/00
0/11
1/00
0/01
1/10
0/10
0/01
1/01
0/10
1/01
1/10
0/11
1/00
1/11
0/00
Figure 8.4: Finite state transition diagram for encoder
8.3.1 Catastrophic sequences
One potential problem with a convolutional encoder is the possibility of catastrophic se-
quences. This occurs when recovery of the original input sequence is required based on obser-
vation of the coded sequence.
Denition 8.3.1 (Catastrophic Encoder) The encoder is described as catastrophic when the
presence of a nite number of errors in the observed coded sequence can result in an innite
number of errors in the decoded data sequence.
Consider the rate 1/2 encoder formed by the generators G
(0)
(D) = D+D
2
+D
3
and G
(1)
(D) =
1 + D + D
2
. The corresponding nite state transition diagram for this encoder is shown in
g 8.5. Note the two highlighted paths. Assuming the original input sequence was all zeros i.e
S
1
S
0
S
2
S
3
S
4
S
5
S
6
7
S
0/00
1/01
0/10
1/11
0/11
0/11
1/10
0/01
1/10
0/01
1/01
0/10
1/11
0/00
1/00
1/00
Figure 8.5: Finite state transition diagram for encoder G
(0)
(D) = D +D
2
+D
3
and G
(1)
(D) =
1 +D +D
2
.
0, 0, 0, 0, 0, 0, . . ., then the output coded sequence is all zeros also i.e 00, 00, 00, 00, 00, 00, . . . and
the state machine is always in the state S
0
.
However, if at some point, 2 bit errors in the encoded sequence caused the pattern . . . , 00, 01, 10, 00, 00, . . .
to be observed, then the observer would believe that the state machine had moved into state
8.3. THE STRUCTURAL PROPERTIES OF CONVOLUTIONAL CODES 133
S
3
. The subsequent pattern of . . . , 00, 00, 00, . . . would be assumed to be the traversing the
states S
3
S
6
S
5
S
3
S
6
S
5
S
3
. . . with the corresponding output sequence
0, 1, 1, 0, 1, 1, 0, 1, 1, . . ..
Hence, the presence of 2 errors in the code sequence leads to a innite number of errors in the
decoded sequence. This is a undesirable situation thus lead to the term Catastrophic Encoder
and it is desirable (and easy) to avoid.
Note that it is the mapping from input data to code sequences that is the problem and not
the code sequences itself.
It can be shown
Theorem 8.3.2 (Catastrophic Codes)
1/n Code:
A rate 1/n convolutional encoder with encoding matrix G(D) = [G
(0)
(D), G
(1)
(D), . . . , G
(n1)
(D)]
is not catastrophic if and only if
GCD(G
(0)
(D), G
(1)
(D), . . . , G
(n1)
(D)) = D
l
for some integer l 0.
k/n Code:
A rate k/n convolutional encoder with encoding matrix G(D) is not catastrophic if and
only if
GCD
_
i
(D), i = 1, 2, 3 . . . ,
_
n
k
__
= D
l
for some integer l 0, where
i
(D) is the i
th
k k submatrix of G(D).
For example for the code with generators G
(0)
(D) = D+D
2
+D
3
and G
(1)
(D) = 1+D+D
2
,
GCD
_
D + D
2
+D
3
, 1 +D +D
2
_
= 1 +D + D
2
as (1 + D +D
2
)D = D +D
2
+ D
3
, hence this encoder is catastrophic.
The original encoder of g 8.1 has generators G
(0)
(D) = 1+D
2
+D
3
and G
(1)
(D) = 1+D+D
3
,
GCD
_
1 +D
2
+D
3
, 1 +D + D
3
_
= 1 = D
0
hence this encoder is not catastrophic.
In general most encoders are not catastrophic.
8.3.2 Weight Enumerators
Weight Enumerators can else be constructed for convolutional codes and are written as
T(X, Y ) =
i=1
j=1
a
i,j
X
i
Y
j
where a
i,j
is the number of codewords of weight i that correspond to input sequences of weight
j.
8.4 Convolutional codes performance metrics
The performance of convolutional codes can be assessed using a distance measure.
Consider any two distinct input sequences x
and x
. The corresponding outputs are y
and
y
.
Denition 8.4.1 (Column Distance Function) The Column Distance Function d
i
of a con-
volutional code C, is the minimum Hamming distance between all pairs of output sequences of
length i corresponding to any two input sequences with x
0
= x
0
.
In this denition the input sequences dier in at least 1 position and therefore are distinct.
Hence
d
i
= min d([y
0
, y
1
, . . . , y
i1
], [y
0
, y
1
, . . . , y
i1
]) | x
0
= x
0
If the code is linear, then the dierence between any two code sequences is also a valid code
sequence and so
d
i
= min w([y
0
, y
1
, . . . , y
i1
]) | x
0
= 0
Fig 8.6 shows and example column distance function.
0
1 2 3 4 5 6 7 8 9 10 ..
0
1
2
3
4
5
6
7
8
i
d
i
Figure 8.6: Example column distance function
(Note that the minimum distance to position i may not correspond to the sequence 1, 0, 0, 0, . . .).
The column distance function must be a non decreasing function of i as
d
i+1
= d([y
0
, y
1
, . . . , y
i
], [y
0
, y
1
, . . . , y
i
])
= d([y
0
, y
1
, . . . , y
i1
], [y
0
, y
1
, . . . , y
i1
]) +d(y
i
, y
i
)
d([y
0
, y
1
, . . . , y
i1
], [y
0
, y
1
, . . . , y
i1
]) = d
i
as d(y
i
, y
i
) 0, 1 0.
The key measure of a convolutional code performance is the minimum free distance.
Denition 8.4.2 (Minimum Free Distance) The Minimum Free Distance d
free
of a convo-
lutional code C, is the minimum Hamming distance between all pairs of output sequences of any
length.
8.5. VITERBI DECODING OF CONVOLUTIONAL CODES 135
Optimum decoding strategies for decoding convolutional codes will take account the full
(semi-innite) coded sequence in making its decision. Hence its performance will be governed
by the distance between all pairs of output sequences of any length.
Provided the code is non catastrophic, then
d
free
= lim
i
d
i
(catastrophic codes can have paths of innite length that add no weight thus preventing d
free
being attained.)
While d
free
governs the performance of a convolutional codes, for two codes with the same
d
free
, the one that reaches it faster is normally superior.
No structured (eg algebraic) method exists to generate codes with a predened d
free
for
convolutional codes. However, good codes have been found by computer search. These can be
found tabulated in most books on the subject for a range of code rates. Table 8.2 list some
examples of rate 1/2. These are listed as octal numbers. Consider the code with constraint length
K g
(0)
g
(1)
d
free
3 5 7 5
4 64 74 6
5 46 72 7
6 65 57 8
7 554 744 10
8 712 476 10
9 561 753 12
10 4734 6624 12
.
.
.
.
.
.
.
.
.
.
.
.
Table 8.2: Best codes of rate 1/2
7. Here g
(0)
= 554
O
and g
(1)
= 744
O
. Hence in binary g
(0)
= 101101100
B
and g
(1)
= 111100100
B
or
g
(0)
= 1 +D
2
+D
3
+ D
5
+D
6
g
(1)
= 1 +D +D
2
+D
3
+D
6
The encoder is shown in g 8.7. The particular encoder is used in the 801.11b wireless LAN
standard.
8.5 Viterbi Decoding of Convolutional Codes
Originally convolutional codes were decoded using searching type algorithms that performed
well but were not optimum. In 1967, A.J.Viterbi proposed a algorithm that achieved maximum
likelihood (ML) decoding of convolutional codes. This algorithm is now widely known as the
Viterbi algorithm and can be used in a number of applications in addition to decoding of
convolutional codes.
Data
IN
x
Coded
OUT
Data
Figure 8.7: K = 7 convolutional encoder with d
free
= 10.
8.5.1 Trellis Representation of Convolutional Codes
Previously a convolutional codes was represented as a state machine showing what patterns
were allowed. Consider a rate 1/2 code with with constraint length 3. From table 8.2 g
(0)
= 5
O
and g
(1)
= 7
O
and d
free
= 5. Hence in binary g
(0)
= 101
B
and g
(1)
= 111
B
or
g
(0)
= 1 +D
2
g
(1)
= 1 +D +D
2
Fig 8.8 shows the encoder for this code.
Coded
OUT
Data
Data
IN
x
Figure 8.8: K = 3 convolutional encoder with d
free
= 5.
A Trellis diagram consists of the N allowed states drawn in a column with adjacent columns
showing the allowed states at subsequent times. Assuming the encode starts initialized to 0,
then the encoder starts in state 00. The allowed state progression is shown in 8.9.
Any valid code sequence is a path along this trellis. Consider an code sequence corresponding
to an information sequence of length l. There are 2
l
possible paths through the trellis.
The objective of a decoder is to chose the most likely path through the trellis that corresponds
to the received sequence.
Input Current State Next State Output
x
k
S
k
= {s
0
, s
1
} S
k+1
{x
k
, s
0
} y
k
0 00 00 00
1 00 10 11
0 10 01 01
1 10 11 10
0 01 00 11
1 01 10 00
0 11 01 10
1 11 11 01
Table 8.3: Sate machine description K = 3 convolutional encoder.
10
00
01
11
0/01
1/11
0/00
1/01
0/10
1/00
1/10
0/11
10
00
01
11
0/01
1/11
0/00
1/01
0/10
1/00
1/10
0/11
10
00
01
11
0/01
1/11
0/00
1/01
0/10
1/00
1/10
0/11
10
00
01
11
1/11
0/00
10
00 00
0/01
1/11
0/00
1/10
Figure 8.9: Trellis diagram for K = 3 convolutional encoder
.
Let and information sequence x be encoded to a code sequence y which is transmitted over
a noisy channel. Let the received code sequence be r.
The maximum likelihood decoder chooses the value y such that
p(r| y) p(r|y
i
) for all y
i
= y
i.e. the value of y that maximizes p(r| y). The information sequence x that corresponds to y is
then selected as the maximum likelihood decoded sequence.
In principle, this could be done by evaluating p(r|y
i
) for all possible allowed code sequences.
Hence for an information sequence of length l there would be 2
l
possible allowed code sequences.
In practice, l can be of the order of hundreds or thousands of bits an thus such an exhaustive
search is not feasible for any real system. The Viterbi algorithm overcomes this diculty.
First, consider the evaluation of p(r|y) Let the coded sequence be of length L of n bit vectors.
Assuming the trellis is terminated, then L = l + m
r = (r
(0)
0
, r
(1)
0
, . . . , r
(n1)
0
, r
(0)
1
, r
(1)
1
, . . . , r
(n1)
1
, . . . r
(0)
L1
, r
(1)
L1
, . . . , r
(n1)
L1
)
and
y = (y
(0)
0
, y
(1)
0
, . . . , y
(n1)
0
, y
(0)
1
, y
(1)
1
, . . . , y
(n1)
1
, . . . y
(0)
L1
, y
(1)
L1
, . . . , y
(n1)
L1
)
It is assumed that the channel is memoryless. Hence r
(i)
j
only depends on y
(i)
j
and
p(r|y) = p(r
(0)
0
|y
(0)
0
, r
(1)
0
|y
(1)
0
, . . . , p(r
(n1)
L1
, y
(n1)
L1
)
M = p(r
(0)
0
|y
(0)
0
)p(r
(1)
0
|y
(1)
0
) . . . p(r
(n1)
L1
|y
(n1)
L1
)
=
L1
i=0

n1
j=0
p(r
(j)
i
|y
(j)
i
)
As this is a probabilities then it has a value between 0 and 1. Thus the logarithm of probability
has a value between and 0. Thus choosing the value of y that maximizes p(r|y) is equivalent
to choosing the value of y that maximizes log p(r|y). Similarly choosing the value of y that
maximizes p(r|y) is equivalent to choosing the value of y that minimizes log p(r|y). Hence
log p(r|y) = log
L1
i=0

n1
j=0
p(r
(j)
i
|y
(j)
i
)
=
L1
i=0
n1
j=0
log p(r
(j)
i
|y
(j)
i
)
=
L1
i=0
_
_
n1
j=0
log p(r
(j)
i
|y
(j)
i
)
_
_
Now consider the trellis diagram at an arbitrary state S
x
at time k which is denoted S
(k)
x
in g 8.10. There are i possible previous states denoted S
(k1)
z
1
, S
(k1)
z
2
, . . . , S
(k1)
z
i
. For the
branch from each of these states to state S
(k)
x
there is a corresponding information sequence
x
j
, j = 1, 2, . . . , i and coded sequence y
j
= (y
(0)
j
, y
(1)
j
, . . . , y
(n1)
j
). Denote the branch metric
associated with the branch from S
(k1)
z
j
to S
(k)
x
k
z
j
,x
=
n1
i=0
log p(r
(i)
k
|y
(i)
j
)
S
x
(k)
S
(k-1)
xi/yi
S
(k-1)
S
(k-1)
z1
z2
zi
x2/y2
x1/y1
Figure 8.10: Arbitrary state in the trellis
Assume that the sequence y passes through state S
(k)
x
. To calculate
log p(r|y)
the branch metrics can be summed up as
log p(r|y) =
k1
i=0
i
a
i
,b
i
+
k
z
j
,x
+
L1
p=k+1
p
c
p
,d
p
It is required to chose the path y that minimizes log p(r|y). Assume that somehow, the
path from the start to time k 1 that minimizes the summation up to that point and ends in
state S
(k1)
z
j
is known, and denote this path as P
k1
z
j
and the summation as the path metric
k1
z
j
.
These values are known for each state at time k 1.
Given this information, then the path metric for state S
(k)
x
at time k can be calculated as
k
x
=
k1
z
j
+
k
z
j
,x
However, this value can be calculated for each of the possible states leading to S
(k)
x
, i.e S
(k1)
z
1
, S
(k1)
z
2
, . . . , S
(k1)
z
i
.
If the minimum one is chosen, then this must correspond to the path with the
minimum path metric that ends in S
(k)
x
as nothing that happens later on can change
the calculation.
Thus choosing
k
x
= min
_
k1
z
0
+
k
z
0
,x
,
k1
z
1
+
k
z
1
,x
, , . . . ,
k1
z
i
+
k
z
i
,x
_
The corresponding path P
k
x
can then be updated as the path P
k1
z
j
corresponding to the chose
state at time k 1 with the transition S
z
j
S
x
appended.
This procedure can be repeated for each of the N states that exist in the trellis.
Hence given,
k1
i
, i = 1, 2, . . . , N, and P
k1
i
, then
k
i
and P
k
i
can be calculated. Initially,
the system starts in state S
0
so the decoding algorithm can begin with
0
0
= 0
0
i
= i = 0
The N path metrics and corresponding paths are calculated recursively at each index until the
end of the code sequence resulting in N nal path metrics and corresponding paths.
As the maximum likelihood path must pass through one of these states then one of the
paths must be the maximum likelihood sequence. If the code is terminated, then the path
corresponding to that state is chosen, otherwise the one with the minimum path metric is
chosen.
This procedure is known as the Viterbi algorithm. At any point in time, only N path metrics
and paths need to be stored. For a data sequence of length l the number of operations is O(lN)
as opposed to O(2
l
) for an exhaustive search.
8.5.2 Viterbi algorithm with rate 1/2 K = 3 code
Considering the code generated by the encoder in g 8.9. The Viterbi operation proceeds as
follows.
Initialize the path metrics
0
00
= 0,
0
10
= ,
0
01
= ,
0
11
=
and the survivor paths
P
0
00
= {}, P
0
10
= {}, P
0
01
= {}, P
0
11
= {},
At each information bit time evaluate
k
00
,
k
01
,
k
10
,
k
11
,
Update the path Metrics, memories as
k+1
00
= min(
k
00
+
k
00
,
k
01
+
k
11
)
P
k+1
00
= {P
k
00
, 0} or {P
k
01
, 0} depending on above decision
k+1
10
= min(
k
00
+
k
11
,
k
01
+
k
00
)
P
k+1
10
= {P
k
00
, 1} or {P
k
01
k+1
01
= min(
k
10
+
k
01
,
k
11
+
k
10
)
P
k+1
01
= {P
k
10
, 0} or {P
k
11
k+1
11
= min(
k
10
+
k
10
,
k
11
+
k
01
)
P
k+1
11
= {P
k
10
, 1} or {P
k
11
The survivor associated with the minimum nal path metric is chosen as the maximum
likelihood decision.
8.6 Branch Metrics
Recall the denition of branch metrics
k
z
j
,x
=
n1
i=0
log p(r
(i)
k
|y
(i)
j
)
In eect, the branch metric for a transition from one state to another at time index k depends
only on the received data at that time index r
(i)
k
and the coded bits expected for that transition
y
(i)
. Hence the branch metric can be denoted by the coded bits expected rather that the states
the transition correspond to.
For example in the previous section the branch metrics
k
10
was used for a number of state
transitions.
Writing
k
y
=
n1
i=0
log p(r
(i)
k
|y
(i)
)
The actual calculation of this value depends on the channel under consideration.
Note that we can scale and oset
k
y
in any convenient manner as all it is used for is to
summed up and the minimum case selected.
8.6.1 Binary Symmetric Channel (Hard Decision)
Recall the binary channel with a Symmetric cross over probability p
e
.
Consider transmitting the logic value y
(i)
= 1 and receiving r
(i)
= 1, then
p(r
(i)
= 1|y
(i)
= 1) = 1 p
e
and
log(p(r
(i)
= 1|y
(i)
= 1)) = log(1 p
e
)
Table 8.4 shows the other cases.
8.6. BRANCH METRICS 141
Table 8.4: Conditional Probabilities for Binary Symmetric Channel
log(p(r
(i)
|y
(i)
)) r
(i)
= 1 r
(i)
= 0
y
(i)
= 1 log(1 p
e
) log(p
e
)
y
(i)
= 0 log(p
e
) log(1 p
e
)
Table 8.5: Conditional Probabilities for Binary Symmetric Channel (oset ed and scaled)
log(p(r
(i)
|y
(i)
))+log(1p
e
)
log(1p
e
)log(p
e
)
r
(i)
= 1 r
(i)
= 0
y
(i)
= 1 0 1
y
(i)
= 0 1 0
For convenience, subtract the constant log(1 p
e
) from each entry and divide by the
constant log(1 p
e
) log(p
e
). Table 8.5 shows the updated values.
Hence each value log(p(r
(i)
|y
(i)
)) can be represented by the hamming distance between the
bits r
(i)
and y
(i)
after being oset and scaled.
Hence the full branch metric
k
y
=
n1
i=0
log p(r
(i)
k
|y
(i)
)
may be written as
k
y
= d
Hamming
(r
k
, y)
This is often termed Hard decision coding as the channel output is decided to be a 1 or 0
before being given to the decoder.
8.6.2 AWGN Channel (Soft Decision)
While the hard decision case results in a simple branch metric calculation, the decoder could
do better by taking more information into account. For example, if it knew that one particular
1 bit was due to a voltage of 1.0V and that another was due to a voltage of 0.001V, then it
could treat the rst one as more reliable than the second!
Consider the additive white Gaussian noise channel where the discrete output at time index
k r
k
, with input y
k
is
r
k
= y
k
+n
k
where n
k
is a Gaussian random variable with variance
2
. The pdf of the noise, n
k
is:
p(n
k
= n) =
1
2
e
n
2
2
2
Hence, the probability that r
k
on condition y
k
is
p(r
k
|y
k
) = p(n
k
= r
k
y
k
)
or
p(r
k
|y
k
) =
1
2
e
(r
k
y
k
)
2
2
2
thus
log(p(r
k
|y
k
)) = log
_
1
2
e
(r
k
y
k
)
2
2
2
_
= log
_
1
2
_
log
_
e
(r
k
y
k
)
2
2
2
_
= log
_
1
2
_
_
(r
k
y
k
)
2
2
2
_
Thus subtracting the constant term log
_
1
2
_
and scaling be the constant 2
2
results in the
metric
(r
k
y
k
)
2
Thus the branch metric can be calculated as
k
y
=
n1
i=0
(r
(i)
k
y
(i)
k
)
2
This is termed the Euclidean metric an is appropriated for AWGN channels.
The Euclidean metric is the most often used metric as it is optimum for AWGN channels.
8.7 Performance Analysis
The performance analysis of convolutional codes with Viterbi decoding can be based on
the distance between allowed codewords. The performance of convolutional codes is usually
measured in average bit error rate as the length of the codewords my be very long.
Consider how the Viterbi decoder makes an error. In eect the Viterbi decoder estimates the
sequence of states that correspond to the most likely path. An error is made when the sequence
of states estimated to be the most likely is incorrect. Fig 8.11 shows this case where the chosen
path is in error i.e dierent to the correct path. In this case one or more bits in the decoded
sequence will be in error. This is called an error event.
Correct Path P
P
0
i
Chosen (error) path
Figure 8.11: Error Event for convolutional code
Let the i
th
error event, E
i
, be the probability that path P
i
is chosen instead of the correct
path P
0
. The probability of this error event is the probability that the metric accumulated by
the error path is smaller that than accumulated by the correct path.
8.7. PERFORMANCE ANALYSIS 143
As the paths are only dierent over the duration of the event then only the metric values
when the dierent paths are taken contribute to the probability of error.
Consider an error event of length l. If the correct path has the coded bits y
k
, y
k+1
, . . . , y
k+l1
and the error path has the coded bits y
k
, y
k+1
, . . . , y
k+l1
This error event occurs if
k+l1
j=k
j
y
j
>
k+l1
j=k
j
y
j
However, many valid erroneous paths exist and hence,
p(Error Event) = p(
_
E
i
)
An upper bound on this probability can be calculated by using the union bound.
Denition 8.7.1 (Union Bound) If {E
1
, E
2
, . . . , E
n
} are events in a probability space, then
p(E
1
) +p(E
2
) +. . . +p(E
n
) p(E
1
E
2
. . . E
n
)
Hence
p(Error Event)
i
p(E
i
)
8.7.1 Binary Symmetric Channel (Hard Decision)
Recall that the hard decision metric was the hamming distance between the expected and
received sequence. If the correct path has the coded bits y = {y
k
, y
k+1
, . . . , y
k+l1
} and the
error path has the coded bits y = { y
k
, y
k+1
, . . . , y
k+l1
} Then, given a received sequence r =
{r
k
, r
k+1
, . . . , r
k+l1
} An error event E
i
is made (summing up all the bits) when
d
H
(y, r) > d
H
( y, r)
Let d be the hamming distance between the two paths
d = d
H
( y, y)
Then the error event occurs when more than d/2 bit errors occur. Hence the probability of the
error event E
i
= P
d
where
P
d
=
_
d
k=(d+1)/2
_
d
k
_
p
k
e
(1 p
e
)
(dk)
d odd
1
2
_
d
d/2
_
p
d/2
e
(1 p
e
)
(d/2)
+
d
k=d/2+1
_
d
k
_
p
k
e
(1 p
e
)
(dk)
d even
where p
e
is the channel crossover probability and the even case assumes that have the ties ( d/2
errors) cause an error event.
If p
e
< 1/2 then these expressions can be bounded as
P
d
< p
d/2
e
(1 p
e
)
(d/2)
d
k=(d+1)/2
_
d
k
_
< p
d/2
e
(1 p
e
)
(d/2)
d
k=0
_
d
k
_
< p
d/2
e
(1 p
e
)
(d/2)
2
d
=
_
2
_
p
e
(1 p
e
)
_
d
noting that 2
d
= (1 + 1)
d
=

d
k=0
_
d
k
_
. The total error event probability
p(Error Event)
i
p(E
i
)
and each individual event probability has
p(E
i
) = P
d
<
_
2
_
p
e
(1 p
e
)
_
d
where d is the hamming distance between the correct and error path for error event E
i
. But
_
2
_
p
e
(1 p
e
)
_
d
is a rapidly decreasing function of d for small values of p
e
and hence the sum-
mation in the union bound make be replace by only the events with the smallest hamming
distance between the correct and error path for error event E
i
which was the value d
free
for the
code. Hence
p(Error Event)
i
p(E
i
) for E
i
having d = d
free
The will be a number of error paths with E
i
having d = d
free
and each error event will result in
a number of bit errors. Hence the bit error rate after decoding could be approximated as
p
BSC
K
_
2
_
p
e
(1 p
e
)
_
d
free
where K is a small constant depending on the details of the actual code.
8.7.2 AWGN Channel (Soft Decision)
The branch metric for the AWGN Channel can be calculated as
k
y
=
n1
i=0
(r
(i)
k
y
(i)
k
)
2
If the correct path has the values (with each y 1) y = {y
k
, y
k+1
, . . . , y
k+l1
} and the
error path has the values y = { y
k
, y
k+1
, . . . , y
k+l1
} Then, given a received sequence r =
{r
k
, r
k+1
, . . . , r
k+l1
} An error event E
i
is made when
j
(y
j
r
j
)
2
>
j
( y
j
r
j
)
2
Noting that
r
j
= y
j
+n
j
where n
j
is a sample from a Gaussian distribution with variance
2
, then
j
(n
j
)
2
>
j
( y
j
y
j
n
j
)
2
j
(n
j
)
2
>
j
( y
j
y
j
)
2
2( y
j
y
j
)n
j
+n
2
j
0 >
j
( y
j
y
j
)
2
2( y
j
y
j
)n
j
j
2( y
j
y
j
)n
j
>
j
( y
j
y
j
)
2
j
( y
j
y
j
)n
j
>
1
2
j
( y
j
y
j
)
2
8.8. VITERBI DECODER IMPLEMENTATION 145
The right hand side

j
( y
j
y
j
)n
j
is the summation of a number of independent Gaussian
random variables. Hence it is also a Gaussian random variable with variance
V ar(
j
( y
j
y
j
)n
j
) =
j
( y
j
y
j
)
2
V ar(n
j
) =
j
( y
j
y
j
)
2
2
Thus an error event is made when a Gaussian random variables with variance

j
( y
j
y
j
)
2
exceeds the value
1
2
j
( y
j
y
j
)
2
. This value can be calculated using the Q() function discussed
previously. The probability of the error event E
i
is thus
P(E
i
) = Q
_
_
1
2
j
( y
j
y
j
)
2
_
j
( y
j
y
j
)
2
2
_
_
or
P(E
i
) = Q
_
_
_
j
( y
j
y
j
)
2
2
_
_
The term
_
j
( y
j
y
j
)
2
is the Euclidean distance between the two sequences. As the Q()
function is a rapidly decreasing function in is argument, only the minimum distance event will
be considered. As before, this corresponds to d
free
.
However, as each bit is 1, the minimum Euclidean distance between any two valid code
sequences d
min
is
_
d
free
(1 (1))
2
=
_
4d
free
Hence, the approximate bound for the AGWN channel is
p
AGWN
KQ
_
d
min
2
_
= KQ
_
_
_
4d
free
2
_
_
8.7.3 Example Code
Consider the performance of the rate 1/2 code with constraint length 3 and d
free
= 5 shown
previously in g 8.8. Fig 8.12 shows the performance of this code both simulated (discrete
points) and calculated bounds (continuous lines). The calculated performance of the hard
decision decoder is quite poor but the calculated performance of soft decision decoder is quite
good (for low error rates).
At error rates of 10
7
, the hard decision provides a coding gain of 2dB while the soft
decision decoding provides a coding gain of 4dB.
In general soft decision decoding provides typically 2dB additional coding gain compared to
hard decision systems. The ease with which the Viterbi decoder can implement soft decision
decoding ( just by using the Euclidean metric for the branch metrics) accounts for the popularity
of convolutional codes with Viterbi decoding.
( Note: Soft decision decoding can also be done with block codes in principle but ecient
decoders for achieving this have not been developed especially for algebraic code like RS codes.)
8.8 Viterbi Decoder Implementation
Practical Viterbi decoding can be readily implemented in current VLSI for codes with con-
straint lengths to about K = 7 i.e with 64 states.
The basic Viterbi decoder contains the 3 blocks shown in g 8.13. The three blocks are:
4 2 0 2 4 6 8 10 12
10
8
10
7
10
6
10
5
10
4
10
3
10
2
10
1
10
0
Eb/N0 (db)
Uncoded
Hard Decision
Soft Decision
Figure 8.12: Simulated and Calculated performance of rate 1/2 code, constraint length 3 and
d
free
= 5
8.8.1 BMU Branch Metric Unit
The BMU Branch Metric Unit takes the input received data and calculates the branch
metric values. For hard decision decoding, the inputs are single bits. For soft decision decoding
the inputs represent continuous values. In practice most Viterbi decoder implementations are
implemented as digital circuits with the input signal being quantized with an ADC. For binary
modulation systems, quantization to 3 bits (8 levels) is typically sucient to achieve close to
optimum performance (for some specialized applications Analogue circuits have been used).
As the inputs are normally received serially, the BMU must also group bits into the correct
groups. E.G a rate 1/3 code generates 3 bits for each data bit. The BMU must be properly
synchronized to group the correct 3 bits together. If this is not done, the decode will not function
correctly.
For high speed decoding, all the required branch metrics are calculated in parallel.
8.8. VITERBI DECODER IMPLEMENTATION 147
0..00
0..01
1..11
r
k
S0
S1
N-1
S
~
x
k - DEL
ACS decisions
BMU
ACS
PSU
Figure 8.13: Block diagram of basic Viterbi
8.8.2 ACS Add Compare Select Units
The ACS Add Compare Select Units must complete a single add compare select operation
for each state in the code. For high speed decoding a separate add compare select circuit for
each state may be required (for lower speeds they may be shared).
Recall the key operation in the Viterbi decoder operation was updating the path metrics.
k
x
= min
_
k1
z
0
+
k
z
0
,x
,
k1
z
1
+
k
z
1
,x
, , . . . ,
k1
z
i
+
k
z
i
,x
_
Hence, this requires branch metrics to be added to path metrics and the results compared to
gure out which is the minimum. The minimum one is the selected for the new path metric
value and the decision is passed on to be used in the management of the survivor paths.
Consider the case of a rate 1/n code. At each encoder cycle, only there are only 2 input
possibilities as only a single input bit is applied. Hence only 2 states in the trellis connect each
state i.e. each state has only 2 possible previous states.
k
x
= min
_
k1
z
0
+
k
z
0
,x
,
k1
z
1
+
k
z
1
,x
,
_
Fig 8.14 shows the hardware implement of the ACS unit for one of the states. Note that 2
adders, 1 comparator and a selector (multiplexer) are required. One such ACS unit is required
for each state ( or the may be shared for low speed operation).
However, consider a rate k/n code (k = 1). In this case there are 2
k
possible inputs and
thus each state has 2
k
possible input states. This makes the ACS units considerable more
complex. For example, with a rate 3/4 code, there are 2
3
= 8 input for each state and g 8.15
shows the required ACS unit per state. 8 Adders are require to add the branch metrics to
the 8 path metrics, 7 comparators and selectors are required which operate in series. Clearly
much more circuitry is required resulting a slower and more complex circuit. For this reason,
more convolutional codes are 1/n codes and higher rates are usually achieved by puncturing as
described in section8.9.
8.8.3 PSU Path Survivor Unit
The PSUPath Survivor Unit takes the decisions from the ACS units and uses them to update
the survivor paths. This unit should keep a path for each state. In theory, the path memory
k
z1,x
k
z0,x
k-1
z0
k-1
z1
k
x
Decision
Select
Add/Compare
Figure 8.14: One ACS unit for rate 1/n convolutional code
k
z0,x
k
z1,x
k
x
Select
Add/Compare
k
z2,x
z3,x
z4,x
z5,x
z6,x
z7,x
Figure 8.15: One ACS unit for rate 3/4 convolutional code
should extend from the start of operation and only when the nal bits ( may be thousands of
bits later) are received should it release any decisions.
In practice, the stored paths only dier in the most recent bits and it is only necessary to
keep paths for a length known as the truncation depth. This is typically 6 to 8 times the
constraint length. Hence decisions on bits can be produced in step with the received data but
with a latency (delay) of 6 to 8 times the constraint length.
The PSU may be implemented using two dierent methods. One called register exchange is
8.9. PUNCTURED CONVOLUTIONAL CODES 149
fast with low latency and stores the path memories in ip-ops thus dissipating a large amount
of power for codes with many states and long constraint lengths. An alternative method called
the trace back method stores the path memories in a RAM and periodically traces through the
RAM to generate decisions. This is more suitable for applications where many states and long
constraint lengths are required and the power/area of a register exchange solution would be
prohibitive.
8.9 Punctured Convolutional Codes
It is desirable to use rate 1/n convolutional codes because the implementation of the ACS
units is easier. However, in order to achieve higher code rates with convolutional codes, punc-
turing can be applied. Consider a rate 1/2 code.
Let the input be the sequence
x
0
, x
1
, x
2
, x
3
, x
4
, . . .
The output sequence is
y
(0)
0
, y
(1)
0
, y
(0)
1
, y
(1)
1
, y
(0)
2
, y
(1)
2
, y
(0)
3
, y
(1)
3
, y
(0)
4
, y
(1)
4
, . . .
Consider deleting every 3 bit in the output sequence (i.e. not transmitting it). The transmitted
sequence is then
y
(0)
0
, y
(1)
0
, , y
(1)
1
, y
(0)
2
, , y
(0)
3
, y
(1)
3
, , y
(1)
4
, . . .
Hence for every 3 input bits, 4 coded bits are transmitted resulting in a rate 3/4 code.
At the receiver, a 0.0V value( soft decision) or and erasure (hard decision) is placed in the
received sequence at the appropriate locations and the data is decoded as a rate 1/2 code.
Of course, this punctured will have a lower coding gain that the un-punctured code, but
will have a higher rate.
It practice, it is common to have a system based on a 1/n code and a number of higher rate
codes can be generated by puncturing. The code rate ( and thus coding gain) can be selected
based on channel conditions. When the channel is reliable, a high rate (e.g 7/8) code could
be used. If the channel degrades the lower rates with higher coding gain can be employed. In
each case the same decoder circuit can be used. In such a system the code is called a Rate
compatible convolutional code.
8.10 Exercises
1. Draw the circuit of a rate 1/2 convolutional encoder of constraint length K = 6 with
g
(0)
= 65
OCT
and g
(0)
= 57
OCT
2. Draw the circuit of a rate 1/2 convolutional encoder of constraint length K = 4 with
g
(0)
= 64
OCT
and g
(0)
= 74
OCT
Write out a state machine description for this convolutional code.
Draw the corresponding trellis diagram for this encoder.
Sketch the interconnection of ACS units required to implement a Viterbi detector to decode
this code (labeling each with the corresponding path and branch metrics).
3. [Computer based Assignment] Write a software program to implement the Viterbi
decoder in the previous question and use it to generate a BER vs E
b
/N
0
plot for both
hard decision and soft decision decoding.

EE6461

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

EE6461

Hochgeladen von

Copyright:

Verfügbare Formate

Information Theory and Coding

Approximating the signal to noise power as S/N s

p(v) log p(v)dv

be a discrete random variable which takes the value x

takes the value x

f(x) log f(x)dx log

f(x) log f(x)dx log

f(x) log f(x)dx is the denition or relative

) . However, if we only consider the

p(v) log p(v)dv

p(v) log p(v)dv

p(v) log p(v)dv

[(v + A/2) +(v A/2)] log

Figure 3.2: Circuit for multiplication by with p(x) = x

must also be a codeword. Hence, the all zero vector must be

and has the dimension n k.

cannot be duals which is a contradiction.

have the same syndrome.

and is the a valid parity check matrix.

(x) is the reciprocal of h(x), the parity check

In polynomial terms, s(x) = d

(x) and the process of calculating d

xs(x) = b(x)g(x) g(x)[xa(x) r

. The corresponding outputs are y

Das könnte Ihnen auch gefallen