Commn Goood

Class Notes: Digital Communications
Prof. J.C. Olivier Department of Electrical, Electronic and Computer Engineering University of Pretoria Pretoria Revision 3 September 8, 2008
Contents
0.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 11 11 12 13 14 17 17 17 17 19 22 23 24 24 24 24 25 26 26 26 29 31 31 31
1 Introduction 1.1 Overview of Wireless Communications . . . . . 1.2 The transmitter data burst structure . . . . . . 1.3 The dispersive radio channel . . . . . . . . . . . 1.4 The model of the entire communication system
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
2 Introduction to Probability theory and Detection 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Probability theory, Detection and some odd experiments 2.2.1 Background . . . . . . . . . . . . . . . . . . . . . 2.2.2 Applications of Bayess theorem . . . . . . . . . 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 3 The modulator and demodulator 3.1 Modulation continued . . . . . . . . . . . . . . . . . . 3.1.1 The concept of base band signal processing and 3.1.2 Types of modulation . . . . . . . . . . . . . . . 3.1.3 Binary phase shift keying (BPSK) . . . . . . . 3.1.4 Four level pulse amplitude modulation (4PAM) 3.1.5 Quadrature phase shift keying (QPSK) . . . . 3.1.6 Eight phase shift keying (8 PSK) . . . . . . . . 3.2 De-modulation . . . . . . . . . . . . . . . . . . . . . . 3.2.1 What if there is multipath? . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
4 Detection 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The static Gaussian channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Computing more than just the most likely symbol: probabilities of all constellation points, and the corresponding coded bit probabilities computed by the receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 MLSE - the most likely sequence estimate . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Finding the sequence x via the MLSE . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 3 tap detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
33 35 36 37 38
4 4.4 Probabilistic Detection via Bayesian Inference for Multipath channels . 4.4.1 Sub optimal detected bit probability calculation . . . . . . . . . 4.4.2 Optimal symbol probability calculation using Bayesian Detection Forward-Backward MAP detection . . . . . . . . . . . . . . . . . . . . . 4.5.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 40 41 43 44 47 47 48 48 48 49 50 50 51 53 53 54 55 56 57 58 58 59 61 63 63 63 64 65 66 66 67 69 71 72 73 76 79 80 80 81
4.5 4.6
5 Frequency Domain Modulation and Detection: OFDM 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Circulant matrix theory . . . . . . . . . . . . . . . . . . . 5.3 The Transmitter for OFDM systems . . . . . . . . . . . . 5.3.1 Cyclic time domain multipath propagation . . . . 5.4 OFDM receiver, i.e. MAP detection . . . . . . . . . . . . 5.4.1 MAP detection with trivial complexity . . . . . . . 5.4.2 Matlab demo . . . . . . . . . . . . . . . . . . . . . 5.5 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Channel Estimation 6.1 Introduction . . . . . . . . . . . . . . . . . . . . 6.2 Optimum receiver lter and sucient statistics 6.3 The linear model . . . . . . . . . . . . . . . . . 6.4 Least Squares Estimation . . . . . . . . . . . . 6.5 A representative example . . . . . . . . . . . . 6.6 Generalized Least Squares Estimation . . . . . 6.6.1 The generalized least squares procedure 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . 6.8 Assignment . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
7 Minimum Mean Square Error (MMSE) Estimation, Prelter and Prediction 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Minimum mean square error (MMSE) estimation . . . . . . . . . . . . . . . . . . . . . 7.2.1 The principle of orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Geometrical interpretation of the principle of orthogonality . . . . . . . . . . . 7.3 Applying minimum mean square error (MMSE) estimation: Let us design a linear prelter 7.3.1 Matched lter, minimum phase lter and spectral Factorization . . . . . . . . . 7.3.2 MMSE prelter design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Evaluating matrix E{yy } and vector E{s[n] y} . . . . . . . . . . . . . . . . . 7.4 A representative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Stochastic processes and MMSE estimation . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Information Theory and Error Correction Coding 8.1 Linear block codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Repetition codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 General linear block codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 8.1.3 Decoding linear block codes using the Parity Check matrix H Convolutional codes and Min-Sum (Viterbi) decoding . . . . . . . . 8.2.1 Decoding the convolutional codes . . . . . . . . . . . . . . . . Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 83 84 87
8.2 8.3
List of Figures
1.1 1.2 1.3 1.4 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 The The The The data burst using pilot or training symbols. . . . . . . . . . . . . . . normalized autocorrelation function for a training sequence. . . . . multi-path channel and time domain representation at the receiver. transmitter and receiver ow in a wireless communication system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14 15 23 25 25 26 27 28 29
The modulation of 1 bit in amplitude modulation. . . . . . . . . . . . The modulation of 4 coded bits x via BPSK modulation. . . . . . . . The modulation of 8 coded bits from x via 4 PAM modulation. . . . . The modulation of 8 coded bits from x via QPSK modulation. . . . . The modulation of 3 coded bits from x via 8 PSK modulation. . . . . The rst stages of the receiver hardware, indicating where the detector come into play. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The de-modulation using a matched lter and optimum sampling. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (an AI device) . . . . . . . . . . . . . . . . . .
4.2 4.3 4.4 4.5 5.1 6.1 6.2 6.3 7.1 7.2
MAP detection on a static Gaussian channel is selecting the modulation constellation point closest to the noice corrupted received samples. Two cases are shown, one where the channel quality is good (high SNR) and one where the channel quality is poor (low SNR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The road-map between town A and B - infer the shortest route with least cost or distance? The trellis - infer the shortest route with least cost - that will be the MLSE sequence! The trellis - infer the shortest route with least cost - that will be the MLSE sequence! The forward-backward MAP trellis for BPSK. . . . . . . . . . . . . . . . . . . . . . . . The OFDM transmitter frame format making the multipath propagation cyclic. . . . . The layout of a typical receiver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Gaussian pulse shaping lter used in GSM. . . . . . . . . . . . . . . . . . . . . . . The estimated impulse response and its z-plane representation. . . . . . . . . . . . . c The principle of orthogonality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The representation of the matched lter, the feed-forward lter and the feedback lter. The prelter is the combination of the matched lter and the feed-forward lter. The feedback lter is the post prelter CIR. . . . . . . . . . . . . . . . . . . . . . . . . . . The representation of the MMSE-DF prelter. . . . . . . . . . . . . . . . . . . . . . . The overall impulse response c before the prelter. . . . . . . . . . . . . . . . . . . . . The overall impulse response b after the prelter. . . . . . . . . . . . . . . . . . . . . . 7
33 36 38 39 42 49 53 57 58 65
7.3 7.4 7.5
66 68 72 73
8 7.6 7.7 8.1 8.2 The interpolation problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GSM channel fading with and without frequency hop. . . . . . . . . . . . . . . . . . . The convolutional encoder and state diagram. . . . . . . . . . . . . . . . . . . . . . . . The convolutional decoder trellis based on the state diagram. . . . . . . . . . . . . . . 74 77 84 85
0.1
Preface
The notes deal with a number of dierent topics needed in Digital Communications theory. On our open source website http://opensource.ee.up.ac.za a complete GSM simulator is available, and it contains most of the blocks needed in a modern communication system. Each chapter in these notes deal with some of the techniques used in the simulator, and in the assignments after each chapter, the student will use the simulator to complete the assignments. The idea is that the student will re-create the material for herself/himself by completing these assignments. The notes only references two other texts, namely the book by MacKay [1] and the book by Proakis [2], as well as a few key papers from the open literature. Most of the material in these notes can be found in these references, however, it is presented in a style easy to understand, and in a logical order that facilitates the students appreciation of the big picture of Digital Communications systems. Feedback on these notes would be appreciated and can be send to the email address below. J.C. Olivier Pretoria September 2005 corne.olivier@eng.up.ac.za
10
Chapter 1
Introduction
1.1 Overview of Wireless Communications
Wireless communications is a term that is applicable to any system transmitting information from one location to another while making use of radio wave propagation. For example, a system transmitting information over a ber optic cable is a communication system, but is not wireless. Some of the problems that plague wireless communications and lead to errors in detected bits, such as thermal noise in the receiver and channel dispersion (multi-path propagation) may also apply to the case where the electromagnetic waves are guided such as in ber optic or coaxial cable systems, but the use of radio wave propagation presents unique challenges. First of all, use of radio wave propagation in wireless communications systems, requires the use of receiver antennae, that will receive any radio wave source inside the frequency band of interest. This may for example include human made noise or interference sources such as other transmitters operating in frequency bands close to the band of interest and leading to cross talk when the transmitter and receiver lters are unable to completely suppress the transmission in adjacent channels. Or it may be radio waves from the solar system, that are omni present. Secondly, radio waves are susceptible to any relative movement between the transmitter and the receiver, as such movement causes Doppler shift. Doppler shift causes radio wave fading, that is, the radio wave is constantly undergoing a multiplicative distortion that varies the wave amplitude and phase during transmission from the transmitter to the receiver. Channel dispersion is caused by multi-path communication. By that we mean that multiple copies of the modulated radio wave that are delayed in time arrive at the receiver. Multi-path may be caused by reections from mountains, buildings, or any large object able to reect a signicant amount of the transmitted wave. Hence, depending on the environment where the system is deployed, the type of multi-path present may be dierent. In rural areas where mountains or hills are absent, multi-path may essentially be absent, and at the other extreme in build up urban areas buildings may cause signicant multi-path. This is also the case in hilly terrain areas, where hills or mountains may cause multi-path with large delay because of the distances involved. Thus we may summarise by saying that the wireless communications system has to operate in an environment where the transmitter and/or the receiver is mobile (moving) causing Doppler shift that leads to fading, where multi-path propagation is present causing channel dispersion, and where a multitude of interference sources, both human made and extraterrestrial are impairing the receivers 11
12 ability to detect transmitted information reliably. It is dicult to say in general which of these impairments are dominant, as the conditions prevalent in dierent locations are dierent. In designing the optimal receiver, based on the theory of estimation and detection, we have to design a receiver that is able to mitigate all of these impairments simultaneously. Later it will become evident that the selection of signal processing methods used in the receiver is based on statistical models and assumptions that are made about the operating conditions and impairments. These impairments may be articially generated or modelled on a computer, and the receiver may so be simulated and its performance determined to a large extent before deployment in the eld. Experience has shown that actual performance obtained in practice closely match the performance predicted based on computer modelling and simulation.
1.2
The transmitter data burst structure
In modern digital communication systems the objective is to transmit discrete data or source bits reliably over the channel. The bits will in general be represented or modulated as complex symbols chosen from a discrete modulation alphabet. These symbols (complex numbers) are unknown at the receiver, where they will be estimated or detected. In later chapters it will become clear that the detection process requires information about the transmit lters, the RF channel and the receiver lters. Since the transmitter or receiver or both may be mobile, the RF channel is time variant and is in general unknown at the receiver. The receiver will thus need to perform an estimation of the channel properties. In order to perform the channel estimation, we will in general require known pilot or training symbols to be intermittently transmitted. These will be used by the receiver for channel estimation. Hence, it has become a standard procedure to include a short sequence of pilot (training) symbols in between data symbols to form a radio burst, as shown in Figure 1.1. The pilot symbols enable the channel that is unknown to be estimated. The choice for the pilot symbols are based on the need to have the autocorrelation function of the pilot sequence to be as close to a Kronecker delta as is possible. In the receiver we will use the transmitted training (pilot) symbols that are known a priori to estimate the overall impulse response valid over a short period of time, which we denote by vector c(t). The period of time over which we assume the CIR remains valid is determined by the length in time of the data burst as shown in Figure 1.1.
3 Tails
58 data symbols
26 training
58 data symbols
5 tails
Figure 1.1: The data burst using pilot or training symbols.
For example, a 26 symbol sequence used in GSM is given by

-1,1,-1,1,1,1,1,-1,-1,-1,1,-1,-1,-1,1,-1,1,1,-1,1,1,1,1,-1,-1,-1
and its normalised autocorrelation function is shown in Figure 1.2. In later chapters it will become evident that the autocorrelation properties are important in designing the optimal channel impulse response estimator.
13 The pilot symbols are used to estimate the eective overall Channel Impulse Response (CIR) valid for one burst, under the assumption that the channel is changing slowly compared to the duration of a single burst.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
10
20
30
40
50
60
Figure 1.2: The normalised autocorrelation function for a training sequence.
1.3
The dispersive radio channel
The radio channel impulse response in fact decays with time. Later copies of the wave have less and less energy since it propagated larger distances, and the radio channel impulse response appear to be nite in time. Thus it has become a standard procedure to model the radio channel in discrete time as a Finite Impulse Response (FIR) lter. The sampling rate of the receiver determines to a large extent how fast the FIR taps will decay with time i.e. how many there are, as a high sample rate implies short time dierences between dierent taps and thus less decay. Thus in general, high sample rate systems, i.e. large bandwidth systems, will experience much longer FIR channels than narrow-band or low sampling rate systems. Movement between the transmitter and receiver will cause fading due to Doppler phenomena, a complicated topic not considered in this notes 1 . The important point is that generally each tap of the channel impulse response fades (i.e. varies) over time independently from the other taps. So for any one burst, the channel impulse response is arbitrary, except for the length of the channel impulse response that is xed and related to the receiver sample rate and multi path environment the receiver operates in. Let us consider a situation where the transmitter transmits a series of modulated symbols denoted 1 as An2 An1 An An+1 and the receiver uses a sample rate of T Hz. The multipath environment the receiver is operating in is depicted in Figure 1.3. There is a direct path between the 2 antennas, which takes 1 seconds to travel between the antennas. There are 2 secondary paths that each are delayed a fraction of T seconds longer than the direct path, then there are 2 delayed paths reecting
1 The reader is referred to the work by Zheng and Xiao available on IEEE Xplore for a detailed exposition of the simulation of these processes
14
Transmiter 3 1 2 5 = + 2T 1 Receiver 4 = 1 + T
f(t)
n2
A Due to An2 A n1
n Due to An c0 c2 c1
Due to An2
Due to An1 T Sample time
Figure 1.3: The multi-path channel and time domain representation at the receiver.
o mountains that are delayed T and 2T seconds each. So due to symbol An2 there arrives 5 copies of the symbol An2 at the receiver, each with a dierent amplitude, phase and time delay. See Figure 1.3 for a pictorial presentation. However, the transmitter is completely unaware of the channel. Hence multiple copies of all the symbols are arriving at the receiver. The receiver however samples the output of the demodulator at a multiple of T seconds. Let us consider what the receiver nds at sample nT , denoted r[n]. It nds present at that point in time, contributions from symbols An2 , An1 , An - as is clear from Figure 1.3. Mathematically we can write this series of symbols present at time n as
L
r[n] =
k=0
ck Ank + ns [n]
(1.1)
where ns [n] represents thermal noise, and each ck is a tap of the channel impulse response. The term L nk is discrete convolution, and that is the reason why we refer to c as the channels impulse k=0 ck A response.
1.4
The model of the entire communication system
Let us now present the overall wireless communications system, as depicted in Figure 1.4. Each block represents a key part of the communication systems ability to transmit and receive information. To start o with, we have a data source, or a voice, or an image that we wish to send over a horrible channel where the transmitted signal will fade, undergo multipath propagation, and where noise and interference signals are added to it. At the output of the receiver we wish to have a reliable copy of the source, with few or no mistakes if possible. How is this achieved in practice? It is achieved in several steps each contributing a key part of the overall Digital Communications system.
15
Modulator:
Transmitter
source voice/ data source compression
Bits to Symbol
Pulse shaping filter
x
encoder data bits
g(t)
s(t)
RF channel model Ch j
coded bits
noise
symbl/soft bit detector
source decompression source estimated
decoder
Overall channel estimator
y
matched filter demodulator
Receiver
Figure 1.4: The transmitter and receiver ow in a wireless communication system.
The rst key part is the source compression block in the transmitter. Here theoretically all or at least most of the redundancy of the source is removed. This process is complicated and is an ongoing research venture, and is not a solved topic at this point in time. The technology for achieving compression is constantly changing. Next the data without any redundancy denoted as x in Figure 1.4, are passed to the error correction encoder. Its job is to add some redundancy to x to produce z. Now why would we add redundancy when we did all the work to remove it in the previous block? The reason is that the redundancy we add here is controlled redundancy, that is added in very smart and ingenious ways. This redundancy will be exploited in the decoder to correct errors - a topic covered in detail in this notes. The data with redundancy denoted x is binary (ones and zeros), and cannot be transmitted in that form over a channel, regardless if the channel is a coaxial cable, a wireless channel or deep space between Pluto and Earth. To be transmitted via an antenna over a channel, the data x must be modulated to transform the binary x into a series of complex data symbols denoted Ai . This is done in the modulator. The complex symbols Ai can be used to modulate a carrier RF electromagnetic wave over the channel. The real part of Ai is used to modulate the in-phase part of the carrier wave, and the imaginary part to modulate the quadrature part of the carrier wave. The more valid points we permit on the complex plane in the modulator, the more bits we can grab from x per symbol, and the higher the thru-put will be over the channel. But the more points we have, the more vulnerable we will be to noise in the receiver. This will all become clear later on in this notes. The modulated series of modulated symbols in analogue form denoted s(t) are transmitted one by one over the channel with a time duration of T seconds each, known as the symbol rate. The smaller T is, the faster data can be transmitted over the channel, but since the bandwidth use 1 is proportional to T , the price we pay for that is we consume more bandwidth, a scarce and expensive resource. The dispersive channel causes multiple copies of s(t) to arrive at the receiver input port, where thermal noise is also added to it, before the receiver lters it. The receiver lter used to lter the distorted signal that arrives at the receiver, is matched to
16 the transmission pulse shaping function g(t) used in the modulator. This causes the Signal to Noise Ratio (SNR) presented to the rest of the receiver to be maximised, which is a desirable situation. The ltered and sampled signal that is the output of the de-modulator is denoted y[n], and is presented to the channel impulse response estimator. Here the overall channel impulse response denoted c is estimated by the receiver using known pilots symbols in the transmitted burst. The received vector y[n] and the channel impulse response c are passed to the rst of 2 detection devices in the receiver, this one the detector where the symbols formed in the modulator are transformed back into bits denoted z because there can be errors in this estimated version of z, and the probability for each bit in z are provided as well by the detector for use by the decoder. The character of the detector is dictated by the length of the vector c, i.e. the number of taps in it. If there is a single tap, with the rest zero, the detector is very simple and the methods developed using elementary probability theory can be used as is. However, even if there is just one additional tap, so there are two taps in c, then we need to resort to graphical methods called a trellis. There exist a very ecient and elegant algorithms to nd the optimal solution on these graphs, known as the Viterbi algorithm. The decoder has the job of xing errors that are present in z , and forms the key part of the modern marvel of Digital Communications systems. Without the decoder (coding theory) we would not be able to produce error free estimates x at the receiver, and data communications would be virtually impossible. The methods used to form the decoder is also based on graphs or trellises, and several options exist with various advantages and disadvantages. Finally, the decompression device uses x to reconstruct the original source. In practice Cyclic Redundancy Check (CRC) codes are used in the frame and if an error is detected in the reconstructed source in spite of the best eorts of the decoder, then a Repeat Frame Transmission request is send back to the transmitter, and the entire frame is send again. This function is performed by higher layer in the protocol stack used in the communication system.
Chapter 2
Introduction to Probability theory and Detection

2.1 Introduction
Here we study inference, a science where we are given information via observations, and we are required to infer the value of a parameter or some property of a random variable [1]. This is a situation commonly found in Digital Communications systems, where the observed data in the receiver is corrupted by noise, that is unknown (stochastic). The only knowledge we do have is a statistical description of the noise probability density function (PDF), and given the noisy observed data (that was also corrupted by multipath propagation) and knowledge of the noise PDF our job is to gure out what was transmitted, and quantify those estimates probabilistically. Clearly therefore, the process of Inference that is performed in the receiver is a statistical one, and statistics and a prociency in applying the concepts of statistics is needed. In this notes it is assumed that the reader has a basic understanding of statistics and its applications, but the concepts behind Inference are explained here using several experiments that were taken from [1] since these contain the essential elements needed in chapters to follow.
2.2
2.2.1
Probability theory, Detection and some odd experiments

Background
Binomial Distribution Dene N r N! (N r)!r!
(2.1)
A bent coin, i.e. an unfair coin has a probability f coming up heads. We perform the experiment N times. What is the probability distribution of the number of heads r? It has a binomial distribution given by P (r|f, N ) = N r N r f (1 f ) r 17 (2.2)
18 Probability An ensemble X is a triple (x, Ax , Px ): x is the outcome, or the value of the random variable It may take on one of the possible values dened by the set Ax usually called the alphabet. The probability that x = ai is P (x = ai ) = pi , pi 0 and
ai Ax
P (x = ai ) = 1
We dont always want to write in such a formal way, so we will use informal notation. We will simply write P (ai ). Joint ensembles XY the joint ensemble is an ordered pair x, y. P (x, y) is the joint probability of x and y. For a joint probability, the two variables are not necessarily independent. By that we mean that the joint probability cannot be written as a product P (x) P (y). If we can, they are independent by denition. Marginal probability P (ai ) =
yAy
P (x = ai , y)
(2.3)
We will use the marginal probability extensively in what will follow. Conditional probability We can compute probability, conditioned on other given information. This will become a cornerstone of inference, so spend time on it. Formally P (x = ai |y = bj ) P (x = ai , y = bj ) P (y = bj ) (2.4)
but P (y = bj ) must be larger then 0, else it is undened. Product rule Lets assume we have some joint probability, say P (x, y|H) where this probability is based on a Hypothesis H. Later on we will see how we have to make a Hypothesis if we want to infer. We may write P (x, y|H) = P (x|y, H)P (y|H) = P (y|x, H)P (x|H) (2.5) known as the chain rule or product rule. Sum rule We may expand P(x/H) as P (x|H) =
y
P (x, y|H) =
y
P (x|y, H)P (y|H)
(2.6)
and this is the sum rule.
19 Bayess theorem From the product rule we may nd that P (y|x, H) = P (x|y, H)P (y|H) = P (x|H) P (x|y, H)P (y|H) y P (x|y , H)P (y |H) (2.7) (2.8)
2.2.2
Applications of Bayess theorem
Gottliebs nasty disease [1] Gottlieb has a test done for a disease he suspects that he has. The doctor tells Gottlieb the this test is 95% reliable, i.e. in all cases that people really have the disease, the result is positive. Also then in 95% of the cases that people dont have the disease the test is negative. The doctor found from past experience (prior information) that for Gottlieb being male and of a certain age and background, the chances are 1% he has it, without the test being done. So Gottliebs results come back, and the test is positive. The doctor tells him it is probable that he has the nasty disease, because the test is so reliable. The question is, do we as Bayesians agree with the doctors assessment? A Bayesian believes in using probabilities to infer - i.e. we think that in order to to believe something (like the statement from the doctor) probability theory must say it is probable since our believe is founded in probability theory. How do we approach the problem? First, let us dene Gottliebs state of health by the variable a, and the test result by variable b. a = 1 implies Gottlieb denitely has the disease. a = 0 thus means he denitely does not have it. In the language of Bayesian inference, we need to infer the probability that Gottlieb has the disease, given the test result: P (a = 1|b = 1). This doctor seems to think that the probability that Gottlieb has the disease is high, but let us compute its exact value, exploiting all knowledge we have available (that was given). P (b = 1|a = 1) = 0.95 P (b = 1|a = 0) = 0.05 P (b = 0|a = 1) = 0.05 P (b = 0|a = 0) = 0.95 Do we agree? Also we know P (a = 1) = 0.01, and P (a = 0) = 0.99, the prior information. So what do Bayesian think P (a = 1|b = 1) is? P (b = 1|a = 1)P (a = 1) = 0.16 P (b = 1|a = 1)P (a = 1) + P (b = 1|a = 0)P (a = 0)
P (a = 1|b = 1) =
(2.9)
A Bayesian thinks the probability that Gottlieb really has the disease is rather small, only 16%. So who is right, the doctor or the Bayesian?
20 Experiments with black and white balls [1] Take a long hard look at the previous example, and make sure that you really understand the essence of the Bayesian approach. It involves the so called inverse probability. Bayess theorem turns the probabilities around, i.e. where we have a given b, we end up with the inverse, b given a, that enable us to make the inference. In most inverse problems (the interesting ones) we need to infer the conditional probability of one or more unobserved variables, given some observed variables. This is a theme that will keep on repeating throughout this course. So let us do some experiments with white and black balls. An urn contains K balls, B are black, and K-B are white. We draw a ball at random, then replace it, N times. What is the probability distribution for the number of times, say n, a black ball is drawn? First o all, why do we say probability distribution? It is because there is a nite probability that a black ball is drawn either 0, or 1 or 2 or N times. It may be more improbable that a black ball is drawn x times than say y times, but the point is that the parameter n has some probability distribution. Dene fB = B/K then the distribution is binomial, and is given by P (n|f, N ) = N n f (1 fB )N n n B (2.10)
Now let us continue and do an inverse experiment with the balls. We have 11 urns, all identical. These we denote u {0, 1, 2, 3, , 10}. Each urn contains 10 balls. Urn u contains u black balls, and 10 u white balls. Candy, our experimental lady, chooses at random an urn u, but we as onlookers dont know the urn number she selects. From this urn she draws N times, each time replacing the ball. It so happens that Candy obtained n=3 black balls after N=10 draws. Candy asks us to guess the number of the urn she is using. The key idea here to answer this question is to realize that we need to compute the probability for each urn, all 11 of them, and then we choose the most probable one - the most probable choice. We thus use probability to make a choice. This will form the core of our approach to detection in Communications theory. There we will compute the probability of each valid symbol in the receiver, and chose the most likely one. To answer the question posed above, we need to compute the probability distribution of the urn identication label u - we will then choose the urn with maximum probability. Thus we need to compute P (u|n, N ), since both n (which was in this case 3) and N (which was 10) are given: P (u|n, N ) = P (u)P (n|u, N ) P (n|N ) (2.11)
Now we just go ahead and compute the needed quantities: 1 P (u) - in fact P (u) = 11 for all u because Candy choose the urn randomly. They are all equally likely. u n P (n|u, N ) - this we know from theory is the binomial distribution N fu (1 fu )N n . fu = 10 n because each urn contains 10 balls, and u are black. What about P (n|N ) ? This is the marginal probability of n, given by P (n|N ) = u P (u, n|N ) = u P (u)P (n|u, N ). For this case given n = 3 and N = 10, P (n = 3|N = 10) = 0.083. Below we have the probability distribution in tabular form: P (u = 0|n = 3, N = 10) = 0 P (u = 1|n = 3, N = 10) = 0.063 P (u = 2|n = 3, N = 10) = 0.22
21 P (u = 3|n = 3, N = 10) = 0.29 P (u = 4|n = 3, N = 10) = 0.24 P (u = 5|n = 3, N = 10) = 0.13 P (u = 6|n = 3, N = 10) = 0.047 P (u = 7|n = 3, N = 10) = 0.0099 P (u = 8|n = 3, N = 10) = 0.00086 P (u = 9|n = 3, N = 10) = 0.00000096 P (u = 10|n = 3, N = 10) = 0 So what is the most likely urn that Candy is using given the evidence? It appears to be urn 3. However, the probabilities for some of the other urns are not far o - so we are uncertain, but based on probability theory and the idea that probabilities can be used to infer, we select urn 3 as the most likely candidate. (Explain why the probability calculation says that it is denitely not urn 0 or 10?) Secondly, if Candy choose a ball 20 times and replacing each time, what do we think would happen to the distribution and our uncertainty? So how does more evidence inuence uncertainty? Notation and naming conventions P (u) is called the Prior probability for u P (n|u, N ) is called the likelihood of u P (u|n, N ) is called the Posterior Probability of u P (n|N ) is called the evidence or marginal likelihood. We continue. We ask Candy to draw another ball from the same urn. What do we think is the probability that she will draw a back ball? Standard statistical analysis will solve this problem as follows. It will say well we know the most probable urn is urn 3. So under the Hypothesis that she is drawing from urn 3, the probability for a next back ball is 0.3. This is an incorrect solution according to Bayesian inference, unless the probabilities for the other urns were negligible, which it was not in this case. We cannot make a Hypothesis about what urn she is drawing from to predict. We must include the uncertainty explicitly in our prediction, by summing (integrating) over all the urns and incorporating the probability distribution we computed above. The Bayesian approach is to say: P (next ball is black|n, N ) =
u
P (ball N + 1 is black|u, n, N )P (u|n, N ).

u 10 ,
P (ball N + 1 is black|u, n, N ) = fu = in P (ball N + 1 is black|u, n, N ).
regardless of n and N. It is because the urn is GIVEN
What about P (u|n, N )? It is the probability distribution we computed in the rst part of this experiment. It contains the uncertainty that we have in what urn Candy is drawing from. Substituting numerical values, we nd P (next ball is black|n = 3, N = 10) = 0.333. So the correct probability computation yields a slightly higher value. The unfair Coin [1] This is a classic problem, rst studied by Thomas Bayes in 1763. The essential ideas are simple - but the consequences far reaching. We are given a coin and asked what is the probability that the next toss, in this case the rst toss, is heads?. The answer is that if it is a fair coin, the probability is 0.5. Now, let us be more
22 general and say we dont know yet if the coin is fair, i.e. it may have a bias and tend to come up more heads than tails (or vice versa). Now what would we say if asked the same question? If it is the rst toss, a good guess would be again to say 0.5, because we do not have any observed information regarding the behaviour of the coin. Now we are asked the same question, but we are provided with a number of previous tosses and the results as evidence. The probability for coming up heads for the F + 1 toss, say a, is therefore a parameter we wish to infer, given the results of the previous F tosses. So here the parameter to be inferred is itself a probability! As usual, we seek to write down the probability for the parameter a, given the observed data. So a useful question to ask is what is the probability of the coin coming up heads for the F + 1 toss, given a sequence of observations of the previous F tosses? So we wish to infer P (a|s, F ) where a is the probability for the F + 1 toss of coming up heads, given the observed sequence s that contain F entries. By the sum rule, we may predict a as P (a|s, F ) = P (a|pa )P (pa |s, F )dpa (2.12)
where pa is the probability of coming up heads, and the probability distribution P (pa |s, F ) itself has to be inferred from the data as well. The prediction thus has the eect of incorporating the uncertainty we have about pa . Since P (a|pa ) = pa , we focus on the second term. We may infer P (pa |s, F ) using Bayess theorem as follows: P (pa |s, F ) = P (s|pa , F )P (pa ) P (s|F ) (2.13)
P (s|pa , F ) = pa Fa (1 pa )Fb where Fa indicates the number of heads, and F = Fa + Fb . P (pa ) - this is a more dicult question to answer without ambiguity. Since we have no idea of the extent of the coins bias, a good assumption is to assume it can be anything, i.e. P (pa ) = 1. However, this is by no means a unique choice, and we may use other priors too.
2.3
Conclusion
This chapter introduced probability theory and a central concept that will be used repeatedly in this notes: when one is faced with choosing between M possible explanations for experimental data where the evidence is not enough to make a denitive choice, one has no choice but to compute the probability of each possibility (or Hypothesis), and choose the one with the highest probability. Given the observed information and the stated assumptions of the experiment - the best choice among the M possible explanations is to choose the most probable one. In the rest of the notes this theme will repeat over and over in the design of Digital Communication systems.
Chapter 3
The modulator and demodulator

In general terms the job of the modulator is to take a few bits from the encoder at a time, say Q bits, and produce a complex symbol that is an analogue function of time, say s(t) representing those Q bits, and that may be used to modulate the in-phase and quadrature phase components of a carrier wave that is transmitted over the channel. The symbol duration is T seconds, so the transmitter is able to send 1 symbol over the channel in T seconds, i.e. Q coded bits are send over the channel in T seconds. The more bits the modulated symbol, s(t) represent, the higher is the thru-put of the Communication system, but the more vulnerable the receiver will be to noise, i.e. the Bit Error Rate for z in the receiver will increase if Q increases. Refer to Figure 3.1, where we use the simplest form of modulation namely amplitude modulation. If the bit to be send over the channel is a logical 1, we multiply the RF carrier frequency by 1. If its a logical 0 we multiply the carrier wave by -1. The time it takes for the symbol to be transmitted is T seconds, i.e. we transmit 1 symbol per T seconds. In this case we also transmit 1 bit per T seconds, because each symbol just represents 1 bit in this simple modulation scheme.
RF(t) = cos(2 f c t) radio frequency carrier wave s(t) = g(t) RF(t) j2fct = Re{A exp } where nT < t < (n+1)T for the nth symbol
to antenna or power amplifier
data symbol wave
g(t) = A where A = 1 if info bit = "1" A = 1 if info bit = "0"

Figure 3.1: The modulation of 1 bit in amplitude modulation.
23
24 In the literature the data symbol wave g(t) is mostly referred to as the pulse shaping function. The reason is that we may shape the signal send to the power amplier and antenna by choosing g(t) wisely. There are many reasons why we would want to do that. For example the government may restrict the spectrum where you are allowed to transmit (they mostly do), and hence you want the RF transmitted signal to not spill over outside your frequency band that you rented from the authorities (since you will be ned severely if you do that). In the simple example in Figure 3.1 the pulse shaping function was is fact g(t) = 1 f or nT < t < (n + 1)T. (3.1)
So without knowing it, we chose the simplest possible shaping function, namely a square pulse. It frequency spectrum has a shape sin() , which you can draw to see what it looks like. Its not very optimal since it does not utilise the frequency spectrum wisely when forced to t into the nite frequency band available. Figure out for yourself why this is so. Secondly, since g(t) had only one of 2 possible values its alphabet size was 2. In practice we may choose M values in general. In fact, we need not always choose a real number like we did above. We can choose a complex number of levels and pulse shaping function, since the operator that selects the real part (Re{})as shown in Figure 3.1 guarantees that the nal waveform send to the power amplier and transmitter antenna is real, as only real signals exist in the real world. Nothing however stops us from using complex signals in the modulator mathematics then, and its generally done like that in the literature.
3.1
3.1.1
Modulation continued
The concept of base band signal processing and detection
In the literature the function g(t) is considered to be in the base band, because it has not yet been frequency translated yet (multiplied by a carrier wave) to the RF carrier frequency where the government rented you some bandwidth to operate in. For example the GSM cellular system in SA is on the 800 to 900 MHz band. All the signal processing mathematics can be done on the base band, since the translation up to RF frequency before transmission can be reversed again at the receiver by translating down again to the base band. Remember the receiver knows the RF carrier frequencies its supposed to operate on.
3.1.2
Types of modulation
There are many types of modulation alphabets that we may use for modulation. If the alphabet contains M entries, then we can map Q = log2 (M ) bits to each symbol from that alphabet. Let us consider a number of modulation schemes for dierent alphabet sizes.
3.1.3
Binary phase shift keying (BPSK)
This is the one you are familiar with, that we used above in Figure 3.1. Here the alphabet A contains two entries, i.e. the ith component of A must be one of two values or Ai {1, 1}. We are at liberty to choose our own pulse shaping function to make the analogue symbol s(t), and we denote that function as g(t). So for BPSK the modulator for a part of the binary string z operates as shown in Figure 3.2. What becomes clear for the case of BPSK modulation is that 1 bit from x maps to 1
25
X= 1
imag
0
imag
1
imag
0
imag
real "1"
real "0"
real
real "0"
real
real "0"
real
s(t) = A g(t) 1 2
s(t) = A g(t) 1 2
s(t) = A g(t) 2 3
s(t) = A g(t) 1 4
== Symbol selected to match bit/s from bit string x g(t) 1 t T
A = 1 1 A = +1 2
Figure 3.2: The modulation of 4 coded bits x via BPSK modulation.
symbol sn (t) that is T seconds long. So every T seconds we are able to transmit 1 coded bit. Also, in this case the modulated symbols sn (t) are real valued. They do not have an imaginary part, because the alphabet contains only real elements [1, 1]. Later it will become clear that even though BPSK is only able to transmit a single bit per symbol, it is very immune to noise. Also, we may view BPSK to modulate the phase of the carrier wave, because the amplitude of the alphabet elements both have a magnitude of 1. Thus the amplitude is not modulated, only phase is modulated.
3.1.4
Four level pulse amplitude modulation (4PAM)
Here we have a case of amplitude modulation when we consider 4PAM. Since there are 4 points in the alphabet or Ai {1, 0.5, 0.5, 1} we are able to map 2 bits from x per symbol. We choose exactly the same pulse shaping strategy as in the previous example, it is only the components of the alphabet, i.e. A that change. 4PAM is shown in Figure 3.3.
X= 10110001
imag imag imag imag
real
real
real
real
real
real
real
"10" s(t) = A g(t) 1 3
"11" s(t) = A g(t) 2 2
"00" s(t) = A g(t) 4 3
"01" s(t) = A g(t) 4 1
== Symbol selected to match bit/s from bit string x g(t) 1 t T A = 1 1 A = 0.5 2 A = 0.5 3 A =1 4
Figure 3.3: The modulation of 8 coded bits from x via 4 PAM modulation.
26
3.1.5
Quadrature phase shift keying (QPSK)
QPSK can also be viewed as 4 PSK, and is a phase modulation technique. The amplitude is not modulated as was the case for 4PAM, but it is also able to map 2 bits to each alphabet point. Once again we choose the same pulse shaping function as we did in the previous cases, but here the alphabet symbols are complex, and thus the analogue symbol sn (t) is also complex so that both the in-phase and quadrature components of the carrier wave will be modulated. Refer to Figure 3.4 for an explanation
X=10010011
imag imag imag imag
1 j
1 real "10"
real
"01"
"11"
real real real real real
"00" s(t) = A g(t) 1 2 s(t) = A g(t) 2 3 s(t) = A g(t) 4 4
s(t) = A g(t) 1 3
== Symbol selected to match bit/s from bit string x g(t) 1 t T A = 1 1 A = j 2 A =1 3 A =j 4
Figure 3.4: The modulation of 8 coded bits from x via QPSK modulation. of the modulation scheme. Clearly 8 coded bits from x were mapped to 4 QPSK symbols.
3.1.6
Eight phase shift keying (8 PSK)
In this case we are able to map 3 coded bits from x to each symbol, that is complex. We choose again the same pulse shaping function to produce the analogue symbols used to modulate the carrier wave with. Figure 3.5 shows the 8 PSK alphabet and the bit mapping used. Note that in all cases only 1 bit changes its value when the symbol changes, a strategy known as Grey mapping.
3.2
De-modulation
What we dealt with in the previous sections were the transmitter operations, i.e. getting binary data onto a RF carrier wave known as modulation. In modulation information bits are grouped into M bits and are allocated to a symbol that is able to accept M bits to create one symbol to be transmitted. For example 8 P SK could represent 3 bits per symbol of duration T seconds. So every T seconds one symbol is transmitter over the air via the transmitter antenna, and it suers multipath distortion and attenuation over the channel. Over a xed period of time, a series of complex symbols s = [s1 (t), s2 (t), s3 (t), s4 (t), , sN (t)] were used to modulate the carrier wave in the transmitter. All in all M N bits were so transmitted in N T seconds if each symbol could represent M bits. The receiver on the other hand, has to perform de-modulation, the opposite process of modulation that was performed in the transmitter. Because of the fact that the receiver does not know, to begin with, what the data is that the transmitter transmitted, plus the fact that the channel causes distortions to the transmitted data due to multipath propagation, and then nally the receiver is bombarded with thermal noise in its own electronics plus other interference sources (both human made and non-human made), it has a very dicult job sumarized in a single word: de-modualtion.
27
M=1 111 (0+j1) M=2 011
1 2
X=011 s(t) = A g(t) 2
M=8 110
1 2
(1+j1)
(1+j1)
M=7 100 (1+j0)
M=3 010 (1+j0)
M=6 101
1 2
M=4 000
1 2
(1j1)
(1j1)
(0j1) 001 M=5
Figure 3.5: The modulation of 3 coded bits from x via 8 PSK modulation.
The rst step in the receiver is to move (translate) the signal which is located at the carrier frequency back into baseband. That is indicated in Figure 3.6, and is performed using a local ossilator, a multiplier and baseband lter. The local ossilator in nonperfect, i.e. its drifts somewhat over time and this causes a so-called frequency oset error, but we will not deal with that complication now. Let us assume that the local ossilator is perfect. After translation to the base band as shown in Figure 3.6, the receiver received a corrupted version of the series of transmitted (complex again now in base band) symbols s. The corruption due to the multipath progation can be modelled by modelling it as a discrete convolution process, i.e. the multipath channel is seen as a system (black box) that has an impulse response denoted c that is either known or if not (mostly the case) can be estimated somehow 1 . So, since we now assume to know or at least have an estimate of c, we may model the eect of the channel as a convolution with the transmitted data. Let r(t) denote the received samples after the translation and bandpass lter operations. The sum of the receiver internal thermal noise plus all other interference sources received by the receiver antenna is denoted ns (t) and is assumed additive. Then in mathematical terms the baseband received signal r(t) can be expanded in terms of the transmitted data s(t) over a symbol duration T seconds long as
L
r(t) =
k=0
ck snk (t) + ns (t)
(3.2)
At this point in Figure 3.6 we have an analogue baseband signal r(t), that still has not been demodulated yet. Demodulation is complete when the binary data that the transmitter transmitted has been recovered by the receiver. However, to do that modern receivers apply digital techniques based on Detection techniques 2 . But before a digital detection operation can be performed, we must
1 This channel estimation problem has been solved by very innovative ways in cellular systems, we will get to it in chapters to follow 2 Detection methods are regarded by some as Articial Intelligence agents, using probabilistic methods, a topic dealt with in the next chapter
28
receiver antenna fc multiplier or mixer RF antenna electronics
bandpass filter fc
bandpass filter r(t)
matched filter
local ossilator
digital sampler
estimated data bits
detector or AI rational agent
y[n]
Figure 3.6: The rst stages of the receiver hardware, indicating where the detector (an AI device) come into play.
convert the analogue signal r(t) to an equivalent digital one, denoted y[n] where n now indicates sampled (digital) time. The question that now arises is what series of steps the de-modulator must follow to produce a digital signal that we can pass to the detector. Each increment of n will then imply T seconds of physical time has elapsed, i.e. a symbol time. Using the concept of relaxation we relax the conditions to make the analysis simpler, and assume that c has only 1 tap (one entry, i.e. its not a vector under this assumption). So here we assume c = [c0 ] and hence for this special case the convolution summation dissapears and r(t) = c0 sn (t) + ns (t). With reference to Figure 3.7 we see that the receiver simply convolves the baseband signal r(t) with a lter with impulse response h(t). However, the lter h(t) is not just any old lter, it is chosen to complement the pulse shaping function g(t) for reasons that will become clear below. Specically it is chosen as h(t) = g(T t). (3.3)
This choice is called the matched lter, since the output of the matcher lter will achieve a maximum over the symbol time T . Then at this maximum it is sampled to convert it to a digital sample. This sample will have the highest possible Signal to Noise Ratio (SNR). There are no other lters able to produce a SNR higher than the SNR for the matched lter, among all linear lters it is the optimum one. Thus the output of the sampler at the peak of the matched lter output when c = [c0 ] (the
29
matched filtersampler pair
t = nT r(t) h(t) = g(Tt) matched filter sampler to convert to digital sequence y[n]
Figure 3.7: The de-modulation using a matched lter and optimum sampling.
relaxation assumption) is given by y[n] = (r(t) h(t)) |t=T = (c0 sn0 (t) h(t) + ns (t) h(t)) |t=T = (c0 An g(t) h(t)) |t=T + ns [n] = (c0 An g(t) g(T t)) |t=T + ns [n] = c0 An + ns [n] (3.4) (3.5) (3.6) (3.7) (3.8)
if g(t) is chosen so that (g(t) g(T t)) |t=T = 1. An you may recall is the complex symbol that the transmitter created from the binary data. So now we have a mathematical relation, relating the digital sample y[n] produced by the demodulator, the channel impulse response c0 and the thermal noise sample ns [n], that is unknown. It is the detector (or AI agents) job to gure out what is An given y[n] and a priori knowledge of the probability distribution function for ns [n] - does that sound familiar (EAI 310!)? With a good estimate for An from the detector, the data bits is recovered - albeit full or errors due to the noise. However as you will later see, we will use error correction coding (another large eld of research in AI) in the transmitter to be able to correct those errors in the receiver.
3.2.1
What if there is multipath?
In the previous section we made a relaxation assumption that there were no multipath components in the transmitter. In general, clearly there is multipath, so how does the above analysis change then? For the case where there are L + 1 taps (multi-path components) in the channel impulse response vector C then the output of the de-modulator will be y[n] = =
k=0
(r(t) h(t)) |t=T

L
(3.9) (3.10)
ck Ank + ns [n]
and a vector with P entries denoted y = [y[1], y[2], , y[P ] is passed from the demodulator to the
30 detector. However in this case the use of the matched lter is insucient to yield an output SNR that is maximised. A later chapter dealing with prelter design will address this case in detail.
Chapter 4
Detection
4.1 Introduction
In this chapter we study detection, using Baysian Inference. Specically we study the concepts behind the inference of unobserved parameters given observed data. This is an important class of problems often found in practice, and the interpretation of methods based on Maximum Likelihood (ML), Maximum Likelihood Sequence Estimation (MLSE) and Maximum Aposteori Probability (MAP) criteria need to be scrutinised in detail. Communications systems oer a very nice environment to study these topics, that are rather complicated, in a practical way that is easy to simulate (and eventually to understand) on a computer. There can be a lot of confusion between these concepts - one has to be careful when making statements about these concepts. The examples chosen in this chapter have been chosen to make the subtle dierences clear. Most communication systems can be broken down to a few blocks as shown in Figure 1.4. In this course we will look at two of the blocks: the symbol detector (also referred to as the equaliser in the literature) and the decoder. Most of the concepts that are important to understand in Detection can be studied by applying the ideas to the two blocks.
4.2
The static Gaussian channel
The rst channel and a classic example, and is the symmetric and static (time invariant) Gaussian channel. There is no fading of the transmitted signal like typically will occur in radio channels, and the only impairment is additive Gaussian noise. By that we mean the noise have a Gaussian distributed pdf. Denote the input to the channel as x and the observed output as y. The symbols x can be one of N possibilities as allowed by the alphabet in use. For example in 8 PSK, there will be 8 discrete possibilities for x. The input x is passed to the channel, where it is corrupted with the Gaussian noise, and we observe the corrupted output y. This process applies equally well to say a magnetic or optical recording device. The storage media is noisy, and can be viewed as the channel. Data networks can also be modelled in this way, the channel is the cables, the transmitter and receiver hardware where the data symbols are corrupted. The noise energy ( 2 ) can be estimated in the receiver - a topic for a chapter to follow - here we 31
32 assume we know it. And we know the noise is white and its distribution is Gaussian. How can we design a device (or intelligent agent) capable of inferring what was transmitted given the observed data (a symbol detector)? There are a few assumptions we can make that are applicable for this simple channel. Here the channel is memoryless, i.e. the outcome of observed symbol i is independent of previous transmitted symbols 1 . Hence on the basis of the observed symbol y we have the posterior probability that the transmitted symbol was Ak (one of the possible symbols transmitted - for example if we have BPSK 2 then Ak could be one of two possibilities namely -1 or 1 corresponding to a logical 0 or 1). P (x = Ak |y) = P (y|x)P (x) P (y) (4.1)
where P (y|x) is proportional to the PDF which is Gaussian (the noise is Gaussian) p(y|x) = So we can write P (x = Ak |y) 1 exp (2 2 )
1 (22 ) yx
2
/22
(4.2)
exp
yAk
/22
.P (x)
We must now address the issues of the evidence term P (y) and the prior term P (Ak ). The key observation is to notice that the evidence term is not aected by what symbol Ak is being considered to decide which one maximises that poterior probability. It is the same regardless of the choice of Ak and hence can be moved outside the brackets. The prior we deal with by saying that all symbols are equally likely, an assumption which is valid if we were transmitting random data such as a compressed voice. So it can also be moved outside the brackets and be neglected as it does not inuence the maximisation process. So we have to choose Ak so that max {P (x = Ak |y)} = max 1 (2 2 ) exp
yAk
2
(4.3) P (y) Recall what we did when we had to guess the urn Candy was using to draw balls from. We calculated the probability of all the urns, then chose the one with the maximum probability, i.e. the Maximum a Posteori (MAP) choice. That was the best we could do. Nobody has yet come up with a better approach. So let us apply this same technique to this static Gaussian channel, where we have to guess what was transmitted, given noisy observed data. To apply MAP, we must choose 2 2 1 (22 ) exp yAk /2 .P (Ak ) max {P (x = Ak |y)} = max . (4.4) Ak k Ak k P (y)
/22
Ak k
Ak k
(4.5)
Now notice what determines the maximisation of P (x = Ak |y), its the minimisation of the Euclidean distance between y and Ak i.e. y Ak 2 . Increasing noise energy causes the dierentiation between dierent Ak to become more blurred and hence inference becomes more dicult. So we can yet again simplify the MAP choice by writing it as max {P (x = Ak |y)} = max 1 exp (2 2 )
yAk
2
/22
Ak k
1 Would 2 Binary
Ak k
= min
Ak k
y Ak
(4.6)
this assumption hold if the channel introduced multipath propagation? Phase Shift Keying
33 This proves that the MAP choice is the one that minimises the Euclidean distance between the observed noisy output y and the alphabet points on the complex constellation as shown in Figure 4.1. Two cases are shown, one where the channel quality is good (high SNR) and one where the channel quality is poor (low SNR). In both cases the transmitter sent 10 symbols that were all 1 and a QPSK modulation scheme was used. This of course is not known by the receiver - it will select the closest of the 4 constellation points, since that is what we proved MAP detection tells us to do under these conditions (Gaussian static channel). In the next section we will se how the MAP choice is complicated when the channel also introduces multipath signals.
imag
large noise power (low SNR) observations y(n) j 1

imag
QPSK 1
real
observations y(n) j QPSK 1 1

real
small noise power (high SNR) j
Figure 4.1: MAP detection on a static Gaussian channel is selecting the modulation constellation point closest to the noise corrupted received samples. Two cases are shown, one where the channel quality is good (high SNR) and one where the channel quality is poor (low SNR).
4.2.1
Computing more than just the most likely symbol: probabilities of all constellation points, and the corresponding coded bit probabilities computed by the receiver
In the previous section we computed just the best (MAP) constellation point. However, the decoder that will follow the detector will be able to do much with the probability info for each encoded bit as we will see in a later chapter 3 . So let us compute all the probabilities for all the constellation points and the bits used to make up the modulated symbols. Imagine we have a transmitter transmitting 8 PSK symbols, i.e. each symbol represents 3 bits from the encoded vector. The constellation used is shown in Figure 3.5. The transmitter transmitted a symbol, denoted Ai where i was one of 8
3 So
called Soft Decision Decoding.
34 possibilities. The receiver must try to determine what the transmitted symbol (i.e. i) was, using a device called a detector. Now, you (the receiver) are given an observed complex number that came out the de-modulator, this is y[1] = 1.1 + j1. We know the noise is Gaussian, what was transmitted? Our strategy is based on what we learned from Candys example with the white and black balls. There we learned that the optimal strategy is to compute the probability of each possibility, and then choose the maximum one. We will follow that same strategy here. We thus compute the posterior probability of each of the 8 possible symbols it could have been, then choose the one with the maximum probability (the most likely one). Thus for the kth symbol in the alphabet we need to compute P (x = Ak |y), which is given by P (x = Ak |y) = P (y|x)P (x) P (y) (4.7)
1 We assume all symbols are a equally likely, so P (x) is 8 regardless of k. P (y) is common to all values of k. The probability P (y|x) is thus proportional to the noise pdf which has a Gaussian distribution. Hence we can write 2 2 P (x = Ak |y) = exp yx /2 (4.8)
where P (x) = 1 was absorbed into the constant along with all other constants including P (y). The 8 2 2 value of has to be determined - it is very important to realize that P (x = Ak |y) = exp yx /2 since a probability cannot be equal to a probability density function. There are 8 possibilities of Ak , so for each P (x = Ak |y) term we have to compute the 8 Euclidean distance metrics D(k) = y[1] Ak 2 (4.9) then we have 8 values for D(k), k = {1, 2, 3, 4, 5, 6, 7, 8}. These we may substitute into equation (4.8), but the value of is still undetermined. Since P (x = Ak |y) is a probability (not a pdf) it has to comply with the axioms of probability theory. One of them says that if a probability is summed over all its possible outcomes, it must yield one. Hence we may demand
8
k=1
P (x = Ak |y) = 1.
(4.10)
The value of may be determined by combining equations (4.10), (4.9) and (4.8). This is left to the reader as an exercise. The most probable symbol turns out to be symbol Ak=8 . So the bits that the transmitter sent were most likely 1, 1, 0. The next question is what is the probability of the 3 bits a, b, c being 1, 1, 0? We know the symbol probabilities, so we may compute the bit probabilities. The probability for the rst bit a to be a one is 4 P (a = 1 ) = P (Ak=8 |y) + P (Ak=7 |y) + P (Ak=6 |y) + P (Ak=1 |y). (4.11) For bit b we have P (b = 1 ) = P (Ak=8 |y) + P (Ak=1 |y) + P (Ak=2 |y) + P (Ak=3 |y). For bit c to be a zero we have P (c = 0 ) = P (Ak=8 |y) + P (Ak=7 |y) + P (Ak=4 |y) + P (Ak=3 |y). (4.13) (4.12)
4 As an exercise go and compute the 3 bit probabilities. Which bit was most reliably detected? Intuitively why is this so?
35
4.3
MLSE - the most likely sequence estimate
In the previous section we dealt with channels that were impaired by Gaussian noise but had no memory (no ISI). In that case it was easy to show how the optimal inference is derived at from knowledge of the noise distribution (PDF) and Bayes theorem. However, such channels in practice are hard to come by. Most storage media, communications channels, waveguides such as cables and bers have memory due to a variety of reasons that we wont go into here. In most cases in practice, we can model these channels as linear time invariant convolutional channels, i.e. we can write the relationship between the transmitted symbols Ak and received symbols y as
L
y[k] =
i=0
ci Aki + ns [k]
(4.14)
where c is a vector containing the channel impulse response and n is additive white Gaussian noise. We are of course assuming that the channel is sampled at or above the Nyquist rate, indicated by k. The assumption that the channel c is time invariant can be satised if we consider the detection of small enough blocks of data symbols. In general we dont know the IR, and it needs to be estimated , but we will not get into the topic of channel estimation in this chapter - we will in a later chapter. For now, accept the fact that we may accurately estimate c using some known symbols in between unknown data symbols. For now, we will simply assume that the channel estimate is available from a channel estimator module in the receiver. Now, we pose a simple question, but with profound implications: Given a block of observed symbols y = {y[1], y[2], , y[N ]}, what was the entire block of symbols x[1], x[2], , x[N ] that was transmitted (not only just one of the symbols) ? Notice that we cannot solve this problem as before because of the channel memory (IR has multiple taps). The reader may invest in guring out for herself why multiple taps in the IR can be viewed as modelling the channel memory. To rene the question posed above, we identify two aspects of that question: We need to estimate the most probable block of data, referred to as as sequence. I.e. we need to infer the most probable sequence. We need to infer the probability of each symbol being correct in that sequence. Later in this notes it will be shown that the most probable sequence does not necessarily contain the most probable symbols. In this section we consider the inference of the most probable sequence. Inferring the most probable symbols are treated in the next two sections. So using Bayes theorem, we write the posterior probability of the sequence x as P (x|y) = P (y|x)P (x) P (y) (4.15)
The noise is white (uncorrelated) so that we can use the separability of the noise and write p(y|x) =
PN 1 exp k=1 2 )N (2 yk PL
i=0
hi xki
/22
(4.16)
So what about the prior P (x)? Since we may have an interleaver after the encoder, and in the absence of other information, the detector (the device we are now designing) may assume that all
36 symbols are equally likely. The probability of the data P (y) does not inuence the choice of x. Hence, we come to the conclusion that we may in fact just nd the sequence x that maximises PN PL 2 2 exp k=1 yk i=0 hi xki /2 , the likelihood function. For that reason, this type of MAP sequence detector given all symbols transmitted are equally probable, is also called the Maximum Likelihood Sequence Estimator (MLSE).
4.3.1
Finding the sequence x via the MLSE
It is one thing to write down the expressions for the Bayesian Inference of the sequence x, quite another to do it in a computationally ecient manner. For example, one fool proof option is called complete enumeration. For example assume we have BPSK modulation, we simply go through all 2N combinations of the sequence of length N and choose the best one. As the length of the block N increases the complexity grows as 2N and we require exponentially more computations on the computer 5 . An algorithm exists that can solve this optimisation problem exactly with signicantly less complexity. It is called the Min-Sum algorithm, which has been invented in more than one eld of Science; in Communications it is known as Viterbis algorithm. First we recognise that in maximising the likelihood function (Maximum Likelihood), we just need to minimise logP (y|t)
N L
F=
k=1
yk
hi xki
i=0
(4.17)
Before we solve the problem of minimising this function with the Min-Sum algorithm, let us look at a simple example of the application of this algorithm. The Min-Sum algorithm
direction of travel
J H
40 20 10 20 20
M
10
A
10
K
20
10
B N
30
I
10
30
Figure 4.2: The road-map between town A and B - infer the shortest route with least cost or distance?
Consider a map of a province, where 2 cities are connected via several towns, as indicated in Figure 4.2. We ask, how do we choose the shortest path from A to B? We do not want to compute the total distance of all possible paths (complete enumeration) because it is too expensive. We make use of the concept of message passing. Information gained at one node of the map is passed to neighbours, thereby eliminating the need to complete enumeration. So we proceed according to the Min-Sum algorithm as follows:
5 This
is a so-called NP complete problem
37 We start at A, where the distance travelled is 0. From A to H there is a path with cost 40 miles. There is an alternative path to I with cost 10 miles. The cost associated with node H is thus 40, and with node I 10. This information is passed to H and I. We examine the paths to the next set of towns, J,K,L. The path A-H-J has cost 60 miles, simply adding the cost of H-J to the cost known at H. There is also a path A-I-L, with cost 20 miles at node L. However, to K there are two competing paths, A-H-K and A-I-K. We compute both, and prune away the worst one, in this case A-H-K. We now associate the cost 30 miles with node K. There are two competing paths to M, via J and via K. We select the least one, via K, giving A-I-K-M with cost 50 at M. We prune away the loosing path. There is also two paths to N, we select A-I-K-N with cost 40 miles at N. We prune away the other path. There are two remaining paths to B, A-I-K-N-B and A-I-K-M-B, we select the path of the two, with least cost, A-I-K-M-B with total cost 60 miles. The surviver or winner path A-I-K-M-B is the path that gave the least cost. Now we return to the MLSE or min-sum detection. The trick of solving the minimisation of F is to realize that we may draw a map, graph or trellis that represent all the paths that are possible for the sequence x, and we may identify a cost at each node of the trellis, incorporating the history of the path that was taken to get to that node. For example, lets draw a trellis for the case of BPSK modulation so that there are 2 possibilities for each data symbol, 1 or -1. There are 1 known pilot symbol at the start and end of the trellis: each a 1. We show only part of the trellis in Figure 4.3. Time ows from left to right. The Euclidean metrics indicated are computed as log(P (y|t)). The impulse response we have in this case is h = [c(0), c(1)], i.e. has L + 1 = 2 taps. Thus we need to compute 4 metrics per time n, and we need to delay the rst pruning L nodes, i.e. in this case 1, because of the memory due to h. This can be seen by noting that a contest between 2 paths only develop at n = 2. So at each node for n = 2, 3, 4, we compute the accumulated metrics that contest that node, and choose the winner. This in essence is the power of the min-sum algorithm - it cuts out all the redundant calculations, and retains the minimum possible calculations needed to get exactly the same answer as complete enumeration. The redundancies removed causes no degradation of the nal results - i.e. MLSE with min-sum detection does not cause any noise enhancement. It is optimal in the sequence sense. At the end of the trellis we nd the overall winner. The path it took resolves the most probable sequence taken by the transmitter. Thus we may nd the most probable sequence of symbols transmitted for the case in Figure 4.3 as 1, 1, 1, 1, 1, 1 In this case each symbol also represents a bit - i.e. the most probable transmitted bits from the encoded vector z is 1, 0, 1, 1, 1, 0.
4.3.2
3 tap detector
In the previous example, the channel memory was given by the impulse response vector as c = [c0 c1 ] and thus had 2 taps. This enabled us to associate one transmitted symbol value per state (node), and led to the simple trellis of Figure 4.3. However, for the 3 tap case the impulse response vector
38
observed data from demodulator: y[0] n=0 1 y[1] n=1 1 y[2] n=2 1 8 12 1 17 1 18 y[3] 3 n=3 1 9 13 4 y[4] n=4 1 y[5] 5 n=5 1 6 y[6] n=6 1
1 7
10 14 19 1 20
11 15 1 winner path has least total accumelated 16
Transmitter has 2 possible states per time n: 1 or 1 since BPSK was used 2 n k = y[n] c A c A n1 1 0 n A => either 1 or 1 depending on position (i=1 or 2) in trellis i
Figure 4.3: The trellis - infer the shortest route with least cost - that will be the MLSE sequence!
is c = [c0 , c1 , c2 ], and we have to associate 2 transmitted symbol values with each state in the trellis. Let us denote that pair by An An1 . With this notation we may set up a trellis for BPSK and 3 taps as shown in Figure 4.4. The min-sum algorithm is executed on this trellis in the same way as before. Note that certain states are not connected here as they are illegal, i.e. the transmitter is unable to move from certain states to certain others. In general, if we have a (L+1) tap channel impulse response c with a modulator that has M symbols in its alphabet, then the trellis will have M L states per time node. On such a trellis, there will be M winning paths per time node, M per state (since M contest at each state), but only one winner that will resolve the transmitter symbols as the most likely sequence that the transmitter transmitted.
4.3.3
Discussion
Note that the min-sum algorithm was able to produce accurate estimates of the data symbols in the receiver when the channel has memory, but was unable to produce any estimate of the probability of the individual symbols being correct. This is a fundamental limitation of the min-sum (Viterbi) algorithm, as the decoder that follows the detector may use probability information eciently to decode as we shall see in later chapters. May authors have extended the min-sum algorithm so that it produces probability information as well, most notably the Soft Output Viterbi Algorithm (SOVA) by Hagenauer in Germany. However, the probabilities are in fact suboptimal, and direct approaches based on Bayesian inference or MAP criteria are able to produce better estimates for the probabilities when there is multipath. A method based on Bayesian detection was devised by Abend and Frichman in the early 70s, and a MAP method using forward and backward iterations on the trellis was devised by Bahl et al [3] and known as the BCJR algorithm at about the same time. However, the BCJR algorithm was known in the Articial
39
y[1] 11 11 11 11 11 1
y[2] 0 A1
2 5
y[3] 10 AA
y[4] 21 AA
y[5] 32 AA
y[6] 3 1A
y[7] 11
winner path least cost 7 2
n n1 n2 = y[n] c A c A c A 0 1 2
Figure 4.4: The trellis - infer the shortest route with least cost - that will be the MLSE sequence!
intelligence community already at that time as the Pearl Believe Propagation algorithm.
In the next section we present an approximate method based on the Viterbi trellis, and the method by Abend and Frichman. The section after introduces the forward-backward MAP algorithm.
4.4
Probabilistic Detection via Bayesian Inference for Multipath channels

Sub optimal detected bit probability calculation
4.4.1
For the case where L = 0 we had no memory in the channel (1 tap channel) and the probabilistic detection dealt with in a previous section was in fact trivial. For the case where L 0 we would need to compute the probabilities in a similar fashion as follows: P (x = Ak |y) = exp
yc0 Ak PL 2 m=1 cm Akm 22
(4.18)
The problem is that we dont know Akm , m 1 before executing the min-sum algorithm. Hence we can use a multi step procedure: 1. Complete a min-sum detector and this produces an estimate for all the symbol values without any probability info. 2. Use the output from the Viterbi (min-sum) to nd the values for Akm , m 1. 3. For each k, compute P (x = Ak |y) using (4.18) and Akm , m 1 from step 2.
latter algorithm can be most eciently executed on a so-called Frey graph, a more sophisticated representation than a trellis [1] and involves a forward and backward iteration on the graph. See http://opensource.ee.up.ac.za/ for a demo on the decoding of repetition and convolutional codes using the Pearl Believe Propagation algorithm on a Frey graph.
6 The
40
4.4.2
Optimal symbol probability calculation using Bayesian Detection
Assuming that data symbols are chosen from a set of discrete complex values, referred to as the modulation alphabet, the estimate of the transmitted symbols made at the receiver is called detection (or sometimes equalisation), as we need to pick one of the M possible symbols in the alphabet as the estimated data symbol. For example, we may have an 8 PSK modulation alphabet. The receiver has access to a burst of received symbols z, of length Q. It also has access to an estimate of the channel impulse response (CIR) b. The question thus is how we will choose or infer the estimated data symbols, given the CIR and the entire received sequence. The approach we will use in this section is referred to as Maximum a Posteriori Probability (MAP) symbol detection based on Bayesian Inference. The idea is that we will detect each symbol independently, given the entire received sequence z and the CIR b, such that the probability of making an incorrect decision is minimised. It will become evident that we will exploit some assumptions about the channel noise statistics, and the fact that the channel memory, i.e. the number of taps in the CIR is nite. For example, a received symbol at discrete time zn can only be inuenced by transmitted symbols di transmitted at times i {n, n + 1, , n + L}. Secondly, by assuming we know the probability density function (pdf) of the additive noise process, we may write the a posteriori probability for the transmitted symbol at time k being symbol i in the alphabet, i.e. dk = i using Bayess theorem as P (dk = i|zk+L , , z1 ) = p(zk+L , , z1 |dk = i) P (dk = i) p(zk+L , , z1 ) (4.19)
where P (dk = i) is the probability that the ith point in the symbol alphabet was transmitted. Notice that in the Pearl Believe Propagation algorithm we needed no assumptions on the prior. However, since we use Bayesian Inference here, where we need to consider the prior, we note that in many practical cases we may make the assumption that the M symbols in the alphabet are equally likely as data bits themselves are random after a suciently long interleaver removed any correlation in the encoded data bits. Also in iterative detectors 7 where a priori information about the symbols dk = i are shared between the decoder and symbol detector, P (dk = i) may vary. Thus, in this presentation we will keep the term P (dk = i) as an unknown. The denominator is in fact identical regardless of the choice i for symbol dk , and thus maximising P (dk = i|zk+L , , z1 ) implies maximising the numerator. Thus, the MAP criterion for detecting dk is d k = arg max p(zk+L , , z1 |dk = i) P (dk = i) .
di
(4.20)
We assume we have known tail symbols, i.e. we know dk k {0, 1, }. Thus we may start by detecting dk=1 given zL+1 , , z1 , as dk=1 cannot inuence the value of zk k > L + 1 since the channel memory is L + 1 (CIR has L + 1 taps). We have d1 = arg max p(z1+L , , z1 |d1 ) P (d1 )
d1
(4.21)
= arg max
d1 dL+1
7 also
d2
p(z1+L , , z1 |dL+1 , , d1 ) P (dL+1 , , d1 )
referred to as turbo detectors or turbo equalizers
41 where d 1 denotes the decision on transmitted symbol d 1 , a process we will call detection. 8 We assume the noise is statistically independent, since it has been whitened by the prelter (see Chapter 5). The reader may now appreciate the importance of noise whitening to the detection process, since the statistical independence makes it possible to simplify the equations above by rewriting p(z1+L , , z1 |dL+1 , , d1 ) as p(z1+L |dL+1 , , d1 )p(zL |dL , , d0 ) p(z1 |d1 , , d1L ) p(z1+L , , z1 |dL+1 , , d1 ) = (4.22)
In addition, assuming the noise is Gaussian enable us to write the pdf used above as p(zk |dk , , dkL ) =
PL 1 e|zk j=0 N0 b[j]d[kj]|2 /N0
(4.23)
Assuming no a priori information about the a priori probability P (dL+1 , , d1 ) is available, we may make the assumption all symbols are equally likely, and hence P (dL+1 , , d1 ) becomes a constant that does not inuence the detection of symbol dk . We now move to k = 2. We have d2 = arg max p(z2+L , , z1 |d2 ) P (d2 )
d2
(4.24)
= arg max
d2 dL+2
d3
p(z2+L , , z1 |dL+2 , , d2 ) P (dL+2 , , d2 ) p(z2+L |dL+2 , , d2 )P (dL+2 )
= arg max
d2 dL+2
d3
p(z1+L , , z1 |dL+1 , , d2 ) P (dL+1 , , d2 ) = arg max

d2 dL+2
d3
p(z2+L |dL+2 , , d2 )P (dL+2 )
d1
p(z1+L , , z1 |dL+1 , , d1 ) P (dL+1 , , d1 )
The term in the last line of equation (4.24) that is summed over may be determined from information gathered when detecting d1 . In other words, the MAP detector is recursive. It does not require the re-computation of information obtained from detected symbols for time prior to k. This leads to a huge saving in computational requirements. We will not present the detection d3 as it should be clear the it follows a similar route as did the detection of d2 with the recursions continuing as time k increases for detecting in general dk .
4.5
Forward-Backward MAP detection
Sequence detection produces optimal hard symbol values, but unfortunately, sub-optimal probabilistic information regarding the reliability of those decisions. In a later chapter dealing with decoding of error correction codes, we will nd that the reliability information is very important to the decoder, and hence one would like to have handy a detection algorithm that can provide optimal probabilistic
8 Historically this is called equalisation, because the process of detection implies that the disturbing eects of the channel were equalised to obtain the detected estimate d 1 .
42 information about each detected symbol. Using Bayesian inference we presented an algorithm capable of precisely that (the previous section), however, we had to either know or guess the prior probabilities of the symbols. A dierent algorithm exists that is able to provide optimal probabilistic information, without the need to know or guess the prior probabilities of the symbols. It is known as the Pearl Belief Propagation algorithm in the articial intelligence community [1] but in Digital Communications it is known as the BCJR or Forward-Backward MAP algorithm [3]. Formally the symbol probability is given by marginalization as P (tn |y) =
tn :n =n
P (tn |y)
(4.25)
Here y is the received or observed vector from the demodulator, and tn is the nth symbol we wish to infer at the receiver. c is the channel impulse response assumed known or at least an estimate of it is known. The symbol may take on one of M possible values dictated by the modulation alphabet used. Let us consider the BP SK alphabet as an example, with a 2 tap trellis. It has two possible symbols 1 or +1. The modulator in the transmitter followed a specic path thru the trellis shown in Figure 4.5, and it is the job of the receiver where the MAP algorithm resides to estimate the probabilities that each symbol was transmitted. In the trellis used for the MAP detection, the edges of the trellis have associated with them not the Euclidean distance metric as was the case for the min-sum algorithm, but rather the likelihood itself. The likelihood has the form P (yn |tn ) = A exp|yn
P1
k=0
ck tnk |2 /22
(4.26)
where tn is given, while yn is known as the observations from the de-modulator. With each edge in the trellis, we associate a symbol value that was transmitted, in the case for BPSK there are two possibilities, a 1 or a +1. If we number each node in the trellis sequentially from left to right as 0, , I, then the edge that connects node j to node i (assuming they are connected, and that we denote with j P(i) meaning j is a parent node of i) has a value ti,j .
observed data from demodulator: y[0] n=0 0 t 01= 1 1 y[1] n=1 t13 = 1 3 t 23=1 t01 = 1 2 t 14= 1 4 t 24= 1 10 y[2] n=2 t35 = 1 5 y[3] n=3 7 y[4] n=4 9 y[5] n=5 11 n=6 y[6]
Transmitter has 2 possible states per time n: 1 or 1 since BPSK was used
Figure 4.5: The forward-backward MAP trellis for BPSK.
Let i run from 0 to I, from left to right on the trellis, and let wi,j be the likelihood itself (given above) associated with the edge from nodes j to i with value tn = ti,j , while the set of states
43 considered for node i is P(i), dened above. Compute the forward pass messages, each associated with a node, say i, as wij j (4.27) i =
jP(i)
and 0 = 1. The second set of messages from right to left is similarly computed as j =
i:jP(i)
wij i
(4.28)
and I = 1. Now let an i run over nodes at time n and j over nodes at time n 1, and let tij be the values of tn associated with the trellis from node j to node i. Compute terms proportional to the probability as
t=1 rn = i,j:jP (i),tij =1
j wij i
(4.29)
for the symbol to be a 1 and

t=1 rn = i,j:jP (i),tij =1
j wij i
(4.30)
for the symbol to be a -1. The term proportional to the probability contains a yet to be determined constant of proportionality which is a function of A. Since summing over all outcomes of a probability must yield 1, we may demand that t=1 t=1 rn + rn =1 (4.31) which yields the constant of proportionality for time instant n and thus correctly normalises both terms so that the probability is thus (t) P (tn = t|y) = rn . (4.32)
4.5.1
An example
Let us assume all the forward and backward messages on the trellis have been computed, and we want to determine the probability that at time n = 1 the transmitted symbol was a 1 or a -1. There are only two nodes at each time slice since we use BPSK, so that
t= 1 rn=1 = 2 w32 3 + 1 w31 3

(4.33)
and
t= 1 rn=1 = 1 w41 4 + 2 w42 4 .

(4.34)
t= 1 t= 1 We now normalise for n = 1 by demanding rn=1 + rn=1 = 1, and then nd after normalisation that
P (t1 = 1 |y) = r1
(t= 1 )
(4.35)
Remember that in computer calculations there may be underow or overow in calculating and , but that may be avoided by re-normalising when needed - in the end, the normalisation constants disappear anyway when the outcomes are summed to one to produce a probability.
44
4.6
Assignments
1. Using the GSM simulator, identify the symbol detector (equaliser) function. It is based on the so-called Max Log MAP algorithm, a sub-optimal implementation of the forward-backward MAP algorithm. Develop your own detector based on the Min-Sum algorithm, and compute the symbol probabilities using the suboptimal procedure explained in this chapter. Form the soft P bit outputs as zsof t = (2 1) | ln Pz= 1 |. The idea is to use the min-sum hard bits since z z= 0 they are optimal, and to scale the decisions based on the sub-optimal probability calculations. 1) Plot BLER (block error rate) versus Eb /N0 , for MCS (Modulation and Coding Scheme) 1 and MCS 4. Choose values for Eb /N0 that yield sensible BLER values, typically between 0.3 and 0.01. Rather simulate less points with more frames/blocks per point, otherwise the curves are not reliable. 2) On the same graphs, plot the Max log MAP BLER values. Comment on what you nd, especially the behaviour for the 2 coding schemes, where the one is at a low code rate and the other at a high code rate. 2. Using the GSM simulator, identify the symbol detector (equaliser) function. It is based on the so-called Max Log MAP algorithm, a sub-optimal implementation of the forward-backward MAP algorithm. Develop your own detector based on the Abend and Frichman detector. Form the soft bits using the same procedure as is the case in the given Max Log MAP algorithm, that is also based on probabilities. 1) Plot BLER (block error rate) versus Eb /N0 , for MCS (Modulation and Coding Scheme) 1 and MCS 4. Choose values for Eb /N0 that yield sensible BLER values, typically between 0.3 and 0.01. Rather simulate less points with more frames/blocks per point, otherwise the curves are not reliable. 2) On the same graphs, plot the Max log MAP BLER values. Comment on what you nd, especially the behaviour for the 2 coding schemes, where the one is at a low code rate and the other at a high code rate.
3. Using the GSM simulator, identify the symbol detector (equaliser) function. It is based on the so-called Max Log MAP algorithm, a sub-optimal implementation of the forward-backward MAP algorithm. Develop your own detector based on the forward-backward MAP algorithm. Form the soft bits using the same procedure as is the case in the given Max Log MAP algorithm, that is also based on probabilities. 1) Plot BLER (block error rate) versus Eb /N0 , for MCS (Modulation and Coding Scheme) 1 and MCS 4. Choose values for Eb /N0 that yield sensible BLER values, typically between 0.3 and 0.01. Rather simulate less points with more frames/blocks per point, otherwise the curves are not reliable. 2) On the same graphs, plot the Max log MAP BLER values. Comment on what you nd, especially the behaviour for the 2 coding schemes, where the one is at a low code rate and the other at a high code rate. 4. Compare the BLER results for MCS 1 and 4 for all 3 methods and Max Log MAP on the same graphs. Discuss the advantages and disadvantages of each in terms of complexity and
45 performance.
46
Chapter 5
Frequency Domain Modulation and Detection: OFDM
5.1
Introduction
In the previous chapters we studied time domain modulation and time domain detection. These led to the development of trellis based detection methods to achieve both Maximum Likelihood and Maximum A-posteriori Probability detection. These are generally complex, especially if the time domain channel impulse response contain many taps and/or the modulation constellation is complex.
Engineers have a long tradition of mitigating complex time domain operations in the frequency domain. We are comfortable with Laplace and Fourier transformations to render dierential operators and/or convolution operators into a form that uses only algebraic equations. To jog your memory, think of how simple it is to nd circuit transfer functions by performing a Laplace transformation and then factoring pure algebraic equations on the s domain. Generations of engineers have done this, even up to the present day.
It was thus a natural question to ask ourselfs if it is possible to modulate and detect in the frequency domain, and then somehow render the detection process trivial, to the extent that regardless how many taps there are in the time domain, the frequency domain detection remains trivial. It turns out that the answer on this question is armative. The solution has become known as Orthogonal Frequency Devision Multiplexing (OFDM) modulation and detection, and is the modulation of choice in many emerging wireless communications standards at the time of writing. This chapter will present and analyse OFDM. 47
48
5.2
Circulant matrix theory
Circulant matrices form a commutative algebra, since for any two given circulant matrices A and B, the sum A + B is circulant, the product AB is circulant, and AB = BA. A key property of any circulant matrix is that the eigenvectors of a circulant matrix of given size are merely the columns of the discrete Fourier transform matrix of the same size. Consequently, the eigenvalues of a circulant matrix can be readily calculated by a Fast Fourier transform (FFT) of c. The discrete Fourier transform matrix is given by
2(0)(0) 1 ej N N 2(0)(1) 1 ej N N 2(0)(2) 1 ej N N 2(1)(0) 1 ej N N 2(1)(1) 1 ej N N 2(1)(2) 1 ej N N
OFDM is based on the properties of circulant matrices. A circulant matrix C is a matrix build up from only n elements c0 , c1 , , cn1 . It has a special structure given by c0 cn1 c2 c1 c c0 cn1 c2 1 c1 c0 c3 C = c2 (5.1) . . . . . . . . . . . . cn1 cn2 cn3 c0
Since the eigenvectors of any circulant matrix is simply the columns of the matrix F , any circulant matrix can be written or factorized as C = F H F (5.3)
F = . . .
. . .
. . .
2(N 1)(0) 1 ej N N 2(N 1)(1) 1 ej N N 2(N 1)(2) 1 ej N N
2(0)(N 1) 1 ej N N
2(1)(N 1) 1 ej N N
2(N 1)(N 1) 1 ej N N
(5.2)
where is a diagonal matrix with the diagonal vector containing the eigenvalues. These are just equal to the FFT of c. H indicates the Hermitian transpose. Finally, note that FFH = I (5.4) where I is the identity matrix. F is thus perfectly orthogonal to itself, i.e. its Hermitian transpose is also its inverse. This is a property unique to the FFT matrix.
5.3
5.3.1
The Transmitter for OFDM systems

Cyclic time domain multipath propagation
In OFDM we want to exploit the nice properties of cyclic matrix theory. To do that, we need to make the multipath propagation cyclic. To do that we change the transmission format to a cyclic one as indicated in Figure 5.1. The baseband model of multipath propagation as we are used to is given by
L
r[n] =
i=0
h[i]d[n i] + ns [n],
(5.5)
49
copied here known as "cyclic prefix" d d d 1 2 3
Last P bits
d nP1 dn
Figure 5.1: The OFDM transmitter frame format making the multipath propagation cyclic.
where h is the time domain channel impulse response as estimated by the receiver in the normal manner (see next chapter) and ns the thermal noise sample at time n and d the transmitted symbols. If the transmission burst is cyclic as in Figure 5.1, then the received frame can be written in cyclic matrix form as h0 0 0 hL h2 h1 ns [1] r[1] d[1] h h 0 0 hL h2 0 1 d[2] ns [2] r[2] + . = h2 h1 h0 0 . 0 hL h3 . (5.6) . . . . . . . . . . . . . . . . . . d[n] ns [n] r[n] 0 0 0 hL h0 Clearly, this baseband model is cyclic since the matrix H is cyclic. OFDM modulation views the vector d as the inverse FFT of the modulated symbols from the modulator, denoted D. Using the inverse fast Fourier transform matrix F H we may write d as d = F H D.
(5.7)
So in an OFDM system the transmitted data d is formed by doing an inverse FFT on the symbols from the modulator, a key dierence between other methods and the essential idea to remove the ISI with trivial complexity in the receiver. Of course the other key idea was to make it cyclic by prepending the frame with the cyclic prex. The cyclic prex must be longer than the length of the CIR vector h.
5.4
OFDM receiver, i.e. MAP detection
The received vector in the receiver from the matched lter sampler pair is denoted r. r[1] corresponds to d[1] in Figure 5.1 and so on, and is corrupted by ISI and thermal noise. Hence we may write r = Hd + ns . (5.8)
where the matrix H is a circulant matrix constructed using h on the rows as dened above in the transmitter. Our job is to estimate the most probable D given r, a theme that should be familiar by now.
50
5.4.1
MAP detection with trivial complexity
MAP detection with trivial complexity in spite of the channel impulse response having L taps was our objective in OFDM. Let us now see how that is possible. Recall that d = F H D. and by substituting this into equation (5.8) we nd r = HF H D + ns . Now taking the FFT both sides we may write F r = F HF H D + F ns , which may be written as (using the decomposition theorem for cyclic matrices) F r = F F H F F H D + F ns . But we know that F F H = I, so we may simplify this to F r = D + F ns . (5.13) (5.12) (5.11) (5.10) (5.9)
So we end up with a new equation to solve given above. The observed symbols in this equation is the FFT of the received symbols, the noise is still Gaussian because the FFT does not change the statistics, and most importantly, the matrix is diagonal, i.e. it contains no memory. So in other words it my be solved by applying straightforward symbol by symbol MAP without memory which has a trivial complexity. 1 The ISI was perfectly removed because of the cyclic matrix properties introduced by the modied Tx frame and the fact the the inverse FFT of the data was transmitted in stead of the modulation symbols themselves.
5.4.2
Matlab demo
The reader may convince herself that the MAP detection is trivial by executing the matlab code below where there is no noise. Add your own noise to calculate the BER and see for yourself that it is the same as MLSE with Viterbi, but that the complexity is trivial. clear all L = 8; z = sign(randn(1,L)); % random data Z = ifft(z); % inverse FFT Z = [1*Z(L-1:L) Z]; % add cyclic redundancy ch = [1 0.6]; % the multitap IR, put in what you want R = conv(Z,ch); % the dispersive channel - no noise - add your own! H = fft([ch zeros(1,L-2)]); % the FD Ch est z_h = sign(real(conj(H).*fft(R(3:L+2))./(conj(H).*H))); % MAP estimate at receiver is trivial error = z - z_h; % the error - there is none! std(error)
1
is a diagonal matrix that contains N values, given by the DFT of h where h with L taps is zero padded to length
N.
51
5.5
Assignment
Add noise at the correct Eb /No to the demo code and verify BER is the same as what Viterbi attains for any channel with L taps. Be careful with the noise energy in OFDM, it needs some normalisation because of the FFT operators.
52
Chapter 6
Channel Estimation
6.1 Introduction
Channel estimation is the rst task in the receiver shown in Figure 6.1, as the RF channel impulse response at the symbol rate is unknown at the receiver. Thus even though the pulse shaping lter and anti aliasing lters are known the overall impulse response is not. This is an important realization, since it implies that we cannot design a receiver lter that is matched to to overall impulse response before estimation. The reader will recall that a matched lter is needed to achieve a maximum output Signal to Noise Ratio (SNR). Our approach will therefore be to estimate the overall impulse response, and then to design a prelter (Figure 6.1) with the objective of maximising the SNR, a topic addressed in the next Chapter.
c[n]
r[n] Channel Estimation
r[n]
Prefilter
z[n]
Soft bit detector
b[n]
Hard bits De-Interleaver
Soft Decision Decoder
Figure 6.1: The layout of a typical receiver.
In general the noise present at the output of the channel is coloured, and hence Maximum Likelihood estimation of the overall channel impulse response will require knowledge of the noise covariance function (we will prove this in later Chapters). Since the noise covariance function is in general not known, it is convenient to deploy Least Squares (LS) estimation. LS estimation requires no statistical description of the overall impulse response, nor does it require statistical knowledge of the noise. The LS approach simply chooses the channel impulse response in such a way that the weighted errors 53
54 between the given measurements and a linear model is minimised.
6.2
Optimum receiver lter and sucient statistics
After suitable RF electronics have been utilised in the receiver front end, we receive a baseband analogue signal r(t). Since transmission was performed in the form of data bursts as discussed above, we will have a nite number of received samples available after sampling r(t) corresponding to a burst. These we denote as r[n] where n denotes discrete time that may be used for digital processing in a digital signal processor. We now study the form of the optimal receiver, specically the receiver lter and sampling rate. Given an overall impulse response c(t) that is time invariant over the duration of the burst, we will limit this study to a short period of time where the impulse response c(t) remains unchanged. Imagine Z discrete symbols s[n] are transmitted at a rate Ts during this time, then the received signal will be
Z
r(t) =
n
s[n]c(t nTs ) + n(t).
(6.1)
Now let us expand the signal r(t) in terms of a complete orthonormal basis with basis functions denoted by k (t). Hence we have
N
r(t) = lim
r[k]k (t).
k=1
(6.2)
Since k (t), k+i (t) = (|k i|) we may write r[k] = r(t), k (t) via the projection theorem. Hence r[k] can be written as
Z
(6.3)
r[k] =
n
s[n] c(t nTs ), k (t) + n(t), k (t) =
s[n]c[k n] + ns [k].
(6.4)
Assuming the noise sequence ns [k] is Gaussian and white the joint probability density function of the variables r = {r[1], r[2], , r[N ]} conditioned on the transmitted symbols s is p(r|s) = 1 2N0
N
exp
PN P 1 2N n s[n]c[kn] k=1 r[k]

0
(6.5)
We may now take the limit where N approaches innity, and write the logarithm of p(r|s) (log likelihood) as P M (s) =
Z
r(t)
s[n]c(t nTs ) 2 dt.
(6.6)
Expanding and integrating we nd that P M (s) 2Re where s[n] z[n] s[n] s[m]x[n m] (6.7)
55

z[n] = and x[n] =
r(t)c (t nTs )dt
(6.8)
c (t)c(t + nTs )dt.
(6.9)
We therefore conclude that passing r(t) through a lter matched to c(t) and then sampling at a rate Ts yields samples z[n] that form a set of sucient statistics for detecting s. Hence in theory an optimal receiver lter exist, namely the matched lter. However there is a problem constructing this lter in practice. Although the transmission and receiver lters are known, the overall CIR c[n] is not known a priori at the receiver, because the RF channel itself causes fading which is unpredictable in most cases. We thus propose to use a xed receiver lter with a bandwidth chosen according to the transmitted bandwidth or other systems constraints and is not chosen to be a matched lter - that will be done in the prelter. What is important is that the receiver lter causes as little an increase to the length of the overall CIR as possible. In later chapters it will become evident that the length of the CIR determines the complexity of the optimal detector, and we therefore may choose the receiver lter accordingly. One such lter is the raised cosine lters, that will not add to the length of the overall CIR. The task of matched lter is given to the prelter after the overall CIR has been estimated and based on this estimated CIR a suitable matched digital lter may be designed. This is indicated in Figure 1.4. However, the prelter also has the task of changing the phase response of the overall CIR after the prelter so that the leading taps become dominant for reasons that will become apparent in later chapters. Also we will whiten the additive noise with the aid of the prelter as this will simplify the detection process. The channel is called dispersive or frequency selective if the sampled CIR c[i] is non-zero for i > 0. Each term represents interference of the transmitted signal s[n] with itself because the overall channel has memory. This is frequently called inter symbol interference (ISI). This is quite common in practical communication systems. The output of the prelter thus yield samples z[n] that form sucient statistics for detection. Chapter 7 will address the design of such a lter. This chapter will now address the Channel Estimation problem, the rst stage of the receiver.
6.3
The linear model
Before we proceed to the formulation of the channel estimation problem, we need to lay down the foundations of least squares estimation, in the form of the linear model. The linear model says that observations r = {r[0], r[1], , r[Q]}T consists of a signal component d = {d[0], d[1], , d[Q]}T plus an error component n = {r[0], r[1], , r[Q]}T given by r = d + n. We now postulate a model, in fact a linear model, that obeys the equation d = Hc (6.11) (6.10)
56 where H is a matrix, and c is a parameter vector c = {c[0], c[1], , c[P ]}T , with P possibly more or less than N . Given the observations r, under the linear model, we need to estimate c. Thus we have an equation error model r = Hc + n. (6.12)
The matrix H is composed of columns hn and we may write H = {h1 , , hp }. Each columns vector is a mode of the signal d, and signal d consists of a linear combination of these modes:
P
d=
n=0
cn hn .
(6.13)
It is precisely these combiner weights cn that are the parameters that we wish to estimate. In general we may have an under determined case (P > N ), a determined case (P = N ) or the overdetermined case (P < N ). It is the last case we are particularly interested in here. It leads naturally to least squares tting or estimation, presented next.
6.4
Least Squares Estimation
We receive a vector of K +1 measurements from the channel denoted r = {r[n], r[n+1], , r[n+K]}T and using (6.4) with the overall channel impulse response denoted by a vector c = {c[0], c[1], , c[L]}T we may set up a linear model as r=Q c+n (6.14)
where n represents the noise. The matrix Q is fully populated by the transmitted training symbols. The length of the overall impulse response L + 1 depends on the sampling rate, the RF channel model length, the pulse shaping lter and the anti-aliasing lters. Consider the matrix Q shown below:
Q=
t[n] t[n + 1] t[n + 2] . . .
t[n 1] t[n] t[n + 1] . . .
t[n 2] t[n 1] t[n] . . .
. . .
t[n L] t[n L + 1] t[n L + 2] . . .
(6.15)
t[n] represents the transmitted training symbol at sample n, and these are known at the receiver. For the matrix to be a full rank matrix, the columns of Q needs to be linearly independent. However, these are just time shifted versions of the training sequence. We thus conclude that we require the training sequence to have an autocorrelation function that approximates a Kronecker delta function, so that time shifted versions of the training sequence are at least linearly independent. The reader may verify that this is the case for the training sequence given in Chapter 1. For a given estimate of c, the squared error between the r and the linear model Q c is 2 = tr[(r Q c)(r Q c) ] = n n which is to be minimised to obtain the least squares estimate. Thus 2 = 2Q (r Qc) c (6.17) (6.16)
57 and equating the gradient to zero produces the estimate c = (Q Q)1 Q r. c (6.18)
The matrix Q Q is called the Grammian matrix. It is the cross-correlation matrix of the transmitted training sequence, and since the latter sequence has a autocorrelation function which approximated a Kronecker delta, the Grammian is highly diagonally dominant and invertible. The optimal training sequence will make Q Q = qI with q a constant, and achieve the minimum mean squared error. However, using a modulation alphabet of xed discrete size and short sequences, such a sequence does not exist, and a computer search for the best suboptimal sequence may be performed instead.
6.5
A representative example
We focus on the GSM system where the pulse shaping lter used in the transmitter is Gaussian, as shown in Figure 6.2 with four samples per symbol.
0.5
0.4
Tap setting [Volt]
0.3
0.2
0.1
8 10 tap number
12
14
16
18
Figure 6.2: The Gaussian pulse shaping lter used in GSM.
The Gaussian pulse shaping lter causes inter symbol interference of three consecutive transmitted symbols, apart from that introduced by the RF channel. For simplicity, we assume here that the RF 1 channel has 3 taps at the symbol rate with settings 3 [1, 1, 1]. We have 26 measurements for r taken at a SNR of 15 dB. We apply equation (6.18) to estimate and the magnitude is shown in Figure 6.3 c along with the z-plane representation. Here we clearly see that the overall impulse response is not minimum phase as some nulls are located outside the unit circle. Moreover, the receiver anti aliasing lter was not matched to the transmission pulse shaping lter, and even if we selected to do that, the channel realization for this burst was unknown, so that we would not have accomplished a maximum output SNR. This problem is addressed in a later Chapter where is it shown that a suitable prelter is needed to achieve both a minimum phase impulse response and a maximum SNR.
58
1.4 1.2 Tap setting [Volt] 1 0.8 0.6 0.4 0.2 0 1 1.5 2 2.5 3 3.5 tap number 4 4.5 5 5.5 6
1 0.5 0 0.5 1 3 2 1 0 Real Part 1 2 3
Figure 6.3: The estimated impulse response and its z-plane representation. c
Another important observation is that we used 26 (L + 1) observations to estimate 6 taps. This choice is to satisfy the need to have a LS estimation rich in measurements while parameters are few. Thus the system is over-determined which makes the estimation relatively immune to noise.
6.6
Generalised Least Squares Estimation
In previous sections, we assumed that the noise covariance matrix V with elements Vij = Cov[ni , nj ] with ni the noise sample at time i was unknown. Hence we applied LS estimation in the form of the normal equations given by (6.18) since it does not require knowledge of V. However, after the CIR has been estimated via the LS method, we may take a second look at the baseband received model given by r=Q c+n (6.19)
since with r, Q and c now available after the estimation, we may form an estimate of n using training symbols. Thus, we may in turn estimate V. The question now arises how we may improve the estimate of c by exploiting further knowledge of the noise covariance.
6.6.1
The generalised least squares procedure
Imagine we have a model for an experiment containing Z realizations, in the form Yi = 1 xi1 + 2 xi2 + + k xik + ni i [1, 2, , Z]. (6.20)
Denoting by the vector of parameters to be estimated and Y, N the vector containing the observations and noise samples respectively, we have y = X + N (6.21)
Imaginary Part
59 where x11 x21 . . . xZ1 x12 x22 . . . xZ2 x13 x23 . . . xZ3 . . . x1k x2k . . . xZk
X=
(6.22)
is the matrix of set points of the k input variables x1 , x2 , x3 , , xk during the N experiments. Let us assume the errors (noise) N has zero mean, and covariance V, otherwise we do not assume or specify the noise pdf. Given the actual observed responses y from the Z experiments, the generalised least squares estimate (GLSE) are those which minimise the quadratic form (y X) V1 (y X) with respect to . Dierentiating and equating to zero produces the estimate Q = (X V1 X)1 X V1 y. (6.24) (6.23)
How do we justify using the quadratic form given above? Well we argue that if the errors (noise) in the model is a multivariate Normal pdf with covariance V then the log likelihood function of the parameters is given by the quadratic form 1 . Thus if the error were multivariate Normal, then the GLSE would be a MLE, which is encouraging, since this was not the case for the LSE in the previous sections. Secondly, let us replace the estimates by the corresponding estimators Q = (X V1 X)1 X V1 y. Then we may proof that the estimator Q is such that the mean square error between L = 1 1 + 2 2 + + k k and L = 1 q1 + 2 q2 + + k qk (6.26) (6.25)
is minimised. Hence an arbitrary linear function of the parameters is estimated with minimum mean square error. These two properties serves as justication for using the GLSE rather than the LSE, and in practice a small improvement is so obtained in Bit Error Rate performance of the receiver, especially at high SNR where the noise covariance is better estimated.
6.7
Conclusion
This chapter presented the key ideas behind channel estimation. It was shown that the unknown RF channel impulse response prevents us from knowing the channel perfectly, and that an estimation of the overall impulse response is inevitable. More so, the noise at the output of the channel is coloured, so that a maximum likelihood estimation is not possible. We thus concluded that least squares estimation is a viable alternative as it needs no statistical description of the noise.
1 apart
from an additive constant.
60 We showed why the training sequence transmitted at the transmitter must have desirable autocorrelation properties, and concluded that a prelter must follow the channel estimator to achieve maximum output SNR.
61
6.8
Assignment
1) In the module Main.m these lines of code appear: % Channel Estimates [ir,noise_s0] = ch_est_1s(transmitted,rx_4s(1:4:length(rx_4s)-3),26,7); rx = rx_4s(1:4:length(rx_4s)-3); Remove this estimator based on LS theory, and create an estimator using the generalised LS estimator using only the 26 training symbols located in the transmitted burst transmitted that will estimate the overall channel impulse response (a 7 tap FIR lter) ir and feed that to the prelter. You have as knowns the received sequence, and the known pilot/training sequence contained in the transmitted burst (26 symbols). Then plot the raw BER vs. Eb/No for TU channel model at 50 km/h with fading and explain how you created the estimator. In your report, explain why the generalised LS estimator does not appear to produce better results than the standard LS estimator.
62
Chapter 7
Minimum Mean Square Error (MMSE) Estimation, Prelter and Prediction

7.1 Introduction
In previous chapters we saw that for any given data burst we do not know the impulse response of the RF channel at the receiver. Hence the overall channel impulse response needs to be estimated, and consequently before this estimation a maximum output signal to noise ratio (SNR) cannot be achieved. We then indicated that an additional lter, referred to as a prelter is required to maximise the output SNR. We also showed that the estimated overall impulse response, denoted c is typically not in minimum phase form. In practical terms this means that the energy in the leading taps, say c[0], c[1] are not maximised. Later it will become evident that this requirement plays a key role in reducing the complexity of the detector. The last consideration is the fact the noise present at the input of the prelter is coloured. Although the noise covariance matrix can be estimated from the training sequence and can be used to modify the maximum likelihood metric in the detector, it is convenient to perform noise whitening also in the prelter. Thus the detector will be presented a signal corrupted by white additive noise, which will be shown to simplify the optimal detector.
7.2
Minimum mean square error (MMSE) estimation
We studied LS estimators for channel estimation as we did not have available the noise covariance function after the receiver lter. However, given the overall channel estimate is now available, as well as a set of training symbols during each data burst, we may estimate the noise covariance function, and exploit that knowledge in an estimator. In general, given a set of measurements y and a vector x that we need to estimate given that y contains information about x, we know from the Gauss-Markov theorem that the conditional mean of x is a linear function of the measurement y when y and x are jointly normal. This fundamental result 63
64 has many consequences. For example, we may set up a linear minimum mean square error estimator of x where x is forced to be a linear function of y, whether x and y are jointly normal or not. This step leads to the Wiener-Ho equations, that are useful in designing MMSE based estimators.
7.2.1
The principle of orthogonality
We are given n random variables x1 , x2 , x3 , , xn . The objective is to nd n constants a1 , a2 , a3 , , an so that if we for m a linear combination of these constants to estimate another random variable say s, the estimation error we make is minimised. The question is how can we set up a general approach to make sure that we choose the best set of constants? One way would be to apply the LS method of the previous chapter. Using the least squares method we will not need the noise covariance function, but if we assume that we do have at least an estimate of the noise covariance function, can we exploit that information to make even better estimates? The answer to the question above is armative, and here is how we do it. First we set up the estimation formulation: s = a1 x1 + a2 x2 + + an xn (7.1) where s is the estimate of s. Let us denote the Mean Square of the error = s s as P, so P is given by P = E{ s a1 x1 + a2 x2 + + an xn
2
} = E{ 2 }.
(7.2)
Note that this is now no longer merely a LS error - it involves the E{} operator. The solution to choosing the best set of constants a is to invoke the orthogonality principle: Theorem 1 The MS error P is a minimum if the constants a are chosen so that the error is orthogonal to the data, i.e. E{ x } = 0 i. i Application of the orthogonality principle then usually leads to a set of linear equations to be solved yielding the optimal choice for a in the MS sense - hence the term Minimum Mean Squared Error (MMSE) criterion. The mathematical form of these linear equations are thus given by E{[s (a1 x1 + a2 x2 + + an xn )] x } = 0 i = 1, , n. i (7.3)
To further develop the theory, we now move on to matrix notation. We constrain the estimator of s to be a linear function of x i.e. s = K x, and we invoke the orthogonality principle E{(s Kx)x } = 0. Hence we nd that Rsx KRxx = 0 and the Wiener-Hopf solution for the linear estimator follows as K = Rsx R1 . xx (7.6) (7.5) (7.4)
This choice for K minimises the mean squared error among all linear functions of x. There is no better linear function we can choose in the mean squared error sense. Some texts call this the Yule-Walker solution (see Papulous), some call it the the Wiener-Hopf equations.
65 There are assumptions that needs to be made regarding the invertibility of Rxx . What constraints must be put on the vector x to guarantee that Rxx is invertible? What does it imply for x in practical terms? The reader may now well think that since we dont know s the analysis led us nowhere. Actually it will become evident that it did lead us to an excellent estimation strategy - the trick is just to recognise that even though we dont know s, we know the statistical properties of s. For example, Rsx is the cross correlation matrix between the parameter to be estimated and the data, and this we may know a priori or we may be able to compute it, without knowing explicitly what s looks like. As always, the best way to understand theory is to use it - as the student needs to be able to recreate the material for themselves. Thus in the next section, we will apply the MS estimation theory to the design of a linear lter known as a prelter, and in the following section we will study channel tracking in GSM that is used when mobiles move very fast. A prelter transforms the impulse response of a system into its minimum phase form, meaning the impulse response after the prelter is dominated by its leading taps. This makes detection far more ecient. The student must be able to explain this statement clearly.
a x 1 1 ~ s
a x 2 2
2
Figure 7.1: The principle of orthogonality.
7.2.2
Geometrical interpretation of the principle of orthogonality
Assume an abstract space where the random variables span a subspace Sn . Then any linear combination of the random variables is a vector s in subspace Sn . However, the random variable to be estimated s does not necessarily lie in this subspace Sn . The projection from this vector s onto the subspace Sn yields the distance from s to sn , and represents the error vector . To minimise the length of this error vector, we must make this projection orthogonal to the subspace, as depicted in Figure 7.1. This is the geometrical manifestation of the principle of orthogonality.
66
7.3
7.3.1
Applying minimum mean square error (MMSE) estimation: Let us design a linear prelter
Matched lter, minimum phase lter and spectral Factorisation
Consider the deployment of a matched lter, a feed-forward lter, a feedback lter and a decision device as shown in Figure 7.2. We are interested in deriving the optimal form of the feed-forward lter. Later, we may absorb the matched lter into the feed-forward lter, but for now, in order to develop the needed theory, let us stick to this arrangement where the matched lter and the feedforward lters are kept separate. 1 By feeding back known pilot symbols, ISI due to channel memory is perfectly cancelled. Thus, an instantaneous decision on the current symbol my be made, and by minimising the MSE between the instantaneous decision and the known training (pilot) symbols, the energy is the leading taps after the prelter is so maximised and is what we are aiming for.
prefilter Matched filter c* [-n] Known pilot symbols s[n]
r[n] Channel estimation c[n]
y[n]
FIR filter f[n]
z[n] +
Decision
s[n]
b[1],b[2],.. Feedback FIR filter
Figure 7.2: The representation of the matched lter, the feed-forward lter and the feedback lter. The prelter is the combination of the matched lter and the feed-forward lter. The feedback lter is the post prelter CIR.
These are qualitative statements, and we would like to formulate these in precise mathematical language. Denoting the post prelter CIR by vector b, and the CIR prior to the prelter as c, the following relationship will hold if the prelter performs a minimum phase transformation:
q q
n=1
|b[n]|2
n=1
|c[n]|2
(7.7)
Hence, a non-minimum phase CIR energy per tap (discrete time) never increases faster than that of the corresponding transformed minimum phase CIR. In this way, the detector may be based on the leading taps where most of the energy is located, and it will become clear in the next chapter how this will reduce detector complexity. To develop the exact form of the minimum phase transformation lter, we transform the problem to the z-domain. The sampled matched lter output in the z-domain is Y (z) = C (z)C(z)D(z) + C (z)N (z)
1 It
(7.8)
should be emphasised at this point in our discussion, that using decision feedback in no way implies that our detector will need to deploy decision feedback. Decision feedback is used only in the synthesis of the optimal prelter coecients.
67 where D(z) represent the transmitted symbols in the z-domain, C(z) the estimated CIR in the zdomain, and N (z) the noise in the z-domain. The z transformation of the autocorrelation of c is nonnegative and hence there are no zeros in the power spectrum, we may factorise as C (z)C(z) = G(z)G ( 1 ). z (7.9)
G(z) is canonical, i.e. it is causal, monic and minimum-phase 2 . Hence we nd that G ( z1 ) is anti-causal, monic and maximum-phase. In practice we may nd G by assigning roots of C (z)C(z) greater than 1 (outside the unit circle) to the feed-forward lter (that is anti-causal), and the rest to the feedback lter (that is causal, where roots inside the unit circle implies stability. The feed-forward lter should be chosen to cancel the precursors of the impulse response after the matched lter 3 with respect to the current time instant [n]. Therefore theoretically the feed-forward 1 lter should be chosen to have a z-domain form given by F (z) = G ( 1 ) . Hence he have after the z feed-forward lter a z-domain representation given by 1 )D(z) + F (z)C (z)N (z) z C (z) G(z)D(z) + 1 N (z) G ( z ) F (z)G(z)G (
Z(z) = =
(7.10)
1 so that the post prelter CIR is G(z) which is causal and minimum phase. However, F (z) = G ( 1 ) z is an Innite Impulse Response (IIR) lter in the time domain. Thus, if we wanted to approximate the IIR lter by a FIR lter, it would theoretically have to be innitely long. Thus in practice we would need to truncate the FIR feed-forward lter to be nite, assuming that the IIR response has a decaying time response.
Now that we have determined that the feed-forward lter need to be anti-causal, and the feedback lter causal, we take a look at the noise after the application of the matched and feed-forward lter. C In the z domain, the post prelter noise is given by G ((z)) N (z) and is clearly non-white even if the 1 z noise fed to the prelter was white to begin with. We thus identify the need to employ a post prelter noise whitening lter, since white noise will simplify the operation of the detector [2]. 4 We will now focus our attention on methods to nd the coecients of the approximating FIR lters of nite length with the appropriate noise whitening properties.
7.3.2
MMSE prelter design
Let us apply the MS estimation theory to the design of a linear anti-causal lter known as a prelter. The reasons for the lter to be anti-causal were given in a previous section. A prelter transforms the impulse response of a system into its minimum phase form, meaning the impulse response after the prelter is dominated by its leading taps. If we plot the poles and nulls on the Z-domain of an impulse response, and the impulse response is in minimum phase form, there will be no nulls outside the unit circle 5 .
the work by Forney 1973, reproduced in Proakis [2] the matched lter the impulse response has both pre and post cursor components since c [n] c[n] is symmetric with respect to [n]. 4 Actually, the noise whitening lter will also be absorbed into the prelter as explained in sections to follow. 5 This follows from the formal denition of minimum phase, see for example Papulous.
3 After 2 See
68
Anti causal prefilter r[n] f z[n] +
Decision
s[n]
s[n]
b[1],b[2],.. Feedback ISI
Figure 7.3: The representation of the MMSE-DF prelter.
One way of designing a minimum phase prelter, is to use the lter-detector combination as shown in Figure 7.3, where now we have absorbed the matched lter, the feedforward lter and the post prelter noise whitening lter into a single feedforward lter. The lter-detector combination architecture has been carefully chosen to guarantee that decisions made based on s[n] are without delay, so that minimisation of the MSE of decisions tend to maximise energy in the leading feedback tap b[0]. This tends to yield a post prelter impulse response that is leading tap dominant and consequently minimum phase 6 , a property that has been shown to lead to reduced complexity soft bit detectors in earlier chapters. Secondly, we have an anti-causal feed-forward lter, that is unconditionally stable. Realisability of an anti-causal feed-forward lter is possible by suitable delay in the receiver. Thirdly, we have available after channel estimation an estimate of the noise samples ns [n], where we will assume that it is a Gaussian distributed sequence and has an autocorrelation function E{ns [k] ns [j]} = N0 x[j k] |k j| L 0 otherwise (7.11)
where E{} denotes the expectation operator. We select to use an anti causal prelter with coecients f and we may represent the lter operation on the received sequence r = {r[n], r[n + 1], }T as z = f T r. (7.12)
The sequence z has an impulse response denoted by b so that we may model the post prelter baseband signal as
L
z[n] =
i=0
b[i]s[n i] + nw [n]
(7.13)
where nw [n] is a whitened Gaussian noise process and s = {s[n], s[n1], }T the transmitted symbol sequence. A procedure for choosing f and b based on MMSE criteria now follows. We feedback past symbols to eliminate ISI using the impulse response b valid after the prelter as given in (7.13). Since we are
cannot prove this assertion, since the FIR lters we are using in fact only approximate the theoretical IIR lters needed to guarantee minimum phase properties; however in practice we nd that the CIR b after the prelter is in fact minimum phase if the length of the FIR prelters are correctly chosen.
6 We
69 operating in the presence of noise, and we assumed b[0] = 1, s[n] must be at least an estimate of s[n]. Additionally, s[n] can only take on a nite set of values (dened by the modulation alphabet). The decision device can then make a hard decision s[n] since the correct past symbols s[n 1], s[n 2], ... are fed back. We argue that the best choice for f and b, assuming b[0] = 1, is the unique one that minimises the dierence between the estimate s[n] and s[n], in the minimum mean squared error (MMSE) sense. This is the best we may do to enable the decision device to make the correct decisions in the presence of noise as it yields estimates as close to s as is possible with linear lter f . s Mathematically, this choice is given by min E{ [n] 2} = min E{ s[n] s[n] 2 } where [n] is the instantaneous error. From Figure 7 we may derive an expression for [n] as [n] = wT y s[n] where w and y are given by (7.15) (7.14)
w y
= =
{f [0] f [1] f [P ] b[1] b[L]}T {r[n] r[n + P ] s[n 1] s[n L]}
(7.16)
T
(7.17)
The MMSE can therefore be written as min E{ wT y s[n] 2 } and the solution for w is given by the Wiener-Hopf equation as E{yy }w = E{s[n] y} (7.19) (7.18)
where indicates complex conjugate and the Hermitian transpose. The solution w yields both the feed-forward lter f and feedback b coecients jointly, with b[0] = 1 by denition. The feedback coecients b are the desired impulse response to be used with nal preltered signal z in the soft bit detector. We now turn to the output SNR after the prelter. As was stated in Chapter 1, the prelter has as one of its objectives the objective of maximising the output SNR (a property of a matched lter). We may dene the output SNR as SN R0 b 2 E{ 2} (7.20)
Since we assumed b[0] = 1 in our synthesis procedure, the numerator energy is xed, while the denominator energy is minimised. Thus the SNR is maximised. This implies that the prelter will act as a matched lter as well as transforming the impulse response to have dominant leading taps.
7.3.3
Evaluating matrix E{yy } and vector E{s[n] y}
The matrix E{yy } may be written as
70
E{yy } = E
11 21
12 22
(7.21)
We shall derive expressions for each . Starting with 11 we have r[k]r [k] r [k]r[k + 1] . . . r [k]r[k + P ] r[k]r [k + 1] r[k + 1]r [k + 1] . . . r[k]r [k + P ] r[k + 1]r [k + P ] . . . r[k + P ]r [k + P ]
11
. . .
(7.22)
12 is given by r[k]s [k 1] r[k + 1]s [k 1] . . . r[k + P ]s [k 1] r[k]s [k 2] r[k + 1]s [k 2] . . . r[k + P ]s [k 2]
12
. . .
(7.23)
and 21 by s[k 1]r [k] s[k 2]r [k] . . . s[k L]r [k] s[k 1]r [k + 1] s[k 2]r [k + 1] . . . s[k L]r [k + 1]
21 = 22 is given by
. . .
(7.24)
22
Vector E{s[n] y} is given by s [k]r[k] s [k]r[k + 1] . . . s [k]r[k + P ] s [k]s[k 1] s [k]s[k 2] . . . s [k]s[k L]
s[k 1]s [k 1] s[k 2]s [k 1] . . . s[k L]s [k 1]
s[k 1]s [k 2] s[k 2]s [k 2] . . . s[k L]s [k 2]
. . .
(7.25)
E{s [n]y} = E
(7.26)
71 We now turn to the individual terms of these matrices and vector. We assume that the noise and data sequences are uncorrelated. First of all we require the term E{r[k]r[k]}. Using (6.4) we may write
L L
E{r[k]r [k]} = E{(

i=0
c[i]s[n i] + ns [n])(
i=0
c [i]s [n i] + n [n])} s
(7.27)
hence
L
E{r[k]r[k]} =
i=0
c[i]
+ E{n[k]n [k]}
(7.28)
E{n[k]n [k]} is the noise energy N0 . The term E{r[k]r [k + ]} is given by

L
E{r[k]r [k + ]} =
i=0
c[i]c [i + ] + E{n[k]n [k + ]}
(7.29)
E{n[k]n [k + ]} is not zero as the noise is not white. It is given by (7.11). The inclusion of the noise covariance enables noise whitening. I.e. the output noise sequence nw [n] will be white and have an autocorrelation function which approximates a Kronecker delta function. There are two more terms we need to evaluate namely E{s [k]r[k + ]} and E{s [k]s[k + ]}. These are given by E{s [k]r[k + ]} = c[] and E{s [k]s[k + ]} = [] where we assumed that the variance of the training sequence is unity. (7.31) (7.30)
7.4
A representative example
We select the Typically Urban RF channel model from GSM, and we select a single burst to examine where each tap undergoes Raleigh fading. We select a SNR ratio of 10 dB, with a transmitted pulse shaping lter which is Gaussian and was presented in Chapter 6. In the receiver we use a raised cosine lter with unity bandwidth and roll o factor 0.5. The overall impulse response c is shown in Figure 7.4, and the impulse response after the prelter in Figure 7.5. In the latter gure the MMSE-DF synthesized lter b is shown along with a LS estimate of b performed after the prelter. It is clear that the preltered impulse response is minimum phase, and the leading taps are dominant. We can not prove that the MMSE-DF synthesis procedure will always yield a minimum phase impulse response, but in practice it is frequently observed because the synthesis procedure forces decisions without any delay causing the leading taps to be dominant. A second interesting observation is that the estimated impulse response b indicates that b[0] is in fact smaller than unity. We may show that in fact b[0] = f T c and the synthesis procedure is biased. Moreover, c itself is only an estimate of the actual overall impulse response.
72
1.4 1.2 Tap setting [Volt] 1 0.8 0.6 0.4 0.2 0 1 1.5 2 2.5 3 3.5 tap number 4 4.5 5 5.5 6
1 0.5 0 0.5 1 3 2 1 0 Real Part 1 2 3
Figure 7.4: The overall impulse response c before the prelter.
7.5
Stochastic processes and MMSE estimation
Many problems encountered in practice are of a stochastic nature. For example, in any receiver the local ossilator will not be perfectly tuned to the carrier frequency and this will cause frequency oset. The frequency oset can be modelled in baseband by modifying the baseband signal representation as
L
Imaginary Part
r[n] = ejo nT +
k=0
h[k]d[n k] + ns [n].
(7.32)
where o and represent the oset and are stochastic processes. They vary with time, are unpredictable, and can assume a continuous value. The question is, can we estimate such an oset for example in a GSM burst, and once it has been estimated, can we track it over time by predicting it into the future? These questions were studied in a general framework by Norbert Wiener. 7 In this section we treat the MMSE problem for the continuous case. Generalisation to the discrete case is straightforward. Formally we wish to estimate the present value of s(t) which is a stochastic process, in terms of the values of another process x(t) which was specied for an interval a t b. The desirable linear estimate s(t) of s(t) is an integral
b
s(t) = E{s(t)|x(), a b} =
h()x() d
a
(7.33)
where h() is a function to be estimated. The orthogonality principle can be applied to the estimation error s(t) s(t) and we nd
b
E{[s(t)
7 N.
h()x() d]x()} = 0, a b,
(7.34)
Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series, MIT Press, 1950.
73
1 0.8 Tap setting [Volt] 0.6 0.4 0.2 0 estiamted IR b computed IR b
1.5
2.5
3.5 tap number
4.5
5.5
1 0.5 0 0.5 1 3 2 1 0 Real Part 1 2 3 5
Figure 7.5: The overall impulse response b after the prelter.
which leads to Rsx (t, ) =

a
Imaginary Part
h()Rxx (, ) d.
(7.35)
The formal solution to this integral equation is usually solved numerically. In the rest of this chapter we assume that all processes to be investigated here are WSS and real - if its complex results still hold except the complex conjugate needs to be applied.
7.5.1
Prediction
1) We wish to estimate the future value s(t + ) of a stationary process s(t) simply in terms of its present value: s(t + ) = E{s(t + )|s(t)} = as(t) (7.36) What is the optimal choice for a given the stated assumptions? Applying the orthogonality principal we nd E{[s(t + ) as(t)]s(t)} = 0 yields a= R() . R(0) (7.38) (7.37)
So it turns out that the optimal choice for a is based on the correlation properties of the process s(t). Isnt that illuminating? If s(t) was completely random, in the sense that it has a white spectrum, how would R look and how would the predictor be able to predict the future value? Can we thus predict the future of such a process? 2) Let us now assume we want to improve on this prediction by not using only its present value but also st derivative s (t), then how would we go about reformulating the prediction equation? Well, we will have two unknowns and set up the predictor linearly as s(t + ) = E{s(t + )|s(t), s (t)} = a1 s(t) + a2 s (t). (7.39)
74 Again we ask the question what are the optimal choices for a1 and a2 ? Again applying the orthogo nality principle s(t + ) s(t + ) s(t), s (t), and given the identities R (0) = 0, Rss ( ) = Rss ( ) and Rs s ( ) = Rss ( ), we nd R() a1 = , (7.40) R(0) and a2 = R () . R (0) (7.41)
Now isnt that illuminating? So the optimal choices in this case where the derivative is used as well as the current values of the process implies the derivatives of the correlation functions. For most processes, say s(t), Rss ( ) has a decaying form, i.e. it is dominated by the values at = 0. So if becomes large, of the two terms, which ones will dominate? Study an example of your choice. 3) Next up we study interpolation. This is an important topic, and we will apply this to the GSM simulator and a real world problem to make the point - in the process the student will re-create the material for herself. Formally we wish to estimate the value s(t + ) of a process s(t) in terms of its 2N + 1 samples that are nearest to t, as shown in Figure 7.6.
s(t)
1 0 1 0 1 0
1 0 1 0 1 0
1 0 1 0 1 0
T tNT
1 0 1 0 1 0
s(t+)
1 0 1 0 1 0
1 0 1 0 1 0 1 0 1 0 1 0
1 0 1 0 1 0
1 0 1 0 1 0
1 0 1 0 1 0
t+NT t+
Figure 7.6: The interpolation problem.
The key point one needs to get here is that we think that all the nearest N points known/given to us may contribute to the optimal estimate of s(t + ). So the most general linear interpolator we can set up is
N
s(t + ) =
k=N
ak s(t + kT ) 0 < < T.
(7.42)
Are we in agreement? The reader must be convinced there is no better linear estimation we can come up with? Now that we just formulated the optimal estimator, how do we decide what the values of ak must be? To answer this question, think about what we would do using a LS approach? Is that the best we can do? How about if we knew the correlation properties of the process s(t)? Can we do better using that information? That is exactly what MMSE estimation will do for us, it will exploit the correlation properties of the process s(t). Let us develop the best values of ak in terms of the MMSE approach.
75 First, using the orthogonality principle we nd

N
E{[s(t + ) from which it follows

N
k=N
ak s(t + kT )]s(t + nT )} = 0 |n| N
(7.43)
k=N
ak R(kt nT ) = R( nT ) N n N.
(7.44)
This constitutes a set of linear equations that can be solved for ak , and these will be optimal in the MMSE sence. Again, the key ingredient in MMSE estimation over LS estimation is that we use the correlation properties of the process s(t) in the former.
76
7.6
Assignments
1. This is a rst class major assignment - be prepared to spend signicant time on it. a. Download and read the work by J. Cio and N. Al-Dhahir on MMSE prelter design (from IEEE Xplore), in addition to the material presented in these notes. Then create your own prelter function for the simulator, and show some results that indicate yours produce the same results as mine that came with the simulator. b. Next, perform a LS optimization of the minimum phase prelter in the GSM simulator (http://opensource.ee.up.ac.za/), write new code based on LS and replace the build in prelter function with your LS 8 . c. Plot BER vs. thermal noise for both the MMSE and LS prelters. d. Plot BER vs. adjacent channel interference for both the MMSE and LS prelters with thermal noise insignicant (Eb /No = 100 dB). e. Plot BER vs. co-channel interference for both the MMSE and LS prelters with thermal noise insignicant (Eb /No = 100dB). You need to explain the results you will nd 9 . 2. This is a rst class major assignment - be prepared to spend signicant time on it. a. In the GSM simulator, you will notice that the GSM burst (see earlier chapters in these notes for an explanation) has been set up that the pilot symbols (26 of them) are placed in the middle of the burst - and that is where the channel impulse response (CIR) is estimated. The estimated CIR is then used anywhere in the burst for detection (equalization). b. The simulator uses GSM in the frequency hop mode, so that the CIR for each burst is totally dierent. Modify the simulator to disable frequency hop, for that you will need to save the state of the fading simulator and feed it as an additional input to subsequent calls to the fading simulator so that the fading becomes continuous over time (multiple bursts). Verify this by plotting the fading over many bursts, and make sure it is continuous. Figure 7.7 depicts the dierence between GSM with and without frequency hop. c. Set the channel selection to Flat fading, i.e. make the dispersion prole to have 1 single tap. If the simulator doesnt support it, add it. d. Numerically, determine the correlation properties of the fading for each tap after the prelter. Tip, we call the fading Raleigh fading. e. Now set the simulator input le to a very high Doppler speed equivalent to 250 km/h. This is the sort of speeds trains in Europe regularly achieve, and GSM networks are supposed to cover moving trains. f. Simulate the raw BER for at fading at 250 km/h over a range of SNR values. Conrm for yourself that the CIR is now varying over the burst, and the assumption made by the equalizer that the CIR is static is invalid. g. Now use MMSE estimation theory to estimate/predict the CIR over the burst (i.e. track it
you want to you may use the work by Olivier and Xiao (J.C. Olivier and C. Xiao, Joint Optimization of FIR Pre-Filter and Channel Estimate for Sequence Estimation, IEEE Trans. Communications, Volume 50, Issue 9, Sept. 2002, pages 1401-1404.) on LS prelter design available from IEEE Xplore - the results in that paper you will not be able to reproduce since it used oversampling, while the simulator for this assignment is based on 1 sample per symbol (S/S). 9 Hint - to explain the results for co-channel and interference, you may need to read up on cyclo stationary noise (such as co-channel interference) and think about the impact that has on the LS prelter that whitens the noise/interference without knowledge of the noise covariance matrix. Conversely, for adjacent channel interference the noise covariance matrix used in MMSE design is only estimated - what does that imply?
8 If
77
fading process s(t) without frequency hop s(t) Estimated value based on pilots t one block 4T T one burst fading process s(t) with frequency hop s(t)
Figure 7.7: GSM channel fading with and without frequency hop.
over the burst) using enough samples from previous bursts, and you decide how many samples you need to use. h. Now use this predicted CIR which is a function of time to equalize - you need to change the detector code to incorporate this CIR which is a function of time. i. Re-simulate the raw BER - verify that you achieve signicant gain for 250 km/h. j. Simulate at 3 km/h, and verify that you do not loose performance relative to the static assumption case. 3. This is a rst class major assignment - be prepared to spend signicant time on it. Repeat Assignment 8 but use a LS approach. Compare the results using LS estimators to that using MMSE estimators. Explain any dierences in results. Tip: LS estimators can be made very sophisticated - for example see the work by Olivier and Xiao 10 .
IEEE Xplore or the journal version: C. Xiao and J.C. Olivier, Nonselective fading channel estimation with non-uniformly spaced pilot symbols, International Journal of Wireless Information Networks, vol.7, no.3, pp.177-185, 2000.
10 on
78
Chapter 8
Information Theory and Error Correction Coding
79
80 The modern marvel of Digital Communication systems would not be possible without the concepts developed in information theory and the theory of error correction coding. Information theory was independently developed by Claude Shannon in the late 40s, and spawned en entire new discipline, with its own industry, which is known today as Digital Communication systems. Information theory promised us the possibility of error free communications, given certain conditions are met. The fundamental property of all channels that limit the rate and quality of communication is called the capacity and denoted C. Shannon proposed that a channels capacity to carry information must be measured in bits, since the bit is the essence of information. So C has units, bits/second/Hz. Shannon proposed his famous noisy channel theorem as follows Theorem 2 Shannons Noisy channel theorem If the desired rate of communication is R bits/seconds/Hz, then an encoding and decoding device in principle exist, that if the encoding device is used in the transmitter, and the decoding device in the receiver, the communication system will be able to achieve arbitrarily small bit error rate, if and only if, the rate R < C, the capacity of the channel. Shannon did not tell us how to design a suitable encoder and decoder, hence the science of error correction coding developed after the seminal work of Shannon, and is still continuing today. Recently, Low Density Parity Check (LDPC) codes were rediscovered, after having being forgotten when invented in 1962 by R. Gallagher (at which time they were un-implementable). The LDPC codes are in the class of linear block codes, and have been shown to reach the limits set by Shannon within fractions of a dB on the static Gaussian channel. Many wireless standards still employ so-called convolutional codes for their ease of operation and decoding, and remarkable resistance to fading. GSM is an example of such a standard, where a convolutional code is used with a constraint length of 7. This chapter presents an introduction to the theory of linear block codes, and a relatively detailed account of convolutional codes with a state of the art decoder, based on the method of Viterbi (minsum).
8.1
Linear block codes
All the work in this section assumes that mathematics takes place over a nite eld called GF(2), named after the French mathematician Galois.
8.1.1
Repetition codes
The most intuitive way of encoding information at the transmitter is simply to send the same information more than once - a repetition code. The receivers decoding function then simply has the job of using the multiple copies to gure out what the transmitter intended to communicate. Imagine we have a source of information that we wish to send over a channel, say x = [101]. Let us denote the repetition encoder by a matrix G. The encoded message send by the transmitter according to 3 times repetition, is given by z as z = [101 101 101]. (8.1)
How can we transform a vector x to a vector z that is 3 times as long? With a matrix that is not square denoted G - hence mathematically z=xG (8.2)
81 How would G look in this particular case where we have a 3 times repetition code? The answer is in fact trivial and given by 1 0 G= 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1
(8.3)
Verify that the matrix G given above does the job it proclaims to do.
8.1.2
General linear block codes
Obviously, we can make weird codes dierent from G given above for repetition codes by changing the generator matrix G. Do they achieve better BER than the repetition code? Generally yes - repetition codes are not good codes. Imagine a random code for the same information sequence x, say 1 Gr = 0 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 1 1 0 0 1 1 1 1
(8.4)
Using elementary operations (i.e. multiplying rows with scalars, adding rows and permutation of columns), we may reduce this matrix to the systematic form 1 0 Gc = 0 1 0 0 0 0 1 p11 p21 p31 p12 p22 p32 p13 p23 p33 p14 p24 p34 p15 p25 p35 p16 p26 p36
(8.5)
It is called systematic because the uncoded bits appear as is in the coded bit string. Hence may may write G = [I | P]. P is called the parity check bits, and I is an identity matrix. The mathematical properties of the systematic form matrix are identical to the original matrix, i.e. we did not add nor remove any information when reducing a matrix to the systematic form. The columns of the parity check bits must be linearly independent, otherwise the matrix is rank decient. In nding the code that will produce the best BER at the receiver, comes down to choosing the best parity check matrix. There are many block codes invented over several decades, such as the Hamming, Reed-Solomon and BCH codes [2], but it turns out that the best block codes able to achieve BER that perform close to the Shannon limits are simply systematic codes with randomly chosen parity check bits. However, to approach the Shannon limits the size of the matrix G must become very large (i.e. x must become very long) so that many bits must be coded together in a large frame. In such a case the BER performance of the code approaches the Shannon limits, but the decoding complexity becomes prohibitively large, since the decoding problem for this case is NP complete. Generally it was thought the situation is hopeless until Robert Gallagher was able to show in 1962 that the decoding problem may be practically solvable, if and only if the matrix P is sparse. For example the columns of P contain only 3 ones (the rest are zero) regardless of the size of P. These codes, called Low Density Parity Check codes have in fact achieved the Shannon limit within a fraction of a dB using frames that contain 105 bits, with decoding performed by the Pearl belief propagation algorithm.
82
8.1.3
Decoding linear block codes using the Parity Check matrix H
Decoding the linear block codes uses the Parity Check matrix H with the property that G H = 0 where denotes the transpose. For the systematic matrix G = [I | P], H is given by H = [P | I]. (8.7) (8.6)
Any codeword z that the transmitter transmits and computed as xG, is orthogonal to any row of the matrix H. Mathematically z H = 0. (8.8) We can use this fact in many ways. One way is to use it in the Pearl Belief Propagation algorithm, which produces near optimal results. However, to gain insight into the decoding process we will consider a sub-optimum decoding procedure here. First we recognize that the detector sends the decoder not only estimates of the detected bits, but also probability information about the reliability of those bit estimates. So we may use the hard bit estimates from the detector, to test if it is a valid z code word, by computing (8.8). If we nd that H = 0 the detected codeword contains no errors, z z and the decoding job is completed since the code is systematic (why is that so?). If we dont nd zero, we look at the probability information also provided by the detector, and change the bit values of the bits with probabilities closest to 0.5, since those are most likely the incorrect bits. Then we test the modied vector by checking if now H = 0. If it is, decoding is done, else we change more bits z z with probability close to 0.5. Decoding the block code Let us now look at an example of sub-optimal decoding for the block code. The transmitter transmitted x that contains 4 bits. The encoder uses an encoder generator matrix given by 1 0 0 0 1 0 1 0 1 0 0 1 1 1 (8.9) G= . 0 0 1 0 1 1 0 0 0 0 1 0 1 1
Wat is the rate of the encoder? It produces 7 coded bits for z if the uncoded bits from the source is x and has 4 bits. So its a R = 4 rate code. 7 We may compute H as 1 1 1 0 1 0 0 (8.10) H = 0 1 1 1 0 1 0 . 1 1 0 1 0 0 1 The modulator used was a BPSK modulator, while the channel had 3 taps. The detector based on a Viterbi algorithm with sub-optimal probability estimates produced an estimate for as z = z 1 0 0 1 0 0 1 0.9 0.9 0.9 0.51 0.58 0.8 0.7 .
(8.11)
The rst row is the bits values and the second row is the probability of that decision being correct. The decoder in the receiver is given the matrix for and asked to determine what x was. First of all z
83 we know that the the vector x has to have 4 bits. To make sure that the hard decisions are not z correct (because if it was then decoding is not needed) we compute [1, 0, 0, 1, 0, 0, 1] H = [111] = 0. (8.12)
Clearly the detector produced hard bits that were wrong - if is was without errors the test above would have produced 0. To decode the received vector, we now determine the most likely bit in error, which we can see is bit number 4 in since it has a probability of 0.51, the closest to 0.5 of all the z probabilities. Hence we compute [1, 0, 0, 0, 0, 0, 1] H = [100] = 0. (8.13)
where the modied bit is indicated in bold. Again we did not nd 0 so there is still an error, however generally speaking there was some improvement since less bits are in error. Let us change the next most likely bit to be wrong, that would be bit 5 in with a probability of 0.58. So we compute z [1, 0, 0, 0, 1, 0, 1] H = [000] = 0 and we conclude that the transmitter must have transmitted [1, 0, 0, 0, 1, 0, 1], so that x = [1, 0, 0, 0] (8.16) (8.15) (8.14)
because the code is systematic and thus the rst 4 bits are the source bits. Sub-optimal decoding is now completed. The reader may now appreciate what the probability information provided by the Viterbi (minsum) detector is useful for - without it we would have had to change every possible combination of the 7 bits, a task that becomes impossible to perform even for moderate sizes of G.
8.2
Convolutional codes and Min-Sum (Viterbi) decoding
Convolutional codes are probably some of the most widely used codes in use today. GSM and its derivatives for data communications such as EDGE uses convolutional codes. The reason is that the convolutional codes are easy to implement and to decode, and can be eciently and optimally decoded by the min-sum (Viterbi) algorithm. Moreover, in recent times (late 1980s) it has been shown that near optimal codes can be derived from convolutional codes if two such codes are combined and iteratively decoded (so-called turbo Codes [2]).
1 Figure 8.1 shows a typical convolutional rate R = 3 code (the encoder) with 3 taps, and its state diagram. The state diagram has states corresponding to the two memory elements of the 3 taps in the encoder. So in general we associate 2L1 states with a convolutional code with L taps. The state digram shows the transitions that the encoder will undergo if fed by a 1 by dotted lines, and the transitions that the encoder will undergo if fed by a 0 as input by solid lines. Each time the encoder produces 3 output encoded bits (associated with the edges in the trellis) after having been fed an 1 uncoded bit because it is a rate R = 3 code. Those bits are indicated also on the state diagram.
84
Output bit 1 Input uncoded bit Output bit 2
Output bit 3 All operations performed modulo 2 uncoded bit = 0 uncoded bit = 1 000 state 00 001 state 01 010
110 011 111 100 state 10
101
state 11
Figure 8.1: The convolutional encoder and state diagram.
8.2.1
Decoding the convolutional codes
The encoded bits are fed to the modulator, where it is modulated into analogue symbols according to some chosen modulation scheme (QPSK, 8PSK, BPSK etc.). These are transmitted over a noisy multipath channel, and after being matched ltered it is fed to the detector. The detector has the job of reversing the multipath channel, and estimating the encoded bits . These estimates are not only z the bits themselves, but also the probabilities. These are now fed to the decoder (see Figure 1.4). For hard decision decoding, the decoder just needs the bit hard values, not the probabilities. The decoder will then base its estiamtes of x on the Hamming distance between the bits estimates from the detector and the candidate transmitted bits in the the decoder trellis, but hard decision decoding is very suboptimal as it doesnt use the probability information supplied by the detector. Hence we will rather look at the optimal soft bit decoding where all information available to the decoder is exploited. First of all, since the encoder has 3 taps, we may set up a trellis as we did also for the 3 tap detector, as shown in Figure 8.2. The trellis has states (1, 1), (1, 1),(1, 1), and (1, 1) corresponding to the possible bits in its 2 tap memory as (1, 1), (1, 0),(0, 1), and (0, 0). For each edge connecting two states (if it is a legal connection, i.e. the encoder was able to move/transition between the 2 states) a cost say i has to be computed. For the detector this was computed as the Euclidean distance between the observed complex number coming from the demodulator and matched lter, and the candidate symbols convolved with the impulse response. For the decoder we also compute an Euclidean distance metric, but in this case simply the accumulated Euclidean distance between the 3 observed soft bits (dened below) and the candidate bits given by the edge in question in the trellis.
85
zs[0] zs [3] zs [1] zs [4] zs [5] zs [2] 0 11 A 1 1 2 111

11 11 11
z s[6] zs[7] zs[8] 2 1 1 0 AA AA 3
3 2 AA
3 A 1
11
4 111
111
5 winner path least cost 7
111 6 111
11 i A = bit i
Figure 8.2: The convolutional decoder trellis based on the state diagram.
The soft bit is formed by the decoder itself using information from the detector as zsof t = (2 1) | ln z Pz= 1 | Pz= 0 (8.17)
where Pz= 1/0 is the probability that the estimated bit was a 1/0 as provided by the detector, and z is the hard bit value (i.e. 0/1) from the Viterbi detector. 1 Hence the cost say i for an edge in this trellis (for a R = 3 ) is given by i = |sof t i |2 + |sof t i |2 + |sof t i |2 zi z i+1 z i+2 (8.18)
where , , imply the 3 bits that are associated with the edge in question. Note that each edge in the trellis consumes 3 bits from the detector in the metric calculation. That is because it is a rate R = 1 code, i.e. 3 coded bits are delivered by the encoder for each uncoded bit. 3 The power of soft decision decoding becomes obvious if we recognize that of the three terms in the calculation the ones that conict with a bit of high certainty from the detector are independently punished. Also, when a bit from the detector is uncertain, then the relevant term in the expression for becomes xed regardless of the trellis state, i.e. that term becomes benign and does not inuence the optimal path that the Viterbi will choose. 1 1 How would the symbol rate T and the uncoded bit rate, say relate? It depends on the modulation scheme used. For example if we use the rate R = 1 code, and we used an 8 P SK modulation scheme, 3 then the symbol rate and uncoded bit rate would be the same. If the modulator uses BPSK then the symbol rate would be 3 times larger than . After the trellis is populated with all the s then we apply the min-sum (Viterbi) algorithm to nd the path thru the trellis with least cost. Back tracing it yield the states the encoder went thru over the entire frame that was encoded. How does that yield x? Simply by recognizing that the uncoded bits required to make the encoder move from one state to the next is unique, and so with the estimated states known the estimated uncoded bits x can be read o from the state diagram. The reader may realize that the decoder based on Viterbi decoding has many things in common with the detector studied previously. For example here we also only know the hard bits x (unless we do extra work). These estimated hard bits are fed to the de-compression device to form the
86 original source, albeit a voice, an image or just a data le. The de-compression device is assumed to only require hard bits. If it did require soft bits, the decoder can be modied to provide also a probability estimate, or a dierent decoder algorithm can be used that is able to provide optimal soft bit information, such as the BCJR [3] algorithm.
87
8.3
Assignments
1. In the GSM simulator the provided decoder is based on the min-sum strategy. Develop your own decoder using the forward-backward MAP decoder, replace the given one and simulate for the BLER for MCS 1 and 4. It may help to redraw the trellis to show each transmitted bit explicitly, one per edge. See [1] Chapter 25 for a treatment of this type of trellis, well suited to the forward-backward MAP decoder. On such a trellis the likelihood for each edge is easily dened - see the chapter on detection where the forward-backward MAP algorithm was explained.
88
Bibliography
[1] D.MacKay, Information Theory, Inference,and Learning Algorithms, Cambridge University Press, 2003. (http://www.inference.phy.cam.ac.uk/mackay/itila/) [2] J. Proakis, Digital Communications, Fourth Edition, McGraw-Hill, 2001. [3] L.R. Bahl, J. Cocker, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Trans. Information Theory, vol.IT-20, pp.284-287, Mar. 1974.
89

Commn Goood

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Commn Goood

Hochgeladen von

Copyright:

Verfügbare Formate

Class Notes: Digital Communications

7.3 7.4 7.5

The transmitter data burst structure

Figure 1.1: The data burst using pilot or training symbols.

For example, a 26 symbol sequence used in GSM is given by

Figure 1.2: The normalised autocorrelation function for a training sequence.

The dispersive radio channel

Due to An1 T Sample time

The model of the entire communication system

Pulse shaping filter

source decompression source estimated

Overall channel estimator

Figure 1.4: The transmitter and receiver ow in a wireless communication system.

Introduction to Probability theory and Detection

Probability theory, Detection and some odd experiments

Binomial Distribution Dene N r N! (N r)!r!

P (x|y, H)P (y|H)

and this is the sum rule.

Applications of Bayess theorem

P (ball N + 1 is black|u, n, N )P (u|n, N ).

P (ball N + 1 is black|u, n, N ) = fu = in P (ball N + 1 is black|u, n, N ).

regardless of n and N. It is because the urn is GIVEN

The modulator and demodulator

to antenna or power amplifier

data symbol wave

g(t) = A where A = 1 if info bit = "1" A = 1 if info bit = "0"

Binary phase shift keying (BPSK)

== Symbol selected to match bit/s from bit string x g(t) 1 t T

Figure 3.2: The modulation of 4 coded bits x via BPSK modulation.

Four level pulse amplitude modulation (4PAM)

"10" s(t) = A g(t) 1 3

"11" s(t) = A g(t) 2 2

"00" s(t) = A g(t) 4 3

"01" s(t) = A g(t) 4 1

Quadrature phase shift keying (QPSK)

"00" s(t) = A g(t) 1 2 s(t) = A g(t) 2 3 s(t) = A g(t) 4 4

== Symbol selected to match bit/s from bit string x g(t) 1 t T A = 1 1 A = j 2 A =1 3 A =j 4

Eight phase shift keying (8 PSK)

X=011 s(t) = A g(t) 2

M=7 100 (1+j0)

M=3 010 (1+j0)

(0j1) 001 M=5

ck snk (t) + ns (t)

receiver antenna fc multiplier or mixer RF antenna electronics

bandpass filter r(t)

estimated data bits

detector or AI rational agent

matched filtersampler pair

What if there is multipath?

(r(t) h(t)) |t=T

The static Gaussian channel

large noise power (low SNR) observations y(n) j 1

observations y(n) j QPSK 1 1

small noise power (high SNR) j

called Soft Decision Decoding.

MLSE - the most likely sequence estimate

Finding the sequence x via the MLSE

is a so-called NP complete problem

11 15 1 winner path has least total accumelated 16

winner path least cost 7 2

Probabilistic Detection via Bayesian Inference for Multipath channels

Optimal symbol probability calculation using Bayesian Detection

p(z1+L , , z1 |dL+1 , , d1 ) P (dL+1 , , d1 )