Beruflich Dokumente
Kultur Dokumente
Autor:
Tutor:
TELECOMUNICACION
UNIVERSIDAD POLITECNICA
DE MADRID
Madrid, Abril 2007
AUTOR:
TUTOR:
El tribunal nombrado para juzgar el Proyecto arriba indicado, compuesto por los siguientes miembros:
PRESIDENTE:
VOCAL:
D
na. Mara Luisa Lopez Vallejo
SECRETARIO:
SUPLENTE:
Madrid
de
de 2007
To my parents
Acknowledgements
First of all I would like to thank Marisa for assigning this project and the scholarship to
me. I have enjoyed working on it all along.
I would like to give special thanks to my mentor and friend Pablo for his advices and
support. I had great time working with him.
Thanks to my friends at the Lab for the fantastic environment.
Finally I would like to thank Sandra for all her support and patient and for being there
all the time.
ii
Abstract
Today most common architectures for implementing the SOVA algorithm are affected by
two parameters: the trace back depth and the reliability updating depth. These parameters play an important role in the BER performance, power consumption, area and system
throughput trade-offs. In this work, we present a new approach for doing the SOVA decoding that is not limited by the mentioned parameters and leads to an optimum SOVA
algorithm execution. Besides, the architecture is achieved by recursive units which consume less power since the amount of employed registers is reduced. We also present a
new scheme to improve the SOVA BER performance which is based on a approximation
to the BR-SOVA algorithm. With this scheme the BER achieved is 0.1 dB from the one
obtained with a Max-Log-Map algorithm.
iii
iv
Contents
1 Introduction
2 Turbo Codes
2.1
. . . . . . . . .
2.2
2.3
Convolutional Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Trellis Diagrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
2.6
Trellis Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
13
3.1
3.2
3.2.2
3.2.3
25
4.1
4.2
4.3
4.4
4.5
4.6
4.6.2
4.6.3
4.6.4
Other Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6.5
. . . . . . . . . . . . . . . . . 38
4.7
4.8
Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.9
Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Methodology
55
59
6.1
Quantization Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2
Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3
6.4
Throughput Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5
Power Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
71
Bibliography
72
vi
List of Figures
2.1
2.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
1
2
2.3
2.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
2.6
2.7
2.8
2.9
3.1
3.2
3.3
3.4
Soft Output extension example for the Viterbi Algorithm. Code given by
Pf b = [111] , Pg = [101] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1
4.2
4.3
Data-in RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4
Data-out RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5
4.6
Interleaving/Deinterleaving Unit . . . . . . . . . . . . . . . . . . . . . . . . 29
4.7
4.8
4.9
4.10 Modular representation of the path metrics. Each path metric register has
a width of nb bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.11 Merging of paths in the traceback. . . . . . . . . . . . . . . . . . . . . . . . 33
vii
. . . . . . . . . . . . . . . . . . . 36
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2
Hardware-in-the-loop approach . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3
6.1
quantization effect on the system BER performacne. BR-SOVA approximation scheme. Simulation with quantification. MCF. Pf b = [111], Pg = [101] 60
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
ix
Chapter 1
Introduction
The goal of any communication system is to achieve highly reliable communications with
a reduced transmitted power and reach as high as possible data rates. All these parameters usually represent a trade-off that designers have to deal with. Bandwidth is
also a limited resource in communication systems. Error-detecting and error-correcting
techniques are used in digital communication systems in order to get higher spectral and
power efficiencies. This is based on the fact that with these techniques more channel errors
can be tolerated and so the communication system can operate with a lower transmitted
power, transmit over longer distances, tolerate more interference, use smaller antennas,
and transmit at a higher data rates.
One of the most widespread of these techniques is Forward Error Correction (FEC). On
the transmitter side, an FEC encoder adds redundancy to the data in the form of parity
information. Then at the receiver, an FEC decoder is able to exploit the redundancy in
such a way that a reasonable number of channel errors can be corrected. Claude Shannon
father of Information Theory showed that if long random codes are used, reliable
communications can take place at the minimum required Signal to Noise Ratio (SNR).
However, truly random codes are not practical to implement. Codes must possess some
structure in order to have computationally tractable encoding and decoding algorithms.
Turbo Codes were introduced by Berrou, Glavieux and Thitimajshima in 1993 [3].
These codes exhibit an astonishing performance close to the theoretical Shannon limit,
in addition to a good feasibility of VLSI (Very Large Scale Integration) implementation.
Turbo Codes are used in the two most widely adopted third-generation cellular standards
(UMTS and CDMA2000). They are also incorporated into standards used by NASA for
deep space communications (CCSDS) and digital video broadcasting (DVB-T).
Decoding in Turbo Codes is carried out by a soft-output decoding algorithm: an
algorithm that provides a measure of reliability for each bit that it decodes. Specifically
two of the component decoding algorithms that are used in Turbo Codes are known as
MAP (Maximum a Posteriori) and SOVA (Soft Output Viterbi Algorithm). The high
computational complexity of the MAP algorithm makes its implementation expensive
and power-hungry. This is why most implementations perform a simplified version of
the algorithm. The most common simplifications are: the Log-MAP and Max-Log-MAP
algorithms which work in the logarithmic domain. Regardless, these algorithms are still
more complex and power-hungry compared to the SOVA algorithm which presents the
Introduction
implementation to verification. Finally the sixth chapter presents the results and measures carried out on the real system while the seventh chapter gives the conclusions and
establishes the basis of the future work.
Introduction
Chapter 2
Turbo Codes
Turbo Codes were presented by Glavieux [3] in 1993. They had a tremendous impact in
the discipline of channel coding. They are, along with LDPC (Low Density Parity Check)
codes, the closest approximation ever to the code that Claude Shanon probed to exist in
the mid XX century and which is able to achieve error free communications. Since their
introduction, they have been intensively studied. The first commercial application was
presented in 1997 [1] and today they are already part of the UMTS (Universal Mobile
Telecommunication System) standards. They have become the first choice when working
with low SNRs (Signal to Noise Ratio) such as in wireless applications and deep space
communications.
In this chapter we first introduce the communication system model which has been
employed in this work as the scenario for channel coding tests. Next we introduce the
concept of soft information which is the key of Turbo Codes. We describe the Turbo
Codes encoders and finally we talk about the trellis termination. The decoding process is
left to the next chapter.
2.1
In order to explain the soft information concept and the log-likelihood ratio, we will
develop a simplified communication model that will be the base example for the proceeding
concepts. This communication model is shown in Figure 2.1. On the transmitter side there
is a source of information that we assume to provide equally likely symbols. There is a
block for channel coding which is the main subject of this work and it is carried out by
a Turbo Code. The modulation scheme is BPSK (Binary Phase-Shift Keying) and the
channel is assumed to be an AWGN (Additive White Gaussian Noise). On the receiver
side, all the complement blocks for those in the transmitter are found. Also, there is a
matched filter which maximizes the SNR before sampling the received data. Note that we
have omitted the synchronization recovery subsystem which will be assumed to be ideal.
As starting point the source provides message bits mi at a rate of T1 bits/sec, which
are fed into the channel coding block. In a Turbo Code context, these bits are grouped
to form a frame of size L bits. The channel coding block outputs a coded frame with size
2L. So, for each message bit mi there is a symbol made of two bits xi = {xsi , xpi }. Then
Turbo Codes
+1V
-1V
u (t )
mi
source
01011010...
Channel Coding
110010...
BPSK Modulation
AWGN
Channel
+1V
-1V
sink
Channel
Decoding
01011010...
r ( t)
Matched Filer
yi
i
m
+1V
Implies sampling
and quantification
-1V
xi
AW GN
C hannel
M a t c h e d F il e r
yi
D is c r e t e A W G N c h a n n e l
yi = a Es (2xi 1) + nG
(2.1)
N0
2Es .
(2.2)
2.2
Whenever a symbol yi is received at the decoder, the following test rule helps us to
determine what the transmitted symbol was, based only on the observation yi and without
the help of the code.
P (xi = 1 | yi ) > P (xi = 0 | yi ) xi = 1
P (xi = 1 | yi ) < P (xi = 0 | yi ) xi = 0
This rule is known as MAP (Maximum a posteriori) since P (xi = 1 | yi ) and P (xi = 0 | yi )
are the a posteriori probabilities. Using the Bayes theorem, the previous rule can be
rewritten as:
P (yi | xi = 1) P (xi = 1)
P (yi )
P (yi | xi = 1) P (xi = 1)
P (yi )
>
<
P (yi | xi = 0) P (xi = 0)
xi = 1
P (yi )
P (yi | xi = 0) P (xi = 0)
xi = 0
P (yi )
P
P
P
P
(yi
(yi
(yi
(yi
| xi
| xi
| xi
| xi
= 1) P
= 0) P
= 1) P
= 0) P
(xi
(xi
(xi
(xi
= 1)
= 0)
= 1)
= 0)
> 1 xi = 1
< 1 xi = 0
If we apply the natural logarithm on the previous equations, the testing result is not
altered, then we obtain:
P
P
P
ln
P
ln
(yi
(yi
(yi
(yi
| xi
| xi
| xi
| xi
= 1)
P
+ ln
= 0)
P
= 1)
P
+ ln
= 0)
P
(xi
(xi
(xi
(xi
= 1)
= 0)
= 1)
= 0)
> 0 xi = 1
< 0 xi = 0
The previous ratios in the log domain, are the LLR (Log Likelihood Ratio) metrics which
is a useful way to represent the soft decision of receivers or decoders. We can summarize
the previous steps with only one equation as follows:
L (xi | yi ) = L (yi | xi ) + L (xi )
P (yi |xi =1)
P (xi =1)
i =1|yi )
where L (xi | yi ) = ln PP (x
(xi =0|yi ) , L (yi | xi ) = ln P (yi |xi =0) and L (xi ) = ln P (xi =0) . The
notation of the previous equation is usually rewritten as:
0i = Lc (yi ) + Lai
where Lai is the LLR of the a priori information and Lc (yi ) is related to a measure of the
channel reliability. Note that the sign of the i indicates the hard decision.
Turbo Codes
x1p
Pg1 = [101]
1
mi
1
Pg 2 = [111]
x 2p
Figure 2.3: NSC encoder of rate
1
2
So far we have introduced the equations of soft information based on the received
symbol at the input of the decoder without the aid of the underlying code. The fact
of using channel coding in the communication system lets us improve the LLR of the a
posteriori probability. This is shown in [3]. The LLR of the a posteriori information at
the output of the decoder is:
i = 0i + Lei = Lc (yi ) + Lai + Lei
(2.3)
The term Lei is known as the extrinsic information which actually is the improvement
achieved by the decoder and the decoding process on the soft information. The extrinsic information will be the data fed as a priori information to the other decoder in a
concatenated decoding scheme. It is important to remark that all terms in equation 2.3
can be added because they are statistically independent [3]. Statistical independence of
terms is essential to allow iterative decoding and this is the reason of interleavers in the
concatenation schemes of Turbo Encoders and Turbo Decoders.
2.3
Convolutional Encoders.
Turbo Codes encoders are mainly based on convolutional encoders. In these encoders
the output signals are typically generated convolving the input signal with itself in several
different configurations, consequently adding redundancy to the code. Convolutional codes
can be either Non-systematic Convolutional codes (NSC) when the input word is not
among the outputs; or Recursive Systematic Convolutional codes (RSC) when the input
word is one of the outputs [8]. Figure 2.3 illustrates an example of a NSC encoder while
figure 2.4 shows an RSC encoder. A set of registers and modulo two adders can be
appreciated on the figures. The connections among those registers and the modulo two
adders determine the output sequence of the encoder. Dividing the number of inputs I
over the number of outputs O results in the code rate OI . The cited examples through all
this work will always use an RSC encoder with rate 21 .
To define a convolutional encoder we need a set of polynomials which represent the
connections among the registers and the modulo two adders. For an NSC two code gener-
P = [111]
fb
xs
mi
Pg1 = [101]
pi
1
2
xs
mi
xp
Figure 2.5: RSC encoder used in the UMTS standard. Pf b = [1011], Pg = [1101]
ator polynomials define the encoder of rate 21 see figure 2.3. On the other side, an RSC
encoder is defined by both feedback and generator polynomials see figure 2.4.
The status of the set of registers represents the state of the encoder. Input bits mi
make the encoder memory elements change and move into another state while producing
the output bits xsi , xpi for the case of the RSC encoder. Convolutional encoders are
characterized by the constraint length K. An encoder with constraint length K has K 1
memory elements which allows the encoder to jump through 2K1 states.
RSC encoders are mostly used in Turbo Codes schemes rather than NSC encoders,
since better BER performances have been achieved with them. For instance, the encoder
used in UMTS is the one depicted in Figure 2.5.
2.4
Trellis Diagrams.
A trellis diagram is a graphical representation of the states of the encoder. It is a powerful tool since not only allows us to see state transitions, but also their time evolution.
The MAP (Maximum a posteriori Probability) and the SOVA (Soft Input Soft Output)
algorithms are used to decode Turbo Codes. They base their calculations on the trellis branches in order to reduce computing and this is the reason why we explain trellis
diagram.
10
Turbo Codes
m = <110...>
s0
x=<11 10 00 ...>
{0,0}
{1,1}
s1
s2
{1,1}
{0,0}
{1,0}
{0,1}
{0,1}
s3
{1,0}
i=0
i=1
i=2
i=3
m i =0
m i =1
11
RSC
E ncoder
RSC
E ncoder
In te r le a v e r
xi
xs
mi
Interleaver
xp
x p1
x p2
puncturing
Figure 2.8: Parallel concatenated Turbo encoder. RSC encoder with Pf b = [111], Pg =
[101].
2.5
As we mentioned in 2.3. Turbo Codes encoders are mainly based on convolutional encoders. However Turbo encoders also include one or more interleavers for shuffling data.
Figure 2.7 shows a serial concatenated Turbo encoder, while figure 2.8 shows a parallel
concatenated Turbo encoder of rate 12 which is the one used in our communication system
model. A lot of combinations can be achieved by concatenating different convolutional
encoders with interleavers. The reason of the interleavers is to uncorrelate data streams,
so at the decoder, an iterative decoding can take place. In figure 2.8, there is a block
known as puncturer, which basically compounds the parity bit of the resulting encoder
by selecting one parity bit from each convolutional encoder at a time. If no puncturing
was done, then the rate of the entire Turbo encoder would be 13 the data rate of the
resulting Turbo encoder can be different from the rate of the convolutional encoder.
2.6
Trellis Termination
Before getting into the decoding process, it is important to mention the trellis termination
of the convolutional encoders since it affects the BER performance of the code. The trellis
12
Turbo Codes
s1
xs
mi
s2
1
1
Interleaver
xp
x p1
x p2
Figure 2.9: Turbo Encoder with trellis termination in one encoder. Pf b = [111], Pg = [101].
termination is basically the final state the memory elements of the convolutional encoders
adopt when the end of the frame, being encoded, is reached. Since there is an interleaver
between both convolution encoders, the trellis termination of them is not a trivial task [16].
We will choose, for the purpose of this work, to terminate the first encoder and left the
second encoder open. Figure 2.9 shows the resulting Turbo encoder. The system works
as follows: At the beginning, switch s1 is closed and switch s2 is opened. A data frame
of size L 2 is encoded and then switch s1 is opened and s2 is closed, the remaining two
bits are encoded , this leads the first convolutional encoder to the state 0. Note that the
data frame, for this case L 2 bits long, and the remaining two bits are used to terminate
the trellis.
Chapter 3
3.1
In the previous chapter we presented Turbo Codes and the encoding process. Now it is
time to talk about the decoding process. Turbo codes are asymmetrical codes. That is,
while the encoding process is relatively easy and straight forward, the decoding process is
complex and time consuming.
The power of Turbo Codes resides on the decoding process which unlike others techniques, is done iteratively. Figure 3.1 shows a general scheme of a turbo decoder. As
we can see, the decoding process is done by two SISO decoders. Signals arriving at the
receiver are sampled and processed with the aid of the channel reliability before becoming
the soft information parity info 1,2 and systematic info shown in figure 3.1. We can
DeInterleaver
-
a priori Info
La
SISO
parity Info 1
systematic Info
La
Interleaver
SISO
parity Info 2
p
DeInterleaver
s
Interleaver
Decision
decoded bits
14
see the output of one SISO decoder becoming the input of the other decoder and vice
versa, forming a feedback loop. The name of turbo code is due to this feedback loop and
its comparable appearance to a turbine engine.
Final decoding is achieved by an iterative process. Soft input information is processed
and as a result soft output information is obtained. The second decoder takes this soft
information as input and produces new soft output information that the first decoder
will use as input. This process continues until the system makes a hard decision. The
BER obtained improves drastically with the first iterations until it begins to converge
asymptotically [3]. A trade-off exists between the decoding delay and the bit error rate
achieved. Even though eight iterations are enough to obtain a reasonable BER, decoders
not always do them all; instead they check the parity of the message header and then they
decide whether to keep iterating or not.
Note that between each decoder there is an interleaver or deinterleaver depending on
the data flow. As we mentioned in chapter 2, the interleaver/deinterleaver unit is a big
issue in turbo coding. This unit reorders soft information so a priori data, parity data and
systematic data are all time coherent at the moment of processing.
Figure 3.1 also shows how soft input information is extracted from output, in order to
avoid the positive feedback which degrades the BER performance of the system.
3.2
Even though the SOVA algorithm and the MAP algorithm are both trellis based they
take advantage of trellis diagram to reduce computations they differ in the final estimation they obtain. MAP performs better when working with low SNR and both of them
are about the same when working with high SNR. MAP finds the most probable set of
symbols in a message sequence while SOVA finds the most probable message sequence of
states associated to a path within a trellis. Nevertheless, MAP is computationally much
heavier than SOVA.
SOVA stands for Soft Output Viterbi Algorithm. Actually, it is a modification of the
Viterbi Algorithm [7].We will introduce the Viterbi Algorithm based on the explanation
given in [16] and then we will add the soft output extension. VA is widely used because
it is useful to find the most probable sequence within a trellis and we can use a trellis
diagram to represent any finite state Markov process.
Recalling our communication model, let s = (s0 , s1 , . . . sL ) be the sequence we want to
estimate and let y be the received sequence of symbols. VA finds:
n
o
s = arg maxP [s | y]
(3.1)
where y is the noisy set of symbols we have at the decoder after sampling. To be more
precise y is the observation. From Bayes theorem we have:
P [y | s] P [s]
s = arg max
s
P [y]
since P [y] does not change with s, we can rewrite equation 3.2 as:
(3.2)
15
n
o
s = arg maxP [y | s] P [s]
(3.3)
In order to compute equation 3.3, we could try all sequences s and find the one that
maximizes the expression. However, this idea it is not scalable when the frame size is too
large.
Since there is a first order Markov process involved, we can take advantage of two of
its properties to simplify the search for s. These properties are:
P [si+1 | s0 . . . si ] = P [si+1 | si ]
(3.4)
(3.5)
Equation 3.4 establishes that the probability of next state does not depend on the entire
past sequence. It only depends on the last state. Equation 3.5 states that the conditional
probability of the observation symbol yi through white noise is only relevant during the
state transition.
Using these properties we can work on 3.3:
P [y | s] =
P [s] =
L1
Y
i=0
L1
Y
P [yi | si si+1 ] ,
P [si+1 | si ] ,
i=0
(
s = arg max
s
L1
Y
P [yi | si si+1 ]
i=0
L1
Y
)
P [si+1 | si ]
(3.6)
i=0
A hardware implementation of an adder requires less resources than a hardware implementation of a multiplier. So if we apply natural logarithm on 3.6 we can replace multiplications with additions without altering the final result. Thus it yields:
(
s = arg max
s
L1
X
)
ln P [yi | si si+1 ] + ln P [si+1 | si ]
(3.7)
i=0
L1
X
)
(si si+1 )
(3.8)
i=0
(si si+1 ) is known as the branch metric associated with transition si si+1 .
The observation yi during state transition si si+1 is actually the output of the
encoder observed through white noise during the state transition. For our BPSK model this
16
(usi,upi)= (-1,-1)
xsi
mi
s0
BPSK Modulation
i
u p = 2x p 1
i
s1
s2
s2
s3
s3
xpi
u s = 2 xs 1
i
)
,1
(1
s1
s0
observation is related to the systematic and parity bit pair (Figure 3.2). Thus, assuming
noise independence, we can express the conditional probability of yi during state transition
as follows:
(3.9)
where usi and upi are the systematic and parity bits respectively after BPSK modulation
and
ysi usi 2
ypi upi 2
1
1
1
1
P [ysi | usi ] = 2 exp 2
dysi ; P [ypi | upi ] = 2 exp 2
dypi ,
since we are dealing with white Gaussian noise with 2 variance. In addition, it is more
convenient to express P [si+1 | si ] in terms of the message bit mi since state transitions
are due to this bit. Then,
P [si+1 | si ] = P [mi ]
(3.10)
This is our a priori probability. For turbo decoding it is easier to work with log-likelihood
ratios, then:
P [mi = 1]
Lai = ln
P [mi = 0]
( La
e i
mi = 1
1+eLai
P [mi ] =
ln P [mi] = Lai mi ln 1 + eLai
1
mi = 0
1+eLai
It is important to remark that for the first iteration, all message bits are assumed to be
equally likely, then P [mi = 1] = P [mi = 0] = 0.5 Lai = 0. For successive iterations
Lai is the extrinsic information provided by the other decoder through the interleaver.
Replacing equation 3.9 and the above expression in the branch metric equation, we have:
17
dysi dypi
1 (ysi usi )2 1 (ypi upi )2
ln
+ Lai mi ln 1 + eLai
2
2
2
2
2
i
1 h
2 (ysi usi )2 + (ypi upi )2 + Lai mi
2
1
2 ys2i 2ysi usi + u2si + yp2i 2ypi upi + u2pi + Lai mi
2
1
[ys us + ypi upi ] + Lai mi
2 i i
(si si+1 ) =
=
=
=
Note that in order to simplify equations we have neglected terms that do not change when
N0
varying sequence s. From chapter 2 we know that 2 = 2Es
and Es = rEb where r = 12
is the code rate. So finally we obtain:
(si si+1 ) =
Eb
[ys us + ypi upi ] + Lai mi
N0 i i
(3.11)
It is more common to express equation 3.11 as shown below, since channel reliability
Es
Lc = 4a N
( a = 1 for our model ),
0
(3.12)
L1
X
)
Lc ysi xsi + Lc ypi xpi + Lai mi
(3.13)
i=0
where xsi , xpi are the raw bits at the output of the channel encoder before the BPSK
modulation. Also mi = xsi for our RSC encoder.
It is important to remark that according to [11], for the SOVA algorithm, Lc can be
assumed to be equal to 1. This means that there is no need to estimate the SNR of
the channel. This is possible because at the beginning of the decoding process, at the
first iteration Lai = 0 which leads the resulting extrinsic information to be weighted by
Lc . This extrinsic information becomes Lai for the next SISO decoder making all terms in
equation 3.13 to be weighted by Lc . Therefore Lc has no influence in the decoding process.
The fact that the SOVA does not need the channel estimation saves a lot of difficulties
and represents a big advantage over the MAP algorithm.
Summarizing, table 3.1 shows the relevant equations for applying the SOVA algorithm.
18
Element
Branch Metric
Equation
(si si+1 ) = ysi xsi + ypi xpi + Lai mi
(
Sequence Estimator
s = arg max
s
L1
X
(3.14)
)
ysi xsi + ypi xpi + Lai mi
(3.15)
i=0
Where {xsi , xpi } is the encoder output symbol when the input message bit is mi ; {ysi , ypi }
is the received symbol, when the encoder output symbol is BPSK-modulated and transmitted through an AWGN channel. Finally Lai represents the LLR of the message bit
mi .
In the next subsection we will develop an example in order to show how expression
3.15 and the trellis diagram are applied in the decoding process.
3.2.1
Figure 3.3 shows a trellis diagram example for a code with Pf b = [111], Pg = [101], and
tries to clarify the decoding process.
As shown on figure 3.3.a, The process begins at time i = 0 from state 0 because
that is the state the encoder takes when initialized. Thus, the probability of being
at state 0 is one, and probability of being at any other state is zero. We assign these
probabilities, as path metrics in log domain, to each state:
pmi,k pm00 = 0
pm0k = k 6= 0
Then, the branch metrics are computed at each state for message bit 0,1 and corresponding parity bit.
s0
s1
s2
s3
( s00 s10 )
( s01 s10 )
19
( s02 s11 )
( s03 s11 )
i=0
i=1
i=2
i=3
i = L-2
i = L-1
i=L
i = L-2
i = L-1
i=L
i = L-2
i = L-1
i=L
i = L-2
i = L-1
i=L
s0
s1
s2
s3
i=0
i=1
i=2
i=3
s0
s1
s2
s3
( s10 s20 )
( s11 s20 )
i=0
i=1
i=2
i=3
s0
s1
s2
s3
i=0
i=1
= 101 10
m
i=2
i=3
survival path
Figure 3.3: Trellis diagram for VA, Code given by Pf b = [111] , Pg = [101] .
20
si,k si+1,k0 (s0,0 s1,0 ) = (ysi + Lai ) 0 + ypi 0
(s0,0 s1,2 ) = (ysi + Lai ) 1 + ypi 1
(s0,1 s1,2 ) = (ysi + Lai ) 0 + ypi 0
(s0,1 s1,0 ) = (ysi + Lai ) 1 + ypi 1
(s0,2 s1,3 ) = (ysi + Lai ) 0 + ypi 1
(s0,2 s1,1 ) = (ysi + Lai ) 1 + ypi 0
(s0,3 s1,1 ) = (ysi + Lai ) 0 + ypi 1
(s0,3 s1,3 ) = (ysi + Lai ) 1 + ypi 0
The incoming path metrics for each state at time i = 1 are calculated by adding the
incoming branch metrics to the corresponding path metrics of states at time i = 0.
Figure 3.3.b.
For each state at time i = 1, the incoming branch with the greater incoming path
metric is kept. The new path metrics of these states are the survival incoming path
metrics.
pm1,0 = max (pm0,0 + (s0,0 s1,0 ) , pm0,1 + (s0,1 s1,0 ))
pm1,1 = max (pm0,3 + (s0,3 s1,1 ) , pm0,2 + (s0,2 s1,1 ))
pm1,2 = max (pm0,1 + (s0,1 s1,2 ) , pm0,0 + (s0,0 s1,2 ))
pm1,3 = max (pm0,2 + (s0,2 s1,3 ) , pm0,3 + (s0,3 s1,3 ))
In figure 3.3.b, 3.3.c the survival branches are drawn thicker.
This algorithm is repeated from item 2 until time i = L 1. Note that the final
states will be at i = L.
In order to find s at this point, there are two possibilities: if the encoder was
terminated, the system should trace back from the state at which the encoder was
terminated usually state 0 through all survival linked branches. If the encoder
was not terminated, the system should choose the state with the greater path metric
and trace back from there. Each branch within the trellis has a message bit m
i
associated. The set of those bits is the most probable message. This step is shown
in figure 3.3.d while the survival path is colored in green.
3.2.2
The Viterbi Algorithm is able to find the most probable sequence within the trellis and
hence its associated bits. Turbo coding techniques also demand the SISO unit to supply
soft output information. There are two well-known extensions for the Viterbi Algorithm
that produce soft output [11]. One was proposed by Battail [2] and it is known as BRSOVA. The other one was proposed by Hagenauer [7] and it is known as HR-SOVA. The
latter is mostly used rather than the former, even though the BR-SOVA performs better
in terms of BER. However, HR-SOVA allows an easier hardware implementation. We will
explain the HR-SOVA extension and remark the main idea.
21
(3.16)
where k is the next state of k 0 and k 00 , for a message bit mi {0, 1} respectively.
See Figure 3.4.a for references.
Let j be a new time index in the range im < j i. At every time instant j, the
system compares the message bit of the survival path with the message bit of the
competing path. If they differ then the reliability j has to be updated according to
j min (j , i,k )
(3.17)
In figure 3.4 a red square is placed on the branches that differ in the message bit.
The BR-SOVA has also an updating rule for the case where the message bit of the
survival path do not differ to the message bit of the competing path:
j min j , i,k + cj
(3.18)
This is the main difference between HR-SOVA and BR-SOVA. Nevertheless this
updating rule implies the knowledge of the bit reliabilities of the competing paths
cj [11].
Once the system reaches the state where the survival path and the competing path
merge, it moves one time instant back from i to i 1 through the survival path
and traces back once again the competing path at that state. This process is shown
in figure 3.4.b. For the example, the system now starts at time i = L 1 and the
corresponding state k = 0. For this case, the competing path and the survival path
now merge at time im = L 5.
This algorithm continues from step 2 until time i = 1, thus allowing all the bit
reliabilities to be updated. Figure 3.4.c shows one more iteration with the aim of
clarifying this process.
Finally soft output information is obtained in terms of LLR(Log-Likelihood Ratio)
as follows:
i = (2m
i 1) i
0iL1
(3.19)
22
im
s0
s1
'
L ,0
''
s2
s3
survival path
i = L-5
i = L-4
i = L-3
i = L-2
i = L-1
i=L
competing path
(a) Survival path and competing path at time i=L, state k=0
L 1,0
s0
s1
s2
s3
survival path
i = L-5
i = L-4
i = L-3
i = L-2
i = L-1
i=L
i = L-1
i=L
competing path
(b) Survival path and competing path at time i=L-1, state k=0
L 2,1
s0
s1
s2
s3
survival path
i = L-5
i = L-4
i = L-3
i = L-2
competing path
(c) Survival path and competing path at time i=L-2, state k=1
Figure 3.4: Soft Output extension example for the Viterbi Algorithm. Code given by
Pf b = [111] , Pg = [101] .
where m
i is the estimated message bit m
i {0, 1}. Note that (2m
i 1) only gives
the sign to i ; its magnitude is provided by i .
After explaining the previous algorithm, it is important to remark the main idea of the
process. At a given time 0 i L 1 the question to ask is: How reliable is the message
bit m
i ? The extension for soft output indicates that, the correctness of bit m
i can only be
23
as good as the decision to choose the closest competing path over the most likely path.
3.2.3
The soft output generated by the HR-SOVA turned out to be overoptimistic [12]. It means
that the HR-SOVA algorithm produces a LLR that is greater in magnitude than the LLR
produced by the BR-SOVA or by the MAP algorithm. These overoptimistic values for the
LLR lead HR-SOVA to a worse performance in terms of BER.
In [12] two problems associated with the output of the HR-SOVA are described. One
is due to the correlation between extrinsic and intrinsic information when the HR-SOVA
is used in a turbo code scheme. The other problem is due to the fact that the output of
the HR-SOVA is biased. The first problem is not easy to solve, and most of the hardware
implementations do not deal with it. In contrast, for the second problem there have
been several proposals that are based on a normalization method. The idea behind a
normalization method can be shown by assuming that the output of the HR-SOVA, given
a message bit mi , is a random variable with a Gaussian distribution, then:
!
1
(i )2
P [i | mi = 1] =
exp
di ,
(3.20)
2
2
2
!
1
(i + )2
P [i | mi = 0] =
(3.21)
exp
di ,
2
2
2
q
where is the expectation of i and = E 2i 2 is the standard deviation. In
order to find the LLR of the message bit mi , given the output of the HR-SOVA, we can
define:
P [mi = 1 | i ]
0
i = ln
,
(3.22)
P [mi = 0 | i ]
using Bayes theorem, assuming P [mi = 1] = P [mi = 0], and working on the previous
expression with 3.20 and 3.21, yields:
0 i =
2 i
2 ,
which indicates that the HR-SOVA output should be multiplied by the factor c =
obtain the LLR.
(3.23)
2
2
to
The factor c, according to [12], depends on the BER of the decoder output. Some
schemes try to estimate factor c while others set up a fixed value for it. In our hardware
implementation we will use a fixed scaling factor since in [10], it has been reported that
the BER performance by a fixed scaling factor is better than by a variable scaling factor.
24
Chapter 4
Hardware Implementation of a
Turbo Decoder based on SOVA
In the previous chapter we introduced the general ideas of a turbo decoder and presented
the HR-SOVA algorithm from now on we will refer to it just as SOVA as the active
part of the SISO unit. In this chapter we will deal with the implementation issues and
analyze today most commonly used hardware architectures. Next we will introduce a new
algorithm for finding points of the survival path and consequently we will present the
architecture for implementing it. We will describe the unit that updates bit reliabilities
and finally we will present the improvements which allow the decoder to boost the BER
performance.
As a general scheme we present the figure 4.1. There are two blocks of RAM used
as input and output buffers. There are also two more blocks of RAM used to store
temporary data as a priori and extrinsic information. Then there is a unit that deals with
the interleaving process, a unit to control the system and to interact with the user and
finally the SISO unit that implements the SOVA algorithm. Note that we only use one
SISO unit. This is possible because of the fact that the interleaver/deinterleaver does not
allow concurrent processing, so a frame has to be completed by one decoder before it can
be processed by the other. For the proposed architecture, this processing is always done
by the same decoder.
Data arriving at the receiver is processed and fed into the data-in RAM buffer, then
a starting command is delivered to the control unit. The states the system goes through
are shown in figure 4.2. The system starts to process the interleaved data first, and at the
last iteration, it ends up with the deinterleaved data . This is done this way, in order to
save an access through the interleaver at the end of the decoding process which also saves
power and allows a simpler control unit. However, the system has to wait until the entire
frame is received, before decoding can take place.
Even though the same unit is used as decoder 1 and decoder 0, its behavior changes
slightly, depending on the role the unit is playing. We can summarize the following tasks
for each role:
SOVA unit is acting as decoder 1:
26
When SOVA addresses data-in RAM buffer, it addresses belong to the interleaved domain.
Since it addresses belong to the interleaved domain, in order to get systematic
data, it has to go through the deinterleaver.
It can address parity data 2 directly.
If the first iteration is running, then a priori information is assumed to be 0.
Otherwise, it fetches a priori information through the deinterleaver from RAM
La/Le.
It writes extrinsic information directly to the RAM Le/La. This entails that,
when acting as decoder 0, it has to access a priori information through the
interleaver.
SOVA unit is acting as decoder 0:
It addresses belong to the deinterleaved domain, or the domain where information bits are in order.
It can access systematic data and parity data 1 directly form the data-in
RAM buffer.
The a priori information is accessed through the interleaver, since each word was
written to an address in RAM Le/La, that belongs to the interleaved domain.
It writes extrinsic information directly to the RAM La/Le.
It writes hard output directly to the data-out RAM buffer. This can be done at
each iteration, allowing the user to check for a frame header, or when running
the last iteration with the aim of saving power.
cmd
data in
RAM
Data In
RAM
La/Le
Control
Unit
Interleaving/Deinterleaving
Unit
RAM
Le/La
SOVA
RAM
Data Out
status
data out
Idle
Begin?
Y
27
Deco 1
Deco 0
Last
Iteration?
Y
Done
4.1
All the RAM buffers are based on double port RAMs. The figure 4.3 shows the scheme
of data-in RAM. Since the systematic data and parity data 2 belong to different time
domains, two double port RAMs are used to store either information data. In figure 4.4
the scheme of the data-out RAM is shown. Finally figure 4.5 presents the RAM La/Le
and the RAM Le/La, which are equivalent.
28
syst p1
sys p1
systematic
parity 1
RAM
addr in
wr
addr out p2
rd p2
x
data-out
RAM
addr in
addr out
wr
rd
data Le
RAM
La/Le
addr la/le
addr in
rd la/le
we la/le
data La
data Le
addr le
RAM
Le/La
addr le/la
addr in
rd la/le
wr le/la
4.2
There have been several proposals to design an area efficient interleaver. In [14] contention
free interleavers, that allow concurrent processing, are studied. In our case for the sake of
simplicity and versatility a ROM is used to carry out the interleaving/deinterleaving functions as look up tables. Figure 4.6 shows the interleaving/deinterleaving unit. The figure
also shows some control signals. The signal named deco indicates the role the SOVA unit
is playing. Note that when working with deco=1, the address of the parity data 2 is
RAM p2
interface
rd p2
29
rd sp1
RAM le/la
interface
RAM la/le
interface
RAM sp1
interface
rd lale
rd lela
wr le/la wr la/le
1
1
addr_la_1
mux
0
addr_la_0
deco
0
dmux
addr_le
deco
1
wr le
addr out p2
Delay 1
Deinterleaver
ROM
Interleaver
ROM
rd dint
rd int
addr_spla
4.3
Before getting into our hardware implementation for the SOVA algorithm, it is important
to comment some of today most commonly used hardware architectures.
Since the SOVA algorithm is an extension of the Viterbi Algorithm, most of the main
units have been based on the implementations achieved for the Viterbi Algorithm. This
architectures are complemented with reliability updating units to produce the soft output.
Figure 4.7, shows a comparison between Viterbi decoders and SOVA decoders. Both
decoders have a BMU (Branch Metric Unit), an ACSU (Add Compare Select Unit), and
an SMU (Survival Memory Unit). However the SOVA ACSU has to provide with the
difference between path metrics, and the SMU includes an RUU (Reliability Updating
Unit) that provides the soft output information. In the next subsection we will discuss
the issues related to the SOVA components.
30
data in
VA
data in
BMU
BMU
ACSU
ACSU
SMU
SMU
RUU
data out
data out
SOVA
4.4
As it name suggests, this unit calculates the branch metrics. According to equation 3.14,
the possible branch metrics depend on xsi , xpi and mi bits. When working with an RSC
encoder of rate 12 , xsi = mi and there is only one parity bit xpi , which means that there
are four possible path metric at each time instant i:
(xsi , xpi ) = (0, 0) 0 = 0
(xsi , xpi ) = (0, 1) 1 = ypi
(xsi , xpi ) = (1, 0) 2 = ysi + Lai
(xsi , xpi ) = (1, 1) 3 = ysi + Lai + ypi
The BMU for an RSC encoder of rate 12 , is shown in figure 4.8.
31
(x
ys
, x p ) = (1, 0 )
(x
, x p ) = (1,1 )
La
(x
yp
, x p ) = (0 ,1 )
(x
, x p ) = (0 , 0 )
0
Figure 4.8: BMU for the RSC encoder.
4.5
Applying equation 3.15 in the trellis diagram, yields the following expression:
pmi,k = pmi1,k0 + si1,k0 si,k
where k is the next state of k 0 that produces the higher incoming path metric. The previous
expression suggests that the path metric pmi,k can be obtained by recursion. In figure 4.9
an ACSU for the SOVA unit is presented.
The set of registers holds the previous path metrics. The branch metrics are mapped to
the corresponding adders according to the outputs during state transitions to produce the
incoming path metrics. Then these incoming path metrics are connected to the selectors,
which choose the higher incoming path metric and produce the decision vector along with
the difference between incoming path metrics. The connections between adders and
selectors represent the trellis butterfly.
One problem that might arise is the overflow of the path metrics after a certain amount
of time. Since the relevant information is the difference between path metrics, a normalization method can be adopted. There have been proposed many normalization methods
since the introduction of Viterbi decoders. We find the modulo technique reported in [13]
to be a good solution, since it actually allows the overflow.
The idea behind the modulo technique is that the maximum difference B between
path metrics at all states is bounded. The figure 4.10 shows the mapping between all
representable numbers, by the path metric register of nb bits, on a circumference.
Let ipm0i,k and ipmi,k be two incoming path metrics at a given time i state k, then
it is shown in [13] that ipm0i,k > ipmi,k , if ipm0i,k ipmi,k > 0 in a two-complement
representation context. The number of bits nb relates to the bound as follows:
C = 2nb = 2B
32
1,0
,0
i
Sel
pm
+ +
v
i
,0
,0
Sel
+ +
pm
,1
,1
,2
Sel
+ +
,2
,3
Sel
+ +
,3
(0 , 0 ) = 0
pm
i ,k
(1 ,1 )
(1 ,1 )
(0 , 0 )
i k
(1 , 0 )
(0 ,1 )
>0?
(0 ,1 )
i k
(1 , 0 )
selector
Figure 4.9: Add Compare Select Unit for the SOVA. Pf b = [111], Pg = [101]
This means that, even though the path metrics may grow in different ways, they all remain
in the half of the representation space provided by C. An appropriate bound is B = 2nB,
being n the minimum number of stages to ensure a complete trellis connectivity among
all trellis states, and B is the upper bound on the branch metrics [13].
33
ipm
i,k
increasing
2nb1 1
2 nb1
1
ipm
'
ipm i , k
'
i ,k
ipm i k
,
mod 2
nb
>
0?
'
ipm i , k
>
ipm i , k
Figure 4.10: Modular representation of the path metrics. Each path metric register has a
width of nb bits.
s0
s1
s2
s3
iFP
4.6
The remaining SOVA units should obtain the soft output information for every bit in the
frame along with the most likelihood path. One way to do so, is to store all the data the
ACSU provides. Then when the last time instant is reached, the data is traced back and
the bit reliabilities are updated according to the SOVA algorithm. However most of the
hardware architectures do not do it that way because the latency is high and the amount
of memory grows considerably with the frame size, the number of states of the encoder
and the width of the quantization of i,k .
Most of the SMUs take advantage of a trellis property to solve this problem. This
property is illustrated in Figure
4.11, where a trellis diagram from a decoding process is shown. If all the paths are
traced back from all the states at a given time i, it is found that they merge at time instant
iF P . Therefore, from time instant iF P down to i = 1 the only path remaining in the trace
34
MAX
1 MAX
i-U
i-D
,0
,1
,2
,3
PE
v ,0
iD
MAX
MAX
PEU
PEU
PEU
PEU
PE
PE
PE
v ,0
iD
MAX
MAX
PEU
PEU
PEU
PEU
PE
PE
PE
v ,0
iD
MAX
MAX
PEU
PEU
PEU
PEU
PE
PE
PE
v ,0
iD
i D
i D
i D
1
i
PE
,2
0
PE
1
i
PEU
,1
0
PEU
1
i
PEU
,0
0
PEU
i D
,3
i
Figure 4.12: Register Exchange SMU for the SOVA. Pf b = [111], Pg = [101]
back started at time i, is the survival path. We define the time instant along with the state
where the paths merge as a FP (Fusion Point). Then, looking at the example of figure
4.11, for time instant i there is a FP at (iF P, s3 ). Simulations have shown that the distance
between the time instant i and the FP iF P is a random variable. It is also observed that
the probability of the paths merging increases with the depth of the trace back and it is
proportional to the constraint length of the code. Then a trace back depth of 10 times
the constraint length of the code, might allow the paths to merge with high probability.
Below we will describe the mostly used architectures based on the previous property.
4.6.1
The RE (Register Exchange) SMU for an RSC encoder of rate 21 is shown in figure 4.12.
This scheme is reported in [9]. It is an array of PE (Processing Elements) of n rows and
D columns n is the number of states of the encoder and D is the trace back depth. The
connection topology between PEs is given by the trellis of the encoder. In figure 4.12, two
types of PEs can be distinguished. The first U PEs red outline , besides tracing back
the paths, update the bit reliabilities. In figure 4.13.a a PE with updating capability is
shown. In figure 4.13.b a normal PE is shown. The system allows the trace back of all the
paths from the states at time instant i. The ACSU provides the data that enters the RE
from the left. The first U units update the bit reliabilities of each path according to the
SOVA algorithm. Each row of the array holds the information of one path. For example,
the first row holds the path information corresponding to the path traced back from the
state 0 at time i. The second row holds the information corresponding to the path traced
back from state 1 at time i, and so on. After D clock cycles, if D is large enough to allow
paths merging, the message bit and its reliability are obtained. Note, that if paths merge
before D then the data coming out from rows, at all states, is the same, since the tails of
all the paths belong to the survival path. Therefore only data from one row is selected.
Parameters U and D, represent a trade-off. Some architectures, set U in a range from
two to five times the constraint length of the code, while D is set between five to ten
times the constraint length of the code. If U and D are too large, the BER performance
increases, so power consumption and area do. The area increasing, is also due to the
resources spent in the connections, which becomes a serious problem with the number of
states of the encoder. If U is large, and D is not, then resources are spent worthlessly
since BER performance is not increased. The same if D is large while U is short or when
'
''
v
v
'
' '
35
'
''
'
' '
a
a>b?
b
vi k
,
i k
,
v
v
'
''
v
v
'
' '
'
''
'
' '
vi k
,
4.6.2
The RE scheme, presents one major problem that leads to a high power consumption. The
problem is that all the paths are traced back D steps. The idea behind the SA (Systolic
Array) is to trace back only one path, however, this path, after D steps will merge with
the survival path and will become the path we are looking for. SA is presented in [15].
The figure 4.14.a introduces the scheme of the SA for an RSC encoder of rate 21 and
four states in the encoder. The figure only shows the SMU for the VA. It is composed of
an array of elements arranged in n rows and 2D columns n is the number of states of the
36
i
vi
Selection
Unit
i-2D
,0
vi
,1
vi
,2
vi
,3
TB
TB
TB
TB
TB
TB
SU
viDk
,
si
vi
,0
vi
,1
vi
,2
vi
,3
Last
State
si k
1, '
37
viD
,0
viD
,1
viD
,2
viD
,3
si
Trace back
s0
s1
s2
s3
i-D-U
survival path
i-D
competing path
Figure 4.16: Two Step idea. First tracing back, and then reliability updating.
each state must have access to all the information about the path metric differences
and decision vectors for that particular time.
These issues make the SA not a good choice for a complete SOVA based decoder. However
SA has been used in [17] as a reliability updating unit in a Two Step configuration.
4.6.3
This scheme was proposed in [9] with the intention to discard all the operations that do
not affect the output. The idea is to postpone the updating process until the survival path
is found. Figure 4.16 shows this concept. The first D steps intend to find the survival
path, while the remaining U steps updates the bit reliabilities. A FIFO(First In, First
Out) memory is usually employed to delay the path metric differences along with decision
vectors until the updating process begins. The SMU we propose in this document is
actually a Two Step configuration. However, we introduce a new scheme for finding the
survival path.
38
,1
i
,2
i
,1
i
,3
i
RAM
v ,0
i
,1
v ,2
i
v ,3
i
FPU
RUU
4.6.4
Other Architectures.
A lot of architectures and schemes have been proposed in the last years. In [4] different
SMUs for the VA are studied and compared. In [6] a trace back architecture based on an
orthogonal memory is presented. However, all these schemes deal with a finite trace back
depth D and with a finite updating length U , which leads to a non optimum algorithm
execution. In the next subsection we will introduce a new architecture for the SOVA
algorithm that does not depend on the D-U trade-off.
4.6.5
So far, two of the most common schemes have been studied. They are the RE and the
SA. Both of them carry out a trace back with the aid of a pipeline architecture. The
size of this pipeline architecture has an impact in the area, power consumption and BER
performance. One of the contributions of this work is a new type of architecture based on
a new algorithm and the development of the architecture that implements it. The major
advantage of this new scheme is that it is independent of the D-U trade-offs and it allows
recursive processing which lessens the register activity.
The new architecture to implement the SOVA algorithm that we propose, as it name
suggests, deals with the Fusion Points. Figure 4.17 shows the general scheme. It consists
of a FPU (Fusion Point Unit) which finds the time instant and the state where the survival
paths merge. It is inside this unit where the new algorithm is implemented. There is a
dual port RAM to store the data the ACSU provides, and finally there is a RUU that
updates the bit reliabilities based on the information provided by the FPU.
The unit works as follows: the data the ACSU provides is stored in the dual port
RAM, the decision bits vi,k are also used by FPU to implement the FP search algorithm.
39
s0
s1
s2
s3
Merging paths
Possible Fusion Points
This unit finds the Fusion Points along the trellis for a code with rate 12 by means of a new
algorithm1 . The algorithm is based on the idea that a fusion point for a code rate 12 will
always reside in the merging point of two paths. Figure 4.18 shows these possibles fusion
points. The following thought explain the previous idea: whenever a trace back operation
takes place, the system traces back from a given time instant i; while tracing back, paths,
at different time instants, merge in groups of two. The last of these two-paths merging
point is a Fusion Point. Therefore a FP will always reside in the merging point of two
paths.
The following steps along with the example of figure 4.19 introduce part of the algorithm:
Decision vectors coming from the ACSU, are used to identify the merging paths or
possible fusion points Figure 4.19.a.
Each possible fusion point is marked. Whenever a mark is set, the mark time and
state are held in registers Figure 4.19.a.
This mark is propagated along the branches to the next states Figure 4.19.b.
The mark is propagated at every clock cycle.
If a mark propagates to all the sates at a given time, then the origin of that mark
is a fusion point. The fusion point coordinate is held by the register and can be
recalled immediately Figure 4.19.c.
After introducing the mark movements, figure 4.20 shows a sequence example where more
1
1
,
2
40
s0
s1
s2
s3
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=5
i=6
i=7
i=8
i=5
i=6
i=7
i=8
s0
s1
s2
s3
i=0
i=1
i=2
i=3
i=4
s0
s1
s2
s3
i=0
i=1
i=2
i=3
i=4
than one mark is handled at the same time. In the figure two columns can be appreciated.
The left column indicates the time instant the system is processing, and also the status.
The status is composed by three pointers that are able to hold the time and the states
of FPs. The first two pointers hold the possible FPs detected while the third pointer
Time Instant
i=0
41
s0
s1
System Status
Pointer 0
(0,s0)
Pointer 1
(-,-)
FP
(-,-)
s2
s3
i=0
Time Instant
i=1
s1
System Status
Pointer 0
(1,s0)
Pointer 1
(1,s2)
FP
s0
s2
(0,s0)i=2
s3
i=1
Time Instant
i=2
s1
System Status
Pointer 0
(1,s0)
Pointer 1
(2,s2)
FP
s0
s2
(-,-)
s3
i=2
Time Instant
i=3
s1
System Status
Pointer 0
(1,s0)
Pointer 1
(3,s0)
FP
s0
s2
(-,-)
s3
i=3
Time Instant
i=4
s1
System Status
Pointer 0
(4,s0)
Pointer 1
(4,s2)
FP
s0
s2
(3,s0)i=5
s3
i=4
Time Instant
i=5
s1
System Status
Pointer 0
(4,s0)
Pointer 1
(4,s2)
FP
s0
s2
(-,-)
s3
i=5
42
indicates a FP. The right column shows the sequence from time time i = 0 to time i = 5.
The algorithm proceeds as follows:
i = 0, a possible fusion point is detected at (0, s0 ). A green mark is set and is
propagated to the states (1, s0 ), (1, s3 ). Its coordinate (0, s0 ) is held in the pointer
0 register.
i = 1, a possible fusion point is detected at (1, s0 ). A blue mark is set and is
propagated to the states (2, s0 ), (2, s3 ). Another possible fusion point is detected at
(1, s2 ). A fuchsia mark is set and is propagated to the states (2, s1 ), (2, s3 ). Since the
green mark propagates to all the state at i = 2 its origin becomes a fusion point. The
fusion point register is set with the data of the pointer 0, which holds the coordinate
of the green mark, and pointer 0 is free. Also a green straight line across all the
states at time i = 2 indicates the time the FP is detected. Note that even though
the actual time instant is i = 1, the detection line of the FP is at i = 2. Before
moving into the next time instant, the coordinates of the blue and fuchsia marks are
stored in pointer 0 and pointer 1 registers respectively. Note that pointer 0 register
was free when the fusion point was detected.
i = 2, a possible fusion point is detected at (2, s0 ). A red mark is set and is
propagated to the states (3, s1 ), (3, s3 ). The fuchsia mark is propagated to the state
(3, s2 ) and its pointer is free. The reason why the fuchsia mark pointer is free will
be explained later. The blue mark propagates to the states (3, s0 ), (3, s1 ), (3, s3 ).
Before moving into the next time instant, the coordinate of the red mark is stored
in pointer 1 since it is the only free pointer available.
i = 3, a possible fusion point is detected at (3, s0 ). A yellow mark is set and is
propagated to the states (4, s0 ), (4, s2 ). The red mark is propagated to the state
(4, s3 ) and its pointer is free. The reason why the red mark pointer is free is the same
as that for the fuchsia mark pointer in the previous instant and will be explained
later. The blue mark is propagated to (4, s0 ), (4, s2 ), (4, s3 ). Before moving to the
next time instant, the coordinate of the yellow mark is stored in pointer 1 since it is
the only free pointer available.
i = 4, a possible fusion point is detected at (4, s0 ). A turquoise mark is set and is
propagated to the states (5, s0 ), (5, s2 ). Another possible fusion point is detected
at (4, s2 ). A brown mark is set and is propagated to states (5, s1 ), (5, s3 ). Both,
the blue mark and the yellow mark propagates to all the sates at i = 5. This means
that the origin of the blue mark and the origin of the yellow mark are both fusion
points. However, the point we are looking for, is the closest FP to the time being
processed, which in this case that FP corresponds to the origin of the yellow mark
at (3, s0 ). The reason is the definition of a FP. If the system traces back from time
i = 5, it finds that all paths merge at (3, s0 ). So (3, s0 ) represents the point where all
paths merge in a trace back operation from i = 5. The point (1, s0 ) corresponding to
the origin of the blue mark, belongs to the survival path, but it does not represent
a merging point for a trace back operation that starts at time i = 5. Now we can
extend the previous thought in the following way: suppose that two marks propagate
to the same states. This means, that in the future, their propagations will always be
43
Address registers
FP addr
current addr
FP sel
Mark
Processing
v ,0
i
,1
v ,2
Mark
Detection
FP state
Mark
Propagation
FP detected
v ,3
i
Mark registers
Combinational Logic
Memory
44
to detect possible FPs according to the trellis butterfly. There is a Mark Propagation
block, which propagates, along the trellis, the new marks and the stored marks. There is
a processing unit, which compares all the marks at the input, and proceeds as follows:
if there are two equal marks, then the one with the latest address is kept.
if there is a mark with only one bit set, then its corresponding register is freed, since
it has no chance to become a FP in the future.
if there is a mark with all bits set, then a FP is indicated with its address and state.
Finally there is a set of registers used to hold marks, addresses, and state codes.
It is important to point out some major concerns:
The algorithm can be computed by recursion.
There are at most n2 new possible FPs at each time instant, where n is the number
of states of the encoder.
Simulations have shown that for an RSC encoder of rate
the amount of registers the FPU needs is:
1
2
n2, registers of n+1 bits to hold marks the remaining bit is used to indicate
if the register is empty.
n 2 registers of K 1 bits to hold state codes.
n 2 registers of A bits to hold addresses, where A is the number of bits used
to code the frame size.
Since the processing unit compares all marks at the same time to see if there are
equal marks, then the number of XOR gates increases drastically with the constraint
length of the code. However it has been observed that Turbo Code schemes with
encoders with short constraint length have better BER performance than encoders
with large constraint length [18].
Comparing our approach with the previous implementations, we conclude with the results
of table 4.1 for an RSC code of rate 12 , K = 3 and a message frame size of 1024. We see
that for a code with constraint length K = 3, a frame size 2A = 210 = 1024 bits, and a
trace back depth of D = 5 K, the RE SMU needs (5 3) 4 = 60 register of one bit, and
the FPU needs (4 2) (4 + 1) + (4 2) 2 + (4 2) 10 = 34 register of one bit. Also,
the FPU will always find the correct FP, while the RE SMU might produce wrong results,
if paths do not merge within the trace back pipeline. Another difference is that the RE
outputs the symbol sequence of the survival path, while the FPU outputs the sequence of
FPs that are spread along the trellis. However, in a turbo code scheme context, the RUU
may take advantage of these FPs as we will show in the next subsection.
Observation
One bit Registers
Reliability
Output Rate
REU
60
depends on the trace back depth
One state per clock cycle
45
FPU
34
Optimum
Random
Table 4.1: Comparison between the REU and FPU for a code with rate 12 , K = 3 and a
frame size 2A = 210 = 1024
The reliability of these bits, could
depend on 4,0 or 4,2
s0
3,0
4 ,0
s1
4 ,2
s2
s3
i=0
i=1
survival path
i=4
i=2
i=3
competing path
Possible competing path in the future
4.7
Before getting into the hardware issues it is important to highlight the main problem we
face at the moment of updating bit reliabilities. For example, figure 4.22 illustrates one
example. While processing data at time instant i = 4, a FP is found at (3, s0 ). This FP
is colored in green. The example shows the survival path and the competing path traced
back from the FP until they merge. The blue branches indicate possible future branches
of the survival path, while the red paths indicate possible competing paths in the future.
The RUU could start to update bit reliabilities as soon as a FP is detected. However,
figure 4.22, shows how the reliability of bits i = 2, i = 3 might depend on 4,0 , or 4,2
. The earlier release of those bit reliabilities leads to a non optimum SOVA algorithm
execution.
One solution to the mentioned problem, is illustrated in figure 4.23. The idea is to trace
back U steps, to allow all the competing path, that start after time i, to merge. After U
steps, the remaining bit reliabilities could be released. However, this solution introduces
the U factor which is a trade-off between BER performance and power consumption.
It has no impact on the area, since as we will show later, bits reliabilities are updated
recursively. Anyway, the introduction of the U factor, leads to a non optimum SOVA
algorithm execution.
The solution we adopted is introduced by the example of figure 4.24 . By the time i,
two FPs have been detected. Since the second FP, resides after the detection line of the
first one, the updating process takes place starting from the second FP. Once, the first FP
is reached, the system continues updating and releasing the bit reliabilities. The fact that
46
Updating Only
s0
s1
s2
s3
survival path
I FP ,3
iFP -U
iFP
competing path
Figure 4.23: One possible solution to the problem of bit reliabilities releasing.
Bits reliabilities before i FP1, will not be affected by future
competing paths. Therefore they can be released .
paths traced back from any instant after i DFP1 will merge at IFP1
Updating and Releasing
Updating Only
s0
s1
s2
s3
iFP1
iDFP1
iFP2
iDFP2
Figure 4.24: Solution adopted for the bit reliabilities releasing problem.
the second FP needs to reside after the detection line of the first one, is due to the concept
that any path traced back after the detection line will merge at the FP of that detection
line. Therefore any future competing path of the survival path, at most will merge at the
first FP, and will not affect the bit reliabilities before the first FP.
We can generalize this solution in an algorithm as follows:
Wait for the first FP provided by the FPU.
Wait for the second FP.
If the second FP is detected after the detection line of the first one then, proceed
with the updating process.
If the second FP is detected before the detection line of the first FP, then wait for
one more FP:
If the third FP resides after the detection line of the second FP then, proceed
to the updating process with the information of the second and third FP.
If the third FP does not reside after the detection line of the second FP, but it
does after the detection line of the first one, then the updating process proceeds,
with the information of the first and third FP.
47
If the third FP does not reside after the detection lines of any of the other two
FPs, then the third FP is discarded. The RUU continues from step 4.
When the updating process finishes, then the last FP, becomes the first FP, and the
process is repeated from step two.
If the end of the frame is reached by the ACSU, then the RUU is interrupted and it
begins to update the bits reliabilities from the end.
Figure 4.25.a presents the RUU general scheme. There is a state machine which controls
the unit. It also carries out the previous algorithm. The registers at the left of the figure,
hold FP state codes, FP addresses and FP detection lines which are used to address the
RAM block and control the updating process. The lastState Unit, calculates the previous
state in the trellis, based on the current state, and the decision bit for that state. This
unit is actually doing the trace back, at each clock cycle, of the survival path. The current
state is used to drive the multiplexers to select the message bit associated with the survival
path and the difference between the metric of the survival path, and a competing path.
These elements are fed into the recursive Updating unit, which calculates the reliability
magnitude of bits i .
The term Lepi is stored in the RAM block in conjunction with the decision bits vi,k
and i,k . This term is equivalent to:
Lepi = ysi + Lai
which is used to calculate the final extrinsic information Lei :
Lei = i Lc ysi Lai
Lei = i (ysi + Lai )
Lei = i Lepi
The term Lepi is calculated when ysi and Lai are available at the time of branch metrics
because it saves clock cycles at the time of computing Lei . Not doing it a that time
supposes to access the data-in RAM buffer and the RAM La/Le-Le/La again. Besides,
the access has to be done through the interleaving/deinterleaving unit, which might be
being used. The calculation of Lei is done in the following way: the recursive units outputs
i , which is actually the magnitude of i . The bit mi gives the sign to i . Since a two
complement representation is used, the bit mi will indicate whether to complement i or
not. Then we have:
Lei =
i + (0 Lepi )
mi = 1
Lei = not i + (1 Lepi )
mi = 0
The operation in parenthesis is done first and its result is delayed until i comes out of the
recursive unit. This allows to distribute combinational delays among the registers. The
resulting Lei is stored in the RAM La/Le-Le/La depending on the decoder.
The recursive updating unit is shown in figure 4.26. This unit updates bits reliabilities by managing all competing path at once. In the scheme there is a set of register
that holds the different for each state. These are propagated to the corresponding
eL
k,1 i
1
m2
k, i
k, i
NOT
Decision bits
+
m
lastState
decision bit
si+1,k
delay
mi+1
+
Recursive
Update
si,k
Current state
previous state
put_frame_end_data
RAM SMU
delay
( + ) +
This operation is equivalent to xxxxxxxxxxxxxx.
+
This is done this way in order to distribute
combinational delays among register
f2
State
Machine
finish_delayed
delay
wr le
Lepi+1
finish
delay
Lepi
flushing
addr_x^
a<=b
Delay
addr_le
peL
Starting state
f2
f1
downCounter
>=
>=
load
Starting state
Delay
queue
flushing_addr_start
MIN
MAX
p2
MAX
decoder
p1
si,k
ps0
i
p0
p0
49
p3
c0
m
it complements the
decision bit of the
survival state
v ,0
i
,2
p1
p2
i
,3
p3
MAX
MAX
ps2
c1
m
ps1
MIN
MIN
,1
ps1
p1
ps0
p0
ps3
ps2
,2
mi
,3
mi
c1
MAX
c2
c2
m
mi
+3
MIN
,1
MAX
m
c0
mi
MIN
p2
v ,0
c3
ps3
MIN
MAX
MIN
p3
MAX
c3
m
i=1
i=2
survival path
i=3
i=4
i=5
i=6
competing path
competing path
competing path
competing path
i=7
i=8
i=9
i = 10
= min ( , , )
50
The unit begins at the time instant i = 10. The orange is fed into the system
through the multiplexer from state 1. A the same time a minimizing process is
started with this orange and the remaining of the registers. The orange is
sent to state 3.
At time i = 9. The blue is fed into the system through the multiplexer from
state 2. The orange from state 3 and the blue from state 2 participate in the
minimizing process. The blue it is sent to state 1, while the orange is sent to
state 3 again.
At time i = 8. The fuchsia is fed into the system through the multiplexer from
state 0. Now there are three participating in the minimizing process. Finally the
orange, blue and fuchsia are sent to state 2, state 3 and state 0 respectively.
The remaining steps are about the same. Note that, at time i = 6, state 2, two
competing path merge. For this example the blue is assumed to be less than the
turquoise and that is why it is kept.
Before moving into the next section it is important to talk about some throughput issues.
In figure 4.24 we can see that the unit RUU updates some distance before it can release
the final bit reliabilities. Therefore if we think of the time distance between fusion points
Then the RUU processes 2D
time instants for each
as a random variable with a mean D.
FP detected by the FPU. This means, that the FIFO input data rate will be higher than
FIFO output data rate and the FIFO will get full. If the FIFO gets full, then the RUU
misses some FPs, however, this is not as bad as is seems, since the algorithm that manages
the FPs is still valid.
Let denote the amount of bits remaining to be updated, when the ACSU unit reaches
the end of the frame, with the parameter DR . Then the throughput of the SOVA SISO
can be estimated by
T HSISO =
L
f [bps]
L + DR
(4.1)
where L is the frame size and f is the frequency of the system. It is straight forward, that
if we want to increase the throughput of the system DR should be reduced. This can be
achieved by increasing the working frequency of the RUU so it processes more FPs per
time unit and at the end of the frame, less bits remain to be updated.
4.8
Control Unit
We finally present the design of the control unit, which is basically a finite state machine
that delays and synchronizes modules. Figure 4.28, shows the scheme. There are two
counters, one is responsible for the frame address count, and the other is responsible for
the iteration count. The iteration counter is first loaded with the number of iterations
that the user indicates. Figure 4.29 shows the state diagram that the entire system goes
through. Once the user drives the go signal high, the system begins to work. It first
initializes the units and progressively activate the corresponding modules before settling
51
x2+1
Iteration
Counter
niters
niters
Bit Counter
State
Machine
=0?
=?
iters Finished
Frame Finished
Frame Length
Idle
Iters Finished
Finishing
go
/Iters Finished
Initializing
Modules
Frame Finished
Decoding
52
4.9
Improvements
The most common implementation of the SOVA decoder only updates bits reliabilities
by the HR-SOVA rule that was described in 3.2.2. A BR-SOVA updating rule would
be desirable since it has been proved in [5] that max-log-map algorithm and BR-SOVA
are equivalent and that the max-log-map algorithm perform better in terms of BER than
the HR-SOVA. However the BR-SOVA updating rule requires the knowledge of the bit
reliabilities of the competing paths which implies a higher complexity in the decoder and
this is the reason why we do not do a strict BR-SOVA, instead we approximate its behavior
by introducing a bound for the bit reliability of the competing path as shown below.
The BR-SOVA updating rule and HR-SOVA updating rule are the same when the
estimated bit and the competing bit are different. In contrast the following equations
recall the updating rules for each algorithm when the estimated bit and the competing bit
differ.
BR
min j , i,k + cj
j
HR
j
j
(4.2)
That is why we can think of the HR-SOVA as a BR-SOVA with an unbounded cj . The
improvement proposed in this work is to bound cj to a known value. When working
with an RSC binary code, the two incoming branches, at any state of a trellis diagram,
are associated with different message bits. Therefore, the difference between the path
metrics is actually a bound for the reliability of those message bits. The resulting updating
rule becomes:
(
j
min
m
i 6= ci
(j , j,k )
c
j min j , i,k + j
m
i = ci
where j is the reliability of bit j; i,k is the path metric difference between competing
path and survival path; m
i is the estimated message bit; ci is the estimated message bit
which is associated with the competing path and finally cj is the path metric difference
at each state at time j that belongs to the competing path.
Figure 4.30 and 4.31 show the modified RUU and the Recursive Updating Unit respectively. They allow the previous rule to be executed. Note that the main difference is the
handling of all the since they represent the bound for the competing bit reliabilities.
queue
>=
m
f1
load
State
Machine
m
m
addr_le
Delay
delay
addr_x^
flushing
finish
delay
i,k
Lepi
vi ,k
si,k
Delay
a<=b
RAM SMU
>=
f2
f2
flushing_addr_start
lastState
wr le
delay
si+1,k
delay
i +1, k
i +1
vi +1,k
mi+1
finish_delayed
Lepi+1
decision bit
Current state
downCounter
i+4
NOT
Recursive
Update
Starting state
m
m
put_frame_end_data
Starting state
previous state
Lei +5
4.9 Improvements
53
54
i,k
The
ps0
MAX
i +1, 0
c0
m
p2
MIN
p1
si,k
p0
m
decodificador
i
p0
p3
+
ps1
vi , 0
ps0
p2
c1
vi ,2
i +1,1
ps2
+
v i ,3
p3
MIN
MAX
ps1
MIN
p1
v i ,1
p1
p0
ps3
ps2
MIN
c2
mi
c1
c2
m
v i ,3
i +1, 2
MIN
mi
MAX
m
mi
vi , 2
c0
mi
v i ,1
p2
vi , 0
c3
ps3
c3
m
MIN
MAX
MIN
i+1, 3
p3
0
+
i +3
Chapter 5
Methodology
The whole practical design process was carried out with the aid of powerful software tools.
Mainly three tools were employed in this thesis:
Matlab 7.1. The mathematics software package Matlab was extensively used in the
simulation and verification of the design. It was employed to model the whole communication system: encoder, channel, receiver and decoder. We also used Matlab
for the HIL (Hardware In the Loop) verification of the design. It was carried out
by establishing a serial port communication with an interface circuit specifically
developed for testing purposes.
Xilinx ISE 8.2. The synthesis software package of Xilinx, ISE 8.2, was used in
all the tasks referred to the implementation, specifically the mapping, translation,
placement and routing along with the back annotation and the static timing analysis.
The FPGA programmer iMPACT is also included in this package; it was used to
download our design into the Xilinx Spartan III FPGA.
ModelSim 6.1. VHDL code and Post-Place and Route models were simulated with
this tool.
Figure 5.1 summarizes the work flow. Five steps have taken place with some feedback
between them. On the rightmost part we have the fundamental stages of this process
whereas on the leftmost part the verification tasks associated with each stage are displayed.
The blue boxes show the main tool employed in the related task. Now we give a description
for the stages of the process:
Information recopilation. A considerable amount of papers and journals were recopilated. They allowed us to understand the main problem and to focus our main
concerns on some aspects of the subject.
Specification. The specification of this work consisted on the design and implementation of a SOVA based Turbo Decoder implementation.
High Level Design. A high level model was programmed using the software tool
MATLAB 7.1. This model allowed us to try the system in different environments
and also to fine tune the design specifications cited in step two.
56
Methodology
Information
Recopilation
Design Specifications
Matlab
High Level Design
Implementation
ModelSim
VHDL
Implementation
Behavioral Verification
ModelSim
FPGA Post-Place &
Route Model Verification
VHDL Synthesis
In-Circuit Verification
Matlab
57
Interface
Unit
Matlab
Decoder
mi
source
Channel Coding
BPSK Modulation
AWGN
Discrete Channel
BER Calculation
sink
i
m
yi
Matlab
FPGA
Channel
Decoding
58
Methodology
Chapter 6
6.1
Quantization Scheme
The quantization scheme is presented in table 6.5. The same quantization scheme is used
in all the tests. This scheme has been adopted from [18].
Element
Received Symbols yi
Extrinsic Information
Path Metrics
s
60
10^(0)
4:2
6:2
8:2
10^(1)
10^(2)
BER
10^(3)
10^(4)
10^(5)
10^(6)
10^(7)
0.5
1.5
2
EbNo [dB]
2.5
Figure 6.1: quantization effect on the system BER performacne. BR-SOVA approximation scheme. Simulation with quantification. MCF. Pf b = [111], Pg = [101]
The only quantization study that has been carried out is related to the path metric
difference which has a significant impact in the system BER performance. Figure 6.1
shows the BER curve against the received signal SNR. It is observed that, for the current
example, the scheme 4:2 is better than the 6:2, 8:2. This behavior has been reported in [11]
as a method of improving the system BER performance. Since quantization saturates the
, the overoptimistic values of the bit reliabilities are lessened and consequently the system
BER performance increases. Note that adopting the reduced quantization scheme yields
more benefits. First the RAM that stores the data the ACSU is reduced. Furthermore
the logic related to the RUU is also reduced.
6.2
Synthesis Results
Tables 6.2 and 6.3 present the synthesis results for the short pair of polynomials and the
UMTS polynomials respectively. Both pairs of polynomials were synthesized with the
quantization scheme given in table 6.5.
Observation
Logic Utilization
Number of Slice Flip Flops
Number of 4 input LUTs
Logic Distribution
Number of occupied Slices
Number of Slices containing only related logic
Number of Slices containing unrelated logic
Total Number 4 input LUTs
Number used as logic
Number used as a route-thru
Number of Block RAMs
Number of MULT18X18s
Number of GCLKs
Total equivalent gate count for design
61
HR
BRap
Resources
720(18%)
776(20%)
752(19%)
803(20%)
3840
3840
677(35%)
677(100%)
0
789(20%)
776
13
10(83%)
1(8%)
4(50%)
671207
674(35%)
674(100%)
0
816(21%)
803
13
10(83%)
1(8%)
4(50%)
671658
1920
674
674
3840
1
1
12
12
8
Observation
Logic Utilization
Number of Slice Flip Flops
Number of 4 input LUTs
Logic Distribution
Number of occupied Slices
Number of Slices containing only related logic
Number of Slices containing unrelated logic
Total Number 4 input LUTs
Number used as logic
Number used as a route-thru
Number of Block RAMs
Number of MULT18X18s
Number of GCLKs
Total equivalent gate count for design
HR
BRap
Resources
1045(27%)
2067(53%)
1108(28%)
2096(54%)
3840
3840
1256(65%)
1256(100%)
0
2082(54%)
2067
15
11(91%)
1(8%)
4(50%)
748769
1329(69%)
1329(100%)
0
2111(54%)
2096
15
11(91%)
1(8%)
4(50%)
749432
1920
674
674
3840
1
1
12
12
8
62
10^(0)
10^(1)
10^(2)
BER
10^(3)
10^(4)
10^(5)
10^(6)
10^(7)
0.5
HR Iter 1
BRap Iter 1
HR Iter 3
BRap Iter 3
HR Iter 5
BRap Iter 5
HR Iter 8
BRap Iter 8
Maxlogmap Iter 8
1.5
2
EbNo [dB]
2.5
Critical Path
ACSU
FPU
6.3
Before getting into the HIL results, we will discus the BR-SOVA approximation BER
performance that it is shown in figure 6.2. These results were obtained by simulation
with a floating point numeric representation. We observe that for an error probability of
104 the BR-SOVA approximation gains 0.3dB over the HR-SOVA at the eighth iteration.
For an error probability of 105 the BR-SOVA approximation gains only 0.23dB over the
HR-SOVA at the eighth iteration. We also observe that for higher SNRs, the curves begin
to converge and the distance between them gets shorter.
Figure 6.3 exhibits the real system BER performance when implementing the HRSOVA for the short pair of polynomials. The figure illustrates the comparisons between
the hardware implemented HR-SOVA and the floating point simulations. Note that the
real HR-SOVA performs better. This is due to the quantization effect that was explained
in 6.1.
63
10^(0)
10^(1)
10^(2)
BER
10^(3)
10^(4)
10^(5)
10^(6)
10^(7)
0.5
HR Iter 1 Inf.Pre
HR Iter 1 HIL
HR Iter 5 Inf.Pre
HR Iter 5 HIL
HR Iter 8 Inf.Pre
HR Iter 8 HIL
Maxlogmap Iter 8 Inf.Pre
1.5
2
EbNo [dB]
2.5
64
10^(0)
10^(1)
10^(2)
BER
10^(3)
10^(4)
10^(5)
10^(6)
10^(7)
0.5
1.5
2
EbNo [dB]
2.5
10^(0)
10^(1)
10^(2)
BER
10^(3)
10^(4)
10^(5)
10^(6)
10^(7)
0.5
HR Iter 8 HIL
BRap Iter 8 HIL
Maxlogmap Iter 8 Quant.
Maxlogmap Iter 8 Inf.Pre
1.5
2
EbNo [dB]
2.5
65
10^(2)
10^(2.5)
10^(3)
BER
10^(3.5)
10^(4)
10^(4.5)
10^(5)
10^(5.5)
0.5
HR Iter 1
BRap Iter 1
HR Iter 3
BRap Iter 3
HR Iter 5
BRap Iter 5
HR Iter 8
BRap Iter 8
1.5
2
EbNo [dB]
2.5
10^(2)
10^(2.5)
10^(3)
BER
10^(3.5)
10^(4)
10^(4.5)
10^(5)
10^(5.5)
0.5
1.5
2
EbNo [dB]
2.5
66
6.4
Throughput Results
In this section we investigate the effect of running the RUU at higher frequencies and its
impact in the system throughput. A DCM (Digital Clock Manager) was used in order to
generate the corresponding frequencies.
Figures 6.8, 6.9 and 6.10 show the throughput histogram statistics for the frequencies
relations fRU U = f ,fRU U = 2f and fRU U = 3f respectively and for the short pair of
polynomials. The statistics were generated with 50000 samples. We observe that the
throughput increases whit the RUU working frequency as we expected.
In a real application context, the system has to guarantee a constant throughput so
it could be set to one of the minimum intervals observed in the histogram. These values
are summarized in table 6.6. We can think of a power saving benefit since the system,
according to the figures, will work faster than the guaranteed throughput. Therefore,
when the system finishes the execution it goes to an idle state until a new set of data
arrives, during this idle state no activity is performed in the circuit which supposes an
important reduction in the power consumption.
Figures 6.11, 6.12 and 6.13 show the same throughput histogram statistics but this
time for the UMTS pair of polynomials. We observe the same effect than with the short
pair. However we notice a slightly difference in the statistics between them. This is due
to the appearing frequency of FPs, which is higher for higher constraint lengths.
Observation
fRU U = f
fRU U = 2f
fRU U = 3f
Short Polynomials
0.5259f [bps]
0.8258f [bps]
0.9543f [bps]
UMTS Polynomials
0.5270f [bps]
0.8308f [bps]
0.9399f [bps]
67
4000
3500
Number of observations
3000
2500
2000
1500
1000
500
0
0.52
0.53
0.54
0.55
0.56
0.57
0.58
0.59
0.6
SISO throughput as a percentage of the system clock in [bps]
0.61
Figure 6.8: Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [111], Pg = [101]
3000
Number of observations
2500
2000
1500
1000
500
0
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
SISO throughput as a percentage of the system clock in [bps]
0.9
Figure 6.9: Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [111], Pg = [101]
68
5000
4500
Number of observations
4000
3500
3000
2500
2000
1500
1000
500
0
0.95
0.955
0.96
0.965
0.97
0.975
0.98
SISO throughput as a percentage of the system clock in [bps]
0.985
Figure 6.10: Throughput statistics. f = 16.66M Hz, fRU U = 25M Hz. Pf b = [111],
Pg = [101]
4000
3500
Number of observations
3000
2500
2000
1500
1000
500
0
0.52
0.54
0.56
0.58
0.6
SISO throughput as a percentage of the system clock in [bps]
0.62
Figure 6.11: Throughput statistics. f = 25M Hz, fRU U = 25M Hz. Pf b = [1011], Pg =
[1101]
69
4000
3500
Number of observations
3000
2500
2000
1500
1000
500
0
0.82
0.84
0.86
0.88
0.9
0.92
SISO throughput as a percentage of the system clock in [bps]
0.94
Figure 6.12: Throughput statistics. f = 25M Hz, fRU U = 50M Hz. Pf b = [1011], Pg =
[1101]
8000
7000
Number of observations
6000
5000
4000
3000
2000
1000
0
0.935
0.98
Figure 6.13: Throughput statistics. f = 16.66M Hz, fRU U = 50M Hz. Pf b = [1011],
Pg = [1101]
70
6.5
Power Results
The power consumption has been estimated by simulations. Table 6.7 summarizes the
results. The system frequencies were set to f = 25M Hz, fRU U = 50M Hz. The simulation
test bench was carefully designed in order to guarantee a SISO throughput of 0.8f =
20M bps. This throughput is feasible according to figures 6.9 and 6.12. We observe a
dynamic power consumption of (22 12) = 10mW for the Short pair of polynomials.
The dynamic power consumption rises up to (29 12) = 17mW when working with the
UMTS polynomials. This effect was expected since the area increasing is about 50% when
jumping from four states to eight. Table 6.7 only shows the power consumption of the BRSOVA approximation since the difference between the BR-SOVA approximation scheme
and the HR-SOVA scheme is negligible.
Observation
Total estimated power consumption[mW]
Vccint 1.20V:
Vccaux 2.50V:
Vcco25 2.50V:
Clocks:
Inputs:
Outputs:
Vcco25
Signals:
Quiescent Vccint 1.20V:
Quiescent Vccaux 2.50V:
Short Polynomials
47
22
25
0
6
1
2
0
2
12
25
UMTS Polynomials
54
29
25
0
6
1
4
0
5
12
25
Table 6.7: Estimated Power consumption. BRapprox. f = 25M Hz, fRU U = 50M Hz
Chapter 7
72
Bibliography
[1] Sorin Adrian Barbulescu. What a wonderful turbo world ... E-book, 2004.
[2] G. Battail. Ponderation des symboles decodes par lagorithem de Viterbi. Ann.
Telecommun, 42:3138, January 1987.
[3] C. Berrou, A. Glavieux, and P. Thitimasjshima. Near Shannon Limit Error-Correcting
Coding and Decoding: Turbo-Codes. Proceedings of the IEEE International Conference on Communications, Geneva, Switzerland, May 1993.
[4] Gennady Feygin and P.G. Gulak. Architectural Tradeoffs for Survivor Sequence Memory Management in Viterbi Decoders. IEEE TRANSACTIONS ON COMMUNICATIONS, 41:425429, March 1993.
[5] Marc P. C. Fossorier, Frank Burkert, Shu Lin, and Joachim Hagenauer. On the Equivalence Between SOVA and Max-Log-Map Decoding. IEEE COMMUNICATIONS
LETTERS, 2(5), May 1998.
[6] David Garret and Micea Stan. Low Power Architecture of the Soft-Output Viterbi
Algorithm. Low Power Electronics and Design, 1998. Proceedings. 1998 International
Symposium on, pages 262 267, August 1998.
[7] Joachim Hagenauer and Peter Hoeher. A Viterbi Algorithm with Soft-Decision Outputs and its Applications. Proc. GLOBECOM IEEE, 3:16801686, November 1989.
[8] Pablo Ituero Herrero. Implementation of an ASIP for Turbo Decoding. Masters
thesis, KTH, May 2005.
[9] Olaf Joeressen, Martin Vaupel, and Henrich Meyr. High-Speed VLSI Architectures
for Soft-Output Viterbi Decoding. Proc. IEEE ICASAP92, Oakland, California,,
pages 373384, August 1992.
[10] D. W. Kim, T. W. Kwon, J. R. Choi., and J. J. Kong. A modified two-step SOVAbased turbo decoder with a fixed scaling factor. Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, 4:3740,
May 2000.
[11] Lang Lin and Roger S. Cheng. Improvements in SOVA-Based Decoding For Turbo
Codes. Communications, 1997. ICC 97 Montreal, Towards the Knowledge Millennium. 1997 IEEE International Conference on, 3:14731478, June 1997.
74
BIBLIOGRAPHY
[12] Lutz Papke and Patrick Robertson. Improved Decoding with the SOVA in a Parallel
Concatenated (Turbo-code) Scheme. IEEE International Conference on Communications, Conference Record, Converging Technologies for Tomorrows Applications.,
1:102106, June 1996.
[13] C.B Shung, P.H. Siegel, G. Ungerboeck, and H.K Thapar. VLSI Architectures for
Metric Normalization in the Viterbi Algorithm. Communications, 1990. ICC 90, Including Supercomm Technical Sessions. SUPERCOMM/ICC 90. Conference Record.,
IEEE International Conference on, 4:17231728, April 1990.
[14] Oscar Y. Takeshita. On Maximum Contention-Free Interleavers and Permutation
Polynomials over Integer Rings. Submitted as a Correspondence to the IEEE Transactions on Information Theory, April 2005.
[15] T.K Truong, Ming-Tang Shih, Irving S. Reed, and E. H. Satorius. A VLSI Design for
a Trace-Back Viterbi Decoder. Communications, IEEE Transactions on, 40:616624,
March 1992.
[16] Matthew C. Valenti. Iterative Detection and Decoding for Wireless Communications.
PhD thesis, Virginia Polytechnic Insitute and State University, July 1999.
[17] Yan Wang, Chi-Ying Tsui, and Roger S. Cheng. A Low Power VLSI Architecture
of SOVA-based Turbo-code decoder using scarce State Transition Scheme. IEEE
International Symposium on Circuits and Systems, Geneva, Switzerland, 00:0000,
May 2000.
[18] Zhongfeng Wang. High Performance, Low Complexity VLSI Design of Turbo Decoders. PhD thesis, University of Minnesota, September 2000.