SoftwareDefined Radio
Diss. ETH No. 18582
ApplicationSpecific
Processor for MIMOOFDM
SoftwareDefined Radio
A dissertation submitted to
ETH ZURICH
presented by
STEFAN EBERLI
Dipl. El.Ing. ETH
born 15.4.1978
citizen of Zürich (ZH) and Hüttwilen (TG)
2009
Acknowledgments
v
vi ACKNOWLEDGMENTS
vii
Zusammenfassung
ix
x ZUSAMMENFASSUNG
Acknowledgments v
Abstract vii
Zusammenfassung ix
1 Introduction 1
1.1 Motivation – Mobility and Wireless Communications . 1
1.2 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 7
xi
xii CONTENTS
B Datasheet 193
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
1.3 Outline
The remainder of this thesis is organized as follows:
Chapter 2 reviews related work. FAs implementing OFDMrelated
baseband processing tasks are presented. Relevant figures of merit
are gathered and/or extrapolated form the open literature. The final
discussion allows a comparison of the different FAs.
Chapter 3 is dedicated to the algorithms. After an introduction
to the domain of MIMOOFDM wireless communications, two crucial
MIMOOFDM design considerations are made. First, computational
complexities of several MIMO detectors are compared, allowing to se
lect linear MMSE detection as an appropriate one. Second, algorithms
to compute linear MMSE detectors are assessed in their computational
complexity and BER performance. Finally, the complete baseband
processing for the MIMOOFDM receiver considered in this thesis is
described and the associated computational complexity is derived.
Chapter 4 evaluates three different FAs, by mapping computation
ally hard OFDM baseband processing kernels onto each one of them.
The evaluation suggests to consider the designtime customizable ASIP
for the casestudy described in the subsequent chapter.
Chapter 5 details the implementation of the relevant baseband
processing kernels of the 2 × 2 MIMOOFDM receiver. The receiver
is split onto two properly configured ASIPs. The chapter concludes
with the comparison to the only known related work [12] and gives
reference for the silicon complexity.
Chapter 6 summarizes and discusses the achieved results and
draws the appropriate conclusions.
Chapter 2
9
10 CHAPTER 2. STATE OF THE ART
F FU FU
FU FU
...
CP DP FA
Memory
SPA SPA + RA RA
Figure 2.1: Left: Flexible architecture (FA) block diagram, CP: Con
trolpath, DP: Datapath. Right: Softwareprogrammable architectures
(SPAs), reconfigurable architectures (RA), and a combination of these
(SPA+RA), are FA subclasses.
be addressed in parallel during one clock cycle, with the operating clock
frequency. Conventionally, to compute the processing performance, all
operations that can be executed in parallel are taken into account (e.g.,
load/store, data processing operations, address generation operations).
The soobtained processing performance is a qualitative measure
that allows only a rough comparison between differing architectures.
Because the operations on the various architectures are not necessarily
the same, comparisons are errorprone.
In this thesis, for the comparison with the computational load
inherent to the algorithms presented in Chapter 3, the processing
performance (PP) is defined as the millions of dataoperations per
second ( MdOp/s) an FA can execute. This unit considers only the
data processing relevant operations that can be executed in parallel on
the FA, multiplied with the operating clock frequency. The relevant
data processing operations are those typical to digital signal processors:
additions, subtractions, multiplications and combinations of these.
where A, f , and P stand for the area, clock frequency, and power
consumption in the original technology and the normalized quantities
are indicated by the 0.18 subscript. The technology scaling factor
αD = 0.18/x is derived from the halfminimumfeaturesize x in the
original technology. Although not standardized, the CMOS technol
ogy’s name commonly indicates approximately the associated half
minimumfeaturesize x: for instance, in a 0.13 µm CMOS technology
technology x = 0.13. The correction factor reflects the voltage scaling
V0.18 = V · /αD from original to the 0.18 µm reference technology
(typically ≈ 1).
Figure 2.2 illustrates the areatime product curves (ATplot, [15])
for a 16 bit adder and a 16 bit multiplier, synthesized for 0.25 µm,
0.18 µm, and 0.13 µm UMC CMOS technologies. The curves have
been generated synthesizing the corresponding circuit with Synopsys
Design Compiler (Z2007.03SP3) and applying the timing constraints
indicated by the curve’s markers. The blue circles show the results
obtained when taking the 0.25 µm CMOS technology as starting point
and scaling the two designs to the 0.18 µm (αD = 0.25/0.18), and
0.13 µm (αD = 0.25/0.13) technologies, according to (2.1) and (2.2).
It is comforting to note that the scaled results reflect the results
achieved with synthesis well: the distance between scaled resultpoint
and synthesized resultcurve is minimal.
umc250
14000 umc180
umc130
12000
10000
Area [μ m2]
8000
6000
4000
2000
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5
Longest path [ns]
umc250
umc180
50000
umc130
40000
Area [μ m2]
30000
20000
10000
0
1 2 3 4 5 6 7 8
Longest path [ns]
Figure 2.2: Areatime product curves (ATplot) for 0.25 µm, 0.18 µm,
and 0.13 µm CMOS technology. Top: 16 bit adder, Bottom: 16 bit
multiplier.
14 CHAPTER 2. STATE OF THE ART
for the same task.1 This difference is visible in the ATplot where the
isomorph, complexvalued unit can attain the higher throughput (at
the cost of an area increase), whereas the decomposed realvalued unit
occupies the smallest area (at the penalty of a lower throughput). As
shown by the black dashed curve, the ATefficiency of the two units
is nearly the same, with the complexvalued unit being slightly more
efficient. The operation with FUs performing only realvalued arith
metics results in a lower throughput than that with complexvalued
FUs since the circuit cannot be synthesized for timing constraints
below 12 ns per data item. Thus, it can be stated that for algorithms
requiring high throughput and mostly operating on complexvalued
numbers, it is convenient to incorporate complexvalued FUs into the
FA’s datapath.
4
x 10 AT−plot for UMC180
16
CMUL
14 RMULADD
12
Area [μ m2]
10
6 AT = 410700 μ m2 ns
2
0 5 10 15 20 25 30
Time per data item (T) [ns]
16 16 16 16
Im{A} Re{A} Im{B} Re{B} X Y
32 32 16 16
16 16 16 16 16 16 16 16 16 16 16 16
32 32 32 32 32 16
>>
32 32 16
Select
>> >> result
16 16 16
32
16 16
Im{C} Re{C} Z
Inv m3
m1
Inv
Phase *
m3
*
*
+
+
Memory 4
m10
m_o2
m3 * m_o1 +
+

m9 m_o2
m5 m_o1
Memory 3 *
+ +
m_o4

*
m6 m_o3
*
m_o3 +
m2
*
* Inv
+


m8 m_o4
* m4
m_o1 m_o2
m5 m_o3
Inv CORDIC2
CORDIC1
m7 m_o4 Phase
m1
m9 Memory 1a m_o1
m2
m3
m10 Memory 1b m_o2
m4
m5 m_o3
Memory 2a
m6
m7
Memory 2b m_o4
m8
Input Output
streams stream
STREAM
MANAGER
CONFIGURABLE INTERCONNECT
F F A M R R A R User M R A M R R F F
I U R U U I I
I L A E L A E Def E L A E
F F U L M G U G L G U L M G F F
O O T M FU T T O O
Data bus
INSTRUCTION GENERATOR
Control
The same receiver tasks are also mapped onto an ASIC, an FPGA,
and a DSP for comparing the achieved performanceovercost. One
of the conclusions in [17] is that RAs fill the performanceovercost
gap between ASICs and DSPs. It is found that there is a sixfold
increase in complexity compared to an ASIC implementation, while
the cost is reduced by a factor of six compared to an implementation
on conventional DSPs. Thus, compared to ASICs, lower NREcost
production is possible, while realizing a higher performanceovercost
than on DSPs.
The RaPiD architecture is reported in Figure 2.6. Its datapath
consists of an heterogeneous, linear array of FUs. The FUs are ALUs,
multipliers (MULT), registers (REG), and storage units (RAM) that
are connected through a configurable interconnect network. The
number and type of FUs are scalable, and determined at designtime.
The interconnect is built by multiplexers that select the input to the
functional units, by tristate buffers for driving the output of functional
units onto the wanted bus, and by bus connectors that split long bus
segments into smaller ones, enabling concurrent utilization of two bus
segments belonging to the same long bus.
The configuration bits required to operate the RaPiD architecture
are divided into hard and soft configuration bits. Hard configuration
20 CHAPTER 2. STATE OF THE ART
ARM GLOBAL
MEMORY
PE
DMA
WtoS
&
StoW
SCALAR SIMD
ALU ALU
16
RF I E 16bit 16bit W
16
16x16bit D X ALU MULT B
16 S E
RF I 16bit 16bit W
32way I 16x16bit 16 D S X ALU MULT B
SIMD R H
16 E
RF I 16bit 16bit W
16x16bit 16 D X ALU MULT B
16 I E W
RF 16bit 16bit
16x16bit 16 D X ALU MULT B
32way SIMD RF
Wide SIMD to Scalar Wide SIMD to Scalar
IMEM 2 READ PORTS
Reduction Network Reduction Network
4KB 1 WRITE PORT
Stage 1 (WtoS 1) Stage 2 (WtoS 2)
SCALAR PIPELINE
StoW1 StoW2 32x16bit
32x16bit
I 16 To
RF I E 16bit W
IQueue
R 16x16bit 16 ALU SCALAR
D X 512 B
RF
SCALAR SCRATCHPAD MEMORY (4KB) To
DMA InterPE
PC&Loop AGU PIPELINE 2 READ/WRITE PORT (16 bit wide)
Counter
BUS
I RF 12 I Address E W To
R 16x16bit D Calculation X B AGU
RF
M01 M02 M03 M04 M05 M05 M07 M08 M09 M10
A B C D A B C D A B C D A B C D A B C D
ALU1 E W ALU2 E W ALU3 E W ALU4 E W ALU5
OUT2 OUT1 OUT2 OUT1 OUT2 OUT1 OUT2 OUT1 OUT2 OUT1
Sequencer
tiles, is 6 mm2 , the power consumption scales to 150 mW, and the data
processing performance to PP = 10 500 MdOp/s.
AGU
16 vector registers 32 registers
ALU ALU
VLIW controller
Shuffle unit
Intravector unit
The EVP’s datapath includes one scalar unit, as well as a set of 16
way SIMD units. The datapath is controlled by very long instruction
words (VLIWs). The data memory feeds the 16 vector registers from
where the execution units retrieve their operands. The programming
of the EVP is performed in EVPC, an extension to the Clanguage
for supporting the SIMD units. The EVP falls into the category of
SPAs.
The EVP described in [5] is synthesized for a 90 nm CMOS technol
ogy. It runs at a frequency of 300 MHz and occupies an area of 2 mm2 .
The power efficiency is of 1 mW/MHz, leading to a power consumption
of 300 mW. The EVP’s processing performance is derived observing
that one multiplication and one ALU operation can be executed in
parallel on the 16way SIMD datapath, thus resulting in a peak data
processing performance of PP = 90 600 MdOp/s.7
1050MHz REF
DSP Local REF1 REF2
JTAG
Peripherals Int. Clks
Clock Gen Multimedia
GPIO
Card IF
RF Control DSP Complex Smart Card
Serial IF
(SPI, I2C) I&D Mem IF
L2
Int
MEM
I&D Mem
L2
Int
MEM
I&D Mem
L2
Int
Sync. Ser.
MEM
Timer I/O I&D Mem
DSP
L2
Int
Prog.
MEM
DSP
DSP Port
Timers/Gens DSP
Keyboard
Parallel IF
TX Data
Streaming
RX Data DSPARM Bridge Keypad IF
Data IF
Vector UART IrDA
Interrupt
Multi Port Controller Audio
Memory ARM Codec IF
Controller Processor
Memory Interface DMA GPIO
(Synch. and USB
Controller
Asynch.) Interface AHBAPB
LCD Timers
Bridge
Interface Peripheral RTC
Dev. Ctrl.
Bus/Memory
Interface
4W (2 active)
64B Lines
ICache
64kB
Data Memory
Dir 64kb
LRU 8 Banks
Replace
Interrupt
SIMDIQ
Branch PC LS IQ WB INT IQ VP VP VP VP
R0 R0 R0 R0
LR (16)32bit VRABC VRABC VRABC VRABC
CR Address GPR
CTR MPY MPY MPY MPY
Address ALU
ADD ADD ADD ADD
SAT
Figure 2.12: SB3010 architecture (source [32, 33]). Top: entire plat
form. Bottom: one DSP slice.
30 CHAPTER 2. STATE OF THE ART
AGU
AGU
Complex
AGU
Complex Complex Integer
AGU
Integer oriented
Complex oriented onchip network
onchip network
PRBS
Host
Freq.err. RF Stack Map/
controller
controller
gen
NCO
IF
canc.
processor
demap
To host
Vector
Vector
CALU
CMAC
CALU
CMAC
CALU
CMAC
ALSU
CALU
CMAC
Filter &
MAC
decimation
PM
Digital frontend CALU SIMD Datapath CMAC SIMD Datapath Controller unit
50 760 MdOp/s.
32 CHAPTER 2. STATE OF THE ART
10 http://www.siliconhive.com
11 PP = 4 PSE × 2 dOp/PSE × 150 MHz = 10 200 MdOp/s.
2.3. FA FOR MIMOOFDM BB PROCESSING 33
PSE
CELL
IN PROG.MEM.
MULTICELL CORE
RF RF RF CTRL
IN PSE PSE
IN
PSE PSE
FU FU FU FU MEM
IN IS IS IS BUS
Host Mem
Generic Architecture
ICache
Program Fetch Instructions
Instruction Dispatch
Instruction Decode
VLIW view FU VLIW FU VLIW
Shared Registerfile
CU
FU FU FU FU FU FU VLIW
CDRF/CPRF
RC RC RC RC RC RC DMQ
RC RC RC RC RC RC
RC RC RC RC RC RC Reconfigurable Debug
Matrix view CGA IF
RC RC RC RC RC RC
RC RC RC RC RC RC
CMEM
inteface AHBS
ADRES core
4x4 CGA
Shared Registerfile
Configuration
memory
bank1
bank2
Functional Unit
From different sources
Configuration
RAM
LRF
FU
Configuration
counter To different destinations
Figure 2.16: ADRES block diagram (source: [12, 40]). Top: Generic
architecture template. Middle: 4 × 4 CGA realization for the 2 × 2
MIMOOFDM receiver. Bottom: FU template.
2.4. SUMMARY AND DISCUSSION 35
Table 2.1: Figures of merit for the reviewed FAs, in the original technology.
Flexible CMOS Area Freq. Power Proc. Perf.
Architecturea [µm] [mm2 ] [MHz] [W] [MdOp/s]
RD [48], 2003 0.35 2.86 100 n.a. 2’200
RaPiD [18, 17], 1996 0.5 81 100 30.4 6’400
MS1 [22], 1999 0.35 180 100 n.a. 6’400
SODA [23], 2006 0.18 26.6 400 3 51’200
Montium [49, 24], 2003 0.13 6 100 0.150 1’500
DSP1 [28, 5], 2002 0.13 1.5 160 0.128 2’560
EVP [5], 2005 0.09 2 300 0.300 9’600
SB3010 [32, 31], 2002 0.09 n.a. 600 0.150 9’600
BBP1 [34], 2005 0.18 2.9 240 189b n.a.
BBP2 [35], 2007 0.13 11 240 n.a. 5’760
AVISPA [37], 2003 0.13 6.5 150 0.127 1’200
ADRES [40, 12], 2003 0.09 5.79 400 0.22 25’600
a The first reference indicates the description of the architecture, while the second to the SDR/OFDM baseband processing
related work (if different form the first). The year refers to the first publication of the architecture.
b Linearly scaled from 126 mW @ 160 MHz to 189 mW @ 240 MHz.
38
Table 2.2: Figures of merit for the reviewed FAs, scaled to 0.18 µm CMOS technology.
Flexible Scaling Area Freq. Power Proc. Perf.
Architecture αD [mm2 ] [MHz] [W] [MdOp/s]
RD 1.94 1.06 0.76 194 n.a. 4’278
RaPiD 2.78 1.52 10.5 278 9 17’778
MS1 1.94 1.06 47.61 194 n.a. 12’444
SODA 1 1 26.6 400 3 51’200
Montium 0.72 1.08 11.5 72 0.338 1’083
2.4. SUMMARY AND DISCUSSION
6000
5000
Performance over area [MdOP/s/mm2]
4000
3000
2000
1’500 MdOp/s/mm2
1000
500 MdOp/s/mm2
0
H + PA
RD ODA aPiD SP 1 P
EV A−C DRE
S
MS
1 P 2 A um
S R D BB VISP onti AVIS
ISP A A M
AV
Figure 2.17: Performance/area, normalized to 0.18 µm CMOS technology.
40
4
10
2 2 RaPiD
m 2
/m m m Montium
W /m /m
m W W SODA
m m
00 0 DSP 1
10 10 10
EVP
AVISPA
3 AVISPA+
10
AVISPA−CH
ADRES
2
2.4. SUMMARY AND DISCUSSION
10
Figure 2.18: Performance/area vs. energy efficiency, normalized to 0.18 µm CMOS technology.
41
5
CHAPTER 2. STATE OF THE ART
10
RaPiD
W
s /m
p/ Montium
dO
M
Processing performance [MdOP/s]
0 SODA
10
4
10 8’580 MdOp/s DSP 1
W EVP
m 3’650 MdOp/s
s/
p/ AVISPA
dO
M
10 1’530 MdOp/s AVISPA+
3
10 AVISPA−CH
680 MdOp/s
W ADRES
m
s/
p/
dO
M
1
2
10
2 3 4 5
10 10 10 10
Power consumption [mW]
Figure 2.19: Processing performance vs. power consumption, normalized to 0.18 µm CMOS technology.
The marker size is proportional to the corresponding circuit area.
42
Chapter 3
Algorithms and
Computational
Complexity
MMSE in this case). When the computations are performed in infinite precision, all
methods using the same type of MIMO detector lead to the same result. However,
when considering finite precision computations the picture changes. Rounding
43
44 CHAPTER 3. ALGORITHMS AND COMP. COMPLEXITY
differences among the methods used to implement the MIMO detector may lead to
a better or worse received signal quality. For this reason, Section 3.4, describing
linear MMSE detection, also considers finite precision effects implementing the
methods in fixedpoint.
Transmitter Channel Receiver
MIMO Detection
Mapping OFDM OFDM
Mod. Demod.
Conversion
Conversion
MIMO Processing
Serial to Parallel
Parallel to Serial
OFDM OFDM
BitMetric Computation
Mapping
Mod. Demod.
3.1. MIMOOFDM SYSTEM MODEL
Noise
b x sk Hk nk yk ŝk x̂ b̂
√
Im 1/ 2
Im 01 11
0 1
−1 1 Re Re
√
00 −1/ 2 10
Im √
Im 7/ 42
√ 100100
3/ 10 1010 √
5/ 42 100101
√
√ 3/ 42 100111
1/ 10 1011 √
1/ 42 100110
Re √ Re
−1/ 42
√
−1/ 10 1001 √
−3/ 42
√
−5/ 42
√ 1000 √
−3/ 10 −7/ 42
the preprocessing phase while the payload during the data processing
phase.
The transceiver’s building blocks illustrated in Figure 3.1 are now
described one after the other, from the transmitter to the receiver.
In the description, the subscript k = 1, . . . , N indicates the OFDM
subchannel a variable belongs to. The superscript n = 1, . . . , MT
associates a variable to one of the MT lowerrate datastreams. A is
an alphabet containing M aryQAM constellation points that have
modulation order M = 2Q , mean zero, and average energy 1/MT . Q
is the number of bits encoded by one constellation point. Figure 3.2
depicts the alphabets A for MT = 1, and M = 2, 4, 16 and 64.
(n)
These binary labels xk are Graymapped, by the Mapping
block in Figure 3.1, into the complexvalued constellation points
(n) (n) (n)
sk ∈ A according to G : xk 7→ sk .
2. Next, by using an Npoint inverse Fourier transform, groups of
N constellation points are mapped into timedomain OFDM
symbols. Each timedomain OFDMsymbol is prepended by
48 CHAPTER 3. ALGORITHMS AND COMP. COMPLEXITY
yk = Hk sk + nk , (3.1)
(n) (n)
x̂k contains Q entries. Each entry Li [i = (k − 1)Q, (k − 1)Q +
1, . . . , kQ − 1] deliveres decision information employed to detect the
(n)
corresponding bit of the binary label of the constellation point sk .
MIMO detectors (and detectors in general) can be classified into two
categories according to the type of decision information they deliver.
Hard detectors (or hardout detectors) output only two values, usually
−1 and +1, according to whether the bit that has to be detected
is estimated to be 0 or 1, respectively. Soft detectors (or softout
detectors), instead, deliver an entire range of values that usually lie in
the interval [−1, +1]. The sign indicates whether the bit is estimated
to be 0 or 1, and larger absolute values indicate more reliable estimates
than smaller ones. Soft detection is superior to hard detection in
the receiver’s signal quality. On the other hand, depending on the
MIMO detection algorithm, the generation of soft information is not
always possible, or it may be associated with an overly increased
computational complexity.
Another distinction is made according to whether the MIMO de
tector performs coherent or noncoherent detection. In this thesis,
receivers that perform coherent detection are considered: for coherent
detection, the receiver needs to take the effect of the wireless channel
into account for correct operation – instead noncoherent detection is
performed without channel knowledge. The coherent receiver has to
estimate the wireless channel Hk during an appropriate training phase
of the transmission (typically at the beginning of an OFDMframe).
The channelestimate is then expressed as matrix Ĥk .
Finally, to perform MIMOdetection, several algorithms that vary
in computational complexity and in the receiver’s signal quality are
known (e.g., [50]). Before choosing an appropriate MIMO detector for
the implementation on an FA in Section 3.3, the evaluation metrics
required to take this choice are introduced in the Section 3.2.
equipped with the atomic operations ADD, MAC, DIV, ANGLE, and
SQRT. At the end of the section, the CC of the evaluated detectors is
reported for one OFDMsubchannel and it is split into two parts. It is
reported separately for the receiver’s preprocessing phase (in Table 3.1)
and the data processing phase (in Table 3.2), since the algorithms
involved in the two phases differ. The subsequent discussion and
comparison of the CCs and the BER performances permits to select a
MIMO detector that is appropriate for an FA.
For the implementation as dedicated VLSI components, a good ref
erence that compares the appropriate CCs and the BER performances
of various MIMO detection algorithms is [51].
assumption that one test can be completed in one clock cycle, for an
OFDM system with 52 subchannels and an OFDMsymbol duration
of 4 µs, would lead to a required processing performance of 40 096 ×
52/4 µs = 530 248 MdOp/s (!). Soft detection would lead to an even
higher CC [53].
These results show that bruteforce ML is not a viable path to find
the solution of (3.2). For this reason, bruteforce ML is dropped from
the candidate list.
10
0
0 10 20 30 40
SNR [dB]
F = HH H + MT σ 2 I (3.5)
G=F −1
H H
(3.6)
ŷ = Gy. (3.7)
As these figures testify, linear MMSE and SIC detection may possibly
fit on a conventional highend digital signal processor (DSP) – e.g., TI’s
C6455 (with a peak data processing performance of 40 000 MdOp/s)
or ADI’s TigerSHARC (with 20 400 MdOp/s peak data performance).8
SD and KB, however, are far beyond that possibility.
An FA with execution units performing complexvalued arithmetic
would reduce the CC [see Figure 3.4(b)] and the processing require
ments would become: 800 MCdOp/s (millions of complexvalued op
erations per second) for the preprocessing with QRdecomposition,
and 80 400 MCdOp/s (SD), 70 400 MCdOp/s (KB), 400 MCdOp/s (SIC),
and 50 MCdOp/s (MMSE) for the data payload processing.
2 dOp/SIMD × 1 GHz = 40 000 MdOp/s for the C6455. For the TigerSHARC
ADSP TS201S: PP = 2way SIMD × 2 dOp/SIMD × 600 MHz = 20 400 MdOp/s.
60 CHAPTER 3. ALGORITHMS AND COMP. COMPLEXITY
Preprocessing
1400
1344
1260
1120 1036
SD, Nav=2.5, 3.75, 5
980
840 KB, K=5
CC
Symbol processing
4000
3920
3600
3200
2925
2800
SD, Nav=2.5, 3.75, 5
2400
1940 KB, K=5
CC
2000
1600 SIC
1228
1200
891 LMMSE
800 574
400 150 36 208 64
96 16
0
2 3 4
Antenna configuration (MxM) [M]
Preprocessing
400
360
360
320 296
280 SD, Nav=2.5, 3.75, 5
240
KB, K=5
CC
200 171
160 SIC
132
120 LMMSE
80 62
44
40
0
2 3 4
Antenna configuration (MxM) [M]
Symbol processing
1400
1305
1260 1228
1120
975
980 891 SD, Nav=2.5, 3.75, 5
840
CC
KB, K=5
700 648
574
560 SIC
420 LMMSE
280
140 48 66
31 4 9 16
0
2 3 4
Antenna configuration (MxM) [M]
0
10
−1
10
−2
10
BER
−3
10
−4 SD
10 KB, K=5
SIC
MMSE
−5
10
0 10 20 30 40
SNR [dB]
0
10
−1
10
−2
10
BER
−3
10
−4 SD
10 KB, K=5
SIC
MMSE
−5
10
0 10 20 30 40
SNR [dB]
0
10
Hard−out SD
−1
10 Hard−out KB, K=5
−2 Hard−out SIC
10 Hard−out MMSE
−3 Soft−out SD
10
BER
Soft−out MMSE
−4
10
−5
10
−6
10
−7
10
10 20 30 40
SNR [dB]
0
10
Hard−out SD
−1
10 Hard−out KB, K=5
−2 Hard−out SIC
10 Hard−out MMSE
−3 Soft−out SD
10
BER
Soft−out MMSE
−4
10
−5
10
−6
10
−7
10
10 20 30 40
SNR [dB]
adj (F)
F−1 = . (3.8)
det(F)
3.4.2 LRdecomposition
LRdecomposition [66] decomposes F into a lefttriangular MT × MT
matrix L and a righttriangular MT × MT matrix R, such that LR =
F. Then, two successive backsubstitution steps lead to the matrix G:
1) A = L−1 HH , and 2) G = R−1 A. No divisions are required during
the first backsubstitution, since the diagonal entries of L are all 1.
However, for the second backsubstitution the inversion of the diagonal
elements of R is required. The atomic operations for LRdecomposition
are ADD, MAC, and DIV.
3.4.3 LDLdecomposition
With LDLdecomposition [66], F is decomposed into the lefttriangular
MT × MT matrix L and the diagonal MT × MT matrix D, with the
66 CHAPTER 3. ALGORITHMS AND COMP. COMPLEXITY
3.4.4 GSdecomposition
When using GSdecomposition9 [66] the augmented channel matrix
3.4.5 QRdecomposition
The classical QRdecomposition [66] involves the same steps as the GS
decomposition. The only difference is the method used to obtain the
matrices Q̄ and R. The QRdecomposition analyzed in Appendix A.4.5
relies on Givensrotations, which require the computation of the arc
tangent as a fundamental operation. Therefore, compared to GS
decomposition, instead of the SQRT atomic operation, the ANGLE
9 Named after the initials of its two independent discoverers: Jorgen Pedersen
Gram and Erhard Schmidt. Gram published in 1883 [67], whereas Schmidt in
1907 [68].
10 Recall that, for deriving the CC, the weights of all atomic operations have been
set to 1 (in Section 3.2) which is, of course, an approximation. If the difference
among the CCs of the various linear MMSE detection methods that are evaluated
becomes too small for a clear choice, these weights may be refined. However, this
will not be necessary, as the final discussion will show.
3.4. LINEAR MMSE DETECTION 67
A B
F= , (3.10)
BH C
3000
D&C
2700 LRdecomp
LDLdecomp
2400
GSdecomp
Rank1
2100
QRdecomp
1800
CC
1500
1200
900
600
300
0
2 3 4 5
Antenna configuration (MxM) [M]
800
D&C
720 LRdecomp
LDLdecomp
640
GSdecomp
Rank1
560
QRdecomp
480
CC
400
320
240
160
80
0
2 3 4 5
Antenna configuration (MxM) [M]
0
10
−1
10
−2
10
BER
−3 Floating point
10
LR
GS
−4 LDL
10 QR
R1
DC
−5
10
0 10 20 30 40 50 60
SNR [dB]
(1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1)
t1 t2 t3 ... t10 GI2 T1A T1B GI T2 GI1 D1 GI2 D2 GINd DNd
(2) (2) (2) (2) (2) (2) (2) (2) (2) (2) (2)
t1 t2 t3 ... t10 GI2 T1A T1B GI T2 GI1 D1 GI2 D2 GINd DNd
8 μs 8 μs 4 μs 4 μs 4 μs 4 μs
STF LTF 1 LTF 2 S1 MIMOOFDM Symbols SNd
Receiver states 1 μs −> 20 Samples
Frame
Preprocessing Data processing
detection
Table 3.4: OFDM modulation parameters for the system under con
sideration.
Parameter N Nc NSP NLP NGI2 NGI Tsym Ns fs
Value 64 52 16 64 64 16 4 µs 80 20 MHz
two identical long training symbols (T1A and T1B ), each of length NLP
samples. LTF1 is exploited to refine the frequency offset estimation
and participates in the channel estimation together with the remaining
long training fields (LTFn, n = 2, 3, . . . , MT ). Each of the remaining
LTFs is composed of a guard interval (GI) of length NGI samples,
followed by a training symbol Tn of length NLP . The MIMOOFDM
data symbols Sm have a GIm of NGI samples and carry the data Dm
(m = 1, 2, . . . , N d). The number of data carrying OFDMsubchannels
is Nc and the remaining N − Nc subchannels are either unused or carry
pilot symbols. One OFDMsymbol has a duration Tsym and a length
of Ns = NGI + N samples, at a sample rate fs . The OFDM parameters
for the system under consideration are reported in Table 3.4.
Based on the abovedescribed frame structure, proper reception of
an OFDMframe can be divided into five states: framestart detection,
STF processing, LTF processing, MIMO channel processing, and data
payload processing. The bottom section of Figure 3.9 illustrates how
these five receiver states are traversed during the reception of an
OFDMframe. Note that the exact point in time for switching from
one receiver state to the next varies depending on the quality of the
received signal and, consequently, on when the frame start is detected.
Typically, the first 4 to 6 short training sequences are corrupted by
AGC.
As a last step, p̄L [d] and m̄L [d] are compared. A frame start is
detected for the first discrete sampletime index d = dˆSP that satisfies
the threshold detection inequality
2 2
p̄L [d] > m̄L [d] /2. (3.16)
staggered, potentially exploiting pipelining and thus slightly reducing the FFT’s
processing latency.
3.5. MIMOOFDM RECEIVER ALGORITHMS 77
35000
28000
LTF
MIMO
24500 Data payload
21000
CC
17500
3653 MdOP/s
14000
10500
0
1 2 3 4
Antenna configuration (MxM) [M]
Design Space
Exploration
4.1 C6455
The C6455 [76] is a commercial highperformance fixedpoint VLIW
DSP. Its core is depicted in Figure 4.1. The CPU consists of fetch and
83
84 CHAPTER 4. DESIGN SPACE EXPLORATION
L1P cache/SRAM
128
A registerfile B registerfile
128 DMA
slave IF 256 64 64 Interrupt
128 L1 data memory controller & exception
Master port
(CPU cache req) Memory protection controller
256 Bandwidth management PWR control
32
L1D cache/SRAM
Figure 4.2: Die photograph of the C6455. The black dots are solder
balls remaining after etching the package away.
processor [77]. The lower data rates of the acoustic domain permit
a realtime and reallife implementation, with data streaming across
the C6455’s input and output ports, while the programming does not
require a throughout optimization of the employed algorithms. In
addition, the acoustic physical frontend is more economic than its RF
counterpart.1
The OFDM parameters used for the acoustic communication system
are summarized in Table 4.2 at the end of this section. Out of the 64
allocated OFDMsubchannels, 54 carry data. To avoid digital up and
down conversion, the realvalued acoustic passband signal is generated
using a 128point FFT. The first 64 tones are determined by the
transmit constellation points, and the remaining 64 correspond to the
symmetric and complexconjugate of the first half (i.e., the constellation
points on the subchannels k = 64, 65, . . . , 127 are determined by sk =
127−k ). Eventually, one timedomain OFDMsymbol consists of 208
sH
impairments are recognizable in the acoustic domain as well, enabling the study of
appropriate countermeasures (e.g., sample rate estimation and tracking, carrier
frequency offset estimation and tracking, etc.)
86 CHAPTER 4. DESIGN SPACE EXPLORATION
Tx PC Rx PC
Figure 4.3: Acoustic transceiver setup with two C6455 DSK boards.
The receive C6455 double buffers the data samples received over
the audio codec’s input line. Each input buffer has a size of NIbuf =
NObuf received samples. These samples are processed blockwise for
detecting the OFDM frame start, in accordance with the method
described in Section 3.5.1, [(3.14)(3.16)]. Once a frame has been
detected and the timing reconstructed, the acoustic channel is esti
mated using the LTF. Next, the timedomain OFDMsymbols are
FFTed, demapped taking the channel estimates into account, and
deinterleaved. Note that neither FOE nor FOC are performed on the
receiver, reducing the processing power requirements (and, of course,
also the BER performance). Finally, the onchip Viterbi coprocessor is
set up to decode the received data and to convey them to the receive
PC. The BER performance of the acoustic communication system is
computed on the receivePC by comparing the transmitted with the
received data.
For the receiver to operate in realtime, the processing of the input
data buffer has to be finished in Tp < TIbuf (= TObuf ). Or, equivalently,
3 The length of one OFDM symbol is of N
GI + N = 80 + 128 samples. The
output buffer can hold at most 410 600/208 = 200 OFDMsymbols, cf. Table 4.2.
88 CHAPTER 4. DESIGN SPACE EXPLORATION
Table 4.1: Processing times resulting from the profiling of the trans
mission (Tx) and reception (Rx) of one 64QAM OFDMsymbol on
the C6455. In italics, the processing times for the RF system discussed
in Section 4.1.2.
Phase TxTask Time [ µs] RxTask Time [ µs]
Framestart detection Frame start det. 1600 1600
Preprocessing Channel est. 3.3 2.9
Encode 10.7 9.5 Decode 12.6 11.2
Data Interleave 2 1.8 Deinterleave 1.2 1.1
proc. Map 8.8 7.8 Demap 10.2 9.0
IFFT 1.3 0.5 FFT 1.3 0.5
Total 22.8 19.6 12.7 10.6
4.2 MSEC4
Table 4.2: OFDM system parameters and performance for the acoustic and RF systems considered in this
thesis; and comparison to related work. The reported data rates refer to the raw overtheair rate, and do
not consider coding.
This work, C6455 [79],a [80], [81],a
OFDM Parameter Acoustic RF C64x C62x C64x
Channel bandwidth [ MHz] 0.024 20 20 20 n.a.
# OFDMsubchannels (N) 64 64 64 64 64
# OFDM data carriers (Nc ) 54 48 48 48 64
Subchannel spacing [ kHz] 0.375 312.5 312.5 312.5 n.a.
GI [# samples] / [ µs] 80 / 10 666 16 / 0.8 16 / 0.8 16 / 0.8 n.a.
OFDMsymbol [# samples] / [ µs] 208 / 40 300 80 / 4 80 / 4 80 / 4 n.a.
BB sample rate [ MS/s] 0.048 20 20 20 n.a.
Modulation 64QAM 64QAM BPSK QPSK 16QAM
Coding yes yes yes no yes
Design data rateb [ Mbit/s] 0.075 72 12 24 n.a.
Achieved Rxdataratec [ Mbit/s] 25.5 25.7 n.a.d 1.7 4.9
a Implements the receiver on a C62x platform and scales the results to a C64x platform.
b Data rate specified by the OFDM parameters, i.e., bits per OFDMsymbol divided by the OFDMsymbol duration.
c Effectivedata rate achieved on the DSP, i.e., bits per OFDMsymbol divided by the OFDMsymbol processing time.
d The CC is estimated to be 2977 MOPS for the processing of one OFDMsymbol.
92
4.2. MSEC4 93
X
DATA ADDRESS BUSES
LOOP
INSTR. ADDRESS BUS
CTRL
SCU
INSTRUCTIONS
X Y
1616
1616 1616
Complex Trivial
MULT. MULT.
To PE2
3232
From PE2
1 1
ALU
To PE1
6464
From PE1
j j
PE0
1616
PE0 1616
Z
a) b)
6 Depending on the underlying CMOS technology and design library, there might
be a critical memory size below which the instantiation of a register file is more
areaefficient.
96 CHAPTER 4. DESIGN SPACE EXPLORATION
4.3 ASPE
The adaptive stream processing engine (ASPE), developed at the
IIS, ETH Zurich [9], is a modular coarsegrained ASIP architecture
optimized for multimedia stream processing, which mainly consists of
regular and repetitive tasks.
4.3.1 Architecture
Figure 4.6 shows the ASPE architecture. The ASPE is tightlycoupled
with a general purpose processor (GPP) responsible for controlling and
setting up the ASPE, as well as for executing performanceuncritical
tasks. In addition, the ASPE has access to the system bus providing
the capability of autonomously handling datastreams.
The ASPE consists of a datapath and a controlpath. The datapath
employs two types of building blocks: functional units (FUs) and
storage units (SUs). FUs perform the arithmetic operations and SUs
4.3. ASPE
GPP
ASPE 15 14 13 12 8
Empty
SEQ SEQ ... SEQ SU
SU SU
SU SU
SU SU ...
Data CWs
CNet DNet
Empty
Register
SEQ SEQ ... SEQ RF (RF)
File
FU FU
FU ...
0 1 2 3 7
Data Ack
Data Req
Data In
Program Length (P)
VLIW program memory
Number of Units (Nu)
Control Network
RAM0 RAM1 ... RAM5 IBUF
16 bit CW
Data Network
CWs
Controlpath
PC RF
Datapath
Sequencer
eight registers.
All FUs implement a twostage pipeline, resulting in an equal
execution time for all FUs independent of their complexity. The
potential advantage of exploiting the different FU’s execution times for
higher hardware efficiency, is tradedoff with the advantage of regular
assembler programming for a shorter development time.
The FUs and SUs have been enhanced to operate in a 2way SIMD
manner for better exploiting the data level parallelism inherent to
many signal processing algorithms – OFDM BB processing included.
Finally, a datapath wordwidth of 16 bit guarantees sufficient precision
for all the required computations.
Careful scheduling is required to efficiently share all FUs and SUs.
Table 4.4 summarizes the assembler cycle counts for the SISOOFDM
BB implementation and the corresponding processing times, whereas
Figure 4.8 depicts the ASPE’s task schedule for the reception of a
BPSK modulated OFDMframe during the data processing state. A
clock frequency of 160 MHz together with the duty cycle of Tdc = 4 µs,
4.3. ASPE 103
Tdc = 4 μs
Figure 4.8: Data processing task schedule for BPSK modulated OFDM
symbols.
lead to a total of 640 clock cycles at one’s disposal for performing all
data processing related tasks. This has proven to be sufficient for
realtime reception of OFDMframes modulated up to 64QAM.
Table 4.4: Assembler cycle counts and processing times for SISO
OFDM BB processing on ASPE running at 160 MHz.
State / Task Assembler cycle counts # Time [ µs]
Framestart detection
Correlation 2Ns + NSP + 20 196 1.22
Mean energy and th. Ns + 20 100 0.63
TOTAL 3Ns + NSP + 40 296 1.85
Short preamble processing (init)
coarse FOE 75 75 0.47
coarse FOC Ns + 10 90 0.56
TOTAL Ns + 85 165 1.03
Short preamble processing
coarse FOC Ns + 10 90 0.56
LTF1 start detect 6Ns + 40 520 3.25
TOTAL 7Ns + 90 610 3.81
LTF1 processing
fine FOE 75 75 0.47
fine FOC Ns + 10 90 0.56
mean LTF1 NLP /2 + 10 42 0.26
FFT on LTF1 160 160 1.00
Channel estimation NLP /2 + 10 42 0.26
TOTAL Ns + NLP + 265 409 2.55
Data processing
fine FOC Ns + 10 90 0.56
FFT 160 160 1.00
Channel compensation 114 114 0.71
Demap 64QAM 270 270 1.69
TOTAL Ns + 554 634 3.96
4.4. SUMMARY AND CONCLUSION 105
implementation.
Facts The C6455 high performance DSP is extremely flexible and al
lows for rapid code development thanks to its powerful SDK. However,
despite the twofold datapath and the high clock frequency, its process
ing performance is hardly sufficient to sustain realtime SISOOFDM
BB processing with BPSK modulation. The MSEC4 special purpose
BB processor incorporates many efficient mechanisms that support
OFDM BB processing (e.g., radix4 butterfly structure, flexible AGUs
for intrakernel data sorting, zero overhead loop support). However,
mainly due to its long critical timingpath, but also because of its
difficulty of performing the persample processing required during the
initial reception phase, the processor cannot be employed for efficient
MIMOOFDM BB processing without a substantial redesign. Finally,
the properly configured ASPE streaming processor comes with enough
flexibility to sustain both persample and persymbol computations,
and delivers enough performance to sustain realtime SISOOFDM
BB processing.
Conclusion Table 4.5 reports the area and the corresponding op
erating frequency for the evaluated DSPs for their original target
technology, as well as normalized for a 0.18 µm CMOS technology. As
reinforced graphically by Figure 4.9, the best area efficiency is attained
by the ASPE, followed by the MSEC4. The C6455’s efficiency is by far
the lowest which can be brought back, on one side, to the DSP’s large
4.4. SUMMARY AND CONCLUSION 107
Table 4.5: Areas and clock frequencies for the original designs and
their normalized versions for a 0.18 µm CMOS technology.
Original Normalized to 0.18 µm
Architecture CMOS f [MHz] A [mm2 ] f0.18 [MHz] A0.18 [mm2 ]
C6455 [76] 0.09 10 000 91 500 360
MSEC [82] 0.25 65 8.14 90 4.2
ASPE [10] 0.13 160 1.9 115 3.6
600
500
Performance/area [MdOPS/mm ]
2
400
300
200
100
0
ASPE MSEC4 C6455
Figure 4.9: Processing performance per area, for the evaluated SPAs.
Chapter 5
MIMOOFDM SDR
Receiver
109
110 CHAPTER 5. MIMOOFDM SDR RECEIVER
GPP
15 14 13 12 8
ASPE SEQ
Empty
IBUF
OBUF
DICTIONARY SU SU
SU SU ...
Data CWs
DNet
Empty
Register
RF (RF)
File
FU FU
FU ...
INSTRUCTIONS
0 1 2 3 7
Conversion and
Demapping and
OFDM OFDM
Conversion
Detection
TxData Mapping Mod. Demod. RxData
S to P
P to S
MIMO
OFDM OFDM
Mod. Demod.
Noise
ASPE A ASPE B
sk Hk nk yk ŝk
where OpA and OpB select the operand sources (i.e., SUs or FUs),
Shamt defines a possible shift amount, and Instr codes the instruction
to be executed on the FU.
Data Ack
Data Req
Data Out
Data Ack
Data Req
Data In
VLIW dictionary memories
Controlpath
Control Network
...
SU0 SU4 OBUF IBUF
Data Network
CWs
IDX
mem. REG CMAC CALU0 CALU1
SEQ Datapath
2way SIMD unit
one OBUF, and the registerfile can deliver the necessary processing
performance.
The three FUs have been selected principally by observing that the
hardest computational kernel in the MIMOOFDM processing part
resides in the computation of the 64point FFT and that this kernel is
best undermined by the use of butterfly operators that significantly
speed up its computation (see Table 3.5). Consequently, the three
FUs can be interconnected to form a radix2 butterfly. The FUs are:
one complexvalued multiply and accumulate (CMAC) unit, and two
complexvalued arithmetic logic units (CALU0 and CALU1). The
CMAC unit is implemented with three pipeline stages, while the two
CALUs require only two pipeline stages to attain the same critical
path length on all FUs.
5.2. ASPE A – MIMOOFDM PROCESSING 115
values into a lookup table that resides inside the CALU has been
discarded and tradedin for more numerical flexibility at runtime. The
FU’s datapath wordwidth of 16 bit permits to run up to NCOR = 16
iterations: arctan(2−15 ) = 3.0518 · 10−5 , which, quantized for a fixed
point representation [16 15], corresponds to 1 and is the smallest
representable quantized number.1 Listings 5.1 and 5.2 show two
assembler code snippets that compute the CORDIC algorithm on 16
complexvalued data samples. The former code snippet is compact and,
at first sight, seems program memory and computationally efficient.
However, a closer look at the CordicIterLoop at line 20 reveals that only
two of the five VLIWs that compose the loop perform effective data
operations (i.e., one of those at lines 21 or 22 depending on the flag
of CALU0, and that at line 23), while the remaining two instructions
are required to fill the two branch delay slots. Thus, the code is not
computationally efficient.
The second code snippet (Listing 5.2) is computationally more
efficient. At every second clock cycle an effective data operation takes
place (again, depending on the flag of CALU0). The apparent program
code inefficiency caused by the several repetitions of the same VLIWs
turns out not to be an issue. Thanks to the dictionarybased program
code compression only the unique VLIWs need to be stored inside the
dictionary memory, hence resulting in a compact code. Eventually,
this code snippet is used for the MIMOOFDM BB implementation.
Saturate
5.2. ASPE A – MIMOOFDM PROCESSING
16  16
OUT
Figure 5.4: One of the two identical datapaths that compose the 2way SIMD CALU FUs of ASPE A.
119
CHAPTER 5. MIMOOFDM SDR RECEIVER
i
A B Control word A arctan(2 ) Control word
16 0 16 0 16 16
1616 Iteration
Counter
+
1
Re{.} Im{.}
Decoder Decoder
+/
Shift Shift
(i)
d
+/ +/
OUT
(i)
1616 d
OUT FLAG
to SEQ
Figure 5.5: Configuration and combination of the two CALUs used to compute one CORDIC iteration.
120
5.2. ASPE A – MIMOOFDM PROCESSING 121
sequence of CORDIC instruction has to reset the internal iteration counter. This
is done by setting the corresponding CW’s Shamtfield to ’0000’.
Table 5.2.
122 CHAPTER 5. MIMOOFDM SDR RECEIVER
A B Control Word
1616 1616 16
16 16 16 16
Decoder
32 32 32 32
L[PHIACCU]
A
0 PHI 0
Shift Decoder
Im{ACCU}
LUT
PHIACCU Re{FOC bypass} Im{FOC bypass}
Re{ACCU}
Decoder
1616
OUT
Figure 5.6: One of the two identical datapaths that compose the 2way
SIMD CMAC FU of ASPE A.
124 CHAPTER 5. MIMOOFDM SDR RECEIVER
The terms ’clock cycle’ and ’cycle’ are used interchangeably in this chapter.
5.2. ASPE A – MIMOOFDM PROCESSING 125
Interleaved rx
datastream
L R
... ...
(2)
r [2] r(1) [2]
r(2) [1] r(1) [1]
r(2) [0] r(1) [0]
employed for the comparison. Once the frame start is detected, the
receiver proceeds into the STF processing state.
Summarizing, frame start detection requires 7Ndc + 90 clock cycles
to process blocks of Ndc received data samples.
L ST1 R CMAC
CMAC
... ...
r(2) [2] r(1) [2] mean correlation
energy
r(2) [1] r(1) [1]
r(2) [0] r(1) [0]
CALU0 CALU0
average average
L ST2 R
... ...
r(2) [2] r(1) [2] CALU1
(2)
CALU1
r [1] r(1) [1] absolute absolute
(2) (1)
r [0] r [0] value value
.2 .2
L ST0 R L SU0 R
... ... ...
m̄16 [2] p̄16 [2] m̄16 [2]
m̄16 [1] p̄16 [0] m̄16 [1]
m̄16 [0] p̄16 [0] m̄16 [0]
L SU0 R
... ...
p̄16 [2] m̄16 [2]
p̄16 [0] m̄16 [1]
p̄16 [0] m̄16 [0]
CALU0
threshold
detection
flags to
SEQ
00
CMAC
radix2
butterfly
CALU0 CALU1 + 
L R SU2 L R SU3
... ... ... ...
MIMO channel processing First, the coarse and fine phases are
added together to generate a single, fine frequency offset estimate (in
15 cycles). Fine FOC takes place next (Ndc + 10 cycles). Then, the
channel estimate is computed by the matrixmatrix multiplication
Ĥ = ZT−1 described in Section 3.5.4, and for which ASPE A requires
4 Nc + 10 cycles. The resulting channel estimates are transferred
to ASPE B, via the OBUF→IBUF link between the two processors.
Once on ASPE B, they are further elaborated to obtain the linear
MMSE estimator matrix G. Then, the processor switches to the data
processing state.
Data Ack
Data Req
Data Out
Data Ack
Data Req
Data In
VLIW dictionary memories
Controlpath
Control Network
Data Network
CWs
IDX
REG CMAC0 CMAC1 CALU DIV
mem.
SEQ Datapath
2way SIMD unit Pipeline stage
0 1
0 1 0 1
SHREG
Re{A} Im{A} 8
8
16 16 Shift 8
Re{B} Im{B}
Decoder
16 16
(MSBs)
(MSBs)
16 16
Saturate Saturate
16 16
0
0 1
16  16
OUT
Figure 5.12: One of the two identical datapaths that compose the
2way SIMD CALU FU of ASPE B.
5.3. ASPE B – MIMO DETECTION 133
A B Control Word
1616 1616 16
Decoder
32 32 32 32 8
SHREG
33 33
0 0 0 0
1 1
8
37 37
Re{ACCU} Im{ACCU}
Decoder
37 37
Shift 8
8
1616 8
OUT
Figure 5.13: One of the two identical datapaths that compose the
2way SIMD CMAC FU of ASPE B.
5.3. ASPE B – MIMO DETECTION 135
A B Control word
Re{A} 16 Re{B} 16 16
16 16
if A>0
X = log2(Re{A})
else
X = 0
end
(X) (2X)
16 16 Decoder
32 5 pipeline
4 pipeline stages
stages
32
Shift
16
Decoder
16
0
1616
OUT
Figure 5.14: One of the two identical datapaths that compose the
2way SIMD DIV FU of ASPE B.
5.3. ASPE B – MIMO DETECTION 137
Datapath
ST0 ST1
L
...
R
...
L
... ...
R
xL11 xR11 xL10 xR10
xL01 xR01 xL00 xR00
xL10 xR10 xL11 xR11
xL00 xR00 xL01 xR01
Left and Right
SIMD
datapaths
(.)H (.)H
CMAC1 CMAC0
H
YL= XL XL
H
YR= XR XR
L R ST2 L R ST3
... ... ... ...
Section 5.2.2.
d Implementation of MIMO detection, as used in Section 5.2.2.
SEQ: 24%
Regfile: 2%
ST5:4%
IBUF: 4%
CMAC: 13%
ST3: 5%
ST4: 5%
ST1: 5%
ST2: 5% CALU0: 7%
CALU1: 7%
Figure 5.16: Area breakdown of the ASPE placed and routed for
0.18 µm CMOS technology. The total area amounts to 4.2 mm2 , the
SEQ containing the program memory occupies 1 mm2 .
Index
Unompressed dictionary memory
Number of units (Nu)
PC 00824110411041134113
600E7800E1D5E1D5
0x2 02E5249881008100
02E5AE9825BA81008100A100A100
Index Mask
Compressed dictionary memory
Number of units (Nu)
PC 00824110411041134113
0x2 0038600040D930E0600E7800A100E1D5E14C4400E1D5E1D5
02E5249881008100
02E5AE9825BA81008100A100A100
Program length (P)
00E74FD820491049A30042208800A100A100
Sequencer 880541024102410241024102
mask logic
600E7800E1D5E1D5
3 600E7800E1D5E1D5
Masks:
000011000011 0038600E7800E1D5E1D5
100000000000
1. If the CWs of both VLIWs are NOPs, then the chance of con
desing the two VLIWs is still intact.
4. If none of the above three tests is successful, then the two VLIWs
cannot be condensed.
Once all Nu CWpairs are tested and it is found that the two VLIWs
can be merged, the resulting condensed VLIW is stored inside the
compressed dictionary memory.
The algorithm used to generate the compressed dictionary is illus
trated by the pseudocode fragment in Algorithm 1. The uncompressed
dictionary memory is considered as an L × Nu matrix M, and the
resulting compressed dictionary as an L0 × Nu matrix N. The entries
of both matrices are CWs of the datapath units (FUs and SUs) and
the SEQ. The notation M(m, :) means that the entire mth row of M is
accessed, and similarly, M(:, n) means that the entire nth column of M
is accessed. At line 1, the sortcols(.) function permutes the columns of
the uncompressed dictionary matrix M, such that the output matrix
T1’s first column contains the column of M with the most nonNOP
CWs, and the last column of T1 that column of M with the least
nonNOP CWs. The vector t1i stores the columnpermutation indexes
of T1. The sortrows(.) function at line 3 instead, sorts the rows of
its argument vector according to the recurrence of the CWs inside the
original, uncompressed program. The most frequent CW is permuted
to the first row of t2, whereas the least frequent CW becomes the
last entry of t2. The function genpattern(.) (line 15) generates an
appropriate pattern pat that is used to test if the four above mentioned
146 CHAPTER 5. MIMOOFDM SDR RECEIVER
Table 5.8: DBCC with NOP masking for the benchmark programs.
Benchmark P L0 ρ ρ̄a bcmp Rbit b
SISO BB proc. 762 142 53% 47% 42504 29%
16 Tap FIR 44 8 33% 67% 2196 25%
64point FFT 149 74 59% 41% 17039 59%
Two 64point FFT 253 97 59% 41% 23431 48%
MIMO BB kernel 506 116 63% 37% 31886 33%
a ρ̄ = 1 − ρ.
b The lower the better.
(a) ASPE A for MIMOOFDM pro (b) ASPE B for MIMO detection.
cessing.
Figure 5.20: Floorplan of the two fabricated chips. The main building
blocks are highlighted and the corresponding areas (in kGE) are
reported. The nonlabeled area is occupied by the DNet, by the
control logic of the SUs, and mainly by fillercells.
154 CHAPTER 5. MIMOOFDM SDR RECEIVER
Summary and
Conclusions
Summary
As the domain of mobile wireless communications becomes increasingly
populated with differing communication protocols, the importance
of mobile software defined radio (SDR) terminals grows. The high
datarates prescribed, however, render an implementation on the limited
processing resources of a flexible architecture extremely challenging.
This trend is especially perceptible in the wireless local area network
(WLAN) domain, where the datarates are already high, compared for
instance to the datarates of mobile phone standards. Moreover, the
tight power consumption constraint, necessary to ensure long operation
times from battery, does not relax this challenge either.
Nevertheless, the implementation of the 2 × 2 MIMOOFDM base
band receiver algorithms on an SDR platform composed of two appli
cation specific processors (ASIPs) proposed in this thesis, indicates
a viable solution to overcome these tough constraints. Therefore,
the joint analysis of the computational complexity and the BER per
formance of MIMO detection algorithms was necessary to identify
suitable, lowcomplexity algorithms (Chapter 3). The subsequent
mapping of computationally hard OFDM baseband processing kernels
155
156 CHAPTER 6. SUMMARY AND CONCLUSIONS
Conclusions
The implementation of the 2 × 2 MIMOOFDM baseband receiver
algorithms on the two ASPEs proved to be extremely challenging. On
one side, among the vast number of MIMOOFDM receiver algorithms,
appropriate ones had to be selected. Therefore, extensive Monte
Carlo simulations were run for assessing the achievable (bittrue) BER
performance, and the involved atomic operations where counted for
deriving the algorithm’s computational complexity. On the other side,
the algorithms had to be implemented onto the ASPE architecture.
Therefore, the ASPE’s datapath first had to be configured with ap
propriate units, followed by the mapping of the assembly coded, hand
1 The area of one gate equivalent (GE) corresponds to the silicon area occupied
An efficient ASIP SDR platform thus requires units that respect and
support the granularity of these elementary kernels.
DLP and ILP The degree of data level parallelism (DLP) inherent
to many signal processing algorithms should be exploited to increase
the processing performance while reducing the datapath control over
head. In this thesis, the DLP inherent to the two receive streams of
the 2 × 2 MIMOOFDM receiver was efficiently exploited by extend
ing the ASPE’s datapath to operate in a 2way SIMDmanner. The
159
MIMO Detection
Methods
161
162 APPENDIX A. MIMO DETECTION METHODS
MT
Ti (s(i) ) = Ti+1 (s(i+1) ) + kỹi − rij sj k2 ,
X
(A.2)
j=i
A.2 KBest
The Kbest algorithm is a breadthfirst search algorithm that operates
on the same tree structure as the SD algorithm does. The description
of the here employed Kbest algorithm and a corresponding ASIC
implementation is presented in [59]. In the following, the Kbest algo
rithm is only briefly described. The aim is to highlight the differences
to the SD algorithm, and to compute its CC.
Two preprocessing methods are considered in [59]. With QR
decomposition the resulting problem is complexvalued, whereas with
realvalued decomposition (RVD) and subsequent QRdecomposition,
the √problem becomes only realvalued. With RVD each tree level
has M nodes and the tree depth doubles resulting in 2MT (instead,
without RVD, each of the MT tree levels has M nodes). The BER
164 APPENDIX A. MIMO DETECTION METHODS
Algorithm 2 KB.
In: R, ỹ
Out: ŝ
1: for i = 2MT , 2M√ T − 1, . . . , 1 do
2: for n = 1, . . . , M do
3: di (sn ) = rii sn − ỹi // M ult : 1, Add : 1
4: end for
5: for k = 1, . . . , K doP
2MT
6: bi+1 (s(i+1) (k)) = j=i+1 rij sj (k) // M ult : 2MT − i
√
7: for n = 1, . . . , M do
8: D(k, n) = Ti+1 (k) + kbi+1 (s(i+1) (k)) + di (sn )k2 // M ult :
1, Add : 2
9: end for
10: end for
11: Ti [1 : K] = sort(D)[1 : K] // sort in ascending order and take
the K smallest distances
12: Store the K candidate vectorsymbols s(i) (k) that lead to
Ti [1 : K]
13: end for
14: ŝ = min(s(1) [1 : K])
Algorithm 3 SIC.
In: R, ỹ
Out: ŝ
1: for i = MT , . . . , 1 do
PMT
2: ŷi = ỹi − j=i+1 ri,j ŝj // M ult : MT − i, Add : 1
√
3: ŝi = Q(ŷi , ri,i ) // M ult : b M c, Comp : log2 M
4: end for
MT
√ √
MT − i + b M c = (−1/2 + M )MT + MT2 /2
X
NM U LT =
i=1
NADD = (1 + log2 M )MT .
y = Hs + n ∈ CMR (A.6)
2
F = H H + MT σ I ∈ CMT ×MT
H
(A.7)
G=F −1
H H
∈C MT ×MR
(A.8)
ŷ = Gy ∈ C MT
(A.9)
adj (F)
F−1 = . (A.10)
det(F)
A.4.2 LRdecomposition
The LRdecomposition (or LUdecomposition) in Algorithm 5 is stable
for strictly diagonally dominant matrices, which is the case of the F
matrix at lowSNR regime. The steps leading to ŷ are:
m = Ay ∈ CMT , (A.14)
ŷ = BS(R, m) = R −1
m∈C MT
. (A.15)
A.4.3 LDLdecomposition
Three versions of the LDLdecomposition are reported. Algorithm 6 is
the implementation reported in [66], Algorithm 7 and Algorithm 8 are
modified versions with slightly lower CC.
Table A.4: LRdecomposition’s CC, BSbased detection.
Preprocessing
Step CMAC DIV CADD Total
(A.7) MT MR (MT + 1)/2 0 MT (1 + MR /2)MT + MR MT2 /2
(A.12) 1/3MT (MT2 − 1) MT 1 − 2MT + MT2 1 − 4/3MT + MT2 + MT3 /3
(A.13) MT MR (MT − 1)/2 0 MR (MT − 1) −MR + MR MT /2 + MR MT2 /2
Total COps. T1 a MT T2 b 1 − MR − MT /3 + MR MT + (1 + MR )MT2 + MT3 /3
A.4. LINEAR DETECTION
1 T
a T = −M /3 + M M 2 + M 3 /3
R T T
2 R R T
b T = 1 − M + (M − 1)M + M 2
T
R R T R
c T = 2 − 2M + (−7/3 − 2M )M + (2 + 4M )M 2 + M 3 4/3
R T T
d Division is avoided: 1/r
i,i is computed during preprocessing.
171
APPENDIX A. MIMO DETECTION METHODS
NDIV = MT
NADD = −MT /2 + MT2 /2.
The estimated number of operations to complete Algorithm 7 is:
MT i−1
j − 1 + 1 = −2MT /3 + MT2 /2 + MT3 /6
X X
NM U LT = i − 1 +
i=1 j=1
NDIV = MT
NADD = −MT /2 + MT2 /2.
174 APPENDIX A. MIMO DETECTION METHODS
A.4.4 GSdecomposition
To perform GramSchmidt based√QRdecomposition, we start by ob
serving that with H̄ = [HH MT σIMT ]H ∈ C(MR +MT )×MT we
obtain
Thus, according to (A.20), we have ŷ = R−1 Q̄H ȳ. Since, the last
MT entries of ȳ are all zero and with
Q
Q̄ = ,
Q̃
SymbolVector Detection
Step CMAC DIV CADD Total
(A.22) MR MT 0 0 MR MT
(A.23) MT (MT + 1)/2 0e MT − 1 −1 + 3/2MT + MT2 /2
Total COps. (1/2 + MR )MT + MT2 /2 0 MT − 1 −1 + (3/2 + MR )MT + MT2 /2
Total ROps. 4(1/2 + MR )MT + 2MT2 0 2MT − 2 −2 + (2 + 4(1/2 + MR ))MT + 2MT2
1 R T R
a T = (7/6 + M )M + (1/2 + M )M 2 + M 3 /3
T T
2 R T
b T = (−1 − 3M )M /6 + M M 2 /2 + M 3 /6
R T T
R T R
c T = (6 + M )M /2 + (1 + 3M )M 2 /2 + M 3 /2
C T T
R T R
d T = (19/3 + 3M )M + (2 + 5M )M 2 + M 3 5/3
R T T
e Division is avoided since 1/r
i,i is computed during preprocessingand stored.
179
APPENDIX A. MIMO DETECTION METHODS
A.4.5 QRdecomposition
In order to detect ŷ = Gy using the classical Givensrotations based
QRdecomposition, we proceed
√ in an analogous way as for GSdecompo
sition. With H̄ = [HH MT σIMT ]H and (A.19), we obtain (A.20).
By taking the QRdecomposition of H̄ = Q̄R̄ we obtain an unitary
matrix Q̄ ∈ C(MR +MT )×(MR +MT ) and an upper triangular matrix
182 APPENDIX A. MIMO DETECTION METHODS
¯ R) −1 (Q̄
¯ R)H (Q̄ ¯ R)H
Ḡ = (Q̄
¯ R −1 (RH Q̄
¯ H Q̄ ¯ H)
= RH Q̄
¯ H)
= (RH R)−1 (RH Q̄
¯ H.
= R−1 Q̄
↓m
1 0 0
...
.. ..
0 . .
.. .. ..
. . .
Em (α) =
0 0 ←m
... ... ejα ...
0 1
.. ..
. .
A.4. LINEAR DETECTION 183
and
↓m ↓n
1 0 0 0
...
.. .. ..
0 . . .
.. .. .. ..
. . . .
Um,n (β) = 0 . . . . . . cos β sin β 0 ←m .
...
0 . . . . . . − sin β cos β 0 ←n
...
0 0 1
.. .. ..
. . .
According to the detection method we proceed as follows.
• For BSbased detection:
m = Q H y ∈ CM T
ŷ = BS(R, m) = R−1 m ∈ CMT .
Table A.10 resumes the CC of detecting ŷ by the above described
procedure.2
• Whereas for MMbased detection the steps are:
G = BS(R, QH ) = R−1 QH ∈ CMT ∈ CMT ×MR
ŷ = Gy ∈ CMT .
ŷ = Gy ∈ C MT
.
(A.25) (13/2 + 2MR )MR MT + 3/2MR MT2 2MR MT 0 (17/2 + 2MR )MR MT + 3/2MR MT2
(A.24) MR MT (MT + 1)/2 MT MT − 1 −1 + (2 + MR /2)MT + MR MT2 /2
2
Total COps. (7 + 2MR )MR MT + 2MR MT2 (1 + 2MR )MT MT − 1 −1 + (2 + 9MR + 2MR )MT + 2MR MT2
2
Total ROps. 4(7 + 2MR )MR MT + 8MR MT2 (1 + 2MR )MT 2MT − 2 −2 + (3 + 30MR + 8MR )MT + 8MR MT2
SymbolVector Detection
Step CMAC DIV CADD Total
(A.9) MR MT 0 0 MR MT
Total COps. MR MT 0 0 MR MT
Total ROps. 4MR MT 0 0 4MR MT
187
APPENDIX A. MIMO DETECTION METHODS
F = HH H + MT σ 2 I ∈ CMT ×MT
ŷ = Gy ∈ C MT
Datasheet
Pinout
Figure B.1 illustrates the pinout of ASPE A and ASPE B in their
PGA120 package. The two ASPEs are pincompatible. Table B.1
describes the functionality of the I/O signal pads.
193
194 APPENDIX B. DATASHEET
FifoOutDataxDO_PAD_15
FifoOutDataxDO_PAD_14
FifoOutDataxDO_PAD_13
FifoOutDataxDO_PAD_12
FifoOutDataxDO_PAD_11
FifoOutDataxDO_PAD_10
FifoOutDataReqxSI_PAD
FifoOutDataxDO_PAD_9
FifoOutDataxDO_PAD_8
FifoOutDataxDO_PAD_7
FifoOutDataxDO_PAD_6
FifoOutDataxDO_PAD_5
FifoOutDataxDO_PAD_4
FifoOutDataxDO_PAD_3
FifoOutDataxDO_PAD_2
FifoOutDataxDO_PAD_1
FifoOutDataxDO_PAD_0
pad_vdd_p6
pad_vdd_c8
pad_vss_c8
pad_vss_p6
pad_vdd_c4
A11 C10 B12
A1 B3 C4A2 A3 B4 C5 A4 B5 A5 C6 B6 A6 A7 C7 B7 A8 B8 C8 A9 B9 A10C9 B10 B11 A12 C11
120 110 100 91
pad_vdd_c1 C3 1 90 A13
FifoOutWritexSO_PAD_0 B2 C12
FifoOutWritexSO_PAD_1 B1 SlaveOutxDO_PAD
D11
StBistOutxTO_PAD_13 D3 SlaveInxDI_PAD
B13
StBistOutxTO_PAD_12 C2 SSxSBI_PAD
C13
StBistOutxTO_PAD_11 C1 D12 pad_vss_p5
pad_vss_p1 D2 E11 SCKxCI_PAD
StBistOutxTO_PAD_10 E3 D13 pad_vdd_p5
D1 E12
StBistOutxTO_PAD_9 E2 10 co ScanEnablexTI_PAD
re E13
StBistOutxTO_PAD_8 E1 _v s
ss vs 80 F11 ScanModexTI_PAD
d_
StBistOutxTO_PAD_7 F3 pa F12 BistEnablexTI_PAD
120
31 40 50 60
L3 M2 N2 L4 M3 N3 M4 L5 N4 M5 N5 L6 M6 N6 M7 L7 N7 N8 M8 L8 N9 M9 L9 N11 L10 N13
N10 M10 N12 M11
pad_vdd_c2
FifoInDataxDI_PAD_15
FifoInDataxDI_PAD_14
FifoInDataxDI_PAD_12
FifoInDataxDI_PAD_11
FifoInDataxDI_PAD_10
FifoInDataxDI_PAD_9
FifoInDataxDI_PAD_8
pad_vss_p3
pad_vdd_c6
pad_vdd_p3
FifoInDataxDI_PAD_7
FifoInDataxDI_PAD_6
FifoInDataxDI_PAD_5
FifoInDataxDI_PAD_4
FifoInDataxDI_PAD_3
FifoInDataxDI_PAD_2
FifoInDataxDI_PAD_1
FifoInDataxDI_PAD_0
FifoInDataReqxSO_PAD
pad_vss_c6
FifoInDataxDI_PAD_13
Core Power
Pad Power
GND
Power For ASPE A, the node toggling activity was extracted while
simulating the computation of a 64point FFT. For ASPE B, the
node toggling activity was extracted while simulating the computation
of the inverse of a 2 × 2 matrix. The resulting power consumption is
reported in Table B.4 At 250 MHz the resulting power consumption is
around 700 mW for each ASPE.
Operating Modes
Configuration through SPI
The serial peripheral interface (SPI) permits to access the ASPE’s
memories. This access is essential for loading software into the se
quencer’s program memory. Also, while debugging, the access through
SPI to SUs and FUs facilitates the error localization. ScanEnablexTI,
ScanModexTI, and BistEnablexTI shall be tied to ground during
configuration.
Figure B.2 shows the generic timing diagram of one SPI transaction.
One transaction begins with SSxSBI being deasserted, while SCKxCI
is low. Next, an SPIcommandbyte is transmitted, bit by bit and
from MSB to LSB, over the SlaveInxDI pin. Concurrently, one SPI
responsebyte is output from MSB to LSB, over the SlaveOutxDO pin.
Three SPIcommands are available: READ from and WRITE to an ASPE
memory, and NOP.
1 2 3 4 8
SCKxCI
SSxSBI
SlaveInxDI
SlaveOutxDO
4. Receive data, from the least significant byte (D0) to the most
significant one (D3). Requires four SPI transactions.
SCKxCI
SSxSBI
SlaveInxDI READ A0 A1 A2
D1 D2 D3
SCKxCI
SSxSBI
SlaveInxDI WRITE A0 A1 A2
D0 D1 D2 D3
SPI commands Table B.5 summarizes the SPI commands and SPI
responses.
201
Memory Map Figure B.5 depicts the physical design of the ASPE’s
index and dictionary memory. The index memory is a 1024 × 20 bit
SRAM and the dictionary memory is composed of three 256 × 64 bit
SRAMs (DICT0DICT2). The 16 bit control words (CWs) used to
control the SEQ, the FUs, and SUs are assigned to the dictionary
memory slots as in Table B.6 (cf. Figure B.5). Finally, the memory
map in Table B.7 allows to access the index and dictionary memories
over SPI.
Access to the SUs and FUs is also possible over SPI. For the FU
and SU access, the 24 bit SPIaddress A (bytes A0A3) has the bit
structure 11uu uuaa aaaa aaaa aaaa aa00. The 4 bits uuuu encode
the unit’s internal slot number and the 16 bits aaaa aaaa aaaa aaaa
the unit’s CW. The internal slot number (uuuu) assignment is reported
in Table B.8. The memory map required to access tha ASPE’s storage
units is reported in Table B.9. The IBUF and OBUF are accessed in
an analogous way.
Normal operation
The ASPE starts its autonomous operation once the stall register
residing inside the SEQ is cleared. Therfore, the SPI command ’WRITE
0x780060 0x00000000’ is issued (write the data word 0x00000000 to
address 0x780060, which corresponds to the SEQ stall register). The
SEQ then starts fetching the first dictionary pointer located at address
0x000 of the index memory.
ScanEnablexTI, ScanModexTI, and BistEnablexTI shall be tied to
ground during normal operation.
BIST
The memory BIST writes a chessboard pattern into the memories.
Thereafter, it reads out the memories and checks whether the pattern
matches the expected one or not. If the check for one memory passes
the BISTsignal of that memory is raised, otherwise it remains zero.
Figure B.6 shows the signaling scheme used to enable the mem
ory builtin selftest (BIST) mode. To enter the BIST mode, the
BistEnablexTI signal has to be set to 1 before RstxRBI is released.
BistModexTI selects whether the "Bist DONE" or "Bist OK" status is
203
IDX DICT2
Addr: 0x000 Addr: 0x000 0x001 0x002 0x003
0x004 0x005 0x006 0x007
0x001
256
...
DICT0
Addr: 0x000 0x001 0x002 0x003
0x004 0x005 0x006 0x007
256
...
Table B.7: Index and dictionary memory maps for ASPE A and
ASPE B(write only).
Unit Address Range
DICT0 0x720000
0x720FFC
DICT1 0x760000
0x760FFC
DICT2 0x7A0000
0x7A0FFC
IDX 0x7B0000
0x7B0FFC
205
Table B.8: Internal slot number uuuu assignment for FUs and SUs.
Internal slot ASPE A ASPE B
uuuu
0x0 REG REG
0x1 nil nil
0x2 nil DIV
0x3 nil CMAC0
0x4 CMAC CMAC1
0x5 CALU0 CALU
0x6 CALU1 nil
0x7 nil nil
0x8 SU0 SU0
0x9 SU1 SU1
0xA SU2 SU2
0xB SU3 SU3
0xC SU4 OBUF
0xD OBUF IBUF
0xE IBUF nil
SCAN
The scan chain is activated trough ScanEnablexTI (active if 1). Scan
ModexTI isolates the memories from the scan chain and needs to be
set to 1 during test mode.
206 APPENDIX B. DATASHEET
Table B.9: Memory maps for addressing SUs of ASPE A and ASPE B.
Unit Address Range Meaning
SU0 0xE1A000 Write to both SIMD mems
0xE1A3FF
0xE18000 Write to R mem
0xE183FF
0xE19000 Write to L mem
0xE193FF
0xE14000 Read from R mem
0xE143FF
0xE15000 Read from L mem
0xE153FF
SU1 0xE5A000 Write to both SIMD mems
0xE5A3FF
0xE58000 Write to R mem
0xE583FF
0xE59000 Write to L mem
0xE593FF
0xE54000 Read from R mem
0xE543FF
0xE55000 Read from L mem
0xE553FF
SU2 0xE9A000 Write to both SIMD mems
0xE9A3FF
0xE98000 Write to R mem
0xE983FF
0xE99000 Write to L mem
0xE993FF
0xE94000 Read from R mem
0xE943FF
0xE95000 Read from L mem
0xE953FF
207
Table B.10: Memory maps for addressing SUs of ASPE A and ASPE B.
Note that SU4 is only available on ASPE A.
Unit Address Range Meaning
SU3 0xEDA000 Write to both SIMD mems
0xEDA3FF
0xED8000 Write to R mem
0xED83FF
0xED9000 Write to L mem
0xED93FF
0xED4000 Read from R mem
0xED43FF
0xED5000 Read from L mem
0xED53FF
SU4 0xF1A000 Write to both SIMD mems
0xF1A3FF
0xF18000 Write to R mem
0xF183FF
0xF19000 Write to L mem
0xF193FF
0xF14000 Read from R mem
0xF143FF
0xF15000 Read from L mem
0xF153FF
208 APPENDIX B. DATASHEET
ClkxCI
BistEnablexTI
before RstxRBI goes high
RstxRBI
211
212 BIBLIOGRAPHY
Proceedings of the IEEE, vol. 92, no. 2, pp. 198–218, Feb. 2004.
5, 53, 57, 161
[8] IEEE, Draft Standard for Information Technology
Telecommunications and information exchange between systems–
Local and metropolitan area networks–Specific requirements– Part
11: Wireless LAN Medium Access Control (MAC) and Physical
Layer (PHY) specifications: Amendment 4: Enhancements for
Higher Throughput, 2007. 5, 72
[9] T. Boesch, “Adaptive stream processor for network multimedia
consumer electronic devices,” Ph.D. dissertation, ETH Zurich,
2004. 5, 6, 98, 100, 156
[10] S. Eberli, A. Burg, T. Boesch, and W. Fichtner, “An IEEE 802.11a
baseband receiver implementation on an application specific pro
cessor,” in Circuits and Systems, 2007. MWSCAS 2007. 50th
Midwest Symposium on, Montreal, Que., Aug. 5–8, 2007, pp.
1324–1327. 5, 101, 107
[11] S. Eberli, A. Burg, and W. Fichtner, “Implementa
tion of a 2 × 2 MIMOOFDM receiver on an appli
cation specific processor,” Microelectronics Journal, vol.
In Press, Corrected Proof, pp. –, 2009 (invited). [On
line]. Available: http://www.sciencedirect.com/science/article/
B6V444VWJ1YP1/2/bbcb3d4c513f25e650913b83fef4c11d 6
[12] B. Bougard, B. De Sutter, S. Rabou, D. Novo, O. Allam,
S. Dupont, and L. Van der Perre, “A coarsegrained array based
baseband processor for 100Mbps+ software defined radio,” in De
sign, Automation and Test in Europe, 2008. DATE ’08, Munich,
Germany, Mar. 2008, pp. 716–721. 6, 7, 33, 34, 38, 151, 154, 156,
158
[13] S. Eberli, D. Cescato, and W. Fichtner, “Divideandconquer ma
trix inversion for linear MMSE detection in SDR MIMO receivers,”
in NORCHIP, 2008., Tallinn, Nov. 2008, pp. 162–167. 6, 67, 189
[14] R. H. Dennard, J. Cai, and A. Kumar, “A perspective on today’s
scaling challenges and possible future directions,” SolidState
Electronics, vol. 51, no. 4, pp. 518–525, April 2007. 11
BIBLIOGRAPHY 213
[86] IEEE Std., Part 11: Wireless LAN medium Access Control (MAC)
and Physical Layer (PHY) specifications, Highspeed Physical
Layer in the 5GHz Band, 1999. 101
223