Sie sind auf Seite 1von 13

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 26, NO.

6, AUGUST 2008 877

A Real-Time 4-Stream MIMO-OFDM Transceiver:


System Design, FPGA Implementation, and
Characterization
Simon Haene, Member, IEEE, David Perels, Member, IEEE, and Andreas Burg, Member, IEEE

AbstractWhen designing complex communication systems, the hardware-efficient implementation of the transceiver algo-
such as MIMO-OFDM transceivers, prototypes have become an rithms is an important task. On the other hand, understanding
important tool for understanding the implementation trade-offs the interaction of various algorithms and system components
and the system behavior. This paper presents a real-time FPGA
prototype for a 4-stream MIMO-OFDM transceiver capable and assessing the performance of the overall system under
of transmitting 216 Mbit/s in 20 MHz bandwidth. The paper real-world conditions is a crucial step in the development and
covers all parts of the system from RF to channel decoding verification process.
and considers both algorithm and implementation aspects. In Prototype implementations of MIMO systems are an ex-
particular, we discuss the initial parameter estimation, channel
estimation, MIMO detection, parameter tracking, and channel cellent vehicle to study these system-level and performance
decoding. FPGA implementation results are reported along aspects. The real-time prototyping approach also offers oppor-
with measurements that demonstrate the throughput of spatial tunities to identify and work around complexity-bottlenecks, to
multiplexing with four spatial streams. consider complexity/performance trade-offs, and to assess the
Index TermsMIMO, OFDM, prototype, FPGA implementa- impact of hardware implementation on system performance.
tion, measurements, 802.11n, VLSI. Hence, the engineering of fully functional systems based on
programmable devices has frequently been adopted in the past
I. I NTRODUCTION to support the development of MIMO wireless systems [13]:
The first prototype in this area has been a DSP-based, non real-

M ULTIPLE-INPUT multiple-output (MIMO) technology


combined with orthogonal frequency-division multi-
plexing (OFDM) and channel coding, for example based
time, narrow-band system built by Bell-Labs to demonstrate
the gains available from using MIMO technology [14]. Later
prototypes, focusing on the application of MIMO to wideband
on bit interleaved coded modulation (BICM), has recently systems, include the real-time MIMO-CDMA demonstrator
attracted significant attention. MIMO offers high spectral described in [15], the first MIMO-OFDM system developed
efficiency through spatial multiplexing, OFDM provides re- at IOSpan [16], and a variety of other real-time and non
silience against interference caused by multipath propagation, real-time prototypes, developed at universities and research
and BICM combines the ability to exploit the diversity in a labs [17][21]. However, most of the existing MIMO testbed
frequency-selective wideband MIMO channel with a straight- implementations focus on algorithm and performance aspects
forward approach to rate adaptation. Due to these signifi- and the complexity and efficient implementation has often not
cant advantages, the combination of these three technologies been the main research focus. Hence, corresponding publica-
constitutes the basis for many next generation wireless com- tions provide little insight into the hardware architecture and
munication systems. The list of emerging standards already implementation aspects.
employing these technologies [11] includes IEEE 802.11n
for wireless local area networks (WLAN), IEEE 802.16e Contributions: This work is primarily concerned with the
for metropolitan area networks, and the evolution of third implementation aspects of a real-time 4 4 MIMO-OFDM
generation cellular systems (3GPP LTE). transceiver with a view towards hardware-efficient (i.e., low-
Unfortunately, leveraging the tremendous gains achievable complexity) algorithms. To this end, the present article covers
with sophisticated signaling schemes also incurs the need the complete physical layer, discusses the system architec-
to employ elaborate transceiver architectures which build on ture, presents the algorithm choices for the transceiver im-
highly complex digital signal processing algorithms. Due to plementation, and outlines corresponding efficient hardware
their computational complexity, the economic implementation architectures. Our implementation results provide reference
of such systems remains challenging [12]: On one hand, for the hardware complexity of the complete physical layer
transceiver and the referenced companion papers provide an
Manuscript received July 15, 2007; revised December 12, 2007. This paper in-depth, comprehensive view of its individual components.
summarizes and extends our results, published in parts in [1][10], and puts Reproducible measurement results are obtained from a well
them into the context of a real-time 4 4 MIMO-OFDM prototype system.
The authors are with the Integrated Systems Laboratory and the Commu- defined measurement setup including an RF channel emulator.
nication Technology Laboratory of the Swiss Federal Institute of Technol- Outline: The remainder of this paper is structured as
ogy (ETH) Zurich, 8092 Zurich, Switzerland (e-mail: haene@iis.ee.ethz.ch;
perels@iis.ee.ethz.ch; apburg@iis.ee.ethz.ch). follows: An overview of the MIMO-OFDM system under con-
Digital Object Identifier 10.1109/JSAC.2008.080805. sideration is presented in Section II. Corresponding receiver
0733-8716/08/$25.00 2008 IEEE
878 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 26, NO. 6, AUGUST 2008

scrambler conv. encoder puncturer mapper OFDM modulator


algorithms are discussed in Section III. Section IV describes

interleaver
scrambler conv. encoder puncturer mapper OFDM modulator

parser
the hardware platform used for the field programmable gate
scrambler conv. encoder puncturer mapper OFDM modulator
array (FPGA) implementation. A detailed view on the archi-
scrambler conv. encoder puncturer mapper s[t]
OFDM modulator
tecture of the real-time 4 4 MIMO-OFDM physical layer
b[l] s[k, t]
is provided in Section V. Implementation and measurement
results are reported in Sections VI and VII and conclusions (a) Transmitter
are drawn in Section VIII. OFDM demod. MIMO processing depunc. Viterbi descr.

synchronization

deinterleaver
pilot-tracker

deparser
Notation: Matrices are denoted by bold uppercase letters, OFDM demod. spatial depunc. Viterbi descr.

AGC &
soft-
MIMO
the rth row of A by Ar . Lowercase bold letters represent OFDM demod. separa-
metric
extraction
depunc. Viterbi descr.
r[t] tion
column vectors. The rth entry of vector a is written as ar . OFDM demod. depunc. Viterbi descr.

L(b | r[k, t], H[k]) b[l]


AH and A1 denote the Hermitian transpose and the inverse r[k, t] r[k, t]
of A, respectively. The operator (.) extracts the angle of its (b) Receiver
complex-valued argument and exp[a] denotes the exponential
Fig. 1. Transmitter and receiver architecture for a 4 4 MIMO-OFDM
function ea . Upper case caligraphic letters denote sets of system employing spatial multiplexing and BICM.
integer numbers and |B| corresponds to the cardinality of the
set B. The operator \ denotes the set-theoretic difference.
modulation, the resulting signal is transmitted over the wire-
II. S YSTEM D ESIGN less channel.
The corresponding digital baseband receiver is shown in
The MIMO-OFDM system presented in the following is Fig. 1(b). The power of the received signal is first adjusted
based on the OFDM modulation parameters defined in the by an automatic gain control (AGC) and frames detected by
IEEE 802.11a single-antenna WLAN standard [22]. These the synchronizer are OFDM demodulated independently for
parameters define a bandwidth of 20 MHz, N = 64 subcar- each of the MR receive antennas to yield the vectors r[k, t]. A
riers, and a cyclic-prefix length of 16 samples. For tracking pilot-tracking unit, operating on
r[k, t], is responsible for fine-
purposes, 4 subcarriers are reserved for the transmission of grained synchronization tasks and produces vectors r[k, t].
known BPSK training sequences. In order to spectrally shape Assuming perfect timing and frequency synchronization after
the transmitted signal, 12 tones are turned off. The frequency- the pilot tracking operation and a propagation channel with
domain indices of all non-zero tones are subsumed in the set a delay-spread smaller than the cyclic prefix inserted during
P = {26, . . . , 1, 1, . . . , 26} and the frequency indices of OFDM modulation, the OFDM demodulated and pilot-tracked
the pilot tones are given by K = {7, 21}. Modulation signal can be expressed as
schemes for the data tones range from BPSK to 64-QAM and
channel coding is based on BICM and a rate-1/2 convolutional r[k, t] = H[k]s[k, t] + n[k, t],
code with constraint length 7. Coding rates of 2/3 and 3/4 are
achieved through puncturing. Compared to the IEEE 802.11a where H[k] is an MR MT -dimensional matrix describing
standard specifications, the system under consideration was the channel experienced by the kth tone and where n[k, t]
extended to support spatial multiplexing with antenna config- is a complex vector representing the additive thermal noise.
urations up to 4 4, thus achieving a four-fold increase in the The time index t is omitted for the channel matrices since
peak data rate from 54 Mbit/s to 216 Mbit/s. they are assumed to remain constant during an entire OFDM
frame (block-fading assumption). A MIMO processing unit is
in charge of extracting soft-information for Viterbi decoding.
A. System Model This operation corresponds to computing log-likelihood ratios
The transmitter follows the data flow shown in Fig. 1(a) for (LLRs) L(b | r[k, t], H[k]) for the QMT bits associated with
the case of MT = 4 transmit antennas. During data transmis- each of the vector symbols s[k, t]. After deinterleaving and
sion, a high-rate data stream b[l] (where l indexes the payload depuncturing, the LLRs are fed into separate Viterbi decoders
bits to be transmitted) is parsed into MT lower-rate data for channel decoding. The estimate b[l] of the original high-
streams. Each of these lower-rate streams is scrambled, binary rate data stream b[l] is finally reconstructed by recombining
convolutionally encoded, and punctured independently. The the decoded and descrambled lower-rate streams.
punctured data is interleaved bitwise across OFDM tones and
across transmit antennas to leverage both frequency and spa-
B. Frame Structure
tial diversity. The resulting bit-streams (each associated with
one of the transmit antennas) are then mapped to 2Q -QAM The WLAN system under consideration is a packet-based
constellations by translating groups of Q = 1, 2, 4, or 6 bits communication system. Hence, the receiver must synchronize
to scalar constellation points, depending on the selected mod- to each OFDM frame anew. For this reason, each burst
ulation scheme. The corresponding scalar constellation points starts with dedicated preamble and training symbols as shown
of the individual streams are grouped into vectors s[k, t] of in Fig. 2. At the receiver, the preamble is used for AGC
dimension MT , where k (P \ K) and t indicate the OFDM convergence and for timing and frequency synchronization.
tone index and the OFDM-symbol time index, respectively. The preamble OFDM symbols are constructed based on the
The QMT bits b , with = 1, 2, . . . , QMT , corresponding short preamble defined in IEEE 802.11a [22], but are radi-
to s[k, t] are referred to as the label of s[k, t]. After OFDM ated employing all transmit antennas. To avoid unintended
HAENE et al.: A REAL-TIME 4-STREAM MIMO-OFDM TRANSCEIVER: SYSTEM DESIGN, FPGA IMPLEMENTATION, AND CHARACTERIZATION 879

Timing and
frequency synch. Block-type 2) Autocorrelation-based approaches rely on the detection
AGC setting channel training Header Payload data
of repeating portions of a received signal. The basic
idea was first described in [25] and was further re-
antennas
Transmit

ng
bl
fined to allow frame-timing extraction in the case of
m

ni
ea

ai
Tr
Pr

initial signal-power fluctuations incurred by an AGC


0 8 12 16 20 24 28 32 time [us] in [26]. The hardware complexity of these approaches
is significantly lower compared to matched-filtering.
Fig. 2. Structure of a MIMO-OFDM frame.
Unfortunately, multi-path channels lead to inaccurate
frame-timing recovery.
3) Power-based frame-start detection [9] compares the
beamforming, each of the OFDM tones excited by the legacy
instantaneous received signal power to a long-term
preamble is assigned to one transmit antenna exclusively [6].
received-signal power estimate. A frame start is detected
To enable coherent detection of the transmitted data symbols
if the ratio between the two power measurements ex-
the complex-valued channel matrices H[k] must be estimated
ceeds a threshold. In the MIMO case, the aggregate
at the receiver. To this end, the preamble is followed by MT
received power across all antennas can be considered,
OFDM training symbols which are based on the BPSK
which leads to surprisingly precise detection. Among
training sequence (long preamble) used in IEEE 802.11a [22].
the three classes of synchronization algorithms, this ap-
To support MIMO channel estimation, a frequency-orthogonal
proach is the most hardware-efficient solution. However,
training sequence is constructed. The basic idea behind this
it suffers from a high false-alarm rate in the presence of
sequence is to distribute the tones that comprise a training
strong interference.
symbol equally across the transmit antennas to stimulate each
tone by only one transmit antenna at a time. To sound all In order to combine the advantages of the last two ap-
entries of H[k], the assignment of tones to transmit antennas is proaches (lower false-alarm rate and detection precision), the
altered for each training symbol. After MT training symbols, system under consideration performs initial synchronization
all tones in all spatial channels have been stimulated, one at using a power-based approach and improves the detection
a time. reliability by additionally taking into account an autocorre-
The first OFDM data symbol, immediately following the lation metric that confirms the characteristic periodicity of an
channel training, is reserved as a physical layer header that IEEE 802.11a-based preamble. The corresponding metric is
contains information on the length and transmission mode of obtained as a side product of the frequency-offset estimation.
the subsequent payload data. This field is always transmitted Upon detection of a new frame, a potential carrier frequency
in the most robust modulation and coding scheme. offset between transmitter and receiver must be estimated and
compensated to preserve the orthogonality between different
tones. Because this estimate must be obtained both fast and
III. T RANSCEIVER A LGORITHMS
accurately, only training assisted methods are considered since
The focus of this section is on the discussion of the blind approaches [27], [28] do not provide reliable results
individual algorithms employed for the most critical receiver on time. In our system, the initial carrier frequency offset
components. To this end, the synchronization, the channel estimation is carried out on the preamble using the auto-
estimation, the separation of the spatially multiplexed data correlation algorithm presented in [25]. Compared to the
streams, the tracking, and the extraction of LLRs shall be single-input single-output (SISO) case, a lower estimation
discussed. variance is achieved by combining the correlations obtained
on different antennas1 .
A. Synchronization
The first stage of the physical layer signal processing is B. Channel Estimation
an AGC that estimates the signal-power level to be expected With the frequency-orthogonal training described in Sec-
during the training and data portion of the frame based on the tion II-B, tone-by-tone channel estimation can be easily per-
first few samples of the preamble [23]. Different gain stages formed for each H[k], one column at a time. To this end,
at the input of the receiver are set according to this estimate in the receiver simply divides the received vectors r[k, t] by
order to optimally exploit the dynamic range of the hardware. the scalar (BPSK) training symbol transmitted on tone k at
The AGC is followed by the frame-start detection for which time t from the jth transmit antenna2 to obtain the jth column
three main classes of algorithms can be identified:
of the estimated channel matrices H[k]. We shall refer to
1) With matched-filter-based algorithms, the received sig- this estimate as the frequency-domain maximum likelihood
nal is correlated with the known preamble sequence. (FDML) estimate.
This approach works well for specially designed syn- The number of degrees of freedom in the estimated
chronization signals as they are common in CDMA frequency-domain channel coefficients depends on the length
systems [24]. The IEEE 802.11a preamble, instead, of the time-domain channel impulse response. Assuming the
is less suitable for matched-filter approaches due to channel being shorter than the cyclic prefix and a number of
the signals periodicity and the considerable hardware 1 It is reasonable to assume that the frequency offset is the same across all
complexity required for a corresponding matched-filter antennas since the local oscillators are usually shared.
implementation. 2 With k P and t = 0, 1, . . . , 3, j = (k t) modulo M
T
880 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 26, NO. 6, AUGUST 2008

Q Q 1) Choice of MIMO Detection Algorithm: Optimum er-


P (trained subcarriers) P (trained subcarriers) ror rate performance is achieved with soft-output maximum
DC likelihood detection which can be implemented economically
S distance = 4 X S X S on ASICs using the sphere decoding or other tree-search
algorithms [31][34]. Unfortunately, on FPGAs (which are
DC significantly slower compared to ASICs), real-time hard-
decision or soft-output sphere decoding for the system under
Fig. 3. Graphical representation of the sets P, Q, X and S required for consideration would require a prohibitive amount of FPGA
the refinement of the FDML channel estimate with the tone allocation of resources for parallel processing.
IEEE 802.11a.
Linear and successive interference cancellation (SIC) algo-
rithms are viable alternative solutions since the complexity
subcarriers that is larger than the cyclic prefix length, the es- of their detection stage is lower than the complexity of
timated frequency-domain channel coefficients are correlated. sphere decoding, especially for high rates (e.g., MT = 4,
Since the simple FDML estimation algorithm does not exploit 64-QAM). While in uncoded systems SIC yields better error
this correlation to improve the estimation accuracy, low-pass rate performance than linear detection, the error-propagation
filtering of channel coefficients (for all the MT MR spatial associated with SIC hinders the generation of high-quality
subchannels independently) has been recognized as a simple soft-information [35]. This drawback of SIC leads to a per-
technique to improve the quality of an initial FDML chan- formance loss compared to linear detectors in systems with
nel estimate. Unfortunately, the presence of untrained zero BICM [35]. Hence, for the prototype under consideration,
tones, that spectrally shape the transmitted signal, introduces a linear detector is employed. The corresponding algorithm
an irregularity in the initial FDML channel estimate which performs the preprocessing for all OFDM tones independently
increases the complexity of straightforward low-complexity according to
channel-estimation refinement algorithms [29]. The technique
H H[k] H
G[k] = (H[k] + MT 2 I)1 H[k] (1)
employed in the system under consideration follows the ap-
proach initially introduced in [5] and also summarized in [8]. where 2 is either an estimate of the noise variance (MMSE
The basic idea of this algorithm is illustrated in Fig. 3 and detection) or a regularization parameter that facilitates the
outlined as follows: The procedure starts with the straightfor- fixed-point hardware implementation by limiting the dynamic
ward FDML estimation for all trained tones, whose indices range of the entries of G[k].
are subsumed in the set P. In a second step, an equidistant The subsequent detection at the symbol-rate applies G[k]
set of tones X with distance four3 is chosen and an initial to the received vectors r[k, t] according to
estimate is computed for all untrained zero tones in X (the
set of zero tones in X is denoted as Q, as shown in Fig. 3). s[k, t] = G[k]r[k, t]. (2)
In a third step, an estimate for all remaining untrained zero
For hard-decision decoding the entries of s[k, t] are quantized
tones, subsumed in S, is obtained based on the tones in the
to the nearest constellation point. For soft-decision decoding,
equidistant set X . In a last step, the initial estimate (now
LLRs are calculated as described in Section III-E.
available for all N tones in the union of the sets P, Q, and S)
2) Algorithm for Linear Preprocessing: The main chal-
is refined though low-pass filtering of the frequency-domain
lenge in the implementation of linear MIMO detection is the
channel estimate. The analytical mean-squared error (MSE) of
preprocessing which requires the inversion of an unstructured,
this estimation scheme has been derived and shown to be close
complex-valued matrix for each data-bearing OFDM tone4 .
to the MSE of the optimum maximum-likelihood estimator
The three most common approaches to this problem are
in [5]. A more detailed discussion and analysis of this topic
triangularization using Givens rotations followed by back-
is provided in [30].
substitution [37], the square-root algorithm employed in [38]
to simplify the V-BLAST algorithm, and direct matrix inver-
C. MIMO Detection sion (DMI) algorithms [37]. The main arguments for DMI
Once the channel coefficients are known, a wide range are the lower number of operations compared to the other
of algorithms is available for the separation of the spatially two methods and the natural mapping to highly efficient
multiplexed data streams. This separation can be performed on arithmetic components available as macros in modern FPGAs.
each OFDM tone independently. Almost all MIMO detection The main disadvantage of DMI are the stringent fixed-point
algorithms can be partitioned into a preprocessing and a detec- requirements.
tion phase. The preprocessing comprises all those operations The preprocessing algorithm employed in the reported sys-
which must be carried out only when the channel estimate tem relies on DMI since the implementation target technology
changes. The detection comprises the remaining operations are FPGAs. The specific algorithm [4], [31] is similar to
which must be performed for each received vector r[k, t]. the procedure for updating the gain in Kalman filters. The
HH
basic idea is to note that H + MT 2 I can be constructed
3 With a distance of four, the channel impulse response is assumed to
be limited to N/4 sample-spaced taps. If this condition is not fulfilled from MR rank-one updates to MT 2 I. By using the matrix
(e.g., because real-world channels are not sample-spaced and the unavoidable
resampling operation will produce longer channels) the performance of the 4 Note that, in the system under consideration, interpolation-based channel-
estimation algorithm will degrade gradually, depending on the amount of matrix preprocessing algorithms [36] are not advantageous in terms of
energy leaking beyond the first N/4 taps. complexity since the number of OFDM tones is low.
HAENE et al.: A REAL-TIME 4-STREAM MIMO-OFDM TRANSCEIVER: SYSTEM DESIGN, FPGA IMPLEMENTATION, AND CHARACTERIZATION 881

inversion lemma, the corresponding inverse can be updated on- RFO and SRO Estimation after Spatial Separation: The
the-fly. To this end, the corresponding iteration is initialized computational complexity of this algorithm is lower compared
by setting to the previous algorithm since there is no need to estimate the
1 received pilot symbols according to (4). Instead, the a-priori
P(1) = I known BPSK pilot symbols s[k, t] can be used as reference.
MT 2
Hence, the calculation of rfo is simplified to
and proceeds by computing  
  
(n+1) (n)
HH
H n
n P(n) rfo [t] = H
s[k, t] s[k, t] , (6)
P P I , (3)
1+H n P(n) H
H kK
n
and the phase offset slope induced by the SRO is obtained
where H n denotes the nth row of H. After MR iterations,
from
HH + MT 2 I)1 and G = P(MR +1) H H.
P(MR +1) = (H 

The dependence on the OFDM tone index k has been omitted k (s[k, t]H s[k, t]) rfo [t]
for brevity in this paragraph.  = kK
[t]  2 . (7)
k
kK
D. Algorithms for Pilot Tracking For the compensation of the impact of RFO and SRO
Initial carrier frequency offset estimates suffer from esti- the received
 signals on the data-bearing
 tones are multiplied
mation uncertainties which induce a residual frequency offset 
with exp j(rfo [t] + [t]k) . This compensation can be
(RFO). Additionally, the sampling rate offset (SRO) between performed before or after linear detection, independently of
transmitter and receiver introduces a tone dependent frequency the chosen estimation algorithm.
shift that cannot be estimated efficiently during the preamble An in-depth discussion of the estimation and compensation
and causes a tone-dependent phase deviation5 [39]. of analog and RF impairments in MIMO-OFDM systems can
Hence, both RFO and SRO inhibit coherent detection if not be found in [23].
compensated. To perform this compensation, the impact of
these impairments must be estimated and eliminated during
E. Generating Soft-Information
the data section of the frame using a-priori known pilot

The computation of LLRs L(b | r[k, t], H[k]) for channel
constellations transmitted on a set of pilot tones (those in K).
In the receiver architecture outlined in Fig. 1(b), pilot- decoding follows the algorithm described in [40]. The basic
tracking is performed immediately after OFDM demodulation. idea is to start from the input-output relation between the
In receivers based on linear MIMO separation, however, this transmitted vector symbol s and the output s of the spatial
task can be postponed until after the MIMO detection. Both separation in (2) which can be written as
alternatives are considered in the following. I)s + Gn,
s = s + (GH
RFO and SRO Estimation after OFDM Demodulation:
During the data section of the frame, the expected received where the time and tone indices have been omitted for brevity.
pilot constellations r[k, t] are calculated for k K based on Modeling the residual interference (GH I)s as i.i.d. Gaus-

the channel estimates H[k] and the known pilot constellations sian noise, the MIMO system is partitioned into MT parallel
s[k, t] according to SISO systems. For each of these SISO systems, the effective
signal to interference plus noise ratio (SINR) that accounts for

r[k, t] = H[k]s[k, t]. (4) the thermal noise and for the residual interference from other
Now, the phase offset due to RFO can be estimated as spatial streams is given by [41]
  1
   i = 1. (8)

rfo [t] = H
exp j(r[k, t] r[k, t]) , (5)
MT n2 (H[k] H[k] + MT n2 I)1
H
kK i,i

where the (.) and exp[.] functions are needed to normalize With this per-stream additive Gaussian noise model and by
the contributions of the individual pilot tones to rfo . The using the well known log-sum approximation, LLRs can be
slope of the phase deviation is calculated as approximated efficiently for the qth bit in the ith stream

independently of all other streams according to
k (r[k, t]H r[k, t]) rfo [t]
i (b , si )
L(b | s, H) (9)
 = kK
[t]  2 .
k
with (b , si ) = min0 | 2
si s| min1 | 2
si s| , (10)
kK s
Aq s
Aq
Note that for this algorithm to function properly, samples must
where the bit-index = (i 1)Q + q. In (10) the sets A0q and
be inserted or removed in the time-domain sample stream
A1q contain the scalar constellation points for the one and zero
if the phase slope exceeds a threshold corresponding to the
 hypothesis for the qth bit associated with si . Note that due to
offset of one sample, i.e., |[t]| > 2/N . This correction is the decoupling of the data streams, the evaluation of (9) for
necessary to avoid phase wrapping due to SRO. all bits in a vector symbol merely requires the computation
5 In the system under consideration the impact of RFO and SRO are of QMT Euclidean distances in a 1-dimensional, complex-
considered to be independent of each other. valued vector space, while the evaluation of exact LLRs would
882 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 26, NO. 6, AUGUST 2008

MIMO-OFDM terminal FPGA


module
Converter
modules
AGC control signals C. Software Stack
Host computer
Matlab applications
Multi-antenna
Each terminal is controlled from a host PC through mul-
MEX function 2.4 GHz RF tiple layers of software that allow communication with the
Transceiver API transceiver
PCI Analog
hardware. This software stack includes a hardware driver for
Driver
bus (20 MHz IF) the PCI board, a transceiver-specific application programming
interface (API) that provides several functions (for the digital
Fig. 4. Hardware components of a MIMO-testbed terminal. baseband configuration and the transmission or reception
of MIMO-OFDM frames), and a Matlab external interface
require the computation of 2QMT Euclidean distances in an (MEX) function to call these API functions. Demo appli-
MT dimensional, complex-valued vector space, which would cations, configuration tools, and measurement sequences are
be prohibitively complex. programmed as Matlab scripts.
Further complexity reduction can be achieved when Grey-
labeling is chosen for the mapping to QAM constellations. For V. R EAL -T IME FPGA I MPLEMENTATION
this case it has been shown in [40] that (10) is a piecewise For the FPGA implementation, the MIMO-OFDM phys-
linear function that depends only on the real or imaginary part ical layer is partitioned into synchronization, OFDM
of si , depending on the value of q. de/modulation, channel estimation, tracking, MIMO prepro-
cessing and detection, a first-input first-output (FIFO) buffer,
IV. MIMO T ESTBED P LATFORM and channel de/coding. The block diagram of the complete
The receiver algorithms presented in the previous section system is shown in Fig. 5.
were prototyped on a real-time MIMO testbed. As shown in Since the data flow in transmit mode is trivial, the ex-
Fig. 4, each terminal of this testbed consists of a host PC, an planations of this section focus on the receive mode. The
FPGA-based platform with peripheral component interconnect corresponding data flow and timing are illustrated in Fig. 6 and
(PCI) interface, and a multi-antenna RF transceiver. are summarized as follows: The synchronization unit is active
only when the channel is idle and during reception of the
A. PCI-Based FPGA Platform preamble. After successful synchronization, OFDM demodula-
The PCI platform hosts three hardware modules: a large tion starts with the beginning of the training sequence. During
FPGA (Xilinx XC2V6000-6) module for baseband signal pro- this training, the output of the OFDM demodulator is fed to

the channel estimation unit. The resulting estimates H[k] for
cessing and two converter modules for digital-to-analog con-
version (DAC) and analog-to-digital signal conversion (ADC). the channel matrices for all data and pilot tones are stored in
On an FPGA (Xilinx XC2V1000-4) provided on each of the a memory. The MIMO processing unit accesses this memory
converter modules, the baseband signals are digitally converted during the subsequent preprocessing phase to compute the
to an intermediate carrier frequency (IF) of 20 MHz. Hence, spatial equalizer coefficients G[k] according to (1) and the
the total number of converters is reduced by a factor of two associated per-stream SINR estimates i [k] according to (8)
compared to zero-IF transceivers with separate converters for for the computation of the LLRs. The matrices G[k] can be
the I and Q components. Incidentally, the digital IF also stored without the need for additional memory by replacing

the corresponding channel estimates H[k], which are no longer
avoids any I/Q imbalance, at the cost of higher sampling
rate and resolution. Both the 14 bit DACs and the 12 bit needed. A separate memory is required to store i [k]. The
ADCs run at a sampling rate of 80 MS/s. Note that for our detection of the data symbols in the MIMO processing unit
testbed implementation the digital-IF architecture has been starts after the preprocessing is complete. Since in the system
preferred over the zero-IF architecture that is commonly used under consideration the training is immediately followed by
in commercial products. The reasons for this choice were the data, a FIFO buffer is needed to store the received samples
lack, at that time, of commercially available integrated zero- until the MIMO processing unit and the bit-metric unit can
IF transceivers and the fact that the advantages outlined above start with the data detection according to (2) and with the
were instrumental in greatly facilitating the hardware design generation of LLRs for the Viterbi decoder according to (9).
and maintenance of the testbed.
A. Synchronization
B. Analog RF Frontend The synchronization block contains the AGC, the frame-
The RF frontend converts analog signals from the 20 MHz start detection, the digital down-conversion (DDC), and the
IF to the RF frequency band, and vice versa. Each terminal frequency offset estimation and compensation.
is equipped with four SISO super-heterodyne RF chains with The AGC for each antenna is comprised of an analog atten-
an analog IF of 475 MHz. The chains are built from discrete uator in the RF chain and of a digital gain stage at the digital
RF components and support 20 MHz channel bandwidth with IF. This two-step approach allows to control the signal power
center frequencies in the 2.4 GHz industrial, scientific, and in finer steps which leads to faster convergence of the power
medical (ISM) band. The received signal power at the ADCs control. Moreover, the digital gain stage allows to further
is regulated by digitally controllable analog attenuators. These amplify weak signals to improve the utilization of the dynamic
attenuators extend the dynamic range of the receive chains by range available for the subsequent digital signal processing.
31 dB. In conjunction with the 12 bit resolution of the ADCs, The corresponding circuit employs two multipliers per receive
the useful dynamic range amounts to approximately 70 dB. antenna, one for power estimation and one for the application
HAENE et al.: A REAL-TIME 4-STREAM MIMO-OFDM TRANSCEIVER: SYSTEM DESIGN, FPGA IMPLEMENTATION, AND CHARACTERIZATION 883

Interface Physical layer


Channel de/coding OFDM de/modulation

Symbol map.
Interleaving

4xDAC
Scrambling

Puncturing
Add
Convolu-

Demux

DUC
Tx preambles
tional
Host PCI bus

buffer and cyclic

FFT processor
encoding
MIMO processing and prefix Synchroni-
channel estimation zation

Deinterleaving
Descrambling

Bit-metric unit
Depuncturing

Frame timing
Pilot tracking

FIFO buffer

4xADC
MIMO processing Freq.

Demux

DDC

AGC
Rx Viterbi
Mux

offset est.
buffer decoding
and comp.
H/G mem H estim.

Fig. 5. Overview of the digital signal processing architecture of the MIMO-OFDM testbed transceiver.

Preamble Training Header & Payload data


Idle Idle

Synchronization Tracking
FFT FIFO buffering FFT
Channel estimation
Preprocessing
MIMO detection
Channel decoding
FFT Est. Preprocessing Receiver
latency latency latency latency

Fig. 6. Timing diagram illustrating the MIMO-OFDM receiver data flow.

of the digital AGC gain. The signal-power estimation for each provides the rotating phasor and four real-valued multipliers
antenna is obtained by the accumulation of 64 successive are used to compensate the carrier frequency offset for all four
squared samples, which proves to be more accurate than an receive antennas.
infinite-impulse-response (IIR) filter when operating on the
periodic preamble signal. Costly mathematical functions, such B. OFDM De/modulation
as square-root and division, which are necessary to obtain
an accurate power correction factor, are implemented in an In transmit mode, the OFDM de/modulation unit maps bi-
iteratively decomposed manner in order to reduce their impact nary data to complex-valued constellation points, computes the
on hardware complexity. superposition of all modulated tones with an IFFT transform,
and inserts the cyclic prefix. At the beginning of each frame,
The DDC in the receive path and the digital up-conversion
the de/modulation unit also outputs the preamble, whose time-
(DUC) in the transmit path are realized as polyphase finite-
domain representation is stored in RAMs instead of being
impulse-response filters with incorporated IQ-de/modulation
generated at run-time to reduce the transmit latency to a
and down-/up-sampling in order to minimize hardware cost,
minimum. In receive mode, the same unit demodulates the
avoiding unnecessary multiplications.
received OFDM symbols by means of FFT transforms. For the
Power-based frame-timing recovery is realized based on de/modulation of OFDM symbols, a 64-point I/FFT processor
the architecture described in [9]. Two multipliers are used with a single radix-4 processing element [8] is shared among
to obtain the power estimate for all four receive chains at the transmit and receive data paths. The different spatial
baseband. streams are processed in a time-interleaved fashion by the
Initial carrier frequency offset estimation and compensation same hardware.
is active during the preamble and realizes the operations The architecture of the I/FFT processor is shown in Fig. 7.
described in [25]. However, instead of a running mean, a low- The memory unit stores the complex-valued vector to be
pass filter in the form of an IIR filter reduces the memory processed. In order to provide sufficient memory bandwidth,
requirements considerably. A coordinate rotation digital cal- the storage is divided into four separate, dual-ported memory
culation (CORDIC) architecture [42] is used to obtain the banks, each holding up to 16 complex-valued data words. The
per-sample phase offset required for compensation of the bus barrel shifters multiplex the data words to the appropriate
frequency offset. One built-in 18 kbit random-access memory bank and are required to support an addressing scheme [43]
(RAM) per receive antenna stores the incoming complex- that avoids access conflicts during the computation of I/FFTs.
valued samples required to compute the auto-correlation. The The processing unit performs the arithmetic operations and
size of the RAM allows to correlate over a lag of either is based on a conventional decimation-in-time radix-4 but-
16, 32, or 64 samples in order to enhance the accuracy of terfly [44], consisting of three complex-valued multipliers
the frequency-offset estimate if necessary. A look-up table and eight complex-valued adders. A dedicated look-up table
884 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 26, NO. 6, AUGUST 2008

control
input
control unit To perform the above described algorithm on the I/FFT
coefficient generation
for channel processor, its radix-4 processing unit was extended with
estimation
constant matrix LUT FFT Twiddle LUT additional multiplexers, two complex-valued adders, and two
accumulation registers as reported in [30] and [8] to enable
processing unit for channel
estimation the execution of plain complex-valued multiplications and
multiply-accumulate operation
multiply-accumulate operations. Moreover, an additional LUT
butterfly or multiplication operation
was added to the I/FFT processor shown in Fig. 7. This
memory unit LUT stores the 2 52 matrix required to compute the initial
input output data estimates for the subcarriers in Q. With these extensions, the
bank 0

bank 1

bank 2

bank 3
SRAM

SRAM

SRAM

SRAM
output
data
bus
barrel
bus
barrel
I/FFT processor can be shared between channel estimation and
input shifter shifter OFDM de/modulation to save hardware resources.
For the MIMO case, the described sequence of operations
Fig. 7. Top-level architecture of the radix-4 I/FFT processor. is repeated for all MT MR spatial subchannels sequentially,
incurring an overall latency of 42 s for channel-estimate
refinement in a 4 4 system.
(LUT) generates the twiddle coefficients for the I/FFT trans-
forms [44]. This twiddle-LUT is modest in size and is thus
realized with random logic. The computations are orchestrated D. MIMO Processing
by a central control unit which accepts instructions and
The MIMO processing unit performs both preprocessing
generates all control signals and addresses for the RAMs and
and detection. The main reason for combining these two
LUTs.
operations is the fact that in the present architecture they are
never performed at the same time. The hardware required for
C. Channel Estimation
the preprocessing can thus be reused for the detection of data
The straightforward FDML channel estimation algorithm symbols once preprocessing is complete. To ease hardware
has a negligible hardware complexity since it involves only reuse, the moderately parallel architecture detailed in [4] has
trivial multiplications with the known BPSK training se- been chosen for the implementation of the MIMO processing
quence. Hence, the corresponding operations are implemented unit. The corresponding circuit is comprised of a circular array
on dedicated hardware directly in the MIMO processing of MT identical processing elements (PEs) and of a common
unit. The interpolation-based channel estimation refinement divider for the division in (3). The PEs are connected only
algorithm described in Section III-B, instead, has a much to their neighbors and each PE contains a complex-valued
higher complexity. However, the interpolation algorithm can multiplier, an adder and some local registers. The arithmetic
be implemented efficiently based on I/FFT operations that is pipelined with one stage which provides a reasonable
can be outsourced to the I/FFT processor in the OFDM compromise between potential for higher clock speeds and
de/modulation unit. the need for additional clock cycles due to data dependencies.
The implementation of the interpolation-based channel es- This architectural choice is suitable for both preprocessing
timation requires the following steps for each spatial subchan- and detection, requires almost no control overhead to switch
nel: First, the tones in P, which have been computed on-the-fly between these modes, and provides a balance between speed
during the demodulation of the training OFDM symbols, are and resource utilization.
multiplied with a constant 2 52 matrix to obtain estimates In this configuration the proposed circuit runs at a clock
for the two untrained tones in Q. Next, the two results of frequency of 40 MHz on the targeted Xilinx XC2V6000-6
the matrix-vector multiplication, together with the remaining FPGA. For the preprocessing, this clock frequency entails a
14 tones that are part of X , are transformed into time- delay of 2.2 s per OFDM tone and, for the detection, 40
domain by a 16-point IFFT. The result of this transformation tones can be processed in the duration of one OFDM sym-
is element-wise multiplied with a phase correcting vector. bol. Unfortunately, maintaining real-time performance for the
After (virtually) zero-padding the result to a length of 64, system under consideration requires a detection throughput of
it is transformed back into the frequency-domain, yielding 52 tones per OFDM symbol6. In addition to that, a FIFO buffer
an estimate for the remaining untrained subcarriers (those for 29 MIMO-OFDM symbols is required to store the data
in S). These estimates are concatenated with the initial FDML symbols arriving during the preprocessing latency of 114.4 s
estimates for the trained subcarriers and with the estimates for to avoid a loss of data. The chosen straightforward solution
those in Q. At this stage, an estimate for all subcarriers is to achieve real-time detection performance for the proposed
available and the correlation between channel coefficients can system is to instantiate two identical MIMO processing units
be exploited to reduce the estimation error. This requires a to process two tones in parallel. In this configuration, the
64-point IFFT, a brick-wall filter that sets to zero all elements preprocessing of 52 tones incurs a latency of 57.2 s which
of the result exceeding the length of the channel impulse reduces the size of the required FIFO buffer. Moreover, the
response, and a 64-point FFT. On the I/FFT processor, the MIMO detection can now process up to 80 instead of the
multiplication with the brick-wall mask is performed only required 52 received vectors r[k, t] per OFDM-symbol interval
virtually and does not require any clock cycles. Additionally,
the last FFT, which operates on a vector containing many 6 Detection is performed for all 48 data-bearing tones and for the 4 pilot
zeros, is optimized to skip a third of the butterfly operations. tones used for tracking as described in Section III-D.
HAENE et al.: A REAL-TIME 4-STREAM MIMO-OFDM TRANSCEIVER: SYSTEM DESIGN, FPGA IMPLEMENTATION, AND CHARACTERIZATION 885

TABLE I
so that the receiver can catch up with the incoming data stream R EQUIRED FPGA RESOURCES
after 27 OFDM symbols.
XC2V6000-6 FPGA
Block Slice %Slice Mult Ram
E. Pilot Tracking
Synchronization and 5038 17.6 17 5
The pilot tracking implements (6) and (7) to estimate tracking
the impact of RFO and SRO using the post-linear-detection FIFO memories 0 0 0 32
algorithm. To avoid the storage of an entire OFDM symbol, a OFDM de/modulation 2879 10.1 12 12
prediction of the correcting phasor based on the estimates of MIMO processing and 8847 31.0 32 10
the two last OFDM symbols is applied to the current OFDM channel estimation
symbol. A CORDIC circuit is employed to implement the Channel decoding and 9082 31.8 4 5
bit-metric unit
(.) function. For the compensation, the correcting phasor
Others 2725 9.5 0 70
is retrieved from a LUT and a complex-valued multiplier is
Total 28571 100 65 134
used to process all four streams in a time-interleaved fashion.
XC2V1000-4 FPGAs
Block Slice %Slice Mult Ram
F. Bit-Metric Unit DUC 980 19.5 20 0
The SINR-independent part (b , si ) of the bit-metric is a DDC 997 19.8 6 0
piecewise linear function that can be scaled so that the slope of AGC 940 18.7 4 0
the different segments is one of 1, 2, 3, 4. Hence, (b , si ) Others 2113 42.0 0 39
can be implemented with comparators and adders only. The Total 5030 100 30 39
product of the per-stream SINR i [k] with (b , si ) in (9) is
computed using one multiplier for each received stream. This
multiplication produces values with a dynamic range well in mapped onto the XC2V1000-4 FPGA located on the asso-
excess of the 5-bit input word width of the Viterbi decoder ciated data converter module. The required FPGA resources
used in the testbed. Fortunately, the bit error rate performance for the two FPGA designs are detailed in Table I, where
is insensitive to the quantization of the bit-metrics. Mult and Ram refer to the built-in signed 18 bit18 bit
multipliers and 18 kbit RAMs, respectively. The resources
reported under Others refer to PCI interface, debugging
G. Channel Decoding and analysis circuitry, intermodule-communication interfaces,
Per-stream convolutional coding with subsequent cross- and other functional units that are not strictly related to the
antenna interleaving, as shown in Fig. 1, allows the processing physical layer signal processing.
of the computed soft-metrics on MT parallel Viterbi decoders. In summary, the circuits for the signal processing need a
This enables the real-time decoding on Virtex-II FPGAs. The total of 64 RAMs, whereof 32 are used for the FIFO buffer
parallel decoders are implemented following the approach required to bridge the latency incurred by the MIMO prepro-
introduced in [30] and [2], where a single hardware unit cessing and channel estimation. This buffer was intentionally
processes multiple data streams in an interleaved fashion. This oversized (by at least a factor of two) to enable special debug
allows to pipeline the add-compare-select recursion which is modes that allow to access the received data directly after the
the throughput bottleneck in conventional single-steam Viterbi FFT or after the pilot tracking. Hence, a significant amount of
decoders. With this architecture, real-time performance is memory could be saved in a real-world system by eliminating
achieved with reduced FPGA resources compared to MT these debug modes.
conventional decoders. The decoders process 5-bit-wide soft- In terms of throughput, all blocks in the testbed are designed
metrics. Traceback is implemented with the register-exchange to meet the real-time requirements imposed by the 20 MHz
technique and the traceback length was selected to be 54 trellis communication bandwidth and by the 216 Mbit/s throughput
steps long. supported by the system. In terms of latency, the delay
Interleaver and deinterleaver are based on dual-ported between the start of the frame and the time when the first
RAMs that allow the concurrent storage and retrieval, accord- bits are available at the output of the receiver is dominated by
ing to the interleaving pattern, of bits (or LLRs) pertaining the 57.2 s preprocessing latency discussed in Sec. V-D. The
two different OFDM symbols. Puncturing and depuncturing use of the channel estimation refinement algorithm, described
are hardware uncritical operations that require only a corre- in Sec. III-B increases this latency by an additional 42 s. In
sponding finite state machine. Scrambling and descrambling comparison, the latency of all other blocks is insignificant.
perform a bit-wise exclusive-OR combination of the payload
data with the output of a linear feedback shift register. VII. M EASUREMENTS AND C HARACTERIZATION
A. Measurement Setup
VI. I MPLEMENTATION R ESULTS Measurements were taken with two real-time terminals
The digital signal processing blocks shown in Fig. 5 are communicating over a wideband multipath channel emulator,
integrated on a single XC2V6000-6 FPGA, with the exception as shown in Fig. 8. In place of antennas, the multi-antenna
of DUC, DDC, and digital AGC gain stages. For each antenna RF transceivers are connected directly to a MIMO channel
pair, the corresponding instances of these three blocks are emulator, which supports RF inputs and outputs. Time-varying
886 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 26, NO. 6, AUGUST 2008

TABLE II
Tx MIMO-OFDM terminal Multipath MIMO Rx MIMO-OFDM terminal
(slave) channel emulator (master) C ONSIDERED 4 4 DATA RATES Tm IN [M BIT / S ]

Host Host Coding Subcarrier modulation


RF RF
computer computer
rate Rc BPSK QPSK 16-QAM 64-QAM
1/2 24 48 96 144
Wired TCP/IP network 2/3 - - - 192
3/4 36 72 144 216
Fig. 8. The measurement setup includes two MIMO-OFDM terminals and
a real-time wideband multipath fading channel emulator.

has sufficient time to determine the highest mode supported


MIMO channels with antenna configurations up to 4 4 by the current channel conditions. In mobile scenarios, where
and channel impulse responses with up to 12 sample-spaced the wireless channel is changing faster, this assumption is
taps per SISO subchannel are supported. The measurement definitely optimistic.
operations are coordinated over a wired TCP/IP network by Performance charts for coded MIMO-OFDM were obtained
the receiver terminal. The transmit terminal acts as a server by computing Tm and T for different average receive powers.
and generates MIMO-OFDM frames upon request. Channel All measurements were taken with 100 kHz frequency offset
emulator settings, such as the wireless-channel model and and 1.6 kHz sample-rate offset, which corresponds to a 20 ppm
output power of the channel emulator, are also controlled over frequency reference precision. The number of channel real-
TCP/IP. MIMO-OFDM frames at the channel emulator input, izations was Nch = 500 and the empirical FER PFER [m, l]
generated by the transmitting terminal, exhibit a constant was computed based on the transmission of 20 frames per
power level. After applying the desired channel model, the channel realization. Block fading is assumed and the emulated
average power at the RF inputs of the receiver is controlled channel is not changed during the transmission of MIMO-
by adjusting the output gain of the emulator. OFDM frames. Unless otherwise specified, channels were
generated based on the TGn models [45], assuming an antenna
spacing of two wavelengths both at the transmitter and at the
B. Performance of Coded MIMO-OFDM
receiver. For all measurements, only the four-stream transmis-
The performance of coded MIMO-OFDM is assessed by sion modes listed in Table II were taken into consideration.
enabling all physical layer receiver functions. Payload data is 2) Performance for Different Channel Models: The mea-
transferred to the transmitter over the wired TCP/IP network, sured average throughput T under different block-fading chan-
which enables the receiver to check the decoded frames for nel scenarios is shown in Fig. 9. TGn-A is a flat-fading
bit errors. channel, while TGn-B and TGn-C have a delay spread of
1) Performance Metrics: By averaging measurement re- 80 ns and 200 ns, respectively7. For comparison, a spatially un-
sults over Nch channel realizations, two different performance correlated Rayleigh fading channel with uniform power delay
metrics are computed as described in the following. A channel profile and two taps spaced 50 ns apart was also considered.
realization labeled l is drawn according to the desired statistics From the chart in Fig. 9 it can be observed that the system
and uploaded to the emulator. Multiple OFDM frames carrying suffers from increasing channel lengths. However, the impact
500 bytes payload data each are transmitted over the selected of different channel models is not extremely significant. The
channel for each of the supported transmission modes m. An non-fluent behavior of the curves in the region around -33 dBm
empirical frame error rate (FER) PFER [m, l] is computed for average input power is due to spurious interference signals
each transmission mode and channel realization by counting generated by the channel emulator. The power of these spurs
the number of incorrectly received frames. For each trans- was observed not to increase proportionally to the selected
mission mode, the per-mode average throughput Tm is then channel emulator output power.
computed as The per-mode average throughput Tm over TGn-A is shown
1 
Nch in Fig. 10 for the transmission modes with code rate Rc = 1/2
Tm = Tm (1 PFER [m, l]), (see Table II). While BPSK transmission is supported over
Nch
l=1 the entire range, with increasing emulator output power also
where Tm is the physical layer data rate of the transmission the higher modes (QPSK, 16-QAM, and 64-QAM) become
mode m, as reported in Table II. reliable. However, the curves for 16-QAM and 64-QAM sat-
Assuming genie-aided adaptive modulation, for each chan- urate before reaching their corresponding Tm (96 Mbit/s and
nel realization the transmission mode providing the best 144 Mbit/s, respectively). This observation can be attributed
throughput is selected. Based on this assumption, the overall to the channel emulator and to the discrete multi-antenna
average throughput T is computed as RF transceivers, which limit the receive SNR. Additionally,
Nch for high SNRs the numerical precision of the linear MIMO
1 
T = max Tm (1 PFER [m, l]). detector units becomes a limiting factor. According to the
Nch m simulation results in [4], the word widths of the corresponding
l=1
7 Channel models TGn-B and TGn-C are defined based on a tap spacing of
Optimal per-channel-realization transmission-mode selec-
10 ns and hence need to be resampled. This operation potentially generates
tion is realistic only for scenarios with slowly varying chan- more than the 12 taps supported by the channel emulator, so that the effective
nels, such as indoor WLAN channels, where the transmitter channels are only a best-effort approximation to the TGn channel models.
HAENE et al.: A REAL-TIME 4-STREAM MIMO-OFDM TRANSCEIVER: SYSTEM DESIGN, FPGA IMPLEMENTATION, AND CHARACTERIZATION 887

160 160

140 140
Average throughput [Mbps]

Average throughput [Mbps]


120 120

100 100

80 80

60 TGn-A (0 ns)
60 Flat Rayleigh Fading
TGn-B (80 ns)
TGn-C (200 ns) TGn-A, antenna spacing 100
40 40 TGn-A, antenna spacing 2
2 tap, flat PDP (50 ns)
TGn-A, antenna spacing 1
20 20 TGn-A, antenna spacing 0.5
0 0
-63 -58 -53 -48 -43 -38 -33 -28 -63 -58 -53 -48 -43 -38 -33 -28
Average receive power [dBm] Average receive power [dBm]

Fig. 9. Average throughput T for different channel models. Fig. 11. Impact of different antenna spacings on the average throughput T
achieved over a TGn-A channel.
144 Mbps
140 160
64QAM, R=1/2
Permode average throughput [Mbps]

120 16QAM, R=1/2 140


QPSK, R=1/2

Average throughput [Mbps]


BPSK, R=1/2 96 Mbps 120
100
100
80
80
60
60
40
40
All receiver algorithms on
20 Enhanced channel estimation off
20
Soft information extraction off
0 0
-63 -58 -53 -48 -43 -38 -33 -28 -63 -58 -53 -48 -43 -38 -33 -28
Average receive power [dBm] Average receive power [dBm]

Fig. 10. Per-mode average throughput Tm over TGn-A for the transmission Fig. 12. Impact of channel estimation refinement and soft-information
modes in Table II with coding rate Rc = 1/2. extraction on the average throughput T over a TGn-B channel.

multipliers must be increased beyond 18 bit (which is the word throughput is investigated in the following. The FPGA re-
width of the FPGA built-in multipliers) for SNRs in excess sources required to support these algorithms are reported in
of 30 dB, when operating in an uncorrelated Rayleigh-fading Table III, where the slice percentage refers to the total number
environment. These limitations explain why the measured of used slices on the XC2V6000-6 FPGA (given in Table I).
average throughputs T in Fig. 9 saturate between 120 Mbit/s
In order to assess the impact of these algorithms on
and 140 Mbit/s, even though the highest physical layer data
performance, channel estimation refinement can be skipped,
rate in Table II is much higher (216 Mbit/s).
and the extraction of soft-information can be replaced with
3) Impact of Antenna Correlation: When reducing the simpler hard-decision MIMO detection. The measured average
antenna spacing, the correlation between the received signals throughput T achieved over a TGn-B channel is shown in
increases and linear MIMO detection becomes less reliable. Fig. 12. The performance with all receiver algorithms enabled
The impact of this correlation is shown in Fig. 11 for the corresponds to the topmost curve. For comparison, the per-
particular case of TGn-A, where the antenna spacing is set formance was measured after disabling channel estimation
to 100, 2, 1, and 0.5 wavelengths. For comparison a flat refinement and once again after additionally disabling the
Rayleigh-fading channel, corresponding to the special case extraction of LLRs. The average throughput Tm achieved with
of TGn-A without any spatial correlation, is considered. As 16-QAM and Rc = 1/2 is shown in Fig. 13. From this
expected, the system suffers significantly when the antenna figure, channel estimation refinement yields an SNR gain of
spacing is reduced below one wavelength. about 2 dB. The impact of soft-information extraction, which
4) Impact of Channel Estimation Refinement and Soft- is slightly harder to determine because the curves saturate at
Output MIMO Detection: The impact of channel estimation different levels due to the SNR limitation, amounts to about
refinement and soft-information extraction on the average 2.5 dB to 3.7 dB.
8 Only the overhead related to the bit-metric unit is considered. The
In summary, the curves in Fig. 13 and the numbers in
difference in silicon area between hard-input and soft-input Viterbi decoding Table III show that with a combination of the two receiver
is not considered. algorithms, an overall SNR gain of about 4.5 dB to 5.7 dB
888 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 26, NO. 6, AUGUST 2008

TABLE III 100


I MPACT ON REQUIRED FPGA RESOURCES FOR CHANNEL ESTIMATION

Permode average throughput [Mbps]


REFINEMENT AND SOFT- INFORMATION EXTRACTION
80
Algorithm Slice %Slice Mult Ram
Refined channel estimation 710 2.5 - - 60
Soft-information extraction8 625 2.2 4 2 3.7 dB
Total 1335 4.7 4 2 2 dB
40

can be achieved with roughly 5% increase in FPGA-resource 20 2.5 dB All receiver algorithms on
utilization. Enhanced channel estimation off
Soft information extraction off
0
VIII. C ONCLUSION -63 -58 -53 -48 -43 -38 -33 -28
A real-time MIMO-OFDM physical layer transmitting at Average receive power [dBm]

a peak data rate of 216 Mbit/s over 20 MHz bandwidth was


Fig. 13. Impact of channel estimation refinement and soft-information
prototyped and characterized through measurements. Real- extraction on Tm over TGn-B with 16-QAM, Rc = 1/2.
time operation of the system on an FPGA was achieved by
diligent selection and optimization of the employed transceiver
algorithms for the FPGA implementation and by careful R EFERENCES
design of the corresponding transceiver hardware architecture.
With respect to the signal processing, it is found that many [1] S. Haene, D. Perels, D. S. Baum, M. Borgmann, A. Burg,
functional units are identical for all antennas. In the FPGA im- N. Felber, W. Fichtner, and H. Bolcskei, Implementation aspects
of a real-time multi-terminal MIMO-OFDM testbed, in IEEE
plementation, this property can often be exploited to perform Radio and Wireless Conference, Sep. 2004. [Online]. Available:
the associated signal processing on a single entity in a time- http://www.nari.ee.ethz.ch/commth/pubs/p/RAWCON2004
interleaved fashion. This architectural optimization allows to [2] S. Haene, A. Burg, D. Perels, P. Luethi, N. Felber, and W. Fichtner,
FPGA implementation of Viterbi decoders for MIMO-BICM, in Proc.
reduce the resource utilization compared to straightforward Asilomar Conf. on Signals, Systems Computers, Oct. 2005.
replication of the corresponding hardware. [3] D. Perels, S. Haene, P. Luethi, A. Burg, N. Felber, W. Fichtner, and
One of the most critical parts in the system is the MIMO H. Bolcskei, ASIC implementation of a MIMO-OFDM transceiver for
192 Mbps WLANs, in Proc. 31st IEEE European Solid State Circuits
detector. For this prototype a linear detector was chosen to en- Conf., Sep. 2005, pp. 215218.
able a real-time FPGA implementation of soft-output MIMO [4] A. Burg, S. Haene, D. Perels, P. Luethi, N. Felber, and W. Fichtner,
detection with reasonable resource utilization. Nevertheless, Algorithm and VLSI architecture for linear MMSE detection in MIMO-
OFDM systems, in Proc. IEEE Int. Symp. on Circuits and Systems, May
the separation of the spatial streams accounts for 30% of the 2006.
FPGA slices and 50% of the multipliers used by the physical [5] S. Haene, A. Burg, N. Felber, and W. Fichtner, OFDM channel
layer design. Moreover, it was found that the preprocessing of estimation algorithm and ASIC implementation, in Proc. IEEE Conf.
the channel matrices in the MIMO detector is one of the main on Circuits and Systems for Communications, Jul. 2006.
[6] D. Perels, A. Burg, S. Haene, N. Felber, and W. Fichtner, An automatic
complexity bottlenecks which can incur considerable detection gain controller for MIMO-OFDM WLAN systems, in Proc. IEEE Conf.
latency and thus requires large FIFO buffers in the receiver. on Circuits and Systems for Communications, vol. 1, Jul. 2006, pp. 55
Compared to legacy single-antenna IEEE 802.11a WLAN 60.
[7] D. Perels, S. Haene, A. Burg, P. Luethi, N. Felber, and W. Fichtner, A
systems, which transmit over the same bandwidth but are frame-start detector for a 4 x 4 MIMO-OFDM system, in Proc. IEEE
limited to a peak data rate of 54 Mbit/s, the measured average Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 4, May
data rates are clearly higher. It was observed that the system 2006, pp. 425428.
[8] S. Haene, A. Burg, P. Luethi, N. Felber, and W. Fichtner, FFT processor
performance is affected by channels with increasing delay for OFDM channel estimation, in Proc. IEEE Int. Symp. on Circuits
spread and by high antenna correlation. As expected, the and Systems, May 2007.
MIMO gain is reduced when the antenna spacing is not [9] D. Perels, C. Studer, and W. Fichtner, Implementation of a low-
complexity frame-start detection algorithm for MIMO systems, in Proc.
sufficient. IEEE Int. Symp. on Circuits and Systems, May 2007.
By considering the specific examples of channel estima- [10] S. Haene, D. Perels, and W. Fichtner, System-level characterization of
tion refinement and soft information extraction, it is found a real-time 4 4 MIMO-OFDM transceiver on FPGA, in Proc. of the
that careful algorithm and architecture optimization allows to European Signal Processing Conf., Sep. 2007, pp. 11461150.
[11] A. Hottinen, M. Kuusela, K. Hugl, Z. Jianzhong, and B. Raghothaman,
achieve a considerable performance improvement at the cost Industrial embrace of smart antennas and MIMO, IEEE Wireless
of only marginal complexity increase. Commun., vol. 13, no. 4, pp. 816, Aug. 2006.
[12] H. Bolcskei, MIMO-OFDM wireless systems: Basics, perspectives, and
challenges, IEEE Wireless Commun., vol. 13, no. 4, pp. 3137, Aug.
ACKNOWLEDGMENT 2006.
The authors wish to thank Prof. Bolcskei for his guidance, [13] T. Kaiser, A. Wilzeck, M. Berentsen, and M. Rupp, Prototyping for
MIMO systems an overview, in Proc. European Signal Processing
Prof. Fichtner for his support, and their colleagues D. Baum, Conf., Sep. 2004, pp. 681688.
P. Luethi, and M. Borgmann for their contributions and [14] P. Wolniansky, G. Foschini, G. Golden, and R. Valenzuela, VBLAST:
invaluable discussions. The ETH Zurich and the FP6 STREP An architecture for realizing very high data rates over the rich-scattering
project MASCOT are acknowledged for partial funding of this wireless channel, in Proc. IEEE ISSSE, Oct. 1998, pp. 295300.
[15] A. Adjoudani, E. Beck, A. Burg, G. Djuknic, T. Gvoth, D. Haessig,
work under the project numbers TH-602-02 and IST-026905, S. Manji, M. Milbrodt, M. Rupp, D. Samardzija, A. Siegel, T. Sizer II,
respectively. C. Tran, S. Walker, S. Wilkus, and P. Wolniansky, Prototype experience
HAENE et al.: A REAL-TIME 4-STREAM MIMO-OFDM TRANSCEIVER: SYSTEM DESIGN, FPGA IMPLEMENTATION, AND CHARACTERIZATION 889

for MIMO BLAST over third-generation wireless system, IEEE J. Sel. [39] M. Speth, S. A. Fechtel, G. Fock, and H. Meyr, Optimum receiver
Areas Commun., vol. 21, pp. 440451, 2003. design for wireless broad-band systems using OFDM, IEEE Trans.
[16] H. Sampath, S. Talwar, J. Tellado, V. Erceg, and A. Paulraj, A Commun., vol. 47, no. 11, pp. 16681677, Nov. 1999.
fourth-generation MIMO-OFDM broadband wireless system: design,
[40] I. B. Collings, M. R. G. Butler, and M. McKay, Low complexity
performance, and field trial results, IEEE Commun. Mag., vol. 40, no. 9,
receiver design for MIMO bit-interleaved coded modulation, in IEEE
pp. 143149, Sep. 2002.
Int. Symp. on Spread Spectrum Techniques and Applications, 2004,
[17] C. Dubuc, D. Starks, T. Creasy, and H. Yong, A MIMO-OFDM
pp. 1216.
prototype for next-generation wireless WANs, IEEE Commun. Mag.,
[41] A. Paulraj, R. Nabar, and D. Gore, Introduction to Space-Time Wireless
vol. 42, pp. 8287, Dec. 2004.
Communications. Cambridge Univ. Press, 2003.
[18] A. van Zelst and T. C. W. Schenk, Implementation of a MIMO OFDM-
[42] B. Parhami, Computer Arithmetic, Algorithms and Hardware Design.
based wireless LAN system, IEEE Trans. Signal Processing, vol. 52,
Oxford University Press, 2000.
no. 2, pp. 483494, Feb. 2004.
[43] L. G. Johnson, Conflict free memory addressing for dedicated FFT
[19] C. Mehlfuhrer, M. Rupp, F. Kaltenberger, and G. Humer, A scalable
hardware, IEEE Trans. Circuits Syst. II, vol. 39, pp. 312316, 1992.
rapid prototyping system for real-time MIMO OFDM transmissions,
[44] E. O. Brigham, The fast Fourier transform and its applications. Prentice
in Proc. of the 2nd IEE/EURASIP Conference on DSP enabled Radio,
Hall, 1988.
Sep. 2005, pp. 714.
[45] V. Erceg et al., TGn Channel Models, IEEE 802.11 document 03/940r4.
[20] Y. Heejung, K. Myung-Soon, C. Eun-young, J. Taehyun, and L. Sok-
kyu, Design and prototype development of MIMO-OFDM for next
generation wireless lan, IEEE Trans. Consumer Electron., vol. 51, pp.
11341142, Nov. 2005.
[21] T. Haustein, A. Forck, H. Gabler, V. Jungnickel, and S. Schiffermuller, Simon Haene (S03M08) was born in Basel,
Real-time signal processing for multiantenna systems: Algorithms, op- Switzerland, in 1978. He received the Diploma
timization, and implementation on an experimental test-bed, EURASIP degree in electrical engineering from ETH Zurich,
Journal on Applied Signal Processing, 2006, Article ID 27 573. Switzerland, in 2002. He then joined the Integrated
[22] IEEE 802.11a Standard, iSO/IEC 8802-11:1999/Amd 1:2000(E). Systems Laboratory of ETH Zurich, and graduated
[23] D. Perels, Frame-based MIMO-OFDM systems: Impairment estimation with a Dr. sc. ETH degree in 2007. In the same
and compensation, Ph.D. dissertation, IIS / ETH-Zurich, Aug. 2007, year, he co-founded Celestrius, an ETH-spinoff in
advisors: Prof. W. Fichtner (ETH Zurich), Prof. H. Bolcskei (ETH the field of MIMO wireless communication. In 2000,
Zurich). he held a summer researcher position at British
[24] Universal mobile telecommunications system (UMTS); spreading and Telecom, England.
modulation (FDD) (3GPP TS 25.213 version 7.2.0 release 7), 3GPP, His research interests include the design of VLSI
Technical specification ETSI TS 125 213, May 2007. circuits, digital signal processing for wireless communication systems, and
[25] T. M. Schmidl and D. C. Cox, Robust frequency and timing synchro- FPGA-based prototyping.
nization for OFDM, IEEE Trans. Commun., vol. 45, no. 12, pp. 1613
1621, Dec. 1997.
[26] A. Fort and W. Eberle, Synchronization and AGC proposal for IEEE
802.11a burst OFDM systems, in Proc. IEEE GLOBECOM, vol. 3, David Perels (S94M97) was born on April 15th
2003, pp. 13351338. 1972 in Heidelberg, Germany. He studied electrical
[27] J. J. van de Beek, M. Sandell, and P. O. Borjesson, ML estimation engineering from 1992 to 1997 at the ETH Zurich,
of time and frequency offset in OFDM systems, IEEE Trans. Signal Switzerland, and received his Diploma degree in
Processing, vol. 45, pp. 18001805, 1997. 1997.
[28] H. Bolcskei, Blind estimation of symbol timing and carrier frequency From 1997 to 2001 he worked at Swisscom
offset in wireless OFDM systems, IEEE Trans. Commun., vol. 49, Mobile in the field of mobile communications and
pp. 988999, 2001. mobile data services. In 2001 he joined the Inte-
[29] L. Deneire, P. Vandenameele, L. van der Perre, B. Gyselinck, and grated Systems Laboratory at ETH Zurich where he
M. Engels, A low-complexity ML channel estimator for OFDM, IEEE graduated with a Dr. sc. techn. degree in 2007 in the
Trans. Commun., vol. 51, no. 2, pp. 135140, 2003. field of VLSI design for wireless communications.
[30] S. Haene, VLSI circuits for MIMO-OFDM physical layer, Ph.D. Mr. Perels is currently working at Phonak, a hearing instrument manufacturer,
dissertation, IIS / ETH-Zurich, Aug. 2007, advisors: Prof. W. Fichtner in the research and development department.
(ETH Zurich), Prof. H. Bolcskei (ETH Zurich).
[31] A. Burg, VLSI circuits for MIMO communication systems, Ph.D.
dissertation, IIS / ETH-Zurich, Feb. 2006, advisors: Prof. W. Fichtner
(ETH Zurich), Prof. M. Rupp (TU-Vienna). Andreas Burg (S97M05) was born in Munich,
[32] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and Germany, in 1975. He received his Dipl.-Ing. de-
H. Bolcskei, VLSI implementation of MIMO detection using the gree in 2000 from the Swiss Federal Institute of
sphere decoder algorithm, IEEE J. Solid-State Circuits, vol. 40, no. 7, Technology (ETH) Zurich, Zurich, Switzerland. He
pp. 15661577, Jul. 2005. then joined the Integrated Systems Laboratory of
[33] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, K-best ETH Zurich, from where he graduated with the
MIMO detection VLSI architectures achieving up to 424 Mbps, in Proc. Dr. sc. techn. degree in 2006.
IEEE Int. Symp. on Circuits and Systems, May 2006. In 1998, he worked at Siemens Semiconductors,
[34] C. Studer, A. Burg, and H. Bolcskei, Soft-output sphere decoding: San Jose, CA. During his doctoral studies, he was a
Algorithms and VLSI implementation, IEEE J. Select. Areas Commun., visiting researcher with Bell Labs Wireless Research
vol. 26, no. 2, pp. 290300, Apr. 2007. for a total of one year. From 2006 to 2007, he held
[35] E. Zimmermann and G. Fettweis, Adaptive vs. hybrid iterative MIMO positions as postdoctoral researcher at the Integrated Systems Laboratory and
receivers based on MMSE linear and soft-SIC detection, in Proc. IEEE at the Communication Technology Laboratory of the ETH Zurich. In 2007
Symp. on Personal, Indoor and Mobile Radio Communications, Sep. he co-founded Celestrius, an ETH-spinoff in the field of MIMO wireless
2006, pp. 15. communication, where he is responsible for the VLSI development. His
[36] M. Borgmann and H. Bolcskei, Interpolation-based efficient matrix research interests include the design of digital VLSI circuits and systems,
inversion for MIMO-OFDM receivers, in Proc. 38th Asilomar Conf. signal processing for wireless communications, and deep submicron VLSI
on Signals, Systems, and Computers, Nov. 2004, pp. 19411947. design.
[37] G. H. Golub and C. F. Van Loan, Matrix Computations. John Hopkins In 2000, Mr. Burg received the Willi Studer Award and the ETH Medal
Univ. Press, 1996. for his diploma and his diploma thesis, respectively. Mr. Burg was also
[38] B. Hassibi, An efficient square-root algorithm for BLAST, in Proc. awarded an ETH Medal for his Ph.D. dissertation in 2006. In 2008, Dr. Burg
IEEE Int. Conf. on Acoustics, Speach, and Signal Process ing (ICASSP), was awarded a 4-years grant from the Swiss National Science Foundation
vol. 2, Jun. 2000, pp. 737740. (SNF) on which he will join the ETH Zurich as an SNF Professor in 2008.

Das könnte Ihnen auch gefallen