Sie sind auf Seite 1von 4

A Complete Pipelined MMSE Detection Architecture in a 4x4 MIMO-OFDM Receiver

Shingo Yoshizawa, Yasushi Yamauchi, and Yoshikazu Miyanaga

Graduate School of Information Science and Technology Hokkaido University, Sapporo, 060-0814 Japan
Abstract This paper presents a VLSI architecture of MMSE detection in a 4x4 MIMO-OFDM receiver. Packet-based MIMOOFDM imposes a considerable throughput requirement on the matrix inversion because of strict timing in frame structure and subcarrier-by-subcarrier basis processing. Pipeline processing oriented algorithms are preferable to tackle this issue. We adopt Strassens algorithms of matrix inversion and multiplication to circuit design in the MMSE detection. The complete pipelined architecture achieves real-time operation which does not depend on numbers of subcarriers. The designed circuit has been implemented to a 90-nm CMOS process and shows a potential for providing a 2.6-Gbps transmission speed in a 160-MHz signal bandwidth.

I. I NTRODUCTION A MIMO-OFDM scheme, given by the combination of multi-input multi-output antennas (MIMO) and orthogonal frequency division multiplexing (OFDM), attracts a great deal of attention in recent wireless communications. The IEEE802.11n provides date rates of up to 600 Mbps by use of four spatial streams. The wireless chip presented by Atheros Communications supports up to two spatial streams and can deliver 300 Mbps of transmission speed in a 40-MHz bandwidth [1]. The chip implementation of the four spatial streams is forthcoming as the next step. The hardware implementations of MIMO detection in MIMO-OFDM are now challenging issues because of the need for large amounts of complex operators within a short period. A packet-based MIMO-OFDM system performs the MIMO detection by subcarrier-by-subcarrier basis and updates the channel matrix for each packet. This imposes on a considerable throughput requirement where the complexity is proportional to the number of data subcarriers. Moreover, we consider the extension of a signal bandwidth for the purpose of realizing an over 1 Gbps of transmission speed by using a 4x4 antenna conguration. The dynamic spectrum access using a cognitive radio technology is a reasonable solution for extending the bandwidth [2]. A linear MMSE detector is assumed for this case. The other detectors such as sphere decoding (SD) and K-best (KB), and successive interference cancellation (SIC) are still difcult to meet the throughput requirement of over 1 Gbps. The implementations of the MMSE detection and matrix inversion (used as a part of the MMSE detection) have been presented in [3]-[6]. The work [4] provided 38.4 s of processing latency for up to 64 subcarriers in the MMSE detection by using a 0.25-m technology. However, their conventional implementations suffer from the increase of

the subcarriers and are insufcient for our targeted throughput performance. To solve this issue, this paper focuses on a complete pipelined architecture of the MMSE detection. The pipelined architecture can offer high throughput performance independent of numbers of subcarriers, however needs careful consideration to algorithm selection avoiding iterative processing and excessive data buffering. The useful idea is the use of Strassen matrix inversion [7]. This algorithm is suitable for concurrent and pipeline processing from its own simplicity. We have adopted Strassens algorithms of matrix inversion and multiplication in the MMSE detection to reduce their complexities in the VLSI implementation. The algorithm and the circuit design of the MMSE detection in a 4x4 MIMOOFDM receiver are presented in this contribution. II. MIMO D ETECTION AND S YSTEM R EQUIREMENTS For a MIMO-OFDM system with MT transmit antennas, MR receive antennas, and K data subcarriers, the channel for k-th data subcarrier is given by H k with a MT MR matrix. In the MIMO detection, the received vector y k (t) for t-th data symbol is expressed as: y k (t) = H k sk (t) + nk (t), (1)

where sk (t) is the transmitted vector and nk (t) models the channel noise. The purpose of the MIMO detection to extract sk (t) from the received vector. Linear detection methods invert the channel matrix using a zero forcing (ZF) or minimum mean squared error (MMSE) criterion. The inverse channel Gk by the MMSE detection is given by Gk = (H H H + 2 I)1 H H , k k (2) where 2 indicates the noise variance and ()H denotes a function of Hermitian transpose. The matrix inversion of Eq. (2) is called as preprocessing. The nal step is to decode the approximate transmit vectors by using sk (t) = Gk y k (t). (3)

The frame structure assuming the IEEE802.11n PHY frame with four spatial streams i.e., MT =4, and MR =4, and K=108, is illustrated in Fig. 1. The channel information of H k is obtained from the four training symbols in four spatial streams. The preprocessing starts from receiving the channel matrix at the rst data subcarrier H 1 and nishes by computing the inverse matrix at the last subcarrier GK . We consider realtime operation that the detection at Eq. (3) is immediately

978-1-4244-1684-4/08/$25.00 2008 IEEE


Training Symbols T(3) Guard Interval Channel Estimation Preprocessing Detection H1 HK THT-LTFs TGI G1 T(4) y(1)

Data Symbols y(2) y(3)

21 22

11 21

12 22 12 11 11 22 11 22

12 22 21 22

11 21 12 11

11 22






3 4


6 5 7 1 3


5 6 4 2

1 2

Fig. 1. Frame structure of a packet-based MIMO-OFDM system and timing chart of the MIMO detection.
22 stages

+ +

+ +

Scale In Block Floating Point Arithmetic Scaling

P11(k) 4x4 Matrix Multiplication 4x4 Matrix Inversion P12(k) H11(k) H12(k) P44(k)

R11(k) R12(k) 4x4 Matrix Multiplication R44(k) Sb(k) G11(k) G12(k) G44(k)
Scale Out


2x2 Matrix Arithmetic Unit



Q11(k) Q12(k) Q44(k)

(k )

Pipeline Delay



Fig. 3.

Circuit structure of the matrix multiplication.


4 stages

22 stages

4 stages

The scaling factors of Sa(k), Sb(k), and Sc(k) is used for a block oating-point format. The block oating-point adjusts a maximum value for a 4x4 matrix so as to satisfy 1/4 max |ij | < 1/2. The numbers of bit shifts are outputted as the scaling factors. B. Matrix Multiplication Strassens algorithm for matrix multiplication has lower complexity than the direct matrix multiplication, which is used for decreasing a circuit scale in real multipliers. We calculate a 4x4 matrix product by partitioning 2x2 block matrices of ij , ij , and ij and computing 11 21 12 22 = 11 21 12 22 11 21 12 22 . (4)

Fig. 2.

Block diagram of the proposed architecture.

executed when accepting the received data symbols of y k . It imposed on strict timing that the preprocessing nishes within the period given by the sum of THT LT F s and TGI . The IEEE802.11n standard sets THT LT F s and TGI to 3.2 s and 0.4 s (using the optional mode), respectively. If the preprocessing cannot nish within this period, the delay affects all the data symbols. It induces the requirement of large sized FIFO memory for data buffering and the expansion of a packet reception delay. The large number of subcarriers (e.g., more than a half thousand) should be considered to obtain more higher transmission speeds of over 1 Gbps. III. P ROPOSED P IPELINED A RCHITECTURE A. Block diagram Our solution is to design a complete pipelined architecture that its processing time does not depend on numbers of subcarriers. Also, it has the advantage of connecting the preprocessing block with the other blocks (e.g., a pipelined FFT processor and a Viterbi decoder) by a pipeline chaining mechanism. The block diagram of the proposed architecture for the preprocessing is illustrated in Fig. 2. There are four parts having the intermediate values of P k =H H H k + 2 I, k Qk =H H , Rk =P 1 , and Gk =Rk Qk . The pipeline stage lak k tency of 30 cycles is small enough to nish the preprocessing within the guard interval (to be appeared in Section IV.C).

The Strassens algorithm applies the following intermediate matrices: 1 = (21 22 ) 11 2 = 22 ( 11 + 21 ) 3 = 11 ( 12 + 22 ) 4 = (12 11 ) 22 5 = (11 + 22 )( 11 + 22 ) 6 = (12 + 22 )( 21 22 ) 7 = (11 + 21 )( 12 11 ) 11 = 5 + 6 2 + 4 12 = 3 + 4 21 = 1 + 2 22 = 5 + 7 + 1 3 . (5)

These equations reduce 256 to 224 in real multiplications. By applying the Strassens algorithm for 2x2 matrix multiplications in the same way, it results in 168 real multiplications. The circuit structure of the matrix multiplication having four pipeline stages is illustrated in Fig. 3. The multiplication, addition, and subtraction units deal with 2x2 complex matrix operations.


TABLE I N UMBERS OF ARITHMETIC OPERATIONS IN THE 4 X 4 MATRIX INVERSION . Direct computation Cholesky decomposition [6] Strassen matrix inversion

1 16 bits 18 bits 20 bits floating point


Mul 1126 264 120

Add/Sub 768 108 150

Div/Rec 1 4 2



A-1 CA-1 A C B D

2 1 1 7 1


Scale In

s1, s2, s3, s4










[ ]-1 D



[ ]-1



+ D
1 2 2 2

Block Floating Arithmetic Scaling

SNR [dB]

Fig. 5. Fixed point and oating point simulation for the wordlength determination.



B C D Scale Out

CH -E-1CA-1(=C)


Legend in 2x2 Matrix Arithmetic Units

No. of Pipeline Stages Arithmetic Operation Scaling Factor D : Delay x : Multiplication +/- : Addition/Subtraction [ ] -1 : Matrix Inversion [ ] H : Hermitian Transpose

The circuit structure of the matrix inversion is shown in Fig. 4. The 2x2 matrix arithmetic units having pipeline stages are implemented in the circuit. The block oating scaling factors of s1, s2, s3, and s4 are settled to decrease the nite length errors in the 2x2 matrix multiplication and inversion units. For a 2x2 matrix inversion, we apply direct computation using a matrix inversion formula: 1 = a c b d

Fig. 4.

Circuit structure of the matrix inversion.

1 ad bc

d b c a


C. Matrix Inversion Strassens algorithm for matrix inversion divides a square matrix into equal small matrices [7]. For a 4x4 matrix , it is divided into 2x2 four submatrices and inversed, which can be expressed as: = 1 = = A C B D (6)

where the term of (ad bc) has a real value for the MMSE detection. A Newton-Raphson method is adopted for real reciprocal calculation. IV. I MPLEMENTATION A. Wordlength Determination The wordlength determination is an indispensable task to nd an appropriate trade-off point between circuit area and arithmetic precision. Fixed-point and oating-point (64-bit double precision) simulation was performed to nd the tradeoff. The simulation result measuring bit error rates (BERs) are shown in Fig. 5, where the wordlength of xed-point representation is given by w bits. The wordlengths of matrix multiplication and inversion are w and w+8 bits, respectively. The 4x4 MIMO-OFDM system based on IEEE802.11n with a 64QAM modulation mode and a 3/4 coding rate in convolutional codes was evaluated under multipath fading environment. In the xed-point implementation, we have used a fractional representation (i.e., consisting of a sign bit and decimal bits) and block oating-point scaling with a 5-bit exponent. The 18bit wordlength slightly decreases BER performance, which is considered as an optimum wordlength. The 20-bit wordlength is almost the same as the oating-point implementation. B. Circuit Area and Power Dissipation The 4x4 MIMO detection circuit was implemented by using a 90-nm CMOS standard library from Semiconductor Technology Academic Research Center (STARC). We used Verilog source codes for RTL level design and performed logic synthesis to evaluate circuit area and power dissipation

A1 + A1 BE 1 CA1 A1 BE 1 E 1 CA1 E 1 A B , where E = D CA1 B. C D


The common parts of A1 , CA1 , and E 1 contribute to the complexity reduction. The numbers of real arithmetic operations for the 4x4 matrix inversion, consisting of multiplication, addition, subtraction, division, and reciprocal calculation, are shown in Table I. We make use of Hermitian transpose in the MMSE detection for further complexity reduction. The product of H H H k at Eq. (2) generates symmetrical values at k non-diagonal components for the matrix P k , It causes the similar symmetrical characteristics for the sub matrices at Eq. (6) H and enables the replacement of B=C H and B =C . Hence, it is the reason that the Strassen matrix inversion has lower complexity than that of the Cholesky decomposition including triangular matrix inversion [6]. The additional advantage of the Strassens algorithm is its own simplicity that each sub matrix can be independently computed without iterative operations.


TABLE II C IRCUIT A REA AND P OWER D ISSIPATION . Wordlength 16 bits 18 bits 20 bits Area (mm2 ) 6.23 7.45 8.81 No. of logic gates 1,559,400 1,862,400 2,203,300 Power (mW) 496.2 593.4 701.2


Bandwidth (MHz) Circuit Operating Frequency (MHz) FFT Size IFFT/FFT Period (s) GI Length (ns) 40 80 128 3.2 400 375 600 80 80 128 1.6 400 375 1160 120 120 256 2.1 400 250 1860 160 160 512 3.2 400 187.5 2630

Maximum Transmission Speed (Mbps)

Pipeline Latency (ns) Max. Transmission Speed (Mbps)

operations for all the bandwidths. As a result, the implemented circuit has a potential for providing up to 2.6 Gbps by using a 4x4 MIMO-OFDM conguration and a 160-MHz signal bandwidth. V. C ONCLUSION
Bandwidth (MHz)

Fig. 6. Estimated transmission speeds for 40 to 160 MHz in a signal bandwidth.

in a gate level. The circuit performance for three types of wordlengths is depicted in Table II. The maximum clock frequency is 174 MHz. The power dissipation was measured by the condition of 1.0-V voltage supply and a 160-MHz clock speed. The implemented circuit reaches millions of logic gates and dissipates 500 to 700 mW. The matrix inversion block occupies about 60% of the total circuit area. The pipeline latency is 187.5 ns when a clock frequency is 160 MHz.

We have presented the real-time MMSE detection for a 4x4 MIMO-OFDM receiver. The key concept is to realize a complete pipeline architecture that the processing time does not depend on numbers of data subcarriers so as to obtain high transmission speeds of over 1 Gbps. The Strassen matrix inversion has lower complexity and is suitable for concurrent and pipeline processing. We implemented this algorithm into a pipelined architecture in the MMSE detection. The circuit was implemented in a 90-nm CMOS process and evaluated in circuit area, power dissipation, and available transmission speeds. R EFERENCES
[1] Petrus Paul, et al., An integrated draft 802.11n compliant MIMO baseband and MAC processor, IEEE International Solid-State Circuits Conference (ISSCC), pp.266-267, 602, Feb. 2007. [2] Jui-Ping Lien, Po-An Chen, Tzi-Dar Chiueh, Design of a MIMO-OFDM baseband transceiver for cognitive radio system, IEEE International Symposium on Circuits and Systems (ISCAS), pp. 4098-4101, May 2006. [3] Johan Eilert, Di Wu, Dake Liu, Efcient complex matrix inversion for MIMO software dened radio, IEEE International Symposium on Circuits and Systems (ISCAS), pp.26102613, May 2007. [4] A. Burg, S. Haene, D. Perels, P. Luethi, N. Felber, W. Fichtner, Algorithm and VLSI architecture for linear MMSE detection in MIMO-OFDM systems, IEEE International Symposium on Circuits and Systems (ISCAS), pp.41024105, May 2006. [5] Isabelle LaRoche and Sebastien Roy, An efcient regular matrix inversion circuit architecture for MIMO processing, IEEE International Symposium on Circuits and Systems (ISCAS), pp.48194822, May 2006. [6] Adrian Burian, Jarnmo Takalam abd Mikko Ylinen, A xedpoint implementation of matrix inversion using Cholesky decomposition, IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), Vol.3, pp.14311434, Dec. 2003. [7] V. Strassen, Gaussian elimination is not optimal, Numer. Math., Vol. 13, No. 3, pp. 354356, 1969. [8] S. Yoshizawa, Y. Miyanaga, H. Ochi, Y. Itoh, N. Hataoka, B. Sai, N. Takayama, M. Hirata, 300-Mbps OFDM baseband transceiver for wireless LAN systems, IEEE International Symposium on Circuits and Systems (ISCAS), pp. 5455-5458, May 2006.

C. Available Transmission Speed The proposed architecture has assigned pipeline stages to all the arithmetic operations. The processing time is not affected by the number of data subcarriers. Whether the real time operation for the MMSE detection is possible or not depends on the following two conditions. One is that the operating clock frequency is not less than a sampling rate corresponding to a signal bandwidth. The other is that the pipeline latency is shorter than the guard interval length. We estimate available transmission speeds satisfying these conditions when extending a signal bandwidth. Figure 6 shows the transmission speeds in 4x4 MIMO-OFDM systems for up to 160 MHz of a signal bandwidth. The estimate method of transmission speeds has been presented in our work [8]. We use a 400-ns guard interval, 64-QAM, a 5/6 coding rate, and a 4-MHz guard band and set pilot subcarriers to 5% in valid subcarriers. The marked points indicate the selected FFT sizes when we consider appropriate trade-off points between a transmission speed and a circuit scale. Table III enumerates OFDM parameters and circuit performance for 40, 80, 120, and 160 MHz of a signal bandwidth. The implemented MMSE detection can work at a 160-MHz operating frequency and offers 187.5 ns in the pipeline latency. The circuit performance is sufcient to meet the real-time