Sie sind auf Seite 1von 12

1544

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

A MIMO Decoder Accelerator for Next Generation Wireless Communications


Karim Mohammed, Member, IEEE, and Babak Daneshrad, Member, IEEE
AbstractIn this paper, we present a multi-inputmulti-output (MIMO) decoder accelerator architecture that offers versatility and reprogrammability while maintaining a very high performance-cost metric. The accelerator is meant to address the MIMO decoding bottlenecks associated with the convergence of multiple high-speed wireless standards onto a single device. It is scalable in the number of antennas, bandwidth, modulation format, and most importantly, present and emerging decoder algorithms. It features a Harvard-like architecture with complex vector operands and a deeply pipelined xed-point complex arithmetic processing unit. When implemented on a Xilinx Virtex-4 LX200FF1513 eld-programmable gate array (FPGA), the design occupied 43% of overall FPGA resources. The accelerator shows an advantage of up to three orders of magnitude (1000 times) in power-delay product for typical MIMO decoding operations relative to a general purpose DSP. When compared to dedicated application-specic IC (ASIC) implementations of mmse MIMO decoders, the accelerator showed a degradation of 340%17%, depending on the actual ASIC being considered. In order to optimize the design for both speed and area, specic challenges had to be overcome. These include: denition of the processing units and their interconnection; proper dynamic scaling of the signal; and memory partitioning and parallelism. Index TermsApplication-specic processor, multiinputmulti-output (MIMO), orthogonal frequency-division multiplexing (OFDM), software-dened radio.

I. INTRODUCTION ODAY, two prominent trends in wireless communication systems are the use of multi-inputmulti-output (MIMO) processing and orthogonal frequency-division multiplexing (OFDM). MIMO-OFDM techniques improve data rate and reliability [1]. As a result, many current and future wireless standards such as 802.11n WiFi, WiMax, and 3G-long-term evolution (LTE) leverage MIMO-OFDM to deliver on the user requirements. Additionally, all trends point to the convergence of all such wireless standards on a single platform such as a personal digital assistant (PDA) or a smartphone. This motivates an accelerator-like approach to efciently deliver on the computation-intensive elements of the system. The MIMO decoder is one such component. MIMO processing is computationally

Manuscript received October 23, 2008; revised March 09, 2009. First published September 01, 2009; current version published October 27, 2010. K. Mohammed was with the University of California, Los Angeles, CA 90095 USA. He is now with Cairo University, Giza 12613, Egypt (e-mail: kabbas@ee. ucla.edu). B. Daneshrad is with the University of California, Los Angeles, CA 90095 USA (e-mail: babak@ee.ucla.edu). Digital Object Identier 10.1109/TVLSI.2009.2025590

intensive due to the need to invert a channel matrix with very low latency. Moreover, over time, systems are expected to incorporate a higher number of antennas and more advanced algorithms. Analogous to the use of the Viterbi accelerator engines [2], [3] in todays cellular systems, a MIMO decoder accelerator that is programmable in bandwidth, number of antennas, decoder algorithm, and modulation format will greatly facilitate the adoption of multistandard, MIMO-based solutions. Such an accelerator engine could also greatly accelerate the adoption of MIMO communications on software-dened radio (SDR) and cognitive radio (CR) based platforms that are mostly found in the research and advanced development environments today [4][6]. MIMO decoding is essentially an inversion of a complex matrix channel. This can be achieved by using a variety of algorithms with a range of complexity and performance. The choice of algorithm and antenna conguration depends on the expected channel conditions, power budget, available resources, and throughput requirements. Even for relatively simple MIMO decoding algorithms, the MIMO decoder is among the most complicated blocks in the transceiver. For example, in a 4 4 802.11n transceiver, an MMSE decoder can easily occupy as much area as the rest of the digital baseband [7]. In addition to the high resource requirements, MIMO decoders also require a long design cycle if they are to be optimized to the target platform. Traditionally, matrix inversion is simplied by using one of a number of matrix decompositions to transform the channel matrix into a more invertible form [8]. The decomposition usually involves regular arrays (systolic arrays) of processing elements (often coordinate rotation digital computer (CORDIC) processors) [9]. QR decomposition leading to an MMSE solution is the traditional approach, but singular value decomposition (SVD) is also efciently implemented on systolic arrays [10]. Recursive algorithms have been also implemented using high-efciency arrays; in [11], an LMS-based solution is implemented using a systolic array. Systolic arrays deliver a quick, efcient implementation of simple algorithms such as MMSE, but they do not offer an easy tradeoff in cost/performance. A Gram-Schmidt-based QR decomposition on Xilinx Virtex-4 delivers 4 802.11n compliant a high-throughput, low-slice count 4 MMSE solution in [7] by paying acute attention to the dynamicrange requirements, and aggressive utilization and time sharing of system resources. SVD is implemented using a power-optimized data path delivering superior performance relative to most SVD application-specic IC (ASIC) solutions [8][10] in [12]. Programmable solutions have been proposed for matrix decompositions. A programmable solution that uses a compact

1063-8210/$26.00 2009 IEEE

MOHAMMED AND DANESHRAD: MIMO DECODER ACCELERATOR FOR NEXT GENERATION WIRELESS COMMUNICATIONS

1545

time-shared processing unit with xed-point optimizations is presented in [13]. The architecture supports 4 4 QR decomposition and SVD. In this paper, we present a MIMO decoder accelerator architecture. The accelerator allows the programmer to easily dene and implement MIMO decoders at will. The accelerator is intended to replace or accelerate the performance of MIMO decoders on programmable devices or system-on-chip (SoC) solutions. The accelerator has a processor-like architecture with most of the controls derived from a memory-stored program. The processing core is designed to support a range of complex operations necessary to enable the realization of major MIMO decoding algorithms. This architecture does not benet from the regular, application-specic ow of regular arrays, neither can it rely on platform or technology specic optimizations as a main driver of high performance. The MIMO accelerator departs radically from a conventional processor in several areas, which deliver an improvement in performance over general purpose processors reaching three orders of magnitude. The accelerator core accepts very wide complex matrix operands and produces complex matrix results. The high access rate required to support this is made possible by a memory map that exploits the matrix/vector nature of the operands in MIMO decoding. The memory map is augmented by sorting circuits at the inputs and outputs of memory that allow the programmer to redene input and output orders without using extra processing cycles. The processing cycle uses properties of OFDM decoding to optimize its ow, and through the use of predecoded instructions and proper compiler positioning of critical control signals, the accelerator ensures that the processing pipeline is continuously engaged. A programmable dynamic scaling circuit automatically handles intermediate word length issues for high dynamic range operations. This allows us to use xed-point processing units, which substantially increases the performance of the processing pipeline over a oating-point implementation. With these optimizations in place, the accelerator penalty (in terms of the ratio of resource-delay or PDP) relative to optimized ASIC and eld-programmable gate array (FPGA) implementations is less than an order of magnitude for most implementations. On the other hand, when compared to TI DSP 6416 processors, the PDP ratio for a number of complex arithmetic test benches was always over two orders of magnitude, and often above three orders of magnitude. In Section II, we briey introduce major MIMO decoding algorithms and derive a set of primitive processing operations. These primitives form the least common set of processing operations needed to realize all MIMO decoding algorithms. Section III takes primitives derived in Section II and describes the architectures used to implement them, showing optimizations in arrangement and interconnection, as well as introducing a multiple cycle dynamic scaling circuit used to maintain bus widths in the overall architecture within reasonable bounds. Section IV discusses memory access, detailing the memory map and associated sorting circuits that allow the programmer to support the high rate of the architecture in Section III, and efciently dene and access data in terms of complex matrix operands.

Finally, in Section V, we compare the accelerator to a number of FPGA, ASIC, and DSP implementations using slice count, cycle count, and PDP as comparison metrics. II. MINIMUM OPERATION SET The literature is rich with alternative algorithms for MIMO decoding [7][12]. Our approach to support these algorithms is to try and identify the set of primitive processing elements that form the basis of major MIMO decoding algorithms. With such a set in hand, the realization of a specic decoder algorithm will translate into the proper sequencing of data among these primitive elements. MIMO system with transmitters and An receivers can be modeled as follows: (1) is the observation vector, is the transwhere is the additive Gaussian noise. mitted vector, and is the channel matrix, where each element of the matrix is Rayleigh faded. MIMO decoding involves exfrom the observation vector. tracting an estimated vector The optimal MIMO decoding algorithm is an exhaustive search maximum-likelihood (ML) decoding. In this algorithm, the geometric distance between the observation vector and of every candidate transmit vector is a distorted version measured, and the nearest candidate is chosen. The high complexity of an exhaustive search ML can be managed through the sphere decoding algorithm (SD). SD can potentially reduce complexity of exhaustive search by arranging the search so that more likely points are examined rst and by pruning large subsets of points early in the search process [14]. With added constraints, SD can be suboptimal relative to an exhaustive search. Linear decoding algorithms are much simpler than ML, but are also suboptimal. These algorithms calculate a weight matrix, often an implicit or explicit inverse of some expression of the channel matrix. This weight is multiplied by the observation to obtain an estimate. The most common linear decoding algorithm is MMSE. In MMSE the estimate is calculated as follows: SNR (2)

MMSE can also be implemented in a variant form, called square root MMSE, where inversion is performed implicitly on a factor of the MMSE expression [7]. Inverting the MMSE expression directly is untenable in hardware; therefore, a matrix decomposition is used to reduce it to a more manageable form. QR decomposition is a typical choice in this case because of its well-behaved dynamic performance, the usefulness of the resultant matrices, and its amenability to parallel processing [8]. This method of explicit matrix inversion is also useful for algorithms other than MMSE, for example, it is used for a recursive least-squared algorithm in [11]. SVD decomposes a matrix into two unitary matrices and a diagonal matrix of singular values. The decomposition is typically performed using the Jacobi algorithm [8]. Jacobi uses a

1546

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

series of unitary transformations to reduce the matrix into its singular values. A 2 2 matrix two-sided complex Jacobi starts by transforming the matrix to pure real. The rst step performs a complex Givens rotation to null element (2,1)

TABLE I PRIMITIVE OPERATIONS

(3) A unitary transformation can remove the phase of element (2,2). Another transformation that exchanges the phases of offdiagonal elements yields a pure real matrix

(4) Real-valued Jacobi can be then used to decompose the matrix by applying a two-sided unitary transformation, leaving only the singular values agonal exchange in (4) and Jacobi diagonalization in (5) reinforce the utility of Givens rotations in algorithms that do not use QR decomposition. Table I lists the primitive operations common to decoding algorithms discussed before. These primitives are derived from certain realizations of the decompositions, but inspection of the accelerator will show that it can support alternatives. The accelerator has to give full exibility to the operations in Table I, for example, allowing multiplication of any combination of matrix and vector operands of any size, or supporting left and right unitary transformations without any loss in performance. Table II details operations necessary for four MIMO decoding algorithms. Although the operations fall neatly into a few arithmetic categories (listed in Table I and detailed in Section III), the exibility of the operands discussed earlier introduces many possible suboperations. The challenge this introduces to memory access will be discussed in detail in Section IV. III. ACCELERATOR PROCESSING UNIT The MIMO-accelerator processing unit (Fig. 1) consists of four cores: a vector multiplication core, a scalar division core, a CORDIC (or coordinate rotation) core, and a vector addition core. Although the design is scalable, this paper discusses results for a realization optimized for 4 4 or smaller matrices. Column operations on matrices with more than four rows or row operations on matrices with more than four columns can still be performed but require multiple instructions per operation, whereas all smaller matrices require a single instruction. The cores are designed for full coverage of the operations in Table I and all their realizations in Table II. Efcient realization of the operations is contingent on a memory access scheme that allows single cycle access to multiple elements of stored matrices, rearranged at will into rows, columns, diagonals, submatrices, etc. The cores are designed to produce at least one 4 1 output vector per processor cycle. The rotation core by necessity generates a vector pair; therefore, the processor output is

(5) This can be extended to larger matrices, but multiple iterations are required to null all off-diagonal elements. Although SVD has multiple potential implementations, Jacobi diagnoalization is highly suited for hardware implementation. The algorithms discussed before share a number of common operations. The order of operations and the nature of the operands set them apart. For example, complex matrix multiplication is used in square MMSE, while matrix vector multiplication and dot products of various sizes are used in all algorithms either as part of the inversion (decomposition) or in the decoding steps. ML and SD are simplied by a matrix decomposition in the initial stage to facilitate the search phase. The search phase of both algorithms requires dedicated hardware, especially in high-throughput systems [15], but the calculation of metrics can be stated in matrix form, and can thus benet from a matrix processing accelerator. Matrix decomposition is critical to all decoding algorithms discussed before. Although there are alternative decompositions for some algorithms, QR decomposition is the most practical method for hardware application. In some decoding algorithms, the target matrix may possess special properties that allow simplication [7], but in the accelerator, we have to provide support for QR for a general target matrix. QR decomposition can be done using several methods, for example, in [7], Gramm-Shmidt orthogonalization is used efciently to such end. However, Givens rotations are commonly used because they are well suited for hardware implementation using CORDIC [8]. The di-

MOHAMMED AND DANESHRAD: MIMO DECODER ACCELERATOR FOR NEXT GENERATION WIRELESS COMMUNICATIONS

1547

TABLE II COMPONENT OPERATIONS OF MIMO DECODING ALGORITHMS

Fig. 1. Overall architecture of the accelerator: processing core, data memory, and input and output sorting circuits.

M = matrix, CV = column vector, RV = row vector, LL = low latency, LS = left side, RS = right side, R = row, Comp = complex, re = real, D = diagonal, Xph = externally provided phase, SubM = sub 0 matrix, Giv = Givens rotation, recip = reciprocation, vect = vectoring, and rot = rotation.
expressed as a pair of 4 1 vectors even for cores that produce a single vector output. The addition core is a set of eight complex adders capable of performing two 4 1 vector additions per cycle. The division core consists of four dividers, supporting one 4 1 vector scaling per cycle. The multiplication core consists of four complex dot product units that can perform four 4 1 dot products per cycle, producing one 4 1 vector output. A full 4 4 matrix multiplication can be performed in four cycles. Every clock cycle the processor accepts 32 complex inputs (N1N32 in Fig. 1); this is the number of inputs necessary for the multiplication core to realize four distinct dot products. The inputs are divided into two sets/operands. A. Coordinate Rotation Core The coordinate rotation core supports (6)(8) in Table I. Normally, CORDIC processors [8] are used in unitary transformations due to their good dynamic-range properties, and the fact that they can be implemented without any multipliers. Exception to this rule is when the target hardware platform is an FPGA with dedicated multiplier resources. In this case, some authors [7] have used a Gramm--Schmidt-based approach that does not leverage CORDICs. Given that the proposed accelerator will

most likely be implemented in an ASIC ow, we will use the CORDIC for implementing the rotation engine. CORDIC performs Cartesian to polar or polar to Cartesian transformation using adders and shifters. CORDIC units can be combined to perform two angle transformations on complex coordinate pairs. We use a compact realization of complex CORDIC, sometimes referred to as a super CORDIC. The super CORDIC assumes a pure real leading element in the vectoring pair, but since a simple rotation can make any element of the matrix real, this causes no loss in generality. The traditional super CORDIC units [11] realize (6) directly. The circuits in Fig. 2 are identical to this realization when the multiplexers are set to mode 0. The realization in Fig. 2 is modied to readily support both equations (7) and (8). Equation (7) is two independent real rotations. The multiplexers in Fig. 2 can realize (7) by rerouting the real and imaginary components of the input vectors as inputs to individual CORDIC units. In fact, since the two component CORDICs of the vectoring unit and two of the three in the rotation units are completely decoupled, (7) can be extended to cases where the two rotation terms in the rotation matrix have different phases or to process multiple real vector inputs simultaneously. Equation (8) can be realized by recognizing that it is identical to (6) if one of the rotation phases is bypassed, and the sense of left-side and right-side rotations is reversed. Again, the multiplexers can be used to reroute the inputs to achieve this. Thus, the modied super CORDIC architecture in Fig. 2 supports rotation operations listed in Table I while providing additional exibility in phase distinctions with the negligible cost of the added multiplexers. The coordinate rotation core is derived from the traditional systolic array architecture [8][11]. Systolic arrays provide very high performance, and past implementations have realized a full parallel array of systolic units. This would be overkill for our application as it will deliver throughput far

1548

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

Fig. 2. Building blocks of the rotation core: (left) multimode super vectoring (translation) CORDIC and (right) multimode super rotation CORDIC. Mode 0 realizes complex Givens rotation, mode 1 real Givens rotation, and mode 2 single-phase rotation of complex vectors (e.g., Jacobi diagonalization).

beyond that required by current or future wireless systems. It also has the additional drawback of reducing the exibility and programmability of the overall accelerator. The rotation core is therefore arranged in a collapsed or linear array, see Fig. 3. The collapsed array delivers vector pair outputs, since decompositions can be easily broken down into vector pair operations, this greatly simplies programming. The rotation core in Fig. 3 consists of four super rotation (SR) CORDIC processors and one super vectoring (SV) CORDIC. The SV CORDICs generate two phases based on their complex inputs. Each phase is generated from a bit vector derived from the inputs, indicating the direction of microrotation at every step of the process. Conventionally, every rotation CORDIC has a phase interpretation unit that accepts the input phase in radians from a vectoring CORDIC and translates it back into a series of directions for microrotations [8], [11]. The phase interpretation components of rotation CORDIC contain relatively costly trigonometric lookup tables and a set of adders. In the accelerator, all SR units and therefore all rotation CORDICs operate on the same vector pair operand per cycle, thus using identical phases (one of the two phases generated by the vectoring CORDIC). We designed rotation CORDICs without phase interpretation components, instead of using a common pair of phase processing units (PP in Fig. 3) that do the job for all 12 rotation CORDICs. This reduces the total resources of the rotation core by 8.7%. The phase processing units contain the phase to direction interpretation units traditionally found in rotation CORDICs, thus allowing them to convert phase values in radians to microrotation direction vectors. They can also serialize the phase encoded as microrotations directly from vectoring CORDIC. Basic arithmetic operations can be performed on the phases in the phase processing unit without feeding phase values back to memory. As discussed earlier, this realization of the accelerator can naup to 4. The rotation core accepts tively handle inputs with

Fig. 3. Rotation core with phase processing unit expanded.

two vectors of up to 5 1 size. However, only four of the input and output pairs are signicant in any clock cycle. In most cases, the rotation core rotates a pair of leading elements in the SV unit, reducing one to null and calculating the phases used to achieve such a result, which refers to a process known as vectoring, and identically rotates the three remaining pairs from the 4 1 vector input pair in 3 SR units. In (3), for example, this involves calculating and and performing the rotation in a single cycle. In other cases, however, the core is required to rotate all four input pairs not by a phase derived from a leading element, but by a phase stored in the phase processing unit or externally generated in the accelerator. This is true, for example, when rotating the unitary factor in the QR decomposition or in the real diagonalization step of the complex two-sided Jacobi. In this case, multiplexers in the phase processor bypass the phases of the SV; thus, only the outputs of the SRs are signicant. As

MOHAMMED AND DANESHRAD: MIMO DECODER ACCELERATOR FOR NEXT GENERATION WIRELESS COMMUNICATIONS

1549

Fig. 5. SQNR for a series of matrix multiplications. S = static truncation, D = dynamic scaling, 1M = 1 multiplication, 2M = 2 multiplications, etc.

Fig. 4. Dynamic scaling unit.

shown in Fig. 3, the rotation core outputs are multiplexed between these two cases, resulting in a four element coordinate rotation output. Although this is relatively simple, it has a signicant impact in light of the memory access scheme and the outputs of the remaining processing units. B. High Dynamic Range Processing The accelerator must support multiple algorithms, and allow modications and manipulations at will. Part of this is to provide the programmer with reasonable freedom in the number of operations that can be performed before the processor overows. Traditionally, a xed-point simulation is needed before a hardware implementation is considered. The simulation helps quantify the impact of xed-point effects and the choice of intermediate wordlength values. The accelerator cannot benet from this approach. In the accelerator, an algorithm is implemented by repeatedly passing operands through the processing unit (an arbitrary number of times), therefore no useful predictions can be made about appropriate intermediate precision requirements. Use of very wide wordlength will cause the size of the accelerator to grow rapidly. This is exacerbated by the matrix/vector nature of the processing units and the fact that all operands are complex numbers. Coordinate rotation, as performed by CORDIC, has a relatively stable wordlength. The addition operation also has very slow growth. Multiplication and division, however, can result in very fast growth in the wordlength requirement. Complex vector multiplication effectively doubles the wordlength of the inputs. We use a dynamic scaling circuit to manage the precision of the multiplier and divider outputs. The circuit efciently handles vector and matrix inputs of variable lengths over a variable number of cycles (corresponding to variable matrix sizes), as shown in Fig. 4. It accepts a vector input of size 4 (the size of the output of the multiplication and division cores). Each vector bits per rail, where element is a complex number of size

is the native precision of the processor. An initial most significant bit (MSB) mask is estimated from the inputs by passing them through the four input OR gate bank. The bank performs an OR operation between corresponding bits of each element of most signifthe 4 1 input vector. This is performed for all icant bits in excess of native precision. The result contains information on the active bits in the largest member of the output vector. This mask is then held for one cycle in register R1 and further ORed through the two-input OR bank. The inputs to the two-input OR bank are the old (accumulative) mask in R2, and the corresponding bits from the mask in R1 resulting from a new input vector. R2 thus holds information on active bits from multiple cycles. The nal result in R2 is then passed to a scaling value circuit that extracts a shift value from the result. Signals are routed to a programmable shifter where the shift value is held for a period appropriate for the matrix size while results signicant bits per rail. dyn_clear and are scaled back to dyn_hold provide primary control over the size of the matrix being scaled. dyn_clear resets the value of R1 to zero and R2 to R1, thus starting a new multiple-cycle sequence. dyn_hold disables registering new values into the scaling value circuit from R2, thus controlling the number of cycles over which the scaling value is considered valid. The dynamic scaling circuit provides programmable scaling while maintaining the throughput of the processor. The additional latency is absorbed since the latency of the CORDIC core is higher than the combined latencies of the multipliers and the dynamic scaling circuit. Fig. 5 shows signal to quantization noise ratio (SQNR) for a matrix multiplier whose outputs are dynamically scaled using the circuit in Fig. 4. The outputs are always truncated to , and the independent axis shows the pretruncation wordlength at the output of the multipliers. Another set of curves shows results for a circuit that always stores the highest MSBs. The quantization noise is measured relative to a noiseless result. The curves show that for one matrix multiplication, the difference in performance between the two is not dramatic. However, when they are used to run more matrix multiplications successively, the dynamically scaled circuit preserves SQNR above 50 dB while the constantly scaled outputs quickly deteriorate. Fig. 6 shows simulation results for

1550

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

Fig. 6. BER curves for a ZF solution realized with constant scaled and dynamic scaled matrix multiplications. Fig. 7. Data memory map.

a zero forcing decoder, where xed-point modeling is limited to the effect of the multipliers. To operate within 2 dB SNR of oating-point performance, a circuit with static scaling needs 27 bit multipliers, while a circuit using dynamic scaling needs only 16 bits per rail. IV. MEMORY ACCESS A. Memory Partitioning, Addressing, and Vector Operands The processing core is designed to accept complex matrix or vector inputs. The efciency of the processor is contingent on a memory access scheme that allows access to any combination of matrix elements simultaneously. The algorithms discussed in Section II require access to row-vectors, column-vectors, whole matrices, submatrices, and diagonals in a single cycle at will. To distinguish between different antenna congurations, the programmer needs to be able to dene how, where, and which results are stored back to memory. Memory read/write operations must be performed as fast as possible since the processing cores are deeply pipelined and provide a new output every processor cycle. Read and write operations can be simultaneously supported through the use of dual port memories, but the randomness of access to matrix elements and the large number of operands every needed cycle require more than multiple-port memories. As shown in Fig. 1, the processing core has 32 complex operand inputs to accept vector or matrix inputs from matrices in data memory. By inspection of algorithms discussed 4 in Section II, the programmer may need up to four 4 matrices to store intermediate results or observation vectors while processing the 4 4 channel matrix. This corresponds to three matrices needed for factors of SVD, and one matrix for observation vectors and results . If all four matrices are stored in a single block of memory and we rely on memory address to access elements, a serious bottleneck is created at the data bus. For example, for a matrix-vector multiplication of a 4 4 matrix and four 4 1 vectors, the processor has to wait 32 cycles for all inputs to be registered. The processor also provides a new output vector (of up to eight elements) every cycle, which means that the processor actually has to wait 40 cycles. Our tests with Virtex-4 block RAM and TSMC 65-nm register-based memories show that the memory can be clocked

twice as fast as the processor. So, if a at single port memory is used, the processor will have to be underclocked by a factor of 20, reducing the performance signicantly. MIMO-OFDM systems employ OFDM. In OFDM, a wideband channel is divided into a number of narrowband subchannels (subcarriers), where each can be treated as a at channel. In OFDM, all subcarriers are processed identically and independently, so data for all subcarriers must be stored and decoded. Thus, data memory does not only contain four matrices as discussed before, but also a number that is a multiple of the number of subcarriers (64 or 128 in 802.11n, but higher for other standards). The size of memory in this case justies splitting it into multiple blocks. With multiple ports and decoders, the processor can be clocked at a faster rate. If splitting is taken to the extent that each of the 64 elements of the four 4 4 matrices occupies an independent block, the processor can be clocked at its full potential. This memory map is shown in Fig. 7. The four conceptual 4 4 matrices are labeled A, B, C, and D, respectively; and the 64 independent memory blocks (not all shown) are each . as deep as the number of subcarriers Exchanging a single memory block for 64 elements with independent address decoders introduces some challenges. Each memory location now needs a pair of indexes to locate it: one to indicate its memory block (e.g., A23) and one to index its depth, namely the subcarrier. The latter in particular can be prohibitive, needing either a very long instruction (nearly 900 additional instruction bits to support 128 subcarriers), or a very complicated address decoder. However, although the processor accepts a large number of complex inputs every cycle, all elements of all operands come from the same depth (subcarrier), regardless of which of the 64 memory blocks they come from. This is because in any cycle, a single subcarrier is being processed. So, even though the elements are stored in independent memories, they do not need independent memory addresses, they all share the same subcarrier address in any given cycle. As shown in Fig. 8, the address is provided directly by the controller as derived from relatively simple matrix index logic. Addresses are multiplexed between read and write indexes, usually offset by the latency of the processing units. Write enables are multiplexed between a null word and values provided from the

MOHAMMED AND DANESHRAD: MIMO DECODER ACCELERATOR FOR NEXT GENERATION WIRELESS COMMUNICATIONS

1551

Fig. 8. Section of data memory showing control and data paths.

Fig. 9. Single-port RAM equivalent circuit.

controller (and in turn from the instruction). Since the memory can be easily clocked twice as fast as the processing unit critical path, this allows multiple one-port memories to read and write a whole 4 4 or smaller matrix every processor cycle. With the aforementioned in mind, the memory map, as shown in Fig. 7, needs to be reconsidered. The identical nature of the addresses to all memory blocks means that independent address decoders are not necessary. The only control port distinguishing memory blocks is write enable during the write phase of the processor cycle. Fig. 9 shows how memory can be conceptually viewed. It is one very wide single port memory (64 32 bits wide in a 16-bit engine), where each memory location contains all elements of all four matrices for a certain subcarrier. Memory addresses do not need to be distinguished, but write enables need to be independently managed for different segments of the word representing different matrix elements. The true additional cost of this memory map reduces to the additional logic used to activate writing; this cost is negligible relative to the memory itself. B. Sorting Circuits and Addressing Data memory provides access to all elements of a matrix in a xed, predetermined manner. The processing unit inputs are

also xed, for example, the multiplication core multiplies all elements in input vector 1 with the exact corresponding elements of vector input 2, and the coordinate rotation core always considers the rst element to be the vectoring element. For the accelerator to support all operations in Table II and begin to extend beyond, signicant exibility needs to be introduced. To dene which matrices or vectors are multiplied, and the direction and target of coordinate rotations, the programmer has to be able to map the outputs of the data memory freely to the inputs of the processor. Many operations also require freedom in mapping the outputs of the processor to the inputs of memory. Generally speaking, we need to be able to map any of the 64 memory output ports (Fig. 7) to any of the 32 input ports of the processor (N1N32 in Fig. 1), and any of the eight output ports of the processor to any of the 64 input ports of the memory (M1M32 in Fig. 1). This is the function of the memory input and the processor input sorting units. The sorting circuits proved to be two of the most resource intensive components of the MIMO accelerator. Essentially, each sorting circuit consists of a collection of multiplexers equal to the number of target ports (32 for processor input sorter, 64 for memory input sorter), with a number of inputs equal to the source (8 for memory input sorter, 64 for processor input sorter). Alternatively, sorting can be performed at the data memory ports by using memory addresses to access specic elements. Since the memory address is already reserved for indexing the subcarrier, additional addressing for sorting involves some redundancy. Due to the critical nature of the sorting circuits, we considered three alternatives that tradeoff multiplexer use and redundancy in data memory. In the rst alternative for the processor input sorter, shown in Fig. 10, we consider replicating the entire data memory while simultaneously reducing the number of independent memory blocks in Fig. 7 by the same order of replication. This consolidates multiple matrix elements in single blocks, allowing a compound memory address to both distinguish the subcarrier and some level of matrix element distinction. The remaining reordering can be supported by a smaller number of multiplexers. However, the size of memory, even for a small number of subcarriers, grows much faster than the saving in multiplexers, both on CMOS ASIC and FPGA targets. Table III shows the trend on a V4 LX200 target for 64 subcarriers. In the second strategy, we consider multiple-port memories as an alternative. Table IV shows that the memory grows less dramatically with port replication than with full memory replication. However, the trend is still not favorable, although having two read ports seems to minimize resources; the saving is not signicant, especially considering the additional complexity in memory addressing and the additional circuitry used to address the problem at the write port. Thus, the alternative using only multiplexers and a memory map unchanged from Fig. 7 is optimal. For the processor input sorting circuit, this translates into 1 multiplexers, equivalent to 16 970 slices on a V4 32 64 LX200. This is roughly 150% of the total area of the coordinate rotation core or 57% of all resources used in the processor, excluding data memory. The 32 input ports of the processor are not independent. They are divided into at most two vector/matrix arithmetic operands,

1552

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

Fig. 11. Processing unit input sorter.

TABLE V PROCESSOR INPUT SORTER RESOURCES BY MULTIPLEXING STRATEGY

Fig. 10. Memory replication and remainder multiplexing of orders 32, 16, and 8.

TABLE III BRAM REPLICATION AND OPERAND SHARING

TABLE IV MULTIPLE READ PORT MEMORY AND OPERAND SORTING

each of 16 complex elements. Each operand comes from a single matrix in memory. The four areas of memory (A, B, C, and D in Fig. 7) can be associated with actual matrices through the compiler without loss of generality. This means that each set of 16 processor input ports is associated with 16 memory ports (as opposed to 64) dened on a per-instruction basis. Fig. 11 shows how this matrix--operand relation can be leveraged in the sorting circuit. A rst level of two 16 element wide 4 1 multiplexers is used to link each of the operands to a matrix, allowing the main sorting multiplexers to be reduced from 64 1 to 16 1. This results in a resource saving of nearly 70% over a direct multiplexing approach. Table V shows alternative memory area to input group premultiplexing strategies and their equivalent

resources. Access exibility is dened as the percentage of the memory accessible for either input after the rst level of multiplexing. The critical access exibility is 25% since it allows each input group to access a full matrix (16 blocks) of memory. The approach used in Fig. 11 is optimal in this context using minimum resources when not restricting the processor. The memory input sorting circuit accepts 3 8 input buses (the three bus inputs to MC in Fig. 1) and redistributes them over 64 memory input ports. The inputs to this circuit are results from matrix or vector operations. Similar to the processor input sorting circuit, processor outputs are divided into at most two vector/matrix outputs with four elements each. Each processor output is assigned in its entirety to a matrix in memory. So, it is only necessary to distribute the processor outputs over 32 ports (M1M32 in Fig. 1) corresponding to at most two memory matrices, and these ports can then be mirrored on the rest of memory without loss of generality. Additionally, since only one core per cycle is producing a result, a rst level of multiplexing (MC in Fig. 1) allows them to timeshare the circuit. Another optimization for the memory input sorter is multiplexer MO in the CORDIC core (Fig. 3). Although the rotation core has ten outputs, only four output pairs at a time are meaningful, additionally all other processing cores have eight or fewer outputs. By performing premultiplexing in MO, multiplexers M1M32 in Fig. 1 are reduced from 10 1 to 8 1. This saves 36% of sorting circuit resources in a V4LX200.

MOHAMMED AND DANESHRAD: MIMO DECODER ACCELERATOR FOR NEXT GENERATION WIRELESS COMMUNICATIONS

1553

TABLE VI RESOURCES FOR ACCELERATOR BUILDING BLOCKS ON V4-LX200

TABLE VII SYNTHESIS RESULTS FOR A TSMC 65-nm PROCESS Fig. 12. 4 4 QRD program. The original matrix is A, by the end of processing A contains the R factor, matrix B contains the Q factor.

The accelerator controller uses an open instruction with control signals mostly corresponding to controls in the circuit. This allows the programmer to dene new memory access and processor-mode combinations at will. Addressing for the accelerator is nonconventional since it involves dening operand and result orientation, write enables, and conjugation as opposed to a traditional addressing scheme [16]. Thus, a high-level syntax is provided to support a large number of common matrix/vector operations. The syntax is very similar to MATLAB with additional operators to cover unitary transformations. High-level programs written in this syntax are converted through a custom compiler to machine-level instructions. Fig. 12 shows the syntax of a program that performs a 4 4 QRD on matrices stored in memory area A, leaving the R component in A and the Q component in B. Operators ! and @ correspond to real rotation (mode 1 in CORDIC) and complex Givens (mode 0), respectively. Operands and results are written as matrix ranges, and the matrices correspond to the areas of memory in Fig. 7. The compiler recognizes most matrix expressions and common mathematical operations as long as they conform to the following memory access limitations: each 4 1 vector operand and result must originate from or be stored in a single matrix in memory. V. RESULTS Table VI shows synthesis results of the accelerator and its main building blocks for a Virtex-4 LX200 target. Timing results show a critical path in the xed-point divider or in the multipliers if they are realized using logic slices. Maximum clock speed on Virtex4-LX200 speed grade 11 is 208 MHz. When two numbers are given in a eld, they represent results with logic slice-based and DSP slice-based multipliers. Other than the rotation and multiplication cores, the main nonmemory components are the sorting circuits accounting for nearly 20% of total logic slices. Section IV, however, shows that this is a substantial improvement over a at multiplexer solution. Table VII shows synthesis results for a 65-nm CMOS ASIC technology. Results are listed for the entire circuit and the circuit excluding data memory (128 kbits for 64 subcarriers).

TABLE VIII CYCLE COUNT ESTIMATES BY ALGORITHM AND ANTENNA COMBINATION

Table VIII lists cycle count results obtained from cycle-accurate xed-point simulations for different antenna/algorithm congurations for 64 subcarriers. The accelerator is optimized for matrix operations used for MIMO decoding. By accepting complex vector operands and by virtue of an optimized processing core, it should show a signicant advantage in matrix operations over a general purpose processor. Additionally, a custom processor cycle (subject of a future paper) allows the accelerator processor pipeline to remain full at all times, thereby reducing the processor overhead significantly. To quantify the accelerator advantage, we carried out a series of tests to measure the energy required to carry out a set of typical complex matrix operations. To quantify delay, we measure the number of cycles required to run these tests on a general purpose DSP and the MIMO accelerator. Latency could be a disadvantage to the DSP in terms of power-delay product, so to isolate any trends, we repeat the test for different subcarrier counts, and repeat the tests in isolation and series, allowing a range of DSP compiler optimizations to become visible. We also compare a number of decompositions and decoding algorithms on both platforms. The algorithms running on the DSP are not identical to those running on the accelerator; for example, in MMSE, an inversion is carried out without performing a QRD since the decomposition does not aid the software implementation. For the 4 4 QRD, the DSP implementation uses

1554

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 11, NOVEMBER 2010

TABLE IX PDP AND THROUGHPUT COMPARISON WITH TMS320C641T-600 MHz. ACCELERATOR CLOCK 234 MHz

TABLE XI COMPARISON OF MMSE CIRCUITS TO ACCELERATOR

Subc: = Subcarriers, Mult = vector{matrix multiplication, Real rot = real Givens rotation, Comp rot = complex rotation,
MMSE is 2 2, and ML is exhaustive metric calculation. THROUGHPUT FOR DSP IN kilo samples per second and FOR ACCELERATOR IN mega samples per second.

time to invert and decode 64 samples including latency, penalty is the percentage by which the gure of merit (slices time to invert and decode) based on slices x for the accelerator is off from the dedicated design, number of cycles.

Throup: = throughput, Systolic array = systolic array implementation of QRD MMSE,

TABLE XII PDP COMPARISON SUMMARY

TABLE X PDP AND THROUGHPUT COMPARISON WITH TMS320C641T-720 MHz ACCELERATOR CLOCK 234 MHz

Subc: = subcarriers, Mult = vector 0 matrix multiplication, Real rot = Real Givens rotation, Comp rot = Complex rotation,
MMSE is 2 2, ML is exhaustive metric calculation. THROUGHPUT FOR DSP IN kilo samples per second and FOR ACCELERATOR IN mega samples per second.

Gramm--Shmidt orthogonalization instead of Givens rotations since the former is more suited for software. The DSP compiler is set to optimize for speed above code size, and a at memory map is used. The DSP used is a xed-point TI DSP6416 600 MHz, a power number assuming 60% CPU utilization is used, and cycle counts are obtained from TI code composer studio simulations. Table IX lists the ratio of accelerator PDP to DSP PDP and throughputs for different test scenarios. The accelerator has a signicant advantage in all operations, and in all scenarios, almost always above two orders of magnitude. The advantage is most notable in rotation (unitary transformation), where the ratio is consistently above three orders of magnitude. Unitary transformations dominate in most decoding algorithms (MMSE, ZF, and SVD). Table X shows results from the tests repeated for a faster DSP from the high-performance 6416 family. The trends observed before are still valid, and the ratio of PDP remains above three

orders of magnitude on average, staying above 2000 for Givens rotations. Compared to single-purpose ASIC and ASIC-like MIMO decoders that do not offer the same level of versatility and programmability, the accelerator is bound to have some performance penalty. Table XI quanties this penalty against a number of MMSE MIMO decoder implementations on a Virtex-2 platform, while Table XII compares against a CMOS ASIC implementation. For FPGA platforms, the penalty is dened based on the following gure of merit: (total slices) (time to invert and decode 64 samples). The accelerator disadvantage is mostly within an order of magnitude and comparable to ASICs within the systolic array architecture, where the disadvantage in PDP is around 20% (Table XII). These penalties are considerably low compared to the signicant PDP advantage of the accelerator against general purpose processors. REFERENCES
[1] G. J. Foschini and M. J. Gans, On limits of wireless communications in a fading environment when using multiple antennas, Wireless Pers. Commun., vol. 6, no. 3, pp. 311335, Mar. 1998. [2] M. Anders, S. Mathew, R. Krishnamurthy, and S. Borkar, A 64-state 2 GHz 500 Mbps 40 mW Viterbi accelerator in 90 nm CMOS, in Symp. VLSI Ciruits, Dig. Tech. Papers, Jun. 2004, pp. 174175. [3] M. Anders, S. Mathew, S. Hsu, R. Krishnamurthy, and S. Borkar, A 1.9 Gb/s 358 mW 16-256 state recongurable Viterbi accelerator in 90 nm CMOS, IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 214222, Jan. 2008. [4] J. Mitola III, The software radio architecture, IEEE Commun. Mag., vol. 33, no. 5, pp. 2638, May 1995. [5] S. Haykin, Cognitive radio: Brain-empowered wireless communications, IEEE J. Sel. Areas Commun., vol. 23, no. 2, pp. 201220, Feb. 2005.

MOHAMMED AND DANESHRAD: MIMO DECODER ACCELERATOR FOR NEXT GENERATION WIRELESS COMMUNICATIONS

1555

[6] C. Chang, J. Wawrzynek, and R. W. Brodersen, BEE2: A high-end recongurable computing system, IEEE Des. Test. Comput., vol. 22, no. 2, pp. 114125, Mar. 2005. [7] H. S. Kim, W. Zhu, J. Bhatia, K. Mohammed, A. Shah, and B. Daneshrad, A practical, hardware friendly MMSE detector for MIMO-OFDM based systems, EURASIP J. Adv. Signal Process., 2008, Article ID 267460. [8] N. D. Hemkumar, Efcient VLSI architectures for matrix factorizations, Ph.D. dissertation, George R. Brown School of Engineering, Dept. Elect. Comput. Eng., Rice Univ., Houston, TX, 1994. [9] R. P. Brent and F. T. Luk, The solution of singular-value and symmetric Eigenvalue problems on multiprocessor arrays, SIAM J. Sci. Stat. Comput., vol. 6, no. 1, pp. 6984, Jan. 1985. [10] R. P. Brent, F. T. Luk, and C. Van Loan, Computation of the singular value decomposition using mesh connected processors, J. Very Large Scale Integr. Comp. Syst., vol. 1, no. 3, pp. 242267, 1985. [11] J. Wang, A recursive least-squares ASIC for Broadband 8 8 multiple-input multiple-output wireless communications, Ph.D. dissertation, Henry Samueli School Eng. Appl. Sci., Univ. California in Los Angeles, Los Angeles, 2005. [12] D. Markovic, R. W. Brodersen, and B. Nikolic, A 70 GOPS, 34 mW multi-carrier MIMO chip in 3.5 mm , in Symp. VLSI Circuits Dig. Tech. Papers, 2006, pp. 196197. [13] C. Suder, P. Blosch, P. Friedli, and A. Burg, Matrix decomposition architecture for MIMO systems: Design and implementation trade-offs, in Proc. Conf. Rec. Forty-First Asilomar Signals, Syst. Comput. 2007 (ACSSC), Nov. 47, 2007, pp. 19861990. [14] B. Hassibi and H. Vikalo, On the sphere-decoding algorithm I: Expected complexity, IEEE Trans. Signal Process., vol. 53, no. 8, pp. 28062818, Aug. 2005. [15] A. Burg, M. Borgmann, M. Wenk, and M. Zellweger, VLSI implementation of MIMO detection using the sphere decoding algorithm, IEEE J. Solid-State Circuits, vol. 40, no. 7, pp. 15661577, Jul. 2005. [16] F. M. Cady, Microcontrollers and Microprocessors Principles of Software and Hardware Engineering. New York: Oxford Univ. Press, 1997. [17] M. Myllyla, J. Hintikka, J. R. Cavallaro, and M. Juntti, Complexity analysis of MMSE detector architectures for MIMO OFDM systems, in Conf. Rec. 39th Asilomar Conf. Signals, Syst. Comput., 2005, pp. 7581. [18] I. LaRoche and S. Roy, An efcient regular matrix inversion circuit architecture for MIMO processing, presented at the IEEE Int. Symp. Circuits Syst., Kos, Greece, 2006. [19] F. Edman and V. wall, A scalable pipelined complex valued matrix inversion architecture, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), 2005, pp. 44894492.

[20] M. Karkooti and J. R. Cavallaro, FPGA implementation of matrix inversion Using QRD-RLS algorithm, in Proc. Conf. Rec. 39th Asilomar Conf. Signals, Syst. Comput., 2005, pp. 16251629.

Karim Mohammed (S99M09) received the B.Sc. degree in electronics and electrical communications and the M.S. degree with emphasis on microelectronics from Cairo University, Cairo, Egypt, in 2002 and 2004, respectively, and the Ph.D. degree from the University of California, Los Angeles (UCLA), in 2009. He is currently a Lecturer at Cairo University. His research interests include architectural approaches toward the realization of complex digital signal processing for wireless communication.

Babak Daneshrad (S84M94) received the B.Eng. and M.Eng. degrees with emphasis in communications from McGill University, Montreal, QC, Canada, in 1986 and 1988, respectively, and the Ph.D. degree with emphasis in ICs and systems from the University of California, Los Angeles (UCLA), in 1993. He is currently a Professor in the Electrical Engineering Department, UCLA. His research interests include wireless communication system design, experimental wireless systems, VLSI for communications, cross disciplinary in nature and deal with addressing practical issues associated with the realization of advanced wireless systems. He is the author or coauthor of the Best Paper Award at the Parallel and Distributed Simulation 2004. Prof. Daneshrad was a recipient of the 2005 Okawa Foundation Award and the First Prize in the Design Automation Conference (DAC) 2003 design contest. He is the beneciary of the endowment for UCLA-Industry Partnership for Wireless Communications and Integrated Systems. In January 2001, he cofounded Innovics Wireless, a company focused on developing 3G-cellular mobile terminal antenna diversity solutions, and in 2004, he cofounded Silvus Technologies. From 1993 to 1996, he was a member of technical staff with the Wireless Communications Systems Research Department, AT&T Bell Laboratories, where he was involved in the design and implementation of systems for high-speed wireless packet communications.

Das könnte Ihnen auch gefallen