Sie sind auf Seite 1von 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.


IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

A 675 Mbps, 4 4 64-QAM K-Best MIMO Detector in 0.13  CMOS


Mahdi Shabany, Associate Member, IEEE, and P. Glenn Gulak, Senior Member, IEEE

AbstractThis paper introduces a novel scalable pipelined 4 64-QAM hard-output mulVLSI architecture for a 4 tiple-inputmultiple-output (MIMO) detector based on K-best lattice decoders. The key contribution is a means of expanding the intermediate nodes of the search tree on-demand, rather than exhaustively, along with three types of distributed sorters operating in a pipelined structure. The proposed architecture has a xed critical path independent of the constellation size, on-demand expansion scheme, efcient distributed sorters, and is scalable to higher number of antennas. Fabricated in 0.13 2 core area. Operating at 282 MHz CMOS, it occupies 0.95 clock frequency, it dissipates 135 mW at 1.3 V supply with no BER performance loss. It achieves an SNR-independent throughput of 675 Mbps satisfying the requirements of IEEE 802.16m and long term evolution (LTE) systems. The measurements conrm that this design consumes 3.0 less energy/bit and operates at a signicantly higher throughput compared to the best previously published design.

mm

Index TermsK-best detectors, long term evolution (LTE) systems, multiple-inputmultiple-output (MIMO) detection, WiMAX systems.

I. INTRODUCTION

UE to the high spectral efciency, multiple-inputmultiple-output (MIMO) systems have attracted signicant attention as the technology of choice in many standards such as IEEE 802.11n, IEEE 802.16e, IEEE 802.16m and the long term evolution (LTE) project. One of the main challenges in exploiting the potential of MIMO systems is to design low-complexity high-throughput detection schemes with near maximum-likelihood (ML) performance that are suitable for efcient very large scale integration (VLSI) realization. Unfortunately, the complexity of the optimal ML detection scheme grows exponentially with the number of transmit antennas and the constellation size. Lower-complexity detectors such as zero-forcing (ZF), minimum mean-square error (MMSE) or successive interference cancelation (SIC) detectors can greatly reduce the computational complexity. However, they suffer from signicant performance loss. The other alternative is to use near-optimal non-linear detectors [2]. Depending on how they carry out the non-exhaus-

Manuscript received March 13, 2010; revised July 03, 2010 and September 12, 2010; accepted October 19, 2010. This work was published in part to International Solid State Circuits Conference (ISSCC) 2009 . M. Shabany is with the Electrical Engineering Department, Sharif University of Technology, Tehran 11556-74513, Iran (e-mail: mahdi@sharif.edu). P. G. Gulak is with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 2E4, Canada (e-mail: gulak@eecg. toronto.edu). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TVLSI.2010.2090367

tive search, near-optimal non-linear detection methods generally fall into a few main categories, namely depth-rst search, breadth-rst search, and best-rst search. Depth-rst sphere decoding (SD) [3] is one of the most attractive depth-rst approaches whose performance is optimal under the assumption of unlimited execution time, [2]. However, the actual runtime of the algorithm is dependent not only on the channel realization, but also on the operating signal-to-noise-ratio (SNR) [4]. Thus leading to a variable sustained throughput, which results in extra overhead in the hardware due to the extra required I/O buffers and lower hardware utilization. Among the breadth-rst search methods, the most well-known approach is the K-best algorithm, [5]. The K-best detector guarantees a SNR-independent xed-throughput with a performance close to ML. Being xed-throughput in nature along with the fact that the breadth-rst approaches are feed-forward detection schemes, makes them especially attractive for VLSI implementation. There has been some effort in the literature directed towards their VLSI implementation [6], [7]. However, the current child expansion and sorting schemes in those architectures are not efcient/scalable for higher-order constellation schemes such as 64-QAM and 256-QAM. In most of these architectures, the delay of the critical path increases for higher modulation orders, which ultimately limits the maximum achieved throughput. Moreover, in spite of various published architectures for the implementation of 4 4 16-QAM systems, an efcient high-throughput application specic integrated circuit (ASIC) implementation for 64-QAM systems at high data rate is still a major challenge and has not been fully addressed in the literature. In this paper, an efcient VLSI architecture, its chip implementation and test results for a 4 4 64-QAM K-best MIMO detector is reported, which alleviates the above problems and operates at a signicantly higher throughput than currently reported schemes. The promising features of the proposed ASIC are as follows. Simulation results indicate sub-linear scaling in the constellation size. For instance, for 16-QAM, is chosen meaning that the constellato be 5 while for 64-QAM, tion quadruples but the K value only doubles, thus the sub-linear increase. It also has xed critical path delay independent of the constellation order, value, and the number of antennas. Moreover, it efciently expands a very small fraction of all possible children in the K-best algorithm and can be applied to innite lattices. Finally it provides the exact K-best solution, i.e., the solution that implements the original K-best algorithm with all needed expansions. II. K-BEST ALGORITHM Let us consider a spatial multiplexing MIMO system with transmit and receive antennas whose equivalent base-

1063-8210/$26.00 2010 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

band model of the Rayleigh fading channel is described by channel matrix . The complex a complex-valued , baseband equivalent model can be expressed as denotes the -dimensional where complex transmit signal vector, in which each element is independently drawn from a complex constellation (a symmetric -QAM scheme with bits per symbol, i.e., ), is the -dimensional received symbol vector, and represents the -dimensional independent identically distributed (i.i.d) complex zero-mean Gaussian noise vector with variance , . The real model equivalent of this system i.e., can also be derived using a real-valued decomposition (RVD) model [6] as follows: (1) , and are the equivalent real-valued vectors with the following mappings , , , , and and are decomposed accordingly, where and denote the real and imaginary parts of the variables, respectively. Note that , where is the set of possible real entries in the constellation for in-phase and quadrature parts with . The objective of the MIMO ML detection method is to nd the closest transmitted vector based on the observation , i.e., where (2) The exhaustive-search ML detection is infeasible to implement for large constellation sizes (i.e., 64-QAM and larger) because of its exponential complexity nature. The K-best algorithm, a.k.a. the M-algorithm, is a near-ML technique to solve the above problem with a much lower complexity. The problem in (2) can be considered as a tree-search problem levels. In fact, the K-best algorithm explores the tree with from the root to the leaves by expanding each level and selecting the best candidates with the lowest path metric in each level that are the surviving nodes of that level. Consider the problem in (2), and let us denote the QR-decomposition of the channel , where is a unitary matrix of size matrix as and is an upper triangular matrix. Applying to (1) results in (3) . Since the nulling matrix is unitary, the where noise, , remains spatially white and the norm vector in (2), which represents the ML detection rule, can be rewritten as . Exploiting the upper triangular nature of , this norm vector can be further expanded as

levels. Starting from which is a tree-search problem with , (4) can be evaluated recursively as follows: (5) (6)

, where , is the accumulated partial Euclidean distance (PED) with , denotes the distance increment between two successive nodes/levels in the tree, and for

(7) where , , and denote the scaled , , and by , respectively, i.e., , , and . Based on the above formulation,1 the K-best algorithm can be described as in Table I. The path with the lowest PED at the last level of the tree is the hard-decision output of the detector, whereas, for a soft-decision output, all of the existing paths at the last level are considered to calculate the Log-Likelihood Ratios (LLRs). real-model MIMO system with Let us consider a channel matrix (1). As mentioned, the system can be thought of as a detection problem in a tree with levels, nodes per level and children per node. Because of the upper triangular structure of matrix , the algorithm starts from the last row of the matrix ( -th row, which is the -th level of the detection tree) and goes all the way up to the rst row of the matrix, which is the rst level of the detection tree. Note that in this scheme, all the possible children of a level are expanded exhaustively. The size of this exhaustive expansion grows signicantly when the constellation size is scaled upward. Therefore, better ways are needed to calculate the best candidates of each level without performing an exhaustive search. Regardless of whether dealing with the hard-decision or softdecision output, there are two main computations that play critical roles in the total computational complexity of the algorithm, namely, 1) the expansion of the surviving paths, and 2) the sorting.2 Therefore, the important part of any VLSI realization of the K-best algorithm is an efcient architecture to implement these two computational cores. The approach used in previously published work and that used in this paper are described in the following. 1) Expansion: The K-best algorithm enumerates all the possible children of a parent node in each level. Since there are parent nodes at each level and children per parent, thus the path metrics of children need to be computed in

decoupled architectures [8].

R matrix. In this paper, we assume that the sorted QR-decomposition algorithm is applied to the channel matrix to generate the R matrix using the scaled and

1A typical detection core consists of a preprocessing core whose output is the

(4)

2If these bottlenecks are resolved, the extension of the hard-decision scheme to the soft version is shown to be straightforward in [6].

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 3

TABLE I K-BEST ALGORITHM

Fig. 1. Order of the SE row-enumeration for four consecutive enumerations in 16-QAM.

structure with the on-demand expansion scheme, which nds the best candidates in clock cycles. It works for any value of and and its complexity is proportional to the value and independent of the constellation size. It also does not compromise the BER performance and provides the exact K-best solution, and can be easily extended to the complex domain [15]. each level, which incurs a large computational complexity.3 The phase shift keying (PSK) enumeration scheme [9], which is based on the search over multiple base-centric circles, or its simplied version for -QAM systems, [10], have been proposed to simplify the enumeration process. Moreover, in [11], a different variation of the base-centric search methodology is used, in which the joint SD algorithm and successive interference cancelation are employed. A relaxed K-best enumeration scheme is also proposed in [12] based on the PSK enumeration idea with local sorters. Although these methods are simpler to implement, they do not linearly scale with the constellation size (such as [9]) and/or have performance loss compared to the exact K-best implementation (such as [12] and [11]). In this paper, we propose an efcient expansion method called the on-demand expansion scheme, which avoids the exhaustive enumeration of the children while providing all the information required for the exact K-best implementation with no performance degradation, which, to the best of our knowledge, is the only expansion scheme to-date with a computational complexity proportional to the value and independent of the constellation size. 2) Sorting: Based on the algorithm in Table I, in each level of the tree there are children to be sorted. Among all the sorting algorithms addressed in [13], bubble sorting is the most effective one, which distributes the sorting over multiple cycles [5]. Using bubble sorting, it takes cycles to obtain the sorted list in each level. This is time-intensive for large values of and , which ultimately limits the throughput. In [7], a distributed sorting method is proposed based on the Schnorr-Euchner (SE) ordered search technique [2], [14]. However, it requires all the children of a parent node to be calculated by a metric computation unit and is applicable only for , , and . Moreand thus cannot be applied to over, for higher values of , the proposed single-cycle merge core in [7] becomes increasingly complex resulting in a long critical path. Therefore, [7] is not a suitable platform to achieve high throughput for higher order modulations like 64-QAM and 256-QAM where the value of is large (e.g., for 64-QAM and for 256-QAM). Finally in [12], a relaxed approach to implement the K-best algorithm using a distributed sorting scheme is proposed. This approach is simpler to implement but results in the performance loss compared to the exact K-best solution. Moreover, the implemented ASIC occupies large silicon area while having moderate throughput. In this paper, we propose a distributed sorter, working in a pipelined
3In some implementations, a metric such as a radius constraint is used to limit

III. PROPOSED K-BEST DETECTION SCHEME Consider level of the tree and assume that the set of K-best candidates in level (denoted by ) is known. Each node in level has possible children, so there are possible children in level . One of the main elements of our proposed scheme is to nd the children of each node on-demand and in the order of increasing PED rather than calculating the PED of all the children exhaustively. In other words, the key idea of the proposed distributed K-best scheme is to nd the rst . Among these rst chilchild4 (FC) of each parent node in dren the one with the lowest PED is denitely one of the K-best candidates in . That child is selected and replaced by its next best sibling.5 This process repeats times to nd the K-best candidates in level . The same procedure is performed for each level of the tree. A. First/Next Child Calculation In the on-demand scheme described above, the rst and next child are required to be determined. Based on the system model of a node in is the one miniin (5), the rst child mizing , i.e.,

(8) This is because is in common between all children of a parent. Therefore, can be found by rounding to the nearest integer value in (represented by in this paper). In order to nd the next children (NC), the Schnorr-Euchner technique, [14], is employed, which implies a zig-zag movement around to select the consecutive elements in . . In fact, the SE Fig. 1 shows such an enumeration for enumeration nds the closest points in a real domain one-by-one by changing the search direction. The procedure of selecting the rst/next child of node in level is described in Table II, where denotes the number of moves, and represents the direction. In fact, alternates between positive and nega. The number of moves also tive unless it reaches increases by 2 every time and is reset to 2 if boundaries of are reached. The proposed scheme is pictorially depicted in Fig. 2 for level where and . It shows the way that is de4The rst child refers to the child with the lowest local PED among all children of a parent. 5The

the number of expanded children [6]. In this paper, we consider a general K-best implementation without a metric restriction.

next sibling refers to the child with the next lowest local PED.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE II FIRST/NEXT CHILD SELECTION PROCEDURE FOR NODE j

lowest PED is denitely among the nal K-best list ( with ). This child should be added to , be removed from and be replaced by its next best sibling (3rd cycle). This times to nd all the K-best candidates procedure repeats , , and (see Fig. 2). The nal children in the K-best list are with PEDs 0.2, 0.5, and 0.7, respectively. Note that using the proposed scheme, only 5 children of 12 possible children are visited in Fig. 2. This savings becomes increasingly signivalues. cant for large The proposed scheme has the following features: 1) Hardware complexity does only scale weakly with the constellation size. 2) With respect to the value, complexity scales sub-linearly with the constellation size. 3) Finds best candidates in clock cycles. 4) Has a xed critical path delay independent of the constellation size and the value. out of possible children in the 5) Expands only exhaustive K-best approach. 6) Can be applied to innite lattices and be jointly applied with lattice reduction7 ([17]). 7) Provides the exact K-best solution with no approximation. 8) Can be easily extended to the complex domain ([18]). IV. VLSI IMPLEMENTATION In this paper, we solely focus on the implementation of the K-best algorithm, where it is assumed that the -th row of matrix and are normalized by . Therefore, in (7) can be written as , where , and denote the scaled and by , respectively, ( , and ).

Fig. 2. The proposed distributed K-best algorithm for and example PED values.

pM = 4 and K = 3

A. General Description One challenge that needs to be addressed to achieve a highminimizathroughput architecture is the implementation of tions in each level, which is still computationally complex even with state-of-the-art bubble sorting [5]. In order to resolve this problem, a pipelined structure is used, which performs the child expansion and minimization jointly in a pipelined fashion and implements the sorting in a distributed way without sacricing the throughput. The proposed architecture with all intermediate parameters for a 4 4, 64-QAM MIMO system with and is shown in Fig. 3. There are levels in the tree. The 8th level of the tree, corresponding to the last row of (3), opens up all the possible values in , and calculates their corresponding PEDs. The output of this stage is resulting in PED values, which is performed by Level I in Fig. 3. For each of the nodes in , the rst child is found and its PED is updated using the FC-Block in Level II (additional detail is provided in Section IV-B-7). The inputs to the FC-Block are , and . Then the FC with the lowest PED should be determined, which requires all the FCs to be sorted. This is done using the Sorter block in Fig. 3, which sorts all eight resulting PEDs in four cycles. Using
7Since children are expanded one-by-one using an on-demand basis, the number of children per parent is not required and can be innite. It is clear that supporting innite lattices (i.e., innite children) in hardware is impossible but the value of required K is always a nite number, which makes the application realizable.

rived from . The input to the algorithm is the best selected nodes of level that are the current parents with corresponding PEDs of 0.1, 0.4, and 0.6. Each parent can be further expanded to four offsprings resulting in 12 children whose PEDs are shown in Fig. 2. Using the technique mentioned in Table II, each parent can nd its own rst child without visiting represent the set consisting of all the all its children.6 Let represent their correcurrent best children of all parents, and sponding PEDs (in Fig. 2, with , where represents the -th child of the -th parent in level ). It is easy to see that the child with the lowest PED in is denitely one of the K-best candidates in , e.g., in Fig. 2 (1st cycle). This is because the child with lowest PED in has globally the lowest PED. This child can be mathematically represented as where . Thus this child should be added to , i.e., . and its correTo nd the next best child in the K-best list, sponding PED are removed from , and , respectively (2nd cycle in Fig. 2). Following this removal, the next best sibling of this child is added to ( with in Fig. 2). Taking the same approach, the child in with the
6Visiting

a child means calculating its PED to the received vector.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 5

Fig. 3. Proposed pipelined VLSI architecture of the K-best algorithm for the detection of a 4

2 4, 64-QAM system with K = 10.

this sorter, the number of clock cycles required for sorting is half as much as that of the classic bubble sorting. The key idea that makes this sorter faster, is the implementation of two tasks (max/min and the data exchange) in 1 clock cycle through the introduction of intermediate registers. The detailed architecture of this block will be discussed in Section IV-B-3. The output of the Sorter block is the sorted FCs of level 7, i.e. , (FC-L7 in Fig. 3) that are all loaded simultaneously to the next stage (i.e., PE I)8 to form . Generally speaking, in each level, 1 PE II block is used to generate and sort the list of all FCs of the current level and 1 PE I block is used to generate the K-best list of the current level, denoted by FC-L and NC-L in Fig. 3. The task of the PE I block is to take the FCs of each level as an input and generates the K-best list of that level one-by-one. The node in with the lowest PED is denitely one of the K-best candidates in level 7. This value is passed to the PE II block in FC-L6. Upon the removal of this FC, its next sibling needs to be calculated, which is done by the core called NC-Block in the feedback loop of the PE I block (Section IV-B-4 and IV-B-5), and substitutes the FC in . The PED of this sibling needs to be compared with the other FCs, already present in the NC-L7 stage. The next K-best candidate has the lowest PED among this new set. This process is repeated 10 times (taking 10 cycles) until all the K-best values of the second level of the tree are generated and passed to the PE II block in FC-L6. The PE II receives the K-best candidates of level 7, one after the other, and generates the FC of each received K-best candidate one-by-one and sorts them as they arrive. It nally transfers them to its following PE I block. This process repeats for all the levels down to the rst level. Since at the rst level only the FC with the lowest PED is of concern, only 1 PE II
8The data transfers happen between blocks every 10 clock cycles. The dashed gray arrows in Fig. 3 imply that the data is loaded only once every 10 clock cycles after the completion of the previous stage, and the number on the arrow shows how many cycles after the completion of the previous stage data is loaded. Note also that the utilization factor for all the blocks except the rst three is 100%. This means PE I/II Blocks require 10 cycles to produce an output while Sorter Block is active for only 4 cycles every 10 clock cycles. The output of the Sorter is loaded to the following PE I Block once every 10 cycles.

block is used for the rst level (FC-L1), whose output is the solution to the hard detection symbol .9 B. Detailed VLSI Architecture The inputs to the architecture are the entries of the matrix as well as the vector in (3). The matrix resulting from the QR-decomposition on (1) has some nice features, which are explained by the following example. Let us consider a 4 4, 64-QAM MIMO system (8 8 with RVD). The matrix is as follows:10

(9) This implies that two consecutive rows of the matrix share the same entries with a possible sign ip. Therefore, in the VLSI architecture, the input values of two consecutive levels share the values, thus, the above RVD both reduces the number of input pads and the required memory to buffer the values. The other implication of this structure is that the rst children of the odd rows do not depend on the K-best list of the preceding even row. This is because , , , and are all zero. For
9The goal of the PE II block is to generate the sorted list of all FCs not just the minimum FC. This is because this sorted list is fed to the following PE I block to nd the best nodes in the current level. This works as follows. In the rst clock cycle, the minimum FC is announced as the next child. The next sibling of this announced node is calculated. The PED of this new sibling has to be compared to the PED of the previous 9 FCs provided by the preceding PE II. Since we have already sorted them in the previous stage, we just need 1 clock cycle to compare the PED of the new sibling with the 2nd lowest FC and announce the winner. Had we not sorted them in PE II, we should have to nd the minimum of the PED of the new sibling and all the 9 remaining FCs of the last stage, which incurs a long critical path. 10The matrix in (9) is derived as follows: First, the columns of the complex-valued channel matrix are sorted. Then, this sorted version is transformed into the real-valued domain, where nally the QRD is applied. Note that the proposed algorithmic and architectural ideas in this paper can be easily recongured/modied, with the aid of a small control circuitry, to accommodate any input R-Matrix with any arbitrary sorted inputs. The only thing that has to be modied is the way we store the input values and the multiplexing scheme at the input.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 5. Architecture for the Limiter block.

Fig. 4. Alternative architecture for multiplication (MU).

instance, the rst child of the fth row is independent of and that is due to the fact that . Note that in the proposed architecture, it is assumed that the channel is quasi-static and is updated after every four channel use. This implies that the QR-decomposition and the input entries need to be updated with a proper frequency. In total, there are 16 distinct entries in the matrix as well as 32 entries corresponding to the input vector.11 Assuming that each entry requires 16 bits on average, the number of input pads for the chip will be 768 pads, which is not feasible in a cost-effective ASIC implementation. In order to avoid this number of pads in our ASIC, input values are received one-by-one at the input, are buffered and consumed later at a proper time. Moreover, we use a multiplexed scheme where each 16-bit is received in 2 consecutive clock cycles, i.e., 8 bits per clock cycle. The received entries are buffered at the input and are used at a proper time in the architecture. Using this scheme, the number of pads is signicantly reduced. The drawback is a longer initial latency. There are a few sub-blocks used throughout the architecture that are explained rst, after which the major functional blocks are discussed. Multiplication (MU): There are two types of multiplications involved in the overall architecture. The multiplication of and . The rst 1 is realized 13-bit multiplier. However, the second using a 13-bit multiplication can be implemented using a faster architecture, which takes less area. This architecture, called MU, is shown in Fig. 4 using seven multiplexers (MUX) and adders (SUM), where the numbers on the right represent the bit location in (i.e., can be represented with 4 bits), represents shifts to the left and the tiny bubble denotes the negation operation. This implementation of the multiplication operation is possible due to the fact that the values of are drawn from a nite pre-determined odd-integer set . As will be discussed in Section IV-B-5, the motivation for designing the MU block is due to the fact that this multiplication lies in the critical path of the architecture, which is on a feedback path. Mapper: Once (see Table II) is calculated based on , the rst child needs to be calculated by mapping the value to the nearest point
11There are four received z vectors per channel matrix while each z vector has eight entries corresponding to eight levels of the tree.

Fig. 6. Architecture for Level I with the critical path highlighted.

in (slicing). This is done in two consecutive stages. First is mapped to its nearest odd integer number using the Mapper block and then if is outside the allowed boundary of , it will be bounded by the Limiter block to generate . The process of mapping to the nearest odd integer number is implemented by the Mapper block shown. Limiter: The process of limiting the value in the predened range is done through the Limiter block. In other words, if is outside the boundaries of , the Limiter block guarantees that the upper/lower bounds (e.g., in 64-QAM) are chosen as the selected points in . The Limiter block is shown in Fig. 5, with examples on the action taken on to determine for three different cases. 1) Level I Block: The input to Level I is and and its output is the PEDs of all the elements in , which are the nodes in the 8th level of the tree. The detail of the architecture is shown in Fig. 6. It employs a 13-bit 13-bit multiplier, a few adders and the absolute value block. Note that the absolute value block, representing the -norm, can be replaced by a squaring operation block, -norm, which can be easily implemented using a carry-save-adder technique [6]. However, simulation results show that the difference in the BER performance is negligible [7]. Since Level I is on the feed-forward path of the architecture, a ne-grained pipelining technique can be employed inside the block in order to increase the system throughput. A 2-stage pipeline is employed in Level I, which is shown by 2 and 5 positive-edge-triggered ip-ops (FFs)/registers added in stage 1 and 2, respectively (see Fig. 6). In fact, by using the FFs the block is broken down to 2 consecutive stages, which avoids

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 7

Fig. 7. Architecture for Level II with the critical path highlighted.

a long critical path and implies that the critical path of the architecture contains only 1 multiplication. It is assumed that all the FFs used in this paper are triggered by the positive-edge of the clock. 2) Level II Block: The input to Level II is the PED values of the 8th level and its output is the PED values of the rst children in the 7th level of the tree. In fact, in the Level II block, the rst children of the eight nodes in the 8th level are matrix in determined. Note that due to the structure of the (9), the rst children in the 7th level of the tree are all the same ). and independent of their parents in level 8 (because This child is determined and is used to calculate the updated PED values of the nodes in the 7th level. Since , no extra input is required for the calculations in Level II. The equation of the 7th level can be written as . This implies that in order to nd the rst child in the 7th level, is applied to the input of the Mapper/Limiter block whose output is the rst child. The architecture of the Level II block is shown in Fig. 7. Once the rst child was determined it is multiplied by using the MU block. The input normalized value is also multiplied by after which the Euclidean distance between the rst child and the received vector (i.e., ) is calculated and the result is added to the PED values of the 8th level PEDs to derive the eight updated PEDs of the 7th level. The ne-grained pipelining technique has also been employed in this block to break it into four stages in order to limit the length of the critical path. 3) Sorter Block: The input to the Sorter block is the set of eight PED values of the 7th level FCs and the main task of this block is to generate the sorted list of these PED values. The architecture of the Sorter is shown in Fig. 8. The eight inputs , and the outputs are stored in eight are denoted by registers shown by grey ip-ops labeled by letter N. The Ctrl signal is used to load the data in 1 clock cycle. Using this architecture, it takes four clock cycles to sort all the eight PED values. This architecture can be used as a general sorter, which sorts numbers in clock cycles because it implements two tasks (max/min and the data exchange) in one clock cycle through the introduction of intermediate registers. One such set of consecutive minimizations is highlighted in Fig. 8, which is also the critical path of the Sorter block. Note that the factor

on the FFs, shown in Fig. 8, represents a register bank of length bits, used to store the child list (path history) as well as the updated PED values.12 4) PE I Block: PE I is a general block used for all the levels from level 7 to level 2. It receives the sorted list of the rst children of each level and generates the best candidates of that level. For instance the output of the PE I in level 7, called consecutive best candidates NC-L7 in Fig. 3, is the with the lowest PED values in level 7, generated one-by-one in series at the output. The architecture of PE I is shown in Fig. 9. It consists of a sorter, and a block called NC-Block on the feedback path. In fact PE I receives the sorted list of the PEDs from the preceding stage. It nds the best one with the lowest PED and announces it as the next K-best candidate at the output, and then calculates the next best sibling of the announced child through the NC-Block and feeds it back to the sorter to locate the correct location of the new sibling in the already sorted list in PE I. The following points clarify the details of this architecture: The main task of the sorter in this block is to receive a sorted list and to nd the correct position of a new entry in the sorted list, while announcing the entry with the lowest PED every clock cycle. Before the sorted PED values of the preceding stage are loaded into the PE I block, there is a reset signal, Rst in Fig. 9, that initializes all the register banks (except the one attached to the output) to the maximum possible number. This is necessary to avoid any interference from the previous values stored in them and makes them ready to process the new list. The Rst signal also initializes the control signal, Ctrl in Fig. 9, to zero, which is used to load the sorted list from the preceding stage to the PE I block. Note that the data in the sorted list is loaded one , and pair at a time. For instance, when are loaded and when , and are loaded. The reason is to guarantee the proper functionality of the architecture when PE I and PE II are operating together. A snapshot of the timing relationship between the Clk, Rst, and Ctrl signals are also shown by an example in Fig. 9. The critical path of the PE I block is highlighted in Fig. 9. It contains a MUX, a comparator and the NC-Block. The main task of the NC-Block is to determine the next best sibling of an already announced best child. It also nds the PED value of this sibling and sends the information to the sorter in the PE I block. Since the NC-Block is on the feedback portion of the architecture, pipelining cannot help to increase the throughput of the architecture. Since this path is the critical path of the whole MIMO architecture, an efcient architecture needs to be proposed for the NC-Block to ensure an overall high-throughput architecture. 5) NC-Block: The detail of the NC-Block architecture is shown in Fig. 10(a). The NC-Block in the -th level needs to calculate the number of jumps (in Table II), the direction of the next move, SignBit, and nally calculates the PED value of the new sibling.13 These 3 tasks are implemented by
12The path list grows from one level to another and therefore so does the value

of

N , i.e., the value of N is different for each stage.

13The signal SB in Fig. 10(a) represents the sign bit of the result of the adder.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 8. Architecture for the Sorter block with the critical path highlighted.

Fig. 9. Architecture for the PE I block with the critical path highlighted.

the architecture shown in Fig. 10(a), where SignBit determines the direction of the SE enumeration for the next child and Uout(Lout) determines whether the SE enumeration has reached the upper (lower) boundary of the set. In fact, according to (5) and (6), the PED value of the new sibling can be determined as follows: (10) refers to the quantity dened in (7) while was where . omitted for brevity of discussion and As mentioned, any effort to simplify this block and/or reduce its critical path, has a direct and signicant effect on the total achievable data rate. To optimize this block, the following two techniques were utilized in our VLSI architecture: 1) Avoid multiplication: Since the value of in (7) depends only on the selected symbols up to level and is independent of the current sibling , the values of and can be calculated using the FC-Block in the preceding block14 (Section IV-B-7) and forwarded to the NC-Block as an input [see Fig. 10(a)]. This is a preferred approach as the required multiplication to calculate will be rescheduled and removed from the critical path and is shifted to a block that is pipelineable. Moreover, the second multiplication, i.e., , is realized using the MU block.
14For PE I of NC-L7, the preceding block is Level II while for all other PE I blocks, this block is the PE II block in the preceding stage. For instance for PE I in NC-L3, this is done in PE II in FC-L3.

2) Broken critical path: As can be seen from the NC-Block architecture [Fig. 10(a)], the critical path has 3 adders (one 4-bit and two 16-bit adders), as well as the MU block. The critical path associated with this architecture is 4.8 ns in 0.13 CMOS technology using a commercial standard cell library. The rst part of the critical path [specied by the 1st section in Fig. 10(a)] calculates the next sibling, which can be transferred to the FC-Block in the preceding block. This means that the FC-Block would calculate both the rst and second best child of each parent and sends them to the NC-Block. The NC-Block calculates the PED value of the second best child while determining the third best child and so on. This implies that the NC-Block block always calculates one child ahead. Using this approach in our ASIC implementation yields a critical path of length 3.65 ns, thus higher overall throughput. The use of this scheduling technique effectively breaks the critical path of the NC-Block down into two smaller parts [1st and 2nd section in Fig. 10(a)]. This is shown in Fig. 10(b) with the improved critical path. The rst section of the NC-Block on the right hand side of Fig. 10(b) is denoted by NCSub, which is the block that will be added to the preceding FC-Block in order to calculate the second best child. The second section consists of two adders, whose complexity is independent of the constellation size and the value. 6) PE II Block: The output of PE I is the serial list of K-best candidates of the the current level, generated one-by-one at the output. As each of the K-best candidates is generated, it is sent to the PE II block to calculate the rst children of the next

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 9

Fig. 10. Architecture for the NC-Block inside the PE I block with the critical path highlighted. (a) Original. (b) Improved.

Fig. 11. Architecture for the PE II block with the critical path highlighted.

level and sort them as they arrive. The architecture of the PE II block is shown in Fig. 11., where is the input port and are the output ports. At the beginning, the rst child of the K-best candidate of the previous stage and its updated PED value are calculated by the FC-Block, and then using the sequential sorter, the calculated PED values are sorted as they arrive. Note that this process is performed on a cycle basis since the PE II block is connected to the output of the PE I block in a pipelined fashion. In the proposed architecture for PE II, the sorted PEDs are stored in the register banks, depicted by -bit registers in Fig. 11. and denoted by . At every clock cycle, 2 register banks are updated at the same time. This is because of the fact that the registers on the upper part of the sorter are located in every other stage. The functionality of the sorter is such that the larger values are shifted to the right while the smaller values are shifted to the left. Once the last element (10th element in 64-QAM) enters the sorter, it updates the rst two register banks, thus the rst two are guaranteed to have the 2 smallest PED values. Therefore, at the next clock cycle, they can be transferred to the following PE I block. After the second clock cycle, the next 2 register banks are updated. Therefore, the PED values are transferred to the next level on a pair-by-pair basis. This fact is shown in Fig. 3 with

grey lines between the PE II block and the PE I block and the numbers on them represent the number of clock cycles after the arrival of the last K-best candidate to the PE II in which they are transferred. Note also that once the last element comes in and the rst two register banks are sent to the next stage, the internal min/max functions should be initialized to the highest positive number to avoid the comparison between the rst element of the next iteration and the last element of the current iteration (done using the signal in Fig. 11.). This makes the core utilization 100% as PE I and PE II are fully pipelined with zero latency with respect to one another.15 7) FC-Block: The main task of the FC-Block is to calvalue in (7), the rst child of the current parent culate the based on the calculated , and its PED. It also determines the second best child and its corresponding PED value as well as the value for its following NC-Block as mentioned in Section IV-B-5. The proposed architecture for the FC-Block is shown in Fig. 12. In order to increase the total throughput in
15Note the difference between the functionality of PE I and PE II block. PE I Block gets a sorted list as an input and nds the location of the new generated nodes in this sorted list. It also announces the smallest node at the same time. However, PE II block is a sorter that takes as an input the nodes that come one by one into the sorter and announces the sorted list with two entries at a time.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE III FIXED-POINT WORD-LENGTH (bits) OF PARAMETERS

n : m] an n-bit number with m bits for the fractional part.


TABLE IV COMPARISON BETWEEN DIFFERENT K-BEST IMPLEMENTATIONS

The number in parenthesis represents the latency after pipelining.

scheme at the input of the chip guarantees the delivery of the and values to the blocks. correct C. Latency and Bit-True Simulation The ne-grained pipelining used inside the blocks improves the throughput at the cost of the larger latency. Starting from the rst block, the latency of Level I is 2 cycles, Level II has a 3-cycle latency, the Sorter blocks latency is 4 cycles, the cycles, and nally PE IIs PE I blocks latency is latency is 10 cycles plus an additional 6 cycles for the pipelined FC-Block. Therefore, according to the architecture in Fig. 3, the total latency of the architecture is . Table III shows the number of bits associated with different variables in the algorithm for the xed-point simulation of a 4 4, 64-QAM system in the form of , where and denote the total number of bits for the integer and fractional parts, respectively. The xed-point simulations are performed using the 2s complement number representation. Note that the word lengths in Table III have been derived based on extensive simulation results to minimize the BER loss relative to the oating-point result (i.e., less than 0.5 dB loss at ). D. Complexity Analysis Table IV shows the complexity comparison between different schemes. For the sake of comparison, the number of visited children (expand), the required number of clock cycles to do the sorting (sort), and the total latency are considered. The values listed in the expand row refers to the number of PED values required to be calculated, which directly translates to more area and power. A key feature in our architecture is that only children need to be calculated in each level, whereas in other approaches, e.g. [7], the PED of all the children of all parent nodes should be calculated. The last row of the table indicates whether the length of the critical path of the architecture grows with the constellation order . The number in parenthesis refers to the latency of the pipelined architecture.

Fig. 12. Architecture for the FC-Block inside the PE II block with the critical path highlighted.

the architecture, pipelining has been used by the introduction of FFs on all the forward paths. The proposed architecture for the FC-Block consists of 5 pipeline levels. Each FC-Block is used inside a PE II block. In the rst pipeline level (see Fig. 12), there are 6 MU blocks. However, depending on the PE II block in which it is used, only part of these MU blocks are used. For instance, for the PE II blocks of stage FC-L6 and FC-L5, only the rst 2 MUs are implemented whereas for PE II blocks of stage FC-L2 and FC-L1, all of the 6 MUs are implemented. The rst 2 levels of the architecture calculate the in (7) and in the third level, the value of is calculated, which is used to calculate the rst child using the Mapper and Limiter blocks. The number of moves and the direction of the move for the SE enumeration to determine the next best child required by the NC-Block are also determined in level 4 through the SignBit, Uout, and Lout signals. Finally the blocks in level 5 calculate the PED value of the announced rst child. It also calculates the second best child through the introduction of the NCSub block, which was described in Section IV-B-5. The updated values of SignBit, Uout, Lout, the second best child and its scaled value are sent to the output. All of the above blocks are interconnected together in a pipelined fashion and at every clock cycle a data exchange occurs between the adjacent blocks. This means all the data are calculated and transferred sequentially operand-by-operand between the blocks. A proper scheduling

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 11

TABLE V COMPARISON OF THE CURRENT ASIC IMPLEMENTATIONS OF 4

2 4 MIMO DETECTORS

Although a 64-QAM system has not been implemented in [7], the complexity results for has been reported, which is an equivalent our 64-QAM design. We use this as a set of optimistic values for comparison purposes. Energy/bit of the designs in [6], [16] and [19] have been scaled to a 0.13 equivalent CMOS process.

K = 10

K to that used in

m

V. DESIGN COMPARISON The proposed VLSI architecture was modeled in Verilog HDL, synthesized using Synopsys Design Compiler and placed and routed using Cadence SoC Encounter/Silicon Ensemble. The RTL and gate level netlists were veried with the golden model generated from the xed-point MATLAB model. The IBM CMOS technology nal ASIC was fabricated in 0.13 using ARM standard library cells. Table V gives an overview and comparison of all the reported ASIC implementations in the literature for a 4 4 16-QAM and 64-QAM MIMO system, respectively. From this table, the following points can be inferred: 1) Most of the ASIC implementations are 16-QAM as both the expansion and sorting scheme for this constellation are simpler with a much lower hardware complexity than 64-QAM. 2) Scalability refers to how the expansion scheme, sorting scheme and/or the critical path of the proposed ASIC scale with the constellation size. Most of the approaches have a , which signicantly reduces complexity order of the maximum clock frequency and hence the throughput. This design, however, scales sub-linearly due to the fact that the number of expanded and sorted children increase only with the value of and are independent of the constellation size. 3) Comparing this paper to [12], which are both fabricated in 0.13 CMOS, reveals a signicant reduction in the area in kilo-gates (kG) achieved using our proposed scheme.16 4) Since for 64-QAM, both the value of and the critical path of the sorting operation scales up,17 [7], the fastest 64-QAM implementation reported to date, [16], operates at 115 Mbps. Since the problem of scaling was alleviated in this paper, the ASIC implementation achieved a
16The scaling of the input values are not considered in the calculation of the total area. 17Although a 64-QAM system has not been implemented in [7], the com, has been reported, which is equivalent to that used plexity results for in our 64-QAM design. We use this as a set of optimistic values for comparison purposes.

signicantly higher throughput of 675 Mbps with a maximum operating clock frequency of 282 MHz. Therefore, the achieved throughput in this paper is 2.6 higher than the best implementation to-date while being scaled to the same 0.13 CMOS technology. This data rate satises the peak data rates required for next-generation WiMAX18 and LTE19 systems. 5) Near-ML refers to a performance result close to the optimal performance. Most schemes such as [11], [12], and [16] have a performance loss compared to the exact K-best algorithm method due to the approximated/relaxed schemes whereas this paper has higher throughput while providing the exact K-best results. 6) The proposed architecture in this work has a signicant advantage of having SNR-independent throughput, which is not the case for some of the designs in the literature (see Table V). For instance, in [16] by changing the SNR value from 17.7 dB to 16.2 dB, the throughput decreases from 115 Mbps to 42 Mbps. VI. TEST RESULTS The K-best MIMO detector was fabricated in a 0.13 CMOS-8LM process (see die micrograph in Fig. 13) and was tested using an Agilent (Verigy) 93000 high-speed digital tester and a Temptronic TP04300 thermal forcing unit. The core supply voltage is 1.2 V whereas the ring voltage is 2.5 V. With a 1.3 V supply, the measure maximum clock frequency is 282 MHz. The functionality of the detector was veried by passing input vectors at different SNR values (9 dB36 dB) to the chip through the tester and comparing the detector outputs
18One proposal for peak data rates for IEEE 802.16m are as follows [17]: (i) Very low rate Data: 16 Kbps, (ii) Low rate Data & Low Multimedia: 144 Kbps, (iii) Medium multimedia: 2 Mbps, (iv) High multimedia: 30 Mbps, (v) Super high multimedia: 30 Mbps 100 Mbps/1 Gbps. In fact, 1 Gbps throughput is easily achieved by our design in 65 nm CMOS, as FO4 gate delays scale from 20 ps in the 0.13 CMOS technology of our prototype to 10 ps typical of 65 nm CMOS technology. 19The peak data rates in LTE are up to 326.4 Mbps for downlink and 50 Mbps for uplink.

K = 10

m

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 13. Micrograph of the implemented ASIC. Fig. 15. Measured throughput vs. energy/bit. Results of the designs in [6] and [19] have been scaled to a 0.13 m equivalent CMOS process.

Fig. 16. Measured BER at a sustained throughput of 675 Mbps (282 MHz clock frequency) dissipating 135 mW @ 1.3 V supply and 25 C. Fig. 14. Measurement plots for maximum clock frequency and power dissipation vs. supply voltage (V ) at 25 C.

with the expected values from the bit-true simulations both from MATLAB and NC-Verilog. Fig. 14. shows a Shmoo plot depicting the maximum operating frequency and the total power dissipation of the design versus the supply voltage. All fabricated chips were tested, where the average and the max/min values of the achieved frequency have been shown in Fig. 14. Operating at a clock rate of 282 MHz with the overall latency of 0.6 results in a measured sustained throughput of 675 Mbps dissipating 135 mW at 1.3 V supply and 25 . (This translates into a sustained throughput of 170 Mbps per transmit antenna in a 4 4 MIMO receiver.). The temperature was forced to be at 25 using the Temptronic TP04300 thermal forcing unit (TFU). Using the TFU, test reyield a clock rate of 250 MHz while dissipating sults at 80 104 mW at 1.2 V supply producing a sustained data rate of 600 Mbps. Fig. 15. shows a comparison between the reported ASIC implementations of 4 4 64-QAM as well as the 16-QAM MIMO Detectors. Previous publications with measured or estimated power dissipations shown in the gure. The values of the [6] and

equivalent CMOS process. [19] have been scaled to a 0.13 The comparison graph conrms that measurements from this design achieves 2.6 better throughput per area compared to the best reported design and at the same time consumes 3.0 less energy per bit compared to the previous best design. This advantage is due to the efcient expansion and sorting operations used in this design. The measured BER results are shown in Fig. 16. It represents the result for a single-carrier 4 4 64-QAM MIMO system. Test vectors used to test the chip and generate the BER curve represent a total of 100,000 packets, where each packet consists of 96 bits20 (9.6 Mbits in total). Test vectors are created using: 1) pseudo-random data, 2) complex-valued random Gaussian with statistically independent elements upchannel matrix dated per four channel use, and 3) additive white Gaussian (circularly symmetric) complex random noise. Test results agree with the expected golden vector set conrming correct operation of the test chip. The partial-scan methodology was employed to provide sufcient level of the observability and controllability
20This is because 96 = 4 (N ) 6 (Number of bits per constellation symbol) 4 (Channel updates every four channel use).

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 13

TABLE VI CHARACTERISTICS SUMMARY OF DETECTOR AND MEASURED RESULTS

during the testing procedure. Table VI summarizes the design and performance characteristics. VII. CONCLUSIONS A high-throughput silicon implementation of the K-best algorithm suitable for high-order constellation schemes, which has the smallest number of visited nodes, as well as the highest achieved throughput reported to-date. The key innovation are the introduction of an on-demand expansion and distributed sorting scheme operating in a pipelined fashion. In 0.13 CMOS, it achieves a throughput of 675 Mbps at a clock frequency of 282 MHz satisfying the throughput requirements of next-generation WiMAX and LTE systems. REFERENCES [1] M. Shabany and P. G. Gulak, A 0.13  CMOS, 655 Mb/s, 64-QAM, K-best 4 4 MIMO detector, in Proc. IEEE Int. Solid State Circuits Conf., Feb. 2009, pp. 256257. [2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, Closest point search in lattices, IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 22012214, Aug. 2002. [3] U. Fincke and M. Pohst, Improved methods for calculating vectors of short length in a lattice, including a complexity analysis, Math. Comput., vol. 44, pp. 463471, Apr. 1985. [4] J. Jaldn and B. Ottersten, On the complexity of sphere decoding in digital communications, IEEE Trans. Signal Proc., vol. 53, no. 4, pp. 14741484, Apr. 2005. [5] K. W. Wong, C. Y. Tsui, R. S. K. Cheng, and W. H. Mow, A VLSI architecture of a K-best lattice decoding algorithm for MIMO channels, in Proc. IEEE Int. Symp. Circuits Syst., May 2002, vol. 3, pp. 273276. [6] Z. Guo and P. Nilsson, Algorithm and implementation of the K-best sphere decoding for MIMO detection, IEEE J. Sel. Areas Commun., vol. 24, no. 3, pp. 491503, Mar. 2006. [7] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, K-best MIMO detection VLSI architectures achieving up to 424 Mbps, in Proc. IEEE Int. Symp. Circuits Syst., 2006, pp. 11511154. [8] L. Davis, Scaled and decoupled Cholesky and QR decompositions with application to spherical MIMO detection, in Proc. Wireless Commun. Netw. Conf., Mar. 2003, vol. 1, pp. 326331. [9] B. M. Hochwald and S. ten Brinkc, Achieving near-capacity on a multiple-antenna channel, IEEE Trans. Commun., vol. 51, no. 3, pp. 389399, Mar. 2003. [10] A. Burg et al., VLSI implementation of MIMO detection using the sphere decoding algorithm, IEEE J. Solid-State Circuits, vol. 40, no. 7, pp. 15661577, Jul. 2005. [11] H.-L. Lin, R. C. Chang, and H. Chan, A high-speed SDM-MIMO decoder using efcient candidate searching for wireless communication, IEEE Trans. Circuits Syst. II, vol. 55, no. 3, pp. 289293, Mar. 2008.

[12] S. Chen, T. Zhang, and Y. Xin, Relaxed K-best MIMO signal detector design and VLSI implementation, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 3, pp. 328337, Mar. 2007. [13] P. A. Bengough and S. J. Simmons, Sorting-based VLSI architecture for the M-algorithm and T-algorithm trellis decoders, IEEE Trans. Commun., vol. 43, no. 3, pp. 514522, Mar. 1995. [14] C. P. Schnorr and M. Euchner, Lattice basis reduction: Improved practical algorithms and solving subset sum problems, Math. Program., vol. 66, pp. 181191, 1994. [15] M. Shabany, K. Su, and P. G. Gulakc, A pipelined high-throughput implementation of near-optimal complex K-best lattice decoders, in Proc. IEEE Conf. Acoust., Speech, Signal Process., 2008, pp. 31733176. [16] S. Chen and T. Zhang, Low power soft-output signal detector design for wireless MIMO communication systems, in Proc. Int. Symp. Low Power Electron. Design, 2007, pp. 232237. [17] WiMAX Forum, WiMAX forum mobile system prole release 1.0 approved specication (revision 1.4.0:2007-05-02), in WiMAX Forum, May 2005. [18] Q. Li and Z. Wang, An improved K-best sphere decoding architecture for MIMO systems, in 40th Asilomar Conf. Signals, Syst. Comput., Nov. 2006, pp. 21902194. [19] C.-Y. Yang and D. Markovic, A exible DSP architecture for MIMO sphere decoding, IEEE J. Solid-State Circuits, vol. 56, no. 10, pp. 23012314, Oct. 2009. Mahdi Shabany (S04A08) received the B.Sc. degree in electrical engineering from Sharif University of Technology, Tehran, Iran, in 2002, and the M.A.Sc. and Ph.D. degrees both in electrical engineering from the University of Toronto, Toronto, Canada, in 2004 and 2008, respectively. From 2007 to 2008, he was with Redline Communications Co., Toronto, Canada, where he developed and patented designs for WiMAX systems. He also served as a post-doctoral fellow at the University of Toronto in 2009. Currently he is an assistant professor in the Electrical Engineering Department at Sharif University of Technology, Tehran, Iran. His main research interests include Digital Electronics, and VLSI architecture/algorithm design for broadband communication systems.

P. Glenn Gulak (S82M83SM96) received the Ph.D. degree from the University of Manitoba, Winnipeg, MB, Canada. While at the University of Mannitoba, he held a Natural Sciences and Engineering Research Council of Canada Postgraduate Scholarship. He is a Professor with the Department of Electrical and Computer Engineering, University of Toronto, ON, Canada, as well as a Senior Member of the IEEE and a registered Professional Engineer in the Province of Ontario. His present research interests are currently focused on algorithms, circuits, and system-on-chip architectures for digital communication systems; and for biological lab-on-chip microsystems. He has authored or co-authored more than 100 publications in refereed journal and refereed conference proceedings. In addition, he has received numerous teaching awards for undergraduate courses taught in both the Department of Computer Science and the Department of Electrical and Computer Engineering at the University of Toronto. He held the L. Lau Chair in Electrical and Computer Engineering for the ve-year period from 19992004. He currently holds the Canada Research Chair in Signal Processing Microsystems and the Edward S. Rogers Sr. Chair in Engineering. From January 1985 to January 1988, he was a Research Associate in the Information Systems Laboratory and the Computer Systems Laboratory at Stanford University, Stanford, CA. From March 2001 to March 2003, he was the Chief Technical Ofcer and Senior Vice President of LSI Engineering, a fabless semiconductor startup headquartered in Irvine, CA with $70M USD of nancing that focused on wireline and wireless communication ICs. Dr. Gulak served on the ISSCC Signal Processing Technical Subcommittee from 1990 to 1999, was ISSCC Technical Vice-Chair in 2000, and served as the Technical Program Chair for ISSCC 2001. He currently serves on the Technology Directions Subcommittee for ISSCC. He was the recipient of the IEEE Millenium Medal in 2001.

Das könnte Ihnen auch gefallen