Sie sind auf Seite 1von 6

1142

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

TABLE I = VERSUS SYSTEM P MEASUREMENT OF SYSTEM MAXIMUM P WITH OPTIMAL POWER TRACKING

=

[12] W. Wu et al., DSP-Based multiple peak power tracking for expandable power system, in Proc. Applied Power Electronics Conf. and Exposition 2003, vol. 1, pp. 525530. [13] F. Su et al., Gate control strategies for high efciency charge pumps, in Proc. ISCAS, 2005, pp. 19071910. [14] H. Shao et al., An inductor-less micro solar power management system design for energy harvesting applications, in Proc. ISCAS, 2007, pp. 13531356. [15] Y. H. Lam, W. H. Ki, and D. Loop gain analysis and development of high-speed high accuracy current sensors for switching converters, in Proc. ISCAS, 2004, pp. V828V831.

Design and Implementation of a Field Programmable CRC Circuit Architecture


Fig. 10. (a) Measurement of V voltage versus light intensity change. (b) Transient response of V to the light intensity change.

Ciaran Toal, Kieran McLaughlin, Sakir Sezer, and Xin Yang

under the same light intensity, VVCO oscillates around the optimal value. When the light intensity changes, the power tracking unit tracks the change and adjusts the VVCO voltage level to the new optimal point. Fig. 10(b) shows the details changes of the VVCO value during the transient period. VI. CONCLUSION A novel power management system was proposed for solar energy harvesting applications. An inductor-less charge pump was used to step up the PV voltage for the battery charging for dim lighting environment. The system operating behavior was discussed. The maximum power tracking algorithm and circuit implementation was presented to obtain the maximum system output power. The design was fabricated and measurement results demonstrated the system operation.

AbstractThe design and implementation of a programmable cyclic redundancy check (CRC) computation circuit architecture, suitable for deployment in network related system-on-chips (SoCs) is presented. The architecture has been designed to be eld reprogrammable so that it is fully exible in terms of the polynomial deployed and the input port width. The circuit includes an embedded conguration controller that has a low reconguration time and hardware cost. The circuit has been synthesised and mapped to 130-nm UMC standard cell [application-specic integrated circuit (ASIC)] technology and is capable of supporting line speeds of 5 Gb/s. Index TermsCyclic redundancy check (CRC), error detection, eld programmable, network processing, recongurable.

I. INTRODUCTION Cyclic redundancy check (CRC) is an error detecting code that is widely used to detect corruption in blocks of data that have been transmitted or stored. A standalone intellectual property (IP) core is ideal for accelerating CRC computation in many network and server applications. Hardware congurability that will allow unrestricted CRC sizes and polynomials to be deployed, enables a wide range of network transmission, storage and security applications to be supported at a low cost. The cost of chip design continues to increase due to factors such as high mask and respin costs. Next generation system-on-chip (SoC) designs are highly expensive and therefore must be congurable to a range of applications and future proof where either product updates or protocol migration can occur. Adding exibility through in-eld hardware congurability is a key method that enables the cost of designs to be reduced. In this paper, we derive a fully eld programmable, parallel architecture for a CRC computation circuit. The objective was to explore a domain specic programmable architecture capable of supporting 5 Gb/s line rates at a minimal area cost. The resulting architecture is able to support all types and sizes of CRC polynomial, for all types of protocols and data encryption. Furthermore, the circuit can handle a variable number of input octets in runtime for byte orientated variable sized protocols. An embedded self-reconguration controller allows

REFERENCES
[1] B. A. Warneke and K. S. J. Pister, An ultra-low energy microcontroller for smart dust wireless sensor networks, in Proc. ISSCC, 2004, pp. 316317. [2] J. Hsu et al., Heliomote: Enabling self-sustained wireless sensor networks through solar energy harvesting, in Proc. ISLPED, 2005, p. 299. [3] C. Park et al., ECO: An ultra-compact low-power wireless sensor node for real-time motion monitoring, in Proc. IPSN, 2005, pp. 398403. [4] S. Roundy et al., Power sources for wireless sensor networks, in Proc. WSN, 2004, pp. 117. [5] J. Applebaum, The quality of load matching in a direct coupling photovoltaic system, IEEE Trans. Energy Convers., vol. EC-2, no. 4, pp. 534541, Dec. 1987. [6] D. Li and P. H. Chou, Maximizing efciency of solar-powered systems by load matching, in Proc. ISLPED, 2004, pp. 162167. [7] C. Alippi and C. Galperti, An adaptive maximum power point tracker for maximizing solar cell efciency in wireless sensor nodes, in Proc. ISCAS, 2006, pp. 37223726. [8] B. Bekker and H. J. Beukes, Finding an optimal PV panel maximum power point tracking method, in Proc. IEEE AFRICON 2004, pp. 11251130. [9] T. Noguchi et al., Short-current pulse-based maximum power point tracking method for multiple photovoltaic-and-converter module system, in Ind. Electron., Feb. 2002, vol. 49, pp. 217223. [10] M. A. S. Masoum et al., Theoretical and experimental analysis of photovoltaic systems with voltage- and current based maximum power point tracking, IEEE Trans. Energy Convers., vol. 17, no. 4, pp. 514522, Dec. 2002. [11] C. Hua et al., Implementation of a DSP-controlled photovoltaic system with peak power tracking, IEEE Trans. Ind. Electron., vol. 45, pp. 99107, Feb. 1998.

Manuscript received December 17, 2007; revised March 11, 2008. First published June 16, 2009; current version published July 22, 2009. The authors are with the ECIT-Queens University Belfast, Northern Ireland Science Park, Queens Road, Queens Island, Belfast BT3 9DT, U.K (e-mail: kieran.mclaughlin@ee.qub.ac.uk). Digital Object Identier 10.1109/TVLSI.2008.2008741

1063-8210/$26.00 2009 IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

1143

any CRC function to be congured, while minimizing programming time and complexity. This paper explores the architecture and functions of the eld programmable CRC computation circuit and analyses its performance when implemented using standard cell UMC 130-nm technology. II. CYCLIC REDUNDANCY CHECK Data integrity is imperative for many network protocols, especially data-link layer protocols. Techniques using parity codes and Hamming codes can be used for data verication, but CRC is the preferred and most efcient method used for detecting bit errors produced from medium related noise. For example, Ethernet uses a 32-bit CRC polynomial for error detection. Data storage is another area where CRC error detection is becoming increasingly important. iSCSI [1] implementations that utilise the TCP/IP protocol to implement Storage Area Networks (SANs) require error detection to be deployed. These operate using multi-gigabit connection speeds and thus require CRC checks to be executed at high speed as well. In such systems, it is becoming common to ofoad TCP/IP operations to hardware. CRC computation faces similar computational overhead constraints in software and is thus an ideal candidate for ofoading to specialized hardware. As this is an evolving area of research, new standards and protocols are likely to emerge in the short and medium term future. The ability of CRC implemented in hardware to be recongurable to handle new protocols will offer a key advantage in this fast developing area. The fact that the recongurable CRC circuit that has been implemented can quickly switch between any polynomial gives it a key advantage over the other circuits referenced, in terms of exibility and ease of upgrade for new and emerging applications and standards. Advanced NPUs have to support most communications protocols, while maintaining a throughput performance of 110 Gb/s. Such performance cannot be achieved without a highly exible CRC ofoad engine capable of parallel computation. iSCSI in particular is a good illustration of an application where new standards or more suitable polynomials can emerge. None of the publications referenced here offer the ability to recongure hardware in the eld to operate with an entirely new CRC polynomial. A. CRC Related Background A large number of CRC polynomials of various lengths are available to use over a range of applications. Reference [2] investigates a total of 48 polynomials, ranging in length from 3- to 16-bits, that are suitable for embedded network applications utilizing CRC error detection. The paper shows how the various polynomials have been assessed for their ability to detect error patterns in messages. It shows that for different data word lengths, different CRC polynomials can be more suitable than others. This assessment is carried out based on maximum hamming distances. Similarly [3] investigates a number of 32-bit CRC polynomials, all suitable for network applications such as Ethernet and iSCSI. CRC functions have been widely implemented in software using methods such as lookup tables [4] and shift and addition [5]. Further research has investigated hardware architectures that can better exploit parallelism. The fundamental work on parallel CRC computation was introduced by Pei in 1992 [6]. Braun [7] addressed the hardware mapping problem of the parallel CRC algorithm by introducing a slightly different matrix computation technique than Pei. Braun incorporated pre- and post- CRC computation circuits to achieve a 32-bit checksum word at 450 Mbps using FPGA technology in 1996. [8] addresses a technique that allows pipelining to increase the circuit speed, independent of the underlying technology.

Reference [9] derives a VLSI implementation of a 32-bit CRC generator circuit based on Galois eld arithmetic and look-ahead blocks. With an eighth-order look-ahead function this circuit can operate at 100 MHz despite the dated 0.6-micron technology. The circuits are exible in terms of the number of input bits processed at a time, up to 32-bits, but they are restricted to using one CRC polynomial. Reference [10] addresses the problem of processing variable sized packets in parallel by simply duplicating circuits and multiplexing between multiple custom implementations as required, i.e., if processing 32 bits and the last cycle of data is only 8 bits wide then this implementation multiplexes the data from a 32-bit circuit to an 8-bit circuit. The research details the VLSI implementation of a CRC-32 circuit for Ethernet. A standard cell and full custom implementation are presented using 180- and 350-nm technology respectively, operating at 1.09 GHz and 625 MHz. The circuits presented are highly customized and targeted for the CRC-32 polynomial selected. Although they operate very fast, the designs are not exible or adaptable as they are intended for a single polynomial. [11] describes a pipelined and parallel implementation for an FPGA based CRC function. The level of parallelism can be varied between 8- to 32-bits and claims performance results of 1 to 4 Gb/s (depending on the level of parallelism selected). Any polynomial can be selected before synthesis, but not after. [12] describes the derivation of VHDL code with a generic construct that allows a designer to synthesise CRC circuits for any desired polynomial of length up to 32 bits. Word widths of 8, 12, 16, and 32 bits have been analyzed. The research concentrates on generating code in a generic style that includes parallelism in its structure, which is based on the linear feedback shift register (LFSR) presented by Pei. While this generic description is useful in terms of design reusability, it is only congurable presynthesis, after which the hardware is xed and the CRC function is not congurable. [13] uses a recursive mathematical formula to derive parallel CRC circuits that can be generated automatically. The examples use MATLAB code to generate the VHDL code for the circuit. The polynomial and number of bits to be processed in parallel can be specied separately. The method is exible and is likely to save both time and cost in the design phase, yet like the other circuits, this one will be xed to a single polynomial as the circuit itself is inexible post-synthesis. Reference [14] is a commercially available core that operates on FPGA. Again, this uses a xed CRC polynomial that cannot be recongured after deployment. The CRC-32 core is able to support 10/40 Gb/s line speeds by utilizing 64-/256-bit data buses, respectively. It is the wide data buses that allow this performance to be achieved. However, using wide input buses adds complexity to the CRC calculation where the end of a word does not fully ll the input bus. If the end of a word is 16-bits wide then the CRC must be computed for 16 bits, this cannot be done using a 32-bit input conguration. Reference [15] presents a software implementation of the iSCSI protocol that includes implementing CRC error detection, which is recognized as the key bottleneck in the system. The overall implementation operates on a 1.7 GHz Pentium M processor, which supports 3.6 Gb/s. None of the aforementioned state-of-the-art options support full in-eld conguration exibility at high speed specically on hardware. Some allow exibility in the design phase and others offer very high line-speed performance, however none offer high line-speed with full exibility, such as the support of different data-path widths and CRC generator polynomials. Although the software option [15] is likely to be very exible, it comes at the expense of a Pentium processor. The next section outlines the derivation of a CRC circuit implementation that fulls the outlined exibility criteria.

1144

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

III. DERIVATION AND IMPLEMENTATION OF THE FIELD PROGRAMMABLE CRC COMPUTATION CIRCUIT CRC is a polynomial-based block coding method for detecting errors in blocks or frames of data. A set of check digits is computed for each frame scheduled for transmission over a medium that may introduce error and is appended to its end. The computed check digits are known as the frame check sequence (FCS). A CRC value is calculated as a remainder of the modulo-2 division of the original transmitted data with a specic CRC generator polynomial. For example, Ethernet uses the 32-bit polynomial value

G(x) = 1 + x + x2 + x4 + x5 + x7 + x9 + x10 + x11 + x12


16 22 23 26 32 +x +x +x +x +x

To nd the FCS, rst a number of zeroes equal to the number of FCS digits to be generated are appended to the message M (x). This is equivalent to multiplying M (x) by 2n , where n is the number of FCS digits. This value is then divided by the generator polynomial G(x), which contains one more digit than the FCS. The division uses modulo-2 arithmetic, where each digit is independent of its neighbor and numbers are not carried or borrowed, thus addition and subtraction are performed via an exclusive-OR (XOR) function. The remainder R(x) is appended to the end of the message before transmission. At the receiver, the message plus the FCS is divided by the same polynomial. If the remainder is zero then it can be assumed that no error has occurred. The eld programmable CRC design is based on the fundamentals established in [6], which derives the D matrix (1) that effectively forms the logic array of the CRC calculator circuit. Due to space restrictions the full derivation is not included here, but can be found in the reference

Fig. 1. Field programmable CRC architecture.

D=
where

G GT GT 2 GT j01
111

(1)

T=

: 0 : 0 ...: ...: ...: ...: ...: 0 0 0 ...: 1 g(0) g(1) g(2) . . . : g(k01)
0 0 1 0 0 1 ... ...

(2)

and G is the Generator polynomial (an example would be the polynomial used for Ethernet)

G = g(0) g(1) g(2) . . . . . . :g(k01) :

(3)

Each value in G is multiplied (ANDed) with the corresponding value in T . The results are XORed together, producing a D matrix of 0s or 1s. The position of the 1s in D determines the position of XOR gates within the logic array while j is the width of the input port in bits. Enabling programmability for parallel CRC computation requires the D matrix to be congurable for all known generator polynomials G(x). Furthermore, conguration logic for the congurable XOR array is required to adjust the input and CRC sizes. Fig. 1 shows a diagram illustrating the architecture of the fully programmable, recongurable CRC circuit. The circuit is composed of six main components, the programmable input and feedback multiplexers, the congurable XOR array, the array conguration circuit and the CRC

conguration processor. The top of the diagram shows the logic associated with the CRC cell array. The input data enters the array down the columns and the outputs are formed along the rows. The current CRC value is held in a register at the array output, which is fed back and XORed with the input data of the next clock cycle as part of the CRC computation process. The outputs are then stored in the registers for the next clock cycle. The CRC conguration parameters are passed via a microprocessor interface. The desired CRC polynomial G(x) and the input port size are stored in registers. By selecting a signal Generate Matrix, a process is initiated that computes the conguration data and congures the CRC cell array with the required data. The conguration circuitry computes and congures one row of data every clock cycle. This reduces the memory required since the calculation of each row of the D matrix is based on the result of the calculation of the previous row. The conguration data is broadcast to every column of the array, with one column enabled for conguration at a time, via one-hot encoding using a counter. When the CRC cells are fully congured, the conguration processor is not used. The Port Size Congure and CRC Size Congure signals control a set of multiplexers that enable/disable input and CRC feedback data to cater for the size of the CRC polynomial and the size of the input port. The Port Size Congure signal is also responsible for reconguring the circuit to process various input word sizes, e.g., if the port size is congured as 32-bit and the last cycle of the payload data contains only 16-bits to be processed, the Port Size Congure signal; switches input bits 16 to 31 to the 0 input of each multiplexer; switches the bottom 16 multiplexers to the previous CRC data input at the left-hand side of the array and nally routes bits 16 to 31 of the previous CRC data to rows 16 to 31 of the array. The congurable XOR array is comprised of interconnected cells, corresponding to the D matrix. Each cell can be congured as an XOR gate (a 1 in D ), or as a basic input to output connection (a 0 in

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

1145

TABLE I POST-LAYOUT SYNTHESIS RESULTS FOR UMC 130 nm

Fig. 2. Programmable CRC array cell.

Fig. 3.

D matrix row calculation, rst loop.

). Fig. 2 shows a diagram of one of the eld programmable elements that facilitate this function in the array. The data-path can be congured so that the output will XOR the two inputs, or simply output Input 1. The control-path contains a conguration register which selects the data-path function and is programmed via the Cong Data input when the Cong Enable input is set high. The computation of the matrix is an iterative process, where the computation of each row is based on the result from the previous row. Therefore, one row is computed every clock cycle and conguring the whole matrix requires 33 clock cycles. This is the minimum time possible due to the feedback required. Fig. 3 show the logic used to compute a matrix row in one clock cycle for an example 4-bit CRC polynomial, using the matrix data and the current matrix row signal. The diagram shows how a single bit of the matrix is computed on the rst loop using the rows and columns shaded grey in the and matrices. Subsequent loops progress through the remaining rows and columns. For a CRC-64 or higher, two or more of the data-path circuits (top of Fig. 1) are combined in a diagonal cascade with additional feedback wiring, so that column and row 0 of the second circuit become column and row 32 of the CRC-64. Similarly for circuits smaller than 32-bit, e.g., 8-bit, the XOR array is congured so that four CRC-8s run diagonally from the top left. The XOR array and wiring is used to connect the four blocks together with the routing necessary to compute the CRC-8s. This means 32-bits can be calculated in parallel for all CRC sizes and all CRC sizes support the same line-speed. Both the programmable architecture and an optimised implementation of the Ethernet CRC-32 polynomial were synthesized for the Altera Stratix II FPGA, to allow the cost of full programmability to be established. The results showed that the programmable circuit operated at 117 MHz, while the Ethernet polynomial circuit operated at 233 MHz, nearly twice the frequency. The increase in area cost was approximately 700%, which is due to the introduction of feedback logic, the congurable array and the reconguration controller. Overall this is a signicant increase in hardware, but the circuit now has signicantly improved capabilities to operate an extensive range of CRC polynomials and widths instead of just one.

Fig. 4. Field programmable CRC circuitsilicon layout (UMC 130 nm).

IV. SYNTHESIS RESULTS AND PERFORMANCE ANALYSIS

Encounter EDA tools. The post layout synthesis results generated using UMC 130-nm technology libraries are shown in Table I. Fig. 4 shows the circuit layout. Table II compares the circuit with other designs included in Section II. The exibility of [12][14] is limited to conguration options in the design phase but the circuits themselves are not reprogrammable. [10] allows exibility by multiplexing between predened CRC circuits, but does not allow full recongurability. [15] utilises software, which is likely to be fully recongurable in-eld, however this come at the rather high expense of operating on a Pentium processor. On their 1.7 GHz test-bench, the CRC computation used 40%50% of the processor overhead, which also represents a high cost in terms of power consumption. Given that CRC computation can account for 29% of the total computational cost in storage area networks [16], the power dissipation of the eld programmable CRC circuit, which is less than 6 mW, can be considered very low. Unfortunately, power gures for the other references are not available for comparison. Although it is difcult to directly compare performance and area parameters for different technologies, some valid comparisons can be made. Comparing [13] with our initial Ethernet polynomial implementation on FPGA, the number of LUTs used are of the same order, so it can be expected that the same trade-offs that apply in our test implementations can also be directly applied here. Furthermore, to add exibility to [13] would require taking the FPGA ofine for reconguration, whereas the eld programmable CRC circuit can be recongured in less than 1 s, even on FPGA. In terms of area, the circuit of [10] is approximately 20 times larger than the programmable CRC circuit, with a similar throughout speed.

2 32 cell array using Synopsis Physical Compiler and Cadence SoC

The circuit has been described in VHDL and synthesised for a 32

1146

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

TABLE II CRC CIRCUIT PERFORMANCE COMPARISON

V. CONCLUSION This paper presents the design and implementation of a novel eld programmable CRC computation circuit that is an ideal IP core for VLSI deployment. The aim was to explore an architecture where all the CRC parameters were fully programmable. This was achieved by deriving an array of processing cells to implement a matrix based computation technique. The circuit uses an embedded conguration controller designed for both standalone and runtime programmability of the CRC circuit. This enables different CRC polynomials and I/O port and processing data-path widths to be deployed. The architecture is also generic in its design and can be scaled to 64-, 128-, or 256-bits in the datapath, enabling support of throughput rates up to 40 Gb/s at 256-bits. The tradeoff between exibility, performance and cost has been taken further than those enabled by traditional heterogeneous architectures based on microprocessor, DSP and FPGA technology. Domain specic and eld programmable processing cores, such as the presented CRC circuit, provide tailored exibility while allowing high performance and low hardware cost. In conjunction with an embedded custom specic conguration controller, the programming task of the logic array is reduced to specic high-level instructions, executed by setting parameter registers. Complex logic synthesis and place and route functions are not required to programme the circuit, as is the case with traditional FPGA technologies, consequently real runtime programmability is possible with a much reduced reconguration time and cost. Embedded domain specic programmable architectures present an opportunity for enhancing SoC designs that face performance and exibility issues, as they strive to meet emerging design challenges imposed by ICT convergence. It is further anticipated that a physical oriented design methodology, such as a data-path compiler [17], can be used to optimize the regular structure of the programmable cell array, which could signicantly increase the operational frequency while maintaining a low hardware cost. Such an optimized circuit represents an attractive hard macro for environments requiring low cost hardware exibility, and in emerging areas such as iSCSI-based SANs, where the exibility to adopt emerging protocols offers a key advantage to vendors.

Notes: [10] Normally 32-bit parallelism, selects 8, 16 or 24-bit if last word of data is not 32-bits. [13] 350 nm platform example implementation: 4.38 Gb/s when utilising 32-bits in parallel. [14] 64-bit/256-bit parallelism. [15] Specically supports iSCSI at this line speed, 1.7 GHz Pentium M processor.

The difference in area cost could be expected to be around 7:1 purely on account of the different process technologies (350/130 nm). The difference in area cost is therefore comparatively insignicant, given that the programmable CRC has signicantly more exibility. The total area of 0.15 mm2 is quite small and allows the CRC circuit to be deployed as an add-on instruction in an NPU data-path or as a standalone CRC accelerator IP core for dedicated network processing SoCs or ASSPs. For these in particular, exibility, area cost, performance and power dissipation are the primary design considerations, making the proposed CRC architecture an ideal ofoad engine. In term of throughput performance, the eld programmable CRC circuit, at 4.92 Gb/s, outperforms the only other truly recongurable option [15] which operates at 3.6 Gb/s. The eld programmable CRC circuit also has the advantage of a much less costly technology platform, 0.15 mm2 on 130-nm standard cell technology, compared to the Pentium used by [15]. The circuit we have implemented does not support the high line speeds of [14] because this custom circuit (which is not reprogrammable) uses much higher levels of parallelism (64/256-bit). However there is a correlation in results, since a simple comparison shows that their 64-bit implementation is both double the bus width and double the line rate of the recongurable CRC circuitassuming doubling our level of parallelism from 32- to 64-bit would also double the supported line speed, the result is 9.84 Gb/s compared to 10 Gb/s. Given all these considerations, the throughput of the eld programmable CRC circuit compares favorably with the state of the art. Apart from the software/processor design, none of the other architectures offer a solution that is exible in terms of supporting all possible combinations of polynomials with a wide variety of port sizes. Objectives such as high speed or low cost have been considered in these implementations, but exibility has not been fully addressed. Some present a degree of exibility, but only in terms of preselecting variables in the design phase. Considering that versatile network processors have to address a wide range of applications, including frame packet processing, security processing and storage related functions; the alternative options presented give limited exibility to support all these applications.

REFERENCES
[1] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner, RFC 3720Internet Small Computer Systems Interface (iSCSI), RFC 3720, Apr. 2004. [2] P. Koopman and T. Charkravarty, Cyclic Redundancy Code (CRC) polynomial selection for embedded networks, in Proc. DSN, pp. 145154. [3] P. Koopman, 32-bit cyclic redundancy codes for internet applications, in Proc. DSN, pp. 459472. [4] D. Sarwate, Computation of cyclic redundancy checks via table lookup, Commun. ACM, vol. 31, no. 8, pp. 10081013, Aug. 1988. [5] D. Feldmeier, Fast software implementation of error correcting codes, IEEE Trans. Network., vol. 3, no. 6, pp. 640651, Dec. 1995. [6] T. Bi-Pei and C. Zukowski, High-speed parallel CRC circuits in VLSI, IEEE Trans. Commun., vol. 40, no. 4, pp. 653657, Apr. 1992. [7] M. Braun, J. Freidich, T. Grun, and J. Lembert, Parallel CRC computation in FPGAs, in Proc. Workshop Field Program. Logic Appl., 1996, pp. 156165. [8] J. H. Derby, High-speed CRC computation using state-space transformations, in Proc. Globecom, Nov. 2001, pp. 166170. [9] M.-D. Shieh, M.-H. Sheu, C.-H. Chen, and H.-F. Lo, A systematic approach for parallel CRC computations, J. Inf. Sci. Eng., vol. 17, pp. 445461, 2001. [10] T. Henriksson and D. Liu, Implementation of fast CRC calculation, in Proc. ASP-DAC, 2003, pp. 563564. [11] F. Monteiro, A. Dandache, A. Msir, and B. Lepley, A fast CRC implementation on FPGA using a pipelined architecture for the polynomial division, in Proc. ICECS, 2001, vol. 3, pp. 12311234.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 8, AUGUST 2009

1147

[12] M. Sprachmann, Automatic generation of parallel CRC circuits, IEEE Des. Test Comput., vol. 18, no. 3, pp. 108114, May/Jun. 2001. [13] G. Campobello, M. Russo, and G. Patan, Parallel CRC realization, IEEE Trans. Comput., vol. 52, no. 10, pp. 13121319, Oct. 2003. [14] Sarance Technologies, Ottawa, ON, Canada, CRC-32 for 10 Gbps/ OC192 and 40 Gbps/OC768 Systems, 2006. [Online]. Available: www.sarance.com/brochures/Sarance_CRC_Product_Brief_2.0.pdf [15] A. Joglekar, M. Kounavis, and F. Berry, A scalable and high performance software iSCSI implementation, in Proc. USENIX FAST, 2005, pp. 267280. [16] A. Crouch, Technology developments favor IP storage growth, Communications Technology Lab, Intel, Apr. 2005. [Online]. Available: http://www.snwonline.com/tech_edge/technology_04-11-05.asp?article_id=535 [17] O. Weiss, M. Gansen, and T. Noll, A exible datapath generator for physical oriented design, in Proc. ESSCIRC, Villach, Sep. 2001, pp. 408411.

A Parallel Pruned Bit-Reversal Interleaver


Mohammad M. Mansour

AbstractA parallel algorithm and architecture for pruned bit-reversal interleaving (PBRI) are proposed. For a pruned interleaver of size with mother interleaver size , the proposed algorithm interin at most steps, as opposed to leaves any number steps using existing PBRI algorithms. A parallel architecture of the proposed algorithm employing simple logic gates and having a short critical path delay is presented. The proposed architecture is valuable in reducing (de-)interleaving latency in emerging wireless standards that employ PBRI channel (de-)interleaving in their PHY layer such as the 3GPP2 Ultra Mobile Broadband standard.

[0

=2

invalid addresses and get pruned out (see [3] and [4] for other pruning techniques). The emphasis in the literature on interleavers and their architectures has been largely in the context of interleavers employed in turbo codes. Not much work has been done on architectures for PBRI channel interleavers. Bit-reversal mapping has been mainly applied to reduce row conicts and improve hit-rates in SDRAM applications [5], and to improve the shufe permutation stages of the FFT algorithm [6], [7]. In turbo interleavers, the emphasis has been on reducing interleaving latency by avoiding memory collisions of read/write operations by the constituent MAP decoders [8][11]. Software programmable turbo interleavers for multiple 3G wireless standards have been addressed in [12]. A major disadvantage of a PBRI interleaver is that, despite its simplicity, interleaved addresses must be generated sequentially. That is, in order to generate the interleaved address of a linear address , the interleaved addresses of all linear addresses less than must rst be generated. This follows from the fact that the number of pruned addresses that have occurred before must be known in order to know where gets mapped to. This requirement introduces a latency bottleneck, especially when (de-)interleaving long packets (e.g., 16 K in UMB [2]). In this paper, we present an algorithm that eliminates this dependency and determines any interleaved address in at most 0 1 steps. Moreoever, the algorithm has a very simple architecture that can be constructed using basic logic gates and has a short critical path delay.

N  2n . Linear addresses that map to addresses outside [0; N 0 1] are

1]

Index TermsBit-reversal maps, channel interleavers, pruned interleavers.

I. INTRODUCTION Channel interleaving is employed in most modern wireless communications systems to protect against burst errors [1]. A channel interleaver reshufes encoded symbols in such a way that consecutive symbols get spread apart from each other as far as possible in order to break the temporal correlation between successive symbols involved in a burst of errors. The reverse de-interleaving operation is performed at the receiver side before feeding the symbols to the channel decoder. Typically, these interleavers employ some form of bit-reversal operations in generating the interleaved addresses, and have a programmable size to accommodate for various encoded packet lengths. For example, the emerging Ultra Mobil Broadband (UMB) standard within the 3rd Generation Partnership Project 2 (3GPP2) [2] employs a pruned bit-reversal channel interleaver in its PHY layer to interleave any packet of length that is a multiple of 8. In pruned bit-reversal interleaving, a packet of size is interleaved by mapping -bit linear addresses into -bit bit-reversed addresses, where is the smallest integer such that

A bit-reversal interleaver (BRI) maps an n-bit number x into another n-bit number y according to a simple bit-reversal rule such that the bits of y appear in the reverse order with respect to x. We designate the BRI mapping on n bits by the function y = n (x). The values taken 2n is the size of the by x and y range from 0 to 2n 0 1, where M interleaver. A pruned BRI maps an n-bit number x less than N , where N  M , into another n-bit number y less than N according to the bit-reversal rule. The size of the pruned interleaver is N , while the size of mother interleaver is M . Note that the numbers from N to M 0 1 are pruned out of the interleaver mappings and are not considered valid mappings. We designate the PBRI mapping on n bits with parameter N by the function y = n;N (x). The mapping n;N (x) for a given x is computed sequentially by starting from y = 0 and maintaining the number of invalid mappings (x) skipped along the way. If y + (x) maps to a valid number (i.e., n (y +  (x)) < N ), then y is incremented by 1. If y + (x) maps to an invalid number, (x) is incremented by 1. These operations are repeated until y reaches x, and n (x) is valid. Algorithm 1 shows the pseudo-code of the sequential PBRI algorithm.

II. SEQUENTIAL PBRI ALGORITHM

Algorithm 1 Sequential PBRI algorithm.

Manuscript received November 05, 2007; revised April 08, 2008. First published June 16, 2009; current version published July 22, 2009. The author was with Qualcomm Flarion Technologies, Bridgewater, NJ 08807 USA. He is currently with the Electrical and Computer Engineering Department, the American University of Beirut, Beirut 1107 2020, Lebanon (e-mail: mmansour@aub.edu.lb). Digital Object Identier 10.1109/TVLSI.2008.2008831

) procedure PBRI-Seq( 0 ( ) 0 while  do then if n ( + ( ))  ( ) ( )+1 else n;N ( ) n ( + ( )) +1 end if end while end procedure

y x

n; N; x

y x  y x N x x y  y x y y

1063-8210/$26.00 2009 IEEE

Das könnte Ihnen auch gefallen