Sie sind auf Seite 1von 11

SLICE REDUCTION ALGORITHM FOR LOW POWER AND AREA EFFICIENT FIR FILTER USING VLSI DESIGN

A. Hemalatha 1, A. Shanmugam 2 1 HOD, Dept of ECE, Periyar Centinary Polytechnic College, Vallam. 2 Principal, Bannari Amman Institute of Technology, Sathyamangalam

Abstract: SDR is fast becoming a crucial


element of wireless technology the use of SDR technology is predicted to replace many of the traditional methods of implementing transmitters and receivers while offering a wide range of advantages reconfigurability, encompassing including and of modes adaptability, multifunctionality operation, radio

(subtractions). The proposed reconfigurable synthesizes multiplier blocks offer significant savings in area over the traditional multiplier blocks for high-speed digital signal processor (DSP) systems are implemented on field programmable gate array (FPGA) hardware platforms. In addition, software radio has recently gained much attention due to the need for integrated and systems. To reconfigurable this end, communication

frequency bands, air interfaces, and waveforms. Research in this field is mainly directed towards improving the architecture and the computational efficiency of SDR systems. Software-defined radio (SDR) refers to wireless communication in which the transmitter modulation and the receiver demodulation are both generated through software. The main advantage of this approach is flexibility, as the software runs on one common hardware platform for any type of receiver configuration. The most computationally intensive part of the wideband receiver of a software defined radio (SDR) is the intermediate frequency (IF) processing block. Digital filtering is the main task in IF processing. The computational complexity of finite impulse response (FIR) filters used in the IF processing block is dominated by the number of adders

reconfigurability has become an important issue for the future filter design. Previous research in this field has concentrated on minimizing multiplier block adder cost but the results presented here demonstrate that this optimization goal does not minimize FPGA hardware. Minimizing multiplier block logic depth and pipeline registers is shown to have the greatest influence in reducing FPGA area cost. Fully pipelined, full-parallel transposed-form FIR filters with reconfigurable multiplier block were generated using the new and previous algorithms, implemented on an FPGA target and the results compared. The proposed method offers average reductions of adders and full adders needed for the coefficient multipliers

over conventional FIR filter implementation methods.

(ii) area for additional application functionality; (iii) potential to use a smaller, cheaper FPGA.

Finite impulse response (FIR) digital filters are common DSP functions and are widely Introduction High-speed digital signal processor (DSP) systems are increasingly being implemented on field programmable gate array (FPGA) hardware platforms. This trend is being fuelled costs by and insurmountable the applicationand specific integrated circuit (ASIC) project flexibility reconfigurability advantages of FPGAs over traditional DSPs and ASICs, respectively. More recently, Structured ASIC technology has yielded lower cost solutions to full custom ASIC by predefining several layers of silicon functionality that require the definition of only a few fabrication layers to implement the required design. However, the FPGA platform provides high performance and flexibility with the option to reconfigure and is the technology focused on for the remainder of this paper. There is a constant requirement for efficient use of FPGA resources where occupying less hardware for a given system can yield significant cost-related benefits:
(i) reduced power consumption;

used in FPGA implementations. If very high sampling rates are required, full-parallel hardware must be used where every clock edge feeds a new input sample and produces a new output sample. Such filters can be implemented on FPGAs using combinations of the general purpose logic fabric, on-board RAM and embedded arithmetic hardware. Full-parallel filters cannot share hardware over multiple clock cycles and so tend to occupy large amounts of resource. Hence, efficient implementation of such filters is important to minimise hardware requirement. When implementing a DSP system on a platform containing dedicated arithmetic blocks, it is normal practice to utilise such blocks as far as possible in preference to any general purpose logic fabric. However, in some cases, there may not be enough blocks for the target application or layout/routing constraints may prohibit their use. In such cases, the general purpose logic fabric must be utilised. In this paper, implementation of full-parallel filters using only the general purpose logic fabric is considered. Hence, the techniques

presented here are also applicable to ASIC and Structured ASIC platforms. In Section 2, the filter type in use is described, followed by a discussion of multiplier block synthesis for low FPGA area in Section 3. A new multiplier block synthesis algorithm is presented in Section 4, followed by the generation Section 7. and presentation of results (Sections 5 and 6). Conclusions are given in

2 Transposed FIR filter with multiplier block Fig. 1 Mathematically identical full-parallel Figure 1 shows three full-parallel, fixedcoefficient FIR filter structures that are mathematically structure using identical cut-set but differ in the a Standard b Transposed c Transposed FIR with multiplier block Note that for maximum sampling rates, all multiplication hardware can be pipelined. In Fig. 1c, the coefficient multipliers of the transposed FIR have been replaced with a multiplier block (detailed in Section 3) that generates all required multiples of the filter input using cascaded adds, subtracts and shifts. This filter architecture is known to be a highly area efficient method of implementing fixed-coefficient, full-parallel FIR filters [1]. It is the multiplier block that architecture. Derived from the standard FIR retiming, transposed FIR (Fig. 1b) yields an identical mathematical response but with several advantages for FPGA implementation:
(i) no input sample shift registers are required since each sample is fed to each tap simultaneously; (ii) the pipelined addition chain maps efficiently; (iii) filter latency is reduced; (iv) identical tap coefficient magnitudes can share multiplication hardware because taps receive the input sample simultaneously.

FIR filter structures

determines filter implementation efficiency regardless of hardwareplatform. Effective synthesis of multiplier blocks for low FPGA area is the focus of this work. Multiplier block synthesis for low FPGA area 3.1 Multiplication hardware operation and area Estimation Figure 2 shows an example multiplier block (also referred to as a graph) that multiplies the input by 3, 13, 21 and 37 in two clock cycles (the logic depth of the block is also 2). Multiplication is achieved using only adds, subtracts and shifts which map very efficiently to FPGA architectures. As an example, the input is fed to the 3 adder untouched and after being left shifted once (multiplied by two). Hence, the output of the 3 adder is 2x x 3x as required. This product can then be used as a graph output to be fed to the filter summation chain (refer to Fig. 1c) and, if required, routed internally to generate further multiples of the input. For efficiency, multiplier blocks need only generate ve, odd integers since negative filter coefficient weightings can be restored at the summation chain by subtracting the ve equivalent generated by the multiplier

block. Odd valued block outputs can be leftshifted en-route to the summation chain to generate even-valued coefficient multiplications. Pipelining multiplier blocks ensures high clock rates are achieved when implemented on FPGA hardware. Note that multiplier blocks usually contain a mixture of adders and subtractors, but the adder cost of a block refers to the number of adders and subtractors. Hence, the adder cost of Fig. 2 is 5. Note that adders may also be referred to as the graph vertices. In this paper, we use the Xilinx Virtex-II FPGA family [2] for implementation analysis and hence area will be measured in Virtex-II slices. The multiplier block in Fig. 2 is quoted as costing 25 slices. This is calculate ed by counting the number of flip-flops inferred by the multi-bit signals crossing pipeline boundaries and dividing by 2 since there are two flip-flops per slice. Equation (1) uses the set S which contains the bitwidths of all N multi-bit signal pipeline boundary crossings to obtain a slice estimate e:

multiplier block synthesis [5, p. 96]. Additional techniques using canonic signed digit (CSD) and subexpression sharing have also been proposed to minimise adder cost [6, 7]. Although the majority of research in this area has focused on full-parallel DSPs, recent work by Demirsoy et al. [8, 9] incorporates multiplexers to allow efficient FPGA implementation of timemultiplexed filters and direct cosine transform (DCT) Fig. 2 Multiplier block: five adders, two pipeline stages costing 25 slices Multiplier goals Multiplier block synthesis has received a great deal of attention in recent decades. The majority of research has concentrated on producing algorithms to synthesise multiplier blocks with the optimisation goal of minimum adder cost. Bull and Horrocks [3] introduced the concept of representing multiplier blocks with graphs , showed the problem to be NP-complete and defined several minimal adder synthesis algorithms. Dempster and Macleod [4] identified limitations in this work and defined the ndimensional reduced adder graph (RAG-n) algorithm which is generally regarded as the primary reference for minimal adder block synthesis optimisation processors. In [10], Dempster et al. defined the C1 synthesis algorithm with the optimisation goal of minimising multiplier block logic depth to reduce power consumption. C1 aims to minimise power consumed by reducing the amount of logic transitions caused by long glitch paths through cascaded arithmetic logic. Note that logic toggling caused by glitch paths through more than one adder does not occur in fully pipelined multiplier blocks. In general, from a hardware perspective, the motivation for minimum adders has been to reduce filter complexity for very large scale integration (VLSI) implementation where adder cost dominates the area requirement. However, FPGAs have a fixed architecture for implementing digital logic, not the blank canvas of VLSI/ASIC design. Hence, algorithms synthesizing multiplier blocks for low FPGA area must operate with

regard to FPGA architectures for best results. In this paper, we show that the classic multiplier block synthesis goal of minimising adders does not minimise FPGA hardware cost; we will, however, define a new algorithm that does. Comparing adder cost and logic depth for low FPGA area Figure 3 shows a multiplier block that generates the same multiples of the input as the block shown in Fig. 2. However, the Fig. 3 block uses only four adders, whereas Fig. 2 uses one more (five). Conversely, four logic levels are required for the Fig. 3 block and only two are required in Fig. 2. Most importantly, using (1), the Fig. 3 block requires 44 slices compared to the 25 slices of Fig. 2. This is due to the increased number of pipeline boundary crossings in Fig. 3 caused by the extra logic levels of the block. Hence, in this case, fewer adders does not mean less FPGA area. It should be noted that architecture specific features may also influence area consumption. For example, a slight area reduction may be gained using the Virtex-II SRL16 shift registers to implement signal delays instead of individual flipFig. 3 Multiplier block: four adders, four pipeline stages costing 44 slices flops. However, such area reductions would not reduce the slice cost of Fig. 3 significantly. Also, there may be slices where a look-up table (LUT) is unused for multiplier block arithmetic logic but the corresponding flip-flop is in use to pipeline a signal. In such cases, synthesis/implementation tools may map logic for other functional blocks into the unused LUT. However, in a fully pipelined DSP data-path system, such functionality sharing of a single slice is less likely and would not signifi- cantly affect multiplier block area consumption. To summaries, the main research community goal of minimising adder cost is not applicable for minimising FPGA area cost. Instead, an algorithm performing synthesis for low FPGA area should aim to reduce signal pipeline crossings. This can be achieved by

synthesising

low-logic

depth

multiplier

coefficient multipliers. The implementation techniques described also allow for multiplication constants to be changed since it is not the structure of the logic that dictates multiplication values (asis the case with MAG) but rather look-up table contents. Using the MAG algorithm, Dempster and Macleod demonstrate a 16% average reduction in adder cost over CSD

blocks and minimising signal bit-widths as far as possible. Note that these low FPGA area optimisation principles apply to any FPGA architecture containing a LUT/flipflop pair structure (including the Xilinx Virtex-II family selected in this instance).

The new reduced slice graph multiplier block synthesis algorithm 4.1 Minimised adder graph algorithm Before discussing the new reduced slice graph (RSG) synthesis algorithm, the minimised adder graph (MAG) algorithm must be introduced. MAG was defined by Dempster and Macleod [11] and generates minimum adder graphs consisting of shifts, adds and subtracts for implementing constant integer multiplication of individual values up to 12-bits (MAG was further extended to 19-bits by Gustafsson et al. [12]). Multiplication by a constant is common received when implementing attention from the DSP on research FPGAs, and efficient implementation has community. For example, Wirthlin and McMurtrey [13] utilise features of recent FPGA architectures to further reduce the area requirement of conventional constant

representation and show CSD to reduce average adder cost by 33% over binary representation. From an FPGA perspective, for single coefficients, these adder reductions also translate into hardware savings. In general, for each integer value in range, MAG finds numerous graphs for each value that all implement the required multiplication with the minimum number of adders/vertices. Dempster and Macleod suggest differentiating between these graphs by calculating the number of single-bit full adders required to implement each graph and selecting the graph requiring the least. This when differentiation implementing is the motivated by attempting to minimise VLSI area consumed multiplication. MAGalso generates a table containing the adder cost of each integer value in range. MAG extensions and modifications

Our MAG implementation was extended beyond the 12-bit range (imposed by Dempster and Macleod for physical RAM constraints) using the generic graph extension cases illustrated in [11]. Also, we noted the potential for greater differentiation between the multiple graphs found by MAG for each integer value in range. Instead of using one level of differentiation with the single-bit full adder count metric described in [11], we implemented two metric levels (primary and secondary) to allow one best graph to be selected. For each integer, the primary metric selects the subset of graphs with minimum logic depth and from that subset, the secondary metric selects the graph with lowest vertices sum to minimise bit-widths. These new metrics reflect the low FPGA area observations described previously and, in general, leave only one best graph per integer, whereas the sole single-bit full adder metric selects multiple graphs from which an arbitrary choice must be made. Note that up to and including adder cost 3, for each integer in range, the modified MAG implementation stores and allows extension/branching from every graph found. This allows full searching of the graph space up to and including cost 4. From cost 4 graphs upwards (i.e. beyond 12bits), only one best graph (as selected by

the new primary/secondary metric system described reviously) is stored and used per integer for extending to higher cost graphs. This restriction is for of practical available implementation concerns

physicalRAMand algorithm run-time. New RSG algorithm design Dempster and Macleods RAG-n algorithm attempts to synthesise by initially all required all multiplications placing

coefficients of adder cost 1 (determined using MAG) into the multiplier block and then building higher cost values using combinations of shifts, adds and subtracts of other adder outputs within the block. For example, in Fig. 3 generated using RAG-n), the 3 adder is synthesised first and all other coefficients are built from it. This approach leads to high logic depth blocks and hence pushes up FPGA area requirement. Note that RAG-n synthesises multiplier blocks in two stages: (i) optimal stage; (ii) suboptimal heuristic stage. If the multiplier block is fully synthesised after stage (i), Dempster and Macleod show that their algorithm ensures the absolute minimum number of adders to implement a

given block. If stage (ii) is required, a suboptimal multiplier

area in general, although Fig. 4b shows that RSG uses more adders in all cases. Figure 4c confirms that FPGA area is correlated with flip-flop usage. LUT data (Fig. 4e) is

Synthesis algorithm comparison varying coefficient bit-width VHDL filters were generated to compare RSG, RAG-n and C1 using the specifications of Section 5.2, coefficient bitwidth varied from 1 to 20 and filter length fixed at 51 taps. Note that the multiplier block solution space expands exponentially as coefficient width increases, meaning algorithm output is similar at lower widths. Also, as stated in Section 4.2, the MAG algorithm implementation only uses the full search space for integers up to adder cost 4. From cost 5 onwards, branching is only performed from the single best graph of each value for reasons of available RAM. Hence, from around 15-bit coefficients upwards (cost 5 upwards), synthesis algorithm results are likely to converge. Were the MAG algorithm to be allowed to store and branch from all graphs from cost 4 upwards, the difference in results between the algorithms shown around 12-bits would be expected to at continue 15-bits. instead Figure of 4a converging

correlated with adder usage with RSG using fractionally more LUTs in general. For logic depth (Fig. 4d).

Fig. 4 Results for VHDL filter generation varying coefficient bit-width (length: 51 taps) a FPGA hardware area b Multiplier block adders c Flip-flop usage d Multiplier block logic depth e LUT usage

demonstrates that RSG requires less FPGA

Conclusions The classic research community

optimisation metric of minimising multiplier block adder cost has been demonstrated not to minimise FPGA hardware for full-parallel pipelined FIR filters. Reducing flip-flop count through minimising multiplier logic depth has instead been shown to yield the lowest area solutions. The new RSG algorithm has been defined to embody this design principle. The results presented establish a clear area advantage of RSG over prior algorithms for typical filter parameters with comparable maximum clock rates. In addition, the industrial relevance of the transposed FIR with multiplier block architecture and the RSG algorithm has been established through comparison with filters implemented using the DA technique. References 1 Macpherson, K., Stirling, I., Rice, G., Garcia-Alis, D., and Stewart, R.: Arithmetic implementation techniques and methodologies for

3G uplink reception in Xilinx FPGAs. Third Int. Conf. on 3G Mobile Communication Technologies, 2002, (IEE Conf. Publ. no. 489), May 2002, pp. 191195 2 Xilinx Inc., http://www.xilinx.com 3 Bull, D.R., and Horrocks, D.H.: Primitive operator digital filters, IEE Proc. G, Circuits Devices Syst., 1991, 138, (3), pp. 401412 4 Dempster, A.G., and Macleod, M.D.: Use of minimumadder multiplier blocks in FIR digital filters, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., 1995, 42, (9), pp. 569577 5 Meyer-Baese, U.: Digital signal processing with field programmable gate arrays (Springer-Verlag, Berlin, Heidelberg, 2001) 6 Gustafsson, O., and Wanhammar, L.: ILP modelling of the common subexpression sharing problem. 9th Int. Conf. on Electronics, Circuits and Systems, 2002, vol. 3, pp. 11711174 [7] Y. Jang, S. Yang, "Low-power CSD linear phase FIR fdter structure using vertical common subexpression''. Electronics Letfers, Vo1.38. Iss.15, Jul2002. pp. 777- 779 [8] A.G. Dempster, S.S. Dimirsoy, I. Kale, "Designing multiplier blocks with low logic depth". Circuits and Svstems, 2002. IEEE International Symposium on. Vo1.5. 2002. pp. V-773- V-776 [9] A.G. Dempster. M.D. Macleod, "Constant integer multiplication using minimum adders", Circuits. Devices

andSvstenis, IEEProceedings, Vo1.141, Iss.5, Oct 1994. pp. 407-413 [lo] K. Macpherson, "Low Hardware Cost, High Speed Full-Parallel FIR Digital Filters on Field Programmable Gate Arrays". PhD Thesis. University of Strathclyde, 2004 [ll] Xilinx Inc.. "Distributed Arithmetic FIR Filter v8.0". http://www.xilinx.com [I21Synplicity Inc.. http://w\nu.synplicity.com
A.Hemalatha received the Bachelor of Engineering degree in Electronics and Communication Engineering from P.S.G College of Technology, Coimbatore ,India in 1986. She received the Master of Engineering degree in Satellite communication from Regional Engineering College ( now,National Institute of Technology) Tiruchirappalli, India in 1992. She got M.B.A degree from Bharathidasan University, Tiruchirappalli, India in 1999. Currently, she is pursuing the Ph.D degree in VLSI Design from Anna University, Chennai. She is currently Head of the Department of Electronics and Communication Engineering at Periyar Centenary Polytechnic College ,Vallam, Thanjavur,India. Her research interests are in the area of dynamic power management schemes, reliability modeling and performance analysis of SoCs. Dr.A. Shanmugam received his Bachelor of Engineering Degree from PSG College of

Technology in 1972, Coimbatore, Master of Engineering Degree from College of Engineering, Guindy, Chennai in 1978 and Doctor of Philosophy in Electrical and Electronics Engineering from Bharathiar University, Coimbatore in 1994. From 197276, he worked as Testing Engineer in Testing and Development Centre, Chennai. He joined Annamalai University as a Lecturer in 1978 and worked for one year. Then he joined PSG College of Technology, Coimbatore in 1979 and served in various capacities. He was the Professor and Head of Electronics and Communication Engineering Department at the time of relieving (April 2004). He is currently the Principal, Bannari Amman Institute of Technology, Sathyamangalam. He published and presented more than 120 Papers in International and National Journals.He is Reviewer of the journals: 1. International Journal on Information Technology : Applications and Management (IJITAM), Vellore Institute of Technology, Vellore, Tamil Nadu. 2. International Journal on Systemics, Cybernetics and Informatics (IJSCI) Pentagram Research Centre, Hyderabad, India. 3. ICTACT Indian Journal on Communication Technology, ICT Academy of Tamil Nadu, Chennai.

Das könnte Ihnen auch gefallen