Sie sind auf Seite 1von 7

Optimization Techniques for Efficient Implementation of DSP in FPGAs

Douang Phanthavong Mentor Graphics 8005 SW Boeckman Rd Wilsonville, OR 97070 USA Tel: +1-503-6851977 douang_phanthavong@ mentor.com Manish Bansal Mentor Graphics Pvt. Ltd. Logix Techno Park, Bldg A, Plot # 5, Sector -127 Noida 201301 UP India Tel: +91-120-5304016 manish_bansal@ mentor.com Mandar Chitnis & D.J. Wang Mentor Graphics 1001 Ridder Park Dr. San Jose, CA 95131 USA Tel: +1-408-4877261, 4877425 {mandar_chitnis, dj_wang}@ mentor.com

Abstract -- This paper describes techniques to automatically capture designers intent and achieve optimal solutions for DSPbased FPGA architectures. Typically, though FFTs and FIR filters seem complex, they basically use simple addition, subtraction, multiplication, etc. operations. These arithmetic modules, along with shift and pipeline registers in modern FPGAs, can be configured in different modes to provide greater flexibility and control with desirable levels of performance. The proposed techniques support DSP applications by optimally mapping all the basic building blocks, such as multiply-accumulate, multiply-add, multipliers and adders. Various applications such as fully parallel FIR filters are illustrated. Complex FPGA-DSP Architectures

addition/subtraction and summation, all of which are commonly used for DSP functions. With these basic arithmetic functionalities, designing the overall DSP-based application becomes fast, flexible and efficient. At the core of a typical DSP block is a multiplier feeding an adder. DSP blocks have additional features that can be utilized for improved resource utilization and performance: 1) Pipelined registers multiplier and adder. in between the

2) Built-in registers at the inputs and output of the DSP block. 3) Dedicated input or multiplier output (or a combination) can synchronously load the output. 4) DSP blocks can be cascaded so that the output of an input stage goes to the next block (this is especially suitable for FIR filter implementation, described in a later section).

DSP algorithms are being increasingly integrated into products like camera-ready cellular phones, HDTV, MP3 players, etc. High-end DSP resources are versatile in The need for manufacturers to differentiate unique ways. Several different kinds of and compete via value-added features, combined with the short life Table 1: Advanced FPGA Architectures with DSP Resources cycle of typical consumer products, makes FPGAs an Features Virtex-4 Stratix II ECP-DSP attractive platform for these Clock DCM -PLL -sysCLOCK applications. FPGA architecManagement Up to 20 up to 12 PLL - up to 4 tures have made considerTriMatrix able performance improveEmbedded sysMEM blocks BlockRAM memory up to ments by introducing new Memory Up to 498 Kb Up to 10 Mb 9 Mb features (table 1), especially Up to 200K Up to 179K Up to 4096 in terms of dedicated DSP configurable logic elements programmable blocks for building computeData logic blocks (LEs), 384 functional units intensive applications. Processing (CLBs) & 512 embedded (PFUs), 32 FPGAs now incorporate embedded features that enable multiplication, accumulation,
XtremeDSP slices Clock Speed Up to 500 MHz multipliers & 96 DSP blocks Up to 500 MHz multiplier blocks & 8 DSP blocks Up to 250 MHz

functions can be mapped on each of these DSP blocks, including multiply-accumulation (MAC), multiply-addition (MADD), counters, shifters, etc. But this versatility brings added complexity in terms of instantiating these DSP blocks and connecting them into the final design [1]. There are several different configurations of these dedicated DSP resources, which imposes tremendous challenges on designers to understand all these components and use them efficiently in real designs. Figures 1, 2 & 3 show the DSP blocks available in Xilinx Virtex-4 (XtremeDSP or DSP48 slice), Alteras Stratix II and Lattice Semiconductors ECP-DSP devices, respectively [2], [3], [4].

It takes a certain level of understanding to instantiate each of these blocks and to stitch them together into the design directly as individual technology cells. From the designers perspective, access to techniques that can automatically take the HDL code and target it efficiently to any of these architectures can be very beneficial. Therefore, efficient implementation of DSPbased algorithms requires a vendor-neutral methodology that breaks the design into the basic building blocks, which can then be flexibly mapped to the DSP resources in the FPGA architecture or device that is most suitable for the target application. Instantiating cells and generating netlists specific to each vendor are not recommended, because the resultant technology dependence further complicates the process. Using the synthesis tool to infer all of the DSP functions whenever possible noticeably reduces runtimes. The use of generic RTL coding styles is an added benefit, because it enables efficient reuse of designs across different FPGA vendors and device architectures. Because the RTL code for many of todays DSP-based designs are increasingly generated automatically by high-level algorithmic synthesis tools, it is also essential that FPGA synthesis tools provide the same level of advanced inferencing capability for the HDL netlists generated by high-level synthesis tools as that provided for hand-coded HDL.

Figure 1: Xilinx XtremeDSP (DSP48) slice.

Figure 2: Altera Stratix-II DSP block.

Figure 3: LatticeECP DSP block.

Optimizations Using MAC and MADD FPGA architectures provide a lot of flexibility in their DSP blocks. The MAC and MADD operators are important building blocks for any DSP application. Overall design efficiency is increased if these basic operators are correctly and optimally mapped to any FPGA-based DSP architecture. It is important to use optimization techniques that take into consideration the nuances of a specific FPGA-based DSP architecture. The main advantage in DSP48 is the dynamic mode configuration, which is useful for implementing the loadable accumulators and combining more logic into single DSP48 slices. Also, the dedicated cascading connections are extremely fast and make the design very efficient. The MADD functions can be efficiently cascaded, which is useful in designing FIR filters where high performance is needed with the smallest overall area. With the synthesis tool, leveraging these translations and utilizing the DSP blocks to their maximum possible configurations helps to achieve the highest possible quality of results. Consider the expression (a*b + c*d) + (e*f + g*h) + i. Traditional DSP synthesis will result in two MADDs for (a*b + x) and (e*f + y)} where x = c*d and y = g*h. This will take up six DSP blocks: two for the MADDs, two for the multipliers "x" and "y", and two for the adders. This is not an effective utilization of DSP blocks. Alternatively, the automatic inferencing techniques described in this paper re-organize the above expression as {a*b + (g*h + (e*f + (c*d + i)))}, leading to inferencing of four MADD functions so the whole expression gets mapped within four DSP48 slices. Altera and Lattice devices each present a different architecture, with more multiplication and addition/subtraction capabilities in a single DSP block. They also provide the ability to build two-level MADD functionality into a single DSP block. A dedicated chain of such blocks can be formed as required by the application. This reduces the overall number of DSP blocks required and maximizes the efficiency. Consider the expression (a*b + c*d) - (e*f + g*h). In this case, the optimization techniques used in this paper re-organize the expression as {a*b - e*f) + (c*d - g*h),

Figure 4: Verilog coding example for MADD operator.


DSP Block X
D Q D Q

D Q

D Q

POUT

B
D Q

Figure 5: DSP implementation for the coding example in figure 4. which can be directly inferred as a single two-level MADD and mapped to a single DSP block. Similarly, take for example the expression a*b + c, which cannot be inferred as a MADD for certain FPGA-based architectures, so it takes up two DSP blocks: one for the multiplier and another for the adder. An ideal alternative is to infer a single MADD for this expression, by treating it as a*b + c*1. Just extending on this idea, if c is some input of the form {k, 000000} then it is converted it to a*b + k*1000000 for MADD inference.

In some cases, even after applying the above transformations, the adder or the multiplier remain isolated. Here, during mapping, preference is given to multipliers and adders that are chained to already inferred DSP blocks, since this helps to keep DSP logic together and reduce routing delays. Figure 4 shows a MADD coding example with pipeline registers and addsub logic, which can be optimized into a single DSP block. Figure 5 shows a simplified, highlevel DSP block diagram view, without displaying all the control signals.

A[34:0] X B[34:0] P[34:17]

Figure 7: A simple fully pipelined 35x35 multiplier block diagram. Xilinx DSP48 slices can produce highfrequency multipliers if cascaded and implemented correctly. These blocks have the ability for pipelining which helps to increase the efficiency of the design with an overall reduction in the device utilization. Altera and Lattice also provide capabilities in their DSP architectures for building large multipliers. They provide a capability for building up to 36x36 bit multipliers into single DSP block, which maximizes the utilization for the DSP usage for the design.

A[34:17]

X P[69:34]

17 bit shift

0,A[16:0] B[34:17]

X P[33:17]

A[34:17]

Figure 6: Coding example for a fully pipelined multiplier.


17 bit shift

Optimizations for Arithmetic Operators A multiplier is the critical element in most DSP functions. It is important to achieve the optimal result on this portion of the design to avoid running into critical datapath issues. The DSP resources available in todays FPGA architectures provide flexible solutions for building any size of multiplier to produce efficient results.
0,A[16:0] 0,B[16:0] 0 X P[16:0]

Figure 8: Mapping algorithm example for a fully pipelined 35x35 multiplier.

Figure 9: Adder driving a MAC operator. The use of automatic optimizations on wide multipliers is very powerful, including cascading, pipelining and sign extending. To illustrate this behavior, a 35 x 35 multiplier with coding example (figure 6) and block diagrams (figures 7 & 8) are shown, targeting a Virtex-4 device. Sometimes, when building regular arithmetic operators like adders, subtractors become very expensive and less efficient when implemented using the available logic fabric of the FPGA device. Optimization via Cascading Connections Especially in DSP applications where high frequency is a requirement, it becomes very critical to be able to build fast arithmetic operators. Consider the example in figure 9 where a 48-bit adder is driving a MAC operator. This can be implemented using two DSP blocks as illustrated in figure 10. The dedicated chaining from the output of the first DSP block to the input of the second reduces the routing delay and increases the speed of the design. A frequency of over 400MHz is easily achieved along with significant area reduction when mapped using DSP blocks. Optimizations Using Scan Chains Most FPGA technologies support scan chains for applications in which inputs arrive in a delayed manner (such as FIR filters). Scan chains are used to cascade DSP stages, where a stage requires the input of the previous stage delayed by a clock cycle. The output from the previous input stage is routed through a scanout port to the input of the next stage. Another advantage is that the routing delay -- associated with exiting the DSP to the FPGA fabric and then returning back from the FPGA fabric to the DSP -- is eliminated.

A(3:0) B(3:0) C(3:0) CLK


c(7:0) out(47:0) outclk

a(3:0) b(3:0) c(47:0) out(47:0) outclk DSP Block DSP Block

OUT(47:0)

Dedicated cascading
Figure 10: DSP block mapping utilizing dedicated cascading.

It is possible to use several different coding approaches. Figure 11 shows a complete Verilog coding example for a 4-tap FIR filter, with block diagram (figure 12). Results Table 2 presents results for various DSP applications, by implementing five typical real-world designs on the DSP resources available from three different FPGA vendors. Mentor Graphics Precision Synthesis tool is used to generate the results. Conclusion Regardless of any particular FPGA architecture and its available DSP resources, synthesis tools must provide optimal inference and mapping capability to DSP designers. The ability to generically infer dedicated DSP blocks and uniquely map the different target technologies proves to be a significant advantage in FPGA DSP design. In addition, it is possible to utilize generic RTL coding styles to benchmark different technologies without having to spend too much time understanding the in-depth specifications of all the competing FPGA architectures, thus freeing up design time. The techniques described are intended to automatically capture design intent and achieve optimal implementations for different DSP-based FPGA architectures. The proposed techniques support DSP applications by optimally mapping all the basic building blocks, such as multiplyaccumulate, multiply-add, multipliers and adders. The results demonstrated for various DSP applications, including fully parallel FIR filters, are very promising.

Figure 11: A complete coding example for a 4-tap FIR filter design.

C3(7:0) C2(7:0) C1(7:0) C0(7:0) data(7:0) clk clkena

a(7:0) scanout_b b(7:0) (7:0) c(0:0) out(15:0) inclk inena m odgen_m ultadd

a(7:0) scanout_b b(7:0) c(15:0) (7:0) out(15:0) inclk inena m odgen_m ultadd

a(7:0) scanout_b b(7:0) c(15:0) (7:0) out(15:0) inclk inena m odgen_m ultadd

a(7:0) b(7:0) c(15:0) out(47:0) inclk inena m odgen_m ultadd

result(15:0)

Figure 12: FIR filter block diagram with 4 taps.

Table 2: Mapping results for DSP applications in different FPGA architectures Architecture A Designs Architecture B Architecture C

LUTs
Fully Pipelined 35x35 Multiplier 16-tap Transposed FIR Filter Floating-point Complex Multiplier 1,024-point FFT with 8 Butterflies 3-point FFT Acknowledgements

Flops 157 0 2,006

DSP LUTs Blocks 4 4 16 32 2 0 0 2,064

Flops 220 0 2,518

DSP LUTs Blocks 8 8 32 64 4 0 0 1,724

Flops 210 0 2,694

DSP Blocks 8 4 32 64 4

0 0 2,194

9,268 11,328 360 258

8,265 25,392 402 296

7,806 25,685 116 332

The authors thank Tom Dillon, President of Dillon Engineering, for providing the DSP reference designs used in this study, and David Pinto for his input on the paper.
References:

[2] XtremeDSP Design User Guide http://www.xilinx.com/bvdocs/userguides/ug 073.pdf [3] Altera Stratix II DSP Blocks, http://www.altera.com/products/devices/strat ix2/features/dsp/st2-dsp_block.html [4] Lattice Semiconductor sysDSP Block Brings High DSP Performance to FPGAs, http://www.latticesemiconductor.com/produc ts/fpga/ecp/sysdsp.cfm

[1] Using Precision Synthesis to Design with the XtremeDSP Slice in Virtex-4, http://www.mentor.com/products/fpga_pld/te chpubs/index.cfm