0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
42 Ansichten3 Seiten
Datapath is present in any RTL code that contains, +, -, >, = shift operators. With clock frequencies now reaching the GHz scale, the burden of efficient datapath design is increasing. Synopsys has a variety of adder architectures embedded into its datapath tools.
Datapath is present in any RTL code that contains, +, -, >, = shift operators. With clock frequencies now reaching the GHz scale, the burden of efficient datapath design is increasing. Synopsys has a variety of adder architectures embedded into its datapath tools.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als PDF, TXT herunterladen oder online auf Scribd lesen
Datapath is present in any RTL code that contains, +, -, >, = shift operators. With clock frequencies now reaching the GHz scale, the burden of efficient datapath design is increasing. Synopsys has a variety of adder architectures embedded into its datapath tools.
Copyright:
Attribution Non-Commercial (BY-NC)
Verfügbare Formate
Als PDF, TXT herunterladen oder online auf Scribd lesen
By Aamir Farooqui cally to describe high-perfor- Shifters are expensive
Sr. R&D Engineer mance datapath designs and (large-area) logic components Synopsys Inc. a b c de f provide more f lexibility and in arithmetic datapaths. E-mail: aamirf@synopsys.com features, structured datapath Shifters are sometimes used to + + + design, carry-save operands bit generate fixed dividers or mul- Datapath is often considered to manipulation and pipelining. tipliers. Datapath tools offer be a specialized part of an arith- + The module compiler tool takes shifters that may have a combi- metic logic unit design. How- the MCL description and, given nation of the following features: ever, the scope of datapath com- + a certain set of constraints, cre- • Arithmetic or logical; ponents is more than that—it is ates high-performance data- • Shifts only to the right, only present in any RTL code that Sum path implementations. A pow- to the left or bidirectional; contains *, +, -, >, <, <= and >= erful feature of MCL is to pre- • Regular or barrel shifter. shift operators. Even simple serve datapath regularity dur- finite state machines require Figure 2: The vector addition of six ing synthesis and generate au- Multiplexers are not truly datapath components in the inputs would use a carry-propagate tomatic relative placement datapath elements, but they are form of incrementors, adders or adder for each 2-to-1 reduction. used by place-and-route tools. extensively used in adders, multipliers. With increasing shifters, dividers, comparators clock frequencies now reaching analyses on noise and power are Building blocks and datapath I/O selection. the GHz scale, the burden of needed to ensure that the final Adders are probably the most The area and speed of the mul- efficient datapath design is rap- results will operate within widely used building blocks in idly increasing. specifications. To perform cus- digital circuits. Every multi- In many high-performance tom design datapath for all plier, divider, incrementor, a b c d e f VLSI designs, including all these disciplines would require decrementor, comparator and high-performance micropro- a large amount of time. In subtractor requires some sort of CSA CSA cessors, datapath is imple- today’s fast-changing market, add operation. Synopsys has a co0 co1 ci0 mented with custom design it is no longer possible to variety of adder architectures techniques. The circuit and lay- handcraft all the RTL and get embedded into its datapath CSA ci1 out of such structures are largely the best performance. tools. These adders range from co2 ci2 handcrafted to achieve maxi- Synopsys offers fully auto- slow, but small ripple carr y CSA mum performance and better mated high-performance data- adders to very fast large fastcla layout density. Most designers path generators. These tools of- adders for integer and floating- CPA are not aware that incremen- fer datapath performance com- point operations. Two new tors, adders, subtractors or mul- parable to handcrafted designs adder architectures in DC and Sum tipliers represent the presence and provide more efficient sig- MC are “pprefix” and “aofcla,” of datapath components in nal integrity, power, verification respectively. These adders offer Figure 3: The vector addition of six finite state machines, memory and layout tools for multidimen- the best area-time QoR for given inputs using CSA rewires a ripple address generators, FIFO and sional specification. This sys- design constraints. adder to do a 3-to-2 reduc-tion. stack. In general, whenever a +, temic solution increases the Multipliers are the most criti- -, *, <, >, or = sign is used in overall quality-of-results (QoR) cal element of any datapath, as tiplexers is mostly governed by HDL, a datapath component is while reducing the time to cre- the speed of the multiplier often the technology’s library cells. built in silicon. ate these results. determines the speed of the However, a judicious selection With process technologies Smart datapath generators cycle time of the digital design. of multiplexers and their con- shrinking below 90nm, issues integrated into the design com- Designing high-performance trol signals also makes a big dif- with signal integrity and leak- piler operate on VHDL and multipliers is always a chal- ference in datapath design. age current are becoming Verilog-based RTL designs and lenge. Moreover, the physical Some simple optimizations for much more significant. Beyond offer high-performance data- design of the multiplier is rather multiplexers are merging two the already complex timing, re- path synthesis. Meanwhile, its difficult because of the complex levels into one larger multi- liability and functional correc- module compiler language interconnections. There are plexer, using a pair of inverting tion tasks and sophisticated (MCL) was developed specifi- three basic parts of a parallel and non-inverting output mul- multiplier: the partial product tiplexers or using AND-OR generator, partial product adder logic. and final adder. Col 7 C6 C5 C4 C3 C2 C1 Col 0 Datapath generators offer a Datapath capabilities A Carry- wide variety of multiplier ar- Any integer datapath compo- save chitectures with the ability to nent and architecture combina- CSA B addition mix and match their compo- tion can be used either through C nents for best QoR. Module direct component instantiation compiler also generates the for optimal user control (not a Sum relative placement informa- preferred method) using DW CPA tion of the multiplier struc- foundation libraries or through Carry tures. Datapath generators arithmetic operators in the also offer the f lexibility to RTL. Since datapath tools are Sum choose between 3-2 adders or context-driven, they automati- 4 x 2 compressors for better cally select, given a certain set Figure 1: Without carry-save addition, A+B+C would require two carry- physical layout of the multi- of constraints, the best architec- propagate adders, and only in the final recombination. plier architectures. ture through RTL extraction, also called adaptive datapath separate. This means that all of Datapath generators are ca- being used delivers better re- extraction. the columns can be added in pable of operator merging for *, sults than building the opera- Module compiler covers de- parallel without relying on the +, -, >, <, <= and >=. Shift, mul- tor out of context. signs that cannot be expressed result of the previous column, tiplexer and truncation opera- The first case, multiplica- using standard RTL or require creating a two-output “adder” tors are also merged. tion by a constant, shows what features such as structured with a constant delay that is In sum-of-product (SOP) happens if you look just beyond datapath design, carry-save op- independent of the input size. operator merging, multiple the boundary of the operator. erands bit manipulation or rela- In Figure 1, A+B+C is com- products and summands are If you create a 32 x 32 multi- tive placement information. puted with one single carry- added together in one datapath plier and then tie one of the The f loating-point (FP) inputs to a constant pattern, datapath components can only the results are predictably be instantiated through DW Multiply vs. square worse than creating a multi- Foundation and/or module plier structure that takes ad- compiler as a function call. The vantage of the constant input. Synopsys FP library offers a set In-context operator synthesis of parameterizable FP compo- 40% smaller techniques allow the descrip- nents and graphic architectures. 15% faster tion of the multiplier X * Y, FP functions conform to the nu- automatic detection that X or merical representation model Multiply Square Y is a constant, and generation and accuracy requirements of of the appropriate structure. the IEEE 754 FP standard. All of Figure 4: A 32bit square is 40 percent smaller and 15 percent faster than The second case in Fig- the functions provide either one the 32bit multiply. In-context synthesis describes the hardware as X * X. ure 4 is more interesting. A 32 or two optional 8bit FP opera- x 32 multiplier can be created tional status flags to the inter- propagate adder using carry- block with only one carr y- and then tie both inputs to- face logic. All of the functions, save addition. Without carry- propagate final adder. Internal gether to generate a squaring with the exception of FP com- save addition, A+B+C would results are kept in redundant circuit. Alternatively, a special parison (DW_cmp_fp), have a require two carry-propagate number representation (carry- multiplier structure can be cre- 3bit RND input port that en- adders. Only the final recombi- save) wherever possible: ated for this purpose. Such a ables dynamic programming of nation of the final carry and 32bit square is 40 percent the rounding modes on interme- sum vectors requires a carry- z = a * 2 * b * d - 3 * (c * d * e) smaller and 15 percent faster diate FP operation results. The propagate addition. CSAs are than the 32bit multiply. In-con- formats of RND rounding input useful when there are many ad- In product-of-sum (POS) op- text synthesis automates the and STATUS output are the dends. This is the case in multi- erator merging, multiple sums design process to describe the same for all functions. The static plication, for example, where and a product are mapped to a hardware simply as X * X. parameters e (exponent) and f many partial products must be datapath block with only one Circuit timing is one of the (fraction) enable a user to imple- added together. carry-propagate final adder: most important design criteria ment not only the standard The schematic in Figure 2 to be optimized in several IEEE 754 FP functions, but also shows how a generic tool would z = (a + b - c + d) * c phases of synthesis process. In- virtually any FP format, and synthesize a vector addition of context timing-driven synthe- allow the selection of the right six inputs. This would use a In-context operator synthesis— sis and optimization allows tree amount of precision and range carry propagate adder for each This can improve the QoR in a delay minimization of the cir- for a particular application. 2-to-1 reduction. If this were a real design containing a mix cuits based on input arrival pro- There are var ious tech- performance design, a carry- of datapath operators. Three file. Consider the case in Fig- niques to improve the perfor- look-ahead type architec- simple examples will be used to ure 5. The “a” input arrives mance of a datapath, such as ture would be obtained for explain the concept: constant late, therefore it can be applied avoiding expensive car r y- each adder and suffer the area multiply, square and shift. to lthe last CSA, hence mini- propagations and using carry- penalty. In each case, building an opera- mizing the worst path delay of save operations wherever pos- The schematic in Figure 3 tor with respect to how it is the circuit. sible. Other techniques in- shows how the same function clude high-level arithmetic op- would be implemented in CSA timizations (in-context opera- arithmetic using Synopsys a bc d ef ab c d e f tor synthesis, operator merg- datapath synthesis tools. This ing and operator sharing). technique rewires a ripple adder CSA CSA CSA These techniques are most ef- to do a 3-to-2 reduction. Then at co1 fective when the biggest and the final stage, the last two inter- co0 co1 ci0 ci0 most complex possible data- mediate sums need to be re- path blocks are extracted from duced to a single binary sum. A CSA the RTL code. propagate adder will be used for CSA ci1 co2 ci1 this final 2-to-1 reduction. co2 ci2 Carry-save operations—One CSA major speed-area enhancement Complex operator merging— CSA co2 technique used in modern digi- Complex arithmetic operator ci2 tal circuits design is the ability merging is a special case of in- CPA to add operands with minimal context synthesis. Operator CSA carry propagation. The basic merging allows the removal of idea is that three or more oper- carry-propagate adders and re- Sum CPA ands can be reduced to two us- sults in faster and smaller de- Late arrival input ing carry-save-adders (CSAs) signs. Such optimizations are Sum that perform the addition of possible only in an environment multiple operands while keep- of automated module genera- Figure 5: The ‘a’ input arrives late, thus it can be applied to the last ing the sum and the carries tion using in-context synthesis. CSA and minimize the worst path delay of the circuit. Mutually exclusive operator DesignWare component Description sharing—Mutually exclusive op- erator sharing offers the maxi- DW_add_fp Floating-point addition mum area benefit through shar- DW_mult_fp Floating-point addition ing a single operator for differ- DW_flt2i_fp Conversion from floating point to integer ent operations. For example, if there is an add operation within DW_i2flt_fp Conversion from integer to floating point an if-else statement, then a DW_cmp_fp Floating-point comparison single adder with the required Table 1: All functions provide either one or two optional 8bit FP operational status flags to the interface logic. control signals is built by the datapath tools. use of a power compiler. On end of a Wallace Tree in a multi- Figure 6 shows an example purely combinational designs, plier. Such control over pipe- of two different macro-architec- if(COND) then DC Ultra can be used to insert lining is difficult in a manual tures derived from a segment of Z = A + B; pipelines to achieve timing re- design environment. Verilog code. The new datapath else quirements and support mul- flow can explore different ar- Y = C + D; tiple clocks. The formal verifi- DC Ultra synthesis flow chitectures corresponding to a end if; cation tool Formality has a fea- DC Ultra has a powerful new section of the design contain- ture that allows retimed designs datapath engine that can extract ing the datapath and pick the Register retiming, automatic to be verified as long as the de- arithmetic components and best macro-architecture, given pipelining—Pipelining achieves sign does not contain an inter- optimize them to provide im- a certain set of constraints. faster throughput clock rates nal feedback loop. proved results in datapath-in- This step is followed by while possibly sacrificing la- Third-generation techniques tensive designs. This capability datapath extraction from the tency. Register retiming per- allow the operators to be created is a key component of the high- macro-architecture (a macro- forms optimization of sequen- on the fly. They can insert pipe- level optimization (HLO) phase architecture could contain sev- tial logic by moving registers lines in the middle of the opera- of compile. eral arithmetic operators). Suc- through combinational logic tors if required. This allows The first step in the DC Ul- cessful datapath extraction re- and across hierarchical bound- pipelining to be automated tra datapath flow is macro-ar- quires that arithmetic opera- aries to optimize timing with smoothly across the entire chitecture selection. A macro- tors be “directly” connected, or minimum area impact. The datapath. Designers can specify architecture is a group of arith- have no random logic in be- same functionality is preserved the desired operating speed. If metic components configured tween datapath elements de- at I/O boundaries. Register the compiler is not able to to implement an arithmetic ex- fined by the *, +, -, >, <, <= and retiming also supports clock achieve that speed, it will insert pression. In this phase, the RTL >=. Shift and truncation opera- gating and will work on any de- pipelines where needed. De- is examined to identify macro- tors can be extracted, as well as sign into which gated clocks signers can guide the placement architectures that contain multiplexers. Datapath extrac- have been inserted through the of the pipelines, such as at the datapath elements. tion is timing-driven and occurs during compile. Any extracted A B C D E F A B C D E F operators can either be shared or unshared, depending on which strategy yields the great- est benefit—again, given a cer- Multiplier Multiplier Multiplier Multiplier tain set of constraints. Finally, the extracted data- path is implemented using the Smart Datapath Generators. The new datapath f low pro- Adder tree Final adder Final adder vides a quick way to identify the Temp1 Temp1 smallest and fastest architec- tures for the arithmetic compo- Final adder Shifter Shifter nents. Smart Generators pro- vide support across operators Temp0 Temp2 Temp2 and applies all the optimiza- Final adder tions mentioned above to the Adder tree datapath designs. Most datapath designs are O larger and more complex than Area cost: Final adder the designs shown here. Even 2 Multipliers 1 Adder tree for designs consisting of tens of Area cost: 1 Shifter 2 Multipliers O thousands of gates, Synopsys 3 Final adders SOP optimizations cannot be applied. 1 Adder tree datapath tools deliver high per- 1 Shifter 2 Final adders formance and enhancements in productivity. These tools Saving of one final adder supplement design techniques with intelligent, high-perfor- Figure 6: The new datapath flow can explore different architectures corresponding to a section of the design mance synthesis and optimiza- containing the datapath and pick the best macro-architecture, given a certain set of constraints. tion algorithms.