2005oct03 DSP Eda Ta

Build efficient datapath designs
By Aamir Farooqui cally to describe high-perfor- Shifters are expensive

Sr. R&D Engineer mance datapath designs and (large-area) logic components
Synopsys Inc. a b c de f provide more f lexibility and in arithmetic datapaths.
E-mail: aamirf@synopsys.com features, structured datapath Shifters are sometimes used to
+ + + design, carry-save operands bit generate fixed dividers or mul-
Datapath is often considered to manipulation and pipelining. tipliers. Datapath tools offer
be a specialized part of an arith- + The module compiler tool takes shifters that may have a combi-
metic logic unit design. How- the MCL description and, given nation of the following features:
ever, the scope of datapath com- + a certain set of constraints, cre- • Arithmetic or logical;
ponents is more than that—it is ates high-performance data- • Shifts only to the right, only
present in any RTL code that Sum path implementations. A pow- to the left or bidirectional;
contains *, +, -, >, <, <= and >= erful feature of MCL is to pre- • Regular or barrel shifter.
shift operators. Even simple serve datapath regularity dur-
finite state machines require Figure 2: The vector addition of six ing synthesis and generate au- Multiplexers are not truly
datapath components in the inputs would use a carry-propagate tomatic relative placement datapath elements, but they are
form of incrementors, adders or adder for each 2-to-1 reduction. used by place-and-route tools. extensively used in adders,
multipliers. With increasing shifters, dividers, comparators
clock frequencies now reaching analyses on noise and power are Building blocks and datapath I/O selection.
the GHz scale, the burden of needed to ensure that the final Adders are probably the most The area and speed of the mul-
efficient datapath design is rap- results will operate within widely used building blocks in
idly increasing. specifications. To perform cus- digital circuits. Every multi-
In many high-performance tom design datapath for all plier, divider, incrementor, a b c d e f
VLSI designs, including all these disciplines would require decrementor, comparator and
high-performance micropro- a large amount of time. In subtractor requires some sort of CSA CSA
cessors, datapath is imple- today’s fast-changing market, add operation. Synopsys has a co0 co1
ci0
mented with custom design it is no longer possible to variety of adder architectures
techniques. The circuit and lay- handcraft all the RTL and get embedded into its datapath CSA
ci1
out of such structures are largely the best performance. tools. These adders range from co2 ci2
handcrafted to achieve maxi- Synopsys offers fully auto- slow, but small ripple carr y CSA
mum performance and better mated high-performance data- adders to very fast large fastcla
layout density. Most designers path generators. These tools of- adders for integer and floating- CPA
are not aware that incremen- fer datapath performance com- point operations. Two new
tors, adders, subtractors or mul- parable to handcrafted designs adder architectures in DC and Sum
tipliers represent the presence and provide more efficient sig- MC are “pprefix” and “aofcla,”
of datapath components in nal integrity, power, verification respectively. These adders offer Figure 3: The vector addition of six
finite state machines, memory and layout tools for multidimen- the best area-time QoR for given inputs using CSA rewires a ripple
address generators, FIFO and sional specification. This sys- design constraints. adder to do a 3-to-2 reduc-tion.
stack. In general, whenever a +, temic solution increases the Multipliers are the most criti-
-, *, <, >, or = sign is used in overall quality-of-results (QoR) cal element of any datapath, as tiplexers is mostly governed by
HDL, a datapath component is while reducing the time to cre- the speed of the multiplier often the technology’s library cells.
built in silicon. ate these results. determines the speed of the However, a judicious selection
With process technologies Smart datapath generators cycle time of the digital design. of multiplexers and their con-
shrinking below 90nm, issues integrated into the design com- Designing high-performance trol signals also makes a big dif-
with signal integrity and leak- piler operate on VHDL and multipliers is always a chal- ference in datapath design.
age current are becoming Verilog-based RTL designs and lenge. Moreover, the physical Some simple optimizations for
much more significant. Beyond offer high-performance data- design of the multiplier is rather multiplexers are merging two
the already complex timing, re- path synthesis. Meanwhile, its difficult because of the complex levels into one larger multi-
liability and functional correc- module compiler language interconnections. There are plexer, using a pair of inverting
tion tasks and sophisticated (MCL) was developed specifi- three basic parts of a parallel and non-inverting output mul-
multiplier: the partial product tiplexers or using AND-OR
generator, partial product adder logic.
and final adder.
Col 7 C6 C5 C4 C3 C2 C1 Col 0
Datapath generators offer a Datapath capabilities
A Carry- wide variety of multiplier ar- Any integer datapath compo-
save chitectures with the ability to nent and architecture combina-
CSA B addition mix and match their compo- tion can be used either through
C nents for best QoR. Module direct component instantiation
compiler also generates the for optimal user control (not a
Sum relative placement informa- preferred method) using DW
CPA tion of the multiplier struc- foundation libraries or through
Carry
tures. Datapath generators arithmetic operators in the
also offer the f lexibility to RTL. Since datapath tools are
Sum choose between 3-2 adders or context-driven, they automati-
4 x 2 compressors for better cally select, given a certain set
Figure 1: Without carry-save addition, A+B+C would require two carry- physical layout of the multi- of constraints, the best architec-
propagate adders, and only in the final recombination. plier architectures. ture through RTL extraction,
also called adaptive datapath separate. This means that all of Datapath generators are ca- being used delivers better re-
extraction. the columns can be added in pable of operator merging for *, sults than building the opera-
Module compiler covers de- parallel without relying on the +, -, >, <, <= and >=. Shift, mul- tor out of context.
signs that cannot be expressed result of the previous column, tiplexer and truncation opera- The first case, multiplica-
using standard RTL or require creating a two-output “adder” tors are also merged. tion by a constant, shows what
features such as structured with a constant delay that is In sum-of-product (SOP) happens if you look just beyond
datapath design, carry-save op- independent of the input size. operator merging, multiple the boundary of the operator.
erands bit manipulation or rela- In Figure 1, A+B+C is com- products and summands are If you create a 32 x 32 multi-
tive placement information. puted with one single carry- added together in one datapath plier and then tie one of the
The f loating-point (FP) inputs to a constant pattern,
datapath components can only the results are predictably
be instantiated through DW Multiply vs. square worse than creating a multi-
Foundation and/or module plier structure that takes ad-
compiler as a function call. The vantage of the constant input.
Synopsys FP library offers a set In-context operator synthesis
of parameterizable FP compo- 40% smaller techniques allow the descrip-
nents and graphic architectures. 15% faster tion of the multiplier X * Y,
FP functions conform to the nu- automatic detection that X or
merical representation model Multiply Square Y is a constant, and generation
and accuracy requirements of of the appropriate structure.
the IEEE 754 FP standard. All of Figure 4: A 32bit square is 40 percent smaller and 15 percent faster than
The second case in Fig-
the functions provide either one the 32bit multiply. In-context synthesis describes the hardware as X * X.
ure 4 is more interesting. A 32
or two optional 8bit FP opera- x 32 multiplier can be created
tional status flags to the inter- propagate adder using carry- block with only one carr y- and then tie both inputs to-
face logic. All of the functions, save addition. Without carry- propagate final adder. Internal gether to generate a squaring
with the exception of FP com- save addition, A+B+C would results are kept in redundant circuit. Alternatively, a special
parison (DW_cmp_fp), have a require two carry-propagate number representation (carry- multiplier structure can be cre-
3bit RND input port that en- adders. Only the final recombi- save) wherever possible: ated for this purpose. Such a
ables dynamic programming of nation of the final carry and 32bit square is 40 percent
the rounding modes on interme- sum vectors requires a carry- z = a * 2 * b * d - 3 * (c * d * e) smaller and 15 percent faster
diate FP operation results. The propagate addition. CSAs are than the 32bit multiply. In-con-
formats of RND rounding input useful when there are many ad- In product-of-sum (POS) op- text synthesis automates the
and STATUS output are the dends. This is the case in multi- erator merging, multiple sums design process to describe the
same for all functions. The static plication, for example, where and a product are mapped to a hardware simply as X * X.
parameters e (exponent) and f many partial products must be datapath block with only one Circuit timing is one of the
(fraction) enable a user to imple- added together. carry-propagate final adder: most important design criteria
ment not only the standard The schematic in Figure 2 to be optimized in several
IEEE 754 FP functions, but also shows how a generic tool would z = (a + b - c + d) * c phases of synthesis process. In-
virtually any FP format, and synthesize a vector addition of context timing-driven synthe-
allow the selection of the right six inputs. This would use a In-context operator synthesis— sis and optimization allows tree
amount of precision and range carry propagate adder for each This can improve the QoR in a delay minimization of the cir-
for a particular application. 2-to-1 reduction. If this were a real design containing a mix cuits based on input arrival pro-
There are var ious tech- performance design, a carry- of datapath operators. Three file. Consider the case in Fig-
niques to improve the perfor- look-ahead type architec- simple examples will be used to ure 5. The “a” input arrives
mance of a datapath, such as ture would be obtained for explain the concept: constant late, therefore it can be applied
avoiding expensive car r y- each adder and suffer the area multiply, square and shift. to lthe last CSA, hence mini-
propagations and using carry- penalty. In each case, building an opera- mizing the worst path delay of
save operations wherever pos- The schematic in Figure 3 tor with respect to how it is the circuit.
sible. Other techniques in- shows how the same function
clude high-level arithmetic op- would be implemented in CSA
timizations (in-context opera- arithmetic using Synopsys a bc d ef ab c d e f
tor synthesis, operator merg- datapath synthesis tools. This
ing and operator sharing). technique rewires a ripple adder CSA
CSA CSA
These techniques are most ef- to do a 3-to-2 reduction. Then at co1
fective when the biggest and the final stage, the last two inter- co0 co1
ci0
ci0
most complex possible data- mediate sums need to be re-
path blocks are extracted from duced to a single binary sum. A CSA
the RTL code. propagate adder will be used for CSA
ci1 co2 ci1
this final 2-to-1 reduction. co2
ci2
Carry-save operations—One CSA
major speed-area enhancement Complex operator merging— CSA co2
technique used in modern digi- Complex arithmetic operator ci2
tal circuits design is the ability merging is a special case of in- CPA
to add operands with minimal context synthesis. Operator CSA
carry propagation. The basic merging allows the removal of
idea is that three or more oper- carry-propagate adders and re- Sum CPA
ands can be reduced to two us- sults in faster and smaller de-
Late arrival input
ing carry-save-adders (CSAs) signs. Such optimizations are Sum
that perform the addition of possible only in an environment
multiple operands while keep- of automated module genera- Figure 5: The ‘a’ input arrives late, thus it can be applied to the last
ing the sum and the carries tion using in-context synthesis. CSA and minimize the worst path delay of the circuit.
Mutually exclusive operator
DesignWare component Description
sharing—Mutually exclusive op-
erator sharing offers the maxi- DW_add_fp Floating-point addition
mum area benefit through shar- DW_mult_fp Floating-point addition
ing a single operator for differ- DW_flt2i_fp Conversion from floating point to integer
ent operations. For example, if
there is an add operation within
DW_i2flt_fp Conversion from integer to floating point
an if-else statement, then a DW_cmp_fp Floating-point comparison
single adder with the required Table 1: All functions provide either one or two optional 8bit FP operational status flags to the interface logic.
control signals is built by the
datapath tools. use of a power compiler. On end of a Wallace Tree in a multi- Figure 6 shows an example
purely combinational designs, plier. Such control over pipe- of two different macro-architec-
if(COND) then DC Ultra can be used to insert lining is difficult in a manual tures derived from a segment of
Z = A + B; pipelines to achieve timing re- design environment. Verilog code. The new datapath
else quirements and support mul- flow can explore different ar-
Y = C + D; tiple clocks. The formal verifi- DC Ultra synthesis flow chitectures corresponding to a
end if; cation tool Formality has a fea- DC Ultra has a powerful new section of the design contain-
ture that allows retimed designs datapath engine that can extract ing the datapath and pick the
Register retiming, automatic to be verified as long as the de- arithmetic components and best macro-architecture, given
pipelining—Pipelining achieves sign does not contain an inter- optimize them to provide im- a certain set of constraints.
faster throughput clock rates nal feedback loop. proved results in datapath-in- This step is followed by
while possibly sacrificing la- Third-generation techniques tensive designs. This capability datapath extraction from the
tency. Register retiming per- allow the operators to be created is a key component of the high- macro-architecture (a macro-
forms optimization of sequen- on the fly. They can insert pipe- level optimization (HLO) phase architecture could contain sev-
tial logic by moving registers lines in the middle of the opera- of compile. eral arithmetic operators). Suc-
through combinational logic tors if required. This allows The first step in the DC Ul- cessful datapath extraction re-
and across hierarchical bound- pipelining to be automated tra datapath flow is macro-ar- quires that arithmetic opera-
aries to optimize timing with smoothly across the entire chitecture selection. A macro- tors be “directly” connected, or
minimum area impact. The datapath. Designers can specify architecture is a group of arith- have no random logic in be-
same functionality is preserved the desired operating speed. If metic components configured tween datapath elements de-
at I/O boundaries. Register the compiler is not able to to implement an arithmetic ex- fined by the *, +, -, >, <, <= and
retiming also supports clock achieve that speed, it will insert pression. In this phase, the RTL >=. Shift and truncation opera-
gating and will work on any de- pipelines where needed. De- is examined to identify macro- tors can be extracted, as well as
sign into which gated clocks signers can guide the placement architectures that contain multiplexers. Datapath extrac-
have been inserted through the of the pipelines, such as at the datapath elements. tion is timing-driven and occurs
during compile. Any extracted
A B C D E F A B C D E F
operators can either be shared
or unshared, depending on
which strategy yields the great-
est benefit—again, given a cer-
Multiplier Multiplier Multiplier Multiplier tain set of constraints.
Finally, the extracted data-
path is implemented using the
Smart Datapath Generators.
The new datapath f low pro-
Adder tree Final adder Final adder vides a quick way to identify the
Temp1 Temp1 smallest and fastest architec-
tures for the arithmetic compo-
Final adder Shifter Shifter nents. Smart Generators pro-
vide support across operators
Temp0 Temp2 Temp2 and applies all the optimiza-
Final adder
tions mentioned above to the
Adder tree
datapath designs.
Most datapath designs are
O larger and more complex than
Area cost: Final adder the designs shown here. Even
2 Multipliers
1 Adder tree
for designs consisting of tens of
Area cost:
1 Shifter 2 Multipliers O
thousands of gates, Synopsys
3 Final adders
SOP optimizations cannot be applied.
1 Adder tree datapath tools deliver high per-
1 Shifter
2 Final adders
formance and enhancements
in productivity. These tools
Saving of one final adder
supplement design techniques
with intelligent, high-perfor-
Figure 6: The new datapath flow can explore different architectures corresponding to a section of the design mance synthesis and optimiza-
containing the datapath and pick the best macro-architecture, given a certain set of constraints. tion algorithms.

2005oct03 DSP Eda Ta

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2005oct03 DSP Eda Ta

Hochgeladen von

Copyright:

Verfügbare Formate

Build efficient datapath designs

By Aamir Farooqui cally to describe high-perfor- Shifters are expensive

Das könnte Ihnen auch gefallen