Beruflich Dokumente
Kultur Dokumente
AbstractThis paper presents improved architectures for a after each addition. Many recent floating-point units can accom-
fused floating-point three-term adder. The fused floating-point modate operations that have three inputs (e.g., FMA). As a re-
three-term adder performs two additions in a single unit to achieve
better performance and better accuracy compared to a network of sult adding a three-term adder does not require a special data
traditional floating-point two-term adders, which is referred to as path or register file. Several issues for the design of the fused
a discrete design. In order to further improve the performance of floating-point three-term adder are discussed in the previous
the three-term adder, several optimization techniques are applied
including a new exponent compare and significand alignment, work [9], [10]: 1) Complex exponent processing and significand
dual-reduction, early normalization, three-input leading zero alignment, 2) Complementation after the significand addition,
anticipation, compound addition/rounding and pipelining. The 3) Large precision significand addition, 4) Massive cancellation
proposed design is implemented for both single and double preci- management, and 5) Complex round processing. In this paper,
sion and synthesized with a 45 nm CMOS standard-cell library.
The improved fused floating-point three-term adder reduces those issues are addressed by investigating several optimization
the area and power consumption by about 20% and reduces techniques. The algorithms and optimizations described in this
the latency by about 35% compared to a discrete floating-point paper can be also extended to fused floating-point multi-term
three-term adder. Based on the data flow analysis, the proposed
three-term adder can be split into three pipeline stages. Since adders with more than three operands. Therefore, the improved
the latencies of three pipeline stages are fairly well balanced, the fused floating-point three-term adder will contribute to the next
throughput is increased to 2.7 times that of the non-pipelined generation of floating-point arithmetic unit design.
design.
The proposed fused floating-point three-term adder takes
Index TermsFloating-point arithmetic, fused floating-point op- three normalized operands and executes two additions (or
erations, high speed computer arithmetic, three-term adder. subtractions) as
(1)
I. INTRODUCTION It supports all five of the rounding modes specified in the
M OST general purpose processors and application spe- IEEE-754 Standard [1]. Several algorithms and optimization
cific processors use floating-point arithmetic which is techniques are applied not only to resolve the design issues but
specified in the IEEE-754 Standard for floating-point arithmetic also to improve the performance:
[1]. The benefits of floating-point arithmetic over fixed-point 1) A new exponent compare and significand alignment
arithmetic come from its constant relative precision over a scheme is proposed. The three exponent differences are
wide dynamic range. However, floating-point operations re- computed in parallel and the differences are used for the
quire complex processes such as alignment, normalization and significand alignment. By shifting the significands with
rounding, which increases the area, power consumption and the partial difference results, the exponent difference
latency. In order to reduce the overhead, fused floating-point computation and the significand alignment can be over-
units have been proposed, which execute several operations in a lapped. The control logic determines the largest exponent
single unit to reduce the area, power consumption and latency. and three aligned significands. This approach reduces
Several fused floating-point units have been introduced: Fused the latency by performing the exponent processing and
Multiply-Add (FMA) [2][4], fused add-subtract [5], [6], fused significand alignment simultaneously.
dot product [7], [8] and fused three-term adder [9], [10]. 2) Dual-reduction is used to handle both cases that the result
Addition is the most frequently used operation in many al- of the significand addition is positive and negative. Two
gorithms and applications. Traditional floating-point two-term reduction trees generate both the positive and negativesig-
adders are extensively discussed in the previous work [11][13]. nificand pairs and the positive significand pair is selected
In case of the additions in series, however, a network of the two- based on the significand comparison. The selected signifi-
term adders loses accuracy due to the multiple roundingsone cand pair produces a positive sum so that the complemen-
tation after the significand addition can be skipped, which
reduces the latency.
Manuscript received November 23, 2013; revised March 01, 2014; ac-
3) Early normalization is applied to reduce the significand ad-
cepted March 17, 2014. Date of publication July 22, 2014; date of current dition size. By performing the normalization prior to the
version September 25, 2014. This paper was recommended by Associate Ed- significand addition, the adder size is reduced by half and
itor F. Clermidy. the rest of bits are covered by the rounding, which signifi-
J. Sohn is with Intel Corp., Austin, TX 78746 USA (e-mail: jongwook.
sohn@intel.com). cantly reduces the latency.
E. E. Swartzlander, Jr. is with the Department of Electrical and Computer 4) Since the normalization is performed prior to the signifi-
Engineering, University of Texas at Austin, Austin, TX 78712 USA (e-mail: cand addition, the Leading Zero Anticipation (LZA) and
e.swartzlander@IEEE.org).
Color versions of one or more of the figures in this paper are available online
normalization shift are on the critical path. In order to re-
at http://ieeexplore.ieee.org. duce the latency, a three-input LZA is proposed, which
Digital Object Identifier 10.1109/TCSI.2014.2333680 hides the delay of the 3:2 reduction trees.
1549-8328 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
SOHN AND SWARTZLANDER: A FUSED FLOATING-POINT THREE-TERM ADDER 2843
TABLE I TABLE II
EXPONENT COMPARE CONTROL LOGIC 2 BIT EXTENDED LSBS FOR COMPLEMENTATION
(3)
where and are th bits of the two significands and is
the level of the prefix tree adder. A Kogge-Stone adder is used
in this paper, but any type of adder can be used. The number
of levels implemented in the first part addition depends on the
delay-optimization to balance with the delay of LZA, but the
level 0 (for and ) and level 1 are implemented in this paper.
C. Early Normalization
One of the design issues for the fused floating-point three-
Fig. 6. Significand processalignment, invert, reduction, normalization, ad- term adder is the high precision significand addition. The tra-
dition and rounding. ditional fused floating-point three-term adder aligns the signif-
icands to 2f 6 bits, where f is the number of significand bits
where and are the first and second op codes, respec- [9], [10]. Such large significands require a large significand ad-
tively. dition and normalization, which the biggest bottleneck of the
The aligned significands are inverted based on the effective fused floating-point three-term adder. To reduce the overhead,
sign bits for the subtraction. The three operand subtraction re- early normalization is applied, which was previously proposed
quires that up to two significands are complimented. (e.g., for the fused floating-point multiply-add unit [4]. As shown in
). If all three Fig. 6, the normalization is performed prior to the significand
operands are negative, they are added and the sign becomes neg- addition so that the significand adder size is reduced to .
ative, that is ). In order to avoid the The rest of lower bits are passed to rounding. By nor-
increments after the inverters, 2 bits are extended to the LSB of malizing the significand pair prior to the significand addition,
the significands that are propagated to the significand addition the round position is fixed so that the significand addition and
2846 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 61, NO. 10, OCTOBER 2014
rounding can be performed in parallel, which significantly re- The F vector is computed with the three symbols as
duces the latency of the critical path. More details of the signif-
icand addition and rounding are described in the next section.
(7)
D. Three-Input LZA and Significand Comparison The F vector is passed to the LZD logic. The LZD produces
Since the normalization is performed prior to the significand the leading zero count which becomes the shift amount of the
addition, the LZA and normalization is on the critical path. To normalization. For the LZD, any type of tree logic can be used
use a traditional two-input LZA, the three significands need to as discussed in the previous work [16], [17]. In this paper, for
be reduced to two using a 3:2 CSA, which increases the delay. fast normalization, a LZD logic producing the MSBs of the shift
The three-input LZA encodes the three inputs at once to skip amount first is selected so that the LZD logic and the normal-
the delay of the 3:2 CSA. The three-input LZA can be imple- ization shifter are overlapped [4]. Fig. 7 shows the 64 bit LZD
mented by extending the traditional two-input LZA [14]. Like tree, which can be used for the single precision. The higher shift
most of the LZAs, the three-input LZA consists of two parts: 1) bits are generated first, which of the delay is overlapped with
Pre-encoding indicator vectors and 2) Leading Zero Detection the lower levels of LZD logic. As a result, only last level of the
(LZD) logic for generating the leading zero count. The pre-en- shifter is in the critical path.
coder performs the bitwise operations to generate the W vector Most of the two-input LZAs are inaccurate due to a possible 1
as bit error. Similarly, the proposed three-input LZA also requires
correction logic. For fast error detection and correction, concur-
rent error correction logic can be used, which was previously
proposed in [18][20]1
(4) In order to determine the sign of the significand sum, which is
where , , are the th bits of the three significands. Since the used for the selection of the dual-reduction, significand compar-
input significands are inverted based on the effective sign bits, ison is required. In order to reduce the overhead, LZA pre-en-
the W vector is always positive. The W vector can be repre- coded bits can be used for the comparison tree [21].
sented by one of the four elements, , , and , indicating
wi equals to 0, 1, 2 and 3, respectively.
(8)
Fig. 13. Delay-area curve for the three-term adders (single precision).
V. RESULTS
Previous sections introduced several optimization techniques
for a fused floating-point three-term adder. The proposed de-
sign is implemented for both single and double precision in
Verilog-HDL and synthesized with the Nangate 45 nm CMOS
technology standard cell library [14]. In order to evaluate the
improvement of the proposed design, the area and latency are
compared with the traditional designs [9], [10]. Fig. 13 shows
the delay-area curve for the three single precision fused floating-
point three-term adders. Depending on the target frequency, the
implementations are synthesized with different area and delay.
For a fair comparison, the most efficient point of delay-area
product for each design is used. Table III compares the area of
the three single precision fused floating-point three-term adders.
The proposed design has a smaller significand adder compared
Fig. 12. Data flow of a pipelined improved fused floating-point three-term to the traditional designs. Also, the proposed design does not
adder. use incremental adders for the complementation and rounding,
while the traditional designs require the incremental adders for
analysis, the proposed improved fused floating-point three-term the exponent and significand computations and rounding, which
adder can be split into three pipeline stages so that results are requires additional adder area. Although the proposed design
produced on every cycle. Fig. 12 shows the data flow and crit- has twice many shifters and CSAs, area reduction in the main
ical path of the improved fused floating-point three-term adder. adders achieves smaller area for the entire logic compared to the
The critical paths of the three pipeline stages are traditional designs.
First stage: Unpack Exponent compare Significand Table IV compares the latency of the three designs. The main
alignment difference between the proposed design and the traditional de-
Second stage: Invert LZA/LZD Normalization signs is the significand addition and rounding. The proposed de-
Third stage: Significand addition Round select. sign performs a smaller significand addition compared to the
Since the second stage has the largest latency among the three traditional designs (for more delay-optimization, the first level
pipeline stages, its latency determines the throughput. Due to pg generator is performed prior to the normalization) without
the latches and control signals between the pipeline stages, the complementation by using the dual-reduction. In contrast, the
SOHN AND SWARTZLANDER: A FUSED FLOATING-POINT THREE-TERM ADDER 2849
TABLE III
AREA COMPARISON OF FUSED FLOATING-POINT THREE-TERM ADDERS (SINGLE PRECISION)
TABLE IV
LATENCY COMPARISON OF FUSED FLOATING-POINT THREE-TERM ADDERS (SINGLE PRECISION)
traditional designs require a large significand adder and the in- TABLE V
verters followed by incremental adders for the complementa- RESULT COMPARISON
tion. Also, the proposed design performs the significand addi-
tion and rounding simultaneously so that the latency is signifi-
cantly reduced. Finally, the shifters for the alignment and nor-
malization are overlapped with the exponent difference compu-
tation and LZD logic, respectively so that only the last level of
the shifter is in the critical path.
Table V summarizes the results for both single and double
precision three-term adders. For the discrete design, the delay-
optimized floating-point adder [13] is used, which is well known
as a high-performance floating-point adder. All the percentages
in the table are ratios compared to the discrete design. The tra- major components such as significand alignment, significand
ditional fused floating-point three-term adders have achieved a addition, LZA and normalization are implemented using the tree
better accuracy with reduced area, power consumption (58%) structures that logarithmically increase the latency, the latency
and latency (314%) compared to the discrete design. The pro- of the double precision implementation is increased by only
posed fused floating-point three-term adder applies several tech- 20%. The benefits of the proposed optimization techniques are
niques discussed above so that the area and power consump- shown in both single and double precision.
tion are reduced by about 1520% and the latency is reduced The pipelined fused floating-point three-term adder is split
by about 35% compared to the discrete design, which is much into three stages. Table VI shows the area, latency and power
better than that of the traditional fused designs. consumption of the three pipeline stages. Each pipeline stage re-
The double precision implementation requires about twice quires latches to maintain the data and control signals between
as much area and power consumption as the single precision the stages, which increases the area, latency and power con-
implementation due to the larger logic components. Since the sumption. However, the latencies of the three pipeline stages
2850 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 61, NO. 10, OCTOBER 2014