Beruflich Dokumente
Kultur Dokumente
Abstract—Variable latency adders have been recently proposed detection network that asserts an output signal when speculation
in literature. A variable latency adder employs speculation: the fails. In this case (misprediction), another clock cycle is needed
exact arithmetic function is replaced with an approximated one
that is faster and gives the correct result most of the time, but to obtain the correct result with the help of a correction stage.
not always. The approximated adder is augmented with an error Since the addition time is one clock cycle when no error occurs
detection network that asserts an error signal when speculation and two clock cycles when the speculation fails, the average ad-
fails. Speculative variable latency adders have attracted strong dition time can be computed as
interest thanks to their capability to reduce average delay com-
pared to traditional architectures. This paper proposes a novel
variable latency speculative adder based on Han-Carlson par- (1)
allel-prefix topology that resulted more effective than variable
latency Kogge-Stone topology. The paper describes the stages in where is the clock period and is the error probability
which variable latency speculative prefix adders can be subdivided of the speculative adder.
and presents a novel error detection network that reduces error Speculative adders are built upon the observation that the crit-
probability compared to previous approaches. Several variable
latency speculative adders, for various operand lengths, using ical path is rarely activated in traditional adders [9]–[13]. In par-
both Han-Carlson and Kogge-Stone topology, have been synthe- ticular, in traditional adders each output depends on all previous
sized using the UMC 65 nm library. Obtained results show that bits, so the most significant output depends on all the input
proposed variable latency Han-Carlson adder outperforms both bits. Instead, in speculative adders each output depends only
previously proposed speculative Kogge-Stone architectures and
non-speculative adders, when high-speed is required. It is also on the previous bits, where goes as [12]–[15].
shown that non-speculative adders remain the best choice when This reflects the fact that a propagate chain longer than
the speed constraint is relaxed. is a very rare event.
Index Terms—Addition, digital arithmetic, parallel-prefix A first speculative approach to addition was proposed by
adders, speculative adders, speculative functional units, variable Nowick [12] in asynchronous contest, which implements a vari-
latency adders. able latency adder cutting the lowest levels of a Kogge-Stone
adder. In synchronous contest, Verma et al.[13] propose a vari-
I. INTRODUCTION able latency speculative adder; here the speculative addition
A DDERS ARE basic functional units in computer arith- is realized in the same way as [12], cutting the lower levels
metic. Binary adders are used in microprocessor for of a Kogge-Stone adder. A similar approach is employed in
addition and subtraction operations as well as for floating point [14]. In [15] a variable latency carry-select adder is introduced,
multiplication and division. Therefore adders are fundamental where the adder is fragmented in various windows, each one
components and improving their performance is one of the containing a Kogge-Stone adder.
major challenges in digital designs. Theoretical research [1] The Kogge-Stone adder is often used when speed is the pri-
has established lower bounds on area and delay of -bit adders: mary concern, since it uses the minimum number of logic levels
the former varies linearly with adder size, the latter has an and each cell in the adder tree has fanout of 2. This comes at the
behavior. cost of using many propagate-generate cells and many wires that
High speed adders are based on well established parallel- must be routed between stages.
prefix architectures [1], [2], including Brent-Kung [3], Kogge- In this paper we propose a novel variable latency specula-
Stone [4], Sklansky [5], Han-Carlson [6], Ladner-Fischer [7], tive adder based on Han-Carlson [6] parallel-prefix topology.
Knowles [8]. These standard architectures operate with fixed The Han-Carlson topology uses one more stage than Kogge-
latency. Better average performances can be achieved by using Stone adder, while requiring a reduced number of cells and sim-
variable latency adders, that have been recently proposed in lit- plified wiring. Thus, it can achieve similar speed performance
erature [9]. A variable latency adder employs speculation: the compared to Kogge-Stone adder, at lower power consumption
exact arithmetic function is replaced with an approximated one and area [16]. We show that a speculative carry tree can be
that is faster and gives the correct result most of the time, but obtained by pruning some intermediate levels of the classical
not always. The approximated adder is augmented with an error Han-Carlson topology. The paper presents a rigorous derivation
of the error detection network and shows that the error detec-
tion network required in speculative Han-Carlson adders is sig-
nificantly faster than the one used by speculative Kogge-Stone
Manuscript received July 23, 2014; revised December 12, 2014 and nulldate;
accepted January 29, 2015. This paper was recommended by Associate Editor architecture. An extensive set of implementation results for 65
C. P. Ravikumar. nm CMOS technology shows that proposed Han-Carlson vari-
The authors are with the Department of Electrical Engineering and Infor- able latency adders outperform previously developed variable
mation Technology, University of Napoli “Federico II”, I80125 Naples, Italy latency Kogge-Stone architectures. Compared with traditional,
(e-mail: dadecaro@unina.it).
Color versions of one or more of the figures in this paper are available online non-speculative, adders, our analysis demonstrates that variable
at http://ieeexplore.ieee.org. latency Han-Carlson adders show sensible improvements when
Digital Object Identifier 10.1109/TCSI.2015.2403036 the highest speed is required; otherwise the burden imposed by
1549-8328 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
A. Prefix Addition
The binary addition problem can be formulated as fol-
lows: given an -bit augend and an
-bit addend generate the -bit sum
Fig. 1. Han-Carlson and Kogge-Stone parallel-prefix topologies. .
. Let us indicate as the carry out of the
-th bit. The sum bit and the carry can be computed as
follows:
(2)
(3) (10)
In prefix addition we use three stages to compute the sum: where: . The prefix operator has two important
pre-processing, prefix-processing and post-processing. properties: it is associative and it is idempotent. These proper-
In the pre-processing stage the generate and propagate ties are exploited in the prefix-processing stage to speed-up the
signal are computed as: computation.
Finally, in the post-processing stage, the sum bit are com-
(4)
puted using (8) and:
(5)
(11)
The condition means that a carry is generated at bit
, while the condition means that a carry is propagated
through bit . B. Han-Carlson and Kogge-Stone Parallel-Prefix Adder
The concept of generate and propagate can be extended to Topologies
a block of contiguous bits, from bit to bit (with ) as The pre-processing and post-processing stages of a prefix
follows: adder involve only simple operations on signals local to each
if bit position. Therefore, adder performance mainly depends on
(6)
otherwise prefix-processing stage.
if Fig. 1 shows Han-Carlson and Kogge-Stone prefix adders
(7) topologies. Here black dots represent the prefix operator (10),
otherwise
while white dots are simple placeholders.
where: . Kogge-Stone adder is composed by levels and
The condition means that a carry is generated in the present a fanout of two at each level using a large number of
block , while the condition means that a carry black cells and many wire tracks. A good trade-off between
is propagated through the block. Thus, for any bit , the carry fanout, number of logic levels and number of black cells is
can be expressed as: given by Han-Carlson. The outer rows of the Han-Carlson
topology are Brent-Kung [3] graphs, while the inner rows are
(8)
Kogge-Stone graphs. The Han-Carlson adder in Fig. 1 uses a
where is the input carry of the -bit adder. In the following, single Brent-Kung level at the beginning and at the end of the
for the sake of simplicity, we assume that , so that (8) graph, and the number of levels is .
simplifies as:
III. VARIABLE LATENCY SPECULATIVE PREFIX ADDERS
(9)
Variable latency speculative prefix adders can be subdivided
The block generate and propagate terms are computed in in five stages: pre-processing, speculative prefix-processing,
the prefix-processing stage of the adder. To that purpose, the post-processing, error detection and error correction. The error
( , ) couples are expressed with the help of the prefix correction stage is off the critical path, as it has two clock
operator defined as follows: cycles to obtain the exact sum when speculation fails.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
(26)
It can easily be seen that in (26) the terms in the second OR
are implied by the terms in the first OR. Let us consider, for
instance, the first two terms of the OR (assuming that is even).
We have:
Fig. 4. The nodes of the prefix-processing stage, whose outputs are needed to
compute the error signal, are named “checking nodes” and are highlighted as (27)
big hatched dots, for the topologies in Fig. 2–3.
Thus, we can write:
(28)
(18)
The last equation can be simplified, following an approach
where the symbol represents the logical OR. similar to previous subsection. Let us consider the last two terms
It is important to note that (18) is a necessary and sufficient of the OR in (28), with index and (assuming that
error condition that requires the calculation of . Unfor- is even):
tunately, these terms are actually not computed by the specula-
(29)
tive prefix-processing stage (avoiding the computation of these
terms is the key idea of speculative adders). Thus, in previous One has:
papers,(18) is replaced by the following looser relation:
(30)
(19) Substituting (30) in (29), the terms with index and
of (28) can be simplified as:
The last equation is a necessary-only error condition. By
using (19), the error signal can be triggered even in absence (31)
of actual misprediction. While this does not harm the correct Similar simplifications can be realized by considering in (28)
operation of the speculative adder, having an high rate of such the terms and and so on. Finally one obtains:
“false positive” errors degrades the average addition time (1).
In this paper, instead, we rewrite the necessary and sufficient (32)
condition (18) in a form that does not require the
terms. To that purpose, let us consider the last two terms of the
OR in (18), with index and : Let us consider, as an example, the prefix-processing stage in
(20) Fig. 3. The error signal (32) is given by:
(21) By comparing (23) and (32), it can easily be seen that the
number of terms to be OR-ed to obtain the error signal is halved
Substituting (21) in (20), the terms with index and in the Han-Carlson topology, compared to Kogge-Stone.
of (18) can be simplified as: We name “checking nodes” the nodes of the prefix-processing
(22) stage, whose outputs are needed to compute the error signal.
The checking nodes for both the Kogge-Stone example of Fig. 2
Similar simplifications can be realized by considering in (18) and the Han-Carlson example of Fig. 3 are highlighted as big
the terms and and so on. Finally one obtains: hatched dots in Fig. 4.
As it can be observed, in Kogge-Stone some of the checking
(23) cells are at the last level of the graph; their output signals are
available after three black cells delay. In Han-Carlson the crit-
Let us consider, as an example, the prefix-processing stage in ical checking cells are in the second last level of the graph and
Fig. 2. The error signal (23) is given by: are also available after three black cells delay, in spite of the
larger number of levels of the Han-Carlson prefix-processing
(24) stage.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 5. Error correction and detection stages for the proposed speculative Han-Carlson adder of Fig. 3.
From the above observations, it can be concluded that error employed black cells (AO gates). Error detection spatial com-
detection is sensibly simplified and potentially faster in Han- plexity is simply estimated assuming that it is composed by a
Carlson, compared to Kogge-Stone. set of AND gates (to compute the terms in (23) or in (32))
As an additional note, the need of driving the gates of the followed by a tree of two-input OR to compute the error signal
error detection stage increases the fanout of the checking cells, (see Fig. 5 for an example). According to the model proposed in
slowing the speculative prefix-processing stage. [18] we assume as unit gate a basic 2-input gate, such as AND
gates and OR gates, while we count black cells (AO gates) as
E. Error Correction two unit gates.
Regarding delay, we assume that speculative sum delay is
The error correction stage computes the exact carry signals
proportional to the number of levels of speculative parallel-
(9), to be used in case of misprediction.
prefix stage, plus two additional levels to take into account pre-
The error correction stage is composed by the levels of the
processing and post-processing. Error detection delay is esti-
prefix-processing stage pruned to obtain the speculative adder.
mated as the number of OR-tree levels, plus one additional level
The Fig. 5 shows the error correction stage of the proposed spec-
to take into account the AND gates computing the terms
ulative Han-Carlson adder; the error correction for Kogge-Stone
in (23) or in (32). Assuming unit gate delay model of [18], we
topology can be obtained similarly.
count the basic 2-input gates such as AND and OR as one gate
It can be observed that the inclusion of the error correction
delay with the exception of the XOR gates which we count as
stage increases the fanout of some of the cells of the speculative
two gate delays.
prefix-processing stage, with adverse effect on adder speed.
Obtained results are shown in Table I. For speculative adders,
spatial complexity is reported as the sum of two contributions.
F. Post-Processing
The first one (curly brackets) is the contribution of speculative
The approximate carries are already available at the output of prefix stage and error correction stage, the second one (square
the prefix-processing stage. The post-processing, according to brackets) is the contribution of error detection stage. As it can
(14), is equal to the one of a non-speculative adder and consists be observed, the two area contributions are both lower in the
of xor gates. proposed Han-Carlson speculative adder, compared to Kogge-
Stone. It also worth noting that the spatial complexity of spec-
IV. ADDERS CHARACTERIZATION ulative adders is higher than non-speculative ones, because of
In this section we provide a characterization of the spatial and error detection and correction stages.
timing complexity of the investigated variable latency specula- Regarding to timing complexity, Table I reports the values
tive adders, using either Han-Carlson or Kogge-Stone topolo- of both speculative sum and error detection. The Kogge-Stone
gies. Results for non-speculative adders are also reported, for adder saves two gate levels to perform the speculative sum.
comparison. This will be achieved with the help of simplistic However, the critical path traverses the error detection stage
hypotheses on area and speed of employed gates, with the aim and hence the proposed Han-Carlson architecture appears
of obtaining an analytic comparison (albeit approximated) be- faster than Kogge-Stone speculative adder, owing to the halved
tween the various topologies. Accurate values of area, speed and number of terms to be OR-ed (column in Table I) to obtain
power for 65 nm technology will be presented in the next sec- the error signal.
tion for a quantitative assessment of variable latency speculative
adders. Results of error rate analysis will also be reported at the B. Error Rate Analysis
end of this section.
The value of error probability is fundamental to understand
the degradation of average addition time (1) caused by mispre-
A. Spatial and Timing Complexity
diction. In order to evaluate error probabilities, the proposed
In order to estimate adder complexity, we make some sim- speculative Han-Carlson and the Kogge-Stone topologies have
plistic hypotheses. been simulated by using a Monte Carlo approach with a 1%
We assume that the spatial complexity of speculative prefix- relative error and a 99% confidence level. Input vectors have
processing and error correction is proportional to the number of been chosen uniformly distributed [12],[13]. Table II reports
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
SPATIAL AND TIMING COMPLEXITY
Fig. 6. Area and power of Han-Carlson speculative and non-speculative adders, as a function of the timing constraint. (a) 32 bit, (b) 64 bit, (c) 128 bit. Different
values are used for speculative adders.
The analysis of Area Occupation and Power Dissipation lower Power Dissipation for . For ,
shows that speculative adders are not effective for large average the non-speculative adder presents an area of and
delay. As the timing constraint imposed during synthesis is a power of , while the variable latency adder
made tighter speculative adders become advantageous. For exhibits an area of (20% reduction) and a power of
instance, in the 64-bit case, speculative Han-Carlson adder about (9% reduction).
results in a lower Area for lower than 385 ps and also in a
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 7. Comparison of Han-Carlson and Kogge-Stone speculative and non-speculative adders, as a function of the timing constraint. (a) 32 bit, (b) 64 bit, (c) 128
bit.
B. Comparison with Kogge-Stone Variable Latency performance of non-speculative adders, in order to identify the
Speculative Adder region where the speculative approach is effective (the optimum
value for the variable latency speculative Kogge-Stone adder
Fig. 7 shows the comparison between proposed speculative is: for , for and
adder and Kogge-Stone one. Also in this case, we report the ).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
The proposed variable latency Han-Carlson adder outper- [16] S. K. Mathew, R. K. Krishnamurthy, M. A. Anders, R. Rios, K. R.
forms the speculative Kogge-Stone architecture in all the Mistry, and K. Soumyanath, “Sub-500-ps 64-b ALUs in 0.18- m SOI/
bulk CMOS: Design and scaling trends,” IEEE J. Solid-State Circuits,
considered cases, confirming the trend highlighted in Table I. vol. 36, no. 11, pp. 1636–1646, Nov. 2001.
As an example, focusing on 64-bit adders, for lower than [17] B. Parhami, Computer Arithmetic: Algorithms and Hardware De-
350 ps, the proposed Han-Carlson speculative adder is the best sign. New York: Oxford Univ. Press, 2000.
choice in terms of silicon area and power consumption. More- [18] A. Tyagi, “A reduced-area scheme for carry-select adders,” IEEE
Trans. Comput., vol. 42, no. 10, pp. 1163–1170, Oct. 1993.
over, it allows to reduce the minimum achievable to 225
ps, with a 18% improvement respect to Kogge-Stone non-spec-
ulative adder and a 11% improvement respect to Kogge-Stone
speculative adder. For , proposed speculative Darjn Esposito was born in 1989 in Naples, Italy. He
adders offer 45% area reduction and 35% power saving com- received the M.S. degree (with honors) in electronic
pared to Kogge-Stone non-speculative adder. engineering from the University of Naples “Federico
II,” in 2013, where he is currently working toward the
VI. CONCLUSION Ph.D. degree. His research interests include design
In this paper a novel variable latency Han-Carlson parallel- of digital VLSI circuits, with particular emphasis on
speculative functional units.
prefix speculative adder for high-speed application is proposed.
A new, more accurate, error detection network is introduced,
which allows reducing the error probability compared to the pre-
vious approaches.
An extensive set of implementation results for 65 nm
CMOS technology shows that proposed Han-Carlson variable Davide De Caro (M'05–SM'09) received the M.S.
degree in electronic engineering with honors, in July
latency adders outperforms previously developed variable 1999, and the Ph.D. degree in electronic engineering
latency Kogge-Stone architectures. Compared with traditional, and computer science, in February 2003, both from
non-speculative, adders, our analysis demonstrates that variable the University of Naples “Federico II”, Italy.
latency Han-Carlson adders show sensible improvements when He has worked in the area of digital integrated
VLSI circuit design for the last fourteen years. Since
the highest speed is required; otherwise the burden imposed March 2003 he is a Researcher at the Department of
by error detection and error correction stages overwhelms any Electrical Engineering and Information Technology.
advantage. Additional work is required to extend the specu- Dr. De Caro is author of more than 50 technical
lative approach to other parallel-prefix architectures, such as papers in international journals and refereed inter-
national conferences.
Brent-Kung, Ladner-Fisher, and Knowles.