10 1 1 100

Copyright 2005 American Scientic Publishers
All rights reserved

Printed in the United States of America
Journal of
Low Power Electronics
Vol. 1, 111, 2005
Implementation of Low Power Digital Multipliers
Using 10 Transistor Adder Blocks
Dhireesha Kudithipudi and Eugene John
Laboratory for Low Power Design, Department of Electrical and Computer Engineering,
University of Texas at San Antonio, San Antonio, TX 78249, USA
(Received: 11 April 2005; Accepted: 29 November 2005)
The increasing demand for the high delity portable devices has laid emphasis on the development
of low power and high performance systems. In the next generation processors, the low power
design has to be incorporated into fundamental computation units, such as multipliers. The char-
acterization and optimization of such low power multipliers will aid in comparison and choice of
multiplier modules in system design. In this paper we performed a comparative analysis of the
power, delay, and power delay product (PDP) optimization characteristics of four parallel digital
multipliers implemented using low power 10 transistor (10T) adders and conventional CMOS adder
cells. In order to achieve optimal power savings at smaller geometry sizes, we proposed a heuris-
tic approach known as hybrid adder models. Multipliers realized using the Static Energy Recovery
Full adder (SERF) circuit consumed considerably less power compared to 10T and static CMOS
based multipliers for all the congurations studied. Furthermore, the difference between the power
consumption of the 10 transistor based multipliers and 28T multipliers is signicant at 180 nm, but
not at 70 nm. For smaller geometry sizes down to 70 nm, the propagation delay of the multipliers
implemented with 10 transistors translates to a better performance measure. Carry-Save Multipliers
had better PDP range than the other multipliers for all the three adder sub-module designs. The
PDP measure for optimal scaled gate width resulted in a best-case scenario for SERF Wallace
tree multiplier as compared to the other three SERF based multipliers. This can be attributed to
the fast computational capability of the Wallace Tree multiplier and SERF adders recovery energy
logic saving more power at deep sub-micron sizes. The proposed SERF-10T Hybrid adder model
multipliers consumed the least power of all the Hybrid and regular models with no deterioration in
performance. Taken together, these results suggest that SERF-10T Hybrid model based multipliers
are suited for ultra low power design and fast computation at smaller geometry sizes.
Keywords: Multipliers, SERF Adders, 10 Transistor Adders, Low Power Design.
1. INTRODUCTION
The prolic growth in semiconductor device industry has
led to the development of high performance portable sys-
tems with enhanced reliability in data transmission. In
order to maintain portability of high performance delity
applications, emphasis will be on incorporation of low-
power modules in future system design.
15
The design of
such modules will have to partially rely on reduced power
consumption and/or dissipation in fundamental arith-
metic computation units such as adders and multipliers.
This underscores a need to design low power multi-
pliers towards the development of power-efcient high-
performance systems.
Author to whom correspondence should be addressed.

Email: eugene.john@utsa.edu
The selection of the most efcient architecture to imple-
ment multiplication has continually challenged DSP sys-
tem designers.
68
The options currently available offer a
wide range of tradeoffs in terms of speed, complexity
and power consumption. Input sequences to the multi-
plier can be fed in parallel, serial or a hybrid (parallel-
serial) approach. To achieve higher processing speeds,
parallel multipliers are usually adopted at the expense
of high area complexity. Multiple parallel multiplica-
tion algorithms (architectures) (e.g., Refs. [1113]) have
been proposed to reduce the chip area and increase
the speed of the multipliers. Various techniques have
been developed to reduce the power dissipation of paral-
lel multipliers. While several of these techniques reduce
power dissipation by eliminating spurious transitions,
1315
others have focused on developing novel multiplier
J. Low Power Electronics 2005, Vol. 1, No. 3 1546-1998/2005/1/001/011 doi:10.1166/jolpe.2005.046 1
Implementation of Low Power Digital Multipliers Using 10 Transistor Adder Blocks Kudithipudi and John
architectures and sign-extension techniques to reduce
power dissipation and improve performance.
1619
Yet
another approach is to develop low-power 32 coun-
ters and 42 compressors, which are key components
in parallel multipliers.
2022
Although each of these tech-
niques helps reduce power dissipation, further reduc-
tions will be needed for future digital signal processing
systems.
This research uses an approach to signicantly reduce
the power consumption and the chip area of the parallel
multipliers, without sacricing performance. The approach
is based on using low power, minimal transistor count
adders that are the determining blocks (second stage of
algorithm) in the performance of the multiplier. The oper-
ation of a parallel multiplier can be divided into two parts:
(a) formation of the partial products, and (b) summation
of these products to form the nal product of the multi-
plication. We realized the digital-parallel multipliers using
three different adders: SERF adder,
23
and 10T adder,
24
and conventional CMOS 28T
33
adder. The Static Energy
Recovery Full adder cell (SERF)
23
was developed using
only 10 transistors, and is based on the reuse of charge
stored in the load capacitance during the high output to
drive the control logic. In addition, elimination of direct
path to ground also reduces power consumption. The 10T
adder also uses 10 transistors, whereas the logic family
used is pass transistor logic (for XOR and XNOR cir-
cuit). Pass transistor logic will partially degrade the output
signal, which can be overcome by using a driver at the
output. To analyze the power savings obtained by these
two 10-transistor adder-based multipliers, we compared
our results with that of a conventional CMOS 28T based
multiplier.
In this study, we investigated the power and delay
performance characteristics of parallel digital multipliers
using two different 10-transistor full adder circuits and a
CMOS 28T adder. For comparative study, we realized mul-
tipliers using four different algorithms: Bit Array, Carry-
Save, Wallace Tree and Baugh Wooley. The tradeoffs
between speed and power of these multipliers were com-
pared to similar multipliers realized using regular static
CMOS adder circuits. In Section 2, we describe the CMOS
28T adder, SERF, and 10T adder circuits used in our
design. Section 3 describes the multiplier architectures.
Section 4 describes the simulation methodology used.
In Section 5 the results of simulation study are discussed
and Section 6 presents a summary of the paper and the
concluding remarks.
2. ADDER MODULES
Adders are the fundamental building blocks in all the mul-
tiplier modules. Hence employing fast and efcient full
adders plays a key role in the performance of the entire
system. In the following section we briey describe the
adder modules used in our design.
Fig. 1. Conventional CMOS Adder with 28 Transistors. Reprinted with
permission from [10], J. M. Rabaey et al., Digital Integrated Circuits,
Prentice Hall Publications (2003).
2.1. Conventional CMOS 28 Transistor (28T)
Full Adder
The 28 Transistor full adder is the pioneer CMOS
traditional adder circuit.
33, 34
The schematic of this adder
is shown in Figure 1. This adder cell is built using
equal number of N-fet and P-fet transistors. The logic
for the Complimentary MOS logic was realized using the
Eqs. (1) and (2)
C
out
=AB+BC
in
+AC
in
(1)
Sum =ABC
in
+(A+B+C
in
)C
out
(2)
The rst 12 transistors of the circuit produce the C
out
and
the remaining transistors produce the Sum outputs. There-
fore the delay for computing C
out
is added to the total prop-
agation delay of the Sum output. The structure of this adder
circuit is huge and thereby consumes large on-chip area.
2.2. SERF Adder
The Static Energy Recovery Full Adder (SERF adder) cir-
cuit was developed implementing energy recovery logic
and reduced number of transistors. The schematic of the
10 transistor SERF adder is shown in Figure 2.
23
The basic
idea in the SERF adder is the reuse of charge stored in the
load capacitance during the high output to drive the con-
trol logic. In regular non-energy recovery adder designs
the input charge applied at logic high will be drained off
during logic low mode. This is achieved by using only one
voltage source (VDD) in the circuit. As an added advan-
tage there will be no path from one voltage level (VDD)
to the other (GND). The elimination of the direct path to
the ground removes the short circuit power component for
the adder module. This reduces the total energy consumed
in the circuit and making it an energy efcient design. The
SERF adder is not only energy efcient but also area ef-
cient due to its low transistor count. The main drawback
of the SERF adder is the threshold voltage drop at the
output voltage for certain input combinations. A detailed
comparative study of SERF adder with other low power
adders can be found in Ref. [23].
2 J. Low Power Electronics 1, 111, 2005
Kudithipudi and John Implementation of Low Power Digital Multipliers Using 10 Transistor Adder Blocks
Fig. 2. Static Energy-Recovery Full (SERF) Adder. Reprinted with per-
mission from [23], R. Shalem et al., A novel low power energy recovery
full adder cell Proceedings of the Great Lakes Symposium on VLSI
(1999), pp. 380383.
2.3. 10T Adder
In the 10T adder cell, the implementation of XOR and
XNOR of A and B is done using pass transistor logic
and an inverter is to complement the input signal A. This
implementation results in faster XOR and XNOR outputs
and also ensures that there is a balance of delays at the
output of these gates. This leads to less spurious SUM and
C
out
signals. The capacitance at the outputs of XOR and
XNOR gates is also reduced as they are not loaded with
inverter. If the signal degradation at the SUM and C
out
is signicant for deep sub-micron circuits, drivers can be
used to reduce the degradation. The driver will help in gen-
erating outputs with equal rise and fall times. This results
in better performance regarding speed, low power dissi-
pation and driving capabilities. The output voltage swing
will be equal to the VDD, if a driver is used at the output.
Figure 3 gives the circuit level diagram of 10T adder. A
Fig. 3. 10 transistor (10T) Adder. Reprinted with permission from [24],
L. Junming et al., A novel 10-transistor low-power high-speed full adder
cell. Proceedings of 6th International Conference on Solid-State and
Integrated-Circuit Technology (2001), pp. 11551158.
detailed comparative study of SERF adder with other low
power adders can be found in Ref. [24].
3. MULTIPLIER ARCHITECTURES
Multipliers are in fact complex adder arrays. This is an
operation common to a large number of applications,
and the complexity of this function has lead to a large
amount of research directed at speeding up its execu-
tion. Multipliers can be implemented using different algo-
rithms. Depending on the algorithm used, the performance
characteristics of the multipliers vary. In the implemen-
tation of digital multipliers binary adders are an essen-
tial component. With the emergence of power as a design
consideration, speed is not the only criterion by which
various implementations are judged. Designing multipliers
with low power, energy efcient adders reduce the power
consumption and efciency of multipliers. In this paper
we have concentrated on the design and characterization
of four popular multipliers, viz. the Carry-Save Multi-
plier, the Bit-Array Multiplier, Wallace tree Multiplier and
Baugh-Wooley Multiplier. To study the performance eval-
uation of these four parallel digital multipliers we imple-
mented them using three adder cells (SERF adder, 10T
adder and the static CMOS 28T adder). We further imple-
mented each of these multipliers for operands sizes 2,
4, and 8.
3.1. Carry-Save Multiplier
Carry Save Array Multipliers
10
have a very regular struc-
ture, which makes it amenable to automation. The algo-
rithm is based on the fact that the multiplication result
does not change when the output carry bits are passed
diagonally downwards instead of only to the right.
10
An
extra adder, known as vector-merging adder, is added in
each stage of the multiplication such that the nal result is
obtained. This is called the carry-save multiplier because
the carry bits are not immediately added but are rather
saved for the next addition stage. In the nal stage the
carries and the sums are merged in a fast-carry propa-
gate adder stage, usually by using a carry-lookahead adder.
Due to the additional adder in each stage there is a slight
increase in the area cost. However, it uses only short
wires to the nearest neighboring cells. It can also be easily
pipelined. Another added advantage is that there is only
one critical path rather than the several identical critical
paths found in the generic array multiplier. The general
structure of a Carry-Save Multiplier is shown in Figure 4.
10
The delay of this multiplier can be expressed
10
as,
AT =T
and
+T
nal
+(X1)T
carry
(3)
where T
and
is the delay of the pre-product generating AND
gates, T
nal
is the delay of the nal stage carry-lookahead
adder, X is the number of partial product stages, and T
carry
J. Low Power Electronics 1, 111, 2005 3
Fig. 4. 4 4 Carry-Save Multiplier. Reprinted with permission from
[10], J. M. Rabaey et al., Digital Integrated Circuits, Prentice Hall Pub-
lications (2003).
is the propagation delay between input and output carry.
This equation is based on the assumption that the delay
for sum generation is equal to that of the carry generation.
3.2. Bit-Array Multiplier
Bit Array Multipliers
10
are essentially regular structures
and are simple to expand. The structure is similar to the
previously discussed Carry-Save multiplier but propagates
the carry bits from the full adders in a different fashion.
A simple diagram of a 4 4 multiplier is shown in Fig-
ure 5.
10
Each partial product is generated by the multi-
plication of the multiplicand with one multiplier bit. The
partial products are shifted according to their bit orders
and then added. In array multiplication we need to add
as many partial products as there are multiplier bits. In
order to perform signed multiplication, 2s complement
Fig. 5. 44 Bit-Array Multiplier. Reprinted with permission from [10],
J. M. Rabaey et al., Digital Integrated Circuits, Prentice Hall Publications
(2003).
number system is used to represent the multiplicand and
the multiplier.
1
This implies that all the adders in a partic-
ular stage should be of equal bitlength. To achieve this, the
sign bits of the partial products in the initial row and the
sum and carry signals of each adder stage are extended.
The extension is carried out until the signals width matches
the width of the largest absolute value signal in that stage.
Also, the generation of X partial products requires X
Y two-bit AND gates. Large area of the multiplier is
devoted to perform addition of N partial products, which
require (N 1) M-bit adders.
1, 10
The shifting of the par-
tial products for proper alignment is performed by simple
routing and does not require any logic. The array struc-
ture makes it a difcult task to measure the propagation
delay. There are more than one identical length critical
timing paths available in the circuit. An approximate equa-
tion as shown in Eq. (4)
10
for the propagation delay can
be obtained by a detailed study of these paths.
AT =T
and
+T
sum
+|(Y 1) +(X2)]T
carry
(4)
where T
and
is the delay of the pre-product generating AND
gates, T
sum
is the delay between the input carry and the sum
bit of the full adder, Y is the width of the multiplicand, X
is the width of the multiplier, and T
carry
is the propagation
delay between input and output carry.
3.3. Wallace Tree Multiplier
Wallace trees
10
were rst introduced in 1964 in order to
design the multipliers whose completion time grows as the
logarithm of the number of bits to be multiplied increases.
Wallace tree multiplier is based on tree structure. In Fig-
ure 6,
10
a 4 bit Wallace tree multiplier is shown. Wal-
lace method uses three-steps to process the multiplication
operation
1. Formation of bit products
2. The bit product matrix is reduced to a 2-row matrix
by using a carry-save adder (Wallace tree).
Fig. 6. 4 4 Wallace Tree Multiplier. Reprinted with permission from
lications (2003).
Fig. 7. Transformation of (a) partial products in to (b), (c), (d) Wallace
Tree. Reprinted with permission from [10], J. M. Rabaey et al., Digital
Integrated Circuits, Prentice Hall Publications (2003).
3. The remaining two rows are summed using a fast
carry-propagate adder to produce the product.
To better understand the procedure, we will show the trans-
formation process with an example in Figure 7. For a 4-bit
operand multiplication the partial products of each stage
are 4-bits wide.
10
These partial products are arranged in
the form of a tree that is shown in Figure 7(a). From the
Fig. 7(a) we can clearly interpret that only column 3 in
the array has to add 4-bits. Therefore the partial products
are rearranged into a tree structure to visually illustrate the
depth of the tree. To realize this tree with minimum num-
ber of adders, we need Full Adders (FA) and Half Adders
(HA). FA is also known as 3:2 compressor, because it takes
3 inputs and produces two outputs, sum (located in the
same column) and carry (located in the adjacent column).
10
FA is denoted by covering 3-bit and HA is denoted by cov-
ering 2-bits. To obtain minimal implementation, we start
at the most dense part, by introducing HAs in columns
3 and 4 as shown in Figure 7(b). The reduced tree is
shown in Figure 7(c). Another round of reductions creates
a tree of depth 2, which is shown in Figure 7(d). This
nal stage adder can be realized using any simple two-
input adder. In this circuit, a total of three FAs and three
HAs are used for reduction process. The maximum delay
for this multiplier is only six adder delays with four of
these being half adders. The propagation delay of the tree
is of the order O(log
3,2
(N))
10
. However, the superior per-
formance of Wallace tree multiplier comes with an added
price: highly irregular structure and complex structure. Its
highly irregular structure makes it difcult to layout the
multiplier in rectangular shape, thereby leading to wastage
of chip-area and power. The structure of Wallace trees is
not unique, i.e., there are several ways of building a par-
ticular Wallace tree. For example, a carry generated in one
column may be introduced in the next most signicant col-
umn at different places, e.g., close to the tree root or close
to the output. The Module Generator builds the Wallace
trees dynamically with the aim of reducing the number of
carries generated in each column, therefore reducing the
total tree height.
3.4. Baugh Wooley Multiplier
Baugh Wooley Multiplier
35
is used for 2s complement
multiplication. It adjusts the partial products to maximize
regularity of the multiplication array. It moves the partial
products with negative signs to the last steps and also adds
the negation of partial products rather than subtracts. This
technique has been developed in order to design regular
multipliers, suited for 2s complement numbers. Gate-level
diagram of a 4-bit Baugh Wooley multiplier is shown in
Figure 8.
10, 35
The equation of Baugh-Wooley algorithm for
an N N multiplication is given by Eq. (5),
35
XY = 2
2N1
+(x
N1
+,
N1
+x
N1
,
N1
) 2
2N2
+
N2
0
N2
0
x
i
,
]
2
i+]
+(x
N1
+,
N1
) 2
N1
+
n2
0
,
N1
x
i
2
i+N1
+
N2
0
x
N1
,
i
2
i+N1
(5)
where X and Y are N-bit operands, so their product is a
2N bits number. Consequently, the most signicant weight
Fig. 8. 44 Baugh Wooley Multiplier. Reprinted with permission from
lications (2003).
is 2N 1, and the rst term 2
2N1
is taken into account
by adding a 1 in the most signicant cell of the multiplier.
Each of the partial products is formed with AND gates
and they are all added together. The outcome is to allow
identical stages of logic in the early steps of multiplication
process and push all the irregularities to the nal stage.
The delay equation for the Baugh Wooley multiplier is
similar to that of the Array Multiplier.
4. SIMULATION SETUP
The functionality of each of the circuits designed was ver-
ied using simulation. The schematics were implemented
as layouts using MAGIC layout editor and the post layout
parasitics were extracted for SPICE simulations. We used
Berkeleys Bsim3v3
28
SPICE model parameters, for all the
device models. The simulations were run on a Red Hat
Linux 9.0 host machine. Each of the readings were taken
for 10,000 pseudo-randomly generated inputs. This covers
all the possible transitions of input combinations. Also the
delay between input pulses was given around 12 ns, for
the output voltages to stabilize. All the multipliers were
analyzed for power consumption, delay and area. To yield
appropriate results, we have added CMOS inverters at the
input and output. For pass transistor logic design the power
consumption values also consider the buffers used to retain
the logic levels. However, while measuring the power and
delays, we have taken only the input driver into consider-
ation and ignored the output buffer. The power dissipated
in CMOS digital circuits is given by Eq. (6)
10, 34
P =oCV
2
dd
] +I
sc
V
dd
+I
leak
V
dd
(6)
where C is the load capacitance, o is the switching activ-
ity, ] is the clock frequency, V
dd
is the supply voltage of
the system, I
sc
is the short circuit current and I
leak
is the
leakage current of the circuit. Delays of the circuit are
measured for worst-case scenario, averaging the low (T
pl
)
and high (T
ph
) transition delays. The delay measurements
for each of the multipliers were averaged for 50 simula-
tion runs and always the worst case delay was taken into
consideration. Monte-Carlo analysis was used for all the
simulations. This also depicts the process variations that
occur due to each technology size.
4.1. BPTM Models
Berkeley Predictive Technology Model [BPTM] is a
new paradigm of predictive MOSFET and interconnect
modeling.
28
These models are specically developed to
address SPICE compatible parameters for deep sub-micron
technology nodes. For each technology node, BPTM
accommodates a set of typical BSIM3v3 parameters for
CMOS and an analytical approach for interconnect capaci-
tance and inductance. BPTM is capable of covering a wide
range of process parameter variations. The following four
parameters are important for generating BPTM param-
eters. They are Effective gate length (L
eff
), Gate oxide
thickness (T
ox
), Threshold voltage (V
sat
t
), Drain Source
parasitic resistance, R
dsw
. The technology models we have
used for our simulations, (70 nm, 100 nm, 130 nm, and
180 nm) are available for download from Ref. [28]. Mod-
ications have been performed on some of the model
parameters such that they are tailored to simulate on
HSPICE.
5. SIMULATION RESULTS
In this section, performance measurement of all the four
multipliers with varying bit-sizes (2-bit, 4-bit, and 8-
bit) using the SERF, 10T and CMOS adders has been
compared. These results were obtained from spice simu-
lations with one common index for all comparisons, i.e.,
the design constraints were the same for all the multipli-
ers. Though low power is the objective of our design, we
wanted to measure the delay and area of these circuits, as
they are indicators of good performance.
5.1. Power
The energy consumption for all the multipliers investi-
gated is presented in Table I for a 180 nm technology
size. For all the operand sizes, the SERF adder based
multipliers consumed considerably less energy compared
to the CMOS adder based multipliers. In fact, the SERF
based multiplier performed at least thirty-two percent bet-
ter than any CMOS based version. The SERF based 88
Bit-Array multiplier proved to have the greatest advantage
over its CMOS counterpart with a sixty percent improve-
ment. The power gain of 10T is less as compared to
SERF based multipliers and hence can be used where pass
transistor logic is used. The power consumed for array
multiplier is higher than Baugh Wooley and Wallace tree
multipliers in 4-bit, whereas the power consumption of
Baugh Wooley is high in 8-bit multipliers, due to added
carry select adder levels. Since greater numbers of adder
Table I. Energy Consumption (nJ) of all the multiplier circuits with
varying bit lengths and adders for a 180 nm BPTM Mosfet Model.
Carry Save Bit Array Wallace Tree Baugh Wooley
SERF
22 0.55 0.55 0.38 0.57
44 2.74 3.23 2.20 2.88
88 12.06 7.56 7.34 13.97
10T
22 0.62 0.62 1.24 1.11
44 2.89 3.67 2.42 2.68
88 12.98 10.11 9.89 12.99
CMOS28T
22 0.81 0.81 1.86 1.92
44 4.41 4.57 3.85 3.64
88 20.86 20.87 21.56 22.29
Fig. 9. Power consumption for 4-bit multipliers implemented using
SERF, 10T, and CMOS 28T adders. Simulations were performed for a 70
nm, 100 nm, 130 nm, and 180 nm Berkeley BPTM MOSFET Models.
cells are used for larger multipliers, the power savings
for smaller operand sizes can be directly extrapolated to
higher operand multiplier modules.
An understanding of the power consumption of all the
multipliers in sub-100 nm would help to prove that these
multipliers are truly designed for low power. This is shown
in Figure 9, where all the four multipliers were imple-
mented in 70 nm, 100 nm, 130 nm, and 180 nm Berkeley
BPTM
28
technology models with different adder modules.
The graph indicates the outputs for each of the 4 4
wide multipliers and the multipliers built using 10 tran-
sistor adders consumed less power than the conventional
28-transistor adder at all technology sizes. Interestingly,
the difference between power consumption of 10 transistor
based multipliers and 28T multiplier is very insignicant
at 70 nm as compared to 180 nm where the difference is
prominent. This could be due to the high static leakage
current at smaller technology nodes dominating the total
power consumption of a circuit. One probable explanation
could be that the SERF and 10T based multipliers con-
sumed more leakage current than the 28T based multiplier
at 70 nm technology node size. Developing hybrid mod-
els that take advantage of both the adder modules might
alleviate this problem to a large extent, which is further
discussed in subsection 5.5.
5.2. Delay
Propagation delay is a measure of the speed performance
of a circuit, even while consuming low power. In Table II,
the delay performance characteristics of various multipli-
ers used for our study at 180 nm technology size are given.
For all the multipliers, the delay for 2 2 cell is almost
identical because of the simplicity of the design at such
a small size. At 4 4 and 8 8 bit widths however, the
differences between the adder cells is signicant. For 8-bit
operands, the delay of SERF adder based multipliers is
Table II. Multiplier Delay Performance (ns) of all the multiplier circuits
with varying bit lengths and adders for 180 nm BPTM Mosfet Model.
Carry Save Bit Array Wallace Tree Baugh Wooley
SERF
22 0.23 0.25 0.21 0.29
44 0.42 0.67 0.59 0.79
88 1.48 2.19 1.20 2.21
10T
22 0.28 0.28 0.23 0.31
44 0.51 0.73 0.62 1.84
88 1.50 2.25 1.21 2.34
CMOS28T
22 0.45 0.45 0.40 0.48
44 0.67 0.86 0.75 0.92
88 1.70 2.59 1.48 2.98
almost 1520% less, and for 10T adder based multipliers,
the delay is approximately 25% less compared to CMOS
28T adder based multipliers.
To further analyze the propagation delay of these cir-
cuits at smaller technology nodes, we performed simula-
tions for 70 nm, 100 nm, 130 nm, and 180 nm technology
nodes. The results from Figure 10 indicate that the propa-
gation delay of the multipliers implemented with 10 tran-
sistors translates to a better performance even at smaller
technology node sizes. Even though the timing delay for
Wallace Tree multipliers is substantially less than other
multipliers at 180 nm technology nodes, the differences
diminish at 70 nm technology node.
5.3. Area
As most of the portable applications demand smaller sili-
con area, designing circuits with optimal area is an impor-
tant performance criterion. Area comparison results for the
simulated multipliers are presented in Table III. Area con-
sumed by array multiplier is greatest among all the mul-
tipliers studied; the probable reason being its increasing
Fig. 10. Propagation Delay for 4-bit multipliers implemented using
SERF, 10T, and CMOS 28T adders. Simulations were performed for a
70 nm, 100 nm, 130 nm, and 180 nm Berkeley BPTM MOSFET Models.
Table III. Energy Consumption (nJ) of all the multiplier circuits with
varying bit lengths and adders for a 180 nm BPTM Mosfet Model.
4 BIT 8 BIT
(100 jm100 jm) SERF 10T CMOS28T SERF 10T CMOS
Array 208 216 420 944 944 1952
Baugh Wooley 202 214 376 916 929 1600
Wallace Tree 148 156 340 624 628 1264
Carry Save 142 151 350 556 563 1365
structure (number of FA modules) at higher bit levels.
Among all the multiplier congurations, the SERF adder
based multiplier consumes the least area. For array multi-
pliers, there is a 50% increase in area when CMOS 28T
adders are used as opposed to SERF adders and a 46%
increase in area as opposed to 10T adders.
5.4. PDP Product
To implement low power dissipation systems, we can
either reduce the power consumed by the circuits or
increase the computations/unit energy. These two opti-
mizations can be realized only when the design tradeoffs
between power and delay are well understood. The opti-
mal setting for power delay product (PDP) of a particular
technology node can be obtained by varying the size of the
gates (W/L ratios), and the operating voltage. To under-
stand the best PDP zone for the four multipliers tested,
we simulated for seven different W/L ratios for the 70 nm
technology MOSFETs. For small geometries the effects
of parasitic wiring capacitance must be considered in the
PDP models.
32
In this research, the following expressions
32
were used for the optimal device sizing
o
oWn(P.t)
=0
W
Wn
=
jn
4j
+
jn
4j
1+8j,jn(1+C
wire
,Wn) (7)
o
oWn(P.t)
=0
W
Wn
=
j
4jn
+
j
4jn
1+8jn,j(1+C
wire
,W)
1
(8)
where Wp and Wn are the widths of the PMOS and NMOS
transistors, j
n
and j
are the mobility of the electrons

and the mobility of the holes respectively, and C
wire
is the
output load capacitance for the multiplier. The width of
the pull up transistors was double that of the width of
the pull down transistors based on the fact that the mobil-
ity of electrons is higher than the mobility of the holes.
Detailed explanation of the derivation of these expressions
is beyond the scope of this paper and can be obtained from
Ref. [32]. In Figure 11 the PDP products for four multipli-
ers were presented with three different [SERF, 10T, CMOS
28T] adder modules. The points on the bottom leftmost
(a)
(b)
(c)
Fig. 11. Power Delay Product (PDP) of (a) SERF based multipliers,
(b) 10T based multipliers and (c) 28T based multipliers, for a 70 nm
technology node size.
corner of the graph indicate good PDP zones. For all the
four SERF adder based multipliers, shown in Figure 11(a),
the gate size variation drops down the PDP as a linear
curve. The ideal zone point for all the multipliers in this
setup are the minimal possible device size settings, which
is indicated as a circle in the Figure. The graph depicts that
Carry-Save multiplier exploits the scaled gate widths better
than the other three multipliers. However, at the nominal
gate width, Wallace tree multiplier has the best PDP value
as the area and power dissipation are reduced drastically
by scaling. This gives an added advantage to the already
fast computational Wallace tree algorithm. For the 10T
adder based multipliers, shown in Figure 11(b), the optimal
PDP zone shifts slightly to the right, an indicator of poorer
performance, as compared to the SERF adder based mul-
tipliers with the same X-axis scaling. Another point to be
observed is that the PDPs of Carry-Save and the Bit-Array
multipliers overlap in the 10T adder based modules sug-
gesting that both the multipliers are an ideal choice. The
28T adder based multipliers graph, shown in Figure 11(c),
has a different X and Y axis and yet follows a similar pat-
tern as the 10T adder based multipliers except that the PDP
zone range is poorer than the prior two. Another difference
is that the PDP range of Bit-Array multiplier has deviated
further away from the Carry-Save Multiplier. Hence it can
be interpreted that the low power adder modules chosen
for building the multipliers form the key factor in the per-
formance of the multiplier module. Also the nominal PDP
point for all the multipliers falls in the same power range
for the CMOS 28T based multipliers.
5.5. Hybrid Models
We observed that the power consumed for the 10 Tran-
sistor multipliers at 70 nm technology is only slightly
lower than the CMOS 28T based multipliers, as com-
pared to 180 nm node sizes. As already mentioned, this
could be due to the dominant leakage current at smaller
technology nodes. This performance deterioration can be
overcome by implementing a hybrid model, in which the
10 transistor adders are used in conjunction with CMOS
28 transistor adders in the multiplier design. Appropri-
ate design of such a model would gain both performance
and power enhancement. This being more of a heuristic
approach, we iteratively performed numerous simulations
to see the best setup to insert the CMOS 28 Transistor
adder in a 10 Transistor adder based model. Further insight
into the propagation delay equations shows that the nal
adder stage is the most critical path in the multiplier oper-
ation. Insertion of CMOS 28T adders and 10 Transistor
adders alternatively in the nal stage would reduce the
performance deterioration but not impact the power con-
sumption. The results from Figure 12 prove that this is
indeed plausible. For the hybrid models, we incorporated
Fig. 12. Power Consumption of 8-bit multipliers implemented using
SERF, SERF-28T Hybrid, 10T, 10T-28T Hybrid, CMOS 28T, and SERF-
10T Hybrid adders. Simulations were performed for a 70 nm Berkeley
BPTM MOSFET Model.
Fig. 13. Timing Delay of 8-bit multipliers implemented using SERF,
SERF-28T Hybrid, 10T, 10T-28T Hybrid, CMOS 28T, and SERF-10T
Hybrid adders. Simulations were performed for a 70 nm Berkeley BPTM
MOSFET Model.
three different combinations of adder modules, viz. SERF-
28T, 10T-28T, and SERF-10T sets. The simulation results,
shown in Figure 12, clearly indicates that the SERF-10T
Hybrid model has lower power consumption as compared
to the SERF adder based model. These results are shown
for 8-bit operands because the difference is clearly notice-
able at higher bit widths. The SERF-10T hybrid is an ideal
model for smaller geometries because the SERF adder has
better delay characteristics and the 10T adder consumes
less static leakage current. The power savings obtained for
this hybrid model range from 1520% for each of the mul-
tipliers. Baugh Wooley Multiplier exploited this model the
best as compared to other multipliers. Delay characteristics
of these modules are presented in Figure 13. As expected,
the delay of the SERF-10T hybrid model is almost identi-
cal to that of the SERF based module. Hence this hybrid
model is a very good choice when designing multipliers
for low power.
6. CONCLUSION
In this paper, we have presented the power and speed
performance characteristics of four different multipliers
realized using 10T, SERF, and CMOS 28T static adders.
For comparative analysis, we realized 2 2, 4 4, and
8 8 Carry-Save, Bit-Array, Wallace Tree, and Baugh
Wooley multipliers. In all the multiplier congurations
investigated, the SERF adder based multipliers exhibited
better power performance compared to 10T and CMOS
28T adder based multipliers. The difference between the
power consumed in 10 transistor adder based multipliers
and CMOS 28T based multipliers decreases at 70 nm as
compared to 180 nm technology size. Propagation delay
of 10T and SERF based multipliers is better compared to
the CMOS28T adder upto 70 nm technology size. Opti-
mizing for PDP by scaling the gate sizes indicated a low
PDP value for all the multipliers at 70 nm technology
size. Scaling of gate widths resulted in better performance
for SERF adder based multipliers. In general, Carry-Save
multipliers displayed optimal PDP range as compared to
the other three multipliers. We further proposed a heuristic
approach to incorporate hybrid adder modules in the nal
stage of multiplication that forms the critical delay path
and also a high static leakage current zone. Placing SERF
adder and 10T adders alternatively in the nal stage will
yield in better performance and low power dissipation.
References
1. S. Shah, A. J. Al-Khalili, and D. Al-Khalili, Comparison of
32-bit multipliers for various performance measures. Proceedings
of the 12th International Conference on Microelectronics (2000),
pp. 7580.
2. I. S. Abu-Khater, A. Bellaouar, and M. I. Elmasry, Circuit techniques
for CMOS low power high-performance multipliers. IEEE Journal
of Solid State Circuits (1996), Vol. 31, pp. 15351546.
3. G.-K. Ma and F. J. Taylor, Multiplier policies for digital signal pro-
cessing. IEEE ASSP Magazine (1990), pp. 619.
4. T. K. Callaway and E. E. Swartzlander, Jr., Power delay character-
istics of CMOS multipliers. Proceedings of the 13th International
Symposium on Computer Arithmetic (1997), pp. 2632.
5. M. J. Liao, C. F. Su, C. Y. Chang, and A. C. H. Wu, A carry-select-
adder optimization technique for high-performance booth-encoded
wallace-tree multipliers. IEEE International Symposium on Circuits
and Systems ISCAS (2002), pp. I-81I-84.
6. G. Goto, A. Inoue, R. Ohe, S. Kashiwahra, S. Mitarai, T. Tsuru, and
T. Izawa, A 4.1-ns compact 54 54 multiplier utilizing sign-select
booth encoders. IEEE Journal of Solid-State Circuits (1997), Vol.
32, pp. 167682.
7. G.-K. Ma and F. J. Taylor, Multiplier policies for digital signal pro-
cessing. IEEE ASSP Magazine (1990), Vol. 7, pp. 619.
8. K. Z. Pekmestzi, Multiplexer-based array multipliers. IEEE Trans.
on Computers (1999), Vol. 48, pp. 1523.
9. D. Radhakrishnan, Low Voltage CMOS Full Adder Cells. Electron-
ics letters (1999), Vol. 35, pp. 17921794.
10. J. M. Rabaey, A.Chandrakasan, and B. Nikolic, (Eds.), Digital Inte-
grated Circuits, Prentice Hall Publications (2003).
11. B. Millar, P. E. Madrid, and E. E. Swartzlander, Jr., A fast hybrid
multiplier combining Booth and Wallace/Dadda algorithms. Pro-
ceedings of the 35th Midwest Symposium on Circuits and Systems
(1992), pp. 158165.
12. Kei-Yong Khoo Zhan Yu and A. N. Willson, Jr., Improved-Booth
encoding for low-power multipliers. Proceedings of the 1999 IEEE
International Symposium on Circuits and Systems (1999), pp. 6265.
13. C. Lemonds and S. S. Shetti, A low power 16 by 16 multiplier
using transition reduction circuitry. Proceedings of the International
Workshop on Low Power Design (1994), pp. 139144.
14. G. E. Sobelman and D. L. Raatz, Low-Power multiplier design
using delayed evaluation. Proceedings of the International Sympo-
sium on Circuits and Systems (1995), pp. 15641567.
15. T. Sakuta, W. Lee, and P. T. Balsara, Delay balanced multipliers for
low power/low voltage DSP core. Proceedings of IEEE Symposium
on Low Power Electronics (1995), pp. 3637.
16. J. Iwamura, A high speed and low power CMOS/SOS multiplier-
accumulator. Microelectronics Journal (1983), Vol. 14, pp. 4957.
17. E. de Angel and E. E. Swartzlander, Jr., Low power parallel multipli-
ers. IX Proceedings of Workshop on VLSI Signal Processing (1997),
pp. 199208.
18. M.-C.Wen, S.-J. Wang, and Y.-N. Lin, Low power parallel multiplier
with column bypassing. Proceedings of IEEE International Sympo-
sium on Circuits and Systems (2005), pp. 16381641.
19. E. Abu-Shama, M. B. Maaz, and M. A. Bayoumi, A fast and low
power multiplier architecture. Proceedings of the 39th Midwest Sym-
posium on Circuits and Systems (1997), pp. 2632.
20. S.-F. Hsiao, M.-R. Jiang, and J.-S. Yeh, Design of High-Speed Low-
Power 3-2 Counter and 4-2 Compressor for Fast Multipliers. Elec-
tronics letters (1998), Vol. 34, pp. 341343.
21. I. S. Abu-Khater, A. Bellaouar, and M. I. Elmasry, Circuit techniques
for CMOS low-power high-performance multipliers IEEE Journal
of solid-state circuits (1996), Vol. 31, pp. 15351546.
22. S. J. Jou, C. Y. C. E. C. Yang, and C. C. Su, A pipelined multiplier-
accumulator using a high-speed, low-power static and dynamic full
adder design. IEEE Journal of solid-state circuits (1997), Vol. 32,
pp. 114118.
23. R. Shalem, E. John, and L. K. John, A novel low power energy
recovery full adder cell Proceedings of the Great Lakes Symposium
on VLSI (1999), pp. 380383.
24. L. Junming, S. Yan, L. Zhenghui, and W. Ling, A novel 10-transistor
low-power high-speed full adder cell. Proceedings of 6th Interna-
tional Conference on Solid-State and Integrated-Circuit Technology
(2001), pp. 11551158.
25. http://public.itrs.net, International technology roadmap for semicon-
ductors
26. H. T. Bui, Y. Wang, and Y. Jiang, Design and analysis of low-power
10-transistor full adders using novel XOR XNOR gates. IEEE Trans.
on Circuits and Systems-II: Analog and Digital Signal Processing
(2002), Vol. 49, pp. 2530.
27. A. M. Shams, T. K. Darwish, and M. A. Bayoumi, Performance
analysis of low-power 1-Bit CMOS full adder cells. IEEE Trans. on
VLSI Systems (2002), Vol. 10, pp. 2029.
28. http://www-device.eecs.berkeley.edu/ptm/mosfet.html.
29. Y. Cao, T. Sato, D. Sylvester, M. Orshansky, and C. Hu, New
paradigm of predictive MOSFET and interconnect modeling for
early circuit design. Proceedings of IEEE Custom Integrated Circuits
Conference (2000), pp. 201204.
30. Y. Wang, Y. Jiang, and E. Sha, On area-efcient low power array
multipliers. Proceedings of the 8th IEEE International Confer-
ence on Electronics, Circuits and Systems, ICECS (2001), Vol. 3,
pp. 14291432.
31. http://www.vlsi.wpi.edu/webcourse/ch06/
32. C. Tretz and C. Zukowski, CMOS transistor sizing for minimization
of energy-delay product. Proceedings of Sixth Great Lakes Sympo-
sium on VLSI (1996), pp. 168173.
33. A. Bellaouar and M. I. Elmasry, (Eds.), Low-Power Digital VLSI
Design: Circuits and Systems. Kluwer Academic Publishers (1995).
34. A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS
Design. Kluwer Academic Publishers (1995).
35. J. Di, J. S. Yuan, and R. F. DeMara, Improving power-awareness
of pipelined array multipliers using 2-dimensional pipeline gating
and its application to r design. Integration, the VLSI J. (2005),
Vol. 38.
Dhireesha Kudithipudi
Dhireesha Kudithipudi received her M.S. degree in computer engineering from Wright State University, Dayton, OH, in 2001. She has
worked for a year as a software analyst III with the U.S. Department of Labor. Since 2002, she has been a fellowship recipient and
research assistant with the low power design laboratory at the University of Texas at San Antonio, where she is currently working
towards her Ph.D. degree in Electrical Engineering. Her current research interests include Circuit and system level design, low-power
nanoscale system design, microprocessor design, Performance Evaluation, Modeling and Simulation of Multimedia and Embedded
Systems, along with formulating beyond CMOS. She is a recipient of Whos Who in American Colleges and Universities and a member
of the Eta Kappa Nu, SWE, and IEEE.
Eugene John
Eugene John is an Associate Professor of electrical and computer engineering at the University of Texas at San Antonio. His research
interest include low power techniques, power efcient circuits and systems, multimedia and embedded systems and performance
evaluation. He has a Ph.D. in electrical engineering from the Pennsylvania State University. He is a senior member of the IEEE.

10 1 1 100

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

10 1 1 100

Hochgeladen von

Copyright:

Verfügbare Formate

Copyright 2005 American Scientic Publishers

All rights reserved

Author to whom correspondence should be addressed.

are the mobility of the electrons

Das könnte Ihnen auch gefallen