Beruflich Dokumente
Kultur Dokumente
Microelectronics Journal
journal homepage: www.elsevier.com/locate/mejo
Low power and high speed multiplier design with row bypassing
and parallel architecture
Ko-Chi Kuo n, Chi-Wen Chou
Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan
a r t i c l e in fo abstract
Article history: This paper presents a low power and high speed row bypassing multiplier. The primary power
Received 10 October 2009 reductions are obtained by tuning off MOS components through multiplexers when the operands of
Received in revised form multiplier are zero. Analysis of the conventional DSP applications shows that the average of zero input
16 June 2010
of operand in multiplier is 73.8 percent. Therefore, significant power consumption can be reduced by
Accepted 21 June 2010
the proposed bypassing multiplier. The proposed multiplier adopts ripple-carry adder with fewer
additional hardware components. In addition, the proposed bypassing architecture can enhance
Keywords: operating speed by the additional parallel architecture to shorten the delay time of the proposed
Low power multiplier. Both unsigned and signed operands of multiplier are developed. Post-layout simulations are
Bypassing multiplier
performed with standard TSMC 0.18 mm CMOS technology and 1.8 V supply voltage by Cadence Spectre
Parallel architecture
simulation tools. Simulation results show that the proposed design can reduce power consumption and
Ripple carry array
operating speed compared to those of counterparts. For a 16 16 multiplier, the proposed design
achieves 17 and 36 percent reduction in power consumption and delay, respectively, at the cost of 20
percent increase of chip area in comparison with those of conventional array multipliers. In addition,
the proposed design achieves averages of 11 and 38 percent reduction in power consumption and delay
with 46 percent less chip area in comparison with those counterparts for both unsigned and signed
multipliers. The proposed design is suitable for low power and high speed arithmetic applications.
& 2010 Elsevier Ltd. All rights reserved.
0026-2692/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.mejo.2010.06.009
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
2 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]
consumes potentially more power than tree-based architecture through a series of shift and addition operations. Since it can reuse
[6]. The reason is that additional adders are embedded in the tree- the same hardware to perform multiplication, it occupies less area
based multiplier that absorb spurious switching and hence reduce than other multipliers. However, it needs more clock cycles to
power consumption [7]. However, the layout of tree-based accomplish multiplication and cannot be realized in pipeline
multiplier tends to be complicated and induces more parasitic structure. On the other hand, array multiplier is common in
capacitances. In addition, tree-based multiplier is limited to multiplier design due to its regular and compact structure. The
shorter operand length (o16 bits) [8]. Multiplier with longer structure of array multiplier is organized by several stages of
operand length can be implemented by modified Booth encoding adders and AND-gates. It generates all the partial products after
Wallace multiplier. Therefore, array-based architecture is imple- only one AND-gate delay. Then, it sums up all partial products
mented in the proposed design. sequentially. The advantage of this structure is that the arrange-
Many researchers have been focusing on reducing power ment of its adders is very regular and is favorable for layout due to
consumption of multipliers [9–20]. A modified binary tree this advantage. It also can be realized with parallel structure.
multiplier is presented in [9]. All partial products can be However it occupies more area and hardware than that of
generated in one step. Power reduction is achieved by minimizing iterative multiplier.
power consumption of full adder. A multiplier with array and tree Conventional array multiplier is primarily used for computing
architecture is proposed in [10] to enhance performance with low multiplication of two input data. For example, two unsigned
power consumption and smaller time-delay product. A leapfrog n-bits binary numbers A¼an 1an 2an 3ya0 and B ¼bn 1bn 2
multiplier [11] is modified such that sum and carry signal to bn 3yb0 can generate a (2n 1)-bit product P, which can be
different rows of adders can arrive at the same time. Therefore, defined as the following:
power consumption can be reduced. The multiplier of [12] divides !0 1
X
n1 X
n1 X
n1 X
n1
all partial products into four clusters. It uses latches to disable P ¼ AB ¼ ai 2i @ jA
bj 2 ¼ ðai bj Þ2i þ j ð1Þ
clusters when cluster is in zero condition. The low power i¼0 j¼0 j¼0i¼0
multiplication is achieved by operand decomposition [13].
Decomposition is performed at both multiplicand and multiplier where i and j are the number of bits in the multiplier and
to achieve low power consumption by reducing logic transitions. multiplicand, respectively.
The work in [14] uses a pre-computation based method to reduce An example of 4-bit multiplication is shown in Fig. 1. Two
power in a sequential multiplier. In [15], low power is accom- conventional array multipliers can be used to generate partial
plished by reducing complexity of multiplication architecture and products. According to the way of carry propagation, it can be
switching activities. In [16–18], significant power consumptions classified into two structures: ripple-carry array (RCA) (Fig. 2)
are reduced by developing new adder cell in different multiplier and carry-save array (CSA). In RCA multiplier all adder cells
designs. In papers [19,20], a low-power multiplier design is are composed of RCA adders. For example, it needs 3N adders to
proposed with bypassing method to turn off device when inputs accomplish multiplication in an N N multiplier. However, delay
of multiplier are zero.
These power reduction techniques have been verified and a a a a
implemented in many DSP or other related applications at certain b b b b
additional expenses. Among these techniques, the most effective
way is reducing dynamic power consumption which dominates a 3b0 a 2 b0 a1b0 a 0 b0
total power consumption. Hence, average power consumption can
be reduced significantly by adopting this method. Consequently, a 3b1 a 2 b1 a1b1 a 0 b1
this paper intends to develop a new design to achieve lower
a 3b2 a 2 b2 a1b2 a 0b2
power and high speed multiplier. Hence, a novel low power
multiplier is proposed by minimizing switching activities of
a 3b3 a 2 b3 a1b3 a 0 b3
multiplier while maintaining the speed of multiplier by adopting
the parallel architecture. Bypassing method achieves significant P7 P6 P5 P4 P3 P2 P1 P0
power saving if the number of zeros in multiplicand has more
Fig. 1. A 4 4 basic multiplication.
than half of the size of multiplier. However, additional hardware
of adopting bypassing method reduces the operation speed of
multiplier in the critical path of multiplier. Hence, parallel a3b0 a2b0 a1b0 a0b0
architecture is adopted to enhance the speed of multiplier.
This paper is organized as follows. The concept of multiplier and a2b1 a1b1 a0b1
a3b1 + + + 0
power consumption issues are described in Section 2. The estimated
probability of zero in multiplier is also presented to illustrate power
saving advantage by using bypassing multiplier. In Section 3, a novel
multiplier design based on bypassing scheme with parallel a2b2 a1b2 a0b2
a3b2 + + + 0
architecture is proposed for both unsigned and signed operands of
multipliers. The simulation results of the proposed design and
performance comparisons with counterpart circuits are shown in a 2 b3 a1b3 a0b3
Section 4. Finally, conclusion is given in Section 5. a 3 b3 + + + 0
2. Multiplier concepts
+ + + 0
2.1. Conventional array multiplier
P7 P6 P5 P4 P3 P2 P1 P0
Conventional multipliers can be classified into iterative and
array multipliers. Iterative multiplier can accomplish multiplication Fig. 2. Ripple carry array multiplier.
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 3
time needed in the worst case is (2N+ 1) full adder delay. In CSA, where a is switching probability, f is the average number of
the main adder cells consist of CSA adders. RCA adders are used in transitions, CL is the output capacitance, VDD is the supply voltage,
the final row of adder. In this array, it also needs 3N adders to ISC is the short circuit current, and Ileakage is the leakage current. In
accomplish multiplication. However, delay time needed in the the submicron technology, leakage current also consumes
worst case is (N + 2) full adder delay. In order to achieve low power significant portion of power [23,24]. Some leakage reduction
and high speed performance at the same time, the proposed methods can be found in [23,24]. This paper mainly focuses on
multiplier is based on RCA adder. reducing dynamic power consumption of multiplier by minimiz-
ing switching activity.
bj bj
2.4. Analysis of switching probability
Ci , j Si , j
The result of power saving by adopting bypassing method
Fig. 3. Modified carry save full adder. primarily relies on the number of zeros in input data of multiplier.
+ + +
b1 b1 b1
a3b2 a 2 b2 a1b2 a0 b2
+ + + b2
b2 b2 b2
a 3 b3 a 2 b3 a1 b3 a 0 b3
+ + + b3 +
b3 b3 b3
+ + + +
P7 P6 P5 P4 P3 P2 P1 P0
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
4 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]
Probability
0.8
Modulation (ADPCM), G723.1 speech code, and wavelet-based
image coders, input data of these applications can be analyzed to 0.6
illustrate power reduction efficiency of bypassing multiplier. The 0.4
input data of these applications are extracted from analyzing
0.2
effective dynamic range presented in [25]. The data of ADPCM is
recorded at 0.125 s audio signal that is further used for multi- 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
plication of low and high pass band splitting. For the data of
Effective dynamic data range
G.723.1 speech coder, a 0.05 s of speech data is sampled with
8 kHz frequency for multiplication in autocorrelation of linear Fig. 5. Probability of effective dynamic data from 1 to 16 bit.
prediction coding. In the last application, one fortieth of multi-
plication for a 512 512 pixel image is performed for low and
high pass filtering.
The original data is fed into a 16 16 multiplier. The Table 2
histograms of effective dynamic range of input data show the Probability of zeros in three different applications: (A) ADPCM audio coder;
(B) G.723.1 speech code; and (C) wavelet-based image coder.
probability of each input vector distribution in terms of effective
bit number. The bit numbers of zeros can be used in bypassing Multiplicand Probability of zeros (%)
multiplier to disable devices. The effective bit number and non-
effective bit number of these three applications are shown in Case (A) Multiplicand X 65.28
Table 1 for multiplicand X and multiplier Y. Case (A) Multiplier Y 77.41
Case (B) Multiplicand X 66.91
The probability of each dynamic range is calculated by actual Case (B) Multiplier Y 75.72
input vector. To estimate the number of zeros in effective data of Case (C) Multiplicand X 72.13
zeros, it is assumed that 50 percent of these effective data range is Case (C) Multiplier Y 88.22
zeros. Since these data range is starting from 1 to 16 effective data
bit, the rest of non-effective dynamic data bit is from 15 to 0. The
rest of the non-effective dynamic data bits are all zeros since they following equation:
do not represent any value. Based on these assumptions, the Xn X
n
i ni
equation of estimating probability of zeros can be described in probðDi Þ 50% þ þ probðDi Þ
i¼1
n n i ¼ 1
i ni
50% þ ð3Þ
Table 1
n n
Probability of effective bit number of three applications: (A) ADPCM audio coder;
(B) G.723.1 speech code; and (C) wavelet-based image coder. where n is the number of bit in multiplicand X and multiplier Y, Di
Effective dynamic Case (A) Case (B) Case (C)
is the effective data, and prob is the probability of specified
range (X Y) effective data from Table 1.
Prob. of Prob. of Prob. of Prob. of Prob. of Prob. of The probability of zeros with effective data range is shown in
X (%) Y (%) X (%) Y (%) X (%) Y (%) Fig. 5. From this figure, we can conclude that the lower the
effective data bit, the higher the probability of zeros, which is
16 0 0 0 0 0 0
15 2 0 0 0 0 0 reflected in most cases of input vector in different applications.
14 10 0 1 1 2 0 Most data are ranged in lower and middle effective data. Based on
13 9 5 2 3 5 0 Eq. (3), the estimated probability of zeros in these three
12 6 0 18 5 9 0 applications can be calculated and are summarized in Table 2;
11 5 10 12 4 5 0
10 2 0 6 4 4 0
the probability of zeros for both multiplicand X and Y is at least
9 2 0 4 3 3 0 over 65 percent and greater than 50 percent in normal
8 3 0 2 2 3 0 distribution. In the application of wavelet-based image coder,
7 3 10 1 1 2 6 the ‘‘zero’’ probability of multiplier Y can even reach 88 percent.
6 3 0 1 0 1 0
Therefore, power consumption can be reduced significantly by
5 2 0 0 0 1 0
4 0 0 0 0 1 11 adopting bypassing method. In addition, it is observed that
3 0 0 0 0 0 11 multiplier Y has larger probability of zeros compared to that of
2 0 0 0 0 1 0 multiplicand X. Therefore, multiplicand with larger probability of
1 0 0 0 0 1 11 zeros can be used in bypassing multiplier.
1 2 37 0 38 12 11
2 0 0 0 0 1 0
3 0 0 0 0 1 23
4 0 0 0 0 2 0 3. The proposed low power and high speed multiplier with
5 2 5 1 0 1 0
6 4 0 2 0 1 22
row bypassing and parallel architecture
7 2 5 2 0 2 5
8 2 5 2 1 3 0 3.1. Unsigned bypassing multiplier design
9 2 5 4 3 4 0
10 2 0 6 4 8 0
11 5 0 11 6 9 0 The array multiplier is composed of rows of adders as shown in
12 6 5 10 5 11 0 Fig. 2. The sum and carry signals are generated from previous
13 11 5 9 2 6 0 rows and fed into 2 of 3 inputs of current row. The power
14 13 0 3 5 1 0 consumption can be lower if the transitions of these input signals
15 2 5 3 13 0 0
16 0 0 0 0 0 0
can be less frequent. As shown in Table 2, the average zero
probability of input signals on different DSP applications is over
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 5
73.8 percent. Therefore, the most effective way to reduce the to achieve parallel architecture. The two tri-state buffers are
power of array based multiplier is to disable the transition of placed at two inputs of full adder to disable the operation of full
adder. The operational principle of bypassing multiplier is adder when bj is 0. The tri-state buffer is designed by transmission
discussed in Section 2.3. The CSA based bypassing multiplier can gate (TG). The multiplexer is placed at the sum output of full
save certain power consumption. However, the circuit imple- adder. The value of sum can be selected from the bypassing value
mentation of CSA based multiplier shown in Figs. 3 and 4 are or sum output of full adder according to the value of bj. The
complicated. The additional circuits by adopting bypassing proposed design does not need to add multiplexer for carry
method can degrade the operation speed of multiplier. As output and tri-state buffer for carry input of full adder. The reason
mentioned in Section 2.1, CSA based multiplier can achieve faster is that two inputs of full adder in jth row need to be disabled
operation speed compared to RCA based multiplier. However, while the value of bj is 0. Thus carry outputs of the full adders in
hardware cost is 50 percent more compared to conventional array the same row cannot be changed since two out of three-input full
multiplier [20]. The proposed multiplier adopts the ripple-carry adder is disabled. Thereby, full adder only needs two tri-state
adder with fewer hardware components and parallel architecture. buffers and one multiplexer. Moreover an AND gate is inserted
The new bypassing architecture is proposed to enhance operating into the last carry output in each row of full adder for correcting
speed and reduce power consumption of ripple-carry adder at output when the value of bj is 0. Therefore, significant portion of
same time. extra hardware can be saved without degrading speed perfor-
A RCA adder is adopted with bypassing ability in each row of mance. In addition, power consumption also can be reduced as a
adders. The reason of adopting RCA adder instead of CSA adder is result of reduced hardware activities. Fig. 6 is the proposed full
adder. Fig. 7 shows the proposed 4 4 multiplier based on the
Si, j 1
modified RCA full adder. The proposed RCA full adder only needs
ai b j two tri-state buffers and one multiplexer. On the other hand, the
full adder design in [20] needs three tri-state buffers and two
bj
multiplexers. It is evident that the proposed design can reduce
bj
hardware area.
Ci , j + Ci 1, j A multiplication test vector of 1111 1001 is set up for the
proposed design shown in Fig. 8. The values on the side of arrows
indicate the value of sum bit or carry bit. From this example, the
bj partial products which shall be summed in first and second row of
Si, j adders are all zero because of b1 ¼b2 ¼0. Then, the sum of output
equals to the results from previous row of adders. It is noteworthy
Fig. 6. Proposed RCA with row bypassing technique. that output carry bit of each full adder is zero in the same row and
b1 + + + 0
1 0 1 0 1 0
a 3 b2 a 2 b2 a1 b 2 b1 b1 b1
a 0 b2
b2 + + + 0
1 0 1 0 1 0
a 3 b3 a 2 b3 a1b3 b2 a b b2 b2
0 3
b3 + + + 0
1 0 1 0 1 0
b3 b3 b3
+ + + 0
P7 P6 P5 P4 P3 P2 P1 P0
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
6 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]
1 1 1 1
0 0 0 0
0 0 0
+ + + 0
1 1 1 1
0 0 0 0
0 0 0
+ + + 0
0 1 1
1 1 1 1
1 1 1 0
+ + +
0 0 0
0 0
+ + + 0
1 0 0 0 0 1 1 1
Fig. 8. An example for 4 4 multiplier with RCA.
a7b1 a6b1 a7b0 a5b1 a6b0 a4b1 a5b0 a3b1 a4b0 a2b1 a3b0 a1b1 a2b0 a0b1 a1b0 a0b0
b1
b1 b1 b1 b1 b1 b1 b1
b2
b2 b2 b2 b2 b2 b2 b2
b3
b3 b3 b3 b3 b3 b3 b3
P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0
Co3 Co2 Co1
carry signal propagates with the same direction. Thus, we can adders in both sides and CSA adders in the middle. In this
discern that all carry signals propagate from zero to the next full configuration, the parallelism of the proposed multiplier can be
adder in the jth row when the value of bj is 0. established. Furthermore, delay time of RCA multiplier can be
Besides the above-mentioned method, the proposed multiplier shortened through this method. The final proposed multiplier is
also adopts parallel architecture to shorten delay time. For an shown in Fig. 10. Less extra hardware is used compared to that of
example of 8 8 multiplication, two 8 4 bypassing multiplier [20]. The proposed multiplier needs ((5/2)N 3) full adder delay
based on RCA can be shown in Fig. 9. The partial sums and carry in the worst case for N N multiplier design. The proposed
output from these two 8 4 multipliers can be computed parallel architecture is not suitable for CSA based Braun multiplier
simultaneously. Note that the final stage adders consist of RCA and [20]. CSA based multiplier cannot be decomposed into two
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 7
b7 b6 b5 b4 a7 a6 a5 a4 a 3 a2 a1 a0 b3 b2 b1 b0 a 7 a 6 a 5 a4 a 3 a2 a1 a 0
0 0 0
a 3 b1 + a2b1 + a1b1 + a0b1
a3b3 + a 2 b3 + a1b3 + a0 b3
+ + + 1
P7 P6 P5 P4 P3 P2 P1 P0
parallel 8 4 multipliers because the inputs of the current row conventional array multipliers such as Braun multiplier, signed
CSA adder come from the upper row; the 16 16 signed multipliers can be realized through Baugh–Wooley multiplication
multiplier can be designed by similar procedure. algorithm [26], often used to deal with signed multiplication.
The algorithm uses 2’s complement to represent the signed
numbers and also uses the same framework of array multiplier.
3.2. Signed bypassing multiplier design The advantage of this algorithm is accomplishing signed multipli-
cation without expanding sign bits. Consequently, additional
The multiplier introduced in the previous section is used to hardware cost is not increased; thus, not dissipating extra
compute unsigned numbers. However, it is essential to design power. Only the AND gate to NAND gate for corresponding
signed multipliers because computer system usually manipu- operands is changed and an inverter is inserted at the final carry
lates signed numbers. With regard to signed multiplier design, output. Fig. 11 shows the architecture of a 4 4 signed Braun
some signed multiplication algorithms are proposed in [26]. In multiplier [26].
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
8 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]
For example, two signed 4-bits binary numbers A¼a3a2a1a0 adders in the signed bypassing multiplier for carrying propagation
and B ¼b3b2b1b0 can generate a product P, which can be defined as is placed in the last row. The whole circuit architecture for a 4 4
follows: signed bypassing multiplier is shown in Fig. 12.
The proposed multiplier also adopts the Baugh–Wooley
P ¼ 1 27 þ a3 b3 26 þ ða3 b2 þ b3 a2 Þ25 þ ða3 b1 þ b3 a1 þ1Þ24
algorithm [26] for signed number multiplication. Considering an
þða3 b0 þb3 a0 Þ23 þ ðb2 22 þ b1 21 þ b0 Þða2 22 þ a1 21 þ a0 Þ ð4Þ 8 8 signed multiplication [26], all operands are separated into
two parts. Full adders are used to compute the last row of
operands according to the analysis in the previous paragraph.
Next, the same algorithm is utilized to design the proposed Therefore, two different 8 4 bit signed ripple-carry array
bypassing multiplier with signed operands. For bypassing multi- multipliers need to be designed. Blocks 1 and 2 are the two
plier [20], it could also utilize Baugh–Wooley multiplication 8 4 bit signed ripple-carry array multipliers, respectively. Block 1
algorithm [26] to realize signed bypassing multiplier. According shown in Fig. 13 is used to deal with the upper part of operands
to Baugh–Wooley multiplication algorithm [26], some AND gates and it is similar to the multiplier shown in Fig. 9 except for some
of original design must be changed to NAND gates for the changes on the gates of the circuits. Similarly, Block 2 shown in
corresponding operands in [20]. However, general CSA would be Fig. 14 is used to deal with the lower part of operands. Block 2
used instead of the modified full adders shown in Fig. 3 for the is different than Block 1 in hardware design as Block 1 uses the
computation of last row of operands in multiplication. The reason proposed full adder shown in Fig. 6 to compute all operands and
is described as follows. First, we know that disabling adders is Block 2 only differs in the computation of the last row of operands
performed only when operand is zero. The probability of a 2-input as ripple-carry adders are used to compute this row of operands.
NAND gate with operand (AB)0 being zero is only 25 percent. If the Since the proposed multiplier does not need additional full adder
adder shown in Fig. 3 is used for this row, additional logic must to correct the operation of multiplication, the addition in the final
be added. Power consumption for these additional logics may be step can be computed without adding other full adders. Thus,
large. In others words, adders in this row may dissipate more hardware requirement for the proposed signed multiplier is less
power in most of time. Consequently, general CAS will be used for than the signed bypassing multiplier. Finally, these two blocks are
this row of adders because they do not dissipate power on the combined and an inverter is placed at the carry output. The
additional logic. Since it has to add one in the final step in Baugh– proposed signed multiplier is shown in Fig. 15. The 16 16 signed
Wooley multiplication algorithm [26], additional one row of multiplier can be designed by the similar procedure.
+ 0 + 0 + 0
0 0 0
b1 b1 b1
a3b2 a2b2 a1b2 a0b2
+ + +
b2
0
b2 b2 b2
+ + + +
1 + 0 + 0 + 0 + 0
+ + + + 1
P7 P6 P5 P4 P3 P2 P1 P0
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 9
a b a b ab a b ab a b a b ab a b ab ab ab a b a b ab ab
b b b b b b b
ab a b ab a b ab a b ab a b
b
b b b b b b b
a b ab a b a b ab a b ab a b
b b b b b b b
P P P P P P P P P P P
Co Co Co
4. Simulation results and performance comparisons are designed in transistors level without using any standard cell
from the technology library. Post-layout simulations are per-
In this section, the performance evaluation of the proposed formed with standard TSMC 0.18 mm CMOS technology and 1.8 V
multiplier along with the comparison to the conventional Braun supply voltage by Cadence Spectre simulation tools. The design
multiplier is presented. Performances include power consump- and simulation flow is shown in Fig. 16. In the design process,
tion, delay, power-delay product, and layout area. These circuits multiplier design was constructed at circuit level in the Cadence
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
10 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 11
Table 3 Table 6
Power consumption (in mW) and power saving. Total area (in mm2) and area overhead.
Design Multiplier size and normalized ratio Design Multiplier size and normalized ratio
Braun (unsigned) 2.413 1.00 11.050 1.00 Braun (unsigned) 73524 1.00 307585 1.00
[10] (unsigned) 2.238 0.93 9.561 0.87 [10] (unsigned) 132342 1.80 538879 1.75
Proposed (unsigned) 2.144 0.89 9.111 0.82 Proposed (unsigned) 92449 1.26 367908 1.20
Braun (signed) 2.867 1.00 11.532 1.00 Braun (signed) 73524 1.00 307585 1.00
[10] (signed) 2.991 1.04 10.671 0.93 [10] (signed) 139604 1.90 553681 1.80
Proposed(signed) 2.445 0.85 9.619 0.83 Proposed (signed) 93592 1.27 372177 1.21
Table 7
Table 4 Performance comparison of recent published papers.
Delay (in ns) and improvement.
Design Performance and normalized ratio
Design Multiplier size and normalized ratio
Power Ratio Delay Ratio Power delay Ratio
88 Ratio 16 16 Ratio (mW) (ns) product (pJ)
Braun (unsigned) 3.504 1.00 7.584 1.00 [27] ROM based 16 16 13.50 1.48 5.55 1.18 74.9 1.75
[10] (unsigned) 3.188 0.91 6.537 0.86 multiplier
Proposed (unsigned) 2.243 0.64 4.713 0.62 [28] low power 16.30 1.79 14.28 3.03 232.7 5.42
Braun (signed) 3.713 1.00 8.104 1.00 bypassing 8 8
[10] (signed) 3.472 0.93 7.238 0.89 Multiplier
Proposed (signed) 2.543 0.69 5.334 0.66 Proposed bypassing 9.11 1.00 4.71 1.00 42.9 1.00
16 16 multiplier
Table 5
Power-delay product (10 12 J) and improvement.
proposed design can achieve greater power efficiency with less
Design Multiplier size and normalized ratio extra hardware and power-delay product among different
counterparts. For a 16 16 multiplier, the proposed design
88 Ratio 16 16 Ratio
achieves 17 and 36 percent reduction in power consumption
Braun (unsigned) 8.455 1.00 83.803 1.00 and delay, respectively, at the cost of 20 percent increase in chip
[10] (unsigned) 7.135 0.84 62.844 0.75 area in comparison with those of conventional array multiplier.
Proposed (unsigned) 4.808 0.57 42.940 0.51 In addition, the proposed design achieves 11 and 38 percent
Braun (signed) 10.645 1.00 93.455 1.00 reduction in power consumption and delay, respectively, with 46
[10] (signed) 10.384 0.97 77.236 0.82
Proposed (signed) 6.217 0.58 51.307 0.55
percent less chip area in comparison with those of in [20]. The test
patterns are randomly generated with uniformly distributed
probability. As mentioned in Section 2.4, the average zero input
of operand in multiplier for the typical DSP applications is 73.8
zeros. Delay time of multiplier is also shortened by adopting percent. Therefore, the proposed multiplier can achieve even
parallel architecture. In order to validate the effectiveness of the greater power saving if the probability of zero in the inputs of
proposed design, power consumption and delay are evaluated by multiplier is larger than 0.5. Compared to other recent published
Cadence Spectre post-layout simulation with standard TSMC papers [27,28], the proposed bypassing multiplier achieves the
0.18 mm CMOS technology. Simulation results show that the lowest value of power, delay, and power-delay-product. Hence,
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
12 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]
the proposed design achieves the goal of low power and high [13] M. Ito, D. Chinnery, K. Keutzer, Low power multiplication algorithm
speed performance at the same time. for switching activity reduction through operand decomposition, in:
Proceedings of the 21st International Conference on Computer Design,
2003, pp.21–26.
[14] N. Honarmand, M.R. Javaheri, N. Sedaghati-Mokhtari, A. Afzali-Kusha, Power
Acknowledgements efficient sequential multiplication using pre-computation, in: Proceedings
of the IEEE International Symposium on Circuits and Systems, 2006,
pp. 2709–2712.
The authors would like to acknowledge the financial support of [15] L.H. Chen, O.T.-C. Chen, T.Y. Wang, Y.C. Ma, Multiplication-accumulation
the National Science Council, Taiwan, Republic of China, under computation unit with optimized compressors and minimized switching
activities, in: Proceedings of the IEEE International Symposium on Circuits
grant number NSC96-2220-E-110-008. Authors would like to
and Systems, 2005, pp. 6118–6121.
express their greatest thanks to CIC (Chip Implementation Center) [16] C. Senthilpari, A.K. Singh, K. Diwakar, Design of a low-power, high
of NAPL (National Applied Research Laboratories), Taiwan, for performance, 8 8 bit multiplier using a Shannon-based adder cell,
their thoughtful chip fabrication service. Microelectronics Journal 39 (2008) 812–821.
[17] Z. Abid, H. El-Razouk, D.A. El-Dib, Low power multipliers based on new
hybrid full adders, Microelectronics Journal 39 (2008) 1509–1515.
References [18] K. Navi, V. Foroutan, M. Rahimi Azghadi, M. Maeen, M. Ebrahimpour
M. Kaveh, O. Kavehei, A novel low-power full-adder cell with new technique
in designing logical gates based on static CMOS inverter, Microelectronics
[1] W.C. Yeh, C.W. Jen, High-speed and low-power split-radix FFT, IEEE Journal 40 (2009) 1441–1448.
Transactions on Signal Processing 51 (2003) 864–874. [19] S. Hong, S. Kim, M.C. Papaefthymiou, W.E. Stark, Low power parallel
[2] C.S. Wallace, A suggestion for a fast multiplier, IEEE Transactions on multiplier design for dsp applications through coefficient optimization,
Computer 13 (1964) 14–17. in: Proceedings of the IEEE International ASIC/SOC Conference, 1999,
[3] V.G. Oklobdzija, D. Villeger, S.S. Liu, A method for speed optimized partial pp. 286–290.
product reduction and generation of fast parallel multipliers using an [20] J. Ohban, V.G. Moshnyaga, K. Inoue, Multiplier energy reduction through
algorithmic approach, IEEE Transaction on Computer 45 (1996) 294–306. bypassing of partial products, in: Proceedings of the IEEE Asia-Pacific
[4] B. Parhami, in: Computer Arithmetic, Algorithms, and Hardware Design, Conference on Circuits and Systems, 2002, pp. 13–17.
Oxford University Press, New York, 2000. [21] A.P. Chandraksan, S. Sheng, R. Bordersen, Low-power CMOS digital design,
[5] K.Z. Pekmestzi, Multiplexer-based array multiplier, IEEE Transactions on IEEE Journal of Solid-State Circuits 27 (1992) 473–484.
Computers 48 (1999) 15–23. [22] M. Psilogeorgopoulos, M. Munteanu, T.-S. Chuang, P.A. Ivey, L. Seed,
[6] P.C.H. Meier, R.A. Rutenbar, L.R. Carley, Exploring multiplier architecture and Contemporary techniques for lower power circuit design, PREST Deliverable
layout for low power, in: Proceedings of the IEEE Custom Integrated Circuits D2.1, The Department of Electronic and Electrical Engineering, The University
Conference, 1996, pp. 513–516. of Sheffield, Mappin Street, Sheffield S1 3JD, UK, 1998, pp. 1–91 /http://
[7] K.S. Chong, B.H. Gwee, J.S. Chang, A micropower low-voltage multiplier with www.engr.newpaltz.edu/ damu/spring_2008/resource/cont_tech.pdfS.
reduced spurious switching, IEEE Transactions on Very Large Scale Integrated [23] N.S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flaunter, J.S. Hu, M.J. Irwin
Systems 13 (2005) 255–265. M. Kandemir, V. Narayanan, Leakage current: Moore’s law meets static
[8] C.H. Han, H.J. Park, L.S. Kim, A low-power array multiplier using seperated power, IEEE Computer 36 (2003) 68–75.
multiplication technique, IEEE Transactions on Circuits and Systems-II, [24] J. Kao, S. Narendra, A. Chandrakasan, Subthreshold leakage modeling and
Analog and Digital Signal Processing 48 (2001) 866–871. reduction techniques, Proceedings of the IEEE/ACM International Conference
[9] E. Abu-Shama, M.B. Maaz, and M.A. Bayoumi, A fast and low power multiplier Computer Aided Design, 2002, pp. 141–148.
architecture, in: Proceedings of the IEEE Midwest Symposium on Circuits and [25] O.T.C. Chen, S. Wang, Y.W. Wu, Minimization of switching activities of partial
Systems, 1996, pp. 53–56. product for designing low-power multipliers, IEEE Transaction on Very Large
[10] R. Mudassir, H. El-Razouk, Z. Abid, New designs of signed multiplier, in: Scale Integrated Systems 11 (2003) 418–433.
Proceedings of the IEEE Northeast Workshop on Circuits and Systems, 2005, [26] R. Mudassir, H. El-Razouk, Z. Abid, New designs of signed multiplier,
259–262. in: Proceedings of the IEEE International NEWCAS Conference, 2005,
[11] S. Mahant-Shetti, P. Balsara, C. Lemonds, High performance low power array pp. 259–262.
multiplier using temporal tiling, IEEE Transactions on Very Large Scale [27] B.C. Paul, S.F. Fujita, M. Okajima, ROM-based logic (RBL) design: a low-
Integrated Systems 7 (1999) 121–124. power 16 bit multiplier, IEEE Journal of Solid-State Circuits 44 (2009)
[12] A.A. Fayed, M.A. Bayoumi, A novel architecture for low-power design of 2935–2942.
parallel multipliers, in: Proceedings of the IEEE computer Society Workshop [28] C.C. Wnag, G.N. Sung, Low-power multiplier design using a bypassing
on VLSI, 2001, pp. 149–154. technique, Journal of Signal Processing Systems 57 (2009) 331–338.
Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009