Sie sind auf Seite 1von 12

Microelectronics Journal ] (]]]]) ]]]–]]]

Contents lists available at ScienceDirect

Microelectronics Journal
journal homepage: www.elsevier.com/locate/mejo

Low power and high speed multiplier design with row bypassing
and parallel architecture
Ko-Chi Kuo n, Chi-Wen Chou
Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan

a r t i c l e in fo abstract

Article history: This paper presents a low power and high speed row bypassing multiplier. The primary power
Received 10 October 2009 reductions are obtained by tuning off MOS components through multiplexers when the operands of
Received in revised form multiplier are zero. Analysis of the conventional DSP applications shows that the average of zero input
16 June 2010
of operand in multiplier is 73.8 percent. Therefore, significant power consumption can be reduced by
Accepted 21 June 2010
the proposed bypassing multiplier. The proposed multiplier adopts ripple-carry adder with fewer
additional hardware components. In addition, the proposed bypassing architecture can enhance
Keywords: operating speed by the additional parallel architecture to shorten the delay time of the proposed
Low power multiplier. Both unsigned and signed operands of multiplier are developed. Post-layout simulations are
Bypassing multiplier
performed with standard TSMC 0.18 mm CMOS technology and 1.8 V supply voltage by Cadence Spectre
Parallel architecture
simulation tools. Simulation results show that the proposed design can reduce power consumption and
Ripple carry array
operating speed compared to those of counterparts. For a 16  16 multiplier, the proposed design
achieves 17 and 36 percent reduction in power consumption and delay, respectively, at the cost of 20
percent increase of chip area in comparison with those of conventional array multipliers. In addition,
the proposed design achieves averages of 11 and 38 percent reduction in power consumption and delay
with 46 percent less chip area in comparison with those counterparts for both unsigned and signed
multipliers. The proposed design is suitable for low power and high speed arithmetic applications.
& 2010 Elsevier Ltd. All rights reserved.

1. Introduction and topology level to implement different arithmetic functions.


For example, to implement a specific function at architecture
With the high demanding of electronic portable devices, the level, ripple-carry, carry-save, or carry look-ahead adder can
requirement of low power device is getting more attention in be adopted. By choosing one of these architectures, low power
recent years. The primary concern of electronic portable device is consumption can be achieved by trading off with other specifica-
to extend operating hours without changing the battery residing tions such as speed or chip area.
in device. Although advanced technology enhances battery life to Digital signal processing (DSP) is one of most important units
operate for longer hours, the complicated operations in the high- in electronic devices. DSP performs fundamental operations
end portable devices are still power hungry and is critical for the which include video processing for displaying streamline image
low power design. Low power design can be achieved at system, and baseband processing for communication operations. These
logic, technology, architecture, and the circuit levels. Power applications consume significant power. For example, fast Fourier
saving can be significant if the low power design is planned in transform (FFT) is one of essential building blocks in DSP. Due to
the earlier stage at system level. Optimizing logic level of circuit is the popularity of orthogonal frequency division multiplex (OFDM)
also critical for the low power design. To reach this goal, dedicated used for various portable communication devices, the demand for
software needs to be developed. As technology continues to low power FFT is a critical requirement. As stated in [1], multiplier
shrink, power consumption also can be scaled down at the modules occupy 46 percent chip area in the 64-point Split-Radix
same time. Many efforts to achieve low power requirements at FFT. Therefore, power saving can be achieved by reducing power
circuit level can be seen in many literatures. These efforts vary consumption of multiplier significantly.
from voltage scaling, threshold voltage scaling, power-down Different parallel multipliers have been proposed in literatures
strategies, and logic style. These options can be chosen at circuit and are classified into tree-based multiplier [2–4] and array-based
multiplier [4,5]. Advantage of tree-based multiplier relies on the
speed of multiplier increasing with the log of operand length [3].
n
Corresponding author. Tel.: +886 75252000 x4322; fax: + 886 75254301. On the other hand, array-based architectures are more popular
E-mail address: kckuo@cse.nsysu.edu.tw (K.-C. Kuo). in terms of regular layout. However, array-based multiplier

0026-2692/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.mejo.2010.06.009

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
2 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]

consumes potentially more power than tree-based architecture through a series of shift and addition operations. Since it can reuse
[6]. The reason is that additional adders are embedded in the tree- the same hardware to perform multiplication, it occupies less area
based multiplier that absorb spurious switching and hence reduce than other multipliers. However, it needs more clock cycles to
power consumption [7]. However, the layout of tree-based accomplish multiplication and cannot be realized in pipeline
multiplier tends to be complicated and induces more parasitic structure. On the other hand, array multiplier is common in
capacitances. In addition, tree-based multiplier is limited to multiplier design due to its regular and compact structure. The
shorter operand length (o16 bits) [8]. Multiplier with longer structure of array multiplier is organized by several stages of
operand length can be implemented by modified Booth encoding adders and AND-gates. It generates all the partial products after
Wallace multiplier. Therefore, array-based architecture is imple- only one AND-gate delay. Then, it sums up all partial products
mented in the proposed design. sequentially. The advantage of this structure is that the arrange-
Many researchers have been focusing on reducing power ment of its adders is very regular and is favorable for layout due to
consumption of multipliers [9–20]. A modified binary tree this advantage. It also can be realized with parallel structure.
multiplier is presented in [9]. All partial products can be However it occupies more area and hardware than that of
generated in one step. Power reduction is achieved by minimizing iterative multiplier.
power consumption of full adder. A multiplier with array and tree Conventional array multiplier is primarily used for computing
architecture is proposed in [10] to enhance performance with low multiplication of two input data. For example, two unsigned
power consumption and smaller time-delay product. A leapfrog n-bits binary numbers A¼an  1an  2an  3ya0 and B ¼bn  1bn  2
multiplier [11] is modified such that sum and carry signal to bn  3yb0 can generate a (2n  1)-bit product P, which can be
different rows of adders can arrive at the same time. Therefore, defined as the following:
power consumption can be reduced. The multiplier of [12] divides !0 1
X
n1 X
n1 X
n1 X
n1
all partial products into four clusters. It uses latches to disable P ¼ AB ¼ ai 2i @ jA
bj 2 ¼ ðai bj Þ2i þ j ð1Þ
clusters when cluster is in zero condition. The low power i¼0 j¼0 j¼0i¼0
multiplication is achieved by operand decomposition [13].
Decomposition is performed at both multiplicand and multiplier where i and j are the number of bits in the multiplier and
to achieve low power consumption by reducing logic transitions. multiplicand, respectively.
The work in [14] uses a pre-computation based method to reduce An example of 4-bit multiplication is shown in Fig. 1. Two
power in a sequential multiplier. In [15], low power is accom- conventional array multipliers can be used to generate partial
plished by reducing complexity of multiplication architecture and products. According to the way of carry propagation, it can be
switching activities. In [16–18], significant power consumptions classified into two structures: ripple-carry array (RCA) (Fig. 2)
are reduced by developing new adder cell in different multiplier and carry-save array (CSA). In RCA multiplier all adder cells
designs. In papers [19,20], a low-power multiplier design is are composed of RCA adders. For example, it needs 3N adders to
proposed with bypassing method to turn off device when inputs accomplish multiplication in an N  N multiplier. However, delay
of multiplier are zero.
These power reduction techniques have been verified and a a a a
implemented in many DSP or other related applications at certain b b b b
additional expenses. Among these techniques, the most effective
way is reducing dynamic power consumption which dominates a 3b0 a 2 b0 a1b0 a 0 b0
total power consumption. Hence, average power consumption can
be reduced significantly by adopting this method. Consequently, a 3b1 a 2 b1 a1b1 a 0 b1
this paper intends to develop a new design to achieve lower
a 3b2 a 2 b2 a1b2 a 0b2
power and high speed multiplier. Hence, a novel low power
multiplier is proposed by minimizing switching activities of
a 3b3 a 2 b3 a1b3 a 0 b3
multiplier while maintaining the speed of multiplier by adopting
the parallel architecture. Bypassing method achieves significant P7 P6 P5 P4 P3 P2 P1 P0
power saving if the number of zeros in multiplicand has more
Fig. 1. A 4  4 basic multiplication.
than half of the size of multiplier. However, additional hardware
of adopting bypassing method reduces the operation speed of
multiplier in the critical path of multiplier. Hence, parallel a3b0 a2b0 a1b0 a0b0
architecture is adopted to enhance the speed of multiplier.
This paper is organized as follows. The concept of multiplier and a2b1 a1b1 a0b1
a3b1 + + + 0
power consumption issues are described in Section 2. The estimated
probability of zero in multiplier is also presented to illustrate power
saving advantage by using bypassing multiplier. In Section 3, a novel
multiplier design based on bypassing scheme with parallel a2b2 a1b2 a0b2
a3b2 + + + 0
architecture is proposed for both unsigned and signed operands of
multipliers. The simulation results of the proposed design and
performance comparisons with counterpart circuits are shown in a 2 b3 a1b3 a0b3
Section 4. Finally, conclusion is given in Section 5. a 3 b3 + + + 0

2. Multiplier concepts
+ + + 0
2.1. Conventional array multiplier
P7 P6 P5 P4 P3 P2 P1 P0
Conventional multipliers can be classified into iterative and
array multipliers. Iterative multiplier can accomplish multiplication Fig. 2. Ripple carry array multiplier.

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 3

time needed in the worst case is (2N+ 1) full adder delay. In CSA, where a is switching probability, f is the average number of
the main adder cells consist of CSA adders. RCA adders are used in transitions, CL is the output capacitance, VDD is the supply voltage,
the final row of adder. In this array, it also needs 3N adders to ISC is the short circuit current, and Ileakage is the leakage current. In
accomplish multiplication. However, delay time needed in the the submicron technology, leakage current also consumes
worst case is (N + 2) full adder delay. In order to achieve low power significant portion of power [23,24]. Some leakage reduction
and high speed performance at the same time, the proposed methods can be found in [23,24]. This paper mainly focuses on
multiplier is based on RCA adder. reducing dynamic power consumption of multiplier by minimiz-
ing switching activity.

2.2. Power consumption


2.3. Bypassing multiplier based on CSA
Power consumption is a critical parameter in designing
electronic circuits, especially in portable electronic and commu- The operation of bypassing multiplier [20] is to disable adders
nication devices. CMOS technology has been widely used for VLSI based on multiplier bit bj (0 rjrn  1); hence, power consump-
circuit design due to its effect of less power consumption. Power tion can be reduced. In order to disable the multiplier, the
consumption of CMOS circuits can be divided into static and conventional full adder needs to be modified as shown in Fig. 3.
dynamic power consumption [21]. Eq. (2) shows power con- The bypassing multiplier based on the modified full adder is
sumption of digital CMOS circuits [22]. shown in Fig. 4. There are three tri-state buffers and two
multiplexers in the modified full adder to perform bypassing
2
Ps ¼ a f CL VDD þISC VDD þ Ileakage VDD ð2Þ technique. The tri-state buffer decides whether to disable adder or
not based on the value of multiplier bits bj. Two multiplexers are
designed to select the correct outputs. For instance, if bit bj is 0,
Si , j −1
the adders in the third row of the multiplier can be disabled. Then,
ai b j outputs of adders in the second row can be passed to adders in the
fourth row directly. Note that it cannot execute addition to
bj bj generate the correct output because the rightmost full adder in
Ci , j −1 + Ci −1 , j −1 third row of multiplier is disabled. Therefore, it has to add an
extra hardware to perform correct addition.

bj bj
2.4. Analysis of switching probability
Ci , j Si , j
The result of power saving by adopting bypassing method
Fig. 3. Modified carry save full adder. primarily relies on the number of zeros in input data of multiplier.

a3b1 a 2b1 a3b0 a1b1 a 2b0 a0b1 a1b0 a0b0

+ + +

b1 b1 b1
a3b2 a 2 b2 a1b2 a0 b2

+ + + b2

b2 b2 b2
a 3 b3 a 2 b3 a1 b3 a 0 b3

+ + + b3 +

b3 b3 b3

+ + + +

P7 P6 P5 P4 P3 P2 P1 P0

Fig. 4. A 4  4 bypassing multiplier based on carry save array.

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
4 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]

The ‘‘zero’’ bit in the data of multiplier is 50 percent in uniformly 1.2


prob
normal distribution. However, in actual multiplier implementa- 1
tion such as applications of Adaptive Differential Pulse Code

Probability
0.8
Modulation (ADPCM), G723.1 speech code, and wavelet-based
image coders, input data of these applications can be analyzed to 0.6
illustrate power reduction efficiency of bypassing multiplier. The 0.4
input data of these applications are extracted from analyzing
0.2
effective dynamic range presented in [25]. The data of ADPCM is
recorded at 0.125 s audio signal that is further used for multi- 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
plication of low and high pass band splitting. For the data of
Effective dynamic data range
G.723.1 speech coder, a 0.05 s of speech data is sampled with
8 kHz frequency for multiplication in autocorrelation of linear Fig. 5. Probability of effective dynamic data from 1 to 16 bit.
prediction coding. In the last application, one fortieth of multi-
plication for a 512  512 pixel image is performed for low and
high pass filtering.
The original data is fed into a 16  16 multiplier. The Table 2
histograms of effective dynamic range of input data show the Probability of zeros in three different applications: (A) ADPCM audio coder;
(B) G.723.1 speech code; and (C) wavelet-based image coder.
probability of each input vector distribution in terms of effective
bit number. The bit numbers of zeros can be used in bypassing Multiplicand Probability of zeros (%)
multiplier to disable devices. The effective bit number and non-
effective bit number of these three applications are shown in Case (A) Multiplicand X 65.28
Table 1 for multiplicand X and multiplier Y. Case (A) Multiplier Y 77.41
Case (B) Multiplicand X 66.91
The probability of each dynamic range is calculated by actual Case (B) Multiplier Y 75.72
input vector. To estimate the number of zeros in effective data of Case (C) Multiplicand X 72.13
zeros, it is assumed that 50 percent of these effective data range is Case (C) Multiplier Y 88.22
zeros. Since these data range is starting from 1 to 16 effective data
bit, the rest of non-effective dynamic data bit is from 15 to 0. The
rest of the non-effective dynamic data bits are all zeros since they following equation:
do not represent any value. Based on these assumptions, the Xn   X
n
i ni
equation of estimating probability of zeros can be described in probðDi Þ   50% þ þ probðDi Þ
i¼1
n n i ¼ 1
   
i ni
    50% þ   ð3Þ
Table 1
n n 
Probability of effective bit number of three applications: (A) ADPCM audio coder;
(B) G.723.1 speech code; and (C) wavelet-based image coder. where n is the number of bit in multiplicand X and multiplier Y, Di
Effective dynamic Case (A) Case (B) Case (C)
is the effective data, and prob is the probability of specified
range (X  Y) effective data from Table 1.
Prob. of Prob. of Prob. of Prob. of Prob. of Prob. of The probability of zeros with effective data range is shown in
X (%) Y (%) X (%) Y (%) X (%) Y (%) Fig. 5. From this figure, we can conclude that the lower the
effective data bit, the higher the probability of zeros, which is
 16 0 0 0 0 0 0
 15 2 0 0 0 0 0 reflected in most cases of input vector in different applications.
 14 10 0 1 1 2 0 Most data are ranged in lower and middle effective data. Based on
 13 9 5 2 3 5 0 Eq. (3), the estimated probability of zeros in these three
 12 6 0 18 5 9 0 applications can be calculated and are summarized in Table 2;
 11 5 10 12 4 5 0
 10 2 0 6 4 4 0
the probability of zeros for both multiplicand X and Y is at least
9 2 0 4 3 3 0 over 65 percent and greater than 50 percent in normal
8 3 0 2 2 3 0 distribution. In the application of wavelet-based image coder,
7 3 10 1 1 2 6 the ‘‘zero’’ probability of multiplier Y can even reach 88 percent.
6 3 0 1 0 1 0
Therefore, power consumption can be reduced significantly by
5 2 0 0 0 1 0
4 0 0 0 0 1 11 adopting bypassing method. In addition, it is observed that
3 0 0 0 0 0 11 multiplier Y has larger probability of zeros compared to that of
2 0 0 0 0 1 0 multiplicand X. Therefore, multiplicand with larger probability of
1 0 0 0 0 1 11 zeros can be used in bypassing multiplier.
1 2 37 0 38 12 11
2 0 0 0 0 1 0
3 0 0 0 0 1 23
4 0 0 0 0 2 0 3. The proposed low power and high speed multiplier with
5 2 5 1 0 1 0
6 4 0 2 0 1 22
row bypassing and parallel architecture
7 2 5 2 0 2 5
8 2 5 2 1 3 0 3.1. Unsigned bypassing multiplier design
9 2 5 4 3 4 0
10 2 0 6 4 8 0
11 5 0 11 6 9 0 The array multiplier is composed of rows of adders as shown in
12 6 5 10 5 11 0 Fig. 2. The sum and carry signals are generated from previous
13 11 5 9 2 6 0 rows and fed into 2 of 3 inputs of current row. The power
14 13 0 3 5 1 0 consumption can be lower if the transitions of these input signals
15 2 5 3 13 0 0
16 0 0 0 0 0 0
can be less frequent. As shown in Table 2, the average zero
probability of input signals on different DSP applications is over

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 5

73.8 percent. Therefore, the most effective way to reduce the to achieve parallel architecture. The two tri-state buffers are
power of array based multiplier is to disable the transition of placed at two inputs of full adder to disable the operation of full
adder. The operational principle of bypassing multiplier is adder when bj is 0. The tri-state buffer is designed by transmission
discussed in Section 2.3. The CSA based bypassing multiplier can gate (TG). The multiplexer is placed at the sum output of full
save certain power consumption. However, the circuit imple- adder. The value of sum can be selected from the bypassing value
mentation of CSA based multiplier shown in Figs. 3 and 4 are or sum output of full adder according to the value of bj. The
complicated. The additional circuits by adopting bypassing proposed design does not need to add multiplexer for carry
method can degrade the operation speed of multiplier. As output and tri-state buffer for carry input of full adder. The reason
mentioned in Section 2.1, CSA based multiplier can achieve faster is that two inputs of full adder in jth row need to be disabled
operation speed compared to RCA based multiplier. However, while the value of bj is 0. Thus carry outputs of the full adders in
hardware cost is 50 percent more compared to conventional array the same row cannot be changed since two out of three-input full
multiplier [20]. The proposed multiplier adopts the ripple-carry adder is disabled. Thereby, full adder only needs two tri-state
adder with fewer hardware components and parallel architecture. buffers and one multiplexer. Moreover an AND gate is inserted
The new bypassing architecture is proposed to enhance operating into the last carry output in each row of full adder for correcting
speed and reduce power consumption of ripple-carry adder at output when the value of bj is 0. Therefore, significant portion of
same time. extra hardware can be saved without degrading speed perfor-
A RCA adder is adopted with bypassing ability in each row of mance. In addition, power consumption also can be reduced as a
adders. The reason of adopting RCA adder instead of CSA adder is result of reduced hardware activities. Fig. 6 is the proposed full
adder. Fig. 7 shows the proposed 4  4 multiplier based on the
Si, j 1
modified RCA full adder. The proposed RCA full adder only needs
ai b j two tri-state buffers and one multiplexer. On the other hand, the
full adder design in [20] needs three tri-state buffers and two
bj
multiplexers. It is evident that the proposed design can reduce
bj
hardware area.
Ci , j + Ci 1, j A multiplication test vector of 1111  1001 is set up for the
proposed design shown in Fig. 8. The values on the side of arrows
indicate the value of sum bit or carry bit. From this example, the
bj partial products which shall be summed in first and second row of
Si, j adders are all zero because of b1 ¼b2 ¼0. Then, the sum of output
equals to the results from previous row of adders. It is noteworthy
Fig. 6. Proposed RCA with row bypassing technique. that output carry bit of each full adder is zero in the same row and

a 3b1 a 2 b1 a 3b0 a1b1 a 2 b0 a 0 b1 a1b0 a 0 b0

b1 + + + 0

1 0 1 0 1 0
a 3 b2 a 2 b2 a1 b 2 b1 b1 b1
a 0 b2

b2 + + + 0

1 0 1 0 1 0
a 3 b3 a 2 b3 a1b3 b2 a b b2 b2
0 3

b3 + + + 0

1 0 1 0 1 0
b3 b3 b3

+ + + 0

P7 P6 P5 P4 P3 P2 P1 P0

Fig. 7. A 4  4 row bypassing multiplier based on RCA.

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
6 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]

1 1 1 1

0 0 0 0
0 0 0
+ + + 0

1 1 1 1
0 0 0 0
0 0 0
+ + + 0

0 1 1
1 1 1 1
1 1 1 0
+ + +

0 0 0

0 0
+ + + 0

1 0 0 0 0 1 1 1
Fig. 8. An example for 4  4 multiplier with RCA.

a7b1 a6b1 a7b0 a5b1 a6b0 a4b1 a5b0 a3b1 a4b0 a2b1 a3b0 a1b1 a2b0 a0b1 a1b0 a0b0

b1

b1 b1 b1 b1 b1 b1 b1

a7 b2 a6b2 a5b2 a4b2 a3b2 a2b2 a1b2 a0b2

b2
b2 b2 b2 b2 b2 b2 b2

a7b3 a6b3 a5b3 a4b3 a3b3 a2b3 a1b3 a0b3

b3

b3 b3 b3 b3 b3 b3 b3

P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0
Co3 Co2 Co1

Fig. 9. An 8  4 row bypassing multiplier based on RCA.

carry signal propagates with the same direction. Thus, we can adders in both sides and CSA adders in the middle. In this
discern that all carry signals propagate from zero to the next full configuration, the parallelism of the proposed multiplier can be
adder in the jth row when the value of bj is 0. established. Furthermore, delay time of RCA multiplier can be
Besides the above-mentioned method, the proposed multiplier shortened through this method. The final proposed multiplier is
also adopts parallel architecture to shorten delay time. For an shown in Fig. 10. Less extra hardware is used compared to that of
example of 8  8 multiplication, two 8  4 bypassing multiplier [20]. The proposed multiplier needs ((5/2)N  3) full adder delay
based on RCA can be shown in Fig. 9. The partial sums and carry in the worst case for N  N multiplier design. The proposed
output from these two 8  4 multipliers can be computed parallel architecture is not suitable for CSA based Braun multiplier
simultaneously. Note that the final stage adders consist of RCA and [20]. CSA based multiplier cannot be decomposed into two

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 7

b7 b6 b5 b4 a7 a6 a5 a4 a 3 a2 a1 a0 b3 b2 b1 b0 a 7 a 6 a 5 a4 a 3 a2 a1 a 0

C 3 C 2 C1 P10P9P8 P7P6P5 P4 P3 P2 P1 P0 C3 C2 C1 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0

P16 P15 P14 P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1

Fig. 10. An 8  8 row bypassing multiplier based on RCA.

a3b0 a2b0 a1b0 a0b0

0 0 0
a 3 b1 + a2b1 + a1b1 + a0b1

a3b2 + a2b2 + a1b2 + a0b2

a3b3 + a 2 b3 + a1b3 + a0 b3

+ + + 1

P7 P6 P5 P4 P3 P2 P1 P0

Fig. 11. A 4  4 signed Braun multiplier.

parallel 8  4 multipliers because the inputs of the current row conventional array multipliers such as Braun multiplier, signed
CSA adder come from the upper row; the 16  16 signed multipliers can be realized through Baugh–Wooley multiplication
multiplier can be designed by similar procedure. algorithm [26], often used to deal with signed multiplication.
The algorithm uses 2’s complement to represent the signed
numbers and also uses the same framework of array multiplier.
3.2. Signed bypassing multiplier design The advantage of this algorithm is accomplishing signed multipli-
cation without expanding sign bits. Consequently, additional
The multiplier introduced in the previous section is used to hardware cost is not increased; thus, not dissipating extra
compute unsigned numbers. However, it is essential to design power. Only the AND gate to NAND gate for corresponding
signed multipliers because computer system usually manipu- operands is changed and an inverter is inserted at the final carry
lates signed numbers. With regard to signed multiplier design, output. Fig. 11 shows the architecture of a 4  4 signed Braun
some signed multiplication algorithms are proposed in [26]. In multiplier [26].

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
8 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]

For example, two signed 4-bits binary numbers A¼a3a2a1a0 adders in the signed bypassing multiplier for carrying propagation
and B ¼b3b2b1b0 can generate a product P, which can be defined as is placed in the last row. The whole circuit architecture for a 4  4
follows: signed bypassing multiplier is shown in Fig. 12.
The proposed multiplier also adopts the Baugh–Wooley
P ¼ 1  27 þ a3 b3 26 þ ða3 b2 þ b3 a2 Þ25 þ ða3 b1 þ b3 a1 þ1Þ24
algorithm [26] for signed number multiplication. Considering an
þða3 b0 þb3 a0 Þ23 þ ðb2 22 þ b1 21 þ b0 Þða2 22 þ a1 21 þ a0 Þ ð4Þ 8  8 signed multiplication [26], all operands are separated into
two parts. Full adders are used to compute the last row of
operands according to the analysis in the previous paragraph.
Next, the same algorithm is utilized to design the proposed Therefore, two different 8  4 bit signed ripple-carry array
bypassing multiplier with signed operands. For bypassing multi- multipliers need to be designed. Blocks 1 and 2 are the two
plier [20], it could also utilize Baugh–Wooley multiplication 8  4 bit signed ripple-carry array multipliers, respectively. Block 1
algorithm [26] to realize signed bypassing multiplier. According shown in Fig. 13 is used to deal with the upper part of operands
to Baugh–Wooley multiplication algorithm [26], some AND gates and it is similar to the multiplier shown in Fig. 9 except for some
of original design must be changed to NAND gates for the changes on the gates of the circuits. Similarly, Block 2 shown in
corresponding operands in [20]. However, general CSA would be Fig. 14 is used to deal with the lower part of operands. Block 2
used instead of the modified full adders shown in Fig. 3 for the is different than Block 1 in hardware design as Block 1 uses the
computation of last row of operands in multiplication. The reason proposed full adder shown in Fig. 6 to compute all operands and
is described as follows. First, we know that disabling adders is Block 2 only differs in the computation of the last row of operands
performed only when operand is zero. The probability of a 2-input as ripple-carry adders are used to compute this row of operands.
NAND gate with operand (AB)0 being zero is only 25 percent. If the Since the proposed multiplier does not need additional full adder
adder shown in Fig. 3 is used for this row, additional logic must to correct the operation of multiplication, the addition in the final
be added. Power consumption for these additional logics may be step can be computed without adding other full adders. Thus,
large. In others words, adders in this row may dissipate more hardware requirement for the proposed signed multiplier is less
power in most of time. Consequently, general CAS will be used for than the signed bypassing multiplier. Finally, these two blocks are
this row of adders because they do not dissipate power on the combined and an inverter is placed at the carry output. The
additional logic. Since it has to add one in the final step in Baugh– proposed signed multiplier is shown in Fig. 15. The 16  16 signed
Wooley multiplication algorithm [26], additional one row of multiplier can be designed by the similar procedure.

a3b1 a2b1 a3b0 a1b1 a2b0 a0b1 a1b0 a0b0

+ 0 + 0 + 0

0 0 0
b1 b1 b1
a3b2 a2b2 a1b2 a0b2

+ + +
b2
0
b2 b2 b2

a3b3 a 2 b3 a1b3 a0b3


0

+ + + +

1 + 0 + 0 + 0 + 0

+ + + + 1

P7 P6 P5 P4 P3 P2 P1 P0

Fig. 12. A 4  4 signed bypassing multiplier.

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 9

a b a b ab a b ab a b a b ab a b ab ab ab a b a b ab ab

b b b b b b b

ab a b ab a b ab a b ab a b

b
b b b b b b b

a b ab a b a b ab a b ab a b

b b b b b b b

P P P P P P P P P P P
Co Co Co

Fig. 13. An 8  4 signed RCA multiplier with row bypassing block 1.

Fig. 14. An 8  4 signed RCA multiplier with row bypassing block 2.

4. Simulation results and performance comparisons are designed in transistors level without using any standard cell
from the technology library. Post-layout simulations are per-
In this section, the performance evaluation of the proposed formed with standard TSMC 0.18 mm CMOS technology and 1.8 V
multiplier along with the comparison to the conventional Braun supply voltage by Cadence Spectre simulation tools. The design
multiplier is presented. Performances include power consump- and simulation flow is shown in Fig. 16. In the design process,
tion, delay, power-delay product, and layout area. These circuits multiplier design was constructed at circuit level in the Cadence

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
10 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]

Fig. 15. The proposed 8  8 signed RCA multiplier.

power delay product. Table 3 shows the power consumption of


the proposed and above-mentioned multipliers. The simulation
results of delay and power-delay product for Braun, [20], and for
the proposed multiplier are shown in Tables 4 and 5, respectively.
Table 6 shows the layout area of proposed multiplier, Braun
multiplier, and [20]. For a 16  16 multiplier, the proposed design
achieves 17 and 36 percent reduction in power consumption
and delay, respectively, at the cost of 20 percent increase in chip
area in comparison with those of conventional array multiplier. In
addition, the proposed design achieves averages of 11 and 38
percent reduction in power consumption and delay, respectively,
with 46 percent less chip area in comparison with that of
counterpart [20] for both unsigned and signed multipliers. From
these simulation results, it is evident that the proposed design
outperforms the other counterparts in terms of power, delay, and
power delay product at the cost of an average of 24 percent area
overhead. In the proposed multiplier, it can achieve more power
Fig. 16. Simulation flow.
savings if the probability of zero is greater than the probability of
one in the operand of multiplier and can be confirmed in Section 2.4.
Table 7 shows the performance comparison of the proposed
design environments. The power consumption and speed of multiplier design with results from other recent published papers.
the proposed design are obtained by simulation. After verifying The designs of these multipliers are ROM based and low power
the circuit level, the proposed unsigned and signed multipliers are bypassing. The ROM based multiplier achieves low power by using
converted to layouts with the Cadence Virtuoso Layout Editor. The single transistor ROM cell that eliminates identical rows and
layouts are verified through Cadence DRC and LVS tool and finally columns [27]. The other bypassing method use additional logic
the layouts are extracted with the Cadence LPE tool. An example implemented in the adder to skip the redundant signal transitions
layout of the proposed 16  16 signed multiplier is shown in [28]. Both of these designs adopt the principle of reducing switching
Fig. 17. To evaluate the proposed method, two different sizes, activity to lower the power consumption. The comparison is based
8  8 and 16  16, of multiplier are simulated and 20 test patterns on the performance of the multiplier provided in [27,28]. Power
are generated randomly for both 8  8 and 16  16 multipliers to consumption, delay, and power-delay product of the proposed
evaluate the performance. The test patterns are randomly design are the best among these designs.
generated with uniformly distributed probability, and the post
layout simulations are performed for both unsigned and signed
multipliers in order to verify the feasibility of the proposed 5. Conclusions
design.
The performance comparisons of the proposed design and A low power and high speed CMOS array multiplier is
other counterparts for both unsigned and signed multipliers are presented. The proposed multiplier reduces power consumption
listed in Tables 3–5 in terms of power consumption, delay, and by disabling adders resided in the multiplier when inputs are at

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]] 11

Fig. 17. The layout of the proposed 16  16 signed multiplier.

Table 3 Table 6
Power consumption (in mW) and power saving. Total area (in mm2) and area overhead.

Design Multiplier size and normalized ratio Design Multiplier size and normalized ratio

88 Ratio 16  16 Ratio 88 Ratio 16  16 Ratio

Braun (unsigned) 2.413 1.00 11.050 1.00 Braun (unsigned) 73524 1.00 307585 1.00
[10] (unsigned) 2.238 0.93 9.561 0.87 [10] (unsigned) 132342 1.80 538879 1.75
Proposed (unsigned) 2.144 0.89 9.111 0.82 Proposed (unsigned) 92449 1.26 367908 1.20
Braun (signed) 2.867 1.00 11.532 1.00 Braun (signed) 73524 1.00 307585 1.00
[10] (signed) 2.991 1.04 10.671 0.93 [10] (signed) 139604 1.90 553681 1.80
Proposed(signed) 2.445 0.85 9.619 0.83 Proposed (signed) 93592 1.27 372177 1.21

Table 7
Table 4 Performance comparison of recent published papers.
Delay (in ns) and improvement.
Design Performance and normalized ratio
Design Multiplier size and normalized ratio
Power Ratio Delay Ratio Power delay Ratio
88 Ratio 16  16 Ratio (mW) (ns) product (pJ)

Braun (unsigned) 3.504 1.00 7.584 1.00 [27] ROM based 16  16 13.50 1.48 5.55 1.18 74.9 1.75
[10] (unsigned) 3.188 0.91 6.537 0.86 multiplier
Proposed (unsigned) 2.243 0.64 4.713 0.62 [28] low power 16.30 1.79 14.28 3.03 232.7 5.42
Braun (signed) 3.713 1.00 8.104 1.00 bypassing 8  8
[10] (signed) 3.472 0.93 7.238 0.89 Multiplier
Proposed (signed) 2.543 0.69 5.334 0.66 Proposed bypassing 9.11 1.00 4.71 1.00 42.9 1.00
16  16 multiplier

Table 5
Power-delay product (10  12 J) and improvement.
proposed design can achieve greater power efficiency with less
Design Multiplier size and normalized ratio extra hardware and power-delay product among different
counterparts. For a 16  16 multiplier, the proposed design
88 Ratio 16  16 Ratio
achieves 17 and 36 percent reduction in power consumption
Braun (unsigned) 8.455 1.00 83.803 1.00 and delay, respectively, at the cost of 20 percent increase in chip
[10] (unsigned) 7.135 0.84 62.844 0.75 area in comparison with those of conventional array multiplier.
Proposed (unsigned) 4.808 0.57 42.940 0.51 In addition, the proposed design achieves 11 and 38 percent
Braun (signed) 10.645 1.00 93.455 1.00 reduction in power consumption and delay, respectively, with 46
[10] (signed) 10.384 0.97 77.236 0.82
Proposed (signed) 6.217 0.58 51.307 0.55
percent less chip area in comparison with those of in [20]. The test
patterns are randomly generated with uniformly distributed
probability. As mentioned in Section 2.4, the average zero input
of operand in multiplier for the typical DSP applications is 73.8
zeros. Delay time of multiplier is also shortened by adopting percent. Therefore, the proposed multiplier can achieve even
parallel architecture. In order to validate the effectiveness of the greater power saving if the probability of zero in the inputs of
proposed design, power consumption and delay are evaluated by multiplier is larger than 0.5. Compared to other recent published
Cadence Spectre post-layout simulation with standard TSMC papers [27,28], the proposed bypassing multiplier achieves the
0.18 mm CMOS technology. Simulation results show that the lowest value of power, delay, and power-delay-product. Hence,

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009
12 K.-C. Kuo, C.-W. Chou / Microelectronics Journal ] (]]]]) ]]]–]]]

the proposed design achieves the goal of low power and high [13] M. Ito, D. Chinnery, K. Keutzer, Low power multiplication algorithm
speed performance at the same time. for switching activity reduction through operand decomposition, in:
Proceedings of the 21st International Conference on Computer Design,
2003, pp.21–26.
[14] N. Honarmand, M.R. Javaheri, N. Sedaghati-Mokhtari, A. Afzali-Kusha, Power
Acknowledgements efficient sequential multiplication using pre-computation, in: Proceedings
of the IEEE International Symposium on Circuits and Systems, 2006,
pp. 2709–2712.
The authors would like to acknowledge the financial support of [15] L.H. Chen, O.T.-C. Chen, T.Y. Wang, Y.C. Ma, Multiplication-accumulation
the National Science Council, Taiwan, Republic of China, under computation unit with optimized compressors and minimized switching
activities, in: Proceedings of the IEEE International Symposium on Circuits
grant number NSC96-2220-E-110-008. Authors would like to
and Systems, 2005, pp. 6118–6121.
express their greatest thanks to CIC (Chip Implementation Center) [16] C. Senthilpari, A.K. Singh, K. Diwakar, Design of a low-power, high
of NAPL (National Applied Research Laboratories), Taiwan, for performance, 8  8 bit multiplier using a Shannon-based adder cell,
their thoughtful chip fabrication service. Microelectronics Journal 39 (2008) 812–821.
[17] Z. Abid, H. El-Razouk, D.A. El-Dib, Low power multipliers based on new
hybrid full adders, Microelectronics Journal 39 (2008) 1509–1515.
References [18] K. Navi, V. Foroutan, M. Rahimi Azghadi, M. Maeen, M. Ebrahimpour
M. Kaveh, O. Kavehei, A novel low-power full-adder cell with new technique
in designing logical gates based on static CMOS inverter, Microelectronics
[1] W.C. Yeh, C.W. Jen, High-speed and low-power split-radix FFT, IEEE Journal 40 (2009) 1441–1448.
Transactions on Signal Processing 51 (2003) 864–874. [19] S. Hong, S. Kim, M.C. Papaefthymiou, W.E. Stark, Low power parallel
[2] C.S. Wallace, A suggestion for a fast multiplier, IEEE Transactions on multiplier design for dsp applications through coefficient optimization,
Computer 13 (1964) 14–17. in: Proceedings of the IEEE International ASIC/SOC Conference, 1999,
[3] V.G. Oklobdzija, D. Villeger, S.S. Liu, A method for speed optimized partial pp. 286–290.
product reduction and generation of fast parallel multipliers using an [20] J. Ohban, V.G. Moshnyaga, K. Inoue, Multiplier energy reduction through
algorithmic approach, IEEE Transaction on Computer 45 (1996) 294–306. bypassing of partial products, in: Proceedings of the IEEE Asia-Pacific
[4] B. Parhami, in: Computer Arithmetic, Algorithms, and Hardware Design, Conference on Circuits and Systems, 2002, pp. 13–17.
Oxford University Press, New York, 2000. [21] A.P. Chandraksan, S. Sheng, R. Bordersen, Low-power CMOS digital design,
[5] K.Z. Pekmestzi, Multiplexer-based array multiplier, IEEE Transactions on IEEE Journal of Solid-State Circuits 27 (1992) 473–484.
Computers 48 (1999) 15–23. [22] M. Psilogeorgopoulos, M. Munteanu, T.-S. Chuang, P.A. Ivey, L. Seed,
[6] P.C.H. Meier, R.A. Rutenbar, L.R. Carley, Exploring multiplier architecture and Contemporary techniques for lower power circuit design, PREST Deliverable
layout for low power, in: Proceedings of the IEEE Custom Integrated Circuits D2.1, The Department of Electronic and Electrical Engineering, The University
Conference, 1996, pp. 513–516. of Sheffield, Mappin Street, Sheffield S1 3JD, UK, 1998, pp. 1–91 /http://
[7] K.S. Chong, B.H. Gwee, J.S. Chang, A micropower low-voltage multiplier with www.engr.newpaltz.edu/  damu/spring_2008/resource/cont_tech.pdfS.
reduced spurious switching, IEEE Transactions on Very Large Scale Integrated [23] N.S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flaunter, J.S. Hu, M.J. Irwin
Systems 13 (2005) 255–265. M. Kandemir, V. Narayanan, Leakage current: Moore’s law meets static
[8] C.H. Han, H.J. Park, L.S. Kim, A low-power array multiplier using seperated power, IEEE Computer 36 (2003) 68–75.
multiplication technique, IEEE Transactions on Circuits and Systems-II, [24] J. Kao, S. Narendra, A. Chandrakasan, Subthreshold leakage modeling and
Analog and Digital Signal Processing 48 (2001) 866–871. reduction techniques, Proceedings of the IEEE/ACM International Conference
[9] E. Abu-Shama, M.B. Maaz, and M.A. Bayoumi, A fast and low power multiplier Computer Aided Design, 2002, pp. 141–148.
architecture, in: Proceedings of the IEEE Midwest Symposium on Circuits and [25] O.T.C. Chen, S. Wang, Y.W. Wu, Minimization of switching activities of partial
Systems, 1996, pp. 53–56. product for designing low-power multipliers, IEEE Transaction on Very Large
[10] R. Mudassir, H. El-Razouk, Z. Abid, New designs of signed multiplier, in: Scale Integrated Systems 11 (2003) 418–433.
Proceedings of the IEEE Northeast Workshop on Circuits and Systems, 2005, [26] R. Mudassir, H. El-Razouk, Z. Abid, New designs of signed multiplier,
259–262. in: Proceedings of the IEEE International NEWCAS Conference, 2005,
[11] S. Mahant-Shetti, P. Balsara, C. Lemonds, High performance low power array pp. 259–262.
multiplier using temporal tiling, IEEE Transactions on Very Large Scale [27] B.C. Paul, S.F. Fujita, M. Okajima, ROM-based logic (RBL) design: a low-
Integrated Systems 7 (1999) 121–124. power 16 bit multiplier, IEEE Journal of Solid-State Circuits 44 (2009)
[12] A.A. Fayed, M.A. Bayoumi, A novel architecture for low-power design of 2935–2942.
parallel multipliers, in: Proceedings of the IEEE computer Society Workshop [28] C.C. Wnag, G.N. Sung, Low-power multiplier design using a bypassing
on VLSI, 2001, pp. 149–154. technique, Journal of Signal Processing Systems 57 (2009) 331–338.

Please cite this article as: K.-C. Kuo, C.-W. Chou, Low power and high speed multiplier design with row bypassing and parallel
architecture, Microelectron. J (2010), doi:10.1016/j.mejo.2010.06.009

Das könnte Ihnen auch gefallen