Sie sind auf Seite 1von 4

A 64x64-bit Modified Booth Multiplier Utilizing

Multiplexer-Select Booth Encoder


Xinyu Wu, Chi Huang, Jinmei Lai and Chenshou Sun
State Key Lab of ASIC and Systems
Fudan University, Shanghai 200433, P.R.China

Abstract-In this paper, we describe a 64x64-bit high


performance multiplier based on multiplexer cells
which is implemented with pass transistor logic. A
multiplexer-select Booth encoders was developed to
increase speed and reduce the hardware cost. Moreover,
a partitioned method was introduced in the design to
save the propagate time of final adder. Realistic
simulation using extracted timing parameters from the
layout shows that the propagation time of the critical
path is 2.82ns at 1.8V on 0.18tm CMOS technology.
I. INTRODUCTION

High performance digital multiplier is the key of modern


microprocessor and digital signal processor. In the past ten
years, many efforts have been devoted to techniques for
their construction: Booth algorithm, pass-transistor logic
(PTL), Wallace tree, and carry look-ahead adder have been
proposed. Where the Booth algorithm innovatively reduces
the partial products half, PTL contributes to making high
speed and low power macro-cells. In order to implement
these structures, full custom design method and standard
cell-based method were both applied [1]-[6]. Particularly,
Booth selectors take up about one-third area in a multiplier.
Inoue's multiplier saves transistors by utilizing sign-select
Booth encoding algorithm [2], however, it pays more delay
time than some other multipliers, such as Cho's work [6].

1.02x 1.02mm2. Simulation using extracted timing


parameters from the layout shows that the propagation time
of a critical path is 2.82ns at 1.8V.
We will introduce the algorithm of MUX-select Booth
encoder in detail in section 2. Circuits design of Wallace
tree and final partitioned adder will be described in section
3 and 4. Simulation results and comparisons of the
proposed design methods and conventional design methods
are in section 5, and finally is the conclusion.

II. MUX-SELECT BOOTH ENCODER ALGORITHM


The proposed multiplier employs a booth encoder block,
compression blocks, and an adder block, in which
architecture is same as the conventional multipliers.
The use of the Booth algorithm reduces the numbers of
partial products and the carry-save adders significantly. In
the modified Booth encoding algorithm, we can reduce
them by half. Booth recoding is performed within two
steps: encoding and selection.

In a classic Booth encoder, three signals yi-1, yi and yi+l,


are generated from the three adjacent multiplier bits, and
for selecting a partial product, that is, one of 0, +X and +2X.
Here X is a multiplicand value of m-bit width. The sht and
dir signals show whether the partial product is doubled and
the sign of the partial product. Thus, the multiplier value,
This paper describes a high-speed multiplier based on either X or 2X, is selected depending on whether the
PTL macro-cells, in which a new multiplexer-select Booth encoded data sht is high, and the direction operator dir
decoder is proposed. We adopted the Booth encoding by determines which multiplicand is taken between X and -X.
utilizing the relationship of three input signals of In this algorithm, XOR circuit is used to determine whether
multiplexer (MUX) as encoded outputs, named as the partial product is inverted or not. For static CMOS logic
MUX-select Booth encoding in this paper. With this implementation, a Booth encoder and a Booth selector
schema, we simplified the Booth encoder and managed to require 32 and 18 transistors respectively.
reduce the delay while keep the hardware cost of selector at
Multiplier published by Inoue gave a different design [2].
two multiplexers. As a result, only 2 multiplexers are
needed per bit comparing with 3 multiplexers in Cho's In Inoue's work, there are two bits used for generation of
work [6]; and, only 3 gates delay occurred on critical path sign of the partial product: Mj (for negative) and PLj (for
comparing with 4 in Inoue's work [2]. Moreover, we positive). The modified Booth Selector requires 10
implemented unbalanced-buffering, compression look transistors per bit as comparing with 18 transistors per bit
ahead and partitioned final adder to speed up the in conventional Booth selector. Since the transistor count of
multiplication further more. Finally, based on Booth selectors occupies 35% of the space in a 54x54 bit
pass-transistor logic and dual-rail duplexer, layout area of multiplier compared with only 1.2% for Booth encoders.
the 0. l8ptm 64-bit CMOS parallel multiplier is The reduction in the number of transistors in Booth

0-7803-9210-8/05/$20.00 2005 IEEE

132

selectors thus has a greater impact on the total transistor


count.
In fact, Inoue's work used the multiplicand as the select
signals in the selector, which is very different from the
conventional method which using the encoded signals as
the select signals. However, encoded signals run through
the two multiplexer in series, thus, making the delays
accumulated completely and five gates needed on the
critical path. Here, we added extra control signals to
represent the MUX relationship of adjacent three signals of
multiplier: yi-,, yi and yi+,. The introduced MUX-select
Booth encoder is shown in Table 1 and Fig. 1. Where, two
signals aO and a, are added for special use, while sht signal,
same as the usual encoder, determines the selection of X or
2X. aO and a, can be configured to determine the +0, +X and
2X in the further. One configuration of aO and a, is

shti = Yj-l
aO j

Yj+,

al, = (Yj,l ( Yj) Yj+l + Yj-1 G3 Yi- Yj


Where aO indicates the sign of Booth encoder, and al
becomes available when the partial product is negative but
not 0.. As shown in Fig. 1, only two multiplexers are
needed in selector while a XOR gate and a multiplexer are
employed in the Booth encoder.
Using these encoded signals, we simplified the partial
product generation logic. For example, when +2X is
necessary for the partial product, sht signals is active, and
the first multiplexer chooses +2X as the partial product,
ao="0" and a1="1" keeping the sign of partial product as
TABLE 1
TRUE TABLE FOR MUX SELECT BOOTH ENCODING

yi+'

yi

yi,

ppi

sht

+0
+X
+X

0
0

+2X
-2X

1
1

-x

-x
-0

0
0
0

0
0
1

0
1
0

1
1

ao
0
0
0
0

a,
0

0
0
0

1
1
1

1
1
1

s~ 0

positive. When -X is necessary, the logical product of sht


chooses the correct partial product, but this time, ao="1'
and aj="0" making the sign of partial product as negative,
and so on. Further more, for "0" multiple both inputs of the
second multiplexer are same, then sht is invalidated.
Thus, in our new Booth selector, encoding and
sign-selected working at the same time, only 2 multiplexers
needed per bit comparing with 3 in Cho's work [6], and is
same as Inoue's work[3]. At the same time, only 3 gates
delay occurred on critical path, comparing with 4 in Inoue's
design. The Booth encoder and selector circuits can be
implemented by utilizing pass transistor selectors. This
causes fewer transistors and makes easy for macro-cell
based system. In the circuit diagram shown in Fig. 1, the
selector and XOR are implemented with transmission gate.
The transistor count of the Booth selectors is 12 transistors
while 16 transistors for dual-rail ones. All the input signals
used in the Booth selectors are buffered globally and the
buffer circuits do not affect the total transistor count
significantly.
III. COMPRESSION MODULE
Generally, the Wallace tree with 4-2 or 9-2 compressor is
used to compress partial products. In a case of the 64x64 bit
modified Booth multiplier, 33 rows and 128 columns of
partial products have formed a trapezoid. Number of
full-adder, critical path, and complexity of sign extension
are three important issues in partial products compression.

Specially, 33 rows for 64x64 bit modified Booth


multiplier is quite expensive for Wallace tree, because 32
rows needs 4 stages compression while 33 rows needs 5
stages. However, we noticed that, in the modified Booth
method, there are only 5 bits of the tree of which heights
exceed 4. Hence, we can compress one level within the
propagate time of one full adder by using very few 3-2
counters. We have mentioned that all the input signals used
in the Booth selectors should be buffered. Because very
few 3-2 counters needed in the first level of compression
and considerable buffers needed in partial products
generation, then, we buffer the 3-2 counters separately as
shown in Fig.2. In this schema, the 60th to 68th bit are
buffered with one inverter while other 58 bits are buffered
with three stages buffer tree. Then, we save two buffers'
time, and fill a 3-2 counter into this gap. This is the
Booth Encoder

Buffers

pp~~~~~~~~

PPj

FIG. 1. Proposed MUX-select Booth encoder and selector.

p~ 0

FIG 2. Unbalanced-buffering and compression look ahead.

133

CI(T_|H1D

unbalanced-buffering and compression look ahead.


As result, unbalanced-buffering makes only four stages
conventional Wallace tree with efficient 4-2 compressor [1]
required.
IV. ADDITION MODULE

As we know, the final adder module to sum two rows of


128 bit is very important in multipliers. We propose an
improved partitioned adder [7][8] based on multiplexer to
minimize gate counts and critical path. Following the
compression level, we divided final adder into 4 parts. In
each part, several carry selected parallel prefix adders
(CS-PPA) [1][2], as shown in Fig.3.(c), make up the group
of the proposed adder, such as a 4-bit CS-PPA in the third
level, a 8-bit CS-PPA in the fourth level and four 4-bit
CS-PPA in the fifth level (the first level has been combined
with partial product generation). Each modularized
CS-PPA consists of a 4-bit or 8-bit carry-selected-based
sum generator and a carry generator as shown in
Fig.3.(a)-(b). The carries of each CS-PPA are transmitted to
SLO

CHO CLO
,~~~~~~~~~~~~B
B
(CHO, CLO)(CH1 CL1)

CH2

Al-

C1(CYH/2C L2)

CL2

B1-

S3
(b)

(a)

_ta-7

SHO
CLO

C5-

Cs

~~~~CH1

cs,

cs-

CS

Csl
l CL,39
C55 )C47
(Cd63

__8-bitCS-PPABJ

(C)

''s-

ci a[(

SI
(S

(e)
)

S_

(d)

(D

the block carry generation block, including CPSK4 and


CSL4 as shown in Fig.3.(d), and to generate condition
group carries. At last, all the sums are selected by the
carries to form the final results as shown in Fig.3.(e).
Because the latest bit has been got in the compression, the
whole architecture of the proposed adder is 127 bit as
shown in the Fig.3.(f).
V. IMPLEMENTATION AND RESULTS
Full custom design has low flexibility and long design
and verification time, while standard cell-based ASIC
design methodologies are hard to achieve high
performance. Thus, proposed multiplier is implemented in
macro-cells, which includes three types custom macro-cell:
multiplexer, inverter and 4-2 compressor. Again, there are
also three cells with different size in multiplexer and
inverter. Multiplexers and 4-2 compressor are implemented
by dual-rail transmission gate, while inverter in static
CMOS logic. Fig.4 shows the final layout from Apollo
Based on SMIC 0.18ptm 1.8V CMOS technology.
Simulations using extracted timing parameters from the
layout. Fig 5 is the propagation time of proposed multiplier.
The comparison of the proposed multiplier and the
conventional ones in terms of booth encoder and selector is
summarized in the Table 2. Proposed design consists of the
same and least multiplexers with Inoue's work. At the same
time, both the transistors used and gate on critical path of
booth encoder is decreased compared to conventional
designs, and the critical path about 50% compared to
conventional design. The simulation result is 0.64ns in the
delay time. This part takes 41% in transistor counts of
proposed multiplier.

Compression part takes 1. l6ns in the delay, where


0. l9ns is killed due to unbalanced-buffering and
compression look ahead. Moreover, compression tree takes
59% transistor counts in the end. 127-bit partitioned adder
makes the critical path move from the least bit to the 64th
bit. It saves the low 29 bits addition directly, and grouped
CS-PPA and grouped carry propagate make the propagate
time of the first group (32nd bit to 63rd bit) dissolved. In
the end, proposed schema saves 0.28ns for the multiplier.
As for whole multiplier, the propagation time of a critical
path is 2.82ns at 1.8V, and the area of layout is
1.02x 1.02mm2 and there are 119520 transistors in the
design. Table 3 shows that summarized comparison results
of the comparison conventional multipliers with proposed
multiplier. Transistor counts is not small, the main reason is
that we employed dual-rail logic in the design, which will
cause about 2 times transistor cost than conventional
design using single-rail logic, however, dual rail-logic
saves much time.

FIG. 3. Proposed 127-bit carry-selected-based partitioned adder.

134

VI. CONCLUSIONS
In this paper, we proposed MUX-select booth encoder,
which can achieve both low multiplexers cost and short
critical path, and multiplexer based partitioned adder to
create high performance multipliers. Implemented with
dual-rail transmission gate multiplexers, simulation results
show achievements on both high speed and small size.

A,;4~~~~~~~4

~4

ACKNOWLEDGEMENTS

The authors would like to thank Prof. Qianlin Zhang for


her continuous encouragement, and Dr. Xia Li, Dr. Bo Shen,
Yunzhi Dong, Jianyu Zhang and Fei Wang for their kindly

help.

FIG. 4. Layout of proposed multiplier from Apollo.

Multiplication Proc:ess
2

REFERENCES

Bo4
Enc,

I
I

I0. .19ns
.9

af,

=0.64ns
E

- 1.0

(0

0L.O

[1] M.Suzuki, N.Ohkubo, T.Shinbo, T.Yamanaka,


A.Shimizu, K.Sasaki and Y.Nakagome, "A 1.5-ns
32-b CMOS ALU in double pass-transistor logic",
IEEE Journal of Solid-State Circuits, vol.28, Issue: 11,

1.5S

pp.1145-1151, Nov.1993.

[2] N.Ohkubo,

[3]

3m

1.80ns
2.

2.82ns

[4]

FIG. 5. Propagation time of proposed multiplier.

TABLE 2
COMPARISON OF BOOTH ENCODER AND SELECTOR

Transistors in Encoder
MUXs in Selector
Critical Path (gate)
Delay (ns)

Ohkubo[2]

Inoue[3]

Cho[6]

30

26
2
5
0.71

20
3
3
0.68

-3
6
1.06

[5]
Proposed
18
2
3
0.64

[6]

TABLE 3
COMPARISON OF MULTIPLIER

Width (bit)
Gate Length (pm)
Supply Voltage (V)
Area (mm2)
Delay (ns)

Ohkubo[2]

Inoue[3]

54
0.25
2.5
3.77x3.41
4.4

54
0.25
2.5
1.04x 1.27
4.1

[7]

Cho[6]
54
0.18
2.5
--

3.25

Proposed
64
0.18
1.8
1.02x 1.02
2.82

[8]

135

M.Suzuki,

T.Shinbo,

T.Yamanaka,

A.Shimizu, K.Sasaki and Y.Nakagome, "A 4.4ns


CMOS 54x54-b Multiplier Using Pass -Transistor
Multiplexer", IEEE Journal of Solid-State Circuits,
vol.30, no.3, pp.251-257, Mar., 1995.
Inoue, R.Ohe, S.Kashiwakura, S.Mitarai, T.Tsuru,
T.Izawa and G.Inoue, "A 4.1 ns compact 54x54 b
multiplier utilizing sign select Booth encoders", IEEE
International Conference on Solid-State Circuits 1997,
pp.416-417, 497, 6-8 Feb. 1997.
Y.Hagihara, S.Inui, A.Yoshikawa, S.Nakazato, S.Iriki,
R.Ikeda, Y.Shibue, T.Inaba, M.Kagamihara and
M.Yamashina, "A 2.7 ns 0.25 gm CMOS 54x54 b
multiplier" IEEE International Conference on
Solid-State Circuits 1998, pp.296-297, 5-7 Feb. 1998.
Carlson, A.Jain, P.Bannon, T.Benninghoff, M.Bertone,
R.Blake-Campos, et. al, , "A 667 MHz RISC
microprocessor containing a 6.0 ns 64 b integer
multiplier", IEEE International Conference on
Solid-State Circuits 1998, pp.294-295, 5-7 Feb. 1998
Ki-seon Cho, Jong-on Park, Jin-seok Hong and
Goang-seog Choi, "54x54-bit Radix-4 Multiplier
based on Modified Booth Algorithm", GLSVLSI
2003, pp.233-237.
V.GiOklobdzija, D.Villeger and S.S.Liu, "A method
for speed optimized partial product reduction and
generation of fast parallel multipliers using an
algorithmic approach", IEEE Transactions on
Computer, vol.45, Issue:3, pp.294-306, March 1996.
Tso-Bing Juang, Jeng-Hsiun Jan, Ming-Yu Tsai and
Shen-Fu Hsiao, "Partition methodology for the final
adder in a tree-structure parallel multiplier generator"
Asia-Pacific Conference on Circuits and Systems,
2002, vol.1, 28-31 pp.471-474 vol.1, Oct. 2002.

Das könnte Ihnen auch gefallen