Sie sind auf Seite 1von 10

8 x 8 bit pipelined Dadda multiplier in CMOS

D.G. Crawley
G.A.J. Amaratunga

Indexing terms: Very large scale integration

the construction of active devices and so canna be used


Abstract: Parallel multiplication schemes for to pass over such devices in the same way as, for
VLSI have traditionally been chosen for their example, an additional layer of metal. It is therefore
regular layout. Unfortunately, this has meant apparent that the minimisation of interconnection is of
using algorithms which are not time-optimal. In primary importance for a manually laid-out design in
the paper, we present an 8 x 8 bit time-optimal which only a single level of low resistance, low capac-
multiplier using the Dadda scheme implemented itance interconnect is available.
as a 7-stage linear pipeline. The design uses The situation has changed recently in that multiple
automated layout techniques to avoid the prob- levels of low resistance, low capacitance interconnect are
lems associated with the irregularity of the now available for VLSI design. This permits the over-
scheme, and a 3 pm n-well CMOS process with routing of active circuitry and permits a somewhat more
two layers of metal. The use of multiple levels of flexible topology for the design. It must be said that
metal reduces the delay associated with the inter- wiring is never desirable for a VLSI design, because a
connection between cells and also permits the long wire will always introduce more delay than a short
over-routing of active circuitry. A new pipelined one, and some additional area will always be consumed
carry look-ahead adder is used for the final sum- (e.g. for vias). However, for certain applications, the addi-
mation, and this provides a significant contribu- tional layers of interconnect may be used to considerable
tion to the performance of the multiplier. A set of advantage, particularly where the layout topology is
cells was designed for the multiplier and some irregular. The other development which has permitted a
aspects of their design are discussed. In particular, change of approach is the development of CAD tools for
a previously unreported V,, overshoot problem in automated layout. These enable the layout of designs
an existing exclusive-OR gate circuit is described that were previously too irregular to be implemented by
and explained. The multiplier is expected to manual layout, because manual layout of complex wiring
operate at a maximum clock frequency of at least is extremely time consuming and prone to error.
50 MHz. In this paper we describe the implementation of a
pipelined 8 x 8 bit Dadda multiplier in 3 pm CMOS
using an automated layout package. The scheme has not
1 Introduction
previously been implemented in VLSI, primarily due to
its irregularity.
Multiplication schemes for implementation in VLSI have
traditionally been chosen for their regular structure and 2 Dadda multiplication scheme
consequent ease of layout. Such schemes are not time- The product, p, of two n-bit unsigned binary numbers x
optimal however, in that they do not achieve the and y may be expressed as follows:
lower bound on the time to perform a multiplication
n-1
(O(log, n)). Schemes which are time-optimal have been
available for several years [l, 21, and have indeed been ( P ( z n - 1) . . PO) = C {yi A
i=O
(xn-1 .. xo)} x 2' (1)
used in nonVLSI implementations for multipliers in large
mainframe computers. In a parallel multiplier, the terms y iA (x, - . . . xo) are
It is hardly surprising that regular layout has been an known as the partial products and are generated using an
extremely important criterion in the selection of a multi- array of AND gates. For a parallel multiplier, the shifting
plication scheme for VLSI. Manual VLSI layout is a term 2' is inherent in the wiring and does not require any
time-consuming process and is minimised in a layout explicit hardware. Thus the main problem is the summa-
which has identical cells with only 'nearest neighbour' tion of the partial products, and it is the time taken to
interconnections. In addition, until very recently, perform this summation which determines the maximum
designers have had only one layer of metal for use as speed at which a multiplier may operate. The summation
interconnect. The use of other process layers for intercon- scheme for an 8 x 8 bit Dadda multiplier is shown in Fig.
nect, such as polysilicon or diffusion, is costly both in 1 (the notation is taken from Reference 1 in which the
terms of speed and area. The speed penalty is introduced outputs from a full adder are joined by a solid line, and
by the high resistance and capacitance of these layers, those from half adders are joined by a line with a dash
and the area is increased because these layers are used in through the centre). The Dadda scheme essentially mini-
mises the number of adder stages required to perform the
Paper 6 3 3 8 6 (E10, C2, C3), first received 10th November 1987 and in
summation of the partial products. This is achieved by
revised form 13th May 1988 using full and half adders to reduce the number of rows
The authors are with Cambridge University Engineering Department, in the matrix of bits at each summation stage by a factor
Trumpington Street, Cambridge, CB2 lPZ, United Kingdom of 3/2. This results in a final matrix consisting of two
IEE PROCEEDINGS, Vol. 135, Pt. G , No. 6, DECEMBER 1988 23 1

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
rows of bits which must be summed using a multiple-bit use a carry look-ahead adder at the final summation
adder (e.g. a ripple-carry or carry look-ahead adder). The stage, partitioned so as to give a maximum look-ahead
corresponding circuit for a multiplier using this scheme is block size of four bits. Gates are assumed to have the
shown in Fig. 2. same delay regardless of the number of inputs, although
no gates with more than five inputs are assumed.

... . . . .. .. .. .

.
.
:
.. ...
. .. .. .. ..
: : / A /

: :.
: : : :
/ / .
..
.
.
.
...
..
. .
...
..
. .
...
. .. .
. . . . .
""I
b6
b7
3 Pipelining of the Dadda multiplication scheme

Pipelining [7] is a technique which allows the computa-


tion rate of a system to be increased. The overall function
of the system must be such that it can be divided into a
number of discrete subprocesses that are executed in
sequence. Registers are then inserted between each

. .. .. .. .. .. .
*
ON/.
process step so that each stage processes a new set of
data each time the registers are clocked. All the registers
are loaded on the same clock transition and so results are
* : * / / / / / / / A : : : ' issued from the system at the clock frequency. The
: i - i - / / / A : : ' concept is made clear in Fig. 6. The terms which are com-
monly used to describe the performance of a pipelined
*

*/
7
. ./
./
. .'
./. '
. ./. :./
./ : *//&
:
Id system are bandwidth, that is the number of processes
which are performed simultaneously, and latency, that is
the number of clock cycles taken before the first result
} e becomes available. It is obvious that the latency should
Fig. 1 Dadda summation scheme
be minimised in a high performance system. In particular,
a Partial products
a large latency may render the multiplier unsuitable for
6 First summation stage certain applications. The multiplier was implemented as a
c Second summation stage seven stage pipeline and particular care was taken to
d Third summation stage
e Fourth summation stage equalise the delay of each stage. This was important
because the stage with the longest delay determines the
minimum clock period at which the system may operate.
By way of contrast, in a popular multiplication scheme The technique of pipelining increases the utilisation of
[3-51, the carry-save array, the summation proceeds in a the processing elements at the expense of the cost
more regular, but slower manner, as may be seen from incurred by adding the required register stages. For long
the summation diagram in Fig. 3. Using this scheme only streams of data, however, such as may be encountered in
one row of bits in the matrix is eliminated at each stage signal processing or certain types of computer arithmetic
of the summation. The circuit for an 8 x 8 bit multiplier (some types of matrix operations, for example), pipelining
using this scheme is shown in Fig. 4. provides a simple means of achieving a highly advanta-
The obvious differences between the two schemes are geous increase in the throughput of the system. Although
the number of stages used in the summation and the the concept of pipelining is quite straightforward, it may
regularity of the interconnection between them. In the require some reworking of existing schemes so that they
Dadda multiplier, the number of stages is smaller, so the may be pipelined ; this applies particularly to the carry
summation of the partial products is faster. In fact the look-ahead adder described here.
Dadda multiplier is time-optimal, having T = O(log, n) The final adder was also required to be pipelined
[SI, and this limit is reached very quickly, so the scheme because it would have introduced an unacceptable delay
is suitable for small word lengths. The carry-save array if implemented as a single processing stage. As has been
has T = O(n) but has a more regular structure. Both the noted in Reference 8 'the final adder used to form the
carry-save array multiplier and the Dadda multiplier may product is the main speed limiting element', a point
be pipelined by inserting registers between each stage of which is also discussed in Reference 9. It was also desir-
processing elements. However, if the system is pipelined able that it should consist of as few stages as possible so
at every stage, then the Dadda multiplier will require as to reduce the latency of the pipeline. The final adder
fewer register stages since it has fewer processing stages. has been pipelined in previous designs [lo], however this
The carry-save array requires fewer registers at each has generally involved the use of what is effectively a
stage compared to the Dadda scheme; this is offset, pipelined ripple-carry adder. It was considered that,
however, by the fact that more register stages are although the stage delay of such an adder is low, the
required for the array to operate at the same clock fre- number of stages would introduce an unnecessary
quency. Overall, the latency in a carry-save array will be increase in the latency of the multiplier. For these
longer than that of the Dadda multiplier if the systems reasons, a new form of pipelined carry look-ahead adder
are both pipelined so that each has the same delay at was developed [111; it is somewhat similar to the Brent/
each stage. Fig. 5 illustrates the number of gate delays in Kung adder [12], however it uses look-ahead blocks of
the carry-save scheme compared with the Dadda scheme, four, as opposed to two, bits. This provides a consider-
assuming that both multipliers are constructed using only able advantage in reducing the number of stages required
NAND gates and that both schemes use a carry look- for the adder at the expense of a slightly less regular
ahead adder at the final stage. Such a graph is, of necess- layout. (This latter criterion is somewhat irrelevant for
ity, somewhat artificial since it is unlikely that any VLSI the multiplier presented here).
implementation would use only simple NAND gates in The operation of the adder is essentially as follows:
the design. The graph does give a clear indication of the (i) If A and B are the two words to be summed in the
trends involved, however. Both schemes are assumed to adder, then a partial sum word, S P , and a generate word,
232 IEE PROCEEDINGS, Vol. 135, Pt. G , N o . 6, DECEMBER 1988

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
G, are produced according to the following equations : (iii) The G M and P M signals are used with the carry
input C,, to produce intermodule carries ( C M ) . (The
SP=A@B (2) signal C,, is produced by an additional half adder in the
G=AAB same pipeline stage as the exclusive-OR gates which
(3) produce the partial sums. The inputs to the half adder

I final adder

Fig. 2 8 x 8 bit Dadda multiplier

(ii) Bits from these words are used to form generate are the two least significant bits of the two words to be
( G M ) and propagate ( P M ) signals for each 4-bit group of summed, the sum output is the next least significant bit of
input bits as follows: the product, and the carry output forms C,,,.)
G M , = G i + 3V SP,+ A ( G i + 2V S P i + 3A SP,+ C M , = G M , V P M , A ( C M , - 1)* (5)
A (Gi+l VSPi+3ASPi+2ASPi+lAGi))) (4)
where CMi*_, represents the expanded form of the
P M , = SPiASP,+,ASPi+2ASPi+3 module carry signal from the next least significant
where i represents the 4-bit 'module' with which the module and C M , = C,, . So, for example, C M = GM V
signal is associated, so G M , , for example, would be the P M , AC,,.
output from the module associated with the least signifi- (iv) The generate ( G ) and partial sum (SP) are used
cant 4 input bits. The division into groups of 4 bits (as together with the intermodule carries ( C M ) to produce
with the normal, nonpipelined form of the adder [13]) the final carries ( C ) as follows:
avoids gates with more than five inputs and excessive
fan-out from the driving logic. C i + l= GiVSP,ACT (6)
IEE PROCEEDINGS, Vol. 135, Pt. G , N o . 6, DECEMBER 1988 233

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
where C: represents the expanded form of the next least This was because the delay of the slowest stage deter-
significant carry and C , = Cin,C , = CM,, and so forth. mines the minimum clock period with which the system
So, for example C, = G , V S P , A C , . may operate. An optimisation of the pipeline design was
made by moving some of the full adders in the Dadda

. 3 .
a a a a a a a
..6
. .
.5 .4
.
.
. .' . b"e
.
.3 .2
summation tree to earlier stages in the pipeline to reduce
the number of registers required. For example, if a full

. . ... ... ... ... ... ... .. .


.
bz
b3
adder in the second stage of the pipeline had operands
which had passed unprocessed through the first stage,
........ b4

. .. .. .. .. .. .. .. .
b5 then the full adder was moved to the first stage. This
b6
b7 meant that only two, as opposed to three, pipeline regis-
. .. .. .. .. .. j()()(//N/ . ters were required to preserve the intermediate results.
The pipeline division is shown in Fig. 8.
. .. .. .. .. .. .. .. .. .. .
. .. .. .. .. .. . 4.1 Cell design
.. .. . The cells were designed especially for the project because
no suitable cell library was available. The common fea-
. .. .. .. .. 7/////./ tures of all the logic cells were as follows:
. .. .. .. .. .. .. .. .. . . (i) A centre guard ring was used on all the cells. This
. .. .. .. .. . consisted of substrate and n-well taps connected to the
appropriate rail (Gnd or Kd). The power rails passed
... through the centre of cell, separating the p - and n-
channel transistors. This was considered necessary
*:::7//////
. .. .. .. .. .. .. .. . * . * because of the high speed at which the system was to
operate.
.. .. .. .. . (ii) Tracks on the second level of metal passed over
the cell at all possible pin positions. This enabled pins
* : :. 7 //////
... ... ... ... ... .. .
* * which are not connected to the cell to be used to allow
other signals to pass across the row of cells without any
requirement for the insertion of an explicit 'feed cell'.
(iii) All cells, except the input and output pads, were
designed so that they had the same drive capability (an
increase in delay of 0.5 ns when driving 5 inverters). This
ensured that all cells were capable of operating at the
.'///I/////
....... .'". required speed under all possible fanout conditions.
The features described in (i) and (ii) are illustrated in Fig.
9. In addition, the cell placement program was instructed
to invert alternate rows of cells so that p - and n-channel
Fig. 3 Carry-save array multiplier summation scheme transistors were always separated by a guard ring.
a Partial products e Fourth summation stage Each cell was simulated using the circuit simulation
b First summation stage f Fifth summation stage program SPICE 2G.5 [14], and the results were used to
c Second summation stage g Sixth summation stage
d Third summation stage h Final summation stage determine the timings of each cell (as well as to verify
correct operation). These timings were used in the logic
descriptions for each cell. The marginal delays (the
(v) Finally, the partial sums ( S P ) are combined with increase in propagation delay for a given capacitive load)
the carries (C) to produce the final result using a set of were also extracted from the circuit simulations and used
exclusive-OR gates: together with the cell input capacitances to produce a
logic simulation which included the increase in propaga-
Pi = c,0 S P , (7) tion delays due to fanout.
By using this scheme, the adder may easily be pipelined Since it is very important to make any design testable,
by the introduction of registers between the processing it was decided to incorporate a scan path test facility
stages, as illustrated in Fig. 7. Notice, however, that the [l5] into the design. This effectively consists of a two
final multiplier design uses a different pipeline division to input multiplexer on the input to each latch so that its
limit the latency and the number of register stages data input may come either from the normal source, or
required. from the Q output of the previous latch. The control
input to all the multiplexers is known as the scan select
input, and applying a signal to this line so that each latch
4 Layout of the multiplier
is fed from the Q output of the previous one enables all
The multiplier was implemented in a 3 pm n-well CMOS the latches in the system to be configured as a single large
process using two layers of metal. Cell placement and shift register. By having the data input of the first register
routing was performed by a software package in the chain fed from another input pin, it is possible to
(CAL - MP*). Before beginning the layout process, a test all the registers in the system by clocking a known
logic description of the design was written using approx- data sequence through the scan path and checking that
imate cell timings to determine how the pipeline should the same sequence emerges from the final register in the
be divided to equalise the delay at each processing stage. chain. Notice that no separate output for the scan path is
required because the final latch in the chain is in the
output register. Likewise, it is possible to test each pro-
* Trade mark cessing stage by aliowing the multiplier to operate nor-
234 IEE PROCEEDINGS, Vol. 135, Pt. G , N o . 6, DECEMBER 1988

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
mally for a given number of clock cycles, then asserting these cells with the dimensions calculated to provide the
the scan select input and clocking out the intermediate appropriate drive capability. The AND gate and half-
results from each stage in the pipeline. It is thus possible adder cells effectively contained two gates in one cell.

I 7 bit adder

. . t t * t *
75 PIL PI3 p12 PI, PI0 p9 pa 6
‘ p5 p3 p2 PI

Fig. 4 Carry-save array multiplier


This was because these types of cell were used often in
b the design, and incorporating two gates in one cell con-
100 - sumed less area than two separate cells.
90 - The exclusive-OR gate proved a somewhat more com-
80 - plicated design task. Since the aims of the design were to
c
3 70- achieve high speed and compactness, the traditional
860- CMOS exclusive-OR [lS] gate was rather large and
5 50- slow, so a variation on the traditional nMOS exclusive-
% 40- OR [lS] was used. A similar scheme also appears in a
30 - very brief note by Hiltebeitel [17], as shown in Fig. 10.
20 - However, since this design was not restoring for a ‘1’
IO - output, the transistor arrangement was changed so that
1 I I ,
4 8 12 16 20 24 28 32 the output could be buffered by an inverter, as shown in
word length,bts Fig. 11. The operation of the circuit is such that the inter-
nal node driving the output inverter could overshoot b d
Fig. 5 Graph illustrating the variation in delay with increasing input
word length by a considerable margin. This overshoot may be
x carry-save array multiplier
explained by observing the voltages across the gate-drain
0 Dadda muliplier capacitances (c&)of the n-channel transistors n, and n 2 ,
as shown in Fig. 12. When both the inputs A and B are
to gain access to all the latched results, considerably low, both the p-channel transistors p, and p2 are on. The
enhancing the testability of the design. capacitances c,d thus charge to the potential of vd,
applied across them. When both the inputs change from
4.2 Design of the logic elements low to high, the p-channel transistors p1 and pz turn off,
The circuits for CMOS inverters and NAND gates are isolating the node vnt.The voltage across the gate-drain
well known [16], and the standard designs were used for capacitances remains at b d , but the gate side is now
IEE PROCEEDINGS, Vol. 135, Pt. G , N o . 6, DECEMBER 1988 235

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
close to b d rather than zero so the node qntrises above data input to accommodate the scan-path test facility.
b d . In theory, the voltage at Knl should rise to 2bd, This was integrated into the latch cell and was imple-
however, in practice it is limited to about 10% v d d by the mented using a pair of transmission gates with an

operands
input A

parital sum ( S P ) and generate (G)


I input registers
I formation

I
,
first processing stage

I I
I
I
I pipeline registers
I

module propagate (PM)and generate

I pipeline registers (GM) formation

HE Jl
second processing stage pipeline registers

4!7
iz
I pipelineregisters
I module carry (CM) formation

I n th processing stage
I pipeline registers

I output registers
I carry formation

.v
~

E
results
pipeline registers
Fig. 8 Generalised linear pipeline

combination of the capacitance from Knl to ground and


the fact that the p-channel transistors cannot turn off infi-
nitely fast. The maximum overshoot occurs if the two
inputs change simultaneously and decreases as the time
between the two rising edges is increased. Similarly, the
magnitude of the overshoot decreases as the rise time of
the inputs is increased. The results of a SPICE simulation
of the exclusive-OR cell are shown in Fig. 13. qnlis
clearly seen to exceed b d ; it is therefore essential to
1
Fig. 7
final sum formation

sum

Pipelined carry look-ahead adder

incorporate a guard ring into this cell in order to avoid inverter to provide the appropriate complemented drive
the possibility of latch-up problems occurring. As stated signal.
previously, all the cells used in this design incorporated The design is such that the latches operate from a
guard rings in any case, so no problems would have single phase clock. This is because the auto router would
arisen. have been unlikely to route two clock phases so that the
The full adder design was based on that of Reference wires were sufficiently similar in length to avoid any skew
18, modified for CMOS by the inclusion of some p- between the clock phases at the highest clock frequency
channel transistors in the pass transistor array. This at which the combinational logic was able to operate. A
allowed the transmission of a ‘1’ without the threshold master-slave latch was also an obvious requirement to
voltage drop that would occur if n-channel transistors permit reliable operation of the pipeline.
were used. As mentioned above, all the latches used in The latch design itself was originally taken from Refer-
the design were required to have a multiplexer on the ence 19, as shown in Fig. 14. When this design was simu-
236 IEE PROCEEDINGS, Vol. 135, Pt. G, N o . 6 , DECEMBER 1988

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
lated, it was found that the speed of operation was far required. The output pad was designed to drive a load of
below what was required, due to charge storage at the 20 pF at the required speed, which was aimed to be less
node A in the figure. The latch was therefore modified to than 10 ns. This was achieved using the design method-
ology described above, with output transistor widths of

input register

partial product array (AND gates) and first


stage of Dadda summation (adders)

I second and third stages of D a d a summation I


second pipeline register
T
E
'D

Fig. 9 Design features common to all the logic cells


A n-well (containingall p-channel transistors)
B V, rail (in metal 1). with taps to n-well
C Gnd rail, with taps to substrate
first stage of carry look-ahead adder D Substrate area (containingall n-channel transiston)
E Metal 2 overpass with no connection to cell
F Metal 2 overpass with connection to a l l

I third pipeline register


I
second stage of carry look-ahead adder

11
I fourth pipeline register
I
I final stage of carry look-ahead adder
I
output register

ground

Fig. 10 Exclusive-OR gate of Reference 17


Fig. 8 Multiplier pipeline division

the form shown in Fig. 15; this now bears a great resem-
blance to the weak feedback inverter latch [18], except
that the lower transistor of the pair has its source con-
nected to the clock line rather than V, . This gives a slight
l-----rVdd
advantage in terms of the size of the transistors that may
be used in the feedback inverter. It also gives the slight
disadvantage that the clock line has a higher capacitance
than in the case of the leaky feedback inverter latch; for
this particular design, however, the clock signal is gener-
"I",
ated by an off-chip source, so the disadvantage is of little vo"t
consequence.
The output pad driver cell was designed according to
the method in Reference 20. This allowed a ratio of 6 A+.
between the sizes of inverters in the driving chain as
opposed to the normal ratio of e. The increase in propa-
gation delay was negligible, but a considerable saving in ground
area was made because of the smaller number of inverters Fig. 11 Restoring exclusive-OR gate as used in the multiplier

IEE PROCEEDINGS, Vol. 135, Pt. G, N o . 6, DECEMBER 1988 237

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
lo00 pm for the p-channel device and 300 pm for the n- done to achieve efficient distribution of the clock signal
channel. Transistor lengths were kept to the minimum and to attempt to minimise the distances between cells in
drawn dimension of 3 pm. the same stage of the pipeline. In general, the pipeline

V I P A P

TVdd D in -
l-

clodc

Fig. 14 Basic latch cell of Reference 19

“int

Din
vdd %

clock
“2 ‘gd
Fig. 15 Final latch design
Fig. 12 Exclusive-OR Vddovershoot model
stages were positioned so that the general signal flow was
from the ‘top’ to the ‘bottom’ of the layout, with the
t ........... .......... clock line being routed down a central channel. The
general placement scheme is illustrated in Fig. 16, and the
final layout is shown in Fig. 17. The complete layout
(including a frame with test structures) occupies an area
of 5.5 x 5.5 mm. The design has been submitted for fabri-
cation by the Microelectronics Group at Southampton
University.
The capacitances of the tracks connecting the cells
were extracted, by using another part of the layout soft-
ware, and inserted back into the logic description. It was
found that there was no change in the maximum clock
frequency at which the multiplier could operate. In prac-
tice, the presence of additional capacitance, no matter
L I I I I I I
how small, would increase the delay, but the simulation
O O C O O IOE-09 10E-08 15E-08 20E-08 25E-08 3DE-08 was performed with timings rounded up to the nearest
time, s nanosecond to ensure a worst-case analysis.
The multiplier is expected to operate at a clock fre-
Fig. 13 SPICE simulation of the exclusive-OR cell
quency of 50 MHz, with a latency of 7 clock cycles. The
input A
~

....... internal maximum clock frequency could be increased by


_____ input B
~ _ _ output
_
reducing the complexity of each stage and thus using
more pipeline registers. This would increase both the
latency and the area of silicon used, because more latches
The input protection pad was designed using a similar would be required. The present design is thought to be a
method, with the protection diodes being placed either good compromise in that the area consumed is accept-
side of the bond pad, each surrounded by a guard ring. ably small, the latency is acceptable, and the operating
The only additional constraint on the design of the input frequency was extremely good for a 3 pm CMOS process.
and output pads was that they should be compatible with
the requirements of the automated layout software, and 5 Conclusions
that they should operate at the required speed.
In this paper, we have presented a VLSI implementation
4.3 Circuit design and performance of a pipelined parallel multiplier using the Dadda scheme.
Although the exact placement for each cell was deter- The use of this scheme in a practical VLSI implementa-
mined by the automated layout software, it was possible tion has not previously been reported, primarily because
to influence the positioning of groups of cells. This was of its irregularity. However, the use of automated layout
238 IEE PROCEEDINGS, Vol. 135, Pt. G , N o . 6, DECEMBER I988

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
software and multiple levels of metal interconnection has consume any additional area. This would require an
enabled the scheme to be integrated without a significant
- additional layer of metal and new automated layout soft-
degradation in performance. ware.
test structures

input protection pads

output pad drivers

Fig. 16 Floor plan of the multiplier


To increase its versatility, it would be desirable for the
multiplier to be capable of dealing with signed integers of
more than 8 bits. It would be possible to accomplish both
of these requirements by using a recursive scheme in
which several small multipliers could be combined
together with an additional adder. This would permit the
construction of a large, regular multiplier from smaller
multiplier blocks that were themselves irregular. This
would avoid the irregularity at the macro level, where it
becomes more important to reduce the wire lengths.

6 Acknowledgments

The authors would like to thank the Microelectronics


Group at Southampton University Department of Elec-
tronics and Computer Science for providing the fabrica-
tion service for this design.

7 References
1 DADDA, L. : ‘Some schemes for parallel multipliers’, A l f a Freq.,
1965,34, pp. 349-356
2 WALLACE, C.S.: ‘Suggestion for a fast multiplier’, IEEE Trans.,
1964, EC-13, pp. 1+17
3 HATAMIAN, M., and CASH, G.L.: ‘A 70-MHz 8 x 8-bit parallel
Fig. 17 Layout of the multiplier pipelined multiplier in 2.5 prn CMOS, IEEE J. Solid-State Circuits,
1986, SC-21, pp. 505-513
Note: for clarity, only 3 layers are shown
4 LEE, F.S., KAELIN, G.R., WELCH, B.M., ZUCCA, R., SHEN, E.,
ASBECK, P., LEE, C.P., KIRKPATRICK, C.G., LONG, S.I., and
A new pipelined carry look-ahead adder has been used EDEN, R.C.: ‘High speed GaAs 8 x 8 bit parallel multiplier’, IEEE
for the final summation and this makes a significant con- J. Solid-state Circuits, 1982, SC-17, pp.. 638-645
tribution to reducing both the maximum delay of each 5 HENLIN, D.A., FERTSCH, M.T., MAZIN, M., and LEWIS, E.T.:
pipeline stage and the latency of the entire multiplier. ‘A 16 x 16 bit pipelined multiplier macrocell’, IEEE J. Solid-State
Circuits, 1985, SC-20, pp. 542-547
Although the concept of using an irregular layout for 6 CAPPELLO, P.R., and STEIGLITZ, K.: ‘A VLSI layout for a pipe-
VLSI is not desirable, the method we have described here lined dadda multiplier’, A C M Trans. Comp. Syst., 1983, 1, pp. 157-
makes it possible to implement a time-optimal scheme 174
without problems of excessive effort used in designing a 7 JUMP, J.R., and AHUJA, S.R.: ‘Effective pipelining of digital
suitable layout or of introducing large delays due to systems’, IEEE Trans., 1978, C-27, pp. 855-865
8 SHARMA, R.: ‘Area-time eficient arithmetic elements for VLSI
wiring. The method may also be applicable to other func- systems’. Proc. 8th IEEE Symposium on Computer Arithmetic, May
tions that have sufficient advantages to justify their VLSI 1987, pp. 57-62
implementation but are also irregular. 9 YUNG, H.C., and ALLEN, C.R.: ‘Part 1: VLSI implementation of
The area utilisation could be significantly improved if an optimised hierarchical multiplier’, I E E Proc. G, Electron. Circuits
& Syst., 1984,131, (2), pp. 5-
it were possible to dispense with the routing channels 10 SCHMITT-LANDSIEDEL, D., NOLL, T.G., KLAR, H., and
between the rows of cells. In this case, all the wiring ENDERS, G.: ‘A pipelined 330 MHz multiplier’. ESSCIRC ’85, 11th
could pass over the top of the active devices and thus not European Solid State Circuits Conf. 1 6 1 8 September 1985

IEE PROCEEDINGS, Vol. 135, Pt. G , No. 6, DECEMBER 1988 239

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
11 CRAWLEY, D.G., and AMARATUNGA, G.A.J.: ‘Pipelined carry 16 WESTE, N., and ESHRAGHIAN, K.: ‘Principles of CMOS VLSI
look-ahead adder’, Electron. Lett., 1986, 22, pp. 661-662 design’ (Addison Wesley, Reading, Mass., 1985)
12 BRENT, R.P., and KUNG, H.T.: ‘Regular layout for parallel 17 HILTEBEITEL, J.S.: ‘CMOS XOR, IBM Tech. Disclosure Bull.,
adders’, IEEE Trans., 1982, C-31, pp: 2 6 2 6 4 1984,27, p. 2639
13 HWANG, K.: ‘Computer arithmetic’ (John Wiley & Sons, New 18 GLASSER, L.A., and DOBBERPUHL, D.W.: ‘The design and
York, 1979) analysis of VLSI circuits’ (Addison Wesley, 1985)
14 VLADIMIRESCU, A., ZHANG, K., NEWTON, A.R., PED- 19 SPAANENBURG, L., POLLOK, W., and VERMEULEN, W.:
ERSON, D.O., and SANGIOVANNI-VINCENTELLI, A. : ‘SPICE ‘Novel switched logic CMOS latch building block’, Electron. Lett.,
version 2G user’s guide’, Department of Electrical Engineering and 1985,21, pp. 398-399
Computer sciences, University of California, Bekeley, Ca., 10 August 20 VEENDRICK, H.J.M.: ‘Short-circuit dissipation of static CMOS
1981 circuitry and its impact on the design of buffer circuits’, IEEE J.,
15 MAVOR, J., JACK, M.A., and DENYER, P.B.: ‘Introduction to 1984, SC-19, pp. 468-473
MOS LSI design’ (Addison Wesley, London, 1983)

240 IEE PROCEEDINGS, Vol. 135, Pt. G, N o . 6, DECEMBER 1988

Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.

Das könnte Ihnen auch gefallen