Beruflich Dokumente
Kultur Dokumente
D.G. Crawley
G.A.J. Amaratunga
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
rows of bits which must be summed using a multiple-bit use a carry look-ahead adder at the final summation
adder (e.g. a ripple-carry or carry look-ahead adder). The stage, partitioned so as to give a maximum look-ahead
corresponding circuit for a multiplier using this scheme is block size of four bits. Gates are assumed to have the
shown in Fig. 2. same delay regardless of the number of inputs, although
no gates with more than five inputs are assumed.
... . . . .. .. .. .
.
.
:
.. ...
. .. .. .. ..
: : / A /
: :.
: : : :
/ / .
..
.
.
.
...
..
. .
...
..
. .
...
. .. .
. . . . .
""I
b6
b7
3 Pipelining of the Dadda multiplication scheme
. .. .. .. .. .. .
*
ON/.
process step so that each stage processes a new set of
data each time the registers are clocked. All the registers
are loaded on the same clock transition and so results are
* : * / / / / / / / A : : : ' issued from the system at the clock frequency. The
: i - i - / / / A : : ' concept is made clear in Fig. 6. The terms which are com-
monly used to describe the performance of a pipelined
*
*/
7
. ./
./
. .'
./. '
. ./. :./
./ : *//&
:
Id system are bandwidth, that is the number of processes
which are performed simultaneously, and latency, that is
the number of clock cycles taken before the first result
} e becomes available. It is obvious that the latency should
Fig. 1 Dadda summation scheme
be minimised in a high performance system. In particular,
a Partial products
a large latency may render the multiplier unsuitable for
6 First summation stage certain applications. The multiplier was implemented as a
c Second summation stage seven stage pipeline and particular care was taken to
d Third summation stage
e Fourth summation stage equalise the delay of each stage. This was important
because the stage with the longest delay determines the
minimum clock period at which the system may operate.
By way of contrast, in a popular multiplication scheme The technique of pipelining increases the utilisation of
[3-51, the carry-save array, the summation proceeds in a the processing elements at the expense of the cost
more regular, but slower manner, as may be seen from incurred by adding the required register stages. For long
the summation diagram in Fig. 3. Using this scheme only streams of data, however, such as may be encountered in
one row of bits in the matrix is eliminated at each stage signal processing or certain types of computer arithmetic
of the summation. The circuit for an 8 x 8 bit multiplier (some types of matrix operations, for example), pipelining
using this scheme is shown in Fig. 4. provides a simple means of achieving a highly advanta-
The obvious differences between the two schemes are geous increase in the throughput of the system. Although
the number of stages used in the summation and the the concept of pipelining is quite straightforward, it may
regularity of the interconnection between them. In the require some reworking of existing schemes so that they
Dadda multiplier, the number of stages is smaller, so the may be pipelined ; this applies particularly to the carry
summation of the partial products is faster. In fact the look-ahead adder described here.
Dadda multiplier is time-optimal, having T = O(log, n) The final adder was also required to be pipelined
[SI, and this limit is reached very quickly, so the scheme because it would have introduced an unacceptable delay
is suitable for small word lengths. The carry-save array if implemented as a single processing stage. As has been
has T = O(n) but has a more regular structure. Both the noted in Reference 8 'the final adder used to form the
carry-save array multiplier and the Dadda multiplier may product is the main speed limiting element', a point
be pipelined by inserting registers between each stage of which is also discussed in Reference 9. It was also desir-
processing elements. However, if the system is pipelined able that it should consist of as few stages as possible so
at every stage, then the Dadda multiplier will require as to reduce the latency of the pipeline. The final adder
fewer register stages since it has fewer processing stages. has been pipelined in previous designs [lo], however this
The carry-save array requires fewer registers at each has generally involved the use of what is effectively a
stage compared to the Dadda scheme; this is offset, pipelined ripple-carry adder. It was considered that,
however, by the fact that more register stages are although the stage delay of such an adder is low, the
required for the array to operate at the same clock fre- number of stages would introduce an unnecessary
quency. Overall, the latency in a carry-save array will be increase in the latency of the multiplier. For these
longer than that of the Dadda multiplier if the systems reasons, a new form of pipelined carry look-ahead adder
are both pipelined so that each has the same delay at was developed [111; it is somewhat similar to the Brent/
each stage. Fig. 5 illustrates the number of gate delays in Kung adder [12], however it uses look-ahead blocks of
the carry-save scheme compared with the Dadda scheme, four, as opposed to two, bits. This provides a consider-
assuming that both multipliers are constructed using only able advantage in reducing the number of stages required
NAND gates and that both schemes use a carry look- for the adder at the expense of a slightly less regular
ahead adder at the final stage. Such a graph is, of necess- layout. (This latter criterion is somewhat irrelevant for
ity, somewhat artificial since it is unlikely that any VLSI the multiplier presented here).
implementation would use only simple NAND gates in The operation of the adder is essentially as follows:
the design. The graph does give a clear indication of the (i) If A and B are the two words to be summed in the
trends involved, however. Both schemes are assumed to adder, then a partial sum word, S P , and a generate word,
232 IEE PROCEEDINGS, Vol. 135, Pt. G , N o . 6, DECEMBER 1988
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
G, are produced according to the following equations : (iii) The G M and P M signals are used with the carry
input C,, to produce intermodule carries ( C M ) . (The
SP=A@B (2) signal C,, is produced by an additional half adder in the
G=AAB same pipeline stage as the exclusive-OR gates which
(3) produce the partial sums. The inputs to the half adder
I final adder
(ii) Bits from these words are used to form generate are the two least significant bits of the two words to be
( G M ) and propagate ( P M ) signals for each 4-bit group of summed, the sum output is the next least significant bit of
input bits as follows: the product, and the carry output forms C,,,.)
G M , = G i + 3V SP,+ A ( G i + 2V S P i + 3A SP,+ C M , = G M , V P M , A ( C M , - 1)* (5)
A (Gi+l VSPi+3ASPi+2ASPi+lAGi))) (4)
where CMi*_, represents the expanded form of the
P M , = SPiASP,+,ASPi+2ASPi+3 module carry signal from the next least significant
where i represents the 4-bit 'module' with which the module and C M , = C,, . So, for example, C M = GM V
signal is associated, so G M , , for example, would be the P M , AC,,.
output from the module associated with the least signifi- (iv) The generate ( G ) and partial sum (SP) are used
cant 4 input bits. The division into groups of 4 bits (as together with the intermodule carries ( C M ) to produce
with the normal, nonpipelined form of the adder [13]) the final carries ( C ) as follows:
avoids gates with more than five inputs and excessive
fan-out from the driving logic. C i + l= GiVSP,ACT (6)
IEE PROCEEDINGS, Vol. 135, Pt. G , N o . 6, DECEMBER 1988 233
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
where C: represents the expanded form of the next least This was because the delay of the slowest stage deter-
significant carry and C , = Cin,C , = CM,, and so forth. mines the minimum clock period with which the system
So, for example C, = G , V S P , A C , . may operate. An optimisation of the pipeline design was
made by moving some of the full adders in the Dadda
. 3 .
a a a a a a a
..6
. .
.5 .4
.
.
. .' . b"e
.
.3 .2
summation tree to earlier stages in the pipeline to reduce
the number of registers required. For example, if a full
. .. .. .. .. .. .. .. .
b5 then the full adder was moved to the first stage. This
b6
b7 meant that only two, as opposed to three, pipeline regis-
. .. .. .. .. .. j()()(//N/ . ters were required to preserve the intermediate results.
The pipeline division is shown in Fig. 8.
. .. .. .. .. .. .. .. .. .. .
. .. .. .. .. .. . 4.1 Cell design
.. .. . The cells were designed especially for the project because
no suitable cell library was available. The common fea-
. .. .. .. .. 7/////./ tures of all the logic cells were as follows:
. .. .. .. .. .. .. .. .. . . (i) A centre guard ring was used on all the cells. This
. .. .. .. .. . consisted of substrate and n-well taps connected to the
appropriate rail (Gnd or Kd). The power rails passed
... through the centre of cell, separating the p - and n-
channel transistors. This was considered necessary
*:::7//////
. .. .. .. .. .. .. .. . * . * because of the high speed at which the system was to
operate.
.. .. .. .. . (ii) Tracks on the second level of metal passed over
the cell at all possible pin positions. This enabled pins
* : :. 7 //////
... ... ... ... ... .. .
* * which are not connected to the cell to be used to allow
other signals to pass across the row of cells without any
requirement for the insertion of an explicit 'feed cell'.
(iii) All cells, except the input and output pads, were
designed so that they had the same drive capability (an
increase in delay of 0.5 ns when driving 5 inverters). This
ensured that all cells were capable of operating at the
.'///I/////
....... .'". required speed under all possible fanout conditions.
The features described in (i) and (ii) are illustrated in Fig.
9. In addition, the cell placement program was instructed
to invert alternate rows of cells so that p - and n-channel
Fig. 3 Carry-save array multiplier summation scheme transistors were always separated by a guard ring.
a Partial products e Fourth summation stage Each cell was simulated using the circuit simulation
b First summation stage f Fifth summation stage program SPICE 2G.5 [14], and the results were used to
c Second summation stage g Sixth summation stage
d Third summation stage h Final summation stage determine the timings of each cell (as well as to verify
correct operation). These timings were used in the logic
descriptions for each cell. The marginal delays (the
(v) Finally, the partial sums ( S P ) are combined with increase in propagation delay for a given capacitive load)
the carries (C) to produce the final result using a set of were also extracted from the circuit simulations and used
exclusive-OR gates: together with the cell input capacitances to produce a
logic simulation which included the increase in propaga-
Pi = c,0 S P , (7) tion delays due to fanout.
By using this scheme, the adder may easily be pipelined Since it is very important to make any design testable,
by the introduction of registers between the processing it was decided to incorporate a scan path test facility
stages, as illustrated in Fig. 7. Notice, however, that the [l5] into the design. This effectively consists of a two
final multiplier design uses a different pipeline division to input multiplexer on the input to each latch so that its
limit the latency and the number of register stages data input may come either from the normal source, or
required. from the Q output of the previous latch. The control
input to all the multiplexers is known as the scan select
input, and applying a signal to this line so that each latch
4 Layout of the multiplier
is fed from the Q output of the previous one enables all
The multiplier was implemented in a 3 pm n-well CMOS the latches in the system to be configured as a single large
process using two layers of metal. Cell placement and shift register. By having the data input of the first register
routing was performed by a software package in the chain fed from another input pin, it is possible to
(CAL - MP*). Before beginning the layout process, a test all the registers in the system by clocking a known
logic description of the design was written using approx- data sequence through the scan path and checking that
imate cell timings to determine how the pipeline should the same sequence emerges from the final register in the
be divided to equalise the delay at each processing stage. chain. Notice that no separate output for the scan path is
required because the final latch in the chain is in the
output register. Likewise, it is possible to test each pro-
* Trade mark cessing stage by aliowing the multiplier to operate nor-
234 IEE PROCEEDINGS, Vol. 135, Pt. G , N o . 6, DECEMBER 1988
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
mally for a given number of clock cycles, then asserting these cells with the dimensions calculated to provide the
the scan select input and clocking out the intermediate appropriate drive capability. The AND gate and half-
results from each stage in the pipeline. It is thus possible adder cells effectively contained two gates in one cell.
I 7 bit adder
. . t t * t *
75 PIL PI3 p12 PI, PI0 p9 pa 6
‘ p5 p3 p2 PI
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
close to b d rather than zero so the node qntrises above data input to accommodate the scan-path test facility.
b d . In theory, the voltage at Knl should rise to 2bd, This was integrated into the latch cell and was imple-
however, in practice it is limited to about 10% v d d by the mented using a pair of transmission gates with an
operands
input A
I
,
first processing stage
I I
I
I
I pipeline registers
I
HE Jl
second processing stage pipeline registers
4!7
iz
I pipelineregisters
I module carry (CM) formation
I n th processing stage
I pipeline registers
I output registers
I carry formation
.v
~
E
results
pipeline registers
Fig. 8 Generalised linear pipeline
sum
incorporate a guard ring into this cell in order to avoid inverter to provide the appropriate complemented drive
the possibility of latch-up problems occurring. As stated signal.
previously, all the cells used in this design incorporated The design is such that the latches operate from a
guard rings in any case, so no problems would have single phase clock. This is because the auto router would
arisen. have been unlikely to route two clock phases so that the
The full adder design was based on that of Reference wires were sufficiently similar in length to avoid any skew
18, modified for CMOS by the inclusion of some p- between the clock phases at the highest clock frequency
channel transistors in the pass transistor array. This at which the combinational logic was able to operate. A
allowed the transmission of a ‘1’ without the threshold master-slave latch was also an obvious requirement to
voltage drop that would occur if n-channel transistors permit reliable operation of the pipeline.
were used. As mentioned above, all the latches used in The latch design itself was originally taken from Refer-
the design were required to have a multiplexer on the ence 19, as shown in Fig. 14. When this design was simu-
236 IEE PROCEEDINGS, Vol. 135, Pt. G, N o . 6 , DECEMBER 1988
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
lated, it was found that the speed of operation was far required. The output pad was designed to drive a load of
below what was required, due to charge storage at the 20 pF at the required speed, which was aimed to be less
node A in the figure. The latch was therefore modified to than 10 ns. This was achieved using the design method-
ology described above, with output transistor widths of
input register
11
I fourth pipeline register
I
I final stage of carry look-ahead adder
I
output register
ground
the form shown in Fig. 15; this now bears a great resem-
blance to the weak feedback inverter latch [18], except
that the lower transistor of the pair has its source con-
nected to the clock line rather than V, . This gives a slight
l-----rVdd
advantage in terms of the size of the transistors that may
be used in the feedback inverter. It also gives the slight
disadvantage that the clock line has a higher capacitance
than in the case of the leaky feedback inverter latch; for
this particular design, however, the clock signal is gener-
"I",
ated by an off-chip source, so the disadvantage is of little vo"t
consequence.
The output pad driver cell was designed according to
the method in Reference 20. This allowed a ratio of 6 A+.
between the sizes of inverters in the driving chain as
opposed to the normal ratio of e. The increase in propa-
gation delay was negligible, but a considerable saving in ground
area was made because of the smaller number of inverters Fig. 11 Restoring exclusive-OR gate as used in the multiplier
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
lo00 pm for the p-channel device and 300 pm for the n- done to achieve efficient distribution of the clock signal
channel. Transistor lengths were kept to the minimum and to attempt to minimise the distances between cells in
drawn dimension of 3 pm. the same stage of the pipeline. In general, the pipeline
V I P A P
TVdd D in -
l-
clodc
“int
Din
vdd %
clock
“2 ‘gd
Fig. 15 Final latch design
Fig. 12 Exclusive-OR Vddovershoot model
stages were positioned so that the general signal flow was
from the ‘top’ to the ‘bottom’ of the layout, with the
t ........... .......... clock line being routed down a central channel. The
general placement scheme is illustrated in Fig. 16, and the
final layout is shown in Fig. 17. The complete layout
(including a frame with test structures) occupies an area
of 5.5 x 5.5 mm. The design has been submitted for fabri-
cation by the Microelectronics Group at Southampton
University.
The capacitances of the tracks connecting the cells
were extracted, by using another part of the layout soft-
ware, and inserted back into the logic description. It was
found that there was no change in the maximum clock
frequency at which the multiplier could operate. In prac-
tice, the presence of additional capacitance, no matter
L I I I I I I
how small, would increase the delay, but the simulation
O O C O O IOE-09 10E-08 15E-08 20E-08 25E-08 3DE-08 was performed with timings rounded up to the nearest
time, s nanosecond to ensure a worst-case analysis.
The multiplier is expected to operate at a clock fre-
Fig. 13 SPICE simulation of the exclusive-OR cell
quency of 50 MHz, with a latency of 7 clock cycles. The
input A
~
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
software and multiple levels of metal interconnection has consume any additional area. This would require an
enabled the scheme to be integrated without a significant
- additional layer of metal and new automated layout soft-
degradation in performance. ware.
test structures
6 Acknowledgments
7 References
1 DADDA, L. : ‘Some schemes for parallel multipliers’, A l f a Freq.,
1965,34, pp. 349-356
2 WALLACE, C.S.: ‘Suggestion for a fast multiplier’, IEEE Trans.,
1964, EC-13, pp. 1+17
3 HATAMIAN, M., and CASH, G.L.: ‘A 70-MHz 8 x 8-bit parallel
Fig. 17 Layout of the multiplier pipelined multiplier in 2.5 prn CMOS, IEEE J. Solid-State Circuits,
1986, SC-21, pp. 505-513
Note: for clarity, only 3 layers are shown
4 LEE, F.S., KAELIN, G.R., WELCH, B.M., ZUCCA, R., SHEN, E.,
ASBECK, P., LEE, C.P., KIRKPATRICK, C.G., LONG, S.I., and
A new pipelined carry look-ahead adder has been used EDEN, R.C.: ‘High speed GaAs 8 x 8 bit parallel multiplier’, IEEE
for the final summation and this makes a significant con- J. Solid-state Circuits, 1982, SC-17, pp.. 638-645
tribution to reducing both the maximum delay of each 5 HENLIN, D.A., FERTSCH, M.T., MAZIN, M., and LEWIS, E.T.:
pipeline stage and the latency of the entire multiplier. ‘A 16 x 16 bit pipelined multiplier macrocell’, IEEE J. Solid-State
Circuits, 1985, SC-20, pp. 542-547
Although the concept of using an irregular layout for 6 CAPPELLO, P.R., and STEIGLITZ, K.: ‘A VLSI layout for a pipe-
VLSI is not desirable, the method we have described here lined dadda multiplier’, A C M Trans. Comp. Syst., 1983, 1, pp. 157-
makes it possible to implement a time-optimal scheme 174
without problems of excessive effort used in designing a 7 JUMP, J.R., and AHUJA, S.R.: ‘Effective pipelining of digital
suitable layout or of introducing large delays due to systems’, IEEE Trans., 1978, C-27, pp. 855-865
8 SHARMA, R.: ‘Area-time eficient arithmetic elements for VLSI
wiring. The method may also be applicable to other func- systems’. Proc. 8th IEEE Symposium on Computer Arithmetic, May
tions that have sufficient advantages to justify their VLSI 1987, pp. 57-62
implementation but are also irregular. 9 YUNG, H.C., and ALLEN, C.R.: ‘Part 1: VLSI implementation of
The area utilisation could be significantly improved if an optimised hierarchical multiplier’, I E E Proc. G, Electron. Circuits
& Syst., 1984,131, (2), pp. 5-
it were possible to dispense with the routing channels 10 SCHMITT-LANDSIEDEL, D., NOLL, T.G., KLAR, H., and
between the rows of cells. In this case, all the wiring ENDERS, G.: ‘A pipelined 330 MHz multiplier’. ESSCIRC ’85, 11th
could pass over the top of the active devices and thus not European Solid State Circuits Conf. 1 6 1 8 September 1985
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.
11 CRAWLEY, D.G., and AMARATUNGA, G.A.J.: ‘Pipelined carry 16 WESTE, N., and ESHRAGHIAN, K.: ‘Principles of CMOS VLSI
look-ahead adder’, Electron. Lett., 1986, 22, pp. 661-662 design’ (Addison Wesley, Reading, Mass., 1985)
12 BRENT, R.P., and KUNG, H.T.: ‘Regular layout for parallel 17 HILTEBEITEL, J.S.: ‘CMOS XOR, IBM Tech. Disclosure Bull.,
adders’, IEEE Trans., 1982, C-31, pp: 2 6 2 6 4 1984,27, p. 2639
13 HWANG, K.: ‘Computer arithmetic’ (John Wiley & Sons, New 18 GLASSER, L.A., and DOBBERPUHL, D.W.: ‘The design and
York, 1979) analysis of VLSI circuits’ (Addison Wesley, 1985)
14 VLADIMIRESCU, A., ZHANG, K., NEWTON, A.R., PED- 19 SPAANENBURG, L., POLLOK, W., and VERMEULEN, W.:
ERSON, D.O., and SANGIOVANNI-VINCENTELLI, A. : ‘SPICE ‘Novel switched logic CMOS latch building block’, Electron. Lett.,
version 2G user’s guide’, Department of Electrical Engineering and 1985,21, pp. 398-399
Computer sciences, University of California, Bekeley, Ca., 10 August 20 VEENDRICK, H.J.M.: ‘Short-circuit dissipation of static CMOS
1981 circuitry and its impact on the design of buffer circuits’, IEEE J.,
15 MAVOR, J., JACK, M.A., and DENYER, P.B.: ‘Introduction to 1984, SC-19, pp. 468-473
MOS LSI design’ (Addison Wesley, London, 1983)
Authorized licensed use limited to: Sri Vasavi Engineering College. Downloaded on April 30,2010 at 05:31:56 UTC from IEEE Xplore. Restrictions apply.