Sie sind auf Seite 1von 7

690 lEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. .sc-l 9, NO.

5, OCTOBER 1984

A High Performance Flc)ating Point


Coprocessor
GIL WOLRICH, EDWARD McLELLAN, LARRY HARADA, JAMES MONTANARO,
AND ROBERT A. J. YODLOWSICI

.4fmfract — A 34000 transistor single-chip floating point coprocessor TABLE I


fabricated in 3 pm double metal NMOS technology is described. The FLOATING POINT DATA TYPES
fraction data path, including a shifter and 6(I bit carry propagate ALU, is
F = Floating ~ 1 bit sign, 8 bit exponent, 24 bit fraction
cycled in 100 ns for all operations requiring less than 19 bits of consecutive
D = double ~ 1 bit sign, 8 bit exponent 56 bit fraction
carry. A versatile carry length detection scheme, which requires minimal G (VAX only) =5 1 bit sign, 11 bit exponent, 53 bit fraction
additional logic, is used to extend the microcycle for the small percentage 32 bit integers, 16 bit integers
of operations in which a long carry exists. Three bit per cycle mrrkiplica-
tion and one and one half bit per cycle division afgoritfurrs were used to
achieve excellent overall performance.

INTRODUCTION

HIS paper describes a single-chip floating point accel-


T erator
chip set. Fabricated
(FPA) for the J-11, 16/32
in a 3 ~m drawn
bit microprocessor
double-metal NM(XS
process, the FPA implements a floating point instruction
set of 46 instructions, The FPA supports four data types:
single and double precision floating point and 16 and 32
bit integer (Table I).
The FPA measures 7.6X 6.3 mm, contains 34000 tran-
sistors, and dissipates 2 W. It is packaged in a 40 pin DIP.
A photomicrograph is shown in Fig. 1. The chip contains
six externally addressable floating point data registers, a
floating exception code register, and a floating point status
and mode control register. The principal functional blocks
in the FPA are the EU (execution unit), consisting of
fraction, ,exponent, and sign datapaths, and the BIU (bus
interface unit).

MICROARCHITECTURE

The fraction processor is a 60 bit wide data path (Fig. 2)


consisting of 7 single-ported RAM registers, 8 ROM con-
stants, an ALU, an 8-position argument shifter (left 2 to Fig. 1. Photomicrograph of FPA.
right 5), 2 operand registers, a Q register (multipler/quo-
tient/output), a 4-position Q shifter (left 2, left 1, no shift,
exponent and sign of floating point operands in parallel
or right 3), and an input register. The fraction shifters with the fraction. The BIU controls all 1/0 for the FPA
provide sufficient range for 3 bit/cycle multiplication, 1.5 and contains a second-state sequencer allowing operation
bit/cycle variable shift division, 5 bit/cycle alignment, and independent of the EU.
3 bit/cycle normalization algorithms.
The EU sequencer contains a two-bank folded PLA with
CARRY LENGTH DETECTION
13 inputs, 160 total AND terms, and 36 outputs. The
exponent processor is a 10 bit data path used to process the Hardware floating point processing is characterized by
very wide data paths. A great deal of hardware in the form
Manuscript received March 22, 1984; revised May 16, 1984. of carry lookahead schemes or carry save adders are often
The authors are with the Digital Equipment Corporation, Hudson, MA
01749. used in computing systems to reduce the penalty associated

0018-9200/84/1000-0690$01.00 W984 IEEE


wOLRICH et a[.: FLOATING POINT COPROCESSOR (591

-s==s
Po
ROM (CONSTANTS)

P4 e ENABLE DOUBLE PRECISION


t
P5

H 60
P9
1 %

I B REGISTER
(MU LTIPLICAND/DIV
Plo 4

60-+--i- P14 7

P15 -

P19 y

P20 4

P24 -
(PARTIAL PRODUCT/REMAINDER)
P25

*
P29

SHIFTER P30
(L2T0 R5)
1
P34

I
OSHIFTER
(Ll, L2, R3. 0)
1
2%--
P40

II
P44 GROUP PROPAGATES
ALSO USED FOR

x
P45 CARRY LOOKAHEAD

P49 E

x
.
P55

P59 e

MINIMUM STUTTER = 10
Yn
~ALLOW STUTTER

DATA BUS
MAX NOT STUTTER = 16
Y “STUTTER” TO
Fig. 2. Block diagram of fraction data path. CLOCK CIRCUIT

Fig. 3. FPA 60 blt ALU “stutter circuit.”

with long carry propagation delays. The FPA, by compari- all operations which have a carry length of 19 or greater.
son, achieves a fast (100 ns) microcycle time, including a 60 Two le~els of AND gating are used because the first level of
bit ALU operation using a new technique well suited to gating is already present for the minimal 5 bit group carry
VLSI applications, as well as designs using standard parts. lookahead logic. The mDing of two group propagate sig-
A simple carry length detection scheme is used to pro- nals indicates whether or not all of the propagates in that
duce a stutter signal that stretches the final phase of the group of 10 bits are asserted.
EU clock if a long carry propagation path exists. The Single precision processing uses only the upper half of
method takes advantage of the fact that most ALU op- the fraction data path. In order to avoid unnecessary
erations have a largest maximum carry length which is stutter cycles caused by data in the lower half of the clata
much less than the width of the ALU. By detecting a long path, an additional enable signal is included in the detec-
carry and providing additional time for the ALU to com- tion gates covering bits 34 to 5. The stutter signal may set
plete for a small percentage of operations, a data path can for as few as 10 consecutive propagates, but might not set
be run at a fast rate for most ( >95 percent) ALU cycles. for as many as 18. A propagate is a necessary but not
Fig. 3 shows the stutter circuit for the 60 bit wide ALU sufficient condition to imply an actual carry. For this
used in the J-II floating point accelerator chip. In Fig. 3 reason, an allow stutter signal is used to gate the stutter
the propagates produced in bit positions 54 to 5 of the signal to the clock circuitry. The allow stutter signal is not
ALU’s PG (propagate generate) logic are gated with a set for ALU operations in which all bit positions will
minimum of logic to produce a detection signal stutter for produce a generate. Unnecessary stutter cycles are there-
692 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. SC-19, NO. 5, OCTOBER 1984

Po
fore prevented for ALU operations in which carry propa-
gation is not a factor (i.e., A + A generates a carry at all
P3
D
bit positions).
P4 -
The optimal width of the AND function used to produce
the group propagate signals will depend on the technology
P7 —
in which the ALU is to be implemented. Fig. 4 shows an
P8
example of the stutter circuit where the width of the group
propagate is 4 bits. If the individual propagate bits are not PI 1
available, as in standard part ALU slices, then the ANDing
P12
of the group propagates can be used directly. The stutter
circuit is very inexpensive to implement, due to the fact P15

that the group propagates are already required for any


P16
ALU with lookahead carry.
Propagate terms become valid after a fixed combina- P19

tional delay allowing enough time to control the length of a P20


system clock phase. The current design uses only one
stutter signal to force a 100 ns clock phase extension P23

sufficient for the longest possible carry. A second stutter P24


signal detecting wider carry lengths could also be used to
more finely tune the clock phase extension (Fig. 5). P27

The difference between the maximum carry length which P28

always stutters and the minimum carry length which might


P31 D
stutter can be reduced by adding additional sets of AND
functions which produce overlapped groupings of consecu- r ALLOW STUTTER

MI NSTALL=8
tive propagates (Fig. 6).
MAX NOT STALL = 14
For the circuit described in Fig, 3 the probability of
requiring a stutter on random data is i- STUTTER

Fig. 4. Stutter detect for 32 bit ALU with 4 bit lookabead groupings.
P=((w–nZ)/n )(2* *(w- n))\2* *W

where bits retired. The required shift and ALU function for each
cycle is determined by examining the multiplier bits from
w = width of the ALU
the LSB’S of the Q register (bits 38:35 for F, bits 5:2 for
m = total bits not included in any detection gate
D). Since the main data path has only one shifter, the total
n = width of the detection gates shift required for each cycle combines the shift necessary to
(this equation is an upper bound since align the binary point for the present multiple, and the
data which sets more than 1 detection post shift required to complete the 3 bit retirement from
gate are counted more than once) the previous cycle. The previous group of multiplier bits
~=6tJ, m=lO, n=lo are held in a delay register which is initially cleared allow-
double precision
ing the algorithm to begin without any shifts being owed
P = 5/1024
from a previous cycle.
single precision ~=30,~=lo,~=lo Prior to the start of the multiplication, 3/4 times the
P = 2/1024, multiplicand is calculated and placed in the scratch register
for use in generating the multiple of 6. If the multiples 2,4,
The 60 bit ALU/shifter cycle of the J-n FPA can be or 8 are required, the normal multiplicand register is
completed in 100 ns for all operations in which the maxi- accessed and the partial product is appropriately shifted.
mum consecutive carry is less than 19 bits. When a multiple of 6 times the multiplicand is required,
the contents of the scratch register are used instead of the
ALGORITHMS multiplicand and added to the partial product shifted right
3 times. A special microinstruction is used to establish an
The FPA executes four data path assisted microinstruc- initial partial product of either zero if the LSB of the
tion which serve as the basis for executing the multiplica- multiplier is zero, or minus one times the multiplicand if
tion, division, alignment, and normalization algorithms. the multiplier LSB is a one. Table II details the single
The FPA uses a fixed 3 bit shift algorithm to perform shifter 3 bit multiplication algorithm implemented in the
multiplication. The algorithm requires the generation of FPA.
multiples O, 2, 4, 6, and 8 times the multiplicand. The The FPA uses a normalizing nonrestoring division al-
multiples are added or subtracted to the partial product gorithm which produces a quotient at a rate of 1.5
and the result is shifted to account for the three multiplier bits/cycle. If the partial remainder will be normalized for
WOLRICH et al.: FLOATING POINT COPROCESSOR 693

Po

P4 ENABLE DOUBLE
PRECISION

L*

=-L_
P25

P29

P30

P34
I
P35

P39

P40

P44

P45

P49

P50

P54

’55-

P59W Y
rALLOW STUTTER
r ALLOW STUTTER

STUTTER

STUTTER 2<
MIN STALL=
1 <MAX
10
NOT ST*LL

MI NSTALL =20
MAX NOT STALL.
. ,B

28
Q STUTTER 1 L! STUTTER 2

Fig. 5. Stutter circuit with second tier.

division (– 1< R < – 1/2, 1/2< R < 1) by a left shift of for F, bits 3:2 for D) and the Q register shifts either 1 or 2
one, then one new quotient bit is determined; when the bits left as required. When Q57 = 1, a Q shift of left one is
partial remainder requires more than a single left shift to forced and only one more quotient bit is accepted. When
become normalized, two quotient bits are determined. Ta- Q58 = 1, the division is completed and the normalized
ble III describes the next shift, ALU operation, and quo- quotient is in the Q register. The normalized quotient can
tient bits derived as a function of the MSBS of the partial be in the range 1/2< Q <1 or 1< Q <2 dependingon’the
remainder. If the 4 MSB’S of the partial remainder equal ratio of the initial dividend and divisor. If the initial
all ones or all zeros then a left shift of 2 will not normalize subtraction of the divisor from the dividend is positive,
the present partial remainder and the next cycle ALU then 1< quotient <2 and the final exponent is incre-
operation is A ~ A. This insures that the next partial mented.
remainder will remain R <1/2.
in If the
the range – 1/2 < The important feature of the FPA alignment and nor-
present partial remainder can be normalized by a left shift malization algorithms is that although the main shifter has
of one or two, the next ALU operation adds or subtracts limited range, the shift probability data for floating point
the divisor depending on the sign of the remainder in order addition and subtraction (Table IV) show this range to be
to drive it toward zero. The quotient bits are inserted at the all that is required for most operations. The FPA perfcmms
guard bit and LSB positions of the Q register (bits 35:34 78 percent (up to 5 bits of exponent difference) of the
694 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL SC-19. NO. 5, OCTOBER 1984

Po

P4

P5

P9

Plo

P14

P15

P19

P20

P24

P25

P29

P30

P34

P35

P39

P40

P49 1 I

P50
II

P54

P55 -

P59 — STUTTER
MI NSTALL=1O

MAX NOT STALL= 13

Y STUTTE R

Fig. 6. Stutter circuit with overlapped detection gates.

TABLE II
FPA 3 BIT/CYCLE MULTIPLICATION ALGORITHM

Present Alignment Previous Shift owed


Multiplier Shift Multiplier from
ALU/REG Group for Present Group Previous
A6A 0000 3 0000 o
A& A+ B; B=AC 0001 1 0001 ~
AGA+B;BcAC 0010 1 0010 ~
A&,4+ B; B==,4C 0011 2 0011 1
A~A+B:B=AC 0100 2 0100 1
AGA+B:B=SCR 0101 3 0101 0
A6A+B; B=scR 0110 3 0110 0
A~A+B:B=AC 0111 3 0111 0
.4+ A–B:B=AC 1000 3 1000 0
A* A–B:B=SCR 1001 3 1001 0
AFA– B; B=SCR 1010 3 1010 0
A* A–B; B=AC 1o11 2 1o11 1
A~A– B; B=AC 1100 2 1100 1
AeA– B: B=AC 1101 1 1101 2
Ah A–B:B=AC 1110 1 1110 2
/f+yI 1111 3 1111 0
AC+ Multiplicand
SCR + 3/4 Multiplicand
WOLRtCH et a[.: FLOATING POINT COPROCESSOR 695

TABLE III
FPA 1.5 BIT/CYCLE DIVISION ALGORtTHM

Quotient Bit(s) Formed


If Partiaf Remainder
Next Next Is Derived by
F59 F58 F57 F56 Shift ALU ADD/SUB Sfuft Left 2
Q3 Q2 Q3 Q2
o 0 0 0 SHL2 A~A 1 0 0 0
0 0 0 1 SHL2 SUB 1 0 0 0
0 0 1 0 SHL1 SUB 1 0
0 0 1 1 SHL1 SUB 1 0
1 1 0 0 SHL1 ADD o 1
1 1 0 1 SHL1 ADD o 1
1 1 1 0 SHL2 ADD O 1 1 1
1 1 1 1 SHL2 A~A O 1 1 1
F59 = Fraction MSB
Q3 = Quotient Register LSB
Q2 = Quotient Register Guard Bit

Note: Quotient bit(s) are inserted at bit positions Q35 and Q34 for single
precision operation instead of Q3 and Q2.

TABLE IV TABLE V
WEIGHTED DATA ON ALIGNMENT AND NORMALIZATION J-n FPA TYPICAL REGISTER-TO-REGISTER EXECUTION TIMES

FPA FPA ADDF/SUBF llps F 1 bit sign ‘“


Alignments Align Normafizatlons Normafize MULF 1,6 ~S floating = 8 blt exponent
Shift (percent) Cycles (percent) Cycles DIVF 2.7 ~S 24 bit fraction
ADDD/SUBD 1.1 ps D 1 bit sign
Result O 0.82 1
MULD 2,8 /&S double = 8 bit exponent
Overflow 17.43 1
DIVD 4.7 MS 56 bit fraction
0 26.28 1 60.25 1
1 13.29 1 8.01 1
2 8.77 1 3.17 1
3 6.53 1 1.54 3 left shift capability of the fraction shifter to accomplish the
4 7.77 1 1,33 3
5 8.07 1 1.03 3
3 bit shift.
6 5.13 2 0.63 4
7 3.78 2 1.02 4
8 1.10 2 0.93 4 PERFORMANCE
9 1.23 2 0.56 5
10 1.84 2 0.66 5 The combination of a 100 ns internal cycle and opti-
11 1.35 3 0.16 5
mized arithmetic algorithms provides excellent perfor-
12 1.54 3 0.22 6
13 0.81 3 0.31 6 mance. Table V shows typical execution times for regis, ter-
14 0.48 3 0.16 6 to-register operations.
15 0.58 3 0.03 7
The J-n FPA interfaces as a true coprocessor. The BIU
16 0.29 4 0.06 7
17 0.31 4 0.09 7 inputs all instruction stream data and decodes instructions
18 0.50 4 0.17 8 in parallel with the base processor. Support microcode in
19 0.32 4 0.15 8
4 0.08
the CPU initiates all 1/0 cycles required by the FPA. As a
20 0.26 8
21 0.40 5 0.07 9 coprocessor, floating point instruction execution can occur
22 0.30 5 0.07 9 simultaneously with integer code. This overlap can effec-
23 0.24 5 0.19 9
tively be used to reduce the execution time of mixed code
24 0.25 5 0.49 10
25 0.26 5 0.09 10 by interleaving floating point and nonfloating point in-
26 0.16 6 0,28 10 structions.
27-53 0.86
A second and more frequent type of instruction overlap
54-255 7.30
which provides a substantial performance gain in floating
*Reference D. W. Sweeney point intensive code is also achieved by the FPA, The IEIIU
supports the overlap of operand data loading for the next
aligns and 90 percent of the normalizations with a single floating point instruction, while the EU completes the
shift cycle. The average number of shift cycles is 1.5 for processing of the current instruction. The FPA asserts a
alignment and 1.2 for normalization. Alignment proceeds stall signal to prevent the CPU from initiating more than
at a 5 bit/cycle rate if the exponent difference is between 6 one new floating point instruction while the EU is still
and the length of the data type. Only one cycle is needed busy. Only the portion of FPA execution time, if any, that
for normalizations requiring shifts in the range of right one the CPU is stalled actually effects system performance.
to left two. If more than 2 bits of left shifting is required, a Two additional floating point processor chips have been
no shift cycle occurs to examine additional fraction bits. developed by extending the J-n FPA design to the VAX
Normalization then proceeds at 3 bits/cycle. The left shift architecture. The G floating point format and the extended
function of the fraction ALU is combined with the 2 bit multiply and integenze instructions are supported by in-
696 [EEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. SC- 19, NO. 5, OCTOBER 1984

~ ~
pm
Larry Harada received the B.S. degree in electri-
creasing the fraction and exponent data paths to 67 and 13
‘“”;8 cal engirieering in 1980, and the M.E. degree in
bits, respectively. The three designs achieve similar perfor- ;f;y
1981, both from Cornell University, Ithaca, NY.
* ,
mance. The carry length detection method and the 3 bit $%?, He joined Digital Equipment Corporation,
.&>.
multiplication are especially beneficial in applications re- %$?$$ Hudson, MA in July 1981. He is currently work-
@(*.,@
ing for the Digital LSI Manufacturing Group in
quiring wide data paths, as evidenced by the performance $&:
Hudson.
of these three floating point processor chips.

IU3FERENCES

[1] D. W, Sweeney, “An analysis of floating-point addition,” IBM SW.


J., vol. 4, pp. 31-42, 1965.
[2] O. L. MacSorley, “High-speed arithmetic in binary computers,”
Proc. IRE, vol. 49, pp. 67–91, 1961,
[3] E. Swartzlander, Ed., Computer Arithmetic. New York: Dowden,
Hutchinson, and Ross, 1980. James Montanaro received the B.S. and MS.

7
degrees in electrical engineering from Massachu-
setts Institute of Technology, Cambridge, MA, in
% 1980.
Prior to Joining Digital Equipment Corpora-
Gil Wolridh received the B.S. degree in electrical tion, Hudson, MA, in 1982, he was with In-
engineering from Rensselaer Polytechnic In- ~? tegrated Circuit Systems, Incorporated, West-
stitute, Troy, NY, in 1971, and the M. S. in boro, MA.
electrical engineering from Northeastern Univer-
sity, Boston, MA, in 1978.
He joined the Digital Equipment Corporation,
Hudson, MA, in April 1979. He is currently a
Principal Engineer with the Semiconductor En-
gineering Group in Hudson, MA.

Robert A. J. Yodlowski received the B.S. degree


in engineering physics from Cornell University,
Edward McLellan received the B.S. degree in
Ithaca, NY, in 1968, and the M.S. degree in
computer and systems engineering from Rensse-
electrical engineering from Syracuse University,
laer Polytechnic Institute, Troy, NY, in 1980.
Syracuse, NY, in 1970.
He is currently a Senior Engineer with the
He has been employed with the Digital
Semiconductor Engineering Group of the Digital
Equipment Corporation, Hudson, MA, since
Equipment Corporation. Hudson, MA.
1977. He is currently a Principal Engineer with
the Semiconductor Engineering Group in Hud-
son, MA. His interests include MOS circuit and
logic design.