Baa 9902

Cover Page
Subject: BAA 99-02

Technical Topic Area:
Adaptive Computing Systems|Dynamic Recon guration
Proposal Title: Adaptive Arithmetic
Technical and Administrative Points of Contact:
Professor Michael J. Flynn P.I.
Professor Martin Morf
Susan M. Gere Admin
Phone: 650.723.1450 Flynn
650.723.0140 Morf
650.723.1559 Gere
650.725.6949 Fax
E-mail: ynn@ee.Stanford.edu
morf@umunhum.Stanford.edu
gere@csl.Stanford.edu
Mail: Electrical Engineering Dept.
Gates Computer Science Building 3A
353 Serra Mall
Stanford, CA 94305-9030
Summary of costs: Total base cost $1,974,610
Year-1 $353,845, Year-2 $532,790, Year-3 $548,425, Year-4 $539,550.
Contractor's type of business: Other Educational
M. Flynn, Adaptive Arithmetic
A Innovation Claim
The proposal is aimed at advancing the recon guration performance and the execution
performance of signal processing applications especially by at least 100 times by
using a coarse grain array of adaptive arithmetic cells AACs.
Arithmetic operations are fundamental to many of the most important commercial
and military applications of signal processors, wireless and wired communications processors, and graphic and video oriented processors. E cient execution of arithmetic
operations is fundamental to signal processing and communications. The objective of
this proposal is to develop a recon gurable array implementation technology where
each cell consists of a small recon gurable multiplier-type array called an adaptive
arithmetic cell AAC. The AAC can be con gured to perform various basic and
higher level logic, trig, etc. arithmetic operations in the execution time of a processor cycle. Underlying this study is the e cient implementation of a multiplier-adder
cell, which is implemented with the space time e ciency of a custom design. We
propose a hybrid approach of adaptive arithmetic cells.
Our adaptive arithmetic AAC approach uses a conventional recon gurable array
of cells. However, each cell contains a partial product arrays multiplier array type
with a small number of FPGA"-like gates as inputs. By properly con guring the
input gates, the partial product can implement any of the basic math functions such
as reciprocal, number divide, square root, log, exponentiation, the trig functions, etc.
and execute such operations in the same time as a multiplication.
The purpose of this research is to show the feasibility of building highly adaptable
coarser grain recon gurable units, each of which can execute even the most di cult
basic math functions such as trig functions, log functions, etc. in the same time
as a short multiplication. The execution rate of many signal processing arithmetic
applications can be enhanced by more than 3 orders of magnitude even over traditional
custom designs.
Since the recon guration of the cell a ects a relatively small number of gates
c. 100, the recon guration information is small|contained in one large register.
This enables us to provide multiple recon guration registers within a cell. This enables recon guration simply by switching registers, improving recon guration time
by a factor of at least 1000.
B Technical Rationale
We propose to develop robust adaptive functional cells that consist of e cient custom arithmetic units but are dynamically con gured to execute many mathematical
functions. the cell ensemble is capable of supporting concurent operation execution
on a broad range of arithmetic intensive applications.
Arithmetic operations including both integer and oating point add, subtract,
multiply, and divide form the basic building blocks of scienti c computation, signal
processing, and communications. Yet usually the optimization of their performance is
relegated to design exercises in which the building blocks, the circuits package technology
have been prede ned and only the algorithm itself and its implementation can be altered to achieve performance.
The Stanford subnanosecond arithmetic processor SNAP research e ort has targeted the full spectrum of tradeo s from ASIC devices, materials, interconnections
and circuits to algorithms and organization.
There are many ways to incorporate exibility into functional units. At one extreme is a baseline implementation using conditional FPGAs to realize the various
arithmetic functions; at the other extreme are ASICs.
Our hybrid approach is based on combining these to form a coarse grain cell
consisting of a function approximation unit that exibly uses a multiplier array so
as to realize rapid approximations to a broad variety of elementary function square
root, log, trig functions or polynomial, etc.. These cells are interconnected with
conventional FPGA technology.
The use of a smaller number of arithmetically robust cells has the following advantages:
1. A small number of bits de nes a cell function and state. This enables us to use
multiple con guration registers, each of which de nes a complete cell function.
2. Each cell executes complete multiply add functions at processor speeds.
3. Fewer cells imply fewer interconnect channels interconnect overhead. This promotes the ability to increase intracell signal path width and decreases the interconnect status bit requirements.
The outline for our approach is
1. The cell, when properly con gured, can realize an approximation to any basic
math function or any polynomial in the time of a multiplication. Precision of
3
Interconnection
network
Switchable
configuration
registers
Data in
reconfigurable
logic gates
PPA
Result
register
Figure 1: Adaptive Arithmetic Cell AAC in a high-speed ACS cellular array.

the result is 10 20 bits, depending on the size of the partial product array and
the function realized. Discussion of the theory follows.
2. AAC cells can be combined to extend the precision of a function or to simply
realize a composite function|as basic as an add or multiply|or as complex as
multiple polynomial evaluations.
3. Each cell operates synchronously at typical processor cycle time rates.
4. A small number of registers in each cell hold switchable recon guration information, so a cell can be functionally recon gured in one cycle, greatly improving
recon guration time.
Function Approximation Theory

Traditionally, look-up tables and rational approximations have been used to accelerate
division and other high-order arithmetic operations. In our study 9 , a non-traditional
approach of back-solving logic equations is used. We further prior research in this
4
0
b2
b3
b4
q0 :
q1
q2
q3
q4
...
...
...
b5
q3 b2 q3 b3 q3 b4 q3 b5 q3
q1 b2 q1 b3 q1 b4 q1 b5 q1

0 1 1 1 1 1
...
q4 b2 q4 b3 q4 b4 q4 b5 q4
q2 b2 q2 b3 q2 b4 q2 b5 q2
=
B
Q
q0 b2 q0 b3 q0 b4 q0 b5 q0
10
Figure 2: An example: Finding the reciprocal 1 = given bit values of =

1 2 3 4 . This is expressed in the form of a multiplication, where the product is
known 1 0 0 1111 and one of the operands is unknown.
Q
: b b b
=B
:::
: or
:::
area 10, 6, 5, 3 , by creating a generic method for describing approximations and

have developed a unique implementation theory using the multiplier array.
These approximations are described in terms that use the partial product array
of a multiplier. In binary multiplication, a partial product consists of the bits of the
multiplier logically ANDed with a bit of the multiplicand. Each row in the partial
product array is determined by ANDing a 2-input logical AND of a multiplier bit
with the multiplicand. The partial products are summed by a large counter tree
and a carry propagate adder to form the product. The partial product array can be
generalized to create an optimal pp array which describes an approximation to any
elementary or polynomial function. Instead of using array elements that are 2-way
ANDs, they are replaced with a type of generalized Boolean logic gate. By using a
similar array to the multiplier partial product array or pp array, the counter tree
and adder could be reused. The use of a pp array to describe an approximation of a
function and reusing an existing multiplier was pioneered by our SNAP project.
Finding the Reciprocal Using a Multiplier

We illustrate function approximation theory based on the reciprocal function. Any
function expressible as = , where any two terms say, and are known,
can be solved for the third. In the case of reciprocal = 1, the multiplication
= 0 11 is back-solved in terms of the multiplier .
x
q4
+
3+
q3
b2 q
+
2+
2+
q2
b2 q
b3 q
+
1+
1+
1+
q1
q0
b2 q0
b2 q
b3 q0
b3 q
b4 q0
b4 q
b5 q0
=
=
=
=
=
1
1
1
1
1
Figure 3: Rewriting the columns of gure 2 as equations. The bits

unknown.
q0
q1
q2
q3
q4
=
=
=
=
=
1
1, 2
1, 3
1, 2+2 2 3, 3, 4
1, 2 3, 4+2 2 4,
q 0 ; q1 ; q2 : : :
are
b b
b b
b
b b
Figure 4: Solving for
b5 :
q0 ; q1 ; q2 : : :
By choosing the quotient digits appropriately in a redundant notation, each column of the pp array forms an independent equation.
These boolean equations are solved to yield the following algebraic equations for
ve digits of the quotient:
These equations are used to form the pp array of the approximation.
This form then undergoes several transformations:
1. Reduction of the Boolean elements by applying reduction rules such as
i = 1 , i and
i,1 = 2 i
b
2. Adding additional terms to reduce maximum error.

3. Eliminating negative elements by applying , i = i , 1, producing all positive
terms and negative constants.
b

q0
q1
q2
q3
q4
,4
,3
,5
,2
22
,4
b
,2
b b4
b b3
,3
22
,23
b b
Figure 5: The equations of gure 4 are now put in summable partial product form.
E.g., 1 = 1 , 2 is now the sum of , 1 and 1.
q
q0
q1
1
b2
q2
q3
2 j 3
b
q5
b5
b2 b3 b4
b2 b4
b4
b3
q4
2j 3
b
1
2j 3j 5
b
Figure 6: A multiplier PP array can only sum positive binary f0 1g bits. So the
column entries in gure 5 are transformed by our AAC function theory.
;
4. Since the highest order column has 0 = 1, this represents the most signi cant
bit 1000 0. We can subtract all negative constants, producing positive
column constants.
q
:::
In the case of reciprocal, this gives us:

If a large multiplier IEEE standard oating point is available, such as one having 53 rows, a PPA can be developed which has a minimum of 12 bits correct for
the reciprocal function. This high-precision approximation can be used as a starting approximation to accelerate a multiplicative division algorithm. A smaller tree
produces almost as good results, as 25 rows gives 9 bit accuracy.
This technique has been developed for many operations such as the square root,
logarithm, exponential, and several trigonometric functions. For the square root
operation, the pp array gives a minimum of 16 bits accuracy.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
q
53
partial
products
53 elements per row 105 columns
Figure 7: The PP array of a 53 bit direct multiplier. Up to 53 bits are summed in

each column.
Further research is aimed at overcoming the irregularity of pp arrays in physical
design. Figures 7, 8 and 9 show the inputs of the original multiplier. In Figure 7, the
modi cations that must be made to the multiplier to incorporate the required input
boolean bit functions that must be made as inputs to the PPA in order to perform,
for example, the reciprocal or divide function. Figure 9 shows the columns that are
a ected in the case of the square root.
Other Functions
In principle, any equation of the form = can be solved for or or given
the other two parameters. Clearly, is found immediately by the multiplier. or
can be found by using the multiplier to back-solve the equation. In this manner,
we can nd the reciprocal function 1 : reciprocal = 1. Or the quotient ,
since quotient denominator = numerator. Or square root, since sqrt sqrt
= . Other more complex functions can also be solved with this technique. In
extending this technique to functions such as log, exponent, and the trig functions,
the rearrangement of the function into multiplicative form is not obvious. It usually
involves two substeps:
x
=b
MULTIPLICATION
2809 Elements
MULTIPLICATION
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
Boolean Elements
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
..... .
Boolean Elements
2809 Elements
AUXILIARY OPERATION
....
....
....
....
....
....
....
< 500 Elements
< 20% of Array
MUX
Enlargement of
Multiplexing
COUNTER
TREE
COUNTER
TREE
Multiply
X3
Y3
Other Op.
b2
MUX
CARRY
PROPAGATE
ADDER
NORMALIZER
NORMALIZER
bi
ANY BOOL
AND
CARRY
PROPAGATE
ADDER
b7
OR
B) Adapted Multiplier
A) Multiplier Before Adaptation
Figure 8: Implementation using direct multiplier. The AAC multiplier is adapted by

setting input con guration gates.
1. Expressing the operands as polynomials or other suitable analytic functions
2. Di erentiating the function.
For example, suppose
= ln
0 = 1 0
W x
W x
so that
W
0 x V x = V 0 x
since is simply the number representation

W x
n,1 2n,1 + wn,2 2n,2 + : : : w0
i is 0 or 1. Then 0 is simply another polynomial with constant term multiplier.

Knowing ln , we can express x.
w
xxxxxx- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - xxxxxx
xxxxxxx
xxxxxx1
xxxxxx1
xxxxxx1
xxxxxx1
xxxxxx1
xxxxxx1
xxxxxx1
xxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxx1
xxxxxxxx1
xxxxxxxx1
xxxxxxxx1
xxxxxxxxx1
xxxxxxxxx1
xxxxxxxxx1
xxxxxxxxx1
xxxxxxxxx1
xxxxxxxxxx1
xxxxxxxxxx1
xxxxxxxxxx1
xxxxxxxxxxx1
xxxxxxxxxxx1
xxxxxxxxxxx1
xxxxxxxxxx11
xxxxxxxxxx11
x xxxxxxxxxx11
xxxxxxxxxxxx1
xxxxxxxxxxxx
x xxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxx
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - xx1x1x11xx11
53
rows
18 columns
Figure 9: 17-digit reciprocal PPA superimposed on a direct multiplier's PPA providing

minimum error of 2,13. Notice that using a large uninverted multiplier, rather little
of the PPA hardware is actually used.
The trig functions are determined by di erentiating sin,1 , so that
= Arc Sine
= Arc Sine
=
1

1 , 2
1
0
0 =
2
1,
1 , 2 0 = 0
1 , 2 02 = 02
x
W x
dW x
dx
W x
q
q
dx
dW x
W x
W x
v x
dx
dV
dx
The polynomial forms of the arguments are:

0 = 1 + 2 2 + 3
V
v3 x
+
1

=
10
XN
i=1
i i,1
iv x
2
0 = 2 1 + 4
W x
X2
w
i=1
w2 x
+6
w3 x
+
i i,1
iw x
3
The result of this has several important consequences:

1. Almost all of the known elementary functions can be computed usually to
within 12 20 bits of precision in the same time it takes to do one multiplication.
2. The hardware cost of this adaptation consists merely of a few multiplex and
logic gates which are input to certain of the partial product array rows.
3. Each resulting adaptive multiplier function unit can be dynamically recon gured in less than one cycle to perform any of the basic math functions.
In principle, it ought to be possible to combine cells either to extend precision or
to provide parallel concurrent execution of various signal processing functions.
The Cell Granularity

Most of our work in the past on AAC has assumed the reuse of an existing usually
double precision IEEE oating point multiplier. Indeed, almost all modern microprocessors have such a fully articulated multiplier. The objective was to enable one or
more higher level functions to be approximated by adding a small number of gates to
an existing multiplier.
The representations in Figure ?? are still unsuited for simple bit summation as
provided by a pp array. Some terms are negative, other terms may involve a multiplication by a constant although not shown, terms such as 4 6j 7 can arise. In
our SNAP work ? , we have developed transformations which take any back solution
term and put it in a single positive bit form suitable for summation by a pp array.
Our work to date would enable such single unit" coarse grain adaptive arithmetic.
It would be a useful complement to con gurable processor e orts|enabling further
e cient recon guration to support fast function evaluation.
b
11
While we recognize the value of multiplier reuse, the work proposed here is directed beyond this. It assumes an array of AACs, where each cell has a multiplier
partial product array optimized for function evaluation. The AAC must support
e cient multiplication and function approximation. The AACs must be con gurable
to extend the precision of operations as well as to support concurrency of execution
of independent operations.
We expect the AAC to include a multiplier array of about 16 24 bits. Our past
research which assumed multiplier reuse has shown that 53b multiply reached a point
of diminishing returns. So a smaller cell is more area e cient. Also a pp array which
favors depth say, 24b rather than 16b seems to o er important precision advantages.
While a 16 24b multiplier coupled with supporting registers and input gating
is not a small cell, it would still seem possible to realize on the order of 100 1000
cells die. While we propose this as an initial study point, clearly a great deal of
additional research is required to determine optimum cell parameters.
The Multiplier
Multiplier optimization is an important factor in AAC optimization. Multiplier optimization has also been an area of important SNAP research. In earlier work, we
developed an improved multiplier encoding technique called redundant Booth 3, enabling a reduction in pp array height, thus speeding up multiply execution 2 .
In research directly related to AAC optimization, we completed an exhaustive
study of pp array topology; considering both planar array structures double arrays,
high order arrays and tree structures binary, balanced tree, overturned staircase
tree. We developed a counter placement routing program to enable optimum placement of counters in a pp array tree, enabling the fastest possible multiplier execution.
Al-Twaijry 1, 8 of our group has completed an important study on the possible
ways to implement oating point multiplication. By doing a complete design and
layout for almost 1,000 separate multiply implementations, he examined the e ect
of various algorithmic and implementation approaches on multiplier performance.
Among the most interesting ndings is that the algorithmic selection is a function of
certain technology parameters such as the feature size. At large feature sizes such as
1 micron, the overall multiplier delay is determined primarily by gate delays through
the circuits themselves. As feature sizes shrink, the wires determine the overall delay
for the multiplier implementation. Thus, techniques that shorten the wires produce
the fastest multiplier. In a typical multiplier implementation, most of the circuitry
and the wires are in the partial product reduction step. In order to optimize delay:
12
1. The designer can use a Booth encoding of the multiplier. By reducing the
number of partial products, we shrink the size of the partial product reduction
tree, even though we increase the circuitry necessary to generate the partial
products. At small feature sizes, the wires in the partial product reduction tree
dominate the delay, and hence Booth encoding is preferable.
2. In implementing the partial product reduction tree, Al-Twaijry shows that an
algorithmically laid out tree of counters using CAD tools in a so-called Wallace
tree is better than the more regular 4 2" compressor layout, which constructs
a balanced binary tree of counters.
;
Based on his analysis, we expect our initial pp array design to use non-Booth
encoded multiplier with our customized tree layout.
The Con guration Registers

One of the most important advantages of AAC compared with table-based cells is the
relatively small number of bits estimated at 100 required to specify a con guration.
This enables us to have multiple con guration registers, switching between con guration registers recon gures the AAC function. This can b e accomplished in 1 2
processor cycles|on the order of 10 ns rather than the milliseconds now required for
conventional FPGA recon guration. Of course, a longer AAC recon guration time
is required if the AAC cell to cell interconnection paths must also be recon gured.
Minimizing path recon guration is an important goal of our research, but remains an
open problem.
b
300MHz, 43 sq mm, 1 micron CMOS
Figure 10: WP Vector Unit Die Photo
13
14
C Deliverables
The deliverables consist of a series of design reports and evaluations including sepcifically the following reports:
1. Complete the theory and development of e cient AACs with the objective of
increasing the precision of the results of function approximation.
2. Evaluation of FPGA, custom and semi-custom designs and their area time effectiveness for the execution of arithmetic operations.
3. The design and analysis of a recon gurable adaptive arithmetic unit. This is
our basic AAC building block.
4. A report on the evaluation of the performance of the prototype AAC unit on at
least three signal processing and communications applications of interest to the
Darpa ACS and GLOMO community. This will involve both spice simulation
and cycle by cycle simulation, together with a comparison of these results with
more conventional approaches to the execution of these applications.
5. A report on extending the design of the AAC unit to include subword parallelism
and a determination of the e ectiveness of this e ect on various signal processing
and communication applications.
6. A report on extending the AAC unit by using wave pipelining and determining the e ectiveness usefulness of this in signal processing and communication
applications.
There are no proprietary claims to the results or the reports.
Cell Interconnection and Conjunction

The interconnection of AAC blocks has some special requirements based on AAC
conjunction. The AAC block has a pp array of a xed size e.g., 16b 24b; oftentimes
this will have to be extended to accommodate higher precision multiply add operation
e.g., 48b 8b . So, the AAC blocks require special neighbor interconnects to support
pp array extension. Note that these neighbor paths apply to adjacent AAC blocks|
left and right, above and below. Since the pp array extension paths are well de ned,
it is possible to create very fast paths for these neighbor connections.
15
Other data intercell paths could use a more conventional FPGA routing, although
we believe that some optimization is possible.
Finally, the cell con guration data entry paths must be designed. Since the number of cells is relatively smaller than in conventional FPGA a parallel cell con guration
path is proposed. Ideally, such paths would have the same width as the con guration
register, but it may be necessary to scale this down. The above results determine the
size and structure of the con guration registers. This completes the AAc de nition.
References
1 Hesham Al-Twaijry. Area and Performance Optimized CMOS Multipliers. PhD
thesis, Stanford University, August 1997.
2 Gary W. Bewick. Fast Multiplication: Algorithms and Implementation. PhD thesis, Stanford University, March 1994.
3 Jean-Marc Delosme, VLSI Implementation of Rotations in Pseudo-Euclidean
Spaces," IEEE Int. Conf. Acoustic, Speech and Signal Processing, Vol. 2, pp. 927930, 1983.
4 E. F. Klass. Wave Pipelining: Theoretical and Practical Issues in CMOS. PhD.
thesis, Dept. of Elect. Engr., Delft Univ. of Technology, 1994.
5 D. T. L. Lee and M. Morf, Generalized Cordic for Digital Signal Processing,"
Proc. ICASSP, Paris, May 1982, pp. 1748-1751.
6 D. M. Mandelbaum. A Method for Calculation of the Square Root Using Combinatorial Logic," Journal of VLSI Signal Processing, December 1993, pp. 233-242.
7 K. Nowka and M. Flynn. Environmental Limits on the Performance of CMOS
Wave-Pipelined Circuits. Technical Report CSL-TR-94-600, Stanford University,
Jan. 1994.
8 S. F. Oberman, H. Al-Twaijry, and M. J. Flynn. The SNAP Project: Design of
Floating Point Arithmetic Units," in Proceedings of the 13th IEEE Symposium
on Computer Arithmetic, pages 156 165, July 1997.
9 E. M. Schwarz. High Radix Algorithms for High-Order Arithmetic Operations.
PhD thesis, Stanford University, January 1993.
16
10 R. Stefanelli, A suggestion for a high-speed parallel binary divider," IEEE

Transactions on Computers, C-211:42 55, Jan. 1972.
11 D. Wong. Techniques for Designing High-Performance Digital Circuits Using
Wave Pipelining. PhD thesis, Stanford University, August 1991.
12 D. Wong, G. De Micheli and M. Flynn, Algorithms for Designing HighPerformance Digital Circuits Using Wave Pipelining," IEEE Transactions on
CAD ICAS, pp. 25 46, January 1993.
17
D Statement of Work
Generalized Theory
The proposed AAC and the corresponding function approximation theory is based
upon the back solution of equations of the form = , where ,
, and are polynomials in . Generalizations of this process are important
research opportunities directed at better understanading of function approximation
and improving function precision. Several generalization studies are anticipated.
w x
v x
z x
v x
z x
w x
1. Study of dynamic pp arrays. In our pp arrays, all partial products are strictly
summed as is the nature of multiply. It is possible to design a pp array where
the summation itself is recon gured as either an add or subtract, depending on a
previous indicator operation. This generalization would include rapid CORDIC
function implementations.
2. Study of approximation representation. Continued fraction expansion of functions represents a powerful approach to function approximation. The problem
is that reciprocal the basis of continued fractions is not multiplication. However, we have already shown a close relationship. This relationship needs to
be elaborated. it might be possible to extend the pp array to support rapid
continued fraction evaluation.
The primary objective of this proposal is to develop, design, and analyze a new
approach to de ne an adaptive arithmetic unit. This unit contains a custom designed
pp array as the basic adaptive arithmetic unit. It would support full precision 64
bit, oating point operations, and is intended as a building block demonstration unit.
The unit includes special input circuitry to allow the unit to perform the following
functions: addition subtraction, multiplication, division, reciprocal, square root, log,
exponentiation, and the trig functions. Each of these operations is performed in the
same time as a high speed multiply. The full precision of a result is achieved only
with the multiply and add operations. Other operations achieve an approximation of
varying precision depending upon the operation. Through further theory studies,
e.g., using Lie algebraic concepts 5 , we plan to extend this precision.
As a precursor to this task, we propose to evaluate a baseline in adaptability
in arithmetic units by doing an area time analysis of the tradeo s between custom
designed arithmetic units and FPGA implemented arithmetic units.
18
Task 1: Our rst and primary task is to determine an optimal AAC con guration.
The primary determinate of the cell structure is the realizable function precision in
terms of the multiplier pp array dimensions. Our initial indications are that pp array
of 16 24b would o er maximum precision per unit area cell size, but this must be
validated by extensive function simulation. As part of task 1, the con gurable gate
structure that acts as input to the pp array must be determined. There are a number
of tradeo s here, as our program that transforms terms in the back solution equation
into positive bit entries into the pp array has signi cant exibility. E.g., it can be
programmed to limit the fan-in" of terms or to give preference to certain logic gate
structures.
Task 2: Intercell Array De nition AAC blocks must be designed to be intercon-
nected to realize higher precision operations when required e.g., 64b multiply. The
intercell connection paths must accommodate rapid intercell communication. Since
the pp array extensions are known, it ought to be possible to design fast pp array
extension paths.
Task 3: AAC Array The resultant AAC array must be physically designed|
circuits, routed and placed|and then simulated. This is necessary to determine the
area of a cell and hence the number of cells die, and ultimately the cost of the AAC
approach. Similarly, the circuit timing analysis SPICE simulator determines the
cell performance.
Task 4: Application Analysis Given AAC die parameters as determined above,
it is importnt to understand exactly how this AAC array compares relative to 1
conventional FPGAs and 2 conventional microprocessors. We propose to take several signal processing applications and perform an in depth performance and area
analysis. We propose to start this task early, as it may provide important insight into
the AAC design task.
Task 5: Extensions Using advanced clocking technology, we propose to consider
accelerated AACs. We would develop an extended AAC array design that uses wave
pipelining to enhance the computational bandwidth of the AAC. Again, we would
determine the e ectiveness usefulness of this in various signal processing and communication applications.
We also propose to investigate tools to enable concurrent task execution on single
AAC arrays.
19
E Schedule of Milestones
All milestones are expressed in months following the project start.
1. Continue the development of AAC theory, especially to determine:
a Optimum multiplier structure width, depth and topology of the pp array.
2.
3.
4.
5.
b A methodology for AAC conjunction to extend the apparent width and or
depth of the pp array. This is to enable enhanced precision.
c Based on 1a and 1b, determine the optimum pp array input con gurable
gate structure.
d Determine number of bits required to specify a con guration. This also
speci es the con guration register set.
12 18 months
Based on optimized AAC design, determine a size and interconnect design for
the recon gurable AAC array. 18 months
Complete spice and register transfer simulation of the AAC array. Determine
physical design and performance parameters. 24 months
Complete an application study based on several numerically intensive signal
processing applications. 36 months
Integrate high performance clocking techniques such as wave pipelining into the
design determined in 1b and 1d. Analyze performance e ects. 36 months
F Technology Transfer
Bahram Ahanin... Altera
20
21
G Comparison with Ongoing Research

There are two basic alternatives to the adaptive arithmetic cellular approach. These
are: 1 Custom design 2 FPGA implementations. We see the AAC approach as
complementary to both.
The advantages of custom design are clear: they are fast and area e cient, although the development time may be long and the cost per part in small runs can
be very high. The FPGA approach has exactly the complementary advantages and
disadvantages. The design time is rapid, but the area usage and performance are
limited. This proposed approach is based on a speci c hybrid method targeted at
numerically intensive applications. It is not intended to compete with FPGAs in
non-arithmetic applications such as bit manipulation, but only in the rapid execution
of basic arithmetic and mathematical functions. However, some of our methodology
may apply to bit manipulations for coding applications, say over Galois Fields.
The adaptive arithmetic unit concept has only been simulated in the past. It is an
experimental approach that may o er some signi cant advantages in computationally
intensive signal processing applications.
22
H List of Key Personnel

Curriculum Vitae of M. Flynn
Michael J. Flynn received his BSEE degree from Manhattan College, the M.S. degree
from Syracuse University, and his Ph.D. from Purdue University.
He began his career in computer engineering at IBM in 1955, and for ten years
worked in the areas of computer organization and design. He was design manager for
the System 360 Model 90 series of computers, which was IBM's largest and highest
performance computer system. Professor Flynn was a faculty member of Northwestern University and the Johns Hopkins University. He served as co-founder and Vice
President of Palyn Associates, Inc.|a computer design rm in San Jose, California, where he is now a senior consultant. In January 1975 he became Professor of
Electrical Engineering at Stanford University. Dr. Flynn was the 1992 recipient of
the ACM IEEE Eckert Mauchly Award for important and seminal contributions to
processor organization and classi cation, computer arithmetic, and performance evaluation." He received the IEEE CS Harry Goode Memorial Award in 1995 for pivotal
seminal contributions to the design and classi cation of computer architecture."
Professor Flynn founded Stanford's Computer Emulation Laboratory, which has
been a leading facility for the analysis of computer architecture. His current research
projects include programs on ultra-high-speed arithmetic performance, rapid evaluation of computer architectures, and parallel machines.
Dr. Flynn has served on the IEEE Computer Society Board of Governors and
as Associate Editor of the Transactions on Computers. He was founding chairman
of both the ACM Special Interest Group on Computer Architecture and the IEEE
Computer Society's Technical Committee on Computer Architecture.
Publications
M. J. Flynn. Computer Architecture: Pipelined and Parallel Processor Design. Jones
& Bartlett, Boston, 1995.
S. Waser and M. Flynn. Introduction to Arithmetic for Digital Systems Designers.
Holt, Rinehart, and Winston, New York, 1982.
Selected journal papers:
M. J. Flynn. Parallel processors were the future. . . and they yet may be." Invited
paper, Computer 297:151 152, December 1996.
23
M. J. Flynn. Parallel processors were the future. . . and they yet may be." Invited
paper, Computer 297:151 152, December 1996.
S. Oberman and M. Flynn. Division Algorithms and Implementations." IEEE
Transactions on Computers 468:833 854, August 1997.
S. Oberman, H. Al-Twaijry, and M. Flynn. The SNAP Project: Design of Floating
Point Arithmetic Units. In Proceedings of the 13th IEEE Symposium on Computer
Arithmetic, July 1997.
S. Oberman and M. Flynn. Design Issues in Division and Other Floating-Point
Operations." IEEE Transactions on Computers 462:154 161, February 1997.
M. J. Flynn, S. Oberman, S. Fu, H. Al-Twaijry, K. Nowka, G. Bewick, E. Schwarz,
and N. Quach. The SNAP Project: Towards Sub-Nanosecond Arithmetic." In
Proceedings of the NSF MIPS Conference on Experimental Research on Computer
Systems, June 1996.
F. Klass, M. J. Flynn, and A. J. van de Goor. Fast Multiplication in VLSI Using
Wave Pipelining," Journal of VLSI Signal Processing 73:233 248, May 1994.
M. Flynn and K. Nowka and G. Bewick and E. Schwatz and N. Quach. The SNAP
Project: Towards Sub-Nanosecond Arithmetic." In Proceedings of the 12th Symposium on Computer Arithmetic, Bath, England, pages 75 82, July 1995.
M. J. Flynn and K. Rudd. Parallel Architectures." Invited paper. ACM Computer
Surveys special issue celebrating the 50th anniversary of Electronic Computing,
291:67 70. March 1996.
Other Support
Recon gurable Multimode, Multiband Information Transfer, DARPA. 20 AY, 30
summer 9 1 96 9 30 98.
Sub-nanosecond Arithmetic II, NSF. 20 AY, 33 summer 3 1 94 8 31 98.
Sub-nanosecond Arithmetic III proposed, NSF. 20 AY, 33 summer.
Curriculum Vitae of Martin Morf

Born 1944 in Winterthur, Switzerland. U.S. citizen '76, married, two children.
Federal Diploma in Electrical Engineering, at the ETH-Zurich in 1968, M.S. in
Electrical Engineering from Stanford University in 1970, Ph.D. in E.E. from Stanford
24
in 1974, Honorary Master of Arts Degree from Yale University in 1983.
Academic Appointments
Visiting Professor in the Computer Systems Lab at Stanford University.
Visiting Professor, NASA Ames Research Center, 1990 and 1991.
Visiting Professor, Center for Integrated Systems, Stanford University, 1987 90.
Full Professor, Informatics Computer Science, ETH Zurich, since 1986.
Visiting Professor, Xerox Palo Alto Research Center PARC, 1984.
Full Professor, Electrical Eng. and Computer Science, Yale University, 1982 84.
Associate Professor, Electrical Engineering, Stanford University, 1980 82.
Visiting Professor, IBM T.J. Watson Research Center, 1977.
A liations
Sigma Xi, Associate Editor for Circuits, Systems, and Signal Processing, Association
for Computing Machinery-ACM, Society for Industrial and Applied MathematicsSIAM, IEEE Computer, IT, AC and ASSP Societies, International Federation of
Information Processing|IFIP Working Group 10.5 on VLSI.
Principal advisor of over 30 Ph.D. theses, and second reader of over 30 theses.
Patents
Patent applications submitted for Multiple Quantum Well devices, X-Modulators,
Quantum-Dots, Spectral-Hole-Burning, and system applications logic, memory. Two
patent applications submitted for VLSI based CORDIC Processors. Patent applications submitted for VLSI BiCMOS Sea of Gates Technology.
Contracts
Instigator and Co-Investigator on contract with BMDO on High-Performance Photonic Networks, Computers, Security and Image Transmission up to 5 years, $2.5M.
Principal Investigator on many U.S. Government contracts from DARPA over $2.2M
on Distributed Sensor Networks, and from the ARO, Air Force, NSF, JSEP, VA, ONR,
DCA.
25
Investigator on active NSF contract on Sub-Nanosecond Arithmetic.

PI on contract with AF on Flexible High-Performance Lattice Gas Architectures.
Publications: over 240 publications.
Co-author of book: VLSI Tools and Applications.
Contributions to recent books: VLSI Algorithms and Architectures, N. Ranganathan.
Selected Publications
J.A. Trezza, M. Morf and J.S. Harris, Jr, Creation and Optimization of VerticalCavity X-modulators," IEEE Journal of Quantum Electronics, vol 32, no. 1,
January 1996, pp. 53-60.
J.S. Powell, J.A. Trezza, M. Morf, and J.S. Harris, Jr., Vertical Cavity X-Modulators
for Recon gurable Optical Interconnection and Routing", International Conference on Massively Parallel Processing Using Optical Interconnections 1996, Maui,
Hawaii, October 1996.
J.S. Powell, J.A. Trezza, M. Morf, and J.S. Harris, Jr., Vertical Cavity X-Modulators
for WDM," Photonics West 1996, San Jose, California, January 1996, pp. 207-216.
M. Ho and M. Morf, Fast Algorithms for Multi-Channel Decision Feedback Equalization," Globecom, January 1996.
F.F. Lee, M.J. Flynn and M. Morf, Design of Compact High Performance Processing Elements for the FCHC Lattice Gas Models," Proceedings of the Fifth SIAM
Conference on Parallel Processing for Scienti c Computing, Philadelphia, Pennsylvania, 1992, pp. 616-622.
F.F. Lee, M.J. Flynn and M. Morf, High Performance Multiprocessor Architecture
for a 3-D Lattice Gas Model," Third Annual NASA Symposium on VLSI Design,
Boise, Idaho, October 1991, pp. 7.2.1-7.2.14.
K.M. Tao and M. Morf, A Lattice Filter Type of Neuron Model for Faster Nonlinear
Processing," 23rd IEEE Asilomar Conference on Signals, Systems and Computers,
Paci c Grove, California, IEEE Catalogue no. 89-CH2836-5, Maple Press, vol. 1,
October 1989, pp. 123-127.
A. El Gamal, J.L.Kouloheris, D. How and M. Morf, BiNMOS: a Basic Cell for
BiCMOS Sea-Of-Gates," Proceedings of the IEEE 1989 Custom Integrated CircuitsConference, San Diego, California, May 1989, pp. 8.3.1-8.3.4.
26
J.-M. Delosme and M. Morf, Covariance Decompositions via Elementary Transformations, Applications to Filtering," Circuits, Systems and Signal Processing,
Birkhauser, vol. 7, no. 1, January 1988, pp. 21-55, invited.
W. Fichtner and M. Morf, co-editors, VLSI Tools and Applications, Kluwer, Boston,
1985.
Other Support
Recon gurable Multimode, Multiband Information Transfer, DARPA. 50 FTE 9 1 96
9 30 98.
27
I Description of Facilities
The Computer Systems Laboratory is located in the Gates Computer Science Building
on the Stanford campus which is very well equipped, from TV classrooms to o ces
and computer labs, all interconnected with state of the art wire, ber, and wireless
communication systems. Researchers have access to the Stanford library system and
other informational and research support resources.
In the area of wave pipelining, we expect to work with former members of our
group now at Sun and IBM and take advantage of some of their resources through
our interactions.
In the area of FPGAs, one of our current members is working with Digital Equipment Corporation's Western Research Lab and we are able to continue to take advantage of some of their resources as a result. Another member of this lab is working
with us on applications of wave pipelining, recon gurable ne grain architectures,
and photonic technology. He has indicated an interest in providing us with some of
the necessary resources for our proposed research.
J Cost by Task
28
29
30
Budget Justi cation

Salaries and Wages
Salary calculations are based on approved salaries for UFY 98. Increases are estimated
at 4 per year. Increases for faculty and students are applied in October of each year.
Increases for sta and all others are applied in September. The summer months are
July, August and September.
Sta Bene ts and Indirect Cost Rates

The sta bene t rates shown are applied to total salaries, excluding research assistant
salaries. The rates shown for University Indirect Costs are applied to a Modi ed
Total Direct Cost base MTDC consisting of salaries and wages, fringe bene ts,
materials and supplies, services, travel and subcontracts and subgrants up to $25,000
each. Materials and supplies include purchases of items of equipment costing less
than $500 or with less than a two year life, and university fabricated equipment
with purchased components having a total value of less than $1000. A copy of the
negotiation agreement between Stanford University and the O ce of Naval Research
will be provided.
Tuition
Pursuant to OMB Circular A-21, tuition is direct charged. Research assistant salaries
are not subject to sta bene ts and tuition is not subject to indirect costs. Tuition
for a 50 research assistant is 62 of the full tuition rate. Proposals include 40
of the full rate as a direct charge to the source supporting the research assistant's
salary. The remaining 22 will be cost shared by the University.
Project Representative Administrative Associate

Susan Gere, at 35, is responsible for ensuring that all charges expensed to this
project are allowable, allocable, and reasonable, within the parameters of University
and Government regulations. This includes monthly nancial reconciliation of the
project account and providing the Principal Investigator with detailed reports of
monthly and project-to-date expenditures and budget projections for the entire grant
period. In addition, she will be responsible for individualized graphics and manuscript
preparation for technical journal submissions and project reports. She will process
31
project-related paperwork for project-speci c existing personnel and new hires, and
maintain accurate up-to-date Other Sources of Support information related to this
project. The e ort put forth for this project, as described here, can be speci cally
identi ed through an e ort reporting system such as that re ected on a Lab Time
Card and will be con ned to the speci c needs of this research project and will not
include any support for the general academic activities of the faculty or Department.
Capital Equipment
We will require workstations SUN or equivalent with FPGA design software for the
three full-time graduate students and post-doc expected to work on this project.
Our aging current facilities are heavily used by our research groups for a number
of di erent projects. To carry out FPGA design, simulation, and trade-o studies
we require a dedicated system capable of handling a large load. Furthermore, the
addition of more resources will allow larger design optimization tasks to be split
among several computers which would analyze a structure in parallel
We expect to purchase the equivalent of three 3 SPARCstation ELC 19" monitors
with 32MB memory with a 4GB SCSI disk at $5,000, and a color X terminal or Sony
multiscan monitor at $2,000.
Susan Gere, at 35, is responsible for ensuring that all charges expensed to this
project are allowable, allocable, and reasonable, within the parameters of University
and Government regulations. This includes monthly nancial reconciliation of the
project account and providing the Principal Investigator with detailed reports of
monthly and project-to-date expenditures and budget projections for the entire grant
period. In addition, she will be responsible for individualized graphics and manuscript
preparation for technical journal submissions and project reports. She will process
project-related paperwork for project-speci c existing personnel and new hires, and
maintain accurate up-to-date Other Sources of Support information related to this
project. The e ort put forth for this project, as described here, can be speci cally
identi ed through an e ort reporting system such as that re ected on a Lab Time
Card and will be con ned to the speci c needs of this research project and will not
include any support for the general academic activities of the faculty or Department.
Cost Sharing
Interactions with our industrial partners result in various resources that can be shared
for this proposed research. The savings accruing to DARPA would be very signi cant.
32
It is very di cult to estimate the relative cost of the deliverables, because they
are highly interrelated and have intrinsic synergy. If DARPA chooses to eliminate a
deliverable it will lower the cost of the program, but not necessarily the total amount
attributed to each step. If DARPA desires to explore such modi cations of this
proposal, we would be happy to work on cost speci c scenarios to achieve the desired
results in the context of this project.
Contents
A Innovation Claim
B Technical Rationale
Function Approximation Theory . . . . . .

Finding the Reciprocal Using a Multiplier
Other Functions . . . . . . . . . . . . . . . . . .
The Cell Granularity . . . . . . . . . . . . . . .
The Multiplier . . . . . . . . . . . . . . . . . .
The Con guration Registers . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
3
4
7
10
11
12
C Deliverables
14
D Statement of Work
17
E
F
G
H
19
20
21
22
Cell Interconnection and Conjunction . . . . . . . . . . . . . . . . . . . . .
Generalized Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Schedule of Milestones
Technology Transfer
Comparison with Ongoing Research
List of Key Personnel
Curriculum Vitae of M. Flynn . .
Publications . . . . . . . . .
Other Support . . . . . . . .
Curriculum Vitae of Martin Morf
Academic Appointments . .
A liations . . . . . . . . . .
Patents . . . . . . . . . . .
Contracts . . . . . . . . . .
Selected Publications . . . .
Other Support . . . . . . . .
I Description of Facilities
J Cost by Task
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Budget Justi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Salaries and Wages . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
17
22
22
23
23
24
24
24
24
25
26
27
28
30
30

Sta Bene ts and Indirect Cost Rates . . . . . . . .
Tuition . . . . . . . . . . . . . . . . . . . . . . . . .
Project Representative Administrative Associate .
Capital Equipment . . . . . . . . . . . . . . . . . .
Cost Sharing . . . . . . . . . . . . . . . . . . . . . . . .
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
30
30
31
31

Baa 9902

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Baa 9902

Hochgeladen von

Copyright:

Verfügbare Formate

Cover Page

Subject: BAA 99-02

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

Figure 1: Adaptive Arithmetic Cell AAC in a high-speed ACS cellular array.

Function Approximation Theory

M. Flynn, Adaptive Arithmetic

Figure 2: An example: Finding the reciprocal 1 =  given bit values of =

area 10, 6, 5, 3 , by creating a generic method for describing approximations and

Finding the Reciprocal Using a Multiplier

M. Flynn, Adaptive Arithmetic

Figure 3: Rewriting the columns of gure 2 as equations. The bits

Figure 4: Solving for

2. Adding additional terms to reduce maximum error.

M. Flynn, Adaptive Arithmetic

In the case of reciprocal, this gives us:

M. Flynn, Adaptive Arithmetic

53 elements per row 105 columns

Figure 7: The PP array of a 53 bit direct multiplier. Up to 53 bits are summed in

M. Flynn, Adaptive Arithmetic

A) Multiplier Before Adaptation

Figure 8: Implementation using direct multiplier. The AAC multiplier is adapted by

0 x V x = V 0 x

since   is simply the number representation

n,1 2n,1 + wn,2 2n,2 + : : : w0

 i is 0 or 1. Then 0  is simply another polynomial with constant term multiplier.

M. Flynn, Adaptive Arithmetic

Figure 9: 17-digit reciprocal PPA superimposed on a direct multiplier's PPA providing

The polynomial forms of the arguments are:

M. Flynn, Adaptive Arithmetic

The result of this has several important consequences:

The Cell Granularity

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

The Con guration Registers

M. Flynn, Adaptive Arithmetic

300MHz, 43 sq mm, 1 micron CMOS

Figure 10: WP Vector Unit Die Photo

M. Flynn, Adaptive Arithmetic

Cell Interconnection and Conjunction

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

10 R. Stefanelli, A suggestion for a high-speed parallel binary divider," IEEE

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

Task 2: Intercell Array De nition AAC blocks must be designed to be intercon-

Task 4: Application Analysis Given AAC die parameters as determined above,

Task 5: Extensions Using advanced clocking technology, we propose to consider

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

G Comparison with Ongoing Research

M. Flynn, Adaptive Arithmetic

H List of Key Personnel

M. Flynn, Adaptive Arithmetic

Curriculum Vitae of Martin Morf

M. Flynn, Adaptive Arithmetic

in 1974, Honorary Master of Arts Degree from Yale University in 1983.

M. Flynn, Adaptive Arithmetic

Investigator on active NSF contract on Sub-Nanosecond Arithmetic.

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

M. Flynn, Adaptive Arithmetic

Figure 1: Adaptive Arithmetic Cell AAC in a high-speed ACS cellular array.

Figure 2: An example: Finding the reciprocal 1 = given bit values of =

0 x V x = V 0 x

since is simply the number representation

i is 0 or 1. Then 0 is simply another polynomial with constant term multiplier.

Project Representative Administrative Associate