Sie sind auf Seite 1von 11

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO.

3, MARCH 1998

303

One-Hot Residue Coding for Low


Delay-Power Product CMOS Design
William A. Chren, Jr., Member, IEEE

Abstract CMOS implementations of arithmetic circuits for


One-Hot Residue (OHR) encoded operands are presented. They
are shown to possess SPICE-simulated delay-power (DP) products that are significantly reduced below those of their binary
number system counterparts (ripplecarry adder and Wallace
tree multiplier). The reduction is attributable to the one-hot
representation, which decreases the number of critical path transistors and signal activity factors. An OHR-based direct digital
frequency synthesizer for frequency-hopped communication systems is presented, and analytical estimates of its DP-product are
derived. The design exhibits a DP-product reduction in excess of
90% below that of a binary-encoded residue synthesizer.

I. INTRODUCTION

ECREASING feature sizes and consumer demand for


battery-operated electronic products have made power
consumption one of the foremost design issues in integrated
circuit design. Many approaches to lowering chip power
consumption have been developed, focused at most levels
of the design process (e.g., system, technology, algorithm,
physical, and circuit). System level approaches include power
supply voltage scaling [1], clock gating methods [2], and the
use of subsystem sleep (power down) modes [3]. Physical
level methods include the use of transistor reordering [4].
Algorithmic methods include the use of alternate number systems [5] and state encodings [6]. Technology level techniques
include the use of dynamic threshold MOSFETs (DTMOSs)
[7]. Circuit level techniques include self-timed (asynchronous)
approaches [8] and glitch reduction [9]. Ultra-low power
designs of the future will employ several methods since no
single one can achieve the power reduction goals that are
predicted for the next decade.
A central concern in the evaluation of any power reduction
technique is its effect on system speed. Many techniques
cause unavoidable lengthening of the system clock period.
Supply voltage scaling, in particular, is well known to reduce
significantly the system clock speed. This is the major reason
for the development of the DTMOS. Accordingly, many
researchers use delay-power (DP) product as the central figure
of merit. They define delay in the intuitive manner as the
difference between the application time of an input operand
and the time at which the output response is stable. Power
is defined as the time average of the system voltage and
current waveforms during the input/output transition. In some
Manuscript received December 6, 1995; revised August 28, 1997. This
paper was recommended by Associate Editor W. D. Grover.
The author is with the School of Engineering, Grand Valley State University,
Grand Rapids, MI 49504 USA.
Publisher Item Identifier S 1057-7130(98)00771-X.

cases, energy is used synonymously with the DP-product, in


recognition of the fact that the units of both are identical.
In this paper we present an alternative algorithmic power
reduction technique. We propose the use of one-hot encoding
for the digits in a Residue Number System (RNS). We present
CMOS adders and multipliers using this novel One-Hot
Residue (OHR) number system. Their regular and simplified
structures allow short critical paths and high speeds. The onehot encoding of the operands allows near-minimal activity
factors and very small power dissipations. Consequently, these
circuits yield superior DP-products. We present the results of
SPICE simulations which show DP-products that are as much
as 70% (95%) below those of ripplecarry adders (Wallace
tree multipliers). Using these results, we give expressions for
the DP-product as a function of modulus (arithmetic unit) size.
We then present an example design of an OHR-based
direct digital frequency synthesizer for a frequency-hopped
communication system. We derive analytical estimates of its
DP-product and show that it is significantly reduced (at least
90%) below that of a recently developed, binary-encoded design called the High-Agility Direct Synthesizer (HADS) [10].
This example serves to indicate the performance advantages
of the OHR number system.
This paper is organized as follows. Section II is a presentation of background information on the RNS, OHR encoding,
and direct digital frequency synthesis (DDFS). Section III
presents our OHR arithmetic circuits, DP-product simulation
results, and analytical expressions relating the DP-product to
modulus size. Section IV presents the OHR-based synthesizer
design, estimates its DP-product, and compares it to the performance of the HADS. Section V contains our conclusions.
II. BACKGROUND
A. The Residue Number System (RNS)
RNS is an integer number system in which the operations of
addition, subtraction, and multiplication (henceforth referred
to as the basic operations) can be performed with a single
combinational step, digit-on-digit, using small arithmetic units
operating in parallel. There are no carries, borrows, or partial
products and, consequently, these operations are fast and
simple to implement. Operations other than the basic ones,
such as magnitude comparison, scaling, and base extension
(the RNS equivalents of right shifting and increasing the
bit width, respectively) and division require several basic
operations [11]. They are slower and are more complicated
to implement. As a result, the RNS has enjoyed its widest

10577130/98$10.00 1998 IEEE

304

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1998

usage in applications such as DSP, where the basic operations


predominate.
as the vector of its
The RNS represents an integer
residues modulo a fixed set of specially chosen integers
called moduli. The moduli are usually chosen to be pairwise relatively prime because in this case the representation
is unique for all nonnegative values less than the prodof the moduli. If the moduli are not pairwise relauct
Lettively prime, uniqueness holds only up to
denote the th modulus, the residue representation
ting
is depicted as
where
of
modulo
are called the digits of the residue
representation. The operations of addition, subtraction and
multiplication are performed in digit-parallel fashion, modIf operands
and
have residue representations
ulo
and
and
where
represents any of the operations, then we have
Because these
operations are done in parallel, modulo small integers
they can be performed quickly. As an example, let
and
Then
and
Therefore,
and
A useful property of any prime modulus is that it has at
least one primitive root. A primitive root is an integer of order
under multiplication. That is, it is an integer whose
successive powers, taken modulo , generate the nonzero
For any
integers modulo
modulo
for some
It is then said that
has index
A primitive root allows multiplication to be
akin to the use of
performed by adding indices modulo
logarithms in the binary number system. As an example, let
Then
is a primitive root because the quantities
and
are equal to 1,
2, 4, 8, 3, 6, 12, 11, 9, 5, 10, and 7 modulo 13. The latter
is the set of nonzero integers modulo 13. Furthermore, letting
modulo 13 and
modulo 13, we
modulo
modulo 13. But the
have that
index of the product can be determined by adding the operand
i.e.,
modulo 12.
indices modulo
From the preceding paragraph, it can also be seen that every
nonzero integer modulo has a multiplicative inverse. Indeed,
and
modulo Then
let
modulo , where the additive inverse of is taken modulo
That is, the index of the multiplicative inverse of
is
of the index of
If
is
the additive inverse, modulo
taken modulo
encoded with moduli, then the inverse of
is the vector of its inverted digits.
can be found with the same simplicity
A quotient
and speed as the three basic operations if it is known a priori
where
that is an integer. In such a case,
is the multiplicative inverse of the denominator. This fact is
used in the scaling operation, which will now be discussed.
Scaling is the residue operation that corresponds to radix
division (also known as right-shifting with truncation, or
integer division) in the binary number system. It is used in
frequency synthesis to reduce the size of a ROM look-up table

holding output sample values. The radices are the moduli, and
the scaling operation can be performed on any single modulus
or a combination of moduli (by repeating the single modulus
operations). Scaling by more than one modulus corresponds
to shifting by more than one bit position in the binary number
system.
The fundamental idea is to perform the radix division
by inverse multiplication. To guarantee that the result will
be integral, an initial subtraction is performed to round the
operand to the next smallest multiple of the radix. Letting
and denoting by
the result of
scaling
by modulus
we have that

(1)
That is, the scaling operation consists of a subtraction of
from and multiplication by
in each modulus except the
does not exist in the th modulus. Conseith. Note that
quently, the subtraction and the multiplication are performed
independently and concurrently in each modulus except the ith.
This means that the result is defined only in digits other than
the th. An operation called Base Extension [12] can be used
to restore the lost digit, if it is needed. It consists of scalings by
the remaining moduli (with the lost digit initialized to zero),
followed by a final multiplication by the additive inverse of
the product of these moduli. Base Extension is not needed for
frequency synthesis, and will not be discussed further. Note
that scaling requires some method of converting a residue digit
from one modulus to another. It will be seen in Section III that
in the OHR number system these conversions are simple and
fast.
As an example of the scaling operation, let
and
Then

which is the residue representation of 4, the correct result.


B. One-Hot Residue (OHR) Representation
The advantages obtained when the one-hot code is used
to represent the digits of an RNS are so compelling that
we have given the resulting number system the new name
one-hot residue (OHR) number system, even though, strictly
speaking, it is still the RNS with the same arithmetic. The
advantages include the use of barrel shifters for the basic
operations (which possess superior DP-products and operandindependent delays compared to binary implementations), simple and regular layout of arithmetic circuits, and zero-cost
implementation (by signal transposition) of inverse and index
calculation and modulus conversion. Lower DP-products result
from the fact that signal activity factors are near-minimal and
fewer critical path transistors are present.
is
The one-hot representation for the th residue digit
depicted in Fig. 1. Only the single line corresponding to the
digit value is asserted (driven high) at any time. Furthermore,
during a change in digit value, at most two lines change their
value. This is the minimal possible activity factor and means
that the power dissipation is small.

CHREN: ONE-HOT RESIDUE CODING

Fig. 1. OHR representation for digit

305

xi :

With this one-hot representation of the residue digits, addition can be performed by cyclic shifts (rotations). One of the
operands (the data operand) is rotated by an amount equal
to the others (the shift operand) value. The rotation can be
performed by one of several types of circuits; in our work
we have chosen to use barrel shifters. These circuits compute
all possible rotations in parallel and pass when required the
appropriate one to the output. They can be used to perform
multiplication also, as will be discussed in Section III.
Calculation of inverses and indices is simple and fast in
the OHR. The process is merely an appropriate permutation,
in each modulus simultaneously, of the signals that comprise
the residue digit. The permutation requires no hardware and
causes little delay. Modulus conversion, which is the process
of converting a residue digit to its value in another modulus,
is also very efficient in the OHR and consists of using OR
gates to collect digit values which are congruent modulo the
target modulus.
C. Direct Digital Frequency Synthesis (DDFS)
DDFS is a method of sinusoidal signal generation that
yields frequencies of high precision and resolution. It is a
purely digital technique and therefore is more reliable and
precise than analog methods. Additionally, it allows fast
frequency switching that is phase continuous, something that
is costly and complicated with analog techniques. It is widely
used in communications and instrumentation systems where
exceptionally pure and stable signals must be generated.
Most DDFS systems use the Sine table lookup method [13],
[14] wherein the output is generated by periodically accessing
a ROM containing contiguous samples of a single period (or
quadrant) of a sine wave. The samples are converted to analog
by a digital-to-analog converter (DAC). The ROM addresses
are computed using a phase accumulator, which generates
successive multiples of an externally supplied frequencysetting word. The value of this word establishes the output
frequency. If it is large, the output frequency is high because
the ROM addresses and sine samples are widely spaced.
Conversely, small frequency setting words produce low output
frequencies because the ROM samples are closely spaced.
The architecture of our OHR-based synthesizer is shown in
Fig. 2. It employs a pipelined architecture and barrel shifterbased RNS processing to achieve low-latency sample generation and fast frequency switching. It is designed for use in
a high-performance frequency-hopped spread spectrum communication system.
All signal processing is performed in an -modulus RNS
in which one of the moduli must be a multiple of 4. This
condition allows the size of the sample ROM to be reduced

Fig. 2. OHR-based frequency synthesizer.

by the same factor. The phase accumulator consists of small


OHR accumulators, one per modulus, operating in parallel.
The residue-encoded frequency-setting word, denoted by , is
the external input to the accumulator. The phase accumulator
computes multiples of (called phase arguments), modulo the
of the moduli. Typical values of
exceed
product
Because these arguments are to be used to address the sample
ROM, they must be scaled down and truncated. As discussed
above, this can be done using the RNS scaling operation, and
is performed by the Scaler subsystem (see Fig. 2), which uses
a pipelined architecture to implement (1). Note that its output
includes two signals, MSB and next-MSB, which are the most
and next-most significant bits of the even modulus digit. It can
be shown [15] that these signals encode quadrant information
about the Scaler output. The next-MSB indicates location in
the second or fourth quadrants, and the MSB indicates third
and fourth. Their use will now be discussed.
The AI (address invert) units and the SI (sign invert)
input of the DAC allow exploitation of the quarter-wave
symmetry of the sine waveform to reduce the size of the
sample ROM. The SI input of the DAC complements the sign
of the analog output when the MSB and next-MSB signals are
unequal, corresponding to second or third quadrant location of
the Scaler output. The AI units compute the additive inverse
of the Scaler output when the next-MSB is asserted (i.e., when
the Scaler output is located in the second or fourth quadrants).
More details of our design will be given in Section IV where
we derive an estimate of its DP-product.
The remainder of this paper is as follows. Section III
presents our OHR arithmetic circuits, DP-product simulation
results, and analytical expressions relating the DP-product to
modulus size. Section IV presents the details of our OHRbased synthesizer design, estimates its DP-product, and compares it to the performance of a previously designed synthesizer, called HADS, which uses a binary-encoded RNS.
Section V contains our conclusions.

III. OHR ARITHMETIC CIRCUITS AND


THEIR DELAY-POWER (DP)-PRODUCTS
In this section, we present OHR adders and multipliers and
also discuss other useful circuits such as modulus converters
and inverse and index calculators. We then present DP-product
simulation results for the adders and multipliers, and present
analytical expressions which indicate the dependence of the
DP-product on the circuit (modulus) size.

306

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1998

(a)

Fig. 4. Architecture of modulo

multiplier.

Fig. 5. Modulus 5 to 3 conversion example.


(b)
Fig. 3. OHR adder: (a) symbol and (b) architecture.

A. OHR Arithmetic Circuits


OHR adders and multipliers employ barrel shifters as their
basic computational elements. This is possible because, for
addition is a cyclic
one-hot encoded operands, modulo
permutation of one operand by the other. Multiplication, on
the other hand, involves addition of indices, and as discussed
in the previous section, index conversion is free in the OHR.
OHR adder is shown in Fig. 3. In Fig. 3(a),
A modulo
the two inputs are denoted as shift and data to make
the internal operation [Fig. 3(b)] more easily understood. The
barrel shifter generates, in parallel, all possible rotations of the
data input, and selects one of them for output. Which one is
selected is determined by the shift input.
The barrel shifters can be built using pass transistors (as
shown in the figure) or transmission gates. If circuit size
is a concern, then pass transistors are preferred. However,
power consumption is larger because of voltage degradation
of the high signal at the output, causing static dissipation
in downstream circuitry. Furthermore, pass transistors have
a larger delay because of slow rise times. Alternatively,
transmission gates offer near-zero static power dissipation,
higher speed, and the absence of signal degradation. They
consume more chip area, however. These issues will be
discussed further later in this section when we discuss our
simulation methodology.
A subtracter is implemented in the same way, except that
the subtrahend input bus is permuted to generate the additive
of its operand. This permutation can
inverse (modulo
be hardwired as a reordering of the wires in the bus. The
result is a dedicated subtracter. An alternative is a combined
adder/subtracter unit, which employs a multiplexer to selectively apply the additive inverse to the barrel shifter. The

multiplexer can be implemented in several ways. In Section


IV, we will use transmission gate logic to implement muxes
in the AI units of Fig. 2.
) can also be
Multiplication by a constant (modulo
done by wire permutation providing the constant is relatively
This condition ensures that no two input values
prime to
generate the same output. Of course, other methods besides
wire permutation, with almost the same degree of simplicity,
can be used in case the condition does not hold. For example,
an OR gate can perform the required many-to-one mapping.
The architecture of a nonconstant multiplier (modulo
)
is shown in Fig. 4. The modulus must be of the form
an odd prime,
an integer) because only
these possess primitive roots [16]. The multiplier consists of
a barrel shifter core with wire permutations on all ports. The
permutations perform index and anti-index computation. Note
that, as explained in Section II, the index sum is computed
rather than modulo
The pull-up FETs
modulo
are needed to guarantee a zero output when an input operand
is zero valued, i.e., when it has no index.
Crossbar switches can be used to implement multipliers for
moduli that have no primitive root. They compute an arbitrary
one-to-one mapping of inputs to outputs. Barrel shifters cannot
be used because modular multiplication cannot be done by
rotations alone. A disadvantage of these switches is that they
lack a regular structure and therefore are more difficult to
route. In this research we did not study these multipliers; we
use prime moduli exclusively. However, our conclusions apply
to these moduli also, as will be clear when we discuss our
results and methodology.
to
Modulus conversion of an operand from modulus
modulus
is performed simply and easily. If
, the
operation is trivial, consisting of wire padding with extra
OR gates are used to map input
(grounded) lines. If
values to their congruence classes modulo
An example of
the conversion from modulo 5 to modulo 3 is given in Fig. 5.

CHREN: ONE-HOT RESIDUE CODING

307

Fig. 6. Index and inverse calculation.

(a)

(b)
Fig. 7. Level restoration methods: (a) output buffered and (b) active pull up.

Index and multiplicative inverse calculation are performed


by a wire permutation as indicated in Fig. 6. The zero-input
line is treated as a special case because zero has no inverse or
index. Note that wire permutation need not involve a physical
relocation of chip traces. Instead, a signal permutation could
be used, which is merely a renaming of the input lines. The
latter avoids the extra chip area consumed by the rewiring,
although such area is less significant with modern fabrication
processes which use many metal layers.
B. DP-Product Simulation Results
We now present DP-product simulation results for OHR
arithmetic circuits. We begin with a discussion of the barrel
shifter and transistor simulation models we used, and then
discuss our methodology and results.
C. Models
We investigated three different models of barrel shifters in
this research. They differ in their approach to signal level
restoration. As mentioned in the discussion of adders, barrel
shifters implemented with pass transistors are slower and
use more power than those implemented with transmission
gates. This is due to signal level degradation at the outputs.
Accordingly, we investigated three different approaches to
output level restoration: output buffered (OB), active pull-up
(APU) [17], and transmission gate (TG).
The OB method, illustrated in Fig. 7(a), employs two inverter stages on each barrel shifter output to restore the signal
level. The APU method, shown in Fig. 7(b), employs positive
feedback and a weak pFET pullup transistor on each output to

Fig. 8. Modulo 5 OHR adder model.

restore the signal voltage. The pFET is made weak in order to


decrease the drive strength required at the data input during
a high to low transition. The TG method (not shown) uses
transmission gates instead of pass transistors in the barrel
shifter (see Fig. 3). With this scheme, the output levels are
not degraded. However, inverters are required on each shift
input in order to generate the proper control signals for the
transmission gates.
SPICE version 3f4 LEVEL 2 1.2- m transistor models
[18] were used in building simulation models of adders and
multipliers in both the OHR and the binary number systems.
were used
Minimum-sized transistors
throughout, including as output loads. The OHR architectures
employed the OB, TG, and APU methods of level restoration
discussed above.
As an example, the modulo 5 OHR adder model using
the OB level restoration method is shown in Fig. 8. The
TG version differs in that it uses transmission gates in place
of the pass transistors, and lacks the dual output buffers.
Furthermore, the complements of the shift inputs are generated
by inverters, one per line, in order to drive the control inputs
of the transmission gates. The pMOS transistors used in the
TG and APU circuits use a threshold voltage of
V because it reduces static power dissipation due to leakage
currents in the off state.
The multiplier simulation models are identical to the adders
because negligible power is consumed in the wire permutations.
For the binary number system architectures, a ripplecarry
adder was chosen because it has the lowest power among
the adder alternatives (e.g., carry-lookahead, carry-save, carryselect, etc.). Its full adder model is shown in Fig. 9 [19].
A Wallace tree architecture was chosen for the multiplier
(Fig. 10) [20].
D. Methodology
OHR adders and multipliers were simulated with modulus
sizes 3, 5, 7, 11, 13, and 17 because larger moduli yielded
little marginal information to justify the long simulation times.
Binary circuits were simulated with sizes of 2, 3, 4, and 6 bits.

308

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1998

Fig. 9. Binary full adder model.


Fig. 11. Delay-power product function of modulus size (TG).

Fig. 10.

Binary multiplier model (4 bits).

For the OHR circuits, DP-product was estimated for simultaneous single transitions on both shift and data inputs. The
combined average power on both the rails and output lines was
computed by using MS EXCEL utilities on the SPICE output
data files. The output delay was measured to the half-rail point
on the output waveform. The product of these two quantities
is the DP-product (per transition).
For the binary circuits, the average rail and output powers
were simulated in the same way but for single operand
changes on both inputs. These changes were chosen so that
approximately half of the output and input bits changed. The
delay times were found for worst case operands and were
measured from the time of the input change (simultaneous on
all inputs) to the 50% point on the last output bit to settle.
The DP-product as a function of modulus is plotted in
Fig. 11 for TG OHR circuits. Analytical expressions, found
using an MS EXCEL curve fitting tool, are also given. It can
be seen that the DP-product is reduced below that of binary
adders and multipliers by at least 35% and 90%, respectively,

for a modulus value of 64, and for smaller, more practical


moduli the reduction is larger (70% for the adder and 95% for
the multiplier with a modulus of 17). Figs. 12 and 13, which
use the same legend as Fig. 11, indicate that this advantage
is primarily due to smaller delay rather than smaller power.
As can be seen from Fig. 12, the power consumption of the
OHR adder increases linearly with modulus, while those for
the ripplecarry adder and multiplier increase logarithmically.
The break-even point occurs at approximately a modulus of
15. The corresponding point for the multipliers could not be
ascertained due to run-time limitations, but based on Fig. 12
it is certainly larger than 64. In Fig. 13 one sees that the OHR
delay time grows linearly in the modulus size and increases
the delay is 25%
at a smaller rate. For example, for
that of the binary adder and 20% that of the multiplier.
As an example of the use of the OHR number system
we will now present, in Section IV, the design of an OHRbased frequency synthesizer. We will also present an analytical
estimate of its DP-product and compare it with that of a binaryencoded RNS-based synthesizer called the HADS. Section V
contains our conclusions.
IV. THE OHR-BASED FREQUENCY SYNTHESIZER
AND ITS DELAY-POWER PRODUCT
In this section we present the design of an OHR-based
frequency synthesizer. We also present an analytical estimate
of its DP-product and compare it to that of a previously
designed, binary-encoded RNS-based synthesizer called the
HADS [10]. Because significant portions of the OHR design
employ nonarithmetic circuitry (e.g., Sample ROM, pipeline
registers), the estimate is derived without the use of the results
of Section III. Instead, delay is estimated by the number of
critical path transistors, and power is estimated by the number
of transistors needed for implementation of each subsystem,
multiplied by that subsystems activity factor. This factor
expresses the average fraction of transistors that switch per
clock transition. DAC delays and powers will not be estimated
because of their difficulty and because they are the same for
both designs.

CHREN: ONE-HOT RESIDUE CODING

Fig. 12.

309

Power per operation versus modulus size.

(a)

Fig. 13.

Delay as a function of modulus size.

(b)
Fig. 15. (a) PA adder and (b) register element.

Fig. 14.

OHR-based frequency synthesizer.

A. OHR-Based Frequency Synthesizer


Our OHR synthesizer is shown in Fig. 2 and duplicated here
for convenience as Fig. 14. Each of its components will now
be discussed.
adder in the Phase Accumulator (PA) is
A modulo
shown in Fig. 15(a), and consists of a single edge-triggered
register with OHR adder feedback. Each register element
is implemented with two level-sensitive latches [21], each

consisting of a two-input inverting mux with an output inverter


[see Fig. 15(b)].
The Scaler performs truncation by successive application
of (1), thereby reducing the number of residue digits from
to
Its architecture is depicted in Fig. 16(a). It consists
of an upper triangular array of scaling units (SUs). Scaling
unit
whose template and architecture are depicted
It comprises
in Fig. 16(b) and (c), scales by modulus
a modulo
OHR adder, modulus conversion, and wire
permutation on the subtrahend input, a wire permutation on
the output port, and an output pipeline register. The output
transposition performs the post multiplication by
The
input modulus conversion/wire transposition computes the
value
while the adder performs the subtraction

310

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1998

(a)

(b)
Fig. 16.

(c)

(a) Scaler architecture. (b) Scaling unit (SU). (c) SU architecture.

as discussed in Section III. At the Scaler output, the digit


result is passed to the AI units (discussed below). In principle,
can have any value. However,
makes Sample ROM
addressing efficient and is therefore the value we used in the
design.
An AI Unit selectively computes the additive inverse based
on the value of the next-msb (generated by the Encoder unit).
It is implemented as shown in Fig. 17, using a 2-to-1 mux and
wire transposition. The mux is implemented using transmission
is one of the two Scaler output moduli.
gates. In the figure,
bit plane of the Sample ROM.
Fig. 18(a) shows a
A number of these share the word and bit addressing circuitry

and are connected in parallel to form the


Sample
and
comprise
ROM, where is the DAC width. Moduli
the word and bit select lines, respectively. Corresponding bit
lines from each word are tied together and provide the data
input for the Sense Amplifier/Output Buffers (SA/OBs). Bit
selection is accomplished by using the bit select lines to control
the output enables of the SA/OBs, whose outputs are tied
to the output bus. The SA/OBs are high gain, single-ended
inverters with transmission gate output control.
Fig. 18(b) shows the word architecture. Pull-down transistors are located only at the 0-bit positions. The bit lines are
precharged during high system clock. During low clock, the

CHREN: ONE-HOT RESIDUE CODING

311

TABLE I
CRITICAL PATH DELAY (TRANSISTORS)

Fig. 17.

AI unit architecture.

available in the fab process. This problem will become less


severe in the future as the cost of fabrication in many-layer
processes continues to fall.
B. OHR Synthesizer Delay Estimate
(a)

(b)
Fig. 18.

(a) Sample ROM architecture. (b) Word architecture.

word line is asserted and the bit lines evaluate. Output data are
active low and are converted to active high by the inverting
action of the SA/OBs. Note that each word line drives only
two transistor gates (due to the OHR encoding).
The Encoder unit generates the msb and next-msb from
the even modulus of the Scaler output. It does this with NOR
combinational logic which classifies the input magnitude as
being in the upper half range (msb) or in either of second
or fourth quartiles (next-msb). The DAC is a standard highspeed twos-complement type with a sign invert (SI) input
which toggles the sign of the output.
There are two primary limitations of the OHR design. First,
the size of the adders and multipliers grows as the square of the
modulus. For large frequency resolutions (i.e., wide PAs) the
chip area could become large enough to prohibit cost-effective
fabrication. We are presently researching decomposition methods whereby large moduli can be partitioned into smaller ones
which consume much less area.
The second limitation is routing area. Operand widths in
the OHR number system increase linearly with the modulus.
Consequently, chip routing area is large if few metal layers are

The critical path of the OHR synthesizer consists of the PA,


Scaler, AI Units, and Sample ROM. Its length is estimated by
adding the worst-case delays through these units. The results
are presented in the last column of Table I. The PA delay is
estimated as 1 transistor for the barrel shifter and 4 (2 each
for the master and slave) for the pipeline register element. The
Scaler delay estimate is 5 (1 for the barrel shifter and 4 for
the output register element) for each SU, for a total delay of
Here we have assumed that the number of output
moduli,
The AI Units have 1 transistor of delay through the mux
transmission gate, and 4 through the pipeline register element
(see Fig. 17). The Sample ROM delay is estimated as 5 for
the word architecture blocks (4 for the input gate and one for
the pull-down transistor) and 3 for the SA/OB (Fig. 18).
C. OHR Synthesizer Power Estimate
An estimate of the power consumption of each subsystem
is found as a weighted sum of the number of transistors in
the PA, Scaler, AI Units, Encoder, and Sample ROM. The
weighting is by the activity factor of each subsystem. We use
and
for simplicity of
two output moduli
Sample ROM addressing (as explained above). Note that it is
desirable to make these moduli approximately equal for ease
of chip floorplanning, because it gives the Sample ROM a
unity aspect ratio.
The estimates are presented in the bottom half of Table II.
Quantity
denotes the activity factor of the th component.
The PA is implemented with modulo
adders and register
elements (see Fig. 15). Transmission gates are used for the
transistors for the barrel
barrel shifters in the adders
shifter array plus
for the control inverters). The register
elements have 16 transistors (8 each for master and slave).
The Scaler power assumes that the number of transistors
used for modulus conversion in each SU is negligible. This
is reasonable because in most applications the PA moduli are

312

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1998

TABLE II
POWER ESTIMATES (TRANSISTORS)

Fig. 19. OHR DP-product reduction.

in Table I [22]. The variable


denotes the maximum PA
denotes the value of the (equal) output
modulus, and
and
The PA employs Shanbhag adders [23].
moduli
ROMs of size
where
The Scaler consists of
The AI Units consist of constant-operand
subtracters controllable by the next-msb of Scaler output. The
where is the resolution
Sample ROM is of size
of the DAC.
E. HADS Power Estimate
The power is estimated in the same way as for the OHR
synthesizer, that is, by adding the number of transistors in the
PA, Scaler, AI Units, and ROM, and scaling by the activity
These estimates are given in the top half of Table II
factor
[24]. The XOR gate power is negligible. We assume, as before,
that the first PA modulus,
is a power of two and that
Scaler output moduli are equal and have value
both
The number of pipeline registers is estimated using a
pipeline intensity factor with units of stages/bit of ripple
carry addition [15]. Note that this factor was not needed for the
OHR case because the number of pipeline registers is fixed.
The percent reduction of the DP-product for the OHR
synthesizer below that of the HADS is plotted in Fig. 19 for
the three modulus sets
approximately equal. The expression accounts for
levels
of SUs, deployed as in Fig. 16(a).
Each SA/OB in the Sample ROM requires four transistors.
The symbol
denotes the output width in bits, and
denotes the activity factor of the binary encoded words output
to the DAC. The Encoder power is estimated assuming that a
multiinput NOR function is implemented in binary-tree fashion,
without internal pipelining, using two-input NOR gates (four
transistors each). The expression accounts for single pipeline
register elements on each output, and assumes a maximum
activity factor
D. HADS Delay Estimate
The critical path length of the HADS is estimated in the
same manner as for the OHR synthesizer. The results are given

We have used an activity factor of


because half (at most two) of the bits change, on
average, every clock cycle for binary-encoded (OHR-encoded)
operands. Other parameters used for the figure are
and
which correspond, respectively, to 2 bits of ripple
carry addition per pipeline stage, and 10 bits of amplitude
resolution on the output.
It can be seen from the figure that the OHR synthesizer has
a DP-product which is reduced by at least 90% below that of
the HADS. Furthermore, the sizes of the moduli only modestly
affect the reduction, because the chosen sets include moduli

CHREN: ONE-HOT RESIDUE CODING

which vary greatly. Other data (not shown) indicate that the
reduction in delay is relatively independent of modulus size,
and that the DP-product reduction is due to both delay and
power reduction (at least 60% for each for all modulus sets
Furthermore, changes in
and have
and values of
modest effects on the DP-product.

V. CONCLUSIONS
The OHR number system appears to offer a significant reduction of the DP-product of CMOS arithmetic circuits below
that of binary number system or binary-encoded RNS circuits.
OHR-based arithmetic circuits offer several other advantages,
including regular layout, operand-independent delay, gate-free
operations such as inversion and modulus conversion, and
simplicity. We have presented SPICE simulation results which
indicate that the ripple carry adder (Wallace tree multiplier)
DP-product is reduced by as much as 70% (95%) for smaller
moduli (e.g., 17), and that this improvement is due primarily
to delay, rather than power, reduction.
Use of the OHR is exemplified in the design of an
OHR-based direct digital frequency synthesizer for frequencyhopped spread-spectrum communication systems. Estimates of
its DP-product indicate a reduction by at least 90% below that
of a recently proposed binary-encoded RNS-based synthesizer.
ACKNOWLEDGMENT
The author is grateful to W. Ivancic of the NASA Lewis Research Center, Space Communications /Electronics Division,
for supporting this research. The simulation assistance of C.
Brogdon and D. Andrevska is much appreciated.

313

[7] F. Assaderaghi et al., A dynamic threshold voltage MOSFET (DTMOS)


for ultra-low voltage operation, in Int. Electron Devices Meet. Tech.
Dig., 1994.
[8] L. S. Nielsen, C. Niessen, J. Spars, and C. H. van Berkel, Low-power
operation using self-timed circuits and adaptive scaling of the supply
voltage, IEEE Tran. VLSI Syst., vol. 2, pp. 391397, Dec. 1994.
[9] J. M. Rabaey, Digital Integrated Circuits: A Design Perspective. Englewood Cliffs, NJ: Prentice-Hall, 1996, pp. 240242.
[10] W. A. Chren, Jr., Low delay-power product CMOS design using onehot residue coding, in Proc. 1995 Int. Symp. Low Power Design, Dana
Point, CA.
, A new residue number system division algorithm, Computers
[11]
Math. Appl., vol. 19, no. 7, pp. 1329, 1990.
[12] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and Its Applications to
Computer Technology. New York: McGraw-Hill, 1967, pp. 147151.
[13] J. Tierney, C. M. Rader, and B. Gold, A digital frequency synthesizer,
IEEE Trans. Audio Electroacoust., vol. AU-19, pp. 4856, Mar. 1971.
[14] V. Manassewitsch, Frequency Synthesizers: Theory and Design, 3rd ed.
New York: Wiley, 1987, pp. 3743.
[15] W. A. Chren, Jr., RNS-based enhancements for direct digital frequency
synthesis, IEEE Trans. Circuits Syst. II, vol. 42, pp. 516524, Aug.
1995.
[16] I. Niven and H. S. Zuckerman, An Introduction to the Theory of Numbers,
3rd ed. New York: Wiley, 1972.
[17] J. M. Rabaey, Digital Integrated Circuits: A Design Perspective. Englewood Cliffs, NJ: Prentice-Hall, 1996, p. 217.
[18] W. A. Chren, Jr., C.H. Brogdon, and D. Andrevska, Delay-power
product simulation results for one-hot residue number system arithmetic
circuits, in Proc. 39th Midwest Symp. Circuits Syst., Ames, IA, 1996.
[19] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design,
A Systems Perspective, 2nd ed. Reading, MA: Addison Wesley, 1993,
p. 516.
[20]
, Principles of CMOS VLSI Design, A Systems Perspective, 2nd
ed. Reading, MA: Addison Wesley, 1993, p. 557.
[21]
, Principles of CMOS VLSI Design, A Systems Perspective, 2nd
ed. Reading, MA: Addison Wesley, 1993, p. 318322.
[22] W. A. Chren, Jr., Low delay-power product CMOS design using onehot residue coding, in Proc. 1995 Int. Symp. Low Power Design, Dana
Point, CA.
[23] N. R. Shanbhag and R. E. Siferd, A single-chip pipelined 2-D FIR
filter using residue arithmetic, IEEE J. Solid-State Circuits, vol. 26, pp.
796805, May 1991.
[24] W. A. Chren, Jr., Low delay-power product CMOS design using onehot residue coding, in Proc. 1995 Int. Symp. Low Power Design, Dana
Point, CA.

REFERENCES
[1] R. Krishnamurthy, I. Lys, and L. Carley, Static power driven voltage
scaling and delay driven buffer sizing in mixed swing QuadRail for
Sub-1V I/O swings, in Proc. 1996 Int. Symp. Low Power Electronics
Design.
[2] S. Rajgopal, Challenges in low power microprocessor design, in Proc.
9th Int. Conf. VLSI Design: VLSI Mobile Commun., 1996.
[3] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design,
A Systems Perspective, 2nd ed. Reading, MA: Addison Wesley, 1993,
p. 370.
[4] S. Prasad and K. Roy, Circuit optimization for transistor reordering
for minimization of power consumption under delay constraints, ACM
Trans. Des. Automat. Electron. Syst., vol. 1, no. 2, Apr. 1996.
[5] C. Nagendra, R. M. Owens, and M. J. Irwin, Unifying carry-sum and
signed-digit number representations for low power, in Proc. 1995 Int.
Symp. Low Power Design.
[6] K. Roy and S. Prasad, Syclop: Synthesis of CMOS logic for low power
application, in Proc. 1992 Int. Conf. Computer Design.

William A. Chren, Jr. received the Ph.D. degree from The Ohio State University, Columbus,
in 1987.
He is presently an Associate Professor of electrical engineering at Grand Valley State University,
Grand Rapids, MI. His current research interests
include low area/delay/power CMOS, ASIC, and
FPGA architectures for DSP and telecommunications, alternative number systems (e.g., RNS, Galois fields) for DSP, quantum-effect devices and
their use in A/D and D/A conversion, and highperformance chip architectures for encryption/decryption and testing of ATM
network packet switches. He has taught various undergraduate and graduate
courses at Ohio State, Penn State, and the University of Kentucky. He
currently heads the Laboratory for VLSI Development at GVSU and teaches
digital VLSI, electronics, microcontroller, and communications courses.

Das könnte Ihnen auch gefallen