Sie sind auf Seite 1von 12

Design tradeos using truncated multipliers

in FIR lter implementations


E. George Walters III and Michael J. Schulte
Computer Architecture and Arithmetic Laboratory
Computer Science and Engineering Department
Lehigh University
Bethlehem, PA 18015, USA
ABSTRACT
This paper presents a general FIR lter architecture utilizing truncated tree multipliers for computation. The
average error, maximum error, and variance of error due to truncation are derived for the proposed architecture.
A novel technique that reduces the average error of the lter and is independent of the number of unformed
columns is presented, as well as equations describing the signal-to-noise ratio of the truncation error. A software
tool written in Java is described that automatically generates structural VHDL models for specic lters based
on this architecture, given parameters such as the number of taps, operand lengths, number of multipliers, and
the number of truncated columns. We show that a 22.5% reduction in area can be achieved for a 24-tap lter
with 16-bit coecients. The ratio of the average error to the full scale value is only 1.4 10
9
, with only an
8.4 dB reduction in SNR for this implementation.
Keywords: FIR lters, truncated multipliers, automatic VHDL generation, computer arithmetic, application-
specic designs
1. INTRODUCTION
Finite Impulse Response (FIR) lters are an important component in many digital signal processing systems.
For fast FIR lter implementations, much of the area, delay, and power consumption is due to multiplication
circuits. Truncated multipliers oer improvements in each of these areas,
1
at the expense of computational
accuracy. The error introduced by using truncated multipliers can be quantied in terms of maximum absolute
error, average error, and variance of error. When implemented as described in this paper, the average error
of the lter due to truncation is independent of the number of unformed columns and approaches zero as the
number of taps increases. This indicates that many applications can take advantage of the area, delay, and
power improvements oered by the use of truncated multiplication with a minimal impact on the accuracy of
the output.
A software tool written in Java has been developed to generate structural VHDL models for specic lters
based on this architecture, given parameters such as the number of taps, operand lengths, number of multipliers,
and the number of truncated columns. The tool is based on a package of Java classes that model the building
blocks of computational systems, such as adders and multipliers. These classes generate VHDL descriptions,
and are used by other classes in hierarchical fashion to generate VHDL descriptions of more complex systems.
Filter models with various parameters were generated using this tool, and then synthesized to get area and
delay estimates.
Sections 1.1 through 1.3 provide background necessary for understanding the twos complement truncated
multipliers used in the proposed FIR lter architecture, which is described in Section 2. Section 3 describes the
tool for automatically generating VHDL models of those lters. Synthesis results of specic lter implementa-
tions are presented in Section 4, with concluding remarks given in Section 5.
A a7 a6 a5 a4 a3 a2 a1 a0
B b
7
b
6
b
5
b
4
b
3
b
2
b
1
b
0
1 a
7
b
0
a
6
b
0
a
5
b
0
a
4
b
0
a
3
b
0
a
2
b
0
a
1
b
0
a
0
b
0
a7b1 a6b1 a5b1 a4b1 a3b1 a2b1 a1b1 a0b1
a
7
b
2
a
6
b
2
a
5
b
2
a
4
b
2
a
3
b
2
a
2
b
2
a
1
b
2
a
0
b
2
a7b3 a6b3 a5b3 a4b3 a3b3 a2b3 a1b3 a0b3
a
7
b
4
a
6
b
4
a
5
b
4
a
4
b
4
a
3
b
4
a
2
b
4
a
1
b
4
a
0
b
4
a
7
b
5
a
6
b
5
a
5
b
5
a
4
b
5
a
3
b
5
a
2
b
5
a
1
b
5
a
0
b
5
a7b6 a6b6 a5b6 a4b6 a3b6 a2b6 a1b6 a0b6
1 a
7
b
7
a
6
b
7
a
5
b
7
a
4
b
7
a
3
b
7
a
2
b
7
a
1
b
7
a
0
b
7
p15 p14 p13 p12 p11 p10 p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
Figure 1. Partial product bit matrix (twos complement).
1.1. Twos Complement Multipliers
Parallel tree multipliers form a matrix of partial product bits which are then added to produce a product.
Consider an m-bit multiplicand, A, and an n-bit multiplier, B. If A and B are integers in twos complement
form, then
A = a
m1
2
m1
+
m2

i=0
a
i
2
i
, and B = b
n1
2
n1
+
n2

j=0
b
j
2
j
. (1)
Multiplying A and B together yields the following expression:
A B = a
m1
b
n1
2
m+n2
+
m2

i=0
n2

j=0
a
i
b
j
2
i+j

m2

i=0
b
n1
a
i
2
i+n1

n2

j=0
a
m1
b
j
2
j+m1
. (2)
The rst two terms in (2) are positive. The third term is either zero (if b
n1
= 0) or negative with a
magnitude of

m2
i=0
a
i
2
i+n1
(if b
n1
= 1). Similarly, the fourth term is either zero or a negative number.
To produce the product of A B, the rst two terms are added as is. Since the third and fourth terms are
negative (or zero), they are added by complementing each bit, adding 1 to the LSB column, and sign extending
with a leading 1. With these substitutions, the product is computed without any subtractions as
P = a
m1
b
n1
2
m+n2
+
m2

i=0
n2

j=0
a
i
b
j
2
i+j
+
m2

i=0
b
n1
a
i
2
i+n1
+
n2

j=0
a
m1
b
j
2
j+m1
+
n2

j=0
a
m1
b
j
2
j+m1
+ 2
m+n1
+ 2
n1
+ 2
m1
.
(3)
Figure 1 shows the multiplication of two 8-bit integers in twos complement form. The partial product bit
matrix is described by (3), and is implemented using an array of and and nand gates. The matrix is then
reduced using techniques such as Wallace,
2
Dadda,
3
or Reduced Area reduction.
4
1.2. Reduced Area (RA) Reduction
Reduced Area (RA) reduction, presented by Bickerstall et al,
4
is a modied reduction scheme that uses a smaller
nal carry propagate adder than Wallace or Dadda, and reduces the overall multiplier area in comparison to
Wallace and Dadda multipliers. The rules for RA reduction are as follows:
1. For each stage, the number of full adders used in each column is b
i
/3, where b
i
is the number of bits in
column i.
Figure 2. Reduced Area (RA) reduction for an 8 8 multiplier.
2. Half adders are used only (a) when required to reduce the number of bits in a column to the number of
bits specied by the Dadda series, or (b) to reduce the rightmost column containing exactly two bits.
The rst rule minimizes the number of bits entering the next stage, a benet especially useful in pipelined
multipliers. The second rule reduces the length of the nal carry propagate adder by one in each stage. Figure 2
shows the dot diagram for RA reduction of an 8 8 multiplier. In such diagrams, dots represent bits. Columns
of three bits that are inputs to a full adder are circled. The outputs of the full adder, a sum bit and a carry
bit, are depicted in the next stage connected by a diagonal line. Columns of two bits that are input to a half
adder are also circled, but the sum and carry outputs are connected by a diagonal line with a slash through it.
1.3. Truncated Multipliers
Truncated m n multipliers that produce results less than m + n bits long are described by Schulte and
Swartzlander.
5
Benets of truncated multipliers include reduced area, delay, and power consumption.
1
An
overview of truncated multipliers, which discusses several methods for correcting the error introduced due to
unformed partial product bits, is given by Swartzlander.
6
The method used in this paper is constant correction,
as described by Schulte and Swartzlander.
5
Figure 3 shows an 8 8 truncated parallel multiplier with a correction constant added. The nal result is
l-bits long. We dene k as the number of truncated columns that are formed, and r as the number of columns
that are not formed. In this example, the ve least signicant columns of partial product bits are not formed
(l = 8, k = 3, r = 5).
Truncation saves an and gate for each bit not formed and eliminates the full adders and half adders that
would otherwise be required to reduce them to two rows. The delay due to reducing the partial product matrix
is not improved because the height of the matrix is unchanged. However, a shorter carry propagate adder is
required, which improves the overall delay of the multiplier.
The correction constant, C
r
, and the 1 added for rounding are normally included in the reduction matrix.
In Figure 3 they are explicitly shown to make the concept more clear.
A consequence of truncation is that a reduction error is introduced due to the discarded bits. For simplicity,
the operands are assumed to be integers, but the technique can also be applied to fractional or mixed number
Figure 3. Truncated twos complement multiplier with constant correction.
systems as well. With r unformed columns, the reduction error is
E
r
=
r1

i=0
i

j=0
a
ij
b
j
2
i
. (4)
The maximum reduction error occurs when each of the truncated partial product bits is a 1, and is
E
r max
=
r1

q=0
(q + 1)2
q
= ((r 1) 2
r
+ 1) . (5)
The reduction error could be zero, so the range of error is
(r 1) 2
r
1 E 0. (6)
If A and B are random with a uniform probability density, then the average value of each partial product bit
is
1
4
, so the average reduction error is
E
r avg
= (r 1) 2
r2
2
2
. (7)
The variance of the reduction error, which is too complex to derive in this paper, is
7

2
r
=
3
16
r1

q=0
(q + 1)2
2q
+
1
8
r1

i=0
ri1

j=0
_
2
i+j
_
i1

k=0
2
k+j
+
j1

l=0
2
l+i
__
. (8)
In order to minimize the average error, a constant is added to the partial product matrix.
5
The correction
constant, C
r
, is chosen to oset E
r avg
. After rounding,
C
r
= round(2
r
E
r avg
) 2
r
= round
_
(r 1) 2
2
+ 2
(r+2)
_
2
r
, (9)
where round(x) indicates x is rounded to the nearest integer. Using this correction constant, the range of error
becomes
C
r
(r 1) 2
r
1 E C
r
. (10)
The maximum error, in terms of absolute magnitude, becomes
E
max
= max (C
r
, (r 1) 2
r
+ 1 C
r
) . (11)
The average error of the l + k-bit product becomes
E
avg
= C
r
(r 1) 2
r2
2
2
, (12)
and the variance remains the same as given by Equation (8).
Modeling the variance of reduction error as noise, which is commonly done for quantization error,
8
the
reduction error SNR for a single truncated multiplier using constant corection is
SNR
r
= 10 log
10
_

2
x

2
r
_
dB, (13)
where
2
x
is the signal variance (power).
2. PROPOSED FIR FILTER ARCHITECTURE
This section describes the architecture used in this work. The architecture is parameterized, allowing the choice
of the number of taps, the number of multipliers, the amount of truncation, etc. Section 2.1 gives an overview
of the architecture, Section 2.2 describes components within the architecture, and Section 2.3 presents an error
analysis.
2.1. Architecture Overview
An FIR lter with T taps computes the following dierence equation,
y[n] =
T1

k=0
b
k
x[n k], (14)
where x[ ] is the input data stream, b
k
is the k
th
tap coecient, and y[ ] is the output data stream. This can be
recognized as the discrete convolution of the input stream, x[n], with the impulse response, h[n].
8
Figure 4 shows the block diagram of the proposed FIR lter architecture. This architecture has two data
inputs, x in and coeff, and one data output, y out. There are two control inputs which are not shown, clk
and loadtap.
The input data stream enters at the x in port. When the lter is ready to process a new sample, the data at
x in is clocked into the register labeled x[n] in the block diagram. This register stores x[n] of (14), the current
input. The x[n] register is one of T shift registers, where T is the number of taps in the lter. When x in is
clocked into the x[n] register, the values in the other registers are shifted right in the diagram, with the oldest
value, x[n T + 1] being discarded.
The tap coecients are stored in another set of shift registers, labeled b
0
through b
T1
on the block diagram.
Coecients are loaded into the registers by applying the coecient values to the coeff port in sequence and
cycling the loadtap signal to load each one.
The lter is pipelined with four stages: operand selection, multiplication, summation, and nal addition.
Operand Selection: The number of multipliers in the architecture is congurable. For a lter with T taps
and M multipliers, each multiplier performs T/M multiplications per input sample. The operands for
each multiplier are selected each clock cycle by an operand bus and clocked into registers.
Multiplication: Each multiplier has two input operand registers, which are loaded by an operand bus in the
previous stage. Each pair of operands is multiplied, and the nal two rows of the reduction tree (the
product in carry-save form) are clocked into a register where they become an input to the multi-operand
adder in the next stage. Keeping the product in carry-save form rather than using a carry propagate
adder reduces the overall area and delay.
Figure 4. Proposed FIR lter architecture.
Summation: The multi-operand adder has carry-save inputs from each multiplier, as well as a carry-save input
from the accumulator. After each of the T/M multiplications have been performed, the output of the
multi-operand adder (in carry-save form) is clocked into the nal CPA register where it is added in the
next pipeline stage.
Final Addition: In the last stage, the carry-save vectors from the multi-operand adder and a correction
constant are added by a specialized carry-save adder and a carry propagate adder to produce a single
result vector. The result is then clocked into an output register, which is connected to the y out output
port of the lter.
The clk signal clocks the system. The clock period is set so that the multipliers and the multi-operand adder
can complete their operation within one clock cycle. Therefore, T/M clock cycles are required to process each
input sample. The nal addition stage only needs to operate once per input sample, so it has T/M clock
cycles to complete its calculation and is generally not on the critical path.
2.2. Architecture Components
This section discusses the components of the FIR lter architecture.
Figure 5. Multi-operand adder example.
2.2.1. Multipliers
In this paper, twos complement parallel tree multipliers using Reduced Area reduction are used. When per-
forming truncated multiplication, the constant correction method
5
is used. The output of each multiplier is the
nal two rows remaining after reduction of the partial product bits, which is the result in carry-save form.
9
As described in Section 1.1, the last three terms in (3) are constants. In this architecture, these constants
are not included in the partial product matrix. Likewise, if using truncated multipliers, the correction constant
is not included either. Instead, the constants for each multiplication are added in a single operation in the nal
addition stage of the lter. This is described in more detail in Section 2.2.3.
2.2.2. Multi-operand adder and accumulator
As (14) shows, the output of an FIR lter is a sum of products. In this architecture, M of these products
are computed per clock cycle. In each clock cycle, the outputs of each multiplier are added and saved in the
accumulator register in carry-save form. The accumulator is included in the sum, except with the rst group of
multiplies for a new input sample. This is done by clearing the accumulator when the rst group of products
arrives at the input to the multi-operand adder.
Figure 2 shows the Reduced Area reduction tree for an 8 8 multiplier. Figure 5 shows the dot diagram
for a multi-operand adder that adds the outputs of three such multipliers and a 16-bit accumulator register,
which is in carry-save form. The multi-operand adder is simply a counter reduction tree using Reduced Area
reduction, similar to a counter reduction tree for a multiplier, except that it begins with operand bits from each
input instead of a partial product bit matrix. The output of the multi-operand adder is the nal two rows of
bits remaining after reduction, which is the sum in carry-save form. This output is clocked into the accumulator
register every clock cycle, and clocked into the CPA operand register every T/M cycles.
2.2.3. Correction constant adder
As stated above in Section 2.2.1, the constants required for twos complement multipliers and the correction
constant for unformed bits in truncated multipliers are not included in the reduction tree, but are added during
the nal addition stage. The 1 that is added to round the lter output is also added in this stage. All of these
constants for each multiplier are added as a single constant, C
TOTAL
.
All multipliers used in this paper operate on twos complement oprands. From (3), the constant which must
be added for an m n multiplier is 2
m+n1
+ 2
n1
+ 2
m1
. With T taps there are T multiply operations,
assuming T is evenly divisible by M. Thus, a total value of
C
M
= T (2
m+n1
+ 2
n1
+ 2
m1
) (15)
must be added in the nal addition stage.
The multipliers may be truncated, with unformed columns of partial product bits. If so, the total average
reduction error of the lter is T E
r avg
. The correction for this is
C
R
= round
_
T (r 1) 2
2
+ T 2
(r+2)
_
2
r
. (16)
To round the lter output to l bits, the rounding constant which must be used is
C
RND
= 2
r+k1
. (17)
Combining these constants, the total correction constant for the lter is
C
TOTAL
= C
M
+ C
R
+ C
RND
. (18)
Adding C
TOTAL
to the summer output is done using a specialized carry-save adder (SCSA) which is simply
a carry-save adder optimized for adding a constant bit vector. A carry-save adder uses full adders to reduce
three bit vectors to two. SCSAs dier in that half adders are used in columns where the constant is a 0 and
special half adders are used in columns where the constant is a 1. A special half adder computes the sum and
carry-out of two bits plus a 1, the logic equations being
s
i
= a
i
b
i
, and c
i+1
= a
i
+ b
i
. (19)
The output of the SCSA is then input to the nal carry propagate adder.
There are two benets to adding one large correction constant at the end. First, if the constants were added
to the multiplier partial product matrix, it might increase the matrix height enough to require an additional
reduction stage, thereby increasing the delay of the multiplier. Second, it improves the average error due to
truncation. When adding the correction constant to a single multiplier, it must be rounded and truncated. The
portion that is lost drives the average error away from zero. When the correction constant is added at the end
of the lter, it is rst multiplied by T before rounding, so fewer bits are lost.
2.2.4. Final carry propagate adder
The output of the special carry-save adder is the output for the current sample, y[n], in carry-save form. A
nal carry propagate adder (CPA) is required to compute the nal result. The nal addition stage has T/M
clock cycles to complete, so for many applications a simple ripple-carry adder will be fast enough. If additional
performance is required, a carry-lookahead adder may be used. Using a faster CPA does not increase throughput,
but does improve latency.
2.2.5. Control
A lter with T taps and M multipliers requires T/M clock cycles to process each input sample. The control
circuit is a state machine with T/M states, implemented using a modulo-T/M counter. The present state is
the output of the counter and is used to control which operands are selected by each operand bus. In addition to
the present state, the control circuit generates four other signals: 1) shiftData, used to shift the input samples,
2) clearAccum, which clears the accumulator, 3) loadCpaReg, which loads the summer output into the CPA
operand register, and 4) loadOutput, which loads the nal sum into the output register.
2.3. Error Analysis
Using truncated multipliers in the FIR lter introduces calculation errors due to the unformed partial product
bits. These errors can be described by their average value, maximum value, variance, and signal-to-noise ratio.
The average error due to the unformed partial product bits of a single truncated multiplier, E
r avg
, is given
in (7). As discussed in Section 2.2.3, T multiplications are performed with variable operands. Therefore, the
total average error due to unformed partial product bits is T E
r avg
. This is compensated for by the nal
correction constant, yielding an average error for the lter of
E
AV G
= C
R
T (r 1) 2
r2
+ T 2
2
. (20)
If the correction constants were added at the multipliers rather than in the nal CPA, the average error of
the lter would be larger. By adding the correction constant at the end, the combined average error of each
multiplier is multiplied by T before rounding, and less precision is lost.
The maximum error for an uncorrected truncated multiplier is given in (5) and the range of error is given
in (6). With T multiplications per input sample, the maximum error for the lter without correction becomes
T (r 1) 2
r
T). By adding a corrrection constant to the lter, the range of the lter output error is
C
R
T (r 1) 2
r
T E C
R
, (21)
and the maximum error of the lter output is
E
MAX
= max (C
R
, T (r 1) 2
r
+ T C
R
) . (22)
The variance of error for a single truncated multiplier is given in (8). Since the operands of each multiplication
are independent for a given input sample, the total variance of error for the lter is

2
R
= T
2
r
, (23)
and the reduction error signal-to-noise ratio for the lter is
SNR
R
= 10 log
10
_

2
x

2
R
_
dB. (24)
To see the eect that the number of taps has on the SNR of the lter, (24) can be rewritten as
SNR
R
= 10 log
10
_

2
x
T
2
r
_
= 10 log
10
_

2
x

2
r
_
10 log
10
(T), (25)
which shows that the SNR of a T tap lter is 10 log
10
(T) less than the SNR for a single multiplier.
3. FILTER GENERATION SOFTWARE (FGS)
The architecture described in Section 2 provides a great deal of exibility in terms of operand size, the number
of taps, and the type of multipliers used. This implies that the design space is quite large. In order to facilitate
the development of a large number of specic implementations, a tool was designed that automatically generates
compiler-ready structural VHDL models given a set of parameters. The tool, which is named FGS, also generates
test benches and les of test vectors to verify the lter models.
FGS is written in Java and consists of two main packages. The arithmetic package, discussed in Section 3.1,
is suitable for general useage and is the foundation of FGS. The fgs package, discussed in Section 3.2, is
specically for generating the lters described previously. It uses the arithmetic package to generate the necessary
components.
3.1. The arithmetic Package
The arithmetic package includes classes for modeling and simulating digital components. The simplest com-
ponents include D ip-ops, half adders, and full adders. Larger components such as ripple-carry adders and
parallel multipliers use the smaller components as building blocks. These components in turn are used to model
complex systems such as FIR lters.
3.1.1. Common classes and interfaces
The arithmetic package has several common the classes and interfaces which are used by arithmetic subpackages.
The most signicant of these are VHDLGenerator, Parameterized, and Simulator.
VHDLGenerator is an abstract class. Any class that represents a digital component and can generate a
VHDL model of itself is derived from this class. It denes three abstract methods which must be
implemented by all subclasses. genCompleteVHDL() generates a complete VHDL le describing the
component. This le includes synthesizable entity-architecture descriptions of all subcomponents used.
genComponentDeclaration() generates the component declaration which must be included in the entity-
architecture descriptions of other components which use this component. genEntityArchitecture() generates
the entity-architecture description of this component.
Parameterized is an interface implemented by classes whose instances can be dened by a set of parameters.
The interface includes the methods getParameters() and setParameters() to access those parameters.
Simulator is an interface implemented by classes that can simulate their operation. The interface has only one
method, simulate, which accepts a vector of inputs and returns a vector of outputs. These inputs and
outputs are vectors of IEEE VHDL std logic vectors.
10
3.1.2. arithmetic subpackages
The arithmetic package contains several subpackages:
smallcomponents provides fundamental components including D ip-ops and full adders which are used as
building blocks for larger components such as registers, adders, and multipliers.
matrixreduction provides classes that reduce the bit matrix formed by multi-operand adders and parallel tree
multipliers. Classes are included to perform Wallace,
2
Dadda,
3
and Reduced Area
4
reduction. Each of
these classes are derived from the abstract class ReductionTree.
adders provides classes that model various types of adders including carry propagate adders, carry-save adders,
and multi-operand adders.
multipliers provides a the ParallelMultiplier class for modeling parallel tree multipliers. Parameters can be
set to congure the multiplier for unsigned, twos complement, or combined operation. The number of
unformed columns, if any, and the type of reduction, Wallace, Dadda, or Reduced Area, may also be
specied. Through polymorphism (dynamic binding), the appropriate subclass of ReductionTree reduces
the bit matrix to two rows. These two rows can then be passed to a CarryPropagateAdder object for nal
addition, or in the case of the FIR lter, to a multi-operand adder. The architecture of FGS makes it easy
to change the reduction scheme and nal addition method. New computer arithmetic techniques can be
incorporated seamlessly by subclassing appropriate abstract classes.
misccomponents provides support classes that provide essential functionality. This includes classes for model-
ing the operand busses and registers used in the FIR lter.
rlters includes classes for modeling ideal FIR lters as well as FIR lters based on the truncated architecture
described in Section 2. Ideal FIR lters provide a baseline for comparison with practical FIR lters and
allow measurement of computational errors.
Filter Synthesis Results Improvement Reduction Error
Total A D
Area Delay Product Total A D SNRR
T M r (gates) (ns) (gatesns) Area Delay Product (dB)

R
2
2B
E
AV G
2
2B
12 2 0 16241 40.80 662633 0 0
12 2 12 12437 40.68 505937 23.4% 0.3% 23.6% 89.70 4.09E-6 -6.98E-10
12 2 16 10211 40.08 409257 37.1% 1.8% 38.2% 64.22 7.69E-5 -6.98E-10
16 2 0 17369 54.40 944874 0 0
16 2 12 13529 54.24 733813 22.1% 0.3% 22.3% 88.45 4.73E-6 -9.31E-10
16 2 16 11303 53.44 604032 34.9% 1.8% 36.1% 62.97 8.88E-5 -9.31E-10
20 2 0 19278 68.00 1310904 0 0
20 2 12 15475 67.80 1049205 19.7% 0.3% 20.0% 87.48 5.28E-6 -1.16E-9
20 2 16 13249 66.80 885033 31.3% 1.8% 32.5% 62.00 9.93E-5 -1.16E-9
24 2 0 20828 81.60 1699565 0 0
24 2 12 17007 81.36 1383690 18.3% 0.3% 18.6% 86.69 5.79E-6 -1.40E-9
24 2 16 14781 80.16 1184845 29.0% 1.8% 30.3% 61.21 1.09E-4 -1.40E-9
12 4 0 25355 20.40 517242 0 0
12 4 12 18671 20.34 379768 26.4% 0.3% 26.6% 89.70 4.09E-6 -6.98E-10
12 4 16 14521 20.04 291001 42.7% 1.8% 43.7% 64.22 7.69E-5 -6.98E-10
16 4 0 26133 27.20 710818 0 0
16 4 12 19413 27.12 526481 25.7% 0.3% 25.9% 88.45 4.73E-6 -9.31E-10
16 4 16 15264 26.72 407854 41.6% 1.8% 42.6% 62.97 8.88E-5 -9.31E-10
20 4 0 28468 34.00 967912 0 0
20 4 12 21786 33.90 738545 23.5% 0.3% 23.7% 87.48 5.28E-6 -1.16E-9
20 4 16 17636 33.40 589042 38.0% 1.8% 39.1% 62.00 9.93E-5 -1.16E-9
24 4 0 29802 40.80 1215922 0 0
24 4 12 23101 40.68 939749 22.5% 0.3% 22.7% 86.69 5.79E-6 -1.40E-9
24 4 16 18950 40.08 759516 36.4% 1.8% 37.5% 61.21 1.09E-4 -1.40E-9
Table 1. Synthesis results, B = m = n = 16 (optimized for area).
testing provides classes for testing components generated by other classes, including parallel multipliers and
FIR lters. The FIR lter test class generates a test bench and an input le of test vectors. It also
generates a .vec le for simulation using Altera Max+Plus II.
gui provides graphical user interface (GUI) components for setting parameters and generating VHDL models
for all of the larger components such as parallel multipliers and FIR lters. The GUI for each component
is a Java Swing JPanel, which can be used in any Swing application. These panels make setting component
parameters and generating VHDL les simple and convenient.
3.2. The fgs Package
Whereas the arithmetic package is suitable for general use, the fgs package is specic to the FIR lter architecture
described in Section 2. fgs includes classes for automating much of the work done to analyze the use of truncated
multipliers in FIR lters. For example, this package includes a driver class that automatically generates a large
number of dierent FIR lter congurations for synthesis and testing. Complete VHDL models are then
generated as well as Tcl scripts to drive the synthesis tool. The Tcl script commands the synthesis program
to write area and delay reports to disk les that are then parsed by another class in the fgs package which
summarizes the data and writes it to a CSV le for analysis by a spreadsheet application.
4. RESULTS
Table 1 presents some representative synthesis results that were obtained from the synthesis tool, Leonardo, and
the LCA300K 0.6 micron CMOS standard cell library. While this is only a small sample of the data collected,
it illustrates the main ndings, which are:
1. Using truncated multipliers in FIR lters results in signicant improvements in area. For example, the
area of a 24-bit lter with 4 multipliers and 12 taps improves by 22.5% with 12 unformed columns and
by 36.4% with 16 unformed columns. We estimate substantial power savings would be realized as well.
Truncation has little impact on the overall delay of the lter.
2. The computational error introduced by truncation is tolerable for many applications. For example, the
reduction error SNR for a 16-bit lter with 24 taps is 86.7 dB with 12 unformed columns and 61.2 dB
with 16 unformed columns. In comparison, the quantization error for a 16-bit quantizer is 95.1 dB.
8
3. The average error of a lter is independent of r (for T > 4), and much less than that of a single truncated
multiplier. For a 16-bit lter with 24 taps, the ratio of the average error to the full range of the output is
1.40 10
9
. In comparison, the average error of a single 16-bit multiplier with r = 12 is 2.38 10
5
.
5. CONCLUSIONS
This paper presented a parameterized FIR lter architecture used to study the eects of truncated multipliers
in FIR lters. A tool for generating structural VHDL models of specic instances of these lters was discussed.
The computation errors introduced by using truncated multipliers are analyzed and quantied in terms of
average error, maximum error, variance of error, and reduction error signal-to-noise ratio. Synthesis results of
several specic implementations of the proposed architecture are presented. These results show that signicant
area and power improvements can be achieved with an moderate decrease in error SNR and a very small average
error.
REFERENCES
1. M.J. Schulte, J.E. Stine, and J.G. Jansen, Reduced Power Dissipation Through Truncated Multiplication,
IEEE Alessandro Volta Memorial Workshop on Low Power Design, Como, Italy, 1999, pp. 61-69.
2. C.S. Wallace, A Suggestion for a Fast Multiplier, IEEE Transactions on Electronic Computers, Vol.
EC-13, pp. 14-17, 1964.
3. L. Dadda, Some Schemes for Parallel Multipliers, Alta Frequenza, Vol. 34, pp. 349-356, 1965.
4. K.C. Bickersta, M.J. Schulte, and E.E. Swartzlander, Jr., Parallel Reduced Area Multipliers, IEEE
Journal of VLSI Signal Processing, Vol. 9, pp. 181-191, 1995.
5. M.J. Schulte and E.E Swartzlander, Jr., Truncated Multiplication with Correction Constant, VLSI Signal
Processing, VI, IEEE Press, New York, NY, 1993, pp. 388-396.
6. E.E. Swartzlander, Jr., Truncated Multiplication with Approximate Rounding, 33rd Asilomar Conference
on Signals, Circuits, and Systems, 1999, pp. 1480-1483.
7. E.G. Walters III, Design Tradeos Using Truncated Multipliers in FIR Filter Implementations, Masters
Thesis, Lehigh University, May 2002.
8. A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, 2nd edition, Prentice Hall, Upper
Saddle River, NJ, 1999.
9. I. Koren, Computer Arithmetic Algorithms, Prentice Hall, Englewood Clis, NJ, 1993.
10. IEEE Standard Multivalue Logic System for VHDL Model Interoperability (Stdlogic1164): IEEE Std 1164-
1993, IEEE, 26 May 1993.

Das könnte Ihnen auch gefallen