Beruflich Dokumente
Kultur Dokumente
UWE MEYER-B
ASE
Department of Electrical and Computer Engineering, FAMU-FSU College of Engineering, Tallahasser,
FL 32310-6046
ANTONIO GARC
IA
Dpto. Ingeniera Inform atica, Universidad Aut onoma de Madrid
FRED TAYLOR
High Speed Digital Architecture Laboratory, University of Florida, Gainesville, FL 32611-6130
Received July 1999; Revised December 1999
Abstract. Field-programmable logic (FPL), often grouped under the popular name eld-programmable gate arrays
(FPGA), are on the verge of revolutionizing sectors of digital signal processing (DSP) industry as programmable DSP
microprocessor did nearly two decades ago. Historically, FPGAs were considered to be only a rapid prototyping and
low-volume production technology. FPGAs are nowattempting to move into the mainstreamDSPas their density and
performance envelope steadily improve. While evidence nowsupports the claimthat FPGAs can accelerate selected
low-end DSP applications (e.g., FIR lter), the technology remains limited in its ability to realize high-end DSP
solutions. This is due primarily to systemic weaknesses in FPGA-facilitated arithmetic processing. It will be shown
that in such cases, the residue number system (RNS) can become an enabling technology for realizing embedded
high-end FPGA-centric DSPsolutions. This thesis is developed in the context of a demonstrated RNS/FPGAsynergy
and the application of the new technology to communication signal processing.
Keywords: eld-programmable logic (FPL), eld programmable gate array (FPGA), complex programmable
logic devices (CPLD), digital signal processing (DSP), residue number system (RNS), channelizer, zero-IF lter
1. Introduction
Experts generally agree that future signal processing
systems will contain deeply embedded DSP elements
having a performance envelop at least 10greater than
that possessed by the existing DSP p art. These de-
signs will normally manifest themselves as an applica-
tion specic integrated circuit (ASIC). Market forces
require that ASIC solutions be rapidly developed in
order to insure early market entry. This market reality
L
i =1
m
i
. RNSarith-
metic is dened with respect to the ring isomorphism:
Z
M
= Z
m
1
Z
m
2
Z
m
L
(1)
Specically, Z
M
=Z/M corresponds to the ring
of integers modulo M. The mapping of an inte-
ger X into the RNS is dened to be the L-tuple
X =(x
1
, x
2
, . . . , x
L
) where x
i
= X mod m
i
, for
i 1, 2, . . . , L. This is generally assumed to be a
straightforward process that can be directly imple-
mented in hardware using small lookup tables.
Dening tobe either the algebraic operations +,
or , it follows that if 0 Z < M, then:
Z = X Y mod M (2)
is isomorphic to Z = (z
1
, z
2
, . . . , z
L
) where:
z
i
= (x
i
y
i
) mod m
i
i = 1, 2, . . . , L (3)
It should be self-evident that the RNS arithmetic is
performed in parallel within small non-communicating
(i.e., carry-free) wordlength channels whose word
width is bounded by n
i
= log
2
(m
i
), where n
i
8-bits (typically). In practice, most RNS arithmetic
systems use small RAM or ROM tables to implement
the modular mappings z
i
=(x
i
y
i
) mod m
i
as LUT
calls. Using direct LUT operations can, however, cre-
ate a technological problem. If the address of the LUT
is formed by concatenating the arguments (x
i
y
i
)
then a 2
(2n
i
)
n
i
-bit table would be required. A 7-bit
moduli, for example, would require a 114K bit table
which is beyond the current capabilities of a modern
FGPA. Specically, consider again an n
i
-bit moduli
and two residues, say x
i
and y
i
used to create a prod-
uct z
i
= (x
i
y
i
) mod m
i
, which is n
i
-bits wide. If the
desired moduli size is on the order of 6 to 8-bits an un-
reasonable 12 to 16-bit TLUaddress space results. The
118 Meyer-B ase, Garca and Taylor
address space requirement can, however, be reduced by
nearly half by using the quarter square algorithm:
z
i
= (x
i
y
i
) mod m
i
(4)
=
(x
i
+ y
i
)
2
4
(x
i
y
i
)
2
4
mod m
i
(5)
= ((x
i
, y
i
) (x
i
, y
i
)) mod m
i
(6)
where (x
i
, y
i
) and (x
i
, y
i
) are obtained from LUT
calls using the sum and difference of residues as an
(n
i
+1)-bit wide address. Compared to a direct imple-
mentation of a standard RNS multiplier, the table re-
quirement are reduced from2
2n
i
n
i
-bits to 22
n
i
+1
n
i
= 2
n
i
+2
n
i
-bits. The savings for a 7-bit moduli is
a factor of 32. What is more important is that the mul-
tiplication LUTs can now be contained within an 8-bit
FPGA channel. As a result, 7-bit moduli could be used
in conjunction with 8-bit FPGA tables to implement a
standard RNS multiplier.
Conversion from the RNS to integers is performed
using either the Chinese Remainder Theorem (CRT) or
mixed-radix conversion (MRC) algorithm. The direct
implementation of either form can be awkward but ef-
cient forms of these algorithms can be found in the
literature.
Demonstration RNS systems have been built as cus-
tom VLSI [7] (see Fig. 1), GaAs, and LSI systems [1].
These studies have demonstrated the speed-area ad-
vantage of the RNS in implementing MAC-intensive
Figure 1. RNS systolic array chip [7].
algorithms. The 0.8 system shown in Fig. 1 contains
twenty-four 32-bit MACs. Running at the speed of a
TMS320C5x MAC, the RNSMACs footprint is 1/14th
the C5xs. For a small wordlengths RNS can provide
a signicant speed-ups [8] using the 2
4
2 bit tables
found in a Xilinx XC4000 FPGAs. For larger moduli,
the 2
8
8 bit tables belonging to the Altera FLEX
CPLDs are benecial in designing RNS arithmetic and
RNS-to-integer converters. With the ability to support
larger moduli, the design of high-precision FPL sys-
tems becomes a practical reality.
There are several variations of the RNS theme which
apply to DSP. One of the popular variants is based on
the use of index arithmetic [9]. It is similar, in some
respects, to the form taken by the logarithmic number
system (LNS). Computation in the index domain is
based on the fact that that if all the moduli are chosen
to be primes p
i
, then it is known from number theory
that there exists a primitive element (i.e., generator )
such that:
modp (7)
The element generates all elements in the eld
Z
p
, excluding zero (denoted Z
p
/{0}). There is, in fact,
a one-to-one correspondence between the integers in
Z
p
/{0} and the exponents which are dened in Z
p1
.
As a point of terminology, the index with respect to
the generator and integer , is denoted =ind
().
For notational purposes the element =0 is denoted
g
and Y =
)
If the data beingprocessedis inindexform, thenmul-
tiplication can performed using only exponent addition
mod( p 1). The advantage gained by index process-
ing is found in the fact that the multiplicative table
size, when compared to the standard RNS of compara-
ble moduli size, is reduced from2
2n
n to 2
n
n based
on n-bit moduli. If the modulo adder in step two is re-
place by a binary adder, then the multiplier correction
Channelizer using FPGAs and RNS 119
table is 2
(n+1)
n, or twice as large as that requiring
modulo adders. In either case, this can be benecial in
FPGA designs where only small tables are generally
available.
The advantage gained in index multiplication, how-
ever, is somewhat mitigated when index addition is
encountered. Addition can technically be performed
by converting index-coded RNS data back into the
RNS domain where the summands can be added. Once
the sum is formed, the result can be mapped back
into the index-domain. Another approach is based
on Zech-logarithms [10], where a Zech-logarithm is
dened as:
Z(k) = ind
(1 +
k
)
Z(k)
= 1 +
k
(8)
The sum of index-coded numbers, say X and Y, is
expressed as:
z = x + y =
z
=
x
+
y
=
1 +
1 +
(9)
or, in terms of a Zech-loragrithm:
z
=
y
+
Z(
x
y
)
z
=
y
+ Z(
x
y
).
(10)
Adding numbers in the index domain requires one
addition, one subtraction, and a Zech-LUT. The spe-
cial case a + b 0 corresponds to the case where
[11]:
x y mod p
x
+( p1)/2
mod p.
That is, the sum is zero if, in the index domain, =
+( p 1)/2 mod ( p 1).
Therefore implementing a basic DSP object, with
Zech logarithm, will reduce the number of necessary
LUTs for FPLs to the minimum of one per MAC
cell.
Another RNS variant applies to case where com-
plex arithmetic is required (e.g., DFT) and, commu-
nications applications. Traditional logic states that the
roots to the quadratic equation x
2
= 1 are dened
over the complex eld. That is, in the complex RNS
(CRNS) the roots of x
2
= 1 are complex and dened
in terms of the imaginary operator =
1. As a
consequence, complex RNS numbers are dened by
the two-tuple Z = X j Y and complex addition re-
quires two real adds, and complex multiplication is de-
ned by four real products, an addition and subtraction
(albeit short wordlength). This condition is radically
altered in the quadratic RNS, or QRNS. The QRNS is
based on known properties of Gaussian primes of the
form p = 4k + 1, where k is a positive integer. The
importance of this choice of moduli is found in the fac-
torization of the polynomial x
2
+1 given by Gauss. For
Gaussian primes, the roots of x
2
= 1 are no longer
imaginary by rather two real roots, denoted and ,.
Specically and are real integers belonging to the
residue class Z
p
. Converting a RNS complex number
a + j b into the QRNS is accomplished by applying the
transform f : Z
2
p
Z
2
p
as follows:
f (a + j b) = (a + b mod p, a b mod p)
= (A, B) (11)
In the QRNS, addition and multiplication is
component-wise, and is dened to be:
(a + j b) +(c + j d) (A +C, B + D) mod p.
(12)
(a + j b)(c + j d) (AC, BD) mod p (13)
In the QRNS domain, complex multiplication re-
quires only two real multiplications, while twos com-
plement multiplication requires four real multiplier, a
real add, and real subtraction to complete. Finally the
conversion of a QRNS digit, back into the RNS, is de-
ned by:
f
1
(A, B) =
2
1
(A + B) + j (2 )
1
(A B)
mod p, (14)
Figure 2 graphical interprets the mappings between
the CRNS and QRNS.
Figure 2. CRNS QRNS conversion.
120 Meyer-B ase, Garca and Taylor
4. FPL RNS Implementation
In order to facilitate efcient RNS-centric FPGA de-
signs, a collection of RNS macros were developed for
a target Altera technology. For an Altera CPLDdesign,
the VHDL description was used because it provided a
exibility design environment that could be precisely
controlled and optimized. For the VHDL approach,
structural (i.e. component instantiation) and behavioral
descriptions yielded similar results. The structural de-
signs, however, produced synthesized results that were
easier to post-optimized.
For both standard and index RNSarithmetic, the core
element is a modular adder. Several modular adder de-
signs are shown Fig. 3 [12]. Using only LCs, the design
of Fig. 3(a) is realized. The Altera FLEX CPLD con-
tains a number of 2K bit ROMs and/or RAMs (EABs)
which can be congured as 2
8
8, 2
9
4, 2
10
2, or
2
11
1 tables and used for modulo m
i
correction, as
shown in Fig. 3(b). Table 3 summarizes the re-designed
6, 7, and 8-bit modulo adder [13].
Although the ROMs shown in Fig. 3 support high-
speed LUTs, the ROM itself produces a four cycle
pipeline delay. Furthermore, the number of on-chip
ROMs are limited. ROMs, however, are required for
Figure 3. Modular addition with CPLD. (a) MPX-Add and MPX-
Add-Pipe. (b) ROM-Pipe.
Table 3. Modulo adder complexity: FLEX10K 3ns device.
Bits
Pipeline
stages
6 7 8
MPX 0 41.3 MSPS 46.5 MSPS 33.7 MSPS
27 LC 31 LC 35 LC
MPX 2 76.3 MSPS 62.5 MSPS 60.9 MSPS
16 LC 18 LC 20 LC
MPX 3 151.5 MSPS 138.9 MSPS 123.5 MSPS
27 LC 31 LC 35 LC
ROM 86.2 MSPS 86.2 MSPS 86.2 MSPS
2 7 LC 8 LC 9 LC
1 EAB 1 EAB 2 EAB
the scaling schemes. Compared to the pipelined design
shown in Fig. 3 (b), the multiplexed-adder (MPX-Add)
shown in Fig. 3 (a) runs at a reduced speed even if a
carry chain is added to each column. The pipelined
design requires the same number of LCs as the un-
pipelined version, but is expected to runs about twice
as fast. Maximum throughput occurs when the adders
are implemented in two blocks (where each block has
eight LCs for Altera FLEX 10K devices) within 6-bit
pipelined channels.
Several other RNS basic building blocks are re-
quired to support RNS designs. The list includes mod-
ulo adder for the index domain (i.e. modulo multiplier),
Zech MAC cells, code converters (BINRNS and an
RNSBIN) based on an CRT algorithm. Altera
VHDL software does not allowgeneric clauses. There-
fore gawk and C programs have been developed for an
automatic generation of the basic building blocks by
specifying the desired blocks. With these library ele-
ments, standard and index RNS arithmetic systems can
be designed.
5. FPGA Channelizer Implementation
A typical modern communication receiver is shown in
Fig. 4. The received analog signal is mixed with locally
generated signal, and bandpass ltered. In the process
the received wideband signal is split into quadrature
channels (I and Q) that is digitized. The digital section
of the receiver is called a channelizer or zero-IFdemod-
ulator The channelizer maps RF(or near RF) directly to
baseband. The commercial imperative is to reduce the
complexity of digital portion of the receiver to ideally
a single chip. For mobile applications, a premium is
also placed on power dissipation (active and standby).
Channelizer using FPGAs and RNS 121
Figure 4. IF incoherent receiver with sin/cos mixer.
Figure 5. Harris HSP43320 Hogenauer decimating lter.
The interface between the analog and channelizers is
therefore based on maximum data conversion rate and
power/complexity decision.
Converting signal from, or near RF-rates to base-
band is a non-trivial problem. For many typical wire-
less communications problems, signal decimation rates
on the order of 10
3
or higher needs to be achieved. The
preferred design methodology is called a Hogenauer
architecture [4]. An example of a Hogenauer-enabled
channelizer is shown in Fig. 5 as the lowpass lters
(LFP). The advantage of the Hogenauer architecture
is:
the preprocessor (called a Hogenauer lter) is a
MAC-free multirate lowpass lter and, being MAC-
free is capable running at a high real-time rate
the postprocessor is a basic FIRhousekeeping l-
ter running at a low data rate.
The theoretical foundations of a Hogenauer chan-
nelizer are well understood but represent a signicant
FPGA design challenge. Arithmetic in a Hogenauer
lter section must be exact and can often exceed 50-
bits wordwidths. Large arithmetic wordwidths imme-
diately create a barrier to FPGA implementation. The
RNS, however, provides a mechanism of achieving ex-
act high-precision MACoperations within independent
small wordlength channels. To appreciate the need for
the RNS in this case, the mechanics of a Hogenauer
channelizer will be briey reviewed.
6. Hogenauer Filter
A Hogenauer lter, or as it is sometimes called, a cas-
cade integrator comb (CIC) lter, has been proven to
be capable of performing high decimation-rate chan-
nelizationat highinput data rates. Figure 6(a) illustrates
a three stage CIC lter consisting of a three stage in-
tegrator, a sampling rate reduction by R (decimation),
and a three stage three comb lter. Notice that the only
logic elements in the design are registers and adders
(i.e., MAC-free).
122 Meyer-B ase, Garca and Taylor
(a)
(b)
Figure 6. CIC lter. (a) Each stage 26 Bit. (b) Detail design with
base removal scaling (BRS).
The transfer function of a S stage CIC system is
given by:
H(z) =
1 z
RD
1 z
1
S
(15)
The S poles of the CIC lter are located at z = 1
(i.e., DC) and the zeros are distributed along the pe-
riphery of the unit circle, appearing with multiplicity S
on /(RD) centers. The S zeros at z = 1 are annihi-
lated by the S poles residing at the same location. The
result is that the transfer-function behaves as a classic
S stage moving average lter. The CIC lter maxi-
mum gain occurs at DC (i.e., z = 1) and has a value
of B
grow
= (RD)
S
, or b = log
2
B
grow
in bits. This value
can be substantial as evidenced by the need for a 56-bit
dynamic range in Harris HSP43220 [15] channelizer
shown in Fig. 5. Furthermore, it is fundamentally im-
portant the CIC arithmetic be performed Exactly since
the integrator section, during run-time, will constantly
be incurring modulo (N) overows (N is the CIC dy-
namic range). The comb lter section must compensate
for the integrators modulo (N) overows by unwrap-
ping the result modulo (N) an equal number of times.
Any rounding or approximation in this process would
be fatal. The Harris HSP43220, for example, uses an
exact 2s-complement 56-bit code to satisfy this re-
quirement. To illustrate, assume that the input word-
width to the 3 stage RNS CIC lter, shown in Fig. 6(a),
is 8-bits. For D = 2, R = 32, or DR=2 32 = 64, an
Figure 7. CICtransfer function ( f
s
is sampling frequency at input).
internal wordwidth of W = 8 + 3 log
2
(64) = 26 bits
is needed to insure that no run-time overow will oc-
cur. The output wordwidth would normally be a value
signicantly less then W, say 10-bits. Hogenauer [14]
noted it is possible to design each stage of the CIC
section to have just enough dynamic range to insure
an arithmetically correct outcome. Figure 7 shows a
pruning architecture as suggested by Hogenauer. If
the ratio of signal bandwidth to sampling frequency
is, for instance 1/32, then the aliasing suppression is
89.6 dBand the maximumpassband attenuation is 0.17
[14, Tables 1 and 2]. These facts, along with a high-
bandwidth requirement, motivate the use of RNS to
implement CIC lters with FPGAs.
Apipelined FPGAintegrator section needs the same
number of LCs as an un-pipelined version, and would
run about twice as fast. Maximum throughput oc-
curs when the adders are implemented in two blocks
(where each block contains 8 LCs for Altera FLEX
10K devices), within six-bit pipelined channels. One
additional pipeline delay, for the modulo adder, corre-
sponds to a non-recursive transfer function A(z) = z
2
which introduces no signicant processing problem.
The accumulator, however, is recursive and an addi-
tional delay is introduces a second pole at one half
the sampling frequency (i.e., [16, Fig. 1]). Because
the transfer function of the pipelined accumulator sat-
ises F(z) =z
2
/(1 z
2
), the pole at can be
compensated for by a (modulo m
i
) comb lter with
a delay of one (i.e., G(z) =(1 z
1
)z
2
). The inte-
grator section, with pole compensation, then becomes
F(z) G(z) = z
4
/(1 z
1
) as desired. In a high
decimation CIC application, it can be assumed that
an anti-aliasing lter provides sufcient suppression of
Channelizer using FPGAs and RNS 123
Figure 8. BRS and -CRT conversion steps.
signal components near . A second passband located
at is introduced by the recursive pipelined accumu-
lator but introduces no additional aliasing. The six-bit
wide pipelined accumulators can then developed with-
out pole compensation.
As a design example, consider a three stage CIC
lter having 8-bit input, 10-bit output, D=2, and
R =32. The required maximal dynamic range is 26-
bits. For the RNS implementation, a 4 modulus sys-
tem is chosen consisting of the relatively prime mod-
uli (256, 63, 61, 59) (i.e., one 8-bit twos complement
(TC) and three 6-bit moduli). The output scaling of the
RNS system is implemented using the -CRT at a cost
of 8 tables and 3 TC adders [17, Fig. 1], or (as shown
in Fig. 8) with a base removal scaling (BRS) algorithm
based on two 6-bit moduli (which occur in the same
fashion in the mixed radix conversion scheme [18])
and a -CRTfor the remaining 2 moduli. This approach
uses a total of 5 modulo adder and 9 ROM tables, or
7 tables if the multiplicative inverse ROM and the -
CRT are combined. The following table shows speed
in MSPS and used LCs and EABs for the three scaling
schemes.
BRS--CRT BRS--CRT
(Speed data for combined
Type -CRT BRS m
4
only) ROM
MSPS 58.8 70.4 58.8
#LC 34 87 87
#Table (EAB) 8 9 7
The decrease in speed to 58.8 MSPS for scaling
schemes #1 and #3 are caused by the fact that a 10-
bit -CRT table address must be placed in different
FPGA rows (each row has only one EAB). This, how-
ever, introduces no system speed decrease because the
scaling is applied at the lower (output) sampling rate.
For the BRS--CRT, it is assumed that only the BRS
m
4
part (see Fig. 8) must run at the input sampling rate,
while BRS m
3
and -CRT runs at the output sampling
rate. Some additional resources can be saved based on
the architecture presented in Fig. 6(b). Here the BRS-
-CRT is used to reduce the bit-width found in earlier
lter sections. The early use of ROMs decreases the
possible throughput from 76.3 to 70.4 MSPS which is
the maximum speed of the BRS with m
4
. At the out-
put, the efcient -CRT scheme was employed. The
following table concludes the three implemented lter
realization without including the scaling data.
TC RNS Detailed bit-width
Type 26 Bit 8, 6, 6, 6 bit RNS design
MSPS 49.3 76.3 70.4
#LC 343 559 355
6.1. Modulation and Postprocessing
Referring to Fig. 4, it can be seen that digital modula-
tors exist to the left of the channelizer. A high-speed
ADC unit, operating at or near RF frequencies, resides
at the analog-digital-domain boundary. For sampling
rates 100 MHz, precision is practically limited to
12-bits or less. The output of the ADC can be either
binary, or directly mapped into standard or index RNS
L-tuples. The product modulators can be implemented
using a standard or indexed RNS multiplier. The differ-
ence would be that the standard RNS multiplier would
124 Meyer-B ase, Garca and Taylor
require a multiplicative LUT and the index multiplier
is simply a modulo p
i
adder. All these options can
be implemented with an FPGA to varying degrees of
acceptability. Afast 2s complement 1212-bit multi-
plier can be built using 9 EABs or 328 LCs, and would
run at 69 MHz. A comparable index RNS multiplier,
based on three 7-bit moduli, would come in at 2 EABs
or 260 LCs per moduli for a total of 6 EABs or 780
LCs, and would run at 86 MHz rate. This points to an
important observation supported by numerous design
studies which states that for low resolution cases, the
RNS benet is marginal. A RNS advantage is, rapidly
gained for high-end high-precision applications (e.g.,
CIC lter). In the case under study, due to the assumed
short wordlength of the digitized data, it may be pre-
ferred to use a traditional 2s complement digital mixer
and then map the output into the standard RNS for CIC
processing. If index RNS is used, data would need to
be converted to standard RNS before being CIC pro-
cessed in the manner developed in the previous section.
Astandard RNS mixer design is also possible using the
quarter-square algorithm.
The channelizer output is a baseband (low sample
rate) signal sampled at a highly decimated rate. The
channelizer output can be taken directly from the CIC
section or from a post-processing FIR. The magnitude
frequency response of the CIC section is that of a S-
stage moving average lter (i.e., sin(x)/x). A low data
rate FIR can be used to shape the CIC baseband which
resides between DC and the rst null of the Hogenauer
lter (i.e., f
sample
/RD). The implementation of a FIR
in the RNS is well understood. If implemented using
the standard RNS, data can be accepted directly form
the CIC section, ltered, and presented to a back-end
communications processor. The implementation of an
index RNSFIRare discussed, for instance, in [19]. This
model assumes that the CIC section is implemented
using the index RNS. The advantage of a standard, or
Figure 9. Cascading of frequency sampling lter to save a factor of R delays for multirate signal processing [20, Sec. 3.4].
indexed RNS arithmetic, over a 2s complement im-
plementation of an FIR is well established. Again, this
advantage geometrically increases with arithmetic pre-
cision for comparable real-time bandwidths.
Finally, it is notedthat the entire systemcanbe imple-
mented using the QRNS as developed in Section 4. The
QRNSimplements complexarithmetic usinga minimal
amount of real arithmetic. The channelizer presented in
Fig. 4 divides the received signal into I and Qchannels,
using separate sine and cosine modulators. This oper-
ation can be replace with a complex exponential that
can, in turn, be directly implemented in a minimally
complex QRNS system, with the individual modular
operations dened as standard or index RNS calls. The
channelizer following the I and Q modulators can also
be implement in the QRNS resulting in an end-to-end
QRNS solution.
7. Frequency Sampling Filter
The CIC lters discussed in the last section belongs
to a larger class of systems called frequency sampling
lters. These lter can be used with channelizers to de-
compose the information spectruminto discrete bands.
This is essential in many multi-user communication
system applications. A classical frequency sampling
lter (FSF) consists of a comb lter cascaded with a
bank of frequency selective resonators [20, 21]. The
resonators independently produce a collection of poles
that annihilate the zeros produced by the comb pre-
lter. Gain adjustments are applied to the output of the
resonators so as to approximately prole the magni-
tude frequency response of a desired lter. An FSF can
also be created by cascading all-pole lter sections with
all-zero lter (comb) sections as suggested in Fig. 9.
The delay of the comb-section 1 z
D
is chosen so
that its zeros cancel the poles of the all-pole prelter
as shown in Fig. 10. It can be observed that wherever
Channelizer using FPGAs and RNS 125
Figure 10. Example of pole/zero-compensation for a pole-angle of 60
and Comb-delay D = 6.
there is a complex pole, there also exists an annihi-
lating complex zero which results in an all-zero FIR,
with the usual linear phase and constant group delay
properties.
Frequency sampling lters are of interest to design-
ers of multi-rate lter banks due, in part, to their intrin-
sic low complexity and linear phase behavior. FSF de-
signs rely on exact pole-zero annihilation and are often
found in embedded applications. Exact FSF pole-zero
annihilation can be guaranteed by using polynomial l-
ters dened over an integer ring in the residue number
system (RNS).
The poles of the FSF lter developed in this pa-
per reside on the periphery of the unit circle. This is
in contrast with the customary practice of forcing the
poles and zeros to reside at interior locations to guard
against possible inexact pole-zero cancellation. It will
be shown that stability is not an issue if the FSF is im-
plemented using RNS. In addition, by allowing the FSF
poles and zeros to reside on the unit circle, a multiplier-
less FSF can be realized with an attendant reduction in
complexity and an increase in data bandwidth.
To motivate this discussion, consider the lter shown
in Fig. 9. It can be argued that rst-order lter sections
produce poles at angles 0
and 180
. Second-order
sections with integer coefcients can produce poles
at angles 60
, 90
, 120
C
2
(z) 1 1 1 180
C
6
(z) 2 1 1 1 60
C
4
(z) 2 1 0 1 90
C
3
(z) 2 1 1 1 120
C
12
(z) 4 1 0 1 0 1 30
150
C
10
(z) 4 1 1 1 1 1 36
108
C
8
(z) 4 1 0 0 0 1 45
135
C
5
(z) 4 1 1 1 1 1 72
144
C
16
(z) 6 1 0 0 1 0 0 1 20.00
100.00
140.00
C
14
(z) 6 1 1 1 1 1 1 1 25.71
77.14
128.57
C
7
(z) 6 1 1 1 1 1 1 1 51.42
102.86
154.29
C
9
(z) 6 1 0 0 1 0 0 1 40.00
80.00
160.00
Table 5. Number of used CLBs of Xilinx XC4000 FPGAs (Notation: F20D90 means lter pole-angle 20.00