Beruflich Dokumente
Kultur Dokumente
I. I NTRODUCTION
Combinational Unit
Memory
Input
Input
Combinational
Unit [h(l,k)]
(a)
Memory
Output
(b)
Output
1
N
1 N
X
X
(1)
(2)
l=0 k=0
H(z1 , z2 ) =
N
1 N
1
X
X
i=0 j=0
where [h(l, k)] is the impulse response matrix of the nonseparable 2-D FIR filter of size (N N ) while {h1 (i)} and
{h2 (j)} are the impulse responses of 1-D FIR filters used for
row-wise and column-wise processing of 2-D input.
Block diagrams of conventional realization of non-separable
and separable 2-D FIR filters are shown in Fig.1. As shown in
this figure, both these filter structures consist of two types of
hardware components: (i) the combinational component and
(ii) the memory or storage component. The combinational
component consists mainly of the arithmetic circuits along
with some steering logic like multiplexors and demultiplexers,
while the storage component consists of transposition buffers
and/or shift-registers to provide appropriate data to combinational units. The non-separable structure uses shift-registers
to introduce the necessary row-delays for the processing of
intermediate data while the separable structure uses shiftregisters for transposition of intermediate data. We can find
from (1) that, a non-separable 2-D FIR filter of size (N N )
involves (N 1) shift-registers (SRs) of size M each, (N 1)2
registers (for row-column processing), and N 2 multipliers and
N 2 adders to compute one filter output per cycle. Similarly, we
can find from (2) that, the separable 2-D filter of size (N N )
involves (N 1) SRs of M words each, 2(N 1) registers,
2N multipliers and 2N adders to compute one filter output
per cycle. Combinational and memory (register) complexities
TABLE I
C OMBINATIONAL AND M EMORY C OMPLEXITY OF F ULL - PARALLEL
S EPARABLE AND N ON - SEPARABLE 2-D FIR F ILTERS
Filter
Combinational
Memory (words)
Multiplier
Adder
Non-separable
N2
N2 1
(M + N )(N 1)
Separable
2N
2(N 1)
(M + 2)(N 1)
N
1
X
i=0
N
1
X
z1i Hi (z2 )
h(i, j) z2j
(3a)
(3b)
j=0
TABLE II
M EMORY COMPLEXITY F ULLY- DIRECT, F ULLY- TRANSPOSE , H YBRID -1 AND H YBRID -2 S TRUCTURES . N : F ILTER - SIZE , M : I NPUT I MAGE - SIZE , w:
I NPUT SIGNAL BIT- WIDTH AND w0 : I NTERMEDIATE SIGNAL BIT- WIDTH .
Structure
Shift-Register/Register words
Input signal storage
Memory (bits)
N = 4, w0 = 16
N = 8, w0 = 20
Fully-Direct
(M + N )(N 1)
(M + N )(N 1)w
12384
29120
Hybrid-1
N (N 1)
M (N 1)
(N 1)(N w + M w0 )
24672
72128
Hybrid-2
M (N 1)
N (N 1)
(N 1)(M w + N w0 )
12480
29792
Fully-Transpose
(M + N )(N 1)
(M + N )(N 1)w0
24768
72800
z 1 1
x(m,n)
z 2 1
h(0,0)
z 1 1
z 2 1
h(0,1)
h(0,2)
z 2 1
h(0,3)
z 1 1
z 2 1
h(1,0)
h(1,1)
z 2 1
h(1,2)
z 2 1
z 2 1
h(2,0)
h(1,3)
y(m,n)
h(2,1)
Memory
z 2 1
h(2,2)
z 2 1
h(2,3)
z 2 1
h(3,0)
h(3,1)
z 2 1
h(3,2)
z 2 1
h(3,3)
(a)
x(m,n)
z 2 1
h(0,0)
h(0,1)
z 2 1
h(0,2)
y(m,n)
z 1 1
x(m,n)
z 1 1
h(0,0)
h(0,1)
z 2 1
z 2 1
h(0,3)
z 2 1
h(1,0)
h(1,1)
z 2 1
h(1,2)
z 2 1
h(1,3)
z 1 1
z 2 1
h(2,0)
h(0,3)
z 2 1
z 2 1
h(1,0)
h(1,1)
z 2 1
h(2,2)
z 2 1
z 2 1
z 2 1
z 2 1
h(2,3)
h(3,0)
h(3,1)
h(3,2)
h(3,3)
h(2,3)
h(3,0)
h(3,1)
h(3,2)
h(3,3)
z 2 1
z 2 1
h(3,1)
h(3,2)
z 2 1
z 2 1
z 1 1
(b)
z 1 1
h(0,2)
h(2,1)
z 2 1
z 1 1
h(1,2)
h(1,3)
z 2 1
z 2 1
y(m,n)
h(2,0)
h(2,1)
h(2,2)
z 2 1
z 2 1
h(2,1)
h(2,2)
z 2 1
z 2 1
z 2 1
z 2 1
(c)
x(m,n)
h(0,0)
h(0,1)
z 2 1
y(m,n)
h(0,2)
h(0,3)
z 2 1
z 2 1
h(1,0)
h(1,1)
z 2 1
h(1,2)
h(1,3)
z 2 1
z 2 1
z 1 1
z 1 1
h(2,0)
(d)
h(2,3)
z 2 1
h(3,0)
h(3,3)
z 2 1
z 1 1
Fig. 2. Four different configurations for realization of 2-D FIR filter for N = 4. (a) Fully-direct structure, (b) Hybrid-1 structure (c) Hybrid-2 structure, and
(d) Fully-transposed structure, where z11 represents a shift-register of M words and z21 represents a single register.
this structure are placed on the input path only. This is a very
useful feature to be exploited for memory footprint reduction
in 2-D FIR filter structure.
A. Exploration of Memory-Reuse Possibilities
To explore the memory reuse possibilities, let us consider
the input data-flow of fully-direct structure for the computation
of m-th row of outputs {y(m, n), y(m, n + 1), y(m, n +
Shift-Register Unit
SR-1
R1
R2
R3
SR-2
R1
R2
SR-3
R1
R3
R2
R1
R3
x(m,n+2) x(m,n+1)
x(m,n+1)
x(m,n)
x(m,n)
x(m,n-1)
x(m,n)
x(m-1,n-1)
Register Unit
R2
R3
y(m,n+3)
x(m-3,n-1)
y(m,n+2)
y(m,n+1)
y(m,n)
Fig. 3. Memory-unit data-flow of fully-direct structure for outputs y(m, n), y(m, n + 1), y(m, n + 2), and y(m, n + 3). Redundant memory-values are
shown by blue color boxes.
TABLE III
M EMORY R EUSE E FFICIENCY AND M EMORY BAND - WIDTH S AVING FOR
D IFFERENT S IZE F ILTERS (N ) AND I NPUT- BLOCK SIZE (L)
N = 16
N = 32
MRE
(in times)
12
40
36
60
84
70
56
44.4
168
66.7
392
77.8
240
47
720
70.5
1680
82.4
16
3600
88.2
15
992
48.4
2976
72.6
6944
84.8
16
14880
90.8
15
Shift-Register Unit
SR-3
SR-2
SR-1
(4)
Shift-Register Unit
v(m, n) =
N
1
X
i=0
N
1
X
x(m-2,n)
x(m-1,n+1)
x(m-1,n)
x(m,n+1)
bm,n+1
x(m,n)
bm,n
SIPO
u(m,n-3)
u(m,n-2)
u(m,n-1)
i=0 j=0
y(m, n) =
x(m-2,n+1)
y(m,n)
x(m-3,n)
y(m, n) =
x(m-3,n+1)
h2 (i) v(m, n i)
(5a)
h1 (i) x(m i, n)
(5b)
Interconnect
Network
SR-3
x(m-3,n+1) x(m-3,n)
SR-2
x(m-2,n+1) x(m-2,n)
SR-1
x(m-1,n+1) x(m-1,n)
x(m,n+1)
x(m,n)
i=0
x(m,n)
x(m-3,n+1) x(m-3,n)
x(m-2,n+1) x(m-2,n)
u(m, n) = bm,n h1
(6a)
y(m, n) = um,n h2
(6b)
(7a)
= u(m, n) u(m, n 1) . . u(m, n N + 1)
(7b)
T
h1 = h1 (0) h1 (1) . . h1 (N 1)
T
h2 = h2 (0) h2 (1) . . h2 (N 1)
Column-Filter
Arithmetic Unit (h2)
N =8
MBS
(in %)
(7c)
(7d)
x(m-1,n+1) x(m-1,n)
x(m,n+1)
N =4
x(m,n)
To the Arithmetic-Unit of
Separable Structure
(L)
u(m,n+1)
u(m,n)
Block-size
(N )
Row-Filter
Arithmetic Unit (h1)
Filter size
Um
k
(14b)
.
x(m, k 0 )
x(m 1, kL)
x(m 1, kL 1)
.
.
x(m 1, k 0 )
.
.
.
.
.
.
x(m0 , kL)
. x(m0 , kL 1)
.
.
.
.
0 0
.
x(m , k )
(15)
for m0 = m M + 1 and k 0 = kL L + 1.
i,k
vi,k = Ami
hi
k
(14a)
h2
.
.
(9b)
.
.
u(m,
kL
L
+
1)
u(m,
kL
L)
The intermediate vector v
is computed by product of
input-matrix Ami
and impulse-response vector hi , and given
k
by
(13)
.
.
.
.
.
.
u(m, k 0 )
.
u(m, k 0 1)
.
.
.
.
. u(m, k 0 L + 1)
(16)
for k 0 = kL N + 1.
(10)
filters.
x(m0 , kL)
x(m0 , kL 1) . .
x(m0 , k 0 + 1)
0
x(m0 , kL 1)
x(m , kL 2) . .
x(m0 , k 0 )
.
.
. .
.
.
.
. .
.
0
0
0 0
x(m , kL L + 1) x(m , kL L) . . x(m , k L + 2)
The computation of (8) and (10) are mapped into a fully(11) direct L parallel structure to derive the proposed blockbased structure for non-separable 2-D FIR filter. The proposed
for m0 = m i and k 0 = kL M , and hi is given by:
structure is shown in Fig.6 for filter-size N = 8 and block-size
L = 4. It consists of one memory-module and one arithmetic
T
hi = h(i, 0) h(i, 1) . . . h(i, N 1)
(12a) module.
Ami
k
SR-28
SR-27
SR-26
SR-25
SR-4
SR-3
SR-2
SR-1
Storage Module
xk
Arithmetic-Module
Shift-Register Array
SR-block7
R1
R2
R3
R4
R5
R6
R7
IRU-1
IRU-2
IRU-8
8
8
8
8
8
8
8
8
m
m
m
m
m-1
m-1
m-1
m-1
Sk,0 Sk,1 Sk,2 Sk,3 Sk,0 Sk,1 Sk,2 Sk,3
FU-1
[h1]
FU-2
[h2]
8
8
8
8
m-7
m-7
m-7
m-7
Sk,0 Sk,1 Sk,2 Sk,3
FU-8
[h8]
ykm
Fig. 6. Proposed block-based structure for non-separable 2-D FIR filter for
block-size L = 4 and filter-size N = 8.
m-i
sk,1
m-i
sk,0
Adder Block
m-i
sk,3
m-i
sk,2
m-i
sk,1
m-i
sk,2
m-i
sk,3
IPC-1
(hi)
IPC-2
(hi)
IPC-3
(hi)
IPC-4
(hi)
vi,k
(a)
v0,k
v(0,0)
v1,k
v(7,0)
v(0,1)
AT-1
v6,k
v(7,1)
AT-2
v(0,2)
v7,k
v(7,2)
v(0,3)
AT-3
v(7,3)
AT-4
yk
(b)
N-point Input-vector
h(i,0)
h(i,1)
h(i,6)
h(i,7)
Adder-Tree (N=8)
bm,4k+1
x(m-1,4k-3)
x(m,4k-3)
sk,3
IPC-5 (h2)
IPC-6 (h2)
Fig. 10. Proposed block-based structure for separable 2-D FIR filter for
N = 8 and L = 4.
V. G ENERIC S TRUCTURES
In this Section, we derive two separate generic structures
for non-separable and separable filter banks. Also we have
proposed a unified structure for realization of 2-D FIR filterbank comprised of non-separable and separable filters.
A. Generic Block-Based Structure for Non-separable Filters
The coefficient matrices of non-separable filters can have varieties of symmetry, e.g., diagonal, centro, four-fold rotational,
quadrant etc. These symmetries can be exploited to realize
the transfer functions with lesser number of multiplications.
The arithmetic module of proposed structure for non-separable
filters could be optimized to take advantage of these symmetry
property. The storage-module of the structure can be interfaced
as a common unit with arithmetic modules of filters with and
without symmetry. This results in a generic structure shown
in Fig.11. Each sub-module of arithmetic-module of generic
structure represents arithmetic module of a constituent filter.
The arithmetic-module of each filter is enabled with a select
signal (ENi , for 1 i 4) to switch off the arithmetic module
to have power saving if the output of a particular filter is not
required. The proposed generic structure can be used to realize
any of the four types of filters or a parallel combination of
filters by selecting the arithmetic modules of respective filters
through the select signals. The proposed generic structure has
higher MRE and MBS than the non-separable structure for
a given block-size and filter-size due to common storageunit. The area-delay-product (ADP) and power consumed per
output (PPO) of the proposed generic structure in parallel
configuration is expected to be significantly less than separate
implementation of individual filters.
Storage-Unit
Arithmetic Module
L
L
L
L
m
xk
NL
IRU-Array
bm,4k-3
x(m-7,4k-3)
x(m-7,4k+1)
[(MN-M) words]
x(m-1,4k-2)
x(m,4k-2)
sk,2
Shift-Register
Array
bm,4k+2
yk
IPC-7 (h2)
bm,4k-2
x(m-7,4k-2)
x(m-7,4k+2)
m
sk,1
IPC-8 (h2)
x(m-1,4k-1)
x(m,4k-1)
x(m-1,4k+3)
x(m,4k+3)
sk,0
Transposition Unit
IPC-1 (h1)
bm,4k+3
x(m-1,4k+1)
x(m,4k+1)
bm,4k-1
x(m-7,4k-1)
x(m-7,4k+3)
x(m-1,4k+2)
x(m,4k+2)
bm,4k
IPC-2 (h1)
x(m-1,4k)
x(m,4k)
Processing-Cell
FU-1
FU-2
IPC-3 (h1)
x(m-1,4k+4)
x(m,4k+4)
bm,4k+4
k-th cycle
x(m-7,4k)
m
Bk
IPC-4 (h1)
(k+1)-th cycle
x(m-7,4k+4)
m
Bk+1
sub-module-1
(general filter)
NL
NL
NL
y1,k
sub-module-2
(diagonal symmetry filter)
L ym
2,k
sub-module-3
(four-fold symmetry filter)
sub-module-4
(quadrant symmetry filter)
EN1
EN2
EN3
y3,k
L ym
4,k
EN4
Control Logic
Fig. 11. Proposed generic block-based structure for non-separable 2-D FIR
filters, N is the filter size and L is the input block-size.
shown for the k-th and (k + 1)-th cycle. The contents of the
first two locations of each SR of the SR-block are shown
for the k-th and (k + 1)-th input cycles of the m-th input
row. The rectangular dotted boxes show the content of one
cell of the SR-block comprised of 4 SRs corresponding to
L = 4. A direct flow of data from the shift-register unit meet
the data-flow requirement of non-separable generic structure
while shift-register output data are rearranged (shown by the
data-distribution block in Fig.13) for the separable generic
structure.
The proposed unified structure for 2-D FIR filter is shown
in Fig.14. It has a common storage unit for both separable and
non-separable sections which can be used for the realization
of any one of four types of non-separable or two types of
separable filters. It also can be used for parallel realization
of any combination of separable or non-separable. It involves
(M + N + 2)(N 1) memory-words and computes L outputs
each of all the six filters in every cycle in its full-parallel
configuration. Although we have shown the unified structure
for 6 parallel filters, it can be used for realization of more
than 6 filters without any additional storage. The processing
module complexity (arithmetic-modules and PCs of each filter)
only increases proportionately with the number of parallel
filters. This is a major advantage for area-delay-power efficient
realization of large size filter banks consisting of separable and
non-separable filters.
Processing Unit
L
N
N
Shift-Register
Array
[(MN-M) words]
N
processing cell-2
(symmetry filter)
xk
EN1
y1,k
processing cell-1
(general filter)
y2,k
EN2
Control Logic
Fig. 12. Proposed generic block-based structure for separable 2-D FIR filters,
N is the filter size and L is the input block-size.
x(m-1,4k+1)
x(m-1,4k+2)
x(m-1,4k+3)
x(m-1,4k+4)
x(m-1,4k-3)
x(m-1,4k-2)
x(m-1,4k-1)
x(m-1,4k)
x(m,4k+1)
x(m,4k+2)
x(m,4k+3)
x(m,4k+4)
x(m,4k-3)
x(m,4k-2)
x(m,4k-1)
x(m,4k)
x(m-7,4k-3)
x(m-7,4k-2)
x(m-7,4k-1)
x(m-7,4k)
x(m-6,4k+1)
x(m-6,4k+2)
x(m-6,4k+3)
x(m-6,4k+4)
x(m-6,4k-3)
x(m-6,4k-2)
x(m-6,4k-1)
x(m-6,4k)
Data-Distribution Block
To IRU-Array of
Non-Separable Generic
Structure Storage Unit
Shift-Register Array
SR-Block1
SR-Block6 SR-Block7
k-th cycle
x(m-7,4k+1)
x(m-7,4k+2)
x(m-7,4k+3)
x(m-7,4k+4)
Input
blocks
(k+1)-th cycle
Processing Module
Storage Unit
L
NL
Shift-Register L
Array
NL
[(MN-M) words] L
To Processing Unit of
Separable Generic
Structure
Non-Separable Section
Data
Distribution
Block
L
L
IRU-Array
The proposed generic structure for the realization of separable filters with and without symmetry is shown in Fig.12. The
structure is similar to proposed generic non-separable structure
except that the shift-register array is only common with PCs of
the processing unit. Proposed generic structure can be used to
realize a separable filter with and without symmetry in parallel.
NL
NL
L
m
xk
sub-module-1
(general filter)
y1,k
sub-module-2
(diagonal symmetry filter)
y2,k
sub-module-3
(four-fold symmetry filter)
y3,k
sub-module-4
(quadrant symmetry filter)
y4,k
sub-module-5
(general filter)
y5,k
sub-module-6
(symmetry filter)
y6,k
EN1
EN2
EN3
EN4
Control Logic
N
N
Fig. 13. Data-flow of common shift-register array and data distribution block
for non-separable and separable generic structures.
N
N
EN5
EN6
m
Separable Section
Fig. 14.
10
TABLE IV
G ENERAL C OMPARISON OF MRE AND MBWPO OF P ROPOSED
S TRUCTURES . M : I NPUT- IMAGE W IDHTH /H EAIGHT, N : F ILTER - SIZE , L:
I NPUT- BLOCK L ENGTH
MRE
MBWPO
NS-block-based
L1
(L + N )(N 1)/L
S-block-based
L1
(L + 1)(N 1)/L
NS-Generic
4L 1
(L + N )(N 1)/4L
S-Generic
2L 1
(L + 2)(N 1)/2L
[M (6L 1) + N (4L 1)
(L + N + 2)
+2(L 1)]/(M + N + 2)
(N 1)/6L
Unified
Structure
Filter-size N=8
NS/S
NS-GEN
100
S-GEN
UNI
90
80
70
60
50
40
30
20
10
0
16
Block-size (L)
2 The
Arithmetic module of non-separable generic structure comprises four sub-modules corresponding to four types of filters
where each sub-module represents arithmetic-module of the
corresponding filter. Sub-modules of symmetric filters involves
the same number of adders as those of sub-module of general
filter which is [L(N 2 1)], and only differ by the number
of multipliers. Sub-module of diagonal, four-fold and quadrant symmetry filters, respectively, involves [LN (N + 1)/2],
(LN 2 /4) and (LN 2 /2) multipliers. Arithmetic-module of
non-separable generic structure, therefore, involves [LN (9N +
2)/4] multipliers and [4L(N 2 1)] adders. Processing unit
of proposed separable generic structure is comprised of two
(a)
40
SG
UNI
35
30
25
20
15
10
5
0
2
16
Block-size (L)
(b)
Fig. 15. (a) Memory reuse efficiency (MRE). (b) Memory band-width per
output (MBWPO). NS, S, NSG, SG and UNI, respectively, stands for nonseparable, separable, non-separable generic, separable generic and unified.
11
TABLE V
C OMPARISON OF H ARDWARE - AND T IME -C OMPLEXITY OF THE P ROPOSED S TRUCTURES AND E XISTING S TRUCTURES
Structures
Filter type
Multiplier
Adder
Register
Filter-output/cc
MBWPO
3N (b(N 1)/3c + 1) + M (N 1)
1/[TM + 2TA ]
N2 1
(M + N )(N 1)
1/[TM + 2TA ]
N2 1
(M + N )(N 1)
L/[TM + T1 ]
(L + N )(N 1)/L
Van [11]
N2
N2
N2
N2
LN 2
L(N 2
non-separable
Proposed
Mohanty et.at [29]
Proposed
separable
Proposed Generic
1)
2N
2(N 1)
(M + 1)(N 1)
1/[TM + TA ]
2N 1
2LN
2L(N 1)
(M + 1)(N 1)
L/[TM + T2 ]
(L + 1)(N 1)/L
3LN
4L(N 1)
(M + 2)(N 1)
2L/[TM + T2 ]
(L + 2)(N 1)/2L
Structures
Chen [17]
Proposed
Generic (NS)
Proposed
Unified (NS+S)
Multiplier
Adder
Register
M (N 1) +
cycle period
output/cycle
MBWPO
TM + 3TA
N2 1
2N 2
N (5N + 2)/8
N (21N + 8)/16
LN (9N + 2)/4
4L(N 2 1)
(M + N )(N 1)
TM + T1
4L
(L + N )(N 1)/4L
LN (9N + 14)/4
4L(N (N + 1) 2)
(M + N + 2)(N 1)
TM + T1
6L
(L + N + 2)(N 1)/6L
Filter
MULT
ADD
REG
MBWPO
output/cc
Chen [17]
11
23
1568
15
Proposed
16
152
240
1548
1.5
Generic (NS)
32
304
480
1548
1.125
Proposed
24
200
288
1554
1.25
Unified (NS+S)
48
400
576
1554
0.875
12
TABLE VIII
C OMPARISON OF P OST-L AYOUT S YNTHESIS R ESULTS OF THE P ROPOSED S TRUCTURES AND THE S TRUCTURES OF [16], [17] AND [29] FOR L = 4,
N = 4, M = 512, w = 8 AND w0 = 16, USING TSMC 90 NM CMOS TECHNOLOGY LIBRARY, P OWER ESTIMATED AT 20 MH Z FREQUENCY
Designs
Type
Structure of [16]
Proposed Structure
Non-separable
Separable
Non-separable
Non-separable
(General + Symmetric)
+ Separable
Output/
DAT
Area
ADP
EPO
cycle
(ns)
(um2 )
Leakage (mW)
Power
Dynamic(mW)
(um2 .ms)
(pJ)
14.17
1009878.67
3.9107
4.9441
14.30
442.74
12.16
791361.47
2.8016
3.2918
2.40
76.16
16
17.82
1142564.15
3.6646
5.3996
1.27
28.32
22.22
472219.56
2.0034
2.5533
10.49
227.83
21.58
676537.05
2.4002
3.2511
3.64
70.64
22.08
785332.33
2.7136
3.8730
2.16
41.16
14.07
488883.93
2.0327
2.6070
6.87
231.98
24
24.75
1437527.67
4.4976
6.8792
1.48
23.70
DAT: data-arrival time, area-delay-product (ADP)=Area DAT/(output per cycle), energy per output (EPO) = power clock-period/number of output per
cycle.
throughput.
We have estimated normalized storage per filter output
(SPO) of proposed structures and the existing structure of [11],
[16], [17], [29] for N = 4, L = 4 and M = 512 and shown in
the graph of Fig.16. We find that proposed structures involves
less normalized SPO than the existing structure. This is mainly
due to memory-reuse efficiency and memory-sharing of the
proposed structures.
1800
1600
1400
1200
1000
800
600
400
200
0
Van [10] Khoo [14] Chen [15] Mohanty Prop. NS
[29]
Prop.S
Structures
4 powerclock-period/number
times less ADP and 5.53 times less EPO than those of [29].
Compared with [17], proposed unified structure involves 4.64
times less area and 9.78 times less EPO.
VII. C ONCLUSION
We have analyzed memory footprint and combinational
complexity of 2-D FIR structures to arrive at a systematic
design strategy to derive area-delay-power-efficient architectures. Based on that we have presented novel block-based
separable and non-separable structures, generic structures for
separable and non-separable filter-banks and unified structure
for concurrent realization of both separable and non-separable
filter-banks. It is shown that storage requirement of proposed
structures does not change with input block-size (L). Similarly, the storage size of generic non-separable structure is
independent of the number of parallel filters (P ) in a filterbank. It increases marginally by P (N 1) words in case of
generic separable and unified structures, where N is the filter
size. Proposed structures, therefore, offer higher memory reuse
efficiency and MBW reduction for higher values of L and P .
This reduces SPO and PPO of proposed structures.
Compared with the existing structures, proposed blockbased separable and non-separable structures involve the
same number of storage words, and proportionately the same
or less number of arithmetic resources than the corresponding
best of existing structures, and compute P L times more filter
outputs per cycle with P L times less MBW. The proposed
unified structure with 4 non-separable filters and 2 separable
filters involve nearly 3.6L times more multipliers, 3L times
more adders, (N 2 N + 2) less registers than existing unified
structure, and computes 6L times more filter outputs per cycle
with 6L times less MBW than the other. ASIC synthesis
result for filter-size (4 4), input-block size L = 4 and
image-size (512 512) shows that proposed block-based nonseparable and generic non-separable structures, respectively,
involve 5.95 times and 11.25 times less ADP, and 5.81 times
and 15.63 times less EPO than the corresponding existing
structures. The proposed unified structure involves 4.64 times
less ADP and 9.78 times less EPO than the corresponding
existing structure. When the inner-product computation
involved in FIR filtering is realized by multiplier-less designs
to reduce the combinational-complexity that may require
additional memory leading to overall increase in memory
complexity. But, the advantage gained by the proposed
memory footprint reduction technique is not affected by such
change of method of implementation of inner-products.
Acknowledgements: This publication was made possible by
an internal grant from Qatar University (QUUG-CENG-DCS11/12 7). The statements made herein are solely the responsibility of the authors.
R EFERENCES
[1] D. E. Dudgeon and R. M. Mersereau, Multidimensional Digital Signal
Processing. Englewood Cliffs, NJ: Prentice-Hall, 1984.
[2] M. A. Sid-Ahmed, Image Processing: Theory, Algorithms, and Architectures. New York: McGraw-Hill, 1995.
13
14