Memory Footprint Reduction For Power-Efficient Realization of 2-D Finite Impulse Response Filters

IEEE TRANSACTIONS ON CIRCUIT AND SYSTEM-I, REGULAR PAPERS
Memory Footprint Reduction for Power-Efficient

Realization of 2-D Finite Impulse Response Filters
Basant K. Mohanty, Senior Member, IEEE, Pramod K. Meher, Senior Member, IEEE,
Somaya Al-Maadeed, Senior Member, IEEE and Abbes Amira, Senior Member, IEEE
AbstractWe have analyzed memory footprint and combinational complexity to arrive at a systematic design strategy to derive area-delay-power-efficient architectures for two-dimensional
(2-D) finite impulse response (FIR) filter. We have presented novel
block-based structures for separable and non-separable filters
with less memory footprint by memory sharing and memoryreuse along with appropriate scheduling of computations and
design of storage architecture. The proposed structures involve
L times less storage per output (SPO), and nearly L times
less energy consumption per output (EPO) compared with
the existing structures, where L is the input block-size. They
involve L times more arithmetic resources than the best of the
corresponding existing structures, and produce L times more
throughput with less memory band-width (MBW) than others,
where L is the input block-size. We have also proposed separate
generic structures for separable and non-separable filter-banks,
and a unified structure of filter-bank constituting symmetric and
general filters. The proposed unified structure for 6 parallel
filters involves nearly 3.6L times more multipliers, 3L times
more adders, (N 2 N + 2) less registers than similar existing
unified structure, and computes 6L times more filter outputs
per cycle with 6L times less MBW than the existing design,
where N is FIR filter size in each dimension. ASIC synthesis
result shows that for filter size (4 4), input-block size L = 4,
and image-size (512 512), proposed block-based non-separable
and generic non-separable structures, respectively, involve 5.95
times and 11.25 times less area-delay-product (ADP), and 5.81
times and 15.63 times less EPO than the corresponding existing
structures. The proposed unified structure involves 4.64 times less
ADP and 9.78 times less EPO than the corresponding existing
structure.
Index TermsBlock processing, 2-dimensional (2-D) finite
impulse response (FIR), Digital Filters, VLSI Architecture
I. I NTRODUCTION
WO-Dimensional (2-D) digital filters are frequently used

in 2-D signal processing as well as the image and video
processing applications such as image enhancement, image
restoration [1], template matching [2], face recognition, feature
extraction for bio-metric systems [3][5], and video communication etc. The 2-D FIR filters are more popularly used
Manuscript submitted on Dec 08, 2012, revised on Feb 17, 2013 and Mar
15, 2013. This paper was recommended by Associate Editor Fabien Clermidy.
Copyright (c) 2012 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
B. K. Mohanty with the Dept. of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Raghogarh, Guna,
Madhy Pradesh, India-473226, Email: bk.mohanti@juet.ac.in.
P. K. Meher is with Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632, Email: pkmeher@i2r.a-star.edu.sg, URL:
http://www.ntu.edu.sg/home/aspkmeher/.
Somaya Al-Maadeed with Dept. Computer Science and Engineering, Qatar
University, Doha, Qatar, Email:s alali@qu.edu.qa.
Abbes Amira with School of Computing, University of West Scotland,
Paisley, Scotland, UK, Email: abbes.amira@uws.ac.uk.
Combinational Unit
Memory
Input
Input
Combinational
Unit [h(l,k)]
(a)
1-D Filter [h1(k)]
Memory
1-D Filter [h2(k)]
Output
(b)
Output
Fig. 1. Conventional structure of 2-D FIR filter. (a) Non-separable method,

(b) Separable method .
compared to its infinite impulse response (IIR) counterpart

due to their numerical stability and simplicity of design. The
system function of 2-D FIR filter is often non-separable, while
in a few cases, it is separable when impulse response [h(i, j)]
is expressed as [h(i, j)] = [h1 (i) h2 (j)]. The non-separable
and separable system functions are, respectively, given as:
H(z1 , z2 ) =
1
N
1 N
X
X
h(l, k)z1l z2k
(1)
h1 (i) h2 (j) z1i z2j
(2)
l=0 k=0
H(z1 , z2 ) =
N
1 N
1
X
X
i=0 j=0
where [h(l, k)] is the impulse response matrix of the nonseparable 2-D FIR filter of size (N N ) while {h1 (i)} and
{h2 (j)} are the impulse responses of 1-D FIR filters used for
row-wise and column-wise processing of 2-D input.
Block diagrams of conventional realization of non-separable
and separable 2-D FIR filters are shown in Fig.1. As shown in
this figure, both these filter structures consist of two types of
hardware components: (i) the combinational component and
(ii) the memory or storage component. The combinational
component consists mainly of the arithmetic circuits along
with some steering logic like multiplexors and demultiplexers,
while the storage component consists of transposition buffers
and/or shift-registers to provide appropriate data to combinational units. The non-separable structure uses shift-registers
to introduce the necessary row-delays for the processing of
intermediate data while the separable structure uses shiftregisters for transposition of intermediate data. We can find
from (1) that, a non-separable 2-D FIR filter of size (N N )
involves (N 1) shift-registers (SRs) of size M each, (N 1)2
registers (for row-column processing), and N 2 multipliers and
N 2 adders to compute one filter output per cycle. Similarly, we
can find from (2) that, the separable 2-D filter of size (N N )
involves (N 1) SRs of M words each, 2(N 1) registers,
2N multipliers and 2N adders to compute one filter output
per cycle. Combinational and memory (register) complexities
TABLE I
C OMBINATIONAL AND M EMORY C OMPLEXITY OF F ULL - PARALLEL
S EPARABLE AND N ON - SEPARABLE 2-D FIR F ILTERS
Filter
Combinational
Memory (words)
Multiplier
Adder
Non-separable
N2
N2 1
(M + N )(N 1)
Separable
2N
2(N 1)
(M + 2)(N 1)
N : filter size, M : image width/height
of full-parallel non-separable and separable filters are given in

Table I. Since, the image size (M ) is higher than the filtersize (N ) by more than an order of magnitude in most of the
image-processing applications, memory becomes the dominant
component of hardware complexity of 2-D FIR structures,
and consumes major amount of chip-area and total power
consumption.
Some systolic architectures have been suggested for VLSI
implementation of 2-D FIR filters to achieve high-throughput
and low-latency implementation [6][11]. Recently, some efficient structures have been proposed for 2-D IIR filters [13]
[17] and shown that non-separable 2-D FIR filters can also be
realized efficiently using those structures. In all these existing
designs, systolization [18] of the structure is considered as the
major issue, and a substantially large number of delay elements
are placed in the data-path to avoid global communication.
Similarly, the symmetry of impulse response matrix has been
exploited to reduce the hardware and time complexities of the
structures [12][17]. In the last four decades, several design
approaches have been suggested for reducing the arithmetic
complexity of one-dimensional (1-D) FIR filters [19][28]. All
those methods are now quite mature, and can be applied to
reduce the complexity of 1-D filters, which could be used
for the implementation of 2-D FIR filters as well. On the
other hand, memory complexity constitutes the dominant part
of overall area complexity of these structures, and plays a
significant role in power-efficient realization of 2-D FIR filters.
Keeping that in view, in this paper, we present memory-centric
designs of 2-D FIR filter for the reduction of total memory
usage, memory-reuse, reduction of memory band-width and
memory-sharing, which could lead to an area-delay-powerefficient structures.
Many image processing applications use 2-D filter banks
comprised of both separable and non-separable filters. One
such application is biometric system where Gabor filter bank
is used for feature extraction for face recognition and fingerprint matching [3][5]. The Gabor filter bank is generated
for different center frequencies and orientations. Consequently,
constituent filters are both separable and non-separable types
with symmetry as well as without symmetry. Recently, a
unified structure has been proposed in [17] for the realization
of filters with diagonal, four-fold rotational, quadrant and
octal symmetries. However, this structure does not favour the
realization a filter-bank since only one filter can be realized at
a time. Moreover, separable filters can not be realized using
this structure. The main objective of unified structure of [17]
is to achieve saving of arithmetic resources, but in this paper,
we aim at presenting a unified memory-efficient structure for
filter bank with separable and non-separable 2-D FIR filters.

The main contributions of this paper are:
Analysis of memory footprint and exploration of possible
storage optimization to have a memory-efficient design
strategy for the implementation of 2-D FIR filter.
Shared memory design for separable and non-separable
structures.
Scheduling of computations and architecture design for
memory-reuse and memory band-width reduction in separable filters.
Unified structure of filter bank comprised of separable
and non-separable filters with low memory footprint per
output.
The rest of the paper is organized as follows: The proposed
design strategy is discussed in Section II and block formulation
of 2-D FIR filter is given in Section III. Proposed structures
are presented in Section IV for separable and non-separable
filters. Generic structure of filter-bank consisting of different
types of filters is presented in Section V. Hardware complexity
and performance of the proposed structures are discussed in
Section VI. Conclusion is presented in Section VII.
II. P ROPOSED D ESIGN S TRATEGY
To arrive at the proposed design strategy we analyze here
memory complexities of possible configurations of 2-D FIR
filter. The system function of non-separable 2-D FIR filter
(given by (1)) can be written in a split form as:
H(z1 , z2 ) =
Hi (z2 ) =
N
1
X
i=0
N
1
X
z1i Hi (z2 )
h(i, j) z2j
(3a)
(3b)
j=0
Computations of (3a) and (3b) can be performed by a

direct-form or a transposed-form structure to have four possible configurations such as fully-direct (direct-direct), fullytransposed (transpose-transpose), hybrid-1 (direct-transpose),
and hybrid-2 (transpose-direct), for the realization of nonseparable H(z1 , z2 ) as shown in Fig.2 for N = 4. All
these four structures require the same number of arithmetic
components (multiplier and adders) and delay elements (z11
and z21 corresponding to shift-registers and fixed registers,
respectively) except their locations in the data-path. Since, the
bit-widths of arithmetic units, buses and registers in the datapath are different for input signals and intermediate signals,
the overall memory requirements of different configurations
are different in terms of number of storage bits, although
all of them have the same number of delay elements1 . We
have estimated memory complexity of all the four type of
structures for an input image size 512 512 (i.e M = 512)
and filter length (N = 4 and N = 8), and listed in Table II for
comparison. We find that, fully-direct structure has the lowest
memory requirement than others. Interestingly, the memory
complexity of fully-direct structure is independent of wordlength of intermediate signals since all the delay elements of
1 The input image to be processed by the 2-D FIR filter consists of 8-bit
pixels while higher bit-width is required for intermediate signals.
TABLE II
M EMORY COMPLEXITY F ULLY- DIRECT, F ULLY- TRANSPOSE , H YBRID -1 AND H YBRID -2 S TRUCTURES . N : F ILTER - SIZE , M : I NPUT I MAGE - SIZE , w:
I NPUT SIGNAL BIT- WIDTH AND w0 : I NTERMEDIATE SIGNAL BIT- WIDTH .
Structure
Shift-Register/Register words
Input signal storage
Memory (bits), M = 512, w = 8
Memory (bits)
Intermediate signal storage
N = 4, w0 = 16
N = 8, w0 = 20
Fully-Direct
(M + N )(N 1)
(M + N )(N 1)w
12384
29120
Hybrid-1
N (N 1)
M (N 1)
(N 1)(N w + M w0 )
24672
72128
Hybrid-2
M (N 1)
N (N 1)
(N 1)(M w + N w0 )
12480
29792
Fully-Transpose
(M + N )(N 1)
(M + N )(N 1)w0
24768
72800
z 1 1
x(m,n)
z 2 1
h(0,0)
z 1 1
z 2 1
h(0,1)
h(0,2)
z 2 1
h(0,3)
z 1 1
z 2 1
h(1,0)
h(1,1)
z 2 1
h(1,2)
z 2 1
z 2 1
h(2,0)
h(1,3)
y(m,n)
h(2,1)
Memory
z 2 1
h(2,2)
z 2 1
h(2,3)
z 2 1
h(3,0)
h(3,1)
z 2 1
h(3,2)
z 2 1
h(3,3)
(a)
x(m,n)
z 2 1
h(0,0)
h(0,1)
z 2 1
h(0,2)
y(m,n)
z 1 1
x(m,n)
z 1 1
h(0,0)
h(0,1)
z 2 1
z 2 1
h(0,3)
z 2 1
h(1,0)
h(1,1)
z 2 1
h(1,2)
z 2 1
h(1,3)
z 1 1
z 2 1
h(2,0)
h(0,3)
z 2 1
z 2 1
h(1,0)
h(1,1)
z 2 1
h(2,2)
z 2 1
z 2 1
z 2 1
z 2 1
h(2,3)
h(3,0)
h(3,1)
h(3,2)
h(3,3)
h(2,3)
h(3,0)
h(3,1)
h(3,2)
h(3,3)
z 2 1
z 2 1
h(3,1)
h(3,2)
z 2 1
z 2 1
z 1 1
(b)
z 1 1
h(0,2)
h(2,1)
z 2 1
z 1 1
h(1,2)
h(1,3)
z 2 1
z 2 1
y(m,n)
h(2,0)
h(2,1)
h(2,2)
z 2 1
z 2 1
h(2,1)
h(2,2)
z 2 1
z 2 1
z 2 1
z 2 1
(c)
x(m,n)
h(0,0)
h(0,1)
z 2 1
y(m,n)
h(0,2)
h(0,3)
z 2 1
z 2 1
h(1,0)
h(1,1)
z 2 1
h(1,2)
h(1,3)
z 2 1
z 2 1
z 1 1
z 1 1
h(2,0)
(d)
h(2,3)
z 2 1
h(3,0)
h(3,3)
z 2 1
z 1 1
Fig. 2. Four different configurations for realization of 2-D FIR filter for N = 4. (a) Fully-direct structure, (b) Hybrid-1 structure (c) Hybrid-2 structure, and
(d) Fully-transposed structure, where z11 represents a shift-register of M words and z21 represents a single register.
this structure are placed on the input path only. This is a very
useful feature to be exploited for memory footprint reduction
in 2-D FIR filter structure.
A. Exploration of Memory-Reuse Possibilities
To explore the memory reuse possibilities, let us consider
the input data-flow of fully-direct structure for the computation
of m-th row of outputs {y(m, n), y(m, n + 1), y(m, n +
2), y(m, n + 3)} as shown in Fig.3 for N = 4. The samples

required to compute a given filter output is shown in a pair of
curly braces and the corresponding filter output is shown at
its right. Each arrow shows the source (shift-register/register)
of samples. To compute each output of (4 4) filter, 16
input-samples corresponding to 4 rows and 4 columns of
2-D input are required. Out of 4 rows (numbered as m,
m 1, m 2, m 3), the m-th row is the current input-row
Shift-Register Unit
SR-1
R1
R2
R3
SR-2
R1
R2
SR-3
R1
R3
R2
R1
R3
x(m,n+3) x(m,n+2) x(m,n+1) x(m,n) x(m-1,n+3) x(m-1,n+2) x(m-1,n+1) x(m-1,n)
x(m-2,n+3) x(m-2,n+2) x(m-2,n+1) x(m-2,n)
x(m,n+2) x(m,n+1)
x(m-2,n+2) x(m-2,n+1) x(m-2,n)
x(m,n+1)
x(m,n)
x(m,n)
x(m,n-1)
x(m,n)
x(m,n-1) x(m-1,n+2) x(m-1,n+1) x(m-1,n)
x(m-1,n-1)
x(m,n-1) x(m,n-2) x(m-1,n+1) x(m-1,n) x(m-1,n-1) x(m-1,n-2)

x(m,n-2) x(m,n-3)
x(m-1,n) x(m-1,n-1) x(m-1,n-2) x(m-1,n-3)
Register Unit
R2
R3
x(m-3,n+3) x(m-3,n+2) x(m-3,n+1) x(m-3,n)
y(m,n+3)
x(m-3,n-1)
y(m,n+2)
x(m-2,n+1) x(m-2,n) x(m-2,n-1) x(m-2,n-2) x(m-3,n+1) x(m-3,n) x(m-3,n-1) x(m-3,n-2)
y(m,n+1)
x(m-2,n-1) x(m-3,n+2) x(m-3,n+1) x(m-3,n)
x(m-2,n) x(m-2,n-1) x(m-2,n-2) x(m-2,n-3) x(m-3,n)
x(m-3,n-1) x(m-3,n-2) x(m-3,n-3)
y(m,n)
Fig. 3. Memory-unit data-flow of fully-direct structure for outputs y(m, n), y(m, n + 1), y(m, n + 2), and y(m, n + 3). Redundant memory-values are
shown by blue color boxes.
and others are immediate past rows. Memory-unit uses three

shift-registers (SR-1, SR-2, SR-3) to buffer 3 past (m 1,
m 2, m 3)-th rows of input samples. The required (n 1,
n 2, n 3)-th columns of samples of a particular-row are
provided by the serial-in parallel-out (SIPO) shift-register of
(N 1) words. As shown in Fig.3, all the 16 input-samples
required to compute the filter output are obtained from the
memory-unit and current input. Therefore, memory used by
fully-direct structure to compute each output is (3M + 12)
words, where SRs-unit provides 3M words and fixed registerunit provides 12 words. Memory band-width (MBW) of the
structure is 15 (read operations on SR-unit is 3, and 12 values
are obtained from register-unit to compute an output). This
can be generalized to find the number of memory words used
by fully-direct structure to be [(N 1)(M + N )] and MBW
to be (N 2 1).
As shown in Fig.3, in total 64 input samples are required to compute the outputs {y(m, n), y(m, n+1), y(m, n+
2), y(m, n + 3)}. These 64 samples belong to 4 rows {m, m
1, m 2, m 3} and 7 columns {n + 3, n + 2, n + 1, n, n
1, n 2, n 3} of the input image. It can be noticed that out
of 64 samples 28 samples are different from each other, while
redundancy exists in rest 36 samples. The redundant values
corresponding to filter-output {y(m, n), y(m, n + 1), y(m, n +
2), y(m, n + 3)} in the blue boxes in Fig.3. These redundantmemory access could be avoided by parallel computation of
filter outputs {y(m, n), y(m, n+1), y(m, n+2), y(m, n+3)}.
In that case, four current input samples {x(m, n), x(m, n +
1), x(m, n + 2), x(m, n + 3)} corresponding to four outputs
need to be available in each cycle. Four input values of each of
the past rows (m1, m2, m3) are retrieved from respective
shift-registers in every cycle. To realize this, each of the shiftregisters (of M words) is required to be split into four equal
parts of M/4 words. Therefore, 12 input values are obtained
from the SR-unit in every cycle. Out of 28 unique samples, 4
are obtained as input samples of current cycle and 12 samples
are obtained from the SR-unit. The remaining 12 samples are
obtained from four SIPO shift-registers. The memory usage for
parallel computation of 4 filter outputs is [3 4 M/4) + 12]
which is the same as the memory usage of the fully-direct
structure for one output. MBW for parallel computation of
four filter outputs is (4 3 + 12 = 24). The memory words
required to compute a block of L filter outputs per cycle of
the 2-D FIR filter of size (N N ) can therefore found to be
[(N 1)(M +N )] and MBW can found to be [(N 1)(L+N )].

Interestingly, the storage space of fully-direct structure of
2-D FIR filter of size (N N ) is the same as parallel computation of L filter outputs due to memory reuse during parallel
computation. Only arithmetic resources increases proportionately with throughput rate. This is an important feature which
can be utilized for area and power saving. The memory reuse
efficiency (MRE) of block-based structure can be estimated
as MRE=[(total input words actual memory usage)/actual
memory usage], where total input words is L times the
memory-usage of single-input single-output (SISO) structure.
For block-based structure with block size L, MRE=[L(N
1)(M + N ) (N 1)(M + N )]/[(N 1)(M + N )] = L 1.
Therefore, MRE of block-based parallel structure increases
proportionately with the block-size (L). Higher the MRE,
lesser is the memory requirement per output. Consequently,
the structure is more area-delay efficient compared with the
SISO structure. MBW of block-based structure increases by
a factor of [(L + N )/(N 1)] instead of L times, where L
is the input block-size. This is mainly due to the number of
redundant samples = N (N 1)(L1). Higher the , better
is the MBW reduction. We have estimated the memory bandwidth saving (MBS) using the formula MBS= [(L MBW
of SISO structure) MBW of block-based structure of blocksize L]/(L MBW of SISO structure). Therefore, we can have
MBS= /[L(N 2 1)]. Since, varies with L and N , MBS
is higher for larger size filters with higher input block-size.
We have estimated , MBS and MRE of parallel 2-D FIR
structure for different filter sizes (N ) and input block-sizes
(L). The estimated values are listed in Table III to quantify
the scope of memory reuse. We can find from Table III that
a block-based realization enhances memory-reuse and reduces
memory-band width per output. Therefore, we have presented
the block formulation and its subsequent implementation of
2-D FIR filters.
B. Memory-Sharing in Generalized 2-D FIR Filter Structures
The fully-direct non-separable structure as well as the
conventional separable structure use shift-registers to store M
words each, but shift-registers of fully-direct non-separable
structure stores input pixel values while those of separable
structure stores intermediate values. Due to the difference
in bit-width, a common shift-register unit cannot be shared
TABLE III
M EMORY R EUSE E FFICIENCY AND M EMORY BAND - WIDTH S AVING FOR
D IFFERENT S IZE F ILTERS (N ) AND I NPUT- BLOCK SIZE (L)
N = 16
N = 32
MRE
(in times)
12
40
36
60
84
70
56
44.4
168
66.7
392
77.8
240
47
720
70.5
1680
82.4
16
3600
88.2
15
992
48.4
2976
72.6
6944
84.8
16
14880
90.8
15
Shift-Register Unit
SR-3
SR-2
SR-1
h1 (i) h2 (j) x(m i, n j)
(4)
Shift-Register Unit
Computation of (4) can be expressed in split form as:
v(m, n) =
N
1
X
i=0
N
1
X
x(m-2,n)
x(m-1,n+1)
x(m-1,n)
x(m,n+1)
bm,n+1
x(m,n)
bm,n
SIPO
u(m,n-3)
u(m,n-2)
u(m,n-1)
Separable filter structure based on decomposition scheme

of (5) is shown in Fig.4 for N = 4. The input samples are
fed as overlapped blocks (bm,n ) for 0 m M 1 and
0 n N 1, where successive blocks of a column are
overlapped by N 1 samples. The input-blocks are fed in
row-serial order from shift-register unit comprising of (N 1)
shift-registers of M words each. We can find from Fig.3 and
Fig.4 that shift-register unit of fully-direct non-separable structure and the separable structure based on this decomposition
scheme has the same number of shift-registers and they are
of the same size. Interestingly, the data-input and data-output
formats of both these shift-register units are identical. This
favours the possible sharing of shift-register units of fullydirect non-separable structure and separable structure. This
leads to a generalized structure for both non-separable and
separable filters individually or in parallel configuration.
i=0 j=0
y(m, n) =
x(m-2,n+1)
y(m,n)
by these two structures. A different design approach for

separable filter is required where the shift-register unit stores
the input signal only. For this purpose, we have used an
efficient decomposition scheme for 2-D separable filter in the
following.
The input-output relation of separable 2-D FIR filter, given
by (2), can be written as:
N
1 N
1
X
X
x(m-3,n)
Fig. 4. Separable 2-D FIR filter structure (for N = 4) based on proposed

decomposition scheme.
= N (N 1)(L 1), MBS= /[L(N 2 1)], MRE (in times) = L 1.
y(m, n) =
x(m-3,n+1)
h2 (i) v(m, n i)
(5a)
h1 (i) x(m i, n)
(5b)
Interconnect
Network
SR-3
x(m-3,n+1) x(m-3,n)
SR-2
x(m-2,n+1) x(m-2,n)
SR-1
x(m-1,n+1) x(m-1,n)
x(m,n+1)
x(m,n)
i=0
and can be expressed as inner-products of a pair of N -point

vectors {bm,n , h1 } and {um,n , h2 } as
x(m,n)
x(m-3,n+1) x(m-3,n)
x(m-2,n+1) x(m-2,n)
u(m, n) = bm,n h1
(6a)
y(m, n) = um,n h2
(6b)
and bm,n , um,n , h1 and h2 are given by

bm,n = x(m, n) x(m 1, n) . . x(m N + 1, n)
um,n
(7a)

= u(m, n) u(m, n 1) . . u(m, n N + 1)
(7b)
T
h1 = h1 (0) h1 (1) . . h1 (N 1)

T
h2 = h2 (0) h2 (1) . . h2 (N 1)
Column-Filter
Arithmetic Unit (h2)
N =8
MBS
(in %)
(7c)
(7d)
According to (5), input-vectors are fed to the row-filter

in column overlapped order, and the row-filter generates
x(m-1,n+1) x(m-1,n)
x(m,n+1)
To the Register Unit of

Fully-direct Structure
N =4
x(m,n)
To the Arithmetic-Unit of
Separable Structure
(L)
u(m,n+1)
u(m,n)
Block-size
(N )
Row-Filter
Arithmetic Unit (h1)
Filter size
intermediate values column-wise exactly in the same order as

the column-filter consumes intermediate values. Consequently,
transposition-unit in this case is comprised of fixed registers
instead of shift-registers.
Fig. 5. Data-flow in a shared shift-register unit by fully-direct structure

separable structure.
Data-flow of a shared shift-register unit is shown in Fig.5

for N = 4, where the input-data requirement of both fullydirect non-separable and separable structures is taken care
of by the shared shift-register unit. A shared shift-register
unit not only offers memory-saving, but also allows parallel
realization of both non-separable and separable filters. It is
shown in the later Sections that the parallel implementation of
generalized structure offers higher area-delay-power efficiency
over sequential structure. Keeping these facts in view, we have
outlined here a systematic memory-centric design strategy to

derive an area-delay-power efficient structure for 2-D FIR
filter.
A fully-direct form structure should be used for nonseparable filter to have less storage-complexity.
A block implementation of fully-direct structure should
be used for MBW reduction.
Separable structure based on the proposed decomposition
algorithm could be derived for shift-register sharing with
non-separable filter structure.
Appropriate algorithm partitioning and scheduling
scheme need to be used for separable filter to minimize
memory bandwidth and increase register sharing.
The register sharing property of proposed non-separable
and separable filter structures are exploited further to derive
generic structures. The proposed generic structures can be
configured for realization of a single filter of types (separable, non-separable, symmetry (diagonal, four-fold rotational,
quadrant)) or parallel realization of any combination of these
filters.
III. B LOCK F ORMULATION OF 2-D FIR F ILTERS
A. For Non-separable Filter
Let us consider a non-separable filter which processes a
block of L input samples and generates a block of L outputs
in every cycle. The k-th block of filter output of the m-th row
ym,k is computed by relation;
N
1
X
ym,k =
vi,k
(8)
i=0
From (9) and (11), we find v(i, kL l) is the inner-product

of smi
(l-th row of matrix Aik ) and hi , given by
k,l
v(i, kL l) = smi
k,l hi
B. For Separable 2-D FIR Filter

Let us consider a separable filter which processes a block
of L input samples and generates a block of L outputs in
every cycle. The k-th block of filter output of the m-th row
ym,k is computed in this case by two successive matrix-vector
products given by
um,k = Bm
k h1
ym,k =
Um
k
(14b)
and the input-matrix Bik and intermediate-matrix Uik are given

by
Bm =
k
x(m, kL)
x(m, kL 1)
.
x(m, k 0 )
x(m 1, kL)
x(m 1, kL 1)
.
.
x(m 1, k 0 )
.
.
.
.
.
.
x(m0 , kL)
. x(m0 , kL 1)
.
.
.
.
0 0
.
x(m , k )
(15)
for m0 = m M + 1 and k 0 = kL L + 1.
i,k
vi,k = Ami
hi
k
(14a)
h2
where ym,k and vi,k are defined as

T
Um =
ym,k = y(m, kL) y(m, kL 1) . . y(m, kL L + 1)
k
u(m, kL)
u(m, kL 1)
(9a)
u(m, kL 1)

T
u(m,
kL 2)
vi,k = v(i, kL) v(i, kL 1) . . v(i, kL L + 1)
.
.
(9b)
.
.
u(m,
kL
L
+
1)
u(m,
kL
L)
The intermediate vector v
is computed by product of
input-matrix Ami
and impulse-response vector hi , and given
k
by
(13)
.
.
.
.
.
.
u(m, k 0 )
.
u(m, k 0 1)
.
.
.
.
. u(m, k 0 L + 1)
(16)
for k 0 = kL N + 1.
(10)
The input matrix Ami

is derived from (m i)-th row of
k
the input matrix [X] of size (M M ) and given by
IV. P ROPOSED S TRUCTURES
In this Section, we have derived two separate structures for

block
implementation of non-separable and separable 2-D FIR
=
filters.
x(m0 , kL)
x(m0 , kL 1) . .
x(m0 , k 0 + 1)
0
x(m0 , kL 1)
x(m , kL 2) . .
x(m0 , k 0 )
.
.
. .
.
A. Block-based Structure for Non-separable 2-D FIR Filter
.
.
. .
.
0
0
0 0
x(m , kL L + 1) x(m , kL L) . . x(m , k L + 2)
The computation of (8) and (10) are mapped into a fully(11) direct L parallel structure to derive the proposed blockbased structure for non-separable 2-D FIR filter. The proposed
for m0 = m i and k 0 = kL M , and hi is given by:
structure is shown in Fig.6 for filter-size N = 8 and block-size
L = 4. It consists of one memory-module and one arithmetic

T
hi = h(i, 0) h(i, 1) . . . h(i, N 1)
(12a) module.
Ami
k
From i-th SR-block

SR-block1
SR-28
SR-27
SR-26
SR-25
SR-4
SR-3
SR-2
SR-1
Storage Module
xk
Arithmetic-Module
Shift-Register Array
SR-block7
R1
R2
R3
R4
R5
R6
R7
IRU-1
IRU-2
IRU-8
8
8
8
8
8
8
8
8
m
m
m
m
m-1
m-1
m-1
m-1
Sk,0 Sk,1 Sk,2 Sk,3 Sk,0 Sk,1 Sk,2 Sk,3
FU-1
[h1]
FU-2
[h2]
8
8
8
8
m-7
m-7
m-7
m-7
Sk,0 Sk,1 Sk,2 Sk,3
FU-8
[h8]
ykm
Fig. 6. Proposed block-based structure for non-separable 2-D FIR filter for
block-size L = 4 and filter-size N = 8.
1) Memory-Module Design: The memory-module of Fig.6

is comprised of one shift-register array and N input-register
units (IRUs). The shift-register array further consists of
[L(N 1) = 28] shift-registers of P words each, where
P = M/L. Proposed structure receives a block of L input
samples and computes a block of L outputs in each cycle. All
the samples of each input-block belong to the same row and
the inputs are fed to the structure block-by-block and then rowby-row in serial order. The input-block (xm
k ) corresponding
to the m-th row of input image [X] is fed to the structure
during (k + 1)-th cycle of m-th set of P cycles, and the
entire image is fed in M P cycles for 0 k P 1,
0 m M 1. [L(N 1)] shift-registers of the shift-register
array are arranged in groups referred to as SR-block and each
SR-block has L shift-registers. As shown in Fig.6, {SR-1,SR2,SR-3,SR-4} constitute SR-block-1 and {SR-25,SR-26,SR27,SR-28} constitute SR-block-7. One SR-block stores one
input row and the shift-register array stores (N 1) rows of
input. SRs of SR-block are connected such that a block of
samples transfer from one SR-block to the adjacent SR-block
to its right after every cycle. Therefore, a block of L = 4
inputs of a particular row are obtained from each SR-block in
every cycle, and (N 1) input-blocks corresponding to (N 1)
consecutive input rows are obtained from the shift-register unit
in every cycle.
Current input-block and (N 1) past input-blocks available
in the shift-register array are sent to N IRUs to generate inputmatrix [Ak ] of size (L N ). During k-th cycle of each set
of P cycles, the first IRU receives the current block of input
{xm
k }, and the (i + 1)-th IRU receives an input-block from
i-th SR-block. The (i + 1)-th IRU generates the input-matrix
[Aik ], for 1 i N 1. According to (11), a block of
(N + L 1) consecutive samples of (m i)-th row of input
are required to construct the matrix [Aik ]. Each IRU receives
L samples from the corresponding SR-block during k-th cycle
and uses (N 1) past samples belonging to dN/Le past inputblocks. The internal structure of (i + 1)-th IRU is shown in
Fig.7 for N = 8 and L = 4. It consists of (N 1) registers,
and produces L number of N -point input-vectors (sik,l , for
m-i
sk,1
m-i
sk,0
Adder Block
m-i
sk,3
m-i
sk,2
Fig. 7. Internal structure of (i + 1)-th input register unit (IRU) for L = 4

and N = 8.
m-i
sk,0
m-i
sk,1
m-i
sk,2
m-i
sk,3
IPC-1
(hi)
IPC-2
(hi)
IPC-3
(hi)
IPC-4
(hi)
vi,k
(a)
v0,k
v(0,0)
v1,k
v(7,0)
v(0,1)
AT-1
v6,k
v(7,1)
AT-2
v(0,2)
v7,k
v(7,2)
v(0,3)
AT-3
v(7,3)
AT-4
yk
(b)
Fig. 8. (a) Structure of (i+1)-th functional unit (FU) for L = 4 and N = 8.

(b) Structure of adder-block for L = 4 and N = 8.
N-point Input-vector
h(i,0)
h(i,1)
h(i,6)
h(i,7)
Adder-Tree (N=8)
N-point inner-product result

Fig. 9.
Internal structure of inner-product cell (IPC) for N = 8.
0 l L 1) corresponding to L rows of [Aik ].

2) Arithmetic-Module Design: The arithmetic module is
comprised of N functional-units (FUs) and one adder tree
(AT). In each cycle, N FUs of arithmetic module receive N
sets of L input-vectors from N IRUs of storage-module such
that (i+1)-th FU receives L input-vectors generated by (i+1)th IRU and it performs L separate inner-product computation
with the (i + 1)-th row of impulse-response matrix [hi ] to
obtain L-point partial output-vector [vi ] according to (10).
bm,4k+1
x(m-1,4k-3)
x(m,4k-3)
sk,3
IPC-5 (h2)
IPC-6 (h2)
Fig. 10. Proposed block-based structure for separable 2-D FIR filter for
N = 8 and L = 4.
Internal structure of the FU is shown in Fig.8(a). It consists

of L inner-product cells (IPCs). Each IPC (shown in Fig.9)
performs N -point inner product of input-vector and weightvector. Finally, N partial-output vectors of N FUs are added
together in an adder-block (shown in Fig.8(b)) according to (8)
to obtain one block of complete output [ykm ] corresponding to
the m-th block of input in one cycles and successive output
blocks after every cycle thereafter, where one clock period
T = TM + TA + TF A (2 log2 N 1), TM , TA , and TF A are,
respectively, one multiplication time, addition time and one
full-adder delay. One complete row of output is obtained in P
cycles and the entire output matrix of size (M M ) in M P
cycles.
B. Block-based Structure for Separable Filter
The proposed block-based structure for separable 2-D FIR
filter is shown in Fig.10. It consists of one processing cell (PC)
and one transposition-unit (TU), where a PC consists of two
FUs. Structure of each FU is the same as the one shown in
Fig.8(a) except that a constant vector is stored in each FU. In
this case, FU-1 and FU-2 store the constant-vectors (h1 ) and
(h2 ), respectively. It processes the (k + 1)-th block of input
(xm
k ) of the m-th row of [X] during the k-th cycle of a set of P
cycles, and produces an output block (ykm ), where P = M/L
and L is the input block-size. One complete row of [X] is
processed in P cycles and the complete image in M P cycles.
Proposed structure receives a block of L input samples through
L number of N -point input-vectors (bm,Lk ) where each of the
input-vectors is overlapped by (N 1) samples. Input-vectors
of k-th and (k + 1)-th cycles of the m-th set input cycles are
shown in Fig.10 for N = 8 and L = 4. Components of each
input-vectors are shown in the rectangular box adjacent to its
left. The input-vectors of k-th cycle form the matrix (Bm
k ) as
given in (15), where bm,Lkl is the (l + 1)-th row of Bm
k , for
0 l L 1, 0 k P 1, and 0 m M 1. L rows
of Bm
k are fed to IPCs of FU-1 in parallel to perform one
matrix-vector multiplication with the constant vector (h1 ) to
calculate an L-point intermediate-vector (um,k ) according to
V. G ENERIC S TRUCTURES
In this Section, we derive two separate generic structures
for non-separable and separable filter banks. Also we have
proposed a unified structure for realization of 2-D FIR filterbank comprised of non-separable and separable filters.
A. Generic Block-Based Structure for Non-separable Filters
The coefficient matrices of non-separable filters can have varieties of symmetry, e.g., diagonal, centro, four-fold rotational,
quadrant etc. These symmetries can be exploited to realize
the transfer functions with lesser number of multiplications.
The arithmetic module of proposed structure for non-separable
filters could be optimized to take advantage of these symmetry
property. The storage-module of the structure can be interfaced
as a common unit with arithmetic modules of filters with and
without symmetry. This results in a generic structure shown
in Fig.11. Each sub-module of arithmetic-module of generic
structure represents arithmetic module of a constituent filter.
The arithmetic-module of each filter is enabled with a select
signal (ENi , for 1 i 4) to switch off the arithmetic module
to have power saving if the output of a particular filter is not
required. The proposed generic structure can be used to realize
any of the four types of filters or a parallel combination of
filters by selecting the arithmetic modules of respective filters
through the select signals. The proposed generic structure has
higher MRE and MBS than the non-separable structure for
a given block-size and filter-size due to common storageunit. The area-delay-product (ADP) and power consumed per
output (PPO) of the proposed generic structure in parallel
configuration is expected to be significantly less than separate
implementation of individual filters.
Storage-Unit
Arithmetic Module
L
L
L
L
m
xk
NL
IRU-Array
bm,4k-3
x(m-7,4k-3)
x(m-7,4k+1)
[(MN-M) words]
x(m-1,4k-2)
x(m,4k-2)
sk,2
(14a). From each intermediate-vector, one intermediate-matrix

(as given in (15)) is generated, such that Um
k is generated from
(um,k ). TU generates the required rows of Um
k from (um,k ).
We can find from (11) and (16) that the elements of Um
k and
Aik satisfy similar property. Therefore, TU performs the same
function as the IRUs and its structure is identical with that of
an IRU (as shown in Fig.7). The TU of separable structure
m
generates L rows (sm
k,l , for 0 l L 1) of Uk in parallel
and feeds those to FU-2 in parallel to perform one matrixvector product with constant-vector (h2 ) to compute a block
of filter output (ym,k ) according to (16b).
Shift-Register
Array
bm,4k+2
yk
IPC-7 (h2)
bm,4k-2
x(m-7,4k-2)
x(m-7,4k+2)
m
sk,1
IPC-8 (h2)
x(m-1,4k-1)
x(m,4k-1)
x(m-1,4k+3)
x(m,4k+3)
sk,0
Transposition Unit
IPC-1 (h1)
bm,4k+3
x(m-1,4k+1)
x(m,4k+1)
bm,4k-1
x(m-7,4k-1)
x(m-7,4k+3)
x(m-1,4k+2)
x(m,4k+2)
bm,4k
IPC-2 (h1)
x(m-1,4k)
x(m,4k)
Processing-Cell
FU-1
FU-2
IPC-3 (h1)
x(m-1,4k+4)
x(m,4k+4)
bm,4k+4
k-th cycle
x(m-7,4k)
m
Bk
IPC-4 (h1)
(k+1)-th cycle
x(m-7,4k+4)
m
Bk+1
sub-module-1
(general filter)
NL
NL
NL
y1,k
sub-module-2
(diagonal symmetry filter)
L ym
2,k
sub-module-3
(four-fold symmetry filter)
sub-module-4
(quadrant symmetry filter)
EN1
EN2
EN3
y3,k
L ym
4,k
EN4
Control Logic
Fig. 11. Proposed generic block-based structure for non-separable 2-D FIR
filters, N is the filter size and L is the input block-size.
B. Generic Block-Based Structure for Separable Filters
shown for the k-th and (k + 1)-th cycle. The contents of the
first two locations of each SR of the SR-block are shown
for the k-th and (k + 1)-th input cycles of the m-th input
row. The rectangular dotted boxes show the content of one
cell of the SR-block comprised of 4 SRs corresponding to
L = 4. A direct flow of data from the shift-register unit meet
the data-flow requirement of non-separable generic structure
while shift-register output data are rearranged (shown by the
data-distribution block in Fig.13) for the separable generic
structure.
The proposed unified structure for 2-D FIR filter is shown
in Fig.14. It has a common storage unit for both separable and
non-separable sections which can be used for the realization
of any one of four types of non-separable or two types of
separable filters. It also can be used for parallel realization
of any combination of separable or non-separable. It involves
(M + N + 2)(N 1) memory-words and computes L outputs
each of all the six filters in every cycle in its full-parallel
configuration. Although we have shown the unified structure
for 6 parallel filters, it can be used for realization of more
than 6 filters without any additional storage. The processing
module complexity (arithmetic-modules and PCs of each filter)
only increases proportionately with the number of parallel
filters. This is a major advantage for area-delay-power efficient
realization of large size filter banks consisting of separable and
non-separable filters.
Processing Unit
L
N
N
Shift-Register
Array
[(MN-M) words]
N
processing cell-2
(symmetry filter)
xk
EN1
y1,k
processing cell-1
(general filter)
y2,k
EN2
Control Logic
Fig. 12. Proposed generic block-based structure for separable 2-D FIR filters,
N is the filter size and L is the input block-size.
x(m-1,4k+1)
x(m-1,4k+2)
x(m-1,4k+3)
x(m-1,4k+4)
x(m-1,4k-3)
x(m-1,4k-2)
x(m-1,4k-1)
x(m-1,4k)
x(m,4k+1)
x(m,4k+2)
x(m,4k+3)
x(m,4k+4)
x(m,4k-3)
x(m,4k-2)
x(m,4k-1)
x(m,4k)
x(m-7,4k-3)
x(m-7,4k-2)
x(m-7,4k-1)
x(m-7,4k)
x(m-6,4k+1)
x(m-6,4k+2)
x(m-6,4k+3)
x(m-6,4k+4)
x(m-6,4k-3)
x(m-6,4k-2)
x(m-6,4k-1)
x(m-6,4k)
Data-Distribution Block
To IRU-Array of
Non-Separable Generic
Structure Storage Unit
Shift-Register Array
SR-Block1
SR-Block6 SR-Block7
k-th cycle
x(m-7,4k+1)
x(m-7,4k+2)
x(m-7,4k+3)
x(m-7,4k+4)
Input
blocks
(k+1)-th cycle
Processing Module
Storage Unit
L
NL
Shift-Register L
Array
NL
[(MN-M) words] L
To Processing Unit of
Separable Generic
Structure
Non-Separable Section
Data
Distribution
Block
L
L
IRU-Array
The proposed generic structure for the realization of separable filters with and without symmetry is shown in Fig.12. The
structure is similar to proposed generic non-separable structure
except that the shift-register array is only common with PCs of
the processing unit. Proposed generic structure can be used to
realize a separable filter with and without symmetry in parallel.
NL
NL
L
m
xk
sub-module-1
(general filter)
y1,k
sub-module-2
(diagonal symmetry filter)
y2,k
sub-module-3
(four-fold symmetry filter)
y3,k
sub-module-4
(quadrant symmetry filter)
y4,k
sub-module-5
(general filter)
y5,k
sub-module-6
(symmetry filter)
y6,k
EN1
EN2
EN3
EN4
Control Logic
N
N
Fig. 13. Data-flow of common shift-register array and data distribution block
for non-separable and separable generic structures.
N
N
EN5
EN6
m
Separable Section
C. Unified Structure for 2-D FIR Filter-Bank

The shift-register size and input to the shift-register array
are the same in the proposed separable and non-separable
generic structures. N L input samples are obtained from the
shift-register array and fed to the IRU-array of generic nonseparable structure as N blocks of L samples each, while the
processing unit of the generic separable structure is fed with L
blocks of N samples each. Therefore, the input-blocks of both
non-separable and separable filters can be obtained from the
same shift-register array. Outputs of the common shift-register
array need to be rearranged appropriately for the non-separable
and separable generic structures.
Data rearrangement of a common shift-register array is
shown in Fig.13. The data-flow of an SR-block is shown in
blue color. The input-blocks shifting through the SR-block are
Fig. 14.
Proposed block-based unified structure for 2-D FIR filter.
VI. C OMPLEXITIES AND P ERFORMANCE

C ONSIDERATIONS
The proposed structure for non-separable filter consists of a
storage-module and an arithmetic module, while the proposed
separable structure consists of one shift-register array and one
processing cell, where the processing cell again consists of
both arithmetic and storage components. The storage module
of non-separable structure consists of one shift-register array
and N IRUs. The shift-register array of both non-separable
and separable structures consists of (N 1) SRs of M words
each. Each IRU consists of (N 1) registers. The arithmetic
module of non-separable structure consists of N FUs and one
adder-block while the processing cell of separable structure

consists of two FUs and one TU (same as the IRU). Each FU
consists of L IPCs, and each IPC consists of N multipliers and
one adder-tree (AT) to add N words. The adder-block consists
of L such ATs.
A. Complexity of Block-based Structures for Separable and
Non-separable Filters
The arithmetic-module of block-based non-separable structure involves (LN 2 ) multipliers, [L(N 2 1)] adders while
each processing cell of separable structure involves 2LN multipliers, 2L(N 1) adders and (N 1) registers. Apart from
these, the non-separable structure involves [(M + N )(N 1)]
registers and the separable structure involves [(M +1)(N 1)]
registers. Both these structures compute L outputs per cycle,
where one cycle period is T = TM + T1 and T = TM + T2 ,
respectively for non-separable and separable structure, for
T1 = TA + 2TF A (log2 N 1), T2 = TA + TF A (log2 N 1),
TM , TA and TF A are, respectively, one multiplier delay, adder
delay and full-adder delay2 .
10
processing cells (one general and one symmetric filter). It

involves 3LN multipliers, 4L(N 1) adders and 2(N 1)
registers. The proposed non-separable generic structure, therefore, involves [LN (9N + 2)/4] multipliers and [4L(N 2 1)]
adders and [(M +N )(N 1)] registers. It computes L outputs
of each of the four filters in every cycle. Similarly, the proposed separable generic structure involves 3LN multipliers,
4L(N 1) adders and [(M +2)(N 1)] registers, and computes
L outputs of each pair of filters in every cycle. The proposed
unified structure has one storage unit and one processing
module which is comprised of one non-separable section and
one separable section. The non-separable section represent
arithmetic-module of proposed non-separable generic structure
whereas the separable section represents one processing unit
of separable generic structure. Complexity of storage unit
is the same as those of proposed block-based non-separable
structure. The proposed unified structure, therefore, involves
LN (9N + 14)/4 multipliers, 4L(N (N + 1) 2) adders and
[(M + N + 2)(N 1)] registers, and computes L filter outputs
of each of six filters (four non-separable and two separable)
in every cycle.
TABLE IV
G ENERAL C OMPARISON OF MRE AND MBWPO OF P ROPOSED
S TRUCTURES . M : I NPUT- IMAGE W IDHTH /H EAIGHT, N : F ILTER - SIZE , L:
I NPUT- BLOCK L ENGTH
MRE
MBWPO
NS-block-based
L1
(L + N )(N 1)/L
S-block-based
L1
(L + 1)(N 1)/L
NS-Generic
4L 1
(L + N )(N 1)/4L
S-Generic
2L 1
(L + 2)(N 1)/2L
[M (6L 1) + N (4L 1)
(L + N + 2)
+2(L 1)]/(M + N + 2)
(N 1)/6L
Unified
Memory Reuse Efficiency (MRE)
Structure
Filter-size N=8
NS/S
NS-GEN
100
S-GEN
UNI
90
80
70
60
50
40
30
20
10
0
LEGEND: NS: non-separable, S: separable.
16
Block-size (L)
B. Complexity of Generic Structures
2 The
delay of AT increases by TF A after each level of the tree. Since,

we have considered a (4 4) 2-D FIR filter to synthesize the proposed
design, it requires an AT of two levels. So the delay introduced by the adder
tree becomes TA + TF A . Therefore, the critical path is not affected much.
For large order filter one can use a pipeline adder-tree (PAT) instead. For
simplicity if we can consider the multiplier consisting of ripple carry adders,
the multiplication time (TM ) is twice that of an addition time (TA ). To
maintain a critical path of one TM +TA , then we need to introduce a pipeline
stage after the first level of adder tree using N/2 registers. Upto a filter of
order 511 we do not require any extra pipeline stage for word length 8 bit or
more.
Memory Band-Width per output (MBWPO)
Arithmetic module of non-separable generic structure comprises four sub-modules corresponding to four types of filters
where each sub-module represents arithmetic-module of the
corresponding filter. Sub-modules of symmetric filters involves
the same number of adders as those of sub-module of general
filter which is [L(N 2 1)], and only differ by the number
of multipliers. Sub-module of diagonal, four-fold and quadrant symmetry filters, respectively, involves [LN (N + 1)/2],
(LN 2 /4) and (LN 2 /2) multipliers. Arithmetic-module of
non-separable generic structure, therefore, involves [LN (9N +
2)/4] multipliers and [4L(N 2 1)] adders. Processing unit
of proposed separable generic structure is comprised of two
(a)
Filter Length (N=8)

NS
S
NSG
40
SG
UNI
35
30
25
20
15
10
5
0
2
16
Block-size (L)
(b)
Fig. 15. (a) Memory reuse efficiency (MRE). (b) Memory band-width per
output (MBWPO). NS, S, NSG, SG and UNI, respectively, stands for nonseparable, separable, non-separable generic, separable generic and unified.
11
TABLE V
C OMPARISON OF H ARDWARE - AND T IME -C OMPLEXITY OF THE P ROPOSED S TRUCTURES AND E XISTING S TRUCTURES
Structures
Filter type
Multiplier
Adder
Register
Filter-output/cc
MBWPO
3N (b(N 1)/3c + 1) + M (N 1)
1/[TM + 2TA ]
N2 1
(M + N )(N 1)
1/[TM + 2TA ]
N2 1
(M + N )(N 1)
L/[TM + T1 ]
(L + N )(N 1)/L
Van [11]
N2
N2
Khoo et.at [16]
N2
N2
LN 2
L(N 2
non-separable
Proposed
Mohanty et.at [29]
Proposed
separable
Proposed Generic
1)
2N
2(N 1)
(M + 1)(N 1)
1/[TM + TA ]
2N 1
2LN
2L(N 1)
(M + 1)(N 1)
L/[TM + T2 ]
(L + 1)(N 1)/L
3LN
4L(N 1)
(M + 2)(N 1)
2L/[TM + T2 ]
(L + 2)(N 1)/2L
*(M N M ) registers required to feed overlapped input-blocks.

T1 = TA + 2TF A (log2 N 1), T2 = TA + TF A (log2 N 1), TF A : full-adder delay.
TABLE VI
C OMPARISON OF H ARDWARE - AND T IME -C OMPLEXITY OF P ROPOSED N ON - SEPARABLE G ENERIC AND U NIFIED S TRUCTURES AND E XISTING U NIFIED
S TRUCTURE OF [17]
Structures
Chen [17]
Proposed
Generic (NS)
Proposed
Unified (NS+S)
Multiplier
Adder
Register
M (N 1) +
cycle period
output/cycle
MBWPO
TM + 3TA
N2 1
2N 2
N (5N + 2)/8
N (21N + 8)/16
LN (9N + 2)/4
4L(N 2 1)
(M + N )(N 1)
TM + T1
4L
(L + N )(N 1)/4L
LN (9N + 14)/4
4L(N (N + 1) 2)
(M + N + 2)(N 1)
TM + T1
6L
(L + N + 2)(N 1)/6L
T1 = TA + 2TF A log2 N 1, TF A : full-adder delay.
C. Memory Reuse Efficiency and Bandwidth Requirement

Using the definition of MRE and MBWPO given in Section II, we have calculated MRE and MBWPO of proposed
structures according to expressions given in Table IV. Using
these expressions, we have estimated MRE and MBWPO of
the proposed structure for filter-size N = 8 and image-size
M = 512. The estimated values are shown in graphs of Fig.15.
The MRE of proposed structures increases with block-size.
MRE of unified structure is higher than others for a given
block-size. This is mainly due to memory-sharing by more
filters in the unified structure.
D. Performance Comparison
In [16], an efficient 2-D IIR filter (pole-zero) structure
is presented which can be modified for realization of 2D FIR filter by removing the all-pole structure from the
pole-zero structure. Accordingly, we have extracted an FIR
filter structure from [16] and synthesized that for comparison.
Hardware complexity of these extracted structure along with
those of proposed structures and structure of [11] and [29]
are listed in Table V for comparison of complexities. We find
from Table V that proposed non-separable structure involves L
times more multipliers and adders than the existing structures
and equal number of registers, but it offers L times higher
throughput and N times less MBWPO than others. Similarly,
proposed separable structure involves L times more multipliers
and adders than those of [29] and the same number of registers,
but offers L times higher throughput and nearly 2 times
less MBWPO than those of [29]. Proposed separable generic
structure involves (3L/2) times more multipliers and 2L times
more adders than those of [29] and (N 1) more registers,
and offers L times higher throughput with nearly 4 times less

MBWPO than those of [29].
TABLE VII
H ARDWARE AND T IME C OMPLEXITIES OF P ROPOSED G ENERIC AND
U NIFIED S TRUCTURES AND E XISTING U NIFIED S TRUCTURE OF [17] FOR
I NPUT-I MAGE S IZE M = 512, F ILTER -S IZE N = 4
Structure
Filter
MULT
ADD
REG
MBWPO
output/cc
Chen [17]
11
23
1568
15
Proposed
16
152
240
1548
1.5
Generic (NS)
32
304
480
1548
1.125
Proposed
24
200
288
1554
1.25
Unified (NS+S)
48
400
576
1554
0.875
LEGEND: NS: non-separable, S: separable.
We have extracted a unified 2-D FIR filter structure from

the multimodal structure of [17] for comparison purpose.
The hardware and time complexities of this structure and the
proposed non-separable generic and unified structure are listed
in Table VI. Compared with structure of [17], proposed nonseparable generic structure and unified structure involve nearly
3.6L times more multiplier 3L times more adders and compute
4L and 6L times more outputs, respectively. Proposed generic
structure involves N (N +1) less registers than that of [17] and
it has nearly 4L times less MBWPO than other. Similarly, the
proposed unified structure involves (N 2 N +2) less registers
and nearly 6L times less MBWPO.
We have estimated hardware complexity of proposed nonseparable generic and unified structures and the unified structure of [17] for N = 4, block-size L = 4, 8 and for imagesize M = 512. The estimated values are listed in Table VII.
We can find from Table VII that, proposed non-separable
12
TABLE VIII
C OMPARISON OF P OST-L AYOUT S YNTHESIS R ESULTS OF THE P ROPOSED S TRUCTURES AND THE S TRUCTURES OF [16], [17] AND [29] FOR L = 4,
N = 4, M = 512, w = 8 AND w0 = 16, USING TSMC 90 NM CMOS TECHNOLOGY LIBRARY, P OWER ESTIMATED AT 20 MH Z FREQUENCY
Designs
Type
Structure of [16]
Proposed Structure
Non-separable
Proposed Generic Structure

Structure of [29]
Proposed Structure
Separable
Proposed Generic Structure

Unified Structure of [17]
(Symmetric)
Non-separable
Proposed Unified Structure
Non-separable
(General + Symmetric)
+ Separable
Output/
DAT
Area
ADP
EPO
cycle
(ns)
(um2 )
Leakage (mW)
Power
Dynamic(mW)
(um2 .ms)
(pJ)
14.17
1009878.67
3.9107
4.9441
14.30
442.74
12.16
791361.47
2.8016
3.2918
2.40
76.16
16
17.82
1142564.15
3.6646
5.3996
1.27
28.32
22.22
472219.56
2.0034
2.5533
10.49
227.83
21.58
676537.05
2.4002
3.2511
3.64
70.64
22.08
785332.33
2.7136
3.8730
2.16
41.16
14.07
488883.93
2.0327
2.6070
6.87
231.98
24
24.75
1437527.67
4.4976
6.8792
1.48
23.70
DAT: data-arrival time, area-delay-product (ADP)=Area DAT/(output per cycle), energy per output (EPO) = power clock-period/number of output per
cycle.
throughput.
We have estimated normalized storage per filter output
(SPO) of proposed structures and the existing structure of [11],
[16], [17], [29] for N = 4, L = 4 and M = 512 and shown in
the graph of Fig.16. We find that proposed structures involves
less normalized SPO than the existing structure. This is mainly
due to memory-reuse efficiency and memory-sharing of the
proposed structures.
Normalized storage per filter output
1800
1600
1400
1200
1000
800
600
400
E. ASIC Synthesis Result
200
0
Van [10] Khoo [14] Chen [15] Mohanty Prop. NS
[29]
Prop.S
Prop. SG Prop.NSG Prop. UNI
Structures
Fig. 16. Normalized storage per filter output (calculated for N = 4, L = 4

and M = 512, NS, S, NSG, SG and UNI, respectively, stands for nonseparable, separable, non-separable generic, separable generic and unified.
generic structure involves 13.6% less normalized multipliers,

34.7% less normalized adders and 90% less MBWPO than
those of [17]. Proposed unified structure involves 24.27% less
normalized multipliers, 47.82% less normalized adders and
94% less MBWPO than those of [17]. Unlike the existing
unified structure, the proposed one does not require any
steering logics for signal switching. Note that, the proposed
non-separable generic structure is comprised of one general
filter and three symmetric filters, while the proposed unified
structure is comprised of one general non-separable and one
general separable filter, and four symmetric filters. But, unified
structure of [17] is comprised of only four symmetric filters.
Proposed structures have several advantage over the existing
structures, (i) provide filter-banks for non-separable and/or
separable filters with symmetry and/or without symmetry
which could be used in many image processing applications,
(ii) can be configured for implementation of any one filter of
the filter-bank or parallel configuration of any of the filters
of the filter-bank, and (iii) can be easily scaled for high-
We have coded the proposed designs in VHDL for filter size

N = 4, block-size L = 4 and image-size (512512), and synthesized using Synopsys tool. We have also synthesized nonseparable 2-D FIR filter structures extracted from IIR structure
of [16] and unified structure of [17] and separable structure
of [29]. We have used Wallace-tree based generic Boothmultiplier of Synopsys DesignWare building blocks library
for all the designs. Shift-registers and registers of all designs
are synthesized using D-FF (flip-flop). We have considered
input signal width w = 8-bit and all intermediate signals and
output signal width w0 = 16-bit. We have set switching activity
toggle rate 0.25 and static probability is 0.5. The netlist file
obtained from the Synopsys Design Compiler is processed in
IC Compiler. After place, route and clock synthesis area, time
and power (leakage and dynamic) reported by the IC Compiler
are listed in Table VIII for comparison. Power consumption
of all the synthesized designs are estimated for 20MHz clock.
Due to memory footprint reduction, the area, leakage power
and dynamic power consumption of the proposed structures do
not increase proportionately with the number of outputs. Consequently, proposed structures involve significantly less areadelay-product (ADP3 ) and consume less energy per output
(EPO4 ). Compared with [16], proposed generic non-separable
structure involves 11.25 times less ADP and 15.63 times less
EPO. The proposed generic separable structure involves 4.85
3 ADP=areadata-arrival
time(DAT)/outputs per cycle

of outputs per cycle
4 powerclock-period/number
times less ADP and 5.53 times less EPO than those of [29].
Compared with [17], proposed unified structure involves 4.64
times less area and 9.78 times less EPO.
VII. C ONCLUSION
We have analyzed memory footprint and combinational
complexity of 2-D FIR structures to arrive at a systematic
design strategy to derive area-delay-power-efficient architectures. Based on that we have presented novel block-based
separable and non-separable structures, generic structures for
separable and non-separable filter-banks and unified structure
for concurrent realization of both separable and non-separable
filter-banks. It is shown that storage requirement of proposed
structures does not change with input block-size (L). Similarly, the storage size of generic non-separable structure is
independent of the number of parallel filters (P ) in a filterbank. It increases marginally by P (N 1) words in case of
generic separable and unified structures, where N is the filter
size. Proposed structures, therefore, offer higher memory reuse
efficiency and MBW reduction for higher values of L and P .
This reduces SPO and PPO of proposed structures.
Compared with the existing structures, proposed blockbased separable and non-separable structures involve the
same number of storage words, and proportionately the same
or less number of arithmetic resources than the corresponding
best of existing structures, and compute P L times more filter
outputs per cycle with P L times less MBW. The proposed
unified structure with 4 non-separable filters and 2 separable
filters involve nearly 3.6L times more multipliers, 3L times
more adders, (N 2 N + 2) less registers than existing unified
structure, and computes 6L times more filter outputs per cycle
with 6L times less MBW than the other. ASIC synthesis
result for filter-size (4 4), input-block size L = 4 and
image-size (512 512) shows that proposed block-based nonseparable and generic non-separable structures, respectively,
involve 5.95 times and 11.25 times less ADP, and 5.81 times
and 15.63 times less EPO than the corresponding existing
structures. The proposed unified structure involves 4.64 times
less ADP and 9.78 times less EPO than the corresponding
existing structure. When the inner-product computation
involved in FIR filtering is realized by multiplier-less designs
to reduce the combinational-complexity that may require
additional memory leading to overall increase in memory
complexity. But, the advantage gained by the proposed
memory footprint reduction technique is not affected by such
change of method of implementation of inner-products.
Acknowledgements: This publication was made possible by
an internal grant from Qatar University (QUUG-CENG-DCS11/12 7). The statements made herein are solely the responsibility of the authors.
R EFERENCES
[1] D. E. Dudgeon and R. M. Mersereau, Multidimensional Digital Signal
Processing. Englewood Cliffs, NJ: Prentice-Hall, 1984.
[2] M. A. Sid-Ahmed, Image Processing: Theory, Algorithms, and Architectures. New York: McGraw-Hill, 1995.
13
[3] T. Barbu, Gabor filter based face recognition technique, in Proceeding

of the Rmanian Academy, Series A, vol. 11, no.3/2010, pp. 277-283,
2010.
[4] S. E. Grigorescu, N. Petkov, and P. Kruizinga, Comparison of texture
features based on Gabor filters, IEEE Transactions on Image Processing,
vol. 11, no. 10, pp 1160-1167, OCT. 2002.
[5] W. Li, K. Mao, H. Zhang, and T. Chai, Designing compact Gabor filter
banks for efficient texture feature extraction, In Proc. 11th International
Conference on Control, Automation, Robotics and Vision, Singapore,
7-10th December 2010, , pp.1193-1197.
[6] M. A. Sid-Ahemed, A systolic realization for 2-D digital filters, IEEE
Transaction on Acoustic Speech and Signal Processing, vol. 37, pp.560565, Apr. 1989.
[7] N. R. Sanbhag, An improved systolic architecture for 2-D digital filters,
IEEE Transaction on Signal Processing, vol. 39, pp. 1195-1202, May
1991.
[8] B. K.Mohanty and P. K. Meher, Cost-effective novel flexible cell-vlevel
systolic architecture for high-throughput implementation of 2-D FIR
filters, IET Proc. Comput. Digit. Tech., vol. 143. No. 6, Nov 1996.
[9] B. K. Mohanty and P. K. Meher, High-throughput and low-latency
implementation of bit-level systolic architecture for 1-D and 2-D digital
filters, IET Proc. Computer and Digital Technique, vol. 146, No. 2,
pp.91-99, March 1999.
[10] L.D.Van, C.C.Tang, S.Tenqchen, and W.S.Feng, A new VLSI architecture without global broadcast for 2-D systolic digital filters, in Proc.
IEEE Int. Symp. Circuis Syst. ISCAS, vol.1 Geneva, Switzerland, May
2000, pp. 547-550.
[11] L. D. Van, A new 2-D systolic digital filters architecture without global
broadcast, IEEE Transaction on Very Large Sclae Integration Systems,
vol.10, no.4, pp. 477-486, Aug. 2002.
[12] H. C. Reddy, I. H. Khoo, and P. K. Rajan, 2-D symmetry: Theory and
filter design applications, IEEE Circuits and System Mag., vol. 3, no. 3,
pp. 433, 3rd Q., 2003.
[13] P. Y. Chen, L. D. Van, H. C. Reddy, and C. T. Lin, A new VLSI 2-D
diagonal-symmetry filter architecture design, in Proc. IEEE APCCAS,
Macao, China, Nov. 2008, pp.320-323.
[14] I. H. Khoo, H. C. Reddy, L. D. Van, and C. T. Lin, 2-D digital filter
architectures without global broadcast and some symmetry applications,
in Proc. IEEE Int. Symp. Circuis Syst. ISCAS, May 2009, pp.952-955.
[15] P. Y. Chen, L. D. Van, H. C. Reddy, and C. T. Lin, A new VLSI 2D fourfold-rotational-symmetry filter architecture design, in Proc. IEEE
Int. Symp. Circuis Syst. (ISCAS), May 2009, pp.93-96.
[16] I. H. Khoo, H. C. Reddy, L. D. Van, and C. T. Lin, Generalized
formulation of 2-D filter structures without global broadcast for VLSI
implementation, in Proc., IEEE MWSCAS, Seattle, WA, Aug. 2010,
pp.426-529.
[17] P. Y. Chen, L. D. Van, I. H. Khoo, H. C. Reddy, and C. T. Lin,
Power-efficient cost-effective 2-D symmetry filter architecture, IEEE
Transaction on Circuit ans Systems-I, Regular Papers, vol. 58, no.1,
pp.112-125, Jan. 2011.
[18] K. K. Parhi, VLSI Digital Signal Processing. New York: Wiley, 1998
[19] C. Y. Yao, H. H. Chen, C. J. Chien, and C. T. Hsu, A novel commonsubexpression elimination method for synthesizing fixed-point FIR filters, IEEE Transaction on Circuits and System-I, Regular Papers, vol.
51, no. 11, pp.2215-2221, Nov. 2004.
[20] Y. Voronenko and M. Puschel, Multiplierless multiple constant multiplication, ACM Transaction on Algorithms, vol. 3, no. 2, article 11, May
2007.
[21] A. P. Vinod and E. M.-K. Lai, An efficient coefficient-partitioning
algorithm for realizing low complexity digital filters, IEEE Transaction
on Computer-Aided Design Integr. Circuits Syst., vol. 24, no. 12,
pp.1936-1946, Dec. 2005.
[22] R. Mahesh and A. P. Vinod, A new common subexpression elimination
algorithm for realizing low complexity higher order digital filters, IEEE
Transaction on Computer-Aided Design Integr. Circuits Syst., vol. 27,
no. 2, pp.217-219, Feb. 2008.
[23] J. Luis, T.-Xihuitl, R. M.A.-Ponce and M. Bayoumi, Hybrid multiplierless FIR filter architecture based on NEDA, in Proc. IFIP International
Conference on Very Large Scale Integration (VLSI-SoC 2007), pp.316319, 2007.
[24] P. K. Meher, S. Chandrasekaran, and A. Amira, FPGA realization of FIR
filters by efficient and flexible systolization using distributed arithmetic,
IEEE Transaction on Signal Processing, vol. 56, no. 7, pp.3009-3017,
July 2008.
[25] P. K. Meher, New approach to look-up table design and memorybased realization of FIR digital filters, IEEE Transaction on Circuit and
Systems-I, Regular Papers, vol. 57, no. 3, pp.592-603, Mar. 2010.
[26] A. Croisier, D. J. Esteban, M. E. Levilion, and V. Rizo, Digital filter for

PCM encoded signals, U.S. Patent 3777130, Apr., 1973.
[27] S. A. White, Applications of the distributed arithmetic to digital signal
processing: A tutorial review, IEEE ASSP Magazine, vol. 6, no. 3, pp.519, July 1989.
[28] R. I. Hartley, Subexpression sharing in filters using canonic signed digit
multipliers, IEEE Transactions on Circuits and Systems II: Analog and
Digital Signal Processing, vol. 43, no. 10, 677-688, Oct. 1996.
[29] B. K. Mohanty and P. K. Meher, New scan method and pipeline
architecture for VLSI implementation of separable 2-D FIR filters
without using transposition, in Proc. IEEE Region 10 TENCON2008
Conference, Hyderabad, India, Nov. 2008.
Basant K Mohanty (M06, SM11) received M.Sc

degree in Physics from Sambalpur University, India,
in 1989 and received Ph.D degree in the field of
VLSI for Digital Signal Processing from Berhampur
University, Orissa in 2000. In 2001 he joined as Lecturer in EEE Department, BITS Pilani, Rajasthan.
Then he joined as an Assistant Professor in the
Department of ECE, Mody Institute of Education
Research (Deemed University), Rajasthan. In 2003
he joined Jaypee University of Engineering and
Technology, Guna, Madhya Pradesh, where he becomes Associate Professor in 2005 and full Professor in 2007. His research interest includes design and implementation of low-power and high-performance
systems for adaptive filters, image and video-processing applications, secured
communication and reconfigurable architectures. He has published nearly 50
technical papers.
Dr.Mohanty is serving as Associate Editor for the Journal of Circuits,
Systems, and Signal Processing. He is a life time member of the Institution
of Electronics and Telecommunication Engineering, New Delhi, India, and
he was the recipient of the Rashtriya Gaurav Award conferred by India
International friendship Society, New Delhi for the year 2012.
Pramod Kumar Meher (SM03) received the M.Sc.

degree in physics and the Ph.D. degree in science from Sambalpur University, India, in 1978,
and 1996, respectively. Currently, he is a Senior
Scientist with the Institute for Infocomm Research,
Singapore, and Adjunct Professor with the School of
Electrical Sciences, Indian Institute of Technology
Bhubaneswar, India. Previously, he was a Professor
of Computer Applications with Utkal University,
India, from 1997 to 2002, and a Reader in electronics
with Berhampur University, India, from 1993 to
1997. His research interest includes design of dedicated and reconfigurable
architectures for computation-intensive algorithms pertaining to signal, image
and video processing, communication, bio-informatics and intelligent computing. He has contributed more than 200 technical papers to various reputed
journals and conference proceedings.
Dr. Meher has served as a speaker for the Distinguished Lecturer Program
(DLP) of IEEE Circuits Systems Society during 2011 and 2012 and Associate
Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II:
EXPRESS BRIEFS during 2008 to 2011. Currently, he is serving as Associate
Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI:
REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY LARGE SCALE
INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits, Systems, and
Signal Processing. Dr. Meher is a Fellow of the Institution of Electronics and
Telecommunication Engineers, India. He was the recipient of the Samanta
Chandrasekhar Award for excellence in research in engineering and technology for 1999.
Somaya Al-Maadeed (SM12) received a B.Sc in computer science from

Qatar University, Qatar, in 1994, received M.Sc in mathematics and computer
science from Alexandria University, Egypt, in 1999, and PhD in computer
science from Nottingham University in 2004. Following her PhD, she worked
as an assistant professor (Qatar University), where she did research in the
areas of biometrics, digital filters, speech recognition, image processing, and
document management. She has published around forty papers in the above
general areas. Dr. Somaya is a senior member of IEEE.
14
Abbes Amira (M01, SM07) is a full professor

in visual communication at the University of the
West of Scotland (UWS), Scotland. Prior to joining
UWS, he took academic positions at the university
of Ulster-UK as Reader in embedded systems in
the Nanotechnology and Integrated Bio-Engineering
Centre (NIBEC); Associate Professor in Embedded
Systems in the department of electrical engineering
at Qatar University-Qatar, Senior lecturer at Brunel
University-UK within the division of Electronic and
Computer Engineering and a lectureship in Computer Engineering at Queens University. He received his Ph.D. in the area
of Electronic and Computer Engineering from Queens University Belfast
in 2001, developing a coprocessor for matrix algorithms using FPGAs.
He has been awarded a number of grants from government and industry
and has published around 200 publications in the area of reconfigurable
computing and image and signal processing during his career to date. 12
PhD students have successfully completed their PhD under his supervision.
He has been invited to give talks, short courses and tutorials at universities and
international conferences and being chair, program committee for a number of
conferences. He was one of the tutorials presenters at ICIP 2009, Conference
Chair of ECVW 2011, Program Chair of ECVW2010, Program Co-Chair of
ICM12, DELTA 2008, IMVIP 2005. He is also one of the 2008 VARIAN
prize recipients. He has been an external examiner for many Universities in
UK, Hong Kong, Australia and Malaysia. Prof.Amira was one of the guest
editors for the Special Issue in the Pattern Recognition Journal, titled Feature
Generation and Machine Learning for Robust Multimodal Biometrics, March
2008.
He took consultancy positions with many companies in UK, and he holds
two visiting professor positions at the University of Nancy, Henri Poincare,
France and University of Tunn Hussein Onn, Malaysia. He is a Fellow of IET,
Fellow of the Higher Education Academy, Senior member of the IEEE, and
Senior member of ACM. His research interests include: embedded systems,
high performance reconfigurable computing, image and video processing,
multi-resolution analysis, biometrics and connected health applications.

Memory Footprint Reduction For Power-Efficient Realization of 2-D Finite Impulse Response Filters

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Memory Footprint Reduction For Power-Efficient Realization of 2-D Finite Impulse Response Filters

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON CIRCUIT AND SYSTEM-I, REGULAR PAPERS