An Efficient Line Based VLSI Architecture For 2D Lifting DWT

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
An Efficient Line based VLSI Architecture for 2D Lifting DWT

Micha Filipek, Grzegorz Mrugalski, Senior Member, IEEE, Nilanjan Mukherjee, Senior Member, IEEE,
Benoit Nadeau-Dostie, Senior Member, IEEE, Janusz Rajski, Fellow, IEEE, Jedrzej Solecki, and Jerzy Tyszer, Fellow, IEEE
Abstract
This paper proposes an efficient architecture for 2D DWT. The
proposed architecture includes a transform module, a RAM
module and a multiplexer. In transform module, polyphase
decomposition and coefficient folding technique is applied to the
decimation filters of stages 1 and 2 respectively. The advantages
of the proposed architecture are the 100% hardware utilization,
fast computing time, regular data flow and low complexity.
Because of the regular structure, the proposed architecture can be
easily be scaled with the filter length and 2D DWT level. VLSI
architecture for the 2-D DWT is implemented using FPGA using
Verilog HDL.
Keywords:
Discrete
Wavelet
architectures, image compression
Transform,
The first obstacle is that the high cost of hardware

implementation of multipliers. It is required approximately
256 transistors to build a delay element, 415 transistors for
an adder and 6800 transistors for multiplier. Several VLSI
architectures have been proposed for DWT [5]-[8].
The best known architecture for the 2D DWT is Parallel
filter architecture [1],[2].The design of the parallel filter
architecture is based on the MRPA[3].The MRPA
intersperses the computation of the second and following
levels among the computation of the first level. Because of
the decimation
VLSI
1. Introduction
operation, the quantity of processing data in each level is

half of that in the previous level, which leads to efficient
utilization of hardware. Hence, MRPA is feasible for 1D
DWT architecture. But it is not suitable for the 2D DWT
architecture.
With the rapid progress of VLSI design technologies,

many processors based on audio and image signal
processing have been developed recently. The twodimensional discrete wavelet transform (2-D DWT) plays
a major role in the JPEG-2000 image compression
standard.
This paper is organized as follows. Section II introduces

the 2D DWT algorithm. Section III describes the proposed
architecture for 2D DWT. Section IV compares the
performance of proposed architectures. Finally the
conclusion is given in Section V.
At present, many VLSI architectures for the 2 -D DWT

have been proposed to meet the requirements of real-time
processing. The implementation of DWT in practical
system has some issues. First the complexity of wavelet
transform is several times higher than that of DCT.
Second, DWT needs extra memory for storing the
intermediate computational results. Moreover, for real
time image compression, DWT has to process massive
amounts of data at high speeds. The use of software
implementation of DWT image compression provides
flexibility for manipulation but it may not meet some
timing constraints in certain applications. Hardware
implementation of DWT, however, also has problems.
2. Two Level DWT

2.1
Discrete Wavelet Transform
Two dimensional discrete wavelet transform(DWT) is

defined as:
K1 K1
xLLJ (n1 , n2 ) = g(i1 ).g(i2 ).xLLJ1 (2n1 i1 )(2n2 i2 )
(1)
i1 =0 i2 =0
K1 K1
xLHJ (n1 , n2 ) = g(i1 ).h(i2 ).xLLJ1 (2n1 i1 )(2n2 i2 )

i1 =0 i2 =0
1063-8210 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
(2)
K1 K1
xHLJ (n1 , n2 ) = h(i1 ).g(i2 ).xLLJ1 (2n1 i1 )(2n2 i2 )
(3)
RAM
N/2*N/2
i1 =0 i2 =0
LLLL,LL
M
U
K 1 K 1
xHH (n1 , n2 ) = h(i1 ).h(i2 ).xLL

J
J1
TRANSFORM
MODULE
LLLH,LH
LLHL,HL
(2n1 i1 )(2n2 i2 )
(4)
LLHH,HH
INPUT
IMAGE
i1 =0 i2 =0
Where xLL(n1,n2) input image.

J
2-D DWT level
K filter length
g(n) impulse responses of the low-pass filter
h(n) impulse responses of the high-pass filter
ROM
DWT
COEFFICIENTS
Figure 2. Block diagram of the proposed work
3.1 Read Only Memory (ROM)

This is the input section which stores the image data. The
input image is taken from the MATLAB. The
corresponding
co-efficients of the image are converted
into its binary equivalent. And these equivalents are stored
in ROM.
Generally, ROM holds programs and data permanently
even when computer is switched off.Data can be read by
the CPU in any order so ROM is also direct access. The
contents of ROM are fixed at the time of manufacture.
3.2 Transform Module

This Chapter deals with the transform module which
is the heart of the block diagram. It performs DWT in two
stages namely, Polyphase Decomposition and Co-efficient
Folding.
Figure 1. Two level DWT
Each decomposition level comprises two stages: stage 1

performs horizontal filtering, and stage 2 performs vertical
filtering. In the first-level decomposition, the size of the
input image is N * N , and the outputs are the three sub
bands LH, HL, and HH, of size N/2*N/2. In the secondlevel decomposition, the input is the LL band and the
outputs are the three subbands LLLH, LLHL, and LLHH,
of size N/4*N/4.The multi-level 2-D DWT can be
extended in an analogous manner.
Transform Module
Stage 1
Stage 2
Polyphase
Decomposition
Technique
Co-Efficient
Folding
Technique
Figure 3. Transform Module
3. Proposed System
The proposed system focuses to implement an efficient
architecture for the two-dimensional discrete wavelet
transform (2-D DWT). The advantages of the proposed
architecture are 100% hardware utilization, fast computing
time than that of parallel filter architecture, regular data
flow, and low control complexity, making this architecture
suitable for JPEG-2000.
The transform module decomposes the input image to the

four subbands LL, LH, HL, and HH, and saves the LL
band to the RAM module. After finishing the first-level
decomposition, the multiplexer selects data from the RAM
module. The LL band is then sent into the transform
module to perform the second-level decomposition. The
transform module decomposes the LL band to the four sub
bands LLLL, LLLH, LLHL, and LLHH, and saves the
LLLL band to the RAM module. After finishing the

second-level decomposition, the multiplexer selects data
from the RAM module. This procedure repeats until the
desired level (i.e., the last level) is finished. The
decimation filter can be implemented directly by a filter
followed by a two- folded decimator. However, the
decimator discards one sample out of every two samples at
the filter output, causing poor hardware utilization. Hence,
we employ two different design techniques to enhance its
performance.
3.3 Polyphase Decomposition Technique

A compact way of designing filter banks is the polyphasedomain analysis. This methodology decomposes signals
and filters into their polyphase components.
Figure 5. Block diagram of polyphase decomposition
The standard decimation method:

x (n)
H(Z)
y (m)
3.4 Data Flow of Decimation Filter Employing

Polyphase Decomposition
Figure 4. A Basic Decimator
The process of converting data sampled at one rate (fs1) to

data sampled at another rate (fs2). If fs1 > fs2, the process
is called decimation. The function of a decimator is to take
data that was sampled at one rate and change it to new data
sampled at a lower rate. The data must be modified in such
a way that when it is sampled at the lower rate the original
signal is preserved.
Table 1 Data flow of decimation filter in polyphase decomposition
CLK SW IN
0
0 x(0)
1
1 x(1)
2
0 x(2)
ODD
a1x(0)
EVEN
OUT
a0 x(1)
a0 x(1)+a1x(0)
a1x(2)+a3x(0)
a0x(3)+a1x(2)+
3
4
1
0
x(3)
x(4)
Block Diagram of Polyphase Decomposition

Technique
5
6
1
0
x(5)
x(6)
The first technique is the polyphase decomposition

technique as illustrated in Figure 5 which decomposes the
filter coefficients into even-ordered and odd -ordered
parts. In the even clock cycles, the input data are fed to the
odd part and multiplied with the odd-ordered coefficients.
In the odd clock cycles, the input data are fed to the even
part and multiplied with the even-ordered coefficients. The
output data are the sum of the odd and even parts.The
internal clock rate is half the input clock rate after
employing the polyphase decomposition technique.
Therefore, we can double the input clock rate to increase
the throughput. When the quantity of processing data is the
same, the computing time will be reduced to half. Thus,
this technique can reduce the time cost to a half. We use
the symbol T/2 to represent the polyphase
decomposition technique
7
8
1
0
x(7)
x(8)
x(9)
a0x(3)+a2x(1) a2x(1)+a3x(0)
a1x(4)+a3x(2)
a0x(5)+a1x(4)+
a0x(5)+a2x(3) a2x(3)+a3x(2)
a1x(6)+a3x(4)
a0x(7)+a1x(6)+
a0x(7)+a2x(5) a2x(5)+a3x(4)
a1x(8)+a3x(6)
a0x(9)+a1x(8)+
a0x(9)+a2x(7) a2x(7)+a3x(6)
3.5 Coefficient Folding Technique

In synthesizing architectures, it is important to
minimize the silicon area of the integrated circuits,
which is achieved by reducing the number of
functional units (such as adders, multipliers),
registers, multiplexers, and interconnection wires.
The
folding
transformation
is
used
to
systematically determine the control circuits in the
architecture where multiple algorithm operations
(such as addition operations) are time-multiplexed
to a single functional unit (such as pipelined
adder). By executing multiple algorithm operations
on a single functional unit, the number of
functional units in the implementation is
reduced, resulting in an integrated circuit with low

silicon area. The figure 6 shows the schematic
diagram of coefficient folding technique:
module, because the image data are fed by a raster-scan

mode, each coefficient requires a line delay to store the
row data for vertical filtering.
Table 2. Design strategy for transform module
The second technique is the coefficient folding technique.

As illustrated in figure 6, every two coefficients share one
set of a multiplier, adder, and register. The switches control
the data path. Viewing the PE0 first, in clock-cycle 0, the
input data x(0) is multiplied with the coefficient a1 and
added with the content of R1 (initially zero). The result
a1x(0) is then stored to R0. In clock-cycle 1, the input data
x(1) is multiplied with the coefficient and added with the
content of R0.
Methods
Stage 1 Stage2
Original Design
T/2
T/2
A/2
A/2
T/2
A/2
A/2
T/2
Stage 1
A
T
a
t
a
t/2
a/2
t
a
t/2
a/2
t
Stage 2
A
T
2a
t/2
2a
t/4
a
t/2
a
t/2
2a
t/4
Total
Area
3a
3a
3a/2
2a
5a/2
Total Stage2
T
Idle
t
at
t/2
at/2
t
at/2
t/2
0
t
3at/2
A-Area: T-Time
Table 3. Data flow of decimation filter employing coefficient
folding technique
CLK SW IN
PE1
PE0
0
0 X(0) a3X(0)
a1X(0)
a2X(1)+
1
1 X(1) a3X(0) a0X(1)+ a1X(0)
2
OUT
a0X(1)+a1X(0)
X(2) a3X(2)
a1X(2)+ a2X(1)
a2X(3)+ a0X(3)+ a1X(2) a0X(3)+ a1X(2)

X(3) a3X(2) +a2X(1)+a3X(0) +a2X(1)+ a3X(0)
X(4) a3X(4)
a2X(5)+ a0X(5)+ a1X(4) a0X(5)+ a1X(4)

X(5) a3X(4) +a2X(3)+a3X(2) +a2X(3)+a3X(2)
X(6) a3X(6)
a2X(7)+ a0X(7)+ a1X (6)+ a0X(7)+ a1X(6)+

X(7) a3X(6) a2X(5)+a3X(4) a2X(5)+a3X(4)
X(8) a3X(8)
a2X(9)+ a0X(9)+a1X(8)+ a0X(9)+a1X(8)+

X(9) a3X(8) a2X(7)+ a3X(6) a2X(7)+ a3X(6)
+a3X(0)
Figure 6. Schematic diagram of coefficient folding technique
The result a0x(1)+a1x(0) is then output. In clock-cycle 2,

the input data x(2) is multiplied with the coefficient a1 and
added with the content of R1 . The result
a1x(2)+a2x(1)+a3x(0) is then stored to R0. In clock-cycle
3, the input data x(3) is multiplied with the coefficient a0
and added with the content of R0. The result a0x(3)+
a1x(2)+a2x(3)+a3x(0) is then output. The following clock
cycles are arranged in an analogous manner. The operation
of the PE1 is similar to the PE0. Because every two
coefficients share one set of a multiplier, adder, and
register, this technique can approximately reduce the area
cost to a half. We use the symbol A/2 to represent the
coefficient folding technique.
Now, we employ these two design techniques to the
decimation filters of stages 1 and 2, respectively. we find
that if we employ the polyphase decomposition technique
to stage 1 and the coefficient folding technique to stage 2,
the area and time cost will both be the same a and t/2 in
stages 1 and 2. Thus, the total area cost is 2a and the total
time cost is t/2. The AT product is reduced from 3at to at,
and no hardware is idle in stage 2.
Design strategy of the transform module (T/2:

Polyphase decomposition; A/2: coefficient folding)
In contrast, the other design methods, as listed in Table 3.2,
cause the hardware to be idle in stage 2. Hence, they are
not efficient design schemes. In stage 2 of the transform
a1X(4)+ a2X(3)
+a3X(2)
a1X(6)+ a2X(5)
+a3X(4)
a1X(6)+ a2X(5)
+a3X(4)
Data flow of decimation filter employing

coefficient folding technique
The data flow is shown in Table 3 where x(n) represents
the nth row data. Every two input rows generate one
output row. Refer to Figure 6, assuming that the low pass
and high pass filter have four taps. The transform module
can perform both polyphase decomposition and coefficient
folding techniques. The output frequency of each stage is
one quarter of its input frequency. Considering the
decomposition with 8*8 block in the first level and ends
with four 1*1 pixels in the third level. The clock cycles
64,16,4 can be used to perform first, second and third
level of decomposition. Because of the regular structure
the proposed architecture can be scaled with the filter
length and 2D DWT level.
has been correctly verified by the Verilog Hardware

Description Language (Verilog HDL). The advantages of
the proposed architecture are 100% hardware utilization,
fast computing time, regular data flow, low control
complexity. These advantages makes the design suitable
for image compression systems in JPEG-2000.The future
work includes implementation of efficient architecture for
DWT using Lifting schemes.
References
Figure 7. Block diagram of coefficient folding
4. Performance of the Architecture

4.1
Hardware Implementation
The performance of the proposed architecture is compared

in terms of number of multipliers, number of adders,
storage size, computing time, control complexity and
hardware utilization. The computing time has been
normalized to the internal clocking rate. Doubling of the
input clock rate for the input pixel can be used. Every two
input rows generate one output row. The coefficient
folding technique has lower accessing time when
compared to polyphase decomposition technique.
Table 4. Hardware implementation output.
Device
Utilization
Summary
IOBS
LUT
LOAD
Stage1
Polyphase
Decomposition
13%
2%
20%
TIMING REPORT
2.926ns
2.926ns
MIN PERIOD
DELAY
OFFSET
MEMORY USAGE
150752KB
TOTAL
Stage 2
Coefficient
Folding
20%
1%
28%
2.725ns
2.725ns
153376KB
[1] Chakrabarti, C. and Mumford, C. 1996. Efficient

realizations of analysis and Synthesis filters based on the
2D- Discrete wavelet transform in Proc. IEEE ICASSP, pp.
3256-3255.
[2] Chakrabarti, C. and Vishwanath, M. 1996. Architectures
for wavelet transforms:A survey,VLSI signal processing,
Vol. 14, pp.171-192,
[3] Christopoulos, C., Askelof, J. and Larsson, M. 2000.
Efficient Methods for Encoding Regions of Interest in the
upcoming JPEG 2000 still image Coding standard, IEEE
signal processing letter, pp.247-249.
[4] Gab Cheon Jung, Duk Young Jin, and Seong Mo Park,
2004. An Efficient Line based VLSI Architecture for 2D
Lifting DWT.
[5] Parhi, K.K. and Nishitani, T. 1993. VLSI architectures for
discrete wavelet Transforms, IEEE trans. VLSI sys, Vol. 1,
pp. 191-202.
[6] Philip P. Dang and Paul M. Chau, 2001. A high
performance, Low power VLSI design of DWT for lossless
compression in JPEG 2000 standard.
[7] Pingping Yu, Suying Yao, and Jiangtao Xu, 2009. An
Efficient Architecture for 2D Lifting based Discrete wavelet
transform.
[8] Po-Cheng Wu, and Liang-Gee Chen, An Efficient
Architecture for Two-Dimensional Discrete Wavelet
Transform, IEEE transactions On circuits and systems for
video technology, Vol. 11.
[9] Powell, S.R. and Chau, P.M. 1992. Reduced complexity
programmable FIR Filters, IEEE international symposium
on circuits and systems (ISCAS), pp. 561-564.
[10] Qionghqi Dai, Xinjian Chen, and Chuang Lin, 2004. A
Novel VLSI Architecture for Multidimensional DWT, IEEE
Tran. On Circuits and systems for video technology, Vol. 14,
No.8, pp. 1105-1110.
[11] Vishwanath, M., Owens, R.M. and Irwin, M.J. 1995. VLSI
Architectures for Discrete wavelet transform, IEEE trans.
Circuits and systems, Vol. 42.
[12] VLSI Signal Processing systems, K.K.Parhi, Wiley
Publication.
5. Conclusion
Many 2- D DWT architectures have been proposed to meet
the requirements of real time processing. However, the
hardware utilization of these architectures needs to be
further improved. Therefore, in this paper an efficient
architecture for the 2-D DWT. The proposed architecture

An Efficient Line Based VLSI Architecture For 2D Lifting DWT

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

An Efficient Line Based VLSI Architecture For 2D Lifting DWT

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS