Sie sind auf Seite 1von 5

Design and Implementaion of a 2D-DCT Architecture using Coefficient

Distributed Arithmetic

Soumik Ghosh, Soujanya Venigalla, Magdy Bayoumi


Center for Advanced Computer Studies, University of Louisiana at Lafayette
P.O. Box 44330, Lafayette, LA 70504
E-mail : {sxg5317, sxv0476, mab}@cacs.louisiana.edu

Abstract various simulation and implementation results. We


conclude the paper in section 6.
The paper describes the design and implementation
of an 8×8 2D DCT chip for use in low-power 2. Distributed Arithmetic
applications. The design exploits a Coefficient
distributed arithmetic (CoDA) scheme as opposed to Distributed arithmetic is an efficient method
the prevalent data distributed arithmetic (DDA) specially for mapping to hardware an inner product,
schemes to achieve low power consumption. The N −1
architecture uses no ROMs and uses minimum number F = cT x = ¦ ci xi (1)
of additions by exploiting the redundancy in the adder i =0
arrays. The described architecture for the CoDA Where, ci ’s are fixed coefficients and xi ’s are the
scheme is implemented on FPGA and has been input data words. For an 8x1 forward DCT, this can be
fabricated on silicon. The fabricated chip computes written in the form of ‘sum of products’ as:
8×8 2D DCT @ 50 MHz consuming around 137mW of 7
power. Fx = ¦Ci,x f i (2)
i =0
c( x) (2i + 1)xπ
1. Introduction Where Ci,x = cos( ) , c( 0 ) = 1 / 2 ,
2 16
Mobile Multimedia devices are becoming and c(1 − 7 ) = 1
ubiquitous today with the increased use of portable, If xi is now represented as a B-bit 2’s complement
wireless, battery-operated appliances which support
binary number as follows:
digital video (DV) and imaging applications. At the
heart of most DV encoders is the DCT which does a ( M ) ( M ) M −1 ( j ) ( j )
x i = − xi 2 + ¦ xi 2 (3)
large volume of computation. Hence a low power j=N
hardware solution for computing the DCT is crucial for
Substituting Ci, x and xi = f i in (2) from (3), we
extended battery life of appliances using these DV
encoders. Most hardware implementations of the DCT see that a data distributed arithmetic solution is
use distributed arithmetic architectures but they possible with the coefficients pre-computed and stored
inherently end up using various sizes of ROMs and in ROMs. Various techniques have been proposed to
doing certain minimum number of multiplications. The reduce the number of resulting additive and
CoDA scheme removes the use of ROMs and multiplicative (shifting) operations but all of them
multiplications in DCT computation and exploits the assume a distributed data input. In the following
redundancy in the binary DCT coefficients to use a section we show that using a Distributed coefficient
minimum number of adders. approach lends itself to accurate DCT computation
Section 2 of the paper talks about distributed with fewer operations resulting in lower power
arithmetic for the DCT and section 3 explains the consumption.
Coefficient distributed Arithmetic. Section 4 discusses
the hardware architecture and section 5 reports the 3. Coefficient Distributed Arithmetic

Proceedings of the IEEE Computer Society Annual Symposium on VLSI


New Frontiers in VLSI Design
0-7695-2365-X/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 24,2010 at 16:00:15 UTC from IEEE Xplore. Restrictions apply.
ª0 0 0 0 0 0 0 0º
«0 0 »»
To analyze the coefficient distributed arithmetic, we « 0 0 0 0 0 0
assume that the coefficients are represented with (M- «1 1 1 1 1 1 1 1»
« »
N+1) precision where M is the sign bit, and N is the «0 0 0 0 0 0 0 0»
least significant bit. Then (3) can be used to represent «1 1 1 1 1 1 1 1» (7)
« »
the binary coefficients as well as the data. Thus (2) can «0
A0 = «
0 0 0 0 0 0 0»
be rewritten as : 1 1 1 1 1 1 1 1»
« »
«1 1 1 1 1 1 1 1»
7 § ( M ) ( M ) M −1 ( j ) ( j ) ·¸ « »
Fx = ¦ ¨ a i = − a i 2 + ¦ ai 2 x (4) «0 0 0 0 0 0 0 0»
¨
i =0 © j=N ¸ j «1 1 1 1 1 1 1 1»
¹ « »
Where x j has the same value as in (3). Putting all this «0 0 0 0 0 0 0 0»
«0 0 0 0 0 0 0 0 »¼
¬
together and separating the variables, the whole With the last row making up the sign bit. Thus for F0
equation can be written in matrix form as shown in (5) we have:
and (6). The Matrix A is the adder matrix and contains
the 12-bit coefficients with one sign bit. The adder F0( 0) = F0( −1) = F0( −3) = F0( −6 ) = F0( −8) = F0( −10) =
matrix maps to hardware as an adder tree to produce F0( −11) = 0
(i )
each of the Fx . In order to obtain one row of Which require no further calculations while we have:
Fx ( x = 0 " 7) , corresponding to a 1 x 8 row from the (−2) (−4) (−5) (−7) (−9)
F0 = F0 = F0 = F0 = F0 =
input block, eight matrices ( A0 " A7 ) are required.
x0 + x1 + x2 + x3 + x \4 + x5 + x6 + x7
The matrices can be obtained by
In this case all the bits of the input vector are
ªx º summed. For the rest of the cases we have only 4 bits
ªa0(N) a1(N) ! a7(N) º « 0» to add and that can be achieved by a 2-stage adder tree.
« (N+1) (N+1) » x1
M «a0 a1 ! a7(N+1) » « »
[ ]
For example for [(x1 + x3 + x5 + x7 )], stage 1 adders =
N N+1
Fx = 2 2 ! 2 « «x2» (5) (x3 + x5) and stage 2 adders are (x1 + x7). For each of
# # ! # »« » the eight coefficients, DCT butterflies can be formed
« » «# » based on these 4-bit, 2-stage adder computations. Table
«¬− a0(M) − a1(M) ! − a7(M) »¼ « »
¬x7¼ 1. shows the additions required in the 2-level butterfly
structure. The illustrated 8-input addition for the first
ªFxN º term is computed separately outside the adder tree.
« N+1 »
«Fx » Table 1. Addition combinations for the DCT
[N N +1 M « N +2
Fx = 2 2 ! 2 «Fx »
»] (6)
Addition Type
butterfly structures.
Index of inputs
«# » 2-input additions 01,23,45,,67,06,35,24,
« M
» 17,25,16,34
«¬inv (Fx )»¼ 4-input additions 0123,4567,0124,0145,
evaluating the constant term in (2) and then 0356,0135,0246,1247,
representing them in the binary format as in equation 2435,1357,0257,0167,
(3) . If the precision of the coefficients is 12 bits, then 1346,1237,3567,1457,
the matrix A0 which corresponds to the first constant 0236,1256,0347,2467,
2367,0456
term c0 = 0.3535534 = 01011010100 can be
represented as: Figure 1. shows the eight butterfly structures for the
DCT.

Proceedings of the IEEE Computer Society Annual Symposium on VLSI


New Frontiers in VLSI Design
0-7695-2365-X/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 24,2010 at 16:00:15 UTC from IEEE Xplore. Restrictions apply.
x(0) Fig 3. shows the architecture of the first stage 1D
x(1)
x(2) (-9)
DCT module. The input stage consists of 9-bit adders
x(3) (-7) with the first stage latches being 10-bit latches. The
x(4) (-5)
(-4) next stage adders and latches are 10 and 11-bits
x(5) (-2)
x(6) respectively.
F0 x(7) In stage 1 of the 9-bit adders, 12 adders are required
F1
to accommodate all the 2-bit combinations for the
different butterflies. In stage 2, which consists of the
10-bit adders, 22 adders are required to complete the 4-
bit additions. The next level of 4:2 compressor trees
(i )
consists of 8 such trees to compute the various Fx . In
F3 the final stage we use eight 13-bit adders to compute
F2 the final Fx .

F5
F4

F7
F6
Figure 1. Eight Butterflies for the DCT
Figure 3. Implemented 8 x 1 DCT architecture. In
4. Hardware implementation the second stage the datapath remains the same
with changed bit sizes for the adders and latches
This section discusses the hardware implementation
of the CoDA architecture. A block diagram of the The output is moved to the transpose buffer, where
architecture is shown in figure 2. we have to wait for eight clock cycles for one row to
fill after which the transposition begins. In the second
part of the DCT ( the second 8 x 1 DCT to complete
the column transform ), the datapath structure is the
same as in figure 3. However the adders and latches are
different. In stage one there are 12 14-bit adders
followed by 15-bit latches. Stage two has 22 15-bit
adders followed by 16-bit latches. The 8 4:2
compressor trees produce 12-bit outputs which are
finally added together to produce the final 8 x 8
Figure 2. Core processor block diagram transform.

The inputs for the first 1D DCT module are eight 9- 5. Simulation and Results
bit vectors for one row of the macroblock. After the 1D
DCT, the output word is 14bits which is passed on to The CoDA 8x8 2D DCT architecture was modeled
the 8 x 8 word transposition RAM where the output using verilog and simulated in ModelSim. Code for the
vector from the 1D DCT is transposed. After the architecture was also written in MATLAB. Binary
second stage of applying the DCT (which completes images were used in MATLAB as test input vectors.
the 2D DCT) we get the 12-bit outputs for each of the The same test vectors were used to test the synthesized
eight input vectors. design and the fabricated chip. The first

Proceedings of the IEEE Computer Society Annual Symposium on VLSI


New Frontiers in VLSI Design
0-7695-2365-X/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 24,2010 at 16:00:15 UTC from IEEE Xplore. Restrictions apply.
implementation was done and tested on a Xilinx Total power (with 39.8 mW
Spartan 3 XC3s200 FPGA on a Digilent board with memory)
2MB of image memory. The results are summarized in
Table 2. A full-custom Chip was fabricated on a 0.5µm AMI
process. This DCT core is to be integrated in a MEMS
Table 2. Performance of the FPGA prototype of the System-in-Package solution at a 0.5 µm CMOS-
architecture. MEMS technology node. That was the rationale for
Operating voltage 2.5V choosing this technology to test the architectural
Clk Freq 50MHz performance and measure the performance parameters
Critical Path Delay 5.86ns at this node. The performance values for the fabricated
Num of CLB slices 1221 chip are in Table 4.
Num of Gates 18,590
Num of 4-input LUTs 2271 Table 4. Performance of the ASIC prototype of the
Num of Flip-Flops 616 architecture.
Power for 1st 1D DCT 126mW Operating voltage 3.3V
Power for 2nd 1D DCT 155mW Clk Frequency 50MHz
Power for 1st 1D DCT 42 mW
The small critical path delay implies that a HDTV Power for 2nd 1D DCT 68 mW
720 x 1280 pixels frame can be encoded in ~ 84 µs. memory Power 24mW
This can be computed as follows: Total measured Power 137mW
Time required / Frame = (Time required / block) ×
(Number of blocks / frame)
= 5.86 ns × ( [720 x 1280] / [8 x 8]) = 84.38 µs
This value is far smaller than the 33.3 ms frame coding
rate required for 30 Frame / sec video.
A standard-cell based synthesis and layout was
done from the verilog-HDL code using the ST-
Microelectronics, hcmos9, 0.12µm technology. Their
low power and low leakage library was used in the
synthesis. The synthesis was done in Synopsys’s
Design Vision and the P&R was done on Cadence’s
Encounter platform. As expected, the power
consumption at this node was much lower. The circuit
level simulations were done with nanosim and results
are reported in Table 3.

Table 3. Performance of the 0.12µm standard-cell


based implementation of the architecture.
Operating voltage 1.5V
Clk Freq. 50MHz
Figure 1. DCT chip die microphotograph.
Power for 1st 1D Internal cell 7mW
DCT power
The fabricated chip is packaged in a 108 PGA
Switching 5.45mW
ceramic package. Besides electrical verification, the
power
chip was tested by interfacing the PGA test-socket with
Total dynamic 12.45mW the FPGA board and feeding the Chip with binary
power image data from the on-board memory. The control
Leakage power 3.44 µW logic for test was implemented on the on-board FPGA.
Power for 2nd 1D Internal cell 10.1mW The layout work for the ASIC was done using
DCT power Magic tools and all simulations were done in Hspice.
Switching 8.2 mW Due to the large number of pins required, especially for
power the output, some of the outputs were serialized on the
Total Dynamic 18.3mW actual implementation. All circuits are static CMOS in
Power the implementation. Further improvements in
Leakage power 18.3 µW

Proceedings of the IEEE Computer Society Annual Symposium on VLSI


New Frontiers in VLSI Design
0-7695-2365-X/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 24,2010 at 16:00:15 UTC from IEEE Xplore. Restrictions apply.
performance can be obtained through enhanced circuit [4] August, N.J., Dong Sam Ha, “Low power design of DCT
techniques and different logic styles. and IDCT for low bit rate video codecs”, IEEE Trans.
Multimedia, Vol. 6, Issue 3, pp. 414-422, June 2004.
6. Conclusions [5] Shen-Fu Hsiao, Wei-Ren Shiue, “A New Hardware-
Efficient Algorithm and Architecture for Computation of 2-D
A coefficient distributed arithmetic scheme is DCTs on a Linear Anay” IEEE Trans. CAS for Video
presented to compute a 2D 8 x 8 DCT. A hardware Technology, Vol 11, Issue 11, pp. 1149-1159, November
architecture to realize the proposed computation 2001.
scheme is presented. The architecture requires no
ROM or multipliers and is based on mapping the DCT [6] Thucydides Xanthopoulos, Anantha Chandrakasan, “A
coefficients into adder trees. By mapping the Low-Power DCT Core Using Adaptive Bitwidth and
Arithmetic Activity Exploiting Signal Correlations and
coefficients into adder trees whose computations result Quantization”, IEEE J. of Solid State Circuits, Vol. 35, Issue
in non-zero values, we do away with operations whose 5, pp. 740-750, May 2000.
computing results are zero. Consequently we have
fewer computations for the same output and hence [7] Sungwook Yu, Earl E. Swartzlander Jr., “DCT
reduced power consumption. Implementation with Distributed Arithmetic”, IEEE Trans.
The architecture has been prototyped on a xilinx On Computers, Vol 50, Issue 9, pp. 985-991, September
Spartan 3 XC3s200 FPGA and an average power 2001.
consumption of 300mW / HDTV frame was found.
The system has been synthesized with a commercial [8] Y-H.Chan and W-C.Siu, “On the realization of discrete
cosine transform using the distributed arithmetic”, IEEE
0.12 µm standard-cell library. An average power trans. on Circuits and Systems--1:Fundamental Theory and
consumption of 40mW / HDTV frame was reported by Applications, Vol. 39, Issue 9, pp. 705-711, September 1992.
Nanosim.
A full-custom ASIC chip for the architecture was [9] ST- Microelectronics, HCMOS9 design documentation.
fabricated in 0.5 micron technology to verify the merits
of the architecture. An average power consumption of [10] A. Madisetti and A. N. Willson, “A 100 MHz 2-D 8 x 8
137 mW / HDTV frame was found for the fabricated DCT/IDCT processor for HDTV application” IEEE Trans.
chip. Circuits Syst. Video Technol., vol.5, pp.158-164, April 1995.
Various circuit techniques can be used to further
reduce power consumption in the chip.

7. Acknowledgements
The authors acknowledge the support of the United
States Department of Energy (DoE) EETAPP program
(DE-97ER12220) and the Governor’s Information
Technology Initiative. We would like to thank MOSIS
for fabricating our chip.

8. References
[1] Jiun-In Guo, Rei-Chin Ju, Jia-Wei Chen, “An efficient 2-
D DCT/IDCT core design using cyclic convolution and
adder-based realizationArticle Title” IEEE Trans. CAS for
Video Technology, Vol 14, Issue 4, pp. 416-428, April 2004.

[2] Jongsun Park, Roy, K., “A low power reconfigurable dct


architecture to trade off image quality for computational
complexity”, ICASSP ‘04, Vol. 5, pp. 17-20, May 2004.

[3] Jiun-In Guo, Rei-Chin Ju, Jia-Wei Chen, “An analysis of


the DCT coefficient distribution with the H.264 video coder”
ICASSP ‘04, Vol 3, pp. 177-180, May 2004.

Proceedings of the IEEE Computer Society Annual Symposium on VLSI


New Frontiers in VLSI Design
0-7695-2365-X/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: MOTILAL NEHRU NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on July 24,2010 at 16:00:15 UTC from IEEE Xplore. Restrictions apply.

Das könnte Ihnen auch gefallen