Beruflich Dokumente
Kultur Dokumente
3, APRIL 2010
157
AbstractThis paper presents a high-throughput, cost-effective implementation of six different integer transforms in the
H.264/AVC high-profile coders, i.e., 4 4 forward, 4 4 inverse,
forward Hadamard, inverse Hadamard, 8 8 forward, and 8 8
inverse transform, all integrated as a shared hardware. The 4 4
transform matrices are regularized by using permutation, partitioned into 2
2 blocks, and factored for maximal hardware
sharing. By using two types of 4 4 transform matrices included
in an 8 8 transform matrix, two different 8 8 transforms are
both described as three steps and unified with minor modification.
To improve throughput of the transform, two independent 4
4
transform blocks within the 8
8 transform block operate in
parallel in the 4 4 transform mode, while the two-stage pipelined
architecture is used in the 8 8 transform mode. Using 0.18CMOS technology, the maximum operating frequency of the
proposed multitransform architecture is 200 MHz, which achieves
4.1 Gpixels/sec throughput rate with the hardware cost of 63618
gates. Compared with existing designs, the proposed design delivers
at least 54% higher throughput at 38% higher throughput/area
ratio in Adaptive Block-size Transform (ABT) mode.
TABLE I
THROUGHPUT REQUIREMENT FOR VARIOUS VIDEO SIZES. TOTAL 50
AND 50 frames/sec
frames WITH 4:2:0 YUV FORMAT,
OF FRAME RATE IS USED
QP = 24
I. INTRODUCTION
H.264/AVC is the state-of-the-art video coding standard to
achieve significant improvement in the video compression performance [1]. To quickly compress video data in spatial domain, H.264/AVC employs 4 4 integer transforms which use
only integer arithmetic without any multiplications, with coefficients that allow 16-bit arithmetic computation [2]. Small
block-size transform tends to reduce the computational complexity and ringing artifacts. However, for high-quality video,
large block-size transform must be used not only to preserve
fine details of the image but also to obtain the better energy compaction [3]. High profile in H.264/AVC Fidelity Range Extension (FRExt) [4], which is a new amendment added in H.264
standard, includes 8 8 integer transform and allows the encoder to adaptively choose between 4
4 and 8
8 transform for luma samples on an MB level, which is called adaptive
block-size transform (ABT).
Manuscript received February 22, 2009; revised November 05, 2009. First
published January 26, 2010; current version published March 17, 2010. This
work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korea government (MEST) (No.2009-0080188). The associate
editor coordinating the review of this manuscript and approving it for publication was Dr. Ketan Mayer-Patel.
The authors are with the Department of Electrical Engineering, KAIST, Daejeon 305-701, Korea (e-mail: woonghb@vslab.kaist.ac.kr; kyung@ee.kaist.ac.
kr).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2010.2041099
2 4 transforms. (b) 8 2
158
flow in Fig. 1(b). For chroma residual input, the H.264/AVC encoder performs 4 4 transform flow only. There are four types
of 4 4 transform, i.e., forward, inverse, forward Hadamard,
and inverse Hadamard transform, two types of 8 8 transform,
i.e., forward and inverse transform. This paper describes how
the 4 4 and 8 8 transforms of the H.264/AVC encoder can
be modified such that they are implemented as one hardware
block by maximally sharing common operations while satisfying the throughput requirement of real-time processing and
reducing hardware cost.
For early-stage H.264/AVC such as the baseline or main profile, researchers mainly focused on developing the fast algorithm
of 4 4 transforms [5] and its implementation to improve performance with minimal area overhead [6][11]. With the advent
of H.264/AVC high profile, implementing 8 8 transforms and
unifying 8 8 and 4 4 transforms have been very important.
A fast 8 8 transform algorithm using Kronecker product and
direct sum is described in [12]. Hardware architectures sharing
between 8 8 and 4 4 transform are described in [13][15].
In [15], a transform architecture to support RDO mode decision
is also proposed. A unified architecture of the forward and inverse transforms are presented in [16]. Moreover, some architectures to support multistandard video applications with adaptive
block-size transform (8 8 and 4 4) are proposed in [17] and
[18]. However, the throughput values of these architectures are
not sufficient to satisfy the real-time requirement of the unified
transform in the HD 2160p system. Only the proposed architecture satisfies the requirement of HD 2160p system as will be
shown in Section VI.
The rest of this paper is organized as follows. In Section II, we
briefly review each of the four different 4 4 and 8 8 integer
transform equations. The proposed 4 4 transform algorithm
and implementation are described in Section III. In Section IV,
we present 8 8 transform algorithm including 4 4 transforms. Unified multitransform architecture (MTA) supporting
all six kinds of integer transforms is described in Section V.
Section VI discusses on the result of synthesis and evaluation
in comparison with previous works followed by conclusions in
Section VII.
II. INTEGER TRANSFORM ALGORITHMS
A. 4
(4)
The 8
(6)
The 8
4 Integer Transforms
The 4
8 Integer Transforms
4 Inverse Transform
(2)
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
(7)
159
by
and , respectively, and
Pre- and post-multiplying
into 2 2 blocks, it follows that
partitioning
(8)
(9)
Fig. 2. Block diagram of the proposed inverse transform consisting of six steps.
where
(10)
It is to be noted that
and
satisfies
(
is the 4 4 identity matrix). If we pre-multiply
by
and
post-multiply it by , the result becomes
intuitively, i.e.,
(11)
can then be factored as follows:
(12)
where
(19)
(13)
is the 2
Matrix
block multiplication
3) Step4:
Partitioning
into 2
2 blocks
, we
is obtained through block multiplication as
compute
follows:
(14)
Because
product of
, the matrix
and :
(20)
(15)
By using (12) and (15) into (11), we obtain
(16)
B. 4
In (2),
can be expressed by
as follows:
as are
(21)
(17)
Since
is the symmetric matrix satisfying
, and
, it follows that
4 Forward Transform
where
(22)
(18)
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
160
D. 4
(24)
Fig. 3 shows the sequence of the proposed forward transform.
Similar to the inverse transform, the forward transform is carried
out by six steps. As Step2, 3, 4, and 5 in sequence in Fig. 3
are the same as Step4, 3, 2, and 5 in sequence in Fig. 2, we
can reuse them as common blocks when integrating the 4 4
forward and inverse transform. Like other permutation, Step1
is implemented as mere hard-wired interconnection. In Step6,
the matrix
in (22) is the same as the 4 4 identity matrix
except for scaling factor 2, which is simply left-shift operation.
Thus, Step6 is also implemented as hard-wired interconnection,
which will be shown in the next subsection.
C. 4
4 Hadamard Transform
(26)
(27)
Since (26) and (27) have the same equation form as the inverse transform (18) except that
is used instead of , the
Hadamard transforms can be carried out by the same procedure
as the inverse transform with a minor modification.
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
4 multiTransform Architecture
Fig. 4(a) shows the sequences of four different 4 4 transforms based on the proposed algorithm. There is a common
sequence among four transforms, i.e., from Step2 to Step5 in
Fig. 4(a), which are merged into a 4 4 MTA core as shown in
Fig. 4(b). The 4 4 MTA core is designed to process a 4 4
block within two clock cycles. Execution of odd and even clock
cycle are named as Phase1 and Phase2, respectively. In Phase1,
Step2 and 3 are performed, followed by Step4 and 5 in Phase2.
A feedback path for the two-phase implementation is enclosed
within the 4 4 MTA core.
and
in
Two different block multiplications, i.e.,
Step2 and Step4 in Fig. 4(a), can be merged into one block
(Block multiplication block in the 4
4 MTA core) as
they do not occur simultaneously. Likewise,
permutation
processes [Step3 and Step5 in Fig. 4(a)] are merged into one
permutation block ( permutation block in the 4 4 MTA
core). Remaining blocks [Step1 and Step6 in Fig. 4(a)] are
merged into the input and output interconnection blocks as
shown in Fig. 4(b).
4 MTA core is shown in Fig. 5.
The proposed 4
This architecture is composed of four processing elements
(PE), 16 multiplexers,
permutation block, and four register
blocks.
1) Sixteen multiplexers between the input ports
and four PEs determine the input to PEs according to the
phase. In Phase1, the multiplexer controller (MC) selects
the input ports
as the input of PEs. In Phase2,
the MC selects the output ports
as the input
of PEs through the feedback path.
2) Four processing elements
are used
to calculate block multiplications such as
and
, which are Step2 and Step4 in Fig. 4(a). Each
PE is composed of two-stage butterfly adders with shift
operation as illustrated in Fig. 6. PEs operate differently
according to the phase and transform type. In Phase1, the
multiplexer controller (MC) in Fig. 6 selects the input 0 for
the forward transform and the input 1 for the inverse transform. In Phase2, MC selects the input 1 for the forward
transform and the input 0 for the inverse transform. On the
other hand, MC always selects the input 0 when the transform type is the forward Hadamard, or inverse Hadamard
transform regardless of the phase. Thus, PEs compute one
of four Step2 operations in Fig. 6(a) in Phase1, and one of
four Step4 operations in Phase2. It is to be noted that as
Fig. 6(a) is also 2 2 Hadamard transform for chroma dc
components, it can be implemented as a part of the 4 4
transform.
3) The
permutation block uses wiring network to implement Step3 and 5.
per4) Four register blocks temporarily store the result of
mutation. In Phase1, the stored data enters PEs again along
the feedback path, while the data enters the output interconnection block in Phase2.
To perform all six steps for a transform, appropriate input and
output (I/O) interconnection need to be done depending on the
161
Fig. 5. First-level details of the proposed MTA core for performing input multiplexing, block multiplication, and P permutation. Step2 and 4 are merged, as
are step3 and 5. MC denotes the multiplexer controller. Because 16 output coefficients are outputted every two cycles, the processing rate is eight pixels/cycle.
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
type of transforms. Fig. 7(a) shows the complete 4 4 multitransform architecture including the I/O interconnection blocks.
The input interconnection block is composed of four permutation blocks and one multiplexer to choose an appropriate input to
be processed. The output interconnection block is composed of
three
permutation blocks and a
multiplication
162
and
(30)
block. As the matrix
in (22) is a scaling matrix without permultiplication is like
mutation, the implementation of the
Fig. 7(b).
IV. 8
8 Inverse Transform
, we obtain
(31)
Then, we can rewrite the 8
transform is a
Extended transform [19] means that the
transform. Taking 4 4 and 8 8 integer
part of the
transform in H.264/AVC as an example, the relation between
them can be described as
(28)
where
(33)
Fig. 8 shows the sequence of the proposed 8 8 inverse transform. The 8 8 inverse transform is carried out by the following
three steps:
Permutation:
1)
As permutation means reordering elements in a 8
8
block, this step is implemented as hard-wired interconnection, i.e., without any arithmetic logic.
Transform
2)
into 4 4 blocks, we compute
as four
Partitioning
different kinds of 4 4 transforms:
(34)
(29)
is equal to
, the first component
As
is exactly the same as 4 4 inverse transform. Therefore,
.
we can reuse the 4 4 MTA to compute
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
Co
163
2 4 inverse transform.
butterfly unit.
(35)
(36)
Fig. 11. (a) Signal flow of B block multiplication. (b) Two-cycle implementation.
4 transforms.
to further improve the throughput of 4
Moreover, the 4 4 forward transform is also merged into
Fig. 10(a), which will be described in the next subsection.
Block Multiplication:
3)
Partitioning
into 4 4 blocks yields
(39)
(37)
Each of the three 4
4 transforms can be computed
by applying the one-dimension (1-D) transform twice.
, which is named as
transform, as an exTaking
, and
ample Fig. 9(a) shows direct implementation of
Fig. 9(b) shows the
butterfly unit. As a
butterfly
unit can process four pixels at a time, four
butterfly
. By sharing
units are needed to process a 4 4 block
the 1D transform unit and transpose register, we obtain
transform shown
two-cycle implementation of the
and
, named as
in Fig. 9(c). Likewise,
transform, are implemented as shown in Fig. 10(a). As
they use both
and
butterfly unit, a cross-feedback
path is enclosed in the
transform block.
In Fig. 10(a), the shaded box with dotted-line feedback
path indicates additional 4 4 inverse transform block. If
, it follows that
we apply 4 4 inverse transform to
is the 4
(40)
Partitioning
also into 4
4 blocks, we obtain
through block multiplication as follows:
(38)
(41)
Fig. 11(a) shows signal flow diagram and (b) shows its
two-cycle implementation. Input multiplexers, registers,
and feedback paths are used to share adders as shown in
Fig. 11(b).
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
164
C. 8
2 4 forward transform.
8 Forward Transform
The 8
8 forward transform can be expanded as follows
using the similar process as the 8 8 inverse transform:
(42)
Fig. 12 shows the sequence of the proposed 8 8 forward transform. As Step1 in Fig. 12 is the same as Step3 in Fig. 8 except that the position of transpose, Step1 can be implemented
multiplication block in Fig. 11(b). Step3 is
by reusing the
the permutation which can be implemented as hard-wired interconnection. In Step2, we obtain following four different kinds
of 4 4 transforms by applying 4 4 block partitioning:
(43)
As
is equal to
, the first component
can be
expanded by the same procedure as 4 4 transforms:
(44)
Equation (44) is the same as (24) in the 4 4 forward transform
except that
in (24) is removed in (44). Thus, we can reuse the
4 4 MTA to compute
by bypassing the
multiplication
block in Fig. 7.
transform block for computing
Fig. 13(a) shows the
and
. The
butterfly unit in Fig. 10(b) is replaced
butterfly unit as shown in Fig. 13(b). As
by
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
, the 4
4 forward transform matrix
is
and
, which is implemented by selecting the multiequal to
plexer terminal 0 in Fig. 13(b). Thus, the 4 4 forward transform can also be implemented along the dotted-line feedback
path in Fig. 13(a).
To compute the 8 8 forward and inverse transform in one
and
butterfly units are unified as
transform architecture,
shown in Fig. 14. As it also includes the 4 4 forward transform
transform block in Fig. 13(a) can process 8
matrix, the
8 forward, 8 8 inverse, 4 4 forward, and 4 4 inverse
transform by using the unified butterfly unit in Fig. 14.
V. MULTITRANSFORM ARCHITECTURE UNIFYING
8 8 AND 4 4 INTEGER TRANSFORMS
Fig. 15 shows the proposed MTA supporting six different
kinds of transforms for H.264/AVC high profile encoder. The
MTA is composed of a
block multiplication block, four 4 4
permutation blocks, and multiplexers.
transform blocks, two
block multiplication,
transform, and permutation
The
blocks are used only for the 8 8 transforms. The 4 4 MTA
and
transform blocks are used for both 4 4 and 8 8
transforms.
For performing four 4 4 transforms (4 4 forward, 4
4 inverse, forward Hadamard, and inverse Hadamard), two 4
165
TABLE II
SYNTHESIS RESULTS AND HARDWARE RESOURCE COMPARISON BETWEEN THE
SINGLE TRANSFORM AND MULTITRANSFORM DESIGN. EACH TRANSFORM HAS
THE SAME OPERATING FREQUENCY OF 200 MHz
FT, IT, FHT, and IHT denote the forward, inverse, forward Hadamard, and
inverse Hadamard transform, respectively. ABT denotes adaptive block-size
transform with 4
Fig. 17. Block diagram for functional verification of the proposed multitransform hardware using testbench from the JM reference software.
part of
trans4 transform blocks, 4 4 MTA and
form block, are used to double the throughput compared to using
only 4 4 MTA. Such throughput allows the proposed MTA to
process the transforms of HD 2160p video (3840 2160 at 50
frames/sec) in real time whose throughput requirement is described in Table I, which is further discussed in Section VI.
Unifying the 8 8 forward and inverse transform is simple
because three functional blocks in each transform are almost the
same while only their sequences are reversed as shown in Fig. 8
and Fig. 12. Multiplexers and feedback paths are used to unify
the 8 8 forward and inverse transform as shown Fig. 15 in
which dotted-line paths are used for the case of performing the
8 8 inverse transform.
8 block using the MTA takes four clock
To process a 8
block multiplication takes two clock cycles
cycles because
and 4 4 transform block takes two clock cycles. However, by
applying two-stage pipelining to 8 8 transforms as shown in
Fig. 16, the throughput can be doubled, i.e., one 8 8 block
every two clock cycles.
VI. IMPLEMENTATION AND RESULTS
A. Implementation and Verification
We have implemented the proposed multitransform design and verified its behavior using Verilog RTL simulation,
logic synthesis, and gate-level simulation. Fig. 17 shows the
simulation environment to verify the functional behavior of
the proposed architecture. Test vectors are obtained by using
H.264/AVC reference software in JM14.0 version. After extracting input and output data from the reference software, we
applied input data to the proposed design and compared its
result with output data from the reference software.
We synthesized the proposed multitransform design by using
Faraday stanSynopsys Design Compiler and UMC 0.18
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
166
TABLE III
SYNTHESIS RESULTS AND COMPARISON OF THE PROPOSED MTA WITH OTHER REPORTED DESIGNS. ALL ARCHITECTURES ARE DESIGNED AS 2-D TRANSFORM.
DPR DENOTES DATA PROCESSING RATE AND MEANS THE NUMBER OF PIXELS TO BE PROCESSED EVERY CLOCK CYCLE. FT, IT, FHT, AND IHT DENOTE THE
FORWARD, INVERSE, FORWARD HADAMARD, AND INVERSE HADAMARD TRANSFORM, RESPECTIVELY
VII. CONCLUSION
We proposed a fast and cost-effective algorithm and implementation of the multitransform architecture in H.264/AVC encoders. Four different 4 4 transforms and two 8 8 transforms are integrated on a shared hardware by using extended
transform and block multiplication. Comparing the proposed
multitransform design with the best previous work, we obtained
54% higher throughput and 38% higher throughput/area.
REFERENCES
[1] N. Kamaci and Y. Altunbasak, Performance comparison of the
emerging H.264 video coding standard with the existing standards, in
Proc. IEEE Int. Conf. Multimedia and Expo, Jul. 2003, pp. 345348.
[2] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, Low-complexity transform and quantization in H.264/AVC, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 598603, 2003.
[3] M. Wien, Variable block-size transform for H.264/AVC, IEEE Trans.
Circuits Syst. Video Technol., vol. 13, no. 7, pp. 604613, Jul. 2003.
[4] D. Marpe, T. Wiegand, and S. Gordon, H.264/MPEG4-AVC fidelity range extensions: Tools, profiles, performance, and application
areas, in Proc. IEEE Int. Conf. Image Processing, Sep. 2005, pp.
I-593I-596.
[5] C. P. Fan, Fast 2-dimensional 4 4 forward integer transform implementation for H.264/AVC, IEEE Trans. Circuits Syst. II, vol. 53, no.
3, pp. 174177, Mar. 2006.
Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
167
[20] Advanced Video Coding for Generic Audiovisual Services, ITU-T Recommendation H.264, Std., 2007.
[21] Faraday UMC Standard Library. [Online]. Available: http://www.
faraday-tech.com.
Woong Hwangbo received the B.S. degree in electrical engineering from Pusan National University,
Busan, Korea, and the M.S. degrees in electrical
engineering from Korea Advanced Institute of
Science and Technology (KAIST), Daejeon, Korea.
He is currently pursuing the Ph.D. degree in the
Department of Electrical Engineering and Computer
Science at KAIST.
His research interests include VLSI design and
multimedia application with high performance and
low power consumption.