A Multitransform Architecture For H.264AVC High-Profile Coders-hWe

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO.
3, APRIL 2010
157
A Multitransform Architecture for H.264/AVC

High-Profile Coders
Woong Hwangbo and Chong-Min Kyung, Fellow, IEEE
AbstractThis paper presents a high-throughput, cost-effective implementation of six different integer transforms in the
H.264/AVC high-profile coders, i.e., 4 4 forward, 4 4 inverse,
forward Hadamard, inverse Hadamard, 8 8 forward, and 8 8
inverse transform, all integrated as a shared hardware. The 4 4
transform matrices are regularized by using permutation, partitioned into 2
2 blocks, and factored for maximal hardware
sharing. By using two types of 4 4 transform matrices included
in an 8 8 transform matrix, two different 8 8 transforms are
both described as three steps and unified with minor modification.
To improve throughput of the transform, two independent 4
4
transform blocks within the 8
8 transform block operate in
parallel in the 4 4 transform mode, while the two-stage pipelined
architecture is used in the 8 8 transform mode. Using 0.18CMOS technology, the maximum operating frequency of the
proposed multitransform architecture is 200 MHz, which achieves
4.1 Gpixels/sec throughput rate with the hardware cost of 63618
gates. Compared with existing designs, the proposed design delivers
at least 54% higher throughput at 38% higher throughput/area
ratio in Adaptive Block-size Transform (ABT) mode.
TABLE I
THROUGHPUT REQUIREMENT FOR VARIOUS VIDEO SIZES. TOTAL 50
AND 50 frames/sec
frames WITH 4:2:0 YUV FORMAT,
OF FRAME RATE IS USED
QP = 24
Index TermsDCT, H.264/AVC, Hadamard transform, IDCT,

integer transform, VLSI design.
I. INTRODUCTION
H.264/AVC is the state-of-the-art video coding standard to
achieve significant improvement in the video compression performance [1]. To quickly compress video data in spatial domain, H.264/AVC employs 4 4 integer transforms which use
only integer arithmetic without any multiplications, with coefficients that allow 16-bit arithmetic computation [2]. Small
block-size transform tends to reduce the computational complexity and ringing artifacts. However, for high-quality video,
large block-size transform must be used not only to preserve
fine details of the image but also to obtain the better energy compaction [3]. High profile in H.264/AVC Fidelity Range Extension (FRExt) [4], which is a new amendment added in H.264
standard, includes 8 8 integer transform and allows the encoder to adaptively choose between 4
4 and 8
8 transform for luma samples on an MB level, which is called adaptive
block-size transform (ABT).
Manuscript received February 22, 2009; revised November 05, 2009. First
published January 26, 2010; current version published March 17, 2010. This
work was supported by the National Research Foundation of Korea (NRF) grant
funded by the Korea government (MEST) (No.2009-0080188). The associate
editor coordinating the review of this manuscript and approving it for publication was Dr. Ketan Mayer-Patel.
The authors are with the Department of Electrical Engineering, KAIST, Daejeon 305-701, Korea (e-mail: woonghb@vslab.kaist.ac.kr; kyung@ee.kaist.ac.
kr).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2010.2041099
Fig. 1. (a) 4 4 transform flow with four different 4

8 transform flow only for luma samples.
The transforms in H.264/AVC require high data throughput

rate for real-time processing in the high-resolution video formats like HD 1080p (1920 1080). Moreover, the mode decision block in H.264 encoder uses ABT iteratively, which results in further increase of data throughput. Table I shows the
throughput requirements for some example frame sizes obtained
from H.264/AVC reference software in JM14.0. The test video
is Crowd Run. In JM14.0 reference software, we set
, high profile, level 5.1, IPPP.. of GOP, fast full search motion estimation, single reference frame, and SAD as mode decision metric without rate-distortion optimization (RDO). The
number of tested frames is 50 and frame rate is 50 frames/sec.
Fig. 1 shows various transforms in the H.264/AVC encoding
system. For luma residual input, the H.264/AVC encoder selects
the transform flow between the 4 4 flow in Fig. 1(a) and 8 8
1520-9210/$26.00 2010 IEEE

Authorized licensd use limted to: IE Xplore. Downlade on May 13,20 at 1:509 UTC from IE Xplore. Restricon aply.
2 4 transforms. (b) 8 2
158
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010
flow in Fig. 1(b). For chroma residual input, the H.264/AVC encoder performs 4 4 transform flow only. There are four types
of 4 4 transform, i.e., forward, inverse, forward Hadamard,
and inverse Hadamard transform, two types of 8 8 transform,
i.e., forward and inverse transform. This paper describes how
the 4 4 and 8 8 transforms of the H.264/AVC encoder can
be modified such that they are implemented as one hardware
block by maximally sharing common operations while satisfying the throughput requirement of real-time processing and
reducing hardware cost.
For early-stage H.264/AVC such as the baseline or main profile, researchers mainly focused on developing the fast algorithm
of 4 4 transforms [5] and its implementation to improve performance with minimal area overhead [6][11]. With the advent
of H.264/AVC high profile, implementing 8 8 transforms and
unifying 8 8 and 4 4 transforms have been very important.
A fast 8 8 transform algorithm using Kronecker product and
direct sum is described in [12]. Hardware architectures sharing
between 8 8 and 4 4 transform are described in [13][15].
In [15], a transform architecture to support RDO mode decision
is also proposed. A unified architecture of the forward and inverse transforms are presented in [16]. Moreover, some architectures to support multistandard video applications with adaptive
block-size transform (8 8 and 4 4) are proposed in [17] and
[18]. However, the throughput values of these architectures are
not sufficient to satisfy the real-time requirement of the unified
transform in the HD 2160p system. Only the proposed architecture satisfies the requirement of HD 2160p system as will be
shown in Section VI.
The rest of this paper is organized as follows. In Section II, we
briefly review each of the four different 4 4 and 8 8 integer
transform equations. The proposed 4 4 transform algorithm
and implementation are described in Section III. In Section IV,
we present 8 8 transform algorithm including 4 4 transforms. Unified multitransform architecture (MTA) supporting
all six kinds of integer transforms is described in Section V.
Section VI discusses on the result of synthesis and evaluation
in comparison with previous works followed by conclusions in
Section VII.
II. INTEGER TRANSFORM ALGORITHMS
A. 4
4 forward and inverse transforms are applied to all

The 4
4 4 input blocks regardless of the type of blocks, i.e., luma
or chroma (Cb or Cr), and prediction modes, i.e., intra or inter
mode.
The forward and inverse Hadamard transforms are defined as
(3)
is the 4 4 block comprised of dc components from
where
is a quantized 4
each of the 16 4 4 submacroblocks and
4 DC block. The transform matrix
is given as
(4)
The Hadamard transforms are applied only when a macroblock

is encoded in 16 16 intra prediction mode.
B. 8
The 8
8 forward and inverse transforms are defined as

(5)
where is a 8 8 residual block input to the forward transform

and
is a inversely quantized 8 8 block input to the inverse
is given as
transform, respectively. The transform matrix
(6)
The 8
4 Integer Transforms
The 4
8 Integer Transforms
8 transforms are applied to only luma blocks.

III. 4
4 forward and inverse transforms are defined as

(1)
where is a 4 4 residual block input to the forward transform

is a inversely quantized 4 4 block input to the inverse
and
and
are
transform, respectively. The transform matrices
given as
4 INTEGER TRANSFORM CODING
In this section, we describe the 4 4 inverse transform coding

based on permutation and matrix factorization so that the 4 4
forward and (forward and inverse) Hadamard transform are derived from the 4 4 inverse transform with a minor modification. The integration of four 4 4 transforms is also addressed
in this section.
A. 4
4 Inverse Transform
The 4 4 inverse transform matrix can be regularized by two

permutation matrices [5]:
(2)
(7)
HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS
159
by
and , respectively, and
Pre- and post-multiplying
into 2 2 blocks, it follows that
partitioning
(8)
(9)
Fig. 2. Block diagram of the proposed inverse transform consisting of six steps.
where
(10)
It is to be noted that
and
satisfies
(
is the 4 4 identity matrix). If we pre-multiply
by
and
post-multiply it by , the result becomes
intuitively, i.e.,
(11)
can then be factored as follows:
Fig. 2 shows the sequence of the proposed inverse transform.

The inverse transform can now be carried out by the following
six steps among which four steps (Step1, 3, 5, and 6) are simple
permutations:
1) Step1, 3, 5, and 6: Permutation
Four steps are all implemented as pure hard-wired interconnection, i.e., without any arithmetic logic.
block multiplication
2) Step2:
Partitioning
into 2 2 blocks
, we
through block multiplication as follows:
compute
(12)
where
(19)
(13)
is the 2
Matrix
2 identity matrix and is the 2 2 null matrix.

is defined by pre- and post-multiplying
by :
block multiplication
3) Step4:
Partitioning
into 2
2 blocks
, we
is obtained through block multiplication as
compute
follows:
(14)
Because
product of
, the matrix
and :
(20)
can be expressed as the
Equation (20) has the same form as (19) in Step2 except

is used instead of
in (19). Thus, we can
that in (20),
reuse Step2 (
block multiplication) to calculate
by
substituting
in (19) by
.
(15)
By using (12) and (15) into (11), we obtain
(16)
B. 4
In (2),
can be expressed by
as follows:
Then, we can rewrite the inverse transform (1) using (16)
as are
and an additional matrix
(21)
(17)
Since
is the symmetric matrix satisfying
, and
, it follows that
4 Forward Transform
where
(22)
(18)
160
D. 4
Fig. 3. Block diagram of the proposed forward transform consisting of six

steps.
By using (16) into (21), we obtain

(23)
Then, the forward transform can be rewritten as
(24)
Fig. 3 shows the sequence of the proposed forward transform.
Similar to the inverse transform, the forward transform is carried
out by six steps. As Step2, 3, 4, and 5 in sequence in Fig. 3
are the same as Step4, 3, 2, and 5 in sequence in Fig. 2, we
can reuse them as common blocks when integrating the 4 4
forward and inverse transform. Like other permutation, Step1
is implemented as mere hard-wired interconnection. In Step6,
the matrix
in (22) is the same as the 4 4 identity matrix
except for scaling factor 2, which is simply left-shift operation.
Thus, Step6 is also implemented as hard-wired interconnection,
which will be shown in the next subsection.
C. 4
4 Hadamard Transform
Applying the same process as the 4 4 inverse transform,

the Hadamard transform matrix can be expanded as follows:
(25)
Then, the forward and inverse Hadamard transform can be
rewritten as
(26)
(27)
Since (26) and (27) have the same equation form as the inverse transform (18) except that
is used instead of , the
Hadamard transforms can be carried out by the same procedure
as the inverse transform with a minor modification.
4 multiTransform Architecture
Fig. 4(a) shows the sequences of four different 4 4 transforms based on the proposed algorithm. There is a common
sequence among four transforms, i.e., from Step2 to Step5 in
Fig. 4(a), which are merged into a 4 4 MTA core as shown in
Fig. 4(b). The 4 4 MTA core is designed to process a 4 4
block within two clock cycles. Execution of odd and even clock
cycle are named as Phase1 and Phase2, respectively. In Phase1,
Step2 and 3 are performed, followed by Step4 and 5 in Phase2.
A feedback path for the two-phase implementation is enclosed
within the 4 4 MTA core.
and
in
Two different block multiplications, i.e.,
Step2 and Step4 in Fig. 4(a), can be merged into one block
(Block multiplication block in the 4
4 MTA core) as
they do not occur simultaneously. Likewise,
permutation
processes [Step3 and Step5 in Fig. 4(a)] are merged into one
permutation block ( permutation block in the 4 4 MTA
core). Remaining blocks [Step1 and Step6 in Fig. 4(a)] are
merged into the input and output interconnection blocks as
shown in Fig. 4(b).
4 MTA core is shown in Fig. 5.
The proposed 4
This architecture is composed of four processing elements
(PE), 16 multiplexers,
permutation block, and four register
blocks.
1) Sixteen multiplexers between the input ports
and four PEs determine the input to PEs according to the
phase. In Phase1, the multiplexer controller (MC) selects
the input ports
as the input of PEs. In Phase2,
the MC selects the output ports
as the input
of PEs through the feedback path.
2) Four processing elements
are used
to calculate block multiplications such as
and
, which are Step2 and Step4 in Fig. 4(a). Each
PE is composed of two-stage butterfly adders with shift
operation as illustrated in Fig. 6. PEs operate differently
according to the phase and transform type. In Phase1, the
multiplexer controller (MC) in Fig. 6 selects the input 0 for
the forward transform and the input 1 for the inverse transform. In Phase2, MC selects the input 1 for the forward
transform and the input 0 for the inverse transform. On the
other hand, MC always selects the input 0 when the transform type is the forward Hadamard, or inverse Hadamard
transform regardless of the phase. Thus, PEs compute one
of four Step2 operations in Fig. 6(a) in Phase1, and one of
four Step4 operations in Phase2. It is to be noted that as
Fig. 6(a) is also 2 2 Hadamard transform for chroma dc
components, it can be implemented as a part of the 4 4
transform.
3) The
permutation block uses wiring network to implement Step3 and 5.
per4) Four register blocks temporarily store the result of
mutation. In Phase1, the stored data enters PEs again along
the feedback path, while the data enters the output interconnection block in Phase2.
To perform all six steps for a transform, appropriate input and
output (I/O) interconnection need to be done depending on the
Fig. 4. (a) Block diagram of the sequence of operations for four 4

hardware platform.
161
2 4 transforms. (b) Proposed 4 2 4 MTA to implement the four transforms on a common
Fig. 6. Second-level details of 2 2 components in the MTA core. (a) P E .

2
(b) P E . (c) P E . (d) P E . Each PE corresponds to each of the 2
elements of block multiplication in (24), (26), (37), and (43). MC denotes the
multiplexer controller. (a) is also 2
2 Hadamard transform for chroma dc
components.
Fig. 5. First-level details of the proposed MTA core for performing input multiplexing, block multiplication, and P permutation. Step2 and 4 are merged, as
are step3 and 5. MC denotes the multiplexer controller. Because 16 output coefficients are outputted every two cycles, the processing rate is eight pixels/cycle.
type of transforms. Fig. 7(a) shows the complete 4 4 multitransform architecture including the I/O interconnection blocks.
The input interconnection block is composed of four permutation blocks and one multiplexer to choose an appropriate input to
be processed. The output interconnection block is composed of
three
permutation blocks and a
multiplication
162
Fig. 8. Block diagram of the proposed 8 8 inverse transform consisting of

three steps. IQ denotes inverse quantization.
and
Fig. 7. (a) Complete 4

4 multitransform architecture including I/O interconnection blocks. (b) S multiplication block in output interconnection of the
4
4 forward transform. The processing rate of 4
4 MTA core is eight
pixels/cycle. All 16 coefficients of the selected input among four inputs must
be prepared simultaneously.
(30)
block. As the matrix
in (22) is a scaling matrix without permultiplication is like
mutation, the implementation of the
Fig. 7(b).
IV. 8
8 INTEGER TRANSFORM CODING
In this section, we describe the 8 8 inverse transform coding

based on the extended transform and block multiplication so that
the 8 8 forward transform is derived from the 8 8 inverse
transform with a minor modification and 4 4 transforms are
included in the 8 8 transform.
A. Extended Transform
is a 8 8 permutation matrix and

is a butterfly matrix.
and
are the integer form
Two 4 4 transform matrices
of II-type and IV-type DCT (discrete cosine transform) [19],
corresponds to the 4 4
respectively. It is to be noted that
inverse transform matrix
in (2).
B. 8
8 Inverse Transform
, we obtain
Defining a new matrix
(31)
Then, we can rewrite the 8
transform is a
Extended transform [19] means that the
transform. Taking 4 4 and 8 8 integer
part of the
transform in H.264/AVC as an example, the relation between
them can be described as
(28)
where
8 transform matrix using (28)

(32)
Applying (32) to the 8
8 inverse transform (5), we obtain
(33)
Fig. 8 shows the sequence of the proposed 8 8 inverse transform. The 8 8 inverse transform is carried out by the following
three steps:
Permutation:
1)
As permutation means reordering elements in a 8
8
block, this step is implemented as hard-wired interconnection, i.e., without any arithmetic logic.
Transform
2)
into 4 4 blocks, we compute
as four
Partitioning
different kinds of 4 4 transforms:
(34)
(29)
is equal to
, the first component
As
is exactly the same as 4 4 inverse transform. Therefore,
.
we can reuse the 4 4 MTA to compute
Fig. 10. (a) CeCo transform block including the 4

(b) Ce butterfly unit.
Fig. 9. (a) Direct implementation of Co transform. (b)

(c) Two-cycle implementation of Co transform.
Co
163
2 4 inverse transform.
butterfly unit.
The other three components,

,
, and
, can be
processed by conventional row-column approach with 1D
transform and transposition presented in H.264/AVC standard [20]. By using algebraic rules for transpose and the
, they can be rewritten as follows:
fact that
(35)
(36)
Fig. 11. (a) Signal flow of B block multiplication. (b) Two-cycle implementation.
4 transforms.
to further improve the throughput of 4
Moreover, the 4 4 forward transform is also merged into
Fig. 10(a), which will be described in the next subsection.
Block Multiplication:
3)
Partitioning
into 4 4 blocks yields
(39)
(37)
Each of the three 4
4 transforms can be computed
by applying the one-dimension (1-D) transform twice.
, which is named as
transform, as an exTaking
, and
ample Fig. 9(a) shows direct implementation of
Fig. 9(b) shows the
butterfly unit. As a
butterfly
unit can process four pixels at a time, four
butterfly
. By sharing
units are needed to process a 4 4 block
the 1D transform unit and transpose register, we obtain
transform shown
two-cycle implementation of the
and
, named as
in Fig. 9(c). Likewise,
transform, are implemented as shown in Fig. 10(a). As
they use both
and
butterfly unit, a cross-feedback
path is enclosed in the
transform block.
In Fig. 10(a), the shaded box with dotted-line feedback
path indicates additional 4 4 inverse transform block. If
, it follows that
we apply 4 4 inverse transform to
where is the 4 4 identity matrix and

permutation matrix:
is the 4
(40)
Partitioning
also into 4
4 blocks, we obtain
through block multiplication as follows:
(38)
(41)
Equation (38) means that the 4 4 inverse transform can

be implemented by applying the 1D
transform twice
with transposition, which corresponds to the shaded box in
Fig. 10(a). It can be used with the 4 4 MTA in parallel
Fig. 11(a) shows signal flow diagram and (b) shows its
two-cycle implementation. Input multiplexers, registers,
and feedback paths are used to share adders as shown in
Fig. 11(b).
164
Fig. 12. Block diagram of the proposed 8

three steps. Q denotes quantization.
2 8 forward transform consisting of
Fig. 14. Butterfly unit unifying Ce and Ce.
Fig. 13. (a) CeCo transform block including the 4

(b) Ce butterfly unit including C (= S C ).
C. 8
2 4 forward transform.
8 Forward Transform
The 8
8 forward transform can be expanded as follows
using the similar process as the 8 8 inverse transform:
(42)
Fig. 12 shows the sequence of the proposed 8 8 forward transform. As Step1 in Fig. 12 is the same as Step3 in Fig. 8 except that the position of transpose, Step1 can be implemented
multiplication block in Fig. 11(b). Step3 is
by reusing the
the permutation which can be implemented as hard-wired interconnection. In Step2, we obtain following four different kinds
of 4 4 transforms by applying 4 4 block partitioning:
(43)
As
is equal to
, the first component
can be
expanded by the same procedure as 4 4 transforms:
(44)
Equation (44) is the same as (24) in the 4 4 forward transform
except that
in (24) is removed in (44). Thus, we can reuse the
4 4 MTA to compute
by bypassing the
multiplication
block in Fig. 7.
transform block for computing
Fig. 13(a) shows the
and
. The
butterfly unit in Fig. 10(b) is replaced
butterfly unit as shown in Fig. 13(b). As
by
Fig. 15. Proposed MTA supporting six different kinds of transforms.
, the 4
4 forward transform matrix
is
and
, which is implemented by selecting the multiequal to
plexer terminal 0 in Fig. 13(b). Thus, the 4 4 forward transform can also be implemented along the dotted-line feedback
path in Fig. 13(a).
To compute the 8 8 forward and inverse transform in one
and
butterfly units are unified as
transform architecture,
shown in Fig. 14. As it also includes the 4 4 forward transform
transform block in Fig. 13(a) can process 8
matrix, the
8 forward, 8 8 inverse, 4 4 forward, and 4 4 inverse
transform by using the unified butterfly unit in Fig. 14.
V. MULTITRANSFORM ARCHITECTURE UNIFYING
8 8 AND 4 4 INTEGER TRANSFORMS
Fig. 15 shows the proposed MTA supporting six different
kinds of transforms for H.264/AVC high profile encoder. The
MTA is composed of a
block multiplication block, four 4 4
permutation blocks, and multiplexers.
transform blocks, two
block multiplication,
transform, and permutation
The
blocks are used only for the 8 8 transforms. The 4 4 MTA
and
transform blocks are used for both 4 4 and 8 8
transforms.
For performing four 4 4 transforms (4 4 forward, 4
4 inverse, forward Hadamard, and inverse Hadamard), two 4
165
TABLE II
SYNTHESIS RESULTS AND HARDWARE RESOURCE COMPARISON BETWEEN THE
SINGLE TRANSFORM AND MULTITRANSFORM DESIGN. EACH TRANSFORM HAS
THE SAME OPERATING FREQUENCY OF 200 MHz
Fig. 16. Temporal diagram of two-stage pipelined transform. (a) 8 8 forward

transform. (b) 8 8 inverse transform. Each stage takes two clock cycles.
FT, IT, FHT, and IHT denote the forward, inverse, forward Hadamard, and
inverse Hadamard transform, respectively. ABT denotes adaptive block-size
transform with 4
2 4 and 8 2 8 block sizes.
DPR denotes data processing rate.
Fig. 17. Block diagram for functional verification of the proposed multitransform hardware using testbench from the JM reference software.
part of
trans4 transform blocks, 4 4 MTA and
form block, are used to double the throughput compared to using
only 4 4 MTA. Such throughput allows the proposed MTA to
process the transforms of HD 2160p video (3840 2160 at 50
frames/sec) in real time whose throughput requirement is described in Table I, which is further discussed in Section VI.
Unifying the 8 8 forward and inverse transform is simple
because three functional blocks in each transform are almost the
same while only their sequences are reversed as shown in Fig. 8
and Fig. 12. Multiplexers and feedback paths are used to unify
the 8 8 forward and inverse transform as shown Fig. 15 in
which dotted-line paths are used for the case of performing the
8 8 inverse transform.
8 block using the MTA takes four clock
To process a 8
block multiplication takes two clock cycles
cycles because
and 4 4 transform block takes two clock cycles. However, by
applying two-stage pipelining to 8 8 transforms as shown in
Fig. 16, the throughput can be doubled, i.e., one 8 8 block
every two clock cycles.
VI. IMPLEMENTATION AND RESULTS
A. Implementation and Verification
We have implemented the proposed multitransform design and verified its behavior using Verilog RTL simulation,
logic synthesis, and gate-level simulation. Fig. 17 shows the
simulation environment to verify the functional behavior of
the proposed architecture. Test vectors are obtained by using
H.264/AVC reference software in JM14.0 version. After extracting input and output data from the reference software, we
applied input data to the proposed design and compared its
result with output data from the reference software.
We synthesized the proposed multitransform design by using
Faraday stanSynopsys Design Compiler and UMC 0.18
dard cell library [21]. In the logic synthesis,

wireload
model was used and skew, jitter, transition time of clock, and
I/O external delay were separately taken into account. Table II
shows the performance and hardware cost of the proposed multitransform design compared with the separate implementation
of the six transforms. Timing constraints are identical so that
each transform has the same operating frequency of 200 MHz.
The single transform design, which is a separate implementation
of four 4 4 transform paths in Fig. 4(a) and two 8 8 transform paths in Figs. 8 and 12, performs the same behavior as the
multitransform design and is used as the target for comparison.
According to Table II, the proposed MTA has about 51%
less area than the single transform. Table II shows that the proposed MTA can process 3.2 Gpixels/sec when it processes only
4 4 transforms. Because the MTA includes two 4 4 transform blocks, i.e., 4 4 MTA and
transform block each
of which can process a 4 4 block within two clock cycles,
the MTA has the data processing rate of 16 pixels/cycle. If the
MTA processes only 8 8 transforms, the throughput becomes
6.4 Gpixels/sec.
B. Performance Comparison
When adaptive block-size transform (ABT) which uses 4
4 and 8 8 transform jointly is used, we obtain the throughput
of 4.1 Gpixels/sec. It is based on the observation that the ratio
of clock cycles spent for 4 4 mode to those spent for 8 8
block mode is 2.5. This was obtained from Table I considering
one cycle is required to process a 4 4 block and two cycles
are required to process a 8 8 block. Thus, the proposed design can allow real-time processing of HD 2160p video (3840
2160 at 50 frames/sec) whose throughput requirement is described in Table I. Table III shows the comparison among various methods in terms of operating frequency, data processing
rate, throughput, gate count, and throughput per area. There are
three different transform modes, i.e., 4 4, 8 8, and ABT.
The results on the 4 4 and 8 8 mode are based on an assumption that each transform hardware performs either 4 4
166
TABLE III
SYNTHESIS RESULTS AND COMPARISON OF THE PROPOSED MTA WITH OTHER REPORTED DESIGNS. ALL ARCHITECTURES ARE DESIGNED AS 2-D TRANSFORM.
DPR DENOTES DATA PROCESSING RATE AND MEANS THE NUMBER OF PIXELS TO BE PROCESSED EVERY CLOCK CYCLE. FT, IT, FHT, AND IHT DENOTE THE
FORWARD, INVERSE, FORWARD HADAMARD, AND INVERSE HADAMARD TRANSFORM, RESPECTIVELY
Assume 2-D transform design by the architecture in Wang [6].

Gate count of the transpose register estimated by Design Compiler is 8821.
Gate count of the transpose register estimated by Design Compiler is 11416.
Gate count of the on-chip memory estimated by UMC MEMMAKER is 5496.
Power consumption of the transpose register estimated by Prime Power is 4.102 mW.
Power consumption of the transpose register estimated by Prime Power is 5.374 mW.
or 8 8 mode, while the result on the ABT mode indicates that

the 4 4 and 8 8 transform mode are jointly used.
Table III shows that the proposed MTA in the 4 4 transform
mode is the most efficient in terms of throughput/area ratio
among designs supporting all six kinds of transforms, which
results from high operating frequency and two independent 4
4 transform blocks operating in parallel. In the 8 8 transform
mode, the proposed design has the highest throughput and
throughput/area ratio. It comes from high data processing rate,
two-stage pipelined architecture as well as efficient sharing
8 forward and inverse
of sub-blocks when unifying the 8
transform. When the designs are operated in ABT mode which
is practical operating condition of the transforms, the proposed
design has at least 54% higher throughput and 38% higher
throughput/area ratio than other designs.
After the logic synthesis, we used Synopsys PrimePower to
estimate power consumption. When supplied with 1.8 V and
operated at 200 MHz, the proposed design consumes about
83.8 mW. Compared to other designs [13], [18], the proposed
design has the largest throughput/power ratio. Moreover, as
power consumption increases in proportion to operating frequency, power consumption of the proposed design can be
lowered with lower frame rate or smaller frame size.
VII. CONCLUSION
We proposed a fast and cost-effective algorithm and implementation of the multitransform architecture in H.264/AVC encoders. Four different 4 4 transforms and two 8 8 transforms are integrated on a shared hardware by using extended
transform and block multiplication. Comparing the proposed
multitransform design with the best previous work, we obtained
54% higher throughput and 38% higher throughput/area.
REFERENCES
[1] N. Kamaci and Y. Altunbasak, Performance comparison of the
emerging H.264 video coding standard with the existing standards, in
Proc. IEEE Int. Conf. Multimedia and Expo, Jul. 2003, pp. 345348.
[2] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, Low-complexity transform and quantization in H.264/AVC, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 598603, 2003.
[3] M. Wien, Variable block-size transform for H.264/AVC, IEEE Trans.
Circuits Syst. Video Technol., vol. 13, no. 7, pp. 604613, Jul. 2003.
[4] D. Marpe, T. Wiegand, and S. Gordon, H.264/MPEG4-AVC fidelity range extensions: Tools, profiles, performance, and application
areas, in Proc. IEEE Int. Conf. Image Processing, Sep. 2005, pp.
I-593I-596.
[5] C. P. Fan, Fast 2-dimensional 4 4 forward integer transform implementation for H.264/AVC, IEEE Trans. Circuits Syst. II, vol. 53, no.
3, pp. 174177, Mar. 2006.
[6] T. C. Wang, Y. W. Huang, H. C. Fang, and L. G. Chen, Parallel 4 4

2D transform and inverse transform architecture for MPEG-4 AVC/H.
264, in Proc. IEEE Int. Symp. Circuits and Systems, May 2003, pp.
800803.
[7] Z. Y. Cheng, C. Chen, B. D. Liu, and J. F. Yang, High throughput
2-D transform architectures for H.264 advanced video coders, in
Proc. IEEE Asia-Pacific Conf. Circuits and Systems, Dec. 2004, pp.
11411144.
[8] K. H. Chen, J. I. Guo, and J. S. Wang, A high-performance direct
2-D transform coding IP design for MPEG-4 AVC/H.264, IEEE Trans.
Circuits Syst. Video Technol., vol. 16, no. 4, pp. 472483, Apr. 2006.
[9] W. Hwangbo, J. Kim, and C. M. Kyung, A high-performance 2-D
inverse transform architecture for the H.264/AVC decoder, in Proc.
IEEE Int. Symp. Circuits and Systems, May 2006, pp. 16131616.
[10] P. Chungan, Y. Dunshan, C. Xixin, and S. Shimin, A new high
throughput VLSI architecture for H.264 transform and quantization,
in Proc. Int. Conf. ASIC, Oct. 2007, pp. 950953.
[11] C. Wei, H. Hui, L. Jinmei, T. Jiarong, and M. Hao, A high-performance reconfigurable 2-D transform architecture for H.264, in Proc.
IEEE Int. Conf. Electronics, Circuits and Systems, Aug. 2008, pp.
606609.
[12] C. P. Fan, Fast 2-dimensional 8 8 integer transform algorithm design for H.264/AVC fidelity range extensions, IEICE Trans. Inf. Syst.,
vol. E89-D, pp. 30063011, Dec. 2006.
[13] C. P. Fan, Cost-effective hardware sharing architectures of fast 8
8 and 4 4 integer transforms for H.264/AVC, in Proc. IEEE Asia
Pacific Conf. Circuits and Systems, Dec. 2006, pp. 776779.
[14] Y. C. Chao, H. H. Tsai, Y. H. Lin, J. F. Yang, and B. D. Liu, A novel
design for computation of all transforms in H.264/AVC decoders, in
Proc. IEEE Int. Conf. Multimedia and Expo, Jul. 2007, pp. 19141917.
[15] G. Pastuszak, Transforms and quantization in the high-throughput
H.264/AVC encoder based on advanced mode selection, in Proc.
IEEE Comput. Soc. Annu. Symp. VLSI, Apr. 2008, pp. 203208.
[16] Y. Li, Y. He, and S. Mei, A highly parallel joint VLSI architecture for
transforms in H.264/AVC, J. Signal Process. Syst., vol. 50, pp. 1932,
Oct. 2007.
[17] B. Li, D. Zhang, J. Fang, L. Wang, and M. Zhang, A unified IDCT
architecture for multi-standard video codecs, in Proc. Int. Conf. ASIC,
Oct. 2007, pp. 962965.
[18] C. Y. Huang, L. F. Chen, and Y. K. Lai, A high-speed 2-D transform
architecture with unique kernel for multi-standard video applications,
in Proc. IEEE Int. Symp. Circuits and Systems, May 2008, pp. 2124.
[19] W. Chen, C. Smith, and S. Pralick, A fast computational algorithm for
the discrete cosine transform, IEEE Trans. Commun., vol. 25, no. 9,
pp. 10041009, Sep. 1977.
167
[20] Advanced Video Coding for Generic Audiovisual Services, ITU-T Recommendation H.264, Std., 2007.
[21] Faraday UMC Standard Library. [Online]. Available: http://www.
faraday-tech.com.
Woong Hwangbo received the B.S. degree in electrical engineering from Pusan National University,
Busan, Korea, and the M.S. degrees in electrical
engineering from Korea Advanced Institute of
Science and Technology (KAIST), Daejeon, Korea.
He is currently pursuing the Ph.D. degree in the
Department of Electrical Engineering and Computer
Science at KAIST.
His research interests include VLSI design and
multimedia application with high performance and
low power consumption.
Chong-Min Kyung (S76M81SM99F08)

received the B.S. degree in electronics engineering
from Seoul National University, Seoul, Korea, in
1975 and the M.S. and Ph.D. degrees in electrical
engineering from Korea Advanced Institute of
Science and Technology (KAIST), Daejeon, Korea,
in 1977 and 1981, respectively.
From April 1981 to January 1983, he worked
at Bell Telephone Laboratories, Murray Hill, NJ,
as a postdoc. Since he joined KAIST in 1983, he
has been working on System-on-a-Chip design and
verification methodology as well as processor and graphics architectures for
high-speed and/or low-power applications, including mobile video codec. He
is Hynix Chair Professor at KAIST
Dr. Kyung received the Most Excellent Design Award, and Special Feature
Award in the University Design Contest in the ASP-DAC 1997 and 1998, respectively. He received the Best Paper Awards in the 36th DAC held in New Orleans, LA; the 10th International Conference on Signal Processing Application
and Technology (ICSPAT), Orlando, FL, in September 1999; and the 1999 International Conference on Computer Design (ICCD), Austin, TX. He was General
Chair of Asian Solid-State Circuits Conference (A-SSCC) 2007, and ASP-DAC
2008. In 2000, he received a National Medal from the Korean government for
his contribution to research and education in IC design. He is a member of the
National Academy of Engineering Korea (NAEK) and the Korean Academy of
Science and Technology (KAST).

A Multitransform Architecture For H.264AVC High-Profile Coders-hWe

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Multitransform Architecture For H.264AVC High-Profile Coders-hWe

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO.

A Multitransform Architecture for H.264/AVC

Index TermsDCT, H.264/AVC, Hadamard transform, IDCT,

Fig. 1. (a) 4 4 transform flow with four different 4

The transforms in H.264/AVC require high data throughput

1520-9210/$26.00 2010 IEEE

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010

4 forward and inverse transforms are applied to all

The Hadamard transforms are applied only when a macroblock

8 forward and inverse transforms are defined as

where is a 8 8 residual block input to the forward transform

8 transforms are applied to only luma blocks.

4 forward and inverse transforms are defined as

where is a 4 4 residual block input to the forward transform

4 INTEGER TRANSFORM CODING

In this section, we describe the 4 4 inverse transform coding

The 4 4 inverse transform matrix can be regularized by two

HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS

Fig. 2 shows the sequence of the proposed inverse transform.

2 identity matrix and is the 2 2 null matrix.

can be expressed as the

Equation (20) has the same form as (19) in Step2 except

Then, we can rewrite the inverse transform (1) using (16)

and an additional matrix

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010

Fig. 3. Block diagram of the proposed forward transform consisting of six

By using (16) into (21), we obtain

Applying the same process as the 4 4 inverse transform,

HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS

Fig. 4. (a) Block diagram of the sequence of operations for four 4

2 4 transforms. (b) Proposed 4 2 4 MTA to implement the four transforms on a common

Fig. 6. Second-level details of 2 2 components in the MTA core. (a) P E .

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010

Fig. 8. Block diagram of the proposed 8 8 inverse transform consisting of

Fig. 7. (a) Complete 4

8 INTEGER TRANSFORM CODING

In this section, we describe the 8 8 inverse transform coding

is a 8 8 permutation matrix and

Defining a new matrix

8 transform matrix using (28)

Applying (32) to the 8

8 inverse transform (5), we obtain

HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS

Fig. 10. (a) CeCo transform block including the 4

Fig. 9. (a) Direct implementation of Co transform. (b)

The other three components,

where is the 4 4 identity matrix and

Equation (38) means that the 4 4 inverse transform can

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010

Fig. 12. Block diagram of the proposed 8

2 8 forward transform consisting of

Fig. 14. Butterfly unit unifying Ce and Ce.

Fig. 13. (a) CeCo transform block including the 4

Fig. 15. Proposed MTA supporting six different kinds of transforms.

HWANGBO AND KYUNG: MULTITRANSFORM ARCHITECTURE FOR H.264/AVC HIGH-PROFILE CODERS

Fig. 16. Temporal diagram of two-stage pipelined transform. (a) 8 8 forward

2 4 and 8 2 8 block sizes.

DPR denotes data processing rate.

dard cell library [21]. In the logic synthesis,

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 3, APRIL 2010

Assume 2-D transform design by the architecture in Wang [6].

or 8 8 mode, while the result on the ABT mode indicates that