Accelerate Video Decoding With Generic GPU

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO.
5, MAY 2005 685
Accelerate Video Decoding With Generic GPU

Guobin Shen, Member, IEEE, Guang-Ping Gao, Shipeng Li, Heung-Yeung Shum, and Ya-Qin Zhang
Abstract—Most modern computers or game consoles are with specialized processors for two-dimensional (2-D) and
equipped with powerful yet cost-effective graphics processing three-dimensional (3-D) graphics operations and they indeed
units (GPUs) to accelerate graphics operations. Though the do a good job in graphics oriented applications.1 These GPUs
graphics engines in these GPUs are specially designed for graphics
operations, can we harness their computing power for more gen- are very powerful. For example, the nVidia GeForce3 chip
eral nongraphics operations? The answer is positive. In this paper, contains more transistors than the Intel Pentium IV and its
we present our study on leveraging the GPUs graphics engine successor the GeForce4 is advertised as being able to perform
to accelerate the video decoding. Specifically, a video decoding more than 1.2 trillion internal operations per second. The
framework that involves both the central processing unit (CPU) internal pipelined processing fashion makes the GPU also
and the GPU is proposed. By moving the whole motion compensa-
tion feedback loop of the decoder to the GPU, the CPU and GPU suitable for stream processing. Furthermore, GPU speed grows
have been made to work in parallel in a pipelining fashion. Several much faster than the famous Moore’s law for CPU, that is,
techniques are also proposed to overcome the GPUs constraints 2.4 times/year versus 2 times per 18 months. Moreover, a GPU
or to optimize the GPU computation. Initial experimental results usually contains multiple (typically 4 or 8) parallel pipelines
show that significant speed-up can be achieved by utilizing the and is indeed a SIMD processor. Another important feature
GPU power. We have achieved real-time playback of high defini-
tion video on a PC with an Intel Pentium III 667-MHz CPU and of most contemporary GPUs is their programmability of the
an nVidia GeForce3 GPU. partial or full graphics pipeline, thanks to the introduction of
Index Terms—General-purpose computing, graphics processing vertex shaders and pixel shaders in DirextX-8. The power-
unit (GPU), video decoding acceleration. fulness, SIMD operation and programmability of GPUs have
motivated an active research area of using GPU for nongraphics
oriented operations such as numerical computations like basic
I. INTRODUCTION linear algebra subprograms (BLAS) [6] and image/volume
processing [7]–[12]. In [7], the authors presented a technique
M ULTIMEDIA content is the core for digital entertain-
ment. However, multimedia content processing usually
requires very high computational power due to its intrinsic huge
for multiplying large matrices quickly using the graphics
hardware of a PC. Strzodka and Rumpf have implemented
volume of data. As the key to most multimedia applications, complicated numerical schemes solving parabolic differential
video encoding and decoding technologies are evolving along equations fully in graphics hardware [8]. Chris Thompson et
the path that trades complexity in coding efficiency. On the al. introduced a programming framework of using modern
other hand, people are becoming more critical to the visual graphics architectures for general purpose computing using
quality. Consumer electronics market has revealed the trend GPU [9]. In [10], the authors tested five color image processing
that high resolution video equipments are occupying more and algorithms using the latest programmability feature available
more market share. All these factors inevitably pose ever in- in DirectX-9 compatible GPUs. Performing FFT on GPU was
creasing high demand on processing power. In the last decade, reported by Doggett et al. in [11]. Wavelet decomposition and
some multimedia-oriented SIMD processor extensions such reconstruction was implemented on modern OpenGL capable
as Intel’s MMX/SSE instructions were introduced to CPU de- graphics hardware [12].
signs. Although these instructions can improve the performance In this paper, we will present our study on accelerating
significantly and are used overwhelmingly throughout the mul- the digital video decoding using the programmable graphics
timedia applications, CPU is still usually heavily loaded, put pipeline of commodity GPU. It is natural to leverage the GPU
aside those cases that CPUs processing power can not meet the to off-load some of CPUs tasks when the CPU is heavily loaded
requirement at all. For example, currently CPUs in most of the while GPU is idle, which is often the case for nongraphics
household PCs are not powerful enough to decode high-defini- oriented applications such as the video decoding. As a matter
tion (HD) video in real-time even with highly optimized code. of fact, most today’s GPUs have a special hardware unit that
With the development of silicon technologies, more and can perform the video decoding process provided that the
more inexpensive yet powerful graphics processing units video is encoded with a certain specific standard, thanks to
(GPUs) can be found in mainstream commodity PC machines the well established international video coding standards such
and game consoles. As the name suggests, GPUs are equipped as MPEG-1/2/4 [1]–[3] and the wide-accepted DirectX video
accelerator (DXVA) specification.2 However, the application
Manuscript received September 30, 2003; revised March 31, 2004.
of such hardware video decoding unit is very limited. For
G. Shen, G.-P. Ga, S. Li, and H.-Y. Shum are with Microsoft Research
Asia, Beijing 100080, China (e-mail: jackysh@microsoft.com; ggao@mi-
1[Online] Available: http://www.siggraph.org/s2002/conference/papers/pa-
crosoft.com; spli@microsoft.com; hshum@microsoft.com).
Y.-Q. Zhang is with the Microsoft Corporation, Redmond, WA 98052 USA pers8.html
(e-mail: yzhang@microsoft.com). 2[Online] Available: http://www.microsoft.com/whdc/hwdev/tech/stream/
Digital Object Identifier 10.1109/TCSVT.2005.846440 DirectX_VA/default.mspx
1051-8215/$20.00 © 2005 IEEE

686 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 5, MAY 2005
Fig. 1. Architectural overview for graphics engine (DirectX3D) of GPU.
example, they can not handle video contents that are coded ferent from a dual-processor system because of the huge dif-
with proprietary yet very popular video coding formats such ferences between CPU and GPU. To most efficiently utilize
as Windows Media Video (WMV)3 and RealVideo.4 Moreover, the CPU and GPU power, we exploit the parallelism between
even though almost all GPUs can help on video rendering CPU and GPU by pipelining them, whereas for a dual-processor
(through overlay), they only provide very limited flexibility in system the most efficient way is to assign one GOP for each CPU
manipulating the video decoded. On the contrary, the solution or divide a frame into half-half with overlapping boundaries for
provided in this paper is a purely software-based solution that motion compensation. Moreover, we believe a CPU plus GPU
automatically avoids all the above-mentioned disadvantages of configuration is far more popular than a dual-processor config-
hardware-based solutions (including on-chip video accelerating uration in consumer commodity PCs. As a first step, the ap-
modules confirming to DXVA or external hardware video proaches and results presented in this paper may not be perfect,
decoding card). but they will shed some lights in how people will use GPU in
Targeting at wider application scenarios and more flexibility, the future.
we study how the common DirectX-8 compatible graphics en- The rest of the paper is organized as follows. In Section II,
gines can be exploited to assist the CPU to speed up the video we will briefly review the architecture of modern GPU. In Sec-
decoding. We chose DirectX-8 is because of its powerfulness, tion III, we will highlight the general procedure for video de-
programmability, predominance and rich application program coding and analyze the complexity of the building blocks. We
interfaces (APIs).5 Our study proves that GPUs power can in- then present the challenges and our solution that utilizes GPU to
deed be utilized for applications other than graphics such as assist CPU in Section IV. Some experimental results and anal-
video decoding. Furthermore, since the major task of video de- ysis are shown in Sections V and VI concludes the paper.
coding is already handled by the GPUs graphics engine, it pro-
vides a more efficient way to incorporate video into computer II. GRAPHICS ENGINE ARCHITECTURE
graphics. This is of high interest in today’s gaming industry.
Recent dramatic increases in the computational power of
This work differs from other pioneering work that exploits
GPU have been fueled by design innovation and the continuing
GPU for numeric computation or image processing [6]–[12] in
improvement in semiconductor technologies. A most signif-
the sense that it is a systematic work. In previous works, the
icant step forward is the introduction of user-programmable
focus was solely on how to efficiently map nongraphics oper-
geometry engine [5] and the pixel pipeline. The principal 3-D
ations entirely onto the graphics pipeline of GPU. The impact
APIs (DirectX and OpenGL) have evolved alongside graphics
of CPU was neglected. In this work, the CPU and GPU have to
hardware. One of the most important new features in DirectX-8
be considered together since GPU alone can not fulfill the re-
graphics is the addition of a programmable pipeline that pro-
quirement of video decoding. This work is also the first attempt
vides an assembly language interface to the transformation and
that uses the graphics pipeline in the GPU to handle video-re-
lighting hardware (vertex shader) and the pixel pipeline (pixel
lated applications. Specifically, a video decoder has strict timing
shader). In this section, we will briefly overview the graphics
constraints. The bottom line is to decode and render the video
pipeline of the GPU to give a high level understanding of how
in real-time which would otherwise be useless. However, other
the hardware renders scenes with emphasis on the vertex shader
works were mainly to prove the feasibility of handling specific
and pixel shader that are used extensively in mapping video
nongraphics operations with GPU. The system is also quite dif-
decoding technology to graphics pipeline.
3[Online] Available: http://www.windowsmedia.com Fig. 1 depicts the graphics pipeline of a DirectX-8 compat-
4[Online] Available: http://www.realnetworks.com ible GPU (see Microsoft DirectX-8 SDK documentation). Ini-
5In legacy GPUs, prior to DirectX-8, the internal graphics pipeline is fixed. tially, a user application supplies the graphics hardware with
It can hardly be used to accelerate video decoding since the fixed pipeline can raw geometry data (vertex or primitive data) specified in some
not be manipulated to perform video decoding operations and there is no way
to guarantee the precision whereas the precision has to be ensured in all the local coordinate system. The hardware transforms this geom-
operations involved in video decoding. etry into world space, then clips away parts of the transformed
SHEN et al.: ACCELERATE VIDEO DECODING WITH GENERIC GPU 687
geometry not contained within the user’s view port. Next, the TABLE I
hardware performs lighting and color calculations and converts PERFORMANCE METRIC OF nVIDIA GEFORCE3 GPU
the vector based geometry to a pixel-based raster representa-
tion. Textures are applied then. Finally, the resultant pixels are
composed into the screen buffer. For legacy performance, most
GPUs still keep the old fixed-function pipeline (the standard
transform and lighting pipeline where the functionality is es-
sentially fixed).
In the figure, we highlighted two modules. The first module,
programmable pipeline, is also called vertex shader because all
the operations it performs are on vertex data. It operates on a
vertex-by-vertex basis, i.e., per-vertex operation. All the data
of a vertex is completely private and can not be accessed by
other vertices. Its basic function is to compute the position,
color and lighting of the polygon vertices. The second module
is called pixel shader because it operates on the pixel data in
the frame buffer. The pixel shader operates on a pixel-by-pixel
basis, i.e., per-pixel operation. Note that per-pixel operation is a
conventional term in graphics engine. In this paper, it does not Fig. 2. Block diagram of a typical decoder.
necessary imply manipulation of pixels in spatial domain. For
example, a pixel could be a DCT coefficient in the DCT do-
main. The functionality of the pixel shader is similar to vertex
shaders’, except that it manipulates the pixel colors and tex-
tures, rather than geometry. A typical application of pixel shader
is to perform texture blending. Both shaders are programmable
through shader languages which are basically custom assembly
languages. It is the programmability of the shaders that enable
the GPU to assist the CPU in software based video decoding Fig. 3. Relative positions of half-pel motion precision and corresponding
applications. filtering formula.
Modern mainstream GPUs are indeed very powerful,
thanks to the complete fine-grained single-instruction mul- to RGB color space for the display purpose.7 This process is
tiple-date (SIMD) parallelism of vertex and pixel shaders and mathematically a 3 3 matrix multiplication.
the pipelining processing fashion. GPU usually has on-chip Note that there is a feedback loop in a video decoder, which
memory (commonly called video memory). The memory band- is exactly the same as that inside the encoder. The feedback loop
width is tremendous due to the wide memory bus. In Table I, consists of motion compensation and reconstruction. Note that
we list some performance metric of the nVidia GeForce3 GPU.6 it is this feedback loop that synchronizes the encoder and the
In summary, the programmable pipeline of GPUs gives the decoder in the decoding results. Consequently, the motion com-
developer a lot more freedom to achieve special effects. The pensated signal in the decoder has to precisely match that in the
power of GPU ensures these special effects can be made into encoder and the reconstruction process (including the clipping)
real time graphics applications and has been proven by the ever must be accurate. Any error introduced in the feedback loop,
increasingly realistic scenes in computer games. even if it is small, will be accumulated and propagated to fu-
ture frames. This is called drifting error. Drifting error usually
III. TYPICAL MOTION COMPENSATED VIDEO leads to quick video quality degradation and, therefore, must be
DECODER ARCHITECTURE prevented.
Motion compensated prediction is an efficient way to exploit
A typical motion compensated video decoder consists of sev-
the temporal correlation between neighboring frames of a video
eral building blocks, namely variable length decoding (VLD),
sequence. For every macroblock in a picture, the most similar
inverse quantization (IQ), inverse DCT (IDCT), motion com-
(according to certain criterion) macroblock in the reference
pensation (MC), reconstruction, and color space conversion
picture is found by a motion estimation process and is signaled
(CSC), as shown in Fig. 2. Motion compensation is the process
by a 2-D motion vector. Realizing the fact that a better prediction
that retrieves the prediction from the reference picture using the
always leads to better coding efficiency, people have proposed
motion vectors. Reconstruction refers to the process that adds
many techniques to generate better prediction. Such techniques
the residual signal (from the IDCT) to the motion compensated
include bidirectional prediction (B-frame) [2], subpixel (half-
prediction to form a decoded picture. Color space conversion
pixel and even quarter-pixel) motion precision [3], unrestricted
module converts the decoded picture from YUV color space
7Modern graphics cards may also support other display formats such as
YV12. In this case, the color space conversion is not needed, but the data still
need to be manipulated since all the supported display formats are generally
6[Online] Available: http://www.nvidia.com/view.asp?PAGE=geforce3 interleaved while YUV output of a decoder is planar.
Fig. 4. Profiling of building blocks. (a) All modules. (b) All modules except color space conversion.
MV [4], etc. In fact, some techniques like half-pixel and TABLE II

quarter-pixel motion precision have become a common practice NATURE OF EACH MODULE OF A VIDEO DECODER
in advanced video codecs [3], [4]. To achieve subpixel motion
precision, a 2-D filtering process is generally applied to produce
the prediction. Specific rounding techniques are also introduced
to control the rounding error during the filtering process. Fig. 3
illustrates the relative positions of half-pel motion precision
and their corresponding filtering formula using bilinear filtering
with rounding control. When the filtering processing becomes
more complex or the motion precision becomes higher, the 2-D we first explore the feasibility of GPU acceleration for video
filtering process is often achieved through two one-dimensuonal decoding and the constraints of GPU. We then present our
(1-D) filtering processes. In these cases, the order in which proposed architecture, the working flow and the key method to
the two 1-D filtering processes are performed must be retained achieve drifting-free motion compensation. Finally, we discuss
the same for both the encoder and the decoder. Clearly, better about a few optimization techniques used in this study.
prediction comes at the cost of higher computational complexity.
The video decoding is of high computational complexity due A. Feasibility
to the huge amount of video data and the complex transform
Even though GPU is not designed for accelerating video de-
and filtering processes involved. The most computationally ex-
coding, the per-vertex and per-pixel operations of GPU may still
pensive parts, in a decreasing order, are CSC, MC, IDCT, IQ,
be utilized to partially handle the video decoding task. That is,
and VLD. Fig. 4(a) shows the profiling result for WMV (Win-
we can use GPU to off-load some video decoding stages that
dows Media Video version 8) on Pentium III 667 MHz CPU
involve only per-vertex and per-pixel operations in nature.
when decoding an HD (1280 720) video sequence. Evidently,
We analyze the nature of each module of a typical video
the CSC and MC consume most of the overall computational
decoder in Table II. In the table, block-wise means that the
power (more than 60%). Since the CSC module can usually be
operations are performed on a rectangular block by block basis.
handled by the GPU. We also performed the profiling that ex-
A block is a regular shape whose vertices can be handled by
cludes the CSC module, as shown in Fig. 4(b). Clearly, the MC
the vertex shader efficiently. Per-pixel means all the pixels in
still occupies a significant portion of the whole computation.
a block will go through the same processing procedure. For
According to the profiling results, it is very desirable that the
example, in CSC process, every pixel will be translated from
color space conversion and motion compensation can be han-
YUV space to RGB space using the same equation while for
dled more efficiently. The best would be not doing them at all.
IDCT every pixel will be transformed using different DCT bases
This can be achieved by moving these two modules to the GPU,
as determined by their position. Clearly, the most computational
as will be discussed in next section.
complex MC and CSC modules are intrinsically suitable for the
GPU to process since they are both block-wise and per-pixel
IV. GPU-ASSISTED VIDEO DECODING operations. Note that although the inverse quantization (IQ) is
Since the GPU is specially designed for faster graphics block-wise and per-pixel operation, we decided not to handle
operations and better graphics effects instead of for assisting it by GPU. Otherwise it would cause too much memory traffic
decoding video, there is no direct mapping of video decoding between CPU and GPU because the subsequent operation,
algorithms to the 2-D or 3-D graphics engines. In this section, IDCT, is not per-pixel operations and has to be handled by
Fig. 5. GPU-assisted video decoding architecture.
CPU. The VLD is a purely sequential operation and has to • Only few texture formats can be used as renderable target.
be handled by the host CPU. Since the computation of the pixel shader can not be per-
The conclusion is that it is feasible for GPU to off-load the formed without rendering a scene, this may lead to waste
two most expensive modules from CPU. of memory and computation power. This is also partially
due to the constraint that not all the channels of the GPU
B. Constraints of GPU can be manipulated separately. For example, the RGB
Even though GPU is very powerful, it still has many con- channels can only be manipulated together, even though
straints. These constraints are more visible when exploiting the alpha channel can be manipulated independently.
GPU for nongraphics oriented applications. Some main con-
straints are as follows.
C. Proposed System Architecture
• The memory bandwidth between CPU and GPU is lim-
ited. The memory bandwidth is often the bottleneck for According to the feasibility analysis above, our solution is
many applications for both graphics and nongraphics ap- to move to the GPU the whole feedback loop that consists of
plications [9]. Moreover, due to the common asymmetric motion compensation, color space conversion and display, as
implementation of memory access path, as a rule of shown in Fig. 5. For the sake of clearer explanation, we present
thumb, read-backs from GPU memory to main memory our GPU accelerated video decoder in a top-down fashion and
should generally be avoided in a practical design. elaborate the techniques that overcome the GPU constraints in
• The internal precision of pixel shader is limited. For ex- the meantime.
ample, the nVidia GeForce3 GPUs internal precision is First of all, it is evident from Fig. 5 that read-back data from
only up to 8.1 b. That is, the input and output is 8-b pre- GPU is completely avoided by moving the whole feedback loop
cision while the internal precision of ALU is 9-b. The to GPU. The CPU will not access the data any more after sub-
rounding effect is not specified in DirectX-8 as well. Note mitting them to the GPU.
that the limited precision constraint is not so severe a Secondly, as mentioned in the introduction section, the CPU
drawback for computer graphics as for video decoding and GPU are pipelined to exploit the parallelism between them.
because all the frames in graphics applications are fresh Specifically, CPU reads in the bit stream (using asynchronized
generated. In other words, there is no error accumulation read method of MS Visual C++ 6.0), performs VLD, IQ, and
and propagation, to which the contrary is exactly the case IDCT and some other preparation work for GPU such as geom-
in video decoding. This constraint is also well recognized etry data generation. CPU then passes the data to GPU and GPU
in other research works [8], [13]. starts to perform MC, CSC, and display tasks. While GPU is ful-
• The instruction set is small. Most instructions are spe- filling its assigned tasks, the CPU proceeds to process the next
cially designed for graphics operations. For example, frame. The procedure is repeated again and again till the end
there are no flow control instructions (even logical oper- of the bit stream. Obviously, the CPU and GPU indeed work
ations such as AND and OR ) in both the vertex shader and in a pipelining manner. Note that the throughput of a pipeline
the pixel shader in DirectX-8. It also lacks of bit-wise is determined by its most time consuming component. There-
operations such as shift. fore, to get the maximum efficiency, the load between CPU and
• The code line count for pixel shader programming is lim- GPU should be balanced. Since the GPU can handle MC, CSC,
ited. Only up to four texture instructions and up to eight and display very effectively, the proposed task assignment for
arithmetic instructions are allowed in any rendering pass. CPU and GPU results in a good load balance between them, as
The immediate result of this constraint is that a relative so observed in our experiment. However, since the complexity
complex process may have to be divided into multiple may fluctuate from frame to frame, it is impossible that load of
passes. This inevitably increases the overhead since some CPU and GPU is balanced for every frame. To maximize the
extra textures must be read and written in each pass. pipeline efficiency, we adopt a large buffer (typically between
4–8 frames) between the CPU and GPU to store the interme-
diate data generated by CPU. The intermediate buffer effectively
absorbed most decoding jitters of both CPU and GPU and con-
tributed significantly to the overall speed-up, as will be shown
in Section V.
D. GPU Working Flow

Now let us briefly go through the working flow of MC and Fig. 6. Divide-and-conquer method for MV handling.
CSC inside the GPU. Motion vectors are first transferred from
CPU to GPU and processed by vertex shader. The vertex shader
generates the target block position for triangle setup and the tex-
ture address for sampling the textures. The pixel shader will
then use the texture addresses generated by the vertex shader
to sample the texture which is stored in the frame buffer of the
GPU, perform necessary arithmetic operations to obtain the mo-
tion compensated prediction, and render to the desired positions
of target texture. By the instance the motion compensation is
done, the CPU would have transferred the difference data to
GPU. The GPU then reconstructs a new picture by adding the
difference to the motion compensated prediction. The picture
will go through color space conversion and be sent for display
or stored in the display buffer if there are B-frames in the bit
stream. For the I-frame and P-frame, this reconstructed picture
Fig. 7. Detailed MC working flow in GPU.
will also be used as the reference for the next frame.
1) Vertex Shader Operations: Vertex shader is used to
handle the motion vectors in MC process because the position stands for a pixel shader procedure. The first pixel shader (PS )
of a macroblock (or an 8 8 block) is solely determined by prepares a padded and decomposed reference (T ) for the sub-
its four vertices and the operation for each vertex is the same. sequent motion compensation process that is to be handled by
Since the vertex data is private, the MV information is repli- the second pixel shader (PS ). The third pixel shader (PS ) then
cated to all the four vertices of a block, which is also due to adds the difference data (T ) transferred from CPU to the mo-
the translational motion model. The vertex shader computes tion compensated prediction (T ) to form the final reconstructed
the target block positions and the source (reference) texture picture (T ) which will in turn serve as the reference for the next
addresses for the macroblocks to be motion-compensated. frame. The fourth pixel shader (PS ) converts the reconstructed
MC can be performed at different precisions such as integer picture from YUV format to RGB format and outputs it to the
pixel level and subpixel level (e.g., half-pel and quarter-pel). display frame buffer.
For subpixel level MC, some interpolation processes are gener- In order for the GPU to fulfill the reconstruction, CPU must
ally involved. The vertex shader needs to ensure correct pixels prepare and transfer the difference data to GPU. On one hand,
are sampled for the interpolation. Although it is possible to the memory bandwidth between the CPU and GPU is rare re-
treat the integer-pel MC process as a special case of subpel MC sources and, on the other hand, there are many zero blocks of the
process, it is not wise to do so because subpel MC generally difference data. To reduce the memory bandwidth requirement,
involves more rendering passes. Apply the same reasoning, the only nonzero blocks are transferred. Note that the zero blocks
half-pel and quarter-pel MC processes are differentiated. Fur- can be easily identified during the VLD process since they are
thermore, there is a special INTRA block type for which no either skipped blocks or their coded block patterns (CBPs) are
motion compensation is needed. There are two possible ways zero. This also helps to speed up the rendering process because
to handle INTRA blocks. One way is to handle them via a sepa- less blocks need to be rendered.
rate rendering pass that zeros out their prediction, and the other The padding process is necessary to handle unrestricted mo-
way is to treat them as INTER blocks with pseudo motion vec- tion vector (UMV) coding options in which a motion vector is
tors pointing to some zero area, e.g., outside texture area. We allowed to point out the reference picture. Note that, for some
adopted the second method because it does not require any extra graphics card driver versions, only swizzled texture of which
rendering pass. both the width and the height must be the power of two is sup-
In summary, we adopted a divide-and-conquer method to ported. In this case, the padding process is absolutely indis-
handle different MC types and resolutions. Blocks belong to the pensable. However, some other driver versions support linear
same category are batched and processed together, as shown in texture, that is, the texture can be exactly the same size as a
Fig. 6 where PS stands for pixel shader. picture. If this is the case, the padding process can be implic-
2) Pixel Shader Operations: There are generally four main itly achieved by specifying a texture address that points outside
steps that involve pixel shaders, as shown in Fig. 7. In the figure, the texture and configuring the texture addressing mode to be
a rectangle represents a texture and a round-cornered rectangle D3DTADDRESS CLAMP.
E. Drifting-Free MC
All the steps mentioned above are relatively easy since MC is
basically a translation operation of texture blocks. The difficulty
lies in the drifting prevention given the limited internal precision
of a GPU. It is just this drifting effect that significantly differ-
entiates the video decoding from normal graphics operations.
Fig. 8. Pixel shader code segment for decomposition with regard to 4.
There are mainly two restrictions: overflow and rounding. We
use the half-pel MC as an example assuming the motion vector
is (0.5, 0.5). The filtering process can be expressed as
(1)
where is the rounding control parameter which takes value of 0

(rounding control is off) or 2 (rounding control is on). Since the
data is only 8-b, to avoid overflow, one would hope to calculate
using either or . Note that will not cause overflow by
using “ ” modifier. Unfortunately, due to the limited internal
precision, none of the three (1) through (3) are equivalent
(2)
Fig. 9. Pixel shader code segment for half-pel MC where MV = (0:5; 0:5).
(3)
We solve the problem by using multipass rendering tech- channels are always running concurrently, even if only
nique. In the first pass, pixel , and are decomposed with one channel is intended to be used. Therefore, it is very
regard to 4 and the resulting quotients and residuals are stored beneficial to pack more pixels into one 32-b pixel and
into RGB channel and alpha channel, respectively. In the second process them together and simultaneously. Note that
pass, the residuals and rounding control parameter are summed there may exist trade-offs for pixel packing because it
up and divided by 4 and the result is then added to the sum of causes extra load on CPU.
quotients. The correctness of this process is obvious. The key 2) Look-Up Table (LUT) Via Dependent Texture Read: Table
point is that only residual and rounding control parameter may look-up is a useful optimization technique because many
lead to precision loss. Realizing this, it is natural to aggregate expensive computations can be performed beforehand.
all the residuals and rounding control parameter and do the divi- Table look-up can be achieved through dependent tex-
sion at the last step. The sum of residuals and rounding control ture read by using the current pixel value as the texture
parameter is guaranteed not to overflow, thanks to the decom- coordinate and retrieves the desired results from the cor-
position. The idea can be easily generalized to other filtering responding pre-computed texture sample. Table look-up
coefficients. also helps to reduce the rendering passes because the tex-
The pixel shader sample code fragment of the decomposition ture addressing instructions are not counted toward the
pass is listed in Fig. 8. Note that, we used two multiplication by code line count on which DirectX-8 has a constraint. In
0.5 instructions to achieve . One can not simply fact, it is this technique that enables us to achieve padding
multiply by 0.25, nor can one use mov d2 because of the and decomposition in a single rendering pass (refer to
internal rounding effect. PS1 in Fig. 7).
The pixel shader sample code segment of the second pass In summary, we have built an efficient architecture wherein
is listed in Fig. 9. The quotients are summed up in the RGB the CPU and GPU are pipelined to harness GPU power to accel-
channel while the residuals are aggregated in the alpha channel. erate video decoding. Several technologies were developed to
Note that the rounding control parameter becomes invisible be- overcome the various constraints of GPU. Drifting-free subpel
cause it is absorbed in counteracting the rounding effects of the motion compensation was achieved. We also proposed pixel-
GPU. packing and table look-up technologies to improve the effec-
tiveness of GPU computation power.
F. GPU Optimization Techniques
In real implementation, a few techniques may be applied to V. EXPERIMENTAL RESULTS
achieve higher GPU computation efficiency. We have implemented our solution for the Windows Media
1) Pixel-Packing: Note that the time GPU consumed to Video (version 8) decoder. We performed extensive tests on a
render a scene is approximately in direct proportion PC with an Intel Pentium III 667 MHz CPU, 256 MB memory,
to the number of pixels rendered, provided that other and an nVidia GeForce3 Ti200 GPU. Some of the initial experi-
factors such as the geometry, lighting retain the same. mental results are reported below in Table III. The test sequences
According to DirectX-8 specifications, the only render are Football, Total, and Trap. The Football sequence is a stan-
target format that suits the video decoding application dard MPEG test sequence in SIF format (320 240) with very
is the 32-b D3DFMT A8R8G8B8 format. All the four high motion. The Total sequence is a concatenation of several
TABLE III
EXPERIMENTAL RESULTS OF GPU ASSISTED VIDEO DECODING ON
PC WITH AN INTEL PENTIUM III 667-MHZ CPU, 256-MB MEMORY
AND AN nVIDIA GEFORCE3 Ti200 GPU
standard MPEG test sequences (such as Carphone, Stephan, Si-

lence, Akiyo, Mobile-Calendar etc.) in CIF format (352 288).
The Trap sequence is a high definition version (1280 720) of Fig. 10. Impact of the intermediate buffer size on the overall decoding speed.
The intermediate buffer size is expressed in terms of number of frames.
the movie trailer of “The Parent Trap” (Disney, 1998). The orig-
inal frame rate of Trap is 23.98 f/s. We encoded the SIF and CIF
sequences at 2 Mb/s and the HD sequence at 5 Mb/s (with WMV processing speed, or in other words, to effectively absorb the
technique, the HD quality is very good at this rate), respectively. jitters of CPU and GPU processing time. We empirically set the
We compare the video decoding speed achieved using CPU intermediate buffer size to be 4 frames. However, if memory
only (with MMX technology) against that achieved with GPU allows, it is advocated to use 6 frames or more to ensure a fast
acceleration, as shown in Table III. The intermediate buffer decoding speed.
size is four frames (i.e., 10.54 MB for HD 720 p sequence).
Clearly, the speed is significantly improved by leveraging the
VI. CONCLUSION AND FUTURE WORK
power of GPUs graphics engine. It is interesting to observe
that the speed-up of Total sequence is much higher than that In this paper, we have demonstrated that GPU can help CPU
of Football sequence, while the speed-up of Trap is by large to accelerate video decoding. We proposed an efficient archi-
the most significant. The reasons are multifolded. First of all, tecture wherein the CPU and the GPU are pipelined. The whole
the speed-up depends on the motion. The speed-up may also feedback loop of the decoder is moved into the GPU to avoid any
rely on the memory bandwidth requirement. For the same bit read back from GPU. Drifting-free motion compensation was
rate, due to its high motion nature, Football sequence produces achieved given that the internal precision of GPU is quite lim-
a lot more nonzero difference blocks than the Total sequence ited. Several GPU optimization techniques were also presented.
does. On the other hand, the Trap sequence at 5 Mb/s leads We successfully implemented our solution for the proprietary
to a high percentage of zero difference blocks. As a result, WMV (version 8) decoder. We achieved real-time decoding of
the memory bandwidth requirement is greatly reduced since high definition video on PC with a slow CPU and a powerful Di-
we do not transfer any zero block. rectX-8 capable GPU, which is definitely impossible otherwise.
Scrutinized readers may find that the speed-up for Trap se- The meaningfulness of the proposed technology is
quence actually exceeds that one would expect from the pro- two-folded. Firstly, it can help to achieve some tasks that
filing results. The reason again is due to the memory bandwidth would otherwise be impossible. The configuration we used in
requirement. For CPU only case, what transferred to GPU for the experiment is just an extreme example. Secondly, GPU can
display is the raw RGB24 format. However, in our proposed be exploited to off-load some of CPUs tasks and the CPU, in
architecture, what transferred to GPU are only nonzero IDCT return, will have more capacity to handle other tasks. In fact,
blocks and some vertex data. The memory bandwidth require- even there are multi-CPUs, GPU can still be helpful. While
ment is only about 50% statistically. the GPU grows faster than CPU, we strongly believe there is a
To study how the intermediate buffer size may affect the trend in both academy and industry that people would use GPU
overall speed-up, we performed experiments with different for more computational tasks.
intermediate buffer size in terms of number of frames. The Our future work is to further investigate the power of GPU
sequences we used are Dinosaur and Liquid, both are HD 720 and to optimize the CPU code to achieve real-time decoding
p, and the bit rate is 5 Mb/s. The experimental results are shown of HD video for even higher bit rate. We will also apply the
in Fig. 10. From the figure, we can clearly observe that the proposed solution to some other decoders such as MPEG-2. At
intermediate buffer indeed has significant impact on the overall the time of this writing, the DirectX-9 capable GPUs already
decoding speed. For example, when increasing the intermediate emerged. How to utilize DirectX-9 capable GPUs to accelerate
buffer from 3 to 4 frames, the decoding speed increased about video decoding will be one of our focuses as well.
4.5 f/s, which is around 15% improvement. However, further
increase in the intermediate buffer size will not bring in signifi-
REFERENCES
cant decoding speed gain. The decoding speed tends to saturate
when the intermediate buffer size exceeds 6 frames. The reason [1] Coding of Moving Pictures and Associated Audio for Digital Storage
Media at up to 1.5 Mbits/Sec, ISO/IEC 11 172, Dec. 1991.
for this phenomenon is that 4-frame intermediate buffer is large [2] Information Technology—Generic Coding of Moving Pictures and As-
enough to achieve good match between the CPU and GPU sociated Audio Information: Video, ISO/IEC 13 818-2, Nov. 1993.
[3] Information Technology—Generic Coding of Audio-Visual Ob- Shipeng Li, photograph and biography not available at the time of publication.
jects—Part 2: Visual, ISO/IEC JTC 1/SC29/WG11, WG11 N2688,
Mar. 1999.
<
[4] Video Coding of Narrow Telecommunication Channels at 64 kbits/s,
ITU-T Recommendation H.263, 1995.
[5] E. Lindholm, M. J. Kilgard, and H. Moreton, “A user programmable
vertex engine,” in Proc. ACM SIGGRAPH, 2001, pp. 149–158.
[6] M. Harris, “GPGPU: General-purpose computation using graphics hard-
ware,”, http://www.cs.unc.edu/~harrism/gpgpu, 2003. Heung-Yeung Shum, photograph and biography not available at the time of
[7] E. S. Larsen and D. K. McAllister, “Fast matrix multiplies using graphics publication.
hardware,” in Proc. IEEE Supercomputing, Nov. 2001, p. 55.
[8] M. Rumpf and R. Strzodka, “Level set segmentation in graphics hard-
ware,” in Proc. ICIP, vol. 3, 2001, pp. 1103–1106.
[9] C. J. Thompson, S. Hahn, and M. Oskin, “Using modern graphics archi-
tectures for general-purpose computing: A framework and analysis,” in
Proc. ACM/IEEE MICRO-35, Nov. 2002, pp. 306–317.
[10] P. Colantoni, N. Boukala, and J. D. Rugna, “Fast and accurate color
Ya-Qin Zhang (S’87–M’90–SM’93–F’98) received
image processing using 3-D graphics cards,” presented at the 8th Int.
the B.S. and M.S. degree from the University of Sci-
Fall Workshop: Vision Modeling and Visualization, Munich, Germany,
ence and Technology of China (USTC) in 1983 and
Nov. 2003.
[11] K. Moreland and E. Angel, “The FFT on a GPU,” in Proc. SIG- 1985, respectively, and the Ph.D. degree from George
GRAPH/Eurographics Workshop Graphics Hardware , July 2003, pp. Washington University, Washington, DC, in 1989, all
112–119. in electrical engineering. He had executive business
[12] M. Hopf and T. Ertl, “Hardware accelerated wavelet transformations,” training from Harvard University.
in Proc. EG/IEEE TCVG Symp. Visualization, 2000, pp. 93–103. He joined Microsoft Research China, Beijing, in
[13] R. Strzodka, “Virtual 16 bit precise operations on RGBA8 textures,” January 1999, leaving his post as the Director of Mul-
in Proc. Vision Modeling and Visualization, Erlangen, Germany, Nov. timedia Technology Laboratory at Sarnoff Corpora-
2002, pp. 171–178. tion, Princeton, NJ (formerly David Sarnoff Research
Center, and RCA Laboratories). He has been engaged in research and com-
mercialization of MPEG2/DTV, MPEG4/VLBR, and multimedia information
technologies. He was with GTE Laboratories Inc. in Waltham, MA and Contel
Guobin Shen (S’99–M’02) received the B.S. degree from Harbin University of Technology Center in Virginia from 1989 to 1994. He has authored and co-au-
Engineering, Harbin, China, in 1994, the M.S. degree from Southeast Univer- thored over 200 refereed papers in leading international conferences and jour-
sity, Nanjing, China, in 1997, and the Ph.D. degree from Hong Kong University nals. He has been granted over 40 U.S. patents in digital video, Internet, multi-
of Science and Technology (HKUST), Hong Kong, in 2001, all in electrical media, wireless and satellite communications. Many of the technologies he and
engineering. his team developed have become the basis for start-up ventures, commercial
He was a Research Assistant at HKUST from 1997 to 2001. Since then, he products, and international standards. He serves on the Board of Directors of
has been with Microsoft Research Asia, Beijing, China. His research interests five high-tech IT companies. He has been a key contributor to the ISO/MPEG
include digital image and video signal processing, video coding and streaming, and ITU standardization efforts in digital video and multimedia.
peer-to-peer networking, and parallel computing. Dr. Zhang served as the Editor-In-Chief for the IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY from July 1997 to July
1999. He was the Chairman of Visual Signal Processing and Communications
Technical Committee of IEEE Circuits and Systems. He serves on the Editorial
Guang-Ping Gao received the B.S. degree in com- boards of seven other professional journals and over a dozen conference
puter science from Tsinghua University, Beijing, committees. He has been the recipient of numerous awards, including several
China. industry technical achievement awards and IEEE awards such as CAS Jubilee
He joined Microsoft Research Asia, Beijing, Golden Medal. He was awarded as the “Research Engineer of the Year” in 1998
China, as a Research Software Development Engi- by the Central Jersey Engineering Council for his “leadership and invention in
neer in August 2002. communications technology, which has enabled dramatic advances in digital
video compression and manipulation for broadcast and interactive television
and networking applications.” He recently received the prestigious national
award as “The Outstanding Young Electrical Engineering of 1998,” given
annually to one electrical engineer in the United States.

Accelerate Video Decoding With Generic GPU

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Accelerate Video Decoding With Generic GPU

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO.

5, MAY 2005 685

Accelerate Video Decoding With Generic GPU

1051-8215/$20.00 © 2005 IEEE

Fig. 1. Architectural overview for graphics engine (DirectX3D) of GPU.

MV [4], etc. In fact, some techniques like half-pixel and TABLE II

Fig. 5. GPU-assisted video decoding architecture.

D. GPU Working Flow

where is the rounding control parameter which takes value of 0

standard MPEG test sequences (such as Carphone, Stephan, Si-

Das könnte Ihnen auch gefallen