Beruflich Dokumente
Kultur Dokumente
Abstract—Most modern computers or game consoles are with specialized processors for two-dimensional (2-D) and
equipped with powerful yet cost-effective graphics processing three-dimensional (3-D) graphics operations and they indeed
units (GPUs) to accelerate graphics operations. Though the do a good job in graphics oriented applications.1 These GPUs
graphics engines in these GPUs are specially designed for graphics
operations, can we harness their computing power for more gen- are very powerful. For example, the nVidia GeForce3 chip
eral nongraphics operations? The answer is positive. In this paper, contains more transistors than the Intel Pentium IV and its
we present our study on leveraging the GPUs graphics engine successor the GeForce4 is advertised as being able to perform
to accelerate the video decoding. Specifically, a video decoding more than 1.2 trillion internal operations per second. The
framework that involves both the central processing unit (CPU) internal pipelined processing fashion makes the GPU also
and the GPU is proposed. By moving the whole motion compensa-
tion feedback loop of the decoder to the GPU, the CPU and GPU suitable for stream processing. Furthermore, GPU speed grows
have been made to work in parallel in a pipelining fashion. Several much faster than the famous Moore’s law for CPU, that is,
techniques are also proposed to overcome the GPUs constraints 2.4 times/year versus 2 times per 18 months. Moreover, a GPU
or to optimize the GPU computation. Initial experimental results usually contains multiple (typically 4 or 8) parallel pipelines
show that significant speed-up can be achieved by utilizing the and is indeed a SIMD processor. Another important feature
GPU power. We have achieved real-time playback of high defini-
tion video on a PC with an Intel Pentium III 667-MHz CPU and of most contemporary GPUs is their programmability of the
an nVidia GeForce3 GPU. partial or full graphics pipeline, thanks to the introduction of
Index Terms—General-purpose computing, graphics processing vertex shaders and pixel shaders in DirextX-8. The power-
unit (GPU), video decoding acceleration. fulness, SIMD operation and programmability of GPUs have
motivated an active research area of using GPU for nongraphics
oriented operations such as numerical computations like basic
I. INTRODUCTION linear algebra subprograms (BLAS) [6] and image/volume
processing [7]–[12]. In [7], the authors presented a technique
M ULTIMEDIA content is the core for digital entertain-
ment. However, multimedia content processing usually
requires very high computational power due to its intrinsic huge
for multiplying large matrices quickly using the graphics
hardware of a PC. Strzodka and Rumpf have implemented
volume of data. As the key to most multimedia applications, complicated numerical schemes solving parabolic differential
video encoding and decoding technologies are evolving along equations fully in graphics hardware [8]. Chris Thompson et
the path that trades complexity in coding efficiency. On the al. introduced a programming framework of using modern
other hand, people are becoming more critical to the visual graphics architectures for general purpose computing using
quality. Consumer electronics market has revealed the trend GPU [9]. In [10], the authors tested five color image processing
that high resolution video equipments are occupying more and algorithms using the latest programmability feature available
more market share. All these factors inevitably pose ever in- in DirectX-9 compatible GPUs. Performing FFT on GPU was
creasing high demand on processing power. In the last decade, reported by Doggett et al. in [11]. Wavelet decomposition and
some multimedia-oriented SIMD processor extensions such reconstruction was implemented on modern OpenGL capable
as Intel’s MMX/SSE instructions were introduced to CPU de- graphics hardware [12].
signs. Although these instructions can improve the performance In this paper, we will present our study on accelerating
significantly and are used overwhelmingly throughout the mul- the digital video decoding using the programmable graphics
timedia applications, CPU is still usually heavily loaded, put pipeline of commodity GPU. It is natural to leverage the GPU
aside those cases that CPUs processing power can not meet the to off-load some of CPUs tasks when the CPU is heavily loaded
requirement at all. For example, currently CPUs in most of the while GPU is idle, which is often the case for nongraphics
household PCs are not powerful enough to decode high-defini- oriented applications such as the video decoding. As a matter
tion (HD) video in real-time even with highly optimized code. of fact, most today’s GPUs have a special hardware unit that
With the development of silicon technologies, more and can perform the video decoding process provided that the
more inexpensive yet powerful graphics processing units video is encoded with a certain specific standard, thanks to
(GPUs) can be found in mainstream commodity PC machines the well established international video coding standards such
and game consoles. As the name suggests, GPUs are equipped as MPEG-1/2/4 [1]–[3] and the wide-accepted DirectX video
accelerator (DXVA) specification.2 However, the application
Manuscript received September 30, 2003; revised March 31, 2004.
of such hardware video decoding unit is very limited. For
G. Shen, G.-P. Ga, S. Li, and H.-Y. Shum are with Microsoft Research
Asia, Beijing 100080, China (e-mail: jackysh@microsoft.com; ggao@mi-
1[Online] Available: http://www.siggraph.org/s2002/conference/papers/pa-
crosoft.com; spli@microsoft.com; hshum@microsoft.com).
Y.-Q. Zhang is with the Microsoft Corporation, Redmond, WA 98052 USA pers8.html
(e-mail: yzhang@microsoft.com). 2[Online] Available: http://www.microsoft.com/whdc/hwdev/tech/stream/
Digital Object Identifier 10.1109/TCSVT.2005.846440 DirectX_VA/default.mspx
example, they can not handle video contents that are coded ferent from a dual-processor system because of the huge dif-
with proprietary yet very popular video coding formats such ferences between CPU and GPU. To most efficiently utilize
as Windows Media Video (WMV)3 and RealVideo.4 Moreover, the CPU and GPU power, we exploit the parallelism between
even though almost all GPUs can help on video rendering CPU and GPU by pipelining them, whereas for a dual-processor
(through overlay), they only provide very limited flexibility in system the most efficient way is to assign one GOP for each CPU
manipulating the video decoded. On the contrary, the solution or divide a frame into half-half with overlapping boundaries for
provided in this paper is a purely software-based solution that motion compensation. Moreover, we believe a CPU plus GPU
automatically avoids all the above-mentioned disadvantages of configuration is far more popular than a dual-processor config-
hardware-based solutions (including on-chip video accelerating uration in consumer commodity PCs. As a first step, the ap-
modules confirming to DXVA or external hardware video proaches and results presented in this paper may not be perfect,
decoding card). but they will shed some lights in how people will use GPU in
Targeting at wider application scenarios and more flexibility, the future.
we study how the common DirectX-8 compatible graphics en- The rest of the paper is organized as follows. In Section II,
gines can be exploited to assist the CPU to speed up the video we will briefly review the architecture of modern GPU. In Sec-
decoding. We chose DirectX-8 is because of its powerfulness, tion III, we will highlight the general procedure for video de-
programmability, predominance and rich application program coding and analyze the complexity of the building blocks. We
interfaces (APIs).5 Our study proves that GPUs power can in- then present the challenges and our solution that utilizes GPU to
deed be utilized for applications other than graphics such as assist CPU in Section IV. Some experimental results and anal-
video decoding. Furthermore, since the major task of video de- ysis are shown in Sections V and VI concludes the paper.
coding is already handled by the GPUs graphics engine, it pro-
vides a more efficient way to incorporate video into computer II. GRAPHICS ENGINE ARCHITECTURE
graphics. This is of high interest in today’s gaming industry.
Recent dramatic increases in the computational power of
This work differs from other pioneering work that exploits
GPU have been fueled by design innovation and the continuing
GPU for numeric computation or image processing [6]–[12] in
improvement in semiconductor technologies. A most signif-
the sense that it is a systematic work. In previous works, the
icant step forward is the introduction of user-programmable
focus was solely on how to efficiently map nongraphics oper-
geometry engine [5] and the pixel pipeline. The principal 3-D
ations entirely onto the graphics pipeline of GPU. The impact
APIs (DirectX and OpenGL) have evolved alongside graphics
of CPU was neglected. In this work, the CPU and GPU have to
hardware. One of the most important new features in DirectX-8
be considered together since GPU alone can not fulfill the re-
graphics is the addition of a programmable pipeline that pro-
quirement of video decoding. This work is also the first attempt
vides an assembly language interface to the transformation and
that uses the graphics pipeline in the GPU to handle video-re-
lighting hardware (vertex shader) and the pixel pipeline (pixel
lated applications. Specifically, a video decoder has strict timing
shader). In this section, we will briefly overview the graphics
constraints. The bottom line is to decode and render the video
pipeline of the GPU to give a high level understanding of how
in real-time which would otherwise be useless. However, other
the hardware renders scenes with emphasis on the vertex shader
works were mainly to prove the feasibility of handling specific
and pixel shader that are used extensively in mapping video
nongraphics operations with GPU. The system is also quite dif-
decoding technology to graphics pipeline.
3[Online] Available: http://www.windowsmedia.com Fig. 1 depicts the graphics pipeline of a DirectX-8 compat-
4[Online] Available: http://www.realnetworks.com ible GPU (see Microsoft DirectX-8 SDK documentation). Ini-
5In legacy GPUs, prior to DirectX-8, the internal graphics pipeline is fixed. tially, a user application supplies the graphics hardware with
It can hardly be used to accelerate video decoding since the fixed pipeline can raw geometry data (vertex or primitive data) specified in some
not be manipulated to perform video decoding operations and there is no way
to guarantee the precision whereas the precision has to be ensured in all the local coordinate system. The hardware transforms this geom-
operations involved in video decoding. etry into world space, then clips away parts of the transformed
SHEN et al.: ACCELERATE VIDEO DECODING WITH GENERIC GPU 687
geometry not contained within the user’s view port. Next, the TABLE I
hardware performs lighting and color calculations and converts PERFORMANCE METRIC OF nVIDIA GEFORCE3 GPU
the vector based geometry to a pixel-based raster representa-
tion. Textures are applied then. Finally, the resultant pixels are
composed into the screen buffer. For legacy performance, most
GPUs still keep the old fixed-function pipeline (the standard
transform and lighting pipeline where the functionality is es-
sentially fixed).
In the figure, we highlighted two modules. The first module,
programmable pipeline, is also called vertex shader because all
the operations it performs are on vertex data. It operates on a
vertex-by-vertex basis, i.e., per-vertex operation. All the data
of a vertex is completely private and can not be accessed by
other vertices. Its basic function is to compute the position,
color and lighting of the polygon vertices. The second module
is called pixel shader because it operates on the pixel data in
the frame buffer. The pixel shader operates on a pixel-by-pixel
basis, i.e., per-pixel operation. Note that per-pixel operation is a
conventional term in graphics engine. In this paper, it does not Fig. 2. Block diagram of a typical decoder.
necessary imply manipulation of pixels in spatial domain. For
example, a pixel could be a DCT coefficient in the DCT do-
main. The functionality of the pixel shader is similar to vertex
shaders’, except that it manipulates the pixel colors and tex-
tures, rather than geometry. A typical application of pixel shader
is to perform texture blending. Both shaders are programmable
through shader languages which are basically custom assembly
languages. It is the programmability of the shaders that enable
the GPU to assist the CPU in software based video decoding Fig. 3. Relative positions of half-pel motion precision and corresponding
applications. filtering formula.
Modern mainstream GPUs are indeed very powerful,
thanks to the complete fine-grained single-instruction mul- to RGB color space for the display purpose.7 This process is
tiple-date (SIMD) parallelism of vertex and pixel shaders and mathematically a 3 3 matrix multiplication.
the pipelining processing fashion. GPU usually has on-chip Note that there is a feedback loop in a video decoder, which
memory (commonly called video memory). The memory band- is exactly the same as that inside the encoder. The feedback loop
width is tremendous due to the wide memory bus. In Table I, consists of motion compensation and reconstruction. Note that
we list some performance metric of the nVidia GeForce3 GPU.6 it is this feedback loop that synchronizes the encoder and the
In summary, the programmable pipeline of GPUs gives the decoder in the decoding results. Consequently, the motion com-
developer a lot more freedom to achieve special effects. The pensated signal in the decoder has to precisely match that in the
power of GPU ensures these special effects can be made into encoder and the reconstruction process (including the clipping)
real time graphics applications and has been proven by the ever must be accurate. Any error introduced in the feedback loop,
increasingly realistic scenes in computer games. even if it is small, will be accumulated and propagated to fu-
ture frames. This is called drifting error. Drifting error usually
III. TYPICAL MOTION COMPENSATED VIDEO leads to quick video quality degradation and, therefore, must be
DECODER ARCHITECTURE prevented.
Motion compensated prediction is an efficient way to exploit
A typical motion compensated video decoder consists of sev-
the temporal correlation between neighboring frames of a video
eral building blocks, namely variable length decoding (VLD),
sequence. For every macroblock in a picture, the most similar
inverse quantization (IQ), inverse DCT (IDCT), motion com-
(according to certain criterion) macroblock in the reference
pensation (MC), reconstruction, and color space conversion
picture is found by a motion estimation process and is signaled
(CSC), as shown in Fig. 2. Motion compensation is the process
by a 2-D motion vector. Realizing the fact that a better prediction
that retrieves the prediction from the reference picture using the
always leads to better coding efficiency, people have proposed
motion vectors. Reconstruction refers to the process that adds
many techniques to generate better prediction. Such techniques
the residual signal (from the IDCT) to the motion compensated
include bidirectional prediction (B-frame) [2], subpixel (half-
prediction to form a decoded picture. Color space conversion
pixel and even quarter-pixel) motion precision [3], unrestricted
module converts the decoded picture from YUV color space
7Modern graphics cards may also support other display formats such as
YV12. In this case, the color space conversion is not needed, but the data still
need to be manipulated since all the supported display formats are generally
6[Online] Available: http://www.nvidia.com/view.asp?PAGE=geforce3 interleaved while YUV output of a decoder is planar.
688 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 5, MAY 2005
Fig. 4. Profiling of building blocks. (a) All modules. (b) All modules except color space conversion.
CPU. The VLD is a purely sequential operation and has to • Only few texture formats can be used as renderable target.
be handled by the host CPU. Since the computation of the pixel shader can not be per-
The conclusion is that it is feasible for GPU to off-load the formed without rendering a scene, this may lead to waste
two most expensive modules from CPU. of memory and computation power. This is also partially
due to the constraint that not all the channels of the GPU
B. Constraints of GPU can be manipulated separately. For example, the RGB
Even though GPU is very powerful, it still has many con- channels can only be manipulated together, even though
straints. These constraints are more visible when exploiting the alpha channel can be manipulated independently.
GPU for nongraphics oriented applications. Some main con-
straints are as follows.
C. Proposed System Architecture
• The memory bandwidth between CPU and GPU is lim-
ited. The memory bandwidth is often the bottleneck for According to the feasibility analysis above, our solution is
many applications for both graphics and nongraphics ap- to move to the GPU the whole feedback loop that consists of
plications [9]. Moreover, due to the common asymmetric motion compensation, color space conversion and display, as
implementation of memory access path, as a rule of shown in Fig. 5. For the sake of clearer explanation, we present
thumb, read-backs from GPU memory to main memory our GPU accelerated video decoder in a top-down fashion and
should generally be avoided in a practical design. elaborate the techniques that overcome the GPU constraints in
• The internal precision of pixel shader is limited. For ex- the meantime.
ample, the nVidia GeForce3 GPUs internal precision is First of all, it is evident from Fig. 5 that read-back data from
only up to 8.1 b. That is, the input and output is 8-b pre- GPU is completely avoided by moving the whole feedback loop
cision while the internal precision of ALU is 9-b. The to GPU. The CPU will not access the data any more after sub-
rounding effect is not specified in DirectX-8 as well. Note mitting them to the GPU.
that the limited precision constraint is not so severe a Secondly, as mentioned in the introduction section, the CPU
drawback for computer graphics as for video decoding and GPU are pipelined to exploit the parallelism between them.
because all the frames in graphics applications are fresh Specifically, CPU reads in the bit stream (using asynchronized
generated. In other words, there is no error accumulation read method of MS Visual C++ 6.0), performs VLD, IQ, and
and propagation, to which the contrary is exactly the case IDCT and some other preparation work for GPU such as geom-
in video decoding. This constraint is also well recognized etry data generation. CPU then passes the data to GPU and GPU
in other research works [8], [13]. starts to perform MC, CSC, and display tasks. While GPU is ful-
• The instruction set is small. Most instructions are spe- filling its assigned tasks, the CPU proceeds to process the next
cially designed for graphics operations. For example, frame. The procedure is repeated again and again till the end
there are no flow control instructions (even logical oper- of the bit stream. Obviously, the CPU and GPU indeed work
ations such as AND and OR ) in both the vertex shader and in a pipelining manner. Note that the throughput of a pipeline
the pixel shader in DirectX-8. It also lacks of bit-wise is determined by its most time consuming component. There-
operations such as shift. fore, to get the maximum efficiency, the load between CPU and
• The code line count for pixel shader programming is lim- GPU should be balanced. Since the GPU can handle MC, CSC,
ited. Only up to four texture instructions and up to eight and display very effectively, the proposed task assignment for
arithmetic instructions are allowed in any rendering pass. CPU and GPU results in a good load balance between them, as
The immediate result of this constraint is that a relative so observed in our experiment. However, since the complexity
complex process may have to be divided into multiple may fluctuate from frame to frame, it is impossible that load of
passes. This inevitably increases the overhead since some CPU and GPU is balanced for every frame. To maximize the
extra textures must be read and written in each pass. pipeline efficiency, we adopt a large buffer (typically between
690 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 5, MAY 2005
4–8 frames) between the CPU and GPU to store the interme-
diate data generated by CPU. The intermediate buffer effectively
absorbed most decoding jitters of both CPU and GPU and con-
tributed significantly to the overall speed-up, as will be shown
in Section V.
E. Drifting-Free MC
All the steps mentioned above are relatively easy since MC is
basically a translation operation of texture blocks. The difficulty
lies in the drifting prevention given the limited internal precision
of a GPU. It is just this drifting effect that significantly differ-
entiates the video decoding from normal graphics operations.
Fig. 8. Pixel shader code segment for decomposition with regard to 4.
There are mainly two restrictions: overflow and rounding. We
use the half-pel MC as an example assuming the motion vector
is (0.5, 0.5). The filtering process can be expressed as
(1)
(2)
Fig. 9. Pixel shader code segment for half-pel MC where MV = (0:5; 0:5).
(3)
We solve the problem by using multipass rendering tech- channels are always running concurrently, even if only
nique. In the first pass, pixel , and are decomposed with one channel is intended to be used. Therefore, it is very
regard to 4 and the resulting quotients and residuals are stored beneficial to pack more pixels into one 32-b pixel and
into RGB channel and alpha channel, respectively. In the second process them together and simultaneously. Note that
pass, the residuals and rounding control parameter are summed there may exist trade-offs for pixel packing because it
up and divided by 4 and the result is then added to the sum of causes extra load on CPU.
quotients. The correctness of this process is obvious. The key 2) Look-Up Table (LUT) Via Dependent Texture Read: Table
point is that only residual and rounding control parameter may look-up is a useful optimization technique because many
lead to precision loss. Realizing this, it is natural to aggregate expensive computations can be performed beforehand.
all the residuals and rounding control parameter and do the divi- Table look-up can be achieved through dependent tex-
sion at the last step. The sum of residuals and rounding control ture read by using the current pixel value as the texture
parameter is guaranteed not to overflow, thanks to the decom- coordinate and retrieves the desired results from the cor-
position. The idea can be easily generalized to other filtering responding pre-computed texture sample. Table look-up
coefficients. also helps to reduce the rendering passes because the tex-
The pixel shader sample code fragment of the decomposition ture addressing instructions are not counted toward the
pass is listed in Fig. 8. Note that, we used two multiplication by code line count on which DirectX-8 has a constraint. In
0.5 instructions to achieve . One can not simply fact, it is this technique that enables us to achieve padding
multiply by 0.25, nor can one use mov d2 because of the and decomposition in a single rendering pass (refer to
internal rounding effect. PS1 in Fig. 7).
The pixel shader sample code segment of the second pass In summary, we have built an efficient architecture wherein
is listed in Fig. 9. The quotients are summed up in the RGB the CPU and GPU are pipelined to harness GPU power to accel-
channel while the residuals are aggregated in the alpha channel. erate video decoding. Several technologies were developed to
Note that the rounding control parameter becomes invisible be- overcome the various constraints of GPU. Drifting-free subpel
cause it is absorbed in counteracting the rounding effects of the motion compensation was achieved. We also proposed pixel-
GPU. packing and table look-up technologies to improve the effec-
tiveness of GPU computation power.
F. GPU Optimization Techniques
In real implementation, a few techniques may be applied to V. EXPERIMENTAL RESULTS
achieve higher GPU computation efficiency. We have implemented our solution for the Windows Media
1) Pixel-Packing: Note that the time GPU consumed to Video (version 8) decoder. We performed extensive tests on a
render a scene is approximately in direct proportion PC with an Intel Pentium III 667 MHz CPU, 256 MB memory,
to the number of pixels rendered, provided that other and an nVidia GeForce3 Ti200 GPU. Some of the initial experi-
factors such as the geometry, lighting retain the same. mental results are reported below in Table III. The test sequences
According to DirectX-8 specifications, the only render are Football, Total, and Trap. The Football sequence is a stan-
target format that suits the video decoding application dard MPEG test sequence in SIF format (320 240) with very
is the 32-b D3DFMT A8R8G8B8 format. All the four high motion. The Total sequence is a concatenation of several
692 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 5, MAY 2005
TABLE III
EXPERIMENTAL RESULTS OF GPU ASSISTED VIDEO DECODING ON
PC WITH AN INTEL PENTIUM III 667-MHZ CPU, 256-MB MEMORY
AND AN nVIDIA GEFORCE3 Ti200 GPU
[3] Information Technology—Generic Coding of Audio-Visual Ob- Shipeng Li, photograph and biography not available at the time of publication.
jects—Part 2: Visual, ISO/IEC JTC 1/SC29/WG11, WG11 N2688,
Mar. 1999.
<
[4] Video Coding of Narrow Telecommunication Channels at 64 kbits/s,
ITU-T Recommendation H.263, 1995.
[5] E. Lindholm, M. J. Kilgard, and H. Moreton, “A user programmable
vertex engine,” in Proc. ACM SIGGRAPH, 2001, pp. 149–158.
[6] M. Harris, “GPGPU: General-purpose computation using graphics hard-
ware,”, http://www.cs.unc.edu/~harrism/gpgpu, 2003. Heung-Yeung Shum, photograph and biography not available at the time of
[7] E. S. Larsen and D. K. McAllister, “Fast matrix multiplies using graphics publication.
hardware,” in Proc. IEEE Supercomputing, Nov. 2001, p. 55.
[8] M. Rumpf and R. Strzodka, “Level set segmentation in graphics hard-
ware,” in Proc. ICIP, vol. 3, 2001, pp. 1103–1106.
[9] C. J. Thompson, S. Hahn, and M. Oskin, “Using modern graphics archi-
tectures for general-purpose computing: A framework and analysis,” in
Proc. ACM/IEEE MICRO-35, Nov. 2002, pp. 306–317.
[10] P. Colantoni, N. Boukala, and J. D. Rugna, “Fast and accurate color
Ya-Qin Zhang (S’87–M’90–SM’93–F’98) received
image processing using 3-D graphics cards,” presented at the 8th Int.
the B.S. and M.S. degree from the University of Sci-
Fall Workshop: Vision Modeling and Visualization, Munich, Germany,
ence and Technology of China (USTC) in 1983 and
Nov. 2003.
[11] K. Moreland and E. Angel, “The FFT on a GPU,” in Proc. SIG- 1985, respectively, and the Ph.D. degree from George
GRAPH/Eurographics Workshop Graphics Hardware , July 2003, pp. Washington University, Washington, DC, in 1989, all
112–119. in electrical engineering. He had executive business
[12] M. Hopf and T. Ertl, “Hardware accelerated wavelet transformations,” training from Harvard University.
in Proc. EG/IEEE TCVG Symp. Visualization, 2000, pp. 93–103. He joined Microsoft Research China, Beijing, in
[13] R. Strzodka, “Virtual 16 bit precise operations on RGBA8 textures,” January 1999, leaving his post as the Director of Mul-
in Proc. Vision Modeling and Visualization, Erlangen, Germany, Nov. timedia Technology Laboratory at Sarnoff Corpora-
2002, pp. 171–178. tion, Princeton, NJ (formerly David Sarnoff Research
Center, and RCA Laboratories). He has been engaged in research and com-
mercialization of MPEG2/DTV, MPEG4/VLBR, and multimedia information
technologies. He was with GTE Laboratories Inc. in Waltham, MA and Contel
Guobin Shen (S’99–M’02) received the B.S. degree from Harbin University of Technology Center in Virginia from 1989 to 1994. He has authored and co-au-
Engineering, Harbin, China, in 1994, the M.S. degree from Southeast Univer- thored over 200 refereed papers in leading international conferences and jour-
sity, Nanjing, China, in 1997, and the Ph.D. degree from Hong Kong University nals. He has been granted over 40 U.S. patents in digital video, Internet, multi-
of Science and Technology (HKUST), Hong Kong, in 2001, all in electrical media, wireless and satellite communications. Many of the technologies he and
engineering. his team developed have become the basis for start-up ventures, commercial
He was a Research Assistant at HKUST from 1997 to 2001. Since then, he products, and international standards. He serves on the Board of Directors of
has been with Microsoft Research Asia, Beijing, China. His research interests five high-tech IT companies. He has been a key contributor to the ISO/MPEG
include digital image and video signal processing, video coding and streaming, and ITU standardization efforts in digital video and multimedia.
peer-to-peer networking, and parallel computing. Dr. Zhang served as the Editor-In-Chief for the IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY from July 1997 to July
1999. He was the Chairman of Visual Signal Processing and Communications
Technical Committee of IEEE Circuits and Systems. He serves on the Editorial
Guang-Ping Gao received the B.S. degree in com- boards of seven other professional journals and over a dozen conference
puter science from Tsinghua University, Beijing, committees. He has been the recipient of numerous awards, including several
China. industry technical achievement awards and IEEE awards such as CAS Jubilee
He joined Microsoft Research Asia, Beijing, Golden Medal. He was awarded as the “Research Engineer of the Year” in 1998
China, as a Research Software Development Engi- by the Central Jersey Engineering Council for his “leadership and invention in
neer in August 2002. communications technology, which has enabled dramatic advances in digital
video compression and manipulation for broadcast and interactive television
and networking applications.” He recently received the prestigious national
award as “The Outstanding Young Electrical Engineering of 1998,” given
annually to one electrical engineer in the United States.