Sie sind auf Seite 1von 17

Evaluation of Parallelization Strategies for the Emerging HEVC Standard

Mauricio Alvarez Mesa1 , Chi Ching Chi2 , Thomas Schierl3 and Ben Juurlink2 Universitat Polit`cnica de Catalunya, Barcelona, Spain e 2 Technische Universitt Berlin, Berlin, Germany a Fraunhofer HHI - Heinrich Hertz Institute, Berlin, Germany
1

1 Introduction
Parallel architectures: multicore, manycore, GPUs, etc. What is HEVC? High Eciency Video Coding New standardization initiative by ISO/MPEG and ITU-T/VCEG Undertaken by the Joint Collaborative Team on Video Coding (JCT-VC) Target Compression performance: 2X compared to H.264/AVC Resolution: up to 4k 2k and up to 60 fps (or more) Color depth: 8-bit and 10-bit (up to 14-bit) Application use cases Intra, random access, low delay High eciency, low complexity Timeline: Started in January 2010 First version expected to be completed in 2012-2013 Evolving reference software: HEVC test Model (HM)

2 Overview of HEVC
HEVC is based on the same structure of prior hybrid video codecs like H.264/AVC but with enhancements in each stage. It has a prediction stage composed of motion compensation (with variable block size and fractional-pel motion vectors) and spatial intra-prediction; integer transformation and scalar quantization are applied to prediction residuals; and quantized coeciets are entropy encoded using either arithmetic coding or variable length coding. Also, as in H.264 an in-loop deblocking lter is applied to the reconstructed signal.

Figure 1: General diagram of HEVC decoder


Split ag = 0 Split ag = 0 Split ag = 0

64

32

Split ag = 1

Split ag = 1

Depth = 0

Depth = 1

Depth = 4

Figure 2: Coding structure: quad-tree segmentation HEVC has two main features that dierentiates it from H.264 and its predecesors. First, a new coding structure that replaces the macroblock structure of H.264; and seond, the inclusion of two new lters that are applied after the deblocking lter: Adaptive Loop Filter (ALF) and Sample Adaptive Oset (SAO). Figure 1 shows a general diagram of thhe main stages of the HEVC decoder.

2.1 Coding Structure


The new block structure is based on codig units (CUs) that contain one or several prediction unit(s) (PUs) and transform units (TUs) [8]. Each frame is divided into a collection of Large Coding Units (LCUs) (with a maximum size of 6464 samples in the current status of HEVC). Each LCU can be recursive split into smaller CUs using a generic quad-tree segmentation structure. PU is the basic unit for prediction and each PU can contain several partitions of variable size. Finally, TU is the basic unit of transform which also can have its own partitions. Figure 2 illustrates the concept of quadtree segmentation, and gure 3 shows an example of partitions in dierent type of units.

Coding Units Prediction Units CU CU

CU CU CU

CU CU

Transform Units

Figure 3: Coding, Transform and Predictions Units

2.2 New lters


2.2.1 Adaptive Loop Filter: ALF ALF is a lter designed to minimize the distortion of the docoded frame compared to the original one using a Wiener lter. The lter can be activated at the CU level and coecients are encoded at the slice level. The lter is applied after the deblocking lter or after SAO is this one is enabled. 2.2.2 Sample Adaptive Oset Filter: SAO The sample adaptive oset (SAO) lter is applied in between the deblocking lter and the ALF [6]. In the SAO lter the entire picture is considered as an hierarchical quadtree. For each subquadrant in the quadtree the SAO lter can be activated by transmitting oset values for the pixels in the quadrant. These osets can either correspond to the intensity band of pixel values (band oset) or the dierence compared to neighboring pixels (edge oset). In HM-3.0 only the luma samples are considered. The same pixel osets are used for all the LCUs in the quadrant.

3 Parallelization Opportunities
The techniques used to parallelize previous video codecs can also be used for parallelization of the HEVC decoder. In addition to that, some new features have been included in the current draft of the new standard for allowing parallel execution at dierent levels of granularity. In this section we review some of these strategies and present in more detail a technique called entropy slices which is the focus of our parallel implementation.

3.1 Parallelism among decode stages


Function-level parallelism consists of processing dierent stages of the video decoder in parallel, for example using a frame-level pipelining approach. A 4 stage pipeline can be implemented with stages like Parsing, Entropy Decoding, LCU Reconstruction and Filtering [4]. Although this mechanism is applicable to HEVC decoding its main disadvantage is the high latency and memory bandwidth required by the multiple frame buers.

3.2 Parallelism Within Decode Stages


This is the nest level of data-level parallelism and consist of nding independent data partitions within a decode stage or kernel. Due to ne granularity this approach is well suited for hardware architectures [10]. 3.2.1 Parallel Entropy Decoding Some techniques currently under consideration for the HEVC standard include: Probability Interval Partitioning (PIPE) and Syntax Element Partitioning (SEP). PIPE uses an entropy coding algorithm similar to CABAC in H.264/AVC. The main dierence is in the binary arithmetic coder in which instead of coding the bins using a single arithmetic coding engine a set of encoders are used each one associated to a partition of the probability interval. In the original design 12 dierent probability intervals are used allowing 12 dierent bin encoders to operate in parallel [11]. SEP consists of grouping bins in a slice by the type of syntax element rather than by macroblock (or LCU) as is in H.264/AVC. Bin groups can be processed in parallel but they need to maintain some data dependencies. A maximum throughput of 2.7X has been reported using 5 dierent partitions [?]. Another proposal for parallel entropy encoding/decoding is related to the parallelization of the context processing stage of the CABAC algorithm. In this case, some of the internal loops for processing context state (signicance map, coecient sign and coecient level) are rearranged for exposing ne-grain data-level parallelism [15, 3]. 3.2.2 Parallel Intra Prediction Intra prediction uses reconstructed data from neighbor blocks to create the prediction of a current block. This creates strong data dependencies that inhibits parallel processing at the block level. A proposal for partially removing these dependencies is known as Parallel Prediction Unit for Parallel Intra Coding. In this approach the blocks inside a LCU are grouped into two sets using a checkerboard pattern. The rst set of blocks can be predicted and reconstructed in parallel and without referencing the second set. After this, the second set can be processed in parallel without referring to blocks in the second set [20].

3.3 Data-level parallelism


In data-level parallelism the same program (instruction or task) is applied to dierent portions of the data set. In a video codec data-level parallelism can be applied at dierent data granularity such as frame-level, macroblock- (or LCU-) level, block-level and sample-level. A detailed analysis of dierent data-level parallelization strategies for H.264/AVC can be found in the literature [12, 14]. 3.3.1 LCU-level parallelism LCU- (or macroblock-) level parallelism can be exploited inside or between frames if the data dependencies of dierent kernels are satised. In HEVC the dependencies vary depending on each stage. For kernels that reference neighbour data at LCU level, like intra-prediction, processing LCUs in a diagonal wavefront allows to exploit parallelism between them [5].

For motion compensation there are not intra-frame dependencies (if motion vector prediction is performed at entropy decoding) but there are inter-frame dependencies due to access to reference areas. By detecting (statically or dynamically) inter-frame dependencies it is possible to increase the number of independent LCUs compared to wavefront processing. This has been reported for H.264/AVC encoding and decoding [12, 21]. In H.264/AVC the deblocking lter uses ltered samples as input for ltering MBs creating wavefront style dependencies. In HEVC a new approach called parallel deblocking lter [9] is introduced. In this scheme the deblocking lter is divided in two separated frame stages: horizontal and vertical ltering. Horizontal ltering takes the reconstructed frame as input and produces a ltered frame. After that the vertical ltering is applied, taking the horizontal ltered frame as input and producing a nal ltered frame. With this approach all the LCU dependencies of deblocking lter are removed allowing parallel processing of all the LCUs in a frame.

3.4 Slice-level parallelism


As in previous video codecs in HEVC each frame can be partitioned into one or more slices. Traditionally slices have been included in order to add robustness to the encoded bitstream in the presence of network transmission errors. In order to accomplish this, slices in a frame are completely independent from each other. That means that no content of a slice is used to predict elements of other slices in the same frame, and that the search area of a dependent frame can not cross the slice boundary [19]. Although not originally designed for that slices can be used for exploiting parallelism because they dont have data dependencies. Parallel processing with slices has several advantages like coarse-grain parallel processing, data locality, low delay and low memory bandwidth. The main disadvantage of slices is the reduction in coding eciency. This is due to three main reasons: rst, a reduction in eciency of entropy coding due to the reduction of the training of probability contexts and the inability to cross the slice boundary for context selection for the CABAC entropy coder. Second, an eciency reduction of the prediction stage due to the inability to cross slice boundaries for LCU prediction. And third, and nally, the increase in bitstream size due to slice headers and start code prexes used to signal the presence of slices in the bitstream [17]. Figure 4 shows the contribution of each of these factors to the total loss of coding eciency. Other disadvantages of traditional slices are load balancing and scalability. Load unbalance appears because slices are created with the same number of MBs, and thus can result that some slices are decoded faster than others depending on the input content. Scalability is limited at the decoder because the number of slices per frame is determined by the encoder. If there is no control of what the encoder does then it is possible to receive sequences with one (or few) slice(s) per frame, with a corresponding reduction in the parallelization opportunities.

3.5 Entropy Slices


As a way to overcome the limitations of traditional slices a new approach for creating slices called entropy slices has been included in the current proposal of HEVC [13].

Figure 4: Coding loss due to slices in H.264/AVC. (BigShips, QP=27)[17]


Slices Context model initialization Context model selection LCU reconstruction neighborhood Slice header overhead slice intra-slice intra-slice high Entropy Slices slice intra-slice inter-slice low Interleaved Slices interleaved inter-slice inter-slice low Entropy

Table 1: Comparison of Slices (S), Entropy Slices (ES) and Interleaved Entropy Slices (IES) The rst dierence with traditional slices is that entropy slices are proposed for parallelism not for error resilience. The main dierences of slices and entropy slices are presented in Table 1. As with regular slices in entropy slices CABAC context models are initialized at the beginning of each slice. Also both in regular and entropy slices the entropy coding of a LCU can not cross slice boundaries. A dierence appears in the reconstruction phase where entropy slices allow to access LCUs in other slices for prediction. Finally, the headers of entropy slices only contain information of start codes and entropy coding signaling thus having a reduced size compared to traditional slices. Entropy slices allows to perform entropy decoding in parallel without data dependencies. In the original design it is assumed that entropy decoding is decoupled of LCU reconstruction with a frame buer. After parallel entropy decoding LCU reconstruction can be performed in parallel using a wavefront parallel pattern. A similar approach to entropy slices called Interleaved Entropy Slices (IES) has been proposed (but not included in HEVC) [16]. In IES, slices are interleaved across LCU lines. Context model states are maintained for a longer period compared to regular and entropy slices and context model selection can cross slice boundaries resulting in minimal coding eciency impact.

4 Parallel HEVC Decoder with Entropy Slices


The parallelization opportunities discussed in the previous section allow for roughly two approaches, a decoupled and a combined approach. In a decoupled approach parallelism is exploited in each stage of the HEVC pipeline. Entropy decoding can be performed parallel using the entropy slices. LCU reconstruction can be executed in parallel using wavefront parallelism. The deblocking parallel deblocking lter (presented in section 3.3.1 allows to deblock edges of the entire picture in two parallel stages, one for the vertical edges followed by one for the horizontal edges. The SAO lter can be performed in one pass in a LCU independent fashion. Finally, the ALF can also be performed in a LCU independent fashion. The parallelism in the decoupled approach can be exploited using a task pool approach in which the work units in each stage are distributed dynamically among the available cores. In most of the decoder stages this can be implemented eciently using a single atomic counter for both synchronization and work distribution. The drawback of this approach is that it requires large buers to store the data between the stages. The entropy decoded data is particularly large at . . . per picture in HM3.0. Additionally, cache locality is reduced because several passes of the picture buer are required in the reconstruction and ltering stages. Especially with higher resolution sequences, for which the picture cannot be contained in the on-chip caches, performance and scalability is reduced, because additional o-chip memory trac is generated. In the combined approach as many stages as possible are combined in a single pass to increase cache locality and to reduce o-chip memory bandwidth requirements. Recent work in parallel H.264 decoders [] have also opted for this approach. In H.264, however, the entropy decoding stage cannot be combined with the MB reconstruction and ltering in parallel approaches, without resorting to regular slices which impact both objective and subjective quality signicantly []. The more ecient HEVC entropy slices allow for parallel entropy decoding, but still maintain the intra dependencies of the LCU reconstruction and deblocking lter stages. Combining the entropy decoding and reconstruction stages requires, therefore, the ability to perform LCU wavefront execution. This can be only achieved by enforcing a one entropy slice per row encoding approach. When using one entropy slice per row, it is not only possible to combine the entropy decode and reconstruct stages, but also the deblocking and SAO lter stages, when using the Ring-Line approach []. In the Ring-Line strategy an arbitrary number of line decoders is used to decode the picture in a line interleaved manner. The line decoders can maintain the wavefront dependencies eciently using a ring synchronization approach, in which the line decoders only require to synchronize with their neighbors. In each line decoder the entropy decode, LCU reconstruct, and deblocking lter can be performed for the same LCU. The SAO lter, however, operates on the deblocked output image and, therefore, cannot process the same LCU as the lower and right edges are not deblocked yet. Instead the SAO lter is performed on the upper left LCU, for which all the deblocked image data is available. The decoding order of the stages and the corresponding modied pixels for one LCU are illustrated in Figure 5. Figure 6 shows the wavefront progression when using four line decoder threads. The SAO lter is designed to be LCU independent, by ignoring some pixel classi-

Reconstruction Deblock ver. edges Deblock hor. edges SAO ltering

Figure 5: The decoder stages are applied on dierent adjacent LCUs to maintain the kernel dependencies. Each square represent a 2x2 pixel block.
T1 T2 T3 T4 T1

Figure 6: Wavefront progression of the combined stages. The colors show the decoding progress of each stage, before starting the decode of the hatched blocks. The SAO lter is delayed until all the pixels in the LCU are deblocked. cation method that would require pixels from neighboring LCUs in the edge oset mode. This allows the SAO lter to be applied LCU parallel for each level of the hierarchical quadtree. For our implementation it is necessary that the SAO lter is applied in a single pass for all the quadtree levels for each LCU. Since the SAO lter is LCU independent, a depth-rst method is equivalent to the breadth-rst method. A mapping table is used to eciently link each LCU to the corresponding quadrant of each level. Unfortunately, the ALF stage cannot be combined with the other stages in HM-3.0. To perform the ALF on a CU, its absolute CU index is required to index the ALF on/o ag array. The absolute CU index, however, is only known after all previous CUs are entropy decoded. This condition is not met when processing the ALF in the line decoders using the Ring-Line strategy. The ALF, therefore, is performed after reconstructing the entire picture in a separated pass. Because the ALF is a relative compute intensive stage the impact of an additional picture buer pass is not signicant. In our implementation, to reduce cache line conicts and synchronization overhead, eight consecutive LCUs are grouped in a work unit and processed by a single core.

5 Experimental Results
In this section we present the experimental results for the proposed parallelization methodology. We present coding eciency results for analysing the impact of entropy slices on compression and, after that, we analyse the performance of our implementation on two dierent parallel machines.

Options Max. CU Size Width Max. Partition Depth Period of I-frames Number of B-frames (GOPSize) Number of reference frames Motion Estimation Algorithm Search range Entropy Coding Internal Bit Depth Adaptive Loop Filter (ALF) Sample Adaptive Oset (SAO) Quantization Parameter (QP)

Value 6464 4 32 8 4 EPZS [18] 64 CABAC 10 enabled enabled 22, 27, 32, and 37

Table 2: Coding Options


Class S S S A A A A B B B B B Sequence CrowdRun IntoThree ParkJoy NebutaFestival PoepleOnStreet SteamLocomotiveTrain Trac BasketballDrive BQTerrace Cactus Kimono1 ParkScene Resolution 3840x2160 3840x2160 3840x2160 2560x1600 2560x1600 2560x1600 2560x1600 1920x1080 1920x1080 1920x1080 1920x1080 1920x1080 Frame count 500 500 500 300 150 300 150 500 600 500 240 240 Frame rate 50p 50p 50p 60p 30p 60p 30p 50p 60p 50p 24p 24p Bit depth 8 8 8 10 8 10 8 8 8 8 8 8

Table 3: Input sequences

5.1 Experimental Environment


5.1.1 HEVC software and Input Videos We have implemented a parallel HEVC decoder on top of the HM-3.0 reference decoder. From all the available test conditions, described in section 1 we selected the Random Access High Eciency (RA-HE) which we consider as the most demanding application scenario of the current HEVC proposal. RA-HE includes the most computationally demanding tools of HEVC such as CABAC and ALF and it makes extensive use of B-frames. It should be noted, however, that the same parallel code can be used without modications with the other test conditions. Encoding options are based on common conditions described in JCTVC-E700 [2]. Table 2 shows the main parameters used for the encodings. We encoded all the videos from the HEVC test sequences using the HM-3.0 reference encoder. Due to space reasons, and because we are mainly interested in high denition applications, we only present results for class A (25601600 pixels) and class B (19201080 pixels) sequences. We also included 4K videos (38402160) from the SVT High Denition Multi Format Test Set [7]. We will refer to videos with this resolution as class S. Test sequences information is presented in Table 3.

System Processor ISA architecture Num. Sockets Num. Cores/socket Num Threads/core Technology Clock frequency Power Level 1 D-cache Level 2 D-cache Level 3 cache Memory Interconnection Boost Compiler Operating system

Fujitsu RX600 S5 Intel Xeon X7550 X86-64 Nehalem-EX 4 8 16 45nm 2.0 GHz 130 W 32 KB / core 256 KB / core 18 MB / socket DDR3-1066 QuickPath 1.46.1. GCC-4.4.5 -O3 Linux kernel 2.6.32-5

Dell Precision T5500 Intel Xeon X5680 X86-64 Westmere 2 6 12 32nm 3.33 GHz 130W 32KB / core 256 KB / core 12MB / socket DDR3-1333 QuickPath 1.42.1 GCC-4.5.2 -O3 Linux kernel 2.6.38-8

Table 4: Experimentation platform 5.1.2 Platform Multiple threads were created using Boost thread C++ library; this library allows the use of multithreading in shared memory systems for C++ programs. For our parallel decoding experiments we used two parallel systems. One is a cache-coherent Non-Uniform Memory Access Architecture (cc-NUMA) machine. It is based on the Intel Xeon 7550 processor which has 8 cores per chip and the whole system contains 4 sockets, for a total of 32 cores, connected with the Intel QuickPath Interconnect (QPI). The second one is dual socket machine based on the Intel Xeon X5680 processor that has 6 cores per chip for a total of 12 cores. Main parameters of these architectures are listed in Table 4. The machine were congured with TurboBoost feature disabled to avoid dynamic changes in frequency. Although Simultaneous MultiThreading (SMT) is enabled by default, we do not schedule more than one thread per core. Also, we used thread pinning to manually assign threads to cores for avoiding thread migrations which have even more negative eects on a cc-NUMA architectures.

5.2 Coding Eciency


In this section we quantify the eect of entropy slices on coding eciency and compare it with a baseline system and a system with regular slices. First, we used a conguration with one regular slice per frame which represents the baseline with the highest quality and the minimum bitrate. Our parallelization approach is based on entropy slices, for which we encoded the videos with one entropy slice per row. As a comparison we also encoded all videos with one regular slice per row. A comparison of the two slice approaches with the baseline is summarized in Table 5 using the Bjontegaard metric [1]. Regular slices result in average bitrate increase of 6.8%, 14% and 9.5% for Y, U and V components respectively. Entropy slices results in an average bitrate increase of 5.2%, 5.9% and 5.5% for the three components.

10

1 regular slice per row Y BD-rate Class S Class A Class B 5.037 6.261 9.216 U BD-rate 13.689 16.854 11.518 V BD-rate 7.093 12.429 8.964

1 entropy slice per row Y BD-rate 3.808 5.472 6.3 U BD-rate 4.585 6.381 6.802 V BD-rate 4.865 5.724 5.929

Table 5: Coding eciency of regular slices and entropy slices compared to HM-3.0 with one slice per frame

5.3 Performance
We ran the parallel HEVC decoder on the two parallel machines. We used all the videos described in Table 3 and we decoded each one ve times for dierent number of threads. 5.3.1 Speedup and Frames-per-Second The sub-gures of the left-side of gures 7 and 8 show the performance in terms of average speedup and frame-rate for the three input video classes and the two machines under study. Speedup is computed against the original sequential code (thread 0) and is presented along with the parallel code using one core (thread 1). The main dierence between the two machines is that the T5500 has a higher clock rate than the RX600. This allows the former to process an average of 50 class B frames per second when using 10 processors. The speedup curves are very similar: the parallelization eciency is relatively high for small core counts (more than 80% for 4 cores) and the speedup saturates at some number of processors, for example 12 processors for class B at which a maximum speedup of 5 is reached with an eciency of 50%. For the most demanding sequences (class S) even with a high number of processors and using the high frequency conguration it was not possible to reach real-time operations (a maximum of 15 fps is reached with 11 cores). In general the low absolute performance is due to the low performance of the original single thread reference code. Additional optimizations (like SIMD vectorization) can be applied to increase the performance of the single thread code. Table 6 shows the corresponding number of entropy slices, maximum number of processors used and the obtained speedup for the dierent code sections. The ALF section exhibits an almost linear speedup (with an eciency close to 90%). The other sections (Entropy Decoding, LCU Reconstruction, Deblocking Filter and SAO) have a lower eciency (around 53%) due to data dependencies and load unbalance. The total speedup and performance is limited by the ED + LCU + DF + SAO stages. 5.3.2 Proling of Execution Time We performed a proling analysis in order to identify the relative contribution of dierent parts of the application on the nal performance. The sub-gures of the right-side of gures 7 and 8 show the average execution time for each video class. It has been divided in the sequential and parallel portions. The parallel part has been further divided in two sections, one with the ED + LCU + DF + SAO stages and the other one with ALF kernel. Due to its massively parallel nature ALF lter execution

11

12

1 Average execution time per frame [s] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10

10 10 8 Speedup 6 6 4 4 2 0 0 5 10 15 Number of threads 20 25 Frames per second 8

Sequential ALF ED+REC+DF+SAO

15

20

25

Number of threads

(a) Class S: Speedup


8 7 6 5 Speedup 4 3 2 1 0 0 2 4 6 8 10 Number of threads 12 14 16

(b) Class S: Exec. time


0.5 Average execution time per frame [s]
18 16 Frames per second 14 12 10 8 6 4 2 0

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2 4

Sequential ALF ED+REC+DF+SAO

6 8 10 12 Number of threads

14

16

18

(c) Class A: Speedup


6 30 5 25 4 Speedup 20 3 15 2 10 5 0 0 2 4 6 8 10 Number of threads 12 14 16 Frames per second

(d) Class A: Exec. time


0.2 Average execution time per frame [s] 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 2 4 6 8 10 12 Number of threads 14 16 18 Sequential ALF ED+REC+DF+SAO

(e) Class B: Speedup

(f) Class B: Exec. time

Figure 7: Speedup and average execution time for the rx600s51t machine. Speedup is measured against original sequential code (referred as thread 0 in the gure). Thread 1 is the parallel code with one thread. Average execution time divided in sequential and parallel parts time reduces almost linearly with the number of threads. ED + LCU + DF + SAO stages also reduces but reaches a saturation point; and the sequential stage increases its fraction of the total execution time according to Amdhals law. Table 7 shows the contribution to the total execution time of the dierent stages

12

8 16 7 14 6 5 Speedup 4 3 2 1 0 0 2 4 6 Number of threads 8 10 12 12 10 8 6 4 2 0 Frames per second

0.6 Average execution time per frame [s] 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6

Sequential ALF ED+REC+DF+SAO

10

12

Number of threads

(a) Class S: Speedup


7 30 6 25 5 20 Speedup 4 15 3 10 Frames per second

(b) Class S: Exec. time


0.25 Average execution time per frame [s] Sequential ALF ED+REC+DF+SAO

0.2

0.15

0.1

0.05

0 0 2 4 6 Number of threads 8 10 12

0 0 2 4 6 8 Number of threads 10 12

(c) Class A: Speedup


6 60

(d) Class A: Exec. time


0.12 Average execution time per frame [s] 0.1 0.08 0.06 0.04 0.02 0 0 2 4 6 8 Number of threads 10 12 Sequential ALF ED+REC+DF+SAO

50

30

20

10

0 0 2 4 6 Number of threads 8 10 12

(e) Class B: Speedup

Frames per second

4 Speedup

40

(f) Class B: Exec. time

Figure 8: Speedup and average execution time for the x5680 machine. Speedup is measured against original sequential code (referred as thread 0 in the gure). Thread 1 is the parallel code with one thread. Average execution time divided in sequential and parallel parts using the number of processors that generates the maximum speedup. The contribution of ALF at the maximum number of processors is below 15%. The sequential part of the application becomes important with a contribution between 15% and 22%. But, the main limitation in scalability is the ED + LCU + DF + SAO stage which

13

rx600s51t machine

x5680 machine

Class Num. entropy slices Max. processors ED+LCU+DF+SAO speedup ALF speedup Total speedup FPS

S 34 24 11.5 21.5 10.3 11.7

A 25 16 8.6 14.5 7.6 18.5

B 17 14 6.03 12.05 5.67 31.9

S 34 12 7.94 11.15 7.35 15.38

A 25 12 7.24 10.62 6.62 29.54

B 17 12 5.35 9.98 5.20 53.15

Table 6: Maximum speedup


rx600s51t machine Class Max. processors LCU+DF+SAO ALF Sequential part S 24 64.11% 13.81% 22.07% A 16 62.70% 16.52% 20.78% B 14 71.24% 13.81% 14.95% S 12 64.95% 19.08& 15.97% x5680 machine A 12 63.26% 18.62% 18.12% B 12 70.69% 14.66% 14.65%

Table 7: Contribution of dierent stages to total execution time with maximum number of processors
35 30 25 Parallel LCUs 20 15 10 5 0 0 20 40 60 80 Time slot 100 120 140 Class S Class A Class B

Figure 9: Maximum parallelism in the wavefront kernels dominates the execution time of the parallel decoder. It takes more that 60% of execution time with the maximum number of processors. This limitation is due to the wavefront dependencies in some kernels. This type of dependencies generates a variable number of parallel tasks with an ramp pattern. Figure 9 shows the number of independent tasks for kernels with wavefront dependencies. Assuming constant task time and no synchronisation overhead the maximum theoretical speedups are 16.19, 11.36 and 8.22 for class S, A and B respectively. The eciency of our parallelization compared to this maximum is between 71% and 76%.

14

6 Limitations and Solutions


The proposed parallelization strategy has several advantageous properties from a parallel implementation perspective. First, it achieves good scaling eciency at lower core counts. Second, the number of line decoders used in the implementation is independent from the bitstream. The actual number of line decoders is exible and can be chosen to match the processing capabilities of the computing hardware and the performance requirements. Third, no addtional frame buers are required, which keeps the memory size requirements similar to that of the single threaded approach. Fourth, as reected by the performance of the parallel implementation runnning on one core, the paralellization overhead is low because the stages are combined as much as possible. Finally, all single threaded optimization opportunities to increase cache locality and reduce o-chip memory trac remain exploitable. A limitation is the scaling eciency at higher core counts. As shown in the execution time breakdown in Section ??, this is caused by the sequential part and the wavefront parallel part. The sequential part does not decrease when increasing the number of cores and takes as much as 22% of the total execution time. This, however, can be solved by pipelining the sequential part, which consist mainly of scanning the bitstreams for startcodes and decoding the SAO and ALF parameters. By parsing the bitstream one frame ahead in a separate thread will hide the time spend in this sequential part. The wavefront parallel part is not scaling linear due to its inherent parallelism ramp-up and ramp-down. This could be mitigated by overlapping the decoding of consecutive frames []. Currently, this cannot be performed because the ALF cannot be combined with the other stages, but nevertheless is inside the prediction loop. The ALF could have been combined with the other stages if the ALF CU on/o ags were not signaled in the slice header, but as syntax elements of the CUs in the raw byte sequence payload. By doing so, the absolute CU index would not be necessary anymore. It is expected that this contribution will be made in future developments of the HEVC standardization as theoretically there will be no impact on the coding eciency since the ALF CU ags are only moved to a dierent location. From a coding eciency perspective, entropy slices have a reduced impact on the objective and no impact on the subjective quality compared to using a single slice per frame. Using one entropy slice per row and wavefront execution, however, allows for further improvement of the objective quality. For example, context selection over entropy slice boundaries (D243) [] and propagation of context tables (E196) [] can be applied straightforward without complicating the proposed parallelization strategy. Furthermore, the training losses can be further reduced by using multiple context initialization tables for each entropy slice or picture. Compared to the regular one slice per picture approach the only losses would originate from the additionally coded start codes or bitstream osets and bitstream padding.

7 Conclusions
In this paper we have proposed and evaluated a parallelization strategy for the emerging HEVC video codec. The proposed strategy requires that each LCU row constitutes an entropy slice. The LCU rows are processed in a wavefront parallel fashion by sev-

15

eral line decoder threads using a ring synchronization. The presented implementation achieves real-time performance for 19201080 (53.1 fps) and 25601600 (29.5 fps) resolutions on a 12-core Xeon machine. The proposed parallelization strategy has several desirable properties. First, it achieves good scaling eciency at moderate core counts. Second, the number of line decoders can be chosen to match the processing capabilities of the computing hardware and the performance requirements. Third, using more cores increases the throughput and at the same time reduces the frame latency, making the implementation both suitable for low delay and high throughput use scenarios. A limitation is the scaling eciency at higher core counts. This is caused by the sequential part and the ramp-up and ramp-down eciency losses of the wavefront parallel part. In future work this can be solved by pipelining the sequential part and overlapping the execution of consecutive frames.

References
[1] Gisle Bjontegaard. Calculation of average PSNR dierences between RD-curves. Technical Report VCEG-M33, ITU-T Video Coding Experts Group (VCEG), 2001. [2] Frank Bossen. Common test conditions and software reference congurations. Technical Report JCTVC-E700, Jan. 2011. [3] Madhukar Budagavi and Mehmet Umut Demircin. Parallel Context Processing techniques for high coding eciency entropy coding in HEVC. Technical Report JCTVC-B088, July 2010. [4] Chi Ching Chi and Ben Juurlink. A QHD-capable parallel H.264 decoder. In Proc. of the Int. Conf. on Supercomputing, pages 317326, 2011. [5] E. B. Van der Tol, E. G. T. Jaspers, and R. H. Gelderblom. Mapping of h.264 decoding on a multiprocessor architecture. In Proceedings of SPIE, 2003. [6] Chih-Ming Fu, Ching-Yeh Chen, Chia-Yang Tsai, Yu-Wen Huang, and Shawmin Lei. CE13: Sample Adaptive Oset with LCU-Independent Decoding. Technical Report JCTVC-E409, March 2011. [7] Lars Haglund. The SVT High Denition Multi Format Test Set. Technical report, Sveriges Television, Feb. 2006. [8] Woo-Jin Han, Junghye Min, Il-Koo Kim, Elena Alshina, Alexander Alshin, Tammy Lee, Jianle Chen, Vadim Seregin, Sunil Lee, Yoon Mi Hong, Min-Su Cheon, Nikolay Shlyakhov, Ken McCann, Thomas Davies, and Jeong-Hoon Park. Improved Video Compression Eciency Through Flexible Unit Representation and Corresponding Extension of Coding Tools. IEEE Transactions on Circuits and Systems for Video Technology, 20(12):17091720, Dec 2010. [9] Masaru Ikeda, Junichi Tanaka, and Teruhiko Suzuki. Parallel deblocking lter. Technical Report JCTVC-E181, Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T and ISO/IEC, March 2011. [10] Young-Long Steve Lin, Chao-Yang Kao, Hung-Chih Kuo, and Jian-Wen Chen. VLSI Design for Video Coding. Springer, 2010.

16

[11] Detlev Marpe, Heiko Schwarz, and Thomas Wiegand. Entropy Coding in Video Compression using Probability Interval Partitioning. In Picture Coding Symposium (PCS 2010), pages 6669, Dec. 2010. [12] Cor Meenderinck, Arnaldo Azevedo, Mauricio Alvarez, Ben Juurlink, and Alex Ram rez. Parallel Scalability of Video Decoders. Journal of Signal Processing Systems, 57:173194, November 2009. [13] Kiran Misra, Jie Zhao, and Andrew Segall. Entropy slices for parallel entropy coding. Technical Report JCTVC-B111, July 2010. [14] Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, and Margrit Gelautz. Evaluation of data-parallel splitting approaches for H.264 decoding. In Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia, pages 4049, 2008. [15] J. Sole, R. Joshi, I. S. Chong, M. Coban, and M. Karczewicz. Parallel context processing for the signicance map in high coding eciency. Technical Report JCTVC-D262, Jan 2011. [16] Vivienne Sze, Madhukar Budagavi, and Anantha P. Chandrakasan. Massively parallel cabac. Technical Report VCEG-AL21, Video Coding Experts Group (VCEG), July 2009. [17] Vivienne Sze and Anantha P. Chandrakasan. A high throughput CABAC algorithm using syntax element partitioning. In Proceedings of the 16th IEEE international conference on Image processing, pages 773776, 2009. [18] Alexis M. Tourapis. Enhanced Predictive Zonal Search for Single and Multiple Frame Motion Estimation. In Proceedings of SPIE Visual Communications and Image Processing 2002, pages 10691079, Jan. 2002. [19] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560576, July 2003. [20] Jie Zhao and Andrew Segall. Parallel prediction unit for parallel intra coding. Technical Report JCTVC-B112, Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T and ISO/IEC, July 2010. [21] Zhuo Zhao and Ping Liang. Data partition for wavefront parallelization of H.264 video encoder. In IEEE International Symposium on Circuits and Systems., 2006.

17

Das könnte Ihnen auch gefallen