Beruflich Dokumente
Kultur Dokumente
Mauricio Alvarez Mesa1 , Chi Ching Chi2 , Thomas Schierl3 and Ben Juurlink2 Universitat Polit`cnica de Catalunya, Barcelona, Spain e 2 Technische Universitt Berlin, Berlin, Germany a Fraunhofer HHI - Heinrich Hertz Institute, Berlin, Germany
1
1 Introduction
Parallel architectures: multicore, manycore, GPUs, etc. What is HEVC? High Eciency Video Coding New standardization initiative by ISO/MPEG and ITU-T/VCEG Undertaken by the Joint Collaborative Team on Video Coding (JCT-VC) Target Compression performance: 2X compared to H.264/AVC Resolution: up to 4k 2k and up to 60 fps (or more) Color depth: 8-bit and 10-bit (up to 14-bit) Application use cases Intra, random access, low delay High eciency, low complexity Timeline: Started in January 2010 First version expected to be completed in 2012-2013 Evolving reference software: HEVC test Model (HM)
2 Overview of HEVC
HEVC is based on the same structure of prior hybrid video codecs like H.264/AVC but with enhancements in each stage. It has a prediction stage composed of motion compensation (with variable block size and fractional-pel motion vectors) and spatial intra-prediction; integer transformation and scalar quantization are applied to prediction residuals; and quantized coeciets are entropy encoded using either arithmetic coding or variable length coding. Also, as in H.264 an in-loop deblocking lter is applied to the reconstructed signal.
64
32
Split ag = 1
Split ag = 1
Depth = 0
Depth = 1
Depth = 4
Figure 2: Coding structure: quad-tree segmentation HEVC has two main features that dierentiates it from H.264 and its predecesors. First, a new coding structure that replaces the macroblock structure of H.264; and seond, the inclusion of two new lters that are applied after the deblocking lter: Adaptive Loop Filter (ALF) and Sample Adaptive Oset (SAO). Figure 1 shows a general diagram of thhe main stages of the HEVC decoder.
CU CU CU
CU CU
Transform Units
3 Parallelization Opportunities
The techniques used to parallelize previous video codecs can also be used for parallelization of the HEVC decoder. In addition to that, some new features have been included in the current draft of the new standard for allowing parallel execution at dierent levels of granularity. In this section we review some of these strategies and present in more detail a technique called entropy slices which is the focus of our parallel implementation.
For motion compensation there are not intra-frame dependencies (if motion vector prediction is performed at entropy decoding) but there are inter-frame dependencies due to access to reference areas. By detecting (statically or dynamically) inter-frame dependencies it is possible to increase the number of independent LCUs compared to wavefront processing. This has been reported for H.264/AVC encoding and decoding [12, 21]. In H.264/AVC the deblocking lter uses ltered samples as input for ltering MBs creating wavefront style dependencies. In HEVC a new approach called parallel deblocking lter [9] is introduced. In this scheme the deblocking lter is divided in two separated frame stages: horizontal and vertical ltering. Horizontal ltering takes the reconstructed frame as input and produces a ltered frame. After that the vertical ltering is applied, taking the horizontal ltered frame as input and producing a nal ltered frame. With this approach all the LCU dependencies of deblocking lter are removed allowing parallel processing of all the LCUs in a frame.
Table 1: Comparison of Slices (S), Entropy Slices (ES) and Interleaved Entropy Slices (IES) The rst dierence with traditional slices is that entropy slices are proposed for parallelism not for error resilience. The main dierences of slices and entropy slices are presented in Table 1. As with regular slices in entropy slices CABAC context models are initialized at the beginning of each slice. Also both in regular and entropy slices the entropy coding of a LCU can not cross slice boundaries. A dierence appears in the reconstruction phase where entropy slices allow to access LCUs in other slices for prediction. Finally, the headers of entropy slices only contain information of start codes and entropy coding signaling thus having a reduced size compared to traditional slices. Entropy slices allows to perform entropy decoding in parallel without data dependencies. In the original design it is assumed that entropy decoding is decoupled of LCU reconstruction with a frame buer. After parallel entropy decoding LCU reconstruction can be performed in parallel using a wavefront parallel pattern. A similar approach to entropy slices called Interleaved Entropy Slices (IES) has been proposed (but not included in HEVC) [16]. In IES, slices are interleaved across LCU lines. Context model states are maintained for a longer period compared to regular and entropy slices and context model selection can cross slice boundaries resulting in minimal coding eciency impact.
Figure 5: The decoder stages are applied on dierent adjacent LCUs to maintain the kernel dependencies. Each square represent a 2x2 pixel block.
T1 T2 T3 T4 T1
Figure 6: Wavefront progression of the combined stages. The colors show the decoding progress of each stage, before starting the decode of the hatched blocks. The SAO lter is delayed until all the pixels in the LCU are deblocked. cation method that would require pixels from neighboring LCUs in the edge oset mode. This allows the SAO lter to be applied LCU parallel for each level of the hierarchical quadtree. For our implementation it is necessary that the SAO lter is applied in a single pass for all the quadtree levels for each LCU. Since the SAO lter is LCU independent, a depth-rst method is equivalent to the breadth-rst method. A mapping table is used to eciently link each LCU to the corresponding quadrant of each level. Unfortunately, the ALF stage cannot be combined with the other stages in HM-3.0. To perform the ALF on a CU, its absolute CU index is required to index the ALF on/o ag array. The absolute CU index, however, is only known after all previous CUs are entropy decoded. This condition is not met when processing the ALF in the line decoders using the Ring-Line strategy. The ALF, therefore, is performed after reconstructing the entire picture in a separated pass. Because the ALF is a relative compute intensive stage the impact of an additional picture buer pass is not signicant. In our implementation, to reduce cache line conicts and synchronization overhead, eight consecutive LCUs are grouped in a work unit and processed by a single core.
5 Experimental Results
In this section we present the experimental results for the proposed parallelization methodology. We present coding eciency results for analysing the impact of entropy slices on compression and, after that, we analyse the performance of our implementation on two dierent parallel machines.
Options Max. CU Size Width Max. Partition Depth Period of I-frames Number of B-frames (GOPSize) Number of reference frames Motion Estimation Algorithm Search range Entropy Coding Internal Bit Depth Adaptive Loop Filter (ALF) Sample Adaptive Oset (SAO) Quantization Parameter (QP)
Value 6464 4 32 8 4 EPZS [18] 64 CABAC 10 enabled enabled 22, 27, 32, and 37
System Processor ISA architecture Num. Sockets Num. Cores/socket Num Threads/core Technology Clock frequency Power Level 1 D-cache Level 2 D-cache Level 3 cache Memory Interconnection Boost Compiler Operating system
Fujitsu RX600 S5 Intel Xeon X7550 X86-64 Nehalem-EX 4 8 16 45nm 2.0 GHz 130 W 32 KB / core 256 KB / core 18 MB / socket DDR3-1066 QuickPath 1.46.1. GCC-4.4.5 -O3 Linux kernel 2.6.32-5
Dell Precision T5500 Intel Xeon X5680 X86-64 Westmere 2 6 12 32nm 3.33 GHz 130W 32KB / core 256 KB / core 12MB / socket DDR3-1333 QuickPath 1.42.1 GCC-4.5.2 -O3 Linux kernel 2.6.38-8
Table 4: Experimentation platform 5.1.2 Platform Multiple threads were created using Boost thread C++ library; this library allows the use of multithreading in shared memory systems for C++ programs. For our parallel decoding experiments we used two parallel systems. One is a cache-coherent Non-Uniform Memory Access Architecture (cc-NUMA) machine. It is based on the Intel Xeon 7550 processor which has 8 cores per chip and the whole system contains 4 sockets, for a total of 32 cores, connected with the Intel QuickPath Interconnect (QPI). The second one is dual socket machine based on the Intel Xeon X5680 processor that has 6 cores per chip for a total of 12 cores. Main parameters of these architectures are listed in Table 4. The machine were congured with TurboBoost feature disabled to avoid dynamic changes in frequency. Although Simultaneous MultiThreading (SMT) is enabled by default, we do not schedule more than one thread per core. Also, we used thread pinning to manually assign threads to cores for avoiding thread migrations which have even more negative eects on a cc-NUMA architectures.
10
1 regular slice per row Y BD-rate Class S Class A Class B 5.037 6.261 9.216 U BD-rate 13.689 16.854 11.518 V BD-rate 7.093 12.429 8.964
1 entropy slice per row Y BD-rate 3.808 5.472 6.3 U BD-rate 4.585 6.381 6.802 V BD-rate 4.865 5.724 5.929
Table 5: Coding eciency of regular slices and entropy slices compared to HM-3.0 with one slice per frame
5.3 Performance
We ran the parallel HEVC decoder on the two parallel machines. We used all the videos described in Table 3 and we decoded each one ve times for dierent number of threads. 5.3.1 Speedup and Frames-per-Second The sub-gures of the left-side of gures 7 and 8 show the performance in terms of average speedup and frame-rate for the three input video classes and the two machines under study. Speedup is computed against the original sequential code (thread 0) and is presented along with the parallel code using one core (thread 1). The main dierence between the two machines is that the T5500 has a higher clock rate than the RX600. This allows the former to process an average of 50 class B frames per second when using 10 processors. The speedup curves are very similar: the parallelization eciency is relatively high for small core counts (more than 80% for 4 cores) and the speedup saturates at some number of processors, for example 12 processors for class B at which a maximum speedup of 5 is reached with an eciency of 50%. For the most demanding sequences (class S) even with a high number of processors and using the high frequency conguration it was not possible to reach real-time operations (a maximum of 15 fps is reached with 11 cores). In general the low absolute performance is due to the low performance of the original single thread reference code. Additional optimizations (like SIMD vectorization) can be applied to increase the performance of the single thread code. Table 6 shows the corresponding number of entropy slices, maximum number of processors used and the obtained speedup for the dierent code sections. The ALF section exhibits an almost linear speedup (with an eciency close to 90%). The other sections (Entropy Decoding, LCU Reconstruction, Deblocking Filter and SAO) have a lower eciency (around 53%) due to data dependencies and load unbalance. The total speedup and performance is limited by the ED + LCU + DF + SAO stages. 5.3.2 Proling of Execution Time We performed a proling analysis in order to identify the relative contribution of dierent parts of the application on the nal performance. The sub-gures of the right-side of gures 7 and 8 show the average execution time for each video class. It has been divided in the sequential and parallel portions. The parallel part has been further divided in two sections, one with the ED + LCU + DF + SAO stages and the other one with ALF kernel. Due to its massively parallel nature ALF lter execution
11
12
1 Average execution time per frame [s] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10
15
20
25
Number of threads
6 8 10 12 Number of threads
14
16
18
Figure 7: Speedup and average execution time for the rx600s51t machine. Speedup is measured against original sequential code (referred as thread 0 in the gure). Thread 1 is the parallel code with one thread. Average execution time divided in sequential and parallel parts time reduces almost linearly with the number of threads. ED + LCU + DF + SAO stages also reduces but reaches a saturation point; and the sequential stage increases its fraction of the total execution time according to Amdhals law. Table 7 shows the contribution to the total execution time of the dierent stages
12
0.6 Average execution time per frame [s] 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6
10
12
Number of threads
0.2
0.15
0.1
0.05
0 0 2 4 6 Number of threads 8 10 12
0 0 2 4 6 8 Number of threads 10 12
50
30
20
10
0 0 2 4 6 Number of threads 8 10 12
4 Speedup
40
Figure 8: Speedup and average execution time for the x5680 machine. Speedup is measured against original sequential code (referred as thread 0 in the gure). Thread 1 is the parallel code with one thread. Average execution time divided in sequential and parallel parts using the number of processors that generates the maximum speedup. The contribution of ALF at the maximum number of processors is below 15%. The sequential part of the application becomes important with a contribution between 15% and 22%. But, the main limitation in scalability is the ED + LCU + DF + SAO stage which
13
rx600s51t machine
x5680 machine
Class Num. entropy slices Max. processors ED+LCU+DF+SAO speedup ALF speedup Total speedup FPS
Table 7: Contribution of dierent stages to total execution time with maximum number of processors
35 30 25 Parallel LCUs 20 15 10 5 0 0 20 40 60 80 Time slot 100 120 140 Class S Class A Class B
Figure 9: Maximum parallelism in the wavefront kernels dominates the execution time of the parallel decoder. It takes more that 60% of execution time with the maximum number of processors. This limitation is due to the wavefront dependencies in some kernels. This type of dependencies generates a variable number of parallel tasks with an ramp pattern. Figure 9 shows the number of independent tasks for kernels with wavefront dependencies. Assuming constant task time and no synchronisation overhead the maximum theoretical speedups are 16.19, 11.36 and 8.22 for class S, A and B respectively. The eciency of our parallelization compared to this maximum is between 71% and 76%.
14
7 Conclusions
In this paper we have proposed and evaluated a parallelization strategy for the emerging HEVC video codec. The proposed strategy requires that each LCU row constitutes an entropy slice. The LCU rows are processed in a wavefront parallel fashion by sev-
15
eral line decoder threads using a ring synchronization. The presented implementation achieves real-time performance for 19201080 (53.1 fps) and 25601600 (29.5 fps) resolutions on a 12-core Xeon machine. The proposed parallelization strategy has several desirable properties. First, it achieves good scaling eciency at moderate core counts. Second, the number of line decoders can be chosen to match the processing capabilities of the computing hardware and the performance requirements. Third, using more cores increases the throughput and at the same time reduces the frame latency, making the implementation both suitable for low delay and high throughput use scenarios. A limitation is the scaling eciency at higher core counts. This is caused by the sequential part and the ramp-up and ramp-down eciency losses of the wavefront parallel part. In future work this can be solved by pipelining the sequential part and overlapping the execution of consecutive frames.
References
[1] Gisle Bjontegaard. Calculation of average PSNR dierences between RD-curves. Technical Report VCEG-M33, ITU-T Video Coding Experts Group (VCEG), 2001. [2] Frank Bossen. Common test conditions and software reference congurations. Technical Report JCTVC-E700, Jan. 2011. [3] Madhukar Budagavi and Mehmet Umut Demircin. Parallel Context Processing techniques for high coding eciency entropy coding in HEVC. Technical Report JCTVC-B088, July 2010. [4] Chi Ching Chi and Ben Juurlink. A QHD-capable parallel H.264 decoder. In Proc. of the Int. Conf. on Supercomputing, pages 317326, 2011. [5] E. B. Van der Tol, E. G. T. Jaspers, and R. H. Gelderblom. Mapping of h.264 decoding on a multiprocessor architecture. In Proceedings of SPIE, 2003. [6] Chih-Ming Fu, Ching-Yeh Chen, Chia-Yang Tsai, Yu-Wen Huang, and Shawmin Lei. CE13: Sample Adaptive Oset with LCU-Independent Decoding. Technical Report JCTVC-E409, March 2011. [7] Lars Haglund. The SVT High Denition Multi Format Test Set. Technical report, Sveriges Television, Feb. 2006. [8] Woo-Jin Han, Junghye Min, Il-Koo Kim, Elena Alshina, Alexander Alshin, Tammy Lee, Jianle Chen, Vadim Seregin, Sunil Lee, Yoon Mi Hong, Min-Su Cheon, Nikolay Shlyakhov, Ken McCann, Thomas Davies, and Jeong-Hoon Park. Improved Video Compression Eciency Through Flexible Unit Representation and Corresponding Extension of Coding Tools. IEEE Transactions on Circuits and Systems for Video Technology, 20(12):17091720, Dec 2010. [9] Masaru Ikeda, Junichi Tanaka, and Teruhiko Suzuki. Parallel deblocking lter. Technical Report JCTVC-E181, Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T and ISO/IEC, March 2011. [10] Young-Long Steve Lin, Chao-Yang Kao, Hung-Chih Kuo, and Jian-Wen Chen. VLSI Design for Video Coding. Springer, 2010.
16
[11] Detlev Marpe, Heiko Schwarz, and Thomas Wiegand. Entropy Coding in Video Compression using Probability Interval Partitioning. In Picture Coding Symposium (PCS 2010), pages 6669, Dec. 2010. [12] Cor Meenderinck, Arnaldo Azevedo, Mauricio Alvarez, Ben Juurlink, and Alex Ram rez. Parallel Scalability of Video Decoders. Journal of Signal Processing Systems, 57:173194, November 2009. [13] Kiran Misra, Jie Zhao, and Andrew Segall. Entropy slices for parallel entropy coding. Technical Report JCTVC-B111, July 2010. [14] Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, and Margrit Gelautz. Evaluation of data-parallel splitting approaches for H.264 decoding. In Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia, pages 4049, 2008. [15] J. Sole, R. Joshi, I. S. Chong, M. Coban, and M. Karczewicz. Parallel context processing for the signicance map in high coding eciency. Technical Report JCTVC-D262, Jan 2011. [16] Vivienne Sze, Madhukar Budagavi, and Anantha P. Chandrakasan. Massively parallel cabac. Technical Report VCEG-AL21, Video Coding Experts Group (VCEG), July 2009. [17] Vivienne Sze and Anantha P. Chandrakasan. A high throughput CABAC algorithm using syntax element partitioning. In Proceedings of the 16th IEEE international conference on Image processing, pages 773776, 2009. [18] Alexis M. Tourapis. Enhanced Predictive Zonal Search for Single and Multiple Frame Motion Estimation. In Proceedings of SPIE Visual Communications and Image Processing 2002, pages 10691079, Jan. 2002. [19] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560576, July 2003. [20] Jie Zhao and Andrew Segall. Parallel prediction unit for parallel intra coding. Technical Report JCTVC-B112, Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T and ISO/IEC, July 2010. [21] Zhuo Zhao and Ping Liang. Data partition for wavefront parallelization of H.264 video encoder. In IEEE International Symposium on Circuits and Systems., 2006.
17