Detailed HEVC/H265 Overview

Detailed Overview of
HEVC/H.265
Prepared by Shevach Riabtsev
The author is thankful to every individual who has reviewed the

presentation and provided comments/suggestions
High Level Syntax
HEVC Encoder
+ Residual Bit-Stream
Ref
Ref
Ref T&Q CABAC
Motion -
Est. MVs
Input Video
Inter
Motion
Comp.
Reference samples Intra/Inter MVs/Intra modes
Mode Intra Decision
Pred.
Intra
Quantized
Intra
Est. residuals
+
Ref.
Ref
Ref SAO Deblk.
Reconstructed
Q-1& T-1
Ref +
DPB SAO Params
Filter Control Est. SAO params
Notes:
In addition to AVC/H.264, SAO and SAO Params Estimation added.
The block SAO Params Est. can be executed right after deblocking or
right after the reconstruction (with negligible penalty) as shown in the figure.
HEVC similarity with AVC/H.264 allows quick upgrading of existing AVC/H.264 solutions
to HEVC ones.
Bitstream Structure
Slice
VPS SPS PPS Slice Data * * * *
Header
Picture #1
Slice Slice
Slice Data Slice Data * * * *
Header Header
Picture #k
As in H.264/AVC the byte stream format is also specified in HEVC, where each
NAL unit is delimited by start-code (0x000001).
Notice that each stream must commences with the 4-bytes start code (0x00000001) at least.
The 4-bytes start code at the very beginning of a stream enables a decoder to achieve
byte boundary and not skip over the first NAL (provided that the decoder enter to the stream
in bit-aligned position and not byte-aligned one).
High-Level Syntax ( VPS/SPS)
VPS dedicated to convey information that is common for multiple layers, i.e.
each layer refers same VPS
SPS contain information which applies to all pictures of a video sequence

and is fixed within this sequence:
Profile, level, picture size, number sub-layers
Enabling flags
Restrictions:
log2_min_luma_coding_block_size_minus3 - minimal CU size
log2_diff_max_min_luma_coding_block_size together with the minimal CU size specifies
maximal CU size
log2_min_transform_block_size minimal transform block size
log2_diff_max_min_transform_block_size together with the minimal TB size specifies the
maximal TB size
Temporal scalability control
Visual usability information (VUI)
Notes:
There is a duplication of some information between SPS and VPS (e.g. profile_idc).
Potential Usage of Some SPS Parameters
log2_min_luma_coding_block_size specify the minimal CU size.

Potential usage: if a priory known that a video sequence is flat or smooth then its worth to
consider setting log2_min_luma_coding_block_size = 4 (16x16). Otherwise split-flags at the
depth 16x16 are redundantly signaled.
log2_diff_max_min_luma_coding_block_size together with

log2_min_luma_coding_block_size specify CTU size. There is no reason (excepting maybe
a legacy to H.264/AVC) to set CTU size smaller than 64x64.
Moreover, according to [8], 64 64-sized CTU brings nearly 12% bitrate reduction on the
average compared with 1616-sized CTU.
log2_min_transform_block_size specify the minimal transform block size. Potential

usage, in case of flat video sequence its worth to consider setting
log2_min_transform_block_size to 8x8.
log2_diff_max_min_transform_block_size - together with the minimal TB size specifies

the maximal TB size. Large transform sizes can cause performance peaks therefore its
worth consider to avoid 32x32 transforms by setting maximal transform size to 16x16.
High-Level Syntax ( PPS/Slice Header)
PPS conveys information which could change from picture to picture

Reference list size
Initial QP, by the way do not confuse QP with the quantizer step size. QP is a control
parameter that controls what the step size is.
Enabling flags
Tiles/Wavefronts
Slice Header - conveys information that can change from slice to slice
POC, Slice type
Prediction weights
Deblocking parameters
Tiles Entry points
Reference picture lists: the list of reference pictures in DPB is explicitly signaled in
the slice header (unlike to AVC/H.264 where MMCO or sliding window mode is used).
Not mentioned pictures in the list are marked as unused for reference and should be
removed from DPB respectively. Its worth mentioning that the explicit signaling of the
reference pictures enhances error resilience. Indeed, if a decoder detects that one of
the mentioned pictures is not exist in DPB then the decoder derives that this picture got
lost. Maximal number of reference indexes is 15 (unlike 16 in AVC/H.264).
Selected Picture Types (IDR, CRA)
IDR - pictures following the IDR in decoding order cannot use

pictures decoded prior to the IDR as reference:
CRA pictures following the CRA in both decoding and presentation

order cannot use pictures decoded prior the CRA as reference :
Leading pictures
CRA
0 1 2 3 4
CRA
0 2 3 1 4
Selected Picture Types (RADL, RASL)
Leading pictures - following in decoding order but preceding in presentation

order. Leading pictures are divided into two types:
RADL (random access decodable leading) can be correctly decoded if

decoding starts with the current CRA
RASL (random access skipped leading) - cant be correctly decoding if

decoding starts with the current CRA and therefore this picture should be skipped.
Decoding order: Leading pictures

CRA
0 1 2 3 4
RASL RADL CRA

Presentation order:
0 2 3 1 4
Picture Syntax (2)
Coding Tree Block (CTB):

Picture is partitioned into square coding tree blocks (CTBs). The size N of the CTBs is
chosen by the encoder (16x16, 32x32, 64x64). Luma CTB covers a square picture area of
N N samples and the corresponding chroma CTBs cover each (N/2) (N/2) samples (in
4:2:0 format).
Coding Tree Units (CTU):

The luma CTB and the two chroma CTBs, together with the associated syntax, form a
coding tree unit (CTU). The CTU is the basic processing unit similar to MB in prior standards.
Coding Block (CB):

Each CTB can be further partitioned into multiple coding blocks (CBs). The size of the CB
can range from the same size as the CTB to a minimum size (88).
Coding Unit (CU)

The luma CB and the chroma CBs, together with the associated syntax, form a coding unit
(CU). Each CU can be either Intra or Inter predicted. Actually CU is the basic unit for
compression.
CTU Syntax
64x64 CTU
CTU Syntax (2)
All CUs in a CTU are encoded (traversed) in ZScan (depth-first) order, this order
makes top and left samples to be available (casual) in most cases :
64x64 CTU
The figure taken:
Benjamin Bross: Relax it's only HEVC, WBU-ISOG Forum, European Broadcast Union, Geneva,
Switzerland, November 28, 2012,
CTU Syntax (3)
Formally CTU specifies quad-tree traversed in depth-first order
Note: unlike to prior standards where MB consists of MB header followed by MB data, in

HEVC headers are dispersed or interleaved with data (complicate CTU pipeline, we cant
separate CTU headers parsing and CTU data decoding):
CTU
CU Hdr CU Data * * * * CU Hdr CU Data
Header
CTU
CU Syntax (1)
Prediction Block (PB):
Each CB is partitioned in 1, 2 or 4 prediction blocks (PBs).
Prediction Unit (PU):

The luma PB and the chroma PBs, together with the associated syntax, form a prediction
unit (PU).
Intra:
2Nx2N NxN (only if CB size is smallest CB size)
Inter:
2Nx2N NxN 2NxN Nx2N

CU Syntax (2)
Inter Assymetric Partitions (conditioned by amp_enabled_flag in SPS and disabled for

the minimal CB size):
nLx2N nRx2N 2NxnU 2NxnD
Examples when assymetric partitions are beneficial:
2NxnU 2NxnD nLx2N nRx2N
Notice that if CU size is 8x8 assymetric partitions are disabled (in order to reduce
complexity). I think that assymetric partitions could be disabled for 16x16 sizes too.
CU Syntax (3)
Notes:
The smallest luma PB size is 4 8 or 8 4 samples (where 4x8 and 8x4 are
permitted only for uni-directional predictions, no bi-prediction < 8x8 allowed).
Chroma PBs mimic corresponding luma partition with the scaling factor 1/2 for 4:2:0.
Assymetric splitting is also applied to chroma CBs.
Preprocessing (basing on texture and/or block complexity metric) can be used to

speed up PB size decision process. If the complexity of CTU is high (detailed, textured
region) then large PUs are filtered out, if the complexity is low (flat region) then small
PUs are filtered out.
Example method for the fast PU size detection (for intra case) is described in the paper
Content Adaptive Prediction Unit Size Decision Algorithm for HEVC Intra Coding,
2012 Picture Coding Symposium.
CU Syntax (4)
Transform Block (TB) :
Each luma CB can be quadtree partitioned into one, four or larger number of TBs.
The number of transform levels is controlled by max_transform_hierarchy_depth_inter
and max_transform_hierarchy_depth_intra.
Example.
CB divided into two TB levels (the block #1 is split into four blocks):
1,0 1,1
0
1,2 1,3
0 2 3
2 3
1,0 1,1 1,2 1,3

CU Syntax (5) [Shevach]
Computational complexity to find best TU partition:
For the range transform block sizes from 8x8 to 32x32 we evaluate RD cost 21
times:
1 {32x32} + 4 {16x16} + 16 {8x8} = 21
For the range transform block sizes from 4x4 to 32x32 (intra CU) we evaluate RD
cost 53 times:
1 {32x32} + 4 {16x16} + 16 {8x8} + 32 {4x4} = 53
CU Syntax (6)
Notes
Unlike to H.264/AVC where TB PB, prediction and transform partitioning are

almost independent (i.e. TB can contain several PBs and vice versa). However,
TB>PB is allowable only for Inter and not for Intra (i.e. intra TB PB ).
Reported by some experts that prediction discontinuities on PB boundaries

within TB are smoothed by transform and quantization. If PB and TB boundaries
coincide then the discontinuities are observed increased.
2x2 TBs are disabled (minimal TB size is 4x4). How handle chroma blocks in
4:2:0 format if luma TB is 4x4?
Luma 8x8 Chroma 4x4
0 1 Cb
4x4 4x4 4x4
2 3 Cr
4x4 4x4 4x4
Restrictions/Constraints
a) HEVC disallows 16x16 CTBs for level 5 and above (4K TV).
Motivation:
16x16 CTBs add overheads for decoders to target 4K TV:
Up to 10% increase in worst-case decode time
Add storage for SAO params.
b) Maximal CTU size shall be less than or equal to 5*RawCtuBits/3.
The variable RawCtuBits is derived as
RawCtuBits=CtbSizeY * CtbSizeY * BitDepthY +2 * ( CtbWidthC * CtbHeightC ) * BitDepth C
Numeric Example:
Lets take CtbSizeY=16 (as in AVC/H.264). Then RawCtuBits = 16*16*8+2*8*8*8 = 3072,
the maximal CTB bit-size is 5*3072/3 = 5120 bits ( much more than the corresponding
3200 bits threshold in AVC/H.264).
Note on maximal CTU bit-size and worst-case CABAC performance
CABAC decoding (as well as encoding) contains the renormalization stage (due to finite
arithmetic). The renormalization procedure is time consuming since it contains a while-loop
and several if-else statements inside the loop.
The number of calls the renormalization routine for a CTU is less or equal than the CTU bit-
size (because during the renormalization at least one bit is read from bit-stream).
Therefore if the worst case CTU bit-size is 5120 bits then the decoder has to invoke the
renormalization at most 5120 times, i.e. 5120 times in the worst case.
From point of CABAC HW design the execution of renormalization the 5120 times is a
serious performance bottleneck.
Note on Interlace Coding
Unlike to H.264/AVC, support of interlace coding in HEVC is not exist:
No mixed frame-field interaction (like PAFF in H.264/AVC)
No interlace scanning of transform coefficients
No correction MVX[1] (or y-component of MV) if current and reference pictures are in
different polarity (top-bottom or bottom-top).
Field pictures are signaled by an SEI message (pic_timing) for every picture in the sequence.
If progressive and interlace streams are spliced together then its required to insert a new
sequence start to switch from progressive coding to interlaced one (or vice versa).
In addition a particular flag general_interlaced_source_flag is signaled in VPS/SPS

(within profile_tier_level section)
In H.264/AVC PAFF mode can be used to diminish I-frame bitrate peaks: I-frame is divided
into two field pictures where the top field picture is coded as I-picture while the bottom picture
is coded as P-picture. Consequently total bits produced by two I-P field pictures is expected
to be smaller than the bits generated by single I-frame.
Because H.265/HEVC does not support PAFF the above trick cant be applied to cope with I-
frame bitrate peaks.
Note on Picture Boundaries
As per the standard the picture boundaries are defined in units of the minimum luma CB size
(MinCbSizeY):
pic_width_in_luma_samples shall not be equal to 0 and shall be an integer multiple of

MinCbSizeY.
pic_height_in_luma_samples shall not be equal to 0 and shall be an integer multiple of

MinCbSizeY
As a result, at the right and bottom edges of the picture CTBs may exceed the picture
boundaries. Data outside of the picture is not coded, therefore quadtree on the right and
bottom edges are pruned respectively.
Pls. see the following slide (granted by John Funnel from Parabola) for illustration:
Note on Reference Picture Signaling
Reference picture set (RPS) is signaled within the slice header of non-IDR pictures. Each
reference picture in RPS is identified by its POC.
Unlike to AVC/H.264, no information from previous pictures is needed to parse RPS and to
populate the reference list.
If a picture is declared in RPS and not present in DPB (Decoded Picture Buffer) then a
decoder should derive that the picture is lost.
To minimize the slice header overhead, up to 64 different RPS can be signaled in SPS and
the slice header can contain a reference to one of these RPS.
Inter Prediction/ Motion
Compensation
Overview
Motion Compensation consists of three steps:
Fetch - reference data, padding is applied if reference block outside picture boundaries.
Interpolation for fractional motion vectors (MV)
Weighted Prediction
Weighted
Fetch Interpolation Prediction
(optional)
padding
Luma Interpolation Details (1)
Fractional interpolation for luma samples uses 8-tap filter for both half-pels and quarter-pels
(although some positions actually reduced to 7-tap filter).
Notice that in AVC/H.264 the motion compensation is executed in two serial stages for each
direction (horizontal and vertical):
6-tap filter for half-pels
bilinear filter for quarter-pels
So, AVC/H.264 gives the same complexity as HEVC 8-tap filter but two-stage latency (HEVC
motion compensation filter can be executed in one stage).
In HEVC luma interpolation consists of two stages: horizontal and vertical filtering.
Intermediate results after the horizontal stage are within16-bits accuracy (even if bitDepth>8
then corresponding right shift is applied to keep 16-bits dynamic range).
After the second stage, the results are right-shifted by 6 for bitDepth=8 (unlike to AVC/H.264
no rounding is applied) to reduce the dynamic range to 16 bits.
If bitDepth is 8 (i.e. 8-bits samples) then the order of interpolation is irrelevant. One can
execute firstly the vertical filtering and then horizontal one and vice versa.
Interpolation Flow Chart (w/o weighted prediction)
Horizontal 8 bits per pixel Vertical

Fwd Ref Filter Filter
>>6
>>2
10 bits per pix
Merge
Horizontal 8 bits per pixel Vertical

Bwd Ref Filter Filter
>>6
>>2
10 bits per pix
In the following slides we illustrate as positions a0,0 through r0,0 are specified.
A -1, -1 A 0, -1 a 0, -1 b 0, -1 c 0, -1 A 1, -1 A 2, -1
A -1,0 A 0,0 a 0,0 b 0,0 c 0,0 A 1,0 A 2,0
d -1,0 d 0,0 e 0,0 f 0,0 g 0,0 d 1,0 d 2,0
h -1,0 h 0,0 i 0,0 j 0,0 k 0,0 h 1,0 h 2,0
n -1,0 n 0,0 p 0,0 q 0,0 r 0,0 n 1,0 n 2,0
A -1,1 A 0,1 a 0,1 b 0,1 c 0,1 A 1,1 A 2,1
A - 1,2 A 0,2 a 0,2 b 0,2 c 0,2 A 1,2 A 2,2

Quarter-pels a0,0, c0,0, d0,0, n0,0 and half-pels b0,0, h0,0 are derived from nearest integer
positions.
The quarter-pels a0,0, c0,0, d0,0, n0,0 are derived by the 7-tap filter and the half-pels b0,0, h0,0
by 8-tap filter.
a0,0 , b0,0 and c0,0 are computed by horizontal filtering, while d0,0 , h0,0 and n0,0 by vertical
filtering.
A -1, -1 A 0, -1 a 0, -1 b 0, -1 c 0, -1 A 1, -1 A 2, -1
A -1,0 A 0,0 a 0,0 b 0,0 c 0,0 A 1,0 A 2,0
d -1,0 d 0,0 e 0,0 f 0,0 g 0,0 d 1,0 d 2,0
h -1,0 h 0,0 i 0,0 j 0,0 k 0,0 h 1,0 h 2,0
n -1,0 n 0,0 p 0,0 q 0,0 r 0,0 n 1,0 n 2,0
A -1,1 A 0,1 a 0,1 b 0,1 c 0,1 A 1,1 A 2,1
A - 1,2 A 0,2 a 0,2 b 0,2 c 0,2 A 1,2 A 2,2

Half-pel j0,0 is derived by applying the 8-tap filter vertically to nearest half-pels: b0,3 , b0,2 ,
b0,1 , b0,0 , b0,1 , b0,2 , b0,3 , b0,4 . Notice that j0,0 can be determined only after b0,0 has been
computed (see the previous slide).
Quarter-pels e0,0 and p0,0 are derived by applying the 7-tap filter vertically to nearest quarter-
pels. Notice that e0,0 and p0,0 can be determined only after a0,0 has been computed (see the
previous slide).
A - 1, -1 A 0, -1 a 0, - 1 b 0, - 1 c 0, - 1 A 1, -1 A 2, - 1
A -1,0 A 0,0 a 0,0 b 0,0 c 0,0 A 1,0 A 2,0
d -1,0 d 0,0 e 0,0 f 0,0 g 0,0 d 1,0 d 2,0
h -1,0 h 0,0 i 0,0 j00 k 0,0 h 1,0 h 2,0
n -1,0 n 0,0 p 0,0 q 0,0 r 0,0 n 1,0 n 2,0
A -1,1 A 0,1 a 0,1 b 0,1 c 0,1 A 1,1 A 2,1
A -1,2 A 0,2 a 0,2 b 0,2 c 0,2 A 1,2 A 2,2

Quarter-pel i0,0 is derived by applying the 8-tap filter vertically to nearest quarter-pels: a 0,3,
a0,2, a0,1, a0,0, a0,1, a0,2, a0,3, a0,4
Quarter-pel k0,0 is derived by applying the 8-tap filter vertically to nearest quarter-pels: c 0,3,
c0,2, c0,1, c0,0, c0,1, c0,2, c0,3, c0,4
A -1, - 1 A 0, - 1 a 0, - 1 b 0, - 1 c 0, - 1 A 1, - 1 A 2, -1
A - 1,0 A 0,0 a 0,0 b 0,0 c 0,0 A 1,0 A 2,0
d - 1,0 d 0,0 e 0,0 f 0,0 g 0,0 d 1,0 d 2,0
h - 1,0 h 0,0 i 0,0 j 0,0 k 0,0 h 1,0 h 2,0
n - 1,0 n 0,0 p 0,0 q 0,0 r 0,0 n 1,0 n 2,0
A - 1,1 A 0,1 a 0,1 b 0,1 c 0,1 A 1,1 A 2,1
A - 1,2 A 0,2 a 0,2 b 0,2 c 0,2 A 1,2 A 2,2

Quarter-pels f0,0 , g0,0 , q0,0 , r0,0 are derived by applying the 7-tap filter vertically to the
nearest quarter-pels.
A -1, - 1 A 0, - 1 a 0, - 1 b 0, - 1 c 0, - 1 A 1, - 1 A 2, -1
A - 1,0 A 0,0 a 0,0 b 0,0 c 0,0 A 1,0 A 2,0
d - 1,0 d 0,0 e 0,0 f 0,0 g 0,0 d 1,0 d 2,0
h - 1,0 h 0,0 i 0,0 j 0,0 k 0,0 h 1,0 h 2,0
n - 1,0 n 0,0 p 0,0 q 0,0 r 0,0 n 1,0 n 2,0
A - 1,1 A 0,1 a 0,1 b 0,1 c 0,1 A 1,1 A 2,1
A - 1,2 A 0,2 a 0,2 b 0,2 c 0,2 A 1,2 A 2,2

Chroma Interpolation
The fractional interpolation for the chroma is similar to the luma with pel and 4-tap filters.
The filter coefficients depends on position, e.g. for ab00 the coefficients are [ -2,58,10,-2].
Notes/Conclusions
1. Luma interpolation can be performed in two serial stages: half-pel and quarter-
pel.
2. For motion compensation of NxM block its required to load (N+7)x(M+7)

reference block (3 left/above, 4 right and below).
3. 8 tap filter coefficients: { -1, 4, -11, 40, 40, -11, 4, -1 }
4. 7-tap filter coefficients: { -1, 4, -10, 58, 17, -5, 1 }, non-symmetric.
5. Luma interpolation expands the dynamic range (DR) to 22 bits.

Dynamic Range Estimation
Worst case for 1/2-pel filter :
b0,0 = A3,0 + 4*A2,0 11*A1,0 + 40*A0,0 + 40*A1,0 11*A2,0 + 4*A3,0 A4,0
Maximal value of b0,0 is 88*255 = 22440, the minimal value is -24*255 = -6120. The same
limits are also correct for h0,0 .
Worst case for 1/4-pel filter:
a0,0 = A3,0 + 4*A2,0 10*A1,0 + 58*A0,0 + 17*A1,0 5*A2,0 + A3,0
Maximal value of a0,0 is 80*255 = 20400, the minimal value is -16*255 = -4080. The same
limits are also correct for c0,0, d0,0, n0,0 .
-4080 a0,0, c0,0, d0,0, n0,0 20400
-6120 b0,0, h0,0 22440
So, the values of a0,0, c0,0, d0,0, n0,0 , b0,0, h0,0 are within 16 bits.
Dynamic Range Estimation (2)
Second step of interpolation (when neighboring half-pel and quarter-pel samples are uses):
e0,0 = ( a0,3 + 4*a0,2 10*a0,1 + 58*a0,0 + 17*a0,1 5*a0,2 + a0,3 ) >> 6
Taking into account that a0,k is in the range [-4080 .. 20400] the expression in the parenthesis
gives the following limits:
-80*4080= - 326400 a0,3 + 4*a0,2 10*a0,1 + 58*a0,0 + 17*a0,1 5*a0,2 + a0,3 80*20400=1632000
So, the dynamic range increased to 22 bits (incl. sign bit).
After shifting by 6 the dynamic range reduced to 16 bits: -5100 e0,0 25500
j0,0 = ( b0,3 + 4*b0,2 11*b0,1 + 40*b0,0 + 40*b0,1 11*b0,2 + 4*b0,3 b0,4 ) >> 6
Taking into account that b0,k is in the range [-6120 .. 22440] the expression in the parenthesis
gives the following limits:
-88*6120 = - 538560
b0,3 + 4*b0,2 11*b0,1 + 40*b0,0 + 40*b0,1 11*b0,2 + 4*b0,3 b0,4
88*22440=1974720
As in the case with e0,0 the dynamic range in calculation of j0,0 is increased to 22 bits. After
the shift by 6, the dynamic range is reduced to 16 bits.
Intra Prediction
Overview
33 angular predictions for both luma and chroma and two non-directional
predictions (DC, Planar).
PB sizes from 44 up to 6464.

Like in AVC/H.264 Luma intra prediction mode is coded predictivly and
chroma intra mode is derived from the luma one.
For more details the paper [7] recommended

Coding & Derivation Luma Intra Prediction Mode (1)
Unlike to AVC/H.264 three most probable modes: MPM0, MPM1 and MPM2
are considered. The following figure reveals the logic for derivation of MPMs:
Note on Most Probable Mode (MPM)
As reported in A novel fast intra-prediction algorithm for high-efficiency

videocoding based on structural similarity, Dongyu Zhaoa et al., 2015 :
in average of 70% blocks MPM is selected as the best intra mode:

Encoder side:
If the current luma intra prediction mode is one of three MPMs,

prev_intra_luma_pred_flag set to 1 and the MPM index (mpm_idx) is
signaled.
Otherwise, the index of the current luma intra prediction mode excluding
the three MPMs is transmitted to the decoder by using a 5-bit fixed length
code (rem_intra_luma_pred_mode).
Note:
If region is smooth (flat) then each of 35 intra modes provides similar result and can be
selected as best.
Decoder side:
Upon derivation of MPMs, collect them in candModeList[3]= [MPM0, MPM1,

MPM2]. The following code outlines how luma intra mode is derived:
If prev_intra_pred_flag = true then

IntraPredMode = CandList [mpm_idx]
Else
{ sort CandList in ascending order }
if candModeList[0] is greater than candModeList[1], swap two values.
Read 5 bits to rem_intra_luma_pred_mode

IntraPredMode = rem_intra_luma_pred_mode
if IntraPredMode >= candList[ 0 ] IntraPredMode ++

EndElse
Coding & Derivation Chroma Intra Prediction Mode
Unlike AVC/H.264, chroma intra prediction mode is derived from luma and from the
syntax element intra_chroma_pred_mode ( signaled each PB ) according to the
following table:
Luma IntraPredMode
intra_chroma_pred
0 26 10 1 X ( 0 <= X <= 34 )
_mode
0 34 0 0 0 0
1 26 34 26 26 26
2 10 10 34 10 10
3 1 1 1 34 1
4 0 26 10 1 X
Because allowable chroma modes are constrained by the corresponding luma mode, its
challenging to execute in parallel searching for best luma and chroma modes (like in
AVC/H.264). In other words one firstly should look for the best luma mode and then the
best chroma mode.
Note:
If intra_chroma_pred_mode=0 then chroma prediction mode is equal to
luma intra prediction mode of the top-left luma PB within a luma CB
Implementation Angular Intra Prediction (1)
At most 4N+1 neighbor pixels are required. In contrast to H.264/AVC, below-left
samples are exploited in HEVC. Wide range of prediction block sizes (from 4x4 to
64x64) makes availability of bottom-left samples to be more frequent event than in
H.264/AVC. Missing reference samples are generated by repetition of the closest
available sample in the reference line.
Top CTU
Top-left CTU
16x16
8x8 8x8
Top-left predictor Top predict. Top-right predictors
Left CTU Current TB*

Left predictors Current CTU
8x8
16x16
Below Left predictors
* See the next slide

The prediction intra mode is specified per prediction block (PB) but applied to
transform blocks (TB).
Example Case TB < PB : let 8x8 prediction block (PB) comprised from four 4x4
transform blocks (TB):
TB0 TB1
4x4 4x4
TB2 TB3
4x4 4x4
Because for intra TB PB the case TB>PB is excluded .

The same intra prediction mode is applied to each TB. The decoding process is
performed on TB basis (this adds performance penalty due to serialization):
Step 1: Predict samples for TB0 (the predictors are outside the current PB), inverse
transform, reconstruct.
Step 2: Predict samples for TB1 where the left predictors are reconstructed samples of
TB0, inverse transform, reconstruct.
Step 3: Predict samples for TB2 where the top and top-right predictors are
reconstructed samples of TB0 and TB1, inverse transform, reconstruct.
Step 4: Predict samples for TB3 where all predictors are reconstructed samples of
TB0-TB2, inverse transform, reconstruct.
To improve the intra prediction accuracy, the projected reference sample

location is computed with 1/32 sample accuracy, bi-linear interpolation is
used.
In angular mode predicted sample PredSample(x,y) is derived as follow:
predSample[ x ][ y ] = ( ( 32 iFact )*ref[ x+iIdx+1 ] + iFact*ref[ x+iIdx+2] + 16 ) >> 5

predSample[ x ][ y ] = ( ( 32 iFact )*ref[ y+iIdx+1 ] + iFact*ref[ y+iIdx+2] + 16 ) >> 5
The parameter iIdx and iFact denote the index and the multiplication factor
determined by the intra prediction mode (can be extracted via LUTs).
The weighting factor iFact remains constant across predicted row or column
that facilitates SIMD implementations of angular intra predictions.
Planar Mode
In AVC/H.264, the plane intra mode requires two multiplications per sample
predL[ x, y ] = Clip1Y( ( a + b * ( x 7 ) + c * ( y 7 ) + 16 ) >> 5 )
plus overhead calculation per 16x16 block in determination of the parameters

a,b and c.
Totally the plane mode takes at most three multiplications per sample.
In HEVC, intra planar mode requires four multiplications per sample:
predSamples[ x ][ y ] = ( ( nT 1 x ) * p[ 1 ][ y ] + ( x + 1 ) * p[ nT ][ 1 ] +
( nT 1 y ) * p[ x ][ 1 ] + ( y + 1 ) * p[ 1 ][ nT ] + nT ) >> ( Log2( nT ) + 1 )
So, HEVC planar mode is more complex than in AVC/H.264. Actually, the
planar mode is an average of two linear predictions:
( nT 1 x ) * p[ 1 ][ y ] + ( x + 1 ) * p[ nT ][ 1 ]
( nT 1 y ) * p[ x ][ 1 ] + ( y + 1 ) * p[ 1 ][ nT ]
Note: According to RESEARCH ON H.265/HEVC INTRA PREDICTION MODES

SELECTION FREQUENCIES, Sharabayko M.P - planar prediction is performed for
20% of PUs, while about 10% are predicted by DC. Thus, Planar mode is more popular
than DC.
Motion Vector Prediction
Overview
Effective and complex motion data prediction techniques have been adopted in HEVC in
order to reduce motion data portion in the stream. HEVC supports two modes:
1)Merge mode motion data is completely inferred, similar to Direct/Skip mode of

H.264/AVC.
2) AMVP (Advanced Motion Prediction) inferred only motion vector (MV) predictors and
MV difference is signaled.
Unlike other standards (e.g. AVC/H.264), the HEVC adopted the competitive motion vector
prediction for both AMVP and merge modes, i.e. several candidates are competing for the
prediction and the best candidate is signaled in the stream.
In both prediction modes, the set of candidates can include a temporal candidate (or co-
located candidate) from a pre-defined reference picture. Unlike to H.264/AVC, in HEVC
enables more flexibility in selection of the co-located reference, it is not necessarily the first
reference picture in L0 or L1. The co-located reference is signaled in slice header by
collocated_ref_idx.
Temporal MV prediction in both prediction modes improves error resilience. On the other hand
an additional storage of co-located MVs of reference frames is required.
Spatial and temporal MVP candidates can be derived independently (facilitating

parallelization).
Merge Mode: Removal of Same Motion Candidates (Prunning)
In earlier versions of HEVC, a candidate removed if any of previous candidate has the same
motion. When NumMergeCands=5 the detection of all redundant candidates requires 10
comparisons per PU.
In the final HEVC version in order to reduce complexity of the merge list generation, only 5
comparisons (arcs in the figure below) are executed (instead of 10) for removing duplications.
B2 B1 B0
comparison
A1
A0
Note:
Due to limiting of comparisons and exempting the temporal candidate from prunning process,
redundant candidates (i.e. B0=B2) can appear in the merge list.
Merge Mode: Additional Candidates
If merge list is not full (i.e. #candidates < NumMergeCands) then additional virtual
candidates appended. So, the merge list is never empty.
Merge Mode: List Construction in Encoding
Derive candidate list

(spatial & temporal )
Prunning Process
remove duplicates (restricted)
If merge
list full?
No
Yes Add virtual candidates
Select candidate
for encoding
Merge Mode: List Construction in Decoding
Derive candidate list

(spatial & temporal )
CABAC
Prunning Process Decode merge_idx

remove duplicates (restricted)
index
If merge_idx<
current list size?
No
Yes Add virtual candidates
Use candidate pointed by merge_idx

for decoding
Advanced Motion Vector Prediction (AMVP)
Motion vector is predicted from five spatial neighbors: B0, B1, B2, A0, A1 (see the figure
below) and one co-located temporal MV. Only two motion candidates are chosen among six
neighbors and the selected predictor is explicitly signaled (mvp_lx_flag).
TCT Central colocated candidate
TBR
Bottom-right colocated candidate

Advanced Motion Vector Prediction (AMVP) cont.
Left candidate: the first available from A0, A1
Top candidate: the first available from B0, B1, B2
If both candidates are available and have the same motion data, one is
excluded
If one of the above candidates is not available or excluded then the temporal
MV (colocated) is used unless the temporal prediction mode disabled: the first
available from TBR and TCT . Notice that if TBR is out of CTU boundary its
considered as unavailable.
If the number of available candidates is still less than 2, zero MV is added

Transform
Overview
The standard supports 32x32, 16x16, 8x8 and 4x4 DCT-like transforms and 4x4 DST-like
transform. Notice that DST-like 4x4 transform is allowed only for intra mode.
Each transform is specified by 8-bits signed digital transform matrix T. To perform all
transform operations its required 32-bits precision.
The inverse transform can be executed as follows:
a) Z = X T, where X is the input matrix of quantized coefficients (16-bits per coefficient,

for more detailed analysis refer to JCTVC-L0332)
b) Scaling and clipping of Z to guarantee that the output values are within 16-bits.
c) Y = Z T
Notice that in encoder architecture the step (c) can be coupled with quantization: once first
row of Y completed the quantization of the first row is started.
Transform Implementation
Notice AVC/H.264 where transform coefficients are dyadic in 4x4 case and near dyadic (i.e.
from the form 2^n, 2^n-1, 2^n+1) in 8x8 case and hence AVC/H.264 transform can be
multiplication-free.
In HEVC transform operations are not multiplication-free. Indeed, let the multiplication takes
3 cycles, shift or addition 1 cycle . Therefore if all coeffs are near dyadic we can use only
shifts and additions, otherwise we need a multiplier (because the alternative of shifts and
adds hurts performance).
As well as in AVC/H.264 the transforms in HEVC are separable and can be performed as
sequence of two 1D transforms (vertical and horizontal):
for (i=0, i<N, i++ )

{
1D Transform on column i // vertical
}
Scaling (right shift by 7) & Saturation
for (j=0, j<N, j++ )

{
1D Transform on row j // horizontal
}
Comments
As well as in previous standards HEVC DCT works well on flat areas, but fails on areas with
noise, contours and other peculiarities of the signal.
HEVC DCT is efficient for big size of blocks but it looses efficiency on smaller blocks.
Beginning from 16x16 transforms visual artifacts are noticeable. The more the transform size
the more artifacts are observable. Deblocking can reduce artifacts on TB boundaries, while
artifacts inside a TB can be reduced only by SAO. Therefore its recommended to apply SAO
when large transform sizes (32x32) are used.
HW Aspects of Transform Implementation 1D 8x8 case
P0 butterfly block (free of multiplications), Z = P0 X
Pc permutation matrix
HW Aspects of Transform 1D 8x8 case (2)
A1 butterfly block (no multiplications), Q = A1 * Z{1..4}
B1 , B2 4x4 matrices (multiplication by these matrices can be implemented in fast

form)
4x4 DST (Discrete Sine Transform)
4x4 DST is applied only for Intra prediction.
The 4x4 DST matrix from HEVC spec:
29 55 74 84
74 74 0 74
84 29 74 55
55 84 74 29
Motivation:
Intra prediction is based on the top and left neighbors. Prediction accuracy is more
for the pixels located near to top/left neighbors than those away from it. In other
words, residual of pixels which are away from the top/left neighbor usually be larger
then pixels near to neighbors. Therefore DST transform is more suitable to code
such kind of residuals, since DST basis function start with low and increase further
which is different from conventional DCT basis function.
Reported 4x4 DST provides some performance gain, about 1%, against DCT. For
bigger sizes the gain is negligible.
As well as DCT, DST can be implemented using "fast" form

Performance Result
According to JCTVC-G757:
For 8x8 transform required 2.47 cycles per sample
The above results are obtained on x86 and ARM with SIMD operations (MMX/SSE on
x86, and NEON on ARM).
4.59 cycles per sample for 32x32 can be a bottleneck on some platforms.
If you wish to avoid performance issues on decoder's side it's would be better not use 32x32
transform and always split 32x32 CU into four 16x16 TBs or even into eight 8x8 TBs.
Notice that if 32x32 TBs is not used, its recommended to consider to disable SAO since
ringing artifacts and mosquito noise are mainly present in large TB sizes.
Entropy Coding
Overview
HEVC specifies only one entropy coding method CABAC, comparing to two
CABAC and CAVLC as in H.264/AVC.
HEVC CABAC encoding involves three main functions:
a) Binarization - maps the syntax elements to binary symbols (bins).
b) Context modeling - estimates the probability of the bins
c) Arithmetic coding - compresses the bins to bits based on the estimated

probability.
Note: In HEVC/h.265, the arithmetic coding engine remained the same as

H.264/AVC while Binarization and Context Modelling modified.
Optimization Points
Reduction in Memory Usage:

In AVC/H.264 context of some syntax elements (e.g. mvd) depends on top and left values.
This dependence on top values requires large line buffers, may be issue for 4K resolutions.
In HEVC the dependence on top neighbors (outside of current CTU) is significantly reduced.
For example, unlike to AVC/H.264, in HEVC mvd is coded without the need of knowing
neighboring mvd values.
Grouping of Bypass Bins:

Successive bypass bins can be processed simultaneously (i.e. in the same cycle) and
improve coding throughput. On the other hand, a simultaneous processing of bypass bins
requires higher precision of Arithmetic Engine.
Grouping Bins with Same Context:

If consecutive bins share the same context model then this context model can be stored in
registers, fewer memory accesses.
Reduction Parsing Back-Dependencies:

Facilitate CABAC pipelining.
Residual Coding: Overview
Each Transform Block (TB) is divided into 4x4 sub-blocks (coefficient groups).
Processing starts with the last significant coefficient and proceeds to the DC coefficient in the
reverse scanning order.
Coefficient groups are processed sequentially in the reverse order (from bottom-right to top-
left) as illustrated in the following figure:
Coefficient groups for 8x8 TB, for different scans

Residual Coding: comparison between AVC/H.264 and HEVC
HEVC H.264/AVC
split_transform_flag
transform_size_8x8_flag
cbf_luma coded_block_pattern
cbf_cb
cbf_cr
transform_skip_flag
last_significant_coeff_x_prefix last_significant_coeff_flag
last_significant_coeff_y_prefix
last_significant_coeff_x_suffix
last_significant_coeff_y_suffix
coded_sub_block_flag coded_block_flag
significant_coeff_flag significant_coeff_flag
coeff_abs_level_greater1_flag
coeff_abs_level_greater2_flag
coeff_abs_level_remaining coeff_abs_level_minus1
coeff_sign_flag coeff_sign_flag
Residual Coding: Scanning Order
Notes:
Experiments show that including horizontal and vertical scans for large TBs offers
little compression efficiency, so vertical and horizontal scans are limited to 4x4 and
8x8 sizes.
Adaptive scanning for Intra blocks is not a new idea (e.g. see the paper Adaptive
Scanning for H.264/AVC Intra Coding, ETRI Journal, 2006).
Residual Coding: Multi-Level Significance
Unlike to AVC/H.264, three level significance is used, intermediate level is added

to exploit transform block sparsity:
Level0: coded_block_flag is signaled for each TB (transform block) to specify the

significance of the entire TB.
Level1: (intermediate level): if coded_block_flag =1 then each TB divided into 4x4

coefficient groups (CG) where the significance of the entire CG is signaled (by
coded_sub_block_flag).
a) The coded_sub_block_flag syntax elements are signaled in reverse order (from

bottom-right towards top-left) according to selected scan.
b) The coded_sub_block_flag is not signaled for the last CG (i.e. the CG which
contains the last level). Motivation: a decoder can infer significance since the last
level is present.
c) The coded_sub_block_flag is not signaled for the group including the DC position
Residual Coding: Multi-Level Significance (cont.)
Level2: If coded_sub_block_flag=1 then significant_coeff_flag are signaled to

specify the significance of individual coefficients.
a)The significant_coeff_flag are signaled in the reverse order (from bottom-right

towards top-left) according to selected scan.
Notes:
The coded_sub_block_flag of CG containing of DC coefficient (i.e. (0,0)

position) is not coded and inferred to 1.
Motivation to improve coding efficiency, since the probability of this CG to be
totally zero is low.
If current CG contains last coefficient in a TB then the coded_sub_block_flag is

not coded and inferred to 1.
Residual Coding: Multi-Level Significance (Example)
16x16 TU with 4x4 coefficient groups:

---------------------------------------------------------
| 5 2 0 1 | 1 0 0 0 | 1 1 1 0 | 0 0 0 0 |
| 5 0 2 1 | 2 0 1 1 | 0 2 1 0 | 0 0 0 0 |
| 0 1 0 0 | 1 0 1 1 | 0 1 1 1 | 0 0 0 0 |
| 0 0 0 1 | 1 1 1 1 | 0 0 1 0 | 0 0 0 0 |
---------------------------------------------------------
| 1 1 0 0 | 1 0 1 0 | 0 1 0 1 | 0 0 0 0 |
| 2 1 0 1 | 0 0 1 0 | 1 0 0 1 | 0 0 0 0 |
| 0 0 1 0 | 0 0 1 0 | 1 0 0 0 | 0 0 0 0 |
| 1 0 0 1 | 1 0 1 0 | 0 0 0 0 | 0 0 0 0 |
---------------------------------------------------------
| 0 1 1 1 | 0 1 0 1 | 1 1 1 0 | 0 1 0 0 |
| 0 0 1 0 | 0 1 1 1 | 0 0 0 1 | 0 0 0 0 |
| 1 0 0 1 | 0 0 0 0 | 1 1 0 0 | 1 0 0 0 |
| 0 0 1 0 | 0 0 0 0 | 1 1 0 0 | 0 0 0 0 |
---------------------------------------------------------
| 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | Last coeff
| 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 |
| 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 |
| 0 0 0 0 | 0 0 0 0 | 0 0 0 0 | 0 0 0 0 |
---------------------------------------------------------
For this CG coded_sub_block_flag is not

coded
Residual Coding: Levels
At the start of residual block, coordinates of last significant coefficient are

signaled (last_significant_coeff_x, last_significant_coeff_y)
Coding starts backward from last significant coefficient toward (0,0).
The coding process of each CG generally consists of five separate loops

(passes), which provides some benefits for parallelization :
1. significant_coeff_flag loop
2. coeff_abs_level_greater1_flag loop
3. coeff_abs_level_greater2_flag (at most one flag is coded)
4. coeff_sign_flag loop
5. coeff_abs_level_remaining loop
Hint for parallelization:

Grouping of syntax elements of same type enables parallel processing.
For example, while coeff_abs_level_greater1_flag are proceeding, the significance map
contexts for the next CG can be pre-calculated.
Residual Coding: significant_coeff_flag
significant_coeff_flag - indicates whether the transform coefficient is non-zero.
1 bin regular coding, 3 context models.
Context model derivation for 8x8 and higher TBs - context depends on the
significant_coeff_group_flag of the neighboring right CG and lower (sl) CGs and on
the coefficient position in the current CG. Motivation: to avoid data dependencies
within a CG and to benefit parallezation with negligible coding loss if contexts
depend on significance of immediately preceeding coefficints (around 0.1% as
reported in JCTVC-I0296).
Coding direction
Current Right
CG CG
Bottom
CG
Here sr denotes significant_coeff_group_flag of the right CG and sl denotes

significant_coeff_group_flag of the lower CG.
Not coded
inferred
Context model derivation for 4x4 TB is completely position based:
Residual Coding: significant_coeff_flag (cont.)
Notes:
significant_coeff_flag is not coded and inferred to 1 when the last significant

coefficient points at it.
significant_coeff_flag is not coded and inferred to 1 when

coded_sub_block_flag is true and all the other coefficients are zero (i.e. all
coefficients are zero except the (0,0), specified by inferSbDcSigCoeffFlag).
In case of 4x4 TB the significant_coeff_flag is not coded. The

significant_coeff_flag is inferred to 1 if the last significant coefficient points at it,
otherwise inferred to 0.
Residual Coding: coeff_abs_level_greater1_flag
coeff_abs_level_greater1_flag - indicates (if signaled) whether the transform
coefficient has absolute value > 1.
Only first eight coeff_abs_level_greater1_flags in a CG are coded, the rest is inferred to 0.

Motivation to reduce regularly (context) coded bins, especially at the high bit-rate, and
improve CABAC throughput. Indeed, at most 8 coeff_abs_level_greater1_flags are coded in
one CG instead of 16.
There are 4 context model sets for luma ( denoted as 0, 1, 2 and 3) and 2 for chroma
(denoted as 4 and 5), the number of context models in each set is 4.
The derivation of context mode consists of two steps: the inference of context set and the
derivation of the model inside the selected set, the following table reveals the context set
derivation:
Luma Chroma
# coeff_abs_level_greater1_flags 0 >0 0 >0
in previous CG
CG with DC 0 1 4 5
CG without DC 2 3 4 5
Residual Coding: coeff_abs_level_greater1_flag (cont.)
Context model derivation within the selected context set depending on the
number of trailing ones and the number of coefficient levels larger than 1 in the
current CG:
If coeff_abs_level_greater1_flag is the first in the current CG then the context model

equals to 1.
Else If the revious coefficient in current CG is more than 1 (i.e. the previous
coeff_abs_level_greater1_flag=1) then the context model equal to 0.
Else If the number of trailing ones is 1 (i.e. the previous coeff_abs_level_greater1_flag=0

and the pre-previous coeff_abs_level_greater1_flag=1 or not exist) then the model is 2
Else (if trailing ones is 2 or more) then the model is 3.

Residual Coding: coeff_abs_level_greater2_flag
coeff_abs_level_greater2_flag - indicates (if signaled) whether the transform

coefficient has absolute value > 2. Unlike coeff_abs_level_greater1_flag, this flag
is signaled once.
If all coeff_abs_level_greater1_flag are 0, the coeff_abs_level_greater2_flag is not

signaled and inferred to 0.
Context model derivation:
Luma Chroma
# coeff_abs_level_greater1_flags 0 >0 0 >0
in previous CG
CG with DC 0 1 4 5
CG without DC 2 3 4 5
Notice that the derivation of context model for coeff_abs_level_greater2_flag is identical to

the derivation of context set of coeff_abs_level_greater1_flag.
Residual Coding: coeff_sign_flag
coeff_sign_flag : substantial proportion in compressed stream 15-20%.
coeff_sign_flags are bypass coded
Sign Data Hiding (SDH): optional mode, for each CG the sign of the last nonzero coefficient
(in the reverse scan) is omitted. Instead, the sign is embedded in the parity of the sum of the
levels, if the sum is even then the hidden sign is +, otherwise .
Encoder can be required to modify coefficients to embed the sign (potentially

quantization noise can be increased).
If the distance in scan order between the first and the last nonzero coefficient is less
than 4 then SDH is not used. Notice that the fixed value 4 was experimentally chosen
(see JCTVC-I0156). Probably that value can be a bad choice on some streams.
If only one nonzero coefficient is present in CG, then SDH is not activated.
When SDH is beneficial?

When the percentage of sign bits is substantial (it is expected to happen when the bit-rate is
low).
Disadvantages of SDH:
More complexity and Increase of quantization noise (potentially)
Residual Coding: example implementation of SDH in encoder
If the parity does not match the omitted sign, the encoder has to change the value of one of
the nonzero coefficients in the current CG.
Low complexity implementation (details can be found in JCTVC-H0481):
If #nonzero coeffs > 1 and

distance (in scan) between the first and the last nozero coefficient >= 4
{
For each nonzero and non-last (in reverse scan) quantized coefficient qCoef Do
{
Calculate delta = abs(qCoef) * q_scale tCoef
where tCoef is transformed coefficient (prior to quantization).
}
If there is nonzero delta values, find the minima minNzDelta among abs( delta )
{
If minNzDelta >0
adjust qCoef = qCoef +1
Else [minNzDelta <0]
adjust qCoef = qCoef -1
}
Else [ all delta values are zero ]
Take most high frequence coeff and adjust it.
}
Residual Coding: coeff_abs_level_remaining
coeff_abs_level_remaining remaining absolute value of the coefficient level. All

bins are bypass coded (to increase throughput).
The total level is derived as:
Level = 1 + coeff_abs_level_greater1_flag+ coeff_abs_level_greater2_flag+ coeff_abs_level_remaining
Binarization - HEVC employs adaptive GolombRice coding for small values and switches to
Exp-Golomb code for larger values.
The Golomb-Rice parameter cRiceParam depends on previous levels.
The transition point to Exp-Golomb is when the unary code length equals 4.
The maximal codeword for coeff_abs_level_remaining is kept within 32 bits.

Residual Coding - Example
Scan_pos 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Coefficients 0 1 -1 0 2 4 -1 -4 4 2 -6 4 7 6 -12 18
significantFlag 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
gr1Flag 0 0 1 1 0 1 1 1 not coded
gr2Flag 0 not coded
signFlag 0 1 0 0 1 1 0 0 1 0 0 0 1 0
levelRem 2 2 2 0 5 3 6 5 11 17
Residual Coding - Notes
On SW decoder: the processing of residuals take ~8% of the computation for

4K video.
It is a challenge to speedup or parallelize the residual coding loop by SW
There are multiple branches within the loops.
There are data dependency among adjacent data.
The loop counts in the four loops can be different (challenge for loop
unrolling)
For coeff_abs_level_remaining the binarization type depends on the

previous coefficient level.
Deblocking
Overview
The deblocking filter is applied to all samples adjacent to PB or TB boundaries,

with the following exceptions:
picture boundaries
slice boundaries (if deblocking is explicitly disabled in sequence level)
tile boundaries (if deblocking is explicitly disabled in sequence level)
Granularity is 8x8 grid or higher (unlike AVC/H.264 where the granularity is 4x4)
32
Deblocked
8X8 8X8
Non-deblocked
16x16
4x4 4x4
8X8
4x4 4x4
32
16X16 16X16
Overview (cont.)
Notes:
1.Because of 8x8 granularity of HEVC/H.265 deblocking, PB/TB boundaries not aligned to

8x8 grid are not filtered (e.g. some boundaries of 4x8 PBs are not filtered).
2.Chroma is deblocked only if one of the adjacent blocks is intra (since at intra-block
boundaries blocking artefacts are strongest). By the way if one knows ahead that in a
picture no intra CU is present then one can turn off chroma deblocking at all.
3.Is it necessary to signal deblocking parameters slice_beta_offset and slice_tc_offset if the

current slice is not used for reference?
A decoder can turn off deblocking for a non-reference picture without
impacting on subsequent pictures (e.g. if the decoder is in tight timing conditions).
The decoder can change beta and tc offsets.
I think the one should be careful in turning off the deblocking on non-references pictures
(in order to reduce say power consumptions) while the deblocking is activated on the
rest of frames. This act might cause fluctuations in perceptual quality.
Deblocking Algorithm
1. Vertical edges are filtered first, then horizontal edges are filtered.
2. For each edge of 8x8 grid determine the filter strength (Bs)
3. According to the filter strength and the average quantization parameter (QP) determine
two thresholds: tC and
4. According to the values of edge pixels and tC , modify (if needed) the pixels
Note: in HEVC deblocking the decision process requires much more logic than the filtering
itself.
The parameter is signaled in slice header (slice_beta_offset_div2) and specifies a

threshold whether deblocking may be applied and not applied. If one increases then
some natural edges may be filtered (smoothed), if one decreases then some artefact
edges can pass to display.
Deblocking Drawback for Pipeline Architecture
In AVC/H.264 spec. the following statement specifies deblocking operations:
"For each macroblock and each component, vertical edges are filtered first, starting with
the edge on the left-hand side of the MACROBLOCK proceeding through the edges
towards the right-hand side of the MACROBLOCK in their geometrical order, and then
horizontal edges are filtered, starting with the edge on the top of the MACROBLOCK
proceeding through the edges towards the bottom of the MACROBLOCK in their
geometrical order."
In HEVC spec., the word MACROBLOCK is replaced with the word PICTURE. This subtle
difference complicates pipelining of Deblock. Indeed, the horizontal filtering of (N-1) th CTB
has to be delayed until the vertical filtering of Nth CTB (i.e. the next CTB) completed (at
least the leftmost vertical edge of the N-th CTB).
Top Line Buffer
For deblocking top reference storage is necessary:
4 luma top lines
2 chroma top lines
p0 .. p2 luma pixels are modifiable while p0 .. p3 are taken into consideration
p0 chroma is modifiable, while p0 , p1 are taken into consideration

Page 91
Determination of Filter Strength [Decision Process]
The strength of the deblocking filter is represented by three values:

0 - no deblocking
1 - weak
2 strong
Notice that in AVC/H.264 there are five strengths and more complicated derivation of
boundary strengths.
P and Q are two adjacent TB or PB blocks then the filter strength Bs is specified as:
If one of the blocks (P or Q) is intra

Bs = 2 { blocking artefacts are strongest at intra-block boundary}
Else if P and Q belong to different TBs and P or Q has at least one non-zero
transform coefficient then
Bs = 1
Else if the reference pictures of P and Q are not equal then
Bs = 1 { splicing from different references may be non-seamless}
Else if P and Q has difference number of MVs then
Bs = 1
Else if the difference between x or y motion vector component of
P and Q is equal or greater than one integer sample then
Bs = 1 { splicing may cause blockiness }
Else
Bs = 0
Determination thresholds tC and [Decision Process]
Thresholds tC and are used in deblocking process and are derived by the following table:
Where Q = [ (QPQ + QPP +1)>>1 + (beta_offset_div2<<1) ]
Notice that if QP in a picture is constant then tC and can be specified once at the start of
the picture. E.g. x265 has adaptive quantization mode and if this mode is switched off then
QP is constant within a picture, although QP can vary between pictures.
Blockiness ad Discrepancy from Ramp
Blockiness artefacts appear as small discrepancies from ramp:
q2
q1 ramp
p0 q3
p2
q0
p1
p3 TB/PB boundary
Vertical Edge Filtering (Luma) derivation d [Decision Process]
on/off decision for all 4 lines/columns based on two lines/columns:
p3,0 p2,0 p1,0 p0,0 q0,0 q1,0 q2,0 q3,0

p3,1 p2,1 p1,1 p0,1 q0,1 q1,1 q2,1 q3,1
p3,2 p2,2 p1,2 p0,2 q0,2 q1,2 q2,2 q3,2
p3,3 p2,3 p1,3 p0,3 q0,3 q1,3 q2,3 q3,3
Derivation of the parameter d (or estimation of discrepancy from ramp):
dp0 = Abs( p2,0 2*p1,0 + p0,0 )

dp0 estimate discrepancy from ramp: p2,0 2*p1,0 + p0,0 = (p2,0 p1,0) (p1,0 - p0,0 )
dp3 = Abs( p2,3 2*p1,3 + p0,3 )

dq0 = Abs( q2,0 2*q1,0 + q0,0 )
dq3 = Abs( q2,3 2*q1,3 + q0,3 )
dpq0 = dp0 + dq0 { sum of discrepancies from the ramp }
dpq3 = dp3 + dq3
dp = dp0 + dp3
dq = dq0 + dq3
d = dpq0 + dpq3
Vertical Edge Filtering (Luma) - dSam0 and dSam1 [Decision Process]
Necessary condition for luma filtering is d<, because small discrepancies from the
ramp apparently are result of blockiness, while strong discrepancies means presence of a
natural edge.
dSam0 = 0
dSam3 = 0
If d< Then
If 2*dpq0 < ( >> 2 ) And

Abs( p3,0 p0,0 ) + Abs( q0,0 q3,0 ) < ( >> 3 ) And
Abs( p0,0 q0,0 ) < ( 5*tC + 1 ) >> 1) Then
{
dSam0 =1
}
If 2*dpq3 < ( >> 2 ) And

Abs( p3,3 p0,3 ) + Abs( q0,3 q3,3 ) < ( >> 3 ) And
Abs( p0,3 q0,3 ) < ( 5*tC + 1 ) >> 1 Then
{
dSam3 =1
}
EndIf
Vertical Edge Filtering (Luma) dE, dEp and dEq [Decision Process]
dE takes values 0 (no filter),1 (weak filter) ,2 (strong filter)

dEp takes values 0 (no filter),1 (filter)
dEq takes values 0,1
If d < Then
dE = 1 { weak filter by default, p0,p1 and q0,q1 are modified}

if dSam0 = 1 and dSam3 = 1 then
dE = 2 // strong filter, modify p0 .. p2, q0 .. q2
If dp <( + ( >> 1 ) ) >> 3 then

dEp = 1
If dq <( + ( >> 1 ) ) >> 3 then

dEq = 1
EndIf
Vertical Edge Filtering (Luma)
If dE = 2 // strong filtering, p0 .. p2, q0 .. q2 are modied

{
for each k=0..3
{
p0,k = Clip3( p0,k2*tC, p0,k+2*tC, ( p2,k + 2*p1,k + 2*p0,k + 2*q0,k + q1,k + 4 ) >> 3 )
p1,k = Clip3( p1,k2*tC, p1,k+2*tC, ( p2,k + p1,k + p0,k + q0,k + 2 ) >> 2 )
p2,k = Clip3( p2,k2*tC, p2,k+2*tC, ( 2*p3,k + 3*p2,k + p1,k + p0,k + q0,k + 4 ) >> 3 )
q0,k = Clip3( q0,k2*tC, q0,k+2*tC, ( p1,k + 2*p0,k + 2*q0,k + 2*q1,k + q2,k + 4 ) >> 3 )
q1,k = Clip3( q1,k2*tC, q1,k+2*tC, ( p0,k + q0,k + q1,k + q2,k + 2 ) >> 2 )
q2,k= Clip3( q2,k2*tC, q2,k+2*tC, ( p0,k + q0,k + q1,k + 3*q2,k + 2*q3,k + 4 ) >> 3 )
}
}
Vertical Edge Filtering (Luma)
Else if dE = 1 // weak filtering, p0,p1 and q0,q1 are modified

{
for each k=0..3
{
= ( 9 * ( q0,k p0,k ) 3 * ( q1,k p1,k ) + 8 ) >> 4
if (Abs() < tC*10)

{
= Clip3( tC, tC, )
p0,k = Clip1Y( p0,k + )
q0,k = Clip1Y( q0,k )
}
if dEp = 1 // modify p1
{
p = Clip3( (tC >> 1), tC >> 1, ( ( ( p2,k + p0,k + 1 ) >> 1 ) p1,k + ) >>1 )
p1,k = Clip1Y( p1,k + p )
}
if dEq = 1 // modify q1
{
q = Clip3( (tC >> 1), tC >> 1, ( ( ( q2,k + q0,k + 1 ) >> 1 ) q1,k - ) >>1 )
q1,k = Clip1Y( q1,k + q )
}
}
}
Sample Adaptive Offset (SAO)
Page 100
Background
Quantization makes reconstructed and original blocks differ. The quantization error is not
uniformly distributed among pixels. There is a bias in distortion around edges (due to Gibbs
effect).
Reported (e.g. in JCTVC-G680) that in local minima reconstructed pixel tend to be lower than
the neighboring pixels. Therefore, the offset in local minima tend to have positive sign.
Background (cont.)
In addition to correction at local extremes, HEVC allows alternative correction to specific

ranges of pixel values.
Reported that SAO reduces ringing and mosquitos artifacts (which in turn expected to
become more annoying with large transforms). Consequently SAO improves subjective
quality for low compression ratio video.
Heuristics:
If a CTB contains a strong edge then its recommended to apply Edge Type SAO for this CTB,
where the direction pattern (sao_eo_class) is determined from the edges direction, e.g. if the
edge is vertical then the direction pattern is horizontal respectively.
Overview
SAO is the second post-processing tool (after deblocking) accepted in HEVC/H.265.
SAO is applied after deblocking. For efficient HW implementation SAO can be coupled with
deblocking in MB-loop. From point of HW design execution of SAO prior to deblocking
facilitates coupling.
SAO can be optionally turned off or applied only on luma samples or only on chroma samples
(regulated by slice_sao_luma_flag and slice_sao_chroma_flag ).
SAO parameters can be either explicitly signalled in CTU header or inherited from left or
above CTUs.
As well as deblocking SAO is adaptively applied on pixels. Unlike to deblocking SAO is

adaptively applied to all pixels.
There are two types of SAO:

Edge Type - offset depends on edge mode (signaled by SaoTypeIdx = 2)
Band Type - offset depends on the sample amplitude (SaoTypeIdx = 1)
Note:
chroma CTBs share the same SaoTypeIdx.
Edge Type SAO
In case of Edge type, the edge is searched across one of following directions ( the
direction is signaled by sao_eo_class parameter, once per CTU):
Notes: Sample labeled p indicates a current sample. Two samples labeled

n0 and n1 specify two neighboring samples along the chosen direction.
The edge detection is applied to each sample. According to the results the
sample is classified into five categories (EdgeIdx) :
Edge Type SAO (cont.)
According to EdgeIdx the corresponding sample offset (signaled by sao_offset_abs and
sao_offset_sign) is added to the current sample.
Up to 12 edge offsets (4 luma, 4 Cb chroma and 4 Cr chroma) are signaled per CTU. To
reduce the bit overhead there is a particular merge mode (signaled by sao_merge_up_flag
and sao_merge_left_flag flag) which enables a direct inheritance of SAO parameters from top
or left CTU.
Band Type SAO
The pixel range from 0..255 (8-bits per pixel) is uniformly split into 32 bands.
A fixed offset added to all samples of the same band. Only 4 consecutive bands are selected
by an encoder (trade-off between CTU header overhead and coding efficiency), a separate
offset is signaled for each band. In other words only 4 successive bands are affected in Band
Type SAO.
Notes:
4 band offsets are signaled in CTU header.
Start band index is signaled in CTU header (5 bits required).
An offset can be computed on encoders side as the average difference between the
original samples and reconstructed samples in a certain band. Then encoder selects 4
successive bands with the maximal difference in averages.
Experiments reveal that Band Type SAO is beneficial in noisy sequences or in sequences
with large gradients (e.g. PeopleOnStreet which is abundant in black-white switches).
SAO Design Points
For SAO left and top lines of pixels need to keep in a memory.
Pipeline chain:
a) QTR Deblock+ SAO decisions SAO
b) QTR + SAO decisions Deblock SAO
According to the schema (a), during the deblocking process, statistical information is
processed and the decision on SAO parameters is made. Due to the fact that SAO
parameters are determined in Deblock stage we cant apply CABAC in parallel to Deblocking.
The schema (b) enables to parallel Deblock and CABAC with negligible coding efficiency
loss.
The method (b) enable coupling SAO and DeblockFilter in CTU-loop. Indeed, after the top-
left 2x2 block in a CB has been completely deblocked, SAO can start with the pixel (0,0).
SAO Impact on Quality
SAO is enabled (QP = 32 ) SAO is disabled (QP = 32 )

Paralleling Tools
HEVC adopted three in-built parallel processing
tools:
Slices
Tiles
Wavefronts (WPP)
Slices
As in H.264/AVC, slices are groups of CTUs in scan order, separated by start

code.
Each slice is divided in a number of slice segments
There are two types of slice segments:

Regular (independent) carry the whole slice header
Dependent carry a shortened slice segment header. Notice that
dependent slices are not a good choice for slice-parallelism
The first slice segment of each slice must be regular, i.e. regular segment is
a leading segment of each the slice. Slice #0
(contains 4 segments)
regular segment #0
dependent segment #1 Slice #1
(contains one segment)
dependent segment #2 #3
#3 Regular segment #0
Slices - Slice Segments
Dependencies among slice segments in a slice are not broken, although CABAC
engine must be flushed and reset at the end of each segment but states are not
reset.
Slice Header Dependency: Short slice header is used where the
missing parameters are taken from the regular segment.
Context Models Dependency: CABAC context models are not

initialized to defaults at the beginning of a dependent slice, only CABAC
engine is reset. Its challenging to use dependent segments for
resynchronization in case of bitstream error or packet loss.
Spatial Prediction Dependency: No breaking intra and motion vector

prediction.
Note: Due to dependencies across segment boundaries slice segments do not

suit for slice-parallelism.
Slices - Dependent Segments
Restriction:
Each slice always starts with a regular segment (carrying the slice header),
followed by zero or more dependent segments.
Rationale for using of dependent slices:
Allow data associated with a particular wavefront thread or tile to be carried in a

separate NAL unit, and thus make that data available to a system for fragmented
packetization with lower latency than if it were all coded together in one slice.
MTU-matching
Slices: Pros and Cons
Pros: Slices are effective for:
Network packetization (MTU size matching)
Low-delay applications. Indeed, to start transmission of the encoded data

earlier, a current slice may be already transmitted, while encoding the next
slice in the picture.
Parallel processing (regular slices are self-contained, excepting deblocking)

but decoder has to perform some look-ahead reading to identify entry points
(actually start codes).
Fast resynchronization in case of bitstream errors or packet loss.

Slices: Pros and Cons
Cons:
Penalty on rate distortion performance is incurred due to the breaking

of dependencies at regular slice boundaries.
At the paper Implementation of fast HEVC encoder based on SIMD and

data-level parallelism by Yong-Jo Ahn et al. reported the following figures
on bitrate penalties:
For 32 slices:
On 4K videos the ratio of bitrate increment is 105:100
On HD videos the ratio of bitrate increment is 110:100
On SD videos the ratio of bitrate increment is 117:100
On 416x240 videos the ratio of bitrate increment is 130:100
Overhead is added since each slice/segment is preceded by headers.

Tiles
Tiles are rectangular groups of CTUs.
Tiles are transmitted in raster scan

order, and the CTUs inside each tile
are also processed in raster scan order.
All dependencies are broken at tile boundaries.
The entropy coding engine is reset at the start of each tile and flushed at the end
of the tile.
Only the deblocking filter can be optionally applied across tiles, in order to reduce
visual artifacts.
Tiles (cont.)
At the end of each tile CABAC is flushed and consequently the tile ends at byte
boundary.
The tile entry points (actually offsets) are signaled at the start of picture in order
to enable to a decoder to process tiles in parallel. Signaling tile offsets in the
start of picture increases encoding latency (it might be issue for ultra-low latency
applications). Indeed, encoder should wait until all tiles completed and then
update the picture header (actually the slice header) and then transmit data. We
can avoid this delay if compose tiles in the mode single slice single tile. In
such case only slice overhead is added.
Due to high area/perimeter ratio square tiles are more beneficial than
rectangular ones (since the perimeter represents the boundaries where the
dependencies are broken).
Tiles: Pros & Cons
Pros:
Friendly to multi-core implementations, can be built by simply replicating the single

core designs.
Composition of a picture (4K TV) from multiple rectangular sources which are
encoded independently. With slices we can compose only horizontal stripes.
Cons:
Pre-defined tile structure makes MTU size matching challenging.
Breaking intra and motion vector prediction across tile boundaries deteriorates
coding efficiency.
Tiles vs. Slices
Tale: If we divide a WxH frame into 4 uniform slices then the total length of internal
boundaries is 3xW. On the other hand, if we divide this frame into 4 uniform tiles
then the total length of internal boundaries is (W+H).
In most cases W+H < 3xW therefore division into tiles looks more beneficial than
splitting into slices (since total length of boundaries where predictions broken is
minimized for tiles). Moreover slice headers adds additional overhead.
Resume from the Tale:

It sounds that for parallel processing Tiles are better than Slices.
However
According my experiments on selected 4K streams the gain in bit-size of tiles
versus slices does not exceed 0.5% . The gain up to 0.5% in coding efficiency is
commonly considered as negligible (within a noise level).
So, tiles and slices can be considered as comparable tools for paralleling
processing.
Potential Usage of Tiles:

The only usage of tiles i can imagine if the video frame is composed from sub-
pictures (tiles) where each sub-picture is taken from separate video source.
Wavefronts (WPP)
Picture is divided into rows of CTUs.
The first row is processed in a regular way.
The second row is delayed until two first CTUs of the first row completed.
The third row is processed after two first CTUs of the second row have been
made, etc.
Wavefronts (cont.)
The context models of the entropy coder in each row are inferred from those in the
preceding row with a small fixed processing lag. Actually the context models are
inherited from the second CTU of the previous row.
CABAC is flushed after the last CTU of each row, making each row to end at byte
boundary.
No breaking of dependencies across rows of CTUs.
The figure taken from:

Frank Bossen, Benjamin Bross, Karsten Shring, and David Flynn:
HEVC Complexity and Implementation Analysis,
IEEE Transactions on Circuits and Systems for Video Technology, December 2012.
Wavefronts (Cont.)
CABAC is flushed at the end of each CTU row in order to make each row to end at
byte boundary and to facilitate parallel processing.
CABAC is reset at the end of each CTU row in order to enable parallel processing.
Entry points of each CTU row are explicitly signaled in picture/slice header.
Calculation of speed-up ratio (inspired by https://sites.google.com/site/hevcwppp/):
Let T is an average coding time of single CTU then the first CTU of the second row starts at
2T, the first CTU of the third CTU row starts at 4T, the first CTU of the last row starts at
2(H-1)T , where H is picture height in CTUs and W is width in CTUs.
Consequently in WPP mode the last CTU in the picture is coded at 2(H-1)T+WT, without WPP
mode the last CTU is coded at H*W*T
Thus the speed-up ratio of WPP: W*H / (W+2H 2)
For 3840x1728 resolution CTU size = 64x64 the speed-up ratio is 14. For CTU size 32x32
the speed-up ratio is 28. Probably it might be efficient to use CTU size 32x32 to exploit better
parallelism?
Wavefronts (WPP) Pros & Cons
Pros:
Good for architectures with shared cache, e.g. overlapping of search areas:
Unlike tiles intra and motion
vector prediction across
CTU rows enabled.
Cons:
Cons
MTU size matching challenging with wavefronts.
Frequent cross-core data communication, inter-processor synchronization for
WPP is complex.
Wavefronts (WPP) Pros & Cons
Pros:
Good for architectures with shared cache due to overlapping of search areas
Unlike tiles, intra and motion

vector prediction across Search area of core #(n-1)
CTU rows enabled. Currently processed
already processed Core #(n-1)
Core #n
Search area of core #n

Notes
Wavefront parallel encoding is reported to give BD-rate degradation around 1.0% compared
to a non-parallel mode.
Bitrate savings from 1% to 2.5% are observed at same QP for Wavefront against Tiles (each
row is encompassed by single tile).
Wavefronts and Tiles cant co-exist in single frame.

Appendix A: HM Test Configuration
1. All Intra
Main configuration - encoder_intra_main.cfg
High efficiency (10 bits per pixel) - encoder_intra_he10.cfg
2. Random Access
Main configuration - encoder_randomaccess_main.cfg
High efficiency (10 bits per pixel) - encoder_randomaccess_he10.cfg
HM Test Configuration (cont.)
3. Low Delay (DPB buffer contains two or more reference frames, each inter frame can
utilize bi-prediction from previous references):
Main configuration - encoder_lowdelay_main.cfg
High efficiency (10 bits per pixel) - encoder_lowdelay_he10.cfg
4. Low Delay (DPB buffer contains single reference frame) : I P P P P

Main configuration - encoder_lowdelay_P_main.cfg
High efficiency (10 bits per pixel) - encoder_lowdelay_P_he10.cfg
Appendix B: Drawbacks of HEVC Features for H.266
Replacement of grid-like tiling with more general tiling:
Tiles are targeted for multi-core platforms therefore the tiles are desired to have equal areas
and minimal joint boundaries (since dependencies are broken across tile boundaries and to
minimize penalty in coding efficiency it's desirable the total length of tile boundaries is
minimal).
Example, let's consider a four-cores platform. How to divide a picture into tiles
by keeping both equal areas and minimal borders?
Solution #1: Partitioning of WxH picture into four vertical (or horizontal) stripes meets the
first condition but the total tile boundary length is 3xH (or 3xW).
T0
T1
T0 T1 T2 T3
T2
T3
Solution #2: Partitioning of the WxH picture into four equivalent quadrants makes the total
tile boundary equal to (W+H) and this value is less than 3xH or 3xW.
T0 T1
T2 T3
So, Solution #2 is the best division (equal areas and minimal tile boundaries).
However, the solution #2 is not suited for triple-cores case (since HEVC enables only grid-
tiling).
In general tiling the best solution (minimal boundaries and equal areas) for triple-cores is:
1/3H T0
Area of T0 = 1/3H x W
Area of T1 = 2/3H x 1/2W

2/3H T1 T2
Area of T2 = 2/3H x 1/2W
1/2W 1/2W
Unfortunately, such general tiling is not supported by HEVC.

Appendix C: Recommended Literature
1. Pieter3d (nickname) gave a detailed explanation of basic HEVC functions in Doom9

forum: http://forum.doom9.org/showthread.php?t=167081
2. Overview of the High Efficiency Video Coding (HEVC) Standard, IEEE
3. Transform coefficient coding in HEVC, 2012 December, IEEE Transactions on

Circuits and Systems for Video Coding.
4. Sample Adaptive Offset in the HEVC Standard, 2012 December, IEEE Transactions
on Circuits and Systems for Video Coding.
5. Analysis of HEVC/H265 Parallel Coding Tools, by Praveen GB and Ramakrishna

Adireddy
6. RESEARCH ON H.265/HEVC INTRA PREDICTION MODES SELECTION

FREQUENCIES, Sharabayko M.P
7. Intra Prediction Header Bits Estimation Algorithm for RDO in H.265/HEVC, Maxim
P. Sharabayko, Oleg G. Ponomarev
8. Entropy-Based Fast Largest Coding Unit Partition Algorithm in High-Efficiency

Video Coding, Mengmeng Zhang, Jianfeng Qu and Huihui Bai, 2013

Detailed HEVC/H265 Overview

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Detailed HEVC/H265 Overview

Hochgeladen von

Copyright:

Verfügbare Formate

Detailed Overview of

The author is thankful to every individual who has reviewed the

SPS contain information which applies to all pictures of a video sequence

log2_min_luma_coding_block_size specify the minimal CU size.

log2_diff_max_min_luma_coding_block_size together with

log2_min_transform_block_size specify the minimal transform block size. Potential

log2_diff_max_min_transform_block_size - together with the minimal TB size specifies

PPS conveys information which could change from picture to picture

IDR - pictures following the IDR in decoding order cannot use

CRA pictures following the CRA in both decoding and presentation

Leading pictures - following in decoding order but preceding in presentation

RADL (random access decodable leading) can be correctly decoded if

RASL (random access skipped leading) - cant be correctly decoding if

Decoding order: Leading pictures

RASL RADL CRA

Coding Tree Block (CTB):

Coding Tree Units (CTU):

Coding Block (CB):

Coding Unit (CU)

Note: unlike to prior standards where MB consists of MB header followed by MB data, in

Prediction Unit (PU):

2Nx2N NxN (only if CB size is smallest CB size)

2Nx2N NxN 2NxN Nx2N

Inter Assymetric Partitions (conditioned by amp_enabled_flag in SPS and disabled for

nLx2N nRx2N 2NxnU 2NxnD

Examples when assymetric partitions are beneficial:

2NxnU 2NxnD nLx2N nRx2N

Assymetric splitting is also applied to chroma CBs.

Preprocessing (basing on texture and/or block complexity metric) can be used to

Transform Block (TB) :

1,0 1,1 1,2 1,3

Computational complexity to find best TU partition:

Unlike to H.264/AVC where TB PB, prediction and transform partitioning are

Reported by some experts that prediction discontinuities on PB boundaries

Luma 8x8 Chroma 4x4

Up to 10% increase in worst-case decode time

Add storage for SAO params.

b) Maximal CTU size shall be less than or equal to 5*RawCtuBits/3.

The variable RawCtuBits is derived as

RawCtuBits=CtbSizeY * CtbSizeY * BitDepthY +2 * ( CtbWidthC * CtbHeightC ) * BitDepth C

Unlike to H.264/AVC, support of interlace coding in HEVC is not exist:

No mixed frame-field interaction (like PAFF in H.264/AVC)

No interlace scanning of transform coefficients

In addition a particular flag general_interlaced_source_flag is signaled in VPS/SPS

pic_width_in_luma_samples shall not be equal to 0 and shall be an integer multiple of

pic_height_in_luma_samples shall not be equal to 0 and shall be an integer multiple of

Interpolation for fractional motion vectors (MV)

Horizontal 8 bits per pixel Vertical

Horizontal 8 bits per pixel Vertical

A -1,0 A 0,0 a 0,0 b 0,0 c 0,0 A 1,0 A 2,0

d -1,0 d 0,0 e 0,0 f 0,0 g 0,0 d 1,0 d 2,0

h -1,0 h 0,0 i 0,0 j 0,0 k 0,0 h 1,0 h 2,0

n -1,0 n 0,0 p 0,0 q 0,0 r 0,0 n 1,0 n 2,0

A -1,1 A 0,1 a 0,1 b 0,1 c 0,1 A 1,1 A 2,1

A - 1,2 A 0,2 a 0,2 b 0,2 c 0,2 A 1,2 A 2,2

A -1,0 A 0,0 a 0,0 b 0,0 c 0,0 A 1,0 A 2,0

d -1,0 d 0,0 e 0,0 f 0,0 g 0,0 d 1,0 d 2,0

h -1,0 h 0,0 i 0,0 j 0,0 k 0,0 h 1,0 h 2,0

n -1,0 n 0,0 p 0,0 q 0,0 r 0,0 n 1,0 n 2,0

A -1,1 A 0,1 a 0,1 b 0,1 c 0,1 A 1,1 A 2,1

A - 1,2 A 0,2 a 0,2 b 0,2 c 0,2 A 1,2 A 2,2

A -1,0 A 0,0 a 0,0 b 0,0 c 0,0 A 1,0 A 2,0

d -1,0 d 0,0 e 0,0 f 0,0 g 0,0 d 1,0 d 2,0

h -1,0 h 0,0 i 0,0 j00 k 0,0 h 1,0 h 2,0

b0,0 = A3,0 + 4A2,0 11A1,0 + 40A0,0 + 40A1,0 11A2,0 + 4A3,0 A4,0

a0,0 = A3,0 + 4A2,0 10A1,0 + 58A0,0 + 17A1,0 5*A2,0 + A3,0

e0,0 = ( a0,3 + 4a0,2 10a0,1 + 58a0,0 + 17a0,1 5*a0,2 + a0,3 ) >> 6

predSample[ x ][ y ] = ( ( 32 iFact )ref[ x+iIdx+1 ] + iFactref[ x+iIdx+2] + 16 ) >> 5