Sie sind auf Seite 1von 78

AnalysisandParallelism

ofHEVCEncoder
KwangwoonUniversity(KWU)
DonggyuSim(dgsim@kw.ac.kr)
Contents
OverviewofHEVC

EncodingissuesforHEVCtestmodel(HM)

ComplexityanalysisofHEVCencoder

Fastencodingalgorithmsandperformances

Issuesofparallelprocessing

Conclusion

OVERVIEWOFHEVC
IntroductionofHEVC
Highefficiencyvideocoding(HEVC)
JCTVC
Jointcollaborativeteamonvideocoding
ITUTSG16Q.6(VCEG)andISO/IECJCT1/SC29/WG11(MPEG)

Requirement
Developmentofastandardforvideocodingtechnologymoreadvancedthan
thecurrentAVCstandard
Developmentoftheassociatedconformancetestingandreferencesoftware
specifications
Casualgoal:50%codingefficiencythanH.264/AVC

Timeline
DelayedHEVCtimeline
Testmodelselectionprocessbegins2010/04(1stJCTVCmeetingatDresden,DE)
Testmodelselectionby2010/10(3rdJCTVCmeetingatGuangzhou,CN)
CommitteeDraft(CD)approvalby2012/02(8thJCTVCmeetingatSanJose,US)
FinalCommitteeDraft(FCD)
DraftInternationalStandard(DIS)approvalby2012/07(10thJCTVCmeetingat
Stockholm,SE)
FinalDraftInternationalStandard(FDIS)approvalby2013/01(12thJCTVCmeetingat
Geneva,CH)
BlockdiagramofHEVCstandard
Typicalblockbasedhybridcodecstructure+enhancedandadditionaltools
CABAC
Entropycoding
Deblockingfilter
SAO
Inloopfilter
DeltaQP
RDOQ
Quantization
TU4x4to32x32
Residualquadtree
transform
Transform
AMVP
Merge
DCTIF
Interprediction
Angularintra
prediction
Intraprediction
Bitstream
Inv.Quantization
Inv.Transform
CU8x8to64x64
DiversePUtypes
Blockstructure
8bit/sample
Picturestorage
Picture
BlockstructureinHEVC
ThreeblockstructuresaredefinedinHEVC
Codingunit(CU)
Predictionunit(PU)
Transformunit(TU)

CU
3232
CU
1616
CU
1616

CTU
64
CU
88
CU
88

CU
1616

CU
88
CU
88

CU
1616
CU
1616

CU
88
CU
88
CU
88
CU
88

CU
88
CU
88
CU
88
CU
88

CU
1616
CU
1616
CU
1616

CU
88
CU
88

CU
88
CU
88

CTU
6464
CTU
6464
CTU
64

2N2N N2N 2NN


2NnU nR2N nL2N

TU
depth0
TU
depth1
TU
depth2
2NnD
NN
ProfilesforHEVC
HEVCstandardsupports3profiles(Main,HE10,MainStillPicture)
Main High efficiency 10 (HE10) Main Still Picture
High-level structure
High-level support for frame rate temporal nesting and random access
Clean random access (CRA) support -
Rectangular tile-structured scanning
Wavefront-structured processing dependencies for parallelism
Slices with spatial granularity equal to coding tree unit
Coding units, Prediction
units, and Transform units
Coding unit quadtree structure
(square coding unit block sizes 2Nx2N, for N=4, 8, 16, 32;
i.e., up to 64x64 luma samples in size)
Prediction units
(for coding unit size 2Nx2N: for Inter, 2Nx2N, 2NxN, Nx2N, and,
for N>4, also 2Nx(N/2+3N/2) & (N/2+3N/2)x2N; for Intra, only 2Nx2N and
, for N=4, also NxN)
Prediction units
(2Nx2N and, for N=4, also NxN)
Transform unit tree structure within coding unit (maximum of 3 levels)
Transform block size of 4x4 to 32x32 samples
(always square)
Spatial Signal
Transformation and PCM
Representation
DCT-like integer block transform;
for Intra also a DST-based integer block transform (only for Luma 4x4)
Transforms can cross prediction unit boundaries for Inter; not for Intra -
PCM coding with worst-case bit usage limit
Intra-picture Prediction
Angular intra prediction (35 directions )
Planar intra prediction
Inter-picture prediction
Luma motion compensation interpolation: 1/4 sample precision,
8x8 separable with 6 bit tap values
-
Chroma motion compensation interpolation: 1/8 sample precision,
4x4 separable with 6 bit tap values
-
Advanced motion vector prediction with motion vector competition and
merging
-
Entropy Coding
Context adaptive binary arithmetic entropy coding
RDOQ on
Picture Storage and Output
Precision
8 bit-per-sample storage and output 10 bit-per-sample storage and output 8 bit-per-sample storage and output
In-Loop Filtering
Deblocking filter
Sample-adaptive offset filter
Level
L
e
v
e
l

M
a
x

l
u
m
a

p
i
c
t
u
r
e

s
i
z
e

M
a
x
L
u
m
a
P
S

(
s
a
m
p
l
e
s
)

M
a
x

C
P
B

s
i
z
e

M
a
x
C
P
B

(
1
0
0
0

b
i
t
s
)

M
a
x

s
l
i
c
e

s
e
g
m
e
n
t
s

p
e
r

p
i
c
t
u
r
e

M
a
x
S
l
i
c
e
S
e
g
m
e
n
t
s
P
e
r
P
i
c
t
u
r
e

M
a
x

#

o
f

t
i
l
e

r
o
w
s

M
a
x
T
i
l
e
R
o
w
s

M
a
x

#

o
f

t
i
l
e

c
o
l
u
m
n
s

M
a
x
T
i
l
e
C
o
l
s

M
a
x

l
u
m
a

s
a
m
p
l
e

r
a
t
e

M
a
x
L
u
m
a
S
R

(
s
a
m
p
l
e
s
/
s
e
c
)

M
a
x

b
i
t

r
a
t
e

M
a
x
B
R

(
1
0
0
0

b
i
t
s
/
s
)

M
i
n

C
o
m
p
r
e
s
s
i
o
n

R
a
t
i
o

M
i
n
C
R

M
a
i
n

t
i
e
r

H
i
g
h

t
i
e
r

M
a
i
n

t
i
e
r

H
i
g
h

t
i
e
r

1 36 864 350 -
16
1 1 552 960 128 - 2
2 122 880 1 500 -
16
1 1 3 686 400 1 500 - 2
2.1 245 760 3 000 -
20
1 1 7 372 800 3 000 - 2
3 552 960 6 000 -
30
2 2 16 588 800 6 000 - 2
3.1 983 040 10 000 -
40
3 3 33 177 600 10 000 - 2
4 2 228 224 12 000 30 000
75
5 5 66 846 720 12 000 30 000 4
4.1 2 228 224 20 000 50 000
75
5 5 133 693 440 20 000 50 000 4
5 8 912 896 25 000 100 000
200
11 10 267 386 880 25 000 100 000 6
5.1 8 912 896 40 000 160 000
200
11 10 534 773 760 40 000 160 000 8
5.2 8 912 896 60 000 240 000
200
11 10 1 069 547 520 60 000 240 000 8
6 35 651 584 60 000 240 000
600
22 20 1 069 547 520 60 000 240 000 8
6.1 35 651 584 120 000 480 000
600
22 20 2 139 095 040 120 000 480 000 8
6.2 35 651 584 240 000 800 000
600
22 20 4 278 190 080 240 000 800 000 6
HMENCODER
DecisionlevelforHEVCencoder
Sequencelevel
Codingstructure(Allintra,Lowdelay,Randomaccess)
Profile,tier,level
Max/MinCTUsize,CUdepth
Max/MinTUsize,TUdepth
Toolon/off(SAO,deblocking,WPP,tile)
Picturelevel
#refframe,ratecontrol
Tile,slice
Sliceortilelevel
Refframes
Deblockingfilterparameters
CTUlevel
CUpartitioning
Sampleadaptiveoffsetparameters
CUlevel
PUandTUpartitioning
PU&TUlevel
Predictionmodes,motionvectors
cbf,coefficients

Sequence

Picture

CTU
SliceorTile
CU
PU&TU
Codingstructure(1/3)
Allintra(AI)
Allpictureiscodedasinstantaneousdecodingrefresh(IDR)picture
Notemporalpredictionisallowed
NoQPvariationisallowed
IDRPicture
time
0
QPI
=POC
Coding
order
1
QPI
2
QPI
3
QPI
4
QPI
5
QPI
6
QPI
7
QPI
Codingstructure(2/3)
Lowdelay(LD)
ThefirstpictureshallbecodedasIDRpicture
GeneralizedPandB(GPB)pictureshallbeusedfortheothersuccessivepictures
TheGPBshallbeabletouseonlythereferencepictures,eachofwhosePOCissmallerthanthecurrent
picture(allreferencepictureinList_0andList_1shallbetemporallypreviousindisplayorderrelativetothe
currentpicture)
QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture
dependingontemporallayer
IDRor
Intrapicture
GPB(GeneralizedPandB)
picture
0
1
2
4
5 3
6
7
8
time
QPI
QPB
L3
=QPI+3
QPB
L2
=QPI+2
QPB
L3
QPB
L3
QPB
L3

QPB
L2

QPB
L1
=QPI+1 QPB
L1

:Depth==0
:Depth==1
:Depth==2
=POC
Coding
order
Codingstructure(3/3)
Randomaccess(RA)
HierarchicalBstructureshallbeusedforcoding
IDRIntrapictureorcleanrandomaccess(CRA)pictureshallbeinsertedcyclicallyperaboutone
secondinrandomaccesspoint
QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture
dependingontemporallayer

IDRor
Intrapicture
GPB(GeneralizedPandB)
picture
0
4
3
2
7 5 8
1
time
ReferencedB
Picture
NonreferencedB
Picture
8
4
1
2
3 5
6
7
0
QPI
QPB
L4
=QPI+4 QPB
L4
QPB
L4
QPB
L4

QPB
L3
=QPI+3 QPB
L3

QPB
L2
=QPI+2
QPB
L1
=QPI+1
POC
Coding
order
:Depth==0
:Depth==1
:Depth==2
:Depth==3
PictureofHEVC
Picture:Apicturecontainsanarrayoflumasamplesinmonochromeformatoranarrayof
lumasamplesandtwocorrespondingarraysofchromasamplesin4:2:0,4:2:2,and4:4:4
colorformat.
Codingorderofcodingtreeunit(CTU)israsterscanorder

Example)ClassA(25601600)
NebutaFestival

CTUsize:6464
4025CTUpartition
*CTU&CTB
:TheCTUconsistsofalumacodingtree
block(CTB)andthecorresponding
chromaCTBsandsyntaxelements
Picturepartitioning
Asliceisasequenceofcodingtreeunits(CTUs)

Unlikeslices,tilesarealwaysrectangularandalwayscontainanintegernumberofcoding
treeunitsincodingtreeunitrasterscan

Atleastoneofthefollowingconditionsshouldbetrueforeachsliceandtileinapicture
AllCTBsinaslicebelongtothesametile,orallCTBsinatilebelongtothesameslice
FIGURE.Apicturewith4025codingtreeunitsthatispartitioned
intotwoslices
FIGURE.Apicturewith4025codingtreeunitsthatispartitioned
intothreetiles
OverallofHMencodingprocess
Sequence

Picture

CTUdecisionsinasliceoratile

Deblockingfilter
SAO
CUpartitioningdecision

PU&TUpartitioningdecision
RDOprocess
MaximumCTUsize&CUdepth
SizeofCTUisspecifiedinsequenceparameterset(SPS)

TABLE.SyntaxforsizeofCTUinSPS
seq_parameter_set_rbsp() { Descriptor

log2_min_coding_block_size_minus3 ue(v)
log2_diff_max_min_coding_block_size_minus2 ue(v)

}
64CU:split_coding_unit_flag(1)
32CU:split_coding_unit_flag(0)
32CU:split_coding_unit_flag(1)
16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(1)

FIGURE.ExampleofCUquadtreestructure
CU
3232
CU
1616
CU
1616

CU
88
CU
88

CU
1616

CU
88
CU
88

CU
1616
CU
1616

CU
88
CU
88
CU
88
CU
88

CU
88
CU
88
CU
88
CU
88

CU
1616
CU
1616
CU
1616

CU
88
CU
88

CU
88
CU
88

16x16~64x64
Codingunit(CU)decision
Codingunitquadtreestructure
StartingfromCTU,eachCUcanbesplitinto4smallerCUs

88
11
6464
1 21
3232
2 5
3232
23 10
3232
44 15
3232
65 20
1616
3 1
1616
8 2
1616
13 3
1616
18 4
88
4
88
5
88
6
88
7
88
9
88
10
88
12
CUsize
BestCURDcostcalculationforeachCUlevel
Competitionofthe
bestCUanditssub
partitionedCUs


Predictionunit(PU)types
2PUtypesforIntraprediction
2N2N,(SmallestCU:additionallyNN)
8PUtypesforInterprediction
SmallestCU:
8x8:2N2N,N2N,2NN
Others:2N2N,N2N,2NN,NN
Others:2N2N,N2N,2NN,nL2N,nR2N,2NnU,2NnD

FIGURE.PUpartitionsinHEVC
2N2N N2N 2NN NN
2NnD 2NnU nR2N nL2N
Predictionunit(PU)types
CurrentCUsize
SCUsize
AMPenableflag
Cur.CUsize==SCUsize
AMPenable
flag
Cur.CUsize
==88
Intra2N2N
Inter2N2N
Inter2NN
InterN2N
InterAMP
Intra2N2N
Inter2N2N
Inter2NN
InterN2N
Intra2N2N
Inter2N2N
Inter2NN
InterN2N
IntraNN
Intra2N2N
Inter2N2N
Inter2NN
InterN2N
IntraNN
InterNN
No Yes
Yes Yes No No
MaximumTUsize&TUdepth
RootofTUquadtreeisCUwhichtheTUbelongto
AvailabletransformblocksizesandmaxtransformhierarchydeptharespecifiedinSPS

FIGURE.TUquadtreestructureinHEVC
TABLE.SyntaxforsizeofTUinSPS
32
32
seq_parameter_set_rbsp() { Descriptor

log2_min_transform_block_size_minus_2 ue(v)
log2_diff_max_min_transform_block_size ue(v)

max_transform_hierarchy_depth_inter ue(v)
max_transform_hierarchy_depth_intra ue(v)

}
4x4~32x32
Maximum3level
(0,1,or2)
RDOprocesstodecidePU&TU


CompressCU
2N2Nmergeskip Inter2N2N
InterN2N Inter2NN InterAMP
Intra2N2N IntraNN
IntraPCM
CUsize
SCU
CompressCU CompressCU CompressCU CompressCU
Finish
No
Yes
Intrapredictionflow
Predictionmodes
Luma
DC,Planar,Angularprediction(33directions)
Chroma
DC,planar,ver,hor,DM
Filtering
MDIS(Modedependentintrasmoothing)
DCfiltering,Ver/Horfiltering
3MPM
2N2NPU
MDIS
Intraprediction
Referencesamplepadding
RDcost,Intra_mode
Bestmodedecision
N
Y
Mode<35?
J
mode
SSE
mod e
* B
mode
J
pred,SATD
SATD
pred
* B
pred
FIGURE.FlowchartIntraprediction FIGURE.DirectionsandmodesofHEVCintraprediction
Fastintraprediction&TUdecisioninHM
IntrapredictionstepinHM
1)Roughpredictionmodedecision
35prediction
SelectNpredictionmodes
Distortion(SATD)+lamda*modebits
#ofcandidatepredictionmodes:Nmodes+MPM(3)
2)Bestintrapredictionmodedecisionwithtransform
Transform(RQTdepth=1)
1bestintramodedecision
3)BestRQTdecisionwithRDcosts
RQTdepth=3

35modes
Nmode+MPM
1Best
mode
Bestmode
RDcost
Interprediction
Skip:Mergeskip

Nonskip
Unidirectionalprediction
Bidirectionalprediction
Halfpel/Quarterpelmotionrefinement
DCTIF(8tap/4tap)
Merge
Mergeskip
Inter2N2N
Inter2NN
InterN2N
Bestmode
decision
Unidirectionalprediction
Bidirectionalprediction
Merge
Bestmodedecision
Cur.CU
RDcost,
Bestmode
Spatialcandidates
derivation
Temporalcandidate
derivation
Additionalcandidates
derivation
RDcostcalculation
AMP
(nL2N,nR2N,
2NnU,2NnD)
FIGURE.FlowchartInterprediction
Interprediction
Intercodingmode
Mergeskipmode(CUlevel)
skip_flag=1andmerge_idx
Noreferenceindex
Nomotionvector
Noresidual
Mergemode(PUlevel)
skip_flag=0,pred_mode_flag,andpart_mode
merge_flag=1andmerge_idx
Noreferenceindexandmotionvector
no_residual_syntax_flag:Residualisencodedornot
GeneralPUmodes
skip_flag=0pred_mode_flag,andpart_mode
merge_flag=0
ref_idx_lxandmvp_lx_flagbasedonAMVP(x=0or1)
MVDisencoded
no_residual_syntax_flag:Residualisencodedornot

Interpredictionflow
BEGIN input : current PU part mode for a CU

FOR PU partition

FOR List = 0 to 1 DO
FOR 0 to refidx DO
Motion estimation (diamond search, SR : 64)
Decide best RD-cost for uni-prediction
ENDFOR
ENDFOR

IF bi-directional prediction THEN
FOR iteration = 0 to 3 DO
FOR 0 to refidx DO
Motion estimation (full search, SR : 4)
Decide best RD-cost for bi-prediction
ENDFOR
ENDFOR
ENDIF
ENDFOR

Merge

RD-cost competition among uni/bi-prediction and merge

END output : inter prediction syntax
Fastencoderdecision(FEN)
SubsampledSADforintegerME
UsesubsampledSADwhenrows>8forintegerME
Only1iterationforbipredictivemotionsearch
defaultnumber:4
FastDecisionforMergeRDcost(FDM)
AftermergewithmergeidxX,ifallcbfiszerothenmerge
processisterminated
FIGURE. Pseudo code - Inter prediction flow
time
Cur
CurrentPU
Uniprediction
Biprediction
LIST_0 LIST_1
Biprediction
SearchP0andP1whichproduceminimumerrorwithO
R=(OP),whereP=(P0+P1)/2

PracticalBipredictivesearch
1)SearchP1whichproduceminimum2Rwith(2OP0)
R=O(P0+P1)/22R=(2OP0)P1
2)SearchP0whichproduceminimumerrorwith(2OP1)
R=O(P0+P1)/22R=(2OP1)P0

BipredSearchRange:4
FEN:1(iteration:1)
P0 P1
O
List1
Reference
List0
Reference Currentframe
Example)Biprediction
Bidirectional
prediction

Iteration:2

Iteration:3
Unidirectional
prediction
P1 O
List1
Reference
List0
Reference
Currentframe
Searchrange:64
P0 P1 O
BipredSearchRange:4
P0 P1 O
BipredSearchRange:4

Iteration:1
P0 P1 O
BipredSearchRange:4

Iteration:4
P0 P1 O
BipredSearchRange:4
P0 2O R0
P1 2O R1
P0 2O R0
P1 2O R1
Motionestimation(Integerpel)
Practicalmotionestimation(diamondsearch)
Firstsearch&earlytermination
Max3(default)moreroundsafterarecentbestmatch

Rasterrefinementsearch
Ifintegerpeldistanceisbiggerthan5,thenconducttherasterrefinementsearch.

Starrefinementsearch&earlytermination
Diamondsearchwiththecenterofthebestmatchfromtheearlytwosteps
Max2roundsafterthebestmatch














FIGURE.Rasterrefinementsearch


3

3 2 3
2 1 2
3 2 1 0 1 2 3
2 1 2
3 2 3

3


FIGURE.Firstsearch&startrefinement
Motionestimation(Subpelrefinement)
Integerpelmotionsearch
Costfunction:SAD

Subpelmotionrefinement
Costfunction:SATD
Halfpelrefinement
Quarterpelrefinement




























FIGURE.Integerpelmotionsearch
FIGURE.Halfpelmotionsearch
FIGURE.Quarterpelmotionsearch
Searchrange
S
e
a
r
c
h

r
a
n
g
e

Integerpel
Halfpel
Quarterpel
Interpolation
DCTIFinHEVC
Fixed8tap(7tap)and4tapinterpolationfiltersbasedonDCT
2Dseparablefilter
8*Horizontal1Dfilter+1*Vertical1Dfilter
Component Filter()
Luma
1/4 {1,4,10,58,17,5,1,0}
1/2 {1,4,11,40,40,11,4,1}
Chroma
1/8 {2,58,10,2}
3/8 {6,46,28,4}
1/4 {4,54,16,2}
1/2 {4,36,36,4}
FIGURE.Integerandfractionalsamplepositionsforlumaandchromainterpolation
TABLE.Interpolationfiltercoefficients
A
-1,-1
A
0,-1
a
0,-1
b
0,-1
c
0,-1
A
1,-1
A
-1,0
A
0,0
A
1,0
A
-1,1
A
0,1
A
1,1
a
0,1
b
0,1
c
0,1
a
0,0
b
0,0
c
0,0
d
0,0
h
0,0
n
0,0
e
0,0
i
0,0
p
0,0
f
0,0
j
0,0
q
0,0
g
0,0
k
0,0
r
0,0
d
-1,0
h
-1,0
n
-1,0
d
1,0
h
1,0
n
1,0
A
2,-1
A
2,0
A
2,1
d
2,0
h
2,0
n
2,0
A
-1,2
A
0,2
A
1,2
a
0,2
b
0,2
c
0,2
A
2,2
B
0,0
ae
0,0
ag
0,0
ah
0,0
ab
0,0
ac
0,0
ad
0,0
af
0,0
B
1,0
B
1,1
B
0,1
be
0,0
bg
0,0
bh
0,0
bb
0,0
bc
0,0
bd
0,0
bf
0,0
ba
0,0
ce
0,0
cg
0,0
ch
0,0
cb
0,0
cc
0,0
cd
0,0
cf
0,0
ca
0,0
de
0,0
dg
0,0
dh
0,0
db
0,0
dc
0,0
dd
0,0
df
0,0
da
0,0
ee
0,0
eg
0,0
eh
0,0
eb
0,0
ec
0,0
ed
0,0
ef
0,0
ea
0,0
fe
0,0
fg
0,0
fh
0,0
fb
0,0
fc
0,0
fd
0,0
ff
0,0
fa
0,0
ge
0,0
gg
0,0
gh
0,0
gb
0,0
gc
0,0
gd
0,0
gf
0,0
ga
0,0
he
0,0
hg
0,0
hh
0,0
hb
0,0
hc
0,0
hd
0,0
hf
0,0
ha
0,0
ah
-1,0
bh
-1,0
ch
-1,0
dh
-1,0
eh
-1,0
fh
-1,0
gh
-1,0
hh
-1,0
he
0,-1
hg
0,-1
hh
0,-1
hb
0,-1
hc
0,-1
hd
0,-1
hf
0,-1
ha
0,-1
ba
1,0
ca
1,0
da
1,0
ea
1,0
fa
1,0
ga
1,0
ha
1,0
ae
0,1
ag
0,1
ah
0,1
ab
0,1
ac
0,1
ad
0,1
af
0,1
ExampleofPUdecision
CompressCU
2N2Nmergeskip Inter2N2N
InterN2N Inter2NN InterAMP
Intra2N2N IntraNN
IntraPCM
CUsize
SCU
CompressCU CompressCU CompressCU CompressCU
Finish
No
Yes
Biprediction
RDcost=SAD/SATD+*Bmode
=9000
Biprediction
RDcost=SSE+*Bmode
=8500
Merge
RDcost=SAD/SATD+*Bmode
=11000
Uniprediction
RDcost=SAD/SATD+*Bmode
=12000
Vs.
Vs.
Example
NoTUdecision
Noreconstruction
TUdecision
Reconstruction
TUdecisionflow(Inter)
Residualquadtree
2N2N N2N 2NN NN
2NnD 2NnU nR2N nL2N
TUdepth:0
TUdepth:1
TUdepth:2

T/Q
IT/IQ(recon)
RDcost(SSE+*Bmode)

Original Predictor Residual
TUdecisionflow(Intra)
Example)intra_pred_mode=10(verticalmode)
Referencesamples
Prediction
direction
Intrapredictionusingreferencesamples
T/Q
IT/IQ
RDcost(SSE+*Bmode)
Prediction
direction
Referencesample(afteraboveblockisreconstructed)
TUdepth:N
TUdepth:N+1

Residual
Transform
ImplementationoftransforminHEVC
Matrixmultiplication
Straightforward/Fewcodelines
Hugenumberofoperations,butSIMDfriendly
Partialbutterflyimplementation
Utilizessymmetry/antisymmetrypropertiesofbasisvectors
Lessmultiplications/additions
Increasenumberofcodelines

Matrixmultiplication
Matrixmultiplication
Matrixmultiplication
Matrixmultiplication
PartitioningsyntaxforaCTU
Syntax
CU
3232
CU
1616
CU
1616

CU
88
CU
88

CU
1616

CU
88
CU
88

CU
1616
CU
1616

CU
88
CU
88
CU
88
CU
88

CU
88
CU
88
CU
88
CU
88

CU
1616
CU
1616
CU
1616

CU
88
CU
88

CU
88
CU
88

64CU:split_coding_unit_flag(1)
32CU:split_coding_unit_flag(0)

32CU:split_coding_unit_flag(1)
16CU:split_coding_unit_flag(0)

16CU:split_coding_unit_flag(0)

16CU:split_coding_unit_flag(1)

32x32TU:splitflag(1)

16x16TU:splitflag(0)<16x16cbf,
coefficients>
coefficients>
4x4TU:splitflag(0)<4x4cbf,
coefficients>

32x32TU:splitflag(1)

16x16TU:splitflag(0)<16x16cbf,
coefficients>

16x16TU:splitflag(1)
8x8TU:splitflag(0)<8x8cbf,
coefficients>
8x8TU:splitflag(0)<8x8cbf,
coefficients>
8x8TU:splitflag(0)<8x8cbf,
coefficients>
8x8TU:splitflag(0)<8x8cbf,
coefficients>

16x16TU:splitflag(1)
8x8TU:splitflag(0)<8x8cbf,
coefficients>
8x8TU:splitflag(1)
4x4TU:splitflag(0)<4x4cbf,
coefficients>
4x4TU:splitflag(0)<4x4cbf,
coefficients>
4x4TU:splitflag(0)<4x4cbf,
coefficients>
4x4TU:splitflag(0)<4x4cbf,
coefficients>

FIGURE.ExampleofTUquadtreestructure
PUpartition&Pred_modeinfo
TUsplitflags&Coefficients
PUpartition&Pred_modeinfo
TUsplitflags&Coefficients
FIGURE.ExampleofCUquadtreestructure
SKIPflag(mergeidx)
Predictionmodeflag(intraorinter)
PUpartsize(2Nx2N,2NxN,Nx2N,NxN,
AMP)
Predictioninfo.(Intramodeormvand
ref.idx.,mergeidx,AMVPidx)
PUpartition&Pred_modeinfo
TUsplitflags&Coefficients
Inloopfilter
InHEVC,twoprocessingsteps,adeblockingfilter(DBF)andasampleadaptiveoffset
(SAO)operationareapplied
DBF:similartotheDBFoftheH.264/AVCstandard
SAO:appliedadaptivelytoallsamplessatisfyingcertainconditions(whiletheDBFisonlyapplied
tothesampleslocatedatblockboundaries)

On/offsyntaxesforinloopfilters
1. slice_disable_deblocking_filter_flag:slicelevelon/off
2. sample_adaptive_offset_enabled_flag:slicelevelon/off
Deblockingfilter(DBF)
Basically,deblockingfilterofHEVCissimilartothatofH.264/AVC
Inloopfiltering
Codingperformanceforinterframe
Framebasedfiltering
On/offcontrolisprovided
Adaptivefiltering
boundarystrength
Filteringontheblockboundaries
transformandpredictionboundary
Sequentialfilteringforverticalandhorizontaledges
Samplevaluesmodifiedduringfilteringofverticaledgesareusedasinputforthefilteringof
thehorizontaledges

Deblockingfilter(DBF)
FeaturesofHEVCdeblockingfiltercomparedtoH.264/AVC
FortheTUsandPUswithedgeslessthan8samplesineitherverticalorhorizontaldirection,only
theedgeslyingonthe88samplegridarefiltered
verticaledges>horizontalfiltering
horizontaledges>verticalfiltering
2
1
verticaledges>horizontalfiltering
horizontaledges>verticalfiltering
2
1
[e.g.16x16Codingunit]
H.264/AVC HEVC
(a) H.264/AVC (b)HEVC
FIGURE.DerivationprocessfortheboundaryfilterstrengthinAVCandHEVC
ProcessingflowofDBF
Boundarydecision
Threekindsofboundariesinvolvinginthefiltering
CU,TU,PUboundary
CUboundariesarealwaysinvolvedinthefiltering
TUboundaryat88blockgridandPUboundarybetween
eachPUinsideCUareinvolvedinthefiltering
[Except]PUboundaryisinsideTU,theboundaryshall
notbefiltered

Bscalculation
Bsiscalculatedin44blockbasis>remappedto88grid
TwoBsarebelongto8pixelsconsistingalinein44grid,
maximumBsisselectedasBsforboundariesin88grid

Boundarydecision
Bscalculation
(44>88)
,t
c
decision
filteron/off
decision
Strong/weakfilter
selection
Strongfiltering Weakfiltering
FIGURE.Overallprocessingflowofdeblockingfilterprocess
Filteron/offdecisionfor4lines
Ifdp0+dq0+dp3+dq3 < ,filteringforthefirst4lines
isturnedon

Derivevaluesforweakfilteringprocess

Forthesecond4lines,decisionismadeinasame
fashionwithabove
ProcessingflowofDBF
andt
c
decision
Thresholdvaluesandt
c
arederivedbasedonlumaQP
P
andQP
Q
Q=((QP
P
+QP
Q
+1)>>1)
andt
c
arespecifiedaslefttable
withQasinput
(IfBs>1,t
c
isspecifiedaslefttable
withclip3(0,55,Q+2)asinput)
Q 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 7 8
t
c
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Q 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
9 10 11 12 13 14 15 16 17 18 20 22 24 26 28 30 32 34 36
t
c
1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4
Q 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
38 40 42 44 46 48 50 52 54 56 58 60 62 64 64 64 64 64
t
c
5 5 6 6 7 8 9 9 10 10 11 11 12 12 13 13 14 14
p3
0
p2
0
p1
0
p0
0
q0
0
q1
0
q2
0
q3
0
p3
1
p2
1
p1
1
p0
1
q0
1
q1
1
q2
1
q3
1
p3
2
p2
2
p1
2
p0
2
q0
2
q1
2
q2
2
q3
2
p3
3
p2
3
p1
3
p0
3
q0
3
q1
3
q2
3
q3
3
p3
4
p2
4
p1
4
p0
4
q0
4
q1
4
q2
4
q3
4
p3
5
p2
5
p1
5
p0
5
q0
5
q1
5
q2
5
q3
5
p3
6
p2
6
p1
6
p0
6
q0
6
q1
6
q2
6
q3
6
p3
7
p2
7
p1
7
p0
7
q0
7
q1
7
q2
7
q3
7
first 4 lines
second 4 lines
dEp1 = dp0+dp3 < (+(>>1))>>3 ? 1 : 0
dEq1 = dq0+dq3 < (+(>>1))>>3 ? 1 : 0
dp0 = | p
2,0
2*p
1,0
+ p
0,0
|
dp3 = | p
2,3
2*p
1,3
+ p
0,3
|
dp4 = | p
2,4
2*p
1,4
+ p
0,4
|
dp7 = | p
2,7
2*p
1,7
+ p
0,7
|
dq0 = | q
2,0
2*q
1,0
+ q
0,0
|
dq3 = | q
2,3
2*q
1,3
+ q
0,3
|
dq4 = | q
2,4
2*q
1,4
+ q
0,4
|
dq7 = | q
2,7
2*q
1,7
+ q
0,7
|
p
3
p
2
p
1
p
0
q
0
q
1
q
2
q
3
ProcessingflowofDBF
Strong/weakfilterselectionfor4lines
Iffollowing2conditionsaremet,strongfilter
Otherwise,weakfilter
Conditionsforfirst4lines

Conditionsforsecond4lines

Strong/weakfiltering
dp0 = | p
2,0
2*p
1,0
+ p
0,0
|
dp3 = | p
2,3
2*p
1,3
+ p
0,3
|
dp4 = | p
2,4
2*p
1,4
+ p
0,4
|
dp7 = | p
2,7
2*p
1,7
+ p
0,7
|
dq0 = | q
2,0
2*q
1,0
+ q
0,0
|
dq3 = | q
2,3
2*q
1,3
+ q
0,3
|
dq4 = | q
2,4
2*q
1,4
+ q
0,4
|
dq7 = | q
2,7
2*q
1,7
+ q
0,7
|
1) 2*(dp0+dq0) < ( >> 2 ), | p3
0
p0
0
| + | q0
0
q3
0
| < ( >> 3 ) and | p0
0
q0
0
| < ( 5* t
C
+ 1 ) >> 1
2) 2*(dp3+dq3) < ( >> 2 ), | p3
3
p0
3
| + | q0
3
q3
3
| < ( >> 3 ) and | p0
3
q0
3
| < ( 5* t
C
+ 1 ) >> 1
1) 2*(dp4+dq4) < ( >> 2 ), | p3
4
p0
4
| + | q0
4
q3
4
| < ( >> 3 ) and | p0
4
q0
4
| < ( 5* t
C
+ 1 ) >> 1
2) 2*(dp7+dq7) < ( >> 2 ), | p3
7
p0
7
| + | q0
7
q3
7
| < ( >> 3 ) and | p0
7
q0
7
| < ( 5* t
C
+ 1 ) >> 1


p
0
= ( p
2
+ 2*p
1
+ 2*p
0
+ 2*q
0
+ q
1
+ 4 ) >> 3
q
0
= ( p
1
+ 2*p
0
+ 2*q
0
+ 2*q
1
+ q
2
+ 4 ) >> 3
p
1
= ( p
2
+ p
1
+ p
0
+ q
0
+ 2 ) >> 2
q
1
= ( p
0
+ q
0
+ q
1
+ q
2
+ 2 ) >> 2
p
2
= ( 2*p
3
+ 3*p
2
+ p
1
+ p
0
+ q
0
+ 4 ) >> 3
q
2
= ( p
0
+ q
0
+ q
1
+ 3*q
2
+ 2*q
3
+ 4 ) >> 3


= ( 9 * ( q
0
p
0
) 3 * ( q
1
p
1
) + 8 ) >> 4
When abs() is less than t
C
*10,
= Clip3( - t
C
, t
C
, )
p
0
= Clip1
Y
( p
0
+ )
q
0
= Clip1
Y
( q
0
- )
If dEp1 is equal to 1,
p = Clip3( -( t
C
>> 1), t
C
>> 1, ( ( ( p
2
+ p
0
+ 1 ) >> 1 ) p
1
+ ) >>1 )
p
1
= Clip1
Y
( p
1
+ p )
If dEq1 is equal to 1,
q = Clip3( -( t
C
>> 1), t
C
>> 1, ( ( ( q
2
+ q
0
+ 1 ) >> 1 ) q
1
) >>1 )
q
1
= Clip1
Y
( q
1
+ q )
Strongfiltering
Weakfiltering
Overviewofsampleadaptiveoffset(1/2)
Artifacts
Blockingartifacts,ringingartifacts,colorbiases,andblurringartifacts
Alargertransformcouldintroducemoreartifacts
HEVC:4x4~32x32transform
Artifactsareexistatmediumandlowbitrates
Alargenumberofinterpolationtapscanalsoleadtomoreseriousringingartifacts
HEVC:8tap(luma),4tap(chroma)

Sampleadaptiveoffset
Toreducesampledistortion(reconstructedpixelsoriginalpixels)
Average3.5%BDratereduction(with1%encodingtimeincrease,2.5%decodingtimeincrease)

SAOislocatedafterDF
andalsobelongstoinloopfiltering
Overviewofsampleadaptiveoffset(2/2)
SAOfeatures
EachcolorcomponentmayhasitsownSAOparameters
TwoSAOtypes
Edgeoffset(EO;4EOclasses)
Bandoffset(BO;1BOclass)
SAOmerging(leftCTUoraboveCTU)
SAOmergeinformationissharedforthreecolorcomponents

SAOobjectandsubjectiveresults

SAOisenabled(QP=32)
SAOisdisabled(QP=32)
Anchor:DisablingSAO
Test:EnablingSAO
CTUsizeinLuma:64x64
CTUBoundary:option1
YDBrate
Allintra
(AI)
Randomac
cess(RA)
LowdelayB
(LB)
LowdelayP
(LP)
ClassSummary
ClassA 0.6% 2.3%
ClassB 0.5% 2.1% 2.0% 11.1%
ClassC 0.5% 1.1% 1.8% 7.1%
ClassD 0.4% 0.3% 0.7% 4.4%
ClassE 0.6% 2.3% 11.0%
ClassF 1.5% 2.6% 5.7% 12.3%
OverallSummary
All 0.7% 1.7% 2.5% 9.2%
Enc.Time(%) 101% 100% 100% 100%
Dec.Time(%) 103% 103% 102% 102%
EdgeoffsetofSAO
Four1Ddirectionalpatterns
horizontal,vertical,135diagonal,45diagonal

OnlyoneEOclasscanbeselectedforeachCTBofwhichEOisenabled
EachsampleinsidetheCTBisclassifiedintooneoffivecategories
Oneedgeoffsetisencodedforeachcategory(4offsetsaretransmittedinthecaseofEO)
Noinformationforclassificationoffivecategories(encoderanddecoderusesamerules)
a c b
a
c
b
a
c
b
a
c
b
FIGURE.Four1DdirectionalpatternsforEOsampleclassification
Category Condition
1 c<a&&c<b
2 (c<a&&c==b)||(c==a&&c<b)
3 (c>a&&c==b)||(c==a&&c>b)
4 c>a&&c>b
0 Noneoftheabove(SAOisnotapplied)
pixelindex
x1 x x+1
p
i
x
e
l

l
e
v
e
l
category1
pixelindex
x1 x x+1
p
i
x
e
l

l
e
v
e
l
category2
pixelindex
x1 x x+1
p
i
x
e
l

l
e
v
e
l

pixelindex
x1 x x+1
p
i
x
e
l

l
e
v
e
l
category3
pixelindex
x1 x x+1
p
i
x
e
l

l
e
v
e
l

pixelindex
x1 x x+1
p
i
x
e
l

l
e
v
e
l
category4
Positiveedgeoffset Negativeedgeoffset
TABLE.Sampleclassificationrulesforedgeoffset
BandoffsetofSAO
BOimpliesoneoffsetisaddedtoallsamplesofthesameband
Thesamplevaluerangeisequallydividedinto32bands
For8bitsamplesrangingfrom0to255,thewidthofabandis8

Onlyoffsetsoffourconsecutivebandsandthestartingbandpositionaresignaledtothe
decoder
Theaveragedifferencebetweentheoriginalsamplesandreconstructedsamplesinabandis
signaledtothedecoder
FouroffsetsaretransmittedinthecaseofBO
0 max
Thefirstbandforwhichoffsetistransmitted
Fouroffsetsaretransmittedforfourconsecutivebands
SAOsyntaxes
Merginginformation
sao_merge_left_flag
sao_merge_up_flag
Noadditionaltransfereddatainmergecase

sao_type_idx_X:typeofSAO
0:Notapplied
1:Bandoffset
2:Edgeoffset

sao_offset_abs,sao_offset_sign
Signisonlyforbandoffset
SignsofEOareimplicitlyderived

sao_band_position(ifBO)
StartbandforSAO

sao_eo_class(ifEO)
Classofedgeoffset(1Ddegree)
TABLE.Syntaxforsampleadaptiveoffset
sao (rx, ry) { Descriptor

sao_merge_left_flag ae(v)
sao_merge_up_flag ae(v)
if(!sao_merge_up_flag && !sao_merge_left_flag) {

sao_type_idx_luma ae(v)
sao_type_idx_chroma ae(v)
if( sao_type_idx != 0) {
for(i=0; i<4; i++)
sao_offset_abs[cIdx][rx][ry][i] ae(v)
if(sao_type_idx == BO) {
for(i=0; i<4; i++) {
if( sao_offset_abs[cIdx][rx][ry][i] != 0)
sao_offset_sign[cIdx][rx][ry][i] ae(v)
}
sao_band_position[cIdx][rx][ry] ae(v)
} else {
sao_eo_class_luma ae(v)
}
}

}
AfastdistortionestimationforSAO
Distortionshavetobecalculatedmanytimes
Letk,s(k),andx(k)besamplepositions,originalsamples,andpreSAOsamples,
respectively
DistortionbetweenoriginalsamplesandpreSAOsamples

DistortionbetweenoriginalsamplesandpostSAOsamples

histheoffsetforthesamplesetandNisthenumberofsamplesintheset,thedeltadistortionis
defined(NandEcanbecalculatedonlyonce)


C k
pre
k x k s D
2
)) ( ) ( ((


C k
post
h k x k s D
2
)) ) ( ( ) ( (


C k
pre post
hE Nh k x k s h h D D D 2 ))) ( ) ( ( 2 (
2 2


C k
k x k s E )) ( ) ( (
R D J
Offsetrefinement
Initialoffsetvalue,hisE/N
Allthenumbersbetweenzeroandoffsetareusedforoffsetrefinementprocess
0
1
2
3
4
5
6 Initialoffset
0
1
2
3
4
5
6 Initialoffset


C k
k x k s E )) ( ) ( (
EncodingflowofSAOinHM
Encoding
picture
Deblocking
Filtering
SAO
(framebased)
CTBbasedprocessing
BO 32 band sum of difference,
pixel count
EO class0 category
Sum of difference, pixel count
EO class1 category
Sum of difference, pixel count
EO class2 category
Sum of difference, pixel count
EO class3 category
Sum of difference, pixel count
EO class0 rdcost
rdcost
0
= distortion + rate
( A fast distortion estimation, offset refinement )
EO class1 rdcost
rdcost
1
= distortion + rate
( A fast distortion estimation, offset refinement )
EO class2 rdcost
rdcost
2
= distortion + rate
( A fast distortion estimation, offset refinement )
EO class3 rdcost
rdcost
3
= distortion + rate
( A fast distortion estimation, offset refinement )
BO band position
( A fast distortion estimation, offset refinement )
Rdcost type
(BO, EO class0, EO class1, EO class2, EO class3)
BO rdcost
rdcost
BO
= distortion + rate
Left merge, up merge
rdcost
E
N
Slicelevelon/offcontrolofSAO
Hierarchicalquantizationparameter(QP)settingsforeachgroupofpictures

Aslicelevelon/offdecisionalgorithm
Fordepth=0picture,SAOisalwaysenabledinthesliceheader
Otherdepth
Ifthepreviouspicture(thelastpictureofdepthN1indecodingorder)disablesSAOfor
morethan75%ofCTBs,thecurrentpicturewillearlyterminatetheSAOencodingprocess
anddisableSAOinallsliceheaders
8k
(8k+4)
Depth=0
Depth=1
Depth=2
Depth=3
AhigherQP
(8k+2)
(8k+1) (8k+3) (8k+5) (8k+7)
(8k+6)
CTBbasedencodingissuesaboutSAO
SinceSAOisafterDF,theSAOparameterscannotbepreciselyestimateduntilthe
deblockedsamplesareavailable
InCTUbasedencoder,thedeblockedsamplesoftherightcolumnsandthebottomrowsinthe
currentCTBmaybeunavailable

TwopracticalCTUbasedSAOdecisions
Case1.Avoidingusingthebottomrowsandrightcolumns(currentHM)
Case2.Usenondeblockfilteredpixelsforthebottomrows
andrightcoloumns(JCTVCJ0139)
TABLE.AverageBDratesofenablingSAOversusdisablingSAOfordifferentCTUsizes
deblockfilteredpixels
nondeblockfilteredpixels
CTUSize
inLuma
Option1:Skiprightandbottom
samplesintheCTUduring
parameterestimation
Option2:Usepredeblocked
samplesnearrightandbottom
boundariesintheCTUduring
parameterestimation
Y Cb Cr Y Cb Cr
6464 3.5% 4.8% 5.8% 3.3% 5.3% 6.6%
3232 2.0% 1.1% 1.5% 2.5% 2.0% 2.7%
1616 0.0% 0.3% 0.3% 0.8% 0.4% 0.1%
COMPLEXITYANALYSIS
OFHEVCENCODER
ComplexityanalysisofHMencoder
Testsequences
Sequence:ClassB(19201080),ClassC(832480)
ClassB:Kimono,ParkScene,Cactus,BasketballDrive,
BQTerrace
ClassC:BasketballDrill,BQMall,PartyScene,
RaceHorse
QP:22,27,32,37
Mainprofile
Randomaccess,lowdelay

Testenvironment
HM7.0software
IntelCore
TM
i7CPU860@2.8GHz
4GBmemory
Windows7(64bit)
Analysistool:IntelVtune
TM
AmplifierXE

FIGURE.ClassBBasketballDrive
FIGURE.ClassCBQMall
ProfilingresultofHEVCencoder
Class Module
QP
22 27 32 37
B
Entropy 6.6 3.4 1.0 0.9
Intra 3.3 2.2 2.1 1.4
Inter 68.4 78.1 83.9 85.7
TR+Q 20.4 15.2 11.7 10.6
Loopfilter 0.2 0.2 0.2 0.1
etc 1.2 1.1 1.3 1.5
C
Entropy 6.5 3.9 2.8 1.3
Intra 2.9 2.7 2.2 1.8
Inter 68.8 74.9 79.8 83.3
TR+Q 20.7 17.0 13.9 12.4
Loopfilter 0.2 0.2 0.2 0.1
etc 1.0 1.5 1.4 1.2
Class Module
QP
22 27 32 37
B
Entropy 6.1 2.8 0.4 0.3
Intra 3.4 2.0 1.2 1.2
Inter 71.3 81.2 87.3 89.1
TR+Q 18.6 13.0 9.9 8.5
Loopfilter 0.2 0.2 0.2 0.1
etc 0.8 1.2 0.8 0.9
C
Entropy 5.3 3.1 1.1 0.4
Intra 3.0 2.5 1.8 1.5
Inter 72.6 79.1 83.5 87.2
TR+Q 18.2 14.9 12.1 10.1
Loopfilter 0.2 0.2 0.2 0.1
etc 1.1 0.6 1.6 1.0
TABLE.ComplexityratioofHM7.0encoder(RA) TABLE.ComplexityratioofHM7.0encoder(LD)
ComplexityportionsofHMencoder
Loopfilter
:0.10.2%
Interprediction:7781%
Intraprediction:12%
Entropy
coding
:24%
Tr+Q:1416%
CABAC
Entropycoding
Deblockingfilter
SAO
Inloopfilter
DeltaQP
RDOQ
Quantization
TU4x4to32x32
Residualquad
treetransform
Transform
AMVP
Merge
DCTIF
Interprediction
Angularintra
prediction
Intraprediction
Bitstream
Inv.Quantization
Inv.Transform
CU8x8to64x64
DiversePUtypes
Blockstructure
8bit/sample
Picturestorage
Picture
Interprediction
Transform+Q
Intraprediction
Loopfilter
Entropycoding
etc
FIGURE.HEVCencoderblockdiagramandprofilingresult
ComplexityratioofCUsizeandmode

FIGURE.ExampleofCUquadtreestructure
CU
3232
CU
1616
CU
1616

CU
88
CU
88

CU
1616

CU
88
CU
88

CU
1616
CU
1616

CU
88
CU
88
CU
88
CU
88

CU
88
CU
88
CU
88
CU
88

CU
1616
CU
1616
CU
1616

CU
88
CU
88

CU
88
CU
88

TABLE.ComplexityratioofCUsizeandmode
Size Mode RA(%) LD(%) Average(%)
64x64
Intra 2.1 1.0 1.6
Inter 19.0 31.9 25.5
Skip 3.9 3.4 3.7
32x32
Intra 1.9 0.7 1.3
Inter 25.0 27.4 26.2
Skip 4.5 3.2 3.9
16x16
Intra 2.3 0.2 1.3
Inter 17.0 12.5 14.8
Skip 3.2 1.7 2.5
8x8
Intra 2.4 0.4 1.4
Inter 8.7 4.9 6.8
Skip 1.7 0.6 1.2
SelectedratioofCU,PUandTU

CUsize PUmode
ClassB ClassC
22 27 32 37 22 27 32 37
64x64
Mergeskip 10.6 26.6 43.3 55.2 11.7 20.6 30.6 39.5
Inter2Nx2N 4.5 7.1 7.2 6.0 5.8 7.5 6.7 5.5
InterNx2N 1.4 2.2 1.8 1.3 1.6 1.8 1.7 1.7
Inter2NxN 1.5 1.9 1.3 0.9 1.2 1.0 0.8 0.7
InterAMP 1.2 1.4 1.0 0.7 1.0 1.1 1.0 1.1
Intra2Nx2N 0.3 0.4 0.6 1.0 0.0 0.0 0.0 0.1
32x32
Mergeskip 9.9 12.4 19.9 8.4 12.2 13.5 15.2 16.8
Inter2Nx2N 8.1 6.9 4.6 3.1 9.1 7.2 5.4 4.3
InterNx2N 1.8 1.4 0.9 0.4 2.2 1.9 1.9 1.7
Inter2NxN 1.7 1.3 0.7 1.0 1.4 1.0 0.9 0.8
InterAMP 4.4 2.9 1.6 0.6 4.2 3.5 3.1 2.6
Intra2Nx2N 2.3 2.3 2.6 2.6 0.2 0.4 0.7 1.1
16x16
Mergeskip 6.8 5.6 3.9 2.9 8.0 7.7 7.3 6.1
Inter2Nx2N 9.1 3.7 1.7 0.8 6.9 4.8 3.1 2.0
InterNx2N 1.6 0.7 0.3 0.1 2.0 1.4 1.0 0.6
Inter2NxN 1.7 0.6 0.2 0.1 1.2 0.8 0.5 0.3
InterAMP 4.1 1.4 0.5 0.2 4.1 2.7 1.7 0.9
Intra2Nx2N 2.6 2.1 1.7 1.4 1.2 1.6 1.8 1.7
8x8
Mergeskip 2.8 1.9 1.2 0.9 3.9 3.3 2.3 1.4
Inter2Nx2N 5.8 1.3 0.4 0.1 4.9 2.5 1.1 0.4
InterNx2N 0.3 0.2 0.1 0.0 1.2 0.7 0.3 0.1
Inter2NxN 0.4 0.2 0.1 0.0 0.7 0.4 0.2 0.1
Intra2Nx2N 2.9 1.2 0.1 0.5 2.1 1.7 1.2 0.8
IntraNxN 0.8 0.6 0.7 0.2 1.9 1.1 0.6 0.3
Class Size
QP
22 27 32 37
B
32x32 33.5 55.0 63.0 65.7
16x16 19.8 20.9 20.1 19.7
8x8 36.2 15.5 10.7 10.0
4x4 10.5 8.5 6.2 4.5
C
32x32 35.7 43.4 49.2 52.2
16x16 27.7 27.7 27.5 29.0
8x8 21.7 18.1 15.8 13.9
4x4 14.8 10.8 7.5 4.9
TABLE.SelectedratioofTU
TABLE.SelectedratioofCUsizeandPUmode
BDBRvs.EncodingtimedependingonCTUsize
CTUsize:32x32
3.33.4%BDbitrate
7879%encodingtime
CTUsize:16x16
15.417.5%BDbitrate
5054%encodingtime

CTUsize:16x16
EncT:50.8%
BDbitrate:17.53%
CTUsize:32x32
EncT:79.22%
BDbitrate:3.31%
CTUsize:64x64
(Reference)
CTUsize:16x16
EncT:54.7%
BDbitrate:15.43%
CTUsize:32x32
EncT:78.92%
BDbitrate:3.43%
SW:HM7.1
Seq:ClassB
cfg:Randomaccess&lowdelay
BDBRvs.EncodingtimedependingonTUsize
Transformsize
1616to44oncase
3.23.5%BDbitrate
96%encodingtime
88to44oncase
10.211.2%BDbitrate
9192%encodingtime

MaxTUsize:8x8
Quadtreemaxdepth:1
EncT:92.4%
BDbitrate:11.2%
MaxTUsize:8x8
Quadtreemaxdepth:1
EncT:91.4%
BDbitrate:10.24%
MaxTUsize:16x16
Quadtreemaxdepth:2
EncT:96.8%
BDbitrate:3.2%
MaxTUsize:16x16
Quadtreemaxdepth:2
EncT:96.5%
BDbitrate:3.5%
MaxTUsize:32x32
Quadtreemaxdepth:3
(Reference)
SW:HM7.1
Seq:ClassB
cfg:Randomaccess&lowdelay
Toolon/offtest
FastencodingalgorithmsinHMsoftware
Contents note
FastEncodingSetting
:FEN,JCTVCA0124
EarlyCUtermination
SubsampledSADOperation
SimpleBiprediction(Thenumberofiteration4>1)
FastDecisionforMergeRDCost
:FDM,JCTVCH178
2Nx2NMergeCBFearlytermination PUlevel
RoughModeDecision(forIntra)
:RMD,JCTVCC311/D283
35IntramodeSATDRD
RDRD
FullRQT
PUlevel
AMPSpeedup
:AMPS,JCTVCE316
AMPMEorMerge PUlevel
CBFFastModeSetting
:CFM,JCTVCF045
PUCBF0PUME PUlevel
EarlyCUSetting
:ECU,JCTVCF092
CUSkip,CU CUlevel
EarlySkipDetectionSetting
:ESD,JCTVCG543
Inter2Nx2NEarlySkipDetection CUlevel
TABLE.FastencodingalgorithmsinHMsoftware
PARALLELTOOLSINHEVC
Tile(JCTVCE408,JCTVCF335)
Motivationandgoaloftile
Picturepartitioningintroducescodingloss
Whypartitionapicture?
Highlevelparallelprocessing
Maximumtransmissionunit(MTUofthenetwork)sizematching
Motionestimationwithconstrainedonchipmemory
Goal:reducecodinglossduetopartitioning

Verticalandhorizontalboundariespartition
Boundarylocationsmaybespecifiedindividuallyoruniformlyspaced
AlwaysrectangularwithanintegernumberofCTUs
Tile#1
Core1
Tile#4
Core4
Tile#2
Core2
Tile#3
Core3
Tile&slice
Tilesarealwaysrectangularandalwayscontainaninteger
numberofcodingtreeblocksincodingtreeblockrasterscan

Parallelimplementationwiththetiletoolhashigher
compressionefficiency,comparedtothesliceone.

Tile#1
Core1
Tile#4
Core4
Tile#2
Core2
Tile#3
Core3
slice1Core1
slice2Core2 slice2Core2
slice3Core3 slice3Core3
slice4Core4
Slicemode Tilemode
Seq. BDR
BDR
low
BDR
high
BDR
BDR
low
BDR
high
classD 3.9 6.1 2.4 2.7 3.6 2.0
classC 4.5 6.6 2.8 3.3 4.4 2.4
classB 5.3 7.6 3.5 4.0 5.1 3.2
classE 13.4 21.6 6.1 6.3 9.1 3.6
Avg. 6.2 9.6 3.6 3.9 5.3 2.8
Seq. BDR
BDR
low
BDR
high
classD 1.2 2.3 0.4
classC 1.1 2.1 0.4
classB 1.2 2.3 0.4
classE 6.3 10.3 2.4
Avg. 2.1 3.8 0.8
TABLE.Slicemodeandtilemodevs.anchor TABLE.Slicemodevs.tilemode
Relationshipbetweenslicesandtiles
Slice#0
Slice#1
Tile#0 Tile#1
Slice#0
Slice#1
Tile#0 Tile#1
Slice#0
Slice#1
Tile#0 Tile#1
Tile#2 Tile#3
Slice#0
Slice#1
Tile#0 Tile#1
Tile#2 Tile#3
ConstraintsaresetontherelationshipbetweenslicesandTiles.
AllCTBsinaslicebelongtothesametile.
or
AllCTBsinatilebelongtothesameslice.
Example1
Example2
Syntaxfortile
Syntaxelementsfortile

TABLE.SyntaxfortileinPPS
pic_parameter_set_rbsp () { Descriptor

tile_enabled_flag u(1)

if( tiles_enabled_flag ) {
num_tile_columns_minus1 ue(v)
num_tile_rows_minus1 ue(v)
uniform_spacing_flag u(1)
if( !uniform_spacing_flag ) {
for(i=0; i<num_tile_columns_minus1; i++)
column_width_minus1[i] ue(v)
for(i=0; i<num_tile_rows_minus1; i++)
row_height_minus1[i] ue(v)
}
loop_filter_across_tiles_enabled_flag u(1)
}

}
Tile#0
Tile#5
Tile#1
Tile#4
Tile#2
Tile#3
num_tile_columns_minus1

n
u
m
_
t
i
l
e
_
r
o
w
s
_
m
i
n
u
s
1

uniform_spacing_flagequalto1specifiesthatcolumnboundariesandlikewise
rowboundariesaredistributeduniformlyacrossthepicture.
column_width_minus1[0]
r
o
w
_
h
e
i
g
h
t
_
m
i
n
u
s
1
[
2
]

Wavefrontparallelprocessing(WPP)
Methodtoperformhighlevelparallelvideo
encodinganddecoding
Advantages
Lossofparallelizationisrelativelysmall
comparedtootherparallelizationmethods
(WPPvs.sliceortile:0.8~1%gain)
Fastdecodingandverylowdelaywiththe
dependentslice
Forlargesequences(ClassesA,B,E)
accelerationfactoris1.8xwithtwo
threadsand3xwithfourthreads

Disadvantages
Parallelizationsteps(prologue,kernel,
epilogue),theactivatecoresarelimited

X
epilogue prologue kernel
time
A
c
t
i
v
a
t
e

c
o
r
e
s

Average
activate
cores
FIGURE.SinglethreadprocessingofHEVCencoding/decoding
X
X
X
X
FIGURE.Parallelprocessingwithoutbreakpixel/MVdependency(ideally)
Wavefrontparallelprocessing(WPP)
InHEVC,CABACistheonlyentropycoder
ProbabilitiesnotavailableonfirstCTUoftheline
IfwereinitializeatthebeginningofeachCTUline,performancedegradation

SynchronizeCABACprobabilitieswithupperrightCTU(JCTVCE196,JCTVCF274)
UpperrightCTUavailable,withspatial/MVdependency
Efficientlycarriesoverquickverticallearningofprobabilities

X
X
X
X
Probabilitiesnotavailable!
X
X
X
X
FIGURE.WavefrontprocessingofHEVCCTUs(problem) FIGURE.WavefrontprocessingofHEVCCTUs(WPP)
Bitstreamcontainsanysubbitstream

BitstreamstructureforWPP&Tiles
SPS PPS
Slice
header
Slice
data
Slicelayerrpsp
Subset0 Subset1 Subset2
entry_point_offset[i]representedby9bit
offset_len_minus1(0to31)=8bit
num_entry_point_offsets=3
slice_header( ) { Descriptor
if( tiles_enabled_flag | | entropy_coding_sync_enabled_flag ) {
num_entry_point_offsets ue(v)
if( num_entry_point_offsets > 0 ) {
offset_len_minus1 ue(v)
for( i = 0; i < num_entry_point_offsets; i++ )
entry_point_offset[ i ] u(v)
}
}
}
TABLE.Syntaxforsubsetsinsliceheader
OnlyoneofWPP
ortiletoolscanbe
usedforthemain
profile.
Case1:multipletilesinaslice
Slicelayerrpspbitstreamconsistsofsubbitstreamsfordecoding
Accordingtothetilepartition,CTUdecodingordercanbedecided
Relationshipbetweenslicesandtiles
Slice#0
Slice#1
Tile#0 Tile#1
Tile#2 Tile#3
Slice#0
Slice#1
Tile#0 Tile#1
Tile#2 Tile#3
Example
Tile0 Tile1
SPS PPS
Slice
header
Slice
data
Slice0
Tile2 Tile3
Slice
header
Slice
data
Slice1
+
Bitstreamstructure
Case2:multipleslicesinatile
Accordingtothetilepartition,CTUdecodingordercanbedecided
SlicedataincludesstartCTUaddressinformation
Relationshipbetweenslicesandtiles
SPS PPS
Slice
header
Slicedata
(Tile0)
Slice0
Slice
header
Slicedata
(Tile0)
Slice1
Bitstreamstructure
Slice
header
Slicedata
(Tile1)
Slice2
Example
+
Slice#0
Slice#2
Slice#1
Tile#0
Tile#1
Slice#0
Slice#2
Slice#1
Tile#0
Tile#1
ConventionalEncodingprocedure
Conventionalencodingprocess
Pixelencodingcanbeprocessed2Dwavefront
But,entropyencodingissequentialprocess
Itrequiresmanysizeofmemoryforsyntaxelements
Generally,encodingprocessconductsrasterscanorderduetoentropycoder
bitstream CTB
0
CTB
1
CTB
3
CTB
k1
CTB
k2
CTB
k3
Syntaxelements
Pixelencoding
entropyencoding
Tileencoding&decodingprocedure
Tileencoding/decodingprocess
Eachtilehasnodependencyforpixelcodingandentropycoding
Alltilescanbeencoding/decodingatthesametime
Subset0
Entropyencoding
Subset1
Subset2
Subset3
Subset0 Subset1
Subset3 Subset2
Tile
boundary
processing
(DB,SAO)
Tile
boundary
processing
(DB,SAO)
Pixelencoding
Bitstream
transmission
Subset0
Entropydecoding
Subset1
Subset2
Subset3
Subset0 Subset1
Subset3 Subset2
Pixeldecoding
Tile
boundary
processing
(DB,SAO)
Tile
boundary
processing
(DB,SAO)
WPPencoding&decodingprocedure
WPPencoding/decodingprocess
Pixelcodingcanbe2Dwavefrontprocess
Entropycodingalsocanbe2Dwavefrontprocessusingcontextssynchronizedscheme
Pixelencoding Pixeldecoding
Subset0
Subset1
Subset2
Subset3
Subset0
Subset1
Subset2
Subset3
Subset3
Bitstream
transmission
Subset0
Subset1
Subset2
sync
sync
sync
Entropyencoding
Subset3
Subset0
Subset1
Subset2
sync
sync
sync
Entropydecoding
Conclusion
OverviewofHEVC
EncodingparametersforHEVCtestmodel(HM)
ComplexityanalysisofHEVCencoder
Fastencodingalgorithmsandperformances
Issuesofparallelprocessing

Das könnte Ihnen auch gefallen