Sie sind auf Seite 1von 69

HEVCEncoder

KwangwoonUniversity(KWU)
Donggyu Sim (dgsim@kw.ac.kr)

:
,

Contents

OverviewofHEVC

EncodingissuesforHEVCtestmodel(HM)

ComplexityanalysisofHEVCencoder

Fastencodingalgorithmsandperformances

Issuesofparallelprocessing

Conclusion

OVERVIEWOFHEVC

BlockdiagramofHEVCstandard

Typicalblockbasedhybridcodecstructure+additionalenhancedtools
Transform

Rn

Fn

TUsize:
3232
~44

Residual
quadtree

Entropycoding

Quantization
Interprediction

Picture
Buffer

DCTIF
AMVP
Merge

ME

Fn1
Fn2

DeltaQP

RDOQ

CABAC

MC

Intraprediction
Reference
sample
padding

MDIS

Planar
DC
33angular

Transform1
Quantization1

Loopfilter

Fn

Sample
adaptive
offset

De
blocking
filter

+ + R
n
FIGURE. BlockdiagramofHEVCencoder

BlockstructureinHEVC

ThreeblockstructuresaredefinedinHEVC
Codingunit(CU)
Predictionunit(PU)
Transformunit(TU)
CTU64
CU1616

CU3232

CU1616

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

2N2N

2NN

N2N

NN

2NnU

2NnD

nL2N

nR2N

CU1616

CU1616

CU1616

CTU6464

CU1616

CU1616

CU1616

CTUTU
6464
depth0

TU
depth1

TU
depth2

FIGURE. AnexampleofCU,PU,andTUpartitioninHEVC

CTU64

ENCODINGSTRUCTURESOFHEVC

DecisionlevelforHEVCencoder

Sequencelevel

Picturelevel

CUpartitioning
Sampleadaptiveoffsetparameters

CUlevel

Refframes
Deblockingfilterparameters

Sequence
Picture
SliceorTile
CTU
CU

CTUlevel

#refframe,ratecontrol
Tile,slice

Slice ortilelevel

Codingstructure(Allintra,Lowdelay,Randomaccess)
Profile,tier,level
Max/MinCTUsize,CUdepth
Max/Min TUsize,TUdepth
Toolon/off(SAO,deblocking,WPP,tile)

PUandTUpartitioning

PU &TUlevel

Predictionmodes,motionvectors
cbf,coefficients

PU&TU

Temporalpredictionstructure (1/3)

Allintra(AI)

Allpictureiscodedasinstantaneousdecodingrefresh(IDR)picture
Notemporalpredictionisallowed

Coding
order
=POC

QPI

QPI

QPI

QPI

QPI

QPI

QPI

QPI

IDRPicture
time

Temporalpredictionstructure (2/3)

Lowdelay(LD)

ThefirstpictureshallbecodedasIDRpicture
GeneralizedPandB(GPB) pictureshallbeusedfortheothersuccessivepictures

TheGPBshallbeabletouseonlythereferencepictures,eachofwhosePOCissmallerthanthecurrent
picture(allreferencepictureinList_0andList_1shallbetemporallypreviousindisplayorderrelativetothe
currentpicture)

QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture
dependingontemporallayer
QPBL3=QPI+3
1

QPBL3
3

QPBL3
5

2
Coding
order
=POC

QPBL2=QPI+2

:Depth==0
:Depth==1
:Depth==2

QPBL3
7

QPI
IDRor
Intrapicture

QPBL2
QPBL1=QPI+1

QPBL1
GPB(GeneralizedPandB)
picture

time

Temporalpredictionstructure(3/3)

Randomaccess(RA)

HierarchicalBstructureshallbeusedforcoding
IDR Intrapictureorcleanrandomaccess(CRA) pictureshallbeinsertedcyclicallyperaboutone
secondinrandomaccesspoint
QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture
dependingontemporallayer
QPBL4=QPI+4

QPBL4

QPBL4

QPBL4

NonreferencedB
Picture

3
2

6
2

Coding
order
POC

4
0

QPBL3=QPI+3

QPBL3
QPBL2=QPI+2

:Depth==0
:Depth==1
:Depth==2
:Depth==3

QPI
IDRor
Intrapicture

QPBL1=QPI+1
ReferencedB
Picture

GPB(GeneralizedPandB)
picture
time

Picturepartitioning

Picture :Apicturecontainsanarrayofluma samplesinmonochromeformatoranarrayof


luma samplesandtwocorrespondingarraysofchroma samplesin4:2:0,4:2:2,and4:4:4
colorformat

Codingorderofcodingtreeunit(CTU) israsterscanorder

CTU:AnNxN blockofluma samplestogetherwithtwocorrespondingblockofchroma


samples

Analogoustomacroblock inpreviousstandards
Themaximumallowedsizeoftheluma blockinaCTUisspecifiedtobe64x64 inMainprofile
30

*CTU&CTB
:TheCTUconsistsofaluma codingtree
block(CTB)andthecorresponding
chroma CTBsandsyntaxelements

17

FIGURE. ExampleofapicturedividedintoCTUs
Example)ClassB(19201080) BQTerrace
CTUsize:6464
3017CTUpartition

Picturepartitioning

Aslice isasequenceofcodingtreeunits(CTUs)

Unlikeslices,tilesarealwaysrectangularandalwayscontainanintegernumberofcoding
treeunitsincodingtreeunitrasterscan

Atleastoneofthefollowingconditionsshouldbetrueforeachsliceandtileinapicture

AllCTBsinaslicebelongtothesametile,orallCTBsinatilebelongtothesameslice

FIGURE. Apicturewith3017codingtreeunitsthatispartitioned
intothreeslices

FIGURE. Apicturewith3017 codingtreeunitsthatispartitioned


intothreetiles

Codingunit(CU)andcodingtreestructure
Codingunit(CU):theleafnodeofaquadtreestructure

Squareblocks
Size:from88uptothesizeofCTU
SizeofCTUisspecifiedinsequenceparameterset(SPS)

8x8~64x64

TABLE.SyntaxforsizeofCTUinSPS
seq_parameter_set_rbsp() {

Descriptor

log2_min_coding_block_size_minus3

ue(v)

log2_diff_max_min_coding_block_size_minus2

ue(v)

CU1616

CU1616

CU3232

CU88

CU88

CU1616

Thequadtreepartitioningstructureallowsrecursive
splittingintofourequallysizednodes
CU1616

CU1616

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU1616

CU1616

CU1616

FIGURE. Exampleofcodingtreestructure

ExampleofCUquadtreestructure

Codingunitquadtreestructure
StartingfromCTU,eachCUcanbesplitinto4smallerCUs

TABLE.SyntaxforCUsplitflagincodingtree
coding_tree(x0, y0, log2CbSize, ctDepth) {

Descriptor

if(x0+(1<<log2CbSize) <= pic_width_inluma_sample &&


y0+(1<<log2CbSize) <= pic_height_in_luma_sample &&
MinCbAddrZS[x0>>Log2MinCbSize][y0>>Log2MinCbSize] >= SliceCbAddrZS &&
log2CbSize > log2MinCbSize && NumPCMBlock == 0 )
split_coding_unit_flag[x0][y0]

ae(v)

CU1616

CU1616

CU3232
CU88

CU88

CU1616

CU1616

CU1616

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU1616

CU1616

64CU:split_coding_unit_flag(1)
32CU:split_coding_unit_flag(0)
32CU:split_coding_unit_flag(1)
16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(1)

CU1616
FIGURE. ExampleofCUquadtreestructure

Codingunit(CU)decision

Codingunitquadtreestructure
StartingfromCTU,eachCUcanbesplitinto4smallerCUs

CUsize

6464
1

BestCURDcostcalculationforeachCUlevel

21

CompetitionofthebestCU
anditssubpartitionedCUs

3232

3232
2
1616

1616

88
4

88
6

1
88
5

88
7

88
9

88
11

23

10

1616
13

3232
44
1616

18

15

65

88
10

88
12

3232

20

Predictionunit(PU)types

Predictionunit(PU):aregionusedforcarryingtheinformationrelatedtotheprediction
processes
2PUtypesforIntraprediction

2N2N,(SmallestCU:additionallyNN)

8PUtypesforInterprediction

SmallestCU:

8x8:2N2N,N2N,2NN
Others:2N2N,N2N,2NN,NN

Others:2N2N,N2N,2NN,nL2N,nR2N,2NnU,2NnD

2N2N

N2N

2NN

NN

nL2N

nR2N

2NnU

2NnD

FIGURE. PUpartitionsinHEVC

Predictionunit(PU)types
CurrentCUsize
SCUsize
AMPenableflag

CurrentCUsize==
SCUsize

No

No

AMP
enableflag

Yes

Yes

No

CurrentCU
size==88

Yes

Intra2N2N

Intra2N2N

Intra2N2N

Intra2N2N

Inter2N2N

Inter2N2N

IntraNN

IntraNN

Inter2NN

Inter2NN

Inter2N2N

Inter2N2N

InterN2N

InterN2N

Inter2NN

Inter2NN

InterAMP

InterN2N

InterN2N

InterNN

Transformunit(TU)andtransformtreestructure

Transformunit(TU):aregionsharingthetransformandquantizationprocesses

Squareshape
Size:from4x4upto32x32
AvailabletransformblocksizesandmaxtransformhierarchydeptharespecifiedinSPS

TABLE.SyntaxforsizeofTUinSPS
seq_parameter_set_rbsp() {

4x4~32x32

Descriptor

32

log2_min_transform_block_size_minus_2

ue(v)

log2_diff_max_min_transform_block_size

ue(v)

max_transform_hierarchy_depth_inter

ue(v)

max_transform_hierarchy_depth_intra

ue(v)

32

RootofTUquadtreeisCUwhichtheTUbelongto

FIGURE. TUquadtreestructureinHEVC

INTER/INTRAPREDICTION
ANDPU/TUDECISION

OverallofHMencodingprocess
Sequence
RDOprocess

Picture
CTUdecisionsinasliceoratile
CUpartitioningdecision
PU&TUpartitioningdecision

Deblocking filter
SAO
Entropycoding

RDOprocesstodecidePU&TU
6464
CompressCU

3232

3232

88

1616

1616

1616

88

88

3232

1616

88

3232

Mergeskip

Inter2N2N

InterNN

InterAMP

Inter2NN

InterN2N

Intra2N2N

IntraNN

88

88

88

88

IntraPCM

CUsize
SCU

Yes

Finish

No

CompressCU

CompressCU

CompressCU

CompressCU

Intrapredictionflow
2N2NPU

Predictionmodes

Luma (35modes)
Planar,DC,Angularprediction(33directions)
Chroma(5modes)
Planar,DC,Vertical,Horizontal,DM

Referencesamplepadding

Filtering

MDIS

MDIS(Modedependentintrasmoothing)
DCfiltering,Ver/Hor filtering

3MPM

18

19

20

21

22

Intraprediction
23

24

25

26

27

28

29

30

31

32

33

34

Bestmodedecision

J pred ,SATD SATD pred * Bpred


0 : Intra_Planar
1 : Intra_DC

Mode<35?

J mod e SSE mod e * Bmod e

N
RDcost,Intra_mode

FIGURE. DirectionsandmodesofHEVCintraprediction

FIGURE. Flowchart Intraprediction

Fastintraprediction&TUdecisioninHM

IntrapredictionstepinHM
1)Roughpredictionmodedecision
35prediction
SelectNpredictionmodes

35modes

Distortion(SATD)+lamda *modebits

#ofcandidatepredictionmodes:Nmodes+MPM(3)
2)Bestintrapredictionmodedecisionwithtransform
Transform(RQTdepth=1)
1bestintramodedecision
3)BestRQTdecisionwithRDcosts
RQTdepth=3

Nmode+MPM

1Best
mode

Bestmode
RDcost

Interprediction

Skip:Mergeskip

Spatialcandidates
derivation

Cur.CU

Nonskip

Unidirectionalprediction
Bidirectionalprediction
Halfpel/Quarterpel motionrefinement

DCTIF(8tap/4tap)

Merge

Temporalcandidate
derivation

Unidirectionalprediction

Mergeskip

Additionalcandidates
derivation

RDcostcalculation

Inter2N2N
Inter2NN
InterN2N

AMP
(nL2N,nR2N,
2NnU,2NnD)

Bestmode
decision

RDcost,
Bestmode
FIGURE. Flowchart Interprediction

Bidirectionalprediction

Merge

Bestmodedecision

Interprediction

Intercodingmode

Mergeskipmode(CUlevel)
skip_flag=1 andmerge_idx
Noreferenceindex
Nomotionvector
Noresidual
Mergemode(PUlevel)
skip_flag=0,pred_mode_flag,and part_mode
merge_flag=1and merge_idx
Noreferenceindexandmotionvector
no_residual_syntax_flag:Residualisencodedornot
GeneralPUmodes
skip_flag=0pred_mode_flag,andpart_mode
merge_flag=0
ref_idx_lx andmvp_lx_flag basedonAMVP(x=0or1)
MVDisencoded
no_residual_syntax_flag:Residualisencodedornot

Interpredictionflow
BEGIN input : current PU part mode for a CU
FOR PU partition
FOR List = 0 to 1 DO
FOR 0 to refidx DO
Motion estimation (diamond search, SR : 64)
Decide best RD-cost for uni-prediction
ENDFOR
ENDFOR
IF bi-directional prediction THEN
FOR iteration = 0 to 3 DO
FOR 0 to refidx DO
Motion estimation (full search, SR : 4)
Decide best RD-cost for bi-prediction
ENDFOR
ENDFOR
ENDIF
ENDFOR

Fastencoderdecision(FEN)
SubsampledSADforintegerME
UsesubsampledSADwhenrows>8forintegerME
Only1iterationforbipredictivemotionsearch
defaultnumber:4
FastDecisionforMergeRDcost(FDM)
Aftermergewithmergeidx X,ifallcbf iszerothenmerge
processisterminated

LIST_0

LIST_1

Merge
RD-cost competition among uni/bi-prediction and merge
CurrentPU

END output : inter prediction syntax

Uniprediction

FIGURE. Pseudo code - Inter prediction flow

Biprediction
Cur
time

Biprediction

SearchP0 andP1 whichproduceminimumerrorwithO

R =(O P),where P =(P0+P1)/2

PracticalBipredictivesearch
1)SearchP1 whichproduceminimum2Rwith(2O P0)
R =O (P0+P1)/2 2R=(2O P0) P1
2)SearchP0 whichproduceminimumerrorwith(2O P1)
R =O (P0+P1)/2 2R=(2O P1) P0

List0
Reference

P0

Currentframe
O

List1
Reference
P1

BipredSearchRange :4
FEN:1(iteration:1)

Example)Biprediction
List0
Reference

Currentframe

List1
Reference

P1

Unidirectional
prediction
Searchrange:64

P0

P1
R1

2O

P1

R0

2O

P0

Iteration :1

BipredSearchRange :4

P0

P1

Iteration:2
BipredSearchRange :4

P0

Bidirectional
prediction

P1
R1

2O

P1

R0

2O

P0

Iteration:3

BipredSearchRange :4

P0

P1

Iteration:4
BipredSearchRange :4

Motionestimation(Integerpel)

Practicalmotionestimation(diamondsearch)

Firstsearch &earlytermination
Max3(default)moreroundsafterarecentbestmatch

Rasterrefinementsearch
Ifintegerpel distanceisbiggerthan5,thenconducttherasterrefinementsearch.

Starrefinementsearch&earlytermination
Diamondsearchwiththecenterofthebestmatchfromtheearlytwosteps
Max2roundsafterthebestmatch

3
3
2

2
3

2
1

3
2

1
2

FIGURE. Firstsearch&startrefinement

FIGURE. Rasterrefinementsearch

Motionestimation(Subpel refinement)
Searchrange

Integerpel motionsearch

Costfunction:SAD

Subpel motionrefinement

Costfunction:SATD
Halfpel refinement
Quarterpel refinement

Searchrange

FIGURE. Halfpel motionsearch

FIGURE. Integerpel motionsearch


Integerpel
Halfpel
Quarterpel

FIGURE. Quarterpel motionsearch

Interpolation

DCTIFinHEVC

Fixed8tap(7tap)and4tapinterpolationfiltersbasedonDCT
2Dseparablefilter
8*Horizontal1Dfilter+1*Vertical1Dfilter

TABLE.Interpolationfiltercoefficients
Component

A-1,-1

A0,-1 a0,-1 b0,-1 c 0,-1 A1,-1

A2,-1

ah -1,0 B0,0

Filter()

1/4

{1,4,10, 58,17,5,1,0}

1/2

{1, 4,11,40,40,11,4,1}

1/8

{2,58, 10,2}

Luma

3/8

{6,46,28,4}

1/4

{4,54,16,2}

1/2

{4, 36,36,4}

ha0,-1 hb0,-1 hc0,-1 hd0,-1 he0,-1 hf 0,-1 hg0,-1 hh 0,-1

A-1,0

A0,0

a0,0

b0,0

c0,0

A1,0

A2,0

d-1,0

d0,0

e0,0

f0,0

g0,0

d1,0

d2,0

h-1,0

h0,0

i0,0

j0,0

k0,0

h1,0

h2,0

n-1,0

n0,0

p0,0

q0,0

r0,0

n1,0

n2,0

A-1,1

A0,1

a0,1

b0,1

c0,1

A1,1

A2,1

ab0,0 ac0,0 ad0,0 ae0,0

af 0,0

ag0,0 ah 0,0

bh -1,0 ba0,0 bb0,0 bc0,0 bd0,0 be0,0

bf 0,0

bg0,0 bh 0,0 ba1,0

ch -1,0 ca0,0 cb0,0 cc0,0

cd0,0 ce0,0

cf 0,0

cg0,0 ch 0,0 ca1,0

dh -1,0 da0,0 db0,0 dc0,0 dd0,0 de0,0

df 0,0

dg0,0 dh 0,0 da1,0

eh -1,0 ea0,0 eb0,0 ec0,0 ed0,0 ee0,0

ef 0,0

eg0,0 eh 0,0 ea1,0

fh -1,0 fa0,0

fe0,0

ff 0,0

fg0,0

gh -1,0 ga0,0 gb0,0 gc0,0 gd0,0 ge0,0

gf 0,0

gg0,0 gh 0,0 ga1,0

fb0,0

fc0,0

fd0,0

fh 0,0

B1,0

fa1,0

Chroma
hh -1,0 ha0,0 hb0,0 hc0,0 hd0,0 he0,0 hf 0,0 hg0,0 hh 0,0 ha1,0
A-1,2

A0,2

a0,2

b0,2

c0,2

A1,2

A2,2

B0,1

ab0,1 ac0,1 ad0,1 ae0,1

af 0,1

ag0,1 ah 0,1

B1,1

FIGURE. Integerandfractionalsamplepositionsforluma andchroma interpolation

ExampleofPUdecision
Example
CompressCU
Uniprediction
RDcost=SAD/SATD+*B
=12000

mode

NoTUdecision
Noreconstruction

Vs.
Biprediction
RDcost=SAD/SATD+*B
=9000

Mergeskip

Inter2N2N

InterNN

InterAMP

Inter2NN

InterN2N

Intra2N2N

IntraNN

mode

Vs.
Merge
RDcost=SAD/SATD+*B
=11000

mode

IntraPCM

Biprediction
RDcost=SSE+*Bmode
=8500

CUsize
SCU

Yes

Finish

No

TUdecision
Reconstruction

CompressCU

CompressCU

CompressCU

CompressCU

TUdecisionflow(Inter)

Residualquadtree

2N2N

N2N

2NN

NN

nL2N

nR2N

2NnU

2NnD

Original

Predictor

T/Q
IT/IQ(recon)
RDcost(SSE+*Bmode)

TUdepth:0

Residual

TUdepth:1

TUdepth:2

TUdecisionflow(Intra)

Example)intra_pred_mode =10(verticalmode)
Intrapredictionusingreferencesamples
T/Q
IT/IQ
RDcost(SSE+*Bmode)

Referencesamples

Residual

TUdepth:N

Prediction
direction

TUdepth:N+1
Prediction
direction

Referencesample(afteraboveblockisreconstructed)

Transform

ImplementationoftransforminHEVC

Matrixmultiplication
Straightforward/Fewcodelines
Hugenumberofoperations,butSIMDfriendly
Partialbutterflyimplementation
Utilizessymmetry/antisymmetrypropertiesofbasisvectors
Lessmultiplications/additions
Increasenumberofcodelines
Matrixmultiplication
Matrixmultiplication

Matrixmultiplication

Matrixmultiplication

PartitioningsyntaxforaCTU
Syntax
CU1616

CU1616

64CU:split_coding_unit_flag(1)
32CU:split_coding_unit_flag(0)

CU1616

PUpartition&Pred_mode info

CU3232
CU88

CU1616

CU1616

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

32CU:split_coding_unit_flag(1)
16CU:split_coding_unit_flag(0)

CU88

PUpartition&Pred_mode info

TUsplitflags&Coefficients

CU1616

CU1616

CU1616
CU88

FIGURE. ExampleofCUquadtreestructure

TUsplitflags&Coefficients

16CU:split_coding_unit_flag(0)
PUpartition&Pred_mode info
TUsplitflags&Coefficients

16CU:split_coding_unit_flag(1)

FIGURE. ExampleofTUquadtreestructure

SKIPflag(mergeidx)
Predictionmodeflag(intraor inter)
PUpartsize(2Nx2N,2NxN,Nx2N,NxN,
AMP)
Predictioninfo.(Intramodeormv and
ref.idx.,mergeidx,AMVPidx)

32x32TU:splitflag(1)
16x16TU:splitflag(0) <16x16cbf,
coefficients>
16x16TU:splitflag(1)
8x8TU:splitflag(0)<8x8cbf,coefficients>
8x8TU:splitflag(0)<8x8cbf,coefficients>
8x8TU:splitflag(0)<8x8cbf,coefficients>
8x8TU:splitflag(0)<8x8cbf,coefficients>
16x16TU:splitflag(1)
8x8TU:splitflag(0)<8x8cbf,coefficients>
8x8TU:splitflag(1)
4x4TU:splitflag(0)<4x4cbf,coefficients>
4x4TU:splitflag(0)<4x4cbf,coefficients>
4x4TU:splitflag(0)<4x4cbf,coefficients>
4x4TU:splitflag(0)<4x4cbf,coefficients>

ENCODINGPROCESS
OFLOOPFILTER

Inloopfilter

InHEVC,twoprocessingsteps,adeblocking filter(DBF)andasampleadaptiveoffset
(SAO) operationareapplied

DBF:similartotheDBFoftheH.264/AVCstandard
SAO:appliedadaptivelytoallsamplessatisfyingcertainconditions(whiletheDBFisonlyapplied
tothesampleslocatedatblockboundaries)

On/offsyntaxesforinloopfilters
1.
2.

slice_disable_deblocking_filter_flag :slicelevelon/off
sample_adaptive_offset_enabled_flag :slicelevelon/off

Deblocking filter(DBF)

Basically,deblocking filterofHEVCissimilartothatofH.264/AVC

Inloopfiltering
Codingperformanceforinterframe
Framebasedfiltering
On/offcontrolisprovided
Adaptivefiltering
boundarystrength
Filteringontheblockboundaries
transformandpredictionboundary
Sequentialfilteringforverticalandhorizontaledges
Samplevaluesmodifiedduringfilteringofverticaledgesareusedasinputforthefilteringof
thehorizontaledges

Deblocking filter(DBF)

FeaturesofHEVCdeblocking filtercomparedtoH.264/AVC

FortheTUsandPUswithedgeslessthan8samplesineitherverticalorhorizontaldirection,only
theedgeslyingonthe88samplegridarefiltered
horizontaledges>verticalfiltering

horizontaledges>verticalfiltering

HEVC

H.264/AVC

[e.g. 16x16Codingunit]

verticaledges>horizontalfiltering

verticaledges>horizontalfiltering

(a) H.264/AVC
(b)HEVC
FIGURE. DerivationprocessfortheboundaryfilterstrengthinAVCandHEVC

ProcessingflowofDBF

Boundarydecision

Threekindsofboundariesinvolvinginthefiltering
CU,TU,PUboundary
CUboundariesarealwaysinvolvedinthefiltering
TUboundaryat88blockgridandPUboundarybetween
eachPUinsideCUareinvolvedinthefiltering
[Except]PUboundaryisinsideTU,theboundaryshall
notbefiltered

Boundarydecision

Bs calculation
(44>88)

,tc decision

Bs calculation

filteron/off
decision

Bs iscalculatedin44blockbasis>remappedto88grid
TwoBs arebelongto8pixelsconsistingalinein44grid,
maximumBs isselectedasBs forboundariesin88grid

Strong/weakfilter
selection

Strongfiltering

Weakfiltering

FIGURE. Overallprocessingflowofdeblocking filterprocess

Overviewofsampleadaptiveoffset(1/2)

Artifacts

Blockingartifacts,ringingartifacts,colorbiases,andblurringartifacts
Alargertransformcouldintroducemoreartifacts
HEVC:4x4~32x32transform
Artifactsareexistatmediumandlowbitrates
Alargenumberofinterpolationtapscanalsoleadtomoreseriousringingartifacts
HEVC:8tap(luma),4tap(chroma)

Sampleadaptiveoffset

Toreducesampledistortion(reconstructedpixels originalpixels)
Average3.5%BDratereduction (with1%encodingtimeincrease,2.5%decodingtimeincrease)

SAOislocatedafterDF
andalsobelongstoinloopfiltering

Overviewofsampleadaptiveoffset(2/2)

SAOfeatures

EachcolorcomponentmayhasitsownSAOparameters
TwoSAOtypes
Edgeoffset(EO;4EOclasses)
Bandoffset(BO;1BOclass)
SAOmerging(leftCTUoraboveCTU)
SAOmergeinformationissharedforthreecolorcomponents

SAOobjectandsubjectiveresults
Anchor:DisablingSAO
Test:EnablingSAO
CTUsizeinLuma: 64x64
CTUBoundary:option1

ClassSummary

OverallSummary

YDBrate
Allintra
(AI)

Randomac
cess(RA)

Low delayB
(LB)

LowdelayP
(LP)

Class A

0.6%

2.3%

ClassB

0.5%

2.1%

2.0%

11.1%

ClassC

0.5%

1.1%

1.8%

7.1%

ClassD

0.4%

0.3%

0.7%

4.4%

ClassE

0.6%

2.3%

11.0%

ClassF

1.5%

2.6%

5.7%

12.3%

All

0.7%

1.7%

2.5%

9.2%

Enc.Time(%)

101%

100%

100%

100%

Dec.Time(%)

103%

103%

102%

102%

SAOisenabled(QP=32)

SAOisdisabled(QP=32)

EdgeoffsetofSAO
Four1Ddirectionalpatterns

horizontal,vertical,135 diagonal,45 diagonal

a
a

a
c

FIGURE. Four1DdirectionalpatternsforEOsampleclassification

OnlyoneEOclasscanbeselectedforeachCTBofwhichEOisenabled
EachsampleinsidetheCTBisclassifiedintooneoffivecategories

c<a&&c<b

(c<a&&c==b)||(c==a &&c<b)

(c>a&&c==b)||(c==a&&c>b)

c>a&&c>b

Noneoftheabove(SAOisnotapplied)

x1 x x+1
pixelindex

category2

x1 x x+1
pixelindex

x1 x x+1
pixelindex

Positiveedgeoffset

category3

x1 x x+1
pixelindex

pixellevel

Condition

category1

pixellevel

Category

pixellevel

TABLE.Sampleclassificationrulesforedgeoffset

pixellevel

Oneedgeoffsetisencodedforeachcategory(4offsetsaretransmittedinthecaseofEO)
Noinformationforclassificationoffivecategories(encoderanddecoderusesamerules)
pixellevel

pixellevel

category4

x1 x x+1
pixelindex

x1 x x+1
pixelindex

Negativeedgeoffset

BandoffsetofSAO

BOimpliesoneoffsetisaddedtoallsamplesofthesameband

Thesamplevaluerangeisequallydividedinto32bands
For8bitsamplesrangingfrom0to255,thewidthofabandis8

Onlyoffsetsoffourconsecutivebandsandthestartingbandpositionaresignaledtothe
decoder

Theaveragedifferencebetweentheoriginalsamplesandreconstructedsamplesinabandis
signaledtothedecoder
Four offsetsaretransmittedinthecaseofBO

max

Thefirstbandforwhichoffsetistransmitted

Four offsetsaretransmittedforfourconsecutivebands

AfastdistortionestimationforSAO

Distortionshavetobecalculatedmanytimes
Letk,s(k),andx(k)besamplepositions,originalsamples,andpreSAOsamples,
respectively

DistortionbetweenoriginalsamplesandpreSAOsamples

D pre (( s (k ) x(k )) 2
kC

DistortionbetweenoriginalsamplesandpostSAOsamples

D post ( s (k ) ( x(k ) h)) 2


kC

h istheoffsetforthesamplesetandN isthenumberofsamplesintheset,thedeltadistortionis
defined(NandEcanbecalculatedonlyonce)

D D post D pre (h 2 2h( s (k ) x(k ))) Nh 2 2hE


kC

J D R

E ( s (k ) x(k ))
kC

Offsetrefinement

Initialoffsetvalue,hisE/N

Allthenumbersbetweenzeroandoffsetareusedforoffsetrefinementprocess

E ( s (k ) x(k ))
kC

Initialoffset

Initialoffset

EncodingflowofSAOinHM
RDOofSAO

Compressslice

Deblocking filter(DBF)

Sampleadaptive
offset(SAO)

Encodeslice

1)CalculateSAOstatistics

1)CalculateSAO
statistics

2)CalculateSAORD
cost

3)Mergeleftorup

ProcessSAO

2)CalculateSAORDcost

BO 32 band sum of difference,


pixel count

EO class0 rdcost
rdcost0 = distortion + rate
( A fast distortion estimation, offset refinement )

EO class0 category
Sum of difference, pixel count

EO class1 rdcost
rdcost1 = distortion + rate
( A fast distortion estimation, offset refinement )

EO class1 category
Sum of difference, pixel count

EO class2 rdcost
rdcost2 = distortion + rate
( A fast distortion estimation, offset refinement )

EO class2 category
Sum of difference, pixel count

EO class3 rdcost
rdcost3 = distortion + rate
( A fast distortion estimation, offset refinement )

EO class3 category
Sum of difference, pixel count

BO band position
( A fast distortion estimation, offset refinement )
BO rdcost
rdcostBO = distortion + rate

FIGURE. Flowchart Sampleadaptiveoffset

CTUbasedprocessing

Rdcost type
(BO, EO class0, EO class1, EO class2, EO class3)

Left merge, up merge


rdcost

Slicelevelon/offcontrolofSAO

Hierarchicalquantizationparameter(QP)settingsforeachgroupofpictures
Depth=3
(8k+1)

Depth=2

(8k+3)

(8k+5)

(8k+2)

Depth=1

(8k+7)
AhigherQP
(8k+6)

(8k+4)

Depth=0
8k

Aslicelevelon/offdecisionalgorithm

Fordepth=0picture,SAOisalwaysenabledinthesliceheader
Otherdepth
Ifthepreviouspicture(thelastpictureofdepthN1indecodingorder)disablesSAOfor
morethan75%ofCTUs,thecurrentpicturewillearlyterminatetheSAOencodingprocess
anddisableSAOinallsliceheaders

CTUbasedencodingissuesaboutSAO

SinceSAOisafterDF,theSAOparameterscannotbepreciselyestimateduntilthe
deblocked samplesareavailable

InCTUbasedencoder,thedeblocked samplesoftherightcolumnsandthebottomrowsinthe
currentCTUmaybeunavailable

TwopracticalCTUbasedSAOdecisions

Case1.Avoidingusingthebottomrowsandrightcolumns(currentHM)
Case2.Usenondeblockfilteredpixelsforthebottomrows
nondeblockfilteredpixels
andrightcoloumns (JCTVCJ0139)
deblockfilteredpixels

TABLE.AverageBDratesofenablingSAOversusdisablingSAOfordifferentCTUsizes

CTUSize
inLuma

Option1:Skiprightandbottom
samplesintheCTUduring
parameterestimation

Option 2:Usepredeblocked
samplesnearrightandbottom
boundariesintheCTUduring
parameterestimation

Cb

Cr

Cb

Cr

6464

3.5%

4.8%

5.8%

3.3%

5.3%

6.6%

3232

2.0%

1.1%

1.5%

2.5%

2.0%

2.7%

1616

0.0%

0.3%

0.3%

0.8%

0.4%

0.1%

COMPLEXITYANALYSIS
OFHEVCENCODER

ComplexityanalysisofHMencoder

Testsequences

Sequence:ClassB(19201080),ClassC(832480)
ClassB:Kimono,ParkScene,Cactus,BasketballDrive,
BQTerrace
ClassC:BasketballDrill,BQMall,PartyScene,
RaceHorse
QP:22,27,32,37
Mainprofile
Randomaccess,lowdelay

FIGURE. ClassB BasketballDrive

Testenvironment

HM7.0software
IntelCoreTM i7CPU860@2.8GHz
4GBmemory
Windows7(64bit)
Analysistool:IntelVtuneTM AmplifierXE

FIGURE. ClassC BQMall

ProfilingresultofHEVCencoder
TABLE. ComplexityratioofHM7.0encoder(RA)

TABLE. ComplexityratioofHM7.0encoder(LD)
QP

Class

QP

Module
22

27

32

Class

37

Module
22

27

32

37

Entropy

6.6

3.4

1.0

0.9

Entropy

6.1

2.8

0.4

0.3

Intra

3.3

2.2

2.1

1.4

Intra

3.4

2.0

1.2

1.2

Inter

68.4

78.1

83.9

85.7

Inter

71.3

81.2

87.3

89.1

TR+Q

18.6

13.0

9.9

8.5

TR+Q

20.4

15.2

11.7

10.6

Loopfilter

0.2

0.2

0.2

0.1

Loopfilter

0.2

0.2

0.2

0.1

etc

1.2

1.1

1.3

1.5

etc

0.8

1.2

0.8

0.9

Entropy

6.5

3.9

2.8

1.3

Entropy

5.3

3.1

1.1

0.4

Intra

2.9

2.7

2.2

1.8

Intra

3.0

2.5

1.8

1.5

Inter

68.8

74.9

79.8

83.3

Inter

72.6

79.1

83.5

87.2

TR+Q

18.2

14.9

12.1

10.1

TR+Q

20.7

17.0

13.9

12.4

Loopfilter

0.2

0.2

0.2

0.1

Loopfilter

0.2

0.2

0.2

0.1

etc

1.0

1.5

1.4

1.2

etc

1.1

0.6

1.6

1.0

ComplexityportionsofHMencoder
Transform

Rn

Fn

TUsize:
3232
~44

Interprediction
DCTIF
AMVP
Merge

ME

Fn1

DeltaQP

Entropycoding

RDOQ

CABAC

MC

Interprediction:7781%
Fn2

Entropycoding
:24%

Quantization

Intraprediction:12%
Picture
Buffer

Residual
quadtree

Tr +Q:1416%

Intraprediction
Reference
sample
padding

MDIS

Planar
DC
33angular

Transform1
Quantization1

Loopfilter

Fn

Sample
adaptive
offset

De
blocking
filter

Loopfilter
:0.10.2%

Intraprediction
Interprediction

+ + R
n

Transform+Q
Loopfilter
Entropycoding

FIGURE. HEVCencoderblockdiagram andprofilingresult

etc

ComplexityportionsforCUsizesandmodes

TABLE. ComplexityportionsforCUsizesandmodes

CU1616

CU1616

CU3232

Size

64x64
CU88

CU88

CU1616
CU88

CU88

32x32

CU1616

CU1616

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU88

CU1616

CU1616

16x16

CU1616
CU88

FIGURE. ExampleofCUquadtreestructure

CU88

8x8

Mode

RA(%)

LD(%)

Average (%)

Intra

2.1

1.0

1.6

Inter

19.0

31.9

25.5

Skip

3.9

3.4

3.7

Intra

1.9

0.7

1.3

Inter

25.0

27.4

26.2

Skip

4.5

3.2

3.9

Intra

2.3

0.2

1.3

Inter

17.0

12.5

14.8

Skip

3.2

1.7

2.5

Intra

2.4

0.4

1.4

Inter

8.7

4.9

6.8

Skip

1.7

0.6

1.2

SelectedratiosofCU,PUandTU
CU size

64x64

32x32

16x16

8x8

PU mode

ClassB

ClassC

22

27

32

37

Merge skip

10.6

26.6

43.3

55.2

Inter2Nx2N

4.5

7.1

7.2

InterNx2N

1.4

2.2

Inter2NxN

1.5

InterAMP

22

TABLE. SelectedratioofCUsizeandPUmode

27

32

37

11.7

20.6

30.6

39.5

6.0

5.8

7.5

6.7

5.5

1.8

1.3

1.6

1.8

1.7

1.7

1.9

1.3

0.9

1.2

1.0

0.8

0.7

1.2

1.4

1.0

0.7

1.0

1.1

1.0

1.1

Intra 2Nx2N

0.3

0.4

0.6

1.0

0.0

0.0

0.0

0.1

Merge skip

9.9

12.4

19.9

8.4

12.2

13.5

15.2

16.8

Inter2Nx2N

8.1

6.9

4.6

3.1

9.1

7.2

5.4

4.3

InterNx2N

1.8

1.4

0.9

0.4

2.2

1.9

1.9

1.7

Inter2NxN

1.7

1.3

0.7

1.0

1.4

1.0

0.9

0.8

InterAMP

4.4

2.9

1.6

0.6

4.2

3.5

3.1

2.6

Intra 2Nx2N

2.3

2.3

2.6

2.6

0.2

0.4

0.7

1.1

Merge skip

6.8

5.6

3.9

2.9

8.0

7.7

7.3

6.1

Inter2Nx2N

9.1

3.7

1.7

0.8

6.9

4.8

3.1

2.0

InterNx2N

1.6

0.7

0.3

0.1

2.0

1.4

1.0

0.6

Inter2NxN

1.7

0.6

0.2

0.1

1.2

0.8

0.5

0.3

InterAMP

4.1

1.4

0.5

0.2

4.1

2.7

1.7

0.9

TABLE. SelectedratioofTU
Class

Size

QP
22

27

32

37

32x32

33.5

55.0

63.0

65.7

16x16

19.8

20.9

20.1

19.7

8x8

36.2

15.5

10.7

10.0

4x4

10.5

8.5

6.2

4.5

32x32

35.7

43.4

49.2

52.2

16x16

27.7

27.7

27.5

29.0

Intra 2Nx2N

2.6

2.1

1.7

1.4

1.2

1.6

1.8

1.7

Mergeskip

2.8

1.9

1.2

0.9

3.9

3.3

2.3

1.4

Inter2Nx2N

5.8

1.3

0.4

0.1

4.9

2.5

1.1

0.4

InterNx2N

0.3

0.2

0.1

0.0

1.2

0.7

0.3

0.1

Inter2NxN

0.4

0.2

0.1

0.0

0.7

0.4

0.2

0.1

Intra2Nx2N

2.9

1.2

0.1

0.5

2.1

1.7

1.2

0.8

8x8

21.7

18.1

15.8

13.9

IntraNxN

0.8

0.6

0.7

0.2

1.9

1.1

0.6

0.3

4x4

14.8

10.8

7.5

4.9

BDBRvs.EncodingtimedependingonCTUsize

SW:HM7.1
Seq :ClassB
cfg :Randomaccess&Lowdelay

CTUsize:32x32
3.33.4%BDbitrate
7879%encodingtime

CTUsize:16x16

15.417.5%BDbitrate
5054%encodingtime

CTUsize:16x16
Enc T:54.7%
BDbitrate:15.43%

CTUsize:16x16
Enc T:50.8%
BDbitrate:17.53%

CTUsize:32x32
Enc T:79.22%
BDbitrate:3.31%

CTUsize:32x32
Enc T:78.92%
BDbitrate:3.43%

CTUsize:64x64
(Reference)

BDBRvs.EncodingtimedependingonTUsize

SW:HM7.1
Seq :ClassB
cfg :Randomaccess&Lowdelay

Transformsize

1616to44oncase
3.23.5%BDbitrate
96%encodingtime
88to44oncase
10.211.2%BDbitrate
9192%encodingtime
MaxTUsize:8x8
Quadtreemaxdepth:1
Enc T:92.4%
BDbitrate:11.2%

MaxTUsize:8x8
Quadtreemaxdepth:1
Enc T:91.4%
BDbitrate:10.24%
MaxTUsize:16x16
Quadtreemaxdepth:2
Enc T:96.8%
BDbitrate:3.2%

MaxTUsize:16x16
Quadtreemaxdepth:2
Enc T:96.5%
BDbitrate:3.5%

MaxTUsize:32x32
Quadtreemaxdepth:3
(Reference)

Toolon/offtest

FastencodingalgorithmsinHMsoftware
TABLE. FastencodingalgorithmsinHMsoftware

Contents

note

FastEncodingSetting
:FEN,JCTVCA0124

EarlyCUtermination
SubsampledSADOperation
SimpleBiprediction(Thenumberofiteration4>1)

FastDecisionforMergeRDCost
:FDM,JCTVCH178

2Nx2NMerge CBF earlytermination

PUlevel

RoughModeDecision(forIntra)
:RMD,JCTVCC311/D283

35 Intramode SATD RD
RD RD
FullRQT

PUlevel

AMPSpeedup
:AMPS,JCTVCE316

AMP MEorMerge

PUlevel

CBFFastModeSetting
:CFM,JCTVCF045

PU CBF 0 PU ME

PUlevel

EarlyCUSetting
:ECU,JCTVCF092

CU Skip, CU

CUlevel

EarlySkipDetectionSetting
:ESD,JCTVCG543

Inter2Nx2N EarlySkipDetection

CUlevel

IPSL

HMencoderforFHD(BQTerrace.seq)
For real-time?

33.33ms

CPU
Oneframe:57930ms

Compress Slice
- Interpolation filter (IF)
- Motion estimation (ME)
- Transform-Quantization (TR-Q)
- Intra prediction
- MV derivation
- Mode decision
- Entropy encoding (CABAC update)

IF:21548.62ms
RDOQ:2645.55ms
TR:1687.37ms
ITR:653.2829ms

D
B
F

S
A
O

Encode
Slice
- Entropy
encoding

DBF:9.42ms
SAO:77.33ms

Inteli7CPU,2.xGHz

KWHEVCencoder

ANSICHEVCencodersoftwarebasedonHMencoder
Cleanupfunctionsandvariables
Nonrecursivefunctioncall

Minimummemoryallocationandbandwidth
Explicitminimummemoryallocations(usingstaticmemory)
Removalofcoderelatedtoduplicatevariablesandstructuretoavoid
redundantmemorycopy
Removalofunnecessarymemoryallocation

Softwareoptimization
SIMDimplementation(Costfunction,transform,interpolation,deblocking,..)
Framelevelinterpolationfilter

Parallelprocessing
SlicelevelparallelprocessingusingOpenMP
MotionestimationusingCUDA

PerformanceofKWHEVC
1)
2)
3)
4)
5)

Cconverting:18%ATSgain(anyBDBR,BDPSNRloss)
+SIMD+FramelevelIF:2speedup(anyBDBR,BDPSNRloss)
+Fastmodedecision:5speedup(12%BDBRloss)
+Slicelevelparallel:20speedup(46%BDBRloss)
+CUDAME&MD(lowdelay P,adjustmentConfig.):200speedup
(1520%BDBRloss){Inteli7(3.3GHz),GeForce660}=>10fps

200

TABLE. EncodingspeedofKWHEVC
Class

Sequence

Frame

Kimono

240

ParkScene

240

Cactus

500

BasketballDrive

500

BQTerrace

600

BasketballDrill

500

BQMall

600

PartyScene

500

RaceHorses

300

FIGURE. Encodingspeedintermsofthedevelopmentsteps

QP

FPS
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37

5.74
7.25
8.38
9.40
5.51
7.52
8.87
10.03
5.19
7.70
9.09
10.09
4.80
6.71
8.09
9.18
4.14
7.68
9.60
10.62
14.86
19.07
23.60
28.12
14.81
19.88
24.91
29.20
11.09
16.46
22.03
27.60
10.48
14.60
19.46
24.49

Comparisonofdecodercomplexity

HM10.0(C++)vs.KWHEVCdecoder(C89)
Cconversion
Softwareoptimization

Decodingperformance
Sequences

HM10.0
(sec)

FPS

KWHEVC
(sec)

FPS

Ratio

BQTerrace_1920x1080_60_qp22.bin

98.271

6.11

71.007

8.45

1.38

BQTerrace_1920x1080_60_qp27.bin

46.531

12.89

30.778

19.49

1.51

BQTerrace_1920x1080_60_qp32.bin

32.737

18.33

19.234

31.19

1.70

BQTerrace_1920x1080_60_qp37.bin

28.189

21.28

15.912

37.71

1.77

Cactus_1920x1080_50_qp22.bin

51.355

9.74

36.270

13.79

1.42

Cactus_1920x1080_50_qp27.bin

31.371

15.94

20.155

24.81

1.56

Cactus_1920x1080_50_qp32.bin

25.506

19.60

15.381

32.51

1.66

Cactus_1920x1080_50_qp37.bin

21.933

22.80

12.792

39.09

1.71

ParallelismandSIMDprocessing

Parallelism
Decodercannotexpectthetileorslicepartitioningofpictures
Decodershouldconsiderworstbitstreams
Theentropydecodercannotbeparallelized
CTUbased2Dwavefrontparallelprocessingisapromisingwayfor
parallelism
Deblocking filterandSAOaremoreproperfortheparallelism
Lessdatadependency

SIMDprocessing
Inversetransform(X=ATYA)
Motioncompensation
About40%ofdecodercomplexity
8tapand4tapfilters

PerformanceoftheoptimizedKWHEVCdecoder

SIMDandparallelization
Pixelreconstruction,interpolation(partial)
Tasklevelparallelism(entropy,pixeldecoding)
Datalevelparallelism(deblocking filter)
2.28Mbps
4.98

2.93

Conclusion

OverviewofHEVC
EncodingparametersforHEVCtestmodel(HM)
ComplexityanalysisofHEVCencoder
Fastencodingalgorithmsandperformances
Issuesofparallelprocessing

HEVC
:,
:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
A.

HEVC


HEVC




CABAC


HEVC

2013

Das könnte Ihnen auch gefallen