Beruflich Dokumente
Kultur Dokumente
KwangwoonUniversity(KWU)
Donggyu Sim (dgsim@kw.ac.kr)
:
,
Contents
OverviewofHEVC
EncodingissuesforHEVCtestmodel(HM)
ComplexityanalysisofHEVCencoder
Fastencodingalgorithmsandperformances
Issuesofparallelprocessing
Conclusion
OVERVIEWOFHEVC
BlockdiagramofHEVCstandard
Typicalblockbasedhybridcodecstructure+additionalenhancedtools
Transform
Rn
Fn
TUsize:
3232
~44
Residual
quadtree
Entropycoding
Quantization
Interprediction
Picture
Buffer
DCTIF
AMVP
Merge
ME
Fn1
Fn2
DeltaQP
RDOQ
CABAC
MC
Intraprediction
Reference
sample
padding
MDIS
Planar
DC
33angular
Transform1
Quantization1
Loopfilter
Fn
Sample
adaptive
offset
De
blocking
filter
+ + R
n
FIGURE. BlockdiagramofHEVCencoder
BlockstructureinHEVC
ThreeblockstructuresaredefinedinHEVC
Codingunit(CU)
Predictionunit(PU)
Transformunit(TU)
CTU64
CU1616
CU3232
CU1616
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
2N2N
2NN
N2N
NN
2NnU
2NnD
nL2N
nR2N
CU1616
CU1616
CU1616
CTU6464
CU1616
CU1616
CU1616
CTUTU
6464
depth0
TU
depth1
TU
depth2
FIGURE. AnexampleofCU,PU,andTUpartitioninHEVC
CTU64
ENCODINGSTRUCTURESOFHEVC
DecisionlevelforHEVCencoder
Sequencelevel
Picturelevel
CUpartitioning
Sampleadaptiveoffsetparameters
CUlevel
Refframes
Deblockingfilterparameters
Sequence
Picture
SliceorTile
CTU
CU
CTUlevel
#refframe,ratecontrol
Tile,slice
Slice ortilelevel
Codingstructure(Allintra,Lowdelay,Randomaccess)
Profile,tier,level
Max/MinCTUsize,CUdepth
Max/Min TUsize,TUdepth
Toolon/off(SAO,deblocking,WPP,tile)
PUandTUpartitioning
PU &TUlevel
Predictionmodes,motionvectors
cbf,coefficients
PU&TU
Temporalpredictionstructure (1/3)
Allintra(AI)
Allpictureiscodedasinstantaneousdecodingrefresh(IDR)picture
Notemporalpredictionisallowed
Coding
order
=POC
QPI
QPI
QPI
QPI
QPI
QPI
QPI
QPI
IDRPicture
time
Temporalpredictionstructure (2/3)
Lowdelay(LD)
ThefirstpictureshallbecodedasIDRpicture
GeneralizedPandB(GPB) pictureshallbeusedfortheothersuccessivepictures
TheGPBshallbeabletouseonlythereferencepictures,eachofwhosePOCissmallerthanthecurrent
picture(allreferencepictureinList_0andList_1shallbetemporallypreviousindisplayorderrelativetothe
currentpicture)
QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture
dependingontemporallayer
QPBL3=QPI+3
1
QPBL3
3
QPBL3
5
2
Coding
order
=POC
QPBL2=QPI+2
:Depth==0
:Depth==1
:Depth==2
QPBL3
7
QPI
IDRor
Intrapicture
QPBL2
QPBL1=QPI+1
QPBL1
GPB(GeneralizedPandB)
picture
time
Temporalpredictionstructure(3/3)
Randomaccess(RA)
HierarchicalBstructureshallbeusedforcoding
IDR Intrapictureorcleanrandomaccess(CRA) pictureshallbeinsertedcyclicallyperaboutone
secondinrandomaccesspoint
QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture
dependingontemporallayer
QPBL4=QPI+4
QPBL4
QPBL4
QPBL4
NonreferencedB
Picture
3
2
6
2
Coding
order
POC
4
0
QPBL3=QPI+3
QPBL3
QPBL2=QPI+2
:Depth==0
:Depth==1
:Depth==2
:Depth==3
QPI
IDRor
Intrapicture
QPBL1=QPI+1
ReferencedB
Picture
GPB(GeneralizedPandB)
picture
time
Picturepartitioning
Codingorderofcodingtreeunit(CTU) israsterscanorder
Analogoustomacroblock inpreviousstandards
Themaximumallowedsizeoftheluma blockinaCTUisspecifiedtobe64x64 inMainprofile
30
*CTU&CTB
:TheCTUconsistsofaluma codingtree
block(CTB)andthecorresponding
chroma CTBsandsyntaxelements
17
FIGURE. ExampleofapicturedividedintoCTUs
Example)ClassB(19201080) BQTerrace
CTUsize:6464
3017CTUpartition
Picturepartitioning
Aslice isasequenceofcodingtreeunits(CTUs)
Unlikeslices,tilesarealwaysrectangularandalwayscontainanintegernumberofcoding
treeunitsincodingtreeunitrasterscan
Atleastoneofthefollowingconditionsshouldbetrueforeachsliceandtileinapicture
AllCTBsinaslicebelongtothesametile,orallCTBsinatilebelongtothesameslice
FIGURE. Apicturewith3017codingtreeunitsthatispartitioned
intothreeslices
Codingunit(CU)andcodingtreestructure
Codingunit(CU):theleafnodeofaquadtreestructure
Squareblocks
Size:from88uptothesizeofCTU
SizeofCTUisspecifiedinsequenceparameterset(SPS)
8x8~64x64
TABLE.SyntaxforsizeofCTUinSPS
seq_parameter_set_rbsp() {
Descriptor
log2_min_coding_block_size_minus3
ue(v)
log2_diff_max_min_coding_block_size_minus2
ue(v)
CU1616
CU1616
CU3232
CU88
CU88
CU1616
Thequadtreepartitioningstructureallowsrecursive
splittingintofourequallysizednodes
CU1616
CU1616
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU1616
CU1616
CU1616
FIGURE. Exampleofcodingtreestructure
ExampleofCUquadtreestructure
Codingunitquadtreestructure
StartingfromCTU,eachCUcanbesplitinto4smallerCUs
TABLE.SyntaxforCUsplitflagincodingtree
coding_tree(x0, y0, log2CbSize, ctDepth) {
Descriptor
ae(v)
CU1616
CU1616
CU3232
CU88
CU88
CU1616
CU1616
CU1616
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU1616
CU1616
64CU:split_coding_unit_flag(1)
32CU:split_coding_unit_flag(0)
32CU:split_coding_unit_flag(1)
16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(1)
CU1616
FIGURE. ExampleofCUquadtreestructure
Codingunit(CU)decision
Codingunitquadtreestructure
StartingfromCTU,eachCUcanbesplitinto4smallerCUs
CUsize
6464
1
BestCURDcostcalculationforeachCUlevel
21
CompetitionofthebestCU
anditssubpartitionedCUs
3232
3232
2
1616
1616
88
4
88
6
1
88
5
88
7
88
9
88
11
23
10
1616
13
3232
44
1616
18
15
65
88
10
88
12
3232
20
Predictionunit(PU)types
Predictionunit(PU):aregionusedforcarryingtheinformationrelatedtotheprediction
processes
2PUtypesforIntraprediction
2N2N,(SmallestCU:additionallyNN)
8PUtypesforInterprediction
SmallestCU:
8x8:2N2N,N2N,2NN
Others:2N2N,N2N,2NN,NN
Others:2N2N,N2N,2NN,nL2N,nR2N,2NnU,2NnD
2N2N
N2N
2NN
NN
nL2N
nR2N
2NnU
2NnD
FIGURE. PUpartitionsinHEVC
Predictionunit(PU)types
CurrentCUsize
SCUsize
AMPenableflag
CurrentCUsize==
SCUsize
No
No
AMP
enableflag
Yes
Yes
No
CurrentCU
size==88
Yes
Intra2N2N
Intra2N2N
Intra2N2N
Intra2N2N
Inter2N2N
Inter2N2N
IntraNN
IntraNN
Inter2NN
Inter2NN
Inter2N2N
Inter2N2N
InterN2N
InterN2N
Inter2NN
Inter2NN
InterAMP
InterN2N
InterN2N
InterNN
Transformunit(TU)andtransformtreestructure
Transformunit(TU):aregionsharingthetransformandquantizationprocesses
Squareshape
Size:from4x4upto32x32
AvailabletransformblocksizesandmaxtransformhierarchydeptharespecifiedinSPS
TABLE.SyntaxforsizeofTUinSPS
seq_parameter_set_rbsp() {
4x4~32x32
Descriptor
32
log2_min_transform_block_size_minus_2
ue(v)
log2_diff_max_min_transform_block_size
ue(v)
max_transform_hierarchy_depth_inter
ue(v)
max_transform_hierarchy_depth_intra
ue(v)
32
RootofTUquadtreeisCUwhichtheTUbelongto
FIGURE. TUquadtreestructureinHEVC
INTER/INTRAPREDICTION
ANDPU/TUDECISION
OverallofHMencodingprocess
Sequence
RDOprocess
Picture
CTUdecisionsinasliceoratile
CUpartitioningdecision
PU&TUpartitioningdecision
Deblocking filter
SAO
Entropycoding
RDOprocesstodecidePU&TU
6464
CompressCU
3232
3232
88
1616
1616
1616
88
88
3232
1616
88
3232
Mergeskip
Inter2N2N
InterNN
InterAMP
Inter2NN
InterN2N
Intra2N2N
IntraNN
88
88
88
88
IntraPCM
CUsize
SCU
Yes
Finish
No
CompressCU
CompressCU
CompressCU
CompressCU
Intrapredictionflow
2N2NPU
Predictionmodes
Luma (35modes)
Planar,DC,Angularprediction(33directions)
Chroma(5modes)
Planar,DC,Vertical,Horizontal,DM
Referencesamplepadding
Filtering
MDIS
MDIS(Modedependentintrasmoothing)
DCfiltering,Ver/Hor filtering
3MPM
18
19
20
21
22
Intraprediction
23
24
25
26
27
28
29
30
31
32
33
34
Bestmodedecision
Mode<35?
N
RDcost,Intra_mode
FIGURE. DirectionsandmodesofHEVCintraprediction
Fastintraprediction&TUdecisioninHM
IntrapredictionstepinHM
1)Roughpredictionmodedecision
35prediction
SelectNpredictionmodes
35modes
Distortion(SATD)+lamda *modebits
#ofcandidatepredictionmodes:Nmodes+MPM(3)
2)Bestintrapredictionmodedecisionwithtransform
Transform(RQTdepth=1)
1bestintramodedecision
3)BestRQTdecisionwithRDcosts
RQTdepth=3
Nmode+MPM
1Best
mode
Bestmode
RDcost
Interprediction
Skip:Mergeskip
Spatialcandidates
derivation
Cur.CU
Nonskip
Unidirectionalprediction
Bidirectionalprediction
Halfpel/Quarterpel motionrefinement
DCTIF(8tap/4tap)
Merge
Temporalcandidate
derivation
Unidirectionalprediction
Mergeskip
Additionalcandidates
derivation
RDcostcalculation
Inter2N2N
Inter2NN
InterN2N
AMP
(nL2N,nR2N,
2NnU,2NnD)
Bestmode
decision
RDcost,
Bestmode
FIGURE. Flowchart Interprediction
Bidirectionalprediction
Merge
Bestmodedecision
Interprediction
Intercodingmode
Mergeskipmode(CUlevel)
skip_flag=1 andmerge_idx
Noreferenceindex
Nomotionvector
Noresidual
Mergemode(PUlevel)
skip_flag=0,pred_mode_flag,and part_mode
merge_flag=1and merge_idx
Noreferenceindexandmotionvector
no_residual_syntax_flag:Residualisencodedornot
GeneralPUmodes
skip_flag=0pred_mode_flag,andpart_mode
merge_flag=0
ref_idx_lx andmvp_lx_flag basedonAMVP(x=0or1)
MVDisencoded
no_residual_syntax_flag:Residualisencodedornot
Interpredictionflow
BEGIN input : current PU part mode for a CU
FOR PU partition
FOR List = 0 to 1 DO
FOR 0 to refidx DO
Motion estimation (diamond search, SR : 64)
Decide best RD-cost for uni-prediction
ENDFOR
ENDFOR
IF bi-directional prediction THEN
FOR iteration = 0 to 3 DO
FOR 0 to refidx DO
Motion estimation (full search, SR : 4)
Decide best RD-cost for bi-prediction
ENDFOR
ENDFOR
ENDIF
ENDFOR
Fastencoderdecision(FEN)
SubsampledSADforintegerME
UsesubsampledSADwhenrows>8forintegerME
Only1iterationforbipredictivemotionsearch
defaultnumber:4
FastDecisionforMergeRDcost(FDM)
Aftermergewithmergeidx X,ifallcbf iszerothenmerge
processisterminated
LIST_0
LIST_1
Merge
RD-cost competition among uni/bi-prediction and merge
CurrentPU
Uniprediction
Biprediction
Cur
time
Biprediction
PracticalBipredictivesearch
1)SearchP1 whichproduceminimum2Rwith(2O P0)
R =O (P0+P1)/2 2R=(2O P0) P1
2)SearchP0 whichproduceminimumerrorwith(2O P1)
R =O (P0+P1)/2 2R=(2O P1) P0
List0
Reference
P0
Currentframe
O
List1
Reference
P1
BipredSearchRange :4
FEN:1(iteration:1)
Example)Biprediction
List0
Reference
Currentframe
List1
Reference
P1
Unidirectional
prediction
Searchrange:64
P0
P1
R1
2O
P1
R0
2O
P0
Iteration :1
BipredSearchRange :4
P0
P1
Iteration:2
BipredSearchRange :4
P0
Bidirectional
prediction
P1
R1
2O
P1
R0
2O
P0
Iteration:3
BipredSearchRange :4
P0
P1
Iteration:4
BipredSearchRange :4
Motionestimation(Integerpel)
Practicalmotionestimation(diamondsearch)
Firstsearch &earlytermination
Max3(default)moreroundsafterarecentbestmatch
Rasterrefinementsearch
Ifintegerpel distanceisbiggerthan5,thenconducttherasterrefinementsearch.
Starrefinementsearch&earlytermination
Diamondsearchwiththecenterofthebestmatchfromtheearlytwosteps
Max2roundsafterthebestmatch
3
3
2
2
3
2
1
3
2
1
2
FIGURE. Firstsearch&startrefinement
FIGURE. Rasterrefinementsearch
Motionestimation(Subpel refinement)
Searchrange
Integerpel motionsearch
Costfunction:SAD
Subpel motionrefinement
Costfunction:SATD
Halfpel refinement
Quarterpel refinement
Searchrange
Interpolation
DCTIFinHEVC
Fixed8tap(7tap)and4tapinterpolationfiltersbasedonDCT
2Dseparablefilter
8*Horizontal1Dfilter+1*Vertical1Dfilter
TABLE.Interpolationfiltercoefficients
Component
A-1,-1
A2,-1
ah -1,0 B0,0
Filter()
1/4
{1,4,10, 58,17,5,1,0}
1/2
{1, 4,11,40,40,11,4,1}
1/8
{2,58, 10,2}
Luma
3/8
{6,46,28,4}
1/4
{4,54,16,2}
1/2
{4, 36,36,4}
A-1,0
A0,0
a0,0
b0,0
c0,0
A1,0
A2,0
d-1,0
d0,0
e0,0
f0,0
g0,0
d1,0
d2,0
h-1,0
h0,0
i0,0
j0,0
k0,0
h1,0
h2,0
n-1,0
n0,0
p0,0
q0,0
r0,0
n1,0
n2,0
A-1,1
A0,1
a0,1
b0,1
c0,1
A1,1
A2,1
af 0,0
ag0,0 ah 0,0
bf 0,0
cd0,0 ce0,0
cf 0,0
df 0,0
ef 0,0
fh -1,0 fa0,0
fe0,0
ff 0,0
fg0,0
gf 0,0
fb0,0
fc0,0
fd0,0
fh 0,0
B1,0
fa1,0
Chroma
hh -1,0 ha0,0 hb0,0 hc0,0 hd0,0 he0,0 hf 0,0 hg0,0 hh 0,0 ha1,0
A-1,2
A0,2
a0,2
b0,2
c0,2
A1,2
A2,2
B0,1
af 0,1
ag0,1 ah 0,1
B1,1
ExampleofPUdecision
Example
CompressCU
Uniprediction
RDcost=SAD/SATD+*B
=12000
mode
NoTUdecision
Noreconstruction
Vs.
Biprediction
RDcost=SAD/SATD+*B
=9000
Mergeskip
Inter2N2N
InterNN
InterAMP
Inter2NN
InterN2N
Intra2N2N
IntraNN
mode
Vs.
Merge
RDcost=SAD/SATD+*B
=11000
mode
IntraPCM
Biprediction
RDcost=SSE+*Bmode
=8500
CUsize
SCU
Yes
Finish
No
TUdecision
Reconstruction
CompressCU
CompressCU
CompressCU
CompressCU
TUdecisionflow(Inter)
Residualquadtree
2N2N
N2N
2NN
NN
nL2N
nR2N
2NnU
2NnD
Original
Predictor
T/Q
IT/IQ(recon)
RDcost(SSE+*Bmode)
TUdepth:0
Residual
TUdepth:1
TUdepth:2
TUdecisionflow(Intra)
Example)intra_pred_mode =10(verticalmode)
Intrapredictionusingreferencesamples
T/Q
IT/IQ
RDcost(SSE+*Bmode)
Referencesamples
Residual
TUdepth:N
Prediction
direction
TUdepth:N+1
Prediction
direction
Referencesample(afteraboveblockisreconstructed)
Transform
ImplementationoftransforminHEVC
Matrixmultiplication
Straightforward/Fewcodelines
Hugenumberofoperations,butSIMDfriendly
Partialbutterflyimplementation
Utilizessymmetry/antisymmetrypropertiesofbasisvectors
Lessmultiplications/additions
Increasenumberofcodelines
Matrixmultiplication
Matrixmultiplication
Matrixmultiplication
Matrixmultiplication
PartitioningsyntaxforaCTU
Syntax
CU1616
CU1616
64CU:split_coding_unit_flag(1)
32CU:split_coding_unit_flag(0)
CU1616
PUpartition&Pred_mode info
CU3232
CU88
CU1616
CU1616
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
32CU:split_coding_unit_flag(1)
16CU:split_coding_unit_flag(0)
CU88
PUpartition&Pred_mode info
TUsplitflags&Coefficients
CU1616
CU1616
CU1616
CU88
FIGURE. ExampleofCUquadtreestructure
TUsplitflags&Coefficients
16CU:split_coding_unit_flag(0)
PUpartition&Pred_mode info
TUsplitflags&Coefficients
16CU:split_coding_unit_flag(1)
FIGURE. ExampleofTUquadtreestructure
SKIPflag(mergeidx)
Predictionmodeflag(intraor inter)
PUpartsize(2Nx2N,2NxN,Nx2N,NxN,
AMP)
Predictioninfo.(Intramodeormv and
ref.idx.,mergeidx,AMVPidx)
32x32TU:splitflag(1)
16x16TU:splitflag(0) <16x16cbf,
coefficients>
16x16TU:splitflag(1)
8x8TU:splitflag(0)<8x8cbf,coefficients>
8x8TU:splitflag(0)<8x8cbf,coefficients>
8x8TU:splitflag(0)<8x8cbf,coefficients>
8x8TU:splitflag(0)<8x8cbf,coefficients>
16x16TU:splitflag(1)
8x8TU:splitflag(0)<8x8cbf,coefficients>
8x8TU:splitflag(1)
4x4TU:splitflag(0)<4x4cbf,coefficients>
4x4TU:splitflag(0)<4x4cbf,coefficients>
4x4TU:splitflag(0)<4x4cbf,coefficients>
4x4TU:splitflag(0)<4x4cbf,coefficients>
ENCODINGPROCESS
OFLOOPFILTER
Inloopfilter
InHEVC,twoprocessingsteps,adeblocking filter(DBF)andasampleadaptiveoffset
(SAO) operationareapplied
DBF:similartotheDBFoftheH.264/AVCstandard
SAO:appliedadaptivelytoallsamplessatisfyingcertainconditions(whiletheDBFisonlyapplied
tothesampleslocatedatblockboundaries)
On/offsyntaxesforinloopfilters
1.
2.
slice_disable_deblocking_filter_flag :slicelevelon/off
sample_adaptive_offset_enabled_flag :slicelevelon/off
Deblocking filter(DBF)
Basically,deblocking filterofHEVCissimilartothatofH.264/AVC
Inloopfiltering
Codingperformanceforinterframe
Framebasedfiltering
On/offcontrolisprovided
Adaptivefiltering
boundarystrength
Filteringontheblockboundaries
transformandpredictionboundary
Sequentialfilteringforverticalandhorizontaledges
Samplevaluesmodifiedduringfilteringofverticaledgesareusedasinputforthefilteringof
thehorizontaledges
Deblocking filter(DBF)
FeaturesofHEVCdeblocking filtercomparedtoH.264/AVC
FortheTUsandPUswithedgeslessthan8samplesineitherverticalorhorizontaldirection,only
theedgeslyingonthe88samplegridarefiltered
horizontaledges>verticalfiltering
horizontaledges>verticalfiltering
HEVC
H.264/AVC
[e.g. 16x16Codingunit]
verticaledges>horizontalfiltering
verticaledges>horizontalfiltering
(a) H.264/AVC
(b)HEVC
FIGURE. DerivationprocessfortheboundaryfilterstrengthinAVCandHEVC
ProcessingflowofDBF
Boundarydecision
Threekindsofboundariesinvolvinginthefiltering
CU,TU,PUboundary
CUboundariesarealwaysinvolvedinthefiltering
TUboundaryat88blockgridandPUboundarybetween
eachPUinsideCUareinvolvedinthefiltering
[Except]PUboundaryisinsideTU,theboundaryshall
notbefiltered
Boundarydecision
Bs calculation
(44>88)
,tc decision
Bs calculation
filteron/off
decision
Bs iscalculatedin44blockbasis>remappedto88grid
TwoBs arebelongto8pixelsconsistingalinein44grid,
maximumBs isselectedasBs forboundariesin88grid
Strong/weakfilter
selection
Strongfiltering
Weakfiltering
Overviewofsampleadaptiveoffset(1/2)
Artifacts
Blockingartifacts,ringingartifacts,colorbiases,andblurringartifacts
Alargertransformcouldintroducemoreartifacts
HEVC:4x4~32x32transform
Artifactsareexistatmediumandlowbitrates
Alargenumberofinterpolationtapscanalsoleadtomoreseriousringingartifacts
HEVC:8tap(luma),4tap(chroma)
Sampleadaptiveoffset
Toreducesampledistortion(reconstructedpixels originalpixels)
Average3.5%BDratereduction (with1%encodingtimeincrease,2.5%decodingtimeincrease)
SAOislocatedafterDF
andalsobelongstoinloopfiltering
Overviewofsampleadaptiveoffset(2/2)
SAOfeatures
EachcolorcomponentmayhasitsownSAOparameters
TwoSAOtypes
Edgeoffset(EO;4EOclasses)
Bandoffset(BO;1BOclass)
SAOmerging(leftCTUoraboveCTU)
SAOmergeinformationissharedforthreecolorcomponents
SAOobjectandsubjectiveresults
Anchor:DisablingSAO
Test:EnablingSAO
CTUsizeinLuma: 64x64
CTUBoundary:option1
ClassSummary
OverallSummary
YDBrate
Allintra
(AI)
Randomac
cess(RA)
Low delayB
(LB)
LowdelayP
(LP)
Class A
0.6%
2.3%
ClassB
0.5%
2.1%
2.0%
11.1%
ClassC
0.5%
1.1%
1.8%
7.1%
ClassD
0.4%
0.3%
0.7%
4.4%
ClassE
0.6%
2.3%
11.0%
ClassF
1.5%
2.6%
5.7%
12.3%
All
0.7%
1.7%
2.5%
9.2%
Enc.Time(%)
101%
100%
100%
100%
Dec.Time(%)
103%
103%
102%
102%
SAOisenabled(QP=32)
SAOisdisabled(QP=32)
EdgeoffsetofSAO
Four1Ddirectionalpatterns
a
a
a
c
FIGURE. Four1DdirectionalpatternsforEOsampleclassification
OnlyoneEOclasscanbeselectedforeachCTBofwhichEOisenabled
EachsampleinsidetheCTBisclassifiedintooneoffivecategories
c<a&&c<b
(c<a&&c==b)||(c==a &&c<b)
(c>a&&c==b)||(c==a&&c>b)
c>a&&c>b
Noneoftheabove(SAOisnotapplied)
x1 x x+1
pixelindex
category2
x1 x x+1
pixelindex
x1 x x+1
pixelindex
Positiveedgeoffset
category3
x1 x x+1
pixelindex
pixellevel
Condition
category1
pixellevel
Category
pixellevel
TABLE.Sampleclassificationrulesforedgeoffset
pixellevel
Oneedgeoffsetisencodedforeachcategory(4offsetsaretransmittedinthecaseofEO)
Noinformationforclassificationoffivecategories(encoderanddecoderusesamerules)
pixellevel
pixellevel
category4
x1 x x+1
pixelindex
x1 x x+1
pixelindex
Negativeedgeoffset
BandoffsetofSAO
BOimpliesoneoffsetisaddedtoallsamplesofthesameband
Thesamplevaluerangeisequallydividedinto32bands
For8bitsamplesrangingfrom0to255,thewidthofabandis8
Onlyoffsetsoffourconsecutivebandsandthestartingbandpositionaresignaledtothe
decoder
Theaveragedifferencebetweentheoriginalsamplesandreconstructedsamplesinabandis
signaledtothedecoder
Four offsetsaretransmittedinthecaseofBO
max
Thefirstbandforwhichoffsetistransmitted
Four offsetsaretransmittedforfourconsecutivebands
AfastdistortionestimationforSAO
Distortionshavetobecalculatedmanytimes
Letk,s(k),andx(k)besamplepositions,originalsamples,andpreSAOsamples,
respectively
DistortionbetweenoriginalsamplesandpreSAOsamples
D pre (( s (k ) x(k )) 2
kC
DistortionbetweenoriginalsamplesandpostSAOsamples
h istheoffsetforthesamplesetandN isthenumberofsamplesintheset,thedeltadistortionis
defined(NandEcanbecalculatedonlyonce)
J D R
E ( s (k ) x(k ))
kC
Offsetrefinement
Initialoffsetvalue,hisE/N
Allthenumbersbetweenzeroandoffsetareusedforoffsetrefinementprocess
E ( s (k ) x(k ))
kC
Initialoffset
Initialoffset
EncodingflowofSAOinHM
RDOofSAO
Compressslice
Deblocking filter(DBF)
Sampleadaptive
offset(SAO)
Encodeslice
1)CalculateSAOstatistics
1)CalculateSAO
statistics
2)CalculateSAORD
cost
3)Mergeleftorup
ProcessSAO
2)CalculateSAORDcost
EO class0 rdcost
rdcost0 = distortion + rate
( A fast distortion estimation, offset refinement )
EO class0 category
Sum of difference, pixel count
EO class1 rdcost
rdcost1 = distortion + rate
( A fast distortion estimation, offset refinement )
EO class1 category
Sum of difference, pixel count
EO class2 rdcost
rdcost2 = distortion + rate
( A fast distortion estimation, offset refinement )
EO class2 category
Sum of difference, pixel count
EO class3 rdcost
rdcost3 = distortion + rate
( A fast distortion estimation, offset refinement )
EO class3 category
Sum of difference, pixel count
BO band position
( A fast distortion estimation, offset refinement )
BO rdcost
rdcostBO = distortion + rate
CTUbasedprocessing
Rdcost type
(BO, EO class0, EO class1, EO class2, EO class3)
Slicelevelon/offcontrolofSAO
Hierarchicalquantizationparameter(QP)settingsforeachgroupofpictures
Depth=3
(8k+1)
Depth=2
(8k+3)
(8k+5)
(8k+2)
Depth=1
(8k+7)
AhigherQP
(8k+6)
(8k+4)
Depth=0
8k
Aslicelevelon/offdecisionalgorithm
Fordepth=0picture,SAOisalwaysenabledinthesliceheader
Otherdepth
Ifthepreviouspicture(thelastpictureofdepthN1indecodingorder)disablesSAOfor
morethan75%ofCTUs,thecurrentpicturewillearlyterminatetheSAOencodingprocess
anddisableSAOinallsliceheaders
CTUbasedencodingissuesaboutSAO
SinceSAOisafterDF,theSAOparameterscannotbepreciselyestimateduntilthe
deblocked samplesareavailable
InCTUbasedencoder,thedeblocked samplesoftherightcolumnsandthebottomrowsinthe
currentCTUmaybeunavailable
TwopracticalCTUbasedSAOdecisions
Case1.Avoidingusingthebottomrowsandrightcolumns(currentHM)
Case2.Usenondeblockfilteredpixelsforthebottomrows
nondeblockfilteredpixels
andrightcoloumns (JCTVCJ0139)
deblockfilteredpixels
TABLE.AverageBDratesofenablingSAOversusdisablingSAOfordifferentCTUsizes
CTUSize
inLuma
Option1:Skiprightandbottom
samplesintheCTUduring
parameterestimation
Option 2:Usepredeblocked
samplesnearrightandbottom
boundariesintheCTUduring
parameterestimation
Cb
Cr
Cb
Cr
6464
3.5%
4.8%
5.8%
3.3%
5.3%
6.6%
3232
2.0%
1.1%
1.5%
2.5%
2.0%
2.7%
1616
0.0%
0.3%
0.3%
0.8%
0.4%
0.1%
COMPLEXITYANALYSIS
OFHEVCENCODER
ComplexityanalysisofHMencoder
Testsequences
Sequence:ClassB(19201080),ClassC(832480)
ClassB:Kimono,ParkScene,Cactus,BasketballDrive,
BQTerrace
ClassC:BasketballDrill,BQMall,PartyScene,
RaceHorse
QP:22,27,32,37
Mainprofile
Randomaccess,lowdelay
Testenvironment
HM7.0software
IntelCoreTM i7CPU860@2.8GHz
4GBmemory
Windows7(64bit)
Analysistool:IntelVtuneTM AmplifierXE
ProfilingresultofHEVCencoder
TABLE. ComplexityratioofHM7.0encoder(RA)
TABLE. ComplexityratioofHM7.0encoder(LD)
QP
Class
QP
Module
22
27
32
Class
37
Module
22
27
32
37
Entropy
6.6
3.4
1.0
0.9
Entropy
6.1
2.8
0.4
0.3
Intra
3.3
2.2
2.1
1.4
Intra
3.4
2.0
1.2
1.2
Inter
68.4
78.1
83.9
85.7
Inter
71.3
81.2
87.3
89.1
TR+Q
18.6
13.0
9.9
8.5
TR+Q
20.4
15.2
11.7
10.6
Loopfilter
0.2
0.2
0.2
0.1
Loopfilter
0.2
0.2
0.2
0.1
etc
1.2
1.1
1.3
1.5
etc
0.8
1.2
0.8
0.9
Entropy
6.5
3.9
2.8
1.3
Entropy
5.3
3.1
1.1
0.4
Intra
2.9
2.7
2.2
1.8
Intra
3.0
2.5
1.8
1.5
Inter
68.8
74.9
79.8
83.3
Inter
72.6
79.1
83.5
87.2
TR+Q
18.2
14.9
12.1
10.1
TR+Q
20.7
17.0
13.9
12.4
Loopfilter
0.2
0.2
0.2
0.1
Loopfilter
0.2
0.2
0.2
0.1
etc
1.0
1.5
1.4
1.2
etc
1.1
0.6
1.6
1.0
ComplexityportionsofHMencoder
Transform
Rn
Fn
TUsize:
3232
~44
Interprediction
DCTIF
AMVP
Merge
ME
Fn1
DeltaQP
Entropycoding
RDOQ
CABAC
MC
Interprediction:7781%
Fn2
Entropycoding
:24%
Quantization
Intraprediction:12%
Picture
Buffer
Residual
quadtree
Tr +Q:1416%
Intraprediction
Reference
sample
padding
MDIS
Planar
DC
33angular
Transform1
Quantization1
Loopfilter
Fn
Sample
adaptive
offset
De
blocking
filter
Loopfilter
:0.10.2%
Intraprediction
Interprediction
+ + R
n
Transform+Q
Loopfilter
Entropycoding
etc
ComplexityportionsforCUsizesandmodes
TABLE. ComplexityportionsforCUsizesandmodes
CU1616
CU1616
CU3232
Size
64x64
CU88
CU88
CU1616
CU88
CU88
32x32
CU1616
CU1616
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU88
CU1616
CU1616
16x16
CU1616
CU88
FIGURE. ExampleofCUquadtreestructure
CU88
8x8
Mode
RA(%)
LD(%)
Average (%)
Intra
2.1
1.0
1.6
Inter
19.0
31.9
25.5
Skip
3.9
3.4
3.7
Intra
1.9
0.7
1.3
Inter
25.0
27.4
26.2
Skip
4.5
3.2
3.9
Intra
2.3
0.2
1.3
Inter
17.0
12.5
14.8
Skip
3.2
1.7
2.5
Intra
2.4
0.4
1.4
Inter
8.7
4.9
6.8
Skip
1.7
0.6
1.2
SelectedratiosofCU,PUandTU
CU size
64x64
32x32
16x16
8x8
PU mode
ClassB
ClassC
22
27
32
37
Merge skip
10.6
26.6
43.3
55.2
Inter2Nx2N
4.5
7.1
7.2
InterNx2N
1.4
2.2
Inter2NxN
1.5
InterAMP
22
TABLE. SelectedratioofCUsizeandPUmode
27
32
37
11.7
20.6
30.6
39.5
6.0
5.8
7.5
6.7
5.5
1.8
1.3
1.6
1.8
1.7
1.7
1.9
1.3
0.9
1.2
1.0
0.8
0.7
1.2
1.4
1.0
0.7
1.0
1.1
1.0
1.1
Intra 2Nx2N
0.3
0.4
0.6
1.0
0.0
0.0
0.0
0.1
Merge skip
9.9
12.4
19.9
8.4
12.2
13.5
15.2
16.8
Inter2Nx2N
8.1
6.9
4.6
3.1
9.1
7.2
5.4
4.3
InterNx2N
1.8
1.4
0.9
0.4
2.2
1.9
1.9
1.7
Inter2NxN
1.7
1.3
0.7
1.0
1.4
1.0
0.9
0.8
InterAMP
4.4
2.9
1.6
0.6
4.2
3.5
3.1
2.6
Intra 2Nx2N
2.3
2.3
2.6
2.6
0.2
0.4
0.7
1.1
Merge skip
6.8
5.6
3.9
2.9
8.0
7.7
7.3
6.1
Inter2Nx2N
9.1
3.7
1.7
0.8
6.9
4.8
3.1
2.0
InterNx2N
1.6
0.7
0.3
0.1
2.0
1.4
1.0
0.6
Inter2NxN
1.7
0.6
0.2
0.1
1.2
0.8
0.5
0.3
InterAMP
4.1
1.4
0.5
0.2
4.1
2.7
1.7
0.9
TABLE. SelectedratioofTU
Class
Size
QP
22
27
32
37
32x32
33.5
55.0
63.0
65.7
16x16
19.8
20.9
20.1
19.7
8x8
36.2
15.5
10.7
10.0
4x4
10.5
8.5
6.2
4.5
32x32
35.7
43.4
49.2
52.2
16x16
27.7
27.7
27.5
29.0
Intra 2Nx2N
2.6
2.1
1.7
1.4
1.2
1.6
1.8
1.7
Mergeskip
2.8
1.9
1.2
0.9
3.9
3.3
2.3
1.4
Inter2Nx2N
5.8
1.3
0.4
0.1
4.9
2.5
1.1
0.4
InterNx2N
0.3
0.2
0.1
0.0
1.2
0.7
0.3
0.1
Inter2NxN
0.4
0.2
0.1
0.0
0.7
0.4
0.2
0.1
Intra2Nx2N
2.9
1.2
0.1
0.5
2.1
1.7
1.2
0.8
8x8
21.7
18.1
15.8
13.9
IntraNxN
0.8
0.6
0.7
0.2
1.9
1.1
0.6
0.3
4x4
14.8
10.8
7.5
4.9
BDBRvs.EncodingtimedependingonCTUsize
SW:HM7.1
Seq :ClassB
cfg :Randomaccess&Lowdelay
CTUsize:32x32
3.33.4%BDbitrate
7879%encodingtime
CTUsize:16x16
15.417.5%BDbitrate
5054%encodingtime
CTUsize:16x16
Enc T:54.7%
BDbitrate:15.43%
CTUsize:16x16
Enc T:50.8%
BDbitrate:17.53%
CTUsize:32x32
Enc T:79.22%
BDbitrate:3.31%
CTUsize:32x32
Enc T:78.92%
BDbitrate:3.43%
CTUsize:64x64
(Reference)
BDBRvs.EncodingtimedependingonTUsize
SW:HM7.1
Seq :ClassB
cfg :Randomaccess&Lowdelay
Transformsize
1616to44oncase
3.23.5%BDbitrate
96%encodingtime
88to44oncase
10.211.2%BDbitrate
9192%encodingtime
MaxTUsize:8x8
Quadtreemaxdepth:1
Enc T:92.4%
BDbitrate:11.2%
MaxTUsize:8x8
Quadtreemaxdepth:1
Enc T:91.4%
BDbitrate:10.24%
MaxTUsize:16x16
Quadtreemaxdepth:2
Enc T:96.8%
BDbitrate:3.2%
MaxTUsize:16x16
Quadtreemaxdepth:2
Enc T:96.5%
BDbitrate:3.5%
MaxTUsize:32x32
Quadtreemaxdepth:3
(Reference)
Toolon/offtest
FastencodingalgorithmsinHMsoftware
TABLE. FastencodingalgorithmsinHMsoftware
Contents
note
FastEncodingSetting
:FEN,JCTVCA0124
EarlyCUtermination
SubsampledSADOperation
SimpleBiprediction(Thenumberofiteration4>1)
FastDecisionforMergeRDCost
:FDM,JCTVCH178
PUlevel
RoughModeDecision(forIntra)
:RMD,JCTVCC311/D283
35 Intramode SATD RD
RD RD
FullRQT
PUlevel
AMPSpeedup
:AMPS,JCTVCE316
AMP MEorMerge
PUlevel
CBFFastModeSetting
:CFM,JCTVCF045
PU CBF 0 PU ME
PUlevel
EarlyCUSetting
:ECU,JCTVCF092
CU Skip, CU
CUlevel
EarlySkipDetectionSetting
:ESD,JCTVCG543
Inter2Nx2N EarlySkipDetection
CUlevel
IPSL
HMencoderforFHD(BQTerrace.seq)
For real-time?
33.33ms
CPU
Oneframe:57930ms
Compress Slice
- Interpolation filter (IF)
- Motion estimation (ME)
- Transform-Quantization (TR-Q)
- Intra prediction
- MV derivation
- Mode decision
- Entropy encoding (CABAC update)
IF:21548.62ms
RDOQ:2645.55ms
TR:1687.37ms
ITR:653.2829ms
D
B
F
S
A
O
Encode
Slice
- Entropy
encoding
DBF:9.42ms
SAO:77.33ms
Inteli7CPU,2.xGHz
KWHEVCencoder
ANSICHEVCencodersoftwarebasedonHMencoder
Cleanupfunctionsandvariables
Nonrecursivefunctioncall
Minimummemoryallocationandbandwidth
Explicitminimummemoryallocations(usingstaticmemory)
Removalofcoderelatedtoduplicatevariablesandstructuretoavoid
redundantmemorycopy
Removalofunnecessarymemoryallocation
Softwareoptimization
SIMDimplementation(Costfunction,transform,interpolation,deblocking,..)
Framelevelinterpolationfilter
Parallelprocessing
SlicelevelparallelprocessingusingOpenMP
MotionestimationusingCUDA
PerformanceofKWHEVC
1)
2)
3)
4)
5)
Cconverting:18%ATSgain(anyBDBR,BDPSNRloss)
+SIMD+FramelevelIF:2speedup(anyBDBR,BDPSNRloss)
+Fastmodedecision:5speedup(12%BDBRloss)
+Slicelevelparallel:20speedup(46%BDBRloss)
+CUDAME&MD(lowdelay P,adjustmentConfig.):200speedup
(1520%BDBRloss){Inteli7(3.3GHz),GeForce660}=>10fps
200
TABLE. EncodingspeedofKWHEVC
Class
Sequence
Frame
Kimono
240
ParkScene
240
Cactus
500
BasketballDrive
500
BQTerrace
600
BasketballDrill
500
BQMall
600
PartyScene
500
RaceHorses
300
FIGURE. Encodingspeedintermsofthedevelopmentsteps
QP
FPS
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
22
27
32
37
5.74
7.25
8.38
9.40
5.51
7.52
8.87
10.03
5.19
7.70
9.09
10.09
4.80
6.71
8.09
9.18
4.14
7.68
9.60
10.62
14.86
19.07
23.60
28.12
14.81
19.88
24.91
29.20
11.09
16.46
22.03
27.60
10.48
14.60
19.46
24.49
Comparisonofdecodercomplexity
HM10.0(C++)vs.KWHEVCdecoder(C89)
Cconversion
Softwareoptimization
Decodingperformance
Sequences
HM10.0
(sec)
FPS
KWHEVC
(sec)
FPS
Ratio
BQTerrace_1920x1080_60_qp22.bin
98.271
6.11
71.007
8.45
1.38
BQTerrace_1920x1080_60_qp27.bin
46.531
12.89
30.778
19.49
1.51
BQTerrace_1920x1080_60_qp32.bin
32.737
18.33
19.234
31.19
1.70
BQTerrace_1920x1080_60_qp37.bin
28.189
21.28
15.912
37.71
1.77
Cactus_1920x1080_50_qp22.bin
51.355
9.74
36.270
13.79
1.42
Cactus_1920x1080_50_qp27.bin
31.371
15.94
20.155
24.81
1.56
Cactus_1920x1080_50_qp32.bin
25.506
19.60
15.381
32.51
1.66
Cactus_1920x1080_50_qp37.bin
21.933
22.80
12.792
39.09
1.71
ParallelismandSIMDprocessing
Parallelism
Decodercannotexpectthetileorslicepartitioningofpictures
Decodershouldconsiderworstbitstreams
Theentropydecodercannotbeparallelized
CTUbased2Dwavefrontparallelprocessingisapromisingwayfor
parallelism
Deblocking filterandSAOaremoreproperfortheparallelism
Lessdatadependency
SIMDprocessing
Inversetransform(X=ATYA)
Motioncompensation
About40%ofdecodercomplexity
8tapand4tapfilters
PerformanceoftheoptimizedKWHEVCdecoder
SIMDandparallelization
Pixelreconstruction,interpolation(partial)
Tasklevelparallelism(entropy,pixeldecoding)
Datalevelparallelism(deblocking filter)
2.28Mbps
4.98
2.93
Conclusion
OverviewofHEVC
EncodingparametersforHEVCtestmodel(HM)
ComplexityanalysisofHEVCencoder
Fastencodingalgorithmsandperformances
Issuesofparallelprocessing
HEVC
:,
:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
A.
HEVC
HEVC
CABAC
HEVC
2013