Sie sind auf Seite 1von 33

Exposing ILP in Custom Hardware with a Dataflow Compiler IR

Ali Mustafa Zaidi Super isor! Dr" Da id #rea es $ni ersit% of Cam&ridge Computer La&orator%

'he Dar( Sili)on Pro&lem


Amdahl8s Law

+
$tili,ation 2all
*"+#H, ./nm 01/23
18%

4"*#H, 54nm 01/23


7%

6"7#H, 7*nm 01/23


3%

=
54nm 9 1nm 07*x resour)es3

Dark Silicon

CP$! 7"4x: #P$ *"5x 0Cnsr "3 CP$! 6".x: #P$ *"6x 0I'RS3
2

Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scalin "! "EEE Micro #$1#!

'he Dar( Sili)on Pro&lem


Amdahl8s Law

+
$tili,ation 2all
*"+#H, ./nm 01/23
18%

4"*#H, 54nm 01/23


7%

6"7#H, 7*nm 01/23


3%

=
Dark Silicon

Can we a)hie e Supers)alar Performan)e: w;o Supers)alar < erheads=


3

Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scalin "! "EEE Micro #$1#!

Solution! Spatial Ar)hite)tures=

%ustom &ard'are, ()*+s, %*,+s, M))+s, etc!

+d-anta es

S)ala&le: De)entrali,ed ar)hite)tures: with short: p*p wiring" High Computational Densit% +/>+///x Energ% and Performan)e effi)ien)%" Poor Programma&ilit%! often re?uiring low>le el hardware (nowledge Limited Amena&ilit%! poor performan)e on se?uential: irregular: or )omplex )ontrol>flow )ode" Conser ation Cores! Performan)e @ in>order MIPS*5AE )ore Phoenix CASH Hardware! Performan)e 7/B less than 5>wa% <<< Core"
4

"ssues

E.am/les

Solution! Spatial Ar)hite)tures=


Control>Data Dlow #raph
Start i = 0

Ae% Reasons for High Performan)e of Complex: <<< Supers)alars!

Aggressi e Control>flow Spe)ulation D%nami): out>of>order exe)ution s)heduling

= A[i] > 0

T
i++ < 100

foo()

Custom hardware has er% limited spe)ulation


Single flow of )ontrol If>)on ersion C h%per&lo)( formation for forward &ran)hes" 0o acceleration of 1ack'ards 1ranches2
5

bar()

End

McFarlin et al., Discerning the dominant out-of-order performance advantage: is it speculation or dynamism? , +S)34S 513

Solution! Spatial Ar)hite)turesE


Control>Data Dlow #raph
Start i = 0

4ur Solution
i

Instead of %D(* ", + %om/ile6time E.ecution Schedulin


foo()

= A[i] > 0

T
i++ < 100

2e Emplo% 7S(* ", + Dataflo' E.ecution Model

bar()

End

< er)oming Control>Dlow with the FSD#


Falue State Dlow #raph
inPred i = 0 A STATE_IN

Hierarchical Dataflow #raph

Instead of GHasi) Hlo)(s I Control DlowJ: we ha e GKested Su&graphs I DataflowJ Dun)tions 9 nested su&graphs Loops 9 tail>re)ursi e fun)tions"
i++

= A[i] > 0 foo()


D P '

Dataflow exe)ution of operations

Multiple Su&graphs ma% exe)ute )on)urrentl% in Dataflow order 0unli(e &asi) &lo)(s3" Exposes Multi/le (lo's of %ontrol2

< 100

Next iteration of 'for' loop

'

bar()

STATE_OUT

< er)oming Control>Dlow with the FSD#


Falue State Dlow #raph
inPred i = 0 A STATE_IN

Infinite DA#

= A[i] > 0 i++ foo()


D P '

Loops represented as 'ail Re)ursion Hran)hes represented ia if>)on ersion Ena&les + ressi-e S/eculation2

Ko single 8Dlow of Control8

Instead: )ontrol implemented ia 8Hoolean Predi)ate Expressions8" Logi) minimi,ation )an simplif% expressions: fa)ilitating %ontrol De/endence +nal8sis2

< 100

Next iteration of 'for' loop

'

bar()

STATE_OUT

< er)oming Control>Dlow with the FSD#


Falue State Dlow #raph
inPred i = 0 A STATE_IN

Hierarchical Dataflow #raph

= A[i] > 0 i++ foo()


D P '

Su&graphs ma% &e 8predi)ated8: or exe)uted spe)ulati el% 0 ia 8if> )on ersion83" 'Flattening' loop tail>)all su&graphs 9 loop unrolling;pipelining" Multiple loops in a loop>nest ma% &e unrolled independentl% to expose ILP

< 100

Next iteration of 'for' loop

'

bar()

STATE_OUT

< er)oming Control>Dlow with the FSD#

10

< er)oming Control>Dlow with the FSD#

11

High Le el S%nthesis Case Stud%


An% High Le el Language LLFM FSD# Low> Le el IR Hluespe) S%stemFerilog ASIC ; DP#A

%1 = mul i32 %x, %y; %2 = srem i32 %1, %z; %3 = icmp slt i32 %2, %1;

FIFOF(int) FIFOF(int) FIFOF(int) FIFOF(int) FIFOF(int) FIFOF(int) FIFOF(int)

x y z srem_1 icmp_1 icmp_2 out_3

mkFIFOF1; mkFIFOF1; mkFIFOF1; mkFIFOF1; mkFIFOF1; mkFIFOF1; mkFIFOF1;

rule mul_inst; let !l1 = let !l2 = let rslt = srem_1"en% icmp_1"en% en$rule

x"#irst; x"$e%; y"#irst; y"$e%; !l1 & !l2; (rslt); (rslt);

rule srem_inst; let !l1 = srem_1"#irst; srem_1"$e%; let !l2 = z"#irst; z"$e%; let rslt = !l1 % !l2; icmp_2"en% (rslt); en$rule "

12

Hardware <riented Dataflow IR

Performan)e and Energ% E aluation &% )omparing with

Leg$p HLS 'ool: C Altera Kios IIf Pro)essor: implemented on Altera Stratix IF #M DP#A" Kehalem Core i6 0Sniper inter al simulator from Intel3" In all )ases: memor% a))ess laten)% assumed NN + C%)le"

Leg$p

<ur 'ool)hain

LLFM *". <* Ko L'<: no L'I Ko <p Chaining Stati)all% S)heduled CD#

LLFM *"L <* Ko L'<: no L'I Ko <p Chaining D%nami)all% S)heduled FSD#
13

Performan)e 0C%)le Counts3


Kormalised to Leg$p

Matrix 'ranspose 0x+( )%)les3

adp)m 0x+( )%)les3

dfsin 0x+( )%)les3

Keural Ket Simulator 0x+M )%)les3

Compared to Kios II;f C Intel Kehalem Core i6 0SniperSim3

14

Dre?uen)% C Dela%
Frequency (Higher is e!!er"
450 400 350 300 250
)H*

Kios IIf *4/MH,

200 150 100 50 0 epic adpcm dfadd #eg$p (%F&" dfdiv '(F&_0 dfmul '(F&_1 '(F&_3 dfsin mips** small_bimpa

./rmali*ed 0elay (#/1er is e!!er"

1+4 1+2 1 0+0+, 0+4 0+2 0 epic adpcm dfadd #eg$p (%F&" dfdiv '(F&_0 dfmul '(F&_1 '(F&_3 dfsin mips** small_bimpa

15

Power and Spe)ulation < erheads


, 5 4 3 2 1 0 epic adpcm dfadd dfdiv dfmul dfsin mips misspecula!ed ac!ivi!y (bi!s" useful ac!ivi!y (bi!s"

, 5 4 3 2 1 0 epic adpcm dfadd dfdiv dfmul dfsin mips

Power estimation assuming *4/MH, operating fre?uen)%

#eg$p '(F&_0_2ff

16

Power and Spe)ulation < erheads


, 5 4 3 2 1 0 epic adpcm dfadd dfdiv dfmul dfsin mips misspecula!ed ac!ivi!y (bi!s" useful ac!ivi!y (bi!s"

12 10 , 4 2 0 epic adpcm dfadd dfdiv dfmul dfsin mips

Power estimation assuming *4/MH, operating fre?uen)%


#eg$p '(F&_0_2ff '(F&_1_2ff '(F&_3_2ff

17

Kormali,ed Energ%
100
62 31 17 18 3 2 1 4 7 2 3 2 3 14 3 5 6 2 3 4 12

10
3 3 1 1 3 1 1

5 6

0.1

epic

adpcm LegUp

dfadd VSFG_0

dfdiv

dfmul VSFG_1

dfsin VSFG_3

mips Nios

GEOMEAN

18

Sour)es of Energ% Ineffi)ien)%


Energ% Cost Comparison!

s Kios II;f! /"*4 x 0#E<MEAK3 s Leg$p! 7>5 x 0#E<MEAK3

< erheads of Spe)ulation

Halan)e &etween spe)ulation C predi)ation must &e found for effi)ien)% C performan)e

Part of power dissipation proportional to Area

Clo)( #ating for predi)ated regions to redu)e d%nami) power

0)onsider as%n)hronous C(ts3

Power gating for predi)ated regions to redu)e stati) power= Sele)ti e loop unrolling"

19

Limitations on Performan)e

74B &etter performan)e than stati)all% s)heduled CD#: without an% optimi,ations!

Impro ements due to d%nami) s)heduling: MDC C CDA $nrolling helps: &ut speed>up saturates ?ui)(l%"

Durther Impro ements possi&le!

Halan)e &etween /redication C s/eculation: to impro e speed>up without unrolling 0thus redu)ing area and energ% )osts3 State>edge is on )riti)al path O limits &oth unrolling C MDC"

Last remnant of 8se?uential8 nature of program"

Dre?uen)% S)aling limited &% Memor% Inter)onne)t

Partition memor% C pipeline memor% a))ess tree

20

'han( Pou

21

Impli)it Parallelism C State>edge Partitioning


Increasing Programmer / Compiler Effort <penMP <penCL Sie e CII +ssertion! Impli)it 0determinsti)3 parallel programming models are essentiall% means of partitioning the state>edge"

Alias Anal%sis Spe)ul" Loads D%nami) <<< LSQ SpM' ; 'LS

Increasing Runtime Effort

22

< er)oming Control>Dlow with the FSD#


Control>Data Dlow #raph
Start i = 0 inPred i = 0 A

Falue State Dlow #raph


STATE_IN

= A[i]
A i

= A[i] > 0 i++

> 0

foo()
D P '

T
i++ < 100

foo() < 100

Next iteration of 'for' loop

'

bar()

End bar()
STATE_OUT

23

Performan)e 0C%)le Counts3


%ycle %/un!s 1i!h Full (pecula!i/n
1+, 1+4 1+2 1 0+0+, 0+4 0+2 0 epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa #eg$p (%F&" '(F&_0 '(F&_1 '(F&_3

C%)le )ounts normali,ed to Leg$p results FSD# implemented with all loops unrolled /: +: and 7 times Dull Spe)ulation! all su&graphs 0ex)ept loops3 triggered without predi)ates
24

Performan)e 0C%)le Counts3


%ycle %/un!s 1i!h Full (pecula!i/n
1+, 1+4 1+2 1 0+0+, 0+4 0+2 0 epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa #eg$p (%F&" '(F&_0 '(F&_1 '(F&_3

)redication! onl% one &lo)( will exe)ute

S/eculation! &oth &lo)(s exe)ute: &ut onl% one result is )hosen

25

Performan)e 0C%)le Counts3


%ycle %/un!s 1i!h Full (pecula!i/n
1+, 1+4 1+2 1 0+0+, 0+4 0+2 0 epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa #eg$p (%F&" '(F&_0 '(F&_1 '(F&_3

26

Performan)e 0C%)le Counts3


%ycle %/un!s 1i!h Full (pecula!i/n
1+, 1+4 1+2 1 0+0+, 0+4 0+2 0 epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa #eg$p (%F&" '(F&_0 '(F&_1 '(F&_3

%ycle %/un!s 1i!h 3redica!ed (ubgraphs


1+, 1+4 1+2 1 0+0+, 0+4 0+2 0 epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa #eg$p (%F&" '(F&_0 '(F&_1 '(F&_3

27

Performan)e 0C%)le Counts3


1400000 1200000 1000000 -00000 ,00000 400000 200000 0 %/re i4 .i/s 2f #eg$p epic '(F&_0 '(F&_1 '(F&_3 200144 52-212,5140 3355,34 -0000 104-444 10,243, 40000 ,0000 50000 40000 30000 20000 10000 0 %/re i4 .i/s 2f #eg$p adpcm '(F&_0 '(F&_1 '(F&_3 42,,2 54-,0 515-0 511-, 115454 41345

140000 120000 100000 -0000 ,0000 40000 20000 0 %/re i4 104553

142055-

200000000 1-0000000 105443 1,0000000 140000000 42004 41-5, 41-5, 120000000 100000000 -0000000 ,0000000 40000000 20000000 0 35,,455,

343344552

1423-,,5, 1143,1454 5-145,4- 54430,4-

.i/s 2f

#eg$p dfsin

'(F&_0

'(F&_1

'(F&_3

%/re i4

.i/s 2f

#eg$p small_bimpa

'(F&_0

'(F&_1

'(F&_3

28

Performan)e 0C%)le Counts3


1-000 1,000 14000 12000 10000 -000 ,000 4000 2000 0 %/re i4 .i/s 2f #eg$p dfadd '(F&_0 '(F&_1 '(F&_3 40000 35000 30000 25000 20000 15000 10000 5000 0 %/re i4 .i/s 2f #eg$p dfdiv '(F&_0 '(F&_1 '(F&_3

1,000 14000 12000 10000 -000 ,000 4000 2000 0 %/re i4 .i/s 2f #eg$p dfmul '(F&_0 '(F&_1 '(F&_3

35000 30000 25000 20000 15000 10000 5000 0 %/re i4 .i/s 2f #eg$p mips** '(F&_0 '(F&_1 '(F&_3

29

$nderstanding <<< Performan)e


Control>Data Dlow #raph
Start i = 0

Control flow is the primar% )onstraint on ILP

2all 0+..+3! Con entional pro)essors limited to "3) of 9682

Single Dlow of )ontrol Hran)h predi)tion 0I.4B a))ura)%3

= A[i] > 0

Lam C 2ilson 0+..73: Ma( C M%)roft 0*//.3! 1$. "3) /ossi1le, 'ith:

%ontrol De/endence +nal8sis ;%D+< Multi/le (lo's of %ontrol ;M(%<

T
i++ < 100

foo()

Custom hardware has er% limited spe)ulation


Single flow of )ontrol If>)on ersion C h%per&lo)( formation for forward &ran)hes" 0o acceleration of 1ack'ards 1ranches2

bar()

End

30

Dormali,ing C E aluating the FSD#

Plot(in>st%le operational semanti)s de eloped for FSD#

Assuming Stati) Dataflow exe)ution model

Low>Le el IR de eloped to fa)ilitate )on ersion to Hluespe)

Hased on Hierar)hi)al Coloured Petri>nets

High>Le el S%nthesis 'ool)hain implemented

An% High Le el Language

LLFM

FSD#

Low> Le el IR

Hluespe) S%stemFerilog

ASIC ; DP#A

31

Hardware <riented Dataflow IR


%1 = mul i32 %x, %y %2 = srem i32 %1, %z %3 = icmp slt i32 %2, %1 ; 'i32( ; 'i32( ; 'i1(
9 Registers 9 Instru)tions 9 Petri Ket Pla)es 9 Petri Ket 'ransitions x y z 337M ",

mul %1

srem %2

)etri 0et 1ased 3o' 3e-el Dataflo' ",

7alue6State (lo' *ra/h

icmp %3
32

Hardware <riented Dataflow IR


%1 = mul i32 %x, %y %2 = srem i32 %1, %z %3 = icmp slt i32 %2, %1 ; 'i32( ; 'i32( ; 'i1(
FIFOF(int) FIFOF(int) FIFOF(int) FIFOF(int) FIFOF(int) FIFOF(int) FIFOF(int) x y z srem_1 icmp_1 icmp_2 out_3

9 Petri Ket Pla)es 9 Petri Ket 'ransitions


mkFIFOF1; mkFIFOF1; mkFIFOF1; mkFIFOF1; mkFIFOF1; mkFIFOF1; mkFIFOF1;

337M ",

rule mul_inst; let !l1 = let !l2 = let rslt = srem_1"en% icmp_1"en% en$rule

x"#irst; x"$e%; y"#irst; y"$e%; !l1 & !l2; (rslt); (rslt);

)etri 0et 1ased 3o' 3e-el Dataflo' ",

rule srem_inst; let !l1 = srem_1"#irst; srem_1"$e%; let !l2 = z"#irst; z"$e%; let rslt = !l1 % !l2; icmp_2"en% (rslt); en$rule " " E=ui-alent >lues/ec %ode 33 "

Das könnte Ihnen auch gefallen