Amdahls Law

MulticoreandParallelProcessing
HakimWeatherspoon
CS3410,Spring2012
ComputerScience
CornellUniversity
P&HChapter4.1011,7.16
xkcd/619
Pitfall:AmdahlsLaw
Executiontimeafterimprovement=
affectedexecutiontime
amountofimprovement
+executiontimeunaffected
Pitfall:AmdahlsLaw
Improvinganaspectofacomputerandexpectinga
proportionalimprovementinoverallperformance
Example: multiply accounts for 80s out of 100s

How much improvement do we need in the multiply
performance to get 5 overall improvement?
Cant be done!
4
ScalingExample
Workload:sumof10scalars,and10 10matrix
sum
Speedupfrom10to100processors?
Singleprocessor:Time=(10+100) tadd
10processors
Time=100/10 tadd +10 tadd =20 tadd
Speedup=110/20=5.5(55%ofpotential)
100processors
Time=100/100 tadd +10 tadd =11 tadd
Speedup=110/11=10(10%ofpotential)
Assumesloadcanbebalancedacrossprocessors
ScalingExample
Whatifmatrixsizeis100 100?
Singleprocessor:Time=(10+10000) tadd
10processors
Time=10 tadd +10000/10 tadd =1010 tadd
Speedup=10010/1010=9.9(99%ofpotential)
100processors
Time=10 tadd +10000/100 tadd =110 tadd
Speedup=10010/110=91(91%ofpotential)
Assumingloadbalanced
GoalsforToday
HowtoimproveSystemPerformance?
InstructionLevelParallelism(ILP)
Multicore
Increaseclockfrequencyvs multicore
BewareofAmdahls Law
Nexttime:
Concurrency,programming,andsynchronization
ProblemStatement
Q:Howtoimprovesystemperformance?
IncreaseCPUclockrate?
ButI/Ospeedsarelimited
Disk,Memory,Networks,etc.
Recall:AmdahlsLaw
Solution:Parallelism
Pipelining:executemultipleinstructionsinparallel
Q:Howtogetmoreinstructionlevelparallelism?
A:Deeperpipeline
E.g.250MHz1stage;500Mhz2stage;1GHz4stage;4GHz
16stage
Pipelinedepthlimitedby
maxclockspeed(lessworkperstage shorterclockcycle)
minunitofwork
dependencies,hazards/forwardinglogic
9
Pipelining:executemultipleinstructionsinparallel
Q:Howtogetmoreinstructionlevelparallelism?
A:Multipleissuepipeline
Startmultipleinstructionsperclockcycleinduplicatestages
10
StaticMultipleIssue
StaticMultipleIssue
a.k.a. VeryLongInstructionWord(VLIW)
Compilergroupsinstructionstobeissuedtogether
Packagesthemintoissueslots
Q:HowdoesHWdetectandresolvehazards?
A:Itdoesnt.
SimpleHW,assumescompileravoidshazards
Example:StaticDualIssue32bitMIPS
Instructionscomeinpairs(64bitaligned)
OneALU/branchinstruction(ornop)
Oneload/storeinstruction(ornop)
11
MIPSwithStaticDualIssue
Twoissuepackets
OneALU/branchinstruction
Oneload/storeinstruction
64bitaligned
ALU/branch,thenload/store
Padanunusedinstructionwithnop
Address
Instruction type
Pipeline Stages
ALU/branch
IF
ID
EX
MEM
WB
n+4
Load/store
IF
ID
EX
MEM
WB
n+8
ALU/branch
IF
ID
EX
MEM
WB
n + 12
Load/store
IF
ID
EX
MEM
WB
n + 16
ALU/branch
IF
ID
EX
MEM
WB
n + 20
Load/store
IF
ID
EX
MEM
WB
12
SchedulingExample
SchedulethisfordualissueMIPS
Loop: lw
addu
sw
addi
bne
Loop:
$t0,
$t0,
$t0,
$s1,
$s1,
0($s1)
$t0, $s2
0($s1)
$s1,4
$zero, Loop
#
#
#
#
#
$t0=array element
add scalar in $s2
store result
decrement pointer
branch $s1!=0
ALU/branch
Load/store
cycle
nop
lw
addi $s1, $s1,4
nop
addu $t0, $t0, $s2
nop
bne
sw
$s1, $zero, Loop
$t0, 0($s1)
$t0, 4($s1)
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

13
SchedulingExample
CompilerschedulingfordualissueMIPS
Loop: lw$t0,0($s1)
lw$t1,4($s1)
addu $t0,$t0,$s2
addu $t1,$t1,$s2
sw$t0,0($s1)
sw$t1,4($s1)
addi $s1,$s1,+8
bne $s1,$s3,TOP
ALU/branchslot
Loop: nop
nop
addu $t0,$t0,$s2
addu $t1,$t1,$s2
addi $s1,$s1,+8
bne $s1,$s3,TOP
#$t0=A[i]
#$t1=A[i+1]
#add$s2
#add$s2
#storeA[i]
#storeA[i+1]
#incrementpointer
#continueif$s1!=end
Load/storeslot
lw$t0,0($s1)
lw$t1,4($s1)
nop
sw$t0,0($s1)
sw$t1,4($s1)
nop
cycle
1
2
3
4
5
6
14
SchedulingExample
Loop: lw$t0,0($s1)
lw$t1,4($s1)
addu $t0,$t0,$s2
addu $t1,$t1,$s2
sw$t0,0($s1)
sw$t1,4($s1)
addi $s1,$s1,+8
bne $s1,$s3,TOP
ALU/branchslot
Loop: nop
addi $s1,$s1,+8
addu $t0,$t0,$s2
addu $t1,$t1,$s2
bne $s1,$s3,Loop
#$t0=A[i]
#$t1=A[i+1]
#add$s2
#add$s2
#storeA[i]
#storeA[i+1]
#incrementpointer
#continueif$s1!=end
Load/storeslot
lw$t0,0($s1)
lw$t1,4($s1)
nop
sw$t0,8($s1)
sw$t1,4($s1)
cycle
1
2
3
4
5
15
LimitsofStaticScheduling
lw$t0,0($s1)
addi $t0,$t0,+1
sw$t0,0($s1)
lw$t0,0($s2)
addi $t0,$t0,+1
sw$t0,0($s2)
ALU/branchslot
nop
nop
addi $t0,$t0,+1
nop
nop
nop
addi $t0,$t0,+1
nop
#loadA
#incrementA
#storeA
#loadB
#incrementB
#storeB
Load/storeslot
lw$t0,0($s1)
nop
nop
sw$t0,0($s1)
lw$t0,0($s2)
nop
nop
sw$t0,0($s2)
cycle
1
2
3
4
5
6
7
8
16
LimitsofStaticScheduling
lw$t0,0($s1)
addi $t0,$t0,+1
sw$t0,0($s1)
lw$t1,0($s2)
addi $t1,$t1,+1
sw$t0,0($s2)
ALU/branchslot
nop
nop
addi $t0,$t0,+1
addi $t1,$t1,+1
nop
#loadA
#incrementA
#storeA
#loadB
#incrementB
#storeB
Load/storeslot
lw$t0,0($s1)
lw
$t1,0($s2)
nop
sw$t0,0($s1)
sw
$t1,0($s2)
cycle
1
2
3
4
5
Problem:Whatif$s1and$s2areequal(aliasing)?Wontwork
17
DynamicMultipleIssue
a.k.a.SuperScalar Processor(c.f.Intel)
CPUexaminesinstructionstreamandchoosesmultiple
instructionstoissueeachcycle
Compilercanhelpbyreorderinginstructions.
butCPUisresponsibleforresolvinghazards
Evenbetter:Speculation/OutoforderExecution
Executeinstructionsasearlyaspossible
Aggressiveregisterrenaming
Guessresultsofbranches,loads,etc.
Rollbackifguesseswerewrong
Dontcommitresultsuntilallpreviousinsts.areretired
18
19
DoesMultipleIssueWork?
Q:Doesmultipleissue/ILPwork?
A:Kindofbutnotasmuchaswedlike
Limitingfactors?
Programsdependencies
Hardtodetectdependencies beconservative
e.g.PointerAliasing:A[0]+=1;B[0]*=2;
Hardtoexposeparallelism
CanonlyissueafewinstructionsaheadofPC
Structurallimits
Memorydelaysandlimitedbandwidth
Hardtokeeppipelinesfull
20
PowerEfficiency
Q:Doesmultipleissue/ILPcostmuch?
A:Yes.
Dynamicissueandspeculationrequirespower
CPU
Year
Clock
Rate
Pipeline
Stages
Issue
width
Out-of-order/ Cores
Speculation
Power
i486
1989
25MHz
No
5W
Pentium
1993
66MHz
No
10W
Pentium Pro
1997
200MHz
10
Yes
29W
P4 Willamette 2001
2000MHz
22
Yes
75W
UltraSparc III
2003
1950MHz
14
No
90W
P4 Prescott
2004
3600MHz
31
Yes
103W
Core
2006
2930MHz
14
Yes
75W
UltraSparc T1
2005
1200MHz
No
70W
Multiplesimplercoresmaybebetter?
21
MooresLaw
DualcoreItanium2
K10
Itanium2
K8
P4
Atom
486
386
286
Pentium
8088
8080
4004 8008
22
WhyMulticore?
Mooreslaw
Alawabouttransistors
Smallermeansmoretransistorsperdie
Andsmallermeansfastertoo
But:Powerconsumptiongrowingtoo
23
PowerLimits
SurfaceofSun
RocketNozzle
NuclearReactor
Xeon
HotPlate
24
PowerWall
Power=capacitance*voltage2 *frequency
Inpractice:Power~voltage3
Reducingvoltagehelps(alot)
...sodoesreducingclockspeed
Bettercoolinghelps
Thepowerwall
Wecantreducevoltagefurther
Wecantremovemoreheat
25
WhyMulticore?
Performance
Power
1.2x
SingleCore
1.7x Overclocked +20%
Performance
Power
1.0x
1.0x
Performance
Power
0.8x
1.6x
0.51x 1.02x
SingleCore
SingleCore
DualCore
Underclocked 20%
26
InsidetheProcessor
AMDBarcelonaQuadCore:4processorcores
27
InsidetheProcessor
IntelNehalemHexCore
28
Hyperthreading
MultiCore vs.MultiIssue
N
Programs:
Num.Pipelines: N
PipelineWidth: 1
1
1
N
vs.HT
N
1
N
.
29
Hyperthreading
MultiCore vs.MultiIssue
N
Programs:
Num.Pipelines: N
PipelineWidth: 1
vs.HT
1
1
N
N
1
N
Hyperthreads
HT=MultiIssue +extraPCsandregisters dependencylogic
HT=MultiCore redundantfunctionalunits+hazardavoidance
Hyperthreads (Intel)
Illusionofmultiplecoresonasinglecore
EasytokeepHTpipelinesfull+sharefunctionalunits
30
Example:Alloftheabove
31
ParallelProgramming
Q:Soletsjustallusemulticorefromnowon!
A:Softwaremustbewrittenasparallelprogram
Multicoredifficulties
Partitioningwork
Coordination&synchronization
Communicationsoverhead
Balancingloadovercores
Howdoyouwriteparallelprograms?
...withoutknowingexactunderlyingarchitecture?
32
WorkPartitioning
Partitionworksoallcoreshavesomethingtodo
33
LoadBalancing
LoadBalancing
Needtopartitionsoallcoresareactuallyworking
34
AmdahlsLaw
Iftaskshaveaserialpartandaparallelpart
Example:
step1:divideinputdatainton pieces
step2:doworkoneachpiece
step3:combineallresults
Recall:AmdahlsLaw
Asnumberofcoresincreases
timetoexecuteparallelpart? goestozero
timetoexecuteserialpart? Remainsthesame
Serialparteventuallydominates
35
AmdahlsLaw
36
ParallelProgramming
Q:Soletsjustallusemulticorefromnowon!
A:Softwaremustbewrittenasparallelprogram
Multicoredifficulties
Partitioningwork
Coordination&synchronization
Communicationsoverhead
Balancingloadovercores
Howdoyouwriteparallelprograms?
...withoutknowingexactunderlyingarchitecture?
37
Administrivia
FlameWar GamesNightNextFriday,April27th
5pminUpsonB17
Pleasecome,eat,drinkandhavefun
NoLab4orLabSectionnext week!
38
Administrivia
PA3:FlameWar isduenextMonday,April23rd
Thegoalistohavefunwithit
Recitationstodaywilltalkaboutit
HW6DuenextTuesday,April24th
Prelim3nextThursday,April26th
39

Amdahls Law

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Amdahls Law

Hochgeladen von

Copyright:

Verfügbare Formate

MulticoreandParallelProcessing

Example: multiply accounts for 80s out of 100s

addi $s1, $s1,4

addu $t0, $t0, $s2

$s1, $zero, Loop

IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

Das könnte Ihnen auch gefallen