Beruflich Dokumente
Kultur Dokumente
RHK.SP96 1
Review From Last Time
Pipelining Lessons
• Pipelining doesn’t help
latency of single task, it
6 PM 7 8 9 helps throughput of
Time entire workload
• Pipeline rate limited by
30 40 40 40 40 20 slowest pipeline stage
T • Multiple tasks operating
a A simultaneously
s • Potential speedup =
k Number pipe stages
B
O • Unbalanced lengths of
r pipe stages reduces
d C speedup
e • Time to “fill” pipeline and
r
D time to “drain” it reduces
speedup RHK.SP96 2
Review From Last Time:
Software Scheduling to Avoid
Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f
in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f SW a,Ra
SUB Rd,Re,Rf SUB Rd,Re,Rf
RHK.SP96 3
SW d,Rd SW d,Rd
Review From Last Time:
Pipelining Summary
• Just overlap tasks, and easy if tasks are independent
• Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
Pipeline Depth Clock Cycle Unpipelined
Speedup = X
1 + Pipeline stall CPI Clock Cycle Pipelined
RHK.SP96 4
Pipelined DLX Datapath
Figure 3.4, page 134
Instruction
Fetch Instr. Decode Execute
Reg. Fetch Addr. Calc. Write
Back
Memory
Access
RHK.SP96 6
Branch Stall Impact
RHK.SP96 7
Pipelined DLX Datapath
Figure 3.22, page 163
ERROR IN TEXTBOOK!!
RHK.SP96 8
Pipelined DLX Datapath
Figure 3.22, page 163
RHK.SP96 9
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% DLX branches not taken on average
– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% DLX branches taken on average
– But haven’t calculated branch target address in DLX
» DLX still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
RHK.SP96 10
Four Branch Hazard Alternatives
branch instruction
sequential successor1
sequential successor2
........ Branch delay of length n
sequential successorn
branch target if taken
RHK.SP96 11
Delayed Branch
RHK.SP96 12
Evaluating Branch Alternatives
Pipeline speedup = Pipeline depth
1 + Branch frequency×Branch penalty
RHK.SP96 13
Compiler “Static” Prediction of
Taken/Untaken Branches
• Improves strategy for placing instructions in delay slot
• Two strategies
– Backward branch predict taken, forward branch not taken
– Profile-based prediction: record branch behavior, predict branch
based on prior run
14%
70%
Frequency of Misprediction
12%
Misprediction Rate
60%
10%
50%
8%
40%
6%
30%
4%
20%
10% 2%
0% 0%
espresso
doduc
gcc
tomcatv
compress
hydro2d
swm256
mdljsp2
ora
alvinn
espresso
doduc
gcc
tomcatv
compress
hydro2d
swm256
mdljsp2
ora
alvinn
Taken backwardsRHK.SP96 14
Always taken
Not Taken Forwards
Evaluating Static Branch
PredictionStrategies
• “Instructions 1000
between 100
mispredicted
branches” is a 10
better metric 1
doduc
gcc
hydro2d
tomcatv
espresso
compress
swm256
ora
mdljsp2
alvinn
Profile-based Direction-based
RHK.SP96 15
Pipelining Complications
• Interrupts: 5 instructions executing in 5 stage pipeline
– How to stop the pipeline?
– How to restart the pipeline?
– Who caused the interrupt?
RHK.SP96 18
Pipelining Complications
• Floating Point: long execution time
• Also, may pipeline FP execution unit so they can
initiate new instructions without waiting full
latency
RHK.SP96 20
Summary of Pipelining Basics
• Hazards limit performance
– Structural: need more HW resources
– Data: need forwarding, compiler scheduling
– Control: early evaluation & PC, delayed branch, prediction
• Increasing length of pipe increases impact of hazards;
pipelining helps instruction bandwidth, not latency
• Interrupts, Instruction Set, FP makes pipelining harder
• Compilers reduce cost of data and control hazards
– Load delay slots
– Bbranch delay slots
– Branch prediction
• Next time: Longer pipelines (R4000) => Better branch
prediction, more instruction parallelism? RHK.SP96 21