Sie sind auf Seite 1von 155

Pipelined Processor

CHAPTER 4

Introduction
Single cycle data path
Run one instruction a time sequentially in the datapath
Each instruction takes one cycle
Slow. Why?
How can we make it faster?
Run multiple instruction at the same time?
Pipelining

Agenda
What is Pipelining?

Improvement over Single Cycle

Pipelined Datapath
Pipelined Control
Hazards and Dependencies

Structural Hazards
Data Hazards
Control Hazards

Exceptions and Interrupts


Advanced Pipelining
3

Pipelining Analogy: Laundry Example


Ann, Brian, Cathy, Dave

each have one load of clothes


to wash, dry, and fold

Washer takes 30 minutes


Dryer takes 30 minutes
Folder takes 30 minutes
Stacker takes 30 minutes

to put clothes into drawers

Sequential Laundry
Sequential laundry takes 8 hours for 4 loads
6 PM
Time

10

11

12

2 AM

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
T
a
s
k
O
r
d
e
r

A
B
C
D

CSUSM

CS 331

Pipelined Laundry: Start work ASAP


Overlapping execution (Parallelism improves performance)

Pipelined laundry takes 3.5 hours for 4 loads!

6 PM
Time
T
a
s
k
O
r
d
e
r

10

11

12

2 AM

30 30 30 30 30 30 30

A
B
C
D
6

What is the speedup for n tasks?


Sequential time for n tasks

Each task requires balanced k stages nk

Pipeline time for n tasks

n + Fill time = n + k-1

Speedup

nk/(n+k-1) k for large n

What about unbalanced pipe stages?

Pipelining Lessons
Pipelining does not help latency of single task, it helps

throughput of entire workload


Multiple tasks operating simultaneously using different

resources
Potential max speedup = Number pipe stages
Pipeline rate limited by slowest pipeline stage
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline or time to drain it reduces speedup
Stall for Dependences

MIPS Pipeline
Five stages, one step per stage

1.
2.
3.
4.
5.

IF: Instruction fetch from memory


ID: Instruction decode & register read
EX: Execute operation or calculate address
MEM: Access memory operand
WB: Write result back to register

Pipelined vs. Single Cycle Performance


Compare average time between instructions
Assume time for stages is
100ps for register read or write
200ps for other stages
Instr

Instr fetch Register


read

ALU op

Memory
access

Register
write

Total time

lw

200ps

100 ps

200ps

200ps

100 ps

800ps

sw

200ps

100 ps

200ps

200ps

R-format

200ps

100 ps

200ps

beq

200ps

100 ps

200ps

700ps
100 ps

600ps
500ps

10

Pipeline Performance

Single-cycle
(Tc= 800ps)

Pipelined
(Tc= 200ps)

11

Pipeline Performance
Pipelining improves performance by increasing instruction

throughput

Executes multiple instructions in parallel


Each instruction has the same latency

Subject to hazards that prevent starting the next instruction

in the next cycle

Structure hazard: a required resource is busy

Data hazard: need to wait for previous instruction to complete its


data read/write

Control hazard: deciding on control action depends on previous


instruction
12

Pipelined Datapath - Basic Idea


Divide a single-cycle datapath into 5 stages

13

Single Clock Cycle Diagram


Pipeline registers to chop the data path into separate stages
To hold information produced in previous cycle

14

Pipeline Operation
Cycle-by-cycle flow of instructions through the pipelined

datapath

Single-clock-cycle pipeline diagram

Shows pipeline usage in a single cycle


Highlight resources used

Multi-clock-cycle diagram

Well look at single-clock-cycle diagrams for load & store

15

IF for Load, Store,

16

ID for Load, Store,

17

EX for Load

18

MEM for Load

19

WB for Load
Did someone spot the error??

Wrong
register
number

20

Corrected Datapath for Load


Shift write register index through all subsequent pipeline stages

21

EX for Store

22

MEM for Store

WB for Store

Multi-Cycle Pipeline Diagram


Form showing resource usage

Can help answer:


o How many cycles to
execute this code?
o What is the ALU
doing during cycle 4?

25

Multi-Cycle Pipeline Diagram


Traditional form

Once the pipeline is


full, one instruction
is completed every
cycle, so CPI = 1

Time to fill the pipeline

26

Single-Cycle Pipeline Diagram


State of pipeline in a given cycle

27

Question

Some instructions do nothing during some stages.


Should we force every instruction to go through all 5
stages? Can we have R-type take 4 cycles instead of 5?
Selection

Yes/No

Reason (Choose BEST answer)

Yes

Decreasing R-type to 4 cycles improves


instruction throughput

Yes

Decreasing R-type to 4 cycles improves


instruction latency

No

Decreasing R-type to 4 cycles causes hazards

No

Decreasing R-type to 4 cycles causes hazards


and doesnt impact throughput

No

Decreasing R-type to 4 cycles causes hazards


and doesnt impact latency
28

Pipelined Control
No information transfer from one pipeline stage to another

is possible except through the pipeline register

Everything that happened in any previous stage will be overwritten


All data belonging to one instruction must be kept within the stage

Control information must travel with the instruction along

pipeline stages just like data

29

Pipelined Control (Simplified)

30

Pipeline Control
Recall what needs to be controlled in each stage:
Instruction Fetch and PC Increment: Identical for all instructions
Instruction Decode / Register Fetch: Identical for all instructions
Execution: RegDest, ALUOp, ALUSrc
Memory Stage: Branch, MemRead, MemWrite
Write Back: MemToReg, RegWrite

Instruction
R-format
lw
sw
beq

Execution/Address Calculation Memory access stage


stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src
Branch Read
Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0

Write-back
stage control
lines
Reg
Mem
write to Reg
1
0
1
1
0
X
0
X

31

Pipelined Control
Control signals derived from instruction
As in single-cycle implementation

32

Pipelined Control

33

Example 1
Instruction sequence

lw
sub
and
or
add

$10,
$11,
$12,
$13,
$14,

20 ($1)
$2, $3
$4, $5
$6, $7
$8, $9

Show data flow and control through the pipeline

34

Example Pipeline - 1

35

Example Pipeline - 2

36

Example Pipeline - 3

37

Example Pipeline - 4
and $12, $4, $5

38

Example Pipeline - 5

39

Example Pipeline - 6

40

Example Pipeline - 7

41

Example Pipeline - 8

42

Example Pipeline - 9

43

Example 2
Instruction sequence

sub
and
or
add
sw

$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)

: Does not write until 5th stage (cycle 5)


: Reads operands in 2nd stage (cycle 3)
Read after write (cycle 6) 3 stalls
Design registers so read/write in
same cycle 2 stalls

Can you spot the problem?


Data hazard: a result is needed in the pipeline before it is available
May result in stalls

44

Example 2: Dependencies

45

Hazards Continued
What happens when...
add $3, $10, $11
lw $8, 1000($3)
sub $11, $8, $7

46

MIPS Pipelining
Pipelining improves performance

Achieve high throughput without reducing instruction latency


Exploit instruction level parallelism (ILP)
Use combinational logic/registers to generate/propagate control signals

ISA design affects complexity of pipeline implementation


What makes it easy

MIPS ISA designed for pipelining


All instructions are the same length
Just a few instruction formats
Memory operands appear only in loads and stores
47

MIPS Pipelining
Hazards prevent next instruction to start in next cycle
Structural hazards

Attempt to use the same resource two different ways at the same time
e.g. the memory

Data hazards

Attempt to use item before it is ready


For example: an instruction depends on a previous instruction

Control hazards

Attempt to make a decision before condition is evaluated


For example: branch instructions

48

Structural Hazards

What happens if we
had unified instruction
and data memory?
Structural Hazard

49

Structural Hazards
Conflict for use of a resource
In MIPS pipeline with a single memory
Load/store requires data access
Instruction fetch would have to stall for that cycle

Would cause a pipeline bubble

Pipelined datapaths require separate instruction/data

memories

Or separate instruction/data caches

50

Data Hazards (and Dependencies)


RAW: read after write
True data dependency Data Hazard
Sub uses the old value of $3
add $3, $10, $11
instead of new one
sub $12, $3, $7
WAW: write after write
Write dependency: in complex pipelining with out of order execution
add $3, $10, $11
Add is delayed so $3 final
sub $3, $7, $3
value is that of add not sub
WAR: write after read
Anti-dependency
add $3, $10, $11
sub $10, $2, $7

51

How to work around true data


dependencies?
1. Hardware Stalling
2. Software Solution (nops)
3. Hardware Forwarding
4. Code Scheduling

52

Solution 1: Pipeline Stall


An instruction depends on completion of data access by a

previous instruction

add
sub

$s0, $t0, $t1


$t2, $s0, $t3

2 stalls

53

Solution 2: Software Solution


Have compiler guarantee no hazards. How?
Where do we insert the nops ?

sub $2, $1, $3


and $12, $2, $5
or
$13, $6, $2
add $14, $2, $2
sw
$15, 100($2)

54

Solution 2: Software Solution


Have compiler guarantee no hazards
Where do we insert the nops ?

sub

$2, $1, $3

Nop
Nop

and
or
add
sw

$12,
$13,
$14,
$15,

$2, $5
$6, $2
$2, $2
100($2)

Problem: this really slows us down!


55

Solution 3: Forwarding (aka Bypassing)


Use result when it is computed
Dont wait for it to be stored in a register
Forward it from one stage to the other
Requires extra connections in the datapath

56

Solution 3: Forwarding (aka Bypassing)

How do we detect
when to forward?

57

Detecting the Need to Forward


Pass register numbers along pipeline
e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX
pipeline register
ALU operand register numbers in EX stage are given by
ID/EX.RegisterRs, ID/EX.RegisterRt
Data hazards when
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Fwd from EX/MEM


pipeline reg

Fwd from MEM/WB


pipeline reg

58

Detecting the Need to Forward


But only if forwarding instruction will write to a register!
EX/MEM.RegWrite, MEM/WB.RegWrite
And only if Rd for that instruction is not $zero
EX/MEM.RegisterRd 0,
MEM/WB.RegisterRd 0

59

Hardware Solution: Detection and Forward


Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X

CC 2

CC 3

CC 4

CC 5

CC 6

CC 7

CC 8

CC 9

10
X
X

10
X
X

10
20
X

10/ 20
X
20

20
X
X

20
X
X

20
X
X

20
X
X

DM

Reg

Program
execution order
(in instructions)
sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

IM

Reg

IM

Reg

IM

DM

Reg

IM

Reg

DM

Reg

IM

Reg

DM

Reg

Hazard detectable in EX stage


Prior instruction in MEM stage
1A: EX/MEM.RegisterRd =
ID/EX.RegisterRs =$2

Reg

DM

Reg

60

Hardware Solution: Detection and Forward


Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X

CC 2

CC 3

CC 4

CC 5

CC 6

CC 7

CC 8

CC 9

10
X
X

10
X
X

10
20
X

10/ 20
X
20

20
X
X

20
X
X

20
X
X

20
X
X

DM

Reg

Program
execution order
(in instructions)
sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

IM

Reg

IM

Reg

IM

DM

Reg

IM

Reg

DM

Reg

IM

What type of Hazard occurs


between sub and or?
2B: MEM/WB.RegisterRd =
ID/EX.RegisterRt =$2
Reg

DM

Reg

Reg

DM

Reg

61

Forwarding Paths

62

Resolve Data Hazards Through Forwarding

EX hazard

if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)


and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
63

Forwarding Conditions
EX hazard
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
MEM hazard
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
64

Double Data Hazard


Consider the sequence:
add $1,$1,$2
add $1,$1,$3
add $1,$1,$4
What data hazard type does the third instruction have?
Both hazards occur
Want to use the most recent
Revise MEM hazard condition
Only fwd if EX hazard condition is not true
65

Data Hazards
Data hazards
Attempt to use item before it is ready
Example: an instruction depends on a previous instruction
Working around Data Hazards
Hardware Stalling
Software Solution (nops)
Hardware Forwarding
Code Scheduling

66

Datapath with Forwarding

67

Review: Data Hazard Example


Consider this code on the 5-stage pipeline processor
sub $2, $1, $3
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
How many cycles are needed without forwarding?

14

How many cycles with forwarding?

68

Example with Forwarding

69

Example with Forwarding

70

Example with Forwarding

71

Example with Forwarding

72

What about Load?


Can not always avoid stalls by forwarding
If value not computed when needed
Can not forward backward in time!

73

Stall/Bubble in the Pipeline

Stall inserted
here

74

Stall/Bubble in the Pipeline

Or, more
accurately

75

Load-Use Hazard Detection


Check when ALU instruction is being decoded in ID stage
ALU operand register numbers in ID stage are given by
IF/ID.RegisterRs, IF/ID.RegisterRt
Load-use hazard when
ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
If detected, stall and insert bubble

76

Stalling the Pipeline


Once you detect hazard in ID how do you insert the nop and stall?
1. Flush all instructions in the pipeline (set control signals to 0)
2. Set all control signals going to ID/EX register to zero
3. Set PCWrite = 0
4. Set IF/ID regWrite = 0

Selection Changes
A

1, 3, 4

1, 2, 3

2, 3, 4

None of the
above
77

How to Stall the Pipeline


Turn the instruction in EX stage to a bubble
Force control values in ID/EX register to 0
EX, MEM and WB do nop (no-operation)
Prevent update of PC and IF/ID register
Using instruction is decoded again
Following instruction is fetched again

Keep the same instructions in ID and IF stages

Set PCWrite to 0 and set IF/ID Write to 0

1-cycle stall allows MEM to read data for lw

Can subsequently forward to EX stage

78

Datapath with Hazard Detection

79

Example with Stall


Code sequence
lw $2, 20($1)
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
How many cycles required?

9 cycles

How many values require hardware forwarding?

3 forwards

80

Example with Stall

81

Example with Stall

82

Example with Stall

83

Example with Stall

84

Example with Stall

85

Example with Stall

86

Another Example with Stall


Code sequence
add $3, $2, $1
lw $4, 100($3)
and $6, $4, $3
sub $7, $6, $2
add $9, $3, $6
How many stalls occur?
1 stall after lw
How many values require hardware forwarding support to avoid

stalling for our MIPS 5-stage pipeline?

4 forwarding: add-lw, lw-and, and-sub, and-add

87

Stalls and Performance


Stalls reduce performance
But are required to get correct results
Compiler can arrange code to avoid hazards and stalls
Requires knowledge of the pipeline structure

88

Code Scheduling to Avoid Stalls


C code for A = B + E; C = B + F;
Find the hazards and reorder instruction to avoid stalls

Reorder code to avoid use of load result in next instruction

stall

stall

lw
lw
add
sw
lw
add
sw

$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,

0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)

13 cycles

lw
lw
lw
add
sw
add
sw

$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,

0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)

11 cycles

89

Example: Code Scheduling


Considering following MIPS code segment
sw

$t2, 20($t0)

add $t3, $t1, $t4


lw

$t1, 4($t2)

and $t2, $t1, $t3

How many cycles required?

9 cycles

What is ALU doing during cycle 4?

Adding $t1 and $t4

Can we rearrange the code to minimize the number of cycles?

Move sw behind lw
90

Example
Suppose EX is the longest (in time) pipeline stage. To

reduce CT, we split it in half. So the pipeline becomes:


IF ID EX1 EX2 M WB

Assume the input data must be available at the start of EX1

and the ALU output is available after EX2


How many hardware stalls are required in the following
code (assuming hardware forwarding wherever possible)?
lw r1, 0(r3)
add r2, r1, r3

91

Dealing with Data Hazards


As an ISA designer, you can deal with hazards in software or
hardware. Which statement(s) is (are) True?
_____ Compilers have a large window of instructions

available to do reordering to eliminate hazards

_____ Detecting data hazards in hardware can be difficult

and expensive

_____ Hardware knows at runtime the actual dependencies

and can exploit that knowledge for better reordering

_____ Exposing the number of required stalls violates the

abstraction between hardware and software

92

Pipelining Recap
Pipelining improves performance by increasing instruction

throughput

Executes multiple instructions in parallel

Subject to hazards that prevent starting the next instruction

in the next cycle

Structure hazard: a required resource is busy

Data hazard: need to wait for previous instruction to complete its


data read/write

Control hazard: deciding on control action depends on previous


instruction

93

Example: Code Scheduling


Considering following MIPS code segment
sw

$t2, 20($t0)

add $t3, $t1, $t4


lw

$t1, 4($t2)

and $t2, $t1, $t3

How many cycles required if processor supports forwarding?

9 cycles

What is ALU doing during cycle 4?

Adding $t1 and $t4

Can we rearrange the code to minimize the number of cycles?

Move sw behind lw
94

Control Hazards
Current design for branch instruction

Decision (Taken or Not Taken) occurs in MEM stage

Dont know which is the next instruction until the decision

is made

Wait until branch outcome determined before fetching next

instruction

How many cycles will we lose per branch if we stall until we

know the branch outcome?

95

Control Hazards
MIPS 5 stage pipeline:

Stall 3 cycles until the branch outcome is known


CC1

beq $4, $0, there IM

and $12, $2, $5

CC2

CC3

Reg
Bubble

CC4
DM

Bubble

Bubble

CC5

CC6

CC7

CC8

Reg
IM

Reg

DM

Reg

With longer pipelines, stall penalty becomes unacceptable

96

Control Hazards
Control (or branch) hazards arise because we must fetch

the next instruction before we know if we are branching or


where we are branching

Control hazards are detected in hardware


We can reduce the impact of control hazards through
1. Static Branch Prediction
2. Reducing the Delay/cost of Branch Hazard
3. Delayed Branch
4. Dynamic Branch Prediction

97

Solution 1: Static Branch Prediction


Predict outcome of branch

For example, predict branch not taken


Fetch instruction after branch, with no delay
Works pretty well when prediction is right
Only stall if prediction is wrong

Add hardware to flush instructions if prediction is wrong

Same performance as stalling when wrong

98

Solution 1: Static Prediction Not Taken


Fetch instruction after branch, with no delay
Works pretty well if prediction is correct

beq $4, $0, Else

and $12, $2, $5


or ...

add ...

sw ...

CC1

CC2

IM

Reg
IM

CC3

CC4
DM

Reg
IM

CC5

CC7

CC8

Reg
DM

Reg
IM

CC6

Reg
DM

Reg

IM

Reg
DM

Reg

99

Solution 1: Static Prediction and Flushing


Flush or discard instructions when wrong

Change the original control values to 0s


Change the 3 instructions in IF, ID, and EX when branch is in MEM

beq $4, $0, Else

and $12, $2, $5


or ...

add ...

Else: sub $12, $4, $2

CC1

CC2

IM

Reg
IM

CC3

CC4
DM

Reg
IM

CC5

CC6

CC7

CC8

Reg
Flush

Reg

Flush

IM

Flush

IM

Flush these
instructions if
prediction is
wrong

Reg

100

Solution 2: Reducing Cost of Branch Hazard


Move decision to earlier stages

Move up to EXE stage, and save ____ cycle per branch


Move up to ID stage, and save ____ cycles per branch

Must add hardware to check registers as they are read

Move hardware to determine branch outcome in ID stage

Branch or Target address adder


Register comparator

Exclusive-or (XOR) of the bits, and OR of the results

Need to copy the forwarding and hazard detection hardware (why?)

add $t1, $s1, $s2


beq $t1, $0, Loop

101

Data Hazards for Branches


Do we need to stall if a comparison register is a destination

of 2nd or 3rd preceding ALU instruction?

Can resolve using forwarding

add $1, $2, $3

IF

add $4, $5, $6

beq $1, $4, target

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

102

Data Hazards for Branches


Do we need to stall if a comparison register is a destination

of preceding ALU instruction or 2nd preceding load


instruction?

Need 1 stall cycle


lw

$1, addr

IF

add $4, $5, $6


beq $1,
$4, target
stalled
beq $1, $4, target

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID
ID

EX

MEM

WB

103

Data Hazards for Branches


Do we need to stall if a comparison register is a destination

of immediately preceding load instruction?

Need 2 stall cycles

lw

$1, addr

IF

beq $1,
$4, target
stalled
beq stalled
beq $1, $4, target

ID

EX

IF

ID

MEM

WB

ID
ID

EX

MEM

WB

104

Impact of Reducing Cost of Branch Hazard


Branch detection in ID stage
Predict: guess one direction then back up if wrong

0 lost cycles per branch instruction if right, 1 if wrong


Need to flush and restart following instruction if wrong

Clear instruction field in IF/ID pipeline -> creates a NOP

Assume, we are right 50% of time

CPI of branch = (1 0.5 + 2 0.5) = 1.5

If 20% of instructions are Branches and CPI of other

instructions=1

Total CPI = 1.5 0.2 + 1 0.8 = 1.1


105

Datapath with Branch Prediction


IF.Flush
Hazard
detection
unit
ID/EX

M
u
x

WB
Control

M
u
x

IF/ID

PC

EX/MEM

WB

EX

MEM/WB
WB

Shift
left 2

Registers

Instruction
memory

M
u
x
ALU
M
u
x

Data
memory

M
u
x

Sign
extend

M
u
x
Forwarding
unit

106

Example
Branch is taken but predicted not-taken
36
40
44
48
52
56

sub
beq
and
or
add
slt
. .

$10, $4, $8
$1, $3, 7
$12, $2, $5
$13, $2, $6
$14, $4, $2
$15, $6, $7
.

72

lw $4, 50(7)

#40 + 4 + 74 = 72

107

Example: Branch Taken

108

Example: Branch Taken

109

Solution 3: Delayed Branch


Fill slot after branch with an

instruction that executes


independent of the branch
decision

Impact: 0 cycles per branch

instruction if can find


instruction to put in slot

Harder to fill for pipelines with

more stages

110

Filling the Branch Delay Slot


1
2
3
4
5
6
7

add $5, $3, $7


add $9, $1, $3
sub $6, $1, $4
and $7, $8, $2
beq $6, $7, there
nop /* branch delay slot */
add $9, $1, $2
sub $2, $9, $5
...
there:
mult $2, $10, $11

Which instruction
can be used to fill
the delay slot?
add $9,$1,$3

111

Solution 4: More-Realistic Branch Prediction


Static branch prediction
Based on typical branch behavior
Example: loop and if-statement branches

Predict backward branches taken


Predict forward branches not taken

Dynamic branch prediction


Hardware measures actual branch behavior

e.g., Record recent history of each branch

Assume future behavior will continue the trend

When wrong, stall while re-fetching, and update history

112

Dynamic Branch Prediction


Analysis of the branch history Branch prediction Buffer
Keep a list of the recent branch instructions outcomes (taken/not
taken)
Indexed by the low order bits of the branch instruction address
To execute a branch
Check table, expect the same outcome
Start fetching from fall-through or target
If wrong, flush pipeline and flip prediction
Limited precision, but it's only a prediction
113

1-Bit Predictor: Shortcoming


Assume 1st prediction is

taken

How many times will the

prediction be wrong for


the inner loop?

Inner loop branches

mispredicted twice!

Mispredict as taken on last


iteration of inner loop
Then mispredict as not taken
on first iteration of inner
loop next time around

for (j = 0; j < 100; j ++)


{
for (i = 0; i < 9; i++)
{

outer:

inner:

beq , , inner

beq , , outer

114

Improvement: 2-Bit Predictor


Only change prediction on 2 successive wrong predictions

115

Example
Assume the following 3 sequences of branch patterns
Assume initial predict taken for 1-bit predictor and strongly

taken for 2-bit predictor


What is the accuracy of each?

1-bit

2-bit

TTTTN
TTNTN
NTNTN

116

Branch Prediction
Latest branch predictors are significantly more

sophisticated, using more advanced correlating techniques,


larger structures, and even AI techniques

Use patterns of branches (local history) and recent other

branch history (global history) to make predictions

117

Pipelining -- Recap
Pipelining focuses on improving instruction throughput,

not individual instruction latency


Data hazards can be handled by hardware or software but

most modern processors have hardware support for stalling


and forwarding
Control hazards can be handled by hardware or software

but most modern processors use Branch Target Buffers and


advanced dynamic branch prediction to reduce hazards
Exec Time = IC CPI CT

118

Example
Consider the following times per stage for MIPS 5-stage pipeline

processor:

IF = 200ps, ID = 100ps, EX = 100ps, M = 200ps, WB = 100ps

Consider splitting IF and M into 2 stages each


IF1 IF2 and M1 M2
Most frequent code run (assume branch taken most of the time):
Loop: lw r1, 0 (r2)
add r2, r1, r4
sub r5, r1, r2
beq r5, $zero, Loop
Assume pipeline has forwarding where available, predicts branch

not taken, and resolves branches in ID. What is the impact of 7stage pipeline vs. 5-stage MIPS pipeline on CPI and CT?

119

Exceptions and Interrupts


Unexpected events that change the normal flow of

instruction execution

Different ISAs use the terms differently

Exception

Arises within the CPU

e.g., undefined opcode, overflow,

Interrupt

From an external I/O controller

Dealing with them without sacrificing performance is hard

Detecting and taking appropriate action is often on critical path


120

Handling Exceptions
In MIPS, exceptions managed by a System Control

Coprocessor (CP0)
Save the address of offending (or interrupted) instruction

In MIPS: Exception Program Counter (EPC)

Save indication of the problem

In MIPS: Cause register

Well assume 1-bit

0 for undefined opcode, 1 for overflow

Jump to exception handler at 8000 00180


121

An Alternate Mechanism
Vectored Interrupts
Handler address determined by the cause of exception
Example:
Undefined opcode: C000 0000
Overflow:
C000 0020
:
C000 0040
Instructions either
Deal with the interrupt, or
Jump to real handler

122

Handler Actions
Read cause, and transfer to relevant handler
Determine action required
If restartable
Take corrective action
Use EPC to return to program
Otherwise
Terminate program
Report error using EPC, cause,

123

Pipeline with Exceptions

124

Exception Properties
Restartable exceptions
Pipeline flushes the offending instruction

What about instructions before and after it?

Handler executes, then returns to the instruction

Refetched and executed from scratch

PC saved in EPC register


Identifies causing instruction
Actually PC + 4 is saved

Handler must adjust

125

Exception Example
Exception on add in
40
sub $11,
44
and $12,
48
or
$13,
4C
add $1,
50
slt $15,
54
lw
$16,

Handler
80000180 sw
80000184 sw

$2, $4
$2, $5
$2, $6
$2, $1
$6, $7
50($7)

$25, 1000($0)
$26, 1004($0)

126

Exception Example

127

Exception Example

128

Multiple Exceptions
Pipelining overlaps multiple instructions

Could have multiple exceptions at once

Simple approach: deal with exception from earliest

instruction

Flush subsequent instructions

In complex pipelines

Multiple instructions issued per cycle

Out-of-order completion

Maintaining precise exceptions is difficult!

129

Final Datapath
Basic pipelined architecture
Forwarding
Hazard detection unit
Branch handling
Exception handling

130

Final Pipelined Datapath


Bra nc h

IF.Flus h

EX.Flus h

ID.Flush
Ha za rd
de te c tion
unit

WB
C ontrol
0

Ins truc tion


memory
PC

Addre s s
Re ad
da ta

Ins truction

S hift
left 2

C a us e

EX

Exce pt
PC

Re ad
da ta 1

Rea d
re giste r 1
Rea d
re giste r 2
Reg is te rs
Write
re gis te r
Re a d
da ta 2
Write

S ign
e xtend

WB

M
u
x

MEM/WB

ALUS rc

WB

Da ta
me mory
ALU

32

Ins truction [25 21 ]

M
u
x

data

16

EX/MEM
M
u
x

Re gWrite

IF/ID

M
u
x

M
u
x

M
u
x

Addre s s
Write
da ta

ALU
c ontrol

Me mtoRe g

ID/EX

M
u
x

Me mWrite

40000040

Rea d
da ta

M
u
x

Me mRe a d

ALUOp
Re gDs t

Ins truction [20 1 6]


Ins truction [20 1 6]
Ins truction [1 5 1 1 ]

M
u
x
Forwarding
unit

131

Advanced Pipelining

132

Instruction-Level Parallelism (ILP)


Pipelining: executing multiple instructions in parallel
To increase amount of ILP

Deeper pipeline

Less work per stage shorter clock cycle

Multiple issue

Replicate pipeline stages multiple pipelines

Start multiple instructions per clock cycle

CPI < 1, so use Instructions Per Cycle (IPC)

E.g., 4GHz 4-way multiple-issue

16 BIPS, peak CPI = 0.25, peak IPC = 4

But dependencies reduce this in practice


133

Super Pipelining
Five-stage pipeline is a good start
Many designs include pipeline as long as 7, 10, or 20 stages

Intel Pentium III: 10 stages

Pentium 4: 20 stages

Prescott Pentium 4: 31 stages

Deeper pipeline let the processor clock run faster, but may

become less attractive with less utility

Pipeline register overheads play a role


Thermal wall/Power wall

Cannot increase clock rate

134

Multiple Issue
Static multiple issue
Compiler groups instructions
to be issued together
Packages them into issue
slots
Compiler detects and avoids
hazards

Dynamic multiple issue


CPU examines instruction
stream and chooses
instructions to issue each
cycle
Compiler can help by
reordering instructions
CPU resolves hazards using
advanced techniques at
runtime

135

Static Multiple Issue


Compiler groups instructions into issue packets
Group of instructions that can be issued on a single cycle
Determined by pipeline resources required
Think of an issue packet as a very long instruction
Specifies multiple concurrent operations
Very Long Instruction Word (VLIW)
Compiler must remove some/all hazards
Reorder instructions into issue packets
No dependencies within a packet
Pad with nop if necessary

136

Which Instructions Can This Do in Parallel?

Any two instructions?


Arithmetic and memory
instruction?
Any instruction and
memory instruction?
137

MIPS with Static Dual Issue


Two-issue packets
One ALU/branch instruction
One load/store instruction
64-bit aligned

ALU/branch, then load/store


Pad an unused instruction with nop

Address

Instruction type

Pipeline Stages

ALU/branch

IF

ID

EX

MEM

WB

n+4

Load/store

IF

ID

EX

MEM

WB

n+8

ALU/branch

IF

ID

EX

MEM

WB

n + 12

Load/store

IF

ID

EX

MEM

WB

n + 16

ALU/branch

IF

ID

EX

MEM

WB

n + 20

Load/store

IF

ID

EX

MEM

WB

138

Hazards in the Dual-Issue MIPS


More instructions executing in parallel
EX data hazard
Forwarding avoided stalls with single-issue
Now cant use ALU result in load/store in same packet

add $t0, $s0, $s1


load $s2, 0($t0)
Split into two packets, effectively a stall

Load-use hazard
Still one cycle use latency, but now two instructions
More aggressive scheduling required
139

Scheduling Example
Schedule on a static dual-issue pipeline for MIPS
Loop: lw
addu
sw
addi
bne

Loop:

$t0,
$t0,
$t0,
$s1,
$s1,

0($s1)
$t0, $s2
0($s1)
$s1,4
$zero, Loop

#
#
#
#
#

$t0=array element
add scalar in $s2
store result
decrement pointer
branch $s1!=0

ALU/branch

Load/store

cycle

nop

lw

addi $s1, $s1,4

nop

addu $t0, $t0, $s2

nop

bne

sw

$s1, $zero, Loop

$t0, 0($s1)

$t0, 4($s1)

IPC = 5/4 = 1.25 (c.f. peak IPC = 2)


140

Loop Unrolling
Basic block: straight-line code sequence with no branches

in except to entry and no branches out except at exit

Average 4 to 7 instructions execute between a pair of branches

To obtain substantial performance enhancements, we must

exploit ILP across multiple basic blocks

Replicate loop body to expose more parallelism among different iterations


Reduce loop-control overhead

141

Loop Unrolling
During unrolling, compiler introduces additional registers

or WAR dependences (name dependences)

2 instructions use same register or memory location, called a name, but no


flow of data between the instructions associated with that name
Repeated instances but completely independent sequences despite using $t0
lw
$t0,0($s1)
addu $t0,$t0,$s2
sw
$t0,0($s1)
lw
$t0,4($s1)
addu $t0,$t0,$s2
sw .....

Register renaming: use different registers per replication

Avoid loop-carried anti-dependencies


Store followed by a load of the same register
142

Loop Unrolling Example


lp:

addi
lw
lw
lw
lw
addu
addu
addu
addu
sw
sw
sw
sw
bne

$s1,$s1,-16
$t0,0($s1)
$t1,4($s1)
$t2,8($s1)
$t3,12($s1)
$t0,$t0,$s2
$t1,$t1,$s2
$t2,$t2,$s2
$t3,$t3,$s2
$t0,0($s1)
$t1,4($s1)
$t2,8($s1)
$t3,12($s1)
$s1,$0,lp

#
#
#
#
#
#
#
#
#
#
#
#
#
#

decrement pointer
$t0=array element
$t1=array element
$t2=array element
$t3=array element
add scalar in $s2
add scalar in $s2
add scalar in $s2
add scalar in $s2
store result
store result
store result
store result
branch if $s1 != 0

143

Loop Unrolling Example


IPC = 14/8 = 1.75

Closer to 2, but at cost of registers and code size

Loop:

ALU/branch

Load/store

cycle

addi $s1, $s1,16

lw

$t0, 0($s1)

nop

lw

$t1, 12($s1)

addu $t0, $t0, $s2

lw

$t2, 8($s1)

addu $t1, $t1, $s2

lw

$t3, 4($s1)

addu $t2, $t2, $s2

sw

$t0, 16($s1)

addu $t3, $t4, $s2

sw

$t1, 12($s1)

nop

sw

$t2, 8($s1)

sw

$t3, 4($s1)

bne

$s1, $zero, Loop

144

Dynamic Multiple Issue


Superscalar processors
CPU decides whether to issue 0, 1, 2, each cycle
Avoiding structural and data hazards
Avoids the need for compiler scheduling
Though it may still help
Code semantics ensured by the CPU

145

Dynamic Pipeline Scheduling


Allow the CPU to execute instructions out of order to avoid

stalls

But commit result to registers in order

Example
lw
$t0, 20($s2)
addu $t1, $t0, $t2
sub
$s4, $s4, $t3
slti $t5, $s4, 20
Can start sub while addu is waiting for lw

146

Dynamically Scheduled CPU


Preserves
dependencies

Hold pending
operands

Results also sent


to any waiting
reservation stations
Reorders buffer for
register writes

Can supply
operands for
issued instructions

147

Dynamic Scheduling
Exploit instruction-level parallelism
To make programs behave as if they were running on

simple in-order pipeline:

Issue instructions in order, which allows dependences to be tracked


Hardware chooses which instructions to execute next

Execute instructions out of order as long as correct flow of data is ensured

Commit in order

Often extended by speculating on branches and keep the

pipeline full

May need to rollback if prediction incorrect

148

Speculation
Guess what to do with an instruction
Start operation as soon as possible
Check whether guess was right

If so, complete the operation


If not, roll-back and do the right thing

Examples
Speculate on branch outcome

Roll back if path taken is different

Speculate on store preceding load that they refer do different address

Roll back if location is updated

149

Speculation
Common to static and dynamic multiple issue
Compiler can reorder instructions
e.g., move load before branch
Can include fix-up instructions to recover from incorrect guess
Hardware can look ahead for instructions to execute
Buffer results until it determines they are actually needed
Flush buffers on incorrect speculation
Out-of-order execution

150

Static or Dynamic Scheduling?


Why not just let the compiler schedule code?

Not all stalls are predictable, some dependences are unknown at compile
time

Cant always schedule around branches

e.g., cache misses


Branch outcome is dynamically determined

Different implementations of an ISA have different latencies and


hazards

Dynamic scheduling allows code that compiled for one pipeline to run
efficiently on a different pipeline

151

Power Efficiency
Microprocessor

Year

Clock Rate

Pipeline
Stages

Issue
width

Out-of-order/
Speculation

Cores

Power

i486

1989

25MHz

No

5W

Pentium

1993

66MHz

No

10W

Pentium Pro

1997

200MHz

10

Yes

29W

P4 Willamette

2001

2000MHz

22

Yes

75W

P4 Prescott

2004

3600MHz

31

Yes

103W

Core

2006

2930MHz

14

Yes

75W

UltraSparc III

2003

1950MHz

14

No

90W

UltraSparc T1

2005

1200MHz

No

70W

- Complexity of dynamic scheduling and speculation requires power


- Multiple simpler cores per chip may be better than deeper,

aggressively speculated ones

152

Fallacies
Pipelining is easy (!)
The basic idea is easy
The devil is in the details

e.g., Detecting data hazards

Pipelining is independent of technology


So why havent we always done pipelining?
More transistors make more advanced techniques feasible
As transistor budgets continued to double and logic became much
faster than memory, multiple functional units and dynamic
pipelining made more sense
Today, concerns about power are leading to less aggressive designs
153

Concluding Remarks
ISA influences design of datapath and control & vice versa
Pipelining improves throughput using parallelism
More instructions completed per second
Latency for each instruction not reduced
Limited by structural, data, and control hazards
Multiple issue and dynamic scheduling (ILP)
Dependencies limit achievable parallelism
Complexity leads to the power wall
Pipelining in Todays most advanced Processors is not

fundamentally different than techniques we discussed

This class has given you the background you need to learn more!

154

Concluding Remarks
What does every technique help reduce: data hazard stalls,

control stalls, or/and CPI?


Technique

Reduces

Dynamic scheduling

Data hazard stalls

Branch prediction

Control stalls

Multiple Issue

CPI

Speculation

Data and control stalls

Loop unrolling

Control hazard stalls

Compiler pipeline scheduling

Data hazard stalls

155