Beruflich Dokumente
Kultur Dokumente
M S Bhat
Dept. of E&C,
NITK Suratkal
Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/16/2015
VL722
Pipelining Example
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers
Each stage takes 30 minutes to complete
Four loads of clothes to wash, dry, and fold
8/16/2015
VL722
D
3
Sequential Laundry
6 PM
Time 30
7
30
8
30
30
9
30
30
10
30
30
11
30
30
12 AM
30
30
A
B
C
D
VL722
7
30
30
8
30
30
30
30
30
30
9 PM
Time
30
30
30
8/16/2015
VL722
k
1 2
1 2
k
1 2
1 2
Without Pipelining
One completion every k time units
8/16/2015
VL722
1 2
With Pipelining
One completion every 1 time unit
6
Synchronous Pipeline
Uses clocked registers between stages
Upon arrival of a clock edge
All registers hold the results of previous stages simultaneously
S2
Sk
Register
Register
S1
Register
Input
Register
Clock
8/16/2015
VL722
Pipeline Performance
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay
8/16/2015
VL722
k+n1
Sk k for large n
8/16/2015
VL722
Performance Example
Assume the following operation times for components:
Instruction and data memories: 200 ps
VL722
10
8/16/2015
Instruction
# cycles
Branch
Load
Jump
VL722
Instruction
# cycles
11
Solution
Instruction
Class
Instruction
Memory
Register
Read
ALU
Operation
Data
Memory
Register
Write
Total
ALU
200
150
180
150
680 ps
Load
200
150
180
200
150
880 ps
Store
200
150
180
200
Branch
200
150
180
530 ps
Jump
200
150
350 ps
730 ps
VL722
12
Reg
ALU
MEM
Reg
IF
880ps
Reg
ALU
880 ps
8/16/2015
VL722
MEM
Reg
13
Reg
200
IF
200
ALU
Reg
IF
200
MEM
Reg
ALU
MEM
Reg
ALU
MEM
200
200
Reg
200
Reg
200
VL722
14
8/16/2015
VL722
15
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/16/2015
VL722
16
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer: Introduce pipeline register at end of each stage
IF = Instruction Fetch
ID = Decode &
Register Read
EX = Execute
MEM = Memory
Access
J
Next
PC
Beq
Bne
30
00
30
Instruction
Memory
Instruction
PC
0
1
ALU result
Imm26
+1
PCSrc
Rs 5
32
Rt 5
Address
Rd
Imm16
zero
32
BusA
RA
Registers
RB
BusB
0
1
WB =
Write
Back
RW
BusW
32
A
L
U
32
Data
Memory
Address
Data_out
Data_in
32
32
32
RegDst
clk
Reg
Write
8/16/2015
VL722
Mem Mem
Read Write
Mem
toReg
17
Pipelined Datapath
Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)
Address
RA
RB
0
1
Rd
RW
ALU result
Imm16
32
BusA
E
BusB
BusW
32
zero
A
L
U
Data
Memory
ALUout
Imm
NPC
Rt 5
Next
PC
32
Data_out
32
32
Address
WB Data
PC
Rs 5
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
MEM = Memory
Access
WB = Write Back
EX = Execute
ID = Decode &
Register Read
NPC2
IF = Instruction Fetch
Data_in
32
clk
8/16/2015
VL722
18
Address
RB
0
1
Rd
RW
Imm
Next
PC
ALU result
Imm16
32
E
BusB
BusW
32
zero
A
L
U
Data
Memory
ALUout
Rt 5
RA
BusA
MEM =
Memory Access
32
32
32
Address
Data_out
PC
Rs 5
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
NPC
NPC2
EX = Execute
WB Data
ID = Decode &
Register Read
IF = Instruction Fetch
WB = Write Back
Data_in
32
clk
8/16/2015
VL722
19
EX
RW
0
1
BusB
BusW
A
L
U
Address
32
Data_out
32
32
32
Data
Memory
ALUout
Imm
A
32
32
zero
WB Data
Rd
ALU result
Imm16
Data_in
Rd4
RB
WB
Next
PC
RA
Address
Rt 5
BusA
MEM
Rd3
PC
Rs 5
Rd2
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
NPC
NPC2
IF
clk
8/16/2015
VL722
20
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
CC1
CC2
CC3
CC4
CC5
lw R14, 8(R21)
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
add R17,R18,R19
ori R20,R11, 7
sub R13, R18, R11
sw R18, 10(R19)
8/16/2015
VL722
CC6
CC7
CC8
21
Instruction-Time Diagram
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle
Instruction Order
R15, 8(R19)
lw
R14, 8(R21)
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
IF
ID
R18, 10(R19)
CC1
8/16/2015
WB
EX MEM
Time
22
Control Signals
ID
EX
RW
Imm16
BusB
Address
32
32
32
32
Data
Memory
Data_out
BusW
0
1
A
L
U
WB Data
32
32
zero
ALUout
Imm
ALU result
Data_in
Rd4
Rd
Bne
Address
RB
Beq
Rd3
PC
Rt 5
RA
BusA
WB
Next
PC
Instruction
Rs 5
MEM
Rd2
Instruction
Memory
Imm26
Register File
PCSrc
Instruction
+1
NPC
NPC2
IF
clk
Reg
Dst
Reg
Write
Ext
Op
ALU
Src
ALU
Ctrl
Mem Mem
Read Write
Mem
toReg
VL722
23
32
32
32
0
1
32
Data_out
BusW
Address
WB Data
Data
Memory
ALUout
Imm
A
BusB
A
L
U
Data_in
Rd4
RW
32
32
zero
Rd
Op
Address
RB
Bne
Rd3
PC
Rt 5
RA
BusA
Beq
ALU result
Imm16
Instruction
Rs 5
Next
PC
Rd2
Instruction
Memory
Imm26
Register File
PCSrc
Instruction
+1
NPC
NPC2
Pipelined Control
8/16/2015
Main
& ALU
Control
Ext
Op
VL722
ALU
Src
J
ALU Beq
Ctrl Bne
Mem Mem
Read Write
Mem
toReg
WB
Reg
Write
MEM
Reg
Dst
EX
Pass control
signals along
pipeline just
like the data
func
clk
24
Memory Stage
VL722
25
Execute Stage
Memory Stage
Write
Control Signals
Control Signals
Back
Op
RegDst ALUSrc ExtOp
R-Type
1=Rd
0=Reg
addi
0=Rt
slti
Beq Bne
ALUCtrl
func
1=Imm 1=sign
ADD
0=Rt
1=Imm 1=sign
SLT
andi
0=Rt
1=Imm 0=zero
AND
ori
0=Rt
1=Imm 0=zero
OR
lw
0=Rt
1=Imm 1=sign
ADD
sw
1=Imm 1=sign
ADD
beq
0=Reg
SUB
bne
0=Reg
SUB
8/16/2015
VL722
26
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/16/2015
VL722
27
Pipeline Hazards
Hazards: situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle
1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions
3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
VL722
28
Structural Hazards
Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Structural Hazard
Two instructions are
attempting to write
the register file
during same cycle
Example
Writing back ALU result in stage 4
Instructions
lw
R14, 8(R21)
IF
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX MEM
R18, 10(R19)
CC1
8/16/2015
ID
Time
29
Second write port can be used to write back load data in stage 5
8/16/2015
VL722
30
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/16/2015
VL722
31
Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
# R17 is written
# R17 is read
8/16/2015
VL722
32
Time (cycles)
value of R18
sub R18, R9, R11
add R20, R18, R13
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10
20
20
20
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
sw R24, 10(R18)
VL722
33
Instruction Order
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
CC9
10
10
10
10
10
20
20
20
20
IF
Reg
ALU
DM
Reg
IF
Reg
Reg
Reg
Reg
ALU
DM
Reg
stall
stall
stall
IF
Reg
ALU
DM
VL722
34
Time (cycles)
value of R18
sub R18, R9, R11
add R20, R18, R13
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10
20
20
20
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
8/16/2015
VL722
35
Implementing Forwarding
Two multiplexers added at the inputs of A & B registers
Data from ALU stage, MEM stage, and WB stage is fed back
Rd
32
Result
Data
Memory
0
32
32
Data_out
0
32
1
WData
Address
Data_in
Rd4
0
1
BusW
A
L
U
RW
BusB
0
1
2
3
32 ALU result
32
Rd3
RB
0
1
2
3
Rt
RA
BusA
Imm16
Rd2
Instruction
Rs
Register File
Imm26
Im26
ForwardA
clk
ForwardB
8/16/2015
VL722
36
Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
VL722
37
Forwarding Example
Instruction sequence:
lw
R12, 4(R8)
ori R15, R9, 2
sub R11, R12, R15
lw R12,4(R8)
Rd
32
Result
Data
Memory
0
32
32
Data_out
0
32
1
WData
Address
Data_in
Rd4
0
1
BusW
0
1
2
3
RW
BusB
Rd3
RB
A
L
U
RA
0
1
2
3
Rt
BusA
32 ALU result
32
Rd2
Instruction
Rs
ext
Register File
Imm16
Imm
Imm26
clk
8/16/2015
VL722
38
ForwardA 1
Else if
Else if
Else
ForwardA 3
ForwardA 0
ForwardB 1
Else if
Else if
Else
8/16/2015
ForwardB 3
ForwardB 0
VL722
39
32
Data_out
32
0
32
1
WData
Data
Memory
Data_in
32
ALUCtrl
Rd4
0
1
Result
Im26
BusW
Address
RW
BusB
0
1
2
3
A
L
U
Rd3
Rd
RB
Rt
RA
0
1
2
3
BusA
32 ALU result
32
Rd2
Instruction
Rs
Register File
Imm26
clk
RegDst
ForwardB
ForwardA
Hazard Detect
and Forward
func
8/16/2015
VL722
RegWrite
WB
Main
& ALU
Control
RegWrite
MEM
Op
EX
RegWrite
40
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Pipeline Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/16/2015
VL722
41
Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding
Program Order
Time (cycles)
lw
R18, 20(R17)
CC1
CC2
CC3
CC4
CC5
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
VL722
CC6
CC7
CC8
Reg
42
VL722
43
Program Order
Time (cycles)
lw
R18, 20(R17)
CC1
CC2
CC3
CC4
CC5
IF
Reg
ALU
DM
Reg
IF
stall
bubble
bubble
bubble
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
VL722
CC6
CC7
CC8
Reg
44
R17, (R13)
lw
R18, 8(R17)
IF
ID
IF
EX MEM WB
Stall
ID
IF
EX MEM WB
Stall
ID
EX MEM WB
IF
ID
EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time
8/16/2015
VL722
45
Data
Memory
0
1
Data_out
32
32
1
Data_in
32
Rd4
32
WData
Address
0
1
2
3
BusW
Result
Im26
A
BusB
RW
A
L
U
Rd3
Rd
RB
Rt
RA
0
1
2
3
BusA
32 ALU result
32
Rd2
PC
Instruction
Rs
Register File
Imm26
clk
Disable PC
RegDst
ForwardB
func
ForwardA
Hazard Detect
Forward, & Stall
MemRead
Stall
8/16/2015
Bubble
=0
RegWrite
0
1
VL722
WB
Control Signals
MEM
EX
Op
RegWrite
RegWrite
46
Slow code:
lw
lw
add
sw
lw
lw
sub
sw
8/16/2015
R8,
R9,
R10,
R10,
R11,
R12,
R13,
R13,
4(R16)
8(R16)
R8, R9
0(R16)
16(R16)
20(R16)
R11, R12
12(R0)
# &B = 4(R16)
# &C = 8(R16)
# stall cycle
# &A = 0(R16)
# &E = 16(R16)
# &F = 20(R16)
# stall cycle
# &D = 12(R0)
VL722
lw
lw
lw
lw
add
sw
sub
sw
R8,
R9,
R11,
R12,
R10,
R10,
R13,
R13,
4(R16)
8(R16)
16(R16)
20(R16)
R8, R9
0(R16)
R11, R12
12(R0)
47
VL722
48
VL722
49
R10, 0(R1)
R11, 6(R5)
8/16/2015
VL722
50
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/16/2015
VL722
51
Control Hazards
Jump and Branch can cause great performance loss
Jump instruction needs only the jump target address
PC + 4 immediate
If Branch is Taken
VL722
52
Beq R9,R10,L1
cc1
cc2
cc3
IF
Reg
ALU
IF
Next1
Next2
cc4
cc5
cc6
Reg
Bubble
Bubble
Bubble
IF
Bubble
Bubble
Bubble
Bubble
IF
Reg
ALU
DM
Branch
Target
Addr
8/16/2015
VL722
cc7
53
NPC2
Bne
32
2
3
BusB
0
1
BusW
A
L
U
32
32
0
1
zero
ALUout
Imm16
RW
Beq
Rd3
Rd
Im26
NPC
Address
RB
BusA
Rt 5
RA
0
1
Next
PC
Rd2
Instruction
0
Rs 5
Register File
Instruction
Memory
Imm26
Op
PCSrc
Instruction
+1
PC
8/16/2015
J, Beq, Bne
Control Signals
Bubble = 0
VL722
0
1
MEM
Reg
Dst
EX
func
clk
54
Beq R9,R10,L1
Next1
Next2
8/16/2015
cc1
cc2
cc3
IF
Reg
IF
cc4
cc5
cc6
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
VL722
cc7
Reg
55
VL722
56
RW
2
3
BusB
0
1
BusW
32
32
0
1
A
L
U
Rd
32
Rd3
RB
BusA
Address
Op
Rt 5
RA
0
1
Rd2
Instruction
0
Rs 5
Register File
Instruction
Memory
Instruction
Imm16
PCSrc
Data forwarded
then compared
ALUout
J
Beq
Bne
Longer Cycle
Im16
Next
PC
Reset
+1
PC
Zero
8/16/2015
Reg
Dst
J, Beq, Bne
Control Signals
Bubble = 0
VL722
ALUCtrl
0
1
MEM
EX
func
clk
57
8/16/2015
VL722
58
8/16/2015
VL722
59
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/16/2015
VL722
60
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
VL722
61
Delayed Branch
Define branch to take place after the next instruction
Instruction in branch delay slot is always executed
Compiler (tries to) move a useful instruction into delay slot.
From before the Branch: Always helpful when possible
For a 1-cycle branch delay, we have one delay slot
branch instruction
branch delay slot
branch target
label:
. . .
(next instruction)
(if branch taken)
add R10,R11,R12
beq R17,R16,label
Delay Slot
. . .
label:
VL722
beq R17,R16,label
add R10,R11,R12
62
Delayed Branch
From the Target: Helps when branch is taken. May duplicate instructions
L1:
L2:
ADD
R2, R1, R3
BEQZ R2, L2
SUB R4, R5, R6
SUB R4, R5, R6
Instructions between BEQ and SUB (in fall through) must not use R4.
Why is instruction at L1 duplicated? What if R5 or R6 changed?
(Because, L1 can be reached by another path !!)
8/16/2015
VL722
63
Delayed Branch
From Fall Through: Helps when branch is not taken
(default case)
ADD R2, R1, R3
ADD
BEQZ R2, L1
BEQZ R2, L1
DELAY SLOT
R2, R1, R3
L1:
L1:
Instructions at target (L1 and after) must not use R4 till set (written)
again.
8/16/2015
VL722
64
VL722
65
VL722
66
Profile-based
8/16/2015
tomcatv
swm256
ora
mdljsp2
hydro2d
gcc
espresso
compress
doduc
10
alvinn
100000
Direction-based
VL722
67
Zero-Delayed Branching
How to achieve zero delay for a jump or a taken branch?
Jump or branch target address is computed in the ID stage
Next instruction has already been fetched in the IF stage
Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions
VL722
68
Inc
mux
PC
Target
Predict
Addresses
Bits
low-order bits
used as index
=
predict_taken
8/16/2015
VL722
69
8/16/2015
VL722
70
IF
Increment PC
No
ID
No
Jump
or taken
branch?
Found
BTB entry with predict
taken?
No
Yes
EX
Normal
Execution
Jump
or taken
branch?
Yes
Correct Prediction
No stall cycles
Mispredicted Jump/branch
Enter jump/branch address, target
address, and set prediction in BTB entry.
Flush fetched instructions
Restart PC at target address
8/16/2015
Yes
VL722
Mispredicted branch
Branch not taken
Update prediction bits
Flush fetched instructions
Restart PC after branch
71
Not
Taken
Taken
Taken
Predict
Not Taken
Predict
Taken
Not Taken
8/16/2015
VL722
72
inner:
bne , , inner
bne , , outer
8/16/2015
VL722
73
Strong
Predict
Not Taken
Taken
Taken
Not Taken
8/16/2015
Weak
Predict
Not Taken
Taken
Not Taken
VL722
Weak
Predict
Taken
Taken
Strong
Predict
Taken
Not Taken
74
8/16/2015
VL722
75
11111111111111111.
branch outcome
1 ...
last outcome
0 ...
prediction
N T
N ...
1 ...
X O O X
X ...
correctness
O O O X
X O O X
O : correct, X : mispredict
Branch outcome
.
.
.
Automaton
PC
Prediction
BHT
2-Bit BHT
2-bit scheme where change prediction only if get misprediction twice
MSB of the state symbol represents the prediction;
1: TAKEN, 0: NOT TAKEN
T
Predict
TAKEN
NT
Predict
TAKEN
10
11
T
NT
T
Predict
NOT
TAKEN
NT
01
00
T
Predict
NOT
TAKEN
NT
branch outcome
1 1 1 0 1 1 1 0 1 1 1 0 ...
counter value
11 11 11 11 10 11 11 11 10 11 11 11 . . .
prediction
T T T T T T T T T T T T ...
O O O X O O O X O O O X ...
L1:
if (aa == 2) aa = 0;
if (bb == 2) bb = 0;
if (aa != bb) { }
L2:
Branch Correlation
Code Snippet
if (aa==2)
// b1
aa = 0;
if (bb==2)
// b2
bb = 0;
if (aa!=bb) { // b3
.
}
b1
0 (NT)
0
b3
Path: A:0-0
aa=0
bb=0
b2
1 (T)
b3
b3
b2
1
b3
Branch direction
Not independent
Correlated to the path taken
80
Correlating Branches
Example:
if (d==0)
b1 :
d=1;
if (d==1)
Initial value
of d
0
1
2
BNEZ
ADDI
SUBI
BNEZ
R1,b1 ;
(b1)(d!=0)
R1,R1,#1; since d==0, make d=1
R3,R1,#1
R3,b2;
(b2)(d!=1)
.....
b2 :
d==0?
Y
N
N
b1
NT
T
T
Value of d
before b2
1
1
2
d==1?
Y
Y
N
b2
NT
NT
T
If b1 is NT, then b2 is NT
1-bit
self history
predictor
Sequence of
2,0,2,0,2,0,...
d=?
2
0
2
0
b1
prediction
NT
T
NT
T
b1
action
T
NT
T
NT
New b1
prediction
T
NT
T
NT
b2
prediction
NT
T
NT
T
b2
action
T
NT
T
NT
New b2
prediction
T
NT
T
NT
Correlating Branches
Example:
if (d==0)
d=1;
if (d==1)
Self
Prediction
bits(XX)
NT/NT
NT/T
T/NT
T/T
d=? b1 prediction
2
NT/NT
0
T/NT
2
T/NT
0
T/NT
b1 action
T
NT
T
NT
b1 :
BNEZ
ADDI
SUBI
BNEZ
R1,L1 ;
branch b1 (d!=0)
R1,R1,#1; since d==0, make d=1
R3,R1,#1
R3,L2;
branch b2(d!=1)
.....
b2 :
Gloabal
Prediction, if last
branch action was NT
NT
NT
T
T
Prediction, if last
branch action was T
NT
T
NT
T
b2 action
T
NT
T
NT
new b2 prediction
NT/T
NT/T
NT/T
NT/T
Local/Global Predictors
Instead of maintaining a counter for each branch to
capture the common case,
Maintain a counter for each branch and surrounding pattern
If the surrounding pattern belongs to the branch being predicted,
the predictor is referred to as a local predictor
If the surrounding pattern includes neighboring branches, the
predictor is referred to as a global predictor
8/16/2015
83
84
Correlating Branches
(2,2) predictor
Behavior of recent
branches selects
between four
predictions of next
branch, updating just
that prediction
Branch address
4
2-bits per branch predictor
Prediction
8/16/2015
85
branch
direction
select
Branch PC
Branch PC
X X
hash
X X
Prediction
.
.
.
.
2-bit
counter
w
hash
2w
.
.
.
.
2-bit
counter
.
.
.
.
2-bit
counter
Prediction
.
.
.
.
2-bit
counter
.
.
.
.
2-bit
counter
16%
14%
12%
11%
10%
8%
6%
6%
6%
8/16/2015
li
eqntott
expresso
gcc
fpppp
matrix300
spice
1%
0%
doducd
1%
tomcatv
2%
0%
5%
4%
4%
87
6%
5%
nasa7
Frequency of Mispredictions
18%
1 1 .....
Rc-k
2N entries
Rc-1
1 0
N
11..10
11..11
Prediction
PHT update
Current State
FSM
Update
Logic
First-in First-Out
BHR can be
Global
Per-set
Local (Per-address)
89
90
PC = 0x4001000C
PHT
00000000
00000001
00000010
00110110
00110110 10
00110111
BHR
0110
11111101
11111110
11111111
MSB = 1
Predict Taken
91
PC = 0x4001000C
00000000
00000001
00000010
00111100
00110110
00110110
00110111
BHR
00111100
decremented
0110
1100
01
10
11111101
11111110
11111111
Tournament Predictors
A local predictor might work well for some branches or
programs, while a global predictor might work well for others
Provide one of each and maintain another predictor to
identify which predictor is best for each branch
Local
Predictor
M
U
X
Global
Predictor
Branch PC
Tournament
Predictor
Table of 2-bit
saturating counters
93
8/16/2015
Tournament Predictors
Multilevel branch predictor
Selector for the Global and Local predictors of correlating branch
prediction
94
8/16/2015
Tournament Predictors
Advantage of tournament predictor is the ability to select the right
predictor for a particular branch
A typical tournament predictor selects global predictor 40% of the time
for SPEC integer benchmarks
AMD Opteron and Phenom use tournament style
95
10%
9%
8%
7%
6%
5%
4%
3%
Tournament
2%
1%
0%
0
16
24
32
40
48
56
64
72
80
88
96
97
pipelines deeper
branch not resolved until more cycles from fetching - therefore the
misprediction penalty greater
cycle times smaller - more emphasis on throughput (performance)
more functionality between fetch & execute
98
object-oriented programming
more indirect branches - which are harder to predict
All this means that the potential stalling due to branches is greater
99
100
Pipelining Complications
Exceptions: Events other than branches or jumps that change the
normal flow of instruction execution. Some types of exceptions:
I/O Device request
Invoking an OS service from user program
VL722
101
Pipelining Complications
Exceptions: Events other than branches or jumps
that change the normal flow of instruction execution.
5 instructions executing in 5 stage pipeline
How to stop the pipeline?
Who caused the interrupt?
How to restart the pipeline?
Stage
IF
ID
EX
MEM
8/16/2015
VL722
102
Pipelining Complications
Simultaneous exceptions in more than one pipeline stage,
e.g.,
LOAD with data page fault in MEM stage
ADD with instruction page fault in IF stage
Solution #1
Interrupt status vector per instruction
Defer check until last stage, kill state update if exception
Solution #2
Interrupt ASAP
Restart everything that is incomplete
8/16/2015
VL722
103
Pipelining Complications
Our DLX pipeline only writes results at the end of the
instructions execution. Not all processors do this.
Condition Codes
Need to detect the last instruction to change condition codes
8/16/2015
VL722
104
VL722
105
VL722
106