Chapter04 Pipelining2

Advanced Computer
Architecture
Chapter 4
Advanced Pipelining
Ioannis Papaefstathiou
CS 590.25
Easter 2003
(thanks to Hennesy & Patterson)
Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP
Chap. 4 - Pipelining II 2
Chapter Overview
Technique Reduces Section
Loop Unrolling Control Stalls 4.1
Basic Pipeline Scheduling RAW Stalls 4.1
Dynamic Scheduling with Scoreboarding RAW stalls 4.2
Dynamic Scheduling with Register Renaming WAR and WAW stalls 4.2
Dynamic Branch Prediction Control Stalls 4.3
Issue Multiple Instructions per Cycle Ideal CPI 4.4
Compiler Dependence Analysis Ideal CPI & data stalls 4.5
Software pipelining and trace scheduling Ideal CPI & data stalls 4.5
Speculation All data & control stalls 4.6
Dynamic memory disambiguation RAW stalls involving memory 4.2, 4.6
Instruction Level
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges ILP is the principle that there are many
instructions in code that don’t
4.2 Overcoming Data Hazards
with Dynamic Scheduling
depend on each other. That means
it’s possible to execute those
4.3 Reducing Branch Penalties instructions in parallel.
with Dynamic Hardware
Prediction
This is easier said than done:
4.4 Taking Advantage of More ILP
with Multiple Issue Issues include:
• Building compilers to analyze the
4.5 Compiler Support for
Exploiting ILP code,
• Building hardware to be even
4.6 Hardware Support for
smarter than that code.
Extracting more Parallelism
4.7 Studies of ILP This section looks at some of the
problems to be solved.
Instruction Level Pipeline Scheduling and
Loop Unrolling
Parallelism
Terminology
Basic Block - That set of instructions between entry points and between
branches. A basic block has only one entry and one exit. Typically
this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such
parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit

the parallelism inherent in the loop.
Loop Unrolling
Parallelism
Simple Loop and its Assembler Equivalent

for (i=1; i<=1000; i++) This is a clean and
x(i) = x(i) + s; simple example!
Loop: LD F0,0(R1) ;F0=vector element

ADDD F4,F0,F2 ;add scalar from F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8bytes (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot
Loop Unrolling
Parallelism
FP Loop Hazards
Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar in F2
SD 0(R1),F4 ;store result
SUBI R1,R1,8 ;decrement pointer 8B (DW)
BNEZ R1,Loop ;branch R1!=zero
NOP ;delayed branch slot
Instruction Instruction Latency in

producing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
Where are the stalls? Chap. 4 - Pipelining II 7

Instruction Level
Pipeline Scheduling and
Parallelism Loop Unrolling
FP Loop Showing Stalls

1 Loop: LD F0,0(R1) ;F0=vector element
2 stall
3 ADDD F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 SD 0(R1),F4 ;store result
7 SUBI R1,R1,8 ;decrement pointer 8Byte (DW)
8 stall
9 BNEZ R1,Loop ;branch R1!=zero
10 stall ;delayed branch slot
Load double Store double 0
Integer op Integer op 0
10 clocks: Rewrite code
to minimize stalls?
Loop Unrolling
Parallelism
Scheduled FP Loop Minimizing Stalls
1 Loop: LD F0,0(R1)
2 SUBI R1,R1,8 Stall is because SD
3 ADDD F4,F0,F2 can’t proceed.
4 stall
5 BNEZ R1,Loop ;delayed branch
6 SD 8(R1),F4 ;altered when move past SUBI
Swap BNEZ and SD by changing address of SD

Now 6 clocks: Now unroll

Chap. 4 - Pipelining II
loop 4 times to make faster. 9
Loop Unrolling
Parallelism
Unroll Loop Four Times (straightforward way)
1 Loop: LD F0,0(R1)
2 stall 15 ADDD F12,F10,F2
3 ADDD F4,F0,F2 16 stall
4 stall 17 stall
5 stall 18 SD -16(R1),F12
6 SD 0(R1),F4 19 LD F14,-24(R1)
7 LD F6,-8(R1) 20 stall
8 stall 21 ADDD F16,F14,F2
9 ADDD F8,F6,F2 22 stall
10 stall 23 stall
11 stall 24 SD -24(R1),F16
12 SD -8(R1),F8 25 SUBI R1,R1,#32
13 LD F10,-16(R1) 26 BNEZ R1,LOOP
14 stall 27 stall
28 NOP
15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration
Assumes R1 is multiple of 4
Rewrite loop to minimize stalls.
Loop Unrolling
Parallelism
Unrolled Loop That Minimizes Stalls
1 Loop: LD F0,0(R1) What assumptions made when
2 LD F6,-8(R1) moved code?
3 LD F10,-16(R1) – OK to move store past SUBI
even though changes register
4 LD F14,-24(R1)
– OK to move loads before
5 ADDD F4,F0,F2 stores: get right data?
6 ADDD F8,F6,F2 – When is it safe for compiler to
7 ADDD F12,F10,F2 do such changes?
8 ADDD F16,F14,F2
9 SD 0(R1),F4
10 SD -8(R1),F8
11 SD -16(R1),F12
12 SUBI R1,R1,#32
13 BNEZ R1,LOOP
14 SD 8(R1),F16 ; 8-32 = -24
No Stalls!!
14 clock cycles, or 3.5 per iteration
Loop Unrolling
Parallelism
Summary of Loop Unrolling Example
• Determine that it was legal to move the SD after the SUBI and BNEZ,
and find the amount to adjust the SD offset.
• Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code.
• Use different registers to avoid unnecessary constraints that would be

forced by using the same registers for different computations.
• Eliminate the extra tests and branches and adjust the loop
maintenance code.
• Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent. This requires analyzing the memory
addresses and finding that they do not refer to the same address.
• Schedule the code, preserving any dependences needed to yield the
same result as the original code.
Instruction Level Dependencies
Parallelism
Compiler Perspectives on Code Movement
Compiler concerned about dependencies in program. Not concerned if a
HW hazard depends on a given pipeline.
• Tries to schedule code to avoid hazards.
• Looks for Data dependencies (RAW if a hazard for HW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data
dependent on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory:
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
Instruction Level Data Dependencies
Parallelism
Where are the data

dependencies?
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SUBI R1,R1,8
4 BNEZ R1,Loop ;delayed branch
5 SD 8(R1),F4 ;altered when move past SUBI
Instruction Level Name Dependencies
Parallelism
• Another kind of dependence called name dependence:

two instructions use same name (register or memory location) but don’t
exchange data
• Anti-dependence (WAR if a hazard for HW)
– Instruction j writes a register or memory location that instruction i reads from
and instruction i is executed first
• Output dependence (WAW if a hazard for HW)
– Instruction i and instruction j write the same register or memory location;
ordering between instructions must be preserved.
Parallelism
1 Loop: LD F0,0(R1) Where are the name
2 ADDD F4,F0,F2
dependencies?
3 SD 0(R1),F4
4 LD F0,-8(R1)
5 ADDD F4,F0,F2 No data is passed in F0, but
6 SD -8(R1),F4 can’t reuse F0 in cycle 4.
7 LD F0,-16(R1)
8 ADDD F4,F0,F2
9 SD -16(R1),F4
10 LD F0,-24(R1)
11 ADDD F4,F0,F2
12 SD -24(R1),F4
13 SUBI R1,R1,#32
14 BNEZ R1,LOOP
15 NOP
How can we remove these
dependencies? Chap. 4 - Pipelining II 16
Parallelism
Where are the name
1 Loop: LD F0,0(R1) dependencies?
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 LD F6,-8(R1)
5 ADDD F8,F6,F2 Now there are data
6 SD -8(R1),F8 dependencies only. F0 exists
7 LD F10,-16(R1)
only in instructions 1 and 2.
8 ADDD F12,F10,F2
9 SD -16(R1),F12
10 LD F14,-24(R1)
11 ADDD F16,F14,F2
12 SD -24(R1),F16
13 SUBI R1,R1,#32
14 BNEZ R1,LOOP
15 NOP
Called “register renaming”
Parallelism
• Again Name Dependencies are Hard for Memory Accesses
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
• Our example required compiler to know that if R1 doesn’t change then:
0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)
There were no dependencies between some loads and stores so they

could be moved around each other
Instruction Level Control Dependencies
Parallelism
• Final kind of dependence called control dependence

• Example
if p1 {S1;};
if p2 {S2;};
S1 is control dependent on p1 and S2 is control dependent on p2 but not
on p1.
Parallelism
• Two (obvious) constraints on control dependences:

– An instruction that is control dependent on a branch cannot be moved
before the branch so that its execution is no longer controlled by the
branch.
– An instruction that is not control dependent on a branch cannot be

moved to after the branch so that its execution is controlled by the
branch.
• Control dependencies relaxed to get parallelism; get same effect if

preserve order of exceptions (address in register checked by branch
before use) and data flow (value in register depends on branch)
Parallelism
1 Loop: LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 SUBI R1,R1,8 Where are the control
5 BEQZ R1,exit dependencies?
6 LD F0,0(R1)
7 ADDD F4,F0,F2
8 SD 0(R1),F4
9 SUBI R1,R1,8
10 BEQZ R1,exit
11 LD F0,0(R1)
12 ADDD F4,F0,F2
13 SD 0(R1),F4
14 SUBI R1,R1,8
15 BEQZ R1,exit
....
Instruction Level Loop Level Parallelism
Parallelism
When Safe to Unroll Loop?
• Example: Where are data dependencies?
(A,B,C distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
1. S2 uses the value, A[i+1], computed by S1 in the same iteration.

2. S1 uses a value computed by S1 in an earlier iteration, since
iteration i computes A[i+1] which is read in iteration i+1. The same
is true of S2 for B[i] and B[i+1].
This is a “loop-carried dependence” between iterations
• Implies that iterations are dependent, and can’t be executed in parallel
• Note the case for our prior example; each iteration was distinct
Parallelism
When Safe to Unroll Loop?
• Example: Where are data dependencies?
(A,B,C,D distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {

A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
1. No dependence from S1 to S2. If there were, then there would be a cycle in

the dependencies and the loop would not be parallel. Since this other
dependence is absent, interchanging the two statements will not affect the
execution of S2.
2. On the first iteration of the loop, statement S1 depends on the value of B[1]
computed prior to initiating the loop.
Parallelism
Now Safe to Unroll Loop? (p. 240)

for (i=1; i<=100; i=i+1) { No circular dependencies.
OLD: A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i];} /* S2 */
Loop caused dependence
on B.
A[1] = A[1] + B[1];

for (i=1; i<=99; i=i+1) { Have eliminated loop
NEW: B[i+1] = C[i] + D[i]; dependence.
A[i+1] = + A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
Dynamic Scheduling
Concepts and Challenges Dynamic Scheduling is when the
hardware rearranges the order of
instruction execution to reduce
stalls.
4.3 Reducing Branch Penalties Advantages:
Prediction • Dependencies unknown at compile
time can be handled by the hardware.
with Multiple Issue • Code compiled for one type of
pipeline can be efficiently run on
Exploiting ILP another.
Disadvantages:
4.6 Hardware Support for • Hardware much more complex.
4.7 Studies of ILP
The idea:
Dynamic Scheduling
HW Schemes: Instruction Parallelism
• Why in HW at run time?
– Works when can’t know real dependence at compile time
– Compiler simpler
– Code for one machine runs well on another
• Key Idea: Allow instructions behind stall to proceed.
• Key Idea: Instructions executing in parallel. There are multiple
execution units, so use them.
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
– Enables out-of-order execution => out-of-order completion
The idea:
Dynamic Scheduling
HW Schemes: Instruction Parallelism
• Out-of-order execution divides ID stage:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read operands
• Scoreboards allow instruction to execute whenever 1 & 2 hold, not
waiting for prior instructions.
• A scoreboard is a “data structure” that provides the information
necessary for all pieces of the processor to work together.
• We will use In order issue, out of order execution, out of order
commit ( also called completion)
• First used in CDC6600. Our example modified here for DLX.
• CDC had 4 FP units, 5 memory reference units, 7 integer units.
• DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
Using A Scoreboard
Dynamic Scheduling
Scoreboard Implications
• Out-of-order completion => WAR, WAW hazards?
• Solutions for WAR
– Queue both the operation and copies of its operands
– Read registers only during Read Operands stage
• For WAW, must detect hazard: stall until other completes
• Need to have multiple instructions in execution phase => multiple
execution units or pipelined execution units
• Scoreboard keeps track of dependencies, state or operations
• Scoreboard replaces ID, EX, WB with 4 stages
Using A Scoreboard
Dynamic Scheduling
Four Stages of Scoreboard Control

1. Issue —decode instructions & check for structural hazards (ID1)
If a functional unit for the instruction is free and no other active
instruction has the same destination register (WAW), the
scoreboard issues the instruction to the functional unit and
updates its internal data structure.
If a structural or WAW hazard exists, then the instruction issue
stalls, and no further instructions will issue until these hazards
are cleared.
Using A Scoreboard
Dynamic Scheduling

2. Read operands —wait until no data hazards, then read
operands (ID2)
A source operand is available if no earlier issued active

instruction is going to write it, or if the register containing
the operand is being written by a currently active
functional unit.
When the source operands are available, the scoreboard tells
the functional unit to proceed to read the operands from
the registers and begin execution. The scoreboard
resolves RAW hazards dynamically in this step, and
instructions may be sent into execution out of order.
Using A Scoreboard
Dynamic Scheduling
3. Execution —operate on operands (EX)
The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.
4. Write result —finish execution (WB)

Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the
instruction.
Example:
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
Scoreboard would stall SUBD until ADDD reads operands
Using A Scoreboard
Dynamic Scheduling
Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
2. Functional unit status—Indicates the state of the functional unit (FU). 9

fields for each functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready
3. Register result status—Indicates which functional unit will write each

register, if one exists. Blank when no pending instructions will write that
register
Using A Scoreboard
Dynamic Scheduling
Detailed Scoreboard Pipeline Control
Instruction Bookkeeping
Wait until
status
Busy(FU)← yes; Op(FU)← op;
Fi(FU)← `D’; Fj(FU)← `S1’;
Not busy (FU) Fk(FU)← `S2’; Qj← Result(‘S1’);
Issue
and not result(D) Qk← Result(`S2’); Rj← not Qj;
Rk← not Qk; Result(‘D’)← FU;
Read Rj← No; Rk← No
Rj and Rk
operands
Execution Functional unit
complete done
∀f((Fj( f )≠Fi(FU)
or Rj( f )=No) & ∀f(if Qj(f)=FU then Rj(f)← Yes);
Write ∀f(if Qk(f)=FU then Rj(f)← Yes);
(Fk( f ) ≠Fi(FU) or
result Result(Fi(FU))← 0; Busy(FU)← No
Rk( f )=No))
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example
This is the sample code we’ll be working with in the example:
LD F6, 34(R2)
LD F2, 45(R3)
MULT F0, F2, F4
SUBD F8, F6, F2
DIVD F10, F0, F6
ADDD F6, F8, F2
What are the hazards in this code?

Latencies (clock cycles):
LD 1
MULT 10
SUBD 2
DIVD 40
ADDD 2
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example
Instruction status Read Execution
Write
Instruction j k Issue operands
complete
Result
LD F6 34+ R2
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status dest S1 S2 FU for j FU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 1
Write
complete
Result Issue LD #1
LD F6 34+ R2 1
LD F2 45+ R3
MULTD F0 F2 F4 Shows in which cycle
SUBD F8 F6 F2 the operation occurred.
DIVD F10 F0 F6
ADDDF6 F8 F2
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
Using A Scoreboard
Dynamic Scheduling
Write LD #2 can’t issue since
complete
Result integer unit is busy.
LD F6 34+ R2 1 2
MULT can’t issue because
LD F2 45+ R3
MULTD F0 F2 F4
we require in-order issue.
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
Using A Scoreboard
Dynamic Scheduling
Write
complete
Result
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
Using A Scoreboard
Dynamic Scheduling
Write
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
Using A Scoreboard
Dynamic Scheduling
Write Issue LD #2 since integer
complete
Result unit is now free.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status dest S1 S2 FU for jFU for kFj? Fk?
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
Using A Scoreboard
Dynamic Scheduling
Write Issue MULT.
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTD F0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
Using A Scoreboard
Dynamic Scheduling
Write MULT can’t read its
complete
Result operands (F2) because LD
LD F6 34+ R2 1 2 3 4 #2 hasn’t finished.
LD F2 45+ R3 5 6 7
MULTD F0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDDF6 F8 F2
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8a

Write
DIVD issues.
complete
Result MULT and SUBD both
LD F6 34+ R2 1 2 3 4 waiting for F2.
LD F2 45+ R3 5 6 7
MULTD F0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide
Using A Scoreboard
Dynamic Scheduling
Scoreboard Example Cycle 8b

Write LD #2 writes F2.
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add Divide
Using A Scoreboard
Dynamic Scheduling
Write
complete
Result
LD F6 34+ R2 1 2 3 4 Now MULT and SUBD can
LD F2 45+ R3 5 6 7 8 both read F2.
MULTD F0 F2 F4 6 9 How can both instructions
SUBD F8 F6 F2 7 9 do this at the same time??
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling
Write ADDD can’t start because
complete
Result add unit is busy.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Integer No
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling

Write
SUBD finishes.
complete
Result DIVD waiting for F0.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2
Integer No
Mult2 No
Add No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide
Using A Scoreboard
Dynamic Scheduling

Write ADDD issues.
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13
Integer No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling
Write
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14
Integer No
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling

Write
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling
Write
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling

Write ADDD can’t write because
complete
Result of DIVD. RAW!
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling
Write Nothing Happens!!
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling

Write MULT completes execution.
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
Using A Scoreboard
Dynamic Scheduling
Write MULT writes.
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDDF6 F8 F2 13 14 16
Integer No
Mult1 No
Mult2 No
Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
Using A Scoreboard
Dynamic Scheduling
Write DIVD loads operands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDDF6 F8 F2 13 14 16
Integer No
Mult1 No
Mult2 No
Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
Using A Scoreboard
Dynamic Scheduling

Write Now ADDD can write since
complete
Result WAR removed.
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDDF6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
Using A Scoreboard
Dynamic Scheduling
Write DIVD completes execution
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDDF6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
Using A Scoreboard
Dynamic Scheduling
Write DONE!!
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDDF6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
Using A Scoreboard
Dynamic Scheduling
Another Dynamic Algorithm:
Tomasulo Algorithm
• For IBM 360/91 about 3 years after CDC 6600 (1966)
• Goal: High Performance without special compilers
• Differences between IBM 360 & CDC 6600 ISA
– IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600
– IBM has 4 FP registers vs. 8 in CDC 6600
• Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604, …
Using A Scoreboard
Dynamic Scheduling
Tomasulo Algorithm vs. Scoreboard
• Control & buffers distributed with Function Units (FU) vs.
centralized in scoreboard;
– FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to
reservation stations(RS); called register renaming ;
– avoids WAR, WAW hazards
– More reservation stations than registers, so can do optimizations
compilers can’t
• Results to FU from RS, not through registers, over Common
Data Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
• Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
Dynamic Scheduling Using A Scoreboard
Tomasulo Organization
FP Op Queue FP
Registers
Load
Buffer
Store
Common Buffer
Data
Bus
FP Add FP Mul
Res. Res.
Station Station
Using A Scoreboard
Dynamic Scheduling
Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk—Value of Source operands
– Store buffers have V field, result to be stored
Qj, Qk—Reservation stations producing source registers (value to be
written)
– Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy—Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write each

register, if one exists. Blank when no pending instructions that will
write that register.
Using A Scoreboard
Dynamic Scheduling
Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instruction & sends operands (renames registers).
2. Execution—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
Using A Scoreboard
Dynamic Scheduling
Tomasulo Example Cycle 0
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Using A Scoreboard
Dynamic Scheduling
Review: Tomasulo
• Prevents Register as bottleneck

• Avoids WAR, WAW hazards of Scoreboard
• Allows loop unrolling in HW
• Not limited to basic blocks (provided branch prediction)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
• 360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA
8000; Intel Pentium Pro
Dynamic Hardware
Prediction
Concepts and Challenges Dynamic Branch Prediction is the ability
of the hardware to make an educated
guess about which way a branch will
go - will the branch be taken or not.
4.3 Reducing Branch Penalties
Prediction The hardware can look for clues based
on the instructions, or it can use past
with Multiple Issue history - we will discuss both of
these directions.
Exploiting ILP
4.7 Studies of ILP
Dynamic Hardware Basic Branch Prediction:
Branch Prediction Buffers
Prediction
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Lower bits of PC address index table of 1-bit values
– Says whether or not branch taken last time
• Problem: in a loop, 1-bit BHT will cause two mis-predictions:
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it predicts exit instead
of looping
P
Address 0 r
e
d
31 1 i
Bits 13 - 2
c
t
1023 i
o
n
Prediction
Dynamic Branch Prediction
• Solution: 2-bit scheme where change prediction only if get

misprediction twice: (Figure 4.13, p. 264)
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
Taken T Taken
NT
Prediction
BHT Accuracy
• Mispredict because either:

– Wrong guess for that branch
– Got branch history of wrong branch when index the table
• 4096 entry table programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
• 4096 about as good as infinite table, but 4096 is a lot of HW
Prediction
Correlating Branches
Idea: taken/not taken of Branch address
recently executed branches is
related to behavior of next 2-bits per branch predictors
branch (as well as the history
of that branch behavior)
– Then behavior of recent
branches selects between, say, Prediction
Prediction
four predictions of next branch,
updating just that prediction
2-bit global branch history
Prediction
Accuracy of Different Schemes
(Figure 4.21,
4096 Entries 2-bits per entry
p. 272)
Unlimited Entries 2-bits per entry
Frequency of Mispredictions
18% 1024 Entries - 2 bits of history,

18%
2 bits per entry
16%
Frequency of Mispredictions
14%
12% 11%
10%
8%
6% 6% 6%
6% 5% 5%
4%
4%
2% 1% 1%
0% 0%
0%
doducd
tomcatv
eqntott
espresso
nasa7
gcc
spice
fpppp
matrix300
li
4,096 entries: 2-bits per entry Chap. 4 - Pipelining II
Unlimited entries: 2-bits/entry
1,024 entries (2,2) 72
Branch Target Buffers
Prediction
•
Branch Target Buffer
Branch Target Buffer (BTB): Use address of branch as index to get prediction AND branch address (if taken)
Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p. 273)
–
Return instruction addresses predicted with stack

•
Predicted PC Branch Prediction:

Taken or not Taken
Branch Target Buffers
Prediction
Example Instructions
in Buffer
Prediction Actual
Branch
Penalty
Cycles
Yes Taken Taken 0
Yes Taken Not taken 2
No Taken 2
Example on page 274.

Determine the total branch penalty for a BTB using the above
penalties. Assume also the following:
• Prediction accuracy of 80%
• Hit rate in the buffer of 90%
• 60% taken branch frequency.
Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2 +(1-
percent buffer hit rate) X Taken branches X 2
Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2)
Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles
Multiple Issue
4.1 Instruction Level Parallelism: Multiple Issue is the ability of the
Concepts and Challenges processor to start more than one
4.2 Overcoming Data Hazards instruction in a given cycle.
4.3 Reducing Branch Penalties Flavor I:
with Dynamic Hardware Superscalar processors issue varying
Prediction
number of instructions per clock - can
4.4 Taking Advantage of More ILP be either statically scheduled (by the
with Multiple Issue
compiler) or dynamically scheduled
4.5 Compiler Support for (by the hardware).
Exploiting ILP
4.6 Hardware Support for Superscalar has a varying number of
Extracting more Parallelism instructions/cycle (1 to 8), scheduled
4.7 Studies of ILP by compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC

Alpha, HP 8000
Multiple Issue
Issuing Multiple Instructions/Cycle

Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of
instructions formatted either as one very large instruction or as a
fixed packet of smaller instructions.
fixed number of instructions (4-16) scheduled by the compiler; put

operators into wide templates
– Joint HP/Intel agreement in 1999/2000
– Intel Architecture-64 (IA-64) 64-bit address
– Style: “Explicitly Parallel Instruction Computer (EPIC)”
Multiple Issue

Flavor II - continued:
• 3 Instructions in 128 bit “groups”; field determines if instructions
dependent or independent
– Smaller code size than old VLIW, larger than x86/RISC
– Groups can be linked to show independence > 3 instr
• 64 integer registers + 64 floating point registers
– Not separate files per functional unit as in old VLIW
• Hardware checks dependencies
(interlocks => binary compatibility over time)
• Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mis-predictions?
• IA-64 : name of instruction set architecture; EPIC is type
• Merced is name of first implementation (1999/2000?)
A SuperScalar Version of DLX
Multiple Issue
In our DLX example,
– Fetch 64-bits/clock cycle; Int on left, FP on right
we can handle 2
– Can only issue 2nd instruction if 1st instruction issues
instructions/cycle:
– More ports for FP registers to do FP load & FP op in a pair • Floating Point
• Anything Else
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
• 1 cycle load delay causes delay to 3 instructions in Superscalar
– instruction in right half can’t use it, nor instructions in next slot
Multiple Issue
Unrolled Loop Minimizes Stalls for Scalar
1 Loop: LD F0,0(R1) Latencies:
2 LD F6,-8(R1) LD to ADDD: 1 Cycle
3 LD F10,-16(R1) ADDD to SD: 2 Cycles
4 LD F14,-24(R1)
5 ADDD F4,F0,F2
6 ADDD F8,F6,F2
7 ADDD F12,F10,F2
8 ADDD F16,F14,F2
9 SD 0(R1),F4
10 SD -8(R1),F8
11 SD -16(R1),F12
12 SUBI R1,R1,#32
13 BNEZ R1,LOOP
14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration

Multiple Issue
Loop Unrolling in Superscalar
Integer instruction FP instruction Clock cycle
Loop: LD F0,0(R1) 1
LD F6,-8(R1) 2
LD F10,-16(R1) ADDD F4,F0,F2 3
LD F14,-24(R1) ADDD F8,F6,F2 4
LD F18,-32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD -8(R1),F8 ADDD F20,F18,F2 7
SD -16(R1),F12 8
SD -24(R1),F16 9
SUBI R1,R1,#40 10
BNEZ R1,LOOP 11
SD 8(R1),F20 12
• Unrolled 5 times to avoid delays (+1 due to SS)
• 12 clocks, or 2.4 clocks per iteration
Multiple Instruction Issue &
Multiple Issue Dynamic Scheduling
Dynamic Scheduling in Superscalar
Code compiler for scalar version will run poorly on Superscalar

May want code to vary depending on how Superscalar
Simple approach: separate Tomasulo Control for separate reservation

stations for Integer FU/Reg and for FP FU/Reg
Dynamic Scheduling in Superscalar

• How to do instruction issue with two instructions and keep in-order
instruction issue for Tomasulo?
– Issue 2X Clock Rate, so that issue remains in order
– Only FP loads might cause dependency between integer and FP
issue:
• Replace load reservation station with a load queue;
operands must be read in the order they are fetched
• Load checks addresses in Store Queue to avoid RAW violation
• Store checks addresses in Load Queue to avoid WAR,WAW
Performance of Dynamic Superscalar

Iteration Instructions Issues Executes Writes result
no. clock-cycle number
1 LD F0,0(R1) 1 2 4
1 ADDD F4,F0,F2 1 5 8
1 SD 0(R1),F4 2 9
1 SUBI R1,R1,#8 3 4 5
1 BNEZ R1,LOOP 4 5
2 LD F0,0(R1) 5 6 8
2 ADDD F4,F0,F2 5 9 12
2 SD 0(R1),F4 6 13
2 SUBI R1,R1,#8 7 8 9
2 BNEZ R1,LOOP 8 9
- 4 clocks per iteration
Branches, Decrements still take 1 clock cycle
VLIW
Multiple Issue
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8
SD -0(R1),F28 BNEZ R1,LOOP 9
• Unrolled 7 times to avoid delays

• 7 results in 9 clocks, or 1.3 clocks per iteration
• Need more registers to effectively use VLIW
Limitations With Multiple Issue
Multiple Issue
Limits to Multi-Issue Machines
• Inherent limitations of ILP
– 1 branch in 5 instructions => how to keep a 5-way VLIW busy?
– Latencies of units => many operations must be scheduled
– Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy.
• Difficulties in building HW
– Duplicate Functional Units to get parallel execution
– Increase ports to Register File (VLIW example needs 6 read and 3
write for Int. Reg. & 6 read and 4 write for Reg.)
– Increase ports to memory
– Decoding SS and impact on clock rate, pipeline depth
Multiple Issue
Limits to Multi-Issue Machines
• Limitations specific to either SS or VLIW implementation

– Decode issue in SS
– VLIW code size: unroll loops + wasted fields in VLIW
– VLIW lock step => 1 hazard & all instructions stall
– VLIW & binary compatibility
Multiple Issue
Multiple Issue Challenges
• While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
– Exactly 50% FP operations
– No hazards
• If more instructions issue at same time, greater difficulty of decode and
issue
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction word are
independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
Compiler Support For ILP
Concepts and Challenges How can compilers be smart?
1. Produce good scheduling of code.
2. Determine which loops might contain
parallelism.
4.3 Reducing Branch Penalties 3. Eliminate name dependencies.
Prediction
Compilers must be REALLY smart to
with Multiple Issue figure out aliases -- pointers in C are
a real problem.
Exploiting ILP
Techniques lead to:
Symbolic Loop Unrolling
Critical Path Scheduling
4.7 Studies of ILP
Compiler Support For ILP Symbolic Loop Unrolling
Software Pipelining
• Observation: if iterations from loops are independent, then can get ILP
by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made
from instructions chosen from different iterations of the original loop
(Tomasulo in SW) Iteration
0 Iteration
1 Iteration
2 Iteration
3 Iteration
4
Software-
pipelined
iteration
SW Pipelining Example
Before: Unrolled 3 times After: Software Pipelined
1 LD F0,0(R1) LD F0,0(R1)
2 ADDD F4,F0,F2 ADDD F4,F0,F2
3 SD 0(R1),F4 LD F0,-8(R1)
4 LD F6,-8(R1) 1 SD 0(R1),F4; Stores M[i]
5 ADDD F8,F6,F2 2 ADDD F4,F0,F2; Adds to M[i-1]
6 SD -8(R1),F8 3 LD F0,-16(R1); loads M[i-2]
7 LD F10,-16(R1) 4 SUBI R1,R1,#8
8 ADDD F12,F10,F2 5 BNEZ R1,LOOP
9 SD -16(R1),F12 SD 0(R1),F4
10 SUBI R1,R1,#24 ADDD F4,F0,F2
11 BNEZ R1,LOOP SD -8(R1),F4
Read F4 Read F0
SD IF ID EX Mem WB Write F4
ADDD IF ID EX Mem WB
LD IF ID EX Mem WB
Chap. 4 - Pipelining II Write F0 90
SW Pipelining Example
Symbolic Loop Unrolling
– Less code space
– Overhead paid only once
vs. each iteration in loop unrolling
Software Pipelining
Loop Unrolling
100 iterations = 25 loops with 4 unrolled iterations each

Compiler Support For ILP Critical Path Scheduling
Trace Scheduling
• Parallelism across IF branches vs. LOOP branches
• Two steps:
– Trace Selection
• Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
– Trace Compaction
• Squeeze trace into few VLIW instructions
• Need bookkeeping code in case prediction is wrong
• Compiler undoes bad guess
(discards values in registers)
• Subtle compiler bugs mean wrong answer
vs. poorer performance; no hardware interlocks
Hardware Support For
Parallelism
Concepts and Challenges Software support of ILP is best when
code is predictable at compile time.
But what if there’s no predictability?
4.3 Reducing Branch Penalties Here we’ll talk about hardware
Prediction techniques. These include:
with Multiple Issue • Conditional or Predicated
Instructions
Exploiting ILP
• Hardware Speculation
4.7 Studies of ILP
Hardware Support For Nullified Instructions
Parallelism
Tell the Hardware To Ignore An Instruction
• Avoid branch prediction by turning branches into
conditionally executed instructions:
IF (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
– Expanded ISA of Alpha, MIPs, PowerPC, SPARC, x
have conditional move. PA-RISC can annul any
following instruction.
– IA-64: 64 1-bit condition fields selected so
conditional execution of any instruction A=
• Drawbacks to conditional instructions: B op C
– Still takes a clock, even if “annulled”
– Stalls if condition evaluated late
– Complex conditions reduce effectiveness; condition
becomes known late in pipeline.
This can be a major win because there is no time lost by
taking a branch!!
Hardware Support For Nullified Instructions
Parallelism
Tell the Hardware To Ignore An Instruction
Suppose we have the code: Nullified Method:
if ( VarA == 0 ) LD R1, VarA
VarS = VarT;and
Compare
Nullify
LD
CMPNNZ
R2, VarT
R1, #0
Next Instr. SD VarS, R2
If Not Zero Label:
Previous Method:
LD R1, VarA
BNEZ R1, Label Nullified Method:
LD R2, VarT LD R1, VarA
SD VarS, R2 Compare LD R2, VarT
Label: and Move CMOVZ VarS,R2, R1
IF Zero
Hardware Support For Compiler Speculation
Parallelism
Increasing Parallelism
The theory here is to move an instruction across a branch so as to
increase the size of a basic block and thus to increase parallelism.
Primary difficulty is in avoiding exceptions. For example

if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.
Methods for increasing speculation include:
1. Use a set of status bits (poison bits) associated with the registers.
Are a signal that the instruction results are invalid until some later
time.
2. Result of instruction isn’t written until it’s certain the instruction is
no longer speculative.
Parallelism
Original Code:
Increasing LW R1, 0(R3) Load A
Parallelism BNEZ R1, L1 Test A
LW R1, 0(R2) If Clause
Example on Page 305. J L2 Skip Else
Code for L1: ADDI R1, R1, #4 Else Clause
if ( A == 0 ) L2: SW 0(R3), R1 Store A
A = B;
else Speculated Code:
A = A + 4; LW R1, 0(R3) Load A
Assume A is at 0(R3) and LW R14, 0(R2) Spec Load B
B is at 0(R4) BEQZ R1, L3 Other if Branch
Note here that only ONE ADDI R14, R1, #4 Else Clause
side needs to take a L3: SW 0(R3), R14 Non-Spec Store
branch!!
Parallelism
Poison Bits
Speculated Code:
In the example on the last LW R1, 0(R3) Load A
page, if the LW* produces
LW* R14, 0(R2) Spec Load B
an exception, a poison bit
is set on that register. The BEQZ R1, L3 Other if Branch
if a later instruction tries to ADDI R14, R1, #4 Else Clause
use the register, an L3: SW 0(R3), R14 Non-Spec Store
exception is THEN raised.
Hardware Support For Hardware Speculation
Parallelism
HW support for More ILP
• Need HW buffer for results of
uncommitted instructions: reorder buffer Reorder
– Reorder buffer can be operand source Buffer
– Once operand commits, result is FP
Op
found in register
Queue
– 3 fields: instr. type, destination, value FP Regs
– Use reorder buffer number instead
of reservation station
– Discard instructions on mis-predicted Res Stations
Res Stations
branches or on exceptions
FP Adder FP Adder
Figure 4.34, page 311
Hardware Support For Hardware Speculation
Parallelism
HW support for More ILP
How is this used in practice?
Rather than predicting the direction of a branch, execute the

instructions on both side!!
We early on know the target of a branch, long before we know it if will

be taken or not.
So begin fetching/executing at that new Target PC.

But also continue fetching/executing as if the branch NOT taken.

Studies of ILP
4.1 Instruction Level Parallelism: • Conflicting studies of amount of
Concepts and Challenges
improvement available
4.2 Overcoming Data Hazards – Benchmarks (vectorized FP
Fortran vs. integer C programs)
4.3 Reducing Branch Penalties
with Dynamic Hardware – Hardware sophistication
Prediction – Compiler sophistication
4.4 Taking Advantage of More ILP • How much ILP is available using existing
with Multiple Issue mechanisms with increasing HW
4.5 Compiler Support for budgets?
Exploiting ILP • Do we need to invent new HW/SW
4.6 Hardware Support for mechanisms to keep on processor
Extracting more Parallelism performance curve?
4.7 Studies of ILP

Studies of ILP
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming–infinite virtual registers and all WAW & WAR
hazards are avoided
2. Branch prediction–perfect; no mispredictions
3. Jump prediction–all jumps perfectly predicted => machine with
perfect speculation & an unbounded buffer of instructions available
4. Memory-address alias analysis–addresses are known & a store can
be moved before a load provided addresses not equal
1 cycle latency for all instructions; unlimited number of instructions
issued per clock cycle

Studies of ILP Upper Limit to ILP: Ideal
This is the amount of parallelism when
there are no branch mis-predictions and
Machine
we’re limited only by data dependencies. (Figure 4.38, page 319)
160 150.1
FP: 75 - 150
Instruction Issues per cycle
140
118.7
120 Integer: 18 - 60
100
75.2
80
IPC
62.6
60 54.8
40
17.9
20
0
gcc espresso li fpppp doducd tomcatv
Instructions that could Programs
theoretically be issued
per cycle. Chap. 4 - Pipelining II 103
Studies of ILP
Impact of Realistic Branch
Prediction
What parallelism do we get when we don’t allow perfect branch

prediction, as in the last picture, but assume some realistic model?
Possibilities include:
1. Perfect - all branches are perfectly predicted (the last slide)
2. Selective History Predictor - a complicated but do-able mechanism for

selection.
3. Standard 2-bit history predictor with 512 2-bit entries.
4. Static prediction based on past history of the program.
5. None - Parallelism is limited to basic block.

Studies of ILP Bonus!!
Selective History Predictor

8096 x 2 bits
1
0
Taken/Not Taken
11
Choose Non-correlator
10
Branch Addr 01 Choose Correlator
00
2
Global
History 00
01 8K x 2 bit
10 Selector
11
11 Taken
10
2048 x 4 x 2 bits
01 Not Taken
00
Impact of Realistic
Studies of ILP
Branch Prediction
Limiting the type of Figure 4.42, Page 325
branch prediction. 61
58
60
60
FP: 15 - 45
50 48
46 45 46 45 45
41
Instruction issues per cycle
40
35
Integer: 6 - 12 29
30
IPC
19
20 16
15
13 14
12
10
9
10 6 7 6 7
6 6
4
2 2 2
Program
Perfect Selective predictor Standard 2-bit Static None

Perfect Selective Hist BHT (512) Profile No prediction
Studies of ILP More Realistic HW:
Register Impact
Effect of limiting the Figure 4.44, Page 328
number of renaming 59
60 registers. FP: 11 - 45
54
49
50
45
44
40
35
Integer: 5 - 15 29
IPC
30 28
20
20 16
15 15 15
13
12 12 12 11 11
11 10 10 10
9
10 7
5 6 5 5 5 5
4 5 4 5
4
Program
Infinite Chap. 4128

256 - Pipelining
64 II 32 None 107
Infinite 256 128 64 32 None
Studies of ILP More Realistic HW:
What happens when there Alias Impact
may be conflicts with Figure 4.46, Page 330
memory aliasing?
FP: 4 - 45
50
49 49
(Fortran, 45 45
45
no heap)
40
35
Integer: 4 - 9
IPC
30
25
20
16 16
15
15
12
10
10 9
7 7
5 5 6
4 4 4 5
3 4 4
5 3 3
Program
Perfect Global/Stack perf; Inspec. None
Perfect Chap.
heap conflicts 4 - Pipelining
Global/stack Perfect II
Inspection None 108
Assem.
Summary
4.1 Instruction Level Parallelism: Concepts and Challenges

4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP

Chapter04 Pipelining2

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter04 Pipelining2

Hochgeladen von

Copyright:

Verfügbare Formate

Advanced Computer

Basic Pipeline Scheduling RAW Stalls 4.1

Dynamic Scheduling with Scoreboarding RAW stalls 4.2

Dynamic Branch Prediction Control Stalls 4.3

Issue Multiple Instructions per Cycle Ideal CPI 4.4

Compiler Dependence Analysis Ideal CPI & data stalls 4.5

Speculation All data & control stalls 4.6

Dynamic memory disambiguation RAW stalls involving memory 4.2, 4.6

Loop Unrolling - Either the compiler or the hardware is able to exploit

Simple Loop and its Assembler Equivalent

Loop: LD F0,0(R1) ;F0=vector element

Instruction Instruction Latency in

Where are the stalls? Chap. 4 - Pipelining II 7

FP Loop Showing Stalls

Swap BNEZ and SD by changing address of SD

Now 6 clocks: Now unroll

• Use different registers to avoid unnecessary constraints that would be

Compiler Perspectives on Code Movement

Where are the data

Compiler Perspectives on Code Movement

• Another kind of dependence called name dependence:

0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)

There were no dependencies between some loads and stores so they

• Final kind of dependence called control dependence

• Two (obvious) constraints on control dependences:

– An instruction that is not control dependent on a branch cannot be

• Control dependencies relaxed to get parallelism; get same effect if

1. S2 uses the value, A[i+1], computed by S1 in the same iteration.

• Implies that iterations are dependent, and can’t be executed in parallel

for (i=1; i<=100; i=i+1) {

1. No dependence from S1 to S2. If there were, then there would be a cycle in

Now Safe to Unroll Loop? (p. 240)

A[1] = A[1] + B[1];

Four Stages of Scoreboard Control

Four Stages of Scoreboard Control

A source operand is available if no earlier issued active

4. Write result —finish execution (WB)

1. Instruction status—which of 4 steps the instruction is in

2. Functional unit status—Indicates the state of the functional unit (FU). 9

3. Register result status—Indicates which functional unit will write each

What are the hazards in this code?

Instruction status Read Execution

Instruction status Read Execution

Instruction status Read Execution

Instruction status Read Execution

Instruction status Read Execution

Instruction status Read Execution

Instruction status Read Execution

Instruction status Read Execution

Register result status—Indicates which functional unit will write each

Three Stages of Tomasulo Algorithm

• Prevents Register as bottleneck

• Solution: 2-bit scheme where change prediction only if get

• Mispredict because either:

2-bit global branch history

18% 1024 Entries - 2 bits of history,

Return instruction addresses predicted with stack

Predicted PC Branch Prediction:

Example on page 274.

IBM PowerPC, Sun UltraSparc, DEC

Issuing Multiple Instructions/Cycle

fixed number of instructions (4-16) scheduled by the compiler; put