Compiler Techniques For Exposing ILP

Chapter 2-1
Instruction-Level Parallelism
and Its Exploitation
Seoul Natl, CAPP, Hyuk-Jae Lee
2.1 Instruction-Level Parallelism: Concepts and Challenges
Instruction level parallelism (ILP): Property of multiple instructions allowed

to be executed in parallel (or simultaneously) or in a pipelined manner
Loop-level parallelism
Ex)
for (I=1; I <= 1000; I++)
x[I] = x[I] + y[I];
Every iteration can be executed in parallel with the other iterations

x[1] = x[1] + y[1];
// i=1
x[2] = x[2] + y[2];
// i=2
x[3] = x[3] + y[3];
// i=3
Loop-level parallelism can be converted into ILP

Exploit ILP
HW approach (dynamic)
SW (compiler, assembly programming) approach (static)

Pipeline CPI = ideal pipeline CPI + structural stalls + data hazard stalls +
control stalls
2.1 Instruction Level Parallelism
Pipeline CPI
Ideal pipeline CPI 1
Example)
ADD
R1,R2,R3
SUB
R4,R5,R6
AND
R7,R8,R9
OR
R10,R11,R12
XOR
R13,R14,R15
ADD
R16,R17,R18
Pipelined execution: produces

clock
1
2
3
4
ADD
IF ID EX ME
SUB
IF ID EX
AND
IF ID
OR
IF
XOR
ADD
result in every clock cycle

5
6
7
8
9
10
WB
ME WB
EX ME WB
ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB
Pipeline CPI
Data hazard causes stall

Pipeline CPI = ideal pipeline CPI + data hazard stalls > 1
Example)
ADD
R1,R2,R3
LD
R4,100(R5)
AND
R7,R4,R9
OR
R10,R11,R12
XOR
R13,R14,R15
ADD
R16,R17,R18
Pipelined execution: produces result in

clock
1
2
3
4
5
ADD
IF ID
EX ME
WB
LD
IF
ID EX
ME
AND
IF ID
OR
IF
XOR
ADD
every clock cycle

6
7
8
9
WB
EX ME WB
ID EX ME
IF
ID EX
IF ID
10
WB
ME WB
EX ME
WB
Pipeline CPI
Control hazard causes stall

Pipeline CPI = ideal pipeline CPI+data hazard stalls+control hazard stalls > 1
Example)
ADD
R1,R2,R3
LD
R4,100(R5)
AND
R7,R4,R9
BEQZ
R8,L
XOR
R13,R14,R15
ADD
R16,R17,R18
L: SUB
R19,R20,R21
Pipelined execution: produces result in

clock
1
2
3
4
5
ADD
IF ID
EX ME
WB
LD
IF
ID EX
ME
AND
IF ID
BEQZ
IF
SUB
every clock cycle

6
7
8
9
WB
EX ME WB
ID EX ME
IF ID
10
WB
EX ME WB
2.1 Instruction-Level Parallelism: Concepts and Challenges
Instruction level parallelism (ILP): Property of multiple instructions allowed

to be executed in parallel (or simultaneously) or in a pipelined manner
Loop-level parallelism
Ex)
for (I=1; I <= 1000; I++)
x[I] = x[I] + y[I];
Every iteration can be executed in parallel with the other iterations

x[1] = x[1] + y[1];
// i=1
x[2] = x[2] + y[2];
// i=2
x[3] = x[3] + y[3];
// i=3
Loop-level parallelism can be converted into ILP

Exploit ILP
resources not available
HW approach (dynamic)
SW (compiler, assembly programming) approach (static)

Pipeline CPI = ideal pipeline CPI + structural stalls + data hazard stalls +
control stalls
To minimize CPI, try to avoid these additional stalls by exploiting parallelism
Data Dependence
Data dependence definition:
An instruction j is data dependent on instruction i if

i produces a result used by j, or
j is data dependent on k and k is data dependent on i
Ex)
Loop:
L.D
F0, 0 (R1)
ADD.D
F4, F0, F2
S.D
F4, 0 (R1)
DADDUI
R1, R1, #-8
BNE
R1, R2, LOOP
ADD.D is data dependent on L.D
S.D is data dependent on ADD.D & L.D
BNE is data dependent on DADDUI
7
Data Dependence and Hazard
Hazard: overlap of re-ordering of instructions may generate incorrect

results
Data dependent instructions potential for a hazard
Ex) DADDUI R1,R1,-#8 BNE R1,R2,Loop
DADDUI
IF
ID
EX
MEM WB
BNE
IF
ID
EX
MEM WB
This overlap may cause a hazard because BNE requires the result of
DADDUI at ID stage
Solution: stop overlapping stall the pipeline

DADDUI
IF
ID
EX
MEM WB
BNE
IF
ID
EX
MEM
Solution (HW): pipeline interlock-stop pipeline for the later

instructions
Solution (SW): compiler schedule-insert NOP instructions
WB
Occurrence of hazard depends on the pipeline organization
Ex) L.D ADD.D, or ADD.D S.D

8
Data Dependence and Hazard
Ex) L.D F0,0(R1) ADD.D F4,F0,F2
L.D
ADD.D
ID
IF
EX
ID
MEM WB
EX
MEM WB
Data hazard
Ex) ADD.D F4,F0,F2 S.D F4,0(R1)
ADD.D
S.D
IF
IF
ID
IF
EX
ID
No data hazard
MEM WB
EX
MEM WB
Name Dependences
Antidependence: j write a register/memory location that

i reads (i precedes j)
Ex)
ADD.D
SUB.D
S.D
Output dependence: i and j write the same

register/memory location
Ex)
ADD.D
SUB.D
F2, F4, F6
F6, F8, F10
F6, 0 (R1)
F2, F4, F6
F2, F8, F10
Register renaming: can remove name dependence
EX)
ADD.D
SUB.D
S.D
F2, F4, F6
F12, F8, F10
F12, 0 (R1)
10
Data Hazards
Data Hazards: data dependent instructions are close

enough so that the overlap or reordering of instructions
would cause incorrect results
RAW (Read after Write):
j tries to read a source before i writes it, so j incorrectly gets

the old value.
Caused by violation of true data dependence

WAW (Write after Write):
j tries to write an operand before it is written by i. The writes

end up being performed in the wrong order, leaving the value
written by i rather than the value written by j in the
destination.
Caused by violation of output dependence

WAR (Write after Read): j tries to write a destination before
it is ready by i, so incorrectly gets the new value.
Caused by violation of antidependence
11
Control Dependence
Control Dependence:
Between an instruction and a branch instruction

Determines the ordering of the instruction and the branch
Ex)
if
p1
if
}
p2
{
s1;
{
s2;
}
S1 control dependent on p1
S2 control dependent on p2, but not p1
12
Control Dependence
Constraints imposed by control dependence:
Two properties to preserve control dependence
An instruction control dependent on a branch cannot be moved

before the branch
An instruction NOT control dependent on a branch cannot be
moved after the branch
Instructions execute in program order; an instruction that
occurs before a branch is executed before the branch
An instruction control dependent on a branch is not executed
until the branch direction is known
Violation of control dependences may not always

affect the correctness of a program
Control dependence is not the critical property that must

be preserved
13
Control Dependence
Two properties critical for the program correctness are

Exception behavior:
Any changes in the ordering of instruction execution must not
change how exceptions are raised in the program.
Ex) Original program
DADDU
R2, R3, R4
BEQZ
R2, L1
LW
R1, 0 (R2) cause memory protection
L1:
Reorganized program:
L1:
data dependence is not violated

Unnecessary exception may be generated
DADDU
LW
BEQZ
R2, R3, R4
R1, 0 (R2) cause memory protection
R2, L1
14
Control Dependence
Two properties critical for the program correctness
Data flow: the flow of data values among instructions that produce results and consume
them.
Branch makes data flow dynamic: preserve of data dependence alone is not sufficient for
the correctness
Ex)
DADDU
R1, R2, R3
BEQZ
R4, L
DSUBU
R1, R5, R6
L:
OR
R7, R1, R8
OR is data dependent on both DADDU and DSUBU

Preserving both data dependences do not guarantee the correctness
Reorganization of code: preserve data dependence but violate control dependence
DADDU
DSUBU
BEQZ
OR
L:
R1,R2,R3
R1, R5, R6
R4, L
R7, R1, R8
Data flow must be preserved (by preserving control dependence)
15
Control Dependence
Violation of control dependence may not always affect the
exception behavior or the data flow
Ex)
DAADU
R1, R2, R3
BEQZ
R12, skipnext
BSUBU
R4, R5, R6
DADDU
R5, R4, R9
skipnext:
OR
R7, R8, R9
BSUBU can be moved before BEQZ if R4 is not used after

skipnext (called dead)
skipnext:
DAADU
BSUBU
BEQZ
DADDU
OR
R1, R2, R3
R4, R5, R6
R12, skipnext
R5, R4, R9
R7, R8, R9
16
2.2 Basic Compiler Techniques for Exposing ILP
Basic Pipeline Scheduling and Loop

Unrolling
Assumption
Standard 5-stage integer

pipeline
Pipeline latency
Branch: one cycle delay
Functional units fully pipelined
Operation can be issued on

every clock
Example)
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
assembly code
Loop:
L.D
ADD.D
S.D
DADDUI
BNE
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
R1, R1, #-8
R1, R2, Loop
17
Instruction
producing
result
Instruction
using result
Latency in
clock cycles
FP ALU op
FP ALU op
FP ALU op
Store double
Load double
FP ALU op
Load double
Store double
Basic Pipeline Scheduling and Loop Unrolling
Example)
Pipeline stall: dependent instructions
cycles equal to the latency
Loop: L.D
F0, 0(R1)
stall
ADD.D F4, F0, F2
stall
stall
S.D
F4, 0(R1)
DADDUI R1, R1, #-8
stall
BNE
R1, R2, Loop
# of cycles per iteration:9
18
should be separated by
1
2
3
4
5
6
7
8
9
Example)
Scheduling:
change of instruction execution orders

Minimize stalls
# of cycles per iteration:7

Loop: L.D
F0, 0(R1)
DADDUI R1, R1, #-8
ADD.D F4, F0, F2
stall
stall
S.D
F4, 8 (R1)
BNE
R1, R2, Loop
note
19
1
2
3
4
5
6
7
To avoid pipeline stall, a dependent instruction should be

separated from the source instruction by pipeline latency
Single iteration: 7 cycles
Need independent instructions between these two

3 instructions for real work: LD.D, ADD.D, S.D
Loop control overhead: DADDUI, BNE
Stall: 2 cycle
Insufficient number of independent instructions
Solution: loop unrolling
Replicate the loop body multiple times,
adjust the loop control
20
Example) unroll (replicate the loop body) 4 times

for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
for (i=1000; i>0; i=i-4) {

x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}
21
Example) unroll 4 times

Loop:
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DADDUI
BNE
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F6, -8 (R1)
F8, F6, F2
F8, -8 (R1)
F10, -16 (R1)
F12, F10, F2
F12, -16 (R1)
F14, -24 (R1)
F16, F14, F2
F16, -24 (R1)
R1, R1, #-32
R1, R2, Loop
22
Example) unroll 4 times: execution cycles: 27 cycles

Loop:
L.D
F0, 0(R1)
stall
ADD.D
F4, F0, F2
stall
stall
S.D
F4, 0(R1)
L.D
F6, -8 (R1)
stall
ADD.D
F8, F6, F2
stall
stall
S.D
F8, -8 (R1)
L.D
F10, -16 (R1)
stall
ADD.D
F12, F10, F2
stall
stall
S.D
F12, -16 (R1)
L.D
F14, -24 (R1)
stall
ADD.D
F16, F14, F2
stall
stall
S.D
F16, -24 (R1)
DADDUI R1, R1, #-32
stall
BNE
R1, R2, Loop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
23
Example) unroll 4 times after schedule: 14 cycles (no stalls)

Loop:
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DADDUI
S.D
S.D
BNE
F0, 0(R1)
F6, -8 (R1)
F10, -16 (R1)
F14, -24 (R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F4, 0(R1)
F8, -8 (R1)
R1, R1, #-32
F12, -16 (R1)
F16, 8 (R1)
R1, R2, Loop
24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Example) unroll 4 times after schedule: 14 cycles (no stalls)

Large number independent instructions no stalls
Boundary condition
n-iteration loop (when n is unknown at compile time) to unroll ktimes

Generate two loops
(n mod k) iteration loop

(n/k)-iteration loop unrolled k-times
Example)
for (i=0; i<n; i++)
x[i] = x[i]+s;
for (i=0; i<(n mod 4); i++)

x[i] = x[i]+s;
for (i=(n mod 4); i< n/4; i=i+4) {
x[i] = x[i]+s; x[i+1]=x[i+1]+s;
x[i+2]=x[i+2]+s; x[i+3]=x[i+3]+s
}
25
Limits of loop unrolling
Decreased gain (amortized loop overhead) with

each unroll
Four times: remove all stalls, loop control

overhead-2 cycles out of four iteration (1/2 per
iteration)
Eight times, loop control overhead per iteration
Growth in code size

Shortfall in registers
26

Compiler Techniques For Exposing ILP

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Compiler Techniques For Exposing ILP

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 2-1

Seoul Natl, CAPP, Hyuk-Jae Lee

2.1 Instruction-Level Parallelism: Concepts and Challenges

Instruction level parallelism (ILP): Property of multiple instructions allowed

Every iteration can be executed in parallel with the other iterations

Loop-level parallelism can be converted into ILP

SW (compiler, assembly programming) approach (static)

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Ideal pipeline CPI 1

Pipelined execution: produces

result in every clock cycle

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data hazard causes stall

Pipelined execution: produces result in

every clock cycle

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Control hazard causes stall

Pipelined execution: produces result in

every clock cycle

Seoul Natl, CAPP, Hyuk-Jae Lee

2.1 Instruction-Level Parallelism: Concepts and Challenges

Instruction level parallelism (ILP): Property of multiple instructions allowed

Every iteration can be executed in parallel with the other iterations

Loop-level parallelism can be converted into ILP

SW (compiler, assembly programming) approach (static)

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data dependence definition:

An instruction j is data dependent on instruction i if

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data Dependence and Hazard

Hazard: overlap of re-ordering of instructions may generate incorrect

Solution: stop overlapping stall the pipeline

Solution (HW): pipeline interlock-stop pipeline for the later

Occurrence of hazard depends on the pipeline organization

Ex) L.D ADD.D, or ADD.D S.D

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data Dependence and Hazard

Ex) L.D F0,0(R1) ADD.D F4,F0,F2

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Antidependence: j write a register/memory location that

Output dependence: i and j write the same

Register renaming: can remove name dependence

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data Hazards: data dependent instructions are close

RAW (Read after Write):

j tries to read a source before i writes it, so j incorrectly gets

Caused by violation of true data dependence

j tries to write an operand before it is written by i. The writes

Caused by violation of output dependence

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Between an instruction and a branch instruction

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee