Sie sind auf Seite 1von 26

Chapter 2-1

Instruction-Level Parallelism
and Its Exploitation

Seoul Natl, CAPP, Hyuk-Jae Lee

2.1 Instruction-Level Parallelism: Concepts and Challenges

Instruction level parallelism (ILP): Property of multiple instructions allowed


to be executed in parallel (or simultaneously) or in a pipelined manner
Loop-level parallelism

Ex)
for (I=1; I <= 1000; I++)
x[I] = x[I] + y[I];

Every iteration can be executed in parallel with the other iterations


x[1] = x[1] + y[1];
// i=1
x[2] = x[2] + y[2];
// i=2
x[3] = x[3] + y[3];
// i=3

Loop-level parallelism can be converted into ILP


Exploit ILP

HW approach (dynamic)

SW (compiler, assembly programming) approach (static)


Pipeline CPI = ideal pipeline CPI + structural stalls + data hazard stalls +
control stalls

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Pipeline CPI

Ideal pipeline CPI 1

Example)
ADD
R1,R2,R3
SUB
R4,R5,R6
AND
R7,R8,R9
OR
R10,R11,R12
XOR
R13,R14,R15
ADD
R16,R17,R18

Pipelined execution: produces


clock
1
2
3
4
ADD
IF ID EX ME
SUB
IF ID EX
AND
IF ID
OR
IF
XOR
ADD

result in every clock cycle


5
6
7
8
9
10
WB
ME WB
EX ME WB
ID EX ME WB
IF ID EX ME WB
IF ID EX ME WB

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Pipeline CPI

Data hazard causes stall


Pipeline CPI = ideal pipeline CPI + data hazard stalls > 1

Example)
ADD
R1,R2,R3
LD
R4,100(R5)
AND
R7,R4,R9
OR
R10,R11,R12
XOR
R13,R14,R15
ADD
R16,R17,R18

Pipelined execution: produces result in


clock
1
2
3
4
5
ADD
IF ID
EX ME
WB
LD
IF
ID EX
ME
AND
IF ID
OR
IF
XOR
ADD

every clock cycle


6
7
8
9
WB
EX ME WB
ID EX ME
IF
ID EX
IF ID

10

WB
ME WB
EX ME

WB

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Pipeline CPI

Control hazard causes stall


Pipeline CPI = ideal pipeline CPI+data hazard stalls+control hazard stalls > 1
Example)
ADD
R1,R2,R3
LD
R4,100(R5)
AND
R7,R4,R9
BEQZ
R8,L
XOR
R13,R14,R15
ADD
R16,R17,R18
L: SUB
R19,R20,R21

Pipelined execution: produces result in


clock
1
2
3
4
5
ADD
IF ID
EX ME
WB
LD
IF
ID EX
ME
AND
IF ID
BEQZ
IF
SUB

every clock cycle


6
7
8
9
WB
EX ME WB
ID EX ME
IF ID

10

WB
EX ME WB

Seoul Natl, CAPP, Hyuk-Jae Lee

2.1 Instruction-Level Parallelism: Concepts and Challenges

Instruction level parallelism (ILP): Property of multiple instructions allowed


to be executed in parallel (or simultaneously) or in a pipelined manner
Loop-level parallelism

Ex)
for (I=1; I <= 1000; I++)
x[I] = x[I] + y[I];

Every iteration can be executed in parallel with the other iterations


x[1] = x[1] + y[1];
// i=1
x[2] = x[2] + y[2];
// i=2
x[3] = x[3] + y[3];
// i=3

Loop-level parallelism can be converted into ILP


Exploit ILP
resources not available

HW approach (dynamic)

SW (compiler, assembly programming) approach (static)


Pipeline CPI = ideal pipeline CPI + structural stalls + data hazard stalls +
control stalls
To minimize CPI, try to avoid these additional stalls by exploiting parallelism

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data Dependence

Data dependence definition:

An instruction j is data dependent on instruction i if


i produces a result used by j, or
j is data dependent on k and k is data dependent on i
Ex)
Loop:
L.D
F0, 0 (R1)
ADD.D
F4, F0, F2
S.D
F4, 0 (R1)
DADDUI
R1, R1, #-8
BNE
R1, R2, LOOP
ADD.D is data dependent on L.D
S.D is data dependent on ADD.D & L.D
BNE is data dependent on DADDUI
7

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data Dependence and Hazard

Hazard: overlap of re-ordering of instructions may generate incorrect


results
Data dependent instructions potential for a hazard
Ex) DADDUI R1,R1,-#8 BNE R1,R2,Loop
DADDUI
IF
ID
EX
MEM WB
BNE
IF
ID
EX
MEM WB

This overlap may cause a hazard because BNE requires the result of
DADDUI at ID stage

Solution: stop overlapping stall the pipeline


DADDUI
IF
ID
EX
MEM WB
BNE
IF
ID
EX

MEM

Solution (HW): pipeline interlock-stop pipeline for the later


instructions
Solution (SW): compiler schedule-insert NOP instructions

WB

Occurrence of hazard depends on the pipeline organization

Ex) L.D ADD.D, or ADD.D S.D


8

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data Dependence and Hazard

Ex) L.D F0,0(R1) ADD.D F4,F0,F2

L.D
ADD.D

ID
IF

EX
ID

MEM WB
EX
MEM WB

Data hazard
Ex) ADD.D F4,F0,F2 S.D F4,0(R1)

ADD.D
S.D

IF

IF

ID
IF

EX
ID

No data hazard

MEM WB
EX
MEM WB

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Name Dependences

Antidependence: j write a register/memory location that


i reads (i precedes j)

Ex)
ADD.D
SUB.D
S.D

Output dependence: i and j write the same


register/memory location

Ex)
ADD.D
SUB.D

F2, F4, F6
F6, F8, F10
F6, 0 (R1)

F2, F4, F6
F2, F8, F10

Register renaming: can remove name dependence

EX)
ADD.D
SUB.D
S.D

F2, F4, F6
F12, F8, F10
F12, 0 (R1)
10

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Data Hazards

Data Hazards: data dependent instructions are close


enough so that the overlap or reordering of instructions
would cause incorrect results

RAW (Read after Write):

j tries to read a source before i writes it, so j incorrectly gets


the old value.

Caused by violation of true data dependence


WAW (Write after Write):

j tries to write an operand before it is written by i. The writes


end up being performed in the wrong order, leaving the value
written by i rather than the value written by j in the
destination.

Caused by violation of output dependence


WAR (Write after Read): j tries to write a destination before
it is ready by i, so incorrectly gets the new value.
Caused by violation of antidependence

11

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Control Dependence

Control Dependence:

Between an instruction and a branch instruction


Determines the ordering of the instruction and the branch
Ex)
if

p1

if

}
p2

{
s1;
{
s2;

}
S1 control dependent on p1
S2 control dependent on p2, but not p1

12

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Control Dependence

Constraints imposed by control dependence:

Two properties to preserve control dependence

An instruction control dependent on a branch cannot be moved


before the branch
An instruction NOT control dependent on a branch cannot be
moved after the branch
Instructions execute in program order; an instruction that
occurs before a branch is executed before the branch
An instruction control dependent on a branch is not executed
until the branch direction is known

Violation of control dependences may not always


affect the correctness of a program

Control dependence is not the critical property that must


be preserved
13

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Control Dependence

Two properties critical for the program correctness are


Exception behavior:
Any changes in the ordering of instruction execution must not
change how exceptions are raised in the program.
Ex) Original program
DADDU
R2, R3, R4
BEQZ
R2, L1
LW
R1, 0 (R2) cause memory protection
L1:
Reorganized program:

L1:

data dependence is not violated


Unnecessary exception may be generated

DADDU
LW
BEQZ

R2, R3, R4
R1, 0 (R2) cause memory protection
R2, L1

14

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Control Dependence

Two properties critical for the program correctness

Data flow: the flow of data values among instructions that produce results and consume
them.
Branch makes data flow dynamic: preserve of data dependence alone is not sufficient for
the correctness
Ex)
DADDU
R1, R2, R3
BEQZ
R4, L
DSUBU
R1, R5, R6
L:

OR
R7, R1, R8

OR is data dependent on both DADDU and DSUBU


Preserving both data dependences do not guarantee the correctness

Reorganization of code: preserve data dependence but violate control dependence

DADDU
DSUBU
BEQZ

OR

L:

R1,R2,R3
R1, R5, R6
R4, L
R7, R1, R8

Data flow must be preserved (by preserving control dependence)

15

2.1 Instruction Level Parallelism

Seoul Natl, CAPP, Hyuk-Jae Lee

Control Dependence
Violation of control dependence may not always affect the
exception behavior or the data flow
Ex)
DAADU
R1, R2, R3
BEQZ
R12, skipnext
BSUBU
R4, R5, R6
DADDU
R5, R4, R9
skipnext:
OR
R7, R8, R9

BSUBU can be moved before BEQZ if R4 is not used after


skipnext (called dead)

skipnext:

DAADU
BSUBU
BEQZ
DADDU
OR

R1, R2, R3
R4, R5, R6
R12, skipnext
R5, R4, R9
R7, R8, R9
16

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop


Unrolling
Assumption

Standard 5-stage integer


pipeline
Pipeline latency
Branch: one cycle delay
Functional units fully pipelined

Operation can be issued on


every clock

Example)
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
assembly code
Loop:
L.D
ADD.D
S.D
DADDUI
BNE

F0, 0(R1)
F4, F0, F2
F4, 0(R1)
R1, R1, #-8
R1, R2, Loop
17

Instruction
producing
result

Instruction
using result

Latency in
clock cycles

FP ALU op

FP ALU op

FP ALU op

Store double

Load double

FP ALU op

Load double

Store double

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Example)
Pipeline stall: dependent instructions
cycles equal to the latency
Loop: L.D
F0, 0(R1)
stall
ADD.D F4, F0, F2
stall
stall
S.D
F4, 0(R1)
DADDUI R1, R1, #-8
stall
BNE
R1, R2, Loop
# of cycles per iteration:9

18

should be separated by
1
2
3
4
5
6
7
8
9

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Example)

Scheduling:

change of instruction execution orders


Minimize stalls

# of cycles per iteration:7


Loop: L.D
F0, 0(R1)
DADDUI R1, R1, #-8
ADD.D F4, F0, F2
stall
stall
S.D
F4, 8 (R1)
BNE
R1, R2, Loop

note

19

1
2
3
4
5
6
7

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

To avoid pipeline stall, a dependent instruction should be


separated from the source instruction by pipeline latency

Single iteration: 7 cycles

Need independent instructions between these two


3 instructions for real work: LD.D, ADD.D, S.D
Loop control overhead: DADDUI, BNE
Stall: 2 cycle
Insufficient number of independent instructions

Solution: loop unrolling

Replicate the loop body multiple times,

adjust the loop control

20

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Example) unroll (replicate the loop body) 4 times


for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;

for (i=1000; i>0; i=i-4) {


x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}

21

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Example) unroll 4 times


Loop:

L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DADDUI
BNE

F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F6, -8 (R1)
F8, F6, F2
F8, -8 (R1)
F10, -16 (R1)
F12, F10, F2
F12, -16 (R1)
F14, -24 (R1)
F16, F14, F2
F16, -24 (R1)
R1, R1, #-32
R1, R2, Loop

22

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Example) unroll 4 times: execution cycles: 27 cycles


Loop:

L.D
F0, 0(R1)
stall
ADD.D
F4, F0, F2
stall
stall
S.D
F4, 0(R1)
L.D
F6, -8 (R1)
stall
ADD.D
F8, F6, F2
stall
stall
S.D
F8, -8 (R1)
L.D
F10, -16 (R1)
stall
ADD.D
F12, F10, F2
stall
stall
S.D
F12, -16 (R1)
L.D
F14, -24 (R1)
stall
ADD.D
F16, F14, F2
stall
stall
S.D
F16, -24 (R1)
DADDUI R1, R1, #-32
stall
BNE
R1, R2, Loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

23

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Example) unroll 4 times after schedule: 14 cycles (no stalls)


Loop:

L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DADDUI
S.D
S.D
BNE

F0, 0(R1)
F6, -8 (R1)
F10, -16 (R1)
F14, -24 (R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F4, 0(R1)
F8, -8 (R1)
R1, R1, #-32
F12, -16 (R1)
F16, 8 (R1)
R1, R2, Loop

24

1
2
3
4
5
6
7
8
9
10
11
12
13
14

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Example) unroll 4 times after schedule: 14 cycles (no stalls)


Large number independent instructions no stalls

Boundary condition

n-iteration loop (when n is unknown at compile time) to unroll ktimes


Generate two loops

(n mod k) iteration loop


(n/k)-iteration loop unrolled k-times

Example)
for (i=0; i<n; i++)

x[i] = x[i]+s;

for (i=0; i<(n mod 4); i++)


x[i] = x[i]+s;
for (i=(n mod 4); i< n/4; i=i+4) {
x[i] = x[i]+s; x[i+1]=x[i+1]+s;
x[i+2]=x[i+2]+s; x[i+3]=x[i+3]+s
}
25

Seoul Natl, CAPP, Hyuk-Jae Lee

2.2 Basic Compiler Techniques for Exposing ILP

Limits of loop unrolling

Decreased gain (amortized loop overhead) with


each unroll

Four times: remove all stalls, loop control


overhead-2 cycles out of four iteration (1/2 per
iteration)
Eight times, loop control overhead per iteration

Growth in code size


Shortfall in registers

26

Das könnte Ihnen auch gefallen