Beruflich Dokumente
Kultur Dokumente
Computer Architecture
Spring 2004
Harvard University
Instructor: Prof. David Brooks
dbrooks@eecs.harvard.edu
Lecture 10: Static Scheduling, Loop
Unrolling, and Software Pipelining
Computer Science 146
David Brooks
Lecture Outline
Finish Pentium Pro and Pentium 4 Case Studies
Loop Unrolling and Static Scheduling
Section 4.1
Software Pipelining
Section 4.4 (pages 329-332)
R1, R2, R4
R4, R1, R2
R3, R1, R3
R1, R3, R2
Map Table
R1
R2
R3
R4
P1
P2
P3
P4
P5
P2
P3
P4
P5
P2
P3
P6
P5
P2
P7
P6
P8
P2
P7
P6
ADD
SUB
ADD
ADD
P5, P2, P4
P6, P5, P2
P7, P5, P3
P8, P7, P2
MIPS R10K:
How to free registers?
Old Method (Tomasulo + Reorder Buffer)
Dont free speculative storage explicitly
At Retire:
Copy value from ROB to register file, free ROB entry
MIPS R10K
Cant free physical register when instructions retire
There is no architectural register to copy to
hydro2d
mgrid
applu
turb3d
apsi
fpppp
wave5
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Figure 10
[Bhandarkar and Ding]
Mispredict frequency
BTB miss frequency
hydro2d
mgrid
applu
turb3d
apsi
fpppp
wave5
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
0%
10%
20%
30%
40%
50%
1% to 60% instructions do not commit: 20% avg (30% integer)
60%
0 uops commit
1 uop commits
2 uops commit
3 uops commit
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
Average
0: 55%
1: 13%
2: 8%
3: 23%
applu
turb3d
apsi
fpppp
wave5
0%
20%
40%
60%
80%
Integer
0: 40%
1: 21%
2: 12%
3: 27%
100%
P6 Dynamic Benefit?
Sum of parts CPI vs. Actual CPI
go
m88ksim
gcc
compress
li
ijpeg
perl
uops
Instruction cache stalls
Resource capacity stalls
Branch mispredict penalty
Data Cache Stalls
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
Actual CPI
Ratio of
sum of
parts vs.
actual CPI:
1.38X avg.
(1.29X
integer)
turb3d
apsi
fpppp
wave5
Pentium 4
Still translate from 80x86 to micro-ops
P4 has better branch predictor, more FUs
Instruction Cache holds micro-operations vs. 80x86 instructions
no decode stages of 80x86 on cache hit (Trace Cache)
Clock rates:
Pentium III 1 GHz v. Pentium IV 1.5 GHz
14 stage pipeline vs. 24 stage pipeline
Computer Science 146
David Brooks
Trace Cache
IA-32 instructions are difficult to decode
Conventional Instruction Cache
Provides instructions up to and including taken branch
217 mm2
PIII: 106 mm2
L1 Execution
Cache
Buffer 12,000
Micro-Ops
Pentium 4 features
Multimedia instructions 128 bits wide vs. 64 bits wide
=> 144 new instructions
When used by programs??
Faster Floating Point: execute 2 64-bit Fl. Pt. Per clock
Memory FU: 1 128-bit load, 1 128-store /clock to MMX regs
300
200
100
00
00
24
00
23
00
22
00
21
00
20
00
19
00
18
00
17
00
16
00
15
00
14
00
13
00
12
11
00
10
90
80
SPECint2K (Peak)
MHz
SPECint2K (Peak)/sqmm
5
4
3
2
1
80
0
90
0
10
00
11
00
12
00
13
00
14
00
15
00
16
00
17
00
18
00
19
00
20
00
21
00
22
00
23
00
24
00
MHz
10
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
Execution
Latency
in cycles in cycles
4
3
3
2
1
1
1
0
1
0
Loop: L.D
ADD.D
S.D
DSUBUI
BNEZ
NOP
11
L.D
stall
ADD.D
stall
stall
S.D
DSUBUI
stall
BNEZ
stall
;branch R1!=zero
;delayed branch slot
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
12
F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP
;alter to 4*8
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#32
R1,LOOP
2 cycles stall
;drop DSUBUI & BNEZ
;drop DSUBUI & BNEZ
This is why
we call it
register
renaming!
;alter to 4*8
1 cycle stall
Rewrite loop to
minimize stalls?
13
F0,0(R1)
Scheduling Assumptions?
F6,-8(R1)
Move store past DSUBUI
F10,-16(R1)
even though it changes
F14,-24(R1)
register (must change offset)
F4,F0,F2
F8,F6,F2
Alias analysis: move loads
F12,F10,F2
before stores
F16,F14,F2
Easy for humans to see this,
0(R1),F4
what about compilers?
-8(R1),F8
R1,R1,#32
16(R1),F12
R1,LOOP
8(R1),F16 ; 8-32 = -24
14
Integer Pipe
L.D
F0,0(R1)
L.D
F6,-8(R1)
L.D
F10,-16(R1)
L.D
F14,-24(R1)
L.D
F18,-32(R1)
S.D
0(R1),F4
S.D
-8(R1),F8
S.D
-16(R1),F12
DSUBUI R1,R1,#40
S.D
16(R1), F16
BNEZ
R1,LOOP
S.D
8(R1),F16
Float Pipe
ADD.D
ADD.D
ADD.D
ADD.D
ADD.D
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
F20,F18,F2
;40-24 = 16
; 40-32 = 8
Loop Performance
10
9
8
7
6
5
4
3
2
1
0
33%
~50%
Unscheduled
Loop
Scheduled
Loop
Unscheduled
4x Unroll
Get 1.7x from unrolling (6->3.5) and 1.5x (3.5 -> 2.5)from Dual Issue
Computer Science 146
David Brooks
15
Compiler Scheduling:
Memory Disambiguation
Name Dependencies are Hard to discover for Memory
Accesses
Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
16
Check OK to move the S.D after DSUBUI and BNEZ, and find amount to
adjust S.D offset
Determine unrolling the loop would be useful by finding that the loop
iterations were independent
Rename registers to avoid name dependencies
Eliminate extra test and branch instructions and adjust the loop
termination and iteration code
Determine loads and stores in unrolled loop can be interchanged by
observing that the loads and stores from different iterations are
independent
6.
requires analyzing memory addresses and finding that they do not refer to the
same address.
Register Pressure
Aggressive unrolling/scheduling can exhaust 32 register
machines
17
overlapped ops
Overlap between
Unrolled iterations
Time
Computer Science 146
David Brooks
Softwarepipelined
iteration
18
Software Pipelining
Now must optimize
inner loop
Want to do as much
work as possible in
each iteration
Keep all of the
functional units busy in
the processor
load C[j]
+
store C[j]
Pipelined:
Not pipelined:
load B[j]
load C[j]
load B[j]
load C[j]
load B[j]
load C[j]
load B[j]
store C[j]
load C[j]
load B[j]
store C[j]
load C[j]
load B[j]
store C[j]
load C[j]
load B[j]
store C[j]
load C[j]
load B[j]
store C[j]
load C[j]
store C[j]
Fill
C[j] += A * B[j];
*
+
load C[j]
A
*
+
store C[j]
load B[j]
A
*
+
store C[j]
Drain
load C[j]
Steady State
store C[j]
load B[j]
store C[j]
19
overlapped ops
Time
Software Pipelining
Targets time when pipelining is filling and draining
20
21