Beruflich Dokumente
Kultur Dokumente
A. Moshovos
A. Moshovos
Superscalar - In-order
Two or more consecutive instructions in the original program order can execute in parallel This is the dynamic execution order N-way Superscalar Can issue up to N instructions per cycle 2-way, 3-way,
A. Moshovos
sum += a[i--]
Pipelining:
fetch ld decode fetch
time
add decode fetch sub decode
Superscalar:
fetch decode fetch fetch ld decode decode fetch
bne
bne
A. Moshovos
Superscalar Performance
Performance Spectrum? What if all instructions were dependent?
Speedup = 0, Superscalar buys us nothing
Real-Life Performance
OLTP = Online Transaction Processing
SOURCE: Partha Ranganathan Kourosh Gharachorloo** Sarita Adve* Luiz Andr Barroso** Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors ASPLOS98
A. Moshovos
IPC
A. Moshovos
18
Superscalar Issue
An instruction at decode can execute if: Dependences
RAW Input operand availability WAR and WAW
Issue Rules
Stall at decode if: RAW dependence and no data available
Source registers against previous targets
A. Moshovos
tgt
src1
src1
tgt
tgt
src1
src1
src1
src1
Assume 2 source & 1 target max per instr. comparators for 2-way:
3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW)
A. Moshovos
Pending Read?
Not needed if all reads are done in order WAR and WAW not possible
Can handle structural hazards Busy indicators per resource Can handle bypass Where a register value is produced R0 busy, in ALU0, at time +3
ECE1773 - Fall 07 ECE Toronto
A. Moshovos
Implications
Need to multiport some structures Register File
Multiple Reads and Writes per cycle
Resource tracking Additional issue conditions Many Superscalars had additional restrictions
E.g., execute one integer and one floating point op one branch, or one store/load
A. Moshovos ECE1773 - Fall 07 ECE Toronto
Allow all preceding instructions to commit Recall comparisons are done in program order Must have sufficient time in clock cycle to handle these
A. Moshovos ECE1773 - Fall 07 ECE Toronto
Interrupts Example
Exception raised fetch decode fetch fetch ld decode decode fetch Exception taken add div decode fetch Exception taken
bne decode
bne
bne decode
bne
A. Moshovos
A. Moshovos
inst inst E2 E1 D2
F1
F2 F1
D1 F2 F1
D2 D1 F2 F1
E1 D2 D1 F2
E2 E1 D2 D1
E2 E1
E2
A. Moshovos
inst inst
F1
F2 F1
D1 F2 F1
D2 D1 F2 F1
E1 D2 D1 F2 F1
E2 E1 D2 D1 F2 F1
E2 E1 D2 D1 F2 F1
E2 E1 D2 D1 F2
E2 E1 D2 D1
E2 E1 D2
E2 E1
E2
A. Moshovos
inst inst
F1
F2 F1
D1 F2 F1
D2 D1 F2 F1
inst D2 D2 D1 D1 F2 F2 F1 F1
inst D2 inst D1 D1 D2 D2 F2 D1 D1
inst D2 inst
A. Moshovos
Amdhals Law
Work performed
Time
f 1 f v
A. Moshovos
lim
f 1 f v
1 f
N No. of Processors f 1
A. Moshovos
1-f Time
ECE1773 - Fall 07 ECE Toronto
Pipeline Performance
N Pipeline Depth 1 1-g g
g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) 1-g = performance suffers
A. Moshovos
A. Moshovos
A. Moshovos
A. Moshovos
A. Moshovos
Performance Comparison
Source:
A. Moshovos
CPI Comparison
A. Moshovos
Compiler Impact
A. Moshovos
A. Moshovos
A. Moshovos
A. Moshovos
When no instruction is committing Does not capture overlapping factors: Stall due to dependence while committing Stall due to cache miss while committing
A. Moshovos ECE1773 - Fall 07 ECE Toronto
Replay Traps
Tried to do something and couldnt Store and write-buffer is full
Cant complete instruction
A. Moshovos
Cache hit
Cache miss
D F
E D
M D
M D
W E
A. Moshovos
Optimistic Scheduling
ld r1 add _, r1
F D F E D M D M D W E
Cache hit
Must decide that add should execute Start making scheduling decisions
A. Moshovos ECE1773 - Fall 07 ECE Toronto
Optimistic Scheduling #2
ld r1 add _, r1
F D F E D M D M D W E
Cache hit
Guess Hit/Miss
Must decide that add should execute Start making scheduling decisions
A. Moshovos ECE1773 - Fall 07 ECE Toronto
Stall Distribution
A. Moshovos
21164 Microarchitecture
Instruction Fetch/Decode + Branch Units Integer Execution Unit Floating-Point Execution Unit Memory Address Translation Unit Cache Control and Bus Interface Data Cache Instruction Cache Second-Level Cache
A. Moshovos
Instruction Decode/Issue
Up to four insts/cycle Naturally aligned groups Must start at 16 byte boundary (INT16) Simplifies Fetch path (in a second) All of group must issue before next group gets in
Instruction Decode/Issue
Up to four insts/cycle Naturally aligned groups Must start at 16 byte boundary (INT16) Simplifies Fetch path
Where instructions come from? I-Cache:
CPU needs:
A. Moshovos ECE1773 - Fall 07 ECE Toronto
CPU needs:
A. Moshovos
Branch Execution
One cycle delay Calc. target PC Nave implementation:
Can fetch every other cycle
Branch Prediction to avoid the delay Up to four pending branches in stage 2 Assignment to Functional Units One at stage 3 Instruction Scheduling/Issue One at stage 4 Instruction Execution Full and execute from right PC
A. Moshovos ECE1773 - Fall 07 ECE Toronto
A. Moshovos
A. Moshovos
A. Moshovos
Floating-Point Unit
FPU ADD FPU Multiply 2 ops per cycle Divides take multiple cycles 32 registers, five reads, four writes Four reads and two writes for FP pipe One read for stores (handled by integer pipe) One write for loads (handled by integer pipe)
A. Moshovos
Memory Unit
Up to two accesses Data translation buffer 512-entries, not-MRU Loads access in parallel with D-cache Miss Address File Pending misses Six data loads Four instruction reads Merges loads to same block
A. Moshovos
Store/Load Conflicts
Load immediately after a store Cant see the data Detect and replay
Flush pipe and re-execute
Compiler can help Schedule load three cycles after store Two cycles stalls the load at issue/address generation
A. Moshovos
Write Buffer
Six 32-byte entries Defer stores until there is a port available Loads can read from Writebuffer
A. Moshovos
A. Moshovos
Integer Add
A. Moshovos
Floating-Point Add
A. Moshovos
Load Hit
A. Moshovos
Load Miss
A. Moshovos
Store Hit
A. Moshovos
80486 Pipeline
Fetch Load 16-bytes from into prefetch buffer Decode 1 Determine instruction length and type Decode 2 Compute memory address Generate immediate operands Execute Register Read ALU Memory read/write Write-back Update register file (source: CS740 CMU, 97, all slides on 486)
A. Moshovos ECE1773 - Fall 07 ECE Toronto
A. Moshovos
80486 Pipeline
D2 Extract memory displacements and immediate operands Compute memory addresses Add base register, and possibly scaled index register May require two cycles
If index register involved, or both address & immediate operand Approx. 5% of executed instructions
EX Read register operands Compute ALU function Read or write memory (data cache) WB Update register result
A. Moshovos