Beruflich Dokumente
Kultur Dokumente
Superscalar Summary
Compiler Hardware
EE382A – Autumn 2009 Lecture 11 - 3 John P Shen
“Iron Law” of Processor Performance
Time
1/Processor Performance = ---------------
Program
IPC x GHz
Processor Performance = -----------------
PathLength
IPC × Frequency
Performance =
PathLength
EE382A – Autumn 2009 Lecture 11 - 5 John P Shen
Frequency vs. Parallelism
Fetch Fetch
Dec.
Disp. Decode Branch
Exec. Mispredict
Penalty
Mem. Dispatch
Retire
Execute
Memory
Retire
Execute
Retire
Memory
Optimize
Retire Back-End
Optimization
• Front-End Contraction
– Code Re-mapping and Caching
– Trace Construction, Caching, Optimization
– Leverage Back-End Optimizations
• Back-End Optimization
– Multiple-Branch, Trace, Stream, Prediction
– Code Reordering, Alignment, Optimization
– Pre-decode, Pre-rename, Pre-scheduling
– Memory Pre-fetch Prediction and Control
Fetch
Decode
• Super-pipelined Dispatch
• Speculative
ALU design Execute OoO execution
• Very high-speed Memory • Criticality-based
arithmetic units data caching
Retire • Aggressive data
Optimize
pre-fetching
Pdyn ≈ ∑CV
i∈units
i
2
Ai f
• Static CMOS: current flows when active
– Combinational logic evaluates new inputs
– Flip-flop, latch captures new value (clock edge)
• Terms
– C: capacitance of circuit
• wire length, number and size of transistors
– V: supply voltage
– A: activity factor
– f: frequency
• Future: Fundamentally power-constrained
EE382A – Autumn 2009 Lecture 11 - 13 John P Shen
Reducing Dynamic Power
• Reduce capacitance
– Simpler, smaller design (yeah right)
– Reduced IPC
• Reduce activity
– Smarter design
Pdyn ≈ CV Af
2
– Reduced IPC
• Reduce frequency
– Often in conjunction with reduced voltage
• Reduce voltage
– Biggest hammer due to quadratic effect, widely employed
– Can be static (binning/sorting of parts), and/or
– Dynamic (power modes)
• E.g. Transmeta Long Run, AMD PowerNow, Intel Speedstep
• Multiple voltages
– Realize non-critical circuits with slower transistors
– Voltage islands: Vdd and Vth are lower
• Problem: supplying multiple Vdd
• Multiple frequencies
– Globally Asynchronous Locally Synchronous (GALS)
• Exploiting safety margins
– Average case vs. worst case design
– Razor latch [UMichigan]:
• Sample latch input twice, then compare, recover
• Body biasing
– Reduce leakage by adapting Vth
Power
• Energy issues:
Energy
– Limited energy capacity (battery)
– Efficiency (work per unit energy) Time
inst cycle
×
IPC × Frequency cycle sec IPS
Performance = = =
PathLength PathLength PathLength
Power Joule
= = EPI
Performance Inst
Normalized Metric
0.8
0.4
Instructions
Mips Second Instructions 0.2
= Joules =
Watt Joule 0
486 p5 p6 pentium 4
Second
MIPS2/watt
1.8
Equivalent to (energy x delay) product. 1.6
mips^2/watt
Normalized Metric
1.2
families. 1
0.8
0.6
0.4
0.2
0
486 p5 p6 pentium 4
Normalized Metric
8
performance. 4
0
486 p5 p6 pentium 4
25 Pentium Pro 24
Pentium 4
(Willamette)
20 Pentium 4 (WMT) 38
Pentium M Core Duo
15 Pentium 4 (CDM) 48
Dothan Yonah
10 Banias Pentium M (Banias) 13
Pentium Pro
nj Merom
5 EPI = 10 Pentium M (Dothan) 15
i486 Pentium
0 Core Duo (Yonah) 11
0 2 4 6 8 10 Core Duo (Merom) 10
Scalar Performance
30
• Assume a large-scale
Pentium 4 (Psc)
25 CMP with potentially
Pentium 4 (Wmt)
many cores.
Relative Power
20 • Replication of cores
power= =perf
power (1.74)
perf(1.74) results in proportional
15
Scalar/Latency Throughput increases to both
Performance Performance
10
Pentium Pro
throughput
Pentium M
performance and
5 Pentium power (hopefully).
i486 power = perf (1.0) ?
0
0 2 4 6 8 Low EPI
Relative Performance