Sie sind auf Seite 1von 28

EE382A Lecture 11:

Superscalar Summary

Department of Electrical Engineering


Stanford University
http://eeclass.stanford.edu/ee382a
EE382A – Autumn 2009 Lecture 11 - 1 John P Shen
Announcement

• EXAM coming up on Friday November 13, 9:00am-12:00noon


– All lectures included

EE382A – Autumn 2009 Lecture 11 - 2 John P Shen


HW/SW Design Space for ILP
[B. Rau & J. Fisher, 1993]

Front end & Optimizer


Sequential
(Superscalar)

Determine Depend. Dependence Determine Depend.


Architecture
(Dataflow)
Determine Independ.
Determine Independ. Independence
Architecture
VLIW
Independence Bind Resources
Bind Resources
Architecture
(Attached
Array Execute
Processor)

Compiler Hardware
EE382A – Autumn 2009 Lecture 11 - 3 John P Shen
“Iron Law” of Processor Performance

Time
1/Processor Performance = ---------------
Program

Instructions Cycles Time


= ------------------ X ---------------- X ------------
Program Instruction Cycle
(path length) (CPI) (cycle time)

IPC x GHz
Processor Performance = -----------------
PathLength

EE382A – Autumn 2009 Lecture 11 - 4 John P Shen


Microprocessor Performance Evolution

IPC × Frequency
Performance =
PathLength
EE382A – Autumn 2009 Lecture 11 - 5 John P Shen
Frequency vs. Parallelism

• Increase Frequency (GHz)


– Deeper Pipelines
– Increased Overall Latency
– Lower IPC

• Increase Instruction Parallelism (IPC)


– Wider Pipelines
– Increased Complexity
– Lower GHz

EE382A – Autumn 2009 Lecture 11 - 6 John P Shen


Deeper and Wider Pipelines

Fetch Fetch
Dec.
Disp. Decode Branch
Exec. Mispredict
Penalty
Mem. Dispatch
Retire
Execute

Memory

Retire

EE382A – Autumn 2009 Lecture 11 - 7 John P Shen


Front-End Pipe-Depth Penalty

Fetch Front-End Fetch


Contraction
Decode
Decode
Dispatch
Execute
Dispatch
Memory

Execute
Retire
Memory
Optimize
Retire Back-End
Optimization

EE382A – Autumn 2009 Lecture 11 - 8 John P Shen


Alleviate Pipe-Depth Penalty

• Front-End Contraction
– Code Re-mapping and Caching
– Trace Construction, Caching, Optimization
– Leverage Back-End Optimizations

• Back-End Optimization
– Multiple-Branch, Trace, Stream, Prediction
– Code Reordering, Alignment, Optimization
– Pre-decode, Pre-rename, Pre-scheduling
– Memory Pre-fetch Prediction and Control

EE382A – Autumn 2009 Lecture 11 - 9 John P Shen


Execution Core Improvement

Fetch

Decode

• Super-pipelined Dispatch
• Speculative
ALU design Execute OoO execution
• Very high-speed Memory • Criticality-based
arithmetic units data caching
Retire • Aggressive data
Optimize
pre-fetching

EE382A – Autumn 2009 Lecture 11 - 10 John P Shen


Trends

• Moore’s Law for device integration


• Chip power consumption
• Single-thread performance trend [source: Intel]
EE382A – Autumn 2009 Lecture 11 - 11 John P Shen
Power Density
[Hu et al, MICRO ’03 tutorial]

• Power density increasing exponentially


– Power delivery, packaging, thermal implications
– Thermal effects on leakage, delay, reliability, etc.
EE382A – Autumn 2009 Lecture 11 - 12 John P Shen
Dynamic Power

Pdyn ≈ ∑CV
i∈units
i
2
Ai f
• Static CMOS: current flows when active
– Combinational logic evaluates new inputs
– Flip-flop, latch captures new value (clock edge)
• Terms
– C: capacitance of circuit
• wire length, number and size of transistors
– V: supply voltage
– A: activity factor
– f: frequency
• Future: Fundamentally power-constrained
EE382A – Autumn 2009 Lecture 11 - 13 John P Shen
Reducing Dynamic Power

• Reduce capacitance
– Simpler, smaller design (yeah right)
– Reduced IPC
• Reduce activity
– Smarter design
Pdyn ≈ CV Af
2

– Reduced IPC
• Reduce frequency
– Often in conjunction with reduced voltage
• Reduce voltage
– Biggest hammer due to quadratic effect, widely employed
– Can be static (binning/sorting of parts), and/or
– Dynamic (power modes)
• E.g. Transmeta Long Run, AMD PowerNow, Intel Speedstep

EE382A – Autumn 2009 Lecture 11 - 14 John P Shen


Frequency/Voltage Scaling

• Voltage/frequency scaling rule of thumb:


– +/- 1% performance buys -/+ 3% power (3:1 rule)
• Hence, any power-saving technique that saves less
than 3x power over performance loss is uninteresting
• Example 1:
– New technique saves 12% power
– However, performance degrades 5%
– Useless, since 12 < 3 x 5
– Instead, reduce f by 5% (also V), and get 15% power savings
• Example 2:
– New technique saves 5% power
– Performance degrades 1%
– Useful, since 5 > 3 x 1
• Does this rule always hold?
EE382A – Autumn 2009 Lecture 11 - 15 John P Shen
Leakage Power (Static/DC)

• Transistors aren’t perfect on/off switches Source


• Even in static CMOS, transistors leak
– Channel (source/drain) leakage
– Gate leakage through insulator Gate
• High-K dielectric replacing SiO2 will help
• Leakage compounded by
– Low threshold voltage
Drain
• Low Vth => fast switching, more leakage
• High Vth => slow switching, less leakage
– Higher temperature
• Temperature increases with power
• Power increases with C, V2, A, f
• Rough approximation: leakage proportional to area
– Transistors aren’t free
• Huge problem in future technologies
– Estimates are 40%-50% of total power
EE382A – Autumn 2009 Lecture 11 - 16 John P Shen
Circuit-Level Techniques

• Multiple voltages
– Realize non-critical circuits with slower transistors
– Voltage islands: Vdd and Vth are lower
• Problem: supplying multiple Vdd
• Multiple frequencies
– Globally Asynchronous Locally Synchronous (GALS)
• Exploiting safety margins
– Average case vs. worst case design
– Razor latch [UMichigan]:
• Sample latch input twice, then compare, recover
• Body biasing
– Reduce leakage by adapting Vth

EE382A – Autumn 2009 Lecture 11 - 17 John P Shen


Architectural Techniques

• Clock gating (dynamic power)


– 70% of dynamic power in IBM Power5 [Jacobson et al., HPCA 04]
– Inhibit clock for
• Functional block
• Pipeline stage
• Pipeline register (sub-stage)
– Widely used in real designs today
– Control overhead, timing complexity (violates FSD rules)
• Power gating (leakage power)
– (Big) sleep transistor cuts off ground path
– Apply to FU, cache subarray, even entire core in CMP

EE382A – Autumn 2009 Lecture 11 - 18 John P Shen


Architectural Techniques

• Cache reconfiguration (leakage power)


– Not all applications or phases require full L1 cache capacity
– Power gate portions of cache memory
– Complicates a critical path (L1 cache access)
– Does not apply to lower level caches
• Heterogeneous cores [Kumar et al., MICRO-36]
– Prior-generation simple core consumes small fraction of die area
– Use simple core to run low-ILP workloads
• And many others…check proceedings of
– ISLPED, MICRO, ISCA, HPCA, ASPLOS, PACT

EE382A – Autumn 2009 Lecture 11 - 19 John P Shen


Power vs. Energy

• Energy: integral of power (area under the curve)


– Energy & power driven by different design constraints
• Power issues:
– Power delivery (supply current @ right voltage)
– Thermal (don’t fry the chip)
– Reliability effects (chip lifetime)

Power
• Energy issues:
Energy
– Limited energy capacity (battery)
– Efficiency (work per unit energy) Time

• Different usage models drive tradeoffs

EE382A – Autumn 2009 Lecture 11 - 20 John P Shen


Power vs. Energy

• With constant time base, two are “equivalent”


– 10% reduction in power => 10% reduction in energy
• Once time changes, must treat as separate metrics
– E.g. reduce frequency to save power => reduce performance
=> increase time to completion => consume more energy
(perhaps)
• Metric: energy-delay product per unit of work
– Tries to capture both effects
– Others advocate energy-delay2
– Best to consider all
• Plot performance (time), energy, ed, ed2

EE382A – Autumn 2009 Lecture 11 - 21 John P Shen


Performance, Power, and Energy

inst cycle
×
IPC × Frequency cycle sec IPS
Performance = = =
PathLength PathLength PathLength

Performance IPS Inst Inst 1


= = = =
Power Watt Watt × Sec Joule EPI

Power Joule
= = EPI
Performance Inst

Power = EPI × Performance = EPI × IPC × Frequency

EE382A – Autumn 2009 Lecture 11 - 22 John P Shen


Estimating Energy Per Instruction
[Ed Grochowski, 2006]

• Think of the microprocessor as a capacitor


V
– Charged or discharged with every 2
instruction processed ½•C•V
– Ignore leakage current and short-circuit
switching current
2 C
• Apply capacitor formula: E = ½ • C • V ½•C•V
2

– E = energy expended per instruction (from


fetch to retirement)
– C = switching capacitance per instruction
(equal to activity factor multiplied by total V
capacitance)
– V = supply voltage
• Energy per instruction depends on only
C
two things 2
½•C•V
– Amount of capacitance toggled to execute
½ • C • V2
an instruction
– Supply voltage
EE382A – Autumn 2009 Lecture 11 - 23 John P Shen
Power Efficiency Metrics
[Ed Grochowski, 2006]
1.2
MIPS/watt Common measure of power efficiency. 1
mips/watt

Equivalent to energy per instruction.

Normalized Metric
0.8

Independent of time. 0.6

0.4
Instructions
Mips Second Instructions 0.2
= Joules =
Watt Joule 0
486 p5 p6 pentium 4
Second
MIPS2/watt
1.8
Equivalent to (energy x delay) product. 1.6
mips^2/watt

Common metric for comparing logic 1.4

Normalized Metric
1.2

families. 1
0.8
0.6
0.4
0.2
0
486 p5 p6 pentium 4

MIPS3/watt Equivalent to (energy x delay2).


12
mips^3/watt
10

Assign increasing weight to time.

Normalized Metric
8

Appropriate metric for latency 6

performance. 4

0
486 p5 p6 pentium 4

EE382A – Autumn 2009 Lecture 11 - 24 John P Shen


Raw Data
for Four Generations of Intel Microprocessors
[Ed Grochowski, 2006]

Method • Compare pair of processors at same process, voltage, and time


• Compute the performance ratio of the pair
• Multiply ratios together across uarch generations
• Repeat calculation for power

EE382A – Autumn 2009 Lecture 11 - 25 John P Shen


Power/Performance (EPI) Evolution
[Ed Grochowski, 2006]
Power = EPI × IPC × Frequency
50

45 Intel EPI (nj)


nj Microprocessors 65nm at
40 48
Pentium 4
PI= 1.33v
(Cedarmill) E
35 i486 10
Power = Performance1.74
30 Pentium 14
Power

25 Pentium Pro 24
Pentium 4
(Willamette)
20 Pentium 4 (WMT) 38
Pentium M Core Duo
15 Pentium 4 (CDM) 48
Dothan Yonah
10 Banias Pentium M (Banias) 13
Pentium Pro
nj Merom
5 EPI = 10 Pentium M (Dothan) 15
i486 Pentium
0 Core Duo (Yonah) 11
0 2 4 6 8 10 Core Duo (Merom) 10
Scalar Performance

Power: single core power (relative to i486 baseline)


Performance: SPECint performance (relative to i486 baseline)
EPI: average energy spent per instruction (in nano-joules)
EE382A – Autumn 2009 Lecture 11 - 26 John P Shen
Power and Throughput Performance

30
• Assume a large-scale
Pentium 4 (Psc)
25 CMP with potentially
Pentium 4 (Wmt)
many cores.
Relative Power

20 • Replication of cores
power= =perf
power (1.74)
perf(1.74) results in proportional
15
Scalar/Latency Throughput increases to both
Performance Performance
10
Pentium Pro
throughput
Pentium M
performance and
5 Pentium power (hopefully).
i486 power = perf (1.0) ?
0
0 2 4 6 8 Low EPI
Relative Performance

EPI: CPU Cores Prog. Accelerators Fixed Function Units


10nj 1nj 0.1nj 0.01nj
EE382A – Autumn 2009 Lecture 11 - 27 John P Shen
So Far, Single Flow of Control…Next, Multiple Threads

EE382A – Autumn 2009 Lecture 11 - 28 John P Shen

Das könnte Ihnen auch gefallen