L11 Summary

EE382A Lecture 11:
Superscalar Summary
Department of Electrical Engineering

Stanford University
http://eeclass.stanford.edu/ee382a
EE382A – Autumn 2009 Lecture 11 - 1 John P Shen
Announcement
• EXAM coming up on Friday November 13, 9:00am-12:00noon

– All lectures included

HW/SW Design Space for ILP
[B. Rau & J. Fisher, 1993]
Front end & Optimizer

Sequential
(Superscalar)
Determine Depend. Dependence Determine Depend.

Architecture
(Dataflow)
Determine Independ.
Determine Independ. Independence
Architecture
VLIW
Independence Bind Resources
Bind Resources
Architecture
(Attached
Array Execute
Processor)
Compiler Hardware
“Iron Law” of Processor Performance
Time
1/Processor Performance = ---------------
Program
Instructions Cycles Time

= ------------------ X ---------------- X ------------
Program Instruction Cycle
(path length) (CPI) (cycle time)
IPC x GHz
Processor Performance = -----------------
PathLength

Microprocessor Performance Evolution
IPC × Frequency
Performance =
PathLength
Frequency vs. Parallelism
• Increase Frequency (GHz)

– Deeper Pipelines
– Increased Overall Latency
– Lower IPC
• Increase Instruction Parallelism (IPC)

– Wider Pipelines
– Increased Complexity
– Lower GHz

Deeper and Wider Pipelines
Fetch Fetch
Dec.
Disp. Decode Branch
Exec. Mispredict
Penalty
Mem. Dispatch
Retire
Execute
Memory
Retire

Front-End Pipe-Depth Penalty
Fetch Front-End Fetch

Contraction
Decode
Decode
Dispatch
Execute
Dispatch
Memory
Execute
Retire
Memory
Optimize
Retire Back-End
Optimization

Alleviate Pipe-Depth Penalty
• Front-End Contraction
– Code Re-mapping and Caching
– Trace Construction, Caching, Optimization
– Leverage Back-End Optimizations
• Back-End Optimization
– Multiple-Branch, Trace, Stream, Prediction
– Code Reordering, Alignment, Optimization
– Pre-decode, Pre-rename, Pre-scheduling
– Memory Pre-fetch Prediction and Control

Execution Core Improvement
Fetch
Decode
• Super-pipelined Dispatch
• Speculative
ALU design Execute OoO execution
• Very high-speed Memory • Criticality-based
arithmetic units data caching
Retire • Aggressive data
Optimize
pre-fetching

Trends
• Moore’s Law for device integration

• Chip power consumption
• Single-thread performance trend [source: Intel]
Power Density
[Hu et al, MICRO ’03 tutorial]
• Power density increasing exponentially

– Power delivery, packaging, thermal implications
– Thermal effects on leakage, delay, reliability, etc.
Dynamic Power
Pdyn ≈ ∑CV
i∈units
i
2
Ai f
• Static CMOS: current flows when active
– Combinational logic evaluates new inputs
– Flip-flop, latch captures new value (clock edge)
• Terms
– C: capacitance of circuit
• wire length, number and size of transistors
– V: supply voltage
– A: activity factor
– f: frequency
• Future: Fundamentally power-constrained
Reducing Dynamic Power
• Reduce capacitance
– Simpler, smaller design (yeah right)
– Reduced IPC
• Reduce activity
– Smarter design
Pdyn ≈ CV Af
2
– Reduced IPC
• Reduce frequency
– Often in conjunction with reduced voltage
• Reduce voltage
– Biggest hammer due to quadratic effect, widely employed
– Can be static (binning/sorting of parts), and/or
– Dynamic (power modes)
• E.g. Transmeta Long Run, AMD PowerNow, Intel Speedstep

Frequency/Voltage Scaling
• Voltage/frequency scaling rule of thumb:

– +/- 1% performance buys -/+ 3% power (3:1 rule)
• Hence, any power-saving technique that saves less
than 3x power over performance loss is uninteresting
• Example 1:
– New technique saves 12% power
– However, performance degrades 5%
– Useless, since 12 < 3 x 5
– Instead, reduce f by 5% (also V), and get 15% power savings
• Example 2:
– New technique saves 5% power
– Performance degrades 1%
– Useful, since 5 > 3 x 1
• Does this rule always hold?
Leakage Power (Static/DC)
• Transistors aren’t perfect on/off switches Source

• Even in static CMOS, transistors leak
– Channel (source/drain) leakage
– Gate leakage through insulator Gate
• High-K dielectric replacing SiO2 will help
• Leakage compounded by
– Low threshold voltage
Drain
• Low Vth => fast switching, more leakage
• High Vth => slow switching, less leakage
– Higher temperature
• Temperature increases with power
• Power increases with C, V2, A, f
• Rough approximation: leakage proportional to area
– Transistors aren’t free
• Huge problem in future technologies
– Estimates are 40%-50% of total power
Circuit-Level Techniques
• Multiple voltages
– Realize non-critical circuits with slower transistors
– Voltage islands: Vdd and Vth are lower
• Problem: supplying multiple Vdd
• Multiple frequencies
– Globally Asynchronous Locally Synchronous (GALS)
• Exploiting safety margins
– Average case vs. worst case design
– Razor latch [UMichigan]:
• Sample latch input twice, then compare, recover
• Body biasing
– Reduce leakage by adapting Vth

Architectural Techniques
• Clock gating (dynamic power)

– 70% of dynamic power in IBM Power5 [Jacobson et al., HPCA 04]
– Inhibit clock for
• Functional block
• Pipeline stage
• Pipeline register (sub-stage)
– Widely used in real designs today
– Control overhead, timing complexity (violates FSD rules)
• Power gating (leakage power)
– (Big) sleep transistor cuts off ground path
– Apply to FU, cache subarray, even entire core in CMP

Architectural Techniques
• Cache reconfiguration (leakage power)

– Not all applications or phases require full L1 cache capacity
– Power gate portions of cache memory
– Complicates a critical path (L1 cache access)
– Does not apply to lower level caches
• Heterogeneous cores [Kumar et al., MICRO-36]
– Prior-generation simple core consumes small fraction of die area
– Use simple core to run low-ILP workloads
• And many others…check proceedings of
– ISLPED, MICRO, ISCA, HPCA, ASPLOS, PACT

Power vs. Energy
• Energy: integral of power (area under the curve)

– Energy & power driven by different design constraints
• Power issues:
– Power delivery (supply current @ right voltage)
– Thermal (don’t fry the chip)
– Reliability effects (chip lifetime)
Power
• Energy issues:
Energy
– Limited energy capacity (battery)
– Efficiency (work per unit energy) Time
• Different usage models drive tradeoffs

Power vs. Energy
• With constant time base, two are “equivalent”

– 10% reduction in power => 10% reduction in energy
• Once time changes, must treat as separate metrics
– E.g. reduce frequency to save power => reduce performance
=> increase time to completion => consume more energy
(perhaps)
• Metric: energy-delay product per unit of work
– Tries to capture both effects
– Others advocate energy-delay2
– Best to consider all
• Plot performance (time), energy, ed, ed2

Performance, Power, and Energy
inst cycle
×
IPC × Frequency cycle sec IPS
Performance = = =
PathLength PathLength PathLength
Performance IPS Inst Inst 1

= = = =
Power Watt Watt × Sec Joule EPI
Power Joule
= = EPI
Performance Inst
Power = EPI × Performance = EPI × IPC × Frequency

Estimating Energy Per Instruction
[Ed Grochowski, 2006]
• Think of the microprocessor as a capacitor

V
– Charged or discharged with every 2
instruction processed ½•C•V
– Ignore leakage current and short-circuit
switching current
2 C
• Apply capacitor formula: E = ½ • C • V ½•C•V
2
– E = energy expended per instruction (from

fetch to retirement)
– C = switching capacitance per instruction
(equal to activity factor multiplied by total V
capacitance)
– V = supply voltage
• Energy per instruction depends on only
C
two things 2
½•C•V
– Amount of capacitance toggled to execute
½ • C • V2
an instruction
– Supply voltage
Power Efficiency Metrics
1.2
MIPS/watt Common measure of power efficiency. 1
mips/watt
Equivalent to energy per instruction.
Normalized Metric
0.8
Independent of time. 0.6
0.4
Instructions
Mips Second Instructions 0.2
= Joules =
Watt Joule 0
486 p5 p6 pentium 4
Second
MIPS2/watt
1.8
Equivalent to (energy x delay) product. 1.6
mips^2/watt
Common metric for comparing logic 1.4
Normalized Metric
1.2
families. 1
0.8
0.6
0.4
0.2
0
486 p5 p6 pentium 4
MIPS3/watt Equivalent to (energy x delay2).

12
mips^3/watt
10
Assign increasing weight to time.
Normalized Metric
8
Appropriate metric for latency 6
performance. 4
0
486 p5 p6 pentium 4

Raw Data
for Four Generations of Intel Microprocessors
Method • Compare pair of processors at same process, voltage, and time

• Compute the performance ratio of the pair
• Multiply ratios together across uarch generations
• Repeat calculation for power

Power/Performance (EPI) Evolution
Power = EPI × IPC × Frequency
50
45 Intel EPI (nj)

nj Microprocessors 65nm at
40 48
Pentium 4
PI= 1.33v
(Cedarmill) E
35 i486 10
Power = Performance1.74
30 Pentium 14
Power
25 Pentium Pro 24
Pentium 4
(Willamette)
20 Pentium 4 (WMT) 38
Pentium M Core Duo
15 Pentium 4 (CDM) 48
Dothan Yonah
10 Banias Pentium M (Banias) 13
Pentium Pro
nj Merom
5 EPI = 10 Pentium M (Dothan) 15
i486 Pentium
0 Core Duo (Yonah) 11
0 2 4 6 8 10 Core Duo (Merom) 10
Scalar Performance
Power: single core power (relative to i486 baseline)

Performance: SPECint performance (relative to i486 baseline)
EPI: average energy spent per instruction (in nano-joules)
Power and Throughput Performance
30
• Assume a large-scale
Pentium 4 (Psc)
25 CMP with potentially
Pentium 4 (Wmt)
many cores.
Relative Power
20 • Replication of cores
power= =perf
power (1.74)
perf(1.74) results in proportional
15
Scalar/Latency Throughput increases to both
Performance Performance
10
Pentium Pro
throughput
Pentium M
performance and
5 Pentium power (hopefully).
i486 power = perf (1.0) ?
0
0 2 4 6 8 Low EPI
Relative Performance
EPI: CPU Cores Prog. Accelerators Fixed Function Units

10nj 1nj 0.1nj 0.01nj
So Far, Single Flow of Control…Next, Multiple Threads

L11 Summary

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

L11 Summary

Hochgeladen von

Copyright:

Verfügbare Formate

EE382A Lecture 11:

Department of Electrical Engineering

• EXAM coming up on Friday November 13, 9:00am-12:00noon

EE382A – Autumn 2009 Lecture 11 - 2 John P Shen

Front end & Optimizer

Determine Depend. Dependence Determine Depend.

Instructions Cycles Time

EE382A – Autumn 2009 Lecture 11 - 4 John P Shen

• Increase Frequency (GHz)

• Increase Instruction Parallelism (IPC)

EE382A – Autumn 2009 Lecture 11 - 6 John P Shen

EE382A – Autumn 2009 Lecture 11 - 7 John P Shen

Fetch Front-End Fetch

EE382A – Autumn 2009 Lecture 11 - 8 John P Shen

EE382A – Autumn 2009 Lecture 11 - 9 John P Shen

EE382A – Autumn 2009 Lecture 11 - 10 John P Shen

• Moore’s Law for device integration

• Power density increasing exponentially

EE382A – Autumn 2009 Lecture 11 - 14 John P Shen

• Voltage/frequency scaling rule of thumb:

• Transistors aren’t perfect on/off switches Source

EE382A – Autumn 2009 Lecture 11 - 17 John P Shen

• Clock gating (dynamic power)

EE382A – Autumn 2009 Lecture 11 - 18 John P Shen

• Cache reconfiguration (leakage power)

EE382A – Autumn 2009 Lecture 11 - 19 John P Shen

• Energy: integral of power (area under the curve)

• Different usage models drive tradeoffs

EE382A – Autumn 2009 Lecture 11 - 20 John P Shen

• With constant time base, two are “equivalent”

EE382A – Autumn 2009 Lecture 11 - 21 John P Shen

Performance IPS Inst Inst 1

Power = EPI × Performance = EPI × IPC × Frequency

EE382A – Autumn 2009 Lecture 11 - 22 John P Shen

• Think of the microprocessor as a capacitor

– E = energy expended per instruction (from

Equivalent to energy per instruction.

Independent of time. 0.6

Common metric for comparing logic 1.4

MIPS3/watt Equivalent to (energy x delay2).

Assign increasing weight to time.

Appropriate metric for latency 6

EE382A – Autumn 2009 Lecture 11 - 24 John P Shen

Method • Compare pair of processors at same process, voltage, and time

EE382A – Autumn 2009 Lecture 11 - 25 John P Shen

45 Intel EPI (nj)

Power: single core power (relative to i486 baseline)

EPI: CPU Cores Prog. Accelerators Fixed Function Units

EE382A – Autumn 2009 Lecture 11 - 28 John P Shen

Das könnte Ihnen auch gefallen