02 - Performance - ECE552-2014 - Complete PDF

ECE 552: Performance
Page 1
Prof. Natalie Enright Jerger

Lecture notes based on slides created by Amir Roth of University of
Pennsylvania with sources that included University of Wisconsin slides by
Mark Hill, Guri Sohi, Jim Smith, and David Wood.
Lecture notes enhanced by Milo Martin, Mark Hill, and David Wood with
sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith,
Sohi, Vijaykumar, and Wood
Before we start...
To have a meaningful discussion
about modern architectures
Must discuss metrics to evaluate them
Discuss how metrics are impacted by Moores Law
Moores Law: Devices per chip double

every 18-24 months
Empirical Evaluation
Metrics
Performance
Page 2
Cost
Power
Reliability
Often more important in combination than

individually
Performance/cost (MIPS/$)
Performance/power (MIPS/W)
Basis for
Design decisions
Purchasing decisions
Performance
Performance metrics
Latency
Throughput
Reporting performance
Benchmarking and averaging
CPU performance equation and
performance trends
Two definitions
Latency (execution time):
Throughput (bandwidth):
Page 3
Very different: throughput can exploit parallelism,

latency cannot
Often contradictory
Choose definition that matches goals (most frequently
throughput)
Latency/Throughput Example
Example: move people from A to B, 10 miles
Car: capacity = 5, speed = 60 miles/hour
Bus: capacity = 60, speed = 20 miles/hour
Latency:
Throughput:
Performance Improvement
Processor A is X times faster than processor B if
Latency(P,A) =
Throughput(P,A) =
Processor A is X% faster than processor B if
Latency(P,A) =
Page 4
Throughput(P,A) =
Car/bus example
Latency?
Throughput?
What Is P in Latency(P,A)?
Program
Latency(A) makes no sense, processor executes
some program
But which one?
Actual target workload?
Some representative benchmark program(s)?
Page 5
Some small kernel benchmarks (micro-benchmarks)
Adding/Averaging Performance Numbers

You can add latencies, but not throughput
Latency(P1+P2, A) = Latency(P1,A) + Latency(P2,A)
Throughput(P1+P2,A) != Throughput(P1,A) +
Throughput(P2,A)
1 km @ 30 kph + 1 km @ 90 kph
0.033 hours at 30 kph + 0.011 hours at 90 kph
Throughput(P1+P2,A) =
2 / [(1/ Throughput(P1,A)) + (1/ Throughput(P2,A))]
Same goes for means (averages)

Arithmetic: (1/N) * P=1..N Latency(P)
Harmonic: N / P=1..N 1/Throughput(P)
Page 6
Geometric: NP=1..N Speedup(P)
CPU Performance Equation

Multiple aspects to performance: helps to
isolate them
Latency(P,A) = seconds / program =
Instructions / program:
Cycles / instruction:
Seconds / cycle:
For low latency (better performance) minimize all three

Hard: often pull against the other
Page 7
Cycles per Instruction (CPI)
This course is mostly about improving CPI

Cycle/instruction for average instruction
IPC = 1/CPI
Different instructions have different cycle costs
E.g., integer add typically takes 1 cycle, FP divide takes > 10
Assumes you know something about insn frequencies
CPI Example
A program executes equal integer, FP, and
memory operations
Cycles per instruction type:
integer = 1, memory = 2, FP = 3
What is the CPI?

Caveat: this sort of calculation ignores
dependences completely
Back-of-the-envelope arguments only
Measuring CPI
How are CPI & execution-time actually measured?
Execution time: time (Unix): wall clock + CPU + system
Page 8
CPI = CPU time / (clock frequency * dynamic insn count)

How is dynamic instruction count measured?
Want CPI breakdowns (CPICPU, CPIMEM, etc.) to see what to fix
CPI breakdowns
Hardware event counters
Calculate CPI using counter frequencies/event costs
Cycle-level micro-architecture simulation (e.g., SimpleScalar)
+ Measures breakdown exactly provided
+ Models micro-architecture faithfully
+ Runs realistic workloads (some)
Method of choice for many micro-architects (and you)
Improving CPI
This course is more about improving CPI than frequency

Historically, clock accounts for 70%+ of performance
improvement
Achieved via deeper pipelines
This has changed

Deep pipelining is not power efficient
Physical speed limits are approaching
1GHz: 1999, 2GHz: 2001, 3GHz: 2002, 3.8GHz: 2004,
5GHz: 2008
Intel Core 2: 1.8-3.2GHz: 2008
Techniques we will look at

Caching, speculation, multiple issue, out-of-order issue,
multiprocessing, more
Moore helps because CPI reduction requires

transistors
Definition of parallelism -- more transistors
But best example is caches
Another CPI Example
Assume a processor with insn frequencies and costs
Integer ALU: 50%, 1 cycle

Load: 20%, 5 cycle
Store: 10%, 1 cycle
Branch: 20%, 2 cycle
Which change would improve performance more?

A. Branch prediction to reduce branch cost to 1 cycle?
B. A bigger data cache to reduce load cost to 3 cycles?
Compute CPI
Base =
A=
B=
CPI Example 3
Operation
Frequency
Cycles
ALU
45%
Load
20%
Store
15%
Branch
20%
Page 9
You can reduce store to 1 cycle
Page 10
But slow down clock by 20%
Old CPI =
New CPI =
Speedup = Old time/New time
Now, if ALU ops were 2 cycles originally and stores

were 1 cycle
and you could reduce ALU to 1 cycle, while slowing
down clock by 20%,
This optimization:
Example of Amdahls law - you dont want to

speedup a small fraction to the detriment of the rest
Page 11
Performance Rule of Thumb:

Amdahls Law
f fraction that can run in parallel (be sped up)

1-f fraction that must run serially
# CPUs
1
Time
# CPUs
1
Time
Speed-up =
Page 12
16
64
14
56
12
48
10
40
32
24
16
0
0
10
12
14
16
16
24
32
40
48
56
64
Pretty good ideal scaling for a modest number of cores

Large number of cores require a lot of parallelism
Amdahls Law not just about parallelism
f fraction that you can speed up
Your performance will always be limited by (1-f) part
Summary
Latency = seconds/program =
(instructions/program) * (cycles/instruction) * (seconds/cycle)
Instructions/program: dynamic instruction count

Function of program, compiler, instruction set architecture (ISA)
Cycles/instruction: CPI
Page 13
Function of program, compiler, ISA, micro-architecture
Seconds/cycle: clock period

Function of micro-architecture, technology parameters
To improve performance, optimize each component

Focus mostly on CPI in this course
Other Metrics
Will (try to) come back to
Cost
Power
Reliability
Interested in learning more:

Grad school

02 - Performance - ECE552-2014 - Complete PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

02 - Performance - ECE552-2014 - Complete PDF

Hochgeladen von

Copyright:

Verfügbare Formate

ECE 552: Performance

Prof. Natalie Enright Jerger

Moores Law: Devices per chip double

Often more important in combination than

Very different: throughput can exploit parallelism,

Processor A is X% faster than processor B if

Actual target workload?

Some representative benchmark program(s)?

Some small kernel benchmarks (micro-benchmarks)

Adding/Averaging Performance Numbers

0.033 hours at 30 kph + 0.011 hours at 90 kph

Same goes for means (averages)

Harmonic: N / P=1..N 1/Throughput(P)

Geometric: NP=1..N Speedup(P)

CPU Performance Equation

For low latency (better performance) minimize all three

Cycles per Instruction (CPI)

This course is mostly about improving CPI

Assumes you know something about insn frequencies

What is the CPI?

Execution time: time (Unix): wall clock + CPU + system

CPI = CPU time / (clock frequency * dynamic insn count)

Method of choice for many micro-architects (and you)

This course is more about improving CPI than frequency

This has changed

Techniques we will look at

Moore helps because CPI reduction requires

Another CPI Example

Assume a processor with insn frequencies and costs

Integer ALU: 50%, 1 cycle

Which change would improve performance more?

You can reduce store to 1 cycle

But slow down clock by 20%

Now, if ALU ops were 2 cycles originally and stores

Example of Amdahls law - you dont want to

Performance Rule of Thumb:

f fraction that can run in parallel (be sped up)

Pretty good ideal scaling for a modest number of cores

Instructions/program: dynamic instruction count

Function of program, compiler, ISA, micro-architecture

Seconds/cycle: clock period

To improve performance, optimize each component

Interested in learning more:

Das könnte Ihnen auch gefallen