Sie sind auf Seite 1von 13

ECE 552: Performance

Page 1

Prof. Natalie Enright Jerger


Lecture notes based on slides created by Amir Roth of University of
Pennsylvania with sources that included University of Wisconsin slides by
Mark Hill, Guri Sohi, Jim Smith, and David Wood.
Lecture notes enhanced by Milo Martin, Mark Hill, and David Wood with
sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith,
Sohi, Vijaykumar, and Wood

Before we start...
To have a meaningful discussion
about modern architectures
Must discuss metrics to evaluate them
Discuss how metrics are impacted by Moores Law

Moores Law: Devices per chip double


every 18-24 months

Empirical Evaluation
Metrics
Performance

Page 2

Cost
Power
Reliability

Often more important in combination than


individually
Performance/cost (MIPS/$)
Performance/power (MIPS/W)

Basis for
Design decisions
Purchasing decisions

Performance
Performance metrics
Latency
Throughput

Reporting performance
Benchmarking and averaging
CPU performance equation and
performance trends
Two definitions
Latency (execution time):
Throughput (bandwidth):

Page 3

Very different: throughput can exploit parallelism,


latency cannot
Often contradictory
Choose definition that matches goals (most frequently
throughput)

Latency/Throughput Example
Example: move people from A to B, 10 miles
Car: capacity = 5, speed = 60 miles/hour
Bus: capacity = 60, speed = 20 miles/hour
Latency:

Throughput:

Performance Improvement
Processor A is X times faster than processor B if
Latency(P,A) =
Throughput(P,A) =

Processor A is X% faster than processor B if

Latency(P,A) =

Page 4

Throughput(P,A) =

Car/bus example
Latency?
Throughput?

What Is P in Latency(P,A)?
Program
Latency(A) makes no sense, processor executes
some program
But which one?

Actual target workload?

Some representative benchmark program(s)?

Page 5

Some small kernel benchmarks (micro-benchmarks)

Adding/Averaging Performance Numbers


You can add latencies, but not throughput
Latency(P1+P2, A) = Latency(P1,A) + Latency(P2,A)
Throughput(P1+P2,A) != Throughput(P1,A) +
Throughput(P2,A)
1 km @ 30 kph + 1 km @ 90 kph

0.033 hours at 30 kph + 0.011 hours at 90 kph

Throughput(P1+P2,A) =
2 / [(1/ Throughput(P1,A)) + (1/ Throughput(P2,A))]

Same goes for means (averages)


Arithmetic: (1/N) * P=1..N Latency(P)

Harmonic: N / P=1..N 1/Throughput(P)

Page 6

Geometric: NP=1..N Speedup(P)

CPU Performance Equation


Multiple aspects to performance: helps to
isolate them
Latency(P,A) = seconds / program =

Instructions / program:

Cycles / instruction:

Seconds / cycle:

For low latency (better performance) minimize all three


Hard: often pull against the other

Page 7

Cycles per Instruction (CPI)

This course is mostly about improving CPI


Cycle/instruction for average instruction
IPC = 1/CPI
Different instructions have different cycle costs
E.g., integer add typically takes 1 cycle, FP divide takes > 10

Assumes you know something about insn frequencies

CPI Example
A program executes equal integer, FP, and
memory operations
Cycles per instruction type:
integer = 1, memory = 2, FP = 3

What is the CPI?


Caveat: this sort of calculation ignores
dependences completely
Back-of-the-envelope arguments only

Measuring CPI
How are CPI & execution-time actually measured?

Execution time: time (Unix): wall clock + CPU + system

Page 8

CPI = CPU time / (clock frequency * dynamic insn count)


How is dynamic instruction count measured?
Want CPI breakdowns (CPICPU, CPIMEM, etc.) to see what to fix

CPI breakdowns
Hardware event counters
Calculate CPI using counter frequencies/event costs
Cycle-level micro-architecture simulation (e.g., SimpleScalar)
+ Measures breakdown exactly provided
+ Models micro-architecture faithfully
+ Runs realistic workloads (some)

Method of choice for many micro-architects (and you)

Improving CPI

This course is more about improving CPI than frequency


Historically, clock accounts for 70%+ of performance
improvement
Achieved via deeper pipelines

This has changed


Deep pipelining is not power efficient
Physical speed limits are approaching
1GHz: 1999, 2GHz: 2001, 3GHz: 2002, 3.8GHz: 2004,
5GHz: 2008
Intel Core 2: 1.8-3.2GHz: 2008

Techniques we will look at


Caching, speculation, multiple issue, out-of-order issue,
multiprocessing, more

Moore helps because CPI reduction requires


transistors
Definition of parallelism -- more transistors
But best example is caches

Another CPI Example

Assume a processor with insn frequencies and costs

Integer ALU: 50%, 1 cycle


Load: 20%, 5 cycle
Store: 10%, 1 cycle
Branch: 20%, 2 cycle

Which change would improve performance more?


A. Branch prediction to reduce branch cost to 1 cycle?
B. A bigger data cache to reduce load cost to 3 cycles?

Compute CPI
Base =
A=
B=

CPI Example 3
Operation

Frequency

Cycles

ALU

45%

Load

20%

Store

15%

Branch

20%

Page 9

You can reduce store to 1 cycle

Page 10

But slow down clock by 20%

Old CPI =
New CPI =
Speedup = Old time/New time

Now, if ALU ops were 2 cycles originally and stores


were 1 cycle
and you could reduce ALU to 1 cycle, while slowing
down clock by 20%,
This optimization:

Example of Amdahls law - you dont want to


speedup a small fraction to the detriment of the rest

Page 11

Performance Rule of Thumb:


Amdahls Law

f fraction that can run in parallel (be sped up)


1-f fraction that must run serially

# CPUs

1
Time

# CPUs

1
Time

Speed-up =

Page 12

16

64

14

56

12

48

10

40

32

24

16

0
0

10

12

14

16

16

24

32

40

48

56

64

Pretty good ideal scaling for a modest number of cores


Large number of cores require a lot of parallelism
Amdahls Law not just about parallelism
f fraction that you can speed up
Your performance will always be limited by (1-f) part

Summary
Latency = seconds/program =
(instructions/program) * (cycles/instruction) * (seconds/cycle)

Instructions/program: dynamic instruction count


Function of program, compiler, instruction set architecture (ISA)

Cycles/instruction: CPI

Page 13

Function of program, compiler, ISA, micro-architecture

Seconds/cycle: clock period


Function of micro-architecture, technology parameters

To improve performance, optimize each component


Focus mostly on CPI in this course

Other Metrics
Will (try to) come back to
Cost
Power
Reliability

Interested in learning more:


Grad school

Das könnte Ihnen auch gefallen