Beruflich Dokumente
Kultur Dokumente
Events
Some predefined change to system state Memory reference Disk access Change in a registers state Network message Processor interrupt
Copyright 2004 David J. Lilja 2
Event Classification
Count metrics
The number of times event X occurs Number of cache misses Number of I/O operations
Event Classification
Secondary-event metrics
Record a value when triggered by some event Record block size for each I/O operation Count number of operations Find average I/O transfer size
Event Classification
Profiles
Characterization of overall behavior Aggregate/big picture view of an application program Time spent in each function
Event-Driven Strategies
Record necessary information only when selected event occurs Modify system to record event Dump data when program terminates
Event-Driven Strategies
System overhead
Only when the event of interest actually occurs Infrequent events little perturbation Frequent events high perturbation Perturbation changes system being measured
Event-Driven Strategies
Depends on when events actually occur Makes it hard to estimate perturbation How long to measure? Good for low-frequency events
Event-Driven Strategies
+1
+1 +1
+1
+1
+1 +1
+1
Tracing
Event has occurred count Additional information to uniquely identify event E.g. addresses that cause page faults Additional memory or disk storage Time to save state
Overhead
Tracing
+1; Addr
+1; Addr
+1; Addr
Sampling
May miss infrequent events Each replication will produce different results
Copyright 2004 David J. Lilja 12
Sampling
+1
+1
+1
Comparisons
Event count Resolution Overhead Perturbation Exact count Low ~ #events Tracing Detailed info High High Sampling Statistical summary Constant Fixed
14
Comparison
Event counting
Best for low frequency events Required if exact counts needed Best for high frequency events If statistical summary is adequate When additional detail is required
Copyright 2004 David J. Lilja 15
Sampling
Tracing
Indirect Measurements
Used when desired metric is not directly accessible Measure one thing directly
16
Interval Timers
Fundamental tool of performance measurement Measure execution time of any portion of a program Provide time basis for sampling
17
Interval Timers
Event 1
Te=(x2 x1)Tc
Copyright 2004 David J. Lilja 18
* clock_period;
19
Hardware Timer
Tc n-bit counter Clock
Te=(x2 x1)Tc
20
Software Timer
Tc Prescalar (divide-by-n) Tc
Clock
Te=(x2 x1)Tc
Software counter
21
Quantization Errors
22
Quantization Error
Timer Rollover
n-bit counter
Timer Rollover
Counter width, n Resolution (Tc) 10 ns 1 us 100 us 1 ms 16 32 64
655 us
65.5 ms 6.55 s 1.1 min
43 s
1.2 h 5 days 50 days
58.5 cent
5,580 cent 585,000 cent 5,850,000 cent
25
Timer Overhead
Start_count = read_timer();
To access timer
Min of 1 memory read subroutine call Min of 1 memory write subroutine call
Timer Overhead
27
Timer Overhead
T1 = time to read counter value T2 = time to store counter value T3 = time of the event we are measuring T4 = time to read counter value
T 4 = T1
T1
T2
T3
Copyright 2004 David J. Lilja
T4
28
Timer Overhead
Tm = T2 + T 3 + T4
T1
T2
T3
Copyright 2004 David J. Lilja
T4
29
Timer Overhead
If Te >> Tovhd
If Te Tovhd
30
How to measure an event that is shorter than the resolution of the clock? Cannot directly measure events with Te < Tc Overhead makes it hard to measure even when Te > nTc,
n is small integer
31
Te
Te
32
Bernoulli experiment
Outcome = +1 with probability p Outcome = +0 with probability (1-p) Equivalent to flipping a biased coin Approximates a binomial distribution Only approximate since each measurement cannot be guaranteed to be independent
Repeat n times
Count+1
n = total number of measurements Average duration is ratio of m/n Use confidence interval for proportions
m Te Tc n
Copyright 2004 David J. Lilja 34
Example
Clock resolution = 10 us n = 8764 measurements m = 467 clock ticks counted 95% confidence interval
10 us
?
Copyright 2004 David J. Lilja
35
Example
467 467 1 467 8764 8764 (c1 , c2 ) 1.96 8764 8764 (0.0486,0.0580)
(0.49, 0.58) us
Copyright 2004 David J. Lilja 36
Profiling
Overall view of programs execution-time behavior Fraction of total time spent in specific states
Fraction of time in each subroutine Fraction of time in OS kernel Fraction of time doing I/O Optimize those sections first
Copyright 2004 David J. Lilja 37
Statistical Sampling
Select a random subset of a population Gather information on only this subset Extrapolate this information to overall population Results are a statistical summary with corresponding error probabilities
Copyright 2004 David J. Lilja 38
PC Sampling
+1
+1
+1
Periodically interrupt program at fixed intervals Record appropriate state information in interrupt service routine Post-process to obtain overall profile
Copyright 2004 David J. Lilja 39
PC Sampling
At each interrupt
Examine PC on return address stack Use address map to translate this PC to subroutine i Increment array element H[i]
Addr map 0-1298: Subr 1 1299-3455: Subr 2 3456-5567: Subr 3 5568-9943: Subr 4
Copyright 2004 David J. Lilja
PC: 4582
40
PC Sampling
140 120 100 80 60 40 20 0
0]
1] H [1
H [1
H [2
H [3
H [4
H [5
H [6
H [7
H [8
H [9
H [1
H [1
41
2]
PC Sampling
H[i]/n = fraction of time executing in subroutine i (H[i]/n) * (interrupt period) = time in each subroutine
42
PC Sampling
Infer behavior of entire program from small sample Apply confidence intervals to quantify precision of results
43
Example
40 us interrupt 36,128 interrupts in subroutine A Program runs for 10 seconds Time in this subroutine?
Example
0.144512(0.855488) (c1 , c2 ) 0.144512 1.645 250000 (0.144,0.146)
90% chance that the program spent 14.4-14.6% of its time in subroutine A
45
Example
p = m/n = 0.015
46
Example
0.015(1 0.015) (c1 , c2 ) 0.015 2.576 800 (0.0039,0.0261)
99% chance that the program spent 31-210 ms in subroutine A A pretty wide range! But only <3% of total execution time Start optimizing somewhere else first
Copyright 2004 David J. Lilja 47
May not be possible May be fixed by system Will increase overhead and perturbation
PC Sampling
+1
+1
+1
Samples must be independent of each other Else over/under-sample events synchronous with interrupt
Copyright 2004 David J. Lilja 49
Basic block
Sequence of instructions with no branches into or out of the block When first instruction is executed, guaranteed that all instructions in block will be executed Single entry, single exit
50
Produces a histogram of program execution Can post-process to find instruction execution frequencies
51
Comparison
PC sampling Output Overhead Perturbation Basic block counting
Repeatability
Statistical Exact count estimate Interrupt service Extra instructions routine per block Randomly High distributed Within statistical Perfect variance
Copyright 2004 David J. Lilja 52
Event Tracing
Ignores time-ordering of events Dynamic list of events generated by program Events = anything you want to instrument
Program trace
Trace Generation
Modify to generate trace Application program Compress
Trace Generation
Modify to generate trace Application program Compress Online trace consumption Uncompress Trace consumer
Copyright 2004 David J. Lilja 55
Trace Generation
Source-code modification
Allows precise control of what events are traced and what data is recorded Typically a manual process
Source code
Compiler
Object code
Proc
Trace
56
Trace Generation
Software exceptions
Source code
Compiler
Object code
Copyright 2004 David J. Lilja
Proc
Trace
57
Trace Generation
Emulation
Make a system appear to be something else Modify emulator to generate trace E.g. Java Virtual Machine
Source code
Compiler
Object code
Copyright 2004 David J. Lilja
Proc
Trace
58
Trace Generation
Microcode modification
Depends on access to lower levels of the processor E.g. Transmeta Crusoe processor
Compiler
Source code
Object code
Copyright 2004 David J. Lilja
Proc
Trace
59
Trace Generation
Compiler modification
Insert trace code directly in object file Requires access to the compiler itself
Source code
Compiler
Object code
Copyright 2004 David J. Lilja
Proc
Trace
60
Trace Generation
Compiler modification
Insert trace code directly in object file Requires access to the compiler itself Write post-compilation binary editor/rewrite tool
Source code
Compiler
Object code
Copyright 2004 David J. Lilja
Proc
Trace
61
Trace Data
Tracing generates a tremendous volume of data Trace 100,000,000 instrs/sec 16 bits of data per event 190 Mbytes of data per second
Huge perturbations
Standard compression algorithms as trace is written to disk Uncompress when reading Typical reduction
20-70%
Uncompress
Trace consumer
Copyright 2004 David J. Lilja
Use trace data as it is generated Never stored on disk Multitasking may lead to non-deterministic behavior
Repeatability issue
Difference due to change in system or change in trace? Becomes statistical comparison with n runs
64
Abstract Execution
Abstract Execution
1. if (i > 5) 2. then a = a + i; 3. else b = b + i; 4. i = i + 1;
1. if (i>5)
1-2-4 1-3-4
2. a=a+i
3. b=b+i
4. i=i+1
Store only 2 or 3 Combine with compilergenerated control flow graph to regenerate trace Slowdown = 2-10x Compress = 10-100x
66
Trace Sampling
Save only subsequences of overall trace Drive simulator with samples Results should be statistically similar to driving with complete trace One sample = k consecutive events Sampling interval = P (period)
k k
P
Copyright 2004 David J. Lilja 67
SimPoint
Perform detailed timing simulation on only these samples Fast-forward (functional simulation) between samples
[Sherwood et al, ASPLOS, 2002]
Copyright 2004 David J. Lilja 68
SimPoint
Weight each samples result by execution frequency to produced overall result Relatively small number (10s) of SimPoints produced 3% error in IPC on SPEC
69
SMARTS
P
Copyright 2004 David J. Lilja 70
Sometimes the desired metric cannot be measured directly Use your creativity to measure one thing and then derive/infer the desired value
71
Number of jobs in run queue? Number of jobs actively time-sharing? Fraction of time processor is not in idle loop? Others? Modify OS PC sampling Indirect?
Copyright 2004 David J. Lilja 72
Example
T
Monitor
Count
Example
T
Monitor
Monitor App 1 Count
n
n/2
Let system run for fixed time T Compare value of loaded system monitor counter to unloaded system count value
Copyright 2004 David J. Lilja 74
Example
T
Monitor
Monitor App 1 Monitor App 1 App 2 Count
n
n/2
n/3
Let system run for fixed time T Compare value of loaded system monitor counter to unloaded system count value
Copyright 2004 David J. Lilja 75
Perturbation
76
Perturbation
High
Low
Resolution
Copyright 2004 David J. Lilja
High
77
Perturbation
Non-linear Non-additive
No way to predict!
Copyright 2004 David J. Lilja 78
Instrumentation Code
Affects memory banking optimizations More frequent cache flushes and replacements But may reduce set associativity conflicts
Important Points
Event types
Simple counts of primary event Secondary events triggered by some primary event Overall profiles
80
Important Points
Measurement strategies
81
Important Points
Interval timers
82
Important Points
Profiling
PC sampling
83
Important Points
Trace generation
Source code modification Force exceptions Emulation Microcode modification Compiler modification Object code editor
Important Points
System load example Nobody likes them Have to learn to live with them
Perturbations
85