Measurement Tools and Techniques: Fundamental Strategies Interval Timers Program Profiling Tracing Indirect Measurement

Measurement tools and techniques
Fundamental strategies Interval timers Program profiling Tracing Indirect measurement
Copyright 2004 David J. Lilja
Events
Most measurement tools based on events
Some predefined change to system state Memory reference Disk access Change in a registers state Network message Processor interrupt
Copyright 2004 David J. Lilja 2
Definition depends on metric being measured

Event Classification
Count metrics

The number of times event X occurs Number of cache misses Number of I/O operations
Secondary-event metrics
Record a value when triggered by some event Record block size for each I/O operation Count number of operations Find average I/O transfer size
Profiles
Characterization of overall behavior Aggregate/big picture view of an application program Time spent in each function
Event-Driven Strategies
Record necessary information only when selected event occurs Modify system to record event Dump data when program terminates
May need intermediate dumps also
E.g. simple counter in page fault routine
System overhead

Only when the event of interest actually occurs Infrequent events little perturbation Frequent events high perturbation Perturbation changes system being measured
No longer typical behavior?
Inter-event time is unpredictable

Depends on when events actually occur Makes it hard to estimate perturbation How long to measure? Good for low-frequency events
Event-driven measurement tools
+1
+1 +1
+1
+1
+1 +1
+1
Counts 8 events exactly

Tracing

Similar to event-driven But record additional system state

Event has occurred count Additional information to uniquely identify event E.g. addresses that cause page faults Additional memory or disk storage Time to save state
Overhead

Relatively large system perturbation

Tracing
+1; +1; +1; Addr Addr Addr
+1; Addr
+1; Addr
+1; +1; Addr Addr
+1; Addr
Counts 8 events plus extra data

Sampling

Record necessary state at fixed time intervals Overhead

Independent of specific event frequency Depends on sampling frequency
Misses some events Produces statistical summary

May miss infrequent events Each replication will produce different results
Sampling
+1
+1
+1
Counts 3 events out of 5 samples

Comparisons
Event count Resolution Overhead Perturbation Exact count Low ~ #events Tracing Detailed info High High Sampling Statistical summary Constant Fixed
14
Comparison
Event counting

Best for low frequency events Required if exact counts needed Best for high frequency events If statistical summary is adequate When additional detail is required
Sampling

Tracing
Indirect Measurements
Used when desired metric is not directly accessible Measure one thing directly
Derive or deduce desired metric
Highly dependent on creativity of performance analyst
16
Interval Timers
Fundamental tool of performance measurement Measure execution time of any portion of a program Provide time basis for sampling
17
Interval Timers
Actually count clock pulses between two events

Event 2
Event 1
Tc x1= Counter x2 = Counter
Te=(x2 x1)Tc
Using an Interval Timer
Within an application program

Start_count = read_timer();
Portion of program to be measured

Stop_count = read_timer(); Elapsed_time = (stop_count start_count)
* clock_period;
19
Hardware Timer
Tc n-bit counter Clock
Te=(x2 x1)Tc
To CPU input port
20
Software Timer
Tc Prescalar (divide-by-n) Tc
Clock
CPU interrupt input
Te=(x2 x1)Tc
Software counter
21
Quantization Errors
22
Quantization Error
Timer resolution quantization error Repeated measurements

nTc < Te < (n+1)Tc Te rounded to one clock tick
Completely unpredictable rounding Want Tc to be as small as possible

Timer Rollover
n-bit counter
count = [0, 2n-1]
Rollover = transition from (2n 1) 0 If rollover occurs between start/stop events
Then count = (x2 x1) < 0 Measure again Add 2n to count

Check for count < 0

Timer Rollover
Counter width, n Resolution (Tc) 10 ns 1 us 100 us 1 ms 16 32 64
655 us
65.5 ms 6.55 s 1.1 min
43 s
1.2 h 5 days 50 days
58.5 cent
5,580 cent 585,000 cent 5,850,000 cent
25
Timer Overhead
Start_count = read_timer();
Stop_count = read_timer(); Elapsed_time = (stop_count start_count) * clock_period;
Portion of program to be measured
To access timer

Min of 1 memory read subroutine call Min of 1 memory write subroutine call
Once at start, again at stop

Event begins; Initiate read_timer() Current time actually read
Timer Overhead
T1 T2 Event being measured begins

T3 Event ends; Initiate read_time() T4 Current time actually read
27
Timer Overhead

T1 = time to read counter value T2 = time to store counter value T3 = time of the event we are measuring T4 = time to read counter value
T 4 = T1
T1
T2
T3
T4
28
Timer Overhead

Te = event time = T3 But actually measured
Tm = T2 + T 3 + T4
Te = Tm (T2 + T4) = Tm (T1 + T2) Timer overhead = Tovhd = (T1 + T2)
T1
T2
T3
T4
29
Timer Overhead
If Te >> Tovhd
Ignore the timer overhead Measurements will be highly suspect
If Te Tovhd
Potentially large variations in Tovhd Good rule of thumb
Te should be 100-1000x > Tovhd
30
Approximate Measures of Short Intervals
How to measure an event that is shorter than the resolution of the clock? Cannot directly measure events with Te < Tc Overhead makes it hard to measure even when Te > nTc,
n is small integer
31

Tc
Te
Case 1: Count+1 Case 2: Count+0
Te
32
Bernoulli experiment

Outcome = +1 with probability p Outcome = +0 with probability (1-p) Equivalent to flipping a biased coin Approximates a binomial distribution Only approximate since each measurement cannot be guaranteed to be independent
Repeat n times

Usually close enough in practice

m = number of times Case 1 occurs
Count+1
n = total number of measurements Average duration is ratio of m/n Use confidence interval for proportions
m Te Tc n
Example

Clock resolution = 10 us n = 8764 measurements m = 467 clock ticks counted 95% confidence interval
10 us
Case 1: 467 Case 2: 8297
?
35
Example
467 467 1 467 8764 8764 (c1 , c2 ) 1.96 8764 8764 (0.0486,0.0580)

Scale by clock period = 10 us 95% chance that measured event is
(0.49, 0.58) us
Profiling
Overall view of programs execution-time behavior Fraction of total time spent in specific states

Fraction of time in each subroutine Fraction of time in OS kernel Fraction of time doing I/O Optimize those sections first
Find bottlenecks, code hot-spots
Statistical Sampling

Select a random subset of a population Gather information on only this subset Extrapolate this information to overall population Results are a statistical summary with corresponding error probabilities
PC Sampling
+1

+1
+1
Periodically interrupt program at fixed intervals Record appropriate state information in interrupt service routine Post-process to obtain overall profile
PC Sampling
At each interrupt

Examine PC on return address stack Use address map to translate this PC to subroutine i Increment array element H[i]
Addr map 0-1298: Subr 1 1299-3455: Subr 2 3456-5567: Subr 3 5568-9943: Subr 4
PC: 4582
Histogram counters: H[3]=H[3]+1
40
PC Sampling
140 120 100 80 60 40 20 0
0]
1] H [1
H [1
H [2
H [3
H [4
H [5
H [6
H [7
H [8
H [9
H [1
H [1
41
2]
PC Sampling

n total interrupts Post-processing step

H[i]/n = fraction of time executing in subroutine i (H[i]/n) * (interrupt period) = time in each subroutine
42
PC Sampling
This is a statistical process
Different counts each time the experiment is performed
Infer behavior of entire program from small sample Apply confidence intervals to quantify precision of results
43
Example
40 us interrupt 36,128 interrupts in subroutine A Program runs for 10 seconds Time in this subroutine?
90% confidence interval
m = 36,128 n = 10 sec / 40 us = 250,000 p = m/n = 0.144

Example
0.144512(0.855488) (c1 , c2 ) 0.144512 1.645 250000 (0.144,0.146)
90% chance that the program spent 14.4-14.6% of its time in subroutine A
45
Example
10 ms interrupt 12 interrupts in subroutine A n = 800 samples
8 seconds total execution time 99% confidence interval
Time in this subroutine?
p = m/n = 0.015
46
Example
0.015(1 0.015) (c1 , c2 ) 0.015 2.576 800 (0.0039,0.0261)
99% chance that the program spent 31-210 ms in subroutine A A pretty wide range! But only <3% of total execution time Start optimizing somewhere else first
Reducing the Interval Size

Use a lower confidence level Obtain more samples
Run program longer
May not be possible May be fixed by system Will increase overhead and perturbation
Increase sample rate

Run multiple times and add samples from each run

PC Sampling
+1
+1
+1
Interrupts must occur asynchronously w.r.t. any program events
Samples must be independent of each other Else over/under-sample events synchronous with interrupt
Periodic versus random sampling
Basic Block Counting
Basic block
Sequence of instructions with no branches into or out of the block When first instruction is executed, guaranteed that all instructions in block will be executed Single entry, single exit
50
Basic Block Counting
Generate a program profile by inserting additional instructions in each block
Increment a unique counter each time a block is entered
Produces a histogram of program execution Can post-process to find instruction execution frequencies
51
Comparison
PC sampling Output Overhead Perturbation Basic block counting
Repeatability
Statistical Exact count estimate Interrupt service Extra instructions routine per block Randomly High distributed Within statistical Perfect variance
Event Tracing
Profile shows overall frequency-of-execution behavior
Ignores time-ordering of events Dynamic list of events generated by program Events = anything you want to instrument

Program trace

Sequence of memory addresses I/O blocks accessed
Typically used to drive a simulator

Trace Generation
Modify to generate trace Application program Compress
Uncompress Trace consumer

Trace Generation
Modify to generate trace Application program Compress Online trace consumption Uncompress Trace consumer
Trace Generation
Source-code modification
Allows precise control of what events are traced and what data is recorded Typically a manual process
Source code
Compiler
Object code
Proc
Trace
56
Trace Generation
Software exceptions

HW forces an exception before each instruction Exception routine decodes instruction
Store instr type, PC, operand addresses, etc.
Trace bit in many processors Tremendous slowdown
Source code
Compiler
Object code
Proc
Trace
57
Trace Generation
Emulation
Make a system appear to be something else Modify emulator to generate trace E.g. Java Virtual Machine
Source code
Compiler
Object code
Proc
Trace
58
Trace Generation
Microcode modification

Modify instruction execution directly Allows tracing of all instructions
Including operating system
Depends on access to lower levels of the processor E.g. Transmeta Crusoe processor
Compiler
Source code
Object code
Proc
Trace
59
Trace Generation
Compiler modification

Insert trace code directly in object file Requires access to the compiler itself
Source code
Compiler
Object code
Proc
Trace
60
Trace Generation
Compiler modification

Insert trace code directly in object file Requires access to the compiler itself Write post-compilation binary editor/rewrite tool
Source code
Compiler
Object code
Proc
Trace
61
Trace Data
Tracing generates a tremendous volume of data Trace 100,000,000 instrs/sec 16 bits of data per event 190 Mbytes of data per second
11 Gbytes per minute

Due to tracing code Time to store trace data
Huge perturbations
Trace Data Compression

Modify to generate trace
Application program Compress
Standard compression algorithms as trace is written to disk Uncompress when reading Typical reduction
20-70%
Uncompress
Tradeoff is compressuncompress time

63
Trace consumer
Online Trace Consumption

Modify to generate trace

Application program Online trace consumption Trace consumer
Use trace data as it is generated Never stored on disk Multitasking may lead to non-deterministic behavior
Repeatability issue
Before-and-after comparison tests
Difference due to change in system or change in trace? Becomes statistical comparison with n runs
64
Abstract Execution
Use higher-level information to intelligently compress trace info Two-step process
Compiler-style analysis to find critical subset of trace
Store only control flow information sufficient to reconstruct trace later
Produce trace-regeneration code for subsequent use of trace

Abstract Execution
1. if (i > 5) 2. then a = a + i; 3. else b = b + i; 4. i = i + 1;
1. if (i>5)
Trace will be either
1-2-4 1-3-4
2. a=a+i
3. b=b+i
4. i=i+1
Store only 2 or 3 Combine with compilergenerated control flow graph to regenerate trace Slowdown = 2-10x Compress = 10-100x
66
Trace Sampling
Save only subsequences of overall trace Drive simulator with samples Results should be statistically similar to driving with complete trace One sample = k consecutive events Sampling interval = P (period)
k k
P
SimPoint
Find representative program samples

Match basic block execution frequencies Clustering tool to automate process
Perform detailed timing simulation on only these samples Fast-forward (functional simulation) between samples
[Sherwood et al, ASPLOS, 2002]
SimPoint
Weight each samples result by execution frequency to produced overall result Relatively small number (10s) of SimPoints produced 3% error in IPC on SPEC
69
SMARTS
Uses systematic sampling
Fixed sample interval
Apply statistical sampling techniques to determine j, k, P

Functional simulation Detailed simulation j k j k
P
Indirect Ad Hoc Techniques
Sometimes the desired metric cannot be measured directly Use your creativity to measure one thing and then derive/infer the desired value
71
Example System Load
What is system load?
Number of jobs in run queue? Number of jobs actively time-sharing? Fraction of time processor is not in idle loop? Others? Modify OS PC sampling Indirect?
How to measure it?

Example
T
Monitor
Count
Let system run for fixed time T Note value of counter

Example
T
Monitor
Monitor App 1 Count
n
n/2
Let system run for fixed time T Compare value of loaded system monitor counter to unloaded system count value
Example
T
Monitor
Monitor App 1 Monitor App 1 App 2 Count
n
n/2
n/3
Let system run for fixed time T Compare value of loaded system monitor counter to unloaded system count value
Perturbation
To obtain more information (higher resolution)
Use more instrumentation points Greater perturbation
More instrumentation points
76
Perturbation
Computer performance measurement uncertainty principle
Accuracy is inversely proportional to resolution.

Accuracy
High
Low
Resolution
High
77
Perturbation
Superposition does not work here

Non-linear Non-additive
Double instrumentation double impact on performance

Some instrumentation cancels out Some multiplies impact
No way to predict!
Instrumentation Code
Changes memory access patterns
Affects memory banking optimizations More frequent cache flushes and replacements But may reduce set associativity conflicts
Generates additional load/store instructions

Generates more I/O operations Will increase overall execution time
More time-sharing context switches
Alters virtual memory paging behavior

Important Points
Event types

Simple counts of primary event Secondary events triggered by some primary event Overall profiles
80
Important Points
Measurement strategies

Event-driven Tracing Sampling Indirect approaches
81
Important Points
Interval timers

Stopwatch functionality Rollover problem Overhead Quantization errors
Statistical measures of short intervals
82
Important Points
Profiling
PC sampling
Statistical view Exact behavior High overhead and perturbation
Basic block counting
83
Important Points
Trace generation
Source code modification Force exceptions Emulation Microcode modification Compiler modification Object code editor
Online trace consumption Trace sampling

Important Points
Indirect measurements when all else fails
System load example Nobody likes them Have to learn to live with them
Perturbations

85

Measurement Tools and Techniques: Fundamental Strategies Interval Timers Program Profiling Tracing Indirect Measurement

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Measurement Tools and Techniques: Fundamental Strategies Interval Timers Program Profiling Tracing Indirect Measurement

Hochgeladen von

Copyright:

Verfügbare Formate

Measurement tools and techniques

Fundamental strategies Interval timers Program profiling Tracing Indirect measurement

Copyright 2004 David J. Lilja

Most measurement tools based on events

Definition depends on metric being measured

Copyright 2004 David J. Lilja

Copyright 2004 David J. Lilja

Copyright 2004 David J. Lilja

May need intermediate dumps also

E.g. simple counter in page fault routine

Copyright 2004 David J. Lilja

No longer typical behavior?

Copyright 2004 David J. Lilja

Inter-event time is unpredictable

Event-driven measurement tools

Copyright 2004 David J. Lilja

Counts 8 events exactly

Similar to event-driven But record additional system state

Relatively large system perturbation

+1; +1; +1; Addr Addr Addr

+1; +1; Addr Addr

Counts 8 events plus extra data

Record necessary state at fixed time intervals Overhead

Independent of specific event frequency Depends on sampling frequency

Misses some events Produces statistical summary

Counts 3 events out of 5 samples

Copyright 2004 David J. Lilja

Derive or deduce desired metric

Highly dependent on creativity of performance analyst

Copyright 2004 David J. Lilja

Copyright 2004 David J. Lilja

Actually count clock pulses between two events

Tc x1= Counter x2 = Counter

Using an Interval Timer

Within an application program

Portion of program to be measured

Copyright 2004 David J. Lilja

To CPU input port

Copyright 2004 David J. Lilja

CPU interrupt input

Copyright 2004 David J. Lilja

Copyright 2004 David J. Lilja

Timer resolution quantization error Repeated measurements

nTc < Te < (n+1)Tc Te rounded to one clock tick

Completely unpredictable rounding Want Tc to be as small as possible

count = [0, 2n-1]

Rollover = transition from (2n 1) 0 If rollover occurs between start/stop events

Then count = (x2 x1) < 0 Measure again Add 2n to count

Check for count < 0

Copyright 2004 David J. Lilja

Stop_count = read_timer(); Elapsed_time = (stop_count start_count) * clock_period;

Portion of program to be measured

Once at start, again at stop

Event begins; Initiate read_timer() Current time actually read

T1 T2 Event being measured begins

T3 Event ends; Initiate read_time() T4 Current time actually read

Te = event time = T3 But actually measured

Te = Tm (T2 + T4) = Tm (T1 + T2) Timer overhead = Tovhd = (T1 + T2)

Ignore the timer overhead Measurements will be highly suspect

Potentially large variations in Tovhd Good rule of thumb

Te should be 100-1000x > Tovhd

Copyright 2004 David J. Lilja

Approximate Measures of Short Intervals