Lect 10

12/5/2013
Elements of CPU performance

Cycle time. CPU pipeline. Memory system.
MODULE - 5
Pipelining
Several instructions are executed simultaneously at different stages of completion. increases the efficiency of the CPU Various conditions can cause pipeline bubbles that reduce utilization:
branches; memory system delays;
Performance measures
Latency: time it takes for an instruction to get through the pipeline. Throughput: number of instructions executed per time period. Pipelining increases throughput without reducing latency.
12/5/2013
ARM7 pipeline
ARM 7 has 3-stage pipe:
fetch instruction from memory; decode opcode and operands; execute.
ARM pipeline execution

fetch
sub r2,r3,r6 cmp r2,#3
decode fetch
execute decode fetch
add r0,r1,#5
execute decode time execute
Each of these operations requires one clock cycle for typical instructions
Pipeline stalls
If every step cannot be completed in the same amount of time, pipeline stalls. Bubbles introduced by stall increase latency, reduce throughput.
ARM multi-cycle LDMIA instruction

ldmia r0,{r2,r3} sub r2,r3,r6 cmp r2,#3
fetch decodeex ld r2ex ld r3 fetch decode ex sub fetch decodeex cmp time
12/5/2013
Control stalls
Branches often introduce stalls (branch penalty).
Stall time may depend on whether branch is taken.
bne foo
ARM pipelined branch

fetch decode ex bne ex bne ex bne fetch decode fetch decode ex add time
May have to squash instructions that already started executing. Dont know what to fetch until condition is evaluated.
sub r2,r3,r6 foo add r0,r1,r2
Delayed branch
To increase pipeline efficiency, delayed branch mechanism requires that some instructions after branch are always executed whether branch is executed or not. Any instruction in the delayed branch window must be valid for both execution paths,whether or not the branch is taken
Example: ARM execution time

Determine execution time of FIR filter:
for (i=0; i<N; i++) f = f + c[i]*x[i];
Only branch in loop test may take more than one cycle.
BLT loop takes 1 cycle best case, 3 worst case.
12/5/2013
FIR filter ARM code

; loop initiation code MOV r0,#0 ; use r0 for i, set to 0 MOV r8,#0 ; use a separate index for arrays ADR r2,N ; get address for N LDR r1,[r2] ; get value of N MOV r2,#0 ; use r2 for f, set to 0 ADR r3,c ; load r3 with address of base of c ADR r5,x ; load r5 with address of base of x ; loop body loop LDR r4,[r3,r8] ; get value of c[i] LDR r6,[r5,r8] ; get value of x[i] MUL r4,r4,r6 ; compute c[i]*x[i] ADD r2,r2,r4 ; add into running sum ; update loop counter and array index ADD r8,r8,#4 ; add one to array index ADD r0,r0,#1 ; add 1 to i ; test for exit CMP r0,r1 BLT loop ; if i < N, continue loop loopend ...
Block
FIR filter performance by block

Variable tinit tbody tupdate ttest # instructions 7 4 2 2 # cycles 7 4 2 [2,4] Initialization Body Update Test
tloop = tinit+ N(tbody + tupdate) + (N-1) ttest,worst + ttest,best Loop test succeeds is worst case Loop test fails is best case
Memory system performance

Caches introduce indeterminacy in execution time.
Depends on order of execution.
Types of cache misses

Compulsory miss: location has not been referenced before. Conflict miss: two locations are fighting for the same block. Capacity miss: working set is too large.
Cache miss penalty: added time due to a cache miss.
12/5/2013
CPU power consumption

Most modern CPUs are designed with power consumption in mind to some degree. Power vs. energy:
heat generation depends on power consumption; battery life depends on energy consumption.
CMOS power consumption

Voltage drops: power consumption proportional to V2. Toggling: more activity means more power.
CMOS circuit uses most of its power when it is changing its output value
high-level power consumption characteristics of CPUs and other system components are derived from the circuits used to build those components
Leakage: basic circuit characteristics; can be eliminated by disconnecting power.
CPU power-saving strategies

Reduce power supply voltage. Run at lower clock frequency to reduce power consumption. Disable function units with control signals when not in use. Disconnect parts from power supply when not in use to eliminate leakage current
C55x low power features

Parallel execution units---longer idle shutdown times. Multiple data widths:
16-bit ALU vs. 40-bit ALU.
Instruction caches minimizes main memory accesses. Power management:
Function unit idle detection. Memory idle detection. User-configurable IDLE domains allow programmer control of what hardware is shut down.
12/5/2013
Power management styles

Static power management: does not depend on CPU activity, but invoked by user.
Example: user-activated power-down mode.
Application: PowerPC 603 energy features

Provides doze, nap, sleep modes. Dynamic power management features:
Uses static logic. Can shut down unused execution units. Cache organized into subarrays to minimize amount of active circuitry.
Dynamic power management: based on CPU activity.

Example: disabling off function units.
PowerPC 603 activity

Percentage of time units are idle for SPEC integer/floating-point:
unit D cache I cache 29% load/store fixed-point floating-point system register Specint92 29% 17% 35% 38% 99% 89% 17% 76% 30% 97% Specfp92 28%
Power-down costs
Going into a power-down mode costs:
time; energy.
Must determine if going into mode is worthwhile. Can model CPU power states with power state machine.
12/5/2013
Application: StrongARM SA-1100 power saving

Processor takes two supplies:
VDD is main 3.3V supply. VDDX is 1.5V.
Three power modes:

Run: normal operation. Idle: stops CPU clock, with logic still powered. Sleep: shuts off most of chip activity; 3 steps, each about 30 s; wakeup takes > 10 ms.
A power state machine for a processor
SA-1100 power state machine

Prun = 400 mW run 10 s 160 ms 90 s 10 s idle Pidle = 50 mW 90 s sleep Psleep = 0.16 mW
System-level performance analysis

Performance depends on all the elements of the system:
CPU. Cache. Bus. Main memory. I/O device. memory CPU cache
12/5/2013
Bandwidth as performance
Bandwidth applies to several components:
Memory. Bus. CPU fetches.
Bandwidth and data transfers

Video frame: 320 x 240 x 3 = 230,400 bytes.
Transfer in 1/30 sec.
Transfer 1 byte/sec, 0.23 sec per frame.

Too slow.
Different parts of the system run at different clock rates. Different components may have different widths (bus, memory).
Increase bandwidth:
Increase bus width. Increase bus clock rate.
Bus bandwidth
T: # bus cycles. P: time/bus cycle. Total time for transfer:
t = TP.
O1
Bus burst transfer bandwidth

D
O2
T: # bus cycles. P: time/bus cycle. Total time for transfer:

W t = TP.
D: data payload length. O1 + O2 = overhead O.
D: data payload length. O1 + O2 = overhead O.
Tbasic(N) = (D+O)N/W
Tburst(N) = (BD+O)N/(BW)
12/5/2013
Memory aspect ratios
Memory access times

Memory component access times comes from chip data sheet.
16 M 64 M
Page modes allow faster access for successive transfers on same page.
If data doesnt fit naturally into physical words:

8M
A = [(E/w)mod W]+1
8
Bus performance bottlenecks

Transfer 320 x 240 video frame @ 30 frames/sec = 612,000 bytes/sec. Is performance bottleneck bus or memory?
Bus performance bottlenecks, contd.

Bus: assume 1 MHz bus, D=1, O=3:
Tbasic = (1+3)612,000/2 = 1,224,000 cycles = 1.224 sec.
memory
CPU
Memory: try burst mode B=4, width w=0.5.

Tmem = (4*1+4)612,000/(4*0.5) = 2,448,000 cycles = 0.2448 sec.
12/5/2013
Performance spreadsheet
D O N T_basic t 1 3 612000 1224000 1.22E+00 D O B N T_mem t 1 4 4 612000 2448000 2.45E-02
Parallelism
Speed things up by running several units at once. DMA provides parallelism if CPU doesnt need the bus:
DMA + bus. CPU.
Expression simplification
Program Optimization Machine independent transformation Constant folding:
8+1 = 9
Algebraic:
a*b + a*c = a*(b+c)
10
12/5/2013
Dead code elimination

Code that will never be executed #define DEBUG 0 ... if (DEBUG) print_debug_stuff(); Can be eliminated by analysis of control flow, constant folding.
0 0 1 dbg(p1);
Procedure inlining
Eliminates procedure linkage overhead
Loop transformations
Goals:
reduce loop overhead; increase opportunities for pipelining; improve memory system performance.
Loop unrolling
Reduces loop overhead, enables some other optimizations.
for (i=0; i<4; i++) a[i] = b[i] * c[i];
a[0] = b[0]*c[0]; a[1] = b[1]*c[1]; a[2] = b[2]*c[2]; a[3] = b[3]*c[3];
11
12/5/2013
Loop fusion
may interfere with the cache & expands the amount of code required for (i=0; i<2; i++) { a[i*2] = b[i*2] * c[i*2]; a[i*2+1] = b[i*2+1] * c[i*2+1]; }
Fusion combines two loops into 1:
for (i=0; i<N; i++) a[i] = b[i] * 5; for (j=0; j<N; j++) w[j] = c[j] * d[j]; for (i=0; i<N; i++) { a[i] = b[i] * 5; w[i] = c[i] * d[i]; }
loops must iterate over the same values loop bodies must not have dependencies that would be violated if they are executed together
Loop distribution
Distribution breaks one loop into two. Changes optimizations within loop body.
Loop tiling
Breaks one loop into a set of nested loops. Changes order of accesses within array.
Changes cache behavior.
12
12/5/2013
Loop tiling example

for (i=0; i<N; i++) for (j=0; j<N; j++) c[i] = a[i,j]*b[i]; for (i=0; i<N; i+=2) for (j=0; j<N; j+=2) for (ii=0; ii<min(i+2,n); ii++) for (jj=0; jj<min(j+2,N); jj++) c[ii] = a[ii,jj]*b[ii];
Array padding
adds dummy data elements to a loop in order to change the layout of the array in the cache can reduce the number of cache conflicts during loop execution.
Register allocation
Goals:
choose register to hold each variable; determine lifespan of variable in the register. choose assignments of variables to registers to minimize the total number of required registers w = a + b; x = c + w; y = c + d;
Register lifetime graph

t=1 t=2 t=3
a b c d w x y 1 2 3
time
13
12/5/2013
register assignment
LDR r0,[p_a] ; load a into r0 using pointer to a (p_a) LDR r1,[p_b] ; load b into r1 ADD r3,r0,r1 ; compute a + b STR r3,[p_w] ; w = a + b LDR r2,[p_c] ; load c into r2 ADD r0,r2,r3 ; compute c + w, reusing r0 for x STR r0,[p_x] ; x = c + w LDR r0,[p_d] ; load d into r0 ADD r3,r2,r0 ; compute c + d, reusing r3 for y STR r3,[p_y] ; y = c + d
If a section of code requires more registers than are available, then spill some of the values out to memory temporarily Spilling registers requires extra CPU time and uses up both instruction and data memory.
Operator scheduling
(a + b) * (c - d) Different orders of loads, stores and arithmetic operations may also result in different execution times on pipelined machines If we can keep values in registers without having to reread them from main memory, execution time and code size can be reduced.
w = a + b; /* statement 1 */ x = c + d; /* statement 2 */ y = x + e; /* statement 3 */ z = a b; /* statement 4 */ Requires 5 registers
14
12/5/2013
w = a + b; /* statement 1 */ z = a b; /* statement 2 */ x = c + d; /* statement 3 */ y = x + e; /* statement 4 */
Instruction scheduling
Non-pipelined machines do not need instruction scheduling: any order of instructions that satisfies data dependencies runs equally fast. In pipelined machines, execution time of one instruction depends on the nearby instructions: opcode, operands.
Reservation table
A reservation table relates instructions/time to CPU resources.
15
12/5/2013
Software pipelining
technique for reordering instructions across several loop iterations to reduce pipeline bubbles Reduces instruction latency in iteration i by inserting instructions from iteration i+1.
Instruction selection
May be several ways to implement an operation or sequence of operations. Represent operations as graphs, match possible instruction sequences onto graph.
+ * expression + *
MADD
*
MUL
+
ADD
templates
Instruction selection
May be several ways to implement an operation or sequence of operations. Represent operations as graphs, match possible instruction onto graph. sequences
Understanding & using your compiler

compilers are different in terms of the optimizations they perform Studying the assembly language output of the compiler is a good way to learn about what the compiler does Understand various optimization levels Look at mixed compiler/assembler output. Modifying compiler output requires care:
correctness;
16
12/5/2013
Interpreters and JIT compilers

Interpreter: translates and executes program statements on-the-fly.
translates program statements one at a time
small amount of memory is used to hold intermediate representations of the program
Structure of a program interpretation system.
Program-level performance analysis

JIT compiler: compiles small sections of code into instructions during program execution. somewhere between an interpreter and a stand-alone compiler saves the compiled version of the code so that the code does not have to be retranslated the next time it is executed
Eliminates some translation overhead. Often requires more memory.
Need to understand performance in detail:
Real-time behavior, not just typical. On complex platforms.
Program performance CPU performance:

Pipeline, cache are windows into program. We must analyze the entire program.
17
12/5/2013
Complexities of program performance

Varies with input data:
Different-length paths.
How to measure program performance

Simulate execution of the CPU.
Makes CPU state visible.
Cache effects. Instruction-level performance variations:

Pipeline interlocks. Fetch times.
Measure on real CPU using timer.

Requires modifying the program to control the timer.
Measure on real CPU using logic analyzer.

Requires events visible on the pins.
Program performance metrics

Average-case execution time.
Typically used in application programming.
Elements of program performance

Basic program execution time formula:
execution time = program path + instruction timing
Worst-case execution time.

A component in deadline satisfaction.
Solving these problems independently helps simplify analysis.

Easier to separate on simpler CPUs.
Best-case execution time.

Task-level interactions can cause best-case program behavior to result in worst-case system behavior.
Accurate performance analysis requires:

Assembly/binary code. Execution platform.
18
12/5/2013
Data-dependent paths in an if statement

if (a || b) { /* T1 */ if ( c ) /* T2 */ x = r*s+t; /* A1 */ else y=r+s; /* A2 */ z = r+s+u; /* A3 */ } else { if ( c ) /* T3 */ y = r-t; /* A4 */ }
a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 path
T1=F, T3=F: no assignments T1=F, T3=T: A4 T1=T, T2=F: A2, A3 T1=T, T2=T: A1, A3 T1=T, T2=F: A2, A3 T1=T, T2=T: A1, A3 T1=T, T2=F: A2, A3 T1=T, T2=T: A1, A3
Paths in a loop
for (i=0, f=0; i<N; i++) f = f + c[i] * x[i]; N i=N Y f = f + c[i] * x[i] i=i+1 i=0 f=0
Instruction timing
Not all instructions take the same amount of time.
Multi-cycle instructions. Fetches.
Mesaurement-driven performance analysis

Not so easy as it sounds:
Must actually have access to the CPU. Must know data inputs that give worst/best case performance. Must make state visible.
Execution times of instructions are not independent.

Pipeline interlocks. Cache effects.
Execution times may vary with operand value.

Floating-point operations. Some multi-cycle integer operations.
Still an important method for performance analysis.
19
12/5/2013
Feeding the program

Need to know the desired input values. May need to write software scaffolding to generate the input values. Software scaffolding may also need to examine outputs to generate feedback-driven inputs.
Trace-driven measurement
Trace-driven:
Instrument the program. Save information about the path.
Requires modifying the program. Trace files are large. Widely used for cache analysis.
Physical measurement
In-circuit emulator allows tracing.
Affects execution timing.
CPU simulation
Some simulators are less accurate. Cycle-accurate simulator provides accurate clock-cycle timing.
Simulator models CPU internals. Simulator writer must know how CPU works.
Logic analyzer can measure behavior at pins.

Address bus can be analyzed to look for events. Code can be modified to make events visible.
Particularly important for real-world input streams.
20
12/5/2013
SimpleScalar FIR filter simulation

int x[N] = {8, 17, }; int c[N] = {1, 2, }; main() { int i, k, f; for (k=0; k<COUNT; k++) for (i=0; i<N; i++) f += c[i]*x[i]; }
N total sim cycles
100 1,000 1,0000 25854 155759 1451840
Performance optimization motivation

Embedded systems must often meet deadlines.
Faster may not be fast enough.
sim cycles per filter execution

259 156 145
Need to be able to analyze execution time.

Worst-case, not typical.
Need techniques for reliably improving execution time.
Programs and performance analysis

Best results come from analyzing optimized instructions, not high-level language code:
non-obvious translations of HLL statements into instructions; code may move; cache effects are hard to predict.
Loop optimizations
Loops are good targets for optimization. Basic loop optimizations:
code motion; induction-variable elimination; strength reduction (x*2 -> x<<1).
21
12/5/2013
Code motion
for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i=0; X i=0; = N*M i<N*M i<X Y
z[i] = a[i] + b[i];
Induction variable elimination

Induction variable: loop index. Consider loop:
N
for (i=0; i<N; i++) for (j=0; j<M; j++) z[i,j] = b[i,j];
i = i+1;
Rather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body.
Cache analysis
Loop nest: set of loops, one inside other. Perfect loop nest: no conditionals in nest. Because loops use large quantities of data, cache conflicts are common.
a[0,0]
Array conflicts in cache

1024 1024 4099
b[0,0]
4099
...
main memory
cache
22
12/5/2013
Array conflicts, contd.

Array elements conflict because they are in the same line, even if not mapped to same location. Solutions:
move one array; pad array.
Performance optimization hints

Use registers efficiently. Use page mode memory accesses. Analyze cache behavior:
instruction conflicts can be handled by rewriting code, rescheudling; conflicting scalar data can easily be moved; conflicting array data can be moved, padded.
23

Lect 10

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lect 10

Hochgeladen von

Copyright:

Verfügbare Formate

12/5/2013

Elements of CPU performance

ARM pipeline execution

execute decode fetch

execute decode time execute

ARM multi-cycle LDMIA instruction

ARM pipelined branch

sub r2,r3,r6 foo add r0,r1,r2

Example: ARM execution time

FIR filter ARM code

FIR filter performance by block

Memory system performance

Types of cache misses

Cache miss penalty: added time due to a cache miss.

CPU power consumption

CMOS power consumption

Leakage: basic circuit characteristics; can be eliminated by disconnecting power.

CPU power-saving strategies

C55x low power features

Instruction caches minimizes main memory accesses. Power management:

Power management styles

Application: PowerPC 603 energy features

Dynamic power management: based on CPU activity.

PowerPC 603 activity

Application: StrongARM SA-1100 power saving

Three power modes:

SA-1100 power state machine

System-level performance analysis

Bandwidth and data transfers

Transfer 1 byte/sec, 0.23 sec per frame.

Bus burst transfer bandwidth

T: # bus cycles. P: time/bus cycle. Total time for transfer:

D: data payload length. O1 + O2 = overhead O.

D: data payload length. O1 + O2 = overhead O.

Memory aspect ratios

Memory access times

If data doesnt fit naturally into physical words:

Bus performance bottlenecks

Bus performance bottlenecks, contd.

Memory: try burst mode B=4, width w=0.5.

Dead code elimination

a[0] = b[0]*c[0]; a[1] = b[1]*c[1]; a[2] = b[2]*c[2]; a[3] = b[3]*c[3];

Loop tiling example

Register lifetime graph

w = a + b; /* statement 1 */ z = a b; /* statement 2 */ x = c + d; /* statement 3 */ y = x + e; /* statement 4 */

Understanding & using your compiler

Interpreters and JIT compilers

small amount of memory is used to hold intermediate representations of the program

Structure of a program interpretation system.

Program-level performance analysis

Program performance CPU performance:

Complexities of program performance

How to measure program performance

Cache effects. Instruction-level performance variations:

Measure on real CPU using timer.

Measure on real CPU using logic analyzer.

Program performance metrics

Elements of program performance

Worst-case execution time.

Solving these problems independently helps simplify analysis.

Best-case execution time.

Accurate performance analysis requires:

Data-dependent paths in an if statement

Mesaurement-driven performance analysis

a[0] = b[0]c[0]; a[1] = b[1]c[1]; a[2] = b[2]c[2]; a[3] = b[3]c[3];

w = a + b; /* statement 1 / z = a b; / statement 2 / x = c + d; / statement 3 / y = x + e; / statement 4 */