Sie sind auf Seite 1von 23

12/5/2013

Elements of CPU performance


Cycle time. CPU pipeline. Memory system.
MODULE - 5

Pipelining
Several instructions are executed simultaneously at different stages of completion. increases the efficiency of the CPU Various conditions can cause pipeline bubbles that reduce utilization:
branches; memory system delays;

Performance measures
Latency: time it takes for an instruction to get through the pipeline. Throughput: number of instructions executed per time period. Pipelining increases throughput without reducing latency.

12/5/2013

ARM7 pipeline
ARM 7 has 3-stage pipe:
fetch instruction from memory; decode opcode and operands; execute.

ARM pipeline execution


fetch
sub r2,r3,r6 cmp r2,#3

decode fetch

execute decode fetch

add r0,r1,#5

execute decode time execute

Each of these operations requires one clock cycle for typical instructions

Pipeline stalls
If every step cannot be completed in the same amount of time, pipeline stalls. Bubbles introduced by stall increase latency, reduce throughput.

ARM multi-cycle LDMIA instruction


ldmia r0,{r2,r3} sub r2,r3,r6 cmp r2,#3

fetch decodeex ld r2ex ld r3 fetch decode ex sub fetch decodeex cmp time

12/5/2013

Control stalls
Branches often introduce stalls (branch penalty).
Stall time may depend on whether branch is taken.
bne foo

ARM pipelined branch


fetch decode ex bne ex bne ex bne fetch decode fetch decode ex add time

May have to squash instructions that already started executing. Dont know what to fetch until condition is evaluated.

sub r2,r3,r6 foo add r0,r1,r2

Delayed branch
To increase pipeline efficiency, delayed branch mechanism requires that some instructions after branch are always executed whether branch is executed or not. Any instruction in the delayed branch window must be valid for both execution paths,whether or not the branch is taken

Example: ARM execution time


Determine execution time of FIR filter:
for (i=0; i<N; i++) f = f + c[i]*x[i];

Only branch in loop test may take more than one cycle.
BLT loop takes 1 cycle best case, 3 worst case.

12/5/2013

FIR filter ARM code


; loop initiation code MOV r0,#0 ; use r0 for i, set to 0 MOV r8,#0 ; use a separate index for arrays ADR r2,N ; get address for N LDR r1,[r2] ; get value of N MOV r2,#0 ; use r2 for f, set to 0 ADR r3,c ; load r3 with address of base of c ADR r5,x ; load r5 with address of base of x ; loop body loop LDR r4,[r3,r8] ; get value of c[i] LDR r6,[r5,r8] ; get value of x[i] MUL r4,r4,r6 ; compute c[i]*x[i] ADD r2,r2,r4 ; add into running sum ; update loop counter and array index ADD r8,r8,#4 ; add one to array index ADD r0,r0,#1 ; add 1 to i ; test for exit CMP r0,r1 BLT loop ; if i < N, continue loop loopend ...
Block

FIR filter performance by block


Variable tinit tbody tupdate ttest # instructions 7 4 2 2 # cycles 7 4 2 [2,4] Initialization Body Update Test

tloop = tinit+ N(tbody + tupdate) + (N-1) ttest,worst + ttest,best Loop test succeeds is worst case Loop test fails is best case

Memory system performance


Caches introduce indeterminacy in execution time.
Depends on order of execution.

Types of cache misses


Compulsory miss: location has not been referenced before. Conflict miss: two locations are fighting for the same block. Capacity miss: working set is too large.

Cache miss penalty: added time due to a cache miss.

12/5/2013

CPU power consumption


Most modern CPUs are designed with power consumption in mind to some degree. Power vs. energy:
heat generation depends on power consumption; battery life depends on energy consumption.

CMOS power consumption


Voltage drops: power consumption proportional to V2. Toggling: more activity means more power.
CMOS circuit uses most of its power when it is changing its output value

high-level power consumption characteristics of CPUs and other system components are derived from the circuits used to build those components

Leakage: basic circuit characteristics; can be eliminated by disconnecting power.

CPU power-saving strategies


Reduce power supply voltage. Run at lower clock frequency to reduce power consumption. Disable function units with control signals when not in use. Disconnect parts from power supply when not in use to eliminate leakage current

C55x low power features


Parallel execution units---longer idle shutdown times. Multiple data widths:
16-bit ALU vs. 40-bit ALU.

Instruction caches minimizes main memory accesses. Power management:

Function unit idle detection. Memory idle detection. User-configurable IDLE domains allow programmer control of what hardware is shut down.

12/5/2013

Power management styles


Static power management: does not depend on CPU activity, but invoked by user.
Example: user-activated power-down mode.

Application: PowerPC 603 energy features


Provides doze, nap, sleep modes. Dynamic power management features:
Uses static logic. Can shut down unused execution units. Cache organized into subarrays to minimize amount of active circuitry.

Dynamic power management: based on CPU activity.


Example: disabling off function units.

PowerPC 603 activity


Percentage of time units are idle for SPEC integer/floating-point:
unit D cache I cache 29% load/store fixed-point floating-point system register Specint92 29% 17% 35% 38% 99% 89% 17% 76% 30% 97% Specfp92 28%

Power-down costs
Going into a power-down mode costs:
time; energy.

Must determine if going into mode is worthwhile. Can model CPU power states with power state machine.

12/5/2013

Application: StrongARM SA-1100 power saving


Processor takes two supplies:
VDD is main 3.3V supply. VDDX is 1.5V.

Three power modes:


Run: normal operation. Idle: stops CPU clock, with logic still powered. Sleep: shuts off most of chip activity; 3 steps, each about 30 s; wakeup takes > 10 ms.
A power state machine for a processor

SA-1100 power state machine


Prun = 400 mW run 10 s 160 ms 90 s 10 s idle Pidle = 50 mW 90 s sleep Psleep = 0.16 mW

System-level performance analysis


Performance depends on all the elements of the system:
CPU. Cache. Bus. Main memory. I/O device. memory CPU cache

12/5/2013

Bandwidth as performance
Bandwidth applies to several components:
Memory. Bus. CPU fetches.

Bandwidth and data transfers


Video frame: 320 x 240 x 3 = 230,400 bytes.
Transfer in 1/30 sec.

Transfer 1 byte/sec, 0.23 sec per frame.


Too slow.

Different parts of the system run at different clock rates. Different components may have different widths (bus, memory).

Increase bandwidth:
Increase bus width. Increase bus clock rate.

Bus bandwidth
T: # bus cycles. P: time/bus cycle. Total time for transfer:
t = TP.
O1

Bus burst transfer bandwidth


D
O2

T: # bus cycles. P: time/bus cycle. Total time for transfer:


W t = TP.

D: data payload length. O1 + O2 = overhead O.

D: data payload length. O1 + O2 = overhead O.

Tbasic(N) = (D+O)N/W

Tburst(N) = (BD+O)N/(BW)

12/5/2013

Memory aspect ratios

Memory access times


Memory component access times comes from chip data sheet.

16 M 64 M

Page modes allow faster access for successive transfers on same page.

If data doesnt fit naturally into physical words:


8M

A = [(E/w)mod W]+1
8

Bus performance bottlenecks


Transfer 320 x 240 video frame @ 30 frames/sec = 612,000 bytes/sec. Is performance bottleneck bus or memory?

Bus performance bottlenecks, contd.


Bus: assume 1 MHz bus, D=1, O=3:
Tbasic = (1+3)612,000/2 = 1,224,000 cycles = 1.224 sec.

memory

CPU

Memory: try burst mode B=4, width w=0.5.


Tmem = (4*1+4)612,000/(4*0.5) = 2,448,000 cycles = 0.2448 sec.

12/5/2013

Performance spreadsheet
D O N T_basic t 1 3 612000 1224000 1.22E+00 D O B N T_mem t 1 4 4 612000 2448000 2.45E-02

Parallelism
Speed things up by running several units at once. DMA provides parallelism if CPU doesnt need the bus:
DMA + bus. CPU.

Expression simplification
Program Optimization Machine independent transformation Constant folding:
8+1 = 9

Algebraic:
a*b + a*c = a*(b+c)

10

12/5/2013

Dead code elimination


Code that will never be executed #define DEBUG 0 ... if (DEBUG) print_debug_stuff(); Can be eliminated by analysis of control flow, constant folding.
0 0 1 dbg(p1);

Procedure inlining
Eliminates procedure linkage overhead

Loop transformations
Goals:
reduce loop overhead; increase opportunities for pipelining; improve memory system performance.

Loop unrolling
Reduces loop overhead, enables some other optimizations.
for (i=0; i<4; i++) a[i] = b[i] * c[i];

a[0] = b[0]*c[0]; a[1] = b[1]*c[1]; a[2] = b[2]*c[2]; a[3] = b[3]*c[3];

11

12/5/2013

Loop fusion
may interfere with the cache & expands the amount of code required for (i=0; i<2; i++) { a[i*2] = b[i*2] * c[i*2]; a[i*2+1] = b[i*2+1] * c[i*2+1]; }
Fusion combines two loops into 1:
for (i=0; i<N; i++) a[i] = b[i] * 5; for (j=0; j<N; j++) w[j] = c[j] * d[j]; for (i=0; i<N; i++) { a[i] = b[i] * 5; w[i] = c[i] * d[i]; }

loops must iterate over the same values loop bodies must not have dependencies that would be violated if they are executed together

Loop distribution
Distribution breaks one loop into two. Changes optimizations within loop body.

Loop tiling
Breaks one loop into a set of nested loops. Changes order of accesses within array.
Changes cache behavior.

12

12/5/2013

Loop tiling example


for (i=0; i<N; i++) for (j=0; j<N; j++) c[i] = a[i,j]*b[i]; for (i=0; i<N; i+=2) for (j=0; j<N; j+=2) for (ii=0; ii<min(i+2,n); ii++) for (jj=0; jj<min(j+2,N); jj++) c[ii] = a[ii,jj]*b[ii];

Array padding
adds dummy data elements to a loop in order to change the layout of the array in the cache can reduce the number of cache conflicts during loop execution.

Register allocation
Goals:
choose register to hold each variable; determine lifespan of variable in the register. choose assignments of variables to registers to minimize the total number of required registers w = a + b; x = c + w; y = c + d;

Register lifetime graph


t=1 t=2 t=3
a b c d w x y 1 2 3

time

13

12/5/2013

register assignment

LDR r0,[p_a] ; load a into r0 using pointer to a (p_a) LDR r1,[p_b] ; load b into r1 ADD r3,r0,r1 ; compute a + b STR r3,[p_w] ; w = a + b LDR r2,[p_c] ; load c into r2 ADD r0,r2,r3 ; compute c + w, reusing r0 for x STR r0,[p_x] ; x = c + w LDR r0,[p_d] ; load d into r0 ADD r3,r2,r0 ; compute c + d, reusing r3 for y STR r3,[p_y] ; y = c + d

If a section of code requires more registers than are available, then spill some of the values out to memory temporarily Spilling registers requires extra CPU time and uses up both instruction and data memory.

Operator scheduling
(a + b) * (c - d) Different orders of loads, stores and arithmetic operations may also result in different execution times on pipelined machines If we can keep values in registers without having to reread them from main memory, execution time and code size can be reduced.
w = a + b; /* statement 1 */ x = c + d; /* statement 2 */ y = x + e; /* statement 3 */ z = a b; /* statement 4 */ Requires 5 registers

14

12/5/2013

w = a + b; /* statement 1 */ z = a b; /* statement 2 */ x = c + d; /* statement 3 */ y = x + e; /* statement 4 */

Instruction scheduling
Non-pipelined machines do not need instruction scheduling: any order of instructions that satisfies data dependencies runs equally fast. In pipelined machines, execution time of one instruction depends on the nearby instructions: opcode, operands.

Reservation table
A reservation table relates instructions/time to CPU resources.

15

12/5/2013

Software pipelining
technique for reordering instructions across several loop iterations to reduce pipeline bubbles Reduces instruction latency in iteration i by inserting instructions from iteration i+1.

Instruction selection
May be several ways to implement an operation or sequence of operations. Represent operations as graphs, match possible instruction sequences onto graph.
+ * expression + *
MADD

*
MUL

+
ADD

templates

Instruction selection
May be several ways to implement an operation or sequence of operations. Represent operations as graphs, match possible instruction onto graph. sequences

Understanding & using your compiler


compilers are different in terms of the optimizations they perform Studying the assembly language output of the compiler is a good way to learn about what the compiler does Understand various optimization levels Look at mixed compiler/assembler output. Modifying compiler output requires care:
correctness;

16

12/5/2013

Interpreters and JIT compilers


Interpreter: translates and executes program statements on-the-fly.
translates program statements one at a time

small amount of memory is used to hold intermediate representations of the program

Structure of a program interpretation system.

Program-level performance analysis


JIT compiler: compiles small sections of code into instructions during program execution. somewhere between an interpreter and a stand-alone compiler saves the compiled version of the code so that the code does not have to be retranslated the next time it is executed
Eliminates some translation overhead. Often requires more memory.
Need to understand performance in detail:
Real-time behavior, not just typical. On complex platforms.

Program performance CPU performance:


Pipeline, cache are windows into program. We must analyze the entire program.

17

12/5/2013

Complexities of program performance


Varies with input data:
Different-length paths.

How to measure program performance


Simulate execution of the CPU.
Makes CPU state visible.

Cache effects. Instruction-level performance variations:


Pipeline interlocks. Fetch times.

Measure on real CPU using timer.


Requires modifying the program to control the timer.

Measure on real CPU using logic analyzer.


Requires events visible on the pins.

Program performance metrics


Average-case execution time.
Typically used in application programming.

Elements of program performance


Basic program execution time formula:
execution time = program path + instruction timing

Worst-case execution time.


A component in deadline satisfaction.

Solving these problems independently helps simplify analysis.


Easier to separate on simpler CPUs.

Best-case execution time.


Task-level interactions can cause best-case program behavior to result in worst-case system behavior.

Accurate performance analysis requires:


Assembly/binary code. Execution platform.

18

12/5/2013

Data-dependent paths in an if statement


if (a || b) { /* T1 */ if ( c ) /* T2 */ x = r*s+t; /* A1 */ else y=r+s; /* A2 */ z = r+s+u; /* A3 */ } else { if ( c ) /* T3 */ y = r-t; /* A4 */ }
a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 path
T1=F, T3=F: no assignments T1=F, T3=T: A4 T1=T, T2=F: A2, A3 T1=T, T2=T: A1, A3 T1=T, T2=F: A2, A3 T1=T, T2=T: A1, A3 T1=T, T2=F: A2, A3 T1=T, T2=T: A1, A3

Paths in a loop
for (i=0, f=0; i<N; i++) f = f + c[i] * x[i]; N i=N Y f = f + c[i] * x[i] i=i+1 i=0 f=0

Instruction timing
Not all instructions take the same amount of time.
Multi-cycle instructions. Fetches.

Mesaurement-driven performance analysis


Not so easy as it sounds:
Must actually have access to the CPU. Must know data inputs that give worst/best case performance. Must make state visible.

Execution times of instructions are not independent.


Pipeline interlocks. Cache effects.

Execution times may vary with operand value.


Floating-point operations. Some multi-cycle integer operations.

Still an important method for performance analysis.

19

12/5/2013

Feeding the program


Need to know the desired input values. May need to write software scaffolding to generate the input values. Software scaffolding may also need to examine outputs to generate feedback-driven inputs.

Trace-driven measurement
Trace-driven:
Instrument the program. Save information about the path.

Requires modifying the program. Trace files are large. Widely used for cache analysis.

Physical measurement
In-circuit emulator allows tracing.
Affects execution timing.

CPU simulation
Some simulators are less accurate. Cycle-accurate simulator provides accurate clock-cycle timing.
Simulator models CPU internals. Simulator writer must know how CPU works.

Logic analyzer can measure behavior at pins.


Address bus can be analyzed to look for events. Code can be modified to make events visible.

Particularly important for real-world input streams.

20

12/5/2013

SimpleScalar FIR filter simulation


int x[N] = {8, 17, }; int c[N] = {1, 2, }; main() { int i, k, f; for (k=0; k<COUNT; k++) for (i=0; i<N; i++) f += c[i]*x[i]; }
N total sim cycles
100 1,000 1,0000 25854 155759 1451840

Performance optimization motivation


Embedded systems must often meet deadlines.
Faster may not be fast enough.

sim cycles per filter execution


259 156 145

Need to be able to analyze execution time.


Worst-case, not typical.

Need techniques for reliably improving execution time.

Programs and performance analysis


Best results come from analyzing optimized instructions, not high-level language code:
non-obvious translations of HLL statements into instructions; code may move; cache effects are hard to predict.

Loop optimizations
Loops are good targets for optimization. Basic loop optimizations:
code motion; induction-variable elimination; strength reduction (x*2 -> x<<1).

21

12/5/2013

Code motion
for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i=0; X i=0; = N*M i<N*M i<X Y
z[i] = a[i] + b[i];

Induction variable elimination


Induction variable: loop index. Consider loop:
N

for (i=0; i<N; i++) for (j=0; j<M; j++) z[i,j] = b[i,j];

i = i+1;

Rather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body.

Cache analysis
Loop nest: set of loops, one inside other. Perfect loop nest: no conditionals in nest. Because loops use large quantities of data, cache conflicts are common.
a[0,0]

Array conflicts in cache


1024 1024 4099

b[0,0]

4099

...

main memory

cache

22

12/5/2013

Array conflicts, contd.


Array elements conflict because they are in the same line, even if not mapped to same location. Solutions:
move one array; pad array.

Performance optimization hints


Use registers efficiently. Use page mode memory accesses. Analyze cache behavior:
instruction conflicts can be handled by rewriting code, rescheudling; conflicting scalar data can easily be moved; conflicting array data can be moved, padded.

23

Das könnte Ihnen auch gefallen