Beruflich Dokumente
Kultur Dokumente
Pipelining
Several instructions are executed simultaneously at different stages of completion. increases the efficiency of the CPU Various conditions can cause pipeline bubbles that reduce utilization:
branches; memory system delays;
Performance measures
Latency: time it takes for an instruction to get through the pipeline. Throughput: number of instructions executed per time period. Pipelining increases throughput without reducing latency.
12/5/2013
ARM7 pipeline
ARM 7 has 3-stage pipe:
fetch instruction from memory; decode opcode and operands; execute.
decode fetch
add r0,r1,#5
Each of these operations requires one clock cycle for typical instructions
Pipeline stalls
If every step cannot be completed in the same amount of time, pipeline stalls. Bubbles introduced by stall increase latency, reduce throughput.
fetch decodeex ld r2ex ld r3 fetch decode ex sub fetch decodeex cmp time
12/5/2013
Control stalls
Branches often introduce stalls (branch penalty).
Stall time may depend on whether branch is taken.
bne foo
May have to squash instructions that already started executing. Dont know what to fetch until condition is evaluated.
Delayed branch
To increase pipeline efficiency, delayed branch mechanism requires that some instructions after branch are always executed whether branch is executed or not. Any instruction in the delayed branch window must be valid for both execution paths,whether or not the branch is taken
Only branch in loop test may take more than one cycle.
BLT loop takes 1 cycle best case, 3 worst case.
12/5/2013
tloop = tinit+ N(tbody + tupdate) + (N-1) ttest,worst + ttest,best Loop test succeeds is worst case Loop test fails is best case
12/5/2013
high-level power consumption characteristics of CPUs and other system components are derived from the circuits used to build those components
Function unit idle detection. Memory idle detection. User-configurable IDLE domains allow programmer control of what hardware is shut down.
12/5/2013
Power-down costs
Going into a power-down mode costs:
time; energy.
Must determine if going into mode is worthwhile. Can model CPU power states with power state machine.
12/5/2013
12/5/2013
Bandwidth as performance
Bandwidth applies to several components:
Memory. Bus. CPU fetches.
Different parts of the system run at different clock rates. Different components may have different widths (bus, memory).
Increase bandwidth:
Increase bus width. Increase bus clock rate.
Bus bandwidth
T: # bus cycles. P: time/bus cycle. Total time for transfer:
t = TP.
O1
Tbasic(N) = (D+O)N/W
Tburst(N) = (BD+O)N/(BW)
12/5/2013
16 M 64 M
Page modes allow faster access for successive transfers on same page.
A = [(E/w)mod W]+1
8
memory
CPU
12/5/2013
Performance spreadsheet
D O N T_basic t 1 3 612000 1224000 1.22E+00 D O B N T_mem t 1 4 4 612000 2448000 2.45E-02
Parallelism
Speed things up by running several units at once. DMA provides parallelism if CPU doesnt need the bus:
DMA + bus. CPU.
Expression simplification
Program Optimization Machine independent transformation Constant folding:
8+1 = 9
Algebraic:
a*b + a*c = a*(b+c)
10
12/5/2013
Procedure inlining
Eliminates procedure linkage overhead
Loop transformations
Goals:
reduce loop overhead; increase opportunities for pipelining; improve memory system performance.
Loop unrolling
Reduces loop overhead, enables some other optimizations.
for (i=0; i<4; i++) a[i] = b[i] * c[i];
11
12/5/2013
Loop fusion
may interfere with the cache & expands the amount of code required for (i=0; i<2; i++) { a[i*2] = b[i*2] * c[i*2]; a[i*2+1] = b[i*2+1] * c[i*2+1]; }
Fusion combines two loops into 1:
for (i=0; i<N; i++) a[i] = b[i] * 5; for (j=0; j<N; j++) w[j] = c[j] * d[j]; for (i=0; i<N; i++) { a[i] = b[i] * 5; w[i] = c[i] * d[i]; }
loops must iterate over the same values loop bodies must not have dependencies that would be violated if they are executed together
Loop distribution
Distribution breaks one loop into two. Changes optimizations within loop body.
Loop tiling
Breaks one loop into a set of nested loops. Changes order of accesses within array.
Changes cache behavior.
12
12/5/2013
Array padding
adds dummy data elements to a loop in order to change the layout of the array in the cache can reduce the number of cache conflicts during loop execution.
Register allocation
Goals:
choose register to hold each variable; determine lifespan of variable in the register. choose assignments of variables to registers to minimize the total number of required registers w = a + b; x = c + w; y = c + d;
time
13
12/5/2013
register assignment
LDR r0,[p_a] ; load a into r0 using pointer to a (p_a) LDR r1,[p_b] ; load b into r1 ADD r3,r0,r1 ; compute a + b STR r3,[p_w] ; w = a + b LDR r2,[p_c] ; load c into r2 ADD r0,r2,r3 ; compute c + w, reusing r0 for x STR r0,[p_x] ; x = c + w LDR r0,[p_d] ; load d into r0 ADD r3,r2,r0 ; compute c + d, reusing r3 for y STR r3,[p_y] ; y = c + d
If a section of code requires more registers than are available, then spill some of the values out to memory temporarily Spilling registers requires extra CPU time and uses up both instruction and data memory.
Operator scheduling
(a + b) * (c - d) Different orders of loads, stores and arithmetic operations may also result in different execution times on pipelined machines If we can keep values in registers without having to reread them from main memory, execution time and code size can be reduced.
w = a + b; /* statement 1 */ x = c + d; /* statement 2 */ y = x + e; /* statement 3 */ z = a b; /* statement 4 */ Requires 5 registers
14
12/5/2013
Instruction scheduling
Non-pipelined machines do not need instruction scheduling: any order of instructions that satisfies data dependencies runs equally fast. In pipelined machines, execution time of one instruction depends on the nearby instructions: opcode, operands.
Reservation table
A reservation table relates instructions/time to CPU resources.
15
12/5/2013
Software pipelining
technique for reordering instructions across several loop iterations to reduce pipeline bubbles Reduces instruction latency in iteration i by inserting instructions from iteration i+1.
Instruction selection
May be several ways to implement an operation or sequence of operations. Represent operations as graphs, match possible instruction sequences onto graph.
+ * expression + *
MADD
*
MUL
+
ADD
templates
Instruction selection
May be several ways to implement an operation or sequence of operations. Represent operations as graphs, match possible instruction onto graph. sequences
16
12/5/2013
17
12/5/2013
18
12/5/2013
Paths in a loop
for (i=0, f=0; i<N; i++) f = f + c[i] * x[i]; N i=N Y f = f + c[i] * x[i] i=i+1 i=0 f=0
Instruction timing
Not all instructions take the same amount of time.
Multi-cycle instructions. Fetches.
19
12/5/2013
Trace-driven measurement
Trace-driven:
Instrument the program. Save information about the path.
Requires modifying the program. Trace files are large. Widely used for cache analysis.
Physical measurement
In-circuit emulator allows tracing.
Affects execution timing.
CPU simulation
Some simulators are less accurate. Cycle-accurate simulator provides accurate clock-cycle timing.
Simulator models CPU internals. Simulator writer must know how CPU works.
20
12/5/2013
Loop optimizations
Loops are good targets for optimization. Basic loop optimizations:
code motion; induction-variable elimination; strength reduction (x*2 -> x<<1).
21
12/5/2013
Code motion
for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; i=0; X i=0; = N*M i<N*M i<X Y
z[i] = a[i] + b[i];
for (i=0; i<N; i++) for (j=0; j<M; j++) z[i,j] = b[i,j];
i = i+1;
Rather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body.
Cache analysis
Loop nest: set of loops, one inside other. Perfect loop nest: no conditionals in nest. Because loops use large quantities of data, cache conflicts are common.
a[0,0]
b[0,0]
4099
...
main memory
cache
22
12/5/2013
23