Sie sind auf Seite 1von 78

Master SDTW an II 2011 - 2012

Vizualizarea in sisteme distribuite


S.l. Dr. ing. Simona Caraiman

VSD - Curs 10-11

Programarea GPU (1V)


CUDA Advanced topics

VSD Curs 10-11

Master SDTW an II

2011 - 2012

Textures in CUDA

Texture is an object for reading data Benefits:


data is cached (optimized for 2D locality) filtering


linear / bilinear / trilinear dedicated hardware clamp to edge / repeat using integer or normalized coordinates

wrap modes (for out-of-bounds addresses)

addressable in 1D, 2D or 3D

Usage:

CPU code binds data to a texture object Kernel reads data by calling a fetch function
Master SDTW an II 2011 - 2012

VSD Curs 10

Textures in CUDA
Texture Addressing

VSD Curs 10

Master SDTW an II

2011 - 2012

Textures in CUDA

Two Texture Types Bound to linear memory


global memory address is bound to a texture only 1D integer addressing no filtering, no addressing modes CUDA array is bound to a texture 1D, 2D, 3D float addressing (size-based or normalized) filtering addressing modes (clamp, repeat)
Master SDTW an II 2011 - 2012

Bound to CUDA arrays


VSD Curs 10

CUDA Texturing Steps

Host (CPU) code:


allocate/obtain memory (global linear, or CUDA array) create texture reference object bind the texture reference to memory/array when done:

unbind the texture reference, free resources

Device (kernel) code:


fetch using texture reference linear memory textures:

tex1Dfetch() tex1D() or tex2D() or tex3D()


Master SDTW an II 2011 - 2012

array textures:

VSD Curs 10

Atomics

Problem:

How do you do global communication? Finish a grid and start a new one Finish a kernel and start a new one All writes from all threads complete before a kernel finishes

step1<<<grid1,blk1>>>(...); // The system ensures that all // writes from step1 complete. step2<<<grid2,blk2>>>(...);
VSD Curs 10 Master SDTW an II 2011 - 2012

Atomics

Global Communication Would need to decompose kernels into before and after parts Or, write to a predefined memory location

Race condition! Updates can be lost


threadId:1917 vector[0] += 1; ... a = vector[0];

threadId:0 // vector[0] was equal to 0 vector[0] += 5; ... a = vector[0];


What is the value of a in thread 0? What is the value of a in thread 1917?


VSD Curs 10 Master SDTW an II 2011 - 2012

Atomics

Race conditions Thread 0 could have finished execution before 1917 started Or the other way around Or both are executing at the same time Answer: not defined by the programming model, can be arbitrary CUDA provides atomic operations to deal with this problem
Master SDTW an II 2011 - 2012

VSD Curs 10

Atomics

An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes The name atomic comes from the fact that it is uninterruptable No dropped data, but ordering is still arbitrary Different types of atomic instructions atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} More types in fermi
Master SDTW an II 2011 - 2012

VSD Curs 10

Atomics
// // // // Example: Histogram Determine frequency of colors in a picture colors have already been converted into ints Each thread looks at one pixel and increments a counter atomically

__global__ void histogram(int* color, int* buckets) { int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1); }
VSD Curs 10 Master SDTW an II 2011 - 2012

Atomics
Example: Workqueue
// For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue

__global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }
VSD Curs 10 Master SDTW an II 2011 - 2012

Atomics

Atomics are slower than normal load/store You can have the whole machine queuing on a single location in memory Atomics unavailable on G80!

VSD Curs 10

Master SDTW an II

2011 - 2012

Atomics
Example: Workqueue
// For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue

__global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }
VSD Curs 10 Master SDTW an II 2011 - 2012

Atomics
Example: Global Min/Max (Naive) // If you require the maximum across all threads // in a grid, you could do it with a single // global maximum value, but it will be VERY slow __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val); }

VSD Curs 10

Master SDTW an II

2011 - 2012

Atomics
Example: Global Min/Max (Better)
// introduce intermediate maximum results, so that // most threads do not try to update the global max

__global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions) { // i and val as before int region = i % num_regions; if(atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); } }
VSD Curs 10 Master SDTW an II 2011 - 2012

Atomics
Global Min/Max

Single value causes serial bottleneck Create hierarchy of values for more parallelism Performance will still be slow, so use judiciously

VSD Curs 10

Master SDTW an II

2011 - 2012

Performance optimization

Overview Memory Optimizations Execution Configuration Optimizations Examples

VSD Curs 10

Master SDTW an II

2011 - 2012

Performance optimization - Overview


Optimize Algorithms for the GPU

maximize independent parallelism maximize arithmetic intensity (math/bandwith) sometimes its better to recompute than to cache

GPU spends its transistors on ALUs, not memory

do more computation on the GPU to avoid costly data transfers

even low parallelism computations can sometimes be fastr than transferring back and forth to host
Master SDTW an II 2011 - 2012

VSD Curs 10

Performance optimization - Overview


Optimize Memory Access

Coalesced vs. Non-coalesced = order of magnitude

Global/Local device memory

Optimize for spatial locality in cached texture memory In shared memory, avoid high degree bank conflicts

VSD Curs 10

Master SDTW an II

2011 - 2012

Performance optimization - Overview


Take advantage of shared memory

hundreds of times faster than global memory threads can cooperate via shared memory use one / a few threads to load / compute data shared by all threads use it to avoid non-coalesced access

stage loads and stores in shared memory to reorder non-coalesceable addressing

VSD Curs 10

Master SDTW an II

2011 - 2012

Performance optimization - Overview


Use parallelism efficiently

partition your computation to keep the GPU multiprocessors equally busy

many threads, many thread blocks

keep resource usage low enough to support multiple active threads blocks per multiprocessor

registers, shared memory

VSD Curs 10

Master SDTW an II

2011 - 2012

Memory optimizations

The global, constant and texture spaces are regions of device memory Each multiprocessor has:

a set of 32-bit registers per processor on-chip shared memory

where the shared memory space resides

a read-only constant cache

to speed up access to the constant memory space

a read-only texture cache


2011 - 2012

to speed up access to the VSD Curs 10 Master SDTW an II texture memory space

Memory optimizations

Optimizing host-device data transfers Coalescing global data accesses Using shared memory effectively

VSD Curs 10

Master SDTW an II

2011 - 2012

Memory optimizations
Host-Device Data Transfers

device2host memory bandwidth much lower than device2device memory bandwidth

4GB/s peak (PCI-e x16 Gen 1) vs. 76 GB/s peak (Tesla C870)

minimize transfers

intermediate data structures can be allocated, operated on, and deallocated without ever copying them to host memory

group transfers

VSD Curs 10

one large transfer Master SDTW an II than many small2012 much better 2011 ones

Memory optimizations
Global and shared memory

Global memory not cached on G8x GPUs


high latency, but launching more threads hides latency important to minimize accesses coalesce global memory accesses

Shared memory is on-chip, very high bandwidth


low latency like user-managed per-multiprocessor cache try to minimize or avoid bank conflicts
Master SDTW an II 2011 - 2012

VSD Curs 10

Memory optimizations
Texture and Constant Memory

Texture partition is cached


uses the texture cache also used for graphics optimized for 2D spatial locality best performance when threads of a warp read locations that are close together in 2D 4 cycles per address read within a single warp

Constant memory is cached

VSD Curs 10

total cost 4 cycles if all threads in a warp read the same address total cost 64 cycles if all threads read different addresses
Master SDTW an II

2011 - 2012

Memory optimizations
Global Memory Reads/Writes

global memory is not cached on G8x highest latency instructions: 400-600 clock cycles likely to be a performance bottleneck optimizations can greatly increase performance

VSD Curs 10

Master SDTW an II

2011 - 2012

Memory optimizations

Coalescing a coordinated read by a half-warp (16 threads) a contiguous region of global memory

64 bytes each thread reads a word: int, float,.. 128 bytes each thread reads a double-word: int2, float2, .. 256 bytes each thread reads a quad-word: int4, float4, starting address for a region must be a multiple of region size the kth thread in a half-warp must access the kth element in a block being read
2011 - 2012

additional restrictions

VSD Curs 10

exception: not all Master SDTW must be threads an II participating

Memory optimizations
Coalesced Access: Reading floats

VSD Curs 10

Master SDTW an II

2011 - 2012

Memory optimizations
Uncoalesced Access: Reading floats

VSD Curs 10

Master SDTW an II

2011 - 2012

Memory optimizations
Coalescing: Timing results

Experiment:

kernel: read a float, increment, write back 3M floats (12 MB) Times averaged over 10k runs 356s coalesced 357s coalesced, some threads dont participate 3,494s permuted/misaligned thread access
Master SDTW an II 2011 - 2012

12K blocks x 256 threads:


VSD Curs 10

Memory optimizations
Shared Memory

hundred times faster than global memory cache data to reduce global memory accesses threads can cooperate via shared memory use it to avoid non-coalesced access

stage loads and stores in shared memory to reorder non-coalesceable addressing

VSD Curs 10

Master SDTW an II

2011 - 2012

Memory optimizations
Example: thread-local variables
// motivate per-thread variables with // Ten Nearest Neighbors application __global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadIdx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; // read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ... }
VSD Curs 10 Master SDTW an II 2011 - 2012

Memory optimizations
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
VSD Curs 10 Master SDTW an II 2011 - 2012

Memory optimizations
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
VSD Curs 10 Master SDTW an II 2011 - 2012

Two loads

Memory optimizations
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; // once by thread i int x_i_minus_one = input[i-1]; // again by thread i+1 result[i] = x_i x_i_minus_one; } }
VSD Curs 10 Master SDTW an II 2011 - 2012

Memory optimizations
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
VSD Curs 10 Master SDTW an II 2011 - 2012

Memory optimizations
Example: shared variables
// optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); ... }

VSD Curs 10

Master SDTW an II

2011 - 2012

Memory optimizations
Example: shared variables
// optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { ... if(tx > 0) result[i] = s_data[tx] s_data[tx1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] input[i-1]; } }

VSD Curs 10

Master SDTW an II

2011 - 2012

Memory optimizations
Example: shared variables
// when the size of the array isnt known at compile time... __global__ void adj_diff(int *result, int *input) { // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);

VSD Curs 10

Master SDTW an II

2011 - 2012

Execution Configuration Optimizations

Occupancy Thread instructions are executed sequentially, so executing other warps is the only way to hide latencies and keep the hardware busy Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently Limited resource usage:

registers shared memory


Master SDTW an II 2011 - 2012

VSD Curs 10

Execution Configuration Optimizations


Grid/Block Size Heuristics

# of blocks > # of multiprocessors

so all multiprocessors have at least one block to execute multiple blocks can run concurrently in a multiprocessor blocks that arent waiting at a __syncthreads() keep the hardware busy subject to resource availability register, shared memory blocks executed in pipeline an II fashion Master SDTW 2011 - 2012 1000 blocks per grid will scale across multiple

# of blocks / # of multiprocessors > 2


# of blocks > 100 to scale to future devices


VSD Curs 10

Execution Configuration Optimizations

Register Dependency Read-after-write register dependency


Instructions result can be read ~11 cycles later Scenario:

To completely hide the latency:

run at least 192 threads (6 warps) per multiprocessor

at least 25% occupancy

threads dont have to belong to the same thread VSD block Curs 10 Master SDTW an II 2011 - 2012

Execution Configuration Optimizations


Register Pressure Hide latency by using more threads per SM Limiting factors:

Number of registers per kernel

8192 per SM, partitioned among concurrent threads 16 KB per SM, partitioned among concurrent thread blocks

Amount of shared memory

Compile with --ptxas-options=-v flag Use --maxrregcount=N flag to NVCC


N = desired maximum registers/kernel At some point spilling into LMEM may occur

reduces performance LMEM is slow


Master SDTW an II 2011 - 2012

VSD Curs 10

Execution Configuration Optimizations


Determining resource usage compile the kernel code with the cubin flag to determine register usage open the .cubin file with a text editor and look for the code section

VSD Curs 10

Master SDTW an II

2011 - 2012

Execution Configuration Optimizations

Optimizing threads per block Choose threads per block as a multiple of warp size

avoid wasting computation on under-populated warps

More threads per-block == better memory latency hiding But, more threads per block == fewer registers per thread

kernel invocations can fail if too many registers are used Minimum: 64 threads per block

Heuristics

only if multiple concurrent blocks usually still enough regs to copile and invoke successfully

192 or 256 threads a better choice

Master SDTW an II 2011 this all depends on your computation, so experiment! 2012

VSD Curs 10

Execution Configuration Optimizations


Occupancy != Performance

Increasing occupancy does not necessarily increase performance BUT

Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels

VSD Curs 10

it all comes down to arithmetic intensity and available parallelism


Master SDTW an II

2011 - 2012

Execution Configuration Optimizations


Parameterize your application

Parameterization helps adaptation to different GPUs GPUs vary in many ways


# of multiprocessors memory bandwidth shared memory size register file size max. threads per block

You can even make apps self-tuning

experiment mode discovers and saves optimal VSD Curs 10 Master SDTW an II 2011 - 2012 configuration

A Common Programming Strategy

Global memory resides in device memory (DRAM)

Much slower access than shared memory

Tile data to take advantage of fast shared memory:


Generalize from adjacent_difference example Divide and conquer

VSD Curs 10-11

Master SDTW an II

2011 - 2012

A Common Programming Strategy

Partition data into subsets that fit into shared memory


VSD Curs 10-11 Master SDTW an II 2011 - 2012

A Common Programming Strategy

Handle each data subset with one thread block


VSD Curs 10-11 Master SDTW an II 2011 - 2012

A Common Programming Strategy

Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
VSD Curs 10-11 Master SDTW an II

2011 - 2012

A Common Programming Strategy

Perform the computation on the subset from shared memory


VSD Curs 10-11 Master SDTW an II 2011 - 2012

A Common Programming Strategy

Copy the result from shared memory back to global memory


VSD Curs 10-11 Master SDTW an II 2011 - 2012

A Common Programming Strategy


Carefully partition data according to access patterns Read-only __constant__ memory (fast) R/W & shared within block __shared__ memory (fast) R/W within each thread registers (fast) Indexed R/W within each thread local memory (slow) R/W inputs/results cudaMalloced global memory (slow)
Master SDTW an II 2011 - 2012

VSD Curs 10-11

Communication Through Memory

Question:

__global__ void race(void) { __shared__ int my_shared_variable; my_shared_variable = threadIdx.x; // what is the value of // my_shared_variable? }
VSD Curs 10-11 Master SDTW an II 2011 - 2012

Communication Through Memory


This is a race condition The result is undefined The order in which threads access the variable is undefined without explicit coordination Use barriers (e.g., __syncthreads) or atomic operations (e.g., atomicAdd) to enforce well-defined semantics

VSD Curs 10-11

Master SDTW an II

2011 - 2012

Communication Through Memory

Use __syncthreads to ensure data is ready for access

__global__ void share_data(int *input) { __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }
VSD Curs 10-11 Master SDTW an II 2011 - 2012

Communication Through Memory

Use atomic operations to ensure exclusive access to a variable

// assume *result is initialized to 0 __global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); // after this kernel exits, the value of // *result will be the sum of the input }
VSD Curs 10-11 Master SDTW an II 2011 - 2012

Resource Contention

Atomic operations arent cheap! They imply serialized access to a variable

__global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); } ... // how many threads will contend // for exclusive access to result? sum<<<B,N/B>>>(input,result);
VSD Curs 10-11 Master SDTW an II 2011 - 2012

Hierarchical Atomics

Divide & Conquer



VSD Curs 10-11

Per-thread atomicAdd to a __shared__ partial sum Per-block atomicAdd to the total sum
Master SDTW an II

2011 - 2012

Hierarchical Atomics
__global__ void sum(int *input, int *result) { __shared__ int partial_sum; // thread 0 is responsible for // initializing partial_sum if(threadIdx.x == 0) partial_sum = 0; __syncthreads(); ... }

VSD Curs 10-11

Master SDTW an II

2011 - 2012

Hierarchical Atomics
__global__ void sum(int *input, int *result) { ... // each thread updates the partial sum atomicAdd(&partial_sum, input[threadIdx.x]); __syncthreads(); // thread 0 updates the total sum if(threadIdx.x == 0) atomicAdd(result, partial_sum); }
VSD Curs 10-11 Master SDTW an II 2011 - 2012

Advice

Use barriers such as __syncthreads to wait until __shared__ data is ready Prefer barriers to atomics when data access patterns are regular or predictable Prefer atomics to barriers when data access patterns are sparse or unpredictable Atomics to __shared__ variables are much faster than atomics to global variables Dont synchronize or serialize unnecessarily

VSD Curs 10-11

Master SDTW an II

2011 - 2012

Matrix Multiplication Example


Generalize adjacent_difference example AB = A * B


Each element ABij = dot(row(A,i),col(B,j)) Thread ABij 2D kernel

Parallelization strategy

VSD Curs 10-11

Master SDTW an II

2011 - 2012

First Implementation
__global__ void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; ab[row*width+col] = result; }

VSD Curs 10-11

Master SDTW an II

2011 - 2012

How will this perform?


How many loads per term of dot product? How many floating point operations? Global memory access to flop ratio (GMAC) What is the peak fp performance of GeForce GTX 260? 2 (a & b) = 8 Bytes 2 (multiply & addition) 8 Bytes / 2 ops = 4 B/op 805 GFLOPS

Lower bound on bandwidth required to GMAC * Peak FLOPS = 4 * 805 = 3.2 reach peak fp performance TB/s What is the actual memory bandwidth 112 GB/s of GeForce GTX 260? Then what is an upper bound on performance of our implementation? Actual BW / GMAC = 112 / 4 = 28 GFLOPS

VSD Curs 10-11

Master SDTW an II

2011 - 2012

Idea: Use __shared__ memory to reuse global data


Each input element is read by width threads Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth
width

VSD Curs 10-11

Master SDTW an II

2011 - 2012

Tiled Multiply

TILE_WIDTH

Partition kernel loop into phases Load a tile of both matrices into __shared__ each phase Each phase, each thread computes a partial result

VSD Curs 10-11

Master SDTW an II

2011 - 2012

Better Implementation
__global__ void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; float result = 0;
VSD Curs 10-11 Master SDTW an II 2011 - 2012

Better Implementation
// loop over the tiles of the input in phases for(int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col]; __syncthreads(); // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); } ab[row*width+col] = result; }
VSD Curs 10-11 Master SDTW an II 2011 - 2012

Use of Barriers in mat_mul

Two barriers per phase:


__syncthreads after all data is loaded into __shared__ memory __syncthreads after all data is read from __shared__ memory Note that second __syncthreads in phase p guards the load in phase p+1

Use barriers to guard data


Guard against using uninitialized data Guard against bashing live data

VSD Curs 10-11

Master SDTW an II

2011 - 2012

First Order Size Considerations

Each thread block should have many threads

TILE_WIDTH = 16 16*16 = 256 threads

There should be many thread blocks


1024*1024 matrices 64*64 = 4096 thread blocks TILE_WIDTH = 16 gives each SM 3 blocks, 768 threads Full occupancy

Each thread block performs 2 * 256 = 512 32b loads for 256 * (2 * 16) = 8,192 fp ops

VSD Curs 10-11

Memory bandwidth no SDTW an IIlimiting factor2011 - 2012 Master longer

Optimization Analysis
Implementation Global Loads Throughput SLOCs Relative Improvement Improvement/SLOC Original 2N3 10.7 GFLOPS 20 1x 1x Improved 2N2 *(N/TILE_WIDTH) 183.9 GFLOPS 44 17.2x 7.8x

Experiment performed on a GT200 This optimization was clearly worth the effort Better performance still possible in theory

TILE_SIZE Effects

Memory Resources as Limit to Parallelism


Resource Registers Per GT200 SM 16384 Full Occupancy on GT200 <= 16384 / 768 threads = 21 per thread <= 16KB / 8 blocks = 2KB per block

__shared__ Memory 16KB

Effective use of different memory resources reduces the number of accesses to global memory These resources are finite! The more memory locations each thread requires the fewer threads an SM can VSD Curs 10-11 Master SDTW an II 2011 - 2012 accommodate

Final Thoughts

Effective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughput Use __shared__ memory to eliminate redundant loads from global memory

Use __syncthreads barriers to protect __shared__ data Use atomics if access patterns are sparse or unpredictable

Optimization comes with a development cost Memory resources ultimately limit parallelism
Master SDTW an II 2011 - 2012

VSD Curs 10-11

Das könnte Ihnen auch gefallen