Beruflich Dokumente
Kultur Dokumente
Master SDTW an II
2011 - 2012
Textures in CUDA
linear / bilinear / trilinear dedicated hardware clamp to edge / repeat using integer or normalized coordinates
addressable in 1D, 2D or 3D
Usage:
CPU code binds data to a texture object Kernel reads data by calling a fetch function
Master SDTW an II 2011 - 2012
VSD Curs 10
Textures in CUDA
Texture Addressing
VSD Curs 10
Master SDTW an II
2011 - 2012
Textures in CUDA
global memory address is bound to a texture only 1D integer addressing no filtering, no addressing modes CUDA array is bound to a texture 1D, 2D, 3D float addressing (size-based or normalized) filtering addressing modes (clamp, repeat)
Master SDTW an II 2011 - 2012
VSD Curs 10
allocate/obtain memory (global linear, or CUDA array) create texture reference object bind the texture reference to memory/array when done:
array textures:
VSD Curs 10
Atomics
Problem:
How do you do global communication? Finish a grid and start a new one Finish a kernel and start a new one All writes from all threads complete before a kernel finishes
step1<<<grid1,blk1>>>(...); // The system ensures that all // writes from step1 complete. step2<<<grid2,blk2>>>(...);
VSD Curs 10 Master SDTW an II 2011 - 2012
Atomics
Global Communication Would need to decompose kernels into before and after parts Or, write to a predefined memory location
Atomics
Race conditions Thread 0 could have finished execution before 1917 started Or the other way around Or both are executing at the same time Answer: not defined by the programming model, can be arbitrary CUDA provides atomic operations to deal with this problem
Master SDTW an II 2011 - 2012
VSD Curs 10
Atomics
An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes The name atomic comes from the fact that it is uninterruptable No dropped data, but ordering is still arbitrary Different types of atomic instructions atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} More types in fermi
Master SDTW an II 2011 - 2012
VSD Curs 10
Atomics
// // // // Example: Histogram Determine frequency of colors in a picture colors have already been converted into ints Each thread looks at one pixel and increments a counter atomically
__global__ void histogram(int* color, int* buckets) { int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1); }
VSD Curs 10 Master SDTW an II 2011 - 2012
Atomics
Example: Workqueue
// For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue
__global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }
VSD Curs 10 Master SDTW an II 2011 - 2012
Atomics
Atomics are slower than normal load/store You can have the whole machine queuing on a single location in memory Atomics unavailable on G80!
VSD Curs 10
Master SDTW an II
2011 - 2012
Atomics
Example: Workqueue
// For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue
__global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }
VSD Curs 10 Master SDTW an II 2011 - 2012
Atomics
Example: Global Min/Max (Naive) // If you require the maximum across all threads // in a grid, you could do it with a single // global maximum value, but it will be VERY slow __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val); }
VSD Curs 10
Master SDTW an II
2011 - 2012
Atomics
Example: Global Min/Max (Better)
// introduce intermediate maximum results, so that // most threads do not try to update the global max
__global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions) { // i and val as before int region = i % num_regions; if(atomicMax(®_max[region],val) < val) { atomicMax(max,val); } }
VSD Curs 10 Master SDTW an II 2011 - 2012
Atomics
Global Min/Max
Single value causes serial bottleneck Create hierarchy of values for more parallelism Performance will still be slow, so use judiciously
VSD Curs 10
Master SDTW an II
2011 - 2012
Performance optimization
VSD Curs 10
Master SDTW an II
2011 - 2012
maximize independent parallelism maximize arithmetic intensity (math/bandwith) sometimes its better to recompute than to cache
even low parallelism computations can sometimes be fastr than transferring back and forth to host
Master SDTW an II 2011 - 2012
VSD Curs 10
Optimize for spatial locality in cached texture memory In shared memory, avoid high degree bank conflicts
VSD Curs 10
Master SDTW an II
2011 - 2012
hundreds of times faster than global memory threads can cooperate via shared memory use one / a few threads to load / compute data shared by all threads use it to avoid non-coalesced access
VSD Curs 10
Master SDTW an II
2011 - 2012
keep resource usage low enough to support multiple active threads blocks per multiprocessor
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
The global, constant and texture spaces are regions of device memory Each multiprocessor has:
to speed up access to the VSD Curs 10 Master SDTW an II texture memory space
Memory optimizations
Optimizing host-device data transfers Coalescing global data accesses Using shared memory effectively
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
Host-Device Data Transfers
4GB/s peak (PCI-e x16 Gen 1) vs. 76 GB/s peak (Tesla C870)
minimize transfers
intermediate data structures can be allocated, operated on, and deallocated without ever copying them to host memory
group transfers
VSD Curs 10
one large transfer Master SDTW an II than many small2012 much better 2011 ones
Memory optimizations
Global and shared memory
high latency, but launching more threads hides latency important to minimize accesses coalesce global memory accesses
low latency like user-managed per-multiprocessor cache try to minimize or avoid bank conflicts
Master SDTW an II 2011 - 2012
VSD Curs 10
Memory optimizations
Texture and Constant Memory
uses the texture cache also used for graphics optimized for 2D spatial locality best performance when threads of a warp read locations that are close together in 2D 4 cycles per address read within a single warp
VSD Curs 10
total cost 4 cycles if all threads in a warp read the same address total cost 64 cycles if all threads read different addresses
Master SDTW an II
2011 - 2012
Memory optimizations
Global Memory Reads/Writes
global memory is not cached on G8x highest latency instructions: 400-600 clock cycles likely to be a performance bottleneck optimizations can greatly increase performance
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
Coalescing a coordinated read by a half-warp (16 threads) a contiguous region of global memory
64 bytes each thread reads a word: int, float,.. 128 bytes each thread reads a double-word: int2, float2, .. 256 bytes each thread reads a quad-word: int4, float4, starting address for a region must be a multiple of region size the kth thread in a half-warp must access the kth element in a block being read
2011 - 2012
additional restrictions
VSD Curs 10
Memory optimizations
Coalesced Access: Reading floats
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
Uncoalesced Access: Reading floats
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
Coalescing: Timing results
Experiment:
kernel: read a float, increment, write back 3M floats (12 MB) Times averaged over 10k runs 356s coalesced 357s coalesced, some threads dont participate 3,494s permuted/misaligned thread access
Master SDTW an II 2011 - 2012
VSD Curs 10
Memory optimizations
Shared Memory
hundred times faster than global memory cache data to reduce global memory accesses threads can cooperate via shared memory use it to avoid non-coalesced access
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
Example: thread-local variables
// motivate per-thread variables with // Ten Nearest Neighbors application __global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadIdx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; // read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ... }
VSD Curs 10 Master SDTW an II 2011 - 2012
Memory optimizations
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
VSD Curs 10 Master SDTW an II 2011 - 2012
Memory optimizations
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
VSD Curs 10 Master SDTW an II 2011 - 2012
Two loads
Memory optimizations
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; // once by thread i int x_i_minus_one = input[i-1]; // again by thread i+1 result[i] = x_i x_i_minus_one; } }
VSD Curs 10 Master SDTW an II 2011 - 2012
Memory optimizations
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
VSD Curs 10 Master SDTW an II 2011 - 2012
Memory optimizations
Example: shared variables
// optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); ... }
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
Example: shared variables
// optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { ... if(tx > 0) result[i] = s_data[tx] s_data[tx1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] input[i-1]; } }
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
Example: shared variables
// when the size of the array isnt known at compile time... __global__ void adj_diff(int *result, int *input) { // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);
VSD Curs 10
Master SDTW an II
2011 - 2012
Occupancy Thread instructions are executed sequentially, so executing other warps is the only way to hide latencies and keep the hardware busy Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently Limited resource usage:
VSD Curs 10
so all multiprocessors have at least one block to execute multiple blocks can run concurrently in a multiprocessor blocks that arent waiting at a __syncthreads() keep the hardware busy subject to resource availability register, shared memory blocks executed in pipeline an II fashion Master SDTW 2011 - 2012 1000 blocks per grid will scale across multiple
VSD Curs 10
threads dont have to belong to the same thread VSD block Curs 10 Master SDTW an II 2011 - 2012
Register Pressure Hide latency by using more threads per SM Limiting factors:
8192 per SM, partitioned among concurrent threads 16 KB per SM, partitioned among concurrent thread blocks
N = desired maximum registers/kernel At some point spilling into LMEM may occur
VSD Curs 10
Determining resource usage compile the kernel code with the cubin flag to determine register usage open the .cubin file with a text editor and look for the code section
VSD Curs 10
Master SDTW an II
2011 - 2012
Optimizing threads per block Choose threads per block as a multiple of warp size
More threads per-block == better memory latency hiding But, more threads per block == fewer registers per thread
kernel invocations can fail if too many registers are used Minimum: 64 threads per block
Heuristics
only if multiple concurrent blocks usually still enough regs to copile and invoke successfully
Master SDTW an II 2011 this all depends on your computation, so experiment! 2012
VSD Curs 10
VSD Curs 10
2011 - 2012
# of multiprocessors memory bandwidth shared memory size register file size max. threads per block
experiment mode discovers and saves optimal VSD Curs 10 Master SDTW an II 2011 - 2012 configuration
Master SDTW an II
2011 - 2012
Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
VSD Curs 10-11 Master SDTW an II
2011 - 2012
Carefully partition data according to access patterns Read-only __constant__ memory (fast) R/W & shared within block __shared__ memory (fast) R/W within each thread registers (fast) Indexed R/W within each thread local memory (slow) R/W inputs/results cudaMalloced global memory (slow)
Master SDTW an II 2011 - 2012
Question:
__global__ void race(void) { __shared__ int my_shared_variable; my_shared_variable = threadIdx.x; // what is the value of // my_shared_variable? }
VSD Curs 10-11 Master SDTW an II 2011 - 2012
This is a race condition The result is undefined The order in which threads access the variable is undefined without explicit coordination Use barriers (e.g., __syncthreads) or atomic operations (e.g., atomicAdd) to enforce well-defined semantics
Master SDTW an II
2011 - 2012
__global__ void share_data(int *input) { __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }
VSD Curs 10-11 Master SDTW an II 2011 - 2012
// assume *result is initialized to 0 __global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); // after this kernel exits, the value of // *result will be the sum of the input }
VSD Curs 10-11 Master SDTW an II 2011 - 2012
Resource Contention
__global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); } ... // how many threads will contend // for exclusive access to result? sum<<<B,N/B>>>(input,result);
VSD Curs 10-11 Master SDTW an II 2011 - 2012
Hierarchical Atomics
Per-thread atomicAdd to a __shared__ partial sum Per-block atomicAdd to the total sum
Master SDTW an II
2011 - 2012
Hierarchical Atomics
__global__ void sum(int *input, int *result) { __shared__ int partial_sum; // thread 0 is responsible for // initializing partial_sum if(threadIdx.x == 0) partial_sum = 0; __syncthreads(); ... }
Master SDTW an II
2011 - 2012
Hierarchical Atomics
__global__ void sum(int *input, int *result) { ... // each thread updates the partial sum atomicAdd(&partial_sum, input[threadIdx.x]); __syncthreads(); // thread 0 updates the total sum if(threadIdx.x == 0) atomicAdd(result, partial_sum); }
VSD Curs 10-11 Master SDTW an II 2011 - 2012
Advice
Use barriers such as __syncthreads to wait until __shared__ data is ready Prefer barriers to atomics when data access patterns are regular or predictable Prefer atomics to barriers when data access patterns are sparse or unpredictable Atomics to __shared__ variables are much faster than atomics to global variables Dont synchronize or serialize unnecessarily
Master SDTW an II
2011 - 2012
Parallelization strategy
Master SDTW an II
2011 - 2012
First Implementation
__global__ void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; ab[row*width+col] = result; }
Master SDTW an II
2011 - 2012
Lower bound on bandwidth required to GMAC * Peak FLOPS = 4 * 805 = 3.2 reach peak fp performance TB/s What is the actual memory bandwidth 112 GB/s of GeForce GTX 260? Then what is an upper bound on performance of our implementation? Actual BW / GMAC = 112 / 4 = 28 GFLOPS
Master SDTW an II
2011 - 2012
Each input element is read by width threads Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth
width
Master SDTW an II
2011 - 2012
Tiled Multiply
TILE_WIDTH
Partition kernel loop into phases Load a tile of both matrices into __shared__ each phase Each phase, each thread computes a partial result
Master SDTW an II
2011 - 2012
Better Implementation
__global__ void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; float result = 0;
VSD Curs 10-11 Master SDTW an II 2011 - 2012
Better Implementation
// loop over the tiles of the input in phases for(int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col]; __syncthreads(); // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); } ab[row*width+col] = result; }
VSD Curs 10-11 Master SDTW an II 2011 - 2012
__syncthreads after all data is loaded into __shared__ memory __syncthreads after all data is read from __shared__ memory Note that second __syncthreads in phase p guards the load in phase p+1
Guard against using uninitialized data Guard against bashing live data
Master SDTW an II
2011 - 2012
1024*1024 matrices 64*64 = 4096 thread blocks TILE_WIDTH = 16 gives each SM 3 blocks, 768 threads Full occupancy
Each thread block performs 2 * 256 = 512 32b loads for 256 * (2 * 16) = 8,192 fp ops
Optimization Analysis
Implementation Global Loads Throughput SLOCs Relative Improvement Improvement/SLOC Original 2N3 10.7 GFLOPS 20 1x 1x Improved 2N2 *(N/TILE_WIDTH) 183.9 GFLOPS 44 17.2x 7.8x
Experiment performed on a GT200 This optimization was clearly worth the effort Better performance still possible in theory
TILE_SIZE Effects
Effective use of different memory resources reduces the number of accesses to global memory These resources are finite! The more memory locations each thread requires the fewer threads an SM can VSD Curs 10-11 Master SDTW an II 2011 - 2012 accommodate
Final Thoughts
Effective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughput Use __shared__ memory to eliminate redundant loads from global memory
Use __syncthreads barriers to protect __shared__ data Use atomics if access patterns are sparse or unpredictable
Optimization comes with a development cost Memory resources ultimately limit parallelism
Master SDTW an II 2011 - 2012