Curs Vsd10 11

Master SDTW an II 2011 - 2012
Vizualizarea in sisteme distribuite

S.l. Dr. ing. Simona Caraiman
VSD - Curs 10-11
Programarea GPU (1V)

CUDA Advanced topics
VSD Curs 10-11
Master SDTW an II
2011 - 2012
Textures in CUDA

Texture is an object for reading data Benefits:

data is cached (optimized for 2D locality) filtering

linear / bilinear / trilinear dedicated hardware clamp to edge / repeat using integer or normalized coordinates
wrap modes (for out-of-bounds addresses)
addressable in 1D, 2D or 3D
Usage:

CPU code binds data to a texture object Kernel reads data by calling a fetch function
VSD Curs 10
Textures in CUDA
Texture Addressing
VSD Curs 10
Master SDTW an II
2011 - 2012
Textures in CUDA
Two Texture Types Bound to linear memory

global memory address is bound to a texture only 1D integer addressing no filtering, no addressing modes CUDA array is bound to a texture 1D, 2D, 3D float addressing (size-based or normalized) filtering addressing modes (clamp, repeat)
Bound to CUDA arrays

VSD Curs 10
CUDA Texturing Steps
Host (CPU) code:

allocate/obtain memory (global linear, or CUDA array) create texture reference object bind the texture reference to memory/array when done:
unbind the texture reference, free resources
Device (kernel) code:

fetch using texture reference linear memory textures:
tex1Dfetch() tex1D() or tex2D() or tex3D()

array textures:
VSD Curs 10
Atomics
Problem:

How do you do global communication? Finish a grid and start a new one Finish a kernel and start a new one All writes from all threads complete before a kernel finishes
step1<<<grid1,blk1>>>(...); // The system ensures that all // writes from step1 complete. step2<<<grid2,blk2>>>(...);
VSD Curs 10 Master SDTW an II 2011 - 2012
Atomics

Global Communication Would need to decompose kernels into before and after parts Or, write to a predefined memory location
Race condition! Updates can be lost

threadId:1917 vector[0] += 1; ... a = vector[0];
threadId:0 // vector[0] was equal to 0 vector[0] += 5; ... a = vector[0];

What is the value of a in thread 0? What is the value of a in thread 1917?

Atomics

Race conditions Thread 0 could have finished execution before 1917 started Or the other way around Or both are executing at the same time Answer: not defined by the programming model, can be arbitrary CUDA provides atomic operations to deal with this problem
VSD Curs 10
Atomics
An atomic operation guarantees that only a single thread has access to a piece of memory while an operation completes The name atomic comes from the fact that it is uninterruptable No dropped data, but ordering is still arbitrary Different types of atomic instructions atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} More types in fermi
VSD Curs 10
Atomics
// // // // Example: Histogram Determine frequency of colors in a picture colors have already been converted into ints Each thread looks at one pixel and increments a counter atomically
__global__ void histogram(int* color, int* buckets) { int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1); }
Atomics
Example: Workqueue
// For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue
__global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }
Atomics

Atomics are slower than normal load/store You can have the whole machine queuing on a single location in memory Atomics unavailable on G80!
VSD Curs 10
Master SDTW an II
2011 - 2012
Atomics
Example: Workqueue
// For algorithms where the amount of work per item // is highly non-uniform, it often makes sense for // to continuously grab work from a queue
__global__ void workq(int* work_q, int* q_counter, int* output, int queue_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result; }
Atomics
Example: Global Min/Max (Naive) // If you require the maximum across all threads // in a grid, you could do it with a single // global maximum value, but it will be VERY slow __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val); }
VSD Curs 10
Master SDTW an II
2011 - 2012
Atomics
Example: Global Min/Max (Better)
// introduce intermediate maximum results, so that // most threads do not try to update the global max
__global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions) { // i and val as before int region = i % num_regions; if(atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); } }
Atomics
Global Min/Max

Single value causes serial bottleneck Create hierarchy of values for more parallelism Performance will still be slow, so use judiciously
VSD Curs 10
Master SDTW an II
2011 - 2012
Performance optimization
Overview Memory Optimizations Execution Configuration Optimizations Examples
VSD Curs 10
Master SDTW an II
2011 - 2012
Performance optimization - Overview

Optimize Algorithms for the GPU

maximize independent parallelism maximize arithmetic intensity (math/bandwith) sometimes its better to recompute than to cache
GPU spends its transistors on ALUs, not memory
do more computation on the GPU to avoid costly data transfers
even low parallelism computations can sometimes be fastr than transferring back and forth to host
VSD Curs 10

Optimize Memory Access
Coalesced vs. Non-coalesced = order of magnitude
Global/Local device memory
Optimize for spatial locality in cached texture memory In shared memory, avoid high degree bank conflicts
VSD Curs 10
Master SDTW an II
2011 - 2012

Take advantage of shared memory

hundreds of times faster than global memory threads can cooperate via shared memory use one / a few threads to load / compute data shared by all threads use it to avoid non-coalesced access
stage loads and stores in shared memory to reorder non-coalesceable addressing
VSD Curs 10
Master SDTW an II
2011 - 2012

Use parallelism efficiently
partition your computation to keep the GPU multiprocessors equally busy
many threads, many thread blocks
keep resource usage low enough to support multiple active threads blocks per multiprocessor
registers, shared memory
VSD Curs 10
Master SDTW an II
2011 - 2012
Memory optimizations
The global, constant and texture spaces are regions of device memory Each multiprocessor has:

a set of 32-bit registers per processor on-chip shared memory
where the shared memory space resides
a read-only constant cache
to speed up access to the constant memory space
a read-only texture cache

2011 - 2012
to speed up access to the VSD Curs 10 Master SDTW an II texture memory space

Optimizing host-device data transfers Coalescing global data accesses Using shared memory effectively
VSD Curs 10
Master SDTW an II
2011 - 2012
Host-Device Data Transfers
device2host memory bandwidth much lower than device2device memory bandwidth
4GB/s peak (PCI-e x16 Gen 1) vs. 76 GB/s peak (Tesla C870)
minimize transfers
intermediate data structures can be allocated, operated on, and deallocated without ever copying them to host memory
group transfers
VSD Curs 10
one large transfer Master SDTW an II than many small2012 much better 2011 ones
Global and shared memory
Global memory not cached on G8x GPUs

high latency, but launching more threads hides latency important to minimize accesses coalesce global memory accesses
Shared memory is on-chip, very high bandwidth

low latency like user-managed per-multiprocessor cache try to minimize or avoid bank conflicts
VSD Curs 10
Texture and Constant Memory
Texture partition is cached

uses the texture cache also used for graphics optimized for 2D spatial locality best performance when threads of a warp read locations that are close together in 2D 4 cycles per address read within a single warp

Constant memory is cached
VSD Curs 10
total cost 4 cycles if all threads in a warp read the same address total cost 64 cycles if all threads read different addresses
Master SDTW an II
2011 - 2012
Global Memory Reads/Writes

global memory is not cached on G8x highest latency instructions: 400-600 clock cycles likely to be a performance bottleneck optimizations can greatly increase performance
VSD Curs 10
Master SDTW an II
2011 - 2012

Coalescing a coordinated read by a half-warp (16 threads) a contiguous region of global memory

64 bytes each thread reads a word: int, float,.. 128 bytes each thread reads a double-word: int2, float2, .. 256 bytes each thread reads a quad-word: int4, float4, starting address for a region must be a multiple of region size the kth thread in a half-warp must access the kth element in a block being read
2011 - 2012
additional restrictions

VSD Curs 10
exception: not all Master SDTW must be threads an II participating
Coalesced Access: Reading floats
VSD Curs 10
Master SDTW an II
2011 - 2012
Uncoalesced Access: Reading floats
VSD Curs 10
Master SDTW an II
2011 - 2012
Coalescing: Timing results
Experiment:

kernel: read a float, increment, write back 3M floats (12 MB) Times averaged over 10k runs 356s coalesced 357s coalesced, some threads dont participate 3,494s permuted/misaligned thread access
12K blocks x 256 threads:

VSD Curs 10
Shared Memory

hundred times faster than global memory cache data to reduce global memory accesses threads can cooperate via shared memory use it to avoid non-coalesced access
stage loads and stores in shared memory to reorder non-coalesceable addressing
VSD Curs 10
Master SDTW an II
2011 - 2012
Example: thread-local variables
// motivate per-thread variables with // Ten Nearest Neighbors application __global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadIdx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; // read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ... }
Example: shared variables
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
Two loads
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; // once by thread i int x_i_minus_one = input[i-1]; // again by thread i+1 result[i] = x_i x_i_minus_one; } }
// motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this threads global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i x_i_minus_one; } }
// optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); ... }
VSD Curs 10
Master SDTW an II
2011 - 2012
// optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { ... if(tx > 0) result[i] = s_data[tx] s_data[tx1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] input[i-1]; } }
VSD Curs 10
Master SDTW an II
2011 - 2012
// when the size of the array isnt known at compile time... __global__ void adj_diff(int *result, int *input) { // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);
VSD Curs 10
Master SDTW an II
2011 - 2012
Execution Configuration Optimizations
Occupancy Thread instructions are executed sequentially, so executing other warps is the only way to hide latencies and keep the hardware busy Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently Limited resource usage:

registers shared memory

VSD Curs 10

Grid/Block Size Heuristics
# of blocks > # of multiprocessors
so all multiprocessors have at least one block to execute multiple blocks can run concurrently in a multiprocessor blocks that arent waiting at a __syncthreads() keep the hardware busy subject to resource availability register, shared memory blocks executed in pipeline an II fashion Master SDTW 2011 - 2012 1000 blocks per grid will scale across multiple
# of blocks / # of multiprocessors > 2

# of blocks > 100 to scale to future devices

VSD Curs 10
Register Dependency Read-after-write register dependency

Instructions result can be read ~11 cycles later Scenario:
To completely hide the latency:
run at least 192 threads (6 warps) per multiprocessor
at least 25% occupancy
threads dont have to belong to the same thread VSD block Curs 10 Master SDTW an II 2011 - 2012

Register Pressure Hide latency by using more threads per SM Limiting factors:

Number of registers per kernel
8192 per SM, partitioned among concurrent threads 16 KB per SM, partitioned among concurrent thread blocks
Amount of shared memory
Compile with --ptxas-options=-v flag Use --maxrregcount=N flag to NVCC

N = desired maximum registers/kernel At some point spilling into LMEM may occur
reduces performance LMEM is slow

VSD Curs 10

Determining resource usage compile the kernel code with the cubin flag to determine register usage open the .cubin file with a text editor and look for the code section
VSD Curs 10
Master SDTW an II
2011 - 2012
Optimizing threads per block Choose threads per block as a multiple of warp size
avoid wasting computation on under-populated warps
More threads per-block == better memory latency hiding But, more threads per block == fewer registers per thread
kernel invocations can fail if too many registers are used Minimum: 64 threads per block
Heuristics
only if multiple concurrent blocks usually still enough regs to copile and invoke successfully
192 or 256 threads a better choice
Master SDTW an II 2011 this all depends on your computation, so experiment! 2012
VSD Curs 10

Occupancy != Performance
Increasing occupancy does not necessarily increase performance BUT
Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels
VSD Curs 10
it all comes down to arithmetic intensity and available parallelism

Master SDTW an II
2011 - 2012

Parameterize your application

Parameterization helps adaptation to different GPUs GPUs vary in many ways

# of multiprocessors memory bandwidth shared memory size register file size max. threads per block
You can even make apps self-tuning
experiment mode discovers and saves optimal VSD Curs 10 Master SDTW an II 2011 - 2012 configuration
A Common Programming Strategy
Global memory resides in device memory (DRAM)
Much slower access than shared memory
Tile data to take advantage of fast shared memory:

Generalize from adjacent_difference example Divide and conquer
VSD Curs 10-11
Master SDTW an II
2011 - 2012
Partition data into subsets that fit into shared memory

VSD Curs 10-11 Master SDTW an II 2011 - 2012
Handle each data subset with one thread block

Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
VSD Curs 10-11 Master SDTW an II
2011 - 2012
Perform the computation on the subset from shared memory

Copy the result from shared memory back to global memory


Carefully partition data according to access patterns Read-only __constant__ memory (fast) R/W & shared within block __shared__ memory (fast) R/W within each thread registers (fast) Indexed R/W within each thread local memory (slow) R/W inputs/results cudaMalloced global memory (slow)
VSD Curs 10-11
Communication Through Memory
Question:
__global__ void race(void) { __shared__ int my_shared_variable; my_shared_variable = threadIdx.x; // what is the value of // my_shared_variable? }

This is a race condition The result is undefined The order in which threads access the variable is undefined without explicit coordination Use barriers (e.g., __syncthreads) or atomic operations (e.g., atomicAdd) to enforce well-defined semantics
VSD Curs 10-11
Master SDTW an II
2011 - 2012
Use __syncthreads to ensure data is ready for access
__global__ void share_data(int *input) { __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }
Use atomic operations to ensure exclusive access to a variable
// assume *result is initialized to 0 __global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); // after this kernel exits, the value of // *result will be the sum of the input }
Resource Contention

Atomic operations arent cheap! They imply serialized access to a variable
__global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); } ... // how many threads will contend // for exclusive access to result? sum<<<B,N/B>>>(input,result);
Hierarchical Atomics
Divide & Conquer

VSD Curs 10-11
Per-thread atomicAdd to a __shared__ partial sum Per-block atomicAdd to the total sum
Master SDTW an II
2011 - 2012
__global__ void sum(int *input, int *result) { __shared__ int partial_sum; // thread 0 is responsible for // initializing partial_sum if(threadIdx.x == 0) partial_sum = 0; __syncthreads(); ... }
VSD Curs 10-11
Master SDTW an II
2011 - 2012
__global__ void sum(int *input, int *result) { ... // each thread updates the partial sum atomicAdd(&partial_sum, input[threadIdx.x]); __syncthreads(); // thread 0 updates the total sum if(threadIdx.x == 0) atomicAdd(result, partial_sum); }
Advice

Use barriers such as __syncthreads to wait until __shared__ data is ready Prefer barriers to atomics when data access patterns are regular or predictable Prefer atomics to barriers when data access patterns are sparse or unpredictable Atomics to __shared__ variables are much faster than atomics to global variables Dont synchronize or serialize unnecessarily
VSD Curs 10-11
Master SDTW an II
2011 - 2012
Matrix Multiplication Example

Generalize adjacent_difference example AB = A * B

Each element ABij = dot(row(A,i),col(B,j)) Thread ABij 2D kernel
Parallelization strategy

VSD Curs 10-11
Master SDTW an II
2011 - 2012
First Implementation
__global__ void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; ab[row*width+col] = result; }
VSD Curs 10-11
Master SDTW an II
2011 - 2012
How will this perform?

How many loads per term of dot product? How many floating point operations? Global memory access to flop ratio (GMAC) What is the peak fp performance of GeForce GTX 260? 2 (a & b) = 8 Bytes 2 (multiply & addition) 8 Bytes / 2 ops = 4 B/op 805 GFLOPS
Lower bound on bandwidth required to GMAC * Peak FLOPS = 4 * 805 = 3.2 reach peak fp performance TB/s What is the actual memory bandwidth 112 GB/s of GeForce GTX 260? Then what is an upper bound on performance of our implementation? Actual BW / GMAC = 112 / 4 = 28 GFLOPS
VSD Curs 10-11
Master SDTW an II
2011 - 2012
Idea: Use __shared__ memory to reuse global data

Each input element is read by width threads Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth
width
VSD Curs 10-11
Master SDTW an II
2011 - 2012
Tiled Multiply

TILE_WIDTH
Partition kernel loop into phases Load a tile of both matrices into __shared__ each phase Each phase, each thread computes a partial result
VSD Curs 10-11
Master SDTW an II
2011 - 2012
Better Implementation
__global__ void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; float result = 0;
Better Implementation
// loop over the tiles of the input in phases for(int p = 0; p < width/TILE_WIDTH; ++p) { // collaboratively load tiles into __shared__ s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col]; __syncthreads(); // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); } ab[row*width+col] = result; }
Use of Barriers in mat_mul
Two barriers per phase:

__syncthreads after all data is loaded into __shared__ memory __syncthreads after all data is read from __shared__ memory Note that second __syncthreads in phase p guards the load in phase p+1
Use barriers to guard data

Guard against using uninitialized data Guard against bashing live data
VSD Curs 10-11
Master SDTW an II
2011 - 2012
First Order Size Considerations
Each thread block should have many threads
TILE_WIDTH = 16 16*16 = 256 threads
There should be many thread blocks

1024*1024 matrices 64*64 = 4096 thread blocks TILE_WIDTH = 16 gives each SM 3 blocks, 768 threads Full occupancy
Each thread block performs 2 * 256 = 512 32b loads for 256 * (2 * 16) = 8,192 fp ops
VSD Curs 10-11
Memory bandwidth no SDTW an IIlimiting factor2011 - 2012 Master longer
Optimization Analysis
Implementation Global Loads Throughput SLOCs Relative Improvement Improvement/SLOC Original 2N3 10.7 GFLOPS 20 1x 1x Improved 2N2 *(N/TILE_WIDTH) 183.9 GFLOPS 44 17.2x 7.8x
Experiment performed on a GT200 This optimization was clearly worth the effort Better performance still possible in theory
TILE_SIZE Effects
Memory Resources as Limit to Parallelism

Resource Registers Per GT200 SM 16384 Full Occupancy on GT200 <= 16384 / 768 threads = 21 per thread <= 16KB / 8 blocks = 2KB per block
__shared__ Memory 16KB
Effective use of different memory resources reduces the number of accesses to global memory These resources are finite! The more memory locations each thread requires the fewer threads an SM can VSD Curs 10-11 Master SDTW an II 2011 - 2012 accommodate
Final Thoughts
Effective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughput Use __shared__ memory to eliminate redundant loads from global memory

Use __syncthreads barriers to protect __shared__ data Use atomics if access patterns are sparse or unpredictable
Optimization comes with a development cost Memory resources ultimately limit parallelism
VSD Curs 10-11

Curs Vsd10 11

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Curs Vsd10 11

Hochgeladen von

Copyright:

Verfügbare Formate

Master SDTW an II 2011 - 2012

Vizualizarea in sisteme distribuite

VSD - Curs 10-11

Programarea GPU (1V)

VSD Curs 10-11

Texture is an object for reading data Benefits:

data is cached (optimized for 2D locality) filtering

wrap modes (for out-of-bounds addresses)

Two Texture Types Bound to linear memory

Bound to CUDA arrays

CUDA Texturing Steps

Host (CPU) code:

unbind the texture reference, free resources

Device (kernel) code:

fetch using texture reference linear memory textures:

tex1Dfetch() tex1D() or tex2D() or tex3D()

Race condition! Updates can be lost

threadId:0 // vector[0] was equal to 0 vector[0] += 5; ... a = vector[0];

What is the value of a in thread 0? What is the value of a in thread 1917?

Overview Memory Optimizations Execution Configuration Optimizations Examples

Performance optimization - Overview

GPU spends its transistors on ALUs, not memory

do more computation on the GPU to avoid costly data transfers

Performance optimization - Overview

Coalesced vs. Non-coalesced = order of magnitude

Global/Local device memory

Performance optimization - Overview

stage loads and stores in shared memory to reorder non-coalesceable addressing

Performance optimization - Overview

partition your computation to keep the GPU multiprocessors equally busy

many threads, many thread blocks

registers, shared memory

a set of 32-bit registers per processor on-chip shared memory

where the shared memory space resides

a read-only constant cache

to speed up access to the constant memory space

a read-only texture cache

device2host memory bandwidth much lower than device2device memory bandwidth

Global memory not cached on G8x GPUs

Shared memory is on-chip, very high bandwidth

Texture partition is cached

Constant memory is cached

exception: not all Master SDTW must be threads an II participating

12K blocks x 256 threads:

stage loads and stores in shared memory to reorder non-coalesceable addressing

Execution Configuration Optimizations

registers shared memory

Execution Configuration Optimizations

# of blocks > # of multiprocessors

# of blocks / # of multiprocessors > 2

# of blocks > 100 to scale to future devices

Execution Configuration Optimizations

Register Dependency Read-after-write register dependency

Instructions result can be read ~11 cycles later Scenario:

To completely hide the latency:

run at least 192 threads (6 warps) per multiprocessor

at least 25% occupancy

Execution Configuration Optimizations

Number of registers per kernel

Amount of shared memory

Compile with --ptxas-options=-v flag Use --maxrregcount=N flag to NVCC

reduces performance LMEM is slow

Execution Configuration Optimizations

Execution Configuration Optimizations

Idea: Use shared memory to reuse global data

shared Memory 16KB