Beruflich Dokumente
Kultur Dokumente
Memory
Thomas Bradley
Agenda
System optimizations
Transfers between host and device
Kernel optimizations
Measuring performance effective bandwidth Coalescing Shared memory Constant memory Textures
System Optimizations
PCIe Performance
Device to host bandwidth is much lower than device to device
PCIe x16 Gen 2: 8 GB/s C1060: 102 GB/s
Minimize transfers
Intermediate data can be allocated, operated on, and deallocated without ever copying to host memory
Allocation
cudaHostAlloc() / cudaFreeHost() i.e. Instead of malloc / free
Implication:
Pinned memory is essentially removed from host virtual memory
// Asynchronous transfer // Asynchronous launch (queued behind // memcpy since in same stream) // Overlaps memcpy and kernel
Stream based
Block until all outstanding CUDA commands in stream complete cudaStreamSynchronize(stream)
Test whether stream is idle (empty), returns cudaSuccess or cudaErrorNotReady non-blocking cudaStreamQuery(stream)
NVIDIA Corporation 2010
Event is recorded (assigned timestamp) when reached by the GPU Block until event is recorded cudaEventSynchronize(event)
Test whether event has been recorded, returns cudaSuccess or cudaErrorNotReady non-blocking cudaEventQuery(stream)
NVIDIA Corporation 2010
Zero-Copy Memory
Device directly accesses host memory (i.e. no explicit copy)
Check canMapHostMemory device property
Kernel Optimizations
MEASURING PERFORMANCE
Theoretical Bandwidth
Device bandwidth for C1060
Memory clock: Memory interface: DDR: Theoretical bandwidth 800 MHz 512 bits yes (800 * 106) Hz (512 / 8) bytes 2 bits per cycle
Effective Bandwidth
Measure time and calculate the effective bandwidth
For example, copying array of N floats
Size of data: Read and write: Time: N * sizeof(float) yes t (N * 4) 2 ops per word
Effective bandwidth
= (N * 4) * 2 / t bytes/s
When optimizing:
Measure effective memory throughput Compare to the theoretical bandwidth
70-80% is very good, ~50% is good if arithmetic is nontrivial
Measuring throughput
From the app point of view (useful bytes) From the hw point of view (actual bytes moved across the bus) The two are likely to be different
Due to coalescing, discrete bus transaction sizes
NVIDIA Corporation 2010
How throughput is calculated Count load/store bus transactions of each size (32B, 64B, 128B) on the TPC Extrapolate from one TPC to the entire GPU Multiply by (total threadblocks / threadblocks on TPC) i.e. (grid size / cta launched)
NVIDIA Corporation 2010
Kernel Optimizations
COALESCING
Address 120 Address 124 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Thread 0 Thread 1 Thread 2 Thread 3
32B segment
Address 104 Address 108 Address 112 Address 116 Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148
64B segment
Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Address 208 Address 212 Address 214 Address 218 Address 222
Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
128B segment
Thread 4
Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
64B segment
Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192
...
Address 252 Address 256
Find all other active threads whose requested address lies in the same segment Reduce the transaction size, if possible:
If size == 128B and only the lower or upper half is used, reduce transaction to 64B If size == 64B and only the lower or upper half is used, reduce transaction to 32B
Applied even if 64B was a reduction from 128B
Carry out the transaction, mark serviced threads as inactive Repeat until all threads in the half-warp are serviced
Address 120 Address 124 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Thread 0 Thread 1 Thread 2 Thread 3
32B segment
Address 104 Address 108 Address 112 Address 116 Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148
64B segment
Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Address 208 Address 212 Address 214 Address 218 Address 222
Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
128B segment
Thread 4
Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
64B segment
Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192
...
Address 252 Address 256
...
32
64
96
128
160
192
224
256
288
128B segment
...
32
64
96
128
160
192
224
256
288
64B segment
...
32
64
96
128
160
192
224
256
288
32B transaction
...
32
64
96
128
160
224
256
288
...
32
64
96
128
160
192
224
256
288
64B transaction
Address 120 Address 124 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Thread 0 Thread 1 Thread 2 Thread 3
32B segment
Address 104 Address 108 Address 112 Address 116 Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148
64B segment
Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Address 208 Address 212 Address 214 Address 218 Address 222
Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
128B segment
Thread 4
Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
64B segment
Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192
...
Address 252 Address 256
128B transaction
32
64
96 32B
128 32B
160
32
64
96
128
160
128B transaction
32
64
96 64B
128 32B
160
32
64
96
128
160
Kernel Optimizations
SHARED MEMORY
Shared Memory
Uses:
Inter-thread communication within a block Cache data to reduce global memory accesses Avoid non-coalesced access
Organization:
16 banks, 32-bit wide Successive 32-bit words belong to different banks
Performance:
32 bits per bank per 2 clocks per multiprocessor smem accesses are per 16-threads (half-warp) Serialization: if n threads (out of 16) access the same bank, n accesses are executed serially Broadcast: n threads access the same word in one fetch
NVIDIA Corporation 2010
No Bank Conflicts
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7
Thread 15
Bank 15
Thread 15
Bank 15
x8
Bank 15
Thread 15
Bank 15
Kernel Optimizations
CONSTANTS
Constant Memory
Data stored in global memory, read through a constant-cache path
__constant__ qualifier in declarations Can only be read by GPU kernels Limited to 64KB
Throughput:
32 bits per warp per clock per multiprocessor
Kernel Optimizations
TEXTURES
Textures in CUDA
Texture is an object for reading data Benefits:
Data is cached
Helpful when coalescing is a problem
Filtering
Linear / bilinear / trilinear interpolation Dedicated hardware
0 1 2
0 1 2
3
NVIDIA Corporation 2010
Using Textures
Host (CPU) code:
Allocate/obtain memory (global linear/pitch linear, or CUDA array) Create a texture reference object (currently must be at file scope) Bind the texture reference to memory/array Unbind the texture reference when finished, free resources