Advanced CUDA 02

Advanced CUDA Optimization 2.
Memory
Thomas Bradley
NVIDIA Corporation 2010
Agenda
System optimizations
Transfers between host and device
Kernel optimizations
Measuring performance effective bandwidth Coalescing Shared memory Constant memory Textures
System Optimizations
HOST DEVICE TRANSFERS
PCIe Performance
Device to host bandwidth is much lower than device to device
PCIe x16 Gen 2: 8 GB/s C1060: 102 GB/s
Minimize transfers
Intermediate data can be allocated, operated on, and deallocated without ever copying to host memory
Group transfers together

One large transfer is better than many small transfers
Pinned (non-pageable) Memory

Pinned memory enables:
faster PCIe copies (~2x throughput on FSB systems) memcopies asynchronous with CPU memcopies asynchronous with GPU
Allocation
cudaHostAlloc() / cudaFreeHost() i.e. Instead of malloc / free
Implication:
Pinned memory is essentially removed from host virtual memory
Asynchronous API and Streams

Default API:
Kernel launches are asynchronous with CPU Memcopies (D2H, H2D) block CPU thread CUDA calls are serialized by the driver
Streams and async functions provide:

Memcopies (D2H, H2D) asynchronous with CPU Ability to concurrently execute a kernel and a memcopy
Stream = sequence of operations that execute in issue-order

Operations from different streams can be interleaved A kernel and memcopy from different streams can be overlapped
Overlap CPU Work with Kernel

Kernel launches are asynchronous
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); kernel<<<grid, block>>>(a_d); cpuFunction(); cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost); // Blocks CPU until transfer complete // Asynchronous launch // Overlaps kernel // Block until kernel is complete, // then block until transfer complete
Overlap CPU Work with Kernel/Memcpy

Overlap the CPU with the memory transfer
Default stream is 0
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0); kernel<<<grid, block>>>(a_d);
cpuFunction();
// Asynchronous transfer // Asynchronous launch (queued behind // memcpy since in same stream) // Overlaps memcpy and kernel
Overlap Kernel with Memcpy

Requirements:
D2H or H2D memcpy from pinned memory Device with compute capability 1.1 (G84 and later) Kernel and memcpy in different, non-0 streams
cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(dst, src, size, dir, stream1); kernel<<<grid, block, 0, stream2>>>();
// Create stream handles
// Asynchronous // Asynchronous in a different stream, // overlaps the transfer
Host Device Synchronization

Context based
Block until all outstanding CUDA commands complete cudaThreadSynchronize(), cudaMemcpy(...)
Stream based
Block until all outstanding CUDA commands in stream complete cudaStreamSynchronize(stream)
Test whether stream is idle (empty), returns cudaSuccess or cudaErrorNotReady non-blocking cudaStreamQuery(stream)
Host Device Synchronization

Event based (within streams)
Record event in a stream cudaEventRecord(event, stream)
Event is recorded (assigned timestamp) when reached by the GPU Block until event is recorded cudaEventSynchronize(event)
Test whether event has been recorded, returns cudaSuccess or cudaErrorNotReady non-blocking cudaEventQuery(stream)
Zero-Copy Memory
Device directly accesses host memory (i.e. no explicit copy)
Check canMapHostMemory device property
Setup from host

cudaSetDeviceFlags(cudaDeviceMapHost); ... cudaHostAlloc((void **)&a_h, size, cudaHostAllocMapped); cudaHostGetDevicePointer((void **)&a_d, (void *)a_h, 0); kernel<<<grid, block>>>(a_d, N);
Zero-Copy Memory Considerations

Always beneficial for integrated devices
Integrated devices access host memory directly Test integrated device property
Typically beneficial if data only read/written once

Copy input data from CPU to GPU memory Run one kernel Copy output data back from GPU to CPU memory
Coalescing is even more important with zero-copy!

Kernel Optimizations
MEASURING PERFORMANCE
Theoretical Bandwidth
Device bandwidth for C1060
Memory clock: Memory interface: DDR: Theoretical bandwidth 800 MHz 512 bits yes (800 * 106) Hz (512 / 8) bytes 2 bits per cycle
= (800 * 106) * (512 / 8) * 2 bytes/sec = 102 GB/s
Be consistent in definition of giga (10243 or 109)
Effective Bandwidth
Measure time and calculate the effective bandwidth
For example, copying array of N floats
Size of data: Read and write: Time: N * sizeof(float) yes t (N * 4) 2 ops per word
Effective bandwidth
= (N * 4) * 2 / t bytes/s
Global Memory Throughput Metric

Many applications are memory throughput bound When coding from scratch:
Start with memory operations first, achieve good throughput Add the arithmetic, measuring performance as you go
When optimizing:
Measure effective memory throughput Compare to the theoretical bandwidth
70-80% is very good, ~50% is good if arithmetic is nontrivial
Measuring throughput
From the app point of view (useful bytes) From the hw point of view (actual bytes moved across the bus) The two are likely to be different
Due to coalescing, discrete bus transaction sizes
Measuring Memory Throughput

Latest Visual Profiler reports memory throughput
From HW point of view Based on counters for one TPC (3 multiprocessors) Need compute capability 1.2 or higher GPU
Measuring Memory Throughput

Visual Profiler reports memory throughput
From HW point of view Based on counters for one TPC (3 multiprocessors) Need compute capability 1.2 or higher GPU
How throughput is calculated Count load/store bus transactions of each size (32B, 64B, 128B) on the TPC Extrapolate from one TPC to the entire GPU Multiply by (total threadblocks / threadblocks on TPC) i.e. (grid size / cta launched)
COALESCING
GMEM Coalescing: Compute Capability 1.2-1.3

Possible GPU memory bus transaction sizes:
32B, 64B, or 128B Transaction segment must be aligned
First address = multiple of segment size
Hardware coalescing for each half-warp (16 threads):

Memory accesses are handled per half-warps Carry out the smallest possible number of transactions Reduce transaction size when possible
Address 120 Address 124 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Thread 0 Thread 1 Thread 2 Thread 3
Address 96 Address 100
32B segment
Address 104 Address 108 Address 112 Address 116 Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148
64B segment
Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Address 208 Address 212 Address 214 Address 218 Address 222
Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
128B segment
Thread 4
Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15
64B segment
Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192
...
HW Steps when Coalescing for half-warp

Find the memory segment that contains the address requested by the lowest numbered active thread:
32B segment for 8-bit data 64B segment for 16-bit data 128B segment for 32, 64 and 128-bit data
Find all other active threads whose requested address lies in the same segment Reduce the transaction size, if possible:
If size == 128B and only the lower or upper half is used, reduce transaction to 64B If size == 64B and only the lower or upper half is used, reduce transaction to 32B
Applied even if 64B was a reduction from 128B
Carry out the transaction, mark serviced threads as inactive Repeat until all threads in the half-warp are serviced
32B segment
64B segment
128B segment
Thread 4
64B segment
...
Threads 0-15: 4-byte words at address 116-176

Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127
t0 t1 t2 t3 t15
...
32
64
96
128
160
192
224
256
288
128B segment

Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 (reduce to 64B)
t0 t1 t2 t3 t15
...
32
64
96
128
160
192
224
256
288
64B segment

t0 t1 t2 t3 t15
...
32
64
96
128
160
192
224
256
288
32B transaction

Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255
t0 t1 t2 t3 t15
...
32
64
96
128
160
192 128B segment
224
256
288

t0 t1 t2 t3 t15
...
32
64
96
128
160
192
224
256
288
64B transaction
32B segment
64B segment
128B segment
Thread 4
64B segment
...
Comparing Compute Capabilities

Compute capability 1.0-1.1
Requires threads in a half-warp to:
Access a single aligned 64B, 128B, or 256B segment Threads must issue addresses in sequence
If requirements are not satisfied:

Separate 32B transaction for each thread
Compute capability 1.2-1.3

Does not require sequential addressing by threads Performance degrades gracefully when a half-warp addresses multiple segments
Experiment: Impact of Address Alignment

Assume half-warp accesses a contiguous region Throughput is maximized when region is aligned on its size boundary
100% of bytes in a bus transaction are useful
Impact of misaligned addressing:

32-bit words, streaming code, Quadro FX5800 (102 GB/s) 0 word offset: 76 GB/s (perfect alignment, typical perf) 8 word offset: 57 GB/s (75% of aligned case) All others: 46 GB/s (61% of aligned case)
Address Alignment, 32-bit words

8-word (32B) offset from perfect alignment:
Observed 75% of the perfectly aligned perf Segments starting at multiple of 32B
One 128B transaction (50% efficiency)
Segments starting at multiple of 96B

Two 32B transactions (100% efficiency)
128B transaction
32
64
96 32B
128 32B
160
32
64
96
128
160

4-word (16B) offset (other offsets have the same perf):
Observed 61% of the perfectly aligned perf Two types of segments, based on starting address
One 128B transaction (50% efficiency) One 64B and one 32B transaction (67% efficiency)
128B transaction
32
64
96 64B
128 32B
160
32
64
96
128
160

Can be analyzed similarly to 32-bit case:
0B offset: 8B offset: 16B offset: 32B offset: 64B offset: 80 GB/s 62 GB/s 62 GB/s 68 GB/s 76 GB/s (perfectly aligned) (78% of perfectly aligned) (78% of perfectly aligned) (85% of perfectly aligned) (95% of perfectly aligned)
Compare 0 and 64B offset performance:

Both consume 100% of the bytes
64B: two 64B transactions 0B: single 128B transaction, slightly faster
SHARED MEMORY
Shared Memory
Uses:
Inter-thread communication within a block Cache data to reduce global memory accesses Avoid non-coalesced access
Organization:
16 banks, 32-bit wide Successive 32-bit words belong to different banks
Performance:
32 bits per bank per 2 clocks per multiprocessor smem accesses are per 16-threads (half-warp) Serialization: if n threads (out of 16) access the same bank, n accesses are executed serially Broadcast: n threads access the same word in one fetch
Bank Addressing Examples

No Bank Conflicts
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7
No Bank Conflicts
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7
Thread 15
Bank 15
Thread 15
Bank 15
Bank Addressing Examples

2-way Bank Conflicts
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7
8-way Bank Conflicts

Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 x8 Bank 0 Bank 1 Bank 2
Thread 8 Thread 9 Thread 10 Thread 11
x8
Bank 7 Bank 8 Bank 9
Bank 15
Thread 15
Bank 15
Shared Memory Bank Conflicts

warp_serialize profiler signal reflects conflicts
The fast case:
If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp read identical address, there is no bank conflict (broadcast)
The slow case:

Bank conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses
Assess Impact On Performance

Replace all SMEM indexes with threadIdx.x
Eliminates conflicts Will show how much performance could be improved by eliminating bank conflicts
CONSTANTS
Constant Memory
Data stored in global memory, read through a constant-cache path
__constant__ qualifier in declarations Can only be read by GPU kernels Limited to 64KB
To be used when all threads in a warp read the same address

Serializes otherwise (indicated by warp_serialize counter in profiler)
Throughput:
32 bits per warp per clock per multiprocessor
TEXTURES
Textures in CUDA
Texture is an object for reading data Benefits:
Data is cached
Helpful when coalescing is a problem
Filtering
Linear / bilinear / trilinear interpolation Dedicated hardware
Wrap modes (for out-of-bounds addresses)

Clamp to edge / repeat
Addressable in 1D, 2D, or 3D

Using integer or normalized coordinates
Texture Addressing Modes

0 1
2 3 Wrap Out-of-bounds coordinate is wrapped (modulo arithmetic) 0 1 2 3 4 (5.5, 1.5) Clamp Out-of-bounds coordinate is replaced with the closest boundary 0 1 2 3 4 (5.5, 1.5) 0 1 2 3 4
(2.5, 0.5) (1.0, 1.0)
0 1 2
0 1 2
3
CUDA Texture Types

Bound to linear memory
Global memory address is bound to a texture Only 1D Integer addressing, no filtering, no addressing modes
Bound to CUDA arrays

Block linear CUDA array is bound to a texture 1D, 2D, or 3D Float addressing (size-based or normalized) Filtering & addressing modes (clamp, repeat) supported
Bound to pitch linear

Global memory address is bound to a 2D texture Float/integer addressing, filtering, and clamp/repeat addressing modes similar to CUDA arrays
Using Textures
Host (CPU) code:
Allocate/obtain memory (global linear/pitch linear, or CUDA array) Create a texture reference object (currently must be at file scope) Bind the texture reference to memory/array Unbind the texture reference when finished, free resources
Device (kernel) code:

Fetch using texture reference Linear memory textures: tex1Dfetch() Array textures: tex1D() or tex2D() or tex3D() Pitch linear textures: tex2D()

Advanced CUDA 02

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Advanced CUDA 02

Hochgeladen von

Copyright:

Verfügbare Formate

Advanced CUDA Optimization 2.

NVIDIA Corporation 2010

NVIDIA Corporation 2010

HOST DEVICE TRANSFERS

NVIDIA Corporation 2010

Group transfers together

NVIDIA Corporation 2010

Pinned (non-pageable) Memory

NVIDIA Corporation 2010

Asynchronous API and Streams

Streams and async functions provide:

Stream = sequence of operations that execute in issue-order

NVIDIA Corporation 2010

Overlap CPU Work with Kernel

NVIDIA Corporation 2010

Overlap CPU Work with Kernel/Memcpy

NVIDIA Corporation 2010

Overlap Kernel with Memcpy

// Create stream handles

// Asynchronous // Asynchronous in a different stream, // overlaps the transfer

Host Device Synchronization

Host Device Synchronization

Setup from host

NVIDIA Corporation 2010

Zero-Copy Memory Considerations

Typically beneficial if data only read/written once

Coalescing is even more important with zero-copy!

NVIDIA Corporation 2010

= (800 * 106) * (512 / 8) * 2 bytes/sec = 102 GB/s

Be consistent in definition of giga (10243 or 109)

NVIDIA Corporation 2010

NVIDIA Corporation 2010

Global Memory Throughput Metric

Measuring Memory Throughput

NVIDIA Corporation 2010

Measuring Memory Throughput

NVIDIA Corporation 2010

GMEM Coalescing: Compute Capability 1.2-1.3

Hardware coalescing for each half-warp (16 threads):

NVIDIA Corporation 2010

Address 96 Address 100

Address 196 Address 200

NVIDIA Corporation 2010

HW Steps when Coalescing for half-warp

NVIDIA Corporation 2010

Address 96 Address 100

Address 196 Address 200

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176

192 128B segment

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176

NVIDIA Corporation 2010

Address 96 Address 100

Address 196 Address 200

NVIDIA Corporation 2010

Comparing Compute Capabilities

If requirements are not satisfied: