Sie sind auf Seite 1von 49

Advanced CUDA Optimization 2.

Memory
Thomas Bradley

NVIDIA Corporation 2010

Agenda
System optimizations
Transfers between host and device

Kernel optimizations
Measuring performance effective bandwidth Coalescing Shared memory Constant memory Textures

NVIDIA Corporation 2010

System Optimizations

HOST DEVICE TRANSFERS

NVIDIA Corporation 2010

PCIe Performance
Device to host bandwidth is much lower than device to device
PCIe x16 Gen 2: 8 GB/s C1060: 102 GB/s

Minimize transfers
Intermediate data can be allocated, operated on, and deallocated without ever copying to host memory

Group transfers together


One large transfer is better than many small transfers

NVIDIA Corporation 2010

Pinned (non-pageable) Memory


Pinned memory enables:
faster PCIe copies (~2x throughput on FSB systems) memcopies asynchronous with CPU memcopies asynchronous with GPU

Allocation
cudaHostAlloc() / cudaFreeHost() i.e. Instead of malloc / free

Implication:
Pinned memory is essentially removed from host virtual memory

NVIDIA Corporation 2010

Asynchronous API and Streams


Default API:
Kernel launches are asynchronous with CPU Memcopies (D2H, H2D) block CPU thread CUDA calls are serialized by the driver

Streams and async functions provide:


Memcopies (D2H, H2D) asynchronous with CPU Ability to concurrently execute a kernel and a memcopy

Stream = sequence of operations that execute in issue-order


Operations from different streams can be interleaved A kernel and memcopy from different streams can be overlapped

NVIDIA Corporation 2010

Overlap CPU Work with Kernel


Kernel launches are asynchronous
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); kernel<<<grid, block>>>(a_d); cpuFunction(); cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost); // Blocks CPU until transfer complete // Asynchronous launch // Overlaps kernel // Block until kernel is complete, // then block until transfer complete

NVIDIA Corporation 2010

Overlap CPU Work with Kernel/Memcpy


Overlap the CPU with the memory transfer
Default stream is 0
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0); kernel<<<grid, block>>>(a_d);
cpuFunction();

// Asynchronous transfer // Asynchronous launch (queued behind // memcpy since in same stream) // Overlaps memcpy and kernel

NVIDIA Corporation 2010

Overlap Kernel with Memcpy


Requirements:
D2H or H2D memcpy from pinned memory Device with compute capability 1.1 (G84 and later) Kernel and memcpy in different, non-0 streams
cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(dst, src, size, dir, stream1); kernel<<<grid, block, 0, stream2>>>();
NVIDIA Corporation 2010

// Create stream handles

// Asynchronous // Asynchronous in a different stream, // overlaps the transfer

Host Device Synchronization


Context based
Block until all outstanding CUDA commands complete cudaThreadSynchronize(), cudaMemcpy(...)

Stream based
Block until all outstanding CUDA commands in stream complete cudaStreamSynchronize(stream)

Test whether stream is idle (empty), returns cudaSuccess or cudaErrorNotReady non-blocking cudaStreamQuery(stream)
NVIDIA Corporation 2010

Host Device Synchronization


Event based (within streams)
Record event in a stream cudaEventRecord(event, stream)

Event is recorded (assigned timestamp) when reached by the GPU Block until event is recorded cudaEventSynchronize(event)
Test whether event has been recorded, returns cudaSuccess or cudaErrorNotReady non-blocking cudaEventQuery(stream)
NVIDIA Corporation 2010

Zero-Copy Memory
Device directly accesses host memory (i.e. no explicit copy)
Check canMapHostMemory device property

Setup from host


cudaSetDeviceFlags(cudaDeviceMapHost); ... cudaHostAlloc((void **)&a_h, size, cudaHostAllocMapped); cudaHostGetDevicePointer((void **)&a_d, (void *)a_h, 0); kernel<<<grid, block>>>(a_d, N);

NVIDIA Corporation 2010

Zero-Copy Memory Considerations


Always beneficial for integrated devices
Integrated devices access host memory directly Test integrated device property

Typically beneficial if data only read/written once


Copy input data from CPU to GPU memory Run one kernel Copy output data back from GPU to CPU memory

Coalescing is even more important with zero-copy!


NVIDIA Corporation 2010

Kernel Optimizations

MEASURING PERFORMANCE

NVIDIA Corporation 2010

Theoretical Bandwidth
Device bandwidth for C1060
Memory clock: Memory interface: DDR: Theoretical bandwidth 800 MHz 512 bits yes (800 * 106) Hz (512 / 8) bytes 2 bits per cycle

= (800 * 106) * (512 / 8) * 2 bytes/sec = 102 GB/s

Be consistent in definition of giga (10243 or 109)

NVIDIA Corporation 2010

Effective Bandwidth
Measure time and calculate the effective bandwidth
For example, copying array of N floats
Size of data: Read and write: Time: N * sizeof(float) yes t (N * 4) 2 ops per word

Effective bandwidth

= (N * 4) * 2 / t bytes/s

NVIDIA Corporation 2010

Global Memory Throughput Metric


Many applications are memory throughput bound When coding from scratch:
Start with memory operations first, achieve good throughput Add the arithmetic, measuring performance as you go

When optimizing:
Measure effective memory throughput Compare to the theoretical bandwidth
70-80% is very good, ~50% is good if arithmetic is nontrivial

Measuring throughput
From the app point of view (useful bytes) From the hw point of view (actual bytes moved across the bus) The two are likely to be different
Due to coalescing, discrete bus transaction sizes
NVIDIA Corporation 2010

Measuring Memory Throughput


Latest Visual Profiler reports memory throughput
From HW point of view Based on counters for one TPC (3 multiprocessors) Need compute capability 1.2 or higher GPU

NVIDIA Corporation 2010

Measuring Memory Throughput


Visual Profiler reports memory throughput
From HW point of view Based on counters for one TPC (3 multiprocessors) Need compute capability 1.2 or higher GPU

How throughput is calculated Count load/store bus transactions of each size (32B, 64B, 128B) on the TPC Extrapolate from one TPC to the entire GPU Multiply by (total threadblocks / threadblocks on TPC) i.e. (grid size / cta launched)
NVIDIA Corporation 2010

Kernel Optimizations

COALESCING

NVIDIA Corporation 2010

GMEM Coalescing: Compute Capability 1.2-1.3


Possible GPU memory bus transaction sizes:
32B, 64B, or 128B Transaction segment must be aligned
First address = multiple of segment size

Hardware coalescing for each half-warp (16 threads):


Memory accesses are handled per half-warps Carry out the smallest possible number of transactions Reduce transaction size when possible

NVIDIA Corporation 2010

Address 120 Address 124 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Thread 0 Thread 1 Thread 2 Thread 3

Address 96 Address 100

32B segment

Address 104 Address 108 Address 112 Address 116 Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148

64B segment

Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Address 208 Address 212 Address 214 Address 218 Address 222

Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15

128B segment

Thread 4
Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15

64B segment

Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192

...
Address 252 Address 256

Address 196 Address 200

NVIDIA Corporation 2010

HW Steps when Coalescing for half-warp


Find the memory segment that contains the address requested by the lowest numbered active thread:
32B segment for 8-bit data 64B segment for 16-bit data 128B segment for 32, 64 and 128-bit data

Find all other active threads whose requested address lies in the same segment Reduce the transaction size, if possible:
If size == 128B and only the lower or upper half is used, reduce transaction to 64B If size == 64B and only the lower or upper half is used, reduce transaction to 32B
Applied even if 64B was a reduction from 128B

Carry out the transaction, mark serviced threads as inactive Repeat until all threads in the half-warp are serviced

NVIDIA Corporation 2010

Address 120 Address 124 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Thread 0 Thread 1 Thread 2 Thread 3

Address 96 Address 100

32B segment

Address 104 Address 108 Address 112 Address 116 Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148

64B segment

Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Address 208 Address 212 Address 214 Address 218 Address 222

Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15

128B segment

Thread 4
Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15

64B segment

Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192

...
Address 252 Address 256

Address 196 Address 200

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176


Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127
t0 t1 t2 t3 t15

...

32

64

96

128

160

192

224

256

288

128B segment

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176


Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 (reduce to 64B)
t0 t1 t2 t3 t15

...

32

64

96

128

160

192

224

256

288

64B segment

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176


Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 (reduce to 32B)
t0 t1 t2 t3 t15

...

32

64

96

128

160

192

224

256

288

32B transaction

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176


Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255
t0 t1 t2 t3 t15

...

32

64

96

128

160

192 128B segment

224

256

288

NVIDIA Corporation 2010

Threads 0-15: 4-byte words at address 116-176


Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 (reduce to 64B)
t0 t1 t2 t3 t15

...

32

64

96

128

160

192

224

256

288

64B transaction

NVIDIA Corporation 2010

Address 120 Address 124 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Thread 0 Thread 1 Thread 2 Thread 3

Address 96 Address 100

32B segment

Address 104 Address 108 Address 112 Address 116 Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148

64B segment

Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192 Address 196 Address 200 Address 204 Address 208 Address 212 Address 214 Address 218 Address 222

Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15

128B segment

Thread 4
Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15

64B segment

Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 Address 176 Address 180 Address 184 Address 188 Address 192

...
Address 252 Address 256

Address 196 Address 200

NVIDIA Corporation 2010

Comparing Compute Capabilities


Compute capability 1.0-1.1
Requires threads in a half-warp to:
Access a single aligned 64B, 128B, or 256B segment Threads must issue addresses in sequence

If requirements are not satisfied:


Separate 32B transaction for each thread

Compute capability 1.2-1.3


Does not require sequential addressing by threads Performance degrades gracefully when a half-warp addresses multiple segments

NVIDIA Corporation 2010

Experiment: Impact of Address Alignment


Assume half-warp accesses a contiguous region Throughput is maximized when region is aligned on its size boundary
100% of bytes in a bus transaction are useful

Impact of misaligned addressing:


32-bit words, streaming code, Quadro FX5800 (102 GB/s) 0 word offset: 76 GB/s (perfect alignment, typical perf) 8 word offset: 57 GB/s (75% of aligned case) All others: 46 GB/s (61% of aligned case)

NVIDIA Corporation 2010

Address Alignment, 32-bit words


8-word (32B) offset from perfect alignment:
Observed 75% of the perfectly aligned perf Segments starting at multiple of 32B
One 128B transaction (50% efficiency)

Segments starting at multiple of 96B


Two 32B transactions (100% efficiency)

128B transaction

32

64

96 32B

128 32B

160

NVIDIA Corporation 2010

32

64

96

128

160

Address Alignment, 32-bit words


4-word (16B) offset (other offsets have the same perf):
Observed 61% of the perfectly aligned perf Two types of segments, based on starting address
One 128B transaction (50% efficiency) One 64B and one 32B transaction (67% efficiency)

128B transaction

32

64

96 64B

128 32B

160

NVIDIA Corporation 2010

32

64

96

128

160

Address Alignment, 64-bit words


Can be analyzed similarly to 32-bit case:
0B offset: 8B offset: 16B offset: 32B offset: 64B offset: 80 GB/s 62 GB/s 62 GB/s 68 GB/s 76 GB/s (perfectly aligned) (78% of perfectly aligned) (78% of perfectly aligned) (85% of perfectly aligned) (95% of perfectly aligned)

Compare 0 and 64B offset performance:


Both consume 100% of the bytes
64B: two 64B transactions 0B: single 128B transaction, slightly faster
NVIDIA Corporation 2010

Kernel Optimizations

SHARED MEMORY

NVIDIA Corporation 2010

Shared Memory
Uses:
Inter-thread communication within a block Cache data to reduce global memory accesses Avoid non-coalesced access

Organization:
16 banks, 32-bit wide Successive 32-bit words belong to different banks

Performance:
32 bits per bank per 2 clocks per multiprocessor smem accesses are per 16-threads (half-warp) Serialization: if n threads (out of 16) access the same bank, n accesses are executed serially Broadcast: n threads access the same word in one fetch
NVIDIA Corporation 2010

Bank Addressing Examples


No Bank Conflicts
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7

No Bank Conflicts
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7

Thread 15

Bank 15

Thread 15

Bank 15

NVIDIA Corporation 2010

Bank Addressing Examples


2-way Bank Conflicts
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7

8-way Bank Conflicts


Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 x8 Bank 0 Bank 1 Bank 2

Thread 8 Thread 9 Thread 10 Thread 11

x8

Bank 7 Bank 8 Bank 9

Bank 15

Thread 15

Bank 15

NVIDIA Corporation 2010

Shared Memory Bank Conflicts


warp_serialize profiler signal reflects conflicts
The fast case:
If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp read identical address, there is no bank conflict (broadcast)

The slow case:


Bank conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses

NVIDIA Corporation 2010

Assess Impact On Performance


Replace all SMEM indexes with threadIdx.x
Eliminates conflicts Will show how much performance could be improved by eliminating bank conflicts

NVIDIA Corporation 2010

Kernel Optimizations

CONSTANTS

NVIDIA Corporation 2010

Constant Memory
Data stored in global memory, read through a constant-cache path
__constant__ qualifier in declarations Can only be read by GPU kernels Limited to 64KB

To be used when all threads in a warp read the same address


Serializes otherwise (indicated by warp_serialize counter in profiler)

Throughput:
32 bits per warp per clock per multiprocessor

NVIDIA Corporation 2010

Kernel Optimizations

TEXTURES

NVIDIA Corporation 2010

Textures in CUDA
Texture is an object for reading data Benefits:
Data is cached
Helpful when coalescing is a problem

Filtering
Linear / bilinear / trilinear interpolation Dedicated hardware

Wrap modes (for out-of-bounds addresses)


Clamp to edge / repeat

Addressable in 1D, 2D, or 3D


Using integer or normalized coordinates

NVIDIA Corporation 2010

Texture Addressing Modes


0 1
2 3 Wrap Out-of-bounds coordinate is wrapped (modulo arithmetic) 0 1 2 3 4 (5.5, 1.5) Clamp Out-of-bounds coordinate is replaced with the closest boundary 0 1 2 3 4 (5.5, 1.5) 0 1 2 3 4

(2.5, 0.5) (1.0, 1.0)

0 1 2

0 1 2

3
NVIDIA Corporation 2010

CUDA Texture Types


Bound to linear memory
Global memory address is bound to a texture Only 1D Integer addressing, no filtering, no addressing modes

Bound to CUDA arrays


Block linear CUDA array is bound to a texture 1D, 2D, or 3D Float addressing (size-based or normalized) Filtering & addressing modes (clamp, repeat) supported

Bound to pitch linear


Global memory address is bound to a 2D texture Float/integer addressing, filtering, and clamp/repeat addressing modes similar to CUDA arrays

NVIDIA Corporation 2010

Using Textures
Host (CPU) code:
Allocate/obtain memory (global linear/pitch linear, or CUDA array) Create a texture reference object (currently must be at file scope) Bind the texture reference to memory/array Unbind the texture reference when finished, free resources

Device (kernel) code:


Fetch using texture reference Linear memory textures: tex1Dfetch() Array textures: tex1D() or tex2D() or tex3D() Pitch linear textures: tex2D()
NVIDIA Corporation 2010

Das könnte Ihnen auch gefallen