Beruflich Dokumente
Kultur Dokumente
CUDA Optimizations
Overview
Memory optimizations
Execution conguration optimization
Instruction optimization
19{1
A. Di Blas
Overview
19{3
A. Di Blas
Overview
19{4
A. Di Blas
Overview
19{5
A. Di Blas
Overview
Terminology
Thread: concurrent code and associated state, executed on the CUDA
device in parallel with other threads
Warp: a group of threads executed physically in parallel on a streaming
multiprocessor (SM) in SIMD way.
19{6
A. Di Blas
Memory optimizations
Memory architecture
Thousands of lightweight threads
Local
Shared
Global
Constant
Texture
19{7
NO
n/a
NO
YES
YES
R/W
R/W
R/W
R
R
Who
One thread
All threads in a block
All threads + host
All threads + host
All threads + host
A. Di Blas
Memory optimizations
19{8
A. Di Blas
Memory optimizations
19{9
A. Di Blas
Memory optimizations
{ cudaMemcpyToSymbol(..., kind)
Where kind can be cudaMemcpyHostToDevice or
cudaMemcpyDeviceToDevice
{ cudaMemcpyFromSymbol(..., kind)
Where kind can be cudaMemcpyDeviceToHost or
cudaMemcpyDeviceToDevice
NOTE that cudaGetSymbolAddress() cannot take address of
__constant__ data.
19{10
A. Di Blas
Memory optimizations
coalescing
19{11
A. Di Blas
Memory optimizations
t0
128
t1
132
t2
136
t3
140
t14
144
180
184
t15
192
128
t1
t2
132
136
t3
140
t14
144
180
184
t15
192
19{12
A. Di Blas
Memory optimizations
t0
128
t1
132
t3
t2
136
140
t14
144
180
184
t15
192
t0
128
132
t2
136
t3
140
t14
144
180
184
t15
192
19{13
A. Di Blas
Memory optimizations
Shared memory
16 banks, 4-byte wide.
As fast as registers, if there are no bank con
icts
The fast case:
19{14
A. Di Blas
Memory optimizations
Bank 0
Thread 0
Bank 0
Thread 1
Bank 1
Thread 1
Bank 1
Thread 2
Bank 2
Thread 2
Bank 2
Thread 3
Bank 3
Thread 3
Bank 3
Thread 4
Bank 4
Thread 4
Bank 4
Thread 5
Bank 5
Thread 5
Bank 5
Thread 6
Bank 6
Thread 6
Bank 6
Thread 7
Bank 7
Thread 7
Bank 7
Thread 8
Bank 8
Thread 8
Bank 8
Thread 9
Bank 9
Thread 9
Bank 9
Thread 10
Bank 10
Thread 10
Bank 10
Thread 11
Bank 11
Thread 11
Bank 11
Thread 12
Bank 12
Thread 12
Bank 12
Thread 13
Bank 13
Thread 13
Bank 13
Thread 14
Bank 14
Thread 14
Bank 14
Thread 15
Bank 15
Thread 15
Bank 15
Stride1 access
19{15
A. Di Blas
Memory optimizations
Bank 0
Thread 0
Bank 0
Thread 1
Bank 1
Thread 1
Bank 1
Thread 2
Bank 2
Thread 2
Bank 2
Thread 3
Bank 3
Thread 3
Bank 3
Thread 4
Bank 4
Thread 4
Bank 4
Thread 5
Bank 5
Thread 5
Bank 5
Thread 6
Bank 6
Thread 6
Bank 6
Thread 7
Bank 7
Thread 7
Bank 7
Thread 8
Bank 8
Thread 8
Bank 8
Thread 9
Bank 9
Thread 9
Bank 9
Thread 10
Bank 10
Thread 10
Bank 10
Thread 11
Bank 11
Thread 11
Bank 11
Thread 12
Bank 12
Thread 12
Bank 12
Thread 13
Bank 13
Thread 13
Bank 13
Thread 14
Bank 14
Thread 14
Bank 14
Thread 15
Bank 15
Thread 15
Bank 15
Stride2 access
19{16
Stride8 access
A. Di Blas
{ Executing other warps is the only way to hide latency and keep
hardware busy
{
Occupancy = number of warps running concurrently on a multiprocessor
as a fraction of the max number of warps that can run concurrently
Limited by resource usage
{ Registers
{ Shared memory
Increasing occupancy does not necessarily improve performance, but
low-occupancy kernels cannot hide memory latency.
19{17
A. Di Blas
# of blocks
> # of multiprocessors
>2
19{18
A. Di Blas
Register dependency
RAW in registers: about 11 cc latency
To completely hide this latency
19{19
A. Di Blas
19{20
A. Di Blas
19{21
A. Di Blas
19{22
A. Di Blas
Instruction optimization
19{23
A. Di Blas
Instruction optimization
19{24
A. Di Blas
Instruction optimization
19{25
A. Di Blas
Instruction optimization
Control
ow instructions
Branching may generate divergence:
19{26
A. Di Blas
Lab
19{27
A. Di Blas