Sie sind auf Seite 1von 16

CUDA Optimization

Strategies
for Compute- and Memory-bound
Neuroimaging Algorithms
Daren Lee, Ivo Dinov, Bin Dong, Boris Gutman, Igor Yanovsky, Arthur W. Toga

Presentation for TDT24, by Andreas Berg Skomedal


Intro
The GPU is more desirable for computational work

Neuroimaging algorithms contain many of the same


properties of for example physics and
math algorithms

Analysis of neighborhood data for each data element can


be exploited in effective shared memory use
CPU v GPU
CPUs are optimized for memory latency
GPUs are optimized for computational power
Balancing GPU resources
memory
registers
threads
memory latency versus execution time.
GPU
Streaming Multiprocessors -> processor cores

SMs have a fixed pool of registers and on


chip memory
Kernels are executed in groups of Thread
Blocks allocated to one SM
Compute Bound vs Memory Bound
Compute Bound
Large number of computations per element

Memory Bound
Many data elements per computation

Performance is bound by determining maximum allowable


resources to a Thread Block
You want the maximum number of threads per Thread
Block
Compute Bound Problems
Number of threads per SM typically limited
by amount of registers used

Improvements
Using shared memory to reduce register use
which increases number of threads
Precomputing/storing intermediate values
Different registers / blocks per SM configurations
For the force field calculation in an image registration algorithm for
automatically spatially aligning multiple sets of 3Dimages.
Memory Bound Problems
Memory bandwidth or the amount of memory
needed is the main issue

Improvements
Caching, type of cache used
Repartition data/copy subsets
Multipass / integrated multipass
Multi-GPU
Reused Data Halo
The area marked as orange
is reused data that is read
by each thread
Multipass
Scaling for multi-GPU on FDTD alg.
Effect of GPU tradeoffs
Summary and general ideas
Recompute vs reuse
Utilize broadcast if possible
Avoid calculations on data that doesn't affect
the result "enough"
Optimize thread block configuration most
important
Thread Block should be a multiple of number
of SMs
Optimized result 6 times as fast as first
implementation
Based on article
CUDA optimization strategies for compute- and
memory-bound neuroimaging algorithms
Daren Leea (a), Ivo Dinova (a), Bin Dongb(b), Boris
Gutmana (a), Igor Yanovskyc (c), Arthur W. Togaa (a)
a Laboratory of Neuro Imaging, David Geffen School of Medicine, UCLA, 635 Charles Young Drive
South Suite 225, Los Angeles, CA
90095, USA
b Department of Mathematics, University of California, 9500 Gilman Drive, La Jolla, San Diego, CA
92093, USA
c Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA
91109, USA
References
Some figures taken from
3D Finite Difference Computation on GPUs
using CUDA
Paulius Micikevicius
NVIDIA

Das könnte Ihnen auch gefallen