Beruflich Dokumente
Kultur Dokumente
Strategies
for Compute- and Memory-bound
Neuroimaging Algorithms
Daren Lee, Ivo Dinov, Bin Dong, Boris Gutman, Igor Yanovsky, Arthur W. Toga
Memory Bound
Many data elements per computation
Improvements
Using shared memory to reduce register use
which increases number of threads
Precomputing/storing intermediate values
Different registers / blocks per SM configurations
For the force field calculation in an image registration algorithm for
automatically spatially aligning multiple sets of 3Dimages.
Memory Bound Problems
Memory bandwidth or the amount of memory
needed is the main issue
Improvements
Caching, type of cache used
Repartition data/copy subsets
Multipass / integrated multipass
Multi-GPU
Reused Data Halo
The area marked as orange
is reused data that is read
by each thread
Multipass
Scaling for multi-GPU on FDTD alg.
Effect of GPU tradeoffs
Summary and general ideas
Recompute vs reuse
Utilize broadcast if possible
Avoid calculations on data that doesn't affect
the result "enough"
Optimize thread block configuration most
important
Thread Block should be a multiple of number
of SMs
Optimized result 6 times as fast as first
implementation
Based on article
CUDA optimization strategies for compute- and
memory-bound neuroimaging algorithms
Daren Leea (a), Ivo Dinova (a), Bin Dongb(b), Boris
Gutmana (a), Igor Yanovskyc (c), Arthur W. Togaa (a)
a Laboratory of Neuro Imaging, David Geffen School of Medicine, UCLA, 635 Charles Young Drive
South Suite 225, Los Angeles, CA
90095, USA
b Department of Mathematics, University of California, 9500 Gilman Drive, La Jolla, San Diego, CA
92093, USA
c Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA
91109, USA
References
Some figures taken from
3D Finite Difference Computation on GPUs
using CUDA
Paulius Micikevicius
NVIDIA