Beruflich Dokumente
Kultur Dokumente
Scott Le Grand DESCRIBING INHERENTLY PARALLEL COMPUTATIONS, AND NVIDIA’S TESLA GPU
DIVERSE SET OF PROBLEMS AND THE PARALLEL SPEEDUPS OVER SEQUENTIAL CODES
Joshua Anderson RUNNING ON TRADITIONAL CPU ARCHITECTURES ATTAINED BY EXECUTING KEY
Iowa State University COMPUTATIONS ON THE GPU.
and Ames Laboratory
...... With the transition from single- parallel processors rather than simply as
Jim Hardwick core to multicore processors essentially graphics API accelerators.
complete, virtually all commodity CPUs NVIDIA developed the CUDA program-
TechniScan Medical Systems are now parallel processors. Increasing ming model and software environment to
parallelism, rather than increasing clock let programmers write scalable parallel
rate, has become the primary engine of programs using a straightforward extension
Scott Morton processor performance growth, and this of the C language. The CUDA program-
trend is likely to continue. This raises many ming model guides the programmer to
Hess important questions about how to produc- expose substantial fine-grained parallelism
tively develop efficient parallel programs sufficient for utilizing massively multi-
that will scale well across increasingly threaded GPUs, while at the same time
Everett Phillips parallel processors. providing scalability across the broad spec-
Modern graphics processing units trum of physical parallelism available in the
Yao Zhang (GPUs) have been at the leading edge of range of GPU devices. Because it provides a
increasing chip-level parallelism for some fairly simple, minimalist abstraction of
University of California, time. Current NVIDIA GPUs are many- parallelism and inherits all the well-known
core processor chips, scaling from 8 to 240 semantics of C, it lets programmers develop
Davis cores. This degree of hardware parallelism massively parallel programs with relative ease.
reflects the fact that GPU architectures In the year since its release, many
evolved to fit the needs of real-time developers have used CUDA to parallelize
Vasily Volkov computer graphics, a problem domain with and accelerate computations across various
tremendous inherent parallelism. With the problem domains. In this article, we survey
University of California, advent of the GeForce 8800—the first GPU some experiences gained in applying CUDA
based on NVIDIA’s Tesla unified architec- to a diverse set of problems and the parallel
Berkeley ture—it has become possible to program speedups attained by executing key compu-
GPU processors directly, as massively tations on the GPU.
...........................................................................
Figure 1. Parallel programming with CUDA: serial (a) and parallel (b) kernels for computing y
r a x + y.
CUDA parallel programming model blocks. The threads of a single thread block
The CUDA parallel programming model are allowed to synchronize with each other
emphasizes two key design goals. First, it via barriers and have access to a high-speed,
aims to extend a standard sequential per-block shared on-chip memory for inter-
programming language, specifically C/C++, thread communication. Threads from dif-
with a minimalist set of abstractions for ferent blocks in the same grid can coordi-
expressing parallelism. Broadly speaking, this nate only via operations in a shared global
lets the programmer focus on the important memory space visible to all threads. CUDA
issues of parallelism—how to craft efficient requires that thread blocks be independent,
parallel algorithms—rather than grappling meaning that a kernel must execute correct-
with the mechanics of an unfamiliar and ly no matter the order in which blocks are
complicated language. Second, it is designed run, even if all blocks are executed sequen-
for writing highly scalable parallel code that tially in arbitrary order without preemption.
can run across tens of thousands of concur- This restriction on the dependencies be-
rent threads and hundreds of processor cores. tween blocks of a kernel provides scalability.
This is essential because the physical paral- It also implies that the need for global
lelism of current NVIDIA GPUs ranges communication or synchronization amongst
from eight processor cores and 768 thread threads is the main consideration in decom-
contexts up to 240 processor cores and posing parallel work into separate kernels.
30,720 thread contexts. The CUDA model The details of the CUDA programming
naturally guides the programmer to write model are available in NVIDIA’s CUDA
parallel programs that transparently and Programming Guide (www.nvidia.com/
efficiently scale across these different levels CUDA) and related literature.1 Figure 1
of parallelism. shows some basic features of parallel
A CUDA program is organized into a programming with CUDA. It contains
host program, consisting of one or more straightforward implementations, both se-
sequential threads running on the host quential and parallel, of the SAXPY routine
CPU, and one or more parallel kernels that defined by the BLAS linear algebra library.
are suitable for execution on a parallel Given vectors x and y containing n floating-
processing device like the GPU. A kernel point numbers, it performs the update y r
executes a scalar sequential program on a set a x + y. The serial implementation is a
of parallel threads. The programmer orga- simple loop that computes one element of y
nizes these threads into a grid of thread in each iteration. The parallel kernel
...........................................................................
14 IEEE MICRO
effectively executes each of these independent supporting arbitrary address arithmetic,
iterations in parallel, assigning a separate and it fully supports both floating-point
thread to compute each element of y. The and integer data types.
__global__ modifier indicates that the Figure 2 shows the Tesla architecture of a
procedure is a kernel entry point, and the GeForce GTX 280 or Tesla T10 GPU with
extended function call syntax saxpy 240 SP streaming processor cores, organized
,,,B, T...(…) is used to launch in 30 SM streaming multiprocessors. Each
the kernel saxpy() in parallel across B multithreaded SP core executes up to 128
blocks of T threads each. Each thread of the concurrent coresident threads sharing a
kernel determines which element it should register file of 2,048 entries; in total, the
process from its integer thread block index GPU executes up to 30,720 concurrent
(blockIdx.x), its index within its block threads. The earlier GeForce 8800 GTX
(threadIdx.x), and the total number of GPU provides 16 SMs with 1,024 registers
threads per block (blockDim.x). per SP core and supports a maximum of
This example demonstrates a common 12,288 concurrent threads. Thread crea-
parallelization pattern, where a serial loop tion, scheduling, and resource management
with independent iterations can be executed are performed entirely in hardware. The
in parallel across many threads. In the cost of creating and destroying threads is
CUDA paradigm, the programmer writes a negligible, and there is effectively no
scalar program—the parallel saxpy() overhead involved in thread scheduling.
kernel—that specifies the behavior of a Each thread of a CUDA program is mapped
single thread of the kernel. This lets CUDA to a physical thread resident in the GPU,
leverage the underlying C language, with and each running thread block is physically
only a few small additions, such as the built- resident on a single SM. The SM multi-
in thread and block index variables. processor supports efficient communication
The SAXPY kernel is also a simple among threads in a block by providing an
example of data parallelism, where parallel extremely lightweight barrier synchroniza-
work is decomposed to match result data tion facility—CUDA’s barrier intrinsic
elements. Although different thread blocks translates into a single instruction—and 16
or different threads of a kernel can poten- Kbytes per SM of on-chip shared memory
tially execute entirely different code, such that has low access latency and high
task parallelism does not generally scale as bandwidth, similar to an L1 cache.
well as data parallelism. Moreover, data- The Tesla architecture is designed to
parallel kernels typically expose substantially operate on massively parallel problems. The
more fine-grained parallelism than task- deep multithreading provides substantial
parallel kernels and, therefore, generally can latency tolerance, allowing resources to be
take best advantage of the GPU architecture. dedicated toward throughput rather than
large caches. As we discussed earlier,
NVIDIA Tesla GPU architecture experience shows that data-parallel software
NVIDIA designed its Tesla unified design is frequently the best approach for
graphics and computing architecture2 to managing this level of fine-grained parallel-
accelerate parallel programs written in the ism. The threads of a data-parallel kernel
CUDA programming model. It is built will often be following substantially similar
around a fully programmable processor paths of execution. Consequently, the Tesla
array, organized into a number of SM architecture is optimized for this case. The
multithreaded multiprocessors, each of Tesla SM employs a single instruction,
which contains eight scalar SP processor multiple thread (SIMT) architecture,1,2
cores. In contrast to earlier generations of where a block’s threads are grouped into
GPUs, threads executing on the SP cores warps containing 32 threads each. A warp’s
have access to the full range of instructions threads are free to follow arbitrary and
programmers expect in any general-purpose independent execution paths, involving any
processor core. In particular, memory is number of branches, but can collectively
accessed through load/store instructions execute only a single instruction at any
...........................................................................
JULY–AUGUST 2008 15
.........................................................................................................................................................................................................................
ACCELERATOR ARCHITECTURES
Figure 2. Tesla unified graphics and computing architecture of a GeForce GTX 280 or Tesla T10 GPU with 240 SP streaming
processor cores, organized in 30 SM multithreaded multiprocessors. Each multithreaded SP core executes up to 128
concurrent threads; the GPU executes up to 30,720 concurrent threads.
16 IEEE MICRO
tion, and track their trajectory over specified
time intervals. Molecular dynamics is a
versatile tool that scientists can use to
calculate materials’ static and dynamic
properties, study methods of crystal growth,
or examine mechanisms by which proteins
perform their functions, to name only a few
of its uses.8 Figure 3 shows an example
snapshot from a bead-spring polymer
simulation with each of 64,017 particles
displayed as a sphere.9
Molecular dynamics simulations are in-
herently parallel computations and are
ideally suited to the CUDA programming
model. During each time step of the
simulation, the program must calculate the
forces acting on all atoms and use the
resulting forces to integrate atom positions
and velocities forward to the next step. The
different tasks required in a time step can
each be implemented in a separate kernel.
Because each atom can be processed
independently of the others during a single
time step, it is natural to map each atom to
a single thread. Thus, the application Figure 3. Snapshot of a bead-spring polymer simulation with
naturally provides the large amount of 64,017 particles.
fine-grained parallelism for which the
GPU is designed, and several molecular
dynamics10 and molecular modeling11,12 CPU cores.10 A multi-GPU implementation
codes have been successfully accelerated is currently under development, and the
with CUDA. performance is expected to scale well with
A typical molecular dynamics simulation the number of GPUs in the workstation.
of the polymer model in Figure 3 would be Due to the excellent scalability of Tesla-
run for 18 million time steps, and during architecture GPUs, HOOMD’s perfor-
each time step, every atom can be processed mance scales linearly with particle count,
independently of the others. Performing all from systems as small as 5,000 particles to
this work sequentially can be expensive; a those consuming all available memory.
serial calculation using a single core of a 2.4- Figure 4 shows the time taken to execute
GHz Opteron 280 CPU can calculate about just the pair force summation kernel for the
6.46 time steps per second (tps) and would polymer system with varying numbers of
take about 32 days to complete the entire particles (N ) and the speedup relative to an
run. Usually, scientists perform jobs of this optimized serial CPU implementation run-
size using parallel software such as ning on a single-core 3.0-GHz Xeon
LAMMPS,13 possibly completing it in a 80546K processor.10 The GPU is very
single day if executed on 32 processor cores efficient at the pair force calculation kernel,
in a cluster of Opteron 280 nodes connect- offering speedup factors of approximately
ed by InfiniBand. Implemented in CUDA 60 times for large systems. Furthermore, it
and executing on a single GeForce 8800 is able to accelerate the entire application-
GTX, Highly Optimized Object-Oriented level simulation for the problem shown in
Molecular Dynamics (HOOMD; www. Figure 3 by a total of 32 times.
ameslab.gov/hoomd) can execute the same In HOOMD, the different tasks required
simulation with a performance of 203 tps, in a time step are each implemented in a
equivalent to that of LAMMPS using 32 separate kernel. The host program performs
...........................................................................
JULY–AUGUST 2008 17
.........................................................................................................................................................................................................................
ACCELERATOR ARCHITECTURES
18 IEEE MICRO
of the CUDA implementation of the
Folding@Home energy kernel on a GeForce
8800M GTS (a laptop GPU with 64 SPs),
a GeForce 8800 GTS (a desktop GPU with
128 SPs), and the latest GeForce GTX 280
(a high-end GPU with 240 SPs). The
CUDA programming model was specifical-
ly designed to enable the design of scalable
parallel computations, and here we see clear
evidence of strong scaling across a range of
processors with substantially different levels
of physical parallelism. To place their
performance in context, the graph also
shows the performance of this simulation
on a Core2 Duo CPU, the Cell processor of
a Sony PlayStation 3, and an ATI Radon
3870 GPU. The algorithmic flexibility that
CUDA provides, and the architectural Figure 5. Performance of Folding@Home energy kernel on various
features that it exploits—specifically shared platforms.
memory—allow the CUDA implementa-
tion on the GTX 280 to run 3.6 times faster
than the Brook-based ATI implementation more careful optimization. Volkov and
and 6.7 times faster than the Cell imple- Demmel use a block algorithm similar to
mentation. those used for vector computers, using GPU
registers and per-block shared memory to
Numerical linear algebra store the data blocks. As the GPU has an
Dense matrix-matrix multiplication, par- unusually large register file, registers can be
ticularly as provided by the BLAS library used as the primary scratch space for the
GEMM routines, is one of the fundamental computation. Furthermore, assigning small
building blocks of numerical linear algebra blocks of elements to each thread, rather
algorithms. It is also a natural fit for CUDA than a single element to each thread, boosts
and the GPU because it is inherently efficiency much as strip-mining boosts
parallel and can naturally be expressed as a efficiency on vector machines. Finally, the
blocked computation. nonblocking nature of loads on the GPU
Volkov and Demmel15 describe the
design of a custom SGEMM matrix-matrix
multiplication kernel. Figure 6 summarizes
the performance of their algorithm running
on a GeForce 8800 GTX and Intel’s Math
Kernel Library (MKL) 10.0, running on a
2.4-GHz Core2 Quad Q6600 (‘‘Kents-
field’’) processor. This algorithm, which
operates on single-precision floating-point
numbers, achieves up to 206 Gflops—a 2.9
times improvement over the highest rate
achieved by the Core2 Quad—and roughly
60 percent of the GeForce 8800 GTX peak
multiply-add rate.
Writing a basic dense matrix-matrix
multiplication kernel is a fairly simple
exercise (see the CUDA Programming
Guide for detail). Achieving this high level Figure 6. Performance on dense matrix-matrix multiplication (SGEMM).
of performance, on the other hand, requires
...........................................................................
JULY–AUGUST 2008 19
.........................................................................................................................................................................................................................
ACCELERATOR ARCHITECTURES
20 IEEE MICRO
Figure 8. Example scans produced from ultrasound by TechniScan Medical Systems.
Cell processor to meet this goal, but each with CUDA—and a simple kernel to
has disadvantages. FPGAs can be a huge perform point-wise multiplication. This
engineering undertaking, and DSPs don’t approach is approximately eight times faster
provide the floating-point performance than a CPU version using an optimized
necessary to compute images in minutes. FFT and running on one core of a 2.4-GHz
The Cell processor showed potential, but Core2 Quad Q6600 processor. However,
the development workstation was expensive because the FFT size is fairly small (256 3
for a small startup. 64), using the entire GPU to perform a
For approximately $600—the cost of one single FFT does not produce enough work
GeForce 8800 GTX and an ATX power to efficiently utilize the GPU. Instead,
supply—TechniScan was able to start a creating a batched FFT that assigns multiple
proof-of-concept project to port the inverse- FFTs to different thread blocks is a much
scattering algorithm to the GPU. The more effective way of utilizing the hardware.
scanner collects ultrasound signals at regular Implementing a batched 2D FFT kernel
rotational positions and at regular coronal nearly doubled convolution performance
slice positions. The inverse-scattering algo- and made the GPU convolution almost 16
rithm is a modified nonlinear conjugate times faster than the CPU implementation.
gradient method that tries to match simu- Other parts of the algorithm showed
lated scattered ultrasound to the collected more dramatic speed improvements run-
signals. It uses 2D convolution by fast ning on the GPU. The most impressive
Fourier transform (FFT) to simulate ultra- improvements were seen in the CAXPBY
sound propagation.17 This simulation is used routine, which multiplies two complex-
to compute the residual, gradient, and step valued vectors by two different scalars and
length for each iteration. Approximately sums the resulting vectors; computation of
63 million 2D FFTs are performed during the L2 norm of a complex vector; and
the algorithm’s run, which accounts for over computation of the complex exponent of
75 percent of the computation time. each value of a large, complex vector.
A 2D GPU convolution routine can be Figure 9 shows the runtime of each routine
implemented easily using CUFFT—the running on a GeForce 8800 GTX and one
Fast Fourier Transform library supplied core of a 2.4-GHz Q6600 GPU.
...........................................................................
JULY–AUGUST 2008 21
.........................................................................................................................................................................................................................
ACCELERATOR ARCHITECTURES
22 IEEE MICRO
Figure 10. Decomposition of a 33 3 17 subdomain. A single-thread block computes a result
for the light gray tile in the center and also accesses neighboring information for integration
(white boxes surrounding the center tile) and smoothing (dark gray boxes surrounding the
center tile) overlap.
tion, the solver scales nearly linearly as more exact effort depending on the fidelity of the
GPUs are used, delivering 22 times the approximations used to simulate the wave
serial performance on one GPU and 160 propagation.
times the serial performance on eight GPUs. Given the large amount of parallel
All GPU simulations were performed in computation involved in the seismic imag-
single precision, and the results were ing process, it is natural to consider
verified to agree with a reference CPU mapping the process onto the GPU with
implementation. The use of single precision CUDA. We consider a specific imaging
required a priori calculation of the inlet and method, where the wave equation is solved
exit properties to compensate for numerical in the frequency domain using a paraxial
drift due to the exponentiation and square- approximation implemented with an ADI
root functions’ limited accuracy. (alternating-direction implicit) finite-differ-
ence method. This method’s computational
Seismic imaging cost is dominated by the evaluation and
The petroleum industry makes heavy use
of seismic data to construct images of the
Earth’s subsurface structure in its search for
oil and gas. A seismic survey of a prospective
region will typically consist of hundreds of
thousands of seismic experiments. Each
experiment involves an impulsive acoustic
source that generates a signal that propa-
gates up to tens of kilometers through the
subsurface, reflects off the interfaces be-
tween rock layers, and is recorded by a few
thousand pressure-sensitive receivers at the
surface. This acquisition process produces
terabytes of seismic data.
Reconstructing a subsurface image from Figure 11. Simulation results showing pressure distribution in a rocket
this data typically requires weeks of com- nozzle (a) and over a supersonic airfoil (b).
putations on thousands of CPUs, with the
...........................................................................
JULY–AUGUST 2008 23
.........................................................................................................................................................................................................................
ACCELERATOR ARCHITECTURES
Figure 12. Performance speedup for GPU clusters of varying size over a single CPU for
solving 2D Euler equations.
solution of complex-valued tridiagonal lin- bers of processors. The CUDA code was
ear systems, where there are on the order of executed on a Tesla C870 GPU. CPU
a billion systems for each seismic experi- performance was measured on two systems,
ment and the dimension of each system is in a dual 3.6-GHz Intel Xeon and a dual 3.0-
the hundreds. GHz quad-core Xeon (Harpertown). The
Although the solution of tridiagonal CPU code was optimized to use SSE
systems can be parallelized directly,21,22 this (streaming SIMD extension) vector instruc-
is not an effective way to solve such a large tions and compiled with the Intel Fortran
number of relatively small systems. A much compiler. Even though eight CPU cores
more efficient approach is to assign each were available on the Harpertown-based
linear system to a single thread. In this system, performance did not scale beyond
particular application, it also happens that four processes because of memory band-
many systems to be solved involve the same width limitations. This resulted in a single
coefficient matrix. This leads to a natural GPU out-performing an entire Harpertown
design in CUDA where each thread block is system by a factor of six on this seismic
given one tridiagonal matrix and each imaging code.
thread of that block solves a linear system
involving that matrix and a specific right-
hand side. Furthermore, keeping these
shared coefficients in shared memory avoids
I n considering the experiences of a
number of applications parallelized with
CUDA, we believe that a few basic trends
redundant accesses to memory. A thread are apparent. CUDA provides a straightfor-
block’s threads collectively read their coef- ward means of describing inherently parallel
ficients into shared memory and synchro- computations and is particularly well suited
nize, and then each thread computes its for data-parallel algorithms. The NVIDIA
system’s solution. Tesla GPU architecture delivers high com-
Figure 13 examines the speedup of the putational throughput on such kernels.
prototype CUDA implementation over Furthermore, unlike the massively parallel
CPU production code with varying num- machines of the past, which were large,
...........................................................................
24 IEEE MICRO
Figure 13. Speedup of a CUDA prototype wave-equation solver compared with various
CPU configurations.
expensive, and available to a select few, these The latest CUDA toolkit, documenta-
GPUs capable of running CUDA are tion, and code examples, as well as a
ubiquitous. By the end of summer 2008, directory of some of the many available
NVIDIA will have shipped roughly 80 mil- CUDA-based applications and research
lion CUDA-capable GPUs, transforming projects, are available at www.nvidia.com/
acceleration with massively parallel hardware CUDA/. A course on parallel programming
from a rarity into an everyday commodity. using CUDA is also available online (http://
Reviewing the many CUDA-enabled courses.ece.uiuc.edu/ece498/al). MICRO
applications now available, we encounter a
few important design techniques. First, and ................................................................................................
foremost, is the fundamental importance of References
exposing sufficient amounts of fine-grained 1. J. Nickolls et al., ‘‘Scalable Parallel Pro-
parallelism to exploit hardware like the gramming with CUDA,’’ ACM Queue,
Tesla-architecture GPU. Second is the vol. 6, no. 2, Mar./Apr. 2008, pp. 40-53.
importance of blocking computations, a 2. E. Lindholm et al., ‘‘NVIDIA Tesla: A Unified
process that naturally fits the CUDA thread Graphics and Computing Architecture,’’
block abstraction and encourages data IEEE Micro, vol. 28, no. 2, Mar./Apr.
layout and access patterns with high locality. 2008, pp. 39-55.
Third is the efficiency of data-parallel 3. B. Catanzaro, N. Sundaram, and K. Keutzer,
programs where threads of a warp follow ‘‘Fast Support Vector Machine Training and
the same execution path, thus fully utilizing Classification on Graphics Processors,’’
the GPU’s processor cores. Finally is the Proc. 25th Ann. Int’l Conf. Machine Learn-
benefit of the on-chip, per-block shared ing, Omnipress, 2008, pp. 104-111.
memory provided by the Tesla architecture, 4. B. He et al., ‘‘Relational Joins on Graphics
which provides high-speed, low-latency Processors,’’ Proc. ACM SIGMOD 2008,
scratchpad space that is critical to the ACM Press, 2008; www.cse.ust.hk/
performance of many efficient algorithms. catalac/papers/gpujoin_sigmod08.pdf.
...........................................................................
JULY–AUGUST 2008 25
.........................................................................................................................................................................................................................
ACCELERATOR ARCHITECTURES
26 IEEE MICRO
PhD in electrical engineering from Stanford and electrical and computer engineering.
University. His research interests include computational
fluid dynamics, parallel computing, and
Joshua Anderson is a graduate student in graphics hardware. Phillips received his BS
the Department of Physics and Astronomy in mechanical engineering from the Uni-
at Iowa State University and the Ames versity of California, Davis.
Laboratory. His PhD work focuses on the
design and implementation of a general- Yao Zhang is a PhD student in the
purpose molecular dynamics simulation Department of Electrical and Computer
code on the GPU and the theoretical design Engineering at University of California,
of new polymeric materials by computa- Davis. Zhang received his BS in electrical
tional methods. Anderson received his BS in engineering from the Beijing Institute of
physics and computer science from Michi- Technology. His research interests are in the
gan Tech. area of GPU computing, especially in solving
linear algebra and fluid problems on GPUs.
Jim Hardwick is the senior software
engineer at TechniScan Medical Systems Vasily Volkov is a PhD student in the
in Salt Lake City, Utah. His expertise is in Department of Electrical Engineering and
medical device development, including Computer Science at the University of
software and hardware development and California, Berkeley. His reserach interests
human-computer interaction. Hardwick include high-performance computing and
received his BS in computer science from numerical linear algebra. Volkov received
Westminster College. his MS from the Moscow Institute of
Physics and Technology and MEng from
Scott Morton leads a geophysical technol- the Nanyang Technological University.
ogy group at Hess Corporation. His
research interests include geophysical tech- Direct questions and comments about
nologies with a focus on seismic imaging. this article to Michael Garland, NVIDIA,
Morton received his PhD in astrophysics 2701 San Tomas Expressway, Santa Clara,
from the University of Illinois at Urbana- CA 95050; mgarland@nvidia.com.
Champaign.
For more information on this or any
Everett Phillips is a graduate student at the other computing topic, please visit our
University of California, Davis, working in Digital Library at http://computer.org/
mechanical and aeronautical engineering csdl.
...........................................................................
JULY–AUGUST 2008 27