A Closer Look at GPUs

practice
doi:10.1145/ 1400181.1400197
T2, and increasingly multicore x86 sys-
As the line between GPUs and CPUs tems from Intel and AMD, differentiate
themselves from traditional CPU de-
begins to blur, it’s important to understand signs by prioritizing high-throughput
what makes GPUs tick. processing of many parallel operations
over the low-latency execution of a sin-
by Kayvon Fatahalian and Mike Houston gle task.
GPUs assemble a large collection of
A Closer
fixed-function and software-program-
mable processing resources. Impressive
statistics, such as ALU (arithmetic logic
unit) counts and peak floating-point
rates often emerge during discussions
of GPU design. Despite the inherently
parallel nature of graphics, however, ef-
Look at
ficiently mapping common rendering
algorithms onto GPU resources is ex-
tremely challenging.
The key to high performance lies in
strategies that hardware components
and their corresponding software in-
terfaces use to keep GPU processing
GPUs
resources busy. GPU designs go to great
lengths to obtain high efficiency and
conveniently reduce the difficulty pro-
grammers face when programming
graphics applications. As a result, GPUs
deliver high performance and expose
an expressive but simple programming
interface. This interface remains largely
devoid of explicit parallelism or asyn-
chronous execution and has proven to
be portable across vendor implementa-
tions and generations of GPU designs.
A gamer wanders through a virtual world rendered At a time when the shift toward
in near-cinematic detail. Seconds later, the screen throughput-oriented CPU platforms is
fills with a 3D explosion, the result of unseen enemies prompting alarm about the complexity
of parallel programming, understand-
hiding in physically accurate shadows. Disappointed, ing key ideas behind the success of
the user exits the game and returns to a computer GPU computing is valuable not only for
developers targeting software for GPU
desktop that exhibits the stylish 3D look-and-feel execution, but also for informing the
of a modern window manager. Both of these visual design of new architectures and pro-
experiences require hundreds of gigaflops of computing gramming systems for other domains.
In this article, we dive under the hood of
performance, a demand met by the GPU (graphics a modern GPU to look at why interactive
processing unit) present in every consumer PC. rendering is challenging and to explore
ILLUSTRATION BY drew flaherty
the solutions GPU architects have de-

The modern GPU is a versatile processor that vised to meet these challenges.
constitutes an extreme but compelling point in
the growing space of multicore parallel computing The Graphics Pipeline
A graphics system generates images
architectures. These platforms, which include GPUs, that represent views of a virtual scene.
the STI Cell Broadband Engine, the Sun UltraSPARC This scene is defined by the geometry,
50 communicatio ns o f th e ac m | o c to ber 2008 | vo l . 5 1 | no. 1 0

practice
Figure 1: A simplified graphics pipeline. by application code. Figure 2 illustrates an application-programmable stage.
the operation of key pipeline stages. PO (pixel operations). PO uses each
VG (vertex generation). Real-time fragment’s screen position to calculate
Memory Buffers graphics APIs represent surfaces as and apply the fragment’s contribution
vertex vertex descriptors collections of simple geometric primi- to output image pixel values. PO ac-
generation vertex data buffers tives (points, lines, or triangles). Each counts for a sample’s distance from the
(VG) primitive is defined by a set of vertices. virtual camera and discards fragments
To initiate rendering, the application that are blocked from view by surfaces
vertex global buffers provides the pipeline’s VG stage with a closer to the camera. When fragments
processing textures list of vertex descriptors. From this list, from multiple primitives contribute to
(VP) VG prefetches vertex data from memory the value of a single pixel, as is often the
and constructs a stream of vertex data case when semi-transparent surfaces
primitive vertex topology records for subsequent processing. In overlap, many rendering techniques
generation practice, each record contains the 3D rely on PO to perform pixel updates
(PG)
(x,y,z) scene position of the vertex plus in the order defined by the primitives’
additional application-defined param- positions in the PP output stream. All
global buffers graphics APIs guarantee this behavior,
primitive eters such as surface color and normal
processing textures
vector orientation. and PO is the only stage where the order
(PP)
VP (vertex processing). The behavior of entity processing is specified by the
of VP is application programmable. VP pipeline’s definition.
fragment operates on each vertex independently
generation
and produces exactly one output vertex Shader Programming
(FG)
record from each input record. One of The behavior of application-program-
the most important operations of VP ex- mable pipeline stages (VP, PP, FP) is
fragment global buffers
textures ecution is computing the 2D output im- defined by shader functions (or shaders).
processing
(FP) age (screen) projection of the 3D vertex Graphics programmers express vertex,
position. primitive, and fragment shader func-
output image PG (primitive generation). PG uses tions in high-level shading languages
pixel
operations vertex topology data provided by the ap- such as NVIDIA’s Cg, OpenGL’s GLSL,
(PO) plication to group vertices from VP into or Microsoft’s HLSL. Shader source is
an ordered stream of primitives (each compiled into bytecode offline, then
fixed-function stage primitive record is the concatenation of transformed into a GPU-specific binary
shader-defined stage several VP output vertex records). Vertex by the graphics driver at runtime.
topology also defines the order of primi- Shading languages support complex
tives in the output stream. data types and a rich set of control-flow
PP (primitive processing). PP operates constructs, but they do not contain
orientation, and material properties of independently on each input primitive primitives related to explicit parallel
object surfaces and the position and to produce zero or more output primi- execution. Thus, a shader definition is
characteristics of light sources. A scene tives. Thus, the output of PP is a new a C-like function that serially computes
view is described by the location of a vir- (potentially longer or shorter) ordered output-entity data records from a single
tual camera. Graphics systems seek to stream of primitives. Like VP, PP opera- input entity. Each function invocation is
find the appropriate balance between tion is application programmable. abstracted as an independent sequence
conflicting goals of enabling maximum FG (fragment generation). FG samples of control that executes in complete
performance and maintaining an ex- each primitive densely in screen space isolation from the processing of other
pressive but simple interface for de- (this process is called rasterization). stream entities.
scribing graphics computations. Each sample is manifest as a fragment As a convenience, in addition to data
Real-time graphics APIs such as Di- record in the FG output stream. Frag- records from stage input and output
rect3D and OpenGL strike this balance ment records contain the output image streams, shader functions may access
by representing the rendering compu- position of the surface sample, its dis- (but not modify) large, globally shared
tation as a graphics processing pipeline tance from the virtual camera, as well as data buffers. Prior to pipeline execu-
that performs operations on four fun- values computed via interpolation of the tion, these buffers are initialized to con-
damental entities: vertices, primitives, source primitive’s vertex parameters. tain shader-specific parameters and tex-
fragments, and pixels. Figure 1 provides FP (fragment processing). FP simu- tures by the application.
a block diagram of a simplified seven- lates the interaction of light with scene
stage graphics pipeline. Data flows be- surfaces to determine surface color and Characteristics and Challenges
tween stages in streams of entities. This opacity at each fragment’s sample point. Graphics pipeline execution is charac-
pipeline contains fixed-function stages To give surfaces realistic appearances, terized by the following key properties.
(green) implementing API-specified FP computations make heavy use of fil- Opportunities for parallel processing.
operations and three programmable tered lookups into large, parameterized Graphics presents opportunities for
stages (red) whose behavior is defined 1D, 2D, or 3D arrays called textures. FP is both task- (across pipeline stages) and
52 comm unicatio ns o f the ac m | o c to ber 2008 | vo l . 5 1 | no. 1 0

practice
data- (stages operate independently on timized, fixed-function hardware com- figure 2: Graphics pipeline operations.
stream entities) parallelism, making ponents.
parallel processing a viable strategy for Mixture of predictable and unpredict-
increasing throughput. Despite abun- able data access. The graphics pipeline (a) v1
dant potential parallelism, however, the rigidly defines inter-stage data flows
unpredictable cost of shader execution using streams of entities. This pre- v0 v5
and constraints on the order of PO stage dictability presents opportunities for

processing introduce dynamic, fine- aggregate prefetching of stream data v2
v4
grained dependencies that complicate records and highly specialized hard-

parallel implementation throughout ware management of on-chip storage v3
the pipeline. Although output image resources. In contrast, buffer and tex-
contributions from most fragments can ture accesses performed by shaders (b)
v1
be applied in parallel, those that con- are fine-grained memory operations on

tribute to the same pixel cannot. dynamically computed addresses, mak-
v0 p0 v5
Extreme variations in pipeline load. ing prefetch difficult. As both forms of
Although the number of stages and data data access are critical to maintaining v4
flows of the graphics pipeline is fixed, high throughput, shader programming p1
v2
the computational and bandwidth re- models explicitly differentiate stream v3
quirements of all stages vary significant- from buffer/texture memory accesses,

ly depending on the behavior of shader permitting specialized hardware solu-
functions and properties of scene. For tions for both types of accesses. (c)
example, primitives that cover large re- Opportunities for instruction stream
gions of the screen generate many more p0
sharing. While the shader programming
fragments than vertices. In contrast, model permits each shader invocation
many small primitives result in high ver- to follow a unique stream of control, in
tex-processing demands. Applications practice, shader execution on nearby
p1
frequently reconfigure the pipeline to stream elements often results in the
use different shader functions that vary same dynamic control-flow decisions.
from tens of instructions to a few hun- As a result, multiple shader invocations
dred. For these reasons, over the dura- can likely share an instruction stream. (d)
tion of processing for a single frame, Although GPUs must accommodate
different stages will dominate overall situations where this is not the case, the p0
execution, often resulting in bandwidth use of SIMD-style execution to exploit
and compute-intensive phases of execu- shared control-flow across multiple
tion. Dynamic load balancing is required shader invocations is a key optimization
p1
to maintain an efficient mapping of the in the design of GPU processing cores
graphics pipeline to a GPU’s resources and is accounted for in algorithms for
in the face of this variability and GPUs pipeline scheduling.
employ sophisticated heuristics for re-
(e)
allocating execution and on-chip stor- Programmable
age resources amongst pipeline stages Processing Resources
depending on load. A large fraction of a GPU’s resources
Fixed-function stages encapsulate dif- exist within programmable processing
ficult-to-parallelize work. Programma- cores responsible for executing shader
ble stages are trivially parallelizable by functions. While substantial imple-
executing shader function logic simul- mentation differences exist across
taneously on multiple stream entities. vendors and product lines, all modern
In contrast, the pipeline’s nonprogram- GPUs maintain high efficiency through (a) six vertices from the vG output stream
mable stages involve multiple entity the use of multicore designs that em- define the scene position and orientation of
two triangles. (b) Following vP and PG, the
interactions (such as ordering depen- ploy both hardware multithreading and vertices have been transformed into their
dencies in PO or vertex grouping in PG) SIMD (single instruction, multiple data) screen-space positions and grouped into
and stateful processing. Isolating this processing. As shown in the table here, two triangle primitives, p0 and p1. (c) FG
samples the two primitives, producing a set
non-data-parallel work into fixed stages these throughput-computing tech-
of fragments corresponding to p0 and p1. (d)
allows the GPU’s programmable pro- niques are not unique to GPUs (top two FP computes the appearance of the surface
cessing components to be highly spe- rows). In comparison with CPUs, how- at each sample location. (e) Po updates the
cialized for data-parallel execution and ever, GPU designs push these ideas to output image with contributions from the
fragments, accounting for surface visibility.
keeps the shader programming model extreme scales. in this example, p1 is nearer to the camera
simple. In addition, the separation en- Multicore + SIMD Processing = Lots than p0. As a result p0 is occluded by p1.
ables difficult aspects of the graphics of ALUs. A logical thread of control is
computation to be encapsulated in op- realized by a stream of processor in-
o c to b e r 2 0 0 8 | vo l. 51 | n o. 1 0 | c om m u n ic at ion s of t he acm 53
practice
structions that execute within a pro- tivec ISA extensions. These extensions to take advantage of SIMD process-
cessor-managed environment, called provide instructions that control the ing. Dynamic per-entity control flow is
an execution (or thread) context. This operation of four ALUs (SIMD width implemented by executing all control
context consists of state such as a pro- of 4). Alternatively, most GPUs realize paths taken by the shader invocations
gram counter, a stack pointer, general- the benefits of SIMD execution by im- in the group. SIMD operations that do
purpose registers, and virtual memory plicitly sharing an instruction stream not apply to all invocations, such as
mappings. A single core processor man- across threads with identical PCs. In those within shader code conditional or
aging a single execution context can this implementation, the SIMD width loop blocks, are partially nullified using
run one thread of control at a time. A of the machine is not explicitly made write-masks. In this implementation,
multicore processor replicates process- visible to the programmer. CPU design- when shader control flow diverges, few-
ing resources (ALUs, control logic, and ers have chosen a SIMD width of four as er SIMD ALUs do useful work. Thus, on
execution contexts) and organizes them a balance between providing increased a chip with width-S SIMD processing,
into independent cores. When an ap- throughput and retaining high single- worst-case behavior yields performance
plication features multiple threads of threaded performance. Characteristics equaling 1/S the chip’s peak rate. For-
control, multicore architectures pro- of the shading workload make it ben- tunately, shader workloads exhibit
vide increased throughput by executing eficial for GPUs to employ significantly sufficient levels of instruction stream
these instruction streams on each core wider SIMD processing (widths ranging sharing to justify wide SIMD implemen-
in parallel. For example, an Intel Core from 32 to 64) and to support a rich set tations. Additionally, GPU ISAs contain
2 Quad contains four cores and can ex- of operations. It is common for GPUs special instructions that make it pos-
ecute four instruction streams simulta- to support SIMD implementations of sible for shader compilers to transform
neously. As significant parallelism exists reciprocal square root, trigonometric per-entity control flow into efficient
across shader invocations in a graphics functions, and memory gather/scatter sequences of explicit or implicit SIMD
pipeline, GPU designs easily push core operations. operations.
counts higher. The efficiency of wide SIMD pro- Hardware Multithreading = High ALU
Even higher performance is possible cessing allows GPUs to pack many cores Utilization. Thread stalls pose an addi-
by populating each core with multiple densely with ALUs. For example, the NVID- tional challenge to high-performance
floating-point ALUs. This is done effi- IA GeForce GTX 280GPU contains 480 shader execution. Threads stall (or
ciently through SIMD (single instruc- ALUs operating at 1.3GHz. These ALUs block) when the processor cannot dis-
tion, multiple data) processing, where are organized into 30 processing cores patch the next instruction in an instruc-
several ALUs perform the same opera- and yield a peak rate of 933GFLOPS. In tion stream due to a dependency on an
tion on a different piece of data. SIMD comparison, a high-end 3GHz Intel Core outstanding instruction. High-latency
processing amortizes the complexity of 2 Quad CPU contains four cores, each off-chip memory accesses, most nota-
decoding an instruction stream and the with eight SIMD floating-point ALUs (two bly those generated by texture access
cost of ALU control structures across 4-width vector instructions per clock) and operations, cause thread stalls lasting
multiple ALUs, resulting in both power- is capable of, at most, 96GFLOPS of peak hundreds of cycles (recall that while
and area-efficient chip execution. performance. shader input and output records lend
The most common implementation Recall that a shader function defines themselves to streaming prefetch, tex-
of SIMD processing is via explicit short- processing on a single pipeline entity. ture accesses do not).
vector instructions, similar to those GPUs execute multiple invocations of Allowing ALUs to remain idle dur-
provided by the x86 SSE or PowerPC Al- the same shader function in parallel ing the period while a thread is stalled
is inefficient. Instead, GPUs maintain
Table 1. Tale of the tape: Throughput architectures. more execution contexts on chip than
they can simultaneously execute, and
they perform instructions from run-
Type Processor Cores/Chip ALUs/Core3 SIMD width Max T4 nable threads when others are stalled.
Hardware scheduling logic determines
GPUs AMD Radeon HD 10 80 64 25
4870 which context(s) to execute in each pro-
cessor cycle. This technique of overpro-
NVIDIA GeForce 30 8 32 128
GTX 280 visioning cores with thread contexts to
hide the latency of thread stalls is called
CPUs Intel Core 2 Quad1 4 8 4 1
hardware multithreading. GPUs use
STI Cell BE2 8 4 4 1
multithreading as the primary mecha-
Sun UltraSPARC T2 8 1 1 4 nism to hide both memory access and
instruction pipeline latencies.
1
SSE processing only, does not account for traditional FPU
2
Stream processing (SPE) cores only, does not account for PPU cores.
The amount of stall latency a GPU
3
32-bit floating point operations can tolerate via multithreading is de-
4
Max T is defined as the maximum ratio of hardware-managed thread execution contexts to simultaneously
executable threads (not an absolute count of hardware-managed execution contexts). This ratio is a measure
pendent on the ratio of hardware thread
of a processor’s ability to automatically hide thread stalls using hardware multithreading. contexts to the number of threads that
are simultaneously executed in a clock
(we refer to this ratio as T). Support for
54 com municatio ns o f th e acm | o c to ber 2008 | vo l . 5 1 | no. 1 0

practice
more thread contexts allows the GPU to

hide longer or more frequent stalls. All Running a Fragment
Shader on a GPU Core
modern GPUs maintain large numbers
of execution contexts on chip to provide
maximal memory latency-hiding ability
(T reaches 128 in modern GPUs—see Shader compilation to SIMD (single instruction, multiple data) instruction
the table). This represents a significant sequences coupled with dynamic hardware thread scheduling lead to efficient
departure from CPU designs, which at- execution of a fragment shader on the simplified single-core GPU shown in Figure A.
tempt to avoid or minimize stalls pri- ˲˲ The core executes an instruc-
Figure A: Example GPU core
tion from at most one thread each
marily using large, low-latency data processor clock, but maintains state
caches and complicated out-of-order ALUs (SIMD operation)
for four threads on-chip simultane-
execution logic. Current Intel Core 2 1 ously (T=4).
31
and AMD Phenom processors maintain ˲˲ Core threads issue explicit
one thread per core, and even high-end width-32 SIMD vector instructions;
general register file 32 ALUs simultaneously execute a
models of Sun’s multithreaded Ultra- (partitioned among threads) vector instruction in a single clock.
SPARC T2 processor manage only four
R R ˲˲ The core contains a pool of 16
times the number of threads they can 0 15
general-purpose vector registers
simultaneously execute. (each containing a vector of 32
execution (thread) contexts
Note that in the absence of stalls, the single-precision floats) partitioned
throughput of single- and multithreaded among thread contexts.
T0 T1 T2 T3
processors is equivalent. Multithread- ˲˲ The only source of thread stalls is
texture access; they have a maximum
ing does not increase the number of latency of 50 cycles.
processing resources on a chip. Rather,
Shader compilation by the graphics driver produces a GPU binary from high-level
it is a strategy that interleaves execution fragment shader source. The resulting vector instruction sequence performs
of multiple threads in order to use exist- 32 invocations of the fragment shader simultaneously by carrying out each
ing resources more efficiently (improve invocation in a single lane of the width-32 vectors. The compiled binary requires
four vector registers for temporary results and contains 20 arithmetic instructions
throughput). On average, a multithread-
between each texture access operation.
ed core operating at its peak rate runs
At runtime, the GPU executes a copy of the shader binary on each of its four thread
each thread 1/T of the time. contexts, as illustrated in Figure B. The core executes T0 (thread 0) until it detects
To achieve large-scale multithread- a stall resulting from texture access in cycle 20. While T0 waits for the result of the
ing, execution contexts must be com- texturing operation, the core continues to execute its remaining three threads.
pact. The number of thread contexts The result of T0’s texture access becomes available in cycle 70. Upon T3’s stall in
cycle 80, the core immediately resumes T0. Thus, at no point during execution are
supported by a GPU core is limited by ALUs left idle.
the size of on-chip execution context When executing the shader program for this example, a minimum of four threads
storage. GPUs require compiled shader is needed to keep core ALUs busy. Each thread operates simultaneously on 32
binaries to statically declare input and fragments; thus, 4*32=128 fragments are required for the chip to achieve peak
output entity sizes, as well as bounds on performance. As memory latencies on real GPUs involve hundreds of cycles,
modern GPUs must contain support for significantly more threads to sustain
temporary storage and scratch registers high utilization. If we extend our simple GPU to a more realistic size of 16
required for their execution. At runtime, processing cores and provision each core with storage for 16 execution contexts,
GPUs use these bounds to dynamically then simultaneous processing of 8,192 fragments is needed to approach peak
processing rates. Clearly, GPU performance relies heavily on the abundance of
partition on-chip storage (including
parallel shading work.
data registers) to support the maximum
possible number of threads. As a re- Figure B: Thread Execution on the example GPU core
sult, the latency hiding ability of a GPU
is shader dependent. GPUs can man-
executing ready (not executing) stalled
age many thread contexts (and provide
maximal latency-hiding ability) when 0
shaders use fewer resources. When 20

shaders require large amounts of stor- stall
Cycle
40
age, the number of execution contexts stall
(and latency-hiding ability) provided by 60
stall
a GPU drops. 80 ready
stall
Fixed-Function
T0 T1 T2 T3
Processing Resources
A GPU’s programmable cores interoper-
ate with a collection of specialized fixed-
function processing units that provide
high-performance, power-efficient
implementations of nonshader stages.
practice
These components do not simply aug- components. These requests include

ment programmable processing; they a mixture of fine-granularity and bulk
perform sophisticated operations and prefetch operations and may even re-
constitute an additional hundreds of gi- quire real-time guarantees (such as dis-
gaflops of processing power. Two of the
most important operations performed Understanding play scan out).
Recall that a GPU’s programmable
via fixed-function hardware are texture
filtering and rasterization (fragment
key ideas behind cores tolerate large memory latencies
the success of
via hardware multithreading and that
generation). interstage stream data accesses can be
Texturing is handled almost entirely
by fixed-function logic. A texturing op-
GPU computing is prefetched. As a result, GPU memory
systems are architected to deliver high-
eration samples a contiguous 1D, 2D, valuable not only bandwidth, rather than low-latency,
or 3D signal (a texture) that is discretely
represented by a multidimensional ar-
for developers data access. High throughput is ob-
tained through the use of wide memory
ray of color values (2D texture data is targeting software buses and specialized GDDR (graphics
simply an image). A GPU texture-filter-
ing unit accepts a point within the tex- for GPU execution, double data rate) memories that oper-
ate most efficiently when memory ac-
ture’s parameterization (represented by but also for cess granularities are large. Thus, GPU
a floating-point tuple, such as {.5,.75})
and loads array values surrounding the informing the memory controllers must buffer, reor-
der, and then coalesce large numbers
coordinate from memory. The values
are then filtered to yield a single result
design of new of memory requests to synthesize large
operations that make efficient use of
that represents the texture’s value at architectures and the memory system. As an example, the
the specified coordinate. This value
is returned to the calling shader func-
programming ATI Radeon HD 4870 memory controller
manipulates thousands of outstanding
tion. Sophisticated texture filtering is systems for other requests to deliver 115GB per second of
required for generating high-quality im-
ages. As graphics APIs provide a finite domains. bandwidth from GDDR5 memories at-
tached to a 256-bit bus.
set of filtering kernels, and because fil- GPU data caches meet different
tering kernels are computationally ex- needs from CPU caches. GPUs employ
pensive, texture filtering is well suited relatively small, read-only caches (no
for fixed-function processing. cache coherence) that serve to filter re-
Primitive rasterization in the FG quests destined for the memory control-
stage is another key pipeline opera- ler and to reduce bandwidth require-
tion currently implemented by fixed- ments placed on main memory. Thus,
function components. Rasterization GPU caches typically serve to amplify
involves densely sampling a primitive total bandwidth to processing units
(at least once per output image pixel) rather than decrease latency of memory
to determine which pixels the primitive accesses. Interleaved execution of many
overlaps. This process involves comput- threads renders large read-write cach-
ing the location of the surface at each es inefficient because of severe cache
sample point and then generating frag- thrashing. Instead, GPUs benefit from
ments for all sample points covered by small caches that capture spatial locality
the primitive. Bounding-box compu- across simultaneously executed shader
tations and hierarchical techniques invocations. This situation is common,
optimize the rasterization process. as texture accesses performed while
Nonetheless, rasterization involves sig- processing fragments in close screen
nificant computation. proximity are likely to have overlapping
In addition to the components for texture-filter support regions.
texturing and rasterization, GPUs con- Although most GPU caches are small,
tain dedicated hardware components for this does not imply that GPUs con-
operations such as surface visibility deter- tain little on-chip storage. Significant
mination, output pixel compositing, and amounts of on-chip storage are used to
data compression/decompression. hold entity streams, execution contexts,
and thread scratch data.
The Memory System
Parallel-processing resources place ex- Pipeline Scheduling and Control
treme load on a GPU’s memory system, Mapping the entire graphics pipeline
which services memory requests from efficiently onto GPU resources is a chal-
both fixed-function and programmable lenging problem that requires dynamic
56 comm unicatio ns o f the acm | o c to ber 2008 | vo l . 5 1 | no. 1 0

practice
and adaptive techniques. A unique as- ity of samples. This ordering improves GPU-compute programming models
pect of GPU computing is that hardware texture cache hit rates, as well as in- are simple to use and permit well-writ-
logic assumes a major role in mapping struction stream sharing across shader ten programs to make good use of both
and scheduling computation onto chip invocations. The GPU memory control- GPU programmable cores and (if need-
resources. GPU hardware “scheduling” ler also performs automatic reorganiza- ed) texturing resources. Programs using
logic extends beyond the thread-sched- tion when it reorders memory requests these interfaces, however, cannot use
uling responsibilities discussed in pre- to optimize memory bus and DRAM uti- powerful fixed-function components of
vious sections. GPUs automatically as- lization. the chip, such as those related to com-
sign computations to threads, clean up GPUs ensure inter-fragment PO or- pression, image compositing, or raster-
after threads complete, size and man- dering dependencies using hardware ization. Also, when these interfaces are
age buffers that hold stream data, guar- logic. Implementations use structures enabled, much of the logic specific to
antee ordered processing when needed, such as post-FP reorder buffers or graphics-pipeline scheduling is simply
and identify and discard unnecessary scoreboards that delay fragment thread turned off. Thus, current GPU-compute
pipeline work. This logic relies heavily launch until the processing of overlap- programming frameworks significant-
on specific upfront knowledge of graph- ping fragments is complete. ly restrict computations so that their
ics workload characteristics. GPU hardware can take responsibil- structure, as well as their use of chip re-
Conventional thread programming ity for sophisticated scheduling deci- sources, remains sufficiently simple for
uses operating-system or threading API sions because semantics and invariants GPUs to run these programs in parallel.
mechanisms for thread creation, com- of the graphics pipeline are known a pri-
pletion, and synchronization on shared ori. Hardware implementation enables GPU and CPU Convergence
structures. Large-scale multithreading fine-granularity logic that is informed The modern graphics processor is a pow-
coupled with the brevity of shader func- by precise knowledge of both the graph- erful computing platform that resides
tion execution (at most a few hundred ics pipeline and the underlying GPU at the extreme end of the design space
instructions), however, means GPU implementation. As a result, GPUs are of throughput-oriented architectures.
thread management must be performed highly efficient at using all available re- A GPU’s processing resources and ac-
entirely by hardware logic. sources. The drawback of this approach companying memory system are heavily
GPUs minimize thread launch costs is that GPUs execute only those compu- optimized to execute large numbers of
by preconfiguring execution contexts to tations for which these invariants and operations in parallel. In addition, spe-
run one of the pipeline’s three types of structures are known. cialization to the graphics domain has
shader functions and reusing the con- Graphics programming is becom- enabled the use of fixed-function pro-
figuration multiple times for shaders ing increasingly versatile. Developers cessing and allowed hardware schedul-
of the same type. GPUs prefetch shader constantly seek to incorporate more ing of a parallel computation to be prac-
input records and launch threads when sophisticated algorithms and leverage tical. With this design, GPUs deliver
a shader stage’s input stream contains more configurable graphics pipelines. unsurpassed levels of performance to
a sufficient number of entities. Simi- Simultaneously, the growing popular- challenging workloads while maintain-
lar hardware logic commits records to ity of GPU-based computing for non- ing a simple and convenient program-
the output stream buffer upon thread graphics applications has led to new ming interface for developers.
completion. The distribution of execu- interfaces for accessing GPU resources. Today, commodity CPU designs are
tion contexts to shader stages is repro- Given both of these trends, the extent adopting features common in GPU
visioned periodically as pipeline needs to which GPU designers can embed a computing, such as increased core
change and stream buffers drain or ap- priori knowledge of computations into counts and hardware multithreading.
proach capacity. hardware scheduling logic will inevita- At the same time, each generation of
GPUs leverage upfront knowledge of bly decrease over time. GPU evolution adds flexibility to previ-
pipeline entities to identify and skip un- A major challenge in the evolution ous high-throughput GPU designs. Giv-
necessary computation. For example, of GPU programming involves preserv- en these trends, software developers in
vertices shared by multiple primitives ing GPU performance levels and ease many fields are likely to take interest in
are identified and VP results cached to of use while increasing the generality the extent to which CPU and GPU archi-
avoid duplicate vertex processing. GPUs and expressiveness of application inter- tectures and, correspondingly, CPU and
also discard fragments prior to FP when faces. The designs of “GPU-compute” GPU programming systems, ultimately
the fragment will not alter the value of interfaces, such as NVIDIA’s CUDA and converge.
any image pixel. Early fragment discard AMD’s CAL, are evidence of how difficult
is triggered when a fragment’s sample this challenge is. These frameworks ab- Kayvon Fatahalian (kayvonf@gmail.com) and Mike
point is occluded by a previously pro- stract computation as large batch oper- Houston are Ph.D. candidates in computer science in the
Computer Graphics Laboratory at Stanford University.
cessed surface located closer to the ations that involve many invocations of
camera. a kernel function operating in parallel. A previous version of this article was published in the
March 2008 issue of ACM Queue.
Another class of hardware optimiza- The resulting computations execute on
tions reorganizes fine-grained opera- GPUs efficiently only under conditions
tions for more efficient processing. For of massive data parallelism. Programs
example, rasterization orders fragment that attempt to implement non data-
generation to maximize screen proxim- parallel algorithms perform poorly. © 2008 ACM 0001-0782/08/1000 $5.00

A Closer Look at GPUs

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Closer Look at GPUs

Hochgeladen von

Copyright:

Verfügbare Formate

practice

the solutions GPU architects have de-

50 communicatio ns o f th e ac m | o c to ber 2008 | vo l . 5 1 | no. 1 0

52 comm unicatio ns o f the ac m | o c to ber 2008 | vo l . 5 1 | no. 1 0

and constraints on the order of PO stage dictability presents opportunities for

grained dependencies that complicate records and highly specialized hard-

be applied in parallel, those that con- are fine-grained memory operations on

the computational and bandwidth re- models explicitly differentiate stream v3

quirements of all stages vary significant- from buffer/texture memory accesses,

54 com municatio ns o f th e acm | o c to ber 2008 | vo l . 5 1 | no. 1 0

more thread contexts allows the GPU to

shaders use fewer resources. When 20

These components do not simply aug- components. These requests include

56 comm unicatio ns o f the acm | o c to ber 2008 | vo l . 5 1 | no. 1 0

Das könnte Ihnen auch gefallen