Beruflich Dokumente
Kultur Dokumente
CUDA
Index
The death of single core solution
NVIDA and CUDA
GPU hardware
Alternatives to CUDA
Understanding parallelism with GPUs
CUDA hardware overview
2 12/6/2019
3 12/6/2019
4 12/6/2019
5 12/6/2019
6 12/6/2019
7 12/6/2019
8 12/6/2019
9 12/6/2019
10 12/6/2019
GPU
Graphical Processing Unit
A single GPU consists of large number of cores – hundreds
of cores.
Whereas a single CPU can consist of 2, 4, 8 or 12 cores
Cores? – Processing units in a chip sharing at least the
memory and L1 cache
11 12/6/2019
GPU Cont...
GPUs are massively multithreaded many core chips
Hundreds of scalar processors
Tens of thousands of concurrent threads
1 Tera-FLOP peak performance
Fine-grained data-parallel computation
12 12/6/2019
The death of single core solution
- GPU and CPU
Typically GPU and CPU coexist in a heterogeneous setting
“Less” computationally intensive part runs on CPU (coarse-grained
parallelism), and more intensive parts run on GPU (fine-grained
parallelism)
NVIDIA’s GPU architecture is called CUDA (Compute Unified
Device Architecture) architecture, accompanied by CUDA
programming model, and CUDA C language
13 12/6/2019
CPU vs. GPU Architecture
14
15 12/6/2019
16 12/6/2019
Fast Fourier Transformation
17 12/6/2019
18 12/6/2019
The death of single core solution
The death of conventional scaling has sparked a sharp increase in the
number of companies researching various types of specialized CPU cores.
19 12/6/2019
20 12/6/2019
21 12/6/2019
22 12/6/2019
Processing Flow
Processing Flow of CUDA:
Copy data from main mem to
GPU mem.
CPU instructs the process to
GPU.
GPU execute parallel in each
core.
Copy the result from GPU mem
to main mem.
23 12/6/2019
24 12/6/2019
CUDA
Definitions: Programming Model
Device = GPU
Host = CPU
Kernel = function
that runs on the
device
25 12/6/2019
CUDA Programming Model
A kernel is executed by a grid of thread blocks
26 12/6/2019
CUDA Kernels and Threads
Parallel portions of an application are executed on the
device as kernels
One kernel is executed at a time
Many threads execute each kernel
27 12/6/2019
Arrays of Parallel Threads
A CUDA kernel is executed by an array of threads
All threads run the same code
Each thread has an ID that it uses to compute memory addresses
and make control decisions
28 12/6/2019
29 12/6/2019
Graphics Processing Units (GPUs)
Brief History
GPU Computing
General-purpose computing on
graphics processing units (GPGPUs)
Fermi
NVIDIA's first Tesla
GPU with general C870, S870, C1060, S1070, C2050, …
purpose processors GeForce 400 series
GTX460/465/470/475/
Quadro 480/485
Established by Jen- GT 80
GeForce 200 series
Hsun Huang, Chris GeForce 8800 GTX260/275/280/285/295
Malachowsky, Curtis
GeForce 8 series
Priem
GeForce 2 series GeForce FX series
NV1 GeForce 1
12/6/2019
99332 1995 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
GPU Hardware
The NVIDIA G80 series processor and beyond implemented
a design that is similar to both the Connection Machine and
IBM’s Cell processor.
Each graphics card consists of a number of SMs. To
each SM is attached eight or more SPs (Stream
Processors). The original 9800 GTX card has eight
SMs, giving a total of 128 SPs.
The GPU cards can broadly be considered as accelerator or
coprocessor cards.
33 12/6/2019
GPU hardware
34 12/6/2019
GPU hardware
35 12/6/2019
GPU Hardware
36 12/6/2019
Streaming Multiprocessor (SM)
• Streaming Multiprocessor (SM)
– 8 Streaming Processors (SP)
Streaming Multiprocessor
– 2 Super Function Units (SFU) Instruction L1 Data L1
• 16 KB shared memory SP SP
37 12/6/2019
Alternatives to CUDA
OpenCL
DirectCompute
CPU alternatives
MPI
OpenMP
Pthreads
ZeroMQ
Hadoop
Directives and libraries
38 12/6/2019
OpenCL
OpenCL is an open and royalty-free standard supported by
NVIDIA, AMD, and others.
The OpenCL trademark is owned by Apple. It sets out an open
standard that allows the use of compute devices.
Unlike CUDA, OpenCL will use all the available speed
accelerators in parallelizing a task.
A compute device can be a GPU, CPU, or other specialist
device for which an OpenCL driver exists. As of 2012, OpenCL
supports all major brands of GPU devices, including CPUs .
39 12/6/2019
DirectCompute
DirectCompute is Microsoft’s alternative to CUDA and
OpenCL. It is a proprietary product linked to the Windows
operating system, and in particular, the DirectX 11 API.
The DirectX API was a huge leap forward for any of those
who remember programming video cards before it.
It meant the developers had to learn only one
library API to program all graphics cards, rather
than write or license drivers for each major video
card manufacturer.
DirectX 11 is the latest standard and supported under
Windows 7.
40 12/6/2019
CPU Alternatives
MPI
OpenMP
Pthreads
ZeroMQ
Hadoop
41 12/6/2019
CPU Alternatives
MPI (Message Passing Interface) : Parallelism is expressed by
spawning hundreds of processes over a cluster of nodes and
explicitly exchanging messages, typically over high-speed
network-based communication links. It’s a good solution
within a controlled cluster environment.
OpenMP : is a system designed for parallelism within a node
or computer system. OpenMP support is built into many
compilers, including the NVCC compiler used for CUDA.
OpenMP tends to hit problems with scaling due to the
underlying CPU architecture.
42 12/6/2019
Pthreads : is a library that is used significantly for multithread
applications on Linux. unlike OpenMP, the programmer is
responsible for thread management and synchronization. This
provides more flexibility and consequently better
performance for well-written programs.
ZeroMQ (0MQ) : is simple library for developing a
multinode, multi-GPU. ZeroMQ supports thread-, process-
, and network-based communications models with a single
crossplatform API. It is also available on both Linux and
Windows platforms. It’s designed for distributed computing,
so the connections are dynamic and nodes fail gracefully.
43 12/6/2019
Hadoop is an open-source version of Google’s MapReduce
framework. It’s aimed primarily at the Linux platform.
The concept is that you take a huge dataset and break (or map)
it into a number of chunks.
However, instead of sending the data to the node, the dataset is
already split over hundreds or thousands of nodes using a
parallel file system.
44 12/6/2019
Directives and Libraries
There are a number of compiler vendors, PGI, CAPS, and
Cray being the most well-known, that support the recently
announced OpenACC set of compiler directives for GPUs.
These, in essence, replicate the approach of OpenMP, in that
the programmer inserts a number of compiler directives
marking regions as “to be executed on the GPU.” The
compiler then does the grunt work of moving data to or
from the GPU, invoking kernels, etc.
45 12/6/2019
Libraries like SDK provide Thrust, which provides common
functions implemented in a very efficient way. Libraries like
CUBLAS are some of the best around for
linear algebra. Libraries exist for many well-known
applications such as Matlab and Mathematica.
Language bindings exist for Python, Perl, Java, and many
others. CUDA can even be integrated with
Excel.
46 12/6/2019
Understanding parallelism with GPUs
TRADITIONAL SERIAL CODE
SERIAL/PARALLEL PROBLEMS
CONCURRENCY
Locality
TYPES OF PARALLELISM
FLYNN’S TAXONOMY
SOME COMMON PARALLEL PATTERNS
47 12/6/2019
Traditional Serial Code
Traditionally, software has been written for serial
computation:
To be run on a single computer having a single Central
Processing Unit (CPU);
A problem is broken into a discrete series of instructions.
Instructions are executed one after another.
Only one instruction may execute at any moment in time.
What is Parallel Computing? (2)
In the simplest sense, parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem.
To be run using multiple CPUs
A problem is broken into discrete parts that can be solved concurrently
Each part is further broken down to a series of instructions
Instructions from each part execute simultaneously on different CPUs
Parallel Computing: Resources
The compute resources can include:
A single computer with multiple processors;
A single computer with (multiple) processor(s) and some
specialized computer resources (GPU, …)
An arbitrary number of computers connected by a network;
A combination of both.
Parallel Computing: The
computational problem
The computational problem usually demonstrates
characteristics such as the ability to be:
Broken apart into discrete pieces of work that can be solved
simultaneously;
Execute multiple program instructions at any moment in time;
Solved in less time with multiple compute resources than with a
single compute resource.
Parallel Computing: what for? (1)
Parallel computing is an evolution of serial computing that attempts to emulate what has
always been the state of affairs in the natural world: many complex, interrelated events
happening at the same time, yet within a sequence.
Some examples:
Planetary and galactic orbits
Weather and ocean patterns
Tectonic plate drift
Rush hour traffic in Paris
Automobile assembly line
Daily operations within a business
Building a shopping mall
Ordering a hamburger at the drive through.
Parallel Computing: what for? (2)
Traditionally, parallel computing has been considered to be
"the high end of computing" and has been motivated by
numerical simulations of complex systems and "Grand
Challenge Problems" such as:
weather and climate
chemical and nuclear reactions
biological, human genome
geological, seismic activity
mechanical devices - from prosthetics to spacecraft
electronic circuits
manufacturing processes
Parallel Computing: what for? (3)
Today, commercial applications are providing an equal or greater driving force in
the development of faster computers. These applications require the processing of
large amounts of data in sophisticated ways. Example applications include:
58 12/6/2019
59 12/6/2019
TYPES OF PARALLELISM
Task-based parallelism
Data-based parallelism
60 12/6/2019
Task-Level Parallelism
Task A Task B
synchronization unexploited
points parallelism
62 12/6/2019
Master/Workers
One object called the master worker threads
initially owns all data.
Creates several workers to process
individual elements
Waits for workers to report results
back
master
63 12/6/2019
Producer/Consumer Flow
P C
P C
P C
64 12/6/2019
Data Parallelism
x := (a * b) + (y * z);
computation A computation B
65 12/6/2019
Parallelism: Dependency Graphs
x := foo(a) + bar(b)
foo(a) bar(b)
write x
67 12/6/2019
Flynn Matrix
The matrix below defines the 4 possible classifications
according to Flynn
Flynn’s Taxonomy
Michael Flynn (from Stanford)
Made a characterization of computer systems which became known
as Flynn’s Taxonomy
Computer
Instructions Data
69 12/6/2019
Flynn’s Taxonomy
SISD – Single Instruction Single Data Systems
SI SISD SD
70 12/6/2019
Single Instruction, Single Data (SISD)
A serial (non-parallel) computer
Single instruction: only one instruction stream
is being acted on by the CPU during any one
clock cycle
Single data: only one data stream is being used
as input during any one clock cycle
Deterministic execution
This is the oldest and until recently, the most
prevalent form of computer
Examples: most PCs, single CPU workstations
and mainframes
Flynn’s Taxonomy
SIMD – Single Instruction Multiple Data Systems “Array
Processors”
SISD SD
SISD SD
72 12/6/2019
Single Instruction, Multiple Data (SIMD)
A type of parallel computer
This type of machine typically has an instruction dispatcher, a very high-
bandwidth internal network, and a very large array of very small-capacity
instruction units.
Best suited for specialized problems characterized by a high degree of
regularity, such as image processing.
Synchronous (lockstep) and deterministic execution
Two varieties: Processor Arrays and Vector Pipelines
Flynn’s Taxonomy
MISD- Multiple Instructions / Single Data System
Some people say “pipelining” lies here, but this is debatable.
Multiple Instructions Single Data
SI SISD
SI SISD SD
SI SISD
74 12/6/2019
Multiple Instruction, Single Data (MISD)
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via independent
instruction streams.
Some conceivable uses might be:
multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single coded message.
Flynn’s Taxonomy
MIMD Multiple Instructions Multiple Data System:
“Multiprocessors”
SI SISD SD
SI SISD SD
SI SISD SD
76 12/6/2019
Multiple Instruction, Multiple Data
(MIMD)
Currently, the most common type of parallel computer. Most modern
computers fall into this category.
Multiple Instruction: every processor may be executing a different instruction
stream
Multiple Data: every processor may be working with a different data stream
Execution can be synchronous or asynchronous, deterministic or non-
deterministic
Examples: most current supercomputers, networked parallel computer "grids"
and multi-processor SMP computers - including some types of PCs.
SOME COMMON PARALLEL PATTERNS
Loop-based patterns
Fork/join pattern
Tiling/grids
Divide and conquer
78 12/6/2019
Loop-level parallelism
Collections of tasks are defined as iterations of one or more loops.
Loop iterations are divided between a collection of processing
elements to compute tasks concurrently.
This design pattern is also heavily used with data parallel design patterns. OpenMP
programmers commonly use this pattern.
79 12/6/2019
Fork-join
Tasks are associated with threads.
The threads are spawned (forked), carry out their execution, and then
terminate (join).
Note that due to the high cost of thread creation and destruction,
programming languages that support this pattern often use logical threads
that execute with physical threads pulled from a thread pool … But to
the programmer, this is an implementation detail managed by the
runtime environment. Cilk and the explicit tasks in OpenMP 3.0
commonly use this pattern.
80 12/6/2019
A queue of data processed by N threads.
81 12/6/2019
Tiling / grid
CUDA provides the simple two-dimensional grid model. For
a significant number of problems this
is entirely sufficient. If you have a linear distribution of work
within a single block, you have an ideal
decomposition into CUDA blocks. As we can assign up to
sixteen blocks per SM and we can have up to
16 SMs (32 on some GPUs), any number of blocks of 256 or
larger is fine. In practice, we’d like to limit
the number of elements within the block to 128, 256, or
512, so this in itself may drive much larger
numbers of blocks with a typical dataset
82 12/6/2019
83 12/6/2019
Divide and Conquer
you see divide-and-conquer algorithms used with recursion.
Quick sort is a classic example of this.
It recursively partitions the data into two sets, those above a
pivot point and those below the pivot point.
When the partition finally consists of just two items, they are
compared and swapped
84 12/6/2019
CUDA hardware overview
85 12/6/2019
GPU Hardware
The NVIDIA G80 series processor and beyond implemented
a design that is similar to both the Connection Machine and
IBM’s Cell processor.
Each graphics card consists of a number of SMs. To
each SM is attached eight or more SPs (Stream
Processors). The original 9800 GTX card has eight
SMs, giving a total of 128 SPs.
The GPU cards can broadly be considered as accelerator or
coprocessor cards.
86 12/6/2019
Typical Core 2 series layout.
87 12/6/2019
GPU hardware
88 12/6/2019
GPU hardware
Notice the GPU hardware consists of a number of key
blocks:
• Memory (global, constant, shared)
• Streaming multiprocessors (SMs)
• Streaming processors (SPs)
89 12/6/2019
Inside an SM.
90 12/6/2019
Global memory - Used by all / GPU
L2 Cache – for all SMs / GPU
L1 Cache – for each SM
Texture mem – Filled by host, only read by SM
Constant mem – data is readable by all threads in SM
SPU-which perform special hardware instruction such as sin
,cos ,exponent
There are 8 SPs in each SM, in fermi it grows to 32-40 SPs
,in Kepler to 192.
91 12/6/2019
Each SM has access to something called a register file, which is
much like a chunk of memory that runs at the same speed as the
SP units, so there is effectively zero wait time on this memory.
There is also a shared memory block accessible only to the
individual SM; this can be used as a program-managed cache.
Constant memory is used for read-only data.
Each SM has a separate bus into the texture memory, constant
memory, and global memory spaces.
Texture memory is a special view onto the global memory,
which is useful for data where there is interpolation.
Eg. With 2D or 3D lookup tables.
92 12/6/2019
Compute Levels
Compute 1.0
Compute 1.1
Compute 1.2
Compute 1.3
Compute 2.0
Compute 2.1
93 12/6/2019
Compute 1.0
Compute level 1.0 is found on the older graphics cards
The main features lacking in compute 1.0 cards are
those for atomic operations. Atomic operations are
those where we can guarantee a complete operation without
any other thread interrupting. In effect, the hardware
implements a barrier point at the entry of the atomic
function and guarantees the completion of the operation
(add, sub, min, max, logical and, or, xor, etc.) as one
operation.
Compute 1.0 cards are effectively now obsolete
94 12/6/2019
Compute 1.1
Compute level 1.1 is found in many of the later shipping 9000
series cards, such as the 9800 GTX, which were extremely
popular. These are based on the G92 hardware as opposed to the
G80 hardware of compute 1.0 devices
One major change brought in with compute 1.1 devices was
support, on many but not all devices, for overlapped data
transfer and kernel execution.
The SDK call to cudaGetDeviceProperties() returns the
deviceOverlap property, which defines if this functionality is
available. This allows for a very nice and important optimization
called double buffering, which works as shown in Figure.
95 12/6/2019
To use this method we require double the memory space we’d
normally use, which may well be an issue if your target market
only had a 512 MB card.
with Tesla cards, used mainly for scientific computing, you can
have up to 6 GB of GPU memory.
Let us see what happens :
Cycle 0: Having allocated two areas of memory in the GPU
memory space, the CPU fills the first buffer.
Cycle 1: The CPU then invokes a CUDA kernel (a GPU task)
on the GPU, which returns immediately to the CPU (a
nonblocking call). The CPU then fetches the next data packet,
from a disk, the network, or wherever. Meanwhile, the GPU is
processing away in the background on the data packet provided.
When the CPU is ready, it starts filling the other buffer.
96 12/6/2019
Cycle 2: When the CPU is done filling the buffer, it invokes a kernel
to process buffer 1. It then checks if the kernel from cycle 1, which
was processing buffer 0, has completed. If not, it waits until this
kernel has finished and then fetches the data from buffer 0 and then
loads the next data block into the same buffer. During this time the
kernel kicked off at the start of the cycle is processing data on the
GPU in buffer 1.
Cycle N:We then repeat cycle 2, alternating between which buffer we
read and write to on the CPU with the buffer being processed on the
GPU.
100 12/6/2019
Compute 1.3
The compute 1.3 devices were introduced with the move
from GT200 to the GT200 a/b revisions of the hardware.
The major change that occurs with compute 1.3
hardware is the introduction of support for limited
double-precision calculations.
GPUs are primarily aimed at graphics and here there is a
huge need for fast single-precision calculations, but limited
need for double-precision ones.
You see an order of magnitude drop in performance using
double-precision as opposed to single-precision floating-point
operations, so time should be taken to see if there is any way
single-precision arithmetic can be used to get the most out of
101
this hardware 12/6/2019
Compute 2.0
Compute 2.0 devices saw the switch to Fermi hardware.
Some of main changes in compute 2.x hardware are as follows:
• Introduction of 16 K to 48 K of L1 cache memory on each SP.
• Introduction of a shared L2 cache for all SMs.
• Support in Tesla-based devices for ECC (Error Correcting
Code)-based memory checking and error correction.
• Support in Tesla-based devices for dual-copy engines.
• Extension in size of the shared memory from 16 K per
SM up to 48 K per SM.
102 12/6/2019
Support for ECC memory is a must for data centers. ECC memory
provides for automatic error detection and correction.
Electrical devices emit small amounts of radiation. When in close
proximity to other devices, this radiation can change the contents
of memory cells in the other device.
Although the probability of this happening is tiny, as you increase
the exposure of the equipment by densely packing it into data
centers, the probability of something going wrong rises to an
unacceptable level.
ECC, therefore, detects and corrects single-bit upset conditions
that you may find in large data centers.
This reduces the amount of available RAM and negatively impacts
memory bandwidth. Because this is a major drawback on graphics
cards, ECC is only available on Tesla products
103 12/6/2019
Dual-copy engines allow you to extend the dual-buffer
example we looked at earlier to use multiple streams.
104 12/6/2019
Compute 2.1
Compute 2.1 is seen on certain devices aimed specifically at the
games market, such as the GTX460 and GTX560. These
devices change the architecture of the device as follows:
• 48 CUDA cores per SM instead of the usual 32 per SM.
• Eight single-precision, special-function units for
transcendental per SM instead of the usual four.
• Dual-warp dispatcher instead of the usual single-warp
dispatcher.
105 12/6/2019
Warps, which we will cover in detail later, are groups of
threads. On compute 2.0 hardware, the single-warp
dispatcher takes two clock cycles to dispatch instructions of
an entire warp.
On compute 2.1 hardware, instead of the usual two
instruction dispatchers per two clock cycles, we now have
four.
106 12/6/2019
Thank You
107 12/6/2019