Unit Iii

Unit III
CUDA
Index
 The death of single core solution
 NVIDA and CUDA
 GPU hardware
 Alternatives to CUDA
 Understanding parallelism with GPUs
 CUDA hardware overview
2 12/6/2019
3 12/6/2019
4 12/6/2019
5 12/6/2019
6 12/6/2019
7 12/6/2019
8 12/6/2019
9 12/6/2019
10 12/6/2019
GPU
 Graphical Processing Unit
 A single GPU consists of large number of cores – hundreds
of cores.
 Whereas a single CPU can consist of 2, 4, 8 or 12 cores
 Cores? – Processing units in a chip sharing at least the
memory and L1 cache
11 12/6/2019
GPU Cont...
GPUs are massively multithreaded many core chips
Hundreds of scalar processors
Tens of thousands of concurrent threads
1 Tera-FLOP peak performance
Fine-grained data-parallel computation
Users across science & engineering disciplines are

achieving tenfold and higher speedups on GPU
12 12/6/2019
The death of single core solution
- GPU and CPU
 Typically GPU and CPU coexist in a heterogeneous setting
 “Less” computationally intensive part runs on CPU (coarse-grained
parallelism), and more intensive parts run on GPU (fine-grained
parallelism)
 NVIDIA’s GPU architecture is called CUDA (Compute Unified
Device Architecture) architecture, accompanied by CUDA
programming model, and CUDA C language
13 12/6/2019
CPU vs. GPU Architecture
14
15 12/6/2019
16 12/6/2019
Fast Fourier Transformation
17 12/6/2019
18 12/6/2019
The death of single core solution
 The death of conventional scaling has sparked a sharp increase in the
number of companies researching various types of specialized CPU cores.
19 12/6/2019
20 12/6/2019
21 12/6/2019
22 12/6/2019
Processing Flow
Processing Flow of CUDA:
Copy data from main mem to
GPU mem.
CPU instructs the process to
GPU.
GPU execute parallel in each
core.
Copy the result from GPU mem
to main mem.
23 12/6/2019
24 12/6/2019
CUDA
Definitions: Programming Model
Device = GPU
Host = CPU
Kernel = function
that runs on the
device
25 12/6/2019
CUDA Programming Model
A kernel is executed by a grid of thread blocks
A thread block is a batch of threads that can cooperate with

each other by:
Sharing data through shared memory
Synchronizing their execution
Threads from different blocks cannot cooperate
26 12/6/2019
CUDA Kernels and Threads
Parallel portions of an application are executed on the
device as kernels
One kernel is executed at a time
Many threads execute each kernel
Differences between CUDA and CPU threads

CUDA threads are extremely lightweight
 Very little creation overhead
 Instant switching
CUDA uses 1000s of threads to achieve efficiency
 Multi-core CPUs can use only a few
27 12/6/2019
Arrays of Parallel Threads
A CUDA kernel is executed by an array of threads
All threads run the same code
Each thread has an ID that it uses to compute memory addresses
and make control decisions
28 12/6/2019
29 12/6/2019
Graphics Processing Units (GPUs)
Brief History
GPU Computing
General-purpose computing on
graphics processing units (GPGPUs)
GPUs with programmable shading
Nvidia GeForce with programmable shading
DirectX graphics API
OpenGL graphics API

Hardware-accelerated 3D graphics
S3 graphics cards- single chip 2D accelerator
IBM PC Professional Graphics

Controller card Playstation
Atari 8-bit computer text/graphics chip
30 12/6/2019
1970 1980 1990 2000 2010
Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit
NVIDA and CUDA
 In 2007, NVIDIA saw an opportunity to bring GPUs into the
mainstream by adding an easy-to-use programming interface,
which it dubbed CUDA, or Compute Unified Device
Architecture.
 CUDA is an extension to the C language that allows GPU code to
be written in regular C.
 GPU has eventually broken the barrier of 1TeraFlops.
 At this point we were moving from the G80 hardware to
the G200 and then in 2010 to the Fermi evolution. This is
driven by the introduction of massively parallel
hardware. The G80 is a 128 CUDA core device, the G200 is
a 256 CUDA core device, and the Fermi is a 512 CUDA
core device
31 12/6/2019
NVIDIA products
NVIDIA Corp. is the leader in GPUs for high performance computing:
Maxwell
(2013)
Tesla 2050 GPU has
448 thread Kepler
processors (2011)
Fermi
NVIDIA's first Tesla
GPU with general C870, S870, C1060, S1070, C2050, …
purpose processors GeForce 400 series
GTX460/465/470/475/
Quadro 480/485
Established by Jen- GT 80
GeForce 200 series
Hsun Huang, Chris GeForce 8800 GTX260/275/280/285/295
Malachowsky, Curtis
GeForce 8 series
Priem
GeForce 2 series GeForce FX series
NV1 GeForce 1
12/6/2019
99332 1995 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
GPU Hardware
 The NVIDIA G80 series processor and beyond implemented
a design that is similar to both the Connection Machine and
IBM’s Cell processor.
 Each graphics card consists of a number of SMs. To
each SM is attached eight or more SPs (Stream
Processors). The original 9800 GTX card has eight
SMs, giving a total of 128 SPs.
 The GPU cards can broadly be considered as accelerator or
coprocessor cards.
33 12/6/2019
GPU hardware
34 12/6/2019
GPU hardware
35 12/6/2019
GPU Hardware
36 12/6/2019
Streaming Multiprocessor (SM)
• Streaming Multiprocessor (SM)
– 8 Streaming Processors (SP)
Streaming Multiprocessor
– 2 Super Function Units (SFU) Instruction L1 Data L1
• Multi-threaded instruction dispatch Instruction Fetch/Dispatch
– 1 to 512 threads active Shared Memory

– Shared instruction fetch per 32 threads SP SP
– Cover latency of texture/memory loads SP SP
• 20+ GFLOPS SP
SFU
SP
SFU
• 16 KB shared memory SP SP
• texture and global memory access
37 12/6/2019
Alternatives to CUDA
 OpenCL
 DirectCompute
 CPU alternatives
 MPI
 OpenMP
 Pthreads
 ZeroMQ
 Hadoop
 Directives and libraries
38 12/6/2019
OpenCL
 OpenCL is an open and royalty-free standard supported by
NVIDIA, AMD, and others.
 The OpenCL trademark is owned by Apple. It sets out an open
standard that allows the use of compute devices.
 Unlike CUDA, OpenCL will use all the available speed
accelerators in parallelizing a task.
 A compute device can be a GPU, CPU, or other specialist
device for which an OpenCL driver exists. As of 2012, OpenCL
supports all major brands of GPU devices, including CPUs .
39 12/6/2019
DirectCompute
 DirectCompute is Microsoft’s alternative to CUDA and
OpenCL. It is a proprietary product linked to the Windows
operating system, and in particular, the DirectX 11 API.
 The DirectX API was a huge leap forward for any of those
who remember programming video cards before it.
 It meant the developers had to learn only one
library API to program all graphics cards, rather
than write or license drivers for each major video
card manufacturer.
 DirectX 11 is the latest standard and supported under
Windows 7.
40 12/6/2019
CPU Alternatives
 MPI
 OpenMP
 Pthreads
 ZeroMQ
 Hadoop
41 12/6/2019
CPU Alternatives
 MPI (Message Passing Interface) : Parallelism is expressed by
spawning hundreds of processes over a cluster of nodes and
explicitly exchanging messages, typically over high-speed
network-based communication links. It’s a good solution
within a controlled cluster environment.
 OpenMP : is a system designed for parallelism within a node
or computer system. OpenMP support is built into many
compilers, including the NVCC compiler used for CUDA.
OpenMP tends to hit problems with scaling due to the
underlying CPU architecture.
42 12/6/2019
 Pthreads : is a library that is used significantly for multithread
applications on Linux. unlike OpenMP, the programmer is
responsible for thread management and synchronization. This
provides more flexibility and consequently better
performance for well-written programs.
 ZeroMQ (0MQ) : is simple library for developing a
multinode, multi-GPU. ZeroMQ supports thread-, process-
, and network-based communications models with a single
crossplatform API. It is also available on both Linux and
Windows platforms. It’s designed for distributed computing,
so the connections are dynamic and nodes fail gracefully.
43 12/6/2019
 Hadoop is an open-source version of Google’s MapReduce
framework. It’s aimed primarily at the Linux platform.
 The concept is that you take a huge dataset and break (or map)
it into a number of chunks.
 However, instead of sending the data to the node, the dataset is
already split over hundreds or thousands of nodes using a
parallel file system.
44 12/6/2019
Directives and Libraries
 There are a number of compiler vendors, PGI, CAPS, and
Cray being the most well-known, that support the recently
announced OpenACC set of compiler directives for GPUs.
 These, in essence, replicate the approach of OpenMP, in that
the programmer inserts a number of compiler directives
marking regions as “to be executed on the GPU.” The
compiler then does the grunt work of moving data to or
from the GPU, invoking kernels, etc.
45 12/6/2019
 Libraries like SDK provide Thrust, which provides common
 functions implemented in a very efficient way. Libraries like
CUBLAS are some of the best around for
 linear algebra. Libraries exist for many well-known
applications such as Matlab and Mathematica.
 Language bindings exist for Python, Perl, Java, and many
others. CUDA can even be integrated with
 Excel.
46 12/6/2019
Understanding parallelism with GPUs
 TRADITIONAL SERIAL CODE
 SERIAL/PARALLEL PROBLEMS
 CONCURRENCY
 Locality
 TYPES OF PARALLELISM
 FLYNN’S TAXONOMY
 SOME COMMON PARALLEL PATTERNS
47 12/6/2019
Traditional Serial Code
 Traditionally, software has been written for serial
computation:
 To be run on a single computer having a single Central
Processing Unit (CPU);
 A problem is broken into a discrete series of instructions.
 Instructions are executed one after another.
 Only one instruction may execute at any moment in time.
What is Parallel Computing? (2)
 In the simplest sense, parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem.
 To be run using multiple CPUs
 A problem is broken into discrete parts that can be solved concurrently
 Each part is further broken down to a series of instructions
 Instructions from each part execute simultaneously on different CPUs
Parallel Computing: Resources
 The compute resources can include:
 A single computer with multiple processors;
 A single computer with (multiple) processor(s) and some
specialized computer resources (GPU, …)
 An arbitrary number of computers connected by a network;
 A combination of both.
Parallel Computing: The
computational problem
 The computational problem usually demonstrates
characteristics such as the ability to be:
 Broken apart into discrete pieces of work that can be solved
simultaneously;
 Execute multiple program instructions at any moment in time;
 Solved in less time with multiple compute resources than with a
single compute resource.
Parallel Computing: what for? (1)
 Parallel computing is an evolution of serial computing that attempts to emulate what has
always been the state of affairs in the natural world: many complex, interrelated events
happening at the same time, yet within a sequence.
 Some examples:
 Planetary and galactic orbits
 Weather and ocean patterns
 Tectonic plate drift
 Rush hour traffic in Paris
 Automobile assembly line
 Daily operations within a business
 Building a shopping mall
 Ordering a hamburger at the drive through.
 Traditionally, parallel computing has been considered to be
"the high end of computing" and has been motivated by
numerical simulations of complex systems and "Grand
Challenge Problems" such as:
 weather and climate
 chemical and nuclear reactions
 biological, human genome
 geological, seismic activity
 mechanical devices - from prosthetics to spacecraft
 electronic circuits
 manufacturing processes
 Today, commercial applications are providing an equal or greater driving force in
the development of faster computers. These applications require the processing of
large amounts of data in sophisticated ways. Example applications include:
 parallel databases, data mining

 oil exploration
 web search engines, web based business services
 computer-aided diagnosis in medicine
 management of national and multi-national corporations
 advanced graphics and virtual reality, particularly in the
entertainment industry
 networked video and multi-media technologies
 collaborative work environments
Why Parallel Computing? (1)
 The primary reasons for using parallel computing:
 Save time - wall clock time
 Solve larger problems
 Provide concurrency
(do multiple things at the same time)
Why Parallel Computing? (2)
 Other reasons might include:
 Taking advantage of non-local resources - using available
compute resources on a wide area network, or even the
Internet when local compute resources are scarce.
 Cost savings - using multiple "cheap" computing resources

instead of paying for time on a supercomputer.
 Overcoming memory constraints - single computers have

very finite memory resources. For large problems, using the
memories of multiple computers may overcome this obstacle.
Limitations of Serial Computing
 Limits to serial computing - both physical and practical reasons pose
significant constraints to simply building ever faster serial computers.
 Transmission speeds - the speed of a serial computer is directly
dependent upon how fast data can move through hardware. Absolute
limits are the speed of light (30 cm/nanosecond) and the transmission
limit of copper wire (9 cm/nanosecond). Increasing speeds necessitate
increasing proximity of processing elements.
 Limits to miniaturization - processor technology is allowing an
increasing number of transistors to be placed on a chip. However, even
with molecular or atomic-level components, a limit will be reached on
how small components can be.
 Economic limitations - it is increasingly expensive to make a single
processor faster. Using a larger number of moderately fast commodity
processors to achieve the same (or better) performance is less
expensive.
CONCURRENCY
 Some programs are best written as a set of activities that run
independently (concurrent programs)
 Concurrency is essential for interaction with the external environment
 Examples includes GUI (Graphical User Interfaces), operating systems, web services
 Also programs that are written independently but interact only when needed
(client-server, peer-to-peer applications)
 First, we will cover declarative concurrency, programs with no
observable nondeterminism, the result is a function
 Independent procedures that execute at their own pace and may communicate
through shared dataflow variables
 Then, we will cover message passing, programs consisting of
components with encapsulated state communicating asynchronously.
58 12/6/2019
59 12/6/2019
TYPES OF PARALLELISM
 Task-based parallelism
 Data-based parallelism
60 12/6/2019
Task-Level Parallelism
Task A Task B
synchronization unexploited
points parallelism
 Dividing work into larger “tasks” identifies

logical units for parallelization as threads
61 12/6/2019
Task-Level Parallelism
 Intelligent task design eliminates as many synchronization points as

possible, but some will be inevitable
 Independent tasks can operate on different physical machines in
distributed fashion
 Good task design requires identifying common data and
functionality to move as a unit
62 12/6/2019
Master/Workers
 One object called the master worker threads
initially owns all data.
 Creates several workers to process
individual elements
 Waits for workers to report results
back
master
63 12/6/2019
Producer/Consumer Flow
P C
P C
P C
 Producer threads create work items

 Consumer threads process them
 Can be daisy-chained
64 12/6/2019
Data Parallelism
x := (a * b) + (y * z);
computation A computation B
 At the micro level, independent algebraic operations can commute –

be processed in any order.
 If commutative operations are applied to different memory addresses, then
they can also occur at the same time
 Compilers, CPUs often do so automatically
65 12/6/2019
Parallelism: Dependency Graphs
x := foo(a) + bar(b)
foo(a) bar(b)
write x
 Arrows indicate dependent operations

 If foo and bar do not access the same memory, there is not
a dependency between them
 These operations can occur in parallel in different threads
 write x operation waits for predecessors to complete
66 12/6/2019
FLYNN’S TAXONOMY
• SIMD-single instruction, multiple data
• MIMD-multiple instructions, multiple data
• SISD-single instruction, single data
• MISD-multiple instructions, single data
67 12/6/2019
Flynn Matrix
 The matrix below defines the 4 possible classifications
according to Flynn
Flynn’s Taxonomy
 Michael Flynn (from Stanford)
 Made a characterization of computer systems which became known
as Flynn’s Taxonomy
Computer
Instructions Data
69 12/6/2019
Flynn’s Taxonomy
 SISD – Single Instruction Single Data Systems
SI SISD SD
70 12/6/2019
Single Instruction, Single Data (SISD)
 A serial (non-parallel) computer
 Single instruction: only one instruction stream
is being acted on by the CPU during any one
clock cycle
 Single data: only one data stream is being used
as input during any one clock cycle
 Deterministic execution
 This is the oldest and until recently, the most
prevalent form of computer
 Examples: most PCs, single CPU workstations
and mainframes
Flynn’s Taxonomy
 SIMD – Single Instruction Multiple Data Systems “Array
Processors”
SISD SD
SI SISD SD Multiple Data
SISD SD
72 12/6/2019
Single Instruction, Multiple Data (SIMD)
 A type of parallel computer
 This type of machine typically has an instruction dispatcher, a very high-
bandwidth internal network, and a very large array of very small-capacity
instruction units.
 Best suited for specialized problems characterized by a high degree of
regularity, such as image processing.
 Synchronous (lockstep) and deterministic execution
 Two varieties: Processor Arrays and Vector Pipelines
Flynn’s Taxonomy
 MISD- Multiple Instructions / Single Data System
 Some people say “pipelining” lies here, but this is debatable.
Multiple Instructions Single Data
SI SISD
SI SISD SD
SI SISD
74 12/6/2019
Multiple Instruction, Single Data (MISD)
 A single data stream is fed into multiple processing units.
 Each processing unit operates on the data independently via independent
instruction streams.
 Some conceivable uses might be:
 multiple frequency filters operating on a single signal stream
 multiple cryptography algorithms attempting to crack a single coded message.
Flynn’s Taxonomy
 MIMD Multiple Instructions Multiple Data System:
“Multiprocessors”
Multiple Instructions Multiple Data
SI SISD SD
SI SISD SD
SI SISD SD
76 12/6/2019
Multiple Instruction, Multiple Data
(MIMD)
 Currently, the most common type of parallel computer. Most modern
computers fall into this category.
 Multiple Instruction: every processor may be executing a different instruction
stream
 Multiple Data: every processor may be working with a different data stream
 Execution can be synchronous or asynchronous, deterministic or non-
deterministic
 Examples: most current supercomputers, networked parallel computer "grids"
and multi-processor SMP computers - including some types of PCs.
SOME COMMON PARALLEL PATTERNS
Loop-based patterns
Fork/join pattern
Tiling/grids
Divide and conquer
78 12/6/2019
Loop-level parallelism
 Collections of tasks are defined as iterations of one or more loops.
 Loop iterations are divided between a collection of processing
elements to compute tasks concurrently.
#pragma parallel for shared(Results) schedule(dynamic)

For(i=0;i<N;i++){
Do_work(i, Results);
}
This design pattern is also heavily used with data parallel design patterns. OpenMP
programmers commonly use this pattern.
79 12/6/2019
Fork-join
 Tasks are associated with threads.
 The threads are spawned (forked), carry out their execution, and then
terminate (join).
 Note that due to the high cost of thread creation and destruction,
programming languages that support this pattern often use logical threads
that execute with physical threads pulled from a thread pool … But to
the programmer, this is an implementation detail managed by the
runtime environment. Cilk and the explicit tasks in OpenMP 3.0
commonly use this pattern.
Cilk and OpenMP 3.0 explicit tasks make heavy use of

this pattern.
80 12/6/2019
A queue of data processed by N threads.
81 12/6/2019
Tiling / grid
 CUDA provides the simple two-dimensional grid model. For
a significant number of problems this
 is entirely sufficient. If you have a linear distribution of work
within a single block, you have an ideal
 decomposition into CUDA blocks. As we can assign up to
sixteen blocks per SM and we can have up to
 16 SMs (32 on some GPUs), any number of blocks of 256 or
larger is fine. In practice, we’d like to limit
 the number of elements within the block to 128, 256, or
512, so this in itself may drive much larger
 numbers of blocks with a typical dataset
82 12/6/2019
83 12/6/2019
Divide and Conquer
 you see divide-and-conquer algorithms used with recursion.
Quick sort is a classic example of this.
 It recursively partitions the data into two sets, those above a
pivot point and those below the pivot point.
 When the partition finally consists of just two items, they are
compared and swapped
84 12/6/2019
 CUDA hardware overview
85 12/6/2019
GPU Hardware
 The NVIDIA G80 series processor and beyond implemented
a design that is similar to both the Connection Machine and
IBM’s Cell processor.
 Each graphics card consists of a number of SMs. To
each SM is attached eight or more SPs (Stream
Processors). The original 9800 GTX card has eight
SMs, giving a total of 128 SPs.
 The GPU cards can broadly be considered as accelerator or
coprocessor cards.
86 12/6/2019
Typical Core 2 series layout.
87 12/6/2019
GPU hardware
88 12/6/2019
GPU hardware
Notice the GPU hardware consists of a number of key
blocks:
• Memory (global, constant, shared)
• Streaming multiprocessors (SMs)
• Streaming processors (SPs)
89 12/6/2019
Inside an SM.
90 12/6/2019
 Global memory - Used by all / GPU
 L2 Cache – for all SMs / GPU
 L1 Cache – for each SM
 Texture mem – Filled by host, only read by SM
 Constant mem – data is readable by all threads in SM
 SPU-which perform special hardware instruction such as sin
,cos ,exponent
 There are 8 SPs in each SM, in fermi it grows to 32-40 SPs
,in Kepler to 192.
91 12/6/2019
 Each SM has access to something called a register file, which is
much like a chunk of memory that runs at the same speed as the
SP units, so there is effectively zero wait time on this memory.
 There is also a shared memory block accessible only to the
individual SM; this can be used as a program-managed cache.
 Constant memory is used for read-only data.
 Each SM has a separate bus into the texture memory, constant
memory, and global memory spaces.
 Texture memory is a special view onto the global memory,
which is useful for data where there is interpolation.
Eg. With 2D or 3D lookup tables.
92 12/6/2019
Compute Levels
 Compute 1.0
 Compute 1.1
 Compute 1.2
 Compute 1.3
 Compute 2.0
 Compute 2.1
93 12/6/2019
Compute 1.0
 Compute level 1.0 is found on the older graphics cards
 The main features lacking in compute 1.0 cards are
those for atomic operations. Atomic operations are
those where we can guarantee a complete operation without
any other thread interrupting. In effect, the hardware
implements a barrier point at the entry of the atomic
function and guarantees the completion of the operation
(add, sub, min, max, logical and, or, xor, etc.) as one
operation.
 Compute 1.0 cards are effectively now obsolete
94 12/6/2019
Compute 1.1
 Compute level 1.1 is found in many of the later shipping 9000
series cards, such as the 9800 GTX, which were extremely
popular. These are based on the G92 hardware as opposed to the
G80 hardware of compute 1.0 devices
 One major change brought in with compute 1.1 devices was
support, on many but not all devices, for overlapped data
transfer and kernel execution.
 The SDK call to cudaGetDeviceProperties() returns the
deviceOverlap property, which defines if this functionality is
available. This allows for a very nice and important optimization
called double buffering, which works as shown in Figure.
95 12/6/2019
 To use this method we require double the memory space we’d
normally use, which may well be an issue if your target market
only had a 512 MB card.
 with Tesla cards, used mainly for scientific computing, you can
have up to 6 GB of GPU memory.
Let us see what happens :
 Cycle 0: Having allocated two areas of memory in the GPU
memory space, the CPU fills the first buffer.
 Cycle 1: The CPU then invokes a CUDA kernel (a GPU task)
on the GPU, which returns immediately to the CPU (a
nonblocking call). The CPU then fetches the next data packet,
from a disk, the network, or wherever. Meanwhile, the GPU is
processing away in the background on the data packet provided.
When the CPU is ready, it starts filling the other buffer.
96 12/6/2019
 Cycle 2: When the CPU is done filling the buffer, it invokes a kernel
to process buffer 1. It then checks if the kernel from cycle 1, which
was processing buffer 0, has completed. If not, it waits until this
kernel has finished and then fetches the data from buffer 0 and then
loads the next data block into the same buffer. During this time the
kernel kicked off at the start of the cycle is processing data on the
GPU in buffer 1.
 Cycle N:We then repeat cycle 2, alternating between which buffer we
read and write to on the CPU with the buffer being processed on the
GPU.
GPU-to-CPU and CPU-to-GPU transfers are made over the

relatively slow (5 GB/s) PCI-E bus and this dual-buffering
method largely hides this latency and keeps both the CPU
and GPU busy
97 12/6/2019
98 12/6/2019
Compute 1.2
 Compute 1.2 devices appeared with the low-end GT200 series
hardware.
 With the GT200 series hardware, NVIDIA approximately
doubled the number of CUDA core processors on a single card,
through doubling the number of multiprocessors present on the
card.
 Along with doubling the number of multiprocessors, NVIDIA
increased the number of concurrent warps a multiprocessor
could execute from 24 to 32.
 Warps are blocks of code that execute within a
multiprocessor, and increasing the amount of
available warps per multiprocessor gives us more
scope to get better performance
99 12/6/2019
 Issues with restrictions on coalesced access to the global
memory and bank conflicts in the shared memory found in
compute 1.0 and compute 1.1 devices were greatly reduced.
 It greatly improved the performance of many previous,
poorly written CUDA programs
100 12/6/2019
Compute 1.3
 The compute 1.3 devices were introduced with the move
from GT200 to the GT200 a/b revisions of the hardware.
 The major change that occurs with compute 1.3
hardware is the introduction of support for limited
double-precision calculations.
 GPUs are primarily aimed at graphics and here there is a
huge need for fast single-precision calculations, but limited
need for double-precision ones.
 You see an order of magnitude drop in performance using
double-precision as opposed to single-precision floating-point
operations, so time should be taken to see if there is any way
single-precision arithmetic can be used to get the most out of
101
this hardware 12/6/2019
Compute 2.0
 Compute 2.0 devices saw the switch to Fermi hardware.
 Some of main changes in compute 2.x hardware are as follows:
• Introduction of 16 K to 48 K of L1 cache memory on each SP.
• Introduction of a shared L2 cache for all SMs.
• Support in Tesla-based devices for ECC (Error Correcting
Code)-based memory checking and error correction.
• Support in Tesla-based devices for dual-copy engines.
• Extension in size of the shared memory from 16 K per
SM up to 48 K per SM.
102 12/6/2019
 Support for ECC memory is a must for data centers. ECC memory
provides for automatic error detection and correction.
Electrical devices emit small amounts of radiation. When in close
proximity to other devices, this radiation can change the contents
of memory cells in the other device.
 Although the probability of this happening is tiny, as you increase
the exposure of the equipment by densely packing it into data
centers, the probability of something going wrong rises to an
unacceptable level.
 ECC, therefore, detects and corrects single-bit upset conditions
that you may find in large data centers.
 This reduces the amount of available RAM and negatively impacts
memory bandwidth. Because this is a major drawback on graphics
cards, ECC is only available on Tesla products
103 12/6/2019
 Dual-copy engines allow you to extend the dual-buffer
example we looked at earlier to use multiple streams.
104 12/6/2019
Compute 2.1
 Compute 2.1 is seen on certain devices aimed specifically at the
games market, such as the GTX460 and GTX560. These
devices change the architecture of the device as follows:
• 48 CUDA cores per SM instead of the usual 32 per SM.
• Eight single-precision, special-function units for
transcendental per SM instead of the usual four.
• Dual-warp dispatcher instead of the usual single-warp
dispatcher.
105 12/6/2019
 Warps, which we will cover in detail later, are groups of
threads. On compute 2.0 hardware, the single-warp
dispatcher takes two clock cycles to dispatch instructions of
an entire warp.
 On compute 2.1 hardware, instead of the usual two
instruction dispatchers per two clock cycles, we now have
four.
 The final details of Kepler and the new compute 3.0

platform were, are still largely unreleased.
106 12/6/2019
Thank You
107 12/6/2019

Unit Iii - Part1 Cuda

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Unit Iii - Part1 Cuda

Hochgeladen von

Copyright:

Verfügbare Formate

Users across science & engineering disciplines are

A thread block is a batch of threads that can cooperate with

Threads from different blocks cannot cooperate

Differences between CUDA and CPU threads

GPUs with programmable shading

Nvidia GeForce with programmable shading

DirectX graphics API

OpenGL graphics API

S3 graphics cards- single chip 2D accelerator

IBM PC Professional Graphics

• Multi-threaded instruction dispatch Instruction Fetch/Dispatch

– 1 to 512 threads active Shared Memory

• texture and global memory access

 parallel databases, data mining

 Cost savings - using multiple "cheap" computing resources

 Overcoming memory constraints - single computers have

 Dividing work into larger “tasks” identifies

 Intelligent task design eliminates as many synchronization points as

 Producer threads create work items

 At the micro level, independent algebraic operations can commute –

 Arrows indicate dependent operations

SI SISD SD Multiple Data

Multiple Instructions Multiple Data

#pragma parallel for shared(Results) schedule(dynamic)

Cilk and OpenMP 3.0 explicit tasks make heavy use of

GPU-to-CPU and CPU-to-GPU transfers are made over the

 The final details of Kepler and the new compute 3.0

Das könnte Ihnen auch gefallen