Sie sind auf Seite 1von 36

CONCURRENT

PROGRAMMING

11/24/15 CS/IS F301 First Semester 2015-16 125


Background and Mo?va?on
A PROCESS or THREAD is a poten?ally-ac?ve
execu?on context
Classic von Neumann (stored program) model
of compu?ng has single thread of control
Parallel programs have more than one
A process can be thought of as an abstrac?on
of a physical processor, but here, only one
processor will run mul?ple threads

11/24/15 CS/IS F301 First Semester 2015-16 126


Background.
Processes/Threads can come from
mul?ple CPUs
kernel-level mul?plexing of single physical machine
language or library level mul?plexing of kernel-
level abstrac?on
They can run
in true parallel
unpredictably interleaved
run-un?l-block
Most work focuses on the rst two cases,
which are equally dicult to deal with
11/24/15 CS/IS F301 First Semester 2015-16 127
Background
Two main classes of programming nota?on
synchronized access to shared memory
message passing between processes that don't share memory
Both approaches can be implemented on hardware designed
for the other, though shared memory on message-passing
hardware tends to be slow
Principle dierence is that the message passing type requires
ac?ve par?cipa?on of 2 processors at either end - one to send
and one to receive - while on a mul?processor, reading or
wri?ng only needs one processor to control

11/24/15 CS/IS F301 First Semester 2015-16 128


Race Condi?on
A race condi?on occurs when ac?ons in two processes are not synchronized
and program behavior depends on the order in which the ac?ons happen
Race condi?ons are not all bad; some?mes any of the possible program
outcomes are ok (e.g. workers taking things o a task queue)
Race condi?ons (which we want to avoid):
Suppose processors A and B share memory, and both try to increment variable
X at more or less the same ?me
Very few processors support arithme?c opera?ons on memory, so each
processor executes
LOAD X
INC
STORE X
Suppose X is ini?alized to 0. If both processors execute these instruc?ons
simultaneously, what are the possible outcomes?
could go up by one or by two

11/24/15 CS/IS F301 First Semester 2015-16 129


Synchroniza?on
SYNCHRONIZATION is the act of ensuring that events in dierent
processes happen in a desired order
Synchroniza?on can be used to eliminate race condi?ons
In our example we need to synchronize the increment opera?ons
to enforce MUTUAL EXCLUSION on access to X
Most synchroniza?on can be regarded as either:
Mutual exclusion (making sure that only one process is execu?ng a CRITICAL
SECTION [touching a variable, for example] at a ?me),
or as
CONDITION SYNCHRONIZATION, which means making sure that a given
process does not proceed un?l some condi?on holds (e.g. that a variable
contains a given value)
We do not in general want to over-synchronize
That eliminates parallelism, which we generally want to encourage for
11/24/15 performance CS/IS F301 First Semester 2015-16 130
Basic Op?ons
SCHEDULERS give us the ability to "put a thread/process to
sleep" and run something else on its process/processor
Six principle op?ons in most systems:
Co-begin
Parallel loops OpenCL, CUDA, C#
Launch-at-elabora?on - Ada
Fork (with op?onal join)- C, C++, Java
makes the crea?on of threads an explicit executable task.
Join is the reverse opera?on, which sllows a thread to wait for the
comple?on of a previously forked thread
Implicit receipt
Early reply

11/24/15 CS/IS F301 First Semester 2015-16 131


Implementa?on of Thread
Threads must be implemented on top of the OS processes.
Could put each thread on a separate OS process - but this is
expensive.
Processes are implemented in the kernel, and need procedure calls.
These are general purpose, which means extra features we dont need
but must pay for anyway.
At the other end, could put all threads onto one process.
Means there is no real parallel programming!
If current thread blocks, then none of the other threads can run if the
process is suspended by the OS.
Generally, some in-between
Need to implement threads running on top of processes

11/24/15 CS/IS F301 First Semester 2015-16 132


Implemen?ng Synchroniza?on
Barriers are provided to make sure every thread completes a
par?cular sec?on of code before proceeding.
Implemented as a number, ini?alized to the number of threads,
and a boolean, done = FALSE.
When a thread nishes, it subtracts 1 from the number and
waits un?l done is set to true.
The last thread that sets the counter to 0 will set done equal to
false, freeing all the threads to con?nue.
Note that the counter means that we need O(n) ?me for n
processors to synchronize and con?nue, which is too long in
some machines.
Best known is O(log n), although some specially designed hardware can
get this closer to O(1) in prac?ce

11/24/15 CS/IS F301 First Semester 2015-16 133


Implemen?ng Synchroniza?on
Monitors have the highest-level seman?cs, but a few
s?cky seman?c problem - they are also widely used
Synchroniza?on in Java is sort of a hybrid of monitors
and CCRs (Java 3 will have true monitors.)
A monitor is a shared object with opera?ons, internal state,
and a number of condi?on queues. Only one opera?on of a
given monitor may be ac?ve at a given point in ?me
A process that calls a busy monitor is delayed un?l the
monitor is free
On behalf of its calling process, any opera?on may suspend itself by
wai?ng on a condi?on
An opera?on may also signal a condi?on, in which case one of the
wai?ng processes is resumed, usually the one that waited rst
11/24/15 CS/IS F301 First Semester 2015-16 134
Condi?onal Cri?cal Regions
In Java, every object accessible to more than 1 thread has an
implicit mutual exclusion lock built in.
synchronized (my_shared_object) {
\\ critical section of code
}
A thread can suspend or voluntarily release itself using wait. It can even
wait un?l some condi?on is met:
while (!condition) {
wait();
}
C# has a lock statement that is essen?ally the same

11/24/15 CS/IS F301 First Semester 2015-16 135


CUDA

11/24/15 CS/IS F301 First Semester 2015-16 136


C Extension
Consists of:
New Syntax and Buil?n Variables
Restric?ons to ANSI C
API/Libraries

11/24/15 CS/IS F301 First Semester 2015-16 137


C Extension: Buil?n Variables
New Syntax:
<<< ... >>>
__host__, __global__, __device__
__constant__, __shared__, __device
__syncthreads()

11/24/15 CS/IS F301 First Semester 2015-16 138


C Extension: Buil?n Variables
Buil?n Variables:
dim3 gridDim;
Dimensions of the grid in blocks (gridDim.z unused)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block

11/24/15 CS/IS F301 First Semester 2015-16 139


C Extension: Restric?ons
New Restric?ons:
No recursion in device code
No func?on pointers in device code

11/24/15 CS/IS F301 First Semester 2015-16 140


Compiling a CUDA Program

11/24/15 CS/IS F301 First Semester 2015-16 141


Matrix Transpose

11/24/15 CS/IS F301 First Semester 2015-16 142


What is GPU?
Processor dedicated to rapid rendering of
polygons texturing, shading
They have lots of compute cores, but a simpler
architecture of a standard CPU
The shader pipeline can be used to do
oa?ng point calcula?ons
cheap scien?c/technical compu?ng
What is GPU?

CS/IS F301 First Semester 2015-16


What is CUDA?
Compute Unied Device Architecture
Extension to C programming language
Adds library func?ons to access to GPU
Adds direc?ves to translate C into instruc?ons
that run on the host CPU or the GPU when
needed
Allows easy mul?-threading - parallel execu?on
on all thread processors on the GPU
CUDA works on modern nVidia cards (Quadro,
GeForce, Tesla)
14/11/15 CS/IS F301 First Semester 2015-16 145
nVidias compiler - nvcc
lCUDA code must be compiled using nvcc
lnvcc generates both instruc?ons for host and GPU (PTX instruc?on

set), as well as instruc?ons to send data back and forwards


between them
lStandard CUDA install; /usr/local/cuda/bin/nvcc

lShell execu?ng compiled code needs dynamic linker path LD

LIBRARY PATH environment variable set to include/usr/local/cuda/


lib
Simple overview
lGPU cant directly access main memory
lCPU cant directly access GPU memory

lNeed to explicitly copy data

lNo prin!
Writing some code (1) - specifying
where code runs
lCUDA provides func?on type qualiers (that are not in C/C++) to enable
programmer to dene where a func?on should run
l_ _host_ _ : species the code should run on the host CPU (redundant on its own

- it is the default)
l_ _device_ _ : species the code should run on the GPU, and the func?on can

only be called by code running on the GPU


l_ _global_ _ : species the code should run on the GPU, but be called from the

host - this is the access point to start mul?-threaded codes running on the GPU
lDevice cant execute code on the host!

lCUDA imposes some restric?ons, such as device code is C-only (host code can be


C++), device code cant be called recursively
Writing some code (2) - launching a
_ _global_ _ function
All calls to a global function must specify how
many threaded copies are to launch and in
what configuration.
CUDA syntax: <<< >>>
threads are grouped into thread blocks then

into a grid of blocks


This defines a memory heirarchy (important for

performance)
The thread/block/grid model
Contd..
Writing some code: _ _ global _ _
lInside the <<< >>>, need at least two arguments (can be two more, that
have default values)
lCall looks eg. like my func<<<bg, tb>>>(arg1, arg2)

lbg species the dimensions of the block grid and tb species the

dimensions of each thread block


lbg and tb are both of type dim3 (a new datatype dened by CUDA; three

unsigned ints where any unspecied component defaults to 1).


ldim3 has struct-like access - members are x, y and z

lCUDA provides constructor: dim3 mygrid(2,2); sets mygrid.x=2,

mygrid.y=2 and mygrid.z=1


l1d syntax allowed: myfunc<<<5, 6>>>() makes 5 blocks (in linear array)

with 6 threads each and runs myfunc on them all.


CUDA API
CUDA Run?me (Host and Device)
Device Memory Handling (cudaMalloc,...)
Buil?n Math Func?ons (sin, sqrt, mod, ...)
Atomic opera?ons (for concurrency)
Datatypes (2D textures, dim2, dim3, ...)

11/24/15 CS/IS F301 First Semester 2015-16 153


Built-in variables on the GPU
lFor code running on the GPU ( device and global ),
some variables are predened, which allow
threads to be located inside their blocks and grids
ldim3 gridDim Dimensions of the grid.

luint3 blockIdx loca?on of this block in the grid.

ldim3 blockDim Dimensions of the blocks

luint3 threadIdx loca?on of this thread in the

block.
Variables are stored
lFor code running on the GPU ( device and global ), the memory
used to hold a variable can be specied.
ldevice : the variable resides in the GPUs global memory and is

dened while the code runs.


lconstant : the variable resides in the constant memory space of the

GPU and is dened while the code runs.


lshared : the variable resides in the shared memory of the thread

block and has the same lifespan as the block. block.


Compiling a CUDA Program

14/11/15 CS/IS F301 First Semester 2015-16 156


Matrix Transpose: First idea
Each thread block transposes
an equalsized block of matrix
M
Assume M is square (n x n)
What is a good blocksize?
CUDA places limita?ons on
number of threads per block
512 threads per block is the
maximum allowed by CUDA

14/11/15 CS/IS F301 First Semester 2015-16 157


Matrix Transpose: First idea
Each thread block
transposes an equalsized
block of matrix M
Assume M is square (n x n)
What is a good blocksize?
CUDA places limita?ons on
number of threads per block
512 threads per block is the
maximum allowed by CUDA

11/24/15 CS/IS F301 First Semester 2015-16 158


Matrix Transpose: First idea
#include <stdio.h> for (int i = 0; i < HEIGHT * WIDTH; i+
#include <stdlib.h> +)
__global__ { M[i] = i; }
void transpose(float* in, float* out, float* Md = NULL;
uint width) { cudaMalloc((void**)&Md, SIZE);
uint tx = blockIdx.x * blockDim.x + cudaMemcpy(Md,M, SIZE,
threadIdx.x; cudaMemcpyHostToDevice);
uint ty = blockIdx.y * blockDim.y + float* Bd = NULL;
threadIdx.y; cudaMalloc((void**)&Bd, SIZE);
out[tx * width + ty] = in[ty * width + transpose<<<gDim, bDim>>>(Md, Bd,
tx]; WIDTH);
} cudaMemcpy(M,Bd, SIZE,
int main(int args, char** vargs) { cudaMemcpyDeviceToHost);
const int HEIGHT = 1024; return 0;
const int WIDTH = 1024; }
const int SIZE = WIDTH * HEIGHT *
sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT /
bDim.y);
float* M = (float*)malloc(SIZE);

11/24/15 CS/IS F301 First Semester 2015-16 159


Matrix Transpose: First idea
#include <stdio.h>
#include <stdlib.h>
__global__
void transpose(float* in, float* out, uint width) {
uint tx = blockIdx.x * blockDim.x + threadIdx.x;
uint ty = blockIdx.y * blockDim.y + threadIdx.y;
out[tx * width + ty] = in[ty * width + tx];
}
int main(int args, char** vargs) {
const int HEIGHT = 1024;
const int WIDTH = 1024;
const int SIZE = WIDTH * HEIGHT * sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
float* M = (float*)malloc(SIZE);
for (int i = 0; i < HEIGHT * WIDTH; i++)
{ M[i] = i; }
float* Md = NULL;
cudaMalloc((void**)&Md, SIZE);
cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
float* Bd = NULL;
cudaMalloc((void**)&Bd, SIZE);
transpose<<<gDim, bDim>>>(Md, Bd, WIDTH);
cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
return 0;
}
14/11/15 CS/IS F301 First Semester 2015-16 160

Das könnte Ihnen auch gefallen