POPL - Concurrent Programming PDF

CONCURRENT
PROGRAMMING
11/24/15 CS/IS F301 First Semester 2015-16 125

Background and Mo?va?on
A PROCESS or THREAD is a poten?ally-ac?ve
execu?on context
Classic von Neumann (stored program) model
of compu?ng has single thread of control
Parallel programs have more than one
A process can be thought of as an abstrac?on
of a physical processor, but here, only one
processor will run mul?ple threads

Background.
Processes/Threads can come from
mul?ple CPUs
kernel-level mul?plexing of single physical machine
language or library level mul?plexing of kernel-
level abstrac?on
They can run
in true parallel
unpredictably interleaved
run-un?l-block
Most work focuses on the rst two cases,
which are equally dicult to deal with
Background
Two main classes of programming nota?on
synchronized access to shared memory
message passing between processes that don't share memory
Both approaches can be implemented on hardware designed
for the other, though shared memory on message-passing
hardware tends to be slow
Principle dierence is that the message passing type requires
ac?ve par?cipa?on of 2 processors at either end - one to send
and one to receive - while on a mul?processor, reading or
wri?ng only needs one processor to control

Race Condi?on
A race condi?on occurs when ac?ons in two processes are not synchronized
and program behavior depends on the order in which the ac?ons happen
Race condi?ons are not all bad; some?mes any of the possible program
outcomes are ok (e.g. workers taking things o a task queue)
Race condi?ons (which we want to avoid):
Suppose processors A and B share memory, and both try to increment variable
X at more or less the same ?me
Very few processors support arithme?c opera?ons on memory, so each
processor executes
LOAD X
INC
STORE X
Suppose X is ini?alized to 0. If both processors execute these instruc?ons
simultaneously, what are the possible outcomes?
could go up by one or by two

Synchroniza?on
SYNCHRONIZATION is the act of ensuring that events in dierent
processes happen in a desired order
Synchroniza?on can be used to eliminate race condi?ons
In our example we need to synchronize the increment opera?ons
to enforce MUTUAL EXCLUSION on access to X
Most synchroniza?on can be regarded as either:
Mutual exclusion (making sure that only one process is execu?ng a CRITICAL
SECTION [touching a variable, for example] at a ?me),
or as
CONDITION SYNCHRONIZATION, which means making sure that a given
process does not proceed un?l some condi?on holds (e.g. that a variable
contains a given value)
We do not in general want to over-synchronize
That eliminates parallelism, which we generally want to encourage for
11/24/15 performance CS/IS F301 First Semester 2015-16 130
Basic Op?ons
SCHEDULERS give us the ability to "put a thread/process to
sleep" and run something else on its process/processor
Six principle op?ons in most systems:
Co-begin
Parallel loops OpenCL, CUDA, C#
Launch-at-elabora?on - Ada
Fork (with op?onal join)- C, C++, Java
makes the crea?on of threads an explicit executable task.
Join is the reverse opera?on, which sllows a thread to wait for the
comple?on of a previously forked thread
Implicit receipt
Early reply

Implementa?on of Thread
Threads must be implemented on top of the OS processes.
Could put each thread on a separate OS process - but this is
expensive.
Processes are implemented in the kernel, and need procedure calls.
These are general purpose, which means extra features we dont need
but must pay for anyway.
At the other end, could put all threads onto one process.
Means there is no real parallel programming!
If current thread blocks, then none of the other threads can run if the
process is suspended by the OS.
Generally, some in-between
Need to implement threads running on top of processes

Implemen?ng Synchroniza?on
Barriers are provided to make sure every thread completes a
par?cular sec?on of code before proceeding.
Implemented as a number, ini?alized to the number of threads,
and a boolean, done = FALSE.
When a thread nishes, it subtracts 1 from the number and
waits un?l done is set to true.
The last thread that sets the counter to 0 will set done equal to
false, freeing all the threads to con?nue.
Note that the counter means that we need O(n) ?me for n
processors to synchronize and con?nue, which is too long in
some machines.
Best known is O(log n), although some specially designed hardware can
get this closer to O(1) in prac?ce

Implemen?ng Synchroniza?on
Monitors have the highest-level seman?cs, but a few
s?cky seman?c problem - they are also widely used
Synchroniza?on in Java is sort of a hybrid of monitors
and CCRs (Java 3 will have true monitors.)
A monitor is a shared object with opera?ons, internal state,
and a number of condi?on queues. Only one opera?on of a
given monitor may be ac?ve at a given point in ?me
A process that calls a busy monitor is delayed un?l the
monitor is free
On behalf of its calling process, any opera?on may suspend itself by
wai?ng on a condi?on
An opera?on may also signal a condi?on, in which case one of the
wai?ng processes is resumed, usually the one that waited rst
Condi?onal Cri?cal Regions
In Java, every object accessible to more than 1 thread has an
implicit mutual exclusion lock built in.
synchronized (my_shared_object) {
\\ critical section of code
}
A thread can suspend or voluntarily release itself using wait. It can even
wait un?l some condi?on is met:
while (!condition) {
wait();
}
C# has a lock statement that is essen?ally the same

CUDA

C Extension
Consists of:
New Syntax and Buil?n Variables
Restric?ons to ANSI C
API/Libraries

C Extension: Buil?n Variables
New Syntax:
<<< ... >>>
__host__, __global__, __device__
__constant__, __shared__, __device
__syncthreads()

C Extension: Buil?n Variables
Buil?n Variables:
dim3 gridDim;
Dimensions of the grid in blocks (gridDim.z unused)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block

C Extension: Restric?ons
New Restric?ons:
No recursion in device code
No func?on pointers in device code

Compiling a CUDA Program

Matrix Transpose

What is GPU?
Processor dedicated to rapid rendering of
polygons texturing, shading
They have lots of compute cores, but a simpler
architecture of a standard CPU
The shader pipeline can be used to do
oa?ng point calcula?ons
cheap scien?c/technical compu?ng
What is GPU?
CS/IS F301 First Semester 2015-16

What is CUDA?
Compute Unied Device Architecture
Extension to C programming language
Adds library func?ons to access to GPU
Adds direc?ves to translate C into instruc?ons
that run on the host CPU or the GPU when
needed
Allows easy mul?-threading - parallel execu?on
on all thread processors on the GPU
CUDA works on modern nVidia cards (Quadro,
GeForce, Tesla)
nVidias compiler - nvcc
lCUDA code must be compiled using nvcc
lnvcc generates both instruc?ons for host and GPU (PTX instruc?on
set), as well as instruc?ons to send data back and forwards

between them
lStandard CUDA install; /usr/local/cuda/bin/nvcc
lShell execu?ng compiled code needs dynamic linker path LD
LIBRARY PATH environment variable set to include/usr/local/cuda/

lib
Simple overview
lGPU cant directly access main memory
lCPU cant directly access GPU memory
lNeed to explicitly copy data
lNo prin!
Writing some code (1) - specifying
where code runs
lCUDA provides func?on type qualiers (that are not in C/C++) to enable
programmer to dene where a func?on should run
l_ _host_ _ : species the code should run on the host CPU (redundant on its own
- it is the default)
l_ _device_ _ : species the code should run on the GPU, and the func?on can
only be called by code running on the GPU

l_ _global_ _ : species the code should run on the GPU, but be called from the
host - this is the access point to start mul?-threaded codes running on the GPU
lDevice cant execute code on the host!
lCUDA imposes some restric?ons, such as device code is C-only (host code can be

C++), device code cant be called recursively
Writing some code (2) - launching a
_ _global_ _ function
All calls to a global function must specify how
many threaded copies are to launch and in
what configuration.
CUDA syntax: <<< >>>
threads are grouped into thread blocks then
into a grid of blocks

This defines a memory heirarchy (important for
performance)
The thread/block/grid model
Contd..
Writing some code: _ _ global _ _
lInside the <<< >>>, need at least two arguments (can be two more, that
have default values)
lCall looks eg. like my func<<<bg, tb>>>(arg1, arg2)
lbg species the dimensions of the block grid and tb species the
dimensions of each thread block

lbg and tb are both of type dim3 (a new datatype dened by CUDA; three
unsigned ints where any unspecied component defaults to 1).

ldim3 has struct-like access - members are x, y and z
lCUDA provides constructor: dim3 mygrid(2,2); sets mygrid.x=2,
mygrid.y=2 and mygrid.z=1

l1d syntax allowed: myfunc<<<5, 6>>>() makes 5 blocks (in linear array)
with 6 threads each and runs myfunc on them all.

CUDA API
CUDA Run?me (Host and Device)
Device Memory Handling (cudaMalloc,...)
Buil?n Math Func?ons (sin, sqrt, mod, ...)
Atomic opera?ons (for concurrency)
Datatypes (2D textures, dim2, dim3, ...)

Built-in variables on the GPU
lFor code running on the GPU ( device and global ),
some variables are predened, which allow
threads to be located inside their blocks and grids
ldim3 gridDim Dimensions of the grid.
luint3 blockIdx loca?on of this block in the grid.
ldim3 blockDim Dimensions of the blocks
luint3 threadIdx loca?on of this thread in the
block.
Variables are stored
lFor code running on the GPU ( device and global ), the memory
used to hold a variable can be specied.
ldevice : the variable resides in the GPUs global memory and is
dened while the code runs.

lconstant : the variable resides in the constant memory space of the
GPU and is dened while the code runs.

lshared : the variable resides in the shared memory of the thread
block and has the same lifespan as the block. block.

Compiling a CUDA Program

Matrix Transpose: First idea
Each thread block transposes
an equalsized block of matrix
M
Assume M is square (n x n)
What is a good blocksize?
CUDA places limita?ons on
number of threads per block
512 threads per block is the
maximum allowed by CUDA

Each thread block
transposes an equalsized
block of matrix M
Assume M is square (n x n)
What is a good blocksize?
CUDA places limita?ons on
number of threads per block
512 threads per block is the
maximum allowed by CUDA

#include <stdio.h> for (int i = 0; i < HEIGHT * WIDTH; i+
#include <stdlib.h> +)
__global__ { M[i] = i; }
void transpose(float* in, float* out, float* Md = NULL;
uint width) { cudaMalloc((void**)&Md, SIZE);
uint tx = blockIdx.x * blockDim.x + cudaMemcpy(Md,M, SIZE,
threadIdx.x; cudaMemcpyHostToDevice);
uint ty = blockIdx.y * blockDim.y + float* Bd = NULL;
threadIdx.y; cudaMalloc((void**)&Bd, SIZE);
out[tx * width + ty] = in[ty * width + transpose<<<gDim, bDim>>>(Md, Bd,
tx]; WIDTH);
} cudaMemcpy(M,Bd, SIZE,
int main(int args, char** vargs) { cudaMemcpyDeviceToHost);
const int HEIGHT = 1024; return 0;
const int WIDTH = 1024; }
const int SIZE = WIDTH * HEIGHT *
sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT /
bDim.y);
float* M = (float*)malloc(SIZE);

#include <stdio.h>
#include <stdlib.h>
__global__
void transpose(float* in, float* out, uint width) {
uint tx = blockIdx.x * blockDim.x + threadIdx.x;
uint ty = blockIdx.y * blockDim.y + threadIdx.y;
out[tx * width + ty] = in[ty * width + tx];
}
int main(int args, char** vargs) {
const int HEIGHT = 1024;
const int WIDTH = 1024;
const int SIZE = WIDTH * HEIGHT * sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
float* M = (float*)malloc(SIZE);
for (int i = 0; i < HEIGHT * WIDTH; i++)
{ M[i] = i; }
float* Md = NULL;
cudaMalloc((void**)&Md, SIZE);
cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
float* Bd = NULL;
cudaMalloc((void**)&Bd, SIZE);
transpose<<<gDim, bDim>>>(Md, Bd, WIDTH);
cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
return 0;
}

POPL - Concurrent Programming PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

POPL - Concurrent Programming PDF

Hochgeladen von

Copyright:

Verfügbare Formate

CONCURRENT

11/24/15 CS/IS F301 First Semester 2015-16 125

11/24/15 CS/IS F301 First Semester 2015-16 126

11/24/15 CS/IS F301 First Semester 2015-16 128

11/24/15 CS/IS F301 First Semester 2015-16 129

11/24/15 CS/IS F301 First Semester 2015-16 131

11/24/15 CS/IS F301 First Semester 2015-16 132

11/24/15 CS/IS F301 First Semester 2015-16 133

11/24/15 CS/IS F301 First Semester 2015-16 135

11/24/15 CS/IS F301 First Semester 2015-16 136

11/24/15 CS/IS F301 First Semester 2015-16 137

11/24/15 CS/IS F301 First Semester 2015-16 138

11/24/15 CS/IS F301 First Semester 2015-16 139

11/24/15 CS/IS F301 First Semester 2015-16 140

11/24/15 CS/IS F301 First Semester 2015-16 141

11/24/15 CS/IS F301 First Semester 2015-16 142

CS/IS F301 First Semester 2015-16

set), as well as instruc?ons to send data back and forwards

lShell execu?ng compiled code needs dynamic linker path LD

LIBRARY PATH environment variable set to include/usr/local/cuda/

lNeed to explicitly copy data

only be called by code running on the GPU

into a grid of blocks

dimensions of each thread block

unsigned ints where any unspecied component defaults to 1).

lCUDA provides constructor: dim3 mygrid(2,2); sets mygrid.x=2,

mygrid.y=2 and mygrid.z=1

with 6 threads each and runs myfunc on them all.

11/24/15 CS/IS F301 First Semester 2015-16 153

luint3 blockIdx loca?on of this block in the grid.

ldim3 blockDim Dimensions of the blocks

luint3 threadIdx loca?on of this thread in the

dened while the code runs.

GPU and is dened while the code runs.

block and has the same lifespan as the block. block.

14/11/15 CS/IS F301 First Semester 2015-16 156

14/11/15 CS/IS F301 First Semester 2015-16 157

11/24/15 CS/IS F301 First Semester 2015-16 158

11/24/15 CS/IS F301 First Semester 2015-16 159

Das könnte Ihnen auch gefallen