Универзитет у Београду, Електротехнички Факултет - Информације о Предмету

PARALLEL
PROCESSING :
FUNDAMENTALS
Khushdeep Singh
Department of Computer Science and Engineering
IIT Kanpur
TUTOR : Prof. Dr. U. Rude, Florian Schornbaum
Overview
1. What is Parallel Processing ?
2. Why use Parallel Processing ?
Flynns Classical Taxonomy
Parallel Computer Memory Architectures
1. Shared Memory
2. Distributed Memory
3. Hybrid Distributed-Shared Memory
Parallel Programming Models
Designing Parallel Programs
Amdahls Law
Embarrassingly parallel
Summary
Parallel Processing : Fundamentals
OUTLINE
What is Parallel Processing?
The problem is broken into

discrete parts that can be
solved concurrently
Instructions from each part
execute simultaneously on
different CPUs
Simultaneous use of
multiple resources to solve a
computational problem :
Why use Parallel Processing?

Save time
Solve larger problems:
Use of non-local resources:

Using compute resources on a wide area network, or even the
Internet when local compute resources are scarce
E.g. : SETI@home : over 1.3 million users, 3.2 million computers
in nearly every country in the world.
Many problems are so large and/or complex that it is impractical

or impossible to solve them on a single computer

Transmission speeds : limits on how fast data can move through
hardware
Limits to miniaturization
Heating issues : Power Consumption proportional to frequency
Economic limitations : it is increasingly expensive to make a
single processor faster
Current computer architectures are increasingly relying upon
hardware level parallelism to improve performance:
Multiple execution units
Pipelined instructions
Multi-core
Limits to serial computing:
Parallelism and Moore's

law:
Moore's law :
performance of chips
effectively doubles
every 2 years due to
the addition of more
transistors to a circuit
board
Parallel computation
necessary to take full
advantage of the gains
allowed by Moore's
law

A serial (non-parallel) computer
Single Instruction: Only one instruction stream is being acted on by
the CPU during any one clock cycle
Single Data: Only one data stream is being used as input during any
one clock cycle
Single Instruction, Multiple Data (SIMD):

Single Instruction: All processing units execute the same instruction
at any given clock cycle
Multiple Data: Each processing unit can operate on a different data
element
Best suited for problems characterized by a high degree of regularity,
such as image processing.
E.g. : GPU
Classification of Parallel Computers :

Flynn's Classical Taxonomy
Single Instruction, Single Data (SISD):

Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
Single Data: A single data stream is fed into multiple processing
units.
Few actual examples.
Multiple Instruction, Multiple Data (MIMD):

Multiple Instruction: Every processor may be executing a different
instruction stream
Multiple Data: Every processor may be working with a different data
stream
E.g. : networked parallel computer clusters and "grids", multiprocessor SMP computers, multi-core PCs.
Multiple Instruction, Single Data (MISD):
Shared Memory :
Ability for all processors to access all memory as global
address space
Changes in a memory location effected by one processor are
visible to all other processors
Shared memory machines can be divided into two main
classes based upon memory access times:
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA)
Parallel Architectures
Uniform Memory Access (UMA) :

Commonly represented by Symmetric Multiprocessor (SMP)
machines
Identical processors
Equal access times to memory
10
Non-Uniform Memory Access (NUMA)

Made by physically linking two or more SMPs
One SMP can directly access memory of another
Not all processors have equal access time to all memories
Memory access across link is slower
11
Distributed Memory :
Processors have their own local memory
Change in a processors local memory have no effect on the
memory of other processors
Needs message passing
Explicit programming required
12
Shared vs Distributed memory :
Distributed Memory
Advantages
Disadvantages
Advantages
Disadvantages
Data sharing
between tasks is
fast
Expense with
increase in no. of
processors
Memory is
scalable with no.
of processors
Explicit
programming
required
User-friendly
programming
perspective to
memory
Programmer
responsible for
synchronization
No overhead
in cache
coherency
Message passing
involves overhead
Lack of scalability
Cost effectiveness
due to
Networking
Shared Memory
13
Shared memory component : a cache coherent SMP machine
Distributed memory component : networking of multiple SMP
machines
Hybrid Distributed-Shared Memory
14
Tasks share a common address space

Mechanisms such as locks / semaphores used for synchronization
Advantage : programming development simplified
Threads can be used :
Each thread has local data, but also, shares the entire resources
of main program
Threads communicate with each other through global memory
An abstraction above hardware and memory architectures

Models NOT specific to a particular type of memory
architecture
Shared Memory Model:
15
Implementation of shared memory model :

OpenMP :
Directive based
Master thread forks a specified number of slave threads and
task is divided among them
After execution of parallel task, threads join back
16
OpenMP : Core elements
17

int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id) {
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
OpenMP : Example Program
18
Message Passing Model :

Tasks use their own local memory
Tasks exchange data by sending and receiving messages
User explicitly distributes data
19

Message Passing Interface (MPI) :
PORTABILITY : Architecture and hardware independent code
Provides well-defined and safe data transfer
Support heterogeneous environment (e.g. clusters)
Most MPI implementations consist of a specific set of routines
(i.e., an API) directly callable from C, C++, Fortran
Implementation of message passing model :
20
Message Passing Interface (MPI) : Concepts

Communicator and Rank : connect groups of processes in the
MPI session
Point-to-point basics : communication between two specific
processes. E.g. MPI_send, MPI_recieve calls
Collective basics : communication among all processes in a
process group E.g. MPI_Bcast, MPI_Reduce calls
Derived data types :
specify the type of data which is sent between processors
predefined MPI data types such as MPI_INT, MPI_CHAR,
MPI_DOUBLE
21
Message Passing Interface (MPI) : Example Program

#define BUFSIZE 128
#define TAG 0
int main (int argc, char *argv[]) {
char idstr[32];
char buff[BUFSIZE];
int numprocs;
int myid;
int i;
MPI_Status stat;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank (MPI_COMM_WORLD,&myid);
If(myid == 0) {
for(i=1 ; i<numprocs ; i++) {
sprintf(buff, "Hello %d! ", i);
MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
}
22
Message Passing Interface (MPI) : Example Program

for(i=1 ; i<numprocs ; i++) {
MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
printf("%d: %s\n", myid, buff);
}
}
else {
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
sprintf(idstr, "Processor %d ", myid);
strncat(buff, idstr, BUFSIZE-1);
strncat(buff, "reporting for duty\n", BUFSIZE-1);
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
23
Automatic Parallelization : done by a parallelizing compiler or preprocessor. Two different ways:

Fully Automatic :
compiler analyzes the source code and identifies opportunities for
parallelism
Programmer Directed :
using "compiler directives" or flags, the programmer explicitly tells
the compiler how to parallelize the code
E.g. : OpenMP
Automatic and Manual Parallelization :

Manual Parallelization : time consuming, complex and errorprone
24
Domain decomposition : the data associated with a problem is

decomposed
Partitioning :
Breaking the problem into discrete "chunks" of work that can
be distributed to multiple tasks
Two basic ways to partition :
25

Functional decomposition : the focus is on the computation that
is to be performed rather than on the data manipulated by the
computation
Partitioning :
Two basic ways to partition :
26
Static load balancing : assigning a fixed amount of work to each

processing site a priori
Dynamic Load Balancing : Two types :
Task-oriented : when one processing site finishes its task, it is
assigned another task
Data-oriented : when a processing site finishes its task before other
sites, the site with the most work gives the idle site some of its data
to process
Load Balancing :
Practice of distributing work among tasks so that all tasks are
kept busy all of the time
Two types :
27
amounts of computation between

communication events
Facilitates load balancing
High communication overhead
Coarse-grain Parallelism : significant work
done between communications

Most efficient granularity depends on the
algorithm and the hardware environment
used
Granularity :
Qualitative measure of the ratio of
computation to communication
Fine-grain Parallelism : relatively small
28
Expected speedup of
parallelized
implementations of
an algorithm relative
to the serial
algorithm.
Eq. :
Speedup =
1
1 +/
P : Portion that can be

made parallel
N : No. of processors
Amdahls Law
29
Embarrassingly parallel problem : little or no effort is required

to separate the problem into a number of parallel tasks
No dependency (or communication) between the parallel
tasks
Examples :
Distributed relational database queries using distributed set
processing
Rendering of computer graphics
Event simulation and reconstruction in particle physics
Brute-force searches in cryptography
Ensemble calculations of numerical weather prediction
Tree growth step of the random forest machine learning
technique
Embarrassingly parallel
30
Applications of parallel
processing
31
Parallel Processing : Simultaneous use of multiple resources

to solve a computational problem
Need for parallel processing : Limits to serial computing and
Moores Law
Flynns Classical Taxonomy : SISD, SIMD, MIMD, MISD
Parallel architectures : Shared memory, distributed memory
and hybrid
Parallel programing models : OpenMP, MPI
Designing parallel programs : Automatic parallelization,
partitioning, load balancing and granularity
Embarrassingly parallel problems : very easy to solve by
parallel processing
Summary
32
References
Introduction to Parallel Computing :
https://computing.llnl.gov/tutorials/parallel_comp/#Hybrid
Reinhold Bader (LRZ), Georg Hager (RRZE), Heinz Bast (Intel)
Elementary Parallel Programming With Examples :

Reinhold Bader (LRZ), Georg Hager (RRZE)
Programming Shared Memory Systems with OpenMP :

Reinhold Bader (LRZ) , Georg Hager (RRZE)
THANK YOU !
http://en.wikipedia.org
Introduction to Scientific High Performance Computing :
33

Универзитет у Београду, Електротехнички Факултет - Информације о Предмету

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Универзитет у Београду, Електротехнички Факултет - Информације о Предмету

Hochgeladen von

Copyright:

Verfügbare Formate

PARALLEL

Parallel Processing : Fundamentals

What is Parallel Processing?

The problem is broken into

Parallel Processing : Fundamentals

Why use Parallel Processing?

Use of non-local resources:

Parallel Processing : Fundamentals

Many problems are so large and/or complex that it is impractical

Why use Parallel Processing?

Parallel Processing : Fundamentals

Limits to serial computing:

Parallelism and Moore's

Parallel Processing : Fundamentals

Why use Parallel Processing?

Flynns Classical Taxonomy

Single Instruction, Multiple Data (SIMD):

Parallel Processing : Fundamentals

Classification of Parallel Computers :

Flynns Classical Taxonomy

Multiple Instruction, Multiple Data (MIMD):

Parallel Processing : Fundamentals

Multiple Instruction, Single Data (MISD):

Parallel Processing : Fundamentals

Uniform Memory Access (UMA) :

Parallel Processing : Fundamentals

Non-Uniform Memory Access (NUMA)

Parallel Processing : Fundamentals

Parallel Processing : Fundamentals

Parallel Processing : Fundamentals

Parallel Processing : Fundamentals

Hybrid Distributed-Shared Memory

Parallel Programming Models

Tasks share a common address space

Parallel Processing : Fundamentals

An abstraction above hardware and memory architectures

Implementation of shared memory model :

Parallel Processing : Fundamentals

Parallel Programming Models

Parallel Programming Models

Parallel Processing : Fundamentals

OpenMP : Core elements

Parallel Programming Models

Parallel Processing : Fundamentals

OpenMP : Example Program

Message Passing Model :

Parallel Processing : Fundamentals

Parallel Programming Models

Parallel Programming Models

Parallel Processing : Fundamentals

Implementation of message passing model :

Message Passing Interface (MPI) : Concepts

Parallel Processing : Fundamentals

Parallel Programming Models

Message Passing Interface (MPI) : Example Program

Parallel Processing : Fundamentals

Parallel Programming Models

Message Passing Interface (MPI) : Example Program

Parallel Processing : Fundamentals

Parallel Programming Models

Designing Parallel Programs

Automatic Parallelization : done by a parallelizing compiler or preprocessor. Two different ways:

Parallel Processing : Fundamentals

Automatic and Manual Parallelization :