Sie sind auf Seite 1von 33

PARALLEL

PROCESSING :
FUNDAMENTALS
Khushdeep Singh
Department of Computer Science and Engineering
IIT Kanpur
TUTOR : Prof. Dr. U. Rude, Florian Schornbaum

Overview
1. What is Parallel Processing ?
2. Why use Parallel Processing ?
Flynns Classical Taxonomy
Parallel Computer Memory Architectures
1. Shared Memory
2. Distributed Memory
3. Hybrid Distributed-Shared Memory
Parallel Programming Models
Designing Parallel Programs
Amdahls Law
Embarrassingly parallel
Summary

Parallel Processing : Fundamentals

OUTLINE

What is Parallel Processing?

The problem is broken into


discrete parts that can be
solved concurrently
Instructions from each part
execute simultaneously on
different CPUs

Parallel Processing : Fundamentals

Simultaneous use of
multiple resources to solve a
computational problem :

Why use Parallel Processing?


Save time
Solve larger problems:

Use of non-local resources:


Using compute resources on a wide area network, or even the
Internet when local compute resources are scarce
E.g. : SETI@home : over 1.3 million users, 3.2 million computers
in nearly every country in the world.

Parallel Processing : Fundamentals

Many problems are so large and/or complex that it is impractical


or impossible to solve them on a single computer

Why use Parallel Processing?


Transmission speeds : limits on how fast data can move through
hardware
Limits to miniaturization
Heating issues : Power Consumption proportional to frequency
Economic limitations : it is increasingly expensive to make a
single processor faster
Current computer architectures are increasingly relying upon
hardware level parallelism to improve performance:
Multiple execution units
Pipelined instructions
Multi-core

Parallel Processing : Fundamentals

Limits to serial computing:

Parallelism and Moore's


law:
Moore's law :
performance of chips
effectively doubles
every 2 years due to
the addition of more
transistors to a circuit
board
Parallel computation
necessary to take full
advantage of the gains
allowed by Moore's
law

Parallel Processing : Fundamentals

Why use Parallel Processing?

Flynns Classical Taxonomy


A serial (non-parallel) computer
Single Instruction: Only one instruction stream is being acted on by
the CPU during any one clock cycle
Single Data: Only one data stream is being used as input during any
one clock cycle

Single Instruction, Multiple Data (SIMD):


Single Instruction: All processing units execute the same instruction
at any given clock cycle
Multiple Data: Each processing unit can operate on a different data
element
Best suited for problems characterized by a high degree of regularity,
such as image processing.
E.g. : GPU

Parallel Processing : Fundamentals

Classification of Parallel Computers :


Flynn's Classical Taxonomy
Single Instruction, Single Data (SISD):

Flynns Classical Taxonomy


Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
Single Data: A single data stream is fed into multiple processing
units.
Few actual examples.

Multiple Instruction, Multiple Data (MIMD):


Multiple Instruction: Every processor may be executing a different
instruction stream
Multiple Data: Every processor may be working with a different data
stream
E.g. : networked parallel computer clusters and "grids", multiprocessor SMP computers, multi-core PCs.

Parallel Processing : Fundamentals

Multiple Instruction, Single Data (MISD):

Shared Memory :
Ability for all processors to access all memory as global
address space
Changes in a memory location effected by one processor are
visible to all other processors
Shared memory machines can be divided into two main
classes based upon memory access times:
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA)

Parallel Processing : Fundamentals

Parallel Architectures

Uniform Memory Access (UMA) :


Commonly represented by Symmetric Multiprocessor (SMP)
machines
Identical processors
Equal access times to memory

Parallel Processing : Fundamentals

Parallel Architectures

10

Non-Uniform Memory Access (NUMA)


Made by physically linking two or more SMPs
One SMP can directly access memory of another
Not all processors have equal access time to all memories
Memory access across link is slower

Parallel Processing : Fundamentals

Parallel Architectures

11

Distributed Memory :
Processors have their own local memory
Change in a processors local memory have no effect on the
memory of other processors
Needs message passing
Explicit programming required

Parallel Processing : Fundamentals

Parallel Architectures

12

Parallel Architectures
Shared vs Distributed memory :
Distributed Memory

Advantages

Disadvantages

Advantages

Disadvantages

Data sharing
between tasks is
fast

Expense with
increase in no. of
processors

Memory is
scalable with no.
of processors

Explicit
programming
required

User-friendly
programming
perspective to
memory

Programmer
responsible for
synchronization

No overhead
in cache
coherency

Message passing
involves overhead

Lack of scalability

Cost effectiveness
due to
Networking

Parallel Processing : Fundamentals

Shared Memory

13

Parallel Architectures
Shared memory component : a cache coherent SMP machine
Distributed memory component : networking of multiple SMP
machines

Parallel Processing : Fundamentals

Hybrid Distributed-Shared Memory

14

Parallel Programming Models

Tasks share a common address space


Mechanisms such as locks / semaphores used for synchronization
Advantage : programming development simplified
Threads can be used :
Each thread has local data, but also, shares the entire resources
of main program
Threads communicate with each other through global memory

Parallel Processing : Fundamentals

An abstraction above hardware and memory architectures


Models NOT specific to a particular type of memory
architecture
Shared Memory Model:

15

Implementation of shared memory model :


OpenMP :
Directive based
Master thread forks a specified number of slave threads and
task is divided among them
After execution of parallel task, threads join back

Parallel Processing : Fundamentals

Parallel Programming Models

16

Parallel Programming Models

Parallel Processing : Fundamentals

OpenMP : Core elements

17

Parallel Programming Models


int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id) {
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}

Parallel Processing : Fundamentals

OpenMP : Example Program

18

Message Passing Model :


Tasks use their own local memory
Tasks exchange data by sending and receiving messages
User explicitly distributes data

Parallel Processing : Fundamentals

Parallel Programming Models

19

Parallel Programming Models


Message Passing Interface (MPI) :
PORTABILITY : Architecture and hardware independent code
Provides well-defined and safe data transfer
Support heterogeneous environment (e.g. clusters)
Most MPI implementations consist of a specific set of routines
(i.e., an API) directly callable from C, C++, Fortran

Parallel Processing : Fundamentals

Implementation of message passing model :

20

Message Passing Interface (MPI) : Concepts


Communicator and Rank : connect groups of processes in the
MPI session
Point-to-point basics : communication between two specific
processes. E.g. MPI_send, MPI_recieve calls
Collective basics : communication among all processes in a
process group E.g. MPI_Bcast, MPI_Reduce calls
Derived data types :
specify the type of data which is sent between processors
predefined MPI data types such as MPI_INT, MPI_CHAR,
MPI_DOUBLE

Parallel Processing : Fundamentals

Parallel Programming Models

21

Message Passing Interface (MPI) : Example Program


#define BUFSIZE 128
#define TAG 0
int main (int argc, char *argv[]) {
char idstr[32];
char buff[BUFSIZE];
int numprocs;
int myid;
int i;
MPI_Status stat;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank (MPI_COMM_WORLD,&myid);
If(myid == 0) {
for(i=1 ; i<numprocs ; i++) {
sprintf(buff, "Hello %d! ", i);
MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
}

Parallel Processing : Fundamentals

Parallel Programming Models

22

Message Passing Interface (MPI) : Example Program


for(i=1 ; i<numprocs ; i++) {
MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
printf("%d: %s\n", myid, buff);
}
}
else {
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
sprintf(idstr, "Processor %d ", myid);
strncat(buff, idstr, BUFSIZE-1);
strncat(buff, "reporting for duty\n", BUFSIZE-1);
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}

Parallel Processing : Fundamentals

Parallel Programming Models

23

Designing Parallel Programs

Automatic Parallelization : done by a parallelizing compiler or preprocessor. Two different ways:


Fully Automatic :
compiler analyzes the source code and identifies opportunities for
parallelism

Programmer Directed :
using "compiler directives" or flags, the programmer explicitly tells
the compiler how to parallelize the code
E.g. : OpenMP

Parallel Processing : Fundamentals

Automatic and Manual Parallelization :


Manual Parallelization : time consuming, complex and errorprone

24

Designing Parallel Programs

Domain decomposition : the data associated with a problem is


decomposed

Parallel Processing : Fundamentals

Partitioning :
Breaking the problem into discrete "chunks" of work that can
be distributed to multiple tasks
Two basic ways to partition :

25

Designing Parallel Programs


Functional decomposition : the focus is on the computation that
is to be performed rather than on the data manipulated by the
computation

Parallel Processing : Fundamentals

Partitioning :
Two basic ways to partition :

26

Designing Parallel Programs

Static load balancing : assigning a fixed amount of work to each


processing site a priori
Dynamic Load Balancing : Two types :
Task-oriented : when one processing site finishes its task, it is
assigned another task
Data-oriented : when a processing site finishes its task before other
sites, the site with the most work gives the idle site some of its data
to process

Parallel Processing : Fundamentals

Load Balancing :
Practice of distributing work among tasks so that all tasks are
kept busy all of the time
Two types :

27

Designing Parallel Programs

amounts of computation between


communication events
Facilitates load balancing
High communication overhead

Coarse-grain Parallelism : significant work

done between communications


Most efficient granularity depends on the
algorithm and the hardware environment
used

Parallel Processing : Fundamentals

Granularity :
Qualitative measure of the ratio of
computation to communication
Fine-grain Parallelism : relatively small

28

Expected speedup of
parallelized
implementations of
an algorithm relative
to the serial
algorithm.
Eq. :
Speedup =

1
1 +/

P : Portion that can be


made parallel
N : No. of processors

Parallel Processing : Fundamentals

Amdahls Law

29

Embarrassingly parallel problem : little or no effort is required


to separate the problem into a number of parallel tasks
No dependency (or communication) between the parallel
tasks
Examples :
Distributed relational database queries using distributed set
processing
Rendering of computer graphics
Event simulation and reconstruction in particle physics
Brute-force searches in cryptography
Ensemble calculations of numerical weather prediction
Tree growth step of the random forest machine learning
technique

Parallel Processing : Fundamentals

Embarrassingly parallel

30

Parallel Processing : Fundamentals

Applications of parallel
processing

31

Parallel Processing : Simultaneous use of multiple resources


to solve a computational problem
Need for parallel processing : Limits to serial computing and
Moores Law
Flynns Classical Taxonomy : SISD, SIMD, MIMD, MISD
Parallel architectures : Shared memory, distributed memory
and hybrid
Parallel programing models : OpenMP, MPI
Designing parallel programs : Automatic parallelization,
partitioning, load balancing and granularity
Embarrassingly parallel problems : very easy to solve by
parallel processing

Parallel Processing : Fundamentals

Summary

32

References
Introduction to Parallel Computing :
https://computing.llnl.gov/tutorials/parallel_comp/#Hybrid

Reinhold Bader (LRZ), Georg Hager (RRZE), Heinz Bast (Intel)

Elementary Parallel Programming With Examples :


Reinhold Bader (LRZ), Georg Hager (RRZE)

Programming Shared Memory Systems with OpenMP :


Reinhold Bader (LRZ) , Georg Hager (RRZE)

THANK YOU !

Parallel Processing : Fundamentals

http://en.wikipedia.org
Introduction to Scientific High Performance Computing :

33

Das könnte Ihnen auch gefallen