Sie sind auf Seite 1von 54

Parallel

Programming
and MPI
A course for IIT-M. September 2008
R Badrinath, STSD Bangalore
(ramamurthy.badrinath@hp.com)

© 2006 Hewlett-Packard Development Company, L.P.


The information contained herein is subject to change without notice
Context and Background
• IIT- Madras has recently added a good deal of compute
power.
• Why –
−Further R&D in sciences, engineering
−Provide computing services to the region
−Create new opportunities in education and skills
−…
• Why this course –
−Update skills to program modern cluster computers
• Length -2 theory and 2 practice sessions, 4 hrs each

September
2 2008 IIT-Madras
Audience Check

3
Contents
1. Instead we
MPI_Init
2. •Understand Issues
MPI_Comm_rank
3. MPI_Comm_size
• Understand Concepts
4. MPI_Send
5. •Learn
MPI_Recv enough to pickup from the manual
6. MPI_Bcast
7.

Go by motivating
MPI_Create_comm examples
8. MPI_Sendrecv
9. MPI_Scatter •Try out some of the examples
10. MPI_Gather

………………

September
4 2008 IIT-Madras
Outline
• Sequential vs Parallel programming
• Shared vs Distributed Memory
• Parallel work breakdown models
• Communication vs Computation
• MPI Examples
• MPI Concepts
• The role of IO

September
5 2008 IIT-Madras
Sequential vs Parallel
• We are used to sequential programming – C, Java, C+
+, etc. E.g., Bubble Sort, Binary Search, Strassen
Multiplication, FFT, BLAST, …
• Main idea – Specify the steps in perfect order
• Reality – We are used to parallelism a lot more than
we think – as a concept; not for programming
• Methodology – Launch a set of tasks; communicate to
make progress. E.g., Sorting 500 answer papers by –
making 5 equal piles, have them sorted by 5 people,
merge them together.

September
6 2008 IIT-Madras
Shared vs Distributed Memory
Programming
• Shared Memory – All tasks access the same
memory, hence the same data. pthreads
• Distributed Memory – All memory is local. Data
sharing is by explicitly transporting data from one
task to another (send-receive pairs in MPI, e.g.)

Program
Memory Communications channel

• HW – Programming model relationship – Tasks vs


CPUs;
• SMPs vs Clusters
September
7 2008 IIT-Madras
Designing Parallel Programs

8
Simple Parallel Program – sorting
numbers in a large array A
• Notionally divide A into 5 pieces
[0..99;100..199;200..299;300..399;400..499
].
• Each part is sorted by an independent
sequential algorithm and left within its
region.

• The resultant parts are merged by simply


reordering among adjacent parts.

September
9 2008 IIT-Madras
What is different – Think about…
• How many people doing the work. (Degree of
Parallelism)
• What is needed to begin the work. (Initialization)
• Who does what. (Work distribution)
• Access to work part. (Data/IO access)
• Whether they need info from each other to finish
their own job. (Communication)
• When are they all done. (Synchronization)
• What needs to be done to collate the result.

September
10 2008 IIT-Madras
Work Break-down
• Parallel algorithm
• Prefer simple intuitive breakdowns
• Usually highly optimized sequential
algorithms are not easily parallelizable
• Breaking work often involves some pre- or
post- processing (much like divide and
conquer)
• Fine vs large grain parallelism and
relationship to communication

September
11 2008 IIT-Madras
Digression – Let’s get a simple MPI Program to
work
#include <mpi.h>
#include <stdio.h>

int main()
{
int total_size, my_rank;

MPI_Init(NULL,NULL);

MPI_Comm_size(MPI_COMM_WORLD, &total_size);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

printf("\n Total number of programs = %d, out of


which rank of this process is %d\n", total_size,
my_rank);
MPI_Finalize();
return 0;
}
September
12 2008 IIT-Madras
Getting it to work
• Compile it:
− mpicc –o simple simple.c # If you want HP-MPI set your path
# /opt/hpmpi/bin
• Run it
− This depends a bit on the system
− mpirun -np2 simple
− qsub –l ncpus=2 –o simple.out /opt/hpmpi/bin/mpirun <your
program location>/simple
− [Fun: qsub –l ncpus=2 –I hostname ]

• Results are in the output file.


• What is mpirun ?
• What does qsub have to do with MPI?... More about qsub in a
separate talk.

September
13 2008 IIT-Madras
What goes on
• Same program is run at the same time on 2
different CPUs
• Each is slightly different in that each returns
different values for some simple calls like
MPI_Comm_rank.
• This gives each instance its identity
• We can make different instances run different
pieces of code based on this identity difference
• Typically it is an SPMD model of computation

September
14 2008 IIT-Madras
Continuing work breakdown…
Simple Example: Find shortest
distances
PROBLEM: 7
Find shortest path 2 2 3
distances
5
1 3
1 2
2

7 0 6 4

Let Nodes be numbered 0,1,…,n-1


0 2 1 .. 6
Let us put all of this in a matrix 7 0 .. .. ..
A[i][j] is the distance from i to j 1 5 0 2 3
.. .. 2 0 2
.. .. .. .. 0

September
15 2008 IIT-Madras
Floyd’s (sequential) algorithm
For (k=0; k<n; k++)
For (i=0; i<n; i++)
for (j=0; j<n; j++)
a[i][j]=min( a[i][j] , a[i,k]+a[k][j] );

Observation:
For a fixed k,
Computing i-th row needs i-th row and k-th row

September
16 2008 IIT-Madras
Parallelizing Floyd
• Actually we just need n2 tasks, with each
task iterating n times (once for each value
of k).
• After each iteration we need to make sure
everyone sees the matrix.
• ‘Ideal’ for shared memory.. Programming
• What if we have less than n2 tasks?... Say
p<n.
• Need to divide the work among the p tasks.
• We can simply divide up the rows.
September
17 2008 IIT-Madras
Dividing the work
• Each task gets [n/p] rows, with the last
possibly getting a little more.
T0

i-th row
q x [ n/p ]

Tq

k-th row Remember


the
observation

September
18 2008 IIT-Madras
/* “id” is TASK NUMBER, each node has only the part of A
that it owns. This is approximate code
Note*/
that each node
for (k=0;k<n;k++) { calls its own matrix by
the same name name a
The MPI current_owner_task = GET_BLOCK_OWNER(k);
[ ][ ] but has only
Model…
if (id == current_owner_task)
[p/n] rows.{
-All nodes run k_here = k - LOW_END_OF_MY_BLOCK(id);
the same
code!! P for(j=0;j<n;j++)
Distributed Memory
replica tasks!! Model
rowk[j]=a[k_here][j];

}
-Some times
/* rowk is broadcast by the owner and received by
they need to
others..
do different
The MPI code will come here later */
things
for(i=0;i<GET_MY_BLOCK_SIZE(id);i++)
for(j=0;j<n;j++)
a[i,j]=min(a[i][j],

September
a[i][k]+rowk[j]);
19 2008 IIT-Madras
The MPI model
• Recall MPI tasks are typically created when the
jobs are launched – not inside the MPI
program (no forking).
−mpirun usually creates the task set
−mpirun –np 2 a.out <args to a.out>
−a.out is run on all nodes and a communication
channel is setup between them
• Functions allow for tasks to find out
−Size of the task group
−Ones own position within the group

September
20 2008 IIT-Madras
MPI Notions [ Taking from the
example ]
• Communicator – A group of tasks in a program
• Rank – Each task’s ID in the group
−MPI_Comm_rank() … /* use this to set “id” */
• Size – Of the group
−MPI_Comm_size() … /* use to set “p” */
• Notion of send/receive/broadcast…
−MPI_Bcast() … /* use to broadcast rowk[] */

• For actual syntax use a good MPI book or manual


• Online resource: http://www-unix.mcs.anl.gov/mpi/www/

September
21 2008 IIT-Madras
MPI Prologue to our Floyd example
int a[MAX][MAX];
int n=20; /* real size of the matrix,
can be read in */
int id,p;

MPI_Init(argc,argv);

MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&p);
.
./* This is where all the real work happens */
.
MPI_Finalize(); /* Epilogue */

September
22 2008 IIT-Madras
This is the time to try out several
simple MPI programs using the
few functions we have seen.
- use mpicc
- use mpirun

23
Visualizing the executionMultiple Tasks/CPUs
maybe on the same
node
Job is Launched
Scheduler ensures 1
Tasks On
task per cpu
CPUs

•MPI_INIT, MPI_Comm_rank, MPI_Comm_size etc…


•Other initializations, like reading in the array
•For initial values of k, task with rank 0 broadcasts row k, others receiv
•For each value of k they do their computation with the correct row
•Loop above for all values of k

•Task 0 receives all blocks of the final array and prints them out
•MPI_Finalize
September
24 2008 IIT-Madras
Communication vs Computation
• Often communication is needed between iterations to
complete the work.
• Often the more the tasks the more the communication
can become.
−In Floyd, bigger “p” indicates that “rowk” will be sent to a
larger number of tasks.
−If each iteration depends on more data, it can get very busy.
• This may mean network contention; i.e., delays.
• Try to count the numbr of “a”s in a string. Time vs p
• This is why for a fixed problem size increasing
number of CPUs does not continually increase
performance
• This needs experimentation – problem specific

September
25 2008 IIT-Madras
Communication primitives
• MPI_Send(sendbuffer, senddatalength,
datatype, destination, tag,
communicator);
• MPI_Send(“Hello”, strlen(“Hello”),
MPI_CHAR, 2 , 100,
MPI_COMM_WORLD);
• MPI_Recv(recvbuffer, revcdatalength,
MPI_CHAR, source, tag,
MPI_COMM_WORLD,
&status);
• Send-Recv happen in pairs.

September
26 2008 IIT-Madras
Collectives
• Broadcast is one-to-all communication
• Both receivers and sender call the same function
• All MUST call it. All end up with SAME result.
• MPI_Bcast (buffer, count, type, root, comm);
• Examples
−MPI_Bcast(&k, 1, MPI_INT, 0, MPI_Comm_World);
−Task 0 sends its integer k and all others receive it.
−MPI_Bcast(rowk,n,MPI_INT,current_owner_task,MPI_COMM_
WORLD);
−Current_owner_task sends rowk to all others.

September
27 2008 IIT-Madras
Try out a simple MPI program with
send-recvs and braodcasts.

Try out Floyd’s algorithm.


What if you have to read a file to
initialize Floyd’s algorithm?

28
A bit more on Broadcast
Ranks: 0 1 2
x : 0 1 2
MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..);

x : 0 0 0

0 0 0

September
29 2008 IIT-Madras
Other useful collectives
• MPI_Reduce(&values,&results,count,type,o
perator, root,comm);
• MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM,
9, MPI_COMM_WORLD);

• Task number 9 gets in the variable res the


sum of whatever was in x in all of the tasks
(including itself).
• Must be called by ALL tasks.

September
30 2008 IIT-Madras
Scattering as opposed to
broadcasting
• MPI_Scatterv(sndbuf, sndcount[], send_disp[],
type, recvbuf, recvcount, recvtype,
root, comm);
• All nodes MUST call

Rank0

Rank0 Rank1 Rank2 Rank3

September
31 2008 IIT-Madras
Common Communication pitfalls!!
• Make sure that
communication primitives are
called by the right number of
tasks.
• Make sure they are called in
the right sequence.
• Make sure that you use the
proper tags.
• If not, you can easily get into
deadlock (“My program seems to
be hung”)
September
32 2008 IIT-Madras
More on work breakdown
• Finding the right work breakdown can be challenging
• Sometime dynamic work breakdown is good
• Master (usually task 0) decides who will do what and
collects the results.
• E.g.,
you have a huge number of 5x5 matrices to multiply
(chained matrix multiplication).
• E.g., Search for a substring in a huge collection of strings.

September
33 2008 IIT-Madras
Master-slave dynamic work
assignment
Master
1
0

2 Slaves

September
34 2008 IIT-Madras
Master slave example – Reverse
strings
Slave(){
do{
MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat);
n=strlen(work);
if(n==0) break; /* detecting the end */

reverse(work);

MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD);
} while (1);

MPI_Finalize();
}

September
35 2008 IIT-Madras
Master slave example – Reverse
strings
Master(){ /* rank 0 task */
initialize_work_tems();
for(i=1;i<np;i++){ /* Initial work distribution */
work=next_work_item();
n = strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,i,0,MPI_COMM_WORLD);
}
unfinished_work=np;
while (unfinished_work!=0) {
MPI_Recv(&res,MAX,MPI_CHAR,MPI_ANY_SOURCE,0,
MPI_COMM_WORLD,&status);
process(res);
work=next_work_item();
if(work==NULL) unfinished_work--;
else {
n=strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,status->MPI_source,
0,MPI_COMM_WORLD);
}
} September
36 2008 IIT-Madras
Master slave example
Main(){
...
MPI_Comm_Rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&np);
if (id ==0 )
Master();
else
Slave();
...
}

September
37 2008 IIT-Madras
Matrix Multiply and Communication
Patterns

38
Block Distribution of Matrices
• Matrix Mutliply: •Each task owns a block –
−Cij = Σ (Aik * Bkj) its own part of A,B and C
• BMR Algorithm:
•The old formula holds for
blocks!
•Example:
C21=A20 * B01
A21 * B11
A22 * B21
A23 * B31

Each isSeptember
a smaller Block – a submatrix
39 2008 IIT-Madras
Block Distribution of Matrices
C21 = A20 * B01
• Matrix Mutliply:
−Cij = Σ (Aik * Bkj) A21 * B11
• BMR Algorithm: A22 * B21
A23 * B31

•A22 is row broadcast


•A22*B21 added into C21
•B_1 is Rolled up one slot
•Out task now has B31
Now repeat the above block except
the item to broadcast is A23
Each isSeptember
a smaller Block – a submatrix
40 2008 IIT-Madras
Attempt doing this with just Send-
Recv and Broadcast

41
Communicators and Topologies
• BMR example shows limitations of
broadcast.. Although there is pattern
• Communicators can be created on
subgroups of processes.
• Communicators can be created that have a
topology
−Will make programming natural
−Might improve performance by matching to
hardware

September
42 2008 IIT-Madras
for (k = 0; k < s; k++) {
sender = (my_row + k) % s;
if (sender == my_col) {
MPI_Bcast(&my_A, m*m,
MPI_INT, sender,
row_comm);
T = my_A;
else MPI_Bcast(&T, m*m, MPI_INT,
sender, row_comm);
my_C = my_C + T x my_B;
}
MPI_Sendrecv_replace(my_B, m*m, MPI_INT,
dest, 0, source, 0, col_comm, &status); }
September
43 2008 IIT-Madras
Creating topologies and
communicators
• Creating a grid
• MPI_Cart_create(MPI_COMM_WORLD, 2,
dim_sizes, istorus, canreorder, &grid_comm);
−int dim_sizes[2], int istorus[2], int canreorder,
MPI_Comm grid_comm

• Divide a grid into rows- each with own


communicator
• MPI_Cart_sub(grid_comm,free,&rowcom)
−MPI_Comm rowcomm; int free[2]

September
44 2008 IIT-Madras
Try implementing the BMR
algorithm with communicators

45
A brief on other MPI Topics – The
last leg
• MPI+Multi-threaded / OpenMP
• One sided Communication
• MPI and IO

September
46 2008 IIT-Madras
MPI and OpenMP
•Grain
•Communication

•Where does
the interesting
pragma omp for

fit in our MPI
… Floyd?
•How do I
assign exactly
one MPI task
per CPU?
September
47 2008 IIT-Madras
One-Sided Communication
• Have no corresponding send-recv pairs!
• RDMA
• Get
• Put

September
48 2008 IIT-Madras
IO in Parallel Programs
• Typically a root task, does the IO.
−Simpler to program
−Natural because of some post processing occasionally needed (sorting)
−All nodes generating IO requests might overwhelm fileserver,
essentially sequentializing it.
• Performance not the limitation for Lustre/SFS.
• ParallelIO interfaces such as MPI-IO can make use of parallel
filesystems such as Lustre.

September
49 2008 IIT-Madras
MPI-BLAST exec time vs other
time[4]

September
50 2008 IIT-Madras
How IO/Comm Optimizations help
MPI-BLAST[4]

September
51 2008 IIT-Madras
What did we learn?
• Distributed Memory Programming Model
• Parallel Algorithm Basics
• Work Breakdown
• Topologies in Communication
• Communication Overhead vs Computation
• Impact of Parallel IO

September
52 2008 IIT-Madras
What MPI Calls did we see here?
1. MPI_Init
2. MPI_Finalize
3. MPI_Comm_size
4. MPI_Comm_Rank
5. MPI_Send
6. MPI_Recv
7. MPI_Sendrecv_replace
8. MPI_Bcast
9. MPI_Reduce
10. MPI_Cart_create
11. MPI_Cart_sub
12. MPI_Scatter

September
53 2008 IIT-Madras
References
1. Parallel Programming in C with MPI and OpenMP,
M J Quinn, TMH. This is an excellent practical
book. Motivated much of the material here,
specifically Floyd’s algorithm.
2. BMR Algorithm for Matrix Multiply and topology
ideas is motivated by
http://www.cs.indiana.edu/classes/b673/notes/matrix
3. MPI online manual
http://www-unix.mcs.anl.gov/mpi/www/
4. Efficient Data Access For Parallel BLAST, IPDPDS’05

September
54 2008 IIT-Madras

Das könnte Ihnen auch gefallen