Beruflich Dokumente
Kultur Dokumente
Programming
and MPI
A course for IIT-M. September 2008
R Badrinath, STSD Bangalore
(ramamurthy.badrinath@hp.com)
September
2 2008 IIT-Madras
Audience Check
3
Contents
1. Instead we
MPI_Init
2. •Understand Issues
MPI_Comm_rank
3. MPI_Comm_size
• Understand Concepts
4. MPI_Send
5. •Learn
MPI_Recv enough to pickup from the manual
6. MPI_Bcast
7.
•
Go by motivating
MPI_Create_comm examples
8. MPI_Sendrecv
9. MPI_Scatter •Try out some of the examples
10. MPI_Gather
………………
September
4 2008 IIT-Madras
Outline
• Sequential vs Parallel programming
• Shared vs Distributed Memory
• Parallel work breakdown models
• Communication vs Computation
• MPI Examples
• MPI Concepts
• The role of IO
September
5 2008 IIT-Madras
Sequential vs Parallel
• We are used to sequential programming – C, Java, C+
+, etc. E.g., Bubble Sort, Binary Search, Strassen
Multiplication, FFT, BLAST, …
• Main idea – Specify the steps in perfect order
• Reality – We are used to parallelism a lot more than
we think – as a concept; not for programming
• Methodology – Launch a set of tasks; communicate to
make progress. E.g., Sorting 500 answer papers by –
making 5 equal piles, have them sorted by 5 people,
merge them together.
September
6 2008 IIT-Madras
Shared vs Distributed Memory
Programming
• Shared Memory – All tasks access the same
memory, hence the same data. pthreads
• Distributed Memory – All memory is local. Data
sharing is by explicitly transporting data from one
task to another (send-receive pairs in MPI, e.g.)
Program
Memory Communications channel
8
Simple Parallel Program – sorting
numbers in a large array A
• Notionally divide A into 5 pieces
[0..99;100..199;200..299;300..399;400..499
].
• Each part is sorted by an independent
sequential algorithm and left within its
region.
September
9 2008 IIT-Madras
What is different – Think about…
• How many people doing the work. (Degree of
Parallelism)
• What is needed to begin the work. (Initialization)
• Who does what. (Work distribution)
• Access to work part. (Data/IO access)
• Whether they need info from each other to finish
their own job. (Communication)
• When are they all done. (Synchronization)
• What needs to be done to collate the result.
September
10 2008 IIT-Madras
Work Break-down
• Parallel algorithm
• Prefer simple intuitive breakdowns
• Usually highly optimized sequential
algorithms are not easily parallelizable
• Breaking work often involves some pre- or
post- processing (much like divide and
conquer)
• Fine vs large grain parallelism and
relationship to communication
September
11 2008 IIT-Madras
Digression – Let’s get a simple MPI Program to
work
#include <mpi.h>
#include <stdio.h>
int main()
{
int total_size, my_rank;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &total_size);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
September
13 2008 IIT-Madras
What goes on
• Same program is run at the same time on 2
different CPUs
• Each is slightly different in that each returns
different values for some simple calls like
MPI_Comm_rank.
• This gives each instance its identity
• We can make different instances run different
pieces of code based on this identity difference
• Typically it is an SPMD model of computation
September
14 2008 IIT-Madras
Continuing work breakdown…
Simple Example: Find shortest
distances
PROBLEM: 7
Find shortest path 2 2 3
distances
5
1 3
1 2
2
7 0 6 4
September
15 2008 IIT-Madras
Floyd’s (sequential) algorithm
For (k=0; k<n; k++)
For (i=0; i<n; i++)
for (j=0; j<n; j++)
a[i][j]=min( a[i][j] , a[i,k]+a[k][j] );
Observation:
For a fixed k,
Computing i-th row needs i-th row and k-th row
September
16 2008 IIT-Madras
Parallelizing Floyd
• Actually we just need n2 tasks, with each
task iterating n times (once for each value
of k).
• After each iteration we need to make sure
everyone sees the matrix.
• ‘Ideal’ for shared memory.. Programming
• What if we have less than n2 tasks?... Say
p<n.
• Need to divide the work among the p tasks.
• We can simply divide up the rows.
September
17 2008 IIT-Madras
Dividing the work
• Each task gets [n/p] rows, with the last
possibly getting a little more.
T0
i-th row
q x [ n/p ]
Tq
September
18 2008 IIT-Madras
/* “id” is TASK NUMBER, each node has only the part of A
that it owns. This is approximate code
Note*/
that each node
for (k=0;k<n;k++) { calls its own matrix by
the same name name a
The MPI current_owner_task = GET_BLOCK_OWNER(k);
[ ][ ] but has only
Model…
if (id == current_owner_task)
[p/n] rows.{
-All nodes run k_here = k - LOW_END_OF_MY_BLOCK(id);
the same
code!! P for(j=0;j<n;j++)
Distributed Memory
replica tasks!! Model
rowk[j]=a[k_here][j];
…
}
-Some times
/* rowk is broadcast by the owner and received by
they need to
others..
do different
The MPI code will come here later */
things
for(i=0;i<GET_MY_BLOCK_SIZE(id);i++)
for(j=0;j<n;j++)
a[i,j]=min(a[i][j],
September
a[i][k]+rowk[j]);
19 2008 IIT-Madras
The MPI model
• Recall MPI tasks are typically created when the
jobs are launched – not inside the MPI
program (no forking).
−mpirun usually creates the task set
−mpirun –np 2 a.out <args to a.out>
−a.out is run on all nodes and a communication
channel is setup between them
• Functions allow for tasks to find out
−Size of the task group
−Ones own position within the group
September
20 2008 IIT-Madras
MPI Notions [ Taking from the
example ]
• Communicator – A group of tasks in a program
• Rank – Each task’s ID in the group
−MPI_Comm_rank() … /* use this to set “id” */
• Size – Of the group
−MPI_Comm_size() … /* use to set “p” */
• Notion of send/receive/broadcast…
−MPI_Bcast() … /* use to broadcast rowk[] */
September
21 2008 IIT-Madras
MPI Prologue to our Floyd example
int a[MAX][MAX];
int n=20; /* real size of the matrix,
can be read in */
int id,p;
MPI_Init(argc,argv);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&p);
.
./* This is where all the real work happens */
.
MPI_Finalize(); /* Epilogue */
September
22 2008 IIT-Madras
This is the time to try out several
simple MPI programs using the
few functions we have seen.
- use mpicc
- use mpirun
23
Visualizing the executionMultiple Tasks/CPUs
maybe on the same
node
Job is Launched
Scheduler ensures 1
Tasks On
task per cpu
CPUs
•Task 0 receives all blocks of the final array and prints them out
•MPI_Finalize
September
24 2008 IIT-Madras
Communication vs Computation
• Often communication is needed between iterations to
complete the work.
• Often the more the tasks the more the communication
can become.
−In Floyd, bigger “p” indicates that “rowk” will be sent to a
larger number of tasks.
−If each iteration depends on more data, it can get very busy.
• This may mean network contention; i.e., delays.
• Try to count the numbr of “a”s in a string. Time vs p
• This is why for a fixed problem size increasing
number of CPUs does not continually increase
performance
• This needs experimentation – problem specific
September
25 2008 IIT-Madras
Communication primitives
• MPI_Send(sendbuffer, senddatalength,
datatype, destination, tag,
communicator);
• MPI_Send(“Hello”, strlen(“Hello”),
MPI_CHAR, 2 , 100,
MPI_COMM_WORLD);
• MPI_Recv(recvbuffer, revcdatalength,
MPI_CHAR, source, tag,
MPI_COMM_WORLD,
&status);
• Send-Recv happen in pairs.
September
26 2008 IIT-Madras
Collectives
• Broadcast is one-to-all communication
• Both receivers and sender call the same function
• All MUST call it. All end up with SAME result.
• MPI_Bcast (buffer, count, type, root, comm);
• Examples
−MPI_Bcast(&k, 1, MPI_INT, 0, MPI_Comm_World);
−Task 0 sends its integer k and all others receive it.
−MPI_Bcast(rowk,n,MPI_INT,current_owner_task,MPI_COMM_
WORLD);
−Current_owner_task sends rowk to all others.
September
27 2008 IIT-Madras
Try out a simple MPI program with
send-recvs and braodcasts.
28
A bit more on Broadcast
Ranks: 0 1 2
x : 0 1 2
MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..);
x : 0 0 0
0 0 0
September
29 2008 IIT-Madras
Other useful collectives
• MPI_Reduce(&values,&results,count,type,o
perator, root,comm);
• MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM,
9, MPI_COMM_WORLD);
September
30 2008 IIT-Madras
Scattering as opposed to
broadcasting
• MPI_Scatterv(sndbuf, sndcount[], send_disp[],
type, recvbuf, recvcount, recvtype,
root, comm);
• All nodes MUST call
Rank0
September
31 2008 IIT-Madras
Common Communication pitfalls!!
• Make sure that
communication primitives are
called by the right number of
tasks.
• Make sure they are called in
the right sequence.
• Make sure that you use the
proper tags.
• If not, you can easily get into
deadlock (“My program seems to
be hung”)
September
32 2008 IIT-Madras
More on work breakdown
• Finding the right work breakdown can be challenging
• Sometime dynamic work breakdown is good
• Master (usually task 0) decides who will do what and
collects the results.
• E.g.,
you have a huge number of 5x5 matrices to multiply
(chained matrix multiplication).
• E.g., Search for a substring in a huge collection of strings.
September
33 2008 IIT-Madras
Master-slave dynamic work
assignment
Master
1
0
2 Slaves
September
34 2008 IIT-Madras
Master slave example – Reverse
strings
Slave(){
do{
MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat);
n=strlen(work);
if(n==0) break; /* detecting the end */
reverse(work);
MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD);
} while (1);
MPI_Finalize();
}
September
35 2008 IIT-Madras
Master slave example – Reverse
strings
Master(){ /* rank 0 task */
initialize_work_tems();
for(i=1;i<np;i++){ /* Initial work distribution */
work=next_work_item();
n = strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,i,0,MPI_COMM_WORLD);
}
unfinished_work=np;
while (unfinished_work!=0) {
MPI_Recv(&res,MAX,MPI_CHAR,MPI_ANY_SOURCE,0,
MPI_COMM_WORLD,&status);
process(res);
work=next_work_item();
if(work==NULL) unfinished_work--;
else {
n=strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,status->MPI_source,
0,MPI_COMM_WORLD);
}
} September
36 2008 IIT-Madras
Master slave example
Main(){
...
MPI_Comm_Rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&np);
if (id ==0 )
Master();
else
Slave();
...
}
September
37 2008 IIT-Madras
Matrix Multiply and Communication
Patterns
38
Block Distribution of Matrices
• Matrix Mutliply: •Each task owns a block –
−Cij = Σ (Aik * Bkj) its own part of A,B and C
• BMR Algorithm:
•The old formula holds for
blocks!
•Example:
C21=A20 * B01
A21 * B11
A22 * B21
A23 * B31
Each isSeptember
a smaller Block – a submatrix
39 2008 IIT-Madras
Block Distribution of Matrices
C21 = A20 * B01
• Matrix Mutliply:
−Cij = Σ (Aik * Bkj) A21 * B11
• BMR Algorithm: A22 * B21
A23 * B31
41
Communicators and Topologies
• BMR example shows limitations of
broadcast.. Although there is pattern
• Communicators can be created on
subgroups of processes.
• Communicators can be created that have a
topology
−Will make programming natural
−Might improve performance by matching to
hardware
September
42 2008 IIT-Madras
for (k = 0; k < s; k++) {
sender = (my_row + k) % s;
if (sender == my_col) {
MPI_Bcast(&my_A, m*m,
MPI_INT, sender,
row_comm);
T = my_A;
else MPI_Bcast(&T, m*m, MPI_INT,
sender, row_comm);
my_C = my_C + T x my_B;
}
MPI_Sendrecv_replace(my_B, m*m, MPI_INT,
dest, 0, source, 0, col_comm, &status); }
September
43 2008 IIT-Madras
Creating topologies and
communicators
• Creating a grid
• MPI_Cart_create(MPI_COMM_WORLD, 2,
dim_sizes, istorus, canreorder, &grid_comm);
−int dim_sizes[2], int istorus[2], int canreorder,
MPI_Comm grid_comm
September
44 2008 IIT-Madras
Try implementing the BMR
algorithm with communicators
45
A brief on other MPI Topics – The
last leg
• MPI+Multi-threaded / OpenMP
• One sided Communication
• MPI and IO
September
46 2008 IIT-Madras
MPI and OpenMP
•Grain
•Communication
•Where does
the interesting
pragma omp for
…
fit in our MPI
… Floyd?
•How do I
assign exactly
one MPI task
per CPU?
September
47 2008 IIT-Madras
One-Sided Communication
• Have no corresponding send-recv pairs!
• RDMA
• Get
• Put
September
48 2008 IIT-Madras
IO in Parallel Programs
• Typically a root task, does the IO.
−Simpler to program
−Natural because of some post processing occasionally needed (sorting)
−All nodes generating IO requests might overwhelm fileserver,
essentially sequentializing it.
• Performance not the limitation for Lustre/SFS.
• ParallelIO interfaces such as MPI-IO can make use of parallel
filesystems such as Lustre.
September
49 2008 IIT-Madras
MPI-BLAST exec time vs other
time[4]
September
50 2008 IIT-Madras
How IO/Comm Optimizations help
MPI-BLAST[4]
September
51 2008 IIT-Madras
What did we learn?
• Distributed Memory Programming Model
• Parallel Algorithm Basics
• Work Breakdown
• Topologies in Communication
• Communication Overhead vs Computation
• Impact of Parallel IO
September
52 2008 IIT-Madras
What MPI Calls did we see here?
1. MPI_Init
2. MPI_Finalize
3. MPI_Comm_size
4. MPI_Comm_Rank
5. MPI_Send
6. MPI_Recv
7. MPI_Sendrecv_replace
8. MPI_Bcast
9. MPI_Reduce
10. MPI_Cart_create
11. MPI_Cart_sub
12. MPI_Scatter
September
53 2008 IIT-Madras
References
1. Parallel Programming in C with MPI and OpenMP,
M J Quinn, TMH. This is an excellent practical
book. Motivated much of the material here,
specifically Floyd’s algorithm.
2. BMR Algorithm for Matrix Multiply and topology
ideas is motivated by
http://www.cs.indiana.edu/classes/b673/notes/matrix
3. MPI online manual
http://www-unix.mcs.anl.gov/mpi/www/
4. Efficient Data Access For Parallel BLAST, IPDPDS’05
September
54 2008 IIT-Madras