Beruflich Dokumente
Kultur Dokumente
Special thanks:
David Luebke & Chandra Cheij @ NVIDIA
During this course,
well try to
and use existing material ;-)
a
d
a
p
t
e
d
fo
r
C
S
2
6
4
Today
yey!!
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
The Problem
Many computational problems too big for single CPU
Lack of RAM
Lack of CPU cycles
Want to distribute work between many CPUs
slide by Richard Edgar
Types of Parallelism
Some computations are embarrassingly parallel
Can do a lot of computation on minimal data
RC5 DES, SETI@HOME etc.
Solution is to distribute across the Internet
Use TCP/IP or similar
slide by Richard Edgar
Types of Parallelism
Some computations very tightly coupled
Have to communicate a lot of data at each step
e.g. hydrodynamics
Internet latencies much too high
Need a dedicated machine
slide by Richard Edgar
Tightly Coupled Computing
Two basic approaches
Shared memory
Distributed memory
Each has advantages and disadvantages
slide by Richard Edgar
Some terminology
Some More Terminology
One way to classify machines distinguishes between
shared memory global memory can be acessed by all processors or
cores. Information exchanged between threads using shared variables
written by one thread and read by another. Need to coordinate access to
shared variables.
distributed memory private memory for each processor, only accessible
this processor, so no synchronization for memory accesses needed.
Information exchanged by sending data from one processor to another
via an interconnection network using explicit communication operations.
Interconnection Network
P P P
M M M
Interconnection Network
P P P
M M M
Hybrid approach increasingly common
Some More Terminology
One way to classify machines distinguishes between
shared memory global memory can be acessed by all processors or
cores. Information exchanged between threads using shared variables
written by one thread and read by another. Need to coordinate access to
shared variables.
distributed memory private memory for each processor, only accessible
this processor, so no synchronization for memory accesses needed.
Information exchanged by sending data from one processor to another
via an interconnection network using explicit communication operations.
Interconnection Network
P P P
M M M
Interconnection Network
P P P
M M M
Hybrid approach increasingly common
now: mostly hybrid
distributed memory shared memory
Some terminology
via an interconnection network using explicit communication operations.
Interconnection Network
P P P
M M M
Some More Terminology
One way to classify machines distinguishes between
shared memory global memory can be acessed by all processors or
cores. Information exchanged between threads using shared variables
written by one thread and read by another. Need to coordinate access to
shared variables.
distributed memory private memory for each processor, only accessible
this processor, so no synchronization for memory accesses needed.
Information exchanged by sending data from one processor to another
via an interconnection network using explicit communication operations.
Interconnection Network
P P P
M M M
Interconnection Network
P P P
M M M
Hybrid approach increasingly common
now: mostly hybrid
distributed memory shared memory
Shared Memory Machines
Have lots of CPUs share the same memory banks
Spawn lots of threads
Each writes to globally shared memory
Multicore CPUs now ubiquitous
Most computers now shared memory machines
slide by Richard Edgar
Shared Memory Machines
NASA Columbia Computer
Up to 2048 cores in single system
slide by Richard Edgar
Shared Memory Machines
Spawning lots of threads (relatively) easy
pthreads, OpenMP
Dont have to worry about data location
Disadvantage is memory performance scaling
Frontside bus saturates rapidly
Can use Non-Uniform Memory Architecture (NUMA)
Silicon Graphics Origin & Altix series
Gets expensive very fast
slide by Richard Edgar
Some terminology
Some More Terminology
One way to classify machines distinguishes between
shared memory global memory can be acessed by all processors or
cores. Information exchanged between threads using shared variables
written by one thread and read by another. Need to coordinate access to
shared variables.
distributed memory private memory for each processor, only accessible
this processor, so no synchronization for memory accesses needed.
Information exchanged by sending data from one processor to another
via an interconnection network using explicit communication operations.
Interconnection Network
P P P
M M M
Interconnection Network
P P P
M M M
Hybrid approach increasingly common
Interconnection Network
P P P
M M M
now: mostly hybrid
distributed memory shared memory
Distributed Memory Clusters
Alternative is a lot of cheap machines
High-speed network between individual nodes
Network can cost as much as the CPUs!
How do nodes communicate?
slide by Richard Edgar
Distributed Memory Clusters
NASA Pleiades Cluster
51,200 cores
slide by Richard Edgar
Distributed Memory Model
Communication is key issue
Each node has its own address space
(exclusive access, no global memory?)
Could use TCP/IP
Painfully low level
Solution: a communication protocol like message-
passing (e.g. MPI)
slide by Richard Edgar
Distributed Memory Model
All data must be explicitly partitionned
Exchange of data by explicit communication
slide by Richard Edgar
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Message Passing Interface
MPI is a communication protocol for parallel programs
Language independent
Open standard
Originally created by working group at SC92
Bindings for C, C++, Fortran, Python, etc.
http://www.mcs.anl.gov/research/projects/mpi/
http://www.mpi-forum.org/
slide by Richard Edgar
Message Passing Interface
MPI processes have independent address spaces
Communicate by sending messages
Means of sending messages invisible
Use shared memory if available! (i.e. can be used
behind the scenes shared memory architectures)
On Level 5 (Session) and higher of OSI model
slide by Richard Edgar
OSI Model ?
Message Passing Interface
MPI is a standard, a specication, for message-passing
libraries
Two major implementations of MPI
MPICH
OpenMPI
Programs should work with either
slide by Richard Edgar
Basic Idea
Usually programmed with SPMD model (single program,
multiple data)
In MPI-1 number of tasks is static - cannot dynamically
spawn new tasks at runtime. Enhanced in MPI-2.
No assumptions on type of interconnection network; all
processors can send a message to any other processor.
All parallelism explicit - programmer responsible for
correctly identifying parallelism and implementing parallel
algorithms
adapted from Berger & Klckner (NYU 2010)
Credits: James Carr (OCI)
Hello World
#include <mpi.h>
#include <stdio.h>
int main(int argc, char
**
argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello world from %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}
adapted from Berger & Klckner (NYU 2010)
Hello World
To compile: Need to load MPI wrappers in addition to the
compiler modules (OpenMPI, MPICH,...)
module load openmpi/intel/1.3.3
To compile: mpicc hello.c
To run: need to tell how many processes you are requesting
mpiexec -n 10 a.out (mpirun -np 10 a.out)
module load mpi/openmpi/1.2.8/gnu
adapted from Berger & Klckner (NYU 2010)
http://www.youtube.com/watch?v=pLqjQ55tz-U
The beauty of data
visualization
http://www.youtube.com/watch?v=pLqjQ55tz-U
The beauty of data
visualization
Example: gprof2dot
Theyve done studies, you
know. 60% of the time, it
works every time...
- Brian Fantana
(Anchorman, 2004)
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Basic MPI
MPI is a library of routines
Bindings exist for many languages
Principal languages are C, C++ and Fortran
Python: mpi4py
We will discuss C++ bindings from now on
http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm
slide by Richard Edgar
Basic MPI
MPI allows processes to exchange messages
Processes are members of communicators
Communicator shared by all is MPI::COMM_WORLD
In C++ API, communicators are objects
Within a communicator, each process has unique ID
slide by Richard Edgar
A Minimal MPI Program
Very much a minimal
program
No actual
communication occurs
#include <iostream>
using namespace std;
#include mpi.h
int main( int argc, char* argv ) {
MPI::Init( argc, argv );
cout << Hello World! << endl;
MPI::Finalize();
return( EXIT_SUCCESS );
}
slide by Richard Edgar
A Minimal MPI Program
To compile MPI programs use mpic++
mpic++ -o MyProg myprog.cpp
The mpic++ command is a wrapper for default compiler
Adds in libraries
Use mpic++ --show to see what it does
Will also nd mpicc, mpif77 and mpif90 (usually)
slide by Richard Edgar
A Minimal MPI Program
To run the program, use mpirun
mpirun -np 2 ./MyProg
The -np 2 option launches two processes
Check documentation for your cluster
Number of processes might be implicit
Program should print Hello World twice
slide by Richard Edgar
Communicators
Processes are members of communicators
A process can
Find the size of a given communicator
Determine its ID (or rank) within it
Default communicator is MPI::COMM_WORLD
slide by Richard Edgar
Communicators
Queries COMM_WORLD communicator for
Number of processes
Current process rank (ID)
Prints these out
Process rank counts from zero
int nProcs, iMyProc;
MPI::Init( argc, argv );
nProcs = MPI::COMM_WORLD.Get_size();
iMyProc = MPI::COMM_WORLD.Get_rank();
cout << Hello from process ;
cout << iMyProc << of ;
cout << nProcs << endl;
MPI::Finalize();
slide by Richard Edgar
Communicators
By convention, process with rank 0 is master
const int iMasterProc = 0;
Can have more than one communicator
Process may have different rank within each
slide by Richard Edgar
Messages
Havent sent any data yet
Communicators have Send and Recv methods for this
One process posts a Send
Must be matched by Recv in the target process
slide by Richard Edgar
Sending Messages
A sample send is as follows:
int a[10];
MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag );
The method prototype is
void Comm::Send( const void* buf, int count,
const Datatype& datatype,
int dest, int tag) const
MPI copies the buffer into a system buffer and returns
No delivery notication
slide by Richard Edgar
Receiving Messages
Similar call to receive
int a[10];
MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, iMyTag);
Function prototype is
void Comm::Recv( void* buf, int count,
const Datatype& datatype,
int source, int tag) const
Blocks until data arrives
MPI::ANY_SOURCE
MPI::ANY_TAG
slide by Richard Edgar
MPI Datatypes
MPI datatypes are
independent of
Language
Endianess
Most common listed
opposite
MPI Datatype C/C++
MPI::CHAR signed char
MPI::SHORT signed short
MPI::INT signed int
MPI::LONG signed long
MPI::FLOAT float
MPI::DOUBLE double
MPI::BYTE
Untyped byte data
slide by Richard Edgar
MPI Send & Receive
Master process sends
out numbers
Worker processes print
out number received
if( iMyProc == iMasterProc ) {
for( int i=1; i<nProcs; i++ ) {
int iMessage = 2 * i + 1;
cout << Sending << iMessage <<
to process << i << endl;
MPI::COMM_WORLD.Send( &iMessage, 1,
MPI::INT,
i, iTag );
}
} else {
int iMessage;
MPI::COMM_WORLD.Recv( &iMessage, 1,
MPI::INT,
iMasterProc, iTag );
cout << Process << iMyProc <<
received << iMessage << endl;
}
slide by Richard Edgar
Six Basic MPI Routines
Have now encounted six MPI routines
MPI::Init(), MPI::Finalize()
MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(),
MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv()
These are enough to get started ;-)
More sophisticated routines available...
slide by Richard Edgar
Collective Communications
Send and Recv are point-to-point
Communicate between specic processes
Sometimes we want all processes to exchange data
These are called collective communications
slide by Richard Edgar
Barriers
Barriers require all processes to synchronise
MPI::COMM_WORLD.Barrier();
Processes wait until all processes arrive at barrier
Potential for deadlock
Bad for performance
Only use if necessary
slide by Richard Edgar
Broadcasts
Suppose one process has array to be shared with all
int a[10];
MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc );
If process has rank iSrcProc, it will send the array
Other processes will receive it
All will have a[10] identical to iSrcProc on completion
slide by Richard Edgar
MPI Broadcast
Broadcast
P0
P1
P2
P3
P0
P1
P2
P3
A
A
A
A
A
MPI Bcast(&buf, count, datatype, root, comm)
All processors must call MPI Bcast with the same root value.
adapted from Berger & Klckner (NYU 2010)
Reductions
Suppose we have a large array split across processes
We want to sum all the elements
Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM
Also MPI::COMM_WORLD.Allreduce() variant
Can perform MAX, MIN, MAXLOC, MINLOC too
slide by Richard Edgar
MPI Reduce
ABCD
P0
P1
P2
P3
P0
P1
P2
P3
Reduce
A
B
C
D
Reduction operators can be min, max, sum, multiply, logical
ops, max value and location ... Must be associative
(commutative optional)
adapted from Berger & Klckner (NYU 2010)
Scatter and Gather
Split a large array between processes
Use MPI::COMM_WORLD.Scatter()
Each process receives part of the array
Combine small arrays into one large one
Use MPI::COMM_WORLD.Gather()
Designated process will construct entire array
Has MPI::COMM_WORLD.Allgather() variant
slide by Richard Edgar
MPI Scatter/Gather
Gather
P0
P1
P2
P3
P0
P1
P2
P3
A
B
C
D
A
Scatter
B C D
adapted from Berger & Klckner (NYU 2010)
MPI Allgather
Allgather
P0
P1
P2
P3
P0
P1
P2
P3
A
A
A
A
B C D
B
B
B
C
C
C
D
D
D
A
B
C
D
adapted from Berger & Klckner (NYU 2010)
MPI Alltoall
Alltoall
P0
P1
P2
P3
P0
P1
P2
P3
A0
A1
A2
A3
B0
B1
B2
B3
C0
C1
C2
C3
D0
D1
D2
D3
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
D0 D1 D2 D3
adapted from Berger & Klckner (NYU 2010)
Asynchronous Messages
An asynchronous API exists too
Have to allocate buffers
Have to check if send or receive has completed
Will give better performance
Trickier to use
slide by Richard Edgar
User-Dened Datatypes
Usually have complex data structures
Require means of distributing these
Can pack & unpack manually
MPI allows us to dene own datatypes for this
slide by Richard Edgar
MPI-2
One-sided RMA (remote memory access) communication
potential for greater efciency, easier programming.
Use windows into memory to expose regions for access
Race conditions now possible.
Parallel I/O like message passing but to le system not
other processes.
Allows for dynamic number of processes and
inter-communicators (as opposed to intra-communicators)
Cleaned up MPI-1
adapted from Berger & Klckner (NYU 2010)
RMA
Processors can designate portions of its address space as
available to other processors for read/write operations
(MPI Get, MPI Put, MPI Accumulate).
RMA window objects created by collective window-creation
fns. (MPI Win create must be called by all participants)
Before accessing, call MPI Win fence (or other synchr.
mechanisms) to start RMA access epoch; fence (like a barrier)
separates local ops on window from remote ops
RMA operations are no-blocking; separate synchronization
needed to check completion. Call MPI Win fence again.
RMA window
Put
P0 local memory P1 local memory
adapted from Berger & Klckner (NYU 2010)
Some MPI Bugs
Sample MPI Bugs
Only works for even number of processors.
MPI Bugs
W
h
a
t
s
w
r
o
n
g
?
adapted from Berger & Klckner (NYU 2010)
Sample MPI Bugs
Only works for even number of processors.
MPI Bugs
adapted from Berger & Klckner (NYU 2010)
Sample MPI Bugs
Supose have local variable, e.g. energy, and want to sum all
the processors energy to nd total energy of the system.
Recall
MPI_Reduce(sendbuf,recvbuf,count,datatype,op,
root,comm)
Using the same variable, as in
MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM,
MPI_COMM_WORLD)
will bomb.
MPI Bugs
Suppose you have a local variable energy and you want
to sum all the processors energy to nd the total energy
of the system
W
h
a
t
s
w
r
o
n
g
?
adapted from Berger & Klckner (NYU 2010)
Communication
Topologies
Communication Topologies
Some topologies very common
Grid, hypercube etc.
API provided to set up communicators following these
slide by Richard Edgar
Parallel Performance
Recall Amdahls law:
if T
1
= serial cost + parallel cost
then
T
p
= serial cost + parallel cost/p
But really
T
p
= serial cost + parallel cost/p + T
communication
How expensive is it?
adapted from Berger & Klckner (NYU 2010)
Network Characteristics
Interconnection network connects nodes, transfers data
Important qualities:
Topology - the structure used to connect the nodes
Routing algorithm - how messages are transmitted
between processors, along which path (= nodes along
which message transferred).
Switching strategy = how message is cut into pieces and
assigned a path
Flow control (for dealing with congestion) - stall, store data
in buffers, re-route data, tell source to halt, discard, etc.
adapted from Berger & Klckner (NYU 2010)
Interconnection Network
Represent as graph G = (V, E), V = set of nodes to be
connected, E = direct links between the nodes. Links usually
bidirectional - transfer msg in both directions at same time.
Characterize network by:
diameter - maximum over all pairs of nodes of the shortest
path between the nodes (length of path in message
transmission)
degree - number of direct links for a node (number of direct
neighbors)
bisection bandwidth - minimum number of edges that must
be removed to partition network into two parts of equal size
with no connection between them. (measures network
capacity for transmitting messages simultaneously)
node/edge connectivity - numbers of node/edges that must
fail to disconnect the network (measure of reliability)
adapted from Berger & Klckner (NYU 2010)
Linear Array
p vertices, p 1 links
Diameter = p 1
Degree = 2
Bisection bandwidth = 1
Node connectivity = 1, edge connectivity = 1
adapted from Berger & Klckner (NYU 2010)
Ring topology
diameter = p/2
degree = 2
bisection bandwidth = 2
node connectivity = 2
edge connectivity = 2
adapted from Berger & Klckner (NYU 2010)
Mesh topology
diameter = 2(
p 1)
3d mesh is 3(
3
p 1)
degree = 4 (6 in 3d )
bisection bandwidth
p
node connectivity 2
edge connectivity 2
Route along each dimension in turn
adapted from Berger & Klckner (NYU 2010)
Torus topology
Diameter halved, Bisection bandwidth doubled,
Edge and Node connectivity doubled over mesh
adapted from Berger & Klckner (NYU 2010)
Hypercube topology
1100
1110
1010
0 1
00 01
10 11
0010 0011
0111
0000 0001
0100 0101
010 011
111
000 001
100 101
110
0110
1011
1111
1000 1001
1101
p = 2
k
processors labelled with binary numbers of length k
k-dimensional cube constructed from two (k 1)-cubes
Connect corresponding procs if labels differ in 1 bit
(Hamming distance d between 2 k-bit binary words =
path of length d between 2 nodes)
adapted from Berger & Klckner (NYU 2010)
Hypercube topology
1100
1110
1010
0 1
00 01
10 11
0010 0011
0111
0000 0001
0100 0101
010 011
111
000 001
100 101
110
0110
1011
1111
1000 1001
1101
diameter = k ( =log p)
degree = k
bisection bandwidth = p/2
node connectivity k
edge connectivity k
adapted from Berger & Klckner (NYU 2010)
Dynamic Networks
Above networks were direct, or static interconnection networks
= processors connected directly with each through xed
physical links.
Indirect or dynamic networks = contain switches which provide
an indirect connection between the nodes. Switches congured
dynamically to establish a connection.
bus
crossbar
multistage network - e.g. buttery, omega, baseline
adapted from Berger & Klckner (NYU 2010)
Crossbar
Mm
P1
P2
Pn
M1 M2
Connecting n inputs and m outputs takes nm switches.
(Typically only for small numbers of processors)
At each switch can either go straight or change dir.
Diameter = 1, bisection bandwidth = p
adapted from Berger & Klckner (NYU 2010)
Buttery
16 16 buttery network:
stage 3
000
001
010
011
100
101
110
111
stage 0 stage 1 stage 2
for p = 2
k+1
processors, k +1 stages, 2
k
switches per stage,
2 2 switches
adapted from Berger & Klckner (NYU 2010)
Fat tree
Complete binary tree
Processors at leaves
Increase links for higher bandwidth near root
adapted from Berger & Klckner (NYU 2010)
Current picture
Old style: mapped algorithms to topologies
New style: avoid topology-specic optimizations
Want code that runs on next years machines too.
Topology awareness in vendor MPI libraries?
Software topology - easy of programming, but not used for
performance?
adapted from Berger & Klckner (NYU 2010)
Should we care ?
Meta-programming / Auto-tuning ?
Top500 Interconnects
10/30/10 9:25 AM Interconnect Family Share Over Time | TOP500 Supercomputing Sites
Page 1 of 2 http://www.top500.org/overtime/list/35/connfam
CONTACT SUBMISSIONS LINKS HOME
Statistics Charts Development
Top500 List:
06/2010
Statistics Type:
Vendors
Generate
Search
Search
Is Underutilizing Processors Such an
Awful Idea?
Have Honey, Will Compute: Bees Wax
Numeric
IDC Has Plan to Get European
Supercomputing Back on Track
ORNL Climate System's Big Reveal
Manufacturers Turn to HPC to Cut
Testing Costs
MATLAB on the TeraGrid One Year
Later
Supercomputer in the Works for Virginia
Not Your Parents' CFD
Machine Learns Language Starting with
the Facts
GPU-based Supercomputing Could Face
HPCWire
Bookmark
Save This Page
Home Statistics Historical Charts
Interconnect Family Share Over Time
In addition to the charts below, you can view the the data used to generate this
chart in table format using the statistics page. A direct link to the statistics is also
available.
PROJECT LISTS STATISTICS RESOURCES NEWS
adapted from Berger & Klckner (NYU 2010)
MPI References
Lawrence Livermore tutorial
https:computing.llnl.gov/tutorials/mpi/
Using MPI
Portable Parallel Programming with the Message=Passing
Interface
by Gropp, Lusk, Skjellum
Using MPI-2
Advanced Features of the Message Passing Interface
by Gropp, Lusk, Thakur
Lots of other on-line tutorials, books, etc.
adapted from Berger & Klckner (NYU 2010)
Ignite: Google Trends
http://www.youtube.com/watch?v=m0b-QX0JDXc
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
MPI with CUDA
MPI and CUDA almost orthogonal
Each node simply becomes faster
Problem matching MPI processes to GPUs
Use compute-exclusive mode on GPUs
Tell cluster environment to limit processes per node
Have to know your cluster documentation
slide by Richard Edgar
Data Movement
Communication now very expensive
GPUs can only communicate via their hosts
Very laborious
Again: need to minimize communication
slide by Richard Edgar
MPI Summary
MPI provides cross-platform interprocess
communication
Invariably available on computer clusters
Only need six basic commands to get started
Much more sophistication available
slide by Richard Edgar
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
ZeroMQ
Excellent documentation:
examples
Design:
MPI is designed for tightly-coupled compute clusters with fast and reliable
networks.
Fault tolerance:
MPI has very limited facilities for fault tolerance (the default error handling
behavior in most implementations is a system-wide fail, ouch!).
Scalable
(Volume Rendererpresented 90
minutes ago @ MapReduce 10)
BenchmarksWhy