Massively Parallel Computing

Lecture #7: GPU Cluster Programming | March 8th, 2011
Nicolas Pinto (MIT, Harvard)

pinto@mit.edu
Massively Parallel Computing
CS 264 / CSCI E-292
Administrativia
Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11
Project info: http://www.cs264.org/projects/projects.html
Project ideas: http://forum.cs264.org/index.php?board=6.0
Project proposal deadline: Fri 3/25/11

(but you should submit way before to start working on it asap)
Need a private private repo for your project?

Let us know! Poll on the forum:
http://forum.cs264.org/index.php?topic=228.0
Goodies
Guest Lectures: 14 distinguished speakers
Schedule updated (see website)

Goodies (contd)
Amazon AWS free credits coming soon

(only for students who completed HW0+1)
Its more than $14,000 donation for the class!
Special thanks: Kurt Messersmith @ Amazon

Goodies (contd)
Best Project Prize: Tesla C2070 (Fermi) Board
Its more than $4,000 donation for the class!
Special thanks:
David Luebke & Chandra Cheij @ NVIDIA
During this course,
well try to
and use existing material ;-)

a
d
a
p
t
e
d
fo
r
C
S
2
6
4
Today
yey!!
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
The Problem
Many computational problems too big for single CPU
Lack of RAM
Lack of CPU cycles
Want to distribute work between many CPUs
slide by Richard Edgar
Types of Parallelism
Some computations are embarrassingly parallel
Can do a lot of computation on minimal data
RC5 DES, SETI@HOME etc.
Solution is to distribute across the Internet
Use TCP/IP or similar
Types of Parallelism
Some computations very tightly coupled
Have to communicate a lot of data at each step
e.g. hydrodynamics
Internet latencies much too high
Need a dedicated machine
Tightly Coupled Computing
Two basic approaches
Shared memory
Distributed memory
Each has advantages and disadvantages
Some terminology
Some More Terminology
One way to classify machines distinguishes between
shared memory global memory can be acessed by all processors or
cores. Information exchanged between threads using shared variables
written by one thread and read by another. Need to coordinate access to
shared variables.
distributed memory private memory for each processor, only accessible
this processor, so no synchronization for memory accesses needed.
Information exchanged by sending data from one processor to another
via an interconnection network using explicit communication operations.
Interconnection Network
P P P
M M M
P P P
M M M
Hybrid approach increasingly common
shared variables.
P P P
M M M
P P P
M M M
now: mostly hybrid
distributed memory shared memory
Some terminology
P P P
M M M
shared variables.
P P P
M M M
P P P
M M M
now: mostly hybrid
Shared Memory Machines
Have lots of CPUs share the same memory banks
Spawn lots of threads
Each writes to globally shared memory
Multicore CPUs now ubiquitous
Most computers now shared memory machines
NASA Columbia Computer
Up to 2048 cores in single system
Spawning lots of threads (relatively) easy
pthreads, OpenMP
Dont have to worry about data location
Disadvantage is memory performance scaling
Frontside bus saturates rapidly
Can use Non-Uniform Memory Architecture (NUMA)
Silicon Graphics Origin & Altix series
Gets expensive very fast
Some terminology
shared variables.
P P P
M M M
P P P
M M M
P P P
M M M
now: mostly hybrid
Distributed Memory Clusters
Alternative is a lot of cheap machines
High-speed network between individual nodes
Network can cost as much as the CPUs!
How do nodes communicate?
Distributed Memory Clusters
NASA Pleiades Cluster
51,200 cores
Distributed Memory Model
Communication is key issue
Each node has its own address space
(exclusive access, no global memory?)
Could use TCP/IP
Painfully low level
Solution: a communication protocol like message-
passing (e.g. MPI)
Distributed Memory Model
All data must be explicitly partitionned
Exchange of data by explicit communication
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Message Passing Interface
MPI is a communication protocol for parallel programs
Language independent
Open standard
Originally created by working group at SC92
Bindings for C, C++, Fortran, Python, etc.
http://www.mcs.anl.gov/research/projects/mpi/
http://www.mpi-forum.org/
MPI processes have independent address spaces
Communicate by sending messages
Means of sending messages invisible
Use shared memory if available! (i.e. can be used
behind the scenes shared memory architectures)
On Level 5 (Session) and higher of OSI model
OSI Model ?
MPI is a standard, a specication, for message-passing
libraries
Two major implementations of MPI
MPICH
OpenMPI
Programs should work with either
Basic Idea
Usually programmed with SPMD model (single program,
multiple data)
In MPI-1 number of tasks is static - cannot dynamically
spawn new tasks at runtime. Enhanced in MPI-2.
No assumptions on type of interconnection network; all
processors can send a message to any other processor.
All parallelism explicit - programmer responsible for
correctly identifying parallelism and implementing parallel
algorithms
adapted from Berger & Klckner (NYU 2010)
Credits: James Carr (OCI)
Hello World
#include <mpi.h>
#include <stdio.h>
int main(int argc, char
**
argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello world from %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}
Hello World
To compile: Need to load MPI wrappers in addition to the
compiler modules (OpenMPI, MPICH,...)
module load openmpi/intel/1.3.3
To compile: mpicc hello.c
To run: need to tell how many processes you are requesting
mpiexec -n 10 a.out (mpirun -np 10 a.out)
module load mpi/openmpi/1.2.8/gnu
http://www.youtube.com/watch?v=pLqjQ55tz-U
The beauty of data
visualization
http://www.youtube.com/watch?v=pLqjQ55tz-U
The beauty of data
visualization
Example: gprof2dot
Theyve done studies, you
know. 60% of the time, it
works every time...
- Brian Fantana
(Anchorman, 2004)
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
Basic MPI
MPI is a library of routines
Bindings exist for many languages
Principal languages are C, C++ and Fortran
Python: mpi4py
We will discuss C++ bindings from now on
http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htm
Basic MPI
MPI allows processes to exchange messages
Processes are members of communicators
Communicator shared by all is MPI::COMM_WORLD
In C++ API, communicators are objects
Within a communicator, each process has unique ID
A Minimal MPI Program
Very much a minimal
program
No actual
communication occurs
#include <iostream>
using namespace std;
#include mpi.h
int main( int argc, char* argv ) {
MPI::Init( argc, argv );
cout << Hello World! << endl;
MPI::Finalize();
return( EXIT_SUCCESS );
}
To compile MPI programs use mpic++
mpic++ -o MyProg myprog.cpp
The mpic++ command is a wrapper for default compiler
Adds in libraries
Use mpic++ --show to see what it does
Will also nd mpicc, mpif77 and mpif90 (usually)
To run the program, use mpirun
mpirun -np 2 ./MyProg
The -np 2 option launches two processes
Check documentation for your cluster
Number of processes might be implicit
Program should print Hello World twice
Communicators
Processes are members of communicators
A process can
Find the size of a given communicator
Determine its ID (or rank) within it
Default communicator is MPI::COMM_WORLD
Communicators
Queries COMM_WORLD communicator for
Number of processes
Current process rank (ID)
Prints these out
Process rank counts from zero
int nProcs, iMyProc;
MPI::Init( argc, argv );
nProcs = MPI::COMM_WORLD.Get_size();
iMyProc = MPI::COMM_WORLD.Get_rank();
cout << Hello from process ;
cout << iMyProc << of ;
cout << nProcs << endl;
MPI::Finalize();
Communicators
By convention, process with rank 0 is master
const int iMasterProc = 0;
Can have more than one communicator
Process may have different rank within each
Messages
Havent sent any data yet
Communicators have Send and Recv methods for this
One process posts a Send
Must be matched by Recv in the target process
Sending Messages
A sample send is as follows:
int a[10];
MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag );
The method prototype is
void Comm::Send( const void* buf, int count,
const Datatype& datatype,
int dest, int tag) const
MPI copies the buffer into a system buffer and returns
No delivery notication
Receiving Messages
Similar call to receive
int a[10];
MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, iMyTag);
Function prototype is
void Comm::Recv( void* buf, int count,
const Datatype& datatype,
int source, int tag) const
Blocks until data arrives
MPI::ANY_SOURCE
MPI::ANY_TAG
MPI Datatypes
MPI datatypes are
independent of
Language
Endianess
Most common listed
opposite
MPI Datatype C/C++
MPI::CHAR signed char
MPI::SHORT signed short
MPI::INT signed int
MPI::LONG signed long
MPI::FLOAT float
MPI::DOUBLE double
MPI::BYTE
Untyped byte data
MPI Send & Receive
Master process sends
out numbers
Worker processes print
out number received
if( iMyProc == iMasterProc ) {
for( int i=1; i<nProcs; i++ ) {
int iMessage = 2 * i + 1;
cout << Sending << iMessage <<
to process << i << endl;
MPI::COMM_WORLD.Send( &iMessage, 1,
MPI::INT,
i, iTag );
}
} else {
int iMessage;
MPI::COMM_WORLD.Recv( &iMessage, 1,
MPI::INT,
iMasterProc, iTag );
cout << Process << iMyProc <<
received << iMessage << endl;
}
Six Basic MPI Routines
Have now encounted six MPI routines
MPI::Init(), MPI::Finalize()
MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(),
MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv()
These are enough to get started ;-)
More sophisticated routines available...
Collective Communications
Send and Recv are point-to-point
Communicate between specic processes
Sometimes we want all processes to exchange data
These are called collective communications
Barriers
Barriers require all processes to synchronise
MPI::COMM_WORLD.Barrier();
Processes wait until all processes arrive at barrier
Potential for deadlock
Bad for performance
Only use if necessary
Broadcasts
Suppose one process has array to be shared with all
int a[10];
MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc );
If process has rank iSrcProc, it will send the array
Other processes will receive it
All will have a[10] identical to iSrcProc on completion
MPI Broadcast
Broadcast
P0
P1
P2
P3
P0
P1
P2
P3
A
A
A
A
A
MPI Bcast(&buf, count, datatype, root, comm)
All processors must call MPI Bcast with the same root value.
Reductions
Suppose we have a large array split across processes
We want to sum all the elements
Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM
Also MPI::COMM_WORLD.Allreduce() variant
Can perform MAX, MIN, MAXLOC, MINLOC too
MPI Reduce
ABCD
P0
P1
P2
P3
P0
P1
P2
P3
Reduce
A
B
C
D
Reduction operators can be min, max, sum, multiply, logical
ops, max value and location ... Must be associative
(commutative optional)
Scatter and Gather
Split a large array between processes
Use MPI::COMM_WORLD.Scatter()
Each process receives part of the array
Combine small arrays into one large one
Use MPI::COMM_WORLD.Gather()
Designated process will construct entire array
Has MPI::COMM_WORLD.Allgather() variant
MPI Scatter/Gather
Gather
P0
P1
P2
P3
P0
P1
P2
P3
A
B
C
D
A
Scatter
B C D
MPI Allgather
Allgather
P0
P1
P2
P3
P0
P1
P2
P3
A
A
A
A
B C D
B
B
B
C
C
C
D
D
D
A
B
C
D
MPI Alltoall
Alltoall
P0
P1
P2
P3
P0
P1
P2
P3
A0
A1
A2
A3
B0
B1
B2
B3
C0
C1
C2
C3
D0
D1
D2
D3
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
D0 D1 D2 D3
Asynchronous Messages
An asynchronous API exists too
Have to allocate buffers
Have to check if send or receive has completed
Will give better performance
Trickier to use
User-Dened Datatypes
Usually have complex data structures
Require means of distributing these
Can pack & unpack manually
MPI allows us to dene own datatypes for this
MPI-2
One-sided RMA (remote memory access) communication
potential for greater efciency, easier programming.
Use windows into memory to expose regions for access
Race conditions now possible.
Parallel I/O like message passing but to le system not
other processes.
Allows for dynamic number of processes and
inter-communicators (as opposed to intra-communicators)
Cleaned up MPI-1
RMA
Processors can designate portions of its address space as
available to other processors for read/write operations
(MPI Get, MPI Put, MPI Accumulate).
RMA window objects created by collective window-creation
fns. (MPI Win create must be called by all participants)
Before accessing, call MPI Win fence (or other synchr.
mechanisms) to start RMA access epoch; fence (like a barrier)
separates local ops on window from remote ops
RMA operations are no-blocking; separate synchronization
needed to check completion. Call MPI Win fence again.
RMA window
Put
P0 local memory P1 local memory

Some MPI Bugs
Sample MPI Bugs
Only works for even number of processors.
MPI Bugs
W
h
a
t
s

w
r
o
n
g
?
Sample MPI Bugs
Only works for even number of processors.
MPI Bugs
Sample MPI Bugs
Supose have local variable, e.g. energy, and want to sum all
the processors energy to nd total energy of the system.
Recall
MPI_Reduce(sendbuf,recvbuf,count,datatype,op,
root,comm)
Using the same variable, as in
MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM,
MPI_COMM_WORLD)
will bomb.
MPI Bugs
Suppose you have a local variable energy and you want
to sum all the processors energy to nd the total energy
of the system
W
h
a
t
s

w
r
o
n
g
?
Communication
Topologies
Communication Topologies
Some topologies very common
Grid, hypercube etc.
API provided to set up communicators following these
Parallel Performance
Recall Amdahls law:
if T
1
= serial cost + parallel cost
then
T
p
= serial cost + parallel cost/p
But really
T
p
= serial cost + parallel cost/p + T
communication
How expensive is it?
Network Characteristics
Interconnection network connects nodes, transfers data
Important qualities:
Topology - the structure used to connect the nodes
Routing algorithm - how messages are transmitted
between processors, along which path (= nodes along
which message transferred).
Switching strategy = how message is cut into pieces and
assigned a path
Flow control (for dealing with congestion) - stall, store data
in buffers, re-route data, tell source to halt, discard, etc.
Represent as graph G = (V, E), V = set of nodes to be
connected, E = direct links between the nodes. Links usually
bidirectional - transfer msg in both directions at same time.
Characterize network by:
diameter - maximum over all pairs of nodes of the shortest
path between the nodes (length of path in message
transmission)
degree - number of direct links for a node (number of direct
neighbors)
bisection bandwidth - minimum number of edges that must
be removed to partition network into two parts of equal size
with no connection between them. (measures network
capacity for transmitting messages simultaneously)
node/edge connectivity - numbers of node/edges that must
fail to disconnect the network (measure of reliability)
Linear Array
p vertices, p 1 links
Diameter = p 1
Degree = 2
Bisection bandwidth = 1
Node connectivity = 1, edge connectivity = 1
Ring topology
diameter = p/2
degree = 2
bisection bandwidth = 2
node connectivity = 2
edge connectivity = 2
Mesh topology
diameter = 2(
p 1)
3d mesh is 3(
3
p 1)
degree = 4 (6 in 3d )
bisection bandwidth
p
node connectivity 2
edge connectivity 2
Route along each dimension in turn
Torus topology
Diameter halved, Bisection bandwidth doubled,
Edge and Node connectivity doubled over mesh
Hypercube topology
1100
1110
1010
0 1
00 01
10 11
0010 0011
0111
0000 0001
0100 0101
010 011
111
000 001
100 101
110
0110
1011
1111
1000 1001
1101
p = 2
k
processors labelled with binary numbers of length k
k-dimensional cube constructed from two (k 1)-cubes
Connect corresponding procs if labels differ in 1 bit
(Hamming distance d between 2 k-bit binary words =
path of length d between 2 nodes)
Hypercube topology
1100
1110
1010
0 1
00 01
10 11
0010 0011
0111
0000 0001
0100 0101
010 011
111
000 001
100 101
110
0110
1011
1111
1000 1001
1101
diameter = k ( =log p)
degree = k
bisection bandwidth = p/2
node connectivity k
edge connectivity k
Dynamic Networks
Above networks were direct, or static interconnection networks
= processors connected directly with each through xed
physical links.
Indirect or dynamic networks = contain switches which provide
an indirect connection between the nodes. Switches congured
dynamically to establish a connection.
bus
crossbar
multistage network - e.g. buttery, omega, baseline
Crossbar
Mm
P1
P2
Pn
M1 M2
Connecting n inputs and m outputs takes nm switches.
(Typically only for small numbers of processors)
At each switch can either go straight or change dir.
Diameter = 1, bisection bandwidth = p
Buttery
16 16 buttery network:
stage 3
000
001
010
011
100
101
110
111
stage 0 stage 1 stage 2
for p = 2
k+1
processors, k +1 stages, 2
k
switches per stage,
2 2 switches
Fat tree
Complete binary tree
Processors at leaves
Increase links for higher bandwidth near root
Current picture
Old style: mapped algorithms to topologies
New style: avoid topology-specic optimizations
Want code that runs on next years machines too.
Topology awareness in vendor MPI libraries?
Software topology - easy of programming, but not used for
performance?
Should we care ?
Old school: map algorithms to specic

topologies
New school: avoid topology-specic

optimimizations (the code should be optimal
on next years infrastructure....)
Meta-programming / Auto-tuning ?
Top500 Interconnects
10/30/10 9:25 AM Interconnect Family Share Over Time | TOP500 Supercomputing Sites
Page 1 of 2 http://www.top500.org/overtime/list/35/connfam
CONTACT SUBMISSIONS LINKS HOME
Statistics Charts Development
Top500 List:
06/2010
Statistics Type:
Vendors
Generate
Search
Search
Is Underutilizing Processors Such an
Awful Idea?
Have Honey, Will Compute: Bees Wax
Numeric
IDC Has Plan to Get European
Supercomputing Back on Track
ORNL Climate System's Big Reveal
Manufacturers Turn to HPC to Cut
Testing Costs
MATLAB on the TeraGrid One Year
Later
Supercomputer in the Works for Virginia
Not Your Parents' CFD
Machine Learns Language Starting with
the Facts
GPU-based Supercomputing Could Face
HPCWire

Bookmark
Save This Page
Home Statistics Historical Charts
Interconnect Family Share Over Time
In addition to the charts below, you can view the the data used to generate this
chart in table format using the statistics page. A direct link to the statistics is also
available.
PROJECT LISTS STATISTICS RESOURCES NEWS
MPI References
Lawrence Livermore tutorial
https:computing.llnl.gov/tutorials/mpi/
Using MPI
Portable Parallel Programming with the Message=Passing
Interface
by Gropp, Lusk, Skjellum
Using MPI-2
Advanced Features of the Message Passing Interface
by Gropp, Lusk, Thakur
Lots of other on-line tutorials, books, etc.
Ignite: Google Trends
http://www.youtube.com/watch?v=m0b-QX0JDXc
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches
MPI with CUDA
MPI and CUDA almost orthogonal
Each node simply becomes faster
Problem matching MPI processes to GPUs
Use compute-exclusive mode on GPUs
Tell cluster environment to limit processes per node
Have to know your cluster documentation
Data Movement
Communication now very expensive
GPUs can only communicate via their hosts
Very laborious
Again: need to minimize communication
MPI Summary
MPI provides cross-platform interprocess
communication
Invariably available on computer clusters
Only need six basic commands to get started
Much more sophistication available
Outline
1. The problem
2. Intro to MPI
3. MPI Basics
4. MPI+CUDA
5. Other approaches

ZeroMQ
messaging middleware TCP on steroids

new layer on the networking stack
not a complete messaging system
just a simple messaging library to be

used programmatically.
a pimped socket interface allowing you to

quickly design / build a complex
communication system without much effort
http://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
ZeroMQ
Fastest. Messaging. Ever.
Excellent documentation:
examples
white papers for everything
Bindings for Ada, Basic, C, Chicken Scheme,

Common Lisp, C#, C++, D, Erlang*, Go*,
Haskell*, Java, Lua, node.js, Objective-C, ooc,
Perl, Perl, PHP, Python, Racket, Ruby, Tcl
Message Patterns
Demo: Why ZeroMQ ?
http://www.youtube.com/watch?v=_JCBphyciAs
MPI vs ZeroMQ ?
MPI is a specication, ZeroMQ is an implementation.
Design:
MPI is designed for tightly-coupled compute clusters with fast and reliable
networks.
ZeroMQ is designed for large distributed systems (web-like).
Fault tolerance:
MPI has very limited facilities for fault tolerance (the default error handling
behavior in most implementations is a system-wide fail, ouch!).
ZeroMQ is resilient to faults and network instability.
ZeroMQ could be a good transport layer for an MPI-like implementation.

http://stackoverow.com/questions/35490/spread-vs-mpi-vs-zeromq
CUDASA
F
a
s
t

F
o
r
w
a
r
d
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&#$"'()*"'#+,
LxLend parallellsm of CuuA for a slngle Cu Lo
-./(01%102 mulLl-Cu sysLems
SLl, 1esla, Cuadrolex, ...
31#4"56(01%102 Cu clusLer envlronmenLs
MyrlneL, lnflnl8and, Myrl10C, ...
ConslsLenL developer lnLerface
Laslly embedded Lo Lhe CuuA complle process
CuuA: CompuLe unlfled uevlce 78 ArchlLecLure &'9(8:/#1;/( CUDASA: Computed Unied Device Systems Architecture
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&'()%*)+%,
!"#$%&'() *+,$%&'() -(./0)1$%&'() 233456&.507
*&,56$(8(6+.507$+75.$9*:#;<$-.%'#+./0%'123'9=(>56(;
;-.*-& ?4061,$57$3&)&44(4
-0$60@@+756&.507$?(./((7$;-.*-& ?4061,
#7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6(
-0$60=($@0=5A56&.507$)(E+5)(=
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&'()%*)+%,
Cu Layer 8us Layer neLwork Layer AppllcaLlon
8Lu: -./'0'1./'or -./'234"'(hosL)
Cne 8Lu equals one CSlx Lhread
h(%#; blocks ln parallel
All blocks share common sysLem memory (eqv. CuuA global memory)
Workload-balanced schedullng of blocks Lo Lhe 8Lus
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&'()%*)+%,
Cu Layer 8us Layer neLwork Layer AppllcaLlon
8Lu: #+-./%'0/1#$%*'23'(node)
Cne 8Lu equals one Ml process
Ml group process ;)< blocks ln parallel
no lnLrlnslc global memory ulsLrlbuLed shared memory managemenL
Workload-balanced schedullng of blocks Lo Lhe 8Lus
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%&'()%*)+%,
!"#$%&'() *+,$%&'() -(./0)1$%&'() 233456&.507
8(9+(7.5&4$&33456&.507$3)06(,,
2):5.)&)'$;<;==$&33456&.507$60>(
24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,',.(A$A(A0)'
B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%"$&'()*&#+,-#+
!"#$%&'(
)#$*+#,%-.!)/0.12#'+3.+#-.1%4+#'5$1.6"7.89).:+2%7
;5#54+:.1%'."6.%3'%#15"#1.6"7.+--5'5"#+:.+<1'7+$'5"#.:+2%71
8%#%7+:5=%.&7%1%#'.&7",7+445#,.&+7+-5,4
;545$.89).5#'%76+$%.6"7.+::.#%>.?@)1
97",7+44+<5:5'2
&
A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+'5"#
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
!"#$%"$&'()*&#+,-#+'./-#*01
!"#$%&'()*+',-.&/
__global__ void gFunc(float* parameter) {...} // DEVICE
__host__ void hFunc(...) { // HOST
...
gFunc <<< Dg, Db, Ns >>>(parameter);
}
01&,234-5'
(-564728"34-5
925,34-5'
:2"%464&8
__task__ void tFunc(float* parameter) { // HOST
hFunc(parameter);
}
__sequence__ void sFunc(...) { // APP
tFunc <<< Dg >>>(parameter);
}
!"#$%&'()*+!+',-.&/
;&5&8"%4<&.
01&,234-5'
(-564728"34-5
=&>
925,34-5'
:2"%464&8
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"$&'()*&#+,-#+'./-#*01
!"#$%&'$()* +,-)#./ 0*$.%*&1 23(1$4(*#
&--1('&$()*51&6.% __sequence__
*.$7)%851&6.% __job__ __node__ jobIdx, jobDim
"3#51&6.% __task__ __host__ taskIdx, taskDim
9:;51&6.% __global__ __device__ gridDim, blockIdx,
blockDim, threadIdx
+,-)#./5<3*'$()*#5&%.5&''.##("1.5<%)=5*.,$5>(?>.%5&"#$%&'$()*
23(1$4(*#5&%.5&3$)=&$('&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%#
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( ::
!"#$%"%&'(')*&
8unLlme Llbrary
;)< and (%#= schedullng
ldle 8Lus requesL new workload
ulsLrlbuLed shared memory for neLwork layer
cudasaMalloc, cudasaMemcpy, cudasaFree, .
Common lnLerface funcLlons (e.g. aLomlc funcLlons)
cudasaAtomicAdd, cudasaAtomicSub, cudasaAtomicExch, .
CuuASA Compller
Code LranslaLlon from CuuASA code Lo CuuA wlLh Lhreads/Ml
Self-conLalned pre-compller Lo CuuA compller process
8ased on Llsa (C++ parser) wlLh added CuuA/CuuASA funcLlonallLy
lull analysls of synLax and semanLlcs requlred
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"%&'(')*&+,-./,0(1%2
Lxample code LranslaLlon:
__task__ void tFunc(int i, float *f) { ... }
typedef struct {
int i; float *f;
dim3 taskIdx, taskDim;
} wrapper_struct_tFunc;
void tFunc(wrapper_stuct_tFunc *param) {
int i = param->i; float *f = param->f;
dim3 taskIdx = param->taskIdx;
dim3 taskDim = param->taskDim;
{ ... }
}
CSlx Lhread
compaLlble
slgnaLure
user-deflned
8ullL-lns
Crlglnal
funcLlon body
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"%&'(')*&+,-./,0(1%2
Lxample code LranslaLlon:
CeneraLed code layouL
Copy funcLlon parameLers lnLo wrapper sLrucL ( i, f )
opulaLe scheduler queue wlLh all blocks of Lhe (%#<9."=+ ( Dg )
ueLermlne bullL-lns for each block ( taskIdx, taskDim )
Wake up 8Lu worker Lhreads from Lhe Lhread pool
ldle 8Lus requesL nexL pendlng block from queue
WalL for all blocks Lo be processed
lssulng a (%#<9."= ls a blocklng call
tFunc <<< Dg >>>(i, f);
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"%&'(')*&+,-%'.*/0,1(2%/
Code LranslaLlon
Analog Lo bus layer uslng Ml lnLerface
8Lus run complle-Llme generaLed evenL loop (eqv. Lo Lhread pool)
AppllcaLlon lssues broadcasL messages Lo
lssue execuLlon <)=+8$*/(")*#
perform shared dlsLrlbuLed memory operaLlons
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%"%&'(')*&+,-%'.*/0,1(2%/
Shared dlsLrlbuLed memory
Lnable compuLaLlons exceedlng sysLem memory of slngle node
Lach clusLer node dedlcaLes parL of sysLem memory for uSM
ConLlnuous vlrLual address range
lnLerface vla cudasaMalloc, cudasaMemcpy, cudasaFree
Lvq. Lo CuuA global memory managemenL
lmplemenLed uslng Ml 8emoLe Memory Access (8MA)
no guaranLles for concurrenL non-aLomlc memory accesses (as ln CuuA)
CuuASA aLomlc funcLlons for uSM
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%&#'()$#(*+,+%%"%-#.
1esL case 1: 8LAS general maLrlx mulLlply (SCLMM)
8ulld on Lop of Cu8LAS SCLMM llbrary funcLlon
8lock-based sub-maLrlx processlng applled on all levels of parallellsm
Cu:
AMu CpLeron 270, 2x2 cores
lnLel C6600, 4 cores
Cu:
nvlulA Cuadro lx3600
nvlulA 8800C1x ulLra
2 cards (16/16 lanes)
nvlulA 8800C1
2 cards (16/4 lanes)
3 cards (16/4/4 lanes)
4 cards (16/4/4/4 lanes)
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%&#'()$#(*+,+%%"%-#.(/012&34
!"#$%&'#"%()%*+,-./0,1"2%03%410,'1%511+256'$506
1<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")*
78%9'&:#,'&:"/%"$%'18;%(<<=
>WZ&
>'/$5$506%03%'11%#&"6"%"1"2"6$#%56%$'#?%'6@%?"/6"1%,10&?#
A6530/2%@5/"&$506'1%/'@5'6&"%@5#$/5,+$506%BC(D%#'2.1"#E
F6%25115#"&06@#%30/%'%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!#
J+2,"/%03%
#&"6"%"1"2"6$#
C%G>A (%G>A# M%G>A#
N(=OD P(O (ON C(Q
CNC<=( C<N< P(< (OP
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%&#'()"&*+,-(./,/%%"%0#1
roof of concepL: SCLMM uslng neLwork parallellsm
1esL case 1:
2 clusLer nodes, Lwo 8800C1s each, ClgablL LLherneL
23000
2
maLrlces 234(56%+7#
<)= compuLaLlon: 1.6s
<)= communlcaLlon: 4.9s
Plgh communlcaLlon cosLs
d^
Slngle C wlLh 4 Cus used as 4 slngle Cu clusLer nodes
10240
2
maLrlces 489(56%+7#((bus level only: 314 Cllops)
8equlres lnLer-process communlcaLlon
uSM accesses Lake 1.3 Llmes longer Lhan compuLaLlon
no awareness of daLa locallLy
Plgh (unnecessary) communlcaLlon overhead
!"#$%&"'%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
!"#$%&'("#
!"#$%$&'()*+,-./,'*/'!"#$'0/1'+,234.,5'36-7,+*8/19':21244+4.-;
<.,.;24'=>2,5+-'*/'/1.5.,24'42,5625+
?/8':1/512;;.,5'2,@'4+21,.,5'/A+1>+2@
B//@'-=24.,5'3+>2A./1'/,'36-'4+A+4
(-:+=.244C'0/1'A+1C'4215+'*215+*'=/;:6*2*./,-
(2-C'*/'.,*+512*+'.,*/'*>+'!"#$'@+A+4/:;+,*':1/=+--
!611+,*':1/D+=*'-*2*+&
()*+,-./,'0/1'!"#$%$'*/'2@@'2821+,+--'/0'@2*2'4/=24.*C
E@+2&'!24432=9';+=>2,.-;'.,'-<-/$(")*+/)*8"9$.%(")*
<.,.;.F+'2;/6,*'/0'#%<'@2*2'*/'3+'=/;;6,.=2*+@
$6*/;2*.=244C';29+'6-+'/0'2-C,=>1/,/6-'@2*2'*12,-0+1'*/'*>+'BG"-
G1+:212*./,-'0/1';29.,5'!"#$%$':634.=4C'2A2.4234+
MultiGPU MapReduce
F
a
s
t

F
o
r
w
a
r
d
MapReduce
http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
Why MapReduce?
Simple programming model
Parallel programming model
Scalable
Previous GPU work: neither multi-GPU nor out-of-core

BenchmarksWhich
Matrix Multiplication (MM)
Word Occurrence (WO)
Sparse-Integer Occurrence (SIO)
Linear Regression (LR)
K-Means Clustering (KMC)
(Volume Rendererpresented 90
minutes ago @ MapReduce 10)
BenchmarksWhy
Needed to stress aspects of GPMR
Unbalanced work (WO)
Multiple emits/Non-uniform number of emits (LR, KMC,

WO)
Sparsity of keys (SIO)
Accumulation (WO, LR, KMC)
Many key-value pairs (SIO)
Compute Bound Scalability (MM)

BenchmarksResults
BenchmarksResults
9
MM SIO WO KMC LR
Input Element Size 4 bytes 1 byte 16 bytes 8 bytes
# Elems in rst set (10
6
) 1024
2
, 2048
2
, 4096
2
, 16384
2
1, 8, 32, 128 1, 16, 64, 512 1, 8, 32, 512 1, 16, 64, 512
# Elems in second set (10
6
/GPU) 1, 2, 4, 1, 2, 4, 8, 16, 1, 2, 4, 1, 2, 4, 8, 16
8, 16, 32 32, 64, 128, 256 8, 16, 32 32, 64
TABLE 1: Dataset Sizes for all four benchmarks. We tested Phoenix against the rst input set for SIO, KMC, LR, and the second set for
WO. We test GPMR against all available input sets.
MM KMC LR SIO WO
1-GPU Speedup 162.712 2.991 1.296 1.450 11.080
4-GPU Speedup 559.209 11.726 4.085 2.322 18.441
TABLE 2: Speedup for GPMR over Phoenix on our large (second-
biggest) input data from our rst set. The exception is MM, for which
we use our small input set (Phoenix required almost twenty seconds
to multiply two 10241024 matrices).
MM KMC WO
1-GPU Speedup 2.695 37.344 3.098
4-GPU Speedup 10.760 129.425 11.709
TABLE 3: Speedup for GPMR over Mars on 4096 4096 Matrix
Multiplication, an 8M-point K-Means Clustering, and a 512 MB
Word Occurrence. These sizes represent the largest problems that
can meet the in-core memory requirements of Mars.
summarizes speedup results over Phoenix, while Table 3 gives
speedup results of GPMR over Mars. Note that GPMR, even
in the one-GPU conguration, is faster on all benchmarks that
either Phoenix or Mars, and GPMR shows good scalability to
four GPUs as well.
Source code size is another important metric. One signif-
icant benet of MapReduce in general is its high level of
abstraction: as a result, code sizes are small and development
time is reduced, since the developer does not have to focus
on the low-level details of communication and scheduling but
instead on the algorithm. Table 4 shows the different number
of lines required for each of three benchmarks implemented
in Phoenix, Mars, and GPMR. We would also like to show
developer time required to implement each benchmark for
each platform, but neither Mars nor Phoenix published such
information (and we wanted to use the applications provided
so as not to introduce bias in Marss or Phoenixs runtimes). As
a frame of reference, the lead author of this paper implemented
and tested MM in GPMR in three hours, SIO in half an hour,
KMC in two hours, LR in two hours, and WO in four hours.
KMC, LR, and WO were then later modied in about half an
hour each to add Accumulation.
7 CONCLUSION
GPMR offers many benets to MapReduce programmers.
The most important is scalability. While it is unrealistic to
expect perfect scalability from all but the most compute-bound
tasks, GPMRs minimal overhead and transfer costs position
MM KMC WO
Phoenix 317 345 231
Mars 235 152 140
GPMR 214 129 397
TABLE 4: Lines of source code for three common benchmarks
written in Phoenix, Mars, and GPMR. We exclude setup code from
all counts as it was roughly the same for all benchmarks and had
little to do with the actual MapReduce code. For GPMR we included
boilerplate code in the form of class header les and C++ wrapper
functions that invoke CUDA kernels. If we excluded these les,
GPMRs totals would be even smaller. Also, WO is so large because
of the hashing required in GPMRs implementation.
Fig. 2: GPMR runtime breakdowns on our the largest datasets.
This gure shows how each application exhibits different runtime
characteristics, and also how exhibited characteristics change as we
increase the number of GPUs.
it well in comparison to other MapReduce implementations.
GPMR also offers exibility to developers in several areas,
particularly when compared with Mars. GPMR allows exible
mappings between threads and keys and customization of the
MapReduce pipeline with additional communication-reducing
stages while still providing sensible default implementations.
Our results demonstrate that even difcult applications that
have not traditionally been addressed by GPUs can still show
vs. CPU
vs. GPU
Benchmarks - Results
G
o
o
d
G
o
o
d
G
o
o
d
iPhD
one more thing
or two...
Life/Code Hacking #3
The Pomodoro Technique
http://lifehacker.com/#!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions
Life/Code Hacking #3
The Pomodoro Technique
http://www.youtube.com/watch?v=QYyJZOHgpco
C
O
M
E

Massively Parallel Computing

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Massively Parallel Computing

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture #7: GPU Cluster Programming | March 8th, 2011

Nicolas Pinto (MIT, Harvard)

Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11

Project info: http://www.cs264.org/projects/projects.html

Project ideas: http://forum.cs264.org/index.php?board=6.0

Project proposal deadline: Fri 3/25/11

Need a private private repo for your project?

Guest Lectures: 14 distinguished speakers

Schedule updated (see website)

Amazon AWS free credits coming soon

Its more than $14,000 donation for the class!

Special thanks: Kurt Messersmith @ Amazon

Best Project Prize: Tesla C2070 (Fermi) Board

Its more than $4,000 donation for the class!

Old school: map algorithms to specic

New school: avoid topology-specic

messaging middleware TCP on steroids

not a complete messaging system

just a simple messaging library to be

a pimped socket interface allowing you to

Fastest. Messaging. Ever.

white papers for everything

Bindings for Ada, Basic, C, Chicken Scheme,

MPI is a specication, ZeroMQ is an implementation.

ZeroMQ is designed for large distributed systems (web-like).

ZeroMQ is resilient to faults and network instability.

ZeroMQ could be a good transport layer for an MPI-like implementation.

Simple programming model

Parallel programming model

Previous GPU work: neither multi-GPU nor out-of-core

Matrix Multiplication (MM)

Word Occurrence (WO)

Sparse-Integer Occurrence (SIO)

Linear Regression (LR)

K-Means Clustering (KMC)

Needed to stress aspects of GPMR

Unbalanced work (WO)

Multiple emits/Non-uniform number of emits (LR, KMC,

Sparsity of keys (SIO)

Accumulation (WO, LR, KMC)

Many key-value pairs (SIO)

Compute Bound Scalability (MM)

Das könnte Ihnen auch gefallen