2009 07 27 Mpi Upc

Hybrid Parallel Programming
with MPI and PGAS (UPC)
P. Balaji (Argonne), R. Thakur (Argonne), E. Lusk (Argonne)

James Dinan (OSU)
Motivation
• MPI and UPC have their own advantages
– UPC:
• Distributed data structures (arrays, trees)
• Implicit and explicit one-sided communication
– Good for irregular codes
• Can support large data sets
– Multiple virtual address spaces joined together to form a global
address space
– MPI:
• Groups
• Topology-aware functionality (e.g., Cart functionality)
Pavan Balaji (Argonne National Laboratory)

MPI Forum (07/27/2009)
Extending MPI to work well with PGAS
• MPI can handle some parts and allow PGAS to do handle
others
– E.g., MPI can handle outer-level coarse-grained parallelism,
scalability, fault tolerance and allow PGAS to handle inner-
level fine-grained parallelism
Flat Model Nested Funneled Nested Multiple

MPI Forum (07/27/2009)
Description of Models
• Nested Multiple
– MPI launches multiple UPC groups of processes
• Note: Here “one process” refers to all entities that share one
virtual address space
– Each UPC process will have an MPI rank
• Can make MPI calls
• Nested Funneled
– MPI launches multiple UPC groups of processes
– Only one UPC process can make MPI calls
• Currently not restricted to the “master process” like with threads
– Applications can extend address space without affecting
other internal components
MPI Forum (07/27/2009)
Description of Models (contd.)
• Flat Model
– Subset of Nested-Multiple
– … but might be easier to implement

MPI Forum (07/27/2009)
What does MPI need to do?
• Hybrid initialization
– MPI_Init_hybrid(&argc, &argv, int ranks_per_group)
• When MPI is launched, it needs to know how many
processes are being launched
– Currently we use a flat model
– If 10 processes are being launched, we know that world size
is 10
– Hybrid launching can be hierarchical
• 10 processes are launched, each of which might launch 10
other processes  world size can be 100 (in the case of
Nested-Multiple)

MPI Forum (07/27/2009)
Other Issues with Interoperability
• No mapping between MPI and UPC ranks
– Application needs to explicitly figure out
– Can be done portably with enough number of MPI_Alltoall
and Allgather calls
• Communication Deadlock
– In some cases deadlocks can be avoided by implicit
progress done by either MPI or UPC
– Being handled as ticket #154
• Might get voted out
• Application might need to assume the worst case

MPI Forum (07/27/2009)
Other Issues with Interoperability (contd.)
• There is no sharing of MPI and UPC objects
– MPI does not know how to send data from “global address
spaces”
• User has to provide the data in its virtual address space
– UPC cannot perform RMA into MPI windows

MPI Forum (07/27/2009)
Implementation in MPICH2
• Rough implementation available
– Will be corrected once the details are finalized

MPI Forum (07/27/2009)
Random Access Benchmark
• UPC: Threads access random elements of distributed shared array
shared double data[N]:
...
P0 Pn
• Hybrid: Array is replicated on every group
shared double data[N]: shared double data[N]:
... ...
P0 Pn/2 P0 Pn/2

MPI Forum (07/27/2009)
Impact of Data Locality on Performance
1000
UPC
900 Hybrid-4
800 Hybrid-8
Hybrid-16
700
600
Time (sec)
500
400
300
200
100
0
1 2 4 8 16 32 64 128
Number of Cores (quad-core nodes)
• Each process performs 1,000,000 random accesses

• Weak scaling ideal: Flat line
MPI Forum (07/27/2009)
Percentage Local References
100%
90% UPC
Hybrid-4
Percent Local Data
80%
Hybrid-8
70% Hybrid-16
60%
50%
40%
30%
20%
10%
0%
1 2 4 8 16 32 64 128
Number of Cores

MPI Forum (07/27/2009)
Barnes-Hut n-Body Cosmological
Simulation
• Simulates gravitational interactions of a system of n bodies
• Represents 3-d space using an oct-tree
• Summarize distant interactions using center of mass
for i in 1..t_max
t <- new octree()
forall b in bodies
insert(t, b)
summarize_subtrees(t)
forall b in bodies
compute_forces(b, t)
Credit: Lonestar Benchmarks (Pingali et al)

forall b in bodies
advance(b)
MPI Forum (07/27/2009)
Hybrid Barnes Algorithm
for i in 1..t_max
t <- new octree()
Tree is distributed across group

forall b in bodies
insert(t, b)
summarize_subtrees(t)
our_bodies <- partion(group id, bodies)
forall b in our_bodies Smaller distribution improves

compute_forces(b, t) O(our_bodies) tree traversals
forall b in bodies
advance(b)
Allgather(bodies)

MPI Forum (07/27/2009)
Barnes Force Computation
256
UPC
224 Hybrid-4
Hybrid-8
192 Hybrid-16
160
Speedup
128
96
64
32
0
0 64 128 192 256
Number of Cores
• Strong scaling: 100,000 body system

MPI Forum (07/27/2009)

2009 07 27 Mpi Upc

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2009 07 27 Mpi Upc

Hochgeladen von

Copyright:

Verfügbare Formate

Hybrid Parallel Programming

with MPI and PGAS (UPC)

P. Balaji (Argonne), R. Thakur (Argonne), E. Lusk (Argonne)

Pavan Balaji (Argonne National Laboratory)

Flat Model Nested Funneled Nested Multiple

Pavan Balaji (Argonne National Laboratory)

Pavan Balaji (Argonne National Laboratory)

Pavan Balaji (Argonne National Laboratory)

Pavan Balaji (Argonne National Laboratory)

Pavan Balaji (Argonne National Laboratory)

Pavan Balaji (Argonne National Laboratory)

• Hybrid: Array is replicated on every group

shared double data[N]: shared double data[N]:

Pavan Balaji (Argonne National Laboratory)

• Each process performs 1,000,000 random accesses

Pavan Balaji (Argonne National Laboratory)

Credit: Lonestar Benchmarks (Pingali et al)

Tree is distributed across group

forall b in our_bodies Smaller distribution improves

Pavan Balaji (Argonne National Laboratory)

• Strong scaling: 100,000 body system

Pavan Balaji (Argonne National Laboratory)

Das könnte Ihnen auch gefallen