Sie sind auf Seite 1von 15

Hybrid Parallel Programming

with MPI and PGAS (UPC)

P. Balaji (Argonne), R. Thakur (Argonne), E. Lusk (Argonne)


James Dinan (OSU)
Motivation
• MPI and UPC have their own advantages
– UPC:
• Distributed data structures (arrays, trees)
• Implicit and explicit one-sided communication
– Good for irregular codes
• Can support large data sets
– Multiple virtual address spaces joined together to form a global
address space
– MPI:
• Groups
• Topology-aware functionality (e.g., Cart functionality)

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Extending MPI to work well with PGAS
• MPI can handle some parts and allow PGAS to do handle
others
– E.g., MPI can handle outer-level coarse-grained parallelism,
scalability, fault tolerance and allow PGAS to handle inner-
level fine-grained parallelism

Flat Model Nested Funneled Nested Multiple

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Description of Models
• Nested Multiple
– MPI launches multiple UPC groups of processes
• Note: Here “one process” refers to all entities that share one
virtual address space
– Each UPC process will have an MPI rank
• Can make MPI calls
• Nested Funneled
– MPI launches multiple UPC groups of processes
– Only one UPC process can make MPI calls
• Currently not restricted to the “master process” like with threads
– Applications can extend address space without affecting
other internal components
Pavan Balaji (Argonne National Laboratory)
MPI Forum (07/27/2009)
Description of Models (contd.)
• Flat Model
– Subset of Nested-Multiple
– … but might be easier to implement

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
What does MPI need to do?
• Hybrid initialization
– MPI_Init_hybrid(&argc, &argv, int ranks_per_group)
• When MPI is launched, it needs to know how many
processes are being launched
– Currently we use a flat model
– If 10 processes are being launched, we know that world size
is 10
– Hybrid launching can be hierarchical
• 10 processes are launched, each of which might launch 10
other processes  world size can be 100 (in the case of
Nested-Multiple)

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Other Issues with Interoperability
• No mapping between MPI and UPC ranks
– Application needs to explicitly figure out
– Can be done portably with enough number of MPI_Alltoall
and Allgather calls
• Communication Deadlock
– In some cases deadlocks can be avoided by implicit
progress done by either MPI or UPC
– Being handled as ticket #154
• Might get voted out
• Application might need to assume the worst case

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Other Issues with Interoperability (contd.)
• There is no sharing of MPI and UPC objects
– MPI does not know how to send data from “global address
spaces”
• User has to provide the data in its virtual address space
– UPC cannot perform RMA into MPI windows

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Implementation in MPICH2
• Rough implementation available
– Will be corrected once the details are finalized

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Random Access Benchmark
• UPC: Threads access random elements of distributed shared array
shared double data[N]:

...
P0 Pn

• Hybrid: Array is replicated on every group

shared double data[N]: shared double data[N]:

... ...
P0 Pn/2 P0 Pn/2

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Impact of Data Locality on Performance
1000
UPC
900 Hybrid-4
800 Hybrid-8
Hybrid-16
700
600
Time (sec)

500
400
300
200
100
0
1 2 4 8 16 32 64 128
Number of Cores (quad-core nodes)

• Each process performs 1,000,000 random accesses


• Weak scaling ideal: Flat line
Pavan Balaji (Argonne National Laboratory)
MPI Forum (07/27/2009)
Percentage Local References
100%

90% UPC
Hybrid-4
Percent Local Data

80%
Hybrid-8
70% Hybrid-16
60%

50%

40%

30%

20%

10%

0%
1 2 4 8 16 32 64 128

Number of Cores

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Barnes-Hut n-Body Cosmological
Simulation
• Simulates gravitational interactions of a system of n bodies
• Represents 3-d space using an oct-tree
• Summarize distant interactions using center of mass

for i in 1..t_max
t <- new octree()

forall b in bodies
insert(t, b)

summarize_subtrees(t)

forall b in bodies
compute_forces(b, t)

Credit: Lonestar Benchmarks (Pingali et al)


forall b in bodies
advance(b)
Pavan Balaji (Argonne National Laboratory)
MPI Forum (07/27/2009)
Hybrid Barnes Algorithm
for i in 1..t_max
t <- new octree()

Tree is distributed across group


forall b in bodies
insert(t, b)

summarize_subtrees(t)
our_bodies <- partion(group id, bodies)

forall b in our_bodies Smaller distribution improves


compute_forces(b, t) O(our_bodies) tree traversals

forall b in bodies
advance(b)

Allgather(bodies)

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)
Barnes Force Computation
256
UPC
224 Hybrid-4
Hybrid-8
192 Hybrid-16

160
Speedup

128

96

64

32

0
0 64 128 192 256
Number of Cores

• Strong scaling: 100,000 body system

Pavan Balaji (Argonne National Laboratory)


MPI Forum (07/27/2009)

Das könnte Ihnen auch gefallen