You are on page 1of 35

Cooperative Parallelism:

An evolutionary programming model


for exploiting massively parallel systems

David Jefferson, John May,


Nathan Barton, Rich Becker, Jarek Knap
Gary Kumfert, James Leek, John Tannahill
Lawrence Livermore National Laboratory

This work was performed under the auspices of the U.S. Department of Energy
by University of California Lawrence Livermore National Laboratory
under contract No. W-7405-Eng-48.
Blue Gene / L
65,536 x 2 processors, 360 Tflops (peak)

Petaflop (peak) machine in 2 years


Petaflop (sustained) in 5 years
Co-op is a new programming paradigm and
components model for petascale simulation

• Petascale performance driven by need for multiphysics, multiscale


models
– fluid -- molecule
– continuum metal -- crystal
– plasma -- charged particle
– classical -- quantum

• Multiphysics, multiscale models call for a simulation components


architecture
– whole, parallel simulation codes used as building blocks in larger simulations
– allows composition (federation) and reuse of codes already mature and trusted

• Multiphysics, multiscale models naturally exhibit MPMD parallelism


– different subsystems, or length and time scales, require multiphysics
– multiphysics most efficient with different codes in parallel

• Efficient use of petascale resources requires more dynamic


simulation algorithms
– much more flexible use of resources: dynamic (sub)allocation of processor nodes
– adaptive sampling family of multiscale algorithms
Co-op allows parallel simulations to be
used as components in larger computations

• Large parallel models


treated as single objects:
– coupled with little knowledge
of each others’ internals

• Coupled models: time scale


– different languages
– different parallel
decomposition
– different physics
{ }
ensemble coupling for parametric
sensitivity or optimization
• Components:
– dynamically launched
– internally parallel
– externally parallel
– communicate in parallel
state space space
Strain rate localization can be predicted
with multiscale expanding cylinder model

1/8 exploding cylinder


• expands radially
• rings with reflecting
strain rate waves
• develops diagonal
shear bands
Classic SPMD
embedding of fine-
scale calculations
• nodes statically
allocated and
scheduled

• fine scale models


time for one executed sequentially
major cycle

coarse scale model

fine scale physics

64
nodes
Adaptive Sampling: a class of dynamic
algorithms for multiscale simulation

• Apply fine scale model


where continuum model is
invalid…

• …but just a sample of the


elements

• Elsewhere, interpolate material


response function from results
previously calculated

• Much less fine scale work;


remaining computation may be
seriously unbalanced, however.

• More than an order of magnitude


of performance improvement
may be achieved.

• Adaptive sampling is not AMR! coarse model is coarse model assumptions


generally accurate break down
Co-op model adds layer of dynamic MPMD
parallelism to familiar SPMD paradigm

MPMD federation New parallelism


composed of symponents layer
that use remote method
invocation (RMI)

SPMD symponent
Familiar
composed of processes
parallelism
that use MPI
layers
Process
composed of threads
that use shared variables, locks, etc.

Thread
Sequential, with vector, pipeline,
or multi-issue parallelism
Adaptive sampling app with integrated
fine scale DB

CSM
ΦΣΜ
Μαστερ

ΦΣΜ Σερϖερσ

ALE3D Coupler FS
Continuum
Lib DB ΦΣΜ
n = 100 processes Μαστερ
z/p = 10 4 zones/process
z = 10 6 zones
T = 10 4 timesteps
ΦΣΜ Σερϖερσ
? = 100 µσεχ/τιµεστεπ
−2
? = 10 (εϖαλ φραχτιον)
Co-op Architecture

• NodeSet allocate / deallocate


– Contiguous node sets only
– Suballocation from original allocation
– Algorithms somewhat like memory allocation

• Symponent launch
– Array of symponents can be launched on array of nodesets
by single call

• Component termination detection


– Parent symponent notified if child terminates

• Component kill
– Must work when target is deadlocked, looping, etc.
Remote Method Invocation (RMI)

• General semantics
– Operation done by a thread on a symponent
– It can be nonblocking: caller gets a ticket and can later check, or wait for,
completion of the RMI
– Exceptions supported
– Concurrent RMIs on same symponent executed in nondeterministic order

• Three kinds of RMI recognized


– Sequential body, threaded execution
• Inter-thread synchronization required
• MPI in body not permitted
• Thread concurrency limited by OS
– Parallel body, serialized execution
• Atomic
• No recursion; no circularity (results in deadlock)
• MPI permitted and needed in body
– One way
• “Call” does not involve a return
• Essentially an asynchronous, one-sided “active” message
– Others might be recognized in the future
More about RMI

• Inter-symponent synchronization
– RMIs queued, and executed only when callee executes
AtConsistentState() method
– Last RMI signaled by special RMI: continue()

• Intra-symponent synchronization
– Sequential body, threaded RMIs must use proper POSIX inter-thread
synchronization

• Implementation
– Babel RMI over TCP
– Persistent connections at the moment (except for one-way)
• Soon to be non-persistent
– Future implementations over
• MPI-2
• UDP
• Native packet transports
Babel and Co-op are intimately related

• Symponents are Babel objects

• Co-op RMI implemented over Babel RMI

• Symponent APIs expressed in Babel’s SIDL language

• Any thread with a reference to a symponent can call RMIs on it

• References can be passed as args, results

• Caller and callee can be in different languages

• Co-op rests totally on Babel for


– RMI syntax
– SIDL specification language
– Language interoperability
– Parts of implementation of RMI
Classic SPMD
embedding of
fine-scale
calculations
• nodes statically
time for one allocated and
major cycle scheduled
• fine scale models
executed sequentially

coarse scale model

fine scale physics

64
nodes
MPMD refactoring and parallelized fine
scale models

time

coarse scale model

fine scale physics

64
nodes
Adaptive Sampling

• evaluation fraction is the most critical performance parameter

time
coarse scale model
full fine scale
simulations
interpolated fine scale
behavior

64
nodes
Adaptive sampling + active load balancing
yields dramatic speedup

coarse scale model

QuickTimeª and a adaptive sample fine


TIFF (Uncompressed) decompressor
are needed to see this picture. scale simulations
database retrieval and
interpolation

time

nodes
Performance of adaptive sampling
using the Co-op programming model
100000

90000

80000

adaptive sampling
70000
adaptive sampling with load balancing
60000 classic model embedding

50000

40000

Wallclock
30000 Time (sec)

20000

10000

0
0.0 5.0 10.0 15.0 20.0 25.0

Sim Time (µsec)


Conclusions

• MP/MS simulation drives need for petascale


performance
• MP/MS simulation requires
– componentized model construction
– MPMD execution
– dynamic instantiation of components
• hence dynamic node allocation
– language interoperability
• Adaptive Sampling is amazigly powerful
End
PSI Project Overview

David Jefferson
Lawrence Livermore National Lab
Co-op allows whole parallel simulations to
be used in multiscale couplings

• separate codes, coupled


without knowledge of each
others’ internals

• different languages (Babel)


different decomposition
different physics

• components
– can be dynamically launched
– are internally parallel
– execute in parallel
– communicate via RMI
Distribution of Coarse-scale and
Fine-scale Models across Processors
Wallclock
time

… ...
One coarse scale


time step

...

Coarse-scale Many instances of fine-scale


model model
MPMD refactoring allows better
scheduling of fine scale model executions

• remote fine scale


models
• nodes dynamically
allocated and
scheduled
time • improved performance
due to better balance

coarse scale model

fine scale physics

64
nodes
Additional parallelism then becomes
available
• fine scale model executions independent
• “nearest neighbor” DB queries are mostly independent and
easily parallelizable as well

coarse scale model


adaptive sample fine
scale simulations
database retrieval and
interpolation

time

125
nodes
Performance Data

• Primitive time constants

– 0.030 µsec Babel method invocation (one process, C-to-C)


– 9.43 µsec MPI ping-pong (between processes, C-to-C)
– 251. µsec OmniOrb Corba RPC (C++ to C++)
– 609. µsec Co-op blocking RMI call

– 539. msec symponent launch


– 273. msec symponent teardown
Multiscale material science application
with parallel FS database

θυερψ()
µαξ = ζ / ?
ινσερτ() ∆Β
µαξ = ζ / ? Μαστερ
µεαν = ?ζ / ?
∆Β
Χλονε 1 ∆Β Σερϖερσ

CSM

∆Β
Μαστερ

∆Β
∆Β Σερϖερσ
Χλονε κ

ΦΣΜ
Μαστερ

ALE3D Coupler
Lib
n = 100 processes ΦΣΜ Σερϖερσ
z/p = 10 4 zones/process
z = 10 6 zones
T = 10 4 timesteps ρυνΦΣΜ()
µαξ = ζ / ?
? = 100 µσεχ/τιµεστεπ µεαν = ?ζ / ?
−2
? = 10 (εϖαλ φραχτιον)

ΦΣΜ
Μαστερ

ΦΣΜ Σερϖερσ
The PSI Project

• Development Co-op model of hybrid


componentized MPMD computation.
– Definition of computational model and semantic
issues
– Implementation of Co-op runtime system
– Implementation of extensions to Babel
• Development of multiscale simulation
technology using Co-op
– Theory and practice of adaptive sampling
– Implementation of adaptive sampling coupler within
Co-op framework
– Implementation of Fine Scale Model “database”
suitable for adaptive sampling
• M-tree database with nearest neighbor queries
Co-op Capabilities
• NodeSet allocate/deallocate
– Suballocation of nodeset of any size from job’s static allocation
– Free sets of nodesets, not nodes

• Symponent launch / kill


– Any process can launch an SPMD executable as a new symponent with any
number of processes on a nodeset whose size divides n.
– Parent-child hierarchy: parent process notified of child death; child killed if
parent dies
– Launch uses SLURM srun
– Runaway or wedged symponent can be killed & its nodeset recovered

• Symponent remote references


– Symponents can have remote references to one another, which they use for
making RMI calls
– Remote references can be used as arguments in RMI calls

• Symponents and Babel


– Symponents are Babel objects, and present SIDL interfaces
– Symponents inherit interfaces in type hierarchy, so they can be treated in
object-oriented fashion
– A symponent RMI is a Babel RMI
• Full type safety
• Language independence / interoperability
Co-op Capabilities
• Symponent RMI & synchronization
– RMI calls are from a thread to a symponent
– RMIs are one-sided, unexpected, and by default nonblocking
– Any number of in- and out-args of any size and type can be used
– Full exception-throwing capability
– RMI’s can only be executed when callee calls atConsistentState()
– Special “system” RMIs inherited by all symponents: continue() and
kill()
– Two kinds of user RMIs
• Sequential body, threaded execution, executes in Rank 0 only
– Body executes in rank 0 process only
– Body is sequential, and does not need MPI
– Concurrent RMIs must synchronize with one another as threads
• Parallel body, serialized execution, executes in all processes
– Each may be parallel, running on all processes of callee
symponent, but multiple RMI calls are serially executed, and hence
atomic
– Normally use MPI
Adaptive Sampling substitutes DB
retrieval and interpolation for full fine
scale evaluation
• subscale results tabulated in a DB
• faster DB queries and interpolations substituted for slower
fine scale model executions

coarse scale model


adaptive sample fine
time scale simulations
database retrieval and
interpolation

64
nodes
Current implementation of Co-op runs
multiscale models on Linux cluster

Linux
Current implementation of Co-op runs
multiscale models on Linux cluster

Co-opd

Linux

MPI
Not shown: SLURM
launch (SLURM / srun) daemons and srun()
Co-oplib processes
Current implementation of Co-op runs
multiscale models on Linux cluster

CSM

Co-opd

Linux

Babel MPI
Not shown: SLURM
launch (SLURM / srun) daemons and srun()
Co-oplib processes
RMI (over UDP)
Current implementation of Co-op runs
multiscale models on Linux cluster
FSMs

CSM

Co-opd

Linux

Babel MPI
Not shown: SLURM
launch (SLURM / srun) daemons and srun()
Co-oplib processes
RMI (over UDP)