Jefferson Coop

Cooperative Parallelism:
An evolutionary programming model

for exploiting massively parallel systems
David Jefferson, John May,

Nathan Barton, Rich Becker, Jarek Knap
Gary Kumfert, James Leek, John Tannahill
Lawrence Livermore National Laboratory
This work was performed under the auspices of the U.S. Department of Energy
by University of California Lawrence Livermore National Laboratory
under contract No. W-7405-Eng-48.
Blue Gene / L
65,536 x 2 processors, 360 Tflops (peak)
Petaflop (peak) machine in 2 years

Petaflop (sustained) in 5 years
Co-op is a new programming paradigm and
components model for petascale simulation
• Petascale performance driven by need for multiphysics, multiscale

models
– fluid -- molecule
– continuum metal -- crystal
– plasma -- charged particle
– classical -- quantum
• Multiphysics, multiscale models call for a simulation components

architecture
– whole, parallel simulation codes used as building blocks in larger simulations
– allows composition (federation) and reuse of codes already mature and trusted
• Multiphysics, multiscale models naturally exhibit MPMD parallelism

– different subsystems, or length and time scales, require multiphysics
– multiphysics most efficient with different codes in parallel
• Efficient use of petascale resources requires more dynamic

simulation algorithms
– much more flexible use of resources: dynamic (sub)allocation of processor nodes
– adaptive sampling family of multiscale algorithms
Co-op allows parallel simulations to be
used as components in larger computations
• Large parallel models

treated as single objects:
– coupled with little knowledge
of each others’ internals
• Coupled models: time scale

– different languages
– different parallel
decomposition
– different physics
{ }
ensemble coupling for parametric
sensitivity or optimization
• Components:
– dynamically launched
– internally parallel
– externally parallel
– communicate in parallel
state space space
Strain rate localization can be predicted
with multiscale expanding cylinder model
1/8 exploding cylinder

• expands radially
• rings with reflecting
strain rate waves
• develops diagonal
shear bands
Classic SPMD
embedding of fine-
scale calculations
• nodes statically
allocated and
scheduled
• fine scale models

time for one executed sequentially
major cycle
coarse scale model
fine scale physics
64
nodes
Adaptive Sampling: a class of dynamic
algorithms for multiscale simulation
• Apply fine scale model

where continuum model is
invalid…
• …but just a sample of the

elements
• Elsewhere, interpolate material

response function from results
previously calculated
• Much less fine scale work;

remaining computation may be
seriously unbalanced, however.
• More than an order of magnitude

of performance improvement
may be achieved.
• Adaptive sampling is not AMR! coarse model is coarse model assumptions

generally accurate break down
Co-op model adds layer of dynamic MPMD
parallelism to familiar SPMD paradigm
MPMD federation New parallelism

composed of symponents layer
that use remote method
invocation (RMI)
SPMD symponent
Familiar
composed of processes
parallelism
that use MPI
layers
Process
composed of threads
that use shared variables, locks, etc.
Thread
Sequential, with vector, pipeline,
or multi-issue parallelism
Adaptive sampling app with integrated
fine scale DB
CSM
ΦΣΜ
Μαστερ
ΦΣΜ Σερϖερσ
ALE3D Coupler FS
Continuum
Lib DB ΦΣΜ
n = 100 processes Μαστερ
z/p = 10 4 zones/process
z = 10 6 zones
T = 10 4 timesteps
? = 100 µσεχ/τιµεστεπ
−2
? = 10 (εϖαλ φραχτιον)
Co-op Architecture
• NodeSet allocate / deallocate

– Contiguous node sets only
– Suballocation from original allocation
– Algorithms somewhat like memory allocation
• Symponent launch
– Array of symponents can be launched on array of nodesets
by single call
• Component termination detection

– Parent symponent notified if child terminates
• Component kill
– Must work when target is deadlocked, looping, etc.
Remote Method Invocation (RMI)
• General semantics
– Operation done by a thread on a symponent
– It can be nonblocking: caller gets a ticket and can later check, or wait for,
completion of the RMI
– Exceptions supported
– Concurrent RMIs on same symponent executed in nondeterministic order
• Three kinds of RMI recognized

– Sequential body, threaded execution
• Inter-thread synchronization required
• MPI in body not permitted
• Thread concurrency limited by OS
– Parallel body, serialized execution
• Atomic
• No recursion; no circularity (results in deadlock)
• MPI permitted and needed in body
– One way
• “Call” does not involve a return
• Essentially an asynchronous, one-sided “active” message
– Others might be recognized in the future
More about RMI
• Inter-symponent synchronization
– RMIs queued, and executed only when callee executes
AtConsistentState() method
– Last RMI signaled by special RMI: continue()
• Intra-symponent synchronization
– Sequential body, threaded RMIs must use proper POSIX inter-thread
synchronization
• Implementation
– Babel RMI over TCP
– Persistent connections at the moment (except for one-way)
• Soon to be non-persistent
– Future implementations over
• MPI-2
• UDP
• Native packet transports
Babel and Co-op are intimately related
• Symponents are Babel objects
• Co-op RMI implemented over Babel RMI
• Symponent APIs expressed in Babel’s SIDL language
• Any thread with a reference to a symponent can call RMIs on it
• References can be passed as args, results
• Caller and callee can be in different languages
• Co-op rests totally on Babel for

– RMI syntax
– SIDL specification language
– Language interoperability
– Parts of implementation of RMI
Classic SPMD
embedding of
fine-scale
calculations
• nodes statically
time for one allocated and
major cycle scheduled
• fine scale models
executed sequentially
coarse scale model
fine scale physics
64
nodes
MPMD refactoring and parallelized fine
scale models
time
coarse scale model
fine scale physics
64
nodes
Adaptive Sampling
• evaluation fraction is the most critical performance parameter
time
coarse scale model
full fine scale
simulations
interpolated fine scale
behavior
64
nodes
Adaptive sampling + active load balancing
yields dramatic speedup
coarse scale model
QuickTimeª and a adaptive sample fine

TIFF (Uncompressed) decompressor
are needed to see this picture. scale simulations
database retrieval and
interpolation
time
nodes
Performance of adaptive sampling
using the Co-op programming model
100000
90000
80000
adaptive sampling
70000
adaptive sampling with load balancing
60000 classic model embedding
50000
40000
Wallclock
30000 Time (sec)
20000
10000
0
0.0 5.0 10.0 15.0 20.0 25.0
Sim Time (µsec)

Conclusions
• MP/MS simulation drives need for petascale

performance
• MP/MS simulation requires
– componentized model construction
– MPMD execution
– dynamic instantiation of components
• hence dynamic node allocation
– language interoperability
• Adaptive Sampling is amazigly powerful
End
PSI Project Overview
David Jefferson
Lawrence Livermore National Lab
Co-op allows whole parallel simulations to
be used in multiscale couplings
• separate codes, coupled

without knowledge of each
others’ internals
• different languages (Babel)

different decomposition
different physics
• components
– can be dynamically launched
– are internally parallel
– execute in parallel
– communicate via RMI
Distribution of Coarse-scale and
Fine-scale Models across Processors
Wallclock
time
… ...
One coarse scale
…
time step
...
Coarse-scale Many instances of fine-scale

model model
MPMD refactoring allows better
scheduling of fine scale model executions
• remote fine scale

models
• nodes dynamically
allocated and
scheduled
time • improved performance
due to better balance
coarse scale model
fine scale physics
64
nodes
Additional parallelism then becomes
available
• fine scale model executions independent
• “nearest neighbor” DB queries are mostly independent and
easily parallelizable as well
coarse scale model

adaptive sample fine
scale simulations
interpolation
time
125
nodes
Performance Data
• Primitive time constants
– 0.030 µsec Babel method invocation (one process, C-to-C)

– 9.43 µsec MPI ping-pong (between processes, C-to-C)
– 251. µsec OmniOrb Corba RPC (C++ to C++)
– 609. µsec Co-op blocking RMI call
– 539. msec symponent launch

– 273. msec symponent teardown
Multiscale material science application
with parallel FS database
θυερψ()
µαξ = ζ / ?
ινσερτ() ∆Β
µαξ = ζ / ? Μαστερ
µεαν = ?ζ / ?
∆Β
Χλονε 1 ∆Β Σερϖερσ
CSM
∆Β
Μαστερ
∆Β
∆Β Σερϖερσ
Χλονε κ
ΦΣΜ
Μαστερ
ALE3D Coupler
Lib
n = 100 processes ΦΣΜ Σερϖερσ
z/p = 10 4 zones/process
z = 10 6 zones
T = 10 4 timesteps ρυνΦΣΜ()
µαξ = ζ / ?
? = 100 µσεχ/τιµεστεπ µεαν = ?ζ / ?
−2
? = 10 (εϖαλ φραχτιον)
ΦΣΜ
Μαστερ
The PSI Project
• Development Co-op model of hybrid

componentized MPMD computation.
– Definition of computational model and semantic
issues
– Implementation of Co-op runtime system
– Implementation of extensions to Babel
• Development of multiscale simulation
technology using Co-op
– Theory and practice of adaptive sampling
– Implementation of adaptive sampling coupler within
Co-op framework
– Implementation of Fine Scale Model “database”
suitable for adaptive sampling
• M-tree database with nearest neighbor queries
Co-op Capabilities
• NodeSet allocate/deallocate
– Suballocation of nodeset of any size from job’s static allocation
– Free sets of nodesets, not nodes
• Symponent launch / kill

– Any process can launch an SPMD executable as a new symponent with any
number of processes on a nodeset whose size divides n.
– Parent-child hierarchy: parent process notified of child death; child killed if
parent dies
– Launch uses SLURM srun
– Runaway or wedged symponent can be killed & its nodeset recovered
• Symponent remote references

– Symponents can have remote references to one another, which they use for
making RMI calls
– Remote references can be used as arguments in RMI calls
• Symponents and Babel

– Symponents are Babel objects, and present SIDL interfaces
– Symponents inherit interfaces in type hierarchy, so they can be treated in
object-oriented fashion
– A symponent RMI is a Babel RMI
• Full type safety
• Language independence / interoperability
Co-op Capabilities
• Symponent RMI & synchronization
– RMI calls are from a thread to a symponent
– RMIs are one-sided, unexpected, and by default nonblocking
– Any number of in- and out-args of any size and type can be used
– Full exception-throwing capability
– RMI’s can only be executed when callee calls atConsistentState()
– Special “system” RMIs inherited by all symponents: continue() and
kill()
– Two kinds of user RMIs
• Sequential body, threaded execution, executes in Rank 0 only
– Body executes in rank 0 process only
– Body is sequential, and does not need MPI
– Concurrent RMIs must synchronize with one another as threads
• Parallel body, serialized execution, executes in all processes
– Each may be parallel, running on all processes of callee
symponent, but multiple RMI calls are serially executed, and hence
atomic
– Normally use MPI
Adaptive Sampling substitutes DB
retrieval and interpolation for full fine
scale evaluation
• subscale results tabulated in a DB
• faster DB queries and interpolations substituted for slower
fine scale model executions
coarse scale model

adaptive sample fine
time scale simulations
interpolation
64
nodes
Current implementation of Co-op runs
multiscale models on Linux cluster
Linux
Co-opd
Linux
MPI
Not shown: SLURM
launch (SLURM / srun) daemons and srun()
Co-oplib processes
CSM
Co-opd
Linux
Babel MPI
Not shown: SLURM
Co-oplib processes
RMI (over UDP)
FSMs
CSM
Co-opd
Linux
Babel MPI
Not shown: SLURM
Co-oplib processes
RMI (over UDP)

Jefferson Coop

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Jefferson Coop

Hochgeladen von

Copyright:

Verfügbare Formate

Cooperative Parallelism:

An evolutionary programming model

David Jefferson, John May,

Petaflop (peak) machine in 2 years

• Petascale performance driven by need for multiphysics, multiscale

• Multiphysics, multiscale models call for a simulation components

• Multiphysics, multiscale models naturally exhibit MPMD parallelism

• Efficient use of petascale resources requires more dynamic

• Large parallel models

• Coupled models: time scale

1/8 exploding cylinder

• fine scale models

coarse scale model

fine scale physics

• Apply fine scale model

• …but just a sample of the

• Elsewhere, interpolate material

• Much less fine scale work;

• More than an order of magnitude

• Adaptive sampling is not AMR! coarse model is coarse model assumptions

MPMD federation New parallelism

• NodeSet allocate / deallocate

• Component termination detection

• Three kinds of RMI recognized

• Symponents are Babel objects

• Co-op RMI implemented over Babel RMI

• Symponent APIs expressed in Babel’s SIDL language

• Any thread with a reference to a symponent can call RMIs on it

• References can be passed as args, results

• Caller and callee can be in different languages

• Co-op rests totally on Babel for

coarse scale model

fine scale physics

coarse scale model

fine scale physics

• evaluation fraction is the most critical performance parameter

coarse scale model

QuickTimeª and a adaptive sample fine

Sim Time (µsec)

• MP/MS simulation drives need for petascale

• separate codes, coupled

• different languages (Babel)

Coarse-scale Many instances of fine-scale

• remote fine scale

coarse scale model

fine scale physics

coarse scale model

• Primitive time constants

– 0.030 µsec Babel method invocation (one process, C-to-C)

– 539. msec symponent launch

• Development Co-op model of hybrid

• Symponent launch / kill

• Symponent remote references

• Symponents and Babel

coarse scale model

Das könnte Ihnen auch gefallen