Beruflich Dokumente
Kultur Dokumente
PRACTICE = THEORY
Abstract
1. Introduction
* Corresponding author.
2. Semi-Lagrangian advection
is the velocity field. In a Lagrangian frame of reference dF/dt = 0 and the function
F is constant along the path (trajectory, characteristic) of a fluid particle in the x-t
S. Thomas, J. CM / Simulation Practice and Theory 3 (1995) 223-238 275
plane. Consider the arrival of a fluid particle at a grid point x, at time +?. It is
assumed that F(x, t) is known at all grid points x, at time t, -At and values of
F(x,, t,) are sought. Integrate (1) along a particle path and approximate dF/dt by
P~(x)=F~(l-~)+F~S?+c~
hi a(&1)[h3+h2(1-i)]
hl + h, + h3
h:
+ CJ $(a - l)(h, + h,Z), (6)
h, + h2 + h,
where
x - x2
hi = AX~, Fj= F(Xi).
f=h,’
ci=[z-2] ’ @xi+;+)
and
In higher dimensions such an approach has several advantages for computer architec-
tures employing a high-speed cache memory. For example, a bicubic Lagrange
polynomial requires function values at 16 neighbouring grid points, whereas function
values and divided differences at the four nearest grid points are needed in the above
formulation. In fact, precomputing the differences is advantageous since the formulae
for bicubic splines turn out to be quite similar. Traditionally, splines have been the
method of choice for semi-Lagrangian advection schemes, but require the solution
of tridiagonal systems to ensure continuity of derivatives at the grid points and
polynomial interpolation may be more efficient. A detailed comparison of Eulerian
and semi-Lagrangian methods applied to passive advection can be found in [9].
For atmospheric modeling, special care must be taken in spherical coordinate systems
to ensure the numerical stability of semi-Lagrangian methods. The main difficulties
are associated with the metric terms in spherical geometry and estimating the
departure points, see [2,12,18,10]. Recent progress has also been made in understand-
ing the response to stationary forcing [ 131.
dF 8F
$ + G(x, t) = R(x, t), dt = at +(u*V)F, (7)
S. Thomas, J. C&k / Simulation Practice and Theory 3 (I 995) 223-238 227
dx
- = u(x, t). (8)
dt
F+-F”
~ + 4 [G+ + Go] = 4 [R+ + RO],
At
a = Atu(x -a/2, t - At/2),
where the superscripts + and 0 represent evaluation at the arrival point (x, t) and
the departure point (x-u, t-At) respectively. In practice, the terms at different
time levels are grouped together and interpolation is performed on the combined
right-hand side terms as indicated below.
F++$G-R]+=F’-$G-R1”.
;+(u.v)u+F:
xu+V#=O, (9)
~+(u.v)t#J+f$v.u=o, (10)
$ - fv + 4, = 0, (11)
$+fu+d,=O, (12)
(16)
c&k+11
= Atu(x - CZ[~]/~,
t - At/z). (17)
Self-advection of momentum in Eqs. (14)-( 16) requires extrapolation in time [ 161.
Several schemes are possible, but a simple Adams-Bashforth method is sufficiently
accurate to obtain midpoint values of the wind at grid points
xd=x( i) -alpha( i)
ix=int( (xd-x(l))/hx)+l
The total flop count per grid point for one time step of semi-Lagrangian advection
S. Thomas, J. C6t4 / Simulation Practice and Theory 3 ( 1995) 223-238 229
is 157 flops. If it is known that the wind is constant in time, then the operation
count per time step can be reduced since the trajectories, and hence the departure
points, need to be computed only once. In this case 53 flops are needed to compute
divided differences and the bicubic polynomial.
3. Parallel implementation
The scalability analysis technique described by Foster et al. [S] will be used
throughout the discussion. The terminology and definitions employed by the authors
are widely adopted in the parallel processing literature. Consider the simplified model
of a parallel computer consisting of p processors, each executing at the same speed
and able to exchange data by means of messages sent across a high-speed interconnec-
tion network. During the execution of a parallel program each processor will perform
useful computations, however, there will be overhead associated with communication.
The sequential time Keq is defined to be the execution time of a good sequential
implementation of an algorithm. For a parallel program, the execution time is T =
Tcamp+ LmY where Tromprepresents the time spent computing and T,,,, is the
communications overhead. To simplify the analysis it is assumed that Tcomp= i&/p.
230 S. Thomas, J. C&I? 1 Simulation Practice and Theory 3 (1995) 223-238
(19)
TcOmm= 2 t, + ht, + t,
n,rc + 11 + 2 t, + ht, + t,
n,rc + 11 (22)
( PX > ( PY 1.
Table 1
Target machine parameters (time in ps)
Machine Topology t, ih L w
n[C + 11
Tcom = 4 t, + ht, + t, (23)
( & >.
(24)
where TcomPrepresents the execution time for one time step on a single processor.
The value of Tornp must be “calibrated” for the particular machine and will depend
on the problem size (due to cache effects), the flop count per grid point and the
execution rate of the processor. Let tfl represent the time to complete one floating
point operation for a given program. If the execution rate is l/tR, for the semi-
Lagrangian algorithm, then Tamp is approximately
160 x 160. The resulting execution time is Tses= 0.200 s per time step. The logical
process structure of a p,.i x py processor mesh implemented using PVM 3.0 on the
iPSC/860 does not map directly onto the hypercube network (e.g. through the use
of binary reflected Gray codes). Therefore, it is assumed that the communications
channels are shared and communications overhead is modeled by N = p in (21).
Eqs. (22) and (23) are then modified accordingly.
To assess how well the problem scales on the iPSC/860, the problem size is
increased up to a grid size of 1280 x 640. Predicted speedup curves are plotted as
solid lines in Fig. 2 and the observed values are plotted as single points. Predicted
and observed execution times for 100 time steps of the program on the iPSC/860
are also plotted in Fig. 3. The predictions fit well with the observed speedup and
execution times, indicating that the available bandwidth is decreasing with the
number of processors and that timings on the iPSC/860 are only influenced by t,
since t,, and t, are negligible. A single node execution rate of 17.34 Mflops is obtained
on the Cray T3D for a 160 x 160 grid, whereas 16.15 Mflops is obtained for a
640 x 320 grid. It appears, therefore, that performance of the DEC Alpha processor
is affected by problem size and access to cache memory. A logical mesh process
topology maps directly onto the 3-D torus interconnection network of the T3D.
Nevertheless, our tests indicate that available bandwidth is reduced as the number
of processors increases. Sharing of communication channels on the T3D was modeled
by setting N = 2& in (21).
In order to assess scalability, the problem size was increased once again on the
T3D. Predicted speedup curves are plotted in Fig. 4, where observed values are
plotted as single points. Predicted and observed execution times for 100 time steps
are plotted in Fig. 5. For small size problems, the predictions are quite accurate. In
234 S. Thomas, J. CM / Simulation Practice and Theory 3 (1995) 223-238
Execution Time
30
+ 320x320
25
*
15 20 25 30 35
Nodes
the case of very large grids, the observed speedups are slightly larger. This effect is
most likely due to the decrease in size of the local grids, resulting in fewer cache
misses. Surprisingly, the Cray T3D implementation of PVM appears to impose a
larger latency and for small size problems this could limit performance. A better
choice might be to use the Cray T3D SHMEM shared memory primitives to
minimize communication overheads.
The basic motivation for the analysis presented in this paper is to eventually
develop a full 3-D atmospheric model on the sphere for massively parallel computers.
Even though the implementation of parallel advection on a 2-D Cartesian plane has
provided useful information, several important issues need to be addressed in spheri-
cal geometry. These are the use of fixed overlap regions and the difficulties associated
with the poles.
Williamson [ 171 and Zero [ 191 have gathered statistics on the average length of
particle trajectories at different latitudes in existing atmospheric models on the
sphere. Their results indicate that near the poles one can expect to encounter Courant
numbers on the order of C = 15 to C = 20. Therefore, the fixed overlap strategy
S. Thomas, J. C6tk j Simulation Practice and Theory 3 (1995) 223-238 235
Execution Time
I I I I 1
0 80x80
X 160x160
10” i&-.---F
I
t , I 1 I I
0 5 10 15 20 25 30 5
Nodes
ExecutionTime
I 1 I I I I
0 80x80
X 160x160
+ 320x320
100- m 1280x640
S 80-
i= 60-
40-
passing libraries such as PVM or MPI should also aid in the development of these
new methods.
To summarize, we found that our performance models provided a good fit to the
experimental results and it was also made clear that the semi-Lagrangian algorithm
based on cubic interpolation is scalable due to a high flop count per grid point. We
believe that our results are very promising for the eventual parallel implementation
of a complete semi-implicit, semi-Lagrangian shallow-water model based on a block
data distribution of the computational grid. Each time step in such a model consists
of semi-Lagrangian advection followed by a semi-implicit correction, requiring the
solution of a nonlinear Helmholtz problem. The task of building an efficient parallel
elliptic solver still remains.
Acknowledgements
We are most grateful to John Drake of the Oak Ridge National Laboratory for
providing access to an Intel iPSC/860. The efforts of John Tulley and Jimmy Scott
of Cray Canada in obtaining time on a Cray T3D are very much appreciated. In
particular, we would like to thank Ivar Lie and Roar SkHlin for sharing their insights
S. Thomas, J. C&e / Simulation Practice and Theory 3 (1995) 223-238 231
ExecutionTime
lo3 I I I 1 I
=‘
lo*
g10'
P ‘1
loo
X
-1 0
lo-
I I I I I
into parallel SLT algorithms with us. We would also like to thank Pierre Gauthier
and Michel Valin of RPN for reviewing the manuscript.
References