10 1 1 42 6331 PDF

Particle-in-Cell Simulation Codes in High Performance
Fortran
Erol Akarsuz Kivanc Dinceryz Tomasz Haupty Georey C. Foxyz
yNortheast Parallel Architectures Center
zDepartment of Electrical Engineering and Computer Science
111 College Place, Mail Stop 3-217
Syracuse University
Syracuse, NY 13244-4100
fakarsu, dincer, haupt, g
gcf @npac.syr.edu
Abstract
Particle-in-Cell (PIC) plasma simulation codes model the interaction of charged particles
with surrounding electrostatic and magnetic elds. PIC's computational requirements are
classied at as one of the grand-challenge problems facing the high-performance community.
In this paper we present the implementation of 1-D and 2-D electrostatic PIC codes in High
Performance Fortran (HPF) on an IBM SP-2. We used one of the most successful commercial
HPF compilers currently available and augmented the compiler's missing HPF functions with
extrinsic routines when necessary. We obtained a near linear speed-up in execution time
and a performance comparable to the native message-passing implementations on the same
platform.
Keywords: Particle-in-Cell, High Performance Fortran.
1 Introduction
High Performance Fortran [1] has been accepted as the standard data parallel extensions to Fortran
90 [4]. Its aim is to simplify the process of developing data parallel applications on distributed
memory systems. HPF programs are expected to be scalable and portable, as the performance
is preserved while moving them to dierent platforms with a comparable number of processors.
The relative simplicity of developing codes in HPF comes from the fact that HPF provides a
global name space as sequential codes do. The tedious and error-prone task of interprocessor
communication detection and generation of appropriate calls to the runtime system are left to the
compiler. This makes HPF codes easier to understand, develop, debug, and maintain.
HPF language features allow the user to control two basic aspects of parallel computations: data
distributions and parallelism. The user species the data distributions using compiler directives.
The parallelism is expressed explicitly using Fortran 90 array syntax augmented by the FORALL
construct, the INDEPENDENT directive, and a rich set of new intrinsic and library functions. The
HPF is a new technology, and only a few compilers are available at this time. We question in
this paper whether this technology actually fullls its promises. We selected a nontrivial, well-
known application, the Particle-in-Cell (PIC) plasma simulation code, and implemented it in HPF.
Particle-in-cell plasma simulation codes are large-scale application codes that comprise thousands
of lines of code and have large memory and high-performance requirements. The numerical simu-
lation of plasmas is one of the Grand Challenge problems facing the high-performance computing
community. These characteristics of PIC codes provide a strong motivation for their paralleliza-
tion. However, data accesses with multiple levels of indirection and complicated message-passing
patterns in some parts of the parallel PIC codes make them hard to be implemented in HPF.
The PIC method is an alternative to the direct particle-particle methods where simulation in-
cludes the interaction of all the particles in the simulation domain, and the complexity is O(Np2)
for a system of Np particles. Instead, the PIC method simulates the interaction of particles with
the elds resulting from the particles' charges. The simulation domain is partitioned into cells
separated by grid points. Each cell contains a number of particles. PIC method is especially
preferable for simulating systems containing an order of 106 or more particles, since its computa-
tional complexity is limited to O(Np) for particles and O(Ng :logNg ) for the grid points where Ng
is the number of grid points in the simulation domain. In the PIC method, moving the particles
to a new position dominates the runtime of the algorithm and this phase can be parallelized easily
due to the independent behavior of particles.
The basic PIC algorithm consists of an initialization phase followed by four processing phases
that are repeated many times (Figure 1). In the scatter phase, the particle attributes are interpo-
lated to nearby points of a regular computational mesh that discretizes the problem domain. The
appropriate eld equations are then solved on the computational mesh (eld solve phase), and the
force on each particle due to resultant elds is found by interpolation on the mesh (gather phase).
Finally, the particles are repositioned under the in uence of this force in the particle push phase.
Parallel PIC methods have been previously studied by a number of researchers on a wide variety
of platforms using dierent parallel programming methodologies [11, 12, 13, 14, 15, 16].
We implemented one- and two-dimensional PIC simulation codes in HPF. We compiled our
codes using PGI's HPF compiler and ran them on an IBM SP-2 with up to 32 nodes.
2
In this paper we talk about our data decomposition and parallelization strategy, and emphasize
HPF features used in individual phases of the simulation code. We assess the power of HPF in
representing such complicated codes. For the sake of simplicity, the most of the following discussion
involves only the one-dimensional PIC method; the interested reader may refer to [9, 10] for higher
dimensional PIC methods.
INITIALIZATION PRINT_STATISTICS
Initialize particle data Print particle and grid data
Initialize grid(field) data Print statistics
Arrange particle and field partitions
SCATTER PARTICLE_PUSH
Compute local charge densities Update position and velocity
Update charges on partition border Move particles to new partitions
Copy Q to QC
FIELD_SOLVE GATHER
Inverse FFT Copy FC to F
Solve Poisson Equation Update forces on partition border
FFT
Figure 1: Main phases of a typical PIC code.
2 The Approach to Parallelization

We applied a top-down approach when trying to identify all the high-level parallelism aorded by
the problem before moving down to other levels. We used the most ecient parallel algorithms
at each step and analyzed the implementation tradeos to determine what could be done to make
the results more attractive in the scope of HPF.
We will explore our approach to the problem's parallelization using HPF in ve levels: dis-
tribution of principal data structures, loop level parallelization, communication optimizations,
conveying data distributions to the subroutines, and augmenting HPF with extrinsic routines
when necessary.
2.1 Decomposition of Principal Data Structures

The particles and elds that comprise the physics of the problem are represented algorithmically
by two computational objects:
Particle array keeps the information such as velocity, position and mass, of each particle
of the simulation.
Grid array is a regular mesh that discretizes the simulation domain. Each particle lies
within a cell of the mesh bounded by two grid points.
3
The performance of the PIC simulation codes depends critically on the decomposition (distribu-
tion) strategy of the grid points and particles among the nodes of a distributed memory machine
or workstation cluster. We used a variant of Eulerian decomposition [14] which assigns each
processor a xed spatial partition of the grid and the particles within the cells are surrounded by
those grid points. This prevents the high communication cost in the gather and scatter phase, yet
load imbalance among processors may develop after repetitive iterations as a result of excessive
particle accumulation in certain cells.
1 2 1 2 1 2 1 2
1 7 1 7 1 7 1 7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Edges(2,1) Edges(2,2) Edges(2,3) Edges(2,4)

Edges(1,1) Edges(1,2) Edges(1,3) Edges(1,4)
Partition 1 Partition 2 Partition 3 Partition 4
: Complex Grid Points(Items of Complex Charge Array QC)

: Grid Points
: Items of Real Charge Density Array ,Q
Figure 2: Illustration of particle and grid decompositions and associated charge densities at each grid
point and interaction among them.
Figure 2 shows how a one-dimensional domain of length 16 is decomposed into grid(eld) and
particle partitions assuming there are four processors (nvp=4).
The virtual processor domain in HPF is declared as follows:
!HPF$ PROCESSORS :: PROCS(nvp)
Particle partitions(PP) are used to divide the particles and particle computation eciently
among processors. The particles are distributed according to their coordinates x, that is, the
k-th partition(PPk ) contains all particles with edges(left,k) < x edges(rght,k). Second
dimension of edges enumerate partitions and it is distributed in a block-wise on that dimension:
DIMENSION INTEGER edges(2, numpp)
!HPF$ DISTRIBUTE (*, BLOCK) ONTO PROCS :: edges
The particles' parameters are stored in array part(i,j,k): the rst index selects between the
spatial coordinate and the velocity of the particle, the second enumerates particles in the par-
tition, and the third one enumerates the partitions. Consequently, the array part is of shape
(2,npmax,numpp), where npmax is the maximum number of particle allowed in one partition.
npmax sets a maximum limit on the number of particles that can be placed into a partition and
prevents enormous load imbalances among partitions. The current number of particles for each
PP is kept in npp(numpp) array. Particle array is distributed in a block fashion:
4
!HPF$ DISTRIBUTE (*, *, BLOCK) ONTO PROCS :: part
Field partitions(FP) are used to divide the electromagnetic computation among the proces-
sors uniformly. Each partition has equal number of cells (delimited by the grid points) of unit
length. For example, in Figure 2 we have 4 cells and 4 primary grid points for each of 4 partitions.
Each partition owns the grid point on its right boundary.
Since the interactions between the eld and the particles are not conned to a partition, each
subdomain stores additional information about the charge density associated with grid points that
lie outside the partition: one guard point on the left of the partition, and two guard points on
the right to the partition (Figure 2) that shadows the rightmost grid point of the left partition, and
two leftmost grid points of the right partition, respectively. It helps each processor to calculate
the partial aect of its local particles on these overlapped grid points. The total charge density
at the primary grid points is found by adding the partial deposits of the guard cells at neighbor
partitions.
The charge densities on grid points (primary grid points and the ones of guard cells)are stored
in the real charge array q(nxpmx,numpp) where nxpmx is the number of primary and guard cell
grid points of one partition and numpp is the number of particle partitions. These structures are
distributed as:
!HPF$ DISTRIBUTE (*, BLOCK) ONTO PROCS :: q
Actually, in order to solve the eld equations, we have to transfer real charge density q into
complex charge qc(kxp,numfp) where kxp is the number of complex grids in each partition
and numfp is the number of eld partitions. The complex charge density on each processor are
calculated from local real charge density values of primary grid points and the ones of overlapped
grid points of neighbor processors. The number of the complex grid points for each partition is
equal to the one half of its primary grid points. Each processor holds all the complex charges for
its complex grid points in its subdomain.
!HPF$ DISTRIBUTE (*, BLOCK) ONTO PROCS :: qc
The complex grid points are aligned with the odd-valued real primary grid points.
2.2 Loop Level Parallelization

In HPF the parallel loops are specied by the user explicitly using FORALL or INDEPENDENT DO
statements. Regular DO loops are executed completely sequential. FORALL is a generalization of
the Fortran 90 array assignment and WHERE statements [3]. The statement executes in parallel
and the results are identical if it is re-executed with the same data. A multiple-statement FORALL
(i.e., construct) is the same as a series of single statement FORALL statements. Every statement
of the FORALL has one synchronization point after the right-hand side is computed and after the
assignment is made to the left-hand side.
INDEPENDENT DO is for situations where the programmer knows that a loop is parallel, but
the compiler cannot detect this parallelism. This is especially useful for array indices that are
computed at runtime or involve a level of indirection. When necessary, the NEW clause is used to
declare temporaries in INDEPENDENT loops to preserve the assertions for being an INDEPENDENT
loop. For each NEW object a new instance of that variable is created at each iteration of the loop.
5
2.3 Communication Optimizations
HPF compilers take care of generating the appropriate communication calls and bringing in the
required non-local data automatically. However, the success of the compilers at this early stage
is still in uenced by the way the programmer species the code fragments that require commu-
nication. Uniform or aggregate styles of communication [2] are more ecient on a network of
workstations environment with high latencies. Before moving particles to other processors, we
formed send-lists containing all the particles to be moved and transferred these lists to the neigh-
bors by using aggregate communication primitives such as CSHIFT or TRANSPOSE at each step.
2.4 Passing Distributed Data to Subroutines

When it is more ecient to use a dierent data mapping in a subroutine, the subroutine temporar-
ily alters the data mapping of a data structure visible to its caller. Since remapping a large data
set imposes a large runtime overhead, we avoided any unnecessary data remappings over subrou-
tine boundaries by using the HPF's descriptive directives. A descriptive directive asserts that
no remapping of an actual argument is required and describes the mapping of a dummy argument
that is the same as the actual one. A representative example is the passing of the part array into
subroutines:
!HPF$ DISTRIBUTE *(*, *, BLOCK) ONTO PRCS :: part
INTENT attributes are added to the subroutine interfaces to supply the compiler with more hints
regarding the use of dummy variables inside the routine.
2.5 Extrinsic Procedures

HPF's EXTRINSIC mechanism provides a means of calling procedures written in other parallel
programming schemes (such as SPMD) or in other programming languages. We beneted from
this mechanism in two ways. First, it helped us to implement the routines that cannot be eciently
implemented using HPF's native data parallel constructs. For example, the Fast Fourier Transform
routine involves a great deal of permutative kind of communication where each sender has a
unique distinct destination. This type of patterns can only be eciently implemented using the
message-passing SPMD style of programming. Second, we used extrinsic procedures to simulate
the INDEPENDENT DO statement, when it is ignored by the compiler.
The PGI compiler only allows the use of extrinsics procedures written in Fortran 77. We used
PGI's own runtime communication library for message passing. We believe that our implemen-
tation is no dierent than what the compiler would do to support this statement. Since we used
the PGI's own message-passing library (the other alternative was MPI), the performance gures
would also be the same.
2.6 Compiler Restrictions

The PGI compiler cannot detect nonlinear dependencies in INDEPENDENT DO loops. In the case
of a nonlinear dependency, it assumes the loop is serial and calls pghpf get scalar(), which is
6
a very expensive communication function. It actually does an all-to-all broadcast. There are at
least two kinds of distributed array references in our code that cause the PGI compiler to choose
a less ecient implementation strategy than needed.
Indirect array references such as
part(1,ihole(j+jsl(2,k),k),k) = rbufr(1,j,k)
where one array is used to index another array prevent communication analysis at compile
time and consequently do cause a collective communication before the loop. The com-
piler generates pghpf get scalar(jsl(2,k)) calls to bring the index array item to the local
processor and then generates another pghpf get scalar(ihole()) call for the second indi-
rection. A third one follows for part itself.
The array references used in a conditional expression of a loop body such as
if (jsr(1,k) < nbmax) then
sbu (1,jsl(1,k),k) = part(2,j,k)
endif
cause the loop to be executed in serial (Surprise!).
The compiler has a special ag (-Mreplicate) that replicates the arrays falling into the above
two categories on each processor, thus overriding the distribution directives given in the program.
Individual elements do not have to be brought in as above. Instead, all the non-local elements of
distributed elements may be copied into their counterparts before being used.
3 Implementation Details
In this section we will describe how we implemented each phase of the PIC code using HPF.
3.1 Scatter (Inverse Interpolation) Phase
(a) (b)
Grid point Particle
Figure 3: For a one-dimensional simulation, interaction of a particle (a) on the lower half of a grid
interval; (b) on the upper half of a grid interval with the neighboring grid points.
The scatter phase consists of three major steps. In the rst step the charge densities at each grid
point due to nearby charged particles are calculated using a second-order spline interpolation with
7
periodic boundaries. The interaction between the particles and nearby grid points is shown in
Figure 3. Since only the particles local to a partition participate in the computation of the charge
densities at this step, the whole thing can be done easily using an INDEPENDENT DO loop over
the number of eld partitions (Figure 4) without any communication. Each processor executes
the inner body for all the partitions belonging to it. The variable n is assigned a value and
then used as an index on the left-hand side of assignment expressions. Because the compiler at
this time lack adequate dependency and communication analysis mechanisms these types of loops
are automatically serialized instead of complicated runtime inspector-executor loops [8] being
generated. To prevent performance degradation, we wrote an extrinsic routine that does static
loop partitioning based on loop indices that executes the loop body in parallel.
In the second step, the partial results at guard grid points are sent to the appropriate processors
to augment the partial values at shadowed primary grid points of neighboring partitions.
!HPF$ INDEPENDENT, NEW(n, dx)

DO k = 1, number of particle partitions
DO i = 1, number of particles in partition
dx = part(POS,j,k) -bpart(P OS; j; k) + 0:5c
n = b part(POS,j,k) +0.5 c -no(k) +2
q(n, k) = q(n,k) + qm*(0.75-dx)2
q(n+1, k) = q(n+1,k) +0.5*qm*(0.5+dx)2
q(n-1, k) = q(n-1,k) +0.5*qm*(0.5-dx)2
END DO
END DO
Figure 4: Calculating the particle charge densities at each grid point.

The third step is the preparation for the FFT. Since the next phase involves complex data, each
partition k takes a pair of consecutive items of q, q(j,k) and q(j+1,k) and constructs from them
a complex charge density value qc(j/2,k) in parallel (Figure 5.)
Normally, once the second step has been performed, this step is communication free. However,
we combined this step with the previous one for eciency reasons and eliminated one costly buer
copy operation. The HPF code for this step is given in Figure 6. The FORALL loop takes the real
charges of primary grid points in pairs and calculates the corresponding complex charges.
While computing the complex charge densities at the leftmost and rightmost grid points, con-
tributions of the charges at the guard cells of neighboring partitions are added separately. Of
course, this operation requires communication. The compiler analyzes the FORALL statement and
determine the elements to be sent out and received and generates the appropriate communication
calls automatically.
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
PARTITION 1 PARTITION 2 PARTITION 3 PARTITION 4
: Physical Grid Points

: Real Charge Densities of Physical Grid Points for Each processor in, array Q
: Complex Charge Densities in array QC (Complex Grid Points = FP Grid Points)
Figure 5: Copying data from real charge density to complex charge density.
3.2 Field-Solve Phase

Field solve phase consists of three steps. First, the inverse FFT of complex charge density values
are taken, then the electric eld values are determined by solving Poisson's equations on the
complex domain. Last, the forward FFT of the results is taken.
3.2.1 Forward and Inverse Fast Fourier Transform (FFT)

FFT has two major steps. The rst step is to permute the input data. Element in(i) goes to
the element out(reverse(i)) of a temporary array. The reverse() function performs a bit-
reversal operation. We divide the input data among processors evenly. The permutation step
requires an all-to-all broadcast communication to send each element to the appropriate place. In
the second step, the FFT of input data is calculated in log(N ) steps. At each step, elements
between processors are interchanged (Figure 7) for an 8-element array on an 8-processor system.
At each step i, the destination is found by ipping the (i ? 1)th bit of the original global indices.
If that element is not located locally, then it is fetched from its owner processor.
The FFT code is too complicated to be handled eciently by current HPF compilers. We
wrote a message-passing version of FFT using extrinsic functions. Because each extrinsic function
running on dierent processors sees only its own local data,whenever it needs other processors'
data, it requests and gets needed data. Here, all processors know when and to which processor it
will send data and receive data from, as explained above.
A scrambling routine is executed after the forward FFT and before the inverse FFT. Each
processor exchanges its charge density array with its pair processor, as described in Figure 8, for
eight processors and does some reduction operation on local and received data.
9
FORALL(k=1:kblok,j=1:kxp)
qc(j,k) = cmplx( q(2*j-1, k), q(2*j,k))
END FORALL
C Consider contributions from guard cells . NP is the number of partitions.
FORALL k = (1 : kblok)
! add last two guard cells to rst complex charge of partition k+1
qc(1,k+1(mod NP)) = qc(1,k+1(mod NP)) + cmplx(q(nxpmx-1,k),q(nxpmx,k))
! add left guard cell to last complex charge of partition k-1
qc(kxp,k-1+NP(mod NP)) = qc(kxp,k-1+NP(mod NP)) + cmplx(0, q(1,k+1(mod NP)))
END FORALL
Figure 6: Copying data from real charge density array to complex charge density array (Scatter-steps
2 and 3).
3.2.2 Poisson Solver

The one-dimensional Poisson equation in the Fourier space is solved to nd the force per unit
charge fc(k) at each FP grid point k using periodic boundary conditions. The force per unit
charge at each particle position is calculated independently using an INDEPENDENT DO loop (Fig-
ure 9.) A complex form factor array ffc that is prepared in the initialization phase is used in the
computation.
3.3 Gather (Interpolation) Phase

To copy data from a complex force array fc to a real force array f we reverse the operation of the
scatter copy routine above. Guard cells are similarly updated. If we reverse the direction of the
arrows in Figure 5, we get a picture of how this copy operation is done.
The forces on the boundaries should be updated at the second step, as seen in the second FORALL
loop of Figure 10. The compiler analyzes the data access patterns and, if necessary, shifts the
values at boundary grid points internally to the neighbor partitions exchanging data with other
processors.
3.4 Particle Push Phase

Using a leap-frog scheme in time and a second-order spline interpolation in space, the particles'
positions and velocities are updated. Due to the periodic boundary conditions, any particle moving
out of the left-most boundary rejoins the simulation from the right-most boundary, and vice versa.
10
INPUT PROC.
INDICE ID
PHASE 1 PHASE 2 PHASE 3
000 001
100 001
010 010
110 011
001 100
101 101
011 110
111 111
Figure 7: Schematic description of FFT.
Processors:
000 001 010 011 100 101 110 111
Figure 8: Communication structure of scrambling.
dnx = 2* / oat(nx)

!HPF$ INDEPENDENT,NEW(wp,ko,j,temp)
DO k = 1, Number of Fpartitions
ko = kxp*(k-1)-1
DO j = 1, Number of Complex Grid Points
temp = dnx*(j+ko)*real(c(j,k)*aimag(c(j,k))
fc(j,k) = temp*cmplx(aimag(q(j,k)),-real(q(j,k)))
END DO
END DO
Figure 9: Finding the electrical elds at FP grid points by using a Poisson solver.
11
FORALL(k=1:kblok,j=1:kxp )
f(2*j-1,k) = real(fc(j,k))
f(2*j,k) = aimag(fc(j,k))
END FORALL
C Consider contributions to guard cells
FORALL k = 1 : kblok
f(1,k) = aimag(fc(kxp,k-1+NP (mod NP)))
f(nxpmx-1,k) = real(fc(1,k+1(mod NP)))
f(nxpmx,k) = aimag(fc(1,k+1(mod NP)))
END FORALL
Figure 10: Copying data from complex charge density array into real charge density array.
The electrostatic force on each particle position is calculated by interpolation of the electric
elds on the nearest grid points to the given position. The interaction among the electrical elds
on grid points and nearby particles is the reverse of that shown in Figure 3. The electrostatic
force for partition k at local grid point j is kept in f(j,k).
The particle push phase can be performed in HPF by using an INDEPENDENT DO loop over all
the particle partitions. Since all data items used in the loop body are local, there is no need
for interprocessor communication. As the compiler we used serialized this loop portion when
it detected the dynamically computed value of nn in the body , we wrote extrinsic routines to
simulate the eect of the INDEPENDENT DO loop. computed value of nn in the body. Compilers
assumes that the use of nn to index the f array will cause a wide range of elements to be accessed
that may disturb inter-iteration independence, and it therefore completely serializes the loop.
3.4.1 Particle Migration Step

Once the new positions of particles have been computed, the ones that leave the current partition
should be passed to the processor responsible for their new position. As noted before, the particles
are placed into partitions according to their positions.
The main data structures used in this phase are given in the following table:
12
!HPF$ INDEPENDENT,NEW(j,nn,dx,dv)
DO k = (1 : Number of particle partitions)
DO j = 1, number of grid points
nn = bpart(1; j; k) + 0:5c
dx = part(1,j,k)-nn
dv = part(VEL,j,k) + ((...dx...f(nn,k)...f(nn+1,k)...f(nn-1,k)...)
part(VEL,j,k) = dv
part(POS,j,k) = part(POS,j,k) + dv*dt
END DO
END DO
Figure 11: Pushing the particles.

Data Structure Denition
sbufr(idimp,nbmax,nblok) Send buer to the right partition
sbu (idimp,nbmax,nblok) Send buer to the left partition
rbufr(idimp,nbmax,nblok) Receive buer to the right partition
rbu (idimp,nbmax,nblok) Receive buer to the left partition
jsl(idimp,nblok) Left-going particle counter
jsr(idimp,nblok) Right-going particle counter
jss(idimp,nblok) Scratch array
Here nbmax is the size of the buer for passing particles. idimp enumerates particle properties,
such as velocity and position. numpp is the number of particle partitions.
The main move routine is given in Figure 12. We add the number of particles, npr, across
processors in one step using HPF's SUM intrinsic. The Pmove1 part1() routine is used to determine
outgoing particles and place the properties of particles into the send buers. This is easily done
using an INDEPENDENT DO loop (Figure 13).
Since the particles can move only one partition at each iteration of the loop (Figure 13), doing
an HPF CSHIFT on the last dimension of the send buers and the send counter arrays transfers
all the information about the particles moving out of their partitions to the receive buers and
receive buer counter arrays, respectively, (Figure 14). Because of the periodic boundaries, the
last partition in the simulation domain exchanges data with the rst partition.
The Pmove1 part2() routine is responsible for checking the receive buers to determine whether
any particles need to be sent further. If there are such particles, then Pmove1 part3() takes over
the control and places those particles into send buers again while transferring the other particles
into vacant positions in the receive buers.
The over ow condition in any specic partition is checked by the Pmove1 part4() routine. At
13
npr = SUM(npp)
10 call Pmove1 part1(part,sbufr,sbu ,jsl,jsr,jss,ihole,edges,npp)
20 rbu = CSHIFT(sbufr, dim=3, RIGHT)
jsl(2, :) = CSHIFT(jsr(1, :), dim=2, RIGHT)
rbufr = CSHIFT(sbu , dim=3, LEFT)
jsr(2, :) = CSHIFT(jsl(1, :), dim=2, LEFT)
call Pmove1 part2(rbu ,rbufr,jsl,jsr,edges,nps)
call Pmove1 part3(rbu ,rbufr,sbufr,sbu ,jsl,jsr,jss,edges)
call Pmove1 part4(jsl,jsr,jss,npmax,ierr)
call Pmove1 part5(part,rbu ,rbufr,jsl,jsr,jss,edges,npp,ihole)
IF nps > 0 GOTO 20
IF ib ag > 0 GOTO 10
npss = SUM(npp)
Figure 12: Moving the particles into appropriate spatial regions after the Push phase.
jsl = jsr = jss = 0

!HPF$ INDEPENDENT , NEW(xt,j)
DO k=1, Number of Ppartitions
DO j = 1, Number of particles in partition
xt = part(POS, j, k)
IF (xt > edges(RGHT, k) THEN
jsr(1,k) = jsr(1,k) + 1
If k = nvp then ! If PP k is last PP
xt = xt - nx
EndIf
sbufr(POS, jsr(1,k),k) = xt
sbufr(VEL, jsr(1,k),k) = part(2,j,k)
ihole(jsl(1,k)+jsr(1,k),k) = j
END IF
... Symmetric operations for the particles going left ...
END DO
jss(1,k)=jsl(1,k)+jsr(1,k) !total number of outgoing particles
END DO
Figure 13: Determining the outgoing particles and placing them into send buers (Pmove1-1).
14
the last step we place the incoming particles into particle array using Pmove1 part5(). If any gaps
remain in the particle arrays, then the particles toward the end of the array are shifted forward
to ll in the gaps (Figure 15).
These ve steps are repeated until each partition arrives at its nal destination. However, since
many particles travel for only short distances, there are usually only a few iterations.
RBufR SBufL SBufR RBufL
SBufR RBufL RBufR SBufL
Proc. (K-1) Proc. (K) Proc. (K+1)
Figure 14: Communication of buers between processors.

Each of these ve steps is done in parallel. The bodies of each routine encompass a large
INDEPENDENT DO loop. Since the compiler we use has a tendency to serialize these loops, we
simulated them by writing extrinsic routines.
Array indices with one or more levels of indirection such as part(pos, ihole(j, k), k) =
rbufl(pos, j, k), causes the compiler to generate a serial loop, and bring in all the possible
values of the right-hand side for each processor. Since accumulation in the INDEPENDENT DO loop
is not allowed, we do this by calculating partial sums in each processor and then taking a global
sum using the SUM intrinsic across all processors.
4 Performance Results
Tables 1 and 2 show the completion time of one iteration of the main phases of the 1-D serial
PIC code for a varying number of particles and varying number of grid points, respectively. We
repeated the main phase 225 times for the 1-D code, and 325 times for the 2-D code. The loop
iteration time in the tables show the total time for a specied number of iterations. As can be
seen from Table 1, when the number of particles changes but the number of grid points remains
constant, the execution time of the push and migrate steps of the particle push phase and rst
step of the scatter phase increase linearly. On the other hand, FFT(), Poisson() and the copy()
routines of the scatter and gather phases take the same time. When we keep the number of
particles the same but increase the number of grid points, the routines (except for the FFT(),
Poisson() and copy() stay still. The execution times of those three routines increases linearly
(Table 2).
We illustrate the execution time changes with respect to the varying number of particles and
grid points of 2-D serial PIC codes in Tables 3 and 4, respectively. The behavior of the 2-D gures
15
!HPF$ INDEPENDENT, NEW(j,oset)
DO k=1,number of partitions(kblok)
IF (jsl(IN, k) .le. jss(OUT,k)) then
! put into holes
DO j = 1, jsl(IN, k)
part(POS, ihole(j, k), k) = rbu (POS, j, k)
part(VEL, ihole(j, k), k) = rbu (VEL, j, k)
END DO
! check right ones
jss(IN, k) = min(jss(OUT, k) - jsl(IN, k), jsr(IN,k))
DO j = 1, jss(IN, k)
part(POS, ihole(j, k)+jsl(IN,k), k) = rbufr(POS, j, k)
part(VEL, ihole(j, k)+jsl(IN,k), k) = rbufr(VEL, j, k)
END DO
IF (jsr(IN, k) .gt. jss(OUT, k) - jsl(IN, k)) THEN
DO j = 1, jsr(IN, k)-jss(IN, k)
part(POS, npp(k)+j, k) = rbufr(POS, j+jss(IN,k), k)
part(VEL, npp(k)+j, k) = rbufr(VEL, j+jss(IN,k), k)
END DO
END IF
ELSE
jss(IN, k) = min(jsl(IN, k), jss(OUT, k))
DO j = 1, jss(IN, k)
part(POS, ihole(j, k), k) = rbu (POS, j, k)
part(VEL, ihole(j, k), k) = rbu (VEL, j, k)
END DO
oset = jsl(IN,k) - jss(IN,k)
DO j = 1, oset
part(POS, npp(k)+j, k) = rbu (POS, j+jss(IN,k), k)
part(VEL, npp(k)+j, k) = rbu (VEL, j+jss(IN,k), k)
END DO
DO j = 1, jsr(IN, k)-oset
part(POS, npp(k)+oset+j, k) = rbufr(POS, j, k)
part(VEL, npp(k)+oset+j, k) = rbufr(VEL, j, k)
END DO
IF (jss(OUT,k) - jsl(IN,k) - jsr(IN,k)) .gt. 0) THEN
oset = jss(OUT,k) - jsl(IN,k) - jsr(IN,k)
DO j = 1, oset
part(POS, ihole(jsl(IN,k)+jsr(IN,k)), k) = part(POS, npp(k), k)
part(VEL, ihole(jsl(IN,k)+jsr(IN,k)), k) = part(VEL, npp(k), k)
npp(k) = npp(k) - 1
END DO
ENDIF
ENDIF
ENDDO
16
Figure 15: Distributing of incoming particles from buers (Pmove1-5).
are similar to the 1-D gures.
When we compare the execution times of the 1-D problem with 32K grid points and 4M
particles to the 2-D problem with 128256=32K grid points and 3.4M particles, we notice that
the push step and rst step of the scatter phase in the 2-D case take more time. This may seem to
be an anomally, but we should remember that in the 2-D case there are nine points that participate
in the interpolation, while there are only three points in the 1-D case. The same reasoning is valid
for the push module.
From these four tables we can conclude that the performance of the main routines of the program
is either dependent on the number of particles or on the number of grid points.
Next, we investigate the behavior of the main routines in parallel environments using HPF.
Table 5 shows the speedup pattern of the 1-D HPF routines for a dierent number of processors.
One observation is that increasing the number of processors decreases almost linearly the execution
times of all the routines except the FFT() and copy(). These two routines involve interprocessor
communication, therefore increasing the number of processors does not really help to improve their
execution time. In other routines the computation is all local, we therefore use more processors
to divide the work into more parallel pieces, thus reducing the overall time. Table 11 presents a
similar pattern for the 2-D codes.
As seen in Table 7 (1-D) and Table 13 (2-D), when we keep the number of grid points and
the number of processors constant and change the number of particles the FFT(), Poisson(),
and copy() routines' execution times increase linearly while the other routines are unaected.
Tables 6 and 12 demonstrate the case where the number of particles and processors are kept the
same, and the number of grid points are changed for 1-D and 2-D cases, respectively.
In order to judge the success of HPF codes as compared to their message passing equivalents,
we repeated the same experiments using message-passing programs written in SP-2's native MPL
libraries. The speedups of the 1-D and 2-D MPL PIC codes (Tables 8, 14) are similar to the HPF
versions of the same codes. Other cases for the MPL versions are summarized in Tables 10 and
9 for 1-D codes, and in Tables 16 and 15 for 2-D codes. The results are approximately equal
to the HPF test results, which proves that the HPF compilers can approach the performance of
message-passing programs for simulating PIC methods.
5 Conclusion
Coding and compiling large-scale application programs for execution on distributed-memory paral-
lel machines and workstation clusters is a big challenge. High Performance Fortran eases this task
by letting the user dene the data decompositions and parallel fragments of the code and doing
the rest of the work itself: it simply detects the necessary communication patterns and generates
17
calls to the runtime libraries to make the necessary data exchanges among the processors. Until
recently, HPF compilers were capable only of generating code for toy applications. The compilers
still have restricted capabilities and some of the important pieces of the functionalities are missing.
We used the PGI's HPF compiler and successfully implemented PIC codes. We had to augment
the HPF codes with extrinsic procedures supplying the missing functionalities of the compilers,
but we were always loyal to the standard language denition while implementing those statements.
Finally, we obtained a performance comparable to the native message-passing implementations on
the IBM SP-2 platform.
Acknowledgements
We would like to thank Victor Decyk for supplying his 1-D and 2-D PIC codes that was the main
focus in this work. We would also like to thank David Walker for sending us a set of research
papers about the PIC, Robert D. Ferraro for helping us to nd message-passing PIC benchmark
data on other architectures and Elaine Weinman for proofreading this manuscript.
References
[1] High Performance Fortran Forum (HPFF), High Performance Fortran Language Specica-
tion, Scientic Programming, (2)1, July 1993.
[2] J. Li and M. Chen, Compiling Communication Ecient Programs for Massively Parallel
Machines, IEEE Trans.on Parallel and Distributed Systems, (2)3:361-375, July 1991.
[3] C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele, M. E. Zosel, The High Performance
Fortran Handbook, MIT Press 1994.
[4] M. Metcalf, J. Reid, Fortran 90 Explained, Oxford, 1990.
[5] P. C. Liewer and V. K. Decyk, A General Concurrent Algorithm for Particle-in-Cell Sim-
ulation Codes, J. of Computational Physics, 85, pp.302{322, 1989.
[6] V. Sunderam, PVM: A Framework for Parallel Distributed Computing, Concurrency: Prac-
tice and Experience, 2(4):315{339, 1990.
[7] Message Passing Interface Forum, Document for a Standard Message Passing Interface, Oct.
1993.
[8] C.Koelbel, P.Mehrotra, J.Saltz, and S.Berryman, Parallel Loops on Distributed Machines, In
Proc. of the 5th Distributed Memory Computing Conference, Charleston, SC, April 1990.
[9] C. K. Birdsall and A. B. Langdon, Plasma Physics via Computer Simulation, McGraw-Hill,
New York, 1985.
[10] R. W. Hockney and J. W. Eastwood, Computer Simulation Using Particles, Adam Hilger,
Bristol, 1988.
18
[11] C. D. Norton, B. K. Szymanski, V. K. Decyk, Object Oriented Parallel Computation for
Plasma Simulation, Communications of ACM, 1995.
[12] T. Hoshino, R. Hiromoto, S. SekiGuchi, S. Majima, Mapping Schemes of the Particle-in-Cell
Method Implemented on the PAX Computer, Parallel Computing, (9):53{75, 1988.
[13] D. W. Walker, Particle-in-Cell Plasma Simulation Codes on the Connection Machine,
Computing Systems in Engineering, (2)2/3:307{319, 1991.
[14] D. W. Walker, Characterizing the Parallel Performance of a Large-Scale, Particle-in-Cell
Plasma Simulation Code, Concurrency: Practice and Experience, (2)4:257{288, Dec.1990.
[15] R. D. Ferraro, P. C. Liewer and V. K. Decyk, Dynamic Load Balancing for a 2D Concurrent
Plasma PIC Code, J. Computational Physics 109, pp. 329, 1993.
[16] V. K. Decyk, Skeleton PIC Codes for Parallel Computers, Computer Physics Communications
87, pp. 87, 1995.
Author Biographies
Erol Akarsu is a Ph.D. student in the Computer and Information Science at Syracuse Uni-
versity. He received the B.S. degree in Computer Engineering from Ege University,Turkey in 1991
. He has received M.S degree in Computer Engineering from Istanbul Technical University,Turkey
in 1993 and M.S. degree in Computer Science from Syracuse University in 1995. His research
interests include parallel compiler design, the design and analysis of parallel algorithms and high
performance computing. (akarsu@npac.syr.edu)
Kivanc Dincer is a Ph.D. candidate at Syracuse University and a research assistant at North-
east Parallel Architectures Center. He has an M.S. degree in Computer Science from Iowa State
University. He worked on several HPCC-related research projects including development of NPAC
F90D/HPF compiler and implementation of Parallel Compiler Runtime Consortium(PCRC) run-
time support systems. His major research interests include parallel processing, distributed com-
puting and Web-based parallel programming environments. (dincer@npac.syr.edu)
(http://www.npac.syr.edu/users/dincer/)
Tomasz Haupt graduated from Jagiellonian University, Krakow, Poland (Ph.D. in Physics
1985). After several years of research in experimental high-energy physics, he changed his re-
search interest toward computer science. He is currently a research scientist at NPAC, Syracuse
University. He is one of the people who contributed to the HPF Language Specication. He
played a leading role in developing the Syracuse Fortran 90D/HPF compiler. Currently he is
involved in developing data-parallel applications, and benchmarking commercial HPF compilers.
19
(haupt@npac.syr.edu) (http://www.npac.syr.edu/users/haupt/)
Georey C. Fox is an internationally recognized expert in the use of parallel architectures and
the development of concurrent algorithms. He leads a major project to develop prototype high
performance Fortran (Fortran90D) compilers. He is also a leading proponent for the development
of computational science as an academic discipline and a scientic method. His research on
parallel computing has focused on development and use of this technology to solve large scale
computational problems. Fox directs InfoMall, which is focused on accelerating the introduction
of high speed communications and parallel computing into New York State industry and developing
the corresponding software and systems industry. Much of this activity is centered on NYNET
with ISDN and ATM connectivity throughout the state including schools where Fox is leading
developments of new K-12 applications that exploit modern technology. (gcf@npac.syr.edu)
(http://www.npac.syr.edu/users/gcf/)
20
Appendix
Number of Scatter Field Solve Gather Particle Push Loop
Grid Points Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
32k 2.91 0.002 0.28 0.005 0.002 4.88 0.0 1895.31
64k 2.91 0.004 0.61 0.010 0.004 4.88 0.0 2042.19
128k 2.91 0.007 1.30 0.002 0.007 4.88 0.0 2350.62
256k 2.91 0.013 2.75 0.004 0.013 4.88 0.0 2995.02
Table 1: Timing Results(in seconds) for 1D PIC serial code using 4505600 particles.

Particles Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
11K 0.008 0.007 1.25 0.020 0.007 0.014 0.0 597.60
88K 0.058 0.008 1.25 0.020 0.008 0.100 0.0 604.17
704K 0.460 0.007 1.27 0.020 0.007 0.780 0.0 854.55
5.5M 3.740 0.007 1.27 0.002 0.007 6.330 0.0 2848.23
Table 2: Timing Results(in seconds) for 1D PIC serial code using 128K Grid Points.
16*32 5.46 0.0009 0.0009 0.0008 0.0009 8.14 0.0 4403.75
32*64 5.44 0.0009 0.0009 0.0008 0.0009 8.18 0.0 4409.73
64*128 5.45 0.0009 0.0009 0.0008 0.0009 8.14 0.0 4404.27
128*256 5.45 0.0009 0.0050 0.0090 0.0009 8.15 0.0 4453.80
Table 3: Timing Results(in seconds) for 2D PIC serial code using 3571712 particles.
313344 0.41 0.0099 0.0099 0.0099 0.0099 0.72 0.0 391.30
368640 0.55 0.0099 0.0099 0.0099 0.0099 0.83 0.0 458.12
589824 0.87 0.0099 0.0099 0.0099 0.0099 1.35 0.0 735.67
1474560 2.19 0.0099 0.0099 0.0099 0.0099 3.36 0.0 1826.50
3571712 5.30 0.0099 0.0099 0.0099 0.0099 8.18 0.0 4408.30
Table 4: Timing Results(in seconds) for 2D PIC serial code using 32*64 Grid Points.
Processors Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
1 2.910 0.001 0.123 0.003 0.001 4.880 0.000 1818.0
2 1.450 0.004 0.075 0.003 0.004 1.930 0.800 999.0
4 0.730 0.003 0.045 0.003 0.003 0.970 0.410 513.0
8 0.360 0.003 0.030 0.002 0.003 0.480 0.210 261.0
16 0.188 0.005 0.023 0.003 0.005 0.248 0.115 144.0
32 0.094 0.007 0.024 0.004 0.007 0.136 0.070 94.5
Table 5: Timing Results(in seconds) for 1D PIC code in HPF using 4505600 Particles and 16K Grid
Points.
21
32k 0.367 0.004 0.058 0.003 0.004 0.488 0.212 283.5
64k 0.367 0.004 0.112 0.004 0.004 0.488 0.212 315.0
128k 0.367 0.006 0.240 0.009 0.006 0.488 0.212 369.0
256k 0.367 0.008 0.470 0.020 0.008 0.488 0.212 468.0
Table 6: Timing Results(in seconds) for 1D PIC code in HPF using 4505600 Particles and 8 Processors.
11264 0.002 0.006 0.23 0.013 0.006 0.004 0.008 108.0
90112 0.008 0.006 0.23 0.013 0.006 0.012 0.012 117.0
72089 0.060 0.006 0.23 0.013 0.006 0.082 0.040 135.0
5767168 0.470 0.006 0.23 0.013 0.006 0.620 0.260 423.0
2306867 1.830 0.006 0.23 0.013 0.006 2.600 1.040 1368.0
Table 7: Timing Results(in seconds) for 1D PIC code in HPF using 16K Grid Points and 8 Processors.
1 2.910 0.001 0.123 0.003 0.001 4.880 0.00 1818.000
2 1.450 0.004 0.075 0.003 0.004 1.930 0.81 1004.067
4 0.730 0.003 0.029 0.003 0.003 0.980 0.40 505.620
8 0.360 0.003 0.030 0.002 0.003 0.490 0.20 263.880
16 0.189 0.005 0.029 0.005 0.005 0.250 0.10 132.120
32 0.090 0.007 0.009 0.004 0.007 0.136 0.07 85.500
Table 8: Timing Results(in seconds) for 1D PIC code in MPL using 4505600 Particles and 16K Grid
Points.
32k 0.37 0.004 0.058 0.003 0.004 0.490 0.200 270.00
64k 0.37 0.004 0.112 0.004 0.004 0.490 0.200 307.62
128k 0.37 0.006 0.240 0.009 0.006 0.478 0.209 346.41
256k 0.36 0.008 0.419 0.019 0.008 0.490 0.209 452.25
Table 9: Timing Results(in seconds) for 1D PIC code in MPL using 4505600 Particles and 8 Processors.
11264 0.002 0.006 0.20 0.0099 0.006 0.004 0.008 95.58
90112 0.008 0.006 0.20 0.0099 0.006 0.012 0.012 100.17
72089 0.060 0.006 0.20 0.0099 0.006 0.082 0.040 133.47
5767168 0.470 0.006 0.20 0.0090 0.006 0.620 0.260 421.56
2306867 1.830 0.006 0.20 0.0090 0.006 2.600 1.040 1334.61
Table 10: Timing Results(in seconds) for 1D PIC code in MPL using 16K Grid Points and 16 Processors.
1 5.450 0.0009 0.005 0.009 0.0009 8.150 0.000 4453.8
4 1.300 0.004 0.015 0.003 0.0040 1.890 0.380 1196.0
8 0.650 0.004 0.009 0.003 0.0040 0.950 0.230 624.0
16 0.329 0.004 0.009 0.003 0.0040 0.486 0.178 325.0
32 0.167 0.005 0.015 0.004 0.0050 0.251 0.204 247.0
Table 11: Timing Results(in seconds) for 2D PIC code in HPF using 3571712 Particles and 128 * 256
Grid Points. 22
16*32 0.655 0.003 0.002 0.003 0.003 0.952 0.530 728.00
32*64 0.655 0.003 0.003 0.003 0.003 0.952 0.354 676.00
64*128 0.655 0.003 0.004 0.002 0.003 0.952 0.266 637.00
128*256 0.655 0.004 0.009 0.002 0.004 0.952 0.226 624.00
Table 12: Timing Results(in seconds) for 2D PIC code in HPF using 3571712 Particles and 8 Processors.
313344 0.004 0.004 0.005 0.003 0.004 0.005 0.014 39.0
368640 0.009 0.004 0.005 0.003 0.004 0.015 0.025 39.0
589824 0.030 0.004 0.005 0.003 0.004 0.046 0.054 65.0
1474560 0.124 0.004 0.005 0.003 0.004 0.190 0.164 169.0
3571712 0.332 0.004 0.005 0.003 0.004 0.486 0.379 416
Table 13: Timing Results(in seconds) for 2D PIC code in HPF using 32*64 Grid Points and 16
Processors.
1 5.45 0.0009 0.0050 0.0090 0.0009 8.150 0.000 4453.80
2 2.66 0.0009 0.0250 0.0090 0.0009 3.820 0.689 2368.60
4 1.32 0.0090 0.0099 0.0099 0.0090 1.880 0.370 1206.53
8 0.66 0.0099 0.0099 0.0099 0.0099 0.940 0.180 614.25
16 0.33 0.0040 0.0090 0.0030 0.0040 0.479 0.100 303.03
32 0.15 0.0050 0.01500 0.0040 0.0050 0.250 0.200 237.90
Table 14: Timing Results(in seconds) for 2D PIC code in MPL using 3571712 Particles and 128 * 256
Grid Points.
16*32 0.660 0.003 0.002 0.003 0.003 0.949 0.219 614.51
32*64 0.655 0.003 0.003 0.003 0.003 0.952 0.219 617.37
64*128 0.670 0.003 0.004 0.002 0.003 0.949 0.189 611.26
128*256 0.670 0.004 0.009 0.002 0.004 0.952 0.189 615.29
Table 15: Timing Results(in seconds) for 2D PIC code in MPL using 3571712 Particles and 8 Processors.
313344 0.029 0.004 0.005 0.003 0.004 0.039 0.029 41.197
368640 0.029 0.004 0.005 0.003 0.004 0.050 0.029 45.604
589824 0.030 0.004 0.005 0.003 0.004 0.046 0.054 65.000
1474560 0.050 0.004 0.005 0.003 0.004 0.090 0.029 135.330
3571712 0.140 0.004 0.005 0.003 0.004 0.209 0.059 134.680
Table 16: Timing Results(in seconds) for 2D PIC code in MPL using 32*64 Grid Points and 16
Processors.
23

10 1 1 42 6331 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

10 1 1 42 6331 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Particle-in-Cell Simulation Codes in High Performance

Figure 1: Main phases of a typical PIC code.

2 The Approach to Parallelization

2.1 Decomposition of Principal Data Structures

Edges(2,1) Edges(2,2) Edges(2,3) Edges(2,4)

Partition 1 Partition 2 Partition 3 Partition 4

: Complex Grid Points(Items of Complex Charge Array QC)

2.2 Loop Level Parallelization

2.4 Passing Distributed Data to Subroutines

2.5 Extrinsic Procedures

2.6 Compiler Restrictions

3.1 Scatter (Inverse Interpolation) Phase

Grid point Particle

!HPF$ INDEPENDENT, NEW(n, dx)

Figure 4: Calculating the particle charge densities at each grid point.

PARTITION 1 PARTITION 2 PARTITION 3 PARTITION 4

: Physical Grid Points

3.2 Field-Solve Phase

3.2.1 Forward and Inverse Fast Fourier Transform (FFT)

3.2.2 Poisson Solver

3.3 Gather (Interpolation) Phase

3.4 Particle Push Phase

Figure 7: Schematic description of FFT.

000 001 010 011 100 101 110 111

Figure 8: Communication structure of scrambling.

dnx = 2* / oat(nx)

3.4.1 Particle Migration Step

Figure 11: Pushing the particles.

jsl = jsr = jss = 0

RBufR SBufL SBufR RBufL

SBufR RBufL RBufR SBufL

Proc. (K-1) Proc. (K) Proc. (K+1)

Figure 14: Communication of bu ers between processors.

Number of Scatter Field Solve Gather Particle Push Loop

Das könnte Ihnen auch gefallen

dnx = 2* / oat(nx)

Figure 14: Communication of buers between processors.