Beruflich Dokumente
Kultur Dokumente
Fortran
Erol Akarsuz Kivanc Dinceryz Tomasz Haupty Georey C. Foxyz
yNortheast Parallel Architectures Center
zDepartment of Electrical Engineering and Computer Science
111 College Place, Mail Stop 3-217
Syracuse University
Syracuse, NY 13244-4100
fakarsu, dincer, haupt, g
gcf @npac.syr.edu
Abstract
Particle-in-Cell (PIC) plasma simulation codes model the interaction of charged particles
with surrounding electrostatic and magnetic elds. PIC's computational requirements are
classied at as one of the grand-challenge problems facing the high-performance community.
In this paper we present the implementation of 1-D and 2-D electrostatic PIC codes in High
Performance Fortran (HPF) on an IBM SP-2. We used one of the most successful commercial
HPF compilers currently available and augmented the compiler's missing HPF functions with
extrinsic routines when necessary. We obtained a near linear speed-up in execution time
and a performance comparable to the native message-passing implementations on the same
platform.
Keywords: Particle-in-Cell, High Performance Fortran.
1 Introduction
High Performance Fortran [1] has been accepted as the standard data parallel extensions to Fortran
90 [4]. Its aim is to simplify the process of developing data parallel applications on distributed
memory systems. HPF programs are expected to be scalable and portable, as the performance
is preserved while moving them to dierent platforms with a comparable number of processors.
The relative simplicity of developing codes in HPF comes from the fact that HPF provides a
global name space as sequential codes do. The tedious and error-prone task of interprocessor
communication detection and generation of appropriate calls to the runtime system are left to the
compiler. This makes HPF codes easier to understand, develop, debug, and maintain.
HPF language features allow the user to control two basic aspects of parallel computations: data
distributions and parallelism. The user species the data distributions using compiler directives.
The parallelism is expressed explicitly using Fortran 90 array syntax augmented by the FORALL
construct, the INDEPENDENT directive, and a rich set of new intrinsic and library functions. The
HPF is a new technology, and only a few compilers are available at this time. We question in
this paper whether this technology actually fullls its promises. We selected a nontrivial, well-
known application, the Particle-in-Cell (PIC) plasma simulation code, and implemented it in HPF.
Particle-in-cell plasma simulation codes are large-scale application codes that comprise thousands
of lines of code and have large memory and high-performance requirements. The numerical simu-
lation of plasmas is one of the Grand Challenge problems facing the high-performance computing
community. These characteristics of PIC codes provide a strong motivation for their paralleliza-
tion. However, data accesses with multiple levels of indirection and complicated message-passing
patterns in some parts of the parallel PIC codes make them hard to be implemented in HPF.
The PIC method is an alternative to the direct particle-particle methods where simulation in-
cludes the interaction of all the particles in the simulation domain, and the complexity is O(Np2)
for a system of Np particles. Instead, the PIC method simulates the interaction of particles with
the elds resulting from the particles' charges. The simulation domain is partitioned into cells
separated by grid points. Each cell contains a number of particles. PIC method is especially
preferable for simulating systems containing an order of 106 or more particles, since its computa-
tional complexity is limited to O(Np) for particles and O(Ng :logNg ) for the grid points where Ng
is the number of grid points in the simulation domain. In the PIC method, moving the particles
to a new position dominates the runtime of the algorithm and this phase can be parallelized easily
due to the independent behavior of particles.
The basic PIC algorithm consists of an initialization phase followed by four processing phases
that are repeated many times (Figure 1). In the scatter phase, the particle attributes are interpo-
lated to nearby points of a regular computational mesh that discretizes the problem domain. The
appropriate eld equations are then solved on the computational mesh (eld solve phase), and the
force on each particle due to resultant elds is found by interpolation on the mesh (gather phase).
Finally, the particles are repositioned under the in
uence of this force in the particle push phase.
Parallel PIC methods have been previously studied by a number of researchers on a wide variety
of platforms using dierent parallel programming methodologies [11, 12, 13, 14, 15, 16].
We implemented one- and two-dimensional PIC simulation codes in HPF. We compiled our
codes using PGI's HPF compiler and ran them on an IBM SP-2 with up to 32 nodes.
2
In this paper we talk about our data decomposition and parallelization strategy, and emphasize
HPF features used in individual phases of the simulation code. We assess the power of HPF in
representing such complicated codes. For the sake of simplicity, the most of the following discussion
involves only the one-dimensional PIC method; the interested reader may refer to [9, 10] for higher
dimensional PIC methods.
INITIALIZATION PRINT_STATISTICS
Initialize particle data Print particle and grid data
Initialize grid(field) data Print statistics
Arrange particle and field partitions
SCATTER PARTICLE_PUSH
Compute local charge densities Update position and velocity
Update charges on partition border Move particles to new partitions
Copy Q to QC
FIELD_SOLVE GATHER
Inverse FFT Copy FC to F
Solve Poisson Equation Update forces on partition border
FFT
3
The performance of the PIC simulation codes depends critically on the decomposition (distribu-
tion) strategy of the grid points and particles among the nodes of a distributed memory machine
or workstation cluster. We used a variant of Eulerian decomposition [14] which assigns each
processor a xed spatial partition of the grid and the particles within the cells are surrounded by
those grid points. This prevents the high communication cost in the gather and scatter phase, yet
load imbalance among processors may develop after repetitive iterations as a result of excessive
particle accumulation in certain cells.
1 2 1 2 1 2 1 2
1 7 1 7 1 7 1 7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Figure 2: Illustration of particle and grid decompositions and associated charge densities at each grid
point and interaction among them.
Figure 2 shows how a one-dimensional domain of length 16 is decomposed into grid(eld) and
particle partitions assuming there are four processors (nvp=4).
The virtual processor domain in HPF is declared as follows:
!HPF$ PROCESSORS :: PROCS(nvp)
Particle partitions(PP) are used to divide the particles and particle computation eciently
among processors. The particles are distributed according to their coordinates x, that is, the
k-th partition(PPk ) contains all particles with edges(left,k) < x edges(rght,k). Second
dimension of edges enumerate partitions and it is distributed in a block-wise on that dimension:
DIMENSION INTEGER edges(2, numpp)
!HPF$ DISTRIBUTE (*, BLOCK) ONTO PROCS :: edges
The particles' parameters are stored in array part(i,j,k): the rst index selects between the
spatial coordinate and the velocity of the particle, the second enumerates particles in the par-
tition, and the third one enumerates the partitions. Consequently, the array part is of shape
(2,npmax,numpp), where npmax is the maximum number of particle allowed in one partition.
npmax sets a maximum limit on the number of particles that can be placed into a partition and
prevents enormous load imbalances among partitions. The current number of particles for each
PP is kept in npp(numpp) array. Particle array is distributed in a block fashion:
4
!HPF$ DISTRIBUTE (*, *, BLOCK) ONTO PROCS :: part
Field partitions(FP) are used to divide the electromagnetic computation among the proces-
sors uniformly. Each partition has equal number of cells (delimited by the grid points) of unit
length. For example, in Figure 2 we have 4 cells and 4 primary grid points for each of 4 partitions.
Each partition owns the grid point on its right boundary.
Since the interactions between the eld and the particles are not conned to a partition, each
subdomain stores additional information about the charge density associated with grid points that
lie outside the partition: one guard point on the left of the partition, and two guard points on
the right to the partition (Figure 2) that shadows the rightmost grid point of the left partition, and
two leftmost grid points of the right partition, respectively. It helps each processor to calculate
the partial aect of its local particles on these overlapped grid points. The total charge density
at the primary grid points is found by adding the partial deposits of the guard cells at neighbor
partitions.
The charge densities on grid points (primary grid points and the ones of guard cells)are stored
in the real charge array q(nxpmx,numpp) where nxpmx is the number of primary and guard cell
grid points of one partition and numpp is the number of particle partitions. These structures are
distributed as:
!HPF$ DISTRIBUTE (*, BLOCK) ONTO PROCS :: q
Actually, in order to solve the eld equations, we have to transfer real charge density q into
complex charge qc(kxp,numfp) where kxp is the number of complex grids in each partition
and numfp is the number of eld partitions. The complex charge density on each processor are
calculated from local real charge density values of primary grid points and the ones of overlapped
grid points of neighbor processors. The number of the complex grid points for each partition is
equal to the one half of its primary grid points. Each processor holds all the complex charges for
its complex grid points in its subdomain.
!HPF$ DISTRIBUTE (*, BLOCK) ONTO PROCS :: qc
The complex grid points are aligned with the odd-valued real primary grid points.
3 Implementation Details
In this section we will describe how we implemented each phase of the PIC code using HPF.
(a) (b)
Figure 3: For a one-dimensional simulation, interaction of a particle (a) on the lower half of a grid
interval; (b) on the upper half of a grid interval with the neighboring grid points.
The scatter phase consists of three major steps. In the rst step the charge densities at each grid
point due to nearby charged particles are calculated using a second-order spline interpolation with
7
periodic boundaries. The interaction between the particles and nearby grid points is shown in
Figure 3. Since only the particles local to a partition participate in the computation of the charge
densities at this step, the whole thing can be done easily using an INDEPENDENT DO loop over
the number of eld partitions (Figure 4) without any communication. Each processor executes
the inner body for all the partitions belonging to it. The variable n is assigned a value and
then used as an index on the left-hand side of assignment expressions. Because the compiler at
this time lack adequate dependency and communication analysis mechanisms these types of loops
are automatically serialized instead of complicated runtime inspector-executor loops [8] being
generated. To prevent performance degradation, we wrote an extrinsic routine that does static
loop partitioning based on loop indices that executes the loop body in parallel.
In the second step, the partial results at guard grid points are sent to the appropriate processors
to augment the partial values at shadowed primary grid points of neighboring partitions.
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Figure 5: Copying data from real charge density to complex charge density.
9
FORALL(k=1:kblok,j=1:kxp)
qc(j,k) = cmplx( q(2*j-1, k), q(2*j,k))
END FORALL
C Consider contributions from guard cells . NP is the number of partitions.
FORALL k = (1 : kblok)
! add last two guard cells to rst complex charge of partition k+1
qc(1,k+1(mod NP)) = qc(1,k+1(mod NP)) + cmplx(q(nxpmx-1,k),q(nxpmx,k))
! add left guard cell to last complex charge of partition k-1
qc(kxp,k-1+NP(mod NP)) = qc(kxp,k-1+NP(mod NP)) + cmplx(0, q(1,k+1(mod NP)))
END FORALL
Figure 6: Copying data from real charge density array to complex charge density array (Scatter-steps
2 and 3).
10
INPUT PROC.
INDICE ID
PHASE 1 PHASE 2 PHASE 3
000 001
100 001
010 010
110 011
001 100
101 101
011 110
111 111
Processors:
Figure 9: Finding the electrical elds at FP grid points by using a Poisson solver.
11
FORALL(k=1:kblok,j=1:kxp )
f(2*j-1,k) = real(fc(j,k))
f(2*j,k) = aimag(fc(j,k))
END FORALL
C Consider contributions to guard cells
FORALL k = 1 : kblok
f(1,k) = aimag(fc(kxp,k-1+NP (mod NP)))
f(nxpmx-1,k) = real(fc(1,k+1(mod NP)))
f(nxpmx,k) = aimag(fc(1,k+1(mod NP)))
END FORALL
Figure 10: Copying data from complex charge density array into real charge density array.
The electrostatic force on each particle position is calculated by interpolation of the electric
elds on the nearest grid points to the given position. The interaction among the electrical elds
on grid points and nearby particles is the reverse of that shown in Figure 3. The electrostatic
force for partition k at local grid point j is kept in f(j,k).
The particle push phase can be performed in HPF by using an INDEPENDENT DO loop over all
the particle partitions. Since all data items used in the loop body are local, there is no need
for interprocessor communication. As the compiler we used serialized this loop portion when
it detected the dynamically computed value of nn in the body , we wrote extrinsic routines to
simulate the eect of the INDEPENDENT DO loop. computed value of nn in the body. Compilers
assumes that the use of nn to index the f array will cause a wide range of elements to be accessed
that may disturb inter-iteration independence, and it therefore completely serializes the loop.
12
!HPF$ INDEPENDENT,NEW(j,nn,dx,dv)
DO k = (1 : Number of particle partitions)
DO j = 1, number of grid points
nn = bpart(1; j; k) + 0:5c
dx = part(1,j,k)-nn
dv = part(VEL,j,k) + ((...dx...f(nn,k)...f(nn+1,k)...f(nn-1,k)...)
part(VEL,j,k) = dv
part(POS,j,k) = part(POS,j,k) + dv*dt
END DO
END DO
Figure 12: Moving the particles into appropriate spatial regions after the Push phase.
Figure 13: Determining the outgoing particles and placing them into send buers (Pmove1-1).
14
the last step we place the incoming particles into particle array using Pmove1 part5(). If any gaps
remain in the particle arrays, then the particles toward the end of the array are shifted forward
to ll in the gaps (Figure 15).
These ve steps are repeated until each partition arrives at its nal destination. However, since
many particles travel for only short distances, there are usually only a few iterations.
4 Performance Results
Tables 1 and 2 show the completion time of one iteration of the main phases of the 1-D serial
PIC code for a varying number of particles and varying number of grid points, respectively. We
repeated the main phase 225 times for the 1-D code, and 325 times for the 2-D code. The loop
iteration time in the tables show the total time for a specied number of iterations. As can be
seen from Table 1, when the number of particles changes but the number of grid points remains
constant, the execution time of the push and migrate steps of the particle push phase and rst
step of the scatter phase increase linearly. On the other hand, FFT(), Poisson() and the copy()
routines of the scatter and gather phases take the same time. When we keep the number of
particles the same but increase the number of grid points, the routines (except for the FFT(),
Poisson() and copy() stay still. The execution times of those three routines increases linearly
(Table 2).
We illustrate the execution time changes with respect to the varying number of particles and
grid points of 2-D serial PIC codes in Tables 3 and 4, respectively. The behavior of the 2-D gures
15
!HPF$ INDEPENDENT, NEW(j,oset)
DO k=1,number of partitions(kblok)
IF (jsl(IN, k) .le. jss(OUT,k)) then
! put into holes
DO j = 1, jsl(IN, k)
part(POS, ihole(j, k), k) = rbu
(POS, j, k)
part(VEL, ihole(j, k), k) = rbu
(VEL, j, k)
END DO
! check right ones
jss(IN, k) = min(jss(OUT, k) - jsl(IN, k), jsr(IN,k))
DO j = 1, jss(IN, k)
part(POS, ihole(j, k)+jsl(IN,k), k) = rbufr(POS, j, k)
part(VEL, ihole(j, k)+jsl(IN,k), k) = rbufr(VEL, j, k)
END DO
IF (jsr(IN, k) .gt. jss(OUT, k) - jsl(IN, k)) THEN
DO j = 1, jsr(IN, k)-jss(IN, k)
part(POS, npp(k)+j, k) = rbufr(POS, j+jss(IN,k), k)
part(VEL, npp(k)+j, k) = rbufr(VEL, j+jss(IN,k), k)
END DO
END IF
ELSE
jss(IN, k) = min(jsl(IN, k), jss(OUT, k))
DO j = 1, jss(IN, k)
part(POS, ihole(j, k), k) = rbu
(POS, j, k)
part(VEL, ihole(j, k), k) = rbu
(VEL, j, k)
END DO
oset = jsl(IN,k) - jss(IN,k)
DO j = 1, oset
part(POS, npp(k)+j, k) = rbu
(POS, j+jss(IN,k), k)
part(VEL, npp(k)+j, k) = rbu
(VEL, j+jss(IN,k), k)
END DO
DO j = 1, jsr(IN, k)-oset
part(POS, npp(k)+oset+j, k) = rbufr(POS, j, k)
part(VEL, npp(k)+oset+j, k) = rbufr(VEL, j, k)
END DO
IF (jss(OUT,k) - jsl(IN,k) - jsr(IN,k)) .gt. 0) THEN
oset = jss(OUT,k) - jsl(IN,k) - jsr(IN,k)
DO j = 1, oset
part(POS, ihole(jsl(IN,k)+jsr(IN,k)), k) = part(POS, npp(k), k)
part(VEL, ihole(jsl(IN,k)+jsr(IN,k)), k) = part(VEL, npp(k), k)
npp(k) = npp(k) - 1
END DO
ENDIF
ENDIF
ENDDO
16
Figure 15: Distributing of incoming particles from buers (Pmove1-5).
are similar to the 1-D gures.
When we compare the execution times of the 1-D problem with 32K grid points and 4M
particles to the 2-D problem with 128256=32K grid points and 3.4M particles, we notice that
the push step and rst step of the scatter phase in the 2-D case take more time. This may seem to
be an anomally, but we should remember that in the 2-D case there are nine points that participate
in the interpolation, while there are only three points in the 1-D case. The same reasoning is valid
for the push module.
From these four tables we can conclude that the performance of the main routines of the program
is either dependent on the number of particles or on the number of grid points.
Next, we investigate the behavior of the main routines in parallel environments using HPF.
Table 5 shows the speedup pattern of the 1-D HPF routines for a dierent number of processors.
One observation is that increasing the number of processors decreases almost linearly the execution
times of all the routines except the FFT() and copy(). These two routines involve interprocessor
communication, therefore increasing the number of processors does not really help to improve their
execution time. In other routines the computation is all local, we therefore use more processors
to divide the work into more parallel pieces, thus reducing the overall time. Table 11 presents a
similar pattern for the 2-D codes.
As seen in Table 7 (1-D) and Table 13 (2-D), when we keep the number of grid points and
the number of processors constant and change the number of particles the FFT(), Poisson(),
and copy() routines' execution times increase linearly while the other routines are unaected.
Tables 6 and 12 demonstrate the case where the number of particles and processors are kept the
same, and the number of grid points are changed for 1-D and 2-D cases, respectively.
In order to judge the success of HPF codes as compared to their message passing equivalents,
we repeated the same experiments using message-passing programs written in SP-2's native MPL
libraries. The speedups of the 1-D and 2-D MPL PIC codes (Tables 8, 14) are similar to the HPF
versions of the same codes. Other cases for the MPL versions are summarized in Tables 10 and
9 for 1-D codes, and in Tables 16 and 15 for 2-D codes. The results are approximately equal
to the HPF test results, which proves that the HPF compilers can approach the performance of
message-passing programs for simulating PIC methods.
5 Conclusion
Coding and compiling large-scale application programs for execution on distributed-memory paral-
lel machines and workstation clusters is a big challenge. High Performance Fortran eases this task
by letting the user dene the data decompositions and parallel fragments of the code and doing
the rest of the work itself: it simply detects the necessary communication patterns and generates
17
calls to the runtime libraries to make the necessary data exchanges among the processors. Until
recently, HPF compilers were capable only of generating code for toy applications. The compilers
still have restricted capabilities and some of the important pieces of the functionalities are missing.
We used the PGI's HPF compiler and successfully implemented PIC codes. We had to augment
the HPF codes with extrinsic procedures supplying the missing functionalities of the compilers,
but we were always loyal to the standard language denition while implementing those statements.
Finally, we obtained a performance comparable to the native message-passing implementations on
the IBM SP-2 platform.
Acknowledgements
We would like to thank Victor Decyk for supplying his 1-D and 2-D PIC codes that was the main
focus in this work. We would also like to thank David Walker for sending us a set of research
papers about the PIC, Robert D. Ferraro for helping us to nd message-passing PIC benchmark
data on other architectures and Elaine Weinman for proofreading this manuscript.
References
[1] High Performance Fortran Forum (HPFF), High Performance Fortran Language Specica-
tion, Scientic Programming, (2)1, July 1993.
[2] J. Li and M. Chen, Compiling Communication Ecient Programs for Massively Parallel
Machines, IEEE Trans.on Parallel and Distributed Systems, (2)3:361-375, July 1991.
[3] C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele, M. E. Zosel, The High Performance
Fortran Handbook, MIT Press 1994.
[4] M. Metcalf, J. Reid, Fortran 90 Explained, Oxford, 1990.
[5] P. C. Liewer and V. K. Decyk, A General Concurrent Algorithm for Particle-in-Cell Sim-
ulation Codes, J. of Computational Physics, 85, pp.302{322, 1989.
[6] V. Sunderam, PVM: A Framework for Parallel Distributed Computing, Concurrency: Prac-
tice and Experience, 2(4):315{339, 1990.
[7] Message Passing Interface Forum, Document for a Standard Message Passing Interface, Oct.
1993.
[8] C.Koelbel, P.Mehrotra, J.Saltz, and S.Berryman, Parallel Loops on Distributed Machines, In
Proc. of the 5th Distributed Memory Computing Conference, Charleston, SC, April 1990.
[9] C. K. Birdsall and A. B. Langdon, Plasma Physics via Computer Simulation, McGraw-Hill,
New York, 1985.
[10] R. W. Hockney and J. W. Eastwood, Computer Simulation Using Particles, Adam Hilger,
Bristol, 1988.
18
[11] C. D. Norton, B. K. Szymanski, V. K. Decyk, Object Oriented Parallel Computation for
Plasma Simulation, Communications of ACM, 1995.
[12] T. Hoshino, R. Hiromoto, S. SekiGuchi, S. Majima, Mapping Schemes of the Particle-in-Cell
Method Implemented on the PAX Computer, Parallel Computing, (9):53{75, 1988.
[13] D. W. Walker, Particle-in-Cell Plasma Simulation Codes on the Connection Machine,
Computing Systems in Engineering, (2)2/3:307{319, 1991.
[14] D. W. Walker, Characterizing the Parallel Performance of a Large-Scale, Particle-in-Cell
Plasma Simulation Code, Concurrency: Practice and Experience, (2)4:257{288, Dec.1990.
[15] R. D. Ferraro, P. C. Liewer and V. K. Decyk, Dynamic Load Balancing for a 2D Concurrent
Plasma PIC Code, J. Computational Physics 109, pp. 329, 1993.
[16] V. K. Decyk, Skeleton PIC Codes for Parallel Computers, Computer Physics Communications
87, pp. 87, 1995.
Author Biographies
Erol Akarsu is a Ph.D. student in the Computer and Information Science at Syracuse Uni-
versity. He received the B.S. degree in Computer Engineering from Ege University,Turkey in 1991
. He has received M.S degree in Computer Engineering from Istanbul Technical University,Turkey
in 1993 and M.S. degree in Computer Science from Syracuse University in 1995. His research
interests include parallel compiler design, the design and analysis of parallel algorithms and high
performance computing. (akarsu@npac.syr.edu)
Kivanc Dincer is a Ph.D. candidate at Syracuse University and a research assistant at North-
east Parallel Architectures Center. He has an M.S. degree in Computer Science from Iowa State
University. He worked on several HPCC-related research projects including development of NPAC
F90D/HPF compiler and implementation of Parallel Compiler Runtime Consortium(PCRC) run-
time support systems. His major research interests include parallel processing, distributed com-
puting and Web-based parallel programming environments. (dincer@npac.syr.edu)
(http://www.npac.syr.edu/users/dincer/)
Tomasz Haupt graduated from Jagiellonian University, Krakow, Poland (Ph.D. in Physics
1985). After several years of research in experimental high-energy physics, he changed his re-
search interest toward computer science. He is currently a research scientist at NPAC, Syracuse
University. He is one of the people who contributed to the HPF Language Specication. He
played a leading role in developing the Syracuse Fortran 90D/HPF compiler. Currently he is
involved in developing data-parallel applications, and benchmarking commercial HPF compilers.
19
(haupt@npac.syr.edu) (http://www.npac.syr.edu/users/haupt/)
Georey C. Fox is an internationally recognized expert in the use of parallel architectures and
the development of concurrent algorithms. He leads a major project to develop prototype high
performance Fortran (Fortran90D) compilers. He is also a leading proponent for the development
of computational science as an academic discipline and a scientic method. His research on
parallel computing has focused on development and use of this technology to solve large scale
computational problems. Fox directs InfoMall, which is focused on accelerating the introduction
of high speed communications and parallel computing into New York State industry and developing
the corresponding software and systems industry. Much of this activity is centered on NYNET
with ISDN and ATM connectivity throughout the state including schools where Fox is leading
developments of new K-12 applications that exploit modern technology. (gcf@npac.syr.edu)
(http://www.npac.syr.edu/users/gcf/)
20
Appendix
Number of Scatter Field Solve Gather Particle Push Loop
Grid Points Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
32k 2.91 0.002 0.28 0.005 0.002 4.88 0.0 1895.31
64k 2.91 0.004 0.61 0.010 0.004 4.88 0.0 2042.19
128k 2.91 0.007 1.30 0.002 0.007 4.88 0.0 2350.62
256k 2.91 0.013 2.75 0.004 0.013 4.88 0.0 2995.02
Table 1: Timing Results(in seconds) for 1D PIC serial code using 4505600 particles.
Table 2: Timing Results(in seconds) for 1D PIC serial code using 128K Grid Points.
Number of Scatter Field Solve Gather Particle Push Loop
Grid Points Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
16*32 5.46 0.0009 0.0009 0.0008 0.0009 8.14 0.0 4403.75
32*64 5.44 0.0009 0.0009 0.0008 0.0009 8.18 0.0 4409.73
64*128 5.45 0.0009 0.0009 0.0008 0.0009 8.14 0.0 4404.27
128*256 5.45 0.0009 0.0050 0.0090 0.0009 8.15 0.0 4453.80
Table 3: Timing Results(in seconds) for 2D PIC serial code using 3571712 particles.
Number of Scatter Field Solve Gather Particle Push Loop
Particles Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
313344 0.41 0.0099 0.0099 0.0099 0.0099 0.72 0.0 391.30
368640 0.55 0.0099 0.0099 0.0099 0.0099 0.83 0.0 458.12
589824 0.87 0.0099 0.0099 0.0099 0.0099 1.35 0.0 735.67
1474560 2.19 0.0099 0.0099 0.0099 0.0099 3.36 0.0 1826.50
3571712 5.30 0.0099 0.0099 0.0099 0.0099 8.18 0.0 4408.30
Table 4: Timing Results(in seconds) for 2D PIC serial code using 32*64 Grid Points.
Number of Scatter Field Solve Gather Particle Push Loop
Processors Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
1 2.910 0.001 0.123 0.003 0.001 4.880 0.000 1818.0
2 1.450 0.004 0.075 0.003 0.004 1.930 0.800 999.0
4 0.730 0.003 0.045 0.003 0.003 0.970 0.410 513.0
8 0.360 0.003 0.030 0.002 0.003 0.480 0.210 261.0
16 0.188 0.005 0.023 0.003 0.005 0.248 0.115 144.0
32 0.094 0.007 0.024 0.004 0.007 0.136 0.070 94.5
Table 5: Timing Results(in seconds) for 1D PIC code in HPF using 4505600 Particles and 16K Grid
Points.
21
Number of Scatter Field Solve Gather Particle Push Loop
Grid Points Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
32k 0.367 0.004 0.058 0.003 0.004 0.488 0.212 283.5
64k 0.367 0.004 0.112 0.004 0.004 0.488 0.212 315.0
128k 0.367 0.006 0.240 0.009 0.006 0.488 0.212 369.0
256k 0.367 0.008 0.470 0.020 0.008 0.488 0.212 468.0
Table 6: Timing Results(in seconds) for 1D PIC code in HPF using 4505600 Particles and 8 Processors.
Number of Scatter Field Solve Gather Particle Push Loop
Particles Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
11264 0.002 0.006 0.23 0.013 0.006 0.004 0.008 108.0
90112 0.008 0.006 0.23 0.013 0.006 0.012 0.012 117.0
72089 0.060 0.006 0.23 0.013 0.006 0.082 0.040 135.0
5767168 0.470 0.006 0.23 0.013 0.006 0.620 0.260 423.0
2306867 1.830 0.006 0.23 0.013 0.006 2.600 1.040 1368.0
Table 7: Timing Results(in seconds) for 1D PIC code in HPF using 16K Grid Points and 8 Processors.
Number of Scatter Field Solve Gather Particle Push Loop
Processors Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
1 2.910 0.001 0.123 0.003 0.001 4.880 0.00 1818.000
2 1.450 0.004 0.075 0.003 0.004 1.930 0.81 1004.067
4 0.730 0.003 0.029 0.003 0.003 0.980 0.40 505.620
8 0.360 0.003 0.030 0.002 0.003 0.490 0.20 263.880
16 0.189 0.005 0.029 0.005 0.005 0.250 0.10 132.120
32 0.090 0.007 0.009 0.004 0.007 0.136 0.07 85.500
Table 8: Timing Results(in seconds) for 1D PIC code in MPL using 4505600 Particles and 16K Grid
Points.
Number of Scatter Field Solve Gather Particle Push Loop
Grid Points Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
32k 0.37 0.004 0.058 0.003 0.004 0.490 0.200 270.00
64k 0.37 0.004 0.112 0.004 0.004 0.490 0.200 307.62
128k 0.37 0.006 0.240 0.009 0.006 0.478 0.209 346.41
256k 0.36 0.008 0.419 0.019 0.008 0.490 0.209 452.25
Table 9: Timing Results(in seconds) for 1D PIC code in MPL using 4505600 Particles and 8 Processors.
Number of Scatter Field Solve Gather Particle Push Loop
Particles Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
11264 0.002 0.006 0.20 0.0099 0.006 0.004 0.008 95.58
90112 0.008 0.006 0.20 0.0099 0.006 0.012 0.012 100.17
72089 0.060 0.006 0.20 0.0099 0.006 0.082 0.040 133.47
5767168 0.470 0.006 0.20 0.0090 0.006 0.620 0.260 421.56
2306867 1.830 0.006 0.20 0.0090 0.006 2.600 1.040 1334.61
Table 10: Timing Results(in seconds) for 1D PIC code in MPL using 16K Grid Points and 16 Processors.
Number of Scatter Field Solve Gather Particle Push Loop
Processors Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
1 5.450 0.0009 0.005 0.009 0.0009 8.150 0.000 4453.8
4 1.300 0.004 0.015 0.003 0.0040 1.890 0.380 1196.0
8 0.650 0.004 0.009 0.003 0.0040 0.950 0.230 624.0
16 0.329 0.004 0.009 0.003 0.0040 0.486 0.178 325.0
32 0.167 0.005 0.015 0.004 0.0050 0.251 0.204 247.0
Table 11: Timing Results(in seconds) for 2D PIC code in HPF using 3571712 Particles and 128 * 256
Grid Points. 22
Number of Scatter Field Solve Gather Particle Push Loop
Grid Points Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
16*32 0.655 0.003 0.002 0.003 0.003 0.952 0.530 728.00
32*64 0.655 0.003 0.003 0.003 0.003 0.952 0.354 676.00
64*128 0.655 0.003 0.004 0.002 0.003 0.952 0.266 637.00
128*256 0.655 0.004 0.009 0.002 0.004 0.952 0.226 624.00
Table 12: Timing Results(in seconds) for 2D PIC code in HPF using 3571712 Particles and 8 Processors.
Number of Scatter Field Solve Gather Particle Push Loop
Particles Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
313344 0.004 0.004 0.005 0.003 0.004 0.005 0.014 39.0
368640 0.009 0.004 0.005 0.003 0.004 0.015 0.025 39.0
589824 0.030 0.004 0.005 0.003 0.004 0.046 0.054 65.0
1474560 0.124 0.004 0.005 0.003 0.004 0.190 0.164 169.0
3571712 0.332 0.004 0.005 0.003 0.004 0.486 0.379 416
Table 13: Timing Results(in seconds) for 2D PIC code in HPF using 32*64 Grid Points and 16
Processors.
Number of Scatter Field Solve Gather Particle Push Loop
Processors Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
1 5.45 0.0009 0.0050 0.0090 0.0009 8.150 0.000 4453.80
2 2.66 0.0009 0.0250 0.0090 0.0009 3.820 0.689 2368.60
4 1.32 0.0090 0.0099 0.0099 0.0090 1.880 0.370 1206.53
8 0.66 0.0099 0.0099 0.0099 0.0099 0.940 0.180 614.25
16 0.33 0.0040 0.0090 0.0030 0.0040 0.479 0.100 303.03
32 0.15 0.0050 0.01500 0.0040 0.0050 0.250 0.200 237.90
Table 14: Timing Results(in seconds) for 2D PIC code in MPL using 3571712 Particles and 128 * 256
Grid Points.
Number of Scatter Field Solve Gather Particle Push Loop
Grid Points Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
16*32 0.660 0.003 0.002 0.003 0.003 0.949 0.219 614.51
32*64 0.655 0.003 0.003 0.003 0.003 0.952 0.219 617.37
64*128 0.670 0.003 0.004 0.002 0.003 0.949 0.189 611.26
128*256 0.670 0.004 0.009 0.002 0.004 0.952 0.189 615.29
Table 15: Timing Results(in seconds) for 2D PIC code in MPL using 3571712 Particles and 8 Processors.
Number of Scatter Field Solve Gather Particle Push Loop
Particles Phase 1 Phase 2 IFFT/FFFT Poisson Copy Push Migrate Iteration Time
313344 0.029 0.004 0.005 0.003 0.004 0.039 0.029 41.197
368640 0.029 0.004 0.005 0.003 0.004 0.050 0.029 45.604
589824 0.030 0.004 0.005 0.003 0.004 0.046 0.054 65.000
1474560 0.050 0.004 0.005 0.003 0.004 0.090 0.029 135.330
3571712 0.140 0.004 0.005 0.003 0.004 0.209 0.059 134.680
Table 16: Timing Results(in seconds) for 2D PIC code in MPL using 32*64 Grid Points and 16
Processors.
23