Sie sind auf Seite 1von 4

Performance Analysis of Partition Algorithms for Parallel Solution of Nonlinear Systems of Equations

Geng Yang', Chunming Rong'

'Department of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, China, Email 'Department of Computer Science, Stavanger University College, N-4068 Stavanger, Norway Email: chunmine.roneQ), Web

Absbnct-In this paper, we discuss performance of partition algorithms for parallel solution of large-scale nonlinear systems of equations. We describe first a block Broyden algorithm for solving a nonlinear system in which a diagonal matrix is used as an iterative matrix. Then, we analyze the parallelism of the algorithm and discuss in details different partitioning schemes. Finally, we give some numerical results and analyze performance of the partitioning schemes. The numerical results show that the algorithms combining block Broyden method with partitioning techniques are effective, and that they can be used in the large-scale problems arising from scientific and engineering computing. Keywords parallel computation, block partitioning, supercomputing, nonlinear systems,

  • I. Introduction .

FOI algorithm solving large-scale nonlinear systems we face two challenges. One is to reduce storage requirement. Because dimension of the system is usually very large, we have to develop an algorithm with low memory requirement, making it possible to be used in practice. The other is about parallel performance. The algorithm should have a high parallel performance in order to get numerical solutions in less CPU times or in real-time. It makes no sense in some practical applications for an algorithm to take a lot of CPU time to achieve solutions, such as computational fluid dynamics, numerical weather forecast, etc. Several algorithms have been developed up to now, such as Newton methods and quasi-Newton method[l-3]. But because of storage requirement or parallel performance, only a few of them are used in practice for large-scale systems. A nonlinear GMFSS(m) (Generalized Minimal RESidual) algorithm was developed in [I], which is widely used recently. The algorithm has some advantages in both memory need' and parallelization. After that a block Broyden (BB) algorithm was proposed in [4]. This simple algorithm needs low storage and has good parallel performance. The complexity and performance of the


work was supported by the National Supercomputing

Foundation of China under the @ant No. 9927.

0-7803-7840-7/03/$17.00 02003 IEEE.

algorithm were

discussed in



some alterative

versions of the algorithm were described in [6,7]. The BB

algorithm can be easily down in parallel program, because it contains only products of matrix and vectors. Morwver, for a nonlinear system with dimension N, the BB algorithm needs only N2/q memory storage, instead of N', where q is the number of blocks. In practice, an optimal choice for N/q is about 25[4]. Therefore, the BB algorithm determines significant storage shaving. This issue plays a very important role in scientific computing of large-scale systems. However, the convergence performance is not as good as that of GMRES(m) method, and block partitioning algorithm influences the convergence performance, Therefore, in order to get an optimal BB algorithm and balance tasks amony CPU processors, it is necessary to study partitioning algorithms and analyze influence of the

partitioning algorithms on the BB algorithm.




organized as follows. In

Section 2,


general padlelizable BB-method is described, including some specific remarks concerning the BB-GMRES(m)

method. Section 3 is related to block partition strategies, where several partitioning schemes are discussed. Numerical experiments are presented in Section 4. In particular, the relative importance of several parameters on








numerically studied. Finally, some concluding remarks are

given in Section 5.



Block Bmyden Algorithm

The BB algorithms meet the two criterions mentioned in Section 1[4-51. We describe only the algorithm here(see [4-51 for details). For a nonlinear system F(x)=O, where F(x) is a

function kom R" to R". the BB algorithm is defined as followings:



For some given xn and Bo ,calculate residual rn=F(x4.




  • 2.1 system in parallel B's'=-~'.

Solve linear

  • 2.2 Calculate in parallel xk+'=x'+sk.

  • 2.3 Calculate the residual ?+'=F(x'+'), if accurate enough, stop. Otherwise, I updateB"' using (1)-(4) in parallel. Set k=k+l, and go to step 2.1,




F(x) are







F(x)=(FI ,




xJT. Let ni be the number of



equations in ith block, the updating formulas are:

& xF' - x :

III. Some Block Partitioning Strategies


an NxN matrix




with a symmetric

zermonzero structure, it is possible to define an induced undirected graph C(A) = <V,E> without loops or multiple edges, or more simply, a gmph. The set V bas N vertices



and E is a collection of unordered pairs of

elements of V such that < *,aj> E E if i# j and if Aij# 0. Let q 3 be a given positive integer. Using a graph partitioning algorillun, Vcould be decomposed into q disjoint subsets V, of comparable size N;, i.e. such that each Vi has about N/q vertices (i=l,,,.,q), For completeness and to facilitate comparisons, we will study five partitioning schemes: linear partitioning, random partitioning, scattered partitioning. MAPLO partitioning (a Modification of the PArameterized BLock Ordering), and Multilevel partitioning.

  • A. Linear Partitioning

In the linear scheme, vertices are assigned in order to blocks in accordance with their numbering in the original graph, i.e. the first N/q vertices are assigned to block VI, the next N/q vertices to the block VZ,and so on. This algorithm is simple and often produces surprisingly good results because data locality is implicit in the numbering of the graph. For a graph with N vertices and NE edges, the algorithm has a O(N) complexity.

  • B. Scanttered Partitioning

In the scattered scheme vertices are dealt out in card fashion to the q sets in the order they are numbered. Note that the neighbor vertices are assigned into different blocks. Hence, it often gives had performance in parallel computing for systems arising from real problems, such as fluid dynamic computing by a finite element method. The scattered scheme runs also in a time O(Nj.

  • C. Randon Partitioning

In the random scheme, vertices are assigned randomly to q sets. Because of the randomicity it is difficulty to control the matrix B. In other words. the matrix B may be different in each time. Usually the random ordering produces partitions with a quality between those of the linear and scattered partitioning algorithms. The complexity of the random scheme is O(N).









  • D. MPABLO Partitioning







(PArameteriaed BLock Ordering) was used here [8], in order to obtain diagonal blocks of comparable size. The crux of this method is to choose groups of nodes in G(A), the graph induced by the matrix A, so that the corresponding diagonal blocks are either full or very dense. -The algorithm runs in a time O(N+NE).

  • E. Multilevel Partitioning

In the mulrilevel algorithm used in this work [91, the graph is approximated by a sequence of increasingly smaller graphs. The smallest graph is then partitioned in power-of-two sets, using a spectral method, and Ihis partition is propagated back through the hierarchy of graphs. A variant of the Kermighm-Lin algorithm is applied periodically to refine the partition. A C-version of this partitioning scheme is included in a software package called Chaco, which was kindly offered to us by the authors for research purposes. The cost of constructing coarse graphs and of the local improvement algorithms runs in a time proportional to the number of edges in the graph.

IV. Numerical Results

All numerical tests in this section were run on a shared memory Power Challenge XL with 8 CPU R8000, 1.5 Go RAM, 64 hits. All computations used 2 processors, except where mentioned. The programs were written in FORTRAN 77 using double precision floating point numbers with automatic parallel environment. In the following discussion, CPU time for a parallel case consists of the maximum CPU time among processors for all parallel computations, including communication time. It

did not include the computational cost of the pmitioning strategies. This preliminary step is performed only once, and its influence on the total computational work is insignificant, particularly for large-scale scientific and engineering computing. The nonlinear partial differential equation for Bratu problem can be written as [SI:


{ U


= 0,

= f,

(x, Y) E

Q = [0,11 X[O,II

We take f=e, A =I, and divide two sides of domain Q into N-I. We use five-point and forward difference schema to discrete first and second order derivatives. We obtain then

a nonlinear system of N'dimension:

We will consider two

grids: M, and Mz with N=30, 50 respectively. The dimensions of the nonlinear system are 900 and 2500. The stop criteria is I?I<IO~. we use LU elimination method for linear systems[S]. So it forms a nonlinear solver BB-LU. Let initial approximation xo be zero and Bo be Jacobian matrix. Five block partitioning schemes are considered:


multilevel, MPABLO, Linear, random and scatfaed

schemes. Figures 1 and 2 show convergence histories log

1F-M b/(y)l versus number of nonlinear iterations) of

BB-LU for the block structure generated by different

partitioning schemes, on grids M1 and M2. The number

of blocks is q=64 on MI and q=128 on M2.

performance as the multilevel scheme.

Figure 3 presents the influence of the number q of

diagonal blocks on the convergence of BB-LU. The

partitioning scheme used is MPABLO. Figure 3 sbows that

BB-LU converges, even for a large q. For example, for the

grid M1 and q=300, each block contains only about 3

unknowns. Therefore, the memory requirements for

storing the BB-matrix could be adjusted according to the

es of the computer. For large-scale scientific

computation, such as biological computation or fluid

dynamic computation, the dimension of system N is very

large. It is impossible to store an entire matrix B. That is

also the reason why to make an optimal partition scheme

and use parallel Figure

computing 4 shows the CPU time


used by BB-LU, with 2 processors, for different choices of

q. In this case, the CPU time increases with the number of

blocks, because the number

of iteration increases too. Fig.4

Influence of block numbers on CPU times

Table 1 gives the parallel performance of the BB-LU

algorithm. The CPU times are normalized by those

obtained with only one processor. Accelerate rate is

division of the sequential CPU time by the parallel CPU

times, and efficiency value is division of the accelerate rate

by the number of processors. Table 1 shows that the

BB-LU algorithm is effective. With 8 processors, it gives a

speedup of 6 times faster than with one processor, and the

efficiency values are about 0.625, 0.658 for the grids MI,

M2 respectively. This is an important advantage for

large-scale scientific computing with parallel computers.

Fig. 2 Caovegence beha*a for p.rtiuoningsckms wih M2

FigtGes show that MPABLO and multilevel schemes

perform very well. BB-LU takes about 1.13 times more

iterations with the MPABLO scheme than with the

multilevel scheme on both grids, because those two

schemes make partitions without loss information between


The linear partition scheme allows a satisfactory

partition in this case, since the nodes in a block are the

neighboring nodes in the geometry. They have a close

relationship. The behavior of.the random scheme is

between those of the linear and the Scattered ones. The

scattered scheme is the worst. We observe some oscillation

phenomenon because it destroys the coupling information

between unknowns. Thus, it could be worthwhile to select

the appropriate partitioning scheme (which is used only

once, as a preliminary step of the computations), in order

to make significant CPU time saving. Moreover, because

















multilevel scheme in practice. We propose to use the

MPABLO scheme, since it gives almost the same






  • I 1.00







1 Rate






I 2.86

1 5.00

1 Effici'









1 times






I 0.35



1 Rate






I 2.86


I 5.26



1 1.000



I 0.715







In this paper a parallelizable block Broyden method

for solving nonlinear systems is presented. Combining it

with some iterative or direct linear solvers, it is possible to

obtain a family of nonlinear solvers. The algorithm

parallelizes well, allowing a speedup of about 6 on an

%processor system. It was successfully used to solve

nonlinear systems arising from the Brabl problem.

Several partitioning algorithms have been evaluated.

The partitions of blocks influence the convergence

performance of the BB methods. The linear scheme is very

simple. It gives sometimes a good performance. The

scattered scheme is worst among the five schemes because

the nodes in a block have less relative information abut

each other. The multilevel scheme and the MPRBLO

scheme give almost the same performance. However,

because of the power-of-two partitioning limitation of the

multilevel scheme, we prefer the MPRBLO scheme to

multilevel scheme to combine with the BB algorithm. It is

clear that the BB method needs less memory storage and

bas a good parallel convergence performance. Therefore, it

could be used to solve large nonlinear systems of equations,

in particular, systems arising !?om real engineering







intended to apply some

parallelizable preconditioning techniques to the

BB-methods, in order to accelerate the convergence of

such methods, and also to increase their robustness.

Moreover, it is interesting to apply the BB method in some

real problems arising from engineering domains.


[I] P. Brown and Y. Saad, “Hybrid Krylov Methods for

Nonlinear Systems of Equations”. SIAM J Scientific

Srarisric Computation, Vol.11, No. 5, pp.450481,


May. 1999.






”A Class of Asynchronous Parallel

Iterations for the Systems of Nonlinear Algebraic

Equations” , Computers & Mathematics with

Applications, Vol. 39, No.7, pp.81-94,April, 2000.

131 E. Babolian and J. Biazar, “Solution of Nonlinear

Equations by Modified a Domain Decomposition

method”, Applied Marhemtics and Compubtion,

Vol. 132, No.1, pp.167-172, October, 2002.

[41 G Yang, ‘‘Analysis of Parallel Algorithms for

Solving Nonlinear Systems of Equations”, Chinese

Journal of Computer, Vol. 23, No. 10, pp.1035-1039,

October, 2000 (in Chinese).

U] G Yang, S. D. Wang and R. C. Wang, “Analysis of

Parallel Algorithms for Solving Nonlinear Systems

of Equations based-on Dawan IOWA Computers”,

Chinese Journal of Computer, ‘Vol. 25, No. 4,

pp.397-402, April, 2002 (in Chinese).

[6] J. J. Xu, “Convergence of Partially Asynchronous

Block Quasi-Newton




Systems of Equations”, Journal of


and Applied Marhemarirs, Vol. 103, No.2,

pp.307-321, March, 1999.

[7] Y. R. Chen and D. Y. Cai, “Inexact Overlapped Block

Broyden Methods for Solving Nonlinear Equations”,

Applied Mathematics and Computation, Vol. 136,

No.2, pp.215-228, April, 2002.

[SI J.ONeil and D.B. Szyld, “A Block Ordering Method

for Sparse Mauices”, SIAM J. Sri. Star. Compur.,

Vol.11, No.7,pp.811-823, July, 1990.

[9] B. Hendrickson and T. Kolda, “Partitioning

Rectangual and Structurally Unsymmetric Sparse

Matrices for Parallel Processing”, SIAM. J. Scientific

Compuring, Vol. 21, No. 6, pp.2048-2072, June,