An Optimal Parallel Jacobi-Like Solution Method For The Singular Value Decomposition

~
An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
G. R. Gao and S. J. Thomas
January, 1988 others have proposed various modifications for hypercube implementations, which require the embedding of rings via binary reflected Gray codes. In this paper, we present a new parallel Jacobi-like solution method for the SVD which is optimal in achieving both the maximum concurrency in computation and the minimum overhead in communication. Unlike previously published parallel SVD algorithms based on a nearest neighbour ring topology for communication, the new algorithm proposed in this paper introduces a recursive divide-exchange communication pattern. As a result of the recursive nature of the algorithm, proofs are given to show that it achieves the lower bounds both in computation and communication costs. Convergence aspects of the new algorithm are briefly discussed. The paper illustrates that the new algorithm can be mapped efficiently and naturally onto hypercube architectures. We have implemented the new algorithm on the Intei hypercube through simulation and the preliminary results will be discussed. A comparison with related work is briefly outlined. We believe that the new algorithm can be efficiently mapped onto multiprocessors with interconnection patterns that have been proposed to support large-scale parallelism such as the many PM2I-based or cubebased networks [20].
Abstract
A new parallel Jacobi-like solution method for the singular value decomposition (SVD) is presented which is optimal in achieving both the maximum concurrency in computation and the minimum overhead in communication. Unlike previously published parallel SVD algorithms based on a nearest neighbour ring topology for communication, the new algorithm introduces a recursiv~ divide-exchange communication pattern. As a result of the recursive nature of the algorithm, proofs are eiven to show that it achieves the lower bounds both in computation and communication costs. In general, the recursive pairwise exchange communication operations of the new algorithm can be efficiently supported by multiprocessors with interconnect patterns used in many networks that have been proposed to support large-scale parallelism. As an example, this paper illustrates that the new algorithm can be mapped efficiently and naturally onto hypercube architectures. Preliminary results with an implementation of the new algorithm are reported. Convergence aspects of the new algorithm are briefly discussed. A comparison with related work is outlined.
Introduction
!
,\ ~ .1 .
;1
Rapid technological advances in multiprocessor architectures have aroused much' interest in parallel computation. Parallel methods to compute the singular value decomposition (SVD) have received attention due to its many important applications in science and engineering. A recent paper by Heath et al [8] includes a history of various Jacobi-like SVD algorithms. An early investigation into parallel computation for the symmetric eigenvalue problem, on the SIMD miac IV is described by Sameh in [18]. Sameh outlines the criteria for maximal parallelism in a Jacobi-like algorithm. More recently, a number of authors including Berry et al [1] advocate the onesided SVD of Hestenes [9], [8], [15] for parallel computation of the SVD. Luk and his co-workers have examined various systolic array configurations to compute the SVD [12], [3], [4]. Brent and Luk [4] have invented a linear array of n/2 processors which implements a one-sided Hestenes algorithm, that in real arithmetic, is an exact analogue of their Jacobi method applied to the eigenvalue problem. The array requires O(mnS) time, where S is the number of sweeps (typically :5 10). Brent and Luk demonstrate that their algorithm is computationally optimal in the sense that it requires the minimum number of computational steps per sweep i.e. n - 1, to ensure the execution of every possible pairwise column rotation. Maximum concurrency is maintained throughout the computation. Their systolic array is comparable to the architecture of a nearestneighbour linear array of processors, where communication is based on a ring topology. Brent and Luk's algorithm is not optimal in terms of communication overhead. Unnecessary costs are incurred by mapping the systolic array architecture onto a ring connected linear array due to the double sends and receives required between pairs of neighbouring processors. Eberlein [5], Bischof [2] and
'School of Computer Science, McGill University, Montreal, Quebec, Canada, H3A 2K6. This work was partially supported by the Natural Sciences and Engineering Research Council of Canada under Grant A9236.
2
2.1
J aco hi-like
The Singular
Algorithms
Value Decomposition
'I
The singular value decomposition (SVD) of a general nonsquare matrix may be given as follows,
Theorem 2.1 For a real matrix A(m x n) of rank r, there exists orthogonal matrices U(m X m) and V(n x n), such that UT AV = r: = diag(0"1,0"2,"') 2: 0,
/1
11
where the elements of r:( m x n) may be ordered so that

0"1 2: 0"2
2:... 2: O"r > O"r+1 =
... = O"q = 0,
q = min{m,n}.
1 ,Ii
;1
:1
If m = n, r: is a square diagonal n x n matrix Ill]. In order to compute the SVD in an iterative fashion, a series of plane rotations may be applied to the matrix A(m x n) described in theorem 2.1 above. This approach is similar in nature to Jacobi's original method for computing the eigenvalues of a symmetric matrix where orthogonal matrices J(i,j,8) are applied so as to annihilate a symmetrically placed pair of the n(n 1) off-diagonal elements. These rotation matrices differ from the identity matrix of order n by the principal submatrix formed at the intersection of the row and column pairs corresponding to i and j. A 2 x 2 submatrix has the form
i1 f
[~s :]
The cosine and sine of the rotation angle 8 are the constants c = cos 8 and s sin 8. Initially Al A and at the k-th
iteration,
47
r
= J(i/o,j/o,8/o)T A/oJ(i/o,j/o,8/o). The number of parallel iterations bounded below by in a computation is therefore
AH1
Rotations are applied simultaneously, in a symmetric fashion from the left and right. Cyclic Jacobi methods refer to a sequence of rotations which update row and column pairs in some predetermined order. For a square matrix, a cyclic sweeprefers to the updating of n(n-l)/2 elements. A number of sweeps 'are required in order to effectively reduce the off-diagonal mass of the matrix to a sufficiently small value, which eventually can be ignored. A diagonal containing the eigenvalues then remains. Annihilation of 2 off-diagonal elements of a symmetric matrix takes the form,
(/0) (/0)
n(n -x- 1) 2
or equivalently,
1 In/2J
(2.1)
2r - 1 =
n n odd { n - 1 n even
Proposition then
2.2 For n a po8itive integer, if r = l(n + 1)/2J

1
(H1)
n(n -x- 1)
In/2J
2r - 1
n n odd { n - 1 n even
[~ ~8]
OJ :(~) JJ ] [ :!~)
[~8 ~] =
a;;O
a<.fH) ] JJ
(2.2)
Kogbetliantz appears to have been the first to apply this method to general nonsymmetric matrices [10] (see [8] and [7]). We can generalize the above equation to the computation of a 2 X 2 SVD, by using two different orthogonal rotation matrices [8]. A serial-cyclic sweep of a general m X n matrix A can be performed either by a cyclic-by-rowor a cyclic-by-column scheme. As noted by Brent et al [4] and others, serial cyclic-by-row and cyclic-by-.:olumn schemes are not suitable for parallel computation due to column and row conflicts throughout. In 2.2 we shall indicate that orderings suitable for parallel computation would apply In/2J rotations simultaneously. In terms of convergence for algorithms which compute the SVD in a cyclic manner we may appeal to the results of Paige and Van Dooren [16].
Proof. Consider two cases, Case 1. When n is even,
ln/2J = n/2 so that,
n(n - 1) 1 ~ x In/2J = n-I.

Furthermore since n is even, n + 1 is odd, hence
l(n +
1)/2J = In/2J = n/2 and 2r - 1 = 2
(~)
- 1 = n - 1.
Case 2. When n is odd, In/2J = l(n n(n - 1) 2 1

X
1)/2J
= (n -
1)/2 and
2.2
ln/2J = n.
+1
Exploiting
Parallelism
Sameh was one of the first researchers to observe that there is a bound on the number of rotations which may be applied in parallel [18], [1], [19]. Given a general m X n matrix, a Kogbetliantz cyclic sweep consists of a maximum of N
With n odd, n+ 1 is even, so that l(n+ 1)/2J = (n+ 1)/2 and 2r - 1 = 2
(~ ) - 1 = n.
= max{m,n}(max{m,n} - 1) 2
pairs of rotations. Our goal is to complete a sweep in the minimum number of parallel steps each consisting of the maximum number of rotations applied in parallel. In addition the maximum number of processors should be kept busy at all times. Criteria such as these were originally formulated by Sameh [18]. For square matricecs with n(n 1)/2 elements above the main diagonal, it is possible to update or annihilate ln/2J elements at a time. Defining r = l (n + 1)/2 J, we can have (2r - 1) rotation sets applied per sweep. To su'mmarize,
If we assume that n is even, then not all algorithms described in the literature have achieved the n - 1 lower bound. Sameh's implementation of Hestenes' one-sided computation on a linear array of processors requires 3n - 2 parallel iterations per sweep [19], whereas Brent and Luk report that they achieve the minimum with their systolic array [4].
2.3
A One-sided
Computation
1. An orthogonal rotation 8et must annihilate In/2J elements. once. This implies each of (2r sets should annihilate or update
or update
2. A sweep should annihilate each off-diagonal element only
1) orthogonal
rotation
ln/2J elements.
The size of a rotation set is simply determined by the maximum number of non-conflicting column pairings possible. For example, given an n X n square matrix with n = 4 we may simultaneously apply 2 rotations from the left or right. This is equivalent to multiplication by an orthogonal matrix V of the form,
C1 C2
81
When we consider general non-square m X n matrices where m 2: n there exists a convenient computation for the SVD which is appropriate for parallel implementation. This method is based on a one-sided computation originally due to Hestenes [9]. It is referred to as one-sided because orthogonal rotations are only applied from the right, updating columns. Brent and Luk's [4] systolic array implements Hestenes' algorithm. Basic operations in each processor of their array reflect a tournament ordering scheme for rotations performed in parallel. The performance of their scheme is analyzed in 3. Eberlein [5] has proposed a block variant of Hestenes' algorithm on a hypercube, suitable for computing either singular values or eigenvalues of symmetric matrices. Hestenes' one-sided computation produces an orthogonal matrix V and a matrix Q with orthogonal columns such that
AV=Q=[qt.q2,...,qn]'
(2.3)
82 c1 C2 ]
where A is m Xn, m 2: n. The Euclidean norms of the columns will be equated with the singular values of A. qTqj=u18;j, i,j=I,...,n. 48
V =
[
-81
-82
...-
~-
",
By normalizing the columns, we see that the SVD of theorem 2.1 is implicit in (2.3) Q
= UE, A= UEVT
Step Pl P2 P3
'"
A one-sided algorithm is somewhat different from its earlier counterparts, as rotations are applied from the right and therefore only columns are affected. Off-diagonal elements are no longer annihilated, instead rotations are designed in order to produce two orthogonal columns. As with similar Jacobi-like algorithms, the orthogonal matrix V may be accumulated from plane rotations J(i,j,lJ) which differ from the unit matrix In in a 2 x 2 principal submatrix containing the cosines and sines of the rotation. Setting Al = A, the k-th iteration updates Ak
Ak+1
P.
= AkJ(i,j,lJk).
Figure 1: Brent and Luk's Systolic Array
If the matrix sequence Ak converges, the result is Q in (2.3). A column update via a 2 x 2 submatrix takes the form defined by the following.
(HI) ,a, (HI) ] ' [ a., a, ] 8 c ]= [ a. (k) (k)
-8
Cmin= (K - l)p.
(3.1)
The orthogonality condition determines the rotation angle IJ.

2(a~kT a(k) ,
(2.4)
In the parallel one-sided SVD algorithm each processor is assigned one of n/2 column pairs at each step, assuming n is even. The total number of processors required is p n/2 in (3.1) and the communication costs are O(n2).
(a~k)Ta~k) - (a}k)Ta}k)
= tan21J.
Cmin =
By avoiding a potential loss of significant digits the magnitude of the angle may be restricted to IIJI ::; 7r/4 and provides formulae for the rotation (see Nash [15] and Rustishauser [17]). As noted by Brent and Luk [4], if a cyclic-by-row rotation ordering is chosen to update the n(n - 1)/2 column pairings determined by the off-diagonal elements above the main diagonal, convergence would follow. Hestenes' computation is mathematically equivalent to a Jacobi algorithm applied to AT A, therefore we expect that the convergence analyses of Forsythe and Henrici [6] or Wilkinson [22] are applicable und~r these circumstances. Rather than testing for convergence, the threshold Jacobi method originally introduced in the symmetric eigenproblem is often employed [23, pp. 277-278], [17].
(K - l)p n(n - 2) 2
As a contrast, a global broadcasting strategy may request each processor to send both columns to all other p-l processors between each step. The total cost for this case will be O(n3). Brent and Luk's algorithm has the following communication cost.
CBL
= =
-
(K - 1) x 2p (n-2)x2G) n(n - 2).
3 Parallel Computation 3.1 Maximizing Concurrency

In this paper the computation cost is measured by the number of parallel computation steps. The methods discussed process (i,j) pairings consisting of partitions containing at least 1 column or row. When n is even, if we assume one parallel computation step has unit cost, then of the algorithms presented the minimum cost achieved is n - 1 per sweep. The systolic array and associated algorithm proposed by Brent and Luk were proven to achieve this lower bound in [4]. We have illustrated their basic scheme in figure 1 for the case n = 8, where a linear array of four processors {PI, P2, Ps, p.} is used. 3.2 Minimizing Communication Costs Another important performance criteria for a parallel algorithm is the total communication cost. For our purposes the communication cost can be measured by the total number of interprocessor transactions (messages). A transaction consists of a column transmission between a pair of processors. The total communication cost of one sweep will be denoted C. From the last section, we know that the minimum number of computation steps in a sweep is K n - 1. The minimum number of interprocessor transactions is achieved when each processor retains one column from a pairing, and transmits the other to a destination processor. As a result, if there are p processors, p transmissions are performed between two consecutive steps. Hence the minimum total communication cost Cmin is
Therefore their algorithm is close to, but not quite optimal. In fact the inefficiency lies in the double sends and receives between processors in the systolic array which are dictated by the tournament ordering. Several ways of modifying Brent and Luk's algorithm to avoid the double sends and receives have been proposed [13], [14], [5], [2]. These algorithms all represent a communication regimen based on a ring topology. A ring topology resembles the architecture of a linear array of processors. Embedding a ring within another topology, for example the binfY requires a special mapping scheme. n-cube,
An Optimal
Parallel
SVD Algorithm
In this section we present a new parallel Jacobi-like algorithm which is optimal in terms of both achieving maximum concurrency and minimum communication overhead. The algorithm relies on a recursive divide-exchange of n 2d columns. Unlike several orderings cited earlier, the new algorithm maps naturally onto parallel architectures which support recursive pairwise exchanges. A mapping onto a hypercube is presented as an example in 5. Pairwise exchanges of columns here are specified by a Perfect Shuffle of processor addresses [21].
49
"
4.1
The Parallel
Algorithm
Let us first illustrate the basic principle of the new algorithm through an example where n 8 and p = 4. The computation steps Kl and communication steps Xl consisting of pairwise exchanges, are shown in figure 2.
The algorithm (for n = '2d, d = 3) can be unwound into a sequence of d = 3 compute-exchange stages (with one divide step between each pair of successive compute-exchange stages) as shown in figure 2.
{Kl,Xl,K2,X2,K3,X3,K4,Dl,Ks,Xs,Ks,D2,K7}. I
PI
Each exchange step Xl is a parallel pairwise exchange of column indices in G2 between" processor pairs (1';,,1';2)' where Pi, and Pi2 are at a distance 2\ ( h ~ 0, h an integer and il < i2). Furthermore, the binary representations of il and i2 may only differ in bit position h. For example, the three communication steps Xl, X2 and Xg result in the exchange pairings illustrated in figure 3. Xl Xl X2 X3
<1 <1
"2
<2
<)
<)
<.
D1
<5
<5
<,
D2
"7
Figure 3: Parallel Processor Pairings Figure 2: Recursive Divide-Exchange Initially the 8 column indices are divided into two sets. In general, the algorithm (for n = 2d) can be unwound into a sequence of d compute-exchange stages (with one divide step between each pair of successive compute-exchange stages). If we number the d co~llpute-exchange stages by k: k = 1,... ,d, the k-th compute-exchange stage consists of 2d-k n/2k computation steps Kl, 1 = 1,... ,2d-k and 2d-k -1 communication (exchange) steps Xl, 1 = 1,..., 2d-k - 1 forming
"
Gl
= {1,3,5, 7}, G2= {2,4,6,8}.

(7,8)}
(4.1)
The pairs {(I, 2), (3,4), (5,6),

The algorithm for n
E Gl X G2 are assigned, of three parts:
in order, to processors in the set P = {Pl, P2, P3, P4}.
= 2d = 23 = 8 consists
stage.
steps
{Kl,Xl,
K2,X2,...
,X2d-Ll,
K2d-k}.
Part 1: Compute-Exchange
n/2
The first stage consists of

and n/2
4 computation
{Kl, K2, Kg, K4}
1. At each computation step K" processors concurrently compute rotations on their assigned column pairings. 2. At each communication step Xl a parallel pairwise exchange of columns with indices in G2 is performed between processors pairs at a distance 2h, where h is given by the function, q
1 3 communication (exchange) steps {Xl,X2,X3}. In one computation step, each processor performs a plane rotation on an (i,j) pairing. A communication step Xl - exchanges columns with indices in G2 between processor pairs. Part 2: Divide step. Processors are divided into two sets Pl
h = h(l) = { h(l
ifl=2q,
2q) if 1 > 2Q.
= {Pl, P2}, P2= {P3,P4}.

4.2
q is the largest integer which satisfies 2Q:<:; l. Computation and Communication Costs
The column indices in Gl are divided into two subsets,
G3 = {I, 3}, G4 = {5, 7},

and are assigned to Pl. Similarly, G2 is split into Gs={2,4}, Gs={6,8}, in figure 2 by step Dl'
Let n = 2d, and the total number of computation steps be fen). If g(n) is the number of computation steps in stage 1 then a recurrence relation for fen) is fen) = g(n) + f(n/2)
and assigned to P2, as indicated
hence
From our description
of the algorithm {
we have g(n)
= n/2,
(4.2)
Part 3: Recursively solve the two subproblems using a scheme similar to parts 1 and 2. A subproblem consists of nl n/2 4 column pairs and n' /2 2 processors.
ten) =
~/2 + f(n/2)
n> 2, n = 2.
In order to specify the pairwise exchange of columns tween processors described in part 1 above we introduce notion of distance between processors. Given an index 5 = {I, 2, ..., N} synonymous with processor addresses {Pi: i E 5} we have, a set of processors P
bethe set and
Solving the recurrence (4.2), we have fen) = n - 1. Therefore, we have verified the fact that the new parallel algorithm has achieved the optimal computation cost. The reader should note that in solving the above recurrence, a geometric progression corresponding to the stage lengths results. To establish that we have achieved the optimal communication cost consider a stage k consisting of 2d-k computation steps and 2d-k -1 communication steps. For n -1 total computation steps, d stages are required. The inter-stage divide steps account for d - 1 of the total. The total number of communication steps c(n) may be derived from a recurrence relation.
Definition 4.1 The distance s E 5 between processors Pi, E P and Pi2 E P is defined to be s=lil-i21
50
c(n)
= { ~!2+ c(n/2)
-
n> 2, n = 2.
(4.3)
number of processors p = n/2 gives the communication costs ODE for our recursive divide-exchange algorithm. We have
achieved the optimum
ODE
Solving (4.3), we obtain c(n) = n

since ODE
2. Multiplying by the
-
= amino
= (n -
2)p
= n(n 2-
2) .
Referring to our example in figure 2, 4 column transactions have occurred at each communication step with a total cost of 6 X 4 24 transactions, which is optimal.
The relative ease of mapping a recursive divide-exchange onto the hypercube is due to the recursive nature of the hypercube itself. The fact that a hypercube is recursively constructed out of lower dimensional subcubes may be exploited. A divide step in our algorithm corresponds to a subdivision .of the problem, allowing computations to proceed on the subcubes. Exchanges will always consist of communication between pairs of nearest neighbours on the hypercube, A cube of dimension d - 1 is required for a problem with n 2d. The computation and communication steps are determin~d by the exchange sequence shown in figure 3. "
5.2
Processor
Pairings
Mapping
onto
the Hypercube
In order to map the recursive divide-eichange algorithm of 4 onto a hypercube architecture we must first specify the operations performed by each processor in the cube. Given the two major components of our algorithm, namely a computeexchange and a divide, deriving an algorithm for individual processors is straightforward. Due to the tail-recursion in the parallel SVD algorithm, it may be transformed into an iterative form. Algorithm for k Divide-Exchange
Nearest neighbour processor pairings on the hypercube may be determined by a Perfect Shuffle of node addresses. Stone's original paper [21] details the generation of such pairings via a left cyclic shift of the bits in an address. A perfect shuffle of an N element vector is a permutation P of the indices or 'addresses a of the elements such that'
2a P a ( ) - { "2a + 1- N
a ::; a ::; N /2 - 1, N/2::; a ::; N - 1.
(5.1)
Consider the binary representation of an integer address for which N 2d. Individual bits at position i are denoted ai.
= ad-12d-1
+ ad-22d-2 + . . . + a12 + ao
(5.2)
for 1 = 1 to 2d-k - 1 do
Compute (i,j) q = h(l) Exchange 2q
= 1 to
d do
A perfect shuffle (5.1) of an address a creating it new address a' corresponds to a left cyclic shift of all bits ai to ai+1 with the leftmost bit ad-1 wrapped around to ao [21]. a'
ad-22d-1 + ad-32d-2 + ... + ao2 + ad-1
end Compute (i,j) Divide"2d-k-1 end
The step "Compute (i,j)" refers to a column update in the parallel version of Hestenes' one-sided computation. Using the terminology introduced in 4, each processor cycles through a Jacobi-sweep consisting of dstages. A divide step, exchanging at a distance of 2-1 would not be carried out. The function h(l) computes the height of an exchange node Xl, where I is the label number derived by an inorder traversal of a complete binary tree. Function
Our earlier requirement for a pairwise exchange of columns at a distance 2h is easily satisfied, due to the geometry of a hypercube. The 'implication is that for addresses of the form (5.2), a difference in a single bit ai indicates a distance of 2i. We also note that the addresses of neighbouring processors in the hypercube differ in only one bit position. Exchanges, therefore, will always be between directly connected neighbours. Processor nodes in a hypercube are labelled from a to 2d-l, for example in a 3-dimensional cube there are 8 processors with addresses a to 7. We can use the perfect shuffle to generate processor pairings required for exchanges at a distance which is a power of 2. This may be illustrated by an example with d 3. Initially processor pairings for exchanges are at a distance of
node
h(l)
a
1 2 3 4 5 6 7
begin q= llog2lJ t = I - zq
if t = a then
else end end return return q h(t)
a 000 001 010 011 100 101 110 111
node
a
2 4 6 1 3 5 7
a' 000 010 100 110 001 all 101 111
node
a"
a
4 1 5 2 6 3 7
000
100 001 101 010 110 011 111
Figure 4: 3-Dimensional
Processor
Pairings
1. After a perfect shuffle from addresses a to a' exchanges may take place at a distance of 2, from a' to a" at a distance of 4 and so on. Processor pairings before and after a perfect shuffle are given in figure 4. The exchange and divide steps required to complete one sweep of a Jacobi-like algorithm, when n = 24 = 16 are illustrated in figure 5. 51
...-
,.
Conclusions
and
Future
Research
! I i
I,
We have described a new optimal parallel Jacobi-like algorithm for- the singular value decomposition (SVD). We have demonstrated that the new algorithm can be mapped naturally onto hypercube architectures, effectively utilizing the nearest neighbour communication capacity throughout the computation. In general, the recursive pairwise exchange communication operations of the new algorithm can be efficiently supported by multiprocessors with interconnect patterns used in many networks that have been proposed to support large-scale parallelism [20]. For example, we believe that the new algorithm can be mapped effectively onto SIMD or MIMD parallel computers with inter~ connection networks such as PM2I-based networks and cubebased networks These interconnection networks have the partit ion ability property: the ability to divide the network into independent subnetworks of different sizes [20], which match the recursive divide-exchange structure of the new parallel algorithm proposed in this paper. We suggest the following future research directions: Study extensions of the new algorithm to various forms of the SVD, to the unsymmetric eigenvalue problem and to the generalized
eigenvalue problem Ax
i '1 i I'
rl ,
ii -I
! I' ,I
1\ i!1 -,,! i\i:,
I'
= >"Bx.
Furthermore,
we would
like to
gather empirical information fro~ numerical simulations.
concerning convergence properties
:i
, \
\
,
Acknowledgements
There
are
15
computation.
steps.
The
column
pairs
(i,j) at each step are written in the processor nodes.

The steps communication are marked. links used between the computation
Intel Scientific Corp. provided us with a hypercube simulator which allowed us to test the new algorithm. Martin Santavy gave many valuable comments concerning the details of software development for the hypercube, particularly in the area of synchronization problems. The figures were prepared with the help of Peggy Gao. We would especially like to thank Prof. Chris Paige for many helpful comments and corrections related to historical background and convergence results.
Figure 5: Divide-Exchange
on a 3-Cube
References
[1] M. Berry and A. H. Sameh, "Multiprocessor Jacobi algorithms for dense symmetric eigenvalue problems and singular value decompositions" , Proceedings of the International Conference on Parallel Processing, 1986. [2] C. Bischof, "The two-sided Jacobi method on a hypercube", SIAM Proceedings of the Second Conference on Hypercube Multiprocessors, 1987. [3] R. P. Brent, F. T. Luk and C. F. Van Loan, "Computation of the singular value decomposition using mesh-connected processors", J. VLSI Computer Systems, 1, (1985), pp. 242-270. [4] R. P. Brent and F. T. Luk, "The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays", SIAM J. Sci. Stat. Comput, 6 (1985), pp. 69-84. [5] P. J. Eberlein, "On using the Jacobi method on the hypercube", SIAM Proceedings of the Second Conference on Hypercube Multiprocessors, 1987. [6] G. E. Forsythe and P. Henrici, "The cyclic Jacobi method for computing the principal values of a complex matrix", Trans. Amer. Math. Soc., 94, (1960), pp. 1-23. [7] G. H. Golub and W. Kahan, "Calculating the singular values and pseudo-inversp- of a matrix", J. SIAM Ser. B: Numer. Anal. 2, (1965), pp. 205-224.
, , \ I
. j
5.3
Computational
Results
An implementation of Hestenes' one-sided SVD via the nonrecursive version of our algorithm was written in 'c' for subsequent testing and analysis- on the Intel iPSC hypercube. A simulator for the hypercube was provided by Intel Scientific Corp. to McGill University for a SUN 3/280 running the BSD 4.3 operating system. This SUN has an IEEE 754 standard co-processor with a floating point precision of s 2.22 X 10-16 in double precision arithmetic. A threshold Jacobi method, as described in 2, was employed to insure proper termination of the algorithm. Following the methods introduced by Berry et al in [1] for computation on an array processor, each node processor in the hypercube maintains a counter stop. The counter is incremented by a processor when one of its assigned column pairs (,j) is deemed to be orthogonal according to a threshold parameter T. For the purposes of our tests we chose T sIlAIIF. The parallel computation terminates at the end of a sweep if each of the n/2 processors report istop counts of n - 1. For a series of random 8 x 8 matrices generated using the interactive matrix software package Matlab, we typically observe convergence in the hypercube computation after 6 sweeps. Finally we have observed a communication pattern for random 16 x 16 matrices matching exactly with that shown in figure 6.
52
.J:'-:
I'/'
, /'
'[8] M. T. Heath, A. J. Laub, C. C.Paige and R. C. Ward, "Computing the singular value decomposition of a product of two matrices", SIAM J. Sci. Stat. Comput., 7 (1986), pp. 1147-1159. [9] M. R. Hestenes, "Inversion of matrices by biorthogonalization and related results", J. Soc. Indust. Appl. Math., 6 (1958), pp. 51-90. [10] E. G. Kogbetliantz, "Solution of linear equations by diagonalization of coefficients matrix", Quart. Appl. Math., 13 (1955), pp. 123-132. [11] C. Lawson and R. Hanson, Solving Least Squares Problems, Prentice-Hall, Englewood Cliffs, N.J., 1974. [12] F. T. Luk, "A triangular pr6cessor array for computing singular values", Linear Algebra Appl., 77 (1986), pp. 259273. [13] F. T. Luk and H. Park, "On parallel Jacobi orderings", Cornell University, School of Elec. Eng. Report, EE-CEG86-5, 1986. [14] J. J. Modi and J. D. Pryce, "Efficient implementation of Jacobi's diagonalization method on the DAP", Numer. Math., 46 (1985), pp. 443-454. [15] J. C. Nash, "A one-sided transformation method for the singular value decomposition and algebraic eigenproblem" , Comput. J., 18 (1975), pp. 74-76. [16] C. C. Paige and P. Van Dooren, "On the quadratic convergence of Kogbetliantz's algorithm for computing the singular value decomposition", Linear Algebra Appl. 77 (1986), pp. 301-313. [17] H. Rutishauser, "The Jacobi method for real symmetric matrices", Numer. Math., 16, (1966), pp. 205-223. [18] A. H. Sameh, "On Jacobi and Jacobi-like algorithms for a parallel computer", Math. Comp., 25, (1971), pp. 579-590. [19] A. H. Sameh, "Solving the linear least squares problem on a linear array of processors", Algorithmically Specialized Parallel Computers, Academic-Press, 1985, pp. 191-200. [20] H. J. Siegel, Intertonnection Networks for Large-Scale Parallel Processing, Lexington Books, D.C. Heath and Co., Mass., 1985. [21] H. S. Stone, "Parallel processing with the perfect shuffle", IEEE Trans. Comput., C'-:20 (1971), pp. 153-161. [22] J. H. Wilkinson, "A note on the quadratic convergence of the cyclic Jacobi process", Numer. Math., 4 (1962), pp. 296-300. [23] J. H. Wilkinson, The Algebraic Clarendon-Press, Oxford, 1965. Eigenvalue Problem,
Ii
dl
'Ii
I,
1 1
Ii
53
''
I
,,
, ,i ,
,
!
!
Iii
II
i f .J:iIi8ii
'~

An Optimal Parallel Jacobi-Like Solution Method For The Singular Value Decomposition

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

An Optimal Parallel Jacobi-Like Solution Method For The Singular Value Decomposition

Hochgeladen von

Copyright:

Verfügbare Formate

~

where the elements of r:( m x n) may be ordered so that

2:... 2: O"r > O"r+1 =

2.2 For n a po8itive integer, if r = l(n + 1)/2J

Proof. Consider two cases, Case 1. When n is even,

ln/2J = n/2 so that,

n(n - 1) 1 ~ x In/2J = n-I.

1)/2J = In/2J = n/2 and 2r - 1 = 2

Case 2. When n is odd, In/2J = l(n n(n - 1) 2 1

With n odd, n+ 1 is even, so that l(n+ 1)/2J = (n+ 1)/2 and 2r - 1 = 2

2. A sweep should annihilate each off-diagonal element only

Figure 1: Brent and Luk's Systolic Array

The orthogonality condition determines the rotation angle IJ.

(K - 1) x 2p (n-2)x2G) n(n - 2).

3 Parallel Computation 3.1 Maximizing Concurrency

= {1,3,5, 7}, G2= {2,4,6,8}.

The pairs {(I, 2), (3,4), (5,6),

E Gl X G2 are assigned, of three parts:

in order, to processors in the set P = {Pl, P2, P3, P4}.

The first stage consists of

{Kl, K2, Kg, K4}

= {Pl, P2}, P2= {P3,P4}.

The column indices in Gl are divided into two subsets,

G3 = {I, 3}, G4 = {5, 7},

and assigned to P2, as indicated

From our description

bethe set and

Solving (4.3), we obtain c(n) = n

a ::; a ::; N /2 - 1, N/2::; a ::; N - 1.

ad-22d-1 + ad-32d-2 + ... + ao2 + ad-1

end Compute (i,j) Divide"2d-k-1 end

a 000 001 010 011 100 101 110 111

a' 000 010 100 110 001 all 101 111

gather empirical information fro~ numerical simulations.

concerning convergence properties

(i,j) at each step are written in the processor nodes.

Das könnte Ihnen auch gefallen