Block Householder

Solution of Large, Dense Symmetric
Generalized Eigenvalue Problems

Using Secondary Storage
ROGER G. GRIMES and HORST D. SIMON
Boeing Computer Services
This paper describes a new implementation

of algorithms for solving large, dense symmetric eigenproblems AX = BXA, where the matrices A and B are too large to fit in the central memory of the
computer. Here A is assumed to be symmetric, and B symmetric positive definite. A combination of
block Cholesky and block Householder transformations
are used to reduce the problem to a symmetric
banded eigenproblem whose eigenvalues can be computed in central memory. Inverse iteration is
applied to the banded matrix to compute selected eigenvectors, which are then transformed back to
eigenvectors of the original problem. This method is especially suitable for the solution of large
eigenproblems arising in quantum physics, using a vector supercomputer with fast secondary storage
device such as the Cray X-MP with SSD. Some numerical results demonstrate the efficiency of the
new implementation.
Categories
Numerical
Numerical
and Subject Descriptors: F.2.1 [Analysis

of Algorithms
and Problem
Complexity]:
Algorithms
and Problems-computation
on matrices; G.1.3 [Numerical
Analysis]:
Linear Algebra-eigenualues,
sparse and very large systems
General Terms: Algorithms

Additional Key Words and Phrases: Block Householder
algorithms, supercomputers, vector computers
reduction,
eigenvalue problems, out-of-core
1. INTRODUCTION
Quantum mechanical bandstructure computations require repeated computation
of a large number of eigenvalues of a symmetric generalized eigenproblem [9].
After the eigenvalues are examined, a selected number of eigenvectors are
required in order to continue the computations. This application is memorylimited on most of todays supercomputers with the exception of the Cray-2. For
example, on a 4-Mword Cray X-MP/24, eigenvalue problems of order up to 1900
can be solved with a modification of standard in-core software such as EISPACK.
Several applications require the solution of problems of about twice that size.
The authors gratefully acknowledge support through NSF grant ASC-8519354.
Authors current addresses: R. G. Grimes, Boeing Computer Services, Engineering Scientific Services
Division, P.O.B. 24346, M/S 7L-21, Seattle, WA 98124-0346; H. D. Simon, NAS Systems Division,
NASA, Ames Research Center, Moffett Field, CA 94035.
Permission to copy without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by permission of the Association
for Computing Machinery.
To copy otherwise, or to republish, requires a fee and/or specific
permission.
0 1988 ACM 0098-3500/88/0900-0241501.50
ACM Transactions
on Mathematical
Software,
Vol. 14, No. 3, September 1988, Pages 241-256.
242
R. G. Grimes and H. D. Simon
This need for an efficient out-of-core solution algorithm for generalized eigenvalue problems motivated our research.
The efficient in-core solution of the symmetric generalized eigenproblem
AX = BXA,
(1)
where A and B fit into memory, has been addressed in [9]. The authors extended
EISPACK [7, 161 to include a path for the generalized eigenproblem where A
and B are in packed symmetric storage. They also applied a block-shifted Lanczos
algorithm to the same problem.
This paper describes an algorithm developed to solve (1) where A and B are
too large to fit in central memory. The key computational element that requires
an efficient implementation
in eigenvalue calculations for dense matrices is the
Householder reduction to tridiagonal form. Even though the in-core implementation of a Householder reduction is well understood, an out-of-core Householder
reduction is far from trivial. The main problem is that in each of the n steps of
the Householder reduction an access to the unreduced remainder of the matrix
appears to be necessary. This seems to indicate that an order of O(n3) I/O
transfers are necessary. The main contribution
of this paper is the application of
a block algorithm that reduces the I/O requirements to O(n3/p), where p is the
block size. With an appropriately chosen block size, the I/O requirements can be
reduced to a level where they are completely dominated by the execution time.
The approach is to first reduce (1) to standard form by applying the Cholesky
factor L of B as follows:
(L-lAL-T)(LTX)
= (LTX)A
or
CY = YA,
(2)
where
C = L-lAL-T
(3)
Y = LTX.
The matrix B is factored using the block Cholesky algorithm where B is initially
stored in secondary storage and is overwritten with its factor L [8, 171. Since B
is assumed to be symmetric positive definite, the Choleski factorization
exists,
and no pivoting is necessary.
The next step in the standard EISPACK approach would be to reduce C to
tridiagonal form using a sequence of Householder transformations.
The application of Householder transformations
to reduce C to tridiagonal form requires
access to all of the unreduced portion of C for the reduction of each column. This
access requirement would incur O(n3) I/O transfers. This amount of I/O is too
high for an out-of-core implementation,
since it would effectively limit the
efficiency of the algorithm by the transfer speed to and from secondary storage.
Instead, a block Householder transformation
is used to reduce C to a band
matrix, where the band matrix will have small enough bandwidth to be held in
central memory. We partition the matrix C symmetrically into an m-by-m block
ACM Transactions
on Mathematical
Software,
Vol. 14, No. 3, September
1988.
Solution of Large, Dense Symmetric Generalized
Eigenvalue Problems
243
matrix. Each block has p rows and columns, except for the blocks in the last
block row (column) which may have fewer than p columns (rows). Only the block
lower triangular part of the coefficient matrix is stored on the secondary storage
device. For each block column we compute the QR factorization of the matrix
comprised of the blocks below the diagonal block. The product of Householder
transformations used to compute the QR factorization is accumulated in the WY
representation described in [2]. This representation is then applied simultaneously to the left and right of the remainder of the matrix. When all block columns
except the last are reduced in this fashion, the result is a band matrix with
bandwidth p which is similar to the matrix C in (2). The I/O requirements of the
block Householder transformation are of O(n3/p). The performance of the
algorithm will no longer be I/O-bound for even small block sizes (say p > 10).
Obviously, the precise value of p, for which the processing time will begin to
dominate, will depend on the ratio of processor speed and I/O speed of a particular
machine.
The banded eigenproblem is then transformed to tridiagonal form using an
enhanced version of EISPACK subroutine BANDR. Finally, EISPACK subroutine TQLRAT is used to compute the eigenvalues.
Specified eigenvectors are computed by first computing the associated eigenvectors of the band matrix and then back-transforming them to Y using the
accumulated block Householder transformations. Application of L to Y backtransforms Y to X, the eigenvectors of the original problem (1). The eigenvectors
of the band matrix are computed using inverse iteration implemented in a much
modified version of EISPACK subroutine BANDV.
This algorithm has been implemented in a pair of subroutines HSSXGV and
HSSXGl which, respectively, compute the eigenvalues and a specified set of
eigenvectors of (1). The usage of these two subroutines as well as subroutines
HSSXEV and HSSXEl, which compute the eigenvalues and specified eigenvectors of (2), is described in [lo].
Section 2 describes in further detail the reduction of the generalized problem
to standard form (3). Section 3 discusses the implementation of the block
Householder reduction to banded form. Section 4 describes the modifications,
which were applied to several EISPACK subroutines, for band matrices. These
modifications resulted in the increased efficiency of several routines. The computation of the eigenvectors is discussed in Section 5. Some resource estimates,
performance results, and comparisons with in-core algorithms are presented in
the final sections.
Our target machine is the Cray X-MP with a Solidstate Storage Device (SSD).
Current Cray configurations range from 64 Mword SSDs up to 1024 Mword
SSDs. A 1024 Mword SSD can hold a matrix of order 32000, or, in our application,
two symmetric packed matrices of the same order. The SSD has an asymptotic
maximum transfer rate of about 150 million words per second. This means that
the whole contents of the SSD could be read in a few seconds (if there were
enough memory). Another way of appreciating the fast rate of the SSD is by
comparing it to the peak performance of the single processor Cray X-MP, which
is about 210 million floating point operations per second (MFLOPS). Thus the
overall speed of the machine is very well in balance with fast secondary memory
ACM Transactions
on Mathematical
Software, Vol. 14, No. 3, September
1988.
244
FL G. Grimes and H. D. Simon
access times. By comparison, the fastest disks available on a Cray X-MP are an
order of magnitude slower. A DD-49 disk can hold 150 Mwords and has an
asymptotic transfer rate of 12.5 Mwords per second. In spite of the focus on this
particular target machine, we have attempted to develop a portable implementation. Indeed, our software has been successfully tested on a variety of machines
such as a Sun 3/260 workstation, a Cyber 875, a SCS-40, and an p-Vax.
2. REDUCTION
TO STANDARD
FORM
The first step in the reduction to standard form is the Cholesky factorization of
B. The out-of-core block factorization of a full matrix is a standard computational
task, and efficient software is available. Here we use a row-oriented implementation which computes the lower triangular matrix L [B, 171. We do not discuss
this algorithm here in more detail, except for two facts. The implementation
[17]
is written in a way to make extensive use of matrix-matrix
multiplications
a
basic computational kernel (see [6, 131). By providing a specially tailored version
of this kernel, very high computational
speeds can be obtained on vector computers. For example, on a Cray X-MP the implementation
in [17], performs at
192 Mflops for a system of order 3000 using the SSD as a secondary storage
device. The second fact which should be mentioned is that the partitioning
of B
into blocks can be independent from the partitioning
of A; in other words, the
blocking of B for the Cholesky factorization
is not required to conform to the
blocking of A for the block Householder reduction.
Assuming that there is existing software for computing the Cholesky factor L
of B using secondary storage and for performing forward elimination
and back
substitution
with L, we partition the matrix A symmetrically
into an m-by-m
block matrix. Each block has p rows and columns, except for the blocks in
the last block row (column) of A which may have fewer than p columns (rows),
that is,
A 191 A&
A 291 Aw
.
.
. . . A:,,
. . . A:,,
A=
.
.
.
(4)
*
.
A m,1 Am,2 . . - Am,,
The blocks of the lower triangle of A are assumed to be stored in secondary
I : * *I
storage.
The reduction of the eigenproblem to standard form given by (2) is accomplished as follows. First the lower block triangle of X = L-A is computed, one
block column at a time, starting with the last block column. The whole column
is computed, but only the lower block triangle is stored. Let X = L-IA, then
C = XLeT is computed next by reading in one column of XT at a time and
performing forward elimination
in order to compute CT = LelXT. Because of
symmetry, it is only necessary to carry out the forward elimination
of the kth
block column up to block k. The result is then stored, and the lower block triangle
of C is obtained as a result.
Even though the intermediate result X is not symmetric, this procedure exploits
the symmetry of the problem. Both the number of operations and the (out-ofACM Transactions
on Mathematical
Software,
1988.
Eigenvalue Problems
245
core) storage requirements are reduced, since the computation of the full C is
avoided. The detailed algorithm for this reduction process is given below.
Algorithm
2.1. Reduction to Standard Form
Compute the Cholesky factor of L of B.

for k = m, 1, -1
forj = 1, k - 1
Read in block A,j and set Xj = A;rk.

end for
forj=k,m
Read in block Aj,k and set Xj = Aj,k.

end for
Compute X = L-X.
forj=k,m
Write X, over A,,k.

end for
end for
The lower triangle of A has been overwritten with L-IA. Now form C = XLmT with C
overwriting X.
fork = 1, m
for j = 1, k
Read in block Xk,, and store in Yj.

end for
Compute YT = L- Y performing the forward elimination only to block k of Y.
forj=
1, k
Write block Yj over Xk,j.

end for
end for
3. BLOCK
HOUSEHOLDER
REDUCTION
TO BANDED
FORM
The symmetric generalized eigenproblem of (1) has now been reduced to a

symmetric standard eigenproblem of (2). The next step of the algorithm is to
reduce (2) to banded form by a sequence of block Householder transformations.
Each individual
block Householder transformation
is implemented using the
Bischof and vanloans
WY representation
[2] of a product of Householder
transformations.
This form has proven to be useful in implementing
the QR
factorization
in a parallel processing environment
and in obtaining optimal
performance in FORTRAN
implementations
of the QR factorization on vector
computers [l, 121.
The product of p Householder transformations
can be represented as
HlHz
. . . HP = I - WYT,
where W and Y are matrices with p columns [2]. Let the matrix
cally partioned as
ACM Transactions
on Mathematical
Software,
C be symmetri-
1988.
246
and let M be the first block column of C excluding
Let HI, H2,. . . , HP be the Householder

of M, where each Hi has the form
the diagonal block
transformation
from the QR factorization
Hi = I - 2UiUTa
The Householder transformations
are accumulated into the WY representation
with the columns wi and yi of W and Y defined as follows:
2.3;
Wi =
yi = HpHp-1 * * * Hi+lUi.
It is easily demonstrated
that
H,Hz
Let Q be a similarity
. . . HP = I - WYT.
transformation
defined as
It can easily be shown that Q is indeed a similarity transformation.

Furthermore,
when applied to C, Q reduces the first block column to banded form with half
bandwidth p + 1.
(5)
where R is the upper triangular matrix from the QR factorization of M and Ci,j
denotes the blocks of C that are modified from the previous step.
The standard trick in the single vector case of reducing the amount of
computation for symmetrically
applying a single Householder reduction can be
applied in the block context as well. Let C () be the matrix consisting of the
second through mth block rows and columns of C. Then (5) reduces to forming
P2) = (I - WYT)TC2(I
Expanding
- WY*).
(6)
the expression yields

c(2)
~'2'
ywTc'2'
C'2'WYT
Let S = Cc2)W and V = WTCc2 W. Substituting

C = ~(2) - ysT
ACM Transactions
on Mathematical
Software,
- sy7'
+ YWTC'2'WYT.
S and V into (7) yields

+ yvyT
1988.
(7)
Eigenvalue Problems
247
or
,(2)42~-.(,+Y~)-(s-~..)Y~.
Let T = S - i YV and substitute
(8)
into (8) to obtain
e(2) = c(z) - yjT - TyT.
(9)
Thus the cost of forming Ci() is reduced to the cost of multiplying

with C(), W,
and Y. Furthermore,
the matrix C* is accessed only 3 times per block row
(2 fetches and 1 store). All these operations are matrix-matrix
multiplications,
which can be performed very efficiently on vector computers.
The same procedure will be applied repeatedly to the first block column of the
remainder of the matrix. After m - 1 steps, the dense matrix has then been
reduced to banded form. The details of the reduction process are given in the
algorithm below. In the following description, Cck+l) refers to block rows and
columns lz + 1 through m of C.
Algorithm
3.1. Reduction to Banded Form
for k = 1, m - 1
forj = k + 1, m
Read in Cj,, and store in Mj.

end for
Compute the QR factorization of M accumulating the matrices W, Y, and R.
Write R over block Ck+l,k.
Write W and Y to secondary storage.
Compute S = C W with S overwriting W.
Compute V = WTS with V overwriting R.
Compute T = S - $ YV with T overwriting
forj=k+l,m
for i = j, m
W.
Read in C,, .
C,,, = C,, - Y;TT - T, Y;.
Write Out C,,j.
end for
end for
end for
4. COMPUTING
THE EIGENVALUES
OF THE BAND MATRIX
We have thus reduced the large out-of-core problem to a standard in-core

problem: the computation of the eigenvalues of a symmetric band matrix. The
standard software for this task is EISPACK
[7, 161. However, we found the
EISPACK routines for this problem, BANDR and BANDV, to be comparatively
inefficient, and therefore applied several modifications. These modifications are
described here and in the next section.
We examined the available software for solving the banded eigenproblemBANDR from EISPACK and VBANDR, the vectorized version of BANDR [ll].
BANDR uses modified Givens rotations to annihilate the nonzero entries within
the band to reduce the band matrix to tridiagonal form. With this method, the
zeroing of an entry in the band creates an extra element on the subdiagonal
ACM Transactions
on Mathematical
1988.
248
immediately outside the band. This new entry must then be zeroed, which creates
another new entry p rows further down the subdiagonal. Several rotations are
carried out until the element is chased off the bottom of the matrix (see [ 151).
This algorithm takes on the average n/2p rotations to chase away one nonzero
in the band. Overall, approximately
n/2 rotations are computed and applied
with vector length of p. The implementation
of BANDR in EISPACK, even with
the appropriate compiler directives, is very inefficient on the Cray X-MP.
Kaufmans VBANDR uses Givens rotations to annihilate the nonzero entries
within the band. Furthermore, it postpones the annihilation
of the entries outside
the band so that several entries can be annihilated
simultaneously.
It then
vectorizes across the entries to be annihilated, instead of on the rows or columns
of the band matrix. On the average, VBANDR simultaneously annihilates n/2p
entries. VBANDR
is more efficient than BANDR when p << n. Kaufmans
original results [ll] were on a Cray-lS, but they are essentially the same for the
Cray X-MP.
Neither the performance of BANDR nor that of VBANDR was acceptable for
the problems we were considering. Since BANDR seemed to be the likely
candidate for further enhancements, we rewrote BANDR to use SROTG and
SROT from the BLAS [13], which have very efficient implementations
on the
Cray X-MP in the Boeing computational
kernel library VectorPak [17]. This
modification to BANDR provided a factor of 10 speed-up over standard BANDR
on problems with n > 500 and p = 50.
The tridiagonal
eigenvalue problem is then solved with TQLRAT
from
EISPACK, which is known to be one of the fastest tridiagonal eigenvalue solvers.
TREEQL,
a recent successful implementation
of Cuppens [4] divide-andconquer algorithm for the tridiagonal
eigenvalue problem by Dongarra and
Sorenson [5], promised some additional speed-up. However, TREEQL computes
both eigenvalues and vectors. Because of the size of the eigenvector matrix, the
computation of the full spectrum is neither desired nor feasible in our application.
TREEQL was therefore modified to avoid the storage of all eigenvectors. Only
intermediate quantities were computed, which are related to the eigenvectors of
the tridiagonal submatrices and are required for the computation of the updates.
This modified version of TREEQL required a maximum of 16n storage locations,
compared to the n2 words of memory required in the original TREEQL for the
full eigenvector matrix. The modified TREEQL also performs fewer operations
than the original TREEQL. Because the number of operations is determined by
the number of deflations occuring in the algorithm, an operation count for the
modified TREEQL is difficult to give. In some computational
tests, however, it
turned out that the modified TREEQL performed about the same as TQLRAT
on matrices of order up to 1500 (for more details, see Section 7). Therefore, we
decided not to include the modified TREEQL into the algorithm. These results
were obtained using the single processor version of TREEQL. We did not use
TREEQL as a multiprocessor algorithm, where its advantages are more obvious.
5. COMPUTATION
OF EIGENVECTORS
After the computation of all the eigenvalues, control is returned to the user. Thus
the user can examine the computed eigenvalues and decide how many (if any)
eigenvectors to compute. Thus the last decision to be made in the development
ACM Transactions
on Mathematical
Software,
1988
Solution of Large, Dense Symmetric Generalized Eigenvalue Problems
249
of this software package concerns the approach to be used for the computation
of the selected eigenvectors.
The first approach was to compute the selected eigenvectors of the tridiagonal
matrix using inverse iteration
(subroutine TINVIT)
from EISPACK.
This,
however, requires the back-transformation
of the eigenvectors using the n2/2
Givens rotations generated by the modified version of BANDR. This seemed to
be prohibitively
expensive. The second approach was to compute the selected
eigenvectors of the band matrix, also using inverse iteration. Subroutine BANDV
of EISPACK performs this computation. We tried BANDV and found it to be
very inefficient
in both central memory requirements and central processing
time.
An examination of EISPACKs BANDV shows that it first computes a pivoted
LU factorization of the band matrix. Only U, which now has a bandwidth of 2p,
is saved and L is discarded. Since L is not available for a true implementation
of
inverse iteration, a heuristic, which uses back-solves with U, is used to compute
the eigenvector. The convergence is apparently slow: EISPACK permits up to n
iterations per eigenvector instead of the limit of 5 iterations allowed in other
EISPACK subroutines using inverse iteration. BANDV was modified to compute
an LDLT factorization
of the shifted matrix. The factorization is modeled after
an efficient implementation
of SPBFA from LINPACK
[3] on the Cray X-MP
based on the LEVEL 2 BLAS kernel STRSV [6]. Also, the heuristic was replaced
with true inverse iteration.
These two modifications give a loo-fold speed-up over the original implementation of BANDV on the Cray X-MP. One order of magnitude of speed-up is
obtained from the replacement of the linear equation solution technique, the
other is from faster convergence of inverse iteration over the original heuristic.
Two other by-products of the enhancements are a reduction of storage required
by BANDV to represent the shifted band matrix from 2np to np and, more
importantly,
more accurate eigenvectors.
The central memory requirement
for computing the eigenvalues alone is
O(3np). To compute eigenvectors, 0(2np), central memory is required for the
band matrix and for workspace for carrying out inverse iteration. This leaves
storage for p eigenvectors to be held in central memory. Thus the eigenvectors
are computed in blocks of at most p eigenvectors at a time.
Reorthogonalization
of the eigenvectors is required for eigenvectors corresponding to multiple eigenvalues, and is often needed when the eigenvalues are
close, but not exactly, multiples. If ] Xi - Xi+1 1< lo-]] B ]I 1, then two neighboring
eigenvalues Xi and Xi+l are considered to belong to a cluster. After all clusters
have been determined, up to p eigenvectors are computed at a time, whereby
all eigenvalues in a cluster are assigned to the same group of p eigenvalues.
Then reorthogonalization
is carried out within the cluster. Since all eigenvectors belonging to the cluster are in-core at this time, no extra I/O costs
are incurred.
There are two possible drawbacks to this scheme. One is that obviously we will
not be able to compute accurate eigenvectors if the multiplicity
or the cluster
size is greater than the block size p. In our current setup the block size is very
large (p = 63), and this is unlikely to occur. In the event that very high
multiplicities
are indeed found, the user still has the option to increase the block
ACM Transactions
on Mathematical
Software,
1988.
250
Fi. G. Grimes and H. D. Simon
size and thus compute all the eigenvectors belonging to a cluster within one
block.
The other possible difficulty arises when there is an inefficient allocation of
clusters to groups of eigenvalues. A worst case is if there are about 2n/p clusters
with p/2 + 1 eigenvalues each. In this case about twice as many groups of
eigenvectors would be required than in the normal case. This increases I/O
requirements for back transforming
the eigenvectors. In other contexts [14],
reorthogonalizations
occupied a significant portion of the computation time. We
did not find this to occur in our numerical experiments, which are reported in
Section 7.
The algorithm to compute the eigenvectors of the band matrix and backtransform to eigenvectors of the original problem is given below.
Algorithm
5.1 Computation
of Eigenvectors
for ivec = 1, numvec, p
Using BANDV, compute the eigenvectors of the band matrix corresponding to the
next set of up to p eigenvalues specified by the user. The eigenvectors are stored in X.
for k = m - 1, 1, -1
Read in Wand Y.
x = (I - WY)X.
end for
x = L-TX.
Write X to secondary storage.

end for
6. RESOURCE
REQUIREMENTS
The software described above makes extensive use of both the central processing
unit and secondary storage of a computer. This section discusses the requirements
of the algorithm for secondary storage, amount of I/O transfer between central
memory and secondary storage, storage requirements for central memory, and
the number of floating point operations.
Secondary storage for this algorithm is used to hold B, which is overwritten
with L; to hold A, which is overwritten with C, and later the band matrix; and to
hold the matrices W and Y. Approximately
n/2 storage is used for each of the
4 matrices; thus the total amount of secondary storage required is approximately
2n2. A Solidstate Storage Device (SSD) with 128 million words of secondary
storage with fast access on a Cray X-MP would allow problems of order up to
8000 before overflowing the SSD to slower secondary storage on disks.
The amount of I/O transfer between central memory and secondary storage
storing the matrices A, C, and the resulting band matrix, is fn/p real words.
The I/O transfer for the unit storing L, the Cholesky factor of B, is n3/p. The
total amount of I/O transfer is yn3/p.
The maximum central memory requirement for the above algorithm is
2np + max(4p2,
np + 3n + p, np + n + p(p
+ l)),
where the three terms correspond to Cholesky factorization

of B, reduction to
banded form, and computation of the eigenvectors. A good working estimate for
the amount of central memory required is 3np. For the size of problems being
ACM Transactions
on Mathematical
Software,
1988.
Eigenvalue Problems
251
considered (e.g., n = 5000 andp = 50), this is less than 1 million words of working
storage.
The operation count for computing the eigenvalues of the eigenproblem in (1)
by this approach is approximately
yn3 + 10n2p + lower order terms. This
operation count comes from analyzing the four major components of the algorithm. The operation counts for these components are as follows:
Component
Cholesky
Operation
Count
$2
factorization
Reduction to standard form

Block Householder transformations
2n3
+n3 + 2n$
QR factorizations
2n$
Reduction
6n$
of band matrix
Total
+n3 + 10nzp
For an execution with n = 1492 and p = 50, the monitored operation count was
13.3 X 10. The operation count computed by the above formula is 13.2 X log,
which is within 1% of the actual count.
The computational
kernels used in the algorithm are Cholesky factorization,
block solves with the Cholesky factor, QR factorization,
matrix multiplication,
and a banded eigenvalue solver. All of these kernels, except for those used in
solving the banded eigenproblem, perform well on vector computers. In fact, they
have computational rates in the 150 to 190 Mflop range on a Cray X-MP for the
problem sizes being discussed in this application. The code is portable, and can
be implemented
efficiently
on other vector supercomputers whenever high
performance implementations
of the Level 2 BLAS [6] are available.
7. PERFORMANCE
RESULTS
After establishing that each component of the implementation

of the above
algorithm in HSSXGV and HSSXGl was executing at the expected efficiency, a
parameter study for the optimal choice of block size p was performed. A small
problem of n = 667 from [9] was chosen as the test problem. HSSXGV and
HSSXGl were executed with varying block sizes. Table I lists the results of the
study. The block size was chosen such that mod(p, 4) = 3 to avoid bank conflicts
on the Cray X-MP. The central processing time is in seconds and computational
speed is reported in millions of floating point operations per second (MFLOPS).
The time reported for HSSXGl is the average time for computing one eigenvector. An optimal choice of block size is dependent on the number of eigenvectors
to be computed. A choice for 63 for the block size is good on the Cray X-MP.
Further experiments indicate that 63 is a good choice on larger problems as well.
If a large number of eigenvectors is to be computed, a smaller block size might
be in order.
Table II compares the combination
of HSSXGV and HSSXGl
with the
performance of the symmetric packed storage version of EISPACK and the block
Lanczos algorithm described in [9] on the same eigenproblems in that paper.
HSSXGV and HSSXGl used block size 63 for all problems. The statement of
ACM Transactions
on Mathematical
1988.
252

Table I.
Various Choices of Block Size for n = 667

(Execution Times in Seconds)
HSSXGV
time
27
39
47
59
63
67
79
a7
99
11.92
11.08
10.75
10.66
10.37
11.31
11.14
11.02
11.25
HSSXGV
MFLOPS
HSSXGl
time
100.4
111.8
117.6
122.1
126.6
117.3
122.0
125.3
125.4
HSSXGl
MFLOPS
.040
,043
.048
.058
,061
.063
.073
.078
48.0
57.5
60.8
64.3
66.1
69.1
74.5
79.8
84.7
Table II.
Comparison of Performance
with In-Core Methods
Number of
eigenvectors
219
667
992
1496
22
32
35
35
EISPACK
HSSXGV and
HSSXGl
Block
Lanczos
1.07
12.33
35.38
88.74
1.37
8.25
17.00
47.70
1.37
14.73
37.33
108.54
Table III. Comparison of HSSXGV versus

Block Lanczos on Matrix of Order 992
Number of
Eigenvectors
HSSXGV and
HSSXGl
Block
Lanczos
20
40
60
80
100
120
140
33.77
35.70
37.78
39.69
41.32
43.66
45.58
21.20
20.62
27.02
30.78
39.35
53.91
62.35
the eigenproblem is to compute all eigenvalues and eigenvectors in the interval

[-1.0,0.30].
HSSXGV and HSSXGl not only allow larger problem sizes than the symmetric
packed storage EISPACK path, but have an 18% performance increase because
of the block nature of the computations. Block Lanczos is indeed faster than
HSSXGV and HSSXGl for these eigenproblems. However, if more eigenvectors
are required, HSSXGV and HSSXGl are more efficient than block Lanczos. For
the problem of order II. = 992, the figures in Table III indicate that HSSXGV
and HSSXGl are faster than the block Lanczos code if the number of required
eigenvectors is increased to more than 100. We estimate that for the largest
problem above, n = 1496, HSSXGV would become the algorithm of choice
ACM Transactions
on Mathematical
Software,
1988.

Table IV.
n
Performance
HSSXGV
MFLOPS
219
667
992
1496
Table V.
n
219
667
992
1496
Eigenvalue Problems
253
in MFLOPS
HSSXGl
MFLOPS
14
127
128
161
51
66
72
80
Comparison of TREEQL and TQLRAT

TQLRAT
TREEQL
0.75
10.37
31.97
83.08
0.93
11.38
33.87
85.28
Number of
levels
Number of
deflations
4
6
7
I
13
135
260
217
if 80 or more eigenvectors were required. Also, HSSXGV computes all the eigenvalues and assumes no prior knowledge of the spectrum. Both Lanczos and
the EISPACK path require an interval as input, and proceed to compute all
the eigenvalues in the interval. Hence HSSXGV in Table II has computed
additional information.
The overall computation
rate for the problems in Table II is given in
Table IV. The highest performance with 161 MFLOPS was obtained on the
largest problem. This number is quite remarkable since a significant amount of
computation in the tridiagonal eigensolver is carried out in scalar mode.
The overall change in performance when using TREEQL instead of TQLRAT
can be seen from the results in Table V. For the runs in Table V, all eigenvalues,
but no eigenvectors, were computed.
The number of levels and the number of deflations refer to quantities used in
the implementation
of TREEQL (see [5]): the numerical values of the computed
eigenvalues agreed to within machine-precision.
TREEQL apparently offers no
advantage in this context for a single-processor implementation.
These results in no way contradict the very favorable ones obtained with
TREEQL elsewhere (e.g., in [5]). As mentioned before, the version of TREEQL
used here has been modified to compute eigenvalues only. The modified version
of TREEQL requires considerably less arithmetic than the original one, which
computes eigenvectors as well. However, the modified version requires still more
arithmetic than a comparable tridiagonal eigenvalue solver based on the QL
algorithm alone. Hence, the roughly comparable execution times indicate that
TREEQL would probably be more efficient if eigenvectors were required.
A second point worth mentioning relates to the number of deflations reported
above. The number of deflations is apparently close to a worst-case behavior for
TREEQL. This behavior usually has been observed for random matrices. One
could argue therefore that the example chosen here is a worst-case example, and
thus insufficient to dismiss TREEQL. On the other hand, one could also argue
ACM Transactions
on Mathematical
Software,
1988.
254
Table VI.
n
1000
2000
3000
Memory
(Mwords)
Performance
CP time
(se4
on Large Random Matrices
Rate
(Mflops)
Secondary
storage
I/O Wait
SSD
Time
DD-49
2.16
8.36
18.61
0.48
4.62
13.84
290.09
-
29.8
192.6
579.7
141
164
180
Table VII.
Performance
Sparse Matrices
0.40
0.60
0.79
CP time
n
(set)
1919
1922
3562
98.9
97.4
514.5
Rate
(Mflops)
on Large
Lanczos
CP time
116
118
131
83.1
19.5
8.43
that tridiagonal matrices obtained through a Householder or Givens reduction

process are very unlikely to look like a tridiagonal matrix with entries (1, -2, l),
which is, in some sense, a best case for TREEQL. Since there was no clear cut
advantage in using TREEQL, we stayed with a routine that has for many years
been proven to be reliable and efficient.
In order to demonstrate the overall performance of the software on some
problems that are too large to fit into central memory of the Cray X-MP, we
performed two additional series of tests. In the first series we computed all
eigenvalues and 50 eigenvectors of random matrices of orders n = 1000, 2000,
and 3000. The results are presented in Table VI.
Table VI demonstrates the excellent asymptotic performance that can be
obtained with HSSXGV on the Cray X-MP, as well as its capability of efficiently
solving very large out-of-core problems. The cost using DD-49 disks was about
50% higher than using the SSD for n = 1000. Also, the I/O wait time was an
order of magnitude larger than the CP time. In contrast, the I/O wait time for
the SSD was negligible. Therefore, the larger problems were not run using the
DD-49 disks. For the same test runs, the eigenvector computation using HSSXGl
required 0.095, 0.226, and 0.389 seconds per eigenvector for the problems of size
1000, 2000, and 3000. The corresponding Mflops rates were 76, 91, and 103.
Finally, we tested the performance of HSSXEV on some large sparse problems.
Here we did not want to utilize the sparsity, but only test the out-of-core
capability on some real problems with known answers. In order to do so, the
sparse problems were unpacked and stored in the full upper triangle of the
coefficient matrix. Then HSSXEV was applied to this matrix. For comparison,
we list the execution time of a block Lanczos algorithm for sparse matrices. The
Lanczos algorithm was only required to compute the eigenvalues to about
8 correct digits. The eigenvalues computed with HSSXEV agreed with the
Lanczos results to this number of digits. It should be noted that the execution
times in Table VII cannot and should not be compared as times for two competing
algorithms. The execution times listed for HSSXEV are for the computation of
ACM Transactions
on Mathematical
Software,
1988.
Eigenvalue Problems
255
all eigenvalues, but no eigenvectors. The execution times for block Lanczos are
for the computation of 550, 200, and 10 eigenpairs, respectively.
8. SUMMARY
Software for the out-of-core solution of the symmetric generalized eigenproblem

has been developed and implemented. Because of its block nature, this software
is more efficient on vector computers than a related in-core algorithm. If the
number of required eigenpairs is large, in particular in applications, where all
eigenvalues and vectors are required, the new software is more efficient than a
previous code of the authors [9] based on the Lanczos algorithm. The Lanczos
algorithm remains the algorithm of choice for large sparse problems. Most
importantly, the new software allows efficient solution of problems too large to
fit in central memory, thus providing an important computational tool to the
researches in quantum mechanics and other disciplines that generate large
symmetric generalized eigenproblems.
ACKNOWLEDGMENT
We would like to thank Dan Pierce for modifying TREEQL and for carrying out
the related numerical experiments.
REFERENCES
1. ARMSTRONG, J. Optimization of Householder Transformations,
Part I: Linear Least Squares.
CONVEX Computer Corp., 701 N. Plano Rd., Richardson, TX 75081.
2. BISCHOF, C., AND VAN LOAN, C. The WY representation for products of Householder matrices.
SIAM J. Sci. Stat. Comput. 8 (1987), s2-~13.
3. BUNCH, J., DONGARRA, J., MOLER, C., AND STEWART, G. LINPACK
Users Guide. SIAM
Philadelphia, 1979.
4. CUPPEN, J. J. A divide and conquer method for the symmetric tridiagonal eigenvalue problem.
Numer. Math. 36 (1981) 177-195.
5. DONGARRA, J. J., AND SORENSEN, D. C. A fully parallel algorithm for the symmetric eigenvalue
problem. SIAM J. Sci. Stat. Comput 8 (1987), s139-s154.
6. DONGARRA, J. J., Du CROZ, J., HAMMARLING, S., AND HANSON, R. Extended set of Fortran
basic linear algebra subprograms. Argonne National Lab. Rep. ANL-MSC-TM-41
(Revision 3),
1986.
7. GARBOW, B. S., BOYLE, J. M., DONGARRA, J. J., AND MOLER, C. B. Matrix eigensystem
routines-EISPACK
guide extension. In Lecture Notes in Computer Sciences, Vol. 51, SpringerVerlag, Berlin, 1977.
8. GRIMES, R. Solving systems of large dense linear equations. Rep. ETA-TR-44, Boeing Computer
Services, Seattle, Wash., Feb. 1987; submitted to Supercomputing and Its Applications.
9. GRIMES, R., KRAKAUER, H., LEWIS, J., SIMON, H., AND WEI, S. The solution of large dense
generalized eigenvalue problems on the Cray X-MP/24 with SSD. J. Comput. Phys. 69,2 (1987),
471-481.
10. GRIMES, R., AND SIMON, H. Subroutines for the out-of-core solution of generalized symmetric
eigenvalue problems. Rep. ETA-TR-54,
Boeing Computer Services, 1987.
11. KAUFMAN, L. Banded eigenvalue solvers on vector machines. ACM Trans. Math. Softw. 10, 1
(1984), 73-86.
12. KAUFMAN, L., DONGARRA, J., AND HAMMARLING, S. Squeezing the most out of eigenvalue,
solvers on high-performance
computers. Linear Algebra Appl. 77 (1986), 113-136.
13. LAWSON, C., HANSON, R., KINCAID, D., AND KROGH, F. Basic linear algebra subprograms for
FORTRAN usage. ACM Trans. Math. Softw. 5 (1979), 308-321.
ACM Transactions
on Mathematical
Software,
1988.
256
FL G. Grimes and H. D. Simon
14. Lo, S., PHILLIPE, B., AND SAMEH, A. A multiprocessor algorithm for the symmetric eigenvalue
problem. SIAM J. Sci. Stat. Comput. 8 (1987), s155-~165.
15. PARLETT, B. N. The Symmetric Eigenualue Problem. Prentice Hall, Englewood Cliffs, N.J.,
1980.
16. SMITH, B. T., BOYLE, J. M., DONGARRA, J. J., GARBOW, B. S., IKEBE, Y., KLEMA, V. C., AND
guide. In Lecture Notes in Computer
MOLER, C. B. Matrix eigensystem routines-EISPACK
Sciences, Vol. 6, Springer-Verlag,
Berlin, 1976.
17. VectorPuk Users Manual. Boeing Computer Services Dot. 20460-0501-Rl, 1987.
Received July 1987; accepted March 1988
ACM Transactions on Mathematical Software, Vol.
14,
No. 3, September
1988

Block Householder

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Block Householder

Hochgeladen von

Copyright:

Verfügbare Formate

Solution of Large, Dense Symmetric

Generalized Eigenvalue Problems

This paper describes a new implementation

and Subject Descriptors: F.2.1 [Analysis

General Terms: Algorithms

eigenvalue problems, out-of-core

Vol. 14, No. 3, September 1988, Pages 241-256.

R. G. Grimes and H. D. Simon

Vol. 14, No. 3, September

Solution of Large, Dense Symmetric Generalized

Software, Vol. 14, No. 3, September

FL G. Grimes and H. D. Simon

Vol. 14, No. 3, September

Solution of Large, Dense Symmetric Generalized

2.1. Reduction to Standard Form

Compute the Cholesky factor of L of B.

Read in block A,j and set Xj = A;rk.

Read in block Aj,k and set Xj = Aj,k.

Write X, over A,,k.

Read in block Xk,, and store in Yj.

Write block Yj over Xk,j.

The symmetric generalized eigenproblem of (1) has now been reduced to a

Vol. 14, No. 3, September

R. G. Grimes and H. D. Simon

and let M be the first block column of C excluding

Let HI, H2,. . . , HP be the Householder

the diagonal block

from the QR factorization

It can easily be shown that Q is indeed a similarity transformation.

the expression yields

Let S = Cc2)W and V = WTCc2 W. Substituting

S and V into (7) yields

Vol. 14, No. 3, September

Solution of Large, Dense Symmetric Generalized

into (8) to obtain

e(2) = c(z) - yjT - TyT.

Thus the cost of forming Ci() is reduced to the cost of multiplying

3.1. Reduction to Banded Form

Read in Cj,, and store in Mj.

OF THE BAND MATRIX

We have thus reduced the large out-of-core problem to a standard in-core

Software, Vol. 14, No. 3, September

R. G. Grimes and H. D. Simon

Vol. 14, No. 3, September

Solution of Large, Dense Symmetric Generalized Eigenvalue Problems

Vol. 14, No. 3, September

Fi. G. Grimes and H. D. Simon

for ivec = 1, numvec, p

Write X to secondary storage.

where the three terms correspond to Cholesky factorization

Vol. 14, No. 3, September

Solution of Large, Dense Symmetric Generalized

Reduction to standard form

After establishing that each component of the implementation

Software, Vol. 14, No. 3, September

R. G. Grimes and H. D. Simon

Various Choices of Block Size for n = 667

Table III. Comparison of HSSXGV versus

the eigenproblem is to compute all eigenvalues and eigenvectors in the interval

Vol. 14, No. 3, September

Solution of Large, Dense Symmetric Generalized

Comparison of TREEQL and TQLRAT

Vol. 14, No. 3, September

R. G. Grimes and H. D. Simon