Sie sind auf Seite 1von 112

CR07 – Sparse Matrix Computations /

Cours de Matrices Creuses (2010-2011)

Jean-Yves L’Excellent (INRIA) and Bora Uçar (CNRS)

LIP-ENS Lyon, Team: ROMA

(Jean-Yves.L.Excellent@ens-lyon.fr, Bora.Ucar@ens-lyon.fr)

Fridays 10h15-12h15

prepared in collaboration with P. Amestoy (ENSEEIHT-IRIT)

1/ 94
Motivations
I Applications ayant un besoin croissant en puissance de calcul:
I modélisation,
I simulation (plutôt qu’expérimentation),
I optimisation numérique
I Typiquement:
Problème continu ⇒ Discrétisation (maillage)
⇒ Algorithme numérique de résolution (selon lois physiques)
⇒ Problème matriciel (Ax = b, . . .)
I Besoins:
I Modélisations de plus en plus précises
I Problèmes de plus en plus complexes
I Applications critiques en temps de réponse
I Minimisation des coûts du calcul

⇒ Calculateurs haute performance, parallélisme


⇒ Algorithmes permettant de tirer le meilleur parti de ces
calculateurs et des propriétés des matrices considérées
2/ 94
Example of preprocessing for Ax = b

Original (A =lhr01) Preprocessed matrix (A0 (lhr01))


0 0

200 200

400 400

600 600

800 800

1000 1000

1200 1200

1400 1400

0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
nz = 18427 nz = 18427

Modified Problem:A0 x 0 = b 0 with A0 = Pn PDr AQDc P t

3/ 94
Quelques exemples dans le domaine du calcul
scientifique

I Constraints of duration: weather forecast

4/ 94
A few examples in the field of scientific computing
I Cost constraints: wind tunnels, crash simulation, . . .

5/ 94
Scale Constraints

I large scale: climate modelling, pollution, astrophysics


I tiny scale: combustion, quantum chemistry

6/ 94
Contents of the course

I Introduction, reminders on graph theory and on numerical


linear algebra
- Introduction to sparse matrices
- Graphs and hypergraphs, trees, some classical algorithms
- Gaussian elimination, LU factorization, fill-in
- Conditioning/sensitivity of a problem and error analysis

7/ 94
Contents of the course

↔ ↔ Graph

I Sparse Gaussian elimination


- Sparse matrices and graphs
- Gaussian elimination of sparse matrices
- Ordering and permuting sparse matrices

7/ 94
Contents of the course

I Maximum (weighted) Matching algorithms and their use in


sparse linear algebra

I Efficient factorization methods


- Implementation of sparse direct solvers
- Parallel sparse solvers
- Scheduling of computations to optimize memory usage
and/or performance

I Graph and hypergraph partitioning

I Iterative methods

I Current research activities

7/ 94
Tentative outline
A. INTRODUCTION
I. Sparse matrices
II. Graph theory and algorithms
III. Linear algebra basics
------------------------------------------------------
B. SPARSE GAUSSIAN ELIMINATION
IV. Elimination tree and structure prediction
V. Fill-reducing ordering methods
VI. Matching in bipartite graphs
VII. Factorization: Methods
VIII. Factorization: Parallelization aspects
------------------------------------------------------
C. SOME OTHER ESSENTIAL SPARSE MATRIX ALGORITHMS
IX. Graph and hypergraph partitioning
X. Iterative methods
------------------------------------------------------
D. CLOSING
XI. Current research activities
XII. Presentations
8/ 94
Tentative organization of the course

I References, teaching material will be made available at


http://graal.ens-lyon.fr/~bucar/CR07/

I 2+ hours dedicated to exercises (manipulation of sparse


matrices and graphs)

I Evaluation: mainly based on the study of research articles


related to the contents of the course; the students will write a
report and do an oral presentation.

9/ 94
Outline

Introduction to Sparse Matrix Computations


Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion

10/ 94
A selection of references
I Books
I Duff, Erisman and Reid, Direct methods for Sparse Matrices,
Clarendon Press, Oxford 1986.
I Dongarra, Duff, Sorensen and H. A. van der Vorst, Solving
Linear Systems on Vector and Shared Memory Computers,
SIAM, 1991.
I Davis, Direct methods for sparse linear systems, SIAM, 2006.
I Saad, Iterative methods for sparse linear systems, 2nd edition,
SIAM, 2004.
I Articles
I Gilbert and Liu, Elimination structures for unsymmetric sparse
LU factors, SIMAX, 1993.
I Liu, The role of elimination trees in sparse factorization,
SIMAX, 1990.
I Heath and E. Ng and B. W. Peyton, Parallel Algorithms for
Sparse Linear Systems, SIAM review 1991.

11/ 94
Introduction to Sparse Matrix Computations
Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion

12/ 94
Motivations
I solution of linear systems of equations → key algorithmic
kernel
Continuous problem

Discretization

Solution of a linear system Ax = b
I Main parameters:
I Numerical properties of the linear system (symmetry, pos.
definite, conditioning, . . . )
I Size and structure:
I Large (> 1000000 × 1000000 ?), square/rectangular
I Dense or sparse (structured / unstructured)
I Target computer (sequential/parallel/multicore)
→ Algorithmic choices are critical

13/ 94
Motivations for designing efficient algorithms

I Time-critical applications
I Solve larger problems
I Decrease elapsed time (parallelism ?)
I Minimize cost of computations (time, memory)

14/ 94
Difficulties

I Access to data :
I Computer : complex memory hierarchy (registers, multilevel
cache, main memory (shared or distributed), disk)
I Sparse matrix : large irregular dynamic data structures.
→ Exploit the locality of references to data on the computer
(design algorithms providing such locality)
I Efficiency (time and memory)
I Number of operations and memory depend very much on the
algorithm used and on the numerical and structural properties
of the problem.
I The algorithm depends on the target computer (vector, scalar,
shared, distributed, clusters of Symmetric Multi-Processors
(SMP), multicore).
→ Algorithmic choices are critical

15/ 94
Introduction to Sparse Matrix Computations
Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion

16/ 94
Sparse matrices

Example:
3 x1 + 2 x2 = 5
2 x2 - 5 x3 = 1
2 x1 + 3 x3 = 0

can be represented as

Ax = b,
     
3 2 0 x1 5
where A =  0 2 −5 , x =  x2  , and b =  1 
2 0 3 x3 0

Sparse matrix: only nonzeros are stored.

17/ 94
Sparse matrix ?

100

200

300

400

500

0 100 200 300 400 500


nz = 5104

Matrix dwt 592.rua (N=592, NZ=5104);


Structural analysis of a submarine

18/ 94
Sparse matrix ?

Matrix from Computational Fluid Dynamics;


(collaboration Univ. Tel Aviv)

1000

2000

3000

4000

5000

6000

7000

0 1000 2000 3000 4000 5000 6000 7000


nz = 43105

“Saddle-point” problem

19/ 94
Preprocessing sparse matrices

Original (A =lhr01) Preprocessed matrix (A0 (lhr01))


0 0

200 200

400 400

600 600

800 800

1000 1000

1200 1200

1400 1400

0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
nz = 18427 nz = 18427

Modified Problem:A0 x 0 = b 0 with A0 = Pn PDr ADc QP t

20/ 94
Factorization process

Solution of Ax = b
I A is unsymmetric :
I A is factorized as: A = LU, where
L is a lower triangular matrix, and
U is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:
I A = LDLT or LLT
I A is rectangular m × n with m ≥ n and minx kAx − bk2 :
I A = QR where Q is orthogonal (Q−1 = QT ) and R is
triangular.
I Solve: y = QT b then Rx = y

21/ 94
Factorization process

Solution of Ax = b
I A is unsymmetric :
I A is factorized as: A = LU, where
L is a lower triangular matrix, and
U is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:
I A = LDLT or LLT
I A is rectangular m × n with m ≥ n and minx kAx − bk2 :
I A = QR where Q is orthogonal (Q−1 = QT ) and R is
triangular.
I Solve: y = QT b then Rx = y

21/ 94
Factorization process

Solution of Ax = b
I A is unsymmetric :
I A is factorized as: A = LU, where
L is a lower triangular matrix, and
U is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:
I A = LDLT or LLT
I A is rectangular m × n with m ≥ n and minx kAx − bk2 :
I A = QR where Q is orthogonal (Q−1 = QT ) and R is
triangular.
I Solve: y = QT b then Rx = y

21/ 94
Difficulties

I Only non-zero values are stored


I Factors L and U have far more nonzeros than A
I Data structures are complex
I Computations are only a small portion of the code (the rest is
data manipulation)
I Memory size is a limiting factor ( → out-of-core solvers )

22/ 94
Key numbers:
1- Small sizes : 500 MB matrix;
Factors = 5 GB; Flops = 100 Gflops ;
2- Example of 2D problem: Lab. Géosiences Azur, Valbonne
I Complex 2D finite difference matrix n=16 × 106 , 150 × 106
nonzeros
I Storage (single prec): 2 GB (12 GB with the factors)
I Flops: 10 TeraFlops
3- Example of 3D problem: EDF (Code Aster, structural
engineering)
I real matrix finite elements n = 106 , nz = 71 × 106 nonzeros
I Storage: 3.5 × 109 entries (28 GB) for factors, 35 GB total
I Flops: 2.1 × 1013
4- Typical performance (MUMPS):
I PC LINUX 1 core (P4, 2GHz) : 1.0 GFlops/s
I Cray T3E (512 procs) : Speed-up ≈ 170, Perf. 71 GFlops/s
I AMD Opteron 8431, 24 cores@2.4 GHz: 50 GFlops/s (1 core:
7 GFlop/s)
23/ 94
Typical test problems:

BMW car body,


227,362 unknowns,
5,757,996 nonzeros, Size of factors: 51.1 million entries
MSC.Software Number of operations: 44.9 × 109

24/ 94
Typical test problems:

BMW crankshaft,
148,770 unknowns,
5,396,386 nonzeros, Size of factors: 97.2 million entries
MSC.Software Number of operations: 127.9 × 109

25/ 94
Sources of parallelism

Several levels of parallelism can be exploited:


I At problem level: problem can be decomposed into
sub-problems (e.g. domain decomposition)
I At matrix level: Sparsity implies independency in calculation
I At submatrix level: within dense linear algebra computations
(parallel BLAS, . . . )

26/ 94
Data structure for sparse matrices

I Storage scheme depends on the pattern of the matrix and on


the type of access required
I band or variable-band matrices
I “block bordered” or block tridiagonal matrices
I general matrix
I row, column or diagonal access

27/ 94
Data formats for a general sparse matrix A

What needs to be represented


I Assembled matrices: MxN matrix A with NNZ nonzeros.
I Elemental matrices (unassembled): MxN matrix A with NELT
elements.
I Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)
I Symmetric (or Hermitian)
→ store only part of the data.
I Distributed format ?
I Duplicate entries and/or out-of-range values ?

28/ 94
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
1 2 3
1 a11

2 a22 a23

3 a31 a33

I Coordinate format
IRN [1 : NNZ ] = 1 3 2 2 3
JCN [1 : NNZ ] = 1 1 2 3 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) format
IRN [1 : NNZ ] = 1 3 2 2 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6
column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:
Similar to CSC, but row by row
I Diagonal format (M=N):
NDIAG = 3 29/ 94
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
1 2 3
1 a11

2 a22 a23

3 a31 a33

I Coordinate format
IRN [1 : NNZ ] = 1 3 2 2 3
JCN [1 : NNZ ] = 1 1 2 3 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) format
IRN [1 : NNZ ] = 1 3 2 2 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6
column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:
Similar to CSC, but row by row
I Diagonal format (M=N):
NDIAG = 3 29/ 94
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
1 2 3
1 a11

2 a22 a23

3 a31 a33

I Coordinate format
IRN [1 : NNZ ] = 1 3 2 2 3
JCN [1 : NNZ ] = 1 1 2 3 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) format
IRN [1 : NNZ ] = 1 3 2 2 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6
column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:
Similar to CSC, but row by row
I Diagonal format (M=N):
NDIAG = 3 29/ 94
Classical Data Formats for Assembled Matrices

I Example of a 3x3 matrix with NNZ=5 nonzeros


1 2 3
1 a11

2 a22 a23

3 a31 a33

I Diagonal format (M=N):


NDIAG = 3
IDIAG = −2 0 1 
na a11 0
VAL =  na a22 a23  (na: not accessed)
a31 a33 na
VAL(i,j) corresponds to A(i,i+IDIAG(j)) (for 1 ≤ i + IDIAG(j)
≤ N)

29/ 94
Sparse Matrix-vector products Y ← AX
Algorithm depends on sparse matrix format:

I Coordinate format:
Y ( 1 :M) = 0
DO k =1 ,NNZ
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k ) ∗ X( JCN ( k ) )
ENDDO

I CSC format:

I CSR format

30/ 94
Sparse Matrix-vector products Y ← AX
Algorithm depends on sparse matrix format:

I Coordinate format:
Y ( 1 :M) = 0
DO k =1 ,NNZ
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k ) ∗ X( JCN ( k ) )
ENDDO

I CSC format:
Y ( 1 :M) = 0
DO J =1 ,N
Xj=X( J )
DO k=COLPTR( J ) ,COLPTR( J+1)−1
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k )∗ Xj
ENDDO
ENDDO

I CSR format

30/ 94
Sparse Matrix-vector products Y ← AX
Algorithm depends on sparse matrix format:

I Coordinate format:
Y ( 1 :M) = 0
DO k =1 ,NNZ
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k ) ∗ X( JCN ( k ) )
ENDDO

I CSC format:
Y ( 1 :M) = 0
DO J =1 ,N
Xj=X( J )
DO k=COLPTR( J ) ,COLPTR( J+1)−1
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k )∗ Xj
ENDDO
ENDDO

I CSR format
DO I =1 ,M
Y i=0
DO k=ROWPTR( I ) ,ROWPTR( I +1)−1
Y i = Y i + VAL( k )∗X( JCN ( k ) )
ENDDO
Y( I )= Y i
ENDDO

30/ 94
Sparse Matrix-vector products Y ← AX

Algorithm depends on sparse matrix format:

I Coordinate format:
Y ( 1 :M) = 0
DO k =1 ,NNZ
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k ) ∗ X( JCN ( k ) )
ENDDO

I Diagonal format: (VAL(i,j) corresponds to A(i,i+IDIAG(j)))


Y ( 1 : N) = 0
DO j =1 ,NDIAG
DO i= max(1 ,1 − IDIAG ( j ) ) , min (N, N−IDIAG ( j ) )
Y( i ) = Y( i ) + VAL( i , j )∗X( i+IDIAG ( j ) )
END DO
END DO

30/ 94
Jagged diagonal storage (JDS)
1 2 3
1 a11

2 a22 a23

3 a31 a33

1. Shift all elements left (similar to CSR) and keep column


a11 (1)
indices a22 (2) a23 (3)
a31 (1) a33 (3)
2. Sort rows in decreasing order of their number of nonzeros
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6

31/ 94
Jagged diagonal storage (JDS)
1 2 3
1 a11

2 a22 a23

3 a31 a33

1. Shift all elements left (similar to CSR) and keep column


a11 (1)
indices a22 (2) a23 (3)
a31 (1) a33 (3)
2. Sort rows in decreasing order of their number of nonzeros
a22 (2) a23 (3)
a31 (1) a33 (3)
a11 (1)
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6
31/ 94
Jagged diagonal storage (JDS)
1 2 3
1 a11

2 a22 a23

3 a31 a33

1. Shift all elements left (similar to CSR) and keep column


indices
2. Sort rows in decreasing order of their number of nonzeros
a22 (2) a23 (3)
a31 (1) a33 (3)
a11 (1)
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6

31/ 94
Jagged diagonal storage (JDS)
1 2 3
1 a11

2 a22 a23

3 a31 a33

1. Shift all elements left (similar to CSR) and keep column


indices
2. Sort rows in decreasing order of their number of nonzeros
a22 (2) a23 (3)
a31 (1) a33 (3)
a11 (1)
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6

31/ 94
Jagged diagonal storage (JDS)
1 2 3
1 a11

2 a22 a23

3 a31 a33

1. Shift all elements left (similar to CSR) and keep column


indices
2. Sort rows in decreasing order of their number of nonzeros
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6

Pros: manipulate longer vectors than CSR (interesting on vector


computers or GPU’s).
Cons: extra-indirection due to permutation array.
31/ 94
Example of elemental matrix format

 
−1 2 3 0 0

 2 1 1 0 0 

A= 
 1 1 3 −1 3  = A1 + A2

 0 0 1 2 −1 
0 0 3 2 1

   
1 −1 2 3 3 2 −1 3
A1 = 2  2 1 1 , A2 = 4  1 2 −1 
3 1 1 1 5 3 2 1

32/ 94
Example of elemental matrix format
   
1 −1 2 3 3 2 −1 3
A1 = 2  2 1 1 , A2 = 4  1 2 −1 
3 1 1 1 5 3 2 1

PNELT
I N=5 NELT=2 NVAR=6 A= i=1 Ai
I
ELTPTR [1:NELT+1] = 147
ELTVAR [1:NVAR] = 123345
ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1
I Remarks:
I NVAR = P ELTPTR(NELT+1)-1
Si2 (unsym) ou
P
I NVAL = Si (Si + 1)/2 (sym), avec
Si = ELTPTR(i + 1) − ELTPTR(i)
I storage of elements in ELTVAL: by columns

32/ 94
File storage: Rutherford-Boeing

I Standard ASCII format for files


I Header + Data (CSC format). key xyz:
I x=[rcp] (real, complex, pattern)
I y=[suhzr] (sym., uns., herm., skew sym., rectang.)
I z=[ae] (assembled, elemental)
I ex: M T1.RSA, SHIP003.RSE
I Supplementary files: right-hand-sides, solution,
permutations. . .
I Canonical format introduced to guarantee a unique
representation (order of entries in each column, no duplicates).

33/ 94
File storage: Rutherford-Boeing

DNV-Ex 1 : Tubular joint-1999-01-17 M_T1


1733710 9758 492558 1231394 0
rsa 97578 97578 4925574 0
(10I8) (10I8) (3e26.16)
1 49 96 142 187 231 274 346 417 487
556 624 691 763 834 904 973 1041 1108 1180
1251 1321 1390 1458 1525 1573 1620 1666 1711 1755
1798 1870 1941 2011 2080 2148 2215 2287 2358 2428
2497 2565 2632 2704 2775 2845 2914 2982 3049 3115
...
1 2 3 4 5 6 7 8 9 10
11 12 49 50 51 52 53 54 55 56
57 58 59 60 67 68 69 70 71 72
223 224 225 226 227 228 229 230 231 232
233 234 433 434 435 436 437 438 2 3
4 5 6 7 8 9 10 11 12 49
50 51 52 53 54 55 56 57 58 59
...
-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+11
0.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08
0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10
0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+07
0.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09
0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07
0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+08
0.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08

34/ 94
File storage: Matrix-market

I Example

%%MatrixMarket matrix coordinate real general


% Comments
5 5 8
1 1 1.000e+00
2 2 1.050e+01
3 3 1.500e-02
1 4 6.000e+00
4 2 2.505e+02
4 4 -2.800e+02
4 5 3.332e+01
5 5 1.200e+01

35/ 94
Examples of sparse matrix collections

I The University of Florida Sparse Matrix Collection


http://www.cise.ufl.edu/research/sparse/matrices/
I Matrix market http://math.nist.gov/MatrixMarket/
I Rutherford-Boeing
http://www.cerfacs.fr/algor/Softs/RB/index.html
I TLSE http://gridtlse.org/

36/ 94
Introduction to Sparse Matrix Computations
Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion

37/ 94
Gaussian elimination

A (1) , b = b(1) , A(1) x = b(1) :


0 = A 10 1 0 1
a11 a12 a13 x1 b1
@ a21 a22 a23 A @ x2 A = @ b2 A 2 ← 2 − 1 × a21 /a11
a31 a32 a33 x3 b3 3 ← 3 − 1 × a31 /a11

A(2) x = b(2)
0 10 0 1
a11 a12 a13 x
1 b1
0
(2)
a22
(2)
a23 C@ 1 A B (2) C (2)
= b2 − a21 b1 /a11 . . .
x2 = @ b2 b2
B
@ A A
(2) (2) x3 (2) (2)
0 a32 a33 b3 a32 = a32 − a31 a12 /a11 . . .

Finally A(3) x = b(3)


0 10 0 1
a11 a12 a13 x1
1 b1
(2) (2) B (2) C
B 0 a22 a23 CA @ x2 A = @ b 2
(3) (2) (2) (2) (2)
@ A
(3) x3 (3) a(33) = a(33) − a32 a23 /a22 . . .
0 0 a33 b3
(k) (k)
(k+1) (k) aik akj
Typical Gaussian elimination step k : aij = aij − (k)
akk

38/ 94
Relation with A = LU factorization

I One step of Gaussian elimination can be written:


A(k+1) 0
= L(k) A(k) , with
1
1
.
(k)
B C
. aik
B C
L(k) = and lik = .
B C
B
B 1 C
C (k)
B
B −lk+1,k . C
C akk
@ . . A
−ln,k 1

I Then, A(n) = U = L(n−1) . . . L(1) A, which gives A = LU ,


1 0
0 1
.
with L = [L(1) ]−1 . . . [L(n−1) ]−1 =
B C
. C,
B C
B
@ . A
li,j 1

I In dense codes, entries of L and U overwrite entries of A.


(k)
I Furthermore, if A is symmetric, A = LDLT with dkk = akk :
A = LU = At = U t Lt implies (U)(Lt )−1 = L−1 U t = D diagonal
and U = DLt , thus A = L(DLt ) = LDLt
Gaussian elimination and sparsity
Step k of LU factorization (akk pivot):
I 0 ),
For i > k compute lik = aik /akk (= aik
I For i > k, j > k
aik × akj
aij0 = aij −
akk
or
aij0 = aij − lik × akj
I If aik 6= 0 and akj 6= 0 then aij0 6= 0
I If aij was zero → its non-zero value must be stored
k j k j

k x x k x x
i x x i x 0

fill-in

40/ 94
I Idem for Cholesky :
√ 0 ),
I For i > k compute lik = aik / akk (= aik
I For i > k, j > k, j ≤ i (lower triang.)
aik × ajk
aij0 = aij − √
akk
or
aij0 = aij − lik × ajk

41/ 94
Example

I Original matrix
x x x x x
x x
x x
x x
x x
I Matrix is full after the first step of elimination
I After reordering the matrix (1st row and column ↔ last row
and column)

42/ 94
x x
x x
x x
x x
x x x x x

I No fill-in
I Ordering the variables has a strong impact on
I the fill-in
I the number of operations
NP-hard problem in general (Yannakakis, 1981)

43/ 94
Illustration: Reverse Cuthill-McKee on matrix
dwt 592.rua
Harwell-Boeing matrix: dwt 592.rua, structural computing on a
submarine. NZ(LU factors)=58202

Original matrix Factorized matrix


0 0

100 100

200 200

300 300

400 400

500 500

0 100 200 300 400 500 0 100 200 300 400 500
nz = 5104 nz = 58202

44/ 94
Illustration: Reverse Cuthill-McKee on matrix
dwt 592.rua
NZ(LU factors)=16924

Permuted matrix Factorized permuted matrix


(RCM)
0 0

100 100

200 200

300 300

400 400

500 500

0 100 200 300 400 500 0 100 200 300 400 500
nz = 5104 nz = 16924

44/ 94
Table: Benefits of Sparsity on a matrix of order 2021 × 2021 with 7353
nonzeros. (Dongarra etal 91) .
Procedure Total storage Flops Time (sec.)
on CRAY J90
Full Syst. 4084 Kwords 5503 ×106 34.5
Sparse Syst. 71 Kwords 1073×106 3.4
Sparse Syst. and reordering 14 Kwords 42×103 0.9

45/ 94
Control of numerical stability: numerical pivoting

I In dense linear algebra partial pivoting commonly used (at


each step the largest entry in the column is selected).
I In sparse linear algebra, flexibility to preserve sparsity is
offered :
I Partial threshold pivoting : Eligible pivots are not too small
with respect to the maximum in the column.
(k) (k)
Set of eligible pivots = {r | |ark | ≥ u × maxi |aik |}, where
0 < u ≤ 1.
I Then among eligible pivots select one preserving better
sparsity.
I u is called the threshold parameter (u = 1 → partial pivoting).
a ×a
I It restricts the maximum possible growth of: aij = aij − ikakk kj
to 1 + u1 which is sufficient to the preserve numerical stability.
I u ≈ 0.1 is often chosen in practice.
I For symmetric indefinite problems 2by2 pivots (with
threshold) is also used to preserve symmetry and sparsity.
Threshold pivoting and numerical accuracy

Table: Effect of variation in threshold parameter u on matrix 541 × 541


with 4285 nonzeros (Dongarra etal 91) .

u Nonzeros in LU factors Error


1.0 16767 3 ×10−9
0.25 14249 6 ×10−10
0.1 13660 4 ×10−9
0.01 15045 1 ×10−5
10−4 16198 1 ×102
10−10 16553 3 ×1023

47/ 94
Threshold pivoting and numerical accuracy

Table: Effect of variation in threshold parameter u on matrix 541 × 541


with 4285 nonzeros (Dongarra etal 91) .

u Nonzeros in LU factors Error


1.0 16767 3 ×10−9
0.25 14249 6 ×10−10
0.1 13660 4 ×10−9
0.01 15045 1 ×10−5
10−4 16198 1 ×102
10−10 16553 3 ×1023

Difficulty: numerical pivoting implies dynamic datastructures that


can not be forecasted symbolically

47/ 94
Three-phase scheme to solve Ax = b

1. Analysis step
I Preprocessing of A (symmetric/unsymmetric orderings,
scalings)
I Build the dependency graph (elimination tree, eDAG . . . )
2. Factorization (A = LU, LDLT , LLT , QR)
Numerical pivoting
3. Solution based on factored matrices
I triangular solves: Ly = b, then Ux = y
I improvement of solution (iterative refinement), error analysis

48/ 94
Efficient implementation of sparse algorithms

I Indirect addressing is often used in sparse calculations: e.g.


sparse SAXPY
do i = 1, m
A( ind(i) ) = A( ind(i) ) + alpha * w( i )
enddo
I Even if manufacturers provide hardware for improving indirect
addressing
I It penalizes the performance
I Identify dense blocks or switch to dense calculations as soon
as the matrix is not sparse enough

49/ 94
Effect of switch to dense calculations

Matrix from 5-point discretization of the Laplacian on a 50 × 50


grid (Dongarra etal 91)

Density for Order of Millions Time


switch to full code full submatrix of flops (seconds)
No switch 0 7 21.8
1.00 74 7 21.4
0.80 190 8 15.0
0.60 235 11 12.5
0.40 305 21 9.0
0.20 422 50 5.5
0.10 531 100 3.7
0.005 1420 1908 6.1

Sparse structure should be exploited if density < 10%.

50/ 94
Introduction to Sparse Matrix Computations
Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion

51/ 94
Main processor (r)evolutions

I pipelined functional units


I superscalar processors
I out-of-order execution (ILP)
I larger caches
I evolution of instruction set (CISC, RISC, EPIC, . . . )
I multicores

52/ 94
Pourquoi des traitements parallèles ?

I Besoins de calcul non satisfaits dans beaucoup de disciplines


(pour résoudre des problèmes significatifs)
I Performance uniprocesseur proche des limites physiques
Temps de cycle 0.5 nanoseconde
↔ 8 GFlop/s (avec 4 opérations flottantes / cycle)
I Calculateur 40 TFlop/s ⇒ 5000 coeurs
→calculateurs massivement parallèles
I Pas parce que c’est le plus simple mais parce que c’est
nécessaire

Puissance actuelle (juin 2010, cf http://www.top500.org):


Cray XT5, Oak Ridge Natl Lab:
1.7 PFlop/s, 300 TBytes de mémoire, 224256 coeurs

53/ 94
Quelques unités pour le calcul haute performance

Vitesse
1 MFlop/s 1 Megaflop/s 106 opérations / seconde
1 GFlop/s 1 Gigaflop/s 109 opérations / seconde
1 TFlop/s 1 Teraflop/s 1012 opérations / seconde
1 PFlop/s 1 Petaflop/s 1015 opérations / seconde
1 EFlop/s 1 Exaflop/s 1015 opérations / seconde
Mémoire
1 kB / 1 ko 1 kilobyte 103 octets
1 MB / 1 Mo 1 Megabyte 106 octets
1 GB / 1 Go 1 Gigabyte 109 octets
1 TB / 1 To 1 Terabyte 1012 octets
1 PB / 1 Po 1 Petabyte 1015 octets

54/ 94
Mesures de performance

I Nombre d’opérations flottantes par seconde (pas MIPS)


I Performance crête :
I Ce qui figure sur la publicité des constructeurs
I Suppose que toutes les unités de traitement sont actives
I On est sûr de ne pas aller plus vite :
#unités fonctionnelles
Performance crête =
clock (sec.)
I Performance réelle :
I Habituellement très inférieure à la précédente
(malheureusement)

55/ 94
Rapport (Performance réelle / performance de crête) souvent bas !!
Soit P un programme :
1. Processeur séquentiel:
I 1 unité scalaire (1 GFlop/s)
I Temps d’exécution de P : 100 s
2. Machine parallèle à 100 processeurs:
I Chaque processor: 1 GFlop/s
I Performance crête: 100 GFlop/s
3. Si P : code séquentiel (10%) + code parallélisé (90%)
I Temps d’exécution de P : 0.9 + 10 = 10.9 s
I Performance réelle : 9.2 GFlop/s
4. Performance réelle = 0.1
Performance de crête

56/ 94
Moore’s law

I Gordon Moore (co-fondateur d’Intel) a prédit en 1965 que la


densité en transitors des circuits intégrés doublerait tous les
24 mois.
I A aussi servi de but à atteindre pour les fabriquants.
I A été déformé:
I 24 → 18 mois
I nombre de transistors → performance
57/ 94
Comment accroı̂tre la vitesse de calcul ?
I Accélérer la fréquence avec des technologies plus rapides
On approche des limites:
I Conception des puces
I Consommation électrique et chaleur dissipée
I Refroidissement ⇒ problème d’espace
I On peut encore miniaturiser, mais:
I pas indéfiniment
I résistance des conducteurs (R = ρ×l s ) augmente et ..
la résistance est responsable de la dissipation d’énergie (effet
Joule).
I effets de capacités difficiles à maı̂triser
Remarque: 0.5 nanoseconde = temps pour qu’un signal
parcourt 15 cm de cable
I Temps de cycle 0.5 nanosecond ↔ 8 GFlop/s (avec 4
opérations flottantes par cycle)

58/ 94
Seule solution: le parallélisme

I parallélisme: exécution simultanée de plusieurs instructions à


l’intérieur d’un programme
I A l’intérieur d’un cœur :
I micro-instructions
I traitement pipeliné
I recouvrement d’instructions exécutées par des unités de calcul
distinctes
→ transparent pour le programmeur
(géré par le compilateur ou durant l’exécution)
I Entre des processeurs ou cœurs distincts:
I suites d’instructions différentes exécutées → synchronisations:
I implicites (compilateur, parallélisation automatique, utilisation
de librairies parallèles)
I ou explicites (transfert de messages, programmation
multithreads, sections critiques)

59/ 94
Problème d’accès aux données

I On est souvent (en pratique) à 10% de la performance crête


I Processeurs plus rapides → accès aux données plus rapide :
I organisation processeur,
I organisation mémoire,
I communication inter-processeurs
I Hardware plus complexe : pipe, technologie, réseau, . . .
I Logiciel plus complexe : compilateur, système d’exploitation,
langages de programmation, gestion du parallélisme, outils de
mise au point . . . applications
Il devient plus difficile de programmer efficacement

60/ 94
Problèmes de débit mémoire
I L’accés aux données est un problème crucial dans les
calculateurs modernes
IPerformance processeur : + 60% par an
IMémoire DRAM : + 9% par an
performance processeur
→ Ratio temps acces memoire augmente d’environ 50% par an!
MFlop/s plus faciles que MB/s pour débit mémoire
I Hiérarchie mémoire de plus en plus complexe (mais latence
augmente)
I Façon d’accéder aux données de plus en plus critique:
I Minimiser les défauts de cache
I Minimiser la pagination mémoire
I Localité: améliorer le rapport références à des mémoires
locales/ références à des mémoires à distance
I Réutilisation, blocage: accroı̂tre le ratio flops/memory access
I Gestion des transferts de données ”à la main” ? (Cell, GPU)

61/ 94
Size Average access time (# cycles) hit/miss

Registers <1

1 − 128 KB Cache level #1 1−2 / 8 − 66

256 KB − 16 MB Cache level #2 6−15 / 30 − 200

1 − 10 GB Main memory 10 − 100

Remote memory 500 − 5000

Disks 700,000 / 6,000,000

Figure: Exemple de hiérarchie mémoire.

62/ 94
Conception mémoire pour nombre important de
processeurs ?
Comment 500 processeurs peuvent-ils avoir accès à des données
rangées dans une mémoire partagée (technologie, interconnexion,
prix ?)
→ Solution à coût raisonnable : mémoire physiquement distribuée
(chaque processeur ou groupe de processeurs a sa propre mémoire
locale)
I 2 solutions :
I mémoires locales globalement adressables : Calulateurs à
mémoire partagée virtuelle
I transferts explicites des données entre processeurs avec
échanges de messages
I Scalabilité impose :
I augmentation linéaire débit mémoire / vitesse du processeur
I augmentation du débit des communications / nombre de
processeurs
I Rapport coût/performance → mémoire distribuée et bon
rapport coût/performance sur les processeurs 63/ 94
Architecture des multiprocesseurs

Nombre élevé de processeurs → mémoire physiquement distribuée

Organisation Organisation physique


logique Partagée (32 procs max) Distribuée
Partagée multiprocesseurs espace d’adressage global
à mémoire partagée (hard/soft) au dessus de messages
mémoire partagée virtuelle
Distribuée émulation de messages échange de messages
(buffers)

Table: Organisation des processeurs

64/ 94
Terminologie
Architecture SMP (Symmetric Multi Processor)
I Mémoire partagée (physiquement et logiquement)
I Temps d’accès uniforme à la mémoire
I Similaire du point de vue applicatif aux architectures
multi-cœurs (1 cœur = 1 processeur logique)
I Mais communications bcp plus rapides dans les multi-cœurs
(latence < 3ns, bande passantee > 20 GB/s) que dans les
SMP (latence ≈ 60ns, bande passantee ≈ 2 GB/s)

Architecture NUMA (Non Uniform Memory Access)


I Mémoire physiquement distribuée et logiquement partagée
I Plus facile d’augmenter le nombre de procs qu’en SMP
I Temps d’accès dépend de la localisation de la donnée
I Accès locaux plus rapides qu’accès distants
I hardware permet la cohérence des caches (ccNUMA)
65/ 94
Programmation

Standards de programmation
Org. logique partagée: threads POSIX, directives OpenMP
Org. logique distribuée: PVM, MPI, sockets (Message Passing)

I Programmation hybride: MPI + OpenMP


I Machines à 1 million de cœurs? architectures émergentes type
GPGPU ? pas encore de standard !

66/ 94
Evolution du calcul haute performance

I Evolution rapide des architectures haute performance (SMP,


clusters, NUMA, multicoeurs, Cell, GPUs, . . . )
I Parallélisme à plusieurs niveaux
I Hiérarchie mémoire de plus en plus complexe
⇒ Programmation de plus en plus difficile avec des outils logiciels
et des standards qui ont toujours un temps de retard.
67/ 94
Introduction to Sparse Matrix Computations
Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion

68/ 94
Simulation numérique et matrices creuses

I Démarche générale pour le calcul scientifique:


1. Problème de simulation (problème continu)
2. Application de lois physiques (Equations aux dérivées
partielles)
3. Discretisation, mise en équations en dimension finie
4. Résolution de systèmes linéaires (Ax = b)
5. (Étude des résultats, remise en cause éventuelle du modèle ou
de la méthode)
I Résolution de systèmes linéaires=noyau algorithmique
fondamental. Paramètres à prendre en compte:
I Propriétés du système (symétrie, défini positif,
conditionnement, sur-déterminé, . . . )
I Structure: dense ou creux,
I Taille: plusieurs millions d’équations ?

69/ 94
Equations aux dérivées partielles

I Modélisation d’un phénomène physique

I Equations différentielles impliquant:


I forces
I moments
I températures
I vitesses
I énergies
I temps

I Solutions analytiques rarement disponibles

70/ 94
Exemples d’équations aux dérivées partielles

I Trouver le potentiel électrique φ pour une distribution de


charge donnée f :
∇2 ϕ = f ⇔ ∆ϕ = f , or
∂2 ∂2 ∂2
∂x 2
ϕ(x, y , z) + ∂y 2 ϕ(x, y , z) + ∂z 2 ϕ(x, y , z) = f (x, y , z)

I Equation de la chaleur (ou équation de Fourier):


∂2u ∂2u ∂2u 1 ∂u
2
+ 2+ 2 =
∂x ∂y ∂z α ∂t
avec
I u = u(x, y , z, t): température,
I α: diffusivité thermique du milieu.
I Equations de propagation d’ondes, équation de Schrödinger,
Navier-Stokes,. . .

71/ 94
Discrétisation (étape qui suit la modélisation
physique)

Travail du numéricien:
I Réalisation d’un maillage (régulier, irrégulier)
I Choix des méthodes de résolution et étude de leur
comportement
I Etude de la perte d’information due au passage à la dimension
finie

Principales techniques de discrétisation


I Différences finies
I Eléments finis
I Volumes finis

72/ 94
Discretization with finite differences (1D)
I Basic approximation (ok if h is small enough):
 
du u(x + h) − u(x)
(x) ≈
dx h
I Results from Taylor’s formula

du h2 d 2 u h3 d 3 u
u(x + h) = u(x) + h + + + O(h4 )
dx 2 dx 2 6 dx 3
I Replacing h by −h:

du h2 d 2 u h3 d 3 u
u(x − h) = u(x) − h + − + O(h4 )
dx 2 dx 2 6 dx 3
I Thus:
d 2u u(x + h) − 2u(x) + u(x − h)
2
= + O(h2 )
dx h2
73/ 94
Discretization with finite differences (1D)

d 2u u(x + h) − 2u(x) + u(x − h)


= + O(h2 )
dx 2 h2

3-point stencil for the centered difference approximation to


the second order derivative:

1 −2 1

74/ 94
Finite Differences for the Laplacian Operator (2D)
Assuming same mesh refinement h in x and y directions:
u(x−h,y )−2u(x,y )+u(x+h,y ) u(x,y −h)−2u(x,y )+u(x,y +h)
∆u(x) ≈ h2
+ h2
1
∆u(x) ≈ h2 (u(x −h, y )+u(x +h, y )+u(x, y −h)+u(x, y +h)−4u(x, y ))

1 1 1

1 −4 1 −4

1 1 1

5-point stencils for the centered difference approximation to


the Laplacian operator (left) standard (right) skewed
75/ 94
27-point stencil used
for 3D geophysical
applications (collabo-
ration with Geoscience
Azur, S.Operto and
J.Virieux).
1D example
−u 00 (x) = f (x) for x ∈ (0, 1)

I Consider the problem
u(0) = u(1) = 0
I xi = i × h, i = 0, . . . , n + 1, f (xi ) = fi , u(xi ) = ui
h = 1/(n + 1)
I Centered difference approximation:
−ui−1 + 2ui − ui+1 = h2 fi (u0 = un+1 = 0),
I We obtain a linear system Au = f or (for n = 6):

2 −1
     
0 0 0 0 u1 f1

 −1 2 −1 0 0 0 


 u2  
  f2 

1  0 −1 2 −1 0 0 


 u3  
= f3 

h2 
 0 0 −1 2 −1 0 


 u4  
  f4 

 0 0 0 −1 2 −1   u5   f5 
0 0 0 0 −1 2 u6 f6

77/ 94
Slightly more complicated (2D)

Consider an elliptic PDE:

∂u
∂(a(x, y ) ∂u
∂x )
∂(b(x, y ) ∂y )
− − + c(x, y ) × u = g (x, y ) sur Ω
∂x ∂y
u(x, y ) = 0 sur ∂Ω
0 ≤ x, y ≤ 1
a(x, y ) > 0
b(x, y ) > 0
c(x, y ) ≥ 0

78/ 94
I Case of a regular 2D mesh:
1

5
1 2 3 4

0 1
1
discretization step: h = n+1 , n =4
I 5-point finite difference scheme:

∂(a(x, y ) ∂u ai+ 1 ,j (ui+1,j − ui,j ) ai− 1 ,j (ui,j − ui−1,j )


∂x )ij
= 2
− 2
+O(h2 )
∂x h2 h2
I Similarly:
∂u bi,j+ 1 (ui,j+1 − ui,j ) bi,j− 1 (ui,j − ui,j−1 )
∂(b(x, y ) ∂y )ij
= 2
− 2
+O(h2 )
∂y h2 h2
I ai+ 1 ,j , bi+ 1 ,j , cij , . . . known.
2 2
I With the ordering of unknows of the example, we obtain a
linear system of the form:

Ax = b,
I where
1 1
x1 ↔ u1,1 = u( n+1 , n+1 )
2 1
x2 ↔ u2,1 = u( n+1 , n+1 )
x3 ↔ u3,1
x4 ↔ u4,1
x5 ↔ u1,2 , . . .
I and A is n2 by n2 , b is of size n2 , with the following structure:

80/ 94
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|x x x | 1 |g11|
|x x x x | 2 |g21|
| x x x x | 3 |g31|
| x x 0 x | 4 |g41|
|x 0 x x x | 5 |g12|
| x x x x x | 6 |g22|
| x x x x x | 7 |g32|
A=| x x x 0 x | 8 b=|g42|
| x 0 x x x | 9 |g13|
| x x x x x |10 |g23|
| x x x x x |11 |g33|
| x x x 0 x |12 |g43|
| x 0 x x |13 |g14|
| x x x x |14 |g24|
| x x x x |15 |g34|
| x x x |16 |g44|

81/ 94
Solution of the linear system

Often the most costly part in a numerical simulation code

I Direct methods:
I L U factorization followed by triangular substitutions
I parallelism depends highly on the structure of the matrix

I Iterative methods:
I usually rely on sparse matrix-vector products
I algebraic preconditioner useful

82/ 94
Evolution in time of a complex phenomenon

I Examples:
I climate modeling,
 evolution of radioactive waste, . . .
∂u(x,y ,z,t)
∆u(x, y , z, t) =
I heat equation: ∂t
u(x, y , z, t0 ) = u0 (x, y , z)
I Discretization in both space and time (1D case):
I Explicit approaches:
ujn+1 −ujn u n −2u n +u n
tn+1 −tn = j+1 h2j j−1 .
I Implicit approaches:
ujn+1 −ujn n+1
uj+1 n+1
−2ujn+1 +uj−1
tn+1 −tn = h 2 .
I Implicit approaches are preferred (more stable, larger timestep
possible) but are more numerically intensive: a sparse linear
system must be solved at each iteration.

83/ 94
Discretization with Finite elements

I Consider a partial differential equation of the form (Poisson


Equation):
( 2 ∂2u
∆u = ∂∂xu2 + ∂y 2 = f

u = 0 on ∂Ω
I we can show (using Green’s formula) that the previous
problem is equivalent to:
Z
a(u, v ) = − f v dx dy ∀v such that v = 0 on ∂Ω

R  ∂u ∂v ∂u ∂v

where a(u, v ) = Ω ∂x ∂x + ∂y ∂y dxdy

84/ 94
Finite element scheme: 1D Poisson Equation
2
I ∆u = ∂∂xu2 = f , u = 0 on ∂Ω
I Equivalent to
a(u, v ) = g (v ) for all v (v|∂Ω = 0)

where a(u, v ) = Ω ∂u ∂v
R R
∂x ∂x and g (v ) = − Ω f (x)v (x)dx
(1D: similar to integration by parts)
P
I Idea: we search u of the form = k αk Φk (x)
(Φk )k=1,n basis of functions such that Φk is linear on all Ei ,
and Φk (xi ) = δik = 1 if k = i, 0 otherwise.
Φk−1 Φk Φk+1


xk
Ek Ek+1
85/ 94
Finite Element Scheme: 1D Poisson Equation
Φk−1 Φk Φk+1


xk
Ek Ek+1
I We rewrite a(u, v ) = g (v ) for P
all Φk :
a(u, Φk ) = gR(Φk ) for all k ⇔ i αi a(Φi , Φk ) = g (Φk )
i ∂Φk
a(Φi , Φk ) = Ω ∂Φ
∂x ∂x = 0 when |i − k| ≥ 2
I k th equation associated with Φk
αk−1 a(Φk−1 , Φk ) + αk a(Φk , Φk ) + αk+1 a(Φk+1 , Φk ) = g (Φk )
R ∂Φ
I a(Φk−1 , Φk ) = Ek ∂xk−1 ∂Φ ∂x
k

a(Φk+1 , Φk ) = Ek +1 ∂Φ∂xk+1 ∂Φ
R k
∂x
a(Φk , Φk ) = Ek ∂Φ k ∂Φk ∂Φk ∂Φk
R R
∂x ∂x + Ek+1 ∂x ∂x

86/ 94
Finite Element Scheme: 1D Poisson Equation

From the point ofR view of Ek!, we have a 2x2 contribution matrix:
R ∂Φk−1 ∂Φk−1 ∂Φk−1 ∂Φk  
∂x ∂x ∂x ∂x IEk (Φk−1 , Φk−1 ) IEk (Φk−1 , Φk )
REk∂Φk−1 ∂Φk REk ∂Φk ∂Φk =
IEk (Φk , Φk−1 ) IEk (Φk , Φk )
Ek ∂x ∂x Ek ∂x ∂x
Φ1 Φ2 Φ3

0 1 2 3 4 Ω


E1 E2 E3 E4 
IE1 (Φ1 , Φ1 ) + IE2 (Φ1 , Φ1 ) IE2 (Φ1 , Φ2 )
 IE2 (Φ2 , Φ1 ) IE2 (Φ2 , Φ2 ) + IE3 (Φ2 , Φ2 ) IE3 (Φ2 , Φ3 ) 
IE3 (Φ2 , Φ3 ) IE3 (Φ3 , Φ3 ) + IE4 (Φ3 , Φ3 )
   
α1 g (φ1 )
×  α2  =  g (φ2 ) 
α3 g (φ3 )

87/ 94
Finite Element Scheme in Higher Dimension

I Can be used for higher dimensions


I Mesh can be irregular
I Φi can be a higher degree polynomial
I Matrix pattern depends on mesh connectivity/ordering

88/ 94
Finite Element Scheme in Higher Dimension

I Set of elements (tetrahedras, triangles) to assemble:


i

 T
ai,i T
ai,j T 
ai,k
T

T T T 
j k C (T ) =  aj,i aj,j aj,k
T
ak,i T
ak,j T
ak,k

Needs for the parallel case


P
I Assemble the sparse matrix A = k C (Tk ): graph coloring
algorithms
I Parallelization domain by domain: graph partitioning
I Solution of Ax = b: high performance matrix computation
kernels

88/ 94
Other example: linear least squares

I mathematical model + approximate measures ⇒ estimate


parameters of the model
I m ”experiments” + n parameters xi :
minkAx − bk avec:
I A ∈ R m×n , m ≥ n: data matrix
I b ∈ R m : vector of observations
I x ∈ R n : parameters of the model
I Solving the problem:
I Decompose A under the form A = QR, with Q orthogonal, R
triangular
I kAx −bk = kQ T Ax −Q T bk = kQ T QRx −Q T bk = kRx −Q T bk
I Problems can be large (meteorological data, . . . ), sparse or
not

89/ 94
Introduction to Sparse Matrix Computations
Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion

90/ 94
Solution of sparse linear systems Ax = b (Direct or Iterative approaches ?

Direct methods Iterative methods

I Very general/robust I Efficiency depends on:

I Numerical accuracy I convergence –


I Irregular/unstructured preconditioning
problems I numerical properties /
I Factorization of matrix A
structure of A
I May be costly I Rely on efficient Mat-Vect product

(flops/memory) I memory effective


I Factors can be reused for I successive right-hand sides
multiple right-hand sides b is problematic
I Computing issues I Computing issues

I Good granularity of I Smaller granularity of


computations computations
I Several levels of parallelism I Often, only one level of
can be exploited parallelism
Introduction to Sparse Matrix Computations
Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion

92/ 94
Summary – sparse matrices

I Widely used in engineering and industry


I Irregular data structures
I Strong relations with graph theory
I Efficient algorithms are critical
I Ordering
I Sparse Gaussian elimination
I Sparse matrix-vector multiplication
I Parallelization
I Challenges:
I Modern applications leading to
I bigger and bigger problems
I different types of matrices and requirements
I Dynamic data structures (numerical pivoting) → need for
dynamic scheduling
I More and more parallelism (evolution of parallel architectures)

93/ 94
Suggested home reading

I Google page rank, The world’s largest matrix computation,


Cleve Moler.

I Architecture and Performance Characteristics of Modern


High Performance Computers, Georg Hager and Gerhard
Wellein, Lect. Notes in Physics

I Optimization Techniques for Modern High Performance


Computers, Georg Hager and Gerhard Wellein, Lect. Notes
in Physics

94/ 94

Das könnte Ihnen auch gefallen