Intro Sparse

CR07 – Sparse Matrix Computations /
Cours de Matrices Creuses (2010-2011)
Jean-Yves L’Excellent (INRIA) and Bora Uçar (CNRS)
LIP-ENS Lyon, Team: ROMA
(Jean-Yves.L.Excellent@ens-lyon.fr, Bora.Ucar@ens-lyon.fr)
Fridays 10h15-12h15
prepared in collaboration with P. Amestoy (ENSEEIHT-IRIT)
1/ 94
Motivations
I Applications ayant un besoin croissant en puissance de calcul:
I modélisation,
I simulation (plutôt qu’expérimentation),
I optimisation numérique
I Typiquement:
Problème continu ⇒ Discrétisation (maillage)
⇒ Algorithme numérique de résolution (selon lois physiques)
⇒ Problème matriciel (Ax = b, . . .)
I Besoins:
I Modélisations de plus en plus précises
I Problèmes de plus en plus complexes
I Applications critiques en temps de réponse
I Minimisation des coûts du calcul
⇒ Calculateurs haute performance, parallélisme

⇒ Algorithmes permettant de tirer le meilleur parti de ces
calculateurs et des propriétés des matrices considérées
2/ 94
Example of preprocessing for Ax = b
Original (A =lhr01) Preprocessed matrix (A0 (lhr01))

0 0
200 200
400 400
600 600
800 800
1000 1000
1200 1200
1400 1400
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
nz = 18427 nz = 18427
Modified Problem:A0 x 0 = b 0 with A0 = Pn PDr AQDc P t
3/ 94
Quelques exemples dans le domaine du calcul
scientifique
I Constraints of duration: weather forecast
4/ 94
A few examples in the field of scientific computing
I Cost constraints: wind tunnels, crash simulation, . . .
5/ 94
Scale Constraints
I large scale: climate modelling, pollution, astrophysics

I tiny scale: combustion, quantum chemistry
6/ 94
Contents of the course
I Introduction, reminders on graph theory and on numerical

linear algebra
- Introduction to sparse matrices
- Graphs and hypergraphs, trees, some classical algorithms
- Gaussian elimination, LU factorization, fill-in
- Conditioning/sensitivity of a problem and error analysis
7/ 94
↔ ↔ Graph
I Sparse Gaussian elimination

- Sparse matrices and graphs
- Gaussian elimination of sparse matrices
- Ordering and permuting sparse matrices
7/ 94
I Maximum (weighted) Matching algorithms and their use in

sparse linear algebra
I Efficient factorization methods

- Implementation of sparse direct solvers
- Parallel sparse solvers
- Scheduling of computations to optimize memory usage
and/or performance
I Graph and hypergraph partitioning
I Iterative methods
I Current research activities
7/ 94
Tentative outline
A. INTRODUCTION
I. Sparse matrices
II. Graph theory and algorithms
III. Linear algebra basics
------------------------------------------------------
B. SPARSE GAUSSIAN ELIMINATION
IV. Elimination tree and structure prediction
V. Fill-reducing ordering methods
VI. Matching in bipartite graphs
VII. Factorization: Methods
VIII. Factorization: Parallelization aspects
------------------------------------------------------
C. SOME OTHER ESSENTIAL SPARSE MATRIX ALGORITHMS
IX. Graph and hypergraph partitioning
X. Iterative methods
------------------------------------------------------
D. CLOSING
XI. Current research activities
XII. Presentations
8/ 94
Tentative organization of the course
I References, teaching material will be made available at

http://graal.ens-lyon.fr/~bucar/CR07/
I 2+ hours dedicated to exercises (manipulation of sparse

matrices and graphs)
I Evaluation: mainly based on the study of research articles

related to the contents of the course; the students will write a
report and do an oral presentation.
9/ 94
Outline
Introduction to Sparse Matrix Computations

Motivation and main issues
Sparse matrices
Gaussian elimination
Parallel and high performance computing
Numerical simulation and sparse matrices
Direct vs iterative methods
Conclusion
10/ 94
A selection of references
I Books
I Duff, Erisman and Reid, Direct methods for Sparse Matrices,
Clarendon Press, Oxford 1986.
I Dongarra, Duff, Sorensen and H. A. van der Vorst, Solving
Linear Systems on Vector and Shared Memory Computers,
SIAM, 1991.
I Davis, Direct methods for sparse linear systems, SIAM, 2006.
I Saad, Iterative methods for sparse linear systems, 2nd edition,
SIAM, 2004.
I Articles
I Gilbert and Liu, Elimination structures for unsymmetric sparse
LU factors, SIMAX, 1993.
I Liu, The role of elimination trees in sparse factorization,
SIMAX, 1990.
I Heath and E. Ng and B. W. Peyton, Parallel Algorithms for
Sparse Linear Systems, SIAM review 1991.
11/ 94
Sparse matrices
Conclusion
12/ 94
Motivations
I solution of linear systems of equations → key algorithmic
kernel
Continuous problem
↓
Discretization
↓
Solution of a linear system Ax = b
I Main parameters:
I Numerical properties of the linear system (symmetry, pos.
definite, conditioning, . . . )
I Size and structure:
I Large (> 1000000 × 1000000 ?), square/rectangular
I Dense or sparse (structured / unstructured)
I Target computer (sequential/parallel/multicore)
→ Algorithmic choices are critical
13/ 94
Motivations for designing efficient algorithms
I Time-critical applications
I Solve larger problems
I Decrease elapsed time (parallelism ?)
I Minimize cost of computations (time, memory)
14/ 94
Difficulties
I Access to data :
I Computer : complex memory hierarchy (registers, multilevel
cache, main memory (shared or distributed), disk)
I Sparse matrix : large irregular dynamic data structures.
→ Exploit the locality of references to data on the computer
(design algorithms providing such locality)
I Efficiency (time and memory)
I Number of operations and memory depend very much on the
algorithm used and on the numerical and structural properties
of the problem.
I The algorithm depends on the target computer (vector, scalar,
shared, distributed, clusters of Symmetric Multi-Processors
(SMP), multicore).
→ Algorithmic choices are critical
15/ 94
Sparse matrices
Conclusion
16/ 94
Sparse matrices
Example:
3 x1 + 2 x2 = 5
2 x2 - 5 x3 = 1
2 x1 + 3 x3 = 0
can be represented as
Ax = b,
     
3 2 0 x1 5
where A =  0 2 −5 , x =  x2  , and b =  1 
2 0 3 x3 0
Sparse matrix: only nonzeros are stored.
17/ 94
Sparse matrix ?
100
200
300
400
500
0 100 200 300 400 500

nz = 5104
Matrix dwt 592.rua (N=592, NZ=5104);

Structural analysis of a submarine
18/ 94
Sparse matrix ?
Matrix from Computational Fluid Dynamics;

(collaboration Univ. Tel Aviv)
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000 6000 7000

nz = 43105
“Saddle-point” problem
19/ 94
Preprocessing sparse matrices
Original (A =lhr01) Preprocessed matrix (A0 (lhr01))

0 0
200 200
400 400
600 600
800 800
1000 1000
1200 1200
1400 1400
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
nz = 18427 nz = 18427
Modified Problem:A0 x 0 = b 0 with A0 = Pn PDr ADc QP t
20/ 94
Factorization process
Solution of Ax = b
I A is unsymmetric :
I A is factorized as: A = LU, where
L is a lower triangular matrix, and
U is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:
I A = LDLT or LLT
I A is rectangular m × n with m ≥ n and minx kAx − bk2 :
I A = QR where Q is orthogonal (Q−1 = QT ) and R is
triangular.
I Solve: y = QT b then Rx = y
21/ 94
Solution of Ax = b
I A is symmetric:
I A = LDLT or LLT
triangular.
21/ 94
Solution of Ax = b
I A is symmetric:
I A = LDLT or LLT
triangular.
21/ 94
Difficulties
I Only non-zero values are stored

I Factors L and U have far more nonzeros than A
I Data structures are complex
I Computations are only a small portion of the code (the rest is
data manipulation)
I Memory size is a limiting factor ( → out-of-core solvers )
22/ 94
Key numbers:
1- Small sizes : 500 MB matrix;
Factors = 5 GB; Flops = 100 Gflops ;
2- Example of 2D problem: Lab. Géosiences Azur, Valbonne
I Complex 2D finite difference matrix n=16 × 106 , 150 × 106
nonzeros
I Storage (single prec): 2 GB (12 GB with the factors)
I Flops: 10 TeraFlops
3- Example of 3D problem: EDF (Code Aster, structural
engineering)
I real matrix finite elements n = 106 , nz = 71 × 106 nonzeros
I Storage: 3.5 × 109 entries (28 GB) for factors, 35 GB total
I Flops: 2.1 × 1013
4- Typical performance (MUMPS):
I PC LINUX 1 core (P4, 2GHz) : 1.0 GFlops/s
I Cray T3E (512 procs) : Speed-up ≈ 170, Perf. 71 GFlops/s
I AMD Opteron 8431, 24 cores@2.4 GHz: 50 GFlops/s (1 core:
7 GFlop/s)
23/ 94
Typical test problems:
BMW car body,

227,362 unknowns,
5,757,996 nonzeros, Size of factors: 51.1 million entries
MSC.Software Number of operations: 44.9 × 109
24/ 94
Typical test problems:
BMW crankshaft,
148,770 unknowns,
5,396,386 nonzeros, Size of factors: 97.2 million entries
MSC.Software Number of operations: 127.9 × 109
25/ 94
Sources of parallelism
Several levels of parallelism can be exploited:

I At problem level: problem can be decomposed into
sub-problems (e.g. domain decomposition)
I At matrix level: Sparsity implies independency in calculation
I At submatrix level: within dense linear algebra computations
(parallel BLAS, . . . )
26/ 94
Data structure for sparse matrices
I Storage scheme depends on the pattern of the matrix and on

the type of access required
I band or variable-band matrices
I “block bordered” or block tridiagonal matrices
I general matrix
I row, column or diagonal access
27/ 94
Data formats for a general sparse matrix A
What needs to be represented

I Assembled matrices: MxN matrix A with NNZ nonzeros.
I Elemental matrices (unassembled): MxN matrix A with NELT
elements.
I Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)
I Symmetric (or Hermitian)
→ store only part of the data.
I Distributed format ?
I Duplicate entries and/or out-of-range values ?
28/ 94
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
1 2 3
1 a11
2 a22 a23
3 a31 a33
I Coordinate format
IRN [1 : NNZ ] = 1 3 2 2 3
JCN [1 : NNZ ] = 1 1 2 3 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) format
IRN [1 : NNZ ] = 1 3 2 2 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6
column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:
Similar to CSC, but row by row
I Diagonal format (M=N):
NDIAG = 3 29/ 94
1 2 3
1 a11
2 a22 a23
3 a31 a33
I Coordinate format
IRN [1 : NNZ ] = 1 3 2 2 3
JCN [1 : NNZ ] = 1 1 2 3 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
IRN [1 : NNZ ] = 1 3 2 2 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6
NDIAG = 3 29/ 94
1 2 3
1 a11
2 a22 a23
3 a31 a33
I Coordinate format
IRN [1 : NNZ ] = 1 3 2 2 3
JCN [1 : NNZ ] = 1 1 2 3 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
IRN [1 : NNZ ] = 1 3 2 2 3
VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6
NDIAG = 3 29/ 94

1 2 3
1 a11
2 a22 a23
3 a31 a33

NDIAG = 3
IDIAG = −2 0 1 
na a11 0
VAL =  na a22 a23  (na: not accessed)
a31 a33 na
VAL(i,j) corresponds to A(i,i+IDIAG(j)) (for 1 ≤ i + IDIAG(j)
≤ N)
29/ 94
Sparse Matrix-vector products Y ← AX
Algorithm depends on sparse matrix format:
I Coordinate format:
Y ( 1 :M) = 0
DO k =1 ,NNZ
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k ) ∗ X( JCN ( k ) )
ENDDO
I CSC format:
I CSR format
30/ 94
Y ( 1 :M) = 0
DO k =1 ,NNZ
ENDDO
I CSC format:
Y ( 1 :M) = 0
DO J =1 ,N
Xj=X( J )
DO k=COLPTR( J ) ,COLPTR( J+1)−1
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k )∗ Xj
ENDDO
ENDDO
I CSR format
30/ 94
Y ( 1 :M) = 0
DO k =1 ,NNZ
ENDDO
I CSC format:
Y ( 1 :M) = 0
DO J =1 ,N
Xj=X( J )
DO k=COLPTR( J ) ,COLPTR( J+1)−1
Y( IRN ( k ) ) = Y( IRN ( k ) ) + VAL( k )∗ Xj
ENDDO
ENDDO
I CSR format
DO I =1 ,M
Y i=0
DO k=ROWPTR( I ) ,ROWPTR( I +1)−1
Y i = Y i + VAL( k )∗X( JCN ( k ) )
ENDDO
Y( I )= Y i
ENDDO
30/ 94
Y ( 1 :M) = 0
DO k =1 ,NNZ
ENDDO
I Diagonal format: (VAL(i,j) corresponds to A(i,i+IDIAG(j)))

Y ( 1 : N) = 0
DO j =1 ,NDIAG
DO i= max(1 ,1 − IDIAG ( j ) ) , min (N, N−IDIAG ( j ) )
Y( i ) = Y( i ) + VAL( i , j )∗X( i+IDIAG ( j ) )
END DO
END DO
30/ 94
Jagged diagonal storage (JDS)
1 2 3
1 a11
2 a22 a23
3 a31 a33
1. Shift all elements left (similar to CSR) and keep column

a11 (1)
indices a22 (2) a23 (3)
a31 (1) a33 (3)
2. Sort rows in decreasing order of their number of nonzeros
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6
31/ 94
1 2 3
1 a11
2 a22 a23
3 a31 a33

a11 (1)
indices a22 (2) a23 (3)
a31 (1) a33 (3)
a22 (2) a23 (3)
a31 (1) a33 (3)
a11 (1)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6
31/ 94
1 2 3
1 a11
2 a22 a23
3 a31 a33

indices
a22 (2) a23 (3)
a31 (1) a33 (3)
a11 (1)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6
31/ 94
1 2 3
1 a11
2 a22 a23
3 a31 a33

indices
a22 (2) a23 (3)
a31 (1) a33 (3)
a11 (1)
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6
31/ 94
1 2 3
1 a11
2 a22 a23
3 a31 a33

indices
VAL = a22 a31 a11 a23 a33
COL IND = 2 1 1 3 3
COL PTR = 1 4 6
Pros: manipulate longer vectors than CSR (interesting on vector

computers or GPU’s).
Cons: extra-indirection due to permutation array.
31/ 94
Example of elemental matrix format
 
−1 2 3 0 0

 2 1 1 0 0 

A= 
 1 1 3 −1 3  = A1 + A2

 0 0 1 2 −1 
0 0 3 2 1
   
1 −1 2 3 3 2 −1 3
A1 = 2  2 1 1 , A2 = 4  1 2 −1 
3 1 1 1 5 3 2 1
32/ 94
Example of elemental matrix format
   
1 −1 2 3 3 2 −1 3
A1 = 2  2 1 1 , A2 = 4  1 2 −1 
3 1 1 1 5 3 2 1
PNELT
I N=5 NELT=2 NVAR=6 A= i=1 Ai
I
ELTPTR [1:NELT+1] = 147
ELTVAR [1:NVAR] = 123345
ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1
I Remarks:
I NVAR = P ELTPTR(NELT+1)-1
Si2 (unsym) ou
P
I NVAL = Si (Si + 1)/2 (sym), avec
Si = ELTPTR(i + 1) − ELTPTR(i)
I storage of elements in ELTVAL: by columns
32/ 94
File storage: Rutherford-Boeing
I Standard ASCII format for files

I Header + Data (CSC format). key xyz:
I x=[rcp] (real, complex, pattern)
I y=[suhzr] (sym., uns., herm., skew sym., rectang.)
I z=[ae] (assembled, elemental)
I ex: M T1.RSA, SHIP003.RSE
I Supplementary files: right-hand-sides, solution,
permutations. . .
I Canonical format introduced to guarantee a unique
representation (order of entries in each column, no duplicates).
33/ 94
File storage: Rutherford-Boeing
DNV-Ex 1 : Tubular joint-1999-01-17 M_T1

1733710 9758 492558 1231394 0
rsa 97578 97578 4925574 0
(10I8) (10I8) (3e26.16)
1 49 96 142 187 231 274 346 417 487
556 624 691 763 834 904 973 1041 1108 1180
1251 1321 1390 1458 1525 1573 1620 1666 1711 1755
1798 1870 1941 2011 2080 2148 2215 2287 2358 2428
2497 2565 2632 2704 2775 2845 2914 2982 3049 3115
...
1 2 3 4 5 6 7 8 9 10
11 12 49 50 51 52 53 54 55 56
57 58 59 60 67 68 69 70 71 72
223 224 225 226 227 228 229 230 231 232
233 234 433 434 435 436 437 438 2 3
4 5 6 7 8 9 10 11 12 49
50 51 52 53 54 55 56 57 58 59
...
-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+11
0.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08
0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10
0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+07
0.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09
0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07
0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+08
0.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08
34/ 94
File storage: Matrix-market
I Example
%%MatrixMarket matrix coordinate real general

% Comments
5 5 8
1 1 1.000e+00
2 2 1.050e+01
3 3 1.500e-02
1 4 6.000e+00
4 2 2.505e+02
4 4 -2.800e+02
4 5 3.332e+01
5 5 1.200e+01
35/ 94
Examples of sparse matrix collections
I The University of Florida Sparse Matrix Collection

http://www.cise.ufl.edu/research/sparse/matrices/
I Matrix market http://math.nist.gov/MatrixMarket/
I Rutherford-Boeing
http://www.cerfacs.fr/algor/Softs/RB/index.html
I TLSE http://gridtlse.org/
36/ 94
Sparse matrices
Conclusion
37/ 94
A (1) , b = b(1) , A(1) x = b(1) :

0 = A 10 1 0 1
a11 a12 a13 x1 b1
@ a21 a22 a23 A @ x2 A = @ b2 A 2 ← 2 − 1 × a21 /a11
a31 a32 a33 x3 b3 3 ← 3 − 1 × a31 /a11
A(2) x = b(2)
0 10 0 1
a11 a12 a13 x
1 b1
0
(2)
a22
(2)
a23 C@ 1 A B (2) C (2)
= b2 − a21 b1 /a11 . . .
x2 = @ b2 b2
B
@ A A
(2) (2) x3 (2) (2)
0 a32 a33 b3 a32 = a32 − a31 a12 /a11 . . .
Finally A(3) x = b(3)

0 10 0 1
a11 a12 a13 x1
1 b1
(2) (2) B (2) C
B 0 a22 a23 CA @ x2 A = @ b 2
(3) (2) (2) (2) (2)
@ A
(3) x3 (3) a(33) = a(33) − a32 a23 /a22 . . .
0 0 a33 b3
(k) (k)
(k+1) (k) aik akj
Typical Gaussian elimination step k : aij = aij − (k)
akk
38/ 94
Relation with A = LU factorization
I One step of Gaussian elimination can be written:

A(k+1) 0
= L(k) A(k) , with
1
1
.
(k)
B C
. aik
B C
L(k) = and lik = .
B C
B
B 1 C
C (k)
B
B −lk+1,k . C
C akk
@ . . A
−ln,k 1
I Then, A(n) = U = L(n−1) . . . L(1) A, which gives A = LU ,

1 0
0 1
.
with L = [L(1) ]−1 . . . [L(n−1) ]−1 =
B C
. C,
B C
B
@ . A
li,j 1
I In dense codes, entries of L and U overwrite entries of A.

(k)
I Furthermore, if A is symmetric, A = LDLT with dkk = akk :
A = LU = At = U t Lt implies (U)(Lt )−1 = L−1 U t = D diagonal
and U = DLt , thus A = L(DLt ) = LDLt
Gaussian elimination and sparsity
Step k of LU factorization (akk pivot):
I 0 ),
For i > k compute lik = aik /akk (= aik
I For i > k, j > k
aik × akj
aij0 = aij −
akk
or
aij0 = aij − lik × akj
I If aik 6= 0 and akj 6= 0 then aij0 6= 0
I If aij was zero → its non-zero value must be stored
k j k j
k x x k x x
i x x i x 0
fill-in
40/ 94
I Idem for Cholesky :
√ 0 ),
I For i > k compute lik = aik / akk (= aik
I For i > k, j > k, j ≤ i (lower triang.)
aik × ajk
aij0 = aij − √
akk
or
aij0 = aij − lik × ajk
41/ 94
Example
I Original matrix
x x x x x
x x
x x
x x
x x
I Matrix is full after the first step of elimination
I After reordering the matrix (1st row and column ↔ last row
and column)
42/ 94
x x
x x
x x
x x
x x x x x
I No fill-in
I Ordering the variables has a strong impact on
I the fill-in
I the number of operations
NP-hard problem in general (Yannakakis, 1981)
43/ 94
Illustration: Reverse Cuthill-McKee on matrix
dwt 592.rua
Harwell-Boeing matrix: dwt 592.rua, structural computing on a
submarine. NZ(LU factors)=58202
Original matrix Factorized matrix

0 0
100 100
200 200
300 300
400 400
500 500
0 100 200 300 400 500 0 100 200 300 400 500
nz = 5104 nz = 58202
44/ 94
Illustration: Reverse Cuthill-McKee on matrix
dwt 592.rua
NZ(LU factors)=16924
Permuted matrix Factorized permuted matrix

(RCM)
0 0
100 100
200 200
300 300
400 400
500 500
0 100 200 300 400 500 0 100 200 300 400 500
nz = 5104 nz = 16924
44/ 94
Table: Benefits of Sparsity on a matrix of order 2021 × 2021 with 7353
nonzeros. (Dongarra etal 91) .
Procedure Total storage Flops Time (sec.)
on CRAY J90
Full Syst. 4084 Kwords 5503 ×106 34.5
Sparse Syst. 71 Kwords 1073×106 3.4
Sparse Syst. and reordering 14 Kwords 42×103 0.9
45/ 94
Control of numerical stability: numerical pivoting
I In dense linear algebra partial pivoting commonly used (at

each step the largest entry in the column is selected).
I In sparse linear algebra, flexibility to preserve sparsity is
offered :
I Partial threshold pivoting : Eligible pivots are not too small
with respect to the maximum in the column.
(k) (k)
Set of eligible pivots = {r | |ark | ≥ u × maxi |aik |}, where
0 < u ≤ 1.
I Then among eligible pivots select one preserving better
sparsity.
I u is called the threshold parameter (u = 1 → partial pivoting).
a ×a
I It restricts the maximum possible growth of: aij = aij − ikakk kj
to 1 + u1 which is sufficient to the preserve numerical stability.
I u ≈ 0.1 is often chosen in practice.
I For symmetric indefinite problems 2by2 pivots (with
threshold) is also used to preserve symmetry and sparsity.
Threshold pivoting and numerical accuracy
Table: Effect of variation in threshold parameter u on matrix 541 × 541

with 4285 nonzeros (Dongarra etal 91) .
u Nonzeros in LU factors Error

1.0 16767 3 ×10−9
0.25 14249 6 ×10−10
0.1 13660 4 ×10−9
0.01 15045 1 ×10−5
10−4 16198 1 ×102
10−10 16553 3 ×1023
47/ 94
Threshold pivoting and numerical accuracy
Table: Effect of variation in threshold parameter u on matrix 541 × 541

with 4285 nonzeros (Dongarra etal 91) .
u Nonzeros in LU factors Error

1.0 16767 3 ×10−9
0.25 14249 6 ×10−10
0.1 13660 4 ×10−9
0.01 15045 1 ×10−5
10−4 16198 1 ×102
10−10 16553 3 ×1023
Difficulty: numerical pivoting implies dynamic datastructures that

can not be forecasted symbolically
47/ 94
Three-phase scheme to solve Ax = b
1. Analysis step
I Preprocessing of A (symmetric/unsymmetric orderings,
scalings)
I Build the dependency graph (elimination tree, eDAG . . . )
2. Factorization (A = LU, LDLT , LLT , QR)
Numerical pivoting
3. Solution based on factored matrices
I triangular solves: Ly = b, then Ux = y
I improvement of solution (iterative refinement), error analysis
48/ 94
Efficient implementation of sparse algorithms
I Indirect addressing is often used in sparse calculations: e.g.

sparse SAXPY
do i = 1, m
A( ind(i) ) = A( ind(i) ) + alpha * w( i )
enddo
I Even if manufacturers provide hardware for improving indirect
addressing
I It penalizes the performance
I Identify dense blocks or switch to dense calculations as soon
as the matrix is not sparse enough
49/ 94
Effect of switch to dense calculations
Matrix from 5-point discretization of the Laplacian on a 50 × 50

grid (Dongarra etal 91)
Density for Order of Millions Time

switch to full code full submatrix of flops (seconds)
No switch 0 7 21.8
1.00 74 7 21.4
0.80 190 8 15.0
0.60 235 11 12.5
0.40 305 21 9.0
0.20 422 50 5.5
0.10 531 100 3.7
0.005 1420 1908 6.1
Sparse structure should be exploited if density < 10%.
50/ 94
Sparse matrices
Conclusion
51/ 94
Main processor (r)evolutions
I pipelined functional units

I superscalar processors
I out-of-order execution (ILP)
I larger caches
I evolution of instruction set (CISC, RISC, EPIC, . . . )
I multicores
52/ 94
Pourquoi des traitements parallèles ?
I Besoins de calcul non satisfaits dans beaucoup de disciplines

(pour résoudre des problèmes significatifs)
I Performance uniprocesseur proche des limites physiques
Temps de cycle 0.5 nanoseconde
↔ 8 GFlop/s (avec 4 opérations flottantes / cycle)
I Calculateur 40 TFlop/s ⇒ 5000 coeurs
→calculateurs massivement parallèles
I Pas parce que c’est le plus simple mais parce que c’est
nécessaire
Puissance actuelle (juin 2010, cf http://www.top500.org):

Cray XT5, Oak Ridge Natl Lab:
1.7 PFlop/s, 300 TBytes de mémoire, 224256 coeurs
53/ 94
Quelques unités pour le calcul haute performance
Vitesse
1 MFlop/s 1 Megaflop/s 106 opérations / seconde
1 GFlop/s 1 Gigaflop/s 109 opérations / seconde
1 TFlop/s 1 Teraflop/s 1012 opérations / seconde
1 PFlop/s 1 Petaflop/s 1015 opérations / seconde
1 EFlop/s 1 Exaflop/s 1015 opérations / seconde
Mémoire
1 kB / 1 ko 1 kilobyte 103 octets
1 MB / 1 Mo 1 Megabyte 106 octets
1 GB / 1 Go 1 Gigabyte 109 octets
1 TB / 1 To 1 Terabyte 1012 octets
1 PB / 1 Po 1 Petabyte 1015 octets
54/ 94
Mesures de performance
I Nombre d’opérations flottantes par seconde (pas MIPS)

I Performance crête :
I Ce qui figure sur la publicité des constructeurs
I Suppose que toutes les unités de traitement sont actives
I On est sûr de ne pas aller plus vite :
#unités fonctionnelles
Performance crête =
clock (sec.)
I Performance réelle :
I Habituellement très inférieure à la précédente
(malheureusement)
55/ 94
Rapport (Performance réelle / performance de crête) souvent bas !!
Soit P un programme :
1. Processeur séquentiel:
I 1 unité scalaire (1 GFlop/s)
I Temps d’exécution de P : 100 s
2. Machine parallèle à 100 processeurs:
I Chaque processor: 1 GFlop/s
I Performance crête: 100 GFlop/s
3. Si P : code séquentiel (10%) + code parallélisé (90%)
I Temps d’exécution de P : 0.9 + 10 = 10.9 s
I Performance réelle : 9.2 GFlop/s
4. Performance réelle = 0.1
Performance de crête
56/ 94
Moore’s law
I Gordon Moore (co-fondateur d’Intel) a prédit en 1965 que la

densité en transitors des circuits intégrés doublerait tous les
24 mois.
I A aussi servi de but à atteindre pour les fabriquants.
I A été déformé:
I 24 → 18 mois
I nombre de transistors → performance
57/ 94
Comment accroı̂tre la vitesse de calcul ?
I Accélérer la fréquence avec des technologies plus rapides
On approche des limites:
I Conception des puces
I Consommation électrique et chaleur dissipée
I Refroidissement ⇒ problème d’espace
I On peut encore miniaturiser, mais:
I pas indéfiniment
I résistance des conducteurs (R = ρ×l s ) augmente et ..
la résistance est responsable de la dissipation d’énergie (effet
Joule).
I effets de capacités difficiles à maı̂triser
Remarque: 0.5 nanoseconde = temps pour qu’un signal
parcourt 15 cm de cable
I Temps de cycle 0.5 nanosecond ↔ 8 GFlop/s (avec 4
opérations flottantes par cycle)
58/ 94
Seule solution: le parallélisme
I parallélisme: exécution simultanée de plusieurs instructions à

l’intérieur d’un programme
I A l’intérieur d’un cœur :
I micro-instructions
I traitement pipeliné
I recouvrement d’instructions exécutées par des unités de calcul
distinctes
→ transparent pour le programmeur
(géré par le compilateur ou durant l’exécution)
I Entre des processeurs ou cœurs distincts:
I suites d’instructions différentes exécutées → synchronisations:
I implicites (compilateur, parallélisation automatique, utilisation
de librairies parallèles)
I ou explicites (transfert de messages, programmation
multithreads, sections critiques)
59/ 94
Problème d’accès aux données
I On est souvent (en pratique) à 10% de la performance crête

I Processeurs plus rapides → accès aux données plus rapide :
I organisation processeur,
I organisation mémoire,
I communication inter-processeurs
I Hardware plus complexe : pipe, technologie, réseau, . . .
I Logiciel plus complexe : compilateur, système d’exploitation,
langages de programmation, gestion du parallélisme, outils de
mise au point . . . applications
Il devient plus difficile de programmer efficacement
60/ 94
Problèmes de débit mémoire
I L’accés aux données est un problème crucial dans les
calculateurs modernes
IPerformance processeur : + 60% par an
IMémoire DRAM : + 9% par an
performance processeur
→ Ratio temps acces memoire augmente d’environ 50% par an!
MFlop/s plus faciles que MB/s pour débit mémoire
I Hiérarchie mémoire de plus en plus complexe (mais latence
augmente)
I Façon d’accéder aux données de plus en plus critique:
I Minimiser les défauts de cache
I Minimiser la pagination mémoire
I Localité: améliorer le rapport références à des mémoires
locales/ références à des mémoires à distance
I Réutilisation, blocage: accroı̂tre le ratio flops/memory access
I Gestion des transferts de données ”à la main” ? (Cell, GPU)
61/ 94
Size Average access time (# cycles) hit/miss
Registers <1
1 − 128 KB Cache level #1 1−2 / 8 − 66
256 KB − 16 MB Cache level #2 6−15 / 30 − 200
1 − 10 GB Main memory 10 − 100
Remote memory 500 − 5000
Disks 700,000 / 6,000,000
Figure: Exemple de hiérarchie mémoire.
62/ 94
Conception mémoire pour nombre important de
processeurs ?
Comment 500 processeurs peuvent-ils avoir accès à des données
rangées dans une mémoire partagée (technologie, interconnexion,
prix ?)
→ Solution à coût raisonnable : mémoire physiquement distribuée
(chaque processeur ou groupe de processeurs a sa propre mémoire
locale)
I 2 solutions :
I mémoires locales globalement adressables : Calulateurs à
mémoire partagée virtuelle
I transferts explicites des données entre processeurs avec
échanges de messages
I Scalabilité impose :
I augmentation linéaire débit mémoire / vitesse du processeur
I augmentation du débit des communications / nombre de
processeurs
I Rapport coût/performance → mémoire distribuée et bon
rapport coût/performance sur les processeurs 63/ 94
Architecture des multiprocesseurs
Nombre élevé de processeurs → mémoire physiquement distribuée
Organisation Organisation physique

logique Partagée (32 procs max) Distribuée
Partagée multiprocesseurs espace d’adressage global
à mémoire partagée (hard/soft) au dessus de messages
mémoire partagée virtuelle
Distribuée émulation de messages échange de messages
(buffers)
Table: Organisation des processeurs
64/ 94
Terminologie
Architecture SMP (Symmetric Multi Processor)
I Mémoire partagée (physiquement et logiquement)
I Temps d’accès uniforme à la mémoire
I Similaire du point de vue applicatif aux architectures
multi-cœurs (1 cœur = 1 processeur logique)
I Mais communications bcp plus rapides dans les multi-cœurs
(latence < 3ns, bande passantee > 20 GB/s) que dans les
SMP (latence ≈ 60ns, bande passantee ≈ 2 GB/s)
Architecture NUMA (Non Uniform Memory Access)

I Mémoire physiquement distribuée et logiquement partagée
I Plus facile d’augmenter le nombre de procs qu’en SMP
I Temps d’accès dépend de la localisation de la donnée
I Accès locaux plus rapides qu’accès distants
I hardware permet la cohérence des caches (ccNUMA)
65/ 94
Programmation
Standards de programmation
Org. logique partagée: threads POSIX, directives OpenMP
Org. logique distribuée: PVM, MPI, sockets (Message Passing)
I Programmation hybride: MPI + OpenMP

I Machines à 1 million de cœurs? architectures émergentes type
GPGPU ? pas encore de standard !
66/ 94
Evolution du calcul haute performance
I Evolution rapide des architectures haute performance (SMP,

clusters, NUMA, multicoeurs, Cell, GPUs, . . . )
I Parallélisme à plusieurs niveaux
I Hiérarchie mémoire de plus en plus complexe
⇒ Programmation de plus en plus difficile avec des outils logiciels
et des standards qui ont toujours un temps de retard.
67/ 94
Sparse matrices
Conclusion
68/ 94
Simulation numérique et matrices creuses
I Démarche générale pour le calcul scientifique:

1. Problème de simulation (problème continu)
2. Application de lois physiques (Equations aux dérivées
partielles)
3. Discretisation, mise en équations en dimension finie
4. Résolution de systèmes linéaires (Ax = b)
5. (Étude des résultats, remise en cause éventuelle du modèle ou
de la méthode)
I Résolution de systèmes linéaires=noyau algorithmique
fondamental. Paramètres à prendre en compte:
I Propriétés du système (symétrie, défini positif,
conditionnement, sur-déterminé, . . . )
I Structure: dense ou creux,
I Taille: plusieurs millions d’équations ?
69/ 94
Equations aux dérivées partielles
I Modélisation d’un phénomène physique
I Equations différentielles impliquant:

I forces
I moments
I températures
I vitesses
I énergies
I temps
I Solutions analytiques rarement disponibles
70/ 94
Exemples d’équations aux dérivées partielles
I Trouver le potentiel électrique φ pour une distribution de

charge donnée f :
∇2 ϕ = f ⇔ ∆ϕ = f , or
∂2 ∂2 ∂2
∂x 2
ϕ(x, y , z) + ∂y 2 ϕ(x, y , z) + ∂z 2 ϕ(x, y , z) = f (x, y , z)
I Equation de la chaleur (ou équation de Fourier):

∂2u ∂2u ∂2u 1 ∂u
2
+ 2+ 2 =
∂x ∂y ∂z α ∂t
avec
I u = u(x, y , z, t): température,
I α: diffusivité thermique du milieu.
I Equations de propagation d’ondes, équation de Schrödinger,
Navier-Stokes,. . .
71/ 94
Discrétisation (étape qui suit la modélisation
physique)
Travail du numéricien:
I Réalisation d’un maillage (régulier, irrégulier)
I Choix des méthodes de résolution et étude de leur
comportement
I Etude de la perte d’information due au passage à la dimension
finie
Principales techniques de discrétisation

I Différences finies
I Eléments finis
I Volumes finis
72/ 94
Discretization with finite differences (1D)
I Basic approximation (ok if h is small enough):

du u(x + h) − u(x)
(x) ≈
dx h
I Results from Taylor’s formula
du h2 d 2 u h3 d 3 u
u(x + h) = u(x) + h + + + O(h4 )
dx 2 dx 2 6 dx 3
I Replacing h by −h:
du h2 d 2 u h3 d 3 u
u(x − h) = u(x) − h + − + O(h4 )
dx 2 dx 2 6 dx 3
I Thus:
d 2u u(x + h) − 2u(x) + u(x − h)
2
= + O(h2 )
dx h2
73/ 94
Discretization with finite differences (1D)
d 2u u(x + h) − 2u(x) + u(x − h)

= + O(h2 )
dx 2 h2
3-point stencil for the centered difference approximation to

the second order derivative:
1 −2 1
74/ 94
Finite Differences for the Laplacian Operator (2D)
Assuming same mesh refinement h in x and y directions:
u(x−h,y )−2u(x,y )+u(x+h,y ) u(x,y −h)−2u(x,y )+u(x,y +h)
∆u(x) ≈ h2
+ h2
1
∆u(x) ≈ h2 (u(x −h, y )+u(x +h, y )+u(x, y −h)+u(x, y +h)−4u(x, y ))
1 1 1
1 −4 1 −4
1 1 1
5-point stencils for the centered difference approximation to

the Laplacian operator (left) standard (right) skewed
75/ 94
27-point stencil used
for 3D geophysical
applications (collabo-
ration with Geoscience
Azur, S.Operto and
J.Virieux).
1D example
−u 00 (x) = f (x) for x ∈ (0, 1)

I Consider the problem
u(0) = u(1) = 0
I xi = i × h, i = 0, . . . , n + 1, f (xi ) = fi , u(xi ) = ui
h = 1/(n + 1)
I Centered difference approximation:
−ui−1 + 2ui − ui+1 = h2 fi (u0 = un+1 = 0),
I We obtain a linear system Au = f or (for n = 6):
2 −1
     
0 0 0 0 u1 f1

 −1 2 −1 0 0 0 


 u2  
  f2 

1  0 −1 2 −1 0 0 


 u3  
= f3 

h2 
 0 0 −1 2 −1 0 


 u4  
  f4 

 0 0 0 −1 2 −1   u5   f5 
0 0 0 0 −1 2 u6 f6
77/ 94
Slightly more complicated (2D)
Consider an elliptic PDE:
∂u
∂(a(x, y ) ∂u
∂x )
∂(b(x, y ) ∂y )
− − + c(x, y ) × u = g (x, y ) sur Ω
∂x ∂y
u(x, y ) = 0 sur ∂Ω
0 ≤ x, y ≤ 1
a(x, y ) > 0
b(x, y ) > 0
c(x, y ) ≥ 0
78/ 94
I Case of a regular 2D mesh:
1
5
1 2 3 4
0 1
1
discretization step: h = n+1 , n =4
I 5-point finite difference scheme:
∂(a(x, y ) ∂u ai+ 1 ,j (ui+1,j − ui,j ) ai− 1 ,j (ui,j − ui−1,j )

∂x )ij
= 2
− 2
+O(h2 )
∂x h2 h2
I Similarly:
∂u bi,j+ 1 (ui,j+1 − ui,j ) bi,j− 1 (ui,j − ui,j−1 )
∂(b(x, y ) ∂y )ij
= 2
− 2
+O(h2 )
∂y h2 h2
I ai+ 1 ,j , bi+ 1 ,j , cij , . . . known.
2 2
I With the ordering of unknows of the example, we obtain a
linear system of the form:
Ax = b,
I where
1 1
x1 ↔ u1,1 = u( n+1 , n+1 )
2 1
x2 ↔ u2,1 = u( n+1 , n+1 )
x3 ↔ u3,1
x4 ↔ u4,1
x5 ↔ u1,2 , . . .
I and A is n2 by n2 , b is of size n2 , with the following structure:
80/ 94
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|x x x | 1 |g11|
|x x x x | 2 |g21|
| x x x x | 3 |g31|
| x x 0 x | 4 |g41|
|x 0 x x x | 5 |g12|
| x x x x x | 6 |g22|
| x x x x x | 7 |g32|
A=| x x x 0 x | 8 b=|g42|
| x 0 x x x | 9 |g13|
| x x x x x |10 |g23|
| x x x x x |11 |g33|
| x x x 0 x |12 |g43|
| x 0 x x |13 |g14|
| x x x x |14 |g24|
| x x x x |15 |g34|
| x x x |16 |g44|
81/ 94
Solution of the linear system
Often the most costly part in a numerical simulation code
I Direct methods:
I L U factorization followed by triangular substitutions
I parallelism depends highly on the structure of the matrix
I Iterative methods:
I usually rely on sparse matrix-vector products
I algebraic preconditioner useful
82/ 94
Evolution in time of a complex phenomenon
I Examples:
I climate modeling,
evolution of radioactive waste, . . .
∂u(x,y ,z,t)
∆u(x, y , z, t) =
I heat equation: ∂t
u(x, y , z, t0 ) = u0 (x, y , z)
I Discretization in both space and time (1D case):
I Explicit approaches:
ujn+1 −ujn u n −2u n +u n
tn+1 −tn = j+1 h2j j−1 .
I Implicit approaches:
ujn+1 −ujn n+1
uj+1 n+1
−2ujn+1 +uj−1
tn+1 −tn = h 2 .
I Implicit approaches are preferred (more stable, larger timestep
possible) but are more numerically intensive: a sparse linear
system must be solved at each iteration.
83/ 94
Discretization with Finite elements
I Consider a partial differential equation of the form (Poisson

Equation):
( 2 ∂2u
∆u = ∂∂xu2 + ∂y 2 = f
u = 0 on ∂Ω
I we can show (using Green’s formula) that the previous
problem is equivalent to:
Z
a(u, v ) = − f v dx dy ∀v such that v = 0 on ∂Ω
Ω
R ∂u ∂v ∂u ∂v

where a(u, v ) = Ω ∂x ∂x + ∂y ∂y dxdy
84/ 94
Finite element scheme: 1D Poisson Equation
2
I ∆u = ∂∂xu2 = f , u = 0 on ∂Ω
I Equivalent to
a(u, v ) = g (v ) for all v (v|∂Ω = 0)
where a(u, v ) = Ω ∂u ∂v
R R
∂x ∂x and g (v ) = − Ω f (x)v (x)dx
(1D: similar to integration by parts)
P
I Idea: we search u of the form = k αk Φk (x)
(Φk )k=1,n basis of functions such that Φk is linear on all Ei ,
and Φk (xi ) = δik = 1 if k = i, 0 otherwise.
Φk−1 Φk Φk+1
Ω
xk
Ek Ek+1
85/ 94
Finite Element Scheme: 1D Poisson Equation
Φk−1 Φk Φk+1
Ω
xk
Ek Ek+1
I We rewrite a(u, v ) = g (v ) for P
all Φk :
a(u, Φk ) = gR(Φk ) for all k ⇔ i αi a(Φi , Φk ) = g (Φk )
i ∂Φk
a(Φi , Φk ) = Ω ∂Φ
∂x ∂x = 0 when |i − k| ≥ 2
I k th equation associated with Φk
αk−1 a(Φk−1 , Φk ) + αk a(Φk , Φk ) + αk+1 a(Φk+1 , Φk ) = g (Φk )
R ∂Φ
I a(Φk−1 , Φk ) = Ek ∂xk−1 ∂Φ ∂x
k
a(Φk+1 , Φk ) = Ek +1 ∂Φ∂xk+1 ∂Φ
R k
∂x
a(Φk , Φk ) = Ek ∂Φ k ∂Φk ∂Φk ∂Φk
R R
∂x ∂x + Ek+1 ∂x ∂x
86/ 94
Finite Element Scheme: 1D Poisson Equation
From the point ofR view of Ek!, we have a 2x2 contribution matrix:
R ∂Φk−1 ∂Φk−1 ∂Φk−1 ∂Φk
∂x ∂x ∂x ∂x IEk (Φk−1 , Φk−1 ) IEk (Φk−1 , Φk )
REk∂Φk−1 ∂Φk REk ∂Φk ∂Φk =
IEk (Φk , Φk−1 ) IEk (Φk , Φk )
Ek ∂x ∂x Ek ∂x ∂x
Φ1 Φ2 Φ3
0 1 2 3 4 Ω

E1 E2 E3 E4 
IE1 (Φ1 , Φ1 ) + IE2 (Φ1 , Φ1 ) IE2 (Φ1 , Φ2 )
 IE2 (Φ2 , Φ1 ) IE2 (Φ2 , Φ2 ) + IE3 (Φ2 , Φ2 ) IE3 (Φ2 , Φ3 ) 
IE3 (Φ2 , Φ3 ) IE3 (Φ3 , Φ3 ) + IE4 (Φ3 , Φ3 )
   
α1 g (φ1 )
×  α2  =  g (φ2 ) 
α3 g (φ3 )
87/ 94
Finite Element Scheme in Higher Dimension
I Can be used for higher dimensions

I Mesh can be irregular
I Φi can be a higher degree polynomial
I Matrix pattern depends on mesh connectivity/ordering
88/ 94
Finite Element Scheme in Higher Dimension
I Set of elements (tetrahedras, triangles) to assemble:

i
 T
ai,i T
ai,j T 
ai,k
T
T T T 
j k C (T ) =  aj,i aj,j aj,k
T
ak,i T
ak,j T
ak,k
Needs for the parallel case

P
I Assemble the sparse matrix A = k C (Tk ): graph coloring
algorithms
I Parallelization domain by domain: graph partitioning
I Solution of Ax = b: high performance matrix computation
kernels
88/ 94
Other example: linear least squares
I mathematical model + approximate measures ⇒ estimate

parameters of the model
I m ”experiments” + n parameters xi :
minkAx − bk avec:
I A ∈ R m×n , m ≥ n: data matrix
I b ∈ R m : vector of observations
I x ∈ R n : parameters of the model
I Solving the problem:
I Decompose A under the form A = QR, with Q orthogonal, R
triangular
I kAx −bk = kQ T Ax −Q T bk = kQ T QRx −Q T bk = kRx −Q T bk
I Problems can be large (meteorological data, . . . ), sparse or
not
89/ 94
Sparse matrices
Conclusion
90/ 94
Solution of sparse linear systems Ax = b (Direct or Iterative approaches ?
Direct methods Iterative methods
I Very general/robust I Efficiency depends on:
I Numerical accuracy I convergence –

I Irregular/unstructured preconditioning
problems I numerical properties /
I Factorization of matrix A
structure of A
I May be costly I Rely on efficient Mat-Vect product
(flops/memory) I memory effective

I Factors can be reused for I successive right-hand sides
multiple right-hand sides b is problematic
I Computing issues I Computing issues
I Good granularity of I Smaller granularity of

computations computations
I Several levels of parallelism I Often, only one level of
can be exploited parallelism
Sparse matrices
Conclusion
92/ 94
Summary – sparse matrices
I Widely used in engineering and industry

I Irregular data structures
I Strong relations with graph theory
I Efficient algorithms are critical
I Ordering
I Sparse Gaussian elimination
I Sparse matrix-vector multiplication
I Parallelization
I Challenges:
I Modern applications leading to
I bigger and bigger problems
I different types of matrices and requirements
I Dynamic data structures (numerical pivoting) → need for
dynamic scheduling
I More and more parallelism (evolution of parallel architectures)
93/ 94
Suggested home reading
I Google page rank, The world’s largest matrix computation,

Cleve Moler.
I Architecture and Performance Characteristics of Modern

High Performance Computers, Georg Hager and Gerhard
Wellein, Lect. Notes in Physics
I Optimization Techniques for Modern High Performance

Computers, Georg Hager and Gerhard Wellein, Lect. Notes
in Physics
94/ 94

Intro Sparse

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Intro Sparse

Hochgeladen von

Copyright:

Verfügbare Formate

CR07 – Sparse Matrix Computations /

Cours de Matrices Creuses (2010-2011)

Jean-Yves L’Excellent (INRIA) and Bora Uçar (CNRS)

LIP-ENS Lyon, Team: ROMA

prepared in collaboration with P. Amestoy (ENSEEIHT-IRIT)

⇒ Calculateurs haute performance, parallélisme

Original (A =lhr01) Preprocessed matrix (A0 (lhr01))

Modified Problem:A0 x 0 = b 0 with A0 = Pn PDr AQDc P t

I Constraints of duration: weather forecast

I large scale: climate modelling, pollution, astrophysics

I Introduction, reminders on graph theory and on numerical

I Sparse Gaussian elimination

I Maximum (weighted) Matching algorithms and their use in

I Efficient factorization methods

I Graph and hypergraph partitioning

I Current research activities

I References, teaching material will be made available at

I 2+ hours dedicated to exercises (manipulation of sparse

I Evaluation: mainly based on the study of research articles

Introduction to Sparse Matrix Computations

Sparse matrix: only nonzeros are stored.

0 100 200 300 400 500

Matrix dwt 592.rua (N=592, NZ=5104);

Matrix from Computational Fluid Dynamics;

0 1000 2000 3000 4000 5000 6000 7000

Original (A =lhr01) Preprocessed matrix (A0 (lhr01))

Modified Problem:A0 x 0 = b 0 with A0 = Pn PDr ADc QP t

I Only non-zero values are stored

BMW car body,

Several levels of parallelism can be exploited:

I Storage scheme depends on the pattern of the matrix and on

What needs to be represented

I Example of a 3x3 matrix with NNZ=5 nonzeros

I Diagonal format (M=N):

Algorithm depends on sparse matrix format:

I Diagonal format: (VAL(i,j) corresponds to A(i,i+IDIAG(j)))

1. Shift all elements left (similar to CSR) and keep column

1. Shift all elements left (similar to CSR) and keep column

1. Shift all elements left (similar to CSR) and keep column

1. Shift all elements left (similar to CSR) and keep column

1. Shift all elements left (similar to CSR) and keep column

Pros: manipulate longer vectors than CSR (interesting on vector

I Standard ASCII format for files

DNV-Ex 1 : Tubular joint-1999-01-17 M_T1

%%MatrixMarket matrix coordinate real general

I The University of Florida Sparse Matrix Collection

A (1) , b = b(1) , A(1) x = b(1) :

Finally A(3) x = b(3)

I One step of Gaussian elimination can be written:

I Then, A(n) = U = L(n−1) . . . L(1) A, which gives A = LU ,

I In dense codes, entries of L and U overwrite entries of A.

Original matrix Factorized matrix

Permuted matrix Factorized permuted matrix

I In dense linear algebra partial pivoting commonly used (at

Table: Effect of variation in threshold parameter u on matrix 541 × 541

u Nonzeros in LU factors Error

Table: Effect of variation in threshold parameter u on matrix 541 × 541

u Nonzeros in LU factors Error

Difficulty: numerical pivoting implies dynamic datastructures that

I Indirect addressing is often used in sparse calculations: e.g.

Matrix from 5-point discretization of the Laplacian on a 50 × 50

Density for Order of Millions Time

Sparse structure should be exploited if density < 10%.

I pipelined functional units