Beruflich Dokumente
Kultur Dokumente
A Collection of Notes
on Numerical Linear Algebra
Robert A. van de Geijn
Release Date March 28, 2015
Kindly do not share this PDF
Point others to * http://www.ulaff.net instead
This is a work in progress
Contents
Preface
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
xi
1
1
2
3
5
5
5
5
6
6
6
6
7
9
9
10
11
12
12
15
18
18
19
19
20
20
20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
25
25
25
27
27
27
27
27
28
29
31
31
32
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
36
37
39
41
41
45
49
50
51
54
55
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
57
58
59
64
68
70
71
72
73
73
74
75
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
75
76
78
79
80
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
83
84
85
85
85
87
87
88
89
92
97
99
99
102
104
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
109
109
110
111
111
112
112
113
114
115
115
116
116
123
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
127
128
129
129
133
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8.4. Why Using the Method of Normal Equations Could be Bad . . . . . . . . . . . . . . . . . 134
8.5. Why Multiplication with Unitary Matrices is a Good Thing . . . . . . . . . . . . . . . . . 135
9. Notes on the Stability of an Algorithm
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2. Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . . . .
9.3. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4. Floating Point Computation . . . . . . . . . . . . . . . . . . . . . .
9.4.1. Model of floating point computation . . . . . . . . . . . . .
9.4.2. Stability of a numerical algorithm . . . . . . . . . . . . . .
9.4.3. Absolute value of vectors and matrices . . . . . . . . . . .
9.5. Stability of the Dot Product Operation . . . . . . . . . . . . . . . .
9.5.1. An algorithm for computing D OT . . . . . . . . . . . . . .
9.5.2. A simple start . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.3. Preparation . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.4. Target result . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.5. A proof in traditional format . . . . . . . . . . . . . . . . .
9.5.6. A weapon of math induction for the war on error (optional) .
9.5.7. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6. Stability of a Matrix-Vector Multiplication Algorithm . . . . . . . .
9.6.1. An algorithm for computing G EMV . . . . . . . . . . . . .
9.6.2. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7. Stability of a Matrix-Matrix Multiplication Algorithm . . . . . . . .
9.7.1. An algorithm for computing G EMM . . . . . . . . . . . . .
9.7.2. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7.3. An application . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
137
137
138
139
140
142
142
142
143
143
144
144
144
146
148
149
149
152
152
152
152
154
154
154
155
157
159
159
160
161
161
161
162
164
165
165
167
173
174
175
175
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11.7.1. Lz = y . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7.2. Ux = z . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.8. Other LU Factorization Algorithms . . . . . . . . . . . . . . . . . .
11.8.1. Variant 1: Bordered algorithm . . . . . . . . . . . . . . . .
11.8.2. Variant 2: Left-looking algorithm . . . . . . . . . . . . . .
11.8.3. Variant 3: Up-looking variant . . . . . . . . . . . . . . . .
11.8.4. Variant 4: Crout variant . . . . . . . . . . . . . . . . . . .
11.8.5. Variant 5: Classical LU factorization . . . . . . . . . . . . .
11.8.6. All algorithms . . . . . . . . . . . . . . . . . . . . . . . .
11.8.7. Formal derivation of algorithms . . . . . . . . . . . . . . .
11.9. Numerical Stability Results . . . . . . . . . . . . . . . . . . . . . .
11.10.Is LU with Partial Pivoting Stable? . . . . . . . . . . . . . . . . . .
11.11.Blocked Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
11.11.1.Blocked classical LU factorization (Variant 5) . . . . . . . .
11.11.2.Blocked classical LU factorization with pivoting (Variant 5)
11.12.Variations on a Triple-Nested Loop . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
175
178
178
183
183
184
184
185
185
185
187
188
188
188
191
192
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
195
195
196
197
197
198
199
200
200
203
204
204
206
206
209
209
209
213
213
214
214
214
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
215
215
216
217
220
222
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
223
223
224
225
225
226
227
230
230
231
231
232
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
235
235
236
237
237
242
242
242
244
244
245
247
247
248
250
250
251
252
254
258
258
258
258
258
261
261
261
261
262
265
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
267
268
269
269
272
274
274
278
279
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
281
282
283
283
287
290
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
293
293
294
295
296
296
298
298
299
300
301
307
307
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
313
314
315
315
316
317
318
318
320
323
324
325
325
.
.
.
.
.
.
.
.
Answers
1. Notes on Simple Vector and Matrix Operations . . . . . . .
2. Notes on Vector and Matrix Norms . . . . . . . . . . . . .
3. Notes on Orthogonality and the SVD . . . . . . . . . . . .
4. Notes on Gram-Schmidt QR Factorization . . . . . . . . . .
6. Notes on Householder QR Factorization . . . . . . . . . . .
7. Notes on Solving Linear Least-squares Problems . . . . . .
8. Notes on the Condition of a Problem . . . . . . . . . . . .
9. Notes on the Stability of an Algorithm . . . . . . . . . . . .
10. Notes on Performance . . . . . . . . . . . . . . . . . . .
11. Notes on Gaussian Elimination and LU Factorization . . .
12. Notes on Cholesky Factorization . . . . . . . . . . . . . .
13. Notes on Eigenvalues and Eigenvectors . . . . . . . . . .
14. Notes on the Power Method and Related Methods . . . . .
15. Notes on the Symmetric QR Algorithm . . . . . . . . . .
16. Notes on the Method of Relatively Robust Representations
17. Notes on Computing the SVD . . . . . . . . . . . . . . .
18. Notes on Splitting Methods . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
333
333
337
342
347
349
352
354
356
361
362
369
373
378
379
383
393
394
A. How to Download
395
Bibliography
397
Index
401
Preface
This document was created over the course of many years, as I periodically taught an introductory course
titled Numerical Analysis: Linear Algebra, cross-listed in the departments of Computer Science, Math,
and Statistics and Data Sciences (SDS), as well as the Computational Science Engineering Mathematics
(CSEM) graduate program.
Over the years, my colleagues and I have used many different books for this course: Matrix Computations
by Golub and Van Loan [21], Fundamentals of Matrix Computation by Watkins [45], Numerical Linear
Algebra by Trefethen and Bau [36], and Applied Numerical Linear Algebra by Demmel [11]. All are books
with tremendous strenghts and depth. Nonetheless, I found myself writing materials for my students that
add insights that are often drawn from our own research experiences in the field. These became a series of
notes that are meant to supplement rather than replace any of the mentioned books.
Fundamental to our exposition is the FLAME notation [24, 38], which we use to present algorithms handin-hand with theory. For example, in Figure 1 (left), we present a commonly encountered LU factorization
algorithm, which we will see performs exactly the same computations as does Gaussian elimination. By
abstracting away from the detailed indexing that are required to implement an algorithm in, for example,
Matlabs M-script language, the reader can focus on the mathematics that justifies the algorithm rather
than the indices used to express the algorithm. The algorithms can be easily translated into code with
the help of a FLAME Application Programming Interface (API). Such interfaces have been created for
C, M-script, and Python, to name a few [24, 5, 29]. In Figure 1 (right), we show the LU factorization
algorithm implemented with the FLAME@lab API for M-script. The C API is used extensively in our
implementation of the libflame dense linear algebra library [39, 40] for sequential and shared-memory
architectures and the notation also inspired the API used to implement the Elemental dense linear algebra
library [30] that targets distributed memory architectures. Thus, the reader is exposed to the abstractions
that experts use to translate algorithms to high-performance software.
These notes may at times appear to be a vanity project, since I often point the reader to our research papers
for a glipse at the cutting edge of the field. The fact is that over the last two decades, we have helped
further the understanding of many of the topics discussed in a typical introductory course on numerical
linear algebra. Since we use the FLAME notation in many of our papers, they should be relatively easy to
read once one familiarizes oneself with these notes. Lets be blunt: these notes do not do the field justice
when it comes to also giving a historic perspective. For that, we recommend any of the above mentioned
texts, or the wonderful books by G.W. Stewart [34, 35]. This is yet another reason why they should be
used to supplement other texts.
ix
AT L AT R
Partition A
ABL ABR
where AT L is 0 0
while n(AT L ) < n(A) do
Repartition
AT L
ABL
AT R
ABR
A00
aT
10
A20
a01
A02
a21
aT12
A22
A00
a01
A02
aT
10
A20
11
aT12
A22
11
AT L
ABL
endwhile
AT R
ABR
a21
Figure 1: LU factorization represented with the FLAME notation and the FLAME@lab API.
The order of the chapters should not be taken too seriously. They were initially written as separate,
relatively self-contained notes. The sequence on which I settled roughly follows the order that the topics
are encountered in Numerical Linear Algebra by Trefethen and Bau [36]. The reason is the same as the
one they give: It introduces orthogonality and the Singular Value Decomposition early on, leading with
important material that many students will not yet have seen in the undergraduate linear algebra classes
they took. However, one could just as easily rearrange the chapters so that one starts with a more traditional
topic: solving dense linear systems.
The notes frequently refer the reader to another resource of ours titled Linear Algebra: Foundations to
Frontiers - Notes to LAFF With (LAFF Notes) [29]. This is a 900+ page document with more than 270
videos that was created for the Massive Open Online Course (MOOC) Linear Algebra: Foundations
to Frontiers (LAFF), offered by the edX platform. That course provides an appropriate undergraduate
background for these notes.
I (tried to) video tape my lectures during Fall 2014. Unlike the many short videos that we created for the
Massive Open Online Course (MOOC) titled Linear Algebra: Foundations to Frontiers that are now part
of the notes for that course, I simply set up a camera, taped the entire lecture, spent minimal time editing,
and uploaded the result for the world to see. Worse, I did not prepare particularly well for the lectures,
other than feverishly writing these notes in the days prior to the presentation. Sometimes, I forgot to turn
on the microphone and/or the camera. Sometimes the memory of the camcorder was full. Sometimes I
forgot to shave. Often I forgot to comb my hair. You are welcome to watch, but dont expect too much!
One should consider this a living document. As time goes on, I will be adding more material. Ideally,
people with more expertise than I have on, for example, solving sparse linear systems will contributed
notes of their own.
Video
Here is the video for the first lecture, which is meant as an introduction.
* YouTube
* Download from UT Box
* View After Local Download
(For help on viewing, see Appendix A.)
(In the downloadable version, the lecture starts about 27 seconds into the video. I was still learning
how to do the editing...)
Acknowledgments
These notes use notation and tools developed by the FLAME project at The University of Texas at Austin
(USA), Universidad Jaume I (Spain), and RWTH Aachen University (Germany). This project involves
a large and ever expanding number of very talented people, to whom I am indepted. Over the years, it
has been supported by a number of grants from the National Science Foundation and industry. The most
recent and most relevant funding came from NSF Award ACI-1148125 titled SI2-SSI: A Linear Algebra
Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences1
In Texas, behind every successful man there is a woman who really pulls the strings. For more than thirty
years, my research, teaching, and other pedagogical activities have been greatly influenced by my wife,
Dr. Maggie Myers. For parts of these notes that are particularly successful, the credit goes to her. Where
they fall short, the blame is all mine!
Any opinions, findings and conclusions or recommendations expressed in this mate- rial are those of the author(s) and do
not necessarily reflect the views of the National Science Foundation (NSF).
xiii
Chapter
Video
The video for this particular note didnt turn out, so you will want to read instead. As I pointed out in the
preface: these videos arent refined by any measure!
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.
(Hermitian) Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1.
1.2.2.
Conjugate of a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3.
Conjugate of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.4.
Transpose of a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.5.
1.2.6.
Transpose of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.7.
1.2.8. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1.
1.3.2.
10
1.3.3.
11
Matrix-vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.4.1.
12
1.4.2.
Rank-1 update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
18
1.5.1.
Element-by-element computation . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.5.2.
19
1.5.3.
19
1.5.4.
20
1.5.5.
Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
20
1.4.
1.5.
1.1. Notation
1.1
Notation
Throughout our notes we will adopt notation popularized by Alston Householder, and that is therefore
sometimes called Householder notation. As a rule, we will use lower case Greek letters (, , etc.) to
denote scalars. For vectors, we will use lower case Roman letters (a, b, etc.). Matrices are denoted by
upper case Roman letters (A, B, etc.). A table of how symbols are often used in these notes is given in
Figure 1.1.
The set of all real numbers will be denoted by R and the set of all complex numbers by C. The set of
all vectors of size n is denoted by Rn or Cn , depending on whether its elements are real or complex valued.
The set of all m n matrices is denoted by Rmn or Cmn .
If x Cn , and A Cmn then the elements of x and A can be exposed as
0
0,0
0,1 0,n1
1,1 1,n1
1,0
1
x = . and A =
.
.
.
.
..
..
..
..
n1
m1,0 m1,1 m1,n1
Notice how the symbols used to represent the elements in x and A are the Greek letters that (roughly)
correspond to the Roman letters used to denote the vector and matrix.
If vectors x is partitioned into N subvectors, we may denote this by
x0
x0
x
x1
x = . or x = . ,
..
..
xN1
xN1
where the horizontal lines should not be mistaken for division. They instead emphasize how the vector is
partitioned into subvectors. It is possible for a subvector to be of size zero (no elements) or one (a scalar).
If we want to emphasize that a specific subvector is a scalar, then we may choose to use a lower case Greek
letter for that scalar, as in
x0
x0
or x =
.
x=
1
1
x2
hlinex2
We will see frequent examples where a matrix, A Cmn , is partitioned into columns or rows:
abT0
abT
1
A = a0 a1 an1 = . .
..
T
abm1
The horizontal and/or vertical lines are optional. We add the b to be able to distinguish between the
identifiers for the columns and rows. When from context it is obvious whether we refer to a row or
Matrix Vector
Scalar
Note
Symbol
LATEX
Code
\alpha
alpha
\beta
beta
\gamma
gamma
\delta
delta
\phi
phi
\xi
xi
\eta
eta
I
K
\kappa
kappa
\lambda
lambda
\mu
mu
\nu
nu
is shared with V.
n() = column dimension.
\pi
pi
\theta
theta
\rho
rho
\sigma
sigma
\tau
tau
\nu
nu
\omega
omega
\chi
chi
\psi
psi
\zeta
zeta
\upsilon upsilon
shared with N.
Figure 1.1: Correspondence between letters used for matrices (uppercase Roman)/vectors (lowercase
Roman) and the symbols used to denote their scalar entries (lowercase Greek letters).
column, we may choose to skip the b. The T is used to indicate transposition of the column vector abi , to
make it into the row vector abTi .
Sometimes, we partition matrices into submatrices:
A=
A0 A1 AN1
b0
A
b1
A
..
.
bM1
A
A0,0
A0,1
A0,N1
A1,0
..
.
A1,1
..
.
A1,N1
..
.
1.2
(Hermitian) Transposition
1.2.1
|| = |r + ic | = 2r + 2c = r + ic )(r ic = = = ||.
1.2.2
Conjugate of a vector
1
x= .
..
n1
1.2.3
Conjugate of a matrix
0,0
0,1
1,1
1,0
A=
.
..
..
m1,0 m1,1
0,n1
1,n1
..
.
m1,n1
1.2.4
Transpose of a vector
1
xT = .
..
n1
0 1 n1 .
Notice that transposing a (column) vector rearranges its elements to make a row vector.
1.2.5
1
xH (= xc ) = (x)T = .
..
n1
1.2.6
1
= .
..
n1
= 0 1 n1 .
Transpose of a matrix
0,0
0,1
0,n1
1,1 1,n1
1,0
AT =
.
..
..
..
1.2.7
0,0
1,0
m1,0
1,1 m1,1
0,1
=
.
..
..
..
0,0
0,1
0,n1
1,0
1,1 1,n1
T
H
c
A (= A ) = A =
.
..
..
..
T
0,0
1,0
1,1
0,1
=
.
..
..
0,n1 1,n1
m1,0
m1,1
..
.
m1,n1
1.2.8
Exercises
A=
a0 a1 an1
aT
0
T aT
1
= .
a0 a1 an1
..
aTm1
abT0
abT
1
.
..
abTm1
abT
1
.
..
abTm1
a0 a1 an1
abT0
abT
1
= .
..
abTm1
abT0
H
aH
0
aH
1
= .
..
aH
m1
* SEE ANSWER
x0
x1
x= .
..
xN1
Convince yourself that the following hold:
x0
x1
x= .
..
xN1
xT =
x0T
x1T
xH
x0H
x1H
T
xN1
H
xN1
.
.
* SEE ANSWER
A0,0
A0,1
A0,N1
A=
A1,0
..
.
A1,1
..
.
A1,N1
..
.
A0,0
A0,1
A0,N1
A1,0
..
.
A1,1
..
.
A1,N1
..
.
A0,0
A0,1
A0,N1
A1,0
..
.
A1,1
..
.
A1,N1
..
.
AT0,0
AT1,0
ATM1
AT
AT1,1 ATM1,1
0,1
=
..
..
..
.
.
AH
0,0
AH
1,0
AH
M1
AH
AH
AH
0,1
1,1
M1,1
=
.
.
..
..
..
H
H
AH
0,N1 A1,N1 AM1,N1
* SEE ANSWER
1.3
1.3.1
Vector-vector Operations
Scaling a vector (scal)
1
x= .
..
n1
0
0
1
1
x = . =
.
..
..
n1
n1
If y := x with
y= .
..
n1
x0
x0
x1 x1
. =
..
..
.
xN1
xN1
* SEE ANSWER
10
Cost
Scaling a vector of size n requires, approximately, n multiplications. Each of these becomes a floating
point operation (flop) when the computation is performed as part of an algorithm executed on a computer
that performs floating point computation. We will thus say that scaling a vector costs n flops.
It should be noted that arithmetic with complex numbers is roughly 4 times as expensive as is arithmetic
with real numbers. In the chapter Notes on Performance (page ??) we also discuss that the cost of
moving data impacts the cost of a flop. Thus, not all flops are created equal!
1.3.2
1
x= .
..
n1
Then x + y equals the vector
1
x + y = .
..
n1
and y = .
..
n1
1
+ .
..
n1
1
=
.
..
n1
1
+
.
..
n1
0
0
0
1
1
1
y := x + y = . + . =
.
..
.. ..
n1
n1
n1
1
+
.
..
n1
x0
x1
.
..
xN1
* SEE ANSWER
11
Cost
The axpy with two vectors of size n requires, approximately, n multiplies and n additions or 2n flops.
1.3.3
Let x, y Cn with
1
x= .
..
n1
and y = .
..
n1
1
xH y = .
..
n1
.
..
n1
1
= 0 1 n1 .
..
n1
n1
= 0 0 + 1 1 + + n1 n1 =
i i .
i=0
x0
x1
.
..
xN1
y0
y1
.
..
yN1
N1
H
ni
= N1
i=0 xi yi . (Provided xi , yi C and i=0 ni = n.)
* SEE ANSWER
Homework 1.7 Prove that xH y = yH x.
* SEE ANSWER
12
As we discuss matrix-vector multiplication and matrix-matrix multiplication, the closely related operation xT y is also useful:
1
T
x y = .
..
n1
.
..
n1
1
= 0 1 n1 .
..
n1
n1
= 0 0 + 1 1 + + n1 n1 =
i i .
i=0
We will sometimes refer to this operation as a dot product since it is like the dot product xH y, but without
conjugation. In the case of real valued vectors, it is the dot product.
Cost
The dot product of two vectors of size n requires, approximately, n multiplies and n additions. Thus, a dot
product cost, approximately, 2n flops.
1.4
1.4.1
Matrix-vector Operations
Matrix-vector multiplication (product)
Be sure to understand the relation between linear transformations and matrix-vector multiplication by
reading Week 2 of
Linear Algebra: Foundations to Fountiers - Notes to LAFF With [29].
Let y Cm , A Cmn , and x Cn with
y= .
..
m1
0,0
0,1
0,n1
1,1 1,n1
1,0
A=
.
..
..
..
.
.
1
and x = .
..
n1
.
.
.
m1
0,0
0,1
0,n1
1,1 1,n1
1,0
=
.
..
..
..
.
.
1
.
..
n1
13
0,0
0,1
0,n1
1,1 1,n1
1,0
=
.
..
..
..
.
.
1
.
..
n1
This is the definition of the matrix-vector product Ax, sometimes referred to as a general matrix-vector
multiplication (gemv) when no special structure or properties of A are assumed.
Now, partition
abT0
abT
1
A = a0 a1 an1 = . .
..
T
abm1
Focusing on how A can be partitioned by columns, we find that
y = Ax =
a0 a1 an1
1
.
..
n1
= a0 0 + a1 1 + + an1 n1
= 0 a0 + 1 a1 + + n1 an1
= n1 an1 + ( + (1 a1 + (0 a0 + 0)) ),
where 0 denotes the zero vector of size m. This suggests the following loop for computing y := Ax:
y := 0
for j := 0, . . . , n 1
y := j a j + y
(axpy)
endfor
In Figure 1.2 (left), we present this algorithm using the FLAME notation (LAFF Notes Week 3 [29]).
Focusing on how A can be partitioned by rows, we find that
y= .
..
n1
abT0
abT
1
= Ax = .
..
abTm1
abT0 x
abT x
1
x =
..
abTm1 x
14
xT
Partition A AL AR , x
xB
where AL is 0 columns, xT has 0 elements
Repartition
Repartition
AL AR
A0 a1 A2
xT
endwhile
A0 a1 A2
x
0
, 1
xB
x2
AL AR
Continue with
A
y
0
0
AT
y
T
aT1 , 1
AB
yB
A2
y2
Continue with
AT
1 := aT1 x + 1
y := 1 a1 + y
A
y
0
0
yT
aT , 1
1
AB
yB
A2
y2
x
0
, 1
xB
x2
xT
endwhile
Figure 1.2: Matrix-vector multiplication algorithms for y := Ax + y. Left: via axpy operations (by
columns). Right: via dot products (by rows).
for i := 0, . . . , m 1
i := abTi x + i
endfor
(dot)
Here we use the term dot because for complex valued matrices it is not really a dot product. In Figure 1.2
(right), we present this algorithm using the FLAME notation (LAFF Notes Week 3 [29]).
It is important to notice that this first matrix-vector operation (matrix-vector multiplication) can be
layered upon vector-vector operations (axpy or dot).
Cost
Matrix-vector multiplication with a m n matrix costs, approximately, 2mn flops. This can be easily
argued in three different ways:
The computation requires a multiply and an add with each element of the matrix. There are mn such
elements.
15
The operation can be computed via n axpy operations, each of which requires 2m flops.
The operation can be computed via m dot operations, each of which requires 2n flops.
1.4.2
Rank-1 update
0
0,0
0,1 0,n1
1,1 1,n1
1
1,0
y = . , A =
.
..
..
..
..
.
.
m1
m1,0 m1,1 m1,n1
1
and x = .
..
n1
0
0
0
1
1
1
T
yx = . . = . 0 1 n1
.. ..
..
m1
n1
m1
0 0
0 1 0 n1
1 1 1 n1
1 0
=
.
.
.
.
..
..
..
m1 0 m1 1 m1 n1
Also,
yxT = y
0 1 n1
0 y 1 y n1 y
0
0 xT
xT
1
1
T
T
yx = . x =
..
..
m1
m1 xT
which shows that all columns are a multiple of row vector xT . This motivates the observation that the
matrix yxT has rank at most equal to one (LAFF Notes Week 10 [29]).
The operation A := yxT + A is called a rank-1 update to matrix A and is often referred to as underlinegeneral rank-1 update (ger):
0,0
0,1 0,n1
1,0
1,1
1,n1
:=
..
..
..
.
.
.
16
0 0 + 0,0
0 1 + 0,1
0 n1 + 0,n1
1 0 + 1,0
..
.
1 1 + 1,1
..
.
1 n1 + 1,n1
..
.
A=
a0 a1 an1
abT0
abT
1
= .
..
abTm1
(axpy)
In Figure 1.3 (left), we present this algorithm using the FLAME notation (LAFF Notes Week 3).
Focusing on how A can be partitioned by rows, we find that
abT0
0 xT
1 xT abT
1
yxT + A =
+ .
.
..
..
T
T
m1 x
abm1
0 xT + abT0
1 xT + abT
1
=
.
.
..
m1 xT + abTm1
Notice that each row is updated with an axpy operation. This suggests the following loop for computing
A := yxT + A:
for i := 0, . . . , m 1
abTj := i xT + abTj
endfor
(axpy)
17
Repartition
x
0
1
, AL AR A0 a1 A2
xB
x2
xT
y
A
0
0
AT
1 , aT
1
yB
AB
y2
A2
yT
a1 := y1 + a1
aT1 := 1 xT + aT1
Continue with
x
0
xT
1 , AL AR A0 a1 A2
xB
x2
Continue with
y
A
0
0
yT
A
T
1 , aT1
yB
AB
y2
A2
endwhile
endwhile
Figure 1.3: Rank-1 update algorithms for computing A := yxT + A. Left: by columns. Right: by rows.
In Figure 1.3 (right), we present this algorithm using the FLAME notation (LAFF Notes Week 3).
Again, it is important to notice that this matrix-vector operation (rank-1 update) can be layered
upon the axpy vector-vector operation.
Cost
A rank-1 update of a m n matrix costs, approximately, 2mn flops. This can be easily argued in three
different ways:
The computation requires a multiply and an add with each element of the matrix. There are mn such
elements.
The operation can be computed one column at a time via n axpy operations, each of which requires
2m flops.
The operation can be computed one row at a time via m axpy operations, each of which requires 2n
flops.
1.5
18
Be sure to understand the relation between linear transformations and matrix-matrix multiplication (LAFF
Notes Weeks 3 and 4 [29]).
We will now discuss the computation of C := AB + C, where C Cmn , A Cmk , and B Ckn .
(If one wishes to compute C := AB, one can always start by setting C := 0, the zero matrix.) This is the
definition of the matrix-matrix product AB, sometimes referred to as a general matrix-matrix multiplication
(gemm) when no special structure or properties of A and B are assumed.
1.5.1
Element-by-element computation
Let
0,0
0,1
0,n1
0,0
0,1
0,k1
1,1 1,k1
1,0
1,1
1,n1
1,0
C=
,
A
=
.
.
.
.
..
..
..
..
..
..
0,0
0,1 0,n1
1,1 1,n1
1,0
B=
.
.
.
.
..
..
..
Then
C := AB +C
k1
k1
p=0 0,p p,0 + 0,0
p=0 0,p p,1 + 0,1
k1
k1
=
.
..
..
k1
k1
p=0 m1,p p,0 + m1,0 p=0 m1,p p,1 + m1,1
k1
p=0 0,p p,n1 + 0,n1
k1
p=0 1,p p,n1 + 1,n1
k1
p=0 m1,p p,n1 + m1,n1
abT0
abT
1
A = . and B = b0 b1 bn1 .
..
T
abm1
Then
abT0
abT
1
C := AB +C = .
..
abTm1
..
.
b0 b1 bn1
1.5.2
19
abT0 b0 + 0,0
abT0 b1 + 0,1
abT1 b0 + 1,0
..
.
abT1 b1 + 1,1
..
.
Partition
C=
c0 c1 cn1
and B =
b0 b1 bn1
Then
C := AB+C = A
b0 b1 bn1
+ c0 c1 cn1 = Ab0 + c0 Ab1 + c1 Abn1 + cn1
1.5.3
(matrix-vector multiplication)
Partition
cbT0
cbT
1
C= .
..
cbTm1
abT
1
A= .
..
abTm1
and
abT0
Then
abT0
abT
1
C := AB +C = .
..
abTm1
cbT0
cbT
1
B
+
..
cbTm1
abT0 B + cbT0
abT1 B + cbT1
..
.
abTm1 B + cbTm1
which shows that each row of C is updated with a row-vector time matrix multiplication: cbTi := abTi B + cbTi :
for i := 0, . . . , m 1
cbTi := abTi B + cbTi
endfor
1.5.4
20
Partition
A=
a0 a1 ak1
b
bT0
b
bT1
..
.
b
bT
B=
and
k1
The
C := AB +C =
a0 a1 ak1
b
bT0
b
bT1
..
.
b
bT
+C
k1
= a0 b
bT0 + a1b
bT1 + + ak1b
bTk1 +C
= ak1b
bT + ( (a1b
bT + (a0b
bT +C)) )
k1
which shows that C can be updated with a sequence of rank-1 update, suggesting the loop
for p := 0, . . . , k 1
bTp +C
C := a pb
endfor
1.5.5
(rank-1 update)
Cost
1.6
Summarizing All
It is important to realize that almost all operations that we discussed in this note are special cases of
matrix-matrix multiplication. This is summarized in Figure 1.4. A few notes:
21
Illustration
k
:=
Label
+
gemm
:=
large large
ger
:=
large
gemv
gemv
large
axpy
axpy
large
dot
:=
1
:=
1
:=
large
large large
:=
large
:=
MAC
Figure 1.4: Special shapes of gemm C := AB +C. Here C, A, and B are m n, m k, and k n matrices,
respectively.
Row-vector times matrix is matrix-vector multiplication in disguise: If yT := xT A then y := AT x.
For a similar reason yT := xT + yT is an axpy in disguise: y := x + y.
The operation y := x + y is the same as y := x + y since multiplication of a vector by a scalar
commutes.
The operation := + is known as a Multiply-ACcumulate (MAC) operation. Often floating
point hardware implements this as an integrated operation.
22
Observations made about operations with partitioned matrices and vectors can be summarized by partitioning matrices into blocks: Let
C0,0
C0,1 C0,N1
A0,0
A0,1 A0,K1
C
C
C
A
A
1,0
1,1
1,N1
1,0
1,1
1,K1
C=
,A =
..
..
..
..
..
..
.
.
.
.
.
B0,1
B0,1
B0,N1
B
B1,1 B1,N1
1,0
B=
.
..
..
..
Then
C := AB +C =
K1
K1
p=0 A0,p B p,0 +C0,0
p=0 A0,p B p,1 +C0,1
K1
K1
.
..
..
k1
K1
p=0 AM1,p B p,0 +CM1,0 p=0 AM1,p B p,1 +CM1,1
K1
p=0 A0,p B p,N1 +C0,N1
K1
p=0 A1,p B p,N1 +C1,N1
..
.
k1
p=0 AM1,p B p,N1 +CM1,N1
Provided the partitionings of C, A, and B are conformal. Loosely speaking, partitionings of vectors
and/or matrices are conformal if they match in a way that makes operations with the subvectors and/or
submatrices legal.
Notice that multiplication with partitioned matrices is exactly like regular matrix-matrix multiplication
with scalar elements, except that multiplication of two blocks does not necessarily commute.
Chapter
23
24
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
25
25
25
27
27
27
27
27
28
29
2.3.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.3.5.
31
32
33
2.1
25
Absolute Value
Recall
that if C, then || equals its absolute value. In other words, if = r + ic , then || =
p
2
r + 2c = .
This absolute value function has the following properties:
6= 0 || > 0 (| | is positive definite),
|| = |||| (| | is homogeneous), and
| + | || + || (| | obeys the triangle inequality).
2.2
Vector Norms
A (vector) norm extends the notion of an absolute value (length or size) to vectors:
Definition 2.1 Let : Cn R. Then is a vector norm if for all x, y Cn
x 6= 0 (x) > 0 ( is positive definite),
(x) = ||(x) ( is homogeneous), and
(x + y) (x) + (y) ( obeys the triangle inequality).
Homework 2.2 Prove that if : Cn R is a norm, then (0) = 0 (where the first 0 denotes the zero vector
in Cn ).
* SEE ANSWER
Note: often we will use k k to denote a vector norm.
2.2.1
H
kxk2 = x x = 0 0 + + n1 n1 = |0 |2 + + |n1 |2 .
To show that the vector 2-norm is a norm, we will need the following theorem:
Theorem 2.4 (Cauchy-Schwartz inequality) Let x, y Cn . Then |xH y| kxk2 kyk2 .
Proof: Assume that x 6= 0 and y 6= 0, since otherwise the inequality is trivially true. We can then choose
xb = x/kxk2 and yb = y/kyk2 . This leaves us to prove that |b
xH yb| 1, with kb
xk2 = kb
yk2 = 1.
H
Pick C with || = 1 s that b
x yb is real and nonnegative. Note that since it is real, b
xH yb = b
xH yb =
H
b
y xb.
26
Now,
0 kb
x b
yk22
= (x b
y)H (b
x b
y)
(kzk22 = zH z)
= xbH xb b
xH yb+ b
yH xb b
yH yb
(multiplying out)
= 1 2b
xH yb+ ||2
(kb
xk2 = kb
yk2 = 1 and b
xH yb = b
xH yb = b
yH xb)
= 2 2b
xH yb
(|| = 1).
Thus 1 b
xH yb and, taking the absolute value of both sides,
1 |b
xH yb| = |||b
xH yb| = |b
xH yb|,
which is the desired result.
QED
kx + yk22 =
=
(x + y)H (x + y)
xH x + yH x + xH y + yH y
kxk22 + 2kxk2 kyk2 + kyk22
(kxk2 + kyk2 )2 .
Taking the square root of both sides yields the desired result.
QED
2.2.2
27
Vector 1-norm
2.2.3
2.2.4
Vector p-norm
2.3
Matrix Norms
It is not hard to see that vector norms are all measures of how big the vectors are. Similarly, we want
to have measures for how big matrices are. We will start with one that are somewhat artificial and then
move on to the important class of induced matrix norms.
2.3.1
Frobenius norm
28
Notice that one can think of the Frobenius norm as taking the columns of the matrix, stacking them on
top of each other to create a vector of size m n, and then taking the vector 2-norm of the result.
Homework 2.13 Show that the Frobenius norm is a norm.
* SEE ANSWER
Similarly, other matrix norms can be created from vector norms by viewing the matrix as a vector. It
turns out that other than the Frobenius norm, these arent particularly interesting in practice.
2.3.2
kAk, = sup
x 6= 0
Let us start by interpreting this. How big A is, as measured by kAk, , is defined as the most that A
magnifies the length of nonzero vectors, where the length of the vectors (x) is measured with norm k k
and the length of the transformed vector (Ax) is measured with norm k k .
Two comments are in order. First,
kAxk
= sup kAxk .
kxk =1
x Cn kxk
sup
x 6= 0
x 6= 0
x 6= 0
x 6= 0
x 6= 0
x
y = kxk
Also the sup (which stands for supremum) is used because we cant claim yet that there is a vector x
with kxk = 1 for which
kAk, = kAxk .
The fact is that there is always such a vector x. The proof depends on a result from real analysis (sometimes
called advanced calculus) that states that supxS f (x) is attained for some vector x S as long as f is
continuous and S is a compact set. Since real analysis is not a prerequisite for this course, the reader may
have to take this on faith!
We conclude that the following two definitions are equivalent definitions to the one we already gave:
Definition 2.15 Let k k : Cm R and k k : Cn R be vector norms. Define k k, : Cmn R by
kAxk
.
x C kxk
kAk, = maxn
x 6= 0
29
and
Definition 2.16 Let k k : Cm R and k k : Cn R be vector norms. Define k k, : Cmn R by
kAk, = max kAxk .
kxk =1
=
> 0.
ke j k
ke j k
x C kxk
kAk, = maxn
x 6= 0
kAk, = maxn
x 6= 0
x 6= 0
x 6= 0
kA + Bk, =
k(A + B)xk
kAx + Bxk
kAxk + kBxk
= maxn
maxn
kxk
kxk
kxk
xC
xC
xC
maxn
x 6= 0
x 6= 0
maxn
xC
x 6= 0
kAxk kBxk
+
kxk
kxk
x 6= 0
kAxk
kBxk
+ maxn
= kAk, + kBk, .
x C kxk
x C kxk
maxn
x 6= 0
x 6= 0
QED
2.3.3
The most important case of k k, : Cmn R uses the same norm for k k and k k (except that m may
not equal n).
Definition 2.18 Define k k p : Cmn R by
kAxk p
= max kAxk p .
kxk p =1
x C kxk p
kAk p = maxn
x 6= 0
30
Special cases are the matrix 2-norm (k k2 ), matrix 1-norm (k k1 ), and matrix -norm (k k ).
Theorem 2.19 Let A Cmn and partition A = a0 a1 an1 . Show that
kAk1 = max ka j k1 .
0 j<n
1
n1
=
max k0 a0 + 1 a1 + + n1 an1 k1
kxk1 =1
kxk1 =1
kxk1 =1
kxk1 =1
kxk1 =1
= ka k1 .
Also,
ka k1 = kAe k1 max kAxk1 .
kxk1 =1
Hence
ka k1 max kAxk1 ka k1
kxk1 =1
kxk1 =1
QED
abT0
abT
1
mn
Homework 2.20 Let A C
and partition A = .
..
abTm1
. Show that
kAk = max kb
ai k1 = max (|i,0 | + |i,1 | + + |i,n1 |)
0i<m
0i<m
* SEE ANSWER
Notice that in the above exercise abi is really (b
aTi )T since abTi is the label for the ith row of matrix A.
Homework 2.21 Let y Cm and x Cn . Show that kyxH k2 = kyk2 kxk2 .
* SEE ANSWER
2.3.4
31
Discussion
While k k2 is a very important matrix norm, it is in practice often difficult to compute. The matrix norms,
k kF , k k1 , and k k are more easily computed and hence more practical in many instances.
2.3.5
kCxk
= max kCxk .
kxk
kxk =1
Then for any A Cmk and B Ckn the inequality kABk kAk kBk holds.
In other words, induced matrix norms are submultiplicative.
To prove the above, it helps to first proof a simpler result:
Lemma 2.24 Let k k : Cn R be a vector norm and given any matrix C Cmn define the induced
matrix norm as
kCxk
= max kCxk .
kCk = max
x6=0 kxk
kxk =1
Then for any A Cmn and y Cn the inequality kAyk kAk kyk holds.
Proof: If y = 0, the result obviously holds since then kAyk = 0 and kyk = 0. Let y 6= 0. Then
kAk = max
x6=0
kAxk kAyk
.
kxk
kyk
QED
kABk = max kABxk = max kA(Bx)k max kAk kBxk max kAk kBk kxk = kAk kBk .
kxk =1
kxk =1
kxk =1
kxk =1
QED
Homework 2.25 Show that kAxk kAk, kxk .
* SEE ANSWER
Homework 2.26 Show that kABk kAk, kBk .
* SEE ANSWER
Homework 2.27 Show that the Frobenius norm, k kF , is submultiplicative.
* SEE ANSWER
2.4
32
A question we will run into later in the course asks how accurate we can expect the solution of a linear
system to be if the right-hand side of the system has error in it.
Formally, this can be stated as follows: We wish to solve Ax = b, where A Cmm but the right-hand
side has been perturbed by a small vector so that it becomes b + b. (Notice how that touches the b. This
is meant to convey that this is a symbol that represents a vector rather than the vector b that is multiplied
by a scalar .) The question now is how a relative error in b propogates into a potential error in the solution
x.
This is summarized as follows:
Ax = b
Exact equation
A(x + x) = b + b
Perturbed equation
We would like to determine a formula, (A, b, b), that tells us how much a relative error in b is potentially
amplified into an error in the solution b:
kbk
kxk
(A, b, b)
.
kxk
kbk
We will assume that A has an inverse. To find an expression for (A, b, b), we notice that
Ax + Ax = b + b
= b
Ax
Ax =
1
kAk kbk
kxk
kbk
kAkkA1 k
.
kxk
kbk
Thus, the desired expression (A, b, b) doesnt depend on anything but the matrix A:
kbk
kxk
kAkkA1 k
.
| {z } kbk
kxk
(A)
33
b
kA1 xk kA1 bk
=
,
b
kxk6=0 kxk
kbk
kA1 k = max
2.5
Equivalence of Norms
Many results we encounter show that the norm of a particular vector or matrix is small. Obviously, it
would be unfortunate if a vector or matrix is large in one norm and small in another norm. The following
result shows that, modulo a constant, all norms are equivalent. Thus, if the vector is small in one norm, it
is small in other norms as well.
Theorem 2.29 Let k k : Cn R and k k : Cn R be vector norms. Then there exist constants ,
and , such that for all x Cn
, kxk kxk , kxk .
A similar result holds for matrix norms:
Theorem 2.30 Let k k : Cmn R and k k : Cmn R be matrix norms. Then there exist constants
, and , such that for all A Cmn
, kAk kAk , kAk .
34
Chapter
Video
Read disclaimer regarding the videos in the preface!
* YouTube Part 1
* YouTube Part 2
* View Part 1 After Local Download * View Part 2 After Local Download
(For help on viewing, see Appendix A.)
35
36
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
37
39
3.3.
41
3.4.
Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
45
3.6.
49
3.7.
50
3.8. An Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
54
55
3.1
37
1 if i = j
qH
q
=
.
i j
0 otherwise
Notice that for n vectors of length m to be mutually orthonormal, n must be less than or equal to
m. This is because n mutually orthonormal vectors are linearly independent and there can be at most m
linearly independent vectors of length m.
Definition 3.3 Let Q Cmn (with n m). Then Q is said to be an orthonormal matrix if QH Q = I (the
identity).
mn
Homework 3.4 Let Q C
(with n m). Partition Q = q0 q1 qn1 . Show that Q is an
orthonormal matrix if and only if q0 , q1 , . . . , qn1 are mutually orthonormal.
* SEE ANSWER
Definition 3.5 Let Q Cmm . Then Q is said to be a unitary matrix if QH Q = I (the identity).
Notice that unitary matrices are always square and only square matrices can be unitary. Sometimes the
term orthogonal matrix is used instead of unitary matrix, especially if the matrix is real valued.
Homework 3.6 Let Q Cmm . Show that if Q is unitary then Q1 = QH and QQH = I.
* SEE ANSWER
Homework 3.7 Let Q0 , Q1 Cmm both be unitary. Show that their product, Q0 Q1 , is unitary.
* SEE ANSWER
Homework 3.8 Let Q0 , Q1 , . . . , Qk1 Cmm all be unitary. Show that their product, Q0 Q1 Qk1 , is
unitary.
* SEE ANSWER
The following is a very important observation: Let Q be a unitary matrix with
Q = q0 q1 qm1 .
Let x Cm . Then
H
x = QQ x =
q0 q1
q0 q1 qm1
qH
0
qH
1
qm1 .
..
qH
m1
q0 q1 qm1
H
q0 q1 qm1
qH
0x
qH x
1
..
qH
m1 x
38
H
H
= (qH
0 x)q0 + (q1 x)q1 + + (qm1 x)qm1 .
The vector x = . gives the coefficients when the vector x is written as a linear combination
..
m1
of the unit basis vectors:
x = 0 e0 + 1 e1 + + m1 em1 .
The vector
qH
0x
qH x
1
QH x =
..
qH
m1 x
gives the coefficients when the vector x is written as a linear combination of the orthonormal vectors
q0 , q1 , . . . , qm1 :
H
H
x = (qH
0 x)q0 + (q1 x)q1 + + (qm1 x)qm1 .
The vector (qH
i x)qi equals the component of x in the direction of vector qi .
Another way of looking at this is that if q0 , q1 , . . . , qm1 is an orthonormal basis for Cm , then any x Cm
can be written as a linear combination of these vectors:
x = 0 q0 + 1 q1 + + m1 qm1 .
Now,
H
qH
i x = qi (0 q0 + 1 q1 + + i1 qi1 + i qi + i+1 qi+1 + + m1 qm1 )
= 0 q H
q + 1 qH
q + + i1 qH
q
| i{z }0
| i{z }1
| i {zi1}
0
0
0
H
H
+ i qH
i qi + i+1 qi qi+1 + + m1 qi qm1
|{z}
| {z }
| {z }
1
0
0
= i .
Thus qH
i x = i , the coefficient that multiplies qi .
39
3.2
DT L
0
We will prove this for m n, leaving the case where m n as an exercise, employing a proof by
induction on n.
Base case: n = 1. In this case A = a0 where a0 Rm is its only column. By assumption,
a0 6= 0. Then
H
A = a0 = u0 (ka0 k2 ) 1
where u0 = a0 /ka0 k2 . Choose U1 Cm(m1) so that U = u0 U1 is unitary. Then
A=
a0
where DT L =
0
u0
(ka0 k2 )
ka0 k2
H
and V =
u0 U1
ka0 k2
0
H
= UDV H
1 .
Inductive step: Assume the result is true for all matrices with 1 k < n columns. Show that it is
true for matrices with n columns.
Let A Cmn with n 2. W.l.o.g., A 6= 0 so that kAk2 6= 0. Let 0 and v0 Cn have the property that kv0 k2 = 1 and 0 = kAv0 k2 = kAk2 . (In other words, v0 is the vector that maximizes
40
maxkxk2 =1 kAxk2 .) Let u0 = Av0 /0 . Note that ku0 k2 = 1. Choose U1 Cm(m1) and V1 Cn(n1)
so that U = u0 U1 and V = v0 V1 are unitary. Then
H
A
u0 U1
v0 V1
H
H
H
H
H
u u u0 AV1
w
u Av0 u0 AV1
= 0 0 0
= 0
,
= 0
U1H Av0 U1H AV1
U1H u0 U1H AV1
0 B
U H AV =
where w = V1H AH u0 and B = U1H AV1 . Now, we will argue that w = 0, the zero vector of appropriate
size:
2
0 wH
x
0 B
kU H AV xk22
2
2
H
2
2
0 = kAk2 = kU AV k2 = max
=
max
2
2
x6=0
x6=0
kxk2
kxk2
2
2
H
2
H
0 w
0 + w w
0
0 B
w
Bw
2
2
=
2
2
0
0
w
w
2
(20 + wH w)2
= 20 + wH w.
20 + wH w
0 0
0
(n1)(n1) , and
U = U
1 0
0 U
,V = V
1 0
0 V
, and D =
(There are some really tough to see checks in the definition of U, V , and D!!) Then A = UDV H
where U, V , and D have the desired properties.
By the Principle of Mathematical Induction the result holds for all matrices A Cmn with m n.
Homework 3.13 Let D = diag(0 , . . . , n1 ). Show that kDk2 = maxn1
i=0 |i |.
* SEE ANSWER
AT
0
41
Homework 3.15 Assume that U Cmm and V Cnn be unitary matrices. Let A, B Cmn with B =
UAV H . Show that the singular values of A equal the singular values of B.
* SEE ANSWER
0 0
and assume that kAk2 = 0 . Show that kBk2
Homework 3.16 Let A Cmn with A =
0 B
kAk2 . (Hint: Use the SVD of B.)
* SEE ANSWER
Homework 3.17 Prove Lemma 3.12 for m n.
* SEE ANSWER
3.3
T L
with T L = diag(0 , , r1 )
0 0
and 0 1 r1 > 0. The 0 , . . . , r1 are known as the singular values of A.
Proof: Notice that the proof of the above theorem is identical to that of Lemma 3.12. However, thanks
to the above exercises, we can conclude that kBk2 0 in the proof, which then can be used to show that
the singular values are found in order.
Proof: (Alternative) An alternative proof uses Lemma 3.12 to conclude that A = UDV H . If the entries on
the diagonal of D are not ordered from largest to smallest, then this can be fixed by permuting the rows
and columns of D, and correspondingly permuting the columns of U and V .
3.4
Geometric Interpretation
We will now quickly illustrate what the SVD Theorem tells us about matrix-vector multiplication (linear
transformations) by examining the case where A R22 . Let A = UV T be its SVD. (Notice that all
matrices are now real valued, and hence V H = V T .) Partition
0 0
T
.
A = u0 u1
v0 v1
0 1
Since U and V are unitary matrices, {u0 , u1 } and {v0 , v1 } form orthonormal bases for the range and domain
of A, respectively:
R2 : Domain of A.
42
R2 : Range (codomain) of A.
0 0
T
0 0
T
v0 v1
v0 v1
A =
= u0 u1
u0 u1
0 1
0 1
T
=
.
0 u0 1 u1
v0 v1
Now let us look at how A transforms v0 and v1 :
Av0 =
0 u0 1 u1
v0 v1
T
v0 =
0 u 0 1 u 1
1
0
= 0 u 0
R2 : Range (codomain) of A.
43
Now let us look at how A transforms any vector with (Euclidean) unit length. Notice that x =
means that
x = 0 e0 + 1 e1 ,
where e0 and e1 are the unit basis vectors. Thus, 0 and 1 are the coefficients when x is expressed using
e0 and e1 as basis. However, we can also express x in the basis given by v0 and v1 :
T
vT x
T
x = VV
x = v0 v1 0
v0 v1
|{z} x = v0 v1
vT1 x
I
0
.
= vT0 x v0 + vT1 x v1 = 0 v0 + 0 v1 = v0 v1
|{z}
|{z}
1
1
0
Thus, in the basis formed by v0 and v1 , its coefficients are 0 and 1 . Now,
Ax =
0 u0 1 u1
v0 v1
T
x=
0 u0 1 u1
v0 v1
T
v0 v1
0 u0 1 u1
= 0 0 u 0 + 1 1 u 1 .
1
This is illustrated by the following picture, which also captures the fact that the unit ball is mapped to an
ellipse1 with major axis equal to 0 = kAk2 and minor axis equal to 1 :
R2 : Domain of A.
1 It
R2 : Range (codomain) of A.
is not clear that it is actually an ellipse and this is not important to our observations.
44
Finally, we show the same insights for general vector x (not necessarily of unit length).
R2 : Domain of A.
R2 : Range (codomain) of A.
Another observation is that if one picks the right basis for the domain and codomain, then the computation Ax simplifies to a matrix multiplication with a diagonal matrix. Let us again illustrate this for
nonsingular A R22 with
T
0 0
.
A=
u0 u1
v0 v1
0 1
| {z }
| {z }
|
{z
}
U
V
Now, if we chose to express y using u0 and u1 as the basis and express x using v0 and v1 as the basis, then
T
T
T
0
yb = UU
| {z } y = (u0 y)u0 + (u1 y)u1 =
b1
T
T
T
0 .
xb = VV
|{z} x = (v0 x)v0 + (v1 x)v1 ==
b
1 .
I
If y = Ax then
T
U U T y = UV
x
| {z }x = Ub
|{z}
Ax
yb
so that yb = b
x and
b0
b 1.
0 b
0
1 b
1 .
3.5
45
T L 0
where T L = diag(0 , . . . , r1 ) with 0 1 . . . r1 > 0.
=
0 0
U = UL UR with UL Cmr .
V=
VL VR
with VL Cnr .
We first generalize the observations we made for A R22 . Let us track what the effect of Ax = UV H x
is on vector x. We assume that m n.
Let U = u0 um1 and V = v0 vn1 .
Let
x = VV H x =
v0 vn1
v0 vn1
H
x=
v0 vn1
vH
x
0
..
.
H
vn1 x
H
= vH
0 xv0 + + vn1 xvn1 .
This can be interpreted as follows: vector x can be written in terms of the usual basis of Cn as 0 e0 +
H
+ 1 en1 or in the orthonormal basis formed by the columns of V as vH
0 xv0 + + vn1 xvn1 .
H
H
H
Notice that Ax = A(vH
0 xv0 + + vn1 xvn1 ) = v0 xAv0 + + vn1 xAvn1 so that we next look at
how A transforms each vi : Avi = UV H vi = Uei = iUei = i ui .
A = UV H =
UL UR
T L
0
0
VL VR
H
= UL T LVLH .
and VL =
46
v0 vr1
Then
H
H
A = 0 u0 vH
1 u1 v1 + + r1 ur1 vr1 .
0 +
|
{z
}
| {z }
|
{z
}
1
0
r1
(Each term nonzero and an outer product, and hence a rank-1 matrix.)
Proof: We leave the proof as an exercise.
Corollary 3.21 C (A) = C (UL ).
Proof:
Let y C (A). Then there exists x Cn such that y = Ax (by the definition of y C (A)). But then
y = Ax = UL T LVLH x = UL z,
| {z }
z
i.e., there exists z Cr such that y = UL z. This means y C (UL ).
Let y C (UL ). Then there exists z Cr such that y = UL z. But then
y = UL z = UL T L 1
z = UL T L VLH VL 1
z = A VL 1
z = Ax
T
L
T
L
| {z }
| {z }
| {zT L}
I
I
x
so that there exists x Cn such that y = Ax, i.e., y C (A).
Corollary 3.22 Let A = UL T LVLH be the reduced SVD of A where UL and VL have r columns. Then the
rank of A is r.
Proof: The rank of A equals the dimension of C (A) = C (UL ). But the dimension of C (UL ) is clearly r.
Corollary 3.23 N (A) = C (VR ).
Proof:
Let x N (A). Then
x =
H
VV
| {z } x =
I
VL VR
VL VR
VLH x
VRH x
VL VR
H
x=
VL VR
VLH
VRH
= VLVLH x +VRVRH x.
If we can show that VLH x = 0 then x = VR z where z = VRH x. Assume that VLH x 6= 0. Then T L (VLH x) 6=
0 (since T L is nonsingular) and UL (T L (VLH x)) 6= 0 (since UL has linearly independent columns).
But that contradicts the fact that Ax = UL T LVLH x = 0.
47
VL VR
VL VR
H
= A VLVLH x +VRVRH x = AVLVLH x + AVRVRH x
= AVLVLH x +UL T L VLH VR VRH x = A VLVLH x .
| {z }
| {z }
0
z
Alternative proof (which uses the last corollary):
Ax = A VLVLH x +VRVRH x = AVLVLH x + A VRVRH x = A VLVLH x .
| {z }
| {z }
N (A)
z
The proof of the last corollary also shows that
Corollary 3.25 Any vector x Cn can be written as x = z + xn where z C (VL ) and xn N (A) = C (VR ).
Corollary 3.26 AH = VL T LULH so that C (AH ) = C (VL ) and N (AH ) = C (UR ).
The above corollaries are summarized in Figure 3.1.
Theorem 3.27 Let A Cnn be nonsingular. Let A = UV H be its SVD. Then
1. The SVD is the reduced SVD.
2. n1 6= 0.
3. If
U=
u0 un1
, = diag(0 , . . . , n1 ), and V =
v0 vn1
then
A
1 T
T H
= (V P )(P P )(UP ) =
where P =
vn1 v0
1
diag(
, . . . , ) un1 u0 ,
n1
0
0 0 1
0
..
.
is the permutation matrix such that Px reverses the order of the entries
1
..
.
0
..
.
1 0 0
in x. (Note: for this permutation matrix, PT = P. In general, this is not the case. What is the case
for all permutation matrices P is that PT P = PPT = I.)
Cn
Cm
dim r
C (AH ) = C (VL )
48
C (A) = C (UL )
y = Az
dim r
x = z + xn
y
y = Ax
= UL T LVLH x
xn
N (A) = C (VR )
N (AH ) = C (UR )
dim n r
dim m r
Since VL is unitary and T L is diagonal with nonzero diagonal entries, tey are both nonsingular. Thus
VL 2T LVLH
1 H
VL 2T L
VL ) = I.
3.6
49
Definition 3.29 Let UL Cmk have orthonormal columns. The projection of a vector y Cm onto C (UL )
is the vector UL x that minimizes ky UL xk2 , where x Ck . We will also call this vector y the component
of x in C (UL ).
Theorem 3.30 Let UL Cmk have orthonormal columns. The projection of y onto C (UL ) is given by
ULULH y.
Proof: The vector UL x that we want must satisfy
kUL x yk2 = min kUL w yk2 .
wCk
UL UR
H
2
= min
U H (UL w y)
2 (since the two norm is preserved)
wCk
2
H
= min
UL UR
(UL w y)
wCk
min
wCk
min
wCk
min
wCk
min
wCk
min
wCk
2
H
UL
(UL w y)
URH
2
2
H
H
UL
U
UL w L y
URH
URH
2
2
UHy
ULH UL w
L
URH UL w
URH y
2
2
UHy
w
L
0
URH y
2
2
w ULH y
H
UR y
2
2
= min
w ULH y
2 +
URH y
2
wCk
2
u
= kuk2 + kvk2 )
(since
2
2
v
2
=
2
H
2
min w UL y 2 +
URH y
2 .
wCk
This is minimized when w = ULH y. Thus, the vector that is closest to y in the space spanned by UL is given
by x = ULULH y.
50
Corollary 3.31 Let A Cmn and A = UL T LVLH be its reduced SVD. Then the projection of y Cm onto
C (A) is given by ULULH y.
Proof: This follows immediately from the fact that C (A) = C (UL ).
Corollary 3.32 Let A Cmn have linearly independent columns. Then the projection of y Cm onto
C (A) is given by A(AH A)1 AH y.
1 H
Proof: From Corrolary 3.28, we know that AH A is nonsingular and that (AH A)1 = VL 2T L
VL . Now,
A(AH A)1 AH y = (UL T LVLH )(VL 2T L
1
1 V H V U H y = ULULH y.
= UL T L VLH VL 1
| {z } T L T L | L{z L} T L L
I
I
Hence the projection of y onto C (A) is given by A(AH A)1 AH y.
Definition 3.33 Let A have linearly independent columns. Then (AH A)1 AH is called the pseudo-inverse
or Moore-Penrose generalized inverse of matrix A.
3.7
Theorem 3.34 Let A Cmn have SVD A = UV H and assume A has rank r. Partition
0
T L
,
U = UL UR , V = VL VR , and =
0 BR
where UL Cmk , VL Cnk , and T L Rkk with k r. Then B = UL T LVLH is the matrix in Cmn
closest to A in the following sense:
kA Bk2 =
min
C Cmn
kA Ck2 = k .
rank(C) k
Proof: First, if B is as defined, then clearly kA Bk2 = k :
kA Bk2 = kU H (A B)V k2 = kU H AV U H BV k2
H
T L
=
B VL VR
UL UR
=
2
0
0 0
= kBR k2 = k
=
0 BR
0
BR
T L
0
0
0
Next, assume that C has rank t k and kA Ck2 < kA Bk2 . We will show that this leads to a
contradiction.
3.8. An Application
51
The null space of C has dimension at least n k since dim(N (C)) + rank(C) = n.
If x N (C) then
kAxk2 = k(A C)xk2 kA Ck2 kxk2 < k kxk2 .
Partition U = u0 um1 and V = v0 vn1 . Then kAv j k2 = k j u j k2 = j s
for j = 0, . . . , k. Now, let x be any linear combination of v0 , . . . , vk : x = 0 v0 + + k vk . Notice
that
kxk22 = k0 v0 + + k vk k22 |0 |2 + |k |2 .
Then
kAxk22 = kA(0 v0 + + k vk )k22 = k0 Av0 + + k Avk k22
= k0 0 u0 + + k k uk k22 = k0 0 u0 k22 + + kk k uk k22
= |0 |2 20 + + |k |2 2k (|0 |2 + + |k |2 )2k
so that kAxk2 k kxk2 . In other words, vectors in the subspace of all linear combinations of
{v0 , . . . , vk } satisfy kAxk2 k kxk2 . The dimension of this subspace is k + 1 (since {v0 , , vk }
form an orthonormal basis).
Both these subspaces are subspaces of Cn . Since their dimensions add up to more than n there must
be at least one nonzero vector z that satisfies both kAzk2 < k kzk2 and kAzk2 k kzk2 , which is a
contradiction.
The above theorem tells us how to pick the best approximation with given rank to a given matrix.
3.8
An Application
Let Y Rmn be a matrix that, for example, stores a picture. In this case, the (i, j) entry in Y is, for
example, a number that represents the grayscale value of pixel (i, j). The following instructions, executed
in octave or matlab, generate the picture of Mexican artist Frida Kahlo in Figure 3.2(top-left). The file
FridaPNG.png can be found at http://www.cs.utexas.edu/users/flame/Notes/FridaPNG.png.
octave> IMG = imread( FridaPNG.png ); % this reads the image
octave> Y = IMG( :,:,1 );
octave> imshow( Y )
% this dispays the image
Although the picture is black and white, it was read as if it is a color image, which means a m n 3 array
of pixel information is stored. Setting Y = IMG( :,:,1 ) extracts a single matrix of pixel information.
(If you start with a color picture, you will want to approximate IMG( :,:,1), IMG( :,:,2), and IMG(
:,:,3) separately.)
Now, let Y = UV T be the SVD of matrix Y . Partition, conformally,
T L
0
,
U = UL UR , V = VL VR , and =
0 BR
Original picture
k=1
k=2
k=5
k = 10
k = 25
52
3.8. An Application
53
T L
0
Y =
UL UR
0 BR
T L
0
=
UL UR
0 BR
T LV T
L
=
UL UR
BRVRT
VL VR
VLT
VRT
T
>>
>>
>>
>>
>>
UL = U(
VL = V(
SigmaTL
Yapprox
imshow(
54
:, 1:k );
% first k columns
:, 1:k );
% first k columns
= Sigma( 1:k, 1:k ); % TL submatrix of Sigma
= uint8( UL * SigmaTL * VL );
Yapprox );
As one increases k, the approximation gets better, as illustrated in Figure 3.2. The graph in Figure 3.3
helps explain. The original matrix Y is 387 469, with 181, 503 entries. When k = 10, matrices U, V , and
are 387 10, 469 10 and 10 10, respectively, requiring only 8, 660 entries to be stores.
3.9
R2 : Range (codomain) of A.
In this case, the ratio 0 /n1 represents the ratio between the major and minor axes of the ellipse on
the right.
3.10
55
It would seem that the proof of the existence of the SVD is constructive in the sense that it provides an
algorithm for computing the SVD of a given matrix A Cmm . Not so fast! Observe that
Computing kAk2 is nontrivial.
Computing the vector that maximizes maxkxk2 =1 kAxk2 is nontrivial.
Given a vector q0 computing vectors q0 , . . . , qm1 is expensive (as we will see when we discuss the
QR factorization).
Towards the end of the course we will discuss algorithms for computing the eigenvalues and eigenvectors
of a matrix, and related algorithms for computing the SVD.
56
Chapter
Video
Read disclaimer regarding the videos in the preface!
* YouTube
* Download from UT Box
* View After Local Download
(For help on viewing, see Appendix A.)
57
58
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.1.
59
4.2.
64
68
4.4.
Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
71
72
4.1
59
Given a set of linearly independent vectors {a0 , . . . , an1 } Cm , the Gram-Schmidt process computes an
orthonormal basis {q0 , . . . , qn1 } that span the same subspace, i.e.
Span({a0 , . . . , an1 }) = Span({q0 , . . . , qn1 }).
The process proceeds as described in Figure 4.1 and in the algorithms in Figure 4.2.
Homework 4.1
What happens in the Gram-Schmidt algorithm if the columns of A are NOT linearly
independent?
How might one fix this?
How can the Gram-Schmidt algorithm be used to identify which columns of A are linearly independent?
* SEE ANSWER
Homework 4.2 Homework 4.3 Convince yourself that the relation between the vectors {a j } and {q j } in
the algorithms in Figure 4.2 is given by
0 1,1 1,n1
,
a0 a1 an1 = q0 q1 qn1 .
.
.
.
..
..
..
..
0
0 n1,n1
where
1 for i = j
qH
q
=
i j
0 otherwise
and
i, j =
qi a j
for i < j
j1
ka j i=0 i, j qi k2 for i = j
otherwise.
* SEE ANSWER
Thus, this relationship between the linearly independent vectors {a j } and the orthonormal vectors {q j }
can be concisely stated as
A = QR,
where A and Q are m n matrices (m n), QH Q = I, and R is an n n upper triangular matrix.
Theorem 4.4 Let A have linearly independent columns, A = QR where A, Q Cmn with n m, R Cnn ,
QH Q = I, and R is an upper triangular matrix with nonzero diagonal entries. Then, for 0 < k < n, the first
k columns of A span the same space as the first k columns of Q.
Proof: Partition
AL
AR
QL
QR
and R
RT L
RT R
RBR
where AL , QL Cmk and RT L Ckk . Then RT L is nonsingular (since it is upper triangular and has no
zero on its diagonal), QH
L QL = I, and AL = QL RT L . We want to show that C (AL ) = C (QL ):
Steps
60
Comment
0,0 := ka0 k2
q0 =: a0 /0,0
0,1 = qH
0 a1
a
1 = a1 0,1 q0
Compute a
1 , the component of vector a1 orthogonal to q0 .
1,1 = ka
1 k2
q1 = a
1 /1,1
Set q1 = a
1 /1,1 , creating a unit vector in the direction of
a
.
1
Now, q0 and q1 are mutually orthonormal
Span({a0 , a1 }) = Span({q0 , q1 }). (Why?)
and
0,2 = qH
0 a2
1,2 = qH
1 a2
a
2 = a2 0,2 q0 1,2 q1 Compute a2 , the component of vector a2 orthogonal to q0
and q1 .
= ka k
2,2
2 2
q2 = a2 /2,2
a2 .
Now, {q0 , q1 , q2 } is an orthonormal basis
Span({a0 , a1 , a2 }) = Span({q0 , q1 , q2 }). (Why?)
And so forth.
Figure 4.1: Gram-Schmidt orthogonalization.
and
for j = 0, . . . , n 1
for j = 0, . . . , n 1
for k = 0, . . . , j 1
k, j := qH
k aj
end
aj := a j
aj := a j
for k = 0, . . . , j 1
for k = 0, . . . , j 1
k, j := qH
a
k j
aj := aj k, j qk
aj := aj k, j qk
end
end
j, j := ka j k2
j, j := kaj k2
q j := a j / j, j
q j := aj / j, j
end
end
61
for
j = 0, . . .
, n 1
0, j
a
qH
0 j
H
.. ..
aj
. := . = q0 q j1
j1, j
qHj1 a j
0, j
.
aj := a j q0 q j1 ..
j1, j
j, j := kaj k2
q j := aj / j, j
end
RT L RT R
R
0
RBR
where AL and QL has 0 columns and RT L is 0 0
while n(AL ) 6= n(A) do
Repartition
AL AR A0 a1 A2 ,
QL QR Q0 q1 Q2 ,
R
RT R
T
TL
0
11 r12
0
RBR
0
0 R22
r01 := QT0 a1
a
1 := a1 Q0 r01
11 := ka
1 k2
q1 := a
1 /11
Continue with
AL AR A0 a1 A2 ,
QL QR Q0 q1 Q2 ,
RT R
R
TL
0 11 rT
12
0
RBR
0
0
R22
62
for j = 0, . . . , n 1
0, j
H
.
.. := q0 q j1
aj
|{z}
|
{z
}
j1, j
H
a1
Q0
|
{z
}
r01
0, j
...
aj := a j q0 q j1
|{z}
|{z}
|
{z
}
j1, j
a1
a
Q0
1
|
{z
r01
j, j := kaj k2
q j := aj / j, j
end
(11 := ka
1 k2 )
(q1 := a1 /11 )
endwhile
Figure 4.3: (Classical) Gram-Schmidt algorithm for computing the QR factorization of a matrix A.
compute r01 = QH
0 a1 and a1 = a1 Q0 r01 , the component of a1 orthogonal to C (Q0 ). Because the
Q0 q1
R00 r01
0
Q0 R00 Q0 r01 + q1 11
11
A0 Q0 r01 + a
1
A0 a1
= A.
Hence Q =
Q0 q1
and R =
63
R00 r01
0
11
By the Principle of Mathematical Induction the result holds for all matrices A Cmn with m n.
The proof motivates the algorithm in Figure 4.3 (left) in FLAME notation1 .
An alternative for motivating that algorithm is as follows: Consider A = QR. Partition A, Q, and R to
yield
A0 a1 A2
Q0 q1
r
R02
R
00 01
T
Q2 0 11 r12
0
0 R22
Assume that Q0 and R00 have already been computed. Since corresponding columns of both sides must be
equal, we find that
a1 = Q0 r01 + q1 11 .
(4.1)
H
H
Also, QH
0 Q0 = I and Q0 q1 = 0, since the columns of Q are mutually orthonormal. Hence Q0 a1 =
H
H
Q0 Q0 r01 + Q0 q1 11 = r01 . This shows how r01 can be computed from Q0 and a1 , which are already
known. Next, a
1 = a1 Q0 r01 is computed from (4.1). This is the component of a1 that is perpendicular
to the columns of Q0 . We know it is nonzero since the columns of A are linearly independent. Since
11 q1 = a
1 and we know that q1 has unit length, we now compute 11 = ka1 k2 and q1 = a1 /11 , which
completes a derivation of the algorithm in Figure 4.3.
Homework 4.6 Homework 4.7 Let A have linearly independent columns and let A = QR be a QR factorization of A. Partition
RT L RT R
,
A AL AR , Q QL QR , and R
0
RBR
where AL and QL have k columns and RT L is k k. Show that
1. AL = QL RT L : QL RT L equals the QR factorization of AL ,
2. C (AL ) = C (QL ): the first k columns of Q form an orthonormal basis for the space spanned by the
first k columns of A.
3. RT R = QH
L AR ,
4. (AR QL RT R )H QL = 0,
5. AR QL RT R = QR RBR , and
6. C (AR QL RT R ) = C (QR ).
* SEE ANSWER
1
The FLAME notation should be intuitively obvious. If it is not, you may want to review the earlier weeks in Linear
Algebra: Foundations to Frontiers - Notes to LAFF With.
64
y = y
y = y
for i = 0, . . . , k 1
for i = 0, . . . , k 1
i := qH
i y
y := y i qi
endfor
i := qH
i y
y := y i qi
endfor
Figure 4.4: Two different ways of computing y = (I QQH )y, the component of y orthogonal to C (Q),
where Q has k orthonormal columns.
4.2
We start by considering the following problem: Given y Cm and Q Cmk with orthonormal columns,
compute y , the component of y orthogonal to the columns of Q. This is a key step in the Gram-Schmidt
process in Figure 4.3.
Recall that if A has linearly independent columns, then A(AH A)1 AH y equals the projection of y onto
the columns space of A (i.e., the component of y in C (A) ) and y A(AH A)1 AH y = (I A(AH A)1 AH )y
equals the component of y orthogonal to C (A). If Q has orthonormal columns, then QH Q = I and hence
QQH y equals the projection of y onto the columns space of Q (i.e., the component of y in C (Q) ) and
y QQH y = (I QQH )y equals the component of y orthogonal to C (A).
Thus, mathematically, the solution to the stated problem is given by
y = (I QQH )y = y QQH y
H
y
= y q0 qk1
q0 qk1
qH
0
.
= y q0 qk1 .. y
H
qk1
H
q y
0
.
= y q0 qk1 ..
H
qk1 y
H
= y (q0 y)q0 + + (qH
k1 y)qk1
H
= y (qH
0 y)q0 (qk1 y)qk1 .
This can be computed by the algorithm in Figure 4.4 (left) and is used by what is often called the Classical
Gram-Schmidt (CGS) algorithm given in Figure 4.3.
An alternative algorithm for computing y is given in Figure 4.4 (right) and is used by the Modified
Gram-Schmidt (MGS) algorithm also given in Figure 4.5. This approach is mathematically equivalent to
65
RT L RT R
Partition A AL AR , R
0
RBR
whereAL has 0 columns and RT L is 0 0
while n(AL ) 6= n(A) do
Repartition
AL
AR
A0 a1 A2
RT L
RT R
RBR
R00
r01 R02
11
T
r12
R22
MGS
MGS (alternative)
r01 := AH
0 a1
a1 := a1 A0 r01
11 := ka1 k2
11 := ka1 k2
11 := ka1 k2
a1 := a1 /11
q1 := a1 /11
a1 := a1 /11
T := aH A
r12
1 2
T
A2 := A2 a1 r12
Continue with
AL
AR
A0 a1 A2
RT L
RT R
RBR
R
r
R02
00 01
0 11 rT
12
0
0 R22
endwhile
Figure 4.5: Left: Classical Gram-Schmidt algorithm. Middle: Modified Gram-Schmidt algorithm. Right:
Modified Gram-Schmidt algorithm where every time a new column of Q, q1 is computed the component
of all future columns in the direction of this new vector are subtracted out.
the algorithm to its left for the following reason:
The algorithm on the left in Figure 4.4 computes
H
y := y (qH
0 y)q0 (qk1 y)qk1
for j = 0, . . . , n 1
aj := a j
for k = 0, . . . , j 1
k, j := qH
k aj
aj := aj k, j qk
end
j, j := kaj k2
q j := aj / j, j
end
for j = 0, . . . , n 1
for k = 0, . . . , j 1
k, j := aH
k aj
a j := a j k, j ak
end
j, j := ka j k2
a j := a j / j, j
end
(a) MGS algorithm that computes Q and R from (b) MGS algorithm that computes Q and R from
A.
A, overwriting A with Q.
for j = 0, . . . , n 1
j, j := ka j k2
a j := a j / j, j
for k = j + 1, . . . , n 1
j,k := aHj ak
for j = 0, . . . , n 1
j, j := ka j k2
a j := a j / j, j
for k = j + 1, . . . , n 1
j,k := aHj ak
end
for k = j + 1, . . . , n 1
ak := ak j,k a j
end
end
ak := ak j, j a j
end
end
(c) MGS algorithm that normalizes the jth col- (d) Slight modification of the algorithm in (c) that
umn to have unit length to compute q j (overwrit- computes j,k in a separate loop.
ing a j with the result) and then subtracts the component in the direction of q j off the rest of the
columns (a j+1 , . . . , an1 ).
for j = 0, . . . , n 1
j, j := ka j k2
a j := a j / j, j
for j = 0, . . . , n 1
j, j := ka j k2
aj := a j / j, j
j, j+1 j,n1 := aHj a j+1 aHj an1
a j+1 an1 :=
a j+1 j, j+1 a j an1 j,n1 a j
end
j, j+1 j,n1 := aHj a j+1 an1
a j+1 an1 := a j+1 an1
a j j, j+1 j,n1
end
a
a j+1 an1
j j, j+1 j,n1 .
66
RT L RT R
R
0
RBR
where AL and QL has 0 columns and RT L is 0 0
while n(AL ) 6= n(A) do
Repartition
AL AR A0 a1 A2 ,
RT L RT R
T
0
11 r12
0
RBR
0
0 R22
67
for j = 0, . . . , n 1
j, j := ka j k2
a j := a j / j, j
(11 := ka
1 k2 )
(a1 := a1 /11 )
T
r12
z
}|
{
:=
j, j+1 j,n1
aHj
a j+1 an1
|{z} |
{z
}
aH
A2
1
11 := ka1 k2
a1 := a1 /11
T := aH A
r12
1 2
z
T
A2 := A2 a1 r12
Continue with
AL AR A0 a1 A2 ,
RT L RT R
T
0 11 r12
0
RBR
0
0
R22
a j+1
A2
A2
}|
z
}|
{
{
:=
an1
a j+1 an1
aj
j, j+1 j,n1
|{z} |
{z
}
T
a1
r12
end
endwhile
Figure 4.7: Modified Gram-Schmidt algorithm for computing the QR factorization of a matrix A.
off the vector y that already contains
H
y = y (qH
0 y)q0 (qi1 y)qi1 ,
leaving us with
H
H
y = y (qH
0 y)q0 (qi1 y)qi1 (qi y)qi .
= qH
i y.
68
H
i := qH
i y = qi y,
4.3
In theory, all Gram-Schmidt algorithms discussed in the previous sections are equivalent: they compute
the exact same QR factorizations. In practice, in the presense of round-off error, MGS is more accurate
than CGS. We will (hopefully) get into detail about this later, but for now we will illustrate it with a classic
example.
When storing real (or complex for that matter) valued numbers in a computer, a limited accuracy can
be maintained, leading to round-off error when a number is stored and/or when computation with numbers
are performed. The machine epsilon or unit roundoff error is defined as the largest positive number mach
such that the stored value of 1 + mach is rounded to 1. Now, let us consider a computer where the only
error that is ever incurred is when 1 + mach is computed and rounded to 1. Let = mach and consider
the matrix
1 1 1
0 0
A=
=
(4.2)
a0 a1 a2
0 0
0 0
In Figure 4.8 (left) we execute the CGS algorithm. It yields the approximate matrix
2 2
2
2
Q
2
0
0
2
2
0
0
2
If we now ask the question Are the columns of Q orthonormal? we can check this by computing QH Q,
which should equal I, the identity. But
2 2
2
2
QH Q =
2
0
0
2
2
0
0
2
2 2
2
2
2
0
0
2
2
0
0
2
2
2
1
+
mach
2
2
1
= 22
1
2
2
1
2
1
2
69
First iteration
First iteration
0,0 = ka0 k2 =
0,0 = ka0 k2 =
p
1 + 2 = 1 + mach
which is rounded to 1.
q0 = a0 /0,0 = /1 =
which is rounded to 1.
q0 = a0 /0,0 = /1 =
Second iteration
Second iteration
0,1 = qH
0 a1
p
p
1 + 2 = 1 + mach
0,1 = qH
0 a1 = 1
=1
a1 = a1 0,1 q0 =
2
1,1 = ka
1 k2 = 2 = 2
0
0
q1 = a
/( 2) = 22
1 /1,1 =
2
0
0
= a1 0,1 q0 =
1,1 = ka1 k2 = 22 = 2
0
0
q1 = a
/( 2) = 22
1 /1,1 =
0
0
a
1
Third iteration
Third iteration
0,2 = qH
0 a2
0,2 = qH
0 a2 = 1
=1
= a2 0,2 q0 =
H
1,2 = q1 a2 = ( 2/2)
/2
a2 = a2 1,2 q1 =
/2
2,2 = ka
(6/4)2 = ( 6/2)
2 k2 =
0
0
6
6
6
q2 = a
/
=
/(
)
=
2,2
2
6
2
6
2
2 6
6
a
2
1,2 = qH
1 a2 = 0
2 = a2 0,2 q0 1,2 q1 =
0
2
2,2 = ka
2 k2 = 2 = 2
0
0
2
q2 = a
/( 2) =
2 /2,2 =
0
0
Figure 4.8: Execution of the CGS algorith (left) and MGS algorithm (right) on the example in Eqn. (4.2).
70
Similarly, in Figure 4.8 (right) we execute the MGS algorithm. It yields the approximate matrix
1
0
0
6
2
6
2
Q
.
6
2
0
6
2 6
0
0
6
If we now ask the question Are the columns of Q orthonormal? we can check if QH Q = I. The answer:
H
1
0
0
1
0
0
6
2
1
+
mach
2
6
2 6 2 6
H
2
6
2
6
2
,
=
Q Q=
1
0
2
6
6
2
2
0
6 0
6
2
2
0
1
6
2 6
2 6
0
0
0
0
6
6
which shows that for this example MGS yields better orthogonality than does CGS. What is going on?
The answer lies with how a
2 is computed in the last step of each of the algorithms.
In the CGS algorithm, we find that
H
H
a
2 := a2 (q0 a2 )q0 (q1 a2 )q1 .
a2 is small, it has a relatively large error in the direction of q1 . This error is amplified when q2 is
computed by normalizing a
2.
In the MGS algorithm, we find that
H
a
2 := a2 (q0 a2 )q0
after which
H
H
H
H
a
2 := a2 q1 a2 q1 = [a2 (q0 a2 )q0 ] (q1 [a2 (q0 a2 )q0 ])q1 .
This time, if a2 qH
1 a2 q1 has an error in the direction of q1 , this error is subtracted out when
H
Obviously, we have argued via an example that MGS is more accurage than CGS. A more thorough
analysis is needed to explain why this is generally so. This is beyond the scope of this note.
4.4
Cost
Let us examine the cost of computing the QR factorization of an m n matrix A. We will count multiplies
and an adds as each as one floating point operation.
We start by reviewing the cost, in floating point operations (flops), of various vector-vector and matrixvector operations:
4.4. Cost
71
Name
Operation
:= xH y
2n
Axpy
y := x + y
2n
Scal
x := x
Nrm2
:= ka1 k2
2n
2mn
y := AH x + y 2mn
A := yxH + A
2mn
Now, consider the algorithms in Figure 4.5. Notice that the columns of A are of size m. During the kth
iteration (0 k < n), A0 has k columns and A2 has n k 1 columns.
4.4.1
Cost of CGS
Operation
r01 := AH
0 a1
2mk
a1 := a1 A0 r01 2mk
11 := ka1 k2
2m
a1 := a1 /11
n1
k=0 [2mk + 2mk + 2m + m]
= n1
k=0 [3m + 4mk]
= 3mn + 4m n1
k=0 k
2
3mn + 4m n2
2
(n1
k=0 k = n(n 1)/2 n /2
= 3mn + 2mn2
2mn2
4.4.2
72
Cost of MGS
Operation
11 := ka1 k2
2m
a1 := a1 /11
T := aH A
r12
1 2
2m(n k 1)
T 2m(n k 1)
A2 := A2 a1 r12
3mn + 4m 2
(Change of variable: i = n k 1)
2
(n1
i=0 i = n(n 1)/2 n /2
= 3mn + 2mn2
2mn2
Chapter
73
74
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
5.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
75
75
75
76
78
5.3.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
80
5.1. Motivation
75
y = y
y = y
for i = 0, . . . , k 1
for i = 0, . . . , k 1
i := qH
i y
i := qH
i y
y := y i qi
y := y i qi
endfor
endfor
Figure 5.1: Two different ways of computing y = (I QQH )y, the component of y orthogonal to C (Q),
where Q has k orthonormal columns.
5.1
Motivation
In the course so far, we have frequently used the FLAME Notation to express linear algebra algorithms.
In this note we show how to translate such algorithms into code, using various QR factorization algorithms
as examples.
5.2
Install FLAME@lab
The API we will use we refer to as the FLAME@lab API, which is an API that targets the M-script
language used by Matlab and Octave (an Open Source Matlab implementation). This API is very intuitive,
and hence we will spend (almost) no time explaining it.
Download all files from http://www.cs.utexas.edu/users/flame/Notes/FLAMEatlab/ and place
them in the same directory as you will the remaining files that you will create as part of the exercises in
this document. (Unless you know how to set up paths in Matlab/Octave, in which case you can put it
whereever you please, and set the path.)
5.3
Let us start by considering the various Gram-Schmidt based QR factorization algorithms from Notes on
Gram-Schmidt QR Factorization, typeset using the FLAME Notation in Figure 5.2.
5.3.1
We wish to typeset the code so that it closely resembles the algorithms in Figure 5.2. The FLAME
notation itself uses white space to better convey the algorithms. We want to do the same for the codes
that implement the algorithms. However, typesetting that code is somewhat bothersome because of the
careful spacing that is required. For this reason, we created a webpage that creates a code skeleton.. We
call this page the Spark page:
http://www.cs.utexas.edu/users/flame/Spark/.
76
RT L RT R
Partition A AL AR , R
0
RBR
whereAL has 0 columns and RT L is 0 0
while n(AL ) 6= n(A) do
Repartition
AL
AR
A0 a1 A2
RT L
RT R
RBR
R00
r01 R02
11
T
r12
R22
MGS
MGS (alternative)
r01 := AH
0 a1
a1 := a1 A0 r01
11 := ka1 k2
11 := ka1 k2
11 := ka1 k2
a1 := a1 /11
q1 := a1 /11
a1 := a1 /11
T := aH A
r12
1 2
T
A2 := A2 a1 r12
Continue with
AL
AR
A0 a1 A2
RT L
RT R
RBR
R
r
R02
00 01
0 11 rT
12
0
0 R22
endwhile
Figure 5.2: Left: Classical Gram-Schmidt algorithm. Middle: Modified Gram-Schmidt algorithm. Right:
Modified Gram-Schmidt algorithm where every time a new column of Q, q1 is computed the component
of all future columns in the direction of this new vector are subtracted out.
When you open the link, you will get a page that looks something like the picture in Figure 5.3.
5.3.2
.
77
78
RT L RT R
R
0
RBR
where AL has 0 columns and RT L is 0 0
while n(AL ) 6= n(A) do
Repartition
AL AR A0 a1 A2 ,
RT L RT R
11 r12
0
0
RBR
0
0 R22
r01 := AH
0 a1
a1 := a1 A0 r01
11 := ka1 k2
a1 := a1 /11
Continue with
AL AR A0 a1 A2 ,
RT L RT R
0 11 rT
12
0
RBR
0
0
R22
endwhile
Figure 5.4: Left: Classical Gram-Schmidt algorithm. Right: Generated code-skeleton for CGS.
5.3.3
At this point, one should copy the code skeleton into ones favorite text editor. (We highly recommend
emacs for the serious programmer.) Once this is done, there are two things left to do:
Fix the code skeleton: The Spark webpage guesses the code skeleton. One detail that it sometimes
gets wrong is the stopping criteria. In this case, the algorithm should stay in the loop as long
as n(AL ) 6= n(A) (the width of AL is not yet the width of A). In our example, the Spark webpage
guessed that the column size of matrix A is to be used for the stopping criteria:
while ( size( AL, 2 ) < size( A, 2 ) )
which happens to be correct. (When you implement the Householder QR factorization, you may not
be so lucky...)
The update statements: The Spark webpage cant guess what the actual updates to the various parts
of matrices A and R should be. It fills in
79
Figure 5.5: The Spark webpage filled out for CGS Variant 1.
%
%
%
update line 1
:
update line n
%
%
%
r01 = A0 * a1;
a1 = a1 - A0 * r01;
rho11 = norm( a1 );
a1 = a1 / rho11;
(Notice: if one forgets the ;, when executed the results of the assignment will be printed by
Matlab/Octave.)
At this point, one saves the resulting code in the file CGS unb var1.m. The .m ending is important
since the name of the file is used to find the routine when using Matlab/Octave.
5.3.4
Testing
To now test the routine, one starts octave and, for example, executes the commands
80
RT L RT R
R
0
RBR
where AL has 0 columns and RT L is 0 0
while n(AL ) 6= n(A) do
Repartition
AL AR A0 a1 A2 ,
RT L RT R
11 r12
0
0
RBR
0
0 R22
r01 := AH
0 a1
a1 := a1 A0 r01
11 := ka1 k2
a1 := a1 /11
Continue with
AL AR A0 a1 A2 ,
RT L RT R
0 11 rT
12
0
RBR
0
0
R22
endwhile
Figure 5.6: Left: Classical Gram-Schmidt algorithm. Right: Generated code-skeleton for CGS.
>
>
>
>
A
R
[
A
= rand( 5, 4 )
= zeros( 4, 4 )
Q, R ] = CGS_unb_var1( A, R )
- Q * triu( R )
5.4
81
The Householder QR factorization algorithm and algorithm to form Q from Notes on Householder
QR Factorization.
The routine for computing a Householder transformation (similar to Figure 5.1) can be found at
http://www.cs.utexas.edu/users/flame/Notes/FLAMEatlab/Housev.m
That routine implements the algorithm on the left in Figure 5.1). Try and see what happens if you
replace it with the algorithm to its right.
Note: For the Householder QR factorization and form Q algorithm how to start the algorithm when the
matrix is not square is a bit tricky. Thus, you may assume that the matrix is square.
82
Chapter
83
84
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
85
85
87
87
88
89
6.4. Forming Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
6.5. Applying QH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
99
99
6.1. Motivation
6.1
85
Motivation
A fundamental problem to avoid in numerical codes is the situation where one starts with large values and
one ends up with small values with large relative errors in them. This is known as catastrophic cancellation.
The Gram-Schmidt algorithms can inherently fall victim to this: column a j is successively reduced in
length as components in the directions of {q0 , . . . , q j1 } are subtracted, leaving a small vector if a j was
almost in the span of the first j columns of A. Application of a unitary transformation to a matrix or vector
inherently preserves length. Thus, it would be beneficial if the QR factorization can be implementated
as the successive application of unitary transformations. The Householder QR factorization accomplishes
this.
The first fundamental insight is that the product of unitary matrices is itself unitary. If, given A Cmn
(with m n), one could find a sequence of unitary matrices, {H0 , . . . , Hn1 }, such that
R
Hn1 H0 A = ,
0
where R Cmn is upper triangular, then
Hn1 H0 A = H0 Hn1 = Q =
| {z }
0
0
Q
QL QR
R
0
= QL R,
where QL equals the first n columns of A. Then A = QL R is the QR factorization of A. The second
fundamental insight will be that the desired unitary transformations {H0 , . . . , Hn1 } can be computed and
applied cheaply.
6.2
6.2.1
86
x
x=z+u xu
H
u xu
v=xy
u xu
u
H
(I 2 u u )x
u
u
(I 2 u u )x
Figure 6.1: Left: Illustration that shows how, given vectors x and unit length vector u, the subspace
orthogonal to u becomes a mirror for reflecting x represented by the transformation (I 2uuH ). Right:
Illustration that shows how to compute u given vectors x and y with kxk2 = kyk2 .
This can be interpreted as follows: The space perpendicular to u acts as a mirror: any vector in that space
(along the mirror) is not reflected, while any other vector has the component that is orthogonal to the space
(the component outside, orthogonal to, the mirror) reversed in direction, as illustrated in Figure 6.1. Notice
that a reflection preserves the length of the vector.
Homework 6.2 Show that if H is a reflector, then
HH = I (reflecting the reflection of a vector results in the original vector),
H = H H , and
H H H = I (a reflection is a unitary matrix and thus preserves the norm).
* SEE ANSWER
Next, let us ask the question of how to reflect a given x Cn into another vector y Cn with kxk2 =
kyk2 . In other words, how do we compute vector u so that (I 2uuH )x = y. From our discussion above,
we need to find a vector u that is perpendicular to the space with respect to which we will reflect. From
Figure 6.1(right) we notice that the vector from y to x, v = x y, is perpendicular to the desired space.
Thus, u must equal a unit vector in the direction v: u = v/kvk2 .
Remark 6.3 In subsequent discussion we will prefer to give Householder transformations as I uuH /,
where = uH u/2 so that u needs no longer be a unit vector, just a direction. The reason for this will
become obvious later.
In the next subsection, we will need to find a Householder transformation H that maps a vector x to a
multiple of the first unit basis vector (e0 ).
Let us first discuss how to find H in the case where x Rn . We seek v so that (I vT2 v vvT )x = kxk2 e0 .
Since the resulting vector that we want is y = kxk2 e0 , we must choose v = x y = x kxk2 e0 .
87
Homework 6.4 Show that if x Rn , v = x kxk2 e0 , and = vT v/2 then (I 1 vvT )x = kxk2 e0 .
* SEE ANSWER
In practice, we choose v = x + sign(1 )kxk2 e0 where 1 denotes the first element of x. The reason is as
follows: the first element of v, 1 , will be 1 = 1 kxk2 . If 1 is positive and kxk2 is almost equal to 1 ,
then 1 kxk2 is a small number and if there is error in 1 and/or kxk2 , this error becomes large relative
to the result 1 kxk2 , due to catastrophic cancellation. Regardless of whether 1 is positive or negative,
we can avoid this by choosing x = 1 + sign(1 )kxk2 e0 .
6.2.2
Next, we discuss a slight variant on the above discussion that is used in practice. To do so, we view x as a
vector that consists of its first element, 1 , and the rest of the vector, x2 : More precisely, partition
1
,
x=
x2
where 1 equals
the
first element of x and x2 is the rest of x. Then we will wish to find a Householder
1
so that
vector u =
u2
1
1
kxk2
1 =
.
I 1
u2
u2
x2
0
kxk2
, so the direction of u is given by
Notice that y in the previous discussion equals the vector
0
1 kxk2
.
v=
x2
We now wish to normalize this vector so its first entry equals 1:
kxk2
1
v
1
1
=
.
u=
=
1 1 kxk2
x2
x2 /1
where 1 = 1 kxk2 equals the first element of v. (Note that if 1 = 0 then u2 can be set to 0.)
6.2.3
Next, let us work out the complex case, dealing explicitly with x as a vector that consists of its first element,
1 , and the rest of the vector, x2 : More precisely, partition
1
,
x=
x2
88
where 1 equals
the
first element of x and x2 is the rest of x. Then we will wish to find a Householder
1
so that
vector u =
u2
kxk2
1
1
1 =
.
I 1
u2
u2
x2
0
denotes a complex scalar on the complex unit circle. By the same argument as before
Here
kxk2
1
.
v=
x2
kxk2
1
v
1
1
=
.
u=
=
kxk2
kvk2 1
x2
x2 /1
kxk2 . (If 1 = 0 then we set u2 to 0.)
where 1 = 1
I 1
u2
u2
x2
0
1
kxk2 .
where = uH u/2 = (1 + uH
2 u2 )/2 and =
Hint: = ||2 = kxk22 since H preserves the norm. Also, kxk22 = |1 |2 + kx2 k22 and
z
z
z
= |z|
.
* SEE ANSWER
6.2.4
1
|1 |
1
u2
, := Housev 1
u2
x2
as the computation of the above mentioned vector u2 , and scalars and , from vector x. We will use the
notation H(x) for the transformation I 1 uuH where u and are computed by Housev(x).
Algorithm:
89
u2
, = Housev
1
x2
2 := kx2 k2
1
(= kxk2 )
:=
2
2
= sign(1 )kxk2
:= sign(1 )
1 = 1 + sign(1 )kxk2
1 := 1
u2 = x2 /1
u2 := x2 /1
2 = 2 /|1 |(= ku2 k2 )
= (1 + 22 )/2
= (1 + uH
2 u2 )/2
Figure 6.2: Computing the Householder transformation. Left: simple formulation. Right: efficient computation. Note: I have not completely double-checked these formulas for the complex case. They
work for the real case.
6.3
Householder QR Factorization
Let A be an m n with m n. We will now show how to compute A QR, the QR factorization, as a
sequence of Householder transformations applied to A, which eventually zeroes out all elements of that
matrix below the diagonal. The process is illustrated in Figure 6.3.
In the first iteration, we partition
A
Let
11
u21
11 aT12
a21 A22
, 1 = Housev
11
a21
be the Householder transform computed from the first column of A. Then applying this Householder
transform to A yields
T
T
a
a
1
1
11 12 := I 1
11 12
1
a21 A22
u2
u2
a21 A22
11
aT12 wT12
,
=
T
0 A22 u21 w12
where wT12 = (aT12 + uH
21 A22 )/1 . Computation of a full QR factorization of A will now proceed with the
updated matrix A22 .
11
Original matrix
u21
, 1 =
Housev
11
a21
90
11
aT12
a21
A22
:=
aT12 wT12
11
0
Move forward
91
Now let us assume that after k iterations of the algorithm matrix A contains
R
r01 R02
00
RT L RT R
A
= 0 11 a12 ,
0
ABR
0 a21 A22
where RT L and R00 are k k upper triangular matrices. Let
11
11
, 1 = Housev
.
u21
a21
and update
A :=
0 I 1 1 1
1
u2
u2
H
0
r01
0
R
00
1
= I 1 1 0 11
1
u2
u2
0 a21
r
R02
R
00 01
T
T
= 0 11
a12 w12 ,
0
0 A22 u21 wT12
where again wT12 = (aT12 + uH
21 A22 )/1 .
Let
0k
0k
R00
r01
R02
11
aT12
a21 A22
R02
aT12
A22
Hk = I 1 1
1
u21
u21
be the Householder transform so computed during the (k + 1)st iteration. Then upon completion matrix A
contains
RT L
= Hn1 H1 H0 A
R=
0
where A denotes the original contents of A and RT L is an upper triangular matrix. Rearranging this we find
that
A = H0 H1 Hn1 R
which shows that if Q = H0 H1 Hn1 then A = QR.
0 I
92
0
0
H
1
=
I
1
1
1
1
1
1
1
u2
u2
u2
u2
* SEE ANSWER
Typically, the algorithm overwrites the original matrix A with the upper triangular matrix, and at each
step u21 is stored over the elements that become zero, thus overwriting a21 . (It is for this reason that the
first element of u was normalized to equal 1.) In this case Q is usually not explicitly formed as it can be
stored as the separate Householder vectors below the diagonal of the overwritten matrix. The algorithm
that overwrites A in this manner is given in Fig. 6.4.
We will let
[{U\R},t] = HouseholderQR(A)
denote the operation that computes the QR factorization of m n matrix A, with m n, via Householder
transformations. It returns the Householder vectors and matrix R in the first argument and the vector of
scalars i that are computed as part of the Householder transformations in t.
Theorem 6.7 Given A Cmn the cost of the algorithm in Figure 6.4 is given by
2
CHQR (m, n) 2mn2 n3 flops.
3
T
Proof: The bulk of the computation is in wT12 = (aT12 + uH
21 A22 )/1 and A22 u21 w12 . During the kth
iteration (when RT L is k k), this means a matrix-vector multiplication (uH
21 A22 ) and rank-1 update with
matrix A22 which is of size approximately (m k) (n k) for a cost of 4(m k)(n k) flops. Thus the
total cost is approximately
n1
n1
n1
n1
j=0
j=0
j=0
n1
= 2(m n)n(n 1) + 4 j2
j=0
2(m n)n2 + 4
Z n
0
6.4
4
2
x2 dx = 2mn2 2n3 + n3 = 2mn2 n3 .
3
3
Forming Q
Given A Cmn , let [A,t] = HouseholderQR(A) yield the matrix A with the Householder vectors stored
below the diagonal, R stored on and above the diagonal, and the i stored in vector t. We now discuss how
to form the first n columns of Q = H0 H1 Hn1 . The computation is illustrated in Figure 6.5.
6.4. Forming Q
93
t
AT L AT R
and t T
Partition A
ABL ABR
tB
whereAT L is 0 0 and tT has 0 elements
while n(ABR ) 6= 0 do
Repartition
a01 A02
A
00
aT 11 aT
12
10
ABL ABR
A20 a21 A22
where11 and 1 are scalars
AT L
11
AT R
, 1 :=
a21
Update
aT12
11
u21
A22
via the steps
1
1
1
u21
1 uH
21
, 1 = Housev
:= I
0
tT
1
and
tB
t2
11
a21
T
a
12
A22
wT12 := (aT12 + aH
21 A22 )/1
T
T
T
a
a12 w12
12 :=
A22
A22 a21 wT12
Continue with
AT L
AT R
ABL
ABR
A
a
00 01
aT 11
10
A20 a21
A02
aT12
A22
0
tT
1
and
tB
t2
endwhile
Figure 6.4: Unblocked Householder transformation based QR factorization.
aT12
11
Original matrix
a21 A22
1 1/1
u21 /1
94
:=
(uH
21 A22 )/1
A22 + u21 aT12
Move forward
1 0 0 0
1 0 0
1 0 0
0 1 0 0
0 1 0
0 1 0
0 0 1 0
0 0 1
0 0 1
0 0 0 1
0 0 0
0 0 0
0 0 0 0
0 0 0
0 0 0
1 0
1 0
0 1
0 1
0 0
0 0
0 0
0 0
0 0
0 0
I
Inn
I
= H0 Hn1 nn = H0 Hk1 Hk Hn1 nn .
Q
0
0
0
|
{z
}
Bk
6.4. Forming Q
95
Bk = Hk Hn1
Inn
0
Ikk
B k
Inn
, which has the desired form.
Base case: k = n. Then Bn =
0
Inductive step: Assume the result is true for Bk . We show it is true for Bk1 :
Inn
I
0
= Hk1 Bk = Hk1 kk
.
Bk1 = Hk1 Hk Hn1
0
0
Bk
0
I(k1)(k1)
I(k1)(k1) 0 0
=
1
0
1 0
1
H
0
I k
1 uk
uk
0
0 B k
I(k1)(k1)
0
=
1
1 0
1
H
I
1 u
0
k
k
uk
0 B k
I
0
(k1)(k1)
=
1 0
1
where yTk = uH
k Bk /k
1/k yT
0
k
0 B k
uk
0
I
(k1)(k1)
T
=
1 1/k
yk
0
uk /k Bk uk yTk
I
0
0
(k1)(k1)
I(k1)(k1)
.
=
0
1 1/k
yTk
=
0
Bk1
0
uk /k Bk uk yTk
By the Principle of Mathematical Induction the result holds for B0 , . . . , Bn .
Theorem 6.9 Given [A,t] = HouseholderQR(A) from Figure 6.4, the algorithm in Figure 6.6 overwrites
A with the first n = n(A) columns of Q as defined by the Householder transformations stored below the
diagonal of A and in the vector t.
96
t
AT L AT R
and t T
Partition A
ABL ABR
tB
whereAT L is n(A) n(A) and tT has n(A) elements
while n(AT R ) 6= 0 do
Repartition
AT L
AT R
Update
11
a21 A22
via the steps
:= I
1
1
1
u21
t
0
tT
1
and
tB
t2
aT12
A
a
A02
00 01
aT 11 aT
12
10
ABL ABR
A20 a21 A22
where11 and 1 are scalars
1 uH
21
1 0
0 A22
11 := 1 1/1
aT12 := (aH
21 A22 )/1
A22 := A22 + a21 aT12
a21 := a21 /1
Continue with
AT L
AT R
ABL
ABR
A
00
aT
10
A20
a01 A02
11 aT12
a21 A22
0
tT
1
and
tB
t2
endwhile
Figure 6.6: Algorithm for overwriting A with Q from the Householder transformations stored as Householder vectors below the diagonal of A (as produced by [A,t] = HouseholderQR(A) ).
6.5. Applying QH
97
6.5
Applying QH
In a future Note, we will see that the QR factorization is used to solve the linear least-squares problem. To
do so, we need to be able to compute y = QH y where QH = Hn1 H0 .
Let us start by computing H0 y:
1
1
1
I 1
1 = 1
1 /1
1
u2
u2
y2
y2
u2
u2
y2
{z
}
|
1
1
1
1
1
= 1
.
=
y2
u2
y2 1 u2
More generally, let us compute Hk y:
I 1
1
u2
u2
y0
y0
1 = 1 1
y2
y2 1 u2
where 1 = (1 + uH
2 y2 )/1 . This motivates the algorithm in Figure 6.7 for computing y := Hn1 H0 y
given the output matrix A and vector t from routine HouseholderQR.
The cost of this algorithm can be analyzed as follows: When yT is of length k, the bulk of the computation is in an inner product with vectors of length m k (to compute 1 ) and an axpy operation with
vectors of length m k to subsequently update 1 and y2 . Thus, the cost is approximately given by
n1
Notice that this is much cheaper than forming Q and then multiplying.
98
AT L AT R
tT
yT
, t
, and y
Partition A
ABL ABR
tB
yB
whereAT L is 0 0 and tT , yT has 0 elements
while n(ABR ) 6= 0 do
Repartition
AT L
AT R
A00
a01 A02
aT10 11 aT12
ABL ABR
A20 a21 A22
where11 , 1 , and 1 are scalars
Update
:= I
y2
via the steps
1
1
1
u21
t0
y0
yT
tT
1 , and
,
1
tB
yB
t2
y2
1 uH
21
1
y2
1 := (1 + aH
21 y2 )/1
1
1
:= 1
y2
y2 1 u2
Continue with
AT L
AT R
ABL
ABR
A
a
00 01
aT 11
10
A20 a21
A02
aT12
A22
t
y
0
0
yT
tT
1
1 , and
,
tB
yB
t2
y2
endwhile
Figure 6.7: Algorithm for computing y := Hn1 H0 y given the output from routine HouseholderQR.
6.6
6.6.1
99
Through a series of exercises, we will show how the application of a sequence of k Householder transformations can be accumulated. What we exactly mean that this will become clear.
Homework 6.12 Assuming all inverses exist, show that
1
1
1
T
t
T00 t01 /1
T
.
00 01 = 00
0 1
0
1/1
* SEE ANSWER
An algorithm that computes the inverse of an upper triangular matrix T based on the above exercise is
given in Figure 6.8.
Homework 6.13 Homework 6.14 Consider u1 Cm with u1 6= 0 (the zero vector), U0 Cmk , and nonsingular T00 Ckk . Define 1 = (uH
1 u1 )/2, so that
H1 = I
1
u1 uH
1
1
Show that
1 H
Q0 H1 = (I U0 T00
U0 )(I
1
u1 uH
1 )=I
1
U0 u1
T00 t01
0
1
U0 u1
H
where t01 = QH
0 u1 .
* SEE ANSWER
Homework 6.15 Homework 6.16 Consider ui Cm with ui 6= 0 (the zero vector). Define i = (uH
i ui )/2,
so that
1
Hi = I ui uH
i i
equals a Householder transformation, and let
U = u0 u1 uk1 .
Show that
H0 H1 Hk1 = I UT 1U H ,
where T is an upper triangular matrix.
* SEE ANSWER
TT L TT R
Partition T
TBL TBR
whereTT L is 0 0
100
VAR 1(T )
t
T
T
00 01 02
t T 11 t T
12
10
TBL TBR
T20 t21 T22
where11 is 1 1
TT L
TT R
Continue with
TT L
TT R
TBL
TBR
T00 t01
11
t10
T20 t21
T02
T
t12
T22
endwhile
Figure 6.8: Unblocked algorithm for inverting an upper triangular matrix. The algorithm assumes that T00
has already been inverted, and computes the next column of T in the current iteration.
101
t
T
UT L UT R
, t T , T TL
Partition U
UBL UBR
tB
TBL
whereUT L is 0 0, tT has 0 rows, TT L is 0 0
TT R
TBR
U
u01 U02
t
00
0
tT
uT 11 uT ,
1
10
12
UBL UBR
tB
U20 u21 U22
t2
t
T
T
00 01 02
TT L TT R
t T 11 t T
12
10
TBL TBR
T20 t21 T22
where11 is 1 1, 1 has 1 row, 11 is 1 1
UT L UT R
t01 := U0H u1
11 := 1
Continue with
U
00
uT10
UBR
U
20
T
00
TT R
T
t10
TBR
T20
UT L UT R
UBL
TT L
TBL
u01 U02
11
uT12
u21 U22
t01
11
t21
T02
0
t
T
,
1 ,
tB
t2
T
t12
T22
endwhile
Figure 6.9: Algorithm that computes T from U and the vector t so that I UTU H equals the UT transform.
Here U is assumed to the output of, for example, an unblocked Householder based QR algorithm. This
means that it is a lower trapezoidal matrix, with ones on the diagonal.
102
The above exercises can be summarized in the algorithm for computing T from U in Figure 6.9.
In [27] we call the transformation I UT 1U H that equals the accumulated Householder transformations the UT transform and prove that T can instead by computed as
T = triu UH U
(the upper triangular part of U H U) followed by either dividing the diagonal elements by two or setting
them to 0 , . . . , k1 (in order). In that paper, we point out similar published results [9, 33, 44, 31].
6.6.2
A blocked algorithm
A QR factorization that exploits the insights that resulted in the UT transform can now be described:
Partition
A11 A12
A21 A22
where A11 is b b.
We can use the unblocked algorithm in Figure 6.4 to factor the panel
A11
A21
,t1 ] := H OUSE QR
UNB VAR 1(
A11
A21
A11
A21
),
overwriting the entries below the diagonal with the Householder vectors
U11
U21
on the diagonal implicitly stored) and the upper triangular part with R11 .
For T11 from the Householder vectors using the procedure descrived in Section 6.6.1:
A11
,t1 )
T11 := F ORM T(
A21
Now we need to also apply the Householder transformations to the rest of the columns:
A12
A22
:= I
A12
A22
U11
T 1
U21
11
U11
U21
H
A12 U11W21
H
A22 U21W21
U11
U21
H
W21
H H
A12
A22
103
tT
AT L AT R
,t
Partition A
ABL ABR
tB
whereAT L is 0 0, tT has 0 rows
while m(AT L ) < m(A) do
Determine block size b
Repartition
A
A01 A02
00
AT L AT R
ABL ABR
A20 A21 A22
whereA11 is b b, t1 has b rows
A11
A21
,t1 ] := H OUSE QR
T11 := F ORM T(
A11
A21
0
tT
t1
,
tB
t2
UNB VAR 1(
A11
A21
,t1 )
H := T H (U H A +U H A )
W21
21 22
11 11 12
H
A U11W21
A
12 := 12
H
A22
A22 U21W21
Continue with
AT L
AT R
ABL
ABR
A00 A01
A10 A11
A20 A21
A02
A12
A22
t0
tT
t1
tB
t2
endwhile
Figure 6.10: Blocked Householder transformation based QR factorization.
104
where
H
H
H
H
W21
= T11
(U11
A12 +U21
A22 ).
6.6.3
Variations on a theme
There are many possible algorithms for computing the QR factorization. For example, the unblocked
algorithm from Figure 6.4 can be merged with the unblocked algorithm for forming T in Figure 6.9 to
yield the algorithm in Figure 6.11.
An alternative unblocked merged algorithm
Let us now again compute the QR factorization of A simultaneous with the forming of T , but now taking
advantage of the fact that T is partically computed to change the algorithm into what some would consider
a left-looking algorithm.
Partition
a01 A02
T
t
T
A
00 01 02
00
T
T
A a10 11 aT12 and T 0 11 t12
.
A
U
00
00
T
T
Assume that a10 has been factored and overwritten with u10 and R00 while also computing
A20
U20
T00 . In the next step, we need to apply previous Householder transformations to the next column of A and
then update that column with the next column of R and U. In addition, the next column of T must be
computing. This means:
Update
a
01
11
a21
U
00
T
:= I u10
U20
00
1 T
T00 u10
U20
H H
11
11
, 11 ] := Housev(
)
[
a21
a21
Compute the rest of the next column of T
t01 := AH
20 a21
This yields the algorithm in Figure 6.12.
a
01
11
a21
105
T
AT L AT R
T
, T TL TR
Partition A
ABL ABR
TBL TBR
whereAT L is 0 0, TT L is 0 0
while m(AT L ) < m(A) do
Repartition
a01 A02
A
00
aT 11 aT
12
10
ABL ABR
A20 a21 A22
where11 is 1 1, 11 is 1 1
11
AT R
AT L
, 11 :=
a21
Update
aT12
11
u21
:= I
A22
via the steps
1
11
TT L
,
TBL
1
u21
TT R
TBR
, 11 = Housev
1 uH
21
T
00
tT
10
T20
t01
T02
11
T
t12
t21
T22
11
a
21
aT12
A22
wT12 := (aT12 + aH
21 A22 )/11
T
T
T
a12 w12
a
12 :=
A22
A22 a21 wT12
t01 := AH
20 a21
Continue with
AT L
AT R
ABL
ABR
A
a
A02
00 01
aT 11 aT
12
10
A20 a21 A22
TT L
,
TBL
TT R
TBR
T
t
00 01
t T 11
10
T20 t21
T02
T
t12
T22
endwhile
Figure 6.11: Unblocked Householder transformation based QR factorization merged with the computation
of T for the UT transform.
106
AT L AT R
TT L TT R
,T
Partition A
ABL ABR
TBL TBR
whereAT L is 0 0, TT L is 0 0
while m(AT L ) < m(A) do
Repartition
A
a01 A02
00
aT 11 aT
12
10
ABL ABR
A20 a21 A22
where11 is 1 1, 11 is 1 1
AT L
AT R
a
U
01 00
T
11 := I u10
a02
U20
via the steps
U
00
1 T
T00 u10
U20
TT L
,
TBL
TT R
TBR
H H
a
01
11
a21
T
00
tT
10
T20
t01
T02
11
T
t12
t21
T22
H
H a + (uT )H +U H a )
w01 := T00
(U00
01
11 10
20 21
a
a
U
01 01 00
T
11 := 11 u10 w01
a02
a02
U20
11 , 11 := 11 , 11 = Housev 11
a21
u21
a21
t01 := AH
20 a21
Continue with
AT L
AT R
ABL
ABR
A
a
A02
00 01
aT 11 aT
12
10
A20 a21 A22
TT L
,
TBL
TT R
TBR
T
t
00 01
t T 11
10
T20 t21
T02
T
t12
T22
endwhile
Figure 6.12: Alternative unblocked Householder transformation based QR factorization merged with the
computation of T for the UT transform.
107
An alternative blocked variant that uses either of the unblocked factorization routines that merges the
formation of T is given in Figure 6.13.
108
AT L AT R
t
,t T
Partition A
ABL ABR
tB
whereAT L is 0 0, tT has 0 rows
while m(AT L ) < m(A) do
Determine block size b
Repartition
A
A01 A02
00
AT L AT R
ABL ABR
A20 A21 A22
whereA11 is b b, t1 has b rows
A11
A21
tT
t1
tB
t2
t0
UNB VAR X(
A11
A21
H := T H (U H A +U H A )
W21
21 22
11 11 12
H
A
A U11W21
12 := 12
H
A22
A22 U21W21
Continue with
AT L
AT R
ABL
ABR
A
A01
00
A10 A11
A20 A21
A02
A12
A22
0
tT
t1
,
tB
t2
endwhile
Figure 6.13: Alternative blocked Householder transformation based QR factorization.
Chapter
Video
Read disclaimer regarding the videos in the preface!
No video... Camera ran out of memory...
109
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1. The Linear Least-Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2. Method of Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3. Solving the LLS Problem Via the QR Factorization . . . . . . . . . . . . . . . . . . 112
7.3.1. Simple derivation of the solution . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3.2. Alternative derivation of the solution . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4. Via Householder QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5. Via the Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.5.1. Simple derivation of the solution . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.5.2. Alternative derivation of the solution . . . . . . . . . . . . . . . . . . . . . . . . 116
7.6. What If A Does Not Have Linearly Independent Columns? . . . . . . . . . . . . . . 116
7.7. Exercise: Using the the LQ factorization to solve underdetermined systems . . . . . 123
110
7.1
111
Let A Cmn and y Cm . Then the linear least-square problem (LLS) is given by
Find x s.t. kAx yk2 = minn kAz yk2 .
zC
In other words, x is the vector that minimizes the expression kAx yk2 . Equivalently, we can solve
Find x s.t. kAx yk22 = minn kAz yk22 .
zC
If x solves the linear least-squares problem, then Ax is the vector in C (A) (the column space of A) closest
to the vector y.
7.2
Let A Rmn have linearly independent columns (which implies m n). Let f : Rn Rm be defined by
f (x) = kAx yk22 = (Ax y)T (Ax y) = xT AT Ax xT AT y yT Ax + yT y
= xT AT Ax 2xT AT y + yT y.
This function is minimized when the gradient is zero, 5 f (x) = 0. Now,
5 f (x) = 2AT Ax 2AT y.
If A has linearly independent columns then AT A is nonsingular. Hence, the x that minimizes kAx yk2
solves AT Ax = AT y. This is known as the method of normal equations. Notice that then
x = (AT A)1 AT y,
| {z }
A
where A is known as the pseudo inverse or Moore-Penrose pseudo inverse.
In practice, one performs the following steps:
Form B = AT A, a symmetric positive-definite (SPD) matrix.
Cost: approximately mn2 floating point operations (flops), if one takes advantage of symmetry.
Compute the Cholesky factor L, a lower triangular matrix, so that B = LLT .
This factorization, discussed in Week 8 (Section 8.4.2) of Linear Algebra: Foundations to Frontiers
- Notes to LAFF With and to be revisited later in this course, exists since B is SPD.
Cost: approximately 31 n3 flops.
Compute y = AT y.
Cost: 2mn flops.
Solve Lz = y and LT x = z.
Cost: n2 flops each.
112
Thus, the total cost of solving the LLS problem via normal equations is approximately mn2 + 13 n3 flops.
Remark 7.1 We will later discuss that if A is not well-conditioned (its columns are nearly linearly dependent), the Method of Normal Equations is numerically unstable because AT A is ill-conditioned.
The above discussion can be generalized to the case where A Cmn . In that case, x must solve AH Ax =
AH y.
A geometric explanation of the method of normal equations (for the case where A is real valued) can
be found in Week 10 (Sections 10.3-10.5) of Linear Algebra: Foundations to Frontiers - Notes to LAFF
With.
7.3
Assume A Cmn has linearly independent columns and let A = QL RT L be its QR factorization. We wish
to compute the solution to the LLS problem: Find x Cn such that
kAx yk22 = minn kAz yk22 .
zC
7.3.1
Notice that we know that, if A has linearly independent columns, the solution is given by x = (AH A)1 AH y
(the solution to the normal equations). Now,
x = [AH A]1 AH y
1
= (QL RT L )H (QL RT L )
(QL RT L )H y
1 H H
H
= RH
RT L QL y
T L QL QL RT L
H
1 H H
= RT L RT L
RT L QL y
H H
H
= R1
T L RT L RT L QL y
(BC)1 = C1 B1
H
= R1
T L QL y
H
RH
T L RT L = I
7.3.2
113
QL
QR
is unitary. Now,
(substitute A = QL RT L )
2
H
QH
Q
L
L
QL RT L z
y
= minzCn
QH
QH
R
R
2
2
H
RT L z
Q y
L
= minzCn
0
QH
Ry
2
2
H
RT L z QL y
= minzCn
H
QR y
2
2
+ kQH yk2
y
= minzCn
RT L z QH
L
R 2
2
2
+ kQH yk2
y
minzCn
RT L z QH
L
R 2
2
2
= kQH
R yk2
(partitioning, distributing)
(QH
R y is independent of z)
(minimized by x that satisfies RT L x = QH
L y)
Thus, the desired x that minimizes the linear least-squares problem solves RT L x = QH
L y. The solution is
unique because RT L is nonsingular (because A has linearly independent columns).
In practice, one performs the following steps:
Compute the QR factorization A = QL RT L .
If Gram-Schmidt or Modified Gram-Schmidt are used, this costs 2mn2 flops.
Form y = QH
L y.
Cost: 2mn flops.
Solve RT L x = y (triangular solve).
Cost: n2 flops.
Thus, the total cost of solving the LLS problem via (Modified) Gram-Schmidt QR factorization is approximately 2mn2 flops.
Notice that the solution computed by the Method of Normal Equations (generalized to the complex
case) is given by
H
1 H
H
(AH A)1 AH y = ((QL RT L )H (QL RT L ))1 (QL RT L )H y = (RH
T L QL QL RT L ) RT L QL y
1 H H
1 H
1
1 H
H
H
= (RH
T L RT L ) RT L QL y = RT L RT L RT L QL y = RT L QL y = RT L y = x
where RT L x = y.
This shows that the two approaches compute the same solution, generalizes the Method of
Normal Equations to complex valued problems, and shows that the Method of Normal Equations computes
the desired result without requiring multivariate calculus.
7.4
114
Given A Cmn with linearly independent columns, the Householder QR factorization yields n Householder transformations, H0 , . . . , Hn1 , so that
RT L
.
Hn1 H0 A =
| {z }
0
H
Q = QL QR
We wish to solve RT L x = QH
L y . But
|{z}
y
y = QH
y
=
I
L
=
0
0
QH
L
y =
(Hn1 H0 )y. =
QH
R
QL
QR
H
y=
QH y
( Hn1 H0 y ) = wT .
| {z
}
wT
w=
wB
RT L
H0 , . . . , Hn1
tion).
Cost: 2mn2 32 n3 flops.
Form w= (Hn1 ( (H0 y) )) (see Notes on Householder QR Factorization). Partition w =
w
T where wT Cn . Then y = wT .
wB
Cost: 4m2 2n2 flops. (See Notes on Householder QR Factorization regarding this.)
Solve RT L x = y.
Cost: n2 flops.
Thus, the total cost of solving the LLS problem via Householder QR factorization is approximately 2mn2
2 3
3 n flops. This is cheaper than using (Modified) Gram-Schmidt QR factorization, and hence preferred
(because it is also numerically more stable, as we will discuss later in the course).
7.5
115
Given A Cmn with linearly independent columns, let A = UV H be its SVD decomposition. Partition
U=
UL UR
and
T L
0
A=
UL UR
T L
V H = UL T LV H .
We wish to compute the solution to the LLS problem: Find x Cn such that
7.5.1
Notice that we know that, if A has linearly independent columns, the solution is given by x = (AH A)1 AH y
(the solution to the normal equations). Now,
x = [AH A]1 AH y
1
= (UL T LV H )H (UL T LV H )
(UL T LV H )H y
1
= (V T LULH )(UL T LV H )
(V T LULH )y
1
= V T L T LV H
V T LULH y
1 H
H
= V 1
T L T LV V T LUL y
H
= V 1
T LUL y
V H V = I and 1
T L T L = I
7.5.2
116
We now discuss a derivation of the result that does not depend on the Normal Equations, in preparation
for the more general case discussed in the next section.
minzCn kAz yk22
= minzCn kUV H z yk22
(substitute A = UV H )
= minzCn kV H z U H yk22
preserves two-norm)
2
T L
ULH y
H
(partition, partitioned matrix-matrix multiplication)
V z
= minzCn
H
0
UR y
2
2
H
H
T LV z UL y
= minzCn
(partitioned matrix-matrix multiplication and addition)
H
UR y
2
2
2
2
v
T
= kvT k2 + kvB k2
= minzCn
T LV H z ULH y
2 +
URH y
2
2
2
vB
2
H
The x that solves T LV H x = ULH y minimizes the expression. That x is given by x = V 1
T LUL y.
This suggests the following approach:
Compute z = V y.
7.6
In the above discussions we assume that A has linearly independent columns. Things get a bit trickier if A
does not have linearly independent columns. There is a variant of the QR factorization known as the QR
factorization with column pivoting that can be used to find the solution. We instead focus on using the
SVD.
Given A Cmn with rank(A) = r < n, let A = UV H be its SVD decomposition. Partition
U=
UL UR
V=
VL VR
and =
T L
117
T L 0
H
VL VR
= UL T LVLH .
A = UL UR
0 0
Now,
minzCn kAz yk22
= minzCn kUV H z yk22
(substitute A = UV H )
(UU H = I)
= min
kUV H V w UU H yk2
2
w Cn
z = V w
= min
kU(w U H y)k2
2
w Cn
z = V w
= min
kw U H yk2
2
(kUvk2 = kvk2 )
w Cn
z = V w
= min
w Cn
T L
0
0
0
wT
wB
ULH
URH
z = V w
2
T L wT ULH y
= min
w Cn
H
UR y
2
z = V w
= min
Cn
T L wT
2
2
ULH y
2 +
URH y
2
2
v
T
= kvT k2 + kvB k2
2
2
vB
2
z = V w
1 H
Since T L is a diagonal with no zeroes on its diagonal, we know that 1
T L exists. Choosing wT = T LUL y
means that
H
2
min T L wT UL y 2 = 0,
w Cn
z = V w
1U H y
H
x = V w = VL VR T L L = VL 1
T LUL y +VR wB
wB
characterizes all solutions to the linear least-squares problem, where wB can be chosen to be any vector of
H
size n r. By choosing wB = 0 and hence x = VL 1
T LUL y we choose the vector x that itself has minimal
2-norm.
118
The sequence of pictures on the following pages reasons through the insights that we gained so far
(in Notes on the Singular Value Decomposition and this note). These pictures can be downloaded as a
PowerPoint presentation from
http://www.cs.utexas.edu/users/flame/Notes/Spaces.pptx
119
A
C(VL )
C(U L )
Row space
Column
space
dim = r
dim = r
Ax = y
0
x = xr + x n
dim = n r
C(VR )
dim = m r
Le1
null
space
Null space
C(U R )
If A Cmn and
A=
UL UR
T L
0
0
VL VR
H
= UL T LVLH
120
A
C(VL )
C(U L )
Row space
Column
space
dim = r
dim = r
Ax = y??
0
dim = n r
C(VR )
Null space
y
dim = m r
Le1
null
space
C(U R )
Given an arbitrary y Cm not in C (A) = C (UL ), there cannot exist a vector x Cn such
that Ax = y.
121
A
C(VL )
C(U L )
Row
space
H
xr = VL TLU L y
dim = r
Axr = U L TLVLH xr = y
Column
space
y = U LU LH y
dim = r
Ax = y
0
dim = n r
x = xr + x n
Axn = 0
xn = VR wB
C(VR )
Null space
y
dim = m r
Le1
null
space
C(U R )
The solution to the Linear Least-Squares problem, x, equals any vector that is mapped by
A to the projection of y onto the column space of A: yb = A(AH A)1 AH y = QQH y. This
solution can be written as the sum of a (unique) vector in the row space of A and any vector
in the null space of A. The vector in the row space of A is given by
1 H
H
b.
xr = VL 1
T LUL y == VL T LUL y
122
The sequence of pictures, and their explanations, suggest a much simply path towards the formula for
solving the LLS problem.
We know that we are looking for the solution x to the equation
Ax = ULULH y.
We know that there must be a solution xr in the row space of A. It suffices to find wT such that
xr = VL wT .
Hence we search for wT that satisfies
AVL wT = ULULH y.
Since there is a one-to-one mapping by A from the row space of A to the column space of A, we
know that wT is unique. Thus, if we find a solution to the above, we have found the solution.
Multiplying both sides of the equation by ULH yields
ULH AVL wT = ULH y.
H
Since A = UL 1
T LVL , we can rewrite the above equation as
T L wT = ULH y
H
so that wT = 1
T LUL y.
Hence
H
xr = VL 1
T LUL y.
Adding any vector in the null space of A to xr also yields a solution. Hence all solutions to the LLS
problem can be characterized by
H
x = VL 1
T LUL y +VR wR .
Here is yet another important way of looking at the problem:
We start by considering the LLS problem: Find x Cn such that
kAx yk22 = maxn kAz yk22 .
zC
7.7
123
We next discuss another special case of the LLS problem: Let A Cmn where m < n and A has linearly
independent rows. A series of exercises will lead you to a practical algorithm for solving the problem of
describing all solutions to the LLS problem
kAx yk2 = min kAz yk2 .
z
You may want to review Notes on the QR Factorization as you do this exercise.
Homework 7.2 Let A Cmn with m < n have linearly independent rows. Show that there exist a lower
triangular matrix LL Cmm and a matrix QT Cmn with orthonormal
rows such that A = LL QT ,
noting that LL does not have any zeroes on the diagonal. Letting L = LL 0 be Cmn and unitary
QT
, reason that A = LQ.
Q=
QB
Dont overthink the problem: use results you have seen before.
* SEE ANSWER
Homework 7.3 Let A Cmn with m < n have linearly independent rows. Consider
kAx yk2 = min kAz yk2 .
z
Use the fact that A = LL QT , where LL Cmm is lower triangular and QT has orthonormal rows, to argue
1
H
nm ) is a solution to the LLS
that any vector of the
QH
T LL y + QB wB (where wB is any vector in C
form
problem. Here Q =
QT
QB
.
* SEE ANSWER
Homework 7.4 Continuing Exercise 7.2, use Figure 7.1 to give a Classical Gram-Schmidt inspired algorithm for computing LL and QT . (The best way to check you got the algorithm right is to implement
it!)
* SEE ANSWER
Homework 7.5 Continuing Exercise 7.2, use Figure 7.2 to give a Householder QR factorization inspired
algorithm for computing L and Q, leaving L in the lower triangular part of A and Q stored as Householder
vectors above the diagonal of A. (The best way to check you got the algorithm right is to implement it!)
* SEE ANSWER
124
LT L LT R
QT
AT
,L
,Q
Partition A
AB
LBL LBR
QB
whereAT has 0 rows, LT L is 0 0, QT has 0 rows
while m(AT ) < m(A) do
Repartition
AT
A0
L00
LT L LT R
aT1 ,
l10
AB
LBL LBR
A2
L20
wherea1 has 1 row, 11 is 1 1, q1 has 1 row
l01
L02
11
T
l12
l21
L22
Q0
QT T
q1
,
QB
Q2
Continue with
A
0
L
aT
TL
,
1
AB
LBL
A2
AT
LT R
LBR
L
l
00 01
l T 11
10
L20 l21
L02
T
l12
L22
0
QT T
q1
,
QB
Q2
endwhile
Figure 7.1: Algorithm skeleton for CGS-like LQ factorization.
AT L AT R
t
,t T
Partition A
ABL ABR
tB
whereAT L is 0 0, tT has 0 rows
while m(AT L ) < m(A) do
Repartition
A
a01 A02
00
aT 11 aT
12
10
ABL ABR
A20 a21 A22
where11 is 1 1, 1 has 1 row
AT L
AT R
t
0
tT
1
,
tB
t2
Continue with
AT L
AT R
ABL
ABR
a
A
00 01
aT 11
10
A20 a21
A02
aT12
A22
0
tT
1
,
tB
t2
endwhile
Figure 7.2: Algorithm skeleton for Householder QR factorization inspired LQ factorization.
125
126
Chapter
Video
Read disclaimer regarding the videos in the preface!
Video did not turn out...
127
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.2. The Prototypical Example: Solving a Linear System . . . . . . . . . . . . . . . . . . 129
8.3. Condition Number of a Rectangular Matrix . . . . . . . . . . . . . . . . . . . . . . . 133
8.4. Why Using the Method of Normal Equations Could be Bad . . . . . . . . . . . . . . 134
8.5. Why Multiplication with Unitary Matrices is a Good Thing . . . . . . . . . . . . . . 135
128
8.1. Notation
8.1
129
Notation
Throughout this note, we will talk about small changes (perturbations) to scalars, vectors, and matrices.
To denote these, we attach a delta to the symbol for a scalar, vector, or matrix.
A small change to scalar C will be denoted by C;
A small change to vector x Cn will be denoted by x Cn ; and
A small change to matrix A Cmn will be denoted by A Cmn .
Notice that the delta touches the , x, and A, so that, for example, x is not mistaken for x.
8.2
Assume that A Rnn is nonsingular and x, y Rn with Ax = y. The problem here is the function that
computes x from y and A. Let us assume that no error is introduced in the matrix A when it is stored,
but that in the process of storing y a small error is introduced: y Rn so that now y + y is stored. The
question becomes by how much the solution x changes as a function of y. In particular, we would like
to quantify how a relative change in the right-hand side y (kyk/kyk in some norm) translates to a relative
change in the solution x (kxk/kxk). It turns out that we will need to compute norms of matrices, using
the norm induced by the vector norm that we use.
Since Ax = y, if we use a consistent (induced) matrix norm,
kyk = kAxk kAkkxk or, equivalently,
1
1
kAk
.
kxk
kyk
(8.1)
Also,
A(x + x) = y + y
implies that Ax = y so that x = A1 y.
Ax
= y
Hence
kxk = kA1 yk kA1 kkyk.
(8.2)
130
In order for this equality to hold, we need to find y and y such that
kyk = kAxk = kAkkxk or, equivalently, kAk =
kAxk
kxk
and
kxk = kA1 yk = kA1 kkyk. or, equivalently, kA1 k =
kA1 yk
.
kyk
In other words, x can be chosen as a vector that maximizes kAxk/kxk and y should maximize kA1 yk/kyk.
The vector y is then chosen as y = Ax.
For this norm, 2 (A) = kAk2 kA1 k2 = 0 /n1 . So, the ratio between the
largest and smallest singular value determines whether a matrix is well-conditioned or ill-conditioned.
To show for what vectors the maximal magnification is attained, consider the SVD
What if we use the 2-norm?
A = UV T =
u0 u1 un1
H
.
v0 v1 vn1
1
..
.
n1
Recall that
kAk2 = 0 , v0 is the vector that maximizes maxkzk2 =1 kAzk2 , and Av0 = 0 u0 ;
kA1 k2 = 1/n1 , un1 is the vector that maximizes maxkzk2 =1 kA1 zk2 , and Avn1 = n1 un1 .
Now, take y = 0 u0 . Then Ax = y is solved by x = v0 . Take y = 1 u1 . Then Ax = y is solved by
x = v1 . Now,
kyk2 ||1
kxk2
=
and
= ||.
kyk2
0
kxk2
Hence
kxk2
0 kyk2
=
kxk2
n1 kyk2
R2 : Domain of A.
131
R2 : Codomain of A.
0 kyk2
2
Figure 8.1: Illustration for choices of vectors y and y that result in kxk
kxk2 = n1 kyk2 . Because of the
eccentricity of the ellipse, the relatively small change y relative to y is amplified into a relatively large
change x relative to x. On the right, we see that kyk2 /kyk2 = 1 /0 (since ku0 k2 = ku1 k2 = 1). On the
left, we see that kxk2 /kxk = (since kv0 k2 = kv1 k2 = 1).
then
log10 (kxk) log10 (kxk) = log10 ((A)) + log10 (kyk) log10 (kyk)
so that
log10 (kxk) log10 (kxk) = [log10 (kyk) log10 (kyk)] log10 ((A)).
In other words, if there were k digits of accuracy in the right-hand side, then it is possible that (due only
to the condition number of A) there are only k log10 ((A)) digits of accuracy in the solution. If we start
with only 8 digits of accuracy and (A) = 105 , we may only get 3 digits of accuracy. If (A) 108 , we
may not get any digits of accuracy...
Homework 8.1 Show that, if A is a nonsingular matrix, for a consistent matrix norm, (A) 1.
* SEE ANSWER
We conclude from this that we can generally only expect as much relative accuracy in the solution as
we had in the right-hand side.
132
Note: the below links conditioning of matrices to the relative condition number
of a more general function. For a more thorough treatment, you may want to read Lecture 12 of Trefethen
and Bau. That book discusses the subject in much more generality than is needed for our discussion of
linear algebra. Thus, if this alternative exposition baffles you, just skip it!
Let f : Rn Rm be a continuous function such that f (y) = x. Let k k be a vector norm. Consider for
y 6= 0
k f (y + y) f (y)k
kyk
f
(y) = lim
sup
/
.
k f (y)k
kyk
0
kyk
Alternative exposition
(y) = lim
0
sup
kyk
kyk
kx + xk
/
.
kxk
kyk
(Obviously, if y = 0 or y = 0 or f (y) = 0 , things get a bit hairy, so lets not allow that.)
Roughly speaking, f (y) equals the maximum that a(n infitessimally) small relative error in y is magnified into a relative error in f (y). This can be considered the relative condition number of function f . A
large relative condition number means a small relative error in the input (y) can be magnified into a large
relative error in the output (x = f (y)). This is bad, since small errors will invariable occur.
Now, if f (y) = x is the function that returns x where Ax = y for a nonsingular matrix A Cnn , then
via an argument similar to what we did earlier in this section we find that f (y) (A) = kAkkA1 k, the
condition number of matrix A:
lim
0
sup
kyk
= lim
0
= lim
0
k f (y + y) f (y)k
kyk
/
k f (y)k
kyk
kyk
kA1 (y + y) A1 (y)k
/
kA1 yk
kyk
kA1 (y + y) A1 (y)k
kyk
/
kA1 yk
kyk
1
kA1 yk
kA yk
/
kyk
kyk
kA1 yk
kyk
sup
kyk
max
kzk = 1
y = z
= lim
0
max
kzk = 1
y = z
= lim
0
max
kzk = 1
y = z
kyk
kA1 yk
133
1 yk kyk
kA
= lim max
0
kA1 yk
kyk
kzk = 1
y = z
1
kyk
kA ( z)k
= lim max
k zk
kA1 yk
0
kzk = 1
y = z
1
kA zk
kyk
=
max
kzk
kA1 yk
kzk = 1
1
kAxk
kA zk
=
max
kzk
kxk
kzk = 1
1
kA zk
kAxk
max
max
kzk
kxk
kzk = 1
x 6= 0
= kAkkA1 k,
where k k is the matrix norm induced by vector norm k k.
8.3
Given A Cmn with linearly independent columns and y Cm , consider the linear least-squares (LLS)
problem
kAx yk2 = min kAw yk2
(8.3)
w
(8.4)
w+w
134
Figure 8.2: Linear least-squares problem kAx yk2 = minv kAv yk2 . Vector z is the projection of y onto
C (A).
and hence
We conclude that
1
kAk2
=
kxk2
cos
kyk2
cos n1 kyk2
where 0 and n1 are (respectively) the largest and smallest singular values of A, because of the following
result:
Homework 8.2 If A has linearly independent columns, show that k(AH A)1 AH k2 = 1/n1 , where n1
equals the smallest singular value of A. Hint: Use the SVD of A.
* SEE ANSWER
The condition number of A Cmn with linearly independent columns is 2 (A) = 0 /n1 .
Notice the effect of the cos . When y is almost perpendicular to C (A), then its projection z is small
and cos is small. Hence a small relative change in y can be greatly amplified. This makes sense: if y is
almost perpendical to C (A), then x 0, and any small y C (A) can yield a relatively large change x.
8.4
Homework 8.3 Let A have linearly independent columns. Show that 2 (AH A) = 2 (A)2 .
* SEE ANSWER
Homework 8.4 Let A Cnn have linearly independent columns.
Show that Ax = y if and only if AH Ax = AH y.
Reason that using the method of normal equations to solve Ax = y has a condition number of 2 (A)2 .
135
* SEE ANSWER
Let A Cmn have linearly independent columns. If one uses the Method of Normal Equations to solve
the linear least-squares problem minx kAxyk2 , one ends up solving the square linear system AH Ax = AH y.
Now, 2 (AH A) = 2 (A)2 . Hence, using this method squares the condition number of the matrix being used.
8.5
Next, consider the computation C = AB where A Cmm is nonsingular and B, B,C, C Cmn . Then
(C + C) = A(B + B)
C = AB
C = AB
Thus,
kCk2 = kABk2 kAk2 kBk2 .
Also, B = A1C so that
kBk2 = kA1Ck2 kA1 k2 kCk2
and hence
1
1
kA1 k2
.
kCk2
kBk2
Thus,
kCk2
kBk2
kBk2
kAk2 kA1 k2
= 2 (A)
.
kCk2
kBk2
kBk2
This means that the relative error in matrix C = AB is at most 2 (A) greater than the relative error in B.
The following exercise gives us a hint as to why algorithms that cast computation in terms of multiplication by unitary matrices avoid the buildup of error:
Homework 8.5 Let U Cnn be unitary. Show that 2 (U) = 1.
* SEE ANSWER
This means is that the relative error in matrix C = UB is no greater than the relative error in B when U is
unitary.
Homework 8.6 Characterize the set of all square matrices A with 2 (A) = 1.
* SEE ANSWER
136
Chapter
Video
Read disclaimer regarding the videos in the preface!
* Lecture on the Stability of an Algorithm
* YouTube
* Download from UT Box
* View After Local Download
(For help on viewing, see Appendix A.)
137
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.2. Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.3. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.4. Floating Point Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.4.1. Model of floating point computation . . . . . . . . . . . . . . . . . . . . . . . . 142
9.4.2. Stability of a numerical algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.4.3. Absolute value of vectors and matrices . . . . . . . . . . . . . . . . . . . . . . 143
9.5. Stability of the Dot Product Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.5.1. An algorithm for computing D OT . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.5.2. A simple start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.5.3. Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.5.4. Target result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.5.5. A proof in traditional format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.5.6. A weapon of math induction for the war on error (optional) . . . . . . . . . . . . 149
9.5.7. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.6. Stability of a Matrix-Vector Multiplication Algorithm . . . . . . . . . . . . . . . . . 152
9.6.1. An algorithm for computing G EMV . . . . . . . . . . . . . . . . . . . . . . . . 152
9.6.2. Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
138
9.1. Motivation
139
Figure 9.1: In this illustation, f : D R is a function to be evaluated. The function f represents the
implementation of the function that uses floating point arithmetic, thus incurring errors. The fact that
for a nearby value x the computed value equals the exact function applied to the slightly perturbed x,
f (x)
= f(x), means that the error in the computation can be attributed to a small change in the input. If
this is true, then f is said to be a (numerically) stable implementation of f for input x.
9.1
Motivation
Correctness in the presence of error (e.g., when floating point computations are performed) takes on a
different meaning. For many problems for which computers are used, there is one correct answer and we
expect that answer to be computed by our program. The problem is that most real numbers cannot be stored
exactly in a computer memory. They are stored as approximations, floating point numbers, instead. Hence
storing them and/or computing with them inherently incurs error. The question thus becomes When is a
program correct in the presense of such errors?
Let us assume that we wish to evaluate the mapping f : D R where D Rn is the domain and
R Rm is the range (codomain). Now, we will let f : D R denote a computer implementation of this
function. Generally, for x D it is the case that f (x) 6= f(x). Thus, the computed value is not correct.
From the Notes on Conditioning, we know that it may not be the case that f(x) is close to f (x). After
all, even if f is an exact implementation of f , the mere act of storing x may introduce a small error x and
f (x + x) may be far from f (x) if f is ill-conditioned.
The following defines a property that captures correctness in the presense of the kinds of errors that
are introduced by computer arithmetic:
Definition 9.1 Let given the mapping f : D R, where D Rn is the domain and R Rm is the range
(codomain), let f : D R be a computer implementation of this function. We will call f a (numerically)
stable implementation of f on domain D if for all x D there exists a x close to x such that f(x) = f (x).
140
In other words, f is a stable implementation if the error that is introduced is similar to that introduced
when f is evaluated with a slightly changed input. This is illustrated in Figure 9.1 for a specific input x. If
an implemention is not stable, it is numerically unstable.
9.2
Only a finite number of (binary) digits can be used to store a real number number. For so-called singleprecision and double-precision floating point numbers 32 bits and 64 bits are typically employed, respectively. Let us focus on double precision numbers.
Recall that any real number can be written as e , where is the base (an integer greater than one),
(1, 1) is the mantissa, and e is the exponent (an integer). For our discussion, we will define F as
the set of all numbers = e such that = 2, = .0 1 t1 has only t (binary) digits ( j {0, 1}),
0 = 0 iff = 0 (the mantissa is normalized), and L e U. Elements in F can be stored with a finite
number of (binary) digits.
There is a largest number (in absolute value) that can be stored. Any number that is larger overflows. Typically, this causes a value that denotes a NaN (Not-a-Number) to be stored.
There is a smallest number (in absolute value) that can be stored. Any number that is smaller
underflows. Typically, this causes a zero to be stored.
Example 9.2 For x Rn , consider computing
v
un1
u
kxk2 = t 2i .
(9.1)
i=0
Notice that
kxk2
n1
n max |i |
i=0
and hence unless some i is close to overflowing, the result will not overflow. The problem is that if
some element i has the property that 2i overflows, intermediate results in the computation in (9.1) will
overflow. The solution is to determine k such that
n1
|k | = maxi=0
|i |
141
2
which can also be written as
21t .
A careful analysis of what happens when might equal zero or be negative yields
|| 21t ||.
Now, in practice any base can be used and floating point computation uses rounding rather than
truncating. A similar analysis can be used to show that then
|| u||
where u = 12 1t is known as the machine epsilon or unit roundoff. When using single precision or double precision real arithmetic, u 108 or 1016 , respectively. The quantity u is machine dependent; it is a
function of the parameters characterizing the machine arithmetic. The unit roundoff is often alternatively
defined as the maximum positive floating point number which can be added to the number stored as 1
without changing the number stored as 1. In the notation introduced below, fl(1 + u) = 1.
Homework 9.3 Assume a floating point number system with = 2 and a mantissa with t digits so that a
typical positive number is written as .d0 d1 . . . dt1 2e , with di {0, 1}.
Write the number 1 as a floating point number.
What is the largest positive real number u (represented as a binary fraction) such that the floating point representation of 1 + u equals the floating point representation of 1? (Assume rounded
arithmetic.)
Show that u = 12 21t .
* SEE ANSWER
9.3
142
Notation
When discussing error analyses, we will distinguish between exact and computed quantities. The function
fl(expression) returns the result of the evaluation of expression, where every operation is executed in
floating point arithmetic. For example, assuming that the expressions are evaluated from left to right,
fl( + + /) is equivalent to fl(fl(fl() + fl()) + fl(fl()/fl())). Equality between the quantities lhs
and rhs is denoted by lhs = rhs. Assignment of rhs to lhs is denoted by lhs := rhs (lhs becomes rhs). In
the context of a program, the statements lhs := rhs and lhs := fl(rhs) are equivalent. Given an assignment
:= expression, we use the notation (pronounced check kappa) to denote the quantity resulting from
fl(expression), which is actually stored in the variable .
9.4
We introduce definitions and results regarding floating point arithmetic. In this note, we focus on real
valued arithmetic only. Extensions to complex arithmetic are straightforward.
9.4.1
The Standard Computational Model (SCM) assumes that, for any two floating point numbers and ,
the basic arithmetic operations satisfy the equality
fl( op ) = ( op )(1 + ),
The quantity is a function of , and op. Sometimes we add a subscript (+ , , . . .) to indicate what
operation generated the (1 + ) error factor. We always assume that all the input variables to an operation
are floating point numbers. We can interpret the SCM as follows: These operations are performed
exactly and it is only in storing the result that a roundoff error occurs (equal to that introduced
when a real number is stored as a floating point number).
Remark 9.4 Given , F, performing any operation op {+, , , /} with and in floating point
arithmetic, [ op ], is a stable operation: Let = op and = + = [ (op) ]. Then || u||
and hence is close to (it has k correct binary digits).
For certain problems it is convenient to use the Alternative Computational Model (ACM) [25],
which also assumes for the basic arithmetic operations that
op
,
|| u, and op {+, , , /}.
fl( op ) =
1+
As for the standard computation model, the quantity is a function of , and op. Note that the s
produced using the standard and alternative models are generally not equal.
Remark 9.5 Notice that the Taylor series expansion of 1/(1 + ) is given by
1
= 1 + () + O(2 ),
1+
which explains how the SCM and ACM are related.
Remark 9.6 Sometimes it is more convenient to use the SCM and sometimes the ACM. Trial and error,
and eventually experience, will determine which one to use.
9.4.2
143
In the presence of round-off error, an algorithm involving numerical computations cannot be expected to
yield the exact result. Thus, the notion of correctness applies only to the execution of algorithms in
exact arithmetic. Here we briefly introduce the notion of stability of algorithms.
Let f : D R be a mapping from the domain D to the range R and let f : D R represent the
mapping that captures the execution in floating point arithmetic of a given algorithm which computes f .
The algorithm is said to be backward stable if for all x D there exists a perturbed input x D ,
close to x, such that f(x) = f (x).
In other words, the computed result equals the result obtained when
the exact function is applied to slightly perturbed data. The difference between x and x, x = x x, is the
perturbation to the original input x.
The reasoning behind backward stability is as follows. The input to a function typically has some errors
associated with it. Uncertainty may be due to measurement errors when obtaining the input and/or may be
the result of converting real numbers to floating point numbers when storing the input on a computer. If it
can be shown that an implementation is backward stable, then it has been proved that the result could have
been obtained through exact computations performed on slightly corrupted input. Thus, one can think of
the error introduced by the implementation as being comparable to the error introduced when obtaining
the input data in the first place.
When discussing error analyses, x, the difference between x and x,
is the backward error and the
difference f(x) f (x) is the forward error. Throughout the remainder of this note we will be concerned
with bounding the backward and/or forward errors introduced by the algorithms executed with floating
point arithmetic.
The algorithm is said to be forward stable on domain D if for all x D it is that case that f(x) f (x).
In other words, the computed result equals a slight perturbation of the exact result.
9.4.3
In the above discussion of error, the vague notions of near and slightly perturbed are used. Making
these notions exact usually requires the introduction of measures of size for vectors and matrices, i.e.,
norms. Instead, for the operations analyzed in this note, all bounds are given in terms of the absolute
values of the individual elements of the vectors and/or matrices. While it is easy to convert such bounds
to bounds involving norms, the converse is not true.
Definition 9.7 Given x Rn and A Rmn ,
|0 |
|0,0 |
|0,1 |
| |
| |
|1,1 |
1
1,0
|x| =
and
|A|
=
..
..
..
.
.
.
|n1 |
|m1,0 | |m1,1 |
Definition 9.8 Let M {<, , =, , >} and x, y Rn . Then
|x| M |y| iff |i | M |i |,
with i = 0, . . . , n 1. Similarly, given A and B Rmn ,
|A| M |B| iff |i j | M |i j |,
with i = 0, . . . , m 1 and j = 0, . . . , n 1.
...
|0,n1 |
...
..
.
|1,n1 |
..
.
. . . |m1,n1 |
144
9.5
The matrix-vector multiplication algorithm discussed in the next section requires the computation of the
dot (inner) product (D OT) of vectors x, y Rn : := xT y. In this section, we give an algorithm for this
operation and the related error results.
9.5.1
We will consider the algorithm given in Figure 9.2. It uses the FLAME notation [24, 4] to express the
computation
:= ((0 0 + 1 1 ) + ) + n2 n2 + n1 n1
(9.2)
in the indicated order.
9.5.2
A simple start
0
=
1
1
= [0 0 + 1 1 ]
= [[0 0 ] + [1 1 ]]
(0)
(1)
(0)
(1)
= [0 0 (1 + ) + 1 1 (1 + )]
(1)
= (0 0 (1 + ) + 1 1 (1 + ))(1 + + )
(0)
(1)
(1)
(1)
= 0 0 (1 + )(1 + + ) + 1 1 (1 + )(1 + + )
yT
xT
,y
x
xB
yB
while m(xT ) < m(x)
x
0
xT
xB
x2
145
do
y0
yT
1
,
yB
y2
:= + 1 1
x
y
0
0
yT
1 ,
1
xB
yB
x2
y2
xT
endwhile
Figure 9.2: Algorithm for computing := xT y.
T
0
1
T
0
1
(0)
(1)
(0)
(1)
0 (1 + )(1 + + )
(1)
(1)
1 (1 + )(1 + + )
(0)
(1)
(1 + )(1 + + )
(0)
(1)
0 (1 + )(1 + + )
(1)
(1)
1 (1 + )(1 + + )
0
(1)
0
T
0
(1)
(1 + )(1 + + )
(1)
where | |, | |, |+ | u.
9.5.3
146
Preparation
satisfies
Under the computational model given in Section 9.4, the computed result of (9.2), ,
(0)
(1)
(1)
(n2)
=
(0 0 (1 + ) + 1 1 (1 + ))(1 + + ) + (1 + + )
(n1)
(n1)
+ n1 n1 (1 +
) (1 + + )
!
n1
(i)
i i (1 + )
(0)
( j)
(1 + + )
(9.3)
j=i
i=0
(0)
n1
( j)
( j)
(1 + i)1 = 1 + n,
i=0
We will show that if i R, 0 i n, (n + 1)u < 1, and |i | u, then there exists a n+1 R such
that
n
1
Case 1: ni=0 (1 + i )1 = n1
i=0 (1 + i ) (1 + n ). See Exercise 9.15.
1
Case 2: ni=0 (1 + i )1 = (n1
i=0 (1 + i ) )/(1 + n ). By the I.H. there exists a n such that
1 and | | nu/(1 nu). Then
(1 + n ) = n1
n
i=0 (1 + i )
1
1 + n
n n
n1
i=0 (1 + i )
,
=
= 1+
1 + n
1 + n
1 + n
| {z }
n+1
=
1 + n
1u
1u
(1 nu)(1 u)
(n + 1)u nu2
(n + 1)u
=
.
2
1 (n + 1)u + nu
1 (n + 1)u
147
(1 + n )
0
0
0
0
0
(1 + n )
0
0
1
=
0
0
(1 + n1 )
0
2
.
.
.
.
..
..
..
..
..
..
.
.
n1
=
2
.
..
n1
I +
0
..
.
0
..
.
n1
..
.
.
..
. ..
2
(9.4)
.
..
(1 + 2 )
n1
2
,
..
.
n1
(9.5)
where | j | j , j = 2, . . . , n.
Two instances of the symbol n , appearing even in the same expression, typically do not represent the
same number. For example, in (9.4) a (1 + n ) multiplies each of the terms 0 0 and 1 1 , but these
two instances of n , as a rule, do not denote the same quantity. In particular, One should be careful when
factoring out such quantities.
As part of the analyses the following bounds will be useful to bound error that accumulates:
Lemma 9.17 If n, b 1 then n n+b and n + b + n b n+b .
Homework 9.18 Prove Lemma 9.17.
* SEE ANSWER
9.5.4
148
Target result
It is of interest to accumulate the roundoff error encountered during computation as a perturbation of input
and/or output parameters:
= (x + x)T y;
= xT (y + y);
= xT y + .
The first two are backward error results (error is accumulated onto input parameters, showing that the
algorithm is numerically stable since it yields the exact output for a slightly perturbed input) while the last
one is a forward error result (error is accumulated onto the answer). We will see that in different situations,
a different error result may be needed by analyses of operations that require a dot product.
Let us focus on the second result. Ideally one would show that each of the entries of y is slightly
perturbed relative to that entry:
0
0
0
. .
.. ..
= ..
..
. .
0 n1
n1
0 0
..
.
y =
n1 n1
= y,
where each i is small and = diag(()0 , . . . , n1 ). The following special structure of , inspired
by (9.5) will be used in the remainder of this note:
(n)
0 0 matrix
1
if n = 0
(9.6)
if n = 1
diag(()n , n , n1 , . . . , 2 ) otherwise.
(k)
I +
0
(1 + 2 ) = (I + (k+1) ).
0
(1 + 1 )
Hint: reason the case where k = 0 separately from the case where k > 0.
* SEE ANSWER
We state a theorem that captures how error is accumulated by the algorithm.
Theorem 9.20 Let x, y Rn and let := xT y be computed by executing the algorithm in Figure 9.2. Then
= [xT y] = xT (I + (n) )y.
9.5.5
149
In the below proof, we will pick symbols to denote vectors so that the proof can be easily related to the
alternative framework to be presented in Section 9.5.6.
Proof: By Mathematical Induction on n, the length of vectors x and y.
Base case. m(x) = m(y) = 0. Trivial.
Inductive Step. I.H.: Assume that if xT , yT Rk , k > 0, then
fl(xTT yT ) = xTT (I + T )yT , where T = (k) .
T
T
T )yT holds true again.
We will show that when xT , yT Rk+1 , the equality
fl(xT yT ) = xT (I +
x0
y
and yT 0 . Then
Assume that xT , yT Rk+1 , and partition xT
1
1
fl(
x0
y0
) = fl(fl(xT y0 ) + fl(1 1 ))
(definition)
(I.H. with xT = x0 ,
yT = y0 , and 0 = (k) )
= x0T (I + 0 )y0 + 1 1 (1 + ) (1 + + )
T
x0
(I + 0 )
0
y
(1 + + ) 0
=
1
0
(1 + )
1
(SCM, twice)
= xTT (I + T )yT
(renaming),
where | |, |+ | u, + = 0 if k = 0, and (I + T ) =
(rearrangement)
(I + 0 )
(1 + )
(1 + + ) so that T =
(k+1) .
By the Principle of Mathematical Induction, the result holds.
9.5.6
We focus the readers attention on Figure 9.3 in which we present a framework, which we will call the
error worksheet, for presenting the inductive proof of Theorem 9.20 side-by-side with the algorithm for
D OT. This framework, in a slightly different form, was first introduced in [3]. The expressions enclosed
by { } (in the grey boxes) are predicates describing the state of the variables used in the algorithms and
in their analysis. In the worksheet, we use superscripts to indicate the iteration number, thus, the symbols
vi and vi+1 do not denote two different variables, but two different states of variable v.
150
:= 0
Partition x
xT
, y
yT
Step
{=0}
!
1a
Error side
0 B
xB
yB
where xT and yT are empty, and T is 0 0
n
o
= xTT (I + T )yT T = (k) m(xT ) = k
2a
3
o
n
= xTT (I + T )yT T = (k) m(xT ) = k
2b
Repartition
xT
x0
y0
y
, T ,
1
1
xB
yB
x2
y2
i0
0 i1 0
0
5a
SCM, twice
(+ = 0 if k = 0)
:= + 1 1
(k)
= x0T (I + 0 )y0 + 1 1 (1 + ) (1 + + )
T
(k)
0
I + 0
x0
(1 + + )
=
1
0
1 +
x0
Step 6: I.H.
y0
Rearrange
y0
I + (k+1)
1
Exercise 9.19
i+1
0
y
0
i+1 = 0 I + 0
i+1
1
0 1
1
i+1
0
x
0
(k+1) m 0 = (k + 1)
i+1
0 1
1
Continue
with
!
x0
xT
1 ,
xB
x2
yT
yB
y0
1 ,
y
i+1
0
0
0
i+1
1
2
n 2
o
T
(k)
= xT (I + T )yT T = m(xT ) = k
5b
2c
endwhile
n
o
= xTT (I + T )yT T = (k) m(xT ) = k m(xT ) = m(x)
n
o
= xT (I + (n) )y m(x) = n
2d
1b
Figure 9.3: Error worksheet completed to establish the backward error result for the given algorithm that
computes the D OT operation.
151
The proof presented in Figure 9.3 goes hand in hand with the algorithm, as it shows that before and
xT , yT , T are such that the predicate
after each iteration of the loop that computes := xT y, the variables ,
{ = xTT (I + T )yT k = m(xT ) T = (k) }
(9.7)
holds true. This relation is satisfied at each iteration of the loop, so it is also satisfied when the loop
completes. Upon completion, the loop guard is m(xT ) = m(x) = n, which implies that = xT (I + (n) )y,
i.e., the thesis of the theorem, is satisfied too.
In details, the inductive proof of Theorem 9.20 is captured by the error worksheet as follows:
Base case. In Step 2a, i.e. before the execution of the loop, predicate (9.7) is satisfied, as k = m(xT ) = 0.
Inductive step. Assume that the predicate (9.7) holds true at Step 2b, i.e., at the top of the loop. Then
Steps 6, 7, and 8 in Figure 9.3 prove that the predicate is satisfied again at Step 2c, i.e., the bottom
of the loop. Specifically,
Step 6 holds by virtue of the equalities x0 = xT , y0 = yT , and i0 = T .
The update in Step 8-left introduces the error indicated in Step 8-right (SCM, twice), yielding
i+1
the results for i+1
0 and 1 , leaving the variables in the state indicated in Step 7.
Finally, the redefinition of T in Step 5b transforms the predicate in Step 7 into that of Step 2c,
completing the inductive step.
By the Principle of Mathematical Induction, the predicate (9.7) holds for all iterations. In particular,
when the loop terminates, the predicate becomes
= xT (I + (n) )y n = m(xT ).
This completes the discussion of the proof as captured by Figure 9.3.
In the derivation of algorithms, the concept of loop-invariant plays a central role. Let L be a loop and
P a predicate. If P is true before the execution of L , at the beginning and at the end of each iteration of L ,
and after the completion of L , then predicate P is a loop-invariant with respect to L . Similarly, we give
the definition of error-invariant.
Definition 9.21 We call the predicate involving the operands and error operands in Steps 2ad the errorinvariant for the analysis. This predicate is true before and after each iteration of the loop.
For any algorithm, the loop-invariant and the error-invariant are related in that the former describes the
status of the computation at the beginning and the end of each iteration, while the latter captures an error
result for the computation indicated by the loop-invariant.
The reader will likely think that the error worksheet is an overkill when proving the error result for the
dot product. We agree. However, it links a proof by induction to the execution of a loop, which we believe
is useful. Elsewhere, as more complex operations are analyzed, the benefits of the structure that the error
worksheet provides will become more obvious. (We will analyze more complex algorithms as the course
proceeds.)
9.5.7
152
Results
A number of useful consequences of Theorem 9.20 follow. These will be used later as an inventory
(library) of error results from which to draw when analyzing operations and algorithms that utilize D OT.
Corollary 9.22 Under the assumptions of Theorem 9.20 the following relations hold:
R1-B: (Backward analysis) = (x + x)T y, where |x| n |x|, and = xT (y + y), where |y| n |y|;
R1-F: (Forward analysis) = xT y + , where || n |x|T |y|.
Proof: We leave the proof of R1-B as an exercise. For R1-F, let = xT (n) y, where (n) is as in
Theorem 9.20. Then
|| =
|xT (n) y|
|0 ||n ||0 | + |1 ||n ||1 | + + |n1 ||2 ||n1 |
n |0 ||0 | + n |1 ||1 | + + 2 |n1 ||n1 |
n |x|T |y|.
9.6
In this section, we discuss the numerical stability of the specific matrix-vector multiplication algorithm
that computes y := Ax via dot products. This allows us to show how results for the dot product can be used
in the setting of a more complicated algorithm.
9.6.1
We will consider the algorithm given in Figure 9.4 for computing y := Ax, which computes y via dot
products.
9.6.2
Analysis
aT0
aT
1
A= .
..
aTm1
.
.
.
m1
and
yT
AT
,y
A
AB
yB
while m(AT ) < m(A)
A
0
AT
aT1
AB
A2
do
153
y0
yT
1
,
yB
y2
1 := aT1 x
x
y
0
0
yT
1 ,
1
xB
yB
x2
y2
xT
endwhile
Figure 9.4: Algorithm for computing y := Ax.
Then
.
..
m1
aT0 x
aT x
1
:=
..
.
aTm1 x
From Corollary 9.22 R1-B regarding the dot product we know that
1
y = .
..
m1
(a0 + a0 )T x
aT0
aT
1
= .
..
T
(am1 + am1 ) x
aTm1
(a1 + a1 )T x
..
.
aT0
aT
1
+
.
..
aTm1
x = (A + A)x,
154
Also, from Corollary 9.22 R1-F regarding the dot product we know that
0
aT0
0
aT0 x + 0
T
aT
1
1
1 a1 x + 1
y = . =
= . x+
.
.
..
..
..
..
m1
aTm1
m1
aTm1 x + m1
= Ax + y.
9.7
In this section, we discuss the numerical stability of the specific matrix-matrix multiplication algorithm
that computes C := AB via the matrix-vector multiplication algorithm from the last section, where C
Rmn , A Rmk , and B Rkn .
9.7.1
We will consider the algorithm given in Figure 9.5 for computing C := AC, which computes one column
at a time so that the matrix-vector multiplication algorithm from the last section can be used.
9.7.2
Analysis
Partition
C=
Then
c0 c1 cn1
c0 c1 cn1
and B =
:=
b0 b1 bn1
From Corollary 9.22 R1-B regarding the dot product we know that
=
c0 c1 cn1
Ab0 + c0 Ab1 + c1 Abn1 + cn1
=
Ab0 Ab1 Abn1 + c0 c1 cn1
= AB + C.
155
VAR 1(A,t)
BR B0 b1 B2
c1 := Ab1
CL
CR
C0 c1 C2 , BL
BR
B0 b1 B2
endwhile
Figure 9.5: Algorithm for computing C := AB one column at a time.
where |c j | k |A||b j |, j = 0, . . . , n 1, and hence |C| k |A||B|.
The above observations can be summarized in the following theorem:
Theorem 9.26 (Forward) error results for matrix-matrix multiplication. Let C Rmn , A Rmk ,
and B Rkn and consider the assignment C := AB implemented via the algorithm in Figure 9.5. Then
the following equality holds:
R1-F: C = AB + C, where |C| k |A||B|.
Homework 9.27 In the above theorem, could one instead prove the result
C = (A + A)(B + B),
where A and B are small?
* SEE ANSWER
9.7.3
An application
A collaborator of ours recently implemented a matrix-matrix multiplication algorithm and wanted to check
if it gave the correct answer. To do so, he followed the following steps:
He created random matrices A Rmk , and C Rmn , with positive entries in the range (0, 1).
He computed C = AB with an implementation that was known to be correct and assumed it yields
the exact solution. (Of course, it has error in it as well. We discuss how he compensated for that,
below.)
He computed C = AB with his new implementation.
He computed C = C C and checked that each of its entries satisfied i, j 2kui, j .
156
In the above, he took advantage of the fact that A and B had positive entries so that |A||B| = AB = C.
ku
with ku, and introduced the factor 2 to compensate for the fact that
He also approximated k = 1ku
C itself was inexactly computed.
Chapter
10
Notes on Performance
How to attain high performance on modern architectures is of importance: linear algebra is fundamental
to scientific computing. Scientific computing often involves very large problems that require the fastest
computers in the world to be employed. One wants to use such computers efficiently.
For now, we suggest the reader become familiar with the following resources:
Week 5 of Linear Algebra: Foundations to Frontiers - Notes to LAFF With [29].
Focus on Section 5.4 Enrichment.
Kazushige Goto and Robert van de Geijn.
Anatomy of high-performance matrix multiplication [22].
ACM Transactions on Mathematical Software, 34 (3), 2008.
Field G. Van Zee and Robert van de Geijn.
BLIS: A Framework for Rapid Instantiation of BLAS Functionality [41].
ACM Transactions on Mathematical Software, to appear.
Robert van de Geijn.
How to Optimize Gemm.
wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm.
(An exercise on how to write a high-performance matrix-matrix multiplication in C.)
A similar exercise:
Michael Lehn
GEMM: From Pure C to SSE Optimized Micro Kernels
http://apfel.mathematik.uni-ulm.de/ lehn/sghpc/gemm/index.html.
157
158
Chapter
11
Video
Read disclaimer regarding the videos in the preface!
Lecture on Gaussian Elimination and LU factorization:
* YouTube
* Download from UT Box
* View After Local Download
Lecture on deriving dense linear algebra algorithms:
* YouTube
* Download from UT Box
* View After Local Download
* Slides
(For help on viewing, see Appendix A.)
159
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
11.1. Definition and Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.2. LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.2.1. First derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.2.2. Gauss transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
11.2.3. Cost of LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
11.3. LU Factorization with Partial Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . 165
11.3.1. Permutation matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
11.3.2. The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
11.4. Proof of Theorem 11.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
11.5. LU with Complete Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.6. Solving Ax = y Via the LU Factorization with Pivoting . . . . . . . . . . . . . . . . . 175
11.7. Solving Triangular Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . 175
11.7.1. Lz = y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
11.7.2. Ux = z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11.8. Other LU Factorization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11.8.1. Variant 1: Bordered algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
11.8.2. Variant 2: Left-looking algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 183
11.8.3. Variant 3: Up-looking variant . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
11.8.4. Variant 4: Crout variant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
11.8.5. Variant 5: Classical LU factorization . . . . . . . . . . . . . . . . . . . . . . . . 185
11.8.6. All algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.8.7. Formal derivation of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.9. Numerical Stability Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
11.10.Is LU with Partial Pivoting Stable? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11.11.Blocked Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11.11.1.Blocked classical LU factorization (Variant 5) . . . . . . . . . . . . . . . . . . . 188
11.11.2.Blocked classical LU factorization with pivoting (Variant 5) . . . . . . . . . . . 191
11.12.Variations on a Triple-Nested Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
160
11.1
161
Definition 11.1 LU factorization (decomposition)Given a matrix A Cmn with m n its LU factorization is given by A = LU where L Cmn is unit lower trapezoidal and U Cnn is upper triangular.
The first question we will ask is when the LU factorization exists. For this, we need a definition.
Definition 11.2 The k k
principle leading
submatrix of a matrix A is defined to be the square matrix
AT L AT R
.
AT L Ckk such that A =
ABL ABR
This definition allows us to indicate when a matrix has an LU factorization:
Theorem 11.3 Existence Let A Cmn and m n have linearly independent columns. Then A has a
unique LU factorization if and only if all its principle leading submatrices are nonsingular.
The proof of this theorem is a bit involved and can be found in Section 11.4.
11.2
LU Factorization
We are going to present two different ways of deriving the most commonly known algorithm. The first is
a straight forward derivation. The second presents the operation as the application of a sequence of Gauss
transforms.
11.2.1
First derivation
T
11 a12
,
A
a21 A22
l21 L22
and U
11
uT12
U22
T
T
T
a
1
0
u
11
u12
11 12 =
.
11 12 =
T
a21 A22
l21 L22
0 U22
l21 11 l21 u12 + L22U22
This means that
11 = 11
aT12 = uT12
aT12 = uT12
162
11.2.2
Gauss transforms
Ik
0
form.
1
l21
Example 11.5 Gauss transforms can be used to take multiples of a row and subtract these multiples from
other rows:
T
T
T
b
a
b
a
ab0
0 0 0
1
0
0
0
T
T
b
a
b
b
a
a
1
0
0
1
1
=
=
.
0
T
T
T
b
b
b
b
a
a
a
a
1
0
21 1
21
2 2 21 abT 2
1
abT3
31
abT3
abT3 31 abT1
0 31 0 1
Notice the similarity with what one does in Gaussian Elimination: take a multiples of one row and subtract
these from other rows.
Now assume that the LU factorization in the previous subsection has proceeded to where A contains
A
a01 A02
00
T
0 11 a12
0 a21 A22
where A00 is upper triangular (recall: it is being overwritten by
U!). Whatwe would like to do is eliminate
the elements in a21 by taking multiples of the current row 11 aT12 and subtract these from the rest
of the rows: a21 A22 . The vehicle is a Gauss transform: we must determine l21 so that
0
0
A
a01 A02
I
00
0
1
0 0 11 aT12
0 l21 I
0 a21 A22
A
a01
A02
00
= 0 11
aT12
0
0 A22 l21 aT12
11.2. LU Factorization
163
AT L AT R
0
L
, L T L
Partition A
ABL ABR
LBL LBR
whereAT L and LT L are 0 0
while n(AT L ) < n(A) do
Repartition
A
a01 A02
00
aT 11 aT
12
10
ABL ABR
A20 a21 A22
where11 , 11 are 1 1
AT L
AT R
LT L
,
LBL
LT R
LBR
L
0
0
00
l T 11 0
10
L20 l21 L22
(a21 := 0)
or, alternatively,
A
a01 A02
00
0 11 aT12
0 a21 A22
A00
a01 A02
:=
0
1
0 0 11 aT12
0 l21 0
0 a21 A22
A02
A00 a01
T
= 0 11
a12
0
0 A22 l21 aT12
Continue with
AT L
AT R
ABL
ABR
A
a
A02
00 01
aT 11 aT
12
10
A20 a21 A22
LT L
,
LBL
0
LBR
L
0
0
00
l T 11 0
10
L20 l21 L22
endwhile
Figure 11.1: Most commonly known algorithm for overwriting a matrix with its LU factorization.
0
0
a01 A02
I
A
00
0
1
0 0 11 aT12
0 l21 I
0 a21 A22
164
a01
A02
A
00
= 0
11
aT12
0 a21 11 l21 A22 l21 aT12
The resulting algorithm is summarized in Figure 11.1 under or, alternatively,. Notice that this algorithm
is identical to the algorithm for computing LU factorization discussed before!
How can this be? The following set of exercises explains it.
Homework 11.6 Show that
I
k
0
1
l21
I
k
= 0
0
1
l21
I
* SEE ANSWER
b0 , . . . , L
bn1
Now, clearly, what the algorithm does is to compute a sequence of n Gauss transforms L
1
b
b
b
b
b
such that Ln1 Ln2 L1 L0 A = U. Or, equivalently, A = L0 L1 Ln2 Ln1U, where Lk = Lk . What
will show next is that L = L0 L1 Ln2 Ln1 is the unit lower triangular matrix computed by the LU
factorization.
0 0
L
00
T
Homework 11.7 Let L k = L0 L1 . . . Lk . Assume that L k has the form L k1 = l10
1 0 , where L 00
L20 0 I
L00 0 0
I
0
0
L20 l21 I
0 l21 I
* SEE ANSWER
What this exercise shows is that L = L0 L1 Ln2 Ln1 is the triangular matrix that is created by simply
placing the computed vectors l21 below the diagonal of a unit lower triangular matrix.
11.2.3
Cost of LU factorization
The cost of the LU factorization algorithm given in Figure 11.1 can be analyzed as follows:
Assume A is n n.
During the kth iteration, AT L is initially k k.
165
Computing l21 := a21 /11 is typically implemented as := 1/11 and then the scaling l21 := a21 .
The reason is that divisions are expensive relative to multiplications. We will ignore the cost of the
division (which will be insignificant if n is large). Thus, we count this as n k 1 multiplies.
The rank-1 update of A22 requires (n k 1)2 multiplications and (n k 1)2 additions.
Thus, the total cost (in flops) can be approximated by
n1
(n k 1) + 2(n k 1)
n1
j + 2 j2
n1
n1
(Change of variable: j = n k 1)
j=0
k=0
j + 2 j2
j=0
j=0
n
n(n 1)
+2
x2 dx
2
0
n(n 1) 2 3
=
+ n
2
3
2 3
n
3
Notice that this involves roughly half the number of floating point operations as are required for a Householder transformation based QR factorization.
11.3
It is well-known that the LU factorization is numerically unstable under general circumstances. In particular, a backward stability analysis, given for example in [3, 8, 6] and summarized in Section 11.9, shows
that the computed matrices L and U statisfy
(A + A) = L U
U|.
(This is the backward error result for the Crout variant for LU factorization, discussed later in this note.
U|).)
Some of the other variants have an error result of (A + A) = L U where |A| n (|A| + |L||
Now, if
is small in magnitude compared to the entries of a21 then not only will l21 have large entries, but the
update A22 l21 aT12 will potentially introduce large entries in the updated A22 (in other words, the part of
matrix A from which the future matrix U will be computed), a phenomenon referred to as element growth.
To overcome this, we take will swap rows in A as the factorization proceeds, resulting in an algorithm
known as LU factorization with partial pivoting.
11.3.1
Permutation matrices
Definition 11.8 An n n matrix P is said to be a permutation matrix, or permutation, if, when applied
to a vector x = (0 , 1 , . . . , n1 )T , it merely rearranges the order of the elements in that vector. Such a
permutation can be represented by the vector of integers, (0 , 1 , . . . , n1 )T , where {0 , 1 , . . . , n1 } is a
permutation of the integers {0, 1, . . . , n 1} and the permuted vector Px is given by (0 , 1 , . . . , n1 )T .
166
A A
BL BR
LT L
AT L AT R
, p
,
pT
LBL LBR
pB
whereAT L and LT L are 0 0 and pT is 0 1
AT L AT R
L00 0
p0
LT L LT R T
pT
T
T
a10 11 a12
l10 11 0
1
ABL ABR
LBL LBR
pB
A20 a21 A22
L20 l21 L22
p2
where11 , 11 , 1 are 1 1
1 = maxi
aT
11 12
a21 A22
11
a21
:= P(1 )
aT
11 12
a21 A22
Continue with
ABL
A a A
L
0 0
p
00 01 02
00
0
L
0
p
T
L
T
aT 11 aT ,
l T 11 0 , 1
10
12
10
ABR
LBL LBR
pB
A20 a21 A22
L20 l21 L22
p2
AT L AT R
endwhile
Figure 11.2: LU factorization with partial pivoting.
167
If P is a permutation matrix then PA rearranges the rows of A exactly as the elements of x are rearranged
by Px.
We will see that when discussing the LU factorization with partial pivoting, a permutation matrix that
swaps the first element of a vector with the -th element of that vector is a fundamental tool. We will
denote that matrix by
In
if = 0
0 0 1
0
0 I
P() =
0
0
1
otherwise,
1 0 0
0 0 0 In1
where n is the dimension of the permutation matrix. In the following we will use the notation Pn to indicate
that the matrix P is of size n. Let p be a vector of integers satisfying the conditions
p = (0 , . . . , k1 )T , where 1 k n and 0 i < n i,
(11.1)
Ik1
0
I
0
1
0
k2
Pn (0 ).
Pn (p) =
0 Pnk+1 (k1 )
0 Pnk+2 (k2 )
0 Pn1 (1 )
Remark 11.9 In the algorithms, the subscript that indicates the matrix dimensions is omitted.
Example 11.10 Let aT0 , aT1 , . . . , aTn1 be the rows of a matrix A. The application of P(p) to A yields a
matrix that results from swapping row aT0 with aT0 , then swapping aT1 with aT1 +1 , aT2 with aT2 +2 , until
finally aTk1 is swapped with aTk1 +k1 .
Remark 11.11 For those familiar with how pivot information is stored in LINPACK and LAPACK, notice
that those packages store the vector of pivot information (0 + 1, 1 + 2, . . . , k1 + k)T .
11.3.2
The algorithm
Having introduced our notation for permutation matrices, we can now define the LU factorization with
partial pivoting: Given an n n matrix A, we wish to compute a) a vector p of n integers which satisfies
the conditions (11.1), b) a unit lower trapezoidal matrix L, and c) an upper triangular matrix U so that
P(p)A = LU. An algorithm for computing this operation is typically represented by
[A, p] := LUpivA,
where upon completion A has been overwritten by {L\U}.
Let us start with revisiting the first derivation of the LU factorization. The first step is to find a first
permutation matrix P(1 ) such that the element on the diagonal in the first column is maximal in value.
For this, we will introduce the function maxi(x) which, given a vector x, returns the index of the element
in x with maximal magnitude (absolute value). The algorithm then proceeds as follows:
168
Partition A, L as follows:
Compute 1 = maxi
11
a21 A22
and L
l21 L22
11
a21
aT12
11
aT12
a21 A22
:= P(1 )
11
aT12
a21 A22
A
a01
00
0 11
0 a21
where A00 is upper triangular. If no pivoting was added one would compute l21 := a21 /11 followed by
the update
A00
a01 A02
11
aT12
a21 A22
A00
:= 0
1
0 0
0 l21 I
0
a01 A02
11
aT12
a21 A22
A00
= 0
0
a01
A02
11
aT12
Compute 1 = maxi
11
a21
A00
0
Compute l21 := a21 /11 .
a01 A02
11
aT12
a21 A22
A00
0
I
:=
0
0 P(1 )
0
a01 A02
11
aT12
a21 A22
Update
A
a01 A02
00
0 11 aT12
0 a21 A22
169
I
0
0
a01 A02
A
00
:= 0
1
0 0 11 aT12
0 l21 I
0 a21 A22
A
00
= 0
0
a01
A02
11
aT12
b1 . What we will finally show is that there are Gauss transforms L0 , . . . Ln1 (here the bar
where Lk = L
k
does NOT mean conjugation. It is just a symbol) such that
T
A = P0T Pn1
L0 Ln1 U
| {z }
L
or, equivalently,
P(p)A = Pn1 P0 A = L0 Ln1 U,
| {z }
L
which is what we set out to compute.
Here is the insight. Assume that after k steps of LU factorization we have computed pT , LT L , LBL , etc.
so that
A
AT R
LT L 0
TL
,
P(pT )A =
LBL I
0
ABR
where AT L is upper triangular and k k.
Now compute the next step of LU factorization with partial pivoting with
Partition
Compute 1 = maxi
11
a21
AT L
AT R
ABR
A00
a01 A02
11
aT12
a01 A02
AT L
AT R
ABR
170
L and A with
Partition A
AT L AT R
, L
LT L
, p
ABL ABR
LBL LBR
whereAT L and LT L are 0 0 and pT is 0 1
pT
pB
AT L AT R
L00 0
p0
LT L LT R
pT
T
,
aT10 11 aT12 ,
l10
1
11 0
ABL ABR
LBL LBR
pB
A20 a21 A22
L20 l21 L22
p2
where11 , 11 , 1 are 1 1
1 = maxi
11
a21
A a A
00 01 02
T
l10 11 aT12
A a A
00 01 02
0 11 aT12
0 a21 A22
A a A
00 01 02
l T 11 aT
:=
10
12
0 P(1 )
L20 a21 A22
A02
A a A
I 0 0
A a
00 01 02 00 01
:= 0 1 0 0 11 aT12 = 0 11
aT12
T
0 l21 0
0 a21 A22
0 0 A22 l21 a12
Continue with
L00 0
p0
LT L 0 T
pT
T
T
a10 11 a12
l10 11 0
1
ABR
LBL LBR
pB
A20 a21 A22
L20 l21 L22
p2
AT L AT R
ABL
endwhile
Figure 11.3: LU factorization with partial pivoting.
171
Partition A
AT L
AT R
, p
ABL ABR
whereAT L is 0 0 and pT is 0 1
pT
pB
a01 A02
A
00
aT 11 aT
12
10
ABL ABR
A20 a21 A22
where11 , 11 , 1 are 1 1
AT L
AT R
1 = maxi
aT10
A20
0
pT
1
,
pB
p2
11
a21
11 aT12
a21 A22
:= P(1 )
aT10
A20
11 aT12
a21 A22
Continue with
AT L
AT R
ABL
ABR
A00 a01
A02
aT10 11
A20 a21
aT12
A22
p0
pT
pB
p2
endwhile
Figure 11.4: LU factorization with partial pivoting, overwriting A with the factors.
A
a01 A02
00
Permute 0 11 aT12
0 a21 A22
Compute l21 := a21 /11 .
00
0
I
0
:=
0 P(1 )
0
a01 A02
11
aT12
a21 A22
172
Update
A
a01 A02
00
0 11 aT12
0 a21 A22
A
00
:= 0
0
a01
A02
11
aT12
After this,
P(pT )A =
LT L
LBL
I 0 0
a01
A02
A
00
0 1 0 0 11
aT12
I
0 P(1 )
0 l21 I
0
0 A22 l21 aT12
(11.2)
But
LT L
LBL
I
0
I
0
I 0 0
A
00
0 1 0 0
0 P(1 )
0 l21 I
0
I 0 0
0
0
LT L
0 1 0
P(1 )
P(1 )LBL I
0 l21 I
I 0 0
0
L
0
TL
0 1 0
P(1 )
LBL I
0 l21 I
a01
A02
11
aT12
A
a01
A02
00
0 11
aT12
0
0 A22 l21 aT12
A00 a01
A02
T
,,
0 11
a12
0
0 A22 l21 aT12
NOTE: There is a bar above some of Ls and ls. Very hard to see!!! For this reason, these show
up in red as well. Here we use the fact that P(1 ) = P(1 )T because of its very special structure.
Bringing the permutation to the left of (11.2) and repartitioning we get
L
0 0
I 0 0
A
a01
A02
00
00
I
0
T
P(p0 ) A = l T 1 0 0 1 0 0 11
.
a12
10
0 P(1 )
T
L
0
I
0
l
I
0
0
A
l
a
|
{z
}
20
21
22
21 12
|
{z
}
p0
P
L
0 0
00
1
T
l 10 1 0
L20 l21 I
This explains how the algorithm in Figure 11.3 compute p, L, and U (overwriting A with U) so that
P(p)A = LU.
Finally, we recognize that L can overwrite the entries of A below its diagonal, yielding the algorithm
in Figure 11.4.
11.4
173
Proof:
() Let nonsingular A have a (unique) LU factorization. We will show that its principle leading submatrices are nonsingular. Let
U
L
AT R
0
UT R
A
= TL
TL
TL
ABL ABR
LBL LBR
0 UBR
|
|
{z
}
{z
} |
{z
}
A
L
U
be the LU factorization of A where AT L , LT L , and UT L are k k. Notice that U cannot have a zero on
the diagonal since then A would not have linearly independent columns. Now, the k k principle leading
submatrix AT L equals AT L = LT LUT L which is nonsingular since LT L has a unit diagonal and UT L has no
zeroes on the diagonal. Since k was chosen arbitrarily, this means that all principle leading submatrices
are nonsingular.
() We will do a proof by induction on n.
11
a21
11 is the LU factorization of
|{z}
a21 /11
|
{z
} U
L
A. This LU factorization is unique because the first element of L must be 1.
Inductive Step: Assume the result is true for all matrices with n = k. Show it is true for matrices with
n = k + 1.
Let A of size n = k + 1 have nonsingular principle leading submatrices. Now, if an LU factorization
of A exists, A = LU, then it would have to form
L00 0
A00 a01
U00 u01
T
.
(11.3)
a10 11 = l10
1
0 11
A20 a21
L20 l21
|
{z
}
|
{z
}
|
{z
}
U
A
L
If we can show that the different parts of L and U
can be rewritten as
A
L
00 00
T T
a10 = l10 U00 and
A20
L20
a
01
11
a21
L00 u01
T u +
l10
01
11
174
u01 exists and is unique. Since L00 is nonsingular (it has ones on its diagonal) L00 u01 = a01
has a solution that is unique.
T and u exist and are unique, = l T u
11 exists, is unique, and is nonzero. Since l10
01
11
11
10 01
exists and is unique. It is also nonzero since the principle leading submatrix of A given by
A
a01
0
u01
U
L
00
= 00
00
,
T
T
a10 11
l10 1
0 11
11.5
LU factorization with partial pivoting builds on the insight that pivoting (rearranging) rows in a linear
system does not change the solution: if Ax = b then P(p)Ax = P(p)b, where p is a pivot vector. Now, if
r is another pivot vector, then notice that P(r)T P(r) = I (a simple property of pivot matrices) and AP(r)T
permutes the columns of A in exactly the same order as P(r)A permutes the rows of A.
What this means is that if Ax = b then P(p)AP(r)T [P(r)x] = P(p)b. This supports the idea that one
might want to not only permute rows of A, as in partial pivoting, but also columns of A. This is done in a
variation on LU factorization that is known as LU factorization with complete pivoting.
The idea is as follows: Given matrix A, partition
T
11 a12
.
A=
a21 A22
Now, instead of finding the largest element in magnitude in the first column, find the largest element in
magnitude in the entire matrix. Lets say it is element (0 , 0 ). Then, one permutes
11 aT12
11 aT12
:= P(0 )
P(0 )T ,
a21 A22
a21 A22
making 11 the largest element in magnitude. This then reduces the magnitude of multipliers and element
growth.
175
It can be shown that the maximal element growth experienced when employing LU with complete
pivoting indeed reduces element growth. The problem is that it requires O(n2 ) comparisons per iteration. Worse, it completely destroys the ability to utilize blocked algorithms, which attain much greater
performance.
In practice LU with complete pivoting is not used.
11.6
Given nonsingular matrix A Cmm , the above discussions have yielded an algorithm for computing
permutation matrix P, unit lower triangular matrix L and upper triangular matrix U such that PA = LU.
We now discuss how these can be used to solve the system of linear equations Ax = y.
Starting with
Ax = y
we multiply both sizes of the equation by permutation matrix P
PAx =
Py
|{z}
yb
11.7
11.7.1 Lz = y
First, we discuss solving Lz = y where L is a unit lower triangular matrix.
Variant 1
1
, z 1
L
l21 L22
z2
and y
1
y2
176
LT L LT R
y
,y T
Partition L
LBL LBR
yB
where LT L is 0 0, yT has 0 rows
LT L LT R
y
,y T
Partition L
LBL LBR
yB
where LT L is 0 0, yT has 0 rows
Repartition
Repartition
L l L
00 01 02
l T 11 l T
,
12
10
LBL LBR
L l L
20 21 22
y
0
yT
yB
y2
LT L LT R
Continue with
Continue with
LT L LT R
T
T
l
10 11 12 ,
LBL LBR
L l L
20 21 22
y
0
yT
1
yB
y2
endwhile
L l L
00 01 02
l T 11 l T
,
12
10
LBL LBR
L l L
20 21 22
y
0
yT
yB
y2
LT L LT R
T y
1 := 1 l10
0
y2 := y2 1 l21
LT L LT R
T
T
l
10 11 12 ,
LBL LBR
L l L
20 21 22
y
0
yT
1
yB
y2
endwhile
Figure 11.5: Algorithms for the solution of a unit lower triangular system Lz = y that overwrite y with z.
Then
177
1 l21 + L22 z2
.
y2
| {z }
y
l21 L22
z2
|
{z
} | {z }
L
z
1
y2
z0
y0
L00 0
, z
and y
.
L
T
l10 1
1
1
Then
L00 0
T
l10
{z
L
z0
1
} | {z }
z
L00 z0
T z +
l10
0
1
y0
1
| {z }
y
y0
178
Discussion
Notice that Variant 1 casts the computation in terms of an AXPY operation while Variant 2 casts it in terms
of DOT products.
11.7.2 Ux = z
Next, we discuss solving Ux = y where U is an upper triangular matrix (with no assumptions about its
diagonal entries).
Homework 11.13 Derive an algorithm for solving Ux = y, overwriting y with the solution, that casts
most computation in terms of DOT products. Hint: Partition
T
11 u12
.
U
0 U22
Call this Variant 1 and use Figure 11.6 to state the algorithm.
* SEE ANSWER
Homework 11.14 Derive an algorithm for solving Ux = y, overwriting y with the solution, that casts most
computation in terms of AXPY operations. Call this Variant 2 and use Figure 11.6 to state the algorithm.
* SEE ANSWER
11.8
There are actually five different (unblocked) algorithms for computing the LU factorization that were
discovered over the course of the centuries1 . The LU factorization in Figure 11.1 is sometimes called
classical LU factorization or the right-looking algorithm. We now briefly describe how to derive the other
algorithms.
Finding the algorithms starts with the following observations.
b to denote the original contents
Our algorithms will overwrite the matrix A, and hence we introduce A
of A. We will say that the precondition for the algorithm is that
b
A=A
(A starts by containing the original contents of A.)
We wish to overwrite A with L and U. Thus, the postcondition for the algorithm (the state in which
we wish to exit the algorithm) is that
b
A = L\U LU = A
(A is overwritten by L below the diagonal and U on and above the diagonal, where multiplying L
and U yields the original matrix A.)
1 For
a thorough discussion of the different LU factorization algorithms that also gives a historic perspective, we recommend
Matrix Algorithms Volume 1 by G.W. Stewart [34]
179
UT L UT R
yT
UT L UT R
y
,y
,y T
Partition U
Partition U
UBL UBR
yB
UBL UBR
yB
where UBR is 0 0, yB has 0 rows
where UBR is 0 0, yB has 0 rows
while m(UBR ) < m(U) do
Repartition
Repartition
U u U
00 01 02
uT 11 uT
,
12
10
UBL UBR
U u U
20 21 22
y
0
yT
yB
y2
UT L UT R
Continue with
U u U
00 01 02
uT 11 uT
,
12
10
UBL UBR
U u U
20 21 22
y
0
yT
yB
y2
UT L UT R
endwhile
UT L UT R
Continue with
U u U
00 01 02
uT 11 uT
,
12
10
UBL UBR
U u U
20 21 22
y
0
yT
yB
y2
U u U
00 01 02
uT 11 uT
,
12
10
UBL UBR
U u U
20 21 22
y
0
yT
yB
y2
UT L UT R
endwhile
Figure 11.6: Algorithms for the solution of an upper triangular system Ux = y that overwrite y with x.
180
All the algorithms will march through the matrices from top-left to bottom-right. Thus, at a representative point in the algorithm, the matrices are viewed as quadrants:
AT L AT R
LT L
0
UT L UT R
, L
, and U
.
A
ABL ABR
LBL LBR
0 UBR
where AT L , LT L , and UT L are all square and equally sized.
In terms of these exposed quadrants, in the end we wish for matrix A to contain
A
AT R
L\UT L UT R
TL
=
ABL ABR
L
L\UBR
BL
bT L A
bT R
UT L UT R
A
LT L
0
where
b
b
LBL LBR
0 UBR
ABL ABR
Manipulating this yields what we call the Partitioned Matrix Expression (PME), which can be
viewed as a recursive definition of the LU factorization:
A
AT R
L\UT L UT R
TL
=
ABL ABR
LBL
L\UBR
bT L
bT R
LT LUT L = A
LT LUT R = A
UT R
AT L AT R
L\UT L
=
b
ABL ABR
LBL
ABR LBLUT R
.
bT L
bT R
LT LUT R = A
LT LUT L = A
(
(
((((
(U
(
bBL LBLU
b
(
LBLUT L = A
+
L
=
A
(
BR BR
BR
(T R
((
What we are saying is that AT L , AT R , and ABL have been completely updated with the corresponding
parts of L and U, and ABR has been partially updated. This is exactly the state that the algorithm
that we discussed previously in this document maintains! What is left is to factor ABR , since it
bBR LBLUT R , and A
bBR LBLUT R = LBRUBR .
contains A
By carefully analyzing the order in which computation must occur (in compiler lingo: by performing
a dependence analysis), we can identify five states that can be maintained at the top of the loop, by
deleting subexpressions from the PME. These are called loop invariants and are listed in Figure 11.8.
Key to figuring out what updates must occur in the loop for each of the variants is to look at how the
matrices are repartitioned at the top and bottom of the loop body.
181
Algorithm: A := LU(A)
LT L LT R
UT L UT R
AT L AT R
,L
,U
Partition A
ABL ABR
LBL LBR
UBL UBR
whereAT L is 0 0, LT L is 0 0, UT L is 0 0
while m(AT L ) < m(A) do
Repartition
AT L
A00
a01 A02
LT L
aT10 11 aT12 ,
ABL ABR
LBL
A20 a21 A22
U
UT R
T
T
TL
u10 11 u12
UBL UBR
U20 u21 U22
where11 is 1 1, 11 is 1 1, 11 is 1 1
AT R
LT R
LBR
L00
l01
L02
T
T
l10
11 l12
Continue with
AT L
AT R
ABL
ABR
UT L UT R
UBL
UBR
A
00
aT
10
A
20
U
00
uT
10
U20
a01
A02
11
aT12
a21
A22
u01 U02
11
uT12
u21 U22
LT L
,
LBL
LT R
LBR
L
l
00 01
l T 11
10
L20 l21
endwhile
Figure 11.7: Code skeleton for LU factorization.
L02
T
l12
L22
Variant
Algorithm
Bordered
AT L
ABL
Left-looking
Up-looking
Crout variant
ABL
ABR
AT L
ABL
AT L
ABL
Classical LU
(
((
b(
(T(
(U
LBL
(
L = ABL
AT R
bT R
AT R
L\UT L A
=
b
b
ABR
ABL
ABR
(
((b(
bT L
(T(
LT(
LT LUT L = A
(
LU
R = AT R
AT L
AT L
ABL
182
(
(
((((
(
(
b
(
L (U
R + LBRUBR = ABR
(T(
(BL
L\UT L
bT R
A
LBL
bBR
A
bT L
LT LUT L = A
(
((b(
(T(
LT(
(
LU
R = AT R
(
(
((((
(U
(
b
(
L
U
+
L
=
A
(
BR BR
BR
((T R
(BL
bBL
LBLUT L = A
AT R
L\UT L UT R
=
b
b
ABL
ABR
ABR
bT L
bT R
LT LUT L = A
LT LUT R = A
(
((
b(
(T(
(U
LBL
(
L = ABL
(
(((
(( b
(U
LBL
U
+(
LBR
= ABR
T(
R(
(
(
(
BR
AT R
L\UT L UT R
=
b
ABR
ABR
LBL
bT L
bT R
LT LUT L = A
LT LUT R = A
((((
(
(
(U = A
bBR
bBL LBLU
+(
LBR
LBLUT L = A
R(
BR
(((T(
AT R
L\UT L
UT R
=
bBR LBLUT R
ABR
LBL
A
bT L
bT R
LT LUT L = A
LT LUT R = A
(
(((
(( b
(U
bBL LBLU
LBLUT L = A
+(
LBR
R(
BR = ABR
(((T(
11.8.1
183
AT L
AT R
ABL
ABR
L\UT L
bT R
A
bBL
A
bBR
A
(
((b(
(T(
LT(
(
LU
R = AT R
(
((
((( b
(U
(
(
L
U
+
L
=
A
(
BR BR
BR
((T R
(BL
bT L
LT LUT L = A
(
((
b(
(T(
(U
LBL
(
L = ABL
b02
ab01 A
abT10
b20
A
b 11 abT12
b22
ab21 A
L\U00
b02
u01 A
T
l10
b20
A
11 abT12
b22
ab21 A
b we notice that
where the entries in blue are to be computed. Now, considering LU = A
b00
L00U00 = A
b02
L00U02 = A
TU =a
bT10
l10
00
T u + =
b 11
l10
01
11
T U + uT = a
bT12
l10
02
12
b22
L20U02 + l21 uT12 + L22U22 = A
where the entries in red are already known. The equalities in yellow can be used to compute the desired
parts of L and U:
Solve L00 u01 = a01 for u01 , overwriting a01 with the result.
T U = aT (or, equivalently, U T (l T )T = (aT )T for l T ), overwriting aT with the result.
Solve l10
00
10
00 10
10
10
10
T u , overwriting with the result.
Compute 11 = 11 l10
01
11
Homework 11.15 If A is an n n matrix, show that the cost of Variant 1 is approximately 23 n3 flops.
* SEE ANSWER
11.8.2
AT L
ABL
AT R
ABR
L\UT L
bT R
A
LBL
bBR
A
bT L
LT LUT L = A
bBL
LBLUT L = A
(
((b(
(T(
LT(
(
LU
R = AT R
(
(
((((
(
(
b
(
L (U
R + LBRUBR = ABR
(T(
(BL
184
b02
ab01 A
T
l10
L20
b 11 abT12
b22
ab21 A
L\U00
b02
u01 A
T
l10
11 abT12
b22
l21 A
L20
b we notice that
where the entries in blue are to be computed. Now, considering LU = A
b00
L00U00 = A
b02
L00U02 = A
TU =a
bT10
l10
00
T u + =
b 11
l10
01
11
T U + uT = a
bT12
l10
02
12
b22
L20U02 + l21 uT12 + L22U22 = A
The equalities in yellow can be used to compute the desired parts of L and U:
Solve L00 u01 = a01 for u01 , overwriting a01 with the result.
T u , overwriting with the result.
Compute 11 = 11 l10
01
11
Compute l21 := (a21 L20 u01 )/11 , overwriting a21 with the result.
11.8.3
Homework 11.16 Derive the up-looking variant for computing the LU factorization.
* SEE ANSWER
11.8.4
AT L
AT R
ABL
ABR
L\UT L UT R
LBL
bT L
LT LUT L = A
bBR
A
bT R
LT LUT R = A
(
(((
(( b
(U
bBL LBLU
LBLUT L = A
+(
LBR
R(
BR = ABR
(((T(
u01 U02
T
l10
b 11 abT12
b22
ab21 A
L20
185
u01 U02
T
l10
11 uT12
b22
l21 A
L20
b we notice that
where the entries in blue are to be computed. Now, considering LU = A
b00
L00U00 = A
b02
L00U02 = A
TU =a
bT10
l10
00
T u + =
b 11
l10
01
11
T U + uT = a
bT12
l10
02
12
b22
L20U02 + l21 uT12 + L22U22 = A
The equalities in yellow can be used to compute the desired parts of L and U:
T u , overwriting with the result.
Compute 11 = 11 l10
01
11
Compute l21 := (21 L20 u01 )/11 , overwriting a21 with the result.
T U , overwriting aT with the result.
Compute uT12 := aT12 l10
02
12
11.8.5
We have already derived this algorithm. You may want to try rederiving it using the techniques discussed
in this section.
11.8.6
All algorithms
11.8.7
The described approach to deriving algorithms, linking the process to the a priori identification of loop
invariants, was first proposed in [24]. It was refined into what we call the worksheet for deriving algorithms hand-in-hand with their proofs of correctness, in [4]. A book that describes the process at a level
also appropriate for the novice is The Science of Programming Matrix Computations [38].
186
AT L AT R
Partition A
ABL ABR
whereAT L is 0 0
while n(AT L ) < n(A) do
Repartition
A
00
aT
10
ABL ABR
A20
where11 is 1 1
AT L
AT R
a01 A02
11
aT12
a21 A22
A00 contains L00 and U00 in its strictly lower and upper triangular part, respectively.
Variant 1
Variant 2
Variant 3
Variant 4
Variant 5
Bordered
Left-looking
Up-looking
Crout variant
Classical LU
1
a01 := L00
a01
1
a01 := L00
a01
Exercise in 11.16
aT10
1
:= aT10U00
11 := 11 aT21 a01
aT12 := aT12 aT10 A02
Continue with
AT L
AT R
ABL
ABR
A00 a01
A02
aT10 11
A20 a21
aT12
A22
endwhile
Figure 11.9: All five LU factorization algorithms.
11.9
187
The numerical stability of various LU factorization algorithms as well as the triangular solve algorithms
can be found in standard graduate level numerical linear algebra texts and references [21, 25, 34]. Of
particular interest may be the analysis of the Crout variant (Variant 4) in [8], since it uses our notation as
well as the results in Notes on Numerical Stability. (We recommend the technical report version [6] of
the paper, since it has more details as well as exercises to help the reader understand.) In that paper, a
systematic approach towards the derivation of backward error results is given that mirrors the systematic
approach to deriving the algorithms given in [?, 4, 38].
Here are pertinent results from that paper, assuming floating point arithmetic obeys the model of computation given in Notes on Numerical Stability (as well as [8, 6, 25]). It is assumed that the reader is
familiar with those notes.
Theorem 11.19 Let A Rnn and let the LU factorization of A be computed via the Crout variant (Variant
Then
4), yielding approximate factors L and U.
(A + A) = L U
U|.
Theorem 11.20 Let L Rnn be lower triangular and y, z Rn with Lz = y. Let z be the approximate
solution that is computed. Then
(L + L)z = y
Theorem 11.21 Let U Rnn be upper triangular and x, z Rn with Ux = z. Let x be the approximate
solution that is computed. Then
(U + U)x = z
Theorem 11.22 Let A Rnn and x, y Rn with Ax = y. Let x be the approximate solution computed via
the following steps:
Then (A + A)x = y
U|.
The analysis of LU factorization without partial pivoting is related that of LU factorization with partial
pivoting. We have shown that LU with partial pivoting is equivalent to the LU factorization without partial
pivoting on a pre-permuted matrix: PA = LU, where P is a permutation matrix. The permutation doesnt
involve any floating point operations and therefore does not generate error. It can therefore be argued that,
as a result, the error that is accumulated is equivalent with or without partial pivoting
11.10
188
1
0 0
1
1 0
1 1 1
A= .
.. ..
..
. .
1 1
1 1
0 1
0 1
0 1
.
.. ..
. .
1 1
..
.
1 1
11.11
Blocked Algorithms
It is well-known that matrix-matrix multiplication can achieve high performance on most computer architectures [2, 22, 19]. As a result, many dense matrix algorithms are reformulated to be rich in matrix-matrix
multiplication operations. An interface to a library of such operations is known as the level-3 Basic Linear
Algebra Subprograms (BLAS) [17]. In this section, we show how LU factorization can be rearranged so
that most computation is in matrix-matrix multiplications.
11.11.1
A11 A12
,
A
A21 A22
L11
L21 L22
and U
U11 U12
0
U22
A11 A12
L11 0
U11 U12
L11U11
L11U12
=
.
A21 A22
L21 L22
0 U22
L21U11 L21U12 + L22U22
This means that
A11 = L11U11
A12 = L11U12
189
Algorithm: A := LU(A)
AT L AT R
Partition A
ABL ABR
whereAT L is 0 0
while m(AT L ) < m(A) do
Determine block size b
Repartition
A
00
AT L AT R
A10
ABL ABR
A20
whereA11 is b b
A01 A02
A11 A12
A21 A22
A00 contains L00 and U00 in its strictly lower and upper triangular part, respectively.
Variant 1:
A01 := L00 1 A01
Variant 2:
A01 := L00 1 A01
Variant 3:
A10 := A10U00 1
A10 := A10U00 1
A11 := LUA11
A11 := LUA11
A11 := LUA11
Variant 4:
A11 := A11 A10 A01
Variant 5:
A11 := LUA11
A21 := A21U11 1
A11 := LUA11
1
A21 := (A21 A20 A01 )U11
A12 := L11 1 A12
Continue with
AT L
AT R
ABL
ABR
A
A01
00
A10 A11
A20 A21
A02
A12
A22
endwhile
Figure 11.10: Blocked algorithms for computing the LU factorization.
190
AT L AT R
pT
, p
Partition A
ABL ABR
pB
whereAT L is 0 0, pT has 0 elements.
while n(AT L ) < n(A) do
Determine block size b
Repartition
AT L AT R
A
10 A11 A12
ABL ABR
A20 A21 A22
whereA11 is b b, p1 has b elements
p0
p
, T p
1
pB
p2
Variant 2:
A01 := L00 1 A01
Variant 4:
Variant 5:
A
11 , p1 :=
A21
A11
, p1
LUpiv
A21
A
A12
10
:=
A20 A22
A10 A12
P(p1 )
A20 A22
A
11 , p1 :=
A21
A11
, p1
LUpiv
A21
A
A12
10
:=
A20 A22
A10 A12
P(p1 )
A20 A22
A12 := A12 A10 A02
A11
A21
, p1 :=
LUpiv
A11
A
21
A10
A12
A20
A22
, p1
:=
P(p1 )
A10
A20
A12
A22
Continue with
AT L
AT R
ABL
ABR
A00
A01
A10
A20
A11
A21
A02
p0
p
T p1
A12
pB
A22
p2
endwhile
Figure 11.11: Blocked algorithms for computing the LU factorization with partial pivoting.
.
191
or, equivalently,
A11 = L11U11
A12 = L11U12
11.11.2
Pivoting can be added to some of the blocked algorithms. Let us focus once again on Variant 5.
Partition A, L, and U as follows:
00
A
A01 A02
L\U00
U01
U02
00
b11 L10U01 A12 L10U02
A10 A11 A12 = L10
A
b21 L20U01 A22 L20U02
A20 A21 A22
L20
A
b denotes the original contents of A and
where, as before, A
b
b
b
A
A01 A02
L
0 0
U
U01 U02
00
00
00
b
b11 A
b12 = L10 I 0 0 A11 A12
P(p0 ) A10 A
b20 A
b21 A
b22
A
L20 0 I
0 A21 A22
In the current blocked step, we now perform the following computations
192
P(p1 )
A11
A21
A11
A21
L11
U11 ,
L21
A10
A12
A20
A22
:=
P(p1 )
A10
A12
A20
A22
Solve L11U12 = A12 , overwriting A12 with U12 . (This can also be more concisely written as A12 :=
1
L11
A12 .)
Update A22 := A22 A21 A12 .
Careful consideration shows that this puts the matrix A in the state
L\U00
= L10
L20
U01
U02
L\U11
U12
L21
where
b
b01 A
b02
A
A
00
p0
)
b
b11 A
b12
P(
A
A
10
p1
b20 A
b21 A
b22
A
L00
= L10
L20
0
L11
L21
0 0
I
0
U11 U12
0
A22
Similarly, blocked algorithms with pivoting for some of the other variants can be derived. All are given
in Figure 11.10.
11.12
All LU factorization algorithms presented in this note perform exactly the same floating point operations
(with some rearrangement of data thrown in for the algorithms that perform pivoting) as does the triplenested loop that implements Gaussian elimination:
for j = 0, . . . , n 1
for i = j + 1, . . . n 1
i, j := i, j / j, j
for k = j + 1, . . . , n 1
193
194
Chapter
12
195
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
12.1. Definition and Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
12.2. Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
12.3. An Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
12.4. Proof of the Cholesky Factorization Theorem . . . . . . . . . . . . . . . . . . . . . . 199
12.5. Blocked Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
12.6. Alternative Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
12.7. Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
12.8. Solving the Linear Least-Squares Problem via the Cholesky Factorization . . . . . . 204
12.9. Other Cholesky Factorization Algorithms . . . . . . . . . . . . . . . . . . . . . . . 204
12.10.Implementing the Cholesky Factorization with the (Traditional) BLAS . . . . . . . 206
12.10.1.What are the BLAS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
12.10.2.A simple implementation in Fortran . . . . . . . . . . . . . . . . . . . . . . . . 209
12.10.3.Implemention with calls to level-1 BLAS . . . . . . . . . . . . . . . . . . . . . 209
12.10.4.Matrix-vector operations (level-2 BLAS) . . . . . . . . . . . . . . . . . . . . . 209
12.10.5.Matrix-matrix operations (level-3 BLAS) . . . . . . . . . . . . . . . . . . . . . 213
12.10.6.Impact on performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
12.11.Alternatives to the BLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
12.11.1.The FLAME/C API
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
12.11.2.BLIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
196
12.1
197
12.2
Application
The Cholesky factorization is used to solve the linear system Ax = y when A is HPD: Substituting the
factors into the equation yields LLH x = y. Letting z = LH x,
Ax = L (LH x) = Lz = y.
| {z }
z
Thus, z can be computed by solving the triangular system of equations Lz = y and subsequently the desired
solution x can be computed by solving the triangular linear system LH x = z.
12.3
198
An Algorithm
The most common algorithm for computing A := CholA can be derived as follows: Consider A = LLH .
Partition
0
11 ?
and L = 11
.
(12.1)
A=
a21 A22
l21 L22
Remark 12.6 We adopt the commonly used notation where Greek lower case letters refer to scalars, lower
case letters refer to (column) vectors, and upper case letters refer to matrices. (This is convention is often
attributed to Alston Householder.) The ? refers to a part of A that is neither stored nor updated.
By substituting these partitioned matrices into A = LLH we find that
11
a21 A22
11
l21
11
L22
l21
L22
|11 |2
11 l21
H + L LH
l21 l21
22 22
11
l21
L22
11
H
l21
H
L22
11
H
l21 = a21 /11 L22 = CholA22 l21 l21
11 ?
.
1. Partition A
a21 A22
2. Overwrite 11 := 11 =
ness.)
12.4
199
11
aH
21
a21 A22
H is HPD.
Lemma 12.9 Let A Cmm be HPD and l21 = a21 / 11 . Then A22 l21 l21
H.
Proof: Since A is Hermitian so are A22 and A22 l21 l21
1
x2
where 1 = aH x2 /11 .
21
Then, since x 6= 0,
H
0 < xH Ax =
x2
11
1
x2
aH
21
a21 A22
11 1 + aH
21 x2
a21 1 + A22 x2
x2
H
H
= 11 |1 |2 + 1 aH
21 x2 + x2 a21 1 + x2 A22 x2
aH x2
aH x2 x2H a21 x2H a21 H
= 11 21
Base case: n = 1. Clearly the result is true for a 1 1 matrix A = 11 : In this case, the fact that A is
HPD means that 11 is real and positive and a Cholesky factor is then given by 11 = 11 , with
uniqueness if we insist that 11 is positive.
Inductive step: Assume the result is true for HPD matrix A C(n1)(n1) . We will show that it holds for
A Cnn . Let A Cnn be HPD. Partition A and L as in (12.1) and let 11 = 11 (which is wellH (which exists as a consequence
defined by Lemma 12.8), l21 = a21 /11 , and L22 = CholA22 l21 l21
of the Inductive Hypothesis and Lemma 12.9). Then L is the desired Cholesky factor of A.
By the principle of mathematical induction, the theorem holds.
12.5
200
Blocked Algorithm
In order to attain high performance, the computation is cast in terms of matrix-matrix multiplication by
so-called blocked algorithms. For the Cholesky factorization a blocked version of the algorithm can be
derived by partitioning
L
0
A11 ?
and L 11
,
A
A21 A22
L21 L22
where A11 and L11 are b b. By substituting these partitioned matrices into A = LLH we find that
A11
A21 A22
L11
L21 L22
L11
L21 L22
H
L11 L11
H
L21 L11
?
H + L LH
L21 L21
22 22
H
L21 = A21 L11
H
L22 = CholA22 L21 L21
A11 ?
, where A11 is b b.
1. Partition A
A21 A22
2. Overwrite A11 := L11 = CholA11 .
H
3. Overwrite A21 := L21 = A21 L11
.
H (updating only the lower triangular part).
4. Overwrite A22 := A22 L21 L21
12.6
Alternative Representation
When explaining the above algorithm in a classroom setting, invariably it is accompanied by a picture
sequence like the one in Figure 12.1(left)1 and the (verbal) explanation:
1
done
done
done
partially
updated
201
Beginning of iteration
AT L
ABL
ABR
A00
Repartition
aT10 11
A20 a21
A22
UPD.
Update
UPD.
11
done
A22
a21
11
UPD.
done
@
R
@
a21 aT21
done
AT L
ABL
ABR
End of iteration
partially
updated
Figure 12.1: Left: Progression of pictures that explain Cholesky factorization algorithm. Right: Same
pictures, annotated with labels and updates.
202
AT L ?
Partition A
ABL ABR
where AT L is 0 0
AT L ?
Partition A
ABL ABR
where AT L is 0 0
Repartition
Repartition
AT L
ABL
11 :=
A00 ?
aT10 11 ?
ABR
A20 a21 A22
11
ABL
endwhile
A10 A11 ?
ABR
A20 A21 A22
Continue with
ABL
Continue with
AT L
AT L
A00 ?
A11 := CholA11
A00 ?
a10 11 ?
ABR
A20 a21 A22
AT L
ABL
A00 ?
A10 A11 ?
ABR
A20 A21 A22
endwhile
Figure 12.2: Unblocked and blocked algorithms for computing the Cholesky factorization in FLAME
notation.
12.7. Cost
203
Beginning of iteration: At some stage of the algorithm (Top of the loop), the computation has moved
through the matrix to the point indicated by the thick lines. Notice that we have finished with the
parts of the matrix that are in the top-left, top-right (which is not to be touched), and bottom-left
quadrants. The bottom-right quadrant has been updated to the point where we only need to perform
a Cholesky factorization of it.
Repartition: We now repartition the bottom-right submatrix to expose 11 , a21 , and A22 .
Update: 11 , a21 , and A22 are updated as discussed before.
End of iteration: The thick lines are moved, since we now have completed more of the computation, and
only a factorization of A22 (which becomes the new bottom-right quadrant) remains to be performed.
Continue: The above steps are repeated until the submatrix ABR is empty.
To motivate our notation, we annotate this progression of pictures as in Figure 12.1 (right). In those
pictures, T, B, L, and R stand for Top, Bottom, Left, and Right, respectively. This then
motivates the format of the algorithm in Figure 12.2 (left). It uses what we call the FLAME notation for
representing algorithms [24, 23, 38]. A similar explanation can be given for the blocked algorithm, which
is given in Figure 12.2 (right). In the algorithms, m(A) indicates the number of rows of matrix A.
Remark 12.12 The indices in our more stylized presentation of the algorithms are subscripts rather than
indices in the conventional sense.
Remark 12.13 The notation in Figs. 12.1 and 12.2 allows the contents of matrix A at the beginning of the
iteration to be formally stated:
?
?
LT L
AT L
=
A=
,
H
ABL ABR
LBL ABR tril LBL LBL
12.7
Cost
The cost of the Cholesky factorization of A Cmm can be analyzed as follows: In Figure 12.2 (left)
during the kth iteration (starting k at zero) A00 is k k. Thus, the operations in that iteration cost
204
m1
(m k 1)2
CChol (m)
|k=0 {z
}
(Due to update of A22 )
m1
m1
j2 +
j=0
j=0
(m k 1)
|k=0 {z
}
(Due to update of a21 )
1
1
1
j m3 + m2 m3
3
2
3
which allows us to state that (obvious) most computation is in the update of A22 . It can be shown that the
blocked Cholesky factorization algorithm performs exactly the same number of floating point operations.
Comparing the cost of the Cholesky factorization to that of the LU factorization from Notes on LU
Factorization we see that taking advantage of symmetry cuts the cost approximately in half.
12.8
Recall that if B Cmn has linearly independent columns, then A = BH B is HPD. Also, recall from Notes
on Linear Least-Squares that the solution, x Cn to the linear least-squares (LLS) problem
kBx yk2 = minn kBz yk2
zC
This makes it obvious how the Cholesky factorization can (and often is) used to solve the LLS problem.
Homework 12.15 Consider B Cmn with linearly independent columns. Recall that B has a QR factorization, B = QR where Q has orthonormal columns and R is an upper triangular matrix with positive
diagonal elements. How are the Cholesky factorization of BH B and the QR factorization of B related?
* SEE ANSWER
12.9
There are actually three different unblocked and three different blocked algorithms for computing the
Cholesky factorization. The algorithms we discussed so far in this note are sometimes called right-looking
algorithms. Systematic derivation of all these algorithms, as well as their blocked counterparts, are given
in Chapter 6 of [38]. In this section, a sequence of exercises leads to what is often referred to as the
bordered Cholesky factorization algorithm.
?
AT L
Partition A
ABL ABR
whereAT L is 0 0
while m(AT L ) < m(A) do
Repartition
AT L
A00
aT
10
ABL ABR
A20
where11 is 1 1
?
11
a21
A22
Continue with
AT L
ABL
?
ABR
A00
aT
10
A20
?
11
a21
A22
endwhile
Figure 12.3: Unblocked Cholesky factorization, bordered variant, for Homework 12.17.
205
206
A00 a10
aT10 11
(Hint: For this exercise, use techniques similar to those in Section 12.4.)
1. Show that A00 is SPD.
T , where L is lower triangular and nonsingular, argue that the assign2. Assuming that A00 = L00 L00
00
T
T
T
ment l10 := a10 L00 is well-defined.
T where L is lower triangular and nonsingular, and l T =
3. Assuming that A00 is SPD, A00 = L00 L00
00
10
q
T
T
T
T
a10 L00 , show that 11 l10 l10 > 0 so that 11 := 11 l10 l10 is well-defined.
4. Show that
A00 a10
aT10
11
L00
T
l10
11
L00
T
l10
11
* SEE ANSWER
Homework 12.17 Use the results in the last exercise to give an alternative proof by induction of the
Cholesky Factorization Theorem and to give an algorithm for computing it by filling in Figure 12.3. This
algorithm is often referred to as the bordered Cholesky factorization algorithm.
* SEE ANSWER
Homework 12.18 Show that the cost of the bordered algorithm is, approximately, 13 n3 flops.
* SEE ANSWER
12.10
The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear
algebra operations. In this section, we illustrate how the unblocked and blocked Cholesky factorization
algorithms can be implemented in terms of the BLAS. The explanation draws from the entry we wrote for
the BLAS in the Encyclopedia of Parallel Computing [37].
12.10.1
The BLAS interface [28, 18, 17] was proposed to support portable high-performance implementation of
applications that are matrix and/or vector computation intensive. The idea is that one casts computation in
terms of the BLAS interface, leaving the architecture-specific optimization of that interface to an expert.
Simple
Algorithm
Code
for j = 1 : n
j, j := j, j
do j=1, n
A(j,j) = sqrt(A(j,j))
for i = j + 1 : n
i, j := i, j / j, j
endfor
for k = j + 1 : n
for i = k : n
i,k := i,k i, j k, j
endfor
endfor
endfor
Vector-vector
for j = 1 : n
j, j := j, j
207
do i=j+1,n
A(i,j) = A(i,j) / A(j,j)
enddo
do k=j+1,n
do i=k,n
A(i,k) = A(i,k) - A(i,j) * A(k,j)
enddo
enddo
enddo
do j=1, n
A( j,j ) = sqrt( A( j,j ) )
call dscal( n-j, 1.0d00 / A(j,j), A(j+1,j), 1 )
j+1:n, j := j+1:n, j / j, j
for k = j + 1 : n
k:n,k := k, j k:n, j + k:n,k
endfor
endfor
do k=j+1,n
call daxpy( n-k+1, -A(k,j), A(k,j), 1,
A(k,k), 1 )
enddo
enddo
Figure 12.4: Simple and vector-vector (level-1 BLAS) based representations of the right-looking algorithm.
Matrix-vector
Algorithm
Code
for j = 1 : n
j, j := j, j
do j=1, n
A(j,j) = sqrt(A(j,j))
j+1:n, j := j+1:n, j / j, j
j+1:n, j+1:n :=
call dsyr( lower triangular,
tril( j+1:n, j Tj+1:n, j ) + j+1:n, j+1:n
n-j, -1.0, A(j+1,j), 1, A(j+1,j+1), lda )
endfor
enddo
int Chol_unb_var3( FLA_Obj A )
{
AT L ?
FLA_Obj ATL,
ATR,
A00, a01,
A02,
Partition A
ABL,
ABR,
a10t, alpha11, a12t,
ABL ABR
A20, a21,
A22;
whereAT L is 0 0
while m(AT L ) < m(A) do
FLAME Notation
208
FLA_Part_2x2( A,
&ATL, &ATR,
&ABL, &ABR,
0, 0, FLA_TL );
Repartition
AT L
A00
aT
10 11 ?
ABL ABR
A20 a21 A22
where11 is 1 1
11 :=
11
Continue with
AT L
ABL
A00 ?
aT 11 ?
10
ABR
A20 a21 A22
endwhile
}
Figure 12.5: Matrix-vector (level-2 BLAS) based representations of the right-looking algorithm.
12.10.2
209
We start with a simple implementation in Fortran. A simple algorithm that does not use BLAS and the
corresponding code in given in the row labeled Simple in Figure 12.4. This sets the stage for our
explaination of how the algorithm and code can be represented with vector-vector, matrix-vector, and
matrix-matrix operations, and the corresponding calls to BLAS routines.
12.10.3
The first BLAS interface [28] was proposed in the 1970s when vector supercomputers like the early Cray
architectures reigned. On such computers, it sufficed to cast computation in terms of vector operations.
As long as memory was accessed mostly contiguously, near-peak performance could be achieved. This
interface is now referred to as the Level-1 BLAS. It was used for the implementation of the first widely
used dense linear algebra package, LINPACK [16].
Let x and y be vectors of appropriate length and be scalar. In this and other notes we have vectorvector operations such as scaling of a vector (x := x), inner (dot) product ( := xT y), and scaled vector
addition (y := x + y). This last operation is known as an axpy, which stands for alpha times x plus y.
Our Cholesky factorization algorithm expressed in terms of such vector-vector operations and the corresponding code are given in Figure 12.4 in the row labeled Vector-vector. If the operations supported
by dscal and daxpy achieve high performance on a target archecture (as it was in the days of vector
supercomputers) then so will the implementation of the Cholesky factorization, since it casts most computation in terms of those operations. Unfortunately, vector-vector operations perform O(n) computation on
O(n) data, meaning that these days the bandwidth to memory typically limits performance, since retrieving a data item from memory is often more than an order of magnitude more costly than a floating point
operation with that data item.
We summarize information about level-1 BLAS in Figure 12.6.
12.10.4
The next level of BLAS supports operations with matrices and vectors. The simplest example of such
an operation is the matrix-vector product: y := Ax where x and y are vectors and A is a matrix. Another
example is the computation A22 = a21 aT21 + A22 (symmetric rank-1 update) in the Cholesky factorization.
Here only the lower (or upper) triangular part of the matrix is updated, taking advantage of symmetry.
The use of symmetric rank-1 update is illustrated in Figure 12.5, in the row labeled Matrix-vector.
There dsyr is the routine that implements a double precision symmetric rank-1 update. Readability of the
code is improved by casting computation in terms of routines that implement the operations that appear in
the algorithm: dscal for a21 = a21 /11 and dsyr for A22 = a21 aT21 + A22 .
If the operation supported by dsyr achieves high performance on a target archecture (as it was in
the days of vector supercomputers) then so will this implementation of the Cholesky factorization, since
it casts most computation in terms of that operation. Unfortunately, matrix-vector operations perform
O(n2 ) computation on O(n2 ) data, meaning that these days the bandwidth to memory typically limits
performance.
We summarize information about level-2 BLAS in Figure 12.7.
xy
scal
x x
copy
yx
axpy
y x + y
dot
xT y
nrm2
kxk2
asum
kre(x)k1 + kim(x)k1
imax
210
YY matrix shape
ge general (rectangular)
sy symmetric
sv solve vector
he Hermitian
tr triangular
r2 rank-2 update
rank-1 update
In addition, operations with banded matrices are supported, which we do not discuss
here.
A representative call to a level-2 BLAS operation is given by
dsyr( uplo, n, alpha, x, incx, A, lda )
which implements the operation A = xxT + A, updating the lower or upper triangular part of
A by choosing uplo as Lower triangular or Upper triangular, respectively. The
parameter lda (the leading dimension of matrix A) indicates the increment by which memory
has to be traversed in order to address successive elements in a row of matrix A.
The following table gives the most commonly used level-2 BLAS operations:
routine/ operation
function
gemv
symv
trmv
trsv
ger
syr
syr2
211
Algorithm
212
Code
do j=1, n, nb
jb = min( nb, n-j+1 )
call chol( jb, A( j, j ), lda )
for j = 1 : n in steps of nb
b := min(n j + 1, nb )
Matrix-matrix
Partition A
whereAT L
AT L
ABL ABR
is 0 0
FLAME Notation
FLA_Part_2x2( A,
A00 ? ?
A
?
TL
A A
10 11 ?
ABL ABR
A20 A21 A22
whereA11 is b b
Continue with
AT L
ABL
A00 ?
0, 0, FLA_TL );
A11 := CholA11
A22 := A22 tril A21 AH
21
&ATL, &ATR,
&ABL, &ABR,
A10 A11 ?
ABR
A20 A21 A22
endwhile
}
213
The naming convention for level-3 BLAS routines are similar to those for the level-2 BLAS.
A representative call to a level-3 BLAS operation is given by
dsyrk( uplo, trans, n, k, alpha, A, lda, beta, C, ldc )
which implements the operation C := AAT + C or C := AT A + C depending on whether
trans is chosen as No transpose or Transpose, respectively. It updates the lower or
upper triangular part of C depending on whether uplo equal Lower triangular or Upper
triangular, respectively. The parameters lda and ldc are the leading dimensions of arrays
A and C, respectively.
The following table gives the most commonly used Level-3 BLAS operationsx
routine/ operation
function
gemm
symm
trmm
trsm
syrk
12.10.5
Finally, we turn to the implementation of the blocked Cholesky factorization algorithm from Section 12.5.
The algorithm is expressed with FLAME notation and Matlab-like notation in Figure 12.8.
The routines dtrsm and dsyrk are level-3 BLAS routines:
T =A .
The call to dtrsm implements A21 := L21 where L21 L11
21
T +A .
The call to dsyrk implements A22 := L21 L21
22
The bulk of the computation is now cast in terms of matrix-matrix operations which can achieve high
performance.
We summarize information about level-3 BLAS in Figure 12.9.
12.10.6
Impact on performance
Figure 12.10 illustrates the performance benefits that come from using the different levels of BLAS on a
typical architecture.
214
One thread
10
GFlops
Hand optimized
BLAS3
BLAS2
BLAS1
triple loops
0
200
400
600
800
1000
n
1200
1400
1600
1800
2000
Figure 12.10: Performance of the different implementations of Cholesky factorization that use different
levels of BLAS. The target processor has a peak of 11.2 Gflops (billions of floating point operations per
second). BLAS1, BLAS2, and BLAS3 indicate that the bulk of computation was cast in terms of level-1,
-2, or -3 BLAS, respectively.
12.11
12.11.1
In a number of places in these notes we presented algorithms in FLAME notation. Clearly, there is a
disconnect between this notation and how the algorithms are then encoded with the BLAS interface. In
Figures 12.4, 12.5, and 12.8 we also show how the FLAME API for the C programming language [5]
allows the algorithms to be more naturally translated into code. While the traditional BLAS interface
underlies the implementation of Cholesky factorization and other algorithms in the widely used LAPACK
library [1], the FLAME/C API is used in our libflame library [24, 39, 40].
12.11.2
BLIS
The implementations that call BLAS in this paper are coded in Fortran. More recently, the languages of
choice for scientific computing have become C and C++. While there is a C interface to the traditional
BLAS called the CBLAS [10], we believe a more elegant such interface is the BLAS-like Library Instantiation Software (BLIS) interface [41]. BLIS is not only a framework for rapid implementation of the
traditional BLAS, but also presents an alternative interface for C and C++ users.
Chapter
13
Video
Read disclaimer regarding the videos in the preface!
* YouTube
* Download from UT Box
* View After Local Download
(For help on viewing, see Appendix A.)
215
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
13.1. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
13.2. The Schur and Spectral Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . 220
13.3. Relation Between the SVD and the Spectral Decomposition . . . . . . . . . . . . . . 222
216
13.1. Definition
13.1
217
Definition
Definition 13.1 Let A Cmm . Then C and nonzero x Cm are said to be an eigenvalue and corresponding eigenvector if Ax = x. The tuple (, x) is said to be an eigenpair. The set of all eigenvalues of A
is denoted by (A) and is called the spectrum of A. The spectral radius of A, (A) equals the magnitude
of the largest eigenvalue, in magnitude:
(A) = max(A) ||.
The action of A on an eigenvector x is as if it were multiplied by a scalar. The direction does not
change, only its length is scaled:
Ax = x.
Theorem 13.2 Scalar is an eigenvalue of A if and only if
is singular
det(I A) = 0
etc.
The following exercises expose some other basic properties of eigenvalues and eigenvectors:
Homework 13.3 Eigenvectors are not unique.
* SEE ANSWER
Homework 13.4 Let be an eigenvalue of A and let E (A) = {x Cm |Ax = x} denote the set of all
eigenvectors of A associated with (including the zero vector, which is not really considered an eigenvector). Show that this set is a (nontrivial) subspace of Cm .
* SEE ANSWER
Definition 13.5 Given A Cmm , the function pm () = det(I A) is a polynomial of degree at most m.
This polynomial is called the characteristic polynomial of A.
The definition of pm () and the fact that is a polynomial of degree at most m is a consequence of the
definition of the determinant of an arbitrary square matrix. This definition is not particularly enlightening
other than that it allows one to succinctly related eigenvalues to the roots of the characteristic polynomial.
Remark 13.6 The relation between eigenvalues and the roots of the characteristic polynomial yield a
disconcerting insight: A general formula for the eigenvalues of a m m matrix with m > 4 does not exist.
218
The reason is that there is no general formula for the roots of a polynomial of degree m > 4. Given any
polynomial pm () of degree m, an m m matrix can be constructed such that its characteristic polynomial
is pm (). If
pm () = 0 + 1 + + m1 m1 + m
and
A=
n1 n2 n3 1 0
1
0
..
.
0
..
.
1
..
.
..
.
0
..
.
0
..
.
then
pm () = det(I A)
Hence, we conclude that no general formula can be found for the eigenvalues for m m matrices when
m > 4. What we will see in future Notes on ... is that we will instead create algorithms that converge to
the eigenvalues and/or eigenvalues of matrices.
Theorem 13.7 Let A Cmm and pm () be its characteristic polynomial. Then (A) if and only if
pm () = 0.
Proof: This is an immediate consequence of Theorem 13.2.
In other words, is an eigenvalue of A if and only if it is a root of pm (). This has the immediate
consequence that A has at most m eigenvalues and, if one counts multiple roots by their multiplicity, it has
exactly m eigenvalues. (One says Matrix A Cmm has m eigenvalues, multiplicity counted.)
Homework 13.8 The eigenvalues of a diagonal matrix equal the values on its diagonal. The eigenvalues
of a triangular matrix equal the values on its diagonal.
* SEE ANSWER
Corollary 13.9 If A Rmm is real valued then some or all of its eigenvalues may be complex valued. In
this case, if (A) then so is its conjugate, .
Proof: It can be shown that if A is real valued, then the coefficients of its characteristic polynomial are
all real valued. Complex roots of a polynomial with real coefficients come in conjugate pairs.
It is not hard to see that an eigenvalue that is a root of multiplicity k has at most k eigenvectors. It
is, however, not necessarily the case that an eigenvalue that is a root of multiplicity k also has k linearly
independent eigenvectors. In other words, the null space of I A may have dimension less then the
algebraic multiplicity of . The prototypical counter example is the k k matrix
1 0 0 0
0 1 ... 0 0
. .
.
.
.
.
.. . . . . . . .. ..
J() =
0 0 0 ... 1
13.1. Definition
219
where k > 1. Observe that I J() is singular if and only if = . Since I J() has k 1 linearly
independent columns its null-space has dimension one: all eigenvectors are scalar multiples of each other.
This matrix is known as a Jordan block.
Definition 13.10 A matrix A Cmm that has fewer than m linearly independent eigenvectors is said to
be defective. A matrix that does have m linearly independent eigenvectors is said to be nondefective.
Theorem 13.11 Let A Cmm . There exist nonsingular matrix X and diagonal matrix such that A =
XX 1 if and only if A is nondefective.
Proof:
(). Assume there exist nonsingular matrix X and diagonal matrix so that A = XX 1 . Then, equivalently, AX = X. Partition X by columns so that
0 0
0
0 1
0
=
A x0 x1 xm1
x0 x1 xm1 . .
.
.
..
.. ..
..
0 0 m1
=
0 x0 1 x1 m1 xm1 .
Then, clearly, Ax j = j x j so that A has m linearly independent eigenvectors and is thus nondefective.
(). Assume that A is nondefective. Let {x0 , , xm1
} equal m linearly independent eigenvectors corresponding to eigenvalues {0 , , m1 }. If X = x0 x1 xm1 then AX = X where =
diag(0 , . . . , m1 ). Hence A = XX 1 .
Definition 13.12 Let (A) and pm () be the characteristic polynomial of A. Then the algebraic
multiplicity of is defined as the multiplicity of as a root of pm ().
Definition 13.13 Let (A). Then the geometric multiplicity of is defined to be the dimension of
E (A). In other words, the geometric multiplicity of equals the number of linearly independent eigenvectors that are associated with .
Theorem 13.14 Let A Cmm . Let the eigenvalues of A be given by 0 , 1 , , k1 , where an eigenvalue
is listed exactly n times if it has geometric multiplicity n. There exists a nonsingular matrix X such that
J(0 )
0
J(1 )
0
A=X .
.
.
.
.
..
..
..
..
0
0
J(k1 )
For our discussion, the sizes of the Jordan blocks J(i ) are not particularly important. Indeed, this decomposition, known as the Jordan Canonical Form of matrix A, is not particularly interesting in practice. For
this reason, we dont discuss it further and do not we give its proof.
13.2
220
Theorem 13.15 Let A,Y, B Cmm , assume Y is nonsingular, and let B = Y 1 AY . Then (A) = (B).
Proof: Let (A) and x be an associated eigenvector. Then Ax = x if and only if Y 1 AYY 1 x = Y 1 x
if and only if B(Y 1 x) = (Y 1 x).
Definition 13.16 Matrices A and B are said to be similar if there exists a nonsingular matrix Y such that
B = Y 1 AY .
Given a nonsingular matrix Y the transformation Y 1 AY is called a similarity transformation of A.
It is not hard to expand the last proof to show that if A is similar to B and (A) has algebraic/geometric multiplicity k then (B) has algebraic/geometric multiplicity k.
The following is the fundamental theorem for the algebraic eigenvalue problem:
Theorem 13.17 Schur Decomposition Theorem Let A Cmm . Then there exist a unitary matrix Q and
upper triangular matrix U such that A = QUQH . This decomposition is called the Schur decomposition
of matrix A.
In the above theorem, (A) = (U) and hence the eigenvalues of A can be found on the diagonal of U.
Proof: We will outline how to construct Q so that QH AQ = U, an upper triangular matrix.
Since a polynomial of degree m has at least one root, matrix A has at least one eigenvalue, 1 , and
corresponding eigenvector
we normalize this eigenvector to have length one. Thus Aq1 = 1 q1 .
q1 , where
Choose Q2 so that Q = q1 Q1 is unitary. Then
H
Q AQ =
A q1 Q2
q1 Q2
H AQ
H AQ
T
qH
Aq
q
w
1
2
1
2
1
1
=
= 1
,
= 1
H
H
H
H
Q2 Aq1 Q2 AQ2
Q2 q1 Q2 AQ2
0 B
H
where wT = qH
1 AQ2 and B = Q2 AQ2 . This insight can be used to construct an inductive proof.
One should not mistake the above theorem and its proof as a constructive way to compute the Schur
decomposition: finding an eigenvalue and/or the eigenvalue associated with it is difficult.
AT L
AT R
ABR
A=
QT L
0
0
QBR
QT L AT L QH
TL
QT L AT R QH
BR
QBR ABR QH
BR
QT L
QBR
221
Homework 13.19 Prove Lemma 13.18. Then generalize it to a result for block upper triangular matrices:
A0,0 A0,1
A=
0
A1,1
..
.
0
A0,N1
A1,N1
..
.
AN1,N1
* SEE ANSWER
AT L
AT R
ABR
Homework 13.21 Prove Corollary 13.20. Then generalize it to a result for block upper triangular matrices.
* SEE ANSWER
A theorem that will later allow the eigenvalues and vectors of a real matrix to be computed (mostly)
without requiring complex arithmetic is given by
Theorem 13.22 Let A Rmm . Then there exist a unitary matrix Q Rmm and quasi upper triangular
matrix U Rmm such that A = QUQT .
A quasi upper triangular matrix is a block upper triangular matrix where the blocks on the diagonal are
1 1 or 2 2. Complex eigenvalues of A are found as the complex eigenvalues of those 2 2 blocks on
the diagonal.
Theorem 13.23 Spectral Decomposition Theorem Let A Cmm be Hermitian. Then there exist a
unitary matrix Q and diagonal matrix Rmm such that A = QQH . This decomposition is called the
Spectral decomposition of matrix A.
Proof: From the Schur Decomposition Theorem we know that there exist a matrix Q and upper triangular
matrix U such that A = QUQH . Since A = AH we know that QUQH = QU H QH and hence U = U H . But
a Hermitian triangular matrix is diagonal with real valued diagonal entries.
What we conclude is that a Hermitian matrix is nondefective and its eigenvectors can be chosen to
form an orthogonal basis.
Homework 13.24 Let A be Hermitian and and be distinct eigenvalues with eigenvectors x and x ,
respectively. Then xH x = 0. (In other words, the eigenvectors of a Hermitian matrix corresponding to
distinct eigenvalues are orthogonal.)
* SEE ANSWER
13.3
222
Homework 13.25 Let A Cmm be a Hermitian matrix, A = QQH its Spectral Decomposition, and
A = UV H its SVD. Relate Q, U, V , , and .
* SEE ANSWER
Homework 13.26 Let A Cmm and A = UV H its SVD. Relate the Spectral decompositions of AH A and
AAH to U, V , and .
* SEE ANSWER
Chapter
14
Video
Read disclaimer regarding the videos in the preface!
Tragically, I forgot to turn on the camera... This was a great lecture!
223
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
14.1. The Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
14.1.1. First attempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
14.1.2. Second attempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
14.1.3. Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
14.1.4. Practical Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
14.1.5. The Rayleigh quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
14.1.6. What if |0 | |1 |? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
14.2. The Inverse Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
14.3. Rayleigh-quotient Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
224
14.1
225
The Power Method is a simple method that under mild conditions yields a vector corresponding to the
eigenvalue that is largest in magnitude.
Throughout this section we will assume that a given matrix A Cmm is nondeficient: there exists a
nonsingular matrix X and diagonal matrix such that A = XX 1 . (Sometimes this is called a diagonalizable matrix since there exists a matrix X so that
X 1 AX = or, equivalently, A = XX 1 .
From Notes on Eigenvalues and Eigenvectors we know then that the columns of X equal eigenvectors
of A and the elements on the diagonal of equal the eigenvalues::
0 0
0
0 1
0
and = .
X = x0 x0 xm1
.
.
.
..
..
..
..
0 0 m1
so that
Axi = i xi
for i = 0, . . . , m 1.
14.1.1
First attempt
Now, let v(0) Cmm be an initial guess. Our (first attempt at the) Power Method iterates as follows:
for k = 0, . . .
v(k+1) = Av(k)
endfor
Clearly v(k) = Ak v(0) . Let
v(0) = Xy = 0 x0 + 1 x1 + + m1 xm1 .
What does this mean? We view the columns of X as forming a basis for Cm and then the elements in
vector y = X 1 v(0) equal the coefficients for describing v(0) in that basis. Then
v(1) = Av(0) =
=
(2)
(1)
v = Av
=
..
.
(k)
(k1)
v = Av
=
A (0 x0 + 1 x1 + + m1 xm1 )
0 0 x0 + 1 0 x1 + + m1 m1 xm1 ,
0 20 x0 + 1 21 x1 + + m1 2m1 xm1 ,
0 k0 x0 + 1 k1 x1 + + m1 km1 xm1 .
226
Now, as long as 0 6= 0 clearly 0 k0 x0 will eventually dominate which means that v(k) will start pointing
in the direction of x0 . In other words, it will start pointing in the direction of an eigenvector corresponding
to 0 . The problem is that it will become infinitely long if |0 | > 1 or infinitessimily short if |0 | < 1. All
is good if |0 | = 1.
14.1.2
Second attempt
Again, let v(0) Cmm be an initial guess. The second attempt at the Power Method iterates as follows:
for k = 0, . . .
v(k+1) = Av(k) /0
endfor
It is not hard to see that then
v(k) = Av(k1) /0 = Ak v(0) /k0
k
k
0
1
m1 k
= 0
x0 + 1
x1 + + m1
xm1
0
0
0
k
1
m1 k
= 0 x0 + 1
x1 + + m1
xm1 .
0
0
Clearly limk v(k) = 0 x0 , as long as 0 6= 0, since 0k < 1 for k > 0.
Another way of stating this is to notice that
Ak = (AA A) = (XX 1 )(XX 1 ) (XX 1 ) = Xk X 1 .
| {z }
|
{z
}
k
k times
so that
v(k) =
=
=
=
Ak v(0) /k0
Ak Xy/k0
Xk X 1 Xy/k0
Xk y/k0
= X k /k0 y = X
k
1
0
0
..
...
...
.
1
0
0
..
.
m1
0
y.
k
227
Now, since 0k < 1 for k > 1 we can argue that
lim v(k)
= lim X
k
1
0
..
.
0
0
k
1
0
..
.
..
0
0
..
.
m1
0
y = X
k
0 0 0
y
.. . . . . ..
.
. .
.
0 0 0
= X0 e0 = 0 Xe0 = 0 x0 .
Thus, as long as 0 6= 0 (which means v must have a component in the direction of x0 ) this method will
eventually yield a vector in the direction of x0 . However, this time the problem is that we dont know 0
when we start.
14.1.3
Convergence
Before we make the algorithm practical, let us examine how fast the iteration converges. This requires a
few definitions regarding rates of convergence.
Definition 14.1 Let 0 , 1 , 2 , . . . C be an infinite sequence of scalars. Then k is said to converge to
if
lim |k | = 0.
k
Let x0 , x1 , x2 , . . .
norm if
Cm
Notice that because of the equivalence of norms, if the sequence converges in one norm, it converges in
all norms.
Definition 14.2 Let 0 , 1 , 2 , . . . C be an infinite sequence of scalars that converges to . Then
k is said to converge linearly to if for large enough k
|k+1 | C|k |
for some constant C < 1.
k is said to converge super-linearly to if
|k+1 | Ck |k |
with Ck 0.
k is said to converge quadratically to if for large enough k
|k+1 | C|k |2
for some constant C.
228
0.99000
0.98010
0.97030
0.96060
0.95099
0.94148
0.93206
0.92274
0.91351
0.99000
0.970299
0.932065
0.860058
0.732303
0.530905
0.279042
0.077085
0.005882
0.000034
229
If we consider the correct result then, eventually, the number of correct digits roughly doubles in each
iteration. This can be explained as follows: If |k | < 1, then the number of correct decimal digits is
given by
log10 |k |.
Since log10 is a monotonically increasing function,
log10 |k+1 | log10 C|k |2 = log10 (C) + 2 log10 |k | 2 log10 (|k |
and hence
log10 |k+1 | 2( log10 (|k | ).
{z
}
|
{z
}
|
number of correct
number of correct
digits in k+1
digits in k
Cubic convergence is dizzyingly fast: Eventually the number of correct digits triples from one iteration
to the next.
We now define a convenient norm.
Lemma 14.3 Let X Cmm be nonsingular. Define k kX : Cm R by kykX = kXyk for some given norm
k k : Cm R. Then k kX is a norm.
Homework 14.4 Prove Lemma 14.3.
* SEE ANSWER
With this new norm, we can do our convergence analysis:
1
0
k
1
0
1 (0)
0
(k)
k (0) k
v 0 x0 = A v /0 0 x0 = X .
X v 0 x0
.
.
.
..
..
..
..
m1 k
0
0
1
0
0
0
0
k
k
1
1
0
0
0
0
= X .
y
x
=
X
0 0
..
..
...
...
...
...
..
...
.
.
k
m1
m1 k
0
0
0
0
0
0
Hence
X 1 (v(k) 0 x0 ) =
k
1
0
0
..
...
...
.
0
0
..
.
m1
0
k
230
and
X 1 (v(k+1) 0 x0 ) =
0 10
.. . . . .
.
.
.
0
..
.
1 (k)
X (v 0 x0 ).
m1
0
Now, let k k be a p-norm1 and its induced matrix norm and k kX 1 as defined in Lemma 14.3. Then
kv(k+1) 0 x0 kX 1 = kX 1 (v(k+1) 0 x0 )k
0 0
0
0 1
0
1 (k)
=
. .0 .
X
(v
x
)
0 0
..
.. . . . .
.
m1
0 0
0
1
1
kX 1 (v(k) 0 x0 )k = kv(k) 0 x0 kX 1 .
0
0
This shows that, in this norm, the convergence of v(k) to 0 x0 is linear: The difference between current
approximation, v(k) , and the solution, x0 , is reduced by at least a constant factor in each iteration.
14.1.4
The following algorithm, known as the Power Method, avoids the problem of v(k) growing or shrinking in
length, without requiring 0 to be known, by scaling it to be of unit length at each step:
for k = 0, . . .
v(k+1) = Av(k)
v(k+1) = v(k+1) /kv(k+1) k
endfor
14.1.5
choose a p-norm to make sure that the norm of a diagonal matrix equals the absolute value of the largest element (in
magnitude) on its diagonal.
231
What if |0 | |1 |?
14.1.6
Now, what if
|0 | = = |k1 | > |k | . . . |m1 |?
By extending the above analysis one can easily show that v(k) will converge to a vector in the subspace
spanned by the eigenvectors associated with 0 , . . . , k1 .
An important special case is when k = 2: if A is real valued then 0 still may be complex valued in
which case 0 is also an eigenvalue and it has the same magnitude as 0 . We deduce that v(k) will always
be in the space spanned by the eigenvectors corresponding to 0 and 0 .
14.2
The Power Method homes in on an eigenvector associated with the largest (in magnitude) eigenvalue. The
Inverse Power Method homes in on an eigenvector associated with the smallest eigenvalue (in magnitude).
Throughout this section we will assume that a given matrix A Cmm is nondeficient and nonsingular
so that there exist matrix X and diagonal matrix such that A = XX 1 . We further assume that =
diag(0 , , m1 ) and
|0 | |1 | |m2 | > |m1 |.
Theorem 14.6 Let A Cmm be nonsingular. Then and x are an eigenvalue and associated eigenvector
of A if and only if 1/ and x are an eigenvalue and associated eigenvector of A1 .
Homework 14.7 Assume that
|0 | |1 | |m2 | > |m1 | > 0.
Show that
1 1 1
>
1 .
0
m1 m2 m3
* SEE ANSWER
Thus, an eigenvector associated with the smallest (in magnitude) eigenvalue of A is an eigenvector associated with the largest (in magnitude) eigenvalue of A1 . This suggest the following naive iteration:
for k = 0, . . .
v(k+1) = A1 v(k)
v(k+1) = m1 v(k+1)
endfor
Of course, we would want to factor A = LU once and solve L(Uv(k+1) ) = v(k) rather than multiplying
with A1 . From the analysis of the convergence of the second attempt for a Power Method algorithm
we conclude that now
m1 (k)
(k+1)
kv m1 xm1 k 1 .
kv
m1 xm1 kX 1
X
m2
A practical Inverse Power Method algorithm is given by
232
for k = 0, . . .
v(k+1) = A1 v(k)
v(k+1) = v(k+1) /kv(k+1) k
endfor
Often, we would expect the Invert Power Method to converge faster than the Power Method. For
example, take the case where |k | are equally spaced between 0 and m: |k | = (k + 1). Then
m1 1
1 m 1
= .
=
and
0
m
m2 2
which means that the Power Method converges much more slowly than the Inverse Power Method.
14.3
Rayleigh-quotient Iteration
233
for k = 0, . . .
v(k+1) = (A I)1 v(k)
v(k+1) = v(k+1) /kv(k+1) k
endfor
The question now becomes how to chose so that it is a good guess for m1 . Often an application
inherently supplies a reasonable approximation for the smallest eigenvalue or an eigenvalue of particular
interest. However, we know that eventually v(k) becomes a good approximation for xm1 and therefore
the Rayleigh quotient gives us a way to find a good approximation for m1 . This suggests the (naive)
Rayleigh-quotient iteration:
for k = 0, . . .
k = v(k) H Av(k) /(v(k) H v(k) )
v(k+1) = (A k I)1 v(k)
v(k+1) = (m1 k )v(k+1)
endfor
Now2
(k+1)
kv
m1 xm1 kX 1
m1 k (k)
kv m1 xm1 k 1
X
m2 k
with
lim (m1 k ) = 0
which means super linear convergence is observed. In fact, it can be shown that once k is large enough
kv(k+1) m1 xm1 kX 1 Ckv(k) m1 xm1 k2X 1 ,
which is known as quadratic convergence. Roughly speaking this means that every iteration doubles
the number of correct digits in the current approximation. To prove this, one shows that |m1 k |
Ckv(k) m1 xm1 kX 1 .
Better yet, it can be shown that if A is Hermitian, then, once k is large enough,
kv(k+1) m1 xm1 kX 1 Ckv(k) m1 xm1 k3X 1 ,
which is known as cubic convergence. Roughly speaking this means that every iteration triples the number
of correct digits in the current approximation. This is mind-boggling fast convergence!
A practical Rayleigh quotient iteration is given by
v(0) = v(0) /kv(0) k2
for k = 0, . . .
k = v(k) H Av(k)
(Now kv(k) k2 = 1)
v(k+1) = (A k I)1 v(k)
v(k+1) = v(k+1) /kv(k+1) k
endfor
2
I think... I have not checked this thoroughly. But the general idea holds. m1 has to be defined as the eigenvalue to which
the method eventually converges.
234
Chapter
15
235
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
15.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
15.2. Subspace Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
15.3. The QR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
15.3.1. A basic (unshifted) QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 242
15.3.2. A basic shifted QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
15.4. Reduction to Tridiagonal Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
15.4.1. Householder transformations (reflectors) . . . . . . . . . . . . . . . . . . . . . 244
15.4.2. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
15.5. The QR algorithm with a Tridiagonal Matrix . . . . . . . . . . . . . . . . . . . . . . 247
15.5.1. Givens rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
15.6. QR Factorization of a Tridiagonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . 248
15.7. The Implicitly Shifted QR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 250
15.7.1. Upper Hessenberg and tridiagonal matrices . . . . . . . . . . . . . . . . . . . . 250
15.7.2. The Implicit Q Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
15.7.3. The Francis QR Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
15.7.4. A complete algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
15.8. Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
15.8.1. More on reduction to tridiagonal form . . . . . . . . . . . . . . . . . . . . . . . 258
15.8.2. Optimizing the tridiagonal QR algorithm . . . . . . . . . . . . . . . . . . . . . 258
15.9. Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
15.9.1. Jacobis method for the symmetric eigenvalue problem . . . . . . . . . . . . . . 258
15.9.2. Cuppens Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
15.9.3. The Method of Multiple Relatively Robust Representations (MRRR) . . . . . . 261
15.10.The Nonsymmetric QR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
15.10.1.A variant of the Schur decomposition . . . . . . . . . . . . . . . . . . . . . . . 261
15.10.2.Reduction to upperHessenberg form . . . . . . . . . . . . . . . . . . . . . . . . 262
15.10.3.The implicitly double-shifted QR algorithm . . . . . . . . . . . . . . . . . . . . 265
236
15.1. Preliminaries
15.1
237
Preliminaries
The QR algorithm is a standard method for computing all eigenvalues and eigenvectors of a matrix. In this
note, we focus on the real valued symmetric eigenvalue problem (the case where A Rnn . For this case,
recall the Spectral Decomposition Theorem:
Theorem 15.1 If A Rnn then there exists unitary matrix Q and diagonal matrix such that A = QQT .
We will partition Q = q0 qn1 and assume that = diag(()0 , , n1 ) so that throughout
this note, qi and i refer to the ith column of Q and the ith diagonal element of , which means that each
tuple (, qi ) is an eigenpair.
15.2
Subspace Iteration
We start with a matrix V Rnr with normalized columns and iterate something like
V (0) = V
for k = 0, . . . convergence
V (k+1) = AV (k)
Normalize the columns to be of unit length.
end for
The problem with this approach is that all columns will (likely) converge to an eigenvector associated with
the dominant eigenvalue, since the Power Method is being applied to all columns simultaneously. We will
now lead the reader through a succession of insights towards a practical algorithm.
b
Let us examine what V = AV looks like, for the simple case where V = v0 v1 v2 (three columns).
We know that
v j = Q QT v j .
| {z }
yj
Hence
n1
v0 =
0, j q j ,
j=0
n1
v1 =
1, j q j , and
j=0
n1
v2 =
2, j q j ,
j=0
238
q
j=0 0, j j j=0 1, j j j=0 2, j j
n1
n1
n1
=
j=0 0, j Aq j j=0 1, j Aq j j=0 2, j Aq j
n1
n1
=
n1
j=0 1, j j j
j=0 2, j j j
j=0 0, j j j
If we happened to know 0 , 1 , and 2 then we could divide the columns by these, respectively, and
get new vectors
j
j
j
n1
n1
n1
=
q
q
1,0 10 q0 +
2,0 20 q0 + 2,1 12 q1 +
= 0,0 q0 +
(15.1)
1,1 q1 +
2,2 q2 +
j
j
j
n1
n1
n1
j=1 0, j 0 q j
j=2 1, j 1 q j
j=3 2, j 2 q j
Assume that |0 | > |1 | > |2 | > |3 | |n1 |. Then, similar as for the power method,
The first column will see components in the direction of {q1 , . . . , qn1 } shrink relative to the
component in the direction of q0 .
The second column will see components in the direction of {q2 , . . . , qn1 } shrink relative to the
component in the direction of q1 , but the component in the direction of q0 increases, relatively,
since |0 /1 | > 1.
The third column will see components in the direction of {q3 , . . . , qn1 } shrink relative to the
component in the direction of q2 , but the components in the directions of q0 and q1 increase,
relatively, since |0 /2 | > 1 and |1 /2 | > 1.
How can we make it so that v j converges to a vector in the direction of q j ?
If we happen to know q0 , then we can subtract out the component of
vb1 = 1,0
n1
j
0
q0 + 1,1 q1 + 1, j q j
1
1
j=2
in the direction of q0 :
n1
j
qj
1
so that we are left with the component in the direction of q1 and components in directions of
q2 , . . . , qn1 that are suppressed every time through the loop.
Similarly, if we also know q1 , the components of vb2 in the direction of q0 and q1 can be subtracted
from that vector.
239
We do not know 0 , 1 , and 2 but from the discussion about the Power Method we remember that
we can just normalize the so updated vb0 , vb1 , and vb2 to have unit length.
How can we make these insights practical?
We do not know q0 , q1 , and q2 , but we can informally argue that if we keep iterating,
The vector vb0 , normalized in each step, will eventually point in the direction of q0 .
S (b
v0 , vb1 ) will eventually equal S (q0 , q1 ).
In each iteration, we can subtract the component of vb1 in the direction of vb0 from vb1 , and then
normalize vb1 so that eventually result in a the vector that points in the direction of q1 .
S (b
v0 , vb1 , vb2 ) will eventually equal S (q0 , q1 , q2 ).
In each iteration, we can subtract the component of vb2 in the directions of vb0 and vb1 from vb2 ,
and then normalize the result, to make vb2 eventually point in the direction of q2 .
What we recognize is that normalizing vb0 , subtracting out the component of vb1 in the direction of vb0 , and
then normalizing vb1 , etc., is exactly what the Gram-Schmidt process does. And thus, we can use any
convenient (and stable) QR factorization method. This also shows how the method can be generalized to
work with more than three columns and even all columns simultaneously.
The algorithm now becomes:
V (0) = I np
for k = 0, . . . convergence
AV (k) V (k+1) R(k+1)
end for
Now consider again (15.1), focusing on the third column:
0
2
q0 + 2,1
2,0
2,2 q2 +
n1
j=3 j 2 q j
1
2
q1 +
0
2
1
2
2,0
q0 + 2,1
q1 +
q +
= 2,2 2
j
j 23 q3 + n1
j=4 2, j 2 q j
This shows that, if the components in the direction of q0 and q1 are subtracted out, it is the
component
3
in the direction of q3 that is deminished in length the most slowly, dictated by the ratio 2 . This, of
(k)
course, generalizes: the jth column of V (k) , vi will have a component in the direction of q j+1 , of length
(k)
|qTj+1 v j |, that can be expected to shrink most slowly.
We demonstrate this in Figure 15.1, which shows the execution of the algorithm with p = n for a 5 5
(k)
matrix, and shows how |qTj+1 v j | converge to zero as as function of k.
240
Figure 15.1: Convergence of the subspace iteration for a 5 5 matrix. This graph is mislabeled: x should
be labeled with v. The (linear) convergence of v j to a vector in the direction of q j is dictated by now
quickly the component in the direction q j+1 converges to zero. The line labeled |qTj+1 x j | plots the length
of the component in the direction q j+1 as a function of the iteration number.
Next, we observe that if V Rnn in the above iteration (which means we are iterating with n vectors
at a time), then AV yields a next-to last column of the form
n3
j=0 n2, j
j
n2
q j+
n2,n2 qn2 +
qn1
n2,n1 n1
n2
where i, j = qTj vi . Thus, given that the components in the direction of q j , j = 0, . . . , n 2 can be expected
in later iterations
to be greatly reduced by the QR factorization that subsequently happens with AV , we
(k)
notice that it is n1
that dictates how fast the component in the direction of qn1 disappears from vn2 .
n2
This is a ratio we also saw in the Inverse Power Method and that we noticed we could accelerate in the
Rayleigh Quotient Iteration: At each iteration we should shift the matrix to (A k I) where k n1 .
Since the last column of V (k) is supposed to be converging to qn1 , it seems reasonable to use k =
(k)T
(k)
(k)
vn1 Avn1 (recall that vn1 has unit length, so this is the Rayleigh quotient.)
The above discussion motivates the iteration
241
| qT0 x(k)
|
4
10
| qT1 x(k)
|
4
| qT x(k) |
2 4
T (k)
| q3 x4 |
10
10
10
15
10
10
15
20
k
25
30
35
40
Figure 15.2: Convergence of the shifted subspace iteration for a 5 5 matrix. This graph is mislabeled: x
should be labeled with v. What this graph shows is that the components of v4 in the directions q0 throught
q3 disappear very quickly. The vector v4 quickly points in the direction of the eigenvector associated
with the smallest (in magnitude) eigenvalue. Just like the Rayleigh-quotient iteration is not guaranteed to
converge to the eigenvector associated with the smallest (in magnitude) eigenvalue, the shifted subspace
iteration may home in on a different eigenvector than the one associated with the smallest (in magnitude)
eigenvalue. Something is wrong in this graph: All curves should quickly drop to (near) zero!
V (0) := I
(V (0) Rnn !)
for k := 0, . . . convergence
(k)T
(k)
k := vn1 Avn1
(Rayleigh quotient)
(QR factorization)
end for
Notice that this does not require one to solve with (A k I), unlike in the Rayleigh Quotient Iteration.
However, it does require a QR factorization, which requires more computation than the LU factorization
(approximately 43 n3 flops).
We demonstrate the convergence in Figure 15.2, which shows the execution of the algorithm with a
(k)
5 5 matrix and illustrates how |qTj vn1 | converge to zero as as function of k.
Subspace iteration
b(0) := A
A
QR algorithm
Vb (0) := I
V (0) := I
242
A(0) := A
end for
Figure 15.3: Basic subspace iteration and basic QR algorithm.
15.3
The QR Algorithm
The QR algorithm is a classic algorithm for computing all eigenvalues and eigenvectors of a matrix.
While we explain it for the symmetric eigenvalue problem, it generalizes to the nonsymmetric eigenvalue
problem as well.
15.3.1
We have informally argued that the columns of the orthogonal matrices V (k) Rnn generated by the
(unshifted) subspace iteration converge to eigenvectors of matrix A. (The exact conditions under which
this happens have not been fully discussed.) In Figure 15.3 (left), we restate the subspace iteration. In
it, we denote matrices V (k) and R(k) from the subspace iteration by Vb (k) and Rb to distinguish them from
the ones computed by the algorithm on the right. The algorithm on the left also computes the matrix
b(k) = V (k)T AV (k) , a matrix that hopefully converges to , the diagonal matrix with the eigenvalues of A
A
on its diagonal. To the right is the QR algorithm. The claim is that the two algorithms compute the same
quantities.
b(k) = A(k) , k = 0, . . ..
Homework 15.2 Prove that in Figure 15.3, Vb (k) = V (k) , and A
* SEE ANSWER
We conclude that if Vb (k) converges to the matrix of orthonormal eigenvectors when the subspace iteration
is applied to V (0) = I, then A(k) converges to the diagonal matrix with eigenvalues along the diagonal.
15.3.2
In Figure 15.4 (left), we restate the subspace iteration with shifting. In it, we denote matrices V (k) and R(k)
from the subspace iteration by Vb (k) and Rb to distinguish them from the ones computed by the algorithm
b(k) = V (k)T AV (k) , a matrix that hopefully
on the right. The algorithm on the left also computes the matrix A
converges to , the diagonal matrix with the eigenvalues of A on its diagonal. To the right is the shifted
QR algorithm. The claim is that the two algorithms compute the same quantities.
243
Subspace iteration
b(0) := A
A
QR algorithm
Vb (0) := I
V (0) := I
A(0) := A
(k) T
(k)
b
k := vbn1 Ab
vn1
(A b
I)Vb (k) Vb (k+1) Rb(k+1)
(k)
k = n1,n1
(QR factorization)
A(k) k I Q(k+1) R(k+1) (QR factorization)
end for
end for
Figure 15.4: Basic shifted subspace iteration and basic shifted QR algorithm.
b(k) = A(k) , k = 1, . . ..
Homework 15.3 Prove that in Figure 15.4, Vb (k) = V (k) , and A
* SEE ANSWER
We conclude that if Vb (k) converges to the matrix of orthonormal eigenvectors when the shifted subspace
iteration is applied to V (0) = I, then A(k) converges to the diagonal matrix with eigenvalues along the
diagonal.
The convergence of the basic shifted QR algorithm is illustrated below. Pay particular attention to the
convergence of the last row and column.
2.01131953448
(0)
A = 0.05992695085
0.14820940917
2.63492207667
(2)
A =
0.47798481637
0.07654607908
2.96578660126
(4)
A = 0.18177690194
0.00000000000
0.05992695085 0.14820940917
2.21466116574 0.34213192482
(1)
A = 0.34213192482 2.54202325042
0.31816754245 0.57052186467
2.87588550968 0.32971207176
(3)
A =
0.32971207176 2.12411444949
0.00024210487 0.00014361630
2.9912213907 0.093282073553
(5)
A = 0.0932820735 2.008778609226
0.0000000000 0.000000000000
2.30708673171 0.93623515213
0.93623515213 1.68159373379
0.47798481637 0.07654607908
2.35970859985 0.06905042811
0.06905042811 1.00536932347
0.18177690194 0.00000000000
2.03421339873 0.00000000000
0.00000000000 1.00000000000
0.31816754245
0.57052186467
1.24331558383
0.00024210487
0.00014361630
1.00000004082
0.00000000000
0.00000000000
1.00000000000
Once the off-diagonal elements of the last row and column have converged (are sufficiently small), the
problem can be deflated by applying the following theorem:
Theorem 15.4 Let
A0,0
A= .
..
A0N1
A1,1
..
..
.
.
A1,N1
..
.
A01
AN1,N1
244
15.4
In the next section, we will see that if A(0) is a tridiagonal matrix, then so are all A(k) . This reduces the cost
of each iteration from O(n3 ) to O(n). We first show how unitary similarity transformations can be used to
reduce a matrix to tridiagonal form.
15.4.1
We briefly review the main tool employed to reduce a matrix to tridiagonal form: the Householder transform, also known as a reflector. Full details were given in Chapter 6.
Definition 15.6 Let u Rn , R. Then H = H(u) = I uuT /, where = 12 uT u, is said to be a reflector
or Householder transformation.
We observe:
Let z be any vector that is perpendicular to u. Applying a Householder transform H(u) to z leaves
the vector unchanged: H(u)z = z.
Let any vector x be written as x = z + uT xu, where z is perpendicular to u and uT xu is the component
of x in the direction of u. Then H(u)x = z uT xu.
This can be interpreted as follows: The space perpendicular to u acts as a mirror: any vector in that
space (along the mirror) is not reflected, while any other vector has the component that is orthogonal
to the space (the component outside and orthogonal to the mirror) reversed in direction. Notice that a
reflection preserves the length of the vector. Also, it is easy to verify that:
1. HH = I (reflecting the reflection of a vector results in the original vector);
2. H = H T , and so H T H = HH T = I (a reflection is an orthogonal matrix and thus preserves the norm);
and
3. if H0 , , Hk1 are Householder transformations and Q = H0 H1 Hk1 , then QT Q = QQT = I (an
accumulation of reflectors is an orthogonal matrix).
As part of the reduction to condensed form operations, given a vector x we will wish to find a
Householder transformation, H(u), such that H(u)x equals a vector with zeroes below the first element:
H(u)x = kxk2 e0 where e0 equals the first column of the identity matrix. It can be easily checked that
choosing u = x kxk2 e0 yields the desired H(u). Notice that any nonzero scaling of u has the same property, and the convention is to scale u so that the first element equals one. Let us define [u, , h] = HouseV(x)
to be the function that returns u with first element equal to one, = 12 uT u, and h = H(u)x.
15.4.2
245
Algorithm
The first step towards computing the eigenvalue decomposition of a symmetric matrix is to reduce the
matrix to tridiagonal form.
The basic algorithm for reducing a symmetric matrix to tridiagonal form, overwriting the original
matrix with the result, can be explained as follows. We assume that symmetric A is stored only in the
lower triangular part of the matrix and that only the diagonal and subdiagonal of the symmetric tridiagonal
matrix is computed, overwriting those parts of A. Finally, the Householder vectors used to zero out parts
of A overwrite the entries that they annihilate (set to zero).
11 aT21
.
Partition A
a21 A22
Let [u21 , , a21 ] := HouseV(a21 ).1
Update
11
aT21
a21 A22
:=
0 H
11
aT21
a21 A22
0 H
11
aT21 H
Ha21 HA22 H
where H = H(u21 ). Note that a21 := Ha21 need not be executed since this update was performed by
the instance of HouseV above.2 Also, aT12 is not stored nor updated due to symmetry. Finally, only
the lower triangular part of HA22 H is computed, overwriting A22 . The update of A22 warrants closer
scrutiny:
1
1
A22 := (I u21 uT21 )A22 (I u21 uT21 )
1
1
= (A22 u21 uT21 A22 )(I u21 uT21 )
| {z }
T
y21
1
1
1
= A22 u21 yT21 Au21 uT21 + 2 u21 yT21 u21 uT21
| {z }
|{z}
y21
2
1
T
T
T
T
= A22
u21 y21 2 u21 u21
y21 u21 2 u21 u21
1 T
T
1
= A22 u21
y u
|
{z
}
|
{z
}
T
w21
w21
=
Note that the semantics here indicate that a21 is overwritten by Ha21 .
In practice, the zeros below the first element of Ha21 are not actually written. Instead, the implementation overwrites these
elements with the corresponding elements of the vector u21 .
2
Algorithm: [A,t] := T RI R ED
Partition A
AT L
AT R
ABL
ABR
UNB (A)
tT
, t
tB
AT L
ABL
AT R
ABR
A00
aT
10
A20
a01
A02
t0
t
T
aT12
1
,
tB
A22
t2
11
a21
Continue with
AT L
ABL
AT R
ABR
A00
a01
aT
10
A20
11
a21
A02
t0
t
T 1
aT12
,
tB
A22
t2
endwhile
Figure 15.5: Basic algorithm for reduction of a symmetric matrix to tridiagonal form.
246
247
Original matrix
First iteration
Second iteration
0 0
Third iteration
Figure 15.6: Illustration of reduction of a symmetric matrix to tridiagonal form. The s denote nonzero
elements in the matrix. The gray entries above the diagonal are not actually updated.
Continue this process with the updated A22 .
This is captured in the algorithm in Figure 15.5. It is also illustrated in Figure 15.6.
The total cost for reducing A Rnn is approximately
n1
k=0
4
4(n k 1)2 flops n3 flops.
3
15.5
We are now ready to describe an algorithm for the QR algorithm with a tridiagonal matrix.
15.5.1
Givens rotations
First,we introduce
another important class of unitary matrices known as Givens
rotations.
Given a vector
1
kxk2
R2 , there exists an orthogonal matrix G such that GT x =
. The Householder
x=
2
0
248
where 2 + 2 = 1. (Notice that and can be thought of as the cosine and sine of an angle.) Then
T
=
GT G =
2
2
+ +
1 0
=
,
=
2
2
+
0 1
which means that a Givens rotation is a unitary matrix.
Now, if = 1 /kxk2 and = 2 /kxk2 , then 2 + 2 = (21 + 22 )/kxk22 = 1 and
T
2
2
(1 + 2 )/kxk2
kxk2
.
=
1 =
1 =
(1 2 1 2 )/kxk2
2
2
0
15.6
0,0 0,1 0
1,0 1,1 1,2
0 2,1 2,2
0
0 3,2
0,0
one can compute 1,0 and 1,0 so that
From
1,0
1,0
1,0
0,0
1,0 1,0
1,0
2,3
3,3
b 0,0
Then
b 0,1
b 0,0
b 0,2 0 1,0 1,0 0 0 0,0 0,1
0
0
0
b 1,1
b 1,2 0 1,0 1,0 0 0 1,0 1,1 1,2 0
=
0
0
0 1
0
0
3,2 3,3
0
0
3,2 3,3
b 1,1
b
b
b
2,1
2,1
1,1 = 1,1 .
2,1 2,1
2,1
0
249
Then
b 0,0
b 0,1
b 0,2
b
b
b 1,1
b 1,2
1,3
b 2,2
3,2 3,3
Finally, from
0
b
b
b 2,3
b 0,0
0 2,1 2,1 0 0
0 2,1 2,1 0 0
0
0
0
0
1
b 2,2
3,2
3,2 3,2
b 0,1
b 0,2
b 1,1
b 1,2
2,1 2,2
2,3
3,2
3,2
3,2
3,3
b 2,2
3,2
b
b 2,2
Then
b
b 0,1
0,0
b
b 1,1
0
0
0
0
0
b 0,2 0
b
b
b 1,2
b 1,3
=
b
b
b 2,2
b 2,3
b 3,3
0
1 0
0 1
1 0
3,2
3,2
0 1 3,2 3,2
b
b 0,1
b 0,2
0,0
b
b
b 1,1
b 1,2
0
b 2,2
0
0
0
3,2
0
b
b
1,3
b 2,3
3,3
The matrix Q is the orthogonal matrix that results from multiplying the different Givens rotations together:
1,0 1,0
1,0 1,0
Q=
0
0
0
0
1 0
0 0
0
0
0 0 0 2,1 2,1 0 0 1
0 0 3,2 3,2
0 1
0
0
0
1
(15.2)
250
The next question is how to compute RQ given the QR factorization of the tridiagonal matrix:
b 0,0
b 0,1
b
b
1,1
b 0,2
b
b
b
b 2,2
1,2
0,0
1,0
0
|
0
b
b
1,0
1,0
0
0
0
0
1
0
}
1,0
b
b 2,3
0
b 3,3
0
{z
0,1
b 1,1
b 0,2
b
b
0
b
b
b
b 2,2
b
b 2,3
b 3,3
1,3
1,3
1,2
1,0
0,0
1,0
0
|
2,1
2,1
2,1
2,1
0
0
0
0
3,2
3,2
3,2
3,2
{z
0,1 0,2
1,1
1,2
0
b
b
2,1
2,2
b
b 2,3
b 3,3
1,3
0,0
1,0
{z
0,1
1,1
0,2
1,2
2,1
2,2
3,2
1,3
.
2,3
3,3
15.7
15.7.1
Definition 15.7 A matrix is said to be upper Hessenberg if all entries below its first subdiagonal equal
zero.
In other words, if matrix A Rnn is upper Hessenberg, it looks like
1,0 1,1 1,2
0 2,1 2,2
A= .
..
..
..
.
.
0
0
0
0
0
0
0,n1
1,n1
2,n1
..
..
.
.
..
. n2,n2
0,n1
1,n1
2,n1
.
..
n2,n2
n1,n2 n1,n2
15.7.2
251
The following theorem sets up one of the most remarkable algorithms in numerical linear algebra, which
allows us to greatly simplify the implementation of the shifted QR algorithm when A is tridiagonal.
Theorem 15.8 (Implicit Q Theorem) Let A, B Rnn where B is upper Hessenberg and has only positive
elements on its first subdiagonal and assume there exists a unitary matrix Q such that QT AQ = B. Then Q
and B are uniquely determined by A and the first column of Q.
Proof: Partition
Q=
q0 q1 q2 qn2 qn1
1,0 1,1 1,2
0 2,1 2,2
and B =
0
0 3,2
.
..
..
..
.
.
0
0
0
0,n2
1,n2
2,n2
..
.
3,n2
..
.
0,n1
1,n1
2,n1
.
3,n1
..
n1,n2 n1,n1
q0 q1 q2 qn2 qn1
1,0 1,1 1,2
0 2,1 2,2
0
0 3,2
.
..
..
..
.
.
0
0
0
0,n2
1,n2
2,n1
...
3,n2
..
.
0,n1
1,n1
.
3,n1
..
n1,n2 n1,n1
Equating the first column on the left and right, we notice that
Aq0 = 0,0 q0 + 1,0 q1 .
Now, q0 is given and kq0 k2 since Q is unitary. Hence
qT0 Aq0 = 0,0 qT0 q0 + 1,0 qT0 q1 = 0,0 .
Next,
1,0 q1 = Aq0 0,0 q0 = q1 .
Since kq1 k2 = 1 (it is a column of a unitary matrix) and 1,0 is assumed to be positive, then we know that
1,0 = kq1 k2 .
Finally,
q1 = q1 /1,0 .
The point is that the first column of B and second column of Q are prescribed by the first column of Q and
the fact that B has positive elements on the first subdiagonal.
In this way, each column of Q and each column of B can be determined, one by one.
252
15.7.3
The Francis QR Step combines the steps (A(k1) k I) Q(k) R(k) and A(k+1) := R(k) Q(k) + k I into a
single step.
Now, consider the 4 4 tridiagonal matrix
0,0 0,1 0
0
I
0 2,1 2,2 2,3
0
0 3,2 3,3
0,0
, yielding 1,0 and 1,0 so that
The first Givens rotation is computed from
1,0
1,0 1,0
0,0 I
1,0 1,0
1,0
has a zero second entry. Now, to preserve eigenvalues, any orthogonal matrix that is applied from the left
must also have its transpose applied from the right. Let us compute
0,0
b 1,0
b 2,0
0
b 1,0
b 2,0
b 1,1
b 1,2
b 2,1
2,2
2,3
3,2
3,3
1,0
=
0
0
1,0
1,0
1,0
0
1,0
0
0
1
0
0,0
Next, from
b 1,0
0,1
1,1
1,2
2,1
2,2
2,3
3,2
3,3
1,0
b 2,0
2,0 2,0
2,0
1,0
1,0
.
0
2,0
1,0
b 1,0
b 2,0
1,0
0
Then
0,0
1,0
1,0
1,1
b
b
0
b
b
2,1
2,1
b 2,2
b 3,1
b 3,2
b 3,1
0
=
b 2,3
0
3,3
0
2,0
2,0
2,0
2,0
0,0
b
0
1,0
b 2,0
0
0
1
b 1,0
b 2,0
b 1,1
b 1,2
b 2,1
2,2
2,3
3,2
3,3
2,0
2,0
2,0
2,0
253
254
b 2,1
b 3,1
3,1 3,1
3,1
b 2,1
b 3,1
3,1
2,1
0
Then
0,0
1,0
0,0
1,0
1,0
1,1
2,1
2,1
2,2
2,3
3,2
3,2
3,2
3,3
3,2
3,2
1,0
1,1
b
b
0
=
1
0
1,0 1,0
1,0 1,0
Q=
0
0
0
0
0
b
b
3,1
3,1
3,1
3,1
2,1
b 2,2
b 3,1
b 2,3
b 3,1
b 3,2
3,3
2,1
1 0
0
0
0
0
0 2,0 2,0 0 0 1
0
0
0
0
0
1
3,1
3,1
1,0
1,0
0
column had Q been computed as in Section 15.6 (Equation 15.2). Thus, by the Implicit Q Theorem, the
tridiagonal matrix that results from this approach is equal to the tridiagonal matrix that would be computed
by applying the QR factorization from Section 15.6 with A I, A I QR followed by the formation
of RQ + I using the algorithm for computing RQ in Section 15.6.
b i+1,i is often referred to as chasing the bulge while the entire
The successive elimination of elements
process that introduces the bulge and then chases it is known as a Francis Implicit QR Step. Obviously,
the method generalizes to matrices of arbitrary size, as illustrated in Figure 15.8. An algorithm for the
chasing of the bulge is given in Figure 17.4. (Note that in those figures T is used for A, something that
needs to be made consistent in these notes, eventually.) In practice, the tridiagonal matrix is not stored as
a matrix. Instead, its diagonal and subdiagonal are stored as vectors.
15.7.4
0 0
0 0
1 0
0 1
A complete algorithm
This last section shows how one iteration of the QR algorithm can be performed on a tridiagonal matrix
by implicitly shifting and then chasing the bulge. All that is left to complete the algorithm is to note that
The shift k can be chosen to equal n1,n1 (the last element on the diagonal, which tends to
converge to the eigenvalue smallest in magnitude). In practice, choosing the shift to be an eigenvalue
of the bottom-right 2 2 matrix works better. This is known as the Wilkinson Shift.
255
Beginning of iteration
+
TT L
T
TML
TML
TMM
T
TBM
TBM
TBR
T00
t10
T
T10
11
T
t21
t21
T22
t32
T
t32
33
T
t43
t43
T44
Repartition
Update
T00
t10
T
t10
11
T
t21
t21
T22
t32
T
t32
33
T
t43
t43
T44
+
End of iteration
TT L
T
TML
TML
TMM
TBM
T
TBM
TBR
Figure 15.8: One step of chasing the bulge in the implicitly shifted symmetric QR algorithm.
If an element of the subdiagonal (and corresponding element on the superdiagonal) becomes small
enough, it can be considered to be zero and the problem deflates (decouples) into two smaller tridiagonal matrices. Small is often taken to means that |i+1,i | (|i,i | + |i+1,i+1 |) where is some
quantity close to the machine epsilon (unit roundoff).
If A = QT QT reduced A to the tridiagonal matrix T before the QR algorithm commensed, then the
Givens rotations encountered as part of the implicitly shifted QR algorithm can be applied from the
256
TT L
?
?
0
TBM TBR
whereTT L is 0 0 and TMM is 3 3
while m(TBR ) > 0 do
Repartition
T00
11
t21
T22
T
t32
33
t43
tT
10
TMM
?
0
0
TBM TBR
0
where11 and 33 are scalars
(during final step, 33 is 0 0)
TT L
TML
T44
21
21
0
Continue with
TT L
TML
?
TMM
TBM
T00
11
t21
T22
T
t32
33
t43
tT
10
? 0
0
TBR
0
0
T44
endwhile
257
If an element on the subdiagonal becomes zero (or very small), and hence the corresponding element
of the superdiagonal, then the problem can be deflated: If
T =
T00
0
0
T11
0
0
then
The computation can continue separately with T00 and T11 .
One can pick the shift from the bottom-right of T00 as one continues finding the eigenvalues of
T00 , thus accelerating the computation.
One can pick the shift from the bottom-right of T11 as one continues finding the eigenvalues of
T11 , thus accelerating the computation.
One must continue to accumulate the eigenvectors by applying the rotations to the appropriate
columns of Q.
Because of the connection between the QR algorithm and the Inverse Power Method, subdiagonal
entries near the bottom-right of T are more likely to converge to a zero, so most deflation will happen
there.
A question becomes when an element on the subdiagonal, i+1,i can be considered to be zero. The
answer is when |i+1,i | is small relative to |i | and |i+1,i+1 . A typical condition that is used is
q
|i+1,i | u |i,i i + 1, i + 1|.
If A Cnn is Hermitian, then its Spectral Decomposition is also computed via the following steps,
which mirror those for the real symmetric case:
Reduce to tridiagonal form. Householder transformation based similarity transformations can
again be used for this. This leaves one with a tridiagonal matrix, T , with real values along the
diagonal (because the matrix is Hermitian) and values on the subdiagonal and superdiagonal
that may be complex valued.
The matrix QT such A = QT T QH
T can then be formed from the Householder transformations.
A simple step can be used to then change this tridiagonal form to have real values even on the
subdiagonal and superdiagonal. The matrix QT can be updated accordingly.
The tridiagonal QR algorithm that we described can then be used to diagonalize the matrix,
accumulating the eigenvectors by applying the encountered Givens rotations to QT . This is
where the real expense is: Apply the Givens rotations to matrix T requires O(n) per sweep.
Applying the Givens rotation to QT requires O(n2 ) per sweep.
For details, see some of our papers mentioned in the next section.
15.8
Further Reading
15.8.1
258
The reduction to tridiagonal form can only be partially cast in terms of matrix-matrix multiplication [20].
This is a severe hindrance to high performance for that first step towards computing all eigenvalues and
eigenvector of a symmetric matrix. Worse, a considerable fraction of the total cost of the computation is
in that first step.
For a detailed discussion on the blocked algorithm that uses FLAME notation, we recommend [43]
Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ort, G. Joseph Elizondo.
Families of Algorithms for Reducing a Matrix to Condensed Form.
ACM Transactions on Mathematical Software (TOMS) , Vol. 39, No. 1, 2012
(Reduction to tridiagonal form is one case of what is more generally referred to as condensed form.)
15.8.2
As the Givens rotations are applied to the tridiagonal matrix, they are also applied to a matrix in which
eigenvectors are accumulated. While one Implicit Francis Step requires O(n) computation, this accumulation of the eigenvectors requires O(n2 ) computation with O(n2 ) data. We have learned before that this
means the cost of accessing data dominates on current architectures.
In a recent paper, we showed how accumulating the Givens rotations for several Francis Steps allows
one to attain performance similar to that attained by a matrix-matrix multiplication. Details can be found
in [42]:
Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ort.
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance.
ACM Transactions on Mathematical Software (TOMS), Vol. 40, No. 3, 2014.
15.9
Other Algorithms
15.9.1
(Not to be mistaken for the Jacobi iteration for solving linear systems.)
The oldest algorithm for computing the eigenvalues and eigenvectors of a matrix is due to Jacobi and
dates back to 1846 [26]. This is a method that keeps resurfacing, since it parallelizes easily. The operation
count tends to be higher (by a constant factor) than that of reduction to tridiagonal form followed by the
tridiagonal QR algorithm.
The idea is as follows: Given a symmetric 2 2 matrix
A31 =
11 13
31 33
259
Sweep 1
0
0
zero (1, 0)
zero (2, 0)
zero (3, 0)
0
0
zero (2, 1)
zero (3, 1)
zero (3, 2)
Sweep 2
0
0
0
zero (1, 0)
zero (2, 0)
zero (3, 0)
0
0
zero (2, 1)
zero (3, 1)
zero (3, 2)
260
11 13
J31 =
31 33
such that
T
=
J31 A31 J31
11 31
11 31
31
31
31 33
33
11 31
33
b 11
b 33
We know this exists since the Spectral Decomposition of the 2 2 matrix exists. Such a rotation is called
a Jacobi rotation. (Notice that it is different from a Givens rotation because it diagonalizes a 2 2 matrix
when used as a unitary similarity transformation. By contrast, a Givens rotation zeroes an element when
applied from one side of a matrix.)
b 211 +
b 233 .
Homework 15.11 In the above discussion, show that 211 + 2231 + 233 =
* SEE ANSWER
Jacobi rotation rotations can be used to selectively zero off-diagonal elements by observing the following:
T
I 0 0
0
0
A00 a10 AT20 a30 AT40
I 0 0
0
0
0
aT
T
T 0
0
0
a
a
0
0
11
31
11
31
11
13
10
21
41
T
T
JAJ =
0
0
0
0
0 0 I
A20 a21 A22 a32 A42 0 0 I
0 31 0 33 0 aT 31 aT 33 aT 0 31 0 33 0
32
43
30
0 0 0
0
I
A40 a41 A42 a43 A44
0 0 0
0
I
abT
10 b 11
=
A20 ab21
abT
30 0
A40 ab41
where
11 31
31
33
aT10
aT30
aT21
aT32
abT21
A22 ab32
b 33
abT32
A42 ab43
aT41
aT43
abT41
T
b
A42
= A,
abT43
A44
abT10
abT30
Importantly,
aT10 a10 + aT30 a30 = abT10 ab10 + abT30 ab30
aT21 a21 + aT32 a32 = abT21 ab21 + abT32 ab32
aT41 a41 + aT43 a43 = abT41 ab41 + abT43 ab43 .
abT21
abT32
abT41
abT43
261
What this means is that if one defines off(A) as the square of the Frobenius norm of the off-diagonal
elements of A,
off(A) = kAk2F kdiag(A)k2F ,
b = off(A) 22 .
then off(A)
31
The good news: every time a Jacobi rotation is used to zero an off-diagonal element, off(A) decreases by twice the square of that element.
The bad news: a previously introduced zero may become nonzero in the process.
The original algorithm developed by Jacobi searched for the largest (in absolute value) off-diagonal
element and zeroed it, repeating this processess until all off-diagonal elements were small. The algorithm
was applied by hand by one of his students, Seidel (of Gauss-Seidel fame). The problem with this is that
searching for the largest off-diagonal element requires O(n2 ) comparisons. Computing and applying one
Jacobi rotation as a similarity transformation requires O(n) flops. Thus, for large n this is not practical.
Instead, it can be shown that zeroing the off-diagonal elements by columns (or rows) also converges to a
diagonal matrix. This is known as the column-cyclic Jacobi algorithm. We illustrate this in Figure 15.10.
15.9.2
Cuppens Algorithm
15.9.3
Even once the problem has been reduced to tridiagonal form, the computation of the eigenvalues and
eigenvectors via the QR algorithm requires O(n3 ) computations. A method that reduces this to O(n2 )
time (which can be argued to achieve the lower bound for computation, within a constant, because the n
vectors must be at least written) is achieved by the Method of Multiple Relatively Robust Representations
(MRRR) by Dhillon and Partlett [13, 12, 14, 15]. The details of that method go beyond the scope of this
note.
15.10
The QR algorithm that we have described can be modified to compute the Schur decomposition of a
nonsymmetric matrix. We briefly describe the high-level ideas.
15.10.1
262
Theorem 15.12 Let A Rnn . Then there exist unitary Q Rnn and quasi upper triangular matrix
R Rnn such that A = QRQT .
Here quasi upper triangular matrix means that the matrix is block upper triangular with blocks of the
diagonal that are 1 1 or 2 2.
Remark 15.13 The important thing is that this alternative to the Schur decomposition can be computed
using only real arithmetic.
15.10.2
The basic algorithm for reducing a real-valued nonsymmetric matrix to upperHessenberg form, overwriting the original matrix with the result, can be explained similar to the explanation of the reduction of a
symmetric matrix to tridiagonal form. We assume that the upperHessenberg matrix overwrites the upperHessenbert part of A and the Householder vectors used to zero out parts of A overwrite the entries that they
annihilate (set to zero).
a01 A02
A
00
T
T
Assume that the process has proceeded to where A = a10 11 a12 with A00 being k k and
0 a21 A22
upperHessenberg, aT10 being a row vector with only a last nonzero entry, the rest of the submatrices
updated according to the application of previous k Householder transformations.
Let [u21 , , a21 ] := HouseV(a21 ).3
Update
A
00
T
a10
a01 A02
11
aT12
a21 A22
I 0
:= 0 1
0 0
A
00
T
= a10
a01 A02
A
00
T
0 a10 11 aT12
H
0 a21 A22
a01
A02 H
T
11
a12 H
Ha21 HA22 H
I 0 0
0 1 0
0 0 H
where H = H(u21 ).
Note that a21 := Ha21 need not be executed since this update was performed by the instance
of HouseV above.4
3
Note that the semantics here indicate that a21 is overwritten by Ha21 .
In practice, the zeros below the first element of Ha21 are not actually written. Instead, the implementation overwrites these
elements with the corresponding elements of the vector u21 .
4
263
Partition A
AT L
AT R
ABL
ABR
UNB (A)
tT
, t
tB
AT L
ABL
AT R
ABR
A00
aT
10
A20
a01
11
a21
A02
t0
t
T
aT12
,
1
tB
A22
t2
where11 is a scalar
[u21 , 1 , a21 ] := HouseV(a21 )
y01 := A02 u21
A02 := A02 1 y01 uT21
11 := aT12 u21
aT12 := aT12 + 11 uT21
y21 := A22 u21
:= uT21 y21 /2
z21 := (y21 u21 /1 )/1
w21 := (AT22 u21 u21 /)/
A22 := A22 (u21 wT21 + z21 uT21 ) (rank-2 update)
Continue with
AT L
ABL
AT R
ABR
A00
a01
aT
10
A20
11
a21
A02
t0
t
T 1
aT12
,
tB
A22
t2
endwhile
Figure 15.11: Basic algorithm for reduction of a nonsymmetric matrix to upperHessenberg form.
264
Original matrix
First iteration
Second iteration
Third iteration
Figure 15.12: Illustration of reduction of a nonsymmetric matrix to upperHessenbert form. The s denote
nonzero elements in the matrix.
The update A02 H requires
1
A02 := A02 (I u21 uT21 )
1
= A02 A02 u21 uT21
| {z }
y01
1
= A02 y01 uT21
(ger)
1
= aT12 aT12 u21 uT21
| {z }
11
11
= aT12
uT21 (axpy)
265
1
1
= (A22 u21 uT21 A22 )(I u21 uT21 )
1
1
1
= A22 u21 uT21 A22 A22 u21 uT21 + 2 u21 uT21 Au21 uT21
| {z }
2
1
1
T
T
T
T
= A22
u21 u21 A22 2 u21 u21
Au21 u21 2 u21 u21
1 T
1
21
{z
}
{z
}
|
|
wT21
z21
=
1
1
A22 u21 wT21 z21 uT21
|
{z
}
rank-2 update
15.10.3
266
Chapter
16
267
Outline
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
16.1. MRRR, from 35,000 Feet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
16.2. Cholesky Factorization, Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
16.3. The LDLT Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
16.4. The UDU T Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
16.5. The UDU T Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
16.6. The Twisted Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
16.7. Computing an Eigenvector from the Twisted Factorization . . . . . . . . . . . . . . 279
268
16.1
269
The Method of Relatively Robust Representations (MRRR) is an algorithm that, given a tridiagonal matrix,
computes eigenvectors associated with that matrix in O(n) time per eigenvector. This means it computes
all eigenvectors in O(n2 ) time, which is much faster than the tridiagonal QR algorithm (which requires
O(n3 ) computation). Notice that highly accurate eigenvalues of a tridiagonal matrix can themselves be
computed in O(n2 ) time. So, it is legitimate to start by assuming that we have these highly accurate
eigenvalues. For our discussion, we only need one.
The MRRR algorithm has at least two benefits of the symmetric QR algorithm for tridiagonal matrices:
It can compute all eigenvalues and eigenvectors of tridiagonal matrix in O(n2 ) time (versus O(n3 )
time for the symmetric QR algorithm).
It can efficiently compute eigenvectors corresponding to a subset of eigenvalues.
The benefit when computing all eigenvalues and eigenvectors of a dense matrix is considerably less because transforming the eigenvectors of the tridiagonal matrix back into the eigenvectors of the dense matrix
requires an extra O(n3 ) computation, as discussed in [42]. In addition, making the method totally robust
has been tricky, since the method does not rely exclusively on unitary similarity transformations.
The fundamental idea is that of a twisted factorization of the tridiagonal matrix. This note builds up
to what that factorization is, how to compute it, and how to then compute an eigenvector with it. We
start by reminding the reader of what the Cholesky factorization of a symmetric positive definite (SPD)
matrix is. Then we discuss the Cholesky factorization of a tridiagonal SPD matrix. This then leads to
the LDLT factorization of an indefinite and indefinite tridiagonal matrix. Next follows a discussion of the
UDU T factorization of an indefinite matrix, which then finally yields the twisted factorization. When the
matrix is nearly singular, an approximation of the twisted factorization can then be used to compute an
approximate eigenvalue.
Notice that the devil is in the details of the MRRR algorithm. We will not tackle those details.
16.2
We have discussed in class the Cholesky factorization, A = LLT , which requires a matrix to be symmetric
positive definite. (We will restrict our discussion to real matrices.) The following computes the Cholesky
factorization:
T
11 a21
.
Partition A
a21 A22
Update 11 :=
11 .
270
Algorithm: A := C HOL(A)
AT L ?
Partition A
ABL ABR
where AT L is 0 0
while m(AT L ) < m(A) do
Repartition
AT L
ABL
11 :=
A00
aT
10 11 ?
ABR
A20 a21 A22
11
AT L
ABL
A00 ?
aT 11 ?
10
ABR
A20 a21 A22
endwhile
Figure 16.1: Unblocked algorithm for computing the Cholesky factorization. Updates to A22 affect only
the lower triangular part.
271
AFF
?
?
Partition A MF eTL MM
?
0
LM eF ALL
where AFF is 0 0
while m(AFF ) < m(A) do
Repartition
A00
A00
A
?
?
FF
eT
?
?
10 L 11
T
MF eL MM
0
21 22 ?
0
LM eF ALL
0
0 32 eF A33
11 :=
11
21 := 21 /11
22 := 22 221
Continue with
?
?
A
FF
10 eT 11
?
?
L
T
MF eL MM
0 21 22
?
0
LM eF ALL
0
0 32 eF A33
endwhile
Figure 16.2: Algorithm for computing the the Cholesky factorization of a tridiagonal matrix.
272
?
?
11
Partition A 21
, where ? indicates the symmetric part that is not stored. Here
?
22
0 32 eF A33
eF indicates the unit basis vector with a 1 as first element.
Update 11 := 11 .
Update 21 := 21 /11 .
Update 22 := 22 221 .
22
32 eF
A33
The resulting algorithm is given in Figure 16.2. In that figure, it helps to interpret F, M, and L as First,
Middle, and Last. In that figure, eF and eL are the unit basis vectors with a 1 as first and last element,
respectively Notice that the Cholesky factor of a tridiagonal matrix is a lower bidiagonal matrix that
overwrites the lower triangular part of the tridiagonal matrix A.
Naturally, the whole matrix needs not be stored. But that detail is not important for our discussion.
16.3
Now, one can alternatively compute A = LDLT , where L is unit lower triangular and D is diagonal. We will
look at how this is done first and will then note that this factorization can be computed for any indefinite
(nonsingular) symmetric matrix. (Notice: the so computed L is not the same as the L computed by the
Cholesky factorization.) Partition
T
1
0
0
11 a21
,L
, and D 1
.
A
a21 A22
l21 L22
0 D22
Then
11
aT21
a21 A22
11
11 l21
11
11 l21
0 D22
l21 L22
T
0
1 l21
T
L22 D22
0 L22
T + L D LT
11 l21 l21
22 22 22
l21 L22
273
Algorithm: A := LDLT(A)
AT L ?
Partition A
ABL ABR
whereAT L is 0 0
while m(AT L ) < m(A) do
Repartition
AT L
A00
aT
10 11 ?
ABL ABR
A20 a21 A22
where11 is 1 1
Continue with
AT L
ABL
A00 ?
aT 11 ?
10
ABR
A20 a21 A22
endwhile
T
11 a21
.
Partition A
a21 A22
11 := 11 = 11 (no-op).
Compute l21 := a21 /11 .
Update A22 := A22 l21 aT21 (updating only the lower triangular part).
a21 := l21 .
T .
Continue with computing A22 L22 D22 L22
274
This algorithm will complete as long as 11 6= 0, which happens when A is nonsingular. This is equivalent
to A does not having zero as an eigenvalue. Such a matrix is also called indefinite.
Homework 16.1 Modify the algorithm in Figure 16.1 so that it computes the LDLT factorization. (Fill in
Figure 16.3.)
* SEE ANSWER
Homework 16.2 Modify the algorithm in Figure 16.2 so that it computes the LDLT factorization of a
tridiagonal matrix. (Fill in Figure 16.4.) What is the approximate cost, in floating point operations, of
computing the LDLT factorization of a tridiagonal matrix? Count a divide, multiply, and add/subtract as
a floating point operation each. Show how you came up with the algorithm, similar to how we derived the
algorithm for the tridiagonal Cholesky factorization.
* SEE ANSWER
Notice that computing the LDLT factorization of an indefinite matrix is not a good idea when A is
nearly singular. This would lead to some 11 being nearly zero, meaning that dividing by it leads to very
large entries in l21 and corresponding element growth in A22 . (In other words, strange things will happen
down stream from the small 11 .) Not a good thing. Unfortunately, we are going to need something like
this factorization specifically for the case where A is nearly singular.
16.4
The fact that the LU, MRRResky, and LDLT factorizations start in the top-left corner of the matrix and
work towards the bottom-right is an accident of history. Gaussian elimination works in that direction,
hence so does LU factorization, and the rest kind of follow.
One can imagine, given a SPD matrix A, instead computing A = UU T , where U is upper triangular.
Such a computation starts in the lower-right corner of the matrix and works towards the top-left. Similarly,
an algorithm can be created for computing A = UDU T where U is unit upper triangular and D is diagonal.
Homework 16.3 Derive an algorithm that, given an indefinite matrix A, computes A = UDU T . Overwrite
only the upper triangular part of A. (Fill in Figure 16.5.) Show how you came up with the algorithm,
similar to how we derived the algorithm for LDLT .
* SEE ANSWER
16.5
Homework 16.4 Derive an algorithm that, given an indefinite tridiagonal matrix A, computes A = UDU T .
Overwrite only the upper triangular part of A. (Fill in Figure 16.6.) Show how you came up with the algorithm, similar to how we derived the algorithm for LDLT .
* SEE ANSWER
Notice that the UDU T factorization has the exact same shortcomings as does LDLT when applied
to a singular matrix. If D has a small element on the diagonal, it is again down stream that this can
cause large elements in the factored matrix. But now down stream means towards the top-left since the
algorithm moves in the opposite direction.
275
AFF
?
?
Partition A MF eTL MM
?
0
LM eF ALL
whereALL is 0 0
while m(AFF ) < m(A) do
Repartition
A00
A
?
?
eT
FF
?
?
10 L 11
T
MF eL MM
?
0
21 22 ?
0
LM eF ALL
0
0 32 eF A33
where22 is a scalars
Continue with
A00
A
?
?
FF
10 eT 11
?
?
L
MF eTL MM
0 21 22
?
0
LM eF ALL
0
0 32 eF A33
endwhile
Figure 16.4: Algorithm for computing the the LDLT factorization of a tridiagonal matrix.
276
AT L AT R
Partition A
? ABR
whereABR is 0 0
while m(ABR ) < m(A) do
Repartition
AT L AT R
? 11 aT
12
? ABR
? ? A22
where11 is 1 1
Continue with
AT L AT R
?
ABR
?
11 aT12
? A22
endwhile
Figure 16.5: Unblocked algorithm for computing A = UDU T . Updates to A00 affect only the upper
triangular part.
277
A
eT
0
FF FM L
Partition A ?
MM ML eTF
?
?
ALL
whereAFF is 0 0
A
e
0
FF FM L
?
MM ML eTF
?
?
ALL
A00 01 eL
11
12
22
23 eTF
A33
0
where
Continue with
A
e
0
FF FM L
?
MM ML eTF
?
?
ALL
A00 01 eL
11
?
?
T
22 23 eF
? A33
12
endwhile
Figure 16.6: Algorithm for computing the the UDU T factorization of a tridiagonal matrix.
16.6
278
Let us assume that A is a tridiagonal matrix and that we are given one of its eigenvalues (or rather, a very
good approximation), , and let us assume that this eigenvalue has multiplicity one. Indeed, we are going
to assume that it is well-separated meaning there is no other eigevalue close to it.
Here is a way to compute an associated eigenvector: Find a nonzero vector in the null space of B =
A I. Let us think back how one was taught how to do this:
Reduce B to row-echelon form. This may will require pivoting.
Find a column that has no pivot in it. This identifies an independent (free) variable.
Set the element in vector x corresponding to the independent variable to one.
Solve for the dependent variables.
There are a number of problems with this approach:
It does not take advantage of symmetry.
It is inherently unstable since you are working with a matrix, B = A I that inherently has a bad
condition number. (Indeed, it is infinity if is an exact eigenvalue.)
is not an exact eigenvalue when it is computed by a computer and hence B is not exactly singular.
So, no independent variables will even be found.
These are only some of the problems. To overcome this, one computes something called a twisted factorization.
Again, let B be a tridiagonal matrix. Assume that is an approximate eigenvalue of B and let A =
B I. We are going to compute an approximate eigenvector of B associated with by computing a vector
that is in the null space of a singular matrix that is close to A. We will assume that is an eigenvalue of
B that has multiplicity one, and is well-separated from other eigenvalues. Thus, the singular matrix close
to A has a null space with dimension one. Thus, we would expect A = LDLT to have one element on the
diagonal of D that is essentially zero. Ditto for A = UEU T , where E is diagonal and we use this letter to
be able to distinguish the two diagonal matrices.
Thus, we have
is an approximate (but not exact) eigenvalue of B.
A = B I is indefinite.
A = LDLT . L is bidiagonal, unit lower triangular, and D is diagonal with one small element on the
diagonal.
A = UEU T . U is bidiagonal, unit upper triangular, and E is diagonal with one small element on the
diagonal.
279
Let
A00
A = 10 eTL
10 eL
L00
U00 01 eL
T
, L = 10 eL
11
1
0 ,U = 0
21 eF
A22
0
21 eF L22
0
0
0
D00 0
E00 0
D = 0 1
,
and
E
=
0 1 0 ,
0
0
0 D22
0
0 E22
21 eTF
21 eTF
U22
L00 0
0
0
0
0
D00 0
L
00
0
0 U22
0
0 E22
0
0 U22
01 eL
0
A00
T
= 01 eTL
11
21 eF .
0
21 eF
A22
(Hint: multiply out A = LDLT and A = UEU T with the partitioned matrices first. Then multiply out the
above. Compare and match...) How should 1 be chosen? What is the cost of computing the twisted
factorization given that you have already computed the LDLT and UDU T factorizations? A Big O
estimate is sufficient. Be sure to take into account what eTL D00 eL and eTF E22 eF equal in your cost estimate.
* SEE ANSWER
16.7
The way the method now works for computing the desired approximate eigenvector is to find the 1 that is
smallest in value of all possible such values. In other words, you can partition A, L, U, D, and E in many
ways, singling out any of the diagonal values of these matrices. The partitioning chosen is the one that
makes 1 the smallest of all possibilities. It is then set to zero so that
L
0
0
D
0 0
L
0
0
A00
01 eL
0
00
00
00
T
T
T
T
T
T
10 eL 1 21 eF 0 0 0 10 eL 1 21 eF 01 eL
11
21 eF .
0
0 U22
0 0 E22
0
0 U22
0
21 eF
A22
Because was assumed to have multiplicity one and well-separated, the resulting twisted factorization
has nice properties. We wont get into what that exactly means.
280
L00
10 eTL
21 eTF
U22
D00
0 0
0 E22
L00
10 eTL
0
|
21 eTF
U22
{z
Hint: 1
x0
=
1
0
x2
0
}
x
0
where x = 1 is not a zero vector. What is the cost of this computation, given that L00 and U22 have
x2
special structure?
* SEE ANSWER
The vector x that is so computed is the desired approximate eigenvector of B.
Now, if the eigenvalues of B are well separated, and one follows essentially this procedure to find
eigenvectors associated with each eigenvalue, the resulting eigenvectors are quite orthogonal to each other.
We know that the eigenvectors of a symmetric matrix with distinct eigenvalues should be orthogonal, so
this is a desirable property.
The approach becomes very tricky when eigenvalues are clustered, meaning that some of them are
very close to each other. A careful scheme that shifts the matrix is then used to make eigenvalues in a
cluster relatively well separated. But that goes beyond this note.
Chapter
17
281
Outline
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
17.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
17.2. Reduction to Bidiagonal Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
17.3. The QR Algorithm with a Bidiagonal Matrix . . . . . . . . . . . . . . . . . . . . . . 287
17.4. Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
282
17.1. Background
17.1
283
Background
Recall:
Theorem 17.1 If A Rmm then there exists unitary matrices U Rmm and V Rnn , and diagonal
matrix Rmn such that A = UV T . This is known as the Singular Value Decomposition.
There is a relation between the SVD of a matrix A and the Spectral Decomposition of AT A:
Homework 17.2 If A = UV T is the SVD of A then AT A = V 2V T is the Spectral Decomposition of AT A.
* SEE ANSWER
The above exercise suggests steps for computing the Reduced SVD:
Form C = AT A.
Compute unitary V and diagonal such that C = QQT , ordering the eigenvalues from largest to
smallest on the diagonal of .
Form W = AV . Then W = U so that U and can be computed from W by choosing the diagonal elements of to equal the lengths of the corresponding columns of W and normalizing those
columns by those lengths.
The problem with this approach is that forming AT A squares the condition number of the problem. We
will show how to avoid this.
17.2
In the last chapter, we saw that it is beneficial to start by reducing a matrix to tridiagonal or upperHessenberg form when computing the Spectral or Schur decompositions. The corresponding step when computing the SVD of a given matrix is to reduce the matrix to bidiagonal form.
We assume that the bidiagnoal matrix overwrites matrix A. Householder vectors will again play a role,
and will again overwrite elements that are annihilated (set to zero).
T
11 a12
. In the first iteration, this can be visualized as
Partition A
a21 A22
We introduce zeroes below the diagonal in the first column by computing a Householder transformation and applying this from the left:
284
Algorithm: [A,t, r] := B I DR ED
Partition A
AT L
AT R
ABL
ABR
UNB (A)
tT
rT
, t
,r
tB
rB
AT L
ABL
AT R
ABR
A00
aT
10
A20
a01
11
a21
A02
t0
r0
t
r
T , T
aT12
1
1
,
tB
rB
A22
t2
r2
1
u21
, 1 ,
11
a21
] := HouseV(
11
a21
Continue with
AT L
ABL
AT R
ABR
A00
a01
aT
10
A20
11
a21
A02
t0
r0
t
r
T 1 , T 1
aT12
,
tB
rB
A22
t2
r2
endwhile
Figure 17.1: Basic algorithm for reduction of a nonsymmetric matrix to bidiagonal form.
285
Third iteration
First iteration
Original matrix
Second iteration
Fourth iteration
Let [
, 1 ,
11
] := HouseV(
11
u21
a21
a21
holder vector, but also zeroes the entries in a21 and updates 11 .
Update
11
aT12
a21 A22
aT12
aT12
1
1
:= (I 1
) 11
= 11
1
u21
u21
a21 A22
0 A 22
where
* 11 is updated as part of the computation of the Householder vector;
T
T
T
* w12 := (a12 + u21 A22 )/1 ;
T
T
T
* a12 := a12 w12 ; and
T
* A 22 := A22 u21 w12 .
This introduces zeroes below the diagonal in the first column:
286
The zeroes below 11 are not actually written. Instead, the implementation overwrites these
elements with the corresponding elements of the vector uT21 .
Next, we introduce zeroes in the first row to the right of the element on the superdiagonal by computing a Householder transformation from aT12 :
Let [v12 , 1 , a12 ] := HouseV(a12 ). This not only computes the Householder vector, but also
zeroes all but the first entry in aT12 , updating that first entry.
Update
aT12
A22
:=
aT12
A22
12 eT0
(I 1 v12 vT12 ) =
1
A 22
where
*
*
*
This introduces zeroes to the right of the superdiagonal in the first row:
0
0
0
The zeros to the right of the first element of aT12 are not actually written. Instead, the implementation overwrites these elements with the corresponding elements of the vector vT12 .
Continue this process with the updated A22 .
The above observations are captured in the algorithm in Figure 17.1. It is also illustrated in Figure 17.2.
Homework 17.3 Homework 17.4 Give the approximate total cost for reducing A Rnn to bidiagonal
form.
* SEE ANSWER
For a detailed discussion on the blocked algorithm that uses FLAME notation, we recommend [43]
Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ort, G. Joseph Elizondo.
Families of Algorithms for Reducing a Matrix to Condensed Form.
ACM Transactions on Mathematical Software (TOMS) , Vol. 39, No. 1, 2012
17.3
287
B=
B00 10 el
0
11
where el denotes the unit basis vector with a 1 in the last entry. Then
T
T
B
B
B
B00 10 el
e
0
e
00 10 l = 00
00 10 l
BT B =
0
11
0
11
10 eTl 11
0
11
BT B00 10 BT00 el
,
= 00
10 eTl B 210 + 211
where 10 BT00 el is (clearly) a vector with only a nonzero in the last entry.
The above proof also shows that if B has no zeroes on its superdiagonal, then neither does BT B.
This means that one can apply the QR algorithm to matrix BT B. The problem, again, is that this squares
the condition number of the problem.
The following extension of the Implicit Q Theorem comes to the rescue:
Theorem 17.6 Let C, B Rmn . If there exist unitary matrices U Rmm and V Rnn so that B =
U T CV is upper bidiagonal and has positive values on its superdiagonal then B and V are uniquely determined by C and the first column of V .
The above theorem supports an algorithm that, starting with an upper bidiagonal matrix B, implicitly
performs a shifted QR algorithm with T = BT B.
Consider the 5 5 bidiagonal matrix
0
0
0
0,0 0,1
0
0
0
1,1 1,2
.
B=
0
0
0
2,2
2,3
0
0
3,3
3,4
4,4
Then T = BT B equals
20,0
0,1 0,0
T =
0
23,4 + 24,4
288
where the ?s indicate dont care entries. Now, if an iteration of the implicitly shifted QR algorithm were
executed with this matrix, then the shift would be k = 23,4 + 24,4 (actually, it is usually computed from
the bottom-right 2 2 matrix, a minor detail) and the first Givens rotation would be computed so that
I
0,1
0,0
0,1
0,1 1,1
0,1 0,1
has a zero second entry. Let us call the n n matrix with this as its top-left submatrix G0 . Then applying
this from the left and right of T means forming G0 T G0 = G0 BT BGT0 = (BGT0 )T (BGT0 ). What if we only
apply it to B? Then
0,0
1,0
0,1
1,1
1,2
2,2
2,3
3,3
3,4
4,4
0,0
0,1
1,1
1,2
2,2
2,3
3,3
3,4
4,4
0,1
0,1
0,1
0,1
0
.
The idea now is to apply Givens rotation from the left and right to chasethe bulge
except that now the
0,0
one can compute 1,0
rotations applied from both sides need not be the same. Thus, next, from
1,0
1,0 1,0
0,0
0,0
=
. Then
and 1,0 so that
1,0 1,0
1,0
0
0,0
1,1
02
1,2
0,1
2,2
2,3
3,3
3,4
4,4
1,0
=
0
0
0
1,0
1,0
1,0
0,0
0
1,0
0
0
0
0
1
0
0
0,1
1,1
1,2
2,2
2,3
3,3
3,4
4,4
0,2 0,2
0,1 =
Continuing, from 0,1 one can compute 0,2 and 0,2 so that
0,2
0,2 0,2
0,2
0,2 . Then
0
0,0
0,1
1,1
1,2
2,1
2,2
2,3
3,3
3,4
4,4
0,0
1,1
02
1,2
0,1
2,2
2,3
3,3
3,4
4,4
1,0
1,0
1,0
1,0
289
0
0
Figure 17.3: Illustration of one sweep of an implicit QR algorithm with a bidiagonal matrix.
17.4
290
291
BT L BT M
?
Partition B 0
BMM BMR
0
0
BBR
whereBT L is 0 0 and BMM is 2 2
while m(BBR ) 0 do
Repartition
B00
BT M
11
12
21
22
23
33
34 eT0
B44
0
0
BMM BMR
0
0
0
BBR
0
0
where11 and 33 are scalars
(during final step, 33 is 0 0)
BT L
01 el
L1
L1
1,1
=
L1 L1
2,1
overwriting 1,1 with 1,1
1,2 1,3
L1
L1
0
1,2
:=
2,2 2,3
L1 L1
2,2 2,3
1,1
if m(BBR ) 6= 0
R1
R1
1,2
=
R1 R1
1,3
overwriting 1,2 with 1,2
1,2
1,2 1,3
0
R1 R1
2,2 2,3
:= 2,2 2,3
R1
R1
3,2 3,3
0
3,3
1,2
0
Continue with
BT L
BT M
BMM
BMR
BBR
B00
01 el
11
12
21
22
23
33
34 eT0
B44
endwhile
292
Chapter
18
293
Outline
Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
18.1. A Simple Example: One-Dimensional Boundary Value Problem . . . . . . . . . . . 295
18.2. A Two-dimensional Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
18.2.1. Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
18.3. Direct solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
18.4. Iterative solution: Jacobi iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
18.4.1. Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
18.4.2. More generally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
18.5. The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
18.6. Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
18.6.1. Some useful facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
294
18.1
295
A computational science application typically starts with a physical problem. This problem is described
by a law that governs the physics. Such a law can be expressed as a (continuous) mathematical equation
that must be satisfied. Often it is impossible to find an explicit solution for this equation and hence it is
approximated by discretizing the problem. At the bottom of the food chain, a linear algebra problem often
appears.
Let us illustrate this for a very simple, one-dimensional problem. A example of a one-dimensional
Poisson equation with Derichlet boundary condition on the domain [0, 1] can be described as
u00 (x) = f (x) where u(0) = u(1) = 0.
The Derichlet boundary condition is just a fancy way of saying that on the boundary (at x = 0 and x = 1)
function u is restricted to take on the value 0. Think of this as a string where there is some driving force
(e.g., a sound wave hitting the wave) that is given by a continuous function f (x), meaning that u is twice
continuously differentiable on the interior (for 0 < x < 1). The ends of the string are fixed, giving rise
to the given boundary condition. The function u(x) indicates how far from rest (which would happen if
f (x) = 0) the string is displaced at point x. There are a few details missing here, such as the fact that u and
f are probably functions of time as well, but lets ignore that.
Now, we can transform this continuous problem into a discretized problem by partitioning the interval
[0, 1] into N + 1 equal intervals of length h and looking at the points x0 , x1 , . . . , xN+1 where xi = ih, which
are the nodes where the intervals meet. We now instead compute i u(xi ). In our discussion, we will let
i = f (xi ) (which is exactly available, since f (x) is given).
Now, we recall that
u(xi + h) u(xi ) i+1 i
=
.
u0 (x)
h
h
Also,
u(xi +h)u(xi )
1 )
u(xi )u(x
u0 (xi + h) u0 (xi )
i1 2i + i+1
00
h
h
u (x)
=
.
h
h
h2
So, one can approximate our problem with
2 + = h2 0 < i N
i
i
i1
i+1
0 = N+1 = 0
or, equivalently,
21 + 2 = h2 1
i1 2i + i+1 = h2 i 1 < i < N
N1 2N = h2 N
21 + 2
= h2 2
1 22 + 3
2 23 +
..
.
4
..
.
= h2 3
.
..
. = ..
N1 2N = h2 N
296
1
2
1
1
2 1
.. ..
.
.
1
..
h2 1
h2
2
2
3 = h 3
.
.
..
.
.
.
N
h2 N
18.2
A Two-dimensional Example
A more typical computational engineering or sciences application starts with a Partial Differential Equation (PDE) that governs the physics of the problem. This is the higher dimensional equivalent of the last
example. In this note, we will use one of the simplest, Laplaces equation. Consider
u = f
which in two dimensions is,
2 u 2 u
= f (x, y)
x2 y2
with boundary condition = 0 (meaning that u(x, y) = 0 on the boundary of domain ). For example,
the domain may be 0 x, y 1, its boundary, and the question may be a membrane with, again, f
being some force from a sound wave.
18.2.1
Discretization
To solve the problem computationally with a computer, the problem is again discretized. Relating back
to the problem of the membrane on the unit square in the previous section, this means that the continuous domain is viewed as a mesh instead, as illustrated in Figure 18.1. In that figure, i will equal the
displacement from rest.
Now, just like before, we will let i be the value of f (x, y) at the mesh point i. One can approximate
2 u(x, y) u(x h, y) 2u(x, y) + u(x + h, y)
x2
h2
and
y2
h2
so that
2 u 2 u
= f (x, y)
x2 y2
297
13
14
15
16
10
11
12
Figure 18.1: Discretized domain, yielding a mesh. The specific ordering of the elements of u, by rows,
is known as the natural ordering. If one ordered differently, for example randomly, this would induce
a permutation on the rows and columns of matrix A, since it induces a permutation on the order of the
indices.
becomes
u(x h, y) + 2u(x, y) u(x + h, y) u(x, y h) + 2u(x, y) u(x, y + h)
= f (x, y)
h2
h2
or, equivalently,
iN + i1 + i+1 + i+N
4
= h2 0
1 +42 3
2 +43 4
= h2 6
3 +44
1
...
= h2 2
= h2 3
8
+45 6
...... ...
9
...
= h2 5
.
= ..
(18.1)
298
1
1 4 1
1
4
1
1
1 4
1
4
1
1
..
.
1
1
4
1
1
1 4 1
1
1 4
..
.
1
4
..
.. ..
.
.
.
4
5
=
6
9
..
.
h2 1
h2 2
h2 3
2
h 4
2
h 5
.
h2 6
h2 7
h2 8
2
h 9
..
.
(18.2)
This now shows how solving the discretized Laplaces equation boils down to the solution of a linear
system Au = h2 f (= b), where A has a distinct sparsity pattern (pattern of nonzeroes).
Remark 18.1 The point is that many problems that arise in computational science require the solution to
a system of linear equations Au = b where A is a (very) sparse matrix.
Definition 18.2 Wilkinson defined a sparse matrix as any matrix with enough zeros that it pays to take
advantage of them.
18.3
Direct solution
It is tempting to simply use a dense linear solver to compute the solution to Au = b. This would require
O(n3 ) operations, where n equals the size of matrix A. One can take advantage of the fact that beyond the
outer band of 1s the matrix is entire zeroes to reduce the cost to O(nB2 ), where B equals the width of
the banded matrix. And there are sparse direct methods that try to further reduce fill-in of zeroes to bring
the number of operations required to factor to matrix. Details of these go beyond the scope of this note.
18.4
It is more common to solve sparse problems like our model problem via iterative methods which generate
a sequence of vectors u(k) , k = 0, . . .. The first vector, u(0) , represents an initial guess of what the solution
is. The hope is that u(k) converges to the solution u, in other words, that eventually u(k) becomes arbitrarily
close to the solution. More precisely, we would like for it to be the case that ku(k) uk 0 for some norm
k k, where u is the true solution.
18.4.1
299
Motivating example
Consider Au = b. If we guessed perfectly, then u(0) would solve Ax = b and it would be the case that
(0)
= (b1 + 2 + 5 )/4
(0)
= (b2 + 1 + 3 + 6 )/4
(0)
= (b3 + 2 + 4 + 7 )/4
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
= (b4 + 3 + 8 )/4
..
..
.
.
Naturally, that is a bit optimistic. Therefore, a new approximation to the solution is created by computing
(1)
= (b1 + 2 + 5 )/4
(1)
= (b2 + 1 + 3 + 6 )/4
(1)
= (b3 + 2 + 4 + 7 )/4
(1)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
(0)
= (b4 + 3 + 8 )/4
..
..
.
.
In other words, a new guess for the displacement ui at point i is generated by the values at the points
around it, plus some contribution from i (which itself incorporates contributions from i in our example).
This process can be used to compute a sequence of vectors u(0) , u(1) , . . . by similarly computing u(k+1)
from u(k) :
(k+1)
= (b1 + 2 + 5 )/4
(k+1)
= (b2 + 1 + 3 + 6 )/4
(k+1)
= (b3 + 2 + 4 + 7 )/4
1
2
3
(k+1)
(k)
(k)
(k)
(k)
(k)
(k)
(k)
(k)
(k)
(k)
= (b4 + 3 + 8 )/4
..
..
.
.
41
(k+1)
(k)
42
= 1
(k+1)
43
=
(k+1)
44
=
(k+1)
(k)
45
= 1
(k)
(k)
(k)
+3
(k)
+1
+5
(k)
+6
(k)
+2
(k)
+4
+7
(k)
(k)
+3
+8
(k)
..
+4
(k)
+6
..
+3
+9
..
+5
. . ..
. .
1(k+1)
2
(k+1)
3
(k+1)
4
(k+1)
5
4
(k+1)
6
(k+1)
7
(k+1)
8
(k+1)
9
..
.
0 1
300
1 0 1
1 0 1
1 0
0
1
=
1
1
1
1
1
1
1
..
.
0 1
1 0 1
1 0
...
..
.. ..
.
.
.
(k)
(k)
1
2
(k)
(k)
(k)
(k)
(k)
(k)
(k)
9
..
.
5
.
..
.
Typically, the physics of the problem dictates that this iteration will eventually yield a vector u(k) that is
arbitrarily close to the solution of Au = b. More on the mathematical reasons for this later. In other words,
the vectors u(k) will converge to u: limk u(k) = u or limk ku(k) uk = 0.
18.4.2
More generally
The above discussion can be more concisely described as follows. Split matrix A = L + D + U, where L
and U are the strictly lower and upper triangular parts of A, respectively, and D its diagonal. Now,
Au = b
implies that
(L + D +U)u = b
or
u = D1 [(L +U)u + b].
This is an example of a fixed-point equation: Plug u into D1 [(L + U)u + b] and the result is again u.
The iteration is then created by viewing the vector on the left as the next guess given the guess for u on
the right:
u(k+1) = D1 [(L +U)u(k) + b].
Now let us see how to extract a more detailed algorithm from this. Partition, conformally,
a01 A02
D
0
0
L
0
0
A
00
00
00
T
T
T
A = a10 11 a12 , D = 0 11 0 , L = l10 0
0 ,
(k)
U
01 U02
0
0
00
(k)
(k)
(k)
U = 0
0 T12 , u = 1 , and b = 1 .
(k)
0
0 U22
2
b2
(18.3)
(18.4)
301
Consider
(k+1)
D00
L00
11
D22
(k+1)
= l T
1
10
(k+1)
x2
L20
x0
Then
(k+1)
11 1
l21
U00
01
0
+ 0
L22
0
(k)
0
0
U02
(k)
b0
(k)
12
1 + 1 .
(k)
U22
2
b2
(k)
T
= 1 l10
0 T12 2 .
T = aT , and = aT , so that
Now, for the Jacobi iteration 11 = 11 , l10
12
12
10
(k+1)
(k)
(k)
This suggests the algorithm in Figure 18.2. For those who prefer indices (and for these kinds of iterations,
indices often clarify rather than obscure), this can be described as
(k+1)
i
=
i
ii
j 6= i
i j 6= 0
0 1
1
1 0 1
1 0 1
1 0
1
0
(L +U) =
1
1
(k)
i j j .
1
1
...
0 1
1 0 1
1 0
...
..
.. ..
.
.
.
1
and D = 4I.
18.5
The Jacobi iteration is a special case of a family of method known as Splitting Methods. It starts with
a splitting of a nonsingular matrix A into A = M N. One next observes that Au = b is equivalent to
(M N)u = b, which is equivalent to Mu = Nu + b or u = M 1 (Nu + b). Notice that M 1 is not explicitly
302
i
h
(k+1)
:= S PLITTING M ETHOD I TERATION(A, x(k) , b)
Algorithm: x
(k+1)
(k)
xT
AT L AT R
x
b
, x(k) T , b T
Partition x(k+1) (k+1) , A
(k)
xB
ABL ABR
xB
bB
(k+1)
whereAT L is 0 0, xT
(k+1)
while m(xT
(k)
, xT , bT have 0 elements
) < m(x(k+1) ) do
Repartition
Jacobi iteration
(k+1)
(k+1)
x0
(k+1)
xT
(k+1)
,
1
(k+1)
xB
(k+1)
x2
(k)
x
(k)
0
b
xT
(k) T
(k)
1
xyB
bB
(k)
x2
(k+1)
(k)
where1
, 11 , 1 , and 1
AT L
AT R
ABL
ABR
b
0
b2
A
a01 A02
00
aT 11 aT
12
10
A20 a21 A22
are scalars
Gauss-Seidel iteration
(k)
(k)
(k+1)
(k+1)
= (i aT10 x0
(k)
aT12 x2 )/11
Continue with
(k+1)
x0
(k+1)
xT
(k+1) AT L AT R
(k+1)
1
xB
ABL ABR
(k+1)
x2
(k)
x
b
(k)
0
0
xT
bT
(k)
,
1
1
(k)
xB
bB
(k)
x2
b2
a
A02
A
00 01
aT 11 aT
12
10
A20 a21 A22
endwhile
Figure 18.2: Various splitting methods (one iteration x(k+1) = M 1 (Nx(k) + b)).
formed: one solves with M instead. Now, a vector u is the solution of Au = b if and only if it satisfies
u = M 1 (Nu + b). The idea is to start with an initial guess (approximation) of u, u(0) , and to then iterate
to compute u(k+1) = M 1 (Nu(k) + b), which requires a multiplication with N and a solve with M in each
iteration.
303
Example 18.4 Let A Cnn and split this matrix A = L + D +U, where L and U are the strictly lower and
upper triangular parts of A, respectively, and D its diagonal. Then choosing M = D and N = (L + U)
yields the Jacobi iteration.
The convergence of these methods is summarized by the following theorem:
Theorem 18.5 Let A Cnn be nonsingular and x, b Cn so that Au = b. Let A = M N be a splitting
of A, u(0) be given (an initial guess), and u(k+1) = M 1 (Nu(k) + b). If kM 1 Nk < 1 for some matrix norm
induced by the k k vector norm, then u(k) will converge to the solution u.
Proof: Notice Au = b implies that (M N)u = b implies that Mu = Nu + b and hence u = M 1 (Nu + b).
Hence
u(k+1) u = M 1 (Nu(k) + b) M 1 (Nu + b) = M 1 N(u(k) u).
Taking norms of both sides, one gets that
ku(k+1) uk = kM 1 N(u(k) u)k kM 1 Nkku(k) uk.
Since this holds for all k, once can deduce that
ku(k) uk kM 1 Nkk ku(0) uk.
If kM 1 Nk < 1 then the right-hand side of this will converge to zero, meaning that u(k) converges to u.
Now, often one checks convergence by computing the residual r(k) = b Au(k) . After all, if r(k) is
approximately the zero vector, then u(k) approximately solves Au = b. The following corollary links the
convergence of r(k) to the convergence of u(k) .
Corollary 18.6 Let A Cnn be nonsingular and u, b Cn so that Au = b. Let u(k) Cn and let r(k) =
b Au(k) . Then u(k) converges to u if and only if r(k) converges to the zero vector.
Proof: Assume that u(k) converges to u. Note that r(k) = b Au(k) = Au Au(k) = A(u u(k) ). Since A is
nonsingular, clearly r(k) converges to the zero vector if and only if u(k) converges to u.
Corollary 18.7 Let A Cnn be nonsingular and u, b Cn so that Au = b. Let A = M N be a splitting
of A, u(0) be given (an initial guess), u(k+1) = M 1 (Nu(k) + b), and r(k) = b Au(k) . If kM 1 Nk < 1 for
some matrix norm induced by the k k vector norm, then r(k) will converge to the zero vector.
Now let us see how to extract a more detailed algorithm from this. Partition, conformally,
00
(k)
u =
(k)
x0
(k)
1
(k)
x2
and
b
0
b = 1 .
b2
304
Consider
M
m01 M02
00
T
m10 11 mT12
(k+1)
x0
(k+1)
1
(k+1)
x2
N
00
T
= n10
N20
n01 N02
11 nT12
n21 N22
(k)
x0
(k)
1
(k)
x2
b
0
+ 1 .
b2
Then
(k+1)
mT10 x0
(k+1)
+ 11 1
(k+1)
+ mT12 x2
(k)
(k)
(k)
= nT10 x0 + 11 1 + nT12 x2 + 1 .
Jacobi iteration
Example 18.8 For the Jacobi iteration mT10 = 0, 11 = 11 , mT12 = 0, nT10 = aT10 , 11 = 0, and nT12 = aT10 ,
so that
(k+1)
(k)
(k)
(k+1)
= (1 + 2 + 5 )/4
(k+1)
= (2 + 1
(k+1)
= (3 + 2
(k+1)
= (4 + 3
1
2
3
4
(k)
(k+1)
(k+1)
+ 3 + 6 )/4
(k+1)
+ 4 + 7 )/4
(k+1)
+ 8 )/4
(k+1)
+ 6 + 9 )/4
= (5 + 1
..
..
.
.
(k)
(k)
(k)
(k)
(k)
(k)
(k)
(k)
41
(k+1)
(k+1)
42
= 1
(k+1)
43
=
(k+1)
44
=
(k+1)
(k+1)
45
= 1
(k)
(k)
+5
(k)
3
+
(k+1)
+1
(k)
+6
(k)
+2
+2
(k)
+4
+7
(k+1)
(k)
+3
+8
(k)
..
+4
(k)
+6
..
+3
+9
..
+5
. . ..
. .
305
(k+1)
1 4
2
(k+1)
1 4
3
(k+1)
1
4
(k+1)
5
1
4
(k+1)
1
1 4
6
(k+1)
1
1 4
7
(k+1)
1
1 4
8
(k+1)
1
4
9
..
..
.. ..
.
.
.
.
(k)
0 1
1
(k)
0
0
1
1
2
(k)
0 0 1
1
3
(k)
0 0
1
4
(k)
0
0 1
1
5
+ b.
=
(k)
.
..
0
0
0
1
(k)
0
0
0
1
(k)
0
0 0
(k)
...
9
0
0
..
..
.. ..
.
.
.
.
Again, split A = L + D +U and partition these matrices as in (18.3) and (18.4). Consider
(k+1)
D00 + L00
T
l10
11
L20
l21
D22 + L22
(k+1)
= 0
1
(k+1)
x2
0
x0
U00
01
0
0
U02
(k)
x0
(k+1)
(k+1)
+ 11 1
(k)
= 1 T12 x2 .
T = aT , and = aT , so that
Now, for the Gauss-Seidel iteration 11 = 11 , l10
12
10
12
(k+1)
(k+1)
= (1 aT10 x0
(k)
b0
(k)
12
1 + 1 .
(k)
U22
x2
b2
Then
T
l10
x0
aT12 x2 )/11 .
306
This suggests the algorithm in Figure 18.2. For those who prefer indices, this can be described as
(k+1)
i
=
i
ii
i1
(k+1)
i j j
n1
j = i+1
i j 6= 0
j=0
i j 6= 0
(k)
i j j .
Notice that this fits the general framework since now M = L + D and N = U.
Successive Over Relaxation (SOR)
GS (k+1)
(k+1)
= (1 aT10 x0
(k)
aT12 x2 )/11 .
(k)
as a direction in which the solution is changing. Now, one can ask the question What if we go a little
further in that direction? In other words, One can think of this as
(k+1)
(k)
GS (k+1)
= 1 + [1
(k)
GS (k+1)
1 ] = 1
(k)
+ (1 )1 ,
for some relaxation parameter = 1 + . Going further would mean picking > 1. Then
(k+1)
(k+1)
= [1 aT10 x0
(k+1)
= (1 aT10 x0
(k)
(k)
aT12 x2 )/11 ] + (1 )1
(k)
(k)
aT12 x2 )/11 + (1 )1 ,
aT10 x0
(k+1)
+ 11 1
(k)
(k)
= (1 )11 1 aT12 x2 + 1 .
or
11 (k+1) 1
(k)
(k)
1
=
11 1 aT12 x2 + 1 .
Now, if one once again partitions A = L + D +U and one takes M = L + 1 D and N =
it can be easily checked that Mu(k+1) = Nu(k) + b. Thus,
(k+1)
aT10 x0
ku(k) uk k(L +
1
D U
then
1 1 1
D) (
D U)kk ku(0) uk = k(L + D)1 ((1 )D U)kk ku(0) uk.
18.6. Theory
307
L
(the Backward splitting). Then
MF u(k+1/2) = NF u(k) + b
MB u(k+1) = NB u(k+1/2) + b,
or,
u(k+1) = MB1 [NB u(k+1/2) + b]
= MB1 [NB MF1 [NF u(k) + b] + b]
1
1
1
1
1
1
(k)
=
U+ D
DL
D U u + b + b
L+ D
| {z
|
|
{z
}
{z
}
|
{z
}
}
MB1
NB
MF1
NF
#
1 "
1
2
1
1
1
1
=
U+ D
D (L + D)
L+ D
D U u(k) + b + b
#
1 "
1 !
1
2
1
1
=
U+ D
D L+ D
I
D U u(k) + b + b
h
h
i
i
= (U + D)1 ((1 )D L) (L + D)1 ((1 )D U) u(k) + b + b
h
h
i
i
1
1
(k)
= (U + D)
((1 )D L) (L + D)
((1 )D U) u + b + b
18.6
Theory
18.6.1
Recall that for A Cnn its spectral radius is denoted by (A). It equals to the magnitude of the eigenvalue
of A that is largest in magnitude.
Theorem 18.9 Let kk be matrix norm induced by a vector norm kk. Then for any A Cmn (A) kAk.
Homework 18.10 Prove the above theorem.
* SEE ANSWER
308
Theorem 18.11 Given a matrix A Cnn and > 0, there exists a consistent matrix norm k k such that
kAk (A) + .
Proof: Let
A = UTU =
u0 U1
T
(A) t01
T11
u0 U1
H
T
1
0
.
kXk =
U XU
0 /kAk2 I
2
T
T
(A) t01
1
0
0
1
kAk =
U
AU
0 /kAk2 I
0
T11
0 /kAk2 I
2
T
(A) 0
0 t01
1
0
+
=
0
0
0 T11
0 /kAk2 I
2
T
(A) 0
0 t01
1
0
1
0
=
+
0
0
0 /kAk2 I
0 T11
0 /kAk2 I
T
(A) 0
0 t01
0
0
0
0
0 T11
0 /kAk2 I
2
2
T
(A) 0
0 t01
0
0
0 T11
0 1/kAk2 I
0
0
2
2
2
T
(A) 0
(A) t01
0
0
+
0
0
0
T11
0 1/kAk2 I
2
18.6. Theory
309
() Assume (M 1 N) < 1. Pick such that (M 1 N) + < 1. Finally, pick k k such that kM 1 Nk
(M 1 N) + . Then kM 1 Nk < 1 and hence the sequence converges.
() We will show that if (M 1 N) 1 then there exists a x(0) such that the iteration does not converge.
Let y 6= 0 have the property that M 1 Ny = (M 1 N)y. Choose x(0) = y + x where Ax = b. Then
kx(1) xk = kM 1 N(x(0) x)k = kM 1 Nyk = k(M 1 N)yk
= (M 1 N)kyk = (M 1 N)kx(0) xk kx(0) xk.
Extending this, one finds that for all k, kx(k) xk = kM 1 N(x(0) x)k. Thus, the sequence does not
converge.
Definition 18.13 A matrix A Cnn is said to be diagonally dominant if for all i, 0 i < n
|i,i | |i, j |.
j6=i
Di = : | ii | |i j |
j6=i
Hence
j
| kk | = k, j
j6=k
k
j
|k, j | k
j6=k
|k, j |
j6=k
T
T
TL
TR
,
PAPT =
0 TBR
where TT L (and hence TBR ) is square of size b b where 0 < b < n. A matrix that is not reducible is
irreducible.
310
In other words, a matrix is reducible if and only if there exists a symmetric permutation of the rows and
columns that leaves the matrix block upper triangular.
Homework 18.17 A symmetric matrix A Cnn is reducible if and only if there exists a permutation
matrix P such that
T
0
TL
,
PAPT =
0 TBR
where TT L (and hence TBR ) is square of size b b where 0 < b < n.
In other words, a symmetric matrix is reducible if and only if there exists a symmetric permutation of the
rows and columns that leaves the matrix block diagonal.
Theorem 18.18 If A is strictly diagonally dominant, then the Jacobi iteration converges.
Proof: Let
0,0
A= .
..
0
|
1,1
..
..
.
.
0
..
.
0
1,n1
1,0
.
.
..
..
..
..
.
.
n1,0 n1,1
0
}
{z
|
N
n1,n1
{z
M
0,0
0,n1
Then
1
M N = .
..
1,1
..
...
.
0
..
.
0,0
0
0
1,0
1,1
..
.
n1,n1
0,1
0,0
0
..
.
0
1,n1
1,0
..
..
..
...
.
.
.
n1,0 n1,1
0
0,n1
0,0
1,n1
1,1
..
.
0,n1
n1,1
n1,0
n1,n1
n1,n1
..
.
0,1
|i, j | < 1.
j6=i
By the Gershgorin Disk Theorem we therefore know that (M 1 N) < 1. Hence we know there exists a
norm k k such that kM 1 Nk < 1, and hence the Jacobi iteration converges.
18.6. Theory
311
Theorem 18.19 If A is strictly diagonally dominant, then the Gram-Schmidt iteration converges.
Proof:
Theorem 18.20 If A is weakly diagonally dominant, irreducible, and has at least one row for which
|k,k | >
|k, j |,
j6=k
312
Chapter
19
313
Chapter 19. Notes on Descent Methods and the Conjugate Gradient Method
Outline
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
19.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
19.2. Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
19.3. Relation to Splitting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
19.4. Method of Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
19.5. Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
19.6. Methods of A-conjugate Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
19.7. Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
314
19.1. Basics
19.1
315
Basics
The basic idea is to look at the problem of solving Ax = b instead as the minimization problem that finds
the vector x that minimizes the function f (x) = 12 xT Ax xT b.
Theorem 19.1 Let A be SPD and assume that Ax = b. Then the vector x minimizes the function f (y) =
1 T
T
2 y Ay y b.
Proof: Let Ax = b. Then
1
1 T
f (y) = yT Ay yT b =
y Ay yT Ax
2
2
1
1
1 T
y Ay yT Ax + xT Ax xT Ax
=
2
2
2
1
1
=
(y x)T A(y x) xT Ax.
2
2
Since xT Ax is independent of y, this is clearly minimized when y = x.
An alternative way to look at this is to note that the function is minimized when the gradient is zero
and that f (x) = Ax b.
19.2
Descent Methods
The basic idea behind a descent method is that at the kth iteration one has an approximation to x, x(k) . One
would like to create a better approximation, x(k+1) . To do so, one picks a search direction, p(k) , and lets
x(k+1) = x(k) + k p(k) . In other words, one searches for a minimum along a line defined by the current
iterate, x(k) , and the search direction, p(k) . One then picks k so that, preferrably, f (x(k+1) ) f (x(k) ).
Typically, one picks k to exactly minimize the function along the line (leading to exact descent methods).
Now
1
f (x(k+1) ) = f (x(k) + k p(k) ) = (x(k) + k p(k) )T A(x(k) + k p(k) ) (x(k) + k p(k) )T b
2
1 (k) T (k)
1
=
x
Ax + k p(k) T Ax(k) + 2k p(k) T Ap(k) x(k) T b k p(k) T b
2
2
1 (k) T (k)
1
=
x
Ax x(k) T b + k p(k) T Ax(k) + 2k p(k) T Ap(k) k p(k) T b
2
2
1
= f (x(k) ) + 2k p(k) T Ap(k) + k p(k) T (Ax(k) b)
2
1
= f (x(k) ) + p(k) T Ap(k) 2k p(k) T r(k) k ,
2
where r(k) = b Ax(k) , the residual. This is a quadratic equation in k (since this is the only free variable).
Thus, minimizing this expression exactly requires the deriviative with respect to k to be zero:
0=
d f (x(k) + k p(k) )
= p(k) T Ap(k) k p(k) T r(k) .
dk
Chapter 19. Notes on Descent Methods and the Conjugate Gradient Method
Given: A, b, x, p
r(0) = b Ax(0)
r(0) = b Ax(0)
r = b Ax
for k = 0, 1,
for k = 0, 1,
for k = 0, 1,
316
q = Ap
(k) T r(k)
k = pp(k) T Ap
(k)
(k+1)
(k)
x
= x + k p(k)
(k) T r(k)
k = pp(k) T Ap
(k)
(k+1)
(k)
x
= x +
r(k+1) = b Ax(k+1)
r = r q
p = next direction
=
kp
endfor
endfor
(k)
pT r
pT q
x = x + p
endfor
Figure 19.1: Three basic descent method. The formulation on the middles follows from the one on the left
by virtue of b Ax(k+1) = b A(x(k) + k p(k) ) = r(k) k Ap(k) . It is preferred because it requires only
one matrix-vector multiplication per iteration (to form Ap(k) . The one of the right overwrites the various
vectors to reduce storage.
and hence, for exact descent methods,
k =
p(k) T r(k)
p(k) T Ap(k)
and
A question now becomes how to pick the search directions {p(0) , p(1) , . . .}.
Basic decent methods based on these ideas are given in Figure 19.1. What is missing in those algorithms is a stopping criteria. Notice that k = 0 if p(k) T r(k) = 0. But this does not necessarily mean that we
have converged, since it merely means that p(k) is orthogonal to r(k) . What we will see is that commonly
used methods pick p(k) not to be orthogonal to r(k) . Then p(k) T r(k) = 0 implies r(k) = 0 and this condition,
cheap to compute, can be used as a stopping criteria.
19.3
317
(0) T (0)
r
(0) +
x(1) = x(0) + 0 p(0) = x(0) + pp(0) T Ap
(0) e0 = x
0 b
aT0 x(0)
e0 ..
0,0
1
(1)
(0)
0 = 0 +
0,0
!
(0)
0 0, j j
j=0
1
=
0,0
n1
!
(0)
0 0, j j
j=1
Careful contemplation then reveals that this is exactly how the first element in vector x is changed
in the Gauss-Seidel method!
We leave it to the reader to similarly deduce that x(n) is exactly the vector one gets from applying one step
of Gauss-Seidel (updating all elements of x once):
x(n) = D1 ((L + LT )x(0) + b)
where D and (L + LT ) equal the diagonal and off-diagonal parts of A so that A = M N = D (L + LT ).
Given that choosing the search directions cyclically as {e0 , e1 , . . .} yields the Gauss-Seidel iteration,
one can image picking x(k+1) = x(k) + k p(k) , with some relaxation factor . This would yield SOR if
we continue to use, cyclically, the search directions {e0 , e1 , . . .}.
When the next iterate is chosen to equal the exact minimum along the line defined by the search
direction, the method is called an exact descent direction and it is, clearly, guaranteed that f (x(k+1) )
f (x(k) ). When a relaxation parameter 6= 1 is used, this is no longer the case, and the method is called
inexact.
19.4
For a function f : Rn R that we are trying to minimize, for a given x, the direction in which the function
most rapidly increases in value at x is given by the gradient,
f (x).
Thus, the direction in which is decreases most rapidly is
f (x).
For our function
1
f (x) = xT Ax xT b
2
this direction of steepest descent is given by
f (x) = (Ax b) = b Ax.
Thus, recalling that r(k) = b Ax(k) , the direction of steepest descent at x(k) is given by p(k) = r(k) =
b Ax(k) , the residual, so that the iteration becomes
x(k+1) = x(k) +
Chapter 19. Notes on Descent Methods and the Conjugate Gradient Method
19.5
318
Preconditioning
For general nonlinear f (x), using the direction of steepest descent as the search direction is often effective.
Not necessarily so for our problem, if A is relatively ill-conditioned.
A picture clarifies this (on the blackboard). The key insight is that if (A) = 0 /n1 (the ratio between
the largest and smallest eigenvalues or, equivalently, the ratio between the largest and smallest singular
value), then convergence can take many iterations.
What would happen if 0 = = n1 ? Then A = QQT is the spectral decomposition of A and
A = Q(0 I)QT = 0 I and
x(1) = x(0)
1 (0)
1
1
1
r(0) T r(0) (0)
(0)
(0)
(0)
(0)
(0)
r
=
x
r
=
x
(
x
b)
=
x
x
+
b
=
b,
0
0
0
0
0
r(0) T 0 Ir(0)
If M is chosen carefully, (L1 ALT ) can be greatly improved. The best choice would be M = A, of
course, but that is not realistic. The matrix M is called the preconditioner. Some careful rearrangement
takes the method of steepest decent on the transformed problem to the much simpler preconditioned algorithm in Figure 19.2.
19.6
We now borrow heavily from the explanation in Golub and Van Loan [21].
If we start with x(0) = 0, then
x(k+1) = 0 p(0) + + k p(k) .
Thus, x(k+1) S (p(0) , . . . , p(k) ). It would be nice if each x(k+1) satisfied
f (x(k+1) ) =
min
f (x)
xS (p(0) ,...,p(k) )
(19.1)
and the search directions were linearly independent. The primary reason is that then the descent direction
method is guaranteed to complete in at most n iterations, since then
S (p(0) , . . . , p(n1) ) = Rn
so that
f (x(n) ) =
min
xS (p(0) ,...,p(n1) )
so that Ax(n) = b.
Given: A, b, x(0)
319
Given: A, b, x(0)
A = L1 ALT , b = L1 b, x(0) = LT x(0)
Given: A, b, x(0)
for k = 0, 1,
for k = 0, 1,
for k = 0, 1,
q(k) = A p(k)
q(k) = Ap(k)
p(k) T r(k)
p(k) T q(k)
x(k+1) = x(k) +
p(k) T r(k)
p(k) T q(0)
x(k+1) = x(k) +
q(k) = Ap(k)
(k)
k p
p(0) T r(0)
p(0) T q
x(k+1) = x(k) + k p(k)
p(k+1) = r(k+1)
p(k+1) = r(k+1)
p(k+1) = M 1 r(k+1)
k =
k =
kp
(k)
k =
x(k+1) = LT x(k+1)
endfor
endfor
endfor
Figure 19.2: Left: method of steepest decent. Middle: method of steepest decent with transformed problem. Right: preconditioned method of steepest decent. It can be checked that the x(k) computed by
the middle algorithm is exactly the x(k) computed by the one on the right. Of course, the computation
x(k+1) = LT x(k+1) needs only be done once, after convergence, in the algorithm in the middle.
Unfortunately, the method of steepest descent does not have this property. The iterate x(k+1) minimizes
f (x) where x is constraint to be on the line x(k) + p(k) . Because in each step f (x(k+1) ) f (x(k) ), a slightly
stronger result holds: It also minimizes f (x) where x is constraint to be on the union of lines x( j) + p( j) ,
(0)
(k)
j = 0, . . . , k. However, that is not the same as it minimizing
over all vectors in S (p , . . . , p ).
We can write write (19.1) more concisely: Let P(k) =
. Then
y0
)
min
f (x) = min f (P(k) y) = min f ( P(k1) p(k)
y
y
(0)
(k)
xS (p ,...,p )
1
y0
y
1
A P(k1) p(k) 0
= min P(k1) p(k)
y
2
1
1
T
y
0
b .
P(k1) p(k)
1
h
i h
i
1 T (k1) T
(k) T
(k1)
(k)
= min
y P
+ 1 p
A P
y0 + 1 p
y
2 0
h
i i
yT0 P(k1) T + 1 p(k) T b
1 T (k1) T (k1)
1 2 (k) T (k)
T (k1) T
(k)
T (k1) T
(k) T
= min y0 P
AP
y0 + 1 y0 P
Ap + 1 p
Ap y0 P
b 1 p
b
y
2
2
Chapter 19. Notes on Descent Methods and the Conjugate Gradient Method
320
1 2 (k) T (k)
1 T (k1) T (k1)
T (k1) T
T (k1) T
(k)
(k) T
AP
y0 y0 P
b + 1 y0 P
Ap + 1 p
Ap 1 p
b .
= min y0 P
y
2
2
where y =
y0
. Now, if
1
P(k1) T Ap(k) = 0
then
min
xS (p(0) ,...,p(k) )
19.7
The Conjugate Gradient method (CG) is a descent method that picks the search directions very carefully.
Specifically, it picks them to be A-conjugate, meaning that pTi Ap j = 0 if i 6= j. It also starts with x(0) = 0
so that p(0) = r(0) = b.
Notice that, if x(0) = 0 then, by nature, descent methods yield
x(k+1) = 0 p(0) + + k p(k) .
Also, the residual
r(k+1) = b Ax(k+1) = b A(0 p(0) + + k1 p(k) ) = b 0 Ap(0) k1 kAp(k) .
Given: A, b, x(0) = 0
321
Given: A, b, x(0) = 0
Given: A, b, x(0) = 0
k =
kp
(k)
(k)
Ap
k = r p(k) T Ap
(k)
p(k+1) Ap(k) = 0.
Now,
0 = p(k+1) T Ap(k) = (r(k+1) + k p(k) )T Ap(k) = r(k+1) T Ap(k) + k p(k) T Ap(k )
so that
k =
r(k+1) T Ap(k)
.
p(k) T Ap(k)
Now, it can be shown that not only p(k+1) T Ap(k) , but also p(k+1) T Ap( j) , for j = 0, 1, . . . , k. Thus, the
Conjugate Gradient Method is an A-conjugate method.
Working these insights into the basic descent method algorithm yields the most basic form of the
Conjugate Gradient method, summarized in Figure 19.3 (left).
Chapter 19. Notes on Descent Methods and the Conjugate Gradient Method
322
Chapter
20
323
Outline
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
20.1. Krylov Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
20.2. The Lanczos Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
324
20.1
325
Krylov Subspaces
Consider matrix A, some initial unit vector q0 , and the sequence x0 , x1 , . . . defined by
xk+1 := Axk ,
k = 0, 1, . . . ,
where x0 = q0 . We already saw this sequence when we discussed the power method.
Definition 20.1 Given square matrix A and initial vector x0 , the Krylov subspace K (A, x0 , k) is defined
by
K (A, x0 , k) = S (x0 , Ax0 , . . . , Ak1 x0 ) = S (x0 , x1 , . . . , xk1 ),
where x j+1 = Ax j , for j = 0, 1, . . . ,. The Krylov matrix is defined by
K(A, x0 , k) = x0 Ax0 . . . Ak1 x0 = x0 x1 . . . xk1
20.2
From our discussion of the Power Method, we notice that it is convenient to instead work with mutually
orthonormal vectors. Consider the QR factorization of K(A, q0 , n):
K(A, q0 , n) =
q0 x1 . . . xn1
q0 q1 . . . qn1
1 0,1 . . .
0,n1
0 1,1 . . .
..
..
...
.
.
1,n1
..
.
0
Notice that
AK(A, q0 , n) = A
q0 x1 . . . xn1
x1 . . . xn1 Axn1
Partition
K(A, q0 , n) =
q0 X1
,Q =
Then
K(A, q0 , n) = A
and hence
K(A, q0 , n) = A
q0 Q1
q0 Q1
q0 X1
1
T
r01
0 R11
, and R =
X1 Axn1
q0
0 R11
T
Q1 R11 q0 r01
T
A q0 Q1 R
T
T
Q1
Q1 R11 q0 r01
Axn1
q0 Q1
T
r01
n1,n1
Axn1
= QR
q0 Q1
326
T
T )
(Q1 R11 q0 r01
T
r01
qT0 Axn1
R11
QT1 Axn1
q0 Q1
T
Axn1
QT AQ =
T
r01
qT0 Axn1
R11
QT1 Axn1
R1 = T.
{z
}
upperHessenberg!
|
{z
upperHessenberg!
We emphasize that
T = QT AQ is upperHessenberg (and if A is symmetric, it is tridiagonal).
Q is generated by the initial vector x0 = q0 .
From the Implicit Q Theorem, we thus know that T and Q are uniquely determined! Finally, QT AQ is a
unitary similarity transformations so that the eigenvalues can be computed from T and then the eigenvectors of T can be transformed back to eigenvectors of Q.
One way for computing the tridiagonal matrix is, of course, to use Householder transformations. However, that would quickly fill zeroes that exist in matrix A. Instead, we compute it more direction. We will
do so under the assumption that A is symmetric and that therefore T is tridiagonal.
Consider AQ = QT and partition
TT L
?
0
T
and T = ML eL MM
Q = QL qM QR
?
0
BM e0 TBR
The idea is that QL , qM have already been computed, as have TT L and ML . Also, we assume that a
temporary vector u holds AqM ML QL eL . Initially, QL has zero rows and qM equals the q1 with which we
started our discussion. Also, initially, TT L is 0 0 so that ML doesnt really exist.
Now, we repartition
QL
qM
QR
and
Q0 q1
T
TL
ML eTL
?
MM
BM e0
T00
T
10 eL
?
0
TBR
0
q2 Q3
11
21
22
32 e0
0
,
T33
327
ML
10
ML
MM
BM
10
11
21
BM
21
22
32
32
0 0
,
0
so that
Q0
q1
q2 Q3
T00
10 eTL
Q3
Q0 q1 q2
11
21
22
32 e0
T33
TT L
ML eTL
MM
BM e0
T
00
10 eT
L
?
0
TBR
0
0
11
21
22
32 e0
0
,
T33
ML
10
ML
MM
BM
10
11
21
BM
21
22
32
32
0 0
.
0
328
TT L
,
T
ML eTL
QL qM QR
0
whereQL has 0 columns, TT L is 0 0
Partition Q
qM := q0 ;
MM
BM e0
TBR
u = Aq0
q1 q2 Q3 ,
T
?
0
0
00
TT L
?
0
T
?
?
10 eL 11
T
ML eL MM
?
0
21 22
?
0
BM e0 TBR
0
0 32 e0 T33
whereq2 has 1 column, 22 is 1 1
QR
Q0
11 := qT1 u
u := u 11 q1
21 := kq2 k2
q2 := u/21
u := Aq2 21 q1
Continue with
QL qM QR Q0 q1 q2 Q3 ,
?
0
0
T
00
TT L
?
0
10 eT 11
?
?
L
ML eTL MM
?
0
21 22
?
0
BM e0 TBR
0
0 32 e0 T33
endwhile
Figure 20.1: Lanscos algorithm for symmetric Q. It computes Q and T such that AQ = QT . It lacks a
stopping criteria.
329
Consider
Q0 q1
q2 Q3
T
00
10 eT
L
Q3
Q0 q1 q2
11
21
22
32 e0
T33
If q2 represents the kth column of Q, then it is in thedirection of (Ak1 q0 ) , the component of
Ak1 q0 that is orthogonal to the columns of Q0 q1 .
As k gets large, Ak1 q0 is essentially in the space spanned by Q0 q1 .
q2 is computed from
21 q2 = Aq1 10 Q0 eL 11 q1 .
The length of Aq1 10 Q0 eL 11 q1 must be very small as k gets large since Ak1 q0 eventually lies
approximately in the direction of Ak2 q0 .
This means that 21 = kAq1 10 Q0 eL 11 q1 k2 will be small.
Hence
Q0 q1 q2 Q3
Q0 q1 q2
T00
10 eTL
Q3
?
11
0
0
0 0
0 0
? ?
? ?
or, equivalently,
Q0 q1
Q0 q1
q1
T00
10 eTL
11
= S,
), is said to be an invariant subspace of
h
Q0 q1
QS
h
Q0 q1
QS S .
330
TT L
, T ML eTL
0
whereQL has 0 columns, TT L is 0 0
Partition Q
qM := q0 ;
QL
qM
QR
MM
BM e0
TBR
u = Aq0
q1 q2 Q3 ,
T
?
0
0
00
?
0
TT L
T
?
?
10 eL 11
T
ML eL MM
?
21 22
?
0
BM e0 TBR
0
0 32 e0 T33
whereq2 has 1 column, 22 is 1 1
QR
Q0
11 := qT1 u
u := u 11 q1
21 := kq2 k2
if | | > tolerance
21
q2 := u/21
u := Aq2 21 q1
Continue with
QL qM QR Q0 q1 q2 Q3 ,
T
?
0
0
00
?
0
TT L
10 eTL 11
?
?
ML eL MM
?
0
21 22
?
0
BM e0 TBR
0
0 32 e0 T33
endwhile
Figure 20.2: Lanscos algorithm for symmetric Q. It computes QL and TT L such that AQL QL TT L .
331
332
Answers
A=
a0 a1 an1
a0 a1 an1
abT0
abT
1
.
..
abTm1
T
aT
1
= .
..
aTm1
aT0
a0 a1 an1
H
aH
0
aH
1
= .
..
aH
m1
333
abT0
abT
1
= .
..
abTm1
abT0
abT
1
.
..
abTm1
* BACK TO TEXT
x0
x1
x= .
..
xN1
x0
x1
x= .
..
xN1
xT =
x0T
xH =
H
x0H x1H xN1
x1T
T
xN1
.
.
* BACK TO TEXT
A0,0
A0,1
A0,N1
A=
A1,0
..
.
A1,1
..
.
A1,N1
..
.
334
A0,0
A0,1
A0,N1
A1,0
..
.
A1,1
..
.
A1,N1
..
.
A0,0
A0,1
A0,N1
A1,0
..
.
A1,1
..
.
A1,N1
..
.
AT0,0
AT1,0
ATM1
AT
AT1,1 ATM1,1
0,1
=
..
..
..
.
.
AH
0,0
AH
1,0
AH
M1
AH
AH
AH
M1,1
1,1
0,1
=
.
.
..
..
..
H
H
H
A0,N1 A1,N1 AM1,N1
* BACK TO TEXT
x0
x0
x1 x1
. =
..
..
.
xN1
xN1
* BACK TO TEXT
Homework 1.5
x0
x1
.
..
xN1
* BACK TO TEXT
x0
x1
.
..
xN1
y0
y1
.
..
yN1
N1
ni
H
= N1
i=0 xi yi . (Provided xi , yi C and i=0 ni = n.)
* BACK TO TEXT
336
0 x = ~0
= |0|(x)
() is homogeneous
= 0
algebra
* BACK TO TEXT
kx + yk1 =
=
=
|0 + 0 | + |1 + 1 | + + |n1 + n1 |
|0 | + |0 | + |1 | + |1 | + + |n1 | + |n1 |
|0 | + |1 | + + |n1 | + |0 | + |1 | + + |n1 |
kxk1 + kyk1 .
* BACK TO TEXT
kx + yk = max |i + i |
i
max(|i | + |i |)
i
max(|i | + max | j |)
i
* BACK TO TEXT
a0 a1 an1
then
v
u
u
u
a
v
v
v
u
0
um1 n1
un1 m1
un1
u
u
u
u
u
a1
2
2
2
t
t
t
kAkF =
.
|i, j | = |i, j | = ka j k2 = u
u
..
i=0 j=0
j=0 i=0
j=0
u
t
an1
2
.
2
In other words, it equals the vector 2-norm of the vector that is created by stacking the columns of A on
top of each other. The fact that the Frobenius norm is a norm then comes from realizing this connection
and exploiting it.
Alternatively, just grind through the three conditions!
* BACK TO TEXT
abT0
abT
1
Homework 2.20 Let A Cmn and partition A = .
..
abTm1
. Show that
kAk = max kb
ai k1 = max (|i,0 | + |i,1 | + + |i,n1 |)
0i<m
0i<m
338
abT0
abT
1
Answer: Partition A = .
..
abTm1
kAk
. Then
T
Tx
b
b
a
a
0
0
abT
abT x
1
1
= max kAxk = max
. x
= max
.
..
..
kxk =1
kxk =1
kxk =1
abT
abT x
m1
m1
n1
n1
= max max |aTi x| = max max | i,p p | max max |i,p p |
i
kxk =1
kxk =1
n1
kxk =1
p=0
p=0
n1
n1
max max (|i,p || p |) max max (|i,p |(max |k |)) max max (|i,p |kxk )
i
kxk =1
kxk =1
p=0
p=0
kxk =1
p=0
n1
= max (|i,p | = kb
ai k1
i
p=0
n1
|i | = 1 and i k,i = |k,i |.) Then
abT
0
abT
1
= max
.
.
kxk1 =1
.
abT
m1
abT0 y
abT y
1
=
..
.
T
ab y
m1
abT
0
abT
1
x
.
..
abT
m1
|b
aTk y| = abTk y = kb
ak k1 . = max kb
ai k1
i
* BACK TO TEXT
339
kAyk kAxk
.
kyk
kxk
kxk =1
kxk =1
* BACK TO TEXT
kABk2F
abH
0
abH
1
=
.
..
abH
m1
b0 b1 bn1
kbaHi k22kb j k2
i
kbaik22
kb j k2
2
2
aH
= |b
i b j|
i j
(Cauchy-Schwartz)
!
=
2
abH
abH
abH
0 b1
0 bn1
0 b0
abH b0
abH
abH
0
0 b1
0 bn1
=
..
..
..
.
.
.
abH b0 abH b1 abH bn1
m1
m1
m1
F
|baHi abj |
i
!
=
abHi abi
bHj b j
|bHi b j |
i
= kAk2F kBk2F .
so that kABk2F kAk22 kBk22 . Taking the square-root of both sides established the desired result.
* BACK TO TEXT
340
Homework 2.28 Let k k be a matrix norm induced by the k k vector norm. Show that (A) =
kAkkA1 k 1.
Answer:
kIk = kAA1 k kAkkA1 k.
But
kIk = max kIxk = max kxk = 1.
kxk=1
kxk
Hence 1 kAkkA1 k.
* BACK TO TEXT
341
Homework 3.6 Let Q Cmm . Show that if Q is unitary then Q1 = QH and QQH = I.
Answer: If Q is unitary, then QH Q = I. If A, B Cmm , the matrix B such that BA = I is the inverse of
A. Hence Q1 = QH . Also, if BA = I then AB = I and hence QQH = I.
* BACK TO TEXT
Homework 3.7 Let Q0 , Q1 Cmm both be unitary. Show that their product, Q0 Q1 , is unitary.
Answer: Obviously, Q0 Q1 is a square matrix.
H
(Q0 Q1 )H (Q0 Q1 ) = QH
QH
1 = I.
1 Q0 Q0 Q1 = |
1{zQ}
| {z }
I
I
Hence Q0 Q1 is unitary.
* BACK TO TEXT
Homework 3.8 Let Q0 , Q1 , . . . , Qk1 Cmm all be unitary. Show that their product, Q0 Q1 Qk1 , is
unitary.
Answer: Strictly speaking, we should do a proof by induction. But instead we will make the more
informal argument that
H H
(Q0 Q1 Qk1 )H Q0 Q1 Qk1 = QH
k1 Q1 Q0 Q0 Q1 Qk1
H
QH
QH
0 Q0 Q1 Qk1 = I.
k1 Q1 |
{z }
I
|
{z
}
I
|
{z
}
I
|
{z
}
I
* BACK TO TEXT
342
* BACK TO TEXT
Homework 3.10 Let U Cmm and V Cnn be unitary matrices and A Cmn . Then
kUAk2 = kAV k2 = kAk2 .
Answer:
kUAk2 = maxkxk2 =1 kUAxk2 = maxkxk2 =1 kAxk2 = kAk2 .
kAV k2 = maxkxk2 =1 kAV xk2 = maxkV xk2 =1 kA(V x)k2 = maxkyk2 =1 kAyk2 = kAk2 .
* BACK TO TEXT
Homework 3.11 Let U Cmm and V Cnn be unitary matrices and A Cmn . Then kUAkF =
kAV kF = kAkF .
Answer:
n1
a0 a1 an1 . Then it is easy to show that kAk2F = j=0 ka j k22 . Thus
kUAk2F = kU a0 a1 an1 k2F = k Ua0 Ua1 Uan1 k2F
Partition A =
n1
n1
j=0
j=0
T
ab
0
abT
1
Partition A = . . Then it is easy to show that kAk2F
..
T
abm1
abT0
abT V
0
abT V
abT
1
1
kAV k2F = k . V k2F = k
..
..
abTm1
abTm1V
m1
i=0
= m1
aTi )H k22 . Thus
i=0 k(b
m1
2
k
=
F
k(baTi V )H k22
i=0
m1
kV H (b
aTi )H k22 =
i=0
* BACK TO TEXT
kDk22
0
0
0
0
0 1
1
2
= max kDxk2 = max
. . .
.
.
..
..
.. ..
kxk2 =1
kxk2 =1
..
0 0 n1
n1
2
2
0 0
"
#
"
#
n1
n1
1 1
= max
= max | j j |2 = max | j |2 | j |2
..
kxk2 =1
kxk2 =1 j=0
kxk2 =1 j=0
.
n1 n1
2 #
"
#
"
2
n1
n1
n1
n1
n1
2
2
2
2
max max |i | | j |
= max max |i | | j | = max |i |
max kxk22
kxk2 =1
kxk2 =1
i=0
j=0
i=0
j=0
kxk2 =1
i=0
2
n1
=
max |i | .
i=0
i=0
n1
n1
so that maxn1
i=0 |i | kDk2 maxi=0 |i |, which implies that kDk2 = maxi=0 |i |.
* BACK TO TEXT
AT
0
H
AT
U V
U
= T T T = T T VTH ,
A=
0
0
0
which is the SVD of A. As a result, cleary the largest singular value of AT equals the largest singular value
of A and hence kAk2 = kAT k2 .
* BACK TO TEXT
344
Homework 3.15 Assume that U Cmm and V Cnn be unitary matrices. Let A, B Cmn with B =
UAV H . Show that the singular values of A equal the singular values of B.
Answer: Let A = UA AVAH be the SVD of A. Then B = UUA AVAH V H = (UUA )A (VVA )H where both
UUA and VVA are unitary. This gives us the SVD for B and it shows that the singular values of B equal the
singular values of A.
* BACK TO TEXT
H
0 0
0
0
1 0
1 0
= 0
0
=
A=
0 B
0 UB BVBH
0 UB
0 B
0 VB
which shows the relationship between the SVD of A and B. Since kAk2 = 0 , it must be the case that the
diagonal entries of B are less than or equal to 0 in magnitude, which means that kBk2 kAk2 .
* BACK TO TEXT
0
We will employ a proof by induction on m.
T
Base case: m = 1. In this case A = ab0
where abT0 R1 n is its only row. By assumption,
abT0 6= 0. Then
H
T
T
A = ab0 = 1
kb
a0 k2
v0
T
H
T
n(n1)
where v0 = (b
a0 ) /kb
a0 k2 . Choose V1 C
so that V = v0 V1 is unitary. Then
A=
where DT L =
0
abT0
kb
aT0 k2
kb
aT0 k2 )
and U =
v0 V1
H
= UDV H
.
Inductive step: Similarly modify the inductive step of the proof of the theorem.
345
By the Principle of Mathematical Induction the result holds for all matrices A Cmn with m n.
* BACK TO TEXT
346
Homework 4.2 Homework 4.3 Convince yourself that the relation between the vectors {a j } and {q j }
in the algorithms in Figure 4.2 is given by
0 1,1 1,n1
,
a0 a1 an1 = q0 q1 qn1 .
.
.
.
..
..
..
..
0
0 n1,n1
where
1 for i = j
H
qi q j =
0 otherwise
and
i, j =
qi a j
for i < j
j1
ka j i=0 i, j qi k2 for i = j
otherwise.
Homework 4.5 Homework 4.6 Let A have linearly independent columns and let A = QR be a QR factorization of A. Partition
RT L RT R
,
A AL AR , Q QL QR , and R
0
RBR
where AL and QL have k columns and RT L is k k. Show that
347
RT L
AL AR = QL QR
0
RT R
RBR
QL RT L
QL RT R + QR RBR
Hence
AL = QL RT L
and AR = QL RT R + QR RBR .
C (AL ) C (QL ): Let y C (AL ). Then there exists x such that AL x = y. But then QL RT L x = y and
hence QL (RT L x) = y which means that y C (QL ).
C (QL ) C (AL ): Let y C (QL ). Then there exists x such that QL x = y. But then AL R1
T L x = y and
1
hence AL (RT L x) = y which means that y C (AL ). (Notice that RT L is nonsingular because it
is a triangular matrix that has only nonzeroes on its diagonal.)
This answer 2.
Take AR QL RT R = QR RBR . and multiply both side by QH
L:
H
QH
L (AR QL RT R ) = QL QR RBR
is equivalent to
H
Q R = 0.
QH
Q R = QH
L AR Q
| L{z R} BR
| L{z L} T R
I
0
Rearranging yields 3.
Since AR QL RT R = QR RBR we find that (AR QL RT R )H QL = (QR RBR )H QL and
H
(AR QL RT R )H QL = RH
BR QR QL = 0.
Homework 6.4 Show that if x Rn , v = x kxk2 e0 , and = vT v/2 then (I 1 vvT )x = kxk2 e0 .
* BACK TO TEXT
1 =
I 1
u2
u2
x2
0
kxk2 .
where = uH u/2 = (1 + uH
2 u2 )/2 and =
Hint: = ||2 = kxk22 since H preserves the norm. Also, kxk22 = |1 |2 + kx2 k22 and
z
z
z
|z| .
Answer:
* BACK TO TEXT
0 I
0
0
H
1
= I 1 1 .
1
1
1
1
1
u
u
u2
u2
2
2
Answer:
* BACK TO TEXT
349
1
1
T00
T00 t01
T00
t01 /1
=
.
0 1
0
1/1
Answer:
1
1
1
1
T
T T
I 0
T
t
T00 t01 /1
T00 T00 t01 /1 + 1 /1
.
00 01 00
= 00 00
=
0 1
0
1/1
0
1 /1
0 1
* BACK TO TEXT
Homework 6.13 Homework 6.14 Consider u1 Cm with u1 6= 0 (the zero vector), U0 Cmk , and nonsingular T00 Ckk . Define 1 = (uH
1 u1 )/2, so that
H1 = I
1
u1 uH
1
1
Show that
1 H
Q0 H1 = (I U0 T00
U0 )(I
1
u1 uH
1 )=I
1
U0 u1
T00 t01
0
1
where t01 = QH
0 u1 .
Answer:
1 H
Q0 H1 = (I U0 T00
U0 )(I
1
u1 uH
1)
1
1
1 H 1
u1 uH
u1 uH
1 +U0 T00 U0
1
1
1
1
1
1 H
1
= I U0 T00
U0 u1 uH
U0 T00
t01 uH
1 +
1
1
1
1 H
= I U0 T00
U0
350
U0 u1
H
Also
U0 u1
T00 t01
= I
= I
U0 u1
U0 u1
U0 u1
H
1
T00
1
T00
t01 /1
1/1
U0 u1
1
1 H
t01 uH
T00
U0 T00
1 /1
1/1 uH
1
H
1
1 H
H
t01 uH
= I U0 (T00
U0 T00
1 /1 ) 1/1 u1 u1
1
1
1 H
1
= I U0 T00
U0 u1 uH
U0 T00
t01 uH
1 +
1
1
1
* BACK TO TEXT
Homework 6.14 Homework 6.15 Consider ui Cm with ui 6= 0 (the zero vector). Define i = (uH
i ui )/2,
so that
1
Hi = I ui uH
i i
equals a Householder transformation, and let
U = u0 u1 uk1 .
Show that
H0 H1 Hk1 = I UT 1U H ,
where T is an upper triangular matrix.
Answer: The results follows from Homework 6.14 via a proof by induction.
* BACK TO TEXT
351
Q
T , reason that A = LQ.
QB
Dont overthink the problem: use results you have seen before.
Answer: We know that AH Cnm with linearly independent
columns (and n m). Hence there exists
AH
RT
0
Homework 7.3 Let A Cmn with m < n have linearly independent rows. Consider
kAx yk2 = min kAz yk2 .
z
Use the fact that A = LL QT , where LL Cmm is lower triangular and QT has orthonormal rows, to argue
1
H
nm ) is a solution to the LLS
that any vector of the
QH
T LL y + QB wB (where wB is any vector in C
form
problem. Here Q =
QT
QB
Answer:
min kAz yk2 = min kLQz yk2
z
min
LL
w
H
z=Q w
w
T y
wB
z = QH w
z = QH w =
QH
T
QH
B
352
wT
wB
= QH
T wT + QB wB .
Homework 7.4 Continuing Exercise 7.2, use Figure 7.1 to give a Classical Gram-Schmidt inspired algorithm for computing LL and QT . (The best way to check you got the algorithm right is to implement
it!)
Answer:
* BACK TO TEXT
Homework 7.5 Continuing Exercise 7.2, use Figure 7.2 to give a Householder QR factorization inspired
algorithm for computing L and Q, leaving L in the lower triangular part of A and Q stored as Householder
vectors above the diagonal of A. (The best way to check you got the algorithm right is to implement it!)
Answer:
* BACK TO TEXT
353
Homework 8.2 If A has linearly independent columns, show that k(AH A)1 AH k2 = 1/n1 , where n1
equals the smallest singular value of A. Hint: Use the SVD of A.
Answer: Let A = UV H be the reduced SVD of A. Then
k(AH A)1 AH k2 =
=
=
=
=
=
k((UV H )H UV H )1 (UV H )H k2
k(V U H UV H )1V U H k2
k(V 1 1V H )V U H k2
kV 1U H k2
k1 k2
1/n1
(since the two norm of a diagonal matrix equals its largest diagonal element in absolute value).
* BACK TO TEXT
Homework 8.3 Let A have linearly independent columns. Show that 2 (AH A) = 2 (A)2 .
Answer: Let A = UV H be the reduced SVD of A. Then
kAH Ak2 k(AH A)1 k2
k(UV H )H UV H k2 k((UV H )H UV H )1 k2
kV 2V H k2 kV (1 )2V H k2
k2 k2 k(1 )2 k2
20
0 2
=
= 2 (A)2 .
=
n1
2n1
2 (AH A) =
=
=
=
* BACK TO TEXT
Homework 8.6 Characterize the set of all square matrices A with 2 (A) = 1.
Answer: If 2 (A) = 1 then 0 /n1 = 1 which means that 0 = n1 and hence 0 = 1 = = n1 .
Hence the singular value decomposition of A must equal A = UV H = 0UV H . But UV H is unitary since
U and V H are, and hence we conclude that A must be a nonzero multiple of a unitary matrix.
* BACK TO TEXT
355
Homework 9.3 Assume a floating point number system with = 2 and a mantissa with t digits so that a
typical positive number is written as .d0 d1 . . . dt1 2e , with di {0, 1}.
Write the number 1 as a floating point number.
Answer: . 10
0} 21 .
| {z
t
digits
What is the largest positive real number u (represented as a binary fraction) such that the floating point representation of 1 + u equals the floating point representation of 1? (Assume rounded
arithmetic.)
0} 1 21 = . 10
0} 1 21 which rounds to . 10
0} 21 = 1.
Answer: . |10{z
0} 21 + . |00{z
| {z
| {z
t
t
t
digits
digits
digits
t
digits
and B =
b0 bn1
Then
m1
kAk1 =
max ka j k1 = max
0 j<n
|i, j |
0 j<n
i=0
m1
|i,k |
i=0
m1
|i,k |
i=0
m1
max
0 j<n
|i, j |
= max kb j k1 = kBk1 .
0 j<n
i=0
kAk2F =
m1 n1
2i, j
i=0 j=0
2i, j = kBk2F .
i=0 j=0
1
Case 1: ni=0 (1 + i )1 = (n1
i=0 (1 + i ) )(1 + n ). By the I.H. there exists a n such that (1 + n ) =
1 and | | nu/(1 nu). Then
n1
n
i=0 (1 + i )
!
n1
(1 + i)1
i=0
(1 + n ) = (1 + n )(1 + n ) = 1 + n + n + n n ,
|
{z
}
n+1
.
1 nu
1 nu
1 (n + 1)u
|n+1 | = |n + n + n n | |n | + |n | + |n ||n |
* BACK TO TEXT
nu
(n + b)u
(n + b)u
= n+b .
1 nu
1 nu
1 (n + b)u
nu
bu
nu
bu
+
+
1 nu 1 bu (1 nu) (1 bu)
nu(1 bu) + (1 nu)bu + bnu2
=
(1 nu)(1 bu)
nu bnu2 + bu bnu2 + bnu2
(n + b)u bnu2
=
=
1 (n + b)u + bnu2
1 (n + b)u + bnu2
(n + b)u
(n + b)u
= n+b .
2
1 (n + b)u + bnu
1 (n + b)u
n + b + n b =
* BACK TO TEXT
(k)
I +
0
(1 + 2 ) = (I + (k+1) ).
0
(1 + 1 )
Hint: reason the case where k = 0 separately from the case where k > 0.
Answer:
Case: k = 0. Then
(k)
I +
0
(1 + 2 ) = (1 + 0)(1 + 2 ) = (1 + 2 ) = (1 + 1 ) = (I + (1) ).
0
(1 + 1 )
358
Case: k = 1. Then
I + (k)
0
1 + 1
0
(1 + 2 ) =
(1 + 2 )
0
(1 + 1 )
0
(1 + 1 )
(1 + 1 )(1 + 2 )
0
=
0
(1 + 1 )(1 + 2 )
(1 + 2 )
0
= (I + (2) ).
=
0
(1 + 2 )
Case: k > 1. Notice that
(I + (k) )(1 + 2 ) = diag(()(1 + k ), (1 + k ), (1 + k1 ), , (1 + 2 ))(1 + 2 )
= diag(()(1 + k+1 ), (1 + k+1 ), (1 + k ), , (1 + 3 ))
Then
I + (k)
(1 + 1 )
(1 + 2 ) =
(I + (k) )(1 + 2 )
(1 + 1 )(1 + 2 )
(k)
(I + )(1 + 2 )
0
= (I + (k+1) ).
=
0
(1 + 2 )
0
* BACK TO TEXT
Homework 9.23 Prove R1-B.
Answer: From Theorem 9.20 we know that
= xT (I + (n) )y = (x + (n) x )T y.
x
Then
(n)
|x| = | x| =
|n ||0 |
| || |
n
1
|n ||2 |
..
n 0
|n 0 |
n 1 |n 1 |
= |n1 2 | =
n1 2
..
..
.
.
2 n1
|2 n1 |
|0 |
| |
1
= |n | |2 | n |x|.
.
.
|n ||n1 |
|n1 |
359
|n ||0 |
|n ||1 |
|n1 ||2 |
..
|2 ||n1 |
(Note: strictly speaking, one should probably treat the case n = 1 separately.)!.
* BACK TO TEXT
Homework 9.25 In the above theorem, could one instead prove the result
y = A(x + x),
where x is small?
Answer: The answer is no. The reason is that for each individual element of y
i = aTi (x + x)
1
.
..
m1
aT0 (x + x)
aT (x + x)
1
=
..
.
aTm1 (x + x)
Homework 9.27 In the above theorem, could one instead prove the result
C = (A + A)(B + B),
where A and B are small?
Answer: The answer is no for reasons similar to why the answer is no for Exercise 9.25.
* BACK TO TEXT
360
361
Ik
1
l21
Ik
= 0
1
l21
Answer:
I
k
0
1
l21
I
k
0 0
I
0
0
1
l21
I
k
0 = 0
I
0
0
1
l21 + l21
I
k
0 = 0
I
0
0 0
1 0
0 I
* BACK TO TEXT
L
0 0
00
Homework 11.7 Let Lk = L0 L1 . . . Lk . Assume that Lk has the form Lk1 = l10 1 0 , where L 00
L20 0 I
L
I
0 0
0
0
00
b
..
(Recall:
L
=
is k k. Show that L k is given by L k = l10
1 0
0
1
0 .)
k
L20 l21 I
0 l21 I
Answer:
b1
L k = L k1 Lk = L k1 L
k
1
L00 0 0
I
0
0
L
00
T
T
= l10
1 0 . 0
1
0 = l10
L20 0 I
0 l21 I
L20
L
0 0
00
= l10 1 0 .
L20 l21 I
0 0
1 0 . 0 1 0
0 I
0 l21 I
* BACK TO TEXT
362
Homework 11.12 Implement LU factorization with partial pivoting with the FLAME@lab API, in Mscript.
Answer: See Figure 11.3.
* BACK TO TEXT
Homework 11.13 Derive an algorithm for solving Ux = y, overwriting y with the solution, that casts most
computation in terms of DOT products. Hint: Partition
T
11 u12
.
U
0 U22
Call this Variant 1 and use Figure 11.6 to state the algorithm.
363
y
UT L UT R
,y T
Partition U
UBL UBR
yB
where UBR is 0 0, yB has 0 rows
y
UT L UT R
,y T
Partition U
UBL UBR
yB
where UBR is 0 0, yB has 0 rows
Repartition
Repartition
U u U
00 01 02
uT 11 uT
,
12
10
UBL UBR
U u U
20 21 22
y
0
yT
yB
y2
UT L UT R
UT L UT R
1 := 1 uT12 x2
1 := 1 /11
1 := 1 /11
y0 := y0 1 u01
Continue with
Continue with
UT L UT R
U u U
00 01 02
uT 11 uT
,
12
10
UBL UBR
U u U
20 21 22
y
0
yT
yB
y2
uT10 11 uT12 ,
UBL UBR
U u U
20 21 22
y
0
yT
1
yB
y2
UT L UT R
uT10 11 uT12 ,
UBL UBR
U u U
20 21 22
y
0
yT
1
yB
y2
endwhile
endwhile
Figure 6 (Answer): Algorithms for the solution of upper triangular system Ux = y that overwrite y with x.
364
Answer:
Partition
uT12
11
U22
x2
1
y2
11 1 + uT12 x2
U22 x2
y2
So, if we assume that x2 has already been computed and has overwritten y2 , then 1 can be computed as
1 = (1 uT12 x2 )/11
which can then overwrite 1 . The resulting algorithm is given in Figure 11.6 (Answer) (left).
* BACK TO TEXT
Homework 11.14 Derive an algorithm for solving Ux = y, overwriting y with the solution, that casts most
computation in terms of AXPY operations. Call this Variant 2 and use Figure 11.6 to state the algorithm.
Answer: Partition
U00 u01
x0
11
y0
U00 x0 + u01 1
11 1
y0
So, 1 = 1 /11 after which x0 can be computed by solving U00 x0 = y0 1 u01 . The resulting algorithm
is given in Figure 11.6 (Answer) (right).
* BACK TO TEXT
Homework 11.15 If A is an n n matrix, show that the cost of Variant 1 is approximately 23 n3 flops.
Answer: During the kth iteration, L00 is k k, for k = 0, . . . , n 1. Then the (approximate) cost of the
steps are given by
Solve L00 u01 = a01 for u01 , overwriting a01 with the result. Cost: k2 flops.
T U = aT (or, equivalently, U T (l T )T = (aT )T for l T ), overwriting aT with the result.
Solve l10
00
10
00 10
10
10
10
Cost: k2 flops.
T u , overwriting with the result. Cost: 2k flops.
Compute 11 = 11 l10
01
11
k=0
n1
1
2
k + k + 2k 2 k2 2 n3 = n3 .
3
3
k=0
2
* BACK TO TEXT
365
Homework 11.16 Derive the up-looking variant for computing the LU factorization.
Answer: Consider the loop invariant:
A
AT R
L\UT L UT R
TL
=
b
b
ABL ABR
ABL
ABR
bT L
bT R
LT LUT L = A
LT LUT R = A
(
(
(
(
(
((((
(
(
(
(
b
b
(
(
(UT L = ABL LBL
LBL
U R + LBRUBR = ABR
(
(((T(
At the top of the loop, after repartitioning, A contains
L\U00
u01 U02
abT10
b20
A
b 11 abT12
b22
ab21 A
L\U00
b02
u01 A
T
l10
b20
A
11 uT12
b22
ab21 A
b we notice that
where the entries in blue are to be computed. Now, considering LU = A
b00
L00U00 = A
b02
L00U02 = A
TU =a
bT10
l10
00
T u + =
b 11
l10
01
11
T U + uT = a
bT12
l10
02
12
b22
L20U02 + l21 uT12 + L22U22 = A
The equalities in yellow can be used to compute the desired parts of L and U:
T U = aT for l T , overwriting aT with the result.
Solve l10
00
10
10
10
T u , overwriting with the result.
Compute 11 = 11 l10
01
11
T U , overwriting aT with the result.
Compute uT12 := aT12 l10
02
12
* BACK TO TEXT
Homework 11.17 Implement all five LU factorization algorithms with the FLAME@lab API, in Mscript.
* BACK TO TEXT
366
Homework 11.18 Which of the five variants can be modified to incorporate partial pivoting?
Answer: Variants 2, 4, and 5.
* BACK TO TEXT
0 0
0 1
1
1 0
0
1 1 1
0
A= .
.
.
..
.
.. .. . .
..
.
1 1
1 1
1
.
..
.
1
0 0
0 1
0
1 0
0 2
0 1 1
0
2
.
.
.. .. . .
.. ..
..
.
.
.
.
.
0 1
1 2
0 1
1 2
Eliminating the entries below the diagonal in the second column yields:
1 0 0
0 1 0
0 0 1
.. .. .. . .
.
. . .
0 0
0 0
367
0 1
0 2
0 4
.
.. ..
. .
1 4
1 4
Eliminating the entries below the diagonal in the (n 1)st column yields:
1 0 0 0
1
0 1 0 0
2
0 0 1 0
4
. . . .
.
.
.
..
.. .. .. . . ..
n2
0 0
1 2
n1
0 0
0 2
* BACK TO TEXT
368
A00 a10
aT10 11
(Hint: For this exercise, use techniques similar to those in Section 12.4.)
1. Show that A00 is SPD.
Answer: Assume that A is n n so that A00 is (n 1) (n 1). Let x0 Rn1 be a nonzero vector.
Then
x0T A00 x0 =
since
x0
0
x0
0
A00 a10
aT10 11
T
is a nonzero vector and A is SPD.
369
x0
>0
T , where L is lower triangular and nonsingular, argue that the assign2. Assuming that A00 = L00 L00
00
T := aT LT is well-defined.
ment l10
10 00
1
T := aT LT uniquely
Answer: The compution is well-defined because L00
exists and hence l10
10 00
T .
computes l10
T where L is lower triangular and nonsingular, and l T =
3. Assuming that A00 is SPD, A00 = L00 L00
00
10
q
T
T
T
T
a10 L00 , show that 11 l10 l10 > 0 so that 11 := 11 l10 l10 is well-defined.
T l > 0. To do so, we are going to construct a nonzero
Answer: We want to show that 11 l10
10
T l , at which point we can invoke the fact that A is SPD.
vector x so that xT Ax = 11 l10
10
Rather than just giving the answer, we go through the steps that will give insight into the thought
process leading up to the answer as well. Consider
x0
A00 a10
aT10
x0
11
x0
A00 x0 + 1 a10
1
aT10 x0 + 11 1
x0T A00 x0 + 1 a10 + 1 aT10 x0 + 11 1
x0T A00 x0 + 1 x0T a10 + 1 aT10 x0 + 11 21 .
1
=
=
x0
A00 a10
aT10
11
x0
1
0 <
1
L00
l10
A00 a10
aT10
11
1
L00
l10
1
1
1
1
T
= (L00
l10 )T L00 L00
(L00
l10 ) + (L00
l10 )T (L00 l10 ) + (L00 l10 )T (L00
l10 ) + 11
T
= 11 l10
l10 .
370
AT L
?
Partition A
ABL ABR
whereAT L is 0 0
while m(AT L ) < m(A) do
Repartition
AT L
A00
aT
10
ABL ABR
A20
where11 is 1 1
?
11
a21
A22
T
:= aT10 L00
11 := 11 aT10 a10
11 := 11
Continue with
AT L
ABL
ABR
A00
aT
10
A20
?
11
a21
A22
endwhile
A00 a10
aT10
11
L00
T
l10
11
L00
T
l10
11
Homework 12.17 Use the results in the last exercise to give an alternative proof by induction of the
Cholesky Factorization Theorem and to give an algorithm for computing it by filling in Figure 12.3. This
algorithm is often referred to as the bordered Cholesky factorization algorithm.
Answer:
Proof by induction.
371
Base case: n = 1. Clearly the result is true for a 1 1 matrix A = 11 : In this case, the fact that A is
SPD means that 11 is real and positive and a Cholesky factor is then given by 11 = 11 , with
uniqueness if we insist that 11 is positive.
Inductive step: Inductive Hypothesis (I.H.): Assume the result is true for SPD matrix A C(n1)(n1) .
We will show that it holds for A Cnn . Let A Cnn be SPD. Partition A and L like
L00 0
A00 ?
and L =
.
A=
T
T
a10 11
l10 11
T is the Cholesky factorization of A . By the I.H., we know this exists
Assume that A00 = L00 L00
00
since A00 is (n 1) (n 1) and that L00 is nonsingular. By the previous homework, we then know
that
T = aT LT is well-defined,
l10
01 00
q
T l is well-defined, and
11 = 11 l10
10
A = LLT .
Hence L is the desired Cholesky factor of A.
By the principle of mathematical induction, the theorem holds.
The algorithm is given in 12.4.
* BACK TO TEXT
Homework 12.18 Show that the cost of the bordered algorithm is, approximately, 13 n3 flops.
Answer: During the kth iteration, k = 0, . . . , n 1 assume that A00 is k k. Then
T
aT10 = aT10 L00
(implemented as the lower triangular solve L00 l10 = a10 ) requires, approximately, k2
flops.
k
k=0
Z n
0
1
x2 dx = n3 .
3
* BACK TO TEXT
372
Homework 13.4 Let be an eigenvalue of A and let E (A) = {x Cm |Ax = x} denote the set of all
eigenvectors of A associated with (including the zero vector, which is not really considered an eigenvector). Show that this set is a (nontrivial) subspace of Cm .
Answer:
0 E (A): (since we explicitly include it).
x E (A) implies x E (A): (by the last exercise).
x, y E (A) implies x + y E (A):
A(x + y) = Ax + Ay = A + y = (x + y).
* BACK TO TEXT
Homework 13.8 The eigenvalues of a diagonal matrix equal the values on its diagonal. The eigenvalues
of a triangular matrix equal the values on its diagonal.
Answer: Since a diagonal matrix is a special case of a triangular matrix, it suffices to prove that the
eigenvalues of a triangular matrix are the values on its diagonal.
If A is triangular, so is A I. By definition, is an eigenvalue of A if and only if A I is singular.
But a triangular matrix is singular if and only if it has a zero on its diagonal. The triangular matrix A I
has a zero on its diagonal if and only if equals one of the diagonal elements of A.
* BACK TO TEXT
Homework 13.19 Prove Lemma 13.18. Then generalize it to a result for block upper triangular matrices:
0 A1,1 A1,N1
A=
.
.
.
0
..
..
0
0
0 AN1,N1
Answer:
373
Lemma 13.18 Let A Cmm be of form A =
AT L
AT R
ABR
H
QT L
0
QT L AT L QH
TL
A=
0
QBR
0
QT L AT R QH
BR
QBR ABR QH
BR
QT L
QBR
Proof:
QT L
0
0
QBR
QT L AT L QH
TL
QT L AT R QH
BR
QBR ABR QH
BR
H
QH
T L QT L AT L QT L QT L
H
QH
T L QT L AT R QBR QBR
H
QH
BR QBR ABR QBR QBR
AT L
AT R
ABR
QT L
0
QBR
= A.
0
Q0,0
0 Q1,1
Q=
0
0
0
0
..
.
0
..
.
QN1,N1
and
B=
H
Q0,0 A0,0 QH
0,0 Q0,0 A0,1 Q1,1
Q0,0 A0,N1 QH
N1,N1
Q1,1 A1,1 QH
1,1
..
.
0
Q1,1 A1,N1 QH
N1,N1
..
.
0
0
0
QN1,N1 AN1,N1 QH
N1,N11
* BACK TO TEXT
Homework 13.21 Prove Corollary 13.20. Then generalize it to a result for block upper triangular matrices.
Answer:
AT L
AT R
ABR
H
Proof: This follows immediately from Lemma 13.18. Let AT L = QT L TT Li QH
T L and ABR = QBR TBRi QBR
be the Shur decompositions of the diagonal blocks. Then (AT L ) equals the diagonal elements of TT L
and (ABR ) equals the diagonal elements of TBR . Now, by Lemma 13.18 the Schur decomposition of A is
given by
H
H
Q A Q
QT L
Q
0
QT L AT R QBR
0
TL TL TL
TL
A =
0
QBR
0
QBR ABR QH
0
Q
BR
BR
QT L
TT L QT L AT R QH
0
0
BR QT L
.
=
0
QBR
0
TBR
0
QBR
|
{z
}
TA
Hence the diagonal elements of TA (which is upper triangular) equal the elements of (A). The set of
those elements is clearly (AT L ) (ABR ).
The generalization is that if
0 A1,1 A1,N1
A=
.
.
0
..
..
0
0
0 AN1,N1
where the blocks on the diagonal are square, then
(A) = (A0,0 ) (A1,1 ) (AN1,N1 ).
* BACK TO TEXT
Homework 13.24 Let A be Hermitian and and be distinct eigenvalues with eigenvectors x and x ,
respectively. Then xH x = 0. (In other words, the eigenvectors of a Hermitian matrix corresponding to
distinct eigenvalues are orthogonal.)
Answer: Assume that Ax = x and Ax = x , for nonzero vectors x and x and 6= . Then
xH Ax = xH x
and
xH Ax = xH x .
Because A is Hermitian and is real
xH x = xH Ax = (xH Ax )H = (xH x )H = xH x .
Hence
xH x = xH x .
If xH x 6= 0 then = , which is a contradiction.
* BACK TO TEXT
375
Homework 13.25 Let A Cmm be a Hermitian matrix, A = QQH its Spectral Decomposition, and
A = UV H its SVD. Relate Q, U, V , , and .
Answer: I am going to answer this by showing how to take the Spectral decomposition of A, and turn
this into the SVD of A. Observations:
We will assume all eigenvalues are nonzero. It should be pretty obvious how to fix the below if some
of them are zero.
Q is unitary and is diagonal. Thus, A = QQH is the SVD except that the diagonal elements of
are not necessarily nonnegative and they are not ordered from largest to smallest in magnitude.
We can fix the fact that they are not nonnegative with the following observation:
0 0
A = QQH = Q
0 1
.. .. . .
.
. .
0
..
.
H
Q
0 0
= Q
0 1
.. .. . .
.
. .
0
0 n1
0
sign(0 )
0
0
0
sign(1 )
..
.
..
..
.
.
n1
0
0
|
0 0
sign(0 )
= Q
0 1
.. .. . .
.
. .
0
..
.
0
..
.
0
|
0 n1
sign(0 )
..
.
0
..
.
0
..
.
sign(n1 )
sign(1 )
..
..
.
.
0
..
.
0
{z
I
sign(n1 )
|0 |
0
..
.
|1 |
.. . .
.
.
0
..
.
{z
0
0
|
sign(1 )
..
..
.
.
0
..
.
H
Q
sign(n1 )
}
sign(0 )
0
..
.
0
} |
sign(1 )
..
..
.
.
0
..
.
H
Q
sign(n1 )
{z
Q H
|n1 |
{z
}
Q H .
= Q
has nonnegative diagonal entries and Q is unitary since it is the product of two unitary
Now
matrices.
are not ordered from largest to smallest. We do so by
Next, we fix the fact that the entries of
H equals the diagonal matrix with the
noting that there is a permutation matrix P such that = PP
values ordered from the largest to smallest. Then
Q H
A = Q
376
PH P Q H
= Q |{z}
PH P
|{z}
I
I
=
H PQ H
QPH P
P
|{z} | {z } |{z}
U
VH
= UV H .
* BACK TO TEXT
Homework 13.26 Let A Cmm and A = UV H its SVD. Relate the Spectral decompositions of AH A
and AAH to U, V , and .
Answer:
AH A = (UV H )H UV H = V U H UV H = V 2V H .
Thus, the eigenvalues of AH A equal the square of the singular values of A.
AAH = UV H (UV H )H = UV H V U H = U2U H .
Thus, the eigenvalues of AAH equal the square of the singular values of A.
* BACK TO TEXT
377
Chapter 14. Notes on the Power Method and Related Methods (Answers)
Homework 14.4 Prove Lemma 14.3.
Answer: We need to show that
Let y 6= 0. Show that kykX > 0: Let z = Xy. Then z 6= 0 since X is nonsingular. Hence
kykX = kXyk = kzk > 0.
Show that if C and y Cm then kykX = ||kykX :
kykX = kX(y)k = kXyk = ||kXyk = ||kykX .
Show that if x, y Cm then kx + ykX kxkX + kykX :
kx + ykX = kX(x + y)k = kXx + Xyk kXxk + kXyk = kxkX + kykX .
* BACK TO TEXT
1 1 1
>
1 .
0
m1 m2 m3
Answer: This follows immediately from the fact that if > 0 and > 0 then
> implies that 1/ > 1/ and
implies that 1/ 1/.
* BACK TO TEXT
b(k) = A(k) , k = 1, . . ..
Homework 15.3 Prove that in Figure 15.4, Vb (k) = V (k) , and A
Answer: This requires a proof by induction.
379
(k)
b
k = vbn1 Ab
vn1 (V ( k)en1 )T A(V ( k)en1 ) = en1V (k) T AV ( k)en1
=
(k)
b 233 .
b 211 +
Homework 15.11 In the above discussion, show that 211 + 2231 + 233 =
b31 k2 = kA31 k2 because multiplying on the left and/or the right by a
b31 = J31 A31 J T then kA
Answer: If A
F
F
31
unitary matrix preserves the Frobenius norm of a matrix. Hence
2
2
11 31
b 11 0
.
=
b 33
0
31 33
F
But
2
11 31
= 211 + 2231 + 233
31 33
F
and
2
b 11 0
=
b 211 + +
b 233 ,
b 33
0
F
Homework 15.14 Give the approximate total cost for reducing a nonsymmetric A Rnn to upperHessenberg form.
Answer: To prove this, assume that in the current iteration of the algorithm A00 is k k. The update in
the current iteration is then
[u21 , 1 , a21 ] := HouseV(a21 )
11 := aT12 u21
:= uT21 y21 /2
A22 := A22 (u21 wT21 + z21 uT21 ) 2 2k(n k 1) flops since A22 is (n k 1) (n k 1),
which is updated by two rank-1 updates.
381
k=0
n1
n1
4k(n k 1) + 4(n k 1)2 + 4(n k 1)2
k=0
n1
k=0
k=0
n1
n1
n1
n1
4n (n k 1) + 4 (n k 1) = 4n j + 4 j2
k=0
4n
n(n 1)
+4
2
k=0
Z n
j=0
j=0
j=0
4
10
j2 d j 2n3 + n3 = n3 .
3
3
382
Algorithm: A := LDLT(A)
AT L ?
Partition A
ABL ABR
whereAT L is 0 0
while m(AT L ) < m(A) do
Repartition
AT L
A00
aT
10 11 ?
ABL ABR
A20 a21 A22
where11 is 1 1
Option 1:
Option 2:
Option 3:
a21 := l21
Continue with
AT L
ABL
A00 ?
aT 11 ?
10
ABR
A20 a21 A22
endwhile
Homework 16.2 Modify the algorithm in Figure 16.2 so that it computes the LDLT factorization of a
tridiagonal matrix. (Fill in Figure 16.4.) What is the approximate cost, in floating point operations, of
383
AFF
?
?
Partition A MF eTL MM
?
0
LM eF ALL
whereALL is 0 0
while m(AFF ) < m(A) do
Repartition
A00
A
?
?
eT
FF
?
?
10
11
MF eTL MM
?
0
21 22 ?
0
LM eF ALL
0
0 32 eF A33
where22 is a scalars
Option 1:
Option 2:
21 := 21 /11
21 := 21 /11
22 := 22 21 21
22 := 22 11 221
Option 3:
22 := 22 111 221
21 := 21
Continue with
A00
?
?
A
FF
10 eT 11
?
?
L
MF eTL MM
?
0 21 22
?
0
LM eF ALL
0
0 32 eF A33
endwhile
Figure 16.6: Algorithm for computing the the LDLT factorization of a tridiagonal matrix.
computing the LDLT factorization of a tridiagonal matrix? Count a divide, multiply, and add/subtract as a
floating point operation each. Show how you came up with the algorithm, similar to how we derived the
algorithm for the tridiagonal Cholesky factorization.
Answer:
Three possible algorithms are given in Figure 16.6.
The key insight is to recognize that, relative to the algorithm in Figure 16.5, a21 =
384
21
0
so that,
/
21 := 21 /11 = 21 11 .
0
0
0
Then, an update like A22 := A22 11 a21 aT21 becomes
22
32 eF A33
22
:=
32 eF A33
11
21
0
21
22 11 221
32 eF
A33
* BACK TO TEXT
Homework 16.3 Derive an algorithm that, given an indefinite matrix A, computes A = UDU T . Overwrite
only the upper triangular part of A. (Fill in Figure 16.5.) Show how you came up with the algorithm,
similar to how we derived the algorithm for LDLT .
Answer: Three possible algorithms are given in Figure 16.7.
Partition
A00 a01
U00 u01
D00 0
U
, and D
A
T
a01 11
0
1
0 1
Then
A00 a01
aT01 11
U00 u01
0
D00
U00 u01
0 1
0
T
0
1
u01 1
T
T
U00 D00U00 + 1 u01 u01 u01 1
?
1
This suggests the following algorithm for overwriting the strictly upper triangular part of A with the strictly
upper triangular part of U and the diagonal of A with D:
A00 a01
.
Partition A
T
a01 11
11 := 11 = 11 (no-op).
385
AT L AT R
Partition A
? ABR
whereABR is 0 0
while m(ABR ) < m(A) do
Repartition
AT L AT R
? 11 aT
12
? ABR
? ? A22
where11 is 1 1
Option 1:
Option 2:
Option 3:
a01 := u01
Continue with
AT L AT R
?
ABR
?
11 aT12
? A22
endwhile
Figure 16.7: Algorithm for computing the the UDU T factorization of a tridiagonal matrix.
386
Homework 16.4 Derive an algorithm that, given an indefinite tridiagonal matrix A, computes A = UDU T .
Overwrite only the upper triangular part of A. (Fill in Figure 16.6.) Show how you came up with the
algorithm, similar to how we derived the algorithm for LDLT .
Answer: Three possible algorithms are given in Figure 16.8.
key insight is to recognize that, relative to the algorithm in Figure 16.7, 11 = 22 and a10 =
The
0
12
0
12
:=
0
12
/22 =
0
12 /22
01 eL
01 eL
0
A
A
0
:= 00
22
00
?
11
?
11
12
12
A00
01 eL
.
=
? 11 212
* BACK TO TEXT
L00 0
0
D00 0
0
L
0
0
00
0
0 U22
0
0 E22
0
0 U22
A00
01 eL
0
T
T
= 01 eL
11
21 eF .
0
21 eF
A22
387
A
eT
0
FF FM L
Partition A ?
MM ML eTF
?
?
ALL
whereAFF is 0 0
A
e
0
FF FM L
?
MM ML eTF
?
?
ALL
A00 01 eL
11
12
22
23 eTF
A33
0
where
Option 1:
Option 2:
12 := 12 /22
12 := 12 /22
11 := 11 12 T12
11 := 11 22 212
Option 3:
11 := 11 122 212
12 := 12
Continue with
e
0
A
FF FM L
?
MM ML eTF
?
?
ALL
A00 01 eL
11
?
?
T
22 23 eF
? A33
12
endwhile
Figure 16.8: Algorithm for computing the the UDU T factorization of a tridiagonal matrix.
388
(Hint: multiply out A = LDLT and A = UEU T with the partitioned matrices first. Then multiply out the
above. Compare and match...) How should 1 be chosen? What is the cost of computing the twisted
factorization given that you have already computed the LDLT and UDU T factorizations? A Big O
estimate is sufficient. Be sure to take into account what eTL D00 eL and eTF E22 eF equal in your cost estimate.
Answer:
L00
10 eTL
21 eF
L22
L00
D00
0
0
L00
1
0 10 eTL
1
0 D22
0
21 eF
D00 0
LT
0
0
00
0
1
0 0
1
0
0 D22
0
L22
10 eL
= 10 eTL
1
0
21 eTF
T
0
21 eF L22
L22
L00
D LT
0
0
10 D00 eL
0
00 00
= 10 eTL
1
0
0
1
21 1 eTF
T
0
21 eF L22
0
0
D22 L22
T
L00 D00 L00
10 L00 D00 eL
0
A00
T
= 10 eTL D00 L00
= 01 eTL
210 eTL D00 eL + 1
21 1 eTF
T
0
21 1 eF
221 1 eF eTF + L22 D22 L22
0
01 eL
11
21 eF
21 eTF .
A22
(16.1)
and
U00
01 eL
21 eTF
U22
U00
01 eL
E00
0
1
0
0
0 E22
E
0
0
00
21 eTF 0
1
U22
0
0
0
E UT
00 00
21 eTF 1 01 eTL
U22
0
= 0
1
0
0
U
01 eL
00
= 0
1
0
0
U E U T + 201 1 eL eTL
00 00 00
=
01 1 eTL
U00
0
1
0
0
T
U00
0
0 01 eTL
E22
0
21 eTF
U22
0
21 eF
21 E22 eF
T
E22U22
L00
10 eTL
0
1
0
21 eTF
U22
D00
T
U22
A00
T
= 01 eTL
1 21 eTF E22U22
T
U22 E22U22
0
Finally,
10 1 eL
01 eL
L00
0 10 eTL
E22
0
389
0
1
0
21 eTF
U22
01 eL
11
21 eF
21 eTF . (16.2)
A22
L00
D00
T
L00
10 eL
= 10 eTL 1 21 eTF 0
1
0 0
1
0
T
0
0
U22
0
0 E22
0
21 eF U22
T
L00
D00 L00
0
0
10 D00 eL
0
= 10 eTL 1 21 eTF
0
1
0
T
0
0
U22
0
21 E22 eF E22U22
T
L00 D00 L00
10 L00 D00 eL
0
A00
T
T = eT
= 10 eTL D00 L00
210 eTL D00 eL + 1 + 221 eTF E22 eF 21 eTF E22U22
01 L
T
0
21U22 E22 eF
U22 E22U22
0
01 eL
11
21 eTF
21 eF
A22
. (16.3)
The equivalence of the submatrices highlighted in yellow in (16.1) and (??) justify the submatrices highlighted in yellow in (16.3).
The equivalence of the submatrices highlighted in grey in (16.1), (??, and (16.3) tell us that
11 = 210 eTL D00 eL + 1
11 = 201 eTF E22 eF + 1
11 = 210 eTL D00 eL + 1 + 221 eTF E22 eF .
or
11 1 = 210 eTL D00 eL
11 1 = 201 eTF E22 eF
so that
11 = (11 1 ) + 1 + (11 1 ).
Solving this for 1 yields
1 = 1 + 1 11 .
Notice that, given the factorizations A = LDLT and A = UEU T the cost of computing the twisted factorization is O(1).
* BACK TO TEXT
L
0
0
D
0 0
00
00
T
T
10 eL 1 21 eF 0 0 0
0
0 U22
0 0 E22
L
00
10 eTL
0
|
390
T
1 21 eF
0 U22
{z
0
Hint: 1
0
x
0
0
1 = 0
x2
0
}
x
0
where x = 1 is not a zero vector. What is the cost of this computation, given that L00 and U22 have
x2
special structure?
Answer: Choose x0 , 1 , and x2 so that
L
00
10 eTL
T
1 21 eF
0 U22
x
0
0
1 = 1
x2
0
or, equivalently,
T
L00
10 eL
1
21 eF
0
x
0
0 1 = 1
T
U22
x2
0
T ) and bidiagonal
so that x0 and x2 can be computed via solves with a bidiagonal upper triangular matrix (L00
T ), respectively.
lower triangular matrix (U00
Then
L00
10 eTL
21 eTF
U22
L
00
= 10 eTL
L
00
= 10 eTL
0
1
0
0
1
0
D00
T
21 eF
U22
T
21 eF
U22
391
L00
0 0 10 eTL
0 E22
0
D00 0 0
0
0 0 0 1
0 0 E22
0
0
0
0 = 0 ,
0
0
21 eTF
0 U22
x0
x2
T x = e :
Now, consider solving L00
0
10 L
0 0
0 0 0
0 0 0
0 0 0
0
0
0
0
0
0
0
0
0
0
10
b computing all eigenvalues requires O(n2 ) computation. (We have not discussed
Given tridiagonal A
this in detail...)
Then, for each eigenvalue
b I requires O(n) flops.
Computing A := A
Factoring A = LDLT requires O(n) flops.
Factoring A = UEU T requires O(n) flops.
Computing all 1 so that the smallest can be chose requires O(n) flops.
b associated with from the twisted factorization requires O(n)
Computing the eigenvector of A
computation.
Thus, computing all eigenvalues and eigenvectors of a tridiagonal matrix via this method requires
O(n2 ) computation. This is much better than computing these via the tridiagonal QR algorithm.
* BACK TO TEXT
392
Homework 17.3 Homework 17.4 Give the approximate total cost for reducing A Rnn to bidiagonal
form.
* BACK TO TEXT
393
max
kxk = 1
Ax = x
kAxk =
max
kxk = 1
Ax = x
kxk =
max
||kxk = || = (A).
kxk = 1
Ax = x
* BACK TO TEXT
394
Appendix
How to Download
Videos associated with these notes can be viewed in one of three ways:
* YouTube links to the video uploaded to YouTube.
* Download from UT Box links to the video uploaded to UT-Austins UT Box File Sharing
Service.
* View After Local Download links to the video downloaded to your own computer in directory
Video within the same directory in which you stored this document. You can download videos by
first linking on * Download from UT Box. Alternatively, visit the UT Box directory in which the
videos are stored and download some or all.
395
396
Bibliography
[1] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. LAPACK Users guide (third ed.). SIAM,
Philadelphia, PA, USA, 1999.
[2] E. Anderson, Z. Bai, J. Demmel, J. E. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. E.
McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users Guide. SIAM, Philadelphia, 1992.
[3] Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. PhD thesis, Department of Computer Sciences, The University of Texas, 2006. Technical
Report TR-06-46. September 2006.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
[4] Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ort, and Robert A.
van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft.,
31(1):126, March 2005.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
[5] Paolo Bientinesi, Enrique S. Quintana-Ort, and Robert A. van de Geijn. Representing linear algebra algorithms in code: The FLAME APIs. ACM Trans. Math. Soft., 31(1):2759, March 2005.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
[6] Paolo Bientinesi and Robert A. van de Geijn. The science of deriving stability analyses. FLAME
Working Note #33. Technical Report AICES-2008-2, Aachen Institute for Computational Engineering Sciences, RWTH Aachen, November 2008.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
[7] Paolo Bientinesi and Robert A. van de Geijn. Goal-oriented and modular stability analysis. SIAM J.
Matrix Anal. & Appl., 32(1):286308, 2011.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
[8] Paolo Bientinesi and Robert A. van de Geijn. Goal-oriented and modular stability analysis. SIAM J.
Matrix Anal. Appl., 32(1):286308, March 2011.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
We suggest you read FLAME Working Note #33 for more details.
[9] Christian Bischof and Charles Van Loan. The WY representation for products of Householder matrices. SIAM J. Sci. Stat. Comput., 8(1):s2s13, Jan. 1987.
[10] Basic linear algebra subprograms technical forum standard. International Journal of High Performance Applications and Supercomputing, 16(1), 2002.
397
[26] C. G. J. Jacobi. Uber
ein leichtes Verfahren, die in der Theorie der Sakular-storungen vorkommenden
Gleichungen numerisch aufzulosen. Crelles Journal, 30:5194, 1846.
[27] Thierry Joffrain, Tze Meng Low, Enrique S. Quintana-Ort, Robert van de Geijn, and Field G. Van
Zee. Accumulating Householder transformations, revisited. ACM Trans. Math. Softw., 32(2):169
179, June 2006.
[28] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for
Fortran usage. ACM Trans. Math. Soft., 5(3):308323, Sept. 1979.
[29] Margaret E. Myers, Pierce M. van de Geijn, and Robert A. van de Geijn. Linear Algebra: Foundations to Frontiers - Notes to LAFF With. Self published, 2014.
Download from http://www.ulaff.net.
[30] Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero.
Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math.
Softw., 39(2):13:113:24, February 2013.
[31] C. Puglisi. Modification of the Householder method based on the compact WY representation. SIAM
J. Sci. Stat. Comput., 13:723726, 1992.
[32] Gregorio Quintana-Ort and Robert van de Geijn. Improving the performance of reduction to Hessenberg form. ACM Trans. Math. Softw., 32(2):180194, June 2006.
[33] Robert Schreiber and Charles Van Loan. A storage-efficient WY representation for products of
Householder transformations. SIAM J. Sci. Stat. Comput., 10(1):5357, Jan. 1989.
[34] G. W. Stewart. Matrix Algorithms Volume 1: Basic Decompositions. SIAM, Philadelphia, PA, USA,
1998.
[35] G. W. Stewart. Matrix Algorithms Volume II: Eigensystems. SIAM, Philadelphia, PA, USA, 2001.
[36] Lloyd N. Trefethen and David Bau III. Numerical Linear Algebra. SIAM, 1997.
[37] Robert van de Geijn and Kazushige Goto. Encyclopedia of Parallel Computing, chapter BLAS (Basic
Linear Algebra Subprograms), pages Part 2, 157164. Springer, 2011.
[38] Robert A. van de Geijn and Enrique S. Quintana-Ort. The Science of Programming Matrix Computations. www.lulu.com/contents/contents/1911788/, 2008.
[39] Field G. Van Zee. libflame: The Complete Reference. www.lulu.com, 2012.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
[40] Field G. Van Zee, Ernie Chan, Robert van de Geijn, Enrique S. Quintana-Ort, and Gregorio
Quintana-Ort. The libflame library for dense matrix computations. IEEE Computation in Science &
Engineering, 11(6):5662, 2009.
[41] Field G. Van Zee and Robert A. van de Geijn. BLIS: A framework for rapid instantiation of BLAS
functionality. ACM Trans. Math. Soft., 2015. To appear.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
399
[42] Field G. Van Zee, Robert A. van de Geijn, and Gregorio Quintana-Ort. Restructuring the tridiagonal
and bidiagonal QR algorithms for performance. ACM Trans. Math. Soft., 40(3):18:118:34, April
2014.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
[43] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ort, and G. Joseph Elizondo. Families
of algorithms for reducing a matrix to condensed form. ACM Trans. Math. Soft., 39(1), 2012.
Download from http://www.cs.utexas.edu/users/flame/web/FLAMEPublications.html.
[44] H. F. Walker. Implementation of the GMRES method using Householder transformations. SIAM J.
Sci. Stat. Comput., 9(1):152163, 1988.
[45] David S. Watkins. Fundamentals of Matrix Computations, 3rd Edition. Wiley, third edition, 2010.
[46] Stephen J. Wright. A collection of problems for which Gaussian elimination with partial pivoting is
unstable. SIAM J. Sci. Comput., 14(1):231238, 1993.
400
Index
Greek letters
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
, 4
| | (e.g., ||), 5
C, 3
Cmn , 3
Cn , 3
() (e.g., (A)), 33
H
(e.g., AH ), 6
c
(e.g., Ac ), 6
| | (e.g., ||), 25
k kF (e.g., kAkF ), 27
k k (e.g., kAk ), 30
k k (e.g., kxk ), 27
k k (e.g., kxk ), 28
k k, (e.g., kAk, ), 28
k k (e.g., kxk ), 28
k k1 (e.g., kAk1 ), 30
k k1 (e.g., kxk1 ), 27
k k p (e.g., kAk p ), 29
k k2 (e.g., kAk2 ), 30
k k2 (e.g., kxk2 ), 25
(e.g., ), 5
R, 3
Rmn , 3
Rn , 3
T
(e.g., AT ), 5
b
(e.g., abTi ), 3
absolute value, 5, 25
axpy, 10
cost, 11, 12
blocked matrix-matrix multiplication, 22
bordered Cholesky factorization algorithm, 204
Cauchy-Schwartz inequality, 25
CGS, 59
Cholesky factorization
bordered algorithm, 204
other algorithm, 204
Classical Gram-Schmidt, 59
code skeleton, 77
complete pivoting
LU factorization, 174
complex conjugate, 5
complex scalar
absolute value, 5
condition number, 33, 54
conformal partitioning, 22
401
conjugate, 5
complex, 5
matrix, 5
scalar, 5
vector, 5
dot, 11
dot product, 11
dot product, 12
eigenpair
definition, 217
eigenvalue, 215222
definition, 217
eigenvector, 215222
definition, 217
equivalence of norms, 33
FLAME
API, 7381
notation, 75
FLAME API, 7381
FLAME notation, 13
floating point operations, 10
flop, 10
Frobenius norm, 27
definitition, 27
Gauss transform, 162164
Gaussian elimination, 159193
gemm
cost, 20
gemm, 18
gemv
cost, 14
gemv, 13
ger
cost, 17
ger, 15
Gershgorin Disc Theorem, 309
Gram-Schmidt
Classical, 59
cost, 70
Madified, 64
Gram-Schmidt orthogonalization
implementation, 75
Gram-Schmidt QR factorization, 5772
Greek letters
use of, 4
Hermitian transpose
vector, 6
homogeneous, 25
Householder notation, 3
Householder transformation, 85
i, 5
Greek letters
entity matrix, 4
induced matrix norm, 28
infinity norm
vector, 27
-norm
definition, 27
matrix, 30
vector, 27
inner product, 11
invariant subspace, 329
inverse
Moore-Penrose generalized, 50
pseudo, 50
LAFF Notes, x
linear system
conditioning, 32
linear system solve, 175
triangular, 175178
low-rank approximation, 50
lower triangular solve, 175178
LU decomposition, 161
LU factorization, 159193
complete pivoting, 174
cost, 164165
existence, 161
existence proof, 173174
partial pivoting, 165172
algorithm, 167
LU factorization:derivation, 161162
MAC, 21
Matlab, 75
matrix
condition number, 33, 54
conjugate, 5
402
-norm, 30
low-rank approximation, 50
1-norm, 30
orthogonal, 3555
orthogormal, 37
permutation, 165
spectrum, 217
transpose, 58
2-norm, 30
unitary, 37
matrix norm, 2333
Frobenius, 27
definition, 27
induced, 28
submultiplicative, 31
definition, 31
matrix p-norm
definition, 29
matrix-matrix multiplication, 18
blocked, 22
cost, 20
element-by-element, 18
via matrix-vector multiplications, 19
via rank-1 updates, 20
via row-vector times matrix multiplications, 19
matrix-matrix operation
gemm, 18
matrix-matrix multiplication, 18
matrix-matrix product, 18
matrix-vector multiplication, 12
algorithm, 14
by columns, 14
by rows, 14
cost, 14
via axpy, 14
via dot, 14
matrix-vector operation, 12
gemv, 13
ger, 15
matrix-vector multiplication, 12
rank-1 update, 15
matrix-vector product, 12
max, 27, 28
MGS, 64
Modified Gram-Schmidt, 64
Moore-Penrose generalized inverse, 50
multiplication
matrix-matrix, 18
blocked, 22
cost, 20
element-by-element, 18
via matrix-vector multiplications, 19
via rank-1 updates, 20
via row-vector times matrix multiplications,
19
matrix-vector, 12
cost, 14
multiply-accumulate, 21
norm, 23
matrix, 23
submultiplicative, 31
vector, 2333
definition, 25
norms
equivalence, 33
notation
FLAME, 13
Householder, 3
Octave, 75
1-norm
definition, 27
matrix, 30
vector, 27
orthonormal basis, 57
outer product, 15
p-norm
matrix
definition, 29
vector, 27
partial pivoting
LU factorization, 165172
partitioning
conformal, 22
permutation
matrix, 165
positive definite, 25
preface, ix
product
dot, 11
inner, 11
403
matrix-matrix, 18
matrix-vector, 12
outer, 15
projection
onto column space, 49
pseudo inverse, 50
QR factorization, 61
Gram-Schmidt, 5772
rank-1 update, 15
algorithm, 17
by columns, 17
by rows, 17
cost, 17
via axpy, 17
via dot, 17
Reflector, 85
reflector, 85
residual, 303
Roman letters
use of, 4
scal, 9
cost, 10
scaled vector addition, 10
cost, 11, 12
scaling
vector
cost, 10
singular value, 41
Singular Value Decomposition, 35, 39, 4155
geometric interpretation, 41
reduced, 45
theorem, 41
solve
triangular, 175178
Spark webpage, 75
spectral radius
definition, 217
spectrum
definition, 217
submultiplicative matrix norm, 31
definition, 31
sup, 28
supremum, 28
matrix, 6
vector, 6
triangle inequality, 25
triangular solve, 175
lower, 175178
upper, 178
triangular system solve, 178
2-norm
definition, 25
matrix, 30
vector, 2526
Greek letters
it basis vector, 4
upper triangular solve, 178
vector
2-norm (length)
definition, 25
complex conjugate, 5
conjugate, 5
Hermitian transpose, 6
infinity norm, 27
-norm, 27
-norm, 27
definition, 27
1-norm, 27
definition, 27
orthogonal, 37
orthonormal, 37
p-norm, 27
perpendicular, 37
scaling
cost, 10
transpose, 6
2-norm (length), 2526
vector addition
scaled, 10
vector norm, 2333
definition, 25
vector-vector operation, 9
axpy, 10
dot, 11
scal, 9
transpose, 58
404