Sie sind auf Seite 1von 41

Mathematical Tripos Part IB: Lent 2010

Numerical Analysis – Lecture 11

1 Polynomial interpolation

1.1 The interpolation problem

Given n + 1 distinct real points x0 , x1 , . . . , xn and real numbers f0 , f1 , . . . , fn , we seek a function


p : R → R such that p(xi ) = fi , i = 0, 1, . . . , n. Such a function is called an interpolant.
We denote by Pn [x] the linear space of all real polynomials of degree at most n and observe that
each p ∈ Pn [x] is uniquely defined by its n + 1 coefficients. In other words, we have n + 1 degrees
of freedom, while interpolation at x0 , x1 , . . . , xn constitutes n + 1 conditions. This, intuitively,
justifies seeking an interpolant from Pn [x].

1.2 The Lagrange formula

Although, in principle, we may solve a linear problem with n + 1 unknowns to determine a poly-
nomial interpolant, this can be accomplished more easily by using the explicit Lagrange formula.
We claim that
Xn Yn
x − xℓ
p(x) = fk , x ∈ R.
xk − xℓ
k=0 ℓ=0
ℓ6=k

Note that p ∈ Pn [x], as required. We wish to show that it interpolates the data. Define

Y
n
x − xℓ
Lk (x) := , k = 0, 1, . . . , n
xk − xℓ
ℓ=0
ℓ6=k

(Lagrange cardinal polynomials). It is trivial to verify that Lj (xj ) = 1 and Lj (xk ) = 0 for k 6= j,
hence
X
n
p(xj ) = fk Lk (xj ) = fj , j = 0, 1, . . . , n,
k=0

and p is an interpolant,
Uniqueness Suppose that both p ∈ Pn [x] and q ∈ Pn [x] interpolate to the same n + 1 data.
Then the nth degree polynomial p − q vanishes at n + 1 distinct points. But the only nth-degree
polynomial with ≥ n + 1 zeros is the zero polynomial. Therefore p − q ≡ 0 and the interpolating
polynomial is unique.

1.3 The error of polynomial interpolation

Let [a, b] be a closed interval of R. We denote by C[a, b] the space of all continuous functions from
[a, b] to R and let C s [a, b], where s is a positive integer, stand for the linear space of all functions
in C[a, b] that possess s continuous derivatives.
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/.

1
Theorem Given f ∈ C n+1 [a, b], let p ∈ Pn [x] interpolate the values f (xi ), i = 0, 1, . . . , n, where
x0 , . . . , xn ∈ [a, b] are pairwise distinct. Then for every x ∈ [a, b] there exists ξ ∈ [a, b] such that

1 Yn
f (x) − p(x) = f (n+1) (ξ) (x − xi ). (1.1)
(n + 1)! i=0

Proof. The formula (1.1) is true when x = xj for j ∈ {0, 1, . . . , n}, since both sides of the
equation vanish. Let x ∈ [a, b] be any other point and define

Y
n Y
n
φ(t) := [f (t) − p(t)] (x − xi ) − [f (x) − p(x)] (t − xi ), t ∈ [a, b].
i=0 i=0

[Note: The variable in φ is t, whereas x is a fixed parameter.] Note that φ(xj ) = 0, j = 0, 1, . . . , n,


and φ(x) = 0. Hence, φ has at least n + 2 distinct zeros in [a, b]. Moreover, φ ∈ C n+1 [a, b].
We now apply the Rolle theorem: if the function g ∈ C 1 [a, b] vanishes at two distinct points in
[a, b] then its derivative vanishes at an intermediate point. We deduce that φ′ vanishes at (at least)
n + 1 distinct points in [a, b]. Next, applying Rolle to φ′ , we conclude that φ′′ vanishes at n points
in [a, b]. In general, we prove by induction that φ(s) vanishes at n + 2 − s distinct points of [a, b]
for s = 0, 1, . . . , n + 1. Letting s = n + 1, we have φ(n+1) (ξ) = 0 for some ξ ∈ [a, b]. Hence

Y
n
dn+1 Y
n
0 = φ(n+1) (ξ) = [f (n+1) (ξ) − p(n+1) (ξ)] (x − xi ) − [f (x) − p(x)] (ξ − xi ).
i=0
dtn+1 i=0
Qn
Since p(n+1) ≡ 0 and dn+1 i=0 (t − xi )/dtn+1 ≡ (n + 1)!, we obtain (1.1). 2
Runge’s example We interpolate f (x) = 1/(1 + x2 ), x ∈ [−5, 5], at the equally-spaced points
xj = −5 + 10 nj , j = 0, 1, . . . , n. Some of the errors are displayed below
Qn 2.5

x f (x) − p(x) i=0 (x


− xi )
3.2 × 10−3 −2.5 × 106
2

0.75
1.75 7.7 × 10−3 −6.6 × 106 1.5

2.75 3.6 × 10−2 −4.1 × 107 1

3.75 5.1 × 10−1 −7.6 × 108


4.0 × 10+2 −7.3 × 1010
0.5

4.75
0

Table: Errors for n = 20


−0.5
−5 −4 −3 −2 −1 0 1 2 3 4 5

Figure: Errors for n = 15


The growth in the error is explained by the product term in (1.1) (the rightmost column of the
table). Adding more interpolation points makes the largest error even worse. A remedy to this
state of affairs is to cluster points toward the end of the range. A considerably smaller error is
attained for xj = 5 cos (n−j)π
n , j = 0, 1, . . . , n (so-called Chebyshev Q
points). It is possible to prove
n
that this choice of points minimizes the magnitude of maxx∈[−5,5] | i=0 (x − xi )|.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 21

1.2 Divided differences: a definition

Given pairwise-distinct points x0 , x1 , . . . , xn ∈ [a, b], we let p ∈ Pn [x] interpolate f ∈ C[a, b] there.
The coefficient of xn in p is called the divided difference and denoted by f [x0 , x1 , . . . , xn ]. We say
that this divided difference is of degree n.
We can derive f [x0 , . . . , xn ] from the Lagrange formula,
n n
X Y 1
f [x0 , x1 , . . . , xn ] = f (xk ) . (1.2)
xk − xℓ
k=0 ℓ=0
ℓ6=k

Theorem Let [ā, b̄] be the shortest interval that contains x0 , x1 , . . . , xn and let f ∈ C n [ā, b̄]. Then
there exists ξ ∈ [ā, b̄] such that
1 (n)
f [x0 , x1 , . . . , xn ] = n! f (ξ). (1.3)

Proof. Let p be the interpolating polynomial. The error function f − p has at least n + 1 zeros in
[ā, b̄] and, applying Rolle’s theorem n times, it follows that f (n) − p(n) vanishes at some ξ ∈ [ā, b̄].
1 (n)
But p(x) = n! p (ζ)xn + lower order terms (for any ζ ∈ R), therefore, letting ζ = ξ,
1 (n) 1 (n)
f [x0 , x1 , . . . , xn ] = n! p (ξ) = n! f (ξ)

and we deduce (1.3). 2


Application It is a consequence of the theorem that divided differences can be used to approximate
derivatives.

1.3 Recurrence relations for divided differences

Our next topic is a useful way to calculate divided differences (and, ultimately, to derive yet
another means to construct an interpolating polynomial). We commence with the remark that
f [xi ] is the coefficient of x0 in the polynomial of degree 0 (i.e., a constant) that interpolates f (xi ),
hence f [xi ] = f (xi ).
Theorem Suppose that x0 , x1 , . . . , xk+1 are pairwise distinct, where k ≥ 0. Then
f [x1 , x2 , . . . , xk+1 ] − f [x0 , x1 , . . . , xk ]
f [x0 , x1 , . . . , xk+1 ] = . (1.4)
xk+1 − x0
Proof. Let p, q ∈ Pk [x] be the polynomials that interpolate f at

{x0 , x1 , . . . , xk } and {x1 , x2 , . . . , xk+1 }

respectively and define


(x − x0 )q(x) + (xk+1 − x)p(x)
r(x) := ∈ Pk+1 [x].
xk+1 − x0
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/.

1
We readily verify that r(xi ) = f (xi ), i = 0, 1, . . . , k + 1. Hence r is the (k + 1)-degree interpolating
polynomial and f [x0 , . . . , xk+1 ] is the coefficient of xk+1 therein. The recurrence (1.4) follows from
the definition of divided differences. 2

1.4 The Newton interpolation formula

Recalling that f [xi ] = f (xi ), the recursive formula allows for rapid evaluation of the divided
difference table, in the following manner:

f [x0 ] PqP
1 f [x0 , x1 ]
 PP
q
f [x1 ] 
PP f [x0 , x1 , x2 ] PP
q
q 
1
1 f [x1 , x2 ]
 PP
q
···
f [x2 ] 
PPq
f [x0 , x1 , . . . , xn ]

..
. ···

1
1 f [xn−2 , xn−1 , xn ]


1 f [xn−1 , xn ]
f [xn ] 

This can be done in O n2 operations and the outcome are the numbers {f [x0 , x1 , . . . , xl ]}kl=0 .


We now provide an alternative representation of the interpolating polynomial. Again, f (xi ), i =


0, 1, . . . , k, are given and we seek p ∈ Pk [x] such that p(xi ) = f (xi ), i = 0, . . . , k.
Theorem Suppose that x0 , x1 , . . . , xk are pairwise distinct. The polynomial
k−1
Y
pk (x) := f [x0 ] + f [x0 , x1 ](x − x0 ) + · · · + f [x0 , x1 , . . . , xk ] (x − xi ) ∈ Pk [x]
i=0

obeys pk (xi ) = f (xi ), i = 0, 1, . . . , k.


Proof. By induction on k. The statement is obvious for k = 0 andQwe suppose that it is
k
true for k. We now prove that pk+1 (x) − pk (x) = f [x0 , x1 , . . . , xk+1 ] i=0 (x − xi ). Clearly,
k+1
pk+1 − pk ∈ Pk+1 [x] and the coefficient of x therein is, by definition, f [x0 , . . . , xk+1 ]. Moreover,
Qk
pk+1 (xi ) − pk (xi ) = 0, i = 0, 1, . . . , k, hence it is a multiple of i=0 (x − xi ), and this proves the
asserted form of pk+1 − pk . The explicit form of pk+1 follows by adding pk+1 − pk to pk . 2
We have derived the Newton interpolation formula, which requires only the top row of the divided
difference table. It has several advantages over Lagrange’s. In particular, its evaluation at a given
point x (provided that divided differences are known) requires just O(k) operations, as long as we
do it by the Horner scheme

pk (x) = {{{f [x0 , . . . , xk ](x − xk−1 ) + f [x0 , . . . , xk−1 ]} × (x − xk−2 ) + f [x0 , . . . , xk−2 ]}
× (x − x3 ) + · · ·} + f [x0 ].

On the other hand, the Lagrange formula is often better when we wish to manipulate the interpo-
lation polynomial as part of a larger mathematical expression. We’ll see an example in the section
on Gaussian quadrature.
b

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 31

2 Orthogonal polynomials

2.1 Orthogonality in general linear spaces


Pn
We have already seen the scalar product hx, yi = i=1 xi yi , P acting on x, y ∈ Rn . Likewise, given
n
arbitrary weights w1 , w2 , . . . , wn > 0, we may define hx, yi = i=1 wi xi yi . In general, a scalar (or
inner ) product is any function V × V → R, where V is a vector space over the reals, subject to the
following three axioms:
Symmetry: hx, yi = hy, xi ∀x, y ∈ V;
Nonnegativity: hx, xi ≥ 0 ∀x ∈ V and hx, xi = 0 iff x = 0; and
Linearity: hax + by, zi = ahx, zi + bhy, zi ∀x, y, z ∈ V, a, b ∈ R.
Given a scalar product, we may define orthogonality: x, y ∈ V are orthogonal if hx, yi = 0.
Rb
Let V = C[a, b], w ∈ V be a fixed positive function and define hf, gi := a w(x)f (x)g(x) dx for all
f, g ∈ V. It is easy to verify all three axioms of the scalar product.

2.2 Orthogonal polynomials – definition, existence, uniqueness

Given a scalar product in V = Pn [x], we say that pn ∈ Pn [x] is the nth orthogonal polynomial
if hpn , pi = 0 for all p ∈ Pn−1 [x]. [Note: different inner products lead to different orthogonal
polynomials.] A polynomial in Pn [x] is monic if the coefficient of xn therein equals one.
Theorem For every n ≥ 0 there exists a unique monic orthogonal polynomial of degree n. More-
over, any p ∈ Pn [x] can be expanded as a linear combination of p0 , p1 , . . . , pn ,
Proof. We let p0 (x) ≡ 1 and prove the theorem by induction on n. Thus, suppose that
p0 , p1 , . . . , pn have been already derived consistently with both assertions of the theorem and let
q(x) := xn+1 ∈ Pn+1 [x]. Motivated by the Gram–Schmidt algorithm, we choose

X
n
hq, pk i
pn+1 (x) = q(x) − pk (x), x ∈ R. (2.1)
hpk , pk i
k=0

Clearly, pn+1 ∈ Pn+1 [x] and it is monic (since all the terms in the sum are of degree ≤ n).
Let m ∈ {0, 1, . . . , n}. It follows from (2.1) and the induction hypothesis that

X
n
hq, pk i hq, pm i
hpn+1 , pm i = hq, pm i − hpk , pm i = hq, pm i − hpm , pm i = 0.
hpk , pk i hpm , pm i
k=0

Hence, pn+1 is orthogonal to p0 , . . . , pn . Consequently, according to the second inductive assertion,


it is orthogonal to all p ∈ Pn [x].
To prove uniqueness, we suppose the existence of two monic orthogonal polynomials pn+1 , p̃n+1 ∈
Pn+1 [x]. Let p := pn+1 − p̃n+1 ∈ Pn [x], hence hpn+1 , pi = hp̃n+1 , pi = 0, and this implies

0 = hpn+1 , pi − hp̃n+1 , pi = hpn+1 − p̃n+1 , pi = hp, pi,


1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/.

1
and we deduce p ≡ 0.
Finally, in order to prove that each p ∈ Pn+1 [x] is a linear combination of p0 , . . . , pn+1 , we note
that we can always write it in the form p = cpn+1 + q, where c is the coefficient of xn+1 in p and
where q ∈ Pn [x]. According to the induction hypothesis, q can be expanded as a linear combination
of p0 , p1 , . . . , pn , hence our assertion is true. 2
Well-known examples of orthogonal polynomials include
Name Notation Interval Weight function
Legendre Pn [−1, 1] w(x) ≡ 1
Chebyshev Tn [−1, 1] w(x) = (1 − x2 )−1/2
Laguerre Ln [0, ∞) w(x) = e−x
2
Hermite Hn (−∞, ∞) w(x) = e−x

2.3 The three-term recurrence relation

How to construct orthogonal polynomials? (2.1) might help, but it suffers from loss of accuracy
due to imprecisions in the calculation of scalar products. A considerably better procedure follows
from our next theorem.
Theorem Monic orthogonal polynomials are given by the formula
p−1 (x) ≡ 0, p0 (x) ≡ 1,
pn+1 (x) = (x − αn )pn (x) − βn pn−1 (x), n = 0, 1, . . . , (2.2)
where
hpn , xpn i hpn , pn i
αn := , βn = > 0.
hpn , pn i hpn−1 , pn−1 i
Proof. Pick n ≥ 0 and let ψ(x) := pn+1 (x) − (x − αn )pn (x) + βn pn−1 (x). Since pn and pn+1 are
monic, it follows that ψ ∈ Pn [x]. Moreover, because of orthogonality of pn−1 , pn , pn+1 ,
hψ, pℓ i = hpn+1 , pℓ i − hpn , (x − αn )pℓ i + βn hpn−1 , pℓ i = 0, ℓ = 0, 1, . . . , n − 2.
Because of monicity, xpn−1 = pn + q, where q ∈ Pn−1 [x]. Thus, from the definition of αn , βn ,
hψ, pn−1 i = −hpn , xpn−1 i + βn hpn−1 , pn−1 i = −hpn , pn i + βn hpn−1 , pn−1 i = 0,
hψ, pn i = −hxpn , pn i + αn hpn , pn i = 0.
Every p ∈ Pn [x] that obeys hp, pℓ i = 0, ℓ = 0, 1, . . . , n, must necessarily be the zero polynomial.
For suppose that it is not so and let xs be the highest power of x in p. Then hp, ps i = 6 0, which is
impossible. We deduce that ψ ≡ 0, hence (2.2) is true. 2
Example Chebyshev polynomials We choose the scalar product
Z 1
dx
hf, gi := f (x)g(x) √ , f, g ∈ C[−1, 1]
−1 1 − x2
and define Tn ∈ Pn [x] by the relation Tn (cos θ) = cos(nθ). Hence T0 (x) ≡ 1, T1 (x) = x, T2 (x) =
2x2 − 1 etc. Changing the integration variable,
Z 1 Z π Z π
dx
hTn , Tm i = Tn (x)Tm (x) √ = cos nθ cos mθ dθ = 21 [cos(n+m)θ + cos(n−m)θ] dθ = 0
−1 1 − x2 0 0

whenever n 6= m. The recurrence relation for Chebyshev polynomials is particularly simple,


Tn+1 (x) = 2xTn (x) − Tn−1 (x), as can be verified at once from the identity cos[(n + 1)θ] + cos[(n −
1)θ] = 2 cos(θ) cos(nθ). Note that the Tn s aren’t monic, hence the inconsistency with (2.2). To
obtain monic polynomials take Tn (x)/2n−1 , n ≥ 1.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 41

2.4 Least-squares polynomial fitting


Rb
Given f ∈ C[a, b] and a scalar product hg, hi = a w(x)g(x)h(x) dx, we wish to pick p ∈ Pn [x] so
as to minimise hf − p, f − pi. Again, we stipulate that w(x) > 0 for x ∈ (a, b). Intuitively speaking,
p approximates f and is an alternative to an interpolating polynomial.
Let p0 , p1 , . . . , pn be orthogonal polynomials w.r.t. the underlying inner product, pℓ ∈ Pℓ [x]. They
form
Pn a basis of Pn [n], therefore for every p ∈ Pn there exist c0 , c1 , . . . , cn ∈ R such that p =
k=0 ck pk . Because of orthogonality,
* n n
+ n n
X X X X
hf − p, f − pi = f − ck pk , f − ck pk = hf, f i − 2 ck hpk , f i + c2k hpk , pk i.
k=0 k=0 k=0 k=0

To derive optimal c0 , c1 , . . . , cn we seek to minimise the last expression. (Note that it is a quadratic
function in the ci s.) Since

1 ∂
2 hf − p, f − pi = −hpk , f i + ck hpk , pk i, k = 0, 1, . . . , n,
∂ck
setting the gradient to zero yields
n
X hpk , f i
p(x) = pk (x). (2.3)
hpk , pk i
k=0

Note that
n n
X X hpk , f i2
hf − p, f − pi = hf, f i − {2ck hpk , f i − c2k hpk , pk i} = hf, f i − . (2.4)
hpk , pk i
k=0 k=0

This identity can be rewritten as hf − p, f − pi + hp, pi = hf, f i, reminiscent of the Pythagoras


theorem.
How to choose n? Note that ck = hpk , f i/hpk , pk i is independent of n. Thus, we can continue
to add terms to (2.3) until hf P
− p, f − pi is below specified tolerance ε. Because of (2.4), we need
n
to pick n so that hf, f i − ε < k=0 hpk , f i2 /hpk , pk i.
Theorem (The Parseval identity) Let [a, b] be finite. Then

X hpk , f i2
= hf, f i. (2.5)
hpk , pk i
k=0

Incomplete proof. Let


n
X hpk , f i2
σn := , n = 0, 1, . . . ,
hpk , pk i
k=0

hence hf −p, f −pi = hf, f i−σn ≥ 0. The sequence {σ}∞n=0 increases monotonically and σn ≤ hf, f i
implies that limn→∞ σn exists. According to the Weierstrass theorem, any function in C[a, b] can
be approximated arbitrarily close by a polynomial, hence limn→∞ hf − p, f − pi = 0 and we deduce
n→∞
that σn −→ hf, f i and (2.5) is true. 2
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
2.5 Least-squares fitting to discrete function values

Suppose that m ≥ n + 1. We are given m function values f (x1 ), f (x2 ), . . . , f (xm ), where the xk s
are pairwise distinct, and seek p ∈ Pn [x] that minimises hf − p, f − pi, where
m
X
hg, hi := g(xk )h(xk ). (2.6)
k=1
Pn
One alternative is to express p as ℓ=0 pℓ xℓ and find optimal p0 , . . . , pn using numerical linear
algebra (which will be considered in the sequel). An alternative is to construct orthogonal polyno-
mials w.r.t. the scalar product (2.6). The theory is identical to that of subsections 2.1–4, except
that we have enough data to evaluate only p0 , p1 , . . . , pm−1 . However, we need just p0 , p1 , . . . , pn
and n ≤ m − 1, and we have enough information to implement the algorithm. Thus
1. Employ the three-term recurrence (2.2) to calculate p0 , p1 , . . . , pn (of course, using the scalar
product (2.6));
n
X hpk , f i
2. Form p(x) = pk (x).
hpk , pk i
k=0
Since the work for eachk is bounded by a constant multiple of m, the complete cost is O(mn), as
compared with O n2 m if linear algebra is used.

2.6 Gaussian quadrature

We are again in C[a, b] and a scalar product is defined as in subsection 2.1, namely hf, gi =
Rb
a
w(x)f (x)g(x) dx, where w(x) > 0 for x ∈ (a, b). Our goal is to approximate integrals by finite
sums,
Z b ν
X
w(x)f (x) dx ≈ bk f (ck ), f ∈ C[a, b].
a k=1
The above is known as a quadrature formula. Here ν is given, whereas the points b1 , . . . , bν (the
weights) and c1 , . . . , cν (the nodes) are independent of the choice of f .
A reasonable approach to achieving high accuracy is to require that the approximation is exact for
all f ∈ Pm [x], where m is as large as possible – this results in Gaussian quadrature and we will
demonstrate that m = 2ν − 1 can be attained.
Firstly, we claim that m = 2ν is impossible. To prove this, choose arbitrary nodes c1 , . . . , cν and
Qν Rb Pν
note that p(x) := k=1 (x − ck )2 lives in P2ν [x]. But a w(x)p(x) dx > 0, while k=1 bk p(ck ) = 0
for any choice of weights b1 , . . . , bν . Hence the integral and the quadrature do not match.
Let p0 , p1 , p2 , . . . denote, as before, the monic polynomials which are orthogonal w.r.t. the under-
lying scalar product.
Theorem Given n ≥ 1, all the zeros of pn are real, distinct and lie in the interval (a, b).
Proof. Recall that p0 ≡ 1. Thus, by orthogonality,
Z b Z b
w(x)pn (x) dx = w(x)p0 (x)pn (x) dx = hp0 , pn i = 0
a a
and we deduce that pn changes sign at least once in (a, b).
Denote by m ≥ 1 the number of the sign changes of pn in (a, b) and assume that m ≤ n − 1.
Qm
Denoting the points where a sign change occurs by ξ1 , ξ2 , . . . , ξm , we let q(x) := j=1 (x − ξj ).
Since q ∈ Pm [x], m ≤ n − 1, it follows that hq, pn i = 0. On the other hand, it follows from our
construction that q(x)pn (x) does not change sign throughout [a, b] and vanishes at a finite number
of points, hence
Z Z
b b
|hq, pn i| = w(x)q(x)pn (x) dx = w(x)|q(x)pn (x)| dx > 0,

a a

a contradiction. It follows that m = n and the proof is complete. 2

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 51
We commence our construction of Gaussian quadrature by choosing pairwise-distinct nodes
c1 , c2 , . . . , cν ∈ [a, b] and define the interpolatory weights
Z b ν
Y x − cj
bk := w(x) dx, k = 1, 2, . . . , ν.
a ck − cj
j=1
j6=k

Theorem The quadrature formula with the above choice is exact for all f ∈ Pν−1 [x]. Moreover,
if c1 , c2 , . . . , cν are the zeros of pν then it is exact for all f ∈ P2ν−1 [x].
Proof. Every f ∈ Pν−1 [x] is its own interpolating polynomial, hence by Lagrange’s formula
ν ν
X Y x − cj
f (x) = f (ck ) . (2.7)
ck − cj
k=1 j=1
j6=k
Rb Pν
The quadrature is exact for all f ∈ Pν−1 [x] if a w(x)f (x) dx = k=1 bk f (ck ), and this, in tandem
with the interpolating-polynomial representation, yields the stipulated form of b1 , . . . , bν .
Let c1 , . . . , cν be the zeros of pν . Given any f ∈ P2ν−1 [x], we can represent it uniquely as f = qpν +r,
where q, r ∈ Pν−1 [x]. Thus, by orthogonality,
Z b Z b Z b
w(x)f (x) dx = w(x)[q(x)pν (x) + r(x)] dx = hq, pν i + w(x)r(x) dx
a a a
Z b
= w(x)r(x) dx.
a
On the other hand, the choice of quadrature knots gives
ν
X ν
X ν
X
bk f (ck ) = bk [q(ck )pν (ck ) + r(ck )] = bk r(ck ).
k=1 k=1 k=1

Hence the integral and its approximation coincide, because r ∈ Pν−1 [x] and the quadrature is exact
for all polynomials in Pν−1 [x]. 2
Example Let [a, b] = [−1, 1], w(x) ≡ 1. Then the underlying orthogonal polynomials are the
Legendre polynomials: P0 ≡ 1, P1 (x) = x, P2 (x) = 23 x2 − 21 , P3 (x) = 52 x3 − 32 x, P4 (x) =
35 4 15 2 3
8 x − 4 x + 8 (it is customary to use this, non-monic, normalisation). The nodes of Gaussian
quadrature are
n = 1: c1 = 0; √ √
n = 2: c1 = − 33 , c2 = 33 ;
√ √
n = 3: c1 = − 515 , c2 = 0, c3 = 515 ;
q √ q √ q √ q √
n = 4: c1 = − 73 + 35 2 2
30, c2 = − 37 − 35 2
30, c3 = 73 − 35 30, c4 = 37 + 352
30.

3 The Peano kernel theorem

3.1 The theorem

Our point of departure is the Taylor formula with an integral remainder term,
(x − a)2 ′′ (x − a)k (k) 1 x
Z
f (x) = f (a)+(x−a)f ′ (a)+ f (a)+· · ·+ f (a)+ (x−θ)k f (k+1) (θ) dθ, (3.1)
2! k! k! a
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
which can be verified by integration by parts for functions f ∈ Ck+1 [a, b], a < b. Suppose that
we are given an approximation (e.g. to a function, a derivative at a given point, an integral etc.)
which is exact for all f ∈ Pk [x]. The Taylor formula produces an expression for the error that
depends on f (k+1) . This is the basis for the Peano kernel theorem.
Formally, let L(f ) be the approximation error. Thus, L maps Ck+1 [a, b], say, to R. We assume
that it is linear , i.e. L(αf +βg) = αL(f )+βL(g) ∀α, β ∈ R, and that L(f ) = 0 for all f ∈ Pk [x]. In
general, a linear mapping from a function space (e.g. Ck+1 [a, b]) to R is called a linear functional.
The formula (3.1) implies
Z x 
1 k (k+1)
L(f ) = L (x − θ) f (θ) dθ , a ≤ x ≤ b.
k! a

To make the range of integration independent of x, we introduce the notation


(Z )
b
(x − θ)k ,

k x ≥ θ, 1 k (k+1)
(x − θ)+ := whence L(f ) = L (x − θ)+ f (θ) dθ .
0, x ≤ θ, k! a

Let K(θ) := L[(x − θ)k+ ] for x ∈ [a, b]. [Note: K is independent of f .] The function K is called
R
the Peano kernel of L. Suppose that it is allowed to exchange the order of action of
and L. Because of the linearity of L, we then have
b
1
Z
L(f ) = K(θ)f (k+1) (θ) dθ. (3.2)
k! a

The Peano kernel theorem Let L be a linear functional such that L(f ) = 0 for all f ∈ Pk [x].
Provided that f ∈ Ck+1 [a, b] and the above exchange of L with the integration sign is valid, the
formula (3.2) is true. 2

3.2 An example and few useful formulae

We approximate a derivative by a linear combination of function values, f ′ (0) ≈ − 32 f (0) + 2f (1) −


1 ′ 3 1
2 f (2). Therefore, L(f ) := f (0) − [− 2 f (0) + 2f (1) − 2 f (2)] and it is easy to check that L(f ) = 0
for f ∈ P2 [x]. (Verify by trying f (x) = 1, x, x and using linearity of L.) Thus, for f ∈ C3 [0, 2] we
2

have Z 2
1
L(f ) = 2 K(θ)f ′′′ (θ) dθ.
0

To evaluate the Peano kernel K, we fix θ. Letting g(x) := (x − θ)2+ , we have

K(θ) = L(g) = g ′ (0) − − 23 g(0) + 2g(1) − 21 g(2)


 

= 2(0 − θ)+ − − 23 (0 − θ)2+ + 2(1 − θ)2+ − 21 (2 − θ)2+


 

−2θ + 23 θ2 + (2θ − 23 θ2 ) ≡ 0,

 θ ≤ 0,
−2(1 − θ)2 + 12 (2 − θ)2 = 2θ − 32 θ2 ,

0 ≤ θ ≤ 1,

= 1
(2 − θ)2 , 1 ≤ θ ≤ 2,
 2


0, θ ≥ 2.

[Note: It is obvious that K(θ) = 0 for θ 6∈ [0, 2], since then L acts on a quadratic polynomial.] This
gives the form of the Peano kernel for our example.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 61
Back to the general case. . . Typically, forming L involves differentiation, integration and linear
combination of function values. Since
Z x
d k−1 1
k
(x − θ)+ = k(x − θ)+ , (t − θ)k+ dt = [(x − θ)k+1
+ − (a − θ)k+1
+ ],
dx a k + 1
the exchange of L with integration is justified in these cases. Similarly for differentiation and,
trivially, for linear combinations.
Theorem Suppose that K doesn’t change sign in (a, b) and that f ∈ Ck+1 [a, b]. Then
"Z #
b
1
L(f ) = K(θ) dθ f (k+1) (ξ) for some ξ ∈ (a, b).
k! a

Proof. Let K ≥ 0. Then


!
b b
1 1
Z Z
(k+1)
L(f ) ≥ K(θ) min f (x) dθ = K(θ) dθ min f (k+1) (x).
k! a x∈[a,b] k! a x∈[a,b]

hR i
1 b
Likewise L(f ) ≤ k! a
K(θ) dθ maxx∈[a,b] f (k+1) (x), consequently

L[f ]
min f (k+1) (x) ≤ Rb ≤ max f (k+1) (x)
x∈[a,b] 1 x∈[a,b]
k! a
K(θ) dθ

and the required result follows from the intermediate value theorem. Similar analysis is true in the
case K ≤ 0. 2
Function norms: We can measure the ‘size’ of function g in various manners. Particular impor-
Rb nR o1/2
b
tance is afforded to the 1-norm kgk1 = a |f (x)| dx, the 2-norm kgk2 = a [g(x)]2 dx and the
∞-norm kgk∞ = maxx∈[a,b] |g(x)|.
R2
Back to our example We have K ≥ 0 and 0 K(θ) dθ = 32 . Consequently L(f ) = 2! 1
× 32 f ′′′ (ξ) =
1 ′′′ 1 ′′′
3 f (ξ) for some ξ ∈ (0, 2). We deduce in particular that |L(f )| ≤ 3 kf k∞ .
R
b
Likewise we can easily deduce from a f (x)g(x) dx ≤ kgk∞ kf k1 that

1 1
|L(f )| ≤ kKk1 kf (k+1) k∞ and |L(f )| ≤ kKk∞ kf (k+1) k1 .
k! k!
This is valid also when K changes sign. Moreover, the Cauchy–Schwarz inequality
Z
b
f (x)g(x) dx ≤ kf k2 kgk2


a

implies the inequality


1
kKk2 kf (k+1) k2 .
|L(f )| ≤
k!
All these provide a very powerful means to bound the size of the error in our approximation proce-
dures and verify how well ‘polynomial assumptions’ translate to arbitrary functions in Ck+1 [a, b].
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
4 Ordinary differential equations
We wish to approximate the exact solution of the ordinary differential equation (ODE)
y ′ = f (t, y), t ≥ 0, (4.1)
N N N
where y ∈ R and the function f : R × R → R is sufficiently ‘nice’. (In principle, it is enough
for f to be Lipschitz to ensure that the solution exists and is unique. Yet, for simplicity, we
henceforth assume that f is analytic: in other words, we are always able to expand locally into
Taylor series.) The equation (4.1) is accompanied by the initial condition y(0) = y 0 .
Our purpose is to approximate y n+1 ≈ y(tn+1 ), n = 0, 1, . . ., where tm = mh and the time step
h > 0 is small, from y 0 , y 1 , . . . , y n and equation (4.1).

4.1 One-step methods

A one-step method is a map y n+1 = ϕh (tn , y n ), i.e. an algorithm which allows y n+1 to depend
only on tn , y n , h and the ODE (4.1).
The Euler method: We know y and its slope y ′ at t = 0 and wish to approximate y at t = h > 0.
The most obvious approach is to truncate y(h) = y(0) + hy ′ (0) + 12 h2 y ′′ (0) + · · · at the h2 term.
Since y ′ (0) = f (t0 , y 0 ), this procedure approximates y(h) ≈ y 0 + hf (t0 , y 0 ) and we thus set
y 1 = y 0 + hf (t0 , y 0 ).
By the same token, we may advance from h to 2h by letting y 2 = y 1 + hf (t1 , y 1 ). In general, we
obtain the Euler method
y n+1 = y n + hf (tn , y n ), n = 0, 1, . . . . (4.2)

Convergence: Let t∗ > 0 be given. We say that a method, which for every h > 0 produces the
k→∞
solution sequence y n = y n (h), n = 0, 1, . . . , ⌊t∗ /h⌋, converges if, as h → 0 and nk (h)h −→ t, it is
true that y nk → y(t), the exact solution of (4.1), uniformly for t ∈ [0, t∗ ].
Theorem Suppose that f satisfies the Lipschitz condition: there exists λ ≥ 0 such that
kf (t, v) − f (t, w)k ≤ λkv − wk, t ∈ [0, t∗ ], v, w ∈ RN .
Then the Euler method (4.2) converges.
Proof. Let en = y n − y(tn ), the error at step n, where 0 ≤ n ≤ t∗ /h. Thus,
en+1 = y n+1 − y(tn+1 ) = [y n + hf (tn , y n )] − [y(tn ) + hy ′ (tn ) + O h2 ].


By the Taylor theorem, the O h2 term can be bounded uniformly for all [0, t∗ ] (in the underlying


norm k · k) by ch2 , where c > 0. Thus, using (4.1) and the triangle inequality,
ken+1 k ≤ ky n − y(tn )k + hkf (tn , y n ) − f (tn , y(tn ))k + ch2
≤ ky n − y(tn )k + hλky n − y(tn )k + ch2 = (1 + hλ)ken k + ch2 .
Consequently, by induction,
m−1
X
ken+1 k ≤ (1 + hλ)m ken+1−m k + ch2 (1 + hλ)j , m = 0, 1, . . . , n + 1.
j=0

In particular, letting m = n + 1 and bearing in mind that e0 = 0, we have


n
X (1 + hλ)n+1 − 1 ch
ken+1 k ≤ ch2 (1 + hλ)j = ch2 ≤ (1 + hλ)n+1 .
j=0
(1 + hλ) − 1 λ

For small h > 0 it is true that 0 < 1 + hλ ≤ ehλ . This and (n + 1)h ≤ t∗ imply that (1 + hλ)n+1 ≤
∗λ

cet h→0
et λ , therefore ken k ≤ λ h −→ 0 uniformly for 0 ≤ nh ≤ t∗ and the theorem is true. 2

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 71
Order: The order of a general numerical method y n+1 = ϕh (tn , y 0 , y 1 , . . . , y n ) for the solution
of (4.1) is the largest integer p ≥ 0 such that
y(tn+1 ) − ϕh (tn , y(t0 ), y(t1 ), . . . , y(tn )) = O hp+1


for all h > 0, n ≥ 0 and all sufficiently smooth functions f in (4.1). Note that, unless p ≥ 1, the
‘method’ is an unsuitable approximation to (4.1): in particular, p ≥ 1 is necessary for convergence.
The order of Euler’s method: We now have ϕh (t, y) = y + hf (t, y). Substituting the exact
solution of (4.1), we obtain from the Taylor theorem
y(tn+1 ) − [y(tn ) + hf (tn , y(tn ))] = [y(tn ) + hy ′ (tn ) + 12 h2 y ′′ (tn ) + · · ·] − [y(tn ) + hy ′ (tn )] = O h2


and we deduce that Euler’s method is of order 1.


Theta methods: We consider methods of the form
y n+1 = y n + h[θf (tn , y n ) + (1 − θ)f (tn+1 , y n+1 )], n = 0, 1, . . . , (4.4)
where θ ∈ [0, 1] is a parameter:
• If θ = 1, we recover Euler’s method.
• if θ ∈ [0, 1) then the theta method (4.4) is implicit: Each time step requires the solution of
N (in general, nonlinear) algebraic equations for the unknown vector y n+1 .
1
• The choices θ = 0 and θ = 2 are known as
Backward Euler: y n+1 = y n + hf (tn+1 , y n+1 ),
Trapezoidal rule: y n+1 = y n + 21 h[f (tn , y n ) + f (tn+1 , y n+1 )].
Solution of nonlinear algebraic equations can be done by iteration. For example, for backward
[0]
Euler, letting y n+1 = y n , we may use
[j+1] [j]
Direct iteration y n+1 = y n + hf (tn+1 , y n+1 );
 [j]
−1
[j+1] [j] ∂f (tn+1 ,y n+1 ) [j] [j]
Newton–Raphson: y n+1 = y n+1 − I − h ∂y [y n+1 − y n − hf (tn+1 , y n+1 )];
h i−1
[j+1] [j] [j] [j]
Modified Newton–Raphson: y n+1 = y n+1 − I − h ∂f (t∂y n ,y n )
[y n+1 − y n − hf (tn+1 , y n+1 )]

The order of the theta method: It follows from (4.4) and Taylor’s theorem that
y(tn+1 ) − y(tn ) − h[θy ′ (tn ) + (1 − θ)y ′ (tn+1 )]
= [y(tn ) + hy ′ (tn ) + 21 h2 y ′′ (tn ) + 61 h3 y ′′′ (tn )] − y(tn ) − θhy ′ (tn )
− (1 − θ)h[y ′ (tn ) + hy ′′ (tn ) + 21 h2 y ′′′ (tn )] + O h4


= (θ − 12 )h2 y ′′ (tn ) + ( 12 θ − 13 )h3 y ′′′ (tn ) + O h4 .




Therefore the theta method is of order 1, except that the trapezoidal rule is of order 2.

4.2 Multistep methods

It is often useful to use past solution values in computing a new value. Thus, assuming that
y n , y n+1 , . . . , y n+s−1 are available, where s ≥ 1, we say that
s
X s
X
ρl y n+l = h σl f (tn+l , y n+l ), n = 0, 1, . . . , (4.5)
l=0 l=0
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
where ρs = 1, is an s-step method. If σs = 0, the method is explicit, otherwise it is implicit.
If s ≥ 2, we P
need to obtain extra
Pstarting values y 1 , . . . , y s−1 by different time-stepping method.
s s
Let ρ(w) = l=0 ρl wl , σ(w) = l=0 σl wl .
Theorem The multistep method (4.5) is of order p ≥ 1 iff

ρ(ez ) − zσ(ez ) = O z p+1 ,



z → 0. (4.6)

Proof. Substituting the exact solution and expanding into Taylor series about tn ,
s s s ∞ s ∞
X X X X 1 (k) X X 1 (k+1)
ρl y(tn+l ) − h σl y ′ (tn+l ) = y (tn )lk hk − h
ρl σl y (tn )lk hk
k! k!
l=0 l=0 l=0 k=0 l=0 k=0
s
! ∞ s s
!
X X 1 X X
= ρl y(tn ) + k
l ρl − k l k−1
σl hk y (k) (tn ).
k!
l=0 k=1 l=0 l=0

p+1

Thus, to obtain O h regardless of the choice of y, it is necessary and sufficient that
s
X s
X s
X
ρl = 0, lk ρl = k lk−1 σl , k = 1, 2, . . . , p. (4.7)
l=0 l=0 l=0

On the other hand, expanding again into Taylor series,


s s s ∞
! s ∞
!
z z
X
lz
X
lz
X X 1 k k X X 1 k k
ρ(e ) − zσ(e ) = ρl e − z σl e = ρl l z −z σl l z
k! k!
l=0 l=0 l=0 k=0 l=0 k=0
s

! ∞ s
!
X 1 X X 1 X
= lk ρl zk − lk−1 σl zk
k! (k − 1)!
k=0 l=0 k=1 l=0
s
! ∞ s s
!
X X 1 X
k
X
k−1
= ρl + l ρl − k l σl zk .
k!
l=0 k=1 l=0 l=0

The theorem follows from (4.7). 2


Example The 2-step Adams–Bashforth method is

y n+2 − y n+1 = h[ 32 f (tn+1 , y n+1 ) − 12 f (tn , y n )]. (4.8)

Therefore ρ(w) = w2 − w, σ(w) = 32 w − 1


2 and

ρ(ez )−zσ(ez ) = [1+2z+2z 2 + 43 z 3 ]−[1+z+ 21 z 2 + 16 z 3 ]− 32 z[1+z+ 21 z 2 ]+ 12 z+O z 4 = 5 3


z4 .
 
12 z +O

Hence the method is of order 2.


Example (Absence of convergence) Consider the 2-step method
1
y n+2 − 3y n+1 + 2y n = 12 h[13f (tn+2 , y n+2 ) − 20f (tn+1 , y n+1 ) − 5f (tn , y n )]. (4.9)
1
Now ρ(w) = w2 − 3w + 2, σ(w) = 12 (13w2 − 20w − 5) and it is easy to verify that the method
is of order 2. Let us apply it, however, to the trivial ODE y ′ = 0, y(0) = 1. Hence a single
step reads yn+2 − 3yn+1 + 2yn = 0 and the general solution of this recursion is yn = c1 + c2 2n ,
n = 0, 1, . . ., where c1 , c2 are arbitrary constants, which are determined by y0 = 1 and our value
of y1 . In general, c2 6= 0. Suppose that h → 0 and nh → t > 0. Then n → ∞, thus |yn | → ∞ and
we cannot recover the exact solution y(t) ≡ 1. (This remains true even if we force c2 = 0 by our
choice of y1 , because of the presence of roundoff errors.)
We deduce that the method (4.9) does not converge! As a more general point, it is important to
realise that many ‘plausible’ multistep methods may fail to be convergent and we need a theoretical
tool to allow us to check for this feature.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 81

Definition We say that a polynomial obeys the root condition if all its zeros reside in |w| ≤ 1 and
all zeros of unit modulus are simple.
Theorem (The Dahlquist equivalence theorem) The multistep method (4.5) is convergent iff
it is of order p ≥ 1 and the polynomial ρ obeys the root condition.2
Examples revisited For the Adams–Bashforth method (4.8) we have ρ(w) = (w − 1)w and the
root condition is obeyed. However, for (4.9) we obtain ρ(w) = (w − 1)(w − 2), the root condition
fails and we deduce that there is no convergence.
A technique A useful procedure to generate multistep methods which are convergent and of high
order is as follows. According to (4.6), order p ≥ 1 implies ρ(1) = 0. Choose an arbitrary s-degree
polynomial ρ that obeys the root condition and such that ρ(1) = 0. To maximize order, we let
σ be the s-degree (alternatively, (s − 1)-degree for explicit methods) polynomial arising from the
truncation of the Taylor expansion of
ρ(w)
log w
about the point w = 1. Thus, for example, for an implicit method,

ρ(w)
+ O |w − 1|s+1 ρ(ez ) − zσ(ez ) = O z s+2
 
σ(w) = ⇒
log w

and (4.6) implies order at least s + 1.


Example The choice ρ(w) = ws−1 (w − 1) corresponds to Adams methods: Adams–Bashforth
methods if σs = 0, whence the order is s, otherwise order-(s + 1) (but implicit) Adams–Moulton
methods. For example, letting s = 2 and ξ = w − 1, we obtain the 3rd-order Adams–Moulton
method by expanding

w(w − 1) ξ + ξ2 ξ + ξ2 1+ξ
= = 1 1 =
log w log(1 + ξ) 2 3
ξ − 2ξ + 3ξ − · · · 1
1 − 2 ξ + 31 ξ 2 − · · ·
= (1 + ξ)[1 + ( 21 ξ − 31 ξ 2 ) + ( 12 ξ − 31 ξ 2 )2 + O ξ 3 ] = 1 + 32 ξ + 12 5 2
ξ + O ξ3
 

= 1 + 32 (w − 1) + 12 5
(w − 1)2 + O |w − 1|3 = − 12 1
+ 23 w + 12
5
w2 + O |w − 1|3 .
 

Therefore the 2-step, 3rd-order Adams–Moulton method is


1
y n+2 − y n+1 = h[− 12 f (tn , y n ) + 23 f (tn+1 , y n+1 ) + 5
12 f (tn+2 , y n+2 )].

BDF methods For reasons that will be made clear in the sequel, we wish to consider s-step,
s-order methods s.t. σ(w) = σs ws for some σs ∈ R \ {0}. In other words,
s
X
ρl y n+l = hσs f (tn+s , y n+s ), n = 0, 1, . . . .
l=0

Such methods are called backward differentiation formulae (BDF).


1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.


2 If ρ obeys the root condition, the method (4.5) is sometimes said to be zero-stable: we will not use this

terminology.

1
Lemma The explicit form of the s-step BDF method is
s s
!−1
X 1 s−l l
X 1
ρ(w) = σs w (w − 1) , where σs = . (4.10)
l l
l=1 l=1


Proof Set v = w−1 , therefore the order condition ρ(w) = σs ws log w + O |w − 1|s+1 becomes
s
X
ρl v s−l = −σs log v + O |v − 1|s+1 ,

v → 1.
l=0
P∞ l−1
But log v = log(1 + (v − 1)) = l=1 (−1) (v − 1)l /l, consequently
s s
X X (−1)l
ρs−l v l = σs (v − 1)l .
l
l=0 l=1

Brief manipulation and a restoration of w = v −1 yield


s s
X X (−1)l
ρl wl = σs ws−l (1 − w)l
l
l=0 l=1

and we pick σs so that ρs = 1, collecting powers of ws on the right of the last displayed equation.
2
2
Example Let s = 2. Substitution in (4.10) yields σ2 = 3 and simple algebra results in ρ(w) =
w2 − 43 w + 13 . Hence the 2-step BDF is

y n+2 − 34 y n+1 + 13 y n = 23 hf (tn+2 , y n+2 ).

Remark We cannot take it for granted that BDF methods are convergent. It is possible to prove
that they are convergent iff s ≤ 6. They must not be used outside this range!

4.3 Runge–Kutta methods

Recalling quadrature We may approximate


Z h ν
X
f (t)dt ≈ h bl f (cl h),
0 l=1

where the weights bl are chosen in accordance with an explicit formula from Lecture 5 (with weight
function
Qν w ≡ 1). This quadrature formula is exact for all polynomials of degree ν − 1 and, provided
that k=1 (x − ck ) is orthogonal w.r.t. the weight function w(x) ≡ 1, 0 ≤ x ≤ 1, the formula is
exact for all polynomials of degree 2ν − 1.
Suppose that we wish to solve the ‘ODE’ y ′ = f (t), y(0) = y0 . The exact solution is y(tn+1 ) =
Rt
y(tn ) + tnn+1 f (t)dt and we can approximate it by quadrature. In general, we obtain the time-
stepping scheme
ν
X
yn+1 = yn + h bl f (tn + cl h) n = 0, 1, . . . .
l=1

Here h = tn+1 − tn (the points tn need not be equispaced). Can we generalize this to genuine
ODEs of the form y ′ = f (t, y)?

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 91
Z tn+1
Formally, y(tn+1 ) = y(tn ) + f (t, y(t))dt, and this can be ‘approximated’ by
tn

ν
X
y n+1 = y n + h bl f (tn + cl h, y(tn + cl h)). (4.11)
l=1

except that, of course, the vectors y(tn + cl h) are unknown! Runge–Kutta methods are a means
of implementing (4.11) by replacing unknown values of y by suitable linear combinations. The
general form of a ν-stage explicit Runge–Kutta method (RK) is

k1 = f (tn , y n ),
k2 = f (tn + c2 h, y n + hc2 k1 ),
k3 = f (tn + c3 h, y n + h(a3,1 k1 + a3,2 k2 )), a3,1 + a3,2 = c3 ,
..
.  
ν−1
X ν−1
X
kν = f tn + cν h, y n + h aν,j kj  , aν,j = cν ,
j=1 j=1
ν
X
y n+1 = y n + h bl k l .
l=1

The choice of the RK coefficients al,j is motivated at the first instance by order considerations.
Example Set ν = 2. We have k1 = f (tn , y n ) and, Taylor-expanding about (tn , y n ),

k2 = f (tn + c2 h, y n + c2 hf (tn , y n ))
 
∂f (tn , y n ) ∂f (tn , y n )
f (tn , y n ) + O h2 .

= f (tn , y n ) + hc2 +
∂t ∂y
But
∂f (t, y) ∂f (t, y)
y ′ = f (t, y) ⇒ y ′′ = + f (t, y).
∂t ∂y
Therefore, substituting the exact solution y n = y(tn ), we obtain k1 = y ′ (tn ) and k2 = y ′ (tn ) +
2
′′

hc2 y (tn ) + O h . Consequently, the local error is

y(tn+1 ) − y n+1 = [y(tn ) + hy ′ (tn ) + 12 h2 y ′′ (tn ) + O h3 ]




− [y(tn ) + h(b1 + b2 )y ′ (tn ) + h2 b2 c2 y ′′ (tn ) + O h3 ].




We deduce that the RK method is of order 2 if b1 + b2 = 1 and b2 c2 = 12 . It is easy to demonstrate


that no such method may be of order ≥ 3 (e.g. by applying it to y ′ = λy).
General RK methods A general ν-stage Runge–Kutta method is
 
ν
X ν
X
kl = f tn + cl h, y n + h al,j kj  where al,j = cl , l = 1, 2, . . . , ν,
j=1 j=1
ν
X
y n+1 = y n + h bl k l .
l=1

Obviously, al,j = 0 for all l ≤ j yields the standard explicit RK. Otherwise, an RK method is said
to be implicit.
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
Example Consider the 2-stage method
k1 = f tn , y n + 41 h(k1 − k2 ) , k2 = f tn + 32 h, y n + 1
 
12 h(3k1 + 5k2 ) ,
y n+1 = y n + 41 h(k1 + 3k2 ).
In order to analyse the order of this method, we restrict our attention to scalar, autonomuous
equations of the form y ′ = f (y). (This procedure might lead to loss of generality for methods of
order ≥ 5.) For brevity, we use the convention that all functions are evaluated at y = yn , e.g.
fy = df (yn )/dy. Thus,
1 2
k1 = f + 14 hfy (k1 − k2 ) + 32 h fyy (k1 − k2 )2 + O h3 ,


1 1
h2 fyy (3k1 + 5k2 )2 + O h3 .

k2 = f + 12 hfy (3k1 + 5k2 ) + 288

We have k1 , k2 = f + O(h) and substitution in the above equations yields k1 = f + O h2 ,




k2 = f + 23 hfy f + O h2 . Substituting again, we obtain

k1 = f − 61 h2 fy2 f + O h3 ,


k2 = f + 32 hfy f + h2 18 5 2
fy f + 92 fyy f 2 + O h3
 

⇒ yn+1 = y + hf + 21 h2 fy f + 16 h3 (fy2 f + fyy f 2 ) + O h4 .




But y ′ = f ⇒ y ′′ = fy f ⇒ y ′′′ = fy2 f + fyy f 2 and we deduce from Taylor’s theorem that the
method is at least of order 3. (It is easy to verify that it isn’t of order 4, for example applying it
to the equation y ′ = λy.)

4.4 Stiff equations

Linear stability Consider the linear system


 
′ −100 1
y = Ay where A= 1 .
0 − 10

The exact solution is a linear combination of e−t/10 and e−100t : the first decays gently, whereas the
second becomes practically zero almost at once. Suppose that we solve the ODE with the forward
Euler method. As will be shown soon, the requirement that limn→∞ y n = 0 (for fixed h > 0) leads
to an unacceptable restriction on the size of h.
With greater generality, let us solve y ′ = Ay, for general N × N constant matrix A, with Euler’s
method. Then y n+1 = (I + hA)y n , therefore y n = (I + hA)n y 0 . Let the eigenvalues of A be
λ1 , . . . , λN , with corresponding linearly-independent eigenvectors v 1 , v 2 , . . . , v N . Let D = diagλ
and V = [v 1 , v 2 , . . . , v N ], whence A = V DV −1 . We assume further that Re λl < 0, l = 1, . . . , N .
In that case it is easy to prove that limt→∞ y(t)P = 0, e.g. by representing the exact solution of the
∞ 1 k k
ODE explicitly as y(t) = etA y 0 , where etA = k=0 k! t A = V etD V −1 . However, y n = V (I +
n −1 −1
hD) V y 0 , where A = V DV and the matrix D is diagonal, therefore limn→∞ y n = 0 for all
1
initial values y 0 iff |1+hλl | < 1, l = 1, . . . , N . In our example we thus require |1− 10 h|, |1−100h| <
1
1, hence h < 50 .
This restriction, necessary to recovery of correct asymptotic behaviour, has nothing to do with local
accuracy, since, for large n, the genuine ‘unstable’ component is exceedingly small. Its purpose is
solely to prevent this component from leading to an unbounded growth in the numerical solution.
Stiffness We say that the ODE y ′ = f (t, y) is stiff if (for some methods) we need to depress h
to maintain stability well beyond requirements of accuracy. An important example of stiff systems
occurs when an equation is linear, Re λl < 0, l = 1, 2, . . . , N , and the quotient max |λk |/ min |λk |
is large: a ratio of 1020 is not unusual in real-life problems!
Stiff equations, mostly nonlinear, occur throughout applications, whenever we have two (or more)
different timescales in the ODE. A typical example are equations of chemical kinetics, where each
timescale is determined by the speed of reaction between two compounds: such speeds can differ
by many orders of magnitude.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 101
Definition We say that the ODE y ′ = f (t, y) is stiff if (for some methods) we need to depress h
to maintain stability well beyond requirements of accuracy. An important example of stiff systems
occurs when an equation is linear, Re λl < 0, l = 1, 2, . . . , N , and the quotient max |λk |/ min |λk |
is large: a ratio of 1020 is not unusual in real-life problems!
Stiff equations, mostly nonlinear, occur throughout applications, whenever we have two (or more)
different timescales in the ODE. A typical example are equations of chemical kinetics, where each
timescale is determined by the speed of reaction between two compounds: such speeds can differ
by many orders of magnitude.
Definition Suppose that a numerical method, applied to y ′ = λy, y(0) = 1, with constant h,
produces the solution sequence {yn }n∈Z+ . We call the set
D = {hλ ∈ C : lim yn = 0}
n→∞
t→∞
the linear stability domain of the method. Noting that the set of λ ∈ C for which y(t) −→ 0 is
the left half-plane C− = {z ∈ C : Re z < 0}, we say that the method is A-stable if C− ⊆ D.
Example We have already seen that for Euler’s method yn → 0 iff |1 + hλ| < 1, therefore
D = {z ∈ C : |1 + z| < 1}. Moreover, solving y ′ = λy with the trapezoidal rule, we obtain
yn+1 = [(1 + 21 hλ)/(1 − 12 hλ)]yn thus, by induction, yn = [(1 + 21 hλ)/(1 − 12 hλ)]n y0 . Therefore
1 + 21 z

z∈D ⇔ 1 − 1z < 1
⇔ Re z < 0
2

and we deduce that D = C− . Hence, the method is A-stable.


It can be proved by similar means that for backward Euler it is true that D = {z ∈ C : |1−z| > 1},
hence that the method is also A-stable.
Note that A-stability does not mean that any step size will do! We need to choose h small enough
to ensure the right accuracy, but we don’t want to depress it much further to prevent instability.
Discussion A-stability analysis of multistep methods is considerably more complicated. However,
according to the second Dahlquist barrier, no multistep method of order p ≥ 3 may be A-stable.
Note that the p = 2 barrier for A-stability is attained by the trapezoidal rule.
The Dahlquist barrier implies that, in our quest for higher-order methods with good stability
properties, we need to pursue one of the following strategies:
• either relax the definition of A-stability
• or consider other methods in place of multistep.
The two courses of action will be considered next.
Stiffness and BDF methods Inasmuch as no multistep method of order p ≥ 3 may be A-stable,
stability properties of BDF, say, are satisfactory for most stiff equations. The point is that in
many stiff linear systems in applications the eigenvalues are not just in C− but also well away from
iR. [Analysis of nonlinear stiff equations is difficult and well outside the scope of this course.] All
BDF methods of order p ≤ 6 (i.e., all convergent BDF methods) share the feature that the linear
stability domain D includes a wedge about (−∞, 0): such methods are said to be A0 -stable.
Stiffness and Runge–Kutta Unlike multistep methods, implicit high-order RK may be A-stable.
For example, recall the 3rd-order method
k1 = f tn , y n + 41 h(k1 − k2 ) ,


1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
k2 = f tn + 32 h, y n + 1

12 h(3k1 + 5k2 ) ,
y n+1 = y n + 14 h(k1 + 3k2 ).
from the last lecture. Applying it to y ′ = λy, we have
hk1 = hλ yn + 41 hk1 − 41 hk2 ,


hk2 = hλ yn + 41 hk1 + 125



hk2 .
This is a linear system, whose solution is
−1 
1 − 14 hλ 1
1 − 23 hλ
     
hk1 4 hλ hλyn hλyn
= = ,
hk2 − 41 hλ 1 − 125
hλ hλyn 1 − 23 hλ + 61 (hλ)2 1

therefore
1 + 31 hλ
yn+1 = yn + 41 hk1 + 34 hk2 = yn .
1 − 32 hλ + 61 h2 λ2

Let
1 + 13 z
r(z) = .
1 − 32 z + 61 z 2
Then yn+1 = r(hλ)yn , therefore, by induction, yn = [r(hλ)]n y0 and we deduce that
D = {z ∈ C : |r(z)| < 1}
We wish to prove that |r(z)| < 1 for every z ∈ C− , since this is equivalent to A-stability. This will
be done by a technique that can be applied to other RK methods. According to the maximum
modulus principle from Complex Methods, if g is analytic in the closed complex domain V then |g|
attains its maximum√ on ∂V. We let g = r. This is a rational function, hence its only singularities
are the poles 2 ± i 2 and g is analytic in V = cl C− = {z ∈ C : Re z ≤ 0}. Therefore it attains its
maximum on ∂V = iR and
A-stability ⇔ |r(z)| < 1, z ∈ C− ⇔ |r(it)| ≤ 1, t ∈ R.
In turn,
|r(it)|2 ≤ 1 ⇔ |1 − 23 it − 61 t2 |2 − |1 + 31 it|2 ≥ 0.
But |1 − 32 it − 16 t2 |2 − |1 + 31 it|2 = 1 4
36 t ≥ 0 and it follows that the method is A-stable.
Example It is possible to prove that the 2-stage Gauss–Legendre method
√ √
k1 = f (tn + ( 12 − 3
6 )h, y n + 14 hk1 + ( 14 − 3
6 )hk2 ),
√ √
k2 = f (tn + ( 12 + 3
6 )h, y n + ( 14 + 3
6 )hk1 + 14 hk2 ),
y n+1 = y n + 12 h(k1 + k2 )
is of order 4. [You can do this for y ′ = f (y) by expansion, but it becomes messy for y ′ = f (t, y).] It
can be easily verified that for y ′ = λy we have yn = [r(hλ)]n y0 , where r(z) = (1 + 21 z + 12
1 2
z )/(1 −
1 1 2

2 z + 12 z ). Since the poles of r reside at 3 ± i 3 and |r(it)| ≡ 1, we can again use the maximum
modulus principle to argue that D = C− and the Gauss–Legendre method is A-stable.

4.5 Implementation of ODE methods

The step size h is not some preordained quantity: it is a parameter of the method (in reality, many
parameters, since we may vary it from step to step). The basic input of a well-written computer
package for ODEs is not the step size but the error tolerance: the level of precision, as required
by the user. The choice of h > 0 is an important tool at our disposal to keep a local estimate of
the error beneath the required tolerance in the solution interval. In other words, we need not just
a time-stepping algorithm, but also mechanisms for error control and for amending the step size.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 111
The Milne device Suppose that we wish to monitor the error of the trapezoidal rule

y n+1 = y n + 21 h[f (y n ) + f (y n+1 )]. (4.12)

We already know that the order is 2. Moreover, substituting the true solution we deduce that
1 3 ′′′
y(tn+1 ) − {y(tn ) + 21 h[y ′ (tn ) + y ′ (tn+1 )]} = − 12 h y (tn ) + O h4 .


1 3 ′′′ 1
Therefore, the error in each step is increased roughly by − 12 h y (tn ). The number cTR = − 12 is
called the error constant of TR. To estimate the error in a single step we assume that y n = y(t n )
1 3 ′′′
and subtract y(tn+1 ) = y(tn ) + 12 h[y ′ (tn ) + y ′ (tn+1 )] − 12 h y (tn ) + O h4 from the numerical


method: this yields y n+1 − y(tn+1 ) = −cTR h3 y ′′′ (tn ) + O h4 . Similarly, each multistep method
(but not RK!) has its own error constant. For example, the 2nd order 2-step Adams–Bashforth
method
y n+1 − y n = 21 h[3f (tn , y n ) − f (tn−1 , y n−1 )], (4.13)
5
has the error constant cAB = 12 .

The idea behind the Milne device is to use two multistep methods of the same order, one explicit
and the second implicit (e.g., (4.13) and (4.12), respectively), to estimate the local error of the
implicit method. For example, locally,

y AB 3 ′′′
n+1 ≈ y(tn+1 ) − cAB h y (tn ) = y(tn+1 ) −
5 3 ′′′
12 h y (tn ),
y TR 3 ′′′
n+1 ≈ y(tn+1 ) − cTR h y (tn ) = y(tn+1 ) +
1 3 ′′′
12 h y (tn ).

Subtracting, we obtain the estimate

h3 y ′′′ (tn ) ≈ −2(y AB TR


n+1 − y n+1 ),

therefore
y TR 1 AB TR
n+1 − y(tn+1 ) ≈ − 6 (y n+1 − y n+1 )

and we use the right hand side as an estimate of the local error.
Note that TR is a far better method than AB: it is A-stable, hence its global behaviour is superior.
We employ AB solely to estimate the local error. This adds very little to the overall cost of TR,
since AB is an explicit method.
Implementation of the Milne device We work with a pair of multistep methods of the same
order, one explicit (predictor) and the other implicit (corrector), e.g.
5
Predictor : y n+2 = y n+1 + h[ 12 f (tn−1 , y n−1 ) − 34 f (tn , y n ) + 23
12 f (tn+1 , y n+1 )],
1
Corrector : y n+2 = y n+1 + h[− 12 f (tn , y n ) + 23 f (tn+1 , y n+1 ) + 12
5
f (tn+2 , y n+2 )],

the third-order Adams–Bashforth and Adams–Moulton methods respectively.


The predictor is employed not just to estimate the error of the corrector, but also to provide an
initial guess in the solution of the implicit corrector equations. Typically, for nonstiff equations, we
iterate correction equations at most twice, while stiff equations require iteration to convergence,
otherwise the typically superior stability features of the corrector are lost.
Let TOL > 0 be a user-specified tolerance: the maximal error allowed in approximating the ODE.
Having completed a single step and estimated the error, there are three possibilities:
1
(a) 10 TOL ≤ k error k ≤ TOL, say: Accept the step, continue to tn+2 with the same step size.
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are
available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
1
(b) k error k < 10 TOL, say: Accept the step and increase the step length;
(c) k error k > TOL: Reject the step, recommence integration from tn with smaller h.
Amending step size can be done easily with polynomial interpolation, although this means that
we need to store past values well in excess of what is necessary for simple implementation of both
multistep methods.
Error estimation per unit step Let e be our estimate of local error. Then e/h is our estimate
for the global error in an interval of unit length. It is usual to require the latter quantity not to
exceed TOL since good implementations of numerical ODEs should monitor the accumulation of
global error. This is called error estimation per unit step.
Embedded Runge–Kutta methods The situation is more complicated with RK, since no single
error constant determines local growth of the error. The approach of embedded RK requires, again,
two (typically explicit) methods: an RK method of ν stages and order p, say, and another method,
of ν + l stages, l ≥ 1, and order p + 1, such that the first ν stages of both methods are identical.
(This means that the cost of implementing the higher-order method is marginal, once we have
computed the lower-order approximation.) For example, consider (and verify!)

k1 = f (tn , y n ),
k2 = f tn + 12 h, y n + 12 hk1 ,


[1]
y n+1 = y n + hk2 =⇒ order 2,
k3 = f (tn + h, y n − hk1 + 2hk2 ),
[2]
y n+1 = y n + 16 h(k1 + 4k2 + k3 ) =⇒ order 3.
[1] [1] [2]
We thus estimate y n+1 − y(tn+1 ) ≈ y n+1 − y n+1 . [It might look paradoxical, at least at first glance,
but the only purpose of the higher-order method is to provide error control for the lower-order one!]
The Zadunaisky device Suppose that the ODE y ′ = f (t, y), y(0) = y 0 , is solved by an ar-
bitrary numerical method of order p and that we have stored (not necessarily equidistant) past
solution values y n , y n−1 , . . . , y n−p . We form an interpolating pth degree polynomial (with vector
coefficients) d such that d(tn−i ) = y n−i , i = 0, 1, . . . , p, and consider the differential equation

z ′ = f (t, z) + d′ (t) − f (t, d), z(tn ) = y n . (4.14)

There are two important observations with regard to (4.14)



(1) Since d(t) − y(t) = O hp+1 , the term d′ (t) − f (t, d) is usually small (because y ′ (t) −
f (t, y(t)) ≡ 0). Therefore, (4.14) is a small perturbation of the original ODE.
(2) The exact solution of (4.14) is known: z(t) = d(t).
Now, having produced y n+1 with our numerical method, we proceed to evaluate z n+1 as well,
using exactly the same method and implementation details. We then evaluate the error in z n+1 ,
namely z n+1 − d(tn+1 ), and use it as an estimate of the error in y n+1 .
Solving nonlinear algebraic systems We have already observed that the implementation of an
implicit ODE method, whether multistep or RK, requires the solution of (in general, nonlinear)
algebraic equations in each step. For example, for an s-step method, we need to solve in each step
the algebraic system
y n+s = σs hf (tn+s , y n+s ) + v, (4.15)
where the vector v can be formed from past (hence known) solution values and their derivatives.
The easiest approach is functional iteration
[j+1] [j]
y n+s = σs hf (tn+s , y n+s ) + v, j = 0, 1, . . . ,
[0]
where y n+s is typically provided by the predictor scheme. It is very effective for nonstiff equations
but fails for stiff ODEs, since the convergence of this iterative scheme requires similar restriction
on h as that we strive to avoid by choosing an implicit method in the first place!

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 121
If the ODE is stiff, we might prefer a Newton–Raphson method, namely
" [j]
#−1
[j+1] [j] ∂f (tn+s , y n+s ) [j] [j]
y n+s = y n+s − I − σs h [y n+s − σs hf (tn+s , y n+s ) − v].
∂y

[j]
The justification of the above is as follows: suppose that yn+s is an approximation to the solution.
[j]
We linearise (4.15) locally about (tn+s , y n+s ),
[j]
" #
[j] [j] ∂f (tn+s , y n+s ) [j]
y n+s −σs hf (tn+s , y n+s )−v ≈ [y n+s −σs hf (tn+s , y n+s )−v] + I −σs h (y n+s −y n+s )
∂y

[j+1]
and choose y n+s by equating the right-hand side to 0.
The snag is that repeatedly evaluating and inverting (i.e. LU-factorizing) the Jacobian matrix
in every iteration is very expensive. The remedy is to implement the modified Newton–Raphson
method , namely
" [0]
#−1
[j+1] [j] ∂f (tn+s , y n+s ) [j] [j]
y n+s = y n+s − I − σs h [y n+s − σs hf (tn+s , y n+s ) − v]. (4.16)
∂y

Thus, the Jacobian need be evaluated only once a step.


The only role the Jacobian matrix plays in (4.16) is to ensure convergence: its precise value
[j]
makes no difference to the ultimate value of limj→∞ y n+s . Therefore we might replace it with a
finite-difference approximation, evaluate it once every several steps etc. Important observation for
future use: Implementation of (4.16) requires repeated solution of linear algebraic systems with
the same matrix. We will soon study LU factorization of matrices, and there this remark will be
appreciated as important and lead to substantial savings. For stiff equations it is much cheaper
to solve nonlinear algebraic equations with (4.16) than using a minute step size with a ‘bad’ (e.g.,
explicit multistep or explicit RK) method.

5 Numerical linear algebra

5.1 LU factorization and its generalizations

Let A be a real n × n matrix. We say that the n × n matrices L and U are an LU factorization of
A if (1) L is lower triangular (i.e., Li,j = 0, i < j); (2) U is upper triangular, Ui,j = 0, i > j; and
(3) A = LU . Therefore the factorization
  takes
 the form
  
 = @  × @
@ @
 .
@ @
Qn Qn
Application 1 Calculation of a determinant: det A = (det L)(det U ) = ( k=1 Lk,k ) · ( k=1 Uk,k ).
Application 2 Testing for nonsingularity: A = LU is nonsingular iff all the diagonal elements of
L and U are nonzero.
Application 3 Solution of linear systems: Let A = LU and suppose we wish to solve Ax = b.
This is the same as L(U x) = b, which we decompose into Ly = b, U x = y. Both latter systems are
1 Correctionsand suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are
available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
triangular and can be calculated easily. Thus, L1,1 y1 = b1 gives y1 , next L2,1 y1 + L2,2 y2 = b2 yields
y2 etc. Having found y, we solve for x in reverse order: Un,n xn = yn gives xn , Un−1,n−1 xn−1 +
Un−1,n xn = yn−1 produces xn−1 and so on. This requires O(n2 ) computational operations (usually
we only bother to count multiplications/divisions).
Application 4 The inverse of A: It is straightforward to devise a direct way of calculating the
inverse of triangular matrices, subsequently forming A−1 = U −1 L−1 .
Why not Cramer’s rule? For the uninitiated, a recursive definition of a determinant may seem
to be a good method for its calculation (and perhaps even for the solution of linear systems with
Cramer’s rule). Unfortunately, the number of operations increases like n!. Thus, on a 109 flop/sec.
computer
n = 10 ⇒ 10−4 sec., n = 20 ⇒ 17 min, n = 30 ⇒ 4 × 105 years.

The calculation of LU factorization We denote the columns of L by l1 , l2 , . . . , ln and the rows


of U by u⊤ ⊤ ⊤
1 , u2 , . . . , un . Hence
 ⊤ 
u1
 u⊤ n
 2  X

A = LU = [ l1 l2 · · · ln ]  .  = lk u⊤
k. (5.1)
 .. 
k=1
u⊤n

Since the first k − 1 components of lk and uk are all zero, each rank-one matrix lk u⊤
k has zeros in
its first k − 1 rows and columns.
Assume that the factorization exists (hence the diagonal elements of L are nonzero) and that A
is nonsingular. Since lk u⊤
k stays the same if we replace lk → αlk , uk → α
−1
uk , where α 6= 0,
we may assume w.l.o.g. that all diagonal elements of L equal one. In other words, the kth row of
lk u⊤ ⊤
k is uk and its kth column is Uk,k times lk .

We begin our calculation by extracting l1 and u⊤


1 from A, and then proceed similarly to extract
l2 and u⊤
2 , etc.

First we note that since the leading k − 1 elements of lk and uk are zero for k ≥ 2, it follows
from (5.1) that u⊤1 is the first row of A and l1 is the first column of A, divided by A1,1 (so that
L1,1 = 1).
Pn
Next, having found l1 and u1 , we form the matrix A1 = A − l1 u⊤ 1 =

k=2 lk uk . The first row &

column of A1 are zero and it follows that u2 is the second row of A1 , while l2 is its second column,
scaled so that L2,2 = 1.
The LU algorithm: Set A0 := A. For all k = 1, 2, . . . , n set u⊤k to the kth row of Ak−1 and lk
to the kth column of Ak−1 , scaled so that Lk,k = 1. Further, calculate Ak := Ak−1 − lk u⊤
k before
incrementing k.
Note that all elements in the first k rows & columns of Ak are zero. Hence, we can use the storage
of the original A to accumulate L and U . The full LU factorization requires O(n3 ) computational
operations.
Relation to Gaussian elimination The equation Ak = Ak−1 − lk u⊤ k has the property that the
jth row of Ak is the jth row of Ak−1 minus Lj,k times u⊤ k (the kth row of Ak−1 ). Moreover, the
multipliers Lk,k , Lk+1,k , . . . , Ln,k are chosen so that the outcome of this elementary row operation
is that the kth column of Ak is zero. This construction is analogous to Gaussian elimination for
solving Ax = b. An important difference is that in LU we do not consider the right hand side
b until the factorization is complete. This is useful e.g. when there are many right hand sides,
in particular if not all the b’s are known at the outset: in Gaussian elimination the solution for
each new b would require O(n3 ) computational operations, whereas with LU factorization O(n3 )
operations are required for the initial factorization, but then the solution for each new b only
requires just O(n2 ) operations.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 131
Pivoting Naive LU factorization fails when, for example, A1,1 = 0. The remedy is to exchange
rows of A, a technique called column pivoting (or just pivoting). This is equivalent to picking a
suitable equation for eliminating the first unknown in Gaussian elimination. Specifically, column
pivoting means that, having obtained Ak−1 , we exchange two rows of Ak−1 so that the element of
largest magnitude in the kth column is in the ‘pivotal position’ (k, k). In other words,

|(Ak−1 )k,k | = max{|(Ak−1 )j,k | : j = 1, 2, . . . , n}.

Of course, the same exchange is required in the portion of L that has been formed already (i.e.,
the first k − 1 columns). Also, we need to record the permutation of rows to solve for the right
hand side and/or to compute the determinant. (The exchange of rows can be regarded as the
pre-multiplication of the relevant matrix by a permutation matrix.)
Column pivoting copes with zeros at the pivot position, except when the entire kth column of
Ak−1 is zero: in that case we let lk be the kth unit vector while, as before, choose u⊤k as the kth
row of Ak ). This choice preserves the condition that the matrix lk u⊤ k has the same kth row and
column as Ak−1 . Thus Ak := Ak−1 − lk u⊤ k still has zeros in its kth row and column as required.

An important advantage of column pivoting is that |Li,j | ≤ 1 for all i, j = 1, . . . , n. This avoids
division by zero and tends to reduce the chance of large numbers occuring during the factorization,
a phenomenon that might lead to ill conditioning and to accumulation of roundoff error.
In row pivoting one exchanges columns of Ak−1 , rather than rows (sic!), whereas total pivoting
corresponds to exchange of both rows and columns, so that the modulus of the pivotal element
(Ak−1 )k,k is maximised.
Symmetric matrices Let A be an n × n symmetric matrix (i.e., Ak,ℓ = Aℓ,k ). An analogue of LU
factorization takes advantage of symmetry: we express A in the form of the product LDL⊤ , where
L is n × n lower triangular, with ones on its diagonal, whereas D is a diagonal matrix. Subject to
its existence, we can write this factorization as
 
D1,1 0 ··· 0 l⊤

1
. ..   ⊤  X n
D2,2 . . l2 

 0 . 
Dk,k lk l⊤

A = l1 l2 · · · ln   .

  ..  = k
. .
 
 . . . . . . 0  . k=1
0 ··· 0 Dn,n l⊤
n

where, as before, lk is the kth column of L. The analogy with the LU algorithm becomes obvious
by letting U = DL⊤ , but the present form lends itself better to exploitation of symmetry and
requires roughly half the storage of conventional LU. Specifically, to compute this factorization,
we let A0 = A and for k = 1, 2, . . . , n let lk be the multiple of the kth column of Ak−1 such that
Lk,k = 1. Set Dk,k = (Ak−1 )k,k and form Ak = Ak−1 − Dk,k lk l⊤ k.
   
2 4 1
Example Let A = A0 = . Hence l1 = , D1,1 = 2 and
4 11 2
     
2 4 1 2 0 0
A1 = A0 − D1,1 l1 l⊤ 1 = − 2 = .
4 11 2 4 0 3
     
0 1 0 2 0 1 2
We deduce that l2 = , D2,2 = 3 and A = .
1 2 1 0 3 0 1
Symmetric positive definite matrices Recall: A is positive definite if x⊤ Ax > 0 for all x 6= 0.
Theorem Let A be a real n × n symmetric matrix. It is positive definite if and only if it has an
LDL⊤ factorization in which the diagonal elements of D are all positive.
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
Proof. Suppose that A P = LDL⊤ and let x ∈ Rn \ {0}. Since L is nonsingular, y := L⊤ x 6= 0.
n
Then x⊤ Ax = y ⊤ Dy = k=1 Dk,k yk2 > 0, hence A is positive definite.
Conversely, suppose that A is positive definite. We wish to demonstrate that an LDL⊤ factorization
exists. We denote by ek ∈ Rn the kth unit vector. Hence e⊤ 1 Ae1 = A1,1 > 0 and l1 & D1,1 are
well defined. We now show that (Ak−1 )k,k > 0 for k = 1, 2, . . .. This is true for k = 1 and we
Pk−1
continue by induction, assuming that Ak−1 = A − j=1 Dj,j lj l⊤ j has been computed successfully.
Define x ∈ Rn as follows. The bottom n − k components are zero, xk = 1 and x1 , x2 , . . . , xk−1
are calculated in a reverse order, each xj being chosen so that l⊤ j x = 0 for j = k − 1, k − 2, . . . , 1.
⊤ Pn Pk Pk
In other words, since 0 = lj x = i=1 Li,j xi = i=j Li,j xi , we let xj = − i=j+1 Li,j xi , j =
k − 1, k − 2, . . . , 1.
Since the first k −1 rows & columns of Ak−1 vanish, our choice implies that (Ak−1 )k,k = x⊤ Ak−1 x.
Thus, from the definition of Ak−1 and the choice of  x,
k−1
X k−1
X
(Ak−1 )k,k = x⊤ Ak−1 x = x⊤ A − Dj,j lj l⊤
j
 x = x⊤ Ax − Dj,j (l⊤ 2 ⊤
j x) = x Ax > 0,
j=1 j=1
as required. Hence (Ak−1 )k,k > 0, k = 1, 2, . . . , n, and the factorization exists. 2
Conclusion It is possible to check if a symmetric matrix is positive definite by trying to form its
LDL⊤ factorization.
1/2
Cholesky factorization Define D1/2 as the diagonal matrix whose (k, k) element is Dk,k , hence
D1/2 D1/2 = D. Then, A being positive definite, we can write
A = (LD1/2 )(D1/2 L⊤ ) = (LD1/2 )(LD1/2 )⊤ .
In other words, letting L̃ := LD1/2 , we obtain the Cholesky factorization A = L̃L̃⊤ .
Sparse matrices It is often required to solve very large systems Ax = b (n = 105 is considered
small in this context!) where nearly all the elements of A are zero. Such a matrix is called sparse
and efficient solution of Ax = b should exploit sparsity. In particular, we wish the matrices L
and U to inherit as much as possible of the sparsity of A and for the cost of computation to be
determined by the number of nonzero entries, rather than by n. The only tool at our disposal at
the moment is the freedom to exchange rows and columns to minimise fill-in.
Theorem Let A = LU be an LU factorization (without pivoting) of a sparse matrix. Then all
leading zeros in the rows of A to the left of the diagonal are inherited by L and all the leading
zeros in the columns of A above the diagonal are inherited by U .
Proof Follows from the second question on Examples’ Sheet 3. 2
This theorem suggests that if one requires a factorization of a sparse matrix then one might try to
reorder its rows and columns by a preliminary calculation so that many of the zero elements are
leading zero elements in rows and columns. This will reduce the fill-in.
Example 1 The LU factorisation of
    
−3 1 1 2 0 1 0 0 0 0 −3 1 1 2 0
 1 −3 0 0 1   − 1 1 0 0 0  0 − 83 1 2
1 
   13  3 3 
 1
 0 2 0 0   =  −3
 − 18 1 0 0 
 0 0 19
8
3
4
1 
8 ,
 2 0 0 3 0   − 23 − 14 6
19 1 0  0 0 0 81
19
4 
19
0 1 0 0 3 0 − 83 1
19
4
81 1 0 0 0 0 272
81
has significant fill-in. However, reordering (symmetrically) rows and columns 1 ↔ 3, 2 ↔ 4 and
4 ↔ 5 yields
    
2 0 1 0 0 1 0 0 0 0 2 0 1 0 0
 0 3 2 0 0   0 1 0 0 0  0 3 2 0 0 
   1 2
 
 1
 2 −3 0 1 =
  2 3 1 0 0 
 0 0 − 29
6 0 1 .

 0 0 0 3 1   0 0 0 1 0  0 0 0 3 1 
6 1
0 0 1 1 −3 0 0 − 29 3 1 0 0 0 0 − 272
87

Example 2 If the nonzeros of A occur only on the diagonal, in one row and in one column, then
the full row and column should be placed at the bottom and on the right of A, respectively.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 141
Banded matrices The matrix A is a banded matrix if there exists an integer r < n such that
Ai,j = 0 for |i − j| > r, i, j = 1, 2, . . . , n. In other words, all the nonzero elements of A reside in a
band of width 2r + 1 along the main diagonal. In that case, according to the statement from the
end of the last lecture, A = LU implies that Li,j = Ui,j = 0 ∀ |i − j| > r and sparsity structure is
inherited by the factorization. 
In general, the expense of calculating an LU factorization of an n × n dense matrix A is O n3
operations and the expense of solving Ax = b, provided thatthe factorization is known, is O n2 .
However, in the case of a banded A, we need just O r2 n operations to factorize and O(rn)
operations to solve a linear system. If r ≪ n this represents a very substantial saving!
General sparse matrices feature a wide range of applications, e.g. the solution of partial differ-
ential equations, and there exists a wealth of methods for their solution. One approach is efficient
factorization, that minimizes fill in. Yet another is to use iterative methods (cf. Part II Numeri-
cal Analysis course). There also exists a substantial body of other, highly effective methods, e.g.
Fast Fourier Transforms, preconditioned conjugate gradients and multigrid techniques (cf. Part II
Numerical Analysis course), fast multipole techniques and much more.
Sparsity and graph theory An exceedingly powerful (and beautiful) methodology of ordering
pivots to minimize fill-in of sparse matrices uses graph theory and, like many other cool applications
of mathematics in numerical analysis, is alas not in the schedules :-(

5.2 QR factorization of matrices

Scalar products, norms and orthogonality We first recall a few definitions. Rn is the linear
space of all real n-tuples.

• For all u, v ∈ Rn we define the scalar product


n
X
hu, vi = hv, ui = uj vj = u⊤ v = v ⊤ u .
j=1

• If u, v, w ∈ Rn and α, β ∈ R then hαu + βw, vi = αhu, vi + βhw, vi.


P 1/2
n
• The norm (a.k.a. the Euclidean length) of u ∈ Rn is kuk = 2
j=1 uj = hu, ui1/2 ≥ 0.

• For u ∈ Rn , kuk = 0 iff u = 0.


• We say that u ∈ Rn and v ∈ Rn are orthogonal to each other if hu, vi = 0.
• The vectors q 1 , q 2 , . . . , q m ∈ Rn are orthonormal if

1, k = ℓ,
hq k , q ℓ i = k, ℓ = 1, 2, . . . , m.
0, k 6= ℓ,

• An n × n real matrix Q is orthogonal if all its columns are orthonormal. Since (Q⊤ Q)k,ℓ =
hq k , q ℓ i, this implies that Q⊤ Q = I (I is the unit matrix ). Hence Q−1 = Q⊤ and QQ⊤ =
QQ−1 = I. We conclude that the rows of an orthogonal matrix are also orthonormal, and
that Q⊤ is an orthogonal matrix. Further, 1 = det I = det(QQ⊤ ) = det Q det Q⊤ = (det Q)2 ,
and thus we deduce that det Q = ±1, and that an orthogonal matrix is nonsingular.
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
Proposition If P, Q are orthogonal then so is P Q.
Proof. Since P ⊤ P = Q⊤ Q = I, we have (P Q)⊤ (P Q) = (Q⊤ P ⊤ )(P Q) = Q⊤ (P ⊤ P )Q = Q⊤ Q =
I, hence P Q is orthogonal. 2
Proposition Let q 1 , q 2 , . . . , q m ∈ Rn be orthonormal. Then m ≤ n.
Proof. We argue by contradiction. Suppose that m ≥ n + 1 and let Q be the orthogonal matrix
whose columns are q 1 , q 2 , . . . , q n . Since Q is nonsingular
Pn and q m 6= 0, there exists a nonzero
solution to the linear system Qa = q m , hence q m = j=1 aj q j . But
n n
* +
X X
0 = hq ℓ , q m i = q ℓ , aj q j = aj hq ℓ , q j i = aℓ , ℓ = 1, 2, . . . , n,
j=1 j=1

hence a = 0, a contradiction. We deduce that m ≤ n. 2


Lemma Let q 1 , q 2 , . . . , q m ∈ Rn be orthonormal and m ≤ n − 1. Then there exists q m+1 ∈ Rn
such that q 1 , q 2 , . . . , q m+1 are orthonormal.
Proof. We construct q m+1 . Let Q be the n × m matrix whose columns are q 1 , . . . , q m . Since
n X
X m m
X
Q2k,j = kq j k2 = m < n,
k=1 j=1 j=1
Pm Pm
it follows that ∃ ℓ ∈ {1, 2, . . . , n} such that j=1 Q2ℓ,j < 1. We let w = eℓ − j=1 hq j , eℓ iq j . Then
for i = 1, 2, . . . , m
m
X
hq i , wi = hq i , eℓ i − hq j , eℓ ihq i , q j i = 0,
j=1

i.e. by design w is orthogonal to q 1 , . . . , q m . Further, since Qℓ,j = hq j , eℓ i, we have


m
X m
X m
X m
X
kwk2 = hw, wi = heℓ , eℓ i − 2 hq j , eℓ iheℓ , q j i + hq j , eℓ i hq k , eℓ ihq j , q k i = 1 − Q2ℓ,j > 0.
j=1 j=1 k=1 j=1

Thus we define q m+1 = w/kwk. 2


The QR factorization The QR factorization of an m × n matrix A has the form A = QR, where
Q is an m × m orthogonal matrix and R is an m × n upper triangular matrix (i.e., Ri,j = 0 for
i > j). We will demonstrate in the sequel that every matrix has a (non-unique) QR factorization.
We say that R is in a standard form if, given that Rk,jk is the first nonzero entry in the kth row,
the jk s form a strictly monotone sequence. (Such R is also allowed entire rows of zeros, but only
at the bottom.)
An application Let m = n and A be nonsingular. We can solve Ax = b by calculating the QR
factorization of A and solving first Qy = b (hence y = Q⊤ b) and then Rx = y (a triangular
system!).
Interpretation of the QR factorization Let m ≥ n and denote the columns of A and Q by
a1 , a2 , . . . , an and q 1 , q 2 , . . . , q m respectively. Since

R1,1 R1,2 · · · R1,n


 
 .. 
 0 R2,2 . 
 ..
 
.. .. 
[ a1 a2 · · · an ] = [ q 1 q 2 · · · q m ] 
 . . . ,


 0 Rn,n  
 . .. 
 .. . 
0 ··· ··· 0
Pk
we have ak = j=1 Rj,k q j , k = 1, 2, . . . , n. In other words, Q has the property that each kth
column of A can be expressed as a linear combination of the first k columns of Q.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 151
The Gram–Schmidt algorithm Given an m×n matrix A 6= O with the columns a1 , a2 , . . . , an ∈
Rm , we construct Q & R where Q is orthogonal, R upper-triangular and A = QR: in other words,

X
Rk,ℓ q k = aℓ , ℓ = 1, 2, . . . , n, where A = [ a1 a2 ··· an ]. (5.2)
k=1

Assuming a1 6= 0, we derive q 1 and R1,1 from the equation (5.2) for k = 1. Since kq 1 k = 1, we let
q 1 = a1 /ka1 k, R1,1 = ka1 k.
Next we form the vector b = a2 − hq 1 , a2 iq 1 . It is orthogonal to q 1 , since hq 1 , a2 − hq 1 , a2 iq 1 i =
hq 1 , a2 i − hq 1 , a2 ihq 1 , q 1 i = 0. If b 6= 0, we set q 2 = b/kbk, hence q 1 and q 2 are orthonormal.
Moreover,
hq 1 , a2 iq 1 + kbkq 2 = hq 1 , a2 iq 1 + b = a2 ,
hence, to obey (5.2) for k = 2, we let R1,2 = hq 1 , a2 i, R2,2 = kbk.
The above idea can be extended to all columns of A.
Step 1 Set k := 0, j := 0 (k is the number of columns of Q that have been already formed and j
is the number of columns of A that have been already considered, clearly k ≤ j);
Step 2 Increase j by 1. If k = 0 then set b := aj , otherwise (i.e., when k ≥ 1) set Ri,j := hq i , aj i,
Pk
i = 1, 2, . . . , k, and b := aj − i=1 hq i , aj iq i . [Note: b is orthogonal to q 1 , q 2 , . . . , q k .]
Step 3 If b 6= 0 increase k by 1. Subsequently, set q k := b/kbk, Rk,j := kbk and Ri,j := 0 for
Pk
i ≥ k + 1. [Note: Hence, each column of Q has unit length, as required, aj = i=1 Ri,j q j and R
is upper triangular, because k ≤ j.]
Step 4 Terminate if j = n, otherwise go to Step 2.
Previous lecture ⇒ Since the columns of Q are orthonormal, there are at most m of them, i.e. the
final value of k can’t exceed m. If it is less then m then a previous lemma demonstrates that we
can add columns so that Q becomes m × m and orthogonal.
The disadvantage of Gram–Schmidt is its ill-conditioning: using finite arithmetic, small impreci-
sions in the calculation of inner products spread rapidly, leading to effective loss of orthogonality.
Errors accumulate fast and the computed off-diagonal elements of Q⊤ Q may become large.
Orthogonality conditions are preserved well when one generates a new orthogonal matrix by com-
puting the product of two given orthogonal matrices. Therefore algorithms that express Q as a
product of simple orthogonal matrices are highly useful. This suggests an alternative way forward.
Orthogonal transformations Given real m×n matrix A0 = A, we seek a sequence Ω1 , Ω2 , . . . , Ωk
of m × m orthogonal matrices such that the matrix Ai := Ωi Ai−1 has more zero elements below
the main diagonal than Ai−1 for i = 1, 2, . . . , k and so that the manner of insertion of such zeros
is such that Ak is upper triangular. We then let R = Ak , therefore Ωk Ωk−1 · · · Ω2 Ω1 A = R
and Q = (Ωk Ωk−1 · · · Ω1 )−1 = (Ωk Ωk−1 · · · Ω1 )⊤ = Ω⊤ ⊤ ⊤
1 Ω2 · · · Ωk . Hence A = QR, where Q is
orthogonal and R upper triangular.
Givens rotations We say that an m × m orthogonal matrix Ωj is a Givens rotation if it coincides
with the unit matrix, except for four elements, and det Ωj = 1. Specifically, we use the notation
Ω[p,q] , where 1 ≤ p < q ≤ m for a matrix such that
Ω[p,q] [p,q]
p,p = Ωq,q = cos θ, Ω[p,q]
p,q = sin θ, Ω[p,q]
q,p = − sin θ

for some θ ∈ [−π, π]. The remaining elements of Ω[p,q] are those of a unit matrix. For example,
   
cos θ sin θ 0 0 1 0 0 0
 − sin θ cos θ 0 0 
[1,2]  , Ω[2,4] =  0 cos θ 0 sin θ  .
 
m = 4 =⇒ Ω =
 0 0 1 0   0 0 1 0 
0 0 0 1 0 − sin θ 0 cos θ
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are
available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
Geometrically, such matrices correspond to the underlying coordinate system being rigidly rotated
along a two-dimensional plane (in mechanics this is called an Euler rotation). It is trivial to confirm
that they are orthogonal.
Theorem Let A be an m × n matrix. Then, for every 1 ≤ p < q ≤ m, i ∈ {p, q} and 1 ≤ j ≤ n,
there exists θ ∈ [−π, π] such that (Ω[p,q] A)i,j = 0. Moreover, all the rows of Ω[p,q] A, except for the
pth and the qth, are the same as the corresponding rows of A, whereas the pth and the qth rows
are linear combinations of the ‘old’ pth and qth rows.
Proof. Let i = q. If Ap,j = Aq,j = 0 then any θ will do, otherwise we let
q q
cos θ := Ap,j / A2p,j + A2q,j , sin θ := Aq,j / A2p,j + A2q,j .

Hence
(Ω[p,q] A)q,k = −(sin θ)Ap,k + (cos θ)Aq,k , k = 1, 2, . . . , n ⇒ (Ω[p,q] A)q,j = 0.
q q
Likewise, when i = p we let cos θ := Aq,j / A2p,j + A2q,j , sin θ := −Ap,j / A2p,j + A2q,j .
The last two statements of the theorem are an immediate consequence of the construction of Ω[p,q] .
2
An example: Suppose that A is 3 × 3. We can force zeros underneath the main diagonal as
follows.
 
× × ×
1 First pick Ω[1,2] so that (Ω[1,2] A)2,1 = 0 ⇒ Ω[1,2] A =  0 × × .
× × ×
2 Next pick Ω[1,3] so that (Ω[1,3] Ω[1,2] A)3,1 = 0. Multiplication 
by Ω[1,3] doesn’t
 alter the second
× × ×
row, hence (Ω[1,3] Ω[1,2] A)2,1 remains zero ⇒ Ω[1,3] Ω[1,2] A =  0 × × .
0 × ×
3 Finally, pick Ω[2,3] so that (Ω[2,3] Ω[1,3] Ω[1,2] A)3,2 = 0. Since both second and third row of
Ω[1,3] Ω[1,2] A have a leading zero, (Ω[2,3] Ω[1,3] Ω[1,2] A)2,1 = (Ω[2,3] Ω[1,3] Ω[1,2] A)3,1 = 0. It follows
that Ω[2,3] Ω[1,3] Ω[1,2] A is upper triangular. Therefore
 
× × ×
R = Ω[2,3] Ω[1,3] Ω[1,2] A =  0 × ×  , Q = (Ω[2,3] Ω[1,3] Ω[1,2] )⊤ .
0 0 ×

The Givens algorithm Given m × n matrix A, let ℓi be the number of leading zeros in the ith
row of A, i = 1, 2, . . . , m.
Step 1 Stop if the (integer) sequence {ℓ1 , ℓ2 , . . . , ℓm } increases monotonically, the increase being
strictly monotone for ℓi ≤ n.
Step 2 Pick any two integers 1 ≤ p < q ≤ m such that either ℓp > ℓq or ℓp = ℓq < n.
Step 3 Replace A by Ω[p,q] A, using the Givens rotation that annihilates the (q, ℓq + 1) element.
Update the values of ℓp and ℓq and go to Step 1.
The final matrix A is upper triangular and also has the property that the number of leading zeros
in each row increases strictly monotonically until all the rows of A are zero – a matrix of this form
is said to be in standard form. This end result, as we recall, is the required matrix R.
The cost There are less than mn rotationsand each rotation replaces two rows by their linear
combinations, hence the total cost is O mn2 .
If we wish to obtain explicitly an orthogonal Q s.t. A = QR then we commence by letting Ω be the
m × m unit matrix and, each time A is premultiplied by Ω[p,q] , we also premultiply Ω by the same

rotation. Hence the final Ω
 is the product of all the rotations, in correct order, and we let Q = Ω .
The extra cost is O m2 n . However, in most applications we don’t need Q but, instead, just the
action of Q⊤ on a given vector (recall: solution of linear systems!). This can be accomplished by
multiplying the vector by successive rotations, the cost being O(mn).

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Lecture 161
Householder reflections Let u ∈ Rm \ {0}. The m × m matrix I − 2 uu

kuk2 is called a Householder


reflection. Each such matrix is symmetric and orthogonal, since
⊤  2
uu⊤ uu⊤ uu⊤ uu⊤ u(u⊤ u)u⊤
  
I −2 I − 2 = I − 2 = I − 4 + 4 = I.
kuk2 kuk2 kuk2 kuk2 kuk4
Householder reflections offer an alternative to Given rotations in the calculation of a QR factor-
ization.
Deriving the first column of R Our goal is to multiply an m × n matrix A by a sequence of
Householder reflections so that each product induces zeros under the diagonal in an entire column.
To start with, we seek a reflection that transforms the first nonzero column of A to a multiple of
e1 .
Let a ∈ Rm be the first nonzero column of A. We wish to choose u ∈ Rm s.t. the bottom m − 1
entries of
uu⊤ u⊤ a
 
I −2 a = a − 2 u
kuk2 kuk2
vanish and, in addition, we normalise u so that 2u⊤ a = kuk2 (recall that a 6= 0). Therefore
ui = ai , i = 2, . . . , m and the normalisation implies that
m
X m
X m
X
2u1 a1 + 2 a2i = u21 + a2i ⇒ u21 − 2u1 a1 + a21 − a2i = 0 ⇒ u1 = a1 ± kak.
i=2 i=2 i=1

It is usual to let the sign be the same as the sign of a1 , since otherwise kuk ≪ 1 might lead to a
division by a tiny number, hence to numerical difficulties.
For large m we do not execute explicit matrix multiplication. Instead, to calculate
uu⊤ u(u⊤ A)
 
I −2 A = A − 2 ,
kuk2 kuk2
we first evaluate w⊤ := u⊤ A, subsequently forming A − ku2k2 uw⊤ .
Subsequent columns of R Suppose that a is the first column of A that isn’t compatible with
standard form (previous columns have been, presumably, already dealt with by Householder reflec-
tions) and that the standard form requires to bring the k + 1, . . . , m components to zero. Hence,
nonzero elements in previous columns must be confined to the first k − 1 rows and we want them
to be unamended by the reflection. Thus, we let the first k − 1 components of u be zero and choose
Pm 2 1/2
uk = ak ± i=k ai and ui = ai , i = k + 1, . . . , m.
The Householder method We process columns of A in sequence, in each stage premultiplying
a current A by the requisite Householder reflection. The end result is an upper triangular matrix
R in its standard form.
Example
     
2 4 7 0 2 4 7
0 3 −1  0 0 3 −1 
uu⊤
     
     
A= 0 0 2  ⇒ u= 5  ⇒ I −2 A= 0 0 −3 
.
    kuk2 
 0 0 1   1   0 0 0 
0 0 −2 −2 0 0 0

Calculation of Q If the matrix Q is required in an


 explicit form, set Ω = I initially and, for
uu⊤

2
each successive reflection, replace Ω by I − 2 Ω = Ω− u(u⊤ Ω). As in the case of
kuk2 kuk2
1 Corrections and suggestions to these notes should be emailed to A.Iserles@damtp.cam.ac.uk. All handouts are

available on the WWW at the URL http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
Givens rotations, by the end of the computation, Q = Ω⊤ . However, if we require just the vector

 = Q b, say,
c  rather than⊤the matrix Q, then we set initially c = b and in each stage replace c by
uu⊤ u c
I −2 2
c=c−2 u.
kuk kuk2
Givens or Householder? If A is dense, it is in general more convenient to use Householder
reflections. Givens rotations come into their own, however, when A has many leading zeros in its
rows. E.g., if an n × n matrix A consists of zeros underneath the first subdiagonal, they can be
‘rotated away’ in just n − 1 Givens rotations, at the cost of O n2 operations!

5.3 Linear least squares

Statement of the problem Suppose that an m×n matrix A and a vector b ∈ Rm are given. The
equation Ax = b, where x ∈ Rn is unknown, has in general no solution (if m > n) or an infinity
of solutions (if m < n). Problems of this form occur frequently when we collect m observations
(which, typically, are prone to measurement error) and wish to exploit them to form an n-variable
linear model, where n ≪ m. (In statistics, this is known as linear regression.) Bearing in mind the
likely presence of errors in A and b, we seek x ∈ Rn that minimises the Euclidean length kAx − bk.
This is the least squares problem.
Theorem x ∈ Rn is a solution of the least squares problem iff A⊤ (Ax − b) = 0.
Proof. If x is a solution then it minimises
f (x) := kAx − bk2 = hAx − b, Ax − bi = x⊤ A⊤ Ax − 2x⊤ A⊤ b + b⊤ b.
Hence ∇f (x) = 0. But 21 ∇f (x) = A⊤ Ax − A⊤ b, hence A⊤ (Ax − b) = 0.
Conversely, suppose that A⊤ (Ax − b) = 0 and let u ∈ Rn . Hence, letting y = u − x,
kAu − bk2 = hAx + Ay − b, Ax + Ay − bi = hAx − b, Ax − bi + 2y ⊤ A⊤ (Ax − b)
+ hAy, Ayi = kAx − bk2 + kAyk2 ≥ kAx − bk2
and x is indeed optimal. 2
Corollary Optimality of x ⇔ the vector Ax − b is orthogonal to all columns of A.
Normal equations One way of finding optimal x is by solving the n × n linear system A⊤ Ax =
A⊤ b – the method of normal equations. This approach is popular in many applications. However,
there are three disadvantages. Firstly, A⊤ A might be singular, secondly sparse A might be replaced
by a dense A⊤ A and, finally, forming A⊤ A might lead to loss of accuracy. Thus, suppose that our
computer works in the IEEE arithmetic standard (≈ 15 significant digits) and let
 8     
10 −108 ⊤ 1016 + 1 −1016 + 1 16 1 −1
A= =⇒ A A= ≈ 10 .
1 1 −1016 + 1 1016 + 1 −1 1
Given b = [0, 2]⊤ the solution of Ax = b is [1, 1]⊤ , as can be easily found by Gaussian elimination.
However, our computer ‘believes’ that A⊤ A is singular!
QR and least squares
Lemma Let A be any m × n matrix and let b ∈ Rm . The vector x ∈ Rn minimises kAx − bk iff
it minimises kΩAx − Ωbk for an arbitrary m × m orthogonal matrix Ω.
Proof. Given an arbitrary vector v ∈ Rm , we have
kΩvk2 = v ⊤ Ω⊤ Ωv = v ⊤ v = kvk2 .
In particular, kΩAx − Ωbk = kAx − bk. 2
Method of solution Suppose that A = QR, a QR factorization with R in a standard form.
Because of the lemma, letting Ω := Q⊤ , we have kAx − bk = kQ⊤ (Ax − b)k = kRx − Q⊤ bk,
therefore we seek x ∈ Rn that minimises kRx − Q⊤ bk.
In general (m > n) many rows of R consist of zeros. Suppose for simplicity that rank R = rank A =
n. Then the bottom m−n rows of R are zero. We find x by solving the (nonsingular) linear system
given by the first n equations of Rx = Q⊤ b. Similar (but more complicated) algorithm applies
when rank R ≤ n − 1. Note that we don’t require Q explicitly, just to evaluate Q⊤ b.

2
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Exercise Sheet 11
1. Suppose that the function values f (0), f (1), f (2) and f (3) are given and that
we wish to estimate Z 3
f (6), f ′ (0) and f (x) dx.
0
One method is to let p be the cubic polynomial that interpolates these function
values, and then to employ the approximants
Z 3

p(6), p (0) and p(x) dx
0

respectively. Deduce from the Lagrange formula for p that each approximant is a
linear combination of the four data with constant coefficients. Calculate the numer-
ical values of these constants. Verify your work by showing that the approximants
are exact when f is an arbitrary cubic polynomial.
2. Let f be a function in C 4 [0, 1] and let p be a cubic polynomial that interpolates
f (0), f ′ (0), f (1) and f ′ (1). Deduce from the Rolle theorem that for every x ∈ [0, 1]
there exists ξ ∈ [0, 1] such that the equation
1 2
f (x) − p(x) = 24
x (x − 1)2 f (4) (ξ)

is satisfied.
3. Let a, b and c be distinct real numbers (not necessarily in ascending order), and
let f (a), f (b), f ′ (a), f ′ (b) and f ′ (c) be given. Because there are five data, one might
try to approximate f by a polynomial of degree at most four that interpolates the
data. Prove by a general argument that this interpolation problem has a solution
and the solution is unique if and only if there is no nonzero polynomial p ∈ P4 [x]
that satisfies p(a) = p(b) = p′ (a) = p′ (b) = p′ (c) = 0. Hence, given a and b, show
that there exists a unique value of c 6= a, b such that there is no unique solution.
[Note: This form of interpolation when both function values and derivatives are
fitted, perhaps at different points, is known as Birkhoff–Hermite interpolation.]
4. Let f : R → R be a given function and let p be the polynomial of degree at most
n that interpolates f at the pairwise distinct points x0 , x1 , . . . , xn . Further, let x be
any real number that is not an interpolation point. Deduce the identity
n
Y
f (x) − p(x) = f [x0 , x1 , . . . , xn , x] (x − xj )
j=0

1
Corrections and suggestions to these notes should be emailed to
A.Iserles@damtp.cam.ac.uk. All handouts are available on the WWW at the URL
http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
from the definition of the divided difference f [x0 , x1 , . . . , xn , x].
5. Simulating a computer that works to only four decimal places, form the table
of divided differences of the values f (0) = 0, f (0.1) = 0.0998, f (0.4) = 0.3894 and
f (0.7) = 0.6442 of sin x. Hence identify the polynomial that is given by Newton’s
interpolation method. Due to rounding errors, this polynomial should differ from
the one that would be given by exact arithmetic. Take the view, however, that the
computed values of f [0.0, 0.1], f [0.0, 0.1, 0.4] and f [0.0, 0.1, 0.4, 0.7] and the function
value f (0) are correct. Then, by working backwards through the difference table,
identify the values of f (0), f (0.1), f (0.4) and f (0.7) that would give these divided
differences in exact arithmetic.
6. Set f (x) = 2x − 1, x ∈ [0, 1]. We require a function of form
n
X
p(x) = ak cos(kπx), 0 ≤ x ≤ 1,
k=0

that satisfies the condition


Z 1
[f (x) − p(x)]2 dx < 10−4 .
0

Explain why it is sufficient if the value of a20 + 21 nk=1 a2k exceeds 13 − 10−4 , where the
P

coefficients {ak }nk=0 are calculated to minimize this integral. Hence find the smallest
acceptable value of n.
7. The polynomials {pn }n∈Z+ are defined by the three-term recurrence formula

p0 (x) ≡ 1,
p1 (x) = 2x,
pn+1 (x) = 2xpn (x) − pn−1 (x), n = 1, 2, . . . .

Prove that they are orthogonal with respect to the inner product
Z 1 √
hf, gi = f (x)g(x) 1 − x2 dx
−1

and evaluate hpn , pn i for n ∈ Z+ . [Hint: Prove that pn (x) = sin(n + 1)θ/ sin θ, where
x = cos θ.]
[Note: These pn s are known as Chebyshev polynomials of the second kind and
denoted by pn = Un .]
8. Calculate the coefficients b1 , b2 , c1 and c2 so that the approximant
Z 1
f (x) dx ≈ b1 f (c1 ) + b2 f (c2 )
0

2
is exact when f is a cubic polynomial. You may exploit the fact that c1 and c2
are the zeros of a quadratic polynomial that is orthogonal to all linear polynomials.
Verify your calculation by testing the formula when f (x) = 1, x, x2 and x3 .
9. The functions p0 , p1 , p2 , . . . are generated by the Rodrigues formula
dn n −x
pn (x) = ex (x e ), 0 ≤ x < ∞.
dxn
Show that these functions are polynomials and prove by integration by parts that
for every p ∈ Pn−1 [x] we have the orthogonality condition hpn , pi = 0 with respect
to the scalar product Z ∞
hf, gi := e−x f (x)g(x) dx.
0
Derive the coefficients of p3 , p4 and p5 from the Rodrigues formula. Verify that these
coefficients are compatible with a three term recurrence relation of the form
p5 (x) = (γx − α)p4 (x) − βp3 (x), x ∈ R,
where α, β and γ are constants.
[Note: These pn s are known as Laguerre polynomials and denoted by pn = Ln – or,
if you want to be really sophisticated, L(0)
n .]

10. Let p( 12 ) = 21 (f (0) + f (1)), where f is a function in C 2 [0, 1]. Find the least
constants c0 , c1 and c2 such that the error bounds
|f ( 12 ) − p( 12 )| ≤ ck kf (k) k∞ , k = 0, 1, 2,
are valid.
[Note: The cases k = 0 and k = 1 are easy if one works from first principles, and
the Peano kernel theorem is suitable when k = 2. Also try the Peano kernel theorem
when k = 1.]
11. Express the divided difference f [0, 1, 2, 4] in the form
Z 4
f [0, 1, 2, 4] = K(θ)f ′′′ (θ) dθ,
0

assuming that f ′′′ exists and is continuous. Sketch the kernel function K(θ) for
0 ≤ θ ≤ 4. By integrating K(θ) analytically and using the mean value theorem
prove that
f [0, 1, 2, 4] = 61 f ′′′ (ξ)
for some point ξ ∈ [0, 4]. Note that another proof of this result was given in the
lecture on divided differences.
12. Let f be a function in C 4 [0, 1] and let ξ be any fixed point in [0, 1]. Calculate
the coefficients α, β, γ and δ such that the approximant
f ′′′ (ξ) ≈ αf (0) + βf (1) + γf ′ (0) + δf ′ (1)

3
is exact for all cubic polynomials. Prove that the inequality
n o
|f ′′′ (ξ) − αf (0) − βf (1) − γf ′ (0) − δf ′ (1)| ≤ 1
2
− ξ + 2ξ 3 − ξ 4 kf (4) k∞

is satisfied. Show that this inequality holds as an equation if we allow f to be the


function 
 −(x − ξ)4 , 0 ≤ x ≤ ξ,
f (x) =
 (x − ξ)4 , ξ ≤ x ≤ 1.

13. [Not easy!] Given f and g in C[a, b], let h := f g. Prove by induction that
the divided differences of h satisfy the equation
n
X
h[x0 , x1 , . . . , xn ] = f [x0 , x1 , . . . , xj ]g[xj , xj+1 , . . . , xn ].
j=0

By expressing the differences in terms of derivatives and by letting the points


x0 , x1 , . . . , xn become coincident, deduce the Leibniz formula for the nth derivative
of a product of two functions.

4
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Exercise Sheet 21

14. Let h = 1/M , where M ≥ 1 is an integer, and let Euler’s method be applied to
calculate the estimates {y n }n=1,2,...,M of y(nh) for each of the differential equations
y 2y
y′ = − and y ′ = , 0 ≤ t ≤ 1,
1+t 1+t
starting with y0 = y(0) = 1 in both cases. By using induction and by cancelling as
many terms as possible in the resultant products, deduce simple explicit expressions for
yn , n = 1, 2, . . . , M , which should be free from summations and products of n terms.
Hence deduce the exact solutions of the equations from the limit h → 0. Verify that the
magnitude of the errors yn − y(nh), n = 1, 2, . . . , M , is at most O(h).
19. Assuming that f satisfies the Lipschitz condition and possesses a bounded third
derivative in [0, t∗ ], apply the method of analysis of the Euler method, given in the lectures,
to prove that the trapezoidal rule

y n+1 = y n + 12 h[f (tn , y n ) + f (tn+1 , y n+1 )]

converges and that ky n − y(tn )k ≤ ch2 for some c > 0 and all n such that 0 ≤ nh ≤ t∗ .
20. The s-step Adams–Bashforth method is of order s and has the form
s−1
X
y n+s = y n+s−1 + h σj f (tn+j , y n+j ).
j=0

Calculate the actual values of the coefficients in the case s = 3


Denoting the polynomials generating the s-step Adams–Bashforth by {ρs , σs }, prove that

σs (z) = zσs−1 (z) + αs−1 (z − 1)s−1 ,

where αs 6= 0 is a constant s.t. ρs (z) − σs (z) log z = αs (z − 1)s+1 + O |z − 1|s+2 , z → 1.




[Hint: Use induction, the order conditions and the fact that the degree of each σs is s − 1.]
21. By solving a three-term recurrence relation, calculate analytically the sequence of
values {y n : n = 2, 3, 4, . . .} that is generated by the explicit midpoint rule

y n+2 = y n + 2hf (tn+1 , y n+1 ),

when it is applied to the ODE y ′ = −y, t ≥ 0. Starting from the values y0 = 1 and
y1 = 1 − h, show that the sequence diverges as n → ∞ for all h > 0. Recall, however, that
order ≥ 1, the root condition and suitable starting conditions imply convergence in a finite
interval. Prove that the above implementation of the explicit midpoint rule is consistent
1
Corrections and suggestions to these notes should be emailed to
A.Iserles@damtp.cam.ac.uk. All handouts are available on the WWW at the URL
http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
with this theorem.
Hint: In the last part, relate the roots of the recurrence relation to ±e∓h + O h3 .


22. Show that the multistep method


3
X 2
X
ρj y n+j = h σj f (tn+j , y n+j )
j=0 j=0

is fourth order only if the conditions ρ0 + ρ2 = 8 and ρ1 = −9 are satisfied. Hence deduce
that this method cannot be both fourth order and satisfy the root condition
23. An s-stage explicit Runge–Kutta method of order s with constant step size h > 0 is
applied to the differential equation y ′ = λy, t ≥ 0. Prove the identity
" s
#n
X 1 l
yn = (hλ) y0 , n = 0, 1, 2, . . . .
l=0
l!

24. The following four-stage Runge–Kutta method has order four,

k1 = f (tn , y n )

k2 = f (tn + 31 h, y n + 13 hk1 )

k3 = f (tn + 32 h, y n − 13 hk1 + hk2 )

k4 = f (tn + h, y n + hk1 − hk2 + hk3 )

y n+1 = y n + h( 81 k1 + 83 k2 + 38 k3 + 81 k4 ).

By considering the equation y ′ = y, show that the order is at most four. Then, for scalar
functions, prove that the order is at least four in the easy case when f is independent of
y, and that the order is at least three in the relatively easy case when f is independent of
t.
[You are not expected to derive all of the (gory) details when f (t, y) depends on both t and
y.]
25. Find D ∩ R, the intersection of the linear stability domain D with the real axis, for
the following methods:

(1) y n+1 = y n + hf (tn , y n ) (2) y n+1 = y n + 12 h[f (tn , y n ) + f (tn+1 , y n+1 )]


(3) y n+2 = y n + 2hf (tn+1 , y n+1 ) (4) y n+2 = y n+1 + 12 h[3f (tn+1 , y n+1 )−f (tn , y n )]
(5) The RK method k1 = f (tn , y n ), k2 = f (tn +h, y n +hk1 ), y n+1 = y n + 12 h(k1 +k2 ).

26. Show that, if z is a nonzero complex number that is on the boundary of the linear
stability domain of the two-step BDF method

y n+2 − 43 y n+1 + 13 y n = 32 hf (tn+2 , y n+2 )

then the real part of z is positive. Thus deduce that this method is A-stable.

2
27. The (stiff) differential equation

y ′ (t) = −104 (y − t−1 ) − t−2 , t ≥ 1, y(1) = 1,

has the analytic solution y(t) = t−1 , t ≥ 1. Let it be solved numerically by Euler’s method
yn+1 = yn + hn f (tn , yn ) and the backward Euler method yn+1 = yn + hn f (tn+1 , yn+1 ),
where hn = tn+1 −tn is allowed to depend on n and to be different in the two cases. Suppose
that, for any tn ≥ 1, we have |yn − y(tn )| ≤ 10−6 , and that we require |yn+1 − y(tn+1 )| ≤
10−6 . Show that Euler’s method can fail if hn = 2 × 10−4 , but that the backward Euler
method always succeeds if hn ≤ 10−2 tn t2n+1 .
Hint: Find relations between yn+1 − y(tn+1 ) and yn − y(tn ) for general yn and tn .
28. This question concerns the predictor-corrector pair

yP 1 3
n+3 = − 2 y n + 3y n+1 − 2 y n+2 + 3hf (tn+2 , y n+2 ),

yC
n+3 =
1
11 [2y n − 9y n+1 + 18y n+2 + 6hf (tn+3 , y n+3 )].

Show that both methods are third order, and that the estimate of the error of the corrector
6
formula by Milne’s device has the value 17 |y P C
n+3 − y n+3 |.

29. Let p be the cubic polynomial that is defined by p(tj ) = y j , j = n, n + 1, n + 2, and


by p′ (tn+2 ) = f (tn+2 , y n+2 ). Show that the predictor formula of the previous exercise is
yP
n+3 = p(tn+2 +h). Further, show that the corrector formula is equivalent to the equation

yC
n+3 = p(tn+2 ) +
5 ′
11 hp (tn+2 ) − 1 2 ′′
22 h p (tn+2 ) − 7 3 ′′′
66 h p (tn+2 ) + 6
11 hf (tn+2 + h, y n+3 ).

The point of these remarks is that p can be derived from available data, and then the
above forms of the predictor and corrector can be applied for any choice of h = tn+3 −tn+2 .
30. Let u(x), 0 ≤ x ≤ 1, be a six-times differentiable function that satisfies the ODE
u′′ (x) = f (x), 0 ≤ x ≤ 1, u(0) and u(1) being given. Further, we let xm = mh = m/M ,
m = 0, 1, . . . , M , for some positive integer M , and calculate the estimates um ≈ u(xm ),
m = 1, 2, . . . , M − 1, by solving the difference equation

um−1 −2um +um+1 = h2 f (xm )+αh2 [f (xm−1 )−2f (xm )+f (xm+1 )], m = 1, 2, . . . , M −1,

where u0 = u(0), uM = u(1), and α is a positive parameter. Show that there exists a
choice of α such that the local truncation error of the difference equation is O h6 . In this
case, deduce that the Euclidean norm of the vector of errors u(xm ) − um , m = 0, 1, . . . , M ,
is bounded above by a constant multiple of ku(6) k∞ h7/2 , and provide an upper bound on
this constant.

3
Mathematical Tripos Part IB: Lent 2010
Numerical Analysis – Exercise Sheet 31

31. Calculate all LU factorizations of the matrix


 
10 6 −2 1
 10 10 −5 0 
A= ,
 
 −2 2 −2 1 
1 3 −2 3

where all diagonal elements of L are one. By using one of these factorizations, find all
solutions of the equation Ax = b where b⊤ = [−2, 0, 2, 1].
32. By using column pivoting if necessary to exchange rows of A, an LU factorization of
a real n×n matrix A is calculated, where L has ones on its diagonal, and where the moduli
of the off-diagonal elements of L do not exceed one. Let α be the largest of the moduli
of the elements of A. Prove by induction on i that elements of U satisfy the condition
|uij | ≤ 2i−1 α. Then construct 2 × 2 and 3 × 3 nonzero matrices A that yield |u22 | = 2α
and |u33 | = 4α respectively.
33. Let A be a real n × n matrix that has the factorization A = LU , where L is lower
triangular with ones on its diagonal and U is upper triangular. Prove that, for every
integer k ∈ {1, 2, . . . , n}, the first k rows of U span the same space as the first k rows
of A. Prove also that the first k columns of A are in the k-dimensional subspace that is
spanned by the first k columns of L. Hence deduce that no LU factorization of the given
form exists if we have rank Hk < rank Bk , where Hk is the leading k × k submatrix of A
and where Bk is the n × k matrix whose columns are the first k columns of A.
34. Calculate the Cholesky factorization of the matrix

1 1
 
 1 2 1 
 
 1 3 1 
.
 
1 4 1

 
 
 1 5 1 
1 λ

Deduce from the factorization the value of λ that makes the matrix singular. Also find
this value of λ by seeking the vector in the null-space of the matrix whose first component
is one.
35. Let A be an n × n nonsingular band matrix that satisfies the condition aij = 0
if |i − j| > r, where r is small, and let Gaussian elimination with column pivoting be
used to solve Ax = b. Identify all the coefficients of the intermediate equations that can
1
Corrections and suggestions to these notes should be emailed to
A.Iserles@damtp.cam.ac.uk. All handouts are available on the WWW at the URL
http://www.damtp.cam.ac.uk/user/na/PartIB/Handouts.html.

1
become nonzero. Hence deduce that the total number of additions and multiplications of
the complete calculation can be bounded by a constant multiple of nr2 .
36. Let a1 , a2 and a3 denote the columns of the matrix
 
6 6 1
A =  3 6 1 .
 
2 1 1

Apply the Gram–Schmidt procedure to A, which generates orthonormal vectors q 1 , q 2


and q 3 . Note that this calculation provides real numbers rjk such that ak = kj=1 rjk q j ,
P

k = 1, 2, 3. Hence express A as the product A = QR, where Q and R are orthogonal and
upper-triangular matrices respectively.
37. Calculate the QR factorization of the matrix of Exercise 36 by using three Givens
rotations. Explain why the initial rotation can be any one of the three types Ω(1,2) , Ω(1,3)
and Ω(2,3) . Prove that the final factorization is independent of this initial choice in exact
arithmetic, provided that we satisfy the condition that in each row of R the leading nonzero
element is positive.
38. Let A be an n × n matrix, and for i = 1, 2, . . . , n let k(i) be the number of zero
elements in the i-th row of A that come before all nonzero elements in this row and
before the diagonal element aii . Show that the QR factorization of A can be calculated
by using at most 21 n(n − 1) − k(i) Givens rotations. Hence show that, if A is an upper
P

triangular matrix except that there are nonzero elements in its first column, i.e. aij = 0
when 2 ≤ j < i ≤ n, then its QR factorization can be calculated by using only 2n − 3
Givens rotations. [Hint: Your should find the order of the first (n − 2) rotations that
brings your matrix to the form considered above.]
39. Calculate the QR factorization of the matrix of Exercise 36 by using two Householder
reflections. Show that, if this technique is used to generate the QR factorization of a
general n × n matrix A, then the computation can be organised so that the total number
of additions and multiplications is bounded above by a constant multiple of n3 .
40. Let    
3 4 7 −2 11
 5 4 9 3   29 
A= , b= .
   
 1 −1 0 3   16 
1 −1 0 0 10
Calculate the QR factorization of A by using Householder reflections. In this case A is
singular and you should choose Q so that the last row of R is zero. Hence identify all the
least squares solutions of the inconsistent system Ax = b, where we require x to minimize
kAx − bk2 . Verify that all the solutions give the same vector of residuals Ax − b, and that
this vector is orthogonal to the columns of A. There is no need to calculate the elements
of Q explicitly.

Das könnte Ihnen auch gefallen