Sie sind auf Seite 1von 725

Part IB

Mathematics

version 1.2

King Ming Lam


Contents
Preface v

1 Metric and Topological Spaces 1


1.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Topological spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2 Variational principle 43
2.1 Multivariate calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Functionals and the Euler-Lagrange equation . . . . . . . . . . . . . . . 51
2.3 Hamilton’s principle and Noether’s theorem . . . . . . . . . . . . . . . . 58
2.4 Multivariate calculus of variations . . . . . . . . . . . . . . . . . . . . . 65
2.5 The second variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3 Optimization 73
3.1 Preliminaries and Lagrange multipliers . . . . . . . . . . . . . . . . . . . 73
3.2 Solutions of linear programs . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3 Non-cooperative games . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4 Network problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4 Linear Algebra 103


4.1 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Linear maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.4 Bilinear forms I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.5 Determinants of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.6 Endomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.7 Bilinear forms II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.8 Inner product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5 Analysis II 161
5.1 Sequence of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2 Series of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.3 Normed space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.4 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.6 Differentiation from Rm to Rn . . . . . . . . . . . . . . . . . . . . . . . . 192

6 Methods 209
6.1 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.2 Sturm-Liouville Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.3 Partial differential equations . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.5 Green’s Functions for ODEs . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.6 Fourier transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.7 PDEs and Method of Characteristics . . . . . . . . . . . . . . . . . . . . 258
6.8 Green’s Functions for PDEs . . . . . . . . . . . . . . . . . . . . . . . . . 265

i
ii CONTENTS

7 Quantum mechanics 275


7.1 Wavefunctions and the Schrödinger equation . . . . . . . . . . . . . . . 278
7.2 Energy eigenstates in one dimension . . . . . . . . . . . . . . . . . . . . 281
7.3 Expectation and uncertainty . . . . . . . . . . . . . . . . . . . . . . . . 287
7.4 Wavepackets and Scatterings . . . . . . . . . . . . . . . . . . . . . . . . 290
7.5 Postulates for quantum mechanics . . . . . . . . . . . . . . . . . . . . . 296
7.6 Quantum mechanics in three dimensions . . . . . . . . . . . . . . . . . . 302
7.7 The hydrogen atom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

8 Markov chains 313


8.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
8.2 Classification of chains and states . . . . . . . . . . . . . . . . . . . . . . 316
8.3 Long-run behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.4 Time reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

9 Groups, Rings and Modules 341


9.1 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
9.2 Rings I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
9.3 Rings II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
9.4 Modules I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
9.5 Modules II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390

10 Complex Analysis and Methods 405


10.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.2 Conformal maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
10.3 Contour integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.4 Residue calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
10.5 Transform theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

11 Geometry 473
11.1 Euclidean geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
11.2 Spherical geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
11.3 Triangulations and the Euler number . . . . . . . . . . . . . . . . . . . . 487
11.4 Hyperbolic geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
11.5 Smooth embedded surfaces (in R3 ) . . . . . . . . . . . . . . . . . . . . . 502
11.6 Abstract smooth surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 513

12 Statistics 517
12.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
12.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.3 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
12.4 Rules of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562

13 Numerical Analysis 565


13.1 Polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
13.2 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
13.3 Approximation of linear functionals . . . . . . . . . . . . . . . . . . . . . 576
13.4 Ordinary differential equations . . . . . . . . . . . . . . . . . . . . . . . 585
13.5 Numerical linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
13.6 Linear least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
CONTENTS iii

14 Electromagnetism 617
14.1 Electrostatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
14.2 Magnetostatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
14.3 Electrodynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
14.4 Electromagnetism and relativity . . . . . . . . . . . . . . . . . . . . . . 640

15 Fluid Dynamics 649


15.1 Parallel viscous flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
15.2 Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
15.3 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
15.4 Inviscid irrotational flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
15.5 Water waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
15.6 Fluid dynamics on a rotating frame . . . . . . . . . . . . . . . . . . . . . 686

A Some useful results I


A.1 Integration and differentiation . . . . . . . . . . . . . . . . . . . . . . . . II
A.2 Coordinate systems and operators . . . . . . . . . . . . . . . . . . . . . III
A.3 Transform tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
A.5 Statistics tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

B List of symbols IX
iv CONTENTS
Preface
This book is a compilation of Part IB course notes that I edited based on lectures,
supervisions, notes taken by Dexter Chua in the previous year, and notes taken by
various friends of mine. I edited and formatted these materials into one book so that
this is one coherent and complete set of notes for the whole part IB maths course. In
a sense, this is a fusion of different notes taken by me and various other people. That
being said, any errors or mistakes are mine. I believe (but make no guarantee) that
this book is a complete set of notes for part IB mathematics for the year 2016/2017.
However since courses and lecturers change year by year, as time passes this book may
resemble less and less the part IB mathematics course. Please use these notes at your
own risk.
I have highlighted titles of propositions and theorems, so hopefully they stand out. At
some places I also added in more materials, sometimes giving proof to propositions that
are only stated in lectures, and sometimes giving answers to a few selected questions
from example sheets. The book is make up of many little sections (or boxes). In
particular, a section labelled D stands for definition, C stands for commentary, L
stands for lemma, P stands for proposition, T stands for theorem and E stands for
example or explanation.
Since now all courses are in this one book, I have removed repetition of contents in
different courses, and just reference the relevant bits either explicitly or implicitly. The
book is put together so that in theory one could (and should) read from beginning to
end; this might cause some problems for readers who are learning the courses simulta-
neously. For example, in this book the Methods course comes after the Linear Algebra
course since the start of the Methods course requires some knowledge of inner products
which the lecturer gave a short introduction of; however I removed this introduction
since inner products are covered in detail at the end of the Linear Algebra course. This
obviously isn’t a problem if one is reading this book from beginning to end in order, but
those who are doing courses simultaneously should bear this in mind. In particular,
knowledge of Metric and Topological Spaces, Linear Algebra, Analysis II and Methods
is heavily used or implied in subsequent courses, so one should get familiar with their
contents as early as possible.
I originally edited this book of notes for personal use. However I learned that people
might find it useful, so I decided to share it (for free, of course). The pdf version of
this and other notes can be found at the following URL:
https://1drv.ms/f/s!AtFdZ6-agiAQky4Y2DSTwhZeT7ha
Any future updates to any of the notes will be put into the above link (until I somehow
fail to maintain this link if that happens). Any comments, suggestions, or reporting
of typos and errors are welcomed at lamkingming@hotmail.com or just drop me a
Facebook message.

King Ming Lam, September 2017.

v
vi PREFACE
CHAPTER 1

Metric and Topological Spaces


L. 1-1
Let f : X → Y be a function. If Uθ ⊆ Y for all θ ∈ Θ,
! !
[ [ −1 \ \ −1
f −1 Uθ = f (Uθ ) and f −1 Uθ = f (Uθ ).
θ∈Θ θ∈Θ θ∈Θ θ∈Θ

Also f −1 (Y ) = X, f −1 (∅) = ∅ and that f −1 (Y \ U ) = X \ f −1 (U ) for U ⊆ Y .

All easy to prove. Here is a proof of the last result: x ∈ f −1 (Y \ U ) ⇔ f (x) ∈


Y \ U ⇔ f (x) ∈
/U ⇔x∈ / f −1 (U ) ⇔ x ∈ X \ f −1 (U ).

1.1 Metric spaces


Given a set X, it is often helpful to have a notion of distance between points. This
distance function is known as the metric.

D. 1-2
• A metric space is a pair (X, dX ) where X is a set (the space ) and dX is a
function dX : X × X → R (the metric ) such that for all x, y, z ∈ X,

• dX (x, y) ≥ 0 (non-negativity)

• dX (x, y) = 0 iff x = y (identity of indiscernibles)

• dX (x, y) = dX (y, x) (symmetry)

• dX (x, z) ≤ dX (x, y) + dX (y, z) (triangle inequality)

It is clear that if (X, dX ) is a metric space and Y ⊆ X, then (Y, dY ) is also a


metric space, where dY (a, b) = dX (a, b). It’s said to be a subspace of X.

• Let (xn ) be a sequence in a metric space (X, dX ). We say (xn ) converges to


x ∈ X (written xn → x), if d(xn , x) → 0 (as a real sequence). Equivalently,

∀ε > 0, ∃N s.t. ∀n > N, d(xn , x) < ε.

• Let (X, dX ) and (Y, dY ) be metric spaces, and f : X → Y . We say f is (sequen-


tially) continuous if f (xn ) → f (x) (in Y ) whenever xn → x (in X).1

1
An alternative definition of f being continuous is that ∀x ∈ X, ∀ε > 0, ∃δ > 0 s.t. ∀y ∈
Y, dX (x, y) < δ ⇒ dY (f (x), f (y)) < ε. As we will show later, the two definitions are equivalent in
metric spaces. However they are not equivalent in the more general topological space.

1
2 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

E. 1-3
The first condition in the definition of metric space is actually redundant because
2dX (x, y) = dX (x, y) + dX (y, x) ≥ dX (x, x) = 0.
C. 1-4
<Euclidean metric> Let X = Rn and
v
u n
uX
d(v, w) = |v − w| = t (vi − wi )2 .
i=1

This is the usual notion of distance we have in the Rn vector space. It is not
difficult to show that this is a metric (the fourth axiom follows from Cauchy-
Schwarz inequality).
<Discrete metric> Let X be a set, and
(
1 x 6= y
dX (x, y) =
0 x=y

To show this is indeed a metric, we have to show it satisfies all the axioms. The
first three axioms are trivially satisfied. How about the fourth? We can prove this
by exhaustion. Since the distance function can only return 0 or 1, d(x, z) can be 0
or 1, while d(x, y) + d(y, z) can be 0, 1 or 2. For the fourth axiom to fail, we must
have RHS < LHS. This can only happen if the right hand side is 0. But for the
right hand side to be 0, we must have x = y = z. So the left hand side is also 0.
So the fourth axiom is always satisfied.
<Manhattan metric> Let X = R2 , and define the metric as

d(x, y) = d((x1 , x2 ), (y1 , y2 )) = |x1 − y1 | + |x2 − y2 |.

The first three axioms are again trivial. To prove the triangle inequality, we have

d(x, y) + d(y, z) = |x1 − y1 | + |x2 − y2 | + |y1 − z1 | + |y2 − z2 |


≥ |x1 − z1 | + |z2 − z2 | = d(x, z),

using the triangle inequality for R. This metric represents the distance you have
to walk from one point to another if you are only allowed to move horizontally and
vertically (and not diagonally).
<British railway metric> Let X = R2 . We define
(
|x − y| if x = ky
d(x, y) =
|x| + |y| otherwise

To explain the name of this metric, think of Britain with London as the origin.
Since the railway system is less than ideal, all trains go through London. For
example, if you want to go from Oxford to Cambridge, you first go from Oxford to
London, then London to Cambridge. So the distance traveled is the distance from
London to Oxford plus the distance from London to Cambridge. The exception
is when the two destinations lie along the same line, then you can directly take
1.1. METRIC SPACES 3

the train from one to the other without going through London, and hence the “if
x = ky” clause.
d : R2 × R2 → R define by
(
kuk2 + kvk2 , if u 6= v,
d(u, v) =
0 if u = v,

is also another possible metric (To get from A to B always via London).

E. 1-5
• S n = {v ∈ Rn+1 : v = 1}, the n-dimensional sphere, is a subspace of Rn+1 .
• Let (vn ) be a sequence in Rk with the Euclidean metric. Write vn = (vn1 , · · · , vnk ),
and v = (v 1 , · · · , v k ) ∈ Rk . Then vn → v iff (vni ) → v i for all i.
• Let X have the discrete metric, and suppose xn → x. Pick ε = 21 . Then there is
some N such that d(xn , x) < 12 whenever n > N . But if d(xn , x) < 21 , we must
have d(xn , x) = 0. So xn = x. Hence if xn → x, then eventually all xn is equal to
x.
• Let X = R with the Euclidean metric. Let Y = R with the discrete metric. Then
f : X → Y that maps f (x) = x is not continuous. This is since 1/n → 0 in the
Euclidean metric, but not in the discrete metric. On the other hand, g : Y → X
by g(x) = x is continuous, since a sequence in Y that converges is eventually
constant.
P. 1-6
<Uniqueness of limits> If (X, d) is a metric space and (xn ) is a sequence in
X such that xn → x and xn → x0 , then x = x0 .

For any ε > 0, ∃N such that d(xn , x) < ε/2 if n > N . Similarly, there exists some
N 0 such that d(xn , x0 ) < ε/2 if n > N 0 . Hence if n > max(N, N 0 ), then

0 ≤ d(x, x0 ) ≤ d(x, xn ) + d(xn , x0 ) = d(xn , x) + d(xn , x0 ) ≤ ε.

0 ≤ d(x, x0 ) ≤ ε for all ε > 0. So d(x, x0 ) = 0, and x = x0 .


Note that to prove the above, we used all of the four axioms.
D. 1-7
Let V be a vector space over R or C. A normed space is a pair (V, N ), where the
function N : V → R (write N (v) = kvk) is the norm on V which satisfies
1. kvk ≥ 0 for all v ∈ V
2. kvk = 0 if and only if v = 0
3. kλvk = |λ|kvk
4. kv + wk ≤ kvk + kwk
Let V is a real vector space. An inner product on V is a function M : V × V → R
(write M (u, v) = hu, vi) such that
1. hv, vi ≥ 0 for all v ∈ V
2. hv, vi = 0 if and only if v = 0
4 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

3. hv, wi = hw, vi

4. hv1 + v2 , w) = hv1 , wi + hv2 , wi

5. hλv2 , w) = λhv2 , wi.

T. 1-8
<Cauchy-Schwarz inequality> If h , i is an inner product, then

hv, wi2 ≤ hv, vihw, wi.

For any x, we have hv + xw, v + xwi = hv, vi + 2xhv, wi + x2 hw, wi ≥ 0. Seen


as a quadratic in x, since it is always non-negative, it can have at most one real
root. So (2hv, wi)2 − 4hv, vihw, wi ≤ 0. The result follows.

C. 1-9
Let V = Rn . Possible norms on Rn includes:
v
Xn u n
uX
kvk1 = |vi |, kvk2 = t vi2 , kvk∞ = max{|vi | : 1 ≤ i ≤ n}.
i=1 i=1

Pn 1/p
In general, for any 1 ≤ p ≤ ∞ we can define the p-norm kvkp = i=1 |vi |p .
And kvk∞ is the limit as p → ∞.

L. 1-10
p
i. If h , i is an inner product on V , then kvk = hv, vi is a norm.
ii. If k k is a norm on V , then d(v, w) = kv − wk defines a metric on V .
p
i. 1. kvk = hv, vi ≥ 0

2. kvk = 0 ⇔ hv, vi = 0 ⇔ v = 0
p p
3. kλvk = hλv, λvi = λ2 hv, vi = |λ|kvk

4. (kvk + kwk)2 = kvk2 + 2kvkkwk + kwk2 ≥ hv, vi + 2hv, wi + hw, wi =


kv + wk2

ii. 1. d(v, w) = kv − wk ≥ 0 by the definition of the norm.

2. d(v, w) = 0 ⇔ kv − wk = 0 ⇔ v − w = 0 ⇔ v = w.

3. d(w, v) = kw − vk = k(−1)(v − w)k = | − 1|kv − wk = d(v, w).

4. d(u, v) + d(v, w) = ku − vk + kv − wk ≥ ku − wk = d(u, w).

So inner products induce norms and hence metrics. For example the inner product
hv, wi = n n
P
i=1 vi wi on R induces the k · k2 norm which induces the Euclidean
metric. Norms that are derived from inner products satisfy the parallelogram law
ku + vk2 + ku − vk2 = 2kuk2 + 2kvk2 since we can just “expand” it out, but
this is not true in general for norms. Note also that although a norm naturally
induces a metric, not all metrics can be derived from a norm. For example the
discrete metric cannot be derive from a norm. Similarly,
P k · k1 and k · k∞ cannot
be derived from inner products, since hv, vi = ij vi vj hei , ej i.
1.1. METRIC SPACES 5

E. 1-11
<p-adic metric> Let p ∈ Z be a (fixed) prime number. For n ∈ Z we define
|n|p = p−k , where k is the highest power of p that divides n. If n = 0, we let
|n|p = 0. For example, |20|2 = |22 · 5|2 = 2−2 . For q = m n
∈ Q we define
|q|p = |n|p /|m|p . Some properties satisfies by | · |P are |a|p |b|p = |ab|p and
|a + b|p ≤ max{|a|p , |b|p } ≤ |a|p + |b|p .
| · |p is sometimes called the p-adic norm, but it’s actually not a norm since it
doesn’t satisfy condition 3 kλvk = |λ|kvk of a norm. However it can still “induce”
a metric. Take X = Q, then dp (a, b) = |a − b|p is a metric. This works because in
the above the lemma, to show that a norm induce a metric, we didn’t make use
of the full kλvk = |λ|kvk, we only use kvk = k − vk which is true in this case.
Note with respect to d2 , we have 1, 2, 4, 8, 16, 32, · · · → 0, while 1, 2, 3, 4, · · · does
not converge. We can also use it to prove certain number-theoretical results, but
we will not go into details here.
L. 1-12
R 1 f ∈ C[0, 1] satisfy f (x) ≥ 0 for all x ∈ [0, 1]. If f (x) is not constantly 0, then
Let
0
f (x) dx > 0.

Pick x0 ∈ [0, 1] with f (x0 ) = a > 0. Then since f is continuous, there is a δ such
that |f (x) − f (x0 )| < a/2 if |x − x0 | < δ. So |f (x)| > a/2 in this region. Take
(
a/2 |x − x0 | < δ
g(x) =
0 otherwise
R1 R1 a
Then f (x) ≥ g(x) for all x ∈ [0, 1]. So 0
f (x) dx ≥ 0
g(x) dx = 2
(2δ) > 0.
C. 1-13
<Function space> C[0, 1], the set of all continuous functions satisfies the axiom
of being a vector space (see IA vector and matrices or [D.4-1]), so it forms a vector
space. A possible metric on C[0, 1] is the Uniform metric

d(f, g) = max |f (x) − g(x)|.


x∈[0,1]

The maximum always exists since continuous functions on [0, 1] are bounded
R 1 attain their bounds. Possible inner products on C[0, 1] include hf, gi =
and
0
f (x)g(x) dx. possible norms on C[0, 1] includes:
sZ
Z 1 1
kf k1 = |f (x)| dx, kf k2 = f (x)2 dx, kf k∞ = max |f (x)|
0 0 x∈[0,1]

The first two known as the L1 and L2 norms. The last is called the uniform norm
or supremum norm , since it induces the uniform metric. In general we can also
R1
define the Lp norm: kf kp = ( 0 |f (x)|p dx)1/p .
We can check that all the above are indeed norms. For example to show that
the L2 norm satisfies the 4. condition Rkv + wk R≤ kvkR + kwk we can use the
Cauchy–Schwartz inequity for integral, ( f g)2 ≤ ( f 2 )( g 2 ), proved in the Part
IA course: ( (f + g)2 )1/2 = ( f 2 + 2 f g + g 2 )1/2 ≤ ( f 2 )1/2 + ( g 2 )1/2 .
R R R R R R
6 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

Another slightly tricky part is to show that kf k = 0 iff f = 0, which we obtain via
the lemma [L.1-12].

E. 1-14
Let F : C[0, 1] → R be defined by F (f ) = f ( 12 ). Then this is continuous with
respect to the uniform metric on C[0, 1] and the usual metric on R: Let fn → f in
the uniform metric, we have to show that F (fn ) → F (f ), ie. fn ( 21 ) → f ( 12 ). This
is easy, since we have

0 ≤ |F (fn ) − F (f )| = |fn ( 12 ) − f ( 12 )| ≤ max |fn (x) − f (x)| → 0.

E. 1-15
R1
Let X = C[0, 1], and let d1 (f, g) = kf − gk1 = 0
|f (x) − g(x)| dx. Define the
sequence
x
(
1 − nx x ∈ [0, n1 ]
fn = f
0 x ≥ n1 .
1 1
n

Now kfn k1 = 21 · n1 · 1 = 1
2n
→ 0 as n → ∞. So fn → 0 in (X, d1 ) where 0(x) = 0.
On the other hand,
kfn k∞ = max kfn (x)k = 1.
x∈[0,1]

So fn 6→ 0 in the uniform metric.

So the function (C[0, 1], d1 ) → (C[0, 1], d∞ ) that maps f 7→ f is not continuous.
This is similar to the case that the identity function from the usual metric of R to
the discrete metric of R is not continuous.

Using the same example, we can show that the function G : (C[0, 1], d1 ) →
(R, usual) with G(f ) = f (0) is not continuous.

D. 1-16
• Let (X, d) be a metric space. We say U ⊆ X is an open subset if for every x ∈ U ,
∃δ > 0 such that d(x, y) < δ ⇒ y ∈ U . We say C ⊆ X is a closed subset if X \ C
is open.

• For any x ∈ X, r ∈ R: Br (x) = {y ∈ X : d(y, x) < r} is the open ball centered


at x; and B̄r (x) = {y ∈ X : d(y, x) ≤ r} is the closed ball centered at x.

• If x ∈ X, an open neighbourhood of x is an open U ⊆ X with x ∈ U .

• Let A ⊆ X. We call x ∈ X a limit point of A if there is a sequence xn → x such


that xn ∈ A and xn 6= x for all n.2

L. 1-17
The open ball Br (x) ⊆ X is an open subset, and the closed ball B̄r (x) ⊆ X is a
closed subset.

2
Some authors drop the requirement xn 6= x in the definition, but we are not going to.
1.1. METRIC SPACES 7

Given y ∈ Br (x), we must find δ > 0 with Bδ (y) ⊆ v2


Br (x). Since y ∈ Br (x), we must have a = d(y, x) < r.
Let δ = r − a > 0. Then if z ∈ Bδ (y), then
y
d(z, x) ≤ d(z, y) + d(y, x) < (r − a) + a = r. r v1
So z ∈ Br (x). So Br (y) ⊆ Br (x) as desired.
The second statement is equivalent to X \ B̄r (x) = {y ∈
X : d(y, x) > r} is open. The proof is very similar.

E. 1-18
1. When X = R, Br (x) = (x − r, x + r) and B̄r (x) = [x − r, x + r].

2. When X = R2 ,
v2
i. If d is the metric induced by the kvk1 = kv1 k+kv2 k,
v1
then an open ball is a rotated square.
v2
p
ii. If d is the metric induced by the kvk2 = v12 + v22 ,
v1
then an open ball is an actual disk.

v2
iii. If d is the metric induced by the kvk∞ =
v1
max{|v1 |, |v2 |}, then an open ball is a square.

E. 1-19
Note that openness is a property of a subset A ⊆ X being open depends on both
A and X, not just A. For example, [0, 21 ) is not an open subset of R, but is an
open subset of [0, 1] (since it is B 1 (0)), both with the Euclidean metric.
2

1. (0, 1) ⊆ R is open, while [0, 1] ⊆ R is closed. [0, 1) ⊆ R is neither closed nor


open.

2. Q ⊆ R is neither open nor closed, since any open interval contains both
rational numbers and irrational numbers. So any open interval (open ball)
cannot be a subset of Q or R \ Q.

3. Let X = [−1, 1] \ {0} with the Euclidean metric. Let A = [−1, 0) ⊆ X. Then
A is open since it is equal to B1 (−1). A is also closed since it is equal to
B̄ 1 (− 12 ).
2

L. 1-20
In a metric space, xn → x iff for all open neighbourhood U of x, ∃N such that
xn ∈ U for all n > N .

(Forward) Since U is open, there exists some δ > 0 such that Bδ (x) ⊆ U . Since
xn → x, ∃N such that d(xn , x) < δ for all n > N . This implies that xn ∈ Bδ (x)
for all n > N . So xn ∈ U for all n > N .

(Backward) Given any ε > 0, Bε (x) is open, so ∃N such that d(xn , x) < ε for all
n > N.
8 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

E. 1-21
1
• If A = (0, 1) ⊆ R, then 0 is a limit point of A, eg. take xn = n
.

• Every x ∈ R is a limit point of Q.

P. 1-22
Let X be a metric space. C ⊆ X is a closed subset if and only if C contains all of
its limit points.3

(Forward) Suppose C is closed and xn → x, xn ∈ C. We have to show that x ∈ C.

Since C is closed, A = X \ C ⊆ X is open. Suppose the contrary that x 6∈ C. Then


x ∈ A. Hence A is an open neighbourhood of x. Then by our previous lemma,
we know that there is some N such that xn ∈ A for all n ≥ N . So xN ∈ A. But
we know that xN ∈ C by assumption. This is a contradiction. So we must have
x ∈ C.

(Backward) Suppose that C is not closed. We have to find a limit point not in C.

Since C is not closed, A is not open. So ∃x ∈ A such that Bδ (x) 6⊆ A for all δ > 0.
This means that Bδ (x) ∩ C 6= ∅ for all δ > 0. Pick xn ∈ B1/n (x) ∩ C for each
n > 0. Then xn ∈ C, xn 6= x and d(xn , x) ≤ n1 → 0. So xn → x. So x is a limit
point of C which is not in C.

P. 1-23
Let (X, dx ) and (Y, dy ) be metric spaces, and f : X → Y . The following are
equivalent:
1. f is continuous
2. If xn → x, then f (xn ) → f (x)
3. For any closed subset C ⊆ Y , f −1 (C) is closed in X.
4. For any open subset U ⊆ Y , f −1 (U ) is open in X.
5. For any x ∈ X and ε > 0, ∃δ > 0 such that f (Bδ (x)) ⊆ Bε (f (x)). (Equiva-
lently, dX (x, z) < δ ⇒ dY (f (x), f (z)) < ε.)

1 ⇔ 2: By definition.

2 ⇒ 3: Suppose C ⊆ Y is closed. We want to show that f −1 (C) is closed. So let


xn → x, where xn ∈ f −1 (C) and xn 6= x.

We know that f (xn ) → f (x) by (2) and f (xn ) ∈ C. So f (x) is a limit


point of C. Since C is closed, f (x) ∈ C. So x ∈ f −1 (C). So every limit
point of f −1 (C) is in f −1 (C). So f −1 (C) is closed.

3 ⇒ 4: If U ⊆ Y is open, then Y \ U is closed in X. So f −1 (Y \ U ) = X \ f −1 (U )


is closed in X. So f −1 (U ) ⊆ X is open.

4 ⇒ 5: Given x ∈ X, ε > 0, Bε (f (x)) is open in Y . By (4), f −1 (Bε (f (x))) = A


is open in X. Since x ∈ A, ∃δ > 0 with Bδ (x) ⊆ A. So

f (Bδ (x)) ⊆ f (A) = f (f −1 (Bε (f (x)))) = Bε (f (x))


3
Some author use this as the definition of closed set.
1.1. METRIC SPACES 9

5 ⇒ 2: Suppose xn → x. Given ε > 0, ∃δ > 0 such that f (Bδ (x)) ⊆ Bε (f (x)).


Since xn → x, ∃N such that xn ∈ Bδ (x) for all n > N Then f (xn ) ∈
f (Bδ (x)) ⊆ Bε (f (x)) for all n > N . So f (xn ) → f (x).
This result shows that in fact any of 2 to 5 can be used as the definition of
continuity.
E. 1-24
Let f : R3 → R be defined as f (x1 , x2 , x3 ) = x21 + x42 x63 + x81 x23 . This is continuous
under the usual metric. So {x ∈ R3 : f (x) ≤ 1} = f −1 ((−∞, 1]) is closed in R3 .
E. 1-25
Define jα,β : (C([0, 1]), k kα ) → (C([0, 1]), k kβ ) by jα,β (f ) = f .
1. Show that j∞,1 and j∞,2 are continuous, but j1,∞ and j2,∞ are not.
2. By using the Cauchy–Schwarz inequality |hf, gi| ≤ kf k2 kgk2 show that j2,1 is
continuous. Show that j1,2 is not.

1. Observe that
Z 1 Z 1
kf k1 = |f (t)| dt ≤ kf k∞ dt = kf k∞
0 0
Z 1 Z 1
kf k22 = |f (t)|2 dt≤ kf k2∞ dt = kf k2∞ ,
0 0

so kf k2 ≤ kf k∞ . Thus kfn − f k∞ → 0 ⇒ kfn − f k1 , kfn − f k2 → 0 and so


j∞,1 and j∞,2 are continuous. However, if we put fn (t) = n1/3 max{0, 1 − nt},
then Z 1
kfn − 0k1 = fn (t) dt = n−2/3 /2 → 0
0

Z 1 Z 1/n
kfn − 0k22 = fn (t)2 dt = n2/3 (1 − nt)2 dt
0 0
3 1/n
n−1/3
 
(1 − nt)
= n2/3 − = →0
3n 0 3

as n → ∞, yet kfn − 0k∞ = n1/3 → ∞ so j1,∞ and j2,∞ are not continuous.
2. Observe that
Z 1
kf k1 = |f (t)| dt = h|f |, 1i ≤ kf k2 k1k2 ≤ kf k2 .
0

Thus kfn − f k2 → 0 ⇒ kfn − f k1 → 0 and so j2,1 is Rcontinuous. However, if we


1
put fn (t) = n2/3 max{0, 1 − nt}, then kfn − 0k1 = 0 fn (t) dt = n−1/3 /2 → 0,
yet
1 1/n
n1/3
Z Z
kfn − 0k22 = fn (t)2 dt = n4/3 (1 − nt)2 dt = →∞
0 0 3

as n → ∞, so j1,2 is not continuous.


10 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

L. 1-26
1. ∅ and X are open subsets of X.
S
2. If Vα ⊆ X is open for all α ∈ A, then U = α∈A Vα is open in X.
Tn
3. If V1 , · · · , Vn ⊆ X are open, then so is V = i=1 Vi .

1. ∅ satisfies the definition of an open subset vacuously. X is open since for any
x, B1 (x) ⊆ X.
2. If x ∈ U , then x ∈ Vα for some
S α. Since Vα is open, there exists δ > 0 such
that Bδ (x) ⊆ Vα . So Bδ (x) ⊆ α∈A Vα = U . So U is open.
3. If x ∈ V , then x ∈ Vi for all i = 1, · · · , n. So ∃δi > 0 with Bδi (x) ⊆ Vi . Take
δ = min{δ1 , · · · , δn }. So Bδ (x) ⊆ Vi for all i. So Bδ (x) ⊆ V . So V is open.
Note that we can take infinite unions or finite intersection, but not infinite inter-
sections. For example, the intersection of all (− n1 , n1 ) is {0}, which is not open.

1.2 Topological spaces


D. 1-27
• A topological space (X, τ ) is a set X (the space) together with a set τ ⊆ P(X)
(the topology ) such that:
1. ∅, X ∈ τ
S
2. If Vα ∈ τ for all α ∈ A, then α∈A Vα ∈ τ .
3. If V1 , · · · , Vn ∈ τ , then n
T
i=1 Vi ∈ τ .

The elements of X are the points . We extend the notion of open set by calling
the elements of τ the open subsets of X.
• Let (X, d) be a metric space, then the topology induced by d is the set of all open
sets of X under d. A notion or property is said to be a topological notion or
topological property if it only depends on the topology, and not the metric.
• Let f : X → Y be a map of topological spaces. Then f is continuous if f −1 (U )
is open in X whenever U is open in Y .
• Let (X, τ ) be a topological space and x ∈ X. We say that V is a open neighbourhood
of x if V ∈ τ and x ∈ V (same as the metric space definition). We say that N is
a neighbourhood of x if we can find U ∈ τ with x ∈ U ⊆ N .
E. 1-28
When the topologies are induced by metrics, the topological and metric notions
of continuous functions coincide, as we showed previously. The new definition of
continuity is more general in the sense that it also works for topological spaces not
induced by a metric.
Notions of limits and continuity (in a metric space) are in fact topological proper-
ties since they only depends on the topology (i.e. which sets are open) as shown
previously, that’s also why we can define continuity solely using topology.
1.2. TOPOLOGICAL SPACES 11

C. 1-29
<Some common topology> Let X be any set.
1. τ = {φ, X} is the coarse topology or indiscrete topology on X.
2. τ = P(X) is the discrete topology on X, it is induced by the discrete metric.
3. τ = {A ⊆ X : X \ A is finite or A = ∅} is the cofinite topology on X.
Let X = R, and τ = {(a, ∞) : a ∈ R} is the right order topology on R.

E. 1-30
Note that if X is finite, then the cofinite topology is the same as the discrete
topology.

If F is a finite set with more than one point, the indiscrete topology is not induced
by any metric since every subset of F must be open, as there exist a minimum
distance between any two distinct points. So the induced topology must be the
discrete topology.

L. 1-31
Let τ1 and τ2 be two topologies on the same space X. Then τ1 ⊆ τ2 if and only
if, given x ∈ U ∈ τ1 , we can find V ∈ τ2 such that x ∈ V ⊆ U .

(Forward) If τ1 ⊆ τ2 and x ∈ U ∈ τ1 , then setting V = U we automatically have


V ∈ τ2 and x ∈ V ⊆ U .

fixed U ∈ τ1 , for each x ∈ U we can find Vx ∈ τ2 such that


(Backward) Given any S
x ∈ Vx ⊆ U . Now U = x∈U Vx ∈ τ2 . Thus τ1 ⊆ τ2 .

We have τ1 = τ2 if and only if, given x ∈ U ∈ τ1 , we can find V ∈ τ2 such that


x ∈ V ⊆ U and, given x ∈ U ∈ τ2 , we can find V ∈ τ1 such that x ∈ V ⊆ U .

E. 1-32
Let X = Rn , the metrics induced by k · k1 , k · k2 and k · k∞ in fact all induces

the same topology as Br/n (x) ⊆ Br1 (x) ⊆ Br2 (x) ⊆ Br∞ (x).

Example with more detail: As d1 (x, y) = kx−yk1 = n


P v2
i=1 |vi |
and d∞ (x, y) = kx − yk∞ = max1≤i≤n |vi |, this implies that
B∞
kvk∞ ≤ kvk1 ≤ nkvk∞ , so Br∞ (x) ⊇ Br1 (x) ⊇ Br/n

(x). B1
Suppose that U is open with respect to d1 . Given any x ∈ U , v1

∃δ > 0 s.t. Bδ1 (x) ⊆ U . So Bδ/n (x) ⊆ Bδ1 (x) ⊆ U . So U is
open with respect to d∞ . The other direction is similar.

However if X = C[0, 1], then d1 (f, g) = kf − gk1 and d∞ (f, g) = kf − gk∞ do not
induce the same topology, since (X, d1 ) → (X, d∞ ) by f 7→ f is not continuous.

E. 1-33
• Any function f : X → Y is continuous if X has the discrete topology.

• Any function f : X → Y is continuous if Y has the course topology.

• If X and Y both have cofinite topology, then f : X → Y is continuous iff f −1 ({y})


is finite for every y ∈ Y .
12 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

It’s because if U ⊆ Y is open, then either U = ∅ in which case trivial, or Y \ U is


finite, so f −1 (Y \ U ) = X \ f −1 (U ) is also finite. For the other direction consider
the open set Y \ {y} ⊆ Y .
L. 1-34
If f : X → Y and g : Y → Z are continuous, then so is g ◦ f : X → Z.

If U ⊆ Z is open, g is continuous, then g −1 (U ) is open in Y . Since f is also


continuous, f −1 (g −1 (U )) = (g ◦ f )−1 (U ) is open in X.
E. 1-35
We usually show functions to be continuous by considering them as compositions
of simpler functions rather than using the definition directly.
For example, let R and R2 have their usual (Euclidean) metric, and suppose that
f : R → R and g : R → R are continuous. We show that m : R2 → R given by
m(x, y) = f (x)g(y) is continuous by showing that
(i) The map (f, g) : R2 → R2 is continuous.
(ii) The map M : R2 → R given by M (x, y) = xy is continuous.
L. 1-36
Let (X, τ ) be a topological space. Then U ∈ τ if and only if, given x ∈ U , we can
find a neighbourhood N of x with N ⊆ U .

(Forward) If U ∈ τ then U is a neighbourhood of x for all x ∈ U .


(Backward) Given any x ∈ U , we can find a neighbourhood Nx of x with Nx ⊆ U ,
S we can find an open neighbourhood Ux of x with Ux ⊆ Nx . We have U =
so
x∈U Ux ∈ τ .

Note that by how neighbourhood is defined, the result holds if we replace neigh-
bourhoods with open neighbourhoods. So the result can be written as: U ∈ τ if
and only if ∀x ∈ U, ∃Vx ∈ τ s.t. x ∈ Vx ⊆ U .
Note that this result is the analogous of how we define openness of a set in a metric
space.
L. 1-37
Let (X, τ ) and (Y, σ) be topological spaces. Then f : X → Y is continuous if
and only if, given x ∈ X and M a neighbourhood of f (x) in Y , we can find a
neighbourhood N of x with f (N ) ⊆ M .

(Forward) If f : X → Y is continuous, x ∈ X and M is a neighbourhood of f (x),


then we can find a V ∈ σ with f (x) ∈ V ⊆ M . Since f is continuous f −1 (V ) ∈ τ .
Thus, since x ∈ f −1 (V ), we have that f −1 (V ) is an open neighbourhood and so a
neighbourhood of x. Setting N = f −1 (V ), we have f (N ) = V ⊆ M as required.
(Backward) Suppose that, given x ∈ X and M a neighbourhood of f (x) in Y , we
can find a neighbourhood N of x with f (N ) ⊆ M . Let V be open in Y . If x ∈ X
and f (x) ∈ V , then V is a neighbourhood of f (x) so there exists a neighbourhood
Nx of x with f (Nx ) ⊆ V . We now choose Ux an open neighbourhood of x with
−1 −1
Ux ⊆ Nx . We haveS f (Ux ) ⊆ V and so Ux ⊆ f (V ) for all x ∈ f (V ). It follows
−1
that f (V ) = x∈f −1 (V ) Ux ∈ τ . We have shown that f is continuous.
1.2. TOPOLOGICAL SPACES 13

We know that a function f between metric spaces is continuous iff ∀x ∈ X, ∀ε > 0,


∃δ > 0 s.t. f (Bδ (x)) ⊆ Bε (f (x). Note that this result is the analogous of this for
functions between topological spaces. In fact like in metric spaces we could define
continuity at a point x for functions between topological spaces: We say f : X → Y
is continuous at x ∈ X if given any neighbourhood M of f (x) in Y , we can find
a neighbourhood N of x such that f (N ) ⊆ M . And this definition would be
consistence with the definition of continuity of a point for metric spaces.
D. 1-38
A function f : X → Y between two topological spaces (X, τ ) and (Y, σ) is a
homeomorphism if f is a bijection and both f and f −1 are continuous.
Two spaces are homeomorphic if there exists a homeomorphism between them,
and we write X ' Y .
E. 1-39
• Let X = C[0, 1] with the topology induced by k k1 and Y = C[0, 1] with the
7 f is continuous but F −1 is
topology induced by k k∞ . Then F : Y → X by f →
not.
• Let X = [0, 2π) and Y = S 1 = {z ∈ C : |z| = 1}. Then f : X → Y given by
f (x) = eix is continuous but its inverse is not.
L. 1-40
Homeomorphism is an equivalence relation.

1. The identity map IX : X → X is always a homeomorphism. So X ' X.


2. If f : X → Y is a homeomorphism, then so is f −1 : Y → X. So X ' Y ⇒
Y ' X.
3. If f : X → Y and g : Y → Z are homeomorphisms, then g ◦ f : X → Z is an
homeomorphism. So X ' Y and Y ' Z implies X ' Z.
E. 1-41
Under the usual topology,
1. The open intervals (0, 1) ' (a, b) for all a, b ∈ R, using the homeomorphism
x 7→ a + (b − a)x. Similarly, [0, 1] ' [a, b]
2. (−1, 1) ' R by x 7→ tan( π2 x).
3. R ' (0, ∞) by x 7→ ex .
4. (a, ∞) ' (b, ∞) by x 7→ x + (b − a).
The fact that ' is an equivalence relation implies that any two open intervals in
R are homeomorphic.
C. 1-42
Homeomorphism and continuous function in topological spaces are the analogy of
isomorphisms and homomorphisms in groups. In group theory, we usually prove
that two groups are isomorphic by constructing an explicit isomorphism and that
two groups are not isomorphic by finding a group property exhibited by one but
not by the other. Similarly, in topology, we usually prove that two topological
spaces are homeomorphic by constructing an explicit homeomorphism and that
two topological spaces are not homeomorphic by finding a topological property
14 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

exhibited by one but not by the other. Later in this course we will meet some
topological properties like being Hausdorff and compactness.

D. 1-43
• A sequence xn converge to x (xn → x) if for every open neighbourhood U of x,
∃N such that xn ∈ U for all n > N . (Topological definition)4
• A topological space X is Hausdorff if for all x1 , x2 ∈ X with x1 6= x2 , there exists
open neighbourhoods U1 of x1 and U2 of x2 such that U1 ∩ U2 = ∅.
E. 1-44
1. If X has the course topology, then any sequence xn converges to every x ∈ X,
since there is only one open neighbourhood of x.
2. If X has the cofinite topology and all xn s are distinct, then xn → x for every
x ∈ X, since every open set can only have finitely many xn not inside it.
Note that convergence only depends on the topology (not the metric, if exist),
as can be seen by [L.1-20] and our new topological definition of convergence. So
convergence is a topological property.
Also it should be pointed out that for functions between topological spaces, in
general, f (xn ) → f (x) whenever xn → x does not mean f is continuous, although
the converse is true. A function satisfying f (xn ) → f (x) whenever xn → x is called
sequentially continuous . We already seen that this is equivalent to continuity in
a metric spaces, but this is not true in general for topological spaces.
L. 1-45
If X is Hausdorff and xn is a sequence in X with xn → x and xn → x0 , then
x = x0 (ie. limits are unique in Hausdorff space).

Suppose the contrary that x 6= x0 . Then by definition of Hausdorff, there exists


open neighbourhoods U, U 0 of x, x0 respectively with U ∩ U 0 = ∅.
Since xn → x and U is a neighbourhood of x, by definition, there is some N such
that whenever n > N , we have xn ∈ U . Similarly, since xn → x0 , there is some
N 0 such that whenever n > N 0 , we have xn ∈ U 0 .
This means that whenever n > max(N, N 0 ), we have xn ∈ U and xn ∈ U 0 . So
xn ∈ U ∩ U 0 . This contradicts the fact that U ∩ U 0 = ∅. Hence we must have
x = x0 .
Limits need not be unique in a general topological space. For example, let X =
{a, b} with a 6= b. If we give X the indiscrete topology, then, if we set xn = a for
all n, we have xn → a and xn → b.
E. 1-46
• If X has more than 1 element, then the course topology on X is not Hausdorff.
• If X has infinitely many elements, the cofinite topology on X is not Hausdorff.
• The discrete topology is always Hausdorff.
4
It might not be possible to prove the links between limits of sequences and topology that we
would wish to be true. Deeper investigations into set theory reveal that sequences are inadequate
tools for the study of topologies.
1.2. TOPOLOGICAL SPACES 15

• If (X, d) is a metric space, the topology induced by d is Hausdorff: for x1 6= x2 ,


let r = d(x1 , x2 ) > 0. Then take Ui = Br/2 (xi ). Then U1 ∩ U2 = 0.
L. 1-47
Let X be a topological space
T
1. If Cα is a closed subset of X for all α ∈ A, then α∈A Cα is closed in X.
2. If C1 , · · · , Cn are closed in X, then so is n
S
i=1 Ci .
S T
1. Since Cα is closed
T in X, X \Cα is open in X. So α∈A (X \Cα ) = X \ α∈A Cα
is open. So α∈A Cα is closed.
Tn Sn
Sn in X, then X \ Ci is open. So i=1 (X \ Ci ) = X \ i=1 Ci is
2. If Ci is closed
open. So i=1 Ci is closed.
P. 1-48
If X is Hausdorff and x ∈ X, then {x} is closed in X.

For all y ∈ X, there exists open subsets Uy , Vy with y ∈ Uy , x ∈TVy , Uy ∩ Vy = ∅.


Let Cy = X \ Uy . Then Cy is closed, y 6∈ Cy , x ∈ Cy . So {x} = y6=x Cy is closed
since it is an intersection of closed subsets.
Note that the converse is not true. If X has infinitely many elements, the cofinite
topology on X is not Hausdorff while {x} is closed in X for all x ∈ X since X \ {x}
is open.
D. 1-49
• Let (X, τ ) be a topological space and A ⊆ X. Define

CA = {C that is closed in X : A ⊆ C}
OA = {U ∈ τ : U ⊆ A}
T
The closure of A in X is Cl(A) = Ā = C∈CA C.
S
The interior of A in X is Int(A) = U ∈OA U .
• Define L(A) = {x ∈ X : ∃(xn ) ∈ A s.t. xn → x}.
• Let F be a closed subset of X. We say that A ⊆ X is a dense subset of F if
Ā = F .
E. 1-50
Since Ā is defined as an intersection, we should make sure we are not taking an
intersection of no sets. Since X is closed in X (its complement ∅ is open), CA 6= ∅.
So we can safely take the intersection.
Note that Int A = {x ∈ A : ∃ U ∈ τ with x ∈ U ⊆ A} and Cl A = {x ∈ X : ∀U ∈
τ with x ∈ U , we have A ∩ U 6= ∅}.
P. 1-51
1. Ā is the smallest closed subset of X which contains A.
2. Int(A) is the largest open subset of X contained in A.

1. Since Ā is an intersection
T of closed sets, it is closed in X. Also, if C ∈ CA ,
then A ⊆ C. So A ⊆ C∈CA C = Ā. Let K ⊆ X be a closed set containing A.
T
Then K ∈ Ca . So Ā = C∈CA C ⊆ K, so Ā ⊆ K.
16 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

2. From definition clearly Int A ⊆ A. Since the union of open sets is open, Int A ∈
τS. Let M ⊆ X be a open set contained in A. Then M ∈ OA , so M ⊆ Int(A) =
U ∈OA U .

P. 1-52
X \ Int(A) = X \ A (Equivalently (Int(A))c = Cl(Ac )).

U ⊆ A ⇔ (X \ U ) ⊇ (X \ A). Also, U open in X ⇔ X \ U is closed in X.


So the complement of the largest open subset of X contained in A will be the
smallest closed subset containing X \ A.
E. 1-53
Consider R with its usual topology. If F is closed and F ⊆ (0, 1), there is a closed
G with F ⊆ G ⊆ (0, 1) and G 6= F . Thus there is no largest closed set contained
in (0, 1).
L. 1-54
If C ⊆ X is closed, then L(C) = C.

Similar to [P.1-22]. Note that C ⊆ L(C). Suppose xn → x with xn ∈ C. Since C


is closed, A = X \ C ⊆ X is open. Suppose x 6∈ C, then x ∈ A. Hence A is an
open neighbourhood of x. So by the topological definition of convergence there is
some N such that xn ∈ A for all n ≥ N . So xN ∈ A. But we know that xN ∈ C
by assumption. This is a contradiction. So we must have x ∈ C.
Note that the backward part of [P.1-22] doesn’t work for topological space, in fact
the converse is not true for topological spaces in general. This is also where the
proof for 2 ⇒ 3 in [P.1-23] fails for topological space, sequential continuity does
not imply continuity for functions between topological spaces.
P. 1-55
L(A) ⊆ Ā.

Note that if A ⊆ C, then L(A) T


⊆ L(C). If C is closed, then L(C) = C. So
C ∈ CA ⇒ L(A) ⊆ C. So L(A) ⊆ C∈CA C = Ā.
L. 1-56
Suppose C ⊆ X is closed and A ⊆ C ⊆ L(A), then C = L(A) = Ā.

C ⊆ L(A) ⊆ Ā ⊆ C, where the last step is since Ā is the smallest closed set
containing A. So C = L(A) = Ā.
This is useful for finding the closure of subsets.
E. 1-57
• If (a, b) ⊆ R, then (a, b) = [a, b].
• If Q ⊆ R, then Q̄ = R. Also R \ Q = R. So Q and R \ Q are both dense in R with
the usual topology. Note also Int(Q) = Int(R \ Q) = ∅.
• In Rn with the Euclidean metric, Br (x) = B̄r (x). In general, Br (x) ⊆ B̄r (x),
since B̄r (x) is closed and Br (x) ⊆ B̄r (x), but these need not be equal.
For example, if X has the discrete metric, then B1 (x) = {x}. Then B1 (x) = {x},
but B̄1 (x) = X.
1.2. TOPOLOGICAL SPACES 17

E. 1-58
1. Let (X, τ ) be a topological space and (Y, d) a metric space. If f, g : X → Y are
continuous show that E = {x ∈ X : f (x) = g(x)} is closed. If now f (x) = g(x)
for all x ∈ A, where A is dense in X, show that f (x) = g(x) for all x ∈ X.
2. Consider the unit interval [0, 1] with the Euclidean metric and A = [0, 1] ∩ Q
with the inherited metric. Exhibit, with proof, a continuous map f : A → R
(where R has the standard metric) such that there does not exist a continuous
map f˜ : [0, 1] → R with f˜(x) = f (x) for all x ∈ [0, 1].

1. We show that the complement of E is open. Suppose b ∈ X \ E. Then


f (b) 6= g(b). We can find open sets U and V such that f (b) ∈ U , g(b) ∈ V and
U ∩ V = ∅. Now f −1 (U ) is open, as is g −1 (V ), so b ∈ f −1 (U ) ∩ g −1 (V ) ∈ τ .
But f −1 (U ) ∩ g −1 (V ) ⊆ X \ E. Thus X \ E is open and we are done.
We have A ⊆ E and E closed so X = Cl A ⊆ E = X and E = X.
2. We observe that x ∈ A ⇒ x2 6= 21 . If x ∈ A, set
(
0 if x2 < 21 ,
f (x) =
1 otherwise.

Observe that, if y ∈ A and y 2 < 12 , we can find a δ > 0 such that

|y − x| < δ ⇒ x2 < 1
2
⇒ f (x) = f (y).
2 1
Similarly if y ∈ A and y > we can find a δ > 0 such that |y − x| < δ ⇒
2
f (x) = f (y). Thus f is continuous.
Suppose that f˜ : [0, 1] → R is such that f˜(x) = f (x) for all x ∈ A. Choose
pn , qn ∈ A such that p2n > 21 > qn2 and |pn − 2−1/2 |, |qn − 2−1/2 | → 0. Then f˜
cannot be continuous since

|f˜(pn ) − f˜(2−1/2 )| + |f˜(qn ) − f˜(2−1/2 )| ≥ |f˜(pn ) − f˜(qn )| = 1

L. 1-59
Let X be a space and let H be a collection of some subsets of X. Then there
exists a unique topology τH such that (i)[[ τH ⊇ H ]], and (ii)[[ if τ is a topology
with τ ⊇ H, then τ ⊇ τH ]].

(Uniqueness) Suppose that σ and σ 0 are topologies such that satisfying (i) and
(ii). By (i) of σ and (ii) of σ 0 , we have σ ⊇ σ 0 and by (i) of σ 0 and (ii) of σ, we
have σ 0 ⊇ σ. Thus σ = σ 0 .

T τ ⊇ H. Since the discrete


(Existence) Let T be the set of topologies τ with
topology contains H, T is non-empty. Set τH = τ ∈T τ .
By construction, τH ⊇ H and τ ⊇ τH whenever τ ∈ T . Thus we need only show
that τH is a topology and this we now do.
• ∅, X ∈ τ for all τ ∈ T , so ∅, X ∈ τH .
S
• If Uα ∈ SτH , then Uα ∈ τ for all α ∈ A and so α∈A Uα ∈ τ for all τ ∈ T ,
whence α∈A Uα ∈ τH .
Tn
• If Uj ∈ T
τH , then Uj ∈ τ for all 1 ≤ j ≤ n and so j=1 Uj ∈ τ for all τ ∈ T ,
whence n j=1 Uj ∈ τH .
18 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

We call τH the smallest (or coarsest) topology containing H. However note that
there need not exist a largest topology contained in H. For example let X =
{1, 2, 3} and θ = {∅, {1}, {2}, X}. There does not exist a topology τ1 ⊆ θ such
that, if τ ⊆ θ is a topology, then τ ⊆ τ1 . Thus there does not exist a largest
topology contained in θ.

L. 1-60
Suppose that A is non-empty, the spaces (Xα , τα ) are topological spaces and we
have maps fα : X → Xα [α ∈ A]. Then there is a smallest topology τ on X for
which the maps fα are continuous.

A topology σ on X makes all the fα continuous if and only if it contains H =


{fα−1 (U ) : U ∈ τα , α ∈ A}. Now apply the above lemma.

D. 1-61
• Let (X, τ ) be a topological space and Y ⊆ X. The subspace topology τY on Y
induced by τ is given by τY = {Y ∩ U : U ∈ τ }.

• If Y ⊆ X, the inclusion function is ι : Y → X that sends y 7→ y.

• Write Dn = {v ∈ Rn : |v| ≤ 1}, the n-dimensional closed unit disk.


Write S n−1 = {v ∈ Rn : |v| = 1}, the n − 1-dimensional sphere.

P. 1-62
If (X, τ ) is a topological space and Y ⊆ X, then the subspace topology τY on Y is
the smallest topology on Y for which the inclusion map is continuous.5

Let ι : Y → X be the inclusion map. Since Y ∩ U = ι−1 (U ), the smallest topology


on Y for which the inclusion map is continuous contains σ = {Y ∩ U : U ∈ τ }.
The result will follow if we show that τY = σ is a topology on Y :

1. ∅ = Y ∩ ∅ ∈ τY and Y = Y ∩ X ∈ τY .

2. If Vα ∈ τY , then Vα = Y ∩ Uα for some Uα ∈ τ , so


!
[ [ [
Vα = (Y ∩ Uα ) = Y ∩ Uα ∈ τY
α∈A α∈A α∈U

3. If Vi ∈ τY , then Vi = Y ∩ Ui for some Ui ∈ τ , so

n n n
!
\ \ \
Vi = (Y ∩ Ui ) = Y ∩ Ui ∈ τY
i=1 i=a i=a

E. 1-63
If (X, d) is a metric space and Y ⊆ X, then the metric topology on (Y, d) is the
subspace topology, since BrY (y) = Y ∩ BrX (y).

P. 1-64
If Y ⊆ X has the subspace topology, then f : Z → Y is continuous iff ι◦f : Z → X
is continuous.

5
Some use this as the definition of subspace topology.
1.2. TOPOLOGICAL SPACES 19

(Forward) ι is continuous. So if f is continuous, so is ι ◦ f .


(Backward) Suppose we know that ι ◦ f is continuous. Given V ⊆ Y is open, we
know that V = Y ∩ U = ι−1 (U ). So f −1 (V ) = f −1 (ι−1 (U ))) = (ι ◦ f )−1 (U ) is
open since ι ◦ f is continuous. So f is continuous.
The converse is also true, i.e. Y ⊆ X has the subspace topology if “f : Z → Y
is continuous iff ι ◦ f : Z → X is continuous” holds. To see this suppose Y ⊆ X
is equip with a topology that is not the subspace topology and let Y 0 be Y but
with the subspace topology. Either there exist U open in Y but not Y 0 , or there
exist U open in Y 0 but not Y . In the former we can take f = id : Y 0 → Y , then
ι ◦ f is continuous while f is not. In the latter, we can take f : {0, 1} → Y such
that f (0) ∈ U and f (1) ∈ U c , then ι ◦ f is continuous while f is not. So in fact
this property characterise the subspace topology in the sense that it can be used
to define the subspace topology on Y .
E. 1-65
Int(Dn ) = {v ∈ Rn : |v| < 1} = B1 (0)
This is, in fact, homeomorphic to Rn . To show this, we can first pick our favorite
homeomorphism f : [0, 1) 7→ [1, ∞). Then v 7→ f (|v|)v is a homeomorphism
Int(Dn ) → Rn .
D. 1-66
S
Let X be a set. A collection B of subsets is called a basis if (i)[[ B∈B B = X ]] and
(ii)[[ If B1 , B2 ∈ B and x ∈ B1 ∩ B2 , then ∃B3 ∈ B such that x ∈ B3 ⊆ B1 ∩ B2 ]].
Let τ be a topology on X. A subset B ⊆ τ is a basis of τ if “U ∈ τ iff U is a
union of sets in B”.6
L. 1-67
Let X be a set and B a collection of subsets of X. Let τB be the collection of sets
U such that, whenever x ∈ U , we can find a B ∈ B such that x ∈ B ⊆ U . Then
τB is a topology if and only if B is a basis.

(Forward) If τB is a topology, then XS ∈ τB and S


so for each xS∈ X we can find
a B x ∈ B with x ∈ Bx ⊆ X. Thus B∈B B ⊇ x∈X Bx ⊇ x∈X {x} = X, so
S
B∈B B = X.
Next we observe that, by definition, B ⊆ τB . Thus if B1 , B2 ∈ B we must have
B1 ∩ B2 ∈ τB and, by definition, if x ∈ B1 ∩ B2 we can find a B3 ∈ B such that
x ∈ B3 ⊆ B1 ∩ B2 . Thus B is a basis.
(Backward) Suppose that B is a basis. S We observe that, using the definition,
B ⊆ τB and whenever A ⊆ τB we have A∈A A ∈ τB .
S
We have ∅ ∈ τB vacuously and, by the definition of a basis, X = B∈B B ∈ τB .
Finally, if U1 , U2 ∈ τB then whenever x ∈ U1 ∩ U2 we can find B1 , B2 ∈ B with
x ∈ B1 ⊂ U1 , x ∈ B2 ⊂ U2 . By the definition of a basis, we can find B3 ∈ B with
x ∈ B3 ⊆ B1 ∩ B2 ⊆ U1 ∩ U2 . Thus U1 ∩ U2 ∈ τB . Thus τB is a topology.
Note that the condition “τB is a collection of sets U such that, whenever x ∈ U ,
we can find a B ∈ B such that x ∈ B ⊆ U ” is equivalent to “τB is a collection of
sets U such that U is empty or is a union sets in B”.
6
Here the union can be empty. We use a convention that the union of empty collection of sets is
the empty set.
20 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

Condition (ii) of a basis might seems strange. But as can be seen from the proof
(ii) ensures that (finite) intersections of sets in τB is in τB , i.e. (finite) intersections
of open sets are open.

D. 1-68
• Let (X, τ ) and (Y, σ) be topological spaces. The product topology λ on X × Y is
given by: U ∈ λ iff ∀(x, y) ∈ U , ∃Vx ∈ τ, ∃Wy ∈ σ s.t. (x, y) ∈ Vx × Wy ⊆ U .

• If X and Y are sets the projection maps πX : X × Y → X and πY : X × Y → Y


are given by πX (x, y) = x and πY (x, y) = y.

P. 1-69
If (X, τ ) and (Y, σ) are topological spaces, then the product topology µ on X × Y
is the smallest topology on X × Y for which the projection maps πX and πY are
continuous.7

Let µ be the product topology on X × Y . Let λ be the smallest topology on X × Y


for which the projection maps πX and πY are continuous.
−1
If U ∈ τ , then U × Y = πX (U ) ∈ λ since πX is continuous. Similarly, if V ∈ σ
then X × V ∈ λ. Thus U × V = U × Y ∩ X × V ∈ λ.

If E ∈ µ then, given (x, y) ∈ E, we can


S find U(x,y) ∈ τ and V(x,y) ∈ σ such that
(x, y) ∈ U(x,y) × V(x,y) ⊆ E. So E = (x,y)∈E U(x,y) × V(x,y) and, since the union
of open sets is open, E ∈ λ. Thus µ ⊆ λ. We check that µ is a topology as follows.

• ∅ ∈ µ vacuously. If (x, y) ∈ X × Y , then X ∈ τ , Y ∈ σ and (x, y) ∈ X × Y ⊆


X × Y . Thus X × Y ∈ µ.
S
• Suppose Eα ∈Sµ for all α ∈ A. If (x, y) ∈ α∈A Eα , then (x, y) ∈ Eβ for some
β ∈ A. Then α∈A Eα ∈ µ since we can find U ∈ τ and V ∈ σ such that
[
(x, y) ∈ U × V ⊆ Eβ ⊆ Eα .
α∈A

• Suppose Ej ∈ µ for all 1 ≤ j ≤ n. If (x, y) ∈ n


T
j=1 Ej , then (x, y) ∈ Ej for all
1 ≤ j ≤ n. We can find Uj ∈ τ and Vj ∈ σ such that (x, y) ∈ Uj × Vj ⊆ Ej and
so
n
\ \ n \n
(x, y) ∈ Uj × Vj ⊆ Ej .
j=1 j=1 j=1
Tn Tn Tn
Since j=1 Uj ∈ τ and j=1 Vj ∈ σ, we have shown that j=1 Ej ∈ µ.
−1
Finally, we observe that, if U ∈ τ , then πX (U ) = U × Y and (x, y) ∈ U × Y ⊆
−1 −1
πX (U ) with U ∈ τ , Y ∈ σ, so πX (U ) ∈ µ. Thus πX : X × Y → X is continuous if
we give X × Y the topology µ. A similar result holds for πY so, by the minimality
of λ, µ = λ.

P. 1-70
If X × Y has the product topology, then f : Z 7→ X × Y is continuous iff πX ◦ f
and πY ◦ f are continuous.

7
Some use this as the definition of product topology.
1.2. TOPOLOGICAL SPACES 21

(Forward) Follows from πX and πY being continuous.

(Backward) Given W ⊆ X × Y open. Given any z ∈ f −1 (W ), say f (z) = (x, y),


then (x, y) ∈ W so there exist open Ux ∈ X and Vy ∈ Y such that (x, y) ∈
−1
Ux × Vy ⊆ W . Now Ux × Uy = πX (Ux ) ∩ πY−1 (Vy ), so

z ∈ (πX ◦ f )−1 (Ux ) ∩ (πY ◦ f )−1 (Vy ) = f −1 (Ux × Vy ) ⊆ f −1 (W )

So f −1 (W ) is open.

E. 1-71
• If V ⊆ X and W ⊆ Y are open, then V × W ⊆ X × Y is open. (take VX = V ,
WY = W )

• Note that our definition of the product topology is rather similar to the definition
of open sets for metrics. We have a special class of subsets of the form V × W ,
and a subset U is open iff every point x ∈ U is contained in some V × W ⊆ U . In
some sense, these subsets “generate” the open sets.
S
Alternatively, if U ⊆ X × Y is open, then U = (x,y)∈U Vx × Wy . So U ⊆ X × Y
is open if and only if it is a union of members of our special class of subsets. We
call this special class the basis.

 {V × W : V ⊆ X, W ⊆ Y are open} is a basis for the product topology for


X ×Y.

 If (X, d) is a metric space, then {B1/n (x) : n ∈ N+ , x ∈ X} is a basis for the


topology induced by d.

E. 1-72
Suppose that (X, τ ) and (Y, σ) are topological spaces and we give X×Y the product
topology µ. Now fix x ∈ X and give E = {x} × Y the subspace topology µE . Show
that the map k : (Y, σ) → (E, µE ) given by k(y) = (x, y) is a homeomorphism.

We observe that k is a bijection. If U is open in (Y, σ), then X × U ∈ µ so


k(U ) = {x} × U ∈ µE . Thus k−1 is continuous.

If W is open in (E, µE ) then W = E ∩ H for some H ∈ µ. If (x, y) ∈ W , then


by definition, we can find J ∈ τ , I ∈ σ such that (x, y) ∈ J × I ⊆ H. Thus
y ∈ I ⊆ k−1 (W ) with I ∈ σ. So k−1 (W ) is open. Thus k is continuous.

E. 1-73
Let (X1 , d1 ) and (X2 , d2 ) be metric spaces. Let τ be the product topology on
X1 × X2 where Xj has the topology induced by dj [j = 1, 2]. Define ρk : (X1 ×
X2 )2 → R by

ρ1 ((x, y), (u, v)) = d1 (x, u) + d2 (y, v),


ρ2 ((x, y), (u, v)) = max(d1 (x, u), d2 (y, v)),
ρ3 ((x, y), (u, v)) = (d1 (x, u)2 + d2 (y, v)2 )1/2 .

Show that ρi are metrics. Show that each of the ρi induces the product topology
τ on X1 × X2 .
22 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

It is easy to show that ρi (u, v) = 0 iff u = v and ρi (u, v) = ρi (v, u). Also,

ρ1 (x, y) + ρ1 (y, z) = d1 (x1 , y1 ) + d1 (y2 , z2 ) + d2 (x2 , y2 ) + d2 (y2 , z2 ) ≥ ρ1 (x, z)


ρ2 (x, y) + ρ2 (y, z) = max{d1 (x1 , y1 ), d2 (x2 , y2 )} + max{d1 (y1 , z1 ), d2 (y2 , z2 )}
≥ max{d1 (x1 , y1 ) + d1 (y1 , z1 ), d2 (x2 , y2 ) + d2 (y2 , z2 )} = ρ2 (x, z)
1 1
ρ3 (x, y) + ρ3 (y, z) = (d1 (x1 , y1 )2 + d2 (x2 , y2 )2 ) 2 + (d1 (y1 , z1 )2 + d2 (y2 , z2 )2 ) 2
1
≥ ((d1 (x1 , y1 ) + d1 (y1 , z1 ))2 + (d2 (x2 , y2 ) + d2 (y2 , z2 ))2 ) 2
≥ ρ3 (x, z)

where the second to last inequality is due to the triangle inequality kuk + kvk ≥
ku + vk where k k is the usual norm. So they are metrics. Suppose ρi induces the
topology τi . We now use [L.1-31].
Given any U ∈ τ and any (x1 , x2 ) ∈ U , there exist Vi open in Xi such that
(x1 , x2 ) ∈ V1 × V2 ⊆ U . Since Vi is open in Xi and xi ∈ Vi , there exist ri > 0 such
that xi ∈ Brdii (xi ) ⊆ Vi . Let r = min{r1 , r2 }, then (x1 , x2 ) ∈ Brd1 (x1 ) × Brd2 (x2 ) ⊆
U . Now (x1 , x2 ) ∈ Brρi (x1 , x2 ) ⊆ Brd1 (x1 ) × Brd2 (x2 ) ⊆ U for all i = 1, 2, 3. So
τi ⊇ τ for all i.
Given any Wi ∈ τi and any (x1 , x2 ) ∈ Wi . There exist Ri > 0 such that (x1 , x2 ) ∈
ρi d1 d2 ρi
BR i
(x1 , x2 ) ⊆ Wi . Now (x1 , x2 ) ∈ BR i /2
(x1 ) × BR i /2
(x2 ) ⊆ BR i
(x1 , x2 ) ⊆ Wi . So
τi ⊆ τ for all i.
E. 1-74
• The product topology on R × R is same as the topology induced by the k k∞ ,
hence is also the same as the topology induced by k k2 or k k1 .[E.1-33] Similarly,
the product topology on Rn = Rn−1 × R is also the same as that induced by k k∞ .
• (0, 1) × (0, 1) × · · · × (0, 1) = (0, 1)n ⊆ Rn is the open n−dimensional cube in Rn .
Since (0, 1) ' R, we have (0, 1)N ' RN ' Int(Dn ).
• [0, 1]×S n ' [1, 2]×S n ' {v ∈ Rn+1 : 1 ≤ |v| ≤ 2}, where the last homeomorphism
is given by (t, w) 7→ tw with inverse v 7→ (|v|, v̂). This is a thickened sphere.
• Let A ⊆ {(r, z) : r > 0} ⊆ R2 , R(A) be the set obtained by rotating A around the
z axis. Then R(A) ' S × A by (x, y, z) = (v, z) 7→ (v̂, (|v|, z)). In particular, if A
is a circle, then R(A) ' S 1 × S 1 = T 2 is the two-dimensional torus.
z

D. 1-75
• If X is a set and ∼ is an equivalence relation on X, then the quotient X/∼ is the
set of equivalence classes. The projection q : X → X/∼ is defined as q(x) = [x],
the equivalence class containing x.
• If (X, τ ) is a topological space and ∼ an equivalence relation on X, the quotient topology
σ on X/∼ is given by σ = {U ⊆ X/∼ : q −1 (U ) ∈ τ }.
1.2. TOPOLOGICAL SPACES 23

P. 1-76
Let (X, τ ) be a topological space and ∼ an equivalence relation on X. The quotient
topology σ is the largest topology on X/ ∼ for which q is continuous.

For all U ∈ σ, q −1 (U ) ∈ τ , so q is continuous. Suppose σ 0 is a topology on X/ ∼


for which q is continuous. If U ∈ σ 0 , then q −1 (U ) ∈ τ , so U ∈ σ, so σ 0 ⊆ σ.
P. 1-77
Let (X, τ ) and (Y, ρ) be a topological space, ∼ an equivalence relation on X and σ
the quotient topology on X/∼. Then f : X/∼ → Y is continuous iff f ◦ q : X → Y
is continuous.

(Forward) Since q is continuous.


(Backward) Given any U ∈ ρ, (f ◦q)−1 (U ) = q −1 (f −1 (U )) ∈ τ , so f −1 (U ) ∈ σ.
E. 1-78
• We can think of the quotient as “gluing” the points identified by ∼ together. Note
that even if X is Hausdorff, X/∼ may not be! For example, R/Q is not Hausdorff.
• Let X = R, x ∼ y if x − y ∈ Z. Then X/∼ = R/Z ' S 1 , given by [x] 7→
(cos 2πx, sin 2πx).
Let X = R2 . Then v ∼ w iff v − w ∈ Z2 . Then X/∼ = R2 /Z2 = (R/Z) × (R/Z) '
S 1 × S 1 = T 2 . Similarly, Rn /Zn = T n = S 1 × S 1 × · · · × S 1 .
• If A ⊆ X, define ∼ by x ∼ y iff x = y or x, y ∈ A. This glues everything in A
together and leaves everything else alone. We often write this as X/A. Note that
this is not consistent with the notation we just used above!
 Let X = [0, 1] and A = {0, 1}, then X/A ' S 1 by, say, t 7→ (cos 2πt, sin 2πt).
Intuitively, the equivalence relation says that the two end points of [0, 1] are
“the same”. So we join the ends together to get a circle.

 Let X = Dn and A = S n−1 . Then X/A ∼ S n . This can be pictured as pulling


the boundary of the disk together to a point to create a closed surface

• Let X = [0, 1] × [0, 1] with ∼ given by (0, y) ∼ (1, y) and (x, 0) ∼ (x, 1), then
X/∼ ' S 1 × S 1 = T 2 , by, say (x, y) 7→ (cos 2πx, sin 2πx), (cos 2πy, sin 2πy)

Similarly, T 3 = [0, 1]3 /∼, where the equivalence is analogous to above.


24 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

L. 1-79
1. If (X, τ ) is a Hausdorff topological space and Y ⊆ X, then Y with the subspace
topology is also Hausdorff.
2. If (X, τ ) and (Y, σ) are Hausdorff topological spaces, then X × Y with the
product topology is also Hausdorff.

1. Given any distinct u, v ∈ Y we have u, v ∈ X, so ∃U, V ∈ τ , neighbourhoods


of u, v such that U ∩ V = ∅. Now U ∩ Y and V ∩ Y are neighbourhoods of u
and v in Y , also (U ∩ Y ) ∩ (V ∩ Y ) = ∅.
2. Given any distinct (x1 , y1 ), (x2 , y2 ) ∈ X × Y , wlog assume x1 6= x2 . Since
(X, τ ) is Hausdorff, we can find U1 , U2 disjoint open neighbourhoods of x1 and
x2 . We observe that U1 × Y and U2 × Y are disjoint open neighbourhoods of
(x1 , y1 ) and (x2 , y2 ), so we are done.
However this is not true for quotient topology.
D. 1-80
A topological group G is a group G that is also a topological space and that
the group operations of product G × G → G : (x, y) 7→ xy and taking inverses
G → G : x 7→ x−1 are continuous. Here G × G is viewed as a topological space
with the product topology.
E. 1-81
A topological group is a mathematical object with both an algebraic structure
and a topological structure. Typical examples of topological group include the
matrix groups such as SO(R3 ) and U(C3 ) (with the topology defined by viewing
the matrices as subspace of R3×3 /C3×3 ). Other examples includes S 1 = R/Z (view
as unit circle in complex plane), equivalently this is U(C1 ).

1.3 Connectivity
D. 1-82
A topological space X is disconnected if X can be written as A ∪ B, where A and
B are disjoint, non-empty open subsets of X. We say A and B disconnect X. A
space is connected if it is not disconnected.
If E is a subset of a topological space (X, τ ), then E is called connected if the
subspace topology on E is connected. The is equivalent to the condition that we
can find open sets U and V such that U ∪ V ⊇ E, U ∩ V ∩ E = ∅, U ∩ E 6= ∅ and
V ∩ E 6= ∅. Similar definition for disconnection.
E. 1-83
Intuitively, we would want to say that a space is “connected” if it is one-piece. For
example, R is connected, while R \ {0} is not. We will come up with two different
definitions of connectivity - normal connectivity and path connectivity, where the
latter implies the former, but not the other way round.
Note that being connected is a property of a space, not a subset. When we say “A
is a connected subset of X”, it means A is connected with the subspace topology
inherited from X.
1.3. CONNECTIVITY 25

Being (dis)connected is a topological property, ie. if X is (dis)connected, and X '


Y , then Y is (dis) connected. To show this, let f : X → Y be the homeomorphism.
By definition, A is open in X iff f (A) is open in Y . So A and B disconnect X iff
f (A) and f (B) disconnect Y .
E. 1-84
• If X has the coarse topology, it is connected.
• If X has the discrete topology and at least 2 elements, it is disconnected.
• Let X ⊆ R. If there is some α ∈ R \ X such that there is some a, b ∈ X with
a < α < b, then X is disconnected. In particular, X ∩ (−∞, α) and X ∩ (α, ∞)
disconnect X. For example, (0, 1) ∪ (1, 2) is disconnected (α = 1).
P. 1-85
The topological space (X, τ ) is disconnected iff there exists a continuous surjective
f : (X, τ ) → ({0, 1}, ∆) where ∆ is the discrete topology on {0, 1}.
(Alternatively, The topological space (X, τ ) is connected iff every continuous map
f : (X, τ ) → ({0, 1}, ∆) is constant.)

(Forward) If A and B disconnect X, define


(
0 x∈A
f (x) =
1 x∈B

Then f −1 (∅), f −1 ({0, 1}) = X, f −1 ({0}) = A and f −1 ({1}) = B are all open. So
f is continuous. Also, since A, B are non-empty, f is surjective.
(Backward) Given f : X 7→ {0, 1} surjective and continuous, define A = f −1 ({0}),
B = f −1 ({1}). Then A and B disconnect X.
In fact we can have, the topological space (X, τ ) is connected iff every continuous
map f : (X, τ ) → (A, ∆) is constant, where A is any set with more than 2 points
and ∆ its discrete topology.
Since Z and {0, 1} have the discrete topology when considered as subspaces of R
with the usual topology, we also have the following:
1. A topological space (X, τ ) is connected if and only if every continuous integer
valued function f : X → R (where R has its usual topology) is constant.
2. A topological space (X, τ ) is connected if and only if every continuous function
f : X → R (where R has its usual topology) which only takes the values 0 or
1 is constant.
P. 1-86
[0, 1] is connected. (More generally [a, b] is connected)

(Proof 1) Note that Q ∩ [0, 1] is disconnected, since we can pick our favorite
irrational number a, and then {x : x < a} and {x : x > a} disconnect the interval.
So we better use something special about [0, 1]. The key property of R is that
every non-empty A ⊆ [0, 1] has a supremum.
Suppose A and B disconnect [0, 1]. wlog, assume 1 ∈ B. Since A is non-empty,
α = sup A exists. Then either
26 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

• α ∈ A. Then α < 1, since 1 ∈ B. Since A is open, ∃ε > 0 with Bε (α) ⊆ A. So


α + 2ε ∈ A, contradicting supremality of α; or
• α 6∈ A. Then α ∈ B. Since B is open, ∃ε > 0 such that Bε (α) ⊆ B. Then
a ≤ α − ε for all a ∈ A. This contradicts α being the least upper bound of A.
Either option gives a contradiction. So A and B cannot exist and [0, 1] is con-
nected.
(Proof 2) Observe that if f : [0, 1] → R is continuous then if f (x) = 1 and f (y) = 0
the intermediate value theorem tells us that there is some z between x and y such
that f (z) = 1/2. So every continuous function f : X → R which only takes the
values 0 or 1 is constant.
P. 1-87
If f : X → Y is continuous and X is connected, then Im f is also connected.

(Proof 1) Suppose A and B disconnect Im f . We will show that f −1 (A) and


f −1 (B) disconnect X. Since A, B ⊆ Im f are open, we know that A = Im f ∩ A0
and B = Im f ∩ B 0 for some A0 , B 0 open in Y . Then f −1 (A) = f −1 (A0 ) and
f −1 (B) = f −1 (B 0 ) are open in X.
Since A, B are non-empty, f −1 (A) and f −1 (B) are non-empty. Also, f −1 (A) ∩
f −1 (B) = f −1 (A∩B) = f −1 (∅) = ∅. Finally, A∪B = Im f . So f −1 (A)∪f −1 (B) =
f −1 (A ∪ B) = X. So f −1 (A) and f −1 (B) disconnect X, contradiction. So Im f is
connected.
(Proof 2) If Im f is not connected, let g : Im f → {0, 1} be continuous surjective.
Then g ◦ f : X → {0, 1} is continuous surjective. Contradiction.
T. 1-88
<Intermediate value theorem> Suppose f : X → R is continuous and X is
connected. If ∃x0 , x1 such that f (x0 ) < y < f (x1 ), then ∃x ∈ X with f (x) = y.

(Proof 1) Suppose no such x exists. Then y 6∈ Im f while y > f (x0 ) ∈ Im f ,


y < f (x1 ) ∈ Im f . Then Im f is disconnected (from our previous example), con-
tradicting X being connected.
f (x)−y
(Proof 2) If f (x) 6= y for all x, then |f (x)−y|
is a continuous surjection from X to
{−1, +1}, which is a contradiction.
P. 1-89
1. If (X, τ ) is a connected topological space and ∼ is an equivalence relation on
X, then X/ ∼ with the quotient topology is connected.
2. If (X, τ ) and (Y, σ) are connected topological spaces, then X × Y with the
product topology is connected.

1. X/ ∼ is the continuous image of X under the quotient map which we know to


be continuous.
2. Suppose X ×Y with the product topology is not connected. Then we can find a
non-constant continuous function f : X ×Y → R taking only the values 0 and 1.
Take (x, y), (u, v) ∈ X × Y with f (x, y) 6= f (u, v). Then, if f (x, v) = f (x, y),
it follows that f (x, v) 6= f (u, v). Without loss of generality, suppose that
f (x, v) 6= f (x, y).
1.3. CONNECTIVITY 27

The function θ : Y → X × Y given by θ(z) = (x, z) is continuous. (Because if


Ω is open in X × Y and z ∈ θ−1 (Ω), then (x, z) ∈ Ω, so we can find U open
in X and V open in Y such that (x, z) ∈ U × V ⊆ Ω. Thus z ∈ V ⊆ θ−1 (Ω)
and we have shown θ−1 (Ω) open.) If we set F = f ◦ θ, then F : Y → R is
non-constant, continuous and only takes the values 0 and 1. Thus Y is not
connected.

Note that if (X, τ ) is a connected topological space and E is a subset of X, it


does not follow that E with the subspace topology is connected. For example R
is connected with the usual topology, but E = (−2, −1) ∪ (1, 2) is not.

L. 1-90
Let E be a subset of a topological space (X, τ ). If E is connected so is Cl E.

Suppose that U and V are open sets with Cl E ⊆ U ∩ V and Cl E ∩ U ∩ V = ∅.


Then (since Cl E ⊇ E) we have E ⊆ U ∩V and E ∩U ∩V = ∅. Since E is connected
we know that either E ∩ U = ∅ or E ∩ V = ∅. Wlog, suppose E ∩ V = ∅. Then
V ⊆ E c so V ⊆ Int E c = (Cl E)c and Cl E ∩ V = ∅. Thus Cl E is connected.

D. 1-91
• Let X be a topological space, and x0 , x1 ∈ X. Then a path from x0 to x1 is a
continuous function γ : [0, 1] 7→ X such that γ(0) = x0 , γ(1) = x1 .

• A topological space X is path connected if for all points x0 , x1 ∈ X, there is a


path from x0 to x1 .

E. 1-92
• (a, b), [a, b), (a, b], R are all path connected (using paths given by linear functions).

• Rn is path connected (eg. γ(t) = tx1 + (1 − t)x0 ).

• Rn \ {0} is path-connected for n > 1 (the paths either line segments or bent line
segments to get around the hole).

P. 1-93
If X is path connected, then X is connected.

(Proof 1) Let X be path connected, and let f : X → {0, 1} be a continuous


function. We want to show that f is constant.

Let x, y ∈ X. By path connectedness, there is a continuous map γ : [0, 1] → X such


that γ(0) = x and γ(1) = y. Composing with f gives a map f ◦ γ : [0, 1] → {0, 1}.
Since [0, 1] is connected, this must be constant. In particular, f (γ(0)) = f (γ(1)),
ie. f (x) = f (y). Since x, y were arbitrary, we know f is constant.

(Proof 2) Suppose that (X, τ ) is path-connected and that U and V are open sets
with U ∩ V = ∅ and U ∪ V = X. Wlog U 6= ∅, choose x ∈ U . If y ∈ X,
we can find f : [0, 1] → X continuous with f (0) = x and f (1) = y. We have
U ∩ V ∩ f ([0, 1]) = ∅ and U ∪ V ⊇ f ([0, 1]). Now the continuous image of a
connected set is connected and [0, 1] is connected, so f ([0, 1]) is connected. Since
U ∩ f ([0, 1]) 6= ∅, V ∩ f ([0, 1]) = ∅. So U ⊇ f ([0, 1]), so y ∈ U . Thus U = X. So
U, V does not disconnect X, so X is connected.

Path connectivity is a stronger condition than connectivity.


28 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

L. 1-94
Suppose f : X → Y is a homeomorphism and A ⊆ X, then f |A : A → f (A) is a
homeomorphism.

Since f is a bijection, f |A is a bijection. If U ⊆ f (A) is open, then U = f (A) ∩ U 0


for some U 0 open in Y . So f |−1 A (U ) = f
−1
(U 0 ) ∩ A is open in A. So f |A is
continuous. Similarly, we can show that (f |A )−1 is continuous.
E. 1-95
We can use connectivity to distinguish spaces. Apart from the obvious “X is
connected while Y is not”, by the above we can also try to remove points and see
what happens.
For example [0, 1] 6' (0, 1). Suppose otherwise, suppose f : [0, 1] → (0, 1) is
a homeomorphism. Let A = (0, 1]. Then f |A : (0, 1] → (0, 1) \ {f (0)} is a
homeomorphism. But (0, 1] is connected while (0, 1) \ {f (0)} is disconnected.
Contradiction. Similarly we have [0, 1) 6' [0, 1] and [0, 1) 6' (0, 1).
More generally let M (X) be the maximum number of points we can remove from
X such that X is still connected. If M (X) 6= M (Y ), then X 6' Y . Wlog suppose
M (X) > M (Y ), then we can remove M (X) points from X with X still connected
while on the image side Y is no-longer connected, so a homeomorphism f : X → Y
leads to a contradiction.
Consequently we also have Rn 6' R for n > 1, and S 1 is not homeomorphic to any
subset of R. We were able to use path connectivity to determine that R is not
homeomorphic to Rn for n > 1. If we want to distinguish general Rn and Rm , we
will need to generalize the idea of path connectivity to higher connectivity.
To do so, we have to formulate path connectivity in a different way. Recall that
S 0 = {−1, 1} ' {0, 1} ⊆ R, while D1 = [−1, 1] ' [0, 1] ⊆ R. Then we can
formulate X as: X is path-connected iff any continuous f : S 0 → X (if we pick
any two points in X) extends to a continuous γ : D1 → X (extends it to a path
connecting the two points) with γ|S 0 = f .
D. 1-96
X is n-connected if for any k ≤ n, any continuous f : S k → X extends to a
continuous F : Dk+1 → X such that F |S k = f .
E. 1-97
A 0-connect space is path-connected. A 1-connected space are said to be
simply connected .
For any point p ∈ Rn , Rn \{p} is m-connected iff m ≤ n−2. So Rn \{p} 6' Rm \{q}
unless n = m. So Rn 6' Rm . We casually stated that Rn \ {p} is m-connected iff
m ≤ n − 2. However, while this intuitively makes sense, it is in fact very difficult
to prove. To actually prove this, we will need tools from algebraic topology.
L. 1-98
Define x ∼ y if there is a path from x to y in X. Then ∼ is an equivalence relation.

1. For any x ∈ X, let γx : [0, 1] → X be γ(t) = x, the constant path. Then this
is a path from x to x. So x ∼ x.
2. If γ : [0, 1] → X is a path from x to y, then γ̄ : [0, 1] → X by t 7→ γ(1 − t) is a
path from y to x. So x ∼ y ⇒ y ∼ x.
1.3. CONNECTIVITY 29

3. If γ1 is a path form x to y and γ2 is a path from y to z, then γ2 ∗ γ1 defined by


(
γ1 (2t) t ∈ [0, 1/2]
t 7→
γ2 (2t − 1) t ∈ [1/2, 1]

is a path from x to z. So x ∼ y, y ∼ z ⇒ x ∼ z.
P. 1-99
T S
If Yα ⊆ X is connected for all α ∈ T and that α∈T Yα 6= ∅, then Y = α∈T Yα
is connected.
Let U and V be open sets such that
[ [
U ∪V ⊇ Yα and U ∩V ∩ Yα = ∅.
α∈A α∈A

We have U ∪ V ⊇ Yα and U ∩ V ∩ Yα = ∅Tfor each α ∈ A. Since T Yα is connected,


either U ∩ Yα = ∅ or V ∩ Yα = ∅. Since α∈T Yα 6= ∅, pick y ∈ α∈T Yα . Wlog,
let y ∈ U . But y ∈ U ∩ YαSso U ∩ Yα 6= ∅ and V ∩ Yα = ∅, and so weShave U ⊇ Yα
for all α ∈ A. Thus U ⊇ α∈A Yα . So U and V cannot disconnect α∈A Yα .
D. 1-100
• Equivalence classes of the relation “x ∼ y if there is a path from x to y” are
path components of X.
• If
S x ∈ X, define C(x) = {A ⊆ X : x ∈ A and A is connected}. Then C(x) =
A∈C(x) A is the connected component of x.

E. 1-101
If a space is disconnected, we could divide the space into different components,
each of which is (maximally) connected.
P. 1-102
C(x) is the largest connected subset of X containing x.

First noteTthat {x} ∈ C(x).


T So x ∈ C(x). To show that it is connected, just note
that x ∈ A∈C(x) A. So A∈C(x) A is non-empty. By our previous proposition, this
implies that C(x) is connected. If x ∈ D ⊆ X with D connected, then D ∈ C(x),
So D ∈ C(x).
L. 1-103
If y ∈ C(x), then C(y) = C(x).

Since y ∈ C(x) and C(x) is connected, C(x) ⊆ C(y). So x ∈ C(y). Then


C(y) ⊆ C(x), so C(x) = C(y).
It follows that x ∼ y if x ∈ C(y) (equivalent to x ∼ y if there exists a connected
set E with x, y ∈ E.) is an equivalence relation and the connected components of
X are the equivalence classes.
P. 1-104
The connected components of a topological space are closed. If there are only
finitely many components then they are also open.
30 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

By [L.1-90], Cl C(x) is connected. Since C(x) is the largest connected subset of


X containing x, Cl C(x) = C(x), so C(x) is closed. If there are only finitely many
components, than the complement of each components is the finite union of the
other closed components, so it itself is open.
E. 1-105
• Let X = (−∞, 0) ∪ (0, ∞) ⊆ R. Then the connected components are (−∞, 0)
and (0, ∞), which are also the path components.
• Let X = Q ⊆ R. Then C(x) = {x} for all x ∈ X. In this case, we say X is
totally disconnected.
Note that C(x) and X \ C(x) need not disconnect X, even though it is the case
in our first example. We must need C(x) and X \ C(x) to be open as well. For
example, in Example 2, C(x) = {x} is not open.
It is also important to note that path components need not be equal to the con-
nected components, as illustrated by the following example. However, since path
connected spaces are connected, the path component containing x must be a subset
of C(x).
Let Y = {(0, y) ∈ R2 : y ∈ R} ⊆ R2 be the y axis and y
Z = {(x, x1 sin x1 ) ∈ R2 : x ∈ (0, ∞)}. Y
Let X = Y ∪ Z ⊆ R2 . We claim that Y and Z are the
path components of X. Since Y and Z are individually Z x
path connected, it suffices to show that there is no
continuous γ : [0, 1] → X with γ(0) = (0, 0), γ(1) =
(1, sin 1).

Suppose γ existed. Then the function π2 ◦ γ : [0, 1] → R projecting the path to


the y direction is continuous. So it is bounded. Let M be such that π2 ◦ γ(t) ≤ M
for all t ∈ [0, 1]. Let W = X ∩ (R × (−∞, M ]) be the part of X that lies below
y = M . Then Im γ ⊆ W . However, W is disconnected: pick 0 < t0 < 1 with
1
t0
sin t10 > M . Then W ∩ ((−∞, t0 ) × R) and W ∩ ((t0 , ∞) × R) disconnect W .
This is a contradiction, since γ is continuous and [0, 1] is connected.
We also claim that X is connected: suppose A and B disconnect X. Then since
Y and Z are connected, either Y ⊆ A or Y ⊆ B; Z ⊆ A or Z ⊆ B. If both
Y ⊆ A, Z ⊆ A, then B = ∅, which is not possible. So wlog assume A = Y , B = Z.
This is also impossible, since Y is not open in X as it is not a union of balls (any
open ball containing a point in Y will also contain a point in X). Hence X must
be connected.
P. 1-106
If U ⊆ Rn is open and connected, then it is path-connected.

Let A be a path component of U . We first show that A is open. Let a ∈ A.


Since U is open, ∃ε > 0 such that Bε (a) ⊆ U . We know that Bε (a) ' Int(Dn ) is
path-connected (eg. use line segments connecting the points). Since A is a path
component and a ∈ A, we must have Bε (a) ⊆ A. So A is an open subset of U .
Now suppose b ∈ U \ A. Then since U is open, ∃ε > 0 such that Bε (b) ⊆ U . Since
Bε (b) is path-connected, so if Bε (b) ∩ A 6= ∅, then Bε (b) ⊆ A. But this implies
b ∈ A, which is a contradiction. So Bε (b) ∩ A = ∅. So Bε (b) ⊆ U \ A. Then U \ A
1.4. COMPACTNESS 31

is open. So A, U \ A are disjoint open subsets of U . Since U is connected, we must


have U \ A empty (since A is not). So U = A is path-connected.
E. 1-107
Show that the non-empty bounded connected subsets of R (with the usual topol-
ogy) are the intervals. (By intervals we mean sets of the form [a, b], [a, b), (a, b]
and (a, b) with a ≤ b. Note that [a, a] = {a}, (a, a) = ∅.) Also describe all the
connected subsets of R.

Since [a, b], [a, b), (a, b] and (a, b) are path connected, they are connected.
Suppose, conversely, that E is bounded and contains at least two points. Since E
is bounded α = inf E and β = sup E exist. Further α < β. If c ∈ (α, β) \ E we
can find x, y ∈ E such that α < x ≤ c and c ≤ y < β. If c ∈
/ E, then U = (−∞, c)
and V = (c, ∞) are open U ∩ V = ∅, U ∪ V ⊇ E but x ∈ U ∩ E, y ∈ V ∩ E so
U ∩ E, V ∩ E 6= ∅ and E is not connected. Thus, if E is connected, E ⊇ (α, β)
and E is one of [α, β], (α, β), (α, β] or [α, β).
The same kind of argument shows that the connected subsets of R are precisely
the sets of the form [a, b], [a, b), (a, b], (a, b), (−∞, b], (−∞, b), [a, ∞), (a, ∞) and
R [a ≤ b].
Note that the condition open is important, as can be seen by the above example
where Y ∪ Z is connected but not path connected.

1.4 Compactness
Compactness is an important concept in topology. It can be viewed as a generalization
of being “closed and bounded” in R. Alternatively, it can also be viewed as a general-
ization of being finite. Compact sets tend to have a lot of really nice properties. For
example, if X is compact and f : X → R is continuous, then f is bounded and attains
its bound.
There are two different definitions of compactness - one based on open covers (which
we will come to shortly), and the other based on sequences. In metric spaces, these
two definitions are equal. However, in general topological spaces, these notions can
be different. The first is just known as “compactness” and the second is known as
“sequential compactness”.
The actual definition of compactness is rather weird and unintuitive, and is difficult to
comprehend at first. However, as we go through more proofs and examples, (hopefully)
you will be able to appreciate this definition.
D. 1-108
• Let (X, τ ) be a topological space and Y ⊆ X. An open cover of Y is a subset
V ⊆ τ such that V ∈V V ⊇ Y . We say V covers Y . If V 0 ⊆ V, and V 0 covers Y ,
S

then we say V 0 is a subcover of V.


• A topological space X is compact if every open cover V of X has a finite subcover
V 0 = {V1 , · · · , Vn } ⊆ V. 8 A subset Y ⊆ X is called compact if the subspace
8
Some people (especially algebraic geometers) call this notion “quasi-compact”, and reserve the
name “compact” for “quasi-compact and Hausdorff”. We will not adapt this notion.
32 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

topology on Y is compact, which is equivalent to every open cover of Y having a


finite subcover.
• A metric space (X, d) is bounded if there exists M ∈ R such that d(x, y) ≤ M for
all x, y ∈ X.
E. 1-109
1. If X is finite, then P(X) is finite. So any open cover of X is finite, so X is
compact.
2. Let X = R and V = {(−R, R) : R ∈ R, R > 0}. Then this is an open cover
with no finite subcover. So R is not compact. Hence all open intervals are not
compact since they are homeomorphic to R.
3. Let X = [0, 1] ∩ Q. Let Un = X \ (α − 1/n, α + 1/n) for some irrational α in
√ −1
(0, 1) (eg. α = 2 ).
S
Then n>0 Un = X since α is irrational. Then V = {Un : n ∈ Z > 0} is an
open cover of X. Since this has no finite subcover, X is not compact.
4. The discrete topology on a set X is compact if and only if X is finite.
5. The indiscrete topology is always compact.
6. The cofinite topology is always compact. If X = ∅ there is nothing to prove.
If not, let Uα [α ∈ A] be an open cover. Since X 6= ∅ we can choose a β ∈ A
such that Uβ 6= ∅ and soSUβ = X \ F where F is a finite set. For each x ∈ F
we know that x S∈ X = α∈A Uα , so there exists an α(x) ∈ A with x ∈ Uα(x) .
We have Uβ ∪ x∈F Uα(x) = X, giving us the desired open cover.
E. 1-110
Let X be uncountable. Let τ consist of subsets A of X such that either A = ∅ or
X \ A is countable. Show that τ is a topology but that (X, τ ) is not compact.

∅, X ∈ τ . If Uα ∈ τ for α ∈ B, then
[ \ \ [
X\ Uα = (X \ Uα ), X\ Uα = (X \ Uα )
α∈B α∈B α∈B α∈B

which is countable (or equals


S X, so
T its complement empty) (B is take to be finite
in the second case), so α∈B Uα , α∈B Uα ∈ τ . So τ is a topology.
Let x1 , x2 , . . . , be distinct points ofSX. Let U = X \ {xj : 1 ≤ j} and Uk =
U ∪ {xk }. Then Uk ∈ τ [k ≥ 1] and k≥1 Uk = X. Now suppose k(1), k(2),. . . ,
/ N
S
k(N ) given. If m = max1≤r≤N k(r), then xm+1 ∈ r=1 Uk(r) so there is no finite
subcover.
E. 1-111
Note that being bounded is not a topological property. For example, (0, 1) ' R
but (0, 1) is bounded while R is not. It depends on the metric d, not just the
topology it induces.
L. 1-112
Suppose that (X, d) is a compact metric space (that is to say, the topology induced
by the metric is compact).
S
1. Given any δ > 0, we can find a finite set of points E such that X = e∈E B(e, δ).
1.4. COMPACTNESS 33

2. X has a countable dense subset.

1. Observe that the open balls B(x, δ) form an open cover of X and so have a
finite subcover.
S
2. For each n ≥ 1, choose
S∞ a finite subset En such that X = e∈En B(e, 1/n).
Observe that E = n=1 En is the countable union of finite sets, so countable.
If U is open and non-empty, then we can find a u ∈ U and a δ > 0 such
that U ⊇ B(u, δ). Choose N > δ −1 . We can find an e ∈ EN ⊆ E with
u ∈ B(e, 1/N ), so e ∈ B(u, 1/N ) ⊆ B(u, δ) ⊆ U . Thus Cl E = X and we are
done.

P. 1-113
[0, 1] with the usual topology is compact.

Suppose V is an open cover of [0, 1]. Let

A = {a ∈ [0, 1] : [0, a] has a finite subcover of V}.

First show that A is non-empty. Since V covers [0, 1], in particular, there is some
V0 that contains 0. So {0} has a finite subcover V0 . So 0 ∈ A.

Next we note that by definition, if 0 ≤ b ≤ a and a ∈ A, then b ∈ A. Now let


α = sup A. Suppose α < 1, then α ∈ [0, 1). Since V covers X, let α ∈ Vα . Since
Vα is open, there is some ε such that Bε (α) ⊆ Vα . By definition of α, we must
have α − ε/2 ∈ A. So [0, α − ε/2] has a finite subcover. Add Vα to that subcover
to get a finite subcover of [0, α + ε/2]. Contradiction. (technically, it will be a
finite subcover of [0, η] for η = min(α + ε/2, 1), in case α + ε/2 gets too large).

So we must have α = sup A = 1. Now we argue as before: ∃V1 ∈ V such that


1 ∈ V1 and ∃ε > 0 with (1 − ε, 1] ⊆ V1 . Since 1 − ε ∈ A, there exists a finite V 0 ⊆ V
which covers [0, 1 − ε/2]. Then W = V 0 ∪ {V1 } is a finite subcover of V .

Again, since this is not true for [0, 1] ∩ Q, we must use a special property of reals.

P. 1-114
A closed subset of a compact set is compact. (If X is compact and C is closed
subset of X, then C is also compact.)

Suppose V is an open cover of C. Say V = {Vα : α ∈ T }. For each


S α, since Vα is
in C, Vα = C ∩ Vα0 for some Vα0 open in X. Also, since α∈T Vα = C, we
open S
have α∈T Vα0 ⊇ C.

Since C is closed, U = X \ C is open in X. So W = {Vα0 : α ∈ T } ∪ {U } is an open


cover of X. Since X is compact, W has a finite subcover W 0 = {Vα0 1 , · · · , Vα0 n , U }
(U may or may not be in there, but it doesn’t matter). Now U ∩ C = ∅. So
{Vα1 , · · · , Vαn } is a finite subcover of C.

P. 1-115
Let X be a Hausdorff space, then every compact subset is closed. (If C ⊆ X is
compact, then C is closed in X.)

Let U = X \ C. We will show that U S


is open. For any x ∈ U , we will find a open
Ux such that x ∈ Ux ⊆ U . Then U = x∈U Ux will be open.
34 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
To construct Ux , fix x ∈ U . Since X is Hausdorff, so
for each y ∈ C, ∃Uxy , Wxy open neighbourhoods of x C Wxy Uxy
and y respectively with Uxy ∩ Wxy = ∅.
x y
Then W = {Wxy ∩ C : y ∈ C} is an open cover of
C. Since C is compact, there exists a finite subcover
W 0 = {Wxy1 ∩ C, · · · , Wxyn ∩ C}.
Let Ux = n
T
i=1 Uxyi . Then Ux is open
S since it is a finite intersection of open sets.
To show Ux ⊆ U , note that Wx = n i=1 Wxyi ⊇ C since {Wxyi ∩ C} is an open
cover. We also have Wx ∩ Ux = ∅. So Ux ⊆ U . So done.
P. 1-116
A compact metric space (X, d) is bounded.

Pick x ∈ X. Then V = {Br (x) : r ∈ R+ } is an open cover of X. Since X is com-


pact, there is a finite subcover {Br1 (x), · · · , Brn (x)}. Let R = max{r1 , · · · , rn }.
Then d(x, y) < R for all y ∈ X. So for all y, z ∈ X, d(y, z) ≤ d(y, x)+d(x, z) < 2R.
So X is bounded.
T. 1-117
<(weak) Heine-Borel theorem> C ⊆ R is compact iff C is closed and
bounded.

Since R is a metric space (hence Hausdorff), C is also a metric space.


So if C is compact, then C is closed in R and is bounded, by our previous two
propositions. Conversely, if C is closed and bounded, then C ⊆ [−N, N ] for some
N ∈ R. Since [−N, N ] ' [0, 1] is compact, and C = C ∩ [−N, N ] is closed in
[−N, N ], C is compact.
P. 1-118
If A ⊆ R is compact, ∃α ∈ A such that α ≥ a for all a ∈ A.

Since A is compact, it is bounded. Let α = sup A. Then by definition, α ≥ a for


all a ∈ A. So it is enough to show that α ∈ A. Suppose α 6∈ A. Then α ∈ R \ A.
Since A is compact, it is closed in R. So R \ A is open. So ∃ε > 0 such that
Bε (α) ⊆ R \ A, which implies that a ≤ α − ε for all a ∈ A. This contradicts the
assumption that α = sup A. So we can conclude α ∈ A.
We call α = max A the maximum element of A.
T. 1-119
Let (X, τ ), (Y, σ) be topological space. If f : X → Y is continuous and K ⊆ X is
compact, then f (K) ⊆ Y is also compact.

Suppose V = {Vα ∈ σ : α ∈ T } is an open cover of f (K). Then Wα = f −1 (Vα ) ∈ τ .


If x ∈ K then f (x) is in Vα for some α, so x ∈ Wα . Thus W = {Wα ∈ τ :
α ∈ T } is an open cover of K. Since K is compact, so there’s a finite subcover
{Wα(1) , · · · , Wα(n) } of W. Now
n n n
!
[ [  [
Vα(j) ⊇ f Wα(j) = f Wα(j) ⊇ f (K)
j=1 j=1 j=1

So {Vα(1) , · · · , Vα(n) } is a finite subcover of V .


1.4. COMPACTNESS 35

P. 1-120
Let (X, τ ) be a compact topological space and ∼ an equivalence relation on X.
Then the quotient topology on X/ ∼ is compact

By above since quotient map q : X → X/ ∼ is continuous.

T. 1-121
<Maximum value theorem> If f : X → R is continuous and X is compact,
then ∃x ∈ X such that f (x) ≥ f (y) for all y ∈ X.

Since X is compact, Im f is compact. Let α = max{Im f }. Then α ∈ Im f . So


∃x ∈ X with f (x) = α. Then by definition f (x) ≥ f (y) for all y ∈ X.

So Im f is closed and bounded, and f attains it bound.

L. 1-122
If f : [0, 1] → R is continuous, then ∃x ∈ [0, 1] such that f (x) ≥ f (y) for all
y ∈ [0, 1].

[0, 1] is compact.

E. 1-123
Let R have the usual metric.
1. If K is a subset of R with the property that, whenever f : K → R is continuous,
f is bounded, show that K is closed and bounded.
2. If K is a subset of R with the property that, f : K → R attains its bounds
whenever f is continuous and bounded, then K is closed and bounded.

1. If K = ∅ there is nothing to prove, so we assume K 6= ∅.

Let f : K → R be defined by f (k) = |k|. Since f is bounded, K must be.

/ K, then the function f : K → R given by f (k) = |k − x|−1 is continuous


If x ∈
and so bounded. Thus we can find an M > 0 such that |f (k)| < M for all
k ∈ K. It follows that |x−k| > M −1 for all k ∈ K and the open ball B(x, M −1 )
lies entirely in the complement of K. Thus K is closed.

2. If K is unbounded, then, setting f (x) = tan−1 x, we see that f is bounded


on K, but does not attain its bounds. If K is not closed, then we can find
a ∈ Cl(K) with a ∈ / K. If we set f (x) = tan−1 a−x
1
then f is bounded on K
but does not attain its bounds.

T. 1-124
If X and Y are compact, then so is X × Y (under the product topology).

S
Let Oα ∈ λ [α ∈ A] and α∈A Oα = X × Y . Then, given (x, y) ∈ X × Y , we can
find Ux,y ∈ τ , Vx,y ∈ σ and α(x, y) ∈ A such that (x, y) ∈ Ux,y × Vx,y ⊆ Oα(x,y) .
36 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
S
We have y∈Y {x}×V S x,y = {(x, y) : y ∈ Y } for each
Y Ux × Y
x ∈ X and so y∈Y Vx,y = Y . By compactness, Ux,y(x,1) × Vx,y(x,1)
we can find a positive integer n(x) and y(x, j) ∈ Y
Ux,y(x,2) × Vx,y(x,2)
Sn(x)
[1 ≤ j ≤ n(x)] such that j=1 Vx,y(x,j) = Y .
Tn(x)
Now Ux = j=1 Ux,y(x,j) is the finite intersection
of openSsets in X and so open. Further x ∈ Ux
and so x∈X Ux = X. By S compactness, we can find
x1 , x2 , . . . , xm such that m r=1 Uxr = X. And the X
result follows since x

m n(x
[ [r ) m n(x
[ [r ) m n(x
[ [r )
Oxr ,y(xr ,j) ⊇ Uxr ,y(xr ,j) × Vxr ,y(xr ,j) ⊇ Uxr × Vxr ,y(xr ,j)
r=1 j=1 r=1 j=1 r=1 j=1
m
[
⊇ Uxr × Y ⊇ X × Y
r=1

L. 1-125
Let (X, τ ) and (Y, σ) be topological spaces with subsets E and F . Let the subspace
topology on E be τE and on F be σF . Let the product topology on X × Y derived
from τ and σ be λ and let the product topology on E × F derived from τE and
σF be µ. Then µ is the subspace topology on E × F derived from λ.

T. 1-126
Let (X, τ ) and (Y, σ) be topological spaces and let λ be the product topology. If
K is a compact subset of X and L is a compact subset of Y , then K × L is a
compact in λ.

Follows from the lemma and theorem above.

E. 1-127
The unit cube [0, 1]n = [0, 1] × [0, 1] × · · · × [0, 1] is compact.

T. 1-128
<Heine-Borel theorem> C ⊆ Rn is compact iff C is closed and bounded.

If C is closed and bounded, C ⊆ [−N, N ]n for some N ∈ R which is compact, so


C is compact since C is closed. The rest of the proof is exactly the same as for
n = 1.

P. 1-129
Suppose f : X → Y is a continuous bijection. If X is compact and Y is Hausdorff,
then f is a homeomorphism.

We show that f −1 is continuous. To do this, it suffices to show (f −1 )−1 (C) is


closed in Y whenever C is closed in X. By hypothesis, f is a bijection . So
(f −1 )−1 (C) = f (C).

Supposed C is closed in X. Since X is compact, C is compact. Since f is con-


tinuous, so f (C) = Im(f |C ) is compact. Since Y is Hausdorff and f (C) ⊆ Y is
compact, f (C) is closed.
1.4. COMPACTNESS 37

E. 1-130
1. Give an example of a Hausdorff space (X, τ ) and a compact Hausdorff space
(Y, σ) together with a continuous bijection f : X → Y which is not a homeo-
morphism.
2. Give an example of a compact Hausdorff space (X, τ ) and a compact space
(Y, σ) together with a continuous bijection f : X → Y which is not a homeo-
morphism.

Let τ1 be the indiscrete topology on [0, 1], and τ2 the usual (Euclidean) topology
on [0, 1] and τ3 the discrete topology on [0, 1]. Then ([0, 1], τ1 ) is compact (but not
Hausdorff), ([0, 1], τ2 ) is compact and Hausdorff, and ([0, 1], τ3 ) is Hausdorff (but
not compact).
The identity maps id : ([0, 1], τ2 ) → ([0, 1], τ1 ) and id : ([0, 1], τ3 ) → ([0, 1], τ3 ) are
continuous bijections but not homeomorphisms.
L. 1-131
Let τ1 and τ2 be topologies on the same space X.
1. If τ1 ⊇ τ2 and (X, τ1 ) is compact, then so is (X, τ2 ).
2. If τ1 ⊇ τ2 and (X, τ2 ) is Hausdorff, then so is (X, τ1 ).
3. If τ1 ⊇ τ2 , (X, τ1 ) is compact and (X, τ2 ) is Hausdorff, then τ1 = τ2 .

1. The identity map id : (X, τ1 ) → (X, τ2 ) is continuous and so takes compact


sets to compact sets. Since X is compact in τ1 , X = id(X) is compact in τ2 .
2. If x 6= y we can find x ∈ U ∈ τ2 and y ∈ V ∈ τ2 with U ∩ V = ∅. Automatically
x ∈ U ∈ τ1 and y ∈ V ∈ τ1 so we are done.
3. The identity map id : (X, τ1 ) → (X, τ2 ) is a continuous bijection and so a
homeomorphism.
P. 1-132
Suppose f : X/∼ → Y is a bijection, X is compact, Y is Hausdorff, and f ◦ q :
X → Y is continuous, then f is a homeomorphism.

Since X is compact and q : X 7→ X/∼ is continuous, Im q = X/∼ is compact.


Since f ◦ q is continuous, f is continuous. So we can apply [P.1-129].
E. 1-133
Let X = D2 and A = S 1 ⊆ X. Then f : X/A 7→ S 2 by (r, θ) 7→ (1, πr, θ) in
spherical coordinates is a homeomorphism. We can check that f is a continuous
bijection and D2 is compact. So X/A ' S 2 .
E. 1-134
Consider the complex plane with its usual metric. Let ∂D = {z ∈ C : |z| = 1}
and give ∂D the subspace topology τ . Give R its usual topology and define an
equivalence relation ∼ by x ∼ y if x − y ∈ Z. We write R/ ∼= T and give T the
quotient topology. Show that ∂D and T are homeomorphic.

We show that ∼ is indeed an equivalence relation. Observe that x − x = 0 ∈ Z,


so x ∼ x. Observe that x ∼ y implies x − y ∈ Z, so y − x = −(x − y) ∈ Z
and y ∼ x. Observe that, if x ∼ y and y ∼ z, then x − y, y − z ∈ Z, so
x − z = (x − y) + (y − z) = x − z ∈ Z and x ∼ z.
38 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES

Define f : R → ∂D by f (x) = exp(2πix). We now show that f (U ) is open


whenever U is open. If x ∈ U an open set, then we can find a 1 > δ > 0 such
that |x − y| < δ implies y ∈ U . By simple geometry, any z ∈ C with |z| = 1 and
| exp(2πix) − z| < δ/100 can be written as z = exp(2πiy) with |y − x| < δ. Thus

∂D ∩ {z ∈ C : |z − exp(2πix)| < δ/100} ⊆ f (U ).

We have shown that f (U ) is open.


Let q : R → T be the quotient map q(x) = [x]. Then q(x) = q(y) if and only if
f (x) = f (y), because

q(x) = q(y) ⇔ y ∈ [x] ⇔ x − y ∈ Z ⇔ exp(2πi(x − y)) = 1


⇔ exp(2πix) = exp(2πiy) ⇔ f (x) = f (y).

It follows that the equation F (exp(2πix)) = q f −1 ({exp(2πix)}) = [x] gives a




well defined bijection F : ∂D → T.


Observe that F −1 ([x]) = q −1 ([x]) and

−1 −1
 {exp(2πit) : exp(2πit) = exp(2πix)} = f −1
so F (V ) = f q (V ) . If V is open, then, since q is continuous, q (V ) is open
,so also F −1 (V ) is open. Thus F is continuous.
Now we show that T is Hausdorff. If [x] 6= [y], then we know that x − y ∈ / Z and
the set {|t| : t − (x − y) ∈ Z, |t| < 1} is finite and non-empty. Thus there exists a
δ > 0 such that {|t| : t − (x − y) ∈ Z, |t| < δ} = ∅. Let
∞   ∞  
[ δ δ [ δ δ
Ux = j + x − ,j + x + and Uy = j + y − ,j + y + .
j=−∞
4 4 j=−∞
4 4

Observe that Ux and Uy are open in R and q −1 q(Ux )) = Ux , q −1 q(Uy )) = Uy ,


and so q(Ux ) and q(Uy ) are open in the quotient topology. Since [x] ∈ q(Ux ),
[y] ∈ q(Uy ) and q(Ux ) ∩ q(Uy ) = ∅, we have shown that the quotient topology is
Hausdorff.
Since ∂D is closed and bounded in C and we can identify C with R2 as a metric
space, ∂D is compact.
Since a continuous bijection from a compact to a Hausdorff space is a homeomor-
phism, F is a homeomorphism.
It is just as simple to show that the natural map from T (which we know to be
compact since R/ ∼= [0, 1]/ ∼) to ∂D (which we know to be Hausdorff since C
is Hausdorff) is a bijective continuous map. Or we could show continuity in both
directions and not use the result on continuous bijections.
D. 1-135
A topological space X is sequentially compact if every sequence (xn ) in X has a
convergent subsequence (that converges to a point in X).
E. 1-136
(0, 1) ⊆ R is not sequentially compact since no subsequence of (1/n) converges to
any x ∈ (0, 1).
Consider the discrete metric on Z. If xn = n and x ∈ Z, then d(x, xn ) = 1 for
all n with at most one exception. Thus the sequence xn can have no convergent
subsequence.
1.4. COMPACTNESS 39

L. 1-137
Suppose that (X, d) is a sequentially compact metric space and that the collection
Uα with α ∈ A is an open cover of X. Then there exists a δ > 0 such that, given
any x ∈ X, there exists an α(x) ∈ A such that the open ball B(x, δ) ⊆ Uα(x) .

Suppose the first sentence is true and the second sentence false. Then, for each
n ≥ 1, we can find an xn such that the open ball B(xn , 1/n) 6⊆ Uα for all α ∈ A.
By sequential compactness, we can find y ∈ X and n(j) → ∞ such that xn(j) → y.
Since y ∈ X, we must have y ∈ Uβ for some β ∈ A. Since Uβ is open, we can find
an  such that B(y, ) ⊆ Uβ . Now choose J sufficiently large that n(J) > 2−1
and d(xn(J) , y) < /2. We now have, using the triangle inequality, that

B(xn(J) , 1/n(J)) ⊆ B(xn(J) , /2) ⊆ B(y, ) ⊆ Uβ ,

contradicting the definition of xn(J) .


L. 1-138
Let (xn ) be a sequence in a metric space (X, d) and x ∈ X. Then (xn ) has a
subsequence converging to x iff for every ε > 0, xn ∈ Bε (x) for infinitely many n
(∗).

(Forward) If (xni ) → x, then for every ε, we can find I such that i > I implies
xnj ∈ Bε (x) by definition of convergence. So (∗) holds.
(Backward) Suppose (∗) holds. We will construct a sequence xni → x inductively.
Take n0 = 0. Suppose we have defined xn0 , xni−1 . Now xn ∈ B1/i (x) for infinitely
many n. Take ni to be smallest such n with ni > ni−1 . Then d(xni x) < 1i implies
that xni → x.
T. 1-139
Let (X, d) be a metric space, then X is compact iff X is sequentially compact.

(Forward) Suppose xn is a sequence in X with no convergent subsequence. Then


for any y ∈ X, there is no subsequence converging to y. By [L.1-138], there exists
εy > 0 such that xn ∈ Bεy (y) for only finitely many n.
Let Uy = Bεy (y). Now V = {Uy : y ∈ X} is an open cover of S X. Since X is
compact, there is a finite subcover {Uy1 , · · · , Uym }. Then xn ∈ m
i=1 Uyi = X for
only finitely many n. This is nonsense, since xn ∈ X for all n. So xn must have a
convergent subsequence.
(Backward) Let (Uα )α∈A be an open cover and let δ be defined as in [L.1-137].
The B(x, δ) form a cover of X. If they have no finite subcover, then given x1 , x2 ,
/ n
S
. . . xn we can find (define) an xn ∈ X such that xn+1 ∈ j=1 B(xj , δ). Consider
the sequence xj thus obtained. We have d(xn+1 , xk ) > δ whenever n ≥ k ≥ 1 and
so d(xr , xs ) > δ for all r 6= s. It follows that, if x ∈ X, d(xn , x) > δ/2 for all n with
at most one exception. Thus the sequence of xn has no convergent subsequence.
Contradicting X being sequentially compact.
So it thus follows that B(x, δ) have a finite subcover. In other words, we can find
an M and yj ∈ X [1 ≤ j ≤ M ] such that X = M
S
j=1 B(yj , δ). We thus have

M
[ M
[
X= B(yj , δ) ⊆ Uα(yj ) ⊆ X
j=1 j=1
40 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
SM
so X = j=1 Uα(yj ) and we have found a finite subcover. Thus X is compact.
E. 1-140
Prove [P.1-113]. ([a, b] with the usual topology is compact)

By the Bolzano-Weierstrass theorem, [a, b] is sequentially compact. Since we are


in a metric space, it follows that [a, b] is compact.
E. 1-141
Sequentially compact metric space is bounded

(Proof 1) By [P.1-116].
(Proof 2) If (X, d) is sequentially compact but not bounded, then for any x0 ,
we can find a sequence (xk ) such that d(xk , x0 ) > k for every k. But then (xk )
cannot have a convergent subsequence. Otherwise, if xkj → x, then d(xkj , x0 ) ≤
d(xkj , x) + d(x, x0 ) and is bounded, which is a contradiction.
E. 1-142
Let X = C[0, 1] with the topology induced d∞ (uniform norm). Let
 y
nx
 x ∈ [0, 1/n]
fn (x) = 2 − nx x ∈ [1/n, 2/n]

0 x ∈ [2/n, 1]

x

Then fn (x) → 0 for all x ∈ [0, 1]. We now claim that fn has no convergent
subsequence. Suppose fni → f . Then fni (x) → f (x) for all x ∈ [0, 1]. However,
we know that fni (x) → 0 for all x ∈ [0, 1]. So f (x) = 0. However, d∞ (fni , 0) = 1.
So fni 6→ 0. It follows that B1 (0) ⊆ X is not sequentially compact. So it is not
compact.
D. 1-143
• Let (X, d) be a metric space. A sequence (xn ) in X is Cauchy if for every ε > 0,
∃N such that d(xn , xm ) < ε for all n, m ≥ N .
• A metric space (X, d) is complete if every Cauchy sequence in X converges con-
verges to a limit in X.
E. 1-144
• xn = n
P
k=1 1/k is not Cauchy.
1
• Let X = (0, 1) ⊆ R with xn = n
. Then this is Cauchy but does not converge.
• If xn → x ∈ X, then xn is Cauchy. The proof is the same as that in Analysis I.
• Let X = Q ⊆ R. Then the sequence (2, 2.7, 2.71, 2.718, · · · ) is Cauchy but does
not converge in Q.
• (0, 1) and Q are not complete.
P. 1-145
If X is a compact metric space, then X is complete.

Let xn be a Cauchy sequence in X. Since X is sequentially compact, there is a


convergent subsequence xni → x. We will show that xn → x.
1.4. COMPACTNESS 41

Given ε > 0, pick N such that d(xn , xm ) < ε/2 for n, m ≥ N . Pick I such
that nI ≥ N and d(xni , x) < ε/2 for all i > I. Then for n ≥ nI , d(xn , x) ≤
d(xn , xnI ) + d(xnI , x) < ε. So xn → x.
Observe that R with the usual Euclidean metric is complete but not compact.
P. 1-146
Rn is complete.

If xn ⊆ Rn is Cauchy, then xn ⊆ B̄R (0) for some R, and B̄R (0) is compact. So it
converges.
E. 1-147
Note that completeness is not a topological property. R ' (0, 1) but R is complete
while (0, 1) is not. This is since Cauchy-ness depends on the metric (not the
topology), Cauchy sequence only make sense in a metric. For example, R \ {0}
with the usual metric d1 (x, y) = |x − y| is incomplete since xn = 1/n is Cauchy
but does not converge, however it is complete under the metric d2 (x, y) = | x1 − y1 |.
Note that both metrics induce the same topology on R \ {0} since xn → x under
d1 iff xn → x under d2 , this is because for all x ∈ R \ {0} we have |xn − x| → 0 iff
| x1n − x1 | → 0.
E. 1-148
When searching for a counterexample we may start by looking at R and Rn with
the standard metrics and subspaces like Q, [a, b], (a, b) and [a, b). Then we might
look at the discrete and indiscrete topologies on a space. It is often worth looking
at possible topologies on spaces with a small number of points.
42 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
CHAPTER 2
Variational principle

2.1 Multivariate calculus


2.1.1 Convex functions
D. 2-1
A set S ⊆ Rn is convex if for any distinct x, y ∈ S and
t ∈ (0, 1), we have (1 − t)x + ty ∈ S. Alternatively, any line
joining two points in S lies completely within S. non-convex convex
n
A function f : R → R is convex if
1. The domain D(f ) is convex
(1 − t)f (x) + tf (y)
2. The function f lies below (or on) all its chords, ie.
f ((1 − t)x + ty) ≤ (1 − t)f (x) + tf (y) for all x, y ∈
D(f ), t ∈ (0, 1).
A function is strictly convex if the inequality is strict,
ie. f ((1 − t)x + ty) < (1 − t)f (x) + tf (y). A function f x (1 − t)x + ty y
is (strictly) concave iff −f is (strictly) convex.1
A matrix H is called positive semi-definite if v T Hv ≥ 0 for all v ∈ Rn .
E. 2-2
1. f (x) = x2 is strictly convex.
2. f (x) = |x| is convex, but not strictly.
1
3. f (x) = x
defined on x > 0 is strictly convex.
4. f (x) = defined on R∗ = R \ {0} is not convex. Apart from the fact that R∗ is
1
x
not a convex domain. But even if we defined, like f (0) = 0, it is not convex by
considering the line joining (−1, −1) and (1, 1) (and in fact f (x) = x1 defined
on x < 0 is concave).
T. 2-3
<First-order convexity conditions> For a function that is once-differentiable,
the convexity condition is equivalent to f (y) ≥ f (x) + (y − x) · ∇f (x).

(Forward) Suppose that f is convex. For fixed x, y, we define the function

h(t) = (1 − t)f (x) + tf (y) − f ((1 − t)x + tf (y)).

By convexity of f , we must have h(t) ≥ 0. Also, trivially h(0) = 0. So (h(t) −


h(0))/t ≥ 0 for any t ∈ (0, 1). So h0 (0) ≥ 0. Differentiate h directly and evaluate
at 0 gives h0 (0) = f (y) − f (x) − (y − x) · ∇f (x). Combining gives the result.
1
A concave function still requires the domain to be convex

43
44 CHAPTER 2. VARIATIONAL PRINCIPLE

(Backward) We have f (y) ≥ f (z)+(y−z)·∇f (z) and f (x) ≥ f (z)+(x−z)·∇f (z).


So let z = (1 − t)x + ty, we have

(1 − t)f (x) + tf (y) ≥ f (z) + ((1 − t)x + ty − z) · ∇f (z) = f ((1 − t)x + ty)

f (x) + (y − x) · ∇f (x) defines the tangent plane of f at x. Hence this condition


states that a convex differentiable function lies above all its tangent planes.
T. 2-4
<Alternative first-order convexity conditions> For a function that is once-
differentiable, the convexity condition is equivalent to (y−x)·(∇f (y)−∇f (x)) ≥ 0.

(Forward) We can rewrite first-order condition into the form

(y − x) · (∇f (y) − ∇f (x)) ≥ f (x) − f (y) − (x − y) · ∇f (y).

By applying the first-order condition to the right hand side (with x and y swapped),
we know that the right hand side is ≥ 0. So we have the result.
(Backward) Let z(t) = (1 − t)x + ty. Now (z − x) · (∇f (z) − ∇f (x)) ≥ 0 implies
(y − x) · (∇f (z) − ∇f (x)) ≥ 0 since t ≥ 0. Now note that
Z 1 Z 1
d
f (y) − f (x) = [f (z(t))]10 = (f (z(t)))dt = (y − x) · ∇f (z(t))dt
0 dt 0
Z 1
=⇒ f (y) − f (x) − (y − x) · ∇f (x) = (y − x) · (∇f (z(t)) − ∇f (x))dt ≥ 0
0

For example, when n = 1, the equation states that (y − x)(f 0 (y) − f 0 (x)) ≥ 0,
which is the same as saying f 0 (y) ≥ f 0 (x) whenever y > x.
T. 2-5
<Second-order convexity condition> For a function f that is everywhere
twice differentiable, the function is convex iff the Hessian matrix Hij never has a
negative eigenvalue (equivalently positive semi-definite).

(Forward) By [T.2-4] (y − x) · (∇f (y) − ∇f (x)) ≥ 0. Write y = x + h, we have


h · (∇f (x + h) − ∇f (x)) ≥ 0. Expand the left in Taylor series and with suffix
notation, ∇i f (x + h) = ∇i f (x) + hj Hij (x) + O(h2 )). So we have

hi Hij hj + O(h3 ) ≥ 0

Since we can take |h| as small as we wish, this implies the Hessian has no negative
eigenvalue for all x ∈ D(f ).
(Backward) For n = 1. So we have f 00 (x) ≥ 0 for all x in some convex domain. So
R x+h 00
0 ≤ (sign h) x f (z)dz = (sign h)(f 0 (x + h) − f 0 (x)). Now integrate between 0
R y−x R0
and y −x (note that h > 0 when y > x and h < 0 when y < x, also 0 = − y−x )
we have the first order condition since
Z y−x
0≤ (f 0 (x + h) − f 0 (x))dh = f (y) − f (x) − (y − x)f 0 (x)
0

For general n. For any x, y, let n̂ = (y − x)/|y − x|. Define g(z) = f (x + zn̂),
then g is convex, so by above g(|y − x|) ≥ g(0) + |y − x|g 0 (0), so f (y) ≥ f (x) +
|y − x|n̂ · ∇f (x) = f (x) + (y − x) · ∇f (x).
2.1. MULTIVARIATE CALCULUS 45

If all eigenvalues are everywhere strictly positive then the function is strictly con-
vex, but this is only a sufficient condition for strict convexity, not a necessary one.
For example, the function f (x) = x4 is strictly convex despite the fact f 00 (x) is
zero at x = 0.
E. 2-6
1
Let f (x, y) = xy
for x, y > 0. Then the Hessian and its determinant and trace are
 2 1   
1 x2 xy 3 2 1 1
H= 1 2 det H = 4 4 > 0 tr H = + 2 > 0.
xy xy y2 x y xy x2 y

Since the products of eigenvalues are the determinant and the sum are the trace,
the Hessian never has negative eigenvalues. So f is convex.
Note that to conclude that f is convex, we only used the fact that xy is positive,
but we cannot relax the domain condition to be xy > 0 instead because the domain
has to be a convex set.
P. 2-7
A stationary point of a convex function is a global minimum. There can be more
than one global minimum (eg a constant function), but there is at most one if the
function is strictly convex.

Given x0 such that ∇f (x0 ) = 0, the first-order condition implies that for any y,
f (y) ≥ f (x0 ) + (y − x0 ) · ∇f (x0 ) = f (x0 ).
If f is strictly convex and f (x) and f (y) are both global minimum, then we must
have f (x) = f (y). So f ((1 − t)x − ty) ≤ (1 − t)f (x) − tf (y) = f (x), contradiction.
So there is at most one global minimum.
P. 2-8
Let A ⊆ Rn be an open convex set. Then f : A → R is a convex function iff for
any a ∈ A, ∃m ∈ Rn such that f (x) ≥ f (a) + m · (x − a) for all x ∈ A.

(Forward) First we prove the case n = 1. Pick any a ∈ A. Note that if f (x) is
convex then so is h(x) = f (x + µ) − λ. So wlog let a = 0 and f (0) = 0. Now let
m = inf {f (x)/x : x ∈ A, x > 0}.
Suppose m = −∞. Pick y < 0 s.t. y ∈ A. Let α = f (y)/y. We can find x > 0
such that f (x)/x < α − 1. We have
       
−y x −y x
0 = f (0) ≤ f (x) + f (y) < (α − 1)x + yα < 0
x−y x−y x−y x−y

Contradiction, so m 6= −∞ (ie. m is finite). So we must have f (x) ≥ mx for all


x ≥ 0. Suppose ∃y < 0 s.t. f (y) < my. Define g : A → R by g(x) = f (x) − mx,
then g is convex. We have g(y) < 0, let ε = g(y)/y, then ε > 0. We can find x > 0
such that m ≤ f (x)/x ≤ m + ε/2, so 0 ≤ g(x)/x ≤ ε/2. Now by convexity
       
−y x −y xε x
0 = g(0) ≤ g(x) + g(y) < + (yε) < 0
x−y x−y x−y 2 x−y

Contradiction, so we have f (x) ≥ mx for all x ∈ A.


Now we prove the general statement by induction. Suppose it’s true for n = k. Let
A ⊆ Rk × R = Rk+1 be an open and convex set and f : A → R a convex function.
46 CHAPTER 2. VARIATIONAL PRINCIPLE

Pick a ∈ A, wlog assume a = (0, 0) and f (0, 0) = 0. By induction assumption


we can find m ∈ Rk such that f (x, 0) ≥ m · x for all (x, 0) ∈ A. By openness
∃(x, y) ∈ A with y > 0 and y < 0. Define g : A → R by g(x, y) = f (x, y) − m · x,
note that f is convex iff g is convex. Let
 
g(x, y)
M = inf : y > 0, (x, y) ∈ A
y
Suppose M = −∞. Pick (w, z) ∈ A with z < 0. Let α = g(w, z)/z. We can find
(x, y) ∈ A with y > 0 such that g(x, y)/y < α − 1. We have
   
−zg(x, y) yg(w, z) −z y
0 ≤ g(u, 0) ≤ + < (α − 1)y + zα < 0
y−z y−z y−z y−z
where u = (yw − zx)/(y − z). Contradiction, so M 6= −∞. So we must have
g(x, y) ≥ M y for all y ≥ 0. Suppose ∃(w, z) ∈ A with z < 0 such that g(w, z) <
M z. Define h : A → R by h(x, y) = g(x, y) − M y, then h is convex. We have
h(w, z) < 0, let ε = h(w, z)/z, then ε > 0. We can find (x, y) ∈ A with y > 0
such that M ≤ g(x, y)/y ≤ M + ε/2, so 0 ≤ h(x, y)/y ≤ ε/2. Now by convexity
   
−zh(x, y) yh(w, z) −z yε y
0 ≤ h(u, 0) ≤ + < + (zε) < 0
y−z y−z y−z 2 y−z
Contradiction, so we have g(x, y) ≥ M y for all (x, y) ∈ A. So f (x, y) ≥ (m, M ) ·
(x, y) for all (x, y) ∈ A. Hence the result by induction.
(Backward) We have f (y) ≥ f (z) + (y − z) · m and f (x) ≥ f (z) + (x − z) · m. So
let z = (1 − t)x + ty, we have
(1 − t)f (x) + tf (y) ≥ f (z) + ((1 − t)x + ty − z) · m = f ((1 − t)x + ty)
Note that it’s clear from the proof that forward direction can be modified as: If
f : A → R is a convex function (and A convex), then for any a ∈ Int(A), ∃m ∈ Rn
such that f (x) ≥ f (a) + m · (x − a) for all x ∈ A.

2.1.2 Legendre transform


D. 2-9
Given a function f : Rn → R,2 its Legendre transform f ∗ (the “conjugate”
function) is defined by
f ∗ (p) = sup(p · x − f (x)),
x
The domain of f ∗ is the set of p ∈ Rn such that the supremum is finite. p is
known as the conjugate variable .
L. 2-10
f ∗ is always convex.

f ∗ ((1 − t)p + tq) = sup ((1 − t)p · x + tq · x − f (x)


 
x
 
= sup (1 − t)(p · x − f (x)) + t(q · x − f (x))
x

≤ (1 − t) sup[p · x − f (x)] + t sup[q · x − f (x)]


x x

= (1 − t)f ∗ (p) + tf ∗ (q)


2
Rn can be replace by a subset of Rn but here we usually consider just Rn .
2.1. MULTIVARIATE CALCULUS 47

Now f ∗ ((1−t)p+tq) is bounded by the sum of two finite terms, which is finite. So
(1 − t)p + tq is also in the domain of f ∗ (domain is convex). So f ∗ is convex.

E. 2-11
When f is once-differentiable, the maximum (a supremum that is attained) of
p · x − f (x), if it exist, is found by solving the equation p = ∇f (x) for x in terms
of p (note that ∇(p · x) = p). The Legendre transform of f is then

f ∗ (p) = p · x(p) − f (x(p)).

In this case if f is a strictly convex function, then p · x − f (x) is strictly concave,


and so there is a unique solution of p = ∇f (x),[P.2-7] and which is the maximum
of p · x − f (x). In the 1D case, this correspond to f ∗ (p) = px − f (x), where x
satisfies f 0 (x) = p. Since f 0 is a monotonically increasing function of x, there can
be only one value of x for a given value of p. If f is convex but not strictly convex
there will be a solution to p = ∇f (x) but it will not be unique for all p ∈ D(f ∗ ).

E. 2-12
y
slope = p

If the convex function f is defined on the


whole line and is everywhere differentiable,
then f ∗ (p) can be interpreted as the nega- f (x)
tive of the y-intercept of the tangent line to px
the graph of f at x that has slope p. x
f ∗ (p) = px − f (x)

−f ∗ (p)

E. 2-13
1. Let f (x) = 12 ax2 for a > 0. Then p = ax at the maximum of px − f (x). So

p 1  p 2 1 2
f ∗ (p) = px − f (x) = p · − a = p , p ∈ R.
a 2 a 2a

So the Legendre transform maps a parabola to a parabola.



2. f (v) = − 1 − v 2 for |v| < 1 is a lower semi-circle. We have

v p
p = f 0 (v) = √ =⇒ v= p
1 − v2 1 + p2

and exists for all p ∈ R. So

p2 1
f ∗ (p) = pv − f (v) = p
p
+p = 1 + p2 .
1+p 2 1+p 2

A circle gets mapped to a hyperbola.

3. Let f = cx for c > 0. This is convex but not strictly convex. Then px − f (x) =
(p − c)x. This has no maximum unless p = c. So the domain of f ∗ is the one
point set {c}. So f ∗ (p) = 0. So a line goes to a point.
48 CHAPTER 2. VARIATIONAL PRINCIPLE

T. 2-14
If f is a convex differentiable function with f ∗ , then f ∗∗ = f .

(Proof 1, for nice functions) Suppose we have f ∗ (p) = (p · x(p) − f (x(p)) where
x(p) satisfies p = ∇f (x(p)) and with x(p) differentiable. Differentiating with
respect to p, we have

∇i f ∗ (p) = xi + pj ∇i xj (p) − ∇i xj (p)∇j f (x)


= xi + pj ∇i xj (p) − ∇i xj (p)pj = xi .

So ∇f ∗ (p) = x. This means that the conjugate variable of p is our original x. So

f ∗∗ (x) = (x · p − f ∗ (p))|p=p(x) = x · p − (p · x − f (x)) = f (x).

(Proof 2) Note that we have f (x) + f ∗ (y) ≥ x · y for all x, y ∈ Rn . So we have


f (x) ≥ supy∈D(f ∗ ) {x · y − f ∗ (y)} = f ∗∗ (x). Also,
 
∗∗
f (x) = sup x · y − sup {y · z − f (z)} = sup infn {y · (x − z) + f (z)}
y∈D(f ∗ ) z∈Rn y∈D(f ∗ ) z∈R

Since f is convex, for fixed x ∈ Rn , there exist m ∈ Rn such that f (z) ≥ f (x) +
m · (z − x) for all z ∈ Rn . Note that m ∈ D(f ∗ ) since m · x − f (x) ≥ supz∈Rn {z ·
m − f (z)}. Combining these inequalities gives

f ∗∗ (x) ≥ f (x) + sup infn (z − x) · (m − y) = f (x)


y∈D(f ∗ ) z∈R

f doesn’t need to be Strictly convex. For example, in our last example above with
the straight line, f ∗ (p) = 0 for p = c. So f ∗∗ (x) = (xp − f ∗ (p))|p=c = cx = f (x).
However, convexity is required. If f ∗∗ = f is true, then f must be convex, since it
is a Legendre transform. Hence f ∗∗ = f cannot be true for non-convex functions.
E. 2-15
<Application to thermodynamics> The first law of thermodynamics is

dE = T dS − P dV.

which states that a small change in the energy of a system in thermal equilibrium
at temperature T and pressure P is the sum of a heat energy term (T dS), due to
a change in the entropy S, and a mechanical work energy term (−P dV ), due to
a change in the volume V . The formula also shows that the total energy E is a
function of the two “extensive” variables (S, V ), so called because these variables
scale with the size of the system. This is in contrast to the “intensive” variables
(T, P ), which can be defined as (comparing with the chain rule)

∂E ∂E
T = , P =− (∗)
∂S ∂V
For a process occurring at fixed entropy the first law becomes dE + P dV = 0,
which tells us that work done by the system will lead to a corresponding reduction
of its energy E. However, many processes of interest occur at fixed temperature,
not at fixed entropy, and in such cases it is more useful to consider (T, V ) as the
independent variables. We can arrange for this by taking the Legendre transform
2.1. MULTIVARIATE CALCULUS 49

of E(S, V ) with respect to S (the volume variable V just goes along for the ride
here). We will call this new function −F (T, V ), so

F (T, V ) = − sup[T S − E(S, V )] = inf [E(S, V ) − T S]


S S

Strictly speaking, we do not yet know that the new independent variable T is the
temperature appearing in the first law. However, the maximum of the RHS w.r.t.
variations of S occurs when T = ∂E∂S
, and this is indeed the temperature as in (∗).
Solving T = ∂E∂S
gives S = S(T, V ). So F (T, V ) = E(S(T, V ), V ) − T S(T, V ). It
then follows that
∂F ∂F
dF = dT + dV and dF = −SdT − T dS + dE = −SdT − P dV
∂T ∂V
∂F ∂F
=⇒ S=− , P =−
∂T ∂V
We now have an alternative version of the first law: dF = −SdT − P dV . For a
process at fixed T , this reduces to dF + P dV = 0, which tells us that work done by
the system at fixed T implies a corresponding reduction in F , which is therefore
the energy that is available to do work at fixed temperature; this is less than the
total energy E because F = E −T S and both T and S are positive. This “available
energy”, as it is sometimes called, is more usually called the Helmholtz free energy,
or just “free energy”. It is also possible to take the Legendre transform of E(S, V )
with respect to the volume V . This gives another “thermodynamic potential”
H(S, P ), known as enthalpy:

H(S, P ) = − sup[(−P )V − E] =⇒ H(S, P ) = E(S, V (S, P )) + P V (S, P )


V

∂E
where V (S, P ) is the solution of P = − ∂V . We have
∂H ∂H
dH = dS + dV and dH = V dP − P dV + dE = V dP + T dS
∂S ∂P
∂H ∂H
=⇒ T = , V =−
∂S ∂P
and it gives us another alternative version of the first law: dH = T dS + V dP .
Enthalpy is useful for chemistry because chemical reactions often take place at
fixed (atmospheric) pressure P . At fixed P we have dH = T dS, which tells that
a transfer of heat to a substance raises its enthalpy by a corresponding amount.
Finally, if we take the Legendre transform with respect to both S and V , we get
the Gibbs free energy.

2.1.3 Constrained variation and Lagrange multipliers


At the beginning, we considered the problem of unconstrained maximization . For
example, we take a simple example of a hill, where the height at (x, y) is given by
f (x, y). The hilltop would be given by the maximum of f , which satisfies

0 = df = ∇f · dx

for any (infinitesimal) displacement dx. So we need ∇f = 0. This would be a case of


unconstrained maximization, since we are considering all possible values of x and y. A
50 CHAPTER 2. VARIATIONAL PRINCIPLE

problem of constrained maximization would be if we also have in addition a path p


defined by p(x, y) = 0, and we wish to find the highest point along the path p.

We still need ∇f · dx = 0, but now dx is not arbitrary, we only consider the dx parallel
to the path. That is to say ∇f has to be entirely perpendicular to the path. Since we
know that the normal to the path is ∇p, our condition becomes ∇f = λ∇p for some
lambda λ. Of course, we still have the constraint p(x, y) = 0. So what we have to
solve is
∇f = λ∇p, p=0

for the three variables x, y, λ. We can change this into a single problem of unconstrained
extremization. We ask for the stationary points of the function φ(x, y, λ) (called the
Lagrangian ) given by
φ(x, y, λ) = f (x, y) − λp(x, y)

since [[ ∇φ = 0 ]] ⇔ [[ ∇f = λ∇p ]] ∧ [[ p = 0 ]]. Variations of x and y, we obtain the


∇f = λ∇p condition, and variation against λ gives the condition p = 0. λ in this
context is called the Lagrange multiplier . When we solve for ∇f = λ∇p, we find all
the points such that ∇f is parallel to ∇p (ie. ∃λ s.t. ∇f = λ∇p), and then we solve
for p = 0 which further restrict those point to lie on our desire path.

Just like the ordinary unconstrained maximization, we still have to determine by other
means which stationary point, if any, is the one we need. However, this is usually easy
to sort out, and a bonus is that the value of the Lagrange multiplier often has some
significance that aids understanding of the problem.

The method of Lagrange multipliers is easily extended to find the stationary points
of f : Rn → R subject to m < n constraints pk (x) = 0 (k = 1, · · · , m). In this
case we need m Lagrange multipliers, one for each constraint, and we extremise the
function
m
X
φ(x; λ1 , · · · , λm ) = f (x) − λk pk (x)
k=1

with respect to the n + m variables on which it depends. Similar as above the interpre-
tation is that each of pk (x) = 0 define a plane in Rn and the normal of the plane at x is
given by ∇pk (x). For x to be a solution of this constrained maximization,
P ∇f (x) must
be in the span of ∇pk (x) = 0 (k = 1, · · · , m), ie. ∇f (x) = k λk ∇pk (x).

E. 2-16
Find the radius of the smallest circle centered on origin that intersects y = x2 − 1.

1. First do it the easy way: for a circle of radius R to work, x2 + y 2 = R2


and y = x2 − 1 must have a solution.
√ So (x2 )2 − x2 + 1 − R2 = 0 and
2 1 2 3 1/2
x = 2 ± (R − 4 ) . So Rmin = 3/2.

2. We can also view this as a variational problem. We want to minimize f (x, y) =


x2 + y 2 subject to the constraint p(x, y) = y − x2 + 1 = 0. We can solve this
directly. The constraint is y = x2 −1. Then R2 (x) = f (x, y(x)) = (x2 )2 −x2 +1.
We look for stationary points of R2 : (R2 (x))0 = 0 ⇒ x x2 − 21 = 0. So x = 0
√ √
and R = 1; or x = ± √12 and R = 23 . Since 23 is smaller, this is our minimum.

3. Finally, we can use Lagrange multipliers. We find stationary points of the


2.2. FUNCTIONALS AND THE EULER-LAGRANGE EQUATION 51

function φ(x, y, λ) = f (x, y) − λp(x, y) = x2 + y 2 − λ(y − x2 + 1).


 
2x(1 + λ)
0 = ∇φ =  2y − λ 
y − x2 + 1

The first equation gives us two choices


p
• x = 0. Then the third equation gives y = −1. So R = x2 + y 2 = 1.

• λ = −1. So the second equation gives y = − 21 and the third gives x = ± √12 .

Hence R = 3/2 is the minimum.

E. 2-17
For x ∈ Rn , find the minimum of the quadratic form f (x) = xi Aij xj on the surface
|x|2 = 1.

1. The constraint imposes a normalization condition x. But if we scale up x,


f (x) scales accordingly. So if we define Λ(x) = f (x)/g(x) with g(x) = |x|2 the
problem is equivalent to minimization of Λ(x) without constraint. Then
 
2 f
∇i Λ(x) = Aij xj − xi
g g

So we need Ax = Λx. So the extremal values of Λ(x) are the eigenvalues of A.


So Λmin is the lowest eigenvalue.

2. We can also do it with Lagrange multipliers. We want to find stationary values


of φ(x, λ) = f (x) − λ(|x|2 − 1). Now 0 = ∇φ gives Aij xj = λxi and |x|2 = 1.
So we get the same set of equations.

E. 2-18
P
Find the probability distributionP{p1 , · · · , pn } satisfying i pi = 1 that maximizes
n
the information entropy S = − i=1 pi log pi .
Pn
pi ln pi − λ( n
P
We look for stationary points of φ(p, λ) = − i=1 i=1 pi − 1).

∂φ
= − ln pi − (1 + λ) = 0.
∂pi

So pi = e−(1+λ) It is the same for all i. So we must have pi = 1


n
.

2.2 Functionals and the Euler-Lagrange equation


D. 2-19
A functional is a real-valued function on a vector space V , usually of functions
(i.e. it takes in a function and give out a number). We write them as F [x], where
x is a function. We say that F [x] is a functional of the function x(t).
52 CHAPTER 2. VARIATIONAL PRINCIPLE

E. 2-20
• We can have functional of functions: F [x] ∈ R where x : R → R. We can also have
functional of many functions of many variables: F [x] ∈ R where x : Rn → Rk .
• Given a medium with refractive index n(x), Rthe time taken by a path x(t) from
x
x0 to x1 is given by the functional T [x] = x01 n(x) dt. We might want to ask
questions like what path minimises the time taken? In which case we need to find
the “stationary point” of the functional.

2.2.1 Derivative of functional


Given a function x(t) defined for α ≤ t ≤ β, we study functional of the form
Z β
F [x] = f (x, ẋ, t) dt where f is some function.
α

Our objective is to find a stationary point of the functional F [x]. To do so, suppose we
vary x(t) by a small amount δx(t). The corresponding change δF [x] of F [x] is
Z β 
δF [x] = F [x + δx] − F [x] = f (x + δx, ẋ + δ ẋ, t) − f (x, ẋ, t) dt
α
Z β  
∂f ∂f
= + δ ẋ
δx dt + o(δ 2 ) (Taylor expand)
α ∂x ∂ ẋ
Z β     β
∂f d ∂f ∂f
= δx − dt + δx (Integration by parts)
α ∂x dt ∂ ẋ ∂ ẋ α

Usually we have boundary condition so that the boundary term δx ∂f

∂ ẋ α
vanishes.
There are three ways:
1. Fixed end boundary conditions. We specify the values of x(α) and x(β), so δx(α) =
δx(β) = 0.
2. Free end (or “natural”) boundary conditions. These are such that ∂f ∂ ẋ
= 0 at the
integration endpoints. Usually, this can be the case if we set ẋ(α) = ẋ(β) = 0.
3. Mixed boundary conditions. Fixed at one end and free at the other.
Now we write3
Z β    
δF [x] δF [x] ∂f d ∂f
δF [x] = δx dt where = −
α δx(t) δx ∂x dt ∂ ẋ

We call δF [x]/δx the functional derivative of F [x] with respect to x(t). If we want
to find a stationary point of F , then we need δFδx[x] = 0. So
 
∂f d ∂f
<Euler-Lagrange equation> − = 0, for α ≤ t ≤ β.
∂x dt ∂ ẋ

There is an obvious generalization to functionals F [x] for x(t) ∈ Rn :


 
∂f d ∂f
− = 0 for all i.
∂xi dt ∂ ẋi
2.2. FUNCTIONALS AND THE EULER-LAGRANGE EQUATION 53

E. 2-21
<Geodesics of a plane> What is the curve C of minimalR length between two
points A, B in the Euclidean plane? The length is L = C d` where d` =
p
dx2 + dy 2 . There are two ways we can do this:
1. We restrict to curves for which
p x (or y) is a good parameter, ie. y can be made
a function of x. Then d` = 1 + (y 0 )2 dx, so
Z βp
L[y] = 1 + (y 0 )2 dx.
α

Since there is no explicit dependence on y, we know that ∂f∂y


= 0. So the
Euler-Lagrange equation says that
 
d ∂f ∂f
=0 =⇒ = constant
dx ∂y 0 ∂y 0
This is known as a first integral, which will be studied more in detail later.
Plugging in our value of f , we obtain
y0
p = constant
1 + (y 0 )2

This shows that y 0 must be constant. So y must be a straight line.


2. We can get around the restriction to “good” curves by choosing an arbitrary
parameterization
p r = (x(t), y(t)) for t ∈ [0, 1] such that r(0) = A, r(1) = B.
So d` = ẋ2 + ẏ 2 dt, then
Z 1p
L[x, y] = ẋ2 + ẏ 2 dt.
0

∂f ∂f
We have, again ∂x
= ∂y
= 0. So
   
d ∂f d ∂f ẋ ẏ
= =0 =⇒ p = c, p =s
dt ∂ ẋ dt ∂ ẏ ẋ2 + ẏ 2 ẋ2 + ẏ 2
where c and s are constants. While we have two constants, they are not
independent. We must have c2 + s2 = 1. So we let c = cos θ, s = sin θ.
Then the two conditions are both equivalent to (ẋ sin θ)2 = (ẏ sin θ)2 . Hence
ẋ sin θ = ±ẏ cos θ. We can choose a θ such that we have a positive sign. So
y cos θ = x sin θ + A for a constant A. This is a straight line with slope tan θ.

2.2.2 First integrals


In our example above, f did not depend on x, and hence ∂f∂x
= 0. Then the Euler-
Lagrange equations entail
 
d ∂f ∂f
= 0, we integrate to obtain = constant.
dt ∂ ẋ ∂ ẋ

We call this the first integral . First integrals are important in several ways. Firstly
it simplifies the problem a lot, we only have to solve a first-order differential equation,
54 CHAPTER 2. VARIATIONAL PRINCIPLE
∂f
instead of a second-order one. Not needing to differentiate ∂ ẋ
also prevents a lot of
mess arising from the product and quotient rules.
Also in physics, if we have a first integral, then we get ∂f
∂ ẋ
= constant. This corre-
sponds to a conserved quantity of the system. When formulating physics problems as
variational problems, the conservation of energy and momentum will arise as constants
of integration from first integrals.
There is also a more complicated first integral appearing when f does not explicitly
depend on t (ie. ∂f ∂t
= 0). Consider the total derivative df dt
, by the chain rule, we
have
df ∂f dx ∂f dẋ ∂f ∂f ∂f ∂f
= + + = + ẋ + ẍ .
dt ∂t dt ∂x dt ∂ ẋ ∂t ∂x ∂ ẋ
On the other hand, the Euler-Lagrange equation says that ∂f d ∂f

∂x
= dt ∂ ẋ
. Substituting
this into our equation for the total derivative gives
     
df ∂f d ∂f ∂f ∂f d ∂f d ∂f ∂f
= + ẋ + ẍ = + ẋ =⇒ f − ẋ = .
dt ∂t dt ∂ ẋ ∂ ẋ ∂t dt ∂ ẋ dt ∂ ẋ ∂t
∂f
So if ∂t
= 0, then we have the first integral
∂f
f − ẋ = constant.
∂ ẋ
E. 2-22
Find the path of the light
√ray travels in the vertical xz plane inside a medium with
refractive index n(z) = a − bz for positive constants a, b. (The velocity of light
in the medium is v = nc )

Fermat’s principle states that the path theR light takes from A to B is one that min-
B
imizes the time taken (ie. minimise T = A d` v
). This is equivalent to minimizing
RB
the optical path length P = cT = A n d`. We specify our path by the function
√ p
z(x). Then the path element is given by d` = dx2 + dz 2 = 1 + z 0 (x)2 dx, then
Z xB p
P [z] = n(z) 1 + (z 0 )2 dx.
xA

Since this does not depend on x, we have the first integral


∂f n(z)
k = f − z0 = p .
∂z 0 1 + (z 0 )2
for an integration constant k. Squaring and putting in the value of n gives
b
(z 0 )2 = (z0 − z) where z0 = (a − k2 )/b.
k2
This is integrable and we obtain
√ √
dz b √ b
√ =± dx =⇒ z − z0 = ± (x − x0 ),
z0 − z k 2k

where x0 is our second integration constant. So z = z0 − b


4k2
(x − x0 )2 which is a
parabola.
3 P ∂f
Compare this with the ordinary chain rule df = i ∂xi dxi . Now the sum gets replaced by an
integral.
2.2. FUNCTIONALS AND THE EULER-LAGRANGE EQUATION 55

Parabola is also the motion of a projectile subject to the downward acceleration


g due to gravity near the Earth’s surface. Inspired by Fermat’s work in optics,
Maupertuis suggested that mechanics could be similarly based on a “principle of
least action”, where “action” should be the product of mass, velocity and distance.
He was vague about the details, but Euler had already discovered that the motion
of a body of constant total energy
R E = 12 mv 2 + U (x), where v = |ẋ|, would
minimise the integral A = m vd`, where d` is the path length element. This
means that we should minimise
Z Bp
A= 2m(E − U (x)) d`,
A

For a particle near the surface of the Earth, under the influence of gravity, U =
mgz. So we have
Z Bp p
A[z] = 2mE − 2m2 gz 1 + (z 0 )2 dx,
A

which is of exactly the same form as the optics problem we just solved. So the
result is again a parabola.
E. 2-23
<Brachistochrone> A bead slides on a frictionless wire in a vertical plane.
What shape of the wire minimises the time for the bead to fall from rest at point
A to a lower, and horizontally displaced, point B?

The conservation
√ of energy implies that R12 mv 2 = mgy. A
So v = 2gy. We want to minimize T = d` . So x
v
s
1
Z p 2
dx + dy 2 1
Z
1 + (y 0 )2 B
T = √ √ = √ dx
2g y 2g y y
Since there is no explicit dependence on x, we have the first integral
∂f 1
f − y0 = p = constant
∂y 0 y(1 + (y 0 )2 )

So the solution is y(1 + (y 0 )2 ) = c for some positive constant c. The solution of


this ODE is, in parametric form,
x = c(θ − sin θ) y = c(1 − cos θ).
Note that this has x = y = 0 at θ = 0. This de-
scribes a cycloid.

The Brachistochrone problem was one of the earliest problems in the calculus of
variations. The name comes from the Greek words brákhistos (“shortest”) and
khrónos (“time”).

2.2.3 Constrained variation of functionals


The method of Lagrange multipliers can be used to solve variational problems with
constraints when we are faced with finding the stationary values of some functional
subject to some other functional constraint. For example, if we want to find the
56 CHAPTER 2. VARIATIONAL PRINCIPLE

stationary points of F [x] subject to the constraint P [x] = c, for some constant c, we
may extremize, without constraint,

Φλ [x] = F [x] − λ(P [x] − c).

with respect to the function x(t) and the variable λ. Assuming that the boundary
term in the variation is zero, this yields the equations
δF δP
−λ = 0, P [x] = c.
δx(t) δx(t)

Instead of a constraint like P [x] = c, it can happen that we want to minimise a



functional F [x] = α f dt subject to a condition that restricts the functions x(t) for
all t (i.e. g(x(t)) = 0 for all t). In this case we need a Lagrange multiplier function

λ(t). We look for stationary point of Φ[x; λ] = α f − λ(t)g dt. So for L = f − λg, we
solve for  
∂L d ∂L
− = 0 for all i, g(x(t)) = 0.
∂xi dt ∂ ẋi
E. 2-24
<Isoperimetric problem> If we have a string of fixed length L, what is the
maximum area we can enclose with it?

Let x : [0, 1] → R2 with x(0) = x(1) = 0. The component of ẋRperpendicular to x


1
is ẋ · (−x2 , x1 )/kxk. We want to maximise the enclosed area 0 12 (−x2 , x1 ) · ẋ dt
R1
subject to 0 kẋk dt = L. So we find the stationary point of
Z 1
1
Φλ [x] = Φλ [x1 , x2 ] = (x1 ẋ2 − x2 ẋ1 ) − λ(ẋ21 + ẋ22 ) dt + λL.
0 2
 
∂f d ∂f
− =0 =⇒ ẋ2 + 2λẍ1 = 0
∂x1 dt ∂ ẋ1
 
∂f d ∂f
− =0 =⇒ ẋ1 − 2λẍ2 = 0
∂x2 dt ∂ ẋ2
1
Let X = x1 + ix2 , then Ẋ + 2λiẌ = 0. So X = Ae 2λ ti + B for some A, B ∈ C. By
1
initial condition we have 2λ = 2π and 2π|A| = L, so X = ( 2π L
)e−2παi (e2πti − 1)
for some real α. A circle as expected. So the maximum area is L2 /(4π).
E. 2-25
<Sturm-Liouville problem> The Sturm-Liouville problem is a very general
class of problems. We will develop some very general theory about these problems
without going into specific examples. It can be formulated as follows: let ρ(x),
σ(x) and w(x) be real functions of x defined on a ≤ x ≤ b. We will consider the
special case where ρ and w are positive on a < x < b. Given that y(x) is fixed at
x = α and x = β, our objective is to find stationary points of the functional
Z β Z β
F [y] = (ρ(x)(y 0 )2 + σ(x)y 2 ) dx subject to G[y] = w(x)y 2 dx = 1.
α α

Using the Euler-Lagrange equation, the functional derivatives of F and G are


δF [y] δG[y]
= 2 − (ρy 0 )0 + σy

= 2(wy).
δy δy
2.2. FUNCTIONALS AND THE EULER-LAGRANGE EQUATION 57

So the Euler-Lagrange equation of Φλ [y] = F [y] − λ(G[y] − 1) is

−(ρy 0 )0 + σy − λwy = 0.

We can write this as the eigenvalue problem


 
d d
Ly = λwy where L=− ρ +σ
dx dx

is the Sturm-Liouville operator . We call this a Sturm-Liouville eigenvalue prob-


lem. w is called the weight function .
We can view this problem in a different way. Notice that Ly = λwy is linear in y.
Hence if y is a solution, then so is Ay. But if G[y] = 1, then G[Ay] = A2 . Hence
the condition G[y] = 1 is simply a normalization condition. We can get around
this problem by asking for the minimum of the functional

F [y]
Λ[y] =
G[y]

instead. It turns out that this Λ has some significance. To minimize Λ, we cannot
apply the Euler-Lagrange equations, since Λ is not of the form of an integral.
However, we can try to vary it directly:
1 F 1
δΛ = (F + δF )(G + δG)−1 − F/G ≈ δF − 2 δG = (δF − ΛδG).
G G G
When Λ is minimized, we have
δF δG
δΛ = 0 ⇐⇒ =Λ ⇐⇒ Ly = Λwy.
δy δy

So at stationary values of Λ[y], Λ is the associated Sturm-Liouville eigenvalue.


E. 2-26
<Geodesics> Suppose that we have a surface in R3 defined by g(x) = 0, and
we want to find the path of shortest distance between two points on the surface.
These paths are known as geodesics.
One possible approach is to solve g(x) = 0 directly. For example, if we have a unit
sphere, a possible solution is x = cos θ cos φ, y = cos θ sin φ, z = sin θ. Then the
total length of a path, which we want to minimise, would be given by
Z Bq
D[θ, φ] = θ̇2 + sin2 θφ̇2 dt.
A

with respect to the functions θ(t) and φ(t). If θ happened to be a good parameter
for the curve, we can use it instead of the extra variable t.
Alternatively, we can impose the condition g(x(t)) = 0 with a Lagrange multiplier.
Then our problem would be to find stationary values of
Z 1

Φ[x, λ] = |ẋ| − λ(t)g(x(t)) dt
0
58 CHAPTER 2. VARIATIONAL PRINCIPLE

2.3 Hamilton’s principle and Noether’s theorem


2.3.1 Hamilton’s principle
Lagrange and Hamilton reformulated Newtonian dynamics into a much more robust
system based on an action principle.
The first important concept is the idea of a configuration space , which is a vector space
containing generalized coordinates ξ(t) that specify the configuration of the system.
The idea is to capture all information about the system in one single vector.
Lagrange introduced the concept of generalized coordinates ξ(t) in 1788 and showed
that it obeys certain complicated ODEs which are determined by the kinetic energy
and the potential energy. In the 1830s, Hamilton found an improved procedure that
also applies to a larger class of problems:
<Hamilton’s principle> R The actual path ξ(t) taken by a particle is the path that
makes the action S[ξ] = L dt stationary, where L = T − V is the Lagrangian , with
T the kinetic energy and V the potential energy.
E. 2-27
Note that S has dimensions M L2 T −1 , which is the same as the 18th century action
(and Plank’s constant).
In the simplest case of a single free particle, generalized coordinates would simply
be the coordinates of the position of the particle. If we have two particles given
by positions x(t) = (x1 , x2 , x3 ) and y(t) = (y1 , y2 , y3 ), our generalized coordinates
might be ξ(t) = (x1 , x2 , x3 , y1 , y2 , y3 ). In general, if we have N different free
particles, the configuration space has 3N dimensions.
The important thing is that the generalized coordinates need not be just the usual
Cartesian coordinates. If we are describing a pendulum in a plane, we do not
need to specify the x and y coordinates of the mass. Instead, the system can be
described by just the angle θ between the mass and the vertical. So the generalized
coordinates is just ξ(t) = θ(t). This is much more natural to work with and avoids
the hassle of imposing constraints on x and y.
E. 2-28
Suppose we have 1 particle in Euclidean 3-space. The configuration space is simply
the coordinates of the particle in space. We can choose Cartesian coordinates x.
Then T = 12 m|ẋ|2 , V = V (x, t). For a trajectory that starts at point A at time
tA and ends at point B at time tB > tA , the particle’s action is
Z tB Z tB  
1
S[x] = L(x, ẋ, t) dt = m|ẋ|2 − V (x, t) dt.
tA tA 2

We apply the Euler-Lagrange equations to obtain


 
d ∂L ∂L
0= − = mẍ + ∇V.
dt ∂ ẋ ∂x
So mẍ = −∇V . This is Newton’s law F = ma with F = −∇V . This shows
that Lagrangian mechanics is “the same” as Newton’s law. However, Lagrangian
mechanics has the advantage that it does not care what coordinates you use, while
Newton’s law requires an inertial frame of reference.
2.3. HAMILTON’S PRINCIPLE AND NOETHER’S THEOREM 59

E. 2-29
<Time-independent potential> Lagrangian mechanics applies even when V is
time-dependent. However, if V is independent of time, then so is L (i.e. ∂L
∂t
= 0).
Then we can obtain a first integral. The chain rule gives

dL ∂L ∂L ∂
= + ẋ · + ẍ ·
dt ∂t ∂x
 ∂ ẋ
   
∂L ∂L d ∂L d ∂L ∂L
= + ẋ · − +ẋ + ẍ ·
∂t ∂x dt ∂ ẋ dt ∂ ẋ ∂ ẋ
| {z }
=0
 
d ∂L ∂L ∂L
=⇒ L − ẋ · = =0 =⇒ ẋ · − L = E.
dt ∂ ẋ ∂t ∂ ẋ

for some constant E. For example, for one particle, the constant of motion is the
total energy E = T + V :

1
E = m|ẋ|2 − m|ẋ|2 + V = T + V = total energy.
2

E. 2-30
<Central force fields> Consider a central force field F = −∇V , where V =
V (r) is independent of time. We use spherical polar coordinates (r, θ, φ), where

x = r sin θ cos φ, y = r sin θ sin φ, z = r cos θ.

We’ll use the fact that motion is planar (a consequence of angular momentum
conservation). So wlog θ = π2 . In this case z = 0 and (x, y) = r(cos φ, sin φ), so
the Lagrangian is
1 1
L = mṙ2 + mr2 φ̇2 − V (r).
2 2
The Euler Lagrange equations give

d
mr̈ − mrφ̇2 + V 0 (r) = 0 and (mr2 φ̇) = 0.
dt

From the second equation, we see that r2 φ̇ = h is a constant (angular momentum


per unit mass). Then φ̇ = h/r2 . So

mh2 mh2
mr̈ − + V 0 (r) = 0 =⇒ 0
mr̈ = −Veff (r) where Veff = V (r) +
r3 2r2

Veff is the effective potential . For example, in a gravitational field, V (r) = − GM


r
.
Then
h2
 
GM
Veff = m − + 2 .
r 2r
60 CHAPTER 2. VARIATIONAL PRINCIPLE

2.3.2 Hamiltonian and Hamilton’s equations

The Hamiltonian of a system is the Legendre transform of the Lagrangian:

H(x, p, t) = p · ẋ(p) − L(x, ẋ(p), t),

where ẋ(p) is the solution to p = ∂L


∂ ẋ
. We say p is the conjugate momentum of x.
The space containing the variables x, p is known as the phase space .

Since the Legendre transform is its self-inverse and Lagrangian is convex with respect
to ẋ, the Lagrangian is the Legendre transform of the Hamiltonian with respect to
p:
L(x, ẋ, t) = p(ẋ) · ẋ − H(x, p(ẋ), t)

where p(ẋ) is the solution of ẋ = ∂H


∂p
. Hence we can write the action using the
Hamiltonian as Z
S[x, p] = (p · ẋ − H(x, p, t)) dt.

This is the phase-space form of the action. The Euler-Lagrange equations for these
are know as the Hamilton’s equations :

∂H ∂H
ẋ = , ṗ = −
∂p ∂x

This is the same as the original action because variation with respect to p makes the
integrand become L(x, ẋ, t), so then variation of x would gives stationary point of the
original action.

Using the Hamiltonian, the Euler-Lagrange equations put x and p on a much more
equal footing, and the equations are more symmetric. Solving the Hamilton’s equations
yields a trajectory in the phase space parametrized by position and momenta, which
are said to be canonically conjugate to each other.

E. 2-31
So what does the Hamiltonian look like? Consider the case of a single particle.
The Lagrangian is given by

1 ∂L
L= m|ẋ|2 − V (x, t) =⇒ p= = mẋ
2 ∂ ẋ

p 1  p 2 1
=⇒ H(x, p) = p · − m + V (x, t) = |p|2 + V.
m 2 m 2m
So p happens to coincide with the usual definition of the momentum. However, the
conjugate momentum is often something more interesting when we use generalized
coordinates. For example, in polar coordinates, the conjugate momentum of the
angle is the angular momentum. Also we have H as the total energy, but expressed
in terms of x, p, not x, ẋ.

2.3.3 Symmetries and Noether’s theorem


2.3. HAMILTON’S PRINCIPLE AND NOETHER’S THEOREM 61

D. 2-32

Given F [x] = α f (x, ẋ, t) dt, suppose we change variables by the transformation
t 7→ t (t) and x 7→ x∗ (t∗ ). Then we have a new independent variable and a new

function. This gives


Z β∗
F [x] 7→ F ∗ [x∗ ] = f (x∗ , ẋ∗ , t∗ ) dt∗
α∗

with α∗ = t∗ (α) and β ∗ = t∗ (β). If F ∗ [x∗ ] = F [x] for all x, α and β, then we say
the transformation ∗ is a symmetry . Symmetries may be discrete or continuous.
The transformations of a continuous symmetry will depend on a parameter ε ∈ R
such that t∗ (t) = t and x∗ (t∗ ) = x(t) when ε = 0.

E. 2-33
Transformation could be a translation of time, space, or a rotation, or even more
fancy stuff. The exact symmetries of F depends on the form of f . For example, if
f only depends on the magnitudes of x, ẋ and t, then rotation of space will be a
symmetry.

What does “continuous symmetry” mean? Intuitively, it is a symmetry we can


do “a bit of”. For example, rotation is a continuous symmetry, since we can do a
bit of rotation. However, reflection is not, since we cannot reflect by “a bit”. We
either reflect or we do not.

E. 2-34
1. Consider the transformation t 7→ t and x 7→ x + ε for some ε. Then
Z β
F ∗ [x∗ ] = f (x + ε, ẋ, t) dt
α

which equals F [x] for ε 6= 0 iff ∂f ∂x


= 0. So it is a space translation invariant
(symmetry) iff ∂f ∂x
= 0. We already know that if ∂f
∂x
= 0, then we have the first
d ∂f ∂f

integral dt ∂ ẋ
= 0 and so ∂ ẋ
is a conserved quantity.

2. Consider the transformation t 7→ t − ε with x 7→ x∗ such that x∗ (t∗ ) = x(t).


Then
Z β
F ∗ [x∗ ] = f (x, ẋ, t − ε) dt
α

∂f
which equals F [x] for ε 6= 0 iff = 0. Hence this is a symmetry (time
∂t
translation invariant) iff ∂f
∂t
= 0. We already know that if ∂f
∂t
= 0 is true, then
we obtain a first integral and the conserved quantity f − ẋ ∂f
∂ ẋ
.

We see that for each simple continuous symmetry we have above, we can obtain
a first integral, which then gives a constant of motion. Noether’s theorem is a
powerful generalization of this.

T. 2-35
<Noether’s theorem> Every continuous symmetry of an action I[x] is asso-
ciated with a corresponding first integral, and hence a constant of the motion
(conserved quantity).
62 CHAPTER 2. VARIATIONAL PRINCIPLE

To prove this theorem, we start by simply relabelling t∗ as t in the expression for


I[t]; this has no effect on the limits of integration, so we now have
Z β∗ 
dx∗ (t)

I ∗ [x∗ ] = L x∗ (t), ; t dt
α∗ dt
For a continuous symmetry we may suppose that the parameter ε is “infinitesimal”.
In this case
t∗ (t) = t + εξ(t), x∗ (t∗ ) = x(t) + εζ(t),
for some ξ(t) and ζ(t). Observe that x(t)+εζ(t) = x∗ (t∗ ) = x∗ (t+εξ(t)) = x∗ (t)+
εξ(t)ẋ∗ (t) + O(ε2 ), differentiating gives ẋ(t) = ẋ∗ (t) + O(ε). So x∗ (t) = x(t) + εh,
where h ≡ ζ − ξ ẋ. Also α∗ ≡ t∗ (α) = α + εξ(α) and β ∗ ≡ t∗ (β) = β + ξ ẋ. So
Z β+εξ(α)  
∗ ∗ d(εh)
I [x ] = L x + εh, ẋ + ; t dt
α+εξ(α) dt
Z β  
d(εh)
= [εξL]βα + L x + εh, ẋ + ; t dt
α dt
We expand the integrand to first order in h to get
 
d(εh) ∂L d(εh) ∂L
L x + εh, ẋ + ; t = L(x, ẋ; t) + εh +
dt ∂x dt ∂ ẋ
   
d ∂L ∂L d ∂L
= L(x, ẋ; t) + εh + εh −
dt ∂ ẋ ∂x dt ∂ ẋ

  β Z β  
∗ ∗ ∂L ∂L d ∂L
=⇒ I [x ] = I[x] + ε ξL + h + εh − dt
∂ ẋ α α ∂x dt ∂ ẋ
Z β  
∂L d ∂L
=⇒ I ∗ [x∗ ] − I[x] = [εQ]βα + εh − dt
α ∂x dt ∂ ẋ
 
∂L ∂L
where Q = ξ L − ẋ +ζ (∗)
∂ ẋ ∂ ẋ
Z β  
∂L d ∂L
=⇒ I ∗ [x∗ ] − I[x] = ε Q̇ + h − dt
α ∂x dt ∂ ẋ
Note that ε is a constant. The LHS of the last line is 0 for a symmetry transfor-
mation for all α and β, so
 
∂L d ∂L
Q̇ = −h − .
∂x dt ∂ ẋ
The RHS of this expression is 0 when the Euler-Lagrange equation is satisfied, in
which case Q̇ = 0. This is the first integral promised by the theorem, and Q is the
constant of motion associated to the symmetry of the action. (For a symmetry, Q
is constant as a consequence of the Euler-Lagrange equations)
Note that continuity is essential. For example, if f is quadratic in x and ẋ, then
x 7→ −x will be a symmetry. But since it is not continuous, there won’t be a
conserved quantity.
The proof given generalises to functions of many functions, and also (in modified
form) to functions of many independent variables (which was the original context).
2.3. HAMILTON’S PRINCIPLE AND NOETHER’S THEOREM 63

L. 2-36

Given I[x] = α L(x, ẋ, t)dt, the transformation t∗ (t) = t + εξ(t), x∗ (t∗ ) = x(t) +
εζ(t) is a symmetry if
 
∂L ∂L ∂L ∂L
ξ + ξ˙ L − ẋ +ζ + ζ̇ =0
∂t ∂x ∂x ∂ ẋ

In the proof of Noether’s theorem, we see that it’s a symmetry iff


 
∂L d ∂L
0 = Q̇ + h −
∂x dt ∂ ẋ
and this is equivalent to the expression in the heading. To show that note that Q̇
equals
     
∂L dL ∂L d ∂L ∂L d ∂L ∂L d ∂L
ξ˙ L − ẋ + − ẍ − ẋ +ζ̇ +ζ +(ζ−ξ ẋ) −
∂ ẋ dt ∂ ẋ dt ∂ ẋ ∂ ẋ dt ∂ ẋ ∂x dt ∂ ẋ
and then simplify using
dL ∂L ∂L ∂L
− ẋ − ẍ ≡
dt ∂x ∂ ẋ ∂t
E. 2-37
Consider the two example in [E.2-34] for the first case of space translation t∗ = t
and x∗ = x + ε, we have (ζ, ξ) = (1, 0), so [L.2-36] is satisfied when ∂L/∂x = 0,
and in which case Q = ∂L
∂ ẋ
is indeed a constant of motion.
For the second case of time translation t∗ = t + ε and x∗ (t∗ ) = x(t), we have
(ζ, ξ) = (0, 1), so [L.2-36] is satisfied when ∂L/∂t = 0, and in which case Q =
L − ẋ ∂L
∂ ẋ
is indeed a constant of motion.
L. 2-38

Given the action I[x] = α L(x, ẋ, t)dt and the transformation t∗ (t) = t + εξ(t),
x (t ) = x(t) + εζ(t). If assuming ε can vary with time we find that δI = I ∗ [x∗ ] −
∗ ∗

I[x] = α ε̇A dt, then the transformation is a symmetry of the action and A is its
constant of motion.
In the proof of [T.2-35] up to the point of (∗) we have not use the fact that ε is a
constant. If we now suppose ε depends on t then the next line of (∗) becomes
Z β Z β   
∂L d ∂L
δI = I ∗ [x∗ ] − I[x] = ε̇Q dt + ε Q̇ + h − dt
α α ∂x dt ∂ ẋ

If the transformation is a symmetry, then δI = α ε̇Q dt.

Conversely if we have δI = I ∗ [x∗ ] − I[x] = α ε̇A dt for all α, β and small ε, then
if we set ε as a constant (i.e. ε̇ = 0), we find that δI = I ∗ [x∗ ] − I[x] = 0 for all

α, β, hence the transformation is a symmetry of the action. But δI = α ε̇Q dt
too, so Q = A since α, β and ε is arbitrary.

Alternatively since δI = I ∗ [x∗ ] − I[x] = α ε̇A dt for all small ε(t), define ε(t) so
Rβ Rβ
that it vanish at the boundary, then we have δI = [εQ]βα − α εȦ dt = − α εȦ dt.
Rβ d
But δI = α L(x + εh, ẋ + dt (εh); t) − L (x, ẋ; t) dt. At the stationary point of the

action we have δI = 0, so 0 = α εȦ dt. Since α, β and ε is arbitrary, we have
Ȧ = 0, so A is a constant of motion.
64 CHAPTER 2. VARIATIONAL PRINCIPLE

E. 2-39
<Application to Hamiltonian mechanics> We can apply this to Hamiltonian
mechanics. The motion of a particle is the stationary point of
Z
1
S[x, p] = (p · ẋ − H(x, p)) dt, where H= |p|2 + V (x, t).
2m
1. Space translation invariance Suppose the potential is position indepen-
dent. Since the action depends only on ẋ (or p) and not x itself, it is invariant
under the translation x 7→ x + ε, p 7→ p. For general ε that can vary with
time, we have
Z Z
δS = S ∗ [x∗ , p∗ ]−S[x, p] =
 
p·(ẋ+ ε̇)−H(p) − p· ẋ−H(p) dt = p· ε̇ dt.

Hence p (the momentum) is a constant of motion. So translational invariance


implies conservation of momentum.
2. Times translational invariance If the potential has no time-dependence,
then the system is invariant under time translation. The transformation is
t 7→ t + ε and
x(t) 7→ x∗ (t + ε) = x(t) =⇒ x∗ (t) = x(t) − εẋ(t),

p(t) 7→ p (t + ε) = p(t) =⇒ p∗ (t) = p(t) − εṗ(t),
Write L(x, ẋ, p; t) = p · ẋ − H(x, p), we have
Z β+ε Z β
δS = L(x∗ (t), ẋ∗ (t), p∗ (t); t) dt − L(x, ẋ, p; t) dt
α+ε α
Z β
∂L d(εẋ) ∂L ∂L
= [εL]βα + −εẋ · − · − εṗ · dt
α ∂x dt ∂ ẋ ∂p
Z β
d(εẋ)
= [εL]βα + − · p − εṗ · ẋ − εḢ dt
α dt
Z β Z β
d
= [εL]βα + − (ε(p · ẋ + H)) − ε̇H dt = −ε̇H dt
α dt α

So the Hamiltonian H is the constant of motion, which is the total energy.


Hence time translational invariance implies conservation of energy.
3. Rotational invariance Suppose that we have a potential V (|x|) that only
depends on radius, then this has a rotational symmetry. Choose any axis of
rotational symmetry ω, and make the rotation
x 7→ x + εω × x p 7→ p + εω × p,
Then our rotation does not affect the radius |x| and |p|. So the Hamiltonian
H(x, p) is unaffected. Then (ignoring O(ε2 ) terms like εε̇, also note that p is
parallel to ẋ hence some of the triple products are 0) we have
Z Z
d d
δS = (p + εω × p) · (x + εω × x) − p · ẋ dt = p · (εω × x) dt
dt dt
Z   Z
d
= p · ω × (εx) dt = p · (ω × (ε̇x + εẋ)) dt
dt
Z Z
= (ε̇p · (ω × x) + εp · (ω × ẋ)) dt = ε̇ω · (x × p) dx.
2.4. MULTIVARIATE CALCULUS OF VARIATIONS 65

So ω · (x × p) is a constant of motion. Since this is true for all ω, the an-


gular momentum L = x × p must be a constant of motion. Hence rotational
invariance implies conservation of angular momentum.

2.4 Multivariate calculus of variations


Previously, the function x(t) we are varying is just a function of a single variable t.
We now consider the most general case y(x1 , · · · , xm ) ∈ Rn that maps Rm → Rn (we
can also view this as n different functions that map Rm → R). The functional will be
a multiple integral of the form
Z Z  
∂y ∂y
F [y] = · · · f (y, ∇y, x1 , · · · , xn ) dx1 · · · dxm , where ∇f = ,··· , .
∂x1 ∂xm
In principle, it is possible to derive a generalisation of the Euler-Lagrange equation
for such functionals, but it is as easy to consider the variation of F on a case by case
basis.
E. 2-40
<Minimal surfaces in E3 > This is a natural generalization of geodesics. A
minimal surface is a surface of least area subject to some boundary conditions.
Suppose that (x, y) are good parameters for a surface S, where (x, y) takes values
in the domain D ⊆ R2 . Then the surface is defined by z = h(x, y), where h is the
height function.
∂h
When possible, we will denote partial differentiation by suffices, ie. hx = ∂x
.
Then the area is given by
Z q
A[h] = 1 + h2x + h2y dA.
D

because small change of δx, δy produce a surface of area


q
k(x̂ + hx ẑ)δx × (ŷ + hy ẑ)δyk = kẑ − hy ŷ − hx x̂kδxδy = 1 + h2x + h2y δxδy.

Consider a variation of h(x, y): h 7→ h + δh(x, y). Then


Z p
A[h + δh] = 1 + (hx + (δh)x )2 + (hy + (δh)y )2 dA
D
Z !
hx (δh)x + hy (δh)y 2
= A[h] + p + O(δh ) dA
D 1 + h2x + h2y

We integrate by parts to obtain


Z ! !!
∂ hx ∂ hy
δA = − δh p + p dA + O(δh2 )
D ∂x 1 + h2x + h2y ∂y 1 + h2x + h2y

plus some boundary terms. So our minimal surface will satisfy


! !
∂ hx ∂ hy
p + p =0
∂x 1 + h2x + h2y ∂y 1 + h2x + h2y

=⇒ (1 + h2y )hxx + (1 + h2x )hyy − 2hx hy hxy = 0.


66 CHAPTER 2. VARIATIONAL PRINCIPLE

This is a non-linear 2nd-order PDE, the minimal-surface equation . While it is


difficult to come up with a fully general solution to this PDE, we can consider
some special cases.
• There is an obvious solution h(x, y) = Ax + By + C, since the equation involves
second-derivatives and this function is linear. This represents a plane.
• If |∇h|2  1, then h2x and h2y are small. So we have hyy + hyy = 0, or ∇2 h = 0.
So we end up with the Laplace equation. Hence harmonic functions are (approxi-
mately) minimal-area.
• We
p might want a cylindrically-symmetric solution, ie. h(x, y) = z(r), where r =
x2 + y 2 . Then we are left with an ordinary differential equation

rz 00 + z 0 + z 03 = 0.

The general solution is z = A−1 cosh(Ar) + B, a catenoid. Alternatively, to obtain


this, we can substitute h(x, y) = z(r) into A[h] to get
Z p
A[z] = 2π r 1 + (h0 (r))2 dr,

and we can apply the Euler-Lagrange equation.


E. 2-41
<Small amplitude oscillations of uniform string> Suppose we have a string
with uniform constant mass density ρ with uniform tension T .

Suppose we pull the line between x = 0 and x = a with some tension T . Then
we set it into motion such that the amplitude is given by y(x; t). Then the kinetic
energy is
1 a 2 ρ a 2
Z Z
T = ρv dx = ẏ dx.
2 0 2 0
The potential energy is the tension times the length. So
Z Z ap Z a
1
V =T d` = T 1 + (y 0 )2 dx = (T a) + T (y 02 ) dx.
0 0 2

Note that y 0 is the derivative wrt x while ẏ is the derivative wrt time. The T a
term can be seen as the ground-state energy. It is the energy initially stored if
there is no oscillation. Since this constant term doesn’t affect where the stationary
points lie, we will ignore it. Then the action is given by
ZZ a  
1 2 1
S[y] = ρẏ − T (y 0 )2 dx dt
0 2 2

We apply Hamilton’s principle which says that we need δS[y] = 0. We have


ZZ a  
∂ ∂
δS[y] = ρẏ δy − T y 0 δy dx dt.
0 ∂t ∂x
2.4. MULTIVARIATE CALCULUS OF VARIATIONS 67

Integrate by parts to obtain


ZZ a
δS[y] = δy(ρÿ − T y 00 ) dx dt + boundary term.
0

Assuming that the boundary term vanishes, we will need

ÿ − v 2 y 00 = 0, where v 2 = T /ρ.

This is the wave equation in two dimensions. Note that this is a linear PDE, which
is a simplification resulting from our assuming the oscillation is small. The general
solution to the wave equation is

y(x, t) = f+ (x − vt) + f− (x + vt),

which is a superposition of a wave travelling rightwards and a wave travelling


leftwards.

E. 2-42
<Maxwell’s equations> It is possible to obtain Maxwell’s equations from an
action principle, where we define a Lagrangian for the electromagnetic field. Note
that this is the Lagrangian for the field itself, and there is a separate Lagrangian
for particles moving in a field.

Let ρ represents the electric charge density and J represents the electric current
density. We have the potentials: φ is the electric scalar potential and A is the
magnetic vector potential. And we have the fields: E = −∇φ − Ȧ is the electric
field, and B = ∇ × A is the magnetic field.

We pick convenient units where c = ε0 = µ0 = 1. With these concepts in mind,


the Lagrangian is given by
Z  
1
S[A, φ] = (|E|2 − |B|2 ) + A · J − φρ dV dt
2

Varying A and φ by δA and δφ respectively, we have


Z    

δS = −E · ∇δφ + δA − B · ∇ × δA + δA · J − ρδφ dV dt.
∂t

Integrate by parts to obtain


Z  
δS = δA · (Ė − ∇ × B + J) + δφ(∇ · E − ρ) dV dt.

Since the coefficients have to be 0, we must have ∇ × B = J + Ė and ∇ · E = ρ.


Also, the definitions of E and B immediately give ∇ · B = 0 and ∇ × E = −Ḃ.
These four equations are Maxwell’s equations.
68 CHAPTER 2. VARIATIONAL PRINCIPLE

2.5 The second variation


2.5.1 Second variation
Previously we have only looked at stationary points of functionals. To distinguish
a maximum, minimum or a saddle, we can look at the second variation (“second
derivatives”).
Suppose x(t) = x0 (t) is a solution of δF [x]
δx(t)
= 0, ie. F [x] is stationary at x = x0 . For
convenience, let δx(t) = εξ(t) with constant ε  1. We consider functionals of the

form F [x] = α f (x, ẋ, t) dt with fixed-end boundary conditions, so ξ(α) = ξ(β) = 0.
Below, we will use both dots (ẋ) and dashes (x0 ) to denote the same derivatives with
respect to t.
To determine what type of stationary point it is, we expand F [x + δx] to second order
in δx. We consider a variation x 7→ x + δx and expand the integrand to obtain
f (t + εξ, ẋ + εξ,˙ t) − f (x, ẋ, t)
ε2 ∂2f ∂2f ∂2f
   
∂f ∂f
=ε ξ + ξ˙ + ξ 2 2 + 2ξ ξ˙ + ξ˙2 2 + O(ε3 )
∂x ∂ ẋ 2 ∂x ∂x∂ ẋ ∂ ẋ
Noting that 2ξ ξ˙ = (ξ 2 )0 and integrating by parts, we obtain
ε2
     2  2  
∂f d ∂f 2 ∂ f d ∂ f ˙ 2 ∂f
= εξ − + ξ − +ξ .
∂x dt ∂ ẋ 2 ∂x2 dt ∂x∂ ẋ ∂ ẋ2
plus some boundary terms which vanish. So
Z β
δF [x] ε2
F [x + εξ] − F [x] = εξ dt + δ 2 F [x, ξ] + O(ε3 ),
α δx(t) 2
Z β  2  2  2

∂ f d ∂ f ˙2 ∂ f dt
where δ 2 F [x, ξ] = ξ2 − + ξ
α ∂x2 dt ∂x∂ ẋ ∂ ẋ2
is a functional of both x(t) and ξ(t). For given x(t) it is a quadratic functional for
ξ(t), this is analogous to the term δxT H(x)δx appearing in the expansion of a regular
function f (x).

2.5.2 Conditions for minimum


In the case of normal functions, if H(x) is positive for all x, then f (x) is convex, and
the stationary point is hence a global minimum. A similar result holds for functionals.
In this case, if δ 2 F [x, ξ] > 0 for all non-zero and allowed ξ and all allowed x (where “al-
lowed” means satisfying boundary conditions and appropriate smoothness conditions),
then a solution x0 (t) of δF δx
= 0 is an absolute minimum. However, not all functions
are convex. We can still determine whether a solution x0 (t) of the Euler-Lagrange
equation is a local minimum. Write
Z β
δ 2 F [x0 , ξ] = (ρ(t)ξ˙2 + σ(t)ξ 2 ) dt,
α
2
∂2f ∂2f
  
∂ f d
where ρ(t) = and σ(t) = − .
∂ ẋ2 x=x0 ∂x2 dt ∂x∂ ẋ x=x0
This is of the same form as the Sturm-Liouville problem. For x0 to minimize F [x]
locally, we need δ 2 F [x0 , ξ] > 0.
2.5. THE SECOND VARIATION 69

1. A necessary condition is ρ(t) ≥ 0 which is called the Legendre condition . However,


this is not a sufficient condition. Even if we had a strict inequality ρ(t) > 0 for all
α < t < β, it is still not sufficient.

2. A sufficient (but not necessary) condition is ρ(t) > 0 and σ(t) ≥ 0, because in this
case δ 2 F [x0 , ξ] > 0 for all allowed ξ (same as before ξ˙ cannot be 0 everywhere).

The intuition behind the Legendre condition is as follows: suppose that ρ(t) is negative
in some interval I ⊆ [α, β]. Then we can find a ξ(t) that makes δ 2 F [x0 , ξ] negative.
We simply have to make ξ zero outside I, and small but crazily oscillating inside I.
Then inside I, ξ˙2 will be very large while ξ 2 is kept tiny. So we can make δ 2 F [y, ξ]
arbitrarily negative.

E. 2-43
• In Geodesics of a plane[E.2-21] shown that a straight line is a stationary point for
the curve-lengthpfunctional, but we didn’t show it is in fact the shortest distance.
Recall that f = 1 + (y 0 )2 . Then

∂f ∂f y0 ∂2f 1
= 0, = , = p 3,
∂y 0 ∂y 02
p
∂y 1 + (y 0 )2 1 + (y 0 )2

with the other second derivatives zero. So we have


β
ξ˙2
Z
δ 2 F [y, ξ] = dx
α (1 + (y 0 )2 )3/2

This is zero for constant ξ but the only constant permitted by the boundary
conditions is zero, so δ 2 F [y, ξ] is positive for non-zero allowed, and this is true for
any allowed y, so a straight line really does minimise the distance between two
points.

• In the Brachistochrone problem,[E.2-23] we have


r
Z β
1 + ẋ2
T [x] ∝ dt.
α x

where x is always positive. The cycloid (at least locally) minimize the time T
because
1 1
ρ(t) = p >0 and σ(t) = p > 0.
x(1 + ẋ2 )3 2x2 x(1 + ẋ2 )

R1 p
• Consider f [y] = −1
x 1 + y 02 dx. In this case

∂2f 3

02
= x(1 + y 02 )− 2
∂y

which change sign at x = 0. So we can say in advance of solving for stationary


points that any solution we find will not be a minimum of the functional, although
it could be a minimum of the same integral with different integration limits.
70 CHAPTER 2. VARIATIONAL PRINCIPLE

Jacobi condition

Legendre tried to prove that ρ > 0 is a sufficient condition for δ 2 F > 0. This is
known as the strong Legendre condition . However, he failed, since it is not a sufficient
condition. Yet, it turns out that he was close.
Thinking ρ > 0 is sufficient isn’t as crazy as it first sounds. If ρ > 0 and σ < 0, we
would want to create a negative δ 2 F [x0 , ξ] by choosing ξ to be large but slowly varying.
Then we will have a very negative σ(t)ξ 2 while a small positive ρ(t)ξ˙2 . But ξ has to
be 0 at the end points α and β, so if β − α is small, then ξ˙ cannot be small.
Now we show that if we have an extra condition on top of this, then we have an
sufficient condition. Assume ρ(t) > 0 for α < t < β (the strong Legendre condition)
and assume boundary conditions ξ(α) = ξ(β) = 0. First of all, notice that for any
smooth function w(t), we have
Z β Z β
0 = wξ 2 (α) − wξ 2 (β) = (wξ 2 )0 dt = (2wξ ξ˙ + ẇξ 2 ) dt.
α α

This allows us to rewrite δ 2 F as


β β 2  !
w2
Z Z  
w
2
δ F = ρξ + 2wξ ξ˙ + (σ + ẇ)ξ
˙2 2
dt = ρ ξ˙ + ξ + σ + ẇ − ξ 2 dt
α α ρ ρ

This δ 2 F is positive if w2 = ρ(σ+ẇ). It cannot equals 0 since that would require


 Z x 
w w(s)
ξ˙ + ξ = 0 =⇒ ξ(x) = C exp − ds
ρ α ρ(s)

But 0 = ξ(α) = Ce0 , so C = 0. Hence equality holds only for ξ = 0, which is not an
allowed ξ. So if we can find a solution to w2 = ρ(σ + ẇ), we know that δ 2 F > 0.
w2 = ρ(σ + ẇ) is non-linear in w. We can convert this into a linear equation by defining
w in terms of a new function u by w = −ρu̇/u. Then it becomes
 2  0  2
u̇ ρu̇ (ρu̇)0 u̇
ρ =σ− =σ− +ρ =⇒ −(ρu̇)0 + σu = 0
u u u u

This is called the Jacobi accessory equation . If we can find a solution u(t) of it such
that u(t) 6= 0 for all t ∈ [α, β], then we have δ 2 F > 0 for all allowed ξ. A suitable
solution will always exists for sufficiently small β − α, but may not exist if β − α is too
large.
E. 2-44
<Geodesics on unit sphere> For any curve C on the sphere, we have
Z q Z φ2 q
L= dθ2 + sin2 θ dφ2 or L[θ] = (θ0 )2 + sin2 θ dφ
C φ1

p
if φ is a good parameter. Assuming this, we have f (θ, θ0 ) = (θ0 )2 + sin2 θ. So

∂f sin θ cos θ ∂f θ0
= p , 0
= p .
∂θ (θ0 )2 + sin2 θ ∂θ (θ0 )2 + sin2 θ
2.5. THE SECOND VARIATION 71
∂f
Since ∂φ
= 0, we have the first integral

∂f sin2 θ
q
const = f − θ0 = p =⇒ c sin2 θ = (θ0 )2 + sin2 θ.
∂θ0 (θ )2 + sin2 θ
0

Here we need c ≥ 1 for the equation to make sense.


Consider the case c = 1. This occurs when θ0 = 0, ie. θ is a constant. Then our
first integral gives sin2 θ = sin θ. So sin θ = 1 and θ = π/2. This corresponds to
a curve on the equator connecting two points on the equator. (we ignore the case
sin θ = 0 that gives θ = 0, which is a rather silly solution). However, given any
two points on the sphere, we can always rotate the sphere so that they both lie on
the equator, so wlog we can assume c = 1.
There are two equatorial solutions to the Euler-Lagrange
equations. Which, if any, minimizes L[θ]? We have

∂2f ∂2f ∂2
= 1, = −1, = 0.
∂(θ0 )2 ∂θ∂θ0 ∂θ∂θ0

bearing in mind θ = π/2. So ρ(x) = 1 and σ(x) = −1. So


Z φ2
δ2 F = ((ξ 0 )2 − ξ 2 ) dφ.
φ1

The Jacobi accessory equation is u00 + u = 0. So the general solution is u ∝


sin φ − γ cos φ for arbitrary γ. This is equal to zero if tan φ = γ.
Looking at the graph of tan φ, we see that tan has a zero every π radians. Hence
if the domain φ2 − φ1 is greater than π (ie. we go the long way from the first point
to the second), it will always contain some values for which tan φ is zero. So we
cannot know whether or not this longer path is a local minimiser. On the other
hand, if φ2 − φ1 is less than π, then we will be able to pick a γ such that u is
non-zero in the domain, in which case the solution is (at least) a local minimiser.
The above shows that the condition σ ≥ 0 is not necessary for positivity of δ 2 F .
Also, the strong Legendre condition ρ > 0 is not sufficient for positivity of δ 2 F . It
also illustrates that when the domain β − α (or in this case φ2 − φ1 ) is small we
can find solution to the Jacobi accessory equation.
72 CHAPTER 2. VARIATIONAL PRINCIPLE
CHAPTER 3
Optimization

3.1 Preliminaries and Lagrange multipliers


D. 3-1
• The general problem is of constrained optimization is

minimise f (x), subject to h(x) = b and x ∈ X (∗)

where x ∈ Rn is the vector of decision variables , f : Rn → R is the objective function ,


h : Rn → Rm and b ∈ Rm are the functional constraints , and X ⊆ Rn is the
regional constraint .1
The set X(b) = {x ∈ X : h(x) = b} is called the feasible set . A problem is called
feasible if X(b) is non-empty, and called bounded if f (x) is bounded from below
on X(b). A solution x∗ is called optimal if it feasible and minimizes f among all
feasible solutions.
• Linear programming (LP) is the special case of the optimization of a linear ob-
jective function, subject to linear equality and linear inequality constraints. There
are two forms of linear programs:
General form : minimise cT x, subject to Ax ≥ b and x ≥ 0.2
Standard form : minimise cT x, subject to Ax = b and x ≥ 0.
where c ∈ Rn is called the cost vector , x ∈ Rn is a vector of decision variables,
and A ∈ Rm×n a matrix. Alternatively, the general form and the standard form
can be written as min{cT x : Ax ≥ b, x ≥ 0} and min{cT x : Ax = b, x ≥ 0}.
C. 3-2
<Representation of constraints> The above form for constrained optimization
is the most general form of the problem. If we want to maximize f instead of
minimize, we can minimize −f . If we want our constraints to be an inequality in
the form hi (x) ≥ bi (note also that hi (x) ≤ bi ⇔ −hi (x) ≥ −bi ), we can introduce
a slack variable zi , make the functional constraint as hi (x) − zi = bi , and add the
regional constraint zi ≥ 0.
For Linear programming, we can write the most general problem as:

aTi x ≥ bi for all i ∈ M1


aTi x ≤ bi for all i ∈ M2
minimise cT x, subject to
aTi x = bi for all i ∈ M3
xi ≥ 0 for all i ∈ N1
xj ≤ 0 for all i ∈ N2

1
Here almost everything we work with will be a vector, for convenience we will not bold them.
2
Here the meaning of ≤ is component-vise, ie. a ≤ b means ai ≤ bi for all i. This will be the
case for the rest of the chapter.

73
74 CHAPTER 3. OPTIMIZATION

where c ∈ Rn is a cost vector, x ∈ Rn is a vector of decision variables, and


constraints are given by ai ∈ Rn and bi ∈ R for i ∈ {1, · · · , m}. Index sets
M1 , M2 , M3 ∈ {1, · · · , m} and N1 , N2 ∈ {1, · · · , n} are used to distinguish between
different types of constraints.
Observe that constraints of the form xi ≥ 0 and xj ≤ 0 are just special cases of
linear constraints of the form aTi ≥ bi , but we often choose to make them explicit.

Each occurrence of an unconstrained variable xj can be replaced by x+ j + xj ,
+ − + −
where xj and xj are two new variables with xj ≥ 0 and xj ≤ 0 (alternatively

xj = x+ j − xj , where each part has to be positive). Note also that an equality
constraint aTi x = bi is equivalent to the pair of constraints aTi ≤ bi and aTi x ≥ bi .
Through tricks similar to that describe above, we can write any linear program
in the general form. The standard form is of course a special case of the general
form. However, we can also bring every general form problem into the standard
form by introducing slack variables.

E. 3-3
Minimise −(x1 + x2 ) subject to [[ x1 + 2x2 ≤ 6 ]], [[ x1 − x2 ≤ 3 ]] and [[ x1 , x2 ≥ 0 ]].

Write f (x) = −(x1 + x2 ). Since we are


x2
lucky to have a 2D problem, we can
draw this out. The shaded region is
the feasible region, and c is our cost
vector. The dotted lines, which are or- x1 − x2 = 3
thogonal to c are lines in which the ob-
jective function is constant. To min-
imize our objective function, we want x1 + 2x2 = 6
the line to be as right as possible, which x1
is clearly achieved at the intersection of c
the two boundary lines where we have f (x) = 0 f (x) = −2 f (x) = −5
f (x) = −5.

E. 3-4

Recall that the stationary point of a convex function (on a convex set) is the global
minimum. It is easy to see that in the case of linear programming, the feasible
set is convex and the objective function is both convex and concave. However the
above result cannot generally be used to solve constrained optimization problems,
because the gradient might not be zero anywhere on the feasible set. Instead we
would use Lagrange Multipliers.

3.1.1 Lagrange multipliers

T. 3-5
<Lagrangian sufficiency> Consider (∗) of [D.3-1]. Let L(x, λ) = f (x) −
λT (h(x) − b) for λ ∈ Rm (its Lagrangian). If x∗ ∈ X and λ∗ ∈ Rm are such
that L(x∗ , λ∗ ) = inf x∈X L(x, λ∗ ) and h(x∗ ) = b, then x∗ is an optimal solution for
(∗).
3.1. PRELIMINARIES AND LAGRANGE MULTIPLIERS 75

min f (x) = min (f (x) − λ∗T (h(x) − b)) [since ∀x ∈ X(b), h(x) − b = 0]
x∈X(b) x∈X(b)

≥ min(f (x) − λ∗T (h(x) − b)) = f (x∗ ) − λ∗T (h(x∗ ) − b) = f (x∗ )


x∈X

This result say that if x∗ minimizes L for some fixed λ∗ , and x∗ satisfies the
constraints, then x∗ minimizes f .
This result is powerful and useful in the aspect that any solution found is definitely
a optimal solution (not just a stationary solution). However this is not a necessary
condition for the optimal solution. For example consider f (x, y) = − 1+x21+y2
subject to x = 1. The optimal solution is clearly f (1, 0) = − 21 . But L((1, 0), λ) 6=
inf (x,y)∈R2 (− 1+x21+y2 − λ(x − 1)) for any λ ∈ R.
E. 3-6
Minimise x1 + x2 − 2x3 subject to x1 + x2 + x3 = 5 and x21 + x22 = 4.

The Lagrangian is
L(x, λ) = x1 − x2 − 2x3 − λ1 (x1 + x2 + x3 − 5) − λ2 (x21 + x22 − 4)
= ((1 − λ1 )x1 − 2λ2 x21 ) + ((−1 − λ1 )x2 − λ2 x22 ) + (−2 − λ1 )x3 + 5λ1 + 4λ2
We want to pick a λ∗ and x∗ such that L(x∗ , λ∗ ) = inf x∈X L(x, λ∗ ). Then in
particular, for our λ∗ , L(x, λ∗ ) must have a finite minimum.
We note that (−2 − λ1 )x3 does not have a finite minimum unless λ1 = −2, since
x3 can take any value. Also, the terms in x1 and x2 do not have a finite mini-
mum unless λ2 < 0. With these in mind, we find a minimum by setting all first
derivatives to be 0:
∂L
= 1 − λ1 − 2λ2 x1 = 3 − 2λ2 x1
∂x1
∂L
= −1 − λ1 − 2λ2 x2 = 1 − 2λ2 x2
∂x2
Since these must be both 0, we must have x1 = 2λ32 , x2 = 1
2λ2
. To show that this
is indeed a minimum, we look at the Hessian matrix:
 
−2λ2 0
H(L) =
0 −2λ2
which is positive semidefinite everywhere when λ2 < 0, so it’s a global minimum.
Let Y = {λ : R2 : λ1 = −2, λ2 < 0} be our helpful values of λ. For every λ ∈ Y ,
L(x, λ) has a unique minimum at x(λ) = ( 2λ32 , 2λ12 , x3 )T . Now all we have to do
is find λ and x such that x(λ) satisfies the functional constraints. The second
constraint gives
r
2 2 9 1 5
x1 + x2 = + 2 =4 ⇐⇒ λ2 = − .
4λ22 4λ2 8
The first constraint gives x3 = 5 − x1 − x2 . So [T.3-5] implies that the following
is an optimal solution:
r r r !
2 2 2
(x1 , x2 , x3 ) = −3 ,− ,5 + 4
5 5 5
76 CHAPTER 3. OPTIMIZATION

C. 3-7
In general to minimize f (x) subject to h(x) ≤ b, x ∈ X, we can proceed as follows:
1. Introduce slack variables to obtain the equivalent problem, to minimize f (x)
subject to h(x) + z = b, x ∈ X, z ≥ 0.
2. Compute the Lagrangian L(x, z, λ) = f (x) − λT (f (x) + z − b).
3. Find Y = {λ : inf x∈X,z≥0 L(x, z, λ) > −∞}.
4. For each λ ∈ Y , minimize L(x, z, λ). That is, find x∗ (λ) ∈ X, z ∗ (λ) ≥ 0 such
that L(x∗ (λ), z ∗ (λ), λ) = inf x∈X,z≥0 L(x, z, λ).
5. Find λ∗ ∈ Y such that h(x∗ (λ∗ )) + z ∗ (λ∗ ) = b.
Then by [T.3-5], x∗ (λ∗ ) is optimal for the constrained problem.
It is worth pointing out we have a property known as complementary slackness .
If we introduce a slack variable z, changing the value of zj does not affect our
objective function, and we are allowed to pick any non-negative z. For each j
we must have (z ∗ (λ))j λj = 0, because by definition z ∗ (λ)j minimizes −zj λj , so
if zj λj 6= 0, we can tweak the values of zj to make a smaller −zj λj . This prop-
erty makes our life easier since our search space is smaller. Note this also means
(h(x∗ (λ∗ )) − b)j λ∗j = 0 for each j.

E. 3-8
Consider the following problem: maximize x1 − 3x2 subject to x21 + x22 + z1 = 4,
x1 + x2 + z2 = 2 and z1 , z2 ≥ 0, where z1 , z2 are slack variables.

L(x, z, λ) = ((1 − λ2 )x1 − λ1 x21 ) + ((−3 − λ2 )x2 − λ1 x22 ) − λ1 z1 − λ2 z2 + 4λ1 + 2λ2 .

To ensure finite minimum, we need λ1 , λ2 ≤ 0. By complementary slackness,


λ1 z1 = λ2 z2 = 0. We can then consider the cases λ1 = 0 and z1 = 0 separately,
and save a lot of algebra.
T. 3-9
Consider (∗) of [D.3-1]. For each b ∈ Rm , let φ(b) = inf{f (x) : h(x) = b, x ∈ Rn }
be the optimal value of f . Suppose f and h are continuously differentiable, and
that there exist unique functions x∗ : Rm → Rn and λ∗ : Rm → Rm such that for
each b ∈ Rm , x∗ (b) and λ∗ satisfies the Lagrangian sufficiency (ie. h(x∗ (b)) = b
and f (x∗ (b)) = inf{f (x) − λ∗ (b)T (h(x) − b) : x ∈ Rn } = φ(b)). Then
∂φ
(b) = λ∗i (b).
∂bi

We have φ(b) = f (x∗ (b)) − λ∗ (b)T (h(x∗ (b)) − b), so (with summation)
  ∗
∂φ(b) ∂f ∗ ∗ T ∂h ∗ ∂xj (b)
= (x (b)) − λ (b) (x (b))
∂bi ∂xj ∂xj ∂bi
∂λ∗ (b)T ∂b
− (h(x∗ (b)) − b) +λ∗ (b)
∂bi ∂bi
| {z }
=0
∂x∗j (b)

∂   ∂b
= f (x) − λ∗ (b)T (h(x) − b) + λ∗ (b) = λ∗i (b)
∂xj x=x∗ (b) ∂bi ∂bi
| {z }
=0
3.1. PRELIMINARIES AND LAGRANGE MULTIPLIERS 77

This result also holds when the functional constraints are inequalities: if the i
th constraint does not not hold with equality, then λ∗i = 0 by complementary
∂λ∗
slackness, and therefore also ∂bii = 0.
E. 3-10
The Lagrange multipliers are also known as shadow prices, due to an economic
interpretation of the problem to

maximize f (x), subject to h(x) ≤ b, x ∈ X.

Consider a firm that produces n different goods from m different raw materials.
Vector b ∈ Rm describes the amount of each raw material available to the firm,
and vector x ∈ Rn describes the quantity produced of each good. Functions
h : Rn → Rm describe the amounts of raw material required to produce a particular
quantities of the goods. And f : Rn → R gives the profit derived from producing
a particular quantities of the goods. The goal of the above problem thus is to
maximize the profit of the firm for given amounts of raw materials available to it.
The shadow price of raw material i then is the price the firm would be willing to
pay per additional unit of this raw material, which of course should be equal to
∂φ
the additional profit derived from it, i.e. ∂bi
(b) = λ∗i (b).

3.1.2 Lagrangian Duality


D. 3-11
Consider (∗) of [D.3-1], denote (∗) as P . Let L(x, λ) = f (x) − λT (h(x) − b) be its
Lagrangian.
• The dual function is g : Rm → R define by g(λ) = inf x∈X L(x, λ).
• The dual problem (which we denote as D) is maximisation of g subject to
λ ∈ Y = {λ ∈ Rn : g(λ) > −∞}. The original problem P is called primal .
• (P ) and (D) are said to satisfy strong duality if supλ∈Y g(λ) = inf x∈X(b) f (x).
(Below we assume supλ∈Y g(λ) is attained for some λ)
T. 3-12
<Weak duality> Consider (∗) of [D.3-1]. If x ∈ X(b) and λ ∈ Y , then g(λ) ≤
f (x). In particular, supλ∈Y g(λ) ≤ inf x∈X(b) f (x).

g(λ) = inf x0 ∈X L(x0 , λ) ≤ L(x, λ) = f (x) − λT (h(x) − b) = f (x)


Any feasible solution of the dual provides a lower bound for the optimal solution of
the primal. This suggests that we can solve a dual problem - instead of minimizing
f , we can maximize g subject to λ ∈ Y . In particular, a pair of solutions of the
primal and dual that yield the same value must be optimal. If strong duality holds,
the optimal value of the primal can be determined by solving the dual, which in
some cases may be easier than solving the primal.
E. 3-13
Problems for which the method of Lagrange sufficiency works must satisfy strong
duality. Consider the problem of [E.3-6]. We saw that Y = {λ ∈ R2 : λ1 =
−2, λ2 < 0} and  
3 1 4
x∗ (λ) = , ,5 − .
2λ2 2λ2 2λ2
78 CHAPTER 3. OPTIMIZATION

So the dual function and the dual problem are

10
g(λ) = inf L(x, λ) = L(x∗ (λ), λ) = + 4λ2 − 10
x∈X 4λ2
10
maximise + 4λ2 − 10 subject to λ2 < 0
4λ2
p
The maximum is attained for√ λ2 = − 5/8 and the primal and dual have the same
optimal value, namely −2( 10 p+ 5). Note that it is not actually necessary to solve
the dual to see that λ2 = − 5/8 an optimizer, it suffices that the value of the
dual function at this point equals the value of the objective function of the primal
at some point in the feasible set of the primal.
D. 3-14
In Rn , given any fix m ∈ Rn and c ∈ R, the set {x ∈ Rn : x·m = c} is a hyperplane
(“planes in higher dimensions”). In an n-dimensional space, a hyperplane has n−1
dimensions.
Consider hyperplane given by α : Rm → R with α(x) = b + m · x for some b ∈ R
and m ∈ Rm . We say α is a supporting hyperplane to a function φ : Rm → R at
b ∈ Rm if φ(b) = α(b) and φ(c) ≥ α(c) for all c ∈ Rm .
E. 3-15
Note that α being a supporting hyperplane to φ at b means that α(c) = φ(b) + m ·
(c − b) for some m such that φ(c) ≥ φ(b) + m · (c − b) for all c ∈ Rm .
L. 3-16
Take the set up of [D.3-11]. Let βλ = sup{β : β + λT (c − b) ≤ φ(c) for all c ∈ Rm }
and φ(c) = inf x∈X(c) f (x), then g(λ) = βλ .

g(λ) = inf L(x, λ) = inf (f (x) − λT (h(x) − b))


x∈X x∈X

= infm inf (f (x) − λT (h(x) − b)) = infm (φ(c) − λT (c − b))


c∈R x∈X(c) c∈R

= sup{β : β + λ (c − b) ≤ φ(c) for all c ∈ Rm } = βλ


T

So weak duality is supλ βλ ≤ φ(b). Since βλ is the highest intercept at b of which


a hyperplane passing through it with gradient λ is below φ, so in fact when we are
solving for the dual problem maxλ g(λ), we are varying λ and see how high the
intercept at b can be.
T. 3-17
P ((∗) of [D.3-1]) satisfies strong duality iff φ(c) = inf x∈X(c) f (x) has a (non-
vertical) supporting hyperplane at b.

(Backward) We have φ(c) ≥ φ(b) + λT (c − b) for some λ. So

φ(b) ≤ infm (φ(c) − λT (c − b)) = infm inf (f (x) − λT (h(x) − b))


c∈R c∈R x∈X(c)

= inf L(x, λ) = g(λ)


x∈X

By weak duality, g(λ) ≤ φ(b). So φ(b) = g(λ) and strong duality holds.
3.1. PRELIMINARIES AND LAGRANGE MULTIPLIERS 79

(Forward) Assume strong duality, then ∃λ such that for any c ∈ Rm we have

φ(b) = g(λ) = inf L(x, λ)


x∈X

≤ inf L(x, λ) = inf (f (x) − λT (h(x) − b)) = φ(c) − λT (c − b)


x∈X(c) x∈X(c)

which defines a supporting hyperplane at b.

T. 3-18
<Supporting hyperplane theorem> If φ : Rm → R is convex and b ∈ Rm lies
in the interior of the set of points where φ is finite, then there exists a (non-vertical)
supporting hyperplane to φ at b.

By [P.2-8].

T. 3-19
Consider φ(c) = inf x∈X {f (x) : h(x) ≤ c}. If X, f, h are convex, then so is φ(c).

Consider any b1 , b2 ∈ Rm such that φ(b1 ) and φ(b2 ) are defined (ie. {f (x) :
h(x) ≤ c} with c = b1 , b2 non-empty, we allowed φ to take the value −∞). Let
δ ∈ [0, 1] and write b = δb1 + (1 − δ)b2 . Choose x1 ∈ X(b1 ), x2 ∈ X(b2 ), and let
x = δx1 + (1 − δ)x2 . By convexity of X, x ∈ X. By convexity of h,

h(x) ≤ δh(x1 ) + (1 − δ)h(x2 ) ≤ δb1 + (1 − δ)b2 = b

So x ∈ X(b). Since φ(b) is the optimal value and by convexity of f ,

φ(b) ≤ f (x) ≤ δf (x1 ) + (1 − δ)f (x2 )

This holds for any x1 ∈ X(b1 ) and x2 ∈ X(b2 ). So by taking infimum of the right
hand side, we have φ(b) ≤ δφ(b1 ) + (1 − δ)φ(b2 ). So φ is convex.

h(x) = b is equivalent to h(x) ≤ b and −h(x) ≤ −b. So the result holds for
problems with equality constraints if both h and −h are convex, ie. if h(x) is
linear.

In the case with equality constraints, convexity of X, f and h does not suffice for
convexity of φ. For example consider minimise f (x) = x2 subject to h(x) = x3 = b
for some b > 0, then φ(b) = b2/3 which is not convex. Also L(x, λ) = x2 −λ(x3 −b),
so inf x L(x, λ) > −∞ iff λ = 0. So the dual has optimal value 0, which is strictly
greater than φ(b) if b > 0. So strong duality is not satisfied.

This result shows that almost all (if b ∈ Rm lies in the interior of the set of points
where φ is finite) convex optimisation problem (inf x∈X {f (x) : h(x) ≤ c} with
X, f, h convex) satisfies strong duality.

T. 3-20
If a linear program is feasible and bounded, then it satisfies strong duality.

φ(c) = inf x∈X(c) f (x) is convex by the above. It can be shown via a result know
as Slaters condition that if such a linear problem is feasible, then b ∈ Rm is an
interior point of φ, so it has a support hyperplane at b, so it satisfies strong duality.
80 CHAPTER 3. OPTIMIZATION

3.2 Solutions of linear programs


We will consider the linear program:

maximize cT x subject to Ax ≤ b, x ≥ 0.

where x ∈ Rn , A ∈ Rm×n and b ∈ Rm . The feasible set of this LP is a convex polytope


in Rn , i.e., an intersection of half-spaces. Since a linear program is linear, the optimal
point will lie on a “corner” (extreme point) of the convex polytope of feasible region,
no matter what the shape of it might be. For example look at [E.3-3]. It might happen
that a whole “edge” are optimal solutions but we will still have an optimal solution at
a “corner” (extreme point).

This already allows us to solve linear programs, since we can just try all corners and
see which has the smallest value. However, this can be made more efficient, especially
when we have a large number of dimensions and hence corners.

Here we will assume that the rows of A are linearly independent, and any set of m
columns are linearly independent. Otherwise, we can just throw away the redundant
rows or columns since they are “repeated”. Note that assuming these, setting any
subset of n − m variables of x to zero uniquely determines the value of the remain-
ing.

D. 3-21
An extreme point x ∈ S of a convex set S is a point that cannot be written as a
convex combination of two distinct points in S, ie. if y, z ∈ S and δ ∈ (0, 1) satisfy
x = δy + (1 − δ)z, then x = y = z.

Consider linear program in standard form, maximize cT x subject to Ax = b, x ≥ 0,


where A ∈ Rm×n and b ∈ Rm .

• A solution x ∈ Rn to Ax = b is basic if it has at most m non-zero entries (out


of n), ie. if there exists a set B ⊆ {1, · · · , n} with |B| = m such that xi = 0
if i 6∈ B. In this case, B is called the basis , and xi are the basic variables if
i ∈ B.

• A basic solution is non-degenerate if it has exactly m non-zero entries.

• A basic solution x is feasible if it satisfies x ≥ 0.

E. 3-22
Consider the linear program: maximize f (x) = x1 + x2 subject to

x1 + 2x2 + z1 = 6, x1 − x2 + z2 = 3, x1 , x2 , z1 , z2 ≥ 0.

where we have included slack variables.

Since we have 2 constraints, a basic solution have at most 2 non-zero entries, and so
at least 2 zero entries. Since setting any subset of 2 variables of x to zero uniquely
determines the value of the remaining, setting a different pair of variables to 0 at
a time gives us the 6 possible basic solutions which are listed below. Among the 6,
E and F are not feasible solutions since they have negative entries. So the basic
feasible solutions are A, B, C, D.
3.2. SOLUTIONS OF LINEAR PROGRAMS 81

x2
x1 x2 z1 z2 f (x)
x1 − x2 = 3
A 0 0 6 3 0 D
B 0 3 0 6 3 C
C 4 1 0 0 5 A B E
x1
D 3 0 3 0 3 x1 + 2x2 = 6
E 6 0 0 −4 6
F 0 −3 12 0 −3 F

So the extreme points are exactly the basic feasible solutions. In fact this is true
in general.
T. 3-23
A vector x is a basic feasible solution (BFS) of Ax = b if and only if it is an
extreme point of the set X(b) = {x0 : Ax0 = b, x0 ≥ 0}.

We assume that every basic solution is non-degenerate and teh assumption stated
at the beginning of the section.
(Forward) Consider a BFS x and suppose that x = δy + (1 − δ)z for y, z ∈ X(b)
and δ ∈ (0, 1). Since y ≥ 0 and z ≥ 0, x = δy + (1 − δ)z implies that yi = zi = 0
whenever xi = 0. So y and z are basic solutions with the same basis, ie. both have
exactly m non-zero entries, which occur in the same rows. Moreover, Ay = b = Az
and thus A(y − z) = 0. This says that a linear combination of the m columns of
A equals 0, but by assumption any set of m columns of A is linearly independent,
so y = z. So x is an extreme point of X(b).
(Backward) Consider a feasible solution x ∈ X(b) that is not a BFS. Let i1 , · · · , ir
be the rows of x that are non-zero, then r > m. This means that the columns
ai1 , · · · , air where ai = (A1i , · · · , Ami )T , have to be linearly dependent, so there
exist yi1 , · · · , yir not all equals 0 such that yi1 ai1 + · · · + yir air = 0. Extend y
to a vector in Rn by setting yi = 0 if i 6∈ {i1 , · · · , ir }, we have Ay = 0 and thus
A(x ± εy) = b for every ε ∈ R. By choosing ε > 0 small enough, x ± εy ≥ 0 and
so x ± εy ∈ X(b). Now x = 21 (x + εy) + 12 (x − εy), so x is not an extreme point of
X(b).
T. 3-24
If a LP is feasible and bounded, then there exists an optimal solution that is a
basic feasible solution.
Let x be optimal solution. If x has at m most non-zero entries, it is a basic feasible
solution, and we are done. Now suppose x has r > m non-zero entries. Since it is
not an extreme point, we have y 6= z ∈ X(b), δ ∈ (0, 1) such that x = δy + (1 − δ)z.
We will show there exists an optimal solution strictly fewer than r non-zero entries.
Then the result follows by induction.
By optimality of x, we have cT x ≥ cT y and cT x ≥ cT z. Since cT x = δcT y + (1 −
δ)cT z, we must have that cT x = cT y = cT z, ie. y and z are also optimal. Since
y ≥ 0 and z ≥ 0, x = δy + (1 − δ)z implies that yi = zi = 0 whenever xi = 0. So y
and z have at most r non-zero entries, which must occur in rows where x is also
non-zero.
If y or z has strictly fewer than r non-zero entries, then we are done. Otherwise,
for any δ̂ ∈ R (not necessarily in (0, 1)), let xδ̂ = δ̂y + (1 − δ̂)z = z + δ̂(y − z).
82 CHAPTER 3. OPTIMIZATION

Observe that xδ̂ is optimal for every δ̂ ∈ R. Moreover, y − z 6= 0, and all non-zero
entries of y − z occur in rows where x is non-zero as well. We can thus choose
δ̂ ∈ R such that xδ̂ ≥ 0 and xδ̂ has strictly fewer than r non-zero entries.
Intuitively, this is what we do when we “slide along the line” if c is orthogonal to
one of the boundary lines.
This result in fact holds more generally for the maximum of a convex function
f over a compact (ie. closed and bounded) convex setP X. In that case, we can
write any point x ∈ X as a convex combination x = ki=1 δi xi of extreme points
xi ∈ X, and where δi ∈ [0, ∞) with ki=1 δi = 1. Then, by convexity of f ,
P

k
X
f (x) ≤ δi f (xi ) ≤ max f (xi )
i
i=1

So any point in the interior cannot be better than the extreme points.
C. 3-25
<Linear Programming Duality> Consider LP in general form min{cT x :
Ax ≥ b, x ≥ 0}. With slack variables it is min{cT x : Ax − z = b, x, z ≥ 0}. We
have X = {(x, z) : x, z ≥ 0} ⊆ Rm+n . The Lagrangian is

L(x, z, λ) = cT x − λT (Ax − z − b) = (cT − λT A)x + λT z + λT b.

Since x, z can be arbitrarily positive, L has a finite minimum if and only if cT −


λT A ≥ 0 and λT ≥ 0. Let Y be the set of such λ. For fixed λ ∈ Y , the minimum of
L(x, z, λ) is attained when (cT − λT A)x = 0 = λT z by complementary slackness.
So
g(λ) = inf L(x, z, λ) = λT b
(x,z)∈X

T T
and the dual is: max{λ b : A λ ≤ c, λ ≥ 0}. Similarly the dual of the standard
form min{cT x : Ax = b, x ≥ 0} is max{λT b : AT λ ≤ c}.

T. 3-26
The dual of the dual of a linear program is the primal.

It suffices to show this for the linear program in general form. The dual problem
is: minimize −bT λ subject to −AT λ ≥ −c and λ ≥ 0. This problem has the same
form as the primal, with −b taking the role of c, −c taking the role of b, and −AT
taking the role of A. So doing it again, we get back to the original problem.
E. 3-27
Let the primal problem be: maximize 3x1 + 2x2 subject to

2x1 + x2 + z1 = 4, 2x1 + 3x2 + z2 = 6, x1 , x2 , z1 , z2 ≥ 0

Then the dual problem is: minimize 4λ1 + 6λ2 subject to

2λ1 + 2λ2 − µ1 = 3, λ1 + 3λ2 − µ2 = 2 λ1 , λ2 , µ1 , µ2 ≥ 0.

We can compute all basic solutions of the primal and the dual by setting n − m − 2
variables to be zero in turn. Given a particular basic solutions of the primal,
the corresponding solutions of the dual can be found by using the complementary
slackness solutions: λ1 z1 = λ2 z2 = 0 and µ1 x1 = µ2 x2 = 0.
3.2. SOLUTIONS OF LINEAR PROGRAMS 83

x1 x2 z1 z2 f (x) λ1 λ2 µ1 µ2 g(λ)
A 0 0 4 6 0 0 0 -3 -2 0
3
B 2 0 0 2 6 2
0 0 − 21 6
3 5
C 3 0 -2 0 9 0 2
0 2
9
3 13 5 1 13
D 2
1 0 0 2 4 4
0 0 2
2
E 0 2 2 0 4 0 3
− 35 0 4
F 0 4 0 -6 8 2 0 1 0 8

x2 λ2

F
C

E D
B D
C F
x1 λ1
A B A E λ1 + 3λ2 = 2
2x1 + 3x2 = 6

2x1 + x2 = 4 2λ1 + 2λ2 = 3

We see that D is the only solution such that both the primal and dual solutions
are feasible. So we know it is optimal without even having to calculate f (x). It
turns out this is always the case.

T. 3-28
Let x and λ be feasible for the primal and the dual of the linear program. Then
x and λ are optimal if and only if they satisfy complementary slackness, ie. (cT −
λT A)x = 0 and λT (Ax − b) = 0.

The below proof if for the general form of LP, but it should be clear that it also
holds for the standard form.

(Forward) If x and λ are optimal, then cT x = λT b since every linear program


satisfies strong duality. So

cT x = λT b = inf (cT x0 − λT (Ax0 − z − b)) = inf (cT x0 − λT (Ax0 − b))


x0 ∈X,z≥0 0 x ∈X
T T T
≤ c x − λ (Ax − b) ≤ c x.

since Ax ≥ b and λ ≥ 0. The first and last term are the same. So the inequalities
hold with equality. Therefore λT b = cT x − λT (Ax − b) = (cT − λT A)x + λT b. So
(cT − λT A)x = 0. Also, λT (Ax − b) = 0.

(Backward) If (cT − λT A)x = 0 and λT (Ax − b) = 0, then

cT x = cT x − λT (Ax − b) = (cT − λT A)x + λT b = λT b.

Hence by weak duality, x and λ are optimal.


84 CHAPTER 3. OPTIMIZATION

3.2.1 Simplex method


Consider a linear program in the standard form as in [D.3-1] but we maximise instead
of minimise.
For a given base B (i.e. B ⊆ {1, 2, · · · , n} with |B| = m) we can decompose A to
AB , AN and x to xB , xN such that Ax = AB xB + AN xN = b. Where in the above AB
contains the columns of A indexed by B and AN contains those that are not, and xB
contains the basis variables of x and xN contains the rest. Similarly we decompose c
to cB , cN such that cT x = cTB xB + cTN xN .
By the assumption we made earlier, AB (an m × m matrix) has linearly independent
columns and hence invertable. So AB xB + AN xN = b gives xB = A−1 B (b − AN xN ).
In particular, when xN = 0, then xB = A−1
B b. So if there is a basic feasible solution
(BFS) with base B, then A−1
B b ≥ 0. We also have the objective function as

f (x) = cT x = cTB xB + cTN xN = cTB A−1 T


B (b − AN xN ) + cN xN

= cTB A−1 T T −1
B b + (cN − cB AB AN )xN .

Suppose we have and A−1 −1


B b ≥ 0 (i.e. ∃x such that xB = AB b and xN = 0 is a
BFS) and cTN − cTB A−1
B A N ≤ 0. For any feasible solution x ∈ Rn , we must have
T T −1 T −1
xN ≥ 0, so (cN − cB AB AN )xN ≤ 0. Hence f (x) ≤ cB AB b for any feasible solution x.
Therefore the basic solution x such that xB = A−1
B b and xN = 0 must be an optimal
solution.
If however (cTN − cTB A−1
B AN )i > 0, we can increase the value of the objective function
by increasing (xN )i . Either we can increasing (xN )i indefinitely, which means that
the maximum is unbounded, or the constraints force some of the variables in the basis
to become smaller and we have to stop when the first one reaches zero, say (xB )j .
Then we have found a new BFS (note we still have A−1 B b ≥ 0 in the new base) with
a larger value, we have switched basis by replacing (xB )j with (xN )i in the base. We
can continue from here. If in the new base (cTN − cTB A−1
B AN ) is negative, we are done.
Otherwise, we repeat the above procedure.
Assuming that the LP is feasible and has a bounded optimal solution, then there exists
a basis B for which A−1 T T −1
B b ≥ 0 and cN −cB AB AN ≤ 0 is satisfied. The above procedure
move us basis to basis until we reach this basis.

The algorithm
The simplex method is a systematic way of doing the above procedure. We made a
simplex tableau (a (n + 1) × (m + 1) table) of the form

aij = (A−1
B A)ij
aij ai0
where a0j = (cT − cTB A−1
B A)j

ai0 = (A−1
B b)i
a0j a00
a00 = −cTB A−1
B b

The simplex method proceeds as follows:


1. Find an initial basic feasible solution with basis B and make the simplex tableau.
This ensured A−1
B b ≥ 0.
3.2. SOLUTIONS OF LINEAR PROGRAMS 85

2. Check whether a0j ≤ 0 for every j. If so, the current solution is optimal, so stop.
−1 −1
Note that (cT − cT T T
B AB A)j = cj − (cB )p (AB )pq Aqj . If j ∈ B, write B(j) as the
reduced index of j such that Aij = (AB )iB(j) etc., we have
−1 −1
(cT − cT T T T T
B AB A)j = (cB )B(j) − (cB )p (AB AB )pB(j) = (cB )B(j) − (cB )B(j) = 0.

If j ∈ N , write N (j) as the reduced index of j, we have


−1 −1 T −1
(cT − cT T T T
B AB A)j = (cN )N (j) − (cB )p (AB )pq (AB )qN (j) = (cN − cB AB AB ))N (j) .

So a0j ≤ 0 for every j means that cT T −1


N − cB AB AN ≤ 0.

3. If not, choose a pivot column j such that a0j > 0. If aij ≤ 0 for all i, the problem
is unbounded, and we stop. Otherwise choose a pivot row i ∈ {i0 : ai0 j > 0}
that minimizes ai0 /aij . If multiple rows are minimize ai0 /aij , then the problem is
degenerate, and things might go wrong.
We chose j such that (cT T −1
N − cB AB AN )N (j) > 0. Note that the constraint b = Ax
is equivalent to A−1
B b = A −1
B Ax = A−1 −1
B AB xB + AB AN xN = IxB + AB AN xN .
−1

The version A−1B Ax = A −1


B b is much easier to work with than Ax = b as we will see
later because we have rewritten the constraints so that the basis variables are “singled
out”.

If aij ≤ 0 (ie. (A−1B AN )iN (j) ≤ 0) for all i, then as we increase (xN )N (j) , every
component of A−1B AN xN becomes smaller and smaller which can be offset by appro-
priate increasing in value of xB to maintain A−1 −1
B b = IxB + AB AN xN , this we can
do forever, hence the problem is unbounded.

As we increase (xN )N (j) , for every i0 such that ai0 j = (A−1B AN )i0 N (j) > 0, the value
of (IxB + A−1
B A N x N ) i0 would increase by a 0
i j ∆(x N ) −1
N (j) , so to maintain AB b =
−1
IxB + AB AN xN , we need to decrease (xB )i0 by the amount ai0 j ∆(xN )N (j) . Now
since the BSF solution we started with is (xB )i = ai0 with xN = 0. The i that
minimise ai0 /aij correspond the (xB )i that hits 0 first. Now we have found a better
BFS that has a different base.

4. We update the tableau by multiplying row i by 1/aij , and add a (−akj /aij ) multiple
of (original) row i to each row k 6= i, including k = 0. Now return to step 2 and
repeat.

This operation change the base B to our new base, so that our table is now in the new
base (all the B in the table becomes B 0 corresponding to the new base). Note that
row operations are allowed because doing so is equivalent to adding one constraints
to another or multiplying one constraints by a scaler, which are allowed. After this
step akj = 0 for all k 6= i and aij = 1, ie. j is now in the new base replacing the
original ith basis variable. So this operation is equivalent to changing base.

Note that in the tableau the column of ai0 is the xB of the current BFS x, and a00 is
the −f (x).

A point to note: Let x be such that xB = A−1 −1


B b and xN = 0. The condition AB b ≥ 0
T T −1 T T −1
implies that x is BFS. The condition c −cB AB A ≤ 0 says that λ = cB AB is feasible
for the dual (we are dealing with the standard form, see[C.3-25]). We have λT (Ax−b) =
0 since Ax = b. Also we have (cT − λT A)x = (cTB − λT AB )xB + (cTN − λT AN )xN = 0.
So by [T.3-28] x (and λ) are optimal solution for the LP (and its dual), consistent
with what we said before.
86 CHAPTER 3. OPTIMIZATION

E. 3-29
Consider the following problem: maximize x1 + x2 subject to
x1 + 2x2 + z1 = 6, x1 − x2 + z2 = 3, x1 , x2 , z1 , z2 ≥ 0.
Note that (x1 , x2 , z1 , z2 ) = (0, 0, 6, 3) is a BFS, so A−1
= I. We make the simplex
B
tableau, which now has the simple form which simply contains the coefficients of
the constraints and objective function:

x1 x2 z1 z2
Constraint 1 1 2 1 0 6
Constraint 2 1 -1 0 1 3
Objective 1 1 0 0 0

It’s pretty clear that our basic feasible solution is not optimal, since our objective
function is 0. This is since something in the last row is positive, and we can
increase the objective by, say, increasing x2 (pivot column is 2). The pivot row is
row 1, which means z1 would be the first to hit 0 as we increase x2 . We multiply
the first row by 12 and then add 1 times of the first row to the second row and -1
times first row to the third row. We have
x

1 x

2 z1
 z2

( 1 1
( ((((1
Constraint 2
1 2
0 3
(( 3 1
Constraint
( ( ( ( 2 2
0 2
1 6
1
− 12

Objective 0 0 -3
  2

Now we have changed base to (x2 , z2 ). Our new and better BFS is (x1 , x2 , z1 , z2 ) =
(0, 3, 0, 6). Do this one more time we have,

x

1 x

2 z1
 z2

1
− 31
(
( ((((1
Constraint 0 1 3
1
(( 1 2
Constraint
( ((( 2 1 0 3 3
4

Objective
 0 0 − 23 − 31 −5

Now we have changed base to (x2 , x1 ). Our new and better BFS is (x1 , x2 , z1 , z2 ) =
(4, 1, 0, 0) with an objective of value 5. This is optimum since a0j ≤ 0.
E. 3-30
<Two-phase simplex method> Sometimes there isn’t a obvious BFS, we
would need to use the two-phase simplex method to find our first BFS. This is best
illustrate with an example. Consider the problem:
minimize 6x1 + 3x2 subject to x1 , x 2 ≥ 0 and
x1 + x2 ≥ 1, 2x1 − x2 ≥ 1, 3x2 ≤ 2,
This is a minimization problem. To avoid being confused, we maximize −6x1 −3x2
instead. We add slack variables to obtain: maximize −6x1 − 3x2 subject to
x1 + x2 − z1 = 1, 2x1 − x2 − z2 = 1, 3x2 + z3 = 2, x1 , x2 , z1 , z2 , z3 ≥ 0
We don’t have an obvious BFS since (x1 , x2 , z1 , z2 , z3 ) = (0, 0, −1, −1, 2) is not
feasible. So we add more variables (called the artificial variables) to places where
it “doesn’t work” previously, and we solve to minimise the sum of the new artificial
variables. So
3.2. SOLUTIONS OF LINEAR PROGRAMS 87

minimize y1 + y2 subject to x1 , x 2 , z 1 , z 2 , z 3 , y 1 , y 2 ≥ 0 and

x1 + x2 − z1 + y1 = 1, 2x1 − x2 − z2 + y2 = 1, 3x2 + z3 = 2

This new problem has the obvious BFS (x1 , x2 , z1 , z2 , z3 , y1 , y2 ) = (0, 0, 0, 0, 2, 1, 1),
so we can solve this problem by the simplex method. If the original problem is
feasible, the optimal solution to the new problem must have y1 + y2 = 0 (ie.
y1 = 0 and y2 = 0), so the optimal solution for the new problem is a BFS to the
original problem. So we can solve the original problem by first solving this new
problem (phrase I), and then solve the original problem (phrase II). We write out
the coefficients in a table:

x1 x2 z1 z2 z3 y1 y2
Constraint 3 0 3 0 0 1 0 0 2
Constraint 1 1 1 -1 0 0 1 0 1
Constraint 2 2 -1 0 -1 0 0 1 1
Original objective -6 -3 0 0 0 0 0 0
New objective 0 0 0 0 0 -1 -1 0

where we rearranged the order of our constraints so that AB = A−1 B = I. This


is almost the simplex tableau except the last line (the “New objective” line) is
wrong, because it is in a wrong base (in particular we know that the last line
should say 0 at the places of the basis variables). This can be fixed by adding
the second and third line to the last line, as can be verified by cT − cTB A−1
B A =
(0, 0, 0, 0, 0, −1, −1) + (0, 1, 1)A. So the simplex tableau is:

x1 x2 z1 z2 z3 y1 y2
0 3 0 0 1 0 0 2
1 1 -1 0 0 1 0 1
2 -1 0 -1 0 0 1 1
-6 -3 0 0 0 0 0 0
3 0 -1 -1 0 0 0 2

In addition to the new objective, we also write our original objective in the tableau,
this is so that we can conveniently use it in the second phase (we can continue to
use this table when we go on to solve the original problem). Our pivot column is
x1 , and our pivot row is the third row, do the procedure we have:

x1 x2 z1 z2 z3 y1 y2
0 3 0 0 1 0 0 2
3 1
0 2
-1 2
0 1 − 12 1
2
1 − 12 0 − 12 0 0 1
2
1
2

0 -6 0 -3 0 0 3 3
3 1
0 2
−1 2
0 0 − 32 1
2

There are two possible pivot columns. We pick z2 and use the second row as the
pivot row. We have
88 CHAPTER 3. OPTIMIZATION

x1 x2 z1 z2 z3 y1 y2
0 3 0 0 1 0 0 2
0 3 -2 1 0 2 -1 1
1 1 -1 0 0 1 0 1
0 3 -6 0 0 6 0 6
0 0 0 0 0 -1 -1 0

We see that y1 and y2 are no longer in the basis, and hence take value 0. So phrase
I is complete. We drop all the phase I stuff in our table, what that remains is our
phrase II tableau (tableau of the original problem):

x1 x2 z1 z2 z3
0 3 0 0 1 2
0 3 -2 1 0 1
1 1 -1 0 0 1
0 3 -6 0 0 6

We see a basic feasible solution (x1 , x2 , z1 , z2 , z3 ) = (1, 0, 0, 1, 2). We pick x2 as


the pivot column, and the second row as the pivot row. Then we have

x1 x2 z1 z2 z3
0 0 2 -1 1 1
0 1 − 23 1
3
0 1
3
1 0 − 31 − 13 0 2
3

0 0 -4 -1 0 5

Since the last row is all negative, we have complementary slackness. So this is a
optimal solution. So (x1 , x2 , z1 , z2 , z3 ) = ( 23 , 13 , 0, 0, 1) is an optimal solution, and
our optimal value is 5.
Note that we previously said that the bottom right entry is the negative of the
optimal value, not the optimal value itself! This is correct, since in the tableau,
we are maximizing −6x1 − 3x2 , whose maximum value is −5. So the minimum
value of 6x1 + 3x2 is 5.
It is worth noting that the problem we have just solved is the dual of the LP in
[E.3-29], added in addition the constraint 3x2 ≤ 2. Ignoring the column and row
corresponding to z3 , the slack variable for this new constraint, the final tableau is
essentially (not quite because there is slack variables and the table is displaying
A−1
B A not A) the negative of the transpose of the final tableau we obtained in
[E.3-29]. This makes sense because the additional constraint is not tight in the
optimal solution, as we can see from the fact that z3 6= 0.

3.3 Non-cooperative games


Here we have a short digression to game theory. Non-cooperative games studies situ-
ations in which multiple self-interested entities (or players), simultaneously and inde-
pendently optimize different objectives and outcomes must therefore be self-enforcing.
3.3. NON-COOPERATIVE GAMES 89

We mostly focus on games with two players, but note that most concepts extend in a
straightforward way to games with more than two players.

3.3.1 Bimatrix game


A two-player game, or bimatrix game , with m actions for player 1 and n actions for
player 2 can be represented by a pair of matrices P, Q ∈ Rm×n , where Pij and Qij are
the payoffs of players 1 and 2 respectively when player 1 plays action i and player 2
plays action j. Player 1 are called the row player because he choose a row when he
decide an action, while player 2 are called the column player . Players choose actions
without knowledge of the other player’s decisions.
Players are allowed to play randomly.
P The set of possible strategies the row player
m
can have
P is X = {x ∈ R : x ≥ 0, xi = 1} and for the column player Y = {y ∈ Rn :
y ≥ 0, yi = 1}. Given a vector in the set (a strategy), the values of its component i
corresponds to the probabilities of the player selecting the action i.
A strategy profile (x, y) ∈ X × Y induces a lottery over outcomes, and we write
p(x, y) = xT P y for the expected payoff of the row player. Similar for the column
player.
D. 3-31
• If x is a strategy with xi = 1 for some i, ie. we always pick i, then we call x a
pure strategy .
• A strategy is dominant if, regardless of what any other players do, the strategy
earns a player a larger payoff than any other. Depending on whether “larger pay-
off” is defined with weak or strict inequalities, the strategy is termed strictly dominant
or weakly dominant . If one strategy is dominant, than all others are dominated.
• An outcome of a game is Pareto dominated if some other outcome would make
at least one player better off without hurting any other player. That is, some
other outcome is weakly preferred by all players and strictly preferred by at
least one player. If an outcome is not Pareto dominated by any other, than it
is Pareto optimal .
• The maximin strategy is the strategy that maximizes the worst payoff that can
be achieved. The security level 3 of a player is the payoff of his maximin strategy.
• A strategy x ∈ X is a best response to y ∈ Y if p(x, y) ≥ p(x0 , y) for all x0 ∈ X.
The concept of a best response for the column player is defined analogously. A pair
(x, y) is an equilibrium (also known as Nash equilibria ) if x is the best response
against y and y is a best response against x.
E. 3-32
A game of rock-paper-scissors can have payoff matrices
   
0 −1 1 0 1 −1
Pij =  1 0 −1 , Qij = −1 0 1 .
−1 1 0 1 −1 0
Here a victory gives you a payoff of 1, a loss gives a payoff of −1, and a draw
gives a payoff of 0. Also the first row/column corresponds to playing rock, second
corresponds to paper and third corresponds to scissors.
3
Some defined security level as a different thing.
90 CHAPTER 3. OPTIMIZATION
Usually, this is not the best way to display
the payoff matrices. First of all, we need R P S
two write out two matrices, and there isn’t R (0, 0) (−1, 1) (1, −1)
an easy way to indicate what row corre- P (1, −1) (0, 0) (−1, 1)
sponds to what decision. Instead, we usu- S (−1, 1) (1, −1) (0, 0)
ally write this as a table.
By convention, the first item in the tuple (−1, 1) indicates the payoff of the row
player, and the second item indicates the payoff of the column player.
E. 3-33
<Prisoner’s dilemma> Suppose Alice and Bob commit a crime together, and
are caught by the police. They can choose to remain silent (S) or testify (T ).
Different options will lead to different outcomes:
• Both keep silent: the police has little evidence and they go to jail for 2 years.
• One testifies and one remains silent: the one who testifies gets awarded and is
freed, while the other gets stuck in jail for 10 years.
• Both testify: they both go to jail for 5 years.

We can represent this by a payoff table one the right. S T


Note that higher payoff is desired, so a longer serv- S (2, 2) (0, 3)
ing time corresponds to a lower payoff. T (3, 0) (1, 1)

Here we see that regardless of what the other person does, it is always strictly
better to testify, so T is a dominant strategy. The strategy profile (T, T ) is a
dominant strategy equilibrium but it is Pareto dominated by (S, S). The source
of the dilemma is that outcome resulting from (T, T ) is strictly worse for both
players than the outcome resulting from (S, S).
L. 3-34
The maximin strategy/security level are the optimal solution/value of the LP
maximize v subject to
m
X m
X
x ≥ 0, xi = 1, xi pij ≥ v for all j = 1, · · · , n
i=1 i=1

The security level of the row player is maxx∈X miny∈Y p(x, y). It is easy to see
that it is the same to maximize the minimum payoff over all pure strategies of the
other player, so maxx∈X minj∈{1,...,n} m
P
i=1 xi pij . We can formulate this as the
LP given.
E. 3-35
<Chicken> The game of Chicken is as follows: two people drive their cars
towards each other at high speed, they can decide to chicken out (C) or continue
driving (D). If they collide (ie. both don’t chicken), they both die. If one chickens
and the other doesn’t, the person who chicken is cowardice.
This can be represented by the table on the right.
Here there is no dominating strategy, so we need a C D
different way of deciding what to do. Instead, we C (2, 2) (1, 3)
can use the maximin strategy. This strategy mini- D (3, 1) (0, 0)
mizes the worst possible loss.
3.3. NON-COOPERATIVE GAMES 91

The unique maximin strategy in this game is to chicken for a security level of 1.
This isn’t an equilibria solution since if one players employ this maximin strategy,
it would be better for the other to not chicken out.
In this game, there are two pure equilibrium, (C, D) and (D, C), and there is a
mixed equilibrium in which the players pick the options with equal probability.
T. 3-36
(Nash, 1961) Every bimatrix game has an equilibrium.

3.3.2 Matrix game


A bimatrix game is a zero-sum game , or matrix game , if qij = −pij for all i, j, ie.
the sum of the payoff of both player is always 0.
E. 3-37
The rock-paper-scissors games as specified in the beginning example is a zero-sum
game. Note that to specify a matrix game, we only need one matrix, since the
matrix of the other player is simply the negative of the matrix of the first.
Assuming invariance of utilities under positive affine transformations, results for
zero-sum games in fact apply to the larger class of constant-sum games, in which
the payoffs of the two players always sum up to the same constant.
T. 3-38
(von Neumann, 1928) If P ∈ Rm×n , then

max min p(x, y) = min max p(x, y).


x∈X y∈Y y∈Y x∈X

Recall the LP for max min p(x, y).[L.3-34] Adding slack variable z ∈ Rn with z ≥ 0,
we obtain the Lagrangian
n m
! m
!
X X X
L(v, x, z, w, y) = v + yj xi pij − zj − v − w xi − 1
j=1 i=1 i=1
n
! m n
! n
X X X X
= 1− yj v+ pij yj − w xi − yj zj + w.
j=1 i=1 j=1 j=1

where w ∈ R and yP∈ Rn areP Lagrange multipliers. This has finite minimum for
all v ∈ R, x ≥ 0 iff yi = 1, pij yj ≤ w for all i, and y ≥ 0. The dual is
minimize w subject to
n
X n
X
y ≥ 0, yj = 1, pij yj ≤ w for all i
j=1 j=1

This corresponds to the column player choosing a strategy (yi ) such that the
expected payoff of the row player is bounded above by w.
The optimum value of the dual is miny∈Y maxx∈X p(x, y). So the result follows
from strong duality.
92 CHAPTER 3. OPTIMIZATION

We call v = maxx∈X miny∈Y p(x, y) = miny∈Y maxx∈X p(x, y) the value of the
matrix game with payoff matrix P .
This result is equivalent to maxx∈X miny∈Y p(x, y) = − maxy∈Y minx∈X −p(x, y).
Then for a zero-sum game, we see that the left hand side is the worst payoff the
row player can get if he employs the minimax strategy, while the right hand side
is the worst payoff the column player can get if he uses his minimax strategy.
Combining with the next theorem, this theorem then says that if both players
employ the minimax strategy, then this is an equilibrium. So in a zero-sum game,
maximin strategies are optimal.
T. 3-39
(x, y) ∈ X × Y is an equilibrium of the matrix game with payoff matrix P iff

min p(x, y 0 ) = max min p(x0 , y 0 )


y 0 ∈Y 0 0x ∈X y ∈Y

max
0
p(x0 , y) = min
0
max
0
p(x0 , y 0 )
x ∈X y ∈Y x ∈X

(Forward) Suppose (x, y) is an equilibrium, then


p(x, y) = max
0
p(x0 , y) ≥ min
0
max
0
p(x0 , y 0 ) = max
0
min
0
p(x0 , y 0 )
x y x x y

≥ min
0
p(x, y 0 ) = − max
0
(−p(x, y 0 )) = −(−p(x, y)) = p(x, y)
y y

So we must have equality all the way through.


(Backward) We have
p(x, y) ≥ min
0
p(x, y 0 ) = max
0
p(x0 , y) ≥ p(x, y).
y x

3.4 Network problems


D. 3-40
• A directed graph or network is a pair G = (V, E), where V is the set of vertices
and E ⊆ V × V is the set of edges. If (u, v) ∈ E, we say there is an edge from u
to v. When the relation E is symmetric, G is called an undirected graph , and we
can write edges as unordered pairs {u, v} ∈ E for u, v ∈ V .
• The degree of a vertex u ∈ V is the number of v ∈ V such that (u, v) ∈ E or
(v, u) ∈ E.
• An walk from u ∈ V to v ∈ V is a sequence of vertices u = v1 , · · · , vk = v
such that (vi , vi+1 ) ∈ E for all i. An undirected walk allows (vi , vi+1 ) ∈ E or
(vi+1 , v) ∈ E.
• A path is a walk where v1 , · · · , vk are pairwise distinct. A cycle is a walk where
v1 , · · · , vk−1 are pairwise distinct and v1 = vk .
• A graph is connected if for any pair of vertices, there is an undirected path
between them.
• A graph G0 = (V 0 , E 0 ) is a subgraph of graph G = (V, E) if V 0 ⊆ V and E 0 ⊆ E.
In the special case where G0 is a tree and V 0 = V , it is called a spanning tree of
G.
3.4. NETWORK PROBLEMS 93

3.4.1 Minimum-cost flow problem


Let G = (V, E) be a directed graph. Let the number of vertices be |V | = n. Let b ∈ Rn
and C, M , M ∈ Rn×n .
Each component of the vector bi denotes the net amount of flow that enters or leaves
the network each vertex i ∈ V . If bi > 0, we call i ∈ V a source (ie. more stuff flow
out than entering). If bi < 0, we call i ∈ V a sink . cij is the cost of transferring one
unit of stuff from vertex i to vertex j , and mij and mij denote the lower and upper
bounds on the amount of flow along (i, j) ∈ E (note that (i, j) is ordered) respectively.
If there is no edge from i to j just let cij = mij = mij = 0.
Let xij be the amount of flow through (i, j) . x ∈ Rn×n is a minimum-cost flow if it
minimizes the cost of transferring stuff, while satisfying the constraints, ie. it is an
optimal solution to the problem
X
minimize cij xij subject to
(i,j)∈E
X X
bi + xji = xij for each i ∈ V
j:(j,i)∈E j:(i,j)∈E

mij ≤ xij ≤ mij for all (i, j) ∈ E.

This problem is a linear program. In theory, we can write it into the general form
Ax = b with regional constraint mk ≤ xk ≤ mk , where A is the matrix given by

1
 if the kth edge starts at vertex i
aik −1 if the kth edge ends at vertex i

0 otherwise

Note that instead of representing edge by indices i, j, in this case we represent them
with just one index k. However, this method is not very efficient, we will look for
better methods.
P
Note that i∈V bi = 0 is required for feasibility, which makes sense (total supply is
equal to the total consumption), and that a problem satisfying this condition can be
transformed into an equivalent circulation problem where bi = 0 for all i by introduc-
ing an additional vertex, and new edges from each sink to the new vertex and from
the new vertex to each of the sources, and let these new edges have upper and lower
bounds equal to the flow that should enter the sources or leave the sinks. Note also
we can assume wlog that the network G is connected. Otherwise the problem can be
decomposed into several smaller problems that can be solved independently.
An uncapacitated problem is the case where mij = 0 and mij = ∞ for all (i, j) ∈ E.
Clearly, an uncapacitated flow problem is either unbounded (which can happen if some
cij are negative), or is bounded and hence has an equivalent problem with finite capac-
ities (as we can add a bound greater than what the optimal solution wants).
The Lagrangian of the minimum-cost circulation problem is
 
X X X X X
L(x, λ) = cij xij − λi  xij − xji  = (cij − λi + λj )xij
(i,j)∈E i∈V j:(i,j)∈E j:(j,i)∈E (i,j)∈E

Bear also in mind we have the regional constraints M ≤ x ≤ M .


94 CHAPTER 3. OPTIMIZATION

T. 3-41
If x ∈ Rn×n is a feasible flow for a circulation problem and with λ ∈ Rn such that

cij − λi + λj > 0 =⇒ xij = mij


cij − λi + λj < 0 =⇒ xij = mij
mij < xij < mij =⇒ cij − λi + λj = 0,

then x is optimal.

For (i, j) ∈ E, let cij = cij − λi + λj . Then, for every feasible flow x0 ,
 
X X X X X X
cij x0ij = cij x0ij − λi  x0ij − x0ji  = cij x0ij
(i,j)∈E (i,j)∈E i∈V j:(i,j)∈E j:(j,i)∈E (i,j)∈E
| {z }
=0
X X X X
≥ cij mij + cij mij = cij xij = cij xij
(i,j)∈E (i,j)∈E (i,j)∈E (i,j)∈E
cij <0 cij >0

Note that this result is simply Lagrange sufficiency. Note that for an x and λ that
satisfies the conditions stated, we must have L(x, λ) = inf x0 ∈X L(x0 , λ) since the
conditions implies we cannot decreased L anymore.
The Lagrange multiplier λi is also referred to as a node number, or as a potential
associated with vertex i ∈ V . Since only the difference between pairs of Lagrange
multipliers appears in the optimality conditions, we can set wlog λ1 = 0.

3.4.2 Transportation problem


The transportation problem is a special case of the minimum-flow problem, where
the graph is a bipartite graph , which means that we can split (ie. partition) the
vertices into two halves A, B, where all edges flow from a vertex in A to a vertex in
B (ie. E ⊆ A × B). We call the vertices of A the suppliers and the vertices of B
the consumers . Again let c ∈ Rn×m representing the cost of transferral. As far as
optimal solutions are concerned, edges not contained in E are equivalent to edges with
a very large cost. We can thus restrict our attention to the case where E = S × C,
known as the Hitchcock transportation problem . We can write the problem as
n X
X m
minimize cij xij subject to x ≥ 0,
i=1 j=1

m
X n
X
xij = si for i = 1, · · · , n and xij = dj for j = 1, · · · , m
j=1 i=1

si is the supply of each supplier, and dP


i is the demand of each consumers. We have
s ∈ Rn , d ∈ Rm satisfying s, d ≥ 0 and
P
si = dj .
T. 3-42
Every minimum cost-flow problem with finite capacities or non-negative costs has
an equivalent transportation problem.
3.4. NETWORK PROBLEMS 95

Consider a minimum-cost flow problem on network (V, E). Wlog assume that
mij = 0 for all (i, j) ∈ E, because otherwise we can set mij to 0, mij to mij − mij ,
bi to bi − mij , bj to bj + mij , xij to xij − mij .
Moreover, we can assume that all capacities are finite: if some edge has infinite
capacity but P
non-negative cost, then setting the capacity to a large enough number,
for example i∈V |bi | does not affect the optimal solutions. This is since cost is
non-negative, P
and the optimal solution will not want shipping loops. So we will
have at most |bi | shipments.
We now construct an transportation problem
P as follows: Replace every vertex
i ∈ V with a consumer with demand ( k:(i,k)∈E mik ) − bi . Replace every edge
(i, j) ∈ E with a supplier with supply mij , this supplier has an edge to consumer
i with cost c(ij,i) = 0 and an edge to consumer j with cost c(ij,j) = cij .
P
i k:(i,k)∈E mik − bi
0

mij ij

cij P
j k:(j,k)∈E mjk − bj

The idea is that if the capacity of the edge (i, j) is, say, 5, in the original network,
and we want to transport 3 along this edge, then in the new network, we send 3
units from ij to j, and 2 units to i.
For any flow x in the original network, the corresponding flow P on (ij, j) is xij and
the flow on (ij, i) is mij − xij . The total flow into i is then k:(i,k)∈E (mik − xik ) +
P
k:(k,i)∈E xki . This satisfies the constraints of the new network iff
X X X
(mik − xik ) + xki = mik − bi ,
k:(i,k)∈E k:(k,i)∈E k:(i,k)∈E

which is true if and only if


X X
bi + xki − xik = 0,
k:(k,i)∈E k:(i,k)∈E

which is exactly the constraint for the node i in the original minimal-cost flow
problem. So the two problem is equivalent.
So we can solve a bounded minimum cost-flow problem by solving the equivalent
transportation problem, which is usually easier.
C. 3-43
<Transportation Algorithm> For the transportation problem, it is convenient
to have two sets of Lagrange multipliers, one for the supplier constraints and one
for the consumer constraint. Then the Lagrangian of the transportation problem
96 CHAPTER 3. OPTIMIZATION

can be written as
m X
n n m
! m n
!
X X X X X
L(x, λ, µ) = cij xij + λi si − xij − µj dj − xij
i=1 j=1 i=1 j=1 j=1 i=1
n X
X n n
X m
X
= (cij − λi + µj )xij + λi si − µj d j .
i=1 j=1 i=1 j=1

Note that we use different signs for the Lagrange multipliers for the suppliers and
the consumers, so that our ultimate optimality condition will look nicer.
Since x ≥ 0, the Lagrangian has a finite minimum iff cij − λi + µj ≥ 0 for all
i, j. So this is our dual feasibility condition. At an optimum, complementary
slackness entails that (cij − λi + µj )xij = 0 for all i, j. In fact if we have λ, µ
and x that satisfies these conditions, then L(x, λ, µ) = inf x0 ∈X L(x0 , λ, µ), so by
Lagrange sufficiency x is optimal. To solve the transportation problem we could
use a method similar to the simplex method. In this case, we made a tableau as
follows:
µ1 µ2 µ3 µ4
λ1 − µ1 λ1 − µ2 λ 1 − µ3 λ 1 − µ4
λ1 x11 c11 x12 c12 x13 c13 x14 c14 s1
λ2 − µ1 λ2 − µ2 λ 2 − µ3 λ 2 − µ4
λ2 x21 c21 x22 c22 x23 c23 x24 c24 s1
λ3 − µ1 λ3 − µ2 λ 3 − µ3 λ 3 − µ4
λ3 x31 c31 x32 c32 x33 c33 x34 c34 s1
d1 d2 d3 d4
We have a row for each supplier and a column for each consumer. We assume
there are 3 suppliers and 4 consumers but of course the table can be alter for any
number of suppliers or consumers. We proceed as follows:
1. Find an initial BFS, and let T be the edge of he corresponding spanning tree.
Although it looks like we have n + m constraints, we effectively only have m +
Pm Pn
n − 1 constraints since for example d1 = i=1 si − i=2 di so the constraint
Pn
i=1 xi1 = d1 can be derived from the other m + n − 1 constraints. So a
BFS has at most m + n − 1 non-zero entries. If we have a BFS, we can always
reduce so that it does not contains cycles. That is because if we have a cycle,
we can increase/decrease the flow an edge of the cycle, the flow of the other
edges of the cycle must change correspondingly to maintain feasibility, eventually
the flow of one of the edges will be reduced to 0, in which case we don’t have
the cycle anymore. Assuming there is no degeneracy, the resulting graph would
be a tree. In general, degeneracies occur when a subset of the consumers can be
satisfied exactly by a subset of the suppliers, hence the graph can be disconnected.
Assuming no degeneracy, the graph would be a spanning tree with n + m − 1
edges. Note that it must be spanning otherwise some suppliers/consumers are
not supplying/demanding.

2. Choose λ and µ such that λ1 = 0 and cij − λi + µj = 0 for all (i, j) ∈ T .


Here we find λ and µ such that cij − λi + µj = 0 for all (i, j) ∈ T one of the
condition for optimality. We set wlog λ1 = 0 (since we have n + m multipliers
3.4. NETWORK PROBLEMS 97

but only |T | = n + m − 1 constraints.

3. If cij − λi + µj ≥ 0 for all (i, j) ∈ E, the solution is optimal, so stop.


4. Otherwise pick (i, j) ∈ E such that cij − λi + µj < 0, and push flow along the
unique cycle in (V, T ∪ {(i, j)}) until xi0 j 0 = 0 for some edge (i0 , j 0 ) in the cycle.
Set T to (T \ {(i0 , j 0 )}) ∪ {(i, j)} and go to step 2.
We can increase xij to decrease L since cij − λi + µj < 0. Now T on addition of
(i, j), forms a cycle (since T is a tree). As we increase xij , to remain feasibility,
the flow in other edges of the cycle will have to change accordingly, until one of
them say (i0 , j 0 ), becomes 0. In that cycle, all the edges (p, q) apart from (i, j)
has cpq − λp + µq = 0, so our new BFS x0 decreased L by |(cij − λi + µj )∆xij |.
Note that for any feasible solution y, we have L(y, λ, µ) = f (y) where f is the
objective function. So f (x0 ) < f (x) and we have found a better BFS. And so we
can repeat the process to eventually get the optimal solution.

E. 3-44
Suppose we have three suppliers with supplies 8, 10 and 9; and four consumers
with demands 6, 5, 8, 8.

It is easy to create an initial feasible solution - we just start from the first supplier
and the first consumer, we supply as much as we can until one side runs out of
supply/demand. If it is the supplier that runs out, we take in the next supplier
to continue the job. If it is the consumer that has no more demand, we go on to
supply to the next consumer. We first fill our tableau with our feasible solution.
6
8 = s1 d1 = 6
2

6 5 2 3 4 6 8 3
10 = s2 d2 = 5
7
2 3 7 7 4 1 10
1
9 = s3 d3 = 8
5 6 1 2 8 4 9 8
6 5 8 8
d4 = 8

We see that our basic feasible solution corresponds to a spanning tree. In general,
if we have n suppliers and m consumers, then we have n + m vertices, and hence
n + m − 1 edges (assuming no degeneracy). To set λ, µ so that cij − λi + µj = 0 for
all these edges, we have n+m−1 constraints here, so we can arbitrarily choose one
Lagrange multiplier, and the other Lagrange multipliers will follow. We choose
λ1 = 0. Now we must have µ1 = −5 etc., we fill in the values of the other Lagrange
multipliers as follows, and obtain
-5 -3 0 -2

0 6 5 2 3 4 6

4 2 3 7 7 4 1

2 5 6 1 2 8 4
We can fill in the values of λi − µj :
98 CHAPTER 3. OPTIMIZATION

-5 -3 0 -2
0 2
0 6 5 2 3 4 6
9 6
4 2 3 7 7 4 1
7 5
2 5 6 1 2 8 4

We didn’t bother filling in the value for λi − µj for (i, j) ∈ T since we know
cij − λi + µj = 0. If λi − µi ≤ cij is satisfied everywhere, we have optimality.
However we do not in this case, for example 9 = λ2 − µ1 > c21 = 2. We add
an edge, from the second supplier to the first consumer. Then we have created a
cycle. We keep increasing the flow on the new edge. This causes the values on
other edges to change by flow conservation. So we keep doing this until some other
edge reaches zero. If we increase flow by, say, δ, we have

6−δ
8 = s1 d1 = 6
2+δ
δ

6−δ 5 2+δ 3 4 6 10 = s2
3−δ
d2 = 5
7
δ 2 3−δ 7 7 4 1
1
9 = s3 d3 = 8
8
5 6 1 2 8 4
d4 = 8

3 5 5 3 4 6
The maximum value of δ we can take
is 3. So we end up with 3 2 7 7 4 1

5 6 1 2 8 4

-5 -3 -7 -9
7 9
0 3 5 5 3 4 6
We re-compute the Lagrange multipli-
0 6
ers to obtain
-3 3 2 7 7 4 1
0 -2
-5 5 6 1 2 8 4

We see a violation of λi − µi ≤ cij at 3 5 5 3 4 6


ij = 24. So we do the process again:
we can increase x24 by 7 to obtain the 3 2 7 4 7 1
tableau on the right.
5 6 8 2 1 4
3.4. NETWORK PROBLEMS 99

-5 -3 -2 -4
Calculating the Lagrange multipliers 2 4
gives the table on the right. 0 3 5 5 3 4 6
0 -1
No more violations. So this is the op- -3 3 2 7 4 7 1
timal solution. 5 3
0 5 6 8 2 1 4

3.4.3 Maximum Flow Problem


D. 3-45
• Suppose G = (V, E) with capacities Cij for (i, j) ∈ E. A cut of G is a partition
of V into two sets. For S ⊆ V , the capacity of the cut (S, V \ S) is
X
C(S) = Cij ,
(i,j)∈(S×(V \S))∩E

ie. the combined capacities of all edges from S to V \ S.


P
• For any feasible flow vector x and X, Y ⊆ V , we define fx (X, Y ) = (i,j)∈(X×Y )∩E xij ,
ie. the overall amount of flow from X to Y .

• For any feasible flow vector x, we call a path v0 , · · · , vk an augmenting path if


the flow along the path can be increased. Formally, it is a path that satisfies
xvi−1 vi < Cvi−1 vi or xvi vi−1 > 0, where Cij is the capacity of the edge (i, j).

Suppose we have a network (V, E) with a single source 1 and a single sink n (but
how much stuff comes out of the source or into the sink are not fixed). There is no
costs in transportation, but each edge (i, j) has a capacity mij = Cij . We assume for
convenience that mij = 0 for all (i, j) ∈ E. We want to transport as much stuff from
1 to n as possible. We can write the problem as

maximize δ subject to

X X δ
 i=1
xij − xji = −δ i=n for each i

j:(i,j)∈E j(j,i)∈E 
0 otherwise

0 ≤ xij ≤ Cij for each (i, j) ∈ E.

Here δ is the total flow from 1 to n.

In fact we can turn this into a minimum-cost flow problem. We add an edge from
n to 1 with −1 cost and infinite capacity and let flow be conserved in this network
(circulation problem). Then the minimal cost flow will maximize the flow from on
(n, 1) hence maximise the flow from 1 to n through the network. However we will see
that there is an easier ways to solve the problem.
100 CHAPTER 3. OPTIMIZATION

L. 3-46
If x is a feasible flow vector that sends δ units from 1 to n, then for any cut S ⊆ V
with 1 ∈ S and n ∈ V \ S, we have δ = fx (S, V \ S) − fx (V \ S, S) ≤ C(S).

 
X X X
δ=  xij − xji  = fx (S, V ) − fx (V, S)
i∈S j:(i,j)∈E j:(j,i)∈E

= fx (S, S) + fx (S, V \ S) − fx (V \ S, S) − fx (S, S) = fx (S, V \ S) − fx (V \ S, S)


≤ fx (S, V \ S) ≤ C(S)

This says that the flow δ from 1 to n is bounded above by any capacity of any
cut S with 1 ∈ S and n ∈ V \ S, which is obviously true. In fact by the below
theorem, this bound is tight, ie. there is always a cut S such that δ = C(S).
T. 3-47
<Max-flow min-cut theorem> Let δ be an optimal solution for the maximum
flow problem, then δ = min{C(S) : S ⊆ V, 1 ∈ S, n ∈ V \ S}.

Suppose x is optimal, let

S = {1} ∪ {i ∈ V : there exists an augmenting path from 1 to i}.

We have n 6∈ S by optimality, so n ∈ V \ S. We have previously shown that


δ = fx (S, V \ S) − fx (V \ S, S). We must have fx (V \ S, S) = 0 since xij = 0
for all (i, j) ∈ E ∩ ((V \ S) × S), otherwise i cannot be in V \ S as we can find
an augmenting path from 1 to j and on addition of the edge (i, j) we have an
augmenting path from 1 to i. Also, we must have fx (S, V \ S) = C(S) since
xij = Cij for every (i, j) ∈ E ∩ (S × (V \ S)) for a similar reason. So we have
δ = C(S). Since δ ≤ C(S 0 ) for any S 0 ⊆ V with 1 ∈ S 0 and n ∈ V \ S 0 , we have
δ = min{C(S) : S ⊆ V, 1 ∈ S, n ∈ V \ S}.
The max-flow min-cut theorem provides a quick way to confirm that our path is
optimal. We just have to show that our flow δ equals C(S) for some S ⊆ V with
1 ∈ S and n ∈ V \ S.
C. 3-48
<Ford-Fulkerson algorithm> To find an optimal solution for the maximum
flow problem, we simply keep adding flow along augmenting paths until we cannot
do so:
1. Start from a feasible flow x, eg. x = 0.
2. If there is no augmenting path for x from 1 to n, then x is optimal.
3. Find an augmenting path for x from 1 to n, and send a maximum amount of
flow along it. Go to step 2.

E. 3-49
1
5 1
Consider the diagram on the right with ca-
1 4 n
pacities as labelled. - -
5 5
2
3.4. NETWORK PROBLEMS 101

We can keep adding flow until we reach the 1


diagram on the right, where gray numbers are 5 1 1
flows and black are capacities. We know this 1 4 4 1 n
- -
is an optimum, since our total flow is 6, and 5 3 5
we can draw a cut with capacity 6 (dashed 2 2 5
line). 2

3.4.4 Bipartite Matching Problem


A matching of a graph (V, E) is a set of edges that do not share any vertices, i.e., a set
M ⊆ E such for all (s, t), (u, v) ∈ M , s 6= u and s 6= v. A matching M is called perfect
if it covers every vertex, i.e., if |M | = |V |/2. A graph is called k-regular if every vertex
has degree k. We can show using flows that every k-regular bipartite-graph, for k > 1,
has a perfect matching:
First note that in the Ford-Fulkerson algorithm, if all capacities are integral (ie. inte-
gers) and if we start from an integral (ie. integer components) flow vector, then the the
algorithm maintains integrality and increases the overall amount of flow by at least one
unit in each iteration. The algorithm is therefore guaranteed to find a maximum flow
after a finite number of iterations. (Clearly, the latter also holds when all capacities
are rational since everytime the flow increase by at least 1/ lcm(denominator of all the
capacities)).
Now consider a k-regular bipartite graph (L ∪ R, E), where the edges flows from L to
R, and add two new vertices s and t and new edges (s, i) and (j, t) for every i ∈ L
and j ∈ R. Finally set the capacity of every new edge to 1, and that of every original
edge to infinity. We can now send |L| units of flow from s to t by setting the flow to
1 for every new edge and to 1/k for every original edge. If we start with the integral
solution x = 0, the Ford-Fulkerson algorithm is guaranteed to find an integral solution
with at least the same value, and it is easy to see that such a solution corresponds to
a perfect matching.
This result is a special case of a well-known characterization of the bipartite graphs that
have a perfect matching. Hall’s theorem states that a bipartite graph G = (L ∪ R, E)
with |L| = |R| has a perfect matching iff |N (X)| ≥ |X| for every X ⊆ L, where
N (X) = {j ∈ R : i ∈ X, (i, j) ∈ E}.
102 CHAPTER 3. OPTIMIZATION
CHAPTER 4

Linear Algebra

4.1 Vector spaces


D. 4-1
• Let F be a field (usually R or C). An F- vector space is an (additive) abelian
group V together with a function F × V → V , written (λ, v) 7→ λv, such that
1. λ(µv) = (λµ)v for all λ, µ ∈ F, v ∈ V (associativity)
2. λ(u + v) = λu + λv for all λ ∈ F, u, v ∈ V (distributivity in V )
3. (λ + µ)v = λv + µv for all λ, µ ∈ F, v ∈ V (distributivity in F)
4. 1v = v for all v ∈ V (identity)
We always write 0 for the additive identity in V , and call this the identity. By
abuse of notation, we also write 0 for the trivial vector space {0}.
• If V is an F-vector space, then we say U ⊆ V is an (F-linear) subspace if
1. u, v ∈ U implies u + v ∈ U .
2. u ∈ U, λ ∈ F implies λu ∈ U .
3. 0 ∈ U .
These conditions can be expressed more concisely as “U is non-empty and if λ, µ ∈
F, u, v ∈ U , then λu + µv ∈ U ”. Alternatively, U is a subspace of V if it is itself a
vector space, inheriting the operations from V . We write U ≤ V if U is a subspace
of V .
• Let U, W be subspaces of an F vector space V . The sum of U and V is U + W =
{u + w : u ∈ U, w ∈ W }.
E. 4-2
Intuitively, a vector space V over a field F (or an F-vector space) is a space such
that
• We can add two vectors v1 , v2 ∈ V to obtain v1 + v2 ∈ V .
• We can multiply a scalar λ ∈ F with a vector v ∈ V to obtain λv ∈ V .
................................................................................
1. Rn = {column vectors of length n with coefficients in R} with the usual addi-
tion and scalar multiplication is a vector space.
An m × n matrix A with coefficients in R can be viewed as a linear map from
Rm to Rn via v 7→ Av. In fact under componentwise addition and scalar
multiplication, the set of m × n matrices with coefficients in R is also a vector
space.

103
104 CHAPTER 4. LINEAR ALGEBRA

2. We can have vector space consist of functions. We define addition on functions


as (f + g)(x) = f (x) + g(x) and scalar multiplication of functions as (λf )(x) =
λf (x). Then under this addition and scalar multiplication the following is
vector space:
def
• RX = {f : X → R} where X is any given set (ie. RX consist of all functions
def
from X to R). More generally, V X = {f : X → V } where V is a vector
space and X a set.
def def
• C([a, b], R) = {f ∈ R[a,b] : f is continuous} and C ∞ ([a, b], R) = {f ∈ R[a,b] :
f is infinitely differentiable} where [a, b] ⊆ R is a closed interval.
E. 4-3
• {(x1 , x2 , x3 ) ∈ R3 : x1 + x2 + x3 = t} is a subspace of R3 iff t = 0.
• Let X be a set. We define the support of f in FX by supp(f ) = {x ∈ X : f (x) 6= 0},
then the set of functions with finite support forms a vector subspace. This is since
supp(f +g) ⊆ supp(f )∪supp(g), supp(λf ) = supp(f ) (for λ 6= 0) and supp(0) = ∅.
P. 4-4
For any v in a vector space, 0v = 0 and (−1)v = −v, where −v is the additive
inverse of v.
0(µv) = (0µ)v = 0v =⇒ 0 = 0(µ − 1)v = 0v.
v + (−1)v = (1 + (−1))v = 0v = 0, so (−1)v is the additive inverse of v.
P. 4-5
Let U, W be subspaces of vector space V . Then U + W and U ∩ W are subspaces.

Let ui + wi ∈ U + W , λ, µ ∈ F. Then

λ(u1 + w1 ) + µ(u2 + w2 ) = (λu1 + µu2 ) + (λw1 + µw2 ) ∈ U + W.

Similarly, if vi ∈ U ∩ W , then λv1 + µv2 ∈ U and λv1 + µv2 ∈ W . So λv1 + µv2 ∈


U ∩ W . Both U ∩ W and U + W contain 0, so are non-empty.
Note that in general taking the union will in general not produce a vector space.
C. 4-6
<Quotient space> Let V be a F-vector space, and U ⊆ V a subspace. Then the
quotient group V /U can be made into a F-vector space called the quotient space ,
where scalar multiplication is given by (λ, v + U ) = (λv) + U .
This is well defined since if v + U = w + U ∈ V /U , then v − w ∈ U . Hence for
λ ∈ F, we have λv − λw ∈ U since U is a subspace, and so λv + U = λw + U . The
axioms of a vector space can be check: for example the first condition is satisfied
since for v + U ∈ V /U and λ, µ ∈ F we have

λ(µ(v + U )) = λ(µv + U ) = λ(µv) + U = (λµ)v + U = (λµ)(v + U )

D. 4-7
Let V be a vector space over F and S ⊆ V . The span of S is defined as
( n )
X
hSi = span S = λi si : λi ∈ F, si ∈ S, n ≥ 0
i=1
4.1. VECTOR SPACES 105

• We say S spans V if hSi = V .


• We say S is linearly independent if givenP λi ∈ F and distinct s1 , · · · , sn ∈ S,
n
we must have λi = 0 for all i whenever i=1 λi si = 0. If S is not linearly
independent, we say it is linearly dependent .
• We say S is a basis for V if S is linearly independent and spans V .
We say a vector space is finite dimensional if there is a finite basis.
E. 4-8
Note that the sums in the definition of hSi is a finite sum. We will not play with
infinite sums. Note also that in fact hSi is the smallest subspace of V containing
S. hSi can be seen as the subspace generated by S.
Note that no linearly independent set can contain 0, as 1 · 0 = 0. We also have
(by convention) h∅i = {0} and ∅ is a basis for this space.
................................................................................
• Let S ⊆ V = R3 where
         
 1 0 1   a 
S = 0 , 1 , 2 then hSi =  b  : a, b ∈ R .
0 1 2 b
   

Note that any subset of S of order 2 has the same span as S. Also S is linearly
dependent since
     
1 0 1
1 0 + 2 1 + (−1) 2 = 0.
0 1 2
S also does not span V since (0, 0, 1) 6∈ hSi.
• Let X be a set and x ∈ X. Define the function δx : X → F by
(
1 y=x
δx(y) = .
0 y 6= x

Then hδx : x ∈ Xi is the set of all functions with finite support.


L. 4-9
S ⊆ V is linearly dependent P if and only if there are distinct s0 , · · · , sn ∈ S and
λ1 , · · · , λn ∈ F such that n
i=1 λi si = s0 .

(Forward) If S is linearly dependent, then there is somePλ1 , · · · , λn ∈ F all non-zero


λi si = 0. Then s1 = n λi
P
and s1 , · · · , sn ∈ S such that i=2 − λ1 si .
Pn Pn
(Backward) If s0 = i=1 λi si , then (−1)s0 + i=1 λi si = 0. So S is linearly
dependent.
P. 4-10
Let S = {e1 , · · · , en } be a subset of V over F. Then S is a basis iff every v ∈ V can
be written
P uniquely as a finite linear combination of elements in S, ie. uniquely
as v = n i=1 λi ei with λi ∈ F.
106 CHAPTER 4. LINEAR ALGEBRA

S spanning V is defined exactly to mean that every item v ∈ V can be written as


a finite linear combination in at least one way.
Pn P
Now suppose that S is linearlyPn independent, and v = i=1 λi ei = i=1 µi ei .
Then we have 0 = v − v = i=1 (λi − µi )ei . Linear independence implies that
λi − µi = 0 for all i. Hence λi = µi . So v is expressed in a unique way.
then we have 0 = n
P
On the other hand, if S is not linearly independent, P i=1 λi ei
n
where λi 6= 0 for some i. But we also know that 0 = i=1 0 · ei . So there are two
ways to write 0 as a linear combination.
T. 4-11
<Steinitz exchange lemma> Let V be an vector space over F, and S =
{e1 , · · · , en } a finite linearly independent subset of V , and T a spanning subset of
V . Then there is some T 0 ⊆ T of order n such that (T \ T 0 ) ∪ S still spans V . In
particular, |T | ≥ n.

Suppose that we have already found Tr0 ⊆ T of order 0 ≤ r < n such that
Tr = (T \ Tr0 ) ∪ {e1 , · · · , er } spans V . Note that the case r = 0 is true, since we
can take Tr0 = ∅; and the case r = n is the theorem which we want to achieve.
Suppose we have these. Since Tr spans V , we can write
k
X
er+1 = λi ti , λi ∈ F, ti ∈ Tr .
i=1

We know that the ei are linearly independent, so not all ti ’s are ei ’s. So there is
some j such that tj ∈ (T \ Tr0 ). We can write this as
1 X λi
tj = er+1 + − ti .
λj λj
i6=j

0
We let Tr+1 = Tr0 ∪ {tj } of order r + 1, and
0
Tr+1 = (T \ Tr+1 ) ∪ {e1 , · · · , er+1 } = (Tr \ {tj }} ∪ {er+1 }

Since tj is in the span of (Tr \ {tj }) ∪ {er+1 }, we have tj ∈ hTr+1 i. So V ⊇


hTr+1 i ⊇ hTr i = V . So hTr+1 i = V . Hence we can inductively find Tn .
This says if T is spanning and S is independent, there is a way of grabbing |S|
many elements away from T and replace them with S, and the result will still be
spanning. In some sense, the consequence |T | ≥ n is the most important part. It
tells us that we cannot have a independent set larger than a spanning set, and
most of our corollaries later will only use this remark.
When |T | < ∞, this theorem says: Let {e1 , · · · , en } be a linearly independent
subset of V , and suppose {f1 , · · · , fm } spans V . Then there is a re-ordering of the
{fi } such that {e1 , · · · , en , fn+1 , · · · , fm } spans V .
T. 4-12
Suppose V is a vector space over F with a basis S of order n. Then
1. Every basis of V has order n.
2. Any linearly independent set of order n is a basis.
3. Every spanning set of order n is a basis.
4.1. VECTOR SPACES 107

4. Every finite spanning set contains a basis.


5. Every linearly independent subset of V can be extended to basis.

Let S = {e1 , · · · , en } be the basis for V .


1. Suppose T is another basis. Since S is independent and T is spanning, |T | ≥
|S|. The other direction is less trivial, since |T | might be infinite, and Steinitz
does not immediately apply. Instead, we argue as follows: since T is linearly
independent, every finite subset of T is independent. Also, S is spanning. So
every finite subset of T has order at most |S|. So |T | ≤ |S|. So |T | = |S|.
2. Suppose now that T is a linearly independent subset of order n, but hT i = 6 V.
Then there is some v ∈ V \ hT i. We now show that T ∪ {v} is independent.
if λ0 v+ m
P
Indeed,
Pm i=1 λi ti = 0 with λi ∈ F, t1 , · · · , tm ∈ T distinct, then λ0 v =
i=1 (−λi )ti . Then λ0 v ∈ hT i. So λ0 = 0. As T is linearly independent, we
have λ0 = · · · = λm = 0. So T ∪ {v} is a linearly independent subset of size
> n. This is a contradiction since S is a spanning set of size n.
3. Let T be a spanning set of order n. If T were linearly dependent, P then there
is some t0 , · · · , tm ∈ T distinct and λ1 , · · · , λm ∈ F such that t0 = λi ti . So
t0 ∈ hT \ {t0 }i, ie. hT \ {t0 }i = V . So T \ {t0 } is a spanning set of order n − 1,
which is a contradiction.
4. Suppose T is any finite spanning set. Let T 0 ⊆ T be a spanning set of least
possible size. This exists because T is finite. If |T 0 | has size n, then done by
3. Otherwise by the Steinitz exchange lemma, it has size |T 0 | > n. So T 0 must
be linearly dependent because S is spanning. P So there is some t0 , · · · , tm ∈ T
distinct and λ1 , · · · , λm ∈ F such that t0 = λi ti . Then T 0 \ {t0 } is a smaller
spanning set. Contradiction.
5. Suppose T is a linearly independent set. Since S spans, there is some S 0 ⊆ S
of order |T | such that (S \ S 0 ) ∪ T spans V by the Steinitz exchange lemma.
So by 3, (S \ S 0 ) ∪ T is a basis of V containing T .
D. 4-13
If V is a vector space over F with a finite basis S, then the dimension of V is
dim V = dimF V = |S|.
E. 4-14
By the corollary, dim V does not depend on the choice of S. However, it does
depend on F. For example, dimC C = 1 (since {1} is a basis), but dimR C = 2
(since {1, i} is a basis).
In fact we could define the dimension of an infinite-dimensional space to be the
cardinality of any basis for V . We have not proven enough to see that this would
be well-defined but in fact there are no problems.
L. 4-15
If V is a finite dimensional vector space over F and U ⊆ V is a proper subspace
(ie. such that U =6 V ), then U is finite dimensional and dim U < dim V .

Every linearly independent subset of V has size at most dim V . So let S ⊆ U be a


linearly independent subset of largest size. We want to show that S spans U and
|S| < dim V .
108 CHAPTER 4. LINEAR ALGEBRA

If v ∈ V \ hSi, then S ∪ {v} is linearly independent. So v 6∈ U by maximality of


S. This means that hSi = U .
Since U 6= V , there is some v ∈ V \U = V \hSi. So S∪{v} is a linearly independent
subset of order |S| + 1. So |S| + 1 ≤ dim V . So dim U = |S| < dim V .
P. 4-16
If U, W are subspaces of a finite dimensional vector space V , then

dim(U + W ) = dim U + dim W − dim(U ∩ W ).

Let R = {v1 , · · · , vr } be a basis for U ∩ W . This is a linearly independent subset


of U . So we can extend it to be a basis of U by S = {v1 , · · · , vr , ur+1 , · · · , us }.
Similarly, for W , we can obtain a basis T = {v1 , · · · , vr , wr+1 , · · · , wt }. We want
to show that dim(U + W ) = |S| + |T | − |R|. It is sufficient to prove that S ∪ T is
a basis for U + W .
We first show it’s spanning. Suppose u + w ∈ U + W with u ∈ U, w ∈ W . Then
u ∈ hSi and w ∈ hT i, so u + w ∈ hS ∪ T i. Also clearly hS ∪ T i ⊆ U + W . So
U + W = hS ∪ T i. To show linear independence, suppose we have a linear relation
r
X s
X t
X
λi vi + µj u j + νk wk = 0.
i=1 j=r+1 k=r+1
X X X
=⇒ λi vi + µj uj = − ν k wk .
Since LHS ∈ U and RHS ∈ W , they both lie in U ∩ W . Since S is a basis of U ,
there is only one way of writing the LHS as a sum of vi and uj . However, since
R is a basis of U ∩ W , we can write the PLHS justPas a sum of vi ’s. So we must
have µj = 0 for all j. Then we have λi vi + νk wk = 0, but T is linearly
independent, so λi = νk = 0 for all i, k. Hence S ∪ T is linearly independent.
This proof shows that we can prove things by choosing the “right” basis.
P. 4-17
If V is a finite dimensional vector space over F and U ⊆ V is a subspace, then

dim V = dim U + dim V /U.

Let {u1 , · · · , um } be a basis for U and extend this to a basis {u1 , · · · , um , vm+1 , · · · , vn }
for V . We want to show that S = {vm+1 + U, · · · , vn + U } is a basis for V /U .
P P
Suppose v + U ∈ V /U . Since we can write v = λi ui + µi vi ,
X X X
v+U = µi (vi + U ) + λi (ui + U ) = µi (vi + U ).

(recall in a more familiar part IA notation, (ab)H = (aH)(bH). Also from [E.4
-6] we know λ(aH) = (λa)H.) So S spans V /U .
P
To show that they P are linearly independent, suppose that λi (vi + U ) = 0 + U =
U . This requires λ i vi ∈ U . Then we can write this as a linear combination
P P
of the ui ’s, ie. λ i vi = µj uj for some µj . Since {u1 , · · · , um , vn+1 , · · · , vn }
is a basis for V , we must have λi = µj = 0 for all i, j. So {vi + U } is linearly
independent.
We can view this as a linear algebra version of Lagrange’s theorem. This com-
bined with the first isomorphism theorem for vector spaces gives the rank-nullity
4.2. LINEAR MAPS 109

theorem. This is because if A is a linear map on V with nullity U , then the first iso-
morphism theorem says V /U ∼ = Im A, so dim V = dim U +dim Im A = n(A)+r(f ).
Note that this result also implies [L.4-15] since if U is proper, then dim(V /U ) > 0,
so dim U = dim V − dim(V /U ) < dim V .
D. 4-18
• Suppose V is a vector space over F and U, W subspaces of V . We say that V is
the (internal) direct sum of U and W if U + W = V and U ∩ W = 0. We write
V = U ⊕ W.
Equivalently, this requires that every v ∈ V can be written uniquely as u + w with
u ∈ U, w ∈ W . We say that U and W are complementary subspaces of V .
• If U1 , · · · , Un ⊆ V are subspaces of V , then V is the (internal) direct sum
n
M
V = U1 ⊕ · · · ⊕ Un = Ui
i=1
P
if every v ∈ V can be written uniquely as v = ui with ui ∈ Ui . This
P can be
extended to an infinite sum with the same definition, but the sum v = ui still
has to be finite.
• If U, W are vector spaces over F, the (external) direct sum is

U ⊕ W = {(u, w) : u ∈ U, w ∈ W },

with addition and scalar multiplication componentwise: (u1 , w1 ) + (u2 , w2 ) =


(u1 + u2 , w1 + w2 ) and λ(u, w) = (λu, λw).
• If U1 , · · · , Un are vector spaces over F, the external direct sum is
n
M
U1 ⊕ · · · ⊕ Un = Ui = {(u1 , · · · , un ) : ui ∈ Ui },
i=1

with pointwise operations. This can be made into an infinite sum if we require
that all but finitely many of the ui have to be zero.
E. 4-19
The difference between internal and external direct sum is that the first is decom-
posing V into smaller spaces, while the second is building a bigger space based on
two spaces.
Note, however, that the external direct sum U ⊕ W is the internal direct sum
of U and W viewed as subspaces of U ⊕ W , ie. as the internal direct sum of
{(u, 0) : u ∈ U } and {(0, v) : v ∈ V }. So these two are indeed compatible
notions, and this is why we give them the same name and notation.
E. 4-20
Let V = R2 , and U = h( 01 )i. Then h( 11 )i and h( 10 )i are both complementary
subspaces to U in V .

4.2 Linear maps


110 CHAPTER 4. LINEAR ALGEBRA

D. 4-21
• Let U, V be vector spaces over F. Then α : U → V is a linear map if
1. α(u1 + u2 ) = α(u1 ) + α(u2 ) for all ui ∈ U .
2. α(λu) = λα(u) for all λ ∈ F, u ∈ U .
We write L(U, V ) for the set of linear maps U → V .
• We say a linear map α : U → V is an isomorphism if there is some β : V → U
(also linear) such that α ◦ β = idV and β ◦ α = idU . If there exists an isomorphism
U → V , we say U and V are isomorphic , and write U ∼ =V.
• Let α : U → V be a linear map. Then the image of α is Im α = {α(u) : u ∈ U }.
The kernel of α is ker α = {u ∈ U : α(u) = 0}.
E. 4-22
• Note that we can combine the two requirements to the single requirement that
α(λu1 + µu2 ) = λα(u1 ) + µα(u2 ).
• It is easy to see that if α is linear, then it is a group homomorphism (if we view
vector spaces as groups). In particular, α(0) = 0.
• If we want to stress the field F, we say that α is F-linear. For example, complex
conjugation is a map C → C that is R-linear but not C-linear (since (iz)∗ = −iz ∗ 6=
iz ∗ ).
E. 4-23
• Let A be an n × m matrix with coefficients in F. We will write A ∈ Mn,m (F).
Then α : Fm → Fn defined by v → Av is linear:
m
X m
X m
X
α(λu + µv)i = Aij (λu + µv)j = λ Aij uj + µ Aij vj = λα(u)i + µα(v)i .
j=1 j=1 j=1

• Let X be a set and g ∈ FX . Then mg : FX → FX defined by mg (f )(x) = g(x)f (x)


is linear.
Rx
• Integration I : (C([a, b]), R) → (C([a, b]), R) defined by f 7→ a f (t) dt is linear.
• Differentiation D : (C ∞ ([a, b]), R) → (C ∞ ([a, b]), R) by f 7→ f 0 is linear.
• If α, β ∈ L(U, V ), then α + β defined by (α + β)(u) = α(u) + β(u) ∈ L(U, V ).
Also, if λ ∈ F, then λα defined by (λα)(u) = λ(α(u)) ∈ L(U, V ). So L(U, V ) is a
vector space over F.
• Composition of linear maps is linear. Using this, we can show that many things
are linear, like differentiating twice, or adding and then multiplying linear maps.
L. 4-24
Let U and V are vector spaces over F. Then a linear map α : U → V is an
isomorphism iff α is a bijective linear map.

(Forward) If α is an isomorphism, then it is clearly bijective since it has an inverse


function.
(Backward) Suppose α is a linear bijection. Then as a function, it has an inverse
β : V → U . We want to show that this is linear. Let v1 , v2 ∈ V , λ, µ ∈ F. Then
αβ(λv1 + µv2 ) = λv1 + µv2 = λαβ(v1 ) + µαβ(v2 ) = α(λβ(v1 ) + µβ(v2 )).
4.2. LINEAR MAPS 111

Since α is injective, we have β(λv1 + µv2 ) = λβ(v1 ) + µβ(v2 ). So β is linear.


Note that a linear map is automatically a homomorphism, so a bijective linear
map is a bijective homomorphism.
E. 4-25
• Note that for any linear map α, Im α and ker α are subspaces of V and U respec-
tively.
• Let A ∈ Mm,n (F)Pand α : Fn → Fm be the linear map v 7→ Av. Then the system of
linear equations mj=1 Aij xj = bi (1 ≤ i ≤
Pn) has a solution iff (b1 , · · · , bn ) ∈ Im α.
The kernel of α contains all solution to j Aij xj = 0.
• Let β : C ∞ (R, R) → C ∞ (R, R) that sends

β(f )(t) = f 00 (t) + p(t)f 0 (t) + q(t)f (t).

for some p, q ∈ C ∞ (R, R). Then if y(t) ∈ Im β, then there is a solution (in
C ∞ (R, R)) to the differential equation f 00 (t) + p(t)f 0 (t) + q(t)f (t) = y(t). Simi-
larly, ker β contains the solutions to the homogeneous equation f 00 (t) + p(t)f 0 (t) +
q(t)f (t) = 0.
P. 4-26
Let α : U → V be an F-linear map.
i. If α is injective and S ⊆ U is linearly independent, then α(S) is linearly
independent in V .
ii. If α is surjective and S ⊆ U spans U , then α(S) spans V .
iii. If α is isomorphic and S ⊆ U is a basis, then α(S) is a basis for V .
In particular, if U and V are finite-dimensional vector spaces over F and
α : U → V is an isomorphism, then dim U = dim V .

i. Suppose that α is injective and α(S) is linearly dependent. So there are


s0 , · · · , sn ∈ S distinct and non-zero λ1 , · · · , λn ∈ F such that
n n
!
X X
α(s0 ) = λi α(si ) = α λi si .
i=1 i=1
Pn
Since α is injective, we must have s0 = i=1 λi si . So S is linearly dependent.
ii. Suppose α is surjective and S spans U . Pick v ∈ V . Then there is some
u ∈ U such that α(u) = v. Since P S spans U , there is some sP1 , · · · , sn ∈ S
and λ1 , · · · , λn ∈ F such that u = nI=1 λi si . So v = α(u) =
n
i=1 λi α(si ).
Hence α(S) spans V .
iii. First part follows from (i) and (ii). Let S be a basis for U . Then α(S) is a
basis for V . Since α is injective, |S| = |α(S)|.
P. 4-27
Suppose V is a F-vector space of dimension n < ∞. Then writing e1 , · · · , en for
the standard basis of Fn , there is a bijection

Φ : {isomorphisms Fn → V } → {(ordered) basis(v1 , · · · , vn ) for V },


112 CHAPTER 4. LINEAR ALGEBRA

defined by α 7→ (α(e1 ), · · · , α(en )).

We first make sure this is indeed a function — if α is an isomorphism, then


from our previous proposition, we know that it sends a basis to a basis. So
(α(e1 ), · · · , α(en )) is indeed a basis for V .
We now prove injectivity. Suppose α, β : Fn → V are isomorphism such that
Φ(α) = Φ(β). In other words, α(ei ) = β(ei ) for all i. Then α = β since
   
x1 n
! x1
 .  X X X  . 
α  ..  = α xi e i = xi α(ei ) = xi β(ei ) = β  ..  .
i=1
xn xn

Next to prove surjectivity, suppose that (v1 , · · · , vn ) is an ordered basis for V . Let
xi vi , we just need to show that α is a isomorphism Fn → V
P
α((x1 , · · · , xn )) =
since if it is then by construction Φ(α) = (v1 , · · · , vn ). It is easy to check that α
is well-defined and linear. We P also know P that α is injective since (v1 , · · · , vn ) is
linearly independent. So if xi v i = yi vi , then xi = yi . Also, α is surjective
since (v1 , · · · , vn ) spans V . So α is an isomorphism.
This result shows that if V is any F-vector space of dimension n < ∞, then there
must be an isomorphism Fn → V , so V is isomorphic to Fn . So in fact any two
F-vector space of dimension n < ∞ must be isomorphic. Choosing a basis for an
n-dimensional vector space V corresponds to choosing an identification of V with
Fn .
P. 4-28
Suppose U, V are vector spaces over F and S = {e1 , · · · , en } is a basis for U . Then
every function f : S → V extends uniquely to a linear map U → V .

For uniqueness, suppose α, β : U → V are both linear map extended Pfrom f : S →


V (ie. so that α(ei ) = f (ei ) for all i). If u ∈ U , we can write u = n i=1 ui ei with
ui ∈ F since S spans. Then
X  X X
α(u) = α ui ei = ui α(ei ) = ui f (ei ).
P
Similarly β(u) = ui f (ei ). So α(u) = β(u) for every u. So α = β.
P
Given any function f : S → V . If u ∈ P U , we can write u = ui ei in a unique
way. So α : U → V defined by α(u) = ui f (ei ) is well defined, and we can see
that α extends f (since α(ei ) = f (ei )). Also α is linear: let λ, µ ∈ F, u, v ∈ U .
Then
X  X
α(λu + µv) = α (λui + µvi )ei = (λui + µvi )f (ei )
X  X 
=λ ui f (ei ) + µ vi f (ei ) = λα(u) + µα(v).

This illustrates that to define a linear map, it suffices to define its values on a basis.
In fact this result can also be extend to the infinite-dimension case. It is not hard
to see that the only subsets of U that satisfy the conclusions of the proposition
are bases: spanning ensures uniqueness and linear independence ensures existence
(well-define).
4.2. LINEAR MAPS 113

P. 4-29
Let Matn,m (F) be the set of n × m matrices over F. Suppose U and V are finite-
dimensional vector spaces over F with bases (e1 , · · · , em ) and (f1 , · · · , fn ) respec-
tively.

Pn,m (F) → L(U, V ), sending A to the unique linear


1. There is a bijection f : Mat
map α such that α(ei ) = aji fj .
2. f is an isomorphism and dim L(U, V ) = (dim U )(dim V ).

1. For any A = (aij ) ∈ Matn,m (F),Pby the above proposition there is a unique
linear map α such that α(ei ) = aji fj .

If α is a linear map PU → V , then for each 1 ≤ i ≤ m, we can write α(ei )


n
uniquely as α(ei ) = j=1 aji fj for some aji ∈ F. This gives a matrix A =
(aij ) ∈ Matn,m (F) such that A 7→ α.

2. Both f (λA + µB) and λf (A)


P + µf (B) correspond to the linear map λα + µβ
such that (λα+µβ)(ei ) = (λaji +µbji )fj , so f is a bijective linear map hence
it’s an isomorphism. Now Matn,m (F) has dimension m × n, so dim L(U, V ) =
dim Matn,m (F) = m × n = (dim U )(dim V ).

We can interpret this has follows: the ith column of A tells us how to write α(ei )
in terms of the fj .

We can also draw a fancy diagram to display this result. Given bases e1 , · · · , em ,
by [P.4-27], we get an isomorphism s(ei ) : U → Fm . Similarly, we get an isomor-
phism s(fi ) : V → Fn . Since a matrix is a linear map A : Fm → Fn , given a matrix
A, we can produce a linear map α : U → V via the following composition
s(ei ) A s(fi )−1
U Fm Fn V. Fm A
Fn
We can put this into a square like on the right. Then the s(ei ) s(fi )
corollary tells us that every A gives rise to an α, and every
α corresponds to an A that fit into this diagram. α
U V

D. 4-30
• We call the matrix corresponding to a linear map α ∈ L(U, V ) under the [P.4-29]
the matrix representing α with respect to the bases (e1 , · · · , em ) and (f1 , · · · , fn ).

• Let α : U → V be a linear map between vector spaces over F and U be finite-


dimensional. The rank of α is the number r(α) = dim(Im α), and the the nullity
of α is the number n(α) = dim(ker α).

P. 4-31
Suppose U, V, W are finite-dimensional vector spaces over F with bases R =
(u1 , · · · , ur ), S = (v1 , .., v2 ) and T = (w1 , · · · , wt ) respectively. If α : U → V and
β : V → W are linear maps represented by A and B respectively (with respect to
R, S and T ), then βα is linear and represented by BA with respect to R and T .

Verifying βα is linear is straightforward. Next we write βα(ui ) as a linear combi-


nation of w1 , · · · , wt :
114 CHAPTER 4. LINEAR ALGEBRA

!
A B
Fr Fs Ft
X X
βα(ui ) = β Aki vk = Aki β(vk )
k k s(R) s(S) s(T )
!
X X X X
β
= Aki Bjk wj = Bjk Aki wj U α
V W
k j j k
X
= (BA)ji wj
j

T. 4-32
<First isomorphism theorem> Let α : U → V be a linear map. Then
ker α and Im α are subspaces of U and V respectively. Moreover, α induces an
isomorphism ᾱ : U/ ker α → Im α with ᾱ(u + ker α) = α(u).

We know that 0 ∈ ker α and 0 ∈ Im α. Suppose u1 , u2 ∈ ker α and λ1 , λ2 ∈ F,


then α(λ1 u1 + λ2 u2 ) = λ1 α(u1 ) + λ2 α(u2 ) = 0. So λ1 u1 + λ2 u2 ∈ ker α. So
ker α is a subspace. Similarly, if α(u1 ), α(u2 ) ∈ Im α, then λα(u1 ) + λ2 α(u2 ) =
α(λ1 u1 + λ2 u2 ) ∈ Im α. So Im α is a subspace.
Now note that ᾱ(u + ker α) = ᾱ(v + ker α) ⇔ α(u) = α(v) ⇔ u − v ∈ ker α ⇔
u + ker α = v + ker α. This shows that ᾱ is well-defined and injective. Clearly by
how ᾱ is defined, it is also subjective. Hence ᾱ is bijective. So it remains to show
that ᾱ is a linear map. Indeed, we have

ᾱ(λ(u + ker α) + µ(v + ker α)) = ᾱ((λu + µv) + ker α))


= α(λu + µv) = λα(u) + µα(v) = λ(ᾱ(u + ker α)) + µ(ᾱ(v + ker α)).

Note that if we view a vector space as an abelian group, then this is the first
isomorphism theorem of Part IA groups, but with a little twist since here we
doesn’t just have a group/homomorphism, we also need to care about conditions
on scalar multiplication.
T. 4-33
<Rank-nullity theorem> If α : U → V is a linear map and U is finite-
dimensional, then r(α) + n(α) = dim U .

By the first isomorphism theorem, we know that U/ ker α ∼


= Im α. So we have

dim Im α = dim(U/ ker α) = dim U − dim ker α.


[P.4 −17]

Most of the work is hidden in [P.4-17]. In fact conversely the Rank-nullity theorem
also implies [P.4-17]. Below is an direct proof of Rank-nullity theorem that doesn’t
involve quotient space, which is also the Part IA proof.
P. 4-34
If α : U → V is a linear map between finite-dimensional vector spaces over F, then
there are bases (e1 , · · · , em ) for U and (f1 , · · · , fn ) for V such that α is represented
by the n × m matrix ( I0r 00 ), where r = r(α) and Ir is the r × r identity matrix.
In particular, r(α) + n(α) = dim U = m.

Let ek+1 , · · · , em be a basis for the kernel of α, where k = m − n(α). We can


extend this to a basis (e1 , · · · , em ) of U . Let fi = α(ei ) for 1 ≤ i ≤ k. We now
show that (f1 , · · · , fk ) is a basis for Im α, and it follows that r = k.
4.2. LINEAR MAPS 115

We firstPshow that spans. SupposePv ∈ Im α, then for some λi ∈ F we have


 itP
m m k
v=α i=1 λi e i = i=1 λi α(ei ) = i=1 λi fi + 0. So v ∈ hf1 , · · · , fk i.

To show linear dependence, suppose that ki=1 µi fi = 0. Then α( ki=1 µi ei ) = 0


P P
Pk
so µ e ∈ ker α. Since (ek+1 , · · · , em ) is a basis for ker α, we can write
Pk i=1 i i Pm
i=1 µi ei = i=k+1 µi ei for some µi (i = k + 1, · · · , m). Since (e1 , · · · , em ) is a
basis, we must have µi = 0 for all i. So they are linearly independent.
Now we extend (f1 , · · · , fr ) to a basis for V , and we have
(
fi 1≤i≤k
α(ei ) =
0 k + 1 ≤ i ≤ m.

By [P.4-29] α is represented by the n × m matrix ( I0r 00 ).


Note that in the above we didn’t define n as n(α), they are different. Note also
that the each zero in ( I0r 00 ) can represent more than one zero.
E. 4-35
Let W = {x ∈ R5 : x1 + x2 + x3 = 0 = x3 − x4 − x5 }. We can guess that dim W is
3 (since we can only freely chose 3 numbers). To prove that we can consider the
map α : R5 → R2 given by
 
x1  
 ..  x1 + x2 + x3
 .  7→
x3 − x4 − x5 .
x5

Then ker α = W . So dim W = 5 − r(α). We know that α(1, 0, 0, 0, 0) = (1, 0) and


α(0, 0, 1, 0, 0) = (0, 1). So r(α) = dim Im α = 2. So dim W = 3. More generally,
the rank-nullity theorem says that m linear equations of n variables have a space
of solutions of dimension at least n − m.
E. 4-36
Prove [P.4-16]: Suppose that U and W are subspaces of a finite-dimensional vector
spaces V , then dim U + dim W = dim(U + W ) + dim(U ∩ W ).

Let α : U ⊕W → V defined by α(u, w) = u+w, where the ⊕ is the external direct


sum. Then Im α = U + W and ker α = {(u, −u) : u ∈ U ∩ W } ∼ = dim(U ∩ W ).
Then we have

dim U + dim W = dim(U ⊕ W ) = r(α) + n(α) = dim(U + W ) + dim(U ∩ W ).

P. 4-37
Suppose α : U → V is a linear map between vector spaces over F both of dimension
n < ∞. Then
(i)[[ α is injective ]] ⇐⇒ (ii)[[ α is surjective ]] ⇐⇒ (iii)[[ α is an isomorphism ]]

It is clear that, (iii) implies (i) and (ii), and (i) and (ii) together implies (iii). So
it suffices to show that (i) and (ii) are equivalent.
Note that α is injective iff n(α) = 0, and α is surjective iff r(α) = dim V = n. By
the rank-nullity theorem, n(α) + r(α) = n. So the result follows.
116 CHAPTER 4. LINEAR ALGEBRA

L. 4-38
Let A ∈ Mn,n (F) = Mn (F) be a square matrix. Then
(i)[[ ∃B ∈ Mn (F) s.t. BA = In ]] ⇐⇒ (ii)[[ ∃C ∈ Mn (F) s.t. AC = In ]].
If these hold, then B = C, and we call A invertible or non-singular , and write
A−1 = B = C.

Let α, β, γ, ι : Fn → Fn be the linear maps represented by matrices A, B, C, In


respectively with respect to the standard basis.

(i)⇔[[ there exists linear map β s.t. βα = ι ]]⇔[[ α is injective ]]⇔[[ α is an isomor-
phism ]]⇔[[ α has an inverse α−1 ]]⇔[[ α is isomorphism ]]⇔[[ α is surjective ]]⇔[[ there
exists linear map γ s.t. αγ = ι ]]⇔(ii)

So these are the same things, and we have β = α−1 = γ if exist.

Note [[ α is injective ]]⇒[[ there exists linear map β s.t. βα = ι ]] is actually because
[[ α is injective ]]⇒[[ α is an isomorphism ]]⇒[[ there exists linear map β s.t. βα = ι ]].
Similarly for [[ α is surjective ]]⇒[[ there exists linear map γ s.t. αγ = ι ]].

T. 4-39
Suppose that (e1 , · · · , em ) and (u1 , · · · , um ) are basis for a finite-dimensional
vector space U over F, and (f1 , · · · , fn ) and (v1 , · · · , vn ) are basis of a finite-
dimensional vector space V over F. Let α : U → V be a linear map represented
by a matrix A with respect to (ei ) and (fi ) and by B with respect Pm to (ui ) and
(vi ). PThen B = Q−1 AP where P and Q are given by ui = k=1 Pki ek and
vi = n k=1 Qki fk .

Note that one can view P as the matrix representing the identity map iU from U
with basis (ui ) to U with basis (ei ), and similarly for Q. So both are invertible.

n
X XX X
α(ui ) = Bji vj = Bji Q`j f` = (QB)`i f`
j=1 j ` `
m
! m
X X X X
α(ui ) = α Pki ek = Pki A`k f` = (AP )`i f`
k=1 k=1 ` `

Since the f` are linearly independent, QB = AP . Since Q is invertible, B =


Q−1 AP .

α
U V
The diagram on the right shows the the linear map α : U → V
represented by A in basis {ei } for U and basis {fi } for V . (ei ) (fi )

When we have two different basis {ui } and {ei } of U . These A


Fm Fn
will then give rise to two different maps from Fm to our space
U , and the two basis can be related by a change-of-basis map ιU
U U
P . We can put them in the second diagram on the right, where
ιU is the identity map. (ui ) (ei )

If we perform a change of basis for both U and V , we can stitch P


Fm Fm
the diagrams together like the final diagram.
4.2. LINEAR MAPS 117

ιU α ιV
Then if we want a matrix representing the U U V V
map U → V with respect to bases (ui )
(ui ) (ei ) (fi ) (vi )
and (vi ), we can write it as the composi-
tion B = Q−1 AP . P A Q
Fm Fm Fn Fn

D. 4-40
• We say A, B ∈ Matn,m (F) are equivalent if there are (invertible) matrices P ∈
GLm (F) and Q ∈ GLn (F) such that B = Q−1 AP .

• Let A ∈ Matn,m (F)

 The column rank of A, written r(A), is the dimension of the subspace of Fn


spanned by the columns of A.

 The row rank of A, written r(AT ), is the dimension of the subspace of Fm


spanned by the rows of A. Alternatively, it is the column rank of AT .

E. 4-41
Since GLk (F) = {A ∈ Matk (F) : A is invertible} is a group, for each k ≥ 1,
equivalence of matrices is indeed an equivalence relation. The equivalence classes
are orbits under the action of GLm (F) × GLn (F), given by

GLm (F) × GLn (F) × Matn,m (F) → Mat(F) with (P, Q, A) 7→ QAP −1 .

Two matrices are equivalent if and only if they represent the same linear map with
respect to different basis. Hence by [P.4-34]: If A ∈ Matn,m (F), then there exists
invertible matrices P ∈ GLm (F), Q ∈ GLn (F) so that Q−1 AP = ( I0r 00 ) for some
0 ≤ r ≤ min(m, n). This also tells us that there are min(m, n) + 1 orbits of the
action for each r ∈ {0, 1, · · · , min(m, n)}.

E. 4-42
Note that if α : Fm → Fn is the linear map represented by A (with respect to the
standard basis), then r(A) = r(α), ie. the column rank is the rank. Moreover,
since the rank of a map is independent of the basis, equivalent matrices have the
same column rank.

T. 4-43
r(A) = r(AT ) for any A ∈ Matn,m (F). (row rank is equivalent to the column
rank)

We know that there are some invertible P, Q such that Q−1 AP = ( I0r 00 ) where r =
r(A). We can transpose this whole equation to obtain (Q−1 AP )T = P T AT (QT )−1 =
( I0r 00 ). So r(AT ) = r.

D. 4-44
A matrix in GLn (F) is called an elementary matrix if it differs from the identity
matrix by one single elementary row operation (i.e. switching two rows, multiply-
ing a row by a non-zero scalar, or adding a multiple of one row to another).
118 CHAPTER 4. LINEAR ALGEBRA

E. 4-45
Elementary matrices in GLn (F) consist of matrices of the following three types:
n
• Sij : the matrix obtain by swapping row i and row j of the identity matrix.
• Tin : a diagonal matrix, with diagonal entries 1 everywhere except in the ith
position, where it is λ.
n
• Eij : the identity matrix but with an λ in the (i, j) position instead of 0.
................................................................................
Observe that if A is a m × n matrix, then
n
• ASij is the matrix A swapping the i and j columns.
• ATin (λ) is the matrix A with the ith column multiply by λ.
n
• AEij (λ) is the matrix A with λ times its column i added to column j.
Multiplying on the left by m × m elementary matrix instead of the right would
result in the same operations performed on the rows instead of the columns.
P. 4-46
If A ∈ Matn,m (F), then there exists invertible matrices P ∈ GLm (F) and Q ∈
GLn (F) so that Q−1 AP = ( I0r 00 ) for some 0 ≤ r ≤ min(m, n).

We claim that there are elementary matrices Gm m n n


1 , · · · , Ga and F1 , · · · , Fb such
m m n n Ir 0 m
that G1 · · · Ga AF1 · · · Fb = ( 0 0 ). This suffices since the Gi ∈ GLM (F) and
Fjn ∈ GLn (F). Moreover, to prove the claim, it suffices to find a sequence of
elementary row and column operations reducing A to this form.
If A = 0, then done. If not, there is some i, j such that Aij 6= 0. By swapping
row 1 and row i; and then column 1 and column j, we can assume A11 = 6 0. By
rescaling row 1 by A111 , we can further assume A11 = 1.
Now we can add −A1j times column 1 to column j for each j 6= 1, and then add
−Ai1 times row 1 to row i 6= 1. Then we now have
 
1 0 ··· 0
0 
A =  ..
 

. B 
0

Now B is smaller than A. So by induction on the size of A, we can reduce B to a


matrix of the required form.
Since elementary matrix is bijective linear map, elementary row and column op-
erations on a matrix doesn’t change it’s rank, in particular we have r = r(A).

4.3 Duality
D. 4-47
• Let V be a vector space over F. The dual space of V is defined as V ∗ =
L(V, F) = {θ : V → F : θ linear}. Elements of V ∗ are called linear functionals or
linear forms .
4.3. DUALITY 119

• Let U ⊆ V . The annihilator of U is U 0 = {θ ∈ V ∗ : θ(u) = 0, ∀u ∈ U }.


Let W ⊆ V ∗ . The annihilator of V is W 0 = {v ∈ V : θ(v) = 0, ∀θ ∈ W }.1
E. 4-48
By convention, we use Roman letters for elements in V , and Greek letters for
elements in V ∗ .
• If V = R3 and θ : V → R that sends (x1 , x2 , x3 ) 7→ x1 − x3 , then θ ∈ V ∗ .
• Let V = FX . Then for any fixed x ∈ X, the function θ : V → F defined by
f 7→ f (x) is in V ∗ .
R1
• Let V = C([0, 1], R), then f 7→ 0 f (t) dt ∈ V ∗ .
• The trace tr : Mn (F) → F defined by A 7→ n ∗
P
i=1 Aii is in Mn (F) .

L. 4-49
If V is a finite-dimensional vector space over F with basis (e1 , · · · , en ), then there
is a basis (ε1 , · · · , εn ) for V ∗ such that εi (ej ) = δij . In particular dim V = dim V ∗ .

Since linear maps are characterized by their values on a basis, there are unique
ε1 , · · · , εn ∈ V ∗ such that εi (ej ) = δij . Now we show that (ε1 , · · · , εn ) is a basis.
Given any θ ∈ V ∗ , we can write θ uniquely as a combination of ε1 , · · · , εn because
n
X n
X
θ= λ i εi ⇐⇒ θ(ej ) = λi εi (ej ) for all j ⇐⇒ λj = θ(ej ).
i=1 i=1

We call (ε1 , · · · , εn ) the dual basis to (e1 , · · · , en ).


When V is infinite dimensional, dim V = dim V ∗ is not true; in fact dim V ∗
is strictly larger than dim V , if we manage to define dimensions for infinite-
dimensional vector spaces.
It helps to come up with a more concrete example of how dual spaces look like.
Consider the vector space Fn , where we treat each element as a column vector.
Given any a ∈ V ∗ and x ∈ Fn , we have
 
! ! n x1
X X X X  . 
a(x) = aj εj xi ei = aj xi δij = ai xi = (a1 · · · an )  ..  ∈ F.
j i i,j i=1
xn
Pn
So we can regard elements of V ∗ as just row vectors (a1 , · · · , an ) = j=1 aj εj
with respect to the dual basis.
P. 4-50
Let V be a finite-dimensional vector space over F with bases (e1 , ·P · · , en ) and
(f1 , · · · , fn ), and that P is the change of basis matrix so that fi = n k=1 Pki ek .
Let (ε1 , · · · , εn ) and (η1 , · · · , ηn ) be the corresponding dual bases so that εi (ej ) =
δij = ηi (fj ). Then Pthe change of basis matrix from (ε1 , · · · , εn ) to (η1 , · · · , ηn ) is
(P −1 )T , ie. εi = n `=1 P T
`i η ` .

1
It might seems like the definitions are not consistent, and W 0 should be a subset of V ∗∗ and
not V . We will later show that there is a canonical isomorphism between V ∗∗ and V , and this will
all make sense.
120 CHAPTER 4. LINEAR ALGEBRA
Pn
Write Q = P −1 so that ej = Qkj fk . Now εi = n T
P
k=1 `=1 P`i η` because

n
! n
! n !
X T
X X X
P`i η` (ej ) = Pi` η` Qkj fk = Pi` δ`k Qkj
`=1 `=1 k=1 k,`
X
= Pi` Q`j = (P Q)ij = δij .
k,`

E. 4-51
Consider R3 with standard basis (e1 , e2 , e3 ) and (R3 )∗ with dual basis (ε1 , ε2 , ε3 ).
If U = he1 + 2e2 + e3 i and W = hε1 − ε3 , 2ε1 − ε2 i, then U 0 = W and W 0 = U .
We see that the dimension of U and U 0 add up to three, which is the dimension
of R3 . This is typical.
P. 4-52
Let V be finite-dimensional vector space over F and U a subspace. Then dim U +
dim U 0 = dim V .
(Proof 1) Let (e1 , · · · , ek ) be a basis for U and extend to (e1 , · · · , en ) a basis for
V . Consider the dual basis for V ∗ , say (ε1 , · · · , εn ). We will prove the result
by showing that U 0 = hεk+1 , · · · , εn i. If j > k, then εj (ei ) = 0 for all i ≤ k.
0 0
k+1 , · · · , εn ∈ U . On the other hand, suppose θ ∈ U . Then we can write
So εP
n
θ = j=1 λj εj . But then 0 = θ(ei ) = λi for i ≤ k.
(Proof 2) Consider the restriction map V ∗ → U ∗ , given by θ 7→ θ|U . This is
obviously linear. Since every linear map U → F can be extended to V → F,
this is a surjection. Moreover, the kernel is U 0 . So by rank-nullity theorem,
dim V = dim V ∗ = dim U 0 + dim U ∗ = dim U 0 + dim U .
(Proof 3) Consider the map α : U 0 → (V /U )∗ such that α(θ)(v + U ) = θ(v). This
is well-defined since [[ v +U = u+U ]] ⇒ [[ v −u ∈ U ]] ⇒ [[ θ(v) = θ(u) ]]. This is an
linear map since for all u + U ∈ V /U we have α(λθ + µφ)(u + U ) = (λθ + µφ)(u) =
λ(θ)(u) + µ(φ)(u) = λα(θ)(u + U ) + µα(φ)(u + U ). This is also injective since
[[ α(θ) = 0 ]] ⇒ [[ θ(v) = 0 for all v ]] ⇒ [[ θ = 0 ]]. This is also surjective since given
any σ ∈ (V /U )∗ , θ ∈ U 0 defined by θ(v) = σ(v + U ) is such that α(θ) = σ. Hence
α is an isomorphism and U 0 ∼ = (V /U )∗ , so dim U 0 = dim(V /U )∗ = dim(V /U ) =
dim V − dim U by [P.4-17].
D. 4-53
Let V, W be vector spaces over F and α : V → W a linear map. The dual map
to α, written α∗ : W ∗ → V ∗ is given by θ 7→ θ ◦ α.
E. 4-54
Note that since the composite of linear maps is linear, α∗ (θ) ∈ V ∗ for all θ ∈ W ∗ .
P. 4-55
Let α ∈ L(V, W ) be a linear map, then α∗ ∈ L(W ∗ , V ∗ ), i.e α∗ is a linear map.

For every v ∈ V we have

α∗ (λθ1 + µθ2 )(v) = (λθ1 + µθ2 )(αv) = λθ1 (α(v)) + µθ2 (α(v))
= (λα∗ (θ1 ) + µα∗ (θ2 ))(v).

So α∗ (λθ1 + µθ2 ) = λα∗ (θ1 ) + µα∗ (θ2 ) and hence α∗ ∈ L(W ∗ , V ∗ ).


4.3. DUALITY 121

P. 4-56
Let V, W be finite-dimensional vector spaces over F and α : V → W be a linear
map. If α is represented by A with respect to basis (e1 , · · · , en ) and (f1 , · · · , fm )
for V and W , then α∗ is represented by AT with respect to the corresponding dual
bases.

ηm ) as the corresponding dual bases for V ∗ and


Write (ε1 , · · · , εn ) and (η1 , · · · , P
W ∗ . We are given that α(ei ) = m ∗
k=1 Aki fk . We must compute α (ηi ). To do so,
we evaluate it at ej :
m
! m n

X X X
α (ηi )(ej ) = ηi (α(ej )) = ηi Akj fk = Akj δik = Aij = Aik εk (ej ).
k=1 k=1 k=1

∗ Pn
Since this is true for all j, we “take away” the (ej ), so α (ηi ) = k=1 ATki εk .
Note that if α : U → V and β : V → W , θ ∈ W ∗ , then

(βα)∗ (θ) = θβα = α∗ (θβ) = α∗ (β ∗ (θ)).

So we have (βα)∗ = α∗ β ∗ . This is obviously true for the finite-dimensional case,


since that’s how transposes of matrices work. Similarly, if α, β : U → V , then
(λα + µβ)∗ = λα∗ + µβ ∗ .
If we change base so B = Q−1 AP for some invertible P and Q, then

B T = (Q−1 AP )T = P T AT (Q−1 )T = ((P −1 )T )−1 AT (Q−1 )T .

So in the dual space, we conjugate by the dual of the change-of-basis matrices.[P.4-50]


L. 4-57
Let α ∈ L(V, W ) with V, W finite dimensional vector spaces over F. Then
(i) ker α∗ = (Im α)0 (ii) r(α) = r(α∗ ) (iii) Im α∗ = (ker α)0 .

i. If θ ∈ W ∗ , then [[ θ ∈ ker α∗ ]] ⇔ [[ α∗ (θ) = 0 ]] ⇔ [[ θα(v) = 0 for all v ∈


V ]] ⇔ [[ θ(w) = 0 for all w ∈ Im α ]] ⇔ [[ θ ∈ (Im α)0 ]].
ii. As Im α ≤ W , we’ve seen that dim Im α + dim(Im α)0 = dim W . From (i),
we know n(α∗ ) = dim(Im α)0 . So r(α) + n(α∗ ) = dim W = dim W ∗ . By the
rank-nullity theorem, we have r(α) = r(α∗ ).
iii. Let θ ∈ Im α∗ . Then θ = φα for some φ ∈ W ∗ . If v ∈ ker α, then θ(v) =
φ(α(v)) = φ(0) = 0. So Im α∗ ⊆ (ker α)0 . But we know dim(ker α)0 +
dim ker α = dim V , so we have

dim(ker α)0 = dim V − n(α) = r(α) = r(α∗ ) = dim Im α∗ .

Hence we must have Im α∗ = (ker α)0 .


Note that (ii) is another proof that row rank is equal to column rank.
L. 4-58
1. Let V be a vector space over F. Then ev : V → (V ∗ )∗ defined by ev(v)(θ) =
θ(v) is a linear map.
2. If V is finite-dimensional, then ev : V → V ∗∗ is an isomorphism.
122 CHAPTER 4. LINEAR ALGEBRA

1. We first show that ev(v) ∈ V ∗∗ for all v ∈ V . That is we need to show


ev(v) : V ∗ → F is linear for any v. This is indeed true since for any v ∈ V ,
λ, µ ∈ F and θ1 , θ2 ∈ V ∗ we have
ev(v)(λθ1 + µθ2 ) = (λθ1 + µθ2 )(v) = λθ1 (v) + µθ2 (v)
= λ ev(v)(θ1 ) + µ ev(v)(θ2 ).
Now we show that ev itself is linear. Let λ, µ ∈ F and v1 , v2 ∈ V . For any
θ ∈ V ∗ we have
ev(λv1 + µv2 )(θ) = θ(λv1 + µv2 ) = λθ(v1 ) + µθ(v2 )
= λ ev(v1 )(θ) + µ ev(v2 )(θ) = (λ ev(v1 ) + µ ev(v2 ))(θ).
Hence ev(λv1 + µv2 ) = λ ev(v1 ) + µ ev(v2 ).
2. We first show it is injective. Suppose ev(v) = 0 for some v ∈ V . Then
θ(v) = ev(v)(θ) = 0 for all θ ∈ V ∗ . So dimhvi0 = dim V ∗ = dim V . So
dimhvi = 0 and hence v = 0. So ev is injective. Since V and V ∗∗ have the
same dimension, this is also surjective.
Note that 2. is very false for infinite dimensional spaces. In fact, this is true only
for finite-dimensional vector spaces. In general, if V is infinite dimensional, then
ev is injective, but not surjective. So we can think of V as a subspace of V ∗∗ in a
canonical way.
Note when we take the dual of V ∗ to get a V ∗∗ , we already know that V ∗∗ is
isomorphic to V , since V ∗ is isomorphic to V already. However, the isomorphism
between V ∗ and V are not “natural”. To define such an isomorphism, we needed
to pick a basis for V and consider a dual basis. If we picked a different basis,
we would get a different isomorphism. There is no natural, canonical, uniquely-
defined isomorphism between V and V ∗ .
However, this is not the case when we want to construct an isomorphism V →
V ∗∗ . The construction of this isomorphism is obvious once we think hard what
V ∗∗ actually means. Unwrapping the definition, we know V ∗∗ = L(V ∗ , F). Our
isomorphism has to produce a function V ∗ → F in V ∗∗ given any v ∈ V . This is
equivalent to saying given any v ∈ V and a function θ ∈ V ∗ , produce something
in F. This is easy, by definition θ ∈ V ∗ is just a linear map V → F. So given v
and θ, we just return v(θ), this is ev.
We call ev the evaluation map . This is a “canonical” map since this does not
require picking a particular basis of the vector spaces. It is in some sense a “natu-
ral” map. From now on, we will just pretend that V and V ∗∗ are the same thing,
at least when V is finite dimensional.
P. 4-59
Let V, W be finite-dimensional vector space.
1. If α ∈ L(V, W ), then α∗∗ ◦ ev = ev ◦α.
2. If U ≤ V , then U 00 = ev(U ) and ev(U 0 ) = ev(U )0 .
3. If U1 , U2 ≤ V , then (U1 + U2 )0 = U10 ∩ U20 and (U1 ∩ U2 )0 = U10 + U20 .

1. Note that α∗∗ ◦ ev and ev ◦α are both linear maps V → W ∗∗ . Then α∗∗ ◦ ev =
ev ◦α since for any v ∈ V and θ ∈ W ∗ we have
α∗∗ (ev(v))(θ) = ev(v)(α∗ )(θ) = (α∗ (θ))v = θ(α(v)) = ev(α(v))(θ)
4.4. BILINEAR FORMS I 123

2. If u ∈ U , then ev(u)(θ) = θ(u) = 0 for all θ ∈ U 0 , so ev(u) ∈ U 00 . Hence


ev(U ) ⊆ U 00 . But in fact ev(U ) = U 00 since

dim(ev U ) = dim U = dim V −dim U 0 = dim V −(dim V ∗ −dim U 00 ) = dim U 00

For the second part we have ev(U )0 = (U 00 )0 = (U 0 )00 = ev(U 0 ).

3. [[ θ ∈ (U1 + U2 )0 ]] ⇔ [[ θ(u1 ) + θ(u2 ) = θ(u1 + u2 ) = 0 for all ui ∈ Ui ]] ⇔


[[ θ(u) = 0 for all u ∈ U1 and for all u ∈ U2 ]] ⇔ [[ θ ∈ U10 ∩ U20 ]].

For the second result first note that ev(U1 ) ∩ ev(U2 ) = U100 ∩ U200 = (U10 + U20 )0 .
Now as ev is an isomorphism and using 2. we have (U1 ∩ U2 )0 = U10 + U20 since

ev((U1 ∩ U2 )0 ) = (ev(U1 ) ∩ ev(U2 ))0 = (U10 + U20 )00 = ev(U10 + U20 ).

Note that if we think of ev(v) and v as the same thing and abuse the notation
and write ev(v) = v, then the result α∗∗ ◦ ev = ev ◦α is simply α∗∗ = α and
U 00 = ev(U ) is just U 00 = U . So again we can think of them as “the same”.

Another way get the result α∗∗ = α is by considering basis: Let (e1 , · · · , en ) be a
basis for V and (f1 , · · · , fm ) be a basis for W , and let (ε1 , · · · , εn ) and (η1 , · · · , ηn )
be the corresponding dual basis. We know that

ei (εj ) = δij = εj (ei ), fi (ηj ) = δij = ηj (fi ).

So (e1 , · · · , en ) is dual to (ε1 , · · · , εn ), and similarly for f and η. If α is represented


by A, then α∗ is represented by AT . So α∗∗ is represented by (AT )T = A.

4.4 Bilinear forms I


D. 4-60
Let V , W and X be vector spaces over the field F. A function B : V × W → X is a
bilinear map if it is linear in each variable: that is for any fixed w ∈ W the map
v 7→ B(v, w) is a linear map V → X, and for any v ∈ V the map w 7→ B(v, w) is a
linear map W → X. In the special case where X is the field F (i.e. B : V ×W → F)
we call B a bilinear form .

E. 4-61
• The map V × V ∗ → F defined by (v, θ) 7→ θ(v) = ev(v)(θ) is a bilinear form.

• Let V = W = Fn . Then the function (v, w) 7→ n


P
i=1 vi wi is bilinear.
Ra
• If V = W = C([0, 1], R), then (f, g) 7→ 0 f g dt is a bilinear form.

• Let A ∈ Matm,n (F). Then φ : Fm ×Fn → F defined by (v, w) 7→ vT Aw is bilinear.


In fact, this is the most general form of bilinear forms on finite-dimensional vector
spaces. Note that the (real) dot product is the special case of this, where n = m
and A = I.
124 CHAPTER 4. LINEAR ALGEBRA

C. 4-62
<Matrix representation> Let (e1 , · · · , en ) be a basis for V and (f1 , · · · , fm )
be a basis for W , and ψ : V × W → F a bilinearPform. Define the P matrix Aij =
ψ(ei , fj ). For any v ∈ V and w ∈ W , write v = vi ei and w = wj fj , then by
linearity, we get
X  X X  X 
ψ(v, w) = ψ vi ei , w = vi ψ(ei , w) = vi ψ ei , wj fj
i i
X
= vi wj ψ(ei , fj ) = veT Awf .
i,j

where ve = (v1 , · · · , vn ) and wf = (w1 , · · · , wm ), both column vectors. So ψ is


determined by A. We call A the matrix representing ψ with respect to the given
basis.

P. 4-63
P
Suppose (e1 , · · · , en ) and (v1 , · · · , vn ) are basis for V such that vi = Pki ek
i = 1, · · · , n; and (f1 , · · · , fm ) and (w1 , · · · , wm ) are bases for W such that
for allP
wi = Q`j f` for all j = 1, · · · , m. If ψ : V ×W → F is a bilinear form represented
by A with respect to (e1 , · · · , en ) and (f1 , · · · , fm ) and by B with respect to the
bases (v1 , · · · , vn ) and (w1 , · · · , wm ), then B = P T AQ.
P P P P T
Bij = φ(vi , wj ) = φ ( Pki ek , Q`j f` ) = Pki Q`j φ(ek , f` ) = k,` Pik Ak` Q`j =
(P T AQ)ij .

The difference between this and the transformation laws of matrices representing
linear maps is that this time we are taking transposes, not inverses. Note that
while the transformation laws for bilinear forms and linear maps are different, we
still get that two matrices are representing the same bilinear form with respect
to different bases if and only if they are equivalent, since if B = P T AQ, then
B = ((P −1 )T )−1 AQ.

D. 4-64
Let ψ : V × W → F be a bilinear form,

• ψL : V → W ∗ and ψR : W → V ∗ are two linear maps defined by ψL (v) =


ψ(v, · ) and ψR (w) = ψ( · , w).

• The kernel of ψL is called the left kernel of ψ, while the kernel of ψR is the
right kernel of ψ.

• ψ is non-degenerate if the left and right kernels are both trivial. We say ψ is
degenerate otherwise.

• If V is finite-dimensional, then the rank of ψ is the rank of any matrix repre-


senting ψ.

E. 4-65
If we are given a bilinear map ψ : V × W → X, we immediately get two linear
maps ψL : V → W ∗ and ψR : W → V ∗ . Note that ψL is indeed linear since for
any fix w ∈ W , ψL (λu + µv)(w) = ψ(λu + µv, w) = λψ(u, w) + µψ(v, w) =
λψL (u)(w) + µψL (v)(w), hence ψL (λu + µv) = λψL (u) + µψL (v). Similarly ψR
is linear.
4.5. DETERMINANTS OF MATRICES 125

Also note that the rank of ψ is well-defined since r(P T AQ) = r(A) for invertible
P and Q.
E. 4-66
• If ψ : V × V ∗ → F, is defined by (v, θ) 7→ θ(v), then ψL : V → V ∗∗ is the
evaluation map. On the other hand, ψR : V ∗ → V ∗ is the identity map.
• For bilinear form ψ : V × W → F, v ∈ V is in the left kernel if ψ(v, w) = 0 for
all w ∈ W . More generally, for T ⊆ V , we can define T ⊥ = {w ∈ W : ψ(t, w) =
0 for all t ∈ T } and similarly for U ⊆ W we define ⊥ U = {v ∈ V : ψ(v, u) =
0 for all u ∈ U }. Then V ⊥ = ker ψR and ⊥ W = ker ψL .
• Note that for ψ : V × W → F we have ψL (v)w = ψ(v, w) = ψR (w)v, we can
∗ ∗
easily show that ψR = ψL ◦ ev and ψL = ψR ◦ ev.
L. 4-67
Let (e1 , · · · , en ),(f1 , · · · , fn ) be basis of V, W respectively and (ε1 , · · · , εn ), (η1 , · · · , ηn )
their dual basis on V ∗ , W ∗ . If A represents ψ with respect to (e1 , · · · , en ) and
(f1 , · · · , fm ), then
• A also represents ψR with respect to (f1 , · · · , fm ) and (ε1 , · · · , εn );
• AT represents ψL with respect to (e1 , · · · , en ) and (η1 , · · · , ηm ).
P  P T
ψL (ei )(fj ) = ψ(ei , fj ) = Aij = ` Ai` η` (fj ), so ψL (ei ) = ` A`i η` and hence
AT represents ψL . We also have ψR (fj )(ei ) = Aij , so ψR (fj ) =
P
Akj εk .
Note that this says that the rank of ψ is the same as the rank of ψL and ψR .
L. 4-68
Let V and W be finite-dimensional vector spaces over F with bases (e1 , · · · , en )
and (f1 , · · · , fm ) respectively, and let ψ : V ×W → F be a bilinear form represented
by A with respect to these bases. Then φ is non-degenerate if and only if A is
(square and) invertible. In particular, V and W have the same dimension if φ is
non-degenerate.

Since ψR and ψL are represented by A and AT (in some order), they both have
trivial kernel iff n(A) = n(AT ) = 0 iff dim W = r(AT ) = r(A) = dim V with A
having full rank, ie. the corresponding linear map is bijective.
E. 4-69
The map F2 × F2 → F defined by (( ac ), ( db )) 7→ ad − bc is a bilinear form. This,
obviously, corresponds to the determinant of a 2×2 matrix. We have ψ(v, w) =
−ψ(w, v) for all v, w ∈ F2 .

4.5 Determinants of matrices


D. 4-70
• Let A ∈ Matn,n (F). Its determinant is det A = σ∈Sn (ε(σ) n
P Q
i=1 Aiσ(i) ) where
ε(σ) = sgn(σ) is the sign of σ. A matrix A is called singular if det A = 0;
otherwise, it is non-singular .
• A volume form on Fn is a function d : Fn × · · · × Fn → F that is
126 CHAPTER 4. LINEAR ALGEBRA

1. Multilinear, ie. d(v1 , · · · , vi−1 , · , vi+1 , · · · , vn ) ∈ (Fn )∗ for all 1 ≤ i ≤ n and


all v1 , · · · , vi−1 , vi+1 , · · · , vn ∈ Fn .

2. Alternating, ie. if vi = vj for some i 6= j, then d(v1 , · · · , vn ) = 0.

E. 4-71
• If n = 2, then S2 = {id, (1 2)}, so det A = A11 A22 − A12 A21 .

• We should think of d(v1 , · · · , vn ) as the n-dimensional volume of the parallelopiped


spanned by v1 , · · · , vn .

L. 4-72
1. det A = det AT .
2. Q
If A is an upper triangular matrix (ie. Aij = 0 for all i > j), then det A =
n
i=1 Aii .

Qn Qn
1. Let τ = σ −1 . Note that ε(τ ) = ε(σ) and i=1 Aσ(i)i = j=1 Ajτ (j) . So

X n
Y X n
Y
det AT = ε(σ) Aσ(i)i = ε(τ ) Ajτ (j) = det A.
σ∈Sn i=1 τ ∈Sn j=1

2. Aiσ(i) = 0 whenever i > σ(i). So n


Q
i=1 Aiσ(i) = 0 if there is some i ∈
{1, · · · , n} such that i > σ(i). However, the only permutation in which
i ≤ σ(i)
P for all iQis the identity. So the only thing Qn that contributes in the
sum σ∈Sn ε(σ) n i=1 Aiσ(i) is σ = id. So det A = i=1 Aii .

L. 4-73
Let n-many vectors A(i) ∈ Fn (1 ≤ i ≤ n) be the columns of the matrix A =
(A(1) A(2) · · · A(n) ) ∈ Matn (F). Then det A is a volume form.

To see that det is multilinear, it is sufficient to show that n


Q
i=1 Aiσ(i) is multilinear
for all σ ∈ Sn , since linear combinations of multilinear forms are multilinear. But
each such product is contains precisely one entry from each column, and so is
multilinear.

To show it is alternating, suppose now there are Q some k, ` distinct such that
A(k) = A(`) (ie. Aik = Ai` ). Let τ = σ(k `), then n
Qn
i=1 Aσ(i)i = i=1 Aτ (i)i (since
Aτ (k)k = Aσ(`)` , Aτ (`)` = Aσ(k)k and Aτ (i)i = Aσ(i)i otherwise). So det A = 0
because

X n
Y X n
Y
det A = ε(σ) Aσ(i)i = ε(τ (k `)) Aτ (i)i
σ∈Sn i=1 τ ∈Sn i=1

X n
Y
=− ε(τ ) Aτ (i)i = − det A.
τ ∈Sn i=1

P Qn
Alternatively,
P QnSn is the union
P of the cosets
Qn An and An (k, `), and σ∈An i=1 Aσ(i)i =
τ (k,`)∈An i=1 Aτ (i)i = τ ∈An (k,`) i=1 Aτ (i)i . But det A = LHS − RHS = 0.
4.5. DETERMINANTS OF MATRICES 127

L. 4-74
Let d be a volume form on Fn . Then swapping two entries changes the sign, ie.

d(v1 , · · · , vi , · · · , vj , · · · , vn ) = −d(v1 , · · · , vj , · · · , vi , · · · , vn ).

In particular if σ ∈ Sn , then d(vσ(1) , · · · , vσ(n) ) = ε(σ)d(v1 , · · · , vn ) for any


vi ∈ Fn .
By linearity, we have

0 = d(v1 , · · · , vi + vj , · · · , vi + vj , · · · , vn )
= d(v1 , · · · , vi , · · · , vi , · · · , vn ) + d(v1 , · · · , vi , · · · , vj , · · · , vn )
+ d(v1 , · · · , vj , · · · , vi , · · · , vn ) + d(v1 , · · · , vj , · · · , vj , · · · , vn )
= d(v1 , · · · , vi , · · · , vj , · · · , vn ) + d(v1 , · · · , vj , · · · , vi , · · · , vn ).

T. 4-75
Let d be any volume form on Fn , let {e1 , · · · , en } the standard basis of Fn , and
let A = (A(1) · · · A(n) ) ∈ Matn (F). Then
1. d(A(1) , · · · , A(n) ) = (det A)d(e1 , · · · , en );
2. d(Av1 , · · · , Avn ) = (det A)d(v1 , · · · , vn ) for any v1 , · · · , vn ∈ Fn .

1. We can compute
n
! n
(1) (n)
X (2) (n)
X
d(A ,··· ,A )=d Ai1 ei , A ,··· ,A = Ai1 d(ei , A(2) , · · · , A(n) )
i=1 i=1
n
X
= Ai1 Aj2 d(ei , ej , A(3) , · · · , A(n) ) = · · ·
i,j=1

X n
Y
= d(ei1 , · · · , ein ) Aij j .
i1 ,··· ,in j=1

We know that lots of these are zero, since if ik = ij for some k, j, then the
term is zero. So we are just summing over distinct tuples, ie. when there is
some σ such that ij = σ(j). So we get
X n
Y
d(A(1) , · · · , A(n) ) = d(eσ(1) , · · · , eσ(n) ) Aσ(j)j
σ∈Sn j=1

X n
Y
= ε(σ)d(e1 , · · · , en ) Aσ(j)j = (det A)d(e1 , · · · , en ).
σ∈Sn j=1

2. We can rewrite the first part result as d(Ae1 , · · · , Aen ) = (det A)d(e1 , · · · , en ).
Let B be the linear map such that Bei = vi for all i. Define dA (u1 , · · · , un ) =
d(Au1 , · · · , Aun ), then dA is a volume form since it’s multi-linear (as ui 7→ Aui
is linear) and alternating (as ui = uj implies Aui = Auj ). Now using part 1
we have

d(Av1 , · · · , Avn ) = dA (Be1 , · · · , Ben ) = det(B)dA (e1 , · · · , en )


= det(B)d(Ae1 , · · · , Aen ) = det(B) det(A)d(e1 , · · · , en )
= det(A)d(Be1 , · · · , Ben ) = (det A)d(v1 , · · · , vn )
128 CHAPTER 4. LINEAR ALGEBRA

d(Av1 , · · · , Avn ) = (det A)d(v1 , · · · , vn ) says that det A is the volume rescaling
factor of an arbitrary parallelopiped, and this is true for any volume form d.
T. 4-76
Let A, B ∈ Matn (F). Then det(AB) = det(A) det(B).

Let d be a non-zero volume form on Fn (eg. the “determinant”). Then

(det AB)d(e1 , · · · , en ) = d(ABe1 , · · · , ABen ) = (det A)d(Be1 , · · · , Ben )


= (det A)(det B)d(e1 , · · · , en ).

Since d(e1 , · · · , en ) is non-zero, we must have det AB = det A det B.


Note that this result make use of part 2 of the last theorem, and they are basically
the same result.
P. 4-77
If A ∈ Matn (F) is invertible, then det A 6= 0 and det(A−1 ) = (det A)−1 .

We have 1 = det I = det(AA−1 ) = det A det A−1 .


T. 4-78
Let A ∈ Matn (F). Then (i)[[ A is invertible ]]⇔(ii)[[ det A 6= 0 ]]⇔(iii)[[ r(A) = n ]].

We have proved that (i) ⇒ (ii) above, and the rank-nullity theorem implies (iii)
⇒ (i). So we just need to prove (ii) ⇒ (iii). Suppose r(A) < n. By rank-nullity
theorem, n(A) > 0. So there is some non-zero column vector x = (λ1 , · · · , λn )
such that Ax = 0. Say λk 6= 0. We define B as follows:
 
1 λ1
 .. .. 

 . . 


 1 λk−1 

B=
 λk 

 λk+1 1 
.. ..
 
 
 . . 
λn 1

with 0 in the black space. So AB has the kth column identically zero. So
det(AB) = 0, but det B = λk 6= 0. So det A = 0.
D. 4-79
• Write Âij for the matrix obtained from A by deleting the ith row and jth column.
• Let A ∈ Matn (F). The adjugate matrix of A, written adj A, is the n × n matrix
such that (adj A)ij = (−1)i+j det Âji .
L. 4-80
Let A ∈ Matn (F). Then for any fixed j ∈ {1, 2, · · · , n}, we can expand det A as
n
X n
X
det A = (−1)i+j Aij det Âij = (−1)i+j Aji det Âji .
i=1 i=1
4.5. DETERMINANTS OF MATRICES 129

We just have to prove the first equality, and then the second equality follows
from det A = det AT . Let A(1) , · · · , A(n) be the columns of A, then det A =
d(A(1) , P
· · · , A(n) ) where d is the volume form induced by the determinant. Since
A(j) = n j=1 Aij ei , we can write det A as

n
!
(1) (j−1)
X (j+1) (n)
det A = d A ,··· ,A , Aij ei , A ,··· ,A
j=1
n
X
= Aij d(A(1) , · · · , A(j−1) , ei , A(j+1) , · · · , A(n) )
i=1

The volume form on the last line is the determinant of a matrix B 0 which is the
matrix A with the jth column replaced with ei . We can make n − j column
transpositions and n − i row transpositions (i.e. column transpositions on its
transpose) on B 0 to obtain the matrix B given by
 
Âij 0
B=
stuff 1

And we have (−1)n−j (−1)n−i det B = det B 0 by property [L.4-74] of a volume


form. Now note that det B = det Âij , since the only permutations that give a
non-zero sum are those that send n to n. So we have
n
X n
X
det A = Aij (−1)n−j (−1)n−i det B = Aij (−1)i+j det Âij .
i=1 i=1

Note instead of using volume forms, we could prove actually prove this directly
from the definition, as done in part IA.
T. 4-81
If A ∈ Matn (F), then A(adj A) = (det A)In = (adj A)A. In particular, if det A 6= 0,
then A−1 = det1 A adj A.

n
X n
X
((adj A)A)jk = (adj A)ji Aik = (−1)i+j det Âij Aik . (∗)
i=1 i=1

So if j = k, then ((adj A)A)jk = det A by the above lemma. Otherwise, if j 6= k,


consider the matrix B obtained from A by replacing the jth column by the kth
column. Then the right hand side of (∗) is just det B by the lemma. But we know
that if two columns are the same, the determinant is zero. So the right hand side
of (∗) is zero. So ((adj A)A)jk = δjk det A.
The calculation for A adj A = (det A)In can be done in a similar manner, or by
considering (A adj A)T = (adj A)T AT = (adj(AT ))AT = (det A)In .
Note that the coefficients of (adj A) are just given by polynomials in the entries
of A, and so is the determinant. So if A is invertible, then this result says that
its inverse is given by a rational function (ie. ratio of two polynomials) in the
entries of A. This is very useful theoretically, but not computationally, since
the polynomials are very large. There are better ways computationally, such as
Gaussian elimination.
130 CHAPTER 4. LINEAR ALGEBRA

L. 4-82
Let A, B be square matrices. Then for any C, we have
 
A C
det = (det A)(det B).
0 B

Suppose A ∈ Matk (F), and B ∈ Mat` (F), so C ∈ Matl,` (F). Let X = ( A C


0 B ). Then
by definition, we have

X k+`
Y
det X = ε(σ) Xiσ(i) .
σ∈Sk+` i=1

If j ≤ k < i, then Xij = 0. We only want to sum over permutations σ such


that σ(i) > k if i > k. So we are permuting the last j things among themselves,
and hence the first k things among themselves. So we can decompose this into
σ = σ1 σ2 , where σ1 is a permutation of {1, · · · , k} and fixes the remaining things,
while σ2 fixes {1, · · · , k}, and permutes the remaining. Then

X k
Y `
Y
det X = ε(σ1 σ2 ) Xiσ1 (i) Xk+j σ2 (k+j)
σ=σ1 σ2 i=1 j=1
  
X k
Y X k
Y
= ε(σ1 ) Aiσ1 (i)   ε(σ2 ) Bjσ2 (j)  = (det A)(det B).
σ1 ∈Sk i=1 σ2 ∈S` j=1

As a consequence we also have


 
A1 stuff
 A2  Yn
det  .. = det Ai
 
 .  i=1
0 An

4.6 Endomorphisms
D. 4-83
• If V is a (finite-dimensional) vector space over F. An endomorphism of V is a
linear map α : V → V . We write End(V ) for the F-vector space of all such linear
maps, and ι for the identity map V → V .

• Properties of endomorphisms that are not dependent on the basis we pick is known
as invariants .

• We say matrices A and B are similar or conjugate if there is some P invertible


such that B = P −1 AP .

• The trace of a matrix of A ∈ Matn (F) is defined by tr A = n


P
i=1 Aii .
4.6. ENDOMORPHISMS 131

E. 4-84
Endomorphisms are linear maps from a vector space V to itself. Unlike when we
work with arbitrary linear maps where we are free to choose any basis for the
domain, and any basis for the co-domain, when working with endomorphisms, we
require ourselves to use the same basis for the domain and co-domain, and there
is much more we can study assuming this.

One major objective is to classify all matrices up to similarity, where two ma-
trices are similar if they represent the same endomorphism under different bases.
GLn (F), the group of invertible n × n matrices, acts on Matn (F) by conjugation:

(P, A) 7→ P · A = P AP −1 .

We are conjugating it this way so that the associativity axiom Q·(P ·A) = (Q·P )·A
holds (otherwise we get a right action instead of a left action). Then A and B are
similar iff they are in the same orbit. Since orbits always partition the set, this is
an equivalence relation. Our main goal is to classify the orbits, ie. find a “nice”
representative for each orbit.

L. 4-85
Suppose (e1 , · · · , en ) and (f1 , · · · , fn ) are bases for V and α ∈ End(V ). If A repre-
sents α with respect to (e1 , · · · , en ) and B represents α with respect to (f1 , · · · , fn ),
then B = P −1 AP where P is given by fi = n
P
j=1 P ji ej .

A special case of [T.4-39] where we always use the same base for the domain and
co-domain.

L. 4-86
1. If A ∈ Matm,n (F) and B ∈ Matn,m (F), then tr AB = tr BA.
2. If A, B ∈ Matn (F) are similar, then tr A = tr B.
3. If A, B ∈ Matn (F) are similar, then det A = det B.

m
X m X
X n n X
X m
1. tr AB = (AB)ii = Aij Bji = Bji Aij = tr BA
i=1 i=1 j=1 j=1 i=1

2. Suppose B = P −1 AP , then tr B = tr(P −1 (AP )) = tr((AP )P −1 ) = tr A.

3. det(P −1 AP ) = det P −1 det A det P = (det P )−1 det A det P = det A.

This allows us to define the trace and determinant of an endomorphism since they
are invariant (i.e. independent on basis used). In fact we can define determinant
even without reference to a basis, by defining more general volume forms and
define the determinant as a scaling factor. The trace is slightly more tricky to
define without reference to a basis, but in fact it is the directional derivative of
the determinant at the origin.

D. 4-87
Let V be a finite dimensional vector space and α ∈ End(V ).

• Let A be a matrix representing α under any basis. Then the trace of α is


tr α = tr A, and the determinant is det α = det A.
132 CHAPTER 4. LINEAR ALGEBRA

• Let α ∈ End(V ). We call λ ∈ F is an eigenvalue (or E-value) of α if there is


some v ∈ V \ {0} such that α(v) = λv. And we call such a v an eigenvector .2
• When λ ∈ F, the λ-eigenspace , written Eα (λ) or E(λ) is the subspace of V
containing all the λ-eigenvectors, ie. Eα (λ) = ker(λι − α) where ι is the identity
function.
• The characteristic polynomial of α is defined by χα (t) = det(tι − α).3 For
A ∈ Matn (F), we can also define χA (t) = det(tI − A).
• We say α is diagonalizable if there is some basis for V such that α is represented
by a diagonal matrix, ie. all terms not on the diagonal are zero.
E. 4-88
Note that λ is an eigenvalue of α iff n(α − λι) > 0 iff r(α − λι) < dim V iff χα (λ) =
det(λι − α) = 0. So the eigenvalues are precisely the roots of the characteristic
polynomial.
L. 4-89
If A and B are similar, then they have the same characteristic polynomial.

det(tI − P −1 AP ) = det(P −1 (tI − A)P ) = det(tI − A).


L. 4-90
Let α ∈ L
End(V ) and λ1 , · · · , λk distinct eigenvalues of α. Then E(λ1 ) + · · · +
E(λk ) = ki=1 E(λi ) (i.e. E(λ1 )+· · ·+E(λk ) is a direct sum of E(λ1 ), · · · , E(λk )).

To prove the result we need to show that if ki=1 xi = ki=1 yi with xi , yi ∈ E(λi ),
P P
then xi = yi for all i. We are going to find some clever
Q map that tells us what xi
and yi are. Consider βj ∈ End(V ) defined by βj = r6=j (α − λr ι). Then

i
! k Y k Y
X X X Y
βj xk = (α − λr ι)(xi ) = (λi − λr )(xi ) = (λj − λr )(xj ).
i=1 i=1 r6=j i=1 r6=j r6=j

Pi Q
Similarly,
P P we obtain βj ( i=1
Q yk ) = − λr )(yj ). Since we Q
r6=j (λj Q know that
xi = yi , we must have r6=j (λj −λr )xj = r6=j (λj −λr )yj . Since r6=j (λr −
λj ) 6= 0, we must have xi = yi for all i.
The proof shows that any set of non-zero eigenvectors with distinct eigenvalues is
linearly independent.
T. 4-91
Let α ∈ End(V ) and λ1 , · · · , λk be distinct eigenvalues of α. Write Ei for E(λi ).
Then the following are equivalent:
i. α is diagonalizable.
ii. V has a basis of eigenvectors for α.
iii. V = ki=1 Ei .
L

2
Note that some author in addition call the zero vector 0 an eigenvector.
3
You might be used to the definition χα (t) = det(α − tι) instead. These two definitions are
obviously equivalent up to a factor of −1, but this definition has an advantage that χα (t) is always
monic, ie. the leading coefficient is 1. However, when doing computations in reality, we often use
det(α − tι) instead, since it is easier to negate tι than α.
4.6. ENDOMORPHISMS 133
Pk
iv. dim V = i=1 dim Ei .

i ⇔ ii: Suppose (e1 , · · · , en ) is a basis for V , then α(ei ) = Aji ej where A repre-
sents α under the basis. Then A is diagonal iff each ei is an eigenvector.

ii ⇔ iii: It is clear that ii is true iff i Ei = V , plus we know ki=1 Ei = ki=1 Ei .


P L P

(i) (i) (i)


iii ⇔ iv: (Forward) Let {e1 , · · · , edim Ei } be the basis for Ei , then B = {ej :
Lk
i ∈ {1, · · · , k}, j ∈ {1, · · · , dim Ei }}
Pis a basis for V = i=1 Ei since any
v ∈ V can be written uniquely as i vi where vi ∈ Ei which in turn can
be written uniquelyP as in terms of a linearly combination of elements of
B. Hence dim V = ki=1 dim Ei .

(Backward) i Ei is a subspace of V , but i Ei = ki=1 Ei , so dim( i Ei ) =


P P L P
Pk P Lk
i=1 dim Ei = dim V , hence we must have V = i Ei = i=1 Ei .

D. 4-92
• A polynomial over F is an object of the form f (t) = am tm + am−1 tm−1 + · · · +
a1 t + a0 with m ≥ 0, a0 , · · · , am ∈ F. We write F[t] for the set of polynomials over
F.4

• Let f ∈ F[t]. Then the degree of f , written deg f is the largest n such that
an 6= 0. In particular, deg 0 = −∞.

• Let f ∈ F[t]. We say λ is a root of f if f (λ) = 0. In addition λ has multiplicity


k if (t − λ)k is a factor of f but (t − λ)k+1 is not (ie. f (t) = (t − λ)k g(t) for some
g(t) ∈ F[t] with g(λ) 6= 0).
Pm i
• Given fP (t) = i=0 ai t ∈ F[t],
P A ∈ iMatn (F) and α ∈ End(V ), we can write
f (A) = i=0 ai Ai or f (α) = m
m 0
i=0 ai α where A = I and α = ι.
0

E. 4-93
Note that deg f g = deg f + deg g and deg(f + g) ≤ max{deg f, deg g}. This also
illustrate why it make sense to single out the 0 polynomials (with degree −∞)
from other constant polynomials (with degrees 0).

L. 4-94
<Polynomial division> If f, g ∈ F[t] with g 6= 0, then there exists q, r ∈ F[t]
with deg r < deg g such that f = qg + r.

We prove that given f, g ∈ F[t] with g 6= 0 and deg f ≥ deg g, there exists q, r ∈ F[t]
with deg r < deg f such that f = qg + r. And then we can just repeatedly apply
this result to get the stated result.

Let f (t) = deg


P f i deg f −deg g
i=0 fi t and similarly for g, then f = (fdeg f /gdeg g )t g+r
where r is a polynomial with deg r < deg f .

4
Note that we don’t identify a polynomial f with the corresponding function it represents. For
example, if F = Z/pZ, then tp and t are different polynomials, even though they define the same
function (by Fermat’s little theorem/Lagrange’s theorem). Two polynomials are equal if and only
if they have the same coefficients. However, we will later see that if F is R or C, then polynomials
are equal if and only if they represent the same function, and this distinction is not as important.
134 CHAPTER 4. LINEAR ALGEBRA

L. 4-95
1. If λ is a root of f ∈ F[t], then there is a polynomial g such that f (t) = (t−λ)g(t).
2. Any non-zero f ∈ F[t] can be written as f (t) = g(t) ki=1 (t − λi )ai where
Q
λ1 , · · · , λk are all distinct, ai > 1, and g is a polynomial with no roots in F.

1. By polynomial division, we have f (t) = (t − λ)g(t) + r(t) for some g(t), r(t) ∈
F[t] with deg r < deg(t − λ) = 1. So r has to be constant, ie. r(t) = a0 for
some a0 ∈ F. But 0 = f (λ) = (λ − λ)g(λ) + r(0) = a0 . So r(t) = a0 = 0.

2. Suppose it’s true for all polynomial of degree k. Let f be any polynomial of
degree k + 1. If f has no roots, then done. If f have a root λ, then by part 1,
f (t) = (t − λ)g(t), and g has degree k. By induction hypothesis, g and hence
f can be written in the desired form. The statement is true for any degree 0
polynomial, hence the general result is true by induction.

L. 4-96
A non-zero polynomial f ∈ F[t] has at most deg f roots, counted with or without
multiplicity.

Using the above lemma, f (t) = g(t) ki=1 (t − λi )ai where is g a polynomial with
Q

no roots. i=1 (t − λi )ai is a polynomial with k roots and degree ki=1 ai . But we
Qk P
Pk
must have deg f ≥ i=1 ai , so deg f ≥ k.

L. 4-97
1. Let f, g ∈ F[t] have degree less than n. If there are λ1 , · · · , λn distinct such
that f (λi ) = g(λi ) for all i, then f = g (in the polynomial sense).
2. If F is infinite, then f = g if and only if they agree on all points.

1. Consider f −g. This has degree less than n, and (f −g)(λi ) = 0 for i = 1, · · · , n.
Since it has at least n ≥ deg(f − g) roots, we must have f − g = 0 and so
f = g.

2. The forward direction is obviously true. If f and g agrees on all points, pick n
such that n > deg f and n > deg g, then by the first part f = g.

T. 4-98
<Fundamental theorem of algebra> Every non-constant polynomial over C
has a root in C.

We will not prove this here, a prove is given in [T.10-50]. Because of this result
we say C is an algebraically closed field .

In fact it follows from this result that every polynomial over C of degreeQn > 0 has
ai
precisely n roots, counted with multiplicity, since if we write f (t) = g(t)
P (t−λi )
and g has no roots, then g is constant. So the number of roots is ai = deg f ,
counted with multiplicity.

It also follows that every polynomial in R factors into linear polynomials and
quadratic polynomials with no real roots since complex roots of real polynomials
come in complex conjugate pairs.
4.6. ENDOMORPHISMS 135

T. 4-99
<Diagonalizability theorem> Suppose α ∈ End(V ). Then α is diagonalizable
if and only if there exists non-zero p(t) ∈ F[t] that can be expressed as a product
of distinct linear factors such that p(α) = 0.

(Forward) Suppose αLis diagonalizable. Let λ1 , · · · , λk be the distinct eigenvalues


of α. We have V = ki=1 E(λi ).[T.4-91] So each v ∈ V can be written (uniquely)
as v = i vi with vi ∈ E(λi ). Now let p(t) = ki=1 (t − λi ). Then for any v, we
P Q
get
Xk X k
p(α)(v) = p(α)(vi ) = p(λi )vi = 0.
i=1 i=1

So p(α) = 0. By construction, p is a product of distinct linear factors.


(Backward) Conversely, suppose there exists p(t) = ki=1 (t−λi ) with λ1 , · · · , λk ∈
Q
F distinct, and p(α) = 0 (we can wlog assume p is monic, ie. the leading coefficient
is 1). By [T.4-91] and [L.4-91] we just need to show that V = ki=1 Eα (λi ). In
P
other words, we want to show P that for all v ∈ V , there is some vi ∈ Eα (λi ) for
i = 1, · · · , k such that v = vi . To find these vi out, we let
Y t − λi
qj (t) = .
λj − λi
i6=j

Pk is a polynomial of degree k − 1, and qj (λi ) = δij . Now consider q(t) =


This
i=1 qj (t). We still have deg q ≤ k − 1, but q(λi ) = 1 for any i. Since q and 1

Pk q = 1. Let πj : V → V be given by πj = qj (α).


agree on k points, we must have
ThenPthe above says that j=1 πj = ι. Hence given v ∈ V , we know that
v = j πj v. We now check that πj v ∈ Eα (λj ). This is true since
Qk
(α − λι ) p(α)
(α − λj ι)πj v = Q i=1 (v) = Q (v) = 0.
i6=j (λj − λ i ) i6=j (λj − λi )

In the above proof, if v ∈ Eα (λi ), then πj (v) = δij v. So πi is a projection onto


the Eα (λi ).
D. 4-100
The minimal polynomial of α ∈ End(V ) is the non-zero monic polynomial Mα (t)
of least degree such that Mα (α) = 0.
E. 4-101
The monic requirement is just for things to look nice, since we can always divide
by the leading coefficient of a polynomial to get a monic version.
Note that if A represents α, then for all p ∈ F[t], p(A) represents p(α). In this case
p(α) = 0 iff p(A) = 0. So the minimal polynomial of α is the minimal polynomial
of A if we define MA analogously.
There are two things we want to know — whether the minimal polynomial exists,
and whether it is unique. Existence is always guaranteed in finite-dimensional
2
cases. If dim V = n < ∞, then dim End(V ) = n2 . So ι, α, α2 , · · · , αn are linearly
Pn2
dependent. So there are some λ1 , · · · , λn ∈ F not all zero such that i=0 λi αi = 0.
So deg Mα ≤ n2 . So we must have a minimal polynomial.
136 CHAPTER 4. LINEAR ALGEBRA

L. 4-102
Let α ∈ End(V ), and p ∈ F[t]. Then p(α) = 0 if and only if Mα (t) is a factor of
p(t). In particular, Mα is unique.

For all such p, we can write p(t) = q(t)Mα (t) + r(t) for some r of degree less than
deg Mα . Then r(α) = 0 iff p(α) = 0. But deg r < deg Mα . By the minimality of
Mα , we must have r(α) = 0 iff r = 0. So p(α) = 0 iff Mα (t) | p(t).
So if M1 and M2 are both minimal polynomials for α, then M1 | M2 and M2 | M1 .
So M2 is just a scalar multiple of M1 . But since M1 and M2 are monic, they must
be equal.
E. 4-103
Let V = F2 , and consider the matrices A = ( 10 01 ) and B = ( 10 11 ). Consider the
polynomial p(t) = (t − 1)2 . We can compute p(A) = p(B) = 0. So MA (t) and
MB (t) are factors of (t−1)2 . There aren’t many factors of (t−1)2 . So the minimal
polynomials are either (t − 1) or (t − 1)2 . Since A − I = 0 and B − I 6= 0, the
minimal polynomial of A is t − 1 and the minimal polynomial of B is (t − 1)2 .
T. 4-104
<Diagonalizability theorem v2> Let α ∈ End(V ). Then α is diagonalizable
if and only if Mα (t) is a product of distinct linear factors.

(Forward) α is diagonalizable, so there is some p ∈ F[t] non-zero such that p(α) = 0


and p is a product of distinct linear factors. Since Mα divides p, Mα also has
distinct linear factors.
(Backward) This follows directly from the previous diagonalizability theorem.
T. 4-105
Let α, β ∈ End(V ) be both diagonalizable. Then α and β are simultaneously
diagonalizable (ie. there exists a basis with respect to which both are diagonal) if
and only if αβ = βα.

(Forward) If there exists a basis (e1 , · · · , en ) for V such that α and β are repre-
sented by A and B respectively, with both diagonal, then by direct computation,
AB = BA. But AB represents αβ and BA represents βα. So αβ = βα.
(Backward) Suppose αβ = βα. The idea is to consider each eigenspace of α
individually, and
Lthen diagonalize β in each of the eigenspaces. Since α is diago-
nalizable, V = ki=1 Eα (λi ) where λi are the different eigenvalues of α. Write Ei
for Eα (λi ).
We now show that β(Ei ) ⊆ Ei . Let v ∈ Ei , then α(β(v)) = β(α(v)) = β(λi v) =
λi β(v). So β(v) is an eigenvector of α with eigenvalue λi , hence β(v) ∈ Ei .
Thus we can view β|Ei ∈ End(Ei ). Note that Mβ (β|Ei ) = Mβ (β)|Ei = 0. Since
Mβ (t) is a product of distinct linear factors (as β is diagonalizable), it follows that
β|Ei is diagonalizable for each Ei , and we can choose a basis Bi of Ei consist of
eigenvectors of β|Ei which must also be eigenvectors of β.
Then since V is a direct sum of the Ei ’s, we know that B = ki=1 Bi is a basis for
S
V consisting of eigenvectors for both α and β.
This result is important in quantum mechanics. This means that if two operators
do not commute, then they do not have a common eigenbasis. Hence we have the
uncertainty principle.
4.6. ENDOMORPHISMS 137

D. 4-106
An endomorphism α ∈ End(V ) is triangulable if there is a basis for V such that
α is represented by an upper triangular matrix (ie. Aij = 0 for all i > j).

L. 4-107
An endomorphism α is triangulable if and only if χα (t) can be written as a prod-
uct of linear factors, not necessarily distinct. In particular, if F = C (or any
algebraically closed field), then every endomorphism is triangulable.

(Forward) Suppose that α is triangulable and represented by

λ1 ∗ ··· ∗ t−λ1 ∗ ··· ∗


! ! n
0 λ2 ··· ∗ 0 t−λ2 ··· ∗
Y
.. .. .. . then χα (t) = det .. .. .. .. = (t − λi ).
. . . .. . . . .
0 0 ··· λn 0 0 ··· t−λn i=1

So χα (t) is a product of linear factors.

(Backward) We are going to prove the converse by induction on the dimension


of our space. The base case dim V = 1 is trivial, since every 1 × 1 matrix is
already upper triangular. Suppose α ∈ End(V ) and the result holds for all spaces
of dimensions < dim V , and χα is a product of linear factors. In particular, χα (t)
has a root, say λ ∈ F.

Let U = E(λ), and let W be a complementary subspace to U in V , ie. V =


U ⊕ W . Let u1 , · · · , ur be a basis for U and wr+1 , · · · , wn be a basis for W so
that u1 , · · · , ur , wr+1 , · · · , wn is a basis for V , then α is represented by
 
λIr stuff
0 B

We know χα (t) = (t − λ)r χB (t). So χB (t) is also a product of linear factors


(since χα (t) is). We let β : W → W be the map defined by B with respect to
wr+1 , · · · , wn .5

Since dim W < dim V , by induction hypothesis there is a basis vr+1 , · · · , vn for
W such that β is represented by an upper triangular C. For j = 1, · · · , n − r, we
have α(vj+r ) = u + n−r
P
k=1 C kj vk+r for some u ∈ U . So α is represented by
 
λIr stuff
0 C

with respect to (u1 , · · · , ur , vr+1 , vn ), which is upper triangular.

When we let β : W → W be the map defined by B with respect to wr+1 , · · · , wn ,


we ignore all the stuff happening to u1 , · · · , ur . Alternatively, this is equivalent
to having a map β : V /U → V /U given by β(v + U ) = α(v) + U . In general if
we have α ∈ End V and U ≤ V such that α(U ) ≤ U , then χα = χα|U χᾱ where
ᾱ : V /U → V /U is given by ᾱ(v + U ) = α(v) + U .

5
Note that in general, β is not α|W in general, since α does not necessarily map W to W (as
can be seen from the “stuff” in the matrix above). However, we can say that (α − β)(w) ∈ U for
all w ∈ W . This can be much more elegantly expressed in terms of quotient spaces.
138 CHAPTER 4. LINEAR ALGEBRA

E. 4-108
Consider the real rotation matrix
 
cos θ sin θ
.
− sin θ cos θ

This is not similar to a real upper triangular matrix (if θ is not an integer multiple
of π). This is since the eigenvalues are e±iθ and are not real. On the other
hand, as a complex matrix, it is triangulable, and in fact diagonalizable since the
eigenvalues are distinct. For this reason, in the rest of the section, we are mostly
going to work in C. We can now prove the Cayley-Hamilton theorem.
T. 4-109
<Cayley-Hamilton theorem> Let V be a finite-dimensional vector space with
dimension n, and α ∈ End(V ) with characteristic polynomial χα . Then χα (α) = 0,
ie. Mα (t) | χα (t). In particular, deg Mα ≤ n.

(Proof 1) Assume V is over C. By the lemma, we can choose a basis {e1 , · · · , en }


is represented by an upper triangular matrix.
λ ∗ ··· ∗
!
1
0 λ2 ··· ∗
A= .. .. .. . .
. . . ..
0 0 ··· λn

We must prove that χα (α) = χA (α) = n


Q
i=1 (α−λi ι) = 0. Write Vj = he1 , · · · , ej i.
So we have the inclusions 0 = V0 ⊆ V1 ⊆ · · · ⊆ Vn−1 ⊆ Vn = V . We also know
that dim Vj = j. This increasing sequence
Piis known as a flag. Now note that since
A is upper-triangular, we get α(ei ) = k=1 Aki ek ∈ Vi . So α(Vj ) ⊆ Vj for all
j = 0, · · · , n. Moreover, we have
j−1
X
(α − λj ι)(ej ) = Akj ek ⊆ Vj−1
k=1

for all j = 1, · · · , n. So every time we apply one ofQthese things, we get to a


smaller space. Hence by induction on n − j, we have n i=j (α − λi ι)(Vn ) ⊆ Vj−1 .
In particular, when j = 1, we get χα (α) = 0 since
n
Y
(α − λi ι)(V ) ⊆ V0 = 0.
i=1

Hence we have the desired result for V that is over C. Now suppose V is over a
field F, which is not C but a subfield of C (for example R). Say α is represented
by B ∈ Matn (F) over some basis, we can view B as an element of Matn (C) to see
that χB (B) = 0. But χα (α) = χB (α) is represented by χB (B) and so is 0.
(Proof 2) Let α be represented by A, and B = tIn − A. Then B adj B = det BIn =
χα (t)In . But we know that adj B is a matrix with entries in F[t] of degree at most
n−1. So we can write adj B = Bn−1 tn−1 +Bn−2 tn−2 +· · ·+B0 with Bi ∈ Matn (F).
We can also write χα (t) = an tn + an−1 tn−1 + · · · + a0 . Then we get

(tIn − A)(Bn−1 tn−1 + Bn−2 tn−2 + · · · + B0 ) = (an tn + an−1 tn−1 + · · · + a0 )In

from B adj B = χα (t)In . Both sides are equals as a polynomials, so the coefficients
on both sides must be equal. Equating coefficients in tk gives ak In = Bk−1 − ABk ,
4.6. ENDOMORPHISMS 139

where we let B−1 = Bn = 0. So


n
X n
X
χα (A) = ak Ak = Ak Bk−1 − Ak+1 Bk = A0 B−1 − An+1 B n = 0.
k=0 k=0

................................................................................
It is tempting to prove this result by substituting t = α into det(tι − α) and
get det(α − α) = 0, but this is meaningless, since what the statement χα (t) =
det(tι − α) tells us to do is to expand the determinant of the matrix
 
t − a11 a12 ··· a1n
 a21 t − a22 · · · a2n 
tIn − A =  .. .. .. .. 
 
 . . . . 
an1 an2 ··· t − ann

to obtain a polynomial, and we clearly cannot substitute t = A in this expression.


Similarly in proof 2 we also cannot substitute t = A in (tIn − A)(Bn−1 tn−1 +
Bn−2 tn−2 + · · · + B0 ) = (an tn + an−1 tn−1 + · · · + a0 )In to get the result since t is
assumed to be a real number and tIn is the scalar multiplication of In by t.
Note also that if ρ(t) ∈ F[t] and A is a diagonal matrix with diagonal elements
λ1 , · · · , λn then ρ(A) is a diagonal matrix with diagonal elements
Q ρ(λ1 ), · · · , ρ(λn ).
If α can be represent by the diagonal matrix A, then χα (t) = n i=1 (t−λi ), it follows
that χα (A) = 0. So if α is diagonalizable, then the theorem is follows easily.
................................................................................
We can see proof 1 more “visually” as follows: for simplicity of expression, we
suppose n = 4. In the basis where α is upper-triangular, the matrices A − λi I look
like this
0 ∗ ∗ ∗ ∗ ∗ ∗ ∗
A − λ1 I = 00 ∗0 ∗∗ ∗∗ A − λ2 I = 00 00 ∗∗ ∗∗
0 0 0 ∗ 0 0 0 ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
A − λ3 I = 00 ∗0 ∗0 ∗∗ A − λ4 I = 00 ∗0 ∗∗ ∗∗
0 0 0 ∗ 0 0 0 0

Then we just multiply out directly (from the right):


4
Y 0 ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗
0 ∗ ∗ ∗ 0 0 ∗ ∗ 0 ∗ ∗ ∗ 0 ∗ ∗ ∗
(A − λi I) = 0 0 ∗ ∗ 0 0 ∗ ∗ 0 0 0 ∗ 0 0 ∗ ∗
i=1 0 0 0 ∗ 0 0 0 ∗ 0 0 0 ∗ 0 0 0 0
0 ∗ ∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗ ∗ 0 ∗ ∗ ∗∗ ∗ ∗ ∗ 0 0 0 0
0 ∗ ∗ ∗ 0 0 ∗ ∗ 0 ∗ ∗ ∗ 0 ∗ ∗ ∗ 0 0 0 0 0 0 0 0
= 0 0 ∗ ∗ 0 0 ∗ ∗ 0 0 0 0 = 0 0 ∗ ∗ 0 0 0 0 = 0 0 0 0 .
0 0 0 ∗ 0 0 0 ∗ 0 0 0 0 0 0 0 ∗ 0 0 0 0 0 0 0 0

This is exactly what we showed in the proof — after multiplying out the first k
elements of the product (counting from the right), the image is contained in the
span of the first n − k basis vectors.
L. 4-110
Let α ∈ End(V ), λ ∈ F. Then
(i)[[ λ is an eigenvalue of α ]]⇔(ii)[[ λ is a root of χα (t) ]]⇔(iii)[[ λ is a root of
Mα (t) ]].

(i) ⇔ (ii): λ is an eigenvalue of α iff (α − λι)(v) = 0 has a non-trivial root iff


det(α − λι) = 0.
140 CHAPTER 4. LINEAR ALGEBRA

(iii) ⇒ (ii): This follows from Cayley-Hamilton theorem since Mα | χα .


(i) ⇒ (iii): Let λ be an eigenvalue, and v be a corresponding eigenvector. Then
by definition of Mα , Mα (α)(v) = 0(v) = 0. But also Mα (α)(v) =
Mα (λ)v. Since v is non-zero, we must have Mα (λ) = 0.
(iii) ⇒ (i): (This is not necessary since it follows from the above, but we could
as well do it explicitly.) Suppose λ is a root of Mα (t). Then Mα (t) =
(t−λ)g(t) for some g ∈ F[t]. But deg g < deg Mα . Hence by minimality
of Mα , we must have g(α) 6= 0. So there is some v ∈ V such that
g(α)(v) 6= 0. But (α − λι)(g(α)(v)) = Mα (α)v = 0, so g(α)(v) is an
eigenvector of α with eigenvalue λ.
E. 4-111
 1 0 −2 
What is the minimal polynomial of A = 0 1 1 and is A diagonalizable?
0 0 2

We can compute χA (t) = (t − 1)2 (t − 2). By Cayley–Hamilton MA (t) is a factor of


(t − 1)2 (t − 2). Moreover by the above lemma it must be a multiple of (t − 1)(t − 2).
So the minimal polynomial is either (t − 1)2 (t − 2) or (t − 1)(t − 2). By direct
computations, we can find (A − I)(A − 2I) = 0, so MA (t) = (t − 1)(t − 2). Also
this means A is diagonalizable by [T.4-104].
D. 4-112
• Let α ∈ End(V ) and λ an eigenvalue of α.
 The algebraic multiplicity of λ, written aλ , is the multiplicity of λ as a root of
χα (t).
 The geometric multiplicity of λ, written gλ , is the dimension of the corre-
sponding eigenspace, dim Eα (λ).
 cλ is the multiplicity of λ as a root of the minimal polynomial Mα (t).
• Define the m × m matrix
λ 1 ··· 0
!
. ..
Jm (λ) = 0 λ .. .
... ... ... 1
0 0 ··· λ

i.e. (Jm (λ))i,i = λ, (Jm (λ))i,i+1 = 1 and other entries 0. These matrices are called
Jordan blocks . We say a matrix A ∈ Matn (C) is in Jordan normal form if it is
a block diagonal matrix
 J (λ ) 0

n1 1
Jn2 (λ2 )
 .. 
.
0 Jnk (λk )

P
where k ≥ 1, n1 , · · · , nk ∈ N such that n = ni and λ1 , · · · , λk ∈ C not neces-
sarily distinct.
E. 4-113
• For the matrix A on the right, λ is an eigenvalue, λ 1 ··· 0
!
. ..
aλ = n = cλ and gλ = 1. A= 0 λ .. .
.. .. ..
. . . 1
0 0 ··· λ

• For A = λI, λ is an eigenvalue with aλ = gλ = n and cλ = 1.


• Note that Jm (λ) = λIm + Jm (0).
4.6. ENDOMORPHISMS 141

• The matrix on the right is in Jordan normal  λ1 1 


form. Blank entries stands for zero. λ1 1
 λ1 0 
 λ2 0 
λ3 1
 
 
 λ3 0 
.. ..
. .
 
λn 1
λn

L. 4-114
If λ is an eigenvalue of α, then 1.[[ 1 ≤ gλ ≤ aλ ]] and 2.[[ 1 ≤ cλ ≤ aλ ]].

1. The first inequality is easy. If λ is an eigenvalue, then E(λ) 6= 0. So gλ =


dim E(λ) ≥ 1. To prove the other inequality, if v1 , · · · , vg is a basis for E(λ),

then we can extend it to a basis for V , and then α is represented by ( λI0g B ).
g
So χα (t) = (t − λ) χB (t). So aλ > g = gλ .
2. This is straightforward since Mα (λ) = 0 implies 1 ≤ cλ , and since Mα (t) |
χα (t), we know that cλ ≤ aλ .
L. 4-115
Suppose F = C and α ∈ End(V ). Then the following are equivalent:
i. α is diagonalizable.
ii. gλ = aλ for all eigenvalues of α.
iii. cλ = 1 for all λ.
P
(i) ⇔ (ii): α is diagonalizable iff dim V = dim Eα (λi ). But this is equivalent to
X X
dim V = gλi ≤ aλi = deg χα = dim V.
P P
So we must have gλi = aλi . Since each gλi is at most aλi , they
must be individually equal.
(i) ⇔ (iii): α is diagonalizable if and only if Mα (t) is a product of distinct linear
factors if and only if cλ = 1 for all eigenvalues λ.
T. 4-116
Every matrix A ∈ Matn (C) is similar to a matrix in Jordan normal form. More-
over, this Jordan normal form matrix is unique up to permutation of the blocks.

This is a complete solution to the classification problem of matrices, at least in


C. In this chapter we will not prove this result completely, we will only prove
the uniqueness part, and then reduce the existence part to a special form of en-
domorphisms. The existence part is proved in [T.9-154] in Groups, Rings and
Modules.
We can rephrase this result using linear maps. If α ∈ End(V ) is an endomorphism
of a finite-dimensional vector space V over C, then the theorem says there exists
a basis such that α is represented by a matrix in Jordan normal form, and this is
unique as before.
Note that the permutation thing is necessary, since if two matrices in Jordan
normal form differ only by a rearrangement of blocks, then they are similar, by
permuting the basis.
142 CHAPTER 4. LINEAR ALGEBRA

E. 4-117
Every 2 × 2 matrix in Jordan normal form is one of the three types:

Jordan normal form χA MA


2
( λ0 λ0 ) (t − λ) (t − λ)
( λ0 µ0 ) (t − λ)(t − µ) (t − λ)(t − µ)
2
( λ0 λ1 ) (t − λ) (t − λ)2

with λ, µ ∈ C distinct. We see that MA determines the Jordan normal form of A,


but χA does not.
Every 3 × 3 matrix in Jordan normal form is one of the six types. Here λ1 , λ2 and
λ3 are distinct complex numbers.

Jordan normal form χA MA


 
λ1 0 0
0 λ2 0 (t − λ1 )(t − λ2 )(t − λ3 ) (t − λ1 )(t − λ2 )(t − λ3 )
0 0 λ3
 
λ1 0 0
0 λ1 0 (t − λ1 )2 (t − λ2 ) (t − λ1 )(t − λ2 )
0 0 λ2
 
λ1 1 0
0 λ1 0 (t − λ1 )2 (t − λ2 ) (t − λ1 )2 (t − λ2 )
0 0 λ2
 
λ1 0 0
0 λ1 0 (t − λ1 )3 (t − λ1 )
0 0 λ1
 
λ1 1 0
0 λ1 0 (t − λ1 )3 (t − λ1 )2
0 0 λ1
 
λ1 1 0
0 λ1 1 (t − λ1 )3 (t − λ1 )3
0 0 λ1

Notice that χA and MA together determine the Jordan normal form of a 3 × 3


complex matrix. We do indeed need χA in the 3 × 3 case, since if we are given
MA = (t − λ1 )(t − λ2 ), we know one of the roots is double, but not which one. In
general, however, even χA and MA together does not suffice.
C. 4-118
<Jordan normal form and gλ , aλ , cλ > If (e1 , · · · , en ) is the standard basis for
Cn , we have Jn (0)(e1 ) = 0 and Jn (0)(ei ) = ei−1 for 2 ≤ i ≤ n. Thus we know
(
k 0 i≤k
Jn (0) (ei ) =
ei−k k < i ≤ n

Since Jn (λ) = λI + Jn (0), for k < n we have


 
0 In−k
(Jn (λ) − λI)k = Jn (0)k = .
0 0

If k ≥ n, then we have (Jn (λ) − λI)k = 0. Hence n((Jm (λ) − λIm )r ) = min{r, m}.
And so for A = Jn (λ), we have χA (t) = MA (t) = (t − λ)n . So λ is the only
eigenvalue of A. Just to be clear writing the algebraic multiplicity for A of λ as
4.6. ENDOMORPHISMS 143

aJn (λ) (instead of just aλ ) etc. we have (∗)[[ aJn (λ) = n ]] and (†)[[ cJn (λ) = n ]]. We
also know that n(A − λI) = n − r(A − λI) = 1. So (‡)[[ gJn (λ) = 1 ]].
Recall that a general Jordan normal form is a block diagonal matrix of Jordan
blocks. We have just studied individual Jordan blocks. Next, we want to look at
some properties of block diagonal matrices in general. If A is the block diagonal
matrix
A1
! k
A2
Y
A= .. then χA (t) = χAi (k) (∗)
.
Ak i=1

Moreover, if ρ ∈ F[t], then

ρ(A1 ) ! MA (t) = lcm(MA1 (t), · · · , MAk (t)) (†)


ρ(A2 ) k
ρ(A) = .. and so X
. n(ρ(A)) = n(ρ(Ai )) (‡)
ρ(Ak )
i=1

The above tells us that if A is in Jordan normal form, we get the following:
(‡): gλ is the number of Jordan blocks in A with eigenvalue λ.
(∗): aλ is the sum of sizes of the Jordan blocks of A with eigenvalue λ.
(†): cλ is the size of the largest Jordan block with eigenvalue λ.

T. 4-119
Let α ∈ End(V ), and A in Jordan normal form representing α. Then the number
of Jordan blocks Jn (λ) in A with n ≥ r is n((α − λι)r ) − n((α − λι)r−1 ).

J
n1 (λ1 )

Jn2 (λ2 )
A= .. 
.
Jnk (λk )

We have previously computed


(
r r r≤m
n((Jm (λ) − λIm ) ) = =⇒
m r>m
(
r r−1 1 r≤m
n((Jm (λ) − λIm ) ) − n((Jm (λ) − λIm ) )=
0 r > m.
It is also easy to see that for µ 6= λ, n((Jm (µ) − λIm )r ) = n(Jm (µ − λ)r ) = 0.
Adding up for each block, for r ≥ 1, we have

n((α − λι)r ) − n((α − λι)r−1 ) = number of Jordan blocks Jn (λ) with n ≥ r.

n((α − λι)r ) − n((α − λι)r−1 ) is independent on basis. So with this result, for any
λ ∈ {λi : i} we can figure out how many Jordan blocks of size exactly n by doing
the right subtraction. And this is true for all Jordan normal forms that represents
α. Hence this tells us that Jordan normal forms are unique up to permutation of
blocks.
We can interpret this result as follows: if r ≤ m, when we take an additional
power of Jm (λ) − λIm , we get from ( 00 Im−r
0
) to ( 00 Im−r−1
0
). So we kill off one
more column in the matrix, and the nullity increase by one. This happens until
144 CHAPTER 4. LINEAR ALGEBRA

(Jm (λ) − λIm )r = 0, in which case increasing the power no longer affects the
matrix. So when we look at the difference in nullity, we are counting the number
of blocks that are affected by the increase in power, which is the number of blocks
of size at least r.
We have now proved uniqueness, but existence is not yet clear. To show this, we
will reduce it to the case where there is exactly one eigenvalue. This reduction
is easy if the matrix is diagonalizable, because we can decompose the matrix into
each eigenspace and then work in the corresponding eigenspace. In general, we
need to work with “generalized eigenspaces”.
T. 4-120
Let
Qk V be a finite-dimensional vector space C such that α ∈ End(V ). Write Mα (t) =
ci
i=1 (t − λi ) with λ1 , · · · , λk ∈ C distinct. Then V = V1 ⊕ · · · ⊕ Vk where
Vi = ker((α − λi ι)ci ).

Let pj (t) = i6=j (t − λi )ci . Then p1 , · · · , pk have no common factors, that is


Q
hcf(p1 , · · · P
, pk ) = 1. Then by Euclid’s algorithm, there exists q1 , · · · , qk ∈ C[t]
such that pi qi = 1.
We can apply Euclid’s algorithm because of [L.4-94]. Back substitution of Euclid’s
algorithm tell us that that given b, c, there exist β, γ such that bβ + cγ = hcf(b, c).
Now given b, c, w, there exist β 0 , γ 0 , ω such that bβ 0 + cγ 0 + wω = hcf(b, c, w) because
there exist η, ω such that η(bβ + cγ) + ωw = hcf(hcf(b, c), w) = hcf(b, c, w). this
generalised to our claim.

We
P now define the endomorphism πj = qj (α)pj (α) P for j = 1, · · · , k. We have
πj = ι. Below we will prove that Im πj ⊆ Vj ⊆ ker( i6=j πi ) ⊆ Im πj and hence
Im πj = Vj for all j.
• Since Mα (α) = 0 and Mα (t) = (t − λj ι)cj pj (t), we have (α − λj ι)cj πj = 0. So
Im πj ⊆ Vj .
• Since (α − λj ι)cj is a factor
P
P of i6=j πi (plus factors of the form (α − λι)
commutes), so Vj ⊆ ker( i6=j πi ).
P P P
• v ∈ ker( i6=j πi ) ⇒ v = ( i πi )v = πj (v), so ker( i6=j πi ) ⊆ Im πj .
Also noteP that πi πj =2 0, since the product contains Mα (α) as a factor.
Pk So πi =
ιπi = ( πj ) πi = πi . Given any v ∈ V , we have v = ι(v) = j=1 πj (v) ∈
P P
j Im πj . On the other hand if v = j πj (uj ), then applying πi to both sides
gives πi (v) = πi (uLi ). Hence there is a unique way of writing v as a sum of things
L
in Im πj . So V = Im πj = Vj .
We call Vi = ker((α − λi ι)ci ) the generalized eigenspace .
This allows us to decompose V into a block diagonal matrix, and then each block
will only have one eigenvalue. Note that if c1 = · · · = ck = 1, then we recover the
diagonalizability theorem.
Note that we didn’t really use the fact that the vector space is over C, except to get
that the minimal polynomial is a product of linear factors. In fact, for arbitrary
vector spaces, if the minimal polynomial of a matrix is a product of linear factors,
then it can be put into Jordan normal form. The converse is also true — if it
can be put into Jordan normal form, then the minimal polynomial is a product
of linear factors, since we’ve seen that a necessary and sufficient condition for the
4.6. ENDOMORPHISMS 145

minimal polynomial to be a product of linear factors is for there to be a basis in


which the matrix is upper triangular.
Using this theorem, by restricting α to its generalized eigenspaces, we can reduce
the existence part of the Jordan normal form theorem to the case Mα (t) = (t−λ)c .
Further by replacing α by α − λι, we can reduce this to the case where 0 is the
only eigenvalue.
We say α ∈ End(V ) is nilpotent if there is some r such that αr = 0. Note that
by [L.4-102] we see that α is nilpotent iff the minimal polynomial is tr for some
r. So using [L.4-110] we see that over C, α is nilpotent iff the only eigenvalue of
α is 0. We’ve now reduced the problem of classifying complex endomorphisms to
the classifying nilpotent endomorphisms.
E. 4-121
Consider the matrix A on the right. We know we can
find the Jordan normal form by just computing the min- 
3 −2 0

imal polynomial and characteristic polynomial. But we A = 1 0 0
can do better and try to find a P such that P −1 AP is 1 0 1
in Jordan normal form.
We first compute the eigenvalues of A. The characteristic polynomial is
 
t − 3 −2 0
det  1 t 0  = (t − 1)((t − 3)t + 2) = (t − 1)2 (t − 2).
1 0 t−1

We now compute the eigenspaces of A. We have


2 −2 0
 *0+ We see A − I has rank 2
A − I = 1 −1 0 =⇒ EA (1) = 0 and hence nullity 1, and
1 0 0 1 the eigenspace its kernel.

Similarly we compute the other eigenspace:


  *2+
1 −2 0
A − 2I = 1 −2 0 =⇒ EA (2) = 1 .
1 0 −1 2

Since dim EA (1) + dim EA (2) = 2 < 3, this is not diagonal-  


izable. So the minimal polynomial must also be MA (t) = 1 1 0
χA (t) = (t − 1)2 (t − 2). From [E.4-117] we know that A is 0 1 0
similar to the matrix on the right. 0 0 2
We now want to compute a basis that transforms A to this. We want a basis
(v1 , v2 , v3 ) of C3 such that Av1 = v1 , Av2 = v1 +v2 and Av3 = 2v3 . Equivalently,
this is
(A − I)v1 = 0, (A − I)v2 = v1 , (A − 2I)v3 = 0.
There is an obvious choices v3 , namely the eigenvector of eigenvalue 2. To find v1
and v2 , the idea is to find some v2 such that (A − I)v2 6= 0 but (A − I)2 v2 = 0.
Then we can let v1 = (A − I)v2 . We can compute the kernel of (A − I)2 ,
  *0 1+
2 −2 0
2 2
(A − I) = 1 −1 0 =⇒ ker(A − I) = 0 , 1 .
2 −2 0 1 0
146 CHAPTER 4. LINEAR ALGEBRA

We need to pick our v2 that is in this kernel but not in the kernel of A − I
(which is the eigenspace E1 we have computed above). So we have v2 = (1, 1, 0),
v1 = (0, 0, 1) and v3 = (2, 1, 2). Hence we have
   
0 1 2 1 1 0
P = 0 1 1  and P −1 AP = 0 1 0 .
1 0 2 0 0 2

4.7 Bilinear forms II


D. 4-122
• Let V is a vector space over F. A bilinear form φ : V ×V → F is called symmetric
if φ(v, w) = φ(w, v) for all v, w ∈ V .
• Two square matrices A, B are congruent if there exists some invertible P such
that B = P T AP .
• A function q : V → F is a quadratic form if there exists some bilinear form φ
such that q(v) = φ(v, v) for all v ∈ V .
E. 4-123
If S ∈ Matn (F) is a symmetric matrix, ie. S T = S, the bilinear form φ : Fn ×Fn →
F defined by φ(x, y) = xT Sy = n
P
i,j=1 xi Sij yj is a symmetric bilinear form.

Note that quadratic forms are not linear maps (they are quadratic).
L. 4-124
Let V be a finite-dimensional vector space over F with basis (e1 , · · · , en ), and
φ : V × V → F is a bilinear form represented by the matrix M with respect to the
basis, ie. Mij = φ(ei , ej ). Then φ is symmetric if and only if M is symmetric.

(Forward) If φ is symmetric, then Mij = φ(ei , ej ) = φ(ej , ei ) = Mji .


P P P
(Backward) If M is symmetric, then φ(x, y) = φ ( xi ei , yj ej ) = i,j xi Mij yj =
P
i,j yj Mji xi = φ(y, x).

L. 4-125
Let V be a finite-dimensional vector space, and φ : V × V → F Pa bilinear form.
Let (e1 , · · · , en ) and (f1 , · · · , fn ) be bases of V such that fi = n
k=1 Pki ek . If A
represents φ with respect to (ei ) and B represents φ with respect to (fi ), then
B = P T AP .

Special case of [P.4-63].


It is easy to see that congruence is an equivalence relation. Two matrices are
congruent if and only if they represent the same bilinear form with respect to
different bases. Thus, to classify symmetric bilinear forms is the same as classifying
symmetric matrices up to congruence.
P. 4-126
<Polarization identity> Let F be a field such that 1 + 1 6= 0 (eg. F = R or C).
If q : V → F is a quadratic form, then there exists a unique symmetric bilinear
form φ : V × V → F such that q(v) = φ(v, v).
4.7. BILINEAR FORMS II 147

Let ψ : V × V → F be a bilinear form such that ψ(v, v) = q(v). We define


φ : V × V → F by φ(v, w) = 12 (ψ(v, w) + ψ(w, v)) for all v, w ∈ F. This is clearly
a symmetric bilinear form with q(v) = φ(v, v). So we have proved existence.
To prove uniqueness, suppose φ is such a symmetric bilinear form. We compute

q(v + w) = φ(v + w, v + w) = φ(v, v) + φ(v, w) + φ(w, v) + φ(w, w)


= q(v) + 2φ(v, w) + q(w).
1
So we have φ(v, w) = 2
(q(v + w) − q(v) − q(w)). So φ is determined by q, and
hence unique.
T. 4-127
Let V be a finite-dimensional vector space over F, and φ : V × V → F a sym-
metric bilinear form. Then there exists a basis (e1 , · · · , en ) for V such that φ is
represented by a diagonal matrix with respect to this basis.

We induct over n = dim V . The cases n = 0 and n = 1 are trivial, since all of its
matrices are diagonal.
Suppose we have proven the result for all spaces of dimension less than n. First
consider the case where φ(v, v) = 0 for all v ∈ V . We want to show that we must
have φ = 0. This follows from the polarization identity, since this φ induces the
zero quadratic form, and we know that there is a unique symmetric bilinear form
that induces the zero quadratic form. Since we know that the zero bilinear form,
which is symmetric, induces the zero quadratic form, we must have φ = 0. Then φ
will be represented by the zero matrix with respect to any basis, which is trivially
diagonal.
If not, pick e1 ∈ V such that φ(e1 , e1 ) 6= 0. Let

U = ker φ(e1 , · ) = {u ∈ V : φ(e1 , u) = 0}.

Since φ(e1 , · ) ∈ V ∗ \{0}, we know that dim U = n−1 by the rank-nullity theorem.
Consider φ|U ×U : U × U → F, a symmetric bilinear form. By the induction
hypothesis, we can find a basis e2 , · · · , en for U such that φ|U ×U is represented by
a diagonal matrix with respect to this basis. Now by construction, φ(ei , ej ) = 0
for all 1 ≤ i 6= j ≤ n and (e1 , · · · , en ) is a basis for V .
This tells us classifying symmetric bilinear forms is easier than classifying en-
domorphisms, since for endomorphisms, even over C, we cannot always make it
diagonal, but we can for bilinear forms over arbitrary fields.
E. 4-128
Let q be a quadratic form on R3 given by

q(x, y, z) = x2 + y 2 + z 2 + 2xy + 4yz + 6xz.

Find a basis f1 , f2 , f3 for R3 such that q is of the form q(af1 + bf2 + cf3 ) = λa2 +
µb2 + νc2 .

(Method 1) There are two ways to do this. The first way is  


to follow the proof we just had. We first find our symmet- 1 1 3
ric bilinear form. It is the bilinear form represented by the 1 1 2
matrix on the right. 3 2 1
148 CHAPTER 4. LINEAR ALGEBRA

We then find f1 such that φ(f1 , f1 ) 6= 0. We note that q(e1 ) = 1 6= 0. So we pick


f1 = e1 = (1, 0, 0). Then
  
 1 1 3 v1
φ(e1 , v) = 1 0 0 1 1 2 v2  = v1 + v2 + 3v3 .
3 2 1 v3
Next we need to pick our f2 . It must satisfy φ(f1 , f2 ) = 0, so it is in the ker-
nel of φ(f1 , · ). To continue the proof inductively, it also need to be such that
φ(f2 , f2 ) 6= 0. For example, we can pick f2 = (3, 0, −1). Then we have q(f2 ) = −8.
Now   
 1 1 3 v1
φ(f2 , v) = 3 0 −1 1 1 2 v2  = v2 + 8v3
3 2 1 v3
Finally, we want φ(f1 , f3 ) = φ(f2 , f3 ) = 0, and f3 = (5, −8, 1) works. We have
q(f3 ) = 8. With these basis elements, we have
q(af1 + bf2 + cf3 ) = φ(af1 + bf2 + cf3 , af1 + bf2 + cf3 )
= a2 q(f1 ) + b2 q(f2 ) + c2 q(f3 ) = a2 − 8b2 + 8c2 .
(Method 2) Alternatively, we can solve the problem by completing the square. We
have
x2 + y 2 + z 2 + 2xy + 4yz + 6xz = (x + y + 3z)2 − 2yz − 8z 2
 y 2 1 2
= (x + y + 3z)2 − 8 z + + y .
8 8
Now we must have
   0 
x x
y  0 y0
 
0 0 0
 1
φ  y , y
  0 
= (x + y + 3z)(x + y + 3z ) − 8 z + z + + yy 0 .
0 8 8 8
z z
This is because this is clearly a symmetric bilinear form, and this also clearly
induces the q given above. By uniqueness, we know that this is the right symmetric
bilinear form.
We now use this form to find our f1 , f2 , f3 such that φ(fi , fj ) = δij . Solving the
equations x+y +3z = 1, z + 81 y = 0 and y = 0 gives our first vector as f1 = (1, 0, 0).
Then we solve x + y + 3z = 0, z + 18 y = 1 and y = 0 to get f2 = (−3, 0, 1). Finally,
we solve x + y + 3z = 0, z + 81 y = 0 and y = 1 to get f3 = (−5/8, 1, −1/8). Then
we can see that the result follows, and we get
1
q(af1 + bf2 + cf3 ) = a2 − 8b2 + c2 .
8
Note that the diagonal matrix we get is not unique. We can re-scale our basis by
any constant, and get an equivalent expression.
D. 4-129
Let φ be a symmetric bilinear form on a finite-dimensional real vector space V .
We say φ is
• positive definite if φ(v, v) > 0 for all v ∈ V \ {0};
positive semi-definite if φ(v, v) ≥ 0 for all v ∈ V .
• negative definite if φ(v, v) < 0 for all v ∈ V \ {0};
negative semi-definite if φ(v, v) ≤ 0 for all v ∈ V .
We say a quadratic form is ...-definite if the corresponding bilinear form is so.
4.7. BILINEAR FORMS II 149

T. 4-130
Let φ be a symmetric bilinear form over a complex vector space V . Then there
exists a basis (v1 , · · · , vm ) for V such that φ is represented by ( I0r 00 ) with respect
to this basis, where r = r(φ).

We’ve already shown that there exists a basis (e1 , · · · , en ) such that φ(ei , ej ) =
λi δij for some λij . By reordering the ei , we can assume that λ1 , · · · , λr 6= 0 and
λr+1 , · · · , λn = 0. For each 1 ≤ i ≤ r, there exists some µi such that µ2i = λi . For
r + 1 ≤ r ≤ n, we let µi = 1 (or anything non-zero). We define vi = ei /µi , then
(
1 0 i 6= j or i = j > r
φ(vi , vj ) = φ(ei , ej ) =
µi µj 1 i = j < r.

Note that it follows that for the corresponding quadratic form q, we have
n
! r
X X
q ai vi = a2i .
i=1 i=1

As a result of this result, every symmetric A ∈ Matn (C) is congruent to a unique


matrix of the form ( I0r 00 ).
T. 4-131
Let φ be a symmetric bilinear form of a finite-dimensional vector space over R.
Then there exists a basis (v1 , · · · , vn ) for V such that φ is represented by
 
Ip
A= −Iq ,
0

Pn and p, q ≥P0.p Equivalently,


with p + q = r(φ)
2 Pp+q the 2
corresponding quadratic forms
is given by q i=1 ai vi = i=1 ai − j=p+1 aj .

We’ve already shown that there exists a basis (e1 , · · · , en ) such that φ(ei , ej ) =
λi δij for some λ1 , · · · , λn ∈ R. By reordering, we may assume
 √
λi > 0 1 ≤ i ≤ p
 √λi
 1≤i≤p
λi < 0 p + 1 ≤ i ≤ r And we define µi by µi = −λi p + 1 ≤ i ≤ r
 
λi = 0 i > r 1 i>r
 

1
Now defining vi = e
µi i
we find that φ is indeed represented by A.
Note that we have seen these things in special relativity, where  −1 0 0 0 
the Minkowski inner product is given by the symmetric bilinear 0 1 0 0
0 0 1 0
form represented by the matrix on the right in units where c = 1. 0 0 0 1

It is easy to see that if p = 0 and q = n, then φ is negative definite; if p = 0 and


q 6= n, then φ is negative semi-definite etc.
E. 4-132
I 0
Let φ be a symmetric bilinear form on Rn represented by ( 0p 0n−p ). Then φ is
positive semi-definite. And φ is positive definite if and only if n = p.
−I 0
If instead φ is represented by ( 0 p 0n−p ), then φ is negative semi-definite. And φ
is negative definite precisely if n = q.
150 CHAPTER 4. LINEAR ALGEBRA

T. 4-133
<Sylvester’s law of inertia> Let φ be a symmetric bilinear form on a finite-
dimensional real vector space V . Then there exists unique non-negative integers
p, q such that with respect to some basis φ is represented by
 
Ip 0 0
A =  0 −Iq 0 .
0 0 0

In particular every real symmetric matrix is congruent to precisely one matrix of


the form as A.
We have already proved the existence part, and we just have to prove uniqueness.
To do so, we characterize p and q in a basis-independent way. We already know
that p + q = r(φ) does not depend on the basis. So it suffices to show p is unique.
To see that p is unique, we show that p is the largest dimension of a subspace
P ⊆ V such that φ|P ×P is positive definite. First we show we can find such a
P . Suppose φ is represented by A with respect to (e1 , · · · , en ). Then φ restricted
to he1 , · · · , ep i is represented by Ip with respect to e1 , · · · , ep . So φ restricted to
this is positive definite. Let Q = hep+1 , · · · , en i. Then φ restricted to Q × Q is
represented by ( −I0 q 00 ).
Now suppose P is any subspace of V such that φ|P ×P is positive definite. Suppose
v ∈ (P ∩ Q) \ {0}, then φ(v, v) > 0 since v ∈ P \ {0} and φ(v, v) ≤ 0 since v ∈ Q,
which is a contradiction. So P ∩ Q = 0. We have

dim V ≥ dim(P + Q) = dim P + dim Q = dim P + (n − p).

Rearranging gives dim P ≤ p.


A similar argument shows that q is the maximal dimension of a subspace Q ⊆ V
such that φ|Q×Q is negative definite.
The signature of a bilinear form φ is the number p − q, where p and q are as
above. Clearly we can recover p and q from the signature and the rank of φ.
D. 4-134
• Let V, W be complex vector spaces. Then a sesquilinear form is a function φ :
V × W → C such that
1. φ(λv1 + µv2 , w) = λ̄φ(v1 , w) + µ̄φ(v2 , w).
2. φ(v, λw1 + µw2 ) = λφ(v, w1 ) + µφ(v, w2 ).
for all v, v1 , v2 ∈ V , w, w1 , w2 ∈ W and λ, µ ∈ C.6
• Let V, W be finite-dimensional complex vector spaces with basis (v1 , · · · , vn ) and
(w1 , · · · , wm ) respectively, and φ : V × W → C be a sesquilinear form. Then
the matrix representing φ with respect to these bases is Aij = φ(vi , wj ) for
1 ≤ i ≤ n, 1 ≤ j ≤ m.
• A sesquilinear form on V × V is Hermitian if φ(v, w) = φ(w, v) (in alternative
notation ā = a∗ ).
6
It is called sesquilinear since “sesqui” means “one and a half”, and this is linear in the second
argument and “half linear” in the first. Note that some people have an opposite definition, where
we have linearity in the first argument and conjugate linearity in the second.
4.7. BILINEAR FORMS II 151

E. 4-135
Note that alternatively, to define a sesquilinear form, we can define a new complex
vector space V̄ structure on V by taking the same abelian group (ie. the same
underlying set and addition), but with the scalar multiplication of V̄ satisfying
α∗v = ᾱv where ∗ is the scalar multiplication of V̄ and “the normal” multiplication
is the scalar multiplication of V . Then a sesquilinear form on V × W is a bilinear
form on V̄ ×W (since φ(λ∗v1 +µ∗v2 , w) = λ̄φ(v1 , w)+ µ̄φ(v2 , w)). Alternatively,
this is a linear map W → V̄ ∗ .
As usual, the matrix representing the sesquilinear form determines the whole
sesquilinear form. This
Pfollows from thePanalogous fact for the bilinear form on
V̄ × W → C. Let v = λi vi and W = µj wj . Then we have
X
φ(v, w) = λi µj φ(vi , wj ) = λ† Aµ.
i,j

We now want the right definition of symmetric sesquilinear form. We cannot just
require φ(v, w) = φ(w, v), since φ is linear in the second variable and conjugate
linear on the first variable. So in particular, if φ(v, w) 6= 0, we have φ(iv, w) 6=
φ(v, iw), so requiring φ(v, w) = φ(w, v) just means that φ(v, w) = 0 for all v, w.
So instead we have the Hermitian forms.
Note that if φ is Hermitian, then φ(v, v) = φ(v, v) ∈ R for any v ∈ V . So it
makes sense to ask if it is positive or negative. Moreover, for any complex number
λ, we have φ(λv, λv) = |λ|2 φ(v, v). So multiplying by a scalar does not change
the sign. So it makes sense to talk about positive (semi-)definite and negative
(semi-)definite Hermitian forms.
L. 4-136
Let φ : V × V → C be a sesquilinear form on a finite-dimensional vector space over
C, and (e1 , · · · , en ) a basis for V . Then φ is Hermitian if and only if the matrix
A representing φ is Hermitian (ie. A = A† ).

(Forward) If φ is Hermitian, then Aij = φ(ei , ej ) = φ(ej , ei ) = A†ij .


P P † † †
(Backward)
P PIf A is Hermitian, then φ ( λi ei , µj ej ) = λ Aµ = µ A λ =
φ ( µj ej , λj ej ).
P. 4-137
Let φ be a Hermitian form on a finite dimensional P vector space V . Let (e1 , · · · , en )
and (v1 , · · · , vn ) be bases for V such that vi = n k=1 Pki ek , and A, B represent
φ with respect to (e1 , . . . , en ) and (v1 , · · · , vn ) respectively. Then B = P † AP .
Pn
P̄ki P`j Ak` = (P † AP )ij .
P P
Bij = φ(vi , vj ) = φ ( Pki ek , P`j e` ) = k,`=1

L. 4-138
<Polarization identity> A Hermitian form φ on V is determined by the func-
tion ψ : V → R defined by v 7→ φ(v, v).

ψ(x + y) = φ(x, x) + φ(x, y) + φ(y, x) + φ(y, y)


−ψ(x − y) = −φ(x, x) + φ(x, y) + φ(y, x) − φ(y, y)
iψ(x − iy) = iφ(x, x) + φ(x, y) − φ(y, x) + iφ(y, y)
−iψ(x + iy) = −iφ(x, x) + φ(x, y) − φ(y, x) − iφ(y, y)
152 CHAPTER 4. LINEAR ALGEBRA

So φ(x, y) = 41 (ψ(x + y) − ψ(x − y) + iψ(x − iy) − iψ(x + iy)).


T. 4-139
<Hermitian form of Sylvester’s law of inertia> Let V be a finite-dimensional
complex vector space and φ a hermitian form on V . Then there exists unique non-
negative integers p and q such that with respect to some basis φ is represented
by  
Ip 0 0
 0 −Ip 0 .
0 0 0

Almost the same as for symmetric forms over R, see [T.4-127], [T.4-131] and
[T.4-133]. For the first part of the proof, [T.4-127] like before if φ(v, v) for all v,
then by Polarization identity φ = 0, otherwise we can pick φ(e1 , e1 ) 6= 0 and
continue like before. For the middle part of the proof,[T.4-131] , note that it is like
[T.4-131] but not [T.4-130], we cannot simply have ( I0r 00 ) because here we have
φ(λe, λe) = |λ|2 φ(e, e), so when φ(e, e) is negative we cannot “normalised” it
to 1. For the last part of the proof,[T.4-131] we should note that positive (semi-
)definitness and negative (semi-)definitness works on Hermitian forms.

4.8 Inner product spaces


For this section, F would always means either R or C.
D. 4-140
• Let V be a vector space. An inner product φ on V is a positive-definite symmetric
bilinear (for real vector space) or hermitian (for complex vector space) form. We
usually write hx, yi instead of φ(x, y).7
• A vector space equipped with an inner product is an inner product space , it has
p
norm kvk = hv, vi.
• Let V be an inner product space. Then
 v, w ∈ V are orthogonal if hv, wi = 0.
 A set {vi : i ∈ I} is an orthonormal set if hvi , vj i = δij for any i, j ∈ I.
 A subset of V is an orthonormal basis if it is an orthonormal set and is a basis.
E. 4-141
Pn
•  Rn or Cn with the usual inner product hx, yi = i=1 x̄i yi forms an inner
product space.
In some sense, this is the only inner product on finite-dimensional spaces, by
Sylvester’s law of inertia.
 Let C([0, 1], F) be the vector
R 1 space of real/complex valued continuous functions
on [0, 1]. Then hf, gi = 0 f¯(t)g(t) dt is an inner product.
More generally, for any w : [0, 1] R→ R+ continuous, we can define the inner
1
product on C([0, 1], F) as hx, yi = 0 w(t)f¯(t)g(t) dt.
7
The lecturer of this course use the notation (x, y) instead of hx, yi.
4.8. INNER PRODUCT SPACES 153

• Like inp[L.1-10], if V is an inner product space, we can define a norm on V by


kvk = hv, vi. This is just the usual notion of norm on Rn and Cn . This gives
the notion of length in inner product spaces. Note that kvk > 0 with equality
iff v = 0. Note also that the norm k · k determines the inner product by the
polarization identity.
• An orthonormal set must be linearly independent. SupposeP {vi : i ∈ I} is an
orthonormal set, then for any finite subset S P of I, if we have i∈S λi vi = 0 then
for any j ∈ S we must have λj hvj , vj i = hvj , i∈S λi vi i = hvj , 0i = 0.
• Given an orthonormal basis, it is easy to find the coordinates of any vector in this
basis. Suppose V is a finite-dimensional inner product space with Pan orthonormal
basis v1 , · · · , vn . To write v = n vi , note that hvj , vi = n
P
i=1 λiP i=1 λi hvj , vi i =
λj . So v ∈ V can always be written as n i=1 hvi , vivi .

T. 4-142
Let V be an inner product space, then for any v, w ∈ V ,
1. <Cauchy-Schwarz inequality> |hv, wi| ≤ kvkkwk
2. <Triangle inequality> kv + wk ≤ kvk + kwk

1. If w = 0, then this is trivial. Otherwise, since the inner product is positive


definite, for any λ, we get 0 ≤ hv − λw, v − λwi = hv, vi − λ̄hw, vi − λhv, wi +
|λ|2 hw, wi. Let λ = hw, vi/hw, wi. Then we get

|hw, vi|2 |hw, vi|2 |hw, vi|2


0 ≤ hv, vi − − + .
hw, wi hw, wi hw, wi

2. Note that hv, wi + hw, vi = 2 Re(hv, wi), then we have

kv + wk2 = hv + w, v + wi = hv, vi + hv, wi + hw, vi + hw, wi


≤ kvk2 + 2kvkkwk + kwk2 = (kvk + kwk)2 .

L. 4-143
<Parseval’s identity> Let V be a finite-dimensional innerP product space with
n
1 , · · · , un . For any v, w ∈ V , hv, wi =
an orthonormal basis uP i=1 hui , vihui , wi.
2 n 2
In particular, kvk = i=1 |hui , vi| .

* n n
+ n
X X X
hv, wi = hui , viui , huj , wiuj = hui , vihuj , wihui , uj i
i=1 j=1 i,j=1
n
X n
X
= hui , vihuj , wiδij = hui , vihui , wi.
i,j=1 i=1

T. 4-144
<Gram-Schmidt process> Let V be an inner product space and e1 , e2 , · · · a
linearly independent set. Then we can construct an orthonormal set v1 , v2 , · · ·
with the property that h{v1 , · · · , vk }i = h{e1 , · · · , ek }i for every k.

We construct it iteratively, and prove this by induction on k. The base case


k = 0 is contentless. Suppose we have P already found v1 , · · · , vk that satisfies
the properties. Define uk+1 = ek+1 − ki=1 hvi , ek+1 ivi (i.e. we removes the
154 CHAPTER 4. LINEAR ALGEBRA

component of ek+1 that is “paralleled” to any vectors in v1 , · · · , vk ). Indeed uk+1


is orthogonal to vj for any j ≤ k since

k
X
hvj , uk+1 i = hvj , ek+1 i − hvi , ek+1 iδij = hvj , ek+1 i − hvj , ek+1 i = 0.
i=1

Now since h{v1 , · · · , vk }i = h{e1 , · · · , ek }i and {e1 , · · · , ek+1 } is linearly inde-


pendent, {v1 , · · · , vk , ek+1 } is linearly independent, so uk+1 is non-zero. There-
fore, we can define vk+1 = uk+1 /kuk+1 k. Now v1 , · · · , vk+1 is orthonormal and
h{v1 , · · · , vk+1 }i = h{e1 , · · · , ek+1 }i as required.
Note that we are not requiring e1 , e2 , · · · to be finite. We are just requiring it to
be countable.
P. 4-145
Any orthonormal set in a finite-dimensional inner product space can be extended
to an orthonormal basis.
Let v1 , · · · , vk be an orthonormal set. Since this is linearly independent, we
can extend it to a basis (v1 , · · · , vk , xk+1 , · · · , xn ). We apply the Gram-Schmidt
process to this basis to get an orthonormal basis, say (u1 , · · · , un ). Moreover,
we can check that the process does not modify our v1 , · · · , vk , ie. ui = vi for
1 ≤ i ≤ k.
D. 4-146
• Let V be an inner product space and V1 , V2 subspace of V . We say V is the
orthogonal internal direct sum of V1 and V2 , written V = V1 ⊥ V2 , if it is a direct
sum and V1 and V2 are orthogonal. More precisely, we require
1. V = V1 + V2
2. V1 ∩ V2 = 0
3. hv1 , v2 i = 0 for all v1 ∈ V1 and v2 ∈ V2 .8
• If W is a subspace of an inner product space V , then the orthogonal complement
of W in V is the subspace W ⊥ = {v ∈ V : hv, wi = 0, ∀w ∈ W }.
• Let V1 , V2 be inner product spaces. The orthogonal external direct sum of V1
and V2 is the vector space V1 ⊕ V2 (the external direct sum) with the inner prod-
uct defined by h(v1 , v2 ), (w1 , w2 )i = hv1 , w1 i + hv2 , w2 i where v1 , w1 ∈ V1 and
v 2 , w2 ∈ V 2 .
• Suppose V = U ⊕W . The projection map onto W along U is the map π : V → W
given by π(u + w) = w for u ∈ U and w ∈ W . If in fact U = W ⊥ then we call
this map the orthogonal projection onto W .
P. 4-147
If W is a subspace of a finite-dimensional inner product space V , then V = W ⊥
W ⊥.
Condition 3 and hence 2 in the definition of orthogonal internal direct sum is
obviously satisfied by definition of W ⊥ . So it remains to prove condition 1, ie.
V = W + W ⊥ . Let w1 , · · · , wk be an orthonormal basis for W , and pick v ∈ V .
8
Note that condition (3) implies (2), but we write it for the sake of explicitness.
4.8. INNER PRODUCT SPACES 155

Now let w = ki=1 hwi , viwi . Clearly, we have w ∈ W . We just need to show
P

v − w ∈ W . For each j, we can compute
k
X k
X
hwj , v − wi = hwj , vi − hwi , vihwj , wi i = hwj , vi − hwi , viδij = 0.
i=1 i=1

λj wj , v − wi = 0. So we have v − w ∈ W ⊥ .
P
Hence for any λi , we have h
Notice that unlike general vector space complements, orthogonal complements are
unique.
E. 4-148
Note that the external direct sum is equivalent to the internal direct sum of
{(v1 , 0) : v1 ∈ V1 } and {(0, v2 ) : v2 ∈ V2 }.
P. 4-149
Let V be a finite-dimensional inner product space and W ≤ V . Let (e1 , · · · , ek )
be an orthonormal basis of W . Let π be the orthonormal projection of V onto W .
Then
1. π is given by the formula π(v) = ki=1 hei , viei .
P

2. kv − π(v)k ≤ kv − wk for all v ∈ V and w ∈ W , with equality iff π(v) = w.


P
1. Let v ∈ V , and define w = i=1 hei , viei . We want to show this is π(v). First
we show that v − w ∈ W ⊥ . We compute
k
X
hej , v − wi = hej , vi − hei , vihej , ei i = 0.
i=1

So v − w is orthogonal to every basis vector in w, hence v − w ∈ W ⊥ . So now


π(v) = π(w) + π(v − w) = w as required.
2. This is just Pythagoras’ theorem. Note that if x and y are orthogonal, then

kx + yk2 = hx + y, x + yi = hx, xi + hx, yi + hy, xi + hy, yi = kxk2 + kyk2 .

We apply this to our projection. For any w ∈ W , we have

kv − wk2 = kv − π(v)k2 + kπ(v) − wk2 ≥ kv − π(v)k2

with equality if and only if kπ(v)−wk = 0, ie. π(v) = w. Note that v−π(v) ∈
ker π = W ⊥ while π(v) − w ∈ Im π = W , hence hv − π(v), π(v) − wi = 0.

W⊥ v
Note that 2. says that π(v) is the point on
W that is closest to v. w π(v)

L. 4-150
Let V and W be finite-dimensional inner product spaces and α : V → W a linear
map. There exists a unique linear map α∗ : W → V such that hαv, wi = hv, α∗ wi
for all v ∈ V , w ∈ W .
156 CHAPTER 4. LINEAR ALGEBRA

Let (v1 , · · · , vn ) and (w1 , · · · , wm ) be orthonormal basis for V and W . Suppose


α is represented by A.
To show uniqueness, suppose α∗ : W → V satisfies hαv, wi = hv, α∗ wi for all
v ∈ V , w ∈ W , then for all i, j, by definition, we know
* +

X X
hvi , α (wj )i = hα(vi ), wj i = Aki wk , wj = Āki hwk , wj i = Āji .
k k
∗ ∗
Āji vi . Hence α∗ must be represented
P P
So we get α (wj ) = i hvi , α (wj )ivi = i
by A† . So α∗ is unique.
To show existence, all we have to do is to show A† indeed works. Now let α∗
be represented by A† . For arbitrary v, w, we compute hαv, wi and hv, α∗ wi and
show that they are equal. We have
* ! + * +
X X X X X
α λ i vi , µj w j = λ̄i µj hα(vi ), wj i = λ̄i µj Aki wk , wj
i j i,j i,j k
X
= λ̄i Āji µj
i,j
* !+ * +
X X X X X
λ i vi , α ∗ µj w j = λ̄i µj vi , A†kj vk = λ̄i Āji µj .
i j i,j k i,j

What does this mean, conceptually? Note that the inner


product V defines an isomorphism V → V̄ ∗ by v 7→ h · , vi. α
V W
Similarly, we have an isomorphism W → W̄ ∗ . We can then
∼ ∼
put them in a diagram on the right. Then α∗ is what fills in = =

the dashed arrow. So α∗ is in some sense the “dual” of the V̄ ∗ W̄ ∗


α∗
map α. We call the map α∗ the adjoint of α.
D. 4-151
• Let V be an inner product space, and α ∈ End(V ). Then α is self-adjoint if
α = α∗ , ie. hα(v), wi = hv, α(w)i for all v, w.
• Let V be a real inner product space. Then α ∈ End(V ) is orthogonal if hα(v), α(w)i =
hv, wi for all v, w ∈ V . For a finite-dimensional V , the orthogonal group of V is
O(V ) = {α ∈ End(V ) : α is orthogonal}.
• Let V be a complex vector space. Then α ∈ End(V ) is unitary if hα(v), α(w)i =
hv, wi for all v, w ∈ V . For a finite-dimensional V , the unitary group of V is
U (V ) = {α ∈ End(V ) : α is unitary}.
E. 4-152
• Thus if V = Rn with the usual inner product, then A ∈ Matn (R) is self-adjoint if
and only if it is symmetric, ie. A = AT . If V = Cn with the usual inner product,
then A ∈ Matn (C) is self-adjoint if and only if A is Hermitian, ie. A = A† .
• By the polarization identity, α (in a real vector space) is orthogonal if and only
if kα(v)k = kvk for all v ∈ V . A real square matrix (as an endomorphism of
Rn with the usual inner product) is orthogonal if and only if its columns are an
orthonormal set.
Similarly by the polarization identity, α (in a complex vector space) is unitary if
and only if kα(v)k = kvk for all v ∈ V .
4.8. INNER PRODUCT SPACES 157

L. 4-153
Let V be a real finite-dimensional space and α ∈ End(V ).
1. α is orthogonal if and only if α−1 = α∗ .
2. α is orthogonal if and only if α is represented by an orthogonal matrix (ie. a
matrix A such that AT = A−1 ) with respect to any orthonormal basis.

1. (Backward) If α−1 = α∗ , then hαv, αvi = hv, α∗ αvi = hv, α−1 αvi = hv, vi.
(Forward) If α is orthogonal and (v1 , · · · , vn ) is an orthonormal basis for V ,
then for 1 ≤ i, j ≤ n,
Pwe have δij = hvi , vj i = hαvi , αvj i = hvi , α∗ αvj i. So
n
we know α α(vj ) = i=1 hvi , α∗ αvj ivi = vj . So by linearity of α∗ α, we know

α∗ α = idV . So α∗ = α−1 .
2. Let (e1 , · · · , en ) be any arbitrary orthonormal basis for V . Let A represent
α under this basis, then A† represents α. So α is orthogonal iff α−1 = α∗ iff
A−1 = AT .
If α∗ = α−1 , then α is invertible. It is clear from definition that O(V ) is closed
under multiplication (since if AT = A−1 and B T = B −1 then (AB)T = (AB)−1 )
and inverses (since if AT = A−1 then (A−1 )T = (A−1 )−1 ). So O(V ) is indeed a
group.
The same result and the same proof extends to complex finite-dimensional space,
with orthogonal replace by unitary.
P. 4-154
Let V be a finite-dimensional real inner product space and (e1 , · · · , en ) is an or-
thonormal basis of V . Then there is a bijection O(V ) → {orthonormal basis for V }
defined by α 7→ (α(e1 ), · · · , α(en )).

If α ∈ O(V ), then hα(v), α(w)i = hv, wi for all v, w ∈ V . So (α(e1 ), · · · , α(en ))


must be an orthonormal set and hence an orthonormal basis.
Given any orthonormal basis, by [P.4-28], there is a unique linear map α such that
(α(e1 ), · · · , α(en )) is the given orthonormal basis. Furthermore P α must P be orthog-
onal
P since for any v, w ∈ V
P we have hα(v),
P α(w)i = hα( i vi ei ), α( j wj ej )i =
i,j vi wj hα(ei ), α(ej )i = i,j vi wj δij = i,j vi wj hei , ej i = hv, wi.

Again the same result and the same proof extends to complex space, with the
orthogonal group replace by the unitary group.
L. 4-155
Let V be a finite-dimensional inner product space, and α ∈ End(V ) self-adjoint,
1. α has a real eigenvalue, and all eigenvalues of α are real.
2. Eigenvectors of α with distinct eigenvalues are orthogonal.

We are going to do real and complex cases separately.


1. Suppose first V is a complex inner product space. Then by the fundamental
theorem of algebra, α has an eigenvalue, say λ. We pick v ∈ V \ {0} such that
αv = λv. Then
λ̄hv, vi = hλv, vi = hαv, vi = hv, αvi = hv, λv) = λ(v, vi.
Since v 6= 0, we know hv, vi 6= 0. So λ = λ̄.
158 CHAPTER 4. LINEAR ALGEBRA

Suppose V is a real inner product space, let e1 , · · · , en be an orthonormal


basis for V . Then α is represented by a symmetric matrix A (with respect to
this basis). Since real symmetric matrices are Hermitian viewed as complex
matrices, this gives a self-adjoint endomorphism of Cn . By the complex case,
A has real eigenvalues only. But the eigenvalues of A are the eigenvalues of α
and MA (t) = Mα (t).

2. Now suppose αv = λv and αw = µw with λ 6= µ. We know hαv, wi = hv, αwi


by definition. This then gives λhv, wi = µhv, wi. Since λ 6= µ, we must have
hv, wi = 0.

We can also prove the real case of 1. without reducing to the complex case. We
know every irreducible factor of Mα (t) in R[t] must have degree 1 or 2, since the
roots are either real or come in complex conjugate pairs. Suppose f (t) were an
irreducible factor of degree 2. Then (Mα /f )(α) 6= 0 since it has degree less than
the minimal polynomial. So ∃v ∈ V such that (Mα /f )(α)(v) 6= 0. So it must be
that f (α)(v) = 0. Let U = h{v, α(v)}i. Then this is an α-invariant subspace of
V (ie. α(U ) ⊆ U ) since f has degree 2 and f (α)(v) = 0 (thus α2 (v) is a linear
combination of v and α(v)).

Now α|U ∈ End(U ) is self-adjoint since α is. So if (e1 , e2 ) is an orthonormal


basis of U , then α is represented by a real symmetric matrix, say ( ab ab ). But then
χα|U (t) = (t − a)2 − b2 , which has real roots, namely a ± b. This is a contradiction,
since we must have Mα|U = f , but f is irreducible.

This result says that Hermitian matrix has real eigenvalues and that eigenvectors
corresponding to distinct eigenvalues are orthogonal, which we see in Part IA.

T. 4-156
Let V be a finite-dimensional inner product space, and α ∈ End(V ) self-adjoint.
Then V has an orthonormal basis of eigenvectors of α. In particular V is the
orthogonal (internal) direct sum of the eigenspaces of α.

By the previous lemma, α has a real eigenvalue, say λ. Then we can find an
eigenvector v ∈ V \ {0} such that αv = λv. Let U = hvi⊥ . Then we can write
V = hvi ⊥ U . We now want to prove α sends U into U . Suppose u ∈ U . Then

hv, α(u)i = hαv, ui = λhv, ui = 0.

So α(u) ∈ hvi⊥ = U . So α|U ∈ End(U ) and is self-adjoint. By induction on dim V ,


U has an orthonormal basis (v2 , · · · , vn ) of α–eigenvectors. Let v1 = v/kvk, then
(v1 , v2 , · · · , vn ) is an orthonormal basis of eigenvectors of α.

P. 4-157
Let A ∈ Matn (R) (resp. Matn (C)) be symmetric (resp. Hermitian). Then there
exists an orthogonal (resp. unitary) matrix P such that P T AP = P −1 AP (resp.
P † AP ) is diagonal with real entries.

Let h · , · i be the standard inner product on Fn . Then A is self-adjoint as an


endomorphism of Fn . So Fn has an orthonormal basis of eigenvectors for A, say
(v1 , · · · , vn ). Taking P = (v1 v2 · · · vn ) gives the result.
4.8. INNER PRODUCT SPACES 159

P. 4-158
Let V be a finite-dimensional real inner product space and ψ : V × V → R a
symmetric bilinear form. Then there exists an orthonormal basis (v1 , · · · , vn ) for
V with respect to which ψ is represented by a diagonal matrix.

Let (u1 , · · · , un ) be any orthonormal basis for V . Then ψ is represented by a


symmetric matrix A. Then P there exists an orthogonal matrix P such that P T AP
is diagonal. Now let vi = Pki uk . Then (v1 , · · · , vn ) is an orthonormal basis
since
DX X E X
T
hvi , vj i = Pki uk , P`j u` = Pik P`j huk , u` i = (P T P )ij = δij .

Also, ψ is represented by P T AP with respect to (v1 , · · · , vn ).


Note that the diagonal values of P T AP are just the eigenvalues of A. So the
signature of ψ is just the number of positive eigenvalues of A minus the number
of negative eigenvalues of A.
We have the same results (with essentially the same proof) for complex inner
product space with symmetric bilinear form replaced by Hermitian form.
P. 4-159
Let V be a finite-dimensional real vector space and φ, ψ symmetric bilinear forms
on V such that φ is positive-definite. Then we can find a basis (v1 , · · · , vn ) for V
such that both φ and ψ are represented by diagonal matrices with respect to this
basis.

We use φ to define an inner product. Choose an orthonormal basis (v1 , · · · , vn )


for V (equipped with φ) with respect to which ψ is diagonal. Then φ is represented
by I with respect to this basis, since φ(vi , vj ) = δij .
This result also says that: If A, B ∈ Matn (R) are symmetric and A is positive
definitive (ie. vT Av > 0 for all v ∈ Rn \ {0}), then there exists an invertible
matrix Q such that QT AQ and QT BQ are both diagonal.
Similarly we have similar result for the complex case: If φ, ψ are Hermitian forms
on a finite-dimensional complex vector space and φ is positive definite, then there
exists a basis for which φ and ψ are diagonalized. Also similarly if A, B ∈ Matn (C)
are Hermitian, and A positive definitive (ie. v† Av > 0 for v ∈ V \ {0}). Then
there exists some invertible Q such that Q† AQ and Q† BQ are diagonal.
T. 4-160
Let V be a finite-dimensional complex vector space and α ∈ U (V ) be unitary.
Then V has an orthonormal basis consisting of eigenvectors of α. Moreover all the
eigenvalues have length 1.

By the fundamental theorem of algebra, there exists v ∈ V \ {0} and λ ∈ C such


that αv = λv. Now consider W = hvi⊥ . Then V = W ⊥ hvi. Let w ∈ W , then

hαw, vi = hw, α−1 vi = hw, λ−1 vi = 0.

So α(w) ∈ W and hence α|W ∈ End(W ). Also, α|W is unitary since α is. So
by induction on dim V , W has an orthonormal basis of α eigenvectors. If we
add v/kvk to this basis, we get an orthonormal basis of V itself comprised of α
eigenvectors.
160 CHAPTER 4. LINEAR ALGEBRA

If αv = λv, then |λ|2 hv, vi = hαv, αvi = hv, vi, hence |λ| = 1.
This theorem and the analogous one for self-adjoint endomorphisms have a com-
mon generalization, at least for complex inner product spaces. The key fact that
leads to the existence of an orthonormal basis of eigenvectors is that α and α∗
commute. This is clearly a necessary condition, since if α is diagonalizable, then
α∗ is diagonal in the same basis (since it is just the transpose (and conjugate)),
and hence they commute. It turns out this is also a sufficient condition.
However, we cannot generalize this in the real or-  
thogonal case. For example the matrix on the right cos θ sin θ
∈ O(R2 )
cannot be diagonalized (if θ 6∈ πZ). − sin θ cos θ
CHAPTER 5

Analysis II

5.1 Sequence of functions

D. 5-1

Let E be a set and M be a metric space. We say the sequence of functions


fn : E → M

• converges pointwise to the function f : E → M if f (x) = limn→∞ fn (x) for all


x ∈ E.

• converges uniformly to the function f : E → M if

∀ε > 0, ∃N s.t. ∀x ∈ E, ∀n > N, d(fn (x), f (x)) < ε.

Alternatively, we can say ∀ε > 0, ∃N s.t. ∀n > N, supx∈E d(fn (x), f (x)) < ε.
For M = R with the usual metric (norm) this is kfn − f k∞ → 0 as a real
sequence, where kgk∞ = supx∈E |g(x)|.

? Just to be clear, when we write fn → 0, the 0 refers to the function that send
everything to 0, this can be understood as the “0” (additive identity) of the
vector space of functions (eg. C[a, b]).

• is pointwise Cauchy if for each x ∈ E, (fn (x)) is a Cauchy sequence in M .

• is uniformly Cauchy if ∀ε > 0, ∃N s.t.∀m, n > N, supx∈E d(fn (x), fm (x)) < ε.

E. 5-2

When defining convergence for sequence of functions, we want it to have properties


similar to that of convergence of numbers. For example, a constant sequence
fn = f has to converge to f , and convergence should not be affected if we change
finitely many terms. It should also act nicely with products and sums.

Pointwise convergence has the usual properties of convergence. However, there


is a problem. Ideally, We want to deduce properties of f from properties of fn .
For example, it would be great if continuity of all fn implies continuity of f , and
similarly for integrability and values of derivatives and integrals. However, it turns
out we cannot. The notion of pointwise convergence is too weak. We will look at
many examples where f fails to preserve the properties of fn .

161
162 CHAPTER 5. ANALYSIS II

• Let fn : [−1, 1] → R be defined by fn (x) = y


x1/(2n+1) . These are all continuous, but the point-
wise limit function is
 x
1
 0<x≤1
fn (x) → f (x) = 0 x=0 ,

−1 −1 ≤ x < 0 y

which is not continuous. Alternatively, we can let


− n1
fn be given by the following graph on the right x
1
which converges to the same function as above. n

• Let fn : [0, 1] → R be the piecewise linear function


y
formed by joining (0, 0), ( n1 , n), ( n2 , 0) and (1, 0).
The pointwise limit of this function is fn (x) →
f (x) = 0. However, we have n
Z a Z 1
fn (x) dx = 1 for all n; f (x) dx = 0.
0 0
x
0 1 2
n n
So the limit of the integral is not the integral of
the limit.

• Let fn : [0, 1] → R be defined as fn (x) = 1 if n!x ∈ Z and fn (x) = 0 otherwise.


Since fn has finitely many discontinuities, it is Riemann integrable. However,
the limit is
(
1 x∈Q
fn (x) → f (x) =
0 x 6∈ Q

which is not integrable. So integrability of a function is not preserved by point-


wise limits.

This suggests that we need a stronger notion of convergence. Of course, we don’t


want this notion to be too strong. For example, we could define fn → f to mean
“fn = f for all sufficiently large n”, then any property common to fn is obviously
inherited by the limit. However, this is clearly silly since only the most trivial
sequences would converge.

Hence we want to find a middle ground between the two cases — a notion of conver-
gence that is sufficiently strong to preserve most interesting properties, without
being too trivial. To do so, we can examine what went wrong in the examples
above. In the last example, even though our sequence fn does indeed tends point-
wise to f , different points converge at different rates to f . For example, at x = 1,
we already have f1 (1) = f (1) = 1. However, at x = (100!)−1 , f99 (x) = 0 while
f (x) = 1. No matter how large n is, we can still find some x where fn (x) differs
a lot from f (x). In other words, if we are given pointwise convergence, there is no
guarantee that for very large n, fn will “look like” f , since there might be some
points for which fn has not started to move towards f . Hence, what we need is
for fn to converge to f at the same pace, this is known as uniform convergence.
5.1. SEQUENCE OF FUNCTIONS 163

E. 5-3
• It should be clear from definition that if fn → f uniformly, then fn → f pointwise.
But the converse is false:
 fn : [−1, 1] → R is defined by fn (x) = x1/(2n+1) . If the uniform limit existed,
then it must be given by

1
 0<x≤1
fn (x) → f (x) = 0 x=1 ,

−1 −1 ≤ x < 0

since uniform convergence implies pointwise convergence. Pick ε = 14 . Then


for x = 2−(2n+1) we have fn (x) = 12 , f (x) = 1. So for all N there exist some x
and n > N such that |fn (x) − f (x)| > ε. So fn 6→ f uniformly.
 Let fn : R → R be defined by fn (x) = nx . Then fn (x) → f (x) = 0 pointwise.
However, this convergence is not uniform in R since |fn (x) − f (x)| = |x|n
, and
this can be arbitrarily large for any n. However, if we restrict fn to a bounded
domain like [−a, a] then the convergence is uniform.
• Uniformly convergence is quite strong in the sense that if fn → f uniformly and
fn ∈ C[a, b] (and hence f ∈ C[a, b] which we will prove later),[C.1-13] then fn → f
wrt k · kp for any p ∈ N0 . This is since
Z b  p1 Z b  p1
p
kfn − f kp ≡ |fn (x) − f (x)| dx ≤ kfn − f kp∞ dx
a a
1
= (b − a) p kfn − f k∞ → 0 as n → ∞.
E. 5-4
Let fn : [0, 1] → R be define by fn (x) = (1 − x)xn . Show that fn → 0 uniformly.

We split the function into two parts which we can “control”. Given ε > 0,
∃N s.t. (1 − ε)N ≤ ε. Now for all n ≥ N we have supx∈[0,1] |fn (x) − 0| < ε
since (
n 1(1 − ε)n ≤ ε for x ∈ [0, 1 − ε]
|fn (x)| = (1 − x)x ≤
ε(1n ) = ε for x ∈ [1 − ε, 1]
Alternatively we could do this by finding the maximum of fn through differentia-
tion.
T. 5-5
Let E be a set and M be a metric space. (fn : E → M ) converges uniformly
implies (fn : E → M ) is uniformly Cauchy. Also, the converse is true if M is
complete.

(Forward) First suppose that fn → f uniformly. Given ε, we know that there is


some N such that ∀n > N, supx∈E d(fn (x), f (x)) < 2ε . Then if n, m > N and
x ∈ E, we have d(fn (x), fm (x)) ≤ d(fn (x), f (x)) + d(fm (x), f (x)) < ε.
(Backward) If (fn ) is uniformly Cauchy, then for each x ∈ E, (fn (x)) is Cauchy
and so converges (since M is complete). Let f (x) = limn→∞ fn (x). Since (fn ) is
uniformly Cauchy, given ε > 0, we can choose N such that whenever n, m > N ,
x ∈ E, we have d(fn (x), fm (x)) < 2ε . Letting m → ∞, fm (x) → f (x). So
d(fn (x), f (x)) ≤ 2ε < ε.
164 CHAPTER 5. ANALYSIS II

Clearly the same result holds for pointwise convergence and pointwise Cauchy
sequence.
If we are given a concrete sequence of functions, then the usual way to show it
converges uniformly is to compute the pointwise limit (since the function that it
uniformly converge to must be same as its pointwise limit) and then prove that
the convergence is uniform. Since R is complete, if the sequence of functions is
real functions, it is often much easier to show that it is uniformly convergent by
showing that it is uniformly Cauchy.
T. 5-6
1. <Uniform limit theorem> Let E and M be metric spaces with x ∈ E. If
fn : E → M are continuous at x for all n and fn → f uniformly, then f is also
continuous at x.
2. If fn , f : [a, b] → R are Riemann integrable for all n and fn → f uniformly,
Rb Rb
then a fn (t) dt → a f (t) dt.

1. Let ε > 0. Choose N such that ∀n ≥ N , supy∈E d(fn (y), f (y)) < ε. Since fN is
continuous at x, there is some δ such that d(x, y) < δ ⇒ d(fN (x), fN (y)) < ε.
Then for each y such that d(x, y) < δ, we have

d(f (x), f (y)) ≤ d(f (x), fN (x)) + d(fN (x), fN (y)) + d(fN (y), f (y)) < 3ε.

b b b b
Z Z Z Z

2.
fn (t) dt − f (t) dt = fn (t) − f (t) dt ≤ |fn (t) − f (t)| dt
a a a a
≤ sup |fn (t) − f (t)|(b − a) → 0 as n → ∞
t∈[a,b]

Note also the first result implies if fn are continuous everywhere, then f is con-
tinuous everywhere. A slightly different version of the first result is: Let E be a
topological space and M a metric space. If fn : E → M is continuous for all n and
fn → f uniformly, then f is also continuous. To prove this note that by [L.1-37]
we just need to show that given any x ∈ X and ε > 0, we can find a neighbour-
hood U of x in E such that f (U ) ⊆ Bε (f (x)). Again we could pick N such that
supy∈E d(fN (y), f (y)) < ε/3, now since fN is continuous, we can find a neighbour-
hood U of x in E such that fN (U ) ⊆ Bε/3 (fN (x)). So now for all u ∈ U we have
d(f (u), f (x)) ≤ d(f (u), fN (u)) + d(fN (u), fN (x)) + d(fN (x), f (x)) < ε. The first
result can be concisely phrased as “the uniform limit of continuous functions is
continuous”.
We will prove later that if fn is integrable and fn → f uniformly, then f is
integrable.
These results show that uniform convergence tends to preserve properties of func-
tions. However, the relationship between uniform convergence and differentiability
is more subtle. The uniform limit of differentiable functions need not be differ-
entiable. Even if it were, the limit of the derivative is not necessarily the same
as the derivative of the limit, even if we just want pointwise convergence of the
derivative.
• Let fn , f : [−1, 1] → R be defined by fn (x) = |x|1+1/n and f (x) = |x|. Then
fn → f uniformly. Each fn is differentiable — this is obvious at x 6= 0, and at
5.1. SEQUENCE OF FUNCTIONS 165

x = 0, the derivative is

fn (x) − fn (0)
fn0 (0) = lim = lim sgn(x)|x|1/n = 0
x→0 x x→0

However, the limit f is not differentiable at x = 0.


sin
√nx
• Let fn (x) = n
for all x ∈ R. Then

1
sup |fn (x)| ≤ √ → 0.
x∈R n

So fn → f = 0 uniformly in R. However, the derivative is fn0 (x) = n cos nx,
which does not converge to f 0 = 0, eg. at x = 0.
T. 5-7
Let fn : [a, b] → R be a sequence of functions differentiable on [a, b] (at the end
points a, b, this means that the one-sided derivatives exist). If
1. For some c ∈ [a, b], fn (c) converges
2. The sequence of derivatives (fn0 ) converges uniformly on [a, b]
then (fn ) converges uniformly on [a, b], and if f = limn fn , then f is differentiable
with derivative f 0 (x) = limn fn0 (x).

To show that (fn ) converges uniformly on [a, b], we want to find an N such that
n, m > N implies sup |fn − fm | < ε. Fix x ∈ [a, b]. We apply the mean value
theorem to fn − fm to get

(fn − fm )(x) − (fn − fm )(c) = (x − c)(fn0 − fm


0
)(t)

for some t ∈ (x, c). Taking the supremum and rearranging terms, we obtain

sup |fn (x) − fm (x)| ≤ |fn (c) − fm (c)| + (b − a) sup |fn0 (t) − fm
0
(t)|.
x∈[a,b] t∈[a,b]

So given any ε, since fn0 and fn (c) converge and are hence Cauchy, there is some N
such that for any n, m ≥ N , supt∈[a,b] |fn0 (t) − fm
0
(t)| < ε, and |fn (c) − fm (c)| < ε.
Hence for all n, m ≥ N, supx∈[a,b] |fn (x) − fm (x)| < (1 + b − a)ε. So (fn ) converges
uniformly on [a, b]. Let f = lim fn .
Now we have to check differentiability. Let fn0 → h (ie. write h = lim fn0 ). For any
fixed y ∈ [a, b], define the two function:
(
f (x)−f (y)
x 6= y
(
fn (x)−fn (y)
x−y
x 6= y g(x) = x−y
gn (x) = h(y) x=y
fn0 (y) x=y
By definition, fn is differentiable at y iff gn is continuous at y, and also f is
differentiable with derivative h at y iff g is continuous at y. However, we know
that gn → g pointwise on [a, b], and we know that gn are all continuous. So if we
can show that gn → g uniformly, then g is continuous and hence the final result.
For x 6= y, we know that

(fn − fm )(x) − (fn − fm )(y)


gn (x) − gm (x) = = (fn0 − fm
0
)(t)
x−y
166 CHAPTER 5. ANALYSIS II

for some t ∈ [x, y]. This also holds for x = y, since gn (y) − gm (y) = fn0 (y) − fm
0
(y)
by definition.
Let ε > 0. Since f 0 converges uniformly, there is some N such that for all n, m > N ,
we have |gn (x)−gm (x)| ≤ sup |fn0 −fm
0
| < ε. So for all n, m ≥ N , sup[a,b] |gn −gm | <
ε, ie. gn converges uniformly.
Note that we do not assume that fn0 are continuous or even Riemann integrable.
If they are, then the proof is much easier! If we assume fn0 are continuous, then
by the fundamental theorem of calculus, we have
Z x
fn (x) = fn (c) + fn0 (t) dt. (∗)
c

Then for sufficiently large n, m > N we get


Z x
(fn0 (t) − fm
0

sup |fn (x) − fm (x)| ≤ |fn (c) − fm (c)| + sup (t)) dt
[a,b] x∈[a,b] c

≤ |fn (c) − fm (c)| + (b − a) sup |fn0 (t) − fm


0
(t)| < ε
t∈[a,b]

So fn is uniformly Cauchy and so fn → f uniformly for some function f : [a, b] →


R. Since the fn0 are continuous, h = limn→∞ fn0 is continuous and hence integrable.
Taking the limit of (∗), we get
Z x
f (x) = f (c) + h(t) dt.
c

Then the fundamental theorem of calculus says that f is differentiable and f 0 (x) =
h(x) = lim fn0 (x). So done.
................................................................................
The result can be generalised for the sequence of differentiable functions fn : Ω →
C, where Ω ⊆ C is a bounded convex set.
Let F : Ω → C be a holomorphic function, and let c and x be distinct points in
Ω. Applying the mean value theorem to F1 , F2 : [0, 1] → R defined by F1 (t) =
Re(F (c + t(x − c))/(x − c)) and F2 (x) = Im(F (c + x(x − c))/(x − c)), we see that
there exist points u, v on the line segment from c to x such that
   
F (x) − F (c) F (x) − F (c)
Re(F 0 (u)) = Re , Im(F 0 (v)) = Im .
x−c x−c

Let F = fn − fm and for Ω that is bounded by α, we have,

(fn − fm )(x) − (fn − fm )(c)


= Re((fn − fm )0 (u)) + i Im((fn − fm )0 (v))
x−c

=⇒ sup |fn (x) − fm (x)| ≤ |fn (c) − fm (c)| + 2α sup |fn0 (t) − fm
0
(t)|
x∈Ω t∈Ω

Also for x 6= y, we have gn (x) − gm (x) = Re((fn − fm ) (t)) + Im((fn − fm )0 (w))


0

for some t, w on the line segment from x to y. So supx∈Ω |gn (x) − gm (x)| ≤
2 supt∈Ω |fn0 (t) − fm
0
(t)|. So our original proof still holds.
5.2. SERIES OF FUNCTIONS 167

P. 5-8
1. Let fn , gn : E → C, be sequences, and fn → f , gn → g uniformly on E. Then
for any a, b ∈ C, afn + bgn → af + bg uniformly.
2. Let fn → f uniformly, and let g : E → C be bounded. Then gfn : E → C
converges uniformly to gf .

1. sup |(afn + bgn ) − (af + bg)| ≤ |a| sup |fn − f | + |b| sup |gn − g| → 0 as n → ∞.
2. Say |g(x)| < M for all x ∈ E. Then |(gfn )(x) − (gf )(x)| ≤ M |fn (x) − f (x)|.
So supE |gfn − gf | ≤ M supE |fn − f | → 0.
Note that 2 is false without assuming boundedness. An easy example is to take
fn = n1 , x ∈ R, and g(x) = x. Then fn → 0 uniformly, but g(x)fn (x) = nx does
not converge uniformly to 0.

5.2 Series of functions


D. 5-9
P∞
Let gn be a sequence of functions. We say the Pn series n=1 gn converges at a
point x if the sequence of partial sums fn = j=1 gj converges at x. The series
converges uniformly if fn converges uniformly.
P
For gn : E → MP where M is a normed space, wePsay gn converges absolutely at
a point
P x ∈ E if kgn k converges at x, and say g n converges absolutely uniformly
if kgn k converges uniformly.
P. 5-10
P
Let gn : E → M , where P M is a complete normed space. If gn converges
absolutely uniformly, then gn converges uniformly.

Note we don’t have a candidate for the limit,


P so we make use of uniformly Cauchy
implying uniform convergence. Let fn = n
Pn
j=1 gj and hn (x) = j=1 kgj k be the
partial sums. Then for n > m, we have
n n
X X
kfn (x) − fm (x)k = gj (x) ≤ kgj (x)k = khn (x) − hm (x)k.


j=m+1 j=m+1

By hypothesis, we have supx∈E khn (x) − hm (x)k → 0 as n, m → ∞. Therefore


supx∈E kfn (x) − fm (x)k → 0 as n, m → ∞.
It is important to remember that uniform convergence plus absolute pointwise
convergence does not imply absolute uniform convergence.
E. 5-11
n xn
Consider the series ∞
P
n=1 (−1) n . This converges absolutely for every x ∈ [0, 1)
since it is bounded by the geometric series.P Also, it converges uniformly on [0, 1)
since it is uniformly Cauchy as supx∈[0,1) | n k k
k=m (−1) x /k| → 0 as n, m → ∞.
However, this does not converge absolutely uniformly on [0, 1), as can be seen by
considering the difference in partial sums
n n
(−1)j j X
 
X 1 j 1 1 1

j x = |x| ≥ + + · · · + |x|n .
j=m

j=m
j m m+1 n
168 CHAPTER 5. ANALYSIS II

For each N , we can make this difference large enough by picking a really large n,
and then making x close enough to 1. So the supremum of it does not tends to 0
as n, m → ∞.
T. 5-12
<Weierstrass M-test> Let gn : E → M , where M is a complete normed space.
Suppose there is some sequence Mn such that for all n, we have

sup kgn (x)k ≤ Mn .


x∈E
P P
If Mn converges, then gn converges absolutely uniformly.
Pn
Let fn = j=1 kgj k be the partial sums. Then for n > m, we have
n
X n
X
kfn (x) − fm (x)k = kgj (x)k ≤ Mj .
j=m+1 j=m+1

Pn
Taking supremum, we have sup kfn (x)−fm (x)k ≤ j=m+1 Mj → 0 as n, m → ∞.
So done by [T.5-5].
Note that this holds when M = C or R.
L. 5-13
S
Let V, W be normed space and f : V → W . Let U = α∈A Uα , where Uα are
open subsets of V (for all α ∈ A). If f |Uα : Uα → W is continuous for all α ∈ A,
then f |U : U → W is continuous.

We need to show that ∀v ∈ U, ∀ε > 0, ∃δ > 0 s.t. f (Bδ (v) ∩ U ) ⊆ Bε (f (v)).


Suppose v ∈ U , then v ∈ Uα for some α ∈ A. Let ε > 0. Now f |Uα is continuous,
so ∃δ1 > 0 s.t.f (Bδ1 (v)∩Uα ) ⊆ Bε (f (v)). But Uα is open, so ∃ε2 > 0 s.t.Bδ2 (v) ⊆
Uα . Let δ = min{δ1 , δ2 }, then Bδ (v) ⊆ Bδ1 (v) ∩ Uα . Thus f (Bδ (v) ∩ U ) =
f (Bδ (v)) ⊆ f (Bδ1 (v) ∩ Uα ) ⊆ Bε (f (v)).
T. 5-14
Let ∞ n
P
n=0 cn (x − a) be a (complex) power series and R ∈ [0, +∞] its radius of
1
convergence.
cn (x − a)n converges absolutely uniformly on B̄r (a) = {y ∈ C : |y − a| ≤ r}
P
1.
for any r < R.
cn (x − a)n is continuous on BR (a).
P
2. f (x) =

|cn |rn is convergent. But we know that if x ∈ B̄r (a), then


P
1. We know that
|cn (x − a) | ≤ |cn |rn . So the result follows from the Weierstrass M-test by
n

taking Mn = |cn |rn .


2. For any N , N n
P
n=0 cn (x − a) is a polynomial, hence it’s continuous. So by the
uniform limit theorem, f (x) is continuous on B̄r (a) for any r < R.
(Proof 1) Given x ∈ BR (a), we can pick r with x ∈ Br (a), then f is continuous
at x. Hence f is continuous on BR (a).
S
(Proof 2) BR (a) = r<R Br (a), hence done by the previous lemma.
1
cn (x − a)n converges absolutely. If |x − a| > R,
P
Recall part IA analysis I: If |x − a| < R, then
cn (x − a)n diverges.
P
then
5.3. NORMED SPACE 169

We say that the sum converges locally absolutely uniformly inside circle of conver-
gence, ie. for every point y ∈ BR (a), there is some open disc around y on which
the sum converges absolutely uniformly.

Note that uniform convergence


P n need not hold on the entire interval of convergence.
For example consider x for real x. This converges for x ∈ (−1, 1), but uniform
convergence fails on (−1, 1) since the tail

n n−m
X X xn
xj = xn xj ≥ .
j=m j=0
1−x

This is not uniformly small since we can make this large by picking x close to 1.

T. 5-15
cn (x − n)n is a power series with
P
(Term-wise differentiation of power series) If
radius of convergence R > 0, then
1. The “derived series” ∞ n−1
P
n=1 ncn (x − a) has radius of convergence R.
P n
2. The function defined by f (x) = cn (x − a)
P , x ∈ BR (a)n−1
= {y ∈ C : |y − a| <
0
R} is differentiable with derivative f (x) = ncn (x − a) within the (open)
circle of convergence.

1. Let R1 be the radius of convergence of the derived series. We know

|cn (x − a)n | = |cn ||x − a|n−1 |x − a| ≤ |ncn (x − a)n−1 ||x − a|.

ncn (x − a)n−1 converges absolutely for some x,


P
Hence if the derived series
P n
then so does cn (x − a) . So R1 ≤ R.

Suppose that R1 < R, then P there are r1 , r such that R1 < r1 < r < R, where
n|cn |r1n−1 diverges while |cn |rn converges. But this cannot be true since
P
n−1 n
n|cn |r1 ≤ |cn |r for sufficiently large n. So we must have R1 = R.

2. Below we show for the real case, but it generalised to C.

Let fn (x) = n
P j 0 Pn j−1
j=0 cj (x − a) , then fn (x) = j=1 jcj (x − a) . We want to
use [T.5-7]. This requires that fn converges at a point, and that fn0 converges
uniformly. The first is obviously true, and we know that fn0 converges uniformly
on [a−r, a+r] for any r < R. So for each x0 , there is some interval containing x0
on which fn0 is uniformly convergent. So on this interval, we know
P that f (x) =
limn→∞ fn (x) is differentiable with f 0 (x) = limn→∞ fn0 (x) = ∞ j
j=1 jcj (x − a) .
0 P∞ j
In particular, f (x0 ) = j=1 jcj (x0 − a) . Since this is true for all x0 , the
result follows.

5.3 Normed space


Given a normed space[D.1-7] , we give notions of boundedness and continuity of functions
etc. in the space through the induced metric[L.1-10] . Contents already on Metric and
Topological space would not be repeated here.
170 CHAPTER 5. ANALYSIS II

D. 5-16
Write RN to be the set of all infinite real sequences (xk ).
C. 5-17
<Space of sequences> We extend our notions on Rn (a finite-dimensional vector
space) RN of infinite-dimension. RN is a vector space with termwise addition and
scalar multiplication.
• Define `1 = (xk ) ∈ RN :
 P
|xk | < ∞ . This is aPlinear subspace of RN . We can
define the norm on it by k(xk )k1 = k(xk )k`1 = |xk |.
2 P 2

• Similarly, we can have the subspace ` = (xk ) ∈ RN : xk < ∞ and a norm
P 2 1/2
on it defined by k(xk )k2 = k(xk )k`2 = xk .
We can also write this as k(xk )k`2 = limn→∞ k(x1 , · · · , xn )k2 . So the triangle
inequality for the Euclidean norm implies the triangle inequality for `2 .
• In general, for p ≥ 1, we can define `p = (xk ) ∈ RN : |xk |p < ∞ with the
 P

norm k(xk )kp = k(xk )k`p = ( |xk |p )1/p .


P

• Finally, we have `∞ , where `∞ = {(xk ) ∈ RN : sup |xk | < ∞}, with the norm
k(xk )k∞ = k(xk )k`∞ = sup |xk |.

E. 5-18
When we define the `p norm and space, we first have the norm defined as a sum,
and then `p to be the set of all sequences for which the sum converges. However,
note in [C.1-13] when we define the Lp space, we restrict ourselves to C([0, 1]), and
then Rdefine the norm. Can we just define, say, L1 to be the set of all functions such
1
that 0 |f | dx exists? No because the norm would no longer be the norm, since if
we have the function f (x) = 1 when x = 0.5 and 0 otherwise, then f is integrable
with integral 0, but is not identically zero (ie. the requirement kvk = 0 ⇔ v = 0
is violated). So we cannot expand our vector space to be too large. To define Lp
properly, we need some more sophisticated notions such as Lebesgue integrability
and other fancy stuff, which we are not doing here.
D. 5-19
Let V be a (real) vector space. Two norms k · k and k · k0 on V are called
Lipschitz equivalent if there are real constants 0 < a < b such that akxk ≤
kxk0 ≤ bkxk for all x ∈ V .
E. 5-20
Note that Lipschitz equivalence forms an equivalence relation on the set of all
norms on V . Looking at [L.1-31] and [E.1-32], we see that norms that are Lipschitz
equivalent induces the same topology, hence the “topological” properties of the
space do not depend on which norm we choose, and the norms will agree on which
sequences are convergent and which functions are continuous.
We can also see that the requirement akxk ≤ kxk0 ≤ bkxk for Lipschitz equivalent
is equivalent to B1/b (0) ⊆ B10 (0) ⊆ B1/a (0) where B 0 is the ball with respect to
k · k0 , while B is the ball with respect to k · k. (Recall Br (a) = {x ∈ V : kx − ak <
r})
................................................................................
Later we will show that any two norms on a finite-dimensional vector space are
Lipschitz equivalent. Here we look at infinite dimensional cases.
5.3. NORMED SPACE 171
R1
Let V = C([0, 1]) with the norms kf k1 = 0 |f | dx y
and kf k∞ = sup[0,1] |f |. We clearly have the bound
kf k1 ≤ kf k∞ . However, there is no constant b such 1
that kf k∞ ≤ bkf k1 for all f .
This is easy to show by constructing a sequence of
functions fn like on the diagram on the right where
x
the width is n2 and the height is 1. Then kfn k∞ = 1 1
1 n
but kfn k1 = n → 0.
Similarly, consider the space `2 = (xn ) :
P 2
xn < ∞ under the regular `2 norm


and the `∞ norm. We have k(xk )k∞ ≤ k(xk )k`2 but there is no b such that
k(xk )k`2 ≤ bk(xk )k∞ . For example, consider the sequence xn = (1, 1, · · · , 1, 0, 0, · · · ),
where the first n terms are 1.
So far in all our examples, out of the two inequalities, one holds and one does not.
Actually it is possible for both inequalities to not hold.
P. 5-21
Suppose k · k and k · k0 are two norms on the vector space V . The followings are
equivalent:
1. ∃C > 0 such that kvk0 ≤ Ckvk for all v ∈ V .
2. The map id : (V, k · k) → (V, k · k0 ) defined by id(v) = v is continuous.
3. τ 0 ⊆ τ where τ and τ 0 are the topology induced by k · k and k · k0 respectively.
In particular if k · k and k · k0 are Lipschitz equivalent, then τ 0 = τ and convergence
and Cauchy-ness (of a sequence), boundedness (of a set), continuity (of a function),
completeness (of the space) etc. on k · k and k · k0 agrees.

2 ⇔ 3: id is continuous iff every open set wrt k · k0 is open wrt k · k.


1 ⇒ 2: Suppose vn → v wrt k · k, then kvn − vk0 ≤ kvn − vk → 0, so vn → v
wrt k · k0 , i.e. id(vn ) → id(v). Hence id is continuous.
2 ⇒ 1: ∃δ > 0 s.t. Bδ (0) ⊆ B10 (0). For any v ∈ V with v 6= 0, ∃K s.t. kKvk =
δ/2, so Kv ∈ Bδ (0) ⊆ B10 (0), so kKvk0 < 1 = (2/δ)kKvk. Hence
kvk0 ≤ (2/δ)kvk, which clearly also holds for v = 0.
If the norms are not equivalent it is possible that there are some sequences that
converge with respect to one norm but not another, or even that a sequence con-
verges to different limits under different norms.
P. 5-22
Let (V, k · k) be a normed space. Then
1. If xk → x and xk → y, then x = y.
2. If xk → x, then axk → ax.
3. If xk → x, yk → y, then xk + yk → x + y.

1. kx − yk ≤ kx − xk k + kxk − yk → 0. So kx − yk = 0. So x = y.
2. kaxk − axk = |a|kxk − xk → 0.
3. k(xk + yk ) − (x + y)k ≤ kxk − xk + kyk − yk → 0.
172 CHAPTER 5. ANALYSIS II

P. 5-23
Let (V, k · k) be a normed vector space, then k · k : (V, k · k) → R is continuous.

Given any u, v ∈ V , we have kuk ≤ ku − vk + kvk, so ku − vk ≥ kuk − kvk and


by symmetry ku − vk ≥ kvk − kuk. Hence kuk − kvk ≤ ku − vk and so k · k is
continuous.
E. 5-24
Given v ∈ V , define fv : V → R by fv (w) = kw − vk, then fv is a continuous
function since it’s a composite of continuous function. So Br (v) = fv−1 ((−r, r))
is open in V , so we have proven that open balls are open. We can do similar for
closed balls.
P. 5-25
Convergence in Rn (with respect to, say, the Euclidean norm) is equivalent to
(k)
coordinate-wise convergence, ie. x(k) → x if and only if xj → xj for all j.

Fix ε > 0. Suppose x(k) → x. Then there is some N such that for any k ≥ N
such that
n
X (k)
kx(k) − xk22 = (xj − xj )2 < ε2 .
j=1
(k)
Hence |xj − xj | < ε for all k ≤ N . Conversely, if for any fixed j, there is some Nj
(k)
such that k ≥ Nj implies |xj − xj | < √ε . Then for k ≥ max{Nj : j = 1, · · · , n},
n

n
!1
X 2
(k) (k) 2
kx − xk2 = (xj − xj ) < ε.
j=1

Note that this results also says that f = (f1 , · · · , fm ) : Rn → Rm is continuous iff
each of fi is continuous. This is because f is continuous iff f (xn ) → f (x) whenever
xn → x.
E. 5-26
Another space we would like to understand is the space of continuous functions.
It should be clear that uniform convergence (supx |fn − f | → 0) is the same as
convergence under the uniform norm (as kfn −f k = supx |fn −f |), hence the name.
However, there is no norm such that convergence under the norm is equivalent to
pointwise convergence, ie. pointwise convergence is not normable. In fact, it is
not even metrizable. However, we will not prove this.
T. 5-27
<Bolzano-Weierstrass theorem in Rn > Any bounded sequence in Rn (with,
say, the Euclidean norm) has a convergent subsequence.

(Proof 1) We induct on n. The n = 1 case is the usual Bolzano-Weierstrass on


the real line, which was proved in IA Analysis I. Assume the theorem holds in
(k) (k)
Rn−1 , and let x(k) = (x1 , · · · , xn ) be a bounded sequence in Rn . Then let
(k) (k)
y(k) = (x1 , · · · , xn−1 ). Since for any k, we know that

ky(k) k2 + |xn
(k) 2
| = kx(k) k2 ,
(k)
it follows that both (y(k) ) and (xn ) are bounded. So by the induction hypothesis,
there is a subsequence (kj ) of (k) and some y ∈ Rn−1 such that y(kj ) → y. Also,
5.3. NORMED SPACE 173
(kj ) (k )
by Bolzano-Weierstrass in R, there is a further subsequence (xn `
) of (xn j ) that
converges to, say, yn ∈ R. Then we know that x(kj` ) → (y, yn ).
(Proof 2) By [T.1-128] and [T.1-139].
All finite dimensional vector spaces are isomorphic to Rn as vector spaces for some
n, and we will later show that all norms on finite dimensional spaces are equiv-
alent. This means every finite-dimensional normed space satisfies the Bolzano-
Weierstrass property. It turns out the converse is also true: If a normed vector
space satisfies the Bolzano-Weierstrass property, must it be finite dimensional.
E. 5-28
Note that the above this is generally not true for normed spaces. Finite-dimensionality
is important for both of the above results.
(k)
• Consider (`∞ , k · k∞ ). We let ej = δjk be the sequence with 1 in the kth
(k)
component and 0 in other components. Then ej → 0 for all fixed j, and hence
e(k) converges componentwise to the zero element 0 = (0, 0, · · · ). However, e(k)
does not converge to the zero element since ke(k) − 0k∞ = 1 for all k. Also, this
is bounded but does not have a convergent subsequence for the same reasons.
• Let C([0, 1]) have the k · kL2 norm. Consider fn (x) = sin(2nπx). We know
that Z 1
1
kfn k2L2 = |fn |2 = .
0 2
So it is bounded. However, it doesn’t have a convergent subsequence. If it did,
say fnj → f in L2 , then we must have kfnj − fnj+1 k2 → 0. However, by direct
calculation, we know that
Z 1
kfnj − fnj+1 k2 = (sin(2nj πx) − sin(2nj+1 πx))2 = 1.
0

In fact the same argument shows also that the sequence (sin 2nπx) has no
subsequence that converges pointwise on [0, 1]: we need the result that if (fj )
is a sequence in C([0, 1]) that is uniformly bounded with fj → f pointwise,
then fj converges to f under the L2 norm. However, we will not be able to
prove this (in a nice way) without Lebesgue integration from IID Probability
and Measure.
L. 5-29
1. Any convergent sequence is Cauchy.
2. A Cauchy sequence is bounded.
3. If a Cauchy sequence has a subsequence converging to an element x, then the
whole sequence converges to x.

1. If xk → x, then kxk − x` k ≤ kxk − xk + kx` − xk → 0 as k, ` → ∞.


2. There is some N such that for all n ≥ N , we have kxN − xn k < 1. So kxn k <
1 + kxN k for n ≥ N . So, for all n, kxn k ≤ max{kx1 k, · · · , kxN −1 k, 1 + kxN k}.
3. Suppose xkj → x. Since (xk ) is Cauchy, given ε > 0, we can choose an N
such that kxn − xm k < 2ε for all n, m ≥ N . We can also choose j0 such that
kj0 ≥ N and kxkj0 − xk < 2ε . Then for any n ≥ N , we have kxn − xk ≤
kxn − xkj0 k + kx − xkj0 k < ε.
174 CHAPTER 5. ANALYSIS II

Clear these results apply to both normed spaces as well as metric spaces.
T. 5-30
Rn (with the Euclidean norm, say) is complete.
(k)
(Proof 1) If (xk ) is Cauchy in Rn , then (xj ) is a Cauchy sequence of real numbers
for each j ∈ {1, · · · , n}. By the completeness of the reals, we know that xkj → xj ∈
R for some xj . So xk → x = (x1 , · · · , xn ) since convergence in Rn is equivalent to
componentwise convergence.
(Proof 2) By [T.1-146].
E. 5-31
• Let V = {(xn ) ∈ RN : xj = 0 for all but finitely many j}. Take the supremum
norm k · k∞ on V . V is a subspace of `∞ (and is sometimes denoted `0 ). Then
(V, k · k∞ ) is not complete: we define x(k) = (1, 21 , 13 , · · · , k1 , 0, 0, · · · ) for k =
1, 2, 3, · · · . Then this is Cauchy, since
1
kx(k) − x(`) k = → 0,
min{`, k} + 1
(k)
but it is not convergent in V . If it actually converged to some x, then xj → xj .
So we must have xj = 1j , but this sequence not in V .

• We claim that C([0, 1]) is not complete in L1 (i.e. with the k · k1 norm) . Consider
fn ∈ C([0, 1]) where fn is the function such that the set of (x, fn (x)) is the straight
line joining (0, 0), ( 12 , 0), ( 12 + n1 , 1), (1, 1), then fn is Cauchy, however it doesn’t
converge:
Suppose fn → f ∈ C([0, 1]) in L1 . Then fn → f in L1 on [0, 12 ], but fn is the zero
constant function on [0, 12 ], the only continuous function it can converge to in L1 is
the zero constant function, so f |[0,1/2] = 0. Also for any N we must have fn → f
in L1 on [ 12 + N1 , 1], but eventually for n large enough fn |[1/2+1/N,1] will be the
constant function 1, so f |[1/2+1/N,1] = 1, however N is arbitrary, so f |(1/2,1] = 1.
But this means f 6∈ C([0, 1]), contradiction.
E. 5-32
1. Show that (`1 , k · k1 ) is complete.
2. Show that (C([a, b]), k · k∞ ) is complete.

1. Suppose (xn )1 , (xn )2 , · · · is a Cauchy sequence in `1 . Then



X
∀ε > 0, ∃N s.t. ∀p, q ≥ N, k(xn )p − (xn )q k1 = |xpn − xqn | < ε. (∗)
n=1

Firstly we find the possible limit. Firstly in (∗) note that |xpn −xqn | ≤ ∞ p
P
n0 =1 |xn0 −
q 1 2
xn0 | < ε, so for each n we wee that xn , xn , · · · is a Cauchy sequence in R, so
it converge to say xn .
m
Next we want
PM to show that (xPn) → (xn ) in (`1 , k · k1 ). Note that ∀p, q ≥ N
p q ∞ p q
we have |x n − x | = n=1 |xn − xn | < ε for any M . Taking limit
n=1 PM n p
q → ∞ we have n=1 |xn − Px∞n | ≤ ε for any M , now take limit M → ∞
we have k(xn )p − (xn )k1 = p
n=1 |xn − xn | ≤ ε. So we now have ∀ε > 0,
∃N s.t. ∀p ≥ N, k(xn ) − (xn )k1 ≤ ε, that is (xn )m → (xn ).
p
5.3. NORMED SPACE 175

2. Since uniform Cauchy convergence implies uniform convergence (as R is com-


plete), and the uniform limit of continuous functions is continuous.

E. 5-33
The spaces `1 , `2 , `∞ are all complete with respect to the standard norms. C([0, 1])
is complete with respect to k · k∞ but not with the L1 or L2 norms.

The incompleteness of L1 tells us that C([0, 1]) is not large enough to to be com-
plete under the L1 or L2 norm. In fact, the space of Riemann integrable functions,
say R([0, 1]), is the natural space for the L1 norm, and of course contains C([0, 1]).
As we have previously
R1 mentioned, this time R([0, 1]) is too large for k · k to be
a norm, since 0 |f | dx = 0 does not imply f = 0. This is a problem we can
solve. We just have to take the equivalence R 1 classes of Riemann integrable func-
tions, where f and g are equivalent if 0 |f − g| dx = 0. But still, L1 is not
complete on R([0, 1])/∼. This is a serious problem in the Riemann integral. This
eventually lead to the Lebesgue integral, which generalizes the Riemann integral,
and gives a complete normed space.

Note
R1 that when we quotient our R([0, 1]) by the equivalence relation f ∼ g if
0
|f − g| dx = 0, we are not losing too much information about our functions. We
know that for the integral to be non-zero, f − g cannot be non-zero at a point of
continuity. Hence they agree on all points of continuities. By Lebesgue’s theorem,
the set of points of discontinuity has Lebesgue measure zero. So they disagree on
at most a set of Lebesgue measure zero.

T. 5-34
Let (V, k · k) be a normed vector space and K ⊆ V .
1. If K is compact, then K is closed and bounded.
2. If V = Rn (with, say, the Euclidean norm), then the converse of 1 is also true.
That is K is compact iff it is closed and bounded. (Heine-Borel theorem)

1. (Proof 1) Let K be compact, hence sequentially compact.[T.1-139] If K is un-


bounded, then we can generate a sequence xk such that kxk k → ∞. Then
this cannot have a convergent subsequence, since any subsequence will also be
unbounded, and convergent sequences are bounded. So K must be bounded.

Let y be a limit point of K. Then there is some yk ∈ K such that yk → y.


Then by compactness, there is a subsequence of yk converging to some point
in K. But any subsequence must converge to y. So y ∈ K. So K is closed by
[L.1-22].

(Proof 2) By [P.1-115] and [P.1-116].

2. (Proof 1) Suppose K is closed and bounded and xk a sequence in K. Then (xk )


is a bounded sequence in Rn . So by [T.5-27], this has a convergent subsequence
xkj . By closedness of K, we know that the limit is in K. So K is sequentially
compact and so compact.[T.1-139]

(Proof 2) By [T.1-128].
176 CHAPTER 5. ANALYSIS II

E. 5-35
Let (V, k · k), (V 0 , k · k0 ) be normed spaces, and let E ⊆ V be a subset, and
f : E → V 0 a mapping. Let y ∈ E. By [P.1-23] f : E → V 0 is continuous at y if
for all ε > 0, there is δ > 0 such that ∀x ∈ E, kx−ykV < δ ⇒ kf (x)−f (y)kV 0 < ε.
Note that x ∈ E and kx − yk < δ is equivalent to saying x ∈ Bδ (y) ∩ E. Sim-
ilarly, kf (x) − f (y)k < ε is equivalent to f (x) ∈ Bε (f (y)). In other words,
x ∈ f −1 (Bε (f (y))). So we can rewrite this statement as there is some δ > 0 such
that E ∩ Bδ (y) ⊆ f −1 (Bε (f (y))).
We are going show again that the above definition of continuity of f at y ∈ E is
equivalent to: for any sequence yk → y in E, we have f (yk ) → f (y).
(Forward) Suppose f is continuous at y ∈ E, and that yk → y. Given ε >
0, by continuity, there is some δ > 0 such that Bδ (y) ∩ E ⊆ f −1 (Bε (f (y))).
For sufficiently large k, yk ∈ Bδ (y) ∩ E. So f (yk ) ∈ Bε (f (y)), or equivalently,
kf (yk ) − f (y)kV 0 < ε.
(Backward) If f is not continuous at y, then there is some ε > 0 such that for any
k, we have B 1 (y) 6⊆ f −1 (Bε (f (y))). Choose yk ∈ B 1 (y) \ f −1 (Bε (f (y))). Then
k k
yk → y, yk ∈ F , but kf (yk ) − f (y)k ≥ ε, contrary to the hypothesis.
T. 5-36
Let (V, k · k) and (V 0 , k · k0 ) be normed spaces, and K a compact subset of V ,
and f : V → V 0 a continuous function. Then
1. f (K) is compact in V 0
2. If V 0 = R, then f |K attains its supremum and infimum, ie. ∃y1 , y2 ∈ K such
that f (y1 ) = sup f (K) and f (y2 ) = inf f (K).

1. (Proof 1) Let (xk ) be a sequence in f (K) with xk = f (yk ) for some yk ∈ K.


By compactness of K, there is a subsequence (ykj ) such that ykj → y. By the
previous theorem, we know that f (yjk ) → f (y). So xkj → f (y) ∈ f (K). So
f (K) is compact.
(Proof 2) By [T.1-119].
2. (Proof 1) If F is any bounded subset of R, then either sup F ∈ F or sup F is a
limit point of F (or both), by definition of the supremum. If F is closed and
bounded, then any limit point must be in F . So sup F ∈ F . Applying this fact
to F = f (K) gives the desired result, and similarly for infimum.
(Proof 2) By [T.1-121].
L. 5-37
Let V be an n-dimensional (real) vector space with a basis {v1 , · · · , vn }.
Pn
1. For any x ∈ V , write x = j=1 xj vj (with xj ∈ R), we define kxk2 =
P 2 1/2
( xj ) . Then this is a norm, and S = {x ∈ V : kxk2 = 1} is compact in
(V, k · k2 ).
2. Let k · k be any norm on V , then k · k : (V, k · k2 ) → R is continuous.

1. k · k2 is well-defined since x1 , · · · , xn are uniquely determined by x (as {vi } is


a basis). It is easy to check that k · k2 is a norm.
5.3. NORMED SPACE 177

First note that S̃ = {x̃ ∈ Rn : kx̃kEuclid = 1} is compact (by Heine–Borel),


since it is closed and bounded. (It is closed as g : (Rn , k · kEuclid ) → R given
by g(x̃) = kx̃kEuclid is continuous,[P.5-23] so S̃ = g −1 ({1}) is closed as {1} is
closed in R.)

Now for x̃ = (x1 , · · · , xn ) ∈ Rn we define f : Rn → V by f (x̃) = n


P
j=1 xj vj .
Then f (S̃) = S. Also f is continuous since kf (x̃) − f (ỹ)k2 = kx̃ − ỹkEuclid , so
f (S̃) = S is compact by 1 of our last theorem.

2. For any x = n
P
j=1 xj vj we have

n n
!1
X X X 2
2
kxk = xj v j ≤ |xj |kvj k ≤ kxk2 kvj k


j=1 j=1
| {z }
=b

by the Cauchy-Schwarz inequality.


Note that b is fixed. Now
by the triangle
inequality we have kxk − kyk ≤ kx − yk ≤ bkx − yk2 . So kxk − kyk → 0 as
kx − yk2 → 0. So k · k : (V, k · k2 ) → R is continuous.
P 2 1/2
Note that instead of defining P the norm kxk2 = ( xj ) , we could get the same
result by defining kxk1 = |xj | and using the standard k · k1 norm forPRn in the
place
P of k · kEuclid . For the second result we would then have kxk = k xj vj k ≤
|xj |kvj k ≤ kxk1 (maxj kvj k).

T. 5-38
If V is a finite dimensional (real) vector space, then any two norms on it are
Lipschitz equivalent.

Fix a basis {v1 , · · · , vn } for V , and define k · k2 as in the lemma above. We


now will prove that any norm k · k on the space is equivalent to k · k2 , then it
follows that any two norm on the space is equivalent since Lipschitz equivalence
is an equivalence relation hence transitive. To show that an arbitrary norm k · k
is equivalent to k · k2 , we need to find a, b > 0 such that akxk2 ≤ kxk ≤ bkxk2 for
any x. Note that

x
akxk2 ≤ kxk ≤ bkxk2 for all x ∈ V ⇐⇒ a≤
kxk2 ≤ b
for all x ∈ V

⇐⇒ a ≤ kyk ≤ b for all y ∈ S = {y ∈ V : kyk2 = 1}

So it suffices to show that the image of k · k under S is bounded by two positive


numbers. By the previous lemma k · k : (V, k · k2 ) → R is continuous and (S, k · k2 )
compact, so by 2 of [T.5-36] k · k under S attains it supremum and infimum, that
is ∃x0 , x1 ∈ S such that 0 < kx0 k = inf kSk and kx1 k = sup kSk. Let a = kx0 k
and b = kx1 k and we are done.

In fact the same proof (along with the lemma) can be extend to for complex vector
space. Also note that the key to the proof is the compactness of the unit sphere
of (V, k · k). On the other hand, compactness of the unit sphere also characterizes
finite dimensionality. If the unit sphere of a space is compact, then the space must
be finite-dimensional.
178 CHAPTER 5. ANALYSIS II

P. 5-39
Let (V, k · k) be a finite-dimensional normed space, then
1. The Bolzano-Weierstrass theorem holds, ie. any bounded sequence sequence
in V has a convergent subsequence.
2. A subset of it is compact iff it is closed and bounded.
3. It is complete.

1. If a subset is bounded in one norm, then it is bounded in any Lipschitz equiv-


alent norm. Similarly, if it converges to x in one norm, then it converges to x
in any Lipschitz equivalent norm.

2. Since these results hold for the Euclidean norm k · k2 , it follows that they hold
for arbitrary finite-dimensional vector spaces. Note that closeness must be the
same in any norm since convergence is and closeness is that the set contains
all its limit points.

3. This is true since if a space is complete in one norm, then it is complete in any
Lipschitz equivalent norm, and we know that Rn under the Euclidean norm is
complete.

Note that the finite-dimensional condition is important, for example B̄1 (0) wrt
k · k∞ in C[0, 1] is closed and bounded but not (sequentially) compact, eg. take
the sequence fn = [function with straight line joining (0, 1), (1/n, 0), (1, 0)].

5.4 Metric spaces


E. 5-40
Given a metric space (X, d), let

d(x, y)
g(x, y) = min{1, d(x, y)} h(x, y) =
1 + d(x, y)

then g and h are a metrics on X. In both cases, we obtain a bounded metric.

The axioms are easily shown to be satisfied, apart from the triangle inequality. So
let’s check the triangle inequality for h. We’ll use a general fact that for numbers
a, c ≥ 0 and b, d > 0 we have

a c a c
≤ ⇐⇒ ≤ .
b d a+b c+d

Based on this fact, we can start with d(x, y) ≤ d(x, z) + d(z, y), then we obtain

d(x, y) d(x, z) + d(z, y) d(x, z) d(z, y)


≤ = +
1 + d(x, y) 1 + d(x, z) + d(z, y) 1 + d(x, z) + d(z, y) 1 + d(x, z) + d(z, y)
d(x, z) d(z, y)
≤ + .
1 + d(x, z) 1 + d(z, y)
5.4. METRIC SPACES 179

D. 5-41
Metrics d, d0 on a set X are said to be Lipschitz equivalent if there are (positive)
constants A, B such that Ad(x, y) ≤ d0 (x, y) ≤ Bd(x, y) for all x, y ∈ X.
E. 5-42
Clearly, any Lipschitz equivalent norms give Lipschitz equivalent metrics. Any
metric coming from a norm in Rn is thus Lipschitz equivalent to the Euclidean
metric. Two norms induce the same topology if and only if they are equivalent. In
some sense, Lipschitz equivalent norms are indistinguishable.
Lipschitz equivalent metrics induce the same topology. The converse, however, is
not true in general. For example, let X = R, d(x, y) = |x − y| and d0 (x, y) =
min{1, |x − y|}. It is easy to check that these are not Lipschitz equivalent, but
they induce the same set collection of open subsets.
E. 5-43
We can create an easy example of an incomplete metric on Rn . We start by
defining h : Rn → Rn by
x
h(x) = ,
1 + kxk
where k · k is the Euclidean norm. We can check that this is injective: if h(x) =
h(y), taking the norm gives kxk/(1 + kxk) = kyk/(1 + kyk). So we must have
kxk = kyk. Also h(x) = h(y) means that x = λy for some real λ. So h(x) = h(y)
implies x = y.
Now we define d(x, y) = kh(x) − h(y)k. It is an easy check that this is a metric
on Rn . Rn under this metric is incomplete, we can consider the sequence xk =
(k − 1)e1 , where e1 = (1, 0, 0, · · · , 0) is the usual basis vector. Then (xk ) is Cauchy
in (Rn , d). To show this, first note that h(xk ) = 1 − k1 e1 . Hence we have



1 1
d(xn , xm ) = kh(xn ) − h(xm )k = − → 0.
n m
n
So it is Cauchy. To show it does not converge in (R k , x) → 0
, d), suppose d(x
for some x. Then since d(xk , x) = kh(xk ) − h(x)k ≥ kh(xk )k − kh(x)k . We must
have
kh(x)k = lim kh(xk )k = 1.
k→∞

However, there is no element with kh(x)k = 1.


In fact as a side note, we can show that h : Rn → B1 (0), and h is a homeomorphism
(ie. continuous bijection with continuous inverse) between Rn and the unit ball
B1 (0), both with the Euclidean metric.
What is happening in this example, is that we are pulling in the whole Rn in to
the unit ball. Then under this metric a sequence that “goes to infinity” in the
usual metric will be Cauchy in this metric, but we have nothing at infinity for it
to converge to.
T. 5-44
Let (X, d) be a metric space, Y ⊆ X any subset.
1. If (Y, d|Y ×Y ) is complete, then Y is closed in X.
180 CHAPTER 5. ANALYSIS II

2. If (X, d) is complete, then (Y, d|Y ×Y ) is complete if and only if Y is closed in


X.
1. Let x ∈ X be a limit point of Y . Then there is some sequence xk → x, where
each xk ∈ Y . Since (xk ) is convergent, it is a Cauchy sequence. Hence it is
Cauchy in Y . By completeness of Y , (xk ) has to converge to some point in Y .
By uniqueness of limits, this limit must be x. So x ∈ Y . So Y contains all its
limit points.
2. We have just showed that if Y is complete, then it is closed. Now suppose Y is
closed. Let (xk ) be a Cauchy sequence in Y . Then (xk ) is Cauchy in X. Since
X is complete, xk → x for some x ∈ X. Since x is a limit point of Y , we must
have x ∈ Y . So xk converges in Y .
D. 5-45
A metric space (X, d) is said to be totally bounded if ∀ε > 0, there exist N ∈ N
and points x1 , · · · , xN ∈ X such that X = N
S
i=1 Bε (xi ).

E. 5-46
It is easy to check that being totally bounded implies being bounded.
From metric and topological space we know that all compact metric spaces are
complete and bounded. The converse is not true. For example, recall if we have
an infinite-dimensional normed vector space, then the closed unit sphere can be
complete and bounded, but not compact. Alternatively, we can take X = R with
the metric d(x, y) = min{1, |x − y|}. This is clearly bounded (by 1), and it is easy
to check that this is complete. However, this is not compact since the sequence
xk = k has no convergent subsequence.
However, we can strengthen the condition of boundedness to total boundedness,
and get the equivalence between “completeness and total boundedness” and com-
pactness.
T. 5-47
Let (X, d) be a metric space. X is (sequentially) compact if and only if X is
complete and totally bounded.

(Backward) Let (yi ) ∈ X. For every j ∈ N, there exists a finite set of points Ej
such that every point is within 1j of one of these points. Write Br (x) = B(x, r)
Now since E1 is finite, there is some x1 ∈ E1 such that there are infinitely many
yi ’s in B(x1 , 1). Pick the first yi in B(x1 , 1) and call it yi1 . Now there is some
x2 ∈ E2 such that there are infinitely many yi ’s in B(x1 , 1) ∩ B(x2 , 21 ). Pick the
one with smallest value of i > i1 , and call this yi2 . Continue till infinity.
This procedure gives a sequence xi ∈ Ei and subsequence (yik ) of (yi ) with
n  
\ 1
yin ∈ B xj , .
j=1
j

2
It is easy to see that (yin ) is Cauchy since if m > n, then d(yim , yin ) < n
. By
completeness of X, this subsequence converges.
(Forward) Suppose X is not totally bounded, Sthen there exist ε such that there is
no finite set of points x1 , · · · , xN with X = N
i=1 Bε (xi ).
5.4. METRIC SPACES 181

We construct a sequence starting by picking an arbitrary y1 . Pick y2 such that


d(y1 , y2 ) ≥ ε. This exists or else Bε (y1 ) covers all of X. Now given y1 , · · · , yn
such that d(yi , yj ) ≥ ε for all i, j = 1, · · · , n, i 6= j, we pick S yn+1 such that
d(yn+1 , yj ) ≥ ε for all j = 1, · · · , n. Again, this exists, or else n
i=1 Bε (yi ) covers
X. Then clearly the sequence (yn ) is not Cauchy and have no Cauchy subsequence,
hence it also has no convergent subsequence.
D. 5-48
• Let (X, d) and (X 0 , d0 ) be metric spaces. A function f : X → X 0 is
 uniformly continuous if

∀ε > 0, ∃δ > 0 s.t. ∀x, y ∈ X, d(x, y) < δ ⇒ d(f (x), f (y)) < ε.

This is equivalent to ∀ε > 0, ∃δ > 0 s.t. ∀y ∈ X, Bδ (y) ⊆ f −1 (Bε (f (y))).


 Lipschitz if there is some K ∈ [0, ∞) such that for all x, y ∈ X, d0 (f (x), f (y)) ≤
Kd(x, y). Such a K is called a Lipschitz constant.
• Let (X, d) be metric space. A mapping f : X → X is a contraction if it is
Lipschitz with Lipschitz constant of less then 1, that is there exists some λ with
0 ≤ λ < 1 such that d(f (x), f (y)) ≤ λd(x, y).
E. 5-49
It is easy to show

Lipschitz ⇒ uniform continuity ⇒ continuity.

Continuity does not imply uniform continuity, an example is given in the next
theorem. To show that uniform continuity does not imply Lipschitz, take X =
X 0 = R. We define the metrics as d(x, y) = min{1, |x − y|}, and d0 (x, y) = |x − y|.
Now consider the function f : (X, d) → (X 0 , d0 ) defined by f (x) = x. We can then
check that this is uniformly continuous but not Lipschitz.
Note that the statement that metrics d and d0 are Lipschitz equivalent is equivalent
to saying the two identity maps i : (X, d) → (X, d0 ) and i0 : (X, d0 ) → (X, d) are
Lipschitz, hence the name.
The metric map itself is also a Lipschitz map for any metric. That is if we view
the metric as a function d : X × X → R with the metric on X × X defined as
˜ 1 , y1 ), (x2 , y2 )) = d(x1 , x2 ) + d(y1 , y2 ). Then by triangle inequality, d(x1 , y1 ) ≤
d((x
d(x1 , x2 )+d(x2 , y2 )+d(y1 , y2 ). Moving the middle term to the left gives d(x1 , y1 )−
˜ 1 , y1 ), (x2 , y2 )). Swapping the theorems around, we can put in the
d(x2 , y2 ) ≤ d((x
absolute value to obtain |d(x1 , y1 ) − d(x2 , y2 )| ≤ d((x˜ 1 , y1 ), (x2 , y2 )).

Note that a contraction mapping is by definition Lipschitz and hence (uniformly)


continuous.
T. 5-50
Let (X, d) and (X 0 , d0 ) be metric space. If (X, d) is (sequentially) compact and
f : X → X 0 is continuous, then f is uniformly continuous.

Suppose f : X → X 0 is not uniformly continuous, then there is some ε > 0


such that for all δ = n1 , there is some xn , yn such that d(xn , yn ) < n1 but
d0 (f (xn ), f (yn )) > ε.
182 CHAPTER 5. ANALYSIS II

By compactness of X, (xn ) has a convergent subsequence (xni ) → x. Then we also


have yni → x. So by continuity, we must have f (xni ) → f (x) and f (yni ) → f (x).
But d0 (f (xni ), f (yni )) > ε for all ni . Contradiction.
While this is a nice theorem, in general a continuous function need not be uniformly
continuous: Consider f : (0, 1] → R given by f (x) = x1 . This is not uniformly
continuous, since when we get very close to 0, a small change in x produces a large
change in x1 . In this case, the function is unbounded. However, even bounded
functions can be not uniformly continuous. For example f : (0, 1] → R with
f (x) = sin x1 . We let
1 1
xn = , yn = .
2nπ (2n + 21 )π
π
Then we have |f (xn ) − f (yn )| = |0 − 1| = 1, while |xn − yn | = 2n(4n+1)
→ 0.
T. 5-51
<Contraction mapping theorem> Let X be a (non-empty) complete metric
space. If f : X → X is a contraction, then f has a unique fixed point (ie. there is
a unique x such that f (x) = x). Moreover, if f : X → X is a function such that
f (m) : X → X (ie. f composed with itself m times) is a contraction for some m,
then f has a unique fixed point.

Uniqueness is straightforward. By assumption, there is some 0 ≤ λ < 1 such that


d(f (x), f (y)) ≤ λd(x, y) for all x, y ∈ X. If x and y are both fixed points, then
this says d(x, y) = d(f (x), f (y)) ≤ λd(x, y). This is possible only if d(x, y) = 0, ie.
x = y.
To prove existence, pick x0 ∈ X. Define the sequence (xn ) inductively by xn+1 =
f (xn ). We first show that this is Cauchy. For any n ≥ 1, we can compute

d(xn+1 , xn ) = d(f (xn ), f (xn−1 )) ≤ λd(xn , xn−1 ) ≤ λn d(x1 , x0 ).

Since this is true for any n, for m > n, we have

d(xm , xn ) ≤ d(xm , xm−1 ) + d(xm−1 , xm−2 ) + · · · + d(xn+1 , xn )


m−1 m−1 ∞
X X X λn
= d(xj+1 , xj ) ≤ λj d(x1 , x0 ) = d(x1 , x0 ) λj = d(x1 , x0 ).
j=n j=n j=n
1−λ

Note that we have again used the property that λ < 1. This implies d(xm , xn ) → 0
as m, n → ∞. So this sequence is Cauchy.
By the completeness of X, there exists some x ∈ X such that xn → x. Since
f is a contraction, it is continuous, so f (xn ) → f (x). However, by definition
f (xn ) = xn+1 , taking the limit on both sides, we get f (x) = x. So x is a fixed
point.
Now suppose that f (m) is a contraction for some m. Hence by the first part, there
is a unique x ∈ X such that f (m) (x) = x. But then

f (m) (f (x)) = f (m+1) (x) = f (f (m) (x)) = f (x).

So f (x) is also a fixed point of f (n) (x). By uniqueness of fixed points, we must
have f (x) = x. Since any fixed point of f is clearly a fixed point of f (m) as well,
it follows that x is the unique fixed point of f .
5.5. INTEGRATION 183

Based on the proof of the theorem, we have the following error estimate in the
contraction mapping theorem: for x0 ∈ X and xn = f (xn−1 ), we showed that for
λn
m > n, we have d(xm , xn ) ≤ 1−λ d(x1 , x0 ). If xn → x, taking the limit of the
above bound as m → ∞ say that for any n
λn
d(x, xn ) ≤ d(x1 , x0 ).
1−λ
Note that the theorem is false if we drop the completeness assumption. For ex-
ample, f : (0, 1) → (0, 1) defined by x2 is clearly a contraction with no fixed point.
The theorem is also false if we drop the assumption λ < 1. In fact, it is not enough
to assume d(f (x), f (y)) < d(x, y) for all x, y.
We can see finding fixed points as the process of solving equations. One important
application we will have is to use this to solve differential equations. After we do
some integration we will look at Picard-Lindelöf existence theorem which gives
condition for the existence of solutions (at least locally) to the ODE
df
= F(t, f (t)) subject to f (t0 ) = x0
dt
where t0 ∈ R, x0 ∈ Rn , f : R → Rn and F : R × Rn → Rn . We can imagine f as
the position vector of a particle moving in Rn , passing through x0 at time t0 . We
then ask if there is a trajectory f (t) such that the velocity of the particle at any
time t is given by F(t, f (t)).

5.5 Integration
T. 5-52
If f : [a, b] → [A, B] is integrable and g : [A, B] → R is continuous, then g ◦ f :
[a, b] → R is integrable.

Since g is continuous, g is uniformly continuous. Given any ε > 0, we can find δ =


δ(ε) > 0 such that for any x, y ∈ [A, B] such that |x−y| < δ, then |g(x)−g(y)| < ε.
Since f is integrable, for arbitrary ε0 , we can find a partition P = {a = a0 < a1 <
· · · < an = b} such that
n
!
X
U (P, f ) − L(P, f ) = (aj − aj−1 ) sup f − inf f < ε0 . (∗)
Ij Ij
j=1

where Ij = [aj , aj−1 ]. Our objective is to make U (P, g ◦ f ) − L(P, g ◦ f ) small. By


uniform continuity of g, if supIj f −inf Ij f is less than δ, then supIj g ◦f −inf Ij g ◦f
will be less that ε. We like these sorts of intervals. Let J = {j ∈ {1, 2, · · · , n} :
supIj f − inf Ij f < δ}. Now
n
!
X
U (P, g ◦ f ) − L(P, g ◦ f ) = (aj+1 − aj ) sup g ◦ f − inf g ◦ f
Ij Ij
j=1
! !
X X
= (aj+1 − aj ) sup g ◦ f − inf g ◦ f + (aj+1 − aj ) sup g ◦ f − inf g ◦ f
Ij Ij Ij Ij
j∈J j6∈J
X
≤ ε(b − a) + 2 sup |g| (aj+1 − aj ).
[A,B] j6∈J
184 CHAPTER 5. ANALYSIS II
ε0
P
From (∗), we know that we must have j6∈J (aj+1 − aj ) < δ
. So we can bound

ε0
U (P, g ◦ f ) − L(P, g ◦ f ) ≤ ε(b − a) + 2 sup |g| .
[A,B] δ

Note that g must be bounded by the maximum value theorem. Now let ε0 = εδ,
then we have shown that given any ε > 0 there exists a partition such that
!
U (P, g ◦ f ) − L(P, g ◦ f ) < (b − a) + 2 sup |g| ε.
[A,B]

Hence the result by the Riemann criterion.

Note as an consequence, any continuous function is integrable, since we can just


let f be the identity function, which we can easily show to be integrable.

T. 5-53
If fn : [a, b] → R is integrable (and bounded) for all n, and (fn ) converges uniformly
to f : [a, b] → R, then f is bounded and integrable.

Let cn = sup[a,b] |fn − f |. Uniform convergence says that cn → 0 as n → ∞. By


definition, for each x ∈ [a, b], we have fn (x) − cn ≤ f (x) ≤ fn (x) + cn . Since fn is
bounded, this implies that f is bounded by sup |fn | + cn . Also, for any x, y ∈ [a, b],
we know
f (x) − f (y) ≤ (fn (x) − fn (y)) + 2cn .

Hence for any partition P ,

U (P, f ) − L(P, f ) ≤ U (P, fn ) − L(P, fn ) + 2(b − a)cn .

So given ε > 0, we choose n such that 2(b − a)cn < 2ε , and then choose P such
that U (P, fn ) − L(P, fn ) < 2ε . Then for this partition, U (P, f ) − L(P, f ) < ε.

P. 5-54
Suppose fn : [a, b] → RR is integrable for eachR n and fn → f uniformly. Suppose
x x
c ∈ [a, b], let Fn (x) = c fn (y)dy and F (x) = c f (y)dy. Then Fn → F uniformly.

By the last theorem f is integrable (on [a, b] and hence also on [x, c] for any
x ∈ [a, b]), so F (x) exist.
Z x Z x

|Fn (x) − F (x)| = fn (y) − f (y)dy ≤ |fn (y) − f (y)|dy
Z c x c

≤ kfn − f k∞ dy ≤ (b − a)kfn − f k∞
c

Hence kFn − F k ≤ (b − a)kfn − f k∞ → 0 as n → ∞.

Note that in general Fn → F uniformly does not hold if we replace [a, b] with R,
but it does hold for [a, b].
5.5. INTEGRATION 185

E. 5-55
Let f (x) = ∞ n
P
n=0 cn (x − a) be a real power series with radius of convergence R,
then for any x ∈ (a − R, a + R) the following exist and equals
Z x ∞ ∞
X cn X
f (y)dy = (x − a)n+1 f 0 (x) = ncn (x − a)n−1
a n=0
n+1 n=0

1. f converge uniformly on [a − r, a + r] for any 0 < r < R.[T.5-14] Also its partial
Rx
sum fn (x) = N
P n PN cn
n=0 cn (x−a) are integrable with a fn (y)dy = n=0 n+1 (x−
n+1
a) . So it follows from the last two results
R x that f P is integrable on [x, a] for
x ∈ [a − r, a + r] for any 0 < r < R with a f (y)dy = ∞ cn
n=0 n+1 (x − a)
n+1
. So
Rx P∞ cn n+1
it follows that a f (y)dy = n=0 n+1 (x − a) for any x ∈ (a − R, a + R).

2. Let g(x) = ∞ n−1


P
n=0 ncn (x − a) , then g is a power
R x seriesPwith radius of con-
∞ n
vergence R.[T.5-15] By the first part we have a g(x) = n=1 cn (x − a) =
f (x) − f (a) for x ∈ (a − R, a + R). Also g is continuous on (a − R, a + R),[T.5-14]
so by the fundamental theorem of Calculus, f is differentiable on (a − R, a + R)
with f 0 (x) = g(x).

This says we can integrate or differentiate term by term inside the radius of con-
vergence.

D. 5-56
Let f : [a, b] → Rn be a vector-valued function, where f (x) = (f1 (x), f2 (x), · · · , fn (x)).
Then we say f is Riemann integrable iff fj : [a, b] → R is Riemann integrable for
all j. We define the integral as

Z b Z b Z b 
f (x) dx = f1 (x) dx, · · · , fn (x) dx ∈ Rn .
a a a

Integration for complex-valued function is defined similarly.

E. 5-57
It is easy to see that most basic properties of integrals of real functions extend to
the vector-valued case.

E. 5-58
Let k · k be the k · k1 or k · k2 norm. If f : [a, b] → Rn is integrable, then kf (x)k
Rb Rb
is integrable (on [a, b]), and k a f (x) dxk ≤ a kf (x)k dx.

The integrability of kf (x)k2 = ( n


P 2 1/2 Pn 2
j=1 fj (x)) and kf (x)k1 = j=1 |fj (x)| is
clear since squaring, taking square roots and taking modulus are continuous, and
a finite sum of integrable
Rb functions is integrable. To showR the inequality, we let
b
v = (v1 , · · · , vn ) = a f (x) dx. Then by definition vj = a fj (x) dx. If v = 0,
then we are done. Otherwise, we have

n Z
X b
X n Z b Z b n
X Z b

kvk1 =
fj (x)dx ≤ |fj (x)|dx = |fj (x)|dx = kf (x)k1
j=1 a j=1 a a j=1 a
186 CHAPTER 5. ANALYSIS II
n
X n
X Z b Z b n
X Z b
kvk22 = vj2 = vj fj (x) dx = (vj fj (x)) dx = v · f (x) dx
j=1 j=1 a a j=1 a
Z b Z b
≤ kvk2 kf (x)k2 dx = kvk2 kf (x)k2 dx.
a a

where the inequality is by Cauchy-Schwarz. Divide by kvk and we are done.


P. 5-59
Let k · k be any norm on Rn . If f : [a, b] → Rn is integrable, then kf (x)k is
integrable (on [a, b]), and
Z b Z b

f (x) dx ≤ kf (x)k dx.

a a

In part IA analysis we prove that f : [a, b] → R is Riemann integrable if and only


if UDm f − LDm f → 0 as m → ∞, where Dm is the dissection a = x0 < x1 < · · · <
xm = b given by xi = a + i(b−a)m
for each i. Write Ik = [xk−1 , xk ]. To show that
kf (x)k is integrable, note that
sup kf (x)k − inf kf (x)k = sup (kf (x)k − kf (y)k) ≤ sup kf (x) − f (y)k
Ik Ik x,y∈Ik x,y∈Ik
n
!
X
≤ C sup kf (x) − f (y)k1 = C sup |fi (x) − fi (y)|
x,y∈Ik x,y∈Ik
i=1
n n
!
X X
≤C sup |fi (x) − fi (y)| = C sup fi (x) − inf fi (y)
x,y∈Ik Ik Ik
i=1 i=1

where we have use kxk − kyk ≤ kx − yk (triangle inequality) and kxk ≤ Ckxk1
(all norms on Rn being equivalent). So
m
!
X
UDm kf k − LDm kf k = |Ik | sup kf (x)k − inf kf (x)k
Ik Ik
k=1
n m
!! n
X X X
≤C |Ik | sup fi (x) − inf fi (y) =C (UDm fi − LDm fi ) → 0
Ik Ik
i=1 k=1 i=1

as m → ∞. Hence kf (x)k is integrable. Now note that for f integrable we have


Xm Z b Xm
|Ii |f (xi ) → f (x)dx since LDm f ≤ |Ii |f (xi ) ≤ UDm f.
i=0 a i=0
Pm Rb Pm Rb
So i=0 |Ii |kf (xi )k → a
kf (x)kdx. Also we have i=0 |Ii |f (xi ) → a
f (x)dx
component-wise, so this also converge wrt the k · k2 as component-wise convergence
implies convergence in k · k2 , in fact this converge wrt any norm k · k since all
Rb
norms on Rn is equivalent. Hence k m
P
i=0 |Ii |f (xi )k → k a f (x)dxk as k · k :
(Rn , k · k) → R is continuous. Now we have
Z b m m Z b
X X
f (x) dx = lim |I |f (x ) ≤ lim |I | f (xi ) = kf (x)k dx.

i i i

m→∞ m→∞
a i=0 i=0 a

since k m
P Pm
i=0 |Ii |f (xi )k ≤ i=0 |Ii | kf (xi )k for all m.

Note that this result is basically glorified triangle inequality.


5.5. INTEGRATION 187

T. 5-60
<Picard-Lindelöf existence theorem> Let x0 ∈ Rn , R > 0, a < b, t0 ∈ [a, b].
Suppose F : [a, b] × B̄R (x0 ) → Rn is a continuous function satisfying

kF(t, x) − F(t, y)k2 ≤ κkx − yk2

for some fixed κ > 0 for all t ∈ [a, b] and x, y ∈ B̄R (x0 ) = {x ∈ Rn : kx − x0 k2 ≤
R}. In other words, F(t, · ) : Rn → Rn is Lipschitz on B̄R (x0 ) with the same
Lipschitz constant for every t. Then
i. There exists an ε > 0 and a unique differentiable function f : [t0 − ε, t0 + ε] ∩
[a, b] → Rn such that
df
= F(t, f (t)) and f (t0 ) = x0 (∗)
dt

R 2
ii. If sup[a,b]×B̄R (x0 ) kFk2 ≤ b−a , then there exists a unique differentiable func-
n
tion f : [a, b] → R that satisfies (∗).

First we show that that (ii) implies (i). We know that sup[a,b]×B̄R (x) kFk2 is
bounded since it is a continuous function on a compact domain. So we can pick
ε > 0 such that 2ε ≤ R/ sup[a,b]×B̄R (x) kFk2 . Then writing [t0 − ε, t0 + ε] ∩ [a, b] =
[a1 , b1 ], we have
R R
sup kFk2 ≤ sup kFk2 ≤ ≤ .
[a1 ,b1 ]×B̄R (x) [a,b]×B̄R (x) 2ε b1 − a1

So (ii) implies there is a solution on [t0 − ε, t0 + ε] ∩ [a, b]. Hence it suffices to prove
(ii).
To apply the contraction mapping theorem, we need to convert this into a fixed
point problem. The key is to reformulate the problem as an integral equation. We
know that a differentiable f : [a, b] → Rn satisfies the differential equation (∗) if
and only if f : [a, b] → B̄R (x0 ) is continuous and satisfies
Z t
f (t) = x0 + F(s, f (s)) ds
t0

by the fundamental theorem of calculus. Note R t that f being continuous means that
F(s, f (s)) is continuous, and so f (t) = x0 + t0 F(s, f (s)) ds is differentiable by the
fundamental theorem of calculus. This is very helpful, since we can work over the
much larger vector space of continuous functions, and it would be easier to find a
solution.
We let X = C([a, b], B̄R (x0 )) be the set of all continuous f : [a, b] → B̄R (x0 ). We
equip X with the supremum metric so that for all g, h ∈ X,

kg − hk = sup kg(t) − h(t)k2 .


t∈[a,b]

We see that X is a closed subset of the complete metric space C([a, b], Rn ) (again
taken with the supremum metric). So X is complete. For every g ∈ X, we define
2
Intuitively we can understand this condition as “not escaping B̄R (x0 )”. Since F represents the
gradient of how f changes with t, if the maximum gradient times (b − a) is less then R, our f would
not escape B̄R (x0 ) for any t ∈ [a, b], hence we have nice solution.
188 CHAPTER 5. ANALYSIS II
Rt
a function T g : [a, b] → Rn by (T g)(t) = x0 + t0
F(s, g(s)) ds. Our differential
equation is thus
f = Tf.
So we first want to show that T is actually mapping X → X, ie. T g ∈ X whenever
g ∈ X, and then prove it (or a power of it) is a contraction map. If g ∈ X, then
Z t Z t

kT g(t) − x0 k2 =
F(s, g(s)) ds ≤
kF(s, g(s))k2 ds

t0 2 t0

≤ |b − a| sup kFk2 ≤ R
[a,b]×B̄R (x0 )

Hence we know that T g(t) ∈ B̄R (x0 ), so T g ∈ X. It turns out T itself need not
be a contraction. Instead, what we have is that for g1 , g2 ∈ X, we have
Z t

kT g1 − T g2 k = sup
F(s, g1 (s)) − F(s, g2 (s)) ds

t∈[a,b] t0 2
t
Z

≤ sup kF(s, g1 (s)) − F(s, g2 (s))k2 ds
t∈[a,b] t0

≤ κ(b − a)kg1 − g2 k

by the Lipschitz condition on F . If we indeed have κ(b − a) < 1 (†), then the
contraction
Rt mapping theorem gives an f ∈ X such that T f = f , ie. f = x0 +
t0
F(s, f (s)) ds. However, we do not necessarily have (†). There are many ways
we can solve this problem. Here, we can solve it by finding an m such that
T (m) = T ◦ T ◦ · · · ◦ T : X → X is a contraction map. We will now show that this
map satisfies the bound
(b − a)m κm
kT (m) g1 − T (m) g2 k ≤ kg1 − g2 k. (‡)
m!
The key is the m!, since this grows much faster than any exponential. Given this
bound, we know that for sufficiently large m, we have ((b − a)m κm )/m! < 1, ie.
T (m) is a contraction. So by the contraction mapping theorem, the result holds.
To prove this, we prove instead the pointwise bound: ie. for any t ∈ [a, b], we have

(|t − t0 |)m κm
kT (m) g1 (t) − T (m) g2 (t)k2 ≤ sup kg1 (s) − g2 (s)k2 .
m! s∈[t0 ,t]

From this, taking the supremum on t ∈ [a, b], we obtain the bound (‡).
To prove this pointwise bound, we induct on m. We wlog assume t > t0 . We know
that for every m, the difference is given by
Z t
(m) (m) (m−1) (m−1)

kT g1 (t) − T g2 (t)k2 =
F(s, T g1 (s)) − F(s, T g2 (s)) ds

t0 2
Z t
(m−1) (m−1) a
≤κ kT g1 (s) − T g2 (s)k2 ds. ( )
t0

This is true for all m. If m = 1, then this gives

kT g1 (t) − T g2 (t)k2 ≤ κ(t − t0 ) sup kg1 − g2 k2 .


[t0 ,t]
5.5. INTEGRATION 189

a is done. For m ≥ 2, assume by induction the bound holds with


So the base case
m − 1. Then ( ) gives
t
κm−1 (s − t0 )m−1
Z
(m) (m)
kT g1 (t) − T g2 (t)k2 ≤ κ sup kg1 − g2 k2 ds
t0 (m − 1)! [t0 ,s]
t
κm κm (t − t0 )m
Z
≤ sup kg1 − g2 k2 (s − t0 )m−1 ds = sup kg1 − g2 k2
(m − 1)! [t0 ,t] t0 m! [t0 ,t]

Note in the final part of the proof to get the factor of m!, we had to actually
perform the integral integrating (s − t0 )m−1 , instead of just bounding (s − t0 )m−1
by (t − t0 )m−1 . In general, this is a good strategy if we want tight bounds. Instead
Rb
of bounding | a f (x) dx| ≤ (b − a) sup |f (x)|, we write f (x) = g(x)h(x), where
Rb
h(x) is something easily integrable. Then we can have a bound | a f (x) dx| ≤
Rb
sup |g(x)| a |h(x)| dx.
Note also that any differentiable f satisfying the differential equation is auto-
matically continuously differentiable, since the derivative is F(t, f (t)), which is
continuous.
Even n = 1 case of this result is important, special, non-trivial case. Even if we
have only one dimension, explicit solutions may be very difficult to find, if not
impossible. For example, df
dt
= f 2 + sin f + ef would be almost impossible to solve.
However, the theorem tells us there will be a solution, at least locally.
The requirements of the theorem are indeed necessary:
• We first look at that ε in (i). Without the addition requirement in (ii), there
might not exist a solution globally on [a, b]. For example, we can consider
the n = 1 case, where we want to solve df dt
= f 2 , with boundary condition
2
f (0) = 1. Our F (t, f ) = f is a nice, uniformly Lipschitz function on any
[0, b] × BR (1) = [0, b] × [1 − R, 1 + R]. However, there is no global solution: If we
d
assume f 6= 0, then for all t ∈ [0, b], the equation is equivalent to dt (t+f −1 ) = 0.
−1
So we need t + f to be constant. The initial conditions tells us this constant
1 1
is 1. So we have f (t) = 1−t . Hence the solution on [0, 1) is 1−t . Any solution
on [0, b] must agree with this on [0, 1). So if b ≥ 1, then there is no solution in
[0, b].
• The Lipschitz condition is also necessary to guarantee uniqueness. Without this
condition, existence of a solution is still guaranteed (but is another theorem,
the Cauchy-Peano theorem), but we could have many different solutions. For
example, we can consider the differential equation

df p
= |f | with f (0) = 0.
dt
p
Here F (t, x) = |x| is not Lipschitz near x = 0. It is easy to see that both
f = 0 and f (t) = 41 t2 are both solutions. In fact, for any α ∈ [0, b], the function
(
0 0≤t≤α
fα (t) = 1 2
4
(t − α) α≤t≤b

is also a solution. So we have an infinite number of solutions.


190 CHAPTER 5. ANALYSIS II

We can have slightly different version of this theorem. If F : R × Rn → Rn is


a continuous function satisfying kF(t, x) − F(t, y)k2 ≤ κkx − yk2 for some fixed
κ > 0 for all t ∈ R and x, y ∈ Rn , then using almost the same proof (with
X = C([a, b], Rn ) instead of X = C([a, b], B̄R x0 )) we can show that the DE (∗)
has a unique solution on any [a, b] with t0 ∈ [a, b], hence there is a unique solution
f : R → Rn globally.
In general, it is possible to use other fixed point theorems to show the existence
of solutions to partial differential equations. This is much more difficult, but has
many far-reaching important applications to theoretical physics and geometry.
This result is already quite a strong result since most of the time we can reduce
high order ODE to a system of first order ODE. For example ÿ + aẏ + by = f (t)
can be equivalently given as
   
ẋ1 x2
= where x1 = y and x2 = ẏ
ẋ2 −ax2 − bx1 + f (t)

More generally any system of ODE can be reduce to an equation of the form
G(x, ẋ) = 0 where G : Rn × Rn → Rm . For example we reduces the DE of our
theorem ḟ = F(t, f (t)) to
     
Ṫ 1 T
= which is of the form ẋ = F̃(x(t)) where x=
ḟ F(T, f ) f

In the case of ÿ + aẏ 2 + by = 0 we have


   
G1 (x1 , x2 , ẋ1 , ẋ2 ) ẋ1 − x2
G(x, ẋ) = = 2 where x1 = y and x2 = ẏ
G2 (x1 , x2 , ẋ1 , ẋ2 ) ẋ2 + ax2 + bx1

If we have m > n in G : Rn × Rn → Rm , then the system is over-determined, so


we probably have no solution. If m < n, then the equation is under-determined,
so we probably have solutions that are not unique. Even if m = n we might still
have no solutions at all (eg. ẋ2 = −1) or many solutions (eg. ẋ2 = 1).
T. 5-61
<Weierstrass Approximation Theorem> If f : [0, 1] → R is continuous, then
there exists a sequence of polynomials (pn ) such that pn → f uniformly. One such
polynomials are the Bernstein polynomials given by
n   !
X i n k
pn (x) = f x (1 − x)n−k .
n k
k=0

Of course, there are many different sequences of polynomials converging uniformly


to f . Apart from the silly examples like adding n1 to each pn , there can also be
vastly different ways of constructing such polynomial sequences.
For convenience, let
!
n k
pn,k (x) = x (1 − x)n−k .
k

First we need a few facts about these functions.  Clearly, pn,k (x) ≥ 0 for all
x ∈ [0, 1]. Also, by the binomial theorem, n n k n−k
(x + y)n . So we get
P
k=0 k x y =
5.5. INTEGRATION 191
Pn
k=0 pn,k (x) = 1. Differentiating the binomial theorem partially with respect to
x and putting y = 1 − x gives
n
! n
! n
X n k−1 n−k
X n X
kx (1−x) =n =⇒ kxk (1−x)n−k = kpn,k (x) = nx.
k k
k=0 k=0 k=0
Pn
Similarly but differentiating once more gives k=0 k(k − 1)pn,k (x) = n(n − 1)x2 .
Adding these two results gives
n
X
k2 pn,k (x) = n2 x2 + nx(1 − x) =⇒
k=0

n
X
(nx − k)2 pn,k (x) = n2 x2 − 2nx · nx + n2 x2 + nx(1 − x) = nx(1 − x). (∗)
k=0

Given any ε > 0, we can pick δ suchPthat |f (x)−f (y)| < ε whenever |x−y|
P < δ since
f is uniformly continuous. Since pn,k (x) = 1, we have f (x) = pn,k (x)f (x).
Now for each fixed x, we can write
n    
n  
k f k − f (x) pn,k (x)
X X
|pn (x) − f (x)| = f − f (x) pn,k (x) ≤

n n
k=0 k=0
       
f k − f (x) pn,k (x) + f k − f (x) pn,k (x)
X X
= n n
k:|x−k/n|<δ k:|x−k/n|≥δ
n
X X
≤ε pn,k (x) + 2 sup |f | pn,k (x)
k=0 [0,1]
k:|x−k/n|>δ
 2
1 X k
≤ ε + 2(sup |f |) x− pn,k (x)
[0,1] δ2 n
k:|x−k/n|>δ
n  2
1 X k 2 sup |f | 2 sup |f |
≤ ε + 2(sup |f |) x − pn,k (x) = ε + nx(1 − x) ≤ ε +
[0,1] δ2 n δ 2 n2 δ2 n
k=0

Hence given any ε and δ, we can pick n sufficiently large that that |pn (x) − f (x)| <
2ε. This is picked independently of x. So done.
D. 5-62
A subset A ⊆ R is said to have Lebesgue measure zero if for any ε > 0, there exists
S∞
a countable (possibly finite) collection of open intervals I j such that A ⊆ j=1 IJ
and ∞
P
j=1 |Ij | < ε where |Ij | is the length of the interval.

E. 5-63
Lebesgue measure zero sets are “small”. It turns out that if a function’s set of
discontinuity is of measure 0, then it is integrable.
• The empty set has measure zero. Any finite set has measure zero.
• Any countable set has measure zero. If A = {a0 , a1 , · · · }, take
 ε ε 
Ij = aj − j+1 , aj + j+1 .
2 2
Then A is contained in the union and the sum of lengths is ε.
192 CHAPTER 5. ANALYSIS II

• A countable union of sets of measure zero has measure zero, using a similar
proof strategy as above.
• Any (non-trivial) interval does not have measure zero.
• The Cantor set , despite being uncountable, has measure zero. The Cantor set
1 2
is constructed asfollows:
  start  with C0 = [0, 1]. Remove the middle third 3 , 3
1 2
to obtain C1 = 0, 3 ∪ 3
, 1 . Removing the middle third of each segment to
obtain C2 = 0, 19 ∪ 29 , 93 ∪ 69 , 79 ∪ 98 , 1 . Continue
   
iteratively by removing
the middle thirds of each part. Then the set C = ∞
T
n=0 Cn is the Cantor set.
n
Since each Cn consists of 2 disjoint closed intervals of length 1/3n , the total
n
length of the segments of Cn is 23

→ 0. So we can cover C by arbitrarily
small union of intervals. Hence the Cantor set has measure zero. It is slightly
trickier to show that C is uncountable.
T. 5-64
<Lebesgue’s theorem on the Riemann integral> Let f : [a, b] → R be a
bounded function, and let Df be the set of points of discontinuities of f . Then f
is Riemann integrable if and only if Df has Lebesgue measure zero.

Using this result, a lot of our theorems follow easily of these. Apart from the easy
ones like the sum and product of integrable functions is integrable, we can also
easily show that the composition of a continuous function with an integrable func-
tion is integrable, since composing with a continuous function will not introduce
more discontinuities.
Similarly, we can show that the uniform limit of integrable functions is integrable,
since the points of discontinuities of the uniform limit is at most the (countable)
union of all discontinuities of the functions in the sequence.

5.6 Differentiation from Rm to Rn


In this section, we write k · k to be the Euclidean norm, but in fact we could use any
norm since they are Lipschitz equivalent.
D. 5-65
• Let E ⊆ Rn , f : E → Rm . Let a ∈ Rn be a limit point of E and b ∈ Rm . We say
limx→a f (x) = b (or f (x) → b as x → a) if

∀ε > 0, ∃δ > 0 s.t. ∀x ∈ E, 0 < kx − ak < δ ⇒ kf (x) − bk < ε.

Equivalently f (x) → b as x → a if f̃ is continuous at a, where f̃ is the function


equal to f at all points except at a where f̃ (a) = b.
• Little o notation : For any function α : U → Rm where Br (0) ⊆ U ⊆ Rn for some
r > 0, we write α(h) = o(h) if α(h)/khk → 0 as h → 0.
• Let U ⊆ Rn be open, f : U → Rm . We say f is differentiable 3 at a point a ∈ U if
there exists a linear map A : Rn → Rm such that f (a + h) = f (a) + Ah + o(h) or
3
We could define more generally: If V and W are normed real or complex vector spaces, and
U ⊆ V open, then a function f : U → W is (Fréchet) differentiable at x ∈ U if there exists a
continuous linear map A : V → W such that f (a + h) = f (a) + Ah + o(h).
5.6. DIFFERENTIATION FROM RM TO RN 193

equivalently
f (a + h) − f (a) − Ah
lim = 0.
h→0 khk
We call A the derivative of f at a. We write the derivative A as Df (a) or Df |a .
• Write L(Rn ; Rm ) for the space of linear maps A : Rn → Rm . More generally write
L(V ; W ) for the space of linear maps A : V → W .
• The directional derivative of f at a ∈ U in the direction of u ∈ Rn is

f (a + tu) − f (a)
Du f (a) = lim
t→0 t
d

whenever this limit exists. By definition, we have Du f (a) = dt
f (a + tu) t=0 .

• For f : U → R, we call Dej f (a) the jth partial derivative of f at a ∈ U when


∂f
the limit exists. We often write this as Dej f (a) = Dj f (a) = ∂xj
(a).

E. 5-66
As in the case of R in IA Analysis I, for limits we do not impose any requirements
on F when x = a. In particular, we don’t assume that a ∈ E.
Usual laws of limits like [[ f (x) → a ]] ∧ [[ g(x) → b ]] ⇒ [[ λf (x) + µg(x) → λa + µb ]]
follows from the fact that [[ f̃ , g̃ continuous ]]⇒[[ λf̃ + µg̃ continuous ]].
Note that officially, α(h) = o(h) as a whole is a piece of notation, and does not
represent equality.
Note for differentiability we require the domain U of f to be open, so that for each
a ∈ U , there is a small ball around a on which f is defined. We could relax this
condition and consider “one-sided” derivatives instead, but we will not look into
these here.
We can interpret the definition of differentiability as saying we can find a “good”
linear approximation (technically, it is affine) to the function f near a. Equiva-
lently, that is we can approximate the function by a hyperplane near a as f (a)+Ah
defines an hyperplane through f (a).
P. 5-67
Derivatives are unique.

Suppose A, B : Rn → Rm is such that


f (a + h) − f (a) − Ah f (a + h) − f (a) − Bh
lim = 0 = lim
h→0 khk h→0 khk
By the triangle inequality, we get

k(B − A)hk ≤ kf (a + h) − f (a) − Ahk + kf (a + h) − f (a) − Bhk.

So k(B − A)hk/khk → 0 as h → 0. Set h = tu, then

k(B − A)(tu)k k(B − A)uk


= →0 as t→0
ktuk kuk

since B − A is linear and so k(B − A)(tu)k = |t|k(B − A)uk. This says that
(B − A)u = 0 for all u ∈ Rn . So B = A.
194 CHAPTER 5. ANALYSIS II

P. 5-68
Let X and Y be normed spaces with X finite dimensional, then all linear maps
X → Y are continuous.
Suppose
Pn f : X → Y is linear. Choose a basis (e1 , e2 , · · · , en ) in X. Then f (x) =
i=1 xi f (ei ). Let M = maxi kf (ei )k. Since
P any two norms on a finite-dimensional
space are equivalent, ∃C > 0 such that n i=1 |xi | ≤ Ckxk for all x ∈ X. Now by
the triangle inequality,
n n n
!
X X X
kf (x)k = xi f (ei ) ≤ |xi |kf (ei )k ≤ |xi | M ≤ CM kxk.


i=1 i=1 i=1

Thus, f is Lipschitz and so is continuous.


Note that linear maps of infinite dimensional spaces might not be continuous,
for example F : (C[0, 1], k · k1 ) → R defined by F (f ) = f (0) is not continuous.
This can be seen by taking fn (x) = max{1 − nx, 0}, then fn → 0 in k · k1 , but
F (fn ) = fn (0) = 1 6→ F (0) = 0(0) = 0. However F is continuous if k · k1 is
replace by k · k∞ .
From kf (x)k ≤ CM kxk we can in addition see that f (x) → 0 as x → 0, again
this is not true in general when X is infinite dimensional.
P. 5-69
Let U ⊆ Rn be open, a ∈ U .
1. If f : U → Rm is differentiable at a, then f is continuous at a.
2. If we write f = (f1 , f2 , · · · , fn ) : U → Rm , where each fi : U → R, then f is
differentiable at a if and only if each fj is differentiable at a for each j.
3. If f, g : U → Rm are both differentiable at a, then λf + µg is differentiable at
a with D(λf + µg)(a) = λDf (a) + µDg(a).
4. If A : Rn → Rm is a linear map, then A is differentiable for any a ∈ Rn with
DA(a) = A.
5. If f is differentiable at a, then the directional derivative Du f (a) exists for all
u ∈ Rn , and Du f (a) = Df (a)u. In particular, all partial derivatives Dj fi (a)
exist for j = 1, · · · , n; i = 1, · · · , m, and are given by Dj fi (a) = Dfi (a)ej .
6. If A = (Aij ) be the matrix representing Df (a) with respect to the standard
basis for Rn and Rm , ie. for any h ∈ Rn , Df (a)h = Ah. Then A is given by
Aij = hDf (a)ej , bi i = Dj fi (a) where {e1 , · · · , en } is the standard basis for
Rn , and {b1 , · · · , bm } is the standard basis for Rm .

P as h → 0, we know f (a + h) − f (a) =
1. By definition, if f is differentiable, then
Df (a)h + o(h) → 0, since Df (a)h = n i=1 hi Df (a)ei → 0.

2. f (a + h) = f (a) + Ah + o(h) ⇐⇒ fj (a + h) = fj (a) + Aji hi + o(h).


3. We just have to check this directly:
(λf + µg)(a + h) − (λf + µg)(a) − (λDf (a) + µDg(a))
khk
f (a + h) − f (a) − Df (a)h g(a + h) − g(a) − Dg(a)h
=λ +µ → 0 as h → 0.
khk khk
5.6. DIFFERENTIATION FROM RM TO RN 195

4. Since A is linear, we always have A(a + h) − A(a) − Ah = 0 for all h.


5. Fix some u ∈ Rn , take h = tu (with t ∈ R). Assuming u 6= 0, differentiability
tells
f (a + tu) − f (a) − Df (a)(tu) f (a + tu) − f (a) − tDf (a)u
0 = lim = lim .
t→0 ktuk t→0 |t|kuk

Since kuk is fixed, This in turn is equivalent to

f (a + tu) − f (a) − tDf (a)u f (a + tu) − f (a)


lim = 0 =⇒ Df (a)u = lim .
t→0 t t→0 t
We derived this assuming u 6= 0, but this is trivially true for u = 0. So this
valid for all u. Hence Du f (a) exist and equals Df (a)u.
6. This follows from the general result for linear maps: for any linear map A
represented by (Aij )m×n , we have Aij = hAej , ei i. Applying this with A =
Df (a) and note that for any h ∈ Rn , Df (a)h = (Df1 (a)h, · · · , Dfm (a)h).
The second property is useful, since instead of considering arbitrary Rm -valued
functions, we can just look at real-valued functions.
The converse of 5 is not true in general. Even if the directional derivative exists
for all u, we still cannot conclude that the derivative exists.
• Let f : R2 → R be defined by
(
0 xy = 0
f (x, y) =
1 xy 6= 0

Then the partial derivatives at (0, 0) are ∂f∂x


(0, 0) = ∂f
∂y
(0, 0) = 0. In other
directions, say u = (1, 1), we have (f (0 + tu) − f (0))/t = 1t which diverges as
t → 0. So the directional derivative at these direction does not exist, hence the
derivative doesn’t exist too.
• Let f : R2 → R be defined by

x3
(
y
y 6= 0
f (x, y) =
0 y=0

Then for u = (u1 , u2 ) 6= 0 and t 6= 0, we can compute

tu3
(
f (0 + tu) − f (0) u2
1
6 0
u2 =
=
t 0 u2 = 0

So Du f (0) = limt→0 f (0+tu)−f t


(0)
= 0. So all directional derivative exist, how-
ever, the function is not differentiable at 0, since it is not even continuous at 0,
as f (δ, δ 4 ) = 1δ diverges as δ → 0.
• Let f : R2 → R be defined by
x3
(
x2 +y 2
(x, y) 6= (0, 0)
f (x, y) = .
0 (x, y) = (0, 0)
196 CHAPTER 5. ANALYSIS II

It is clear that f continuous at points other than 0, and f is also continuous at


0 since |f (x, y)| ≤ |x|. We see that ∂f
∂x
(0, 0) = 1 and ∂f
∂y
(0, 0) = 0. In fact, we
can compute the difference quotient in the direction u = (u1 , u2 ) 6= 0 to be
f (0 + tu) − f (0) u3
Du f (0) = = 2 1 2.
t u1 + u2
We can now immediately conclude that f is not differentiable at 0, since if it
were, then we would have Du f (0) = Df (0)u which should be a linear expression
in u, but this is not.
Alternatively by property 6, if f were differentiable, then we have
 
 h1
Df (0)h = 1 0 = h1 .
h2
However, we have
h3
2 − h1
1
f (0 + h) − f (0) − Df (0)h h2
1 +h2 h1 h22
= = −p 3,
khk
p
h21 + h22 h21 + h22
which does not tend to 0 as h → 0. For example, if h = (t, t), this quotient is
1
− 23/2 6 0.
for t =
In general, to decide if a function is differentiable, the first step would be to
compute the partial derivatives. If they don’t exist, then we can immediately
know the function is not differentiable. However, if they do, then we have a
candidate for what the derivative is, and we plug it into the definition to check if
it actually is the derivative.
T. 5-70
Let a ∈ U ⊆ Rn with U open and f : U → Rm . Then f is differentiable at a if
there exists some open ball Br (a) ⊆ U such that
1. Dj fi (x) exists for every x ∈ Br (a) and 1 ≤ i ≤ m, j ≤ 1 ≤ n;
2. Dj fi are continuous at a for all 1 ≤ i ≤ m, j ≤ 1 ≤ n,

It suffices to prove for m = 1 by 2. of the above proposition. For each h =


(h1 , · · · , hn ) ∈ Rn , write h(j) = h1 e1 + · · · + hj ej = (h1 , · · · , hj , 0, · · · , 0). Then
we have
Xn
f (a + h) − f (a) = f (a + h(j) ) − f (a + h(j−1) )
j=1
n
X
= f (a + h(j−1) + hj ej ) − f (a + h(j−1) ).
j=1

Note that in each term, we are just moving along the coordinate axes. Since the
partial derivatives exist, the mean value theorem of single-variable calculus applied
to g(t) = f (a + h(j−1) + tej ) on the interval t ∈ [0, hj ], so ∃θj ∈ (0, 1) for each j
so that
X n
f (a + h) − f (a) = hj Dj f (a + h(j−1) + θj hj ej )
j=1
n
X n
X  
= hj Dj f (a) + hj Dj f (a + h(j−1) + θj hj ej ) − Dj f (a)
j=1 j=1
5.6. DIFFERENTIATION FROM RM TO RN 197

Note that Dj f (a + h(j−1) + θj hj ej ) − Dj f (a) → 0 as h → 0 since the partial


derivatives are continuous at a. So the second term is o(h). So f is differentiable
at a with Df (a)h = n
P
j=1 Dj f (a)hj .

This is a very useful result. For example, we can now immediately conclude that
the function  
x  2
y + e6z

y  7→ 3x + 4 sin14x
xyze
z
is differentiable everywhere, since it has continuous partial derivatives. This is
much better than messing with the definition itself.
D. 5-71
Let V, W be finite dimensional normed vector spaces over R. The operator norm
on L = L(V ; W ) is defined by kAk = supx∈V :kxk=1 kAxk.
C. 5-72
So far, we have only looked at derivatives at a single point. We haven’t dis-
cussed much about the derivative at, say, a neighbourhood or the whole space.
We might want to ask if the derivative is continuous or bounded. However, this
is not straightforward, since the derivative is a linear map, and we need to define
these notions for functions whose values are linear maps. In particular, we want
to understand the map Df : Br (a) → L(Rn ; Rm ) given by x 7→ Df (x). To do so,
we need a metric on the space L(Rn ; Rm ). In fact, we will use a norm.
Let L = L(Rn ; Rm ). This is a vector space over R defined with addition and scalar
multiplication defined pointwise. In fact, L is a subspace of C(Rn , Rm ) continuous
functions Rn → Rm . Since L is finite-dimensional (it is isomorphic to the space
of real m × n matrices, as vector spaces, and hence have dimension mn), it really
doesn’t matter which norm we pick as they are all Lipschitz equivalent, but a
convenient choice is the sup norm, or the operator norm.

P. 5-73
1. k · k is indeed a norm on L.
kAxk
2. kAk = sup .
V \{0} kxk
3. kAxk ≤ kAkkxk for all x ∈ V .
4. Let A ∈ L(V ; W ) and B ∈ L(W ; U ). Then kBAk ≤ kBkkAk where BA =
B ◦ A ∈ L(V ; U ).

1. Firstly kAk < ∞ for all A ∈ L, this is since A is continuous and {x ∈ V :


kxk = 1} is closed and bounded and hence compact. To show k · k is a norm,
the only non-trivial part is the triangle inequality. We have
kA + Bk = sup kAx + Bxk ≤ sup (kAxk + kBxk)
kxk=1 kxk=1

≤ sup kAxk + sup kBxk = kAk + kBk


kxk=1 kxk=1

2. For any x ∈ V \ {0}, we have


 
x kAxk Ax x
kxk = 1 and =
= A
kxk kxk kxk
198 CHAPTER 5. ANALYSIS II

3. Immediate from 2.
kBAxk kBkkAxk
4. kBAk = sup ≤ sup = kBkkAk.
V \{0} kxk V \{0} kxk

From 2 and 3 we see that A is Lipschitz with Lipschitz constant K iff kAk ≤ K
since both says that kA(u) − A(v)k = kA(u − v)k ≤ Kku − vk.

Note that as a consequence of 4 we have that if An → A and Bn → B, then


Bn An → BA since

kBn An − BAk = kBn An − Bn A + Bn A − BAk


≤ kBn kkAn − Ak + kBn − BkkAk → 0 as n→∞

P. 5-74
Let V be finite dimensional normed vector spaces over R. Let U = {A ∈ L(V ; V ) :
A invertible}. Then U is open and · −1 : U → U is continuous.

L(V ; W ) is also finite dimensional normed space, hence it’s complete. Suppose
h ∈ U with khk < 1, then
m m
X X
n
h ≤ khkn → 0 as m > k → ∞.



n=k n=k

P∞ P∞
So n
Phm is Cauchy
n=0 and hence converges. Write h0 = n
n=0 h . Note that
n m+1
(I − h)( n=0 h ) = I − h where I is the identity function V → V . Take
limit m → ∞, we have (I − h)h0 = I, hence (I − h) is invertible with inverse
(I − h)−1 = h0 .

Suppose A ∈ U . Then if khk < 1/kA−1 k, then khA−1 k < 1 so A+h = (I +hA−1 )A
−1 −1 P∞ −1 n
is invertible. Hence U is open. Moreover, (A + h) = A n=0 (−hA ) , so
−1 −1 −1 P∞ −1 n
k(A + h) − A k = kA n=1 (−hA ) k → 0 as h → 0 since for any m,

m m
kA−1 k2 khk(1 − kA−1 km khkm )
X
−1 X −1 n
A (−hA ) ≤ kA−1 kn+1 khkn =

n=1

n=1
1 − kA−1 kkhk
kA−1 k2 khk
≤ → 0 as h → 0.
1 − kA−1 kkhk

P. 5-75
1. If A ∈ L(R, Rm ), then A can be written as Ax = xa for some a ∈ Rm .
Moreover, kAk = kak, where the second norm is the Euclidean norm in Rn
2. If A ∈ L(Rn , R), then Ax = x · a for some fixed a ∈ Rn . Again, kAk = kak.

1. Set A(1) = a. Then by linearity, we get Ax = xA(1) = xa. Then we have


kAxk = |x|kak. So kAxk
|x|
= kak for all x ∈ R \ {0}, so kAk = kak by 3 of [P.5
-73].

2. Since A is a linear map, Ax = xi (Aei ) = x · a where ai = Aei . By Cauchy-


Schwarz inequality, Ax̂ = x̂·a ≤ kx̂kkak = kak, so kAk = supx∈Rn :kxk=1 kAxk =
Aâ = kak.
5.6. DIFFERENTIATION FROM RM TO RN 199

T. 5-76
<Chain rule> Let U ⊆ Rn be open and f : U → Rm differentiable at a ∈ U .
Also, V ⊆ Rm is open with f (U ) ⊆ V and g : V → Rp differentiable at f (a). Then
g ◦f : U → Rp is differentiable at a, with derivative D(g ◦f )(a) = Dg(f (a)) Df (a).

Let A = Df (a) and B = Dg(f (a)). By differentiability of f , we know f (a + h) =


f (a) + Ah + o(h) and g(f (a) + k) = g(f (a)) + Bk + o(k). Now we have

(g ◦ f )(a + h) = g(f (a) + Ah + o(h)) = g(f (a)) + B(Ah + o(h)) + o(Ah + o(h))
| {z }
k

= g ◦ f (a) + BAh + B(o(h)) + o(Ah + o(h)).

Since B are bounded, kB(o(h))k ≤ kBkko(h)k, so B(o(h)) = o(h). Similarly,


kAh + o(h)k ≤ kAkkhk + ko(h)k ≤ (kAk + 1)khk for sufficiently small khk. So
o(Ah+o(h)) is in fact o(h) as well. Hence (g◦f )(a+h) = (g◦f )(a)+BAh+o(h).
T. 5-77
Let f : [a, b] → Rm be continuous on [a, b] and differentiable on (a, b). If ∃M such
that kDf (t)k ≤ M for all t ∈ (a, b), then kf (b) − f (a)k ≤ M (b − a).
Pm
Let v = f (b) − f (a). We define g(t) = v · f (t) = i=1 vi fi (t). Since each fi
is differentiable, g is continuous on [a, b] and differentiable on (a, b) with g 0 (t) =
vi fi0 (t). Hence using Cauchy-Schwarz we get
P

m n
!1/2
X X
0 0 02
|g (t)| ≤ vi fi (t) ≤ kvk fi (t) = kvkkDf (t)k ≤ M kvk.


i=1 i=1

Mean value theorem says that g(b) − g(a) = g 0 (t)(b − a) for some t ∈ (a, b). By
definition of g, we get v · (f (b) − f (a)) = g 0 (t)(b − a). By definition of v, we have

kf (b) − f (a)k2 = |g 0 (t)(b − a)| ≤ (b − a)M kf (b) − f (a)k.

If f (b) = f (a), then there is nothing to prove. Otherwise, divide by kf (b) − f (a)k
and done.
T. 5-78
<Mean value inequality> If U ⊆ Rn is an open convex set and f : U → Rm
is differentiable on U with kDf (x)k ≤ M for all x ∈ U , then kf (b1 ) − f (b2 )k ≤
M kb1 − b2 k for any b1 , b2 ∈ U .

We will reduce this to the previous theorem. Fix b1 , b2 ∈ Br (a). Note that
tb1 + (1 − t)b2 ∈ U for all t ∈ [0, 1]. Consider g : [0, 1] → Rm defined by
g(t) = f (tb1 + (1 − t)b2 ). By the chain rule, g is differentiable and g0 (t) =
Dg(t) = (Df (tb1 + (1 − t)b2 ))(b1 − b2 ). Therefore

kDg(t)k ≤ kDf (tb1 + (1 − t)b2 )kkb1 − b2 k ≤ M kb1 − b2 k.

Now we can apply the previous theorem, and get kf (b1 )−f (b2 )k = kg(1)−g(0)k ≤
M kb1 − b2 k
Note that this result says that if f : U → Rm have Df (x) = 0 for all x ∈ U , then
f is constant. Just apply the result with M = 0.
200 CHAPTER 5. ANALYSIS II

T. 5-79
Let U ⊆ Rm be open and path-connected. Then for any differentiable f : U → Rm ,
if Df (x) = 0 for all x ∈ U , then f is constant on U .

We are going to use the fact that f is locally constant. wlog, assume m = 1 since
we just need to show that fi is constant for all i where f = (f1 , · · · , fm ). Fix any
a, b ∈ U . Let γ : [0, 1] → U be a (continuous) path from a to b. For any s ∈ (0, 1),
there exists some ε > 0 such that Bε (γ(s)) ⊆ U since U is open. By continuity of
γ, there is a δ such that (s − δ, s + δ) ⊆ [0, 1] with γ((σ − δ, s + δ)) ⊆ Bε (γ(s)) ⊆ U .
We know f is constant on Bε (γ(s)) by the previous result, so g(t) = f ◦ γ(t) is
constant on (s − δ, s + δ). So g is differentiable at s with derivative 0. This is true
for all s ∈ (0, 1). So the map g : [0, 1] → R has zero derivative on (0, 1), also it is
continuous on [0, 1] since it is a composition of continuous map. So g is constant
and hence g(0) = g(1), ie. f (a) = f (b).
Note that if γ is differentiable, then this is much easier, since we can show g 0 = 0
by the chain rule g 0 (t) = Df (γ(t))γ 0 (t).
D. 5-80
• Let U ⊆ Rn be open. We say f : U → Rm is C 1 on U if f is differentiable at each
x ∈ U and Df : U → L(Rn , Rm ) is continuous.
• We write C 1 (U ) or C 1 (U ; Rm ) for the set of all C 1 maps from U to Rm .
• Let U, U 0 ⊆ Rn are open, then a map g : U → U 0 is a diffeomorphism (or
C 1 -diffeomorphism) if it is C 1 and has a C 1 inverse.4
E. 5-81
Suppose U ∈ Rn is open and f : U → Rn is C 1 , then ∃ε > 0 such that ẋ = f (x)
with x(t0 ) = a ∈ U has a unique solution x : [t0 − ε, t0 + ε] → Rn .

Let K = B̄r (a) ⊆ U . We claim that f |K is Lipschitz. By assumption Df : U →


L(Rn ; Rn ) is continuous, it follows that kDf k : U → R is also continuous. K
is closed and bounded, hence compact, so ∃M ∈ R such that kDf (x)k ≤ M for
all x ∈ K. Now K is convex, so by the mean value inequality kf (u) − f (v)k ≤
M ku − vk for any u, v ∈ K, that is f |K is Lipschitz. Hence done by [T.5-60].
P. 5-82
Let U ⊆ Rm be open. Then f = (f1 , · · · , fn ) : U → Rn is C 1 on U if and only
if for all x ∈ U , 1 ≤ i ≤ n, 1 ≤ j ≤ m the partial derivatives Dj fi (x) exists and
Dj fi : U → R are continuous.

(Forward) Differentiability of f at x implies Dj fi (x) exists and is given by Dj fi (x) =


hDf (x)ej , bi i, where {e1 , · · · , em } and {b1 , · · · , bn } are the standard basis for Rn
and Rm . So we know

|Dj fi (x) − Dj fi (y)| = |h(Df (x) − Df (y))ej , bi i| ≤ kDf (x) − Df (y)k

since ej and bi are unit vectors. Hence Dj fi is continuous since Df is.


(Backward) By [T.5-70] the derivative Df exists at all x ∈ U . To show Df : U →
L(Rm ; Rn ) is continuous, note that for any linear map A ∈ L(Rn ; Rm ) represented
4
Some have a different definition for the word “diffeomorphism”. Some require it to be merely
differentiable, while others require it to be infinitely differentiable.
5.6. DIFFERENTIATION FROM RM TO RN 201

by (aij ) so that Ah = aij hj , we have

m n
!2 m n
! n
! m X
n
2
X X X X X X
kAxk = Aij xj ≤ a2ij x2j = kxk2 a2ij .
i=1 j=1 i=1 j=1 j=1 i=1 j=1

2
1 , · · · , xn ) and we have used Cauchy-Schwarz. Dividing by kxk , we
where x = (xq
P P 2
know kAk ≤ i j aij . Applying this to A = Df (x) − Df (y), we get

sX X
kDf (x) − Df (y)k ≤ (Dj fi (x) − Dj fi (x))2 .
i j

So Df is continuous since all of Dj fi are continuous.


Note
qP P that instead of going through allqthat algebra to show the inequality kAk ≤
a2ij , we can instead note that aij is a norm on L(Rn , Rm ), since it
PP 2

is just the Euclidean norm if we treat the matrix as a vector written in a funny
way. So by the equivalenceq of norms on finite-dimensional vector spaces, there is
PP 2
some C such that kAk ≤ C aij and then the result follows.

This results shows that a equivalent definition of f : U → Rn being C 1 is that


for all x ∈ U , 1 ≤ i ≤ n, 1 ≤ j ≤ m the partial derivatives Dj fi (x) exists and
Dj fi : U → R are continuous.
T. 5-83
<Inverse function theorem> Let a ∈ U ⊆ Rn with U open, and f : U → Rn
a C 1 map. If the linear map Df (a) is invertible, then there exists open sets
V, W ⊆ Rn with a ∈ V ⊆ U and f (a) ∈ W such that f |V : V → W is a bijection,
and its the inverse map f |−1 1
V : W → V is C .

(i.e. If f is C 1 and Df (a) is invertible, then f is a local diffeomorphism at a.)

Note that if A : Rn → Rn is an invertible linear map, then f is C 1 iff Af is C 1 ,


also f is a diffeomorphism iff Af is a diffeomorphism. So wlog we can consider
(Df (a))−1 f instead of f . That is we can assume Df (a) = I, the identity map.
By continuity of Df , ∃r > 0 such that kDf (x) − Ik < 21 for all x ∈ B̄r (a). By
shrinking r sufficiently, we can assume B̄r (a) ⊆ U . Let W = Br/2 (f (a)), and let
V = f −1 (W ) ∩ Br (a).
To prove the theorem, we just need to prove the following 3 results:
i. V is open, and f |V : V → W is a bijection.
ii. The inverse map g = f |−1
V : W → V is Lipschitz (and hence continuous). In
fact, we have kg(y1 ) − g(y2 )k ≤ 2ky1 − y2 k.
iii. g is in fact C 1 , and moreover, Dg(y) = Df (g(y))−1 for all y ∈ W .
(Part i) Since f is continuous, f −1 (W ) is open as W is open. So V is open. We
now show that f |V : V → W is bijection by showing that for each y ∈ W , there is
a unique x ∈ V such that f (x) = y using the contraction mapping theorem. We
need to show that for each y ∈ W , the map T (x) = x − f (x) + y has a unique
fixed point x ∈ V .
202 CHAPTER 5. ANALYSIS II

Let h(x) = x − f (x), then Dh(x) = I − Df (x), so kDh(x)k ≤ 21 for every


x ∈ B̄r (a). Now for any x1 , x2 ∈ B̄r (a), the mean value inequality tell us that
kT (x1 ) − T (x2 )k = kh(x1 ) − h(x2 )k ≤ 21 kx1 − x2 k. For any x ∈ B̄r (a), we have

kT (x) − ak = kx − f (x) + y − ak = kx − f (x) − (a − f (a)) + y − f (a)k


1 r r
≤ kh(x) − h(a)k + ky − f (a)k ≤ kx − ak + ky − f (a)k < + = r.
2 2 2
So T : B̄r (a) → Br (a) ⊆ B̄r (a). Since B̄r (a) is complete, T has a unique fixed
point x ∈ B̄r (a) so that T (x) = x. Since f (x) = y ∈ W , we know x ∈ f −1 (W ), so
x ∈ V . So we have shown that ∀y ∈ W , ∃!x ∈ V s.t.f (x) = y. Hence f |V : V → W
is a bijection.
(Part ii) For any x1 , x2 ∈ V , by the triangle inequality, know

kx1 − x2 k − kf (x1 ) − f (x2 )k ≤ k(x1 − f (x1 )) − (x2 − f (x2 ))k = kh(x1 ) − h(x0 )k
1
≤ kx1 − x2 k.
2
Hence kx1 − x2 k ≤ 2kf (x1 ) − f (x2 )k. Apply this to x1 = g(y1 ) and x2 = g(y2 ),
and note that f (g(yj )) = yj we have kg(y1 ) − g(y2 )k ≤ 2ky1 − y2 k.
(Part iii) Note that if g is differentiable, then its derivative must be given by
Dg(y) = Df (g(y))−1 since by definition f (g(y)) = y and hence the chain rule
gives Df (g(y)) · Dg(y) = I. Also, we immediately know Dg is continuous, since
it is the composition of continuous functions. So we only need to check that
Df (g(y))−1 is indeed the derivative of g.
First we check that Df (x) is indeed invertible for every x ∈ B̄r (a). Suppose
Df (x)v = 0, then
1
kvk = kDf (x)v − vk ≤ kDf (x) − Ikkvk ≤ kvk.
2
So we must have kvk = 0, ie. v = 0. So ker Df (x) = {0} hence Df (g(y))−1
exists.
Now let x ∈ V be fixed, and y = f (x). Let k be small (so that y + k ∈ W ) and
h = g(y + k) − g(y). In other words f (x + h) − f (x) = k. We have

g(y + k) − g(y) − Df (g(y))−1 k = h − Df (g(y))−1 k = Df (x)−1 (Df (x)h − k)


= −Df (x)−1 (f (x + h) − f (x) − Df (x)h) = o(h)

It remains to show that o(h) = o(k), this is true since


o(h) o(h) khk o(h) kg(y + k) − g(y)k
= = →0 as k→0
kkk khk kkk khk k(y + k) − yk
| {z }
bounded by 2

Consider when n = 1. Suppose f 0 (a) 6= 0 (hence invertible as a linear map), then


there exists a δ such that f 0 (t) > 0 or f 0 (t) < 0 in t ∈ (a − δ, a + δ). So f |(a−δ,a+δ)
is strictly monotone and hence is invertible.
Note that in the case where n = 1, if f : (a, b) → R is C 1 with f 0 (x) 6= 0
for every x, then f is strictly monotone on the whole domain (a, b), and hence
f : (a, b) → f ((a, b)) is a bijection. In higher dimensions, this is not true. Even if
5.6. DIFFERENTIATION FROM RM TO RN 203

we know that Df (x) is invertible for all x ∈ U , we cannot say f |U is a bijection.


We still only know there is a local inverse:
Let U = R2 , and f : R2 → R2 be given by f (x, y) = (ex cos y, ex sin y). We can
directly compute
 x
e cos y −ex sin y

Df (x, y) = for all (x, y) ∈ R2
ex sin y ex cos y.

We have det(Df (x, y)) = ex 6= 0 for all (x, y) ∈ R2 . However, by periodicity


f (x, y + 2nπ) = f (x, y) for all n. So f is not injective on R2 , hence not a bijection.
T. 5-84
<Implicit function Theorem>
1. Suppose n > m, F : Rn → Rm is C 1 , F(x0 ) = y0 and DF (x0 ) is surjective.
Then there exist open set V ⊆ Rn with x0 ∈ V and an injective C 1 map
G : Bεn−m → Rn , where Bεn−m = Bε (0) ⊆ Rn−m (open ball in Rn−m ), such
that F −1 (y0 ) ∩ V = Im G and DG(z) is injective for all z ∈ Bεn−m .
2. Let Rn+m have coordinates (x, y). Suppose f : Rn+m → Rm is C 1 in a neigh-
bourhood of (a, b) with f (a, b) = c ∈ Rm . If D(y) f (a, b) (the dervative of f
wrt the y variable at (a, b)) is invertible, then there exists an open set U ⊆ Rn
containing a, an open set V ⊆ Rm containing b, and a C 1 function g : U → V
such that {(x, g(x)) | x ∈ U } = {(x, y) ∈ U × V | f (x, y) = c}.

1. DF (x0 ) : Rn → Rm is surjective, so K = ker DF (x0 ) has dimension n − m.


Choose a linear map π : Rn → Rn−m with π(K) = Rn−m . Define f : Rn →
Rm ×Rn−m = Rn with f (x) = (F(x), π(x)). Then Df (x0 )v = (DF(x0 )v, π(v))
since Dπ = π as π is a linear map.
Suppose Df (x0 )v = 0, then (π(v) = 0 and) DF(x0 )v = 0, so v ∈ K. But
π : K → Rn−m is an isomorphism (bijective linear map), so π(v) = 0 implies
v = 0. Hence Df (x0 ) is an isomorphism.
By the inverse function theorem ∃V ⊆ Rn with x0 ∈ V and W ⊆ Rm × Rn−m
with (y0 , π(x0 )) ∈ W such that f : V → W is a diffeomorphism. Wlog we can
assume W = U ×Bεn−m where U is open with y0 ∈ U . Write g = f −1 : W → V .
Then F −1 (y0 )∩V = f −1 (y0 ×Rn−m )∩V , so g(y0 ×Rn−m ∩W ) = F −1 (y0 )∩V .
Define G(z) = g(y0 , z), then F −1 (y0 ) ∩ V = Im G. Also G and DG(z) are
injective since g and Dg(y0 , z) are. (Note that DG(z)v = Dg(y0 , z)(0, v)).
2. In matrix form we can write Df (a, b) = (D(x) f (a, b) D(y) f (a, b)). Define
F : Rn+m → Rn+m by F(x, y) = (ι(x), f (x, y)) where ι : Rn → Rn is the
identity function, then F is C 1 in a neighbourhood of (a, b). In particular F
is differentiable at (a, b) with derivative in matrix form
 
I 0
DF(a, b) =
D(x) f (a, b) D(y) f (a, b)

Suppose DF(a, b) hk = 0, then we must have h = 0, but then D(y) f (a, b) is




an isomorphism, so we must also have k = 0. Hence the liner map DF(a, b) is


invertible (as it has trivial kernel). By the inverse function theorem there exist
open U × V ⊆ Rn+m with (a, b) ∈ U × V such that F : U × V → F(U × V )
is a diffeomorphism. Note that we can write F(U × V ) = U × W for some W .
204 CHAPTER 5. ANALYSIS II

Now F−1 (x, c) = (x, G(x, c)) for some G : U × W → V . Define g : U → V by


g(x) = G(x, c), then g satisfies the claimed property.
In 1, since n > m in F : Rn → Rm , given an equation F(x) = y0 we expect
there to be many solution x. If we know F(x0 ) = y0 , we might want to ask what
points near x0 is such that we also have F(x) = y0 , i.e. what does F−1 (y0 ) looks
like. From this result, we can obtain a function G whose image is all the points
of F−1 (y0 ) near x0 . For example when n − m = 1, Bε1 = (−ε, ε), so F−1 (y0 ) is
a curve locally, and G gives us a parametrisation of this curve. Similarly when
n − m = 1 we have a parametrised surface.
In 2, the equation f (x, y) = c defines a surface in Rn+m , if the stated condition
is satisfied, then the results says that one can in principle express y in terms of x
at least local in some disk, so that geometrically the locus defined by f (x, y) = c
will coincide locally with the hypersurface given by y = g(x).
This is useful, for example, suppose we can write a differential equation as f (x, ẋ) =
0, and we found some y such that f (x, y) = 0, then (if the conditions are satisfied)
2 says that we can express ẋ = g(x) locally, so we can solve the differential equation
locally using [T.5-60].
D. 5-85
• Let U ⊆ Rn be open, f : U → Rm be differentiable. Then Df : U → L(Rn ; Rm ).
We say Df is differentiable at a ∈ U if there exists A ∈ L(Rn ; L(Rn ; Rm )) such
that Df (a + h) = Df (a) + Ah + o(h). We say f is two times differentiable, and A
is the second derivative of f at a, we may write A as D(Df )(a). Higher derivatives
are defined similarly.
∂2
Write Dij f (a) = Di (Dj f )(a) = ∂xi ∂xj
f (a). We define D2 f (a) : Rn × Rn → Rm by
D2 f (a)(u, v) = D(Df )(a)(u)(v).
• Let U ⊆ Rn be open. We say f : U → Rm is C k if all partial derivatives of f of
order k exist and are continuous on U (that is Di1 Di2 · · · Dik f (x) = Di1 i2 ···ik f (x)
exist for all ij ∈ {1, · · · , n} and x, and are continuous).
E. 5-86
We’ve done so much work to understand first derivatives. For real functions, we
can immediately know a lot about higher derivatives, since the derivative is just a
normal real function again. Here, it slightly more complicated, since the derivative
is a linear operator. However, this is not really a problem, since the space of linear
operators is just yet another vector space, so we can essentially use the same
definition.
For Df (a + h) = Df (a) + Ah + o(h) to make sense, we would need to put a
norm on L(Rn ; Rm ) (eg. the operator norm), but the derivative A of Df , if it
exists, is independent of the choice of the norm, since all norms are equivalent for
a finite-dimensional space.
This is, in fact, the same definition as our usual differentiability, since L(Rn ; Rm ) is
just a finite-dimensional space, and is isomorphic to Rnm . So Df is differentiable if
and only if Df : U → Rnm is differentiable with A ∈ L(Rn ; Rnm ). This allows use
to recycle our previous theorems about differentiability. In particular, we know
Df being differentiable is implied by the existence of partial derivatives Di (Dj fk )
in a neighbourhood of a, and their continuity at a, for all k = 1, · · · , m and
i, j = 1, · · · , n.
5.6. DIFFERENTIATION FROM RM TO RN 205

C. 5-87
<2nd derivatives as bilinear map>
By linear algebra, in general, a linear map φ : R` → L(Rn ; Rm ) induces a bilinear
map Φ : R` × Rn → Rm by Φ(u, v) = φ(u)(v) ∈ Rm . In particular, we know

Φ(au + bv, w) = aΦ(u, w) + bΦ(v, w)


Φ(u, av + bw) = aΦ(u, v) + bΦ(u, w).

Conversely, if Φ : R` × Rn → Rm is bilinear, then φ : R` → L(Rn ; Rm ) defined


by φ(u) = Φ(u, · ) = (v 7→ Φ(u, v)) is linear. These are clearly inverse operations
to each other. So there is a one-to-one correspondence between bilinear maps
φ : R` × Rn → Rm and linear maps Φ : R` → L(Rn ; Rm ).
In other words, instead of treating our second derivative as a weird linear map
D(Df )(a) in L(Rn ; L(Rn ; Rm )), we can view it as a bilinear map D2 f (a) : Rn ×
Rn → Rm , and we have D(Df )(a)(u) = D2 f (a)(u, · ) ∈ L(Rn ; Rm ).

<Higher derivatives as multi-linear map>


Just like the second derivative of f (at a) can be understood as a bilinear map
D2 f (a) : Rn × Rn → Rm , the kth derivative (at a) can be understood as a multi-
linear map Dk f (a) : (Rn )k → Rm linear in each of the Rn entry. And finding the
k + 1th derivative (at a) is equivalent to finding a multi-linear map Dk+1 f (a) :
(Rn )k+1 → Rm such that Dk f (a + h) = Dk f (a) + Dk+1 f (a)( · , · · · , · , h) + o(h),
so that Dk+1 f (a)( · , · · · , · , h) = D(Dk f )(a)(h).

P. 5-88
Suppose D(Df )(a) exist, then for any u, v,
1. D2 f (a)(u, v) = Du (Dv f )(a)
2. Write u = n
P Pn
j=1 uj ej and v = j=1 vj ej , where {ei } is the standard basis,
then
Xn n X
X m
D2 f (a)(u, v) = Dij f (a)ui vj = Dij fk (a)ui vj ek
i,j=1 i,j=1 k=1

1. D2 f (a)(u, v) ≡ D(Df )(a)(u)(v) = Du (Df )(a)(v)


 
Df (a + tu) − Df (a) Df (a + tu)(v) − Df (a)(v)
= lim (v) = lim
t→0 t t→0 t
Dv f (a + tu) − Dv f (a)
= lim = Du (Dv f )(a)
t→0 t

Note that if An ∈ L(Rn ; Rm ) and An → A ∈ L(Rn ; Rm ) (ie. kAn − Ak → 0)


then k(An − A)xk ≤ kAn − Akkxk → 0, ie. An (x) → A(x).
2. Using bilinearity and part 1, we have
n X
X n
D2 f (a)(u, v) = D2 f (a)(ei , ej )ui vi
i=1 j=1
n
X n X
X m
= Dij f (a)ui vj = Dij fk (a)ui vj ek .
i,j=1 i,j=1 k=1
206 CHAPTER 5. ANALYSIS II

This result generalised to higher derivatives in an obvious way, eg.


n
X (1) (2) (k)
Dk f (a)(v(1) , v(2) , · · · , v(k) ) = Di1 i2 ···ik f (a) vi1 vi2 · · · vik .
i1 ,i2 ,··· ,ik =1

We have been very careful to keep the right order of the partial derivatives. How-
ever, in most cases we care about, it doesn’t matter as will be illustrate below.
T. 5-89
Let a ∈ Bρ (a) ⊆ U ⊆ Rn with U open, and f : U → Rm . Let i, j ∈ {1, · · · , n} be
fixed and suppose that Di Dj f (x) and Dj Di f (x) exist for all x ∈ Bρ (a) and are
continuous at a, then Di Dj f (a) = Dj Di f (a).

Wlog assume m = 1. If i = j, then there is nothing to prove. So assume i 6= j.


Let
gij (t) = f (a + tei + tej ) − f (a + tei ) − f (a + tej ) + f (a).
For each fixed t, define φ : [0, 1] → R by φ(s) = f (a + stei + tej ) − f (a + stei ).
Then we get gij (t) = φ(1) − φ(0). By the mean value theorem and the chain rule,
there is some θ ∈ (0, 1) such that
 
gij (t) = φ0 (θ) = t Di (a + θtei + tej ) − Di (a + θtei ) .

Now apply mean value theorem to the function s 7→ Di (a + θtei + stej ), there is
some η ∈ (0, 1) such that

gij (t) = t2 Dj Di f (a + θtei + ηtej ).

We can do the same for gji , and find some θ̃, η̃ such that gij (t) = t2 Di Dj f (a +
θ̃tei + η̃tej ). From definition of gij we see that gij = gji , so

t2 Dj Di f (a + θtei + ηtej ) = t2 Di Dj f (a + θ̃tei + η̃tej ).

Divide by t2 , and take the limit as t → 0. By continuity of the partial derivatives,


we get Dj Di f (a) = Di Dj f (a).
Note that Dj Di f (a) = Di Dj f (a) implies that D2 f is a symmetric bilinear form,
that is D2 f (u, v) = D2 f (v, u)
P. 5-90
If f : U → Rm is differentiable in U such that Di Dj f (x) exists for all i, j in a
neighbourhood of a P∈ UPand are continuous at a, then Df is differentiable at a
and D2 f (a)(u, v) = j i Di Dj f (a)ui vj is a symmetric bilinear form.

This follows from the fact that continuity of second partials implies differentiability,
and the symmetry of mixed partials.
This result along with the related results generalised to higher derivatives. So if
f : U → Rm is C k in a neighbourhood of a ∈ U , then f is k times differentiable
with
n
X (1) (2) (k)
Dk f (a)(v(1) , v(2) , · · · , v(k) ) = Di1 i2 ···ik f (a) vi1 vi2 · · · vik
i1 ,i2 ,··· ,ik =1

symmetric, that is Dk f (a)(v(1) , v(2) , · · · , v(k) ) = Dk f (a)(v(σ(1)) , v(σ(2)) , · · · , v(σ(k)) )


for any permutation σ ∈ Sk .
5.6. DIFFERENTIATION FROM RM TO RN 207

T. 5-91
<Taylor’s theorem> Let U ⊆ Rn be open and a ∈ U . Suppose f : U → R is k
times differentiable and that h ∈ Rn is such that the line segment from a to a + h
is contained in U , then
k−1
X 1 i 1
f (a + h) = D f (a)hi + Dk f (a + sh)hk for some s ∈ [0, 1]
i=0
i! k!

where hi = (h, · · · , h) ∈ (Rn )i . If f is also C k , then


| {z }
i times
k
X 1 i
f (a + h) = D f (a)hk + o(khkk ).
i=0
i!

Consider the function g(t) = f (a + th). Then the assumptions tell us g is k times
differentiable. By the 1D Taylor’s theorem with Lagrange form remainder, we
know
k−1
X 1 (i) 1
g(1) = g (0) + g (k) (s) for some s ∈ [0, 1].
i=0
i! k!

In other words, noting that g (i) (t) = Dh · · · Dh f (a + th) = Di f (a + th)hi , this is


k−1 k
X 1 i 1 X1 i
f (a + h) = D f (a)hi + Dk f (a + sh)hk = D f (a)hi + E(h),
i=0
i! k! i=0
i!

1  k 
where E(h) = D f (a + sh)hk − Dk f (a)hk .
k!
kBk = supv1 ··· ,vk 6=0 kB(v1 , · · · , vk )k/( ki=1 kvi k) is a norm on the space of multi-
Q

linear maps (Rn )k → Rm , using this we have


1
|E(h)| ≤ kDk f (a + sh) − Dk f (a)kkhkk .
k!
By continuity of the kth derivative, as h → 0, we get kDk f (a+sh)−Dk f (a)k → 0.
So E(h) = o(khkk ).
In fact the last result is still true even without the assuming f is C k . This is not
as crazy as it seems, noting that the case k = 1 is true by definition of derivative.
208 CHAPTER 5. ANALYSIS II
CHAPTER 6

Methods
In physics many important differential equations are linear,
that is if φ1 , φ2 are solutions, then so are λ1 φ1 + λ2 φ2 (for
any constants λi ). Where did the linearity of equations of Laplace’s equation:
physics come from? The real world is not linear in gen-
eral. However, often we are not looking for a completely ∂2φ ∂φ
+ =0
accurate and precise description of the universe. When we ∂x2 ∂y 2
have low energy/speed/whatever, we can often quite accu- Heat equation:
rately approximate reality by a linear equation. Whatever  2
∂2φ

the complicated equations governing the dynamics of the ∂φ ∂ φ
=κ +
underlying theory, if we just look to first order in the small ∂t ∂x2 ∂y 2
perturbations then we’ll find a linear equation essentially
by definition.

The possible exception is Schrödinger’s equation in Quantum Mechanics. We know


of many ways to generalize this equation, such as making it relativistic or passing
to Quantum Field Theory, but in each case the analogue of Schrödinger’s equation
always remains exactly linear. No one knows if there is a fundamental reason for
this (though it’s certainly built into the principles of Quantum Mechanics at a deep
level), or whether our experiments just haven’t probed far enough. In any case, due
to the prevalence of linear equations, it is rather important that we understand these
equations well, and this is our primary objective of the course.

When dealing with functions and differential equations, we will often think of the space
of functions as a vector space. In many cases, it will be useful to find a “basis” for our
space of functions. Under different situations, we would want to use a different basis
for our space. A familiar example would be the Taylor series, where we are thinking
of {xn : n ∈ N} as the basis of our space, and trying to approximate an arbitrary
function as a sum of the basis elements. When writing the function f as a sum like
this, it is of course important to consider whether the sum converges, and when it
does, whether it actually converges back to f . Note that the set of solutions to a linear
differential equation would form a vector space since if φ1 , φ2 are solutions, then so are
λ 1 φ1 + λ 2 φ2 .

We will often want to restrict our functions to take particular values on some boundary,
known as boundary conditions. Often, we want the boundary conditions to preserve
linearity. We call these nice boundary conditions homogeneous conditions.

D. 6-1
A boundary condition is homogeneous if whenever f and g satisfy the boundary
conditions, then so does λf + µg for any λ, µ ∈ C (or R).

E. 6-2
Let Σ = [a, b]. We could require that f (a)+7f 0 (b) = 0, or maybe f (a)+3f 00 (a) = 0.
These are examples of homogeneous boundary conditions. On the other hand, the
requirement f (a) = 1 is not homogeneous.

209
210 CHAPTER 6. METHODS

6.1 Fourier series


D. 6-3
A function f : R → C is called periodic if there exist fixed T ∈ R such that
f (x + T ) = f (x) for all x ∈ R. The smallest positive such T are called the period ,
and 1/T the frequency .
C. 6-4
<Fourier series> We will study periodic functions R → C, wlog we study
functions S 1 → C (ie. from unit circle to C), and parametrize our function with
θ ∈ [−π, π). So if we have a periodic function f (t) with period T , we let θ = 2πt/T .
The simplest case of complex-valued periodic functions is the exponential function
eiθ , it has a period of 2π. Integer powers of these exponentials are orthogonal with
respect to the inner product
Z π Z π (
imθ inθ −imθ inθ i(n−m)θ 2π n = m
he , e i = e e dθ = e dθ = = 2πδnm
−π −π 0 n 6= m

We can normalize these to get { √12π einθ : n ∈ Z}, a orthonormal set of complex
valued periodic functions. Fourier’s idea was to try to use this as a “basis” for any
periodic function. Fourier (not entirely correctly) claimed that any f : S 1 → C
can be expanded in this basis as Fourier series given by
Z π
X 1 inθ 1
f (θ) = fˆn einθ , where fˆn = he , f i = e−inθ f (θ) dθ
n∈Z
2π 2π −π

are called the Fourier coefficients .1 The partial Fourier sum of f is


n
X
def
Sn f = fˆm eimθ
m=−n

It turns out that in many cases, the Fourier series don’t converge, that is the partial
Fourier sum do not converge to f . As we have seen in Analysis, there are many
ways of defining convergence of functions. Of course, with different definitions of
convergence, we can get different answers to whether it converges. It can be shown
that if f is continuously differentiable on S 1 , then Sn f converges to f uniformly
(and hence also pointwise).

E. 6-5
<Real Fourier series> In the special case where f is a real valued function, we
can re-formulate the Fourier series in terms of sin and cos.
 Z π ∗ Z π
1 1
(fˆn )∗ = e−inθ f (θ) dθ = einθ f (θ) dθ = fˆ−n .
2π −π 2π −π

So we can replace our Fourier series by


∞ 
X  ∞ 
X 
f (θ) = fˆ0 + fˆn einθ + fˆ−n e−inθ = fˆ0 + fˆn einθ + fˆn∗ e−inθ .
n=1 n=1
P ˆ inθ √ √
1
These really should be defined as f (θ) = fn (e / 2π) with fˆn = heinθ / 2π, f i, but for
ˆ
convenience reasons, we move all the constant factors to the fn coefficients.
6.1. FOURIER SERIES 211

Setting fˆn = 21 (an − ibn ), we can write this as


∞ ∞
X a0 X
f (θ) = fˆ0 + (an cos nθ + bn sin nθ) = + (an cos nθ + bn sin nθ).
n=1
2 n=1
π
1 π
Z Z
1
with an = cos(nθ)f (θ) dθ and bn = sin(nθ)f (θ) dθ.
π −π π −π
This formulation is particularly useful when the given f is an odd (or even) func-
tion, since the cosine (or sine) terms will simply disappear. On the other hand,
if we want to stick our function into a differential equation, exponential functions
are usually more helpful.
E. 6-6
<Sawtooth function> Consider the sawtooth
function f (θ) = θ (for θ ∈ [−π, π)). Note that this
function is discontinuous at odd multiples of π. The −3π −π π 3π
Fourier coefficients (for n 6= 0) are
Z π π Z π
(−1)n+1

1 1 −inθ 1
fˆn = e−inθ θ dθ = − e θ + einθ dθ = .
2π −π 2πin −π 2πin −π in
We also have fˆ0 = 0. Hence we have
k k
X (−1)n+1 inθ X (−1)n+1
Sk f = e =2 sin(nθ).
in n=1
n
n=−k

It turns out that this series converges (pointwise) to the sawtooth for all θ 6=
(2m + 1)π, ie. everywhere that the sawtooth is continuous.
Let’s look explicitly at the case where θ = π. Each term (n and −n part together)
of the partial Fourier series is zero. So the Fourier series converges to 0 here. This
is the average value of limε→0+ f (π + ε) and limε→0+ f (π − ε). This is typical. At
an isolated discontinuity, the Fourier series is the average of the limiting values of
the original function as we approach from either side.
C. 6-7
<Integration and differentiation of Fourier series>
Suppose f : S 1 → C is such that Sn f converge pointwise to f . Then we can define
a new sequence Sn F by the integrals
θ −1 ikθ n
− (−1)k X ˆ eikθ − (−1)k
Z X e
Sn F ≡ S − nf (φ)dφ = (θ − π)fˆ0 + fˆk + fk
−π ik ik
k=−n k=1

This new series is guarantee to converge since the original series did by assumption
and integration has suppressed each coefficient by a further factor of k. In fact
even if the original function had jump discontinuities, so that at some discrete
points the Fourier series converged to the average value of f on either side of the
discontinuity, we’ve seen that integration produces
Rθ a continuous function for us,
the new series will in fact converge to F (θ) = −π f (φ)dφ everywhere.
By contrast, if we differentiate a Fourier series term by term then we multiply each
coefficient by ik and this makes convergence worse, perhaps fatally.
Integration is a smoothening operator. The indefinite integral of the step function
212 CHAPTER 6. METHODS

Θ(x) = H(x) = I[x > 0] (where I is the indicator function) is a continuous


function. If we integrate it again, we get a differentiable function. On the other
hand, when we differentiate a function, we generally make it worse. Hence, it is
often helpful to characterize the “smoothness” of a function by how many times
we can differentiate it.
L. 6-8
Suppose f : S 1 → C is m times differentiable and that f (m) is integrable, then
∃M such that |fˆk | ≤ kMm for all k.

Note that by assumption f (m) and hence |f (m) | is integrable. Also f (n) is continu-
ous (hence f (n) (π) = f (n) (−π)) for all n ≤ m − 1. So repeatedly apply integration
by parts we have
Z π  π Z π
1 1 −ikθ 1
fˆk = e−ikθ f (θ) dθ = − e f (θ) + e−ikθ f 0 (θ) dθ
2π −π 2πik −π 2πik −π
Z π Z π
1 −ikθ 0 1 1
= e f (θ) dθ = · · · = e−ikθ f (m) (θ) dθ.
2πik −π (ik)m 2π −π
 Z π 
1 1
=⇒ |fˆk | ≤ m |f (m) (θ)| dθ
k 2π −π
For example if f (m) is bounded and continuous except at finitely many points,
then it is integrable.
This result makes sense. If we have a rather smooth function, then we would
expect the first few Fourier terms (with low frequency) to account for most of
the variation of f . Hence the coefficients decay really quickly. However, if the
function is jiggly and bumps around all the time, we would expect to need some
higher frequency terms to account for the minute variation. Hence the terms would
not decay away that quickly. So in general, if we can differentiate it more times,
then the terms should decay quicker. Conversely the smoothness of the function
can be inferred from the decay speed of the Fourier coefficients fˆk as k → ∞.
T. 6-9
Pn
<Fejér’s theorem> If f : S 1 → C is continuous, then σn (f ) = 1
n+1 m=0 Sm f
converge uniformly to f as n → ∞.

By definition of Sm f and fˆn we have


n Z π
1 X 1
σn (f (θ)) = Sm f (θ) = f (φ)Fn (θ − φ)dφ
n + 1 m=0 2π −π
n m
( 2
1 sin ((n+1)x/2)
1 X X ikx 2 for x 6= 0
where Fn (x) = e = n+1 sin (x/2)
n + 1 m=0 n + 1 for x=0
k=−m

i. Clearly Fn (x) ≥ 0 everywhere.


1

ii. Also 2π −π n
F (θ)dθ = 1 because
Z π Z π n m
1 1 1 X X ikθ
Fn (θ)dθ = e dθ
2π −π 2π −π n + 1 m=0
k=−m
n m  Z π  n m
1 X X 1 ikθ 1 X X
= e dθ = δ0k = 1
n + 1 m=0 2π −π n + 1 m=0
k=−m k=−m
6.1. FOURIER SERIES 213

iii. It follows from (ii) that Fn (x) → 0 uniformly outside an arbitrary small region
(−δ, δ) around θ = 0, this is because for any δ ≤ |x| ≤ π,

1 1 1 1
Fn (x) ≤ ≤ → 0 as n → ∞
n + 1 sin2 (x/2) n + 1 sin2 (δ/2)

(i), (ii) and (iii) basically tell us thatas n → ∞, Fn behaves R π like the Dirac delta
1
function we see in Part IA, hence for large n, σn (f (θ)) = 2π −π
f (φ)Fn (θ−φ)dφ ≈
1

f (θ) 2π −π
F n (θ − φ)dφ = f (θ). We are going to formalise this.
1 1 1
Let δ = 1/n 4 , then Fn (x) ≤ n+1 sin2 (δ/2)
= Bn → 0 as n → ∞ for all x ∈
[−π, π] \ (−δ, δ). Also f is continuous on a compact domain, so |f (x)| ≤ C for all
x. Furthermore both f and Fn are periodic with period 2π plus Fn is even, so
Z π Z θ+π
1 1
σn (f (θ)) = f (φ)Fn (θ − φ)dφ = f (φ)Fn (θ − φ)dφ
2π −π 2π θ−π
Z π
1
= f (θ + x)Fn (x)dx
2π −π

Now we have
Z π
1
sup |σn (f (θ)) − f (θ)| = sup (f (θ + x) − f (θ))Fn (x)dx
θ θ 2π −π
Z π
1
≤ sup |f (θ + x) − f (θ)|Fn (x)dx
θ 2π −π
Z δ
1
≤ sup 2CBn + |f (θ + x) − f (θ)|Fn (x)dx
θ 2π −δ
supx∈(−δ,δ) |f (θ + x) − f (θ)| δ
Z

≤ 2CBn + sup Fn (x)dx
θ 2π −δ
supx∈(−δ,δ) |f (θ + x) − f (θ)| π
Z

≤ 2CBn + sup Fn (x)dx
θ 2π −π


= 2CBn + sup sup |f (θ + x) − f (θ)| → 0 as n → ∞

θ x∈(−δ,δ)

We have the → in the last line because f must be uniformly continuous.

Note that this result is saying that if we are given the coefficients fˆn and the
information that f is continuous, then we can recover the function f from fˆn .

It’s actually possible to generalize this result to allow f : S 1 → C to be R discontin-


uous at a finite number of isolated points {θ1 , · · · , θr } ∈ S 1 provided S 1 |f (θ)|dθ
exists. Then σn (f ) converges to the original function at all points θ ∈ S 1 where
f (θ) is continuous.

In this chapter we’ll not pay too much attention about rigour (this result and proof
is actually non-examinable), we’ll henceforth mostly gloss over these subtle issues
of convergence.
214 CHAPTER 6. METHODS

T. 6-10
Suppose f : S 1 → C is continuous,
1. If ∞ ˆ
P
n=−∞ |fn | converges, then Sn f converges uniformly to f .

2. If k∈Z |k||fˆk | converges, then f is differentiable and the partial sum (Sn f )0 =
P
Pn ˆ ikθ converge uniformly to f 0 (θ) as n → ∞.
k=−n ik fk e

1. Both ∞ |fˆn | and ∞ |fˆ−n | converge, so by [T.5-12], [P.5-8] and [T.5-6],


P P
n=1 P n=1P
Sn f = f0 + n=1 fn e + ∞
ˆ ∞ ˆ inθ ˆ −inθ converge uniformly to a continuous
n=1 f−n e
function g.

Now heinθ , · i is continuous, since if sup |f − g| < ε, then

π
Z
inθ inθ inθ −inθ

|he , f i − he , gi| = |he , f − gi| =
e (f (θ) − g(θ))dθ
−π
Z π
≤ |f (θ) − g(θ)|dθ ≤ 2πε.
−π

def 1
So ĝn = 2π heinθ , gi = 2π
1
heinθ , limm→∞ Sm f i = limm→∞ heinθ , Sm f i = fˆn . So
Sn f = Sn g for all n, and hence also σn f = σn g for all n, where σ is as defined
in Fejér’s theorem. By Fejér’s theorem, σn f converge uniformly to both f and
g, but the uniform limits must be unique, hence f = g.
P ˆ 0
2. Since k∈Z |k||fk | converge, by [T.5-12] (Sn f ) converge uniformly to some
function. Since k∈Z |k||fk | converge, so k∈Z |fˆk | also converge by compar-
ˆ
P P
ison test, so by 1, Sn f converge uniformly to f . Now by [T.5-7] f is differen-
tiable and (Sn f )0 converge uniformly to f 0 (θ).

T. 6-11
If f : S 1 → C is C 2 (i.e. two times differentiable with continuous second deriva-
tive), then Sn f converges uniformly to f .

f (2) is continuous on a compact domain,


P hence ˆit’s integrable. By [L.6-8] we have
|fˆk | ≤ kM2 for some fix M . Therefore ∞
n=−∞ |fn | converge, hence the result by 1
of [T.6-10].

T. 6-12

<Parseval’s identity> hf, f i = −π |f (θ)|2 dθ = 2π n∈Z |fˆn |2
P

Here is a not very rigorous proof:


! !
Z π Z π X X
∗ −imθ
hf, f i = 2
|f (θ)| dθ = fˆm e fˆn einθ dθ
−π −π m∈Z n∈Z
X Z π X X
∗ ˆ ∗ ˆ
= fˆm fn ei(n−m)θ dθ = 2π fˆm fn δmn = 2π |fˆn |2
m,n∈Z −π m,n∈Z n∈Z

Parseval’s identity can be viewed as an infinite dimensional version of Pythagoras’


theorem. Note that the factor of 2π comes from√we using the orthogonal “basis”
{einθ : n ∈ Z} instead of the orthonormal {einθ / 2π : n ∈ Z}.
6.2. STURM-LIOUVILLE THEORY 215

We can provide a more rigours proof here under some restricted conditions, al-
though the result actually holds under some more relaxed condition. We have
Z π n
!∗ n
! n Z π
X X X
hSn f, Sn f i = ˆ
fj eijθ ˆ
fk e ikθ
dθ = fˆj∗ fˆk ei(k−j)θ dθ
−π j=−n k=−n j,k=−n −π

n
X n
X
= 2π fˆj∗ fˆk δkj = 2π |fˆk |2
j,k=−n k=−n

n
! n n
Z π X X Z π X
hSn f, f i = fˆk∗ e−ikθ f (θ)dθ = fˆk∗ e−ikθ f (θ)dθ = 2π |fˆk |2
−π k=−n k=−n −π k=−n

which is real, so hSn f, Sn f i = hSn f, f i = hf, Sn f i, so we have


n
X
hSn f −f, Sn f −f i = hSn f, Sn f i+hf, f i−hSn f, f i−hf, Sn f i = hf, f i−2π |fˆk |2 .
k=−n

Now if Sn f converge to f in the norm induces by h · , · i, that is limn→∞ hSn f −


f, Sn f − f i = 0, then we have hf, f i = 2π n
P ˆ 2
k=−n |fk | .

For example if f is such that supn∈N,θ∈(π,−π) |Sn f (θ) − f (θ)| < ∞ and that Sn f
converge uniformly to f except at finitely
Rπ many arbitrary small interval, then
limn→∞ hSn f − f, Sn f − f i = limn→∞ −π |Sn f (θ) − f (θ)|2 dθ = 0.
E. 6-13
From Parseval’s identity we P can obtain some interesting results. Consider the
∞ 1
Riemann ζ-function ζ(s) = n=1 ns . We can show that for any m, ζ(2m) =
2m
π q for some q ∈ Q. This may not be obvious at first sight. Last time, we
computed that the sawtooth function f (θ) = θ has Fourier coefficients fˆ0 = 0 and
n
fˆn = i(−1)
n
for n 6= 0. Applying Parseval’s theorem for the sawtooth function we
get
Z π ∞
2π 3 X X 1
= θ2 dθ = hf, f i = 2π |fˆn |2 = 4π 2
.
3 −π n∈Z n=1
n
π2
So ∞ 1
P
n=1 n2 = ζ(2) = 6 . We have just done it for the case where m = 1. But if
we integrate the sawtooth function repeatedly, then we can get the general result
for all m.

6.2 Sturm-Liouville Theory


If M is a self-adjoint (ie. Hermitian) matrix then there is a nice way to solve linear
equations of the form M u = f which is easier than inverting M . Recalled that self-
adjoint maps has an Porthonormal P basis of eigenvectors.[T.4-156]
P We expand P in this basis
to obtain M u = M ui v i = ui λi vi . Hence we have ui λi vi = fi vi . Taking
the inner product with vj , we know that uj = fj /λj . Sturm-Liouville theory is the
infinite-dimensional analogue.
In our vector space of differentiable functions, our “matrices” would be linear differ-
ential operators L. For example, we could have
dp dp−1 d
L = Ap (x) p
+ Ap−1 (x) p−1 + · · · + A1 (x) + A0 (x).
dx dx dx
216 CHAPTER 6. METHODS

It is an easy check that this is in fact linear (i.e. L(λy + µz) = λL(y) + µL(z)). We
dp
say L has order p if the highest derivative that appears is dx p . In most applications,
we will be interested in the case p = 2.

C. 6-14
<Sturm-Liouville operators> Consider the 2nd order linear differential oper-
ator,

d2 y
 2 
dy d y R dy Q
Ly = P (x) 2 + R(x) − Q(x)y = P + − y
dx dx dx2 P dx P
 R  R      
R d R dy Q d dy Q
= P e− P dx e P dx − y = P p−1 p − py .
dx dx P dx dx P

dx . We further define q = Q
R R 
where p = exp P P
p. We also drop a factor of
−1
P p . Then we are left with
 
d d
L= p(x) − q(x).
dx dx

This is the Sturm-Liouville form of the operator. Now let’s compute hf, Lgi.
Assuming that p, q are real, we integrate by parts numerous times to obtain
Z b     Z b ∗ 
d dg df dg
hf, Lgi = f∗ p − qg dx = [f ∗ pg 0 ]ba − p + f ∗ qg dx
a dx dx a dx dx
Z b ∗
  
d df
= [f ∗ pg 0 − f 0∗ pg]ba + p − qf ∗ g dx
a dx dx
= [(f ∗ g 0 − f 0∗ g)p]ba + hLf, gi.

So (assuming that p, q are real) L is self-adjoint (i.e. hf, Lgi = hLf, gi) with
respect to this norm when we restrict ourself to the set of functions (satisfying some
appropriate boundary conditions) such that the boundary terms vanish. Examples
of such boundary conditions are:
1. All our functions satisfies a1 h0 (a) + a2 h(a) = 0 and b1 h0 (b) + b2 h(b) = 0 where
a1,2 and b1,2 are fixed real constants (and h is f or g).2 The boundary term
above vanishes at each boundary separately for any function f and g that
satisfy this condition. This is called the separated boundary conditions .
2. If the function p obeys p(a) = p(b) then we restrict all out functions to be peri-
odic so that h(a) = h(b) and h0 (a) = h0 (b); this ensures that the contributions
at each boundary cancels. This is called the periodic boundary condition .
3. Finally it may sometimes be that p(a) = p(b) = 0, though in this case the
endpoints of the interval [a, b] are singular points of the differential equation.
Note that it is important that we have a second-order differential operator. If it is
first-order, then we would have a negative sign, since we integrated by parts once,
so not possible for L to be self-adjoint.

2
Note that a1 and a2 cannot both be zero, otherwise there is no boundary condition. Similarly
for b1 and b2
6.2. STURM-LIOUVILLE THEORY 217

D. 6-15
• A weight function w is real, non-negative function that has only finitely many
zeroes on the domain.
Rb
• An inner product with weight w is defined by hf, giw = a f ∗ (x)g(x)w(x) dx.

• An eigenfunction with weight w of L is a non-zero function y : [a, b] → C obeying


the differential equation Ly = λwy for some λ ∈ C which is its eigenvalue .
• The Sturm-Liouville problem is the problem of solving Ly = λwy, where L is a
Sturm-Liouville form and w a weight function, for arbitrary λ under some bound-
ary conditions.
A Sturm-Liouville problem is regular if the boundary conditions is the separated
boundary conditions and also p(x), w(x) > 0, and p, p0 , q, w are continuous func-
tions over the finite interval [a, b].
E. 6-16
Note that since w is real, hf, giw = hf, wgi = hwf, gi. One of the reason we want a
weight w(x) is that we might want to work with the unit disk, instead of a square
in R2 . When we want to use polar coordinates, we will have to integrate with
r dr dθ, instead of just dr dθ. Hence we need the weight of r. Also, we allow it to
have finitely many zeroes, so that the radius can be 0 at the origin. But we want
the inner product to keep the property that hf, f iw = 0 iff f = 0 (for continuous
f ), so we cannot have zeros at too many places.
P. 6-17
Let L be the Sturm-Liouville operator.
1. Its eigenvalues are real.
2. If f is an eigenfunction, then so is f ∗ with the same eigenvalue.
3. Eigenfunctions with different eigenvalues (but same weight) are orthogonal
with respect to inner product of the weight.

1. Suppose Lyi = λi wyi , then

λi hyi , yi iw = λi hyi , wyi i = hyi , Lyi i = hLyi , yi i = hλi wyi , yi i = λ∗i hyi , yi iw .

Since hyi , yi iw 6= 0, we have λi = λ∗i .


2. Since the eigenvalue λ, weight w and coefficients p and q are real, we have
L(f ∗ ) = (Lf )∗ = (λwf )∗ = λwf ∗ .
3. Let Lyi = λi wyi and Lyj = λj wyj . Then

λi hyj , yi iw = hyj , Lyi i = hLyj , yi i = λj hyj , yi iw .

Since λi 6= λj , we must have hyj , yi iw = 0.


Note that result 2 means that we can always make the eigenfunction to be real by
using f R+ f ∗ . Note that from result 3, we can divide each of our eigenfunctions
b
yn by ( a |yn |2 wdx)1/2 to normalised the eigenfunctions, so that we can have
an orthonormal set of eigenfunctions. Note also that these property holds in
general for any self-adjoint linear differential operator with coefficients which are
real functions.
218 CHAPTER 6. METHODS

T. 6-18
For a Sturm-Liouville problem with periodic or separate boundary condition,3
the eigenvalues are real and can be ordered to form a countable infinite (not
necessary strictly) increasing sequence λ1 , λ2 , · · · with λn → ∞ as n → ∞, and
such that the corresponding eigenfunction y1 , y2 , · · · upon normalisation forms
a complete orthonormal basis for the function space on which the operator is
defined (including satisfying the boundary condition). That is, anyP such function
∞ ˆ
f : [a, b] → C in the function space can be expanded as f (x) = n=1 fn yn (x)
ˆ
R ∗
where fn = hyn , f iw = yn (x)f (x)w(x) dx.

We will not prove this. A detailed version of this result is that for a regular
Sturm-Liouville problem (in particular it has separate boundary conditions), each
eigenvalue only has one independent eigenfunction, so the sequence λ1 , λ2 , · · · is
actually strictly increasing. On the other hand if we have periodic boundary
conditions, it may happen that there are two linearly independent eigenfunctions
corresponding to the same eigenvalue. Since the Sturm-Liouville problem is second
order we know that it is impossible to have more than two linearly independent
solutions irrespective of the boundary conditions. If there are two independent
eigenfunctions corresponding to the same eigenvalue, we can always made them
to be mutually orthogonal by the Gram-Schmidt process. And the totality of all
these eigenfunctions forms a complete orthonormal basis as said in the result.
The significant feature here is that the function f (x) is expanded as a discrete sum,
just as we saw for Fourier series. This is remarkable, because the definition of the
yn s that they be normalised eigenfunctions of L involves no hint of discreteness.
In fact the discreteness arises because the domain [a, b] is compact, and because
of the boundary conditions.
E. 6-19
d2
Let L = dx 2 . Here we have p = 1, q = 0. If we ask for functions to be satisfy the pe-
riodic boundary condition on [a, b], then L is self-adjoint. Now let [a, b] = [−L, L]
2
and w = 1. Eigenfunction obeys ddxy2n = λn yn (x). So our eigenfunctions are
yn (x) = einπx/L with eigenvalues λn = −n2 π 2 /L2 for n ∈ Z. This is just the
Fourier series! Note that each eigenvalue λn = λ−n has two independent eigen-
functions, also note that this is related to the result that if y is an eigenfunction,
then so is y ∗ with the same eigenvalue.
If however instead of the periodic boundary condition, we require our functions
to satisfy f (−L) = f (L) = 0 which is a separate boundary condition, then the
problem is a regular Sturm-Liouville problem, and we would find just the sinusoidal
Fourier series which has non-degenerate eigenvalues (i.e. each eigenvalue λn has
only one independent eigenfunction namely yn (x) = sin(nπx/L)).
E. 6-20
<Hermite polynomials> We want to study the DE 21 H 00 − xH 0 = −λH with
H : R → C for arbitrary λ. We want to put this in Sturm-Liouville form. We have
Rx 2
the integrating factor p(x) = exp − 0 2t dt = e−x . We can rewrite the DE as


 
d 2 dH 2
e−x = −2λe−x H(x).
dx dx
3
and some other condition including p, q and w being “nice”.
6.2. STURM-LIOUVILLE THEORY 219
2
Here L = d
dx
(e−x dx
d
) and we are solving for LH = −2λwH where we have weight
−x2
function w(x) = e . We now ask that H(x) grows at most polynomially as
2
|x| → ∞. In particular, we want e−x H(x)2 → 0. This ensures that our Sturm-
Liouville operator is self-adjoint, since it ensure that the boundary terms from
integrationby partsvanish at the infinite boundary and that the integral hf, giw =
R∞ ∗ d 2 dg R∞ ∗ 2 dg
−∞
f dx e−x dx dx = −∞ df dx
e−x dx dx remains finite. The eigenfunctions
turn out to be n 
2 d 2

Hn (x) = (−1)n ex n
e−x .
dx
These are known as the Hermite polynomials . Note that these are indeed poly-
2
nomials. When we differentiate the e−x term many times, we get a lot of things
2
from the product rule, but they will always keep an e−x , which will ultimately
2
cancel with ex . The Hermite polynomials are orthogonal with respect to our
weight function.
T. 6-21
<Parseval’s identity II> hf, f iw = n∈Z |fˆn |2 for inner product with weight
P
w.
Again a not very rigorous proof: Let {y1 , y2 , · · · } be a complete set of functions
that are orthonormal with respect to the weight function w. We can expand f
with this basis, then
Z b X Z ∗ ∗
hf, f iw = f ∗ (x)f (x)w(x) dx = fˆn yn (x)fˆm ym (x)w(x) dx
a n,m∈N Ω
X X
= fˆn∗ fˆm hyn , ym iw = |fˆn | .
2

n,m∈N n∈N

C. 6-22
<Least squares approximation> So far, we have expanded functions in terms
of infinite series. However, in real life, when we ask a computer to do this for
us, it is incapable of storing and calculating “infinite” terms. So it’s important to
know how accurately we can represent a function by expanding it in just a limited,
incomplete set of eigenfunctions.
Suppose we approximate some function f : Ω → C by a finite set of P eigenfunctions
n
{y1 , · · · , yn }. Suppose we write the approximation g as g(x) = k=1 ck yk (x).
The objective here is to figure out what values of the coefficients ck are “the best”,
ie. can make g represent f as closely as possible. On notion of “as closely as
possible” is that we want to minimize the hf − g, f − giw . To minimize this norm,
first we want ∇hf − g, f − giw = 0 where we are treating hf − g, f − giw as a
function of variables Re(c1 ), · · · , Re(cn ), Im(c1 ), · · · , Im(cn ). The requirement for
∇hf − g, f − giw = 0 is that for all j we have
∂ ∂
0= hf − g, f − giw = (hf, f iw + hg, giw − hf, giw − hg, f iw )
∂ Re(cj ) ∂ Re(cj )
n n n
!
∂ X X ∗
X
= 2
|ck | − ˆ
fk ck − ck fk = 2 Re(cj ) − fˆj∗ − fˆj
∗ ˆ
∂ Re(cj ) i=1
k=1 k=1

= 2 Re(cj ) − 2 Re(fˆj ).
220 CHAPTER 6. METHODS

and 0= hf −g, f −giw = · · · = 2 Im(cj )−ifˆj∗ +ifˆj = 2 Im(cj )−2 Im(fˆj )
∂ Re(c∗j )
where the hf, f iw term vanishes since it does not depend on our variables and
we expanded the other inner products in a similar manner like in the Parseval’s
identity. These conditions are satisfy iff cj = fˆj for all j. Now to check that this
is indeed an minimum, we can look at the second-derivatives, we have

∂2 ∂2
hf − g, f − giw = hf − g, f − giw = 2δij ,
∂ Re(ci )∂ Re(cj ) ∂ Im(ci )∂ Im(cj )

∂2
and hf − g, f − giw = 0.
∂ Im(ci )∂ Re(cj )
Hence the Hessian matrix is 2I which is obviously positive-define, so this is indeed
a minimum. Thus we know that hf − g, f − giw is minimized over all g(x) when
ck = fˆk = hyk , f iw . These are exactly the coefficients in our infinite expansion.
Hence if we truncate our infinite series at an arbitrary point, it is still the best
approximation we can get using only the remaining eigenfunctions.

P. 6-23
Let L be a linear differential operator. Given f , a necessary condition for the
inhomogeneous equation Lg = f to have a solution g is that f is orthogonal to
ker(L∗ ), where L∗ is the adjoint of L.

Suppose Lg = f and ψ ∈ ker(L∗ ), then hψ, f i = hψ, Lgi = hL∗ ψ, gi = 0.

In particular if L is self-adjoint then f must be orthogonal to ker(L) for Lg = f


to have a solution.

E. 6-24
Let Ω ⊆ R2 be a compact domain with boundary ∂Ω. We want to seek solution
φ : Ω → R that solves the poisson equation ∇2 φ = f with Neumann boundary
condition n · ∇φ|∂Ω = 0.

With this boundary condition ∇2 is self-adjoint: Note that ∇ · (ψ∇φ) = ∇ψ · ∇φ +


ψ∇2 φ, so using divergent theorem we have

Z Z Z Z
hψ, ∇2 φi = ψ∇2 φ dV = ψ∇φ · dS − ∇ψ · ∇φ dV = − ∇ψ · ∇φ dV
ZΩ ∂Ω Ω Ω

= φ∇2 ψ dV = h∇2 ψ, φi.


Now ∇2 ψ = 0 with n · ∇ψ|∂Ω = 0 only has solution ψ =constant, so ker(∇2 )


consist of all the constant functions. In order for ∇2 φ = f to have a solution we
need 0 = h1, f i = Ω f (x)dV . which makes sense since if ∇2 φ = f then
R

Z Z Z
0= ∇φ · dS = ∇2 φdV = f (x)dV.
∂Ω Ω Ω
6.2. STURM-LIOUVILLE THEORY 221

C. 6-25
<Inhomogeneous equations and Green’s functions>
Analogue to solving the inhomogeneous matrix equation M u = f for a self-adjoint
matrix M . In the context of Sturm–Liouville differential operators, we seek to
solve the inhomogeneous differential equation Lg = f (x) where f (x) is a forcing
term. We can write this as Lg = w(x)F (x). Let {y1 , y2 , · · · } be the orthonormal
P of eigenfunctions of L,Pso that Lyn = λn wyn . We expand g and F as
basis consist
g(x) = n∈N ĝn yn (x) and F (x) = n∈N F̂n yn (x). By linearity,
!
X X X
w(x) F̂n yn (x) = w(x)F (x) = Lg = ĝn Lyn (x) = ĝn λn w(x)yn (x).
n∈N n∈N n∈N

Taking the (regular) inner product with ym (x) (and noting orthogonality of eigen-
functions), we obtain F̂m = ĝm λm . This tells us that ĝm = F̂m /λm . So provided
all λn are non-zero we have found the (particular) solution
X F̂n
g(x) = yn (x).
n∈N
λn

Note that there are no non-trivial contemporary solution otherwise we would have
a 0 eigenvalue.
It is often helpful to rewrite this into another form, using the fact that F̂n =
hyn , F iw and f = wF . Not caring too much about rigorousness, we have
X hyn , F iw Z bX ∗ Z b
yn (t)F (t)
g(x) = yn (x) = yn (x)w(t) dt = G(x, t)f (t) dt,
n∈N
λn a n∈N λn a

X 1 ∗
where G(x, t) = yn (t)yn (x).
n∈N
λn

We call G the Green’s function . The Green’s function is a function of two vari-
ables (x, t) ∈ [a, b] × [a, b]. Note G depends on λn and yn only, it depends on the
differential operator L both through its eigenfunctions and through the boundary
conditions we chose to ensure L is self-adjoint, but it doesn’t depend on the forc-
ing term f . Thus if we know the Green’s function we can use it to construct a
particular solution g of Lg = f for an arbitrary forcing term f .
In this way, the Green’s function provides a formal inverse to the differential oper-
ator L in analogy to the finite dimensional case where for an non-singular matrix
M , M −1 is its inverse so that u = M −1 f provides a solution to M u = f . Recall
that for a matrix, the inverse exists if the determinant is non-zero, which is true
if the eigenvalues are all non-zero, equivalently ker(M ) = 0. Similarly, here a
necessary condition for the Green’s function to exist is that all the eigenvalues are
non-zero, that is ker(L) = 0.
What happen if we have a 0 eigenvalue, say λm = 0? Well F̂m = ĝm λm tell us that
there will be no solution if F̂m 6= 0. If F̂m = 0, then we can have a solution by
we can have arbitrary amount of ĝm , ie. an arbitrary amount of ym . This makes
sense since ym is in ker(L).
222 CHAPTER 6. METHODS

6.3 Partial differential equations


6.3.1 Legendre Polynomials and Bessel functions
Legendre Polynomials

Consider the Legendre’s equation which is


 
d dΘ
(1 − x2 ) = −λΘ.
dx dx

The domain of x is −1 ≤ x ≤ 1. This operator is a Sturm-Liouville operator with


p(x) = 1 − x2 and q(x) = 0. For the Sturm-Liouville operator to be self-adjoint, we
had
hg, Lf i = hLg, f i + [p(x)(g ∗ f 0 − g 0∗ f )]1−1 .
We want the boundary term to vanish. Since p(x) = 1 − x2 vanishes at our boundary
x = ±1, the Sturm-Liouville operator is self-adjoint provided our function Θ(x) is
regular (i.e. finite) at x = ±1. Hence we look for a set of Legendre’s equations
P 1) thatn remains regular including at x = ±1. We can try a power series
inside (−1,
Θ(x) = ∞ n=0 an x . The Legendre’s equation becomes


X ∞
X ∞
X
(1 − x2 ) an n(n − 1)xn−2 − 2 an nxn + λ an xn = 0.
n=0 n=0 n=0

For this to hold for all x ∈ (−1, 1), the equation must hold for each coefficient of x
separately. So an (λ − n(n + 1)) + an−2 (n + 2)(n + 1) = 0. This requires that

n(n + 1) − λ
an+2 = an .
(n + 2)(n + 1)

This equation relates an+2 to an . So we can choose a0 and a1 freely. So we get two
linearly independents solutions Θ0 (x) and Θ1 (x), where they satisfy Θ0 (−x) = Θ0 (x)
and Θ1 (−x) = −Θ1 (x). In particular, we can expand our recurrence formula to
obtain
 
λ (6 − λ)λ 4
Θ0 (x) = a0 1 − x2 − x + ···
2 4!
 
(2 − λ) 3 (12 − λ)(2 − λ) 5
Θ1 (x) = a1 x + x + x + ··· .
3! 5!

We now consider the boundary conditions. We know that Θ(x) must be regular (ie.
finite) at x = ±1. However, we have limn→∞ an+2 /an = 1. This is fine if we are
inside (−1, 1), since the power series will still converge. However, at ±1, the ratio
test is not conclusive and does not guarantee convergence. In fact, more sophisticated
convergence tests show that the infinite series would diverge on the boundary!
To avoid this problem, we need to choose λ such that the series truncates. That is, if
we set λ = `(` + 1) for some ` ∈ N0 , then our power series will truncate, and so Θ(x)
is finite at x = ±1. Note that in this case, the finiteness boundary condition fixes our
possible values of eigenvalues. This is how quantization occurs in quantum mechanics.
In this case, this process is why angular momentum is quantized.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 223

The resulting polynomial solutions P` (x) are called Legendre polynomials . For exam-
ple, we have P0 (x) = 1, P1 (x) = x, P2 (x) = 21 (3x2 − 1) and P3 (x) = 21 (5x3 − 3x) where
the overall normalization is chosen to fix P` (1) = 1. It turns out that
1d` 2
P` (x) = (x − 1)` .
2` `!
dx`
The constants in front are just to ensure normalization. We now check that this indeed
gives the desired normalization:
1 d`

` ` 1 h i
`!(x + 1)` + (x − 1)(stuff)

P` (1) = ` ` ((x − 1) (x + 1) ) = ` =1
2 dx x=1 2 `! x=1

(Orthogonality): We have previously shown that the eigenfunctions of Sturm-Liouville


operators are orthogonal. Let’s see that directly now for this operator. The weight
function is w(x) = 1. We first prove a short lemma: for 0 ≤ k ≤ `, we have
dk 2
(x − 1)` = Q`,k (x2 − 1)`−k
dxk
for some degree k polynomial Q`,k (x). We show this by induction. This is trivially
true when k = 0. We have
dk+1 2 d  
k+1
(x + 1)` = Q`,k (x)(x2 − 1)`−k
dx dx
= Q0`,k (x2 − 1)`−k + 2(` − k)Q`,k x(x2 − 1)`−k−1
 
= (x2 − 1)`−k−1 Q0`,k (x2 − 1) − 2(` − k)Q`,k x .

Since Q`,k has degree k, Q0`,k has degree k − 1. So the right bunch of stuff has degree
k + 1. Done. Now we can show orthogonality. We have
Z 1 Z 1 `
1 d
hP` , P`0 i = P` (x)P`0 (x) dx = ` `
(x2 − 1)` P`0 (x) dx
−1 2 `! −1 dx
 `−1 1 Z 1 `−1
1 d 1 d dP 0
= ` (x 2
− 1)`
P `0 (x) − (x2 − 1)` ` dx
2 `! dx`−1 −1 2 ` `!
−1 dx `−1 dx
Z 1 `−1
1 d dP 0
=− ` (x2 − 1)` ` dx = · · · = · · ·
2 `! −1 dx`−1 dx
Z 1 `−k
1 d dk P`0
=− ` `−k
(x2 − 1)` dx
2 `! −1 dx dxk
Note that the boundary term disappears since the (` − 1)th derivative of (x2 − 1)` still
has a factor of x2 − 1. So integration by parts allows us to transfer the derivative from
x2 − 1 to P`0 . Now if ` 6= `0 , we can wlog assume `0 < `. We can integrate by parts
`0 + 1 times until we get the `0 + 1th derivative of P`0 , which is zero. In fact, we can
2
show that hP` , P`0 i = 2`+1 δ``0 hence the P` (x) are orthogonal polynomials.
(Roots of P` (x)): By the fundamental theorem of algebra, P` (x) has ` roots. In
fact, they are always real, and lie in (−1, Q 1). To see this, suppose only m < `
roots lie in (−1, 1). Then let Qm (x) = m r=1 (x − xr ), where {x1 , x2 , · · · , xm } are
these
Q` m roots. Q Consider the polynomial P` (x)Qm (x). If we factorize this, we get
m 2
r=m+1 (x − r) r=1 (x − xr ) . The first few terms have roots outside (−1, 1), and
hence do not change sign in (−1, 1). The R 1 latter terms are always non-negative. Hence
Pm sign, we have ± −1 P` (x)Qm (x) dx > 0. However, we can ex-
for some appropriate
pand Qm (x) = r=1 qr Pr (x) in a basis of Legendre polynomials, but hP` , Pr i = 0 for
all r < `, so the integral is 0. This is a contradiction.
224 CHAPTER 6. METHODS

Bessel functions
Consider the Bessel’s equation

d2 R dR
x2 +x + (x2 − n2 )R = 0
dx2 dx
Note that this is actually a whole family of differential equations, one for each n.
Here we assume n ∈ N0 . Since Bessel’s equations are second-order, there are two
independent solution for each n, namely Jn (x) and Yn (x). These are called Bessel
functions of order n of the first (Jn ) or second (Yn ) kind. We will not study these
functions’ properties, but just state some useful properties of them.
The Jn ’s are all regular at the origin. In particular, as x → 0 Jn (x) ∼ xn . These
look like decaying sine waves, but the zeroes are not regularly spaced. On the other
hand, the Yn ’s are similar but singular at the origin. As x → 0, Y0 (x) ∼ ln x, while
Yn (x) ∼ x−n .
The Bessel functions satisfies the orthogonality condition
Z a
a2
   
knj r kni r
Jn Jn r dr = δij (Jn0 (kni ))2 ,
0 a a 2
Note that we have a weight function r here. Also, this is the orthogonality relation
for different roots of Bessel’s functions of the same order, it does not relate Bessel’s
functions of different orders.

6.3.2 Laplace’s equation


The Laplace’s equation is ∇2 φ = 0. A harmonic function is a function that satisfies
the Laplace’s equation. Suppose we are solving Laplace’s equation for φ with domain
Ω. A boundary condition of the form φ|∂Ω = f is called Dirichlet .4 A boundary
condition of the form n · ∇φ|∂Ω = g is called a Neumann .
Laplace’s equation arises in many situations. For example, if we have a conservative
force F = −∇φ, and we are in a region where the force obeys the source-free Gauss’
law ∇ · F = 0, then we get Laplace’s equation. It also arises as a special case of the
2
heat equation ∂φ
∂t
= κ∇2 φ, and wave equation ∂∂t2φ = c2 ∇2 φ, when there is no time
dependence.
P. 6-26
Let Ω be a compact domain, and ∂Ω be its boundary. Any solutions to ∇2 φ = 0
on Ω satisfying the Dirichlet boundary condition are unique. And any solutions
to ∇2 φ = 0 on Ω satisfying the Neumann boundary condition are unique up to a
constant.
Suppose φ1 and φ2 are both solutions to the Laplace’s equation. Let Φ = φ1 − φ2 .
We have
Z Z Z
0= Φ∇2 Φ dV = − (∇Φ) · (∇Φ) dV + Φ∇Φ · n dS.
Ω Ω ∂Ω

We note that the second term vanishes since either we have Φ = 0 on the boundary
for Dirichlet boundary condition or we have ∇Φ · n = 0 on the boundary for
4
∂Ω is the boundary of Ω.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 225
R
Neumann boundary condition. So we have 0 = − Ω (∇Φ) · (∇Φ) dV . However,
this can only hold if ∇Φ = 0 throughout. Hence Φ is constant throughout Ω,
i.e. φ1 = φ2 + c for constant c. For Dirichlet boundary condition, Φ = 0 at the
boundary, so we must have Φ = 0 throughout, i.e. c = 0 and φ1 = φ2 .
E. 6-27
<Laplace’s equation on a disk> Solve ∇2 φ = 0 on a disk.

For (x, y) ∈ R2 , let z = x + iy and z̄ = x − iy, then the Laplace’s equation becomes
∂2φ
0= =⇒ φ(z, z̄) = ψ(z) + χ(z̄) for some ψ, χ
∂z∂ z̄
where φ(z) is holomorphic (i.e. ∂φ/∂ z̄ = 0) and χ(z̄) is antiholomorphic (i.e.
∂χ/∂z = 0). Suppose that we wish to solve Laplace’s equation inside the unit disc,
obeying the Dirichlet boundary condition φ(z, z̄)|∂Ω = f (θ), where the boundary
∂Ω is the unit circle. Since the domain of f is the unit circle S 1 , we can Fourier-
expand it (assuming that f is “nice” enough):
X ∞
X ∞
X
f (θ) = fˆn einθ = fˆ0 + fˆn einθ + fˆ−n e−inθ .
n∈Z n=1 n=1

iθ −iθ
However, we know that z = re and z̄ = re . So on the P boundary, we know
that z = eiθ and z̄ = e−iθ . So we can write f (θ) = fˆ0 + ∞ ˆ n ˆ n
n=1 (fn z + f−n z̄ ).
This is defined for |z| = 1. Now we can extend this and define φ on the unit disk
by

X ∞
X
φ(z, z̄) = fˆ0 + fˆn z n + fˆ−n z̄ n .
n=1 n=1

It is clear that this obeys the boundary conditions by construction. Also, φ(z, z̄) is
of the form ψ(z) + χ(z̄), a sum of a holomorphic and an antiholomorphic function
(the constant fˆ0 being both holomorphic and antiholomorphic, may be included
with either). Note that φ(z, z̄) certainly converges when |z| ≤ 1 if the series for
f (θ) converged on ∂Ω. So it is a solution to Laplace’s equation on the unit disk.
Since the solution is unique, we’re done!
C. 6-28
<Separation of variables> Unfortunately, the use of complex variables is very
special to the case where Ω ⊆ R2 . In higher dimensions, we proceed differently.
The method of separation of variables is proceed as follows:
1. Write ψ as a product of functions that depend only on one variable each.
Hence reduce Laplace’s PDE to a system of ODEs that depend on a number
of constants (here λ and µ).
2. Solve the system of ODEs. Since Laplace’s equation was a second order linear
equation, these ODEs will always be of Sturm-Liouville type; the constants will
appear as eigenvalues of the Sturm-Liouville equation and the equations will
be solved by the eigenfunctions of the Sturm-Liouville operator.
3. Use the homogeneous boundary conditions to impose restrictions on the possi-
ble values of the eigenvalues. The solution for a fixed permissible choice of the
eigenvalues is known as a normal mode of the system. By linearity, the general
solution is a linear combination of these normal modes.
226 CHAPTER 6. METHODS

4. Use the inhomogeneous boundary conditions to determine which linear combi-


nation we should take; this will require using the orthogonality property of the
eigenfunctions of the Sturm-Liouville operator.
This is best illustrate with an example, see below.
E. 6-29
Let be a f given function. Solve ∇2 ψ = 0 on Ω = {(x, y, z) ∈ R3 : 0 ≤ x ≤ a, 0 ≤
y ≤ b, z ≥ 0} with boundary conditions:

ψ(0, y, z) = ψ(a, y, z) = 0 lim ψ(x, y, z) = 0


z→∞

ψ(x, 0, z) = ψ(x, b, z) = 0 ψ(x, y, 0) = f (x, y)

The boundary conditions says that we want our function to be f when z = 0 and
vanish at the boundaries. The first step is to look for a solution of ∇2 ψ(x, y, z) = 0
of the form ψ(x, y, z) = X(x)Y (y)Z(z). Then we have
 00
Y 00 Z 00

X
0 = ∇2 ψ = Y ZX 00 + XZY 00 + XY Z 00 = XY Z + + .
X Y Z
As long as ψ 6= 0, we can divide through by ψ and obtain
X 00 Y 00 Z 00
+ + = 0.
X Y Z
The key point is each term depends on only one of the variables (x, y, z). If we
vary, say, x but keep others unchanged, then Y 00 /Y + Z 00 /Z does not change. For
the total sum to be 0, X 00 /X must be constant. So each term has to be separately
constant. We can thus write

X 00 = −λX, Y 00 = −µY, Z 00 = (λ + µ)Z.

The signs before λ and µ are there just to make our life easier later on. We can
solve these to obtain
√ √
X = a sin λx + b cos λx,
√ √
Y = c sin λy + d cos λx,
p p
Z = g exp( λ + µz) + h exp(− λ + µz).

We now impose the homogeneous boundary conditions, ie. the conditions that ψ
vanishes at the walls and at infinity: At x = 0, we need ψ(0, y, z) = 0, so b = 0. At
x = a, we need ψ(a, y, z) = 0, so λ = ( nπa
)2 . At y = 0, we need ψ(x, 0, z) = 0, so
d = 0. At y = b, we need ψ(x, b, z) = 0, so µ = ( mπ b
)2 . As z → ∞, ψ(x, y, z) → 0,
so g = 0. Therefore for each n, m ∈ N, we have solutions
 2
m2
 nπx   mπy  
n
ψ(x, y, z) = An,m sin sin exp(−sn,m z) where s2n,m = + π2
a b a2 b2
and An,m is an arbitrary constant. This obeys the homogeneous boundary condi-
tions for any n, m ∈ N but not the inhomogeneous condition at z = 0. By linearity,
the general solution obeying the homogeneous boundary conditions is

X  nπx   mπy 
ψ(x, y, z) = An,m sin sin exp(−sn,m z)
n,m=1
a b
6.3. PARTIAL DIFFERENTIAL EQUATIONS 227

The final step is to impose the inhomogeneous boundary condition ψ(x, y, 0) =


f (x, y). In other words, we need

X  nπx   mπy 
An,m sin sin = f (x, y). (†)
n,m=1
a b

The objective is thus to find the An,m coefficients. We can use the orthogonality
relation: Z a  
kπx  nπx  a
sin sin dx = δk,n .
0 a a 2
kπx

So we multiply (†) by sin a and integrate wr,t, x:
∞  mπy  Z a   Z a  
X kπx  nπx  kπx
An,m sin sin sin dx = sin f (x, y) dx.
n,m=1
b 0 a a 0 a

Using the orthogonality relation, we have


∞  mπy  Z a  
a X kπx
Ak,m sin = sin f (x, y) dx.
2 m=1 b 0 a

We can perform the same trick again, and obtain


Z    
ab kπx jπy
Ak,j = sin sin f (x, y) dx dy. (∗)
4 [0,a]×[0,b] a b

So we found the solution



X  nπx   mπy 
ψ(x, y, z) = An,m sin sin exp(−sn,m z)
n,m=1
a b

where Am,n is given by (∗). Since we have shown that there is a unique solution
to Laplace’s equation obeying Dirichlet boundary conditions, we’re done. Note
that if we imposed a boundary
√ condition at finite z, say 0 ≤ z ≤ c, then both
sets of exponential exp(± λ + µz) would have contributed. Similarly, if ψ does
not vanish at the other boundaries, then the cos terms would also contribute. In
general, to actually find our ψ, we have to do the horrible integral for Am,n , and
this is not always easy.
In particular suppose f (x, y) = 1, then
(
Z a Z b 16
4  nπx   mπy 
π 2 mn
n, m both odd
Am,n = sin dx sin dy =
ab 0 a 0 b 0 otherwise

Hence we have
16 X 1  nπx   mπy 
ψ(x, y, z) = sin sin exp(−sm,n z).
π2 nm a b
n,m odd

Note that in this example, we obtained a Fourier sine series because of the ho-
mogeneous Dirichlet boundary conditions on x and y. If instead we’d imposed
Neumann boundary conditions ∂ψ/dx = 0 at y = 0, b and ∂ψ/dy = 0 at x = 0, a
then we would instead find Fourier cosine series.
228 CHAPTER 6. METHODS

E. 6-30
<Laplace’s equation in spherical polar coordinates>
p Find the axisymmet-
ric solutions of ∇2 ψ = 0 on Ω = {(x, y, z) ∈ R3 : x2 + y 2 + z 2 ≤ a}.

Since our domain is spherical, it makes sense to use some coordinate system with
spherical symmetry. We use spherical coordinates (r, θ, φ), where x = r sin θ cos φ,
y = r sin θ sin φ and z = r cos θ. The Laplacian is
∂2
   
1 ∂ ∂ 1 ∂ ∂ 1
∇2 = 2 r2 + 2 sin θ + 2 2 2
.
r ∂r ∂r r sin θ ∂θ ∂θ r sin θ ∂φ

Similarly, the volume element is dV = dx dy dz = r2 sin θ dr dθ dφ. Here we’ll only


look at axisymmetric solutions , where ψ(r, θ, φ) = ψ(r, θ) does not depend on φ.
Note that with this condition, ψ must be even in θ, i.e. ψ(r, θ) = ψ(r, −θ), since
ψ(r, θ, φ) = ψ(r, −θ, φ + π). We perform separation of variables, where we look
for any set of solution of ∇2 ψ = 0 inside Ω where ψ(r, θ) = R(r)Θ(θ). Laplace’s
equation then becomes
  
ψ 1 d 2 0 1 d dΘ
(r R ) + sin θ = 0.
r2 R dr Θ sin θ dθ dθ
Similarly, since each term depends only on one variable, each must separately be
constant. So we have
   
d dR d dΘ
r2 = λR, sin θ = −λ sin θΘ.
dr dr dθ dθ
Note that both these equations are eigenfunction equations for some Sturm-Liouville
operator and the Θ(θ) equation has a non-trivial weight function w(θ) = sin θ.
d
sin θ dΘ

Consider the angular equation dθ dθ
= −λ sin θΘ above. Use the substitu-
tion5 x = cos θ, then we have dθd
= − sin θ dxd
. So the Legendre’s equation becomes
 
d dΘ
− sin θ sin θ(− sin θ) + λ sin θΘ = 0.
dx dx
 
d dΘ
Equivalently (1 − x2 ) = −λΘ.
dx dx
which is the Legendre’s equation. Note that since 0 ≤ θ ≤ π, the domain of x
is −1 ≤ x ≤ 1. The solutions are the Legendre’s polynomials P` (cos θ). In our
original equation ∇2 φ = 0, we now set Θ(θ) = P` (cos θ). The radial equation
becomes (r2 R0 )0 = `(` + 1)R. Trying a solution of the form R(r) = rα , we find
that α(α + 1) = `(` + 1). So the solution is α = ` or −(` + 1). Hence
∞  
X b`
φ(r, θ) = a` r` + `+1 P` (cos θ)
r
`=0

is the general solution to the Laplace equation ∇2 φ = 0 on the domain. If we


want regularity at r = 0, then we need b` = 0 for all `. Suppose we require
φ(a, θ) = f (θ) on ∂Ω for some fixed a. We expand

2` + 1 1
X Z
f (θ) = F` P` (cos θ) where F` = P` (y)f (cos−1 y) dy.
2 −1
`=0
5
x here has nothing to do with the Cartesian coordinate.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 229
R1
where the (2` + 1)/2 comes from the normalisation −1 (P` (x))2 dx = 2/(2` + 1).
So we have the axisymmetric solution to Laplace’s equation in spherical polar
coordinates

X  r `
φ(r, θ) = F` P` (cos θ).
a
`=0

E. 6-31
<Multipole expansions for Laplace’s equation> One can quickly check that

1
φ(r) =
|r − r0 |

solves Laplace’s equation ∇2 φ = 0 for all r ∈ R3 \ r0 . For example, if r0 = k̂, where


k̂ is the unit vector in the z direction, then

1 1 X
= √ = c` r` P` (cos θ).
2 + 1 − 2r cos θ
|r − k̂| r `=0

We can expand it as ∞ `
P
`=0 c` r P` (cos θ) by our previous result since 1/|r − k̂| is
finite at the origin. To find c` , we can employ a little trick. Since P` (1) = 1, at
θ = 0, we have
∞ ∞
X 1 X
c` r` = = r` .
1−r
`=0 `=0

So all the coefficients must be 1. This is valid for r < 1. More generally, we have

1 1 X  r ` 1 r
0
= 0 P` (r̂ · r̂0 ) = 0 + 02 r̂ · r̂0 + · · · ..
|r − r | r r0 r r
`=0

This is called the multiple expansion , and is valid when r < r0 . The first term
1/r0 is known as the monopole , and the second term is known as the dipole ,
in analogy to charges in electromagnetism. The monopole term is what we get
due to a single charge, and the second term is the result of the interaction of two
charges.
E. 6-32
<Laplace’s equation in cylindrical coordinates> Solve the Laplace’s equa-
tion in cylindrical coordinates on Ω = {(r, θ, z) ∈ R3 : r ≤ a, z ≥ 0}. Let f be a
given function. Give the solution that is regular inside Ω and obeys the boundary
conditions

φ(a, θ, z) = 0 φ(r, θ, 0) = f (r, θ) lim φ(r, θ, z) = 0.


z→∞

In cylindrical polars, we have

1 ∂2φ ∂2φ
 
1 ∂ ∂φ
∇2 φ = r + 2 2 + = 0.
r ∂r ∂r r ∂θ ∂z 2

Again, we start by separation of variables. We try φ(r, θ, z) = R(r)Θ(θ)Z(z), then


we get
1 1 Θ00 Z 00
(rR0 )0 + 2 + = 0.
rR r Θ Z
230 CHAPTER 6. METHODS
√ √
So we immediately know that Z 00 = µZ, so Z(z) = Ae − µz
+ Be µz
. Now replace
Z 00 /Z by µ, and multiply by r2 to obtain

r Θ00
(rR0 )0 + + µr2 = 0.
R Θ

So we know that Θ00 = −λΘ. Since we require periodicity in θ, we have Θ(θ) =


an sin nθ + bn cos nθ with λ = n2 , n ∈ N0 . Now we are left with r2 R00 + rR0 +
(µr2 − λ)R = 0, this is of Sturm-Liouville type since it can be written as

n2
 
d dR
r − R = −µrR.
dr dr r
2 √
Here we have p(r) = r, q(r) = − nr and w(r) = r. Introducing x = r µ, we can
rewrite this as

d2 R dR
<Bessel’s equation> x2 +x + (x2 − n2 )R = 0
dx2 dx
Note that the n here is not the eigenvalue we are trying to figure out in this
equation. It is already fixed by the Θ equation. The actual eigenvalue we are
finding out is µ, not n.
The boundary√
conditions require the solution to decay as z → ∞, so we have
Z(z) = cµ e− µz . Now we can write our separable solution as
√ √ √
φ(r, θ, z) = (an sin nθ + bn cos nθ)e− µz
(cµ,n Jn (r µ) + dµ,n Yn (r, µ)).

Since we want regularity at r = 0, we don’t want to have the Yn terms. So


dµ,n = 0. We now impose our first boundary condition φ(a, θ, z) = 0. This
√ √
demands Jn (a µ) = 0. So we must have µ = kni /a where kni is the ith root of
Jn (x) (since there is not much pattern to these roots, this is the best description
we can give!). So our general solution obeying the homogeneous conditions is
∞    
X X kni kni r
φ(r, θ, z) = (Ani sin nθ + Bni cos nθ) exp − z Jn (∗)
n=0 i∈roots
a a

We finally impose the inhomogeneous boundary condition φ(r, θ, 0) = f (r, θ). We


set z = 0 in our general solution (∗) and integrate with respect to cos mθdθ to
obtain
1 π
Z  
X kmi r
f (r, θ) cos mθ dθ = Bmi Jm .
π −π i∈roots
a
Rπ Rπ
where we have use the fact that −π sin mθ sin nθdθ = −π cos mθ cos nθdθ = πδmn

and −π sin mθ cos nθdθ = 0. Now we just have Bessel’s function for a single order
m. So we can use the orthogonality relation for the Bessel’s functions to obtain
Z aZ π  
1 2 kmj r
Bmj = 0 cos mθJ m f (r, θ)r dr dθ.
(Jm (kmj )2 πa2 0 −π a

Often it’s hard to explicitly evaluate these integral, but we can ask our computers
to do this numerically for us.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 231

6.3.3 Heat equation


The heat equation is ∂φ ∂t
= κ∇2 φ where κ > 0 is the diffusion constant . It acts on
functions φ : Ω × [0, ∞) → C, where Ω is “space” and [0, ∞) is “time after an initial
event”.Note that if our system (of heat) is in equilibrium and does not change with
time, then we are left with Laplace’s equation. Some important properties of the heat
equation are
1. If no
R heat flows out of Ω, then the total amount of heat is conserved, that is
d
dt Ω
φ(x, t) dV = 0. To see this note
Z Z Z Z
d ∂φ
φ(x, t) dV = dV = κ ∇2 φ dV = κ ∇φ · dS
dt Ω Ω ∂t Ω ∂Ω

So (assuming Ω is compact) if no heat flows out of ∂Ω, that is ∇φ · dS = 0 on ∂Ω,


d
φ(x, t) dV = 0. For non-compact spaces Ω like Rn we require |∇φ(x, t)|
R
then dt Ω
to decay sufficiently quickly as |x| → ∞ for the above to hold.
2. The second property is that if φ(x, t) solves the heat equation on Rn × [0, ∞), then
so do φ1 (x, t) = φ(x − x0 , t − t0 ) and φ2 (x, t) = Aφ(λx, λ2 t) where A, λ and t0 are
real constants and x0 ∈ Rn . This can be shown straightforwardly.

Heat equation on Rn × [0, ∞): The heat kernel


Let’s try to choose A such that the total amount of heat
R in φ(x, t) is the same as
that in φ2 (x, t) = Aφ(λx, λ2 t), that is Rn φ2 (x, t) dV = Rn φ(x, t0 ) dV (note that the
R

value of t and t0 doesn’t matter since the total amount of heat is conserved, that is
property 1). We have
Z Z Z
φ2 (x, t) dn x = A φ(λx, λ2 t) dn x = Aλ−n φ(y, λ2 t) dn y,
Rn Rn Rn
n
where we substituted y = λx. So we need A = λ .
Whenever φ(x, t) solves ∂φ
∂t
= κ∇2 φ, then so does λn φ(λx, λ2 t) with the same amount
of total heat. So we try to find solutions of the form
 
1 x 1
φ(x, t) = F √ = F (η),
(κt)n/2 κt (κt)n/2

where η = x/ κt is called a similarity variable . Note we have φ(x, t) = λn φ(λx, λ2 t).
In other words we are finding solution that satisfies φ(x, t) = λn φ(λx, λ2 t) for any λ.
Turns out our solution correspond to the solution to the heat equation with initial
condition Cδ(x) (Dirac delta function) at t = 0. Intuitively it’s reasonable to say that
there will be some heat at the origin at some point, so Rwe expect F (0) 6= 0, but this
means limt→0 φ(0, t) = ∞; also we want the total heat Rn φ2 (x, t) dV to be finite, so
we expect F (y) → 0 as kyk → ∞, this means limt→0 φ(x, t) = 0 for any x 6= 0.
In 1 + 1 dimensions, we can look for a solution of the form φ(x, t) = √1 F (η) to
κt
2
∂φ 0
∂t
= κ ∂∂xφ2 with boundary condition F (0) = 0. We have
 
∂φ ∂ 1 −1 1 dη 0 −1
= √ F (η) = √ F (η) + √ F (η) = √ (F + ηF 0 )
∂t ∂t κt 2 κt3 κt dt 2 κt3
∂2φ ∂2
   
1 κ ∂ ∂η 0 1
κ = κ 2 √ F (η) = √ F = √ F 00 .
∂x2 ∂x κt κt ∂x ∂x κt3
232 CHAPTER 6. METHODS

So the heat equation is 0 = 2F 00 +ηF 0 +F = (2F 0 +ηF )0 . So we have 2F 0 +ηF = const


and the boundary conditions require that the constant is zero. So we get F 0 = − η2 F .
2
We now see a solution F (η) = R ∞a exp(−η /4). By convention, we now normalize our
solution φ(x, t) by requiring −∞ φ(x, t) dx = 1, that is total heat is 1. This gives

a = 1/ 4π and the full solution is
x2
 
1
φ(x, t) = √ exp − .
4πκt 4κt
More generally, on Rn × [0, ∞), we have the solution
−kx − x0 k2
 
1
φ(x, t) = exp .
(4πκ(t − t0 ))n/2 4κ(t − t0 )
This class of solutions is known as the heat kernel , or sometimes as the fundamental
p time t 6= t0 , the solutions are Gaussians
solutions of the heat equation. At any fixed
centered on
p x 0 with standard deviation 2κ(t − t0 ), and the height of the peak at
x = x0 is 1/(4πκ(t − t0 )). The solution gets “flatter” as t increases. This is in fact
a general property of the evolution by the heat equation.

Eigenfunctions solution
We can also find solution to the heat equation using eigenfunctions. Suppose φ(x, t)
obeys the heat equation ∂φ ∂t
= ∇2 φ for t > 0, where for convenience we P let κ = 1.
Given the initial state φ(x, 0) at time t = 0, we can expand it as φ(x, 0) = I cI yI (x)
in the complete set {yI (x)} of eigenfunctions for ∇2 . Let λI be the eigenvalue of yI ,
i.e. ∇2 yI = −λI yI , then the solution for all time is
X
φ(x, t) = cI e−λI t yI
I
P
as can be verified by direct substitution. We can write φ(x, t) = I cI (t)yI (x) where
cI (t) = cI e−λ2 t .
In some cases the eigenvalues λ must be positive. Suppose we have an eigenfunction
y(x) of the Laplacian ∇2 on Ω, so that ∇2 y = −λy. Using the identity ∇ · (Φ∇Ψ) =
(∇Φ) · (∇Ψ) + Φ∇2 Ψ we have
Z Z Z Z
−λ |y|2 dV = y ∗ (x)∇2 y(x) dV = y ∗ n · ∇y dS − |∇y|2 dV.
Ω Ω ∂Ω Ω

If the boundary term vanish (e.gR when n · ∇y R= 0 on ∂Ω or when ∂Ω = 0) then the


eigenvalues are positive as λ = ( Ω |∇y|2 dV )/( Ω |y|2 dV ).
If this is the case, and λI are all positive, then cI (t) decays exponentially with time.
In particular, the coefficients corresponding to large eigenvalues |λI | decays rapidly.
Eigenfunctions with larger eigenvalues are in general more “spiky”, so these smooth
out rather quickly. So he heat equation smooth-en the function over time.

Heat equation derivation


When people first came up with the heat equation to describe heat loss, they were
slightly skeptical — this flattening out caused by the heat equation is not time-
reversible, while Newton’s laws of motions are all time reversible. How could this
smoothening out arise from reversible equations?
6.3. PARTIAL DIFFERENTIAL EQUATIONS 233

Einstein came up with an example where the heat equation can come out of apparently
naturally from seemingly reversible laws, namely Brownian motion. The idea is that
a particle will randomly jump around in space, where the movement is independent of
time and position.
Let the probability that a dust particle moves R ∞ through a step y over a time ∆t be
p(y, ∆t). For any fixed ∆t, we must have −∞ p(y, ∆t) dy = 1. Here we assume
that p(y, ∆t) is independent of time, and of the location of the dust particle. We also
assume p(y, ∆t) is strongly peaked around y = 0 and also p(y, ∆t) = p(−y, ∆t). Now
let P (x, t) be the probability that the dust particle is located at x at time t. Then at
time t + ∆t, we have
Z ∞
P (x, t + ∆t) = P (x − y, t)p(y, ∆t) dy.
−∞

For P (x − y, t) sufficiently smooth, we can write

∂P y2 ∂ 2 P
P (x − y, t) ≈ P (x, t) − y (x, t) + (x, t)
∂x 2 ∂x2
∂P 1 ∂2P
=⇒ P (x, t + ∆t) ≈ P (x, t) − (x, t)hyi + hy 2 i 2 (x, t) + · · · ,
∂x 2 ∂x
Z ∞
where hy r i = y r p(y, ∆t) dy.
−∞

Since p is even function, we expect hy r i to vanish when r is odd. Also, since y is strongly
peaked at 0, we expect the higher order terms to be small. So we can write

1 2 ∂2P
P (x, t + ∆t) − P (x, t) = hy i 2 .
2 ∂x
2
Suppose that as we take the limit ∆t → 0, we get hy i
2∆t
→ κ for some κ. Then this
becomes the heat equation
∂P ∂2P
=κ 2.
∂t ∂x
P. 6-33
Suppose φ : Ω × [0, ∞) → R satisfies the heat equation ∂φ
∂t
= κ∇2 φ, and obeys
• Initial conditions φ(x, 0) = f (x) for all x ∈ Ω
• Boundary condition φ(x, t)|∂Ω = g(x, t) for all t ∈ [0, ∞).
Then φ(x, t) is unique.

Suppose φ1 and φ2 are both solutions. Then define Φ = φ1 − φ2 and E(t) =


1 2
R
2 Ω
Φ dV . Then we know that E(t) ≥ 0. Since φ1 , φ2 both obey the heat
equation, so does Φ. Also, Φ = 0 on the boundary. So we have
Z Z Z Z
dE dΦ
= Φ dV = κ Φ∇2 Φ dV = κ Φ∇Φ · dS − κ (∇Φ)2 dV
dt Ω dt Ω ∂Ω Ω
Z
2
= −κ (∇Φ) dV ≤ 0.

So we know that E decreases with time but is always non-negative. But at time
t = 0, E = Φ = 0. So E = 0 always, so Φ = 0 and hence φ1 = φ2 .
234 CHAPTER 6. METHODS

E. 6-34
<Heat conduction in uniform medium> Suppose we are on Earth, and the
Sun heats up the surface of the Earth through sun light. So the sun will maintain
the soil at some fixed temperature. However, this temperature varies with time as
we move through the day-night cycle and seasons of the year.
We let φ(x, t) be the temperature of the soil as a function of the depth x, defined
2
on [0, ∞) × [0, ∞). Then it obeys the heat equation ∂φ∂t
= K ∂∂xφ2 with conditions
   
1. φ(0, t) = φ0 + A cos 2πt
tD
+ B cos 2πt
tY
.

2. φ(x, t) → const as x → ∞.
We try the separation of variables. Suppose φ(x, t) = T (t)X(x), then we get the
equations T 0 = λT and X 00 = K λ
X. From the boundary solution, we know that
our things will be oscillatory. So we let λ be imaginary, and set λ = iω. So we
have  √ √ 
φ(x, t) = eiωt aω e− iω/Kx + bω e iω/Kx .
Note that we have  q
r (1 + i) |ω| ω>0
iω q 2K
=
K (i − 1) |ω| ω<0
2K

Since φ(x, t) → constant as x → ∞, we don’t want our φ to blow up. So if ω < 0,


we need aω = 0. Otherwise, we need bω = 0. To match up at x = 0, we just want
terms with |ω| = ωD = t2π
D
and |ω| = ωY = t2π
Y
. So we can write out solution as
 r   r 
ωD ωD
φ(x, t) = φ0 + A exp − t cos ωD t − x
2K 2K
 r   r 
ωY ωY
+ B exp − t cos ωY t − x
2K 2K
We can notice a few things. Firstly, as we go further down, the effect of the
sun decays, and the temperature is more stable. Also, the effect of the day-night
cycle decays more quickly than the annual cycle, which makes sense. We also see
that while the temperature does fluctuate with the same frequency as the day-
night/annual cycle, as we go down, there is a phase shift. This is helpful since we
can store things underground and make them cool in summer, warm in winter.
E. 6-35
<Heat conduction in a finite rod> We now have a finite rod of length 2L:

x = −L x=0 x=L

We have the initial conditions


(
1 0<x<L
φ(x, 0) = Θ(x) =
0 −L < x < 0

and the boundary conditions φ(−L, t) = 0 and φ(L, t) = 1. So we start with a


step temperature, and then maintain the two ends at fixed temperatures 0 and 1.
We are going to do separation of variables, but we note that all our boundary and
initial conditions are inhomogeneous. This is not helpful. So we use a little trick.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 235

We first look for any solution satisfying the boundary conditions φS (−L, t) =
0 and φS (L, t) = 1. For example, we can look for time-independent solutions
2
φS (x, t) = φS (x). Then we need ddxφ2S = 0. So we get
x+L
φS (x) = .
2L
By linearity, ψ(x, t) = φ(x, t) − φs (x) obeys the heat equation with the con-
ditions ψ(−L, t) = ψ(L, t) = 0 which is homogeneous! Our initial condition
now becomes ψ(x, 0) = Θ(x) − x+L 2L
. We now perform separation of variables,
ψ(x, t) = X(x)T (t). Then we obtain the equations T 0 = −κλT and X 0 = −λX.
Then we have  √ √ 
ψ(x, t) = a sin( λx) + b cos( λx) e−κλt .
Since initial condition is odd, we can eliminate all cos terms. Our boundary
conditions also requires λ = n2 π 2 /L2 where n = 1, 2, · · · . So we have

κn2 π 2
X  nπx   
φ(x, t) = φs (x) + an sin exp − t ,
n=1
L L2

where an are the Fourier coefficients


1 L
Z  
x+L  nπx  1
an = Θ(x) − sin dx = .
L −L 2L L nπ
E. 6-36
<Cooling of a uniform sphere> Once upon a time, Charles Darwin went
around the Earth, looked at species, and decided that evolution happened. When
he came up with his theory, it worked quite well, except that there was one worry.
Was there enough time on Earth for evolution to occur? He knew well that the
Earth started as a ball of molten rock, and obviously life couldn’t have evolved
when the world was still molten rock. So he would want to know how long it took
for the Earth to cool down to its current temperature, and if that was sufficient
for evolution to occur.
We can, unsurprisingly, use the heat equation. We assume that at time t = 0,
the temperature is φ0 , the melting temperature of rock. We also assume that the
space is cold, and we let the temperature on the boundary of Earth as 0. We then
solve the heat equation on a sphere (or ball), and see how much time it takes for
Earth to cool down to its present temperature.
We let Ω = {(x, y, z) ∈ R3 , r ≤ R}, and we want a solution φ(r, t) of the heat
equation that is spherically symmetric and obeys the conditions φ(R, t) = 0 and
φ(r, 0) = φ0 . We can write the heat equation as
 
∂φ κ ∂ ∂φ
= κ∇2 φ = 2 r2 .
∂t r ∂r ∂r
Again, we do the separation of variables, φ(r, t) = R(r)T (t). So we get
 
d dR
T 0 = −λ2 κT, r2 = −λ2 r2 R.
dr dr
We can simplify this a bit by letting R(r) = S(r)
r
, then our radial equation just
becomes S 00 = −λ2 S. We can solve this to get
sin λr cos λr
R(r) = Aλ + Bλ .
r r
236 CHAPTER 6. METHODS

We want a regular solution at r = 0. So we need Bλ = 0. Also, the boundary


condition φ(R, t) = 0 gives λ = nπ/R for n = 1, 2, · · · . So we get
−κn2 π 2 t
X An  nπr   
φ0 R
φ(r, t) = sin exp 2
where An = (−1)n+1 .
n∈Z
r R R nπ

We know that the Earth isn’t just a cold piece of rock. There are still volcanoes.
So we know many terms still contribute to the sum nowadays. This is rather
difficult to work with. So we instead look at the temperature gradient
−κn2 π 2 t
 
∂φ φ0 X  nπr 
= (−1)n+1 cos exp + sin stuff.
∂r r n∈Z R R2

We evaluate this at the surface of the Earth, R = r. So we get the gradient


r
κn2 π 2 t φ0 ∞ κy 2 π 2 t
  Z  
∂φ φ0 X 1
= − exp − ≈ exp − dy = φ 0 .
∂r R R n∈Z R2 R −∞ R2 πκt

So the age of the Earth is approximately


 2
φ0 1 ∂φ
t≈ where V = .
V πκ ∂r R
We can plug the numbers in, and get that the Earth is 100 million years. This is
not enough time for evolution. Later on, people discovered that fission of metals
inside the core of Earth produce heat, and provides an alternative source of heat.
So problem solved! The current estimate of the age of the world is around 4 billion
years, and evolution did have enough time to occur.

6.3.4 Wave equation


∂2φ
The wave equation is ∂t2
= c2 ∇2 φ where c is constant.

Wave equation derivation


Consider a string x ∈ [0, L] undergoing small oscillations described by φ(x, t).

φ(x, t)

B
A

Consider two points A, B separated by a small distance δx. Let TA and TB be the
tension of the string at A and B respectively, and θA , θB the angle they make with the
horizontal. Consider the segment of the string between A and B. It has no sideways
(x) motion, there is no net horizontal force, so
TA cos θA = TB cos θB = T. (∗)
If the string has mass per unit length µ, then in the vertical direction, Newton’s second
law gives
∂2φ
µδx 2 = TB sin θB − TA sin θA .
∂t
6.3. PARTIAL DIFFERENTIAL EQUATIONS 237

We now divide everything by T , noting the relation (∗), and get


δx ∂ 2 φ ∂2φ

TB sin θB TA sin θA ∂φ ∂φ
µ 2
= − = tan θ B − tan θA = − ≈ δx 2 .
T ∂t TB cos θB TA cos θA ∂x B ∂x A ∂x
Taking the limit δx → 0 and setting c2 = T /µ, we have that φ(x, t) obeys the (one
dimensional) wave equation
1 ∂2φ ∂2φ
= .
c2 ∂t2 ∂x2

Solution to the wave equation


Assume that the string is fixed at both ends. Then φ(0, t) = φ(L, t) = 0 for all t. The
2 2
we can perform separation of variables, and the general solution of c12 ∂∂t2φ = ∂∂xφ2 can
then be written as
∞  nπx      
X nπct nπct
φ(x, t) = sin An cos + Bn sin .
n=1
L L L

The coefficients An are fixed by the initial profile φ(x, 0) of the string, while the
coefficients Bn are fixed by the initial string velocity ∂φ
∂t
(x, 0). Note that we need two
sets of initial conditions, since the wave equation is second-order in time.
2 2
From IA Differential Equations, we’ve learnt that the solution of c12 ∂∂t2φ = ∂∂xφ2 can
be written as f (x − ct) + g(x + ct). However, the method does not extend to higher
dimensions, but the method of separation of variables does.

Energy of oscillating string


An oscillating string contains some sort of energy. The kinetic energy of a small
2
element δx of the string is 21 µδx ∂φ
∂t
. The total kinetic energy of the string is hence
the integral
Z  2
µ L ∂φ
K(t) = dx.
2 0 ∂t
The string also has potential energy due to tension. The potential energy of a small
element δx of the string is
p
T × extension = T (δs − δx) = T ( δx2 + δφ2 − δx)
 2 !  2
1 δφ T δφ
= T δx 1 + + · · · − T δx ≈ δx .
2 δx 2 δx

Hence the total potential energy of the string is


 2
µ L 2 ∂φ
Z
V (t) = c dx,
2 0 ∂x
using the definition of c2 . Using our series expansion for φ, we get
∞ 2
µπ 2 c2 X 2
   
nπct nπct
K(t) = n An sin − Bn cos
4L n=1 L L
∞ 2
µπ 2 c2 X 2
   
nπct nπct
V (t) = n An cos + Bn sin
4L n=1 L L
238 CHAPTER 6. METHODS

The total energy is then



µπ 2 c2 X 2 2
E(t) = n (An + Bn2 ).
4L n=1

What can we see here? Our solution is essentially an (infinite) sum of independent
harmonic oscillators, one for each n. The period of the fundamental mode (n = 1) is
2π L
ω
= 2π πc = 2Lc
. Thus, averaging over a period, the average kinetic energy is
Z 2L/c Z 2L/c
c c E
K̄ = K(t) dt = V̄ = V (t) dt = .
2L 0 2L 0 2
Hence we have an equipartition of the energy between the kinetic and potential en-
ergy.
P. 6-37
2
Suppose φ : Ω × [0, ∞) → R obeys the wave equation ∂∂t2φ = c2 ∇2 φ inside Ω ×
(0, ∞), and is fixed at the boundary. Then E is constant.

By definition of E = V + K, dropping the constant µ we have


∂ 2 φ ∂φ
Z  
dE 2 ∂φ
= 2 ∂t
+ c ∇ · ∇φ dV.
dt Ω ∂t ∂t
We integrate by parts (by divergence theorem) in the second term to obtain
dφ ∂ 2 φ
Z   Z
dE 2 2 2 ∂φ
= − c ∇ φ dV + c ∇φ · dS.
dt Ω dt ∂t2 ∂Ω ∂t
2
noting that ∇ · ( ∂φ ∇φ) = ∇ ∂φ · ∇φ + ∂φ ∇2 φ. Since ∂∂t2φ = c2 ∇2 φ by the wave

∂t ∂t ∂t
equation, and φ is constant on ∂Ω, we know that dE dt
= 0.
P. 6-38
∂2φ
Suppose φ : Ω × [0, ∞) → R obeys the wave equation ∂t2
= c2 ∇2 φ inside Ω ×
(0, ∞), and obeys, for some f, g, h,
∂φ
1. φ(x, 0) = f (x); 2. (x, 0) = g(x); and 3. φ|∂Ω×[0,∞) = h(x, t).
∂t
Then φ is unique.

Suppose φ1 and φ2 are two such solutions. Then ψ = φ1 − φ2 obeys the wave
2
equation ∂∂tψ ∂ψ
2 2

2 = c ∇ ψ and ψ|∂Ω×[0,∞) = ψ|Ω×{0} = ∂t Ω×{0} = 0. Consider the

energy (per µ) !
Z  2
1 ∂ψ 2
Eψ (t) = + c ∇ψ · ∇ψ dV.
2 Ω ∂t
Then since ψ obeys the wave equation with fixed boundary conditions, we know
Eψ is constant. Initially, at t = 0, we know that ψ = ∂ψ ∂t
= 0. So Eψ (0) = 0. At
time t, we have
Z  2
1 ∂ψ
Eψ = + c2 (∇ψ) · (∇ψ) dV = 0.
2 Ω ∂t
∂ψ
Hence we must have ∂t
= 0. So ψ is constant. Since it is 0 at the beginning, it is
always 0.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 239

E. 6-39
<Vibrations of a circular membrane> Consider Ω = {(x, y) ∈ R2 , x2 + y 2 ≤
1}, and let φ(r, θ, t) solve

1 ∂2φ 1 ∂2φ
 
2 1 ∂φ
2 2
=∇ φ= r + 2 2,
c ∂t r ∂r r ∂θ

with the boundary condition φ|∂Ω = 0. We can imagine this as a drum, where the
membrane can freely oscillate with the boundary fixed. Separating variables with
φ(r, θ, t) = T (t)R(r)Θ(θ), we get

T 00 = −c2 λT, Θ00 = −µΘ, r(R0 )0 + (r2 λ − µ)R = 0.

Then as before, T and Θ are both sine and cosine waves. Since we are in polars
coordinates, we need φ(t, r, θ + 2π) = φ(t, r, θ). So we must have µ = m2 for some
m ∈ N. Then the radial equation becomes r(rR0 )0 + (r√ 2
λ − m2 )R = 0
√ which is
Bessel’s equation of order m. So we have R(r) = am Jm ( λr) + bm Ym ( λr).
Since we want regularity at r = 0, we need √
bm = 0 for all m. To satisfy the
boundary condition φ|∂Ω = 0, we must choose λ = kmi , where kmi is the ith root
of Jm . Hence the general solution is
∞ 
X 
φ(t, r, θ) = A0i sin(k0i ct) + B0i cos(k0i ct) J0 (k0i r)
i=0
∞ X
X ∞  
+ Ami cos(mθ) + Bmi sin(mθ) sin(kmi ct)Jm (kmi r)
m=1 i=0
X∞ X ∞  
+ Cmi cos(mθ) + Dmi sin(mθ) cos(kmi ct)Jm (kmi r)
m=1 i=0

For example, suppose we have the initial conditions φ(0, r, θ) = 0, ∂t φ(0, r, θ) =


g(r). So we start with a flat surface and suddenly hit it with some force. By
symmetry, we must have Ami , Bmi , Cmi , Dmi = 0 for m 6= 0. If this does not seem
obvious, we can perform some integrals with the orthogonality relations to prove
this. At t = 0, we need φ = 0. So we must have B0i = 0. So we are left with

X
φ= A0i sin(k0i ct)J0 (k0j r).
i=0

We can differentiate this w.r.t t, set t = 0 and multiply with J0 (k0j r)r to obtain
Z ∞
1X Z 1
k0i cA0i J0 (k0i r)J0 (k0j r)r dr = g(r)J0 (k0j r)r dr.
0 i=0 0

Using the orthogonality relations for J0 , we would get


Z 1
2 1
A0i = g(r)J0 (k0j r)r dr.
ck0i (J00 (k0i ))2 0

Note that the frequencies come from the roots of the Bessel’s function, and are not
evenly spaced. This is different from, say, string instruments, where the frequencies
are evenly spaced. So drums sound differently from strings.
240 CHAPTER 6. METHODS

E. 6-40
So far, we have used separation of variables to solve our differential equations. It
worked in our examples, but there are a few issues with it. Of course, we have the
problem of whether it converges or not, but there is a more fundamental problem.
To perform separation of variables, we need to pick a good coordinate system,
such that the boundary conditions come in a nice form. However, in real life, our
domain might have a weird shape, and we cannot easily find good coordinates for
it.
Mark Kac asked the interesting question “can you hear the shape of a drum?” —
suppose we know all the frequencies of the modes of oscillation on some domain
Ω. Can we know what Ω is like? The answer is no, and we can explicitly construct
two distinct drums that sound the same. However if we require Ω to be convex,
and has a real analytic boundary, then yes! For example, we can recover the area:
let N (λ0 ) be the number of eigenfrequencies less than λ0 . Then we can show that

N (λ0 )
4π 2 lim = Area(Ω).
λ0 →∞ λ0

6.4 Distributions
When performing separation of variables, we first find some particular solutions of the
form, say, X(x)Y (y)Z(z). We know that these solve, say, the wave equation. However,
what we do next is take an infinite sum of these functions. First of all, how can we
be sure that this converges at all? Even if it did, how do we know that the sum
satisfies the differential equation? As we have seen in Fourier series, an infinite sum
of continuous functions can be discontinuous. If it is not even continuous, how can
we say it is a solution of a differential equation? Hence, at first people were rather
skeptical of this method.
Quite remarkably, the most fruitful way forward has turned out not to be to restrict
ourselves to sufficiently differentiable functions that our concerns are eased, but rather
to be to generalize the very notion of a function itself with the aim of finding the right
class of object where the method always makes sense. Generalized functions were intro-
duced in mathematics by Sobolev and Schwartz. They’re designed to full an apparently
mutually contradictory pair of requirements: They are sufficiently well behaved that
they’re infinitely differentiable and thus have a chance to satisfy partial differential
equations, yet at the same time they can be arbitrarily singular neither smooth, nor
differentiable, nor continuous, nor even finite if interpreted naively as ‘ordinary func-
tions’. These generalised functions are called distributions, inspired by the distribution
of singular source which is represented by the Dirac delta distribution.
D. 6-41
• For Ω ⊆ Rn , write D(Ω) for the set of functions Ω → R which is smooth (i.e.
infinitely differentiable) and has compact support (i.e. is identically zero outside
some compact set). Alternatively they are the set of functions Rn → R which is
smooth and has compact support on Ω (i.e. it’s zero outside some compact set
K ⊆ Ω).
• Distributions are a class of linear functionals that map a set (space) of test
functions (conventional and well-behaved functions) into the set of real numbers.
6.4. DISTRIBUTIONS 241

Below we will take D(Ω) as the space of test functions, and distributions are
elements of the dual space (D(Ω))∗ , which here we will write as D0 (Ω).
Like a usual vector space given distributions T1 , T2 and constants λ, µ, the distri-
bution λT1 + µT2 is given by (λT1 + µT2 )[φ] = λT1 [φ] + µT2 [φ]. In addition, given
a smooth function ψ ∈ C ∞ (Ω) and distribution T , we define the distribution ψT
by (ψT )[φ] = T [ψφ].6
• Given an ordinary function f : Ω → R that is locally integrable (i.e. inte-
Rgrable over any compact region), define the distribution Tf by Tf [φ] = hf, φi =

f (x)φ(x) dV . Sometimes Tf [φ] is simply written as f [φ].

• The derivative T 0 of a a distributions T is defined by T 0 [φ] = −T [φ0 ].


• The Dirac delta δ : D(Ω) → R is a distribution defined by δ[ϕ] = ϕ(0) for every
test function ϕ.
E. 6-42
• Here is an example of a function in D(R): the bump function defined by
2
(
e−1/(1−x ) |x| < 1
φ(x) =
0 otherwise. x

• Given an ordinary function f : Ω → R that is locally integrable, its corresponding


generalized function (i.e. distribution), written Tf is define by
Z
Tf [φ] = hf, φi = f (x)φ(x) dV.

Note that this is a linear map since integration is linear (and multiplication is
commutative and distributes over addition). Also this integral is guaranteed to be
well-defined even when Ω is non-compact (say, the whole of Rn ) since φ has com-
pact support and f is locally integrable. Note also that unlike the test functions,
we do not require f itself to have compact support. When the context is clear, we
might write Tf simply as f .
R
• By analogy with above, we often abuse notation and write δ[φ] = Ω δ(x)φ(x) dV
and pretend δ(x) is an actual function (more precisely, pretend that there is an
ordinary function δ(x) that gave rise to the Dirac delta distribution) like we did
in part IA. Of course, this cannot really be the case, since if it were, then we must
have δ(x) = 0 whenever x 6= 0, since δ[φ] = φ(0) only depends on what happens
at 0. But then this integral will just give 0 if δ(0) ∈ R. Some people like to think
of this as a function that is zero anywhere and “infinitely large” everywhere else.
Formally, the Dirac delta should be think of as a distribution.
• Although distributions can be arbitrarily singular and insane, we can nonetheless
define all their derivatives, as T 0 [φ] = −T [φ0 ]. This is motivated by the case of
regular functions, where we would want Tf0 = Tf 0 : In one-dimension (i.e. Ω ⊆ R,
after integrating by parts we get
Z Z
Tf 0 [φ] = f 0 (x)φ(x) dx = − f (x)φ0 (x) dx = −T [φ0 ],
Ω Ω

6
In general there is no way to multiply two distributions together.
242 CHAPTER 6. METHODS

with no boundary terms since we have a compact support. Since φ is infinitely


differentiable, we can take arbitrary derivatives of distributions. So even though
distributions can be crazy and singular, everything can be differentiated.
E. 6-43
Consider the Heaviside
R∞ step function
R∞Θ(x) = I[x > 0], we have the distribution
TΘ [φ] = Θ[φ] = −∞ Θ(x)φ(x)dx = 0 φ(x)dx which converges since φ has com-
pact support. While Θ(x) is not differentiable (nor continuous) as a function, it
is differentiable as a distribution. We have
Z ∞
∂φ
Θ0 ]φ] = −Θ[φ0 ] = − dx = φ(0) − φ(∞) = φ(0)
0 ∂x

So in terms of distribution, Θ0 = δ, the Dirac delta. The excellent differentiability


of distributions also allows us to make sense
P of divergent Fourier series. The saw-
tooth function has Fourier series,[E.6-6] 2 ∞n=1 (−1) n+1 1
n
sin(nx) which converges
to the sawtooth function (except at discontinuity). Differentiating the series term
by term leads to the divergent series 2 ∞ n+1
P
n=1 (−1) cos(nx). But if we treat the
sawtooth function as a distribution, writing it as
∞ ∞ ∞
X (−1)n+1 X X
2 sin(nx) = x + 2π Θ(x − nπ) − 2π Θ(−x − nπ)
n=1
n n=1 n=0

where the step functions provide the jumps in the sawtooth. Then differentiating
this term by term gives

X X
2 (−1)n+1 cos(nx) = 1 + 2π δ(x − nπ)
n=1 n∈Z

which gives meaning to the non-convergent sum as a distribution: The sawtooth


function has a constant gradient everywhere except at x = nπ for n ∈ Z, where it
has a δ-function spike.
C. 6-44
<Properties of δ(x)> We can look at some properties of δ(x), we will treat
it in the functions sense, but it can be make rigorous in the deeper theory of
distributions.
• For a f continuous in a neighbourhood of 0, then the distribution f δ obeys
(f δ)[φ] = δ[f φ] = f (0)φ(0) = f (0)δ[φ], we write this in function notation as
f (x)δ(x) = f (0)δ(x) using the idea that δ(x) vanishes whenever x 6= 0.
• Translation: For a ∈ R we can define a translated Dirac delta by δa [φ] = φ(a).
We can write it as δ(x − a)
Z ∞ Z ∞
δ(x − a)φ(x) dx = δ(y)φ(y + a) dy = φ(a)
−∞ −∞

• Scaling: We can write δ(cx) = δ(x)/|c| as


Z ∞ Z ∞  y  dy 1
δ(cx)φ(x) dx = δ(y)φ = φ(0)
−∞ −∞ c |c| |c|
6.5. GREEN’S FUNCTIONS FOR ODES 243

• The previous two are special cases of the following: suppose f (x) is a continu-
ously differentiable function with isolated simple zeros at xi . Then near any of
its zeros xi , we have f (x) ≈ (x − xi ) ∂f . Then
∂x x i

Z ∞ n Z ∞
! n
X ∂f X 1
δ(f (x))φ(x) dx = δ (x − xi ) φ(x) dx = 0 (x )|
φ(xi ).
−∞ i=1 −∞ ∂x xi

i=1
|f i

E. 6-45
• Generalized functions can occur as limits of sequences of normal functions. For
example, the family of functions
n 2 2
Gn (x) = √ e−n x
π
are smooth for any finite n, and Gn [φ] → δ[φ] for any φ. It thus makes Rsense to de-
fine δ 0 [φ] = −δ[φ0 ] = −φ0 (0) as this is the limit of the sequence limn→∞ Ω G0n (x)φ(x) dx.
It is often convenient to think of δ(x) as limn→∞ Gn (x), and δ 0 (x) = limn→∞ G0n (x)
etc., despite the fact that these limits do not exist as functions.
• We can also expand the δ-function in a basis of eigenfunctions.
P Suppose we live in
the interval [−L, L], and write a Fourier expansion δ(x) = n∈Z δ̂n einπx/L with
1
R L −inπx/L 1
δ̂n = 2L −L
e δ(x) dx = 2L . So we have

1 X inπx/L
δ(x) = e .
2L n∈Z

This does make sense as a distribution. Consider the partial sum SN δ(x),
Z L Z L N
1 X
lim SN δ(x)φ(x) dx = lim einπx/L φ(x) dx
N →∞ −L N →∞ 2L L n=−N
N  Z L 
X 1
= lim einπx/L φ(x) dx
N →∞
n=−N
2L −L
N
X N
X
= lim φ̂−n = lim φ̂n einπ0/L = φ(0),
N →∞ N →∞
n=−N n=−N

since the Fourier series of the smooth function φ(x) does converge for all x ∈
[−L, L].
• We can equally well expand δ(x) in terms of any other set of orthonormal eigenfunc-
tions. Let {yn (x)} be a complete set of eigenfunctions on [a, b] that areP
orthogonal
with respect to a weight function w(x). Then we can write δ(x − ξ) = n cn yn (x)
Rb
with cn = a yn∗ (x)δ(x − ξ)w(x) dx = yn∗ (ξ)w(ξ). So
X ∗ X ∗
δ(x − ξ) = w(ξ) yn (ξ)yn (x) = w(x) yn (ξ)yn (x),
n n

using the fact that w(x)δ(x − ξ) = w(ξ)δ(x − ξ).

6.5 Green’s Functions for ODEs


244 CHAPTER 6. METHODS

C. 6-46
<Green’s functions> One of the main uses of the δ function is the Green’s
function. Suppose we wish to solve the 2nd order ordinary differential equation
Ly = f on [a, b] (which may be ±∞ respectively), where f (x) is a bounded forcing
term, and L is a differential operator

∂2 ∂
L = α(x) + β(x) + γ(x).
∂x2 ∂x
where α, β, γ are continuous with α nonzero except perhaps at a finite number of
isolated points (on [a, b]). We now define the Green’s function G(x, ξ) of L to be
the any solution (might not be unique) to the problem LG = δ(x − ξ).
Rb
Given G(x, ξ), if we define y(x) = a G(x, ξ)f (ξ) dξ, then
Z b Z b
Ly = LG(x, ξ)f (ξ) dξ = δ(x − ξ)f (ξ) dξ = f (x).
a a

We can see that y is a solution to Ly = f . The important point is that G depends


on L but not on the forcing term f . Once G is known, we will be able to write down
the solution to Ly = f for an arbitrary force term. To put this differently, since
asking for a solution to the differential equation Ly = f is like asking to invert
the differential operator L, and we might formally write y(x) = L−1 f . This result
shows what is meant by the inverse of the differential operator L is integration
with the Green’s function as the integral kernel.

E. 6-47
Use Green’s function to solve Ly = f on [a, b] with boundary condition y(a) =
y(b) = 0.

It would be enough if we could find the Green’s function G(x, ξ) obeying the
homogeneous boundary conditions G(a, ξ) = G(b, ξ) = 0, which would gives us
Rb
the (unique) solution y(x) = a G(x, ξ)f (ξ) dξ (satisfying the required boundary
condition) for the problem.
Note that LG(x, ξ) = 0 whenever x 6= ξ. Thus, for both x < ξ and x > ξ we can
express G in terms of solutions of the homogeneous equation Ly = 0. Suppose that
{y1 (x), y2 (x)} is a basis of linearly independent solutions to the problem Ly = 0
on [a, b]. We define this basis by requiring that y1 (a) = 0 and y2 (b) = 0. That is,
each of y1 and y2 obeys one of the homogeneous boundary conditions. Note that
such y1 and y2 are unique up to multiplication of a constant. Therefore we must
have (
A(ξ)y1 (x) a≤x<ξ
G(x, ξ) =
B(ξ)y2 (x) ξ<x≤b
So we have a whole family of solutions. To fix the coefficients, we must decide how
to join these solutions together over x = ξ.
If G(x, ξ) were discontinuous at x = ξ, then ∂x G|x=ξ would involve a δ function,
while ∂x2 G|x=ξ would involve the derivative of the δ function. This is not good,
since nothing in LG = δ(x − ξ) would balance a δ 0 . So G(x, ξ) must be everywhere
continuous. Hence we require

A(ξ)y1 (ξ) = B(ξ)y2 (ξ). (∗)


6.5. GREEN’S FUNCTIONS FOR ODES 245

Now integrate over a small region (ξ − ε, ξ + ε) surrounding ξ. Then we have


Z ξ+ε  Z ξ+ε
d2 G

dG
α(x) 2 + β(x) + γ(x)G dx = δ(x − ξ) dx = 1.
ξ−ε dx dx ξ−ε

If we take ε → 0, by continuity of G, the γG term does not contribute. While G0


0
is discontinuous,
R ξ+εit is00still finite. So the βG term also does not contribute. So we
have limε→0 ξ−ε αG dx = 1. We now integrate by parts to obtain
 Z ξ+ε  !
∂G ∂G
lim [αG0 ]ξ+ε
ξ−ε + 0 0
α G dx = 1 =⇒ α(ξ) − =1
ε→0 ξ−ε ∂x ξ+ ∂x ξ−

since by finiteness of G0 , the integral of α0 G0 does not contribute. Hence we obtain


B(ξ)y20 (ξ) − A(ξ)y10 (ξ) = 1/α(ξ). Together with (∗), we know that

y2 (ξ) y1 (ξ)
A(ξ) = , B(ξ) = ,
α(ξ)W (ξ) α(ξ)W (ξ)

where W is the Wronskian W = y1 y20 − y2 y10 . Hence, we know that


(
1 y2 (ξ)y1 (x) a≤x≤ξ
G(x, ξ) =
α(ξ)W (ξ) y1 (ξ)y2 (x) ξ<x≤b

Using the step function Θ, we can write this as


Θ(ξ − x)y2 (ξ)y1 (x) + Θ(x − ξ)y1 (ξ)y2 (x)
G(x, ξ) = .
α(ξ)W (ξ)
So our general solution is
Z b Z b Z x
f (ξ) f (ξ)
y(x) = G(x, ξ)f (ξ) dξ = y2 (ξ)y1 (x) dξ+ y1 (ξ)y2 (x) dξ.
a x α(ξ)W (ξ) a α(ξ)W (ξ)
E. 6-48
Consider Ly = −y 00 − y = f for x ∈ (0, 1) with y(0) = y(1) = 0. We choose our
basis solution to satisfy y 00 = −y as {sin x, sin(1 − x)}. Then we can compute the
Wronskian W (x) = − sin x cos(1 − x) − sin(1 − x) cos x = − sin 1. Our Green’s
function is
1  
G(x, ξ) = Θ(ξ − x) sin(1 − ξ) sin x + Θ(x − ξ) sin ξ sin(1 − x) .
sin 1
Hence we get
Z 1 Z x
sin(1 − ξ) sin ξ
y(x) = sin x f (ξ) dξ + sin(1 − x) f (ξ) dξ.
x sin 1 0 sin 1
E. 6-49
Consider a elastic string with ends fixed at x = 0, L. If y(x, t) represents the small
2
vertical displacement transverse to the string, the wave equation gives µ ∂∂t2y =
2
∂ y
T ∂x2 . In the presence of gravity, the equation becomes

∂2y ∂2y
µ 2
= T 2 + µg.
∂t ∂x
246 CHAPTER 6. METHODS

as can be seen by slightly altering the derivation in [C.6.3.4]. We look for the
steady state solution ẏ = 0 (i.e. shape of the string when the string is at rest)
obeying y(0, t) = y(L, t) = 0. In this case the above equation reduces to

∂2y µ(x)
=− g.
∂x2 T
2
∂ y
We look for a Green’s function obeying ∂x 2 = −δ(x − ξ). This can be interpreted
as the contribution of a pointlike mass of mass T /g located at x = ξ, or in other
∂2y
words the solution of ∂x2 = −µ(x)g/T with µ(x) = δ(x − ξ)T /g (i.e under a point
mass). The homogeneous equation y 00 = 0 gives y = Ax + B(x − L). So we get
(
A(ξ)x 0≤x<ξ
G(x, ξ) =
B(ξ)(x − L) ξ < x ≤ L.

Continuity at x = ξ gives A(ξ)ξ = B(ξ)(ξ − L). The jump condition on the


derivative gives A(ξ) − B(ξ) = 1. We can solve these to get A(ξ) = (ξ − L)/L and
B(ξ) = ξ/L. Hence the Green’s function is

ξ−L ξ
G(x, ξ) = xΘ(ξ − x) + (x − L)Θ(x − ξ).
L L
0 ξ L
Notice that ξ is always less that L. So the first x
term has a negative slope; while ξ is always pos-
itive, so the second term has positive slope. G(x, ξ)

We can model the string with arbitrary mass per unit length µ(x) as having many
R x +∆x
pointlike particles of mass mi = x i µ(x) dx at small separations ∆x along
i
the string. So we get the solution
L/∆x Z L
X g µ(ξ)g
y(x) = G(x, xi )mi → G(x, ξ) dξ in the limit ∆x → 0
i=1
T 0 T

This is what the Green’s function are supposed to do. In general we can think
of the Green’s function as the solution to the point sources, and then we can
reconstruct the forcing term using a weighted sum of point sources, and hence the
the solution for an arbitrary forcing term.
E. 6-50
Use Green’s function to solve Ly = f (t) subject to y(t0 ) = y 0 (t0 ) = 0.

Now instead of having two boundaries, we have one boundary and restrict both
the value and the derivative at this boundary. Note that this boundary condition
is still homogeneous. Similar to before, if we can find Green’s function G(t, τ )
satisfying G(t0 , τ ) = G0 (t0 , τ ) = 0, the the solution y construct from G(t, τ ) would
obey y(t0 ) = y 0 (t0 ) = 0.
As before, let y1 (t), y2 (t) be any basis of solutions to Ly = 0. The Green’s function
obeys L(G) = δ(t − τ ). We can write our Green’s function as
(
A(τ )y1 (t) + B(τ )y2 (t) t0 ≤ t < τ
G(t, τ ) =
C(τ )y1 (t) + D(τ )y2 (t) t > τ.
6.6. FOURIER TRANSFORMS 247

Our initial conditions require that


    
y1 (t0 ) y2 (t0 ) A 0
=
y10 (t0 ) y20 (t0 ) B 0
However, we know that the matrix is non-singular (by definition of y1 , y2 being a
basis). So we must have A, B = 0 (which makes sense, since G = 0 is obviously a
solution for t0 ≤ t < τ ). Our continuity and jump conditions now require
1
0 = C(τ )t1 (t) + D(τ )y2 (t) = C(τ )y10 (τ ) + D(τ )y20 (τ ).
α(τ )
We can solve this to get C(τ ) and D(τ ). Then we get
Z ∞ Z t
y(t) = G(t, τ )f (τ ) dτ = G(t, τ )f (τ ) dτ,
t0 t0

since we know that when τ > t, the Green’s function G(t, τ ) = 0. Thus the
solution y(t) depends on the forcing term term f only for times < t. This expresses
causality!
E. 6-51
Suppose we have ÿ + y = f (t) with y(0) = ẏ(0) = 0. Then we have
G(t, τ ) = Θ(t − τ )(C(τ ) cos(t − τ ) + D(τ ) sin(t − τ ))
for some C(τ ), D(τ ). The continuity and jump conditionsRgives D(τ ) = 1, C(τ ) =
t
0. So we get G(t, τ ) = Θ(t − τ ) sin t(τ ). So we get y(t) = 0 sin(t − τ )f (τ ) dτ .
E. 6-52
<Eigenfunction expansion of Green’s function>
P
If L is the Sturm-Liouville operator, then we can expand G(x, ξ) = n∈N Ĝn (ξ)yn (x)
where {yn } is the basis of w orthonormal eigenfunctions of L. We have
X X
δ(x − ξ) = LG(x, ξ) = Ĝn (ξ)Lyn (x) = w(x) Ĝn (ξ)λn yn (x)
n∈N n∈N
X Z b Z b
∗ ∗
=⇒ Ĝn (ξ)λn ym (x)yn (x)w(x)dx = ym (x)δ(x − ξ)dx = yn∗ (ξ)
n∈N a a
∗ X yn∗ (ξ)
ym (ξ)
=⇒ Ĝm = and so G(x, ξ) = yn (x)
λm n∈N
λn

This is in agreement with [C.6-25].

6.6 Fourier transforms


D. 6-53
• The Fourier transform of an (absolutely integrable) function f : R → C is the
function f˜ : R → C (we’ll also write f˜ = F[f (x)]) defined by7
Z ∞
f˜(k) = e−ikx f (x) dx.
−∞

R∞
• The convolution of functions f, g : R → C is f ∗ g(x) = −∞
f (x − y)g(y) dy.
R∞ −2πikx
7
Some authors use a different definition f˜(k) = −∞
e f (x) dx.
248 CHAPTER 6. METHODS

E. 6-54
Note that for any k, we have
Z ∞ Z ∞ Z ∞
|f˜(k)| = e−ikx f (x) dx ≤ |e−ikx f (x)| dx =

|f (x)| dx.
−∞ −∞ −∞

Since we have assumed that our function is absolutely integrable, this is finite,
and the definition makes sense. Note also that
Z ∞ Z ∞
f ∗ g(x) = f (x − y)g(y) dy = f (y)g(x − y) dy = g ∗ f (x).
−∞ −∞

C. 6-55
<Properties of Fourier transform>
1. Linearity: If f, g : R → C are absolutely integrable and c1 , c2 are constants,
then F[c1 f (x) + c2 g(x)] = c1 F[f (x)] + c2 F[g(x)]. So F is a linear operator.
2. Translation:
Z Z
F[f (x − a)] = e−ikx f (x − a) dx = e−ik(y+a) f (y) dy
R R
Z
−ika
=e e−iky f (y) dy = e−ika F[f (x)]
R

So a translation in x-space becomes a re-phasing in k-space.


3. Re-phasing: For ` ∈ R,
Z ∞ Z ∞
F[e−i`x f (x)] = e−ikx e−i`x f (x) dx = e−i(k+`)x f (x) dx = f˜(k + `).
−∞ −∞

4. Scaling:
Z ∞ Z ∞  
−ikx −iky/c dy 1 ˜ k
F[f (cx)] = e f (cx) dx = e f (y) = f .
−∞ −∞ |c| |c| c

Note that we have to take the absolute value of c, since if we replace x by


y/c and c is negative, then we will flip the bounds of the integral. The extra
minus sign then turns c into −c = |c|. Note that this result can be written as
1
F[f (cx)](k) = |c| F[f (x)]( kc ).
5. Convolutions: We have
Z ∞ Z ∞ 
F[f ∗ g(x)] = e−ikx f (x − y)g(y) dy dx
−∞ −∞
Z
= eik(x−y) f (x − y)e−iky g(y) dy dx
R2
Z Z
= e−iku f (u) du e−iky g(y) dy = F[f ]F[g],
R R

where u = x − y. So the Fourier transform of a convolution is the product of


individual Fourier transforms.
6.6. FOURIER TRANSFORMS 249

6. The most useful property of the Fourier transform is that it “turns differenti-
ation into multiplication”. Integrating by parts, we have
Z ∞ Z ∞
df d
F[f 0 (x)] = e−ikx dx = − (e−ikx )f (x) dx
−∞ dx −∞ dx
Z ∞
= ik e−ikx f (x) dx = ikF[f (x)]
−∞

Note that we don’t have any boundary terms since for the function to be
absolutely integrable, it has to decay to zero as we go to infinity. Conversely,
Z ∞ Z ∞
d
F[xf (x)] = e−ikx xf (x) dx = i e−ikx f (x) dx = if˜0 (k).
−∞ dk −∞

T. 6-56
R∞
<Fourier inversion theorem> f (x) = 1
2π −∞
eikx f˜(k) dk.8

We will only give a non-rigours proof: Recall that in the periodic case where
f (x) = f (x + L), we have the Fourier series
Z L/2
X 1
f (x) = fˆn e2inxπ/L where fˆn = e−2inπu/L f (u) du
n∈Z
L −L/2

We can try to obtain a similar expression for a non-periodic function f : R → C


by taking the limit L → ∞. We define ∆k = 2π/L. So we have
Z L/2
X ∆k
f (x) = einx∆k e−inu∆k f (u) du
n∈Z
2π −L/2
!
X eix(n∆k) Z L/2
−iu(n∆k)
= e f (u) du ∆k
n∈Z
2π −L/2

As we take the limit L → ∞,


∞ ∞ Z ∞
eixk
Z Z 
1
f (x) = F −1 [f˜(k)] = e−iuk f (u) du dk = eikx f˜(k) dk.
−∞ 2π −∞ 2π −∞

This result says that we can express our original function f (x) in terms of its
Fourier transform f˜(k), so we can write f (x) = F −1 [f˜(k)].
Nevertheless, note that the inverse Fourier transform looks very similar to the
Fourier transform itself. We have F −1 [f (x)] = 2π1
F[f (−x)] and the duality prop-
erty
1
f˜(k) = F[f (x)] ⇐⇒ f (−x) = F[f˜(k)].

These are useful because it means we can use our knowledge of Fourier transforms
to compute inverse Fourier transform. Note that this does not occur in the case of
the Fourier series. In the Fourier series, we obtain the coefficients by evaluating
an integral, and restore the original function by taking a discrete sum. These
operations are not symmetric.
8
This in fact requires f to be well behaved satisfying certain conditions.
R ∞ −2πikx Also if we use the
R ∞ 2πikx
definition f˜(k) = −∞ e f (x) dx, then this result becomes f (x) = −∞ e f˜(k) dk.
250 CHAPTER 6. METHODS

Note that f = F −1 [F[f ](k)] = 2π 1


F[F[f ](−k)]. Using the scaling property we
have
1  1  1 ˜
f˜(x).
 
f (−x) = F F[f ](−k) (−x) = F F[f ](k) (x) =
2π 2π 2π
Recall F[f ∗ g(x)] = f˜(k)f˜(k). So f ∗ g(x) = F −1 [f˜(k)g̃(k)], so

f˜ ∗ g̃(x) = (2π)2 F −1 [f (−k)g(−k)] = 2πF[f (k)g(k)].

E. 6-57
<Fourier transform on differential equation> Suppose we have a differential
equation
p
X dr
L(∂)y = f where L(∂) = cr r
r=0
dx
is a differential operator of pth order with constant coefficients. Taking the Fourier
transform of both sides of the equation, we find F[L(∂)y] = F[f (x)] = f˜(k). The
interesting part is the left hand side, since the Fourier transform turns differenti-
ation into multiplication, we have
p
X
F[L(∂)y] = cr (ik)r ỹ(k) = L(ik)ỹ(k).
r=0

Here L(ik) is a polynomial in ik. Thus taking the Fourier transform has changed
our ordinary differential equation into the algebraic equation L(ik)ỹ(k) = f˜(k).
Since L(ik) is just multiplication by a polynomial, we can immediately get

f˜(k)
ỹ(k) = .
L(ik)

Using F −1 [f˜(k)g̃(k)] = f ∗ g(x), we can recover y as


  Z ∞  
−1 −1 1 −1 1
y(x) = F [ỹ(k)] = f ∗ F (x) = F (x − y)f (y) dy.
L(ik) −∞ L(ik)
Recall that if we have Green’s function, the solution of the DE is an integral of
forcing term f multiplied by the Green’s function. Here we have a similar result!
So Fourier transforms can be used to solve ordinary differential equations provided
the functions are sufficiently well-behaved. For example, suppose y : R → R solves
y 00 − A2 y = −f (x) with y and y 0 → 0 as |x| → ∞. We get

f˜(k)
ỹ(k) = .
k2 + A2
1 −µ|x|
Consider h(x) = 2µ
e with µ > 0. Then
Z ∞ Z ∞ 
1 −ikx −µ|x| 1 −(µ+ik)x
h̃(k) = e e dx = Re e dx
2µ −∞ µ 0
 
1 1 1
= Re = .
µ ik + µ µ2 + k 2
Therefore we get Z ∞
1
y(x) = e−A|x−u| f (u) du.
2A −∞
6.6. FOURIER TRANSFORMS 251

E. 6-58
For φ : Rn → C, suppose we have the equation ∇2 φ − m2 φ = ρ(x). We define the
n dimensional Fourier transform by
Z
F[φ(x)](k) = φ̃(k) = e−ik·x f (x) dV
Rn

where now k ∈ Rn is an n-dimensional vector. So we get F[∇2 φ − m2 φ] = F[ρ] =


ρ̃(k). The first term is
Z Z Z
F[∇2 φ] = e−ik·x ∇2 φ dV = − ∇(e−ik·x ) · ∇φ dV = ∇2 (e−ik·x )φ dV
Rn Rn Rn
Z
= −k · k e−ik·x φ(x) dV = −k · kφ̃(k).
Rn

So our equation becomes −k · kφ̃(k) − m2 φ̃(k) = ρ̃(k). So we get

ρ̃(k)
φ̃(k) = − .
|k|2 + m2

So differential equations are trivial in k space. We can retrieve our φ as


Z
1 ρ̃(k)
φ(x) = F −1 [φ̃(k)] = − eik·x dV.
(2π)n Rn |k|2 + m2

Note that we have (2π)n instead of 2π since we get a factor of 2π for each dimension
(and the negative sign was just brought down from the original expression). Using
F −1 [f˜(k)g̃(k)] = f ∗ g(x), we have
 
−1
φ(x) = F −1 [φ̃(k)] = ρ ∗ F −1 .
|k|2 + m2

T. 6-59
<Parseval’s theorem> Suppose f, g : R → C are sufficiently well-behaved that
f˜ and g̃ exist and that F −1 [f˜] = f and F −1 [g̃] = g. Then
Z
1 ˜
f ∗ (x)g(x) dx =
def
hf, gi = hf , g̃i.
R 2π

In particular kf k2 = 1

kf˜k2 .

Z ∞  Z ∞  Z ∞ Z ∞ 
1 1
hf, gi = f ∗ (x) eikx g̃(x) dk dx = f ∗ (x)eikx dx g̃(k) dk
−∞ −∞ 2π 2π −∞ −∞
Z ∞ Z ∞ ∗ Z ∞
1 −ikx 1
= f (x)e dx g̃(k) dk = f˜∗ (k)g̃(k) dk
2π −∞ −∞ 2π −∞
1 ˜
= hf , g̃i.

So Fourier transform preserves the L2 norm up to a constant factor of 1



.
252 CHAPTER 6. METHODS

E. 6-60
Suppose f (x) is defined by y
(
1 |x| < 1
f (x) =
0 |x| ≥ 1. x

This function looks rather innocent. Sure it has discontinuities at ±1, but these are
not too absurd, and we are just integrating. This is certainly absolutely integrable.
We can easily compute the Fourier transform as
Z ∞ Z 1
2 sin k
f˜(k) = e−ikx f (x) dx = e−ikx dx = .
−∞ −1 k

Is our original function equal to the integral F −1 [f˜(k)]? We have

1 ∞ ikx sin k
  Z
2 sin k
F −1 = e dk.
k π −∞ k

This is hard to do. So let’s first see if this function is absolutely integrable. We
have
Z ∞ Z ∞ Z ∞ N Z (n+3/4)π
ikx sin k sin k sin k X sin k
e dx = dk = 2 dk ≥ 2
k dk.

−∞
k −∞
k
0
k
n=0 (n+1/4)π

The idea here is instead of looking at the integral over the whole real line, we just
pick out segments of it. In these small segments, we know that | sin k| ≥ √12 . So
we can bound this by
N N Z (n+3/4)π
∞ Z (n+3/4)π √ X
Z
ikx sin k X 1 dk dk
e dx ≥ 2 √ ≥ 2
−∞
k n=0
2 (n+1/4)π k n=0 (n+1/4)π (n + 1)π
√ N
2X 1
= ,
2 n=0 n + 1

and this diverges as N → ∞. So we don’t have absolute integrability. So we have


to be careful when we are doing these things.
C. 6-61
<Fourier transformation of distributions> We have
Z ∞
F[δ(x)] = eikx δ(x) dx = 1
−∞
Z ∞
1
Hence we have F −1 [1] = eikx dk = δ(x)
2π −∞

Of course, it is extremely hard to make sense of this integral. It quite obviously


diverges as a normal integral, but we can just have faith and believe this makes
sense as long as we are talking about distributions. Similarly, from our rules of
translations, we get F[δ(x − a)] = e−ika and F[ei`x ] = 2πδ(k − `). Hence we get
 
1 i`x 1 1
F[cos(`x)] = F (e + e−i`x ) = F[ei`x ] + F[e−i`x ] = π[δ(k − `) + δ(k + `)].
2 2 2
6.6. FOURIER TRANSFORMS 253

We see that highly localized functions in x-space have very spread-out behaviour
in k-space, and vice versa.
E. 6-62
More formally, we define Fourier transformation of distributions as follows: Recall
given an ordinary function g we have the distribution Tg [φ] = g[φ] = hg, φi =
1
R

g(x)φ(x)dx. Parseval’s theorem tell us that hg, φi = 2π hF[g], F[φ]i, equiva-
−1
lently we have hF[g], χi = 2πhg, F [χ]i. In light of this, for any distribution g
(not just those derived from ordinary function), we define the distribution Fg to
be
(Fg)[χ] = 2πg[F −1 [χ]].
Using this, we have
Z ∞
(Fδ)[φ] = 2πδ[F −1 [φ]] = 2πF −1 [φ](0) = eik0 φ(k)dk = h1, φi = 1[φ],
−∞

in agreement with what we got before. Conversely we have (F1)[φ] = 2π1[F −1 [φ]] =
2πh1, F −1 [φ(x)]i = h1, F[φ(−x)]i = (Fδ)[F[φ(−x)]] = 2πδ[φ(−x)] R ∞ = 2πφ(0).
Hence (F −1 1)[φ] = δ[φ]. So in the world of distributions we have −∞ e−ikx dx =
2πδ(k).
Consider the step function Θ(x) = I[x > 0] were I is the indicator function. Define
Θε (x) = Θ(x)e−εx . Note that Θ(x) = limε→0+ Θε (x). Now
Z ∞ Z ∞
1
F[Θε ] = e−ikx Θε (x)dx = e−(ε+ik)x dx =
−∞ 0 ε + ik
The presence of ε is important to ensure convergence of the integral. However
F[Θ] is in fact not just 1/ik. To understand what F[Θ] we let it act on a test
function, for any δ > 0 we have
Z ∞ Z Z δ
φ(k) φ(k) φ(k) − φ(0) φ(0)
(FΘ)[φ] = lim dk = dk+ lim + dk
ε→0+ −∞ ε + ik |k|>δ ik ε→0+ −δ ε + ik ε + ik

Since limk→p k1 (φ(k)−φ(0)) = φ0 (0), for small enough δ we have | k1 (φ(k)−φ(0))| ≤


|φ0 (0)| + 1. Then now since |ε + ik| ≥ |k| for all ε we have
Z δ Z δ Z δ
φ(k) − φ(0) φ(k) − φ(0) φ(k) − φ(0)
lim dk ≤ lim dk ≤ dk
ε→0+ −δ ε + ik ε→0+ −δ
ε + ik
−δ
k
Z δ
≤ |φ0 (0)| + 1 dk = 2δ(|φ0 (0)| + 1) → 0 as δ → 0
−δ

Also for any δ > 0 we have


Z δ
φ(0)
lim dk = −iφ(0) lim [ln(ε + ik)]δ−δ
ε→0+ −δ ε + ik ε→0+
 p δ
k
= −iφ(0) lim ln ε2 + k2 + i tan−1 = πφ(0).
ε→0+ 2 −δ

Combining these we have


(
1
k 6= 0
Z
φ(k) ik
(FΘ)[φ] = πφ(0) + lim dk hence F[Θ](k) = .
δ→0 |k|>δ ik πδ(k) k=0
254 CHAPTER 6. METHODS

This is sometimes written as F[Θ](k) =p.v.(ik)−1 + πδ(k) where the letters p.v.
stand for the (Cauchy) principle value and mean that we should exclude the point
k = 0 from any intgral containing this term. What happens at k = 0 is instead
governed by the δ-function.
E. 6-63
<Linear systems and response functions> Suppose we have an amplifier
that modifies an input signal I(t) to produce an output O(t). Typically, amplifiers
work by modifying the amplitudes and phases of specific frequencies in the output.
By Fourier’s inversion theorem, we know
Z ∞
1 ˜
I(t) = eiωt I(ω) dω.
2π −∞

˜
This I(ω) is the resolution of I(t). We specify what the amplifier does by the
transfer function R̃(ω) such that the output is given by
Z ∞
1 ˜
O(t) = eiωt R̃(ω)I(ω) dω.
2π −∞

˜
Since this R̃(ω)I(ω) ˜
is a product, on computing O(t) = F −1 [R̃(ω)I(ω)], we obtain
a convolution
Z ∞ Z ∞
1
O(t) = R(t − u)I(u) du where R(t) = eiωt R̃(ω) dω
−∞ 2π −∞

is the response function . By plugging it directly into the equation above, we see
˜
that R(t) is the output O(t) of the system when the input has I(ω) = 1 – in other
words when the input signal is I(t) = δ(t). Note that causality implies that the
amplifier cannot “respond” before any input has been given. So we must have
R(t) = 0 for all t < 0. Assume that we only start providing input at t = 0. Then
Z ∞ Z t
O(t) = R(t − u)I(u) du = R(t − u)I(u) du.
−∞ 0

This is exactly the same form of solution as we found for initial value problems
with the response function R(t) playing the role of the Green’s function.
................................................................................
<General form of transfer function> To model the situation, suppose the
amplifier’s operation is described by the ordinary differential equation
m
X di
I(t) = Lm O(t), where Lm = ai
i=0
dti

with ai ∈ C. In other words, we have an mth order ordinary differential equation


with constant coefficients, and the input is the forcing function. Notice that we
are not saying that O(t) can be expressed as a function of derivatives of I(t).
Instead, I(t) influences O(t) by being the forcing term the differential equation
has to satisfy. This makes sense because when we send an input into the amplifier,
this input wave “pushes” the amplifier, and the amplifier is forced to react in some
way to match the input.
6.6. FOURIER TRANSFORMS 255

˜ = m j
 dO  P
Using the fact F dt = iω Õ(ω), we have I(ω) j=0 aj (iω) Õ(ω). So we get

˜
I(ω) 1
Õ(ω) = =⇒ R̃(ω) = .
a0 + iωa1 + · · · + (iω)m am a0 + iωa1 + · · · + (iω)m am

˜ The denominator is an nth order polynomial. By the fundamental


since Õ = R̃I.
theorem of algebra, it has m roots, say cj ∈ C for j = 1, · · · , J, where cj has
multiplicity kj . Then we can write our transfer function as

J J X kj
1 Y 1 X Γrj
R̃(ω) = =
am j=1 (iω − cj )kj j=1 r=1
(iω − cj )r

for some constants Γrj ∈ C, where we obtain the last equality by repeated use of
partial fractions. By linearity of the (inverse) Fourier transform, we can find O(t)
if we know the inverse Fourier transform of all functions of the form 1/(iω − α)p .
Consider the function ( p
t
eαt t > 0
hp (t) = p!
0 otherwise

Now provided Re(α) < 0 (so that e(α−iω)t → 0 as t → ∞), we have


Z ∞ Z ∞
−iωt 1
h̃0 (ω) = e h0 (t) dt = e(α−iω)t dt =
−∞ 0 iω −α
d 1
h̃1 (ω) = F[th0 (t)] = i F[h0 (t)] =
dω (iω − α)2
1
h̃p (ω) =
(iω − α)p+1

So the response function is a linear combination of these functions hp (t) (if any of
the roots cj have non-negative real part, then it turns out the system is unstable,
and is better analysed by the Laplace transform). We see that the response func-
tion does indeed vanish for all t < 0. In fact, each term (except h0 ) increases from
zero at t = 0 to rise to some maximum before eventually decaying exponentially.
E. 6-64
<Discrete Fourier transform> So far, we have done Fourier analysis over
some abelian groups. For example, we’ve done it over S 1 , which is an abelian
group under multiplication, and R, which is an abelian group under addition. We
will next look at Fourier analysis over another abelian group, Zm , known as the
discrete Fourier transform. Recall that the Fourier transform is defined as
Z
f˜(k) = e−ikx f (x) dx.
R

To find the Fourier transform, we have to know how to perform the integral. If we
cannot do the integral, then we cannot do the Fourier transform. However, it is
usually difficult to perform integrals for even slightly more complicated functions.
A more serious problem is that we usually don’t even have a closed form of the
function f for us to perform the integral. In real life, f might be some radio signal,
and we are just given some data of what value f takes at different points. There
is no hope that we can do the integral exactly.
256 CHAPTER 6. METHODS

Hence, we first give ourselves a simplifying assumption. We suppose that f is


mostly concentrated in some finite region [−R, S]. More precisely, |f (x)|  1 for
x 6∈ [−R, S] for some R, S > 0. Then we can approximate our integral as
Z S
f˜(k) = e−ikx f (x) dx.
−R

Afterwards, we perform the integral numerically. Suppose we “sample” f (x) at


x = xj = −R + j R+S
N
for N a large integer and j = 0, 1, · · · , N − 1. Then
N −1
R+S X
f˜(k) ≈ f (xj )e−ikxj .
N j=0

This is just the Riemann sum. Similarly, our computer can only store the result
f˜(k) for some finite list of k. Let’s choose these to be at k = km = 2πm/(R + S).
Then after some cancellation,
N −1 −1
N
!
˜ R + S ikm R X − 2πi jm 2πimR 1 X −jm
f (km ) ≈ e f (xj )e N = (R + S)e R+S f (xj )ω ,
N j=0
N j=0
2πi PN −1 −jm
where ω = e N is an N th root of unity. Let F (m) = N1 j=0 f (xj )ω . Of
course, we’ve thrown away lots of information about our original function f (x),
since we made approximations all over the place. For example, we have already lost
all knowledge of structures varying more rapidly than our sampling interval R+S N
.
Also, F (m + N ) = F (m), since ω N = 1. So we have “forced” some periodicity
into our function F , while f˜(k) itself was not necessarily periodic.
For the usual Fourier transform, we were able to re-construct our original function
from the f˜(k), but here we clearly cannot. However, if we know the F (m) for
all m = 0, 1, · · · , N − 1, then we can reconstruct the exact values of f (xj ) for
all j by just solving linear equations. To make this more precise, we want to
put what we’re doing into the linear algebra framework we’ve been using. Let
G = {1, ω, ω 2 , · · · , ω N −1 }. For our purposes below, we can just treat this as a
discrete set, and ω i are just meaningless symbols that happen to visually look like
the powers of our N th roots of unity.
Consider a function g : G → C defined by g(ω j ) = f (xj ). This is actually nothing
but just a new notation for f (xj ). Then using this new notation, we have
N −1
1 X −jm
F (m) = ω g(ω j ).
N j=0

The space of all functions g : G → C is a finite-dimensional vector space, isomor-


phic to CN . This has an inner product
N −1
1 X ∗ j
hf, gi = f (ω )g(ω j ).
N j=0

Now let en : G → C be the function em (ω j ) = ω jm . Then the set of functions {em }


for m = 0, · · · , N − 1 is an orthonormal basis with respect to our inner product.
To show this, we can compute
N −1 N −1 N −1
1 X ∗ j 1 X −jm jm 1 X
hem , em i = em (ω )em (ω j ) = ω ω = 1 = 1.
N j=0 N j=0 N j=0
6.6. FOURIER TRANSFORMS 257

For n 6= m, we have
N −1 N −1
1 X ∗ j 1 X j(m−n) 1 1 − ω (m−n)N
hen , em i = en (ω )em (ω j ) = ω = = 0.
N j=0 N j=0 N 1 − ω m−n

Note that m − n is an integer and ω is an N th root of unity, we know that


ω (m−n)N = 1, so the numerator is 0. However, since n 6= m, m − n 6= 0, the
denominator is non-zero. We can now rewrite our F (m) as
N −1 N −1
1 X −jm 1 X ∗ j
F (m) = ω f (xj ) = em (ω )g(ω j ) = hem , gi.
N j=0 N j=0
P −1 PN −1
Hence we can expand our g as g = N m=0 hem , giem = m=0 F (m)em . Writing f
instead of g, we recover the formula
N
X −1
f (xj ) = g(ω j ) = F (m)em (ω j ).
m=0

If we forget about our f s and just look at the g, what we have effectively done is
take the Fourier transform of functions taking
PN −1values on G = {1, ω, · · · , ω N −1 } ∼
=
−jm
1
ZN . This can be seen from F (m) = N j=0 ω g(ω j ). This is exactly anal-
ogous to what we did for the Fourier transform, except that everything is now
discrete, and we don’t have to worry about convergence since these are finite
sums.
E. 6-65
<Fast Fourier transform> What we’ve said so far is we’ve defined
N −1
1 X −mj
F (m) = ω g(ω j ).
N j=0

To compute this directly, even given the values of ω −mj for all j, m, this takes
N − 1 additions and N + 1 multiplications. This is 2N operations for a single
value of m. To know F (m) for all m, this takes 2N 2 operations.
This is a problem. Historically, during the cold war, people were in fear that the
world will one day go into a huge nuclear war. Countries decided to come up with
a treaty to ensure people don’t perform nuclear testings anymore. However, it is
difficult to monitor whether other countries are doing nuclear tests. Underground
nuclear tests are hard to detect.
They then came up with the idea to put some seismometers all around the world,
and record vibrations in the ground. To distinguish normal crust movements
from nuclear tests, they wanted to take the Fourier transform and analyze the
frequencies of the vibrations. However, they had a large amount of data, and the
value of N is on the order of magnitude of a few million. So 2N 2 will be a huge
number that the computers at that time were not able to handle. So they develop a
trick, known as the fast Fourier transform, to perform Fourier transforms quickly.
This is nothing new mathematically, but entirely a computational trick.
Now suppose N = 2M . We can write
2M −1 M −1
!
1 X −jm 1 1 X −2km −(2k+1)m
F (m) = ω g(ω j ) = ω 2k
g(ω ) + ω g(ω 2k+1
) .
2M j=0 2 M
k=0
258 CHAPTER 6. METHODS

We now let η be an M th root of unity, and define G(η k ) = g(ω 2k ) and H(η k ) =
g(ω 2k+1 ). We then have

M −1 M −1
!
1 1 X −km ω −m X −km
F (m) = η G(η k ) + η H(η k )
2 M M
k=0 k=0
1
= [G̃(m) + ω −m H̃(m)].
2

Suppose we are given the values of G̃(m) and H̃(m) and ω −m for all m =
{0, · · · , N − 1}. Then we can compute F (m) using 3 operations per value of
m. So this takes 3 × N = 6M operations for all M .
We can compute ω −m for all m using at most 2N operations, and suppose it takes
PM operations to compute the transform G̃(m) (or equivalently H̃(m)) for all m.
Then the number of operations needed to compute F (m) is P2M = 2PM +6M +2M .
Now let N = 2n . Then by iterating this, we can find PN ≤ 4N log2 N  N 2 . So
by breaking the Fourier transform apart, we are able to greatly reduce the number
of operations needed to compute the Fourier transform.

6.7 PDEs and Method of Characteristics


D. 6-66
Given a partial differential equation (pde), we have boundary conditions which
restricts the value of the solution or its derivatives on a surface of 1 dimension less
than the domain. The choice of exactly how the solution should look like on these
surfaces are known as the Cauchy data for the pde and solving the pde subject
to these conditions is said to be a Cauchy problem . A Cauchy problem is said to
be well-posed if
1. A solution exists;
2. The solution is unique;
3. The solution depends continuously on the auxiliary data.
E. 6-67
Recall that
• For Laplace’s equation ∇2 φ = 0 on a bounded domain Ω ⊆ Rn , we imposed the
Dirichlet boundary condition φ|∂Ω = f or the Neumann boundary condition
n · ∇φ|∂Ω = g.
∂φ
• For the heat equation ∂t
= κ∇2 φ on Ω × [0, ∞), we asked for φ|∂Ω×[0,∞) = f
and also φ|Ω×{0} = g.
2
• For the wave equation ∂∂t2φ = c2 ∇2 φ on Ω × [0, ∞), we imposed φ|∂Ω×[0,∞) = f ,
φ|Ω×{0} = g and also ∂t φ|Ω×{0} = h.
All the boundary and initial conditions restrict the value of φ on some co-dimension
1 surface, ie. a surface whose dimension is one less than the domain. And f, g and
h, i.e. what the solution should look like on this surface, is the Cauchy data for
our pde.
6.7. PDES AND METHOD OF CHARACTERISTICS 259

E. 6-68
Intuitively, “the solution depends continuously on the auxiliary data” means “small
change” in the Cauchy data leads to a “small change” in the solution. To under-
stand this, we need to make it clear what we mean by “small change”. To do this
properly, we need to impose some topology on our space of functions, which is
some technicalities we will not go into. Instead, we can look at a simple example.
Suppose we have the heat equation ∂t φ = κ∇2 φ. We know that whatever starting
condition we have, the solution quickly smooths out with time. Any spikiness
of the initial conditions get exponentially suppressed. Hence this is a well-posed
problem — changing the initial condition slightly will result in a similar solution.
However, if we take the heat equation but run it backwards in time, we get a non-
well-posed problem. If we provide a tiny, small change in the “ending condition”,
as we go back in time, this perturbation grows exponentially, and the result could
vary wildly.
Another example is as follows: consider the Laplace’s equation ∇2 φ on the upper
half plane (x, y) ∈ R × R≥0 subject to the boundary conditions φ(x, 0) = 0 and
∂y φ(x, 0) = g(x). If we take g(x) = 0, then φ(x, y) = 0 is the unique solution,
obviously. However, if we take g(x) = sin(Ax)/A, then we get the unique solution

sin(Ax) sinh(Ay)
φ(x, y) = .
A2
So far so good. However, now consider the limit as A → ∞. Then g(x) =
π

sin(Ax)/A → 0 for all x ∈ R. However, at the special point φ 2A , y , we get
 π  sinh(Ay)
φ ,y = → A−2 eAy ,
2A A2
which is unbounded. So as we take the limit as our boundary conditions g(x) → 0,
we get an unbounded solution.
The condition the solution depends continuously on the auxiliary data is important
in physics since we can neither set up our apparatus nor measure our results with
infinite precision, equations that can usefully model the physics had better obey
this condition or else our approximation would give us very wrong answer.

6.7.1 Characteristics for first order PDEs


D. 6-69
• The tangent vector to a smooth curve C given by x : R → R2 with x(s) =
(x(s), y(s)) is ( dx , dy ).
ds ds

• Let V(x, y) : R2 → R2 be a vector field. The integral curves associated to V are


curves whose tangent is just V(x, y).
E. 6-70
Suppose we have a curve on the plane x : R → R2 given by s 7→ (x(s), y(s)).
If we have some quantity φ : R2 → R, then its value along the curve C is just
φ(x(s), y(s)).
A vector field V(x, y) : R2 → R2 defines a family of curves called integral curves.
To imagine this, suppose we are living in a river. The vector field tells us the how
260 CHAPTER 6. METHODS

the water flows at each point. We can then obtain a curve by starting a point
and flow along with the water. Notice that the integral curves are determined by
a system of 1st order ODE, ( dx , dy ) = V(x(s), y(s)), and hence always exist, at
ds ds
least locally.
Note that we have a 1st order ODE, so our solution has
a free variable, so the solution is not unique. For suffi-
ciently regular vector fields, we can fill the whole space
with different integral curves. We can parametrize which B(t)
curve we are on by the parameter t. More precisely, C4
we pick a curve B = (x(t), y(t)) that is transverse (ie.
nowhere parallel) to our family of curves at any point, C3
C2
and we can label the members of our family by the value C1
of t at which they intersect B. We can thus label our
family of curves by (x(s, t), y(s, t)), so that for each t we
have the integral curve Ct . If the Jacobian
∂(x, y) ∂x ∂y ∂x ∂y
J= = − 6= 0,
∂(s, t) ∂s ∂t ∂t ∂s
then we can invert this to find (s, t) as a function of (x, y), ie. at any point, we
know which curve we are on, and how far along the curve we are. This means
the set of integral curves fills the whole space and we now have a new coordinate
system (s, t) for our points in R2 . It turns out by picking the right vector field
V, we can make differential equations much easier to solve in this new coordinate
system.
C. 6-71
<The method of characteristics> Suppose φ : R2 → R obeys
∂φ ∂φ
a(x, y) + b(x, y) = f (x, y).
∂x ∂y

with boundary condition φ|B = h(t) for some function h(t) along a curve B ⊆ R2
parametrise by t. Our differential equation is equivalent to V · ∇φ = f where
V(x, y) = (a, b). Along any particular integral curve (x(s), y(s)) of V, we have

∂φ(x(s), y(s)) dx(s) ∂φ dy(s) ∂φ


= + = V · ∇φ,
∂s ds ∂x ds ∂y
We try to find a set of integral curves (x(s, t), y(s, t)) of V indexed by t, that sat-
isfies (x(0, t), y(0, t)) = B(t). These integral curves (x(s, t), y(s, t)) are determined
by
∂x(s, t) ∂y(s, t)
= a(x(s, t), y(s, t)), = b(x(s, t), y(s, t)).
∂s ∂s
known as the characteristic equation. The integral curves is called the characteris-
tic curves of the partial DE. Writing φ as a function φ(x(s, t), y(s, t)) of (s, t), our
partial differential equation just becomes the equation
∂φ(x(s, t), y(s, t))
= f (x(s, t), y(s, t)) subject to φ(x(0, t), y(0, t)) = h(t).
∂s
Note that this equation tell us how the solution φ vary along the tth characteristic
curves of the DE. This might be easier to solve than the original version. If we
6.7. PDES AND METHOD OF CHARACTERISTICS 261

manage to find the solution φ(x(s, t), y(s, t)) in terms of s, t, and we could invert
the variables so that (s, t) is a function of (x, y), then we have found the solution
φ(x, y). In particular if we have f = 0, then since φ does not vary with s, ie.
∂φ
∂s
= 0, the solution is simply φ(x(s, t), y(s, t)) = h(t). Now if we can invert the
variables so that t = t(x, y) is a function of (x, y), then we have φ(x, y) = h(t(x, y)).
Note the following features of the above construction:
• If any characteristic curve intersects the initial curve B more than once then
the problem is over-determined. In this case we might have a contradiction
and so no solution. For example, in the case of a homogeneous equation, i.e.
when f = 0, the solution will be constant along the same characteristic curve,
so in order to have a solution our Cauchy data φ|B = h(t) must be such that
h(t1 ) = h(t2 ) for any points t1 , t2 ∈ B that intersect the same characteristic
curve.
• If B does not intersect all characteristics curves, we will not get a unique solu-
tion, as the solution is not fixed along those characteristics.
• If the initial curve is transverse to all characteristics and intersects them once
only, then the problem is well-posed for any h(t) and has a unique solution
φ(x, y) (at least in a neighbourhood of B).
Note that the initial data cannot be propagated from one characteristic to an-
other. In particular, if h(t) is discontinuous, then these discontinuities will
propagate along the corresponding characteristic curve.
E. 6-72
Suppose φ obeys
∂φ(x, y)
= 0. with φ(0, y) = f (y).
∂x
The solution is obviously φ(x, y) = f (y). However, we can try to use the method
of characteristics to practice our skills. Our vector field and integral curves is given
by    dx 
1 ds
dx dy
V= = dy =⇒ = 1, = 0.
0 ds ds ds
So we have x(s) = s + c and y(s) = d. We have boundary condition along the
cure (xB (t), yB (t)) = (0, t). We want our integral curves to intersect this at s = 0,
i.e. (xB (t), yB (t)) = (x(0), y(0)), so our integral curves are (x(s, t), y(s, t)) = (s, t).
Now we have to solve
∂φ(x(s, t), y(s, t))
=0 with φ(x(0, t), y(0, t)) = f (t).
∂s
So we know that φ(x, y) = f (y).
E. 6-73
Consider the equation
∂φ ∂φ
ex + =0 with φ(x, 0) = cosh x.
∂x ∂y
The vector field is V = (ex , 1). So integral cures obey dx ds
= ex and dy
ds
= 1. Thus
−x(s)
our integral curves (x(s, t), y(s, t)) obeys e = −s + c(t) and y(s) = s + d(t).
So
(x(s, t), y(s, t)) = (− ln(−s + c(t)), s + d(t)).
262 CHAPTER 6. METHODS

We want cosh t = (x(0, t), y(0, t)) = (− ln(c(t)), d(t)). So (x, y) = (− ln(−s +
e−t ), s). We thus have φ(x, y) = cosh(t) = cosh(− ln(y + e−x )).
E. 6-74
Suppose φ : R2 → R solve the inhomogeneous partial differential equation ∂x φ +
2∂y φ = yex with φ(x, x) = sin x. We can still use the method of characteristics.
We have V = (1, 2). So the characteristic curves (x(s, t), y(s, t)) are obtained as

∂x ∂y
= 1, =2 with φ(x(0, t), y(0, t)) = (t, t)
∂s ∂s
So (x, y) = (s + t, 2s + t). We can invert this to obtain (s, t) = (y − x, 2x − y). The
partial differential equation now becomes

∂φ(x(s, t), y(s, t))


= V·∇φ = yex = (2s+t)es+t with φ(x(0, t), y(0, t)) = sin t.
∂s

=⇒ φ(x(s, t), y(s, t)) = 2(s − 1)es+t + tes+t + h(t) with sin t = et (t − 2) + h(t)
= (2 − t)et (1 − es ) + sin t + 2ses+t
=⇒ φ(x, y) = (2 − 2x + y)e2x−y + sin(2x − y) + (y − 2)ex

6.7.2 Characteristics for second order PDEs


D. 6-75
Let L be the general 2nd order differential operator on Rn = {(x1 , x2 , · · · , xn ) :
xi ∈ R}. We can write it as
n n
X ∂2 X i ∂
L= aij (x) + b (x) i + c(x),
i,j=1
∂xi ∂xj i=1
∂x

where aij (x), bi (x), c(x) ∈ R and aij = aji (wlog). We define the symbol of L as
n
X n
X
σ(k, x) = aij (x)ki kj + bi (x)ki + c(x),
i,j=1 i=1
| {z }
=σ p (k,x)

where we just replace the derivatives by the variable k. The principal part of
the symbol is the quadratic form σ p (k, x) = kT A(x)k, where A(x) is the real
symmetric matrix with entries aij (x). We say the differential operator L at a
point x is
• elliptic if all eigenvalues of A(x) have the same sign;9
• hyperbolic if all but one eigenvalues of A(x) have the same sign;
• ultra-hyperbolic if there are more than one eigenvalues of A(x) of each sign;
• parabolic if A(x) is degenerate, ie. has a zero eigenvalue.
9
Recalling that the eigenvalues of a real, symmetric matrix are always real.
6.7. PDES AND METHOD OF CHARACTERISTICS 263

The surface C ⊆ Rn defined by f (x1 , x2 , · · · , xn ) = 0, where f : Rn → R don’t


have 0 gradient everywhere, is a characteristic surface of the operator L at a
point x ∈ Rn if
n
X ∂f ∂f
aij (x) i j
= (∇f )T A(∇f ) = 0.
i,j=1
∂x ∂x
We say C is a characteristic surface for L if it is characteristic surface everywhere.
E. 6-76
If L = ∇2 , then σ(k, x) = σ p (k, x) = n 2
P
i=1 (ki ) . If L is the heat operator (where
n
we think of the last coordinate x as the time, and others as space), then the
operator is given by
n−1
X ∂2

L= n
− .
∂x i=1
∂xi2

symbol is then σ(k, x) = kn − n−1 2 p


P
ThePn−1 i=1 (ki ) and the principal part is σ (k, x) =
2
− i=1 (ki ) .
Note that the symbol is closely related to the Fourier transform of the differential
operator, since both turn differentiation into multiplication. Indeed, they are
equal if the coefficients are constant. However, we define this symbol for arbitrary
differential operators with non-constant coefficients.
E. 6-77
On R2 the general second order L is
∂2 ∂2 ∂2 ∂ ∂
L = a(x, y) 2
+ 2b(x, y) + c(x, y) 2 + d(x, y) + e(x, y) + f (x, y).
∂x ∂x∂y ∂y ∂x ∂y
The principal part of the symbol is
  
 a(x, y) b(x, y) kx
σ p (k, x) = kx ky
b(x, y) c(x, y) ky

Then L is elliptic if b2 − ac < 0; hyperbolic if b2 − ac > 0; and parabolic if


b2 − ac = 0. Note that since we only have two dimensions, we cannot possibly
have an ultra-hyperbolic equation.
E. 6-78
<Characteristic surfaces> We see that the characteristic equation restricts
what values ∇f can take. Recall that ∇f is the normal to the surface C. So in
general, at any point, we can find what the normal of the surface should be, and
stitch these together to form the full characteristic surfaces.
For an elliptic operator, all the eigenvalues of A have the same sign. So there
are no non-trivial real solutions to this equation (∇f )T A(∇f ) = 0. Consequently,
elliptic operators have no real characteristics. So the method of characteristics
would not be of any use when studying, say, Laplace’s equation, at least if we
want to stay in the realm of real numbers.
If L is parabolic, we for simplicity consider the case where A has exactly one zero
eigenvector, say n, and the other eigenvalues have the same sign. This is the
case when, say, there are just two dimensions. So we have An = nT A = 0. We
normalize n such that n · n = 1. For any ∇f , we can always decompose it as
∇f = n(n · ∇f ) + (∇f − n(n · ∇f )).
264 CHAPTER 6. METHODS

This is a relation that is trivially true, since we just add and subtract the same
thing. Note, however, that the first term points along n, while the latter term is
orthogonal to n. Write ∇⊥ f = ∇f − n(n · ∇f ). So we have ∇f = n(n · ∇f ) + ∇f⊥ .
Then we can compute
T
(∇f )T A(∇f ) = n(n · ∇f ) + ∇⊥ f A n(n · ∇f ) + ∇⊥ f = (∇⊥ f )T A(∇⊥ f ).


Then by assumption, (∇⊥ f )T A(∇⊥ f ) is definite. So just as in the elliptic case,


there are no non-trivial solutions. Hence, if f defines a characteristic surface, then
∇⊥ f = 0. So at any point, the normal to a characteristic surface must be n. Thus
there is a unique characteristic surface through any point x ∈ Rn .
If L is hyperbolic, we assume all but one eigenvalues are positive, and let −λ be
the unique negative eigenvalue. We let n be the corresponding unit eigenvector,
where we normalize it such that n · n = 1. Then f is characteristic if
T
0 = (∇f )T A(∇f ) = n(n · ∇f ) + ∇⊥ f A n(n · ∇f ) + ∇⊥ f


= −λ(n · ∇f )2 + (∇⊥ f )T A(∇⊥ f ).


Consequently, for this to be a characteristic, we need
r
(∇⊥ f )T A(∇⊥ f )
n · ∇f = ± .
λ
So there are two choices for n · ∇f , given any ∇⊥ f .
................................................................................
In the case where we have two dimensions, we can find the characteristic curves
explicitly. Suppose our curve is given by f (x, y) = 0. We can write y = y(x).
Then since f is constant along each characteristic, by the chain rule, we know
∂f ∂f dy dy fx
0= + =⇒ =− .
∂x ∂y dx dx fy
Substitute this into (∇f )T A(∇f ) = 0 where A = ( ab cb ), we have

dy −b ± b2 − ac
=− .
dx a
We now see explicitly how the type of the differential equation influences the
number of characteristics — if b2 − ac > 0 (hyperbolic operators), then we obtain
two distinct differential equations and obtain two solutions; if b2 −ac = 0 (parabolic
operators), then we only have one equation; if b2 − ac < 0 (elliptic operators), then
there are no real characteristics.
E. 6-79
Consider ∂y2 φ − xy∂x2 φ = 0 on R2 . Then a = −xy, b = 0, c = 1. So b2 − ac = xy.
So the type is elliptic if xy < 0, hyperbolic if xy > 0, and parabolic if xy = 0. In
the regions where it is hyperbolic, the two characteristics are given by

dy −b ± b2 − ac 1
= = ±√ .
dx a xy
This has a solution 31 y 3/2 ± x1/2 = c. We now let u = 13 y 3/2 + x1/2 and v =
1 3/2
3
y − x1/2 . Then the equation in the hyperbolic region becomes
∂2φ
+ lower order terms = 0.
∂u∂v
6.8. GREEN’S FUNCTIONS FOR PDES 265

E. 6-80
Consider the wave equation
∂2φ ∂2φ
2
− c2 2 = 0
∂t ∂x
on R1,1 . Then the equation is hyperbolic everywhere, and the characteristic curves
are x ± ct = const. Let’s look for a solution to the wave equation that obeys
φ(x, 0) = f (x) and ∂t φ(x, 0) = g(x). Now put u = x − ct and v = x + ct. Then
∂2φ
the wave equation becomes ∂u∂v = 0. So the general solution to this is

φ(x, t) = G(u) + H(v) = G(x − ct) + H(x + ct).

The initial conditions means f (x) = G(x) + H(x) and g(x) = −cG0 (x) + cH 0 (x).
Solving these, we find
Z x+ct
1  1
φ(x, t) = f (x − ct) + f (x + ct) + g(y) dy.
2 2c x−ct

This is d’Alembert’s solution to the 1 + 1 dimensional wave equation. Note that


the value of φ at any point (x, t) is completely determined by f, g in the interval
[x − ct, x + ct]. This is known as the (past) domain of dependence of the solution
at (x, t), written D− (x, t). Similarly, at any time t, the initial data at (x0 , 0)
only affects the solution within the region x0 − ct ≤ x ≤ x0 + ct. This is the
range of influence of data at x0 , written D+ (x0 ). We see that disturbances in
the wave equation propagate with speed c.

6.8 Green’s Functions for PDEs


6.8.1 Heat equation
Heat equation with inhomogeneous boundary condition
Suppose φ : Rn × [0, ∞) → R solves the heat equation ∂t φ = D∇2 φ subject to
φ|Rn ×{0} = f , where D is the diffusion constant. To solve this, we take the Fourier
transform of the heat equation in the spatial variables. For simplicity of notation, we
take n = 1. Writing F[φ(x, t)] = φ̃(k, t), we get

∂t φ̃(k, t) = −Dk2 φ̃(k, t) with the boundary conditions φ̃(k, 0) = f˜(k).


2
This has solution φ̃(k, t) = f˜(k)e−Dk t . We now take the inverse Fourier transform
and get
Z
2 1  2
 2
φ(x, t) = F −1 [f˜(k)e−Dk t ] = eikx f˜(k)e−Dk t dk = f ∗ F −1 [e−Dk t ]
2π R
So we are done if we can find the inverse Fourier transform of the Gaussian. Trans-
2 2 2 2 2 2
forming the DE satisfied by e−a x and then solving it we find F[e−a x ] = Ae−k /4a
2 2 √
for some constant A. Evaluating F[e−a x ](0) we find that A = π/a. Now setting
2
a = 1/(4Dt), we get

x2
 
2 1
F −1 [e−Dk t ] = √ exp −
4πDt 4Dt
266 CHAPTER 6. METHODS

We shall call this S1 (x, t), known as the fundamental solution of the heat equation,
where the subscript 1 tells us we are in 1 + 1 dimensions. We then get
Z ∞
φ(x, t) = f (y)S1 (x − y, t) dy
−∞

Suppose our initial data is f (x) = φ0 δ(x). So we start with a really cold room, with a
huge spike in temperature at the middle of the room. Then we get
x2
 
φ0
φ(x, t) = √ exp − .
4πDt 4Dt
What this shows, as we’ve seen before, is that if we start with a delta function, then
as time evolves, we get a Gaussian that gets shorter and broader.
Now note that if we start with a delta function, at t = 0, everywhere outside the origin
is zero. However, after any infinitely small time t, φ becomes non-zero everywhere,
instantly. Unlike the wave equation, information travels instantly to all of space, ie.
heat propagates arbitrarily fast according to this equation (of course in reality, it
doesn’t). Mathematically, this is a consequence of the fact that the heat equation is
parabolic, and so has only one family of characteristic surfaces (in this case, they are
the surfaces t =const). Physically, we see that the heat equation is not compatible
with Special Relativity; this is because it is really just a macroscopic approximation
to the underlying statistical mechanics of microscopic particle motion.

Forced heat equation with homogeneous boundary condition


Now suppose instead φ satisfies the inhomogeneous, forced, heat equation ∂t φ −
D∇2 φ = F (x, t) but with homogeneous initial conditions φ|t=0 = 0. Physically, this
represents having an external heat source somewhere, and we have zero temperature
initially.
Note that if we can solve this, then we have completely solved the heat equation. If
we have an inhomogeneous equation and an inhomogeneous initial condition, then we
can solve this forced problem with homogeneous boundary conditions to get φF ; and
solve the unforced equation with homogeneous equation to get φH . Then the sum
φ = φF + φH solves the full equation.
As before, we take the Fourier transform of the forced equation with respect to the
spacial variables. As before, we will just do it in the cases with one spacial dimension.
We find
∂t φ̃(k, t) + Dk2 φ̃(k, t) = F̃ (k, t), with φ̃(k, 0) = 0.
As before, we have reduced this to a first-order ordinary differential equation in t.
Using an integrating factor, we can rewrite this as
Z t
∂  Dk2 t  2 2 2
e φ̃(k, t) = eDk t F̃ (k, t) =⇒ φ̃(k, t) = e−Dk t eDk u F̃ (k, u) du.
∂t 0

We define the Green’s function G(x, t; y, τ )) to be the solution to


∂t − D∇2x G(x, t; y, τ ) = δ(x − y)δ(t − τ ).


So the Fourier transform with respect to x gives


Z t
2 2
G̃(k, t, y, τ ) = e−Dk t eDk u eiky δ(t − τ ) du,
0
6.8. GREEN’S FUNCTIONS FOR PDES 267

where eiky is just the Fourier transform of δ(x − y). This is equal to
(
0 t<τ 2
G̃(k, t; y, τ ) = −iky −Dk2 (t−τ )
= Θ(t − τ )e−iky e−Dk (t−τ ) .
e e t>τ

Reverting the Fourier transform, we get


Z
Θ(t − τ ) 2
G(x, t; y, τ ) = eik(x−y) e−Dk (t−τ )
dk
2π R

This integral is just the inverse Fourier transform of the Gaussian with a phase shift.
So we end up with
(x − y)2
 
Θ(t − τ )
G(x, t; y, τ ) = p exp − = Θ(t − τ )S1 (x − y; t − τ ).
4πD(t − τ ) 4D(t − τ )

The solution we seek is then


Z t Z
φ(x, t) = F (y, τ )G(x, t; y, τ ) dy dτ.
0 R

It is interesting that the solution to the forced equation involves the same function
S1 (x, t) as the homogeneous equation with inhomogeneous boundary conditions.

Duhamel’s principle
In general, Sn (x, t) solves
∂Sn
− D∇2 Sn = 0 with Sn (x, 0) = δ (n) (x − y)
∂t
and we can find
|x − y|2
 
1
Sn (x, t) = exp − .
(4πDt)n/2 4Dt
Then in general, given an initial condition φ|t=0 = f (x), the solution is
Z
φ(x, t) = f (y)S(x − y, t) dn y.

Similarly, Gn (x, t; y, t) solves


∂Gn
− D∇2 Gn = δ(t − τ )δ (n) (x − y) with Gn (x, 0; y, τ ) = 0
∂t
The solution is G(x, t; y, τ ) = Θ(t − τ )Sn (x − y, t − τ ). Given our Green’s function,
the general solution to the forced heat equation
∂φ
− D∇2 φ = F (x, t) with φ(x, 0) = 0
∂t
is just
Z ∞ Z Z t Z
φ(x, t) = F (y, τ )G(x, t; y, τ ) dn y dτ = F (y, τ )G(x, t; y, τ ) dn y dτ.
0 Rn 0 Rn

Duhamel noticed that we can write this as


Z t Z
φ(x, t) = φF (x, t; τ ) dτ where φF = F (y, t)Sn (x − y, t − τ ) dn y
0 R
268 CHAPTER 6. METHODS

ie. φF solves the homogeneous heat equation with φF |t=τ = F (x, τ ).


Hence in general, we can think of the forcing term as providing a whole sequence of
“initial conditions” for all t > 0. We then integrate over the times at which these condi-
tions were imposed to find the full solution to the forced problem. This interpretation
is called Duhamel’s principle .

6.8.2 Wave equation


Suppose φ : Rn × [0, ∞) → C solves the inhomogeneous wave equation
∂2φ ∂
− c2 ∇2 φ = F (x, t) with φ(x, 0) = φ(x, 0) = 0.
∂t2 ∂t
We look for a Green’s function Gn (x, t; y, τ ) that solves
∂Gn
− c2 ∇2 Gn = δ(t − τ )δ (n) (x − y) (∗)
∂t

with the same initial conditions Gn (x, 0, y, τ ) = ∂t Gn (x, 0, y, τ ) = 0. Just as before,
we take the Fourier transform of this equation with respect the spacial variables x.
We get

G̃n + c2 |k|2 G̃n = δ(t − τ )e−ik·y where G̃n = G̃n (k, t, y, τ )
∂t
This is just an ordinary differential equation from the point of view of t, and is of the
same type of initial value problem that we studied earlier, and the solution is
sin |k|c(t − τ )
G̃n (k, t, y, τ ) = Θ(t − τ )e−ik·y .
|k|c
To recover the Green’s function itself, we have to compute the inverse Fourier trans-
form, and find
Z
1 sin |k|c(t − τ ) n
Gn (x, t; y, τ ) = eik·x Θ(t − τ )e−ik·y d k.
(2π)n Rn |k|c
Unlike the case of the heat equation, the form of the answer we get here does depend
on the number of spatial dimensions n. For definiteness, we look at the case where
n = 3. Then our Green’s function is
Z
Θ(t − τ ) sin |k|c(t − τ ) 3
G(x, t; y, τ ) = eik·(x−y) d k.
(2π)3 c R3 |k|
We use spherical polar coordinates with the z-axis in k-space aligned along the direction
of x − y. Hence k · (x − y) = kr cos θ, where r = |x − y| and k = |k|. Note that nothing
in our integral depends on ϕ, so we can pull out a factor of 2π, and get
Θ(t − τ ) ∞ π ikr cos θ sin kc(t − τ ) 2
Z Z
G(x, t; y, τ ) = e k sin θ dθ dk
(2π)2 c 0 0 k
The next integral to do is the θ integral, which is straightforward since it is an exact
differential. So we get
Θ(t − τ ) ∞ eikr − e−ikr sin kc(t − τ ) 2
Z  
= k dk
(2π)2 c 0 ikr k
Z ∞ Z ∞ 
Θ(t − τ ) ikr −ikr
= e sin kα dk − e sin kα dk
(2π)2 icr 0 0
 Z ∞ 
Θ(t − τ ) 1 Θ(t − τ ) −1
= eikr sin kα dk = F [sin kα],
2πicr 2π −∞ 2πicr
6.8. GREEN’S FUNCTIONS FOR PDES 269

where we let α = c(t − τ ). Now recall F[δ(x − α)] = eikα . So

− e−ikα
 ikα 
e 1
F −1 [sin kα] = F −1

= δ(x − α) − δ(x + α) .
2i 2i
Hence our Green’s function is
Θ(t − τ )  
G(x, t; y, τ ) = − δ(|x − y| − c(t − τ )) − δ(|x − y| + c(t − τ )) .
4πc|x − y|
Now we look at our delta functions. The step function is non-zero only if t > τ . Hence
|x − y| + c(t − τ ) is always positive. So δ(|x − y| + c(t − τ )) does not contribute. On
the other hand, δ(|x − y| − c(t − τ )) is non-zero only if t > τ . So Θ(t − τ ) is always
positive in this region. So we can write our Green’s function as
1 1
G(x, t; y, τ ) = − δ(|x − y| − c(t − τ )).
4πc |x − y|
As always, given our Green’s function, the general solution to the forced equation
∂2φ
∂t2
− c2 ∇2 φ = F (x, t) is
Z ∞Z
F (y, τ )
φ(x, t) = − δ(|x − y| − c(t − τ )) d3 y dt.
0 R3 4πc|x − y|
We can use the delta function to do one of the integrals. It is up to us which integral
we do, but we pick the time integral to do. Then we get
Z
1 F (y, tret ) 3 |x − y|
φ(x, t) = − d y, where tret = t − .
4πc2 R3 |x − y| c

This shows that the effect of the forcing term at some point y ∈ R3 affects the solution
φ at some other point x not instantaneously, but only after time |x − y|/c has elapsed.
This is just as we saw for characteristics. This, again, tells us that information travels
at speed c.
Also, we see the effect of the forcing term gets weaker and weaker as we move further
away from the source. This dispersion depends on the number of dimensions of the
space. As we spread out in a three-dimensional space, the “energy” from the forcing
term has to be distributed over a larger sphere, and the effect diminishes. On the
contrary, in one-dimensional space, there is no spreading out to do, and we don’t have
this reduction. In fact, in one dimensions, we get
Z tZ
Θ(c(t − τ ) − |x − y|)
φ(x, t) = F (y, τ ) dy dτ.
0 R 2c
We see there is now no suppression factor at the bottom, as expected.

6.8.3 Poisson’s equation


P. 6-81
Suppose φ, ψ : R3 → R are both smooth everywhere in some region Ω ⊆ R3 with
boundary ∂Ω, then
Z Z
<Green’s first identity> φn · ∇ψ dS = φ∇2 ψ + (∇φ) · (∇ψ) dV
∂Ω Ω
270 CHAPTER 6. METHODS
Z Z
<Green’s second identity> φ∇2 ψ − ψ∇2 φdV = φn · ∇ψ − ψn · ∇φ dS.
Ω ∂Ω

By divergence theorem,
Z Z Z
φn · ∇ψ dS = ∇ · (φ∇ψ) dV = φ∇2 ψ + (∇φ) · (∇ψ) dV.
∂Ω Ω Ω

So we get Green’s first identity. Similarly, we obtain


Z Z
ψn · ∇φ dS = ψ∇2 φ + (∇φ) · (∇ψ) dV.
∂Ω Ω

Subtracting these two equations gives Green’s second identity.


Why is this useful? On the left, we have things like ∇2 ψ and ∇2 φ. These are
things we are given by Poisson’s or Laplace’s equation. On the right, we have
things on the boundary, and these are often the boundary conditions we are given.
So this can be rather helpful when we have to solve these equations.

Green’s third identity

Let φ : Rn → R satisfy the n-dimensional Poisson’s equation ∇2 φ = −F, where F (x) is


a forcing term. The fundamental solution to this equation is defined to be Gn (x, y),
where ∇2 Gn (x, y) = δ (n) (x−y), where δ (n) is the n-dimensional delta function.
By rotational symmetry, Gn (x, y) = f (|x − y|) for some one variable function f .
(r)
We will abuse the notation to write Gn just as f . Write An as the surface area of
the spherical shell of radius r in Rn (i.e. surface area of Srn−1 = ∂Br ), also write
(1) (r) (r)
An = An . For example A2 = 2πr and A3 = 4πr2 . Integrating over a ball
n
Br = {x ∈ R : |x − y| ≤ r} we have
Z Z
dG3 (r) n−1 dG3
∇2 G3 dV =

1= n · ∇G3 dS = A n = An r =⇒
Br ∂Br dr r dr r

|x − y| + c1


 n=1

 1

dGn 1
= =⇒ Gn (x, y) = 2π ln |x − y| + c2 n=2
dr An rn−1 


− 1
 + cn n≥3
An (n − 2)|x − y|n−2

For n ≥ 3 we often set cn = 0 so that lim|x|→∞ Gn = 0. We wish to apply Green’s


identities to the case ψ = Gn (|x − y|). However, recall that when deriving Green’s
identity we assumed φ and ψ are smooth everywhere in our domain, but our Green’s
function is singular at x = y (for n ≥ 2) and we want to integrate over this region as
well. So we need to do this carefully, we take

Ω = Br − Bε = {x ∈ Rn : ε ≤ |x − y| ≤ r}.

In other words, we remove a small region of radius ε centered on y from the domain.
In this choice of Ω, it is completely safe to use Green’s identity, since our Green’s
6.8. GREEN’S FUNCTIONS FOR PDES 271

function is certainly regular everywhere in this Ω. Using Green’s second identity and
noting that ∇2 Gn = 0 everywhere except at x = y, we get
Z Z
− Gn ∇2 φ dV = φ∇2 Gn − Gn ∇2 φ dV
Ω Ω
Z Z
= φ(n · ∇Gn ) − Gn (n · ∇φ) dS + φ(n · ∇G3 ) − Gn (n · ∇φ) dS
n−1 n−1
Sr Sε

Note that on the inner boundary, we have n = −r̂. Also, at Sεn−1 , we have
1 ε
Gn |Sεn−1 = − =− ,
An (n − 2)εn−2 (ε)
(n − 2)An

dGn 1 1
=− = − (ε) .
dr Sεn−1 An εn−1 An
So the inner boundary terms are
Z  
φ(n · ∇Gn ) − Gn (n · ∇φ) dS
n−1

Z Z
1 ε
=− (ε)
φ dS + (ε)
(n · ∇φ)dS
An n−1
Sε (n − 2)An n−1

This in fact tends to −φ(y) as ε → 0. To see this note that the final integral is bounded
by the assumption that φ is everywhere smooth (so the value of ∇φ is bounded). So
as we take the limit ε → 0, the final term vanishes. For the first term
Z
1
φ dS = − average of φ on Sεn−1 → −φ(y) as ε → 0

− (ε)
An Sεn−1
Now suppose ∇2 φ = −F . Then this gives Green’s third identity :
Z Z
φ(y) = φ(n · ∇Gn ) − Gn (n · ∇φ) dS − Gn (x, y)F (x) dV.
∂Ω Ω

where the integral are taken over the x variable. This is a remarkable formula! It de-
scribes the solution throughout our domain in terms of the solution on the boundary,
the forcing term and the known function Gn . Also notice that (unlike the previous
cases) the Green’s function is here providing an expression for the solution with inho-
mogeneous boundary conditions. If the boundary values of φ and n · ∇φ vanish as we
take r → ∞, then we have
Z
φ(y) = − Gn (x, y)F (x) dV.
Rn

So the fundamental solution is the Green’s function for Poisson’s equation on Rn . We


can verify the formula using the fact that ∇2 Gn = δ (n) (x − y) since
Z Z
∇2 φ(y) = − (∇2y Gn (x, y))F (x)dV = − δ (n) (x − y)G(x)dV = −F (y)
Rn Rn

where the Laplacian ∇2y


is wrt y. However, there is a puzzle. Suppose F = 0, so
∇2 φ = 0. Then Green’s identity says
Z  
dGn dφ
φ(y) = φ − Gn dS.
n−1
Sr dr dr
272 CHAPTER 6. METHODS

But we know there is a unique solution to Laplace’s equation on every bounded do-
main once we specify the boundary value φ|∂Ω (Dirichlet boundary condition), or a
unique-up-to-constant solution if we specify the boundary value of n·∇φ|∂Ω (Neumann
condition). However, to get φ(y) using Green’s identity, we need to know φ and and
n · ∇φ on the boundary. This is too much. Green’s third identity is a valid relation
obeyed by solutions to Poisson’s equation, but it is not constructive. We cannot specify
φ and n · ∇φ freely.

Dirichlet Green’s function


When we have Dirichlet boundary condition we would like a formula for φ given just
the value of φ on the boundary. We can achieve this as follows: We seek to modify Gn
via
Gn → G = Gn + H(x, y),
where ∇2 H = 0 everywhere in Ω, H is regular throughout Ω, and G|∂Ω = 0. In other
words, we find some H that does not affect the relations on G when acted on by ∇2 ,
but now our G will have boundary value 0. We will find this H later, but given such
an H, we replace Gn with G − H in Green’s third identity, and see that all the H
terms fall out since
Z Z Z
φ(n · ∇H) − H(n · ∇φ) dS = φ∇2 H − H∇2 φ dV = HF dV,
∂Ω Ω Ω

therefore G also satisfies Green’s third identity, and so


Z Z Z Z
φ(y) = φn · ∇G − Gn · ∇φ dS − F G dV = φn · ∇G dS − F G dV.
∂Ω Ω ∂Ω Ω

So as long as we find H, we can express the value of φ(y) just in terms of the values
of φ on the boundary. Similarly, if we are given a Neumann condition, ie. the value of
n · ∇φ on the boundary, we have to find an H that kills off n · ∇G on the boundary,
and get a similar result.
In general, finding a harmonic ∇2 H = 0 with H|∂Ω = −Gn |∂Ω is a difficult problem.
However, the method of images allows us to solve this in some special cases with lots
of symmetry. The key concept is to match the boundary conditions by placing a
extending the domain beyond the region of interest, and placing a ‘mirror’ or ‘image’
source or forcing term in the unphysical region.
E. 6-82
Let Ω = {(x, y, z) ∈ R3 : z ≥ 0}. Find a solution to ∇2 φ = −F in Ω with φ → 0
rapidly as |x| → ∞ with boundary condition φ(x, y, 0) = g(x, y).

The fundamental solution


1 1
G3 (x, x0 ) = −
4π |x − x0 |
obeys all the conditions we need except
1 1
G3 |z=0 = − 6= 0.
4π [(x − x0 )2 + (y − y0 )2 + z02 ]1/2

However, let xR
0 be the point (x0 , y0 , −z0 ) . This is the reflection of x0 in the
boundary plane z = 0. Since the point xR R
0 is outside our domain, G3 (x, x0 ) obeys
6.8. GREEN’S FUNCTIONS FOR PDES 273

∇2 G3 (x, xR R
0 ) = 0 for all x ∈ Ω, and also G3 (x, x0 )|z=0 = G3 (x, x0 )|z=0 . Hence
R
we take G(x, x0 ) = G3 (x, x0 ) − G3 (x, x0 ). The outward pointing normal to Ω at
z = 0 is n = −ẑ. Hence we have
 
1 −(z − z0 ) −(z + z0 ) 1 z0
n·∇G|z=0 = − = 2 + (y − y )2 + z 2 ]3/2
.
4π |x − x0 |3 |x − xR
0 | 3
z=0 2π [(x − x 0 ) 0 0

Therefore our solution is


Z  
1 1 1
φ(x0 ) = − F (x) d3 x
4π Ω |x − x0 | |x − xR0|
Z
z0 g(x, y)
+ dx dy.
2π R2 [(x − x0 )2 + (y − y0 )2 + z02 ]3/2

What have we actually done here? The Green’s function G3 (x, x0 ) in some sense
represents a “charge” at x0 . We can imagine that the term G3 (x, xR
0 ) represents
the contribution to our solution from a point charge of opposite sign located at
xR
0 . Then by symmetry, G is zero at z = 0.

E. 6-83
Suppose a chimney produces smoke such that the density φ(x, t) of smoke obeys

∂t φ − D∇2 φ = F (x, t).

The left side is just the heat equation, modelling the diffusion of smoke, while the
right forcing term describes the production of smoke by the chimney. If this were
a problem for x ∈ R3 , then the solution is
Z tZ
φ(x, t) = F (y, τ )S3 (x − y, t − τ ) d3 y dτ,
0 R3

|x − y|2
 
1
where S3 (x − y, t − τ ) = exp − .
[4πD(t − τ )]3/2 4D(t − τ )
This is true only if the smoke can diffuse in all of R3 . However, this is not true
for our current circumstances, since smoke does not diffuse into the ground. To
account for this, we should find a Green’s function that obeys n · ∇G|z=0 = 0
which says that there is no smoke diffusing in to the ground. This is achieved by
picking
 
G(x, t; y, τ ) = Θ(t − τ ) S3 (x − y, t − τ ) + S3 (x − yR , t − τ ) .

We can directly check that this obeys ∂t D2 −∇2 G = δ(t−τ )δ 3 (x−y) when x ∈ Ω,
and also n · ∇G|z0 = 0. Hence the smoke density is given by
Z t Z  
φ(x, t) = F (y, τ ) S3 (x − y, t − τ ) + S3 (x − yR , t − τ ) d3 y.
0 Ω

We can think of the second term as the contribution from a “mirror chimney”.
Without a mirror chimney, we will have smoke flowing into the ground. With a
mirror chimney, we will have equal amounts of mirror smoke flowing up from the
ground, so there is no net flow. Of course, there are no mirror chimneys in reality.
These are just artifacts we use to find the solution we want.
274 CHAPTER 6. METHODS

E. 6-84
Suppose we want to solve the wave equation in the region (x, t) with x > 0 with
boundary conditions

φ(x, 0) = b(x), ∂t φ(x, 0) = 0, ∂x φ(0, t) = 0.

On R1,1 d’Alembert’s solution gives φ(x, t) = 21 (b(x − ct) + b(x + ct)). This is
not what we want, since eventually we will have a wave moving past the x = 0
line. To compensate for this, we introduce a mirror wave moving in the opposite
direction, such that when as they pass through each other at x = 0, there is no
net flow across the boundary.
More precisely, we include a mirror initial condition φ(x, 0) = b(x) + b(−x), where
we set b(x) = 0 when x < 0. In the region x > 0 we are interested in, only the
b(x) term will contribute. In the x < 0 region, only b(−x) will contribute. Then
the general solution is
1 
φ(x, t) = b(x + ct) + b(x − ct) + b(−x − ct) + b(−x + ct) .
2
CHAPTER 7
Quantum mechanics
Quantum mechanics (QM) is a radical generalization of classical physics. Profound
new features of quantum mechanics include
1. Quantisation — Quantities such as energy are often restricted to a discrete set of
values, called quanta.
2. Wave-particle duality — Classical concepts of particles and waves become merged
in quantum mechanics. They are different aspects of a single entity. For example
electrons has properties of both particles and waves.
3. Probability and uncertainty — Predictions in quantum mechanics involve prob-
ability in a fundamental way. This probability does not arise from our lack of
knowledge of the system, but is a genuine uncertainty in reality.
Quantum mechanics also involves a new fundamental constant — Planck constant h
h
or it reduced form ~ = 2π . The dimension of this is

[h] = M L2 T −1 = [energy] × [time] = [position] × [momentum].

We can think of this constant as representing the “strength” of quantum effects. De-
spite having these new profound features, we expect to recover classical physics when
we take the limit ~ → 0.
Historically, there are a few experiments that led to the development of quantum
mechanics:

Light quanta
In quantum mechanics, light (or electromagnetic waves) consists of quanta called pho-
tons. We can think of them as waves that come in discrete “packets” that behave like
particles.
In particular, photons behave like particles with energy E = hν = ~ω, where ν is the
frequency and ω = 2πν is the angular frequency. However, we usually don’t care about
ν and just call ω the frequency. Similarly, the momentum is given by p = h/λ = ~k,
where λ is the wavelength and k = 2π/λ is the wave number.
For electromagnetic waves with speed c = ω/k = νλ, the above is consistent with the
fact that photons are massless particles, since we have E = cp. as entailed by special
relativity.
The physical reality of photons was clarified by Einstein in explaining
the photo-electric effect. When we shine some light (or electromag- e
γ
netic radiation γ) of frequency ω onto certain metals, this can cause
an emission of electrons (e). We can measure the maximum kinetic
energy K of these electrons. Experiments show that
1. K depends only (linearly) on the frequency but not the intensity.

275
276 CHAPTER 7. QUANTUM MECHANICS

2. For ω < ω0 (for some critical value ω0 ), no electrons are emitted at all, regardless
of the intensity of the incident light.
3. For a given frequency of incident radiation, the rate at which electrons are ejected
is directly proportional to the intensity of the incident light.
This is hard to understand classically, but is exactly as expected if each electron emitted
is due to the impact with a single photon (of energy E = ~ω). If W is the energy
required to liberate an electron, then we would expect K = ~ω−W by the conservation
of energy. We will have no emission at all if ω < ω0 = W/~, even if we increase the
number of such photons hitting the metal (i.e. increase the intensity).

Bohr model of the atom


When we heat atoms up to make them emit light; or shine light at atoms so that they
absorb light, we will find that light is emitted and absorbed at very specific frequencies,
known as the emission and absorption spectra. This suggests that the inner structure
of atoms is discrete.
However, this is not the case in the classical model. In the classical model, the simplest
atom, the hydrogen, consists of an electron with charge −e and mass m, orbiting a
proton of charge +e and mass mp  m fixed at the origin. The dynamics of the electron
is governed by Newton’s laws of motions, just as we derived the orbits of planets under
the gravitational potential. This model implies that the angular momentum L is
e2 1
constant, and so is the energy E = 21 mv 2 + V (r), where V (r) = − 4πε 0 r
.
This is not a very satisfactory model for the atom. First of all, it cannot explain the
discrete emission and absorption spectra. More importantly, while this model seems
like a mini solar system, electromagnetism behaves differently from gravitation. To
maintain a circular orbit, an acceleration has to be applied onto the electron. But
accelerating particles emit radiation and lose energy. So according to classical electro-
dynamics, the electron will just decay into the proton and atoms will implode.
In response to this, Bohr proposed the Bohr quantization conditions which restricts
the classical orbits by saying that the angular momentum can only take values

L = mrv = n~ for n = 1, 2, · · · .

mv 2 e2 1
Assume these, together with the requirement for centripetal force r
= 4πε0 r 2
, we
can solve r and v completely for each n and obtain
2
e2 1 e2

4πε0 2 2 1 1
rn = ~ n vn = En = − m .
me2 4πε0 ~n 2 4πε0 ~ n2

When electron make transitions between different energy levels n and m > n, we will
see emission or absorption of a photon of frequency ω given by

0
1

e2
2 
1 1
 Em
E = ~ω = En − Em = m − 2 γ
2 4πε0 ~ n2 m En

This model explains a vast amount of experimental data. This also gives an estimate
of the size of a hydrogen atom: r1 = 4πε × 10−11 m known as the Bohr
 2
me2
0
~ ≈ 5.29
277

radius. While the model fits experiments very well, it does not provide a good expla-
nation for why the radius/angular momentum should be quantized, and we shall seek
the answer in quantum mechanics.

Matter waves

The relations E = hν = ~ω and p = λh = ~k means we can associate wave properties


(frequency and wave number) to particles. Sow we can have wavelength λ, known as
the de Broglie wavelength, for an electron. Bohr model requires that L = rp = n~,
this is equivalent to requiring that nλ = 2πr. This is exactly the statement that the
circumference of the orbit is an integer multiple of the wavelength. This is the condition
we need for a standing wave to form on the circumference. This looks promising as an
explanation for the quantization relation.

The double-slit experiment on electrons show that electrons really behave like waves,
in particular it show that elections have interference like waves. We have a sinusoidal
wave/election incident on some barrier with narrow openings as shown:

wave/elections

wave/elections density of
electrons
δ
wave/elections

At different points, depending on the difference δ in path length, we may have con-
structive interference (large amplitude) or destructive interference (no amplitude). In
particular, constructive interference occurs if δ = nλ, and destructive if δ = (n + 12 )λ.
Not only does this experiment allow us to verify if something is a wave. We can also
figure out its wavelength λ by experiment.

For a single slit, we can also show that elections


diffracts. This also has a conceptual importance. density of
We know that if we fire many many electrons, the electrons
distribution will follow the pattern described on
the right. But what if we just fire a single elec-
tron? On average, it should still follow the distribution. However, for this individual
electron, we cannot know where it will actually land. We can only provide a probability
distribution of where it will end up. In quantum mechanics, everything is inherently
probabilistic.

Practically, the actual experiment for electrons is slightly more complicated. Since
the wavelength of an electron is rather small, to obtain the diffraction pattern, we
cannot just poke holes in sheets. Instead, we need to use crystals as our diffraction
grating.
278 CHAPTER 7. QUANTUM MECHANICS

7.1 Wavefunctions and the Schrödinger equation


7.1.1 Particle state and probability
Classically, a point particle in 1 dimension has a definitive position x (and momentum
p) at each time. To completely specify a particle, it suffices to write down these two
numbers. In quantum mechanics, this is much more complicated. Instead, a particle
has a state at each time, specified by a complex-valued wavefunction ψ(x).

The physical
R ∞ content of the wavefunction is as follows: if ψ is appropriately normalized
so that −∞ |ψ(x)|2 dx = 1, then when we measure the position of a particle, we get a
result x with probability density function |ψ(x)|2 , ie. the probability that the position
is found in [x, x+δx] (for small δx) is given by |ψ(x)|2 δx. Alternatively, the probability
of finding it in an interval [a, b] is given by
Z b
P(particle position in [a, b]) = |ψ(x)|2 dx.
a

The normalised condition ensures that the probability of finding the particle somewhere
is 1. It is possible that in some cases, the particles in the configuration space may be
restricted. For example, we might require − 2` ≤ x ≤ 2` with some boundary conditions
at the edges. Then the normalization condition would not be integrating over (−∞, ∞),
but [− 2` , 2` ].

While it is nice to have a normalized wavefunction, it is often inconvenient to deal


exclusively with normalized wavefunctions, or else we will have a lot of ugly-looking
constants floating all around. As a matter of fact, we can always restore normalization
at the end of the calculation. So we often don’t bother.

If we do not care about normalization, then for any (non-zero) λ ∈ C, ψ(x) and λψ(x)
represent the same quantum state (since they give the same probabilities). In practice,
we usually refer to either of these as “the state”. We can thus think of the states as
equivalence classes of wavefunctions under the equivalence relation ψ ∼ φ if φ = λψ for
some non-zero λ. What R∞we do require, then, is not that the wavefunction is normalized,
but normalizable, ie. −∞ |ψ(x)|2 dx < ∞. We will encounter wavefunctions that are
not normalizable. Mathematically, these are useful things to have, but we have to be
more careful when interpreting these things physically.

A characteristic property of quantum mechanics is that if ψ1 (x) and ψ2 (x) are wave-
functions for a particle, then ψ1 (x) + ψ2 (x) is also a possible particle state (ignoring
normalization), provided the result is non-zero. This is the principle of superposition,
and arises from the fact that the equations of quantum mechanics are linear.

E. 7-1
−(x − c)2 x2 |ψ(x)|2
    
ψ(x) = B exp + exp −
2α 2β
We choose B so that ψ in a normalized wavefunction for
a single particle. Then the resultant distribution |ψ(x)|2
x
would like the diagram on the right. c
7.1. WAVEFUNCTIONS AND THE SCHRÖDINGER EQUATION 279

7.1.2 Operators
We know that the square of the wavefunction gives the probability distribution of the
position of the particle. How about other information such as the momentum and
energy? It turns out that all the information about the particle is contained in the
wavefunction (which is why we call it the “state” of the particle).
We call each property of the particle which we can measure an observable . Each
observable has a corresponding operator , which acts on wavefunctions ψ(x). For
example, the position is represented by the operator x̂ = x. This means that (x̂ψ)(x) =
xψ(x). Here are a few operators:1

Position: x̂ = x so that x̂ψ = xψ(x)



Momentum: p̂ = −i~ so that p̂ψ = −i~ψ 0 (x)
∂x
p̂2 ~2 ∂ 2
Energy: H= + V (x̂) so that Hψ = − ψ + V (x)ψ(x)
2m 2m ∂x2
The final H is called the Hamiltonian , where m is the mass and V is the potential.
We see that the Hamiltonian is just the kinetic energy p2 /2m and the potential energy
V . There will be more insight into why the operators are defined like this in IIC
Classical Dynamics and IID Principles of Quantum Mechanics.
How do these operators relate to the actual physical properties? In general, when
we measure an observable, the result is not certain. They are randomly distributed
according to some probability distribution, which we will go intoR full details later. The

expected value of a observable Q is in fact given by hψ, Q̂ψi = −∞ ψ ∗ Q̂ψdx.
However, a definite result for the observable is obtained if and only if ψ is an eigen-
function (without weight) of the operator. Such a ψ is called an eigenstate of the
operator. If this is the case, the result of the measurement is the eigenvalue associated.
For example, we have p̂ψ = pψ for constant p iff ψ is a state with definite momentum
p. Similarly, Hψ = Eψ for constant E iff ψ has definite energy E.
As a result quantization occurs in quantum mechanics. Suppose we can measure a
definite value of say energy and momentum, then the only possible values of E and
p are the eigenvalues of the operators. So if the operators have a discrete set of
eigenvalues then we can only have discrete values of p and E.
The time-independent Schrödinger equation is the energy eigenvalue equation Hψ =
Eψ. This is in general what determines what the system behaves. In particular, the
eigenvalues E are precisely the allowed energy values.
E. 7-2
Let ψ(x) = Ceikx . We would expect it to have a wavelength of λ = 2π/k. This
is a momentum eigenstate, since we have p̂ψ = −i~ψ 0 = (~k)ψ. So we know that
the momentum is p = ~k. So we have wavelength λ = h/p = 2π/k. Note also that
2 2
k
if there is no potential, ie. V = 0, then the energy is E = ~2m since
p̂2 ~2 00 ~2 k 2
Hψ = ψ=− ψ = ψ.
2m 2m 2m
Note, however, that our wavefunction has |ψ(x)|2 = |C|2 , which is a constant. So
this wavefunction is not normalizable on R. However, if we restrict ourselves to
some finite domain − 2` ≤ x ≤ 2` , then we can normalize by picking C = √1` .
280 CHAPTER 7. QUANTUM MECHANICS

E. 7-3
2
x
Consider the Gaussian distribution ψ(x) = C exp(− 2α ). We get p̂ψ(x) = −i~ψ 0 (x) 6=
pψ(x) for any number p. So this is not an eigenfunction of the momentum. How-
ever, for the harmonic oscillator with potential V (x) = 21 Kx2 , this ψ(x) is an
eigenfunction of the Hamiltonian operator, provided we picked the right α. We
have
~2 00 1
Hψ = − ψ + Kx2 ψ = Eψ
2m 2
q
~2
for some constant E when α2 = Km
. The energy is in fact E = ~
2
K
m
.

7.1.3 Time-dependent Schrödinger equation

The wavefunction specifies the state, however the state can change with time. For a
time-dependent wavefunction Ψ(x, t), its evolution with time is described by

∂Ψ
<Time-dependent Schrödinger equation> i~ = HΨ.
∂t

The classical dynamics (time evolution) of, say a particle, is determined by its potential
through F (x) = −V 0 (x). In quantum mechanics, the time evolution of a state is deter-
mined by the Hamiltonian through the Time-dependent Schrödinger equation. For a
particle in a potential V (x), the time-dependent Schrödinger equation can reads

∂Ψ ~2 ∂ 2 Ψ
i~ =− + V (x)Ψ.
∂t 2m ∂x2

Note that it is linear. So the sums and multiples of solutions are also solutions. It is
also first-order in time. So if we know the wavefunction Ψ(x, t0 ) at a particular time
t0 , then this determines the whole function Ψ(x, t).

This is similar to classical dynamics, where knowing the potential V (and hence the
Hamiltonian H) completely specifies how the system evolves with time. However, this
is in some ways different from classical dynamics. Newton’s second law is second-order
in time, while this is first-order in time. This is significant since if our equation is
first-order in time, then the current state of the wavefunction completely specifies the
evolution of the wavefunction in time.

Yet, this difference is just an illusion. The wavefunction is the state of the particle,
and not just the “position”. Instead, we can think of it as capturing the position and
momentum. Indeed, if we write the equations of classical dynamics in terms of position
and momentum, it will be first order in time.

D. 7-4
A stationary state is a state of the form Ψ(x, t) = ψ(x)e−iEt/~ where ψ(x) is an
eigenfunction of the Hamiltonian with eigenvalue E. This term is also sometimes
applied to ψ instead.
7.2. ENERGY EIGENSTATES IN ONE DIMENSION 281

E. 7-5
Note that a stationary state satisfies the time-dependent Schrödinger equation,
also we have |Ψ(x, t)|2 = |ψ(x)|2 which is independent of time. The stationary state
is the unique solution to the time-dependent Schrödinger equation with Ψ(x, 0) =
ψ(x) and HΨ = EΨ. Note that an measurement of energy for a stationary state
would give definite result of E at any time.
P. 7-6
Let Ψ(x, t) a time-dependent wavefunction. The probability density P (x, t) =
|Ψ(x, t)|2 obeys a conservation equation

∂Ψ∗
 
∂P ∂j i~ ∂Ψ
=− where j(x, t) = − Ψ∗ − Ψ
∂t ∂x 2m ∂x ∂x

This is straightforward from the Schrödinger equation and its complex conjugate.
Assuming V is real, we have
∂P ∂Ψ ∂Ψ∗ i~ 00 i~ 00∗ ∂j
= Ψ∗ + Ψ = Ψ∗ Ψ − Ψ Ψ=− .
∂t ∂t ∂t 2m 2m ∂x
since the two V terms cancel each other out.
j(x, t) is called the probability current . Note that Ψ∗ Ψ0 is the complex conjugate
of Ψ0∗ Ψ, so Ψ∗ Ψ0 − Ψ0∗ Ψ is imaginary. So multiplying by i ensures that j(x, t) is
real, which is a good thing since P is also real.
The important thing here is not the specific form of j, but that ∂P
∂t
can be written
as the space derivative of some quantity. This implies that the probability that
we find the particle in [a, b] at time t changes with time as
Z b Z b
d ∂j
|Ψ(x, t)|2 dx = − (x, t) dx = j(a, t) − j(b, t).
dt a a ∂x
We can think of the final term as the probability current getting in and out of the
interval at the boundary.
In particular, consider a normalizable state such that Ψ, Ψ0 , j → 0 as x → ±∞ for
fixed t. Taking a → −∞ and b → +∞, we have
Z ∞
d
|Ψ(x, t)|2 dx = 0.
dt −∞
What does this tell us? This tells us that if Ψ(x, 0) is normalized, Ψ(x, t) is
normalized for all t. Hence we know that for each fixed t, |Ψ(x, t)|2 is a probability
distribution. So what this really says is that the probability interpretation is
consistent with the time evolution.

7.2 Energy eigenstates in one dimension


We are going to consider the energy eigenvalue problem for a particle in one dimension
in a potential V (x), that is
~2 00
Hψ = − ψ + V (x)ψ = Eψ.
2m
1
Note that we put hats on x̂ and p̂ to make it explicit that these are operators, as opposed to
the classical quantities position and momentum.
282 CHAPTER 7. QUANTUM MECHANICS

In other words, we want to find the allowed energy eigenvalues. This is a hard problem
in general. We will consider simple cases involving simple V (x).

7.2.1 Parity
Consider the case with potential such that V (x) = V (−x). By changing variables
x → −x, we see that ψ(x) is an eigenfunction of H with energy E if and only if ψ(−x)
is an eigenfunction of H with energy E. There are two possibilities:
1. If ψ(x) and ψ(−x) represent the same quantum state, this can only happen if
ψ(−x) = ηψ(x) for some constant η. Since this is true for all x, we can do this twice
and get ψ(x) = ηψ(−x) = η 2 ψ(x). So we get that η = ±1 and ψ(−x) = ±ψ(x).
We call η the parity, and say ψ has even/odd parity if η is +1/ − 1 respectively.
For example, in our particle in a box, our states ψn have parity (−1)n+1 .
2. If ψ(x) and ψ(−x) represent different quantum states, then we can still take linear
combinations ψ± (x) = α(ψ(x) ± ψ(−x)) and these are also eigenstates with energy
eigenvalue E, where α is for normalization. Then by construction, ψ± (−x) =
±ψ± (x) and have parity η = ±1.
Hence, if we are given a potential with reflective symmetry V (−x) = V (x), then we
can restrict our attention and just look for solutions with definite parity.

7.2.2 Piecewise constant potential


Consider the the case with V (x) = U , where U is a constant. Then we can easily write
down solutions.
1. If U > E, then the Schrödinger equation is equivalent to ψ 00 − κ2 ψ = 0 where
2 2
κ
κ is such that U − E = ~2m . We take wlog κ > 0. The solution is then ψ =
κx −κx
Ae + Be .
2. On the other hand, if U < E, then the Schrödinger equation says ψ + k2 ψ = 0
2 2
k
where k is picked such that E − U = ~2m . The solutions are ψ = Aeikx + Beikx .2
Wouldn’t the case where the potential is constant be just equivalent to a free particle?
This is indeed true. However, knowing these solutions allow us to to study piecewise
flat potentials such as steps, wells and barriers.

V V V

x x x

Here a finite discontinuity in V is allowed. In this case, we can have ψ, ψ 0 continu-


ous and ψ 00 discontinuous. Then the discontinuity of ψ 00 cancels that of V , and the
Schrödinger equation holds everywhere.
2
κ and k are merely there to simplify our equations, they generally need not have physical
meanings.
7.2. ENERGY EIGENSTATES IN ONE DIMENSION 283
R∞
We would want normalizable solutions with −∞ |ψ(x)|2 dx = 1. This requires that
ψ(x) → 0 as x → ±∞. We see that for the segments and the end, we want to have
decaying exponentials e−κx instead of oscillating exponentials e−ikx .
E. 7-7
<Infinite well: particle in a box> The simplest V
case to consider is the infinite well. Here the potential
is infinite outside the region [−a, a] and zero inside, and
we have much less to think about. For |x| > a, we must
have ψ(x) = 0, or else V (x)ψ(x) would be infinite. x
−a a

We require ψ = 0 for |x| > a and ψ continuous at x = ±a. Within |x| < a, the
Schrödinger equation is

~2 00 ~2 k2
− ψ = Eψ =⇒ ψ 00 + k2 ψ = 0 where E=
2m 2m

Here, instead of working with the complex exponentials, we use sin and cos since
we know well when these vanish. The general solution is thus ψ = A cos kx +
B sin kx. Our boundary conditions require that ψ vanishes at x = ±a. So we need
A cos ka ± B sin ka = 0. In other words, we require A cos ka = B sin ka = 0. Since
sin ka and cos ka cannot be simultaneously 0, either A = 0 or B = 0. So the two
possibilities are
1. B = 0 and ka = nπ/2 with n = 1, 3, · · ·
2. A = 0 and ka = nπ/2 with n = 2, 4, · · ·
Hence the allowed energy levels are

~2 π 2 2
En = n for n = 1, 2, · · ·
8ma2
Ra
and the normalized ( −a
|ψn (x)|2 dx = 1) wavefunctions are

 1 (
1 2 cos nπx
2a
n odd
ψn (x) = .
a sin nπx
2a
n even

V V
ψ1 : ψ2 :

x x
−a a −a a

V V
ψ3 : ψ4 :

x x
−a a −a a
284 CHAPTER 7. QUANTUM MECHANICS

This was a rather simple and nice example. We have an infinite well, and the
particle is well-contained inside the box. The solutions just look like standing
waves on a string with two fixed end points.

Note that ψn (−x) = (−1)n+1 ψn (x). We will see that this is a general feature of
energy eigenfunctions of a symmetric potential. This is known as parity.

E. 7-8
<Potential well> We will consider a potential V
(
−U |x| < a x
V (x) =
0 |x| ≥ a
−U
for some constant U > 0. −a a

Classically, this is not very interesting. If the energy E < 0, then the particle is
contained in the well. Otherwise it is free to move around. However, in quantum
mechanics, this is much more interesting.

We want to seek energy levels for a particle of mass m, defined by the Schrödinger
~2
equation Hψ = − 2m ψ 00 + V (x)ψ = Eψ. For energies in the range −U < E < 0
we set
~2 k2 ~2 κ2
U +E = > 0, E=− ,
2m 2m
where k, κ > 0 are new real constants. Note that these coefficients are not inde-
pendent, since U is given and fixed. So they must satisfy k2 + κ2 = 2mU
~2
. Using
these constants, the Schrödinger equation becomes
(
ψ 00 + k2 ψ = 0 |x| < a
ψ 00 − κ2 ψ = 0 |x| > a.

As we previously said, we want the Schrödinger equation to hold even at the


discontinuities. So we need ψ and ψ 0 to be continuous at x = ±a. We first
consider the even parity solutions ψ(−x) = ψ(x). We can write our solution as
(
A cos kx |x| < a
ψ=
Be−κ|x| |x| > a

We match ψ and ψ 0 at x = a. So we need

A cos ka = Be−κa − Ak sin ka = −κBe−κa .

By parity, there is no additional information from x = −a. We can divide the


equations to obtain k tan ka = κ. This is still not something we can solve easily.
To find when solutions exist, it is convenient to introduce ξ = ak, η = aκ, where
these two constants are dimensionless and positive.3 Hence the solution we need
are solutions to η = ξ tan ξ. Also, our initial conditions on k and κ require ξ 2 +η 2 =
2ma2 U/~2 . We can look for solutions by plotting these two equations. We first
plot the curve η = ξ tan ξ:
3
This η has nothing to do with parity.
7.2. ENERGY EIGENSTATES IN ONE DIMENSION 285

ξ
1π 3π 5π 7π 9π 11π
2 2 2 2 2 2

The other equation is the equation of a circle. Depending on the size of the
constant 2ma2 U/~2 , there will be a different number of points of intersections.
η

So there will be a different number of solutions depending on the value of 2ma2 U/~2 .
In particular, if
1/2
2mU a2

(n − 1)π < < nπ,
~2

then we have exactly n even parity solutions (for n ≥ 1). We can do exactly the
same thing for odd parity eigenstates. For E > 0 or E < −U , we will end up
finding non-normalizable solutions.

We can compare the solutions we have now with what we would expect classically.
Classically, any value of E in the range −U < E < 0 is allowed, and the motion is
deeply uninteresting. The particle just goes back and forth inside the well, and is
strictly confined in −a ≤ x ≤ a.

Quantum mechanically, there is just a discrete, finite set of allowed energies. What
is more surprising is that while ψ decays exponentially outside the well, it is non-
zero! This means there is in theory a non-zero probability of finding the particle
outside the well! We call these particles bound in the potential, but in fact there
is a non-zero probability of finding the particle outside the well.

7.2.3 Harmonic oscillator


So far in our examples, the quantization (mathematically) comes from us requiring con-
tinuity at the boundaries. In the harmonic oscillator, it arises in a different way.

1 V
V (x) = mω 2 x2 .
2
This is a harmonic oscillator of mass m. Classically, this has a
motion of x = A cos ω(t − t0 ). x
286 CHAPTER 7. QUANTUM MECHANICS

This is a really important example. First of all, we can solve it, which is a good thing.
More importantly, any smooth potential can be approximated by a harmonic oscillator
near an equilibrium x0 , since

1 00
V (x) = V (x0 ) + V (x0 )(x − x0 )2 + · · · .
2
Systems with many degrees like crystals can also be treated as collections of inde-
pendent oscillators by considering the normal modes. If we apply this to the electro-
magnetic field, we get photons! So it is very important to understand the quantum
mechanical oscillator.
We are going to seek all normalizable solutions to the time-independent Schrödinger
~2
equation Hψ = − 2m ψ 00 + 21 mω 2 x2 ψ = Eψ. To simplify constants, we define y =
1
(mω/~) 2 x, E = 2E/(~ω) both of which is dimensionless. Then we are left with

d2 ψ
− + y 2 ψ = Eψ.
dy 2

We can consider the behaviour for y 2  E. For large y, the y 2 ψ term will be large,
2
and so we want the ψ 00 term to offset it. We might want to try the Gaussian e−y /2 ,
2
and when we differentiate it twice, we would have brought down a factor of y . So we
1 2
can wlog set ψ = f (y)e− 2 y , then the Schrödinger equation gives

d2 f df
<Hermite’s equation> − 2y + (E − 1) = 0.
dy 2 dy

ar y r and substitute in to get


P
We try a series solution f (y) = r≥0

X
(r + 2)(r + 1)an+2 + (E − 1 − 2r)ar y r = 0.

r≥0

2r + 1 − E
=⇒ ar+2 = ar , r ≥ 0.
(r + 2)(r + 1)
We can choose a0 and a1 independently, and can get two linearly independent solutions.
Each solution involves either all even or all odd powers.
However, we have a problem. We want normalizable solutions. So we want to make
sure our function does not explode at large y. Note that it is okay if f (y) is quite large,
1 2
since our ψ is suppressed by the e− 2 y terms, but we cannot grow too big.
We look at these two solutions individually. To examine the behaviour of f (y) when
y is large, observe that unless the coefficients vanish, we get ap /ap−2 ∼ 2/p. This is
2
bad, f (y) is like ey = p≥0 y 2p /p!, and ψ cannot be normalized.
P

Hence, we get normalizable ψ if and only if the series for f terminates to give a
polynomial. This occurs iff E = 2n + 1 for some n. Note that for each n, only one
of the two independent solutions is normalizable. So for each E, we get exactly one
solution. For n even, we have

2r − 2n
ar+2 = ar for r even,
(r + 2)(r + 1)
7.3. EXPECTATION AND UNCERTAINTY 287

and ar = 0 for r odd. And we have the other way round when n is odd. The
solutions are thus f (y) = hn (y), where hn is a polynomial of degree n with hn (−y) =
(−1)n hn (y). For example, we have
h0 (y) = a0 h1 (y) = a1 y
 
2
h2 (y) = a0 (1 − 2y 2 ) h3 (y) = a1 y − y 3 .
3
These are known as the Hermite polynomials . We have now solved our harmonic
oscillator. With the constant restored, the possible energy eigenvalues are
 
1
En = ~ω n + for n = 0, 1, 2, · · · .
2
The wavefunctions are
mω  21
   
1 mω 2
ψn (x) = hn x exp − x ,
~ 2 ~
where normalization fixes a0 and a1 .
Harmonic oscillators are everywhere. It turns out quantised electromagnetic fields
correspond to sums of quantised harmonic oscillators, with En − E0 = n~ω. This is
equivalent to saying the nth state contains n photons, each of energy ~ω.

7.3 Expectation and uncertainty


D. 7-9
• Let ψ(x) and φ(x) be normalizable wavefunctions at some fixed time (not nec-
essarily stationary states). We define the complex inner product by hφ, ψi =
R∞ R∞ 1
φ(x)∗ ψ(x) dx. The norm of ψ is kψk = hψ, ψi = ( −∞ |ψ(x)|2 dx) 2 .
p
−∞

• Suppose we have a normalized state ψ, ie. kψk = 1. The expectation value of


any observable Q on the state ψ is hQiψ = hψ, Qψi.
• The uncertainty in position (∆x)ψ and momentum (∆p)ψ are defined by (∆x)2ψ =
h(x̂ − hx̂iψ )2 iψ = hx̂2 iψ − hx̂i2ψ with exactly the same expression for momentum:
(∆p)2ψ = h(p̂ − hp̂iψ )2 iψ = hp̂2 iψ − hp̂i2ψ .
• An operator Q is saidRto be Hermitian if hφ, Qψi = hQφ, ψi for all normalizable
φ, ψ. In other words, φ∗ Qψ dx = (Qφ)∗ ψ dx.
R

• The commutator of operators P and Q is denoted and defined by [Q, S] = QS −


SQ.
E. 7-10
• Note that for any complex number α, we have hφ, αψi = αhφ, ψi = hα∗ φ, ψi. Also,
we have hφ, ψi = hψ, φi∗ . This ensures the norm is real and positive. These are
just the usual properties of an inner product.
R∞
• For the position, we have hx̂iψ = hψ, x̂ψi = −∞ x|ψ(x)|2 dx. Similarly, for the
R∞ ∗
momentum, we have hp̂iψ = hψ, p̂ψi = −∞ ψ (−i~ψ 0 ) dx.

• Suppose φ(x) = ψ(x)eikx , then |ϕ(x)|2 = |ψ(x)|2 and hx̂iφ = hx̂iψ . However
hp̂iφ = hp̂iψ + ~k. So an additional factor of eikx change the momentum by ~k.
288 CHAPTER 7. QUANTUM MECHANICS

• (∆x)2ψ and (∆y)2ψ are like variance, we’ll show that they are real and positive.
• If Q is Hermitian, then hψ, Qψi = hQψ, ψi = hψ, Qψi∗ , so hψ, Qψi is real, ie. hQiψ
is real.
• The commutator is a measure of the lack of commutativity of the two operators.
The commutator of position and momentum is [x̂, p̂] = x̂p̂ − p̂x̂ = i~. This comes
from product rule: (x̂p̂ − p̂x̂)ψ = −i~xψ 0 − (−i~(xψ)0 = i~ψ.
Note that if α and β are any real constants, then the operators X = x̂ − α,
P = p̂ − β also obey [X, P ] = i~.
P. 7-11
The operators x̂, p̂ and H (for real potentials) are all Hermitian.

We do x̂ first: Since position is real,


Z ∞ Z ∞
hφ, x̂ψi = φ(x)∗ xψ(x) dx = (xφ(x))∗ ψ(x) dx = hx̂φ, ψi.
−∞ −∞

To show that p̂ is Hermitian, note that −i~[φ∗ ψ]∞−∞ = 0 since φ, ψ are normaliz-
able. So using integration by parts
Z ∞ Z ∞
hφ, p̂ψi = φ∗ (−i~ψ 0 ) dx = (−i~φ0 )∗ ψ dx = hp̂φ, ψi.
−∞ −∞

2 2
h d
To show that H = − 2m dx2
+ V (x) is Hermitian, we want to show hφ, Hψi =
hHφ, ψi. To show this, it suffices to consider the kinetic and potential terms
separately. For the kinetic energy, we want hφ, ψ 00 i = hφ00 , ψi, which is true since
we can integrate by parts twice to obtain hφ, ψ 00 i = −hφ0 , ψ 0 i = hφ00 , ψi. For the
potential term, we have hφ, V (x̂)ψi = hφ, V (x)ψi = hV (x)φ, ψi = hV (x̂)φ, ψi. So
H is Hermitian.
Thus we know that hxiψ , hp̂iψ , hHiψ are all real. Furthermore, observe that X =
x̂ − α and P = p̂ − β are (similarly) Hermitian for any real α, β. Hence

hX 2 iψ = hψ, X 2 ψi = hψ, X(Xψ)i = hXψ, Xψi = kXψk2 ≥ 0.

and same for P . If we choose α = hx̂iψ and β = hp̂iψ , the expressions above say
that (∆x)2ψ and (∆p)2ψ are indeed real and positive.
P. 7-12
<Cauchy-Schwarz inequality> kψkkφk ≥ |hψ, φi| for any normalizable ψ, φ.

Same as [T.4-142].
T. 7-13
d 1 d
<Ehrenfest’s theorem> hx̂iΨ = hp̂iΨ hp̂iΨ = −hV 0 (x̂)iΨ
dt m dt

The times evolution of Ψ satisfies the Schrödinger equation i~Ψ̇ = HΨ. So we


have
     
d 1 1
hx̂iΨ = hΨ̇, x̂Ψi + hΨ, x̂Ψ̇i = HΨ, x̂Ψ + Ψ, x̂ H Ψ
dt i~ i~
7.3. EXPECTATION AND UNCERTAINTY 289

Since H is Hermitian, we can move it around and get

1 1 1
=− hΨ, H(x̂Ψ)i + hΨ, x̂(HΨ)i = hΨ, (x̂H − H x̂)Ψi.
i~ i~ i~

But plugging in H we know

~2 ~2 i~
(x̂H − H x̂)Ψ = − (xΨ00 − (xΨ)00 ) + (xV Ψ − V xΨ) = − Ψ0 = p̂Ψ.
2m m m

Hence we have the first result. The second part is similar. We have
     
d 1 1
hp̂iΨ = hΨ̇, p̂Ψi + hΨ, p̂Ψ̇i = HΨ, p̂Ψ + Ψ, p̂ H Ψ
dt ~ i~
1 1 1
= − hΨ, H(p̂Ψ)i + hΨ, p̂(HΨ)i = hΨ, (p̂H − H p̂)Ψi.
i~ i~ i~

Again, we can compute

−~2
 
(p̂H − H p̂)Ψ = −i~ ((Ψ00 )0 − (Ψ0 )00 ) − i~((V (x)Ψ)0 − V (x)Ψ0 )
2m
= −i~V 0 (x)Ψ.

These are the quantum counterparts to the classical equations of motion.

In general, quantum mechanics can be portrayed in different “pictures”. We will


be using the Schrödinger picture all the time, in which the operators are time-
independent, and the states evolve in time. An alternative picture is the Heisen-
berg picture, in which states are fixed in time, and all the time dependence lie in
the operators. When written in this way, quantum mechanics is even more like
classical mechanics.

E. 7-14
When we proved Ehrenfest’s theorem, the last step was to calculate the [x̂, H] and
[p̂, H]. Commutator relations are important in quantum mechanics. When we first

defined the momentum operator p̂ as −i~ ∂x , you might have wondered where this
definition came from.

This definition naturally comes up if we require that x̂ is “multiply by x” (so that


the delta function δ(x − x0 ) is the eigenstate with definition position x0 ), and that
x̂ and p̂ satisfies the commutator relation [x̂, p̂] = i~. With this requirements, p̂
must be defined as that derivative. Then one might ask, why would we want x̂
and p̂ to satisfy this commutator relation? It turns out that in classical dynamics,
there is something similar known as the Poisson bracket { · , · }, where we have
{x, p} = 1. To get from classical dynamics to quantum mechanics, we just have
to promote our Poisson brackets to commutator brackets, and multiply by i~.

T. 7-15
<Heisenberg’s uncertainty principle> If ψ is any normalized state (at any
fixed time), then (∆x)ψ (∆p)ψ ≥ ~/2.
290 CHAPTER 7. QUANTUM MECHANICS

Let X = x̂ − hx̂iψ and P = p̂ − hp̂iψ . Then (∆x)2ψ = hψ, X 2 ψi = hXψ, Xψi =


kXψk2 and similarly (∆p)2ψ = kP ψk2 . So
(∆x)ψ (∆p)ψ = kXψkkP ψk ≥ |hXψ, P ψi| ≥ | ImhXψ, P ψi|
  1  
1
≥ hXψ, P ψi − hP ψ, Xψi = hψ, XP ψi − hψ, P Xψi
2i 2i

1 ~ ~
= hψ, [X, P ]ψi = hψ, ψi = .
2i 2 2
E. 7-16
1 2
Consider the normalized Gaussian ψ(x) = (1/απ) 4 e−x /2α . We would find that
2 2 2
hx̂iψ = hp̂iψ = 0. Also (∆x)ψ = α/2 and (∆p)ψ = ~ /2α. So (∆x)ψ (∆p)ψ = ~/2.
A small α corresponds to ψ sharply peaked around x = 0, ie. it has a rather
definite position, but has a large uncertainty in momentum. On the other hand, if
α is large, ψ is spread out in position but has a small uncertainty in momentum.
~
When α = mω , this in fact the lowest energy eigenstate for harmonic oscillator
with H = 2m p̂ + 12 mω 2 x̂2 with eigenvalue 21 ~ω. We can use the uncertainty
1

principle to understand why we have a minimum energy of 12 ~ω instead of 0. If we


had almost no energy, then we would just sit at the bottom of the potential well
and do nothing, with both a small (and definite) momentum and position. Hence
for uncertainty to occur, we need a non-zero ground state energy.

7.4 Wavepackets and Scatterings


D. 7-17
• A wavefunction localised in space (about some point, on some scale) is usually
called a wavepacket .
• For a particle of mass m in a potential V (x), a normalisable energy eigenstate is
called a bound state .
• The lowest energy eigenstate is called the ground state . Eigenstates with higher
energies are called excited states .

7.4.1 Wavepacket
When we solve Schrödinger’s equation, what we get is a “wave” that represents the
probability of finding our thing at each position. However, in real life, we don’t think
of particles as being randomly distributed over different places. Instead, particles are
localized to some small regions of space.
These would be represented by wavefunctions in which most
of the distribution is concentrated in some small region.
These are known as wavepackets, it has a rather loose defi-
nition, and can refer to anything that is localized in space.
A Gaussian wavepacket given by
 α 1/4 1 2
Ψ0 (x, t) = e−αx /2γ(t) .
π γ(t)1/2
7.4. WAVEPACKETS AND SCATTERINGS 291
i~
is a solution of the time-dependent Schrödinger equation with V = 0 for γ(t) = α+ m t.
1 2
Note that Ψ0 (x, 0) is the normalized Gaussian ψ(x) = (1/απ) 4 e−x /2α . Gaussian
wavepacket are particularly nice wavefunctions. For example, we can show that for
a Gaussian wavepacket, (∆x)ψ (∆p)ψ = ~2 exactly, uncertainty is minimized. Sub-
i~
stituting this γ(t) = α + m t into our equation, we find that the probability density
is
 α 1/2 1 2 2
P0 (x, t) = |Ψ0 (x, t)|2 = e−αx /|γ(t)| ,
π |γ(t)|
which is peaked at x = 0. This corresponds to a particle at rest at the origin, spreading
out with time.
A related solution to the time-dependent Schrödinger equation with V = 0 is a moving
particle:
mu2
 mu   
Ψu (x, t) = Ψ0 (x − ut, t) exp i x exp −i t .
~ 2~
The probability density of this is Pu (x, t) = |Ψu (x, t)|2 = P0 (x − ut, t). So this corre-
sponds to a particle moving with velocity u. Furthermore, we get hp̂iΨu = mu. This
corresponds with the classical momentum, mass × velocity.
We see that wavepackets do indeed behave like particles, in the sense that we can set
them moving and the quantum momentum of these objects do indeed behave like the
classical momentum.
In the limit α → 0, our particle becomes more and more spread out in space. The un-
certainty in the position becomes larger and larger, while the momentum becomes more
and more definite. Then the wavefunction above resembles something like Ψ(x, t) =
Ceikx e−iEt/~ which is a momentum eigenstate with ~k = mu and energy E = 21 mu2 =
~2 k2 /(2m). Note, however, that this is not normalizable.
E. 7-18
Consider a free particle (zero potential), then the time-dependent Schrödinger
equation is
~ ∂2Ψ ∂Ψ
− =i (∗)
2m ∂t2 ∂t
Suppose we are given at time zero Ψ(x, 0) = ψ(x) and we wish to find Ψ(x, t).
Taking the Fourier transform on (∗) in the x variable we obtain

~k2 ∂ Ψ̃ ~k2
Ψ̃(k, t) = i (k, t) =⇒ Ψ̃(k, t) = A(k)e−i 2m t
2m ∂t

Taking Fourier transform on Ψ(x, 0) = ψ(x) we have Ψ̃(k, 0) = ψ̃(k), so A(k) =


ψ̃(k). Hence
Z ∞
1 ~k2
Ψ(x, t) = F[Ψ̃(k, t)] = ψ̃(k)e−i 2m t eikx dk
2π −∞

We can use this method to obtain the Gaussian wavepacket in the above example!
∂2
Note that the essence of this method is that we “eliminated” the ∂t 2 term. In
fact equivalently we can achieve this by expanding ψ(x) in terms of momentum
eigenstates.
292 CHAPTER 7. QUANTUM MECHANICS

7.4.2 Scattering
Consider the time-dependent Schrödinger equa-
tion with a potential barrier. We would like Ψ
to send a wavepacket towards the barrier and u
see what happens. Classically, we would expect
the particle to either pass through the barrier
or get reflected. However, in quantum mechan-
ics, we would expect it to “partly” pass through
and “partly” get reflected. Here Ψ, Ψref and AΨref BΨtr
Ψtr are normalized wavefunctions, and where
Pref = |A|2 and Ptr = |B|2 are the probabilities
of reflection and transmission respectively.
This is generally hard to solve. Scattering problems are much simpler to solve for
momentum eigenstates of the form eikx . However, these are not normalizable wave-
functions, and despite being mathematically convenient, we are not allowed to use
them directly, since they do not make sense physically. In some sense, they represent
particles that are “infinitely spread out” and can appear anywhere in the universe with
equal probability, which doesn’t really make sense.

There are two ways we can get around this problem. We know that we can construct
normalized momentum eigenstates for a single particle confined in a box − 2` ≤ x ≤ 2` ,

namely ψ(x) = eikx / ` where the periodic boundary conditions require ψ(x + `) =
ψ(x), ie. k = 2πn
`
for some integer n. After calculations have been done, the box can
be removed by taking the limit ` → ∞.

Identical results are obtained more conveniently by allowing Ψ(x, t) to represent beams
of infinitely many particles, with |Ψ(x, t)|2 being the density of the number of particles
(per unit length) at x, t. When we do this, instead of having one particle and watching
it evolve, we constantly send in particles so that the system does not appear to change
with time. This allows us to find steady states, which mathematically corresponds to
finding solutions to the Schrödinger equation that do not change with time. To de-
termine, say, the probability of reflection, roughly speaking, we look at the proportion
of particles moving left compared to the proportion of particles moving right in this
steady state. In principle, this interpretation is obtained by considering a constant
stream of wavepackets and using some limiting/averaging procedure, but we usually
don’t care about these formalities.

For these particle beams, Ψ(x, t) is bounded, but no longer normalizable. Recall that
for a single particle, the probability current was defined as

i~
j(x, t) = − (Ψ∗ Ψ0 − ΨΨ0∗ ).
2m

If we have a particle beam instead of a particle, and Ψ is the particle density instead
of the probability distribution, j now represents the flux of particles at x, t, ie. the
number of particles passing the point x in unit time.

Recall that a stationary state of energy E is of the form Ψ(x, t) = ψ(x)eiEt/~ . We


have |Ψ(x, t)|2 = |ψ(x)|2 and j(x, t) = − 2m
i~
(ψ ∗ ψ 0 − ψψ 0∗ ). Often, when solving a
scattering problem, the solution will involve sums of momentum eigenstates. So it
helps to understand these better.
7.4. WAVEPACKETS AND SCATTERINGS 293

Our momentum eigenstates are ψ(x) = Ceikx which are solutions to the time-independent
2 2
k
Schrödinger equation with V = 0 with E = ~2m . Applying the momentum op-
erator, we find that p = ~k is the momentum of each particle in the beam, and
|ψ(x)|2 = |C|2 is the density of particles in the beam. We can also evaluate the current
to be j = ~k|C|2 /m. This makes sense — ~k/m = p/m is the velocity of the particles,
and |C|2 is how many particles we have. So this still roughly corresponds to what we
used to have classically.
In scattering problems, we will seek the transmitted and reflected flux jtr , jref in terms
of the incident flux jinc , and the probabilities for transmission and reflection are then
given by
|jtr | |jref |
Ptr = , Pref = .
|jinc | |jinc |
E. 7-19
<Potential step> Consider the time-independent V (x)
Schrödinger equation for a step potential
(
0 x≤0
V (x) = , x
U x>0
2
where U > 0 is a constant. The Schrödinger equation is −~
2m
ψ 00 + V (x)ψ = Eψ.
0
We require ψ and ψ to be continuous at x = 0. We can consider two different
cases:
1. 0 < E < U : We apply the standard method, introducing constants k, κ > 0
such that E = ~2 k2 /(2m), U −E = ~2 κ2 /(2m). Then the Schrödinger equation
becomes
( (
ψ 00 + k2 ψ = 0 x<0 Ieikx + Re−ikx x<0
00
=⇒ ψ=
2
ψ −κ ψ =0 x>0 Ce−κx x>0

We only have ψ = Ce−κx for x > 0 since ψ has to be bounded. Since ψ and
ψ 0 are continuous at x = 0, we have the equations
(
I +R=C k − iκ 2k
=⇒ R= I, C= I.
ikI − ikR = −κC k + iκ k + iκ

If x < 0, ψ(x) is a superposition of beams (momentum eigenstates) with |I|2


particles per unit length in the incident part, and |R|2 particles per unit length
in the reflected part, with p = ±~k. The current is
~k ~k
j = jinc + jref = |I|2 − |R|2 ,
m m
The probability of reflection is Pref = |jref |/|jinc | = |R|2 /|I|2 = 1 which makes
sense. On the right hand side, we have j = 0. So Ptr = 0. However, |ψ(x)|2 6= 0
in this classically forbidden region.
2. E > U : This time, we set E = ~2 k2 /(2m) and E −U = ~2 κ2 /2m with k, κ > 0.
Then the Schrödinger equation becomes
( (
ψ 00 + k2 ψ = 0 x<0 Ieikx + R−ikx x<0
00 2
=⇒ ψ=
ψ +κ ψ =0 x>0 T eiκx x>0
294 CHAPTER 7. QUANTUM MECHANICS

Note that it is in principle possible to get an e−iκx term on x > 0, but this
would correspond to sending in a particle from the right. We, by choice, assume
there is no such term. We now match ψ and ψ 0 at x = 0. Then we get the
equations
(
I +R=T k−κ 2k
=⇒ R= I, T = I.
ikI − ikR = iκT k+κ k+κ

Our flux on the left and the right are


~k ~k ~κ
jleft = jinc + jref = |I|2 − |R|2 , jright = jtr = |T |2 .
m m m
The probability of reflection and transmission are
2
|R|2 |T |2 κ

|jref | k−κ |jtr | 4kκ
Pref = = = , Ptr = = = .
|jinc | |I| 2 k+κ |jinc | |I|2 k (k + κ)2
Note that Pref + Ptr = 1. Classically, we would expect all particles to be trans-
mitted, since they all have sufficient energy. However, quantum mechanically,
there is still a probability of being reflected.
E. 7-20
<Potential barrier> Consider the following V
potential:
 U
0
 x≤0
V (x) = U 0<x<a
x
0 a

0 x≥a

We consider a stationary state with energy E with 0 < E < U . We set the con-
stants E = ~2 k2 /(2m) and U − E = ~2 κ2 /(2m). Then the Schrödinger equations
become
 
00
 2
ψ + k ψ = 0 x<0 Ie
 ikx
+ Re−ikx x<0
00 2
ψ −κ ψ =0 0<x<a =⇒ ψ = Ae + Be−κx
κx
0<x<a
 00
 
ψ + k2 ψ = 0
 ikx
x>a Te x>a

Matching ψ and ψ 0 at x = 0 and a gives the equations


I +R=A+B Aeκa + Be−κa = T eika
ik(I − R) = κ(A − B) κ(Aeκa − Be−κa ) = ikT eika
We can solve these to obtain
κ − ik κ + ik
I+ R = T eika e−κa and I+ R = T eika eκa .
κ + ik κ − ik
After some algebra, we obtain
−1
k 2 − κ2

T = Ie−ika cosh κa − i sinh κa
2kκ
To interpret this, we use the currents for x < 0 and x > a which are
~k ~k
jx<0 = jinc + jref = (|I|2 − |R|2 ) and jx>0 = jtr = |T |2 .
m m
7.4. WAVEPACKETS AND SCATTERINGS 295

We can use these to find the transmission probability, and it turns out to be
−1
|T |2 U2

|jtrj | 2
Ptr = = = 1 + sinh κa .
|jinc | |I|2 4E(U − E)
This demonstrates quantum tunneling . There is a non-zero probability that the
particles can pass through the potential barrier even though it classically does not
have enough energy. In particular, for κa  1, the probability of tunneling decays
as e−2κa . This is important, since it allows certain reactions with high potential
barrier to occur in practice even if the reactants do not classically have enough
energy to overcome it.

7.4.3 General features of stationary states


We will look at the difference between bound states and scattering states in general.
Consider the time-independent Schrödinger equation for a particle of mass m
~2 00
Hψ = − ψ + V (x)ψ = Eψ,
2m
with the potential V (x) → 0 as x → ±∞. This is a second order ordinary differential
equation for a complex function ψ, and hence there are two complex constants in
general solution. However, since this is a linear equation, this implies that 1 complex
constant corresponds to changing ψ 7→ λψ, which gives no change in physical state.
So we just have one constant to mess with.
2
As |x| → ∞, our equation simply becomes − 2m
~
ψ 00 = Eψ. So we get
2 2
(
Aeikx + Be−ikx E = ~2m k
>0
ψ∼ κx −κx ~ 2 κ2
Ae + Be E = − 2m < 0.
So we get two kinds of stationary states depending on the sign of E. These correspond
to bound states and scattering states.

Bound state solutions (E < 0)


If we want ψ to be normalizable, then there are 2 boundary conditions for ψ:
(
Aeκx x → −∞
ψ∼ −κx
Be x → +∞

This is an overdetermined system, since we have too many boundary conditions (we
have two conditions requiring no exponential growing on either side). Solutions exist
only when we are lucky, and only for certain values of E. So bound state energy levels
are quantized. We may find several bound states or none.

Scattering state solutions (E > 0)


Now ψ is not normalized but bounded. We can view this as particle beams, and the
boundary conditions determines the direction of the incoming beam. So we have
(
Ieikx + Re−ikx x → −∞
ψ∼
T eikx x → +∞
296 CHAPTER 7. QUANTUM MECHANICS

This is no longer overdetermined since we have more free constants (we only used up
one by requiring no e−ikx term on x → ∞). The solution for any E > 0 (imposing
condition on one complex constant) gives
( (
jinc + jref |I|2 ~k
m
− |R|2 ~k
m
x → −∞
j∼ =
jtr |T |2 ~k
m
x → +∞

We also get the reflection and transmission probabilities

|jref | |jtr |
Pref = |Aref |2 = Ptr = |Atr |2 = ,
|jinc | |jinc |

where Aref (k) = R/I and Atr (k) = T /I are the reflection and transmission amplitudes .
In quantum mechanics, “amplitude” general refers to things that give probabilities
when squared.

7.5 Postulates for quantum mechanics


D. 7-21
• An operator is a mapping from one vector space to another. An operator is
linear if it is a linear map.
• Let V be a (complex) inner product space and A an linear operator V → V , i.e.
A ∈ End(V ). The Hermitian conjugate or adjoint of A, denoted A† is defined to
be the unique operator satisfying hφ, A† ψi = hAφ, ψi. And A is called Hermitian
or self-adjoint if A† = A.
• Let V be a vector space over F and A ∈ End(V ). A non zero v ∈ V is said to be
an eigenvector with eigenvalue λ ∈ F if Av = λv. The set of all eigenvalues of
A are called the spectrum of A.
P. 7-22
Let Q be Hermitian operator4 on the inner product space V , then
1. Eigenvalues of Q are real. And Eigenvectors of Q corresponding to different
eigenvalues are orthogonal.
2. The set of eigenvectors is complete. That is, any vector in V can be written as
a (possibly infinite) linear combination of eigenvectors of Q, ie. eigenvectors
of Q provide a basis for V .5

1. Since Q is Hermitian, hχ, Qχi = hQχ, χi. Let χ be an eigenvector with


eigenvalue λ. Then λhχ, χi = hχ, λχ) = hλχ, χi = λ∗ hχ, χi. Since hχ, χi 6= 0,
we must have λ = λ∗ . So λ is real.
Suppose Qχ = λχ and Qφ = µφ. Then λhφ, χi = hφ, λχi = hφ, Qχi =
hQφ, χi = hµφ, χi = µ∗ hφ, χ) = µhφ, χi. Since λ 6= µ, hφ, χi = 0. 1

Note the similarity and difference of this and [L.4-155].


4
Note that when we say Hermitian operator, it is by definition a linear operator.
5
A more precise and accurate statement of this result is: If Q is a compact self-adjoint operator
on a Hilbert space V , then there is an orthonormal basis of V consisting of eigenvectors of Q.
7.5. POSTULATES FOR QUANTUM MECHANICS 297

2. We will not prove this. Note that the finite dimensional case is proved in
[T.4-156], however in this chapter we mostly care about the case when V is
infinite dimensional, like function spaces. Note also the relation of this result
to [T.6-18].

Note that using this result, given a wavefunction ψ(x) (say at time
P 0), we can
expand it with energy eigenstates Hχn = En χn to get ψ = n αn χn , and
just like what we done for stationary state, its subsequence time evolution
according to the time-dependent Schrödinger equation is simply Ψ(x, t) =
P −iEn t/~
n αn χn (x)e .

C. 7-23
<Postulates for quantum mechanics>
1. <States> The state of a quantum system at a given time correspond to non-
zero elements of a complex complete inner product space V .6 Two elements of
V that are a (non-zero) multiple of each other are physically equivalent.
2. <Observables> Each observable (i.e. measurable quantity) Q has a corre-
sponding Hermitian (self-adjoint) operator Q̂.
3. <Measurement> If Q̂ has a discrete spectrum and ψ ∈ V is a normalised
state, then we have the following: A measurement of Q in the system represented
by ψ would be one of the eigenvalues of Q̂. The probability of obtaining the
eigenvalue λn is Pn = |αn |2 where αn = hψλn , ψi/hψλn , ψλn i is called the
amplitude and ψλn is the projection of ψ onto the eigenspace corresponding
to λn . The measurement is instantaneous and forces the system into the state
ψλn . That is our ψ turns into ψλn after the measurement.
Note that with appropriate condition, [P.7-22] says that the set of eigen-
states form a orthonormal basis for V , so for each eigenvalue λn , we can
pick a normalised eigenvector χn from Pthe eigenspace of λn so that the
normalised ψ can be written as ψ = n αn χn with αn = hχn , ψi. Note
that by construction all the χn has different eigenvalues. The measurement
then forces the system into the state χn . That is our ψ turns into χn after
the measurement.
Also note that we assume above that Q̂ has a discrete spectrum, if that’s
not the case (e.g. the position operator x̂ = x) then there is a different
way to find out the probability of the measurements. Note that we can
view x̂ as either having no eigenvalues/eigenvectors, or that the Dirac delta
functions δ(x − λ) are its eigenvectors so that all of R is its eigenvalues.
4. <Dynamics> The time evolution of a state Ψ(x, t) of a quantum system obeys
the time-dependent Schrödinger equation i~Ψ̇ = HΨ where H is a Hermitian
operator, the Hamiltonian ; this holds at all times except at the instant a
measurement is made.

6
V has nothing to do with the potential. More precisely V should be a Hilbert space, and it is
infinite dimensional in general.
298 CHAPTER 7. QUANTUM MECHANICS

E. 7-24
So far, we have worked with the vector space of functions (that are sufficiently
smooth), and used the integral as the inner product. However, in fact, we can
work with arbitrary complex vector spaces and arbitrary inner products.
Postulate 3 says that our ψ turns into χn after the measurement, which is rather
weird. However, it make sense because if we measure the state of the system,
and get, say, 3, then if we measure it immediately afterwards, we would expect to
get the result 3 again with certainty, instead of being randomly distributed like
the original state. So after a measurement, the system must be forced into the
corresponding eigenstate.
Note that these axioms
P are consistent in the following sense. If ψ is normalized,
then 1 = hψ, ψi = n Pn because
DX X E X X ∗ X

hψ, ψi = αm χm , αn χn = αm αn hχm , χn i = αm αn δmn = |αn |2 .
m,n m,n n

So if the state is normalized, then the sum of probabilities is indeed 1.


P. 7-25
P
1. The expectation value of Q in state ψ is hQiψ = hψ, Q̂ψi = λn Pn .
2. The uncertainty (∆Q)ψ is given by
X
(∆Q)2ψ = h(Q̂ − hQiψ )2 iψ = hQ2 iψ − hQi2ψ = (λn − hQiψ )2 Pn .
n

P P
1. Write ψ = αn χn , then Q̂ψ = αn λn χn . So we have
X X ∗ X
hψ, Q̂ψi = hαm , χn , αn λn χn i = αn αn λn = λ n Pn .
n,m n

2. (Q̂ − hQiψ )2 χn = λ2n χn − 2λn hQiψ χn + hQi2ψ χn = (λn − hQiψ )2 χn hence done
by the first part.
From this, we see that ψ is an eigenstate of Q̂ with eigenvalue λ if and only if
hQiψ = λ and (∆Q)ψ = 0.
E. 7-26
Consider the harmonic oscillator as in [E.7.2.3], the operator Q̂ = H with eigen-
functions χn = ψn and eigenvalues λn = En = ~ω n + 12 . Suppose we have
prepared our system in the state
1
ψ = √ (ψ0 + 2ψ1 − iψ4 ).
6

Then Energy Probability
√ the coefficients
√ are α0 = 1/ 6, α1 =
2/ 6, α4 = −i/ 6. This is normalized since 1 1
E0 = ~ω P0 =
kψk2 = |α0 |2 + |α1 |2 + |α4 |2 = 1. Measuring the 2 6
energy would gives answers with probability as 3 2
in the table on the right. If a measurement gives E1 = ~ω P1 =
2 3
E1 , then ψ1 is the new state immediately after 9 1
measurement. E4 = ~ω P4 =
2 6
7.5. POSTULATES FOR QUANTUM MECHANICS 299

The expectation value of energy is


 
X 1 1 3 2 9 1 11
hHiψ = En Pn = · + · + · ~ω = ~ω.
n
2 6 2 3 2 6 6

Note that the measurement postulate tells us that after measurement, the system is
then forced into the eigenstate. So when we said that we interpret the expectation
value as the “average result for many measurements”, we do not mean measuring
a single system many many times. Instead, we prepare a lot of copies of the system
in state ψ, and measure each of them once.
E. 7-27
Consider the normalised energy eigenstates Hψn = En ψn with hψm , ψn i = δmn .
Then we have certain simple solutions of the (time-dependent) Schrödinger equa-
tion of the form Ψn = ψn e−iEn t/~ . In general, given an initial state Ψ(0) =
P
n αn ψn , since the Schrödinger equation is linear, we can get the following solu-
tion for all time: X
Ψ(t) = αn e−iEn t/~ ψn .
n

For example onsider the harmonic oscillator with initial state Ψ(0) = √1 (ψ0 +
6
2ψ1 − iψ4 ). Then Ψ(t) is given by
1
Ψ(t) = √ (ψ0 e−iωt/2 + 2ψ1 e−3iωt/2 − iψ4 e−9iωt/2 ).
6
T. 7-28
<Ehrenfest’s theorem (General form)> If Q is any operator with no explicit
time dependence, then (recall that [Q, H] = QH − HQ)
d
i~ hQiΨ = h[Q, H]iΨ ,
dt
If Q does not have time dependence, then
d
i~ hΨ, QΨi = h−i~Ψ̇, QΨi + hΨ, Qi~Ψ̇i = h−HΨ, QΨi + hΨ, QHΨi
dt
= hΨ, (QH − HQ)Ψi = hΨ, [Q, H]Ψi.
If Q has explicit time dependence, then we have an extra term on the right,
d
and have i~ dt hQiΨ = h[Q, H]iΨ + i~hQ̇iΨ . These general versions correspond to
classical equations of motion in Hamiltonian formalism.
C. 7-29
<Discrete and continuous spectra> In stating the measurement postulates,
we have assumed that our spectrum of eigenvalues of Q̂ was discrete and eigenstates
normalisable, and we got nice results about measurements. However, this is not
always the case.
One way to get around this problem is to put the system in a “box” of length `
with suitable boundary conditions of ψ(x) (so that we force to have only certain
discrete eigenvalues with normalisable eigenvectors). We can then take ` → ∞ at
the end of the calculation.
Alternatively, when we have a continuous spectrum, we can proceed analogous
to the discrete case. Not caring too much about rigour below, suppose we have
300 CHAPTER 7. QUANTUM MECHANICS

Qχξ = λξ χξ for continuous label ξ instead of the discrete label n. And we have
eigenstates χξ with orthonormality conditions hχξ , χη i = δ(ξ−η) where we replaced
our old δmn with δ(ξ −η), the Dirac delta function.R To expand ψ in eigenstates, the
discrete sum becomes an integral, so we have ψ = αξ χξ dξ where αξ = hχξ , ψi. In
the discrete case, |αn |2 is the probability mass function. The obvious generalization
here would be to let |αξ |2 be our probability density function. More precisely,
Rb
a
|αξ |2 dξ is the probability that the result corresponds λξ with a ≤ ξ ≤ b.

E. 7-30
d
• Consider p̂ = −i~ dx . We have p̂eikx = ~keikx . So ~k are eigenvalue for all k ∈ R,
so the spectrum of p̂ is continuous, also the eigenvectors eikx are not normalisable
on (−∞, ∞).
Consider ψ(x) with periodic boundary conditions ψ(x + `) = ψ(x). So we can
d
restrict to −`/2 ≤ x ≤ `/2. The eigenstates of p̂ = −i~ dx are now χn (x) =
ikn x

e / ` where kn = 2πn/` with eigenvalues λn = ~kn , discrete and normalised.
The states are orthonormal on −`/2 ≤ x ≤ `/2. We expand P ψ(x) in terms of
the eigenstates to get a complex Fourier series ψ(x) = n αn χn (x) where the
amplitudes are given by αn = hχn , ψi. Take the limit as ` → ∞, the Fourier series
becomes a Fourier integral.
• Consider the particle in one dimension with position as our operator. The eigen-
states of x̂ are χξ (x) = δ(x − ξ) with corresponding eigenvalue λξ = ξ. This comes
from x̂χξ (x) = xδ(x − ξ) = ξδ(x − ξ) = ξχξ (x) since δ(x − ξ) is non-zero only when
x = ξ. With these eigenstates, we can expand
Z Z
ψ(x) = αξ χξ (x) dξ = αξ δ(x − ξ) dξ = αx .
Rb
So our coefficients are given by αξ = ψ(ξ). So a |ψ(ξ)|2 dξ is indeed the probabil-
ity of measuring the particle to be in [a, b]. So we recover our original interpretation
of the wavefunction.
D. 7-31
For any observable Q, the number of linearly independent eigenstates with eigen-
value λ is the degeneracy of the eigenvalue. In other words, the degeneracy
is the dimension of the eigenspace Vλ = {ψ : Qψ = λψ}. An eigenvalue is
non-degenerate if the degeneracy is exactly 1, and is degenerate if the degen-
eracy is more than 1. We say two states are degenerate if they have the same
eigenvalue.
T. 7-32
<Uncertainty principle (General form)> Let A and B be observables. If ψ
is any normalized state (at any fixed time), then (∆A)ψ (∆B)ψ ≥ 21 |h[A, B]iψ |.

Let α = Â − hÂiψ and β = B̂ − hB̂iψ . Then (∆A)2ψ = hψ, α2 ψi = hαψ, αψi =


kαψk2 and similarly (∆B)2ψ = kβψk2 . Also note that [α, β] = [A, B]. So
(∆A)ψ (∆B)ψ = kαψkkβψk ≥ |hαψ, βψi| ≥ | Imhαψ, βψi|
  1  
1
≥ hαψ, βψi − hβψ, αψi =
hψ, αβψi − hψ, βαψi
2i 2i

1 1
= hψ, [α, β]ψi = |h[A, B]iψ |
2i 2
7.5. POSTULATES FOR QUANTUM MECHANICS 301

P. 7-33
Let A and B be operators resulting from observables. A and B is simultaneously
diagonalizable (i.e. there exist a basis of joint eigenstates) if and only if A and B
commute (ie. [A, B] = 0).

(Forward) Suppose there exist a basis of joint eigenstates {χn }. So we have Aχn =
λn χn and Bχn = µn χn . So now [A, B]χn = (AB −BA)χn = λn µn χn −µn λn χn =
0 for all n. Since {χn } is a basis for V , we must have [A, B] = 0.
(Backward) Suppose [A, B] = 0. For any eigenvalue λ of A, consider the eigenspace
Vλ = {ψ : Aψ = λψ}. Now if ψ ∈ Vλ , then A(Bψ) = BAψ = λ(Bψ), so Bψ ∈ Vλ .
So B|Vλ ∈ End(Vλ ). Now V is the direct sum of the eigenspaces Vλ over all possible
eigenvalues λ since V is spanned by eigenstates of A. So given any basis for each
of the eigenspaces, their union is a basis for V . Since B is hermitian, B|Vλ is also
hermitian. It follows that Vλ has a basis of eigenstates of B, and all its element
is an eigenstate for A too (with eigenvalue λ) by definition. Since this holds for
every Vλ , this provides a basis of joint eigenstates for V .
The proof is similar to [T.4-105], although now we are in a infinite dimensional
space, so we have to get around it.
E. 7-34
<Commuting observables> In one dimension, the energy bound states are
always non-degenerate. However, in three dimensions, energy bound states may
be degenerate. If λ is degenerate, then there is a large freedom in choosing an
orthonormal basis for Vλ . Physically, we cannot distinguish degenerate states
by measuring Q alone. So we would like to measure something else as well to
distinguish these eigenstates. When can we do this? We have previously seen
that we cannot simultaneously measure position and momentum. It turns out the
criterion for whether we can do this is simple.
Recall that after performing a measurement, the state is forced into the corre-
sponding eigenstate. Hence, to simultaneously measure two observables A and
B, we need the state to be a simultaneously an eigenstate of A and B. In other
words, simultaneous measurement is possible if and only if there is a basis for V
consisting of simultaneous or joint eigenstates χn with

Aχn = λn χn , Bχn = µn χn .

Our measurement axioms imply that if the state is in χn , then measuring A, B


in rapid succession, in any order, will give definite results λn and µn respectively
(assuming the time interval between each pair of measurements is short enough
that we can neglect the evolution of state in time).
A necessary and sufficient condition for A and B to be simultaneously measurable
(ie. simultaneously diagonalizable) is for A and B to commute, ie. [A, B] = 0.
This is consistent with a “generalized uncertainty relation”
1
(∆A)ψ (∆B)ψ ≥ |h[A, B]iψ |,
2
since if we if have a state that is simultaneously an eigenstate for A and B, then
the uncertainties on the left would vanish. So h[A, B]iψ = 0.
This will be a technique we will use to tackle degeneracy in general. If we have
a system where, say, the energy eigenvalue λ (with eigenspace VλH ) is degenerate,
302 CHAPTER 7. QUANTUM MECHANICS

we will attempt to find another operator A that commutes with H, then we can
find a common eigenstate, say with eigenvalue of µ wrt to A. Then if VλH 6= VµA ,
we would be able to tell apart eigenstates in VλH by grouping them into those that
are in VλH ∩ VµA and those that are not. So we are able to further classify the
underlying states. When dealing with the hydrogen atom later, we will use the
angular momentum to separate the degenerate energy eigenstates.
A complete set of commuting observables (CSCO) is a set of commuting opera-
tors whose eigenvalues completely specify the state of a system.

7.6 Quantum mechanics in three dimensions


To begin with, we translate everything we’ve had for the one-dimensional world into
the three-dimensional setting.
A quantum state of a particle in three dimensions is given by a wavefunction ψ(x)
at fixed time,
R or Ψ(x, t) for a state evolving in time. The inner product is defined
as hϕ, ψi = ϕ(x)∗ ψ(x) d3 x, where we adopt the convention that no limits on the
integral means integrating over all space. If ψ is normalized, ie. kψk2 = hψ, ψi =
2 3
R
|ψ(x)| d x = 1 then the probability of measuring the particle to be inside a small
volume δV (containing x) is |ψ(x)|2 δV . The position and momentum are Hermitian
operators
x̂ = (x̂1 , x̂2 , x̂3 ) where x̂i ψ = xi ψ
 
∂ ∂ ∂
p̂ = (p̂1 , p̂2 , p̂3 ) = −i~∇ = −i~ , ,
∂x1 ∂x2 ∂x3
We have the canonical commutation relations

[x̂i , p̂j ] = i~δij , [x̂i , x̂j ] = [p̂i , p̂j ] = 0.

We see that position and momentum in different directions don’t come into conflict.
We can have a definite position in one direction and a definite momentum in another
direction. The uncertainty principle only kicks in when we are in the same direction.
In particular, we have
~
(∆xi )(∆pj ) ≥ δij .
2
Similar to what we did in classical mechanics, we assume our particles are structureless ,
that is a particle for which all observables can be written in terms of position and mo-
mentum. In reality, many particles are not structureless, and (at least) possess an
additional quantity known as “spin”. We will not study these and will only mention
it briefly near the end.
The Hamiltonian for a structureless particle in a potential V is

p̂2 ~2 2
H≡ + V (x̂) = − ∇ + V (x).
2m 2m
The time-dependent Schrödinger equation is i~ ∂Ψ
∂t
= HΨ. The probability current and
the conservation equation with which it obeys are

i~ ∂
j≡− (Ψ∗ ∇Ψ − Ψ∇Ψ∗ ) |Ψ(x, t)|2 = −∇ · j.
2m ∂t
7.6. QUANTUM MECHANICS IN THREE DIMENSIONS 303

The conservation equation implies that for any fixed volume V with boundary ∂V ,
Z Z Z
d
|Ψ(x, t)|2 d3 x = − ∇ · j d3 x = − j · dS,
dt V V ∂V

So if |Ψ(x,
R t)| → 02 sufficiently rapidly as |x| → ∞, then the boundary term disappears
and dtd
|Ψ(x, t)| d3 x = 0. This is the conservation of probability (or normaliza-
tion).

7.6.1 Separable eigenstate solutions


Sometimes we can solve the three dimensional problem by reducing it to the one-
dimensional problems, using the symmetry of the system. For example, we will solve
the hydrogen atom exploiting the fact that the potential is spherically symmetric, and
then use the method of separation of variables. Consider the simpler case where we only
have two dimensions. The time-independent Schrödinger equation then gives
~2
 2
∂2


Hψ = − + ψ + V (x1 , x2 )ψ = Eψ.
2m ∂x21 ∂x22
We are going to consider potentials of the form V (x1 , x2 ) = U1 (x1 ) + U2 (x2 ). The
Hamiltonian then splits into
~2 ∂ 2
H = H1 + H2 where Hi = − + Ui (xi ).
2m ∂x2i
We look for separable solutions of the form ψ = χ1 (x1 )χ2 (x2 ). The Schrödinger
equation then gives (upon division by ψ = χ1 χ2 ) gives
~2 χ001 ~2 χ002
   
− + U1 + − + U2 = E.
2m χ1 2m χ2
Since each term is independent of x2 and x1 respectively, we have H1 χ1 = E1 χ1 and
H2 χ2 = E2 χ2 with E1 + E2 = E. This is the usual separation of variables, but here
we can interpret this physically — in this scenario, the two dimensions are de-coupled,
and we can treat them separately. The individual E1 and E2 are just the contributions
from each component to the energy. Thus the process of separation variables can be
seen as looking for joint or simultaneous eigenstates for H1 and H2 , noting the fact
that [H1 , H2 ] = 0.
E. 7-35
Consider the harmonic oscillator with
1 1
V = mω 2 kxk2 = mω 2 (x21 + x22 ),
2 2
so Vi (xi ) = 21 ω 2 x2i . We have Hi = H0 (x̂i , p̂i ) with H0 the usual one-dimensional
harmonic oscillator Hamiltonian. Using the previous results for the one-dimensional
harmonic oscillator, we have χ1 = ψn1 (x1 ) and χ2 = ψn2 (x2 ) where ψi is the ith
eigenstate of the one-dimensional harmonic oscillator, and n1 , n2 ∈ {0, 1, 2, · · · }.
The corresponding energies are Ei = ~ω ni + 21 . The energy eigenvalues of the


two-dimensional oscillator are thus

E = ~ω (n1 + n2 + 1) for ψ(x1 , x2 ) = ψn1 (x1 )ψn2 (x2 ).

We have the following energies and states:


304 CHAPTER 7. QUANTUM MECHANICS

State Energy Possible states


Ground state E = ~ω ψ = ψ0 (x1 )ψ0 (x2 )
ψ = ψ1 (x1 )ψ0 (x2 )
1st excited state E = 2~ω
ψ = ψ0 (x1 )ψ1 (x2 )

We see there is a degeneracy of 2 for the first excited state.

7.6.2 Angular momentum


D. 7-36
The angular momentum is a vector of operators L = x̂×p̂ = −i~x×∇. In compo-
nents, this is given by Li = εijk x̂j p̂k = −i~εijk xj ∂x∂ k . The total angular momentum
operator is L2 = Li Li = L21 + L22 + L23 .
E. 7-37
To understand spherically symmetric potentials in quantum mechanics, it is also
helpful to understand the angular momentum. Note that this is just the same
definition as in classical dynamics, but with everything promoted to operators.
These operators are Hermitian, ie. L† = L, since x̂i and p̂j are themselves
Hermitian, and noting the fact that x̂i and p̂j commute whenever i 6= j. So
L†i = εijk (x̂j p̂k )† = εijk p̂†k x̂†j = εijk p̂k x̂j = εijk x̂j p̂k = Li .
Each component of L is the angular momentum in one direction. We can also
consider the length of L, which is the total angular momentum. Again, this is
Hermitian, and hence an observable.
L. 7-38
<Leibnitz property> [A, BC] = [A, B]C + B[A, C] or equivalently [AB, C] =
[A, C]B + A[B, C].

[A, BC] = ABC − BCA = ABC − BAC + BAC − BCA = [A, B]C + B[A, C].
The second version follows from [A, B] = −[B, A].
P. 7-39
1. [Li , Lj ] = i~εijk Lk .
2. [L2 , Li ] = 0.
3. [Li , x̂j ] = i~εijk x̂k and [Li , p̂j ] = i~εijk p̂k

1. We have

Li Lj = εiar x̂a p̂r εjbs x̂b p̂s = εiar εjbs (x̂a p̂r x̂b p̂s ) = εiar εjbs (x̂a (x̂b p̂r + [p̂r , x̂b ])p̂s )
= εiar εjbs (x̂a x̂b p̂r p̂s − i~δbr x̂a p̂s )

Similarly, Lj Li = εiar εjbs (x̂b x̂a p̂s p̂r − i~δas x̂b p̂r ). So the commutator is

Li Lj − Lj Li = −i~εiar εjbs (δbr x̂a p̂s − δas x̂b p̂r )


= −i~(εiab εjbs x̂a p̂s − εiar εjba x̂b p̂r )
= −i~((δis δja − δij δas )x̂a p̂s − (δib δrj − δij δrb )x̂b p̂r )
= i~(x̂i p̂j − x̂j p̂i ) = i~εijk Lk .
7.6. QUANTUM MECHANICS IN THREE DIMENSIONS 305

2. This follows from (1) using the Leibnitz property. We get

[Li , L2 ] = [Li , Lj Lj ] = [Li , Lj ]Lj + Lj [Li , Lj ] = i~εijk (Lk Lj + Lj Lk ) = 0

where we get 0 since we are contracting the antisymmetric tensor εijk with the
symmetric tensor Lk Lj + Lj Lk .
3. We will use the Leibnitz property again

[Li , x̂j ] = εiab [x̂a p̂b , x̂j ] = εiab ([x̂a , x̂j ]p̂b + x̂a [p̂b , x̂j ]) = εiab x̂a (−i~δbj )
= i~εija x̂a

as claimed. We also have

[Li , p̂j ] = εiab [x̂a p̂b , p̂j ] = εiab ([x̂a , p̂j ]p̂b + x̂a [p̂b , p̂j ]) = εiab (i~δaj p̂b )
= i~εijb p̂b .

Recall that in classical dynamics, an important result is that the angular momen-
tum is conserved in all directions. However, we know we can’t do this in quantum
mechanics, since the angular momentum operators do not commute (result (1)),
and we cannot measure all of them. This is why we have this L2 . It captures the
total angular momentum, and it commutes with the angular momentum operators:
[L2 , Li ] = 0 for all i. So (1) implies we cannot simultaneously measure, say L1
and L2 , the best that can be done is to measure L2 and L3 .

7.6.3 Spherical polars and spherical harmonics


Angular momentum is something coming from rotation. So let’s work with something
with spherical symmetry. We will exploit the symmetry and work with spherical
polar coordinates. We define our spherical polar coordinates (r, θ, ϕ) by the usual
relations

x1 = r sin θ cos ϕ x2 = r sin θ sin ϕ x3 = r cos θ.

If we express our angular momentum operators as differential operators, then we can


write them entirely in terms of r, θ and ϕ using the chain rule. The formula for L3
will be rather simple, since x3 is our axis of rotation. However, those for L1 and L2
will be much more complicated. Instead of writing them out directly, we write down
the formula for L± = L1 ± iL2 . A routine application of the chain rule gives

L3 = −i~
∂ϕ
 
±iϕ ∂ ∂
L± = L1 ± iL2 = ±~e ± i cot θ
∂θ ∂ϕ
∂2
 
1 ∂ ∂ 1
L2 = −~2 sin θ + 2 .
sin θ ∂θ ∂θ sin θ ∂ϕ2

Note these operators involve only θ and ϕ. Furthermore, the expression for L2 is
something we have all seen before — we have

1 ∂2 1
∇2 = r − 2 2 L2 .
r ∂r2 ~ r
306 CHAPTER 7. QUANTUM MECHANICS

Since we know [L3 , L2 ] = 0 there are simultaneous eigenfunctions of these operators,


which we shall call Y`m (θ, ϕ), with ` = 0, 1, 2, · · · and m = 0, ±1, ±2, · · · , ±`. These
have eigenvalues ~m for L3 and ~2 `(` + 1) for L2 . In general, we have

Y`m = (const)eimϕ P`m (cos θ),

where P`m is the associated Legendre function . For the simplest case m = 0, we have
Y`0 = const P` (cos θ) where P` is the Legendre polynomial. The details are not impor-
tant, the important thing to take away is that there are solutions Y`m (θ, ϕ).

Joint eigenstates for a spherically symmetric potential


It is standard to use m to indicate the eigenvalue of L3 , as we did above. Hence, here
we shall consider a particle of mass µ in a potential V (r) with spherical symmetry.
The Hamiltonian is

1 2 ~2 ~2 1 ∂ 2 1 1 2
H= p̂ + V = − ∇2 + V (r) = − r+ L + V (r).
2µ 2µ 2µ r ∂r2 2µ r2

The first thing we want to check is that [Li , H] = [L2 , H] = 0. This implies we can
use the eigenvalues of H, L2 and L3 to label our solutions to the equation. We check
this using Cartesian coordinates. The kinetic term is

[Li , p̂2 ] = [Li , p̂j p̂j ] = [Li , p̂j ]p̂j + p̂j [Li , p̂j ] = i~εijk (p̂k p̂j + p̂j p̂k ) = 0

since we are contracting an antisymmetric tensor with a symmetric term. We can also
compute the commutator with the potential term
 
∂V xk
[Li , V (r)] = −i~εijk xj = −i~εijk xj V 0 (r) = 0,
∂xk r
∂r
using the fact that ∂x i
= xri . Now that H, L2 and L3 are a commuting set of
observables, we have the joint eigenstates ψ(x) = R(r)Y`m (θ, ϕ) and we have

L2 Y`m = ~2 `(` + 1)Y`m ` = 0, 1, 2, · · ·


L3 Y`m = ~mY`m m = 0, ±1, ±2, · · · , ±`.

The numbers ` and m are known as the angular momentum quantum numbers . Note
that ` = 0 is the special case where we have a spherically symmetric solution. Finally,
we solve the Schrödinger equation Hψ = Eψ to obtain

~2 1 d2 ~2
− (rR) + `(` + 1)R + V R = ER.
2µ r dr2 2µr2

This is now an ordinary differential equation in R. We can interpret the terms as


the radial kinetic energy, angular kinetic energy, the potential energy and the energy
eigenvalue respectively. Note that similar to what we did in classical dynamics, under
a spherically symmetric potential, we can replace the angular part of the motion with
~2
an “effective potential” 2µr 2 `(` + 1) + V .

We often call R(r) the radial part of the wavefunction, defined on r ≥ 0. Of-
ten, it is convenient to work with χ(r) = rR(r), which is sometimes called the
7.6. QUANTUM MECHANICS IN THREE DIMENSIONS 307

radial wavefunction . Multiplying the original Schrödinger equation by r, we ob-


tain

~2 00 ~2 `(` + 1)
<Radial Schrödinger equation> − χ + χ + V χ = Eχ.
2µ 2µr2

This has to obey some boundary conditions. Since we want R to be finite as r → 0,


we must have χ = 0 at r = 0. Moreover, the normalization condition is now
Z Z Z
1 = |ψ(x)|2 d3 x = |R(r)|2 r2 dr |Y`m (θ, ϕ)|2 sin θ dθ dϕ.

R∞ R∞
Hence, ψ is normalizable if and only if 0
|R(r)|2 r2 dr = 0
|χ(r)|2 dr < ∞.
E. 7-40
<Three-dimensional well> Suppose we ave a spherically symmetric potential
given by (
0 r≥a
V (r) = ,
−U r < a
where U > 0 is a constant. We now look for bound state solutions to the
Schrödinger equation with −U < E < 0, with total angular momentum quan-
tum number `. Our radial wavefunction χ obeys

~2 k2

00 `(` + 1)
χ −

 χ + k2 χ = 0 with U +E = for r≥a
r2 2µ
2 2
χ00 − `(` + 1) χ − κ2 χ = 0 ~ κ

with E=− for r<a

r2 2µ

We can solve in each region and match χ, χ0 at r = a, with boundary condition


χ(0) = 0. Note that given this boundary condition, solving this is equivalent to
solving it for the whole R but requiring the solution to be odd.
Solving this for general ` is slightly complicated. So we shall look at some partic-
ular examples. For ` = 0, we have no angular term, and we have done this before.
The general solution is (
A sin kr r<a
χ(r) =
Be−κr r>a
Matching the values at x = a determines the values of k, κ and hence E. For
` = 1, it turns out the solution is just
(
1

A cos kr − kr sin kr r<a
χ(r) = .
1
e−κr

B 1 + kr r>a

After matching, the solution is

χ(r)
ψ(r) = R(r)Y1m (θ, ϕ) = Y1m (θ, ϕ),
r
where m can take values m = 0, ±1. Solution for general ` involves spherical
Bessel functions, which we’ll not look at here.
308 CHAPTER 7. QUANTUM MECHANICS

7.7 The hydrogen atom


7.7.1 Introduction
Consider an electron moving in a Coulomb potential

e2 1
V (r) = − .
4πε0 r
This potential is due to a proton stationary at r = 0. We follow results from the last
section, and set the mass µ = me , the electron mass. The joint energy eigenstates of
H, L2 and L3 are of the form

φ(x) = R(r)Y`m (θ, ϕ) for ` = 0, 1, · · · and m = 0, ±1, · · · , ±`.

The radial part of the Schrödinger equation can be written

2 0 `(` + 1) 2λ m2e −~2 κ2


R00 + R − R+ R = κ2 R, with λ= , E= . (∗)
r r2 r 4πε0 ~2 2me
Note that here we work directly with R instead of χ, as this turns out to be easier
later on. The goal of this section is to understand all the (normalizable) solutions to
this equation (∗).
As in the case of the harmonic oscillator, the trick to solve this is see what happens
for large r, and “guess” a common factor of the solutions. In the case of the harmonic
2
oscillator, we guessed the solution should have e−y /2 as a factor. Here, for large r,
00 2 −κr
we get R ∼ κ R. This implies R ∼ e for large R.
For small r, we by assumption know that R is finite, while R0 and R00 could potentially
go crazy. So we multiply by r2 and discard the rR and r2 R terms to get r2 R00 + 2rR0 −
`(` + 1)R ∼ 0. This gives the solution R ∼ r` .
We try a solution of the form R(r) = Cr` e−κr . When we substitute this in, we will
get three kinds of terms. The r` e−κr terms match, and so do the terms of the form
r`−2 e−κr . Finally, we see the r`−1 e−κe terms match if and only if

2(`r`−1 )(−κe−κr ) + 2(r`−1 )(−κe−κr ) + 2λr`−1 e−κe = 0.

When we simplify this mess, we see this holds if and only if (` + 1)κ = λ. Hence, for
any integer n = ` + 1 = 1, 2, 3, · · · , there are bound states with energies
2
~2 λ 2 e2

1 1
En = − = − me .
2me n2 2 4πε0 ~ n2

These are the energy levels of the Bohr model, derived within the framework of the
Bohr model. However, there is a slight difference. In our model, the total angular
momentum eigenvalue is
~2 `(` + 1) = ~2 n(n − 1),
which is not what the Bohr model predicted. Nevertheless, this is not the full solution.
For each energy level, this only gives one possible angular momentum, but we know
that there can be many possible angular momentums for each energy level. So there
is more work to be done.
7.7. THE HYDROGEN ATOM 309

7.7.2 General solution


We guessed our solution r` e−κr above by looking at the asymptotic behaviour at large
and small r. We then managed to show that this is one solution of the hydrogen atom.
There are however more solutions. Similar to what we did for the harmonic oscillator,
we guess that our general solution is of the form R(r) = e−κr f (r). Putting it in, we
obtain  
2 `(` + 1) f
f 00 + f 0 − f = 2 κf 0
+ (κ − λ) .
r r2 r
We immediately see an advantage of this substitution — now each side of the equality
is equidimensional, and equidimensionality makes our life much easier when seeking
series solution. This equation is regular singular at r = 0, and hence we guess a
solution of the form
X∞
f (r) = ap rp+σ , a0 6= 0.
p=0

Then substitution gives


X X
((p + σ)(p + σ + 1) − `(` + 1))ap rp+σ−2 = 2(κ(p + σ + 1) − λ)ap rp+σ−1 .
p≥0 p≥0

The lowest term gives us the indicial equation

σ(σ + 1) − `(` + 1) = (σ − `)(σ + ` + 1) = 0.

So either σ = ` or σ = −(` + 1). We discard the σ = −(` + 1) solution since this would
make f and hence R singular at r = 0. So we have σ = `. Given this, the coefficients
are then determined by
2(κ(p + `) − λ)
ap = ap−1 , p ≥ 1.
p(p + 2` + 1)
Similar to the harmonic oscillator, we now observe that, unless the series terminates,
we have ap /ap−1 ∼ 2κ/p as p → ∞, which matches the behaviour of rα e2κr (for some
α). So R(r) is normalizable only if the series terminates. Hence the possible values of
κ are κn = λ for some n ≥ ` + 1. So the resulting energy levels are exactly those we
found before:
 2 2
~2 2 ~2 λ 2 1 e 1
En = − κ =− = − m e for n ∈ N.
2me 2me n2 2 4πε0 ~ n2

This n is called the principle quantum number . For any given n, the possible angular
momentum quantum numbers are ` = 0, 1, 2, 3, · · · , n − 1 with m = 0, ±1, ±2, · · · , ±`.
The simultaneous eigenstates are then

ψn`m (x) = Rn` (r)Y`m (θ, ϕ) with Rn` (r) = r` gn` (r)e−λr/n ,

where gn` (r) are (proportional to) the associated Laguerre polynomials .
In general, the “shape” of probability distribution for any electron state depends on
r and θ, ϕ mostly through Y`m . For ` = 0, we have a spherically symmetric solu-
tions
ψn00 (x) = gn0 (r)e−λr/n .
This is very different from the Bohr model, that says the energy levels depend only
on the angular momentum and nothing else. Here we can have many different angular
310 CHAPTER 7. QUANTUM MECHANICS

momentums for each energy level, and can even have no angular momentum at all.
The degeneracy of each energy level En is

n−1
X `
X n−1
X
1= (2` + 1) = n2 .
`=0 m=−` `=0

In fact the degeneracy of energy eigenstates reflects the symmetries in the Coulomb
potential. Moreover, the fact that we have n2 degenerate states implies that there is
a hidden symmetry, in addition to the obvious SO(3) rotational symmetry, since just
SO(3) itself should give rise to much fewer degenerate states.

7.7.3 Assumptions in the treatment of the hydrogen atom

So we have solved the hydrogen atom. However this is only after we made a lot of
simplifying assumptions. It is worth revisiting these assumptions and see if they are
actually significant.

1. We assumed was that the proton is stationary at the origin and the electron moves
around it. We also took the mass to be µ = me . More accurately, we can consider
the motion relative to the center of mass of the system, and we should take the
mass as the reduced mass
me mp
µ= ,
me + mp

just as in classical mechanics. However, the proton mass is so much larger and
heavier, and the reduced mass is very close to the electron mass. Hence, what
we’ve got is actually a good approximation. In principle, we can take this into
account and this will change the energy levels very slightly.

2. The entire treatment of quantum mechanics is non-relativistic. We can work a bit


harder and solve the hydrogen atom relativistically, but the corrections are also
small. These are rather moot problems. There are larger problems.

3. We have always assumed that particles are structureless, namely that we can com-
pletely specify the properties of a particle by its position and momentum. However,
it turns out electrons (and protons and neutrons) have an additional internal degree
of freedom called spin . In particular it has two spin state called up and down.
This is a form of angular momentum, but with ` = 12 and m = ± 12 . This are
not due to orbital motion, orbital motion has integer values of ` for well-behaved
wavefunctions. However, we call it angular momentum since angular momentum
is conserved only if we take these into account as well.

In Hydrogen we have a small energy difference between states in which electron


and proton spins are in the same or opposite directions. Transitions between these
two states produce EM radiation with wavelength of about 21cm, this is important
in astrophysics.

Also, for each each quantum number n, `, m, since there are two possible spin states,
the total degeneracy of level En is then 2n2 . This agrees with what we know from
chemistry.
7.7. THE HYDROGEN ATOM 311

7.7.4 Many electron atoms


So far, we have been looking at a hydrogen atom, with just one proton and one electron.
What if we had more electrons? Consider a nucleus at the origin with charge +Ze,
where Z is the atomic number. This has Z independent electrons orbiting it with
positions xa for a = 1, · · · , Z.
We can write down the Schrödinger equation for these particles, and it looks rather
complicated, since electrons not only interact with the nucleus, but with other electrons
as well. So, to begin, we first ignore electron-electron interactions. Then the solutions
can be written down immediately:

ψ(x1 , x2 , · · · , xZ ) = ψ1 (x1 )ψ2 (x2 ) · · · ψZ (xZ ),

where ψi is any solution for the hydrogen atom, scaled appropriately by e2 7→ Ze2 to
accommodate for the larger charge of the nucleus. The energy is then

E = E1 + E2 + · · · + EZ .

We can next add in the electron-electron interactions terms, and find a more accurate
equation for ψ using something called perturbation theory. However, there is an ad-
ditional constraint on this. The Fermi-Dirac statistics or Pauli exclusion principle
states that no two particles can have the same state. In other words, if we attempt
to construct a multi-electron atom, we cannot put everything into the ground state.
We are forced to put some electrons in higher energy states. This is how chemical
reactivity arises, which depends on occupancy of energy levels.
• For n = 1, we have 2n2 = 2 electron states. This is full for Z = 2, and this is
helium.
• For n = 2, we have 2n2 = 8 electron states. Hence the first two energy levels are
full for Z = 10, and this is neon.
These are rather stable elements, since to give them an additional electron, we must
put it in a higher energy level, which costs a lot of energy. We also expect reactive
atoms when the number of electrons is one more or less than the full energy levels.
These include hydrogen (Z = 1), lithium (Z = 3), fluorine (Z = 9) and sodium
(Z = 11).
This is a recognizable sketch of the periodic table. However, for n = 3 and above, this
model does not hold well. At these energy levels, electron-electron interactions become
important, and the world is not so simple.
312 CHAPTER 7. QUANTUM MECHANICS
CHAPTER 8
Markov chains

8.1 Markov chains


D. 8-1
• Let X = (X0 , X1 , · · · ) be a sequence of random variables taking values in some
set S, the state space . We assume that S is countable (which could be finite).
• We say X has the Markov property if for all n ≥ 0 and i0 , · · · , in+1 ∈ S, we have

P(Xn+1 = in+1 | X0 = i0 , · · · , Xn = in ) = P(Xn+1 = in+1 | Xn = in ).

If X has the Markov property, we call it a Markov chain .


• We say that a Markov chain X is homogeneous if the conditional probabilities
P(Xn+1 = j | Xn = i) do not depend on n (i.e. the same for all n).
? Below we will assume all our chains X be Markov and homogeneous unless oth-
erwise specified. Also since the state space S is countable, we usually label the
states by integers i ∈ N.
E. 8-2
The Markov property, intuitively, says that the future depends only upon the
present (current state), and not past (how we got to the current state). So if we
know all the information about the present state, we know the future.
1. A random walk is a Markov chain.
2. The branching process is a Markov chain.
C. 8-3
In general, to fully specify a (homogeneous) Markov chain, we will need two items:
1. The initial distribution λi = P(X0 = i). We can write this as a vector λ = (λi :
i ∈ S).
2. The (one-step) transition probabilities pi,j = P(Xn+1 = j | Xn = i).1 We can
write this as a matrix2 P = (pi,j )i,j∈S .

P. 8-4
If X is a homogeneous Markov chain, then
P
1. λ is a distribution, ie. λi ≥ 0 for all i and i λi = 1.
P
2. P is a stochastic matrix, ie. pi,j ≥ 0 for all i, j and j pi,j = 1 for all i.

1. Obvious since λ is a probability distribution.


P P
2. pi,j ≥ 0 since pij is a probability. We also have j pi,j = j P(X1 = j | X0 =
i) = 1 since P(X1 = · | X0 = i) is a probability distribution function.
1
Note that this is well defined since we assume X is homogeneous.
2
Note that the matrix might not be finite dimensional.

313
314 CHAPTER 8. MARKOV CHAINS

Note stochastic matrix only require the row sum to be 1, and the column sum
need not be.
T. 8-5
Let λ be a distribution (on S) and P a stochastic matrix. Then X = (X0 , X1 , · · · )
is a (homogeneous) Markov chain with initial distribution λ and transition matrix
P iff for all n and i0 , · · · , in we have

P(X0 = i0 , X1 = i1 , · · · , Xn = in ) = λi0 pi0 ,i1 pi1 ,i2 · · · pin−1 ,in . (∗)

(Forward) Let Ak be the event Xk = ik . Then we can write (∗) as

P(A0 ∩ A1 ∩ · · · ∩ An ) = λi0 pi0 ,i1 pi1 ,i2 · · · pin−1 ,in . (∗)

We first assume that X is a Markov chain. We prove (∗) by induction on n. When


n = 0, (∗) says P(A0 ) = λi0 . This is true by definition of λ. Assume that it is true
for all n < N . Then

P(A0 ∩ A1 ∩ · · · ∩ AN ) = P(A0 ∩ · · · ∩ AN −1 )P(AN | A0 ∩ · · · ∩ AN −1 )


= λi0 pi0 ,i1 · · · piN −2 ,iN −1 P(AN | A0 ∩ · · · ∩ AN −1 )
= λi0 pi0 ,i1 · · · piN −2 ,iN −1 P(AN | AN −1 )
= λi0 pi0 ,i1 · · · piN −2 ,iN −1 piN −1 ,iN .

So it is true for N as well. Hence we are done by induction.


(Backward) Suppose that (∗) holds. Then for n = 0, we have P(X0 = i0 ) = λi0 .

P(Xn = in | X0 = i0 , · · · , Xn−1 = in−1 ) = P(An | A0 ∩ · · · ∩ An−1 )


P(A0 ∩ · · · ∩ An )
= = pin−1 ,in ,
P(A0 ∩ · · · ∩ An−1 )

which is independent of i0 , · · · , in−2 . So this is Markov.


Note that in the very last bit strictly speaking we should show P(Xn = in | Xn−1 =
in−1 ) instead of just P(Xn = in | X0 = i0 , · · · , Xn−1 ) = pin−1 ,in .
T. 8-6
<Extended Markov property> Let X be a Markov chain. For n ≥ 0, any
event H given in terms of the past {Xi : i < n}, and any event F given in terms
of the future {Xi : i > n}, we have P(F | Xn = i, H) = P(F | Xn = i).

We only give a proof for when F depends on only the finite future, so H is given
in terms of X0 , X − 1, · · · , Xn−1 and F is given in terms of Xn+1 , Xn+2 , · · · , XN
for some N > n.
P(F, Xn = i, H)
P(F | Xn = i, H) =
P(Xn = i, H)
P P
<n >n λi0 pi0 ,i1 · · · pin−1 ,i pi,in+1 · · · piN −1 ,iN
= P
<n λi0 pi0 ,i1 · · · pin−1 ,i
X
= pi,in+1 · · · piN −1 ,iN = P(F | Xn = i)
>n
P
where P<n sums over all sequences (i0 , i1 , · · · , in−1 ) corresponding to the event
H, and >n over all sequences (in+1 , in+2 , · · · , iN ) correspond to event F .
8.1. MARKOV CHAINS 315

D. 8-7
Let X be a homogeneous Markov chain. The n-step transition probability from
i to j is pi,j (n) = P(Xn = j | X0 = i). We can write this as a matrix P (m) =
(pi,j (m))i,j∈S .
T. 8-8
P
<Chapman-Kolmogorov equation> pi,j (m+n) = k∈S pi,k (m)pk,j (n). That
is P (m + n) = P (m)P (n). In particular, P (n) = P (1)P (n − 1) = · · · = P (1)n =
P n.

pij (m + n) = P(Xm+n = j | X0 = i)
X
= P(Xm+n = j | Xm = k, X0 = i)P(Xm = k | X0 = i)
k
X
= P(Xm+n = j | Xm = k)P(Xm = k | X0 = i)
k
X
= pi,k (m)pk,j (n).
k

We saw that the Chapman-Kolmogorov equation can be concisely stated as a rule


about matrix multiplication. In general, many statements about Markov chains
can be formulated in the language of linear algebra naturally.
For example, let X0 have distribution λ. What is the distribution of X1 ? By
definition, it is
X X
P(X1 = j) = P(X1 = j | X0 = i)P(X0 = i) = λi pi,j .
i i

Hence this has a distribution λP , where λ is treated as a row vector. Similarly,


Xn has the distribution λP n .
In fact, historically, Markov chains was initially developed as a branch of linear
algebra, and a lot of the proofs were just linear algebra manipulations. However,
nowadays, we often look at it as a branch of probability theory instead.
E. 8-9
Let S = {1, 2}, and assume 0 < α, β < 1. We are 
1−α α

given P as in the right. Find the n-step transition P =
β 1−β
probability.

(Method 1) We can achieve this via diagonalization. We can write P = U −1 ( κ01 0


κ2 )U
where the κi are eigenvalues of P , and U is composed of the eigenvectors.
To find the eigenvalues, we calculate det(P − λI) = (1 − α − λ)(1 − β − λ) − αβ = 0.
We solve this to obtain κ1 = 1, and κ2 = 1 − α − β. Usually, the next thing to do
would be to find the eigenvectors to obtain U . However, here we can cheat a bit
and not do that. Using the diagonalization of P , we have
 n 
κ1 0
P n = U −1 U.
0 κn 2

We can now attempt to compute p1,2 . We know that it must be of the form

p1,2 = Aκn n
1 + Bκ2 = A + B(1 − α − β)
n
316 CHAPTER 8. MARKOV CHAINS

where A and B are constants coming from U and U −1 . However, we know that
p1,2 (0) = 0, p1,2 (1) = α. So we obtain A + B = 0 and A + B(1 − α − β) = α.
Solve this we obtain
α
p1,2 (n) = (1 − (1 − α − β)n ) = 1 − p1,1 (n).
α+β

since row sum equals 1 as P (n) must also be stochastic. How about p2,1 and p2,2 ?
Well we don’t need additional work. We can obtain these simply by interchanging
α and β. So we obtain

β + α(1 − α − β)n α − α(1 − α − β)n


 
1
Pn =
α+β α + β(1 − β − α)n β − β(1 − β − α)n

(Method 2) Alternatively, we can solve this by a difference equation. The recur-


rence relation is given by p1,1 (n + 1) = p1,1 (n)(1 − α) + p1,2 (n)β. But p1,2 (n) +
p1,1 (n) = 1, so

p1,1 (n + 1) = p1,1 (n)(1 − α) + (1 − p1,1 (n))β.

We can solve this as we have done in IA Differential Equations.


Note that as n → ∞, we have
 
1 β α
Pn →
α+β β α

We see that the two rows are the same. This means that as time goes on, where
we end up does not depend on where we started. We will later (near the end of
the course) see that this is generally true for most Markov chains.

8.2 Classification of chains and states


D. 8-10
Suppose we have two states i, j ∈ S. We write i → j and say i leads to j if there
is some n ≥ 0 such that pi,j (n) > 0, ie. it is possible for us to get from i to j
(in multiple steps).3 We write i ↔ j and say i and j communicate if i → j and
j → i.
P. 8-11
↔ defined above is an equivalence relation.

1. Reflexive: we have i ↔ i since pi,i (0) = 1.


2. Symmetric: trivial by definition.
3. Transitive: suppose i → j and j → k. Since i → j, there is some m > 0 such
0. Since j → k, there is some n such that pj,k (n) > 0. Then
that pi,j (m) >P
pi,k (m + n) = r pi,r (m)prk (n) ≥ pi,j (m)pj,k (n) > 0. So i → k. Similarly, if
j → i and k → j, then k → i. So i ↔ j and j ↔ k implies that i ↔ k.
3
Note that we allow n = 0. So we always have i → i
8.2. CLASSIFICATION OF CHAINS AND STATES 317

D. 8-12
• The equivalence classes of ↔ are communicating classes .

• A Markov chain is irreducible if there is a unique communication class.

• A subset C ⊆ S is closed if pi,j = 0 for all i ∈ C, j 6∈ C.4 If a singleton set {i} is


closed, we call i an absorbing state .

E. 8-13
Note that intuitively a subset closed if we cannot escape from it. Note also that
different communicating classes are not completely isolated. Within a communi-
cating class A, of course we can move between any two vertices. However, it is also
possible that we can escape from a class A to a different class B. It is just that
after going to B, we cannot return to class A. From B, we might be able to get to
another space C. We can jump around all the time, but (if there are finitely many
communicating classes) eventually we have to stop when we have visited every
class. Then we are bound to stay in that class, i.e. the class is closed. Since we
are eventually going to be stuck in that class anyway, often, we can just consider
this final communicating class and ignore the others. So wlog we can assume that
the chain only has one communicating class, i.e. it is irreducible.

P. 8-14
A subset C is closed iff j ∈ C whenever i ∈ C and i → j.

(Forward) Suppose i ∈ C and i → j. Since i → j, there is some m such that


pi,j (m) > 0. Expanding the Chapman-Kolmogorov equation, we have
X
pi,j (m) = pi,i1 pi1 ,i2 , · · · , pim−1 ,j > 0.
i1 ,··· ,im−1

So there is some route i, i1 , · · · , im−1 , j such that pi,i1 , pi1 ,i2 , · · · , pim−1 ,j > 0.
Since pi,i1 > 0, we have i1 ∈ C as C is closed. Since pi1 ,i2 > 0, we have i2 ∈ C.
By induction, we get that j ∈ C.

(Backward) If i ∈ C and j 6∈ C, then i 6→ j. So pi,j = 0 and hence C is closed.

E. 8-15
Consider S = {1, 2, 3, 4, 5, 6} with transition matrix P given below.
1 1
0 0 0 0
 2
2 2
0 0 1 0 0 0
1 1 1


3
0 0 3 3
0 1 3 5 6
P =
0 1 1

 0 0 2 2
0
0 0 0 0 0 1 4
0 0 0 0 1 0
We see that the communicating classes are {1, 2, 3}, {4}, {5, 6}, where {5, 6} is
closed.

D. 8-16
For convenience, we write Pi (A) = P(A | X0 = i) and Ei (Z) = E(Z | X0 = i).
4
An equivalent definition is that C is closed if j ∈ C whenever i → j and i ∈ C .
318 CHAPTER 8. MARKOV CHAINS

• The first passage time of j ∈ S starting from i = X0 is Tj = min{n ≥ 1 :


Xn = j}.5 The first passage probability is fij (n) = Pi (Tj = n).
• A state i ∈ S is recurrent (or persistent ) if Pi (Ti < ∞) = 1. Otherwise, we
call the state transient .
? For what to follow, we write Pi,j (s) = ∞
P n P∞ n
n=0 pi,j (n)s and Fi,j (S) = n=0 fi,j (n)s
where fi,j is our first passage probability. For the sake of clarity, here pi,j (0) =
δi,j and fi,j (0) = 0.
E. 8-17
The major focus of this chapter is recurrence and transience. This was something
that came up when we discussed random walks in IA Probability — given a random
walk, say starting at 0, what is the probability that we will return to 0 later on?
Recurrence and transience is a qualitative way of answering this question. We
will mostly focus on irreducible chains, so there is always a non-zero probability
of returning to 0. Hence the question we want to ask is whether we are going to
return to 0 with certainty, ie. with probability 1. If we are bound to return, then
we say the state is recurrent. Otherwise, we say it is transient.
It should be clear that this notion is usually interesting only for an infinite state
space. If we have an infinite state space, we might get transience because we might
drift away to a place far away and never return. However, in a finite state space,
this can’t happen. In a finite state space, transience can occur only if we get stuck
in some other place and can’t leave, ie. we are not in an irreducible state space.
Note that transient does not mean we don’t get back. It’s just that we are not
sure that we will get back. We can show that if a state is recurrent, then the
probability that we return to i infinitely many times is also 1.
P. 8-18
Pi,j (s) = δi,j + Fi,j (s)Pj,j (s) for −1 < s ≤ 1.

Using the law of total probability


n
X
pi,j (n) = Pi (Xn = j | Tj = m)Pi (Tj = m)
m=1

Using the Markov property, we can write this as


n
X n
X
= P(Xn = j | Xm = j)Pi (Tj = m) = pj,j (n − m)fi,j (m).
m=1 m=1

We can multiply through by sn and sum over all n to obtain



X ∞ X
X n
pi,j (n)sn = pj,j (n − m)sn−m fi,j (m)sm .
n=1 n=1 m=1

The left hand side is almost the generating function Pi,j (s), except that we are
missing an n = 0 term, which is pi,j (0) = δi,j . The right hand side is the “convo-
lution” of the power series Pj,j (s) and Fi,j (s), which we can write as the product
Pj,j (s)Fi,j (s). So Pi,j (s) − δi,j = Pi,j (s)Fi,j (s).
5
Here we require n ≥ 1. Otherwise Ti would always be 0. Also Tj does not necessarily exist,
since {n ≥ 1 : Xn = j} might be empty.
8.2. CLASSIFICATION OF CHAINS AND STATES 319

L. 8-19
<Abel’s lemma> Let a1 , a2 , · · · be positive real numbers such that f (x) =
P n P
n an x converges for |x| < 1. Then limx→1− f (x) = n an .

For x ∈ (−1, 1), note that f (x) = ∞ n


(1 − x) ∞ n
P P
PN P n=0 an x = P n=0 sn x where sn =

n=0 an . Suppose n an converges, write a = n an , then


∞ ∞

X X
n n
|f (x) − a| = (1 − x) sn x − a(1 − x) x


n=0 n=0
N


X X
n n
≤ (1 − x) (sn − a)x + (1 − x) (sn − a)x


n=0 n=N +1

Given any
P ε > 0, we can choose N such that ∀n > N , |sn − a| < ε/2. Since
(1 − x) N n
n=0 (sn − a)x is a continuous function that takes value 0 when x = 1,
∃δ > 0 s.t. ∀y ∈ (1 − δ, 1), |(1 − x) N n
P
n=0 (sn − a)x | < ε/2. Now ∀y ∈ (1 − δ, 1) we
have
∞ ∞
!
ε X n ε ε X n
|f (x) − a| < + (1 − y) |sn − a|y ≤ + y (1 − y) ≤ ε
2 n=N +1
2 2 n=N +1
P P
Hence limx→1− f (x) = n an . Suppose n an doesn’t converges, since its terms
PN
are positive, sN = n=0 an → ∞ as N → ∞. Given R ∈ R, we can pick N such
that ∀n ≥ N, sn > 2R. Also ∃δ > 0 s.t. ∀x ∈ (1 − δ, 1), xN > 12 . Now

X ∞
X ∞
X
(1 − x) sn xn ≥ (1 − x) sn xn > 2R(1 − x) xn = 2RxN ≥ R
n=0 n=N n=N
P
So limx→1− f (x) = ∞ = n an .
T. 8-20
P
1. i is recurrent iff n pi,i (n) = ∞.
P
2. If j is transient, then n pi,j (n) < ∞ for all states i.

1. Using j = i in [P.8-18], for 0 < s < 1, we have


1
Pi,i (s) = .
1 − Fi,i (s)
Here we need to be careful that we are not dividing byP0. This would be a
problem if Fi,i (s) = 1. By definition, we have Fi,i (s) = ∞ n
n=1 fi,i (n)s . Also,
by definition of fii , we have
X
Fi,i (1) = fi,i (n) = P(ever returning to i) ≤ 1.
n

Since |s| < 1 we have Fi,i (s) < 1. So we are not dividing byP zero. Pi,i (s)
converge for |s| < 1 since it’s bounded byPthe geometric series n sn . So by
Abel’s lemma, lims→1 Pi,i (s) = Pi,i (1) = n pi,i (n). Similarly we also have
1 1 1
lim = = P .
s→1 1 − Fi,i (s) 1 − lims→1 Fi,i (s) 1 − fi,i (n)
P P P
Hence we have n pi,i (n) = 1/(1− n fi,i (n)). Since fi,i (n)Pis the probabil-
ity of ever returning, the probability of ever returning is 1 iff n pi,i (n) = ∞.
320 CHAPTER 8. MARKOV CHAINS
P P
2. By part 1, Pj,j (1) = n pi,i (n) < ∞, so n pi,j (n) = Pi,j (1) = δi,j +
Fi,j (1)Pj,j (1) < ∞.

T. 8-21
Let C be a communicating class. Then
1. Either every state in C is recurrent, or every state is transient.
2. If C contains a recurrent state, then C is closed.

1. Let i ↔ j and i 6= j. Then by definition of communicating, there is some m


such that pi,j (m) = α > 0, and some n such that pj,i (n) = β > 0. So for each
k, we have

pi,i (m + k + n) ≥ pi,j (m)pj,j (k)pj,i (n) = αβpj,j (k).


P P
So if k pj,j (k) = ∞, then r pi,i (r) = ∞. So j recurrent implies i recurrent.
Similarly, i recurrent implies j recurrent.

2. If C is not closed, then there is a non-zero probability that we leave the class
and never get back. So the states are not recurrent.

E. 8-22
There is a profound difference between a finite state space and an infinite state
space. A finite state space can be represented by a finite matrix, and we are all
very familiar with a finite matrices. We can use everything we know about finite
matrices. However, infinite matrices are weirder.

For example, any finite transition matrix P has an eigenvalue of 1. This is since
the row sums of a transition matrix is always 1. So if we multiply P by e =
(1, 1, · · · , 1), then we get e again. However, this is not true for infinite matrices,
since we usually don’t usually allow arbitrary infinite vectors. To avoid getting
infinitely large numbers when multiplying
P 2 vectors and matrices, we usually restrict
our focus to vectors x such that xi is finite. In this case the vector e is not
allowed, and the transition matrix need not have eigenvalue 1.

Another thing about a finite state space is that probability “cannot escape”. Each
step of a Markov chain gives a probability distribution on the state space, and
we can imagine the progression of the chain as a flow of probabilities around the
state space. If we have a finite state space, then all the probability flow must be
contained within our finite state space. However, if we have an infinite state space,
then probabilities can just drift away to infinity.

T. 8-23
In a finite state space,
1. There exists at least one recurrent state.
2. If the chain is irreducible, every state is recurrent.

1. We first fix an arbitrary


P i. Recall that Pi,j (s) = δi,j + Pj,j (s)Fi,j (s). If j is
transient, then n pj,j (n) = Pj,j (1) < ∞. Also, Fi,j (1) is the probability of
ever reaching j from i, and is hence finite asPwell. So we have Pi,j (1) < ∞. By
Abel’s lemma, Pi,j (1) is given by Pi,j (1) = n pi,j (n). Since this is finite, we
must have pi,j (n) → 0 as n → ∞.
8.2. CLASSIFICATION OF CHAINS AND STATES 321
P
But we know thatP j∈S pi,j (n) = 1, so if every state is transient, then since
the sum is finite, j∈S pi,j (n) → 0 as n → ∞. This is a contradiction. So we
must have a recurrent state.
2. follows immediately from 1. since if a chain is irreducible, then either all states
are transient or all states are recurrent by [T.8-21].
C. 8-24
Consider Zd = {(x1 , x2 , · · · , xd ) : xi ∈ Z}. This generates a graph with x adjacent
to y if kx−yk = 1, where k · k is the Euclidean norm. The symmetric random walk
on Zd is that at each step, we moves to a neighbour, each chosen with equal
probability, ie. (
1
|j − i| = 1
P(Xn+1 = j | Xn = i) = 2d
0 otherwise
This is an irreducible chain, since it is possible to get from one point to any other
point. So the points are either all recurrent or all transient.

T. 8-25
<Pólya’s theorem> The symmetric random walk on Zd is recurrent for d = 1, 2
and transient for d ≥ 3.
P
We will start with the case d = 1. We want to show that p0,0 (n) = ∞. Then
we know the origin is recurrent. It is impossible toPget back to the origin in an
odd number of steps. So we can instead consider p0,0 (2n). To return to the
origin after 2n steps, we need to have made n steps to the left, and n steps to the
right, in any order. So we have
! 
2n
2n 1
p0,0 (2n) = P(n steps left, n steps right) = .
n 2
√ n
Using Stirling’s formula n! ' 2πn ne , we get p0,0 (2n) ∼ √1πn . Summing this
P
is worst than the harmonic series. So we have p0,0 (2n) = ∞.
In the d = 2 case, suppose after 2n steps, I have taken r steps right, ` steps left,
u steps up and d steps down. We must have r + ` + u + d = 2n, and we need
r = `, u = d to return the origin. Let r = ` = m and u = d = n − m, then
 2n X n
!   n
2n X
1 2n 1 (2n)!
p0,0 (2n) = =
4 m=0
m, m, n − m, n − m 4 m=0
(m!) 2 ((n − m)!)2

 2n ! n  2  2n ! n ! !
1 2n X n! 1 2n X n n
= =
4 n m=0 m!(n − m)! 4 n m=0 m n−m

We now use a well-known identity (proved in IA Numbers and Sets) to obtain


 2n ! ! !   !2
2n
1 2n 2n 2n 1 1
= = ∼ .
4 n n n 2 πn

So the sum diverges. So this is recurrent. Note that the two-dimensional proba-
bility turns out to be the square of the one-dimensional probability. This is not
a coincidence, and we will explain this after the proof. However, this does not
extend to higher dimensions.
322 CHAPTER 8. MARKOV CHAINS

In the d = 3 case, we have


 2n
1 X (2n)!
p0,0 (2n) = .
6 (i!j!k!)2
i+j+k=n

This time, there is no neat combinatorial formula. Since we want to show this is
summable, we can try to bound this from above. We have
 2n !  2  2n !  2
1 2n X n! 1 2n X n!
p0,0 (2m) = =
6 n i!j!k! 2 n 3n i!j!k!
 2n !   X
1 2n n! n!
≤ max
2 n 3n i!j!k! 3n i!j!k!
i+j+k=n

n!
P
Now we will use the identity i+j+k=n 3n i!j!k! = 1, which can be obtained as
follows: suppose we have three urns, and throw n balls into it. Then the probability
n!
of getting i balls in the first, j in the second and k in the third is exactly 3n i!j!k! .
Summing over all possible combinations of i, j and k gives the total probability
of getting in any configuration, which is 1. To find the maximum of n!/(3n i!j!k!),
we can replace the factorial by the gamma function and use Lagrange multipliers.
However, we would just argue that the maximum can be achieved when i, j and
k are as close to each other as possible. So we get
 2n !  3
1 2n n! 1
≤ ∼ Cn−3/2
2 n 3n bn/3c!
P
for some constant C and using Stirling’s formula. So p0,0 (2n) < ∞ and the
chain is transient. We can prove similarly for higher dimensions.
Intuitively, this result makes sense that we get recurrence only for low dimensions,
since if we have more dimensions, then it is easier to get lost.
Let’s get back to why the two dimensional probability is the square of the one-
dimensional probability. This square might remind us of independence. However,
it is obviously not true that horizontal movement and vertical movement are in-
dependent — if we go sideways in one step, then we cannot move vertically. But
we can “fix” this:
We write Xn = (An , Bn ). We rotate our space,
record our coordinates in a pair of axis that is ro-
tated by 45◦ (and then stretched). The new coordi- B Vn

V
2
nates is
      (An , Bn )
Un 1 1 An − Bn
= Xn =
Vn −1 1 An + Bn

In each step, either An or Bn change by one step. A


So Un and Vn both change by 1. Note that Un and U
√n
2
Vn are each 1 dimensional random walk. Moreover,
for any a, b = ±1,

P((Un+1 − Un , Vn+1 − Vn ) = (a, b)) U


= P(Un+1 − Un = a)P(Vn+1 − Vn = b)
8.2. CLASSIFICATION OF CHAINS AND STATES 323

So Un and Vn are independent. So we have

p0,0 (2n) = P0 (A2n = B2n = 0) = P0 (U2n = V2n = 0) = P0 (U2n = 0)P0 (V2n = 0)


!   !2
2n
2n 1
= .
n 2

D. 8-26
The hitting time of A ⊆ S is the random variable H A = min{n ≥ 0 : Xn ∈ A}.6
Also we write kiA = Ei (H A ) and hA
i = Pi (H
A
< ∞) = Pi (ever reach A). If A is
A
closed, then hi is called an absorption probability
E. 8-27
Note that we have to be careful finding Ei (H A ) = kiA . If there is a chance that we
never hit A, then H A could be infinite, and Ei (H A ) = ∞. This occurs if hA i < 1.
So often we are only interested in the case where hA i = 1. Note however that
hA A
i = 1 does not imply that ki < ∞. It is merely a necessary condition.

T. 8-28
The vector (hA
i : i ∈ S) satisfies
(
1 i∈A
hA
i = P A
,
j∈S pi,j hj i 6∈ A

and is minimal in that for any non-negative solution (xi : i ∈ S) to these equations,
we have hAi ≤ xi for all i.

By definition, hA
i = 1 if i ∈ A. Otherwise, we have
X X A
hAi = Pi (H
A
< ∞) = Pi (H A < ∞ | X1 = j)pi,j = hj pi,j .
j∈S j∈S

So hA
iis indeed a solution to the equations. To show that hA i is the minimal
solution, suppose x = (xi : i ∈ S) is a non-negative solution, ie.
(
A 1 i∈A
xi = P A
,
p x
j∈S i,j j i 6∈ A

If i ∈ A, we have hAi = xi = 1. Otherwise, we can write


X X X X X X
xi = pi,j xj = pi,j xj + pi,j xj = pi,j + pi,j xj ≥ pi,j = Pi (H A = 1).
j∈S j∈A j6∈A j∈A j6∈A j∈A

By iterating this process, we can write


!  
X X X X X X X
xi = pi,j + pi,j pi,k xk = pi,j + pi,j  pi,k xk + pi,k xk 
j∈A j6∈A k j∈A j6∈A k∈A k6∈A

A
X A A
≥ Pi (H = 1) + pi,j pj,k = Pi (H = 1) + Pi (H = 2) = Pi (H A ≤ 2).
j6∈A,k∈A

By induction, xi ≥ Pi (H A ≤ n) for all n. Taking the limit as n → ∞, we get


xi ≥ Pi (H A ≤ ∞) = hA A
i . So hi is minimal.
6
In particular, if we start in A, then H A = 0. Also here we let H A = ∞ if {n ≥ 0 : Xn ∈ A} = ∅.
324 CHAPTER 8. MARKOV CHAINS

T. 8-29
(kiA : i ∈ S) is the minimal non-negative solution to
(
A 0 i∈A
ki =
1 + j pi,j kjA i 6∈ A.
P

By definition, kiA = 0 if i ∈ A. Otherwise, we have


X X A
kiA = Ei (H A ) = 1 + Ej (H A )pi,j = 1 + kj pi,j .
j∈S j∈S

Now suppose (yi : i ∈ S) is a non-negative solution. If i ∈ A, we get yi = kiA = 0.


Otherwise, suppose i 6∈ A. Then (assuming hA i = 1) we have
X X X X
yi = 1 + pi,j yj = 1 + pi,j yj + pi,j yj = 1 + pi,j yj
j j∈A j6∈A j6∈A
 
X X X
=1+ pi,j 1 + pj,k yk  ≥ 1 + pi,j = Pi (H A ≥ 1) + Pi (H A ≥ 2).
j6∈A k6∈A j6∈A

A
P that yi ≥ Pi (H P
By induction, we know ≥ 1) + · · · + Pi (H A ≥ n) for all n. Let
n → ∞, then yi ≥ m≥1 Pi (H A ≥ m) = m≥1 mPi (H a = m) = kiA .
Note that we have the extra “1+” since when we move from i to j, one step has
already passed.
E. 8-30
<Gambler’s ruin> This time, we will consider a random walk on N. In each
step, we either move to the right with probability p, or to the left with probability
q = 1 − p. What is the probability of ever hitting 0 from a given initial point? In
{0}
other words, we want to find hi = hi .

We know hi is the minimal solution to


(
1 i=0
hi =
qhi−1 + phi+1 i 6= 0.

We can view this as a difference equation

phi+1 − hi + qhi−1 = 0, i ≥ 1.
1
with the boundary condition that h0 = 1. If p 6= q, ie. p 6= 2
, then the solution
has the form hi = A + B( pq )i for i ≥ 0.

If p < q, then for large i, ( pq )i is very large and blows up. However, since hi is
a probability, it can never blow up. So we must have B = 0. So hi is constant.
Since h0 = 1, we have hi = 1 for all i. So we always get to 0.
If p > q, since h0 = 1, we have A + B = 1. So hi = ( pq )i + A(1 − ( pq )i ). This is
in fact a solution for all A. So we want to find the smallest solution. As i → ∞,
we get hi → A. Since hi ≥ 0, we know that A ≥ 0. Subject to this constraint, the
minimum is attained when A = 0 (since (q/p)i and (1 − (q/p)i ) are both positive).
So we have hi = ( pq )i .
If p = q, then by similar arguments, hi = 1 for all i.
8.2. CLASSIFICATION OF CHAINS AND STATES 325

There is another way to solve this. We can give ourselves a ceiling, ie. we also
stop when we hit k > 0, ie. hk = 0. We now have two boundary conditions and
can find a unique solution. Then we take the limit as k → ∞. We seen similar
methods in IA Probability.
E. 8-31
<Birth-death chain> Let (pi : i ≥ 0) be an arbitrary sequence such that
pi ∈ (0, 1), and write qi = 1 − pi . Let N be our state space and define the
transition probabilities to be pi,i+1 = pi and pi,i−1 = qi for i ≥ 1 and p0,1 = p0
{0}
and p0,0 = q0 . What is hi ?

{0}
We write hi = hi . We know that

h0 = 1, pi hi+1 − hi + qi hi−1 = 0 for i ≥ 1. (∗)

This is no longer a difference equation, since the coefficients depends on the index i.
To solve this, we rewrite this as pi hi+1 −hi +qi hi−1 = pi (hi+1 −hi )−qi (hi −hi−1 ).
We let ui = hi−1 − hi ,7 then our equation becomes
    
qi qi qi−1 q1
ui+1 = ui =⇒ ui+1 = ··· u1 .
pi pi pi−1 p1

Let γi = pq11 pq22 ···q


···pi
i
, then ui+1 = γi u1 . For convenience, we let γ0 = 1. Now we want
to retrieve our hi . We have h0 − hi = u1 + u2 + · · · + ui . Using h0 = 1, we get

hi = 1 − u1 (γ0 + γ1 + · · · + γi−1 ) for i ≥ 1.

This solves (∗) for any value P of u1 . But our theorem tells us that the value of u1
minimizes hi . Note that S = ∞ i=0 γi either diverges or converges. If S = ∞, then
we must have u1 = 0 and so hi = 1 for all i. This is since hi cannot blows up as
0 ≤ hi ≤ 1. If S is finite, then u1 can be non-zero. We know that the γi are all
positive. So to minimize hi , we need to maximize u1 . Since 0 ≤ hi , the maximum
possible value of u1 , is such that 0 = 1 − u1 S. In other words, u1 = 1/S. So we
have P∞
k=i γk
hi = P∞ .
k=0 γk

This is a more general case of the random walk — in contrast to the random walk
where we have a constant pi sequence. This is also a general model for population
growth, where the change in population depends on what the current population
is. Here each “step” does not correspond to some unit time, since births and
deaths occur rather randomly. Instead, we just make a “step” whenever some
birth or death occurs, regardless of what time they occur. Here, if we have no
people left, then it is impossible for us to reproduce and get more population. So
we might want to have p0,0 = 1. In this case 0 is absorbing in that {0} is closed.
D. 8-32
A random variable T (which is a function Ω → N ∪ {∞}) is a stopping time for
the Markov chain X = (Xn ) if for n ≥ 0, the event {T = n} is given in terms of
X0 , · · · , Xn .
7
Letting ui = hi − hi−1 might seem more natural, but this definition makes ui positive
326 CHAPTER 8. MARKOV CHAINS

E. 8-33
For example, suppose we are in a casino and gambling. We let Xn be the amount
of money we have at time n. Then we can set our stopping time as “the time
when we have $10 left”. This is a stopping time, in the sense that we can use this
as a guide to when to stop — it is certainly possible to set yourself a guide that
you should leave the casino when you have $10 left. However, it does not make
sense to say “I will leave if the next game will make me bankrupt”, since there is
no way to tell if the next game will make you bankrupt (it certainly will not if you
win the game!). Hence this is not a stopping time.
The hitting time H A is a stopping time. This is since {H A = n} = {Xi 6∈ A for i <
n} ∩ {Xn ∈ A}. We also know that H A + 1 is a stopping time since it only depends
in Xi for i ≤ n − 1. However, H A − 1 is not a stopping time since it depends on
Xn+1 .
T. 8-34
<Strong Markov property> Let X be a Markov chain with transition matrix
P , and let T be a stopping time for X. Given T < ∞ and XT = i, the chain
Y = (Yk : k ≥ 0) given by Yk = XT +k is a Markov chain with transition matrix
P and initial distribution XT +0 = i, and this Markov chain is independent of
X0 , · · · , XT .

Write the event {T < ∞} as B, the event {T = k} as Bk , the event {XT = i} as


A, and the event {Yk ≡ XT +k = ii } as Ak . Also write PBm ( · ) = P( · | Bk ).
n
! n
! n
!
\ \ \
PBm Ak | A = PBm (A1 | A)PBm Ak | A1 ∩ A = pi,i1 PB Ak | A1
k=1 k=2 k=2
n
!
\
= pi,i1 PBm (A2 | A1 )PBm Ak | A2 ∩ A1
k=3
n
!
\
= pi,i1 pi1 ,i2 PBm Ak | A2 = · · · = pi,i1 pi1 ,i2 · · · pin−1 ,in
k=3

So P(( n
T
k=1T Ak )∩A∩Bm ) = P(A∩Bm )pi,i1 pi1 ,i2 · · · pin−1 ,in , now sum over m ∈ N0
we get P( n k=1 Ak | A ∩ B) = pi,i1 pi1 ,i2 · · · pin−1 ,in . Hence Yk is a Markov chain
with transition matrix P and initial distribution XT +0 = i. Now let H be an
event given in terms of X0 , X1 , · · · , XT −1 . The event H ∩ Bm is in terms of
X0 , X1 , · · · , Xm−1 , so by Markov property at time m we have
n
!
\
pi,i1 pi1 ,i2 · · · pin−1 ,in P (A ∩ Bm ∩ H) = P A ∩ Bm ∩ H ∩ Ak
k=1
Tn
since pi,i1 pi1 ,i2 · · · pin−1 ,in = P( k=1 Ak | A ∩ Bm ). Sum over m ∈ N0 we get
n
!
\
pi,i1 pi1 ,i2 · · · pin−1 ,in P (A ∩ B ∩ H) = P A ∩ B ∩ H ∩ Ak
k=1
n
! n
!
\ \
=⇒ P Ak | A ∩ B P(H | A ∩ B) = P H ∩ Ak | A ∩ B
k=1 k=1

The “Markov property” we seen at the start of the chapter is the weak Markov
property. In probability, we often have “strong” and “weak” versions of things.
8.2. CLASSIFICATION OF CHAINS AND STATES 327

For example, we have the strong and weak law of large numbers. The difference
is that the weak versions are expressed in terms of probabilities, while the strong
versions are expressed in terms of random variables.
Initially, when people first started developing probability theory, they just talked
about probability distributions like the Poisson distribution or the normal distri-
bution. However, later it turned out it is often nicer to talk about random variables
instead. After messing with random variables, we can just take expectations or
evaluate probabilities to get the corresponding statement about probability distri-
butions. Hence usually the “strong” versions imply the “weak” version, but not
the other way round.
E. 8-35
<Gambler’s ruin> Again, this is the Markov chain taking values on the non-
negative integers, moving to the right with probability p and left with probability
q = 1 − p. 0 is an absorbing state, since we have no money left to bet if we are
broke. Instead of computing the probability of hitting zero, we want to find the
time it takes to get to 0, ie.

H = inf{n ≥ 0 : Xn = 0}.

Here we let infimum of the empty set be +∞, ie. if we never hit zero, we say
it takes infinite time. What is the distribution of H? We define the generating
function

X
Gi (s) = Ei (sH ) = sn Pi (H = n), |s| < 1.
n=0

We have

G1 (s) = E1 (sH ) = p E1 (sH | X1 = 2) + q E1 (sH | X1 = 0).

How can we simplify this? The second term is easy, since if X1 = 0, then we must
have H = 1. So E1 (sH | X1 = 0) = s. The first term is more tricky. We are now
at 2. To get to 0, we need to pass through 1. So the time needed to get to 0 is the
time to get from 2 to 1 (say H 0 ), plus the time to get from 1 to 0 (say H 00 ). We
know that H 0 and H 00 have the same distribution as H, and by the strong Markov
property, they are independent. So
p
0 00 1 ± 1 − 4pqs2
G1 = p E1 (sH +H +1 ) + qs = psG21 + qs =⇒ G1 (s) = .
2ps

We havep to be careful here. This result


p says that for each value of s, G1 (s) is either
1 1
2ps
(1 + 1 − 4pqs2 ) or
2ps
(1 − 1 − 4pqs2 ). It does not say that there is some
consistent choice of + or − that works for everything. However, we know that if
we suddenly change the sign, then G1 (s) will be discontinuous at that point, but
G1 , being a power series, has to be continuous. So the solution must be either +
for all s, or − for all s.
To determine the sign, we can look at what happens when s = 0. We see that the
numerator becomes 1 ± 1, while the denominator is 0. We know that G converges
at s = 0. Hence the numerator must be 0. So we must pick −, ie.
p
1 − 1 − 4pqs2
G1 (s) = .
2ps
328 CHAPTER 8. MARKOV CHAINS

We can find P1 (H = k) by expanding the Taylor series. What is the probability


of ever hitting 0? This is
∞ √
X 1− 1 − 4pq
P1 (H < ∞) = P(H = n) = lim G1 (s) = .
n=1
s→1 2p

We can rewrite this using the fact that q = 1 − p. So 1 − 4pq = 1 − 4p(1 − p) =


(1 − 2p)2 = |q − p|2 . So
(
1 − |p − q| 1 p≤q
P1 (H < ∞) = = q .
2p p
p>q

Using this, we can also find µ = E1 (H). Firstly, if p > q, then it is possible that
H = ∞, so µ = ∞. If p ≤ q, we can find µ by differentiating G1 (s) and evaluating
at s = 1. Doing this directly would result it horrible and messy algebra, which we
want to avoid. Instead, we can differentiate G1 = psG21 + qs and obtain

pG1 (s)2 + q
G01 = pG21 + ps2G1 G01 + q. =⇒ G01 (s) = .
1 − 2psG1 (s)
(
1
∞ p=
=⇒ µ = lim G01 (s) = 1
2
1
.
s→1
p−q
p< 2

D. 8-36
• Let Ti be the returning time to a state i as defined before. The mean recurrence time
of i is (
∞ i transient
µi = Ei (Ti ) = P∞
n=1 nf i,i (n) i recurrent

• If i is recurrent, we call i a null state if µi = ∞. Otherwise we say i is non-null


or positive .
• The period of a state i is di = hcf{n ≥ 1 : pi,i (n) > 0}. A state is aperiodic if
di = 1.
• A state i is ergodic if it is aperiodic and positive recurrent.
E. 8-37
Recall that in the random walk starting from the origin, we can only return to the
origin after an even number of steps. This causes problems for a lot of our future
results. For example, we will later look at the “convergence” of Markov chains. If
we are very likely to return to 0 after an even number of steps, but is impossible
for an odd number of steps, we don’t get convergence. Hence we would like to
prevent this from happening. Hence we define the notion of period.
In general, we like aperiodic states. This is not a very severe restriction. For
example, in the random walk, we can get rid of periodicity by saying there is a
very small chance of staying at the same spot in each step. We can make this
chance is so small that we can ignore its existence for most practical purposes, but
will help us get rid of the technical problem of periodicity.
8.2. CLASSIFICATION OF CHAINS AND STATES 329

T. 8-38
Suppose X0 = i. Let Vi = |{n ≥ 1 : Xn = i}| and Fi,i = Pi (Ti < ∞). Then
r
Pi (Vi = r) = Fi,i (1 − Fi,i ) (i.e. has genuine geometric distribution). In particular
Pi (Vi = ∞) = 1 if i is recurrent and Pi (Vi < ∞) = 1 if i is transient.

Let Tir be the time at which the rth visit back to i takes place, with Tir = ∞ if
Vi < r. Since the Tir are increasing in r,

P(Vi ≥ r) = P(Tir < ∞) = P(Tir < ∞ | Tir−1 < ∞)P(Tir−1 < ∞)


= Fi,i P(Tir−1 < ∞) = P(Vi ≥ r − 1)

r r
By iteration we have P(Vi ≥ r) = Fi,i , hence Pi (Vi = r) = Fi,i (1 − Fi,i ). So if
Fi,i = 1 (i.e. i is recurrent), then Pi (Vi = r) = 0 for all r. So Pi (Vi = ∞) = 1.
Otherwise if Fi,i < 1 (i.e. i is transient), Pi (Vi = r) is a genuine geometric
distribution, and we get Pi (Vi < ∞) = 1.
r
Intuitively we have Pi (Vi = r) = Fi,i (1 − Fi,i ) since we have to return r times,
each with probability Fi,i , and then never return. Note that this result says that
if a state is recurrent, then we are expected to return to it infinitely many times.

T. 8-39
If i ↔ j are communicating, then
1. di = dj .
2. i is recurrent iff j is recurrent.
3. i is positive recurrent iff j is positive recurrent.
4. i is ergodic iff j is ergodic.

1. Assume i ↔ j. Then there are m, n ≥ 1 with pi,j (m), pj,i (n) > 0. By the
Chapman-Kolmogorov equation, we know that

pi,i (m + r + n) ≥ pi,j (m)pj,j (r)pj,i (n) ≥ αpj,j (r),

where α = pi,j (m)pj,i (n) > 0. Now let Dj = {r ≥ 1 : pj,j (r) > 0}. Then by
definition, dj = hcf Dj .

We have just shown that if r ∈ Dj , then we have m + r + n ∈ Di . We also


know that n + m ∈ Di , since pi,i (n + m) ≥ pi,j (n)pji (m) > 0. Hence for any
r ∈ Dj , we know that di | m + r + n, and also di | m + n. So di | r. Hence
hcf Di | hcf Dj . By symmetry, hcf Dj | hcf Di as well. So hcf Di = hcf Dj .

2. Proved before in [T.8-21].

3. This is deferred to a later time.

4. Follows directly from the previous 3 results by definition.

This results say that recurrence, period, positive recurrent and ergodic are class
properties — if two states are in the same communicating class, then they are
either both recurrent, or both transient; both have the same period; etc.
330 CHAPTER 8. MARKOV CHAINS

P. 8-40
If the chain is irreducible and j ∈ S is recurrent, then P(Xn = j for some n ≥
1) = 1 regardless of the distribution of X0 .8

Let fi,j = Pi (Xn = j for some n ≥ 1). Since j → i, there exists a least integer
m ≥ 1 with pj,i (m) > 0. Since m is least, we know that pj,i (m) = Pj (Xm =
i, Xr 6= j for r < m). Then

pj,i (m)(1 − fi,j ) ≤ 1 − fj,j .

This is since the left hand side is the probability that we first go from j to i in m
steps, and then never going from i to j again; while the right is the just probability
of never returning to j starting from j; and we know that it is easier to just not
get back to j than to go to i in exactly m steps and never returning to j. From
this equation we see that if fj,j = 1, then fi,j = 1. Now let λk = P(X0 = k) be
our initial distribution. Then
X
P(Xn = j for some n ≥ 1) = λj Pj (Xn = j for some n ≥ 1) = 1.
i

8.3 Long-run behaviour


D. 8-41
• Let Xj be a Markov chain with transition probabilities P . A distribution9 π =
(πk : k ∈ S) is an invariant distribution if π = πP . An invariant distribution
is also known as an invariant measure, equilibrium distribution or steady-state
distribution.
• Suppose X0 = k. Let P∞ Wi denote the number of visits
PTk to i before the next visit
to k, that is Wi = m=1 1(Xm = i, m ≤ Tk ) = m=1 1(Xm = i) where Tk is
the recurrence time of k and 1 is the indicator function. In particular, Wi = 1 for
i = k (if Tk is finite). We write ρi = Ek (Wi ) and ρ = (ρi : i ∈ S).
E. 8-42
We want to look at what happens in the long run. Recall in [E.8-9] we calculated
the transition probabilities of the two-state Markov chain, and saw that in the
long run, as n → ∞, the probability distribution of the Xn will “converge” to
some particular distribution. Moreover, this limit does not depend on where we
started. We’d like to see if this is true for all Markov chains.
First of all, we want to make it clear what we mean by the chain “converging” to
something. For the purposes of our investigation of Markov chains here, we’ll look
at the probability mass function. For each k ∈ S, we ask if P(Xn = k) converges
to anything, and if they do, is the limit a distribution (probability mass function).
The limit will be known as an invariant distribution. Recall that if we have a
starting state λ, then the distribution of the nth step is given by λP n . We have
the following trivial identity: λP n+1 = (λP n )P . If the distribution converges,
then we have λP n → π for some π, and also λP n+1 → π. Hence the limit π
satisfies πP = π.
8
Note that recurrence says that if we start at i, then we will return to i. Here we are saying
wherever we start, we will eventually
P visit i.
9
i.e. it must satisfy πk ≥ 0 and k πk = 1.
8.3. LONG-RUN BEHAVIOUR 331

The main focus of this section is to study the existence and properties of invariant
distributions, and we will provide sufficient conditions for convergence to occur in
the next.
P. 8-43
For an irreducible recurrent chain with X0 = k ∈ S,
X
(i) ρk = 1 (ii) ρ i = µk (iii) ρ = ρP (iv) 0 < ρi < ∞ for all i ∈ S.
i

i. This follows from definition of ρi , since for m < Tk , Xm 6= k.


P
ii. Note that i Wi = Tk , since in each step we hit exactly one thing. We have
!
X X X
ρi = Ek (Wi ) = Ek Wi = Ek (Tk ) = µk .
i i i

Note that we swapped the sum and expectation on a potentially infinite sums.
However, there is a theorem (bounded convergence) that tells us this is okay
whenever the summands are non-negative.
iii. We have
 
X X
ρj = Ek (Wj ) = Ek  1(Xm = j, Tk ≥ m) = Pk (Xm = j, Tk ≥ m)
m≥1 m≥1
XX
= Pk (Xm = j | Xm−1 = i, Tk ≥ m)Pk (Xm−1 = i, Tk ≥ m)
m≥1 i∈S

We now use the Markov property. Note that Tk ≥ m means X1 , · · · , Xm−1


are all not k. The Markov property thus tells us the condition Tk ≥ m is
useless. So we are left with
XX
= Pk (Xm = j | Xm−1 = i)Pk (Xm−1 = i, Tk ≥ m)
m≥1 i∈S
XX X X
= pi,j Pk (Xm−1 = i, Tk ≥ m) = pi,j Pk (Xm−1 = i, Tk ≥ m)
m≥1 i∈S i∈S m≥1
P
So we just need to show that m≥1 Pk (Xm−1 = i, Tk ≥ m) = ρi . First we
let r = m − 1, and get
X ∞
X
Pk (Xm−1 = i, Tk ≥ m) = Pk (Xr = i, Tk ≥ r + 1).
m≥1 r=0

Of course this does not fix the problem. We will look at the different possible
cases. First, if i = k, then the r = 0 term is 1 since Tk ≥ 1 is always true by
definition and X0 = k, also by construction. On the other hand, the other
terms are all zero since it is impossible for the return time to be greater or
equal to r + 1 if we are at k at time r. So the sum is 1, which is ρk .
In the case where i 6= k, first note that when r = 0 we know that X0 = k 6= i.
So the term is zero. For r ≥ 1, we know that if Xr = i and Tk ≥ r, then we
must also have Tk ≥ r + 1, since it is impossible for the return time to k to be
exactly r if we are not at k atP time r. So Pk (Xr = i, Tk ≥ r + 1) = Pk (Xr =
i, Tk ≥ r). So indeed we have m≥0 Pk (Xm−1 = i, Tk ≥ m) = ρi .
332 CHAPTER 8. MARKOV CHAINS

iv. To show that 0 < ρi < ∞, first fix our i, and note that ρk = 1. We know that
ρ = ρP = ρP n for n ≥ 1. So by expanding the matrix sum, we know that
for any m, n, we have ρi ≥ ρk pk,i (n) and ρk ≥ ρi pi,k (m). By irreducibility,
we now choose m, n such that pi,k (m), pk,i (n) > 0. So the result follows since
ρk = 1 and
ρk
ρk pk,i (n) ≤ ρi ≤
pi,k (m)

This result says that for an irreducible recurrent Markov chain, there exist a ρ
(where ρi is the mean number of visit to i before return to k) such that
P ρP = ρ, this
is not quite an invariant distrubution since a distribution requires iP ρo = 1, but
we don’t know if we can actually “normalised” ρ. In particular since i ρi = µk ,
if µk = ∞, then we cannot “normalised” ρ.
L. 8-44
Suppose λi ≥ 0 and ∞
P
P|αi (n)| < M for some M for all i, n
i=1 λi converges, also
and αi (n) → 0 as n → ∞ for each i. Then ∞ i=1 λi αi (n) → 0 as n → ∞.

Given ε > 0, we can pick N such that M ∞


P
i=N +1 λi < ε/2. We can then pick L
PN
so that i=1 λi |αi (n)| < ε/2 for all n ≥ L. Now ∀n ≥ L,
∞ N ∞
X X X
λi αi (n) ≤ λi |αi (n)| + M λi < ε.



i=1 i=1 i=N +1

T. 8-45
For an irreducible Markov chain:
1. If some state is positive recurrent, then there exists an invariant distribution.
2. If there is an invariant distribution π, then every state is positive recurrent, and
πi = 1/µi for i ∈ S, where µi is the mean recurrence time of i. In particular,
π is unique.

Let k be a positive recurrent state. Then πi = ρi /µk satisfies πi ≥ 0 with


1. P
i πi = 1, and is an invariant distribution.

2. Let π be an invariant distribution. We first show that all entries are non-zero.
n
For all n, we have π = P πP . Hence for all i, j ∈ S, n ∈ N, we have (∗)
πi ≥ πj pj,i (n). Since πi = 1, there is some k such that πk > 0. By (∗)
with j = k, we know that πi ≥ πk pk,i (n) > 0 for some n, by irreducibility. So
πi > 0 for all i.
Now we show that all states are positive recurrent. So we need to rule out
the cases of transience and null recurrence. Assume all states are transient.
P(n) → 0 for all i, j ∈ S as n → ∞.
So pj,i P[T.8-20] However, we know that
πi = j πj pj,i (n). By our previous lemma, j πj pj,i (n) → 0 as n → ∞ since
our state space is countable and pj,i (n) → 0 etc.. This is contradiction, since
πi is non-zero. Hence all states are recurrent.
To rule out the case of null recurrence, we prove that πi µi = 1, then this would
imply that µi is finite since
P πi > 0. By definition µi = Ei (TPi ), and we have the
general formula E(N ) = n P(N ≥ n). So we get πi µi = ∞ n=1 πi Pi (Ti ≥ n).
Note that Pi is a probability conditional on starting at i. So to work with the
8.3. LONG-RUN BEHAVIOUR 333

expression πi Pi (Ti ≥ n), it is helpful to let πi be the probability of starting at


i. So let X0 has distribution π, then

X
π i µi = P(Ti ≥ n, X0 = i).
n=1

The first term when n = 1 is P(Ti ≥ 1, X0 = i) = P(X0 = i) = πi since we


know that we always have Ti ≥ 1 by definition. For other n ≥ 2, we want to
compute P(Ti ≥ n, X0 = i). This is the probability of starting at i, and then
not return to i in the next n − 1 steps. So we have

P(Ti ≥ n, X0 = i) = P(X0 = i, Xm 6= i for 1 ≤ m ≤ n − 1)


= P(Xm 6= i for 1 ≤ m ≤ n − 1)
− P(Xm 6= i for 0 ≤ m ≤ n − 1)

Since we started with an invariant distribution, we always live in an invariant


distribution. Looking at the time interval 1 ≤ m ≤ n − 1 is the same as looking
at 0 ≤ m ≤ n − 2). In other words, the sequence (X0 , · · · , Xn−2 ) has the same
distribution as (X1 , · · · , Xn−1 ). So we can write the expression as

P(Ti ≥ n, X0 = i) = an−2 − an−1 , where ar = P(Xm 6= i for 0 ≤ i ≤ r).

Now we are summing differences, and when we sum differences everything


cancels term by term. We have

πi µi = lim [πi + (a0 − a1 ) + (a1 − a2 ) + · · · + (aN −2 − aN −1 )]


N →∞

= lim [πi + a0 − aN −1 ] = πi + a0 + lim aN .


N →∞ N →∞

What is each term? πi is the probability that X0 = i, and a0 is the probability


that X0 6= i. So we know that πi + a0 = 1. What about lim aN ? We know
that limN →∞ aN = P(Xm 6= i for all m). Since the state is recurrent and
the chain irreducible, the probability of never visiting i is 0.[P.8-40] So we get
πi µi = 1. Since πi > 0, we get µi = π1i < ∞ for all i. Hence we have positive
recurrence.
P. 8-46
Let i, k ∈ S be distinct states of an irreducible, positive recurrent Markov chain
with invariant distribution π. The mean number of visits to state i between two
consecutive visits to k equals πi /πk .

By [P.8-43] and 1 of [T.8-45], π 0 define by πi0 = ρi /µk is an invariant distribution.


By 2 of [T.8-45] this is the unique invariant distribution, so π = π 0 . Now we have
1/µk = πk and πi = ρi /µk , hence ρi = πi /πk .
T. 8-47
<Convergence theorem for Markov chains> For an irreducible, positive
recurrent and aperiodic Markov chain, pi,k (n) → πk as n → ∞, where π is the
unique invariant distribution.

The idea of the proof is to show that for any i, j, k ∈ S, we have pi,k (n)−pj,k (n) →
0 as n → ∞. Then we can argue that no matter where we start, we will tend to
the same distribution, and hence any distribution tends to the same distribution
as π, since π doesn’t change.
334 CHAPTER 8. MARKOV CHAINS

Suppose we have two such independent Markov chains, one starts at i and the other
starts at j. Define the pair Z = (X, Y ) of the two chains, with X = (Xn ) and
Y = (Yn ) each having the state space S and transition matrix P . Now Z = (Zn ),
where Zn = (Xn , Yn ) is a Markov chain on state space S 2 . This has transition
probabilities pij,k` = pi,k pj,` by independence of the chains.
First, it can be shown that Z is irreducible. We have pij,k` (n) = pi,k (n)pj,` (n).
We want this to be strictly positive for some n. We know that there is m such
that pi,k (m) > 0, and some r such that pj,` (r) > 0. However, what we need is an
n that makes them simultaneously positive. We can indeed find such an n, based
on the assumption that we have aperiodic chains and waffling something about
number theory.
Next we show that it is positive recurrent. We know that X and Y is positive
recurrent. By our previous theorem, there is a unique invariant distribution π for
P . It is then easy to check that Z has invariant distribution

ν = (νij : ij ∈ S 2 ) where νi,j = πi πj .

This works because X and Y are independent. So Z is also positive recurrent.


The next step is to couple the two chains together. The idea is to fix some state
s ∈ S, and let T be the earliest time at which Xn = Yn = s. Because of recurrence,
we can always find such at T . After this time T , X and Y behave under the exact
same distribution.
Define T = inf{n : Zn = (Xn , Yn ) = (s, s)}. Write Pij ( · ) = P( · | X0 = i, Y0 = j),
we have

pi,k (n) = Pi (Xn = k) = Pij (Xn = k) = Pij (Xn = k, T ≤ n) + Pij (Xn = k, T > n)

Note that if n ≥ T , then since at time T , XT = YT , the evolution of X and Y


after time T is equal. So this is equal to

= Pij (Yn = k, T ≤ n) + Pij (Xn = k, T > n)


≤ Pij (Yn = k) + Pij (T > n) = pj,k (n) + Pij (T > n).

Hence |pi,k (n) − pj,k (n)| ≤ Pij (T > n). As n → ∞, we know that Pij (T > n) → 0
since Z is recurrent. So |pi,k (n) − pj,k (n)| → 0.
P Now by the invariance of π, we
have π = πP n for all n. So we can write πk = j πj pj,k (n). Hence we have

X X
|πk − pi,k (n)| = πj (pj,k (n) − pi,k (n)) ≤ πj |pj,k (n) − pi,k |.


j j

We know that each individual |pj,k (n) − pi,k (n)| tends to zero. So by our lemma
we know πk − pi,k (n) → 0.
This proof is done by “coupling”. The idea of coupling is that here we have two
sets of probabilities, and we want to prove relations between them. The first step
is to move our attention to random variables, by considering random variables that
give rise to these probability distribution. In other words, we look at the Markov
chains themselves instead of the probabilities. In general, random variables are
nicer to work with, since they are functions, not discrete, unrelated numbers.
8.3. LONG-RUN BEHAVIOUR 335

What happens when we have a null recurrent case? We would still be able to
prove the result about pi,k (n) → pj,k (n), since T is finite by recurrence. However,
we do not have a π to make the last step.
Here is an elementary example which highlights the necessity of aperiodicity in
the convergence theorem. Let X be a Markov chain with state space S = {1, 2}
and transition matrix P = ( 01 10 ). Thus, X alternates deterministically between
the two states. It is immediate that P 2m = I and P 2m+1 = P for all m, and
in particular, the limit limn→∞ pi,j (n) doesn’t exist for any i, j ∈ S. The proof
of the Theorem fails since the paired chain Z is not irreducible: for example, if
Z0 = (0, 1), then Zn 6= (0, 0) for all n.
E. 8-48
<Coupling game> A pack of playing cards is shuffled, and the cards dealt (face
up) one by one. A friend is asked to select some card, secretly, from amongst the
first six. If the face value of this card is m (aces count 1 and court cards count 10
etc.), the next m − 1 cards are allowed to pass, and your friend is asked to note
the face value of the mth card. Continuing according to this rule, there arrives a
last card in this sequence, with face value of say X, and with fewer than X cards
remaining. We call X your friend’s ‘score’. If you follow the same rules as your
friend, starting for simplicity at the first card, you obtain thereby a score Y , there
is a high probability that X = Y .
Why is this the case? Suppose your friend picks the m1 th card, m2 th card, and
so on, and you pick the n1 (= 1)th, n2 th and so on. If mi = nj for some i, j, the
two of you are ‘stuck together’ forever after. When this occurs first, we say that
coupling has occurred. Prior to coupling, each time you read the value of a card,
there is a positive probability that you will arrive at the next stage on exactly
the same card as the other person. If the pack of cards were infinitely large, then
coupling would take place sooner or later. It turns out that there is a reasonable
chance that coupling takes place before the last card of a regular pack of 52 cards
has been dealt.
P. 8-49
Let X be an irreducible, recurrent Markov chain. Then every state is null recurrent
if and only if there exists a state i such that pi,i (n) → 0 as n → ∞.

We will only prove the backward direction. Suppose X is positive recurrent and
in addition, aperiodic, then pii (n) → πi = 1/µi > 0. Therefore, there’s no state
such that pi,i (n) → 0. In the periodic case, consider the chain Yn = Xnd where
d is the period of the chain, then Yn is aperiodic. So by the same argument
pii (nd) → πi = 1/µi > 0, hence pii (n) 6→ 0.
E. 8-50
In [T.8-25] we have that for 1 and 2 dimensional symmetric random walk p0,0 (2n) →
0 as n → ∞ and we know p0,0 (2n + 1) = 0 for all n. Hence p0,0 (n) → 0 as n → ∞.
Therefore the 1 and 2 dimensional symmetric random walk is null recurrent.
E. 8-51
Intuitively, for a irreducible and positive recurrent chain, πi = 1/µi is the propor-
tion of time we spend in state i. To prove this, we let Vi (n) = |{m ≤ n : Xm = i}|.
We want to show Vi (n)/n → πi as n → ∞. Note that technically, this is not a
well-formed questions, since we don’t exactly know how convergence of random
336 CHAPTER 8. MARKOV CHAINS

variables should be defined. Nevertheless, we will give an informal proof of this


result.

The idea is to look at the average time between successive visits. We assume
X0 = i. We let Tm be the time of mth return to i. In particular, T0 = 0. Let
Um = Tm − Tm−1 . All of these are iid by the strong Markov property, and has
mean µi by definition of µi . Hence, by the law of large numbers,

m
1 1 X
Tm = Ur → E[U1 ] = µi . (∗)
m m r=1

Now note that Vi (n) ≤ k if and only if Tk ≥ n. We can write an equivalent


statement by letting k be a real number. We denote dxe as the least integer
greater than x. Then we have Vi (n) ≤ x ⇔ Tdxe ≥ n. Putting a funny value of x
in, we get
Vi (n) A 1
≤ ⇐⇒ TdAn/µi e ≥ 1.
n µi n

However, using (∗), we know that TAn/µi /An/µi → µi . Multiply both sides by
A/µi to get
TAn/µi
→ A.
n

So if A < 1, the event n1 TAn/µi ≥ 1 (and hence the event n1 Vi (n) ≤ µAi ) occurs
with almost probability 0 as n → ∞. Otherwise, it happens with probability 1.
So in some sense Vi (n)/n → 1/µi = πi . It should be clear that even if we didn’t
assume X0 = i, this should still holds, since we will reach i for the first time in a
finite time T 0 , and then after that the behaviour is the same as our calculation, so
Vi (n)/(n − T 0 ) → 1/µi . Hence

Vi (n) Vi (n) n − T 0 1
= → as n → ∞.
n n − T0 n µi

8.4 Time reversal


T. 8-52
Let X be positive recurrent, irreducible with invariant distribution π. Suppose
that X0 has distribution π. Then Y defined by Yk = XN −k  is a Markov chain
π
with transition matrix P̂ = (p̂i,j : i, j ∈ S), where p̂i,j = πji pj,i . Also π is
invariant for P̂ .

First we show that


P P̂ is a stochastic matrix. We clearly have p̂i,j ≥ 0. We also have
that for each i, j p̂i,j = π1i j πj pj,i = π1i πi = 1 using the fact that π = πP .
P

P P
We now show π is invariant for
P P̂ : We have i πi p̂i,j = i πj pj,i = πj since
P is a stochastic matrix and i p ji = 1. Note that our formula for p̂i,j gives
πi p̂i,j = pj,i πj .
8.4. TIME REVERSAL 337

Now we show that Y is a Markov chain. We have

P(Y0 = i0 , · · · , Yk = ik ) = P(XN −k = ik , XN −k+1 = ik−1 , · · · , XN = i0 )


= πik pik ,ik−1 pik−1 ,pk−2 · · · pi1 ,i0
= (πik pik ,ik−1 )pik−1 ,pk−2 · · · pi1 ,i0
= p̂ik−1 ik (πik−1 pik−1 ,pk−2 ) · · · pi1 ,i0
= ···
= πi0 p̂i0 ,i p̂i1 ,i2 · · · p̂ik−1 ,ik .

So Y is a Markov chain.
Most of the results here should not be surprising, apart from the fact that Y is
Markov. Since Y is just X reversed, the transition matrix of Y is just the transpose
of the transition matrix of X, with some factors to get the normalization right.
Also, it is not surprising that π is invariant for P̂ , since each Xi , and hence Yi has
distribution π by assumption.
D. 8-53
• An irreducible Markov chain X = (X0 , · · · , XN ) in its invariant distribution π is
said to be reversible if its reversal has the same transition probabilities as does
X, that is it satisfies the detailed balance equation πi pi,j = πj pj,i for all i, j ∈ S.
• In general, if λ is a distribution that satisfies λi pi,j = λj pj,i we say (P, λ) is in
detailed balance .
E. 8-54
Time reversibility is a very useful concept in the theory of random networks. There
is a valuable analogy using the language of flows. Let X be a Markov chain with
state space S and invariant distribution π. To this chain there corresponds the
following directed network (or graph). The vertices of the network are the states
of the chain, and an arrow is placed from vertex i to vertex j if pi,j > 0. At the
start, one unit of a notional material is distributed about the vertices such that
proportion πi of the material is placed initially at vertex i. At each epoch of time
and for each vertex i, a proportion pi,j of the material at i is transported to each
vertex j.
P
The amount of material at vertex i after one epoch is j πj pj,i , which equals πi
since π = πP . That is to say, the deterministic flow of probability is in equilibrium:
there is ‘global balance’ in the sense that the total quantity leaving each vertex
is balanced by an equal quantity arriving there. There may or may not be ‘local
balance’, in the sense that, for every i, j ∈ S, the amount flowing from i to j equals
the amount flowing from j to i. Local balance occurs if and only if πi pi,j = πj pj, i
for all i, j ∈ S, which is to say that P and π are in detailed balance.
P. 8-55
Let P be the transition matrix of an irreducible Markov chain X. Suppose (P, λ)
is in detailed balance. Then λ is the unique invariant distribution and the chain
is reversible (when X0 has distribution λ).

It suffices to show that λ is invariant. Then it is automatically unique and the


chain is by definition reversible. This is easy to check. We have
X X X
λj pj,i = λi pi,j = λi pi,j = λi .
j j j
338 CHAPTER 8. MARKOV CHAINS

If we want to show reversibility, we can compute π, Pand check it satisfies the


equation directly. To find π, we need to solve πi = j πj pj,i . This result gives
another method to show reversibility, we just need to solve λi pi,j = λj pj,i and
there is no sum involved. So this is indeed a helpful result.
E. 8-56
<Birth-death chain with immigration> Recall our birth-death chain in[E.8
-31], where at each state i > 0, we move to i + 1 with probability pi and to i − 1
with probability qi = 1 − pi . When we are at 0, we are dead and no longer change.
We wouldn’t be able to apply our previous result to this scenario, since 0 is an
absorbing equation, and this chain is obviously not irreducible, let alone positive
recurrent. Hence we make a slight modification to our scenario — if we have
population 0, we allow ourselves to have a probability p0 of having an immigrant
and get to 1, or probability q0 = 1 − p0 that we stay at 0.
We can try to find a solution to the detailed balance equation λi pi,j = λj pj,i . If
it works, we would have solved it quickly. If not, we have just wasted a minute or
two. Note that the equation is automatically satisfied if j and i differ by at least
2, since both sides are zero. So we only look at the case where j = i + 1 (the case
j = i − 1 is the same thing with the slides flipped). So the only equation we have
to satisfy is λi pi = λi+1 qi+1 for all i. This is just a recursive formula for λi , and
we can solve to get
 
pi−1 pi−1 pi−2 p0
λi = λi−1 = · · · = ··· λ0 .
qi qi qi−1 q1
p p
We can call the term in the brackets ρi = ( i−1 i−2
qi qi−1
· · · pq10 ). For λi to be a
P P P
distribution, we
P need 1 = i λi = λ0 i ρi . Thus if ρi < ∞ then we can
pick λ0 = 1/ ρi and λ is a distribution. Hence this is the unique invariant
distribution.
P
If ρi diverges, the method fails, and we need to use our more traditional methods
to check recurrence and transience.
E. 8-57
<Random walk on a finite graph> A graph is a col-
lection of points with edges between them, like the one on
the right. More precisely, a graph here is a pair G = (V, E),
where E contains distinct unordered pairs of distinct ver-
tices (u, v), drawn as edges from u to v.
Note that the restriction of distinct pairs and distinct vertices are there to prevent
loops and parallel edges, and the fact that the pairs are unordered means our edges
don’t have orientations. A graph G is connected if for all u, v ∈ V , there exists a
path along the edges from u to v.
Let G = (V, E) be a connected graph with |V | ≤ ∞. Let X = (Xn ) be a random
walk on G. Here we live on the vertices, and on each step, we move to one
an adjacent vertex. More precisely, if Xn = x, then Xn+1 is chosen uniformly
at random from the set of neighbours of x, ie. the set {y ∈ V : (x, y) ∈ E},
independently of the past. This is a Markov chain.
For example, our previous simple symmetric random walks on Z or Zd are random
walks on graphs (despite the graphs not being finite). Our transition probabilities
8.4. TIME REVERSAL 339

are (
0 j is a neighbour of i
pi,j = 1
,
di
j is a neighbour of i
where di is the number of neighbours of i, commonly known as the degree of i.
By connectivity, the Markov chain is irreducible. Since it is finite, it is recurrent,
and in fact positive recurrent. This process is a rather “physical” process, and we
would expect it to be reversible. So we try to solve the detailed balance equation
λi pi,j = λj pj,i .
If j is not a neighbour of i, then both sides are zero, and it is trivially balanced.
Otherwise, the equation becomes λi d1i = λj d1j . The solution is obvious. Take
λi = di . In fact we can multiply by any constant c,Pand λi =P cdi for any c. So we
pick our c such that this is a distribution, ie. 1 = i λi = c i di .
We nowP note that since each edge adds 1 to the degrees of each vertex on the two
ends, di is just twice the number of edges. So the equation gives 1 = 2c|E|.
Hence we get c = 1/2|E|. Hence, our invariant distribution is λi = di /2|E|.
Let’s look at a specific scenario. Suppose we have a knight on the chessboard. In
each step, the allowed moves are:
• Move two steps horizontally, then one step vertically;
× ×
• Move two steps vertically, then one step horizontally. × ×
For example on the diagram on the right, if the knight is
× ×
at the black dot, then the possible moves are indicated
× ×
with black crosses.
At each epoch of time, our erratic knight follows a legal
move chosen uniformly from the set of possible moves.
4 4 3 2
Hence we have a Markov chain derived from the chess- 6 6 4 3
board. What is his invariant distribution? We can com- 8 8 6 4
pute the number of possible moves from each position, 8 8 6 4
shown on the diagram on the right.
P
The sum of degrees is i di = 336. So the invariant
2
distribution at, say, the corner is πcorner = 336 .
340 CHAPTER 8. MARKOV CHAINS
CHAPTER 9
Groups, Rings and Modules

9.1 Groups
For the basics see Part IA groups. We will only repeat here a minimal amount of
things that are covered in Part IA groups.
E. 9-1
We have the following familiar examples of groups
1. (Z, +, 0), (Q, +, 0), (R, +, 0), (C, +, 0) where we write 0 to emphasise it’s the
identity.
2. The symmetric group Sn is the collection of all permutations of {1, 2, · · · , n}.
The alternating group An ≤ Sn .
3. The dihedral group D2n is the symmetries of a regular n-gon. The cyclic group
Cn ≤ D2n .
4. The group GLn (R) is the group of invertible n × n real matrices, which also is
the group of invertible R-linear maps from the vector space Rn to itself. The
special linear group SLn (R) ≤ GLn (R), the subgroup of matrices of determi-
nant 1.
5. The Klein-four group C2 × C2 .
6. The quaternions Q8 = {±1, ±i, ±j, ±k} with ij = k, ji = −k, i2 = j 2 = k2 =
−1, (−1)2 = 1.
T. 9-2
<First isomorphism theorem> Let φ : G → H be a homomorphism. Then
ker(φ) C G and G/ ker(φ) ∼
= Im(φ).

Suppose g, h ∈ ker φ, then φ(gh−1 ) = φ(g)φ(h)−1 = ee−1 = e, so gh−1 ∈ ker φ.


Also, φ(e) = e, so ker(φ) is non-empty. Hence ker(φ) is a subgroup. Suppose
g ∈ ker(φ) and x ∈ G, then
φ(x−1 gx) = φ(x−1 )φ(g)φ(x) = φ(x−1 )φ(x) = φ(x−1 x) = φ(e) = e.
So x−1 gx ∈ ker(φ). Hence ker(φ) is normal.
We now construct a homomorphism f : G/ ker(φ) → Im(φ), and prove it is an
isomorphism. Define
f : G/ ker(φ) → Im(φ) with g ker(φ) 7→ φ(g).
Firstly, f is well-defined: If g ker(φ) = g 0 ker(φ), then g −1 g 0 ∈ ker(φ), so φ(g −1 g 0 ) =
e. Now φ(g) = φ(g)φ(g −1 · g 0 ) = φ(g 0 ). Hence this function is well-defined.
Secondly, f is a homomorphism since
f (g ker(φ) · g 0 ker(φ)) = f (gg 0 ker(φ)) = φ(gg 0 ) = φ(g)φ(g 0 )
= f (g ker(φ))f (g 0 ker(φ)).

341
342 CHAPTER 9. GROUPS, RINGS AND MODULES

Finally, we show it is a bijection. Suppose h ∈ Im(φ), then h = φ(g) for some


g., and so h = f (g ker(φ)) is in the image of f . To show injectivity, suppose
f (g ker(φ)) = f (g 0 ker(φ)), then φ(g) = φ(g 0 ). So φ(g −1 g 0 ) = e. Hence g −1 g 0 ∈
ker(φ), and so g ker(φ) = g 0 ker(φ).
E. 9-3
Consider φ : C → C \ {0} given by z 7→ ez . We know ez+w = ez ew , so φ is a
homomorphism if we think of it as φ : (C, +) → (C \ {0}, ×).
What is the image of this homomorphism? The existence of log shows that φ is
surjective. So Im φ = C \ {0}. What about the kernel? It is given by ker(φ) =
{z ∈ C : ez = 1} = 2πiZ, ie. the set of all integer multiples of 2πi. Hence

(C/(2πiZ), +) ∼
= (C \ {0}, ×).

T. 9-4
<Second isomorphism theorem> Let H ≤ G and K C G. Then HK = {hk :
h ∈ H, k ∈ K} is a subgroup of G, and H ∩ K C H. Moreover,
HK ∼ H
= .
K H ∩K

Suppose hk, h0 k0 ∈ HK, then

h0 k0 (hk)−1 = h0 k0 k−1 h−1 = (h0 h−1 )(hk0 k−1 h−1 ) ∈ HK.

since the first term is in H, while the second term is k0 k−1 ∈ K conjugated by h,
which has to be in K be normality. Now HK is non-empty as it contains e, hence
it’s a group.
Suppose x ∈ H ∩ K and h ∈ H. Consider h−1 xh. Since x ∈ K, the normality
of K implies h−1 xh ∈ K. Also, since x, h ∈ H, closure implies h−1 xh ∈ H. So
h−1 xh ∈ H ∩ K. Therefore H ∩ K C H.
Now we prove the isomorphism part. To do so, we apply the first isomorphism
theorem. Define φ : H → G/K by h 7→ hK. This is easily seen to be a homo-
morphism. The image is all K cosets represented by something in H, ie. Im(φ) =
HK/K. Then the kernel of φ is ker(φ) = {h : hK = eK} = {h : h ∈ K} = H ∩ K.
So the first isomorphism theorem says H/(H ∩ K) ∼= HK/K.
Notice we did more work than we really had to. We could have started by writing
down φ and checked it is a homomorphism. Then since H ∩ K is its kernel, it has
to be a normal subgroup.
C. 9-5
<Subgroup correspondence> For K C G, there is a bijection between sub-
groups of G/K and subgroups of G containing K, given by

{subgroups of G/K} ←→ {subgroups of G which contain K}


G
X≤ 7−→ {g ∈ G : gK ∈ X}
K
L G
≤ ←−[ K C L ≤ G.
K K
9.1. GROUPS 343

Using the same bijection this specializes to the bijection of normal subgroups:

{normal subgroups of G/K} ←→ {normal subgroups of G which contain K}

T. 9-6
<Third isomorphism theorem> Let K ≤ L ≤ G be normal subgroups of G.
Then
G/K ∼ G
= .
L/K L

Define the homomorphism φ : G/K → G/L by gK 7→ gL. This is well-defined:


If gK = g 0 K, then g −1 g 0 ∈ K ⊆ L. So gL = g 0 L. This is also a homomorphism
since
φ(gK · g 0 K) = φ(gg 0 K) = gg 0 L = gL · g 0 L = φ(gK)φ(g 0 K).
This clearly is surjective, since any coset gL is φ(gK). So the image is G/L. The
kernel is
L
ker(φ) = {gK : gL = L} = {gK : g ∈ L} = .
K
So the conclusion follows by the first isomorphism theorem.
The general idea of these theorems is to take a group, find a normal subgroup,
and then quotient it out. Then hopefully the normal subgroup and the quotient
group will be simpler. However, this doesn’t always work.
L. 9-7
An abelian group is simple if and only if it is isomorphic to the cyclic group Cp
for some prime number p.

(Backward) By Lagrange’s theorem, any subgroup of Cp has order dividing |Cp | =


p. Hence if p is prime, then it has no such divisors, and any subgroup must have
order 1 or p, ie. it is either {e} or Cp itself. Hence in particular any normal
subgroup must be {e} or Cp . So it is simple.
(Forward) Suppose G is abelian and simple. Let e 6= g ∈ G be a non-trivial
element, and consider H = {· · · , g −2 , g −1 , e, g, g 2 , · · · }. Since G is abelian, conju-
gation does nothing, and every subgroup is normal. So H is a normal subgroup.
As G is simple, H = {e} or H = G. Since it contains g 6= e, it is non-trivial. So
we must have H = G. So G is cyclic.
If G is infinite cyclic, then it is isomorphic to Z. But Z is not simple, since 2Z C Z.
So G is a finite cyclic group, ie. G ∼ = Cm for some finite m. If n | m, then g m/n
generates a subgroup of G of order n. So this is a normal subgroup. Therefore n
must be m or 1. Hence m is a prime.
T. 9-8
Let G be any finite group. Then there are subgroups H1 , · · · , Hn such that

G = H1 B H2 B H3 B H4 B · · · B Hn = {e} and Hi /Hi+1 is simple for all i.

If G is simple, let H2 = {e}. Then we are done.


If G is not simple, let H2 be a maximal proper normal subgroup of G. We now
claim that G/H2 is simple. If G/H2 is not simple, it contains a proper non-
trivial normal subgroup L C G/H2 such that L 6= {e}, G/H2 . However, there is a
correspondence between normal subgroups of G/H2 and normal subgroups of G
344 CHAPTER 9. GROUPS, RINGS AND MODULES

containing H2 . So L must be K/H2 for some K C G such that K ≥ H2 . Moreover,


since L is non-trivial and not G/H2 , we know K is not G or H2 . So K is a larger
normal subgroup. Contradiction.
So we have found an H2 C G such that G/H2 is simple. Iterating this process on
H2 gives the desired result. Note that this process eventually stops, as Hi+1 < Hi ,
and hence |Hi+1 | < |Hi |, and all these numbers are finite.
Note that here we only claim that Hi+1 is normal in Hi . This does not say that,
say, H3 is a normal subgroup of H1 .
E. 9-9
Recall the alternating group An ≤ Sn is the subgroup of even permutations, ie. An
is the kernel of sgn. This immediately tells us An C Sn , and we can immediately
work out its index, since Sn /An ∼
= Im(sgn) = {±1} unless n = 1. So An has index
2.
D. 9-10
• Let X be a set. We write Sym(X) for the group of all permutations of X. A
group G is called a permutation group if it is a subgroup of Sym(X) for some X.
We say G is a permutation group of order n if in addition |X| = n.
• An action of a group (G, · ) on a set X is a function ∗ : G × X → X such that
1. g1 ∗ (g2 ∗ x) = (g1 · g2 ) ∗ x for all g1 , g2 ∈ G and x ∈ X.
2. e ∗ x = x for all x ∈ X.
• A permutation representation of a group G on a set X is a homomorphism G →
Sym(X).
E. 9-11
We will soon see, every group is (isomorphic to) a permutation group. Sometimes
thinking of a group as a permutation group of some object gives us better intuition
on what the group is about.
Sn and An are obviously permutation groups as they are persecutions of X =
{1, 2, · · · , n}. Also, the dihedral group D2n is a permutation group of order n,
viewing it as a permutation of the vertices of a regular n-gon.
We would next want to recover the idea of a group being a “permutation”. If
G ≤ Sym(X), then each g ∈ G should be able to give us a permutation of X, in a
way that is consistent with the group structure. We say the group G acts on X.
L. 9-12
An action of G on X is equivalent to a homomorphism φ : G → Sym X.

1. Suppose ∗ : G × X → X is an action. Define φ : G → Sym X by sending g to


the function φ(g) = (g ∗ · : X → X). This is indeed a permutation – g −1 ∗ ·
is an inverse since

φ(g −1 )(φ(g)(x)) = g −1 ∗ (g ∗ x) = (g −1 · g) ∗ x = e ∗ x = x,

and a similar argument shows φ(g) ◦ φ(g −1 ) = id. To show it is a homomor-


phism, just note that

φ(g1 )(φ(g2 )(x)) = g1 ∗ (g2 ∗ x) = (g1 · g2 ) ∗ x = φ(g1 · g2 )(x).


9.1. GROUPS 345

Since this is true for all x ∈ X, we know φ(g1 ) ◦ φ(g2 ) = φ(g1 · g2 ). Also,
φ(e)(x) = e ∗ x = x.
2. Suppose φ : G → Sym X is a homomorphism, define a function G × X → X
by g ∗ x = φ(g)(x). This is a group action since
• g1 ∗(g2 ∗x) = φ(g1 )(φ(g2 )(x)) = (φ(g1 )◦φ(g2 ))(x) = φ(g1 ·g2 )(x) = (g1 ·g2 )∗x.
• e ∗ x = φ(e)(x) = idX (x) = x.
These two operations are clearly inverses to each other. So group actions of G on
X is the same as homomorphisms G → Sym(X).
We have thus shown that a permutation representation is the same as a group
action. Also a good thing about thinking of group actions as homomorphisms is
that we can use all we know about homomorphisms on it.
D. 9-13
• For an action of G on X given by homomorphism φ : G → Sym(X), we write
GX = Im(φ) and GX = ker(φ).
• Given subgroup H ≤ G write G/H as the set of left cosets of H.1
• If G acts on a set X, the orbit of x ∈ X is G · x = {g ∗ x ∈ X : g ∈ G}. The
stabilizer of x ∈ X is Gx = {g ∈ G : g ∗ x = x}.
E. 9-14
• The first isomorphism theorem theorem immediately gives GX C G and G/GX ∼
=
GX . In particular, if GX = {e} is trivial, then G ∼
= GX ≤ Sym(X).
• Let G be the group of symmetries of a cube. Let X be the
set of diagonals of the cube. Then G acts on X, and so we
get φ : G → Sym(X).
Kernel: To preserve the diagonals, it either does nothing to
the diagonal, or flips the two vertices of each diagonals. So
GX = ker(φ) = {id, symmetry that sends each vertex to its
opposite} ∼
= C2 .

Image: We have GX = Im(φ) ≤ Sym(X) ∼ = S4 . We can show show that Im(φ) =


Sym(X), ie. it is surjective. We are not doing this as this is more an exercise in
geometry.
Now the first isomorphism theorem tells us GX ∼ = G/GX . So |G| = |GX ||GX | =
4! · 2 = 48. This is an example of how we can use group actions to count elements
in a group.
T. 9-15
<Cayley’s theorem> Every group is (isomorphic to) a subgroup of a symmetric
group.

For any group G, we have an action of G on G itself via g ∗ g1 = gg1 . It is trivial to


check this is indeed an action. This gives a group homomorphism φ : G → Sym(G).
What is its kernel? If g ∈ ker(φ), then it acts trivially on every element. In
1
Note that this is only a set, not necessary a group. We use the same notation for the quotient
group only when H is normal in G.
346 CHAPTER 9. GROUPS, RINGS AND MODULES

particular, it acts trivially on the identity. So g ∗ e = e, which means g = e. So


ker(φ) = {e}. By the first isomorphism theorem G ∼ = G/{e} ∼= Im φ ≤ Sym(G).
E. 9-16
Let H be a subgroup of G, and X = G/H be the set of left cosets of H. We let G
act on X via g ∗ g1 H = gg1 H. It is easy to check this is well-defined and is indeed
a group action. So we get φ : G → Sym(X).
Now consider GX = ker(φ). If g ∈ GX , then for every g1 ∈ G, we have g ∗ g1 H =
g1 H. This means g1−1 gg1 ∈ H. In other words g ∈ g1 Hg1−1 . This has to happen
for all g1 ∈ G. So
\
GX ⊆ g1 Hg1−1 .
g1 ∈G

This argument is completely reversible: If g ∈ g1 ∈G g1 Hg1−1 , then for each g1 ∈


T

G, we know g1−1 gg1 ∈ H and hence gg1 H = g1 H. So g ∗ g1 H = g1 H for all g1 ∈ G


and so g ∈ GX . Hence we indeed have equality:
\
ker(φ) = GX = g1 Hg1−1 .
g1 ∈G

Since this is a kernel, this is a normal subgroup of G, and is contained in H.


Starting with an arbitrary subgroup H, this allows us to generate a normal sub-
group, and this is in fact the biggest normal
T subgroup of G that is contained in
H: Suppose N C G and N ⊆ H, then N = g1 ∈G g1 N g1−1 , but g1 N g1−1 ⊆ g1 Hg1−1
for all g1 ∈ G, hence N ⊆ g1 ∈G g1 Hg1−1 .
T

T. 9-17
Let G be a finite group, and H ≤ G a subgroup of index n. Then there is a normal
subgroup K C G with K ≤ H such that G/K is isomorphic to a subgroup of Sn .
Hence |G/K| | n! and |G/K| ≥ n.

We apply the previous example, giving φ : G → Sym(G/H), and let K be the


kernel of this homomorphism. We have already shown that K ≤ H. Then the
first isomorphism theorem gives G/K ∼
= Im φ ≤ Sym(G/H) ∼ = Sn . By Lagrange’s
theorem, we know |G/K| | |Sn | = n!, and we also have |G/K| ≥ |G/H| = n as
K ≤ H.
P. 9-18
If a non-abelian simple group G has an index n proper subgroup H, then G is
isomorphic to a subgroup of An . Moreover, we must have n ≥ 5, ie. G cannot
have a subgroup of index less than 5.

The action of G on X = G/H gives a homomorphism φ : G → Sym(X). Then


ker(φ)CG. Since G is simple, ker(φ) is either G or {e}. We first show that it cannot
be G. If ker(φ) = G, then every element of G acts trivially on X = G/H. But if
g ∈ G \ H, which exists since the index of H is not 1, then g ∗ H = gH 6= H. So g
does not act trivially. So the kernel cannot be the whole of G. Hence ker(φ) = {e}.
Thus by the first isomorphism theorem, we get G ∼= Im(φ) ≤ Sym(X) ∼
= Sn . We
now need to show that G is in fact a subgroup of An .
We know An C Sn . So Im(φ) ∩ An C Im(φ) ∼ = G. As G is simple, Im(φ) ∩ An is
either {e} or G = Im(φ). We want to show that the second thing happens, ie. the
9.1. GROUPS 347

intersection is not the trivial group. We use the second isomorphism theorem. If
Im(φ) ∩ An = {e}, then
Im(φ) Im(φ)An Sn ∼
Im(φ) ∼
= ∼
= ≤ = C2 .
Im(φ) ∩ An An An
So G ∼= Im(φ) is a subgroup of C2 , ie. either {e} or C2 itself. Neither of these are
non-abelian. So this cannot be the case. So we must have Im(φ) ∩ An = Im(φ),
ie. Im(φ) ≤ An .
The last part follows from the fact that S1 , S2 , S3 , S4 have no non-abelian simple
subgroups, which can be check by listing out all their subgroups.
T. 9-19
<Orbit-stabilizer theorem> Let G act on X. Then for any x ∈ X, there is a
bijection between G · x and G/Gx , given by g · x ↔ g · Gx . In particular, if G is
finite, then |G| = |Gx ||G · x|.

Proof is done in IA Groups. In IA Groups, we state only the in particular bit as


the theorem, but this result is more generally true for infinite groups.
D. 9-20
• The automorphism group of G is

Aut(G) = {f : f is a group isomorphism G → G}.

This is a group under composition, with the identity map as the identity.
• The conjugacy class of g ∈ G is cclG (g) = {hgh−1 : h ∈ G}, ie. the orbit of
g ∈ G under the conjugation action.
• The centralizer of g ∈ G is CG (g) = {h ∈ G : hgh−1 = g}, ie. the stabilizer
of g under the conjugation action. This is alternatively the set of all h ∈ G that
commutes with g.
• The center of a group G is
\
Z(G) = {h ∈ G : hgh−1 = g for all g ∈ G} = CG (g) = ker(φ).
g∈G

where φ is the homomorphism for the conjugate action. It is the elements of the
group that commute with everything else
• Let H ≤ G. The normalizer of H in G is NG (H) = {g ∈ G : g −1 Hg = H}.
E. 9-21
We have seen that every group acts on itself by multiplying on the left. A group
G can also act on itself by conjugation g ∗ g1 = gg1 g −1 .
Let φ : G → Sym(G) be the associated permutation representation. We know, by
definition, that φ(g) is a bijection from G to G as sets. However, here G is not
an arbitrary set, but is a group. A natural question to ask is whether φ(g) is a
homomorphism or not. Indeed, we have

φ(g)(g1 · g2 ) = gg1 g2 g −1 = (gg1 g −1 )(gg2 g −1 ) = φ(g)(g1 )φ(g)(g2 ).

So φ(g) is a homomorphism from G to G. Since φ(g) is bijective (as in any group


action), it is in fact an isomorphism.
348 CHAPTER 9. GROUPS, RINGS AND MODULES

Thus, for any group G, there are many isomorphisms from G to itself, one for
every g ∈ G, and can be obtained from a group action of G on itself. We can, of
course, take the collection of all isomorphisms of G, and form a new group out of
it, the automorphism group. It is a subgroup of Sym(G), and the homomorphism
φ : G → Sym(G) by conjugation lands in Aut(G).
This is pretty fun — we can use this to cook up some more groups, by taking a
group and looking at its automorphism group. We can also take a group, take
its automorphism group, and then take its automorphism group again, and do it
again, and see if this process stabilizes, or becomes periodic, or something!
P. 9-22
Let G be a finite group. Then | ccl(x)| = |G : CG (x)| = |G|/|CG (x)|.

By the orbit-stabilizer theorem, for each x ∈ G, we obtain a bijection ccl(x) ↔


G/CG (x).
In particular, we see that the size of each conjugacy class divides the order of the
group.
E. 9-23
• Note that we certainly have H ≤ NG (H). Even better, H C NG (H), essentially
by definition. This is in fact the biggest subgroup of G in which H is normal:
Suppose H C U ≤ G, for any g ∈ U , gHg −1 = H, hence g ∈ NG (H), therefore
U ⊆ NG (H).
• Recall from IA Groups that permutations in Sn are conjugate if and only if they
have the same cycle type when written as a product of disjoint cycles. We can
think of the cycle types as partitions of n. For example, the partition 2, 2, 1 of 5
corresponds to the conjugacy class of (1 2)(3 4)(5). So the conjugacy classes of Sn
are exactly the partitions of n.
T. 9-24
The alternating groups An are simple for n ≥ 5 (also for n = 1, 2, 3).

The cases in brackets follow from a direct check since A1 ∼


= A2 ∼
= {e} and A3 ∼= C3 ,
all of which are simple. We can also check manually that A4 has non-trivial normal
subgroups, and hence not simple. To prove the case n ≥ 5 we just need to prove
the followings:
i. An is generated by 3-cycles.
ii. If H C An contains a 3-cycle, then H = An .
iii. If H C An is non-trivial, then H contains a 3-cycle.
(Part i) As any element of An is a product of evenly-many transpositions, it suffices
to show that every product of two transpositions is also a product of 3-cycles.
There are three possible cases: let a, b, c, d be distinct. Then

(a b)(a b) = e, (a b)(b c) = (a b c), (a b)(c d) = (a c b)(a c d)

So every possible product of two transpositions is a product of three-cycles.


(Part ii) Since An is generated by 3-cycles, we just need to show that if H contains a
3-cycle, then every 3-cycle is in H. For concreteness, suppose we know (a b c) ∈ H,
and we want to show (1 2 3) ∈ H.
9.1. GROUPS 349

Since they have the same cycle type, so we have σ ∈ Sn such that (a b c) =
σ(1 2 3)σ −1 . If σ is even, ie. σ ∈ An , then we have that (1 2 3) ∈ σ −1 Hσ = H,
by the normality of H and we are trivially done.
If σ is odd, replace it by σ̄ = σ · (4 5). Here is where we use the fact that n ≥ 5
(we will use it again later). Then we have
σ̄(1 2 3)σ̄ −1 = σ(4 5)(1 2 3)(4 5)σ −1 = σ(1 2 3)σ −1 = (a b c),
since (1 2 3) and (4 5) commute. Now σ̄ is even, so (1 2 3) ∈ H as above.
(Part iii) We separate this into many cases
1. Suppose H contains an element which can be written in disjoint cycle notation
σ = (1 2 3 · · · r)τ,
for r ≥ 4. We now let δ = (1 2 3) ∈ An . Then by normality of H, we know
δ −1 σδ ∈ H. Then σ −1 δ −1 σδ ∈ H. Also, we notice that τ does not contain
1, 2, 3. So it commutes with δ, and also trivially with (1 2 3 · · · r). We can
expand σ −1 δ −1 σδ to obtain a 3 cycle in H:
σ −1 δ −1 σδ = (r · · · 2 1)(1 3 2)(1 2 3 · · · r)(1 2 3) = (2 3 r),
The same argument goes through if σ = (a1 a2 · · · ar )τ for any a1 , · · · , ar .
2. Suppose H contains an element consisting of at least two 3-cycles in disjoint
cycle notation, say σ = (1 2 3)(4 5 6)τ . We now let δ = (1 2 4), and again
calculate
σ −1 δ −1 σδ = (1 3 2)(4 6 5)(1 4 2)(1 2 3)(4 5 6)(1 2 4) = (1 2 4 3 6).
This is a 5-cycle, which is necessarily in H. By the previous case, we get a
3-cycle in H too, and hence H = An .
3. Suppose H contains σ = (1 2 3)τ , with τ a product of 2-cycles (if τ contains
anything longer, then it would fit in one of the previous two cases). Then
σ 2 = (1 2 3)2 = (1 3 2) is a three-cycle.
4. Suppose H contains σ = (1 2)(3 4)τ , where τ is a product of 2-cycles. We first
let δ = (1 2 3) and calculate
u = σ −1 δ −1 σδ = (1 2)(3 4)(1 3 2)(1 2)(3 4)(1 2 3) = (1 4)(2 3),
which is again in u. We landed in the same case, but instead of two transpo-
sitions times a mess, we just have two transpositions, which is nicer. Let
v = (1 5 2)u(1 2 5) = (1 3)(4 5) ∈ H.
Note that we used n ≥ 5 again. We have yet again landed in the same case.
Notice however, that these are not the same transpositions. We multiply
uv = (1 4)(2 3)(1 3)(4 5) = (1 2 3 4 5) ∈ H.
This is then covered by the first case, and we are done.
Recall we proved that A5 is simple in IA Groups by brute force – we listed all its
conjugacy classes, and see they cannot be put together to make a normal subgroup.
This obviously cannot be easily generalized to higher values of n. Hence we need
to prove this with a different approach.
350 CHAPTER 9. GROUPS, RINGS AND MODULES

D. 9-25
A finite group G is a p-group if |G| = pn for some prime number p and n ≥ 1.
T. 9-26
If G is a finite p-group, then Z(G) = {x ∈ G : xg = gx for all g ∈ G} is non-trivial.

Let G act on itself by conjugation. The orbits of this action (ie. the conjugacy
classes) have order dividing |G| = pn . So it is either a singleton, or its size is
divisible by p. Since the conjugacy classes partition G, we know the total size of
the conjugacy classes is |G|. In particular,
X
|G| = number of conjugacy class of size 1 + size of other conjugacy classes.

We know the second term is divisible by p. Also |G| = pn is divisible by p. Hence


the number of conjugacy classes of size 1 is divisible by p. We know {e} is a
conjugacy class of size 1. So there must be at least p conjugacy classes of size 1.
Since the smallest prime number is 2, there is a conjugacy class {x} =
6 {e}.
But if {x} is a conjugacy class on its own, then by definition g −1 xg = x for all
g ∈ G, ie. xg = gx for all g ∈ G. So x ∈ Z(G). So Z(G) is non-trivial.
This immediately tells us that for n ≥ 2, a p group is never simple, since Z(G)
is a normal subgroup of G. If Z(G) is a proper subgroup, then we are done. If
Z(G) = G, then G is abelian group of non-prime order, hence not simple.
Also, this theorem allows us to prove interesting things about p-groups by induc-
tion – we can quotient G by Z(G), and get a smaller p-group. One way to do this
is via the below lemma.
L. 9-27
For any group G, if G/Z(G) is cyclic, then G is abelian.

Let gZ(G) be a generator of the cyclic group G/Z(G). Hence every coset of Z(G)
is of the form g r Z(G). So every element x ∈ G must be of the form g r z for
z ∈ Z(G) and r ∈ Z. To show G is abelian, let x̄ = g r̄ z̄ be another element, with
z̄ ∈ Z(G), r̄ ∈ Z. Note that z and z̄ are in the center, and hence commute with
every element. So we have

xx̄ = g r zg r̄ z̄ = g r g r̄ z z̄ = g r̄ g r z̄z = g r̄ z̄g r z = x̄x.

So they commute. Hence G is abelian.


In other words, if G/Z(G) is cyclic, then it is in fact trivial, since the center of an
abelian group is the abelian group itself. This is a general lemma for groups, but
is particularly useful when applied to p groups.
P. 9-28
If p is prime and |G| = p2 , then G is abelian.

Since Z(G) ≤ G, its order must be 1, p or p2 . Since it is not trivial, it can only
be p or p2 . If it has order p2 , then it is the whole group and the group is abelian.
Otherwise, G/Z(G) has order p2 /p = p. But then it must be cyclic, and thus G
must be abelian.
9.1. GROUPS 351

T. 9-29
Let G be a group of order pa , where p is a prime number. Then it has a subgroup
of order pb for any 0 ≤ b ≤ a.

We induct on a. If a = 1, then {e}, G give subgroups of order p0 and p1 .

Now suppose a > 1, and we want to construct a subgroup of order pb . If b = 0,


then this is trivial, namely {e} ≤ G has order 1. Otherwise, we know Z(G) is
non-trivial. So let x ∈ Z(G) with x 6= e. Since ord(x) | |G|, its order is a power of
c−1
p. If it in fact has order pc , then xp has order p. So wlog x has order p. We
have thus generated a subgroup hxi of order exactly p. Moreover, since x is in the
center, hxi commutes with everything in G. So hxi is in fact a normal subgroup of
G. This is the point of choosing it in the center. Therefore G/hxi has order pa−1 .

Since this is a strictly smaller group, we can by induction suppose G/hxi has a
subgroup of any order. In particular, it has a subgroup L of order pb−1 . By the
subgroup correspondence, there is some K ≤ G such that L = K/hxi and hxi C K.
But then K has order pb .

This means there is a subgroup of every conceivable order. This is not true for
general groups. For example, A5 has no subgroup of order 30 or else that would
be a normal subgroup.

T. 9-30
<Classification of finite abelian groups> Let G be a finite abelian group.
Then there exists some d1 , · · · , dr such that

G∼
= Cd1 × Cd2 × · · · × Cdr .

Moreover, we can pick di such that di+1 | di for each i, and this expression is
unique.

We will prove this later in [T.9-145] as it turns out the best way to prove this is
not to think of it as a group, but as a Z-module.

E. 9-31
So it turns out finite abelian groups are very easy to classify. We can just write
down a list of all finite abelian groups. For example the abelian groups of order 8
are C8 , C4 × C2 , C2 × C2 × C2 .

L. 9-32
If n and m are coprime, then Cmn ∼
= Cm × Cn .

It suffices to find an element of order nm in Cm × Cn . Then since Cn × Cm has


order nm, it must be cyclic, and hence isomorphic to Cnm .

Let g ∈ Cm have order m; h ∈ Cn have order n, and consider (g, h) ∈ Cm × Cn .


Suppose the order of (g, h) is k. Then (g, h)k = (e, e). Hence (g k , hk ) = (e, e). So
the order of g and h divide k, ie. m | k and n | k. As m and n are coprime, this
means that mn | k. As k = ord((g, h)) and (g, h) ∈ Cm × Cn is a group of order
mn, we must have k | nm. So k = nm.

This is a grown-up version of the Chinese remainder theorem. This is what the
Chinese remainder theorem really says.
352 CHAPTER 9. GROUPS, RINGS AND MODULES

P. 9-33
For any finite abelian group G, we have G ∼
= Cd1 × Cd2 × · · · × Cdr where each di
is some prime power.

From the classification theorem, iteratively apply the previous lemma to break
each component up into products of prime powers.
This is somewhat a more useful form of decomposition of finite abelian groups.
T. 9-34
<Sylow theorems> Let G be a finite group of order pa m, with p and prime
and p - m. Then
1. The set of Sylow p-subgroups of G, given by Sylp (G) = {P ≤ G : |P | = pa } is
non-empty. In other words, G has a subgroup of order pa .
2. All elements of Sylp (G) are conjugate in G.
3. The number of Sylow p-subgroups np = | Sylp (G)| satisfies np ≡ 1 (mod p)
and np | |G| (in fact np | m, since p is not a factor of np ).

1. We need find some subgroup of order pa . We find something clever for G to


act on. Let Ω = {X ⊆ G : |X| = pa }. We let G act on Ω by
g ∗ {g1 , g2 , · · · , gpa } = {gg1 , gg2 , · · · , ggpa }.
Let Σ be an orbit of this. We first note that if {g1 , · · · , gpa } ∈ Σ, then by the
definition of an orbit, for every g ∈ G,
gg1−1 ∗ {g1 , · · · , gpa } = {g, gg1−1 g2 , · · · , gg1−1 gpa } ∈ Σ.
The important thing is that this set contains g. So for each g, Σ contains a set
X which contains g. Since each set X has size pa , we must have |Σ|pa ≥ |G|,
so |Σ| ≥ |G|/pa = m.
Suppose |Σ| = m. Then the orbit-stabilizer theorem says the stabilizer H of
any {g1 , · · · , gpa } ∈ Σ has index m, hence |H| = pa , and thus H ∈ Sylp (G).
So we just need to show that not every orbit Σ can have size > m. Again,
by the orbit-stabilizer, the size of any orbit divides the order of the group,
|G| = pa m. So if |Σ| > m, then p | |Σ|. Suppose we can show that p - |Ω|.
Then not every orbit Σ can have size > m, since Ω is the disjoint union of all
the orbits, and thus we are done.
So we have to show p - |Ω|. We have
! ! pa −1
|G| pa m Y pa m − j
|Ω| = = = .
pa pa j=0
pa − j

Now note that the largest power of p dividing pa m − j is the largest power of
p dividing j. Similarly, the largest power of p dividing pa − j is also the largest
power of p dividing j. So we have the same power of p on top and bottom for
each item in the product, and they cancel. So |Ω| is not divisible by p. 1

This proof is not straightforward. We first needed the clever idea of letting G
act on Ω. But then if we are given this set, the obvious thing to do would be
to find something in Ω that is also a group. This is not what we do. Instead,
we find an orbit whose stabilizer is a Sylow p-subgroup.
9.1. GROUPS 353

2. We will prove something stronger: if Q ≤ G is a p-subgroup (ie. |Q| = pb , for


b not necessarily a), and P ≤ G is a Sylow p-subgroup, then there is a g ∈ G
such that g −1 Qg ≤ P . Applying this to the case where Q is another Sylow
p-subgroup says there is a g such that g −1 Qg ≤ P , but since g −1 Qg has the
same size as P , they must be equal.

We let Q act on the set of cosets of G/P via q ∗ gP = qgP . The orbits of this
action have size dividing |Q|, so is either 1 or divisible by p. But they can’t
all be divisible by p, since |G/P | is coprime to p. So at least one of them have
size 1, say {gP }. In other words, for every q ∈ Q, we have qgP = gP . This
means g −1 qg ∈ P . This holds for every element q ∈ Q. So we have found a g
such that g −1 Qg ≤ P . 2

3. Finally, we need to show that np ∼= 1 (mod p) and np | |G|. The second part is
easier — by result 2, the action of G on Sylp (G) by conjugation has one orbit.
By the orbit-stabilizer theorem, the size of the orbit, which is | Sylp (G)| = np ,
divides |G|. This proves the second part.

For the first part, let P ∈ Sylp (G). Consider the action by conjugation of P
on Sylp (G). Again by the orbit-stabilizer theorem, the orbits each have size 1
or size divisible by p. But we know there is one orbit of size 1, namely {P }
itself. To show np = | Sylp (G)| ∼
= 1 (mod p), it is enough to show there are no
other orbits of size 1.

Suppose {Q} is an orbit of size 1. This means for every p ∈ P , we get

p−1 Qp = Q.

In other words, P ≤ NG (Q). Now NG (Q) is itself a group, and we can look at
its Sylow p-subgroups. We know Q ≤ NG (Q) ≤ G. So pa | |NG (Q)| | pa m. So
pa is the biggest power of p that divides |NG (Q)|. So Q is a Sylow p-subgroup
of NG (Q).

Now we know P ≤ NG (Q) is also a Sylow p-subgroup of NG (Q). By Sylow’s


second theorem, they must be conjugate in NG (Q). But conjugating anything
in Q by something in NG (Q) does nothing, by definition of NG (Q). So we
must have P = Q. So the only orbit of size 1 is {P } itself. 3

These are sometimes known as Sylow’s first/second/third theorem respectively.

L. 9-35
A Sylow p-subgroup is normal in G iff it is the only Sylow p-subgroup in G (i.e.
np = | Sylp (G)| = 1)

(Forward) It it’s normal, then we can’t conjugate it to any other subgroup, hence
it’s the only Sylow p-subgroup.

(Backward) Let P be the unique Sylow p-subgroup, and let g ∈ G, and consider
g −1 P g. Since this is isomorphic to P , we must have |g −1 P g| = pa , ie. it is also
a Sylow p-subgroup. Since there is only one, we must have P = g −1 P g. So P is
normal.
354 CHAPTER 9. GROUPS, RINGS AND MODULES

E. 9-36
• Suppose |G| = 1000. Then |G| is not simple. To show this, we need to factorize
1000. We have |G| = 23 · 53 . We pick our favorite prime to be p = 5. We know
n5 ≡ 1 (mod 5), and n5 | 23 = 8. The only number that satisfies this is n5 = 1.
So the Sylow 5-subgroup is normal, and hence G is not simple.
• Consider the group conjugate action on the set of Sylow p-subgroup Sylp (G). Its
orbit is all the Sylow p-subgroups, so the stabiliser (ie. normaliser) of any Sylow
p-subgroup must have index np .
P. 9-37
Let G be a non-abelian simple group. Then |G| | np !/2 for every prime p such that
p | |G|.

The group G acts on Sylp (G) by conjugation. So it gives a permutation represen-


tation φ : G → Sym(Sylp (G)) ∼= Snp . We know ker φ C G. But G is simple. So
ker(φ) = {e} or G. We want to show it is not the whole of G.
Suppose we had G = ker(φ), then g −1 P g = P for all g ∈ G. Hence P is a normal
subgroup. As G is simple, either P = {e}, or P = G. We know P cannot be trivial
since p | |G|. So G = P , then G is a p-group, has a non-trivial center, and hence
G is not non-abelian simple. So we must have ker(φ) = {e}.
Then by the first isomorphism theorem, we know G ∼ = Im φ ≤ Snp . We have
proved the theorem without the divide-by-two part. To prove the whole result,
we need to show that in fact Im(φ) ≤ Anp . Consider the following composition of
homomorphisms:
φ sgn
G Snp {±1}.

If this is surjective, then ker(sgn ◦φ) C G has index 2, and is not the whole of
G. This means G is not simple (the case where G = C2 is ruled out since it is
abelian). So the kernel must be the whole G. In other words, sgn(φ(g)) = +1 for
all G. Hence G ∼ = Im(φ) ≤ Anp . So we get |G| | np !/2.
Note that this result in fact follows directly from [P.9-18] applied on the normaliser
of a Sylow p-subgroup. This proof simply write out this argument directly.
E. 9-38
No group of order 132 is simple.

Suppose G with |G| = 132 = 22 ·3·11 is simple. We start by looking at p = 11. We


know n11 ≡ 1 (mod 11). Also n11 | 12. As G is simple, we must have n11 = 12.
Now look at p = 3. We have n3 = 1 (mod 3) and n3 | 44. The possible values of
n3 are 4 and 22. If n3 = 4, then |G| | 4!
2
= 12, which is of course nonsense. So
n3 = 22.
At this point, we count how many elements of each order there are. This is
particularly useful if p | |G| but p2 - |G|, ie. the Sylow p-subgroups have order p
and hence are cyclic.
As all Sylow 11-subgroups are disjoint, apart from {e}, we know there are 12·(11−
1) = 120 elements of order 11. We do the same thing with the Sylow 3-subgroups.
We need 22 · (3 − 1) = 44 elements of order 3. But this is more elements than the
group has. This can’t happen, contradiction. So G must be simple.
9.1. GROUPS 355

L. 9-39
Suppose |G| = pn a with p and a coprime. If H C G with |H| = pn b, then
| Sylp (G)| = | Sylp (H)|.

Every Sylow p subgroup of H is a Sylow p subgroup of G, hence | Sylp (G)| ≥


| Sylp (H)|. Suppose H1 is a Sylow p subgroup of G, pick H2 which is a Sylow p
subgroup of H, then H1 and H2 are conjugates, but g −1 H2 g ≤ H for all g ∈ G by
normality of H, hence H1 ≤ H. So | Sylp (G)| ≤ | Sylp (H)|. Therefore | Sylp (G)| =
| Sylp (H)|.
E. 9-40
• Let G = GLn (Zp ) where p is a prime. To find | GLn (Zp )|, note that choosing an
invertible n × n matrix in Zp is choosing n linearly independent vectors in Zn
p . We
have pn − 1 choice for the first vector as it can be anything but the zero-vector.
The second vector cannot by multiples of the first vector, hence we have pn − p
choices. The third vector cannot be a linear combination of the fist two vector,
hence we have pn − p2 choices. Similarly for the remaining vectors. So
n−1
Y n−1
Y n−1
Y
| GLn (Zp )| = (pn − pi ) = pi (pn−i − 1) = pn(n−1)/2 (pn−i − 1).
i=0 i=0 i=0

Let U be the group of upper triangular matrix (i.e. of the form ( ∗0 ∗∗ ))


 of GLn (Zp )
with diagonal entries 1. We have 0 + 1 · · · + n − 1 = n(n − 1)/2 = n2 free entries,
so |U | = |Zp |n(n−1)/2 = pn(n−1)/2 . Therefore U is in fact a Sylow p−subgroup.
def
• Recall SL2 (Zp ) = ker(det : GL2 (Zp ) → Z∗p ) where Z∗p = Zp \ {0} is the multiplica-
tive group of Zp . The determinant homomorphism is surjective since det( λ0 01 ) = λ.
So SL2 (Zp ) C GL2 (Zp ) has index p − 1, hence | SL2 (Zp )| = (p − 1)p(p + 1).
Now PSL2 (Zp ) = SL2 (Zp )/{( λ0 λ0 ) ∈ SL2 (Zp )}. In order for ( λ0 λ0 ) ∈ SL2 (Zp ) we
need λ2 = 1. Since Z∗p ∼
= Cp−1 (see part IA notes or the latter part of the chapter),
as long as n > 2 we have λ = ±1. Therefore |PSL2 (Zp )| = 21 (p − 1)p(p + 1). Let
Z∞
p = Zp ∪ {∞}, then PSL2 (Zp ) acts on this set by Möbius maps
 
a b az + b
∗z = .
c d cz + d

• Consider the case p = 5, the action gives a homomorphism φ : PSL2 (Z5 ) →


Sym(Z∞ ∼
5 ) = S6 . This is injective since if

az + b
=z for all z ∈ Z∞
5
cz + d
then z = 0 tell us b = 0, z = ∞ tell us c = 0 and z = 1 tell us a = d. So
in PSL2 (Z5 ), ( ac db ) = ( 10 01 ). Moreover we claim that Im φ ≤ A6 . Consider ψ =
sgn ◦φ, we need to show ψ( ac db ) = 1 always. Possible element orders of PSL2 (Z5 )
is 2, 3, 4, 5, 6, 12, 15, 20, 30. Firstly elements of odd orders in PSL2 (Z5 ) must be
sent to 1. Secondly elements of order 2 and 4 must be sent to 1, this is because
    
λ 0 0 λ ∗
H= , : λ ∈ Z
0 λ−1 λ−1 0 5

is a subgroup of order 4, so is a Sylow 2-subgroup. Now we claim that ψ(H) = {1}.


Note that H is generated by ( 20 −2 0 0 1
) and ( −1 0 ). ψ on the first of these gives a
356 CHAPTER 9. GROUPS, RINGS AND MODULES

permutation z 7→ −z which is even. Similarly ψ on the second matrix also gives


an even permutation z 7→ −z −1 . So ψ(H) = {1}. Any order 4 element forms a
Sylow 2-subgroup, so must be conjugate to H, so when act on by ψ gives 1. Any
order 2 element forms a 2-subgroup of order 2, so by the proof of the second Sylow
theorem [T.9-34], we can conjugate it so that it is contained in H, hence when act
on ψ also gives 1. Finally elements of order divisible by an odd number must also
be sent to 1, this is because if we can write ord(a) = mn where m is odd and n
even, then ψ(a) = ψ(am ), but am has order n which is either 2 or 4, so ψ(am ) = 1
by the above. Therefore elements of all orders are sent to 1 under ψ, hence the
result.

9.2 Rings I
In a ring, we are allowed to add, subtract, multiply but not divide. Our canonical
example of a ring would be Z. In this course, we are only going to consider rings in
which multiplication is commutative, since these rings behave like “number systems”,
where we can study number theory. We will study properties of arbitrary rings.
D. 9-41
• A ring is a quintuple (R, +, · , 0R , 1R ) where 0R , 1R ∈ R, and +, · : R × R → R
are binary operations such that
1. (R, +, 0R ) is an abelian group.
2. The operation · : R × R → R satisfies associativity a · (b · c) = (a · b) · c and
identity2 1R · r = r · 1R = r.
3. Multiplication distributes over addition, that is

r1 · (r2 + r3 ) = (r1 · r2 ) + (r1 · r3 ) (r1 + r2 ) · r3 = (r1 · r3 ) + (r2 · r3 ).

If R is a ring and r ∈ R, we write −r for the inverse to r in (R, +, 0R ). This


satisfies r + (−r) = 0R . We write r − s to mean r + (−s) etc.
• We say a ring R is commutative if a · b = b · a for all a, b ∈ R. All rings we will
consider is assume to be commutative.
• Let (R, +, · , 0R , 1R ) be a ring, and S ⊆ R is a subset. We say S is a subring of
R if 0R , 1R ∈ S, and the operations +, · make S into a ring in its own right. In
this case we write S ≤ R.
• An element u ∈ R in ring R is a unit if there is another element v ∈ R such that
u · v = 1R .
• A field is a non-zero ring where every u ∈ R except 0R is a unit.
E. 9-42
• Since we can add and multiply two elements in a ring, by induction, we can add and
multiply any finite number of elements. However, the notions of infinite sum and
product are undefined. It doesn’t make sense to ask if an infinite sum converges.
2
Some people don’t insist on the existence of the multiplicative identity, but we will.
9.2. RINGS I 357

• The familiar number systems are all rings: we have Z ≤ Q ≤ R ≤ C, under the
usual 0, 1, +, · . The set Z[i] = {a + ib : a, b ∈ Z} ≤ C is the Gaussian integers ,
√ √
which is a ring. We also have the ring Q[ 2] = {a + b 2 ∈ R : a, b ∈ Q} ≤ R.
We will use the square brackets notation quite frequently. It should be clear what
it should mean, and we will define it properly later.
• In general, elements in a ring do not have inverses. This is not a bad thing. This
is what makes rings interesting. For example, the division algorithm would be
rather contentless if everything in Z had an inverse. Fortunately, Z only has two
invertible elements — 1 and −1. We call these units
• Note that the notion of units depends on R, not just on u. For example, 2 ∈ Z is
not a unit, but 2 ∈ Q is a unit (since 21 is an inverse). We will later show that 0R
cannot be a unit unless in a very degenerate case.

• Z is not a field, but Q, R, C are all fields. Similarly, Z[i] is not a field, while Q[ 2]
is.
• Let R be a ring. Then 0R + 0R = 0R , since this is true in the group (R, +, 0R ).
Then for any r ∈ R, we get r · (0R + 0R ) = r · 0R . Multiplication distributes
over addition, so r · 0R + r · 0R = r · 0R . Adding −(r · 0R ) to both sides give
r · 0R = 0R . This is true for any element r ∈ R. From this, it follows that if
R 6= {0}, then 1R 6= 0R — if they were equal, then take r 6= 0R and we have
r = r · 1R = r · 0R = 0R , which is a contradiction.
Note, however, that {0} forms a ring (with the only possible operations and iden-
tities), the zero ring, albeit a boring one. However, this is often a counterexample
to many things.
D. 9-43
• Let R, S be rings. Then the product R × S is a ring via

(r, s) + (r0 , s0 ) = (r + r0 , s + s0 ), (r, s) · (r0 , s0 ) = (r · r0 , s · s0 ).

The zero is (0R , 0S ) and the one is (1R , 1S ).3


• Let R be a ring. Then a polynomial with coefficients in R is an expression

f = a0 + a1 X + a2 X 2 + · · · + an X n ,

with ai ∈ R. The symbols X i are formal symbols. The degree of a polynomial


f is the largest m such that am 6= 0. For f with degree m, if am = 1, then f is
called monic .
• We write R[X] for the set (ring) of all polynomials with coefficients in the ring R.
The operations are performed in the obvious way, ie. if f = a0 + a1 X + · · · + an X n
and g = b0 + b1 X + · · · + bk X k are polynomials, then
max{n,k} n+k i
!
X i
X X
f +g = (ai + bi )X and f ·g = aj bi−j X i ,
r=0 i=0 j=0

ai X i
P
We identify the ring R with the constant polynomials, ie. polynomials
with ai = 0 for i > 0. In particular, 0R ∈ R and 1R ∈ R are the zero and one of
R[X].
3
It can be can checked that R × S is indeed a ring.
358 CHAPTER 9. GROUPS, RINGS AND MODULES

• We write R[[X]] for the ring of power series on the ring R, ie. f = a0 + a1 X +
a2 X 2 + · · · where each ai ∈ R. This has addition and multiplication the same as
for polynomials, but without upper limits.
• The Laurent polynomials on the ring R is the set R[X, X −1 ] so that each element
is of the form f = i∈Z ai X i where ai ∈ R and only finitely many ai with i < 0
P
are non-zero. The operations are the obvious ones.
E. 9-44
• For polynomial f , we identify f and f + 0R · X n+1 as the same thing.
Note that a polynomial is just a sequence of numbers, interpreted as the coefficients
of some formal symbols. While it does indeed induce a function in the obvious way,
we shall not identify the polynomial with the function given by it, since different
polynomials can give rise to the same function.
For example, in Z/2Z[X], f = X 2 + X is not the zero polynomial, since its
coefficients are not zero. However, f (0) = 0 and f (1) = 0. As a function, this is
identically zero. So f 6= 0 as a polynomial but f = 0 as a function.
• A power series is very not a function. We don’t talk about whether the sum
converges or not, because it is not a sum.
• R[X] is in fact a ring. Is 1 − X ∈ R[X] a unit? For every g = a0 + · · · + an X n
(with an 6= 0), we get

(1 − X)g = stuff + · · · − an X n+1 ,

which is not 1. So g cannot be the inverse of (1 − X). So (1 − X) is not a unit.


However, 1 − X ∈ R[[X]] is a unit, since

(1 − X)(1 + X + X 2 + X 3 + · · · ) = 1.

• We can also think of Laurent series, but we have to be careful. We allow infinitely
many positive coefficients, but only finitely many negative ones. Or else, in the
formula for multiplication, we will have an infinite sum, which is undefined.
• Let X be a set, and R be a ring. Then the set of all functions on X, ie. functions
f : X → R is a ring given by

(f + g)(x) = f (x) + g(x), (f · g)(x) = f (x) · g(x).

Here zero is the constant function 0 and one is the constant function 1.
Usually, we don’t want to consider all functions X → R. Instead, we look at some
subrings of this. For example, we can consider the ring of all continuous functions
R → R. This contains, for example, the polynomial functions, which is just R[X]
(since in R, polynomials are functions).
D. 9-45
• Let R, S be rings. A function φ : R → S is a ring homomorphism if it preserves
everything we can think of, that is
1. φ(r1 + r2 ) = φ(r1 ) + φ(r2 ),
2. φ(0R ) = 0S ,
3. φ(r1 · r2 ) = φ(r1 ) · φ(r2 ),
9.2. RINGS I 359

4. φ(1R ) = 1S .
If a homomorphism φ : R → S is a bijection, we call it an isomorphism .
• The kernel of a homomorphism φ : R → S is ker(φ) = {r ∈ R : φ(r) = 0S }, and
its image is Im(φ) = {s ∈ S : s = φ(r) for some r ∈ R}.
• A subset I ⊆ R is an ideal , written I C R, if
1. It is an additive subgroup of (R, +, 0R ), ie. it is closed under addition and
additive inverses. (additive closure)
2. If a ∈ I and b ∈ R, then a · b ∈ I. (strong closure)
We say I is a proper ideal if I 6= R.
• Let R be a ring,
 For an element a ∈ R, we write (a) = aR = {a · r : r ∈ R} C R. This is the
ideal generated by a . An ideal I is a principal ideal if I = (a) for some a ∈ R.
 Let a1 , a2 , · · · , ak ∈ R, the ideal generated by a1 , · · · , ak is
(a1 , a2 , · · · , ak ) = {a1 r1 + · · · + ak rk : r1 , · · · , rk ∈ R}.

 For A ⊆ R a subset, the ideal generated by A is


( )
X
(A) = ra · a : ra ∈ R, only finitely-many non-zero .
a∈A

E. 9-46
In the group scenario, we had groups, subgroups and normal subgroups, which
are special subgroups. Here, we have a special kind of subsets of a ring that act
like normal subgroups, known as ideals.
Note that the multiplicative closure is stronger than what we require for subrings
— for subrings, it has to be closed under multiplication by its own elements; for
ideals, it has to be closed under multiplication by everything in the world. This
is similar to how normal subgroups not only have to be closed under internal
multiplication, but also conjugation by external elements.
Principal ideals are rather nice ideals, since they are easy to describe, and often
have some nice properties.
Note that it is easier to come up with ideals than normal subgroups — we can
just pick up random elements, and then take the ideal generated by them.
L. 9-47
A homomorphism φ : R → S is injective if and only if ker φ = {0R }.

A ring homomorphism is in particular a group homomorphism φ : (R, +, 0R ) →


(S, +, 0S ) of abelian groups. So this follows from the case of groups.
L. 9-48
If φ : R → S is a homomorphism, then ker(φ) C R.

Since φ : (R, +, 0R ) → (S, +, 0R ) is a group homomorphism, the kernel is a sub-


group of (R, +, 0R ). For the second condition, let a ∈ ker(φ), b ∈ R. We have
φ(a · b) = φ(a) · φ(b) = 0 · φ(b) = 0. So a · b ∈ ker(φ).
360 CHAPTER 9. GROUPS, RINGS AND MODULES

E. 9-49
Suppose I C R is an ideal, and 1R ∈ I. Then for any r ∈ R, the axioms entail
1R · r ∈ I. But 1R · r = r. So if 1R ∈ I, then I = R.
In other words, every proper ideal does not contain 1. In particular, every proper
ideal is not a subring, since a subring must contain 1. We are starting to diverge
from groups. In groups, a normal subgroup is a subgroup, but here an ideal is not
a subring.
We can generalize the above a bit. Suppose I C R and u ∈ I is a unit, ie. there is
some v ∈ R such that uv = 1R . Then by strong closure, 1R = u · v ∈ I. So I = R.
Hence proper ideals are not allowed to contain any unit at all, not just 1R .
E. 9-50
Consider the ring Z of integers. Then every ideal of Z is of the form

nZ = {· · · , −2n, −n, 0, n, 2n, · · · } ⊆ Z.

To show these are all the ideals, let I C Z. If I = {0}, then I = 0Z. Otherwise,
let n ∈ N be the smallest positive element of T . We want to show in fact I = nZ.
Certainly nZ ⊆ I by strong closure.
Now let m ∈ I. By the Euclidean algorithm, we can write m = q · n + r with
0 ≤ r < n. Now n, m ∈ I. So by strong closure, m, qn ∈ I. So r = m − q · n ∈ I.
As n is the smallest positive element of I, and r < n, we must have r = 0. So
m = q · n ∈ nZ. So I ⊆ nZ. So I = nZ.
So what we have just shown for Z is that all ideals are principal. Not all rings are
like this. These are special types of rings. The key to proving this was that we
can perform the Euclidean algorithm on Z. Thus, for any ring R in which we can
“do Euclidean algorithm”, every ideal is of the form aR = {a · r : r ∈ R} for some
a ∈ R. We will make this notion precise in later.
E. 9-51
Consider {f ∈ R[X] : the constant coefficient of f is 0}. This is an ideal, as we
can check manually (alternatively, it is the kernel of the “evaluate at 0” homomor-
phism). It turns out this is a principal ideal, in fact it is (X).
D. 9-52
Let I C R. The quotient ring R/I consists of the (additive) cosets r + I with the
zero and one as 0R + I and 1R + I, and operations

(r1 + I) + (r2 + I) = (r1 + r2 ) + I (r1 + I) · (r2 + I) = r1 r2 + I.

P. 9-53
The quotient ring is a ring, and the function R → R/I with r 7→ r + I is a ring
homomorphism.

We know the group (R/I, +, 0R/I ) is well-defined, since I is a (normal) subgroup of


R. So we only have to check multiplication is well-defined. Suppose r1 + I = r10 + I
and r2 + I = r20 + I. Then r10 − r1 = a1 ∈ I and r20 − r2 = a2 ∈ I. So

r10 r20 = (r1 + a1 )(r2 + a2 ) = r1 r2 + r1 a2 + r2 a1 + a1 a1 .

By the strong closure property, the last three objects are in I. So r10 r20 +I = r1 r2 +I.
9.2. RINGS I 361

It is easy to check that 0R + I and 1R + I are indeed the zero and one, and the
function given is clearly a homomorphism.
This is true, because we defined ideals to be those things that can be quotiented
by. So we just have to check we made the right definition. Just as we could have
come up with the definition of a normal subgroup by requiring operations on the
cosets to be well-defined, we could have come up with the definition of an ideal by
requiring the multiplication of cosets is well-defined, and we will end up with the
strong closure property.
E. 9-54
• We have the ideals nZ C Z. So we have the quotient ring Z/nZ. The elements are
of the form m + nZ, so are just

0 + nZ, 1 + nZ, 2 + nZ, · · · , (n − 1) + nZ.

Addition and multiplication is just what we are used to — it is addition and


multiplication modulo n.
• Consider (X) C C[X]. What is C[X]/(X)? Elements are represented by

a0 + a1 X + a2 X 2 + · · · + an X n + (X).

But everything but the first term is in (X). So every such thing is equivalent to
a0 + (X). This representation is unique since a0 + (X) = b0 + (X) ⇒ a0 − b0 is
divisible by X ⇒ a0 − b0 = 0 ⇒ a0 = b0 . So in fact C[X]/(X) ∼ = C, with the
isomorphism a0 + (X) ↔ a0 .
T. 9-55
<Euclidean algorithm for polynomials> Let F be a field and f, g ∈ F[X].
Then there is some r, q ∈ F[X] such that f = gq + r with deg r < deg g.

) = n and write f = n i
P
Let deg(fP i=0 ai X with an 6= 0. Similarly, write deg g = m
m
and g = i=0 bi X i with bm 6= 0. If n < m, we let q = 0 and r = f , and done.
Otherwise, suppose n ≥ m, and proceed by induction on n. We let f1 = f −
an b−1
m X
n−m
g. This is possible since bm 6= 0, and F is a field. Then by construction,
the coefficients of X n cancel out. So deg(f1 ) < n.
If n = m, then deg(f1 ) < n = m. So we can write f = (an b−1 m X
n−m
)g + f1
and deg(f1 ) < deg(f ). So done. Otherwise, if n > m, then as deg(f1 ) < n, by
induction, we can find r1 , q1 such that f1 = gq1 + r1 and deg(r1 ) < deg g = m.
Then
−1 n−m
f = an bm X g + q1 g + r1 = (an b−1
m X
n−m
+ q1 )g + r1 .

This is like the usual Euclidean algorithm, except that instead of the absolute
value, we use the degree to measure how “big” the polynomial is.
Now that we have a Euclidean algorithm for polynomials. So we should be able to
show that every ideal of F[X] is generated by one polynomial. We will not prove it
specifically here, but later show that in general, in every ring where the Euclidean
algorithm is possible, all ideals are principal.
362 CHAPTER 9. GROUPS, RINGS AND MODULES

E. 9-56
Consider R[X], and consider the principal ideal (X 2 + 1) C R[X]. We let R =
R[X]/(X 2 + 1). Elements of R are polynomials

a0 + a1 X + a2 X 2 + · · · + an X n +(X 2 + 1).
| {z }
=f

By the Euclidean algorithm, we have f = q(X 2 + 1) + r with deg(r) < 2, i.e


r = b0 +b1 X. Thus f +(X 2 +1) = r +(X 2 +1). So every element of R[X]/(X 2 +1)
is representable as a + bX for some a, b ∈ R.
Is this representation unique? If a + bX + (X 2 + 1) = a0 + b0 X + (X 2 + 1), then
the difference (a − a0 ) + (b − b0 )X ∈ (X 2 + 1). So it is (X 2 + 1)q for some q. This
is possible only if q = 0, since for non-zero q, we know (X 2 + 1)q has degree at
least 2. So we must have (a − a0 ) + (b − b0 )X = 0. So a + bX = a0 + b0 X. So the
representation is unique.
What we’ve got is that every element in R is of the form a + bX, and X 2 + 1 = 0,
ie. X 2 = −1. This sounds like the complex numbers, just that we are calling it X
instead of i. To show this formally, we define the function

φ : R[x]/(X 2 + 1) → C with a + bX + (X 2 + 1) 7→ a + bi.

This is well-defined and a bijection. It is also clearly additive. So to prove this


is an isomorphism, we have to show it is multiplicative. We check this manually.
We have

φ (a + bX + (X 2 + 1))(c + dX + (X 2 + 1)) = φ ac + (ad + bc)X + bdX 2 + (X 2 + 1)


 

= φ((ac − bd) + (ad + bc)X + (X 2 + 1)) = (ac − bd) + (ad + bc)i = (a + bi)(c + di)
= φ(a + bX + (X 2 + 1))φ(c + dX + (X 2 + 1)).

So this is indeed an isomorphism.


L. 9-57
Let R be any ring. If w ∈ R is a root of a n-degree polynomial f ∈ R[X], then
f = (X − w)g where g ∈ R[X] is a (n − 1)-degree polynomial.

Write f = n k n−1
+ bn−2 X n−2 +
P
k=0 ak X . We try to write f = (X − w)(bn−1 X
· · · + b0 ) + r where r ∈ R. Equating the coefficients from the highest power to the
lowest we find that
 
n−1
X Xn Xn
f = (X − w)  wj−(k+1) aj  X k + a0 + wj aj
k=0 j=k+1 j=1
| {z } | {z }
α(X) β

Now f (w) = 0 = α(w), so we must have β = 0. Hence f (X) = α(X).


A field enable us use to use Euclidean algorithm, however it turns out for a general
ring, without this we can always write a non-zero polynomial f in the form f =
(X − w)g + r for any w where g has degree less than f and r is a number.
9.2. RINGS I 363

T. 9-58
<First isomorphism theorem> Let φ : R → S be a ring homomorphism.
Then ker(φ) C R and R/ ker(φ) ∼
= Im(φ) ≤ S.

We have already seen ker(φ) C R. Now define


Φ : R/ ker(φ) → Im(φ) by r + ker(φ) 7→ φ(r).
This well-defined, since if r + ker(φ) = r0 + ker(φ), then r − r0 ∈ ker(φ). So
φ(r − r0 ) = 0. So φ(r) = φ(r0 ).
We don’t have to check this is bijective and additive, since that comes for free
from the (proof of the) isomorphism theorem of groups. So we just have to check
it is multiplicative. To show Φ is multiplicative, we have
Φ((r + ker(φ))(t + ker(φ))) = Φ(rt + ker(φ)) = φ(rt) = φ(r)φ(t)
= Φ(r + ker(φ))Φ(t + ker(φ)).
This is more-or-less the same proof as the one for groups, just that we had a few
more things to check.
E. 9-59
Consider the homomorphism ψ : R[X] → C given by ψ(f ) = f (i). The is surjective
since ψ(a + bX) = a + bi. Also
ker(ψ) = {f ∈ R[X] : f (i) = 0} = {f ∈ R[X] : f (i) = 0 = f (−i)} = (X 2 + 1).
So by the first isomorphism theorem R[X]/(X 2 + 1) ∼
= C.
T. 9-60
<Second isomorphism theorem> Let R ≤ S and J C S. Then J ∩ R C R,
R+J S R ∼ R+J
= {r + J : r ∈ R} ≤ and = .
J J R∩J J
Define the function φ : R → S/J by r 7→ r + J. Since this is the quotient map, it
is a ring homomorphism. The kernel an image are
ker(φ) = {r ∈ R : r + J = 0, ie. r ∈ J} = R ∩ J
R+J
Im(φ) = {r + J : r ∈ R} = .
J
By the first isomorphism theorem, R ∩ J C R and R+J
J
≤ S, and also R/(R ∩ J) ∼
=
(R + J)/J.
C. 9-61
<Ideal correspondence> Recall we had the subgroup correspondence for groups.
Analogously, for I C R,

{subrings of R/I} ←→ {subrings of R which contain I}


R
L≤ 7−→ {x ∈ R : x + I ∈ L}
I
S R
≤ ←−[ I C S ≤ R.
I I
This is exactly the same formula as for groups. For groups, we had a correspon-
364 CHAPTER 9. GROUPS, RINGS AND MODULES

dence for normal subgroups. Here, we have a correspondence between ideals

{ideals of R/I} ←→ {ideals of R which contain I}

It is important to note here quotienting in groups and rings have different purposes.
In groups, we take quotients so that we have a simpler group to work with. In rings,
we often take quotients to get more interesting rings. For example, R[X] is quite
boring, but R[X]/(X 2 +1) ∼ = C is more interesting. Thus this ideal correspondence
allows us to occasionally get interesting ideals from boring ones.

T. 9-62
<Third isomorphism theorem> Let I C R and J C R, and I ⊆ J. Then
J/I C R/I and
R/I ∼ R
= .
J/I J

We define the map φ : R/I → R/J by r + I 7→ r + J. This is well-defined and


surjective by the groups case. Also it is a ring homomorphism since multiplication
in R/I and R/J are “the same”. The kernel is

J
ker(φ) = {r + I : r + J = 0} = {r + I : r ∈ J} = .
I
So the result follows from the first isomorphism theorem.
E. 9-63
Note that for any ring R, there is a unique ring homomorphism Z → R, given by

1
 R + 1R + · · · + 1R for n ≥ 0

| {z }
|n| times
ι:Z→R with ι(n) =

 −(1
 | R + 1R + · · · + 1R ) for n < 0
 {z }
|n| times

Any homomorphism Z → R must be given by this formula, since it must send 1 to


1, and we can show this is indeed a homomorphism by distributivity. So the ring
homomorphism is unique. We say Z is the initial object in (the category of) rings.
Now we then know ker(ι) C Z. Thus ker(ι) = nZ for some n. The characteristic
of R is the unique non-negative n such that ker(ι) = nZ.
The rings Z, Q, R, C all have characteristic 0. The ring Z/nZ has characteristic n.
In particular, all natural numbers can be characteristics.
The notion of the characteristic will not be too useful in this course. However, fields
of non-zero characteristic often provide interesting examples and counterexamples
to some later theory.
D. 9-64
• A non-zero ring R is an integral domain if for all a, b ∈ R, a · b = 0R implies
a = 0R or b = 0R .
• An element x ∈ R is a zero divisor if x 6= 0 and there is a y 6= 0 such that
xy = 0 ∈ R.
• Write R[X, Y ] for (R[X])[Y ], the polynomial ring of R in two variables. In general,
write R[X1 , · · · , Xn ] = (· · · ((R[X1 ])[X2 ]) · · · )[Xn ].
9.2. RINGS I 365

• Let R be an integral domain. A field of fractions F of R is a field with the


following properties
1. R ≤ F
2. Every element of F may be written as a · b−1 for a, b ∈ R, where b−1 means
the multiplicative inverse to b 6= 0 in F .
E. 9-65
Many rings can be completely different to Z. For example, in Z, we know that if
a, b 6= 0, then ab 6= 0. However, in, say, Z/6Z, we get 2, 3 6= 0, but 2 · 3 = 0. Also,
Z has some nice properties such as every ideal is principal, and every integer has
an (essentially) unique factorization. We want to classify rings according to which
properties they have.
We start with the most fundamental property that the product of two non-zero
elements are non-zero, integral domain. We will almost exclusively work with rings
that satisfy this property. An element that violates this property is known as a
zero divisor. In other words, a ring is an integral domain if it has no zero divisors.
• All fields are integral domains, since if a·b = 0, and b 6= 0, then a = a·(b·b−1 ) =
0. Similarly, if a 6= 0, then b = 0.
• A subring of an integral domain is an integral domain, since a zero divisor in
the small ring would also be a zero divisor in the big ring.
• Z, Q, R, C are integral domains, since C is a field, and the others are subrings
of it. Also, Z[i] ≤ C is also an integral domain.
These are the nice rings we like in number theory, since there we can sensibly
talk about things like factorization.
Like the construction of Q from Z. For any integral domain R, we want to construct
a field F that consists of “fractions” of elements in R, such a field if exist is called
a field of fractions.
L. 9-66
Let R be a finite ring which is an integral domain, then R is a field.

Let a ∈ R be non-zero, and consider the ring homomorphism a · − : R → R by


b 7→ a · b.
If r ∈ ker(a · −), then a · r = 0 and r = 0 since R is an integral domain. So its
kernel is trivial, hence the map is injective. Since R is finite, a · − must also be
surjective. In particular, given any a ∈ R, there is an element b ∈ R such that
a · b = 1R , so a has an inverse. Since a was arbitrary, R is a field.
L. 9-67
Let R be an integral domain, then R[X] is also an integral domain.

We need to show the product of two non-zero elements are non-zero. Let f, g ∈
R[X] be non-zero, say

f = a0 + a1 X + · · · + an X n ∈ R[X]
g = b0 + b1 X + · · · + bm X m ∈ R[X],

with an , bm 6= 0. Then the coefficient of X n+m in f g is an bm . This is non-zero


since R is an integral domain. So f g is non-zero. So R[X] is an integral domain.
366 CHAPTER 9. GROUPS, RINGS AND MODULES

So, for instance, Z[X] is an integral domain. We can also iterate this: if R is an
integral domain, so is R[X1 , · · · , Xn ].

T. 9-68
Every integral domain has a field of fractions.

The construction is exactly how we construct the rationals from the integers – as
equivalence classes of pairs of integers. We let S = {(a, b) ∈ R × R : b 6= 0}. We
think of (a, b) ∈ S as ab . We define the equivalence relation ∼ on S by

(a, b) ∼ (c, d) ⇔ ad = bc.

To show this is indeed a equivalence relation: symmetry and reflexivity are obvious.
To show transitivity, suppose (a, b) ∼ (c, d) and (c, d) ∼ (e, f ), ie. ad = bc and
cf = de. We multiply the first equation by f and the second by b, to obtain
adf = bcf and bcf = bed. Rearranging, we get d(af − be) = 0. Since d is in the
denominator, d 6= 0. Since R is an integral domain, we must have af − be = 0, ie.
af = be. So (a, b) ∼ (e, f ). This is where being an integral domain is important.

Now let F = S/∼ be the set of equivalence classes. We now want to check this
is indeed the field of fractions. We first want to show it is a field. We write
a
b
= [(a, b)] ∈ F , and define the operations by

a c ad + bc a c ac
+ = · = .
b d bd b d bd

We can check that this is well-defined, and makes (F, +, ·, 01 , 11 ) into a ring.

Finally, we need to show every non-zero element has an inverse. Suppose ab 6= 0F ,


that is ab 6= 01 or a · 1 6= b · 0, so equivalently a 6= 0. Hence ab ∈ F is defined, and

b a ba
· = = 1.
a b ba
a
So b
has a multiplicative inverse. So F is a field.

We now need to construct a subring of F that is isomorphic to R. We define


an injective isomorphism φ : R → F by r 7→ r1 . We can check this is a ring
homomorphism. The kernel is the set of all r ∈ R such that r1 = 0, ie. r = 0. So
the kernel is trivial, and φ is injective. Then by the first isomorphism theorem,
R∼= Im(φ) ⊆ F . Finally, everything is a quotient of two things in R since
 −1
a a 1 a b
= · = ·
b 1 b 1 1

Recall that the subring of any field is an integral domain. This says the converse
– every integral domain is the subring of some field. For example, Q is the field of
fractions of Z. The field of fractions of C[X] is the field of all rational functions
p(X)
q(X)
, where p, q ∈ C[X].

This gives us a very useful tool. Since this gives us a field from an integral domain,
this allows us to use field techniques to study integral domains. Moreover, we can
use this to construct new interesting fields from integral domains.
9.2. RINGS I 367

D. 9-69
An proper ideal I of a ring R (i.e. with I 6= R) is said to be

• maximal if for any ideal J with I ≤ J ≤ R, either J = I or J = R.

• prime if whenever a, b ∈ R are such that a · b ∈ I, then a ∈ I or b ∈ I.

E. 9-70
A non-zero ideal nZ C Z is prime if and only if n is a prime. To show this, first
suppose n = p is a prime, and a · b ∈ pZ, then p | a · b. So p | a or p | b, ie.
a ∈ pZ or b ∈ pZ. For the other direction, suppose n = pq is a composite number
(p, q 6= 1). Then n ∈ nZ but p 6∈ nZ and q 6∈ nZ, since 0 < p, q < n.

So instead of talking about prime numbers, we can talk about prime ideals instead.

L. 9-71
A (non-zero) ring R is a field if and only if its only ideals are {0} and R.

(Forward) Let I C R and R be a field. Suppose x 6= 0 ∈ I. Then as x is a unit,


I = R.

(Backward) Suppose x 6= 0 ∈ R. Then (x) is an ideal of I. It is not {0} since it


contains x. So (x) = R and hence 1R ∈ (x). But (x) is defined to be {x·y : y ∈ R},
so there is some u ∈ R such that x · u = 1R . So x is a unit. Since x was arbitrary,
so R is a field.

Note that we don’t need elements to define the ideals {0} and R. {0} can be
defined as the ideal that all other ideals contain, and R is the ideal that contains
all other ideals. Alternatively, we can reword this as “R is a field if and only if
it has only two ideals” to avoid mentioning explicit ideals. This is another reason
why fields are special. They have the simplest possible ideal structure.

L. 9-72
1. An ideal I C R is maximal if and only if R/I is a field.
2. An ideal I C R is prime if and only if R/I is an integral domain.

1. R/I is a field if and only if {0} and R/I are the only ideals of R/I. By the
ideal correspondence, this is equivalent to saying I and R are the only ideals
of R which contains I, ie. I is maximal.

2. (Forward) Let I be prime. Let a + I, b + I ∈ R/I, and suppose (a + I)(b + I) =


0R/I . By definition, (a + I)(b + I) = ab + I. So we must have ab ∈ I. As I is
prime, either a ∈ I or b ∈ I. So a + I = 0R/I or b + I = 0R/I . So R/I is an
integral domain.

(Backward) Suppose R/I is an integral domain. Let a, b ∈ R be such that


ab ∈ I. Then (a + I)(b + I) = ab + I = 0R/I ∈ R/I. Since R/I is an integral
domain, either a + I = 0R/I or b + I = 0R/i , ie. a ∈ I or b ∈ I. So I is a prime
ideal.

This is a nice result. This makes a correspondence between properties of ideals I


and properties of the quotient R/I.
368 CHAPTER 9. GROUPS, RINGS AND MODULES

P. 9-73
Every maximal ideal is a prime ideal.

I C R is maximal implies R/I is a field, which implies R/I is an integral domain,


which implies I is prime.

The converse is not true. For example, {0} ⊆ Z is prime but not maximal. Also,
(X) ∈ Z[X, Y ] is prime but not maximal (since Z[X, Y ]/(X) ∼
= Z[Y ]).

L. 9-74
Let R be an integral domain. Then its characteristic is either 0 or a prime number.

Consider the unique map φ : Z → R, and ker(φ) = nZ. Then n is the characteristic
of R by definition. By the first isomorphism theorem, Z/nZ ∼ = Im(φ) ≤ R. So
Z/nZ is an integral domain. So nZCZ is a prime. So n = 0 or a prime number.

9.3 Rings II
D. 9-75
Let R be a integral domain.

• An element a ∈ R is a unit if there is an b ∈ R such that ab = 1R . Equivalently,


if the ideal (a) = R.

• For elements a, b ∈ R, we say a divides b, written a | b, if ∃c ∈ R such that


b = ac. Equivalently, if (b) ⊆ (a).

• We say a, b ∈ R are associates if a = bc for some unit c. Equivalently, if


(a) = (b). Equivalently, if a | b and b | a.

• A non-zero, non-unit element a ∈ R is said to be

 a irreducible if whenever a = xy, x or y is a unit.

 a prime if whenever a | xy, either a | x or a | y.

E. 9-76
In integers, two numbers are associates only if a and b differ by a sign, but in
more interesting rings, more interesting things can happen. When considering
division in rings, we often consider two associates to be “the same”. For example,
in Z, we can factorize 6 as 6 = 2 · 3 = (−2) · (−3) but this does not violate unique
factorization, since 2 and −2 are associates (and so are 3 and −3), and we consider
these two factorizations to be “the same”.

For integers, being irreducible is the same as being a prime number. However,
“prime” means something different in general rings.

It is important to note all these properties depend on the ring, not the element
itself. For example 2 ∈ Z is a prime, but 2 ∈ Q is not (since it is a unit). Similarly,
the polynomial 2X ∈ Q[X] is irreducible (since 2 is a unit), but 2X ∈ Z[X] not
irreducible.
9.3. RINGS II 369

L. 9-77
Let R be a integral domain. A principal ideal (r) is a prime ideal iff r = 0 or r is
prime.

(Forward) Let (r) be a prime ideal. If r = 0, then done. Otherwise, as prime


ideals are proper, ie. not the whole ring, r is not a unit. Now suppose r | a · b.
Then a · b ∈ (r). But (r) is prime. So a ∈ (r) or b ∈ (r). So r | a or r | b. So r is
prime.
(Backward) If r = 0, then (0) = {0} C R, which is prime since R is an integral
domain. Otherwise, let r 6= 0 be prime. Suppose a · b ∈ (r). This means r | a · b.
So r | a or r | b. So a ∈ (r) or b ∈ (r). So (r) is prime.
L. 9-78
Let R be a integral domain. If r ∈ R is prime, then it is irreducible.

Let r ∈ R be prime, and suppose r = ab. Since r | r = ab, and r is prime, we must
have r | a or r | b. wlog, r | a. So a = rc for some c ∈ R. So r = ab = rcb. Since
we are in an integral domain, we must have 1 = cb. So b is a unit.
The converse is in general not true, although it’s true in Z.
E. 9-79
Consider √ √
R = Z[ −5] = {a + b −5 : a, b ∈ Z} ≤ C.
By definition, it is a subring of a field. So it is an integral domain. What are the
units of the ring? There is a nice trick we can use, when things are lying inside C.
Consider the function

N : R → Z≥0 given by N (a + b −5) = a2 + 5b2 .

It is convenient to think of this as z 7→ z z̄ = |z|2 . This satisfies N (z · w) =


N (z)N (w). This is a desirable thing to have for a ring, since it immediately
implies all units have norm 1 — if r · s = 1, then 1 = N (1) = N (rs) = N (r)N (s).
So N (r) = N (s) = 1.
So to find the units, we need to solve a2 + 5b2 = 1. The only solutions are ±1.
So only ±1 ∈ R can be units, and these obviously are units. So these are all the
units.
Next, we claim 2 ∈ R is irreducible. We again use the norm. Suppose 2 = ab.
Then 4 = N (2) = N (a)N (b). Now note that nothing has norm 2 as a2 + 5b2 can
never be 2 for integers a, b ∈ Z. So we must have,
√ wlog,√N (a) = 4, N (b) = 1. So b
must be a unit. Similarly, we see that 3, 1 + −5, 1 − −5 are irreducible (since
there is also no element of norm 3).
We have four irreducible elements in this ring. They are in fact not prime. Note
that √ √
(1 + −5)(1 − −5) = 6 = 2 · 3.
√ √
We now claim 2 does not divide
√ 1 + −5 or 1 − −5. √ So 2 is not prime. To
√ suppose 2 | 1 + −5. Then N (2) √
show this, | N (1 + −5). But N (2) =√4 and
N (1 + −5) = 6, and 4 - 6. Similarly, N (1 − −5) = 6 as well. So 2 - 1 ± −5.
This illustrates that primes and irreducibles are not the same thing in general
(unless we are in Z). Secondly, factorization into irreducibles is not necessarily
370 CHAPTER 9. GROUPS, RINGS AND MODULES
√ √
unique, since 2 · 3 = (1 + −5)(1 − −5) are two factorizations into irreducibles.
However, turns out there is one situation when unique factorizations holds– when
we have a Euclidean algorithm available.
D. 9-80
• Let R be an integral domain, we say R is a
 Euclidean domain (ED) if there is a Euclidean function φ : R \ {0} → Z≥0
such that
1. φ(a · b) ≥ φ(b) for all a, b 6= 0
2. If a, b ∈ R, with b 6= 0, then there are q, r ∈ R such that a = b · q + r with
either r = 0 or φ(r) < φ(b).
 principal ideal domain (PID) if every ideal is a principal ideal, i.e for all I C R,
there is some a such that I = (a).
 unique factorization domain (UFD) if
1. Every non-unit may be written as a product of irreducibles;
2. If p1 p2 · · · pn = q1 · · · qm with pi , qj irreducibles, then n = m, and they can
be reordered such that pi is an associate of qi .
• A ring satisfies the ascending chain condition (ACC) if there is no infinite strictly
increasing chain of ideals. In which case we call the ring a Noetherian ring .
E. 9-81
Z is a Euclidean domain with φ(n) = |n|. For any field F, F[X] is a Euclidean
domain with φ(f ) = deg(f ). The Gaussian integers R = Z[i] ≤ C is a Euclidean
domain with φ(z) = N (z) = |z|2 , which we can check:
1. We have φ(zw) = φ(z)φ(w) ≥ φ(z), since φ(w) is a positive integer.
2. Given a, b ∈ Z[i] with b 6= 0. Consider the complex Im
number ab ∈ C. Consider the complex plane on the
right, where the gray dots are points in Z[i]. From
a
the picture
we see that there is some q ∈ Z[i] such b
that ab − q < 1. So we can write ab = q + c with Re
|c| < 1. Then we have

a = b · q + |{z}
b·c
=r

We know r = a − bq ∈ Z[i], and φ(r) = N (bc) = N (b)N (c) < N (b) = φ(b).
This is not just true for the Gaussian integers. All we really needed was that
R ≤ C, and for any x ∈ C, there is some point in R that is not more than√1 away
from x. If we draw some more pictures, we will see this is not true for Z[ −5].
E. 9-82
Z is a principal ideal domain.
P. 9-83
If R is a Euclidean domain, then R is a principal ideal domain.
9.3. RINGS II 371

Let R have a Euclidean function φ : R \ {0} → Z≥0 . We let I C R be a non-zero


ideal, and let b ∈ I \ {0} be an element with φ(b) minimal. Then for any a ∈ I,
we write a = bq + r with r = 0 or φ(r) < φ(b). However, any such r must be in I
since r = a − bq ∈ I. So we cannot have φ(r) < φ(b). So we must have r = 0. So
a = bq. So a ∈ (b). Since this is true for all a ∈ I, we must have I ⊆ (b). On the
other hand, since b ∈ I, we must have (b) ⊆ I. So we must have I = (b).
This is exactly the same proof as we gave for the integers, except we replaced the
absolute value with φ.
E. 9-84
• Z is a Euclidean domain, and hence a principal ideal domain. Also, for any field F,
F[X] is a Euclidean domain, hence principal ideal domain. Also, Z[i] is a Euclidean
domain, and hence a principal ideal domain.
• In Z[X], the ideal (2, X) C Z[X] is not a principal ideal. Suppose it were. Then
(2, X) = (f ). Since 2 ∈ (2, X) = (f ), we know 2 = f · g for some g. So f has
degree zero, and hence constant. So f = ±1 or ±2.
If f = ±1, then (f ) = Z[X] since ±1 are units. But (2, X) 6= Z[X], since 1 6∈ (2, X).
If f = ±2, then since X ∈ (2, X) = (f ), we must have ±2 | X, but this is clearly
false. So (2, X) cannot be a principal ideal.
E. 9-85
Let A ∈ Mn×n (F) be an n × n matrix over a field F. We consider the following set

I = {f ∈ F[X] : f (A) = 0}.

This is an ideal – if f, g ∈ I, then (f + g)(A) = f (A) + g(A) = 0. Similarly, if


f ∈ I and h ∈ F[X], then (f g)(A) = f (A)g(A) = 0.
Now as F[X] is a principal ideal domain, there must be some m ∈ F[X] such that
I = (m) for some m. Suppose f ∈ F[X] such that f (A) = 0, ie. f ∈ I, then m | f .
So m is a polynomial that divides all polynomials that kill A, in other words m is
the minimal polynomial of A.
We have just proved that all matrices have minimal polynomials, and that the
minimal polynomial divides all other polynomials that kill A. Also, the minimal
polynomial is unique up to multiplication of units.
L. 9-86
Let R be a principal ideal domain. If p ∈ R is irreducible, then it is prime.

Let p ∈ R be irreducible, and suppose p | a · b. Wlog suppose p - a and we will


show that p | b. Consider the ideal (p, a) C R. Since R is a principal ideal domain,
there is some d ∈ R such that (p, a) = (d). So d | p and d | a.
Since d | p, there is some q1 such that p = q1 d. As p is irreducible, either q1 or d
is a unit. If q1 is a unit, then d = q1−1 p, and this divides a. So a = q1−1 px for some
x. This is a contradiction, since p - a.
Therefore d is a unit. So (p, a) = (d) = R. In particular, 1R ∈ (p, a). So suppose
1R = rp + sa, for some r, s ∈ R. Multiplying by b we get b = rpb + sab. Now since
both ab and p is divisible by p, we have p | b.
This is similar to the argument for integers. For integers, we would say if p - a,
then p and a are coprime. Therefore there are some r, s such that 1 = rp + sa.
372 CHAPTER 9. GROUPS, RINGS AND MODULES

Then we continue the proof as above. Hence what we did in the middle is to do
something similar to showing p and a are “coprime”.

Note that this result also true for general unique factorization domains, which we
can prove directly by unique factorization.

L. 9-87
Let R be a principal ideal domain. Let I1 ⊆ I2 ⊆ I3 ⊆ · · · be a chain of ideals.
Then there is some N ∈ N such that In = In+1 for all n ≥ N . (Every principal
ideal domain is Noetherian)

The obvious thing to do Swhen we have an infinite chain of ideals is to take the
union of them. Let I = ∞ n≥1 In which is again an ideal. Since
S R is a principal
ideal domain, I = (a) for some a ∈ R. We know a ∈ I = ∞ n=0 In . So a ∈ IN
for some N . Then we have (a) ⊆ IN ⊆ I = (a). So we must have IN = I. So
In = IN = I for all n ≥ N .

So in a principal ideal domain, we cannot have an infinite chain of bigger and


bigger ideals.

Notice it is not important that I is generated by one element. If, for some reason,
we know I is generated by finitely many elements, then the same argument work.
So if every ideal is finitely generated, then the ring must be Noetherian. It turns
out this is an if-and-only-if — if you are Noetherian, then every ideal is finitely
generated. We will prove this later.

P. 9-88
If R is a principal ideal domain, then R is a unique factorization domain.

We first need to show any (non-unit) r ∈ R is a product of irreducibles. Suppose


r ∈ R cannot be factored as a product of irreducibles. Then it is certainly not
irreducible. So we can write r = r1 s1 , with r1 , s1 both non-units. Since r cannot
be factored as a product of irreducibles, wlog r1 cannot be factored as a product
of irreducibles (if both can, then r would be a product of irreducibles). So we
can write r1 = r2 s2 , with r2 , s2 not units. Again, wlog r2 cannot be factored as a
product of irreducibles. We continue this way.

By assumption, the process does not end, and then we have the following chain
of ideals: (r) ⊆ (r1 ) ⊆ (r2 ) ⊆ · · · ⊆ (rn ) ⊆ · · · . But then we have an ascending
chain of ideals. By the ascending chain condition, these are all eventually equal,
ie. there is some n such that (rn ) = (rn+1 ) = (rn+2 ) = · · · . In particular, since
(rn ) = (rn+1 ), and rn = rn+1 sn+1 , then sn+1 is a unit. But this is a contradiction,
since sn+1 is not a unit. So r must be a product of irreducibles.

To show uniqueness, we let p1 p2 · · · pn = q1 q2 · · · qm , with pi , qi irreducible. So in


particular p1 | q1 · · · qm . Since p1 is irreducible, it is prime. So p1 divides some qi .
We reorder and suppose p1 | q1 . So q1 = p1 ·a for some a. But since q1 is irreducible,
a must be a unit. So p1 , q1 are associates. Since R is a principal ideal domain,
hence integral domain, we can cancel p1 to obtain p2 p3 · · · pn = (aq2 )q3 · · · qm . We
rename aq2 as q2 , so that we in fact have p2 p3 · · · pn = q2 q3 · · · qm . We can then
continue to show that pi and qi are associates for all i. This also shows that n = m,
or else if n = m + k, saw, then pk+1 · · · pn = 1, which is a contradiction.
9.3. RINGS II 373

D. 9-89
• d is a greatest common divisor (gcd) of a1 , a2 , · · · , an if d | ai for all i, and if any
other d0 satisfies d0 | ai for all i, then d0 | d.
m is a least common multiple (lcm) of a1 , a2 , · · · , an if ai | m for all i, and if any
other d0 satisfies ai | m0 for all i, then m | m0 .
• Let R be a UFD and f = a0 + a1 X + · · · + an X n ∈ R[X]. The content c(f ) of
f is c(f ) = gcd(a0 , a1 , · · · , an ) ∈ R. A polynomial is called primitive if c(f ) is a
unit, ie. the ai are coprime.
E. 9-90
Note that the gcd or lcm of a set of numbers, if exists (which it might not), is not
unique. It is only well-defined up to a unit. And since the gcd is only defined up
to a unit, so is the content. In the definition of primitive, we ask for c(f ) to be a
unit, we cannot ask for c(f ) to be exactly 1, since the gcd is only well-defined up
to a unit.
L. 9-91
In a unique factorization domain, the gcd and lcm always exists, and are unique
up to associates.

We construct the greatest common divisor using the good-old way of prime fac-
torization. We let p1 , p2 , · · · , pm be a list of all irreducible factors of the ai s, such
that no two of these are associates of each other. For each i we can write
m
n
Y
ai = ui pj ij where nij ∈ N and ui are units
j=1

Qm m
We let mj = mini {nij } and let d = j=1 pj j . As, by definition, mj ≤ nij for all
i, we know d | ai for all i.
tj
Finally, if d0 | ai for all i, then we let d0 = v m
Q
j=1 pj . Then we must have tj ≤ nij
0
for all i, j. Hence tj ≤ mj for all j. So d | d.
Uniqueness is immediate since any two greatest common divisors have to divide
each other. The argument for lcm is similar.
This result tell us that by definition c(f ) always exist.
E. 9-92
<Factorisations of polynomials over a field> Since polynomial rings are a
bit more special than general integral domains, we can say a bit more about them.
Recall that for F a field, we know F [X] is a Euclidean domain, hence a principal
ideal domain, hence a unique factorization domain. Therefore we know
1. If I C F [X], then I = (f ) for some f ∈ F [X].
2. If f ∈ F [X], then f is irreducible if and only if f is prime.
3. Let f be irreducible, and suppose (f ) ⊆ J ⊆ F [X]. Then J = (g) for some g.
Since (f ) ⊆ (g), we must have f = gh for some h. But f is irreducible. So
either g or h is a unit. If g is a unit, then (g) = F [X]. If h is a unit, then
(f ) = (g). So (f ) is a maximal ideal. Note that this argument is valid for any
PID, not just polynomial rings.
374 CHAPTER 9. GROUPS, RINGS AND MODULES

4. Let (f ) be a prime ideal. Then f is prime. So f is irreducible. So (f ) is


maximal. But we also know in complete generality that maximal ideals are
prime. So in F [X], prime ideals are the same as maximal ideals. Again, this
is true for all PIDs in general.
5. Thus f is irreducible if and only if F [X]/(f ) is a field.
To use the last item, we can first show that F [X]/(f ) is a field, and then use this
to deduce that f is irreducible. But we can also do something more interesting
— find an irreducible f , and then generate an interesting field F [X]/(f ). So we
want to understand reducibility, ie. we want to know whether we can factorize a
polynomial f .
L. 9-93
Let R be a UFD. If f, g ∈ R[X] are primitive, then so is f g.

Let f = a0 + a1 X + · · · + an X n and g = b0 + b1 X + · · · + bm X m where an , bm 6= 0,


and f, g are primitive. We want to show that the content of f g is a unit.
Now suppose f g is not primitive. Then c(f g) is not a unit. Since R is a UFD, we
can find an irreducible p which divides c(f g). By assumption, c(f ) and c(g) are
units. So p - c(f ) and p - c(g). So suppose p | a0 , p | a1 , . . . , p | ak−1 but p - ak .
Note it is possible that k = 0. Similarly, suppose p | b0 , p | b1 , · · · , p | b`−1 , p - b` .
We look at the coefficient of X k+` in f g. It is given by
X
ai bj = ak+` b0 + · · · + ak+1 b`−1 + ak b` + ak−1 b`+1 + · · · + a0 b`+k .
i+j=k+`

By assumption, this is divisible by p. However, the terms ak+` b0 + · · · + ak+1 b`−1 ,


is divisible by p, as p | bj for j < `. Similarly, ak−1 b`+1 + · · · + a0 b`+k is divisible
by p. So we must have p | ak b` . As p is irreducible, and hence prime, we must
have p | ak or p | b` . This is a contradiction. So c(f g) must be a unit.
P. 9-94
Let R be a UFD and f, g ∈ R[X]. Then c(f g) is an associate of c(f )c(g).

We can write f = c(f )f1 and g = c(g)g1 , with f1 and g1 primitive. Now f g =
c(f )c(g)f1 g1 . Since f1 g1 is primitive, c(f )c(g) is a gcd of the coefficients of f g,
and so is c(f g), by definition. So they are associates.
Again, we cannot say they are equal, since content is only well-defined up to a
unit.
L. 9-95
<Gauss’ lemma> Let R be a UFD, and f ∈ R[X] be a primitive polynomial.
Then f is reducible in R[X] if and only if f is reducible in F [X], where F is the
field of fractions of R.
(Forward) Let f = gh be a product in R[X] with g, h not units. As f is primitive,
so are g and h. So both have degree > 0. So g, h are not units in F [X]. So f is
reducible in F [X].
(Backward) Let f = gh in F [X], with g, h not units. So g and h have degree
> 0, since F is a field. So we can clear denominators by finding a, b ∈ R such
that (ag), (bh) ∈ R[X] (eg. let a be the product of denominators of coefficients
of g). Then we get abf = (ag)(bh) and this is a factorization in R[X]. Here we
9.3. RINGS II 375

have to be careful — (ag) is one thing that lives in R[X], and is not necessarily
a product in R[X], since g might not be in R[X]. So we should just treat it as a
single symbol. We now write (ag) = c(ag)g1 and (bh) = c(bh)h1 where g1 , h1 are
primitive. So we have

ab = c(abf ) = c((ag)(bh)) = u · c(ag)c(bh) where u ∈ R is a unit,

by the previous result. But also we have abf = c(ag)c(gh)g1 h1 = u−1 abg1 h1 . So
cancelling ab gives f = u−1 g1 h1 ∈ R[X]. So f is reducible in R[X].
If this looks fancy and magical, you can try to do this explicitly in the case where
R = Z and F = Q. Then you will probably get enlightened.
E. 9-96
Consider X 3 + X + 1 ∈ Z[X]. This has content 1 so is primitive. We show it is
not reducible in Z[X], and hence not reducible in Q[X].
Suppose f is reducible in Q[X]. Then by Gauss’ lemma, this is reducible in Z[X].
So we can write X 3 + X + 1 = gh for some polynomials g, h ∈ Z[X], with g, h
not units. But if g and h are not units, then they cannot be constant, since the
coefficients of X 3 + X + 1 are all 1 or 0. So they have degree at least 1. Since
the degrees add up to 3, we wlog suppose g has degree 1 and h has degree 2. So
suppose
g = b0 + b1 X, h = c0 + c1 X + c2 X 2 .
Multiplying out and equating coefficients, we get b0 c0 = 1 and c2 b1 = 1. So b0
and b1 must be ±1. So g is either 1 + X, 1 − X, −1 + X or −1 − X, and hence has
±1 as a root. But this is a contradiction, since ±1 is not a root of X 3 + X + 1.
So f is not reducible in Q. In particular, f has no root in Q.
We see the advantage of using Gauss’ lemma — if we worked in Q instead, we
could have gotten to the step b0 c0 = 1, and then we can do nothing, since b0 and
c0 can be many things if we live in Q.
P. 9-97
Let R be a UFD, and F be its field of fractions. Let g ∈ R[X] be primitive. Write
J = (g)CR[X] and I = (g)CF [X], then J = I ∩R[X]. In other words, if f ∈ R[X]
and we can write it as f = gh, with h ∈ F [X], then we also have f = gh0 with
h0 ∈ R[X].

The strategy is the same the Gauss’ lemma – we clear denominators in the equation
f = gh, and then use contents to get that down in R[X]. We certainly have
J ⊆ I ∩ R[X]. Now let f ∈ I ∩ R[X]. So we can write f = gh with h ∈ F [X].
We can choose b ∈ R such that bh ∈ R[X]. Then bf = g(bh) ∈ R[X]. Write
(bh) = c(bh)h1 with h1 ∈ R[X] primitive. Thus bf = c(bh)gh1 .
Since g is primitive, so is gh1 and so c(bh) = uc(bf ) for u a unit. Moreover bf is a
product in R[X], so c(bf ) = vc(b)c(f ) = vbc(f ) for some unit v. So now we have
bf = uvbc(f )gh1 . Cancelling b gives f = g(uvc(f )h1 ). So f ∈ J.
T. 9-98
If R is a UFD, then R[X] is a UFD. In particular, R[X1 , · · · , Xn ] is also a UFD.

We know R[X] has a notion of degree. So we will combine this with the fact that
R is a UFD. Let f ∈ R[X]. We can write f = c(f )f1 , with f1 primitive. Firstly,
376 CHAPTER 9. GROUPS, RINGS AND MODULES

as R is a UFD, we may factor c(f ) = p1 p2 · · · pn for pi ∈ R irreducible (and also


irreducible in R[X]). Now we want to deal with f1 .
If f1 is not irreducible, then we can write f1 = f2 f3 with f2 , f3 both not units.
Since f1 is primitive, f2 , f3 also cannot be constants. So we must have deg f2 , deg f3 >
0. Also, since deg f2 + deg f3 = deg f1 , we must have deg f2 , deg f3 < deg f1 . If
f2 , f3 are irreducible, then done. Otherwise, keep on going. We will eventually stop
since the degrees have to keep on decreasing. So we can write it as f1 = q1 · · · qm
with qi irreducible. So we can write f = p1 p2 · · · pn q1 q2 · · · qm a product of irre-
ducibles.
For uniqueness, we first deal with the p’s. We note that c(f ) = p1 p2 · · · pn is a
unique factorization of the content, up to reordering and associates, as R is a UFD.
So cancelling the content, we only have to show that primitives can be factored
uniquely.
Suppose we have two factorizations f1 = q1 q2 · · · qm = r1 r2 · · · r` . Note that each
qi and each ri is a factor of the primitive polynomial f1 , so are also primitive. Let
F be the field of fractions of R, and consider qi , ri ∈ F [X]. Since F is a field, F
is a Euclidean domain, hence principal ideal domain, hence unique factorization
domain.
By Gauss’ lemma, since the qi and ri are irreducible in R[X], they are also irre-
ducible in F [X]. As F [X] is a UFD, we find that ` = m, and after reordering,
ri and qi are associates, say ri = ui qi with ui ∈ F [X] a unit. What we want to
say is that ri is a unit times qi in R[X]. Firstly, note that ui ∈ F as it is a unit.
Clearing denominators, we can write ai ri = bi qi ∈ R[X]. Taking contents, since
ri , qi are primitives, we know ai and bi are associates, say bi = vi ai with vi ∈ R a
unit. Cancelling ai on both sides, we know ri = vi qi as required.
The key idea is to use Gauss’ lemma to say the reducibility in R[X] is the same
as reducibility in F [X], as long as we are primitive. The first part about contents
is just to turn everything into primitives.
Note that the last part of the proof is just our previous proposition. We could
have applied it, but we decide to spell it out in full for clarity.
P. 9-99
<Eisentein’s criterion> Let R be a UFD, and let f = a0 + a1 X + · · · + an X n ∈
R[X] be primitive with an 6= 0. If ∃p ∈ R irreducible (hence prime) such that

1. p - an 2. p | ai for all 0 ≤ i < n 3. p2 - a0 ,

then f is irreducible in R[X], and hence also in F [X], where F is the field of
fractions of F .

Suppose we have a factorization f = gh with g = r0 + r1 X + · · · + rk X k and


h = s0 + s1 X + · · · + s` X ` with rk , s` 6= 0.
We know rk s` = an . Since p - an , we have so p - rk and p - s` . We can also look at
bottom coefficients. We know r0 s0 = a0 . We know p | a0 and p2 - a0 . So p divides
exactly one of r0 and s0 . wlog, p | r0 and p - s0 .
Now let j be the smallest natural number such that p - rj . We now look at aj . This
is, by definition, aj = r0 sj + r1 sj−1 + · · · + rj−1 s1 + rj s0 . We know r0 , · · · , rj−1
9.3. RINGS II 377

are all divisible by p. Also, since p - rj and p - s0 , we know p - rj s0 , using the fact
that p is prime. So p - aj . So we must have j = n.
We also know that j ≤ k ≤ n. So we must have j = k = n. So deg g = n. Hence
` = n − h = 0. So h is a constant. But we also know f is primitive. So h must be
a unit. So this is not a proper factorization.
It is important that we work in R[X] all the time, until the end where we apply
Gauss’ lemma. Otherwise, we cannot possibly apply Eisentein’s criterion since
there are no primes in F .
E. 9-100
• Consider the polynomial X n − p ∈ Z[X] for p a prime. Apply Eisentein’s criterion
with p, and observe all the conditions hold. This is certainly primitive, since this
is monic. So X n − p is irreducible in Z[X], hence in Q[X]. In particular, X n − p

has no rational roots, ie. n p is irrational (for n > 1).
• Consider a polynomial f = X p−1 + X p−2 + · · · + X 2 + X + 1 ∈ Z[X] where p is
a prime number. If we look at this, we notice Eisentein’s criteria does not apply.
What should we do? We observe that
Xp − 1
f= .
X −1

So it might be a good idea to let Y = X − 1. Then we get a new polynomial


! ! !
(Y + 1)p − 1 p p p
fˆ = fˆ(Y ) = = Y p−1 + Y p−2 + Y p−3 + · · · + .
Y 1 2 p−1

When we look at it hard enough, we notice Eisentein’s criteria can be applied —
we know p | pi for 1 ≤ i ≤ p − 1, but p2 - p−1
p
= p. So fˆ is irreducible in Z[Y ].


Now if we had a factorization f (X) = g(X)h(X) ∈ Z[X] then we get fˆ(Y ) =


g(Y + 1)h(Y + 1) in Z[Y ]. So f is irreducible. Hence none of the roots of f are
rational (but we already know that — they are not even real!).
D. 9-101
The Gaussian integers is the subring Z[i] = {a + bi : a, b ∈ Z} ≤ C.
E. 9-102
We have already shown that the norm N (a + ib) = a2 + b2 is a Euclidean function
for Z[i]. So Z[i] is a Euclidean domain, hence principal ideal domain, hence a
unique factorization domain.
Since the units must have norm 1, they are precisely ±1, ±i. What does factor-
ization in Z[i] look like? What are the primes? We know we are going to get new
primes, ie. primes that are not integers, while we will lose some other primes. For
example, we have 2 = (1 + i)(1 − i). So 2 is not irreducible, hence not prime. Also,
5 is not prime, since 5 = (1 + 2i)(1 − 2i).
However, 3 is a prime. We have N (3) = 9. So if 3 = uv, with u, v not units, then
9 = N (u)N (v), and neither N (u) nor N (v) are 1. So N (u) = N (v) = 3. However,
3 = a2 + b2 has no solutions with a, b ∈ Z. So there is nothing of norm 3. So 3 is
irreducible, hence a prime. Similarly 7 is a prime.
378 CHAPTER 9. GROUPS, RINGS AND MODULES

P. 9-103
A prime number p ∈ Z is prime in Z[i] if and only if p 6= a2 +b2 for all a, b ∈ Z\{0}.

(Forward) If p = a2 + b2 , then p = (a + ib)(a − ib). So p is not irreducible.

(Backward) Now suppose p = uv, with u, v not units. Taking norms, we get
p2 = N (u)N (v). Since u and v are not units, N (u) = N (v) = p. Writing
u = a + ib, then this says a2 + b2 = p.

So to understand which primes stay as primes in the Gaussian integers, we have


to understand when a prime p can be written as a sum of two squares.

L. 9-104
Let p be a prime number. Let Fp = Z/pZ be the field with p elements. Let

p = Fp \ {0} be the group of invertible elements under multiplication. Then
F× ∼
p = Cp−1 .

Certainly F× p has order p − 1, and is abelian. We know from the classification


of finite abelian groups that if F×
p is not cyclic, then it must contain a subgroup
Cm × Cm for m > 1 (we can write it as Cd × Cd0 × · · · , and that d0 | d. So Cd has
a subgroup isomorphic to Cd0 ).

We consider the polynomial X m − 1 ∈ Fp [x], which is a UFD. At best, this


factors into m linear factors. So X m − 1 has at most m distinct roots. But if
Cm × Cm ≤ F× 2
p , then we can find m elements of order diving m. So there are
m2 elements of Fp which are roots of X m − 1. This is a contradiction. So F×
p is
cyclic.

This is a funny proof, since we have not found any element that has order p − 1.

P. 9-105
The primes in Z[i] are, up to associates,
1. Prime numbers p ∈ Z ≤ Z[i] such that p ≡ 3 (mod 4).
2. Gaussian integers z ∈ Z[i] with N (z) = z z̄ = p for some prime p such that
p = 2 or p ≡ 1 (mod 4).

We first show these are indeed primes in Z[i]. If p ≡ 3 (mod 4), then p 6= a2 + b2 ,
since a square number mod 4 is always 0 or 1. So those in 1. are primes in Z[i]. If
N (z) = p, and z = uv, then N (u)N (v) = p. So N (u) is 1 or N (v) is 1. So u or v
is a unit. Note that we did not use the condition that p 6≡ 3 (mod 4). This is not
needed, since N (z) is always a sum of squares, and hence N (z) cannot be a prime
that is 3 mod 4.

Now we show that they are all the primes in Z[i]. Suppose z ∈ Z[i] is a irreducible,
hence prime. Then z̄ is also irreducible. So N (z) = z z̄ is a factorization of N (z)
into irreducibles. Let p ∈ Z be an ordinary prime number dividing N (z), which
exists since N (z) 6= 1.

Now if p ≡ 3 (mod 4), then p itself is prime in Z[i] by the first part of the proof.
So p | N (z) = z z̄. So p | z or p | z̄. Note that if p | z̄, then p | z by taking complex
conjugates. So we get p | z. Since both p and z are both irreducible, they must
be equal up to associates.
9.3. RINGS II 379

Otherwise, we get p = 2 or p ≡ 1 (mod 4). If p ≡ 1 (mod 4), then p − 1 = 4k for


some k ∈ Z. As F× ∼
p = Cp−1 = C4k , there is a unique element of order 2 (this is
true for any cyclic group of order 4k — think Z/4kZ). This must be [−1] ∈ Fp .
Now let a ∈ F× 2 2
p be an element of order 4. Then a has order 2. So [a ] = [−1]. In
2
other words we have an a such that p | a + 1. Thus p | (a + i)(a − i). In the case
where p = 2, we know by checking directly that 2 = (1 + i)(1 − i).
In either case, we deduce that p (or 2) is not prime (hence not irreducible), since
it clearly does not divide a ± i (or 1 ± i). So we can write p = z1 z2 , for z1 , z2 ∈ Z[i]
not units. Now we get p2 = N (p) = N (z1 )N (z2 ). As the zi are not units, we
know N (z1 ) = N (z2 ) = p. By definition, this means p = z1 z̄1 = z2 z̄2 . But also
p = z1 z2 . So we must have z̄1 = z2 . Finally, we have p = z1 z̄1 | N (z) = z z̄. All
these z, zi are irreducible. So z must be an associate of z1 (or maybe z̄1 ). So in
particular N (z) = p.
P. 9-106
An integer n ∈ Z≥0 may be written as x2 + y 2 (a sum of two squares) if and only
nk
if in its prime factorisation n = pn 1 n2
1 p2 · · · pk (where p1 are distinct primes), ni
is even for all pi such that pi ≡ 3 (mod 4).

Note that we have already proved this in the case when n is a prime.
(Forward) If n = x2 + y 2 , then we have n = (x + iy)(x − iy) = N (x + iy). Let
z = x + iy. So we can write z = α1 · · · αq as a product of irreducibles in Z[i]. By
the previous proposition, each αi is either αi = p (a genuine prime number with
p ≡ 3 (mod 4)), or N (αi ) = p is a prime number which is either 2 or ≡ 1 (mod 4).
We now take the norm to obtain

n = x2 + y 2 = N (z) = N (α1 )N (α2 ) · · · N (αq ).

Now each N (αi ) is either p2 with p ≡ 3 (mod 4), or is just p for p = 2 or p ≡ 1


(mod 4). So if pm is the largest power of p divides n, we find that n must be even
if p ≡ 3 (mod 4).
n
(Backward) let n = pn 1 n2
1 p2 · · · pk be a product of distinct primes. Now for each
k

n /2
pi , either pi ≡ 3 (mod 4), and ni is even, in which case pn i 2 ni /2
i = (pi ) = N (pi i );
or pi = 2 or pi ≡ 1 (mod 4), in which case, the above proof shows that pi = N (αi )
for some αi . So pn n
i = N (αi ). Since the norm is multiplicative, we can write n as
the norm of some z ∈ Z[i]. So n = N (z) = N (x + iy) = x2 + y 2 as required.
E. 9-107
We know 65 = 5 × 13. Since 5, 13 ≡ 1 (mod 4), it is a sum of squares. Moreover,
the proof tells us how to find 65 as the sum of squares. We have to factor 5 and
13 in Z[i]. We have 5 = (2 + i)(2 − i) and 13 = (2 + 3i)(2 − 3i). So we know

65 = N (2 + i)N (2 + 3i) = N ((2 + i)(2 + 3i)) = N (1 + 8i) = 12 + 82 .

But there is a choice here. We had to pick which factor is α and which is ᾱ. So
we can also write

65 = N ((2 + i)(2 − 3i)) = N (7 − 4i) = 72 + 42 .

So not only are we able to write them as sum of squares, but this also gives us
many ways of writing 65 as a sum of squares.
380 CHAPTER 9. GROUPS, RINGS AND MODULES

D. 9-108
An α ∈ C is called an algebraic integer if it is a root of a monic polynomial in
Z[X], ie. there is a monic f ∈ Z[X] such that f (α) = 0.
For α an algebraic integer, we write Z[α] ∈ C for the smallest subring of C con-
taining α.
E. 9-109
We generalize the idea of Gaussian integers to algebraic integers. We can im-
mediately check that this is a sensible definition — not all complex numbers are
algebraic integers, since there are only countably many polynomials with integer
coefficients, hence only countably many algebraic integers, but there are uncount-
ably many complex numbers.
Z[α] can also be defined for arbitrary complex numbers, but it is less interesting.
We can also construct Z[α] by taking it as the image of the map φ : Z[X] → C
given by g 7→ g(α). So we can also write Z[α] = Z[X]/I where I = ker φ. Note
that I is non-empty, since, say, f ∈ I, by definition of an algebraic integer.
P. 9-110
If α ∈ C is an algebraic integer, then the ideal I = ker(φ : Z[X] → C, f 7→ f (α))
is principal, and equal to (fα ) for some irreducible monic fα .

By definition, there is a monic f ∈ Z[X] such that f (α) = 0. So f ∈ I. So I 6= 0.


Now let fα ∈ I be such a polynomial of minimal degree. We may suppose that fα
is primitive. We want to show that I = (fα ), and that fα is irreducible.
Let h ∈ I. We pretend we are living in Q[X]. Then we have the Euclidean
algorithm. So we can write h = fα q + r with r = 0 or deg r < deg fα . This was
done over Q[X], not Z[X]. We now clear denominators. We multiply by some
a ∈ Z to get ah = fα (aq) + (ar) where now (aq), (ar) ∈ Z[X].
We now evaluate these polynomials at α. Then we have ah(α) = fα (α)aq(α) +
ar(α). We know fα (α) = h(α) = 0, since fα and h are both in I. So ar(α) = 0.
So (ar) ∈ I. As fα ∈ I has minimal degree, we cannot have deg(r) = deg(ar) <
deg(fa ). So we must have r = 0.
Hence we know ah = fα · (aq) is a factorization in Z[X]. This is almost right, but
we want to factor h, not ah. Again, taking contents of everything, we get

ac(h) = c(ah) = c(fα (aq)) = c(aq),

as fα is primitive. In particular, a | c(aq). This, by definition of content, means


(aq) can be written as aq̄, where q̄ ∈ Z[X]. Cancelling, we get q = q̄ ∈ Z[X]. So
we know h = fα q ∈ (fα ). Therefore I = (fα ). The leading coefficient of fα must
be a unit since the monic f ∈ (fα ), so wlog fα is monic.
To show fα is irreducible, note that
Z[X] ∼ Z[X] ∼
= = Im(φ) = Z[α] ≤ C.
(fα ) ker φ
Since C is an integral domain, so is Im(φ). So we know Z[X]/(fα ) is an integral
domain. So (fα ) is prime. So fα is prime, hence irreducible.
If the final line looks magical, we can unravel this proof as follows: suppose fα = pq
for some non-units pq. Then since fα (α) = 0, we know p(α)q(α) = 0. Since
9.3. RINGS II 381

p(α), q(α) ∈ C, which is an integral domain, we must have, say, p(α) = 0. But
then deg p < deg fα . Contradiction.

This is a non-trivial theorem, since Z[X] is not a principal ideal domain. So there
is no immediate guarantee that I is generated by one polynomial. From this result
we see that the irreducible monic polynomial fα is the minimal polynomial of the
algebraic integer α over Z.

E. 9-111
1. We know α = i is an algebraic integer with fα = X 2 + 1.

2. Also, α = 2 is an algebraic integer with fα = X 2 − 2.

3. More interestingly, α = 12 (1+ −3) is an algebraic integer with fα = X 2 −X−1.

4. The polynomial X 5 − X + d ∈ Z[X] with d ∈ Z≥0 has precisely one real root
α, which is an algebraic integer. In fact there is a theorem,
√ that say this
α cannot be constructed from integers via +, −, ×, ÷, n · . There is another
theorem which says that degree 5 polynomials are the smallest degree for which
this can happen (the prove involves writing down formulas analogous to the
quadratic formula for degree 3 and 4 polynomials).

L. 9-112
If α ∈ Q is an algebraic integer, then α ∈ Z.

Let fα ∈ Z[X] be the minimal polynomial, which is irreducible. In Q[X], the


polynomial X −α must divide fα . However, by Gauss’ lemma, we know fα ∈ Q[X]
is irreducible. So we must have fα = X − α ∈ Z[X]. So α is an integer.

It turns out the collection of all algebraic integers form a subring of C. This is not
at all obvious — given f, g ∈ Z[X] monic such that f (α) = g(α) = 0, there is no
easy way to find a new monic h such that h(α + β) = 0. We will prove this much
later on in the course.

D. 9-113
An ideal I is finitely generated if it can be written as I = (r1 , · · · , rn ) for some
r1 , · · · , rn ∈ R.

E. 9-114
Recall a ring is Noetherian if for any chain of ideals I1 ⊆ I2 ⊆ I3 ⊆ · · · m there is
some N such that IN = IN +1 = IN +2 = · · · .

• Every finite ring is Noetherian. This is since there are only finitely many pos-
sible ideals.

• Every field is Noetherian. This is since there are only two possible ideals.

• Every principal ideal domain (eg. Z) is Noetherian. This is easy to check


directly, but the next proposition will make this utterly trivial.

• Most rings we love and know are indeed Noetherian. However, we can explic-
itly construct some non-Noetherian ideals. The ring Z[X1 , X2 , X3 , · · · ] is not
Noetherian, it has the chain of strictly increasing ideals (X1 ) ⊆ (X1 , X2 ) ⊆
(X1 , X2 , X3 ) ⊆ · · · .
382 CHAPTER 9. GROUPS, RINGS AND MODULES

P. 9-115
A ring is Noetherian if and only if every ideal is finitely generated.

(Backward) Suppose every ideal S of R is finitely generated. Given the chain I1 ⊆


I2 ⊆ · · · , consider the ideal I = i Ii . This is obviously an ideal, which one can
check. We know I is finitely generated, say I = (r1 , · · · , rn ), with ri ∈ Iki . Let
K = max1≤i≤n {ki }, then r1 , · · · , rn ∈ IK . So IK = I. Therefore IK = IK+1 =
IK+2 = · · · .
(Forward) Suppose there is an ideal I C R that is not finitely generated. We pick
r1 ∈ I. Since I is not finitely generated, we know (r1 ) 6= I. So we can find some
r2 ∈ I \ (r1 ). Again (r1 , r2 ) 6= I. So we can find r3 ∈ I \ (r1 , r2 ). We continue on,
and then can find an infinite strictly ascending chain (r1 ) ⊆ (r1 , r2 ) ⊆ (r1 , r2 , r3 ) ⊆
· · · . Therefore R is not Noetherian.
This proposition makes Noetherian rings much more concrete, and makes it obvi-
ous why PIDs are Noetherian. Every PID trivially satisfies this condition. So we
know every PID is Noetherian.
When we have developed some properties or notions, a natural thing to ask is
whether it passes on to subrings and quotients. If R is Noetherian, does every
subring of R have to be Noetherian? The answer is no. For example, since
Z[X1 , X2 , · · · ] is an integral domain, we can take its field of fractions, which is
a field, hence Noetherian, but Z[X1 , X2 , · · · ] is a subring of its field of fractions.
However for quotient we can say something.
P. 9-116
Let R be a Noetherian ring and I be an ideal of R. Then R/I is Noetherian.

Whenever we see quotients, we should think of them as the image of a homomor-


phism. Consider the quotient map π : R → R/I with x 7→ x+I. We can prove this
result by finitely generated or ascending chain condition. We go for the former.
Let J C R/I be an ideal. We want to show that J is finitely generated. Consider
the inverse image π −1 (J). This is an ideal of R, and is hence finitely generated,
since R is Noetherian. So π −1 (J) = (r1 , · · · , rn ) for some r1 , · · · , rn ∈ R. Then J
is generated by π(r1 ), · · · , π(rn ).
T. 9-117
<Hilbert basis theorem> If R is a Noetherian ring, then so is R[X].

Let I C R[X] be an ideal. We want to show it is finitely generated. Since we know


R is Noetherian, we want to generate some ideals of R from I. For n = 0, 1, 2, · · · ,
we let

In = {r ∈ R : there is some f ∈ I such that f = rX n + · · · } ∪ {0}.

Then it is easy to see, using the strong closure property, that each ideal In is an
ideal of R. Moreover, they form a chain, since if f ∈ I, then Xf ∈ I, by strong
closure. So In ⊆ In+1 for all n.
By the ascending chain condition of R, we know there is some N such that IN =
IN +1 = · · · . Now for each 0 ≤ n ≤ N , since R is Noetherian, we can write
(n) (n) (n)
In = (r1 , r2 , · · · , rk(n) ).
9.4. MODULES I 383
(n) (n) (n) (n)
Now for each ri , we choose some fi ∈ I with fi = ri X n + · · · . We now
(n)
claim the polynomials fi for 0 ≤ n ≤ N and 1 ≤ i ≤ k(n) generate I. Suppose
(n)
not. We pick g ∈ I of minimal degree not generated by the fi . There are two
possible cases:
• If deg
Pg = (n)n ≤ N , say g = rX n + · · · . We know r ∈ In . So we can write
(n)
r = i λi ri for some λi ∈ R. Then we know i λi fi = rX n + · · · ∈ I. But
P
(j) P (n)
if g is not in the span of the fi , then so isn’t g − i λi fi . But this has a
lower degree than g. This is a contradiction.
• In the case deg g = n > N , since In = IN , we have the sameP proof. We write
(N )
g = rX n + · · · , but we know r ∈ In = IN . So we know r = I λi ri . So
X (n)
X n−N λi fi = rX N + · · · ∈ I.
i

n−N P (N )
Hence g − X λi fi has smaller degree than g, but is not in the span of
(j)
fi . Contradiction.
Since Z is Noetherian, we know Z[X] also is. Hence so is Z[X, Y ] etc.
Before the Hilbert basis theorem, there were many mathematicians studying some-
thing known as invariant theory. The idea is that we have some interesting objects,
and we want to look at their symmetries. Often, there are infinitely many possible
such symmetries, and one interesting question to ask is whether there is a finite
set of symmetries that generate all possible symmetries. Turns out the collection
of such symmetries are often just ideals of some funny ring. So Hilbert came along
and proved the Hilbert basis theorem, and showed once and for all that those rings
are Noetherian, and hence the symmetries are finitely generated.
As an aside, let E ⊆ F [X1 , X2 , · · · , Xn ] be any set of polynomials. We view this as
a set of equations f = 0 for each f ∈ E. The claim is that to solve the potentially
infinite set of equations E, we actually only have to solve finitely many equations.
Consider the ideal (E) C F [X1 , · · · , Xn ]. By the Hilbert basis theorem, there is
a finite list f1 , · · · , fk such that (f1 , · · · , fk ) = (E). We want to show that we
only have to solve fi (x) = 0 for these fi . Given (α1 , · · · , αn ) ∈ F n , consider the
homomorphism

φα : F [X1 , · · · , Xn ] → F defined by f 7→ f (α1 , · · · , αn ).

Then we know (α1 , · · · , αn ) ∈ F n is a solution to the equations E if and only


if (E) ⊆ ker(ϕα ). By our choice of fi , this is true if and only if (f1 , · · · , fk ) ⊆
ker(ϕα ). By inspection, this is true if and only if (α1 , · · · , αn ) is a solution to all
of f1 , · · · , fk . So solving E is the same as solving f1 , · · · , fk . This is a useful thing
in, say, algebraic geometry.

9.4 Modules I
Recall that to define a vector space, we first pick some base field F. We then defined
a vector space to be an abelian group V with an action of F on V (ie. scalar multi-
plication) that is compatible with the multiplicative and additive structure of F. In
the definition, we did not at all mention division in F. So in fact we can make the
384 CHAPTER 9. GROUPS, RINGS AND MODULES

same definition, but allow F to be a ring instead of a field. We call these modules.
Unfortunately, most results we prove about vector spaces do use the fact that F is a
field, so many linear algebra results do not apply to modules, and modules have much
richer structures.
D. 9-118
• Let R be a commutative ring. We say a quadruple (M, +, 0M , · ) is an R- module
if
1. (M, +, 0M ) is an abelian group
2. The operation · : R × M → M satisfies
i. (r1 + r2 ) · m = (r1 · m) + (r2 · m);
ii. r · (m1 + m2 ) = (r · m1 ) + (r · m2 );
iii. r1 · (r2 · m) = (r1 · r2 ) · m; and
iv. 1R · M = M .
E. 9-119
Note that there are two different additions going on – addition in the ring and
addition in the module, and similarly two notions of multiplication. However, it
is easy to distinguish them since they operate on different things. If needed, we
can make them explicit by writing, say, +R and +M .
We can imagine modules as rings acting on abelian groups, just as groups can act
on sets. Hence we might say “R acts on M ” to mean M is an R-module.
• Let F be a field. An F-module is precisely the same as a vector space over F
(the axioms are the same).
• For any ring R, we have the R-module Rn = R×R×· · ·×R via r·(r1 , · · · , rn ) =
(rr1 , · · · , rrn ) using the ring multiplication. This is the same as the definition
of the vector space Fn for fields F.
• Let I C R be an ideal. Then it is a R-module via r ·M a = r ·R a and r1 +M r2 =
r1 +R r2 . Also, R/I is an R-module via r ·M (a + I) = (r ·R a) + I.
• A Z-module is precisely the same as an abelian group. For A an abelian group,
we have
Z×A→A with (n, a) →
7 a + · · · + a,
| {z }
n times

where if n is negative that means adding −a to itself |n| times, and adding
something to itself 0 times is just 0. This definition is essentially forced upon
us, since by the axioms of a module, we must have (1, a) 7→ a. Then we must
send, say, (2, a) = (1 + 1, a) 7→ a + a.
• Let F be a field and V a vector space over F, and α : V → V be a linear map.
Then V is an F[X]-module via F[X] × V → V with (f, v) 7→ f (α)(v). This is
a module. Note that we cannot just say that V is an F[X] module. We have
to specify the α as well. Picking a different α will give a different F[X]-module
structure.
• Let φ : R → S be a homomorphism of rings. Then any S-module M may be
considered as an R module via R × M → M with (r, m) 7→ φ(r) ·M m.
9.4. MODULES I 385

D. 9-120
• Let M be an R-module. A subset N ⊆ M is an R- submodule if it is a subgroup
of (M, +, 0M ), and rn ∈ N whenever n ∈ N and r ∈ R. We write N ≤ M .

• Let N ≤ M be an R-submodule. The quotient module 4 M/N is the set of N -


cosets in (M, +, 0M ), with the R action given by r · (m + N ) = (r · m) + N .

• A function f : M → N between R-modules is a R- module homomorphism if it is


a homomorphism of abelian groups, and satisfies f (r · m) = r · f (m) for all r ∈ R
and m ∈ M .

An isomorphism is a bijective homomorphism, and two R-modules are isomorphic


if there is an isomorphism between them.

E. 9-121
• We know R itself is an R-module. Then a subset of R is a submodule if and only
if it is an ideal.

• A subset of an F-module V , where F is a field, is a F-submodule if and only if it


is a vector subspace of V .

• Note that modules are different from rings and groups. In groups, we had sub-
groups, and we have some really nice ones called normal subgroups. We are only
allowed to quotient by normal subgroups. In rings, we have subrings and ideals,
which are unrelated objects, and we only quotient by ideals. In modules, we only
have submodules, and we can quotient by arbitrary submodules.

• If F is a field and V, W are F-modules (ie. vector spaces over F), then an F-module
homomorphism is precisely an F-linear map.

T. 9-122
<Isomorphism theorems>
1. Let f : M → N be an R-module homomorphism. Then ker f = {m ∈ M :
f (m) = 0} ≤ M is an R-submodule of M , and Im f = {f (m) : m ∈ M } ≤ N
is an R-submodule of N . Moreover,

M/ ker f ∼
= Im f.

2. Let A, B ≤ M . Then A + B = {m ∈ M : m = a + b for some a ∈ A, b ∈ B}


and A ∩ B are submodules of M . Moreover,
A+B ∼ B
= .
A A∩B

3. Let N ≤ L ≤ M . Then we have


M/N ∼ M
= .
L/N L

The proof is almost exactly the same as for rings and groups.

4
It is easy to check this is well-defined and is indeed a module.
386 CHAPTER 9. GROUPS, RINGS AND MODULES

C. 9-123
<Submodule correspondence> Similar to groups and rings, given N ≤ M ,
we have a correspondence

{submodules of M/N } ←→ {submodules of M which contain N }

D. 9-124
• Let M be a R-module, and m ∈ M . The annihilator of m is Ann(m) = {r ∈ R :
r · m = 0}. For any set S ⊆ M , we define
\
Ann(S) = {r ∈ R : r · m = 0 for all m ∈ S} = Ann(m).
m∈S

• Let M be an R-module, and m ∈ M . The submodule generated by m is Rm =


{r · m ∈ M : r ∈ R}.
• An R-module M is finitely generated if there is a finite list of elements m1 , · · · , mk
such that M = Rm1 + Rm2 + · · · + Rmk = {r1 m1 + r2 m2 + · · · + rk mk : ri ∈ R}.
? Sometimes we write f : M  N to mean f is surjective onto N .
E. 9-125
Note that the annihilator is a subset of R. Moreover it is an ideal — if r · m = 0
and s · m = 0, then (r + s) · m = r · m + s · m = 0. So r + s ∈ Ann(m). Moreover,
if r · m = 0, then also (sr) · m = s · (r · m) = 0. So sr ∈ Ann(m).
What is this good for? We first note that any m ∈ M generates a submodule Rm.
We consider the R-module homomorphism φ : R → M given by r 7→ rm. This is
clearly a homomorphism. Then we have Rm = Im(φ) and Ann(m) = ker(φ). The
conclusion is that Rm ∼= R/ Ann(m).
As we mentioned, rings acting on modules is like groups acting on sets. We can
think of this as the analogue of the orbit stabilizer theorem. A finitely generated
module is in some sense analogous to the idea of a vector space being finite-
dimensional, however, it behaves much more differently.
L. 9-126
An R-module M is finitely-generated if and only if there is a surjective R-module
homomorphism f : Rk → M for some finite k.

(Forward) If M = Rm1 +Rm2 +· · ·+Rmk we define f : Rk → M by (r1 , · · · , rk ) 7→


r1 m1 + · · · + rk mk . It is clear that this is an R-module homomorphism. This is
by definition surjective.
(Backward) Given a surjective homomorphism f : Rk → M , we let
mi = f (0, 0, · · · , 0, 1, 0, · · · , 0)
where the 1 appears in the ith position. We now claim that M = Rm1 + Rm2 +
· · · + Rmk . Let m ∈ M . As f is surjective, we know m = f (r1 , r2 , · · · , rk ) for
some ri . We then have
f (r1 , r2 , · · · , rk ) = f ((r1 , 0, · · · , 0) + (0, r2 , 0, · · · , 0) + · · · + (0, 0, · · · , 0, rk ))
= f (r1 , 0, · · · , 0) + f (0, r2 , 0, · · · , 0) + f (0, 0, · · · , 0, rk )
= r1 f (1, 0, · · · , 0) + r2 f (0, 1, 0, · · · , 0) + rk f (0, 0, · · · , 0, 1)
= r1 m1 + r2 m2 + · · · + rk mk .
9.4. MODULES I 387

So the mi generate M .

This view is a convenient way of thinking about finitely-generated modules. For


example, we can immediately prove the following result.

P. 9-127
Let N ≤ M and M be finitely-generated. Then M/N is also finitely generated.

Since M is finitely generated, we have some surjection f : Rk  M . Moreover,


we have the surjective quotient map q : M  M/N . Then q ◦ f is a surjection
Rk  M/N . So M/N is finitely generated.

E. 9-128
• It is very tempting to believe that if a module is finitely generated, then its sub-
modules are also finitely generated. But in fact a submodule of a finitely-generated
module need not be finitely generated.

We let R = C[X1 , X2 , · · · ]. We consider the R-module R, which is finitely gener-


ated (by 1). A submodule of the ring is the same as an ideal. Moreover, an ideal
is finitely generated as an ideal if and only if it is finitely generated as a module.
We pick the submodule I = (X1 , X2 , · · · ) which we have already shown to be not
finitely-generated.

• For a complex number α, the ring Z[α] (ie. the smallest subring of C containing
α) is a finitely-generated as a Z-module if and only if α is an algebraic integer,
which one can prove. This allows us to prove that algebraic integers are closed
under addition and multiplication, since it is easier to argue about whether Z[α]
is finitely generated.

D. 9-129
• Let M1 , M2 , · · · , Mk be R-modules. The direct sum is the R-module M1 ⊕
M2 ⊕ · · · ⊕ Mk which is the set M1 × M2 × · · · × Mk , with addition given by
(m1 , · · · , mk ) + (m01 , · · · , m0k ) = (m1 + m01 , · · · , mk + m0k ) and the R action is
given by r · (m1 , · · · , mk ) = (rm1 , · · · , rmk ).
Pk
• Let m1 , · · · , mk ∈ M . Then {m1 , · · · , mk } is linearly independent if i=1 ri mi =
0 implies r1 = r2 = · · · = rk = 0.

• A subset S ⊆ M generates M freely if

1. S generates M

2. Any set function ψ : S → N to an R-module N extends to an R-module


homomorphism θ : M → N .

• An R-module is free if it is freely generated by some subset S ⊆ M , and S is


called a basis .

• If M is a finitely-generated R-module, we have shown that there is a surjective


R-module φ : Rk → M . We call ker(φ) the relation module for those generators.
We say M is finitely presented if the relation module ker φ is finitely generated.
388 CHAPTER 9. GROUPS, RINGS AND MODULES

E. 9-130
• We’ve been using one example of the direct sum already, namely

Rn = R ⊕ R ⊕ · · · ⊕ R .
| {z }
n times

Recall we said modules are like vector spaces. So we can try to define things like
basis and linear independence. However, we will fail massively, since we really
can’t prove much about them. Still, we can define them.

• Suppose S ⊆ M generates M freely and θ1 , θ2 are two extensions of ψ : S → N ,


we can consider θ1 − θ2 : S → M . Then θ1 − θ2 sends everything in S to 0. So
S ⊆ ker(θ1 − θ2 ) ≤ M . So the submodule generated by S lies in ker(θ1 − θ2 )
too. But this is by definition M . So M ≤ ker(θ1 − θ2 ) ≤ M , ie. equality holds.
So θ1 − θ2 = 0. So θ1 = θ2 . So any such extension is unique. Thus, what this
definition tells us is that giving a map from M to N is exactly the same thing as
giving a function from S to N .

We will soon prove that if R is a field, then every module is free. However, if
R is not a field, then there are non-free modules. For example, the Z-module
Z/2Z is not freely generated. Suppose Z/2Z were generated freely by some S ⊆
Z/2Z. Then this can only possibly be S = {1}. Then this implies there is a
homomorphism θ : Z/2Z → Z sending 1 to 1. But it does not send 0 = 1 + 1 to
1 + 1, since homomorphisms send 0 to 0. So Z/2Z is not freely generated.

• Being finitely presented means I can tell you everything about the module with
finitely many paper. More precisely, if {m1 , · · · , mk } generate M and {n1 , n2 , · · · , nk }
generate ker(φ), then each ni = (ri1 , · · · rik ) corresponds to the relation

ri1 m1 + ri2 m2 + · · · + rik mk = 0

in M . So M is the module generated by writing down R-linear combinations of


m1 , · · · , mk , and say two elements are the same if they are related to one another
by these relations. Since there are only finitely many generators and finitely many
such relations, we can specify the module with finitely many information.

P. 9-131
For a subset S = {m1 , · · · , mk } ⊆ M , the following are equivalent:
1. S generates M freely.
2. S generates M and the set S is independent.
3. Every element of M is uniquely expressible as r1 m1 + r2 m2 + · · · + rk mk for
some ri ∈ R.

The fact that (2) and (3) are equivalent is something we would expect from what
we know from linear algebra, and in fact the proof is the same. So we only show
that (1) and (2) are equivalent.

1 ⇒ 2: Suppose S generate M freely. If S is not independent, then we can write


r1 m1 + · · · + rk mk = 0 with ri ∈ M and, say, r1 non-zero. We define the
set function ψ : S → R by sending m1 7→ 1R and mi 7→ 0 for all i 6= 1.
As S generates M freely, this extends to an R-module homomorphism
9.4. MODULES I 389

θ : M → R. By definition of a homomorphism, we can compute

0 = θ(0) = θ(r1 m1 + r2 m2 + · · · + rk mk )
= r1 θ(m1 ) + r2 θ(m2 ) + · · · + rk θ(mk ) = r1 .

This is a contradiction. So S must be independent.


2 ⇒ 1: Suppose every element can be uniquely written as r1 m1 + · · · + rk mk .
Given any set function ψ : S → N , we define θ : M → N by

θ(r1 m1 + · · · + rk mk ) = r1 ψ(m1 ) + · · · + rk ψ(mk ).

This is well-defined by uniqueness, and is clearly a homomorphism. So it


follows that S generates M .
∼ Rk as an R-
It follows that if S = {m1 , · · · , mk } generates M freely, then M =
k
module via the isomorphism φ : R → M with φ(r1 , r2 , · · · , rk ) = r1 m1 + r2 m2 +
· · · + rk mk .
E. 9-132
The set {2, 3} ∈ Z generates Z. However, since 3 · 2 + (−2) · 3 = 0, they do not
generate Z freely. Recall from linear algebra that if a set S spans a vector space
V , and it is not independent, then we can just pick some useless ones and throw
them away in order to get a basis. However, this is no longer the case in modules.
Neither 2 nor 3 generate Z.
L. 9-133
<Zorn’s lemma> Suppose a partially ordered set P has the property that every
chain has an upper bound in P . Then the set P contains at least one maximal
element.
We will not prove this. In fact the Zorn’s lemma is equivalent to the axiom of
choice.
L. 9-134
Every non-zero ring has a maximal ideal.

We observe that an ideal I C R is proper if and only if 1R 6∈ I. So every increasing


union of proper ideals is proper. Let P be set of all proper ideals, then every chain
in P has an upper bound in P , namely the union of everything in the chain. So
by Zorn’s lemma, there exist a maximal ideal.
This is a rather strong statement, since it talks about “all rings”, and we can have
weird rings.
P. 9-135
<Invariance of dimension> Let R be a non-zero ring. If Rn ∼
= Rm as an
R-module, then n = m.

We know is true if R is a field. We now want to reduce this to the case where R
is a field. If R is an integral domain, then we can produce a field by taking the
field of fractions, and this might be a good starting point. However, we want to
do this for general rings. So we need some more magic.
Firstly we we will show that if I CR is an ideal and M is an R-module, then M/IM
is an R/I module in a natural way, where IM = {am ∈ M : a ∈ I, m ∈ M } ≤ M .
We can take the quotient module M/IM , which is an R-module again. Now if
390 CHAPTER 9. GROUPS, RINGS AND MODULES

b ∈ I, then its action on M/IM is b(m + IM ) = bm + IM = IM . So everything


in I kills everything in M/IM . So we can consider M/IM as an R/I module by

(r + I) · (m + IM ) = r · m + IM.

Let I be a maximal ideal of R. Suppose we have Rn ∼


= Rm . Then we must have

Rn /IRn ∼ m
= R /IR
m

as R/I modules. But staring at it long enough, we figure that Rn /IRn ∼ = (R/I)n
and similarly for m. Since R/I is a field, the result follows by linear algebra.
The point of this proposition is not the result itself (which is not too interesting),
but the general constructions used behind the proof.

9.5 Modules II
In the is section we will prove the classification of finite abelian groups and Jordan
normal forms. We will mostly work with R a Euclidean domain, and we write φ :
R \ {0} → Z≥0 for its Euclidean function.
D. 9-136
• Elementary row operations on an m×n matrix A with entries in R are operations
of the form
1. Add a multiple of one row to another.
2. Swap two rows.
3. Multiply a row by a unit.
We also have elementary column operations defined in a similar fashion.
• Two matrices are equivalent if we can get from one to the other via a sequence
of such elementary row and column operations.
• A k × k minor of a matrix A is the determinant of a k × k sub-matrix of A (ie.
a matrix formed by removing all but k rows and all but k columns).
• For a matrix A, the kth Fitting ideal Fitk (A) C R is the ideal generated by the
set of all k × k minors of A.
E. 9-137
Elementary rows and columns operations on an m × n matrix A with entries in R
can be done by achieve as follows
1. Add c ∈ R times the ith row to the jth row: this may be done by multiplying
on the left by the matrix, which is the m×m identity matrix but with c instead
of 0 at the j, i entry.
2. Swap the ith and jth rows: this may be done by multiplying on the left by the
matrix, which is the m × m identity matrix but swapping the i and j rows.
3. Multiplying the ith row by a unit c ∈ R: this may be done by multiplying on
the left by the matrix, which is the m × m identity matrix by with c instead
of 1 at the i, i entry. Notice that if R is a field, then we can multiply any row
by any non-zero number, since they are all units.
9.5. MODULES II 391

For elementary column operations, this corresponding to right multiplication of the


matrices. Notice all these matrices corresponding to each operation are invertible.
Note that if A and B are equivalent, then we can write B = QAT −1 for some
invertible matrices Q and T −1 .
For each matrix, we want to find a matrix equivalent to it that is a simple as
possible. Recall from IB Linear Algebra that if R is a field, then we can put any
matrix into the form ( I0r 00 ) via elementary row and column operations. This is
no longer true when working with rings. For example, over Z, we cannot put the
matrix ( 20 00 ) into that form, since no operation can turn the 2 into a 1.
T. 9-138
<Smith normal form> An m × n matrix over a Euclidean domain R is equiva-
lent to a matrix with only diagonal entries (that is only non-zero entries in the i, i
positions), and those entries are (d1 , d2 , · · · , dr , 0, · · · , 0) with the di all non-zero
and d1 | d2 | d3 | · · · | dr .

Throughout the process, we will keep calling our matrix A, even though it keeps
changing in each step, so that we don’t have to invent hundreds of names for these
matrices.
1. If A = 0, then done! So suppose A 6= 0. So some entry is not zero, say,
Aij 6= 0. Swapping the ith and first row, then jth and first column, we
arrange that A11 6= 0.
We now try to reduce A11 as much as possible. We do the following:
2. If there is an A1j not divisible by A11 , then we can use the Euclidean algo-
rithm to write A1j = qA11 + r. By assumption, r 6= 0. So φ(r) < φ(A11 )
(where φ is the Euclidean function). So we subtract q copies of the first col-
umn from the jth column. Then in position (1, j), we now have r. We swap
the first and jth column such that r is in position (1, 1), and we have strictly
reduced the value of φ at the first entry.
If there is an Ai1 not divisible by A11 , we do the same thing, and this again
reduces φ(A11 ). We keep performing these until no move is possible. Since
the value of φ(A11 ) strictly decreases every move, we stop after finitely many
applications. Then we know that we must have A11 dividing all A1j and
Ai1 . Now we can just subtract appropriate multiples of the first column from
others so that A1j = 0 for j 6= 1. We do the same thing with rows so that
the first row is cleared. Then we have a matrix of the form
 
d 0 ··· 0
0 
A =  .. .
 
. C 
0

We would like to say “do the same thing with C”, but then this would get us a
regular diagonal matrix, not necessarily in Smith normal form. So we need some
preparation.
3. Suppose there is an entry of C not divisible by d, say Aij with i, j > 1. We
suppose Aij = qd + r with r 6= 0 and φ(r) < φ(d). We add column 1 to
column j, and subtract q times row 1 from row i. Now we get r in the (i, j)th
392 CHAPTER 9. GROUPS, RINGS AND MODULES

entry, and we want to send it back to the (1, 1) position. We swap row i
with row 1, swap column j with row 1, so that r is in the (1, 1)th entry, and
φ(r) < φ(d).
Now we have messed up the first row and column. So we go back and do (1)
again until the first row and columns are cleared. Then we get
 0 
d 0 ··· 0
0 
A =  .. , where φ(d0 ) ≤ φ(r) < φ(d).
 
. C0 
0
We keep on repeating this process. As this strictly decreases the value of
φ(A11 ), we can only repeat this finitely many times. When we stop, we will
end up with a matrix  
d 0 ··· 0
0 
A =  .. ,
 
. C 
0
and d divides every entry of C.
4. Now we apply the entire process again to C. When we do this process, notice
all allowed operations don’t change the fact that d divides every entry of C.
So applying this recursively, we obtain a diagonal matrix with the claimed
divisibility property.
Note that if we didn’t have to care about the divisibility property, we can just do
(1) and (2), and we can get a diagonal matrix. The magic to get to the Smith
normal form is (3).
The dk obtained in the Smith normal form are called the invariant factors of A.
It would be nice if we can prove that the di are indeed invariant. It is not clear
from the algorithm that we will always end up with the same di . Indeed, we can
multiply a whole row by −1 and get different invariant factors. However, it turns
out that these are unique up to multiplication by units. To study the uniqueness
of the invariant factors of a matrix A, we relate them to other invariants, which
involves minors. Any given matrix has many minors, since we get to decide which
rows and columns we can throw away. The idea is to consider the ideal generated
by all the minors of matrix. A key property (we will show next) is that equivalent
matrices have the same Fitting ideal, even if they might have very different minors.
Note that the divisibility criterion is similar to the classification of finitely-generated
abelian groups. In fact, we will derive that as a consequence of the Smith normal
form.
E. 9-139
We exhibit the algorithm of producing the Smith normal 
3 7 4

form with an algorithm in Z. We start with the matrix on 1 −1 2
the right. 3 5 1
 
We want to move the 1 to the top-left corner. So we swap 1 −1 2
the first and second rows to obtain 3 7 4
3 5 1
9.5. MODULES II 393

We then try to eliminate the other entries in the first row by 


1 0 0

column operations. We add multiples of the first column to 3 10 −2
the second and third to obtain 3 8 −5
 
1 0 0
We similarly clear the first column to get 0 10 −2
0 8 −5
We are left with a 2 × 2 matrix to fiddle with. We swap the 
1 0 0

second and third columns so that 2 is in the 2, 2 entry, and 0 2 10
secretly change sign to get 0 5 8
 
We notice that (2, 5) = 1. So we can use linear combinations 1 0 0
to introduce a 1 at the bottom 0 2 10 
0 1 −12
 
1 0 0
Swapping rows, we get 0 1 −12
0 2 10
 
1 0 0
We then clear the remaining rows and columns to get 0 1 0
0 0 34
L. 9-140
If A and B are equivalent matrices, then Fitk (A) = Fitk (B) for all k.

It suffices to show that changing A by a row or column operation does not change
the Fitting ideal. Since taking the transpose does not change the determinant, ie.
Fitk (A) = Fitk (AT ), it suffices to consider the row operations.
The most difficult one is taking linear combinations. Let B be the result of adding
c times the ith row to the jth row, and fix C a k × k sub-matrix of A. Suppose the
corresponding matrix wrt B is C 0 . We then want to show that det C 0 ∈ Fitk (A).
If the jth row is outside of C, then the minor det C is unchanged. If both the ith
and jth rows are in C, then the submatrix C changes by a row operation, which
does not affect the determinant. These are the boring cases.
Suppose the jth row is in C and the ith row is not. Suppose the ith row is
f1 , · · · , fk . Then C is changed to C 0 , with the jth row being

(Cj1 + cf1 , Cj2 + cf2 , · · · , Cjk + cfk ).

We compute det C 0 by expanding along this row. Then we get

det C 0 = det C + c det D,

where D is the matrix obtained by replacing the jth row of C with (f1 , · · · , fk ).
The point is that det C is definitely a minor of A, and det D is still a minor of A,
just another one. Since ideals are closed under addition and multiplications, we
know det(C 0 ) ∈ Fitk (A).
The other operations are much simpler. They just follow by standard properties
of the effect of swapping rows or multiplying rows on determinants. So after any
394 CHAPTER 9. GROUPS, RINGS AND MODULES

row operation, the resultant submatrix C 0 satisfies det(C 0 ) ∈ Fitk (A). Since this
is true for all minors, we must have Fitk (B) ⊆ Fitk (A). But row operations are
invertible. So we must have Fitk (A) ⊆ Fitk (B) as well. So they must be equal.
P. 9-141
If A has Smith normal form B = diag(d1 , d2 , · · · , dr , 0, · · · , 0) then Fitk (A) =
(d1 d2 · · · dk ) (where dk = 0 if k > r). And in fact dk is unique up to associates.

The fact Fitk (B) = (d1 d2 · · · dk ) is clear once we notice that the only possible
contributing minors are from the diagonal submatrices, and the minor from the
top left square submatrix divides all other diagonal ones. The uniqueness of di
follows since we can find dk by dividing the generator of Fitk (A) by the generator
of Fitk−1 (A).
E. 9-142
Consider the matrix in Z: A = ( 20 03 ). This is diagonal, but not in Smith normal
form. We can potentially apply the algorithm, but that would be messy. We
notice that Fit1 (A) = (2, 3) = (1). So we know d1 = ±1. We can then look at the
second Fitting ideal Fit2 (A) = (6), hence d1 d2 = ±6. So we must have d2 = ±6.
So the Smith normal form is ( 10 06 ). That was is much easier.
L. 9-143
Let R be a principal ideal domain. Then any submodule of Rm is generated by at
most m elements.
Let N ≤ Rm be a submodule. Consider the ideal

I = {r ∈ R : (r, r2 , · · · , rm ) ∈ N for some r2 , · · · , rm ∈ R}.

It is clear this is an ideal. Since R is a principle ideal domain, we must have


I = (a) for some a ∈ R. We now choose an n = (a, a2 , · · · , am ) ∈ N . Then for
any vector (r1 , r2 , · · · , rm ) ∈ N , we know that r1 ∈ I. So a | r1 . So we can write
r1 = ra. Then we can form

(r1 , r2 , · · · , rm ) − r(a, a2 , · · · , am ) = (0, r2 − ra2 , · · · , rm − ram ) ∈ N.

This lies in N 0 = N ∩ ({0} × Rm−1 ) ≤ Rm−1 . Thus everything in N can be written


as a multiple of n plus something in N 0 . But by induction, since N 0 ≤ Rm−1 , we
know N 0 is generated by at most m − 1 elements. So there are n2 , · · · , nm ∈ N 0
generating N 0 . So n, n2 , · · · , nm generate N .
This result is obvious for vector spaces, but is slightly more difficult here.
If we have a submodule of Rm , then it has at most m generators. However, these
might generate the submodule in a terrible way. The next theorem tells us there
is a nice way of finding generators.
T. 9-144
Let R be a Euclidean domain, and let N ≤ Rm be a submodule, then
1. there exists a basis v1 , · · · , vm of Rm such that N is generated by d1 v1 , d2 v2 , · · · , dr vr
for some 0 ≤ r ≤ m and some di ∈ R such that d1 | d2 | · · · | dr ;
2. N is free and N ∼ = Rr with r ≤ m. In other words, the submodule of a free
module is free, and of a smaller (or equal) rank.
9.5. MODULES II 395

1. By the previous lemma, N is generated by some elements x1 , · · · , xn with


n ≤ m. Each xi is an element of Rm . Now we form the m × n matrix A =
(x1 , x2 , · · · , xn ). We put it in Smith normal form diag(d1 , · · · , dr , 0, · · · , 0)!

Recall that we got to the Smith normal form by row and column operations.
Performing row operations is just changing the basis of Rm , while each column
operation changes the generators of N . So what this tells us is that there is
a new basis v1 , · · · , vm of Rm such that N is generated by d1 v1 , · · · , dr vr . By
definition of Smith normal form, the divisibility condition holds.

2. Let N ≤ Rm be a submodule. By the above, there is a basis v1 , · · · , vn of Rm


such that N is generated by d1 v1 , · · · , dr vr for r ≤ m. So it is certainly gener-
ated by at most m elements. So we only have to show that d1 v1 , · · · , dr vr are
independent. But if they were linearly dependent, then so would be v1 , · · · , vm .
But v1 , · · · , vn are a basis, hence independent. So d1 v1 , · · · , dr vr generate N
freely. So N ∼ = Rr .

T. 9-145
<Classification of finitely-generated modules over a Euclidean domain>
Let R be a Euclidean domain, and M be a finitely generated R-module. Then
R R R
M∼
= ⊕ ⊕ ··· ⊕ ⊕ R ⊕ R ⊕ ··· ⊕ R
(d1 ) (d1 ) (dr )

for some di 6= 0, and d1 | d2 | · · · | dr .

Since M is finitely-generated, there is a surjection φ : Rm → M . So by the first


isomorphism, we have M ∼ = Rm / ker φ. Since ker φ is a submodule of Rm , by the
previous theorem, there is a basis v1 , · · · , vm of Rm such that ker φ is generated
by d1 v1 , · · · , dr , vr for 0 ≤ r ≤ m and d1 | d2 | · · · | dr . So we know

Rm
M∼
=
((d1 , 0, · · · , 0), (0, d2 , 0, · · · , 0), · · · , (0, · · · , 0, dr , 0, · · · , 0))
∼ R R R
= ⊕ ⊕ ··· ⊕ ⊕ R ⊕ ··· ⊕ R .
(d1 ) (d2 ) (dr ) | {z }
m − r copies of R

So all finitely-generated modules are of this simple form, so we can prove things
about them assuming they look like this.

This result is particularly useful in the case where R = Z, where R-modules are
abelian groups. In which case we get: Any finitely-generated abelian group is
isomorphic to Cd1 × · · · × Cdr × C∞ × · · · × C∞ where C∞ ∼ = Z is the infinite cyclic
group, with d1 | d2 | · · · | dr . Note that if the group is finite, then there cannot be
any C∞ factors. So it is just a product of finite cyclic groups. That is, if A is a
finite abelian group, then A ∼ = Cd1 × · · · Cdr with d1 | d2 | · · · | dr .

E. 9-146
Let A be the abelian group generated by a, b, c with relations 2a + 3b + c = 0,
a + 2b = 0, and 5a + 6b + 7c = 0. In other words, we have

Z3
A= .
((2, 3, 1), (1, 2, 0), (5, 6, 7))
396 CHAPTER 9. GROUPS, RINGS AND MODULES

We would like to get a better description of A. It is not even obvious if this module
is the zero module or not. To work out a good description, We consider the matrix
 
2 1 5
X = 3 2 6  .
1 0 7
To figure out the Smith normal form, we find the fitting ideals. We have Fit1 (X) =
(1, · · · ) = (1). So d1 = 1. We have to work out the second fitting ideal. In
principle, we have to check all the minors, but we immediately notice | 23 12 | = 1.
So Fit2 (X) = (1), and d2 = 1. Finally, we find
 
2 1 5
Fit3 (X) = 3 2 6 = (3) =⇒ d3 = 3
1 0 7
So we know
Z Z Z ∼ Z ∼
A∼
= ⊕ ⊕ = = C3 .
(1) (1) (3) (3)
If you don’t feel like computing determinants, doing row and column reduction is
often as quick and straightforward.
L. 9-147
<Chinese remainder theorem> Let R be a Euclidean domain, and a, b ∈ R
be such that gcd(a, b) = 1. Then
R ∼ R R
= × as R-modules.
(ab) (a) (b)

Consider the R-module homomorphism


R R R
φ: × → by (r1 + (a), r2 + (b)) 7→ br1 + ar2 + (ab).
(a) (b) (ab)
To show this is well-defined, suppose (r1 + (a), r2 + (b)) = (r10 + (a), r20 + (b)), then
r1 = r10 + xa and r2 = r20 + yb. So
br1 + ar2 + (ab) = br10 + xab + ar20 + yab + (ab) = br10 + ar20 + (ab).
So this is indeed well-defined. It is clear that this is a module map, by inspection.
We now have to show it is surjective and injective. So far, we have not used
the hypothesis, that gcd(a, b) = 1. As we know gcd(a, b) = 1, by the Euclidean
algorithm, we can write 1 = ax + by for some x, y ∈ R. So we have
φ(y + (a), x + (b)) = by + ax + (ab) = 1 + (ab).
So 1 ∈ Im φ. Since this is an R-module map, we get
φ(r(y + (a), x + (b))) = r · (1 + (ab)) = r + (ab).
The key fact is that R/(ab) as an R-module is generated by 1. Thus we know φ
is surjective. Finally, we have to show it is injective, ie. that the kernel is trivial.
Suppose
φ(r1 + (a), r2 + (b)) = 0 + (ab).
Then br1 + ar2 ∈ (ab). So we can write br1 + ar2 = abx for some x ∈ R. Since
a | ar2 and a | abx, we know a | br1 . Since a and b are coprime, unique factorization
implies a | r1 . Similarly, we know b | r2 . Therefore (r1 + (a), r2 + (b)) = (0 +
(a), 0 + (b)).
The proof is just that of the Chinese remainder theorem written in ring language.
9.5. MODULES II 397

T. 9-148
<Prime decomposition theorem> Let R be a Euclidean domain, and M be
a finitely-generated R-module. Then

M∼
= N1 ⊕ N2 ⊕ · · · ⊕ Nt ,

where each Ni is either R or is R/(pn ) for some prime p ∈ R and some n ≥ 1.

We already know

R R
M∼
= ⊕ ··· ⊕ ⊕ R ⊕ · · · ⊕ R.
(d1 ) (dr )

So it suffices to show that each R/(d1 ) can be written in that form. We let

n
d = pn 1 n2
1 p2 · · · pk
k

with pi distinct primes. So each pni


i
is coprime to each other. So by the lemma
iterated, we have
R ∼ R R
= n1 ⊕ · · · ⊕ nk .
(d1 ) (p1 ) (pk )

Recall that we were also to decompose a finite abelian group into products of the
form Cpk , where p is a prime, and it was just the Chinese remainder theorem.
This is again in general true.

C. 9-149
<F-vector space as F[X]-module> We next want to consider the Jordan normal
form. This is less straightforward, since considering V directly as an F module
would not be too helpful (since that would just be pure linear algebra). Instead,
we use the following trick: For a field F, the polynomial ring F[X] is a Euclidean
domain, so the results we had apply. If V is a vector space on F, and α : V → V
is a linear map, then we can make V into an F[X]-module via

F[X] × V → V given by (f, v) 7→ (f (α))(v).

We write Vα for this F[X]-module.

L. 9-150
If V is a finite-dimensional vector space, then Vα is a finitely-generated F[X]-
module.

If v1 , · · · , vn generate V as an F-module, ie. spans V as a vector space over F,


then they also generate Vα as an F[X]-module, since F ≤ F[X] (hence a scalar in
F is a constant polynomial in F[X]).

E. 9-151
1. Suppose Vα ∼= F[X]/(X r ) as F[X]-modules, then in particular also Vα ∼=
r
F[X]/(X ) as F-modules (since being a map of F-modules has fewer require-
ments than being a map of F[X]-modules).

Under this bijection, the elements 1, X, X 2 , · · · , X r−1 ∈ F[X]/(X r ) form a


vector space basis for Vα . Viewing F[X]/(X r ) as an F-vector space, the action
398 CHAPTER 9. GROUPS, RINGS AND MODULES

of X has the matrix  


0 0 ··· 0 0
1
 0 ··· 0 0 
0 1 ··· 0 0 . (∗)
 .. .. .. .. .. 

. . . . .
0 0 ··· 1 0
We also know that in Vα , the action of X is by definition the linear map α. So
under this basis, α is represented by the same matrix.
2. Suppose Vα ∼= F[X]/((X − λ)r ) for some λ ∈ F. Consider the new linear map
β = α − λ · ι where ι : V → V is the identity map. Then Vβ ∼ = F[Y ]/(Y r ) for
Y = X − λ. So there is a basis for V so that β looks like (∗). So we know α is
represented by the matrix
 
λ 0 ··· 0 0
1 λ · · · 0 0
 
0 1 · · · 0 0
 .. .. . . . .
 
. . . .. .. 
0 0 ··· 1 λ

This is a Jordan block. The Jordan blocks we defined in Linear algebra are the
other way round, with zeroes below the diagonal. However a simply change of
basis (conjugate it with ( 01 10 )) would gives us the “right” form.
3. Suppose Vα ∼
= F[X]/(f ) for some polynomial f , for

f = a0 + a1 X + · · · + ar−1 X r−1 + X r .

This has a basis 1, X, X 2 , · · · , X r−1 as well, in which α is


 
0 0 ··· 0 −a0
1 0 · · · 0 −a1 
 
c(f ) = 0 1 · · · 0 −a2 .

 .. .. . . .. .. 
. . . . . 
0 0 · · · 1 −ar−1

We call this the companion matrix for the monic polynomial f (this is not
the content despite the notation).
These are different things that can possibly happen. Since we have already classi-
fied all finitely generated F[X] modules, this allows us to put matrices in a rather
nice form.
T. 9-152
<Rational canonical form> Let α : V → V be a linear endomorphism of
a finite-dimensional vector space over F, and Vα be the associated F[X]-module.
Then
F[X] F[X] F[X]
Vα ∼= ⊕ ⊕ ··· ⊕ ,
(f1 ) (f2 ) (fs )
with f1 | f2 | · · · | fs . Thus there is a basis for V in which the matrix for α is
represented by the block diagonal matrix diag(c(f1 ), c(f2 ), · · · , c(fs )) where c(f )
denote the companion matrix of f .
9.5. MODULES II 399

We already know that Vα is a finitely-generated F[X] module. By the structure


theorem of F[X]-modules, we know

F[X] F[X] F[X]


Vα ∼
= ⊕ ⊕ ··· ⊕ ⊕ 0.
(f1 ) (f2 ) (fs )

We know there are no copies of F[X], since Vα = V is finite-dimensional over F,


but F[X] is not. The divisibility criterion also follows from the structure theorem.
Then the form of the matrix is immediate.
This is really a canonical form. The Jordan normal form is not canonical, since
we can move the blocks around. The structure theorem determines the factors fi
up to units, and once we require it is monic, there is no choice left.
In terms of matrices, this says that if α is represented by a matrix A ∈ Mn,n (F )
in some basis, then A is conjugate to a matrix of the form above.
From the rational canonical form, we can immediately read off the minimal poly-
nomial as fs . This is since if we view Vα as the decomposition above, we find that
fs (α) kills everything in F[X]/(fs ). It also kills the other factors since fi | fs for
all i. So fs (α) = 0. We also know no smaller polynomial kills V , since it does
not kill F[X]/(fs ). Similarly, we find that the characteristic polynomial of α is
f1 f2 · · · fs .
Recall we had a different way of decomposing a module over a Euclidean domain,
namely the prime decomposition, and this in fact gives us the Jordan normal form,
which we’ll next show.
L. 9-153
The prime elements of C[X] are X − λ for λ ∈ C (up to multiplication by units).

Let f ∈ C[X]. If f is constant, then it is either a unit or 0. Otherwise, by the


fundamental theorem of algebra, it has a root λ. So it is divisible by X − λ. So
if f is irreducible, it must have degree 1. And clearly everything of degree 1 is
prime.
Applying the prime decomposition theorem to C[X] modules gives us the Jordan
normal form.
T. 9-154
<Jordan normal form> Let α : V → V be an endomorphism of a vector space
V over C, and Vα be the associated C[X] module. Then

C[X] C[X] C[X]


Vα ∼
= ⊕ ⊕ ··· ⊕ ,
((X − λ1 )a1 ) ((X − λ2 )a2 ) ((X − λt )at )
where λi ∈ C do not have to be distinct. There is a basis of V in which α
is represented by the block diagonal matrix diag(Ja1 (λ1 ), Ja2 (λ2 ), · · · , Jat (λt ))
where Jm (λ) are Jordan blocks.

Apply the prime decomposition theorem to Vα . Then all primes are of the form
X − λ. We then use 2 of [E.9-151] to get the form of the matrix.
The blocks Jm (λ) are called the Jordan λ-blocks. It turns out that the Jordan
blocks are unique up to reordering, which we proved in the Linear Algebra.
400 CHAPTER 9. GROUPS, RINGS AND MODULES

• We read off the minimal Qpolynomial (and characteristic polynomial) of α. The


minimal polynomial is λ (X − λ)aλ where aλ is the size of the largest λ-block.
(X − λ)bλ where bλ is the sum of the
Q
• The characteristic polynomial of α is λQ
sizes of the λ-blocks. Alternatively, it is ti=1 (X − λi )ai .
• We can also read off another invariant from the Jordan normal form, namely
the size of the λ-eigenspace of α, namely the number of λ-blocks.
E. 9-155
Consider X · − : Vα → Vα (here X · − means left multiplication by the polynomial
X), what is its nullity (dimension of its kernel)?
1. For λ 6= 0, the map X · − : C[X]/((X − λ)q ) → C[X]/((X − λ)q ) is an
isomorphism. Injectivity follows from ker(X · −) := {f + ((X − λ)q ) | Xf ∈
((X − λ)q )}: in order for Xf ∈ ((X − λ)q we must have Xf = (X − λ)q g for
some polynomial g, but X and X − λ are coprime, so (X − λ)q | f and hence
it have trivial kernel. The map is also surjective. Therefore its nullity is 0.
2. For λ = 0, X · − : C[X]/(X q ) → C[X]/(X q ) has matrix
0 0 ··· 0 0 !
1 0 ··· 0 0
0 1 ··· 0 0 = Jm (0),
.. .. .. .. ..
. . . . .
0 0 ··· 1 0

so it has nullity 1.
Combining these two we see that nullity of X · − : Vα → Vα is equal to the number
of Jordan blocks with eigenvalue 0. In linear algebra language this says that the
linear map α has nullity equal to the number of Jordan blocks (in its Jordan
normal form) with eigenvalue 0.
Similarly X 2 · − : C[X]/((X − λ)q ) → C[X]/((X − λ)q ) is an isomorphism for
λ 6= 0, whereas for λ = 0 it has matrix
 0 0 0 ··· 0 0 0 
0 0 0 ··· 0 0 0
 10 01 00 ··· 0 0 0
. . . ··· 0 0 0
 .. .. .. .. .. .. 
. . .
... ... .. .. ..
. . .
..
.
0 0 0 ··· 1 0 0

which has a 2 dimensional kernel for q ≥ 2 and a 1 dimensional kernel when q = 1.


So the nullity of X 2 · − : Vα → Vα is equal to the number of Jordan blocks with
eigenvalue 0 plus the number of Jordan blocks with eigenvalue 0 that is of size
greater or equal to 2. Hence

number of 0 eigenvalue Jodan block of size 1


= 2n(X · − : Vα → Vα ) − n(X 2 · − : Vα → V α)

where n = dim ker. Similarly, we can extract the numbers of Jordan blocks of any
size and any eigenvalue. This is what we do in [P.4-119].
T. 9-156
<Cayley-Hamilton theorem> Let M be a finitely-generated R-module, where
R is some commutative ring. Let α : M → M be an R-module homomorphism.
Let A be a matrix representation of α under some choice of generators, and let
9.5. MODULES II 401

p(t) = det(tI − A). Then p(α) = 0.

We consider M as an R[X]-module with action given by (f (X))(m) = f (α)m.


Suppose e1 , · · · , en span M , and that for all i, we have α(ei ) = n
P
j=1 aji ej . So
n
X n
X
0= (αδij − aji )ej = (Xδij − aji )ej .
j=1 j=1

We write C for the matrix with entries cij = Xδij − aji ∈ F[X]. We now use
the fact that adj(C)C = det(C)I which we proved in [T.4-81] (the proof did
not assume that the underlying ring is a field). Expanding this out, we get the
following equation (in F [X]).
χα (X)I = det(XI − A)I = (adj(XI − A))(XI − A).
Writing this in components, and multiplying by ek , we have
X
χα (X)δik ek = (adj(XI − A)ij )(Xδjk − akj )ek .
j

Then for each i, we sum over k to obtain


n
X n X
X
χα (X)δik ek = (adj(XI − A)ij )(Xδjk − akj )ek = 0,
k=1 k=1 j

by our choice of aij . But the left hand side is just χα (X)ei . So χα (X) acts trivially
on all of the generators ei . So it in fact acts trivially. So χα (α) is the zero map
(since acting by X is the same as acting by α, by construction).
So we can also use the idea of viewing V as an F[X] module to prove Cayley-
Hamilton theorem. In fact, we don’t need F to be a field.
L. 9-157
Let α, β : V → V be two linear maps. Vα ∼ = Vβ as F[X]-modules if and only if
α and β are conjugate as linear maps, ie. there is some γ : V → V such that
α = γ −1 βγ.

(Forward) Let γ : Vβ → Vα be an F[X]-module isomorphism. Then for v ∈ V , we


notice that β(v) is just X · v in Vβ , and α(v) is just X · v in Vα . So we get
β ◦ γ(v) = X · (γ(v)) = γ(X · v) = γ ◦ α(v),
using the definition of an F[X]-module homomorphism. Hence βγ = γα. So
α = γ −1 βγ.
(Backward) Let γ : V → V be a linear isomorphism such that γ −1 βγ = α. We
now claim that γ : Vα → Vβ is an F[X]-module map. We just have to check that
γ(f · v) = γ(f (α)(v)) = γ(a0 + a1 α + · · · + ar αr )(v)
= γ(a0 v) + γ(a1 α(v)) + γ(a2 α2 (v)) + · · · + γ(an αn (v))
= (a0 + a1 β + a2 β 2 + · · · + an β n )(γ(v)) = f · γ(v)
So classifying linear maps up to conjugation is the same as classifying modules.
We can reinterpret this a little bit, using [T.9-152], our classification of finitely-
generated modules: there is a bijection between conjugacy classes of n×n matrices
over F and sequences of monic polynomials d1 , · · · , dr such that d1 | d2 | · · · | dr
and deg(d1 · · · dr ) = n.
402 CHAPTER 9. GROUPS, RINGS AND MODULES

E. 9-158
Let’s classify conjugacy classes in GL2 (F), ie. we need to classify F[X] modules of
the form
F[X] F[X] F[X]
⊕ ⊕ ··· ⊕
(d1 ) (d2 ) (dr )
which are two-dimensional as F-modules. As we must have deg(d1 d2 · · · dr ) = 2,
we either have a quadratic thing or two linear things, ie. either
1. r = 1 and deg(d1 ) = 2. In this first case, the module is F[X]/(d1 ) where, say,
d1 = X 2 + a1 X + a2 .
2. r = 2 and deg(d1 ) = deg(d2 ) = 1. In this case, since we have d1 | d2 , and they
are both monic linear, we must have d1 = d2 = X − λ for some λ. In this case,
we get
F[X] F[X]
⊕ .
(X − λ) (X − λ)

We use the basis 1, X, and the linear map corresponding to 1 and 2 respectively
have matrix    
0 −a2 λ 0
.
1 −a1 0 λ
Do these cases overlap? Suppose the two of them are conjugate. Then they have
the same determinant and same trace. So we know −a1 = 2λ and a2 = λ2 . So in
fact our polynomial is

X 2 + a1 X + a2 = X 2 − 2λ + λ2 = (X − λ)2 .

This is just the polynomial of a Jordan block. So the matrix


   
0 −a2 λ 0
is conjugate to the Jordan block ,
1 −a1 1 λ
but this is not conjugate to λI, eg. by looking at eigenspaces. So these cases are
disjoint. Note that we have done more work that we really needed, since λI is
invariant under conjugation.
But the first case is not too satisfactory. We can further classify it as follows. If
X 2 + a1 X + a2 is reducible, then it is (X − λ)(X − µ) for some µ, λ ∈ F. If λ = µ,
then the matrix is conjugate to ( λ1 λ0 ). Otherwise, it is conjugate to ( λ0 µ0 ).
In the case where X 2 +a1 X +a2 is irreducible, there is nothing we can do in general.
However, we can look at some special scenarios and see if there is anything we can
do.
E. 9-159
Consider GL2 (Z/3) (here Z/3 = Z/3Z). We want to classify its conjugacy classes.
By the general theory, we know everything is conjugate to
     
λ 0 λ 0 0 −a2
, , ,
0 µ 1 λ 1 −a1

with X 2 + a1 X + a2 irreducible. So we need to figure out what the irreducibles


are.
A reasonable strategy is to guess. Given any quadratic, it is easy to see if it is
irreducible, since we can try to see if it has any roots, and there are just three
9.5. MODULES II 403

things to try. However, we can be a bit slightly more clever. We first count how
many irreducibles we are expecting, and then find that many of them.
There are 9 monic quadratic polynomials in total, since a1 , a2 ∈ Z/3. The re-
ducibles are (X − λ)2 or (X − λ)(X − µ) with λ 6= µ. There are three of each kind.
So we have 6 reducible polynomials, and so 3 irreducible ones.
We can then check that X 2 + 1, X 2 + X + 2 and X 2 + 2X + 2 are the irreducible
polynomials. So every matrix in GL2 (Z/3) is either congruent to
         
0 −1 0 −2 0 −2 λ 0 λ 0
, , , , ,
1 0 1 −1 1 −2 0 µ 1 λ

where λ 6= µ and λ, µ ∈ (Z/3)× (since the matrix has to be invertible). The


number of conjugacy classes of each type are 1, 1, 1, 3, 2. So there are 8 conjugacy
classes. The first three classes have elements of order 4, 8, 8 respectively, by trying.
We notice that the identity matrix has order 1, and ( λ0 µ0 ) has order 2 otherwise.
Finally, for the last type, we have ord( 11 01 ) = 3, and ord( 21 02 ) = 6. Note that we
also have
| GL2 (Z/3)| = 48 = 24 · 3.
Since there is no element of order 16, the Sylow 2-subgroup of GL2 (Z/3) is not
cyclic. To construct the Sylow 2-subgroup, we might start with an element of
order 8, say  
0 1
B= .
1 2
To make a subgroup of order 16, a sensible guess would be to take an element of
order 2, but that doesn’t work, since B 4 will give you the only element of order 2.
Instead, we pick  
0 2
A= .
1 0
We notice
        
0 1 0 1 0 2 1 2 0 2 2 2
A−1 BA = = = = B3.
2 0 1 2 1 0 0 2 1 0 2 0
So this is a bit like the dihedral group. We know that hBi C hA, Bi. Also, we
know |hBi| = 8. So if we can show that hBi has index 2 in hA, Bi, then this is the
Sylow 2-subgroup. By the second isomorphism theorem, something we have never
used in our life, we know
hA, Bi ∼ hAi
= .
hBi hAi ∩ hBi
We can list things out, and then find
 
2 0 ∼
hAi ∩ hBi = = C2 .
0 2
We also know hAi ∼ = C4 . Now we know |hA, Bi|/|hBi| = 2. So |hA, Bi| = 16.
Hence this is the Sylow 2-subgroup. In fact, it is
hA, B | A4 = B 8 = e, A−1 BA = B 3 i
We call this the semi-dihedral group of order 16, because it is a bit like a dihedral
group. Note that finding this subgroup was purely guesswork. There is no method
to know that A and B are the right choices.
404 CHAPTER 9. GROUPS, RINGS AND MODULES

L. 9-160
Let M be a module over a ring R, and let N be a submodule of M . If M/N is
free, then M ∼
= N ⊕ M/N .

Suppose a1 + N, · · · , ak + N generates M/N freely. Define


!
X X
f : N ⊕ M/N → M by f n, αi (ai + N ) = n + αi ai .
i i

We can easilyP check that this P is a homomorphism. Moreover this is injective


since if n + i αi ai = 0, then i αi (ai + N ) = N , so αi = 0 for all i by linear
independence of a1 + N, · · · , akP+ N . Also this is surjective
P since given x ∈ M we
can find αi such that x + N = i αi (ai + N ), so x − i αi ai ∈ N , so
!
X X
f x− αi ai , αi (ai + N ) = x.
i i

E. 9-161

Let R = Z[X]/(X 2 + 5) ∼
= C[ −5] ⊆ C, then (1 + X)(1 − X) = 1 − X 2 = 1 + 5 =
6 = 2 × 3. Now 1 ± X, 2, 3 are irreducible, so R is not a UFD. Let

I1 = (3, 1 + X) and I2 = (3, 1 − X)

be ideals (submodule) of R. Consider φ : I1 ⊕ I2 → R given by (a, b) 7→ a + b.


Then Im(φ) = (3, 1 + x, 1 − X). But 3 − ((1 + X) + (1 − X)) = 1, so Im φ = R,
i.e. φ is surjective. Also

ker φ = {(a, b) ∈ I1 ⊕ I2 : a + b = 0} = I1 ∩ I2

So (x, −x) ∈ ker φ whenever x ∈ I1 ∩ I2 . Note that (3) ⊆ I1 ∩ I2 . Suppose


a ∈ I1 ∩ I2 , then we can write a = 3s + (1 + X)t ∈ (3, 1 − X) ⊆ R. Working in
modulo 3 as well we have

(1 + X)t = (1 − X)p mod (3, X 2 + 5) = (3, X 2 − 1) = (3, (X + 1)(X − 1))

so 1−X | t, so (1+X)(1−X) | t(1+X), so t(1+X) = q(X 2 −1) = q(X 2 +5−6) =


−6q. So a = 3s + (1 + X)t is divisible by 3, so I1 ∩ I2 ⊆ (3), so in fact I1 ∩ I2 = (3).
I1 ⊕ I2 / ker φ ∼
= Im φ = R, so by the above lemma I1 ⊕ I2 ∼
= R ⊕ ker φ = R ⊕ (3).
Consider the surjective module homomorphism ψ : R → (3) given by ψ(x) = 3x.
Let ker ψ = {x ∈ R : 3x = 0} = 0 as R is an integral domain, so ψ is an
isomorphism. So R ∼ = (3), so I1 ⊕ I2 ∼
= R ⊕ R.
We claim that I1 is not principle. R has a isomorphism to itself x 7→ −x which
exchange I1 and I2 . Suppose I1 = (a + bX), then I2 = (a − bX). Then

(3) = I1 ∩ I2 = ((a + bX)(a − bX)) = (a2 − b2 X 2 ) = (a2 + 5b2 )

So 3 ∈ (a2 + 5b2 ), so 3 = (a2 + 5b2 )(c + dX), so a2 + 5b2 | 3. Contradiction.


∼ I2 ∼
In fact we see that I1 = = I and is not principle, and also I ⊕ I ∼
= R ⊕ R. Now
I needs two element to generate it, but it’s not the free module R2 . And I is a
submodule of the free module R ⊕ R, but it’s not free. This show how strange
things in modules can be.
CHAPTER 10
Complex Analysis and Methods
Complex analysis is the study of complex differentiable functions. It turns out complex
differentiable functions behave rather differently to the real case. Requiring that a
function is complex differentiable is a very strong condition, and consequently these
functions have very nice properties.
For example Liouville’s theorem says that every bounded complex differentiable func-
tion f : C → C must be constant. This is very for real functions (eg. sin x). This
gives a simple proof of the fundamental theorem of algebra – if the polynomial p has
no roots, then p1 is well-defined on all of C and bounded. So p is constant.
Also, if a complex function is once differentiable, then it is infinitely differentiable,
again false for the real case. In particular, every complex differentiable function has a
Taylor series and is equal to it.
Another result is that the uniform limit of complex differentiable functions is again
complex differentiable. In contrast this with the huge list of weird conditions we needed
for real analysis!
Not only is differentiation nice. It turns out integration is also easier in complex
analysis. We can exploit this fact to perform real integrals by pretending they are
complex integrals.

10.1 Basic notions


10.1.1 Complex differentiability
D. 10-1
? An open ball in C centred at x with radius ε is written Bε (x) = B(x; ε) = D(x; ε).
Also for a function u : R2 → R we will write the partial derivative wrt to the x
variable as ∂u
∂x
= ux .
• A domain is a non-empty open path-connected1 subset of C.
• Let U ⊆ C be a domain and f : U → C be a function. We say f is (complex)
differentiable at w ∈ U if
f (z) − f (w)
lim
z→w z−w
exists. In which case we write the limit as f 0 (w), and call it the derivative of f at
w.
2
• A function f is called holomorphic or analytic at w ∈ U if f is differentiable
on an open neighbourhood of w.
1
or just connected since an open subset of the plane is connected if and only if it is path-connected.
2
Strictly speaking analytic means that the function has and is given by its Taylor series every-
where locally. But we will see later that this is in fact equivalent to being holomorphic in complex
analysis for complex differentiable functions.

405
406 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

• If f : C → C is defined on all of C and is holomorphic on C, then f is said to be


entire .
E. 10-2
If the domain of a function is not path-connected, then we do not have results
such as functions with zero derivative must be constant. Hence, we would require
our subset to be connected.
C. 10-3
<The extended complex plane> Sometimes we want to consider the point ∞
as well. The extended complex plane is C∞ = C ∪ {∞}. We can reach the “point
at infinity” by going off in any direction in the plane, and all are equivalent. In
particular, there is no concept of −∞. All infinities are the same.3 Operations
with ∞ are done in the obvious way.
N

P
S
z

Conceptually, we can visualize this using the Riemann sphere , which is a sphere
resting on the complex plane with its “South Pole” S at z = 0. For any point z ∈ C,
drawing a line through the “North Pole” N of the sphere to z, and noting where
this intersects the sphere. This specifies an equivalent point P on the sphere. Then
∞ is equivalent to the North Pole of the sphere itself. So the extended complex
plane is mapped bijectively to the sphere.
This is a useful way to visualize things, but is not as useful when we actually want
to do computations. To investigate properties of ∞, we use the substitution ζ = z1 ,
ie. a Möbius map to take the point ∞ to 0. A function f (z) is said to have a
particular property at ∞ if f ( ζ1 ) has that same property at ζ = 0.

P. 10-4
Let f be defined on an open set U ⊆ C. Write f (x + iy) = u(x, y) + iv(x, y), where
u, v are functions R2 → R. Then f is complex differentiable at w = c + id ∈ U
if and only if u, v are differentiable at (c, d) and the Cauchy-Riemann equations
ux = vy and uy = −vx holds at (c, d). Moreover, when these holds we have
f 0 (w) = ux (c, d) + ivx (c, d) = vy (c, d) − iuy (c, d).

f is differentiable at w with f 0 (w) = p + iq if and only if

f (z) − f (w) − (p + iq)(z − w)


lim = 0. (†)
z→w |z − w|

Write z = x + iy, then

(p + iq)(z − w) = p(x − c) − q(y − d) + i(q(x − c) + p(y − d)).


3
Sometimes, we do write down things like −∞. This does not refer to a different point. Instead,
this indicates a limiting process. We mean we are approaching this infinity from the direction of
the negative real axis. However, we still end up in the same place.
10.1. BASIC NOTIONS 407

So, breaking into real and imaginary parts, we know (†) holds if and only if
u(x, y) − u(c, d) − (p(x − c) − q(y − d))
lim p =0
(x,y)→(c,d) (x − c)2 + (y − d)2
v(x, y) − v(c, d) − (q(x − c) + p(y − d))
and lim p = 0.
(x,y)→(c,d) (x − c)2 + (y − d)2
Comparing this to the definition of the differentiability of a real-valued function, we
see this holds exactly if u and v are differentiable at (c, d) with Du(c, d) = (p, −q)
and Dv(c, d) = (q, p).
Note that if we just have ux = vy and uy = −vx at (c, d) ∈ U , we cannot conclude
that f is complex differentiable at (c, d). These conditions only say the partial
derivatives exist, but this does not imply imply that u and v are differentiable,
as required by the proposition. However, if the partial derivatives exist and are
continuous, then by Analysis II we know they are differentiable.
E. 10-5
• The usual rules of differentiation (sum rule, product, rule, chain rule, derivative of
inverse) all hold for complex differentiable functions, with the same proof as the
real case.
• A polynomial p : C → C is entire. This can be checked directly from definition, or
using the product rule.
• A rational function p(z)/q(z) : U → C, where U ⊆ C \ {z : q(z) = 0}, is holomor-
phic on U . Here p, q are polynomials.
• f (z) = |z| is not complex differentiable
p at any point of C. Indeed, we can write
this as f = u + iv, where u(x, y) = x2 + y 2 and v(x, y) = 0. If (x, y) 6= (0, 0),
then
x y
ux = p , uy = p .
x2 + y 2 x2 + y 2
If we are not at the origin, then clearly both cannot vanish, but the partials of
v both vanish. Hence the Cauchy-Riemann equations do not hold and it is not
differentiable outside of the origin. At the origin, we can compute directly that
f (h) − f (0) |h|
= .
h h
This is, say, +1 for h ∈ R+ and −1 for h ∈ R− . So the limit as h → 0 does not
exist.
• For f (z) = |z|2 = x2 + y 2 , the Cauchy-Riemann equations are satisfied only at the
origin. So f is only differentiable at z = 0. However, it is not analytic since there
is no neighbourhood of 0 throughout which f is differentiable.
• Let f (z) = Re z. This has u = x, v = 0. But f is nowhere analytic as
∂u ∂v
= 1 6= 0 = .
∂x ∂y

• The complex conjugate f (z) = z̄ = z ∗ = x − iy has u = x, v = −y. So the


Cauchy-Riemann equations don’t hold. Hence this is nowhere analytic. We could
have deduced that |z| are not analytic from this — if it is, then so would |z|2 , and
hence z̄ = |z|2 /z also has to be analytic, which is not true.
408 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

10.1.2 Revision of power series


Recalled that given
P∞any constants {cn }n≥0 ⊆ C, there is a unique R ∈ [0, ∞] such that
n
the series z 7→ n=0 c n (z − a) converges absolutely if |z − a| < R and diverges if
|z − a| > R. This R is knownp as the radius of convergence. Recall also R can be given
by the formula R = (lim sup n |cn |)−1 .
T. 10-6
Let f (z) = ∞ n
P
n=0 cn (z − a) be a power series with radius of convergence R > 0.
Then
1. The series converge uniformly in any compact subset of BR (a).
P∞
2. f is holomorphic on B(a; R) = {z : |z − a| < R} with f 0 (z) = n=0 ncn (z −
1)n−1 which also has radius of convergence R.
3. f is infinitely complex differentiable on B(a; R). Furthermore, cn = f (n) (a)/n!.

1. Any compact subset of BR (a) is contained within some B̄r (a) for some r < R.
So it suffice to show that the series converge uniformly on B̄r (a) for any r < R.
The proof is same as that in [T.5-14].
2. Without loss of generality, take a = 0. We first prove that the derivative
series has radius of convergence R, so that we can freely happily manipulate it.
Certainly, we have |ncn | ≥ |cn |. So by Pcomparison to the series for f , we can
see that the radius of convergence of ncn z n−1 is at most R. Given |z| < R,
pick |z| < ρ < R, then we can see
n−1
|ncn z n−1 | z
= n →0
|cn ρn−1 | ρ

|cn |ρn−1 , which converges, ncn z n−1


P P
as n → ∞. So by comparison to
converges absolutely. So the radius of convergence must be exactly R.
Now we want to show f really is differentiable with that derivative. Pick z, w
such that |z|, |w| ≤ ρ for some ρ < R as before. Define a new function

X n−1
X
ϕ(z, w) = cn z j wn−1−j .
n=1 j=0

Now noting n−1


X
j n−1−j n
c z w ≤ n|cn |ρ ,

n

j=0

we know the series defining ϕ converges uniformly on {|z|, |w| ≤ ρ} by the


Weierstrass M-test, and hence to a continuous limit by the uniform limit the-
orem. If z 6= w, then using the formula for the (finite) geometric series, we
know
∞  n
z − wn

X f (z) − f (w)
ϕ(z, w) = cn = .
n=1
z − w z−w
P∞ n−1
On the other hand, if z = w, then ϕ(z, z) = n=1 cn nz . Since ϕ is
continuous, we know

f (z) − f (w) X
lim = cn nz n−1 .
w→z z−w n=1
10.1. BASIC NOTIONS 409

So f 0 (z) = ϕ(z, z) as claimed.


3. Follows from the second part.
E. 10-7
The exponential function exp(z) = ∞
def P n
n=0 z /n! has infinite radius of convergence,
hence it is entire. The exponential function is always non-zero. To see this consider
F (z) = exp(w + z) exp(−z), by the the product rule, F 0 (z) = 0, so F is a constant
function. Since F (0) = exp(w), we have exp(w + z) exp(−z) = F (z) = exp(w).
Setting w = 0 we see that exp(z) exp(−z) = 1, hence exp(z) 6= 0 for any z. Further-
more wee see that exp(w+z) = exp(w) exp(z) for all w, z since exp(z) exp(−z) = 1.
The exponential function is in fact surjective onto C \ {0} as can be seen from
ex+iy = ex eiy .
P. 10-8
P n
If a power series f (z) = n≥0 cn (z − a) with radius of convergence R > 0,
vanishes on some B(a, ε) where 0 < ε < R, then f vanishes identically.

If f vanishes on B(a, ε), then all its derivatives at a vanish, and hence the coeffi-
cients all vanish as cn = f (n) (a)/n!. So it is identically zero.

10.1.3 Revision of Möbius map


We know from general theory that the Möbius map
az + b
z 7→ w = with ad − bc 6= 0
cz + d
is holomorphic except at z = − dc . It is useful to consider it as a map from C∗ → C∗ =
C ∪ {∞}, with − dc 7→ ∞ and ∞ 7→ ac . It is then a bijective map between C∗ and itself,
with the inverse being
−dw + b
w 7→ ,
cw − a
another Möbius map. These are all holomorphic everywhere when considered as a map
C∗ → C∗ .
D. 10-9
A circline is either a circle or a line.
P. 10-10
Möbius maps take circlines to circlines.

Any circline can be expressed as a circle of Apollonius |z − z1 | = λ|z − z2 | where


z1 , z2 ∈ C and λ ∈ R+ . This was proved in part IA. The case λ = 1 corresponds
to a line, while λ 6= 1 corresponds to a circle. Substituting z in terms of w, we get

−dw + b −dw + b
cw − a − z1 = λ cw − a − z2 .

Rearranging this gives


|(cz1 + d)w − (az1 + b)| = λ|(cz2 + d)w − (az2 + b)|. (∗)
A bit more rearranging gives

w − az1 + b = λ cz2 + d w − az2 + b .

cz1 + d cz1 + d cz2 + d
410 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

This is another circle of Apollonius. Note that the proof fails if either cz1 + d = 0
or cz2 + d = 0, but then (∗) trivially represents a circle.

Geometrically, it is clear that choosing three distinct points in C∗ uniquely specifies


a circline. If one of the points is ∞, then we have specified the straight line through
the other two points.

P. 10-11
Given six points α, β, γ, α0 , β 0 , γ 0 ∈ C∗ , we can find a Möbius map which sends
α 7→ α0 , β 7→ β 0 and γ → γ 0 .

Define the Möbius map


β−γ z−α
f1 (z) = .
β−α z−γ

By direct inspection, this sends α → 0, β → 1 and γ → ∞. Again, we let

β 0 − γ 0 z − α0
f2 (z) = .
β 0 − α0 z − γ 0

This clearly sends α0 → 0, β 0 → 1 and γ 0 → ∞. Then f2−1 ◦ f1 is the required


mapping. It is a Möbius map since Möbius maps form a group.

Therefore, we can therefore find a Möbius map taking any given circline to any
other, which is convenient.

10.1.4 Branches and branch points

D. 10-12
• We say that a point p ∈ C is a branch point of a multivalued function if the
function cannot be given a continuous single-valued definition in a (punctured)
neighbourhood B(p, ε) \ {p} of p for any ε > 0. The function is said to have a
branch point singularity there.

• Let U ⊆ C∗ be an open subset. A branch of the logarithm on U is a continuous


function λ : U → C for which eλ(z) = z for all z ∈ U .

The principal branch of logarithm is λ : C\R≤0 → C given by λ(riθ ) = log(r)+iθ


where we restrict −π < θ < π.

E. 10-13
When we are attempt to define an inverse for an non-injective function, we have
many choice of sending each point to. So we end up with a multi-valued function.
For example for the exponential function ez , we want to define the inverse log z =
log r + iθ where r = |z| and θ = arg(z). There are infinitely many possible values
of log z, for every choice of θ (differing by a multiple of 2π). However when we
write down an expression we often want them to be singled-valued, well-defined,
and continuous.
10.1. BASIC NOTIONS 411
Consider the three curves shown in the diagram. In
C1 , we could always choose θ to be always in the range C1
(0, π2 ), and then log z would be continuous and single-
valued going round C1 . On C2 , we could choose θ ∈ C2
( π2 , 3π
2
) and log z would again be continuous and single-
valued. However, this doesn’t work for C3 . Since this C3
encircles the origin, there is no such choice. Whatever
we do, log z cannot be made continuous and single-valued around C3 . It must
either “jump” somewhere, or the value has to increase by 2πi every time we go
round the circle, ie. the function is multi-valued. This is true for any curves going
around the origin, so the origin in this case is a branch point.
1. log(z − a) has a branch point at z = a.
 
2. log z−1
z+1
= log(z − 1) − log(z + 1) has two branch points at ±1.

3. z α = rα eiαθ has a branch point at the origin as well for α 6∈ Z — consider a


circle of radius of r0 centered at 0, and wlog that we start at θ = 0 and go once
round anticlockwise. Just as before, θ must vary continuous to ensure conti-
nuity of eiαθ . So as we get back almost to where we started, θ will approach
2π, and there will be a jump in θ from 2π back to 0. So there will be a jump
in z α from r0α e2πiα to r0α . So z α is not continuous if e2πiα 6= 1, ie. α is not an
integer.
4. log z also has a branch point at ∞. Recall that to investigate the  properties
of a function f (z) at infinity, we investigate the property of f z1 at zero. If
ζ = z1 , then log z = − log ζ, which has a branch point at ζ = 0. Similarly, z α
has a branch point at ∞ for α 6∈ Z.
5. The function log( z−1
z+1
) does not have a branch point at infinity, since if ζ = z1 ,
then    
z−1 1−ζ
log = log .
z+1 1+ζ
For ζ close to zero, 1−ζ
1+ζ
remains close to 1, and therefore well away from the
branch point of log at the origin. So we can encircle ζ = 0 without log( 1−ζ
1+ζ
)
being discontinuous.
If we wish to make log z continuous and single valued,
therefore, we must stop any curve from encircling the z
origin. We do this by introducing a branch cut from
θ
−∞ on the real axis to the origin. No curve is allowed
to cross this cut. Once we’ve decided where our branch
cut is, we can use it to fix on values of θ to be lying in
the range (−π, π), and we have defined a branch of log z. Note that a branch cut
is the squiggly line, while a branch is a particular choice of the value of log z.
In general, we define a branch of the logarithm. This is a partially defined inverse
λ to the exponential function, only defined on some domain U . These need not
exist for all U . For example, there is no branch of the logarithm defined on the
whole C∗ , as we will later prove. The branch defined above has U = C\R≤0 , a “slit
plane”. Then for each z ∈ U , λ(z) = log(r) + iθ where z = riθ with −π < θ < π.
This is a branch of the logarithm called the principal branch. The resulting value
of λ(z) is called the principal value of the logarithm. On U , there is a continuous
function arg : U → (−π, π), which is why we can construct a branch. This is
412 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

not true on, say, the unit circle. For practical (and applied) purpose, instead of
restricting θ ∈ (−π, π) (i.e. on U ) we might also give the function a value on θ = π,
so that θ ∈ (−π, π] so that we have log on C∗ . However we should still imagine
there being a imaginary barrier at the negative real axis which we can’t go through
and where we have a discontinuity of 2πi. The branch is then single-valued and
continuous on any curve C that does not cross the cut.
We have picked an arbitrary branch cut and branch. We can pick other branch
cuts or branches. Even with the same branch cut, we can still have a different
branch — we can instead require θ to fall in (π, 3π]. Of course, we can also pick
other branch cuts, eg. the non-negative imaginary axis. Any cut that stops curves
wrapping around the branch point will do:

On the first diagram we can choose θ ∈ − 3π , π2 . The exact choice of θ in the



2
second diagram is more difficult to write down, but this is an equally valid cut,
since it stops curves from encircling the origin. Exactly the same considerations
(and possible branch cuts) apply for z α (for α 6∈ Z).
In practice, whenever a problem requires the use of a branch, it is important to
specify it clearly. This can be done in two ways:
1. Define the function and parameter range explicitly, eg.
log z = log |z| + i arg z, arg z ∈ (−π, π].

2. Specify the location of the branch cut and give the value of the required branch
at a single point not on the cut. The values everywhere else are then defined
uniquely by continuity. For example, we have log z with a branch cut along
R≤0 and log 1 = 0. Of course, we could have defined log 1 = 2πi as well, and
this would correspond to picking arg z ∈ (π, 3π].
E. 10-14
<Powers> Having defined the logarithm, we define general power functions.
Let α ∈ C and log : U → C be a branch of the logarithm. Then we can define
z α = eα log z on U . This is again only defined when log is.
p
Consider the function f (z) = z(z − 1). This has two branch points, z = 0
and z = 1, since we cannot define a square root consistently near 0, as it is
defined via the logarithm. Note we can define a continuous branch of f on either
C \ ((−∞, 0) ∪ (1, ∞)) or C \ (0, 1). Why is the second case possible? Note that
1
f (z) = e 2 (log(z)+log(z−1)) .
If we move around a path encircling the finite slit (0, 1), the argument of each of
log(z) and log(z − 1) will jump by 2πi, and the total change in the exponent is
2πi. So the expression for f (z) becomes uniquely defined. While these two ways of
cutting slits look rather different, if we consider this to be on the Riemann sphere,
then these two cuts now look similar. It’s just that one passes through the point
∞, and the other doesn’t.
10.1. BASIC NOTIONS 413

E. 10-15
<Riemann surfaces> The introduction of these
slits/cuts is practical and helpful for many of our
problems. However, theoretically, this is not the
best way to think about multi-valued functions.
Instead of this brutal way of introducing a cut
and forbidding crossing, Riemann imagined differ- C
ent branches as separate copies of C, all stacked on C
top of each other but each one joined to the next at C
the branch cut. This structure is a Riemann surface. C

The idea is that traditionally, we are not allowed to cross branch cuts. Here, when
we cross a branch cut, we will move to a different copy of C, and this corresponds
to a different branch of our function. We will not investigate this further here —
the Part II Riemann Surfaces wil study this in detail.

L. 10-16
Let D be a domain. Suppose f : D → C is holomorphic with 0 derivative every-
where, then f is constant on D.

Write f (x + iy) = u(x, y) + iv(x, y), where u, v are functions R2 → R. By [P.10


-4], u and v is differentiable with 0 derivative everywhere, so u and v is constant
by [T.5-79], hence so is f .

In particular wee see that if two holomorphic functions f, g is such that f 0 = g 0


everywhere and that f, g agrees on a point, then we must have f = g.

L. 10-17
If f is holomorphic at w ∈ C and f 0 (w) 6= 0, then f is locally invertible at w and
its local inverse g is holomorphic at f (w) with g 0 (f (w)) = 1/f 0 (w).

We write f = u + iv, then viewed as a map R2 → R2 , the Jacobian matrix is given


by
 
ux uy
Df = .
vx vy

Then det(Df ) = ux vy − uy vx = u2x + u2y . Using the formula for the complex
derivative in terms of the partials, this shows that if f 0 (w) 6= 0, then det(Df |w ) 6=
0. Hence, by the inverse function theorem (viewing f as a function R2 → R2 ), f
is locally invertible at w (technically, we need f to be continuously differentiable,
instead of just differentiable, but we will later show that f in fact must be infinitely
differentiable and hence have continuous derivatives). Moreover, by the same
proof as in real analysis, the local inverse is holomorphic. More precisely, say
f |U : U → f (U ) (with w ∈ U and U open) has local inverse g : f (U ) → U . Wlog
by continuity f 0 (z) 6= 0 on U . Fix any z ∈ f (U ) and let k = g(z + h) − g(z), then
f (k + g(z)) − f (g(z)) = h. So

g(z + h) − g(z) h 1
= → 0 as h→0
h f (k + g(z)) − f (g(z)) f (g(z))

since k → 0 as h → 0 by continuity.
414 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

P. 10-18
1. A branch of logarithm λ : U → C is holomorphic with λ0 (z) = 1/z.
2. Let log : U → C (where U = {z ∈ C : z 6∈ R≤0 }) be the principle branch of
logarithm. Then for |z| < 1,
X zn z2 z3
log(1 + z) = (−1)n−1 =z− + − ··· .
n 2 3
n≥1

1. The derivative of the exponential function is everywhere non-zero, hence for


all z ∈ U we have exp0 (λ(z)) = exp(λ(z)) 6= 0. By the above lemma, λ is
holomorphic at z with λ0 (z) = 1/ exp0 (λ(z)) = 1/z.
2. To show that log(1 + z) is indeed given by the said power series, note that the
power series does have a radius of convergence 1 by, say, the ratio test. So it
has derivative
1
1 − z + z2 + · · · = .
1+z
Therefore, log(1 + z) and the claimed power series have equal derivative, and
hence coincide up to a constant. Since they agree at z = 0, they must in fact
be equal.

10.2 Conformal maps


D. 10-19
• Let f : U → C be a function holomorphic at w ∈ U . If f 0 (w) 6= 0, we say f is
conformal at w.4
• If U and V are open subsets of C and f : U → V is a conformal bijection, then it
is a conformal equivalence .
? We write C∗ = C \ {0} and H = {z ∈ C : Im(z) > 0} is the upper half plane.
E. 10-20
• As we seen previously, f holomorphic at w with f 0 (w) 6= 0 tell us that f is locally
invertible, and its local inverse is conformal at f (w). Note however a global inverse
might not exist, for example f (z) = ez is conformal on C but not injective, hence
global inverse does not exist.
• But being conformal is more than just being locally invertible. An important
property of conformal mappings is that they preserve angles. To give a precise
statement of this, we need to specify how “angles” work.
The idea is to look at tangent vectors of paths. Let γ1 , γ2 : [−1, 1] → U be
continuously differentiable paths that intersect when t = 0 at w = γ1 (0) = γ2 (0).
Moreover, assume γi0 (0) 6= 0. Then we can compare the angles between the paths
by looking at the difference in arguments of the tangents at w. In particular, we
define the intersection angle

angle(γ1 , γ2 ) = arg(γ10 (0)) − arg(γ20 (0)).


4
An alternative definition is that a conformal map is one that preserves the angle (in both
magnitude and orientation) between intersecting curves.
10.2. CONFORMAL MAPS 415

Let f : U → C and w ∈ U . Suppose f is conformal at w. Then f maps our two


paths to f ◦ γi : [−1, 1] → C. These two paths now intersect at f (w). Then the
angle between them is

angle(f ◦ γ1 , f ◦ γ2 ) = arg((f ◦ γ1 )0 (0)) − arg((f ◦ γ2 )0 (0))


(f ◦ γ1 )0 (0)
   0 
γ1 (0)
= arg = arg = angle(γ1 , γ2 )
(f ◦ γ2 )0 (0) γ20 (0)

using the chain rule and the fact that f 0 (w) 6= 0. So angles are preserved.
E. 10-21
<Examples of conformal maps/equivalence>
• Any Möbius map A(z) = az+bcz+d
(with ad − bc 6= 0) defines a conformal equivalence
C ∪ {∞} → C ∪ {∞} in the obvious sense. A0 (z) 6= 0 follows from the chain rule
and the invertibility of A(z). In particular, the Möbius group of the disk D,
 
z−a
Möb(D) = {f ∈ Möbius group : f (D) = D} = λ ∈ Möb : |a| < 1, |λ| = 1
āz − 1

is a group of conformal equivalences of the disk. One can prove that the Möbius
group of the disk is indeed of this form, and that in fact these are all conformal
equivalences of the disk.
• The map z 7→ z n for n ≥ 2 is holomorphic everywhere and conformal except at
z = 0. This gives a conformal equivalence
n πo
z ∈ C∗ : 0 < arg(z) < ↔ H,
n

• Suppose we want to conformally map the left-hand half-plane U = {z : Re z < 0}


to a wedge V = {w : − π4 < arg w ≤ π4 }.

U
V

We need to halve the angle. We saw that z 7→ z 2 doubles the angle, so we might
try z 1/2 , for which we need to choose a branch (of log). The branch cut must not
lie in U , since z 1/2 is not analytic on the branch cut. In particular, the principal
branch does not work. So we choose √ a cut along the negative imaginary axis, and
the function is defined by reiθ 7→ reiθ/2 , where θ ∈ (− π2 , 3π2
]. This produces the
wedge {z 0 : π4 < arg z 0 < 3π4
}. This isn’t exactly the wedge we want. So we need
to rotate it through − π2 . So the final map is f (z) = −iz 1/2 .
• The exponential function

z2 z3
ez = 1 + z + + + ···
2! 3!
defines a function C → C∗ . In fact it is a conformal mapping. ez takes rectangles
conformally to sectors of annuli:
416 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

iy2
y1 y
U 2

iy1 ex1 ex2

x1 x2

With an appropriate choice of branch, log z does the reverse. In particular the
map sends the region {z : Re(z) ∈ [a, b]} to the annulus {ea ≤ |w| ≤ eb }. One
is simply connected, but the other is not — this is not a problem since ez is not
bijective on the strip (and hence not a conformal equivalence).

ez
a b

Note that the above strip to annulus conformal map cannot be achieve by a Möbius
map since in the strip both lines (the boundary) pass through the point ∞ while
on the annulus the two boundaries do not intersect.
• Note that z ∈ H if and only if z is closer to i than to −i. In other words |z − i| <
|z +i|, equivalently | z−i
z+i
| < 1. So z 7→ z−i
z+i
defines a conformal equivalence H → D,
the unit disk. We know this is conformal since it is a special case of the Möbius
map. Another way to see that this map maps H → D is to note that the maps
sends −1, 0, 1 to i, −1, −i. Then since Möbius map maps circle/line to circle/line
we see that the real line is mapped to the unit circle, ie. ∂H 7→ ∂D. Moreover,
the map sends i ∈ H to 0 ∈ D, hence by continuity we must have H → D.
Similarly the map f (z) = z−1z+1
maps D to the left-hand plane. In 4 1
fact these maps can be deployed more generally on quadrants, in
particular f (z) = z−1 permutes 8 divisions on the complex plane 3 2
z+1
as follows: it sends 1 7→ 2 7→ 3 7→ 4 7→ 1 and 5 7→ 6 7→ 7 7→ 8 7→ 5. 7 6
In particular, this agrees with what we had above — it sends the
complete circle to the left hand half plane. 8 5

E. 10-22
Consider the map
 
1 1
z 7→ w = z+ assuming z ∈ C∗ = C \ {0}
2 z

This can also be written as


 2
w+1 2 4 4z z+1
=1+ =1+ =1+ 2 = .
w−1 w−1 z + z1 − 2 z − 2z + 1 z−1
10.2. CONFORMAL MAPS 417
z+1
So this is just squaring in some funny coordinates given by z−1
. This map is
holomorphic (except at z = 0). Also, we have

z2 + 1
f 0 (z) = 1 − .
2z 2
So f is conformal except at ±1. Recall Möbius maps maps lines and circles to
lines and circles. This does something different. We write z = reiθ . Then if we
write z 7→ w = u + iv, we have
   
1 1 1 1
u= r+ cos θ v= r− sin θ
2 r 2 r
Fixing the radius, we see that a circle of radius ρ is mapped to the ellipse
u2 v2
1 + = 1,
4
(ρ + ρ1 )2 1
4
(ρ − ρ1 )2

Fixing the argument, we see that the half-line arg(z) = µ is mapped to the hyper-
bola
u2 v2
− = 1.
2
cos µ sin2 µ
We can do something more interesting. Consider a off-centered circle, chosen to
pass through the point −1 and −i. Then the image looks like this:

f
f (−i)
−1 f (−1)
−i

Note that we have a singularity at f (−1) = −1. This is exactly the point where
f is not conformal, and is no longer required to preserve angles. This is a crude
model of an aerofoil, and the transformation is known as the Joukowsky transform.
In applied mathematics, this is used to model fluid flow over a wing in terms of
the analytically simpler flow across a circular section.
E. 10-23
Often, there is no simple way to describe regions in space. However, if the region
is bounded by circular arcs, there is a trick that can be useful.
Suppose we have a circular arc between α and β. z
Along this arc, µ = θ − φ = arg(z − α) − arg(z − β)
is constant, by elementary geometry. Thus, for each α µ
β
fixed µ, the equation arg(z − α) − arg(z − β) = µ θ
φ
determines an arc through the points α, β.

To obtain a region bounded by two arcs, we find the two µ− and µ+ that describe
the boundary arcs. Then a point lie between the two arcs if and only if its µ is in
between µ− and µ+ , ie. the region is
   
z−α
z : arg ∈ [µ− , µ+ ] .
z−β
418 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

This says the point has to lie in some arc between those given by µ− and µ+ . For
example, the following region:

i
   h 
z−1 π i
is U= z : arg ∈ ,π .
z+1 2
−1 1

Thus for instance the map


 2
z−1
z 7→ −
z+1
is a conformal equivalence from U to H. This is since if z ∈ U , then z−1
z+1
has
argument in π2 , π , and can have arbitrary magnitude since z can be made as
 

close to −1 as you wish. Squaring doubles the angle and gives the lower half-
plane, and multiplying by −1 gives the upper half plane.

z−1
z 7→ z+1 z 7→ z 2 z 7→ −z

In practice, complicated conformal maps are usually built up from individual build-
ing blocks, each a simple conformal map like the above. This makes use of the fact
that composition of conformal maps is conformal, by the chain rule. We know H is
conformal equivalent to D, hence U is conformally equivalent to D. In fact, there
is a really powerful theorem telling us most things are conformally equivalent to
the unit disk.
D. 10-24
• A simple closed curve is the image of an continuous injective map S 1 → C.
• Two functions u, v : R2 → R satisfying the Cauchy-Riemann equations are called
harmonic conjugates .
E. 10-25
• It should be clear (though not trivial to prove) that a simple closed curve separates
C into a bounded part and an unbounded part.
• If we know one of the harmonic conjugates, then we can find the other up to a
constant. For example, if u(x, y) = x2 − y 2 , then v must satisfy
∂v ∂u
= = 2x.
∂y ∂x
So we must have v = 2xy + g(x) for some function g(x). The other Cauchy-
Riemann equation gives
∂u ∂v
−2y = =− = −2y − g 0 (x).
∂y ∂x
This tells us g 0 (x) = 0. So g must be a genuine constant, say α. The corresponding
analytic function whose real part is u is therefore
f (z) = x2 − y 2 + 2ixy + iα = (x + iy)2 + iα = z 2 + iα.
10.2. CONFORMAL MAPS 419

• Recall that a domain U ⊆ C is simply connected if every continuous map from the
circle f : S 1 → U can be extended to a continuous map from the disk F : D2 → U
such that F |∂D2 = f . Alternatively, any loop can be continuously shrunk to a
point. For example, the unit disk is simply-connected, but the region defined by
1 < |z| < 2 is not, since the circle |z| = 1.5 cannot be extended to a map from a
disk.

T. 10-26
<Riemann mapping theorem> Let U ⊆ C be the bounded domain enclosed
by a simple closed curve, or more generally any simply connected domain not equal
to all of C. Then U is conformally equivalent to D = {z : |z| < 1} ⊆ C.

We will not prove this. This in particular tells us any two simply connected
domains are conformally equivalent. If we believe that the unit disk is relatively
simple, then since all simply connected regions are conformally equivalent to the
disk, all simply connected domains are boring. We will later encounter domains
with holes to make things interesting.

L. 10-27
A map is a conformal equivalences iff it is a bijective holomorphic map.

We will only prove the forward direction. If a map is conformal, then the inverse
mapping theorem tells us there is a local conformal inverse. And if the function is
also bijective, these patch together to give a global conformal inverse.

P. 10-28
The real and imaginary parts of any holomorphic function are harmonic.

We will later prove that if f : U → C is holomorphic on an open set U , then


f 0 : U → C is also holomorphic. Hence f is infinitely differentiable. In particular,
if we write f = u + iv, then using the formula for f 0 in terms of the partials,
we know u and v are also infinitely differentiable. Using the Cauchy-Riemann
equations, we get uxx = (ux )x = (vy )x = (vx )y = (−uy )y = −uyy . In other words,
uxx + uyy = 0. We get similar results for v too. Hence Re(f ) and Im(f ) satisfy
the Laplace equation and are hence harmonic (by definition).

In light of this result, conformal maps have another use: to solve Laplace’s equa-
tion. The idea is that if we are required to solve the 2D Laplace’s equation on a
funny domain U subject to some boundary conditions, we can try to find a con-
formal equivalence (bijective holomorphic map) f between U and some other nice
domain V . We can then solve Laplace’s equation on V subject to the boundary
conditions carried forward by f , which is hopefully easier. And then we bring the
solution back to U . More concretely, the following algorithm can be used to solve
Laplace’s Equation ∇2 φ(x, y) = 0 on a tricky domain U ⊆ R2 with given Dirich-
let boundary conditions on ∂U . We now pretend R2 is actually C, and identify
subsets of R2 with subsets of C in the obvious manner.

1. Find a conformal map f : U → V , where U is now considered a subset of C,


and V is a “nice” domain of our choice.

2. Map the boundary conditions on ∂U directly to the equivalent points on ∂V .

3. Now solve ∇2 Φ = 0 in V with the new boundary conditions.


420 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

4. The required harmonic function φ in U is then given by

φ(x, y) = Φ(Re(f (x + iy)), Im f (x + iy)).

To prove this works, we can take the ∇2 of this expression, write f = u + iv, use
the Cauchy-Riemann equation, and expand out to see it gives 0. Alternatively, it
can be shown that (via simply-connect Cauchy theorem which we will prove at the
end of the Chapter) any harmonic function on a simply-connected domain has a
harmonic conjugate unique upto an additive constant. Then since Φ is harmonic,
it is the real part of some holomorphic function F (z) = Φ(x, y) + iΨ(x, y), where
z = x + iy. Now F (f (z)) is holomorphic, as it is a composition of holomorphic
functions. So its real part Φ(Re f, Im f ) is harmonic.

E. 10-29
Find a bounded solution of ∇2 φ = 0 on the first quadrant of R2 subject to φ(x, 0) =
0 and φ(0, y) = 1 for all x, y > 0.

π
We choose f (z) = log z, which maps U to the strip 0 < Im z < 2
.

1 U 1 i π2

z 7→ log z V
0 0 0

Recall that we said log maps (sections of) an annulus to a rectangle. This is indeed
the case here — U is an annulus with zero inner radius and infinite outer radius;
V is an infinitely long rectangle. Now, we must now solve ∇2 Φ = 0 in V subject
to Φ(x, 0) = 0 and Φ x, π2 = 1 for all x ∈ R. Note that we have these boundary


conditions since f (z) takes positive real axis of ∂V to the line Im z = 0, and the
positive imaginary axis to Im z = π2 . By inspection, the solution is Φ(x, y) = π2 y.
Hence,

2 2 y
Φ(x, y) = Φ(Re log z, Im log z) = Im log z = tan−1 .
π π x

Notice this is just the argument θ.

10.3 Contour integration


D. 10-30
We say a function f : [a, b] → C is Riemann integrable if Re(f ) and Im(f ) are
Riemann integrable individually, and the integral is defined to be
Z b Z b Z b
f (t) dt = Re(f (t)) dt + i Im(f (t)) dt.
a a a
10.3. CONTOUR INTEGRATION 421

L. 10-31
Suppose f : [a, b] → C is continuous (and hence integrable). Then
Z b

f (t) dt ≤ (b − a) sup |f (t)|
t
a

with equality if and only if f is constant.


Rb
Write θ = arg( a f (t) dt) and M = supt |f (t)|. Then we have
b b b
Z Z Z
e−iθ f (t) dt = Re(e−iθ f (t)) dt ≤ (b − a)M,


f (t) dt =
a a a

with equality if and only if |f (t)| = M and arg f (t) = θ for all t, ie. f is con-
stant.
D. 10-32
• A path (or curve) in C is a continuous function γ : [a, b] → C, where a, b ∈ R. A
path γ : [a, b] → C is
 simple if γ(t1 ) = γ(t2 ) implies t1 = t2 or {t1 , t2 } = {a, b}.
 closed if γ(a) = γ(b).
• A contour is a simple closed path which is piecewise C 1 , ie. piecewise continuously
differentiable.5
? Given a path γ : [a, b] → C, sometimes we write −γ to mean the path γ traversed
in the opposite direction, that is −γ : [a, b] → C is given by (−γ)(t) = γ(a + b − t).
Given two paths γ1 : [a, b] → C and γ2 : [α, β] → C with γ1 (b) = γ2 (α), we
sometimes write γ1 + γ2 to denote the path form by joining the two paths at
γ1 (b) = γ2 (α). That is, γ1 + γ2 : [a, b + (β − α)] → C is given by
(
γ1 (t) t ∈ [a, b]
(γ1 + γ2 )(t) = .
γ2 (t − b + α) t ∈ [b, b + (β − α)]

? If we only specify the image of a contour but not its orientation. The convention
is that when we integrate over it the direction of traversal is anticlockwise. This is
also the direction that keeps the interior of the contour on the left. This direction
of traversal is sometimes called positive .
• If γ : [a, b] → U ⊆ C is C 1 -smooth and f : U → C is continuous, then we define
the integral of f along γ as
Z Z b
f (z) dz = f (γ(t))γ 0 (t) dt.
γ a

By summing over subdomains, the definition extends to piecewise C 1 -smooth


paths, and in particular contours.
• Let U ⊆ C and f : U → C be continuous. An antiderivative of f is a holomorphic
function F : U → C such that F 0 (z) = f (z).
5
The lecturer for the method course don’t require a contour to be close.
422 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

E. 10-33
• For general paths, we just require continuity, and do not impose any conditions
about, say, differentiability. Unfortunately, the world is full of weird paths. There
are even paths that fill up the whole of the unit square. So we might want to look
at some nicer paths, call simple paths. This is pats such that it either does not
intersect itself, or only intersects itself at the end points.
• For example, contour can look something like the diagram on the
right. Most of the time, we are just interested in integration along
contours. However, it is also important to understand integration
along just simple C 1 smooth paths, since we might want to break
our contour up into different segments.
• Our definition of integration along paths have following elementary properties:
1. The definition is insensitive to reparametrization. Let φ : [a0 , b0 ] → [a, b] be C 1
such that φ(a0 ) = a, φ(b0 ) = b. If γ is a C 1 path and δ = γ ◦ φ, then
Z Z
f (z) dz = f (z) dz.
γ δ

This is just the regular change of variables formula:let u = φ(t), then


Z b0 Z b
f (γ(φ(t)))γ 0 (φ(t))φ0 (t) dt = f (γ(u))γ 0 (u) du
a0 a

2. If a < u < b, then


Z Z Z
f (z) dz = f (z) dz + f (z) dz.
γ γ|[a,u] γ|[u,b]

These together tells us the integral depends only on the path itself, not how we
look at the path or how we cut up the path into pieces. We also have the following
easy properties:
R R
3. If −γ is γ with reversed orientation, then −γ f (z) dz = − γ f (z) dz.

4. If we set for γ : [a, b] → C the length

b
Z Z
|γ 0 (t)| dt,

length(γ) = then f (z) dz ≤ length(γ) sup |f (γ(t))|.
t
a γ

The integral of f along a path γ : [a, b] → C is defined so that it is what we would


expect it to be. More precisely if we dissect [a, b] into a = t0 < t1 < · · · < tn = b,
and let zn = γ(tn ) for n = 0, · · · , N . Then

Z N
X −1
f (z) dz = lim f (zn )δzn ,
γ ∆→0
n=0

where δtn = tn+1 − tn , δzn = zn+1 − zn and ∆ = maxn∈{0,··· ,N −1} δtn .


10.3. CONTOUR INTEGRATION 423

E. 10-34
• Take U = C∗ , and let f (z) = z n for n ∈ Z. We pick φ : [0, 2π] → U that sends
θ 7→ eiθ . Then (
2πi n = −1
Z
f (z) dz =
φ 0 otherwise

To show this, we have


Z Z 2π Z 2π
f (z) dz = einθ ieiθ dθ = i ei(n+1)θ dθ.
φ 0 0

If n = −1, then the integrand is constantly 1, and hence gives 2πi. Otherwise, the
integrand is a non-trivial exponential which is made of trigonometric functions,
and when integrated over 2π gives zero.
• Take γ to be the contour (given in two parts) by iR γ2

γ1 :[−R, R] → C with t 7→ t
γ2 :[0, 1] → C with t 7→ Reiπt
−R γ1 R
2
Consider the function f (z) = z . Then the integral is
Z Z R Z 1 Z 1
2 3
f (z) dz = t2 dt + R2 e2πit iπReiπt dt = R + R3 iπ e3πit dt
γ −R 0 3 0
 3πit 1
2 3 3 e
= R + R iπ =0
3 3πi 0

We worked this out explicitly, but we have just wasted our time, since this is just
an instance of the fundamental theorem of calculus!
T. 10-35
<Fundamental theorem of calculus> Let f : U → C be continuous with
antiderivative F . If γ : [a, b] → U is piecewise C 1 -smooth, then
Z
f (z) dz = F (γ(b)) − F (γ(a)).
γ

We have Z Z b Z b
f (z) dz = f (γ(t))γ 0 (t) dt = (F ◦ γ)0 (t) dt.
γ a a

Then the result follows from the usual fundamental theorem of calculus, applied
to the real and imaginary parts separately.
In particular, the integral depends only on the end points, and not the path itself.
Moreover, if γ is closed, then the integral vanishes.
E. 10-36
This allows us to understand the first example we had. We had the function
f (z) = z n integrated along the path φ(t) = eit (for 0 ≤ t ≤ 2π). If n 6= −1, then

z n+1
 
d
f= .
dt n+1
424 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

So f has a well-defined antiderivative, and the integral vanishes. On the other


hand, if n = −1, then
d
f (z) = (log z),
dz
where log can only be defined on a slit plane. It is not defined on the whole unit
circle. So we cannot apply the fundamental theorem of calculus.
Reversing the argument around, since φ z −1 dz does not vanish, this implies there
R

isn’t a continuous branch of log on any set U containing the unit circle.
D. 10-37
• A star-shaped domain or star domain is a domain U such w
that there is some a0 ∈ U such that the line segment [a0 , w] ⊆
U for all w ∈ U . a0

• A triangle in a domain U is what it ought to be — the Euclidean convex hull of


3 points in U , lying wholly in U . We write its boundary as ∂T , which we view as
an oriented piecewise C 1 path, ie. a contour.
E. 10-38
Star domain is a weaker notion than requiring U to be convex, which says any
line segment between any two points in U , lies in U . In general, we have the
implications

U is a disc ⇒ U is convex ⇒ U is star-shaped ⇒ U is path-connected,

and none of the implications reverse.


P. 10-39
R
1. Let U ⊆ C be a domain, and f : U → C be continuous. If γ f (z) dz = 0 for
every closed piecewise C 1 -smooth path γ in U , then f has an antiderivative.
R
2. Let U is a star domain, and f : U → C is continuous. If ∂T f (z) dz = 0 for
all triangles T ⊆ U , then f has an antiderivative on U .

(Version 1) Fix some a0 ∈ U . For w ∈ U , we choose a path γw : [0, 1] → U such


that γw (0) = a0 and γw (1) = w. We first show that we can pick γw such that
this is piecewise C 1 . We already know a continuous path γ : [0, 1] → U from a0
to w exists, by definition of path connectedness. Since U is open, for all x in the
image of γ, there is some ε(x) > 0 such that B(x, ε(x)) ⊆ U . Since the image of
γ is compact, it is covered by finitely many such balls. Then it is trivial to pick
a piecewise straight path living inside the union of these balls, which is clearly
piecewise smooth.
R
Now define F (w) = γw f (z) dz. This F (w) is independent of the choice of γw ,
by our hypothesis on f – given another choice γ̃w , we can form the new path
γw ∗ (−γ̃w ), namely the path obtained by concatenating γw with −γ̃w . This is a
closed piecewise C 1 -smooth curve. So
Z Z Z Z Z
0= f (z) dz = f (z) dz + f (z) dz = f (z) dz − f (z) dz.
γw ∗(−γ̃w ) γw −γ̃w γw γ̃w

So the two integrals agree. Now we need to check that F is complex differentiable.
Since U is open, we can pick ε > 0 such that B(w; ε) ⊆ U . Let δh be the radial
10.3. CONTOUR INTEGRATION 425

path in B(w, ε) from W to w + h, with |h| < ε. Now note that γw ∗ δh is a path
from a0 to w + h. So
Z Z
F (w + h) = f (z) dz = F (w) + f (z) dz
γw ∗δh δh
Z
= F (w) + hf (w) + (f (z) − f (w)) dz.
δh

Thus, we know
Z
F (w + h) − F (w)
≤ 1

− f (w) f (z) − f (w) dz
h |h| δh

1
≤ length(δh ) sup |f (z) − f (w)| = sup |f (z) − f (w)|.
|h| δh δh

Since f is continuous, as h → 0, we know f (z) − f (w) → 0. So F is differentiable


with derivative f . 1

(Version 2) Similar as 1, but this time we take γw = [a0 , w] ⊆ U for U that is


star-shaped about a0 . Then
In the proof, we need to construct a small straight line segment δh . There’s no
issue. By the openness of U , we can pick an open ball B(w, ε) ⊆ U , and we can
certainly construct the straight line in this ball.
Finally, we get to the integration part. Suppose we picked all our γw to be the
fixed straight line segment from a0 . Then for antiderivative to be differentiable,
we needed Z Z
f (z) dz = f (z) dz.
γw ∗δh γw+h

In other words, we needed to the integral along the path γw ∗ δh ∗ (−γw+h ) to


vanish which is by assumption true. 2

This is more-or-less the same proof we gave in IA Vector Calculus that a real
function is a gradient if and only if the integral about any closed path vanishes.
L. 10-40
Let M be a metric space. T Suppose Un ⊆ M is compact for all n and is such that
U1 ⊇ U2 ⊇ U3 ⊇ · · · , then n Un is non-empty.

Pick any xn ∈ Un , since xn ∈ U1 and U1 is (sequentially)


T compact, there is a
subsequence xni that converges to say x. Suppose x 6∈ n Un , then x 6∈ UN for
some N . Define g : UN → R by g(y) = d(y, x). Then g is continuous, moreover
since UN is compact, ∃y0 ∈ UN such that d(y0 , x) ≤ inf y g(y) by the maximum
value theorem. Since x 6= y0 , d(y0 , x) > 0. But this means 0 < d(y0 , x) <
inf y∈Un g(y, x)Tfor all n ≥ N , hence xni cannot converge to x. Contradiction.
Therefore x ∈ n Un .
T. 10-41
<Cauchy’s theorem for a triangle> LetR U be a domain, and let f : U → C
be holomorphic. If T ⊆ U is a triangle, then ∂T f (z) dz = 0.
R
Fix a triangle T . Let η = ∂T f (z) dz and ` = length(∂T ). We will bound η by
ε, for every ε > 0, and hence we must have η = 0. To do so, we subdivide our
triangles.
426 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
By subdividing the triangle further and further, we are focusing on
a smaller and smaller region of the complex plane. This is helpful
since we are given that f is holomorphic, and holomorphicity is a
local property. We start with the triangle T = T 0 shown on the
right.
We then add more lines between the midpoint of each edge to
get Ta0 , Tb0 , Tc0 , Td0 (it doesn’t really matter which is which), so
we get the diagram on the right. We orient the middle triangle
by the anti-clockwise direction. Then we have
Z X Z
f (z) dz = f (z) dz,
∂T 0 ∂T 0
a,b,c,d ·
since each internal
R edge occurs twice, with opposite orientation. For this to be
possible, if η = ∂T 0 f (z) dz , then there must be some subscript in {a, b, c, d}
such that
Z
η
f (z) dz ≥ .
0 4
∂T
·

We call this T· = T . Then we notice ∂T 1 has length length(∂T 1 ) = 2` . Iterating


0 1

this, we obtain triangles T 0 ⊇ T 1 ⊇ T 2 ⊇ · · · such that


Z
η `
length(∂T i ) = i .

f (z) dz ≥ i ,

∂T i 4 2
Now Twe are given a nested sequence of closed sets. By the lemma, there is some
z0 ∈ i≥0 T i . Now fix an ε > 0. Since f is holomorphic at z0 , we can find a δ > 0
such that

|f (w) − f (z0 ) − (w − z0 )f 0 (z0 )| ≤ ε|w − z0 | whenever |w − z0 | < δ.

Since the diameters of the triangles are shrinking each time, we can pick an n such
that T n ⊆ B(z0 , ε). Now note that since 1 and z both have anti-derivatives on
T n , we have Z Z
1 dz = 0 = z dz,
∂T n ∂T n

Therefore, noting that f (z0 ) and f 0 (z0 ) are just constants, we have
Z Z
0


f (z) dz =
(f (z) − f (z 0 ) − (z − z 0 )f (z 0 )) dz

∂T n n
Z ∂T
≤ |f (z) − f (z0 ) − (z − z0 )f 0 (z0 )| dz
∂T n

≤ length(∂T n )ε sup |z − z0 | ≤ ε length(∂T n )2 ,


z∈∂T n

where the last inequality comes from the fact that z0 ∈ T n , and the distance
between any two points in the triangle cannot be greater than the perimeter of
the triangle. Substituting our formulas for these in, we have
η 1
≤ n `2 ε =⇒ η ≤ `2 ε.
4n 4
Since ` is fixed and ε was arbitrary, it follows that we must have η = 0.
10.3. CONTOUR INTEGRATION 427

P. 10-42
<Star-shaped Cauchy’s theorem> If U is a star-shaped domain, and f : U →
1
R is holomorphic, then for any closed piecewise C paths γ ∈ U , we must have
C
γ
f (z) dz = 0.

If f is holomorphic, then Cauchy’s theorem says the integral over any triangle
vanishes. If U is star shaped, 2 of [P.10-39] says f has an antiderivative. Then
the fundamental theorem of calculus tells us the integral around any closed path
vanishes.

Is this the best we can do? Can we formulate this for an arbitrary domain, and
not just star-shaped ones? It is obviously not true if the domain is not simply
connected, eg. for f (z) = z1 defined on C \ {0}. However, it turns out this holds
as long as the domain is simply connected, as we will show in a later part of the
course. However, this is not surprising given the Riemann mapping theorem, since
any simply connected domain is conformally equivalent to the unit disk, which is
star-shaped (and in fact convex).

We can generalize our result when f : U → C is continuous on the whole of U ,


and holomorphic
R except on finitely many points. In this case, the same conclusion
holds — γ f (z) dz = 0 for all piecewise smooth closed γ.

Why is this?
R In the proof, it was sufficient to focus on
showing ∂T f (z) dz = 0 for a triangle T ⊆ U . Con-
sider the simple case where we only have a single point
of non-holomorphicity a ∈ T . The idea is again to sub-
divide like on the diagram. We call the center triangle a
0
T
R . Along all other triangles in our subdivision, we get
f (z) dz = 0, as these triangles lie in a region where f
is holomorphic. So
Z Z
f (z) dz = f (z) dz.
∂T ∂T 0

Note now that we can make T 0 as small as we like. But


Z
f (z) dz ≤ length(∂T 0 ) sup |f (z)|.


z∈∂T 0

∂T 0

Since f is continuous, it is bounded.R As we take smaller and smaller subdivisions,


length(∂T 0 ) → 0. So we must have ∂T f (z) dz = 0.

From here, it’s straightforward to conclude the general case with many points of
non-holomorphicity — we can divide the triangle in a way such that each small
triangle contains one bad point.

D. 10-43
Let U ⊆ C be open and φ, ψ piecewise C 1 -smooth closed paths in U . We say ψ
is an elementary deformation of φ if φ and ψ can each be express as a join of
paths φ1 , φ2 , · · · , φn and ψ1 , ψ2 , · · · , ψn such that for all i both φi , ψi ⊆ Ci for
some convex set Ci ⊆ U .
428 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

E. 10-44
The idea of elementary deformation is that for each of the sections φi and ψi , since
they lies in a convex region, we can continuously deform ψi to φi . So that in fact
the two curves φ and ψ are “the same”.
Suppose f : U → C is holomorphic. Wlog suppose each of φi and ψi are
parametrised by [0, 1]. Let `i be the straight line from φi (0) = φi−1 (1) to ψi (0) =
ψi−1 (1) (where φ0 means φn and same for ψ). Note that φi + `i+1 − ψi − `i (where
`n+1 means `1 ) is a closed
R curve that lies in the convex (hence star shaped) region
Ci , so by [P.10-42], φ +` −ψ −` f (z)dz = 0 for all i. Now since `n+1 = `1 , we
i i+1 i i
have
Z Z n Z
X Z  n Z
X
f (z) dz − f (z) dz = f (z) dz − f (z) dz = f (z) dz
φ ψ i=1 φi ψi i=1 φi −ψi
n Z Z Z !
X
= f (z) dz + f (z) dz − f (z) dz
i=1 φi −ψi `i+1 `i
n Z !
X
= f (z) dz =0
i=1 φi +`i+1 −ψi −`i

R R
Hence φ
f (z) dz = ψ
f (z) dz when ψ and φ are elementary deformation of each
other.
L. 10-45
Let M be a open subset of a metric space. If U ⊆ M is compact, then ∃δ > 0
such that B(u, δ) ⊆ M for all u ∈ U .

Define f : M → R by f (x) = sup{R : BR (x) ⊆ M }. Now f is continuous since


for any 0 < ε < f (x) we have d(y, x) < ε implies f (x) − ε < f (y) < f (x) + ε, ie.
|f (y) − f (x)| < 2ε. Now U is compact, so by the maximum value theorem, ∃b ∈ U
such that f (b) = inf x∈U f (x). Moreover f (b) 6= 0 by openness of M . Therefore
B(u, f (b)) ⊆ M for all u ∈ U .
In particular when M ⊆ C is a domain, since B̄(z, r) ∈ C is compact we see that
if B̄(z, r) ⊆ M , then ∃δ > 0 such that B̄(z, r + δ/2) ⊆ M .
T. 10-46
<Cauchy integral formula> Let U be a domain, and f : U → C be holomor-
phic. Suppose B̄(z0 ; r) ⊆ U , then for all z ∈ B(z0 ; r) we have
Z
1 f (w)
f (z) = dw.
2πi ∂ B̄(z0 ;r) w − z

(Proof 1) Since U is open, there is some δ > 0 such that B̄(z0 ; r + δ) ⊆ U . We


define g : B(z0 ; r + δ) → C by
(
f (w)−f (z)
w−z
6 z
w=
g(w) = 0
,
f (z) w=z

where we have fixed z ∈ B(z0 ; r) as in the statement of the theorem. Now note that
g is holomorphic as a function of w ∈ B(z0 , r + δ), except perhaps at w = z. But
10.3. CONTOUR INTEGRATION 429

since f is holomorphic, by definition g is continuous everywhere on B(z0 , r + δ).


So the previous result says
Z Z Z
f (w) f (z)
g(w) dw = 0 =⇒ dw = dw.
∂ B̄(z0 ;r) ∂ B̄(z0 ;r) w − z ∂ B̄(z0 ;r) w − z

We now rewrite

1 1 1 X (z − z0 )n
= · z−z0 = .
w−z w − z0 1 − ( w−z ) n=0
(w − z0 )n+1
0

z−z0
Note that this sum converges uniformly on ∂ B̄(z0 ; r) since | w−z 0
| < 1 for w on
this circle. By uniform convergence, we can exchange summation and integration.
So
∞ Z
(z − z0 )n
Z
f (w) X
dw = f (z) dw.
∂ B̄(z0 ;r) w − z n=0 ∂ B̄(z0 ,r)
(w − z0 )n+1
We note that f (z)(z − z0 )n is just a constant, and that we have previously proven
(
2πi k = −1
Z
k
(w − z0 ) dw = .
∂ B̄(z0 ;r) 0 k 6= −1
So the right hand side is just 2πif (z).
(Proof 2) Given ε > 0, we pick δ > 0 such that B̄(z, δ) ⊆ B(z0 , r), and such that
whenever |w − z| < δ, then |f (w) − f (z)| < ε. This is possible since f is uniformly
continuous on the neighbourhood of z. We now cut our region apart:

z z

z0 z0

We know fw−z (w)


is holomorphic on the shaded area. The shaded area enclosed by
the contours might not be star-shaped, but we can definitely divide it once more so
that it is. Hence the integral of fw−z (w)
around the half-contour vanishes by Cauchy’s
theorem. Adding these together, we get
Z Z
f (w) f (w)
dw = dw,
∂ B̄(z0 ,r) w − z ∂ B̄(z,δ) w −z
where the balls are both oriented anticlockwise. Now we have
Z Z
1 f (w) 1 f (w)

f (z) − dw = f (z) − dw .

2πi ∂ B̄(z0 ,r) w − z 2πi ∂ B̄(z,δ) w − z
1
R
Now we use the fact that ∂ B̄(z,δ) w−z dz = 2πi to show this is equal to
Z
1 f (z) − f (w) 1 1

dw ≤ · 2πδ · · ε = ε.

2πi ∂ B̄(z,δ) w−z 2π δ

Taking ε → 0, we see that the Cauchy integral formula holds.


Below are some remarks:
430 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

• The subdivisionRwe did in proof 2 above was something we can do in general.


Note also that ∂ B̄(0,1) z1 dz = 2πi is a special case of the Cauchy integral
formula, also something we used to prove the formula.
Both proofs involving us knowing how to integrate things of the form 1/(w −
z0 )n , and manipulated the formula such that the integral is made of things
like this. The first proof achieve this by manipulating the integrand while the
second achieve this by manipulating the contour of integration.
• In a nutshell, proof 2 is doing the following:
Z 2π
f (z + δeiθ ) iθ
Z Z
f (w) f (w)
dw = dw = iδe dθ
∂ B̄(z0 ,r) w − z ∂ B̄(z,δ) w − z 0 δeiθ
Z 2
=i π(f (z) + O(δ)) dθ → 2πif (z) as δ→0
0

• In fact the Cauchy integral formula is like the mean-value property for harmonic
functions:
Z Z 2π
1 f (w) 1
f (z) = dw = f (z + reiθ ) dθ.
2πi ∂ B̄(z;r) w − z 2π 0

• So this result says that, if we know the value of f on the boundary of a disc
and that it is holomorphic on a domain containing the disc, then we know the
value of f at all points within the disc. While this seems magical, it is less
surprising if we look at it in another way. We can write f = u + iv, where u and
v are harmonic functions, ie. they satisfy Laplace’s equation. Then if we know
the values of u and v on the boundary of a disc, then what we essentially have
is Laplace’s equation with Dirichlet boundary conditions! Then the fact that
this tells us everything about f within the boundary is just the statement that
Laplace’s equation with Dirichlet boundary conditions has a unique solution!
• We can have a slightly more general form of this result: In fact we have
Z
1 f (w)
f (z) = dw.
2πi γ w − z
for any closed piecewise C 1 curve γ that can be obtained from ∂ B̄(z0 ; r) via a
sequence of elementary deformations in U \ {z}.
T. 10-47
<Taylor’s theorem> Let f : B(a, r) → P C be holomorphic. Then f has a con-
vergent power series representation f (z) = ∞ n
n=0 cn (z − a) on B(a, r). Moreover,

f (n) (a)
Z
1 f (z)
cn = = dz for any 0<ρ<r
n! 2πi ∂B(a,ρ) (z − a)n+1

We’ll use Cauchy’s integral formula. If |w − a| < ρ < r, then


Z
1 f (z)
f (w) = dz.
2πi ∂B(a,ρ) z − w
Now just like in the proof of [T.10-46], we note that

1 1 X (w − a)n
= w−a = .
z−w (z − a)(1 − z−a
) n=0
(z − a)n+1
10.3. CONTOUR INTEGRATION 431

This series is uniformly convergent everywhere on the ρ disk, including its bound-
ary. By uniform convergence, we can exchange integration and summation to
get

∞ Z ! ∞
X 1 f (z) X
f (w) = dz (w − a)n = cn (w − a)n .
n=0
2πi ∂B(a,ρ) (z − a)n+1 n=0

Since cn does not depend on w, this is a genuine power series representation, and
this is valid on any disk B(a, ρ) ⊆ B(a, r). Then the formula for cn in terms of
the derivative comes for free since that’s the formula for the derivative of a power
series.

This tells us every holomorphic function behaves like a power series. The statement
of the theorem implies any holomorphic function has to be infinitely differentiable!
2
Also, we do not get weird things like e−1/x on R that have a trivial Taylor series
expansion, but is itself non-trivial. Similarly, we know that there are no “bump
functions” on C that are non-zero only on a compact set (since power series don’t
behave like that). Of course, we already knew that from Liouville’s theorem.

Note that the formula for f (n) (a) is what we would expect: Differentiate Cauchy’s
integral formula with respect to a we get

Z Z
1 f (z) 1 f (z)
f (a) = dz =⇒ f 0 (a) = dz.
2πi ∂ B̄(a;ρ) z−a 2πi ∂ B̄(a;ρ) (z − a)2

We have just taken the differentiation inside the integral sign. This works since
the integrand, both before and after, is a continuous function of both z and a. We
can do this any numbers of times to get

Z
n! f (z)
f (n) (a) = dz.
2πi ∂ B̄(a;ρ) (z − a)n+1

P. 10-48
If f : U → C is holomorphic on a disc, then f is infinitely differentiable on the
disc.

Complex power series are infinitely differentiable (and f had better be infinitely
differentiable for us to write down the formula for cn in terms of f (n) ).

This justifies our claim from the very beginning that Re(f ) and Im(f ) are harmonic
functions if f is holomorphic.

T. 10-49
<Liouville’s theorem> Let f : C → C be an entire function. If f is bounded,
then f is constant.

(Proof 1) Suppose |f (z)| ≤ M for all z ∈ C. We fix z1 , z2 ∈ C, and estimate


|f (z1 ) − f (z2 )| with the integral formula. Let R > max{2|z1 |, 2|z2 |}. By the
432 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

integral formula, we know


Z  
1 f (w) f (w)

|f (z1 ) − f (z2 )| = − dw

2πi ∂B(0,R) w − z1 w − z2
Z
1 f (w)(z1 − z2 )

= dw

2πi ∂B(0,R) (w − z1 )(w − z2 )
1 M |z1 − z2 | 4M |z1 − z2 |
≤ · 2πR · = .
2π (R/2)2 R

Note that we get the bound on the denominator since |w| = R implies |w −zi | > R
2
by our choice of R. Letting R → ∞, we know we must have f (z1 ) = f (z2 ). So f
is constant.

(Proof 2) Suppose that |f (z)| ≤ M for all z ∈ C, and consider an arbitrary point
z0 ∈ C. By our formula for the derivative, we have
Z
1 f (z) 1 M
f 0 (z0 ) = dz ≤ · 2πr · 2 → 0 as r→∞
2πi ∂B(z0 ;r) (z − z0 )2 2π r

since the first equality is valid for any r > 0. Hence f 0 (z0 ) = 0 for all z0 ∈ C. So
f is constant.

This, for example, means there is no interesting holomorphic period functions like
sin and cos that are bounded everywhere.

T. 10-50
<Fundamental theorem of algebra> All non-constant complex polynomial
has a root in C.

Let P (z) = an z n + an−1 z n−1 + · · · + a0 where an 6= 0 and n > 0. So P is non-


constant. Thus, as |z| → ∞, |P (z)| → ∞. In particular, there is some R such that
for |z| > R, we have |P (z)| ≥ 1. Now suppose for contradiction that P does not
have a root in C. Consider
1
f (z) = ,
P (z)

which is then an entire function, since it is a rational function. On B̄(0, R), we


know f is certainly continuous, and hence bounded. Outside this ball, we get
|f (z)| ≤ 1. So f (z) is constant, by Liouville’s theorem. But P is non-constant.
This is absurd. Hence the result follows.

There are many many ways we can prove the fundamental theorem of algebra.
However, none of them belong wholely to algebra. They all involve some analysis
or topology. This is not surprising since the construction of R, and hence C,
is intrinsically analytic — we get from N to Z by requiring it to have additive
inverses; Z to Q by requiring multiplicative inverses; R to C by requiring the root
to x2 + 1 = 0. These are all algebraic. However, to get from Q to R, we are
requiring something about convergence in Q. This is not algebraic. It requires a
particular of metric on Q. If we pick a different metric, then you get a different
completion, as you may have seen in IB Metric and Topological Spaces. Hence the
construction of R is actually analytic, and not purely algebraic.
10.3. CONTOUR INTEGRATION 433

P. 10-51
If f : U → C is a complex-valued function, then f = u+iv is holomorphic at p ∈ U
if and only if u, v satisfy the Cauchy-Riemann equations, and that ux , uy , vx , vy
are continuous in a neighbourhood of p.

(Backward) If ux , uy , vx , vy exist and are continuous in an open neighbourhood of


p, then u and v are differentiable as functions R2 → R at p, and then together with
the Cauchy-Riemann equations this implies[P.10-4] differentiability at each point in
the neighbourhood of p. So f is differentiable at a neighbourhood of p.
(Forward) If f is holomorphic at p, then it is infinitely differentiable in a neigh-
bourhood of p. In particular, f 0 (z) is also holomorphic. So ux , uy , vx , vy are
differentiable, hence continuous.
T. 10-52
<Morera’s theorem> Let U ⊆ C be a domain. Suppose f : U → C is continu-
ous and such that γ f (z) dz = 0 for all piecewise-C 1 closed curves γ ∈ U . Then
R

f is holomorphic on U .

We have previously shown that the condition implies that f has an antiderivative
F : U → C, ie. F is a holomorphic function such that F 0 = f . But F is infinitely
differentiable. So f must be holomorphic.
So we have here a (partial) converse to Cauchy’s theorem. Recall that Cauchy’s
theorem required U to be sufficiently nice, eg. being star-shaped or just simply-
connected. However, Morera’s theorem does not. It just requires that U is a
domain. This is since holomorphicity is a local property, while vanishing on closed
curves is a global result. Cauchy’s theorem gets us from a local property to a
global property, and hence we need to assume more about what the “globe” looks
like. On the other hand, passing from a global property to a local one does not.
Hence we have this asymmetry.
L. 10-53
Let U be open. A sequence of functions fn : U → C is locally uniformly convergent
on U iff it is uniformly convergent on all compact subsets of U .

(Forward) Since fn is everywhere locally uniform convergent, given any z ∈ U , we


can find Bδ(z) (z) such that fn converge uniformly inside there. {Bδ(z) (z) : z ∈ U }
is an open cover of U . Given any compact subset V ⊆ U , we can find a finite
subcover of V , hence fn converges uniformly on V .
(Backward) Given any z ∈ U , since U is open we can find B̄δ (z) ⊆ U , and this is
compact, hence fn converges uniformly on B̄δ (z) ⊆ U .
P. 10-54
Let U ⊆ C be a domain, fn : U → C be holomorphic functions. If fn → f locally
uniformly on U , then f is holomorphic and fn0 → f 0 locally uniformly on U .

Since f being holomorphic is a local condition, so we fix p ∈ U and work in some


small, convex disc B(p, ε) ⊆ R U inside which fn converges uniformly. For any curve
γ inside this disk, we have γ fn (z) dz = 0. Given a piecewise C 1 path γ inside the
R R
disc, uniformity of convergence says γ fn (z)dz → γ f (z)dz. Hence we also have
R
γ
f (z) dz = 0. Since this is true for all curves, we conclude f is holomorphic inside
B(p, ε) by Morera’s theorem. Since p was arbitrary, we know f is holomorphic.
434 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

We know that f 0 (z) = limn fn0 (z) since we can express f 0 (a) in terms of the integral
f (z)
of (z−a) 2 , as in Taylor’s theorem, and exchange the limit and the integral. To show

that this is locally uniform we need more work. Pick any a ∈ U , then we can find
Br (a) ⊆ U such that fn → f uniformly inside it. Then for any w ∈ B(a, r/2) we
have
Z
1 fn (z) − f (z)

0 0
|fn (w) − f (w)| = dz
2πi ∂ B̄(w;r/2) (z − w)2


supz∈Br (a) |fn (z) − f (z)|
≤ length(∂ B̄(w; r/2))
2π(r/2)2
2
= sup |fn (z) − f (z)| → 0 as n → ∞
r z∈Br (a)

hence fn0 → f 0 locally uniformly on U .

There is a lot of passing between knowledge of integrals and knowledge of holo-


morphicity all the time, as we can see in these few results. These few sections are
in some sense the heart of the course, where we start from Cauchy’s theorem and
Cauchy’s integral formula, and derive all the other amazing consequences.

D. 10-55
• Let f : B(a, r) → C be holomorphic. Then we can write f (z) = ∞ n
P
n=0 cn (z − a)
as a convergent power series. Then either all cn = 0, in which case f = 0 on
B(a, r), or there is a least N such that cN 6= 0 (N is the smallest n such that
f (n) (a) 6= 0). If N > 0, then we say f has a zero (root) of order N . A zero of
order one is called a simple zero .

• Let U0 be a domain and f : U0 → C be holomorphic. An analytic continuation


of f is a holomorphic function h : U → C, where U is a domain with U ⊇ U0 , such
that h|U0 = f , ie. h(z) = f (z) for all z ∈ U0 .

E. 10-56
If f has a zero of order N at a, then we can write f (z) = (z − a)N g(z) on B(a, r),
where g(a) = cN 6= 0. Often, it is not the actual order that is too important.
Instead, it is the ability to factor f in this way.

• z 3 + iz 2 + z + i = (z − i)(z + i)2 has a simple zero at z = i and a zero of order


2 at z = −i.

• sinh z has zeros where 12 (ez − e−z ) = 0, ie. e2z = 1, ie. z = nπi, where n ∈ Z.
The zeros are all simple, since cosh(nπi) = cos(nπ) 6= 0.

Since sinh z has a simple zero at z = πi, we know sinh3 z has a zero of order
3 there. This is since the first term of the Taylor series of sinh z about z = πi
has order 1, and hence the first term of the Taylor series of sinh3 z has order 3.
We can also find the Taylor series about πi by writing ζ = z − πi:
 3
1 1
sinh3 z = (sinh(ζ + πi))3 = (− sinh ζ)3 = − ζ + + ··· = −ζ 3 − ζ 5 − · · ·
3! 2
3 1 5
= −(z − πi) − (z − πi) − · · · .
2
10.3. CONTOUR INTEGRATION 435

L. 10-57
<Principle of isolated zeroes> Let f : B(a, r) → C be holomorphic and not
identically zero. Then there exists some 0 < ρ < a such that f (z) 6= 0 in the
punctured neighbourhood B(a, ρ) \ {a}.

If f (a) 6= 0, then the result is obvious by continuity of f . If f has a zero of order


N at a, then we can write f (z) = (z − a)N g(z) with g(a) 6= 0. By continuity of g,
g does not vanish on some small neighbourhood of a, say B(a, ρ). Then f (z) does
not vanish on B(a, ρ) \ {a}.
A consequence is that given two holomorphic functions on the same domain, if
they agree on sufficiently many points, then they must in fact be equal.
T. 10-58
<Identity theorem> Let U ⊆ C be a domain, and f, g : U → C be holomorphic.
Let S = {z ∈ U : f (z) = g(z)}. Suppose S contains a non-isolated point, ie. there
exists some w ∈ S such that for all ε > 0, S ∩ B(w, ε) 6= {w}. Then f = g on U .

Consider the function h(z) = f (z) − g(z). Then the hypothesis says h(z) has a
non-isolated zero at w, ie. there is no non-punctured neighbourhood of w on which
h is non-zero. By the previous lemma, this means there is some ρ > 0 such that
h = 0 on B(w, ρ) ⊆ U . Now let

U0 = {a ∈ U : h = 0 on some neighbourhood B(a, δ) of a in U },


U1 = {a ∈ U : there exists n ≥ 0 such that h(n) (a) 6= 0}.

Clearly, U0 ∩ U1 = ∅, and the existence of Taylor expansions shows U0 ∪ U1 = U .


Moreover, U0 is open by definition, and U1 is open since h(n) (z) is continuous near
any given a ∈ U1 . Since U is (path) connected, such a decomposition can happen
if one of U0 and U1 is empty. But w ∈ U0 . So in fact U0 = U , ie. h vanishes on
the whole of U . So f = g.
In particular, if two holomorphic functions agree on some small open subset of the
domain, then they must in fact be identical. This is a very strong result, and is
very false for real functions. Hence, to specify, say, an entire function, all we need
to do is to specify it on an arbitrarily small domain we like.
E. 10-59
From the Identity theorem we know the analytic continuation is unique if it exists.
Thus, given any holomorphic function f : U → C, it is natural to ask how far we
can extend the domain, ie. what is the largest U 0 ⊇ U such that there is an
analytic continuation of f to U 0 . There is no general method that does this for
us. However, one useful trick is to try to write our function f in a different way
so that it is clear how we can extend it to elsewhere.
• Consider the function f (z) = n≥0 z n = 1 + z + z 2 + · · · defined on B(0, 1).
P
By itself, this series diverges for z outside B(0, 1). However, we know well that
this function is just
1
f (z) = .
1−z
This alternative representation makes sense on the whole of C except at z = 1.
So we see that f has an analytic continuation to C \ {1}. There is clearly no
extension to the whole of C, since it blows up near z = 1.
436 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
n
• Consider f (z) = n≥0 z 2 . Then this again converges on B(0, 1). But in fact
P
there is no analytic continuation of f to any larger domain.
• The Riemann zeta function

X
ζ(z) = n−z
n=1

defines a holomorphic function Pon−t{z : Re(z) > 1} ⊆ C. Indeed, we have


|n−z | = |nRe(z) |, and we know n converges for t ∈ R≥1 , and in fact does so
uniformly on any compact domain. So [P.10-54] tells us that ζ(z) is holomorphic
on Re(z) > 1.
We know this cannot converge as z → 1, since we approach the harmonic series
which diverges. However, it turns out ζ(z) has an analytic continuation to
C \ {1}. We will not prove this.
Let pi (i = 1, 2, · · · ) Q
be a list of all primes ordered from small to big. Let
mi
Sn = {m ∈ N : m = n i=1 pi for some mi ∈ N0 }, i.e. the set of all natural
numbers that can be written as a product of powers of primes p1 , · · · , pn . Then
n
Y X
(1 + p−z + p−2z m−z
def
Pn = i i + ···) =
i=1 m∈Sn

So we have

X ∞ ∞
X X
m−z ≤ m−z ≤
−z
|ζ(z) − Pn | = m → 0 as n → ∞
m6∈Sn m=pn+1 m=pn+1

Hence we can write


∞ ∞
Y Y 1
ζ(z) = (1 + p−z
i + p−2z
i + ···) = .
i=1 i=1
1 − p−z
i

If there were only finitelyQ many primes, then Sn = N for some n, and so we
would have ζ(z) = Pn = n −z −1
i=1 (1 − pi ) which is a well-defined at z = 1 since
this is a finite product. Hence, the fact that this blows up at z = 1 implies that
there are infinitely many primes.
L. 10-60
Let f : U → C be holomorphic on an open set U ⊆ C. If |f | is constant, than f
must also be constant.

Write f (x + iy) = u(x, y) + iv(x, y), where u, v are functions R2 → R. By [P.10-4]


ux = vy and uy = −vx . Also |f |2 = u2 + v 2 is differentiable with derivative
! !
2uux + 2vvx uux − vuy
0= =⇒ 0= .
2uuy + 2vvy uuy + vux

So uux − vuy = 0 and uuy + vux = 0. Eliminating ux and uy respectively we have


(u2 + v 2 )ux = 0 and (u2 + v 2 )uy = 0. These equations needs to hold everywhere.
Suppose at c + id ∈ U we have ux (c, d) 6= 0, then v 2 + v 2 = 0, so u = v = 0 at
(c, d). We can pick δ small enough so that u(c + δ, d) 6= 0 since ux (c, d) 6= 0. But
that means |f (c + δ, d)|2 6= |f (c, d)|2 = 0. So |f |2 is not constant. Contradiction,
so ux = 0 everywhere. Similarly uy = 0 everywhere. Hence f is constant.
10.3. CONTOUR INTEGRATION 437

P. 10-61
1. <Local maximum principle> Let U be a domain and f : U → C be
holomorphic. If for some z ∈ U we have |f (w)| ≤ |f (z)| for all w ∈ B(z; r) ⊆ U ,
then f is constant. In other words, a non-constant holomorphic function cannot
achieve an interior local maximum.
2. <Global maximum principle> Let U be a domain and Ū its closure. If
f : Ū → C is a continuous function that is holomorphic on U , then |f | achieve
its maximum in the boundary ∂U = Ū \ U .

1. Let 0 < ρ < r. Applying the Cauchy integral formula, we get


Z
Z 1
1 f (w)

f (z + ρe2πiθ ) dθ

|f (z)| = dw =

2πi ∂ B̄(z;ρ) w − z 0

≤ sup |f (w)| ≤ |f (z)|.


|z−w|=ρ

So we must have equality throughout. When we proved the supremum bound


for the integral, we showed equality can happen only if the integrand is con-
stant. So |f (w)| is constant on the circle |z − w| = ρ, and is equal to f (z).
Since this is true for all ρ ∈ (0, r), it follows that |f | and hence f is constant
on B(z; r). By the identity theorem, f must be constant on U .
2. |f | is continuous on the compact set Ū , so it attains its global maximum. If it
attains its maximum in U , then by the local minimum principle, f is constant
on U (and hence Ū ), so in any case the global maximum is attained at a point
in the boundary ∂U .
P. 10-62
<Removal of singularities> Let U be a domain, z0 ∈ U and f : U \ {z0 } → C
holomorphic. If f is bounded near z0 or if (z − z0 )f (z) → 0 as z → z0 , then there
exists an a such that f (z) → a as z → z0 . Furthermore, g : U → C is holomorphic
on U where (
f (z) z ∈ U \ {z0 }
g(z) = ,
a z = z0

Define a new function h : U → C by


(
(z − z0 )2 f (z) z 6= z0
h(z) = .
0 z = z0

Then since f is holomorphic away from z0 , we know h is also holomorphic away


from z0 . In the case f is bounded near z0 , say |f (z)| < M in some neighbourhood
of z0 , we have

h(z) − h(z0 )
≤ |z − z0 |M → 0 as z → ∞.
z − z0

In the case (z − z0 )f (z) → 0 as z → z0 we also have the same conclusion that the
LHS tends to 0. So in fact h is also differentiable at z0 , and h(z0 ) = h0 (z0 ) = 0.
So near z0 , h has a Taylor series h(z) = n≥0 an (z − z0 )n . We know a0 = a1 = 0.
P
Now define a g(z) by X
g(z) = an+2 (z − z0 )n ,
n≥0
438 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

defined on some ball B(z0 , ρ), where the Taylor series for h is defined, (and equal
to f elsewhere). By construction, on the punctured ball B(z0 , ρ) \ {z0 }, we get
g(z) = f (z). Moreover, g(z) → a2 as z → z0 . So f (z) → a2 as z → z0 . Since g is
a power series, it is holomorphic at z0 (and hence on U ).
Singularities of holomorphic functionsare places where the function is not defined.
There are many ways a function can be ill-defined. For example, if we write
1−z
f (z) = ,
1−z
then on the face of it, this function is not defined at z = 1. However, elsewhere,
f is just the constant function 1, and we might as well define f (1) = 1. Then we
get a holomorphic function. These are rather silly singularities, and are singular
solely because we were not bothered to define f there. Some singularities however
are genuinely singular. For example, the function
1
f (z) =
1−z
is actually singular at z = 1, since f is unbounded near the point. It turns out
these are the only possibilities. This result tells us the only way for a function to
fail to be holomorphic at an isolated point is that it blows up near the point. This
won’t happen because f fails to be continuous in some weird ways.
However, we are not yet done with our classification. There are many ways in
which things can blow up. We can further classify these into two cases — the
case where |f (z)| → ∞ as z → z0 , and the case where |f (z)| does not converge as
z → z0 . It happens that the first case is almost just as boring as the removable
ones.
P. 10-63
Let U be a domain, z0 ∈ U and f : U \ {z0 } → C be holomorphic. Suppose
|f (z)| → ∞ as z → z0 . Then there is a unique k ∈ Z≥1 and a unique holomorphic
function g : U → C such that g(z0 ) 6= 0, and

g(z)
f (z) = .
(z − z0 )k

We shall construct g near z0 in some small neighbourhood, and then apply analytic
continuation to the whole of U . The idea is that since f (z) blows up nicely as
z → z0 , we know 1/f (z) behaves sensibly near z0 . We pick some δ > 0 such
that |f (z)| ≥ 1 for all z ∈ B(z0 ; δ) \ {z0 }. In particular, f (z) is non-zero on
B(z0 ; δ) \ {z0 }. So we can define
(
1
z ∈ B(z0 ; δ) \ {z0 }
h(z) = f (z) .
0 z = z0

Since |1/f (z)| ≤ 1 on B(z0 ; δ) \ {z0 }, by the removal of singularities, h is holomor-


phic on B(z0 , δ). Since h vanishes at the z0 , it has a unique definite order at z0 ,
ie. there is a unique integer k ≥ 1 such that h has a zero of order k at z0 . In other
words,
h(z) = (z − z0 )k `(z),
for some holomorphic ` : B(z0 ; δ) → C and `(z0 ) =
6 0. Now by continuity of `,
there is some 0 < ε < δ such that `(z) 6= 0 for all z ∈ B(z0 , ε). Now define
10.3. CONTOUR INTEGRATION 439

g : B(z0 ; ε) → C by g(z) = 1/`(z). Then g is holomorphic on this disc. By


construction, at least away from z0 , we have
1 1
g(z) = = · (z − z0 )k = (z − z0 )k f (z).
`(z) h(z)

g was initially defined on B(z0 ; ε) → C, but now this expression certainly makes
sense on all of U . So g admits an analytic continuation from B(z0 ; ε) to U .
D. 10-64
• Let U be a domain and f a complex-valued function defined on some subset of U .
We say a point z0 ∈ U is an isolated singularity of f if there exist some open disc
(ball) Bε (z0 ) such that f is defined and holomorphic on Bε (z0 ) \ {z0 } but not on
Bε (z0 ). Such a singularity is called
 removable singularity if f is bounded near z0 , or equivalently if there exist a
holomorhic function g : Bε (z0 ) → C such that f = g on Bε (z0 ) \ {z0 }.
 pole if |f (z)| → ∞ as z → z0 , or equivalently if on Bε (z0 ) \ {z0 } one can write

g(z)
f (z) = where g : Bε (z0 ) → C is holomorphic with g(z0 ) 6= 0
(z − z0 )k

in which case we call it a pole of order k.


 essential singularity if it is neither removable nor a pole.
• If U is a domain and S ⊆ U is a finite or discrete set, a function f : U \ S → C
which is holomorphic and has (at worst) poles on S is said to be meromorphic
on U .
E. 10-65
log z has a non-isolated singularity at z = 0, because it is not holomorphic at any
point on the branch cut. It is easy to give examples of removable singularities and
poles. z 7→ e1/z has an isolated essential singularity at z = 0.
E. 10-66
For one point of view, a pole is not a singularity at all. Note that if B(z0 , ε)\{z0 } →
C has a pole of order h at z0 , then f naturally defines a map fˆ : B(z0 ; ε) → CP1 =
C ∪ {∞}, the Riemann sphere, by
(
∞ z = z0
f (z) = .
f (z) z 6= z0

This is then a “continuous” function. So the singularity is just a point that gets
mapped to the point ∞. The point infinity is not a special point in the Riemann
sphere. Similarly, poles are also not really singularities from the viewpoint of the
Riemann sphere. It’s just that we are looking at it in a wrong way. Indeed, if we
change coordinates on the Riemann sphere so that we label each point w ∈ CP1
by w0 = w1 instead, then f just maps z0 to 0 under the new coordinate system.
In particular, at the point z0 , we find that f is holomorphic and has an innocent
zero of order k.
Since poles are not bad, we might as well allow them, so we talk about meromorphic
functions. Note that the requirement that S is discrete is so that each pole in S
is actually an isolated singularity.
440 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
P (z)
A rational function Q(z) , where P, Q are polynomials, is holomorphic on C \ {z :
Q(z) = 0}, and meromorphic on C. More is true — it is in fact holomorphic as a
function CP1 → CP1 !

Note that a polynomial f (z) satisfies |f (z)| → ∞ as z → ∞, so it’s just a pole at


infinity! Indeed if we apply the Möbius transformation z 7→ 1/z we find that f (1/z)
has a pole at z = 0. This is useful since this means that studying polynomials are
to some extends the same as studying poles!

As an aside, if we want to get an interesting holomorphic function with domain


CP1 , its image must contain the point ∞, or else its image will be a compact subset
of C (since CP1 is compact), thus bounded, and therefore constant by Liouville’s
theorem.

T. 10-67
<Casorati-Weierstrass theorem> Let U be a domain, z ∈ U , and suppose
f : U \ {z} → C has an essential singularity at z. Then for all w ∈ C, there
is a sequence zn → z such that f (zn ) → w. In other words, on any punctured
neighbourhood B(z; ε) \ {z}, the image of f is dense in C.

Suppose ∃a ∈ C such that no sequence zn → z is such that f (zn ) → a, then


∃ε > 0, ∃δ > 0 s.t. f (Bδ (z)) ∩ Bε (a) = ∅. Define g : Bδ (z) \ {z} → C by g(x) =
1/(f (x)−a), then g is holomorphic and bounded, hence z is a removable singularity
of g, hence we have g : Bδ (z) → C holomorphic. g is also never 0, so a + 1/g(x) is
homomorphic on Bδ (z). But f (x) = a + 1/g(x) on Bδ (z) \ {z}, so z is a removable
singular point of f . Contradiction.

So essential singularities are very bad. In fact it’s actually worse than that. The
theorem only tells us the image is dense, but not that we will hit every point. It
1
is in fact not true that every point will get hit. For example e z can never be zero.
However, Picard’s theorem says that this is the worst we can get.

T. 10-68
<Picard’s theorem> If f has an isolated essential singularity at z0 , then there
is some b ∈ C such that on each punctured neighbourhood B(z0 ; ε) \ {z0 }, the
image of f contains C \ {b}.

We will not prove this.

T. 10-69
<Laurent series> Let 0 ≤ r < R < ∞, and let A = {z ∈ C : r < |z − a| < R}.
If f : A → C is holomorphic, then f has a (unique) convergent series expansion
∞ Z
X 1 f (z)
f (z) = cn (z − a)n where cn = dz
n=−∞
2πi ∂ B̄(a,ρ) (z − a)n+1

with ρ ∈ (r, R). Moreover, the series converges uniformly on compact subsets of
A.
10.3. CONTOUR INTEGRATION 441

Let w ∈ A. We let r < ρ0 < |w − a| < ρ00 < R.


We let γ̃ be the contour containing w, and γ̃˜ be
the other contour as shown in the diagram.
ρ00
Let B ⊆ A be a small ball around w, then we ρ0
can see that we can get from γ̃ to ∂B via a γ̃˜ a w γ̃
sequence of elementary deformation. Also, we
can further decomposed γ̃˜ into pieces of closed
curves such that each closed curve is in a star-
shaped subset of A. Therefore by [T.10-46] and
[P.10-42] we have
Z Z
1 f (z) 1 f (z)
f (w) = dz and 0= dz.
2πi γ̃ z − w 2πi γ̃˜ z − w
So we get
Z Z
1 f (z) 1 f (z)
f (w) = dz − dz.
2πi ∂B(a,ρ00 ) z−w 2πi ∂B(a,ρ0 ) z−w

As in the first proof of the Cauchy integral formula, we make the following expan-
sions: for the first integral, we have |w − a| < |z − a|. So
! ∞
1 1 1 X (w − a)n
= = ,
z−w z − a 1 − w−a
z−a n=0
(z − a)n+1

which is uniformly convergent on z ∈ ∂B(a, ρ00 ). For the second integral, we have
|w − a| > |z − a|. So
! ∞
−1 1 1 X (z − a)m−1
= z−a = ,
z−w w − a 1 − w−a m=1
(w − a)m

which is uniformly convergent for z ∈ ∂B(a, ρ0 ). By uniform convergence, we can


swap summation and integration. So we get
∞ Z !
X 1 f (z)
f (w) = dz (w − a)n
n=0
2πi ∂B(a,ρ00 ) (z − a)n+1
∞ Z !
X 1 f (z)
+ −m+1
dz (w − a)−m .
m=1
2πi 0
∂B(a,ρ ) (z − a)

Now we substitute n = −m in the second sum, and get



X
f (w) = c̃n (w − a)n ,
n=−∞

for the integrals c̃n . However, some of the coefficients are integrals around the ρ00
circle, while the others are around the ρ0 circle. This is not a problem. For any
r < ρ < R, these circles are elementary deformations of |z − a| = ρ inside the
annulus A. So Z
f (z)
dz
∂B(a,ρ) (z − a)n+1
442 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

ρ as long as ρ ∈ (r, R). To show uniqueness of the series, suppose


is independent of P
also that f (z) = ∞ n
n=−∞ bn (z − a) . Using our formula for ck , we know

Z Z !
1 f (z) 1 X
ck = dz = bn (z − a)n−k−1 dz
2πi ∂B(a,ρ) (z − a)k+1 2πi ∂B(a,ρ) n∈Z
Z
1 X
= bn (z − a)n−k−1 dz = bk .
2πi n∈Z ∂B(a,ρ)

To show uniform convergence note that f (z) = ∞ n


P
n=−∞ cn (z − a) can be viewed
as the sum of two power series: the positive part is a power series that converge
inside BR (a), while the negative part is a power series that converge outside B̄R (a).
Power series converge uniformly in any compact subsets inside its (open) region
of convergence, hence the Laurent series converge uniformly inside any compact
subsets of A.
The proof looks very much like the blend of the two proofs we’ve given for the
Cauchy integral formula. In one of them, we took a power series expansion of the
integrand, and in the second, we changed our contour by cutting it up. This is
like a mix of the two.
If
P∞ f is holomorphic at z0 , then we have a local power series expansion f (z) =
n
n=0 cn (z − z0 ) near z0 . If f is singular at z0 (and the singularity is not remov-
able), then there is no hope we can get a Taylor series, since the existence of a
Taylor series would imply f is holomorphic at z = z0 . However, it turns out we
can get a series expansion if we allow ourselves to have negative powers of z.
We can interpret the Laurent series as follows – write f (z) = fin (z) + fout (z)
where fin consists of the terms with positive power and fout consists of those with
negative power. Then fin is the part that is holomorphic on the disk |z − a| < R,
while fout (z) is the part that is holomorphic on |z − a| > r. These two combine to
give an expression holomorphic on r < |z − a| < R. This is a nice way of thinking
about it.
The Laurent series in fact provides another way of classifying singularities. In the
case where r = 0, we just have

X
f (z) = cn (z − a)n on B(a, R) \ {a},
n=−∞

then we have the following possible scenarios:6


1. cn = 0 for all n < 0. Then f is bounded near a, and hence this is a removable
singularity.
2. Only finitely many negative coefficients are non-zero, ie. there is a k ≥ 1 such
that cn = 0 for all n < −k and c−k 6= 0. Then f has a pole of order k at a.
3. There are infinitely many non-zero negative coefficients. Then we have an
isolated essential singularity.
So our classification of singularities fit nicely with the Laurent series expansion.
6
Some take the below as definitions for isolated removable/essential singularities and poles. Any-
how it’s equivalent to our definition.
10.3. CONTOUR INTEGRATION 443
P n
If f |Br (a)\{a} is holomorphic with Laurent series f (z) = n∈Z cn (z − a) , we
P−1 n
call fprincipal = n=−∞ cn (z − a) the principal part of f at a. We have that
f − fprincipal is holomorphic near a, and fprincipal carries the information of what
kind of singularity f has at a.
When we talked P about Taylor nseries, if f : B(a, r) → C is holomorphic with Taylor
series f (z) = ∞ n=0 cn (z − a) , then we had two possible ways of expressing the
coefficients of cn . We had
f (n) (a)
Z
1 f (z)
cn = dz = .
2πi ∂B(a,ρ) (z − a)n+1 n!
In particular, the second expansion makes it obvious the Taylor series is uniquely
determined by f . For the Laurent series, we cannot expect to have a simple
expression of the coefficients in terms of the derivatives of the function, for the
very reason that f is not even defined, let alone differentiable, at a.
E. 10-70
We don’t really know how to find a Laurent series. For a Taylor series, we can
just keep differentiating and then get the coefficients. For Laurent series, the
above integral is often almost impossible to evaluate. So the technique to compute
a Laurent series is blind guesswork. Sometimes finding Laurent series help us
to classify singularity. However sometimes, it’s easier to show that the function
exhibit certain defining property of a particular singularity than computing the
Laurent series.
• The Laurent series of 1/(1 − z) for |r| < 1 is the obvious geometric series, and
its Laurent series for |r| > 1 can also obtained using geometric series
∞ ∞
1 −z −1 −1
X −1 j
X
= = −z (z ) = z −j .
1−z 1 − z −1 j=0 j=1

3 5
• We know sin z = z − z3! + z5! − · · · defines a holomorphic function, with a radius
of convergence of ∞. Now consider cosec z = 1/ sin z which is holomorphic
except for z = kπ, with k ∈ Z. So cosec z has a Laurent series near z = 0.
Using
z2 z2
   
1
sin z = z 1 − + O(z 4 ) we get cosec z = 1+ + O(z 4 ) .
6 z 6
From this, we can read off that the Laurent series has cn = 0 for all n ≤ −2,
c−1 = 1, c1 = 15 . If we want, we can go further, but we already see that cosec
has a simple pole at z = 0. By periodicity, cosec has a simple pole at all other
singularities.
• Consider instead  
1 1 1 1
sin = − + − ··· .
z z 3!z 3 5!z 5
We see this is holomorphic on C∗ , with cn 6= 0 for infinitely many n < 0. So
this has an isolated essential singularity.
• Consider cosec z1 . This has singularities at z = 1/kπ for k ∈ N = {1, 2, 3, · · · }.


So it is not holomorphic at any punctured neighbourhood B(0, r) \ {0} of zero.


So this has a non-isolated singularity at zero, and there is no Laurent series in
a neighbourhood of zero.
444 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

• Now consider
ez
f (z) = .
z2 −1
This has a singularity at z0 = 1 but is holomorphic in an annulus 0 < |z−z0 | < 2
(the 2 comes from the other singularity at z = −1). How do we find its Laurent
series? This is a standard trick that turns out to be useful — we write everything
in terms of ζ = z − z0 . Then
−1
eζ ez0 ez0 eζ

1
f (z) = = 1+ ζ
ζ(ζ + 2) 2ζ 2
ez0 ez0
    
1 2 1 1
= 1 + ζ + ζ + ··· 1 − ζ + ··· = 1 + ζ + ···
2ζ 2! 2 2ζ 2
ez0
 
1 1 1
= + + ··· .
2 z − z0 2 (z − z0 )2

This is now a Laurent series, with a−1 = 21 ez0 , a0 = 14 ez0 etc.

• f (z) = z −1/2 has no Laurent series about 0. The reason is that the required
branch cut of z −1/2 would pass through any annulus about z = 0. So we cannot
find an annulus on which f is holomorphic.

• Consider z 2 /((z − 1)3 (z − i)2 ). This has a double pole at z = i and a triple pole
at z = 1. To show formally that, for instance, there is a double pole at z = i,
notice first that z 2 /(z − 1)3 is analytic at z = i. So it has a Taylor series, say,

b0 + b1 (z − i) + b2 (z − i)2 + · · ·

for some bn . Moreover, since z 2 /(z − 1)3 is non-zero at z = i, we have b0 6= 0.


Hence
z2 b0 b1
= + + b2 + · · · .
(z − 1)3 (z − i)2 (z − i)2 z−i
So this has a double pole at z = i.
1
• If g(z) has zero of order N at z = z0 , then g(z) has a pole of order N there,
and vice versa. Hence cot z has a simple pole at the origin, because tan z has a
simple zero there. To prove the general statement, write

g(z) = (z − z0 )N G(z)

1
for some G with G(z0 ) 6= 0. Then G(z)
has a Taylor series about z0 , and then
the result follows.

• z 2 has a double pole at infinity, since 1


ζ2
has a double pole at ζ = 0.
P∞
• f (z) = e1/z = 1 1 n
n=0 n! ( z ) has an essential isolated singularity at z = 0
because all the an ’s are non-zero for n ≤ 0. Another way to see that the
singularity is essential is to note that zn = (inπ)−1 → 0 as n → ∞ yet f (zn )
oscillates between −1 and 1. This can only happen if we have a essential
singularity at 0. Sometimes to show that a singularity is essential, finding such
a sequence is easier than computing the Laurent series.
10.3. CONTOUR INTEGRATION 445

E. 10-71
<Series summation> We claim that the following two function are equal and
holomorphic on C \ Z,

X 1 π2
f (z) = g(z) = ,
n=−∞
(z − n)2 sin2 (πz)

Our strategy is as follows — we first show that f (z) converges and is holomorphic,
which is not hard, given the Weierstrass M -test and Morera’s theorem. To show
that indeed we have f (z) = g(z), we first show that they have equal principal
part, so that f (z) − g(z) is entire. We then show it is zero by proving f − g is
bounded, hence constant, and that f (z)−g(z) → 0 as z → ∞ (in some appropriate
direction).
1/n2 and apply the Weierstrass
P
For any fixed w ∈ C \ Z, we can compare it with
M -test. We pick r > 0 such that |w − n| > 2r for all n ∈ Z. Then for all
z ∈ B(w; r), we have |z − n| ≥ max{r, |n − |w| − r|}. Hence
 
1 1 1
≤ min , = Mn .
|z − n|2 r2 (n − |w| − r)2
1/n2 , we know n Mn converges. So by the Weierstrass M -
P P
By comparison to
test, we know our series converges uniformly on B(w, r). We see that f |B(w,r) is a
uniform limit of holomorphic functions N 2
P
n=−N 1/(z −n) , and hence holomorphic
on B(w, r). Since w was arbitrary, we know f is holomorphic on C \ Z. Note that
we do not say the sum converges uniformly on C \ Z. It’s just that for any point
w ∈ C \ Z, there is a small neighbourhood of w on which the sum is uniformly
convergent, and this is sufficient to apply [P.10-54].
For the second part, note that f is periodic, since f (z + 1) = f (z). Also, at 0,
f has a double pole, since f (z) = z12 + holomorphic stuff near z = 0. So f has a
double pole at each k ∈ Z. Note that 1/ sin2 (πz) also has a double pole at each
k ∈ Z.
Now, consider the principal parts of our functions — at k ∈ Z, f (z) has principal
part 1/(z − k)2 . Looking at our previous Laurent series for cosec(z) we see that
 π 2
g(z) = =⇒ lim z 2 g(z) = 1.
sin πz z→0

So g(z) must have the same principal part at 0 and hence at k for all k ∈ Z. Thus
h(z) = f (z) − g(z) is holomorphic on C \ Z. However, since its principal part
vanishes at the integers, it has at worst a removable singularity. Removing the
singularity, we know h(z) is entire.
Now we will show that h(z) = 0. We first show it is boundedness. We know f
and g are both periodic with period 1. So it suffices to focus attention on the strip
− 12 ≤ x = Re(z) ≤ 12 . To show this is bounded on the rectangle, it suffices to
show that h(x + iy) → 0 as y → ±∞, by continuity. To do so, we show that f and
g both vanish as y → ∞. We set z = x + iy, with |x| ≤ 21 . Then we have
4π 2
|g(z)| ≤ →0 as y → ∞
|eπy − e−πy |

X 1 1 X 1
|f (z)| ≤ ≤ + 2 1 2 →0 as y → ∞
n∈Z
|x + iy − n|2 y2 n=1
(n − 2 ) + y 2
446 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

So h is bounded on the strip, and tends to 0 as y → ∞, and is hence constant by


Liouville’s theorem. But if h → 0 as y → ∞, then the constant must be zero. So
we get h(z) = 0.

10.4 Residue calculus


L. 10-72
Let γ : [a, b] → C be a continuous closed curve, and pick a point w ∈ C \ image(γ).
Then there are continuous functions r : [a, b] → R>0 and θ : [a, b] → R such that
γ(t) = w + r(t)eiθ(t) .7

Clearly r(t) = |γ(t) − w| exists and is continuous, since it is the composition of


continuous functions. Note that this is never zero since γ(t) is never w. The actual
content is in defining θ.

To define θ(t), we for simplicity assume w = 0. Furthermore, by considering


instead the function γ(t)/r(t), which is continuous and well-defined since r is
never zero, we can assume |γ(t)| = 1 for all t.

Recall that the principal branch of log, and hence of the argument Im(log), takes
values in (−π, π) and is defined on C \ R≤0 . If γ(t) always lied in, say C \ R≤0 the
right-hand half plane, we would have no problem defining θ consistently, since we
can just let θ(t) = arg(γ(t)) for arg the principal branch.
There is nothing special about the right-hand half plane. So
similarly, if γ lies in α
n  z  o
z : Re iα > 0
e
for a fixed α, we can define
 
γ(t)
θ(t) = α + arg .
eiα

Since γ : [a, b] → C is continuous, it is uniformly continuous, and we can find


a subdivision a√= a0 < a1 < · · · < am = b such that if s, t ∈ [ai−1 , ai ], then
|γ(s) − γ(t)| < 2, and hence γ(s) and γ(t) belong to such a half-plane.

So we define θj : [aj−1 , aj ] → R such that γ(t) = eiθj (t) for t ∈ [aj−1 , aj ], and
1 ≤ j ≤ n−1. On each region [aj−1 , aj ], this gives a continuous argument function.
We cannot immediately extend this to the whole of [a, b], since it is entirely possible
that θj (aj ) 6= θj+1 (aj ). However, we do know that θj (aj ) are both values of the
argument of γ(aj ). So they must differ by an integer multiple of 2π, say 2nπ.
Then we can just replace θj+1 by θj+1 − 2nπ, which is an equally valid argument
function, and then the two functions will agree at aj . Hence, for j > 1, we can
successively re-define θj such that the resulting map θ is continuous.

If we assume γ to be C 1 (or piece-wise C 1 ), then it’s simpler and we can find a


more explicit expression for a continuous θ. Again wlog w = 0. Define h(t) =

7
Of course, at each point t, we can find r and θ such that the above holds. The key point of the
lemma is that we can do so continuously.
10.4. RESIDUE CALCULUS 447
Rt 0
0
γ (s)/γ(s)ds, then h0 (t) = γ 0 (t)/γ(t). Now

d  −h 
0= γe = γ 0 e−h − γh0 e−h = e−h (γ 0 − γh0 )
dt

So γ(t) = γ(a)eh(t) . Hence θ(t) = arg(γ(a)) + Im(h(t)).


D. 10-73
• Let f : Br (a)\{a} → C be holomorphic with Laurent series f (z) = ∞
P
n=−∞ cn (z −
a)n . Then the residue of f at a is Res(f, a) = Resf (a) = c−1 .
• Given a continuous path γ : [a, b] → C such that γ(a) = γ(b) and w 6∈ image(γ),
1
the winding number of γ about w is 2π (θ(b) − θ(a)) where θ : [a, b] → R is a
continuous function as in the lemma above. This is denoted by I(γ, w) (stands for
index) or nγ (W ) (stands for number).
E. 10-74
What is special about the coefficient c−1 in the Laurent series expansion? Well it
is what is left when we integrate the function over a circle about the point: Let
ρ < r, then
Z Z ∞
X ∞
X I
f (z) dz = cn (z − a)n dz = cn (z − a)n dz
Bρ (a) Bρ (a) n=−∞ n=−∞ Bρ (a)
I
= c−1 (z − a)−1 dz = 2πic−1 .
Bρ (a)

where by uniform convergence we can swap the integral and sum and then all the
other term vanish since
R they have an anti-derivative. Indeed by definition of the
Laurent coefficients ∂ B̄(a,ρ) f (z) dz = 2πic−1 . So we can alternatively write the
residue as Z
1
Res(a) = f (z) dz.
f 2πi ∂ B̄(a,ρ)
This gives us a formulation of the residue without reference to the Laurent se-
ries. Deforming paths if necessary, it is not too far-fetching
R to imagine that for
any simple curve γ around the singularity a, we have γ f (z) dz = 2πi Res(f, a).
Moreover, if the path actually encircles
R two singularities a and b, then deforming
the path, we would expect to have γ f (z) dz = 2πi(Res(f, a) + Res(f, b))

a b a b

and this generalizes to multiple singularities in the obvious way. If this were true,
then it would be very helpful, since this turns integration into addition, which is
(hopefully) much easier!
Indeed, we will soon prove that this result holds. However, we first get rid of
the technical restriction that we only work with simple (ie. non-self intersecting)
curves. This is completely not needed. We are actually not really worried in the
curve intersecting itself. The reason why we’ve always talked about simple closed
curves is that we want to avoid the curve going around the same point many times.
There is a simple workaround to this problem — we consider arbitrary curves, and
448 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

then count how many times we are looping around the point. If we are looping
around it twice, then we count its contribution twice!

Indeed, suppose we have the following curve around a singularity


like the diagram. We see that the curve loops around a twice.
Also, by the additivity of the integral, we can break this curve a
into two closed contours. So we have
Z
1
f (z) dz = 2 Res(a).
2πi γ f

So we define what it means for a curve to loop around a point n times, called the
winding number. There are many ways we can define the winding number. The
definition we pick is based on the following observation — suppose, for convenience,
that the point in question is the origin. As we move along a simple closed curve
around 0, our argument will change. If we keep track of our argument continuously,
then we will find that when we return to starting point, the argument would have
increased by 2π. If we have a curve that winds around the point twice, then our
argument will increase by 4π. What we do is exactly this — given a path, find
a continuous function that gives the “argument” of the path, and then define the
winding number to be the difference between the argument at the start and end
points, divided by 2π.
Note that we always have I(γ, w) ∈ Z, since θ(b) and θ(a) are arguments of
the same number. More importantly, I(γ, w) is well-defined — suppose γ(t) =
r(t)eiθ1 (t) = r(t)eiθ2 (t) for continuous functions θ1 , θ2 : [a, b] → R. Then θ1 − θ2 :
[a, b] → R is continuous, but takes values in the discrete set 2πZ. So it must in
fact be constant, and thus θ1 (b) − θ1 (a) = θ2 (b) − θ2 (a).
L. 10-75
Suppose γ : [a, b] → C is a piecewise C 1 -smooth closed path, and w 6∈ image(γ).
Then Z
1 1
I(γ, w) = dz.
2πi γ z − w

Let γ(t) − w = r(t)eiθ(t) , with now r and θ piecewise C 1 -smooth. Then


b
γ 0 (t) b
r0 (t)
Z Z Z  
1
dz = dt = + iθ0 (t) dt
γ z−w a γ(t) − w a r(t)
= [ln r(t) + iθ(t)]ba = i(θ(b) − θ(a)) = 2πiI(γ, w).

In some books, this integral expression is taken as the definition of the winding
number. While this is elegant in complex analysis, it is not clear a priori that this
is an integer, and only works for piecewise C 1 -smooth closed curves, not arbitrary
continuous closed curves.
On the other hand, what is evident from this expression is that I(γ, w) is contin-
uous as a function of w ∈ C \ image(γ), since it is even holomorphic as a function
of w. Since I(γ; w) is integer valued, I(γ; w) must be locally constant on path
components of C \ image(γ).
We can quickly verify that this is a sensible definition, in that the winding number
around a point “outside” the curve is zero. More precisely, since image(γ) is
compact (so contained in some B̄r (0)), all points of sufficiently large modulus in C
10.4. RESIDUE CALCULUS 449

belong to one component of C \ image(γ). This is indeed the only path component
of C \ image(γ) that is unbounded.
To find the winding number about a point in this unbounded component, note
that I(γ; w) is consistent on this component, and so we can consider arbitrarily
larger w. By the integral formula,

1 1
|I(γ, w)| ≤ length(γ) max →0 as w→∞
2π z∈γ |w − z|

So it does vanish outside the curve. Alternatively one could show this by noting
that if w is outside of B̄r (0), we can draw a line through w such that B̄r (0) is
entirely on one side of the line, hence we can define a branch of log so that 1/(z−w)
has an anti-derivative, and so the integral vanish. Of course, inside the other path
components, we can still have some interesting values of the winding number.
D. 10-76
Let U ⊆ C be a domain, and let φ : [a, b] → U and ψ : [a, b] → U be piecewise
C 1 -smooth closed paths. A homotopy from φ to ψ is a continuous map F :
[0, 1] × [a, b] → U such that F (0, t) = φ(t) and F (1, t) = ψ(t), and moreover for all
s ∈ [0, t] the map t 7→ F (s, t) viewed as a map [a, b] → U is closed and piecewise
C 1 -smooth.
E. 10-77
The idea now is to define a more general and natural notion of deforming a curve,
known as “homotopy”. We will then show that each homotopy can be given by
a sequence of elementary deformations. So homotopies also preserve integrals
of holomorphic functions. We can imagine this as a process of “continuously
deforming” the path φ to ψ, with a path F (s, · ) at each point in time s ∈ [0, 1].
P. 10-78
Let φ, ψ : [a, b] → U be homotopic (piecewise C 1 ) closed paths in a domain U .
1. There exists a sequence of paths φ = φ0 , φ1 , · · · , φN = ψ such that each φj is
piecewise C 1 closed and φi+1 is obtained from φi by elementary deformation.
R R
2. If f : U → C is holomorphic, then φ f (z) dz = ψ f (z) dz.

Clearly 2 is a consequence of 1, so just need to prove 1. This is an exercise in


uniform continuity. We let F : [0, 1] × [a, b] → U be a homotopy from φ to ψ.
Since image(F ) is compact and U is open, by [L.10-45] there is some ε > 0 such
that B(F (s, t), ε) ⊆ U for all (s, t) ∈ [0, 1] × [a, b]. Since F is uniformly continuous,
there is some δ such that k(s, t) − (s0 , t0 )k < δ implies |F (s, t) − F (s0 , t0 )| < ε.
Now pick n ∈ N such that n1 k(1, b − a)k < δ, and let xj = a + (b − a) nj . Write
Ij = [xj−1 , xj ]. Let φi (t) = F ( ni , t) and φij = φi |Ij . Also let Cji = B(F ( ni , xj ), ε),
then Cji is clearly convex. Also we see that φi+1 j , φij ⊆ Cji , hence φi+1 is obtained
i
from φ by elementary deformation.
Note the proof are cooked up precisely so that if s ∈ [ ni , i+1
n
] and t ∈ Ij , then
F (s, t) ∈ Cji .
2. means the integral around any path depends only on the homotopy class of
the path, and not the actual path itself. We can now use this to “upgrade” our
Cauchy’s theorem to allow arbitrary simply connected domains. The theorem
450 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

will become immediate if we adopt the following alternative definition of a simply


connected domain:
P. 10-79
A domain U is simply connected iff every C 1 smooth closed path is homotopic to
a constant path (i.e. a path that is just a point).

This is in fact equivalent to our definition of simply connectedness that every con-
tinuous map S 1 → U can be extended to a continuous map D2 → U , which is
equivalent to saying that any loop can be continuously shrunk to a point. This
result is basically this, except that we now only works with piecewise C 1 paths
(instead of general continuous paths). We can show their equivalence by approx-
imating any continuous curve with a piecewise C 1 -smooth one, but we shall not
do that here.
In fact U being simply connected is also equivalent to the following:
1. I(γ, w) = 0 for any closed curve γ in U and any w 6∈ U .
2. The complement of U in the extended complex plane C∞ is connected.
P. 10-80
<Cauchy’s theorem> Let U be a simply connected domain, and let f : U →
1
R be holomorphic. If γ is any piecewise C -smooth closed curve in U , then
C
γ
f (z) dz = 0.

γ is homotopic to the constant path, and the integral along a constant path is
zero.
We will sometimes refer to this theorem as “simply-connected Cauchy”. This
theorem in fact also follows from Green’s theorem (that is if we can prove Green’s
theorem rigorously). Recall Green’s theorem: Let ∂S be a positively oriented,
piecewise smooth, simple closed curve in a plane, and let S be the region bounded
by ∂S. If P and Q are functions of (x, y) defined on an open region containing S
and have continuous partial derivatives there, then
I ZZ  
∂Q ∂P
(P dx + Q dy) = − dx dy.
∂S S ∂x ∂y
Let u, v be the real and imaginary parts of f . Then
I I I I
f (z) dz = (u + iv)(dx + i dy) = (u dx − v dy) + i (v dx + u dy)
γ γ γ γ
ZZ   ZZ  
∂v ∂u ∂u ∂v
= − − dx dy + i − dx dy
S ∂x ∂y S ∂x ∂y
But both integrands vanish by the Cauchy-Riemann equations, since f is differ-
entiable throughout S. So the result follows. This proof relies u and v to have
continuous partial derivatives. We know this is true since a holomorphic func-
tion f is infinitely differentiable, however, our proof of holomorphic function being
infinitely differentiable utilize Cauchy’s theorem! So actually we still have to do
most of the stuff we have done before even if we know Green’s theorem.
One useful consequence of Cauchy’s theorem is that we can freely deform contours
along regions where f is holomorphic without changing the value of the integral.
More precisely if that γ1 and γ2 are contours from a to b, and that f is holomorphic
10.4. RESIDUE CALCULUS 451
R R
on the contours and between the contours. Then γ1 f (z) dz = γ2 f (z) dz. In
particular
R b if f is a holomorphic function defined on a simply connected domain,
then a f (z) dz does not depend on Rthe chosen contour. This result of path
independence, is very much related to f (z) dz as a path integral in R2 , because

f (z) dz = (u + iv)(dz + i dy) = (u + iv) dx + (−v + iu) dy


∂ ∂
is an exact differential, since ∂y
(u + iv) = ∂x
(−v + iu) from the Cauchy-Riemann
equations.
T. 10-81
<Cauchy’s residue theorem> Let {z1 , · · · , zk } ⊆ U , where U is a simply
connected domain. If f : U \ {z1 , · · · , zk } → C is holomorphic and γ : [a, b] →
U \ {z1 , · · · , zk } is a piecewise C 1 -smooth closed curve, then
Z k
1 X
f (z) dz = I(γ, zi ) Res(f ; zi ).
2πi γ j=1

(i)
At each zi , f has a Laurent expansion f (z) = n∈Z cn (z − zi )n valid in some
P

neighbourhood of zi . Let gi (z) be the principal part, gi (z) = −1


P (i) n
n=−∞ cn (z − zi ) .
From the proof of the Laurent series, we know gi (z) gives a holomorphic function
on U \ {zi }.
We now consider f − g1 − g2 − · · · − gk , which isR holomorphic on U \ {z1 , · · · , zk },
and has a removable singularity at each zi . So γ (f − g1 − · · · − gk )(z) dz = 0 by
simply-connected Cauchy. Hence we know
Z k Z
X
f (z) dz = gj (z) dz.
γ j=1 γ

(j)
For each j, we use uniform convergence of the series n≤−1 cn (z − zj )n on com-
P
pact subsets of U \ {zj }, and hence on γ, to write
Z X (j) Z Z
(j) 1
gj (z) dz = cn (z − zj )n dz = c−1 dz.
γ n≤−1 γ γ z − zj

The last equality is since for n 6= −1, the function (z − zj )n has an antiderivative,
(j)
and hence the integral around γ vanishes. But c−1 is by definition the residue of
f at zj , and the integral is just the integral definition of the winding number (up
to a factor of 2πi). So we get
Z k
X
f (z) dz = 2πi Res(f ; zj )I(γ, zj ).
γ j=1

The Cauchy integral formula and simply-connected Cauchy are special cases of
this. This in some sense a mix of all the results we’ve previously had. Simply-
connected Cauchy tells us the integral of a holomorphic f around a closed curve
depends only on its homotopy class, ie. we can deform curves by homotopy and
this preserves the integral. This means the value of the integral really only depends
on the “holes” enclosed by the curve.
452 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

We also had the Cauchy integral formula. This says if f : B(a, r) → C is holomor-
phic, w ∈ B(a, ρ) and ρ < r, then
Z
1 f (z)
f (w) = dz.
2πi ∂ B̄(a,ρ) z−w

f (z)
Note that f (w) also happens to be the residue of the function z−w . So this really
says if g has a simple pole at a inside the region bounded by a simple closed curve
γ, then Z
1
g(z) dz = Res(g, a).
2πi γ
The Cauchy’s residue theorem says the result holds for any type of singularities,
and any number of singularities.
We can picture a simple case of our proof as follows: Consider a simple curve
encircling singularities zi .

z2 γ γ̂
z1
z3

Consider the simple curve γ̂, consisting of small clockwise circles γ1 , · · · , γn around
each singularity; cross cuts, which cancel in the limit as they approach each other,
in pairs; and the large outer curve
H (which is the same as γ in the limit). Note that
γ̂ encircles no singularities. So γ̂ f (z) dz = 0 by Cauchy’s theorem. So in the
limit when the cross cuts cancel, we have
I n I
X I
f (z) dz + f (z) dz = f (z) dz = 0.
γ k=1 γk γ̂

H
But from what we did in the previous section, we know γ f (z) dz = −2πi Res(f, zk )
k
(since γk encircles onlyHone singularity,
Pn and we get a negative sign since γk is a
clockwise contour). So γ f (z) dz = k=1 2πi Res(f, zk ).
L. 10-82
Let f : U \ {a} → C be holomorphic with a pole at a, i.e f is meromorphic on U .
1. If the pole is simple, then Res(f, a) = limz→a (z − a)f (z).
2. If there exist g, h holomorphic on some B(a, ε) \ {a} with g(a) 6= 0 and h with
a simple zero at a, and that
g(z) g(a)
f (z) = then Res(f, a) = .
h(z) h0 (a)

3. If near a, there exist a holomorphic g with g(a) 6= 0 and

g(z) g (k−1) (a)


f (z) = then Res(f, a) = .
(z − a)k (k − 1)!
10.4. RESIDUE CALCULUS 453

1. By definition, if f has a simple pole at a, then


c−1
f (z) = + c0 + c1 (z − a) + · · · ,
(z − a)
and by definition c−1 = Res(f, a). Then the result is obvious.
2. This is basically L’Hôpital’s rule. By the previous part, we have
g(z) g(a)
Res(f ; a) = lim (z − a) = 0 .
z→a h(z) h (a)

3. We know the residue Res(f ; a) is the coefficient of (z − a)k−1 in the Taylor


1
series of g at a, which is exactly (k−1)! g (k−1) (a).
Note that result 3 is for when we have a pole of order k, note we can write the
result as
1 dk−1  
Res(f, a) = lim (z − a)k f (z) .
z→a (k − 1)! dz k−1

which is in the form of result 1.


E. 10-83
• Consider f (z) = ez /z 3 . We can find the residue by directly computing the Laurent
series about z = 0:
ez 1 1
= z −3 + z −2 + z −1 + + ··· .
z3 2 3!
Hence the residue is 12 . Alternatively, we can use the fact that f has a pole of
order 3 at z = 0. So we can use the formula to obtain
1 d2 3 1 d2 z 1
Res(f, 0) = lim 2
(z f (z)) = lim e = .
z→0 2! dz z→0 2 dz 2 2

• Consider h(z) = (z 8 − w8 )−1 , for any complex constant w. We know this has 8
simple poles at z = wenπi/4 for n = 0, · · · , 7. What is the residue at z = w? We
can try to compute this directly by
z−w
Res(h, w) = lim
z→w (z − w)(z − weiπ/4 ) · · · (z − we7πi/4 )
1 1 1
= = 7
(w − weiπ/4 ) · · · (w − we7πi/4 ) w (1 − eiπ/4 ) · · · (1 − e7iπ/4 )
Now we are quite stuck. We don’t know what to do with this. We can think really
hard about complex numbers and figure out what it should be, but this is difficult.
What we should do is to apply L’Hôpital’s rule and obtain
z−w 1 1
Res(h, w) = lim = 7 = .
z→w z 8 − w8 8z 8w7

• Consider the function (sinh πz)−1 . This has a simple pole at z = ni for all integers
n (because the zeros of sinh z are at nπi and are simple). Again, we can compute
this by finding the Laurent expansion. However, it turns out it is easier to use our
magic formula together with L’Hôpital’s rule. We have
z − ni 1 1 1 (−1)n
lim = lim = = = .
z→ni sinh πz z→ni π cosh πz π cosh nπi π cos nπ π
454 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

• Consider the function (sinh3 z)−1 . This time, we find the residue by looking at
the Laurent series. We first look at sinh3 z. This has a zero of order 3 at z = πi.
Its Taylor series is sinh3 z = −(z − πi)3 − 21 (z − πi)5 + · · · . So
 −1
1 −3 1 2
= −(z − πi) 1 + (z − πi) + · · ·
sinh3 z 2
 
1
= −(z − πi)−3 1 − (z − πi)2 + · · ·
2
1
−3
= −(z − πi) + (z − πi)−1 + · · ·
2
1
Therefore the residue is 2
.
E. 10-84
Z ∞
1
Compute the integral dx.
0 1 + x4

We consider the following contour drawn on the right.


We notice 1/(1 + x4 ) has poles at x4 = −1, indicated
as crosses in the diagram. Note that the two of the
× ×
poles lie in the unbounded region. So I(γ, · ) = 0 for e3iπ/4 eiπ/4
these. We can write the integral as
Z R Z π −R R
iReiθ
Z
1 1 × ×
4
dz = 4
dx + 4 4iθ
dθ.
γR 1 + z −R 1 + x 0 1+R e

The first term is something we care about, while the second is something we
despise. So we might want to get rid of it. We notice the integrand of the second
integral is O(R−3 ). Since we are integrating it over something of length R, the
whole thing tends to 0 as R → ∞. We also know the left hand side is just
Z
1
4
dz = 2πi(Res(f, eiπ/4 ) + Res(f, e3iπ/4 )).
γR 1 + z

So we just have to compute the residues. But our function is of the form given by
part 2 of the lemma above. So we know

1 1
Res(f, eiπ/4
)= = e−3πi/4 ,
4z 3 z=eiπ/4 4

i3π/4
R ∞ at e 4 −1. On the other hand, as R → ∞, the first integral on the
and similarly
right is −∞ (1 + x ) dx, which is, by evenness, twice of what we want. So
Z ∞ Z ∞
1 1 2πi iπ/4 π
2 dx = dx = − (e + e3πi/4 ) = √ .
0 1 + x4 −∞ 1 + x4 4 2
R∞
Hence our integral is 0
(1 + x4 )−1 dx = 2
π

2
.
When computing contour integrals, there are two things we have to decide. First,
we need to pick a nice contour to integrate along. Secondly, as we will see in the
next example, we have to decide what function to integrate.
10.4. RESIDUE CALCULUS 455

E. 10-85
Z
cos(x)
Compute dx.
R 1 + x + x2

We know cos, as a complex function, is everywhere


holomorphic, and 1 + x + x2 have two simple zeroes,
namely at the cube roots of unity e2πi/3 and e−2πi/3 .
We pick the same contour, and write ω = e2πi/3 . Life ×ω
would be good if cos were bounded, for the integrand
would then be O(R−2 ), and the circular integral van-
ishes. Unfortunately, cos(z) is large at say iR. So −R R
instead, we consider

eiz
f (z) = .
1 + z + z2

Now, again by the previous lemma, we get Res(f ; ω) = eiω /(2ω + 1). On the
semicircle, we have

Re−R sin θ
Z π Z π
iθ iθ

f (Re )Re dθ ≤ dθ,
0 |R e
2 2iθ + Reiθ + 1|
0

which is O(R−1 ). So this vanishes as R → ∞. The remaining is not quite the


integral we want, but we can just take the real part. We have
Z Z   Z 
cos x
2
dx = Re f (z) dz = Re lim f (z) dz
R 1+x+x R R→∞ γ
R
2π −√3/2 1
= Re (2πi Res(f, ω)) = √ e cos .
3 2

E. 10-86
Z π/2
1
Compute dt.
0 1 + sin2 (t)

We use the expression of sin in terms of the exponential function, namely sin(t) =
1
2i
(eit −e−it ). So if we are on the unit circle, and z = eit , then sin(t) = 2i
1
(z −z −1 ).
dz it 1
Moreover, we can check dt = ie . So dt = iz dz. Hence we get
Z π/2 Z 2π Z
1 1 1 1 1 dz
dt = dt =
0 1 + sin2 (t) 4 0 1 + sin2 (t) 4 |z|=1 1 − (z − z −1 )2 /4 iz
Z
iz
= dz.
|z|=1 z 4 − 6z 2 + 1

The base is a quadratic √ in z 2 , which we


√ can solve. We
find the roots√to be 1 ± 2 and√ −1 ± 2. The
√ residues × × × ×
at the point 2 − 1 and − 2 + 1 give − i162 . So the
integral we want is
Z π/2  √ √ 
1 − 2i − 2i π
2 = 2πi + = √ .
0 1 + sin (t) 16 16 2 2
Another class of integrals that often come up are integrals of trigonometric func-
tions, where we are integrating along the unit circle. Most rational functions of
456 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

trigonometric functions can be integrated around |z| = 1 in this way, using the
fact that
eikt − e−ikt z k − z −k eikt + e−ikt z k + z −k
sin(kt) = = , cos(kt) = = .
2i 2i 2 2
L. 10-87
Let f : B(a, r) \ {a} → C be holomorphic, and suppose f has a simple pole at a.
We let γε : [α, β] → C be given by t 7→ a + εeit . Then
Z
lim f (z) dz = (β − α)i Res(f, a).
ε→0 γε

c
We can write f (z) = z−a
+ g(z) near a, where c = Res(f ; a), and g : B(a, δ) → C
is holomorphic. We take ε < δ. Then
Z

g(z) dz ≤ (β − α) · ε sup |g(z)|.
z∈γ
γε ε

But g is bounded on B(α, δ). So this vanishes as ε → 0. So the remaining integral


is
Z Z Z β
c 1 1
lim dz = c lim dz = c lim · iεeit dt = i(β − α)c.
ε→0 γ z − a ε→0 γ z − a ε→0 α εeit
ε ε

L. 10-88
<Jordan’s lemma> Let f be holomorphic on a neighbourhood of infinity in C
(i.e. on {|z| > r} for some r > 0), and that zf (z) is bounded in this region. Let
γR (t) = Reit for t ∈ [0, π] (which is not closed). Then for α > 0, we have
Z
f (z)eiαz dz → 0 as R→∞
γR

By assumption, we have |f (z)| ≤ M/|z| for large |z| and some constant M > 0.
We also have |eiαz | = e−Rα sin t on γR . To avoid messing with sin t, we note that
on (0, π2 ], the function sinθ θ is decreasing. Then by considering the end points, we
find sin(t) ≥ 2t/π for t ∈ [0, π2 ]. This gives us the bound
(
iαz −Rα sin t e−Ra2t/π 0 ≤ t ≤ π2
|e | = e ≤ 0
e−Ra2t /π 0 ≤ t0 = π − t ≤ π2
So we get
Z Z
π/2 2π
1

iRαeit
e it it
f (Re )Re dt ≤ e−2αRt/π · M dt = (1 − eαR ) → 0

2R

0 0

as R → ∞. The estimate for π/2 f (z)eiαz dz is analogous.
Note that the condition that zf (z) is bounded near infinity is saying that its
Laurent series only has terms of negative power, which is equivalent to saying that
f (z) → 0 as |z| → ∞.
This lemma allows us to consider integrals on expanding semicircles. In previous
cases, we had f (z) = O(R−2 ), and then we can bound the integral simply as
O(R−1 ) → −1
R 0. In this case, we only require f (z) = O(R ). The drawback is that
the case γ f (z) dz need not work — it is possible that this does not vanish.
R
However, if we have the extra help from eiαx , then we do get that the integral
vanishes.
10.4. RESIDUE CALCULUS 457

E. 10-89
Z ∞
sin x π
Show that dx = .
0 x 2

Note that sinx x has a removable singularity at x = 0. So everything is fine. Our


first thought might be to use our usual (upper) semi-circular contour. But if we
look at it and take the function sinz z , then we get no control at iR ∈ γR . So what
we would like to do is to replace the sine with an exponential by noting that
∞ ∞
eix
Z Z
sin x
dx = Im dx
0 x 0 x
iz
So we consider f (z) = e /z, this is helpful since we can apply Jordan’s lemma.
However now we have the problem that f has a simple pole at 0.
So instead we consider a modified contour as shown
on the right. Now if γR,ε denotes the modified con-
tour, then the singularity of eiz /z lies
R outside the γR,ε
contour, and Cauchy’s theorem says γ f (z) dz =
R,ε
0. Considering the R-semicircle γR , and using Jor-
dan’s Rlemma with α = 1 and z1 as the function, we ×
−R −ε ε R
know γ f (z) dz → 0 as R → ∞.
R

Considering the ε-semicircle γε , and using the first lemma, we get a contribution
of −iπ, where the sign comes from the orientation. Rearranging, and using the
fact that the function is even, we get the desired result.
E. 10-90

eax
Z
Evaluate dx where a ∈ (−1, 1) is a real constant.
−∞ cosh x

Note that the function f (z) = eaz / cosh z has simple poles where z = n + 21 iπ


for n ∈ Z. So if we did as we have done above, then we would run into infinitely
many singularities, which is not fun.
Instead, we note that cos(x+iπ) = − cosh x and
consider a rectangular contour as shown in the πi γ1
diagram. We now enclose only one singularity,
− πi
namely ρ = iπ2
, where γvert 2
× +
γvert

eaρ −R γ0
Res(f, ρ) = = ieaπi/2 . R
cosh0 (ρ)

We first want to see what happens at the edges. We have


π
ea(R+iy)
Z Z
f (z) dz = i dy.
+
γvert 0 cosh(R + iy)

hence we can bound this as


Z Z
π aR

2e

f (z) dz ≤ dy → 0 as R → ∞,

−R
e −e
R

γ+ 0

vert
458 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

since a < 1. We can do a similar bound for γvert− , where we use the fact that
a > −1. Thus, letting R → ∞, we get

∞ −∞
eax eaπi eax
Z Z
dx + dx = 2πi(−ieaπi/2 ).
−∞ cosh x +∞ cosh(x + iπ)

Using the fact that cosh(x + iπ) = − cosh(x), we get


eax 2πeaiπ/2
Z  πa 
dx = aπi
= π sec .
−∞ cosh x 1+e 2

E. 10-91

X 1 π2
Show that 2
= .
n=1
n 6

Recall we just avoided having to encircle infinitely poles by picking a rectangular


contour. Here we do the opposite — we encircle infinitely many poles, and then
we can use this to evaluate the infinite sum of residues using contour integrals.

We consider the function f (z) = π cot(πz)/z 2 , which is holomorphic on C except


for simple poles at Z \ {0}, and a triple pole at 0. We can check that at n ∈ Z \ {0},
we can write
π cos(πz) 1
f (z) = ,
z2 sin(πz)
where the second term (second fraction) has a simple zero at n, and the first is
non-vanishing at n 6= 0. Then we have compute

π cos(πn) 1 1
Res(f ; n) = = 2.
n2 π cos(πn) n

Note that the reason why we have those funny π’s all around the place is so that
we can get this nice expression for the residue. At z = 0, we get
−1
z2 z3
 
1 z
cot(z) = 1− + O(z 4 ) z− + O(z 5 ) = − + O(z 2 ).
2 3 z 3

So we get
π cot(πz) 1 π2
2
= 3 − + ···
z z 3z
So the residue is −π 2 /3. Now we consider the
1
(N + 2 )i
following square contour γN as shown in the di-
agram. Since we don’t want the contour itself to
pass through singularities, we make the square −(N + 1 1
2) N+
pass through ± N + 21 . Then the residue theo- 2

×××××××××××
rem says
N
!
π2
Z X 1
f (z) dz = 2πi 2 − . −(N + 1
2 )i
γN n=1
n2 3
10.4. RESIDUE CALCULUS 459
R
We can thus get the desired series if we can show that γN
f (z) dz → 0 as n → ∞.
We first note that
Z
≤ sup π cot πz 4(2N + 1)

f (z) dz 2

γN
γ

Nz
4(2N + 1)π −1
≤ sup | cot πz|  = sup | cot πz|O(N ).
1 2
γN N+2 γN

So everything is good if we can show supγN | cot πz| is bounded as N → ∞. On


the vertical sides, we have z = π(N + 12 ) + iy and thus
| cot(πz)| = | tan(iπy)| = | tanh(πy)| ≤ 1,
while on the horizontal sides, we have z = x ± i(N + 21 ) and

eπ(N +1/2) + e−π(N +1/2)


  
1
| cot(πz)| ≤ π(N +1/2) = coth π N + .
e − e−π(N +1/2) 2
This is bounded since x 7→ coth x is decreasing and positive for x ≥ 0.
E. 10-92
Z ∞
log x
Compute the integral dx.
0 1 + x2

The point is that to define log z, we have to cut the plane


to avoid multi-valuedness. In this case, we might choose
to cut it along iR≤0 , giving a branch of log, for which ×i
arg(z) ∈ − π2 , 3π
2
. We need to avoid running through
zero. So we might look the contour on the right. On the ε ε
−R R
large semicircular arc of radius R, the integrand

   
log R log R
|f (z)||dz| = O R · = O →0 as R → ∞.
R2 R
On the small semicircular arc of radius ε, the integrand
|f (z)||dz| = O(ε log ε) → 0 as ε → 0.
Hence, as ε → 0 and R → ∞, we are left with the integral along the real axis.
Along the negative real axis, we have log z = log |z| + iπ. So the residue theorem
says Z ∞ Z 0
log x log |z| + iπ
2
dx + (−dx) = 2πi Res(f ; i).
0 1 + x ∞ 1 + x2
We can compute the residue as
1
log i iπ π
Res(f, i) = = 2 = .
2i 2i 4
So we find
∞ Z ∞
iπ 2
Z
log x 1
2 2
dx + iπ 2
dx = .
0 1+x 0 1+x 2
Taking the real part of this, we obtain 0 as the answer to the original integral.
In this case, we had a branch cut, and we managed to avoid it by going around
our magic contour. Sometimes, it is helpful to run our integral along the branch
cut.
460 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

E. 10-93
Z ∞ √
x
Compute dx where a, b ∈ R.
0 x2 + ax + b

To define z, we need to pick a branch cut. We pick
it to lie along the positive real line, and consider the
keyhole contour . As usual this has a small circle
of radius ε around the origin, and a large circle of
radius R. Note that these both avoid the branch
cut. Again, on the R circle, we have
 
1
|f (z)||dz| = O √ → 0 as R → ∞.
R
√ 1
On the ε-circle, we have |f (z)||dz| = O(ε3/2 ) → 0 as ε → 0. Viewing √ z = e 2 log z ,
on the two pieces of the contour along R≥0 , log z differs by 2πi. So z changes
sign. This cancels with the sign change arising from going in the wrong direction.
Therefore the residue theorem says
Z ∞ √
X x
2πi residues inside contour = 2 dx.
0 x2 + ax + b
What the residues are depends on what the quadratic actually is, but we will not
go into details.
E. 10-94


Z
Compute I = √ dx where −1 < α < 1.
0 1 + 2x + x2

Like before we need a branch cut for z α . We take our


CR
branch cut to be along the positive real axis, and define
α α iαθ iθ
z = r e where z = re and 0 ≤ θ < 2π. We use ×e3πi/4
the same contour. This consists of a large circle CR of

radius R, a small circle Cε of radius ε, and the two lines
just above and below the branch cut. The idea is to ×e5πi/4
simultaneously take the limit ε → 0 and R → ∞. We
have four integrals to work out. The first is

Z
√ dz = O(Rα−2 ) · 2πR = O(Rα−1 ) → 0 as R→∞
γR 1 + 2z + z 2
To obtain the contribution from γε , we substitute z = εeiθ , and obtain
Z 0
εα eiαθ
√ iεeiθ dθ = O(εα+1 ) → 0 as ε → 0.
2π 1 + 2εeiθ + ε2 e2iθ
Finally, we look at the integrals above and below the branch cut. The contribution
from just above the branch cut is
Z R

√ dx → I.
ε 1 + 2x + x2
Similarly, the integral below is
Z ε
xα e2απi
√ → −e2απi I.
R 1 + 2x + x2
10.4. RESIDUE CALCULUS 461

So we get

I
√ dz → (1 − e2απi )I.
γ 1 + 2z + z 2
All that remains is to compute the residues. We write the integrand as

.
(z − e3eπi/4 )(z − e5πi/4 )

So the poles are
√ at z0 = e3πi/4 and z1 = e5πi/4 . The residues are e3απi/4 /( 2i)
and e5απi/4 /(− 2i) respectively. Hence we know
 3απi/4
e5απi/4

e
(1 − e2απi )I = 2πi √ + √ .
2i − 2i

In other words, we get eαπi (e−απi − eαπi )I = 2πeαπi (e−απi/4 − eαπi/4 ). Thus
we have
√ sin(απ/4)
I = 2π .
sin(απ)
Note that we we labeled the poles as e3πi/4 and e5πi/4 . The second point is the
same point as e−3πi/4 , but it would be wrong to label it like that. We have decided
at the beginning to pick the branch such that 0 ≤ θ < 2π, and −3πi/4 is not in
If we wrote it as e−3πi/4 instead, we might
that range. √ √ have got the residue as
−3απi/4
e /(− 2i), which is not the same as e5απi/4 /(− 2i).
L. 10-95
Suppose the holomorphic f have a zero (or pole) of order k > 0 at z = a, then
f 0 (z)/f (z) has a simple pole at z = a with residue k (respectively −k for pole).

If f has a zero of order k at z = a, then f (z) = (z −a)k g(z) where g is holomorphic


and non-zero at z = a. Then
f 0 (z) k g 0 (z)
= +
f (z) z−a g(z)

whence the result. For a pole we have f (z) = (z − a)−k g(z) and proceed in the
same way.
T. 10-96
<Argument principle> Let U be a simply connected domain, and let f be
meromorphic on U . Suppose in fact f has finitely many zeroes z1 , · · · , zk and
finitely many poles w1 , · · · , w` . Let γ be a piecewise-C 1 closed curve in U such
that zi , wj 6∈ image(γ) for all i, j. Then
k `
f 0 (z)
Z
1 X X
I(f ◦ γ, 0) = dz = ord(f ; zi )I(γ, zi ) − ord(f ; wj )I(γ, wj ).
2πi γ f (z) i=1 j=1

f 0 (z)
Z Z Z
1 dw 1 df 1
I(f ◦ γ, 0) = = = dz.
2πi f ◦γ w 2πi γ f (z) 2πi γ f (z)
Let S = {z1 , · · · , zk , w1 , · · · , w` }. By the residue theorem, we have
Z 0  0 
1 f (z) X f
dz = Res , z I(γ, z),
2πi γ f (z) z∈S
f
462 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

Note that outside these zeroes and poles, the function f 0 (z)/f (z) is holomorphic.
By the above lemma Res(f 0 /f, zj ) = ord(f ; zj ) and Res(f 0 /f, wj ) = − ord(f ; wj ).
Note that if c is a constant function, then
f 0 (z)
Z Z
1 1 dw
I((f − c) ◦ γ, 0) = dz = = I(f ◦ γ, c).
2πi γ f (z) − c 2πi f ◦γ w − c
This “shifting property” would be useful later.
Recall we said that if f : B(a; r) → C is holomorphic, and f (a) = 0, then f has
a zero of order k if, locally f (z) = (z − a)k g(z) with g holomorphic and g(a) 6= 0.
Analogously, if f : B(a, r) \ {a} → C is holomorphic, and f has at worst a pole
at a, we can again write f (z) = (z − a)k g(z) where now k ∈ Z may be negative.
Since we like numbers to be positive, we say the order of the zero/pole is |k|. It
turns out we can use integrals to help count poles and zeroes. In particular, if γ is
a simple closed curve, then all the winding numbers of γ about points zi , wj lying
in the region bound by γ are all +1 (with the right choice of orientation). Then
this results ays that in the region
1
number of zeroes − number of poles = (change in argument of f along γ).

where the zeros and poles are counted with multiplicity.
This might be the right place to put the following remark — all the time, we have
assumed that a simple closed curve “bounds a region”, and then we talk about
which poles or zeroes are bounded by the curve. While this seems obvious, it is
not. This is given by the Jordan curve theorem, which is actually hard. Instead of
resorting to this theorem, we can instead define what it means to bound a region
in a more convenient way. One can say that for a domain U , a closed curve γ ⊆ U
bounds a domain D ⊆ U if
(
+1 z ∈ D
I(γ, z) = ,
0 z 6∈ D

for a particular choice of orientation on γ. However, we shall not worry ourselves


with this.
T. 10-97
<Rouchés theorem> Let U be a domain and γ a closed curve which bounds
a domain in U . Let f, g be holomorphic on U , and suppose |f | > |g| for all
z ∈ image(γ). Then f and f + g have the same number of zeroes in the domain
bound by γ, when counted with multiplicity.

If |f | > |g| on γ, then f and f + g cannot have zeroes on the curve γ. We let
f (z) + g(z) g(z)
h(z) = =1+ .
f (z) f (z)
This is a natural thing to consider, since zeroes of f + g is zeroes of h, while poles
of h are zeroes of f . Note that by assumption, for all z ∈ γ, we have

h(z) ∈ B(1, 1) ⊆ {z : Re(z) > 0}.

Therefore h◦γ is a closed curve in the half-plane {z : Re(z) > 0}. So I(h◦γ; 0) = 0.
Then by the argument principle, h must have the same number of zeros as poles
10.4. RESIDUE CALCULUS 463

in D, when counted with multiplicity (note that the winding numbers are all +1).
Thus, as the zeroes of h are the zeroes of f + g, and the poles of h are the poles
of f , the result follows.
E. 10-98
Typical application of Rouchés theorem is to determine the approximate location
of the zeroes of (say) a polynomial. Consider the function z 4 + 6z + 3, we claim
this has three roots (with multiplicity) in {1 < |z| < 2}. To show this, note that
on |z| = 2, we have
|z|4 = 16 > 6|z| + 3 ≥ |6z + 3|.
So if we let f (z) = z 4 and g(z) = 6z + 3, then f and f + g have the same number
of roots in {|z| < 2}. Hence all four roots lie inside {|z| < 2}.
On the other hand, on |z| = 1, we have |6z| = 6 > |z 4 + 3|. So 6z and z 4 + 6z + 3
have the same number of roots in {|z| < 1}. So there is exactly one root in there,
and the remaining three must lie in {1 < |z| < 2} (the bounds above show that
|z| cannot be exactly 1 or 2).
E. 10-99
Let P (x) = xn + an−1 xn−1 + · · · + a1 x + a0 ∈ Z[x] and suppose a0 6= 0. We claim
that if |an−1 | > 1 + |an−2 | + · · · + |a1 | + |a0 |, then P is irreducible over Z (and
hence irreducible over Q, by Gauss’ lemma).
To show this, we let f (z) = an−1 z n−1 and g(z) = z n + an−2 z n−2 + · · · + a1 z + a0 .
Then our hypothesis tells us |f | > |g| on |z| = 1. So f and P = f + g both have
n − 1 roots in the open unit disc {|z| < 1}.
Now if we could factor P (z) = Q(z)R(z), where Q, R ∈ Z[x], then at least one of
Q, R must have all its roots inside the unit disk. Say all roots of Q are inside the
unit disk. But we assumed a0 6= 0. So 0 is not a root of P , hence it is not a root
of Q. But the product of the roots Q is a coefficient of Q, hence an integer strictly
between −1 and 1. This is a contradiction.
The argument principle and Rouchés theorem tell us how many roots we have got.
However, we do not know if they are distinct or not. This information is given to
us via the local degree theorem which we’ll do next.
D. 10-100
Let f : B(a, r) → C be holomorphic and non-constant. Then the local degree of
f at a, written deg(f, a) is the order of the zero of f (z) − f (a) at a.
E. 10-101
If we take the Taylor expansion of f about a, then the local degree is the degree
of the first non-zero term after the constant term.
L. 10-102
The local degree is given by deg(f, a) = I(f ◦ γ, f (a)) where γ(t) = a + reit with
0 ≤ t ≤ 2π, for r > 0 sufficiently small.

Note that by the identity theorem, we know that, f (z) − f (a) has an isolated zero
at a (since f is non-constant). So for sufficiently small r, the function f (z) − f (a)
does not vanish on B̄(a, r) \ {a}. If we use this r, then f ◦ γ never hits f (a),
and the winding number is well-defined. The result then follows directly from the
argument principle.
464 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

T. 10-103
<Local degree theorem> Let f : B(a, R) → C be holomorphic and non-
constant. Then ∃δ > 0 s.t. ∀r ∈ (0, δ], ∃ε > 0 s.t. ∀w ∈ B(f (a), ε) \ {f (a)}, the
equation f (z) = w has exactly deg(f, a) distinct solutions in B(a, r).

We pick δ > 0 such that f (z) − f (a) and f 0 (z) don’t vanish on B̄(a, δ) \ {a}.
Then in particular the same applies to r ∈ (0, δ]. We let γ(t) = a + reit . Then
f (a) 6∈ image(f ◦ γ). So there is some ε > 0 such that B(f (a), ε) ∩ image(f ◦ γ) = ∅
(since B(a, R) \ image(f ◦ γ) is open).
We now let w ∈ B(f (a), ε). Then the number of zeros of f (z) − w in B(a, r) is
just I(f ◦ γ, w), by the argument principle. This is just equal to I(f ◦ γ, f (a)) =
deg(f, a), by the invariance of I(Γ, ∗) as we move ∗ on path components of C \ Γ.
Now if w 6= f (a), since f 0 (z) 6= 0 on B(a, r) \ {a}, all roots of f (z) − w must be
simple. So there are exactly deg(f, a) distinct zeros.
P. 10-104
<Open mapping theorem> Let U be a domain and f : U → C is holomorphic
and non-constant, then f is an open map, ie. for all open V ⊆ U , we get that
f (V ) is open.

This is an immediate consequence of the local degree theorem. Firstly note that
by the identity theorem, f is not locally constant anywhere. For every a ∈ U we
pick r > 0 sufficiently small so that r ∈ (0, δ] and B(a, r) ⊆ V . Then by the local
degree theorem ∃ε > 0 such that B(f (a), ε) ⊆ f (B(a, r)) ⊆ f (V ). Hence f (V ) is
open.
L. 10-105
Suppose U ⊆ C is a simply connected domain and 0 6∈ U , then there exist a branch
of logarithm on U

Pick a ∈ U . Since exp is surjective onto C∗ , we can pick b such that eb = a. Given
any x ∈ U , Rlet γx be any piece-wise C 1 path from a to x. Define F : U → C by
F (x) = b + γx z1 dz. By [P.10-80], if γ is a closed piece-wise C 1 path in U , then
R 1
γ z
dz = 0. Hence by 1 of [P.10-39], F is holomorphic with derivative 1/z. Now

eF (z) eF (z) eF (z)


 
d
= 2
− =0
dz z z z2

by the chain rule, hence eF (z) = Az for some constant A.[L.10-16] Now eF (a) = eb =
a, hence A = 1. Therefore F is a continuous log in U .
P. 10-106
Let U ⊆ C be a simply connected domain, and U 6= C. Then there is a non-
constant holomorphic function U → B(0, 1).

We let q ∈ C \ U , and define φ : U → C∗ by φ(z) = z − q. Define g : U → C by


g(z) = log(z − q). By the lemma, log can be chosen so that g is continuous. Now
we have eg(z) = z − q = φ(z). In particular, we can write φ(z) = h(z)2 , where
1
h : U → C∗ is given by h(z) = e 2 g(z) .
We let y ∈ h(U ), and then the open mapping theorem says there is some r > 0
with B(y, r) ⊆ h(U ). Let r be small enough so that |r| < |y|. Now note that φ
10.5. TRANSFORM THEORY 465

is injective, and that h(z1 ) = ±h(z2 ) implies φ(z1 ) = φ(z2 ). So we deduce that
B(−y, r) ∩ h(U ) = ∅. Now define
r
f : z 7→ .
2(h(z) + y)

This is a holomorphic function f : U → B(0, 1), and is non-constant.


Recall that Liouville’s theorem says every holomorphic f : C → B(0, 1) is con-
stant. However, for any other simply connected domain, we know there are some
interesting functions we can write down.
This is a weak form of the Riemann mapping theorem (which says that there is a
conformal equivalence to B(0, 1)). This just says there is a map that is not boring.
This also shows the amazing difference between C and C∗ .

10.5 Transform theory


We are now going to consider two different types of “transforms”. The first is the
Fourier transform, which we already met in IB methods. The second is the Laplace
transform. While the formula of a Laplace transform is completely real, the inverse
Laplace transform involves a more complicated contour integral. In either case, the
new tool of contour integrals allows us to compute more transforms. Apart from that,
the two transforms are pretty similar, and most properties of the Fourier transform
also carry over to the Laplace transform.

10.5.1 Fourier transforms


Recall the Fourier transform of a function f (x) that decays sufficiently as |x| → ∞ is
defined as Z ∞
f˜(k) = f (x)e−ikx dx,
−∞

and the inverse transform is


Z ∞
1
f (x) = f˜(k)eikx dk.
2π −∞

It is common for the terms e−ikx and eikx to be swapped around in these definitions.
It might even be swapped around by the same author in the same paper — for some
reason, if we have a function in two variables, then it is traditional to transform one
variable with e−ikx ikx
√ and the other with e , just to confuse people. More rarely,
factors of 2π or 2π are rearranged. Traditionally, if f is a function of position x,
then the transform variable is called k; while if f is a function of time t, then it is
called ω.
In fact, a more precise version of the inverse transform is
Z ∞
1 1
(f (x+ ) + f (x− )) = PV f˜(k)eikx dk.
2 2π −∞

The left-hand side indicates that at a discontinuity, the inverse Fourier transform gives
the average value. The right-hand side shows that only the Cauchy principal value
466 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
R R R
of the integral (denoted PV , P or −) is required, ie. the limit
Z R Z R
lim f˜(k)eikx dk, rather than lim f˜(k)eikx dk.
R→∞ −R R→∞
S→−∞ S

Several functions have PV integrals, but not normal ones. For example,
Z ∞
x
PV 2
dx = 0,
−∞ 1 + x
R∞ x
since it is an odd function, but −∞ 1+x 2 dx diverges at both −∞ and ∞. So the
normal proper integral does not exist. So for the inverse Fourier transform, we only
have to care about the Cauchy principal value. This is convenient because that’s how
we are used to compute contour integrals all the time!
E. 10-107
2
Consider f (x) = e−x /2 . Then
Z ∞ Z ∞ Z ∞+ik
2 2
/2 −k2 /2 2 2
f˜(k) = e−x /2 e−ikx dx = e−(x+ik) e dx = e−k /2
e−z /2
dz
−∞ −∞ −∞+ik

We create a rectangular contour that looks like this: γ0


ik
The integral we want is the integral along γ0 as
R shown,
in the limit as R → ∞. We can show that γ + → 0 −
γR +
γR
R R
and γ − → 0. Then we notice there are no singularities
R −R γ1 R
inside the contour. So Z Z
2 2
e−z /2
dz = − e−z /2
dz,
γ0 γ1
in the limit. Since γ1 is traversed in the reverse direction, we have
2
Z ∞
2 √ 2
f˜(k) = e−k /2 e−z /2 dz = 2πe−k /2 ,
−∞

E. 10-108
When inverting Fourier transforms, we generally use a semicircular contour (in the
upper half-plane if x > 0, lower if x < 0), and apply Jordan’s lemma. Consider
the real function (
0 x<0
f (x) = ,
e−ax x > 0
where a > 0 is a real constant. The Fourier transform of f is
Z ∞ Z ∞
1 1
f˜(k) = f (x)e−ikx dx = e−ax−ikx dx = − [e−ax−ikx ]∞
0 = .
−∞ 0 a + ik a + ik

We shall compute R ∞ the inverse Fourier transform by


evaluating 2π 1
−∞
f˜(k)eikx dk. In the complex plane, γR
we let γ0 be the contour from −R to R in the real ×ia
axis; γR be the semicircle of radius R in the upper
0
half-plane, γR the semicircle in the lower half-plane. −R γ0 R
We let γ = γ0 + γR and γ 0 = γ0 + γR 0
. We see f˜(k) has
0
only one pole, at k = ia, which is a simple pole. So we γR
get
eikx
I I
f˜(k)eikx dk = 2πi Res = 2πe−ax , f˜(k)eikx dk = 0.
γ k=ia i(k − ia) γ0
10.5. TRANSFORM THEORY 467

For x > 0, Jordan’s lemma (with λ = x) on γR shows that γ f˜(k)eikx dk → 0 as


R
R
R → ∞. Hence we get
Z ∞ Z
1 1
f˜(k)eikx dk = lim f˜(k)eikx dk
2π −∞ 2π R→∞ γ0
Z Z 
1
= lim f˜(k)eikx dk − f˜(k)eikx dk = e−ax .
2π R→∞ γ γR

For x < 0, we have to close in the lower Rhalf plane (to apply Jordan’s lemma).
∞ ˜
Since there are no singularities, we get 2π1
−∞
f (k)eikx dk = 0. Combining these
results, we obtain
Z ∞ (
1 ˜ ikx 0 x<0
f (k)e dk = −ax
,
2π −∞ e x>0

10.5.2 Laplace transform


The Fourier transform is a powerful tool for solving differential equations and investi-
gating physical systems, but it has two key restrictions:
1. Many functions of interest grow exponentially (eg. ex ), and so do not have Fourier
transforms;
2. There is no way of incorporating initial or boundary conditions in the transform
variable. When used to solve an ODE, the Fourier transform merely gives a par-
ticular integral: there are no arbitrary constants produced by the method.
So for solving differential equations, the Fourier transform is pretty limited. Laplace
transform gets around these two restrictions. However, we have to pay the price with
a different restriction — it is only defined for functions f (t) which vanishes for t < 0
(by convention). For the remaining of this section, we shall make this assumption, so
that if we refer to the function f (t) = et for instance, we really mean f (t) = et H(t),
where H(t) is the Heaviside step function.
D. 10-109
The Laplace transform of a function f (t) such that f (t) = 0 for t < 0 is defined
by Z ∞
fˆ(p) = f (t)e−pt dt.
0

This exists for functions that grow no more than exponentially fast. There is no
standard notation for the Laplace transform. We sometimes write fˆ = L(f ) or
fˆ(p) = L(f (t)). The variable p is also not standard. Sometimes, s is used instead.
E. 10-110
Many functions (eg. t and et ) which do not have Fourier transforms do have
Laplace transforms. Note that fˆ(p) = f˜(−ip), where f˜ is the Fourier transform,
provided that both transforms exist.
Z ∞
1
1. L(1) = e−pt dt =
0 p
2. Integrating by parts, we find L(t) = 1/p2 .
468 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
Z ∞
1
3. L(eλt ) = e(λ−p)t dt =
0 p−λ
    
1 1 1 1 1
4. L(sin t) = L eit − e−it = − = 2
2i 2i p − i p+i p +1
Note that the integral only converges if Re p is sufficiently large. For example, in
(3), we require Re p > Re λ. However, once we have calculated fˆ in this domain, we
can consider it to exist everywhere in the complete p-plane, except at singularities
(such as at p = λ in this example). That is we use the analytic continuation.
C. 10-111
<Properties of the Laplace transform> We will come up with seven elemen-
tary properties of the Laplace transform. The first 4 properties are easily proved
by direct substitution
1. Linearity: L(αf + βg) = αL(f ) + βL(g).
2. Translation: L(f (t − t0 )H(t − t0 )) = e−pt0 fˆ(p).
3. Scaling: L(f (λt)) = λ1 fˆ λp where we require λ > 0 so that f (λt) vanishes


for t < 0.
4. Shifting: L(ep0 t f (t)) = fˆ(p − p0 ).
5. Transform of a derivative: L(f 0 (t)) = pfˆ(p) − f (0). To see this note
Z ∞ Z ∞
f 0 (t)e−pt dt = [f (t)e−pt ]∞
0 +p f (t)e−pt dt = pfˆ(p) − f (0).
0 0

Repeating the process,

L(f 00 (t)) = pL(f 0 (t)) − f 0 (0) = p2 fˆ(p) − pf (0) − f 0 (0),

and so on. This is the key fact for solving ODEs using Laplace transforms.
6. RDerivative of a transform: fˆ0 (p) = L(−tf (t)). To see this ˆ
R note that f (p) =
∞ −pt ˆ0 (p) = − ∞ tf (t)e−pt dt.
0
f (t)e dt, differentiating wrt p, we have f 0

Of course, the point of this is not that we know what the derivative of fˆ is.
It is we know how to find the Laplace transform of tf (t)! For example, this
lets us find the derivative of t2 with ease. In general, fˆ(n) (p) = L((−t)n f (t)).
7. Asymptotic limits (
f (0) as p → ∞
pfˆ(p) → ,
f (∞) as p → 0
where the second case requires f to have a limit at ∞. Too see that we use
using property 5 to write
Z ∞
pfˆ(p) = f (0) + f 0 (t)e−pt dt.
0

As p → ∞, we know e−pt → 0 for all t. So pfˆ(p) → f (0). This proof looks


dodgy, but is actually valid since f 0 grows no more than exponentially fast.
10.5. TRANSFORM THEORY 469

Similarly, as p → 0, then e−pt → 1. So


Z ∞
pfˆ(p) → f (0) + f 0 (t) dt = f (∞).
0

E. 10-112
d d 1 2p
L(t sin t) = − L(sin t) = − = 2 .
dp dp p2 + 1 (p + 1)2

P. 10-113
<Inverse Laplace transform> The inverse Laplace transform is given by
Z c+i∞
1
f (t) = fˆ(p)ept dp,
2πi c−i∞

where c is a real constant such that the Bromwich inversion contour γ from c − i∞
to c + i∞ lies to the right of all the singularities of fˆ.

Since f has a Laplace transform, it grows no more than exponentially. So we can


find a c ∈ R such that g(t) = f (t)e−ct decays at infinity (and is zero for t < 0, of
course). So g has a Fourier transform, and

Z ∞
g̃(ω) = f (t)e−ct e−iωt dt = fˆ(c + iω).
−∞

Then we use the Fourier inversion formula to obtain


Z ∞
1
g(t) = fˆ(c + iω)eiωt dω.
2π −∞

Make the substitution p = c + iω, and thus obtain

Z c+i∞
1
f (t)e−ct = fˆ(p)e(p−c)t dp.
2πi c−i∞

Multiplying both sides by ect , we get the result we were looking for (the require-
ment that c lies to the right of all singularities is to fix the “constant of integration”
so that f (t) = 0 for all t < 0, as we will soon see).

This formula is known as the Bromwich inversion formula . In most cases, we


don’t have to use the full inversion formula. Instead, we use the following special
case:

P. 10-114
In the case that fˆ(p) has only a finite number of isolated singularities pk for
k = 1, · · · , n, and fˆ(p) → 0 as |p| → ∞, then
n
!
X
f (t) = ˆ pt
Res (f (p)e ) H(t).
p=pk
k=1
470 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

We first do the case where t < 0, consider the contour c + iR


0
γ0 + γR as shown, which encloses no singularities. Now
ˆ
if f (p) = o(|p|−1 ) as |p| → ∞, then 0
γR
× γ0
Z


×
f (p)e dp ≤ πRect sup |fˆ(p)| → 0 as R → ∞.
ˆ pt

×

γ0
R
p∈γ 0
R

Here we used the fact |ept | ≤ ect , which arises from


c − iR
Re(pt) ≤ ct, noting that t < 0.

If fˆ decays less rapidly at infinity, but still tends to zero there, the same result
holds, but we need to use a slight modification of Jordan’s lemma. So in either
case, the integral Z
fˆ(p)ept dp → 0 as R → ∞.
0
γR
R R
Thus, we know γ0 → γ . Hence, by Cauchy’s theorem, we know f (t) = 0 for t < 0.
This is in agreement with the requirement that functions with Laplace transform
vanish for t < 0. Here we see why γ must lie to the right of all singularities. If not,
then the contour would encircle some singularities, and then the integral would no
longer be zero.
c + iR
When t > 0, we close the contour to the left. This
time, our γ does enclose some singularities. Since there
are only finitely many singularities, we enclose all sin- γR
γ0
gularities for sufficiently large R. Once again, we get ×
R ×
γR
→ 0 as R → ∞. Thus, by the residue theorem, we
know ×
Z Z n
X
fˆ(p)ept dp = lim fˆ(p)ept dp = 2πi Res (fˆ(p)ept ).
γ R→∞ γ0 p=pk
k=1
c − iR
So the inversion formula gives f (t) =
Pn ˆ pt
k=1 Resp=pk (f (p)e ).

E. 10-115
• We know
1
fˆ(p) =
p−1
has a pole at p = 1. So we must use c > 1. We have fˆ(p) → 0 as |p| → ∞. So
Jordan’s lemma applies as above. Hence f (t) = 0 for t < 0, and for t > 0, we have
 pt 
e
f (t) = Res = et .
p=1 p−1

This agrees with what we computed before.


• Consider fˆ(p) = p−n . This has a pole of order n at p = 0. So we pick c > 0. Then
for t > 0, we have
 pt 
dn−1 pt tn−1
 
e 1
f (t) = Res = lim e = .
p=0 pn p→0 (n − 1)! dpn−1 (n − 1)!

This again agrees with what we computed before.


10.5. TRANSFORM THEORY 471

• In the case where


e−p
fˆ(p) = ,
p

then we cannot use the standard result about residues, since fˆ(p) does not vanish
as |p| → ∞. But we can use the original Bromwich inversion formula to get

e−p pt
Z Z
1 1 1 pt0
f (t) = e dp = e dp,
2πi γ p 2πi γ p

where t0 = t − 1. Now we can close to the right when t0 < 0, and to the left when
t0 > 0, picking up the residue from the pole at p = 0. Then we get
( (
0 t0 < 0 0 t<1
f (t) = = = H(t − 1).
1 t0 > 0 1 t>1

This again agrees with what we’ve got before.

• If fˆ(p) has a branch point (at p = 0, say), then we must


use a Bromwich keyhole contour as shown.

E. 10-116
<Solve differential equations by Laplace transform> The Laplace trans-
form converts ODEs to algebraic equations, and PDEs to ODEs. We will illustrate
this by examples.

• Consider the differential equation tÿ − tẏ + y = 0 with y(0) = 2 and ẏ(0) = −1.
Note that
d d
L(tẏ) = − L(ẏ) − (pŷ − y(0)) = −pŷ 0 − ŷ.
dp dp

Similarly, we find L(tÿ) = −p2 ŷ 0 − 2pŷ + y(0). Substituting and rearranging,


we obtain pŷ 0 + 2ŷ = 2/p which is a simpler differential equation. We can solve
this using an integrating factor to obtain

2 A
ŷ = + 2,
p p

where A is an arbitrary constant. Hence we have y = 2 + At, and A = −1 from


the initial conditions.

• A system of ODEs can always be written in the form ẋ = M x with x(0) = x0


where x ∈ Rn and M is an n × n matrix. Taking the Laplace transform, we
obtain px̂ − x0 = M x̂. So we get x̂ = (pI − M )−1 x0 . This has singularities
when p is equal to an eigenvalue of M .
472 CHAPTER 10. COMPLEX ANALYSIS AND METHODS

E. 10-117
Recall that the convolution of two functions f and g is defined as
Z ∞
(f ∗ g)(t) = f (t − t0 )g(t0 ) dt0 .
−∞

When f and g vanish for negative t, this simplifies to


Z t
(f ∗ g)(t) = f (t − t0 )g(t0 ) dt0 .
0

Recall from Methods that the Fourier transforms turn convolutions into products,
and vice versa. We will now prove an analogous result for Laplace transforms.
T. 10-118
<Convolution theorem> The Laplace transform of a convolution is given by

L(f ∗ g)(p) = fˆ(p)ĝ(p).


Z ∞ Z t 
L(f ∗ g)(p) = f (t − t0 )g(t0 ) dt0 e−pt dt
0 0
Z ∞ Z ∞ 
= f (t − t0 )g(t0 )e−pt dt dt0
0 t0

where we change the order of integration in the (t, t0 ) plane, and t0


adjust the limits accordingly: the limits of integration are correct
since they both represent the region same region as illustrated in
the picture. We now substitute u = t − t0 to get t
Z ∞ Z ∞ 0

L(f ∗ g)(p) = f (u)g(t0 )e−pu e−pt du dt0
0 0
Z ∞ Z ∞ 
0
= f (u)e−pu du g(t0 )e−pt dt0 = fˆ(p)ĝ(p).
0 0
CHAPTER 11
Geometry
In the very beginning, Euclid came up with the axioms of geometry, one of which is
the parallel postulate. This says that given any point P and a line ` not containing
P , there is a line through P that does not intersect `. Unlike the other axioms Euclid
had, this was not seen as “obvious”. For many years, geometers tried hard to prove
this axiom from the others, but failed.
Eventually, people realized that this axiom cannot be proved from the others. There
exists some other “geometries” in which the other axioms hold, but the parallel postu-
late fails. This was known as hyperbolic geometry. Together with Euclidean geometry
and spherical geometry (which is the geometry of the surface of a sphere), these con-
stitute the three classical geometries. We will study these geometries in detail, and
see that they actually obey many similar properties, while being strikingly different in
other respects. That is not the end. In later parts of the course, we will generalize the
notions we began with and eventually define an abstract smooth surface. This covers
all three classical geometries, and many more!

11.1 Euclidean geometry


Roughly speaking, Euclidean geometry is the geometry of the familiar Rn under the
usual inner product. The purpose of this section is to study maps on Rn that preserve
distances, ie. isometries of Rn . Our objective is to classify all isometries of Rn . We
will then also briefly look at curves in Rn .
D. 11-1
Pn
• The (standard) inner product on Rn is defined by hx, yi = x · y = i=1 xi yi .
p
The Euclidean norm of x ∈ Rn is kxk = hx, xi. This defines a metric on Rn
by d(x, y) = kx − yk.
• A map f : Rn → Rn is an isometry of Rn if d(f (x), f (y)) = d(x, y) for all
x, y ∈ Rn . The isometry group Isom(Rn ) is the group of all isometries of Rn ,
which is a group by composition.
• An n × n matrix A is orthogonal if AAT = AT A = I. The group of all orthogonal
matrices is the orthogonal group O(n). The special orthogonal group is the group
SO(n) = {A ∈ O(n) : det A = 1}.
E. 11-2
Note that the inner product and the norm both depend on our choice of origin,
but the distance does not. In general, we don’t like having a choice of origin —
choosing the origin is just to provide a (very) convenient way labelling points. The
origin should not be a special point (in theory). In fancy language, we say we view
Rn as an affine space instead of a vector space.
Note that f is not required to be linear. This is since we are viewing Rn as an affine
space, and linearity only makes sense if we have a specified point as the origin.

473
474 CHAPTER 11. GEOMETRY

Nevertheless, we will still view the linear isometries as “special” isometries, since
they are more convenient to work with, despite not being special fundamentally.
• For any matrix A and x, y ∈ Rn , we get
hAx, Ayi = (Ax)T (Ay) = xT AT Ay = hx, AT Ayi.
So A is orthogonal if and only if hAx, Ayi = hx, yi for all x, y ∈ Rn .
• Note that the inner product can be expressed in terms of the norm by
1
hx, yi = (kx + yk2 − kxk2 − kyk2 ).
2
So if A preserves norm, then it preserves the inner product, and the converse
is obviously true. So A is orthogonal if and only if kAxk = kxk for all x ∈ Rn .
Hence matrices are orthogonal if and only if they are isometries.
• More generally, let f (x) = Ax + b. Then d(f (x), f (y)) = kA(x − y)k. So any
f of this form is an isometry if and only if A is orthogonal. This is not too
surprising. What might not be expected is that all isometries are of this form.
T. 11-3
Every isometry of f : Rn → Rn is of the form f (x) = Ax + b for A orthogonal
and b ∈ Rn .

Let f be an isometry. Let e1 , · · · , en be the standard basis of Rn . Let b = f (0)


and ai = f (ei ) − b. The idea is to construct our matrix A out of these ai . For A
to be orthogonal, {ai } must be an orthonormal basis. Indeed, we can compute
kai k = kf (ei ) − f (0)k = d(f (ei ), f (0)) = d(ei , 0) = kei k = 1.
For i 6= j, we have
1
hai , aj i = −hai , −aj i = − (kai − aj k2 − kai k2 − kaj k2 )
2
1 1
= − (kf (ei ) − f (ej )k2 − 2) = − (kei − ej k2 − 2) = 0
2 2
So ai and aj are orthogonal. In other words, {ai } forms an orthonormal set. Any
orthogonal set must be linearly independent. Since we have found n orthonormal
vectors, they form an orthonormal basis. Hence, the matrix A with columns given
by the column vectors ai is an orthogonal matrix.
We now define the isometry g(x) = Ax + b. We want to show f = g. By
construction, we know g(x) = f (x) is true for x = 0, e1 , · · · , en . We observe that
g is invertible. In particular,
g −1 (x) = A−1 (x − b) = AT x − AT b.
Moreover, it is an isometry, since AT is orthogonal (or we can appeal to the more
general fact that inverses of isometries are isometries). We define h = g −1 ◦ f .
Since it is a composition of isometries, it is also an isometry. Moreover, it fixes
x = 0, e1 , · · · , en . To prove f = g it is suffices
P to prove that h is the identity. Let
x ∈ Rn , and expand it in the basis as x = n
Pn
i=1 xi ei . Let y = h(x) = i=1 yi ei .
We can compute
d(x, ei )2 = hx − ei , x − ei i = kxk2 + 1 − 2xi d(x, 0)2 = kxk2
d(y, ei )2 = hy − ei , y − ei i = kyk2 + 1 − 2yi d(y, 0)2 = kyk2 .
11.1. EUCLIDEAN GEOMETRY 475

Since h is an isometry and fixes 0, e1 , · · · , en , and by definition h(x) = y, we must


have d(x, 0) = d(y, 0) and d(x, ei ) = d(y, ei ). The first equality gives kxk2 =
kyk2 , and the others then imply xi = yi for all i. In other words, x = y = h(x).
So h is the identity.
E. 11-4
<Reflections in an affine hyperplane> Let H ⊆ Rn be an affine hyperplane
given by H = {x ∈ Rn : u · x = c} where kuk = 1 and c ∈ R. This is just a natural
generalization of a 2-dimensional plane in R3 . Note that unlike a vector subspace,
it does not have to contain the origin (since the origin is not a special point).
Reflection in H, written RH , is the map RH : Rn →
a + tu
Rn with x 7→ x − 2(x · u − c)u. We now check this is
indeed what we think a reflection should be. Note H
that every point in Rn can be written as a + tu, a
where a ∈ H. Then the reflection should send this
point to a − tu. And indeed
a − tu
0
RH (a + tu) = (a + tu) − 2tu = a − tu.

This shows us two things. Firstly we see that RH is indeed an isometry since

hRH (a + tu), RH (a + tu)i = ha − tu, a − tui = ha, ai + t2 = ha + tu, a + tui

Secondly we see that RH fixes exactly the points of H. The converse is also true
— any isometry S ∈ Isom(Rn ) that fixes the points in some affine hyperplane H is
either the identity or RH . To show this, we first want to translate the plane such
that it becomes a vector subspace. Then we can use our linear algebra magic. For
any a ∈ Rn , we can define the translation by a as Ta (x) = x + a. This is clearly
an isometry.
We pick an arbitrary a ∈ H, and let R = T−a STa ∈ Isom(Rn ). Then R fixes
exactly H 0 = T−a H. Since 0 ∈ H 0 , H 0 is a vector subspace. In particular, if
H = {x : x · u = c}, then by putting c = a · u, we find H 0 = {x : x · u = 0}. To
understand R, we already know it fixes everything in H 0 . So we want to see what
it does to u. Note that since R is an isometry and fixes the origin, it is in fact an
orthogonal map. Hence for any x ∈ H 0 , we get

hRu, xi = hRu, Rxi = hu, xi = 0.

So Ru is also perpendicular to H 0 . Hence Ru = λu for some λ. Since R is an


isometry, we have kRuk2 = 1. Hence |λ|2 = 1, and thus λ = ±1. So either λ = 1,
and R = id; or λ = −1, and R = RH 0 , as we already know for orthogonal matrices.
It thus follow that S = idRn , or S is the reflection in H. Hence we find that each
reflection RH is the (unique) isometry fixing H but not idRn .
L. 11-5
Given points P 6= Q in Rn , there exist a hyperplane H consisting of the points of
Rn which are equidistance from P and Q, for which the reflection RH swaps the
points P and Q.

If the points P and Q are represented by vectors p and q, we consider the perpen-
dicular bisector of the line segment P Q which is a hyperplane H with equation
1 1
x · (p − q) = (p + q) · (p − q) = (kpk2 − kqk2 ).
2 2
476 CHAPTER 11. GEOMETRY

An elementary calculation confirms that H consists precisly of the points which


are equidistant from P and Q. We observe that RH (p − q) = −(p − q); moreover
(p+q)/2 ∈ H and hence is fixed under RH . Noting that p = (p+q)/2+(p−q)/2
and q = (p + q)/2 − (p − q)/2, it follows that RH (p) = q and RH (q) = p.
T. 11-6
Any isometry of Rn can be written as the composite of at most n + 1 reflections.

Let e1 , · · · , en be the standard basis of Rn . Consider the n + 1 points represented


by the vectors 0, e1 , · · · , en . Let f is an arbitrary isometry of Rn and consider
the image f (0), f (e1 ), · · · , f (en ) of these vectors. We use the above lemma: If H0
denotes the hyperplane of points equidistant from 0 and f (0), the reflection RH0
swaps the points. In particular, if we set f1 = Rh0 ◦ f , then f1 is an isometry
which fixes 0.
We now repeat this argument. Suppose by induction that we have an isometry fi ,
which is the composite of our original isometry f with at most i reflections, which
fixes all the points 0, e1 , · · · , ei−1 . If fi (ei ) = ei , we set fi+1 = fi . Otherwise, we
let Hi denote the hyperplane consisting of points equidistant from ei and fi (ei ).
Our assumptions imply that 0, e1 , · · · , ei−1 are equidistant from ei and fi (ei ),
and hence lie in Hi . Thus RHi fixes 0, e1 , · · · , ei−1 and swaps ei and fi (ei ), and
so the composite fi+1 = RHi ◦ fi is an isometry fixing 0, e1 , · · · , ei .
After n + 1 steps, we can attain an isometry fn+1 , the composite of f with at most
n + 1 refections, which fixes all of 0, e1 , · · · , en . We saw in the proof of [T.11-3]
that this is sufficient to imply that fn+1 is the identity, from ehich it follows that
the original isometry f is the composite of at most n + 1 reflections.
Note that if we know that the isometry f fixes the origin, the proof shows that it
can be written as the composite of at most n reflections.
E. 11-7
Consider the subgroup of Isom(Rn ) that fixes 0. By our general expression for the
general isometry, we know this is the set {f (x) = Ax : AAT = I} ∼ = O(n), the
orthogonal group. For each A ∈ O(n), we must have det(A)2 = 1. So det A = ±1.
We use this to define a further subgroup, the special orthogonal group. We can
look at these explicitly for low dimensions. Consider
 
a b
A= ∈ O(2)
c d
Orthogonality then requires a2 + c2 = b2 + d2 = 1 and ab + cd = 0. Now we pick
0 ≤ θ, ϕ ≤ 2π such that
a = cos θ c = sin θ b = − sin ϕ d = cos ϕ
Then ab + cd = 0 gives tan θ = tan ϕ (if cos θ and cos ϕ are zero, we formally say
these are both infinity). So either θ = ϕ or θ = ϕ ± π. Thus we have
   
cos θ − sin θ cos θ sin θ
A= or A=
sin θ cos θ sin θ − cos θ
respectively. In the first case, this is a rotation through θ about the origin. This
has determinant 1, and hence A ∈ SO(2). In the second case, this is a reflection in
the line ` at angle θ/2 to the x-axis. Then det A = −1 and A 6∈ SO(2). So in two
dimensions, the orthogonal matrices are either reflections or rotations — those in
SO(2) are rotations, and the others are reflections.
11.1. EUCLIDEAN GEOMETRY 477

D. 11-8
• An orientation of a vector space is an equivalence class of bases — let v1 , · · · , vn
and v10 , · · · , vn0 be two bases and A be the change of basis matrix. We say the two
bases are equivalent iff det A > 0. This is an equivalence relation on the bases,
and the equivalence classes are the orientations.

• An isometry f (x) = Ax + b is orientation-preserving if det A = 1. Otherwise, if


det A = −1, we say it is orientation-reversing .

E. 11-9
We now want to look at O(3). First focus on the case where A ∈ SO(3), ie.
det A = 1. Then we can compute

det(A − I) = det(AT − I) = det(A) det(AT − I) = det(I − A) = − det(A − I).

So det(A − I) = 0, ie. +1 is an eigenvalue in R. So there is some v1 ∈ R3 such


that Av1 = v1 . We set W = hv1 i⊥ . Let w ∈ W , then we can compute

hAw, v1 i = hAw, Av1 i = hw, v1 i = 0.

So Aw ∈ W . In other words, W is fixed by A, and A|W : W → W is well-defined.


Moreover, it is still orthogonal and has determinant 1. So it is a rotation of the
two-dimensional vector space W . We choose {v2 , v3 } an orthonormal basis of W .
Then under the bases {v1 , v2 , v3 }, A is represented by
 
1 0 0
A = 0 cos θ − sin θ .
0 sin θ cos θ

This is the most general orientation-preserving isometry of R3 that fixes the origin.

How about the orientation-reversing ones? Suppose det A = −1. Then det(−A) =
1. So in some orthonormal basis, we can express A as
   
1 0 0 −1 0 0
−A = 0 cos θ − sin θ =⇒ A= 0 cos ϕ − sin ϕ ,
0 sin θ cos θ 0 sin ϕ cos ϕ

where ϕ = θ + π. This is a rotated reflection, ie. we first do a reflection, then


rotation. In the special case where ϕ = 0, this is a pure reflection.

D. 11-10
A curve Γ in Rn is a continuous map Γ : [a, b] → Rn . Given such a curve we can
define dissection D = a = t0 < t1 < · · · < tN = b of [a, b], and set Pi = Γ(ti ). We
define
X −−−−→
SD = kPi Pi+1 k.
i

The length of a curve Γ : [a, b] → Rn is ` = supD SD (supremum taken over all


dissection) if the supremum exists.
478 CHAPTER 11. GEOMETRY

E. 11-11
Here we can think of the curve as the trajectory of a particle moving through time.
Our main objective of this
R b section is to define the length of a curve. We might want
to define the length as a kΓ0 (t)k dt as is familiar from, say, IA Vector Calculus.
However, we can’t do this, since our definition of a curve does not require Γ to be
differentiable. It is merely required to be continuous. Hence we have to define the
length in a more roundabout way.
Similar to the definition of the Riemann integral, we PN −1
consider dissections. Notice that if we add more points
to the dissection, then SD will necessarily increase, by PN
the triangle inequality. So it makes sense to define the P2
length as supremum ` = supD SD . Alternatively, if we
P0
let mesh(D) = maxi (ti − ti−1 ) then if ` exists, then we
have P1

`= lim sD .
mesh(D)→0

Note also that by definition, we can write ` = inf{`˜ : `˜ ≥ SD for all D}. The
definition by itself isn’t too helpful, since there is no nice and easy way to check if
the supremum exists. However, differentiability allows us to compute this easily
in the expected way.

P. 11-12
If Γ is continuously differentiable (ie. C 1 ), then the length of Γ is given by
Z b
length(Γ) = kΓ0 (t)k dt.
a

To simplify notation, we assume n = 3. However, the proof works for all possible
dimensions. We write Γ(t) = (f1 (t), f2 (t), f3 (t)). For every s 6= t ∈ [a, b], the mean
value theorem tells us for each i = 1, 2, 3

fi (t) − fi (s)
= fi0 (ξi ) for some ξi ∈ (s, t)
t−s

Now note that fi0 are continuous on a closed, bounded interval, and hence uniformly
continuous. For all ε > 0, there is some δ > 0 such that if s, t is such that |t−s| < δ,
then |fi0 (ξi ) − f 0 (ξ)| < 3ε for all ξ ∈ (s, t). And thus for any ξ ∈ (s, t), we have
  0  2
2 f10 (ξ1 )


Γ(t) − Γ(s) f1 (ξ) 2 2 2
0 = f2 (ξ2 ) − f20 (ξ) ≤ ε + ε + ε < ε2 .
0

− Γ (ξ)
t−s f30 (ξ3 ) 9 9 9
f30 (ξ)

In other words, kΓ(t) − Γ(s) − (t − s)Γ0 (ξ)k ≤ ε(t − s). We relabel t = ti , s = ti−1
and ξ = s+t
2
. Using the triangle inequality, we have

 
0 ti + ti−1
(ti − ti−1 ) Γ − ε(ti − ti−1 ) < kΓ(ti ) − Γ(ti−1 )k
2
 
0 ti + ti−1
< (ti − ti−1 ) Γ + ε(ti − ti−1 ).
2
11.2. SPHERICAL GEOMETRY 479

Summing over all i, we obtain

 
X 0 ti + ti−1
(ti − ti−1 ) Γ − ε(b − a) < SD
i
2
 
X 0 ti + ti−1
< (ti − ti−1 ) Γ + ε(b − a),
i
2

which is valid whenever mesh(D) < δ. Since Γ0 is continuous, and hence integrable,
we know

b
  Z
X 0 ti + ti−1
(ti − ti−1 ) Γ


kΓ0 (t)k dt as mesh(D) → 0
i
2 a

Rb
and so length(Γ) = limmesh(D)→0 SD = a
kΓ0 (t)k dt.

This proof is just a careful check that the definition of the integral coincides with
the definition of length.

11.2 Spherical geometry


We are going to study is the geometry on the surface of a sphere. This is a rather sen-
sible thing to study, since it so happens that we all live on something (approximately)
spherical. In this section, we will always think of S 2 as a subset of R3 so that we can
reuse what we know about R3 . We write S = S 2 ⊆ R3 for the unit sphere. We write
O = 0 for the origin, which is the center of the sphere (and not on the sphere).

D. 11-13
• A great circle (in S 2 ) is S 2 ∩ (a plane through O). We also call these (spherical)
lines .

• Given P, Q ∈ S, the distance d(P, Q) is the shorter of the two (spherical) line
segments (ie. arcs) P Q along the respective great circle. When P and Q are
antipodal, there are infinitely many line segments between them of the same length,
and the distance is π.

E. 11-14
When we live on the sphere, we can no longer use regular lines in R3 , since these
do not lie fully on the sphere. Instead, we have a concept of a spherical line, also
known as a great circle. We will also call these geodesics , which is a much more
general term defined on any surface, and happens to be these great circles in S 2 .

In R3 , we know that any three points that are not colinear determine a unique
plane through them. Hence given any two non-antipodal points P, Q ∈ S, there
exists a unique spherical line through P and Q.

−→
Note that by the definition of the radian, d(P, Q) is the angle between OP and

−→ −1 −−
→ −−→
OQ, which is also cos (P · Q) where P = OP , Q = OQ.
480 CHAPTER 11. GEOMETRY

C. 11-15
<Triangles on a sphere> One main object of study is A
spherical triangles – they are defined just like Euclidean tri-
angles, with AB, AC, BC line segments on S of length < π. c α b
γ
The restriction of length is just merely for convenience. We B β C
will take advantage of the fact that the sphere sits in R3 . a
We set

C×B A×C B×A


n1 = n2 = n3 = .
sin a sin b sin c
These are unit normals to the planes OBC, OAC and OAB respectively. They
are pointing out of the solid OABC. The angles α, β, γ are the angles between
the planes for the respective sides. Then 0 < α, β, γ < π. Note that the angle
between n2 and n3 is π + α (not α itself — if α = 0, then the angle between the
two normals is π). So

n2 · n3 = − cos α n3 · n1 = − cos β n1 · n2 = − cos γ.

T. 11-16
1. <Spherical cosine rule> sin a sin b cos γ = cos c − cos a cos b.
sin a sin b sin c
2. <Spherical sine rule> = = .
sin α sin β sin γ

1. We use the fact from IA Vectors and Matrices that


(C × B) · (A × C) = (A · C)(B · C) − (C · C)(B · A),
which follows easily from the double-epsilon identity εijk εimn = δjm δkn −
δjn δkm . In our case, since C·C = 1, the right hand side is (A·C)(B·C)−(B·A).
Thus we have
C×B A×C (A · C)(B · C) − (B · A)
− cos γ = n1 · n2 = · =
sin a sin b sin a sin b
cos b cos a − cos c
= .
sin a sin b

2. We use the fact that (A × C) × (C × B) = (C · (B × A))C. The left hand


side is −(n1 × n2 ) sin a sin b. Since the angle between n1 and n2 is π + γ, we
know n1 × n2 = C sin γ. Thus the left hand side is −C sin a sin b sin γ. Thus
we know C · (A × B) = sin a sin b sin γ. However, since the scalar triple product
is cyclic C · (A × B) = A · (B × C), we have sin a sin b sin γ = sin b sin c sin α.
Thus we have
sin γ sin α
= .
sin c sin a
sin β
Similarly, we know this is equal to sin b .
In the case γ = π2 , then the spherical cosine rule says cos c = cos a cos b. This is
Pythagoras theorem on S.
Recall that for small a, b, c, we know sin a = a + O(a3 ). Similarly, cos a = 1 −
1 2
2
a + O(a4 ). As we take the limit a, b, c → 0, the spherical sine and cosine rules
become the usual Euclidean versions. For example, the cosine rule becomes
c2 a2 b2
  
ab cos γ = 1 − − 1− 1− + O(k(a, b, c)k3 ).
2 2 2
11.2. SPHERICAL GEOMETRY 481

Rearranging gives c2 = a2 +b2 −2ab cos γ +O(k(a, b, c)k3 ). The sine rule transforms
similarly as well. This is what we would expect, since making a, b, c small is
equivalent to zooming into the surface of the sphere, and it looks more and more
like flat space.
P. 11-17
<Triangle inequality> For any P, Q, R ∈ S 2 , we have d(P, Q) + d(Q, R) ≥
d(P, R) with equality if and only if Q lies is in the line segment P R of shortest
length.

Form a spherical triangle with the three point so that P, Q, R are our A, C, B
respectively. If γ = π, it then follows that Q is in the line segment given by P R.
So c = a + b. If γ 6= π, since cos γ > −1 spherical cosine rule gives

cos c > cos a cos b − sin a sin b = cos(a + b).

Since cos is decreasing on [0, π], we know c < a + b. The only case left to check
is if d(P, R) = π, since we do not allow our triangles to have side length π. But
in this case they are antipodal points, and any Q lies in a line through P R, and
equality holds.
Thus, we find that (S 2 , d) is a metric space.
On Rn , straight lines are curves that minimize distance. Since we are calling
spherical lines lines, we would expect them to minimize distance as well. This is
in fact true.
P. 11-18
The group of ismotries of S 2 , Isom(S 2 ), is isomorphic to O(3, R).

By [T.11-3], the isometry of R3 which fixes the orgin is determined by a matrix


in O(3, R). Since such a matrix preserves the standard inner-profuct, it preserves
both the lengths of vectors and the angles between ectors. Since the distance
between points of S 2 has been defined to be precisely the angle between the cor-
responding unit vectors, isometries of R3 restricts to isometries of S 2 . Moreover,
since any matirx in O(3, R) is determined by its effect on the standard basis in
Rn , different matrices in O(3, R) gives rise to different isometries of S 2 .
To show that every isometry f : S 2 → S 2 is of the above form, note that any such
isometry f may be extended to a map g : R3 → R3 fixing the origin, which for
non-zero x is defined by g(x) = kxkf (x/kxk). Now for any x, y ∈ R3 we have
hg(x), g(y)i = hx, yi. Too see this note that for non-zero x, y ∈ R3 we have
      
x y x y
hg(x), g(y)i = kxkkyk f ,f = kxkkyk , = hx, yi
kxk kyk kxk kyk

From this we see that g is an isometry of R3 which fixes the origin. Therefore
Isom(S 2 ) is naturally identified with the group O(3, R).
We define a reflection of S 2 in a spherical line/great circle H ∩ S 2 where H is
a plane through the origin, to be the restriction to S 2 of the isometry RH of
R3 , the reflection of R3 in the hyperplane H. It therefore follows from results in
the Euclidean case that any element of Isom(S 2 ) is the composite of at most three
reflection of S 2 . Note in passing that the exact same result holds for the isometries
of the Euclidean plane R2 . It follows that Isom(S 2 ) has an index two subgroup
482 CHAPTER 11. GEOMETRY

corresponding to SO(3) ⊆ S(3); these isometries are just the rotations of S 2 , and
are the composite of two reflections. Since any element of O(3) is of the form ±A
with A ∈ SO(3), it follows that the group O(3) is isomorphic to SO(3) × C2 .
P. 11-19
Given a curve Γ on S 2 ⊆ R3 from P to Q, we have ` = length(Γ) ≥ d(P, Q).
Moreover, if ` = d(P, Q), then the image of Γ is a spherical line segment P Q.

Let Γ : [0, 1] → S and ` = length(Γ). Then for any dissection D of [0, 1], say
0 = t0 < · · · < tN = 1, write Pi = Γ(ti ). We define
def
X X −−−−→
S̃D = d(Pi−1 , Pi ) > SD = |Pi−1 Pi |,
i i

where the length in the right hand expression is the distance in Euclidean 3-space.
Now suppose ` < d(P, Q). Then there is some ε > 0 such that `(1 + ε) < d(P, Q).
Recall from basic trigonometric that if θ > 0, then sin θ < θ. Also, (sin θ)/θ →
1 as θ → 0. Thus for small θ we have θ ≤ (1 + ε) sin θ.
What we really want is the double of this: 2θ ≤ (1 + ε)2 sin θ.
This is useful since these lengths are those in the diagram. This

−→ 2θ
means for P, Q sufficiently close, we have d(P, Q) ≤ (1+ε)|P Q|. 2 sin θ
From Analysis II, we know Γ is uniformly continuous on [0, 1].

So we can choose D such that
−−−−→
d(Pi−1 , Pi ) ≤ (1 + ε)|Pi−1 Pi | for all i.

So we know that for sufficiently fine D,

S̃D ≤ (1 + ε)SD < d(P, Q),

since SD → ` from below. However, by the triangle inequality S̃D ≥ d(P, Q). This
is a contradiction. Hence we must have ` ≥ d(P, Q).
Suppose now ` = d(P, Q) for some Γ : [0, 1] → S, ` = length(Γ). Then for every
t ∈ [0, 1], we have

d(P, Q) = ` = length Γ|[0,t] + length Γ|[t,1] ≥ d(P, Γ(t)) + d(Γ(t), Q) ≥ d(P, Q).

Hence we must have equality all along the way, that is d(P, Q) = d(P, Γ(t)) +
d(Γ(t), Q) for all Γ(t). However, this is possible only if Γ(t) lies on the shorter
spherical line segment P Q, as we have previously proved.
So if Γ is a curve of minimal length from P to Q in S 2 , then Γ is a spherical
line segment. Further, from the proof of this proposition, we know length Γ|[0,t] =
d(P, Γ(t)) for all t. So the parametrisation of Γ is monotonic. Such a Γ is called a
minimizing geodesic.
Finally, we get to an important theorem whose prove involves complicated pictures.
This is known as the Gauss-Bonnet theorem. The Gauss-Bonnet theorem is in
fact a much more general theorem. However, here we will specialize in the special
case of the sphere. Later, when doing hyperbolic geometry, we will prove the
hyperbolic version of the Gauss-Bonnet theorem. Near the end of the course,
when we have developed sufficient machinery, we would be able to state the Gauss-
Bonnet theorem in its full glory. However, we will not be able to prove the general
version.
11.2. SPHERICAL GEOMETRY 483

P. 11-20
<Gauss-Bonnet theorem for S 2 > If ∆ is a spherical triangle with angles
α, β, γ, then area(∆) = (α + β + γ) − π.

We start with the concept of a double lune. A double lune A


with angle 0 < α < π is two regions S cut out by two planes
through a pair of antipodal points, where α is the angle be-
tween the two planes. It is not hard to show that the area
of a double lune is 4α, since the area of the sphere is 4π. A0

Now note that our triangle ∆ = ABC is the intersec-


tion of 3 single lunes, with each of A, B, C as the pole
(in fact we only need two, but it is more convenient to A
talk about 3). Therefore ∆ together with its antipodal C0
B
partner ∆0 is a subset of each of the 3 double lunes with B0
C
areas 4α, 4β, 4γ. Also, the union of all the double lunes
A0
cover the whole sphere, and overlap at exactly ∆ and
∆0 . Thus
4(α + β + γ) = 4π + 2(area(∆) + area(∆0 )) = 4π + 4 area(∆).

This is easily generalized to arbitrary convex n-gons on S 2 (with n ≥ 3). Suppose


M is such a convex n-gon with interior angles α1 , · · · , αn . Then we have
n
X
area(M ) = αi − (n − 2)π.
1

This follows directly from cutting the polygon up into the constituent triangles.
This is very different from Euclidean space. On R2 , we always have α + β + γ = π.
Not only is this false on S 2 , but by measuring the difference, we can tell the area
of the triangle. In fact, we can identify triangles up to congruence just by knowing
the three angles.

11.2.1 Möbius geometry


It turns out it is convenient to identify the sphere S 2 with the the extended complex
plane C∞ = C ∪ {∞}. Then isometries of S 2 will translate to nice class of maps of
C∞ .
From IA Groups, we know Möbius transformations act on C∞ and form a group G by
composition. For each matrix A = ( ac db ) ∈ GL(2, C) we get a Möbius map C∞ → C∞
by
aζ + b
ζ 7→ .
cζ + d
Moreover, composition of Möbius map is the same multiplication of matrices.
This is not exactly a bijective map between G and GL(2, C). For any λ ∈ C∗ = C\{0},
we know λA defines the same map Möbius map as A. Conversely, if A1 and A2 gives
the same Möbius map, then there is some λ1 6= 0 such that A1 = λA2 . Hence, we
have
G∼ = PGL(2, C) = GL(2, C)/C ,

where C∗ ∼= {λI : λ ∈ C}.
484 CHAPTER 11. GEOMETRY

Instead of taking the whole GL(2, C) and quotienting out multiples of the identities,
we can instead start with SL(2, C). Again, A1 , A2 ∈ SL(2, C) define the same map if
and only if A1 = λA2 for some λ. What are the possible values of λ? By definition of
the special linear group, we must have 1 = det(λA) = λ2 det A = λ2 . So λ2 = ±1. So
each Möbius map is represented by two matrices, A and −A, and we get

G∼
= PSL(2, C) = SL(2, C)/{±1}.

Now let’s think about the sphere. On S 2 , the rotations SO(3) act as isometries. Recall
the full isometry group of S 2 is O(3). We would like to show that rotations of S 2 corre-
spond to Möbius transformations coming from the subgroup SU(2) ≤ GL(2, C).
C. 11-21
<Stereographic projection> We first find a way to identify S 2 with C∞ . We
use coordinates ζ ∈ C∞ . We define the stereographic projection π : S 2 → C∞ by
π(P ) = (line P N ) ∩ {z = 0} which is well defined except where P = N , in which
case we define π(N ) = ∞.

N
P

π(P )

To give an explicit formula for this, con-


sider the cross-section through the plane
N
ON P . If P has coordinates (x, y, z), then
P
we see that π(P ) will be a scalar multiple
r
of x + iy. To find this factor, we notice z
that we have two similar triangles, and π(P )
hence obtain O R
r 1−z
= .
R 1
Then we obtain the formula
x + iy
π(x, y, z) = .
1−z
If we do the projection from the South pole instead, we get a related formula.

L. 11-22
If π 0 : S 2 → C∞ denotes the stereographic projection from the South Pole, then
π 0 (P ) = 1/π(P ).
x+iy
Let P = (x, y, z). We know π(x, y, z) = 1−z
. We have

x + iy
π 0 (x, y, z) = ,
1+z
since we have just flipped the z axis around. So we have

x2 + y 2
π(P )π 0 (P ) = = 1,
1 − z2
11.2. SPHERICAL GEOMETRY 485

noting that we have x2 + y 2 + z 2 = 1 since we are on the unit sphere.


We can use this to infer that π 0 ◦ π −1 : C∞ → C∞ takes ζ 7→ 1/ζ̄, which is the
inversion in the unit circle |ζ| = 1.
T. 11-23
Via the stereographic projection, every rotation of S 2 induces a Möbius map de-
fined by a matrix in SU(2) ⊆ GL(2, C), where
  
a −b
SU(2) = : |a|2 + |b|2 = 1 .
b̄ ā

1. Consider the r(ẑ, θ), the rotations about the z axis by θ. These corresponds
to the Möbius map ζ 7→ eiθ ζ, which is given by the unitary matrix
 iθ/2 
e 0
−iθ/2 .
0 e

2. Consider the rotation r(ŷ, π2 ). This has the matrix


    
0 0 1 x z
0 1 0 y  =  y  .
−1 0 0 z −x

This corresponds to the map


x + iy z + iy
ζ= 7→ ζ 0 =
1−z 1+x
We want to show this is a Möbius map. To do so, we guess
what the Möbius map should be, and check it works. We ∞
can manually compute that −1 7→ ∞, 1 7→ 0, i 7→ i. The
only Möbius map that does this is
1 −1
0 ζ −1 i
ζ = .
ζ +1 0
We now check:

ζ −1 x + iy − 1 + z x − 1 + z + iy (z + iy)(x − 1 + z + iy)
= = =
ζ +1 x + iy + 1 − z x + 1 − (z − iy) (x + 1)(z + iy) − (z 2 + y 2 )
(z + iy)(x − 1 + z + iy) (z + iy)(x − 1 + z + iy) z + iy
= = = .
(x + 1)(z + iy) + (x2 − 1) (x + 1)(z + iy + x − 1) x+1

as required. We finally have to write this in the form of an SU(2) matrix:


 
1 1 −1
√ .
2 1 1

3. We claim that SO(3) is generated by r(ŷ, π2 ) and r(ẑ, θ) for 0 ≤ θ < 2π. To
show this, we first observe that r(x̂, ϕ) = r(ŷ, π2 )r(ẑ, ϕ)r(ŷ, − π2 ). Note that
we read the composition from right to left. You can convince yourself this is
true by taking a physical sphere and try rotating. To prove it formally, we can
just multiply the matrices out.
486 CHAPTER 11. GEOMETRY

Next, observe that for v ∈ S 2 ⊆ R3 , there are some angles ϕ, ψ such that
g = r(ẑ, ψ)r(x̂, ϕ) maps v to x̂. We can do so by first picking r(x̂, ϕ) to rotate
v into the (x, y)-plane. Then we rotate about the z-axis to send it to x̂. Then
for any θ, we have r(v, θ) = g −1 r(x̂, θ)g, and our claim follows by composition.
4. Thus, via the stereographic projection, every rotation of S 2 corresponds to
products of Möbius transformations of C∞ with matrices in SU(2).
The key of the proof is step 3. Apart from enabling us to perform the proof, it
exemplifies a useful technique in geometry — we know how to rotate arbitrary
things in the z axis. When we want to rotate things about the x axis instead,
we first rotate the sphere to move the x axis to where the z axis used to be, do
those rotations, and then rotate it back. In general, we can use some isometries
or rotations to move what we want to do to a convenient location.
T. 11-24
The group of rotations SO(3) acting on S 2 corresponds precisely with the subgroup
PSU(2) = SU(2)/{±1} of Möbius transformations acting on C∞ .

Let g ∈ PSU(2) be a Möbius transformation


az + b
g(z) = .
b̄z + ā

Suppose first that g(0) = 0. So b = 0. So aā = 1. Hence a = eiθ/2 . Then g


corresponds to r(ẑ, θ), as we have previously seen.
In general, let g(0) = w ∈ C∞ . Let Q ∈ S 2 be such that π(Q) = w. Choose a
rotation A ∈ SO(3) such that A(Q) = −ẑ. Since A is a rotation, let α ∈ PSU(2)
be the corresponding Möbius transformation. By construction we have α(w) = 0.
Then the composition α ◦ g fixes zero, we have reduced to the initial case. So
it corresponds to some B = r(ẑ, θ). We then see that g corresponds to A−1 B ∈
SO(3).
What this is in some sense a converse of the previous theorem. We are saying that
for each Möbius map from SU(2), we can find some rotation of S 2 that induces
that Möbius map, and there is exactly one. Again, we solved an easy case, where
g(0) = 0, and then performed some rotations to transform any other case into this
simple case. We have now produced a 2-to-1 map

SU(2) → PSU(2) ∼
= SO(3).

If we treat these groups as topological spaces, this map does something funny.
Suppose we start with a (non-closed) path from I to −I in SU(2). Applying the
map, we get a closed loop from I to I in SO(3). Hence, in SO(3), loops are behave
slightly weirdly. If we go around this loop in SO(3), we didn’t really get back to
the same place. Instead, we have actually moved from I to −I in SU(2). It takes
two full loops to actually get back to I. In physics, this corresponds to the idea
of spinors.
We can also understand this topologically as follows: since SU(2) is defined by two
complex points a, b ∈ C such that |a|2 + |b|2 , we can view it as a three-sphere S 3
in SO(3). A nice property of S 3 is it is simply connected, as in any loop in S 3 can
be shrunk to a point. On the other hand, SO(3) is not simply connected. We have
just constructed a loop in SO(3) by mapping the path from I to −I in SU(2). We
11.3. TRIANGULATIONS AND THE EULER NUMBER 487

cannot deform this loop until it just sits at a single point, since if we lift it back
up to SU(2), it still has to move from I to −I.
∼ SU(2) is just “two copies” of SO(3). By
The neat thing is that in some sense, S 3 =
duplicating SO(3), we have produced SU(2), a simply connected space. Thus we
say SU(2) is a universal cover of SO(3). We’ve just been waffling about spaces
and loops, and throwing around terms we haven’t defined properly. These vague
notions will be made precise in Algebraic Topology.

11.3 Triangulations and the Euler number


We shall now study the idea of triangulations and the Euler number. We aren’t going to
do much with them — we will not even prove that the Euler number is a well-defined
number. However, we need Euler numbers in order to state the full Gauss-Bonnet
theorem at the end, and the idea of triangulations is useful in Algebraic Topology for
defining simplicial homology.
D. 11-25
• The (Euclidean) torus T is the set R2 /Z2 of equivalence classes of (x, y) ∈ R2
under the equivalence relation

(x1 , y1 ) ∼ (x2 , y2 ) ⇔ x1 − x2 , y1 − y2 ∈ Z.

• A topological triangle on a metric space X is a subset R ⊆ X equipped with a


homeomorphism R → ∆, where ∆ is a closed Euclidean triangle in R2 .
• A topological triangulation τ of a metric space X is a finite collection of topolog-
ical triangles of X which cover X and is such that
1. For every pair of triangles in τ , either they are disjoint, or they meet in exactly
one edge, or meet at exactly one vertex.
2. Each edge belongs to exactly two triangles.
• The Euler number e = e(X, τ ) of a triangulation is e = F − E + V where F is the
number of triangles, E is the number of edges and V is the number of vertices.
• A geodesic triangle is a triangle whose sides are geodesics, ie. paths of shortest
distance between two points.
E. 11-26
It is easy to see the relation in the definition of torus is indeed an
equivalence relation. Thus a point in torus T represented by (x, y)
is a coset (x, y) + Z2 of the subgroup Z2 ≤ R2 . Of course, there are
many ways to represent a point in the torus. However, for any closed Q
square Q ⊆ R2 with side length 1, we can obtain T is by identifying
the opposite sides. We can define a distance d for all P1 , P2 ∈ T to
be
d(P1 , P2 ) = min{kv1 − v2 k : vi ∈ R2 , vi + Z2 = Pi }.
It is not hard to show this definition makes (T, d) into a metric space. This allows
us to talk about things like open sets and continuous functions. We will later show
that this is not just a metric space, but a smooth surface, after we have defined
what that means.
488 CHAPTER 11. GEOMETRY

We write Q̊ for the interior of Q. Then the natural map f : Q̊ → T given


by v 7→ v + Z2 is a bijection onto some open set U ⊆ T . In fact, U is just
T \ {two circles meeting in 1 point}, where the two circles are the boundary of the
square Q.

Now given any point in the torus represented by P + Z2 , we can find a square Q
such that P ∈ Q̊. Then f : Q̊ → T restricted to an small enough open disk about
P is an isometry. Thus we say d is a locally Euclidean metric.
One can also think of the torus T as the surface of a doughnut,
“embedded” in Euclidean space R3 . Given this, it is natural
to define the distance between two points to be the length of
the shortest curve between them on the torus. However, this
distance function is not the same as what we have here. So
it is misleading to think of our locally Euclidean torus as a
“doughnut”.

With one more example in our toolkit, we can start doing what we really want to
do. The idea of a triangulation is to cut a space X up into many smaller triangles,
since we like triangles. Note that a spherical triangle is in fact a topological
triangle, using the radial projection to the plane R2 from the center of the sphere.
These notions are useful only if the space X is “two dimensional” — there is no
way we can triangulate, say R3 , or a line. We can generalize triangulation to allow
higher dimensional “triangles”, namely topological tetrahedrons, and in general,
n-simplices, and make an analogous definition of triangulation. However, we will
not bother ourselves with this.

T. 11-27
The Euler number e is independent of the choice of triangulation.

This is one important fact about triangulations from algebraic topology, which we
will state without proof. So the Euler number e = e(X) is a property of the space
X itself, not a particular triangulation.

E. 11-28
1. Consider the following triangulation of the sphere as shown
on the diagram. This has 8 faces, 12 edges and 6 vertices.
So e = 2.

2. Consider the following triangulation of the torus. Be care-


ful not to double count the edges and vertices at the sides
(dashed), since the sides are glued together. This has 18
faces, 27 edges and 9 vertices. So e = 0.

In both cases, we did not cut up our space with funny, squiggly lines. Instead, we
used “straight” lines (a curve that minimises distance between two points). These
triangles are known as geodesic triangles. In particular, we used spherical triangles
in S 2 and Euclidean triangles in Q̊. Triangulations made of geodesic triangles are
rather nice and we will prove a result about it. In general however a topological
triangle can look like any simple closed curve (even a circle), it is just that there’s
three distant points on them that we consider to be vertices.
11.4. HYPERBOLIC GEOMETRY 489

P. 11-29
Every geodesic triangulation of S 2 has e = 2 and those of T has e = 0.

For any triangulation τ , we denote the “faces” of ∆1 , · · · , ∆F , and write τi =


the sum of the interior angles of the triangles (with i = 1, · · · , F ).
αi + βi + γi forP
Then we have τi = 2πV since the total angle around each vertex is 2π. Also,
each triangle has three edges, and each edge is from two triangles. So 3F = 2E.
We write this in a more convenient form: F = 2E − 2F . How we continue depends
on whether we are on the sphere or the torus.
1. For the sphere, Gauss-Bonnet for the sphere says the area of ∆i is τi − π. Since
the area of the sphere is 4π, we know
X X
4π = area(∆i ) = (τi − π) = 2πV − F π
= 2πV − (2E − 2F )π = 2π(F − E + V ).

So F − E + V = 2.
P
2. For the torus, we have τi = π for every face in Q̊. So 2πV = τi = πF and
so 2V = F = 2E − 2F . Hence we get 2(F − V + E) = 0 as required.
Of course, we know this is true for any triangulation, but it is difficult to prove
that without algebraic topology.
Note that in the definition of triangulation, we decomposed X into topological
triangles. We can also use decompositions by topological polygons, but they are
slightly more complicated, since we have to worry about convexity. However, apart
from this, everything works out well. In particular, the previous proposition also
holds, and we have Euler’s formula for S 2 : V − E + F = 2 for any polygonal
decomposition of S 2 .

11.4 Hyperbolic geometry


Historically, hyperbolic geometry was created when people tried to prove Euclid’s
parallel postulate. Instead of proving the parallel postulate, they managed to create a
new geometry where this is false, and this is hyperbolic geometry. Hyperbolic geometry
is much more complicated, since we cannot directly visualize it as a subset of R3 .
Instead, we need to develop the machinery of a Riemannian metric in order to properly
describe hyperbolic geometry. In a nutshell, this allows us to take a subset of R2 and
measure distances in it in a funny way.
D. 11-30
? Here we call a function f smooth if it’s C ∞ . A function is a C ∞ - diffeomorphism
if it’s a C ∞ bijective map with C ∞ inverse.
? Write the derivative for a function f : U → Rm at a point a ∈ U ⊆ Rn as
dfa : Rn → Rm (also written as Df (a) or f 0 (a)). We can write the derivative as
the Jacobian matrix
 
∂fi
J(f )a = (a) ,
∂xj
with the linear map given by matrix multiplication, namely h 7→ J(f )a · h.
490 CHAPTER 11. GEOMETRY

E. 11-31
Suppose we have a holomorphic (analytic) function of complex variables f : U ⊆
C → C. Say f 0 (z) = a + ib and w = h1 + ih2 , then we have
f 0 (z)w = ah1 − bh2 + i(ah2 + bh1 ).
If we identify R2 = C, then f : U ⊆ R2 → R2 has a derivative dfz : R2 → R2 given
by the matrix  
a −b
.
b a
D. 11-32
We use coordinates (u, v) ∈ R2 . We let V ⊆ R2 be open. Then a Riemannian metric
on V is defined by giving C ∞ functions E, F, G : V → R such that
 
E(P ) F (P )
F (P ) G(P )
is a positive definite definite matrix for all P ∈ V . Alternatively, this is a smooth
function that gives a 2 × 2 symmetric positive definite matrix, ie. inner product
h · , · iP , for each point in V .
E. 11-33
Not that we can pick E = G = 1 and F = 0, then this is just the standard
Euclidean inner product. Note also that by definition, if e1 , e2 are the standard
basis, then
he1 , e1 iP = E(P )
he1 , e2 iP = F (P )
he2 , e2 iP = G(P ).
The basic idea of a Riemannian metric is not too unfamiliar. Presumably, we
have all seen maps of the Earth, where we try to draw the spherical Earth on
a piece of paper, ie. a subset of R2 . However, this does not behave like R2 .
You cannot measure distances on Earth by placing a ruler on the map, since
distances are distorted. Instead, you have to find the coordinates of the points (eg.
the longitude and latitude), and then plug them into some complicated formula.
Similarly, straight lines on the map are not really straight (spherical) lines on
Earth. We really should not think of Earth a subset of R2 . All we have done was
to “force” Earth to live in R2 to get a convenient way of depicting the Earth, as
well as a convenient system of labelling points (in many map projections, the x
and y axes are the longitude and latitude).
This is the idea of a Riemannian metric. To describe some complicated surface,
we take a subset U of R2 , and define a new way of measuring distances, angles
and areas on U . All these information are packed into an entity known as the
Riemannian metric. As mentioned, we should not imagine V as a subset of R2 .
Instead, we should think of it as an abstract two-dimensional surface, with some
coordinate system given by a subset of R2 . However, this coordinate system is
just a convenient way of labelling points. They do not represent any notion of
distance. For example, (0, 1) need not be closer to (0, 2) than to (7, 0). These are
just abstract labels.
With this in mind, V does not have any intrinsic notion of distances, angles and
areas. However, we do want these notions. We can certainly write down things
11.4. HYPERBOLIC GEOMETRY 491

like the difference of two points, or even the compute the derivative of a function.
However, these numbers you get are not meaningful, since we can easily use a
different coordinate system (eg. by scaling the axes) and get a different number.
They have to be interpreted with the Riemannian metric. This tells us how to
measure these things, via an inner product “that varies with space”. This variation
in space is not an oddity arising from us not being able to make up our minds.
This is since we have “forced” our space to lie in R2 . Inside V , going from (0, 1)
to (0, 2) might be very different from going from (5, 5) to (6, 5), since coordinates
don’t mean anything. Hence our inner product needs to measure “going from
(0, 1) to (0, 2)” differently from “going from (5, 5) to (6, 5)”, and must vary with
space.
We’ll soon come to defining how this inner product gives rise to the notion of
distance and similar stuff. Before that, we want to understand what we can put
into the inner product h · , · iP . Obviously these would be vectors in R2 , but where
do these vectors come from? What are they supposed to represent? The answer
is “directions” (more formally, tangent vectors). For example, he1 , e1 iP will tell
us how far we actually are going if we move in the direction of e1 from P . Note
that we say “move in the direction of e1 ”, not “move by e1 ”. We really should
read
p this as “if we move by he1 for some small h, then the distance covered is
h he1 , e1 iP ”. This statement is to be interpreted along the same lines as “if we
vary x by some small h, then the value of f will vary by f 0 (x)h”. Notice how
the inner product allows us to translate a length in R2 (namely khe1 keucl = h)
into the actual length in V . What we needed for this is just the norm induced by
the inner product. Since what we have is the whole inner product, we in fact can
define more interesting things such as areas and angles. We will formalize these
ideas very soon, after getting some more notation out of the way.
Often, instead of specifying the three functions separately, we write the metric as

E du2 + 2F du dv + G dv 2 .

This notation has some mathematical meaning. We can view the coordinates as
smooth functions u : V → R, v : V → R. Since they are smooth, they have
derivatives. They are linear maps

duP : R2 → R dvP : R2 → R
(h1 , h2 ) 7→ h1 (h1 , h2 ) 7→ h2 .

These formula are valid for all P ∈ V . So we just write du and dv instead.
Since they are maps R2 → R, we can view them as vectors in the dual space,
du, dv ∈ (R2 )∗ . Moreover, they form a basis for the dual space. In particular,
they are the dual basis to the standard basis e1 , e2 of R2 . Then we can consider
du2 , du dv and dv 2 as bilinear forms on R2 . For example,

du2 (h, k) = du(h)du(k)


1
du dv(h, k) = (du(h)dv(k) + du(k)dv(h))
2
dv 2 (h, k) = dv(h)dv(k)

These have matrices


1
     
1 0 0 2
0 0
, 1 ,
0 0 2
0 0 1
492 CHAPTER 11. GEOMETRY

respectively. Then we indeed have


 
E F
E du2 + 2F du dv + G dv 2 = .
F G
D. 11-34
• The length of a smooth curve γ = (γ1 , γ2 ) : [0, 1] → V is defined as
Z 1 Z 1 Z 1
1 1
kγ̇kγ(t) dt = hγ̇, γ̇iγ(t)
2
dt = E γ̇12 + 2F γ˙1 γ˙2 + Gγ˙2 2 2 dt,
0 0 0

where E = E(γ1 (t), γ2 (t)) etc.


• We define the distance between two points P and Q to be the infimum of the
lengths of all (piecewise smooth) curves from P to Q.
• The area of a region W ⊆ V is defined as
Z q Z
1
area(W ) = det( E F
F G ) du dv = (EG − F 2 ) 2 du dv
W W

when this integral exists.


• Let V, Ṽ ⊆ R2 be open sets endowed with Riemannian metrics, denoted as h · , · iP
and h · , · i∼ ∞
Q for P ∈ V and Q ∈ Ṽ respectively. A C -diffeomorphism ϕ : V → Ṽ
2
is an isometry if for every P ∈ V and x, y ∈ R , we get
hx, yiP = hdϕP (x), dϕP (y)i∼
ϕ(P ) .

E. 11-35
One can show that the distance as define is indeed a metric (assuming V is path-
connected). In the area formula, what we are integrating is just the determinant
of the metric, this is also known as the Gram determinant.
To understand the definition of isometry: Consider the point P ∈ V , at this point
we can go in say the two directions x and y. At Ṽ , this corresponds to going in
the two directions dϕP (x) and dϕP (y) from point the ϕ(P ). We want an isometry
to preserve the inner product, hence we want hx, yiP = hdϕP (x), dϕP (y)i∼ϕ(P ) .

This definition of isometry indeed preserves lengths: If γ : [0, 1] → V is a C ∞


curve, the composition γ̃ = ϕ ◦ γ : [0, 1] → Ṽ is a path in Ṽ . We let P = γ(t), and
hence ϕ(P ) = γ̃(t). Then
hγ̃ 0 (t), γ̃ 0 (t)i∼ 0 0 ∼ 0 0
γ̃(t) = hdϕP ◦ γ (t), dϕP ◦ γ (t)iϕ(P ) = hγ (t), γ (t)iγ(t)=P .

Integrating, we obtain length(γ̃) = length(γ).


E. 11-36
To understand this better, here is an example relating spherical and Eclidean
geometry. We will just outline and not do this in full detail. Let V = R2 , and
define the Riemannian metric by
4(du2 + dv 2 )
.
(1 + u2 + v 2 )2
This looks somewhat arbitrary, but we shall see this actually makes sense by
identifying R2 with the sphere by the stereographic projection π : S 2 \ {N } → R2 .
For every point P ∈ S 2 , the tangent plane to S 2 at P is given by

−→
{x ∈ R3 : x · OP = 0}.
11.4. HYPERBOLIC GEOMETRY 493

Note that we translated it so that P is the origin, so that we can view it as a


vector space (points on the tangent plane are points “from P ”). Now given any

−→
two tangent vectors x1 , x2 ⊥ OP , we can take the inner product x1 · x2 in R3 .
We want to say this inner product is “the same as” the inner product provided by
the Riemannian metric on R2 . We cannot just require x1 · x2 = hx1 , x2 iπ(P ) since
this makes no sense at all. Apart from the obvious problem that x1 , x2 have three
components but the Riemannian metric takes in vectors of two components, we
know that x1 and x2 are vectors tangent to P ∈ S 2 , but to apply the Riemannian
metric, we need the corresponding tangent vector at π(P ) ∈ R2 . To do so, we act
by dπp . So what we want is

x1 · x2 = hdπP (x1 ), dπP (x2 )iπ(P ) .

One can verify of this equality is indeed true by noting that


(2u, 2v, u2 + v 2 − 1)
π −1 (u, v) = .
1 + u2 + v 2
In some sense, we say the surface S 2 \{N } is “isometric” to R2 via the stereographic
projection π. We can define the notion of isometry between two open sets with
Riemannian metrics in general.
P. 11-37
Suppose that V and Ṽ are open subsets of R2 equipped with Riemannian metrics,
and that φ : Ṽ → V is an isometry. Then for any region W ⊆ V for which the
area exists, the region φ−1 W of Ṽ has the same area as W .

Since φ si an isometry, it preserves the Riemannian metrics, that is for any P ∈ Ṽ


   
Ẽ F̃ E F
= JT J (∗)
F̃ G̃ P F G φ(P )

here J = J(φ) is the Jacobian matrix representing dφP . Now


Z q Z q
area(W ) = det( E F
)
F G (u,v) du dv = det( E F
F G )φ(ũ,ṽ) | det J(φ)| dũ dṽ
W φ−1 W
Z q
= det( Ẽ F̃ )
F̃ G̃ (ũ,ṽ)
dũ dṽ = area(φ−1 W )
φ−1 W

where the second equality is the change of variable formula from vector calculus,
and the third equality is obtained by taking the determinant of (∗).
D. 11-38
• The (Poincaré) disk model for the hyperbolic plane is given by the unit disk
D⊆C∼ = R2 where D = {ζ ∈ C : |ζ| < 1}, and a Riemannian metric on this disk
given by
4(du2 + dv 2 ) 4|dζ|2
= where ζ = u + iv (∗)
(1 − u − v )
2 2 2 (1 − |ζ|2 )2

• The upper half-plane is H = {z ∈ C : =(z) > 0}. The upper half-plane model
of the hyperbolic plane is the upper half-plane H with the Riemannian metric
dx2 + dy 2
.
y2
494 CHAPTER 11. GEOMETRY

E. 11-39
Hyperbolic geometry is another possible type of geometry after Euclidean and
Spherical geometry. We provide two models of the hyperbolic plane. Each model
has its own strengths, and often proving something is significantly easier in one
model than the other.
Note that the Poincaré disk model is similar to our previous metric for the sphere,
but we have 1 − u2 − v 2 instead of 1 + u2 + v 2 . To interpret the term |dζ|2 ,
we can either formally set |dζ|2 = du2 + dv 2 , or interpret it as the derivative
dζ = du + idv : C → C.
We see that (∗) is a scaling of the standard Euclidean metric by a factor depending
on the polar radius r = |ζ|2 . The distances are scaled by 2/(1 − r2 ), while the
areas are scaled by 4/(1 − r2 )2 . Note, however, that the angles in the hyperbolic
disk are the same as that in R2 . This is in general true for metrics that are just
scaled versions of the Euclidean metric.
What is the appropriate Riemannian metric to put on the upper half plane? We
know D bijects to H via the Möbius transformation

1+ζ
ϕ : ζ ∈ D 7→ i ∈ H.
1−ζ

This bijection is in fact a conformal equivalence, as defined in IB Complex Analy-


sis/Methods. The idea is to pick a metric on H such that this map is an isometry.
Then H together with this Riemannian metric will be the upper half-plane model
for the hyperbolic plane.
To avoid confusion, we reserve the letter z for points z ∈ H, with z = x + iy, while
we use ζ for points ζ ∈ D, and write ζ = u + iv. Then we have

1+ζ z−i
z=i and ζ= .
1−ζ z+i

Instead of trying to convert the Riemannian metric on D to H, which would be


a complete algebraic horror, we first try converting the Euclidean metric. The
Euclidean metric on R2 = C is given by
1
hw1 , w2 i = Re(w1 w̄2 ) = (w1 w̄2 + w̄1 w2 ).
2
z−i
So if h · , · ieucl is the Euclidean metric at ζ, then at z such that ζ = z+i
, we require
(by definition of isometry)
  2 2
dζ dζ dζ dζ
hw, viz = w, v = Re(wv̄) = (w1 v1 + w2 v2 ),

dz dz eucl dz dz

where w = w1 + iw2 , v = v1 + iv2 . Hence, on H, we obtain the Riemannian metric


2

(dx2 + dy 2 ).
dz

We can compute
dζ 1 z−i 2i
= − = .
dz z+i (z + i)2 (z + i)2
11.4. HYPERBOLIC GEOMETRY 495

This is what we get if we started with a Euclidean metric. If we start with the
hyperbolic metric on D, we get an additional scaling factor. We can do some
computations to get
|z − i|2 1 |z + i|2 |z + i|2
1 − |ζ|2 = 1 − , and hence = = .
|z + i|2 1 − |ζ| 2 |z + i| − |z − i|
2 2 4 Im z
4|dζ|2
Putting all these together, the metric corresponding to (1−|ζ|2 )2
on D is
2
|z + i|2 |dz|2 dx2 + dy 2

4
4· · · |dz|2 = = .
|z + i|4 4 Im z (Im z) 2 y2
which is the upper half-plane model as expected. The lengths on H are scaled
(from the Euclidean one) by 1/y, while the areas are scaled by 1/y 2 . Again, the
angles are the same.
Note that we did not have to go through so much mess in order to define the
sphere. This is since we can easily “embed” the surface of the sphere in R3 .
However, there is no easy surface in R3 that gives us the hyperbolic plane. As we
don’t have an actual prototype, we need to rely on the more abstract data of a
Riemannian metric in order to work with hyperbolic geometry.
C. 11-40
<Möbius maps that fixes H> Consider the following group of Möbius maps:
 
az + b
PSL(2, R) = z 7→ : a, b, c, d ∈ R and ad − bc = 1 .
cz + d
Note that the coefficients have to be real, not complex. It is easy to check that
these are precisely the Möbius maps that send R ∪ {∞} to R ∪ {∞} and sends H
to H. A Möbius maps with real coefficients may always be represented by a real
matrix with determinant ±1; the condition that the determinant is positive is just
saying that the upper half-plane is sent to itself (and not the lower half plane).
We will show that these maps are in fact isometries of H.

P. 11-41
The elements of PSL(2, R) are isometries of H.

It is easy to check that PSL(2, R) is generated by

Translations: z 7→ z + a for a∈R


Dilations: z 7→ az for a>0
The single map: z 7→ − z1 .

So it suffices to show each of these preserves the metric |dz|2 /y 2 , where z = x + iy.
The first two are straightforward to see, by plugging it into formula and notice
the metric does not change. We now look at the last one, given by z 7→ − z1 . The
derivative at z is f 0 (z) = 1/z 2 . So we get
    2 2
1 dz d − 1 = |dz| .

dz 7→ d − = 2 and so
z z z |z|4
We also have  
1 1 Im z
Im − = − 2 Im z̄ = .
z |z| |z|2
496 CHAPTER 11. GEOMETRY

|d(−1/z)|2 |dz|2  (Im z)2 |dz|2


   
and so = = .
Im(−1/z)2 |z |
4 |z| 4 (Im z)2
So this is an isometry, as required.
Note that each z 7→ az + b with a > 0, b ∈ R is in PSL(2, R). Also, we can use
maps of this form to send any point in H to any other point in H. So PSL(2, R)
acts transitively on H.
Recall also that each Möbius transformation preserves circles and lines in the
complex plane, as well as angles between circles/lines. In particular, consider the
line L = iR, which meets R perpendicularly, and let g ∈ PSL(2, R). Then the
image is either a circle centered at a point in R, or a straight line perpendicular
to R. We let L+ = L ∩ H = {it : t > 0}. Then g(L+ ) is either a vertical half-line
or a semi-circle that ends in R.
D. 11-42
• Hyperbolic lines in H are vertical half-lines or semicircles ending in R.
? We write L+ = {it : t > 0}.
• For points z1 , z2 ∈ H, the hyperbolic distance ρ(z1 , z2 ) is the length of the segment
[z1 , z2 ] ⊆ ` of the hyperbolic line through z1 , z2 (parametrized monotonically).
L. 11-43
Given any two distinct points z1 , z2 ∈ H, there exists a unique hyperbolic line
through z1 and z2 .

This is clear if Re z1 = Re z2 — we just pick the vertical


half-line through them, and it is clear this is the only pos- z2
sible choice. Otherwise, if Re z1 6= Re z2 , then we can find z1
the centre of the desired semi-circle by construction of a R
perpendicular bisector as in the diagram. Hence this is the
only possible choice.
L. 11-44
PSL(2, R) acts transitively on the set of hyperbolic lines in H.

It suffices to show that for each hyperbolic line `, there is some g ∈ PSL(2, R) such
that g(`) = L+ . This is clear when ` is a vertical half-line, since we can just apply
a horizontal translation. If it is a semicircle, suppose it has end-points s < t ∈ R.
Then consider
z−t
g(z) = .
z−s
This has determinant −s+t > 0. So g ∈ PSL(2, R) (after scaling g). Then g(t) = 0
and g(s) = ∞. Then we must have g(`) = L+ , since g(`) is a hyperbolic line, and
the only hyperbolic lines passing through ∞ are the vertical half-lines.
Note that we can achieve g(s) = 0 and g(t) = ∞ by composing with − z1 . Also, for
any P ∈ ` not on the endpoints, we can construct a g such that g(P ) = i ∈ L+ , by
composing with z 7→ az. So the isometries act transitively on pairs (`, P ), where
` is a hyperbolic line and P ∈ `.
Note that PSL(2, R) preserves hyperbolic distances. Similar to Euclidean space
and the sphere, we show these lines minimize distance.
11.4. HYPERBOLIC GEOMETRY 497

P. 11-45
1. If γ : [0, 1] → H is a piecewise C 1 -smooth curve with γ(0) = z1 and γ(1) = z2 ,
then length(γ) ≥ ρ(z1 , z2 ), with equality iff γ is a monotonic parametrisation
of [z1 , z2 ] ⊆ `, where ` is the hyperbolic line through z1 and z2 .
2. <Triangle inequality> Given three points z1 , z2 , z3 ∈ H, we have ρ(z1 , z3 ) ≤
ρ(z1 , z2 ) + ρ(z2 , z3 ) with equality if and only if z2 lies between z1 and z2 .

1. We pick an isometry g ∈ PSL(2, R) so that g(`) = L+ . So without loss of


generality, we assume z1 = iu and z2 = iv, with u < v ∈ R. We decompose
the path as γ(t) = x(t) + iy(t). Then we have
Z 1 p Z 1 Z 1
1 |ẏ| ẏ v
dt = [log y(t)]10 = log

length(γ) = ẋ + ẏ dt ≥
2 2 dz ≥
0 y 0 y 0 y u

This calculation also tells us that ρ(z1 , z2 ) = log uv . So length(γ) ≥ ρ(z1 , z2 )




with equality if and only if x(t) = 0 (hence γ ⊆ L+ ) and ẏ ≥ 0 (hence


monotonic).
2. Follows from 1.
Hence, (H, ρ) is a metric space.
C. 11-46
<Geometry of the hyperbolic disk> So far, we have worked with the upper
half-plane model. This is since the upper half-plane model is more convenient for
those calculations. However, sometimes the disk model is more convenient. So we
also want to understand that as well. Recall that
1+ζ z−i
ζ 7→ z = i and its inverse z 7→ ζ =
1−ζ z+i
is an isometry between D and H. Moreover, since these are Möbius maps, circle-
lines are preserved, and angles between the lines are also preserved. Hence, imme-
diately from previous work on H, we know
1. PSL(2, R) ∼
def
= {Möbius transformations sending D to itself} = G. This is since
PSL(2, R) are all the Möbius maps that fixes H and so ζ 7→ z gives an isomor-
phism of PSL(2, R) to G.
2. Hyperbolic lines in D are circle segments meeting |ζ| = 1 orthogonally, includ-
ing diameters.
3. G acts transitively on hyperbolic lines in D (and also on pairs consisting of a
line and a point on the line).
4. The length-minimizing geodesics on D are a segments of hyperbolic lines parametrized
monotonically.
We write ρ for the (hyperbolic) distance in D.

L. 11-47
Let G be the set of isometries of the hyperbolic disk. Then
1. Rotations z 7→ eiθ z (for θ ∈ R) are elements of G.
z−a
2. If a ∈ D, then g(z) = 1−āz
is in G.
498 CHAPTER 11. GEOMETRY

1. This is clearly an isometry, since this is a linear map, preserves |z| and |dz|,
and hence also the metric
4|dz|2
.
(1 − |z|2 )2

2. First, we need to check this indeed maps D to itself. To do this, we first make
sure it sends {|z| = 1} to itself. If |z| = 1, then

|1 − āz| = |z̄(1 − āz)| = |z̄ − ā| = |z − a|.

So indeed |g(z)| = 1. Finally, g(a) = 0 so by continuity G must map D to


itself. We can then show it is an isometry by plugging it into the formula.
In fact one can show that all g ∈ G is of the form
z−a z̄ − a
g(z) = eiθ or g(z) = eiθ for some θ ∈ R and a ∈ D
1 − āz 1 − āz̄
P. 11-48
ρ(0, reiθ ) = 2 tanh−1 r where 0 ≤ r < 1. In general, for z1 , z2 ∈ D,

z1 − z2
g(z1 , z2 ) = 2 tanh−1 .
1 − z¯1 z2

By the lemma above, we can rotate the hyperbolic disk so that reiθ is rotated to
r. So ρ(0, reiθ ) = ρ(0, r). We can evaluate this by performing the integral
Z r
2 dt
ρ(0, r) = = 2 tanh−1 r.
0 1 − t2
For the general case, we apply the Möbius transformation
z − z1
g(z) = .
1 − z̄1 z
Then we have

z2 − z1 z1 − z2 iθ
g(z1 ) = 0 and g(z2 ) = = e .
1 − z̄1 z2 1 − z¯1 z2

z1 − z2
=⇒ ρ(z1 , z2 ) = ρ(g(z1 ), g(z2 )) = 2 tanh−1 .
1 − z¯1 z2
Again, we exploited the idea of performing the calculation in an easy case, and
then using isometries to move everything else to the easy case. In general, when we
have a “distinguished” point in the hyperbolic plane, it is often more convenient
to use the disk model, move it to 0 by an isometry.
P. 11-49
For every point P and hyperbolic line `, with P 6∈ `, there is a unique hyperbolic
line `0 with P ∈ `0 such that `0 meets ` orthogonally, say at Q. Moreover ρ(P, Q) ≤
ρ(P, Q̃) for all Q̃ ∈ `, ie. Q minimises the distance P to `.

This is a familiar fact from Euclidean geometry. To prove this, we again apply the
trick of letting P = 0. Wlog, assume P = 0 ∈ D. Note that a hyperbolic line in
D (that is not a diameter) is a Euclidean circle. So it has a center, say C. Since
any line through P is a diameter, there is clearly only one line that intersects `
perpendicularly.
11.4. HYPERBOLIC GEOMETRY 499

D `
`0 P Q C

It is also clear that P Q minimizes the Euclidean distance between P and `. While
this is not the same as the hyperbolic distance, since hyperbolic lines through P
are diameters, having a larger hyperbolic distance is equivalent to having a higher
Euclidean distance. So this indeed minimizes the distance.
L. 11-50
<Hyperbolic reflection> Suppose g is an isometry of the hyperbolic half-plane
H and g fixes every point in L+ = {iy : y ∈ R+ }. Then G is either the identity or
g(z) = −z̄, ie. it is a reflection in the vertical axis L+ .

For every P ∈ H \ L+ , there is a unique line `0 contain- L+


ing P such that `0 ⊥ L+ . Let Q = L+ ∩ `0 . We see `0
is a semicircle, and by definition of isometry, we must have `0 Q P
ρ(P, Q) = ρ(g(P ), Q).

Now note that g(`0 ) is also a line meeting L+ perpendicularly at Q, since g fixes
L+ and preserves angles. So we must have g(`0 ) = `0 . Then in particular g(P ) ∈ `0 .
So we must have g(P ) = P or g(P ) = P 0 , where P 0 is the image under reflection
in L+ .
Now it suffices to prove that if g(P ) = P for any one P , then g(P ) must be
the identity (since if g(P ) = P 0 for all P , then g must be given by g(z) = −z̄).
Suppose g(P ) = P , and let A ∈ H + , where H + = {z ∈ H : Re z > 0}. Now
if g(A) 6= A, then g(A) = A0 . Then ρ(A0 , P ) = ρ(A0 , g(P )) = ρ(A, P ). But
L+
ρ(A0 , P ) = ρ(A0 , B) + ρ(B, P ) B
= ρ(A, B) + ρ(B, P ) > ρ(A, P ), A0 A

by the triangle inequality, noting that B 6∈ (AP ). This P0 P


is a contradiction. So g must fix everything.

This time to study reflections, we work in the upper half-plane model, since we
have a favorite line L+ . Note that we have proved a similar result in Euclidean
geometry, and the spherical version is in the example sheets.
D. 11-51
• The map R : H → H with z 7→ −z̄ is the (hyperbolic) reflection in L+ . More
generally, given any hyperbolic line `, let T be the isometry that sends ` to L+ .
Then the (hyperbolic) reflection in ` is R` = T −1 RT .
• A hyperbolic triangle ABC is the region determined by three hyperbolic line
segments AB, BC and CA, including extreme cases where some vertices A, B, C
are allowed to be “at infinity”. More precisely, in the half-plane model, we allow
them to lie in R ∪ {∞}; in the disk model we allow them to lie on the unit circle
|z| = 1.
500 CHAPTER 11. GEOMETRY

E. 11-52
We already know how to reflect in L+ . So to reflect in another line `, we move our
plane such that ` becomes L+ , do the reflection, and move back. By the previous
proposition, R` is the unique isometry that is not identity and fixes `.
For hyperbolic triangle we see that if A is “at infinity”, then the angle at A must
be zero. Recall for a region R ⊆ H, we can compute the area of R as
ZZ
dx dy
area(R) = .
R y2
T. 11-53
<Gauss-Bonnet theorem for hyperbolic triangles> If ∆ with vertices
A, B, C is a hyperbolic triangle with angles α, β, γ ≥ 0 (note that zero angle
is possible), then area(∆) = π − (α + β + γ).

First do the case where γ = 0, so C is “at infinity”. Recall that we like to use the
disk model if we have a distinguished point in the hyperbolic plane. If we have a
distinguished point at infinity, it is often advantageous to use the upper half plane
model, since ∞ is a distinguished point at infinity.
So we use the upper-half plane model, and wlog C =
∞ (apply PSL(2, R) if necessary). Then AC and C C
BC are vertical half-lines and AB is in the arc of a β
semi-circle. We use the transformation z 7→ z + a α B
(with a ∈ R) to center the semi-circle at 0. We then A
apply z 7→ bz (with b > 0) to make the circle have π−α
radius 1. Thus wlog AB ⊆ {x2 + y 2 = 1}. Now we β
have
Z cos β Z ∞ Z cos β
1 1
area(T ) = √ 2
dy dx = √ dx
cos(π−α) 1−x2 y cos(π−α) 1 − x2
= [− cos−1 (x)]cos β
cos(π−α) = π − α − β,

as required. In general, we use H again, and we


can arrange AC in a vertical half-line. Also, we can
C
move AB to x2 + y 2 = 1, noting that this transfor- γ
mation keeps AC vertical. We consider ∆1 = AB∞
β δ
and ∆2 = CB∞. Then we can immediately write α B
A
area(∆1 ) = π − α − (β + δ)
area(∆2 ) = π − δ − (π − γ) = γ − δ.

So we have area(T ) = area(∆2 ) − area(∆1 ) = π − α − β − γ as required.


T. 11-54
Suppose we have a hyperbolic triangle with angles α, β, γ and sides of length a, b, c
(the side of length a being opposite the vertex α etc.), then
1. <Hyperbolic cosine rule> cosh a = cosh b cosh c − sinh b sinh c cos α
sinh a sinh b sinh c
2. <Hyperbolic sine rule> = = .
sin α sin β sin γ
11.4. HYPERBOLIC GEOMETRY 501

D. 11-55
• We use the disk model of the hyperbolic plane. Two hyperbolic lines are parallel
iff they meet only at the boundary of the disk (at |z| = 1).
• Two hyperbolic lines are ultraparallel if they don’t meet anywhere in {|z| ≤ 1}.
E. 11-56
Recall that in S 2 , any two lines meet (in two points). In the Euclidean plane R2 ,
any two lines meet (in one point) iff they are not parallel. In the Euclidean plane,
we have the parallel axiom: given a line ` and P 6∈ `, there exists a unique line
`0 containing P with ` ∩ `0 = ∅. This fails in both S 2 and the hyperbolic plane
— but for very different reasons! In S 2 , there are no such parallel lines. In the
hyperbolic plane, there are many parallel lines. There is a more deep reason for
why this is the case, which we will come to at the very end of the course.
C. 11-57
<Hyperboloid model> Recall we said there is no way to view the hyperbolic
plane as a subset of R3 , and hence we need to mess with Riemannian metrics.
However, it turns out we can indeed embed the hyperbolic plane in R3 , if we give
R3 a different metric! The Lorentzian inner product on R3 has the matrix
 
1 0 0
0 1 0
0 0 −1

Recall from IB Linear Algebra that we can always pick a basis where a non-
degenerate symmetric bilinear form has diagonal made of 1 and −1. If we further
identify A and −A as the “same” symmetric bilinear form, then the above matrix
is the only other possibility left.
Thus, we obtain the quadratic form given by q(x) = hx, xi = x2 + y 2 − z 2 . We now
define the 2-sheet hyperboloid as {x ∈ R2 : q(x) = −1}. This is given explicitly
by the formula x2 + y 2 = z 2 − 1. We don’t actually need two sheets. So we define
S + = S ∩ {z > 0}. We let π : S + → D ⊆ C = R2 be the stereographic projection
from (0, 0, -1) by

x + iy
π(x, y, z) = = u + iv.
1+z
π(P )

We put r2 = u2 + v 2 . Doing some calculations, we can show that


1. We always have r < 1, as expected.
2. The stereographic projection π is invertible with
1
σ(u, v) = π −1 (u, v) = (2u, 2v, 1 + r2 ) ∈ S + .
1 − r2
502 CHAPTER 11. GEOMETRY

3. The tangent plane to S + at P is spanned by the two vectors


∂σ 2
σu = = (1 + u2 − v 2 , 2uv, 2u),
∂u (1 − r2 )2
∂σ 2
σv = = (2uv, 1 + v 2 − u2 , 2v).
∂v (1 − r2 )2

We restrict the inner product h · , · i to the span of σu , σv , and we get a sym-


metric bilinear form assigned to each u, v ∈ D given by
4
E = hσu , σu i =
(1 − r2 )2
E du2 + 2F du dv + G dv 2 where F = hσu , σv i = 0
4
G = hσv , σv i = .
(1 − r2 )2

We have thus recovered the Poincare disk model of the hyperbolic plane.

11.5 Smooth embedded surfaces (in R3 )


So far, we have been studying some specific geometries, namely Euclidean, spherical
and hyperbolic geometry. From now on, we go towards greater generality, and study
arbitrary surfaces. We will mostly work with surfaces that are smoothly embedded as
subsets of R3 .
D. 11-58
A set S ⊆ R3 is a (parametrized) smooth embedded surface if every point P ∈ S
has an open neighbourhood U = W ∩S (with W open in R3 ) and a map σ : V → U
from an open V ⊆ R2 to U such that
1. σ is a homeomorphism (ie. a bijection with continuous inverse)
2. σ is C ∞ (smooth) on V (ie. continuous partial derivatives of all orders).
3. For all Q = σ(P ), the partial derivatives σu (P ) and σv (P ) are linearly inde-
pendent.
We say (u, v) are smooth coordinates on U ⊆ S. The subspace of R3 spanned by
σu , σv is called the tangent space TQ S of S at Q = σ(P ). The function σ is called
a smooth parametrisation of U ⊆ S.
The unit normal to S at Q ∈ S is (well-defined up to a sign)
σu × σv
N = NQ = .
kσu × σv k
E. 11-59
Recall that if we write σ(u, v) = (x(u, v), y(u, v), z(u, v)) then
 
∂u x
∂σ
σu (P ) = (P ) = ∂u y  (P ) = dσP (e1 ),
∂u
∂u z

where e1 , e2 are the standard basis of R2 . Similarly, we have σv (P ) = dσP (e2 ).


We define some terminology.
11.5. SMOOTH EMBEDDED SURFACES (IN R3 ) 503

P. 11-60
If σ : V → U and σ̃ : Ṽ → U are two C ∞ parametrisations of a surface, then the
homeomorphism ϕ = σ −1 ◦ σ̃ : Ṽ → V is in fact a C ∞ diffeomorphism.

Since differentiability is a local property, it suffices to consider ϕ on some small


neighbourhood of a point in V . Pick our favorite point (v0 , u0 ) ∈ Ṽ . We know
σ = σ(u, v) = (x(u, v), y(u, v), z(u, v)) is differentiable. So it has a Jacobian matrix
 
xu xv
 yu yv  .
zu zv

By definition, this matrix has rank two at each point. wlog, we assume the first
two rows are linearly independent. So det( xyuu xyvv ) 6= 0 at (v0 , u0 ) ∈ Ṽ . We define a
new function F (u, v) = (x(u, v), y(u, v)) ie. composing σ with the projection map
onto the first two coordinates. Now the inverse function theorem applies. So F has
a local C ∞ inverse, ie. there are two open neighbourhoods N, N 0 with (u0 , v0 ) ∈ N
and F (u0 , v0 ) ∈ N 0 ⊆ R2 such that FN : N → N 0 is a diffeomorphism. Writing
π : σ(N ) → N 0 for the projection π(x, y, z) = (x, y) we can put these things in a
commutative diagram:
σ(N )
σ π .

N F
N0

We now let Ñ = σ̃ −1 (σ(N )) and F̃ = π ◦ σ̃, which is yet again smooth. Then we
have the following larger commutative diagram.

σ(N )
σ π σ̃ .

N F
N0 Ñ

Then we have ϕ = σ −1 ◦ σ̃ = σ −1 ◦ π −1 ◦ π ◦ σ̃ = F −1 ◦ F̃ which is smooth, since


F −1 and F̃ are. Hence ϕ is smooth everywhere. By symmetry of the argument,
ϕ−1 is smooth as well. So this is a diffeomorphism.
This proposition says any two parametrizations of the same surface are compatible.
P. 11-61
The tangent plane TQ S is independent of parametrization.

We know σ̃(ũ, ṽ) = σ(ϕ1 (ũ, ṽ), ϕ2 (ũ, ṽ)). We can then compute the partial deriva-
tives as σ̃ũ = ϕ1,ũ σu + ϕ2,ũ σv and σ̃ṽ = ϕ1,ṽ σu + ϕ2,ṽ σv . Here the transformation
is related by the Jacobian matrix
 
ϕ1,ũ ϕ1,ṽ
= J(ϕ).
ϕ2,ũ ϕ2,ṽ

This is invertible since ϕ is a diffeomorphism. So (σũ , σṽ ) and (σu , σv ) are different
basis of the same two-dimensional vector space.
Note that we have σ̃ũ × σ̃ṽ = det(J(ϕ))σu × σv .
504 CHAPTER 11. GEOMETRY

D. 11-62
• Let S ⊆ R3 be an embedded surface. The map θ = σ −1 : U ⊆ S → V ⊆ R2 is a
chart . A collection of charts which covers S is called an atlas .
• If S ⊆ R3 is an embedded surface, then at each point we have inner product given
by restricting the standard inner product to the tangent space at the point. This
family of inner products together are called the first fundamental form .
• Given a smooth curve Γ : [a, b] → S ⊆ R3 , the length and the energy of γ are
Z b Z b
length(Γ) = kΓ0 (t)k dt, energy(Γ) = kΓ0 (t)k2 dt.
a a

• Given a smooth C ∞ parametrization σ : V → U ⊆ S ⊆ R3 , and a region T ⊆ U , we


define the area of T to be the area of σ −1 (T ) with respect to the first fundamental
form on V , that is Z p
area(T ) = EG − F 2 du dv,
θ(T )

whenever the integral exists (where θ = σ −1 is a chart).


E. 11-63
Often, instead of a parametrization σ : V ⊆ R2 → U ⊆ S, we want the function the
other way round. We call this a chart. For example let S 2 ⊆ R3 be a sphere. The
two stereographic projections from north and south pole give two charts, whose
domain together cover S 2 , so they together is an atlas. Note that stereographic
projections from north (or south) pole alone already almost covers the whole sphere
but it misses out the north (or south) pole itself.
Similar to what we did to the sphere, given a chart θ : U → V ⊆ R2 , we can induce
a Riemannian metric on V . We first get an inner product on the tangent space,
together they are the first fundamental form. This is a theoretical entity, and is
more easily worked with when we have a chart. Suppose we have a parametrization
σ : V → U ⊆ S, a, b ∈ R2 , and P ∈ V . We can then define

ha, biP = hdσP (a), dσP (b)iR3 .

With respect to the standard basis e1 , e2 ∈ R2 , we can write the first fundamental
form as
E = hσu , σu i = he1 , e1 iP
E du2 + 2F du dv + G dv 2 where F = hσu , σv i = he1 , e2 iP
G = hσv , σv i = he2 , e2 iP .

Thus, this induces a Riemannian metric on V . This is also called the first funda-
mental form corresponding to σ. This is what we do in practical examples.
We can think of the energy as something like the average kinetic energy of a particle
along the path. Length and energy works well with parametrizations? For the sake
of simplicity, we assume Γ([a, b]) ⊆ U for some parametrization σ : V → U . Then
we define the new curve
γ = σ −1 ◦ Γ : [a, b] → V.
This curve has two components in V , say γ = (γ1 , γ2 ). Then we have

Γ0 (t) = (dσ)γ(t) (γ̇1 (t)e1 + γ̇2 (t)e2 ) = γ̇1 σu + γ̇2 σv ,


11.5. SMOOTH EMBEDDED SURFACES (IN R3 ) 505

and thus (here P = γ(t))


1 1
kΓ0 (t)k = hγ̇, γ̇iP2 = hdσP (γ̇), dσP (γ̇)iR3 = (E γ̇12 + 2F γ̇1 γ̇2 + Gγ̇22 ) 2 .

So we get Z b
1
length Γ = (E γ̇12 + 2F γ̇1 γ̇2 + Gγ̇22 ) 2 dt.
a
Similarly, the energy is given by
Z b
energy Γ = (E γ̇12 + 2F γ̇1 γ̇2 + Gγ̇22 ) dt.
a

This agrees with what we’ve had for Riemannian metrics.


P. 11-64
If σ : V → U and σ̃ : Ṽ → U are two C ∞ parametrisations of a surface, then the
ϕ = σ −1 ◦ σ̃ : Ṽ → V is an isometry of Riemannian metrics (on V and Ṽ ).

ϕ is an isometry since by the chain rule we have

ha, bi∼
P = hdσ̃P (a), dσ̃(b)iR3 = hdσφ(P ) ◦ dφP (a), dσφ(P ) ◦ dφP (b)iR3

= hdσφ(P ) (a), dσφ(P ) (b)iφ(P ) .

P. 11-65
The area of T is independent of the choice of parametrization.

Isometry preserves area,[P.11-37] and by the above different parametrization gives


an isometry.
So we can define area for more general subsets T ⊆ S, not necessarily cover by a
single parametrization. By this result this is well-defined.
However to calculate areas, we often only need to consider one chart/parametrization
since in many cases the subset omitted will not affect the area. For example, if we
work with the sphere, we can easily parametrize everything but the poles. In that
case, it suffices to use just one parametrization σ for area(S).
D. 11-66
• Let V ⊆ R2u,v be open, and E du2 + 2F du dv + G dv 2 be a Riemannian metric
on V . We let γ = (γ1 , γ2 ) : [a, b] → V be a smooth curve. We say γ is a geodesic
with respect to the Riemannian metric if for all t ∈ [a, b] it satisfies
d 1
(E γ̇1 + F γ̇2 ) = (Eu γ̇12 + 2Fu γ̇1 γ̇2 + Gu γ̇22 )
dt 2
d 1
(F γ̇1 + Gγ̇2 ) = (Ev γ̇12 + 2Fv γ̇1 γ̇2 + Gv γ̇22 ).
dt 2
These equations are known as the geodesic ODEs .
• Let γ : [a, b] → V be a smooth curve, and let γ(a) = p and γ(b) = q. A
proper variation of γ is a C ∞ map h : [a, b] × (−ε, ε) → V such that
1. h(t, 0) = γ(t) for all t ∈ [a, b],
2. h(a, τ ) = p and h(b, τ ) = q for all |τ | < ε,
3. γτ = h( · , τ ) : [a, b] → V is a C ∞ curve for all fixed τ ∈ (−ε, ε).
506 CHAPTER 11. GEOMETRY

• Let S ⊆ R3 be an embedded surface. Let Γ : [a, b] → S be a smooth curve in S,


and suppose there is a parametrization σ : V → U ⊆ S such that Im Γ ⊆ U . We
let θ = σ −1 be the corresponding chart. Define a new curve in V by γ = θ ◦ Γ :
[a, b] → V . Then we say Γ is a geodesic on S if and only if γ is a geodesic with
respect to the induced Riemannian metric.
For a general Γ : [a, b] → S, we say Γ is a geodesic if for each point t0 ∈ [a, b],
there is a neighbourhood Ṽ of t0 such that Im Γ|Ṽ lies in the domain of some chart,
and Γ|Ṽ is a geodesic in the previous sense.
P. 11-67
A smooth curve γ satisfies the geodesic ODEs if and only if γ is a stationary point
of the energy function for all proper variation.

We let γ(t) = (u(t), v(t)). Then we have


Z b Z b
energy(γ) = (E(u, v)u̇2 + 2F (u, v)u̇v̇ + G(u, v)v̇ 2 ) dt = I(u, v, u̇, v̇) dt.
a a

We consider this as a function of four variables u, u̇, v, v̇, which are not necessarily
related to one another. From the calculus of variations, we know γ is stationary
if and only if    
d ∂I ∂I d dI ∂I
= , = .
dt ∂ u̇ ∂u dt dv̇ ∂v
The first equation gives us

d
(2(E u̇ + F v̇)) = Eu u̇2 + 2Fu u̇v̇ + Gu v̇ 2 ,
dt
which is exactly the geodesic ODE. Similarly, the second equation gives the other
geodesic ODE.
So a curve from a to b is geodesic if it is stationary points of energy among all
curves from a to b. The definition of a geodesic involves the derivative only, which
is a local property, we can generalize the definition to arbitrary embedded surfaces
as done above.
P. 11-68
If a curve Γ minimizes the energy among all curves from P = Γ(a) to Q = Γ(b),
then Γ is a geodesic.

For any a1 , b1 such that a ≤ a1 ≤ b1 ≤ b, we let Γ1 = Γ|[a1 ,b1 ] . Then Γ1 also


minimizes the energy between a1 and b1 for all curves between Γ(a1 ) and Γ(b1 ).
If we picked a1 , b1 such that Γ([a1 , b1 ]) ⊆ U for some parametrized neighbourhood
U , then Γ1 is a geodesic by the previous proposition. Since the parametrized
neighbourhoods cover S, at each point t0 ∈ [a, b], we can find a1 , b1 such that
Γ([a1 , b1 ]) ⊆ U for some parametrized neighbourhood U .
L. 11-69
Let V ⊆ R2 be an open set with a Riemannian metric, and let P, Q ∈ V . Consider
C ∞ curves γ : [a, b] → V such that γ(0) = P, γ(1) = Q. Then such a γ will
minimize the energy (and therefore is a geodesic) if and only if γ minimizes the
length and has constant speed.
11.5. SMOOTH EMBEDDED SURFACES (IN R3 ) 507

Recall the Cauchy-Schwartz inequality for continuous functions f, g ∈ C[0, 1],


which says
Z 1 2 Z 1  Z 1 
f (x)g(x) dx ≤ f (x)2 dx g(x)2 dx ,
0 0 0

with equality iff g = λf for some λ ∈ R, or f = 0, ie. g and f are linearly


dependent. We now put f = 1 and g = kγ̇k. Then Cauchy-Schwartz says
(length γ)2 ≤ energy(γ) with equality if and only if γ̇ is constant. From this,
we see that a curve of minimal energy must have constant speed. Then it follows
that minimizing energy is the same as minimizing length if we move at constant
speed.
This means being a geodesic is almost the same as minimizing length. It’s just
that to be a geodesic, we have to parametrize it carefully.
P. 11-70
A curve Γ is a geodesic iff it minimizes the energy locally, and this happens if it
minimizes the length locally and has constant speed.

Here minimizing a quantity locally means for every t ∈ [a, b], there is some ε > 0
such that Γ|[t−ε,t+ε] minimizes the quantity over al curves from Γ(t−ε) to Γ(t+ε).
We will not prove this. Local minimization is the best we can hope for, since the
definition of a geodesic involves differentiation, and derivatives are local properties.
The geodesic ODEs imply kΓ0 (t)k is constant. In the special case of the hyperbolic
plane, we can check this directly.
E. 11-71
A natural question to ask is that if we pick a point P and a tangent direction a,
can we find a geodesic through P whose tangent vector at P is a? In the geodesic
equations, if we expand out the derivative, we can write the equation as
  
E F γ̈1
= something.
F G γ̈2

Since the Riemannian metric is positive definite, we can invert the matrix and get
an equation of the form  
γ̈1
= H(γ1 , γ2 , γ̇1 , γ)˙ 2
γ̈2
for some function H. From the general theory of ODE’s in Analysis II, subject to
some sensible conditions, given any P = (u0 , v0 ) ∈ V and a = (p0 , q0 ) ∈ R2 , there
is a unique geodesic curve γ(t) defined for |t| < ε with γ(0) = P and γ̇(0) = a. In
other words, we can choose a point, and a direction, and then there is a geodesic
going that way. Note that we need the restriction that γ is defined only for |t| < ε
since we might run off to the boundary in finite time. So we need not be able to
define it for all t ∈ R.
How is this result useful? We can use the uniqueness part to find geodesics. We
can try to find some family of curves C that are length-minimizing. To prove that
we have found all of them, we can show that given any point P ∈ V and direction
a, there is some curve in C through P with direction a.
Consider the sphere S 2 . Recall that arcs of great circles are length-minimizing,
at least locally. So these are indeed geodesics. Are these all? We know for any
508 CHAPTER 11. GEOMETRY

P ∈ S 2 and any tangent direction, there exists a unique great circle through P
in this direction. So there cannot be any other geodesics on S 2 , by uniqueness.
Similarly, we find that hyperbolic line are precisely all the geodesics on a hyperbolic
plane.
E. 11-72
<Geodesic polar coordinates> We have defined these geodesics as solutions
of certain ODEs. It is possible to show that the solutions of these ODEs depend
C ∞ -smoothly on the initial conditions. We shall use this to construct the geodesic
polar coordinates around each point P ∈ S in a surface. The idea is that to
specify a point near P , we can just say “go in direction θ, and then move along
the corresponding geodesic for time r”.
We can make this (slightly) more precise, and provide a quick sketch of how we
can do this formally. We let ψ : U → V be some chart with P ∈ U ⊆ S. We wlog
ψ(P ) = 0 ∈ V ⊆ R2 . We denote by θ the polar angle (coordinate), defined on
V \ {0}. Then for any given θ, there is a unique geodesic γ θ : (−ε, ε) → V such
that γ θ (0) = 0, and γ̇ θ (0) is the unit vector in the θ direction (it makes an angle
of θ with the x-axis). We define σ(r, θ) = γ θ (r) whenever this is defined. It is
possible to check that σ is C ∞ -smooth. While we would like to say that σ gives us
a parametrization, this is not exactly true, since we cannot define θ continuously.
Instead, for each θ0 , we define the region

Wθ0 = {(r, θ) : 0 < r < ε, θ0 < θ < θ0 + 2π} ⊆ R2 .

Writing V0 for the image of Wθ0 under σ, the composition

σ ψ −1
Wθ 0 V0 U0 ⊆ S

is a valid parametrization. Thus σ −1 ◦ ψ is a valid chart. The image (r, θ) of this


chart are the geodesic polar coordinates.
L. 11-73
<Gauss’ lemma> The geodesic circles {r = r0 } ⊆ W are orthogonal to their
radii, ie. to γ θ , and the Riemannian metric (first fundamental form) on W is of
the form dr2 + G(r, θ) dθ2 .

This is why we like geodesic polar coordinates. Using these, we can put the
Riemannian metric into a very simple form. Of course, this is just a sketch of
what really happens, and there are many holes to fill in.
C. 11-74
<Surfaces of revolution> So far, we do not have many examples of surfaces.
We now describe a nice way of obtaining surfaces — we obtain a surface S by
rotating a plane curve η around a line `. Wlog the coordinates is chosen so that `
is the z-axis, and η lies in the x − z plane. More precisely, we let η : (a, b) → R3 ,
and write η(u) = (f (u), 0, g(u)). Note that it is possible that a = −∞ and/or
b = ∞. We require kη 0 (u)k = 1 for all u, that is parametrization by arclength.
We also require f (u) > 0 for all u > 0, or else things won’t make sense. Finally,
we require that η is a homeomorphism to its image. In particular this requires η
to be injective (non-self intersecting) and continuous. Then S is the image of the
11.5. SMOOTH EMBEDDED SURFACES (IN R3 ) 509

following map:

σ(u, v) = (f (u) cos v, f (u) sin v, g(u)) for a < u < b and 0 ≤ v ≤ 2π.

This is not exactly a parametrization, since it is not injective (v = 0 and v = 2π


give the same points). To rectify this, for each α ∈ R, we define

σ α : (a, b) × (α, α + 2π) → S,

given by the same formula, and this is a homeomorphism onto the image as one
can check. It is evidently smooth, since f and g both are. To show this is a
parametrization, we need to show that the partial derivatives are linearly indepen-
dent. We can compute the partial derivatives and show that they are non-zero. We
have σu = (f 0 cos v, f 0 sin v, g 0 ) and σv = (−f sin v, f cos v, 0). We then compute
the cross product as σu × σv = (−f g 0 cos v, −f g 0 sin v, f f 0 ). So we have

kσu × σv k = f 2 (g 02 + f 02 ) = f 2 6= 0.

Thus every σ α is a valid parametrization, and S is a valid embedded surface. More


generally, we can allow S to be covered by several families of parametrizations of
type σ α , ie. we can consider more than one curve or more than one axis of rotation.
This allows us to obtain, say, S 2 or the embedded torus (in the old sense, we cannot
view S 2 as a surface of revolution in the obvious way, since we will be missing the
poles).
On a surface of revolution, parallels are curves of the form γ(t) = σ(u0 , t) for fixed u0
and Meridians are curves of the form γ(t) = σ(t, v0 ) for fixed v0 . These are gen-
eralizations of the notions of longitude and latitude (in some order) on Earth.
In a general surface of revolution, we can compute the first fundamental form with
respect to σ as

E = kσu k2 = f 02 + g 02 = 1 F = σu · σv = 0 G = kσv k2 = f 2 .

So its first fundamental form is also of the simple form, like the geodesic polar
coordinates. Putting these explicit expressions into the geodesic formula, we find
that the geodesic equations are
df 2 d 2
ü = f v̇ (f v̇) = 0.
du dt
P. 11-75
Consider a surface of revolution. We assume kγ̇k = 1, ie. u̇2 + f 2 (u)v̇ 2 = 1. Then
1. Every unit speed meridians is a geodesic.
df
2. A (unit speed) parallel will be a geodesic if and only if du
(u0 ) = 0, ie. u0 is a
critical point for f .

1. In a meridian, v = v0 is constant. So the second geodesic equations holds.


Also, we know kγ̇k = |u̇| = 1. So ü = 0. So the first geodesic equation is
satisfied.

2. Since u = u0 , we know f (u0 )2 v̇ 2 = 1. So v̇ = ±1/f (u0 ). So the second


equation holds. Since v̇ and f are non-zero, the first equation is satisfied if and
df
only if du (u0 ) = 0.
510 CHAPTER 11. GEOMETRY

D. 11-76
• We let η : [0, `] → R2 be a curve parametrized with unit speed, ie. kη 0 k = 1. The
curvature κ at the point η(s) is determined by
η 00 = κn,
where n is the unit normal, chosen so that κ is non-negative.
• The second fundamental form on V with σ : V → U ⊆ S for surface S with unit
normal N is
L du2 + 2M du dv + N dv 2 ,
where L = σuu · N M = σuv · N N = σvv · N.

• The Gaussian curvature K of a surface of S at P ∈ S is the ratio of the deter-


minants of the two fundamental forms, ie.
LN − M 2
K= .
EG − F 2
This is valid since the first fundamental form is positive-definite and in particular
has non-zero derivative.
E. 11-77
If f : [c, d] → [0, `] is a smooth function and f 0 (t) > 0 for all t, then we can
reparametrize our curve by γ(t) = η(f (t)). We can then find γ̇(t) = df dt
η 0 (f (t)).
2 df 2 00
So we have kγ̇k = ( dt ) . We also have by definition η (f (t)) = κn where κ is the
curvature at γ(t). On the other hand, Taylor’s theorem tells us
 
df
γ(t + ∆t) − γ(t) = η 0 (f (t))∆t
dt
 2   2 !
1 d f 0 df 00
+ η (f (t)) + η (f (t)) ∆t2 + higher order terms.
2 dt2 dt

Now we know by assumption that η 0 · η 0 = 1. Differentiating thus give s η 0 · η 00 = 0.


Hence we get η 0 · n = 0. We now take the dot product of the Taylor expansion
with n, killing off all the η 0 terms. Then we get

1
(γ(t + ∆t) − γ(t)) · n = κkγ̇k2 (∆t)2 + · · · (∗)
2
γ(t + ∆t)
where κ is the curvature. This is the distance shown as
the double arrow in the diagram. We can also compute γ(t)
2 2 2
kγ(t + ∆t) − γ(t)k = kγ̇k (∆t) . (†)
So we find that 21 κ is the ratio of the leading (quadratic) terms of (∗) and (†), and
is independent of the choice of parametrization.
We now try to apply this thinking to embedded surfaces. We let σ : V → U ⊆ S be
a parametrization of a surface S (with V ⊆ R2 open). We apply Taylor’s theorem
to σ to get

σ(u + ∆u, v + ∆v) − σ(u, v) = σ(u, v) + σu ∆u + σv ∆v


1
+ (σuu (∆u2 ) + 2σuv ∆u∆v + σvv (∆v)2 ) + · · · .
2
11.5. SMOOTH EMBEDDED SURFACES (IN R3 ) 511

We now measure the deviation from the tangent plane, ie.

1
(σ(u + ∆u, v + ∆v) − σ(u, v)) · N = (L(∆u)2 + 2M ∆u∆v + N (∆v)2 ) + · · · ,
2

where L = σuu · N M = σuv · N N = σvv · N.


Note that N and N are different things. N is the unit normal, while N is the
expression given above. We can also compute

kσ(u + ∆u, v + ∆v) − σ(u, v)k2 = E(∆u)2 + 2F ∆u∆v + G(∆v)2 + · · · .

The Gaussian curvature is then a quarter of the ratio of these two equations, that
is the ratio of the determinants of the two fundamental forms.
E. 11-78
It can be deduced, similar to the curves, that K is independent of parametrization.
Note also that K > 0 means the second fundamental form is definite (ie. either
positive definite or negative definite). If K < 0, then the second fundamental form
is indefinite. If K = 0, then the second fundamental form is semi-definite (but not
definite).
Consider the unit sphere S 2 ⊆ R3 . This has K > 0 at each point. We can compute
this directly, or we can, for the moment, pretend that M = 0. Then by symmetry,
N = L. So K > 0.
On the other hand, we can imagine a Pringle crisp (also known as a hyperbolic
paraboloid), and this has K < 0. More examples are left on the third example
sheet. For example we will see that the embedded torus in R3 has points at which
K > 0, some where K < 0, and others where K = 0.
P. 11-79
We let N = σu × σv /kσu × σv k be our unit normal for a surface patch. Then at
each point, we have
    
Nu = aσu + bσv L M a b E F
, where − = .
Nv = cσu + dσv M N c d F G

In particular K = ad − bc.

Note that N · N = 1. Differentiating gives N · Nu = 0 = N · Nv . Since σu , σv and


N form an orthogonal basis, at least there are some a, b, c, d such that

Nu = aσu + bσv and Nv = cσu + dσv .

By definition of σu , we have N · σu = 0, differentiating gives Nu · σu + N · σuu = 0.


So we know Nu · σu = −L. Similarly, we find Nu · σv = −M = Nv · σu and
Nv · σv = −N . We dot our original definition of Nu , Nv in terms of a, b, c, d with
σu and σv to obtain

−L = aE + bF −M = aF + bG
−M = cE + dF −N = cF + dG.

Taking determinants, we get the formula for the curvature.


512 CHAPTER 11. GEOMETRY

T. 11-80
Suppose for a parametrization σ : V → U ⊆ S ⊆ R3 , the first fundamental form

is given by du2 + G(u, 2
√ v) dv √for some G ∈ C (V ). Then the Gaussian curvature
is given by K = −( G)uu / G. In particular, we do not need to compute the
second fundamental form of the surface.

We set e = σu and f = σv / G. Then e and f are unit and orthogonal. We also
let N = e × f be a third unit vector orthogonal to e and f so that they form a
basis of R3 . Using the notation of the previous proposition, we have

Nu × Nv = (aσu + bσv ) × (cσu + dσv ) = (ad − bc)σu × σv


√ √
= Kσu × σv = K Ge × f = K GN.

Thus we know

K G = (Nu × Nv ) · N = (Nu × Nv ) × (e × f )
= (Nu · e)(Nv · f ) − (Nu · f )(Nv · e).

Since N · e = 0, we know Nu · e + N · eu = 0. Hence to evaluate the expression


above, it suffices to compute N · eu instead of Nu · e. Since e · e = 1, we know
e · eu = 0 = e · ev . So we can write

eu = αf + λ1 N fu = −α̃e + µ1 N
and similarly we have
ev = βf + λ2 N fv = −β̃e + µ2 N.

Our objective now is to find the coefficients µi , λi , and then K G = λ1 µ2 − λ2 µ1 .
Since we know e · f = 0, differentiating gives

eu · f + e · fu = 0 α̃ = α
=⇒
ev · f + e · fv = 0 β̃ = β.

But we have
 
σv 1 1
α = eu · f = σuu · √ = (σu · σv )u − (σu · σu )v √ = 0,
G 2 G
since σu · σv = 0 and σu · σu = 1. So α vanishes. Also, we have
σv 1 Gu √
β = ev · f = σuv · √ = √ = ( G)u .
G 2 G

since F = σv · σv . Finally, we can use our equations again to find

λ1 µ1 − λ2 µ1 = eu · fv − ev · fu = (e · fv )u − (e · fu )v

= −β̃u − (−α̃)u = −( G)uu .
√ √
So we have K G = −( G)uu as required.
So if we have nice coordinates on S, then we get a nice formula for the Gaussian
curvature K. Observe, for this σ, K depends only on the first fundamental from,
not on the second fundamental form. When Gauss discovered this, he was so
impressed that he called it the Theorema Egregium :
If S1 and S2 have locally isometric charts, then K is locally the same.
11.6. ABSTRACT SMOOTH SURFACES 513

We know that this is valid under the assumption of the theorem, ie. the existence
of a parametrization σ of the surface S such that the first fundamental form is
du2 + G(u, v) dv 2 . Suitable σ includes, for each point P ∈ S, the geodesic polars
(ρ, θ). However, P itself is not in the chart, ie. P 6∈ σ(U ), and there is no guarantee
that there will be some geodesic polar that covers P . To solve this problem, we
notice that K is a C ∞ function of S, and in particular continuous. So we can
determine the curvature at P as

K(P ) = lim K(ρ, σ).


ρ→0

Note also that every surface of revolution has such a suitable parametrization, as
we have previously explicitly seen.

11.6 Abstract smooth surfaces


While embedded surfaces are quite general surfaces, they do not include, say, the
hyperbolic plane. We can generalize our notions by considering surfaces “without
embedding in R3 ”. These are known as abstract surfaces.
D. 11-81
• An abstract smooth surface S is a metric space (or Hausdorff (and second-countable)
topological space) equipped with homeomorphisms θi : Ui → Vi , where Ui ⊆ S
and Vi ⊆ R2 are open sets such that
S
1. S = i Ui
2. For any i, j, the transition map

φij = θi ◦ θj−1 : θj (Ui ∩ Uj ) → θi (Ui ∩ Uj )

is a diffeomorphism. Note that θj (Ui ∩ Uj ) and θi (Ui ∩ Uj ) are open sets in R2 .


So it makes sense to talk about whether the function is a diffeomorphism.
Like for embedded surfaces, the maps θi are called charts , and the collection of
θi ’s satisfying our conditions is an atlas etc.
• A Riemannian metric on an abstract surface is given by Riemannian metrics
on each Vi = θi (Ui ) subject to the compatibility condition that for all i, j, the
transition map φij is an isometry, ie.

hdϕP (a), dϕP (b)iϕ(P ) = ha, biP

Note that on the right, we are computing the Riemannian metric on Vi , while on
the left we are computing it on Vj .
E. 11-82
It is clear that every embedded surface is an abstract surface, by forgetting that
it is embedded in R3 . The three classical geometries are all abstract surfaces.
1. The Euclidean space R2 with dx2 + dy 2 is an abstract surface.
2. The sphere S 2 ⊆ R2 , being an embedded surface, is an abstract surface with
metric
4(dx2 + dy 2 )
.
(1 + x2 + y 2 )2
514 CHAPTER 11. GEOMETRY

3. The hyperbolic disc D ⊆ R2 is an abstract surface with metric

4(dx2 + dy 2 )
.
(1 − x2 − y 2 )2
and this is isometric to the upper half plane H with metric

dx2 + dy 2
y2

Note that in the first and last example, it was sufficient to use just one chart to
cover every point of the surface, but not for the sphere. Also, in the case of the
hyperbolic plane, we can have many different charts, and they are compatible.
Finally, we notice that we really need the notion of abstract surface for the hyper-
bolic plane, since it cannot be realized as an embedded surface in R3 . The proof
is not obvious at all, and is a theorem of Hilbert.
E. 11-83
One important thing we can do is to study the curvature of surfaces. Given a
P ∈ S, the Riemannian metric (on a chart) around P determines a “reparametriza-
tion” by geodesics, similar to embedded surfaces. We have a local geodesic polar
coordinates where the metric takes the form

dρ2 + G(ρ, θ) dθ2 .

We then define the curvature as



−( G)ρρ
K= √ .
G
Note that for embedded surfaces, we obtained this formula as a theorem. For
abstract surfaces, we take this as a definition. We can check how this works in
some familiar examples.
1. In R2 , we use the usual polar coordinates (ρ, θ), the metric is dρ2 + ρ2 dθ2
where x = ρ cos θ and y = ρ sin θ. So the curvature is

−( G)ρρ −(ρ)ρρ
√ = = 0.
G ρ
So the Euclidean space has zero curvature.
2. For the sphere S, we use the spherical coordinates, fixing the radius to be 1.
So we specify each point by

σ(ρ, θ) = (sin ρ cos θ, sin ρ sin θ, cos ρ).

Note that ρ is not really the radius in spherical coordinates, but just one of
2 2 2
the
√ angle coordinates. We then have the metric dρ + sin ρ dθ . Then we get
G = sin ρ and K = 1.
3. For the hyperbolic plane, we use the disk model D, and we first express our
original metric in polar coordinates of the Euclidean plane to get
 2
2
(dr2 + r2 dθ2 ).
1 − r2
11.6. ABSTRACT SMOOTH SURFACES 515

This is not geodesic polar coordinates, since r is given by the Euclidean dis-
tance, not hyperbolic distance. We will need to put
2
ρ = 2 tanh−1 r, dρ = dr.
1 − r2
ρ
Then we have r = tanh 2
which gives

4r2
= sinh2 ρ.
(1 − r2 )2

So we finally get G = sinh ρ with K = −1.
We see that the three classic geometries are characterized by having constant 0, 1
and −1 curvatures.
We will next state the Gauss-Bonnet theorem. Recall the definition of triangu-
lations which makes sense for (compact) abstract surfaces S. Recall the Euler
number e(S) = F − E + V is independent of triangulations, so we know that this
is invariant under homeomorphisms.
T. 11-84
<Gauss-Bonnet theorem>
1. Let ABC ⊆ S be a triangle with angles α, β, γ. If the sides of a triangle
ABC ⊆ S are geodesic segments, then
Z p
K dA = (α + β + γ) − π where dA = EG − F 2 du dv,
ABC

is the “area element” on each domain U ⊆ S of a chart, with E, F, G as in the


respective first fundamental form.
R
2. If S is a compact surface, then S K dA = 2πe(S).

This is a genuine generalization of what we previously had for the sphere and
hyperbolic plane, as one can easily see. We will not prove this theorem, but we
will make some remarks. Note that we can deduce the second part from the first
part. The basic idea is to take a triangulation of S, and then use things like each
edge belongs to two triangles and each triangle has three edges.
Using the Gauss-Bonnet theorem, we can define the curvature K(P ) for a point
P ∈ S alternatively by considering triangles containing P , and then taking the
limit
(α + β + γ) − π
lim = K(P ).
area→0 area
Finally, we note how this relates to the problem of the parallel postulate we have
mentioned previously. The parallel postulate, in some form, states that given a line
and a point not on it, there is a unique line through the point and parallel to the
line. This holds in Euclidean geometry, but not hyperbolic and spherical geometry.
It is a fact that this is equivalent to the axiom that the angles of a triangle sum
to π. Thus, the Gauss-Bonnet theorem tells us the parallel postulate is captured
by the fact that the curvature of the Euclidean plane is zero everywhere.
516 CHAPTER 11. GEOMETRY
CHAPTER 12

Statistics
Statistics is a set of principles and procedures for gaining and processing quantita-
tive evidence in order to help us make judgements and decisions. Here we will focus
on formal statistical inference: we assume that we have some data generated from
some unknown probability model, and we aim to use the data to learn about certain
properties of the underlying probability model.
In particular, we perform parametric inference . We assume that we have a random
variable X that follows a particular known family of distribution (eg. Poisson distribu-
tion). However, we do not know the parameters of the distribution. We then attempt
to estimate the parameter from the data given. We repeat the experiment (or observa-
tion) many times, then will have X1 , X2 , · · · , Xn being iid with the same distribution
as X. We have a simple random sample X = (X1 , X2 , · · · , Xn ), this is the data we
have. We will use the observed X = x to make inferences about the parameter.
Knowledge of part IA probability assumed. We’ll introduce two distributions not seen
explicitly in part IA probability:
• A random variable X has a negative binomial distribution (NegBin) with param-
eters k and p (k ∈ N, p ∈ (0, 1)), if
!
x−1
P (X = x) = (1 − p)x−k pk for x = k, k + 1, · · ·
k−1

and zero otherwise. It has E[x] = k/p and var(x) = k(1 − p)/p2 . This is a discrete
distribution. It is the distribution of the number of trials up to and including the kth
success, in a sequence of independent Bernoulli trials each with success probability
p. The negative binomial distribution with k = 1 is the geometric distribution with
parameter p. And if X1 , · · · , Xk are iid geometric distribution with parameter p,
then ki=1 Xi ∼ NegBin(k, p).
P

The random variable Y = X − k has


!
y+k−1
P (Y = y) = (1 − p)y pk for y = 0, 1, · · ·
k−1

This is the distribution of the number of failures before the kth success in a sequence
of independent Bernoulli trials each with success probability p. It is also sometimes
called the negative binomial distribution: be careful!
• P
If Z1 , · · · , Zk are iid N (0, 1) random variables, then the random variable X =
k 2 2
i=1 Zi has a chi-squared distribution on k degrees of freedom, we write X ∼ χk .
This is a continuous distribution. Since E[Zi2 ] = 1 and E[Zi4 ] = 3, we find that
E[X] = k and var(X) = 2k. Further, the moment generating function of Zi2 is
Z ∞
2 2 1 2
MZ 2 (t) = E[eZi t ] = E z t √ e−z /2 dz = (1 − 2t)−1/2 for t < 1/2,
i
−∞ 2π
517
518 CHAPTER 12. STATISTICS

so that the mgf of X = ki=1 Zi2 is MX (t) = (MZ 2 (t))k = (1 − 2t)−k/2 for t < 1/2.
P
We recognise this as the mgf of a Gamma(k/2, 1/2), so that X has pdf
 k/2
1 1
fX (x) = xk/2−1 e−x/2 x > 0.
Γ(k/2) 2

We will denote the upper 100α% point of χ2k by χ2k (α), so that if χ ∼ χ2k then
P(X > χ2k (α)) = α. The above connections between gamma and χ2 means that
sometimes we can use χ2 -tables to find percentage points for gamma distributions.
χ2 satisfies the following additive property (which can be proven via mgf’s): If
X ∼ χ2m and Y ∼ χ2n are independent, then X + Y ∼ χ2m+n . It is also worth noting
that if X ∼Gamma(n, λ) then 2λY ∼ χ22n , this can be prove via mgf’s or density
transformation formula.

12.1 Estimation
The goal of estimation is as follows: we are given iid X1 , · · · , Xn , and we know that
their probability density/mass function is fX (x; θ) for some unknown θ. We know fX
but not θ. For example, we might know that they follow a Poisson distribution, but
we do not know what the mean is. The objective is to estimate the value of θ.
D. 12-1
• A statistic is a function T of the data x = (x1 , · · · , xn ). We obtain estimates of
θ using a statistics, so we can write our estimate as θ̂ = T (x). Then the random
variable T (X) is called an estimator of θ. The distribution of T = T (X) is the
sampling distribution of the statistic.
? We adopt the convention where capital X denotes a random variable and x is an
observed value. So T (X) is a random variable and T (x) is a particular value we
obtain after experiments.
• Let θ̂ = T (X) be an estimator of θ. The bias of θ̂ is the difference between its
expected value and true value: bias(θ̂) = Eθ (θ̂) − θ.1 An estimator is unbiased if
it has no bias, ie. Eθ (θ̂) = θ.
• The mean squared error of an estimator θ̂ is Eθ [(θ̂ − θ)2 ].
The root mean squared error is the square root of the above.
E. 12-2
= n1
P
• Let X1 , · · · , Xn be iid N (µ, 1). A possible estimator for µ is T (X) P Xi .
1
Then for any particular observed sample x, our estimate is T (x) = n xi .
What is the sampling distribution ofP
T ? Recall
P from IA Probability that in general,
if Xi ∼ N (µi , σi2 ), then Xi ∼ N ( µi , σi2 ), which is something we can prove
P
by considering moment-generating functions.
So we have T (X) ∼ N (µ, 1/n). Note that by the Central Limit Theorem, even
if Xi were not normal, we still have approximately T (X) ∼ N (µ, 1/n) for large
values of n, but here we get exactly the normal distribution even for small values
of n.
1
Note that the subscript θ does not represent the random variable, but the thing we want to
estimate. This is inconsistent with the use for, say, the probability mass function.
12.1. ESTIMATION 519

• The estimator n1
P
Xi we had above is a rather sensible estimator. Of course, we
can also have silly estimators such as T (X) = X1 , or even T (X) = 0.32 always.
One way to decide if an estimator is silly is to look at its bias. To find out Eθ (T ),
we can either find the distribution of T and find its expected value, or evaluate
T as a function of X directly, and find its expected value. In the above example,
Eµ (T ) = µ. So T is unbiased for µ.
• Given an estimator, we want to know how good the estimator is. We have the
concept of the bias. However, this is generally not a good measure of how good
the estimator is. For example, if we do 1000 random trials X1 , · · · , X1000 , we can
pick our estimator as T (X) = X1 . This is an unbiased estimator, but is really
bad because we have just wasted the data from the other 999 trials. On the other
hand, T 0 (X) = 0.01 + 1000
1
P
Xi is biased (with a bias of 0.01), but is in general
much more trustworthy than T . In fact, at the end of the section, we will construct
cases where the only possible unbiased estimator is a completely silly estimator to
use.
We can express the mean squared error in terms of the variance and bias:
Eθ [(θ̂ − θ)2 ] = Eθ [(θ̂ − Eθ (θ̂) + Eθ (θ̂) − θ)2 ]
= Eθ [(θ̂ − Eθ (θ̂))2 ] + [Eθ (θ̂) − θ]2 + 2[Eθ (θ̂) − θ] Eθ [θ̂ − Eθ (θ̂)]
= var(θ̂) + bias2 (θ̂).
If we are aiming for a low mean squared error, sometimes it could be preferable to
have a biased estimator with a lower variance. This is known as the “bias-variance
trade-off”.
 For example, suppose X ∼ binomial(n, θ). The standard estimator is TU =
X/n, which is unbiased. TU has variance
varθ (X) θ(1 − θ)
varθ (TU ) = = .
n2 n
Hence the mean squared error of the usual estimator is given by
mse(TU ) = varθ (TU ) + bias2 (TU ) = θ(1 − θ)/n.
Consider an alternative estimator
X +1 X 1
TB = = w + (1 − w) where w = n/(n + 2)
n+2 n 2
This can be interpreted to be a weighted average (by the sample size) of the
sample mean and 1/2. We have
 
nθ + 1 1
Eθ (TB ) − θ = − θ = (1 − w) −θ ,
n+2 2
and is biased. The variance is given by
varθ (X) θ(1 − θ)
varθ (TB ) = = w2 .
(n + 2)2 n
Hence the mean squared error is
 2
θ(1 − θ) 1
mse(TB ) = varθ (TB ) + bias2 (TB ) = w2 + (1 − w)2 −θ .
n 2
We can plot the mean squared error of each estimator for possible values of θ.
Here we plot the case where n = 10.
520 CHAPTER 12. STATISTICS

mse
0.03 unbiased estimator
0.02

0.01
biased estimator
0 θ
0 0.2 0.4 0.6 0.8 1.0

This biased estimator has smaller MSE unless θ has extreme values.
 We see that sometimes biased estimators could give better mean squared er-
rors. In some cases, not only could unbiased estimators be worse — they could
be completely nonsense. Suppose X ∼ Poisson(λ), and we want to estimate
θ = [P (X = 0)]2 = e−2λ . Then any unbiased estimator T (X) must satisfy
Eθ (T (X)) = θ, or equivalently,

X λx
Eλ (T (X)) = e−λ T (x) = e−2λ .
x=0
x!

The only function T that can satisfy this equation is T (X) = (−1)X . Thus the
unbiased estimator would estimate e−2λ to be 1 if X is even, -1 if X is odd.
This is clearly nonsense.
D. 12-3
• A statistic T is sufficient for θ if the conditional distribution of X given T does
not depend on θ.
• A sufficient statistic T (X) is minimal if it is a function of every other sufficient
statistic, ie. if T 0 (X) is also sufficient, then T 0 (X) = T 0 (Y) ⇒ T (X) = T (Y).
E. 12-4
Often, we do experiments just to find out the value of θ. For example, we might
want to estimate what proportion of the population supports some political can-
didate. We are seldom interested in the data points themselves, and just want to
learn about the big picture. This leads us to the concept of a sufficient statis-
tic. This is a statistic T (X) that contains all information we have about θ in the
sample.
In general the distribution of X depends on θ, that’s why we can use a data x to
estimate θ. However for sufficient statistics the conditional distribution of X given
any particular T = t does not depend on θ. So if we know T , then the knowledge
of the exact form of our data does x not give us more information about θ. In
other words, for any fixed t, all x ∈ {x : T (x) = t} are “the same” to us.
Note that sufficient statistics are not unique. If T is sufficient for θ, then so is any
injective function of T . Note that X is always sufficient for θ as well, but it is not
of much use. How can we decide if a sufficient statistic is “good”?
Given any statistic T , we can partition the sample space X n into sets {x ∈ X n :
T (X) = t} for each t. Then after an experiment, instead of recording the actual
value of x, we can simply record the partition x falls into. If there are less partitions
than possible values of x, then effectively there is less information we have to store.
If T is sufficient, then this data reduction does not lose any information about θ.
12.1. ESTIMATION 521

The “best” sufficient statistic would be one in which we achieve the maximum
possible reduction. This is known as the minimal sufficient statistic.
E. 12-5
Let X1 , · · · Xn be iid Bernoulli(θ),Pso that P(Xi = 1) = 1 − P(Xi = 0) = θ for
some 0 < θ < 1. Suppose T (X) = xi , the total number of ones. Then
n
Y P P
fX (x | θ) = θxi (1 − θ)1−xi = θ xi
(1 − θ)n− xi
.
i=1

This depends on the data only through T . Suppose we are now given that T (X) =
t. Then what is the distribution of X? We have
P P !−1
Pθ (X = x, T = t) Pθ (X = x) θ xi (1 − θ)n− xi n
fX|T =t (x) = = = n t
 = ,
Pθ (T = t) Pθ (T = t) t
θ (1 − θ)n−t t

where the second equality comes because since if X = x, then T must be equal to
t. So T is sufficient.
T. 12-6
<Factorization criterion> T is sufficient for θ iff fX (x | θ) = g(T (x), θ)h(x)
for some functions g and h.

(Backward) We first prove the discrete case. Suppose fX (x | θ) = g(T (x), θ)h(x).
If T (x) = t, then
Pθ (X = x, T (X) = t) g(T (x), θ)h(x)
fX|T =t (x) = = P
Pθ (T = t) {y:T (y)=t} g(T (y), θ)h(y)

g(t, θ)h(x) h(x)


= P = P
g(t, θ) h(y) h(y)
which doesn’t depend on θ. So T is sufficient. The continuous case is similar. If
fX (x | θ) = g(T (x), θ)h(x), and T (x) = t, then
g(T (x), θ)h(x) g(t, θ)h(x) h(x)
fX|T =t (x) = R = R = R ,
y:T (y)=t
g(T (y), θ)h(y) dy g(t, θ) h(y) dy h(y) dy

which does not depend on θ.


(Forward) Now suppose T is sufficient so that the conditional distribution of X |
T = t does not depend on θ. Then

Pθ (X = x) = Pθ (X = x, T = T (x)) = Pθ (X = x | T = T (x))Pθ (T = T (x)).

The first factor does not depend on θ by assumption; call it h(x). Let the second
factor be g(t, θ), and so we have the required factorisation.
E. 12-7 P P
• Continuing the previous example fX (x | θ) = θP xi (1 − θ)n− xi . Take g(t, θ) =
θt (1 − θ)n−t and h(x) = 1 to see that T (X) = Xi is sufficient for θ.
• Let X1 , · · · , Xn be iid U [0, θ]. Write 1[A] for the indicator function of an arbitrary
set A. We have
n
Y 1 1
fX (x | θ) = 1[0≤xi ≤θ] = n 1[maxi xi ≤θ] 1[mini xi ≥0] .
i=1
θ θ
522 CHAPTER 12. STATISTICS

If we let T = maxi xi , then T = maxi xi is sufficient since


1
fX (x | θ) = 1
n [t≤θ]
1[mini xi ≥0] .
θ
| {z } | {z }
h(x)
g(t,θ)

T. 12-8
Suppose T = T (X) is a statistic that satisfies
fX (x; θ)
does not depend on θ ⇐⇒ T (x) = T (y).
fX (y; θ)
Then T is minimal sufficient for θ.
First we show T is sufficient. We will use the factorization criterion to do so.
Firstly, for each possible t, pick a favorite xt such that T (xt ) = t. Now let
x ∈ X n . Then T (x) = T (xT (x) ). By the hypothesis, fX (x; θ)/fX (xT (x) ; θ) does
not depend on θ. Let this be h(x). Let g(t, θ) = fX (xt , θ). Then
fX (x; θ)
fX (x; θ) = fX (xt ; θ) = g(t, θ)h(x).
fX (xt ; θ)
So T is sufficient for θ. To show that this is minimal, suppose that S(X) is also
sufficient. By the factorization criterion, there exist functions gS and hS such that
fX (x; θ) = gS (S(x), θ)hS (x). Now suppose that S(x) = S(y). Then
fX (x; θ) gS (S(x), θ)hS (x) hS (x)
= = .
fX (y; θ) gS (S(y), θ)hS (y) hS (y)

This means that the ratio ffX


X (x;θ)
(y;θ)
does not depend on θ. By the hypothesis, this
implies that T (x) = T (y). So we know that S(x) = S(y) implies T (x) = T (y).
So T is a function of S. So T is minimal sufficient.
E. 12-9
Suppose X1 , · · · , Xn are iid N (µ, σ 2 ). Then

(2πσ 2 )−n/2 exp − 2σ1 2 i (xi − µ)2


 P
fX (x | µ, σ 2 )
=
(2πσ 2 )−n/2 exp − 2σ1 2 i (yi − µ)2

fX (y | µ, σ 2 )
P
( ! !)
1 X 2 X 2 µ X X
= exp − 2 xi − yi + 2 xi − yi .
2σ i i
σ i i

of (µ, σ 2 ) iff
P 2 P 2 P P
This is a constant
P 2 function i xi = i yi and i xi = i yi .
So T (X) = ( i Xi , i Xi ) is minimal sufficient for (µ, σ 2 ). Since functions of
P

sufficient, T̃ (X) = (X̄, (Xi − X̄)2 )


P
minimal sufficient statistics are also minimal
is also sufficient for µ, σ 2 , where X̄ = i Xi /n.
P

Note that above has vector T sufficient for a vector θ. Dimensions doP
not have
Pto be
the same. For example, one can check that for N (µ, µ2 ), T (X) = ( i Xi2 , i Xi )
is minimal sufficient for µ.
T. 12-10
<Rao-Blackwell Theorem> Let T be a sufficient statistic for θ and let θ̃ be
an estimator for θ with E(θ̃2 ) < ∞ for all θ. Let θ̂(x) = E[θ̃(X) | T (X) = T (x)].
Then E[(θ̂ − θ)2 ] ≤ E[(θ̃ − θ)2 ] for all θ. The inequality is strict unless θ̃ is a
12.1. ESTIMATION 523

function of T .

By the conditional expectation formula, we have E(θ̂) = E[E(θ̃ | T )] = E(θ̃). So


they have the same bias. By the conditional variance formula,

var(θ̃) = E[var(θ̃ | T )] + var[E(θ̃ | T )] = E[var(θ̃ | T )] + var(θ̂).

Hence var(θ̃) ≥ var(θ̂). So mse(θ̃) ≥ mse(θ̂), with equality only if var(θ̃ | T ) =


0.
Here we have to be careful with our definition of θ̂. It is defined as the expected
value of θ̃(X). And this could potentially depend on the actual value of θ. For-
tunately, since T is sufficient for θ, the conditional distribution of X given T = t
does not depend on θ. Hence θ̂ = E[θ̃(X) | T ] does not depend on θ, and so is a
genuine estimator.
As mentioned, minimal sufficient statistics allow us to store the results of our
experiments in the most efficient way. It turns out if we have a minimal sufficient
statistic, then we can use it to improve any estimator. Using this theorem, given
any estimator, we another find estimator that is a function of the sufficient statistic
and is at least as good in terms of mean squared error of estimation. Moreover, if
the original estimator θ̃ is unbiased, so is the new θ̂. Also, if θ̃ is already a function
of T , then θ̂ = θ̃.
E. 12-11
• Suppose X1 , · · · , Xn are iid Poisson(λ), and let θ = e−λ , which is the probability
that X1 = 0. Then
P P
−enλ λ xi θn (− log θ) xi
pX (x | λ) = Q =⇒ pX (x | θ) = Q .
xi ! xi !
P
We seePthat T = Xi is sufficient for θ (by the factorisation criterion), and we
know Xi ∼ Poisson(nλ). We start with an easy estimator θ̃ = 1X1 =0 (ie. if we
observe nothing in the first observation, we assume the event is impossible), which
is unbiased. Then
n
!
P(X1 = 0)P( n
P  t
2 Xi = t) n−1
X
E[θ̃ | T = t] = P X1 = 0 | Xi = t = Pn = .
1
P( 1 Xi = t) n
P
So we have θ̂ = (1 − 1/n) xi
. This approximately (1 − 1/n)nX̄ ≈ e−X̄ = e−λ̂ ,
which makes sense.
• Let X1 , · · · , Xn be iid U [0, θ], and suppose that we want to estimate θ. We have
shown before that T = max Xi is sufficient for θ. Let θ̂ = 2X1 , an unbiased
estimator. Then

E[θ̃ | T = t] = 2 E[X1 | max Xi = t]


= 2 E[X1 | max Xi = t, X1 = max Xi ]P(X1 = max Xi )
+ 2 E[X1 | max Xi = t, X1 6= max Xi ] P(X1 6= max Xi )
| {z }
= E[X1 | X1 < t] = t/2 since
X1 | X1 < t has distribution U [0, t]
 
1 t n−1 n+1
=2 t + = t.
n 2 n n
524 CHAPTER 12. STATISTICS

Therefore θ̂ = n+1
n
max Xi is our new estimator. In case this is not clear, X1 |
X1 < t has distribution U [0, t] since

fX1 (x1 , X1 < t) (t/θ)I[0 ≤ x1 ≤ t] 1


fX1 (x1 | X1 < t) = = = I[0 ≤ x1 ≤ t].
P(X1 < t) t/θ t

D. 12-12
Let X1 , · · · , Xn be random variables with joint pdf/pmg fX (x | θ). We observe
X = x. For any given x, the likelihood of θ is like(θ) = fX (x | θ), regarded as a
function of θ. The maximum likelihood estimator (mle) of θ is an estimator that
picks the value of θ that maximizes like(θ).
E. 12-13
There are many different estimators we can pick, and we have just come up with
some criteria to determine whether an estimator is “good”. However, these do
not give us a systematic way of coming up with an estimator to actually use. In
practice, we often use the maximum likelihood estimator. We can imagine that
given a data, the mle picks the “distrubution” which that is “most likely” to give
us that data.
Often there is no closed form for the mle, and we have to find θ̂ numerically.
When we can find the mle explicitly, in practice, we often equivalently maximize
the log-likelihood instead of the likelihood since it’s often easier. In particular, if
X1 , · · · , Xn are iid, each with pdf/pmf fX (x | θ), then
n
Y n
X
like(θ) = fX (xi | θ) log like(θ) = log fX (xi | θ).
i=1 i=1

How does the mle relate to sufficient statistics? Suppose that T is sufficient for
θ. Then the likelihood is g(T (x), θ)h(x), which depends on θ through T (x). To
maximise this as a function of θ, we only need to maximize g. So the mle θ̂ is a
function of the sufficient statistic T .
Note that if φ = h(θ) with h injective, then the mle of φ is given by h(θ̂). This is
called the invariance property of mle’s. For example, if the mle of the standard
deviation σ is σ̂, then the mle of the variance σ 2 is σ̂ 2 . This is rather useful in
practice, since we can use this to simplify a lot of computations.

It can be shown that, under regularity conditions, that n(θ̂ − θ is asymptotically
multivariate normal with mean 0 and ‘smallest attainable variance’ (see Part II
Principles of Statistics).
E. 12-14
• Let X1 , · · · , Xn be iid Bernoulli(p). Then
X   X 
l(p) = log like(p) = xi log p + n − xi log(1 − p).

P P
dl xi n − xi
=⇒ = − .
dp p 1−p
P
This is zero when p = xi /n. So this is the maximum likelihood estimator (and
is unbiased).
12.1. ESTIMATION 525

• Let X1 , · · · , Xn be iid N (µ, σ 2 ), and we want to estimate θ = (µ, σ 2 ). Then

n n 1 X
l(µ, σ 2 ) = log like(µ, σ 2 ) = − log(2π) − log(σ 2 ) − 2 (xi − µ)2 .
2 2 2σ
∂l ∂l
This is maximized when ∂µ
= ∂σ 2
= 0. We have

∂l 1 X ∂l n 1 X
=− 2 (xi − µ), 2
=− 2 + 4 (xi − µ)2 .
∂µ σ ∂σ 2σ 2σ

So the Psolution, hence maximum likelihood estimator is (µ̂, σ̂ 2 ) = (x̄, Sxx /n), where
1
xi and Sxx = (xi − x̄)2 . We shall see later that SXX /σ 2 = nσ̂ 2 /σ 2 ∼
P
x̄ = n
χ2n−1 , the chi-squared distribution. This has E[χ2n−1 ] = n − 1 and so E(σ̂ 2 ) = (n −
1)σ 2 /n, ie. σ̂ 2 is biased. However it is asymptotically unbiased since E(σ̂ 2 ) − σ 2 →
0 as n → ∞.
• Suppose the American army discovers some German tanks that are sequentially
numbered, ie. the first tank is numbered 1, the second is numbered 2, etc. Then
if θ tanks are produced, then the probability distribution of the tank number is
U (0, θ). Suppose we have discovered n tanks whose numbers are x1 , x2 , · · · , xn ,
and we want to estimate θ, the total number of tanks produced. We want to find
the maximum likelihood estimator.
1
like(θ) = 1[max xi ≤θ] 1[min xi ≥0] .
θn

So for θ ≥ max xi , like(θ) = 1/θn and is decreasing as θ increases, while for


θ < max xi , like(θ) = 0. Hence the value θ̂ = max xi maximizes the likelihood.
Is θ̂ unbiased? First we need to find the distribution of θ̂. For 0 ≤ t ≤ θ, the
cumulative distribution function of θ̂ is
 n
t
Fθ̂ (t) = P(θ̂ ≤ t) = P(Xi ≤ t for all i) = (P(Xi ≤ t))n = ,
θ

Differentiating with respect to t, we find the pdf fθ̂ (t) = ntn−1 /θn . Hence

θ
ntn−1
Z

E(θ̂) = t dt = .
0 θn n+1

So θ̂ is biased, but asymptotically unbiased.


• Smarties come in k equally frequent colours, but suppose we do not know k.
Assume that we sample with replacement. Our first Smarties are Red, Purple,
Red, Yellow. Then

like(k) = Pk (1st is a new colour)Pk (2nd is a new colour)


Pk (3rd matches 1st)Pk (4th is a new colour)
k−1 1 k−2 (k − 1)(k − 2)
=1× × × = .
k k k k3

The maximum likelihood is 5 (by trial and error). If we plot like(k) against k, the
plot is fairly flat, so k = 5 is not that much more likelier than any other k ≥ 3.
526 CHAPTER 12. STATISTICS

D. 12-15
A 100γ% (0 < γ < 1) confidence interval for θ is a random interval (A(X), B(X))
such that P(A(X) < θ < B(X)) = γ, no matter what the true value of θ may be.
E. 12-16
In fact it is also possible to have confidence intervals for vector parameters. Notice
that it is the endpoints of the interval that are random quantities, while θ is a fixed
constant we want to find out. We can interpret this in terms of repeat sampling.
If we calculate (A(x), B(x)) for a large number of samples x, then approximately
100γ% of them will cover the true value of θ.
It is important to know that having observed some data x and calculated 95%
confidence interval, we cannot say that θ has 95% chance of being within the
interval. Apart from the standard objection that θ is a fixed value and either is or
is not in the interval, and hence we cannot assign probabilities to this event, we
will later construct an example where even though we have got a 50% confidence
interval, we are 100% sure that θ lies in that interval.
Note that if (A(x), B(x)) is a 100γ% confidence interval for θ, and T (θ) is a
monotone increasing function of θ, then (T (A(x)), T (B(x))) is a 100γ% confidence
interval for T (θ). And if T is monotone decreasing, then (T (B(x)), T (A(x))) is a
100γ% confidence interval for T (θ).
E. 12-17
Suppose X1 , · · · , Xn are iid N (θ, 1). Find a 95% confidence interval for θ.

We know X̄ ∼ N (θ, n1 σ 2 ), so that n(X̄ − θ) ∼ N (0, 1). Let z1 , z2 be such that
φ(z2 ) − φ(z1 ) = 0.95, where√ φ is the standard normal (cumulative) distribution
function. We have P[z1 < n(X̄ − θ) < z2 ] = 0.95, which can be rearranged to
give  
z2 z1
P X̄ − √ < θ < X̄ − √ = 0.95.
n n
so we obtain the following 95% confidence interval:
 
z2 z1
X̄ − √ , X̄ − √ .
n n

There are many possible choices for z1 and z2 . Since N (0, 1) density is symmetric,
the shortest such interval is obtained by z2 = φ−1 (0.975) = 1.96 = −z1 . We
can also choose other values such as z1 = −∞, z2 = 1.64, but we usually choose
symmetric end points.
This example illustrates a common procedure for finding confidence intervals:
• Find a quantity R(X, θ) such that the Pθ -distribution of R(X, θ)
√ does not
depend on θ. This is called a pivot . In our example, R(X, θ) = n(X̄ − θ).
• Write down a probability statement of the form Pθ (c1 < R(X, θ) < c2 ) = γ.
• Rearrange the inequalities inside P(. . .) to find the interval.
Usually c1 , c2 are percentage points from a known standardised distribution, often
equitailed. For example, we pick 2.5% and 97.5% points for a 95% confidence
interval. We could also use, say 0% and 95%, but this generally results in a wider
interval.
12.1. ESTIMATION 527

E. 12-18
Suppose X1 , · · · , X50 are iid N (0, σ 2 ). Find a 99% confidence interval for σ.

We know that Xi /σ ∼ N (0, 1). So


50
1 X 2
2
Xi ∼ χ250 .
σ i=1
P50
So R(X, σ 2 ) = 2 2 2
i=1 X2 /σ is a pivot. Recall that χn (α) is the upper 100α%
point of χn , ie. P(χn ≤ χn (α)) = 1 − α. We have c1 = χ250 (0.995) = 27.99 and
2 2 2

c2 = χ250 (0.005) = 79.49. So


 P 2  P 2 P 2
Xi Xi 2 Xi
P c1 < < c 2 = 0.99 =⇒ P < σ < = 0.99.
σ2 c2 c1

Since · is monotone increasing on [0, ∞), we know that a 99% confidence interval
for σ is s 
P 2 sP 2
X i X i 
 , .
c2 c1

E. 12-19
Suppose X1 , · · · , Xn are iid Bernoulli(p). Find an approximate confidence interval
for p.
P
The mle of p is p̂ = Xi /n. By √ the CentralpLimit theorem, p̂ is approximately
N (p, p(1 − p)/n) for large n. So n(p̂ − p)/ p(1 − p) is approximately N (0, 1)
for large n. So we have
r r !
p(1 − p) p(1 − p)
P p̂ − z(1−γ)/2 < p < p̂ + z(1−γ)/2 ≈ γ.
n n
But p is unknown! So we approximate it by p̂ to get a confidence interval for p
when n is large:
r r !
p̂(1 − p̂) p̂(1 − p̂)
P p̂ − z(1−γ)/2 < p < p̂ + z(1−γ)/2 ≈ γ.
n n
Note that we have made a lot of approximations here, but it would be difficult to
do better than this.
E. 12-20
Suppose an opinion poll says 20% of the people are going to vote UKIP, based on
a random sample of 1, 000 people. What might the true proportion be?

We assume we have an observation of x = 200 from a binomial(n, p) distribution


with n = 1000. Then p̂ = x/n = 0.2 is an unbiased estimate, and also the mle.
Now var(X/n) = p(1 − p)/n ≈ p̂(1 − p̂)/n = 0.00016. So a 95% confidence interval
is
r r !
p̂(1 − p̂) p̂(1 − p̂)
p̂ − 1.96 , p̂ + 1.96 = 0.20 ± 1.96 × 0.013 = (0.175, 0.225),
n n

If we don’t want to make that many approximations, we can note p that p(1 −
pp) ≤
1/4 for all 0 ≤ p ≤ 1. So a conservative 95% interval is p̂±1.96 √
1/4n ≈ p̂± 1/n.
So whatever proportion is reported, it will be ‘accurate’ to ±1/ n.
528 CHAPTER 12. STATISTICS

E. 12-21
Suppose X1 , X2 are iid from U (θ −1/2, θ +1/2). What is a sensible 50% confidence
interval for θ?

We know that each Xi is equally likely to be less than θ or greater than θ. So


there is 50% chance that we get one observation on each side, ie.

1
Pθ (min(X1 , X2 ) ≤ θ ≤ max(X1 , X2 )) = .
2

So (min(X1 , X2 ), max(X1 , X2 )) is a 50% confidence interval for θ.


Suppose after the experiment, we obtain |x1 − x2 | ≥ 21 . For example, we might
get x1 = 0.2, x2 = 0.9, then we know that, in this particular case, θ must lie in
(min(X1 , X2 ), max(X1 , X2 )), and we don’t have just 50% “confidence”!
This is why after we calculate a confidence interval, we should not say “there
is 100γ% chance that θ lies in here”. The confidence interval just says that “if
we keep making these intervals, 100γ% of them will contain θ”. But if we have
calculated a particular confidence interval, the probability that it contains θ is not
100γ%.
D. 12-22
• The prior distribution of θ is the probability distribution of the value of θ before
conducting the experiment. We usually write as π(θ).
• The posterior distribution of θ is the probability distribution of the value of θ
given an outcome of the experiment x. We write as π(θ | x).
E. 12-23
So far we have seen the frequentist approach to a statistical inference, ie. inferential
statements about θ are interpreted in terms of repeat sampling. For example, the
confidence interval is what’s the probability that the interval will contain θ, not
the probability that θ lies in the interval.
In contrast, the Bayesian approach treats θ as a random variable taking values
in Θ. The investigator’s information and beliefs about the possible values of θ
before any observation of data are summarised by a prior distribution π(θ). When
X = x are observed, the extra information about θ is combined with the prior to
obtain the posterior distribution π(θ | x) for θ given X = x.
There has been a long-running argument between the two approaches. Recently,
things have settled down, and Bayesian methods are seen to be appropriate in huge
numbers of application where one seeks to assess a probability about a “state of
the world”. For example, spam filters will assess the probability that a specific
email is a spam, even though from a frequentist’s point of view, this is nonsense,
because the email either is or is not a spam, and it makes no sense to assign a
probability to the email’s being a spam.
In Bayesian inference, we usually have some prior knowledge about the distribu-
tion of θ. After collecting some data, we will find a posterior distribution of θ
given X = x. By Bayes’ theorem, the distributions are related by

fX (x | θ)π(θ) π(θ | x) ∝ fX (x | θ)π(θ)


π(θ | x) = =⇒
fX (x) posterior ∝ likelihood × prior.
12.1. ESTIMATION 529

where the constant of proportionality is chosen to make the total mass of the pos-
terior distribution equal to one. Usually, we use this form, instead of attempting
to calculate fX (x). It should be clear that the data enters through the likelihood,
so if we have a sufficient statistic, the inference is automatically a function of the
sufficient statistic.
E. 12-24
Suppose I have 3 coins in my pocket. One is 3 : 1 in favour of tails, one is a fair
coin, and one is 3 : 1 in favour of heads. I randomly select one coin and flip it
once, observing a head. What is the probability that I have chosen coin 3?

Let X = 1 denote the event that I observe a head, X = 0 if a tail. Let θ denote
the probability of a head. So θ is either 0.25, 0.5 or 0.75. Our prior distribution
is θ(θ = 0.25) = π(θ = 0.5) = π(θ = 0.75) = 1/3. The probability mass function
fX (x | θ) = θx (1 − θ)1−x . So we have to following results:

θ π(θ) fX (x = 1 | θ) fX (x = 1 | θ)π(θ) π(x)


0.25 0.33 0.25 0.0825 0.167
0.50 0.33 0.50 0.1650 0.333
0.75 0.33 0.75 0.2475 0.500
Sum 1.00 1.50 0.4950 1.000

So if we observe a head, then there is now a 50% chance that we have picked the
third coin.
E. 12-25
Suppose we are interested in the true mortality risk θ in a hospital H which is about
to try a new operation. On average in the country, around 10% of the people die,
but mortality rates in different hospitals vary from around 3% to around 20%.
Hospital H has no deaths in their first 10 operations. What should we believe
about θ?
Let Xi = 1 if the ith patient in H dies. Then
P P
xi
fx (x | θ) = θ (1 − θ)n− xi
.

Suppose a priori that θ ∼ Beta(a, b) for some unknown a > 0, b > 0 so that
π(θ) ∝ θa−1 (1 − θ)b−1 . Then the posteriori is
P P
π(θ | x) ∝ fx (x | θ)π(θ) ∝ θ xi +a−1 (1 − θ)n− xi +b−1
.
P P
We recognize this as Beta( xi + a, n − xi + b). So
P P
θ xi +a−1 (1 − θ)n− xi +b−1
π(θ | x) = P P .
B( xi + a, n − xi + b)
In practice, we need to find a Beta prior distribution that matches our information
from other hospitals. It turns out that Beta(a = 3, b = 27) priorPdistribution has
mean 0.1 and P(0.03 < θP < .20) = 0.9. PThen we observe data xi = 0, n = 10.
So the posterior is Beta( xi + a, n − xi + b) = Beta(3, 37). This has a mean
of 3/40 = 0.075.
This leads to a different conclusion than a frequentist analysis. Since nobody
has died so far, the mle is 0, which does not seem plausible. Using a Bayesian
530 CHAPTER 12. STATISTICS

approach, we have a higher mean than 0 because we take into account the data
from other hospitals. For this problem, a beta prior leads to a beta posterior. We
say that the beta family is a conjugate family of prior distribution for Bernoulli
samples.
Suppose that a = b = 1 so that
P π(θ) = 1 for
P0 < θ < 1 — the uniform distribution.
Then the posterior is Beta( xi + 1, n − xi + 1), with properties

mean mode variance


prior P 1/2 non-unique
P P 1/12 P
xi + 1 xi ( xi + 1)(n − xi + 1)
posterior
n+2 n (n + 2)2 (n + 3)
Note
P that the mode of the posterior is the mle. The posterior mean estimator,
xi +1
n+2
, known as the Laplace’s estimator, is discussed before where we showed
that this estimator had smaller mse than the mle for non-extreme value of θ. The
posterior variance is bounded above by 1/(4(n + 3)), and this is smaller than
the prior variance, and is smaller for larger n. Again, note that the posterior
automatically depends on the data through the sufficient statistic.
After we come up with the posterior distribution, we have to decide what estimator
to use. In the case above, we used the posterior mean, but this might not be the
best estimator. To determine what is the “best” estimator, we first need a loss
function that gives the loss incurred in estimating the value of a parameter to be
a when the true value is θ. And then we want a estimator that minimises the
expected loss.
D. 12-26
• A loss function L(θ, a) gives the loss incurred in estimating the value of a pa-
rameter to be a when the true value is θ.
R
• When our estimate is a, the expected posterior loss is h(a) = L(θ, a)π(θ | x) dθ.

• The Bayes estimator θ̂ is the estimator that minimises the expected posterior
loss.
C. 12-27
Common loss functions are quadratic loss L(θ, a) = (θ − a)2 , absolute error loss
L(θ, a) = |θ − a|, but we can have others.
• For quadratic loss, h(a) = (a − θ)2 π(θ | x) dθ and so h0 (a) = 0 if
R

Z Z Z
(a − θ)π(θ | x) dθ = 0 or a π(θ | x) dθ = θπ(θ | x) dθ.

R R
Since π(θ | x) dθ = 1, the Bayes estimator is θ̂ = θπ(θ | x) dθ, the
posterior mean .
• For absolute error loss,
Z Z a Z ∞
h(a) = |θ − a|π(θ | x) dθ = (a − θ)π(θ | x) dθ + (θ − a)π(θ | x) dθ
−∞ a
Z a Z a Z ∞ Z ∞
=a π(θ | x) dθ − θπ(θ | x) dθ + θπ(θ | x) dθ − a π(θ | x) dθ.
−∞ −∞ a a
12.2. HYPOTHESIS TESTING 531
Ra R∞
Now h0 (a) = 0 if −∞
π(θ | x) dθ = a
π(θ | x) dθ. This occurs when each side
is 1/2. So θ̂ is the posterior median .

E. 12-28
Suppose that X1 , · · · , Xn are iid N (µ, 1), and that a prior µ ∼ N (0, τ −2 ) for some
τ −2 . So τ is the certainty of our prior knowledge. The posterior is given by

µ2 τ 2
   
1X
π(µ | x) ∝ fx (x | µ)π(µ) ∝ exp − (xi − µ)2 exp −
2 2
 P 2 !
1 xi
∝ exp − (n + τ 2 ) µ −
2 n + τ2

which is a normal distribution.P So the posterior distribution of µ given x is a


normal distribution with mean xi /(n+τ 2 ) and variance 1/(n+τ 2 ). The normal
density is symmetric, and so the posterior mean and the posterior media have the
xi /(n + τ 2 ). This is the optimal estimator for both quadratic and
P
same value
absolute loss.
E. 12-29
Suppose that X1 , · · · , Xn are iid Poisson(λ) random variables, and λ has an ex-
ponential distribution with mean 1. So π(λ) = e−λ . The posterior distribution is
given by
P P
π(λ | x) ∝ enλ λ xi e−λ = λ xi e−(n+1)λ , λ > 0,
P
which is gamma ( xi + 1, n + 1). Hence under quadratic loss, our estimator is
the posterior mean P
xi + 1
λ̂ = .
n+1
Under absolute error loss, λ̂ solves
P P
(n + 1) xi +1 λ xi e−(n+1)λ
Z λ
1
P dλ = .
0 ( xi )! 2

12.2 Hypothesis testing


Often in statistics, we have some hypothesis to test. For example, we want to test
whether a drug can lower the chance of a heart attack. Often, we will have two
hypotheses to compare: the null hypothesis states that the drug is useless, while the
alternative hypothesis states that the drug is useful. Quantitatively, suppose that the
chance of heart attack without the drug is θ0 and the chance with the drug is θ.
Then the null hypothesis is H0 : θ = θ0 , while the alternative hypothesis is H1 : θ 6=
θ0 .
It is important to note that the null hypothesis and alternative hypothesis are not on
equal footing. By default, we assume the null hypothesis is true. For us to reject the
null hypothesis, we need a lot of evidence to prove that. This is since we consider
incorrectly rejecting the null hypothesis to be a much more serious problem than
rejecting it when we should not. For example, it is relatively okay to reject a drug
when it is actually useful, but it is terrible to distribute drugs to patients when the
532 CHAPTER 12. STATISTICS

drugs are actually useless. Alternatively, it is more serious to deem an innocent person
guilty than to say a guilty person is innocent.
In general, let X1 , · · · , Xn be iid, each taking values in X , each with unknown pdf/pmf
f . We have two hypotheses, H0 and H1 , about f . On the basis of data X = x, we
make a choice between the two hypotheses.
D. 12-30
A simple hypothesis H specifies f completely (eg. H0 : θ = 12 ). Otherwise, H is
a composite hypothesis .
E. 12-31
• A coin has P(Heads) = θ, and is thrown independently n times. We could have
H0 : θ = 12 versus H1 : θ = 34 .
• Suppose X1 , · · · , Xn are iid discrete random variables. We could have H0 : the
distribution is Poisson with unknown mean, and H1 : the distribution is not Pois-
son.
• General parametric cases: Let X1 , · · · , Xn be iid with density f (x | θ). f is known
while θ is unknown. Then our hypotheses are H0 : θ ∈ Θ0 and H1 : θ ∈ Θ1 , with
Θ0 ∩ Θ1 = ∅.
• We could have H0 : f = f0 and H1 = f = f1 , where f0 and f1 are densities that
are completely specified but do not come form the same parametric family.
D. 12-32
• For testing a null hypothesis H0 against an alternative hypothesis H1 , a test
procedure has to partition X n into two disjoint exhaustive regions C and C̄, such
that if x ∈ C, then H0 is rejected, and if x ∈ C̄, then H0 is not rejected. C is
called the critical region , its complement is called the acceptance region .2
• When performing a test, we may either arrive at a correct conclusion, or make one
of the two types of error:
1. Type I error : reject H0 when H0 is true.
2. Type II error : not rejecting H0 when H0 is false.
• The p-value is the probability of obtaining a result equal to or “more extreme”
than what was actually observed, when the null hypothesis H0 is true.
• When H0 and H1 are both simple, let

α = P(Type I error) = P(X ∈ C | H0 is true).

β = P(Type II error) = P(X 6∈ C | H1 is true).


α is called the size of the test (i.e. the probability of rejecting H0 when H0 is
true), and 1 − β is the power of the test to detect H1 (i.e. the probability of
rejecting H0 when H1 is true).
• The likelihood of a simple hypothesis H : θ = θ∗ given data x is

Lx (H) = fX (x | θ = θ∗ ).
2
Note that when we say “acceptance”, we really mean “non-rejection”! The name is purely for
historical reasons.
12.2. HYPOTHESIS TESTING 533

The likelihood ratio of two simple hypotheses H0 , H1 given data x is

Lx (H1 )
Λx (H0 ; H1 ) = .
Lx (H0 )

A likelihood ratio test (LR test) is one where the critical region C is of the form
C = {x : Λx (H0 ; H1 ) > k} for some k.

E. 12-33
Here the hypotheses are not treated symmetrically; H0 has precedence over H1 and
a Type I error is treated as more serious than a Type II error. The null hypothesis
is a conservative hypothesis, ie one of “no change,” “no bias,” “no association,”
and is only rejected if we have clear evidence against it. H1 represents the kind of
departure from H0 that is of interest to us.

Ideally we would like both the size and power of a test to be 0 (or at least very
small), however typically it is not possible to find a test that makes both of them
arbitrarily small. Usually there’s trade-off. Nonetheless we would like to pick the
best possible test.

L. 12-34
<Neyman-Pearson lemma> Suppose H0 : f = f0 , H1 : f = f1 , where f0 and
f1 are continuous densities that are nonzero on the same regions. Then among all
tests of size less than or equal to α, the test with the largest power is the likelihood
ratio test of size α.

Under the likelihood ratio test, our critical region is C = {x : f1 (x)/f


R 0 (x) > k}
where k is chosen so that α = P(reject H0 | H0 ) = P(X ∈ C | H0 ) = C f0 (x) dx.
The probability of Type II error is given by

Z
β = P(X 6∈ C | f1 ) = f1 (x) dx.

Let C ∗ be the critical region of any other test with size less than or equal to α.
Let α∗ = P(X ∈ C ∗ | f0 ) and β ∗ = P(X 6∈ C ∗ | f1 ). We want to show β ≤ β ∗ . We
know α∗ ≤ α, ie
Z Z
f0 (x) dx ≤ f0 (x) dx.
C∗ C

Also, on C, we have f1 (x) > kf0 (x), while on C̄ we have f1 (x) ≤ kf0 (x). So

Z Z
f1 (x) dx ≥ k f0 (x) dx
∗ ∩C ∗ ∩C
ZC̄ ZC̄
f1 (x) dx ≤ k f0 (x) dx.
C̄∩C ∗ C̄∩C ∗
534 CHAPTER 12. STATISTICS

Hence
Z Z
β − β∗ = f1 (x) dx − f1 (x) dx
C̄ ∗
ZC̄ Z Z Z
= f1 (x) dx + f1 (x) dx − f1 (x) dx − f1 (x) dx
∗ ∗ C̄ ∗ ∩C C̄∩C̄ ∗
ZC̄∩C ZC̄∩C̄
= f1 (x) dx − f1 (x) dx
C̄∩C ∗ C̄ ∗ ∩C
Z Z
≤k f0 (x) dx − k f0 (x) dx
C̄∩C ∗ C̄ ∗ ∩C
Z Z 
=k f0 (x) dx + f0 (x) dx
C̄∩C ∗ C∩C ∗
Z Z 
−k f0 (x) dx + f0 (x) dx = k(α∗ − α) ≤ 0.
C̄ ∗ ∩C C∩C ∗

Here we assumed the f0 and f1 are continuous densities. However, this assumption
is need just to ensure that the likelihood ratio test of exactly size α exists. Even
with non-continuous distributions, the likelihood ratio test is still a good idea. In
fact, for a discrete distribution, as long as a likelihood ratio test of exactly size α
exists, the same result holds.
E. 12-35
Suppose X1 , · · · , Xn are iid N (µ, σ02 ), where σ02 is known. We want to find the
best size α test of H0 : µ = µ0 against H1 : µ = µ1 , where µ0 and µ1 are known
fixed values with µ1 > µ0 . Then
n
 
(2πσ02 )− 2 exp 2σ −1
(xi − µ1 )2
P
n(µ20 − µ21 )
 
2
0 µ 1 − µ0
Λx (H0 ; H1 ) = = exp nx̄ + .
σ02 2σ02
n
 
−1
(2πσ02 ) 2 exp 2σ
P
2 (xi − µ0 )2
0

This is an increasing function of x̄, so for any k, Λx > k ⇔ x̄ > c for some c. Hence
we reject H0 if x̄ > c, where√c is chosen such that P(X̄ > c | H0 ) = α. Under H0 ,
X̄ ∼ N (µ0 , σ02 /n), so Z = n(X̄ − µ0 )/σ0 ∼ N (0, 1). Since x̄ > c ⇔ z > c0 for
some c0 , the size α test rejects H0 if

n(x̄ − µ0 )
z= > zα .
σ0
For example, suppose µ0 = 5, µ1 = 6, σ0 = 1, α = 0.05, n = 4 and x =
(5.1, 5.5, 4.9, 5.3). So x̄ = 5.2. From tables, z0.05 = 1.645. We have z = 0.4 and
this is less than 1.645. So x is not in the rejection region. We do not reject H0
at the 5% level and say that the data are consistent with H0 . Note that this does
not mean that we accept H0 . While we don’t have sufficient reason to believe it
is false, we also don’t have sufficient reason to believe it is true. This is called a
z-test .
In this example, LR tests reject H0 if z > k for some constant k. The size of such
a test is α = P(Z > k | H0 ) = 1 − Φ(k), and is decreasing as k increasing. Our
observed value z will be in the rejected region iff z > k ⇔ α > p∗ = P(Z > z | H0 ).
The quantity p∗ is called the p-value of our observed data x. For the example
above, z = 0.4 and so p∗ = 1 − Φ(0.4) = 0.3446.
In general, the p-value is sometimes called the “observed significance level” of x.
This is the probability under H0 of seeing data that is “more extreme” than our
12.2. HYPOTHESIS TESTING 535

observed data x. Extreme observations are viewed as providing evidence against


H0 . The p-value has a Uniform(0, 1) pdf under the null hypothesis. To see this
for a z-test, note that
P(p∗ < p | H0 ) = P((1 − φ(Z)) < p | H0 ) = P(Z > φ−1 (1 − p) | H0 )
= 1 − φ(φ−1 (1 − p)) = 1 − (1 − p) = p.
D. 12-36
• The power function of a hypothesis test H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 with
a critical region C is the function W (θ) = P(X ∈ C | θ) = P(reject H0 | θ). The
size of the test is α = supθ∈Θ0 W (θ).
• A test specified by a critical region C is uniformly most powerful (UMP) size α
test for test H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 if
1. supθ∈Θ0 W (θ) = α.
2. For any other test C ∗ with size ≤ α and with power function W ∗ , we have
W (θ) ≥ W ∗ (θ) for all θ ∈ Θ1 .
• The likelihood of a composite hypothesis H : θ ∈ Θ given data x is Lx (H) =
supθ∈Θ f (x | θ).
E. 12-37
For composite hypotheses like H : θ ≥ 0, the error probabilities do not have a
single value, so we define the power function. We want C to be such that W (θ) is
small on H0 and large on H1 . The size is the worst possible size we can get. Note
that for θ ∈ Θ1 , 1 − W (θ) = P(Type II error | θ).
Note a uniformly most powerful may not exist. However, the likelihood ratio test
often works. Sometimes the Neyman-Pearson theory can be extended to one-sided
alternatives. For example, in the previous example, we have shown that the most
powerful size α test of H0 : µ = µ0 versus H1 : µ = µ1 (where µ1 > µ0 ) is given by
 √ 
n(x̄ − µ0 )
C= x: > zα .
σ0
The critical region depends on µ0 , n, σ0 , α, and the fact that µ1 > µ0 . It does
not depend on the particular value of µ1 . This test is then uniformly the most
powerful size α for testing H0 : µ = µ0 against H1 : µ > µ0 .
E. 12-38
Suppose X1 , · · · , Xn are iid N (µ, σ02 ) where σ0 is known, and we wish to test
H0 : µ ≤ µ0 against H1 : µ > µ0 . First consider testing H00 : µ = µ0 against
H10 : µ = µ1 , where µ1 > µ0 . The Neyman-Pearson test of size α of H00 against
H10 has  √ 
n(x̄ − µ0 )
C= x: > zα .
σ0
We will show that C is in fact UMP for the composite hypotheses H0 against H1 .
For µ ∈ R, the power function is
√ 
n(X̄ − µ0 )
W (µ) = Pµ (reject H0 ) = Pµ > zα
σ0
√ √   √ 
n(X̄ − µ) n(µ0 − µ) n(µ0 − µ)
= Pµ > zα + = 1 − Φ zα +
σ0 σ0 σ0
536 CHAPTER 12. STATISTICS

To show this is UMP, we know that W (µ0 ) = α (by plugging in). W (µ) is an
increasing function of µ. So supµ≤µ0 W (µ) = α. So the first condition is satisfied.

For the second condition, observe that for any µ > µ0 , the Neyman Pearson size
α test of H00 vs H10 has critical region C. Let C ∗ and W ∗ belong to any other test
of H0 vs H1 of size ≤ α. Then C ∗ can be regarded as a test of H00 vs H10 of size
≤ α, and the Neyman-Pearson lemma says that W ∗ (µ1 ) ≤ W (µ1 ). This holds for
all µ1 > µ0 . So the condition is satisfied and it is UMP.

E. 12-39
So far we have considered disjoint hypotheses Θ0 , Θ1 . Sometimes it is easier to
take Θ1 = Θ rather than Θ \ Θ0 . Then

Lx (H1 ) supθ∈Θ1 f (x | θ)
Λx (H0 ; H1 ) = = ≥ 1,
Lx (H0 ) supθ∈Θ0 f (x | θ)

with large values of Λ indicating departure from H0 . Notice that if we have


Λ∗x = supθ∈Θ\Θ0 f (x | θ)/ supθ∈Θ0 f (x | θ), then Λx = max{1, Λ∗x }.

Here’s an example testing a given mean with known variance (z-test). Suppose
that X1 , · · · , Xn are iid N (µ, σ02 ), with σ02 known, and we wish to test H0 : µ = µ0
against H1 : µ 6= µ0 (for given constant µ0 ). Here Θ0 = {µ0 } and Θ = R.
For the denominator, we have supθ∈Θ0 f (x | θ) = f (x | µ0 ). For the numerator
supµ∈Θ f (x | µ) = f (x | µ̂), where µ̂ is the mle. We know that µ̂ = x̄. Hence
 
(2πσ02 )−n/2 exp − 2σ1 2 (xi − x̄)2
P
0
Λx (H0 ; H1 ) =  .
(2πσ02 )−n/2 exp − 2σ1 2 (xi − µ0 )2
P
0

Then H0 is rejected if Λx is large. To make our lives easier, we can use the
logarithm instead:

1 X X  n
2 log Λ(H0 ; H1 ) = 2
(xi − µ0 )2 − (xi − x̄)2 = 2 (x̄ − µ0 )2 .
σ0 σ0

So we can reject√H0 if we have | n(x̄ − µ0 )/σ0 | > c for some c. We know that
under H0 , Z = n(X̄ − µ0 )/σ0 ∼ N (0, 1). So the size α generalised likelihood
test rejects H0 if

n(x̄ − µ0 )
> zα/2 .
σ0

In fact a symmetric 100(1−α)% confidence interval for µ is x̄±zz/2 σ0 / n. There-
fore we reject H0 iff µ0 is not in this confidence interval. Later we’ll explore the
connection between confidence intervals and hypothesis tests further. Alterna-
tively, since n(X̄ − µ0 )/σ02 ∼ χ21 , we reject H0 if

n(X̄ − µ0 )2
> χ21 (α),
σ02

2
One can check that zα/2 = χ21 (α). Note that this is a two-tailed test — ie. we
reject H0 both for high and low values of x̄.
12.2. HYPOTHESIS TESTING 537

D. 12-40
Suppose we have a vector parameter Θ = {θ : θ = (θ1 , · · · , θk )}, we say Θ has k free
parameters and write |Θ| = k. If H0 : θ ∈ Θ0 imposes p independent restrictions
on Θ, then we say Θ0 has k − p free parameters and we write |Θ0 | = k − p.
E. 12-41
We consider the ”size“ or ”dimension“ of our hypotheses: suppose that H0 imposes
p independent restrictions on Θ = {θ : θ = (θ1 , · · · , θk )}, so for example
• H0 : θij = aj for j = 1, · · · , p; or
• H0 : Aθ = b (with A a p × k matrix and b a p × 1 matrix (vector) given); or
• H0 : θi = fi (ϕ) for i = 1, · · · , k for some ϕ = (ϕ1 , · · · , ϕk−p ) freely chosen.
Then Θ has k free parameters and Θ0 has k − p free parameters. And |Θ0 | = k − p
and |Θ| = k.
T. 12-42
<Generalized likelihood ratio theorem> Suppose Θ0 ⊆ Θ1 and |Θ1 |−|Θ0 | =
p. Let X = (X1 , · · · , Xn ) with all Xi iid. Then if H0 is true, as n → ∞,

2 log ΛX (H0 : H1 ) ∼ χ2p .

If H0 is not true, then 2 log Λ tends to be larger. We reject H0 if 2 log Λ > c, where
c = χ2p (α) for a test of approximately size α.

We will not prove this. As an example in our example above, |Θ1 | − |Θ0 | = 1, and
in this case, we saw that under H0 , 2 log Λ ∼ χ21 exactly for all n in that particular
case, rather than just approximately. This theorem allows us to use likelihood
ratio tests even when we cannot find the exact relevant null distribution.

12.2.1 Tests and examples


So far, we have considered relatively simple cases where we are attempting to figure
out, say, the mean. However, in reality, more complicated scenarios arise. For example,
we might want to know if a dice is fair, ie. if the probability of getting each number
is exactly 61 . Our null hypothesis would be that p1 = p2 = · · · = p6 = 16 , while the
alternative hypothesis allows any possible values of pi .
In general, suppose the observation space X is partitioned into k sets, and let pi be the
probability that an observation is in set i for i = 1, · · · , k. We want to test “H0 : the
pi ’s arise from a fully specified
P model” against “H1 : the pi ’s are unrestricted (apart
from the obvious pi ≥ 0, pi = 1)”.
E. 12-43
Is following table listing the birth months of admissions to Oxford and Cambridge
in 2012 compatible with the birth distribution over the year?
Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
470 515 470 457 473 381 466 457 437 396 384 394

Out of n independent observations, let Ni be the number of observations in ith


set. So (N1 , · · · , Nk ) ∼ multinomial(n; p1 , · · · , pn ). For a generalized likelihood
ratio test of H0 , we need to find the maximised likelihood under H0 and H1 .
538 CHAPTER 12. STATISTICS
n
Under
P H1 , like(p1 , · · · , pk ) ∝ pn
1 · · · pk . So the log
1 k
P likelihood is l = constant +
ni log pi . We want to maximise this subject to pi = 1. Using the Lagrange
multiplier, we will find that the mle is pˆi = ni /n. Also |Θ1 | = k − 1 (not k, since
they must sum up to 1).
Under H0 , the values of pi are specified completely, say pi = p̃i . So |Θ0 | = 0.
Using our formula for pˆi , we find that
n
p̂n
   
1 · · · p̂k
1 k X ni
2 log Λ = 2 log n1 nk =2 ni log
p̃1 · · · p̃k np̃i

Here |Θ1 | − |Θ0 | = k − 1. So we reject H0 if 2 log Λ > χ2k−1 (α) for an approximate
size α test.
For H0 (no effect of month of birth), let p̃i be the proportion of births in month
i in say year 1993/1994 — this is not simply proportional to the number of days
1
in each month (or even worse, 12 ), as there is for example an excess of September
births (the “Christmas effect”). Turns out
 
X ni
2 log Λ = 2 ni log = 44.9.
np̃i

P(χ211 > 44.86) = 3 × 10−9 , which is our p-value. Since this is certainly less than
0.001, we can reject H0 at the 0.1% level (ie. with test size of α = 0.01), or can
say the result is “significant at the 0.1% level”.
The traditional levels for comparison are α = 0.04, 0.01, 0.001, roughly correspond-
ing to “evidence”, “strong evidence” and “very strong evidence”.
C. 12-44
<Pearson’s Chi-squared test> Like the above example, a similar situation
has H0 : pi = pi (θ) for some parameter θ and H1 unrestricted (except the obvious
positivity and summing to 1). Now |Θ0 | is the number of independent P parameters
to be estimated under H0 . Under H0 , we find mle θ̂ by maximizing ni log pi (θ),
and then
! !
pˆ1 n1 · · · pˆk nk X ni
2 log Λ = 2 log =2 ni log . (?)
p1 (θ̂)n1 · · · pk (θ̂)n1 npi (θ̂)

The degrees of freedom are k − 1 − |Θ0 |. In general, let oi = ni (observed number)


and let ei = np˜i or npi (θ̂) (expected number). Let δi = oi − ei . Then
   
X oi X δi
2 log Λ = 2 oi log =2 (ei + δi ) log 1 +
ei ei
δi2 δi2 δi2
  X 
X δi 3 3
=2 (ei + δi ) − 2 + O(δi ) = 2 δi + − + O(δi )
ei 2ei ei 2ei
P P P
We know that δi = 0 since ei = oi . So
X δi2 X (oi − ei )2
≈ = .
ei ei
This is known as the Pearson’s Chi-squared test.
12.2. HYPOTHESIS TESTING 539

E. 12-45
Mendel crossed 556 smooth yellow male peas with wrinkled green peas. From the
progeny, let
1. N1 be the number of smooth yellow peas,
2. N2 be the number of smooth green peas,
3. N3 be the number of wrinkled yellow peas,
4. N4 be the number of wrinkled green peas.
We wish to test the goodness of fit of the model
9
, 3, 3, 1

H0 : (p1 , p2 , p3 , p4 ) = 16 16 16 16
.
Suppose we observe (n1 , n2 , n3 , n4 ) = (315, 108, 102, 31). We find (e1 , e2 , e3 , e4 ) =
(312.75, 104.25, 104.25, 34.75). The actual 2 log Λ = 0.618 and the approximation
we had is (oi − ei )2 /ei = 0.604. Here |Θ0 | = 0 and |Θ1 | = 4 − 1 = 3. So we refer
P
to test statistics χ23 (α).
Since χ23 (0.05) = 7.815, we see that neither value is significant at 5%. So there
is no evidence against Mendel’s theory. In fact, the p-value is approximately
P(χ23 > 0.6) ≈ 0.96. This is a really good fit!
E. 12-46
In a genetics problem, each individual has one of the three possible genotypes,
with probabilities p1 , p2 , p3 . Suppose we wish to test H0 : pi = pi (θ), where

p1 (θ) = θ2 , p2 = 2θ(1 − θ), p3 (θ) = (1 − θ)2 for some θ ∈ (0, 1)

We observe Ni = ni . Under H0 , the mle θ̂ is found by maximising


X
ni log pi (θ) = 2n1 log θ + n2 log(2θ(1 − θ)) + 2n3 log(1 − θ).

We find that θ̂ = (2n1 + n2 )/2n. Also, |Θ0 | = 1 and |Θ1 | = 2. After conducting an
experiment, we can substitute pi (θ̂) into (?) in [C.12.2.1], or find the corresponding
Pearson’s chi-squared statistic, and refer to χ21 .
D. 12-47
A contingency table is a table in which observations or individuals are classified
according to one or more criteria.
E. 12-48
Suppose N is set of people. We classify them according to some criteria, then we
obtain a partition of N = a1 ∪a2 ∪· · ·∪ar . Suppose we then classify them according
to some other criteria, then we obtain another partition of N = b1 ∪ b2 ∪ · · · ∪ bs .
The r × s table (nij ) with entries nij = |ai ∩ bj | is a two
way contingency table. The entry nij tell us how many b1 ··· bs
people are in both of the categories ai and bj . If the two a1 n11 ··· n1s
classification is independent we expect for all i, j .. .. ..
. . .
|ai | |bj | |ai ∩ bj | ar nr1 ··· nrs
= .
|N | |N | |N |
540 CHAPTER 12. STATISTICS

C. 12-49
<Testing independence in contingency tables> Consider a two-way con-
tingency table with r rows and c columns. For i = 1, · · · , r And j = 1, · · · , c,
let pij be the probability that an individual selected from the population under
consideration is classified in row i and column j. (ie. in the (i, j) cell of the table).
Let
P P pi+ = P(in row i) and p+j = P(in column j). Then we must have p++ =
i j pij = 1. Suppose a random sample of n individuals is taken, and Plet nij be
the number
P of these classified in the (i, j) cell of the table. Let ni+ = j nij and
n+j = i nij . So n++ = n. We have

(N11 , · · · , N1c , N21 , · · · , Nrc ) ∼ multinomial(n; p11 , · · · , p1c , p21 , · · · , prc ).

We may be interested in testing the null hypothesis that the two classifications are
independent. So we test

H0 : pij = pi+ p+j for all i, j, ie. independence of columns and rows
H1 : pij are unrestricted (ecpet the obious p++ = 1, pij ≥ 0).

Under H1 , the mles are p̂ij = nij /n. Under H0 , the mles are p̂i+ = ni+ /n and
p̂+j = n+j /n. Write oij = nij and eij = np̂i+ p̂+j = ni+ n+j /n. Then
r X c r X c
(oij − eij )2
  X
X oij
2 log Λ = 2 oij log ≈ .
i=1 j=1
eij i=1 j=1
eij

using the same approximating steps for Pearson’s Chi-squared test. We have |Θ1 | =
rc−1, because under H1 the Ppij ’s sum to one. Also, |Θ0 | = (r −1)+(c−1)
P because
p1+ , · · · , pr+ must satisfy i pi+ = 1 and p+1 , · · · , p+c must satisfy j p+j = 1.
So
|Θ1 | − |Θ0 | = rc − 1 − (r − 1) − (c − 1) = (r − 1)(c − 1).

E. 12-50
500 people with recent car changes were asked about their previous and new cars.
The results are as follows:
New car
Large Medium Small Total
Large 56 52 42 150
Previous
Medium 50 83 67 120
car
Small 18 51 81 150
Total 124 186 190 500
This is a two-way contingency table: Each person is classified according to the
previous car size and new car size. We wish to test H0 : the new and previous car
sizes are independent. The expected values given by H0 is
New car
Large Medium Small Total
Large 37.2 55.8 57.0 150
Previous
Medium 49.6 74.4 76.0 120
car
Small 37.2 55.8 57.0 150
Total 124 186 190 500
12.2. HYPOTHESIS TESTING 541

Note the margins are the same. It is quite clear that they do not match well, but
we can find the p value to be sure.
X X (oij − eij )2
= 36.20,
eij
and the degrees of freedom is (3 − 1)(3 − 1) = 4. From the tables, χ24 (0.05) = 9.488
and χ24 (0.01) = 13.28. So our observed value of 36.20 is significant at the 1% level,
ie. there is strong evidence against H0 . So we conclude that the new and present
car sizes are not independent.
It may be informative to look at the contributions of each cell to Pearson’s chi-
squared:
New car
Large Medium Small
Large 9.50 0.26 3.95
Previous
Medium 0.00 0.99 1.07
car
Small 9.91 0.41 10.11
It seems that more owners of large cars than expected under H0 bought another
large car, and more owners of small cars than expected under H0 bought another
small car. Fewer than expected changed from a small to a large car.
C. 12-51
<Tests of homogeneity> We want to test whether two or more multino-
mial distributions are equal. In general, we have independent observations from
r multinomial distributions, each of which has c categories, ie. we observe an
r × c table (nij ), for i = 1, · · · , r and j = 1, · · · , c, where (Ni1 , · · · , Nic ) ∼
multinomial(ni+ , pi1 , · · · , pic ) independently for each i = 1, · · · , r. So the rows
of the table is the results for the different multinomial distributions. We want to
test

H0 : p1j = p2j = · · · = prj = pj for all j = 1, · · · , c


H1 : pij are unrestricted.

Using H1 ,
r r X c
Y ni+ ! ni1
X
like(pij ) = pi1 · · · pn
ic
ic
=⇒ log like = constant + nij log pij .
i=1
ni1 ! · · · nic ! i=1 j=1

Using Lagrangian methods, we find that p̂ij = nij /ni+ . Under H0 ,


c
X
log like = constant + n+j log pj .
j=1

By Lagrangian methods, we have p̂j = n+j /n++ . Hence


r X c   r X c  
X p̂ij X nij
2 log Λ = nij log =2 nij log ,
i=1 j=1
p̂j i=1 j=1
ni+ n+j /n++

which is the same as what we had last time, when the row totals are unrestricted!
We have |Θ1 | = r(c − 1) and |Θ0 | = c − 1. So the degrees of freedom is r(c − 1) −
542 CHAPTER 12. STATISTICS

(c − 1) = (r − 1)(c − 1). Under H0 , 2 log Λ is approximately χ2(r−1)(c−1) . Again, it


is exactly the same as what we had last time!
We reject H0 if 2 log Λ > χ2(r−1)(c−1) (α) for an approximate size α test. If we let
oij = nij , eij = ni+ n+j /n++ , and δij = oij − eij , using the same approximating
steps as for Pearson’s Chi-squared, we obtain
X (oij − eij )2
2 log Λ ≈ .
eij

E. 12-52
150 patients were randomly allocated to three groups of 50 patients each. Two
groups were given a new drug at different dosage levels, and the third group
received a placebo. The responses were as shown in the table below.
Improved No difference Worse Total
Placebo 18 17 15 50
Half dose 20 10 20 50
Full dose 25 13 12 50
Total 63 40 47 150
Here the row totals are fixed in advance, in contrast to our last section, where the
row totals are random variables. For the above, we may be interested in testing
H0 : the probability of “improved” is the same for each of the three treatment
groups, and so are the probabilities of “no difference” and “worse”, ie. H0 says
that we have homogeneity down the rows. The expected under H0 is
Improved No difference Worse Total
Placebo 21 13.3 15.7 50
Half dose 21 13.3 15.7 50
Full dose 21 13.3 15.7 50
Total 63 40 47 150
We find 2 log Λ = 5.129, and we refer this to χ24 .
Clearly this is not significant, as
the mean of χ24 is 4, and is something we would expect to happen solely by chance.
We can calculate the p-value: from tables, χ24 (0.05) = 9.488, so our observed value
is not significant at 5%, and the data are consistent with H0 . We conclude that
there is no evidence forPa difference between the drug at the given doses and the
placebo. For interest, (oij − eij )2 /eij = 5.173 giving the same conclusion.
T. 12-53
Suppose X1 , · · · , Xn have joint pdf fX (x | θ) for θ ∈ Θ
1. Suppose that for every θ0 ∈ Θ there is a size α test of H0 : θ = θ0 . Denote
the acceptance region by A(θ0 ). Then the set I(X) = {θ : X ∈ A(θ)} is a
100(1 − α)% confidence set for θ.
2. Suppose I(X) is a 100(1 − α)% confidence set for θ. Then A(θ0 ) = {X : θ0 ∈
I(X)} is an acceptance region for a size α test of H0 : θ = θ0 .

First note that θ0 ∈ I(X) iff X ∈ A(θ0 ).


1. Since the test is size α, we have
P(accept H0 | H0 is true) = P(X ∈ A(θ0 ) | θ = θ0 ) = 1 − α.
12.2. HYPOTHESIS TESTING 543

And so P(θ0 ∈ I(X) | θ = θ0 ) = P(X ∈ A(θ0 ) | θ = θ0 ) = 1 − α.


2. Since I(X) is a 100(1 − α)% confidence set, P (θ0 ∈ I(X) | θ = θ0 ) = 1 − α. So
P(X ∈ A(θ0 ) | θ = θ0 ) = P(θ0 ∈ I(X) | θ = θ0 ) = 1 − α.
Intuitively, this says that “confidence intervals” and “hypothesis acceptance/rejection”
are the same thing. Result 1 says that a 100(1 − α)% confidence set for θ consists
of all those values of θ0 for which H0 : θ = θ0 is not rejected at level α on the basis
of X. Results 2 says that given a confidence set, we define the test by rejecting θ0
if it is not in the confidence set.
E. 12-54
Suppose X1 , · · · , Xn are iid N (µ, 1) random variables and we want a 95% con-
fidence set for µ. One way is to use the theorem and find the confidence set
that belongs to the hypothesis test that we found in the previous examples. We
find a √test of size 0.05 of H0 : µ = µ0 against H1 : µ 6= µ0 that rejects H0
when | n(x̄ − µ0 )| > 1.96 (where √ 1.96 is the upper 2.5% point of N (0, 1)). Then
I(X) = {µ : X ∈ √ A(µ)} = {µ : |
√ n(X̄ − µ)| < 1.96}. So a 95% confidence set for
µ is (X̄ − 1.96/ n, X̄ + 1.96/ n). This is in agreement of [E.12-17].

12.2.2 Multivariate normal theory


D. 12-55
• The mean of a vector-valued random variable X = (X1 , · · · , Xn )T is taken component-
wise so that µ = E[X] = (E(X1 ), · · · , E(Xn ))T = (µ1 , · · · , µn )T .
• The covariance of two vector-valued random variable X and Y is the matrix
cov(X, Y) with (i, j)th entry cov(Xi , Yj ). In particular cov(X, X) = E[(X −
µ)(X − µ)T ] = (cov(Xi , Xj ))ij . Sometimes we write cov(X, X) = cov(X).
• X has a multivariate normal distribution if for every t ∈ Rn the random variable
tT X (ie. t·X) has a normal distribution. We write X ∼ Nn (µ, Σ) where µ = E[X]
and Σ = cov(X).
E. 12-56
Note that for X = (X1 , · · · , Xn )T and a m × n matrix A we have E[AX] = Aµ
and cov(AX) = A cov(X)AT . The last one comes from

cov(AX) = E[(AX − E[AX])(AX − E[AX])T ] = E[A(X − E X)(X − E X)T AT ]


= A E[(X − E X)(X − E X)T ]AT .

More generally we have cov(AX, BX) = A cov(X)B T .


L. 12-57
Suppose X ∼ Nn (µ, Σ), then Σ is symmetric and positive semi-definite, also X
has mgf  
T 1
MX (t) = E[et X ] = exp tT µ + tT Σt
2

Symmetry of Σ is inherited from that of cov. It’s positive semi-definite since

tT Σt = tT cov(X)t = var(tT X) ≥ 0.
544 CHAPTER 12. STATISTICS

By definition the mgf of a vector-valued random variables X with real components


is MX (t) = E[eht,Xi ]. Recall that a (univariate) normal X ∼ N (µ, σ 2 ) has mgf
 
sX 1 2 2
MX (s) = E[e ] = exp µs + σ s .
2

Hence for any t, the moment generating function of tT X is given by


 
T 1
MtT X (s) = E[est X ] = exp tT µs + tT Σts2 .
2
T
Setting s = 1 we see that MX (t) = E[et X
] = MtT X (1) as claimed.
P. 12-58
1. If X ∼ Nn (µ, Σ), and A is an m × n matrix, then AX ∼ Nm (Aµ, AΣAT ).
|X|2 XT X X Xi2
2. If X ∼ Nn (0, σ 2 I), then 2
= 2
= 2
∼ χ2n .
σ σ i
σ

1. Note that MAX (t) = MX (AT t) = exp(tT Aµ + 12 tT AΣAT t).


2. Immediate from definition of χ2n .
Instead of writing |X|2 /σ 2 ∼ χ2n , we often just say |X|2 ∼ σ 2 χ2n .
P. 12-59
Let X ∼ Nn (µ, Σ). We split X up into two parts: X = ( X
X2 ), where Xi is a ni × 1
1

column vector and n1 + n2 = n. Similarly write


   
µ1 Σ11 Σ12
µ= , Σ= where Σij is an ni × nj matrix.
µ2 Σ21 Σ22

Then (1)[[ Xi ∼ Nni (µi , Σii ) ]] and (2)[[ X1 and X2 are independent iff Σ12 = 0 ]].

t
) = exp(tT µ1 + 1 T
 P
1. Note that MXi (t) = MX ( 0 2
t 11 t) and similarly for
component 2.
2. Note that by symmetry of Σ, Σ12 = 0 if and only if Σ21 = 0. Recall MX (t) =
exp(tT µ + 12 tT Σt) for each t ∈ Rn . We write t = ( tt12 ). Then
 
1 1 1
MX (t) = exp tT1 µ1 + tT2 µ2 + tT2 Σ11 t1 + tT2 Σ22 t2 + tT1 Σ12 t2 + tT2 Σ21 t1 .
2 2 2

From (1), we know that MXi (ti ) = exp(tTi µi + 12 tTi Σii ti ). Therefore MX (t) =
MX1 (t1 )MX2 (t2 ) for all t if and only if Σ12 = 0.
P. 12-60
When Σ is a positive definite, then X ∼ Nn (µ, Σ) has pdf
 n  
1 1 1 T −1
fX (x; µ, Σ) = √ exp − (x − µ) Σ (x − µ) .
|Σ|2 2π 2

Note that Σ is always positive semi-definite. The conditions just forbid the case
|Σ| = 0, since this would lead to dividing by zero.
12.2. HYPOTHESIS TESTING 545

T. 12-61
<Joint distribution of X̄ and
P SXX > Suppose X1 , · · · , Xn are iid N (µ, σ 2 ).
1
P 2
Write X̄ = n Xi and SXX = (Xi − X̄) . Then
1. X̄ ∼ N (µ, σ 2 /n)
2. SXX /σ 2 ∼ χ2n−1 .
3. X̄ and SXX are independent.

We can write the joint density as X ∼ Nn (µ, σ 2 I), where√µ = (µ, µ, · · · , µ). Let
A be an n × n orthogonal matrix with the first row all 1/ n (the other rows are
not important). One possible such matrix is


√1 √1 √1 √1 √1

n n n n
··· n
 √1 √−1 0 0 ··· 0 
 2×1 2×1 
√1 √1 √−2 0 ··· 0
 
A= 3×2 3×2 3×2
 
 .. .. .. .. .. .. 

 . . . . . . 
√−(n−1)
 1 1 1 1

√ √ √ √ ···
n(n−1) n(n−1) n(n−1) n(n−1) n(n−1)

2 T
√ Y = AX.T Then Y ∼ N
Now define √n (Aµ,2Aσ IA ) = Nn (Aµ, σ 2 I). We have
2
Aµ = ( nµ, 0, · · · , 0) . So Y1 ∼ N ( nµ, σ ) and Yi ∼ N (0, σ ) for i = 2, · · · , n.
Also, Y1 , · · · , Yn are independent, since the covariance matrix has every non-
diagonal term equal 0. But from the definition of A, we have

n
1 X √
Y1 = √ Xi = nX̄.
n i=1

√ √
So nX̄ ∼ N ( nµ, σ 2 ), or X̄ ∼ N (µ, σ 2 /n). Also

Y22 + · · · + Yn2 = YT Y − Y12 = XT AT AX − Y12 = XT X − nX̄ 2


n
X n
X
= Xi2 − nX̄ 2 = (Xi − X̄)2 = SXX .
i=1 i=1

So SXX = Y22 + · · · + Yn2 ∼ σ 2 χ2n−1 . Finally, since Y1 and Y2 , · · · , Yn are indepen-


dent, so are X̄ and SXX .

D. 12-62
p
• Suppose that Z ∼ N (0, 1) and Y ∼ χ2k are independent, then T = Z/ Y /k is said
to have a t-distribution on k degrees of freedom, and we write T ∼ tk . We write
tk (α) be the upper 100α% point of the tk distribution, so that P(T > tk (α)) = α.

• Suppose U and V are independent with U ∼ χ2m and V ∼ χ2n . Then X = U/m V /n
is said to have an F-distribution on m and n degrees of freedom and we write
X ∼ Fm,n . We write Fm,n (α) be the upper 100α% point for the Fm,n -distribution
so that if X ∼ Fm,n , then P(X > Fm,n (α)) = α.
546 CHAPTER 12. STATISTICS

E. 12-63
• Since U and V have mean m and n respectively, U/m and V /n are approximately
1. So F is often approximately 1.
• Note that from the definition it’s clear that if X ∼ Fm,n , then 1/X ∼ Fn,m .
Suppose that we have the upper 5% point for all Fn,m . Using these information, it
is easy to find the lower 5% point for Fm,n since we know that P(Fm,n < 1/x) =
P(Fn,m > x).
• Note that it is immediate from definitions of tn and F1,n that if Y ∼ tn , then
Y 2 ∼ F1,n , ie. it is a ratio of independent χ21 and χ2n variables.
P. 12-64
Let T ∼ tk ,
1. The density of T is
−(k+1)/2
t2

Γ((k + 1)/2) 1
fT (t) = √ 1+ .
Γ(k/2) πk k

2. Ek (T ) = 0 if k > 1, otherwise undefined.


k
3. vark (T ) = k−2
if k > 2, and vark (T ) = ∞ if k = 2. Otherwise undefined.

Note that the density is symmetric, bell-shaped, and has a maximum at t = 0,


which is rather like the standard normal density. However, it can be shown that
P(T > t) > P(Z > t), ie. the T distribution has a “fatter” tail. Also, as k → ∞,
tk approaches a normal distribution. The k = 1 case is known as the Cauchy
distribution, and has undefined mean and variance.
E. 12-65
Why would we define such a weird distribution? The typical application is to study
random samples with unknown mean and unknown √ variance. Let X1 , · · · , Xn be
iid N (µ, σ 2 ). Then X̄ ∼ N (µ, σ 2 /n). So Z = n(X̄ − µ)/σ ∼ N (0, 1). Also,
SXX /σ 2 ∼ χ2n−1 and is independent of X̄, and hence Z. So
√ √
n(X̄ − µ)/σ n(X̄ − µ)
p ∼ tn−1 or p ∼ tn−1 .
SXX /((n − 1)σ )2 SXX /(n − 1)

We write σ̃ 2 = SXX /(n − 1), this is an unbiased estimator. A 100(1 − α)%


confidence interval for µ is then found from
 √  α 
α n(X̄ − µ)
1 − α = P −tn−1 ≤ ≤ tn−1 .
2 σ̃ 2

This has endpoints X̄ ± √σ̃n tn−1 α2 .




12.3 Linear models


Linear models can be used to explain or model the relationship between a response
(or dependent ) variable, and one or more explanatory variables (or covariates or
predictors ). We ask questions like how do motor insurance claim rates (response)
12.3. LINEAR MODELS 547

depend on the age and sex of the driver, and where they live (explanatory variables)?
As the name suggests, we assume the relationship is linear. In general we do not assume
normality (that the variables are normally distributed) in our calculations here, but
we will consider it in places.
Suppose we have p covariates xj , and we have n observations Yi . We assume n > p, or
else we can pick the parameters to fix our data exactly. We assume our n observations
(responses) are modelled as

Yi = β1 xi1 + · · · + βp xip + εi for i = 1, · · · , n (∗)

where
• β1 , · · · , βp are unknown, fixed parameters we wish to work out (with n > p)
• xi1 , · · · , xip are the values of the p covariates for the ith response (which are all
known).
• ε1 , · · · , εn are independent (or possibly just uncorrelated) random variables with
mean 0 and variance σ 2 . We assume homoscedasticity here, that is all these εi have
the same variance. The case that is not homoscedasticity is called heteroscedasticity .
We think of the βj xij terms to be the causal effects of xij and εi to be a random
fluctuation (error term). Then we clearly have
1. E(Yi ) = β1 xi1 + · · · βp xip .
2. var(Yi ) = var(εi ) = σ 2 .
3. Y1 , · · · , Yn are independent.
Note that (∗) is linear in the parameters β1 , · · · , βp . Obviously the real world can be
much more complicated. But this is much easier to work with. In terms of matri-
ces:        
Y1 x11 · · · x1p β1 ε1
 .   . .. ..   ..   .. 
Y =  ..  , X =  .. . .  , β =  .  , ε =  .
Yn xn1 ··· xnp βp εn

Then the equation becomes Y = Xβ + ε. We also have E(ε) = 0 and cov(Y) = σ 2 I.


We assume throughout that X has full rank p, ie. the columns are independent, and
that the error variance is the same for each observation.
E. 12-66
<Oxygen/time example> For each of 24 males, the maximum volume of oxy-
gen uptake in the blood and the time taken to run 2 miles (in minutes) were
measured. We want to know how the time taken depends on oxygen uptake. We
might get the results

Oxygen 42.3 53.1 42.1 50.1 42.5 42.5 47.8 49.9


Time 918 805 892 962 968 907 770 743
Oxygen 36.2 49.7 41.5 46.2 48.2 43.2 51.8 53.3
Time 1045 810 927 813 858 860 760 747
Oxygen 53.3 47.2 56.9 47.8 48.7 53.7 60.6 56.7
Time 743 803 683 844 755 700 748 775
548 CHAPTER 12. STATISTICS

For each individual i, we let Yi be the time to run 2 miles, and xi be the maximum
volume of oxygen uptake, i = 1, · · · , 24. We might want to fit a straight line to it.
So a possible model is Yi = a + bxi + εi where εi are independent random variables
with variance σ 2 , and a and b are constants. We have, in matrix form
     
Y1 1 x1   ε1
 .  . ..  a  . 
Y =  ..  , X =  .. . , β= , ε =  .. 
b
Y24 1 x24 ε24 .

Then Y = Xβ + ε.
D. 12-67
In a linear model Y = Xβ + ε, the least squares estimator β̂ of β minimizes

n
X
S(β) = kY − Xβk2 = (Y − Xβ)T (Y − Xβ) = (Yi − xij βj )2
i=1

with implicit summation over j.


E. 12-68
If we plot the points on a graph, then the least square estimators minimizes the
(square of the) vertical distance between the points and the line.
P. 12-69
The least squares estimator satisfies the least squares equation X T X β̂ = X T Y.
Moreover X T X is invertible, so β̂ = (X T X)−1 X T Y. This has E(β̂) = β and
cov(β̂) = σ 2 (X T X)−1 .

To minimize S(β), we want



∂S
=0 for all k
∂βk β=β̂

So −2xik (Yi − xij β̂j ) = 0 for each k (with implicit summation over i and j), that
is xik xij βˆj = xik Yi for all k. Putting this back in matrix form, we have the result.
We could also have derived this by completing the square of (Y − Xβ)T (Y − Xβ),
but that would be more complicated.
We assumed that X is of full rank p, so kXtk 6= 0 for all non-zero t. Hence

tX T Xt = (Xt)T (Xt) = kXtk2 > 0 for all t 6= 0 in Rp

So X T X is positive definite, and hence has an inverse. So β̂ = (X T X)−1 X T Y


which is linear in Y. We have

E(β̂) = (X T X −1 )X T E[Y] = (X T X)−1 X T Xβ = β.

So β̂ is an unbiased estimator for β. Also since cov Y = σ 2 I we have

cov(β̂) = (X T X)−1 X T cov(Y)X(X T X)−1 = σ 2 (X T X)−1 .


12.3. LINEAR MODELS 549

T. 12-70
<Gauss Markov theorem> In a full rank linear model, let β̂ be the least
squares estimator of β and let β ∗ be any other unbiased estimator for β which is
linear in the Yi ’s. Then var(tT β̂) ≤ var(tT β ∗ ) for all t ∈ Rp .

Since β ∗ is linear in the Yi ’s, β ∗ = AY for some p × n matrix A. Since β ∗ is


an unbiased estimator, we must have E[β ∗ ] = β. However, we also have E[β ∗ ] =
A E[Y] = AXβ. So we must have β = AXβ. Since this holds for any β, we must
have AX = Ip . Now
cov(β ∗ ) = E[(β ∗ − β)(β ∗ − β)T ] = E[(AY − β)(AY − β)T ]
= E[(AXβ + Aε − β)(AXβ + Aε − β)T ]
Since AXβ = β, this is equal to
= E[Aε(Aε)T ] = A(σ 2 I)AT = σ 2 AAT .
Now let B = A − (X T X)−1 X T , then β ∗ − β̂ = (A − (X T X)−1 X T )Y = BY and
BX = AX − (X T X)−1 X T X = Ip − Ip = 0.
Hence
cov(β ∗ ) = σ 2 AAT = σ 2 (B + (X T X)−1 X T )(B + (X T X)−1 X T )T
= σ 2 (BB T + (X T X)−1 ) = σ 2 BB T + cov(β̂).
Note that in the second line, the cross-terms disappear since BX = 0. So for any
t ∈ Rp , we have
var(tT β ∗ ) = tT cov(β ∗ )t = tT cov(β̂)t + tT BB T tσ 2 = var(tT β̂) + σ 2 kB T tk2
≥ var(tT β̂).
Taking t = (0, · · · , 1, 0, · · · , 0)T with a 1 in the ith position, we see that var(βˆi ) ≤
var(βi∗ ).
Hence we say that β̂ is the best linear unbiased estimator (BLUE) of β .
D. 12-71
Let β̂ be the least squares estimator of β.
• The vector of fitted values is Ŷ = X β̂, this is what our model says Y should
be.
• The vector of residuals is R = Y− Ŷ, the deviations of our model from reality.
• The residual sum of squares is RSS = kRk2 = RT R = (Y − X β̂)T (Y − X β̂).
E. 12-72
We can give these a geometric interpretation. Note X T R = X T (Y − Ŷ) =
X T Y − X T X β̂ = 0 by our formula for β̂. So R is orthogonal to the column space
of X.
Write Ŷ = X β̂ = X(X T X)−1 X T Y = P Y, where P = X(X T X)−1 X T . Then P
represents an orthogonal projection of Rn onto the space spanned by the columns
of X (note that P 2 = P with P X = X and ker P = X ⊥ ), ie. it projects the actual
data Y to the fitted values Ŷ. It follows that R is the part of Y orthogonal to
the column space of X. The projection matrix P is idempotent and symmetric,
ie. P 2 = P and P T = P .
550 CHAPTER 12. STATISTICS

12.3.1 Linear models with normal assumptions


So far, we have not assumed anything about our variables. In particular, we have not
assumed that they are normal. So what further information can we obtain by assuming
normality?
P. 12-73
The linear model under normal assumptions (ie. εi have normal distribution so
that Y = Xβ + ε with ε ∼ Nn (0, σ 2 I)),
1. The MLE (maximum likelihood estimator) for β is β̂ = (X T X)−1 X T Y, the
same as the least squares estimator.
2. The MLE for σ 2 is σ̂ 2 = 1
n
RSS.

1. Now we know Y ∼ Nn (Xβ, σ 2 I), the log-likelihood is

n n 1
l(β, σ 2 ) = − log 2π − log σ 2 − 2 S(β),
2 2 2σ

where S(β) = (Y − Xβ)T (Y − Xβ). If we want to maximize l with respect


to β, we have to maximize the only term containing β, ie. S(β).
2. To obtain the MLE for σ 2 , we require

∂l n S(β̂)
= 0, that is − + =0
∂σ 2 β̂,σ̂2 2σ 2 2σ̂ 4

1 1 1
=⇒ σ̂ 2 = S(β̂) = (Y − X β̂)T (Y − X β̂) = RSS.
n n n

This isn’t coincidence! Historically, when Gauss devised the normal distribution,
he designed it so that the least squares estimator is the same as the maximum
likelihood estimator. Note also that the linear model under normal assumptions
is a special case of the linear model we just had, so all previous results hold.
L. 12-74
1. If Z ∼ Nn (0, σ 2 I) and A is n × n, symmetric, idempotent with rank r, then
ZT AZ ∼ σ 2 χ2r .
2. For a symmetric idempotent matrix A, rank(A) = tr(A).

1. Since A is idempotent, A2 = A by definition. Since A is also symmetric, it is


diagonalizable and have real eigenvalues. The eigenvalues of A are either 0 or
1 (since λx = Ax = A2 x = λ2 x). So there exists an orthogonal Q such that

Λ = QT AQ = diag(λ1 , · · · , λn ) = diag(1, · · · , 1, 0, · · · , 0)

with r copies of 1 and n − r copies of 0. Let W = QT Z, so Z = QW. Then


W ∼ Nn (0, σ 2 I), since cov(W) = QT σ 2 IQ = σ 2 I. Then
r
X
ZT AZ = WT QT AQW = WT ΛW = wi2 ∼ χ2r .
i=1
12.3. LINEAR MODELS 551

2. rank(A) = rank(Λ) = tr(Λ) = tr(QT AQ) = tr(AQQT ) = tr A.


Recall that the matrix P = X(X T X)−1 X T that projects Y to Ŷ is idempotent
and symmetric, so we can apply this result. Our ultimate goal now is to show that
β̂ and σ̂ 2 are independent. Then we can apply our other standard results such as
the t-distribution.
T. 12-75
For the normal linear model Y ∼ Nn (Xβ, σ 2 I) with rank(X) = p,
1. β̂ ∼ Np (β, σ 2 (X T X)−1 )
σ2 2
2. RSS ∼ σ 2 χ2n−p , and so σ̂ 2 ∼ χ
n n−p
.

3. β̂ and σ̂ 2 are independent.

1. We have β̂ = (X T X)−1 X T Y. Call this CY for later use. Then β̂ has a normal
distribution with mean (X T X)−1 X T (Xβ) = β and covariance
(X T X)−1 X T (σ 2 I)[(X T X)−1 X T ]T = σ 2 (X T X)−1 .
So β̂ ∼ Np (β, σ 2 (X T X)−1 ).
2. Our previous lemma says that ZT AZ ∼ σ 2 χ2r . So we want to pick our Z and
A so that ZT AZ = RSS, and the degrees of freedom of A being r = n − p. Let
Z = Y − Xβ and A = (In − P ), where P = X(X T X)−1 X T .
We first check that the conditions of the lemma hold: Since Y ∼ Nn (Xβ, σ 2 I),
Z = Y − Xβ ∼ Nn (0, σ 2 I). Since P is idempotent, In − P also is. We also
have rank(In − P ) = tr(In − P ) = n − p. Therefore the conditions of the lemma
hold.
To get the final useful result, we want to show that the RSS is indeed ZT AZ.
We simplify the expressions of RSS and ZT AZ and show that they are equal:
ZT AZ = (Y − Xβ)T (In − P )(Y − Xβ) = YT (In − P )Y.
Noting the fact that (In − P )X = 0. Since R = Y − Ŷ = (In − P )Y, we have
RSS = RT R = YT (In − P )Y using the symmetry and idempotence of In − P .
Hence RSS = ZT AZ ∼ σ 2 χ2n−p . Therefore

RSS σ2 2
σ̂ 2 = ∼ χn−p .
n n

3. Let V = ( Rβ̂ ) = DY, where D = ( C ) is a (p + n) × n matrix. Since Y is


In −P
multivariate, V is multivariate with
CC T C(In − P )T
 
2 T 2
cov(V ) = Dσ ID = σ
(In − P )C T (In − P )(In − P )T
CC T T
   
C(In − P ) 2 CC 0
= σ2 = σ
(In − P )C T (In − P ) 0 In − P

using C(In − P ) = 0 (since (In − P )X = 0 so (X T X)−1 X T (In − P ) = 0).


Hence β̂ and R are independent since the off-diagonal covariant terms are 0.
So β̂ and RSS = RT R are independent. So β̂ and σ̂ 2 are independent.
From (2), E(RSS) = σ 2 (n − p). So σ̃ 2 = RSS
n−p
is an unbiased estimator of σ 2 .
552 CHAPTER 12. STATISTICS

D. 12-76
• σ̃ 2 = n−p
RSS
(an estimator of σ 2 ) is called the residual standard error on n − p
degrees of freedom.
• In the linear normal model, β̂ ∼ Np (β, σ 2 (X T X)−1 ). So β̂j ∼ N (βj , σ 2 (X T X)−1
jj ).
The standard error of β̂j is defined to be
q
SE(β̂j ) = σ̃ 2 (X T X)−1
jj where σ̃ 2 = RSS/(n − p)

Alternative notation to SE includes s.e..


E. 12-77
<Inference for β> In the linear normal model β̂ ∼ Np (β, σ 2 (X T X)−1 ), the
standard error of β̂j , SE(β̂j ) looks like the variance σ 2 (X T X)−1 jj (or standard
deviation). But unlike the actual variance, the standard error is calculable from
our data. Note that
q
β̂j − βj β̂j − βj (β̂ j − βj )/ σ 2 (X T X)−1
jj
= q = p
SE(β̂j ) σ̃ 2 (X T X)−1 RSS/((n − p)σ ) 2
jj

By writing it in this somewhat weird form, we now recognize both the numerator
and denominator. The numerator
q is a standard normal N (0, 1), and the denom-
inator is an independent χ2n−p /(n − p), as we have previously shown. But a
standard normal divided by χ2 is, by definition, the t distribution. So

β̂j − βj
∼ tn−p .
SE(β̂j )

So a 100(1 − α)% confidence interval for βj has end points β̂j ± SE(β̂j )tn−p ( α2 ).
In particular, if we want to test H0 : βj = 0, we use the fact that under H0 ,
β̂j /SE(β̂j ) ∼ tn−p .
E. 12-78
<Wafer example> Suppose we want to measure the resistivity of silicon wafers.
We have five instruments, and five wafers were measured by each instrument (so
we have 25 wafers in total). We assume that the silicon wafers are all the same, and
want to see whether the instruments consistent with each other, ie. The results
are as follows:
Wafer
1 2 3 4 5
1 130.5 112.4 118.9 125.7 134.0
Instrument

2 130.4 138.2 116.7 132.6 104.2


3 113.0 120.5 128.9 103.4 118.1
4 128.0 117.5 114.9 114.9 98.8
5 121.2 110.5 118.5 100.5 120.9
Let Yi,j be the resistivity of the jth wafer measured by instrument i, where
i, j = 1, · · · , 5. A possible model is Yi,j = µi + εi,j where εij are independent
12.3. LINEAR MODELS 553

random variables such that E[εij ] = 0 and var(εij ) = σ 2 , and the µi ’s are un-
known constants. This can be written in matrix form, with
     
Y1,1 1 0 ··· 0 ε1,1
 ..   .. .. . . .  .. 
 . 
 
. .
 . .. 
 . 
 
Y1,5  1 0 · · · 0 ε1,5 
       
Y2,1  0 1 · · · 0 µ1 ε2,1 
. . . . . .
     
 .  . . . . ..  µ2   . 
 .  . .     . 
Y= Y2,5  , X = 0 1 · · · 0 , β = µ3  , ε = ε2,5 
      
µ4 
 ..   .. .. . . ..   .. 
     
 .  . . . .  µ 5
 . 
     
Y5,1  0 0 · · · 1 ε5,1 
 ..   .. .. . . .  .. 
     
 .  . . . ..   . 
Y5,5 0 0 ··· 1 ε5,5

Then Y = Xβ + ε. We have
  1 
5 0 ··· 0 5
0 ··· 0
 0 5 · · · 0 1
0
5
··· 0
X T X =  .. .. . . . =⇒ (X T X)−1 =  .. .. .. .. 
   
. . . ..  . . . .
1
0 0 ··· 5 0 0 ··· 5

So we have  
Ȳ1
.
µ̂ = (X T X)−1 X T Y =  .. 
Ȳ5
The residual sum of squares is
5 X
X 5 5 X
X 5
RSS = (Yi,j − µ̂i )2 = (Yi,j − Ȳi )2 = 2170.
i=1 j=1 i=1 j=1

p
p has n − p = 25 − 5 = 20 degrees of freedom. So σ̄ =
This RSS/(n − p) =
2170/20 = 10.4.

12.3.2 Simple linear regression


A simple linear regression is when we have a single explanatory variable (i.e. one
independent variable). In this case we can plot our data in two dimensional x, y axis,
where y is the value of our dependent variable, and x is the independent variable. The
objective is to find a linear function (straight line) y = bx that predicts the dependent
variable values as a function of the independent variables.
If we have a simple linear regression modelP Yi = a+bxi0+εi , then we can P
reparameterise
it to Yi = a0 + b(xi − x̄) + εi where x̄ = xi /n and a = a + bx̄. Since (xi − x̄) = 0,
this leads to simplified calculations. In matrix form,
 
1 (x1 − x̄)  
. . n 0 X
X =  .. .. =⇒ XT X = where Sxx = (xi −x̄)2
 
0 Sxx

1 (x24 − x̄)
554 CHAPTER 12. STATISTICS

(xi − x̄) = 0 and so in X T X the off-diagonals are all 0. Hence


P
since
1   
0 Ȳ
(X T X)−1 = n 1 =⇒ β̂ = (X T X −1 )X T Y = SxY ,
0 Sxx Sxx
P 0
where SxY = Yi (xi − x̄). Hence the estimated intercept is â = ȳ, and the estimated
gradient is
P P r
Sxy i yi (xi − x̄) i (yi − ȳ)(xi − x̄) Syy
b̂ = = P = pP ×
Sxx (x − x̄) 2 S
P
i i i (x i − x̄) 2 (y i − ȳ) 2 xx
| {z i }
=r
P
where the last equality is since ȳ(xi − x̄) = 0, so we can add it to the nominator,
the other square root things are just multiplying and dividing by the same things.
So the gradient is the Pearson product-moment correlation coefficient r times the
ratio of the empirical standard deviations of the y’s and x’s (note that the gradient
is the same whether the x’s are standardised to have mean 0 or not). Hence we get
cov(β̂) = (X T X)−1 σ 2 , and so from our expression of (X T X)−1 ,
σ2 σ2
var(â0 ) = var(Ȳ ) = , var(b̂) = .
n Sxx
Note that these estimators are uncorrelated. Note also that these are obtained without
any explicit distributional assumptions.
E. 12-79
Continuing our previous oxygen/time example, we have ȳ = 826.5, Sxx = 783.5 =
28.02 , Sxy = −10077, Syy = 4442 , r = −0.81, b̂ = −12.9.

With normality
Suppose in our simple linear regression Yi = a0 + b(xi − x̄) + εi where x̄ =
P
xi /n we
have εi being iid N (0, σ 2 ) for i = 1, · · · , n. Then we have
σ2 σ2
   
SxY
â0 = Ȳ ∼ N a0 , b̂ = ∼ N b,
n Sxx Sxx
0
X
Ŷi = â + b̂(xi − x̄) RSS = (Yi − Ŷi ) ∼ σ 2 χ2n−2 ,
2

and (â0 , b̂) and σ̂ 2 = RSS/n are independent, as we have previously shown. Note
that σ̂ 2 is obtained by dividing RSS by n, and is the maximum likelihood estimator.
On the other hand, σ̃ is obtained by dividing RSS by n − p, and is an unbiased
estimator.
E. 12-80
P 2 P 0
Using the oxygen/time example, we have RSS = i (yi − ŷi ) = i (yi − â −
2
b̂(xi − x̂) ) = 67968. So the Residual standard error squared is
RSS 67968
σ̃ 2 = = = 3089 = 55.62 .
n−p 24 − 2
on 22 degrees of freedom. So the standard error of β̂ is
r
3089 55.6
q
SE(b̂) = σ̃ 2 (X T X)−1
22 = = = 1.99.
Sxx 28.0
12.3. LINEAR MODELS 555

So a 95% interval for b has end points

b̂ ± SE(b̂) × tn−p (0.025) = 12.9 ± 1.99 ∗ t22 (0.025) = (−17.0, −8.8),

using the fact that t22 (0.025) = 2.07. Note that this interval does not contain 0.
So if we want to carry out a size 0.05 test of H0 : b = 0 (they are uncorrelated)
vs H1 : b 6= 0 (they are correlated), the test statistic would be b̂/SE(b̂) = −12.9
1.99
=
−6.48. Then we reject H0 because this is less than −t22 (0.025) = −2.07.

12.3.3 Expected response at x∗


After performing the linear regression, we can now make predictions from it. Suppose
that x∗ is a new vector of values for the explanatory variables. The expected response
at x∗ is E[Y | x∗ ] = x∗T β. We estimate this by x∗T β̂. Then we have

x∗T (β̂ − β) ∼ N (0, x∗T cov(β̂)x∗ ) = N (0, σ 2 x∗T (X T X)−1 x∗ ).

Let τ 2 = x∗T (X T X)−1 x∗ . Then

x∗T (β̂ − β)
∼ tn−p .
σ̃τ
Then a confidence interval for the expected response x∗T β has end points
α
x∗T β̂ ± σ̃τ tn−p .
2
We have a confidence interval for x∗T β, what about one for Y ∗ = x∗T β + ε∗ ? The
predicted response at x∗ is Y ∗ = x∗ β+ε∗ , where ε∗ ∼ N (0, σ 2 ), and Y ∗ is independent
of Y1 , · · · , Yn . Here we have more uncertainties in our prediction: β and ε∗ . A
100(1 − α)% prediction interval for Y ∗ is an interval I(Y) such that P(Y ∗ ∈ I(Y)) =
1 − α, where the probability is over the joint distribution of Y ∗ , Y1 , · · · , Yn . So I is a
random function of the past data Y that outputs an interval.
First of all, as above, the predicted expected response is Ŷ ∗ = x∗T β. This is an
unbiased estimator since Ŷ ∗ − Y ∗ = x∗T (β̂ − β) − ε∗ , and hence E[Ŷ ∗ − Y ∗ ] =
x∗T (β − β) = 0. To find the variance, we use that fact that x∗T (β̂ − β) and ε∗ are
independent, and the variance of the sum of independent variables is the sum of the
variances. So

var(Ŷ ∗ − Y ∗ ) = var(x∗T (β̂)) + var(ε∗ ) = σ 2 x∗T (X T X)−1 x∗ + σ 2 = σ 2 (τ 2 + 1).

We can see this as the uncertainty in the regression line σ 2 τ 2 , plus the wobble about
the regression line σ 2 . So Ŷ ∗ − Y ∗ ∼ N (0, σ 2 (τ 2 + 1)). We therefore find that

Ŷ ∗ − Y ∗
√ ∼ tn−p .
σ̃ τ 2 + 1
So the interval with endpoints
p α
x∗T β̂ ± σ̃ τ 2 + 1tn−p
2
is a 100(1 − α)% prediction interval for Y ∗ . We don’t call this a confidence interval
— confidence intervals are about finding parameters of the distribution, while the
prediction interval is about our predictions.
556 CHAPTER 12. STATISTICS

E. 12-81
Previous example continued: Suppose we wish to estimate the time to run 2 miles
for a man with an oxygen take-up measurement of 50. Here x∗T = (1, 50 − x̄),
where x̄ = 48.6. The estimated expected response at x∗T is

x∗T β̂ = â0 + (50 − 48.5) × b̂ = 826.5 − 1.4 × 12.9 = 808.5,

which is obtained by plugging x∗T into our fitted line. We find

1 x∗2 1 1.42
τ 2 = x∗T (X T X)−1 x∗ = + = + = 0.044 = 0.212 .
n Sxx 24 783.5
So a 95% confidence interval for E[Y | x∗ = 50 − x̄] is
α
x∗T β̂ ± σ̃τ tn−p = 808.5 ± 55.6 × 0.21 × 2.07 = (783.6, 832.2).
2
Note that this is the confidence interval for the predicted expected value, NOT the
confidence interval for the actual obtained value. A 95% prediction interval for
Y ∗ at x∗T = (1, (50 − x̄)) is
p α
x∗T ± σ̃ τ 2 + 1tn−p = 808.5 ± 55.6 × 1.02 × 2.07 = (691.1, 925.8).
2
Note that this is much wider than our our expected response! This is since there
are three sources of uncertainty: we don’t know what σ is, what b̂ is, and the
random ε fluctuation!
E. 12-82
Wafer example continued: Suppose we wish to estimate the expected resistivity of
a new wafer in the first instrument. Here x∗T = (1, 0, · · · , 0) (recall that x is an
indicator vector to indicate which instrument is used). The estimated response at
x∗T is x∗T µ̂ = µ̂1 = y¯1 = 124.3. We find τ 2 = x∗T (X T X)−1 x∗ = 51 . So a 95%
confidence interval for E[Y1∗ ] is
α 10.4
x∗T µ̂ ± σ̃τ tn−p = 124.3 ± √ × 2.09 = (114.6, 134.0).
2 5
Note that we are using an estimate of σ obtained from all five instruments. If we
had only used the data from the first instrument, σ would be estimated as
sP
5
j=1 y1,j − y¯1
σ̃1 = = 8.74.
5−1

The observed 95% confidence interval for µ1 would have been


σ̃1  α 
y¯1 ± √ t4 = 124.3 ± 3.91 × 2.78 = (113.5, 135.1),
5 2
which is slightly wider. Usually it is much wider, but in this special case, we only
get little difference since the data from the first instrument is relatively tighter
than the others.
A 95% prediction interval for Y1∗ at x∗T = (1, 0, · · · , 0) is
p α
x∗T µ̂ ± σ̃ τ 2 + 1tn−p = 124.3 ± 10.42 × 1.1 × 2.07 = (100.5, 148.1).
2
12.3. LINEAR MODELS 557

12.3.4 Hypothesis testing


In hypothesis testing, we want to know whether certain variables influence the result.
If, say, the variable x1 does not influence Y , then we must have β1 = 0. So the goal is
to test the hypothesis H0 : β1 = 0 versus H1 : β1 6= 0. We will tackle a more general
case, where β can be split into two vectors β0 and β1 , and we test if β1 is zero.
L. 12-83
Suppose Z ∼ Nn (0, σ 2 In ), and A1 and A2 are symmetric, idempotent n × n
matrices with A1 A2 = 0 (ie. they are orthogonal). Then ZT A1 Z and ZT A2 Z are
independent.

Let Xi = Ai Z, i = 1, 2 and
       
W1 A1 0 A1 0
W= = Z, then W ∼ N2n , σ2
W2 A2 0 0 A2

since the off diagonal matrices are σ 2 AT1 A2 = A1 A2 = 0. So W1 and W2 are


independent, which implies
W1T W1 = ZT AT1 A1 Z = ZT A1 A1 Z = ZT A1 Z
and W2T W2 = ZT AT2 A2 Z = ZT A2 A2 Z = ZT A2 Z
are independent.
This is geometrically intuitive, because A1 and A2 being orthogonal means they
are concerned about different parts of the vector Z.

Now we go to hypothesis testing in general linear models: Suppose


   
β0
X = X0 X1 and B= ,
n×p n×p0 n×(p−p0 ) β1
where rank(X) = p and rank(X0 ) = p0 . We want to test H0 : β1 = 0 against
H1 : β1 6= 0. Under H0 , X1 β1 vanishes and Y = X0 β0 + ε. Also the mle of β0 and
σ 2 are
ˆ
β̂0 = (X0T X0 )−1 X0T Y
ˆ 2 = RSS0 = 1 (Y − X0 β̂
σ̂
ˆ T ˆ
ˆ
0 ) (Y − X0 β0 )
n n
and we have previously shown these are independent. Note that our estimators wear
two hats instead of one. We adopt the convention that the estimators of the null
hypothesis have two hats, while those of the alternative hypothesis have one. The
fitted values under H0 are
ˆ
Ŷ = X0 (X0T X0 )−1 X0T Y = P0 Y where P0 = X0 (X0T X0 )−1 X0T .
The generalized likelihood ratio test of H0 against H1 is
    !n/2
√ 1
2
exp − 2σ̂12 (Y − X β̂)T (Y − X β̂) ˆ2
σ̂
2π σ̂
ΛY (H0 , H1 ) =    =
1

ˆ ˆ σ̂ 2
√ exp − 2σ̂ˆ12 (Y − X0 β̂0 )T (Y − X0 β̂0 )
ˆ2
2π σ̂
 n/2  n/2
RSS0 RSS0 − RSS
= = 1+ .
RSS RSS
558 CHAPTER 12. STATISTICS

We reject H0 when 2 log Λ is large, equivalently when (RSS0 − RSS)/RSS is large. So


under H0 , we have
 
RSS0 − RSS
2 log Λ = n log 1 + ,
RSS

which is approximately a χ2p1 −p0 random variable.[T.12-42] This is a good approxima-


tion. But we can get an exact null distribution, and get an exact test. We have
previously shown that RSS = YT (In − P )Y, and so

RSS0 − RSS = YT (In − P0 )Y − YT (In − P )Y = YT (P − P0 )Y.

Now note that the column space of X0 is a subspace of that of X, so Im P0 is a subspace


of the column space of X, hence P P0 = P0 . Now we see that both In − P and P − P0
are symmetric and idempotent, and therefore rank(In − P ) = n − p and

rank(P − P0 ) = tr(P − P0 ) = tr(P ) − tr(P0 ) = rank(P ) − rank(P0 ) = p − p0


(In − P )(P − P0 ) = (In − P )P − (In − P )P0 = (P − P 2 ) − (P0 − P P0 ) = 0.

Finally, since (In − P )X0 = (P − P0 )X0 = 0,

YT (In − P )Y = (Y − X0 β0 )T (In − P )(Y − X0 β0 )


YT (P − P0 )Y = (Y − X0 β0 )T (P − P0 )(Y − X0 β0 )

Let Z = Y − X0 β0 , A1 = In − P and A2 = P − P0 . Apply our previous lemma and


the fact that ZT Ai Z ∼ σ 2 χ2r we have

RSS = YT (In − P )Y ∼ χ2n−p


RSS0 − RSS = YT (P − P0 )Y ∼ χ2p−p0

and these random variables are independent. So under H0 ,

YT (P − P0 )Y/(p − p0 ) (RSS0 − RSS)/(p − p0 )


F = = ∼ Fp−p0 ,n−p ,
YT (In − P )Y/(n − p) RSS/(n − p)

Hence we reject H0 if F > Fp−p0 ,n−p (α). RSS0 − RSS is the reduction in the sum of
squares due to fitting β1 in addition to β0 .

Source of var. d.f. sum of squares mean squares F statistic


RSS0 − RSS
Fitted model p − p0 RSS0 − RSS
p − p0 (RSS0 − RSS)/(p − p0 )
RSS RSS/(n − p)
Residual n−p RSS
n−p
Total n − p0 RSS0

The ratio (RSS0 −RSS)/RSS0 is sometimes known as the proportion of variance explained
by β1 , and denoted R2 .
12.3. LINEAR MODELS 559

E. 12-84
<Simple linear regression> We assume that Yi = a0 + b(xi − x̄) + εi where x̄ =
xi /n and εi are N (0, σ 2 ). Suppose we want to test the hypothesis H0 : b = 0,
P
ie. no linear relationship. We have previously seen how to construct a confidence
interval, and so we could simply see if it included 0. Alternatively, under H0 , the
model is Yi ∼ N (a0 , σ 2 ), and so â0 = Ȳ , and the fitted values are Ŷi = Ȳ . The
observed RSS0 is therefore
X
RSS0 = (yi − ȳ)2 = Syy .
i

The fitted sum of squares is therefore


X X
(yi − ȳ)2 − (yi − ȳ − b̂(xi − x̄))2 = b̂2 (xi − x̄)2 = b̂2 Sxx .

RSS0 − RSS =
i

Source of var. d.f. sum of squares mean squares F statistic


2 2
Fitted model 1 RSS0 − RSS = b̂ SXX b̂ Sxx
F = b̂2 Sxx /σ̃ 2
RSS = i (yi − ŷ)2 σ̃ 2
P
Residual n−2
RSS0 = i (yi − ȳ)2
P
Total n−1

Note that the proportion of variance explained by the fitted model is


2
b̂2 Sxx Sxy Sxy
= = r2 where r= p
Syy Sxx Syy Sxx Syy
is the Pearson’s product-moment correlation coefficient. We√ have previously seen
that under H0 we have b̂/SE(b̂) ∼ tn−2 , where SE(b̂) = σ̃/ Sxx . So we let

b̂ b̂ Sxx
t= = .
SE(b̂) σ̃

Checking whether |t| > tn−2 α2 is precisely the same as checking whether t2 =


F > F1,n−2 (α), since a F1,n−2 variable is t2n−2 . Hence the same conclusion is
reached, regardless of whether we use the t-distribution or the F statistic derived
form an analysis of variance table.
E. 12-85
<One way analysis of variance with equal numbers in each group>
Recall that in our wafer example, we made measurements in groups, and want to
know if there is a difference between groups. In general, suppose J measurements
are taken in each of I groups, and that Yij = µi + εij where εij are independent
N (0, σ 2 ) random variables, and the µi are unknown constants. Fitting this model
gives
XI X J I X
X J
RSS = (Yij − µ̂i )2 = (Yij − Ȳi. )2
i=1 j=1 i=1 j=1

on n − I degrees of freedom. Suppose we want to test the hypothesis H0 : µi = µ,


ie. no difference between groups. Under H0 , the model is Yij ∼ N (µ, σ 2 ), and so
µ̂ = Ȳ , and the fitted values are Yˆij = Ȳ . The observed RSS0 is therefore
X
RSS0 = (yij − ȳ.. )2 .
i,j
560 CHAPTER 12. STATISTICS

The fitted sum of squares is therefore


XX X
(yij − ȳ.. )2 − (yij − ȳi. )2 = J (ȳi. − ȳ.. )2 .

RSS0 − RSS =
i j i

Source of var. d.f. sum of squares mean squares F statistic


X (ȳi. − ȳ.. )2
J i (ȳi − ȳ.. )2
P
Fitted model I −1 J X (ȳi. − ȳ.. )2
i
I −1 J
2 (I − 1)σ̃ 2
σ̃ 2
P P
Residual n−I i j (yij − ȳi. )
i

− ȳ.. )2
P P
Total n−1 i j (yij

12.3.5 Examples
Suppose we have two independent samples X1 , · · · , Xm iid N (µX , σ 2 ), and Y1 , · · · , Yn
iid N (µY , σ 2 ), with σ 2 unknown. We wish to test H0 : µX = µY = µ against H1 : µX 6=
µY . Using the generalised likelihood ratio test Lx,y (H0 ) = supµ,σ2 fX (x | µ, σ 2 )fY (y |
µ, σ 2 ). Under H0 the mle’s are
mx̄ + nȳ
µ̂ =
m+n
 
1 X X  1 mn
σ̂02 = (xi − µ̂)2 + (yi − µ̂)2 = Sxx + Syy + (x̄ − ȳ)2
m+n m+n m+n
So
 
m+n 1 X X
Lx,y (H0 ) = (2πσ̂02 )− 2 exp − 2 (xi − µ̂)2 + (yi − µ̂)2
2σ̂0
m+n m+n
= (2πσ̂02 )− 2 e− 2

Similarly
m+n m+n
Lx,y (H1 ) = sup fX (x | µX , σ 2 )fY (y | µY , σ 2 ) = (2πσ12 )− 2 e− 2
µX ,µY ,σ 2

achieved by µ̂X = x̄, µ̂Y = ŷ and σ̂12 = (Sxx + Syy )/(m + n). Hence
 m+n  m+n
σ̂02 mn(x̄ − ȳ)2
 2
 2
Λx,y (H0 , H1 ) = = 1+ .
σ̂12 (m + n)(Sxx + Syy )

We reject H0 if mn(x̄ − ȳ)2 /((Sxx + Syy )(m + n)) is large, or equivalently if

|x̄ − ȳ|
|t| = q
Sxx +Syy 1 1
n+m−2
(m + n
)

is large. Under H0 , X̄ ∼ N (µ, σ 2 /m), Ȳ ∼ (µ, σ 2 /n) and so

X̄ − Ȳ
q ∼ N (0, 1).
1 1
σ m + n
12.3. LINEAR MODELS 561

By [T.12-61], we know SXX /σ 2 ∼ χ2m−1 independently of X̄ and SY Y /σ 2 ∼ χ2n−1


independently of Y . Hence (SXX +SY Y )/σ 2 ∼ χ2n+m−2 from additivity of independent
χ2 distributions. Since our two random samples are independent, we have X̄ − Ȳ and
SXX + SY Y are independent. This means that under H0 ,
X̄ − Ȳ
q ∼ tn+m−2
SXX +SY Y 1 1
n+m−2
(m + n
)

A size α test is to reject H0 if |t| > tn+m−2 (α/2). The analysis of variance gives

Source of var. d.f. sum of squares mean squares F statistic


mn mn
Fitted model 1 (x̄ − ȳ)2 (x̄ − ȳ)2 mn (x̄ − ȳ)2
m+n m+n
Sxx +Syy m+n σ̃ 2
Residual m+n−2 Sxx + Syy σ̃ 2 = m+n−2

Total m+n−1

Seeing if F > F1,m+n−2 (α) is exactly the same as checking if |t| > Tn+m−2 (α/2).
Suppose now we are not observing from iid samples, instead we have X1 , · · · , Xn
all different but independent, and they correspond to Y1 , · · · , Yn respectively. More
precisely we have Xi ∼ N (µX + γi , σ 2 ) and Yi ∼ N (µP 2
Y + γi , σ ) for i = 1, · · · , n
all independent, where the parameter γi is such that i γi = 0. So observations
are made in pairs Xi , Yi each i corresponding to one pair, and each pair are slightly
different.
Working through the generalised likelihood ratio test, or expressing in matrix form,
leads to the intuitive conclusion that we should work with the differences Di = Xi − Yi
(i = 1, · · · , n) so that Di ∼ N (µX − µY , φ2 ) where φ2 = 2σ 2 . Thus D̄ ∼ N (µX −
µY , φ2 /n) and we test H0 : µX − µY = 0 by the t statistic
P 2
D̄ 2 SDD i (Di − D̄i )
t= √ where σ̃ = =
σ̃/ n n−1 n−1
and t ∼ tn−1 under H0 .
E. 12-86
Seeds of a particular variety of plant were randomly assigned either to a nutrition-
ally rich environment (the treatment) or to the standard conditions (the control).
After a predetermined period, all plants were harvested, dried and weighed, with
weights as shown below in grams.
Control 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14
Treatment 4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 4.69
Is there a difference between the mean weights due to the environmental condi-
tions?

Control observations are realisations of X1 , · · · , X10 iid N (µX , σ 2 ), and for the
treatment we have Y1 , · · · , Y10 iid N (µY , σ 2 ). We test H0 : µX = µY vs H1 : µx 6=
µY . Here m = n = 10, x̄ = 5.032, Sxx = 3.060, ȳ = 4.661 and Syy = 5.669, so
σ̃ 2 = (Sxx + Syy )/(m + n − 2) = 0.485. Then
s  
1 1
|t| = |x̄ − ȳ|/ σ̃ 2 + = 1.19.
m n
562 CHAPTER 12. STATISTICS

From tables t18 (0.025) = 2.101, so we do not reject H0 . We conclude that there is
no evidence for a difference between the mean weights due to the environmental
conditions.
E. 12-87
Suppose we have 10 different species of plants. We sample a pair of seeds from
each specie of plants, and then one seed from each pair assigned to a nutritionally
rich environment (the treatment) and the other to the standard conditions (the
control).
Pair 1 2 3 4 5 6 7 8 9 10
Control 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14
Treatment 4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 4.69
Difference -0.64 1.41 0.77 2.52 -1.37 0.78 -0.86 -0.36 1.01 0.45
Does the treatment have any effect?

Observed statistics are d¯ = 0.37, Sdd = 12, 54, n = 10 so that φ̃ =


p
p Sdd /(n − 1) =
2.33/9 = 1.18. Thus

d¯ 0.37
t= √ = √ = 0.99
φ̃/ n 1.18/ 10

This can be compared to t18 (0.025) = 2.262 to show that we cannot reject H0 :
E[D] = 0, i.e. that there is no effect of the treatment. Alternatively, we see that
the observed p-value is the probability of getting such an extreme result under H0 ,
i.e.
P(|t9 | > |t| | H0 ) = 2P(t9 > |t|) = 2 × 0.17 = 0.34.

12.4 Rules of thumb


We’ll give some rules of thumb!
E. 12-88
If there have been n opportunities for an event to occur, and yet it has not occurred
yet, then we can be 95% confident that the chance of it occurring at the next
opportunity is less than 3/n.

Let p be the chance of it occurring at each opportunity. Assume these are in-
dependent Bernoulli trials, so essentially we have X ∼Binom(n, p), we have ob-
served X = 0, and want a one-sided 95% confidence interval for p. Base this on
the set of values that cannot be rejected at the 5% level in a one-sided test,
i.e. the 95% interval is (0, p0 ) where the one-sided p-value for p0 is 0.05, so
0.05 = P(X = 0 | p0 ) = (1 − p)n . Hence since log(0.05) = −2.9957 we have

− log(0.05) 3
p0 = 1 − elog(0.05)/n ≈ ≈ .
n n
For example, suppose we have given a drug to 100 people and none of them
have had a serious adverse reaction. Then we can be 95% confident that the
chance the next person has a serious reaction is less than 3%. The exact p0 is
1 − elog(0.05)/100 = 0.0295.
12.4. RULES OF THUMB 563

E. 12-89
After n observations, if the number
√ of events differs from that expected under a
null hypothesis H0 by more than n, reject H0 .

We assume X ∼Binom(n, p), and H0 : p = p0 , so the expected number of events


is E[X | H0 ] = np0 . √
Then the probability of the difference between observed and
expected exceeding n, given H0 is true, is
!
√ |X − np0 | 1
P(|X − np0 | > n | H0 ) = P p > p | H0
np0 (1 − p0 ) p0 (1 − p0 )
!
|X − np0 |
<P p > 2 | H0 ≈ P(|Z| > 2) ≈ 0.05
np0 (1 − p0 )
p
where the first line to second line follows from 1/ p0 (1 − p0 ) > 2 and Z ∼
N (0, 1).
For example, suppose we flip a coin 1000 times and it comes up heads 550 times,

do
√ we think the coin is odd? We expect 500 heads, and observe 50 more. n =
1000 ≈ 32, which is less than 50, so this suggests the coin is odd. The 2-sided
p-value is actually 2 × P(X ≥ 550) = 2 × (1 − P(X ≤ 549)) = 0.0017, where
X ∼Binom(1000, 0.5).
This rule of thumb is fine for chances around 0.5, but is too lenient for rarer events,
in which case the following can be used.
E. 12-90
After n observations, if the number of
√ rare events differs from that expected under
a null hypothesis H0 by more than 4 × expected, reject H0 .

We assume X ∼Binom(n, p), and H0 : p = p0 , so the expected


p number of events is
E[X | H0 ] = np0 . Under h0 , X has standard deviation
p np 0 (1 − p0 ), the critical
difference is approximately 2 standard derivation p4np0 (1 − p0 ) which is less than
√ √
√ of root n wee see above. But 4np0 (1 − p0 ) < 4np0 , which
n: this is the rule
will be less than n if p0 < 0.25. So for smaller p0 , a more powerful√ rule is to reject
H0 if the difference between observed and expected is greater than 4 × expected.
This is essentially a Poisson approximation.
For example, suppose we throw a die 120 times and it comes up ‘six’ 30 times; is
this ‘significant’ ? We √
expect√20 sixes, and so the difference between observed and
expected is 10. Since n = 120 ≈ 11, which is more √ than 10, the ‘rule
√ of root n’
does not suggest a significant difference. But since 4 × expected = 80 ≈ 9, the
second rule does suggest significance. The 2-sided p-value is actually 2 × P(X ≥
30) = 2 × (1 − P(X ≤ 29)) = 0.026, where X ∼Binom(120, 16 ).
E. 12-91
Suppose we have 95% confidence intervals for µ1 and µ2 based on independent
estimates ȳ1 and ȳ2 . Let H0 : µ1 = µ2 .
1. If the confidence intervals do not overlap, then we can reject H0 at p < 0.05.
2. If the confidence intervals do overlap, then this does not necessarily imply that
we cannot reject H0 at p < 0.05.
564 CHAPTER 12. STATISTICS

Assume for simplicity that the confidence intervals are based on assuming Ȳ1 ∼
N (µ1 , s21 ), Ȳ2 ∼ N (µ2 , s22 ), where s1 and s2 are known standard errors. Suppose
wlog that ȳ1 > ȳ2 . Then since Ȳ1 p − Ȳ2 ∼ N (µ1 − µ2 , s21 + s22 ), we can reject
H0 at α = 0.05 if ȳ1 − ȳ2 > 1.96 s21 + s22 . The two CIs will not overlap p if
ȳ1 −1.96s1 > ȳ2 +1.96s2 , i.e. ȳ1 − ȳ2 > 1.96(s1 +s2 ). But since s1 +s2 > s21 + s22
for positive s1 , s2 , we have the first part of this ’rule of thumb’. Non-overlapping
CIs is a more stringent criterion: we cannot conclude ’not significantly different’
just because CIs overlap.
So if 95% CLs just touch, what is the p-value? Suppose s1 = s2 = s. Then CLs
just touch if |ȳ1 − ȳ2 | = 1.96 × 2s = 3.92 × s. So p-value is
 
Ȳ1 − Ȳ2 3.92
P(|Ȳ1 − Ȳ2 | > 3.92s) = P √ > √ = P(|Z| > 2.77)
2s 2
= 2P(Z > 2.77) = 0.0055.

where Z ∼ N (0, 1). And if ‘just not touching’ 100(1 − α)% CIs were to be
equivalent to ‘just rejecting H0 ’, then we would need to set α so that the crit-
difference between ȳ1 − ȳ2 was√exactly the width of each of the CIs, and so
ical √
1.96 2s = 2sφ−1 (1 − α/2) where 2s is√the standard deviation of Ȳ1 − Ȳ2 with
s1 = s2 = s. This means α = 2φ(−1.96/ 2) = 0.16. So in these specific circum-
stances, we would need to use 84% intervals in order to make non-overlapping CIs
the same as rejecting H0 at the 5% level.
CHAPTER 13
Numerical Analysis
Numerical analysis is the study of algorithms. There are many problems we would
like algorithms to solve. In general, there are two things we are concerned with —
accuracy and speed. We want our programs to be able to solve problem quickly and
accurately. Sometimes we are also concerned about stability, the sensitivity of the
solution of a given problem to small changes in the data or the given parameters of
the problem.

13.1 Polynomial interpolation


D. 13-1
• Write Pn [x] for the real linear vector space of polynomials (with real coefficients)
having degree n or less. For simplicity, here we will only deal with real polynomials.
• The interpolation problem is the following problem: we are given n + 1 distinct
interpolation points {xi }n n
i=0 ⊆ R, and n+1 data values {fi }i=0 ⊆ R. The objective
is to find a p ∈ Pn [x] such that p(xi ) = fi for i = 0, · · · , n. In other words, we
want to fit a polynomial of degree ≤ n through the n + 1 points (xi , fi ). Such a
function is called a polynomial interpolant .
• The Lagrange cardinal polynomials with respect to the interpolation points {xi }n
i=0
are, for k = 0, · · · , n,
n
Y x − xi
`k (x) = .
xk − xi
i=0,i6=k

E. 13-2
It is easy to show that dim(Pn [x]) = n + 1. Writing down a polynomial of degree n
involves only n+1 numbers. They are easy to evaluate, integrate and differentiate.
So it would be nice if we can approximate things with polynomials.
There are many situations where the interpolation problem may come up. For
example, we may be given n+1 actual data points, and we want to fit a polynomial
through the points. Alternatively, we might have a complicated function f , and
want to approximate it with a polynomial p such that p and f agree on at least
that n + 1 points.
The naive way of looking at this is that we try a polynomial

p(x) = an xn + an−1 xn−1 + · · · + a0 ,

and then solve the system of equations

fi = p(xi ) = an xn n−1
i + an−1 xi + · · · + a0 .

This is a perfectly respectable system of n+1 equations in n+1 unknowns. Solving


an (n + 1) × (n + 1) linear system requires O(n3 ) operations in general, but we
want to do better. Moreover, from linear algebra, we know that in general, such

565
566 CHAPTER 13. NUMERICAL ANALYSIS

a system is not guaranteed to have a solution, and if the solution exists, it is not
guaranteed to be unique. That was not helpful. So our first goal is to show that
in the case of polynomial interpolation, the solution exists and is unique.

Note that the Lagrange cardinal polynomials have degree exactly n. The signif-
icance of these polynomials is we have `k (xi ) = 0 for i 6= k, and `k (xk ) = 1. In
other words, we have `k (xj ) = δjk . This is obvious from definition. With these
cardinal polynomials, we immediately write down a solution to the interpolation
problem.

T. 13-3
The interpolation problem has exactly one solution.
Pn
Pn define p ∈ PnP
We [x] by p(x) =
n
k=0 fk `k (x). Evaluating at xi gives p(xj ) =
k=0 f k `k (x j ) = k=0 fk δ jk = f j . So we get existence.

For uniqueness, suppose p, q ∈ Pn [x] are solutions. Then the difference r = p − q ∈


Pn [x] satisfies r(xj ) = 0 for all j, ie. it has n + 1 roots. However, a non-zero
polynomial of degree n can have at most n roots. So in fact p − q is zero, ie.
p = q.

If we define the nodal polynomial ω(x) = n


Q
i=0 (x − xi ), then in the expression
for `k , the numerator is simply ω(x)/(x − xk ) while the denominator is equal to
ω(xk ) (all but one term vanish when we substitute xk ). With that, we arrive at a
compact Lagrange form

n n
X X fk ω(x)
p(x) = fk `k (x) = .
ω 0 (xk ) x − xk
k=0 k=0

While this works, the Lagrange forms are not ideal for numerical evaluation, both
because of speed of calculation and because of the accumulation of rounding error.
Moreover if we one day decide we should add one more interpolation point, we
would have to recompute all the cardinal polynomials, and that is not fun. Ideally,
we would like some way to reuse our previous computations when we have new
interpolation points, this lead us to the Newton form.

D. 13-4
• Write f [xj , · · · , xk ] as the leading coefficient of the unique q ∈ Pk−j [x] such that
q(xi ) = fi for i = j, · · · , k. This is called the Newton divided difference of degree
(or order ) k.

• Write C n [a, b] as the set of all nth times differentiable functions [a, b] → R with
continuous nth derivative.

• Suppose that our data values are derived from a particular function, i.e. fi = f (xi )
for i = 0, · · · , n for some smooth f . Write en (x) = f (x) − pn (x), i.e. the error of
the approximation.

• The Chebyshev polynomial of degree n on [−1, 1] is defined by Tn (x) = cos(nθ)


where x = cos θ with θ ∈ [0, π].
13.1. POLYNOMIAL INTERPOLATION 567

C. 13-5
<The Newton formula> The idea of Newton’s formula is as follows — for
k = 0, · · · , n, we write pk ∈ Pk [x] for the polynomial that satisfies

pk (xi ) = fi for i = 0, · · · , k.

This is the unique degree-k polynomial that satisfies the first k + 1 conditions,
whose existence (and uniqueness) is guaranteed by the previous section. Then we
can write

p(x) = pn (x) = p0 (x) + (p1 (x) − p0 (x)) + · · · + (pn (x) − pn−1 (x)).

Hence we are done if we have an efficient way of finding the differences pk − pk−1 .
We know that pk and pk−1 agree on x0 , · · · , xk−1 . So pk − pk−1 evaluates to 0 at
those points, and we must have
k−1
Y
pk (x) − pk−1 (x) = Ak (x − xi ),
i=0

for some Ak yet to be found out. Then we can write


n
X k−1
Y
p(x) = pn (x) = A0 + Ak (x − xi ).
k=1 i=0

This formula has the advantage that it is built up gradually from the interpolation
points one-by-one. If we stop the sum at any point, we have obtained the polyno-
mial that interpolates the data for the first k points (for some k). Conversely, if we
have a new data point, we just need to add a new term, instead of re-computing
everything.
All that remains is to find the coefficients Ak . For k = 0, we know A0 is the unique
constant polynomial that interpolates the point at x0 , ie. A0 = f0 . For the others,
we note that in the formula for pk − pk−1 , we find that Ak is the leading coefficient
of xk . But pk−1 (x) has no degree k term. So Ak must be the leading coefficient of
pk . That is Ak = f [x0 , · · · , xk ].
So we have reduced our problem to finding the leading coefficients of pk . The algo-
rithm to obtain the solution to this is known as the Newton divided differences .
While we do not have an explicit formula for what these coefficients f [x0 , · · · , xk ]
are, it turns out if we consider a larger set of coefficient f [xj , · · · , xk ], we can come
up with a recurrence relation for these coefficients.

T. 13-6
<Recurrence relation for Newton divided differences> For 0 ≤ j < k ≤ n,
we have
f [xj+1 , · · · , xk ] − f [xj , · · · , xk−1 ]
f [xj , · · · , xk ] = .
xk − xj

The key to proving this is to relate the interpolating polynomials. Let q0 , q1 ∈


568 CHAPTER 13. NUMERICAL ANALYSIS

Pk−j−1 [x] and q2 ∈ Pk−j [x] satisfy

q0 (xi ) = fi i = j, · · · , k − 1
q1 (xi ) = fi i = j + 1, · · · , k
q2 (xi ) = fi i = j, · · · , k

We now claim that

x − xj xk − x
q2 (x) = q1 (x) + q0 (x).
xk − xj xk − xj

We can check directly that the expression on the right correctly interpolates the
points xi for i = j, · · · , k. By uniqueness, the two expressions agree. Since
f [xj , · · · , xk ], f [xj+1 , · · · , xk ] and f [xj , · · · , xk−1 ] are the leading coefficients of
q2 , q1 , q0 respectively, the result follows.

Using this result the Newton divided difference table can be constructed

xi fi f [∗, ∗] f [∗, ∗, ∗] ··· f [∗, · · · , ∗]


x0 f [x0 ]
f [x0 , x1 ]
x1 f [x1 ] f [x0 , x1 , x2 ]
..
f [x1 , x2 ] .
x2 f [x2 ] f [x2 , x3 , x4 ] ··· f [x0 , x1 , · · · , xn ]
.
f [x2 , x3 ] ..
.
x3 f [x3 ] ..
.. .. .
. . ..
xn f [xn ]

From the first n columns, we can find the n + 1th column using the recurrence
relation above. The values of Ak can then be found at the top diagonal, and this
is all we really need. However, to compute this diagonal, we will need to compute
everything in the table. The whole table can be evaluated in O(n2 ) operations.

In practice, we often need not find the actual interpolating polynomial. If we just
want to evaluateP p(x̂) at some
Qk−1new point x̂ using the divided table, we can simply
use p(x̂) = A0 + n k=1 A k i=0 (x̂−xi ). This is describe by the Horner’s scheme ,
given by

S <- f[x0 ,..., xn ]


for k = n - 1,..., 0
S <- (^x - xk )S + f[x0 ,..., xk ]
end
This only takes O(n) operations. If an extra data point {xn+1 , fn+1 } is added,
then we only have to compute an extra diagonal f [xk , · · · , xn+1 ] for k = n, · · · , 0
in the divided difference table to obtain the new coefficient, and the old results
can be reused. This requires O(n) operations. This is less straightforward for
Lagrange’s method.
13.1. POLYNOMIAL INTERPOLATION 569

L. 13-7
If g ∈ C m [a, b] is zero at m + ` distinct points, then g (m) has at least ` distinct
zeros in [a, b].

This is a repeated application of Rolle’s theorem. We know that between every


two zeros of g, there is at least one zero of g 0 ∈ C m−1 [a, b]. So by differentiating
once, we have lost at most 1 zeros. So after differentiating m times, g (m) has lost
at most m zeros. So it still has at least ` zeros.
T. 13-8
Let {xi }n n
i=0 ∈ [a, b] and f ∈ C [a, b]. Then there exists some ξ ∈ (a, b) such that

1 (n)
f [x0 , · · · , xn ] = f (ξ).
n!
Consider e = f − pn ∈ C n [a, b]. This has at least n + 1 distinct zeros in [a, b].
(n) (n)
So by the lemma, e(n) = f (n) − pn must vanish at some ξ ∈ (a, b). But pn =
n!f [x0 , · · · , xn ] constantly. So the result follows.
A method of estimating a derivative, say f (n) (ξ) where ξ is given, is to let the
distinct points {xi }n i=0 be suitably close to ξ, and to make the approximation
f (n) (ξ) ≈ n!f [x0 , x1 , ..., xn ]. However, a drawback is that, although one achieves
good accuracy in theory by picking such close interpolation points, if f is smooth
and if the precision of the arithmetic is finite, significant loss of accuracy may
occur due to cancellation of the leading digits of the function values.
T. 13-9
Assume {xi }n
i=0 ⊆ [a, b] and f ∈ C[a, b]. Let x̄ ∈ [a, b] be a non-interpolation
point. Then
n
Y
en (x̄) ≡ f (x̄) − pn (x̄) = f [x0 , x1 , · · · , xn , x̄]ω(x̄) where ω(x) = (x − xi ).
i=0

We think of x̄ = xn+1 as a new interpolation point so that

pn+1 (x) − pn (x) = f [x0 , · · · , xn , x̄]ω(x)

for all x ∈ R. In particular, putting x = x̄, we have pn+1 (x̄) = f (x̄), and we get
the result.
Note that we forbid the case where x̄ is an interpolation point, since it is not clear
what the expression f [x0 , x1 , · · · , xn , x̄] means. However, if x̄ is an interpolation
point, then both en (x̄) and ω(x̄) are zero, so there isn’t much to say. This results
says that the error e = f − pn of the approximation is “like the next term in the
Newton’s formula”.
T. 13-10
Given f ∈ C n+1 [a, b] and distinct interpolation points {xi }n i=0 ⊆ [a, b], let pn ∈
Pn [x] be the unique solution of the polynomial interpolation problem for data
values {f (xi )}n
i=0 . Then for each x ∈ [a, b], we can find ξx ∈ (a, b) such that

1
en (x) ≡ f (x) − pn (x) = f (n+1) (ξx )ω(x) (∗)
(n + 1)!
570 CHAPTER 13. NUMERICAL ANALYSIS

In particular |en (x)| ≤ 1


(n+1)!
kf (n+1) k∞ |ω(x)| for all x ∈ [a, b].

(Proof 1) (∗) is trivial if x is an interpolation point — pick arbitrary ξx , and both


sides are zero. Otherwise, this follows directly from the last two theorems. The
final part follows by definition of the max norm kgk∞ = maxt∈[a,b] |g(t)|.
(Proof 2) (∗) is trivially true if x is an interpolation point. Let x ∈ [a, b] be any
other point (which we fixed) and define
n
Y n
Y
φ(t) = (f (t) − p(t)) (x − xi ) − (f (x) − p(x)) (t − xi ), t ∈ [a, b].
i=0 i=0

Next, note that φ(xj ) = 0 for j = 0, 1, · · · , n, and φ(x) = 0. Hence, φ has at least
n + 2 distinct zeros in [a, b]. Moreover, φ ∈ C n+1 [a, b]. We deduce that φ0 has at
least n + 1 distinct zeros in (a, b), that φ00 vanishes at n points in (a, b), etc. We
conclude that φ(s) vanishes at n+2−s distinct points of (a, b) for s = 0, 1, · · · , n+1.
Letting s = n + 1, we have φ(n+1) (ξx ) = 0 for some ξx ∈ (a, b) and hence
n
Y
0 = φ(n+1) (ξx ) = (f (n+1) (ξx ) − p(n+1) (ξx )) (x − xi ) − (f (x) − p(x))(n + 1)!.
i=0

The final error bound follows by definition of the max norm.


Assuming our function f is fixed, this error bound depends only on ω(x), which
depends on our choice of interpolation points. So can we minimize ω(x) in some
sense by picking some clever interpolation points ∆ = {xi }n
i=0 ? Here we will have
n fixed. So instead, we put ∆ as the subscript. We can write our bound as
1
kf − p∆ k∞ ≤ kf (n+1) k∞ kω∆ k∞ .
(n + 1)!

So the objective is to find a ∆ that minimizes kω∆ k∞ .


For the moment, we focus on the special case where the interval is [−1, 1]. The gen-
eral solution can be obtained by an easy change of variable x = 21 (b + a) + 12 (b − a)t
which maps t ∈ [−1, 1] to x ∈ [a, b]. For some magical reasons that hopefully will
become clear soon, the optimal choice of ∆ comes from the Chebyshev polynomials.
C. 13-11
<Chebyshev polynomial> The Chebyshev polynomial is in fact a polynomial,
since from trigonometric identities, we know cos(nθ) can be expanded as a poly-
nomial in cos θ up to degree n. Two key properties of Tn on [−1, 1] are
1. The maximum absolute value is obtained at Xk = cos(πk/n) for k = 0, · · · , n
with Tn (Xk ) = (−1)k .
2. It has n distinct zeros at xk = cos 2k−1

2n
π for k = 1, · · · , n.
13.1. POLYNOMIAL INTERPOLATION 571

On a right is a plot of T4 . All that really matters about T4 (x)


the Chebyshev polynomials is that the maximum is ob- 1
tained at n+1 distinct points with alternating sign. The
exact form of the polynomial is not really important.
Notice there is an intentional clash between the use of x
xk as the zeros and xk as the interpolation points — −1 1
we will show these are indeed the optimal interpolation
points. −1

L. 13-12
<3-term recurrence relation> The Chebyshev polynomials satisfy the recur-
rence relations Tn+1 (x) = 2xTn (x) − Tn−1 (x) with initial condition T0 (x) = 1,
T1 (x) = x.

cos((n + 1)θ) + cos((n − 1)θ) = 2 cos θ cos(nθ).


This recurrence relation can be useful for many things, but for our purposes, we
only use it to show that the leading coefficient of Tn is 2n−1 (for n ≥ 1), this is
true by induction.
T. 13-13
<Minimal property (for n ≥ 1)> On [−1, 1], among all polynomials p ∈
1
Pn [x] with leading coefficient 1, p = 2n−1 Tn minimizes kpk∞ . In particular, the
1
minimum value of kpk∞ is 2n−1 .

We proceed by contradiction. Suppose there is a polynomial qn ∈ Pn with leading


1
coefficient 1 such that kqn k∞ < 2n−1 . Define a new polynomial
1
r= Tn − qn .
2n−1
This is, by assumption, non-zero. Since both the polynomials have leading coef-
ficient 1, the difference must have degree at most n − 1, ie. r ∈ Pn−1 [x]. Since
1 1 1
2n−1 n
T (Xk ) = ± 2n−1 , and |qn (Xn )| < 2n−1 by assumption, r alternates in sign
between these n + 1 points. But then by the intermediate value theorem, r has to
have at least n zeros. This is a contradiction, since r has degree n − 1, and cannot
be zero.
Consider w∆ = n n
Q
i=0 (x − xi ) ∈ Pn+1 [x] for any distinct points ∆ = {xi }i=0 ⊆
1
[−1, 1]. Then by this result min∆ kω∆ k∞ = 2n . This minimum is achieved by
1
picking the interpolation points to be the zeros of Tn+1 , so that ω∆ = 2n−1 Tn .
For a general interval [a, b],
(b − a)n
 
2 b+a
Tn x−
22n−1 b−a b−a
is the element of Pn [x], with leading coefficient 1, which has minimal ∞-norm over
[a, b]. Moreover we also have
(b − a)n+1
min kω∆ k∞ =
∆ 22n+1
and this is achieved by the zeros of Tn+1 under the linear mapping [−1, 1] → [a, b],
i.e. xk = 21 (b + a) + 21 (b − a) cos 2n+2
2k+1
π for k = 0, · · · , n.
572 CHAPTER 13. NUMERICAL ANALYSIS

T. 13-14
For f ∈ C n+1 [−1, 1], the Chebyshev choice of interpolation points gives
1 1
kf − pn k∞ ≤ kf (n+1) k∞ .
2n (n + 1)!

By the last three theorems.


For the general interval [a, b] this result becomes

(b − a)n+1 1
kf − pn k∞ ≤ kf (n+1) kk∞ ,
22n+1 (n + 1)!
where the transformed zeros of Tn+1 are used as interpolation points.
E. 13-15
Suppose f has as many continuous derivatives as we want. Then as we increase
n, what happens to the error bounds? The coefficients involve dividing by an
exponential and a factorial. Hence as long as the higher derivatives of f don’t
blow up too badly, in general, the error will tend to zero as n → ∞, which makes
sense.

13.2 Orthogonal polynomials


D. 13-16
• Given a vector space V and an inner product h · , · i, two vectors f, g ∈ V are
orthogonal if hf, gi = 0.
• Given a vector space V of polynomials and inner product h · , · i, we say pn ∈ Pn [x]
is the nth orthogonal polynomial if hpn , qi = 0 for all q ∈ Pn−1 [x]. In particular,
hpn , pm i = 0 for n 6= m.
E. 13-17
<Scalar products>
1. Let V = C s [a, b], where [a, b] is a finite interval and s ≥ 0. Pick a weight
function w(x) ∈ C(a, b) such that w(x) > 0 for all x ∈ (a, b), and w is integrable
over [a, b]. In particular, we allow w to vanish at the end points, or blow up
mildly such that it is still integrable. We can then define the inner product to
be Z b
hf, gi = w(x)f (x)g(x) dx.
a

2. We can allow [a, b] to be infinite, eg. [0, ∞) or even (−∞, ∞), but we have to
be more careful. We first define
Z b
hf, gi = w(x)f (x)g(x) dx
a
Rb
as before, but we now need more conditions. We require that a w(x)xn dx to
exist for all n ≥ 0, since we want to allow polynomials in our vector space. For
2
example, w(x) = e−x on [0, ∞), works, or w(x) = e−x on (−∞, ∞). These
are scalar products for Pn [x] for n ≥ 0, but we cannot extend this definition
13.2. ORTHOGONAL POLYNOMIALS 573

to all smooth functions since they might blow up too fast at infinity. We will
not go into the technical details, since we are only interested in polynomials,
and knowing it works for polynomials suffices.
3. We can also have a discrete inner product, defined by
m
X
hf, gi = wj f (ξj )g(ξj )
j=1

with {ξj }m m
j=1 distinct points and {wj }j=1 > 0. Now we have to restrict our-
selves a lot. This is a scalar product for V = Pm−1 [x], but not for higher
degrees, since a scalar product should satisfy hf, f i > 0 for f 6= 0. In particu-
lar, we cannot extend this to all smooth functions.
T. 13-18
S
Given a vector space V of functions containing n Pn [x] and an inner product
h · , · i, there exists a unique monic orthogonal polynomial pk for each degree n ≥ 0.
In addition, {pk }n k=0 form a basis for Pn [x].

This is a big induction proof over both parts of the theorem. We induct over n. For
the base case, we pick p0 (x) = 1, which is the only degree-zero monic polynomial.
Suppose we already have {pn }n k=0 satisfying the induction hypothesis.

1. Now pick any monic qn+1 ∈ Pn+1 [x], eg. xn+1 . We now construct pn+1 from
qn+1 by the Gram-Schmidt process. We define
n
X hqn+1 , pk i
pn+1 = qn+1 − pk .
hpk , pk i
k=0

This is again monic since qn+1 is, and we have hpn+1 , pm i = 0 for all m ≤ n,
and hence hpn+1 , pi = 0 for all p ∈ Pn [x] = h{p0 , · · · , pn }i.
2. To obtain uniqueness, assume both pn+1 , p̂n+1 ∈ Pn+1 [x] are both monic or-
thogonal polynomials. Then r = pn+1 − p̂n+1 ∈ Pn [x]. Now

hr, ri = hr, pn+1 − p̂n+1 i = hr, pn+1 i − hr, p̂n+1 i = 0 − 0 = 0.

So r = 0, hence pn+1 = p̂n−1 .


3. Finally, we have to show that p0 , · · · , pn+1 form a basis for Pn+1 [x]. Now
note that every p ∈ Pn+1 [x] can be written uniquely as p = cpn+1 + q where
q ∈ Pn [x]. But {pk }n
k=0 is a basis for Pn [x]. So q can be uniquely decomposed
as a linear combination of p0 , · · · , pn .
Alternatively, this follows from the fact that any set of orthogonal vectors must
be linearly independent, and since there are n + 2 of these vectors and Pn+1 [x]
has dimension n + 2, they must be a basis.
In practice, following the proof naively is not the best way of producing the new
pn+1 . Instead, we can reduce a lot of our work by making a clever choice of qn+1 .
T. 13-19
<Three-term recurrence relation> Let h · , · i be an inner product on the
space of polynomials such that x is self adjoint. Then Monic orthogonal polyno-
574 CHAPTER 13. NUMERICAL ANALYSIS

mials are generated by


hxpk , pk i hpk , pk i
pk+1 (x) = (x−αk )pk (x)−βk pk−1 (x) where αk = , βk = .
hpk , pk i hpk−1 , pk−1 i

with initial conditions p0 = 1, p1 (x) = (x − α0 )p0 .

By inspection, the p1 given is monic and satisfies hp1 , p0 i = 0. Using qn+1 = xpn
in the Gram-Schmidt process gives
n n
X hxpn , pk i X hpn , xpk i
pn+1 = xpn − pk = xpn − pk
hpk , pk i hpk , pk i
k=0 k=0

We notice that hpn , xpk i and vanishes whenever xpk has degree less than n. So we
are left with
hxpn , pn i hpn , xpn−1 i hpn , xpn−1 i
= xpn − pn − pn−1 = (x − αn )pn − pn−1 .
hpn , pn i hpn−1 , pn−1 i hpn−1 , pn−1 i
Now we notice that xpn−1 is a monic polynomial of degree n so we can write this as
xpn−1 = pn + q. Thus hpn , xpn−1 i = hpn , pn + qi = hpn , pn i. Hence the coefficient
of pn−1 is indeed the β we defined.
Note that x being self adjoing, i.e. hxf, gi = hf, xgi, is not necessarily true for
arbitrary inner products, but for most sensible inner products we will meet in this
course, this
R is true. In particular, it is clearly true for inner products of the form
hf, gi = w(x)f (x)g(x) dx.
E. 13-20
Legendre polynomials, Chebyshev polynomials, Laguerre polynomials and Hermite
polynomials are all examples of orthogonal polynomials that can be generated by
this recurrence relation. Chebyshev is based on the scalar product defined by
Z 1
1
hf, gi = √ f (x)g(x) dx.
−1 1 − x2
Note that the weight function blows up mildly at the end, but this is fine since it
is still integrable. This links up with
Tn (x) = cos(nθ)
for x = cos θ via the usual trigonometric substitution. We have
Z π
1
hTn , Tm i = √ cos(nθ) cos(mθ) sin θ dθ
1 − cos2 θ
Z0 π
= cos(nθ) cos(mθ) dθ = 0 if m 6= n.
0

The other
R b orthogonal polynomials all comes from scalar products of the form
hf, gi = a w(x)f (x)g(x) dx as described in the table below:

Type Notation Range Weight Recurrence


Legendre Pn [−1, 1] 1 (n + 1)Pn+1 (x)= (2n + 1)xPn (x) − nPn−1 (x)

Chebyshev Tn [−1, 1] 1/ 1 − x2 Tn+1 (x)= 2xTn (x) − Tn−1 (x)
Laguerre Ln [0, ∞) e−x (n + 1)Ln+1 (x)= (2n + 1 − x)Ln (x) − nLn−1 (x)
2
Hermite Hn (−∞, ∞) e−x Hn+1 (x)= 2xHn (x) − 2nHn−1 (x)
13.2. ORTHOGONAL POLYNOMIALS 575

C. 13-21
<Least-squares polynomial approximation> If we want to approximate a
function with a polynomial, polynomial interpolation might not be the best idea,
since all we do is make sure the polynomial agrees with f at certain points, but
it might not be a good approximation elsewhere. Instead, we want to choose a
polynomial p in Pn [x] that “minimizes the error”.
What exactly do we mean by minimizing the error? The error is defined as the
function f − p. So given an appropriate inner product on the vector space of
continuous functions, we want to minimize kf − pk2 = hf − p, f − pi. This is
usually of the form
Z b
hf − p, f − pi = w(x)[f (x) − p(x)]2 dx,
a

but we can also use inner products such as hf −p, f −pi = m 2


P
j=1 wj (f (ξi )−p(ξi )) .
Unlike polynomial interpolation, there is no guarantee that the approximation
agrees with the function anywhere. Unlike polynomial interpolation, there is some
guarantee that the total error is small (or as small as we can get, by definition).
In fact, if f is continuous, then the Weierstrass approximation theorem tells us
the total error must eventually vanish as we approximate it with polynomials of
higher and higher degrees.
Unsurprisingly, the solution involves the use of the orthogonal polynomials with
respect to the corresponding inner products.

T. 13-22
Let f be a given function and {pn }n k=0 orthogonal polynomials with respect to
h · , · i. Then the p ∈ Pn [x] that minimise kf − pk2 is given by
n
X hf, pk i
p= ck pk where ck = ,
kpk k2
k=0

and the formula for the error is


n
X hf, pk i2
kf − pk2 = kf k2 − .
kpk k2
k=0

Pn
We consider a general polynomial p = k=0 ck pk ∈ Pn [x]. We substitute this in
to obtain
n
X n
X
hf − p, f − pi = hf, f i − 2 ck hf, pk i + c2k kpk k2 .
k=0 k=0

Note that there are no cross terms between the different coefficients. We minimize
this quadratic by setting the partial derivatives to zero:


0= hf − p, f − pi = −2hf, pk i + 2ck kpk k2 .
∂ck

To check this is indeed a minimum, note that the Hessian matrix is simply 2I,
which is positive definite. So this is really a minimum. So we get the formula for
the ck ’s as claimed, and putting the formula for ck gives the error formula.
Note that following:
576 CHAPTER 13. NUMERICAL ANALYSIS

1. The solution decouples, in the sense that ck depends only on f and pk . If we


want to take one more term, we just need to compute an extra term, and not
redo our previous work.
2. Our constructed p ∈ Pn [x] has a nice property: for k ≤ n, we have

hf, pk i
hf − p, pk i = hf, pk i − hp, pk i = hf, pk i − hpk , pk i = 0.
kpk k2

Thus for all q ∈ Pn [x], we have hf − p, qi = 0. In particular, this is true when


q = p, and tells us hf, pi = hp, pi. Using this to expand hf − p, f − pi gives

kf − pk2 + kpk2 = kf k2 ,

which is just a glorified Pythagoras theorem.


3. Also, we notice that the formula for the error is a positive term kf k2 subtracting
a lot of squares. As we increase n, we subtract more squares, and the error
decreases. If we are lucky, the error tends to 0 as we take n → ∞. Even
though we might not know how many terms we need in order to get the error
to be sufficiently small, we can just keep adding terms until the computed error
small enough (which is something we have to do anyway even if we knew what
n to take).
4. Without further technical restrictions, we cannot deduce the Parseval identity

X hf, pk i2
= hf, f i (∗)
kpk k2
k=0

which is necessary for the above error to tend to 0 as we take n → ∞. However,


when our vector space is C[a, b] for a finite interval and our scalar product is
that of 1 in [E.13.2], this result follows easily from the Weierstrass approxima-
tion theorem,[T.5-61] which sates that given f ∈ C[a, b], we can find a sequence
p̄n ∈ Pn [x] so that kf − p̄n k∞ → 0 as n → ∞.
Firstly note though that infinite sum in (∗) must converge since it’s increasing
and bounded above by kf k2 . Also write our least square minimising p as p̂n .
Then by the minimising property of p̂n , as n → ∞ we have

X hf, pk i2
kf k2 − ≤ kf − p̂n k ≤ kf − p̄n k ≤ kwk1 kf − p̄n k∞ → 0.
kpk k2
k=0

5. We have spent most of this subsection looking at least-squares polynomial


approximation with respect to the continuous scalar product. Least-squares
problems with respect to the discrete scalar product often arise in practice for
a different reason: i.e. {fi }m
i=1 correspond to inexact experimental data and
we wish to construct a low-degree (n  m) polynomial that smooths out the
data errors.

13.3 Approximation of linear functionals


13.3. APPROXIMATION OF LINEAR FUNCTIONALS 577

D. 13-23
• Given a vector space V of functions, a linear functional is an element of the dual
space of V .
? Here we’ll assume V is a real vector space, so a linear functional is a linear
mapping L : V → R.
E. 13-24
We usually don’t put so much emphasis on the actual vector space V . Instead, we
provide a formula for L, and take V to be the vector space of functions for which
the formula makes sense.
1. We can choose some fixed ξ ∈ R, and define a linear functional by L(f ) = f (ξ).
2. Alternatively, for fixed η ∈ R we can define our functional by L(f ) = f 0 (η).
In this case, we need to pick a vector space in which this makes sense, eg. the
space of continuously differentiable functions.
Rb
3. We can define L(f ) = a f (x) dx. The set of continuous (or even just in-
tegrable) functions defined on [a, b] will be a sensible domain for this linear
functional.
4. Any linear combination of these linear functions are also linear functionals.
For example, we can pick some fixed α, β ∈ R, and define

β−α 0
L(f ) = f (β) − f (α) − [f (β) + f 0 (α)].
2

The objective of this section is to construct approximations to more complicated linear


functionals (usually integrals, possibly derivatives point values) in terms of simpler lin-
ear functionals (usually point values of f itself). For example, given a linear functional
L we might produce an approximation of the form
N
X
L(f ) ≈ ai f (xi ),
i=0

where V = C p [a, b], p ≥ 0, and {xi }N


i=0 ⊆ [a, b] are distinct points.

How can we choose the coefficients ai and the points xi so that our approximation is
“good”? Note that most of our functionals can be easily evaluated exactly when f is
a polynomial. So we might approximate our function f by a polynomial, and then do
it exactly for polynomials. More precisely, we let {xi }N
i=0 ⊆ [a, b] be P
arbitrary points.
N
Then using the Lagrange cardinal polynomials `i , we have f (x) ≈ i=0 f (xi )`i (x).
Then using linearity, we can approximate
N
! N
X X
L(f ) ≈ L f (xi )`i (x) = L(`i )f (xi ).
i=0 i=0

So we can pick ai = L(`i ). Similar to polynomial interpolation, this formula is exact for
f ∈ PN [x]. But we could do better. If we can freely choose {ai }N N
i=0 and {xi }i=0 , then
since we now have 2n + 2 free parameters, we might expect to find an approximation
that is exact for f ∈ P2N +1 [x]. This is not always possible, but there are cases when
we can. The most famous example is Gaussian quadrature.
578 CHAPTER 13. NUMERICAL ANALYSIS

13.3.1 Numerical integration


We consider the space of continuous function C[a, b] with [a, b] finite and L(f ) =
Rb
a
f (x)w(x)dx where w is a fixed integrable function. If {xi }N i=0 ⊆ [a, b] are distinct
Rb
points, then ai = a w(x)`i (x)dx for i = 0, · · · , N , where {`i }N i=0 are the Lagrange
Rb
cardinal polynomials with respect to {xi }N i=0 . Our approximation a
f (x)w(x)dx =
PN
i=0 ia f (xi ) will be exact when f ∈ P N [x]. In certain commonly-occurring situations,
we can slightly improve this result.

T. 13-25
In the above scenario, if in addition
1. N is even;
2. {xi }N 1
i=0 are symmetrically placed in [a, b], i.e. xN/2 = 2 (a+b) and xk +xN −k =
a + b for k = 0, · · · , N/2 − 1;
3. w is even with respect to [a, b], i.e. w(x − 12 (a + b)) is an even function for
x ∈ [− 21 (a + b)), 12 (a + b))],
Rb
then the approximation a f (x)w(x)dx = N
P
i=0 ai f (xi ) is exact when f ∈ PN +1 [x].

N
QN {xi }i=0 are symmetrically placed in [a, b], the nodal polynomial ω(x) =
Since
i=0 (x − xi ) is an odd function with respect to [a, b]; thus ω is an even function
with respect to [a, b]. Writing our Lagrange cardinal polynomials in terms of ω we
have
Z b
ω(x)
ai = w(x) 0 dx, i = 0, · · · , N.
a ω (xi )(x − xi )

Thus for k = 0, · · · , N/2 − 1,


Z b  
ω(x) ω(xk )
aN −k − ak = w(x) − 0 dx
a ω 0 (xN −k )(x − xN −k ) ω (x)(x − xk )
xN −k − xk b
Z
ω(x)
= w(x) dx
dk a (x − xN −k )(x − xk )

where dk = ω 0 (xk ) = ω 0 (xN −k ), must be zero because the integrand is odd with
respect to [a, b]. The main part of the theorem is now simple and relies on the
following decomposition: given any f ∈ PN +1 [x], there is a unique c ∈ R and
unique q ∈ PN [x] such that f (x) = c(x − a+b 2
)N +1 + q(x) where c is the leading
coefficient of f . Now

Z b Z b N
X N
X
w(x)f (x)dx = w(x)q(x)dx = ai q(xi ) = ai f (xi )
a a i=0 i=0

This first equality is since c(x− a+b


2
)N +1 is odd in [a, b] hence vanish when integrate
over [a, b]; the second equality is since we know the approximation is exact for
PN [x]; the final equality is since c(x − a+b 2
)N +1 is odd on [a, b] plus the ai are
symmetric on [a, b].

In fact we can prove a similar result with N and w odd.


13.3. APPROXIMATION OF LINEAR FUNCTIONALS 579

E. 13-26
In practice, the restriction that the weight function w must be even usually occurs
because it is constant.
a+b
• Mid-point rule: This has w(x) = 1, N = 0 and x0 = 2
. The approximation
Rb
is a f (x)dx ≈ (b − a)f ( a+b
2
), this is exact for P1 [x].
• Simpson rule: This has w(x) = 1, N = 2 and x0 = a, x1 = a+b 2
, x2 = b.
Rb
The approximation is a f (x)dx ≈ b−a
6
(f (a) + 4f ( a+b
2
) + f (b)), this is exact for
P3 [x].

Gaussian quadrature
The objective of Gaussian quadrature is to approximate integrals of the form
Z b
L(f ) = w(x)f (x) dx,
a

where w(x) is a positive weight function that determines a scalar product. In particular
Rb
hf, gi = a w(x)f (x)g(x) dx is a scalar product for Pν [x]. We will show that we can
find weights , written {bn }νk=1 , and nodes , written {ck }νk=1 ⊆ [a, b], such that the
approximation
Z b Xν
w(x)f (x) dx ≈ bk f (ck )
a k=1

is exact for f ∈ P2ν−1 [x]. Such a quadrature with ν nodes that is exact on P2ν−1 [x]
is called Gaussian quadrature. We will show that the nodes {ck }νk=1 is in fact the
zeros of the orthogonal polynomial pν with respect to the scalar product. We start by
showing that this is the best we can achieve.
P. 13-27
Rb
There is no choice of ν weights and nodes such that the approximation of a w(x)f (x) dx
is exact for all f ∈ P2ν [x].
Rb
DefinePq(x) = νk=1 (x − ck ) ∈ Pν [x]. Then we know a w(x)q 2 (x) dx > 0. How-
Q
ν
ever, k=1 bk q 2 (cn ) = 0. So this cannot be exact for q 2 .
T. 13-28
<Ordinary quadrature> For any distinct {ck }νk=1 ⊆ [a, b], let {`k }νk=1 be the
Lagrange cardinal polynomials with respect to {ck }νk=1 . Then the approximation
Z b ν
X Z b
L(f ) = w(x)f (x) dx ≈ bk f (ck ) where bk = w(x)`k (x) dx
a k=1 a

is exact for f ∈ Pν−1 [x].


Rb
We have L(`m ) = a w(x)f (x) dx = bm = νk=1 bk δmk = νk=1 bk `m (ck ). Each of
P P
`k is an element of Pν−1 [x] and they are linearly independent, hence {`k }νk=1 is a
basis for Pν−1 [x]. It follows by linearity of L that the approximation is exact for
f ∈ Pν−1 [x].
Here we produce an approximation given {ck }νk=1 . This simple idea is how we
generate many classical numerical integration techniques, such as the trapezoidal
580 CHAPTER 13. NUMERICAL ANALYSIS

rule. But those are quite inaccurate. It turns out a clever choice of {ck } does much
better — take them to be the zeros of the orthogonal polynomials. However, to do
this, we must make sure the roots indeed lie in [a, b]. This is what we will prove
now — given any inner product, the roots of the orthogonal polynomials must lie
in [a, b].
T. 13-29
For ν ≥ 1, the zeros of the orthogonal polynomial pν are real, distinct and lie in
(a, b).

First we show there is at least one root. Notice that p0 = 1. Thus for ν ≥ 1, by
orthogonality, we know
Z b Z b
w(x)pν (x)p1 (x) dx = w(x)pν (x) dx = 0.
a a

So there is at least one sign change in (a, b). We have already got the result we
need for ν = 1, since we only need one zero in (a, b). Now for ν > 1, suppose
{ξj }m
j=1 are the places where the sign of pν changes in (a, b) (which is a subset of
the roots of pν ). We define
m
Y
q(x) = (x − ξj ) ∈ Pm [x].
j=1

Since this changes sign at the same place as pν , we know qpν maintains the same
sign in (a, b). Now if we had m < ν, then orthogonality gives
Z b
hq, pν i = w(x)q(x)pν (x) dx = 0,
a

which is impossible, since qpν does not change sign. Hence we must have m = ν.
T. 13-30
In the ordinary quadrature, if we pick {ck }νk=1 to be the roots of pν (x), then get
we exactness for f ∈ P2ν−1 [x]. In addition, {bn }νk=1 are all positive.

Let f ∈ P2ν−1 [x]. Then by polynomial division, we get f = qpν + r where q, r are
polynomials of degree at most ν − 1. We apply orthogonality to get
Z b Z b Z b
w(x)f (x) dx = w(x)(q(x)pν (x) + r(x)) dx = w(x)r(x) dx.
a a a

Also, since each ck is a root of pν , we get


ν
X ν
X ν
X
bk f (ck ) = bk (q(ck )pν (ck ) + r(ck )) = bk r(ck ).
k=1 k=1 k=1

But r has degree at most ν −1, and this formula is exact for polynomials in Pν−1 [x].
Hence we know
Z b Z b ν
X ν
X
w(x)f (x) dx = w(x)r(x) dx = bk r(ck ) = bk f (ck ).
a a k=1 k=1
13.3. APPROXIMATION OF LINEAR FUNCTIONALS 581

To show the weights are positive, we pick a special f . Consider f ∈ {`2k }νk=1 ⊆
P2ν−2 [x], for `k the Lagrange cardinal polynomials for {ck }νk=1 . Since the quadra-
ture is exact for these, we get
Z b ν
X ν
X
0< w(x)`2k (x) dx = bj `2k (cj ) = bj δjk = bk .
a j=1 j=1

Since this is true for all k = 1, · · · , ν, we get the desired result.

13.3.2 Numerical differentiation

In this subsection our vector space is C k [a, b], for a finite interval and some k > 1, and
our linear functional is L(f ) = f (k) (ξ) for some fixed ξ ∈ [a, b]. For N > k, we seek
distinct {xi }N N
i=0 ⊆ [a, b] and {ai }i=0 ⊆ R so that

N
X
f (k) (ξ) ≈ ai f (xi )
i=0

is a “good” approximation. Just as before we can for any choice of distinct {xi }N
i=0 have
(k)
ai = `i (ξ) for i = 0, · · · , N , where {`i }N
i=0 are the Lagrange cardinal polynomials
with respect to {xi }N
i=0 . Then our approximation is exact when f ∈ PN [x].

Note Pthat our approximation takes a particularly simple form when N = k because
k (k)
then i=0 ai f (xi ) = p (ξ) = k!f [x0 , · · · , xk ] where p ∈ Pk [x] is the interpolating
polynomial for f with respect to {xi }N i=0 . The last equality is since f [x0 , · · · , xk ] is by
definition the leading coefficient of p.

Like before, we can also slightly improve the above result in special cases. However
there is no analogue of Gaussian quadrature.

T. 13-31
In the above scenario (without assuming N = k), if in addition
a+b
1. k is even and ξ = 2
;
2. N is even and {xi }Ni=0 are symmetrically placed in [a, b], i.e. xN/2 =
a+b
2
and
xi + xN −i = a + b for i = 0, · · · , N/2 − 1.
Then the coefficients satisfy ai = aN −i for i = 0, · · · , N/2 − 1 and the approxima-
tion is exact for f ∈ PN +1 [x].

The proof has the same pattern as [T.13-25] and also slso similar to that we can
have a similar result for k and N odd.

E. 13-32
 
0 a+b f (b) − f (a)
• For N = 1, f ≈ is exact for P2 [x].
2 b−a

f (b) − 2f ( a+b
 
a+b ) + f (a)
• For N = 2, f 0 ≈ 2
is exact for P3 [x].
2 (b − a)2 /4
582 CHAPTER 13. NUMERICAL ANALYSIS

13.3.3 Expressing errors in terms of derivatives


As before, we approximate a linear functional L by
n
X
L(f ) ≈ ai Li (f ),
i=0

where LP i are some simpler linear functionals. If the the error is such that eL (f ) =
L(f ) − n i=0 ai Li (f ) = 0 whenever f ∈ Pk [x] (i.e. this approximation is exact for
f ∈ Pk [x]), we say the error annihilates for polynomials of degree less than k.
If the error annihilates for Pk [x], we will show that we can find a bound for the error
of L on k + 1 times continuous differentiable functions (i.e. f ∈ C k+1 [a, b]) in the
form
|eL (f )| ≤ cL kf (k+1) k∞ for some constant cL .
Moreover, we want to make cL as small as possible. Ideally we want it to be sharp ,
that is

∀ε > 0, ∃fε ∈ C k+1 [a, b] such that |eL (fε )| ≥ (cL − ε)kfε(k+1) k∞ .

This doesn’t say anything about whether cL can actually be achieved. This depends
on the particular form of the question.
Note that so far, everything we’ve done works if the interval is infinite, as long as the
weight function vanishes sufficiently quickly as we go far away. However, for this little
bit, we will need to require [a, b] to be finite, since we want to make sure we can take
the supremum of our functions.
T. 13-33
<Peano kernel theorem> Let f ∈ C k+1 [a, b] and write (x−θ)k+ = (x−θ)k I[θ ≤
x]. Suppose the linear functional λ annihilates polynomials in Pk [x] and that we
Rb
can exchange the order of λ and the integral in λ( a (x − θ)k+ f (k+1) (θ) dθ), then
Z b
1
λ(f ) = K(θ)f (k+1) (θ) dθ where K(θ) = λ((x − θ)k+ ).
k! a

Using Taylor’s theorem with the integral remainder,


x
(x − a)k (k)
Z
1
f (x) = f (a) + (x − a)f 0 (a) + · · · + f (a) + (x − θ)k f (k+1) (θ) dθ.
k! k! a

We write the integral as


(
b
(x − θ)k θ≤x
Z
(x − θ)k+ f (k+1) (θ) dθ where (x − θ)k+ =
a 0 θ > x.

Then since λ is a linear functional that annihilates Pk [x], we have


 Z b 
1
λ(f ) = λ (x − θ)k+ f (k+1) (θ) dθ for all f ∈ C k+1 [a, b].
k! a
Taking the λ inside the integral sign and obtain
Z b
1
λ(f ) = λ((x − θ)k+ )f (k+1) (θ) dθ,
k! a
13.3. APPROXIMATION OF LINEAR FUNCTIONALS 583

noting that λ acts on (x − θ)k+ ∈ C k−1 [a, b] as a function of x, and think of θ as


being held constant.
Note the following:
• Of course, there are linear functionals for which we cannot move the λ inside
the integral, but for our linear functionals (point values, derivative point values,
integrals etc.), this is valid, as we can verify directly.
• K(θ) is called the Peano kernel . The important thing is that the kernel K is
independent of f . Taking suprema in different ways, we obtain different forms
of bounds:
Z b



 |K(θ)| dθkf (k+1) k∞ = kKk1 kf (k+1) k∞

 a
1 Z b

 12
|λ(f )| ≤ 2 .
k! 
 a |K(θ)| dθ
 kf (k+1) k2 = kKk2 kf (k+1) k2 (Cauchy-Schwarz)



kKk∞ kf (k+1) k1

Hence we can find the constant cL for different choices of the norm. When
1
computing cL , don’t forget the factor of k! ! By fiddling with functions a bit,
we can show these bounds are indeed sharp.
• If K(θ) does not change sign on [a, b], we can say a bit more. First, we note
that the bound Z b
1
K(θ) dθ kf (k+1) k∞

|λ(f )| ≤
k! a

can be achieved by xk+1 , since this has constant k + 1th derivative. Also, we
can use the integral mean value theorem to get the bound
Z b 
1
λ(f ) = K(θ) dθ f (k+1) (ξ),
k! a

where ξ ∈ (a, b) depends on f . To see this note that λ(f ) is bounded above
Rb Rb
1
and below by ( k! a
K(θ) dθ)(inf [a,b] f (k+1) ) and ( k!
1
a
K(θ) dθ)(sup[a,b] f (k+1) )
hence we can find such a ξ by the intermediate value theorem.
• Finally, note that Peano’s kernel theorem says if eL (f ) = 0 for all f ∈ Pk [x],
then
Z b
1
eL (f ) = K(θ)f (k+1) (θ) dθ for all f ∈ C k+1 [a, b].
k! a

But for any other fixed j = 0, · · · , k−1, we also have eL (f ) = 0 for all f ∈ Pj [x].
So we also know

1 b
Z
eL (f ) = Kj (θ)f (j+1) (θ) dθ for all f ∈ C j+1 [a, b]
j! a

Note that we have a different kernel. In general, this might not be a good idea,
since we are throwing information away. Yet, this can be helpful if we get some
less smooth functions that don’t have k + 1 derivatives.
584 CHAPTER 13. NUMERICAL ANALYSIS

E. 13-34
• Let L(f ) = f (β). We decide to be silly and approximate L(f ) by

β−α 0
L(f ) ≈ f (α) + (f (β) + f 0 (α)) where α 6= β
2
The error is given by

β−α 0
eL (f ) = f (β) − f (α) − (f (β) + f 0 (α)),
2
and this vanishes for f ∈ P2 [x]. We wlog assume α < β. Then

K(θ) = eL ((x − θ)2+ ) = (β − θ)2+ − (α − θ)2+ − (β − α)((β − θ)+ + (α − θ)+ ).

Hence we get 
0
 a≤θ≤α
K(θ) = (α − θ)(β − θ) α≤θ≤β

0 β ≤ θ ≤ b.

Hence we know
Z β
1
eL (f ) = (α − θ)(β − θ)f 000 (θ) dθ for all f ∈ C 3 [a, b]
2 α

Note that in this particular case, our function K(θ) does not change sign on [a, b].
Rb
We have K(θ) ≤ 0 on [a, b], and a K(θ) dθ = − 61 (β − α)3 . Hence we have the
bound
1
|eL (f )| ≤ (β − α)3 kf 000 k∞ ,
12
and this bound is achieved for x3 . We also have eL (f ) = − 12
1
(β − α)3 f 000 (ξ) for
some f -dependent value of some ξ ∈ (a, b).
• Consider the approximation f (0) ≈ − 23 f (0) + 2f (1) − 12 f (2). The error of this
approximation is the linear functional eL (f ) = f (0) + 32 f (0) − 2f (1) + 12 f (2) and
(as may be verified by trying f (x) = 1, x, x2 ) eL (f ) = 0 for f ∈ P2 [x]. Hence the
Peano kernel theorem tells us that, for f ∈ C 3 [0, 2],
Z 2
1
eL (f ) = K(θ)f 000 (θ)dθ
2 0

3 1
where K(θ) ≡ eL ((x − θ)2+ ) = 2(0 − θ)+ + (0 − θ)2+ − 2(1 − θ)2+ + (2 − θ)2+
( 2 2
−2(1 − θ)2 + 21 (2 − θ)2 = 2θ − 23 θ2 0 ≤ θ ≤ 1
= 1
2
(2 − θ)2 1≤θ≤2

Note that K ≥ 0 so
Z 2 Z 1 Z 2
3 1 1 1 2
K(θ)dθ = (2θ − θ2 )dθ + (2 − θ)2 dθ = + = .
0 0 2 1 2 2 6 3

So the bound |eL (f )| ≤ 1


3
kf 000 k∞ is achievable and eL (f ) = 1 000
3
f (ξ) for some
ξ ∈ (0, 2).
13.4. ORDINARY DIFFERENTIAL EQUATIONS 585
Rb
? As an alternative to determining a K(θ)dθ directly, we can use the Peano Kernel
theorem noting that the formula is valid for all f ∈ C k+1 [a, b]: for the particular
case f (x) = xk+1 (so that f (k+1) (x) = (k + 1)!) we deduce that
Z b
1
K(θ)dθ = eL (xk+1 ). (∗)
a K +1

Furthermore since eL (p) = 0 for p ∈ Pk [x], (∗) remains true if xk+1 is replaced by
any monic q ∈ Pk+1 [x], hence we can choose such a q for which the evaluation of
eL (q) is straightforward.

13.4 Ordinary differential equations


Our goal is to solve ordinary differential equations numerically. We will focus on
differential equations of the form

y0 (t) = f (t, y(t)) for 0≤t≤T with initial conditions y(0) = y0 .


N N
The data we are provided is the function f : R×R → R , the ending time T > 0, and
the initial condition y0 ∈ Rn . What we seek is the function y : [0, T ] → RN .
When solving the differential equation numerically, our goal would be to make our
numerical solution as close to the true solution as possible. This makes sense only if
a “true” solution actually exists, and is unique. From Analysis II, we know a unique
solution to the ODE exists if f is Lipschitz (on the spatial, i.e. RN , part for all
time with the same Lipschitz constant).[T.5-60] We will assume our function has this
property, that is ∃λ such that

kf (t, x) − f (t, x̂)k ≤ λkx − x̂k for all t ∈ [0, T ] and x, x̂ ∈ RN

It doesn’t really matter what norm we pick. It will just change the λ. The importance
is the existence of a λ. A special case is when λ = 0, ie. f does not depend on x. In this
case, this is just an integration problem, and is usually easy. This is a convenient test
case — if our numerical approximation does not even work for these easy problems,
then it’s pretty useless. An extra assumption we will often make is that f can be
expanded in a Taylor series to as many degrees as we want, since this is convenient for
our analysis.
What exactly does a numerical solution to the ODE consist of? We first choose a small
time step h > 0, let tn = nh and then construct approximations

yn ≈ y(tn ), n = 1, 2, · · · .

In particular, tn − tn−1 = h and is always constant. In practice, we don’t fix the step
size tn −tn−1 , and allow it to vary in each step. However, this makes the analysis much
more complicated, and we will mostly not consider varying time steps here. If we make
h smaller, then we will (probably) make better approximations. However, this is more
computationally demanding. So we want to study the behaviour of numerical methods
in order to figure out what h we should pick.

13.4.1 One-step methods


586 CHAPTER 13. NUMERICAL ANALYSIS

D. 13-35
• A numerical method is said to be
 explicit if the value for the next step can be obtain by a direct computation
in terms of known quantities (e.g. values of previous steps), or in other words
the value for the next step are given explicitly by know quantities.
 implicit if the next step are defined by systems of equations which we need
to solve to get the next step, or in other words the value for the next step are
given implicitly by known quantities.
• A numerical method is
 one-step if yn+1 depends only on tn and yn , that is yn+1 = φh (tn , yn ) for
some function φh : R × RN → RN .1
 multi-step if yn+1 also depends on yk for some k < n.
• The Euler’s method employs the formula yn+1 = yn + hf (tn , yn ).
• For θ ∈ [0, 1], the θ-method is
 
yn+1 = yn + h θf (tn , yn ) + (1 − θ)f (tn+1 , yn+1 ) .

The θ-method with θ = 1 is called the Euler’s method, the one with θ = 0 is called
the backward Euler method and the one with θ = 12 is called trapezoidal rule .
• For each h > 0, produce a sequence of discrete values yn for n = 0, h, 2h, · · · , [T /h],
where [T /h] is the integer part of T /h. We say a method converges if, as h → 0
and nh → t (hence n → ∞), we get yn → y(t) where y is the true solution to the
differential equation. Moreover, we require the convergence to be uniform in t.
• For a general (multi-step) numerical method yn+1 = φ(tn , y0 , y1 , · · · , yn ) the
local truncation error is

ηn+1 = y(tn+1 ) − φ(tn , y(t0 ), y(t1 ), · · · , y(tn )).

The order of a numerical method is the largest p ≥ 1 such that ηn+1 = O(hp+1 ).
E. 13-36
There are many ways we can classify numerical methods. One important classi-
fication is one-step versus multi-step methods. In one-step methods, the value of
yn+1 depends only on the previous iteration tn and yn . In multi-step methods,
we are allowed to look back further in time and use further results.
We want to show that Euler’s method “converges” to the real solution. First of
all, we need to make precise the notion of “convergence”. The Lipschitz condition
means there is a unique solution to the differential equation. So we would want
the numerical solution to be able to approximate the actual solution to arbitrary
accuracy as long as we take a small enough h. Hence our definition of converge.
The local truncation error is the error we will make at the (n + 1)th step if we had
accurate values for the first n steps. In the definition of order of an method, we
1
Sometimes a one-step method would be given in the form yn+1 = φh (tn , yn , yn+1 ) for some
function φh : R × RN × RN → RN , in which case by solving the equation we can still find yn+1
in terms of tn and yn , i.e. it is still of the form stated in the definition. This is an example of a
implicit method since yn+1 is defined in terms of itself and we have to solve equations to get yn+1 .
13.4. ORDINARY DIFFERENTIAL EQUATIONS 587

have the +1 in p + 1 because when we sum all the local errors to get the global
error we’ll drop a power of h. (at least this is true for one-step method)
When θ 6= 1, the θ-method is an implicit method. In general, we can’t just write
down the value of yn+1 given the value of yn . Instead, we have to treat the formula
as N (in general) non-linear equations, and solve them to find yn+1 !
In the past, people did not like to use this, because they didn’t have computers, or
computers were too slow. It is tedious to have to solve these equations in every step
of the method. Nowadays, these are becoming more and more popular because
it is getting easier to solve equations, and θ-methods have some huge theoretical
advantages, but we will not go into it.
We now look at the error of the θ-method. We have
 
η = y(tn+1 ) − y(tn ) − h θy0 (tn ) + (1 − θ)y0 (tn+1 )

We expand all terms about tn with Taylor’s series to obtain


   
1 1 1
= θ− h2 y00 (tn ) + θ− h3 y000 (tn ) + O(h4 ).
2 2 3
1
We see that θ = 2
gives us an order 2 method. Otherwise, we get an order 1
method.
T. 13-37
<Convergence of Euler’s method>
1. For all t ∈ [0, T ], we have limh→0,nh→t yn − y(t) = 0.
2. Let λ be the Lipschitz constant of f . Then there exists a c ≥ 0 such that

eλT − 1
ken k ≤ ch for all 0 ≤ n ≤ [T /h] where en = yn − y(tn ).
λ
Note that the bound in 2 is uniform, so 1 follows from 2. We only need to prove
2. There are two parts to proving this. We first look at the local truncation error.
This is the error we would get at each step assuming we got the previous steps
right. More precisely, we write
y(tn+1 ) = y(tn ) + hf (tn , y(tn )) + Rn ,
and Rn is the local truncation error. For the Euler’s method, it is easy to get Rn ,
since f (tn , y(tn )) = y0 (tn ), by definition. So this is just the Taylor series expansion
of y. We can write Rn as the integral remainder of the Taylor series,
Z tn+1
Rn = (tn+1 − θ)y 00 (θ) dθ.
tn

By some careful analysis, we get kRn k∞ ≤ ch2 where c = 21 ky 00 k∞ . This is the


easy part, and tends to go rather smoothly even for more complicated methods.
Once we have bounded the local truncation error, we patch them together to get
the actual error. We can write
 
en+1 = yn+1 − y(tn+1 ) = yn + hf (tn , yn ) − y(tn ) + hf (tn , y(tn )) + Rn
 
= (yn − y(tn )) + h f (tn , yn ) − f (tn , y(tn )) − Rn
588 CHAPTER 13. NUMERICAL ANALYSIS

Taking the infinity norm, we get

ken+1 k∞ ≤ kyn − y(tn )k∞ + hkf (tn , yn ) − f (tn , y(tn ))k∞ + kRn k∞
≤ ken k∞ + hλken k∞ + ch2 = (1 + λh)ken k∞ + ch2 .

This is valid for all n ≥ 0. We also know ke0 k = 0. Doing some algebra, we get

n−1
X ch
ken k∞ ≤ ch2 (1 + hλ)j ≤ ((1 + hλ)n − 1) .
j=0
λ

Finally, we have 1 + hλ ≤ eλh since 1 + λh is the first two terms of the Taylor series,
and the other terms are positive. So (1 + hλ)n ≤ eλhn ≤ eλT . So we obtain the
bound
eλT − 1
ken k∞ ≤ ch .
λ
Then this tends to 0 as we take h → 0. So the method converges.
This works as long as λ 6= 0. However, λ = 0 is the easy case, since it is just
integration. We can either check this case directly, or use the fact that λ1 (eλT −
1) → T as λ → 0.
The same proof strategy works for most numerical methods, but the algebra will
be much messier.
This result tell us that the Euler method has order 1. This is one less than the
power of the local truncation error. When we look at the global error, we drop a
power, and only have en ∼ h as expected.

13.4.2 Multistep methods


D. 13-38
• A s-step (linear) method is given by
s
X s
X
ρ` yn+` = h σ` f (tn+` , yn+` ).
`=0 `=0

This formula is used to find the valuePof yn+s given the others.
P For the method
we define the two polynomial ρ(w) = s`=0 ρ` w` and σ(w) = s`=0 σ` w` .
• We say ρ(w) satisfies the root condition if all its zeros are bounded by 1 in size,
ie. all roots w satisfy |w| ≤ 1. Moreover any zero with |w| = 1 must be simple.
• The 2-step Adams-Bashforth (AB2) method has

1
yn+2 = yn+1 + h (3f (tn+1 , yn+1 ) − f (tn , yn )) .
2
E. 13-39
The idea behind multi-step method is that might be able to make our methods
more efficient/accurate by making use of previous values of yn instead of just the
most recent one.
13.4. ORDINARY DIFFERENTIAL EQUATIONS 589

Note that in the s-step numerical method we get the same method if we multiply
all the constants ρ` , σ` by a non-zero constant. By convention, we normalize this
by setting ρs = 1. Then we can alternatively write this as
s
X s−1
X
yn+s = h σ` f (tn+` , yn+` ) − ρ` yn+` .
`=0 `=0

This method is an implicit method if σs 6= 0. Otherwise, it is explicit.


Note that this method is linear in the sense that the coefficients ρ` and σ` appear
linearly, outside the f s. Later we will see more complicated numerical methods
where these coefficients appear inside the arguments of f .
For multi-step methods, we have a slight problem to solve. In a one-step method,
we are given y0 , and this allows us to immediately apply the one-step method to get
higher values of yn . However, for an s-step method, we need to use other (possibly
1-step) method to obtain y1 , · · · , ys−1 before we can get started. Fortunately, we
only need to apply the one-step method a fixed, small number of times, even as
h → 0. So the accuracy of the one-step method at the start does not matter too
much.
T. 13-40
An s-step method has order p (p ≥ 1) if and only if
s
X s
X s
X
ρ` = 0 and ρ` ` k = k σ` `k−1 for k = 1, · · · , p where 00 = 1.
`=0 `=0 `=0

Alternatively, this condition is equivalent to p being the largest number such that
ρ(ex ) − xσ(ex ) = O(xp+1 ) as x → 0.

The local truncation error is


s
X s
X
ρ` y(tn+` ) − h σ` y0 (tn+` ).
`=0 `=0

0
We now expand the y and y about tn , and obtain
s
! ∞ s s
!
X X hk X X
ρ` y(tn ) + ρ` ` k − k σ` `k−1 y(k) (tn ).
k!
`=0 k=1 `=0 `=0

This is O(hp+1 ) under the given conditions. Hence the first result. To see the
equivalence of the two condition we expand ρ(ex ) − xσ(ex ),
s
X s
X
ρ(ex ) − xσ(ex ) = ρ` e`x − x σ` e`x .
`=0 `=0

We now expand the e`x in Taylor series about x = 0. This comes out as

s s s
!
X X 1 X k
X k−1
ρ` + ρ` ` − k σ` ` xk .
k!
`=0 k=1 `=0 `=0

W see the equivalence of the two conditions.


590 CHAPTER 13. NUMERICAL ANALYSIS

Instead of dealing with the coefficients ρ` and σ` directly, it is often convenient to


pack them together into the two polynomials ρ(w) and σ(w) as illustrate by this
result.

Note that s`=0 ρ` = 0, which is the condition required for the method to even
P
have an order at all, can be expressed as ρ(1) = 0.

By using the change of variable w = ez , an equivalent statement of the theorem


is that the multistep method is of order p > 1 iff

ρ(w) − log(w)σ(w) = O(|w − 1|p+1 ) as w → 1.

E. 13-41
In the two-step Adams-Bashforth method, We see that the conditions hold for
p = 2 but not p = 3. So the order is 2. Alternatively

3 1
ρ(w) = w2 − w, σ(w) = w − ..
2 2

We can immediately check that ρ(1) = 0. We also have ρ(ex ) − xσ(ex ) = 5 3


12
x +
O(x4 ). So the order is 2 as expected.

T. 13-42
<The Dahlquist equivalence theorem> A multi-step method is convergent
if and only if its order p is at least 1 and the root condition holds.

The proof is quite difficult, we will not prove it here. We can imagine this as
saying large roots in ρ are bad (for convergence) — they cannot get past 1, and
we cannot have too many with modulus 1.
Ps Ps
Intuitively,
Ps if a method
Ps `=0 ρ` yn+` = h `=0 σ` f (tp+1 n+` , yn+` ) has order p, then

`=0 Pρ` y(tn+` ) = h `=0P σ` f (tn+` , y(tn+` )) + O(h ), the difference of them
gives s`=0 ρ` en+` (h) = h s`=0 (σ` f (tn+` , yn+` ) − σ` f (tn+` , y(tn+` ))) + O(hp+1 )
wherePen (h) is the error at step n with step size h. In the limiting case h = 0 we
have s`=0 ρ` en+` (0) = 0, This is a difference equation with solution of P the form
(en )i (0) = rn where r is the root of the characteristic equation ρ(w) = s`=0 ρ` w` .
By consideration of continuity, for small h it would be similar to the case h = 0,
so we sort of see that roots ≥ 1 are “bad”.

We saw any sensible multi-step method must have ρ(1) = 0. So in particular,


1 must be a simple zero for the method to be convergent. Note the difference
between one-step and multi-step methods come in. For one-step methods, we only
needed the order to understand convergence. It is a fact that a one step method
converges whenever it has an order p ≥ 1. For multi-step methods, we need this
extra root condition.

E. 13-43
Consider the two-step Adams-Bashforth method. We have seen it has order p =
2 ≥ 1. So we need to check the root condition. ρ(w) = w2 − w = w(w − 1). So it
satisfies the root condition.
13.4. ORDINARY DIFFERENTIAL EQUATIONS 591

C. 13-44
<Constructing multi-step methods> Let’s now come up with a sensible
strategy for constructing convergent s-step methods:
1. Choose ρ(w) = s`=0 ρ` w` so that ρ(1) = 0 and the root condition holds.
P

2. Choose σ(w) = s`=0 σ` w` to maximize the order.


P

For 2. we want to choose σ to maximise the maximum p (which is the order) in


ρ(ex )−xσ(ex ) = O(xp+1 ), this equation can be written as σ(ex ) = x1 ρ(ex )+O(xp ).
Now let w = ex noting that ex − 1 ∼ x as x → 0, this can be further written as
ρ(w)
σ(w) = + O(|w − 1|p ) as w → 1
log w
ρ(w) P∞ `
We can expand log w
= `=0 a` (w − 1) as a series about w = 1. We see that
to maximise p, we Psneed to choose σ as the truncation of this series at ` = s,
`
that is ρ(w) = `=0 a` (w − 1) in which case we would have p ≥ s + 1. If
howeverP we want an explicit method, we require σs = 0, then we can only choose
ρ(w) = s−1 `
`=0 a` (w − 1) , so we only have p ≥ s. In summary we can choose σ so
that (
ρ(w) O(|w − 1|s+1 ) if implicit
σ(w) = + as w → 1
log w O(|w − 1|s ) if explicit
And in fact we see that there is only one way to pick this σ. So the key is in
picking a good enough ρ. Turns out the root conditions is “best” satisfied if
ρ(w) = ws−1 (w − 1), ie. all but one roots are 0. Then we have
s
X
yn+s − yn+s−1 = h σ` f (tn+` , yn+` ),
`=0

where σ is chosen to maximize order.


Of course we can also create multi-step in the opposite way — we choose a partic-
ular σ, and then find the most optimal ρ.

D. 13-45
• An Adams method is a multi-step numerical method with ρ(w) = ws−1 (w − 1).
An Adams-Bashforth method is an explicit Adams method.
An Adams-Moulton method is an implicit Adams method.
• A backward differentiation method has σ(w) = σs ws for some σs 6= 0, ie.
s
X
ρ` yn+` = σs f (tn+s , yn+s ).
`=0

This is a generalization of the one-step backwards Euler method.


E. 13-46
We look at the two-step third-order Adams-Moulton method. This is given by
 
1 2 5
yn+2 − yn+1 = h − f (tn , yn ) + f (tn+1 , yn+1 ) + f (tn+1 , yn+2 ) .
12 3 12
ρ(w)
Where do these coefficients come from? We have to first expand our log w
about
592 CHAPTER 13. NUMERICAL ANALYSIS

w = 1:
ρ(w) w(w − 1) 3 5
= = 1 + (w − 1) + (w − 1)2 + O(|w − 1|3 ).
log w log w 2 12
These aren’t our coefficients of σ, since what we need to do is to rearrange the
first three terms to be expressed in terms of w. So we have
ρ(w) 1 2 5 2
=− + w+ w + O(|w − 1|3 ).
log w 12 3 12
L. 13-47
An s-step backward differentiation method of order s is obtained by choosing
s
X 1 s−`
ρ(w) = σs w (w − 1)` ,
`
`=1

1 −1
Ps 
with σs chosen such that ρs = 1, namely σs = `=1 ` .

We need to construct ρ so that ρ(w) = σs ws log w + O(|w − 1|s+1 ). Write


    X ∞  `
1 w−1 1 w−1
log w = − log = − log 1 − = .
w w ` w
`=1

s
Multiplying both sides by σs w gives the desired result.
For this method to be convergent, we need to make sure it does satisfy the root
condition. It turns out the root condition is satisfied only for s ≤ 6. This is not
obvious by first sight, but we can certainly verify this manually.

13.4.3 Runge-Kutta methods


The general (implicit) ν-stage Runge-Kutta (RK) methods have the form
ν ν
!
X X
yn+1 = yn + h b` k` , where k` = f tn + c` h, yn + h a`j kj
`=1 j=1

The Runge-Kutta methods are complicated, and are tedious to analyse. They have
been ignored for a long time, until more powerful computers came along and made
them much more practical to use. Nowadays they are used quite a lot since they have
many nice properties. The Runge-Kutta methods is a one-step method. So once we
get order p ≥ 1, we will have convergence (we don’t need to worry about things like
root condition!).
There are a lot of parameters we have to choose. We need to pick

{b` }ν`=1 , {c` }ν`=1 , {a`j }ν`,j=1 .

In fact the optimal choice of parameters makes the method achieve order 2ν. The
conditions we need for a decent order is in general very complicated. However, we
can quickly obtain some necessary conditions. We can considerPthe case where f is a
constant. Then k` is P always that constant. So we must have ν`=1 b` = 1. It turns
out we also need c` = νj=1 a`j for ` = 1, · · · , ν. While these are necessary conditions,
they are not sufficient.
13.4. ORDINARY DIFFERENTIAL EQUATIONS 593

Note that in general, we have a implicit method since {k` }ν`=1 have to be solved for, as
they are defined in terms of one another. However, for certain choices of parameters, we
can make this an explicit method. This makes it easier to compute, but we would have
lost some accuracy and flexibility. Note also unlike all the other methods we’ve seen
so far, the parameters appear inside f . They appear non-linearly inside the functions.
This makes the method much more complicated and difficult to analyse using Taylor
series, the algebra rapidly become unmanageable.

To describe a Runge-Kutta method, a standard notation is to put the coefficients in


the Butcher table :

c1 a11 ··· a1ν


.. .. .. ..
. . . . c A
or we can write it as
cν aν1 ··· aνν
b1 ··· bν vT

This table allows for a general implicit method. Initially, explicit methods came out
first, since they are much easier to compute. In this case, the matrix A is strictly lower
triangular, ie. a`j = 0 whenever ` ≤ j.

E. 13-48
The most famous explicit Runge-Kutta method is the 4-stage 4th order one, often
called the classical Runge-Kutta method . The formula can be given explicitly by

h
yn+1 = yn + (k1 + 2k2 + 2k3 + k4 ),
6

 
1 1
where k1 = f (xn , yn ) k2 = f xn + h, yn + hk1
2 2
 
1 1
k3 = f xn + h, yn + hk2 k4 = f (xn + h, yn + hk3 ) .
2 2

We see that this is an explicit method. We don’t need to solve any equations.

E. 13-49
The Runge-Kutta methods can be motivated by Gaussian quadrature. Since ẏ = f ,
integrating over a single time step gives
Z tn+1
y(tn+1 ) = y(tn ) + f (t, y(t))dt.
tn

We can use the quadrature formula on the integral (transformed from [0, 1] to
[tn , tn+1 ]) to generate the “one-step method”

ν
X
yn+1 = yn + h b` f (tn + c` h, y(tn + c` h)).
`=1

But we certainly don’t know the exact values of {y(tn + c` h)}ν`=1 , so in some way
we have to approximate these values, which is where the k` comes in.
594 CHAPTER 13. NUMERICAL ANALYSIS

E. 13-50
Choosing the parameters for the Runge-Kutta method to maximize order is hard.
Consider the simplest case, the 2-stage explicit method. The general formula is

yn+1 = yn + h(b1 k1 + b2 k2 ),

where k1 = f (xn , yn ) and k2 = f (xn + c2 h, yn + a2,1 hk1 ). Since c` = νj=1 a`j


P
we need a2,1 = c2 . To analyse this, we insert the true solution into the method.
First, we need to insert the true solution of the ODE into the k’s. We get (below
we write tn instead of xn )

k1 = f (xn , yn ) = y0 (tn )
k2 = f (tn + c2 h, y(tn ) + c2 hy0 (tn ))
 
∂f
= y0 (tn ) + c2 h (tn , y(tn )) + ∇f (tn , y(tn )) · y0 (tn ) +O(h2 )
∂t
| {z }
=y00 (tn )
0 00 2
= y (tn ) + c2 hy (tn ) + O(h ).

Hence, our local truncation error for the Runge-Kutta method is

y(tn+1 ) − y(tn ) − h(b1 k1 + b2 k2 )


 
1
= (1 − b1 − b2 )hy0 (tn ) + − b2 c2 h2 y00 (tn ) + O(h3 ).
2

Now we see why Runge-Kutta methods are hard to analyse. The coefficients
appear non-linearly in this expression. It is still solvable in this case, in the
obvious way, but for higher stage methods, this becomes much more complicated.
In this case, we have a 1-parameter family of order 2 methods, satisfying
1
b1 + b2 = 1, b2 c2 = .
2
It is easy to check using a simple equation y 0 = λh that it is not possible to
get a higher order method (the optimal 2-stage order 4 Runge-Kutta method is
implicit). So as long as our choice of b1 and b2 satisfy this equation, we get a
decent order 2 method. For example for b1 = b2 = 21 and c2 = 1 we have the

h 
<Heun’s method> yn+1 = yn + f (tn , yn ) + f (tn + h, yn + hf (tn , yn ))
2
This formula also make sense that it can be derive from the Taylor expansion of y
1 2 00
y(tn+1 ) = y(tn ) + hy0 (tn ) + h y (tn ) + O(h3 )
2
= y(tn ) + hf (tn , y(tn ))
 
1 f (tn + h, yn + hf (tn , yn )) − f (tn , yn )
+ h2 + O(h) + O(h3 )
2 h
h 
= y(tn ) + f (tn , yn ) + f (tn + h, yn + hf (tn , yn )) + O(h3 ).
2
13.4. ORDINARY DIFFERENTIAL EQUATIONS 595

13.4.4 Stiff equations


Initially, when people were developing numerical methods, people focused mostly on
quantitative properties like order and accuracy like we did previously. More recently,
people started to look at structural properties. Often, equations come with some special
properties. For example, a differential equation describing the motion of a particle
would most probably conserve energy. When we approximate it numerically, we would
like the numerical approximation to satisfy conservation of energy as well. We want
to look at whether numerical methods preserve some of these nice properties. We are
not going to look at conservation of energy, this is too complicated. Instead, we look
at the following problem. Suppose we have a system of ODEs for 0 ≤ t ≤ T :

y0 (t) = f (t, y(t)) with y(0) = y0 .

Suppose T > 0 is arbitrary, and limt→∞ y(t) = 0. What restriction on h is necessary


for a numerical method to satisfy limn→∞ yn = 0? This question is still too compli-
cated for us to tackle. It can only be easily solved for linear problems, namely ODEs
of the form
y0 (t) = Ay(t) where A ∈ RN ×N

Firstly, for what A do we have y(t) → 0 as t → ∞? By some basic linear algebra, we


know this holds only if Re(λ) < 0 for all eigenvalues A. It suffice for us to consider the
one dimensional case

y 0 (t) = λy(t), with Re(λ) < 0.

It should be clear that if A is diagonalizable, then it can be reduced to multiple


instances of this case. Otherwise, we need to do some more work.
D. 13-51
• If we apply a numerical method to y 0 (t) = λy(t) with y(0) = 1, λ ∈ C, then its
linear stability domain is D = {z = hλ : limn→∞ yn = 0} where h the the step
size.
• A numerical method is said to be A-stable if when act on the set of test equation
y 0 (t) = λy(t) we have C− ≡ {z ∈ C : Re(z) < 0} ⊆ D.
E. 13-52
•  Consider the Euler method on y 0 (t) = λy(t). The discrete Im
solution is

yn = yn−1 + hλyn−1 = (1 + hλ)yn−1 = · · · = (1 + hλ)n . Re


−1
Thus we get D = {z ∈ C : |1 + z| < 1}. We can visualize
this on the complex plane as the open unit ball.

 Consider instead the backward Euler method on y 0 (t) = Im


λy(t). We have yn = yn−1 + hλyn , so

yn = (1 − λh)−n . Re
1
Then we get D = {z ∈ C : |1 − z| > 1}. We can visualize
this as the shaded region.
596 CHAPTER 13. NUMERICAL ANALYSIS

 Again consider y 0 (t) = λy, with the trapezoidal rule. Then we can find
 n  
1 + hλ/2 2 + z
yn = =⇒ D= z∈C: < 1 = C− .
1 − hλ/2 2−z
2+z
Note that | 2−z | < 1 says that z has to be closer to −2 than to 2, therefore

D=C .
So we see this the backwards Euler method and trapezoidal rule is A-stable, but
the Euler method is not.
• Suppose a method is A-stable. If y(t) → 0 as t → ∞, then Re(λ) < 0, so
Re(hλ) < 0, so hλ ∈ D, hence yn → 0 as n → ∞. This says that everything that
is meant to converge to 0 will converge to 0 under the numerical method if it’s
A-stable. (Note that A-stability does not mean that any step size will do! We still
need to choose h small enough to ensure the required accuracy.)
Usually, when applying a numerical method to the problem y 0 = λy, we get
yn = (r(hλ))n where r is some rational function. So D = {z ∈ C : |r(z)| < 1}. In
particular whether yn → 0 or not should depend on just hλ, not them separately.
A-stability is a very strong requirement. It is hard to get A-stability. In particular,
Dahlquist proved that no multi-step method of order p ≥ 3 is A-stable (the p = 2
barrier is attained by the trapezoidal rule). Moreover, no explicit Runge-Kutta
method can be A-stable (although implicit ones can).
C. 13-53
<A-stability and the maximum principle> Suppose when applying a nu-
merical method to the problem y 0 = λy, we get yn = (r(hλ))n where r is some
function. We want to know whether C− ⊆ D = {z ∈ C : |r(z)| < 1}, i.e. whether
the method is A-stable. Suppose we know r is holomorphic/analytic in C− . The
Maximum principle from complex analysis states that if a function g is analytic
and non-constant in an open set Ω ⊆ C, then |g| has no maximum in Ω. Since
|r| needs to have a maximum in the closure of C− , the maximum must occur on
the boundary. So to show |r| ≤ 1 on the region C− , we only need to show the
inequality holds on the boundary which is the imaginary axis and “infinity”.

E. 13-54
Consider the following 2-stage 3rd-order implicit Runge-Kutta method applied on
y 0 = λy,
1 1
hk1 = hλ(yn + hk1 − hk2 )
4 4
1 5
hk2 = hλ(yn + hk1 + hk2 ).
4 12
This is a linear system for hk1 and hk2 , whose solution is
−1 
1 − 41 hλ 1
1 − 23 hλ
     
hk1 4
hλ hλyn hλyn
= =
hk2 − 41 hλ 1 − 125
hλ hλyn 1 − 23 hλ + 16 (hλ)2 1

and therefore

1 3 1 + 13 hλ
yn+1 = yn + hk1 + hk2 = yn
4 4 1 − 3 hλ + 16 (hλ)2
2
13.4. ORDINARY DIFFERENTIAL EQUATIONS 597

Thus we have yn+1 = r(hλ)yn where

6 + 2z
r(z) = .
6 − 4z + z 2

We first check if it is analytic. This certainly has some poles, but they are 2 ± 2i,

and are in the right-half plane. So this is analytic in C .
Next, what happens at the boundary of the left-half plane? Firstly, as |z| → ∞,
we find r(z) → 0, since we have a z 2 at the denominator. The next part is checking
when z is on the imaginary axis, say z = it with t ∈ R. Then we can check by some
messy algebra that |r(it)| ≤ 1 for t ∈ R. Therefore, by the maximum principle, we
must have |r(z)| ≤ 1 for all z ∈ C− .

13.4.5 Implementation of ODE methods


Throughout this section we have simplified matters by assuming that the step size h
is constant. In practice, under control of a well-written software package, this will not
be the case: i.e. the step size hn = tn+1 − tn will vary with n. The step size h is
not a pre-ordained quantity chosen by the user, but a parameter of the method. More
precisely, the user do not choose the step size, rather they choose an error tolerance:
this being the accuracy of the numerical approximation that the user requires. The
software package then chooses the step length hn (varying with n in general) to bound
a local estimate of the error by the user-given error tolerance. It is this strategy that
is called a time-stepping algorithm. We shall briefly describe a few of the key ideas
behind such algorithms for error control.

Local error estimation


The first problem we want to tackle is what h we (or in fact the program) should
pick to get the desired accuracy. Milnes device is a technique for estimating the local
error and thus choosing the step size h to achieve a given error tolerance. It uses two
multistep methods of the same order, one explicit and one implicit. It is the implicit
(and hence better) method that produces our approximation for the exact solution
y(tn+1 ); the explicit method is just used to provide a local error estimate for this
approximation. We illustrate the use of Milnes device with the two 2nd order multistep
methods; the explicit AdamsBashforth method (AB) and the implicit trapezoidal rule
(TR).2 Recall the Adams-Bashforth method is given by

h
yn+1 = yn + (3f (tn , yn ) − f (tn−1 , yn−1 )).
2
5 3 000
This is an order 2 error with ηn+1 = 12
h y (tn ) + O(h4 ). The trapezoidal rule is an
order 2 implicit method

h
yn+1 = yn + (f (tn , yn ) + f (tn+1 , yn+1 )).
2
2
TR is a far better method than AB: it is A-stable, hence its global behaviour is superior.
Employing AB to estimate the local error adds very little to the overall cost of TR, since AB is an
explicit method.
598 CHAPTER 13. NUMERICAL ANALYSIS
1 3 000
This has a local truncation error of ηn+1 = − 12 h y (tn ) + O(h4 ). The key to Milne’s
3 000
device is the coefficients of h y (yn ), namely
5 1
cAB = , cTR = − .
12 12
These are called the error constant of the methods (if exist). Since these are two
different methods, we get different yn+1 ’s. We distinguish these by superscripts, and
have
AB
y(tn+1 ) − yn+1 ' cAB h3 y000 (tn )
TR
y(tn+1 ) − yn+1 ' cTR h3 y000 (tn )
We can now eliminate y000 (tn ) to obtain
TR −cTR AB TR
y(tn+1 ) − yn+1 ' (yn+1 − yn+1 ).
cAB − cTR
In this case, the constant we have is 61 . So we can estimate the local truncation error
for the trapezoidal rule, without knowing the value of y000 . We can then use this to
adjust h accordingly.
In general, for error control for multistep methods we employ a predictor-corrector pair :
that is, we use two multistep methods of the same order, one explicit (the predictor)
and the other implicit (the corrector). The predictor is employed not just to estimate
the local error for the corrector using Milnes device, but also to provide a good initial
guess for the solution of the implicit corrector equations. Depending on whether an
error tolerance has been achieved, we (or the program) amend the step size h.
Since we are using a multi-step method one might wonder, if we change the step size
half way through a simulation, how how are the unknown previous approximations
(e.g. yn−1 , yn−2 etc.) obtained? We cannot use the originals since they correspond to
a different step size. The answer is that they are obtained by suitable polynomial inter-
polation from the other approximations calculated with different step lengths.
The strategy used for error control of multistep methods (i.e. Milnes device with
predictor-corrector pairs) cannot be applied to RK methods. This is because the
nonlinear nature of RK methods means that the leading term in the local truncation
error is no longer simply an error constant multiplying a derivative of the exact solution.
We replace the idea of predictor-corrector pairs by embedded RungeKutta methods ,
where a lower-order RK method is hidden inside a higher-order RK method. We will
not go into too much detail but in short, the embedded RK approach requires two
(typically explicit) RK methods: one having ν stages and order p, while the other
has ν + ` stages (with ` > 1) and order p + 1. The key restriction is that the first
ν stages of both methods must be identical. This restriction ensures that the cost
of implementing the higher-order method is marginal, once we have computed the
lower-order approximation.
The Zadunaisky device is a general technique for obtaining error estimates for nu-
merical approximations of initial-value ODEs. Suppose we have used an arbitrary
numerical method of order p and that we have stored the previously computed values
yn , yn−1 , ..., yn−p (the time steps between them need not be equal). We construct the p
degree interpolating polynomial (with vector coefficients) d, such that d(tn−i ) = yn−i
for i = 0, 1, · · · , p, and consider the initial-value ODE
z(t) = f (t, z(t)) + (d(t) − f (t, d)) for t ∈ [tn , tn+1 ] with z(tn ) = yn .
13.4. ORDINARY DIFFERENTIAL EQUATIONS 599

Note that
1. Since d(t) − y(t) = O(hp+1 ) and y(t) = f (t, y(t)), the term d(t) − f (t, d) is usually
small. Therefore, the new ODE for z is a small perturbation of the original ODE.
2. The exact solution of this new ODE is z(t) = d(t).
So, having applied our numerical method to the original ODE to produce yn+1 , we
apply exactly the same numerical method and implementation details to the new ODE
to produce zn+1 . We then evaluate the error in zn+1 , namely zn+1 − d(tn+1 ), and use
it as an estimate of the error in yn+1 .

Solving for implicit methods


Implicit methods are often more likely to preserve nice properties like conservation of
energy. Since we have more computational power nowadays, it is often preferable to
use these more complicated methods. When using these implicit methods, we have to
come up with some way to solve the equations involved. As an example, we consider
the backward Euler method

yn+1 = yn + hf (tn+1 , yn+1 ).

There are various ways to solve this equation for yn+1 . We will give two commonest
types of iterative nonlinear solver:
1. Functional iteration : As the name suggests, this method is iterative. So we use
superscripts to denote the iterates. In this case, we use the formula
(k+1) (k)
yn+1 = yn + hf (tn+1 , yn+1 ).
(0)
Usually, we start with yn+1 = yn or we use some simpler explicit method to obtain
(0) (k)
our first guess of yn+1 . The question, of course, is whether yn+1 converges to
yn+1 as k → ∞. Fortunately, this converges to a locally unique solution if λh is
sufficiently small, where λ is the Lipschitz constant of f . For the backward Euler,
we will require λh < 1. This relies on the contraction mapping theorem.
Unlike Newtons method, functional iteration requires neither the solution of N ×N
linear systems nor the computation of Jacobian matrices. Hence it has a much
lower computational cost than Newtons method (which we will see next). However
for functional iteration wee need λh < 1, this restriction can (especially for stiff
equations) lead to very small time-steps and a large amount of computation time.
2. Newton’s method (Newton-Raphson method): yn+1 is the root of F(x) ≡ x −
(k+1) (k) (k) (k)
(yn + hf (tn+1 , x)), so we use the scheme yn+1 = yn+1 − (∇F(yn+1 ))−1 F(yn+1 ).
That is
(k+1) (k) (k) (k)
yn+1 = yn+1 − z(k) with (I − hJ (k) )z(k) = yn+1 − (yn + hf (tn+1 , yn+1 ))
(k)
where J (k) is the Jacobian matrix J (k) = ∇f (tn+1 , yn+1 ) ∈ RN ×N . This requires
us to first solve for z in the second equation, but this is a linear system, which we
have some efficient methods for solving.
There are several variants to Newton’s method. This is the full Newton’s method,
where we re-compute the Jacobian in every iteration. It is also possible to just
use the same Jacobian J (0) over and over again. There are some speed gains in
600 CHAPTER 13. NUMERICAL ANALYSIS

solving the equation, but then we will need more iterations before we can get our
yn+1 . The only role the Jacobian matrix plays is to ensure convergence: its precise
[k]
value makes no difference to limk→∞ yn . Therefore we might replace it with a
finite-difference approximation and/or evaluate it once every several steps.

13.5 Numerical linear algebra


We start off with the simplest problem in linear algebra: given A ∈ Rn×n , b ∈ Rn ,
we want to find an x such that Ax = b. We all know about the theory — if A is
non-singular, then there is a unique solution for every possible b. Otherwise, there are
no solutions for some b and infinitely many solutions for other b’s.
D. 13-55
• A (square) matrix A is said to be upper triangular if Aij = 0 whenever i > j, and
lower triangular if Aij = 0 whenever i < j. We usually denote upper triangular
matrices as U and lower triangular matrices as L.
• A unit triangular matrix is a triangular matrix with 1 in all diagonal entries.
• A = LU is an LU factorization of A if U is an upper triangular matrix and L an
lower triangular matrix.3
E. 13-56
<Properties of triangular matrices> Triangular matrices are nice:
• (Determinants) It is easy to find the determinants of triangular matrices:
n
Y n
Y
det(L) = Lii , det(U ) = Uii .
i=1 i=1

In particular, a triangular matrix is non-singular if and only if it has no zero


entries in its diagonal.
• (Matrix equation) Suppose we wish to solve a lower triangular matrix equation
    
L11 0 ··· 0 x1 b1
 L12 L22 ··· 0   x2   b2 
 .. .. .. ..   ..  =  ..  ,
    
 . . . .  .   . 
L1n L2n ··· Lnn xn bn
There is nothing to solve in the first line. We can immediately write down the
value of x1 . Substituting this into the second line, we can then solve for x2 . In
general, we have !
i−1
1 X
xi = bi − Lij xj .
Lii j=1

This is known as forward substitution . For upper triangular matrices, we can


do a similar thing, but we have to solve from the bottom instead, known as
backwards substitution , we have
n
!
1 X
xi = bi − Uij xj .
Uii j=1+i

3
The lecturer of this course requires L to be an unit lower triangular.
13.5. NUMERICAL LINEAR ALGEBRA 601

It is helpful to analyse how many operations it takes to compute this. If we


look carefully, solving Lx = b or U x = b this way requires O(n2 ) operations.
• (Inverse) The knowledge of solving triangular matrices equation can be used to
find the inverse of matrices as well. The solution to Lxj = ej is the jth column
of L−1 . It is then easy to see from this that the inverse of a lower triangular
matrix is also lower triangular. Similarly, the columns of U −1 are given by
solving U xj = ei , and we see that U −1 is also upper triangular.

13.5.1 LU factorization
Since triangular matrices are easy to work with, it will be useful if we can factorise an
arbitrary matrix into products of triangular matrices. If we can find a LU factorization
A = LU of A where L and U are lower and upper triangular, then we can solve Ax = b
in two steps — we first find a y such that Ly = b. Then we find an x such that U x = y.
Then Ax = LU x = Ly = b.
Unfortunately, even if A is non-singular, it may not have an LU factorization. For
example ( 01 11 ) is non-singular with determinant −1, but we can manually check that
there is no LU factorization of this. On the other hand, while we don’t really like
singular matrices, singular matrices may still have LU factorizations. For example,
( 00 01 ) = ( 10 01 )( 00 01 ) is trivially an LU factorization of a singular matrix.

Construction and algorithm


To understand when LU factorizations exist, we try to construct one, and see when it
could fail. Let l1 , l2 , · · · , ln to be the columns of L, and uT1 , uT2 , · · · , uTn be the rows
of U . Since L and U are triangular, li and ui must be zero in the first i − 1 entries.
Suppose this is an LU factorization of A. Then we can write
 T
u1
uT2 
?
A = LU = l1 l2 · · · ln  ..  = l1 uT1 + l2 uT2 + · · · + ln uTn .
  
 . 
uTn

If our A is invertible, then both L and U must have non-zero determinant, hence non
of their diagonal entries are 0. So replacing li uTi with (li /Lii )(uTi Lii ) we can assume
L is an unit lower triangular matrix. This extra condition will ensure that (as we will
see) our LU factorisation of A is unique. In the case that A is not invertible, there
exist matrix (e.g. ( 01 01 )) such that LU factorisation exist, yet no LU factorisation exist
with L being unit. But we will not care too much about non-invertible matrix, so here
we will impose the extra condition of L being unit.
For each i, we know li and ui have the first i − 1 entries zero. So the first i − 1
rows and columns of li uTi are zero. In particular, the first row and columns only have
contributions from l1 uT1 , the second row/column only has contributions from l2 uT2 and
l1 uT1 etc. We can find the LU factorisation as follows:
1. Obtain l1 and u1 from the first row and column of A. Since the first entry of l1
is 1, uT1 is exactly the first row of A. We can then obtain l1 by taking the first
column of A and dividing by U11 = A11 .
2. Obtain l2 and u2 form the second row and column of A − l1 uT1 similarly.
602 CHAPTER 13. NUMERICAL ANALYSIS

3. · · ·
Pn−1
4. Obtain ln and un from the nth row and column of A − i=1 li uTi .
We can turn this into an algorithm. We define the intermediate matrices, starting with
A(0) = A. For k = 1, · · · , n, we let
(k−1)
Ukj = Akj j = k, · · · , n
(k−1)
Aik
Lik = (k−1)
i = k, · · · , n
Akk
(k) (k−1)
Aij = Aij − Lik Ukj i, j > k
(k)
we only sum over i, j > k in the last part because we know Aij would be 0 for any
other i, j. At the end of the algorithm when k = n, we end up with a zero matrix, and
then U and L are completely filled.
We can now see when this will break down. A sufficient condition for A = LU to exist
(k−1) (k−1)
is that Akk 6= 0 for all k. Since Akk = Ukk , this sufficient condition ensures U ,
and hence A is non-singular. Conversely, if A is non-singular and an LU factorization
(k−1)
exists, then this would always work, since we must have Akk = Ukk 6= 0. Moreover,
the LU factorization must be given by this algorithm. So the LU factorization of a
invertible matrix into LU with L being unit is unique. This is not necessarily true if A
is singular. For example, ( 00 01 ) = ( a1 10 )( 00 01 ) for any real number a. The problem with
this sufficient condition is that most of these coefficients do not appear in the matrix
A. They are constructed during the algorithm. We don’t know easily what they are
in terms of the coefficients of A. We will later come up with an equivalent condition
on our original A that is easier to check.
Note that as long as this method does not break down, we need O(n3 ) operations
to perform this factorization. Recall we only needed O(n2 ) operations to solve the
equation after factorization. So the bulk of the work in solving Ax = b is in doing the
LU factorization. In fact, LU factorisation is a good way to
1. find det A. We use the formula det A = det L det U = det U = n
Q
k=1 Ukk since L is
unit triangular.
2. find the inverse of A if it is non-singular. In particular, solving Axj = ej gives the
jth column of A−1 . Note that we are solving the system for the same A for each j.
So we only have to perform the LU factorization once, and then solve n different
equations. So in total we need O(n3 ) operations.

Relation to Gaussian elimination

At the kth step of the LU-algorithm, the operation A(k) = A(k−1) − lk uTk (done for
entries i, j > k) has the property that the ith row of A(k) is the ith row of A(k−1)
minus Lik times uTk (the kth row of A(k−1) ), i.e.
[the ith row of A(k) ] = [the ith row of A(k−1) ] − Lik × [the kth row of A(k−1) ],
(k−1) (k−1)
where the multipliers Lik = Aik /Akk are chosen so that, at the outcome, the i, k
entry of A(k) is zero. This action is perform over i = k + 1, · · · , n, so that the last n − k
entries of the kth column is 0. This construction is analogous to Gaussian elimination
for solving Ax = b. Indeed, in the algorithm, we don’t need to define a new A(k) for
each k at every step, we can just overwrite the existing matrix A. In this case we also
13.5. NUMERICAL LINEAR ALGEBRA 603

managed to conveniently store U gradually in A (we stored them in the rows that are
meant to be becoming 0). The algorithm becomes: for k = 1, · · · , n,

Aik
Lik = i = k, · · · , n
Akk
Aij = Aij − Lik Akj i, j > k.

The resulting A is our U . We see that in this process, A gradually becomes U through
row operations just as in Gaussian elimination. We note in passing that if one wish
we can also make the program so that it also store L gradually in A by making use
of the columns that are becoming zero (and not storing the unit diagonal entries of L
since we know that are 1).
This different between the LU and the Gaussian approach is that in the LU approach
we also store L, this allow us to not care about b (for solving Ax = b) until the
factorization is complete. To solve Ax = b for many different b using the LU algorithm,
O(n3 ) operations are only required for the single initial factorisation (which we only
need to do once); but then the solution for each new b only requires O(n2 ) operations
(for the back- and forward substitutions). Whereas to solve Ax = b using Gaussian
elimination we would need to perform the row operations not just on the matrix
A but also at the same time on the particular b, so for each new b we need to
perform the whole elimination again which requires O(n3 ) computational operations
each time.

LU factorization with (partial) pivoting


However, we still have the problem that factorization is not always possible. Requiring
that we must factorize A as LU is too restrictive. The idea is to factor something closely
related, but not exactly A. Instead, we want a factorization

P A = LU,

where P is a permutation matrix such that when act on A would just permute the rows
of A. So we want to factor A up to a permutation of rows. Note that P is invertible,
so A = P −1 LU . If we managed to do this, we can like before easily solve Ax = b
since we know how to solve P Ax = P b. Now we will extend the previous algorithm
to allow permutations of rows, and we shall show that this factorization is possible for
all matrix.
(0)
Suppose our breakdown occurs at k = 1, ie. A11 = A11 = 0. We find a permutation
matrix P1 and let it act via P1 A(0) . The idea is to look down the first column of A, and
find a row starting with a non-zero element, say p. Then we use P1 to interchange rows
1 and p such that P1 A(0) has a non-zero top-most entry. For simplicity, we assume we
(0)
always need a P1 , and if A11 is non-zero in the first place, we just take P1 to be the
identity. After that, we can carry on. We construct l1 and u1 from P1 A(0) as before,
and set A(1) = P1 A(0) − l1 uT1 .
But what happens if the first column of A is completely zero? Then no interchange
will make the (1, 1) entry non-zero. However, in this case, we don’t actually have to
do anything. We can immediately find our l1 and u1 , namely set l1 = e1 (or anything)
and let uT1 be the first row of A(0) . Then this already works. Note however that this
corresponds to A (and hence U ) being singular, and we are not too interested with
these.
604 CHAPTER 13. NUMERICAL ANALYSIS
(k−1)
The later steps are exactly analogous. Suppose we have Akk = 0. Again we find
a Pk such that Pk A(k−1) has a non-zero (k, k) entry. We then construct lk and uk
from Pk A(k−1) and set A(k) = Pk A(k−1) − lk uTk . Again, if the kth column of A(k−1) is
completely zero, we set lk = ek and uTk to be the kth row of A(k−1) .
However, as we do this, the permutation matrices appear all over the place inside
the algorithm. It is not immediately clear that we do get a factorization of the form
P A = LU . Fortunately, keeping track of the interchanges, we do have an LU factor-
ization
P A = L̃U,
where U is what we got from the algorithm, P = Pn−1 · · · P2 P1 while L̃ is given
by L̃ = l̃1 · · · l̃n where l̃k = Pn−1 · · · Pk−1 lk . Note that in particular, we have
l̃n−1 = ln−1 and l̃n = ln .
One problem we have not considered is the problem of inexact arithmetic. While
these formula are correct mathematically, when we actually implement things, we do
them on computers with finite precision. As we go through the algorithm, errors will
accumulate, and the error might be amplified to a significant amount when we get
the reach the end. We want an algorithm that is insensitive to errors. In order to
work safely in inexact arithmetic, every time we permute the rows we will choose the
element of largest modulus in the kth column and put it in the (k, k)th position, not
just an arbitrary non-zero one, as this minimizes the error when dividing. We perform
this permuting every step even if the original element in the (k, k)th position is already
non-zero, i.e our aim is now to permute the largest element to the (k, k)th position
rather than getting rid of a zero element.
So far we allow the permutation of rows. We can in fact allow the permutation of
columns as well, so that we have the factorisation P AQ = LU where P and Q is
permutation matrices that reorders the rows and columns of A respectively. This
is call full pivoting. Now at every step, we don’t just move the largest modulus
element of the kth column to the (k, k)th position, we can move the largest modulus
element of the whole remaining matrix to the (k, k)th position. So this can minimise
inexact arithmetic. In practice however, the extra computational effort required for
total pivoting is not regarded as worthwhile and partial pivoting remains the standard
choice.

13.5.2 Factorization theory


D. 13-57
• The leading principal submatrices Ak ∈ Rk×k for k = 1, · · · , n of A ∈ Rn×n are

(Ak )ij = Aij , i, j = 1, · · · , k.

In other words, we take the first k rows and columns of A.


• A matrix A ∈ Rn×n is positive-definite if xT Ax > 0 for all non-zero x ∈ Rn .
• A band matrix of band width r is a matrix A such that Aij 6= 0 implies |i−j| ≤ r.
In other words, all the non-zero elements of A reside in a narrow band of width
2r + 1 about the principal diagonal.
• A (square) matrix is called tridiagonal if it has band width 1.
13.5. NUMERICAL LINEAR ALGEBRA 605

• A leading zero of a string of numbers or a vector is any 0 digit that comes before
the first non-zero digit.
E. 13-58
A band matrix of band width 0 is a diagonal matrix; a band matrix
of band width 1 is a tridiagonal matrix.

T. 13-59
A sufficient condition for both the existence and uniqueness of LU factorization
A = LU of a n × n matrix A with L unit is that det(Ak ) 6= 0 for k = 1, · · · , n − 1.

(Proof 1) By induction. This is trivially true for 1 × 1 matrix. Suppose this is


true for m × m matrix. Let A be an m + 1 × m + 1 matrix such that det(Ak ) 6= 0
for k = 1, · · · , m. Then A11 = det(A1 ) 6= 0, so we can perform the first step of
our LU factorization algorithm to obtain A − l1 uT1 = A(1) where (u1 )i = A1i ,
(l1 )i = Ai1 /A11 and A(1) has its first row and column equals 0, so we can write
A(1) = ( 00 C0 ) where C is a m × m matrix.
Form the matrix Bk by adding the first column of Ak times −A1j /A11 to the jth
column for each j 6= 1, so that (Bk )ij = (Ak )ij − Ai1 A1j /A11 for j 6= 1. This
operation does not change the determinant, furthermore, the first row of Bk is
(A11 , 0, · · · , 0), in fact we see that
 
A11 0
Bk = hence det Ak = det Bk = A11 det Ck−1 .
∗ Ck−1
So C is a matrix such that det(Ck ) 6= 0 for k = 1, · · · , m − 1. So C has unique
LU factorization C = l2 uT2 + · · · + lm uTm , now add a zero entry before the first
entry in all li and ui for all i = 2, · · · , m. Then A(1) = l2 uT2 + · · · + lm uTm , and so
A = l1 uT1 + · · · + lm uTm and we obtain a unique LU factorization for A.
(Proof 2) We use induction on n, the size of the matrix. For n = 1, the result is
clear. Assume the result is true for (n − 1) × (n − 1) matrices and write A ∈ Rn×n
as  
An−1 b
A= T .
c Ann
We want
   
Ln−1 0 Un−1 y
A = LU with L= and U =
xT 1 0T Unn
where Ln−1 , Un−1 ∈ R(n−1)×(n−1) and x, y ∈ Rn−1 . Multiplying out these block
matrices, we see that we want to have
   
An−1 b Ln−1 Un−1 Ln−1 y
T = T T
c Ann x Un−1 x y + Unn
By the induction assumption Ln−1 and Un−1 exist and are unique; furthermore
Un−1 is non-singular since it is assumed that An−1 is non-singular. Hence Ln−1 y =
b and xT Un−1 = cT can be uniquely solved to obtain the required x, y and Unn .
Note that we don’t need A to be non-singular. Our condition is equivalent to the
(k−1)
restriction Akk 6= 0 for k = 1, · · · , n − 1. Note also that
det(An ) = det(A) = det(L) det(U ) = det(U ) = Unn det(Un−1 ),
det(An−1 ) = det(Ln−1 ) det(Un−1 ) = det(Un−1 ).
606 CHAPTER 13. NUMERICAL ANALYSIS

So det(An ) = Unn det(An−1 ) and hence by induction Ukk = det(Ak )/ det(Ak−1 )


for k = 2, · · · , n. Furthermore U11 = A11 = det(A1 ) and by definition of the
(k−1)
algorithm Ukk = Akk for k = 1, · · · , n. So we have

(0) (k−1) det(Ak )


A11 = det(A1 ) and Akk = for k = 2, · · · , n.
det(Ak−1 )
T. 13-60
<LDU decomposition> If det(Ak ) 6= 0 for all k = 1, · · · , n, then A ∈ Rn×n
has a unique factorization of the form A = LDÛ where D is non-singular diagonal
matrix, and both L and Û are unit triangular (lower and upper respectively).

From the previous theorem, A = LU exists. Since A is non-singular, U is non-


singular. So we can write this as U = DÛ where D consists of the diagonals of U
and Û = D−1 U is unit.
This is not hard, but is rather convenient. In particular, it allows us to factorize
symmetric matrices in a nice way.
T. 13-61
If A is non-singular and an LU factorization of A with L being unit exists, then
1. Ak is non-singular for k = 1 · · · , n − 1,
2. this LU factorization with L unit is unique.
Qn Qk
1. We know det(A) = i=1 Uii 6= 0, so det(Ak ) = i=1 Uii =6= 0 for k =
1 · · · , n − 1.
2. By part 1 and [T.13-59].
P. 13-62
If A is non-singular, it is impossible for more than one LU factorization with L
unit to exist.
If A is non-singular, then both U1 and U2 are non-singular, and hence the equality
L1 U1 = L2 U2 implies L−12 L1 = U2 U1
−1
= V . The product of lower/upper triangu-
lar matrices is lower/upper triangular and the inverse of a lower/upper triangular
matrix is lower/upper triangular. Consequently, V is simultaneously lower and
upper triangular, hence it is diagonal. Since L−1
2 L1 has unit diagonal, we obtain
V = I.
T. 13-63
Let A ∈ Rn×n be symmetric and non-singular and det(Ak ) 6= 0 for all k = 1, · · · , n.
Then there is a unique “symmetric” factorization A = LDLT with L unit lower
triangular and D diagonal and non-singular.

From the previous theorem, we can factorize A uniquely as A = LDÛ . We take


the transpose to obtain A = AT = Û T DLT . This is a factorization of the form
“unit lower”-“diagonal”-“unit upper”. By uniqueness, we must have Û = LT .
Clearly this form of LU factorisation is a suitable exploitation of symmetry and
requires roughly half the storage of conventional LU. Specifically, to compute this
factorization, we let A(0) = A and for k = 1, 2, · · · , n let lk be the multiple of the
(k−1)
kth column of A(k−1) such that Lkk = 1. We then set Dkk = Akk and form
(k) (k−1) T
A =A − Dkk lk lk .
13.5. NUMERICAL LINEAR ALGEBRA 607

Even with symmetric matrices, some form of pivoting is generally necessary, both
to avoid breakdown and to maintain accuracy when using inexact arithmetic.
Clearly, permuting the rows of A will destroy symmetry unless we simultaneously
permute the corresponding columns: i.e. A → P AP T , where P is a permutation
matrix. One would like to prove that, for any symmetric A a symmetric factori-
sation of the form P AP T = LDLT where L is unit lower triangular and D is
diagonal, exists. This however is not true, even if A is restricted to be nonsingu-
lar. Fortunately, the next best result is true: for any symmetric A, a symmetric
factorisation of the form P AP T = LT LT where L is unit lower triangular and T
is both symmetric and tridiagonal, exists.
T. 13-64
Let A ∈ Rn×n be a positive-definite matrix. Then det(Ak ) 6= 0 for all k = 1, · · · , n.

First consider k = n. To show A is non-singular, it suffices to show that Ax = 0


implies x = 0. But Ax = 0 implies xT Ax = 0, so by positive-definiteness x = 0.
Now suppose Ak y = 0 for k < n and y = Rk . Then yT Ak y = 0. We invent a
new x ∈ Rn by taking y and pad it with zeros. Then xT Ax = 0. By positive-
definiteness x = 0, so y = 0.
It follows from this result that a positive definite matrix always have a unique LU
factorisation with L unit. However in practice, we still do pivoting due to concerns
of inexact arithmetic.
T. 13-65
A symmetric matrix A ∈ Rn×n is positive-definite iff we can factor it as A = LDLT
where L is unit lower triangular, D is diagonal and Dkk > 0 for all k.

(Backward) We have xT Ax = xT LDLT x = (LT x)T D(LT x). We let y = LT x.


Note that y = 0 if and only if x = 0, since L is invertible. So
n
X
xT Ax = yT Dy = yk2 Dkk > 0 for all y 6= 0
k=1

(Forward) Since A is positive definite and symmetric, by the previous two theorem
A = LDLT where L is unit lower triangular and D is diagonal. Now we have to
show Dkk > 0. We define yk such that LT yk = ek , which exist, since L is
invertible. Then clearly yk 6= 0. Then we have
Dkk = eTk Dek = ykT LDLT yk = ykT Ayk > 0.
In fact if A is both symmetric and positive definite, pivoting is no longer required
either theoretically or practically.
This results gives a practical check for whether a symmetric A is positive definite.
We can perform this LU factorization, and then check whether the diagonal has
positive entries. This decomposition of a symmetric positive-definite matrix A
into A = LDLT with L unit lower triangular and D a positive-definite diagonal
matrix is called Cholesky factorization . There is another way of stating this
factorization. We let D1/2 be the “square root” of D, by taking the positive
square root of the diagonal entries of D. Then we have
A = LDLT = LD1/2 D1/2 LT = (LD1/2 )(LD1/2 )T = GGT ,
where G is lower triangular with Gkk > 0.
608 CHAPTER 13. NUMERICAL ANALYSIS

T. 13-66
Let A = LU be the LU factorization of the non-singular matrix A with L unit.
Then
1. all leading zeros in the rows of A to the left of the principal diagonal are
inherited by L,
2. all leading zeros in the columns of A above the principal diagonal are inherited
by U .

We know that Ukk 6= 0 for k = 1, · · · , n since A is non-singular. If Ai,1 = Ai,2 =


= 0 are the leading zeros in the ith row, then we obtain

0 = Ai,1 = Li,1 U1,1 =⇒ Li,1 = 0,


0 = Ai,2 = Li,1 U1,2 + Li,2 U2,2 =⇒ Li,2 = 0,
0 = Ai,3 = Li,1 U1,3 + Li,2 U2,3 + Li,3 U3,3 =⇒ Li,3 = 0, and so on.

Similarly for the leading zeros in the jth column.

A matrix A is called a sparse matrix if nearly all of its elements are zero. It is
often required to solve very large systems Ax = b where A is sparse. The efficient
solution of such systems should exploit the sparsity. In particular, we wish the
matrices L and U to inherit as much as possible of the sparsity of A, so that
the cost of performing the forward and backward substitutions with L and U is
comparable with the cost of forming the product Ax: i.e. the cost of computation
should be determined by the number of nonzero entries, rather than by n. Hence
this and the next result are useful.

Also this result suggest that for the factorization of a sparse matrix A, one might
try to reorder its rows and columns beforehand so that many of the zero elements
become leading zeros in rows and columns. Thus we are using interchanges to re-
duce the fill-in for L and U , rather than to prevent breakdown of the factorisation.

P. 13-67
If a band matrix A has band width r and an LU factorization A = LU , then L
and U are both band matrices of width r.

Follows from the above result.

13.6 Linear least squares


We will consider the problem Ax = b where A ∈ Rm×n with m > n, b ∈ Rm and
x ∈ Rn . This is an over-determined system. In general, this has no solution, so
we aim to find the “best” solution to this system of equations. More precisely, we
want to find an x∗ ∈ Rn that minimizes kAx − bk2 (Euclidean norm). This is the
least squares problem . Problems of this form occur frequently when we collect m
observations (xi , yi ), which are typically prone to measurement error, and wish to fit
them to an n-variable linear model, typically with m  n. In statistics, this is called
linear regression.
13.6. LINEAR LEAST SQUARES 609

T. 13-68
A vector x∗ ∈ Rn minimizes kAx − bk2 if and only if AT (Ax∗ − b) = 0.

(Forward) x∗ minimizes
f (x) = hAx − b, Ax − bi = xT AAx − 2xT AT b + bT b.
Then as a function of x, the partial derivatives evaluated at x∗ must vanish. We
have ∇f (x) = 2AT (Ax − b). So a necessary condition is AT (Ax∗ − b) = 0.
(Backward) We have AT (Ax∗ − b) = 0. For any x ∈ Rn write x = x∗ + y, then
kAx − bk2 = kA(x∗ + y) − bk = kAx∗ − bk2 + 2yT AT (Ax − b) + kAyk2
= kAx∗ − bk2 + kAyk2 ≥ kAx∗ − bk2 .
So x∗ must minimize the Euclidean norm.
This result makes sense geometrically: Let U be the space span by the columns of
A. Then Ax is a point in U , so kAx − bk is the distance between b and a point
in U . So if x∗ minimise kAx − bk, then Ax∗ − b must be orthogonal to the space
U , hence AT (Ax∗ − b) = 0.
P. 13-69
If A ∈ Rm×n is a full-rank matrix, then there is a unique solution to the least
squares problem.

We know all minimizers are solutions to (AT A)x = AT b. The matrix A being full
rank means y 6= 0 ∈ Rn implies Ay 6= 0 ∈ Rm . Now
xT AT Ax = (Ax)T (Ax) = kAxk2 > 0 for all x 6= 0.
T n×n
Hence A A ∈ R is positive definite (and in particular non-singular). So we
can invert AT A and find a unique solution x.
Now to find the x∗ minimizing kAx−bk2 , we just need to solve the normal equations
AT Ax = AT b. If A has full rank, then the Gram matrix AT A is non-singular,
and there is a unique solution. If not, then the general theory of linear equa-
tions tells us there are either infinitely many solution or no solutions. But for this
particular form of equations, it turns out there will always be a solution.
However, solving these equations is sometimes not the best way for finding x∗ for
accuracy or practical reasons. For example, A may have useful sparsity properties
which are lost when forming AT A. The “squaring” process is inherently danger-
ous and AT A can be a much more ill conditioned matrix than A. With inexact
arithmetic, A can have full rank but AT A may be singular to the computer (eg.
due to rounding errors when numbers get large (108 )2 = 1016 ). Instead, a better
approach to solve the problem make use of QR factorization.
D. 13-70
• A QR factorization of an m × n matrix A is a factorization of the form A = QR
where Q ∈ Rm×m is an orthogonal matrix, and R ∈ Rm×n is an upper triangular
matrix.4 Since the last m − n rows of R is 0, we can remove these useless rows in
R and corresponding useless column in Q, so that we obtain the “skinny”5 version
of QR factorization A = Q̃R̃ where Q̃ ∈ Rm×n and R̃ ∈ Rn×n .
4
Note that here R is not a square matrix, but we have the same definition of upper triangular,
that is Rij = 0 whenever i > j.
5
Sometimes called “thin” or “reduced” instead.
610 CHAPTER 13. NUMERICAL ANALYSIS

• We say the matrix R is in a standard form if it has the property that the number
of leading zeros in each row increases strictly monotonically: i.e. if Ri,ji is the first
non-zero entry in the ith row, then j1 , · · · , jp is a strictly monotonically increasing
sequence, where p is the last row with an non-zero element. Completely zero rows
of R are allowed, but they must all be at the bottom of R.
• In Rm , where m > 2, we define the Givens rotation on 3 parameters 1 ≤ p < q ≤
m and θ ∈ [−π, π] by
1 
..
.
 1 
 cos θ sin θ 
 1 
[p,q] .. m×m
Ωθ = ∈R ,
 
.
 1 
 − sin θ cos θ 
 1 
..
.
1

where the sin and cos appear at the p, qth rows and columns.
• For u 6= 0 ∈ Rm , we define the Householder reflection by

uuT
Hu = I − 2 ∈ Rm×m .
uT u
E. 13-71
• Every A ∈ Rm×n has a QR factorization, as we will soon show, but this is not
unique (eg. we can multiply both Q and R by −1, or in fact just changing
sign in a column of Q and a corresponding row of R). In the “skinny” QR
factorization, with A = Q̃R̃, we will see that if A has full rank, then the skinny
QR is unique up to a sign, ie. unique if we require R̃kk > 0 for k = 1, · · · , n.
• Let A = QR be a QR factorisation. If we denote the columns Pjof A and Q
by {aj }n m
j=1 and {qj }j=1 respectively, then we see that aj = i=1 qi Rij for
j = 1, 2, · · · , n. In other words, the jth column of A is expressed as a linear
combination of the first j columns of Q (remember that the columns of Q form
an orthonormal set in Rm .)
• A key property of the orthogonal matrices is that we have kQxk = kxk for all
x ∈ Rn . Once we have the QR factorization of A, we can multiply kAx − bk by
QT = Q−1 , and get an equivalent problem of minimizing kRx − QT bk. We will
not go into details, but it should be clear that this is not too hard to solve. The
key of why QR factorization is useful here is that Q preserves distance while R
is easy to deal with.
• If R is in stand form, then the matrix equation Rx = b is easy to solve. For
example consider
 
 x
R11 R12 R13 R14  1 
  
b1
 0 x 2
0 R23 R24   x3  = b2
  
0 0 0 0 0
x4

where R11 , R23 6= 0. This has infinitely many solutions. In particular we see
that x4 and x2 can be freely chosen and then x3 is determined by R23 x3 +
R24 x4 = b2 , and after that x1 is determined by R11 x1 + · · · + R14x4 = b1 . In
13.6. LINEAR LEAST SQUARES 611

general for R in standard form xji for i = 1, · · · , p can be freely chosen and the
others are then determined.
We shall demonstrate three standard algorithms for QR factorization:
1. Gram-Schmidt factorization: The Gram-Schmidt process is used to orthogo-
nalise the columns of A.
2. Givens rotations: Simple rotation (hence orthogonal) matrices are used to
gradually transform A element-by-element into “upper triangular” form.
3. Householder reflections: Simple reflection (hence orthogonal) matrices are
used to gradually transform A column-by-column into “upper triangular”
form.

13.6.1 Gram-Schmidt factorization


First algorithm
This targets the skinny version, for convenience we’ll not write the tildes above, and
just write Q ∈ Rm×n and R ∈ Rn×n . We will use Gram-Schmidt process to orthogo-

nalize are the columns of A. We write A = a1 · · · an and Q = q1 · · · qn .
Pj
By definition of the QR factorization, we need aj = i=1 Rij qj . This is just done in
the usual Gram-Schmidt way.
1. To construct column 1, if a1 6= 0, then we set
a1
q1 = and R11 = ka1 k.
ka1 k

Note that the only non-unique possibility here is the sign — we can let R11 = −ka1 k
and q1 = −a1 /ka1 k instead. But if we require R11 > 0, then this is fixed. In the
degenerate case a1 = 0 we can just set R11 = 0, and the pick any q1 ∈ Rn with
kq1 k = 1.
2. For columns 1 < k ≤ n, for i = 1, · · · , k − 1, we set Rik = hqi , ak i and set
k−1
X
dk = ak − qi hqi , ak i.
i=1

If dk 6= 0, then we set
dk
qk = and Rkk = kdk k.
kdk k
In the case where dk = 0, we again set Rkk = 0, and pick qk to be anything
orthonormal to q1 , · · · , qk−1 .
When using this algorithm we see that if A ∈ Rm×n (m ≥ n) has full rank (so no
degenerate cases), the only lack of uniqueness in the solution is the choice of sign for
each column of Q and corresponding row of R. Thus its skinny QR factorisation is
unique provided we impose the restrictions Rii > 0 for i = 1, · · · , n.
When A has full rank, we can also link its unique skinny QR factorisation with
the unique Cholesky factorisation of the Gram matrix AT A (recall AT A ∈ Rn×n
is symmetric positive definite and so has AT A = GGT , where G ∈ Rn×n is a lower
612 CHAPTER 13. NUMERICAL ANALYSIS

triangular matrix with positive diagonal elements). Thus AG−T ∈ Rm×n satisfies
(AG−T )T AG−T = G−1 AT AG−T = I and so the columns of AG−T form an orthonor-
mal set in Rm . Hence A = AG−T GT is our unique skinny QR factorisation, with
AG−T playing the role of Q and GT playing the role of R.

Second algorithm
When A doesn’t have full rank (say it has rank p < n), we’ll encounter cases like
a1 , dk = 0. Instead of defining a random vector qk orthonormal to q1 , · · · , qk−1 , we
might want to postpone the use of qk to the next step, so that pk would take on the
role of pk+1 . If we do this, we would find that at the end the p orthonormal vectors
q1 , · · · , qp is enough to create all the columns a1 , · · · , an of A. So our factorisation
A = QR looks like (a1 · · · an ) = (q1 · · · qp ∗ · · · ∗)R, where R would be in standard
form with its last n − p rows zero. We can remove the last n − p rows and columns
from R and Q respectively to get the skinny version.
More precisely we have the following algorithm (which also work for full rank) for this
procedure which produce the skinny version: Set j = 0, k = 0; where we use j to keep
track of the number of columns of A and R that have been already considered, and
k to keep track of the number of columns of Q that have been formed (k ≤ j). On
termination of the algorithm, we will have k = p, the rank of A.
1. Increase j by 1.
If k = 0, set dj = aj .
Pk
If k ≥ 1, set Rij = hqi , aj i for i = 1, 2, · · · , k and compute dj = aj − i=1 Rij qi .
2. If dj 6= 0, set qk+1 = dj /kdj k, Rk+1,j = kdj k and put Rij = 0 for all i with
j ≥ i ≥ k + 2 (if j ≥ k + 2), and then increase k by 1.
If dj = 0, put Rij = 0 for all i with j ≥ i ≥ k + 1.
3. Terminate if j = n, otherwise go to Step 1.
As constructed in the above algorithm, the first non-zero element of each row of R also
has the property that it is positive. This gives us conditions that make the skinny QR
factorisation unique. If A ∈ Rm×n has rank p ≤ n, then the skinny QR factorisation
is unique if R is in standard form and the first non-zero element in each row of R is
greater than zero.
In practice, a slightly different algorithm (modified Gram-Schmidt process) is used,
which is (much) superior with inexact arithmetic. The modified Gram-Schmidt process
is in fact the same algorithm, but performed in a different order in order to minimize
errors. However, this is often not an ideal algorithm for large matrices, since there are
many divisions and normalizations involved in computing the qi , and the accumulation
of errors will cause the resulting matrix Q to lose orthogonality.

13.6.2 Givens rotations


This works with the full QR factorization. Recall that in R2 , a clockwise rotation of
θ ∈ [−π, π] is performed by
    
cos θ sin θ α α cos θ + β sin θ
= .
− sin θ cos θ β −α sin θ + β cos θ
13.6. LINEAR LEAST SQUARES 613
p p
By choosing θ such that cos θ = α/ α2 + β 2 and sin θ = β/ α2 + β 2 we have
    p 
cos θ sin θ α α2 + β 2
= .
− sin θ cos θ β 0

Of course, by choosing
p a slightly different θ, we can make the result zero in the first
component and α2 + β 2 in the second.
[p,q]
Note that for y ∈ Rm , the effect of Ωθ only alters the p and q components. In
m×n [p,q]
general, for B ∈ R , Ωθ B only alters the p and q rows of B. Moreover, just like
the R2 case, given a particular z ∈ Rm , we can choose θ such that the qth component
[p,q]
(Ωθ z)q = 0.
Hence, A ∈ Rm×n can be transformed into an “upper triangular” form by applying
s = mn − 12 n(n + 1) Givens rotations, since we need to introduce s many zeros. Then
Qs · · · Q1 A = R. We’ll illustrate this with an example of a matrix A ∈ R4×3 . We will
apply the Givens rotations in the following order:
[3,4] [3,4] [2,3] [1,4] [1,3] [1,2]
Ωθ6 Ωθ5 Ωθ4 Ωθ3 Ωθ2 Ωθ1 A = R.

The matrix A transforms as follows:


× × × [1,2] × × × [1,3] × × × [1,3] × × ×
Ωθ1 Ωθ2 Ωθ3
× × × 0 × × 0 × × 0 × ×
× × × × × × 0 × × 0 × ×
× × × × × × × × × 0 × ×

[2,3] × × × [3,4] × × × [3,4] × × ×


Ωθ4 Ωθ5 Ωθ6
0 × × 0 × × 0 × ×
0 0 × 0 0 × 0 0 ×
0 × × 0 0 × 0 0 0

Note that when applying, say, Ω[2,3]


θ4
, the zeros of the first column get preserved, since
[2,3]
Ωθ
4
only mixes together things in row 2 and 3, both of which are zero in the first
column. So we are safe. This gives us something of the form Qs · · · Q1 A = R. We can
obtain a QR factorization by inverting to get

A = QT1 · · · QTs R.

However, we don’t really need to do this if we just want to use this to solve the least
squares problem, since to do so, we need to multiply by QT , not Q, which is exactly
Qs · · · Q1 . Note finally that when performing each of the rotation we can perform them
so that we leave the diagonal entries of A non-negative, so if A has full rank at the end
we can simply remove the redundant rows/columns to get the unique skinny version
(as the diagonal entries are positive).

13.6.3 Householder reflections

Note that Hu is symmetric, and we can see Hu2 = I. So this is indeed orthogonal.
To show this is a reflection, suppose we resolve x in perpendicular and parallel parts
as x = αu + w ∈ Rm , where α = uT x/(uT u) and uT w = 0. Then we have Hu x =
614 CHAPTER 13. NUMERICAL ANALYSIS

−αu + w. So this is a reflection in the m − 1-dimensional hyperplane uT y = 0. What


is the cost of computing Hu z? This is evaluated as

uT z
Hu z = z − 2 u.
uT u
This only requires O(m) operations, which is nice.
L. 13-72
1. Let a, b ∈ Rm , with a 6= b, but kak = kbk. Then if we pick u = a − b, then
Hu a = b.
2. Let a, b ∈ Rm , with (ak , · · · , am ) 6= (bk , · · · , bm ) but m
P 2 Pm 2
j=k aj = j=k bj . If
T
u = (0, 0, · · · , 0, ak − bk , · · · , am − bm ) , then Hu a = (a1 , · · · , ak−1 bk , · · · , bm ).

2(kak − aT b)
1. Hu a = a − (a − b) = a − (a − b) = b.
kak2 − 2aT b + kbk2
2. This is a generalization of 1. The proof is just straightforward verification
noting the next lemma.
These result is obvious if we draw some pictures in low dimensions.
L. 13-73
Suppose the first k − 1 components of u are zero, then
1. For every x ∈ Rm , Hu x does not alter the first k − 1 components of x.
2. If the last m − k + 1 components of y ∈ Rm are zero, then Hu y = y.

These are all obvious from definition. All these say is that reflections don’t affect
components perpendicular to u, and in particular fixes all vectors perpendicular
to u.

So we begin our algorithm: we want to clear the first column of A. We let a be the
first column of A, and assume a ∈ Rm is not already in the correct form, ie. a is not
a multiple of e1 . Then we define u = a ∓ kake1 where either choice of the sign is
pure-mathematically valid. However, we will later see that there is one choice that is
better when we have to deal with inexact arithmetic. Then by 1 of [L.13-72] we end
up with

× × ×· · · ×
0 × ×· · · ×
H1 a = Hu a = ±kake1 hence H1 A = 0
.
×
.
×· · · ×
.. .
.
. . . . .
. . . . .
0 × ×· · · ×

To do the next step, we need to be careful, since we don’t want to destroy the previously
created zeroes. We let a0 be the second column of H1 A, and assume a03 , · · · , a0m are
not all zero, ie. (0, a02 , · · · , P
a0m )T is not a multiple of e2 . We choose u0 = (0, a02 ∓
γ, a03 , · · · , a0m )T where γ = ( m 0 2 1/2
j=2 (aj ) ) . Then by 2 of [L.13-72],

× × ×· · · ×
0 × ×· · · ×
0 0
H 2 a = H u0 a = (a01 , ±γ, 0, · · · ) and H2 H1 A = 0
.
0
.
×· · · ×
.. .
. . . . .
. . . . .
0 0 ×· · · ×
13.6. LINEAR LEAST SQUARES 615

The first column (and row) is unchanged by [L.13-73]. Suppose we have reached
Hk−1 · · · H1 A, where the first k − 1 rows are of the correct form. We consider a(k) , the
(k) (k)
kth column of Hk−1 · · · H1 A. We assume (0, · · · , 0, ak , · · · , am )T is not a multiple
of ek . Choosing
v
u m (k)
uX
(k) (k) (k) (k) (k) (k) T (k)
u = γ (0, · · · , 0, ak ∓ γ , ak+1 , · · · , am ) where γ = t (aj )2 ,
j=k

(k) (k)
we find that Hk a(k) = Hu(k) a(k) = (a1 , · · · , ak−1 , ±γ (k) , 0, · · · , 0)T . And we have

Hk · · · H1 A = where previously Hk−1 · · · H1 A =


0 0

k k−1
Note that Hk did not alter the first k − 1 rows and columns of Hk−1 · · · H1 A.
There is one thing we have to decide on — which sign to pick. As mentioned, these do
not matter in pure mathematics, but with inexact arithmetic, we should pick the sign
in ak ∓ γ such that ak ∓ γ has maximum magnitude, ie. ak ∓ γ = ak + sgn(ak )γ. It
takes some analysis to justify why this is the right choice, but it is not too surprising
that there is some choice that is better. Alternatively we could choose the sign so
that the diagonal entries is non-negative, then if A has full rank, at the end we can
simply remove the redundant rows/columns o get the unique skinny version of the
factorisation.
So how do Householder and Givens compare? We notice that the Givens method
generates zeros at one entry at a time, while Householder does it column by column.
So in general, the Householder method is superior. However, for certain matrices with
special structures, we might need the extra delicacy of introducing a zero one at a
time. For example, if A already have a lot of zero entries in the lower triangular part
(eg. a band matrix), then it might be beneficial to just remove the few non-zero entries
one by one.

13.6.4 Solving least squares problems with QR factorisation


Suppose that A ∈ Rm×n has full rank and that we have a QR factorisation of A.
Then the least squares problem minx kAx − bk is the same as minx kRx − QT bk.
The last m − n rows of R are zero, let R̃ ∈ Rn×n contains the first n rows of R and
Q̃ ∈ Rm×n contains the first n columns of Q, then the unique solution of our least
squares problem is the x∗ ∈ Rn which satisfies the n × n non-singular upper triangular
system R̃x∗ = Q̃T b. The solution can of course be obtained by back substitution!
The size of the residual for our least squares problem is just the Euclidean norm of the
neglected last m − n components: i.e.
m
!1/2
∗ T
X T 2
kRx − Q bk = (Q b)i .
i=n+1

Note that A = Q̃R̃ is exactly the skinny QR factorisation of A, so the skinny factorisa-
tion is sufficient to solve our least squares problem (although the norm of the residual
would have to be calculated in a different way).
616 CHAPTER 13. NUMERICAL ANALYSIS

Now suppose that A ∈ Rm×n has rank p < n, and that we have a QR factorisation of
A with R in standard form. Again the least squares problem simplifies to minx kRx −
QT bk. Since the last m − p rows of R are zero, this simplified minimisation problem
again reduces to R̃x∗ = Q̃T b where R̃ ∈ Rp×n contains the first p rows of R and Q̃ ∈
Rm×p contains the first p columns of Q. This however is an under-determined system
with an infinite number of solutions depending on n − p free parameters. Nevertheless,
since R is in standard form, we can easily describe the solutions. If we denote any of
these solutions by x∗ ∈ Rn , then the residual for our least squares problem is exactly
the same: i.e. !1/2
Xm
∗ T T 2
kRx − Q bk = (Q b)i .
i=n+1
CHAPTER 14
Electromagnetism
Electromagnetism is one of the four fundamental forces of the universe. Apart from
gravity, most daily phenomena can be explained by electromagnetism. The very exis-
tence of atoms requires the electric force (plus weird quantum effects) to hold electrons
to the nucleus, and molecules are formed again due to electric forces between atoms.
More macroscopically, electricity powers all our electrical appliances. Finally, it is the
force that gives rise to light, and allows us to see things.

Charge and Current


The strength of the electromagnetic force experienced by a particle is determined by
its (electric) charge . The SI unit of charge is the Coulomb . In this course, we assume
that the charge can be any real number. However, at the fundamental level, charge
is quantised. All particles carry charge q = ne for some integer n, and the basic unit
e ≈ 1.6 × 10−19 C. For example, the electron has n = −1, proton has n = +1, neutron
has n = 0.1
Often, it will be more useful to talk about charge density ρ(x, t) (depending on po-
sition and time). The charge density is the charge per unit volume. The total charge
in a region V is then Z
Q= ρ(x, t) dV
V

When we study charged sheets or lines, the charge density would be charge per unit
area or length instead, but this would be clear from context.
Electric current describes the coherent motions of electric charge across some surface
S, it counts the number of charge passing through the surface per unit time. The
motion of charge is described by the current density J(x, t), which is the “current
per unit area”, so that the current passing through a small surface dS located at x at
time t is J(x, t) · dS. For any surface S, the current through it is then
Z
I= J · dS
S

which counts the charge per unit time passing through S. Intuitively, if the charges in
the charge distribution ρ(x, t) move with velocity v(x, t), then (neglecting relativistic
effects) we have J = ρv.
It is well known that charge is conserved — we cannot create or destroy charge. How-
ever, the conservation of charge does not simply say that “the total charge in the
universe does not change”. We want to rule out scenarios where a charge on Earth
disappears, and instantaneously appears on the Moon. So what we really want to
say is that charge is conserved locally: if it disappears here, it must have moved to
1
The charge of quarks is actually −e/3 or 2e/3. This doesn’t change the spirit of the above
discussion since we could just change the basic unit. But, apart from in extreme circumstances,
quarks are confined inside protons and neutrons so we rarely have to worry about this.

617
618 CHAPTER 14. ELECTROMAGNETISM

somewhere nearby. Alternatively, charge density can only change due to continuous
currents. This is captured by the:
∂ρ
<Continuity equation> +∇·J=0
∂t
We can write this into a more intuitive integral form
R via the divergence theorem. The
charge Q in some region V is defined to be Q = V ρ dV . So
Z Z Z
dQ ∂ρ
= dV = − ∇ · J dV = − J · dS.
dt V ∂t V S

Hence the continuity equation states that the change in total charge in a volume is
given by the total current passing through its boundary. In particular, we can take
V = R3 , the whole of space. If there are no currents at infinity, then dQ
dt
= 0. So the
continuity equation implies the conservation of charge.
E. 14-1
A wire is a cylinder of cross-sectional area A. Suppose there are n electrons per
unit volume. Then ρ = nq = −ne, J = nqv and I = nqvA.

Forces and Fields


In modern physics, we believe that all forces are mediated by fields .2 A field is
a dynamical quantity (ie. a function) that assigns a value to every point in space
and time. In electromagnetism, we have two fields: the electric field E(x, t) and the
magnetic field B(x, t). Each of these fields is a vector, ie. it assigns a vector to every
point in space and time, instead of a single number.
The fields interact with particles in two ways. On the one hand, fields cause particles
to move. On the other hand, particles create fields. The first aspect is governed by
the Lorentz force law 3 F = q(E + v × B) while the second aspect is governed by
Maxwell’s equations given by the four equation below:
ρ
<Gauss’ law> ∇·E=
ε0
<Gauss’ law for magnetism> ∇·B=0
∂B
<Faraday’s law of induction> ∇×E+ =0
∂t
∂E
<Ampere-Maxwell law> ∇ × B − µ0 ε 0 = µ0 J
∂t
These are the differential form of Maxwell’s equations (we’ll late see the integral form).
All of classical electromagnetism are captured by these equations. Here we have two
constants of nature:4
Electric constant : ε0 = 8.85 × 10−12 m−3 kg−1 s2 C2 ;
Magnetic constant : µ0 = 4π × 10−6 m kg C−2 .
2
not to be confused with “fields” in algebra, or agriculture.
3
From this equation we see that the SI units for E are force per unit charge (Newton per coulomb).
The units for B are tesla (symbol T), it is equivalent to newtons per meter per ampere.
4
Alternative name for these constants are the “permittivity of free space” and “permeability of
free space” respectively.
14.1. ELECTROSTATICS 619

The presence of 4π in this formula isn’t telling us anything deep about Nature. It’s
more a reflection of the definition of the Coulomb as the unit of charge. The two
constant can be thought of as characterising the strength of the electric interactions
the strength of magnetic interactions respectively.

14.1 Electrostatics
Electrostatics is the study of stationary charges in the absence of magnetic fields. We
take ρ = ρ(x), J = 0 and B = 0. We then look for time-independent solutions. In this
case, the only relevant equations are ∇ · E = ρ/ε0 and ∇ × E = 0, and the other two
equations just give 0 = 0. In this section, our goal is to find E for any ρ.

14.1.1 Gauss’ Law


Here we transform the first Maxwell’s equation ∇ · E = ρ/ε0 into an integral form,
known as Gauss’ Law . Consider a region V ⊆ R3 with boundary S = ∂V . Then
integrating the equation over the volume V gives
Z Z
1 Q
∇ · E dV = ρ dV = .
V ε 0 V ε 0
R
Rwhere Q is the total charge inside V . The divergence theorem says V ∇ · E dV =
S
E · dS. So we end up with
Z
Q
<Gauss’ law> E · dS = ,
S ε 0
R
We call S E · dS the flux of E through the surface S. Gauss’ law tells us that the
flux depends only on the total charge contained inside the surface. In particular any
external charge does not contribute to the total flux. While external charges do create
fields that pass through the surface, the fields have to enter the volume through one
side of the surface and leave through the other. Gauss’ law tells us that these two cancel
each other out exactly, and the total flux caused by external charges is zero.
E. 14-2
Consider a spherically symmetric charge density ρ(r) with ρ(r) = 0 for r > R, ie.
all the charge is contained in a ball of radius R. By symmetry, the force is the
same in all directions and point outward radially. So E = E(r)r̂. This immediately
ensures that ∇ × E = 0. Put S to be a sphere of radius r > R. Then the total
flux is
Z Z Z
E · dS = E(r)r̂ · dS = E(r) r̂ · dS = E(r) · 4πr2
S S S

By Gauss’ law, we know that this is equal to Q/ε0 . Therefore


Q Q
E(r) = and so E(r) = r̂.
4πε0 r2 4πε0 r2
By the Lorentz force law, the force experienced by a second charge is
Qq
<Coulomb’s law> F(r) = r̂,
4πε0 r2
620 CHAPTER 14. ELECTROMAGNETISM

Strictly speaking, this only holds when the charges are not moving. However,
for most practical purposes, we can still use this because the corrections required
when they are moving are tiny.
Consider ρ(r) = ρI[r < R], where I is the indicator function. Outside the sphere
of radius R, i.e for r > R, we know that E(r) = (Q/4πε0 r2 )r̂. Now suppose we
are inside the sphere, so r < R with S = ∂Br (0), then

Q r3
  Z
Qr
= E · dS = E(r)4πr2 =⇒ E(r) = r̂,
ε0 R 3 S 4πε0 R3
So the field increases with radius inside the sphere.
E. 14-3
z
<Line charge> Consider an infinite line with uniform charge
density per unit length η. We p use cylindrical polar coordinates (so
our line is along z and r = x2 + y 2 ). By symmetry, the field E
is radial, ie. E(r) = E(r)r̂. Pick S to be a cylinder of length L
E
and radius r. We know that the end caps do not contribute to the
flux since the field lines are perpendicular to the normal. Also, the E
curved surface has area 2πrL. Then
Z
ηL η
= E · dS = E(r)2πrL. =⇒ E(r) = r̂.
ε0 S 2πε 0r

Note that the field varies as 1/r, not 1/r2 . Intuitively, this is because we have
one more dimension of “stuff” compared to the point charge, so the field does not
drop as fast.
E. 14-4
<Surface charge> Consider an infinite plane z = 0,
with uniform charge per unit area σ. By symmetry, the
field points vertically, and the field bottom is the oppo-
site of that on top. So E = E(z)ẑ with E(z) = −E(−z).
Consider a vertical cylinder of height 2z and cross-
sectional area A. Now only the end caps contribute.
So
Z
σA σ
= E · dS = E(z)A − E(−z)A. =⇒ E(z) =
ε0 S 2ε0
So E and is constant. Note here z is positive. Note that this electric field is
discontinuous across the surface. We have E(0+ ) − E(0− ) = σ/ε0 . Another
example is a spherical shell of radius R with surface charge σ, then it has electric
field (
σ R 2
( ) r>R
E = ε0 r
0 r<R
So again we find that E(0+ ) − E(0− ) = σ/ε0 . This is in fact a general result
that is true for any arbitrary surfaces and σ. We can prove this by considering
a cylinder across the surface and then shrink it indefinitely. Then we find that
n̂ · E+ − n̂ · E− = εσ . So the components of E normal to the surface are not
0
continuous across the surface.
However, the components of E tangential to the surface are continuous. To see
this let t be any tangent to a surface at a point. Consider a small straight line
14.1. ELECTROSTATICS 621

with length L in the direction of t at the point on the surface. We make two
copies of this line, one slight above the surface one slightly below with a distance
of a apart, and we join their ends with straights line of length a so that we have a
square loop C. We integrate E around the loop. Using Stoke’s theorem, we have
I Z
E · dr = ∇ × E · dS
C S

where S is the surface bounded by C. In the limit a → 0, the surface S shrinks


to zero size so this integral gives zero. This means that the contribution to line
integral must also vanish, leaving us with LtE+ − LtE− = 0. Since t can be any
tangent we in fact have n̂ × (E+ − E− ) = 0 which says that the components of E
tangential to the surface are continuous.

14.1.2 Electrostatic potential


In the most general case, we will have to solve both ∇ · E = ρ/ε0 and ∇ × E = 0.
However, we already know that the general form of the solution to the second equation
is E = −∇φ for some scalar field φ. Such a φ is called the electrostatic potential .
Substituting this into ∇ · E = ρ/ε0 , we obtain ∇2 φ = −ρ/ε0 , the Poisson equation. If
we are in the middle of nowhere and ρ = 0, then we get the Laplace equation. There
are a few interesting things to note about our result:
• φ is only defined up to a constant. We usually fix this by insisting φ(r) → 0
as r → ∞. This statement seems trivial, but this property of φ is actually very
important and gives rise to a lot of interesting properties. However, we will not
explore this.
• The Poisson equation is linear. So if we have two charges ρ1 and ρ2 , then the poten-
tial is simply φ1 +φ2 and the field is E1 +E2 . This is the principle of superposition .
Among the four fundamental forces of nature, electromagnetism is the only force
with this property.

Point charge
Consider a point particle with charge Q at the origin, then ρ(r) = Qδ 3 (r), where δ 3 is
the generalization of the usual delta function for (3D) vectors. The equation we have
to solve is
Q
∇2 φ = − δ 3 (r).
ε0
Away from the origin r = 0, δ 3 (r) = 0, and we have the Laplace equation. From the
IA Vector Calculus course, the general solution is
α
φ= for some constant α.
r
The constant α is determined by the delta function. We integrate the equation over a
sphere of radius r centered at the origin. Then
Z Z Z
Q α
− = ∇2 φ dV = ∇φ · dS = − 2 r̂ · dS = −4πα
ε0 V S S r
Q Q
=⇒ α= and so E = −∇φ = r̂.
4πε0 4πε0 r2
This is just what we get from Coulomb’s law.
622 CHAPTER 14. ELECTROMAGNETISM

Dipole
A dipole consists of two point charges, +Q and −Q at r = 0 and r = −d respectively.
To find the potential of a dipole, we simply apply the principle of superposition and
obtain  
1 Q Q
φ= − .
4πε0 r |r + d|
This is not a very helpful result, but we can consider the case when we are far, far
away, ie. r  d. To do so, we Taylor expand the second term. For a general f (r), we
have
1
f (r + d) = f (r) + d · ∇f (r) + (d · ∇)2 f (r) + · · · .
2
Applying to the term we are interested in gives
   
1 1 1 1 2 1
= +d·∇ + (d · ∇) + ···
|r + d| r r 2 r
3(d · r)2
 
1 d·r 1 d·d
= − 3 − 3
− + ··· .
r r 2 r r5

Plugging this into our equation gives


 
Q 1 1 d·r Q d·r
φ= − + 3 + ··· ∼ .
4πε0 r r r 4πε0 r3

We define the electric dipole moment to be p = Qd. By convention, it points from


-ve to +ve. Then
 
p · r̂ 1 3(p · r̂)r̂ − p
φ= , and so E = −∇φ = .
4πε0 r2 4πε0 r3

General charge distribution


To find φ for a general charge distribution ρ, we use the Green’s function for the
Laplacian. The Green’s function is defined to be the solution to ∇2r G(r, r0 ) = δ 3 (r−r0 ).
In the section about point charges, We have shown that
1 1
G(r, r0 ) = − .
4π |r − r0 |

We assume all charge is contained in some compact region V . Then

ρ(r0 ) 3 0
Z Z
1 1
φ(r) = − ρ(r0 )G(r, r0 ) d3 r0 = d r
ε0 V 4πε0 V |r − r0 |

will be such that ∇2 φ(r) = −ρ(r)/ε0 . So


0
Z   Z
1 1 1 0 (r − r )
E(r) = −∇φ(r) = − ρ(r0 )∇ d3 0
r = ρ(r ) d3 r0
4πε0 V |r − r0 | 4πε0 V |r − r0 |3

We can ask what φ and E look like very far from V , ie. |r|  |r0 |. We again use the
Taylor expansion.

r · r0
 
1 1 0 1 1
= + r · ∇ + · · · = + + ··· .
|r − r0 | r r r r3
14.1. ELECTROSTATICS 623

Then we get
r · r0
Z    
1 0 1 3 0 1 Q p · r̂
φ(r) = ρ(r ) + 3 + ··· d r = + 2 + ··· ,
4πε0 V r r 4πε0 r r
Z Z
r
where Q= ρ(r0 ) dV 0 p= r0 ρ(r0 ) dV 0 r̂ = .
V V krk
So if we have a huge lump of charge, we can consider it to be a point charge Q, plus
some dipole correction terms. Here p is again called the electric dipole moment , this
is the general form.
E. 14-5
<Field lines and equipotentials> Vectors are usually visualized using arrows,
where longer arrows represent larger vectors. However, this is not a practical
approach when it comes to visualizing fields, since a field assigns a vector to every
single point in space, and we don’t want to draw infinitely many arrows. Instead,
we use field lines. A field line is a continuous line tangent to the electric field E.
The density of lines is proportional to |E|. They begin and end only at charges
(and infinity), and never cross. We can also draw the equipotentials , which are
surfaces of constant φ. Because E = −∇φ, they are always perpendicular to field
lines. Below are the field lines (solid line) and equipotentials (dashed line) for a
positive (point) charge, negative charge, and dipole:

+ - + -

14.1.3 Electrostatic energy


We want to calculate how much energy is stored in the electric field. Recall from IA
Dynamics and Relativity that a particle of charge q in a field E = −∇φ has potential
energy U (r) = qφ(r), this energy can be thought of as the work done in bringing the
particle from infinity, as illustrated below:
Z r Z r Z r
work done = − F · dr = − E · dr = q ∇φ · dr = q[φ(r) − φ(∞)] = U (r)
∞ ∞ ∞

where we set φ(∞) = 0. Now consider N charges qi at positions ri . The total potential
energy stored is the work done to assemble these particles. Let’s put them in one by
one.
1. The first charge is free. The work done is W1 = 0.
2. To place the second charge at position r2 takes work. The work is
q1 q2 1
W2 = .
4πε0 |r1 − r2 |

3. To place the third charge at position r3 , we do


 
q3 q1 q2
W3 = +
4πε0 |r1 − r3 | |r2 − r3 |
624 CHAPTER 14. ELECTROMAGNETISM

4. etc.
The total work done is
N
X 1 X qi qj 1 1 X qi qj
U= Wi = = .
i=1
4πε0 i<j |ri − rj | 4πε0 2 |ri − rj |
i6=j

We can write this in an alternative form. The potential at point ri due to all other
particles is
N
1 X qj 1X
φ(ri ) = , so we can write U= qi φ(ri ).
4πε0 |ri − rj | 2 i=1
j6=i

1
ρ(r)φ(r) d3 r.
R
There is an obvious generalization to continuous charge distributions: U = 2
Hence we obtain
Z Z
ε0 ε0
U= (∇ · E)φ d3 r = [∇ · (Eφ) − E · ∇φ] d3 r.
2 2
The first term is a total derivative and vanishes. In the second term, we use the
definition E = −∇φ and obtain
Z
ε0
U= E · E d3 r.
2
This derivation of potential energy is not satisfactory. The final result shows that the
potential energy depends only on the field itself, and not the charges. However, the
result was derived using charges and electric potentials — there should be a way to
derive this result directly with the field, and indeed there is, however we will not do
this.
Also, we have waved our hands a lot when generalizing to continuous distributions,
which was not entirely correct. If we have a single point particle, the original discrete
formula implies that there is no potential energy. However, since the associated field
is non-zero, our continuous formula gives a non-zero potential. This does not mean
that the final result is wrong. It is correct, but it describes a more sophisticated (and
preferred) conception of “potential energy”. Again, we shall not go into the details
here.

14.1.4 Conductors
A conductor is a region of space which contains lots of charges that are free to
move. In electrostatic situations, we must have E = 0 inside a conductor (in the
interior where the charges can move in all directions freely). Otherwise, the charges
inside the conductor would move till equilibrium. This almost describes the whole
of the section: if you apply an electric field onto a conductor, the charges inside the
conductor move until the external field is cancelled out. Since E = 0 inside a conductor,
the potential φ must be a constant throughout the conductor (also on the surface by
continuity).
From this, we can derive a lot of results. Since 0 = ∇ · E = ρ/ε0 inside the conductor,
we must have ρ = 0. Hence any net charge must live on the surface. Note that
inside the conductor there can still be charge, just that in the interior the positive and
negative charges must balance out to give ρ = 0. Since φ is constant throughout the
14.1. ELECTROSTATICS 625

conductor, the surface (in fact the whole) of the conductor is an equipotential. Hence
the electric field is perpendicular to the surface. This makes sense since any electric
field with components parallel to the surface would cause the charges to move.
Recall also from before that across a surface, we have n̂ · Eoutside − n̂ · Einside = σ/ε0
where σ is the surface charge density. Since Einside = 0, we obtain
σ
Eoutside = n̂
ε0
This allows us to compute the surface charge given the field, and vice versa.
E. 14-6
<Faraday Cage> Consider some region of space that doesn’t contain any
charges, surrounded by a conductor. The conductor sits at constant φ = φ0 while,
since there are no charges inside, we must have ∇2 φ = 0. But this means that
φ = φ0 everywhere. This is because, if it didn’t then there would be a maximum
or minimum of φ somewhere inside, violating the maximum principle. Or alterna-
tively we know φ = φ0 is the unique solution since we have a Dirichlet boundary
condition φ = φ0 on the boundary. Therefore, inside a region surrounded by a
conductor, we must have E = 0. This is a very useful result if you want to shield
a region from electric fields. In this context, the surrounding conductor is called
a Faraday cage.
E. 14-7
Consider a spherical conductor with Q = 0. We put
a positive plate on the left and a negative plate on
the right. This creates a field from the left to the
right. With the conductor in place, since the electric
field lines must be perpendicular to the surface, they − +
have to bend towards the conductor. Since field lines
end and start at charges, there must be negative
charges at the left and positive charges at the right.
We get an induced surface charge.

To find the exact potential φ we’ll work in spherical polar coordinates and chose
the original, constant electric field (in absence of the conductor) to point in the
ẑ direction, E0 = e0 ẑ. This have potential φ0 = −E0 z = −E0 r cos θ. Take the
conducting sphere to have radius R and be centred on the the origin. Let’s add to
this an image dipole at the origin. This makes sense, as the − + in the diagram
suggested. The resulting potential is

R3
 
φ = −E0 r − 2 cos θ
r
Since we’ve added a dipole term, we can be sure that this still solves the Laplace
equation outside the conductor (as Laplace equation is linear). Moreover, by
construction, φ = 0 when r = R. This is all we wanted from our solution. An
alternative way to solve this is to use the general solution we obtain for axisymmet-
ric Laplace equation in [E.6-30] and equating boundary conditions. The induced
surface charge can be computed by evaluating the electric field just outside the
conductor. It is
2R3
 
σ ∂φ
=− = E0 1 + 3 cos θ = 3E0 cos θ
ε0 ∂r r r=R
626 CHAPTER 14. ELECTROMAGNETISM

We see that the surface charge is positive in one hemisphere and negative in the
other. The total induced charge averages to zero.
E. 14-8
Suppose we have a conductor that fills all space x < 0. We
ground it such that φ = 0 throughout the conductor. Then
we place a charge q at x = d > 0. We are looking for a
potential that corresponds to a source at x = d and satisfies φ=0 +
φ = 0 for x < 0. Since the solution to the Poisson equation is
unique, we can use the method of images to guess a solution
and see if it works — if it does, we are done.

To guess a solution, we pretend that we don’t have a conductor. Instead, we have


a charge −q and x = −d. Then by symmetry we will get φ = 0 when x = 0. The
potential of this pair is
!
1 q q
φ= p −p .
4πε0 (x − d)2 + y 2 + z 2 (x + d)2 + y 2 + z 2

To get the solution we want, we “steal” part of this potential and declare our
potential to be
  
1 q q

4πε0
√ −√ if x > 0
φ= (x−d)2 +y 2 +z 2 (x+d)2 +y 2 +z 2

0 if x ≤ 0

Using this solution, we can immediately see that it satisfies Poisson’s equations
both outside and inside the conductor. To complete our solution, we need to find
the surface charge required such that the equations are satisfied on the surface as
well.
To do so, we can calculate the electric field near the surface, and use the relation
σ = ε0 Eoutside · n̂. To find σ, we only need the component of E in the x direction:
 
∂φ q x−d x+d
Ex = − = − for x > 0.
∂x 4πε0 |r − d|3/2 |r + d|3/2

Then induced surface charge density is given by Ex at x = 0:

q d
σ = E x ε0 = − .
2π (d2 + y 2 + z 2 )3/2
R
The total surface charge is then given by σ dy dz = −q.
E. 14-9
<Capacitors> We’ll consider a parallel plate capacitor. Suppose there are two
identical flat plane conductor with surface area A, parallel to each other with a
distance d apart, one carrying charge Q, the other charge √ −Q. We assume that
the distance d between the surfaces is much smaller than A. This means that
we can neglect the effects that arise around the edge of plates and we’re justified
in assuming that the electric field between the two plates is the same as it would
be if the plates were infinite in extent.
14.2. MAGNETOSTATICS 627
Then between the plates we have electric field E =
(σ/ε0 )ẑ where σ = Q/A and we have assumed the plates
are separated in the z-direction. We define the capac- −Q
z=d
itance C to be C = Q/V where V is the voltage or
z=0
potential difference, the difference in the potential on Q
the two conductors. Since E = −dφ/dz is constant, we
must have
Qd
φ = −Ez + c =⇒ V = φ(0) − φ(d) = Ed =
Aε0
and the capacitance for parallel plates of area A, separated by distance d, is
C = Aε0 /d. Because V was proportional to Q, the charge has dropped out of our
expression for the capacitance. Instead, C depends only on the geometry of the
set-up. This is a general property.
Capacitors are usually employed as a method to store electrical energy. Using our
result in previous section, the energy stored in a parallel plate capacitor is
Z  2
Aε0 d σ Q2
Z
ε0
U= E · E dV = dz = .
2 2 0 ε0 2C

14.2 Magnetostatics
Charges give rise to electric fields, currents give rise to magnetic fields. In this section,
we study the magnetic fields induced by steady currents . This means that we are
again looking for time independent solutions to the Maxwell equations. We will also
restrict to situations in which the charge density vanishes, so ρ = 0. We can then set
E = 0 and focus our attention only on the magnetic field. The remaining Maxwell’s
equations are ∇ × B = µ0 J and ∇ · B = 0. The objective is, given a J, find the
resultant B.
Before we start, what does the condition ρ = 0 mean? It does not mean that there
are no charges around. We want charges to be moving around to create current.
What it means is that the positive and negative charges balance out exactly. More
importantly, it stays that way. We don’t have a net charge flow from one place to
another. At any specific point in space, the amount of charge entering is the same as
the amount of charge leaving. This is the case in many applications. For example, in a
wire, all electrons move together at the same rate, and we don’t have charge building
up at parts of the circuit. Mathematically, we can obtain the interpretation from the
continuity equation: ∂ρ
∂t
+ ∇ · J = 0. In the case of steady currents, we have ρ = 0. So
∇ · J = 0 which says that there is no net flow in or out of a point.

14.2.1 Ampere’s Law


Consider a surface S with boundary C, and we have current density J. Let I be the
H ∇ × B = µJ over the surface S to
current through Rthe surface.R We now integrate
obtain µ0 I = µ0 S J · dS = S (∇ × B) · dS = C B · dr. So we have
I
<Ampere’s law> B · dr = µ0 I,
C
628 CHAPTER 14. ELECTROMAGNETISM

E. 14-10
<A long straight wire> A wire is a cylinder with current I flowing through it.
We use cylindrical polar coordinates (r, ϕ, z), where z is along the direction of the
current, and r points in the radial direction.
By symmetry, the magnetic field can only depend on the radius, and I
must lie in the x, y plane. Since we require that ∇ · B = 0, we cannot
have a radial component. So the general form is B(r) = B(r)ϕ̂. To
find B(r), using Ampere’s law and integrate over H a disc that cuts
through the wire horizontally, we have µ0 I = C B · dr = 2πrB(r). S
So
µ0 I
B(r) = ϕ̂.
2πr
E. 14-11
<Surface current> Consider the plane z = 0 with
surface current density k (ie. current per unit length).
Take the x-direction to be the direction of the current,
and the z direction to be the normal to the plane.

We imagine this situation by considering this to be infinitely many copies of the


above wire situation (placed horizontally side by side pointing the x direction).
Then the magnetic fields must point in the y-direction. By symmetry, we must
have B = −B(z)ŷ with B(z) = −B(−z).
Consider a vertical rectangular loop of length L through
H
the surface. Then µ0 kL = C B · dr = LB(z) − LB(−z). L
So
µ0 k
B(z) = for z > 0.
2
Similar to the electrostatic case, the magnetic field is constant, and the part par-
allel to the surface is discontinuous across the plane. This is a general result, ie.
across any surface, n̂ × B+ − n̂ × B− = µ0 k. Meanwhile, the magnetic field normal
to the surface is continuous, so n̂ · (B+ − B− ) = 0 (To see this, you can use a
Gaussian pillbox, together with the other Maxwell equation ∇ · B = 0). Note that
for electric field this is the other way round.
E. 14-12
<Solenoid> A solenoid is a cylindrical surface current, usually
made by wrapping a wire around a cylinder. We use cylindrical polar C
coordinates with z in the direction of the extension of the cylinder.
By symmetry, B = B(r)ẑ. Away from the cylinder, ∇ × B = 0, so
∂B
∂r
= 0, which means that B(r) is constant outside. Since we know
that B = 0 at infinity, B = 0 everywhere outside the cylinder.

To compute B inside, use Ampere’s law with a curve C. Note that only the vertical
part (say of length L) inside the cylinder contributes to the integral. Then
I
BL = B · dr = µ0 IN L.
C

where N is the number of wires per unit length and I is the current in each wire
(so IN L is the total amount of current through the wires). Hence B = µ0 IN .
14.2. MAGNETOSTATICS 629

14.2.2 Biot-Savart law


For general current distributions J, we also need to solve ∇ · B = 0. Recall that for
the electric case, the equation is ∇ · E = ρ/ε0 . For the B field, we have 0 on the right
hand side instead of ρ/ε0 . This is telling us that there are no magnetic monopoles, ie.
magnetic charges. The general solution to this equation is B = ∇×A for some A. Such
an A is called the vector potential . The other Maxwell equation then says

∇ × B = −∇2 A + ∇(∇ · A) = µ0 J. (∗)


This is rather difficult to solve, but it can be made easier by noting that A is not
unique. If A is a vector potential, then for any function χ(x), A0 = A + ∇χ is also a
vector potential of B, since ∇ × (A + ∇χ) = ∇ × A. The transformation A 7→ A + ∇χ
is called a gauge transformation . Each choice of A is called a gauge . An A such
that ∇ · A = 0 is called a Coulomb gauge .
L. 14-13
We can always pick χ such that ∇ · A0 = 0.

Suppose that B = ∇ × A with ∇ · A = ψ(x). Then for any A0 = A + ∇χ, we have


∇ · A0 = ∇A + ∇2 χ = ψ + ∇2 χ.
So we need a χ such that ∇2 χ = −ψ. This is the Poisson equation which we know
that there is always a solution by, say, the Green’s function. Hence we can find a
χ that works.

If B = ∇ × A and ∇ · A = 0, then the Maxwell equation (∗) becomes ∇2 A = −µ0 J.


Or, in Cartesian components, ∇2 Ai = −µ0 Ji . This is 3 copies of the Poisson equation,
which we know how to solve using Green’s functions. The solution is
Ji (r0 ) J(r0 )
Z Z
µ0 0 µ0
Ai (r) = 0
dV or A(r) = dV 0 ,
4π |r − r | 4π |r − r0 |
both integrating over r0 . We have randomly written down a solution of ∇2 A = −µ0 J.
However, this is a solution to Maxwell’s equations only if it is a Coulomb gauge.
Fortunately, it is:
Z   Z  
µ0 1 µ0 1
∇ · A(r) = J(r0 ) · ∇ dV 0
= − J(r0
) · ∇ 0
dV 0
4π |r − r0 | 4π |r − r0 |
Here we employed a clever trick — differentiating 1/|r − r0 | with respect to r is the
negative of differentiating it with respect to r0 . Now that since we are differentiating
against r0 , we can integrate by parts (using ∇ · (f F) = (∇f ) · F + f (∇ · F)) to
obtain
J(r0 ) ∇0 · J(r0 )
Z    
µ0
=− ∇0 · − dV 0 .
4π |r − r0 | |r − r0 |
Here both terms vanish. We assume that the current is localized in some region of
space so that J = 0 on the boundary. Then the first term vanishes since it is a total
derivative. The second term vanishes since we assumed that the current is steady
(∇ · J = 0). Hence we have Coulomb gauge. So the magnetic field is
r − r0
Z
µ0
<Biot-Savart law> B(r) = ∇ × A = J(r0 ) × dV 0 .
4π |r − r0 |3
630 CHAPTER 14. ELECTROMAGNETISM

using the fact that ∇ × (f g) = (∇f ) × g + f (∇ × g). If the current is localized on a


curve, then J(r0 ) is non-zero only on the curve, so this becomes

r − r0
I
µ0 I
B= dr0 × .
4π C |r − r0 |3

14.2.3 Magnetic dipoles


A Current Loop
According to Maxwell’s equations, magnetic monopoles
don’t exist. However, it turns out that a localized current
looks like a dipole from far far away. Take a current loop of
wire C, radius R and current I. Based on the fields gener-
ated by a straight wire, we can guess that B looks like on
the right, but we want to calculate it. By the Biot-Savart
law, we know that

J(r0 ) dr0
Z I
µ0 µ0 I
A(r) = 0
dV = .
4π |r − r | 4π C |r − r0 |

Far from the loop, |r − r0 | is small and we can use the Taylor expansion 1/|r − r0 | =
1/r + r · r0 /r3 + · · · , then

r · r0
I  
µ0 I 1
A(r) = + 3 + · · · dr0 .
4π C r r

Note that r is a constant of the integral, and we can take it out. The first r1 term
vanishes because it is a constant, and when we integrate along a closed loop, we get 0.
So we only consider the second term. We claim that for any constant vector g,
I Z
g · r0 dr0 = S × g where S= dS
C C

bounded by C. This follows from the result C f (r0 )dr0 =


H
is
R the vector area of the surface
S
dS × ∇f by taking f (r0 ) = g · r0 . This identity results from Stokes Stokes’ theo-
rem:
I I Z Z
f (r)dri = f (r)ei · dr = ∇ × (f (r)ei ) · dS = dS` ε`jk ∂j f (r)δik
C
ZC S
Z ZS
= dS` ε`ji ∂j f (r) = εi`j dS` ∂j f (r) = (dS × ∇f (r))i
S S S

Now we have
µ0 m × r
A(r) ≈ where m = IS.
4π r3
m = IS is called the magnetic dipole moment . Now
 
µ0 3(m · r̂)r̂ − m
B=∇×A= .
4π r3
This is the same form as E for an electric dipole! Note that however the B field due
to a current loop and E field due to two charges don’t look the same close up, it just
that they have identical dipole long-range fall-offs.
14.2. MAGNETOSTATICS 631

General Current Distributions


After doing it for a loop, we can do it for a general current distribution. We have
Ji (r0 ) Ji (r0 ) Ji (r0 )(r · r0 )
Z Z  
µ0 0 µ0
Ai (r) = dV = − + · · · dV 0 .
4π |r − r0 | 4π r r3
We will show that the first term vanishes by showing that it is a total derivative. We
have
∂j0 (Jj ri0 ) = (∂j0 Jj ) ri0 + (∂j0 ri0 ) Jj = Ji .
| {z } | {z }
=∇·J=0 =δij

For the second term, we look at

∂j0 (Jj ri0 rk0 ) = (∂j0 Jj )ri0 rk0 + Ji rk0 + Jk ri0 = Jk ri0 + Ji rk0 .

Apply this trick to


Z Z Z
rj rj
Ji rj rj0 dV 0 = (2Ji rj0 ) dV 0 = (Ji rj0 − Jj ri0 ) dV 0 ,
2 2
where we discarded a total derivative ∂j0 (Jj ri0 ru0 ). Putting it back in vector nota-
tion,
Z Z Z
1 1
Ji r · r0 dV = Ji (r · r0 ) − ri0 (J · r) dV = r × (J × r0 ) i dV.
 
2 2
So the long-distance vector potential is again
µ0 m × r
A(r) = ,
4π r3
1
r0 × J(r0 ) dV 0 .
R
with Magnetic dipole moment m = 2

14.2.4 Magnetic forces


We’ve seen that moving charge produces currents which generates a magnetic field. But
we also know that a charge moving in a magnetic field experiences a force F = q ṙ × B.
So two currents will exert a force on each other.
A current J1 , localized on some closed curve C1 , sets up a magnetic field
I
µ0 I 1 r − r1
B(r) = dr1 × .
4π C1 |r − r1 |3
R
A second current J2 on C2 experiences the Lorentz force F = J2 (r)×B(r) dV . While
we are integrating over all of space, the current is localized at a curve C2 . So
I I I  
µ0 r2 − r1
F = I2 dr2 × B(r2 ) = I1 I2 dr2 × dr1 × .
C2 4π C1 C2 |r2 − r1 |3

For well-separated currents, approximated by magnetic dipole moment m1 and m2 ,


the force can be written as
 
µ0 3(m1 · r̂)(m2 · r̂) − (m1 · m2 )
F= ∇ ,
4π r3
whose proof is too long/complicated to be included.
632 CHAPTER 14. ELECTROMAGNETISM

E. 14-14
<Two parallel wires> Consider two straight wires (with currents)
pointing in the z direction, one passes through (0, 0, 0), the other
passes through (d, 0, 0), so there is a distance of d between them. We
know that the field produced by each current is
µ0 Ii
Bi = ϕ̂.
2πr d
The particles on the second wire will feel a force
 
µ0 I1
F = qv × B1 = qv × ŷ.
2πd

But J2 = nqv and I2 = J2 A, where n is the density of particles and A is the


cross-sectional area of the wire. So the number of particles per unit length is nA,
and the force per unit length is
µ0 I1 I2 I1 I2
nAF = ẑ × ŷ = −µ0 x̂.
2πd 2πd
So if I1 I2 > 0, ie. the currents are in the same direction, the force is attractive.
Otherwise the force is repulsive.

14.3 Electrodynamics
So far, we have only looked at fields that do not change with time. However, in real
life, fields do change with time. We will now look at time-dependent E and B fields.
We’ll explore the Maxwell equation ∇ × E + ∂B ∂t
= 0. In short, if the magnetic field
changes in time, ie. ∂B∂t
6
= 0, this creates an E that accelerates charges, which creates
a current in a wire. This process is called induction.
Consider a wire, which is a closed curve C, with a surface S. We integrate over the
surface S to obtain Z Z
∂B
(∇ × E) · dS = − · dS.
S S ∂t

By Stokes’ theorem and commutativity of integration and differentiation (assuming S


and C do not change in time), we have
Z Z
d
<Faraday’s law of induction for fix curve> E · dr = − B · dS .
dt S
| C {z } | {z }
=E =Φ
R
where E = C E · dr is Rthe electromotive force (emf). Despite the name, this is not a
force! It is in fact E = C F · dr/q, we can think of it as the work done on a unit charge
moving
R around the curve, or the “voltage” of the system. Also we have the quantity
Φ = S B · dS, called the magnetic flux . In the most general form, we have

∂Φ
<Faraday’s law of induction> E =−
∂t
true also for moving curves which we’ll later show. The Faraday’s law of induction
says that when we change the magnetic flux through S, then a current is induced. In
14.3. ELECTRODYNAMICS 633

practice, there are many ways we can change the magnetic flux, such as by moving
bar magnets or using an electromagnet and turning it on and off.
The minus sign has a significance. When we change a magnetic field, an emf is created.
This induces a current around the wire. However, we also know that currents produce
magnetic fields. The minus sign indicates that the induced magnetic field opposes the
initial change in magnetic field. If it didn’t and the induced magnetic field reinforces
the change, we will get runaway behaviour and the world will explode. This is known
as Lenz’s law .
E. 14-15

Consider a circular wire with a magnetic field perpendicular


to it. If we decrease B such that Φ̇ < 0, then E > 0. So
the current flows anticlockwise (viewed from above). The
current generates its own B. This acts to increase B inside,
which counteracts the initial decrease. This means you don’t
get runaway behaviour.

E. 14-16
There is another, related way to induce currents in the pres-
ence of a magnetic field: you can keep the field fixed, but
move the wire. Perhaps the simplest example is shown in
the figure: it’s a rectangular circuit, but where one of the d
wires is a metal bar that can slide backwards and forwards.
This whole set-up is then placed in a magnetic field, which
passes up, perpendicular through the circuit.

Slide the bar to the left with speed v. Each charge q will experience a Lorentz force
F = qvB in the counterclockwise direction. The emf, defined as the work done
per unit charge, is E = vBd because work is only done when particles (charges)
pass through the bar.
Meanwhile, the change of flux is dΦdt
= −vBd since the area decreases at a rate
of −vd. We again have E = − dΦ dt
. Note that we obtain the same formula but
different physics — we used Lorentz force law, not Maxwell’s equation.

Now we consider the general case: a moving loop C(t) (which need S(t + δt)
not be a circle) bounding a surface S(t). As the curve moves, the Sc
curve sweeps out a cylinder-like surface Sc . The change in flux S(t)
Z Z
Φ(t + δt) − Φ(t) = B(t + δt) · dS − B(t) · dS
S(t+δt) S(t)
Z   Z
∂B
= B(t) + δt · dS − B(t) · dS + O(δt2 )
S(t+δt) ∂t S(t)
Z "Z Z #
∂B
= δt · dS + − B(t) · dS + O(δt2 )
S(t) ∂t S(t+δt) S(t)

We know that S(t + δt), S(t) and Sc together form a closedR surface. Since ∇ · B = 0,
the integral of B over a closed surface is 0. So we obtain S(t+δt)−S(t)+Sc B(t) · dS = 0.
634 CHAPTER 14. ELECTROMAGNETISM

Hence we have
Z Z
∂B
Φ(t + δt) − Φ(t) = δt · dS − B(t) · dS = 0.
S(t) ∂t Sc

We can simplify the integral over Sc by writing the surface element as dS = (dr×v) δt.
Then B · dS = δt(v × B) · dr. So
Z Z Z
dΦ δΦ ∂B
= lim = · dS − (v × B) · dr = − (E + v × B) dr.
dt δ→0 δt S(t) ∂t C(t) C

where the last equality comes from the Maxwell’s equation R∂B ∂t
= −∇ × E. Now the
emf (work done per charge moving around the curve) is E = C (E + v × B) dr, which
now includes the force tangential to the wire from both electric fields and also from the
motion of the wire in the presence of magnetic fields. We obtain the Faraday’s law of
induction E = − ∂Φ
∂t
for the most general case where the curve itself can change.

14.3.1 Magnetostatic energy


Suppose that a current I flows along a wire C. From magnetostatics,R we know that
this gives rise to a magnetic field B, and hence a flux Φ given by Φ = S B · dS where
S is the surface bounded by C. The inductance of a curve C, defined as L = Φ/I is
the amount of flux it generates per unit current passing through C. This is a property
only of the curve C.
E. 14-17

Consider a solenoid of length ` and cross-sectional area A (with `  A so we
can ignore end effects). We know that B = µ0 IN where N is the number of
turns of wire per unit length and I is the current. The flux through a single turn
(pretending it is closed) is Φ0 = µ0 IN A. So the total flux is Φ = Φ0 N ` = µ0 IN 2 V
where V is the volume, A`. So L = µ0 N 2 V .

We can use the idea of inductance to compute the energy stored in magnetic fields.
The idea is to compute the work done in building up a current. As we build the
current, the change in current results in a change in magnetic field. This produces an
induced emf that we need work to oppose. The emf is given by
dΦ dI
E =− = −L .
dt dt
This opposes the change in current by Lenz’s law. In time δt, a charge Iδt flows around
C. The work done is
dI dW dI 1 dI 2
δW = EIδt = −LI δt =⇒ = −LI =− L .
dt dt dt 2 dt
So the work done to build up a current is W = 12 LI 2 = 12 IΦ. Note that we dropped
the minus sign because we switched from talking about the work done by the emf to
the work done to oppose the emf.
This work done is identified with the energy stored in the system. Recall that the
vector potential A is given by B = ∇ × A. So
Z Z I Z
1 1 1 1
U= I B · dS = I (∇ × A) · dS = I A · dr = J · A dV
2 S 2 S 2 C 2 R3
14.3. ELECTRODYNAMICS 635

Using Maxwell’s equation ∇ × B = µ0 J, we obtain


Z Z
1 1
= (∇ × B) · A dV = [∇ · (B × A) + B · (∇ × A)] dV
2µ0 2µ0

Assuming that B × A vanishes sufficiently fast at infinity, the integral of the first term
vanishes. So we are left with
Z
1
= B · B dV.
2µ0

So the energy stored in a magnetic field is U = 2µ1


R
B · B dV . In general, combining
0
this with the electrostatic energy we obtain previously, the energy stored in E and B
is Z  
ε0 1
U= E·E+ B · B dV.
2 2µ0
Note that while this is true, it does not follow directly from our results for pure
magnetic and pure electric fields. It is entirely plausible that when both are present,
they interact in weird ways that increases the energy stored. However, it turns out
that this does not happen, and this formula is right.

14.3.2 Resistance
The story so far is that we change the flux, an emf is produced, and charges are
accelerated. In principle, we should be able to compute the current. But accelerating
charges are complicated (they emit light). Instead, we invoke a new effect, friction. In
a wire, this is called resistance. In most materials, the effect of resistance is that E is
proportional to the speed of the charged particles, rather than the acceleration.
We can think of the particles as accelerating for a very short period of time, and then
reaching a terminal velocity.R So we have Ohm’s law E = IR, the constant R is called
resistance . Note that E = E · dr and E = −∇φ, so E = V , the potential difference.
So Ohm’s law can also be written as V = IR.
For the wire of length L and cross-sectional area A, we define the resistivity to be
ρ = AR/L, and the conductivity to be σ = 1/ρ. These are properties only of the
substance and not the actual shape of the wire. Now Ohm’s law reads J = σE. One
can formally derive Ohm’s law by considering the field and interactions between the
electron and the atoms, but we’ll not do it.
With resistance, we need to do work to keep a constant current. In time δt, the work
needed is δW = EIδt = I 2 Rδt using Ohm’s law. So the Joule heating , the energy
lost as heat in a circuit due to friction, is given by

dW
= I 2 R.
dt
E. 14-18
Take again the circuit with the sliding bar. Suppose that the
sliding bar has resistance R, and the remaining parts of the
circuit are superconductors with no resistance. There are `
two dynamical variables, the position of the bar x(t), and
the current I(t).
636 CHAPTER 14. ELECTROMAGNETISM

If a current I flows, the force on the bar is F = I`ŷ × Bẑ = IB`x̂ since I` = qv
where q is the total charge in the bar and v their speed. Suppose the bar can slide
without friction, so we have mẍ = IB`. We can compute the emf as


E =− = −B`ẋ.
dt

Ohm’s law gives IR = −B`ẋ. Hence

B 2 `2 2 `2 t/mR
mẍ = − ẋ =⇒ ẋ(t) = ẋ(0)e−B .
R

So the speed of the bar decays exponentially from its initial speed. Here we
assumed that all the emf are induced, but we can easily modify this if we have
say an external supply of emf E0 (eg. a battery connected to the circuit) in which
case E = E0 − dΦ
dt
.

14.3.3 Displacement currents

We have studied every Maxwell equation, except one thing. Recall the final Maxwell
equations is
 
∂E
∇ × B = µ0 J + ε 0
∂t

Its time-independent version is the Ampere’s Law ∇×B = µ0 J. But now that we allow
the field to change with time, we have the µ0 ε0 ∂E
∂t
term, which we haven’t previously
discussed. Historically this term is called the displacement current . The need of this
term was discovered by purely mathematically, since people discovered that Maxwell’s
equations would be inconsistent with charge conservation without the term.

Without the term, the last equation is ∇ × B = µ0 J. Take the divergence of the
equation to obtain µ0 ∇ · J = ∇ · (∇ × B) = 0, which says that any current that flows
into a given volume has to also flow out. But we know that’s not always the case. To
give a simple example, we can imagine putting lots of charge in a small region and
watching it disperse. Since the charge is leaving the central region, the current does
not obey ∇ · J = 0, seemingly in violation of Ampere’s Law. In fact we know charge
conservation says that ρ̇+∇·J = 0. With the new term however, taking the divergence
yields
 
∂E
µ0 ∇ · J + ε 0 ∇ · = 0.
∂t

Since partial derivatives commute, we have

∂E ∂
ε0 ∇ · = ε0 (∇ · E) = ρ̇
∂t ∂t

by the first Maxwell’s equation. So it gives ∇ · J + ρ̇ = 0. So with the new term, not
only is Maxwell’s equation consistent with charge conservation — it actually implies
charge conservation.
14.3. ELECTRODYNAMICS 637

14.3.4 Electromagnetic waves


We now look for solutions to Maxwell’s equation in the case where ρ = 0 and J = 0, ie.
in nothing/vacuum. Differentiating the final Maxwell equation with respect to time
we have
∂2E ∂ ∂B
µ0 ε 0 = (∇ × B) = ∇ × = −∇ × (∇ × E) = ∇2 E − ∇ (∇ · E) = ∇2 E.
∂t2 ∂t ∂t | {z }
=ρ/ε0 =0

So each component of E obeys the wave equation

1 ∂2E 1
− ∇2 E = 0 where c= √
c2 ∂t2 µ0 ε 0

So speed of the wave is c = 1/ µ0 ε0 = 3 × 108 m s−1 which is the speed of light!
We now look for plane wave solutions which propagate in the x direction, and are inde-
pendent of y and z. So we can write our electric field as E(x) = (Ex (x, t), Ey (x, t), Ez (x, t)).
Hence any derivatives wrt y and z are zero. Since we know that ∇ · E = 0, Ex must be
constant. We take Ex = 0 since any constant electric field can always be added as a
solution to the Maxwell equations so wlog we’ll choose this constant to vanish (in fact
we also can’t have Ex (x, t) = At). Also wlog, assume Ez = 0, ie the wave propagate
in the x direction and oscillates in the y direction. Then we look for solutions of the
form E = (0, E(x, t), 0) with
1 ∂2E ∂2E
2 2
− = 0.
c ∂t ∂x2
The general solution is E(x, t) = f (x − ct) + g(x + ct). The most important solutions
are the monochromatic waves

E = E0 sin(kx − ωt).

where E0 is the amplitude , k is the wave number , and ω is the (angular) frequency .
The wave number is related to the wavelength by λ = 2π/k. Since the wave has to
travel at speed c, we must have ω 2 = c2 k2 . So the value of k determines the value of
ω, vice versa.
To solve for B, we use ∇ × E = − ∂B∂t
and ∇ · B = 0, so B = (0, 0, B) for some B.
(Now if we use ∇ × B = µ0 ε0 ∂E
∂t
we’ll find we cannot have Ex (x, t) = At). Hence the
equation gives.
∂B ∂E E0
=− =⇒ B= sin(kx − ωt).
∂t ∂x c
Note that this is uniquely determined by E, and we do not get to choose our favorite
amplitude, frequency etc for the magnetic component.
We see that E and B oscillate in phase, orthogonal to each other, and orthogonal to
the direction of travel. These waves are what we usually consider to be “light”. Also
note that Maxwell’s equations are linear, so we can add up two solutions to get a
new one. This is particularly important, since it allows light waves to pass through
each other without interfering. It is useful to use complex notation. The most general
monochromatic takes the form
E = E0 exp(i(k · x − ωt))
with ω 2 = c2 |k|2
B = B0 exp(i(k · x − ωt))
638 CHAPTER 14. ELECTROMAGNETISM

k is called the wave vector , which is real. The “actual” solutions are just the real
part of these expressions. There are some restrictions to the values of E0 etc due to
the Maxwell’s equations:
∇·E=0 ⇒ k · E0 = 0
∇·B=0 ⇒ k · B0 = 0
∂B
∇×E=− ⇒ k × E0 = ωB0
∂t
If E0 and B0 are real, then k, E0 /c and B0 form a right-handed orthogonal triad of
vectors. A solution with real E0 , B0 , k is said to be linearly polarized . This says that
the waves oscillate up and down in a fixed plane. If E0 and B0 are complex, then the
polarization is not in a fixed direction. If we write E0 = α + iβ for α, β ∈ R3 , then
the “real solution” is
Re(E) = α cos(k · x − ωt) − β sin(k · x − ωt).
Note that ∇·E = 0 requires that k·α = k·β = 0. It is not difficult to see that this traces
out an ellipse. If E0 and B0 are complex, then it is said to be elliptically polarized .
In the special case where |α| = |β| and α · β = 0, this is circular polarization .
E. 14-19
We can have a simple application: why metals are shiny. A metal is
a conductor. Suppose the region x > 0 is filled with a conductor. A
light wave is incident on the conductor, ie. Einc = E0 ŷ exp(i(kx + Einc
ωt)) with ω = ck. We know that inside a conductor, E = 0, and
at the surface, Ek = 0, that is E0 · ŷ|x=0 = 0. Then clearly our
solution above does not satisfy the boundary conditions! To achieve
the boundary conditions, we add a reflected wave
Eref = −E0 ŷ exp(i(−kx − ωt)).
Then our total electric field is E = Einc +Eref . Then this is a solution to Maxwell’s
equations since it is a sum of two solutions, and satisfies E · ŷ|x=0 = 0 as required.
Maxwell’s equations says ∇ × E = − ∂B ∂t
. So
E0 E0
Binc = ẑ exp(i(kx − ωt)) Bref = ẑ exp(i(−kx − ωt))
c c
This obeys B · n̂ = 0, where n̂ is the normal to the surface. But we also have
2E0 −iωt
B · ẑ|x=0− = e ,
c
So there is a magnetic field at the surface. However, we know that inside the
conductor, we have B = 0. This means that there is a discontinuity across the
surface! We know that discontinuity happens when there is a surface current.
Using the formula we’ve previously obtained, we know that the surface current is
given by
2E0 −iωt
K= ŷe .
µ0 c
We see that the surface current oscillates with the frequency of the reflected wave.
So shining a light onto a metal will cause an oscillating current. We can imagine
the process as the incident light hits the conductor, causes an oscillating current,
which generates a reflected wave (since accelerating charges generate light). We
can do the same for light incident at an angle, and prove that the incident angle
is equal to the reflected angle.
14.3. ELECTRODYNAMICS 639

14.3.5 Poynting vector

Electromagnetic waves carry energy — that’s how the Sun heats up the Earth! We
will compute how much. The energy stored in a field in a volume V is

Z  
ε0 1
U= E·E+ B·B dV.
V 2 2µ0

Z  
dU ∂E 1 ∂B
= ε0 E · + B· dV
dt V ∂t µ0 ∂t
Z  
1 1
= E · (∇ × B) − E · J − B · (∇ × E) dV.
V µ0 µ0

But E · (∇ × B) − B · (∇ × E) = ∇ · (E × B) by vector identities. So

Z Z
dU 1
=− J · E dV − (E × B) · dS.
dt V µ0 S

Recall that the work done on a particle q moving with velocity v is δW = qv · E δt.
So the J · E term is the rate of work done on a charged particles in V (note that no
work is done by the magnetic field). We can thus write

Z Z
dU 1
<Poynting theorem> + J · E dV =− (E × B) · dS
dt V µ0 S
| {z } | {z }
Total change of energy in V Energy that escapes
(fields + particles) through the surface S

Then the left-hand side is the combined change in energy of both fields and particles in
region V . Since energy is conserved, the right-hand side must describe the energy that
escapes through the surface S of region V . The Poynting vector is S = µ1 E × B.
0
This is a vector field, it tells us the magnitude and direction of the flow of energy
in any point in space. We can write the Poynting theorem in differential form as
∂u
∂t
+ J · E + ∇ · S = 0, where u is the energy density given by the integrand of U .
This is simikar to the continuity equation ∂ρ ∂t
+ ∇ · J = 0, but instead of describing
movements of charge it describes movements of energy.

Because the Poynting vector is quadratic in E and B, we’re not allowed to use the
complex form of the waves, otherwise things in the imaginary component would goes
into the real component when we times E and B together. For a linearly polarized
wave,

E = E0 sin(k · x − ωt) E02


=⇒ S= k̂ sin2 (k · x − ωt).
B = 1c (k̂ × E0 ) sin(k · x − ωt) cµ0

2
E0
The average over T = 2π/ω is thus hSi = 2cµ0
k̂.
640 CHAPTER 14. ELECTROMAGNETISM

14.4 Electromagnetism and relativity


14.4.1 A review of special relativity
In special relativity, we combine space and time into one single object, instead of
treating them separately. We pack these information into a single 4-vector
 
ct
µ
x
X =
y,

where µ is the coordinate ranging µ = 0, 1, 2, 3. In special relativity, we use Greek


alphabets (eg. µ, ν, ρ, σ) to denote the indices. If we want to refer to the spacial
components (1, 2, 3) only, we will use Roman alphabets (eg. i, j, k) to denote them.
Note also that the index is a superscript instead of a subscript.

Instead of the usual Euclidean metric, we use the Minkowski metric, defined by
 
+1 0 0 0
0 −1 0 0
ηµν = 
0 0 −1 0
0 0 0 −1

The “dot product” of 4-vectors is defined by X · Y = X T ηY . In particular, the length


of a vector is given by kXk2 = X · X = X T ηX. For comparison, under the usual
Euclidean metric, the metric is the identity matrix. However, the above notation is
very bad and we should not use it in special relativity. Instead, we use summation
convention, which can be easily generalized to higher-rank tensors.

Given the funny dot product, we have to be slightly more careful about our summation
convention. If we simply write X µ X µ for the dot product, then we will get (ct)2 +
x2 + y 2 + z 2 , which is nonsense. Instead, we define a quantity
 
ct
−x
Xµ = 
−y 

−z

Then the dot product will be given by Xµ X µ . This is the general rule of special
relativity — when contracting indices, we must have one index up and one index
down. Summing over indices on the same side is forbidden.

Since we know that the dot product is also given by X T ηX, we can view Xµ as a
shorthand Xµ = ηµν X ν . If we are given Xµ and want to obtain X µ , we have to
multiply by the inverse of ηµν , which fortunately is the same matrix (but with indices
up): η µν = diag(+1, −1, −1, −1). Hence X µ = η µν Xν . In general, the metric tensor
can be used to raise and lower indices. Note that since we have to distinguish between
indices-up and indices-down, we will in general not write things simply as X or η. We
always write the indices out as well. In special relativity (and electromagnetism), we
will have a lot more other 4-vectors, and the rules above all apply.
14.4. ELECTROMAGNETISM AND RELATIVITY 641

Lorentz transformations
The basic laws of relativity tell us how things look from the viewpoints of different
observers moving relative to each other. Suppose our first observer sits in an inertial
frame S with coordinates (ct, x, y, z), while the second sits in S 0 with coordinates
(ct0 , x0 , y 0 , z 0 ). Suppose S 0 is moving with speed v in the x direction relative to S.
Then the two coordinates are related by the Lorentz transform

ct0 = γ ct − vc x

1
x0 = γ x − vc ct γ = p

with 1 − (v/c)2
y0 =y
c = 299 792 458 m s−1 .
z0 =z

Note that we think in terms of ct instead of t, since ct has the same dimensions as x.
Also, we look at v/c instead of v itself since v/c is dimensionless. In general, changing
the frame of reference corresponds to applying a Lorentz transformation Λµν . Vectors
transform according to
X µ 7→ Λµν X ν ,
Of course, not all matrices Λµν represent Lorentz transforms. To represent a Lorentz
transform, Λµν must obey Λρµ ηρσ Λσν = ηµν . This definition is analogous to the defini-
tion of orthogonal matrices, which can be written (in a convoluted way) as Oij δik Ok` =
δj` . In particular, this definition requires that Λµν preserves the Minkowski metric.
Indeed, if we define the (pseudo) inner product of two tensors X, Y as

hX, Y i = X µ Yµ = X µ ηµν Y ν

and we write ΛX for Λµν X ν , then the equation above just says that we need hΛX, ΛY i =
hX, Y i for all tensors X, Y — the classic definition of orthogonal matrices! So a Lorentz
transform really is just an orthogonal transform under the Minskowski metric.
Excluding the strange time-reversal transformation, which we discard, there are two
classes of Lorentz transformations:
 
1 0 0 0
0 
1. Rotations: Here RT R = 1, ie. an orthogonal matrix. Λµν = 0

R 
0
 
γ −γv/c 0 0
−γv/c γ 0 0
2. Boosts: eg. a boost in the x direction is Λµν =  0

0 1 0
0 0 0 1
We know that a 4-vector transforms as X µ 7→ Λµν X ν . How does Xµ transform? We
have
Xµ 7→ Xµ0 = ηµν X 0ν = ηµν Λνσ X σ = ηµν Λνσ η σρ Xρ = Λµρ Xρ ,
where we used the rules for lowering and raising indices in the last line. What is this
mysterious Λµρ ? We can view it as the transpose of Λρµ . It turns out that this is also
the inverse of Λµρ . Recall that Λρµ ηρσ Λσν = ηµν . We multiply both sides by η ντ to
obtain Λρµ ηρσ Λσν η ντ = δµτ . So raising and lowering indices gives Λρµ Λρτ = δµτ . This
is analogous to the fact that the transpose of an orthogonal matrix is its inverse.
642 CHAPTER 14. ELECTROMAGNETISM

Vectors, co-vectors and tensors


Recall we started with a quantity X µ = (ct, x, y, z), and declared Xµ = (ct, −x, −y, −z).
However, defining up and down indices in terms of “negating the last three terms” is
not satisfactory. For one, it becomes rather confusing when you have higher-order
terms. More importantly, if we discover a new quantity Y = (a, s, d, f ), how could we
decide if Y should have indices up or indices down?
We can make use of our results above — X µ and Xµ transform differently. Vectors
have indices up and transform according to X µ 7→ Λµν X ν . Co-vectors have indices
down and transform according to Xµ 7→ Λµν Xν . We can explore different objects and
see if they are vectors or co-vectors.
The relativistic generalization of ∇ is the 4-derivative , defined to be
 
∂ 1 ∂
∂µ = = , ∇ .
∂X µ c ∂t
Note that ∂µ is a co-vector, not a vector. This follows from the chain rule
∂ ∂ ∂X ν ∂
∂µ = →
7 = = (Λ−1 )νµ ∂ν = Λµν ∂ν .
∂X µ ∂X 0µ ∂X 0µ ∂X ν
We can have higher-dimensional objects: A tensor of type (m, n) is a quantity
T µ1 ···µnν1 ···νm which transforms as
T 0µ1 ···µnν1 ···νm = Λµ1ρ1 · · · Λµnρn Λν1σ1 · · · Λνmσm T ρ1 ,··· ,ρnσ1 ,··· ,σm .
We can change the type of a tensor by raising and lowering indices via ηµν . However,
the total n + m will not be changed. To build theories consistent with special rela-
tivity, all we have to do is work only with tensors, and make sure indices match in
equations.

14.4.2 Conserved currents


Recall the continuity equation ∂ρ∂t
+ ∇ · J = 0. An implication of this was that we
cannot have a charge disappear on Earth and appear on Moon. One way to explain
this impossibility would be that the charge is travelling faster than the speed of light,
which is something related to relativity. This suggests that we might be able to write
it relativistically. We define  
µ ρc
J =
J
Before we do anything with it, we must show that this is a 4-vector, ie. it transforms
via left multiplication by Λµν . Suppose we have a static charge density ρ0 (x) with
J = 0. In a frame boosted by v, we want to show that the new current is
 
γρ0 c
J 0µ = Λµν J ν = .
−γρ0 v
The new charge is now γρ0 instead of ρ0 . This is correct since we have Lorentz
contraction. As space is contracted, we get more charge per unit volume. Then the
new current is velocity times charge density, which is J0 = γρ0 v. So this is a sensible
definition. With this definition of J µ , local charge conservation is just ∂µ J µ = 0. This
is invariant under Lorentz transformation, simply because the indices work out (ie. we
match indices up with indices down all the time). We see that once we have the right
notation, the laws of physics are so short and simple!
14.4. ELECTROMAGNETISM AND RELATIVITY 643

14.4.3 Gauge potentials and electromagnetic fields


Recall that when solving electrostatic and magentostatic problems, we introduced the
potentials φ and A to help solve two of Maxwell’s equations:
∇×E=0 ⇐⇒ E = −∇φ
∇·B=0 ⇐⇒ B = ∇ × A.
However, in general, we do not have ∇ × E = 0. Instead, the equations are
∂B
∇×E+ = 0, ∇ · B = 0.
∂t
So we cannot use E = −∇φ directly if B changes. Nonetheless, it’s still possible to
use the scalar and vector potentials to solve both of these equations. The solutions
are
∂A
E = −∇φ − B = ∇ × A.
∂t
It is important to note that φ and A are not unique. We can shift them by A 7→ A+∇χ
and φ 7→ φ − ∂χ
∂t
for any function χ(x, t). These are known as gauge transformations .
Plug them into the equations and you will see that you get the same E and B. Now
we have ended up with four objects, 1 from φ and 3 from A, so we can put them into
a 4-vector gauge potential :  
φ/c
Aµ =
A
We will assume that this makes sense, ie. is a genuine 4-vector. Now gauge transfor-
mations take a really nice form: Aµ 7→ Aµ −∂µ χ. Finally, we define the anti-symmetric
electromagnetic tensor Fµν = ∂µ Aν − ∂ν Aµ . Since this is antisymmetric, the diagonals
are all 0, and Aµν = −Aνµ . So this thing has (4 × 4 − 4)/2 = 6 independent compo-
nents. So it could encapsulate information about the electric and magnetic fields (and
nothing else). We note that F is invariant under gauge transformations:
Fµν 7→ Fµν − ∂µ ∂ν χ + ∂ν ∂µ χ = Fµν .
We can compute components of Fµν ,
1 ∂ ∂φ/c Ex
F01 = (−Ax ) − = .
c ∂t ∂x c
Note that we have −Ax instead of Ax since Fµν was defined in terms of Aµ with indices
down. We can similarly obtain F02 = Ey /c and F03 = Ez /c. We also have
∂ ∂
F12 = (−Ay) − (−Ax ) = −Bz .
∂x ∂y
etc. Therefore  
0 Ex /c Ey /c Ez /c
−Ex /c 0 −Bz By 
Fµν =
 
−Ey /c Bz 0 −Bx 
−Ez /c −By Bx 0
Raising both indices, we have
 
0 −Ex /c −Ey /c −Ez /c
Ex /c 0 −Bz By 
F µν = η µρ η νσ Fρσ = 
Ey /c Bz 0 −Bx 
Ez /c −By Bx 0
644 CHAPTER 14. ELECTROMAGNETISM

So the electric fields are inverted and the magnetic field is intact. Both Fµν and
F µν are tensors, since they are constructed out of Aµ , ∂µ and ηµν , which themselves
transform nicely under the Lorentz group. Under a Lorentz transformation, we have
F 0µν = Λµρ Λνσ F ρσ . For example under rotation Λ = ( 10 R
0
), we find that E0 = RE and
0
B = RB. Under a boost by v in the x-direction , we have
Ex0 = Ex Bx0 = Bx
 v 
Ey0 = γ(Ey − vBz ) By0 = γ By + 2 Ez
c
 v 
Ez0 = γ(Ez + vBy ) 0
Bz = γ Bz − 2 Ey
c
E. 14-20
<Boosted line charge> An infinite line along the x direction with uniform
charge per unit length, η, has electric field
 
0
η y  .
E= 2 2
2πε0 (y + z )
z
The magnetic field is B = 0. Plugging this into the equation above, an observer
in frame S 0 boosted with v = (v, 0, 0), ie. parallel to the wire, sees
   
0 0
ηγ y  ηγ y 0 
E= 2 2
= 02 02
2πε0 (y + z ) 2πε0 (y + z )
z z0
   
0 0
ηγv  z = ηγv  z0  .
B=
2πε0 c2 (y 2 + z 2 ) 2πε0 c2 (y 02 + z 02 )
−y −y 0
where we have y = y 0 and z = z 0 because the boost is in the x-direction. In frame
S 0 , the charge density is Lorentz contracted to γη. Since the charge density is now
moving, the observer in frame S 0 sees a current I 0 = −γηv. The magnetic field
can be written as
 
0
µ0 I 0 0 0 1 −z 0
B= p ϕ̂ where ϕ̂ = p
2π y 02 + z 02 y 02 + z 02 y0
is the basis vector of cylindrical coordinates. This is just the magnetic field due
to a current in a wire. This is what we calculated from Ampere’s law previously.
But we didn’t use Ampere’s law here. We used Gauss’ law (to get the electric
field), and then applied a Lorentz boost.
We see that magnetic fields are relativistic effects of electric fields. They are what
we get when we apply a Lorentz boost to an electric field. So relativity is not only
about very fast objects. It is there when you stick a magnet onto your fridge!
E. 14-21
<Boosted point charge> A boosted point charge generates a current, but is
not the steady current we studied in magnetostatics. As the point charge moves,
the current density moves. A point charge Q, at rest in frame S has
 
x
Q Q y 
E= 2
r̂ = 2 2 2 3/2
and B=0
4πε0 r 4πε0 (x + y + z )
z
14.4. ELECTROMAGNETISM AND RELATIVITY 645

In frame S 0 , boosted with v = (v, 0, 0), we have


 0
x + vt0
  
x
Q Qγ
E0 = γy  =  y0  .
4πε0 (x2 + y 2 + z 2 )3/2 4πε0 (γ 2 (x0 + vt0 )2 + y 02 + z 02 )3/2
γz z0

The particle is at (−vt0 , 0, 0) in S 0 , so we see that the electric field emanates


from the position of the charge, as it should. Let’s look at the electric field at
t0 = 0. Then the radial vector is r0 = (x0 , y 0 , z 0 ). However, the electric field is
not isotropic. This arises from the denominator of which is not proportional to
r03 because there’s an extra factor of γ 2 in front of the x0 component. In the
denominator, we write

v2 γ 2
γ 2 x02 + y 02 + z 02 = (γ 2 − 1)x02 + r02 = 2 x02 + r02
c
 2 2
v2
  
v γ 02
= cos 2
θ + 1 r = γ 2
1 − sin 2
θ r02
c2 c2

where θ is the angle between the x0 axis and r0 . So at t0 = 0,


1 Q ˆ0
E0 = v2 2
r.
γ 2 (1 − c2
sin2 θ)3/2 4πε0 r
2
The factor 1/γ 2 (1 − vc2 sin2 θ)3/2 squashes the electric field in the direction of mo-
tion. This result was first discovered by Lorentz from solving Maxwell’s equations
directly, which lead to him discovering the Lorentz transformations. There is also
a magnetic field
 
0
µ0 Qγ  z0  .
B=
4π(γ 2 (x0 + vt0 )2 + y 02 + z 02 )3/2
−y 0

Lorentz invariants
We can ask the question “are there any combinations of E and B that all observers
agree on?” With index notation, all we have to do is to contract all indices such
that there are no dangling indices. It turns out that there are two such possible
combinations. The first thing we might try would be

1 E2
Fµν F µν = − 2 + B2 ,
2 c
which works great. To describe the second invariant, we need to introduce a new
object in Minkowski space. Analogous to the εijk in R3 we define the anti-symmetric
tensor 
+1 µνρσ is even permutation of 0123

εµνρσ = −1 µνρσ is odd permutation of 0123

0 otherwise

Under a Lorentz transformation, ε0µνρσ = Λµκ Λνλ Λρα Λσβ εκλαβ . Since εµνρσ is fully
anti-symmetric, so is ε0µνρσ . Similar to what we did in R3 , we can show that the only
646 CHAPTER 14. ELECTROMAGNETISM

fully anti-symmetric tensors in Minkowski space are multiples of εµνρσ . So ε0µνρσ =


aεµνρσ for some constant a. To figure out what a is, test a single component:

ε00123 = Λ0κ Λ1λ Λ2β Λ3β εκλαβ = det Λ.

Lorentz transformations have det Λ = +1 (rotations, boosts), or det Λ = −1 (reflec-


tions, time reversal). We restrict ourselves to Λ such that det Λ = +1. Then a = 1,
and εµνρσ is invariant. In particular, it is a tensor.
We can finally apply this to electromagnetic fields. The dual electromagnetic tensor
is defined to be
 
0 −Bx −By −Bz
1 Bx 0 Ez /c −Ey /c
F̃ µν = εµνρσ Fρσ . = 
By −Ez /c
.
2 0 Ex /c 
Bz Ey /c −Ex /c 0

Why do we have the factor of a half? Consider a single component, say F̃ 12 . It


gets contributions from both F03 and F30 , so we need to average the sum to avoid
double-counting. F̃ µν is yet another antisymmetric matrix. This is obtained from F µν
through the substitution E 7→ cB and B 7→ −E/c. Note the minus sign!
F̃ µν is a tensor. We can construct the other Lorentz invariant using F̃ µν . We don’t
get anything new if we contract this with itself, since F̃µν F̃ µν = −Fµν F µν . Instead,
the second Lorentz invariant is
1 µν
F̃ Fµν = E · B/c.
4

14.4.4 Maxwell Equations


To write out Maxwell’s Equations relativistically, we have to write them out in the
language of tensors. It turns out that they are

∂µ F µν = µ0 J ν ∂µ F̃ µν = 0.

As we said before, if we find the right way of writing equations, they look really simple!
We don’t have to worry ourselves with where the c and µ0 , ε0 go! Note that each law is
actually 4 equations, one for each of ν = 0, 1, 2, 3. Under a Lorentz boost, the equations
are not invariant individually. Instead, they all transform nicely by left-multiplication
of Λνρ .
We now check that these agree with the Maxwell’s equations. First work with the first
equation: when ν = 0, we are left with ∂i F i0 = µ0 J 0 where i ranges over 1, 2, 3. This
is equivalent to saying
 
E ρ
∇· = µ0 ρc or ∇ · E = c 2 µ0 ρ =
c ε0

When ν = i for some i = 1, 2, 3, we get ∂µ F µi = µ0 J i . So after some tedious calcula-


tion, we obtain
 
1 ∂ E
− + ∇ × B = µ0 J.
c ∂t c
14.4. ELECTROMAGNETISM AND RELATIVITY 647

Similarly for the second equation, we have

∂i F̃ i0 = 0 =⇒ ∇·B=0
∂B
∂µ F̃ µi = 0 =⇒ + ∇ × E.
∂t
So we recover Maxwell’s equations. Then we now see why the J ν term appears in the
first equation and not the second — it tells us that there is only electric charge, not
magnetic charge.
We can derive the continuity equation from Maxwell’s equation here. Since ∂ν ∂µ F µν =
0 due to anti-symmetry, we must have ∂ν J ν = 0. Recall that we once derived the
continuity equation from Maxwell’s equations without using relativity, which worked
but is not as clean as this.
Finally, we recall the long-forgotten potential Aµ . For Fµν defined in terms of it:
Fµν = ∂µ Aν − ∂ν Aµ , the equation ∂µ F̃ µν = 0 comes for free since we have
1 µνρσ 1
∂µ F̃ µν = ε ∂µ Fρσ = εµνρσ ∂µ (∂ρ Aσ − ∂σ Aρ ) = 0
2 2
where the last equality holds because of the symmetry of the two derivatives, combined
with the anti-symmetry of the ε-tensor. This means that we can also write the Maxwell
equations as
∂µ F µν = µ0 J ν where Fµν = ∂µ Aν − ∂ν Aµ .

14.4.5 The Lorentz force law


The final aspect of electromagnetism is the Lorentz force law for a particle with charge
q moving with velocity u:
dp
= q(E + v × B).
dt
To write this in relativistic form, we use the proper time τ (time experienced by
particle), which obeys
dt 1
= γ(u) = p .
dτ 1 − u2 /c2
We define the 4-velocity U = dX

= γ( uc ), and 4-momentum P = ( E/c p ), where E is
the energy. Note that E is the energy while E is the electric field. The Lorentz force
law can be written as
dP µ
= qF µν Uν .

We show that this does give our original Lorentz force law: When µ = 1, 2, 3, we
obtain
dp dp
= qγ(E + v × B) =⇒ = q(E + v × B).
dτ dt
dt
by the chain rule dτ = γ. Note that here p = mγv, the relativistic momentum, not
the ordinary momentum. But how about the µ = 0 component? We get

dP 0 1 dE q dE
= = γE · v =⇒ = qE · v,
dτ c dτ c dt
which is the work done by an electric field.
648 CHAPTER 14. ELECTROMAGNETISM

E. 14-22
<Motion in a constant field> Consider a particle in a vanishing magnetic
field and constant electric field E = (E, 0, 0) (E here is not the energy) and
u = (u(t), 0, 0). Assuming that the particle starts from rest at t = 0, then Lorentz
force gives
d(γu) dx qEt
m = qE =⇒ mγu = qEt =⇒ u= = p .
dt dt m2 + q 2 E 2 t2 /c2
Note that u → c as t → ∞. We can solve to find
r !
mc2 q 2 E 2 t2
x= 1+ −1
qE mc2

For small t, x ≈ 21 qEt2 , which is the usual non-relativistic result for particles
undergoing constant acceleration in a straight line.
E. 14-23
<Motion in constant magnetic field> Now suppose we have no electric field
and B = (0, 0, B). In the non-relativistic world, we know that particles turn circles
with frequency ω = qB/m. Then

dP 0
= qF 0ν Uν = 0 =⇒ E = mγc2 = constant.

So |u| is constant. Now

∂(γu) du
m = qu × B =⇒ mγ = qu × B
∂t dt
since |u|, and hence γ, is constant. This is the same equation as the non-relativistic
case, except for the extra γ term. The particle goes in circles with frequency
ω = qB/mγ.
CHAPTER 15

Fluid Dynamics
In real life, we encounter a lot of fluids. For example, there is air and water. These
are known as Newtonian fluids, whose dynamics follow relatively simple equations.
This is fundamentally because they have simple composition — they are made up of
simple molecules. There are non-Newtonian fluid like toothpaste and shampoo, which
are more complex molecularly speaking and have complex properties. Sand, rice and
foams also have some fluid-like properties, although they are fundamentally made of
small granular solids. Here we will study only Newtonian fluids.
There are many applications of fluid dynamics. On a small scale, the dynamics of fluids
in cells is important for biology. On a larger scale, the fluid flow of the mantle affects
the movement of tectonic plates, while the dynamics of the atmosphere can be used
to explain climate and weather. On an even larger scale, we can use fluid dynamics to
analyse the flow of galactic systems in the universe.

15.0.6 Preliminaries

A fluid is a material that flows. A Newtonian fluid is a fluid with a linear relation-
ship between stress and rate of strain. The constant of proportionality is called the
(dynamic) viscosity . We will consider such fluids with viscosity, although sometime
we will make a simplifying assumption, the inviscid approximation , where we set the
viscosity to 0. Otherwise when we do not assume zero viscosity we say we have a
viscous flow .
Stress is force per unit area. For example, pressure is a stress. Strain is the extension
d
per unit length, and the rate of strain dt (strain) here is concerned with gradients
(w.r.t. space) of velocity. These quantities are in fact tensor fields, but we will not
treat them as such here. We will just consider “simplified” cases. Concepts we define
here, will become more clear when we start to write down our equations later.
Stresses are the forces that exist inside the fluid. If we have a boundary, we can classify
the stress according to the direction of the force — whether it is normal to or parallel
to the boundary. The boundary can either be an actual physical boundary, or an
imaginary surface we cook up in order to compute things.

Normal stress
Suppose we have a fluid with pressure p acting on a n pressure p
fluid
surface with unit normal n, pointing into the fluid.
The normal stress is τp = −pn. solid τp
The normal stress is present everywhere, as long as
we have a fluid (with pressure). However, pressure by itself does not do anything,
since pressure acts in all directions, and the net effect cancels out. However, if we have
a pressure gradient, then it gives an actual force and drives fluid flow. For example,
suppose we have a pipe, with the pump on the left:

649
650 CHAPTER 15. FLUID DYNAMICS

∇p
high pressure low pressure patm

force= −∇p

Then this gives a body force that drives the water from left to right.

Tangential stress

We can also have stress in the tangential direction of


a surface. Suppose we have two infinite plates with U
fluid in the middle. We keep the bottom plane at h
rest, and move the top plate with velocity U . The
tangential stress (or called shear stress ) τs in this
case is the force per unit area required to move the top plate at speed U . This is also
the force we need to exert on the bottom plate to keep it still.

By definition of a Newtonian fluid, this stress τs is proportional to the velocity gradient


U/h, the rate of extension in the x direction per unit length in the y direction. This
result is also given by experiments. The dynamic viscosity µ of the fluid is the constant
of proportionality in τs = µ Uh . We can try to figure out the dimensions of these
quantities:
 
U
[τs ] = M L−1 T −2 = T −1 [µ] = M L−1 T −1 .
h

In SI units, µ has unit kg m−1 s−1 .

We have not yet said what the fluid in the middle U


does. It turns out this is simple: at the bottom, the
fluid is constant, and at the top, the fluid moves with u
h
velocity U . In between, the speed varies linearly. We
will derive this formally later.

We can imagine viscosity µ as a measure of the “stick-


iness” or “thickness” of the fluid. Suppose we have a flow that moves in the x direction
with speed u(y). Suppose u(0) < u(δ) for some small δ > 0. So fluid particles at y = δ
is travelling faster than the fluid particles in y = 0. This is rather unnatural. By the
“stickiness” of fluid particles, fluid particles at y = δ will “pull” the fluid particles at
y = 0 and try to make them travel faster. So fluid particles at y = 0 will feel a force
pulling them forward, caused by fluid particles above them: every unit area of fluids
particles in y = 0 will feel a force of µ du
dy
(0), this is the tangential stress.

In short the tangential stress exerted by a fluid on a surface with normal n pointing
into the fluid is
∂u
τs = µ = µn · ∇u
∂n
where µ is the dynamic viscosity of the fluid and u is the flow velocity along the surface.
This tangential stress is caused by the fluid on the side of the surface which the normal
is pointing into.
651

Boundary conditions
How does a fluid behave at a physical boundary? In general, Newtonian fluids satisfy
(as experimentally shows) one of the following two boundary conditions:
1. No-slip condition : More precisely this is the no-slip no-penetration condition,
which requires that at the boundary, the fluid velocity equals the velocity of bound-
ary. In particular, if the boundary is stationary, the fluid velocity is zero at the
boundary. This no-slip condition is normally applied when we have a fluid-solid
boundary, where we think the fluid “sticks” to the solid boundary. This no-slip
condition of course only make sense when we have a viscous fluid. In the case
when we have a inviscid flow (i.e. 0 viscosity) we simply apply the no-penetration
condition and allow the fluid to slip.
2. Stress condition : At the boundary (with normal n pointing into the fluid), a
tangential stress τ is imposed on the fluid. In this case,
∂uT
−µ = τ.
∂n
This stress condition is common when we have a fluid-fluid boundary (like liquid
and gas), where we require the tangential stresses to match up.

Streamlines, streaklines, and pathlines


In order to visualise a flow, we can draw diagrams. Streamlines, streaklines, and
pathlines are different sorts of diagrams which we can draw:
1. Streamline is a curve that is instantaneously tangent to the velocity vector of the
flow. These show the direction in which different fluid particles on the line are
travelling at the current moment. Streamlines are an instantaneous picture of the
flow. Drawing the Streamlines of the flow u(x, t) at time t = t0 is the same as
drawing the “field lines” (like electric field lines) of u(x, t0 ).
2. Streakline is the loci of points of all the fluid particles that have passed continu-
ously through a particular spatial point in the past. Dye steadily injected into the
fluid at a fixed point extends along a streakline.
3. Pathline is the trajectories of a single fluid particle. That is it is “recording” the
path of a fluid element in the flow over a certain period.
Note that if the flow is unsteady, then the streamlines are not particle paths.
E. 15-1
Consider u = (t, 1, 0). When t = 0, the velocity is purely in the y direction, and
the streamlines are also vertical; at t = 1, the velocity makes an 45◦ angle with
the horizontal, and the streamlines are slanted:

t=0 t=1
dy
Indeed the streamlines at time t 6= 0 satisfies dx = 1t , so the streamlines are
y = t x + c for c a constant. For t = 0, the streamline (x(s), y(s)) satisfies dx
1
ds
=0
and dy
ds
= 1, so (x(s), y(s)) = (c, s), so the streamlines are x = c.
652 CHAPTER 15. FLUID DYNAMICS

However, no particles will actually follow any of these streamlines. Pathlines


satisfies ẋ(t) = u = t and ẏ = v = 1, so x = 12 t2 + A and y = t + B for constants
A, B. For a particle released at x0 = (x0 , y0 ) at t = 0, we have x = 12 t2 + x0 and
y = t + y0 . Eliminating t, we get that the path is given by (x − x0 ) = 21 (y − y0 )2 .
So the particle paths are parabolas.

What is the streaklines? At time t = T , suppose we wish to find the loci of points
that have passed through (x0 , y0 ) in the past. We know particle paths satisfies
x = 12 t2 +A and y = t+B, so at time T we have x = 12 T 2 +A and y = T +B. For it
to have pass through (x0 , y0 ) at some point in the past we need A and B to satisfy
x0 = 12 t02 + A and y0 = t0 + B for some t0 < T . Hence we have x = 21 T 2 + x0 − 12 t02
and y = T + y0 − t0 . Eliminating t0 we have x = 12 T 2 + x0 − 12 (y − T − y0 )2 . So
the streakline is x − x0 = (y − y0 )T − 12 (y − y0 )2 .

15.1 Parallel viscous flow


D. 15-2
• A steady flow is a flow that does not change in time. In other words, all forces
balance, and there is no acceleration.

• A parallel flow is a flow where the fluid only flows in one dimension, and only
depends on the direction perpendicular to a plane.

For example u = (u(y), 0, 0) is a parallel flow, the fluid only flows in the x direction,
and it only depends on y which is perpendicular to the x − z plane. Note that our
velocity does not depend on the x direction. This can be justified by the assumption
that the fluid is incompressible. If we had a velocity gradient in the x-direction, then
we will have fluid “piling up” at certain places, violating incompressibility.

We will give a formal definition of incompressibility later. In general, though, fluids are
not incompressible. For example, sound waves are exactly waves of compression in air,
and cannot exist if air is incompressible. So we can alternatively state the assumption
of incompressibility as “sound travels at infinite speed”. Hence, compressibility matters
mostly when we are travelling near the speed of sound. If we are moving in low speeds,
we can just pretend the fluid is indeed incompressible. This is what we will do mostly,
since the speed of sound is in general very high.

15.1.1 Parallel viscous flow equation


Firstly we will consider the steady case, i.e. a steady parallel y
viscous flow, and derive the equations of motion of it. For a
general steady parallel viscous flow, we can draw a flow profile y + δy
(see diagram), the lengths of the arrow signify the magnitude of
the velocity at that point. To derive the equations of motion, x, y x + δx
we can consider a small box (represented by the dashed lines)
in the fluid. We know that this block of fluid moves in the x
direction without acceleration. So the total forces of the surrounding environment on
the box should vanish.
15.1. PARALLEL VISCOUS FLOW 653

We first consider the x direction. There are normal stresses on all the sides, and
tangential stresses at the top and bottom. The sum of forces in the x-direction (per
unit transverse width) gives

p(x)δy − p(x + δx)δy + τs (y + δy)δx + τs (y)δx = 0.

By the definition of τs , we can write τs (y + δy) = µ ∂u


∂y
(y + δy) and τs (y) = −µ ∂u
∂y
(y)
where the different signs come from the different normals. Dividing by δxδy, we
get
 
1 1 ∂u ∂u
(p(x) − p(x + δx)) + µ (y + δy) − (y) .
δx δy ∂y ∂y
Taking the limit as δx, δy → 0, we end up with the equation of motion

∂p ∂2u
− + µ 2 = 0.
∂x ∂y

Performing similar calculations in the y direction, we obtain

∂p
− = 0.
∂y

In the second equation, we keep the negative sign for consistency, but obviously in this
case it is not necessary.
We can extend this result a bit by allowing non-steady flows and external forces on the
fluid. Then the velocity is of the form u = (u(y, t), 0, 0). Writing the external body
force (per unit volume) as (fx , fy , 0), we obtain the equations

∂u ∂p ∂2u ∂p
ρ =− + µ 2 + fx 0=− + fy .
∂t ∂x ∂y ∂y

The derivation of these equations is straightforward. Here ρ is the density, ie. the
mass per unit volume. The following table gives the approximate values of ρ and µ for
water and air

µ(kg m−1 s−1 ) ρ(kg m−3 ) ν(m2 s−1 )


water 10−3 103 10−6
air 2 × 10−5 1 2 × 10−5
def
where ν = µ/ρ is known as the kinematic viscosity .
E. 15-3
<Couette flow> Consider the flow driven by the
motion of a boundary, as we have previously seen. U
We assume that this is a steady flow, and there is
no pressure gradient. So our equations give

∂2u
= 0.
∂y 2

Moreover, the no-slip condition says u = 0 on y = 0; u = U on y = h. The solution


is thus u = U y/h
654 CHAPTER 15. FLUID DYNAMICS

E. 15-4
<Poiseuille flow> Consider a flow driven by a
pressure gradient between stationary boundaries.
We have a high pressure P1 on the left, and a low
P P0
pressure P0 < P1 on the right. We solve this prob- 1
lem again, but we will also include gravity. So the
equations of motion become
∂p ∂2u ∂p
− +µ 2 =0 − − ρg = 0
∂x ∂y ∂y
The boundary conditions are u = 0 at y = 0, h. The second equation implies
p = −ρgy + f (x) for some function f . So ∂x p = f 0 (x). Substituting into the first
gives
∂2u
µ 2 = f 0 (x).
∂y
The left is a function of y only, while the right depends only on x. So both must be
constant, say
R LG. Write LR as the length of the tube. Using the boundary conditions,
L
P0 − P1 = 0 ∂x p dx = 0 f 0 (x) dx = LG, so we get
∂2u P1 − P0 G −G
µ = f 0 (x) = G = − =⇒ u= y(y − h) = y(h − y).
∂y 2 L 2µ 2µ
Here the velocity is the greatest at the middle, where y = h/2. Since the equations
of motion are linear, if we have both a moving boundary and a pressure gradient,
we can just add the two solutions up.

15.1.2 Derived properties of a flow


The velocity and pressure already fully describe the flow. However, there are some
other useful quantities we can compute out of these. For example how much stuff is
being transported by the flow, how much “rotation” there is in the flow, and the force
it exert on the boundaries:
1. The volume flux (volumetric flow rate) is the volume of fluid traversing a cross-
section per unit time. For parallel flow u = (u(y, t), 0, 0) confined in y ∈ [0, h] the
Rh
volume flux is given by q = 0 u(y, t) dy per unit transverse width.
Z h
Uy Uh
• For the Couette flow, we have q = dy = .
0 h 2
Z h
G Gh3
• For the Poiseuille flow, we have q = y(h − y) dy = .
0 2µ 12µ
Clearly the volume flux can be defined for any surface we choose (not necessary the
cross-section of the whole thing), so that it is the volume of fluid passing through
a given surface per unit time.
2. The vorticity is defined by ω = ∇×u. In our case, since we have u = (u(y, t), 0, 0)
we have ω = (0, 0, − ∂u
∂y
).
• For the case of the Couette flow, the vorticity U
is ω = (0, 0, −U/h). This is a constant, ie. the
ω
vorticity is uniform.
15.1. PARALLEL VISCOUS FLOW 655

• For the case of the Poiseuille flow, we have


ω
P1 P0
  
G h
ω = 0, 0, y− . ω
µ 2

3. We can also calculate the force exerted on the boundary by the fluid. Recall that
the tangential stress τs is the tangential force per unit area exerted by the fluid on
the surface, given by

∂u
τs = µ where n points into the fluid.
∂n

• For the Couette flow, we see that at y = 0, the stress (


is positive, and pulls the surface forward. At y = h, it µ Uh y=0
τs =
is negative, and the surface is pulled backwards. −µ Uh y=h

• For the Poiseuille flow, both surfaces are pulled for- (


Gh
ward, and this is independent of the viscosity. This 2
y=0
τs = Gh
makes sense since the force on the surface is given 2
y=h
by the pressure gradient, which is independent of the
fluid.

15.1.3 Examples of more interesting flows

Gravity-driven flow down a slope


Suppose we have some fluid flowing along a slope.
Here there is just atmosphere above the fluid, and y
we assume the fluid flow is steady, ie. u is merely g
p0
a function of y. We further assume that the at- atmospheric
mospheric pressure does not vary over the vertical pressure
extent of the flow. This is a very good approxi- u
x
mation because ρair  ρliq . Similarly, we assume
µair  µliq . So the air exerts no significant tangen- α
tial stress. This is known as a free surface.

We first solve the y momentum equation. The force (per area) in the y direction is
−ρgδy cos α. Hence the equation is

∂p
= −ρg cos α =⇒ p = p0 − ρg(y − h) cos α
∂y

using the fact that p = p0 at the top boundary. In particular, p is independent of x.


In the x component, we get

∂2u ρg sin α
µ = −ρg sin α =⇒ u= y(2h − y).
∂y 2 2µ

using the no slip condition u = 0 when y = 0 and also condition that there is no
stress at y = h to get ∂u
∂y
= 0 when y = h. This is a bit like the Poiseuille flow, with
gp sin(α)/2µ as the pressure gradient. But instead of going to zero at y = h, we get
to zero at y = 2h instead. So this is half a Poiseuille flow.
656 CHAPTER 15. FLUID DYNAMICS

An unsteady flow in semi-infinite domain


For a change, we do a case where we have unsteady flow.
y
Consider fluid initially at rest in y > 0, resting on a flat
surface. At time t = 0, the boundary y = 0 starts to move
at constant speed U . There is no force and no pressure
x
gradient. We use the x-momentum equation to get

∂u ∂2u µ
=ν 2 where ν=
∂t ∂y ρ
This is clearly the diffusion equation, with the diffusivity ν, which is the kinematic
viscosity. We can view this as the diffusion coefficient for motion/momentum/vorticity.
Diffusion equation can be solved by ways done in Methods (including using Fourier
series or Fourier/Laplace transform in time etc.). The boundary conditions are u = 0
for t = 0 and u → 0 as y → ∞ for all t. The other boundary condition is obviously
u = U when y = 0, for t > 0.
Before we start, we try to do some dimensional analysis to gain some intuition about
the problem. We will try to figure out the approximate scales of things in our system, in
particular we will try to figure out how far we have to go away from the boundary before
we don’t feel any significant motion and how fast the movement of fluid propagates
up the y axis. We are already provided with a velocity U , which we use as our
characteristic speed. We let T be the time scale which we care about. We note that
in this case, we don’t really have an extrinsic length scale — in the case where we
have two boundaries, the distance between them is a natural length scale to compare
with, but here the fluid is infinitely thick. So what is a characteristic length scale
δ corresponding to the chosen time scale? Well, putting them into our differential
equation we obtain
U U √
∼ν 2 =⇒ δ ∼ νT .
T δ

So we expect the decay length of u up the y axis √ to be O( νt), and we expect the
movement of fluid propagates up the y axis like νt.
We now solve the problem properly. By symmetry of the system we expect the solution
u to only depend on variable y and t. In an infinite domain with no extrinsic length
scale, the diffusion equation admits a similarity solution. Note that if u(y, t) is a
solution to our problem, then so is ũ(y, t) = u(λy, λ2 t) for any λ > 0. If we is to have a
unique solution to the problem, the two solution must be the same, that is our solution
must be self-similar in the sense that u(y, t) = ũ(y, t) =√u(λy, λ2 t). This suggest that
our solution should only depend on the one variable y/ t.
Indeed if we write u(y, t) = U f (η)√where f (η) is a dimensionless function of the
dimensionless variable η = y/δ = y/ νt, and substitute this form of the solution into
the differential equation, we get
1
− ηf 0 (η) = f 00 (η) with boundary condition f = 1 on η = 0
2
The solution is by definition f = erfc(η/2), where
R∞ 2
erfc(z) = 1 − erf(z) = √2π z e−s ds, so √
δ∼ νt
 
y U
u = U erfc √ .
2 νt
15.2. KINEMATICS 657

Using our table of values for kinematic viscosities, we find that νair ≈ 20νwater , so
motion induce further/faster into air than water. We can also compute the tangential
stress of the fluid in the above case to be
 
∂u U 2 2 µU
τs = µ = µ√ √ e−y = √ .
∂y νt π y=0 πνt

Using values for viscosities, we find that µ/ ν is about 1 for water and about 5 × 10−3
for air. So water asserts a much greater sheer stress for the same motion. This is
significant in, say, the motion of ocean current. When the wind blows, it causes the
water in the ocean to move along with it. This is done in a way such that the surface
tension between the two fluids match at the boundary. Hence we see that even if the
air blows really quickly, the resultant ocean current is much smaller, say a hundredth
to a thousandth of it.

15.2 Kinematics
15.2.1 Material time derivative
We first want to consider the problem of how we can measure the change in a quantity,
say f . This might be pressure, velocity, temperature, or anything else you can imagine.
The obvious thing to do would be the consider the time derivative ∂f ∂t
. In physical
terms, this would be equivalent to fixing our measurement instrument at a point and
measure the quantity over time. This is known as the Eulerian picture . However,
often we want to consider something else. We pretend we are a fluid particle, and move
along with the flow. We then measure the change in f along our trajectory. This is
known as the Lagrangian picture .
Let’s look at these two pictures, and see how they relate to each other. Consider a
time-dependent field f (x, t). For example, it might be the pressure of the system, or
the temperature of the fluid. Consider a path x(t) through the field, and we want to
know how the field varies as we move along the path. Along the path x(t), the chain
rule gives
df ∂f dx ∂f dy ∂f dz ∂f ∂f
(x(t), t) = + + + = ∇f · ẋ + .
dt ∂x dt ∂y dt ∂z dt ∂t ∂t
If x(t) is the (Lagrangian) path followed by a fluid particle, then necessarily ẋ(t) = u
by definition. In this case, we write this as
Df ∂f
= u · ∇f + .
Dt ∂t
This is called the material derivative or Lagrangian derivative , which is the change
in f as we follow the path. On the right hand side of this equation, the first term
u · ∇f is the advective derivative , which is the change due to change in position, the
second term ∂f∂t
is the Eulerian time derivative , the change at a fixed point.
For example, consider a river that gets wider as we go downstream. We know (from,
say, experience) that the flow is faster upstream than downstream. If the motion is
steady, then the Eulerian time derivative vanishes, but the Lagrangian derivative does
not, since as the fluid goes down the stream, the fluid slows down, and there is a spacial
variation.
658 CHAPTER 15. FLUID DYNAMICS

Often, it is the Lagrangian derivative that is relevant, but the Eulerian time derivative
is what we can usually put into differential equations. So we will need both of them,
and relate them by the formula above. A final remark: later we will have material
Df ∂f
P say f , so Dt = u · ∇f + dt , while u · ∇f might seems
time derivative of a vector field,
confusing, it is just u · ∇f = i (u · ∇fi )ei = (u · ∇)f .

15.2.2 Conservation of mass


We can start formulating equations. The first one is
the conservation of mass. We fix an arbitrary region ∂D
of space D with boundary ∂D and outward normal n. n
We imagine there is some flow through this volume.
What we want to say is that the change in the mass D
inside D is equal to the total flow of fluid through the
boundary. We can write this as
Z Z
d
ρ dV = − ρu · n dS.
dt D ∂D

We have the negative sign since we picked the outward normal, and hence the integral
measures the outward flow of fluid. This makes sense since if we have flow flowing
out than the mass should decrease. Since the domain is fixed, we can interchange the
derivative and the integral on the left; on the right, we can use the divergence theorem.
Then we get
Z  
∂ρ
+ ∇ · (ρu) dV = 0.
D ∂t
Since D was arbitrary, everywhere in space we must have

∂ρ Dρ
<Conservation law> + ∇ · (ρu) = 0 or equivalently + ρ∇ · u = 0.
∂t Dt
This is the general form of a conservation law — the rate of change of “stuff” density
plus the divergence of the “stuff flux” is constantly zero. Similar conservation laws
appear everywhere in physics. In the (first) conservation equation, we can expand
∇ · (ρu) to get

∂ρ Dρ
+ u · ∇ρ + ρ∇ · u = 0 =⇒ + ρ∇ · u = 0.
∂t Dt
since the first term is just the material derivative of ρ. With the conservation of mass,
we can now properly say what incompressibility is. What exactly happens when we
compress a fluid? When we compress mass, in order to conserve mass, we must increase
the density. Indeed we say a fluid is incompressible if the density of a fluid particle
does not (cannot) change. If we don’t allow changes in density, then the material
derivative DρDt
must vanish. Hence a incompressible flow satisfies ∇ · u = 0, this is
known as the continuity equation .
Of course, incompressibility is just an approximation. Since we can hear things, ev-
erything must be compressible. So what really matters is whether the speed is small
compared to the speed of sound. If it is relatively small, then incompressibility is a
good approximation. In air, the speed of sound is approximately 340 m s−1 . In water,
the speed of sound is approximately 1500 m s−1 .
15.2. KINEMATICS 659

E. 15-5
For a flow u = (u, 0, 0) which flows in one dimension, if the flow is incompressible,
we must have ∂u
∂x
= ∇ · u = 0. So we considered u of the form u = u(y, z, t).

15.2.3 Kinematic boundary conditions


Suppose our system has a boundary. There are two possible cases: either the boundary
is rigid, like a wall, or it moves with the fluid. In either case, fluids are not allowed to
pass through the boundary.
To deal with boundaries, we first suppose it has a velocity U (which may vary with
position if the boundary extends through space). We define a local reference frame
moving with velocity U, so that in this frame of reference, the boundary is stationary.
In this frame of reference, the fluid has relative velocity u0 = u − U. As we mentioned,
fluids are not allowed to cross the boundary. Let the normal to the boundary be n,
then we must have u0 · n = 0. In terms of the outside frame, we get u · n = U · n, this
is the condition we have for the boundary.
If we have a stationary rigid boundary, eg. the wall, then n
U = 0. So our boundary condition is u · n = 0. Free bound- oil
aries are more complicated. A good example of a free material ζ(x, y, t)
boundary is the surface of a water wave, or interface between
two immiscible fluids — say oil and water. We can define the water
surface by a function z = ζ(x, y, t). We can alternatively define
the surface as a contour of
F (x, y, z, t) = z − ζ(x, y, t).
Then the surface is defined by F = 0. A fluid particle on the free surface must remain
on the free surface, hence F = 0 along that particle, that is we must have DF Dt
= 0 on
the free surface. Alternatively to see this recall by IA Vector Calculus, the normal n is
parallel to ∇F = (−ζx , −ζy , 1). Also, we have U = (0, 0, ζt ). We now let the velocity
of the fluid be u = (u, v, w). Then the boundary condition requires u · n = U · n. In
other words, we have −uζx − vζy + w = ζt . So we get
∂ζ Dζ
w = uζx + vζy + ζt = u · ∇ζ + = .
∂t Dt
So all the boundary condition says is Dζ
Dt
= w. This is the same as saying DF
Dt
= 0 on
the free surface. We will discuss surface waves later on, and then come back and use
this condition.

15.2.4 Streamfunction for incompressible flow


Two dimensional flow
Suppose our fluid is incompressible, ie. ∇ · u = 0. By IA Vector Calculus, this implies
there is a vector potential A such that u = ∇ × A. In the special case where the flow
is two dimensional, say u = (u(x, y, t), v(x, y, t), 0) we can immediately know A is of
the form A = (0, 0, ψ(x, y, t)). Taking the curl of this, we get
 
∂ψ ∂ψ
u= ,− ,0 .
∂y ∂x
660 CHAPTER 15. FLUID DYNAMICS

The ψ such that A = (0, 0, ψ) is called the stream function . Alternatively we can
derive this by noting that the incompressibility condition ∂u ∂x
∂v
+ ∂y = 0 means that
−v dx + u dy = 0 is an exact differential, hence we can find ψ such that ∇ψ = (−v, u).
This stream function is both physically significant, and mathematically convenient, as
we will soon see. We look at some properties of the stream function. The first thing
we can do is to look at the contours ψ = c. These have normal n = ∇ψ = (ψx , ψy , 0).
We immediately see that
∂ψ ∂ψ ∂ψ ∂ψ
u·n= − = 0.
∂x ∂y ∂y ∂x
So the flow is perpendicular to the normal, ie. tangent to the contours of ψ. So
the contours of the stream function ψ are in fact the streamlines, that describes an
instantaneous picture of flow. It is also worth noting that in this case we have u =
∇×(0, 0, ψ) = (∇ψ)×(0, 0, 1). Also, note that ψ must be constant on a stationary rigid
boundary, ie. the boundary is a streamline, since the flow is tangential at the boundary.
This is a consequence of u · n = 0. We often choose ψ = 0 as our boundary.
To draw streamlines we just need to draw curves given
by ψ =constant. Typically, we draw streamlines that are
“evenly spaced”, ie. we pick the streamlines ψ = c0 , ψ = c1 , slow fast

ψ = c2 etc. such that c3 − c2 = c2 − c1 . Then we know the


flow is faster where streamlines are closer together. This is since the fluid between any
two stream lines must be between the stream lines. So if the flow is incompressible, to
conserve mass, they mast move faster when the streamlines are closer.
We can also consider the volume flux (per unit length x0
in the z-direction), crossing any
R x curve from x0 to x1 . u
Then the volume flux is q = x 1 u · n d`. We see that slow fast
0
n d` = (−dy, dx). So we can write this as x1
Z x1
∂ψ ∂ψ
q= − dy − dx = ψ(x0 ) − ψ(x1 ).
x0 ∂y ∂x
So the flux depends only on the difference in the value of ψ. Hence, for closer stream-
lines, to maintain the same volume flux, we need a higher speed.
Sometimes it is convenient to consider the case when we have plane polars. We embed
these in cylindrical polars (r, θ, z). Then we have

e reθ ez 
1 r

1 ∂ψ ∂ψ
u = ∇ × (0, 0, ψ) = ∂r ∂θ ∂z = ,− ,0 .
r r ∂θ ∂r
0 0 ψ
One can verify that ∇ · u = 0. It is convenient to note that in plane polars,
1 ∂ 1 ∂uθ
∇·u= (rur ) + .
r ∂r r ∂θ
E. 15-6
y
Consider the flow u = (y, x) (in Cartesian coordinate), this has
stream function ψ = 21 (y 2 − x2 ) by inspection. So the stream-
lines are given by y 2 − x2 = c for c constant. We can draw the x
streamline as shown on the right.
15.3. DYNAMICS 661

Three dimensional axisymmetric flow


We can also have these steam functions for a three dimensional axisymmetric flow.
These stream functions for three dimensional axisymmetric incompressible flows are
called Stokes stream function .
1. Firstly we work in cylindrical polars coordinates (r, θ, z) and consider the flow
u = (ur , 0, uz ) where there are no θ dependence, the incompressibility condition
reads
1 ∂ ∂ ∂ ∂
0=∇·u= (rur ) + (uz ) =⇒ 0= (rur ) + (ruz )
r ∂r ∂z ∂r ∂z
So (ruz )dr − (rur )dz an exact differential, hence we can find a stream function ψ
such that
   
∂ψ ∂ψ 1 ∂ψ 1 ∂ψ
, = (ruz , −rur ) so u= − , 0, .
∂r ∂z r ∂z r ∂r

2. Similarly this also works for spherical polar axisymmetric flow. In spherical polar
coordinates (r, θ, ϕ) we consider the flow u = (ur , uθ , 0) where there are no ϕ
dependence. The incompressibility condition reads

1 ∂(r2 ur ) 1 ∂(sin θuθ ) ∂ 2 ∂


0 = ∇·u = + =⇒ 0= (r sin θur )+ (r sin θuθ )
r2 ∂r r sin θ ∂θ ∂r ∂θ
So (−r sin θuθ )dr + (r2 sin θur )dθ an exact differential, hence we can find a stream
function ψ such that
   
∂ψ ∂ψ 1 ∂ψ 1 ∂ψ
, = (−r sin θuθ , r2 sin θur ) so u= , − , 0 .
∂r ∂θ r2 sin θ ∂θ r sin θ ∂r

We can also manually check that for these flows u · ∇ψ = 0, so the flow is tangential
to the surface ψ =constant. However ψ =constant now defines a surface, which can’t
be a streamline. But fortunately the intersection of the ψ =constant surface with any
plane of the form y = Ax (or x = Ay) is a streamline. So we can still use it to draw
the stream lines.

15.3 Dynamics
The equations for the parallel viscous flow was a good start, and it turns out the
general equation of motion is of a rather similar form. Unfortunately, the equation
is rather complicated and difficult to derive, and we will not derive it here. How-
ever, we will be able to derive some special cases of it later under certain simplifying
assumptions.
Du
<Navier-Stokes equation> ρ = −∇p + µ∇2 u + f
Dt
This is the general equation for fluid motion. The left hand side is mass times ac-
celeration, and the right is the individual forces — the pressure gradient, viscosity,
and the body forces (per unit volume) respectively. In general, these are very difficult
equations to solve because of non-linearity. For example, in the material derivative,
we have the term u · ∇u. There are a few things to note:
662 CHAPTER 15. FLUID DYNAMICS

1. The acceleration of a fluid particle is the Lagrangian material derivative of the


velocity.
2. The derivation of the viscous term is complicated, since for each side of the cube,
there is one normal direction and two tangential directions. Thus this requires
consideration of the stress tensor, which we will not look at.
3. In a gravitational field, we just have f = ρg. This is the only body force we will
consider.
4. In Cartesian coordinates, for u = (ux , uy , uz ), we have ∇2 u = (∇2 ux , ∇2 uy , ∇2 uz ).
Note that ∇2 u can be written as ∇2 u = ∇(∇·u)−∇×(∇×u). In an incompressible
fluid, this reduces to ∇2 u = −∇ × (∇ × u) = −∇ × ω where ω = ∇ × u is the
vorticity .
5. The Navier-Stokes equation reduces to the parallel flow equation in the special case
of parallel flow, ie. u = (u(y, t), 0, 0).

15.3.1 Pressure
In the Navier-Stokes equation, we have a pressure term. In general, we classify the
pressure into two categories. If there is gravity, then we will get pressure in the fluid due
to the weight of fluid above it. This is what we call hydrostatic pressure . Technically,
this is the pressure in a fluid at rest, ie. when u = 0. We denote the hydrostatic
pressure as pH .
To find this, we put in u = 0 into the Navier-Stokes equation to get ∇pH = f = ρg.
We can integrate this to obtain pH = pg · x + p0 where p0 is some arbitrary constant.
Usually, we have g = (0, 0, −g). Then pH = p0 − ρgz. This exactly says that the
hydrostatic pressure is the weight of the fluid above you.
What can we infer from this? Suppose we have a body D with boundary ∂D and
outward normal n. Then the force due to the pressure is
Z Z Z Z
F=− pH ndS = − ∇pH dV = − ρg dV = −g ρ dV = −M g,
∂D D D D
R
where M is the mass of fluid displaced. The second equality is since ∂D pH ni dS =
R R R
p e · dS = D ∇ · (pH ei ) dV = D (∇pH )i dV . This the Archimedes’ principle .
∂D H i
In particular, if the body is less dense than the fluid, it will float; if the body is denser
than the fluid, it will sink; if the density is the same, then it does not move, and we
say it is neutrally buoyant. This is valid only when nothing is moving, since that was
our assumption. Things can be very different when things are moving, which is why
planes can fly.
In general, when there is motion, we might expect some other pressure gradient. It
can either be some external pressure gradient driving the motion (eg. in the case of
Poiseuille flow), or a pressure gradient caused by the flow itself. In either case, we can
write p = pH + p0 where pH is the hydrostatic pressure, and p0 is what caused/results
from motion. We substitute this into the Navier-Stokes equation to obtain
Du
ρ = −∇p0 + µ∇2 u.
Dt
So the hydrostatic pressure term cancels with the gravitational term. What we usually
do is drop the “prime”, and just look at the deviation from hydrostatic pressure. What
15.3. DYNAMICS 663

this means is that gravity no longer plays a role, and we can ignore gravity in any flow
in which the density is constant. Then all fluid particles are neutrally buoyant. This is
the case in most of what we will do, except when we consider motion of water waves,
since there is a difference in air density and water density.

15.3.2 Reynolds number


As we mentioned, the Navier-Stokes equation is very difficult to solve. So we want to
find some approximations to the equation. We would like to know if we can ignore
some terms. For example, if we can neglect the viscous term, then we are left with a
first-order equation, not a second-order one. To do so, we look at the balance of terms,
and see if some terms dominate the others. This is done via Reynolds number.
We suppose the flow has a characteristic speed U and an extrinsic length scale L,
externally imposed by geometry. For example, if we look at the flow between two
planes, the characteristic speed can be the maximum (or average) speed of the fluid,
and a sensible length scale would be the length between the planes. And if we are
looking at say a flow pass a solid sphere, then a sensible length scale would be the
radius of the sphere. Next, we have to define the time scale T = L/U . Finally, we
suppose pressure differences have characteristic magnitude P . We are concerned with
differences since it is pressure differences that drive the flow. We are going to take
the Navier-Stokes equation, and look at the scales of the terms. Dividing by ρ, we
get
∂u 1 µ
+ u · ∇u = − ∇p + ν∇2 u where ν= .
∂t ρ ρ
We are going to estimate the size of these terms. We get
U U 1P U
U· ν .
(L/U ) L ρL L2

Dividing by U 2 /L, we get


P ν
1 1 .
ρU 2 UL

The Reynolds number is Re = U L/ν which is a dimensionless number. This is a


measure of the ratio of inertial terms to viscous term within a fluid. If Re is very
large, then the viscous term is small, and we can probably neglect it. For example,
for an aircraft, we have U ∼ 104 , L ∼ 10 and ν ∼ 10−5 . So the Reynolds number
is large, and we can ignore the viscous term. On the other hand, if we have a small
slow-moving object in viscous fluid, then Re will be small, and the viscous term is
significant. Note that the pressure always scales to balance the dominant terms in the
equation, so as to impose incompressibility, ie. ∇ · u = 0. So we don’t have to care
about its scale.
In practice, it is the Reynolds number, and not the factors U, L, ν individually, that
determines the behaviour of a flow. For example, even though lava is very very viscous,
on a large scale, the flow of lava is just like the flow of water in a river, since they have
comparable Reynolds number. Flows with the same geometry and equal Reynolds
numbers are said to be dynamically similar .
We have the following two cases:
664 CHAPTER 15. FLUID DYNAMICS

1. When Re  1, the inertia terms are negligible, and we now have P ∼ ρνU/L =
µU/L. So the pressure balances the sheer stress. We can approximate the Navier-
Stokes equation by dropping the term on the left hand side, then we have

<Stokes equations> 0 = −∇p + µ∇2 u ∇·u=0

where ∇ · u is the incompressibility condition. This is now a linear equation, with


four equations and four unknowns (three components of u and one component of
pressure).
2. When Re  1, the viscous terms are negligible on extrinsic length scale. Then the
pressure scales on the momentum flux, P ∼ ρU 2 and on extrinsic scales, we can
approximate Navier-Stokes equations by the
Du
<Euler equations> ρ = −∇p ∇ · u = 0.
Dt
In this case, the acceleration is proportional to the pressure gradient.
Why do we emphasize “extrinsic scale”? Well, we only get the large Reynolds
number using the extrinsic scale L. If instead we look at a small length scale δ  L,
then we’ll end up with a much smaller Reynolds number, in which case viscosity
does matter. This is to say that we can only ignore the viscous term when we are
looking at the big picture on big length scales like L, we cannot zoom too much in,
because viscosity still acts on small length scales. In particular we know near rigid
boundaries viscosity plays a big role, so we cannot look too close to the boundary.
Indeed, since we ignored the viscous term we can no longer enforce the no-slip
boundary condition on rigid boundaries (note that the order of the differential
equation drops by one when we ignore the viscous term). So the solution we get
would be wrong close to the boundary. So what is the intrinsic length scale δ at
which viscosity matters? At this length scale the viscous and inertia effects are
comparable, we need U 2 ∼ νU/δ. So we get δ = ν/U . Alternatively, we have
δ ν 1
= = .
L UL Re
Thus, for large Reynolds number, the intrinsic length scale small as expected, and
viscous boundary effects doesn’t matter that much for the big picture. For much
of the rest of this course, we will ignore viscosity, and consider inviscid flow .
E. 15-7
Suppose we have a object with characteristic length scale L moving with speed U
in a fluid, how does the drag force excreted on the sphere by the fluid behave?
Well, let we list the things that the drag can depend on: µ(kg m−1 s−1 ), ρ(kg m−3 ),
U (m s−1 ) and L(m). If the Reynolds number is small, then the equation of motion
reduces to the Stokes equation, which doesn’t depend on the density ρ anymore!
Then dimensional analysis tells us that the drag must be proportional to µU L.
Alternatively, since for Re  1 the characteristic pressure difference P ∼ µU/L,
we know drag force on the object is ∼ µU L. It can be shown that the drag on a
sphere of radius a (with very small Reynolds numbers) is in fact 6πµaU (consistent
with what we obtain), this is know as the Stokes’ law .

15.3.3 A case study: stagnation point flow


15.3. DYNAMICS 665
y
We attempt to find a solution to the Navier-Stokes equation
in the half-plane y ≥ 0, subject to the boundary condition
u → (Ex, −Ey, 0) as y → ∞, where E > 0 is a constant, and
u = 0 on y = 0 (no slip). Intuitively, we would expect the flow x
to look like that drawn on the diagram.
The boundary conditions as y → ∞ gives us a picture of what is going on at large y, as
shown in the diagram, but near the boundary, the velocity has to start to vanish. So
we would want to solve the equations to see what happens near the boundary.
This problem does not have any p extrinsic length scale. So we can seek a solution
in terms of η = y/δ and δ = ν/E. Note that δ has dimensions of length: E has
dimension of [E] = T −1 and ν has dimension [ν] = L2 T −1 , hence [E/ν] = L2 and
δ has dimension L. Therefore η = y/δ is dimensionless. Note that we make η a
function of y instead of a function of x so that we can impose the boundary condition
as y → ∞.
We seek a solution such that u is in the form u = Exg 0 (η). Applying the incompress-
ibility conditions ∇ · u = 0, we must have

u = (u, v, 0) = (Exg 0 (η), −Eδg(η), 0).



This has streamfunction ψ = νExg(η). Finally, we look at the Navier-Stokes equa-
tions. The x and y components are
 2
∂2u

∂u ∂u 1 ∂p ∂ u
u +v =− +ν +
∂x ∂y ρ ∂x ∂x2 ∂y 2
 2
∂2v

∂v ∂v 1 ∂p ∂ v
u +v =− +ν + .
∂x ∂y ρ ∂y ∂x2 ∂y 2
Substituting our expression for u and v in the first equation, we get
1 1 ∂p 1
Exg 0 Eg 0 − EδgExg 00 =− − νExg 000 2 .
δ ρ ∂x δ
1 ∂p
=⇒ E 2 x(g 02 − gg 00 ) = − − E 2 xg 000 .
ρ ∂x
We can do the same thing for the y-component, and get
√ 1 ∂p √
E νEgg 0 = − − E νEg 00 .
ρ ∂y
So we’ve got two equations for g and p. We take the y derivative of the first, and
x derivative of the second, and we can use that to eliminate the p terms. Then we
have
g 0 g 00 − gg 000 = g (4) .
So we have a single equation for g that we shall solve. We now look at our boundary
conditions. The no-slip condition gives u = 0 on y = 0. So g 0 (0) = g(0) = 0 when
η = 0. As as y → ∞, the boundary conditions for u gives g 0 (η) → 1, g(η) → η as
η → ∞.
All dimensional variables are absorbed into the scaled variables g and η. So we only
have to solve the ODE once. The far field velocity u = (Ex, −Ey, 0) is reached to
a very good approximation when η > C or equivalently y > Cδ for some constant
C.
666 CHAPTER 15. FLUID DYNAMICS

y
g 0 (η)
Ex

Cδ = O(1)

η u

At the scale δ, we get a Reynolds number of Reδ = U δ/ν ∼ O(1). This is the boundary
layer. For a larger extrinsic scale L  δ, we get ReL = UνL  1. When interested
in flow on scales much larger than δ, we ignore the region y < δ (since it is small),
and we imagine a rigid boundary at y = δ at which the no-slip condition does not
apply.
When ReL  1, we solve the Euler equations, namely

Du
ρ = −∇p + f ∇ · u = 0.
Dt
We also have a boundary condition u · n = 0 at the stationary rigid boundary— we
don’t allow fluid to flow through the boundary. The no-slip condition is no longer
satisfied. One can show that u = (Ex, −Ey, 0) satisfies the Euler equations in y > 0
with a rigid boundary of y = 0, with

1 2 2
p = p0 − ρE (x + y 2 ).
2
We can plot the curves of constant pressure, as well
as the streamlines As a flow enters from the top,
the pressure keeps increases, and this slows down y
the flow. We say the y-pressure gradient is adverse.
As it moves down and flows sideways, the pressure
pushes the flow. So the x-pressure gradient is favor-
able. At the origin, the velocity is zero, and this is
a stagnation point. This is also the point of highest
pressure. In general, velocity is high at low pres- x
sures and low at high pressures. Note that the pres- p0
sure acts as an internal reaction force imposing the
constrinat of incompressibility.

15.3.4 Momentum equation for inviscid incompressible fluid


We are now going to derive the Navier-Stokes equation in the special case where
viscosity is assumed to vanish. We will derive this by considering the change in mo-
mentum. Consider an arbitrary volume D with boundary ∂D and outward pointing
normal n. The total momentum of the fluid in D can change owing to four things:
1. Momentum flux across the boundary ∂D; ∂D
2. Surface pressure forces; n
D
3. Body forces;
4. Viscous surface forces.
15.3. DYNAMICS 667

We will ignore the last one. We can then write the rate of change of the total momen-
tum as Z Z Z Z
d
ρu dV = − ρu(u · n) dS − pn dS + f dV. (∗)
dt D ∂D ∂D D

It is helpful to write this in suffix notation. In this case, the equation becomes
Z Z Z Z
d
ρui dV = − ρui uj nj dS − −pni dS + fi dV.
dt D ∂D ∂D D

Just as in the case of mass conservation, we can use the divergence theorem to
write Z   Z  
∂ui ∂ ∂p
ρ +ρ (ui uj ) dV = − + fi dV.
D ∂t ∂xj D ∂xi
Since D is arbitrary, we must have

∂ui ∂ui ∂uj ∂p


ρ +ρ uj + ρui =− + fi .
∂t ∂xj ∂xj ∂xi

The last term on the left is the divergence of u, which vanishes by incompressibility,
and the remaining terms is just the material derivative of u. So we get

Du
<Euler momentum equation> ρ = −∇p + f .
Dt
This is just the equation we get from the Navier-Stokes equation by ignoring the
viscous terms. However, we were able to derive this directly from momentum conser-
vation.
Suppose we have conservative force f = −∇χ, where χ is a scalar potential. For
example, gravity can be given by f = ρg = ∇(ρg · x) (for ρ constant), so χ = −ρg · x =
ρgz if g = (0, 0, −g). Further suppose that we have a steady flow, so ∂u ∂t
vanishes.
Then the momentum equation (∗) becomes
Z Z Z
0=− ρu(u · n) dS − pn dS − ∇χ dV.
∂D ∂D D

We can then convert this to


Z
<Momentum integral for steady flow> (ρu(u · n) + pn + χn) dS = 0.
∂D

This can be sometimes be useful. Of course in the case that we don’t have steady flow
and conservative force we can always revert back to using (∗).
E. 15-8
<Force on curved pipe> Suppose in space (no gravity)
we have a curved circular pipe with constant cross section U
area A (as shown on the diagram) and a steady inviscid S
imcompressible flow of speed U flowing though it. What
n1 S1 S2 n2
is the force the fluid excert on the pipe?

We use the momentum integral for steady flow, ∂D in this case made up of three
parts S1 (with normal n1 ), S2 (with normal n2 ) and the curved surface S. The
668 CHAPTER 15. FLUID DYNAMICS

force on this section of the pipe is thus


Z Z Z
F= pndS = − ρu(u · n)dS − pndS
S 1 S ∪S ∪S
2 S ∪S
  1 2
= − ρ(−U n1 )(−U )A + ρ(U n2 )U A + 0 − (pn1 A + pn2 A)

= −A(p + ρU 2 )(n1 + n2 )
Note that the force depend on the background pressure p determined by the pump-
ing station.

15.3.5 Bernoulli’s equation and principle


We saw that for inviscid incompressible flow we have the Euler momentum equation
ρ Du
Dt
= −∇p + f where ρ is a fixed constant. If we further assume that we have
conservative force f = −∇χ, then using the vector identity u × (∇ × u) = ∇( 21 |u|2 ) −
u · ∇u we can rewrite the equation as
 
∂u 1 2
ρ + ρ∇ |u| − ρu × (∇ × u) = −∇p − ∇χ. (?)
∂t 2
Dotting with u, the last term on the left vanishes, and we get
1 ∂|u|2
 
1
<Bernoulli’s equation> ρ = −u · ∇ ρ|u|2 + p + χ .
2 ∂t 2
In the case of a steady flow, the LHS vanish, so we know H = 12 ρ|u|2 +p+χ is constant
along streamlines. Since we have a steady flow, the streamlines are particle paths, so
we see that this says that an increase in the speed of a fluid occurs simultaneously with
a decrease in pressure and vice versa (assuming the value of χ does not change in the
points of the streamline we are considering, e.g. a horizontal flow under gravity), this
is Bernoulli’s principle . Even if the flow is not steady, we can still define the value
H, and then we can integrate Bernoulli’s equation over a volume D to obtain
Z Z
d 1
ρ|u|2 dV = − Hu · n dS.
dt D 2 ∂D

So H is the transportable energy of the flow.


We saw that for steady inviscid incompressible flow under conservative force we have
u · ∇H = 0, we can obtain a similar thing for vorticity ω = ∇ × u. Firstly note that
in (?), ∂u
∂t
vanish since the flow is steady. Dot the equation with ω, then the last term
on the LHS vanish and we get ω · ∇H = 0. This says that H is constant along lines
tangential to ω (these lines are called vortex lines analogous to stream lines).
E. 15-9
<Venturi meter> Consider a pipe with
a constriction. Ignore the vertical pipes
for the moment — they are there just so h
that we can measure the pressure in the
U
fluid. Suppose at the left, we have a uni-
P u, p, a
form speed U , area A and pressure P . In
A
the middle constricted area, we have speed
u, area a and pressure p.
15.3. DYNAMICS 669

By the conservation of mass, we have q = U A = ua. We apply Bernoulli along


the central streamline, using the fact that H is constant along streamlines. We
can omit the body force χ = ρgy, since this is the same at both locations. Then
we get
1 2 1
ρU + P = ρu2 + p.
2 2
Replacing our U with q/A and u with q/a, we obtain
 
1  q 2 1  q 2 1 1 2(p − P )
ρ +P = ρ +p =⇒ − 2 q2 = .
2 A 2 a A2 a ρ

We see there is a difference in pressure due to the difference in area. This is


balanced by the difference in heights h. Using the y-momentum equation, we get

2(p − P ) p Aa
= −2gh. =⇒ q= 2gh √ .
ρ A2 − a2

Therefore we can measure h in order to find out the flow rate. This allows us to
measure fluid velocity just by creating a constriction and then putting in some
pipes.
E. 15-10
<Force on a fire hose nozzle> Suppose we have a fire hose nozzle like this:

(2)
(3)
U (4)
A (1) (5) u, a, p = 0
P

We consider the steady-flow equation and integrate along the surface indicated as
dashed lines. We integrate each section separately to find the total rate of change
of momentum in the x direction. The end (1) contributes ρU (−U )A−P A. On (2),
everything vanishes. On (3), the first term vanishes Rsince the velocity is parallel
to the surface. Then we get a contribution of 0 + nozzle pn · x̂ dS. Similarly,
everything in (4) vanishes. Finally, on (5), noting that p = 0, we get ρu2 a. By
the steady flow equation, we know these all sum to zero. Hence, the force on the
nozzle is just
Z
F = pn · x̂ dS = ρAU 2 − ρau2 + P A.
nozzle

We can again apply Bernoulli along a streamline in the middle, which says 21 ρU 2 +
P = 21 ρu2 . So we can get

1 1 A  a 2
F = ρAU 2 − ρau2 + ρA(u2 − U 2 ) = ρ 2 q 2 1 − .
2 2 a A

Let’s now put some numbers in. Suppose A = (0.1)2 πm2 and a = (0.05)2 πm2 . So
we get A/a = 4. A typical fire hose has a flow rate of q = 0.01 m3 s−1 . So we get
 2
1 4 3
F = · 1000 · · 10−4 · ≈ 14 N.
2 π/40 4
670 CHAPTER 15. FLUID DYNAMICS

E. 15-11
Consider a two dimensional flow where a jet of water
with width a travelling with speed U incident on
a
a inclined plane. Assume we have no gravity, or
that the jet is fast enough that gravitational effect is
negligible. Wlog assume the atmospheric pressure is U
0. Suppose also that we have reach the point where a1 β a2
the flow is steady. We want to find a1 , a2 and the
force the fluid exert on the plane.

Far away upstream in the incident jet, the water are flowing in a straight line
with constant speed U , so 0 = ρ Du Dt
= −∇p, hence the pressure p within the jet
is constant, and so equals 0 the atmospheric pressure since the boundary have
that pressure. The same goes with the the two streams created after the incident.
In fact the speed of both of these streams fat away from the incident point must
have be U as well, this is since along the free surface stream line, the pressure is 0
always, so by Bernoulli ( 21 ρ|u|2 + p =constant along stream line) the speed of the
streams must be U .
By mass conservation we must have U a = U a1 + U a2 . Let D be the volume of
the
R flow as shown in the diagram. The momentum integral for steady flow says
∂D
(ρu(u · n) + pn) dS = 0. Along the x-direction we have
Z Z
0=− pn1 dS = ρu1 (u · n) dS = ρU cos β(−U )a + ρU (U )a2 + ρ(−U )(U )a1
∂D ∂D

Hence −ρa cos β + ρa2 − ρa1 = 0. Together with mass conservation, we find that
a2 = 12 a(1 + cos β) and a1 = 12 a(1 − cos β). The force the fluid exert on the plane
is given by
Z Z Z
F = p(−e2 ) dS = pn2 dS = − ρu2 (u · n) dS = ρU sin β(−U )a
y=0 ∂D ∂D

= −ρaU 2 sin β.

15.3.6 Linear flows


Suppose we have a favorite point x0 . Near the point x0 , it turns out we can break up
the flow into three parts — uniform flow, pure strain, and pure rotation. To do this,
we take the Taylor expansion of u about x0 :

u(x) = u(x0 ) + (x − x0 ) · ∇u(x0 ) + · · · ≈ u0 + r · ∇u0 ,

with r = x − x0 and u0 = u(x0 ). This is a linear approximation to the flow field. We


can do something more about the ∇u term. This is a rank-2 tensor, ie. a matrix, and
we can split it into its symmetric and antisymmetric parts:
 
1 ∂ui ∂uj
Eij = + ,
∂ui 2  ∂xj ∂xi 
∇u = = Eij + Ωij = E + Ω where
∂xj 1 ∂u i ∂u j
Ωij = − .
2 ∂xj ∂xi
15.3. DYNAMICS 671

We can write the second part in terms of the vorticity ω = ∇×u. Then we have
 
∂uk ∂uk
ω × r = (∇ × u) × r = εpiq εijk rq ep = (δqi δpk − δqk δpj ) rq ep
∂xj ∂xj
 
∂uk ∂uk ∂ui ∂uj
= rj ek − rk ej = rj − ei = 2Ωij rj ei .
∂xj ∂xj ∂xj ∂xi

So we can write u = u0 + Er + 12 ω × r. The first component is uniform flow, the


second is the strain field and the last is the rotation component.

uniform flow pure strain pure rotation

Since we have an incompressible fluid, we have ∇ · u = 0. Hence its eigenvalues don’t


all have the same sign, this justifies the picture above.

15.3.7 Vorticity equation


The Navier-Stokes equation tells us how the velocity changes with time. Can we obtain
a similar equation for the vorticity? Consider the Navier-Stokes equation for a viscous
fluid,
 
∂u
ρ + u · ∇u = −∇p − ∇χ + µ∇2 u
∂t
We use a vector identity u · ∇u = 21 ∇|u|2 − u × ω and take the curl of the above
equation to obtain
∂ω
− ∇ × (u × ω) = ν∇2 ω,
∂t
exploiting the fact that the curl of a gradient vanishes. We now use the fact that

∇ × (u × ω) = (∇ · ω)u + ω · ∇u − (∇ · u)ω − u · ∇ω.

The divergence of a curl vanishes, and so does ∇ · u by incompressibility. So we


get
∂ω
+ u · ∇ω − ω · ∇u = ν∇2 ω.
∂t
Now we use the definition of the material derivative, and rearrange terms to ob-
tain

<Vorticity equation> = ω · ∇u + ν∇2 ω.
Dt
Hence, the rate of change of vorticity about a fluid particle is caused by ω · ∇u (am-
plification by stretching or twisting) and ν∇2 ω (dissipation of vorticity by viscosity).
The second term also allows for generation of vorticity at boundaries by the no-slip
condition. This will be explained shortly.
672 CHAPTER 15. FLUID DYNAMICS

Consider an inviscid fluid, where ν = 0. So we are left with Dt
= ω · ∇u. If we take
the dot product with ω, we get
 
D 1 2
|ω| = ω · ∇u · ω = ω(E + Ω)ω = ωi (Eij + Ωij )ωj .
Dt 2

Since ωi ωj is symmetric, while Ωij is antisymmetric, the second term vanishes. In the
principal axes, E is diagonalizable. So we get
 
D 1 2
|ω| = E1 ω12 + E2 ω22 + E3 ω32 .
Dt 2

wlog, we assume E1 > 0 (since the Ei ’s sum to 0), and imagine E2 , E3 < 0. So the flow
is stretched in the e1 direction and compressed radially. We consider what happens to
a vortex in the direction of the stretching, ω = (ω1 , 0, 0). We then get
 
D 1 2
ω1 = E1 ω12 .
Dt 2

So the vorticity grows exponentially. This is vorticity amplification by stretching. This


makes sense — as the fluid particles have to get closer to the axis of rotation, they have
to rotate faster, by the conservation of angular momentum. This is in general true —
vorticity increases as the length of a material line increases. To show this, we consider
two neighbouring (Lagrangian) fluid particles, x1 (t), x2 (t). We let δ`(t) = x2 (t)−x1 (t).
Note that

Dx2 (t)
= u(x2 ) Dδ`(t)
Dt =⇒ = u(x2 ) − u(x1 ) = δ` · ∇u,
Dx1 (t) Dt
= u(x1 )
Dt

by taking a first-order Taylor expansion. This is exactly the same equation as that
for ω in an inviscid fluid. So vorticity increases as the length of a material line in-
creases.

Note that the vorticity is generated by viscous forces near


boundaries. When we make the inviscid approximation, then ω
we are losing this source of vorticity, and we sometimes have
to be careful.

We have so far assumed the density is constant. If the fluid has a non-uniform density
ρ(x), then it turns out
Dω 1
= ω · ∇u + 2 ∇ρ × ∇p.
Dt ρ

This is what happens in convection flow. The difference in density ∇ρ


drives a motion of the fluid. For example, if we have a horizontal
density gradient and a vertical pressure gradient (eg. we have a × ω
heater at the end of a room with air pressure varying according to ∇p
height), then we get vorticity as shown in the diagram.
15.4. INVISCID IRROTATIONAL FLOW 673

E. 15-12
Below are some example of vortex amplification by stretching
low vorticity convection

stretching
cool warm cool
high vorticity Ground
Bathtub Hurricane Vorticity around build-ups

15.4 Inviscid irrotational flow


We’ll have more simplifying assumption. We are going to assume that our flow is is
incompressible, inviscid, and irrotational.
We first check this is a sensible assumption, in that if we start of with an irrotational
flow, then the flow will continue being irrotational. Suppose we have ∇×u = 0 at t = 0,
and the fluid is inviscid and homogeneous (ie. ρ is constant). Then by the vorticity
equation Dω
Dt
= ω · ∇u = 0. So the vorticity will keep on being zero. So for all time, we
can write u = ∇φ for some scalar function φ called the velocity potential . Note that
here we don’t put a negative sign in front of ∇φ, unlike for example gravity.
Together with the incompressible condition ∇ · u = 0, we have ∇2 φ = 0. So the
potential satisfies Laplace’s equation. A key property of Laplace’s equation is that it
is linear. So we can add two solutions up to get a third solution.
D. 15-13
A potential flow is a flow whose velocity potential satisfies Laplace’s equation.

15.4.1 Three-dimensional potential flows


We will not consider Laplace’s equation in full generality. Instead, we will con-
sider some solutions with symmetry. We will thus use spherical coordinates (r, θ, ϕ).
Then
∂2φ
   
1 ∂ ∂φ 1 ∂ ∂φ 1
∇2 φ = 2 r2 + 2 sin θ + 2 2 .
r ∂r ∂r r sin θ ∂θ ∂θ r sin θ ∂ϕ2
It is also useful to know what the gradient is. This is given by
 
∂φ 1 ∂φ 1 ∂φ
u = ∇φ = , , .
∂r r ∂θ r sin θ ∂ϕ
We start off with the simplest case. The simplest thing we can possibly imagine is if
φ = φ(r) depends only on r. So the velocity is purely radial. Then Laplace’s equation
implies
 
∂ ∂φ ∂φ A
r2 =0 =⇒ = 2 for some constant A
∂r ∂r ∂r r
So φ = −A/r + B where B is yet another constant. Since we only care about the
gradient ∇φ, we can wlog B = 0. So the only possible velocity potential is φ = −A/r.
Then the speed, given by ∂φ
∂r
, falls off as 1/r2 .
674 CHAPTER 15. FLUID DYNAMICS

What is the physical significance of the factor A? Consider the volume flux q across
the surface of the sphere r = a. Then
Z Z Z Z
∂φ A
q= u · n dS = ur dS = dS = 2
dS = 4πA.
S S S ∂r S a
So we can write φ = −q/4πr. When q > 0, this corresponds to a point source of fluid.
When q < 0, this is a point sink of fluid. We can also derive this solution directly, using
incompressibility. Since flow is incompressible, the flux through any sphere containing
the source/sink 0 should be constant. Since the surface area increases as 4πr2 , the
velocity must drop as r12 , in agreement with what we obtained above. Notice that we
have ∇2 φ = qδ(x). So φ is actually Green’s function for the Laplacian.
That was not too interesting. We can consider a more general solution, where φ
depends on r and θ but not ϕ. Then Laplace’s equation becomes
   
1 ∂ ∂φ 1 ∂ ∂φ
∇2 φ = 2 r2 + 2 sin θ = 0.
r ∂r ∂r r sin θ ∂θ ∂θ
As we know from IB Methods, We can use Legendre polynomials to write the solution
as
∞  
X ∂φ 1 ∂φ
φ= (An rn + Bn r−n−1 )Pn (cos θ), and then u= , ,0 .
n=0
∂r r ∂θ

E. 15-14
We can immediately look at some possible flows.
1. Source/sink: In the case that only B0 6= 0 we have φ = B0 /r and u = ∇φ =
−B0 er /r2 . This is a source if B0 < 0 and a sink if B0 > 0. Indeed if we look
at the outward mass flux over a sphere of radius R
Z
ρu · dS = ρ(−B0 /R2 )4πR2 = −4πB0 ρ
SR

which is independent of R. In fact any surface containing the origin would


have this flux. The volume flux is of course −4πB0 . So m = |4πB0 | is called
the strength of the source/sink.
2. Uniform flow: In the case only A1 6= 0, we have φ = A1 r cos θ = A1 z and
u = ∇φ = A1 ez . This is a uniform flow of velocity A1 .
3. Doublet flow (dipole): In the case only B1 6= 0, we have φ = B1 cos θ/r2 =
B1 z/r3 and  
1 3z
u = ∇φ = B1 e z − e r .
r3 r4
This is a doublet (dipole) as
 
−B1 /h B1 /h
φ = lim +
h→0 (x2 + y 2 + (z + 12 h)2 )1/2 (x2 + y 2 + (z − 21 h)2 )1/2

i.e. a superposition of a sink and a source with the same


strength. The flow looks like the diagram on the right. Next
we will look at uniform flow past a sphere, turns out the
solution is a combination of 2 and 3.
15.4. INVISCID IRROTATIONAL FLOW 675

E. 15-15
<Uniform flow pass sphere> We can look at uniform flow around a sphere of
radius a. We suppose the upstream flow is u = U x̂. So φ = U x = U r cos θ. So we
need to solve

r
∇2 φ = 0 r>a
φ → U r cos θ r→∞ U θ x
∂φ
=0 r = a.
∂r

The last condition is there to ensure no fluid flows into the sphere, ie. u · n = 0,
for n the outward normal.
Since P1 (cos θ) = cos θ, and the Pn are orthogonal, our boundary conditions at
infinity require φ to be of the form

a3
   
B
φ= Ar + cos θ and in fact φ=U r+ cos θ
r2 2r2

where we determined the two constant using two boundary conditions. The
condition that φ → U r cos θ tells us A = U , and the other condition tells us
a3
A − 2B/a3 = 0. We can interpret U r cos θ as the uniform flow, and U 2r 2 cos θ as
the dipole response due to the sphere. We can compute the velocity to be

a3
 
∂φ
ur = 1 − 3 cos θ
=U
∂r r
a3
 
1 ∂φ
uθ = = −U 1 + 3 sin θ.
r ∂θ 2r

We notice that ur = 0 when r = a or θ = ±π/2 and uθ = 0 when θ = 0, π.


At the north and south poles, when θ = ± π2 , we
get ur = 0 and uθ = ±3U/2. So the velocity is
faster at the top than at the infinity boundary. B0
A0 A
This is why it is windier at the top of a hill than
below. This is illustrated in the diagram where
B
the streamlines are closer to each other near the
top since the velocity is faster.
To obtain the pressure p on the surface of the sphere, we apply Bernoulli’s equation
on a streamline to the surface (a, θ). Comparing with what happens at infinity,
we get
 
1 2 1 9 1 2 9
p∞ + ρU = p + ρU 2 sin2 θ =⇒ p = p∞ + ρU 1 − sin2 θ .
2 2 4 2 4

At A and A0 , we have p = p∞ + 21 ρU 2 . At B and B 0 , we find p = p∞ − 85 ρU 2 . Note


that the pressure is a function of sin2 θ. So the pressure at the back of the sphere
is exactly the same as that at the front. Thus, if we integrate the pressure around
the whole surface, we get 0. So the fluid exerts no net force on the sphere! This
is d’Alemberts’ paradox . This is, of course, because we have ignored viscosity.
676 CHAPTER 15. FLUID DYNAMICS
In practice, for a solid sphere in a viscous fluid at high
Reynolds numbers, the flow looks very similar upstream,
but at some point after passing the sphere, the flow sepa-
rates and forms a wake behind the sphere. At the wake,
the pressure is approximately p∞ . The separation is wake
caused by the adverse pressure gradient at the rear. Em-
pirically, we find the force is F = CD 21 ρU 2 πα2 where CD
is a dimensionless drag coefficient. This is, in general, a
function of the Reynolds number Re.
E. 15-16
<Rising bubble> Suppose we have a rising bubble, rising at speed U . We
assume our bubble is small and spherical, and has a free surface. In this case, we
do not have to re-calculate the velocity of the fluid. We can change our frame of
reference, and suppose the bubble is stationary and the fluid is moving past it at
U . Then this is exactly what we have calculated above. We then translate the
velocities back by U to get the solution. Doing all this, we find that the kinetic
energy of the fluid is
Z  
1 π 1 1 4 3 1
ρ|u|2 dV = a3 ρU 2 = MA U 2 where MA = πa ρ = MD ,
r>a 2 3 2 2 3 2

is the added mass of the bubble (and MD is the mass of the fluid displaced by
the bubble). Now suppose we raise the bubble by a distance h. The change in
potential energy of the system is ∆PE = −MD gh. So
1
MA U 2 − MD gh = Energy
2
is constant, since we assume there is no dissipation of energy due to viscosity. We
differentiate this to know MA U U̇ = MD g ḣ = MD gU . We can cancel out the U ’s
and get U̇ = 2g. So in an inviscid fluid, the bubble rises at twice the acceleration
of gravity.

15.4.2 Two-dimensional potential flow


We now consider the case where we have two-dimensional world. We use polar coor-
dinates (r, θ). Then we have

1 ∂2φ
   
2 1 ∂ ∂φ ∂φ 1 ∂φ
∇ φ= r + 2 2. u = ∇φ = , .
r ∂r ∂r r ∂θ ∂r r ∂θ
The general solution to Laplace’s equation is given by

(
X n −n cos nθ
φ = A0 log r + B0 θ + (An r + Bn r )
n=1
sin nθ

X
An rn cos(nθ + αn ) + Bn r−n cos(nθ + βn ) .

= A0 log r + B0 θ +
n=1

Note that even though we say the flow is two dimensional, it could still be flow in
three dimension, e.g. a flow that’s is confined in the x-y plane and is the same for all
z.
15.4. INVISCID IRROTATIONAL FLOW 677

E. 15-17
We can immediately look at some possible flow
1. Source/sink: In the case only A0 6= 0, we have φ = A0 log rR and u =
∇φ = A0 er /r. The mass flux through a circle about the origin is ρu · dS =
2πRρA0 /R = 2πρA0 . So A0 > 0 correspond to having a point source at the
origin, and A0 < 0 correspond to having a sink at the origin. And we call
m = |2πA0 | the strength of the source/sink.
Alternatively to get a point source of strength q, we can either solve ∇2 φ =
qδ(r) or use the conservation of mass to obtain 2πrur = q. then
q q
ur = =⇒ φ= log r.
2πr 2π

2. Vortex: In the case only B0 6= 0, we have φ = B0 θ and u = ∇φH = B0 eθ /r.


This is a point vortex in 2D or a line vortex in 3D. the circulation is r=R u·d` =
2πB0 , hence we can also understand K = 2πB0 as the “strength” of the vortex.
Conversely, if we want a vortex in the sense that the flow goes
around in circles with circulation K, since there is no radial ve-
locity, we must have ∂φ ∂r
= 0. So φ only depends on θ. This
corresponds to the solution φ = Bθ. To find out the value of B,
consider the circulation around a loop
I Z 2π
B K K
K= u · d` = · a dθ = 2πB. =⇒ φ= θ, uθ = .
r=a 0 a 2π 2πr
This flow is in fact irrotational, ie. ∇ × u = 0 for r 6= 0, despite it looking
rotational. Moreover, for any (simple) loop c, we get
I (
K the origin is inside C
u · d` = .
c 0 otherwise
We can interpret this as saying there is no vorticity everywhere, except at the
singular point at the origin, where
H we have R infinite vorticity. Indeed, for S the
surface bounded by c we have c u · d` = S ω · n dS.
3. Uniform flow: In the case we only have A1 6= 0, we have
φ = A1 r cos(θ + α1 ) = A1 r(cos(θ) cos(α1 ) − sin(θ) sin(α1 ))
= A1 (x cos α1 − y sin α1 ) = A1 (x cos(−α1 ) + y sin(−α1 ))
This is a uniform flow in direction −α1 .
4. Dipole: In the case only B1 6= 0, we have φ = B1 cos(θ + β1 )/r. This is a 2D
dipole.
E. 15-18
<Uniform flow with circulation past cylinder> Consider the flow with
circulation past a cylinder. We can kind of imagine this as uniform flow passing
around a rotating cylinder (although not exactly). We need to solve

r
∇2 φ = 0 r>a
φ → U r cos θ r→∞ U θ x
∂φ
=0 r = a.
∂r
678 CHAPTER 15. FLUID DYNAMICS

We already have the general solution above. So we just write it down. We find

a2
 
u r = U 1 − cos θ
a2
 
K  r 2
2
φ=U r+ cos θ + θ and so
r 2π a K
uθ = −U 1 + 2 sin θ + .
r 2πr
The last term in φ allows for a net circulation K around the cylinder, to account
for vorticity in the viscous boundary layer on the surface of the cylinder. We can
find the streamfunction for this as
a2
 
K
ψ = U r sin θ 1 − 2 − log r.
r 2π
If there is no circulation, ie. K = 0, then we get a
flow similar to the flow around a sphere. Again,
A0 A
there is no net force on the cylinder, since flow
is symmetric before and after, above and below.
Again, we get two stagnation points at A, A0 .
What happens when K 6= 0 is more interesting. We
first look at the stagnation points. We get ur = 0
if and only if r = a or cos θ = 0. For uθ = 0, when
r = a, we require K = 4πaU sin θ. So provided
|K| ≤ 4πaU , there is a solution to this problem, and
we get (two) stagnation points on the boundary.
For |K| > 4πaU , we do not get a stagnation point on
the boundary. However, we still have the stagnation
point at cos θ = 0 (ie. θ = ± π2 ) for some r. Looking
at the equation for uθ = 0, only θ = π2 works.
Let’s now look at the effect on the sphere. For
steady potential flow, Bernoulli works (ie. H is con-
stant) everywhere, not just along each streamline
(see later). So we can calculate the pressure on the
surface. Let p be the pressure on the surface. Then
we get
 2
1 2 1 K
p∞ + ρU = p + ρ − 2U sin θ
2 2 2πa
1 2 ρK 2 ρKU sin θ
=⇒ p = p∞ + ρU − 2 2 + − 2ρU 2 sin2 θ.
2 8π a πa
We see the pressure is symmetrical before and after. So there is no force in the x
direction. However, we get a transverse force (per unit length) in the y-direction.
We have
Z 2π Z 2π
ρKU
Fy = − p sin θ(a dθ) = − sin2 θa dθ = −ρU K.
0 0 πa
where we have dropped all the odd terms. So there is a sideways force in the
direction perpendicular to the flow, and is directly proportional to the circulation
of the system. In general, the magnus force (lift force) resulting from interaction
between the flow U and the vortex K is F = ρU × K.
15.4. INVISCID IRROTATIONAL FLOW 679

15.4.3 Time dependent potential flows


Consider the time-dependent Euler equation
   
∂u 1 2
ρ +∇ |u| − u × ω = −∇p − ∇χ.
∂t 2
We assume that we have a potential flow u = ∇φ. So ω = 0. Then we can write the
whole equation as  
∂φ 1
∇ ρ + ρ|u|2 + p + χ = 0.
∂t 2
Thus we can integrate this to obtain
∂φ 1
ρ + ρ|∇φ|2 + p + χ = f (t), (∗)
∂t 2
where f (t) is a function independent of space. This equation allows us to correlate the
properties of φ, p etc. at different points in space. Note that if we wish we can absorb
f (t) into the potential φ (by adding the anti-derivative of f (t) to φ) so that the RHS
of the equation is 0.
E. 15-19
<Fast jet generation> Suppose we have a tube, say it extend from x = 0 to
x = L, and it is connected to a large pressurised container at x = 0. The tube
contains water supplied from the container. The flow starts from rest at t = 0.
We have a free jet at x = L with p = patm and u parallel to x̂. The pressure at
x = 0 is controlled by a feedback mechanism so that

p|x=0 = patm + p0 (t)

where p0 (t) is positive for t > 0 and zero otherwise. We neglect gravity. What is
the velocity of the flow in the tube?
Firstly starting from rest implies that the flow is irrotational. So we can write
the potential as φ = u(t)x. Note that the flow speed u cannot depend on x since
the flow is incompressible. Applying the time dependent Bernoulli equation (∗)
on both ends of the tube we find
1 2 patm f (t) 1 patm + p0
u̇L + u + = = u2 +
2 ρ ρ 2 ρ
Therefore we have u̇ = p0 (t)/(ρL) with initial condition u(0) = 0. In the case that
p0 is a constant, we have u = (p0 /ρL)t.
................................................................................
Consider a slight variation of the problem where instead of specifying the pressure
at x = 0 by p|x=0 = patm +p0 (t), the pressure is control at x = −ξ where x = −ξ is
the back of the container. We can think of this as there being a piston at x = −ξ
pushing toward the water in the container with force per area patm + p0 (t). We
assume the container is large enough so that the flow velocity in the container is
negligible.
Applying (∗) on x = L and x = −ξ gives u̇L + 12 u2 + patm /ρ = (p0 (t) + patm )/ρ.
Therefore we have the non-linear ODE
1 2 p0 (t)
u̇ + u = with u(0) = 0
2L ρL
680 CHAPTER 15. FLUID DYNAMICS
p
In the case where p0 is constant, we define the velocity scale u0 = 2p0 /ρ and
time scale t0 = 2L/u0 . Then in terms of the dimensionless parameters η = u/u0

and τ = t/t0 we have dτ = 1 − η 2 with η(0) = 0. The solution is therefore
η = tanh τ , that is u = u0 tanh(u0 t/2L).
E. 15-20
<Oscillations in a manometer> A manometer is a U -
shaped tube. We use some magic to set it up such that the
water level in the left tube is h above the equilibrium position
H. Then when we release the system, the water levels on h
both side would oscillate. We can get quite far just by doing H
dimensional analysis. There are only two parameters g, H.
p y=0
Hence the frequency must be proportional to g/H. To
get the constant of proportionality, we have to do proper
calculations.
We are going to assume the reservoir at the bottom is large, so velocities are
negligible. So φ is constant in the reservoir, say φ = 0. We want to figure out
the velocity on the left. This only moves vertically. So we have φ = uy = ḣy.
So we have ∂φ∂t
= ḧy. On the right hand side, we just have φ = −uy = −ḣy and
∂φ
∂t
= − ḧy. We now apply the equation from one tube to the other — we get

1 2
ρḧ(H + h) + ρḣ + patm + ρg(H + h) = f (t)
2
1 2
= −ρḧ(H − h) + ρḣ + patm + ρg(H − h).
2
Quite a lot of these terms cancel, and we are left with 2ρH ḧ+2ρgh = 0. Simplifying
terms, we get ḧ + gh/H = 0. So this is simple harmonic motion with the frequency
p
g/H.
E. 15-21
<Oscillations of a bubble> Suppose we have a spherical bubble of radius a(t)
in some fluid. Spherically symmetric oscillations induce a flow in the fluid. This
satisfies
∂φ
∇2 φ = 0 for r > a φ → 0 as r → ∞ = ȧ for r = a.
∂r
In spherical polars, we write Laplace’s equation as
 
1 ∂ 2 ∂φ A(t) ∂φ A(t)
r = 0. =⇒ φ= , ur = =− 2 .
r2 ∂r ∂r r ∂r r

This certainly vanishes as r → ∞. We also know −A/a2 = ȧ. So we have


φ = −a2 ȧ/r. So

2aȧ2 a2 ä
 
∂φ
= − − = −(aä + 2ȧ2 ).
∂t r=a r r r=a

We now consider the pressure on the surface of the bubble. We will ignore gravity,
and apply Euler’s equation at the bubble surface and at infinity. Then we get
 
1 3
−ρ(aä + 2ȧ2 ) + ρȧ2 + p(a, t) = p∞ =⇒ ρ aä + ȧ2 = p(a, t) − p∞ . (∗)
2 2
15.4. INVISCID IRROTATIONAL FLOW 681

This is a difficult equation to solve, because it is non-linear. So we assume we have


small oscillations about equilibrium, and write a = a0 +η(t) where η(t)  a0 . Then
we can write
3 3
aä + ȧ2 = (a0 + η)η̈ + η̇ 2 = a0 η̈ + O(η 2 ).
2 2
Ignoring second order terms, we get ρa0 η̈ = p(a, t) − p∞ We also know that p∞ is
the pressure when we are in equilibrium. So p(a, t) − p∞ is the small change in
pressure δp caused by the change in volume.
To relate the change in pressure with the change in volume, we need to know some
thermodynamics. We suppose the oscillation is adiabatic, ie. it does not involve
any heat exchange. This is valid if the oscillation is fast, since there isn’t much
time for heat transfer. Then it is a fact from thermodynamics that the motion
obeys

specific heat under constant pressure


P V γ = constant where γ= .
specific heat under constant volume

We can take logs to obtain log p+γ log V = constant. Then taking small variations
of p and V about p∞ and V0 = 43 πa30 gives

δp δV
δ(log p) + γδ(log V ) = 0, that is = −γ .
p∞ V0

Thus we find
δV η
p(a, t) − p∞ = δp = −p∞ γ = −3p∞ γ .
V0 a0
Thus we get ρa0 η̈ = −3(γp∞ /a0 )η. This is again simple harmonic motion with
frequency ω = (3γp∞ /ρa20 )1/2 . We know all these numbers, so we can put them
in. For a 1 cm bubble, we get ω ≈ 2 × 103 s−1 . For reference, the human audible
range is 20 − 20 000 s−1 . This is why we can hear, say, waves breaking.
Finally we will point out that the pressure is not at its extreme at the boundary
of the bubble. Applying Euler’s equation (time-dependent Bernoulli) at any point
r and at infinity, we get
 2 
∂ a ȧ 1 a4 ȧ2
ρ − + ρ 4 + p(r, t) = p∞ .
∂t r 2 r

Eliminate ä between this equation and (∗) gives

a4
 
a 1 a
p(r, t) = p∞ + (p(a, t) − p∞ ) + ρȧ − 4 .
r 2 r r

The maximum of p is achieved at r ≈ 41/3 a (at least when 1


2
ρȧ  p(a, t) − p∞ ).

15.4.4 Vortex motion


Recall point vortex have φ = (K/2π)θ and
 
K K ez × x K y x
u = ∇φ = eθ = = − ,
2πr 2π kxk2 2π x2 + y 2 x2 + y 2
682 CHAPTER 15. FLUID DYNAMICS

Since Laplace’s equation is linear, for N vertice at xi with circulation Ki we have


N N
X Ki θi X Ki ez × (x − xi )
φ= =⇒ u=
i=1
2π i=1
2π kx − xi k2

In fact each vortex is moved by the velocity field due to all the other vortices, so
X Ki ez × (xi − xj )
ẋi (t) = .
2π kxi − xj k2
j6=i

E. 15-22
• Consider a vortex pair N = 2 with K = K1 = −K2 > 0. Then

K ez × (x1 − x2 )
ẋ1 (t) = −
2π kx1 − x2 k2
x1 x2
K ez × (x2 − x1 )
ẋ2 (t) =
2π kx1 − x2 k2
So ẋ1 = ẋ2 = U. In particular the distance between the two vortices does not
d ∂φ
change, dt (x1 x2 ) = 0. Also we have |U| = |K|/(2πkx2 − x2 k). Note that ∂n =0
on the bisection of x1 − x2 (dashed line) since φ is symmetric.
• Consider a simple vertex of strength K a distance d from a rigid straight boundary.
To find its evolution, we use the method of images. We place an image vortex of
strength −K at a distance d on the other side of the wall and remove the boundary.
So we have reduce it to the first case. An example of this happening is when planes
are taking off or landing, vortices form at the wing tip, the ground act as the rigid
boundary, and the vortices vortices would migrate away from the plane/runway
towards the sides.

15.5 Water waves


We now consider water waves. We are first going to use dimensional analysis to under-
stand the qualitative behaviour of water waves in deep and shallow water respectively.
Afterwards, we will try to solve it properly, and see how we can recover the results
from the dimensional analysis. We will use the inviscid approximation.

h(x, y, t)
z
z=0
y H
x z = −H

15.5.1 Dimensional analysis


Consider waves with wave number k = 2π/λ, where λ is the wavelength, on a layer of
water of depth H. We suppose the fluid is inviscid. Then the wave
√ speed c depends
on k, g and H. Dimensionally, we can write the answer as c = gHf (kH) for some
dimensionless function f .
15.5. WATER WAVES 683

1. Now suppose we have deep water. Then H  λ. Therefore kH  1. In this limit,


we would expect the speed not to √ depend on H, since H is p
just too big. The only
way this can be true is if f ∝ 1/ kH. Then we know c = α g/k where α is some
dimensionless constant.
2. What happens near the shore? Here the water is shallow. So we have kH  1.
Since the wavelength is now so√ long, the speed should be independent of k. So f
is a constant, say β. So c = β gH.
We don’t know α and β. To know that, we would need proper theory. And we also
need that to connect up the regimes of deep and shallow water, as we will soon do.
Yet these can already explain several phenomena we see in daily life. For example,
we see that wave fronts are always parallel to the shore, regardless of how the shore
is shaped and positioned. This is since if wave is coming in from an angle, the parts
further away from the shore move faster (since it is deeper), causing the wave front to
rotate until it is parallel.
We can also use this to explain why waves break. Near
the shore, the water is shallow, and the difference in
height between the peaks and troughs of the wave is sig- H +h
H −h
nificant. Hence the peaks travels faster than the trough,
causing the waves to break.

15.5.2 Equation and boundary conditions


We now try to solve for the actual solution. We assume the fluid is inviscid, and the
motion starts from rest. Thus the vorticity ∇ × u is initially zero, and hence always
zero. Together with the incompressibility condition ∇·u = 0, we end up with Laplace’s
equation ∇2 φ = 0. We have some kinematic boundary conditions. First of all, there
can be no flow through the bottom. So we have uz = ∂φ ∂z
= 0 when z = −H. At the
free surface, we have

∂φ Dh ∂h ∂h ∂h
uz = = = +u +v when z = h.
∂z Dt ∂t ∂x ∂y

We then have the dynamic boundary condition that the pressure at the surface is the
atmospheric pressure, that is at z = h we have p = p0 = constant. We need to relate
this to the flow. So we apply the time-dependent Bernoulli equation

∂φ 1
ρ + ρ|∇φ|2 + ρgh + p0 = f (t) on z = h.
∂t 2

The equation is not hard, but the boundary conditions are. Apart from them being
non-linear, there is this surface h that we know nothing about.

Small amplitude water waves


It is impossible to solve these equations just as they are. So we want to make some
approximations. We assume that the waves amplitudes are small, ie. that h  H.
Moreover, we assume that the waves are relatively flat, so that ∂h , ∂h  1. We then
∂x ∂y
ignore quadratic terms in small quantities. For example, since the waves are small,
the velocities u and v also are. So we ignore u ∂h
∂x
and v ∂h
∂y
. Similarly, we ignore the
684 CHAPTER 15. FLUID DYNAMICS

whole of |∇φ|2 in Bernoulli’s equations since it is small. Next, we use Taylor series to
write
∂ 2 φ

∂φ ∂φ ∂φ
= + h + · · · ≈ .
∂z z=h ∂z z=0 ∂z 2 z=0 ∂z z=0

Again, we ignore all quadratic terms. We do similar for ∂φ∂t


. We are then left with
linear water waves, the equations are ∇2 φ = 0 for −H < z ≤ 0 with
(
∂φ 0 for z = −H ∂φ
= ∂h and + gh = f (t) for z = 0.
∂z ∂t
for z = 0 ∂t

Note that the last equation is just Bernoulli equations, after removing the small terms
and throwing our constants and factors in to the function f . We now have a nice,
straightforward problem. We have a linear equation with linear boundary conditions,
which we can solve. General strategies of solving these include looking for separable
solution, or taking Fourier transform wrt x.

15.5.3 Two-dimensional waves (straight crested waves)


We are going further simplify the situation by considering the case where the wave does
not depend on y. We consider a simple wave form h = h0 ei(kx−ωt) . Using the boundary
condition at z = 0, we know we must have a solution of the form φ = φ̂(z)ei(kx−ωt) .
Putting this into Laplace’s equation, we have −k2 φ̂ + φ̂00 = 0. We notice that the
solutions are of the form φ̂ = φ0 cosh k(z + H) where the constants are chosen so that
∂φ
∂z
= 0 at z = −H.
We now have three unknowns, namely h0 , φ0 and ω (we assume k is given, and we want
to find waves of this wave number). We use the boundary condition ∂φ
∂z
= ∂h
∂t
at z = 0
to get kφ0 sinh kH = −iωh0 . We put in Bernoulli’s equation to get

−iω φ̂(z)ei(kx−ωt) + gh0 ei(kx−ωt) = f (t).

For this not to depend on x, we must have −iωφ0 cosh kH + gh0 = 0. A trivial
solution is h0 = φ0 = 0. Otherwise, we can solve to get ω 2 = gk tanh kH. This is the
dispersion relation , and relates the frequency to the wavelengths of the wave. We can
use the dispersion relation to find the speed of the wave, it is
r
ω g
c= = tanh kH.
k k

Analysis of result
We can look at the limits for large and small H.
• In deep water (or short waves), we
p have kH  1. We know that as kH → ∞, we
get tanh kH → 1. So we get c = g/k.
• In shallow water (or long waves), √
we have kH  1. In the limit kH → 0, we get
tanh kH → kH. Then we get c = gH.
These are exactly as predicted using dimensional analysis, with all the dimensionless
constants being 1.
15.5. WATER WAVES 685

We can plot how the wave speed varies with k. We see that √c
gH
wave speed decreases monotonically with k, and long waves 1
travel faster than short waves. This means if we start with,
say, a square wave, the long components of the wave trav-
els faster than the short components. So the square wave
disintegrates as it travels. kH

Note also that this puts an upper bound


√ on the maximum value of the speed √
c. There
can be no wave travelling faster than gH. Thus if you travel faster than gH, all
the waves you produce are left behind you.
√ In general, if you have velocity U , we
can define the Froude number F r = U/ gH. This is like the Mach number (the
equivalent for sound).
E. 15-23
For waves in ocean we have H ∼ 5 km and wave period ∼ 15 s, so ω ∼ 2π/15 ≈
0.4 s−1 . So we are in the deep water case. Hence 1/k = g/ω 2 ≈ 63 m, λ = 2π/k ≈
400 m and c = 25 m s−1 .
For a tsunami, we have√λ ∼ 400 km and H ∼ 4 km. We are thus in the regime
of small kH, and c = 10 × 4 × 103 = 200 m s−1 . Note that the speed of the
tsunami depends on the depth of the water. So the topography of the bottom of
the ocean will affect how tsunamis move. So knowing the topography of the sea
bed allows us to predict how tsunamis will move, and can save lives.

15.5.4 Free-surface modes in a container


We consider water confined in a box given by 0 ≤ x ≤ a, 0 ≤ y ≤ b and −H ≤ z. Let
z = h(x, y, t) be the height of the free surface, which oscillates about z = 0. Assume
h  H. The linearised equations and boundary conditions are

∇2 φ = 0 for − H ≤ z ≤ 0, 0 ≤ x ≤ a, 0 ≤ y ≤ b;

∂φ ∂φ ∂φ
= 0 on z = −H, = 0 on x = a, 0, = 0 on y = 0, b;
∂z ∂x ∂y
∂h ∂φ ∂φ
− =0 and + gh = f (t) on z = 0.
∂t ∂z ∂t
We seek solutions of the form φ = Re(φ̂(x, y, z)e−iωt ). In order for ∂t φ + gh to be
independent of x, y on z = 0 we need h = Re((iω/g)φ̂(x, y, z)e−iωt ). Now the condition
∂t h − ∂z φ = 0 on z = 0 says that ∂z φ|z=0 = (ω 2 /g)φ̂|z=0 .
Performing separation of variables on φ̂(x, y, z) = α(x)β(y)γ(z) using the boundary
condition and the fact that φ is harmonic, we find that α(x) = cos(mπx/a) and
β(y) = cos(nπy/b) for n, m ∈ Z. So we can write φ̂ = φ̂mn (z) cos(mπx/a) cos(nπy/b).
This solve ∇2 φ̂ = 0 provided

d2 φ̂mn 2 2
 mπ 2  nπ 2
− kmn φ̂mn = 0 where kmn = + . (∗)
dz 2 a b

The boundary condition in z tell us that ∂z φ̂mn (−H) = 0 and ∂z φ̂mn (0) = (ω 2 /g)φ̂mn (0).
Using the first of these boundary condition we find that the solution to (∗) is φ̂mn (z) =
686 CHAPTER 15. FLUID DYNAMICS

C cosh(kmn (z + H)) for some constant C. Using the second boundary condition we
2
find that ω satisfies ωmn = gkmn tanh(kmn h). Each of these solution
 
iωmn  mπx   nπy 
h = Re C cosh(kmn (z + H)) cos cos e−iωmn t
g a b

is a standing wave. Of course we can also superimpose them.

15.5.5 Group velocity


Now suppose we have two waves travelling closely together with similar wave numbers,
eg.
   
k1 + k2 k1 − k2
sin k1 x + sin k2 x = 2 sin x cos x ,
2 2

since k1 and k2 are very similar, we know k1 +k


2
2
≈ k1 , and k1 −k
2
2
is small, hence the
cosine term has long period. We would then expect waves to look like this

So the amplitudes of the waves would fluctuate. We say the wave travels in groups.
It turns out the “packets” don’t travel at the same velocity as the waves themselves.
The group velocity is given by
∂ω
cg = .
∂k
√ p
In particular, for deep water waves, where ω ∼ gk, we get cg = 12 g/k = 12 c. This
is also the velocity at which energy propagates.

15.5.6 Rayleigh-Taylor instability


Now we turn the problem upside down, and imagine
we have water over air. We can imagine this as the water
same scenario as before, with water at the bottom
air
but with gravity pull upwards. So exactly the same
2
√ but we replace −g with g. In deep water, we have ω = −gk. So we
equations hold,
have ω = ±i gk and so
√ √
h ∝ Ae gkt
+ Be− gkt
.
We thus have an exponentially growing solution. So the system is un-
stable, and water will fall down. This very interesting fact is known as
Rayleigh-Taylor instability .

15.6 Fluid dynamics on a rotating frame


We would like to study fluid dynamics in a rotating frame, because the Earth is rotat-
ing. So this is particularly relevant if we want to study ocean currents or atmospheric
flow.
15.6. FLUID DYNAMICS ON A ROTATING FRAME 687

15.6.1 Equations of motion in a rotating frame


The Lagrangian (particle) acceleration in a rotating frame of reference rotating with
constant angular velocity Ω is given by
Du
+ 2Ω × u + Ω × (Ω × x),
Dt
from IA Dynamics and Relativity. So we have the equation of motion
 
∂u
ρ + u · ∇u + 2Ω × u = −∇p − ρΩ × (Ω × x) + ρg.
∂t
This is complicated, so we want to simplify this. We will first compare the centrifugal
force compared to gravity. We have |Ω| = 1 2π day
s−1 ≈ 2π × 10−5 s−1 . The largest
scales are 10 × 104 km. Compared to gravity g, we have

|Ω × (Ω × x)| (2π)2 × 10−10 × 107


≤ ≈ 4 × 10−3 .
|g| 10
So the centrifugal term is tiny compared to gravity, and we will ignore it. Alternatively,
we can show that Ω×(Ω×x) can be given by a scalar potential, and we can incorporate
it into the potential term, but we will not do that.
Next, we want to get rid of the non-linear terms. We consider motions for which
|u · ∇u|  |2Ω × u|. The scales of these two terms are U 2 /L and ΩU respectively. So
we need Ro = U/ΩL  1, this is known as the Rossby number . In our atmosphere,
we have U ∼ 10 m s−1 and L ∼ 1 × 103 km. So we get Ro = 10/106 · 10−4 ≈ 0.1. So
we can ignore the non-linear terms. Now we are left with
∂u 1
<Euler’s equation in a rotating frame> + 2Ω × u = − ∇p + g.
∂t ρ
This holds when we have small Rossby number, for example when we are in a strongly
rotating frame with large Ω. We conventionally write 2Ω = f , this is called the
Coriolis parameter or the planetary vorticity . Note that since we take the cross
product of f with u, only the component perpendicular to the velocity matters. As-
suming that fluid flows along the surface of the earth, we only need the component
of f normal to the surface, namely f = 2Ω sin θ where θ is the angle from the equa-
tor.

15.6.2 Shallow water equations


We are now going to derive the shallow water equations, where we have some water
shallow relative to its horizontal extent. This is actually quite a good approximation
for the ocean — while the Atlantic is around 4 km deep, it is several thousand kilo-
meters across. So the ratio is just like a piece of paper. Similarly, this is also a good
approximation for the atmosphere.
Suppose we have a shallow layer of depth z = h(x, y, t) with p = p0 on z = h. We
consider motions with horizontal scales L much greater than vertical scales H. We
use the fact that the fluid is incompressible, ie. ∇ · u = 0. Writing u = (u, v, w), we
get
∂w ∂u ∂v
=− − .
∂z ∂x ∂y
688 CHAPTER 15. FLUID DYNAMICS

The scales of the terms are W/H, U/L and V /L respectively. Since H  L, we know
W  U, V , ie. most of the movement is horizontal, which makes sense, since there
isn’t much vertical space to move around. We consider only horizontal velocities, and
write u = (u, v, 0) and f = (0, 0, f ). Then from Euler’s equations, we get

∂u 1 ∂p
− fv = − ,
∂t ρ ∂x
∂v 1 ∂p
+ fu = − ,
∂t ρ ∂y
1 ∂p
0=− − g.
ρ ∂z

From the last equation, plus the boundary conditions, we know p = p0 + ρg(h − z).
This is just the hydrostatic balance. We now put this expression into the horizontal
components to get

∂u ∂h ∂v ∂h
− f v = −g and + f u = −g .
∂t ∂x ∂t ∂y

Note that the right hand sides of both equations are independent of z. So the ac-
celerations are independent of z. The initial conditions are usually that u and v are
independent of z. So we assume that the velocities always do not depend on z.

15.6.3 Geostrophic balance


When we have steady flow, the time derivatives vanish. So we get
   
∂ gh ∂ p
u= − = − ,
∂y f ∂y ρf
   
∂ gh ∂ p
v=− − =− − .
∂x f ∂x ρf

This has streamfunction ψ = −gh/f , called the shallow water streamfunction . The
streamlines are places where h is constant, ie. the surface is of constant height, ie. the
pressure is constant.
In general, near a low pressure zone, there is a pressure gradi-
ent pushing the flow towards to the low pressure area. As soon
as the air starts to move, however, the Coriolis force deflects L
it. As the air moves from the high-pressure area, its speed
increases, and so does its Coriolis deflection. The deflection
increases until the Coriolis and pressure gradient forces are in geostrophic balance :
at this point, the air flow is no longer moving from high to low pressure, but instead
moves along an isobar, like in a cyclone. The diagram shows how a cyclone looks like
in the Northern hemisphere. If we are on the other side of the Earth, cyclones go the
other way round.
We now look at the continuity equation, ie. the conservation of mass. We consider a
horizontal surface D in the water. Then we can compute
Z Z Z
d
ρh dA = − hρuH · n d` = − ∇H · (ρhuH ) dA.
dt D ∂D D
15.6. FLUID DYNAMICS ON A ROTATING FRAME 689

where uH is the horizontal velocity and n the normal of the line. In the last equality we
∂ ∂
applied the divergence theorem where ∇H = ( ∂x , ∂y , 0). Since this was an arbitrary
surface, we can take the integral away, and we have the continuity equation
∂h
+ ∇H · (uH h) = 0.
∂t
So if there is water flowing into a point (ie. a vertical line), then the height of the
surface at the point increase, and vice versa. We can write this out in full, in Cartesian
coordinates:
∂h ∂ ∂
+ (uh) + (uh) = 0.
∂t ∂x ∂y

Small oscillations (linearise equations)


To simplify the situation, we suppose we have small oscillations, so we have h =
h0 + η(x, y, t), where η  h0 , and write u = (u(x, y), v(x, y)). Then we can rewrite
our equations of motion as
∂u
+ f × u = −g∇η (∗)
∂t
and, ignoring terms like uη, vη, the continuity equation gives

+ h0 ∇ · u = 0. (†)
dt
Taking the curl of (∗), we get ∂ζ
∂t
+ f ∇ · u = 0 where ζ = ∇ × u. Note that even though
we wrote this as a vector equation, really only the z-component is non-zero. So we can
also view ζ and f as scalars, and get a scalar equation.
We can express ∇ · u in terms of η using (†). So we get
 
∂ η dQ
ζ− f = = 0,
∂t h0 dt

where we get a conserved quantity Q = ζ − (η/h0 )f called potential vorticity .1


Hence given any initial condition, we can compute Q(x, y, 0) = Q0 , and then we
have Q(x, y, t) = Q0 for all time. How can we make use of this? We start by taking
the divergence of (∗) above to get

(∇ · u) − f · ∇ × u = −g∇2 η,
∂t
∂η
and use (†) to substitute ∇ · u = − h1 . We then get
0 ∂t

1 ∂2η
− − f · ζ = −g∇2 η.
h0 ∂t2
We now use the conservation of potential vorticity ζ = Q0 + hη f to rewrite this
0
as
2
∂ η
− gh0 ∇2 η + f · f η = −h0 f · Q0 .
∂t2
Note that the right hand side is just a constant (in time). So we have a nice differential
equation we can solve.
1
These assume small oscillations and small Rossby number. The more general non-linearised
D ( ζ+f ) = 0.
version is Dt h
690 CHAPTER 15. FLUID DYNAMICS

E. 15-24
Suppose we have fluid with mean depth h0 , and we
z
start with the following scenario. Due to the differ-
ences in height, we have higher pressure on the right η0
and lower pressure on the left. η0 h0
If there is no rotation, then the final state is a flat
surface with no flow. However, this cannot be the
case if there is rotation, since this violates the con-
servation of Q. So what happens if there is rotation?

At the beginning, there is no movement. So we have ζ(t = 0) = 0. Thus we have

η0
Q0 = − f sign(x).
h0

We seek the final steady state such that ∂η


∂t
= 0. We further assume that the final
solution is independent of y, so ∂η
∂y
= 0. So η = η(x) is just a function of x. Our
equation then says

∂2η f2 f f2
2
− η = Q0 = − η0 sign(x).
∂x gh0 g gh0


It is convenient to define a new variable R = gh0 /f which is a length scale called
the Rossby radius of deformation . This is the fundamental length
√ scale to use in
rotating systems when gravity is involved as well. We know gh0 is the fastest
possible wave speed, and thus R is how far a wave can travel in one rotation period.
We rewrite our equation as

d2 η 1 1
− 2 η = − 2 η0 sign(x).
dx2 R R

We now impose our boundary conditions. We require η → ±η0 as x → ±∞. We



also require η and dx to be continuous at x = 0. The solution is
(
1 − e−x/R x>0 z
η = η0 × .
−(1 − ex/R ) x<0
h0
We can see that this looks quite different form the
non-rotating case. The horizontal length scale in-
volved is 2R. We now look at the velocities. Using
the steady flow equations, we have
r
g ∂η g ∂η g −|x|/R
u=− =0 v= = η0 e .
f ∂y f ∂x h0

So there is still flow in this system, and is a flow in the y direction into the paper.
This flow gives Coriolis force to the right, and hence balances the pressure gradient
of the system. The final state is not one of rest, but one with motion in which the
Coriolis force balances the pressure gradient. This is geostrophic flow.
15.6. FLUID DYNAMICS ON A ROTATING FRAME 691

E. 15-25
Going back to our pressure maps, if we have high and low
pressure systems, we can have flows that look like the
L H
right. Then the Coriolis force will balance the pressure
gradients.

Weather maps describe balanced flows. We can compute the scales here. In the
atmosphere, we have approximately

10 · 103
R≈ = 106 ≈ 1000 km.
10−4
So the scales of cyclones are approximately 1000 km. On the other hand, in the
ocean, we have √
10 · 10
R≈ = 105 = 100 km.
10−4
So ocean scales are much smaller than atmospheric scales.
692 CHAPTER 15. FLUID DYNAMICS
APPENDIX A

Some useful results


Binomial series sin(2x)

tan x =
X ak k 1 + cos(2x)
(1 + x)a = 1 + x for |x| < 1, a ∈ C
k! sin(2x)
k=1 cot x =
1 − cos(2x)
Maclaurin’s series
∞ Summation
X 1 n
ex = x for x ∈ C n
n! X 1
n=0 r2 = n(n + 1)(2n + 1)
∞ 6
X (−1)n+1 n r=1
ln(1 + x) = x for − 1 < x ≤ 1 n
n X 1 2
n=1 r3 = n (n + 1)2
r=1
4

X (−1)n 2n+1
sin x = x for x ∈ C Trigonometric values
n=0
(2n + 1)!

π
(−1)n 2n sin 0 = cos = 0
2
X
cos x = x for x ∈ C
n=0
(2n)! π π 1
sin = cos =
∞ 6 3 2
X (−1)n 2n+1 √
tan−1 x = x for |x| ≤ 1, x 6= ±i π π 2
n=0
2n + 1 sin = cos =
4 4 2

The series for the corresponding hyperbolic func- π π 3
sin = cos =
tions are the same but without the factor (−1)n . 3 6 2
π
sin = cos 0 = 1
Hyperbolic functions 2
p
cosh−1 x = ln(x + x2 − 1) for x ≥ 1 Tan half-angle substitution
p
sinh−1 x = ln(x + x2 + 1) x 2
t = tan dx = dt
1

1+x
 2 1 + t2
tanh−1 x = ln for |x| < 1
2 1−x 2t 1 − t2
sin x = cos x =
Trigonometric identities 1 + t2 1 + t2

sin(A + B) = sin A cos B + cos A sin B Vectors


cos(A + B) = cos A cos B − sin A sin B a × (b × c) = (a · c)b − (a · b)c
tan A + tan B
tan(A + B) = Dot with d to obtain result for
1 − tan A tan B
(b × c) · (d × a).
A+B A−B For r = |r| where r = xi ei in Rn ,
sin A + sin B = 2 sin cos
2 2
A+B A−B ∇rα = αrα−2 r
cos A + cos B = 2 cos cos ∇ · (rα r) = (α + n)rα
2 2
A+B A−B
cos A − cos B = −2 sin sin In R3 , ∇ × (rα r) = 0.
2 2

I
II APPENDIX A. SOME USEFUL RESULTS

A.1 Integration and differentiation


R
f (x) f (x)dx

f (x) f 0 (x) tan x ln | sec x|

tan x sec2 x cot x ln | sin x|

cot x − cosec2 x sec x ln | sec x + tan x| = ln | tan( x2 + π


4
)|

sec x sec x tan x cosec x − ln | cosec x + cot x| = ln | tan x2 |


√ 1
sinh−1 xa

cosec x − cosec x cot x x2 +a2

sin−1 x √ 1
√ 1
sin−1 xa

1−x2 a2 −x2

cos−1 x √ −1 √ 1
cosh−1 xa

1−x2 x2 −a2

tan−1 x 1 1 1
tan−1 xa

1+x2 a2 +x2 a

sinh−1 x √ 1 1 1
tanh−1 xa = 2a 1 a+x

a2 −x2 a
ln a−x
1+x2


1 1 x−a
cosh−1 x √ −1 x2 −a2 2a
ln x+a
x2 −1
1 √x
tanh−1 x 1
1−x2 (a2 ±x2 )3/2 a2 a2 ±x2
1 √x
(x2 −a2 )3/2

a2 x2 −a2

Z Z
1 x 1
sin2 (ax) dx = (1 − cos(2ax))dx = − sin(2ax)
2 2 4a
Z Z
1 x 1
cos2 (ax) dx = (1 + cos(2ax))dx = + sin(2ax)
2 2 4a
Z ax
e
eax (cos(bx) + i sin(bx))dx = 2

a cos(bx) + b sin(bx) + i(a sin(bx) − b cos(bx))
a + b2
n
(−1)r nr n−r mx
Z X
xn emx dx = x e
r=0
mr+1
dn
2e b2c n
Z
n
X (−1)r+1 n2r−1 n−2r+1 X (−1)r n2r n−2r
x cos(mx)dx = x cos(mx) + x sin(mx)
r=1
m2r r=0
m2r+1
dn
2e b2c n
Z
n
X (−1)r+1 n2r−1 n−2r+1 X (−1)r n2r n−2r
x sin(mx)dx = 2r
x sin(mx) − x cos(mx)
r=1
m r=0
m2r+1
Z a  
kπx  nπx  a
sin sin dx = δk,n .
0 a a 2
p  a
a cos(mx) + b sin(mx) = sgn(b) a2 + b2 sin mx + tan−1
 b 
−1 b
p
= sgn(a) a + b cos mx − tan
2 2
a
A.2. COORDINATE SYSTEMS AND OPERATORS III

A.2 Coordinate systems and operators

Cylindrical polars Spherical polars


x1 = ρ cos ϕ x1 = r sin θ cos ϕ
x2 = ρ sin ϕ x2 = r sin θ sin ϕ
x3 = z x3 = r cos θ
eρ = (cos ϕ, sin ϕ, 0) er = (sin θ cos ϕ, sin θ sin ϕ, cos θ)
eϕ = (− sin ϕ, cos ϕ, 0) eθ = (cos θ cos ϕ, cos θ sin ϕ, − sin θ)
ez = (0, 0, 1) eϕ = (− sin ϕ, cos ϕ, 0)
hu = hρ = 1 hu = hr = 1
hv = hϕ = ρ hv = hθ = r
hw = hz = 1 hw = hϕ = r sin θ
dV = 
ρ dρ dφ dz r2 sin θ dr dθ dϕ
dV = 
2
eρ ρ dϕ dz
 (ρ const) er r sin θ dθ dϕ
 (r const)
dS = eϕ dρ dz (ϕ const) dS = eθ r sin θ dr dϕ (θ const)
 
ez ρ dρ dϕ (z const) eϕ r dr dθ (ϕ const)
 

1 ∂ 1 ∂ 1 ∂
∇=eu + ev + ew
hu ∂u hv ∂v hw ∂w

hu eu hv ev hw ew
1 ∂ ∂ ∂

∇×F=
hu hv hw ∂u ∂v ∂w
hu Fu hv Fv hw Fw
 
1 ∂ ∂ ∂
∇·F= (hv hw Fu ) + (hu hw Fv ) + (hu hv Fw )
hu hv hw ∂u ∂v ∂w
.............................................................................
In cylindrical polar coordinates (ρ, ϕ, z) we have
∂f 1 ∂f ∂f
∇f = ρ̂ + ϕ̂ + ẑ
∂ρ ρ ∂ϕ ∂z
1 ∂ (ρFρ ) 1 ∂Fϕ ∂Fz
∇·F= + +
ρ ∂ρ ρ ∂ϕ ∂z
2
∂2f
 
1 ∂ ∂f 1 ∂ f
∇2 f = ρ + 2 2
+
ρ ∂ρ ∂ρ ρ ∂ϕ ∂z 2

In spherical polar coordinates (r, θ, ϕ) we have


∂f 1 ∂f 1 ∂f
∇f = r̂ + θ̂ + ϕ̂
∂r r ∂θ r sin θ ∂ϕ
2

1 ∂ r Fr 1 ∂(Fθ sin θ) 1 ∂Fϕ
∇·F= 2 + +
r ∂r r sin θ ∂θ r sin θ ∂ϕ
∂2f
   
1 ∂ ∂f 1 ∂ ∂f 1
∇2 f = 2 r2 + 2 sin θ + 2 2
r ∂r ∂r r sin θ ∂θ ∂θ r sin θ ∂ϕ2
IV APPENDIX A. SOME USEFUL RESULTS

A.3 Transform tables

Laplace transforms: Fourier transforms:

f (t) fˆ(p) p0 f (x) f˜(p)

c
c 0 1 2πδ(k)
p
cn!
ctn 0 δ(x − c) e−ick
pn+1
b 1
sin(bt) 0 − 12 + H(x)
(p2 + b2 ) ik
p 1
cos(bt) 0 H(x) πδ(k) +
(p2 + b2 ) ik
1 1
eat a e−αx H(x) (Re α > 0)
(p − a) ik + α
n! 2α
tn eat a e−α|x| (Re α > 0)
(p − a)n+1 α2 + k 2
a
sinh(at) |a| cos(ax + b) π(e−ib δ(k + a) + eib δ(k − a))
(p2 − a2 )
p
cosh(at) |a| sin(ax + b) iπ(e−ib δ(k + a) − eib δ(k − a))
(p2 − a2 )
 
b x tk
eat sin(bt) a rect τ sinc
(p − a)2 + b2 τ 2π
 
p−a τx k
eat cos(bt) a τ sinc 2π rect
(p − a)2 + b2 2π τ

r    
1 π 2|x| x τ τk
t 0 1− rect sinc2
2 p3 τ τ 2 4π
r    
1 π τ τx 2|k| k
√ 0 sinc2 1− rect
t p 2 4π τ τ

δ(t − t0 ) e−pt0 0

e−pt0
H(t − t0 ) 0
p
def
The above Laplace transforms are valid for p > p0 . Here sinc(x) = sin(x)/x, rect(x/τ )
is the rectangular pulse function of width τ , H(x) is the Heaviside step function and
δ(x) is the Dirac delta function.
A.4. DISTRIBUTIONS V

A.4 Distributions
Discrete distributions: (here q = 1 − p)
Distribution PMF Mean Variance PGF
n
1 1 1 2 1X i
Discrete uniform U {1, · · · , n} , k ∈ {1, 2, · · · , n} (n + 1) (n − 1) z
k 2 12 n i=1
Bernoulli Bin(1, p) p (1 − p)1−k , k ∈ {0, 1}
k
p p(1 − p) q + pz
!
n k
Binomial Bin(n, p) p (1 − p)n−k , k ∈ {0, 1, · · · , n} np np(1 − p) (q + pz)n
k
k q q p
Geometric v.1 (1 − p) p, k ∈ N0
p p2 1 − qz
1 q pz
Geometric v.2 (1 − p)k−1 p, k∈N
! p p2 1 − qz
k−1 n n q (pz)n
Negative binomial NegBin(n, p) p (1 − p)k−n , k ∈ {n, n + 1, · · · } n 2
n−1 p p (1 − qz)n
λk −λ
Poisson Poisson(λ) e , k ∈ N0 λ λ eλ(z−1)
k!

Continuous distributions:
Distribution PDF CDF Mean Variance MGF
1 x−a a+b 1 eθb − eθa
Uniform U [a, b] I(a ≤ x ≤ b) (b − a)2
b−a b−a 2 12 θ(b − a)
(x−µ) 2
1 − 1 2 σ2
Normal N (µ, σ 2 ) √ e 2σ2 / µ σ2 eθµ+ 2 θ
2πσ
1 1 λ
Exponential Exponential(λ) λe−λx I(x ≥ 0) 1 − e−λx
λ λ2 λ−θ
1
Cauchy / undefined undefined undefined
π(1 + x2 )
α α−1 −λx
 n
λ x e α α λ
Gamma Gamma(α, λ) I(x ≥ 0) / for θ < λ
Γ(α) λ λ2 λ−θ
Γ(a + b) a−1 a ab
Beta Beta(a, b) x (1 − x)b−1 I(0 ≤ x ≤ 1) / /
Γ(a)Γ(b) a+b (a + b)2 (a + b + 1)
−1 T −1 (z−µ)
e 2 (z−µ) Σ T µ+ 1 tT P t
Multivariate normal Nn (µ, Σ) n √ / µ Σ et 2
(2π) 2 det Σ

R∞
• Here Γ(z) = 0 xz−1 e−x dx is the gamma function. It is such that Γ(n) = (n − 1)!
if n is a positive integer.
• For n ∈ N the χ2n distribution is the same as the Gamma( n2 , 21 ) distribution. If Y ∼
Gamma(n, λ), then 2λY ∼ χ22n . The χ2n distribution is also the same as the sum of
n iid standard normal N (0, 1).
VI APPENDIX A. SOME USEFUL RESULTS

A.5 Statistics tables


These tables give values x such that a certain percentage of the distribution lies less
than x.

Percentage points of N (0, 1)


60.0% 66.7% 75.0% 80.0% 87.5% 90.0% 95.0% 97.5% 99.0% 99.5% 99.9%
0.253 0.432 0.674 0.842 1.150 1.282 1.645 1.960 2.326 2.576 3.090

Percentage points of tn
n 60.0% 66.7% 75.0% 80.0% 87.5% 90.0% 95.0% 97.5% 99.0% 99.5% 99.9%
1 0.325 0.577 1.000 1.376 2.414 3.078 6.314 12.706 31.821 63.657 318.31
2 0.289 0.500 0.816 1.061 1.604 1.886 2.920 4.303 6.965 9.925 22.327
3 0.277 0.476 0.765 0.978 1.423 1.638 2.353 3.182 4.541 5.841 10.215
4 0.271 0.464 0.741 0.941 1.344 1.533 2.132 2.776 3.747 4.604 7.173
5 0.267 0.457 0.727 0.920 1.301 1.476 2.015 2.571 3.365 4.032 5.893
6 0.265 0.453 0.718 0.906 1.273 1.440 1.943 2.447 3.143 3.707 5.208
7 0.263 0.449 0.711 0.896 1.254 1.415 1.895 2.365 2.998 3.499 4.785
8 0.262 0.447 0.706 0.889 1.240 1.397 1.860 2.306 2.896 3.355 4.501
9 0.261 0.445 0.703 0.883 1.230 1.383 1.833 2.262 2.821 3.250 4.297
10 0.260 0.444 0.700 0.879 1.221 1.372 1.812 2.228 2.764 3.169 4.144
11 0.260 0.443 0.697 0.876 1.214 1.363 1.796 2.201 2.718 3.106 4.025
12 0.259 0.442 0.695 0.873 1.209 1.356 1.782 2.179 2.681 3.055 3.930
13 0.259 0.441 0.694 0.870 1.204 1.350 1.771 2.160 2.650 3.012 3.852
14 0.258 0.440 0.692 0.868 1.200 1.345 1.761 2.145 2.624 2.977 3.787
15 0.258 0.439 0.691 0.866 1.197 1.341 1.753 2.131 2.602 2.947 3.733
16 0.258 0.439 0.690 0.865 1.194 1.337 1.746 2.120 2.583 2.921 3.686
17 0.257 0.438 0.689 0.863 1.191 1.333 1.740 2.110 2.567 2.898 3.646
18 0.257 0.438 0.688 0.862 1.189 1.330 1.734 2.101 2.552 2.878 3.610
19 0.257 0.438 0.688 0.861 1.187 1.328 1.729 2.093 2.539 2.861 3.579
20 0.257 0.437 0.687 0.860 1.185 1.325 1.725 2.086 2.528 2.845 3.552
21 0.257 0.437 0.686 0.859 1.183 1.323 1.721 2.080 2.518 2.831 3.527
22 0.256 0.437 0.686 0.858 1.182 1.321 1.717 2.074 2.508 2.819 3.505
23 0.256 0.436 0.685 0.858 1.180 1.319 1.714 2.069 2.500 2.807 3.485
24 0.256 0.436 0.685 0.857 1.179 1.318 1.711 2.064 2.492 2.797 3.467
25 0.256 0.436 0.684 0.856 1.178 1.316 1.708 2.060 2.485 2.787 3.450
26 0.256 0.436 0.684 0.856 1.177 1.315 1.706 2.056 2.479 2.779 3.435
27 0.256 0.435 0.684 0.855 1.176 1.314 1.703 2.052 2.473 2.771 3.421
28 0.256 0.435 0.683 0.855 1.175 1.313 1.701 2.048 2.467 2.763 3.408
29 0.256 0.435 0.683 0.854 1.174 1.311 1.699 2.045 2.462 2.756 3.396
30 0.256 0.435 0.683 0.854 1.173 1.310 1.697 2.042 2.457 2.750 3.385
35 0.255 0.434 0.682 0.852 1.170 1.306 1.690 2.030 2.438 2.724 3.340
40 0.255 0.434 0.681 0.851 1.167 1.303 1.684 2.021 2.423 2.704 3.307
45 0.255 0.434 0.680 0.850 1.165 1.301 1.679 2.014 2.412 2.690 3.281
50 0.255 0.433 0.679 0.849 1.164 1.299 1.676 2.009 2.403 2.678 3.261
55 0.255 0.433 0.679 0.848 1.163 1.297 1.673 2.004 2.396 2.668 3.245
60 0.254 0.433 0.679 0.848 1.162 1.296 1.671 2.000 2.390 2.660 3.232
120 0.254 0.677 1.289 1.658 1.980 2.358 2.617 3.160
∞ 0.253 0.431 0.674 0.842 1.150 1.282 1.645 1.960 2.326 2.576 3.090
A.5. STATISTICS TABLES VII

Percentage points of χ2n


n 60.0% 66.7% 75.0% 80.0% 87.5% 90.0% 95.0% 97.5% 99.0% 99.5% 99.9%
1 0.708 0.936 1.323 1.642 2.354 2.706 3.841 5.024 6.635 7.879 10.828
2 1.833 2.197 2.773 3.219 4.159 4.605 5.991 7.378 9.210 10.597 13.816
3 2.946 3.405 4.108 4.642 5.739 6.251 7.815 9.348 11.345 12.838 16.266
4 4.045 4.579 5.385 5.989 7.214 7.779 9.488 11.143 13.277 14.860 18.467
5 5.132 5.730 6.626 7.289 8.625 9.236 11.070 12.833 15.086 16.750 20.515
6 6.211 6.867 7.841 8.558 9.992 10.645 12.592 14.449 16.812 18.548 22.458
7 7.283 7.992 9.037 9.803 11.326 12.017 14.067 16.013 18.475 20.278 24.322
8 8.351 9.107 10.219 11.030 12.636 13.362 15.507 17.535 20.090 21.955 26.125
9 9.414 10.215 11.389 12.242 13.926 14.684 16.919 19.023 21.666 23.589 27.877
10 10.473 11.317 12.549 13.442 15.198 15.987 18.307 20.483 23.209 25.188 29.588
11 11.530 12.414 13.701 14.631 16.457 17.275 19.675 21.920 24.725 26.757 31.264
12 12.584 13.506 14.845 15.812 17.703 18.549 21.026 23.337 26.217 28.300 32.910
13 13.636 14.595 15.984 16.985 18.939 19.812 22.362 24.736 27.688 29.819 34.528
14 14.685 15.680 17.117 18.151 20.166 21.064 23.685 26.119 29.141 31.319 36.123
15 15.733 16.761 18.245 19.311 21.384 22.307 24.996 27.488 30.578 32.801 37.697
16 16.780 17.840 19.369 20.465 22.595 23.542 26.296 28.845 32.000 34.267 39.252
17 17.824 18.917 20.489 21.615 23.799 24.769 27.587 30.191 33.409 35.718 40.790
18 18.868 19.991 21.605 22.760 24.997 25.989 28.869 31.526 34.805 37.156 42.312
19 19.910 21.063 22.718 23.900 26.189 27.204 30.144 32.852 36.191 38.582 43.820
20 20.951 22.133 23.828 25.038 27.376 28.412 31.410 34.170 37.566 39.997 45.315
21 21.991 23.201 24.935 26.171 28.559 29.615 32.671 35.479 38.932 41.401 46.797
22 23.031 24.268 26.039 27.301 29.737 30.813 33.924 36.781 40.289 42.796 48.268
23 24.069 25.333 27.141 28.429 30.911 32.007 35.172 38.076 41.638 44.181 49.728
24 25.106 26.397 28.241 29.553 32.081 33.196 36.415 39.364 42.980 45.559 51.179
25 26.143 27.459 29.339 30.675 33.247 34.382 37.652 40.646 44.314 46.928 52.620
26 27.179 28.520 30.435 31.795 34.410 35.563 38.885 41.923 45.642 48.290 54.052
27 28.214 29.580 31.528 32.912 35.570 36.741 40.113 43.195 46.963 49.645 55.476
28 29.249 30.639 32.620 34.027 36.727 37.916 41.337 44.461 48.278 50.993 56.892
29 30.283 31.697 33.711 35.139 37.881 39.087 42.557 45.722 49.588 52.336 58.301
30 31.316 32.754 34.800 36.250 39.033 40.256 43.773 46.979 50.892 53.672 59.703
35 36.475 38.024 40.223 41.778 44.753 46.059 49.802 53.203 57.342 60.275 66.619
40 41.622 43.275 45.616 47.269 50.424 51.805 55.758 59.342 63.691 66.766 73.402
45 46.761 48.510 50.985 52.729 56.052 57.505 61.656 65.410 69.957 73.166 80.077
50 51.892 53.733 56.334 58.164 61.647 63.167 67.505 71.420 76.154 79.490 86.661
55 57.016 58.945 61.665 63.577 67.211 68.796 73.311 77.380 82.292 85.749 93.168
60 62.135 64.147 66.981 68.972 72.751 74.397 79.082 83.298 88.379 91.952 99.607
70 79.715 85.527 90.531 95.023 100.425 104.215 112.319
80 90.405 96.578 101.880 106.629 112.329 116.321 124.842
90 101.054 107.565 113.145 118.136 124.117 128.300 137.211
100 111.667 118.498 124.342 129.561 135.807 140.170 149.452
VIII APPENDIX A. SOME USEFUL RESULTS

95 percent points of Fn,m


m/n 1 2 3 4 5 6 8 12 16 20 30 40 50
1 161.4 199.5 215.7 224.5 230.1 233.9 238.8 243.9 246.4 248.0 250.1 251.1 251.7
2 18.51 19.00 19.16 19.25 19.30 19.33 19.37 19.41 19.43 19.45 19.46 19.47 19.48
3 10.13 9.55 9.28 9.12 9.01 8.94 8.85 8.74 8.69 8.66 8.62 8.59 8.58
4 7.71 6.94 6.59 6.39 6.26 6.16 6.04 5.91 5.84 5.80 5.75 5.72 5.70
5 6.61 5.79 5.41 5.19 5.05 4.95 4.82 4.68 4.60 4.56 4.50 4.46 4.44
6 5.99 5.14 4.76 4.53 4.39 4.28 4.15 4.00 3.92 3.87 3.81 3.77 3.75
7 5.59 4.74 4.35 4.12 3.97 3.87 3.73 3.57 3.49 3.44 3.38 3.34 3.32
8 5.32 4.46 4.07 3.84 3.69 3.58 3.44 3.28 3.20 3.15 3.08 3.04 3.02
9 5.12 4.26 3.86 3.63 3.48 3.37 3.23 3.07 2.99 2.94 2.86 2.83 2.80
10 4.96 4.10 3.71 3.48 3.33 3.22 3.07 2.91 2.83 2.77 2.70 2.66 2.64
11 4.84 3.98 3.59 3.36 3.20 3.09 2.95 2.79 2.70 2.65 2.57 2.53 2.51
12 4.75 3.89 3.49 3.26 3.11 3.00 2.85 2.69 2.60 2.54 2.47 2.43 2.40
13 4.67 3.81 3.41 3.18 3.03 2.92 2.77 2.60 2.51 2.46 2.38 2.34 2.31
14 4.60 3.74 3.34 3.11 2.96 2.85 2.70 2.53 2.44 2.39 2.31 2.27 2.24
15 4.54 3.68 3.29 3.06 2.90 2.79 2.64 2.48 2.38 2.33 2.25 2.20 2.18
16 4.49 3.63 3.24 3.01 2.85 2.74 2.59 2.42 2.33 2.28 2.19 2.15 2.12
17 4.45 3.59 3.20 2.96 2.81 2.70 2.55 2.38 2.29 2.23 2.15 2.10 2.08
18 4.41 3.55 3.16 2.93 2.77 2.66 2.51 2.34 2.25 2.19 2.11 2.06 2.04
19 4.38 3.52 3.13 2.90 2.74 2.63 2.48 2.31 2.21 2.16 2.07 2.03 2.00
20 4.35 3.49 3.10 2.87 2.71 2.60 2.45 2.28 2.18 2.12 2.04 1.99 1.97
22 4.30 3.44 3.05 2.82 2.66 2.55 2.40 2.23 2.13 2.07 1.98 1.94 1.91
24 4.26 3.40 3.01 2.78 2.62 2.51 2.36 2.18 2.09 2.03 1.94 1.89 1.86
26 4.23 3.37 2.98 2.74 2.59 2.47 2.32 2.15 2.05 1.99 1.90 1.85 1.82
28 4.20 3.34 2.95 2.71 2.56 2.45 2.29 2.12 2.02 1.96 1.87 1.82 1.79
30 4.17 3.32 2.92 2.69 2.53 2.42 2.27 2.09 1.99 1.93 1.84 1.79 1.76
40 4.08 3.23 2.84 2.61 2.45 2.34 2.18 2.00 1.90 1.84 1.74 1.69 1.66
50 4.03 3.18 2.79 2.56 2.40 2.29 2.13 1.95 1.85 1.78 1.69 1.63 1.60
60 4.00 3.15 2.76 2.53 2.37 2.25 2.10 1.92 1.82 1.75 1.65 1.59 1.56
70 3.98 3.13 2.74 2.50 2.35 2.23 2.07 1.89 1.79 1.72 1.62 1.57 1.53
80 3.96 3.11 2.72 2.49 2.33 2.21 2.06 1.88 1.77 1.70 1.60 1.54 1.51
100 3.94 3.09 2.70 2.46 2.31 2.19 2.03 1.85 1.75 1.68 1.57 1.52 1.48
APPENDIX B
List of symbols
end of proof ∩ intersection; eg. {a, b} ∩
{?} set consists of ? {b, c} = {b}
[[ ? ]] just a fancy bracket contain- ∪ union; eg. {a, b} ∪ {b, c} =
ing ? {a, b, c}
|?| magnitude/size of ? 6? not ?; eg. 6= is not equal to,
6∈ is not an element of
k·k norm. Sometimes a subscript
is used to distinguish different + add
norms. − minus, addition inverse
h, i inner product, or the group × times, or vector cross product
or vector space generated by
given elements · dot product, or times
h|i inner product, or the group ◦ function composition, eg. f ◦
generated by given elements g(x) = f (g(x))
= equals / divide, quotient
P Pn
≈ approximately equal to sum; eg. i=1 ai = a1 + · · · +
def an
= by definition equals to; this is Q Qn
only used when the context is product; eg. i=1 ai =
potentially not clear a1 · · · an

≡ by definition equals to, or · square root

equal as identity to, or con- p
· nth root
gruent to (modulo)
± plus or minus; note if ± and

= is isomorphic to ∓ are used in the same line
then a ± b ∓ c actually means
⇒ implies a + b − c or a − b + c
⇔ implies both ways, if and only < strictly less than
if
≤ less than or equal to, or is
7→ maps to a subgroup of, depending on
?→∗ from ? to ∗, or ? tends to ∗, context
or ? leads to ∗  a lot less than
∀ for all / normal subgroup, ideal
∃ there exist E normal subgroup, ideal
∃! there exist unique ∇ gradient operator
∈ is an element of ∇2 Laplacian operator, it is ∇ · ∇
⊆ is a subset of ∝ is proportional to
⊂ is a (proper) subset of ∞ infinity
¬ negation of
R
integral
∧ and, some author used this as
H
closed line integral
cross product
∅ empty set
∨ or
N Natural numbers, i.e.
⊕ either but not both, or direct {1, 2, · · · }
sum

IX
X APPENDIX B. LIST OF SYMBOLS

N0 Natural numbers with 0, i.e. PSL projective special linear group


{0, 1, 2, · · · } PID principal ideal domain
Z Integers UFD unique factorization domain
Zn integers modulo n ACC ascending chain condition
Q Rational numbers RSS residual sum of squares
R Real numbers AB2 2-step Adams-Bashforth
C Complex numbers method
C∞ one point compactification of Ann annihilator
C, extended complex numbers Int interior
F∗ The field F without 0, eg. Z∗p End Endomorphism
is integers modulo p without
0 Sym(?) Group of permutations of ?
A\B set difference of A and B, that Sylp (?) Sylow p subgroup of ?
is {x ∈ A : x 6∈ B}
Fitk kth Fitting ideal
A∆B symmetric difference of A and
B Res residue
?|∗ ? divides ∗, or conditional arg argument
probability when use with P mod modulo
(?, ∗) open interval from ? to ∗, ker kernel
or coordinate (?, ∗) in R2 , or dim dimension
hcf(?, ∗), or the ideal gener-
ated by given elements deg degree
[?, ∗] closed interval from ? to ∗, or exp exponential function
commutator of ? and ∗ log logarithm
?


binomial coefficient, or vector hcf highest common factor
?

∗1 ,··· ,∗n
multinomial coefficient gcd greatest common divisor,
factorial, that is n highest common factor
Q
n! k=1 k
Qr
nr (n − k + 1) lcm least common multiple
k=1
î, ĵ, k̂ standard basis of R3 ccl conjugacy class
Re real part det determinant
Im image, or imaginary part sgn sign
CL Closure max maximum
ED Euclidean domain min minimum
PV Cauchy principal value var variance
SE Standard error cov covariance
RK Runge-Kutta methods pgf probability generating func-
tion
LP Linear program
pdf probability density function
s.e. standard error
sup supremum
id identity map
inf Infimum
s.t. such that
cos trigonometric function cos
r.v. random variable
sin trigonometric function sin
tr( · ) trace of
sec 1/cos
BFS Basic feasible solution
XI

cosec 1/sin
tan sin/cos
cot 1/tan
cosh hyperbolic function cosh
sinh hyperbolic function sinh
sech 1/cosh
cosech 1/sinh
tanh sinh/cosh
coth 1/tanh
limx→a limit as x → a
like likelihood
iid independent identically dis-
tributed
i.i.d. independent identically dis-
tributed
stab stabilizers
corr correlation coefficient
span span of
diag(?) Diagonal matrix with diago-
nal entries ?
XII APPENDIX B. LIST OF SYMBOLS

Below are probable meanings for notations with relatively less certain/definite
meaning. Notations which are more easily used for something else.

∗ Convolution Sn {v ∈ Rn+1 : kvk = 1} the n-


?0 derivative of ? dimensional sphere

?00
second derivative of ? Cn differentiability class

?˙ times derivative of ? Jn (?) Jordan block of size n with


eigenvalue ?, or Bessel func-
¨
? second time derivative of ? tions of order n of the first
kind
?(n) nth derivative of ?
∂?
Yn Bessel functions of order n of
∂i ? ∂xi the second kind
∂?
?x ∂x
hn Hermite polynomials

? complex conjugation of ?, or Hn Hermite polynomials
Legendre transform of func- P` Legendre polynomials
tion ?, or dual space of vector
space ? `k Lagrange cardinal polynomi-
¯
? complex conjugation of ?, or als
closure of set ? Tn Chebyshev polynomial
˜
? Fourier transform of ? H A
hitting time of A
ˆ
? to denote unit vector, or the Cn cyclic group of order n
vector obtain by normalising
?, or Laplace transform of a D2n Dihedral group of order n
function ? Sn symmetric group of degree n
?T transpose of matrix ? An alternating group of degree n
?† hermitian conjugate of matrix GLn general linear group
?
On orthogonal group
?0 annihilator of ?
Mn×m the set of n × m matrix
YX set of all functions function
from X to Y CG centralizers of group G
[n] {1, 2, 3, · · · , n} UD upper sum
D(a, R) disc in the complex plane cen- LD lower sum
tred at a with radius R
SOn special orthogonal group
D? derivative of ?
ei standard basis of Rn or Cn
d? derivative of ?, or infinitesi-
mal change of ? δij Kronecker delta, it equals 0 if
i 6= j, equals 1 if i = j
B(a, R) Open ball centred at a with
radius R Mij minor (of a matrix)
BR (a) Open ball centred on a with ∆ij cofactor (of a matrix)
radius R εij···k returns the sign of the permu-
M? minimal polynomial of ? tation ij · · · k
T? distribution corresponding to ε0 permittivity of free space
function ? µ0 permeability of free space
C(?, ∗) continuous function from ? to µi mean recurrence time of state
∗. Sometimes ∗ is missed out. i
Dn {v ∈ Rn : kvk ≤ 1} the n- di period of state i
dimensional closed unit disk
np number of Sylow p-subgroups
XIII

e identity of a group, or the con- F[?] Fourier transform of ?


stant e
R[?] the set (ring) of all polynomi-
i imaginary unit, so that i2 = als with coefficients in the ring
−1 R
g acceleration cause by gravity R[[?]] the ring of power series on the
h Planck constant ring R
~ Reduced Planck constant H( · ) Heaviside step function
I identity matrix, or moment of Θ( · ) Heaviside step function
inertia, or inertia tensor Z(?) centre of group ?
H Hessian matrix, or Hamilto- P (?) Poisson distribution
nian
P(?) probability of ?
J Jacobian
E( · ) expectation
T kinetic energy
I( · ) indicator function
V potential energy, or volume
I[ · ] indicator function
L Lagrangian
I(?, ∗) winding number of ? about ∗
P stochastic matrix
B(?, ∗) Binomial distribution
K Gaussian curvature
N (?, ∗) Normal distribution
γ Lorentz factor, or photon
χ2k chi-squared distribution on k
τ proper time, or stress degrees of freedom
π constant π d(?, ∗) metric function
A vector potential δ( · ) Dirac delta function
E electric field φ( · ) Euler’s totient function
B magnetic field ε(?) sign of permutation ?
F force E(?) exponential variable with pa-
G torque rameter ?
R position of centre of mass, or P(?) power set of ?, that is the set
vector of residuals consisting of all the subset of
?
J current density
L( · ) Laplace transform, or a differ-
L angular momentum ential operator
p momentum SO( · ) special orthogonal group
r position vector ds scalar line element
rc position with respect to cen- dS scalar area element
tre of mass
dS vector area element
v velocity
dV volume element
a acceleration
∆s invariant interval or space-
Vorticity time interval
r( · ) rank of (∆?)ψ uncertainty of ? under ψ
n( · ) nullity of h?iψ expected value of ? under ψ
c( · ) content of Z[i] Gaussian integers
o( · ) little o notation
O( · ) big O notation or the orthog-
onal group
Index
λ-eigenspace, 132 angular momentum quantum numbers,
F-vector space as F[X]-module, 397 306
θ-method, 586 annihilates, 582
nth orthogonal polynomial, 572 annihilator, 119, 386
p-adic metric, 5 antiderivative, 421
p-group, 350 aperiodic, 328
(external) direct sum, 109 Application to Hamiltonian mechanics,
(internal) direct sum, 109 64
(weak) Heine-Borel theorem, 34 Application to thermodynamics, 48
2-step Adams-Bashforth (AB2) method, Archimedes’ principle, 662
588 area, 492, 504
2nd derivatives as bilinear map, 205 Argument principle, 461
3-term recurrence relation, 571 ascending chain condition, 370
4-derivative, 642 associated Laguerre polynomials, 309
associated Legendre function, 306
associates, 368
A long straight wire, 628
atlas, 504, 513
A-stability and the maximum princi-
augmenting path, 99
ple, 596
automorphism group, 347
A-stable, 595
axisymmetric solutions, 228
Abel’s lemma, 319
absorbing state, 317
backward differentiation method, 591
absorption probability, 323 backward Euler method, 586
abstract smooth surface, 513 backwards substitution, 600
acceptance region, 532 band matrix, 604
action, 344 band width, 604
Adams method, 591 basic, 80
Adams-Bashforth, 591 basic variables, 80
Adams-Moulton, 591 basis, 19, 80, 105, 387
added mass, 676 basis of τ , 19
adjoint, 156, 296 Bayes estimator, 530
adjugate matrix, 128 Bayesian approach, 528
advective derivative, 657 Bernoulli’s equation, 668
affine space, 473 Bernoulli’s principle, 668
algebraic integer, 380 Bessel’s equation, 224, 230
algebraic multiplicity, 140 best linear unbiased estimator, 549
algebraically closed field, 134 best response, 89
Alternative first-order convexity condi- bias, 518
tions, 44 bilinear form, 123
alternative hypothesis, 532 bilinear map, 123
Ampere’s law, 627 bimatrix game, 89
Ampere-Maxwell law, 618 Biot-Savart law, 629
amplitude, 297, 637 bipartite graph, 94
amplitudes, 296 Birth-death chain, 325
analytic, 405 Birth-death chain with immigration, 338
analytic continuation, 434 body force, 650
angular momentum, 304 Bolzano-Weierstrass theorem in Rn , 172

XIV
INDEX XV

Boosted line charge, 644 circulation problem, 93


Boosted point charge, 644 classical Runge-Kutta method, 593
bound state, 290 Classification of finite abelian groups,
bounded, 32, 73 351
bounds, 462 Classification of finitely-generated mod-
Brachistochrone, 55 ules over a Euclidean domain,
branch cut, 411 395
branch of the logarithm, 410 closed, 317, 421
branch point, 410 closed ball, 6
branch point singularity, 410 closed subset, 6
British railway metric, 2 closure, 15
Bromwich inversion formula, 469 Co-vectors, 642
Butcher table, 593 coarse topology, 11
cofinite topology, 11
Cantor set, 192 column player, 89
Capacitors, 626 column rank, 117
capacity, 99 communicate, 316
Casorati-Weierstrass theorem, 440 communicating classes, 317
Cauchy, 40 commutative, 356
Cauchy data, 258 commutator, 287
Cauchy integral formula, 428 Commuting observables, 301
Cauchy principal value, 465 compact, 31
Cauchy problem, 258 compact support, 240
Cauchy’s residue theorem, 451 companion matrix, 398
Cauchy’s theorem, 450 complementary slackness, 76
Cauchy’s theorem for a triangle, 425 complementary subspaces, 109
Cauchy-Riemann equations, 406 complete, 40
Cauchy-Schwarz inequality, 4, 153, 288 complete set of commuting observables
Cayley’s theorem, 345 (CSCO), 302
Cayley-Hamilton theorem, 138, 400 composite hypothesis, 532
center, 347 conductivity, 635
Central force fields, 59 conductor, 624
centralizer, 347 confidence interval, 526
Chain rule, 199 configuration space, 58
Chapman-Kolmogorov equation, 315 conformal, 414
characteristic, 364 conformal equivalence, 414
characteristic polynomial, 132 congruent, 146
characteristic speed, 663 conjugacy class, 347
characteristic surface, 263 conjugate, 130, 530
Characteristic surfaces, 263 conjugate momentum, 60
charge, 617 conjugate variable, 46
charge density, 617 conjugation, 347
chart, 504 connected, 24, 92
charts, 513 connected component, 29
Chebyshev polynomial, 566, 570 Conservation law, 658
chi-squared distribution, 517 conserved quantity, 54
Chicken, 90 constrained maximization, 50
Chinese remainder theorem, 396 constrained optimization, 73
Cholesky factorization, 607 Constructing multi-step methods, 591
circline, 409 consumers, 94
circular polarization, 638 content, 373
XVI INDEX

contingency table, 539 Diagonalizability theorem v2, 136


Continuity equation, 618 diffeomorphism, 200, 489
continuity equation, 658 differentiable, 192, 204, 405
continuous, 1, 10 diffusion constant, 231
continuous symmetry, 61 dimension, 107
contour, 421 Dipole, 677
contraction, 181 dipole, 229, 622
Contraction mapping theorem, 182 Dirac delta, 241
converge, 14 direct sum, 387
Convergence of Euler’s method, 587 directed graph, 92
Convergence theorem for Markov chains, directional derivative, 193
333 Dirichlet, 224
converges, 1, 167, 586 disconnect, 24
converges absolutely, 167 disconnected, 24
converges absolutely uniformly, 167 Discrete and continuous spectra, 299
converges pointwise, 161 Discrete Fourier transform, 255
converges uniformly, 161, 167 Discrete metric, 2
convex, 43 discrete topology, 11
convolution, 247 disk model, 493
Convolution theorem, 472 dispersion relation, 684
Cooling of a uniform sphere, 235 displacement current, 636
Coriolis parameter, 687 dissection, 477
cost vector, 73 distance, 479, 492
Couette flow, 653 Distributions, 240
Coulomb, 617 divided difference, 566
Coulomb gauge, 629 divides, 368
Coulomb’s law, 619 domain, 405
Coupling game, 335 domain of dependence, 265
covariates, 546 dominant, 89
covers, 31 Doublet flow (dipole), 674
critical region, 532 dual, 118
current, 617 dual basis, 119
current density, 617 dual electromagnetic tensor, 646
curvature, 510 dual function, 77
curve, 477 dual map, 120
cut, 99 dual problem, 77
cycle, 92 Duhamel’s principle, 268
dynamically similar, 663
d’Alembert’s solution, 265 Dynamics, 297
d’Alemberts’ paradox, 675
degeneracy, 300 effective potential, 59
degenerate, 124, 300 Ehrenfest’s theorem, 288
degree, 133, 357 Ehrenfest’s theorem (General form), 299
degree of a vertex, 92 Eigenfunction expansion of Green’s func-
dense, 15 tion, 247
dependent, 546 eigenfunction with weight, 217
derivative, 193, 241 eigenstate, 279
detailed balance, 337 eigenvalue, 132, 217, 296
detailed balance equation, 337 eigenvector, 132, 296
determinant, 125, 131 Eisentein’s criterion, 376
Diagonalizability theorem, 135 Electric constant, 618
INDEX XVII

electric dipole moment, 622, 623 Factorisations of polynomials over a field,


electric field, 618 373
electromotive force, 632 Factorization criterion, 521
electrostatic potential, 621 Faraday Cage, 625
elementary column operations, 390 Faraday’s law of induction, 618, 632
elementary deformation, 427 Faraday’s law of induction for fix curve,
elementary matrix, 117 632
Elementary row operations, 390 Fast Fourier transform, 257
elliptic, 262 Fast jet generation, 679
elliptically polarized, 638 feasible, 73, 80
embedded RungeKutta methods, 598 feasible set, 73
endomorphism, 130 Fejér’s theorem, 212
energy, 504 Fermi-Dirac statistics, 311
entire, 406 field, 356, 618
equilibrium, 89 field line, 623
equipotentials, 623 Field lines and equipotentials, 623
equivalent, 117, 390, 477 field of fractions, 365
ergodic, 328 fields, 618
error constant, 598 finite dimensional, 105
essential singularity, 439 finitely generated, 381, 386
estimator, 518 finitely presented, 387
Euclidean algorithm for polynomials, first fundamental form, 504
361 first integral, 53
Euclidean domain, 370 First isomorphism theorem, 114, 341,
Euclidean function, 370 363, 385
first passage probability, 318
Euclidean metric, 2
first passage time, 318
Euclidean norm, 473
First-order convexity conditions, 43
Euler equations, 664
Fitting ideal, 390
Euler momentum equation, 667
fluid, 649
Euler number, 487
flux, 619
Euler’s equation in a rotating frame,
Force on a fire hose nozzle, 669
687
Force on curved pipe, 667
Euler’s method, 586
Ford-Fulkerson algorithm, 100
Euler-Lagrange equation, 52
forward substitution, 600
Eulerian picture, 657
Fourier coefficients, 210
Eulerian time derivative, 657
Fourier inversion theorem, 249
evaluation map, 122
Fourier series, 210
Examples of conformal maps/equivalence, Fourier transform, 247
415 Fourier transform on differential equa-
excited states, 290 tion, 250
expectation value, 287 Fourier transformation of distributions,
expected payoff, 89 252
expected posterior loss, 530 free, 387
explanatory, 546 freely, 387
explicit, 586 frequency, 210, 637
Extended Markov property, 314 Froude number, 685
extreme point, 80 Function space, 5
extrinsic length scale, 663 functional, 51
functional constraints, 73
F-distribution, 545 functional derivative, 52
XVIII INDEX

Functional iteration, 599 Hall’s theorem, 101


fundamental solution, 266, 270 Hamilton’s equations, 60
Fundamental theorem of algebra, 134, Hamilton’s principle, 58
432 Hamiltonian, 60, 279, 297
Fundamental theorem of calculus, 423 harmonic, 224
harmonic conjugates, 418
Gambler’s ruin, 324, 327 Hausdorff, 14
gauge, 629 Heat conduction in a finite rod, 234
gauge transformation, 629 Heat conduction in uniform medium,
gauge transformations, 643 234
Gauss Markov theorem, 549 heat equation, 231
Gauss’ Law, 619 heat kernel, 232
Gauss’ law, 618, 619 Heine-Borel theorem, 36
Gauss’ law for magnetism, 618 Heisenberg’s uncertainty principle, 289
Gauss’ lemma, 374, 508 Hermite polynomials, 218, 219, 287
Gauss-Bonnet theorem, 515 Hermite’s equation, 286
Gauss-Bonnet theorem for S 2 , 483 Hermitian, 150, 287, 296
Gauss-Bonnet theorem for hyperbolic Hermitian conjugate, 296
triangles, 500 Hermitian form of Sylvester’s law of in-
Gaussian curvature, 510 ertia, 152
Gaussian integers, 357, 377 heteroscedasticity, 547
Gaussian wavepacket, 290 Heun’s method, 594
General form, 73 Higher derivatives as multi-linear map,
General form of transfer function, 254 205
generalized coordinates, 58 Hilbert basis theorem, 382
generalized eigenspace, 144 Hitchcock transportation problem, 94
Generalized likelihood ratio theorem, hitting time, 323
537 holomorphic, 405
geodesic, 505, 506 homeomorphic, 13
geodesic ODEs, 505 homeomorphism, 13
Geodesic polar coordinates, 508 homogeneous, 209, 313
geodesic triangle, 487 homomorphism, 385
Geodesics, 57 homoscedasticity, 547
geodesics, 479 homotopy, 449
Geodesics of a plane, 53 Horner’s scheme, 568
Geodesics on unit sphere, 70 Householder reflection, 610
geometric multiplicity, 140 hydrostatic pressure, 662
Geometry of the hyperbolic disk, 497 hyperbolic, 262
geostrophic balance, 688 Hyperbolic cosine rule, 500
Givens rotation, 610 hyperbolic distance, 496
Global maximum principle, 437 Hyperbolic lines, 496
Gram matrix, 609 Hyperbolic reflection, 499
Gram-Schmidt process, 153 Hyperbolic sine rule, 500
great circle, 479 hyperbolic triangle, 499
greatest common divisor, 373 Hyperboloid model, 501
Green’s first identity, 269 hyperplane, 78
Green’s function, 221, 244
Green’s functions, 244 ideal, 359
Green’s second identity, 270 Ideal correspondence, 363
Green’s third identity, 271 ideal generated by A, 359
ground state, 290 ideal generated by a, 359
INDEX XIX

ideal generated by a1 , · · · , ak , 359 kernel, 110, 359


Identity theorem, 435 keyhole contour, 460
image, 110, 359 kinematic viscosity, 653
implicit, 586
Implicit function Theorem, 203 Lagrange cardinal polynomials, 565
inclusion function, 18 Lagrange multiplier, 50
incompressible, 658 Lagrangian, 50, 58
indiscrete topology, 11 Lagrangian derivative, 657
inductance, 634 Lagrangian picture, 657
Inference for β, 552 Lagrangian sufficiency, 74
Infinite well: particle in a box, 283 Laplace transform, 467
Inhomogeneous equations and Green’s Laplace’s equation, 224
functions, 221 Laplace’s equation in cylindrical coor-
inner product, 3, 152, 287, 473 dinates, 229
inner product space, 152 Laplace’s equation in spherical polar
inner product with weight, 217 coordinates, 228
integral, 421 Laplace’s equation on a disk, 225
integral curves, 259 Laurent polynomials, 358
integral domain, 364 Laurent series, 440
Integration and differentiation of Fourier LDU decomposition, 606
series, 211 leading principal submatrices, 604
interior, 15 leading zero, 605
Intermediate value theorem, 26 leads to, 316
interpolation problem, 565 least common multiple, 373
Invariance of dimension, 389 Least squares approximation, 219
invariance property, 524 least squares equation, 548
invariant distribution, 330 least squares estimator, 548
invariant factors, 392 least squares problem, 608
invariants, 130 Least-squares polynomial approximation,
Inverse function theorem, 201 575
Inverse Laplace transform, 469 Lebesgue measure zero, 191
invertible, 116 Lebesgue’s theorem on the Riemann in-
inviscid approximation, 649 tegral, 192
inviscid flow, 664 left kernel, 124
irreducible, 317, 368 Legendre condition, 69
isolated singularity, 439 Legendre polynomials, 223
isometry, 473, 492 Legendre transform, 46
isometry group, 473 Legendre’s equation, 222
isomorphic, 110 Leibnitz property, 304
isomorphism, 110, 359, 385 length, 477, 492, 504
Isomorphism theorems, 385 Lenz’s law, 633
Isoperimetric problem, 56 likelihood, 524, 532, 535
likelihood ratio, 533
Jacobi accessory equation, 70 likelihood ratio test, 533
Jacobian matrix, 489 limit point, 6
Joint distribution of X̄ and SXX , 545 Line charge, 620
Jordan blocks, 140 linear, 296, 589
Jordan normal form, 140, 399 linear forms, 118
Jordan normal form and gλ , aλ , cλ , 142 linear functional, 577
Jordan’s lemma, 456 linear functionals, 118
Joule heating, 635 linear map, 110
XX INDEX

Linear programming (LP), 73 meromorphic, 439


Linear Programming Duality, 82 metric, 1
linear stability domain, 595 metric space, 1
Linear systems and response functions, Milnes device, 597
254 minimal, 520
linearly dependent, 105 minimal polynomial, 135, 371, 381
linearly independent, 105, 387 Minimal property (for n ≥ 1), 571
linearly polarized, 638 Minimal surfaces in E3 , 65
lines, 479 minimal-surface equation, 66
Liouville’s theorem, 431 minor, 390
Lipschitz, 181 module, 384
Lipschitz equivalent, 170, 179 module homomorphism, 385
Little o notation, 192 Momentum integral for steady flow, 667
local degree, 463 monic, 357
Local degree theorem, 464 monochromatic, 637
Local maximum principle, 437 monopole, 229
local truncation error, 586 Morera’s theorem, 433
Lorentz force law, 618 Motion in a constant field, 648
Lorentz transform, 641 Motion in constant magnetic field, 648
Lorentz transformation, 641 multi-step, 586
Lorentzian inner product, 501 multiple expansion, 229
loss function, 530 multiplicity, 133
lower triangular, 600 Multipole expansions for Laplace’s equa-
LU factorization, 600 tion, 229
multivariate normal distribution, 543
Möbius maps that fixes H, 495
Magnetic constant, 618 n-connected, 28
Magnetic dipole moment, 631 Nash equilibria, 89
magnetic dipole moment, 630 Navier-Stokes equation, 661
magnetic field, 618 negative binomial distribution, 517
magnetic flux, 632 negative definite, 148
Magnetic forces, 631 negative semi-definite, 148
magnus force, 678 neighbourhood, 10
Manhattan metric, 2 network, 92
Markov chain, 313 Neumann, 224
Markov property, 313 Newton divided differences, 567
material derivative, 657 Newton’s method, 599
matrix game, 91 Newtonian fluid, 649
Matrix representation, 124 Neyman-Pearson lemma, 533
matrix representing, 113, 124, 150 nilpotent, 145
Max-flow min-cut theorem, 100 No-slip condition, 651
maximal, 367 nodal polynomial, 566
maximin strategy, 89 nodes, 579
maximum likelihood estimator, 524 Noether’s theorem, 61
Maximum value theorem, 35 Noetherian ring, 370
Maxwell’s equations, 67, 618 non-degenerate, 80, 124, 300
mean recurrence time, 328 non-null, 328
mean squared error, 518 non-singular, 116, 125
Mean value inequality, 199 norm, 3, 287
Measurement, 297 normal equations, 609
Meridians, 509 normal stress, 649
INDEX XXI

normalizer, 347 Pareto optimal, 89


normed space, 3 Parseval’s identity, 153, 214
null hypothesis, 532 Parseval’s identity II, 219
null state, 328 Parseval’s theorem, 251
nullity, 113 partial derivative, 193
partial Fourier sum, 210
objective function, 73 Particle state and probability, 278
observable, 279 path, 27, 92, 421
Observables, 297 path components, 29
Ohm’s law, 635 path connected, 27
One way analysis of variance with equal Pathline, 651
numbers in each group, 559 Pauli exclusion principle, 311
one-step, 586 Peano kernel, 583
open ball, 6 Peano kernel theorem, 582
open cover, 31 Pearson product-moment correlation co-
Open mapping theorem, 464 efficient, 554
open neighbourhood, 6, 10 Pearson’s Chi-squared test, 538
open subset, 6 period, 210, 328
open subsets, 10 periodic, 210
operator, 279, 296 periodic boundary condition, 216
operator norm, 197 permutation group, 344
Operators, 279 permutation group of order n, 344
optimal, 73 permutation representation, 344
orbit, 345 permutations, 344
Orbit-stabilizer theorem, 347 persistent, 318
order, 216, 434, 439, 586 phase space, 60
Ordinary quadrature, 579 phase-space form, 60
orientation, 477 Picard’s theorem, 440
orientation-preserving, 477 Picard-Lindelöf existence theorem, 187
orientation-reversing, 477 pivot, 526
orthogonal, 152, 156, 473, 572 pivot column, 85
orthogonal complement, 154 pivot row, 85
orthogonal external direct sum, 154 planetary vorticity, 687
orthogonal group, 156 points, 10
orthogonal internal direct sum, 154 pointwise Cauchy, 161
orthogonal projection, 154 Poiseuille flow, 654
orthonormal basis, 152 Poisson bracket, 289
orthonormal set, 152 Polarization identity, 146, 151
Oscillations in a manometer, 680 pole, 439
Oscillations of a bubble, 680 polynomial, 133, 357
Oxygen/time example, 547 Polynomial division, 133
polynomial interpolant, 565
p-norm, 4 positive, 328, 421
p-value, 532 positive definite, 148
Pólya’s theorem, 321 positive semi-definite, 43, 148
parabolic, 262 positive-definite, 604
parallel, 501 posterior distribution, 528
parallel flow, 652 posterior mean, 530
parallels, 509 posterior median, 531
parametric inference, 517 Postulates for quantum mechanics, 297
Pareto dominated, 89 Potential barrier, 294
XXII INDEX

potential flow, 673 Random walk on a finite graph, 338


Potential step, 293 range of influence, 265
potential vorticity, 689 rank, 113, 124
Potential well, 284 Rank-nullity theorem, 114
power, 532 Rao-Blackwell Theorem, 522
power function, 535 rate of strain, 649
power series, 358 Rational canonical form, 398
Powers, 412 Rayleigh-Taylor instability, 686
Poynting theorem, 639 Real Fourier series, 210
Poynting vector, 639 Recurrence relation for Newton divided
prediction interval, 555 differences, 567
predictor-corrector pair, 598 recurrent, 318
predictors, 546 reflection in `, 499
primal, 77 Reflections in an affine hyperplane, 475
prime, 367, 368 regional constraint, 73
Prime decomposition theorem, 397 regular, 217
primitive, 373 relation module, 387
principal branch, 410 removable singularity, 439
principal ideal, 359 Removal of singularities, 437
principal ideal domain, 370 Representation of constraints, 73
principal part, 262, 443 residual standard error, 552
Principle of isolated zeroes, 435 residual sum of squares, 549
principle of superposition, 621 residue, 447
principle quantum number, 309 resistance, 635
prior distribution, 528 resistivity, 635
Prisoner’s dilemma, 90 resolution, 254
probability current, 281 response, 546
product, 357 response function, 254
product topology, 20 reversible, 337
projection map, 154 Reynolds number, 663
proper variation, 505 Riemann integrable, 185
Properties of δ(x), 242 Riemann mapping theorem, 419
Properties of Fourier transform, 248 Riemann sphere, 406
Properties of the Laplace transform, 468 Riemann surfaces, 413
Properties of triangular matrices, 600 Riemannian metric, 490, 513
proportion of variance explained, 558 right kernel, 124
pure strategy, 89 right order topology, 11
Pythagoras theorem, 480 ring, 356
ring homomorphism, 358
QR factorization, 609 Rising bubble, 676
quadratic form, 146 root condition, 588
quantum tunneling, 295 root mean squared error, 518
quotient module, 385 Rossby number, 687
quotient ring, 360 Rossby radius of deformation, 690
Quotient space, 104 Rotational invariance, 64
quotient space, 104 Rouchés theorem, 462
quotient topology, 22 row player, 89
row rank, 117
radial part, 306 Runge-Kutta (RK) methods, 592
Radial Schrödinger equation, 307
radial wavefunction, 307 s-step (linear) method, 588
INDEX XXIII

sampling distribution, 518 sparse matrix, 608


Sawtooth function, 211 special orthogonal group, 473
Scalar products, 572 spectrum, 296
second fundamental form, 510 Spherical cosine rule, 480
Second isomorphism theorem, 342, 363, Spherical sine rule, 480
385 spin, 310
Second-order convexity condition, 44 stabilizer, 345
security level, 89 standard error, 552
self-adjoint, 156, 296 Standard form, 73
separated boundary conditions, 216 standard form, 610
Separation of variables, 225 star domain, 424
sequentially compact, 38 Star-shaped Cauchy’s theorem, 427
sequentially continuous, 14 star-shaped domain, 424
Series summation, 445 state, 278
sesquilinear form, 150 state space, 313
shallow water streamfunction, 688 States, 297
sharp, 582 stationary state, 280
shear stress, 650 statistic, 518
signature, 150 steady currents, 627
similar, 130 steady flow, 652
similarity variable, 231 Steinitz exchange lemma, 106
simple, 421 Stereographic projection, 484
simple closed curve, 418 Stokes equations, 664
simple hypothesis, 532 Stokes stream function, 661
Simple linear regression, 559 Stokes’ law, 664
simple random sample, 517 stopping time, 325
simple zero, 434 Strain, 649
simplex tableau, 84 strategies, 89
simply connected, 28, 419 strategy profile, 89
singular, 125 Streakline, 651
sink, 93 stream function, 660
size, 532, 535 Streamline, 651
slack variable, 73 strength, 674
Small amplitude oscillations of uniform Stress, 649
string, 66 Stress condition, 651
Smith normal form, 391 strictly convex, 43
smooth, 489 strictly dominant, 89
smooth embedded surface, 502 strong duality, 77
smooth parametrisation, 502 strong Legendre condition, 70
Solenoid, 628 Strong Markov property, 326
Solve differential equations by Laplace structureless, 302
transform, 471 Sturm-Liouville eigenvalue, 57
Some common topology, 11 Sturm-Liouville form, 216
source, 93 Sturm-Liouville operator, 57
Source/sink, 674, 677 Sturm-Liouville operators, 216
space, 1 Sturm-Liouville problem, 56
Space of sequences, 170 Sturm-Liouville problem, 217
Space translation invariance, 64 subcover, 31
span, 104 subgraph, 92
spanning tree, 92 Subgroup correspondence, 342
spans, 105 submodule, 385
XXIV INDEX

Submodule correspondence, 386 topology, 10


submodule generated by m, 386 topology induced by, 10
subring, 356 torus, 487
subspace, 1, 103 total angular momentum, 304
subspace topology, 18 totally bounded, 180
sufficient, 520 trace, 130, 131
sum, 103 transfer function, 254
suppliers, 94 transient, 318
supporting hyperplane, 78 transportable energy, 668
Supporting hyperplane theorem, 79 Transportation Algorithm, 95
supremum norm, 5 trapezoidal rule, 586
Surface charge, 620 triangle, 424
Surface current, 628 Triangle inequality, 153, 481, 497
Surfaces of revolution, 508 Triangles on a sphere, 480
Sylow theorems, 352 triangulable, 137
Sylvester’s law of inertia, 150 tridiagonal, 604
symbol, 262 Two parallel wires, 632
symmetric, 146 Two-phase simplex method, 86
symmetric random walk, 321 Type I error, 532
symmetry, 61 Type II error, 532

t-distribution, 545 ultra-hyperbolic, 262


tangent space, 502 ultraparallel, 501
tangent vector, 259 unbiased, 518
tangential stress, 650 uncapacitated problem, 93
Taylor’s theorem, 207, 430 uncertainty, 287
tensor, 642 Uncertainty principle (General form),
Testing independence in contingency ta- 300
bles, 540 unconstrained maximization, 49
Tests of homogeneity, 541 undirected graph, 92
The Dahlquist equivalence theorem, 590 undirected walk, 92
The extended complex plane, 406 Uniform flow, 674, 677
The method of characteristics, 260 Uniform flow pass sphere, 675
The Newton formula, 567 Uniform flow with circulation past cylin-
Theorema Egregium, 512 der, 677
Third isomorphism theorem, 343, 364, Uniform limit theorem, 164
385 Uniform metric, 5
Three-dimensional well, 307 uniform norm, 5
Three-term recurrence relation, 573 uniformly Cauchy, 161
Time-dependent Schrödinger equation, uniformly continuous, 181
280 uniformly most powerful, 535
Time-independent potential, 59 unique factorization domain, 370
time-independent Schrödinger equation, Uniqueness of limits, 3
279 unit, 356, 368
Times translational invariance, 64 unit normal, 502
topological group, 24 unit triangular matrix, 600
topological notion, 10 unitary, 156
topological property, 10 unitary group, 156
topological space, 10 universal cover, 487
topological triangle, 487 upper half-plane, 493
topological triangulation, 487 upper triangular, 600
INDEX XXV

value, 92
vector of decision variables, 73
vector of fitted values, 549
vector of residuals, 549
vector potential, 629
vector space, 103
Vectors, 642
velocity potential, 673
Venturi meter, 668
Vibrations of a circular membrane, 239
viscosity, 649
viscous flow, 649
volume flux, 654
volume form, 125
Vortex, 677
vortex lines, 668
vorticity, 654, 662
Vorticity equation, 671

Wafer example, 552


walk, 92
wave equation, 236
wave number, 637
wave vector, 638
wavefunction, 278
wavepacket, 290
Weak duality, 77
weakly dominant, 89
Weierstrass Approximation Theorem,
190
Weierstrass M-test, 168
weight function, 57, 217
weights, 579
well-posed, 258
winding number, 447
Wronskian, 245

z-test, 534
Zadunaisky device, 598
zero divisor, 364
zero-sum game, 91
Zorn’s lemma, 389

Das könnte Ihnen auch gefallen